Introduction
Transcribing audio files can be a tedious and time-consuming task, especially when dealing with lengthy recordings or multiple files. Manual transcription is prone to errors, and the effort required can be overwhelming. Fortunately, automation using Python and OpenAI's Whisper model offers a solution that is both efficient and accurate.
Why even do all this when there's Zoom AI and transcription? While those are great options, where I work hasn’t enabled those features and isn’t planning to anytime soon. Additionally, this project allows for a customizable, offline-capable solution that gives me complete control over the transcription process and integrates seamlessly into existing workflows.
In this guide, I'll walk you through how to build a Python application that automates the transcription of .m4a
audio files. We'll cover everything from splitting audio into manageable chunks to using Whisper for transcription and creating a user-friendly macOS application with Automator.
Overview of the Solution
Why Automate Transcription?
Automation:
Saves time, especially for long audio files.
Increases transcription accuracy using AI.
Scales effortlessly for batch processing.
Tools and Technologies
Python: The backbone of the project for scripting and logic.
OpenAI Whisper: A state-of-the-art AI model for speech-to-text transcription.
pydub: For audio processing and splitting.
Tkinter: For creating a graphical user interface (GUI).
Automator: To make launching the app seamless.
Step-by-Step Implementation
Step 1: Environment Setup
Install Prerequisites
To set up your environment effectively, the setup_transcriber.py
script will take care of creating all the necessary files, installing dependencies, and ensuring the setup is complete. Here's how to get started:
Clone or download the project files to your local system.
Run the
setup_transcriber.py
script:
python setup_transcriber.py
This script will:
Install required Python packages (
pydub
,openai
,audioop-lts
).Create the core files (
split_m4a.py
andapp.py
) dynamically.Ensure the environment is ready for execution.
FFmpeg Installation
Ensure FFmpeg is installed for audio processing:
brew install ffmpeg
Step 2: Using setup_transcriber.py
The setup_transcriber.py
script simplifies the setup process by dynamically generating all necessary project files, installing dependencies, and setting up the environment. Here’s how it works:
setup_transcriber.py
This script automates the creation of required files and the installation of dependencies.
Key Features:
Installs all necessary packages, including
pydub
,openai
,audioop-lts
, and others.Creates
split_m4a.py
andapp.py
dynamically.Ensures your environment is correctly configured.
import os
import subprocess
# Define the content of the split_m4a.py script
split_m4a_content = """
import os
import math
import subprocess
import logging
from pydub import AudioSegment
import argparse
# Set up logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
MAX_SIZE_MB = 25
def calculate_segment_duration_and_num_segments(duration_seconds, overlap_seconds, max_size, bitrate_kbps):
\"""Calculate the duration and number of segments for an audio file.\"""
seconds_for_max_size = (max_size * 8 * 1024) / bitrate_kbps
num_segments = max(2, int(duration_seconds / seconds_for_max_size) + 1)
total_overlap = (num_segments - 1) * overlap_seconds
actual_playable_duration = (duration_seconds - total_overlap) / num_segments
return num_segments, actual_playable_duration + overlap_seconds
def construct_file_names(output_directory, base_name, num_segments, extension):
\"""Construct new file names for the segments of an audio file.\"""
padding = max(1, int(math.ceil(math.log10(num_segments))))
new_names = [os.path.join(output_directory, f"{base_name}_{str(i).zfill(padding)}.{extension}") for i in range(1, num_segments + 1)]
return new_names
def extract_audio_from_m4a(path_to_m4a, output_audio_path):
\"""Extract audio from an M4A file using ffmpeg.\"""
command = ['ffmpeg', '-i', path_to_m4a, '-q:a', '0', '-map', 'a', output_audio_path]
subprocess.run(command, check=True)
def split_audio(path_to_audio, output_directory, overlap_seconds, max_size=MAX_SIZE_MB):
\"""Split an audio file into segments.\"""
audio = AudioSegment.from_file(path_to_audio)
duration_seconds = len(audio) / 1000.0 # convert to seconds
bitrate_kbps = audio.frame_rate * audio.frame_width * 8 // 1000 # approximate bitrate in kbps
file_size_MB = os.path.getsize(path_to_audio) / (1024 * 1024)
if file_size_MB < max_size:
logging.info("File is less than maximum size, no action taken.")
return [path_to_audio]
base_name = os.path.splitext(os.path.basename(path_to_audio))[0]
num_segments, segment_duration = calculate_segment_duration_and_num_segments(duration_seconds, overlap_seconds, max_size, bitrate_kbps)
new_file_names = construct_file_names(output_directory, base_name, num_segments, 'mp3')
start = 0
for i in range(num_segments):
if i == num_segments - 1:
segment = audio[start:]
else:
end = start + segment_duration * 1000
segment = audio[start:int(end)]
segment.export(new_file_names[i], format="mp3")
logging.info(f"Segment {i + 1}: {new_file_names[i]} (Duration: {len(segment) / 1000} seconds)")
start += (segment_duration - overlap_seconds) * 1000
logging.info(f"Split into {num_segments} sub-files.")
return new_file_names
def split_m4a(path_to_m4a, output_directory, overlap_seconds, max_size=MAX_SIZE_MB):
\"""Extract audio from an M4A file and split it into segments.\"""
if not os.path.exists(path_to_m4a):
raise ValueError(f"File {path_to_m4a} does not exist.")
base_name = os.path.splitext(os.path.basename(path_to_m4a))[0]
os.makedirs(output_directory, exist_ok=True)
# Extract audio from M4A
path_to_audio = os.path.join(output_directory, base_name + ".mp3")
extract_audio_from_m4a(path_to_m4a, path_to_audio)
# Split extracted audio
return split_audio(path_to_audio, output_directory, overlap_seconds, max_size)
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Split M4A audio into segments.")
parser.add_argument("path_to_m4a", type=str, help="Path to the M4A file.")
parser.add_argument("output_directory", type=str, help="Output directory for the segments and transcript.")
parser.add_argument("--overlap_seconds", type=int, default=10, help="Overlap duration in seconds.")
parser.add_argument("--max_size", type=int, default=25, help="Maximum segment size in MB.")
args = parser.parse_args()
new_files = split_m4a(args.path_to_m4a, args.output_directory, args.overlap_seconds, args.max_size)
print("Generated segments:")
for file in new_files:
print(file)
"""
# Define the content of the app.py script
app_content = """
import tkinter as tk
from tkinter import filedialog, messagebox, ttk
import os
import threading
from split_m4a import split_m4a
import subprocess
import openai
# Initialize OpenAI client
openai.api_key = os.getenv("OPENAI_API_KEY")
def transcribe_file(file_path):
command = [
'curl',
'--request', 'POST',
'--url', 'https://api.openai.com/v1/audio/transcriptions',
'--header', f'Authorization: Bearer {openai.api_key}',
'--header', 'Content-Type: multipart/form-data',
'--form', f'file=@{file_path}',
'--form', 'model=whisper-1',
'--form', 'language="en"'
]
result = subprocess.run(command, capture_output=True, text=True)
return result.stdout
def update_status(progress_bar, status_label, current, total, message):
progress_bar['value'] = (current / total) * 100
status_label.config(text=f"{message} ({current}/{total})")
def process_file(progress_bar, status_label):
file_path = filedialog.askopenfilename(filetypes=[("M4A files", "*.m4a")])
if not file_path:
return
def process():
try:
base_name = os.path.splitext(os.path.basename(file_path))[0]
output_directory = os.path.join(os.path.dirname(file_path), base_name)
os.makedirs(output_directory, exist_ok=True)
update_status(progress_bar, status_label, 0, 1, "Splitting audio")
segments = split_m4a(file_path, output_directory, overlap_seconds=10)
total_segments = len(segments)
transcript = ""
for i, segment in enumerate(segments, start=1):
update_status(progress_bar, status_label, i, total_segments, "Transcribing segment")
transcript += transcribe_file(segment)
transcript_path = os.path.join(output_directory, base_name + ".txt")
with open(transcript_path, "w") as f:
f.write(transcript)
update_status(progress_bar, status_label, total_segments, total_segments, "Completed")
messagebox.showinfo("Success", f"Transcription saved to {transcript_path}")
except Exception as e:
messagebox.showerror("Error", str(e))
threading.Thread(target=process).start()
def create_gui():
root = tk.Tk()
root.title(".m4a Transcriber")
frame = tk.Frame(root, padx=20, pady=20)
frame.pack(padx=10, pady=10)
label = tk.Label(frame, text="Select an .m4a file to transcribe:")
label.pack(pady=5)
button = tk.Button(frame, text="Select File", command=lambda: process_file(progress_bar, status_label))
button.pack(pady=5)
progress_bar = ttk.Progressbar(frame, orient="horizontal", length=300, mode="determinate")
progress_bar.pack(pady=10)
status_label = tk.Label(frame, text="Status: Waiting for file selection")
status_label.pack(pady=5)
root.mainloop()
if __name__ == "__main__":
create_gui()
"""
# Create the Python scripts
with open('split_m4a.py', 'w') as f:
f.write(split_m4a_content)
with open('app.py', 'w') as f:
f.write(app_content)
# Install required packages
subprocess.run(['pip', 'install', 'pydub', 'argparse', 'tkinter', 'py2app', 'openai'])
# Run py2app to create the macOS app
subprocess.run(['python', 'setup.py', 'py2app'])
Deep Dive: split_m4a.py
The split_m4a.py
file is the backbone of the audio processing pipeline. It handles the extraction, splitting, and preparation of audio segments from a given .m4a
file to make transcription more manageable. Below is a walkthrough of its main components:
Key Features
Audio Extraction:
- Converts
.m4a
files to.mp3
format using FFmpeg to ensure compatibility with the splitting process.
- Converts
Audio Splitting:
- Divides audio files into smaller segments based on a specified maximum size or duration while ensuring an overlap between segments to prevent information loss.
Dynamic File Naming:
- Automatically generates unique file names for each segment to avoid conflicts and ensure clear organization.
Core Functions
1. calculate_segment_duration_and_num_segments
This function calculates the number of segments and their respective durations based on the input audio file's size, bitrate, and a defined overlap.
def calculate_segment_duration_and_num_segments(duration_seconds, overlap_seconds, max_size, bitrate_kbps):
"""Calculate the duration and number of segments for an audio file."""
seconds_for_max_size = (max_size * 8 * 1024) / bitrate_kbps # Calculate max duration per segment
num_segments = max(2, int(duration_seconds / seconds_for_max_size) + 1) # Ensure at least two segments
total_overlap = (num_segments - 1) * overlap_seconds
actual_playable_duration = (duration_seconds - total_overlap) / num_segments
return num_segments, actual_playable_duration + overlap_seconds
Inputs:
duration_seconds
: Total duration of the audio file.overlap_seconds
: Overlap duration between segments to prevent loss of data during splitting.max_size
: Maximum allowed size for each segment (in MB).bitrate_kbps
: Bitrate of the audio file (in kilobits per second).
Outputs:
- Number of segments and their durations.
2. extract_audio_from_m4a
This function uses FFmpeg to extract the audio from the .m4a
file and convert it into .mp3
format.
def extract_audio_from_m4a(path_to_m4a, output_audio_path):
"""Extract audio from an M4A file using ffmpeg."""
command = ['ffmpeg', '-i', path_to_m4a, '-q:a', '0', '-map', 'a', output_audio_path]
subprocess.run(command, check=True)
Inputs:
path_to_m4a
: Path to the input.m4a
file.output_audio_path
: Path to save the converted.mp3
file.
Outputs:
- Extracted and converted
.mp3
file ready for processing.
- Extracted and converted
3. split_audio
This function splits the converted .mp3
file into smaller segments based on the calculated duration and overlap.
def split_audio(path_to_audio, output_directory, overlap_seconds, max_size=MAX_SIZE_MB):
"""Split an audio file into segments."""
audio = AudioSegment.from_file(path_to_audio)
duration_seconds = len(audio) / 1000.0 # Convert duration to seconds
bitrate_kbps = audio.frame_rate * audio.frame_width * 8 // 1000
file_size_MB = os.path.getsize(path_to_audio) / (1024 * 1024)
if file_size_MB < max_size:
logging.info("File size is below the maximum limit. No splitting needed.")
return [path_to_audio]
base_name = os.path.splitext(os.path.basename(path_to_audio))[0]
num_segments, segment_duration = calculate_segment_duration_and_num_segments(
duration_seconds, overlap_seconds, max_size, bitrate_kbps
)
new_file_names = [
os.path.join(output_directory, f"{base_name}_part{i + 1}.mp3") for i in range(num_segments)
]
start = 0
for i in range(num_segments):
end = start + segment_duration * 1000
segment = audio[start:end] if i < num_segments - 1 else audio[start:]
segment.export(new_file_names[i], format="mp3")
logging.info(f"Segment {i + 1}: {new_file_names[i]}")
start += (segment_duration - overlap_seconds) * 1000
return new_file_names
Inputs:
path_to_audio
: Path to the input audio file.output_directory
: Directory where the segments will be saved.overlap_seconds
: Duration of overlap between consecutive segments.max_size
: Maximum allowed size for each segment (default: 25 MB).
Outputs:
- List of file paths to the generated audio segments.
4. split_m4a
This is the main function that ties everything together. It extracts the audio from the .m4a
file, splits it into segments, and returns the paths to the generated files.
def split_m4a(path_to_m4a, output_directory, overlap_seconds, max_size=MAX_SIZE_MB):
"""Extract audio from an M4A file and split it into segments."""
if not os.path.exists(path_to_m4a):
raise ValueError(f"File {path_to_m4a} does not exist.")
base_name = os.path.splitext(os.path.basename(path_to_m4a))[0]
os.makedirs(output_directory, exist_ok=True)
# Extract audio from M4A
path_to_audio = os.path.join(output_directory, base_name + ".mp3")
extract_audio_from_m4a(path_to_m4a, path_to_audio)
# Split extracted audio
return split_audio(path_to_audio, output_directory, overlap_seconds, max_size)
Usage Example
To use this script directly, run it from the command line:
python split_m4a.py /path/to/input.m4a /path/to/output --overlap_seconds 10 --max_size 25
This will:
Extract audio from
/path/to/input.m4a
.Split it into overlapping segments.
Save the segments in
/path/to/output
.
Logging
Throughout the process, the script uses Python's logging
module to provide real-time feedback on:
The number of segments created.
File sizes and durations.
Exported file paths.
This ensures transparency and easy debugging in case of errors.
By understanding the functionality of split_m4a.py
, you can customize it to suit specific requirements, such as changing the output format, adjusting segment overlap, or modifying file naming conventions.
Deep Dive: app.py
The GUI for the application allows users to select an .m4a
file, split it, and transcribe it using Whisper.
import tkinter as tk
from tkinter import filedialog, messagebox, ttk
import os
import threading
from split_m4a import split_m4a
import subprocess
import openai
# Initialize OpenAI client
openai.api_key = os.getenv("OPENAI_API_KEY")
def transcribe_file(file_path):
command = [
'curl',
'--request', 'POST',
'--url', 'https://api.openai.com/v1/audio/transcriptions',
'--header', f'Authorization: Bearer {openai.api_key}',
'--header', 'Content-Type: multipart/form-data',
'--form', f'file=@{file_path}',
'--form', 'model=whisper-1',
'--form', 'language="en"'
]
result = subprocess.run(command, capture_output=True, text=True)
return result.stdout
def update_status(progress_bar, status_label, current, total, message):
progress_bar['value'] = (current / total) * 100
status_label.config(text=f"{message} ({current}/{total})")
def process_file(progress_bar, status_label):
file_path = filedialog.askopenfilename(filetypes=[("M4A files", "*.m4a")])
if not file_path:
return
def process():
try:
base_name = os.path.splitext(os.path.basename(file_path))[0]
output_directory = os.path.join(os.path.dirname(file_path), base_name)
os.makedirs(output_directory, exist_ok=True)
update_status(progress_bar, status_label, 0, 1, "Splitting audio")
segments = split_m4a(file_path, output_directory, overlap_seconds=10)
total_segments = len(segments)
transcript = ""
for i, segment in enumerate(segments, start=1):
update_status(progress_bar, status_label, i, total_segments, "Transcribing segment")
transcript += transcribe_file(segment)
transcript_path = os.path.join(output_directory, base_name + ".txt")
with open(transcript_path, "w") as f:
f.write(transcript)
update_status(progress_bar, status_label, total_segments, total_segments, "Completed")
messagebox.showinfo("Success", f"Transcription saved to {transcript_path}")
except Exception as e:
messagebox.showerror("Error", str(e))
threading.Thread(target=process).start()
def create_gui():
root = tk.Tk()
root.title(".m4a Transcriber")
frame = tk.Frame(root, padx=20, pady=20)
frame.pack(padx=10, pady=10)
label = tk.Label(frame, text="Select an .m4a file to transcribe:")
label.pack(pady=5)
button = tk.Button(frame, text="Select File", command=lambda: process_file(progress_bar, status_label))
button.pack(pady=5)
progress_bar = ttk.Progressbar(frame, orient="horizontal", length=300, mode="determinate")
progress_bar.pack(pady=10)
status_label = tk.Label(frame, text="Status: Waiting for file selection")
status_label.pack(pady=5)
root.mainloop()
if __name__ == "__main__":
create_gui()
The GUI elements are built using the Tkinter library, which provides a simple and user-friendly interface for the application. Here is how the GUI is structured and how transcription is triggered:
Main Window:
- The main application window (
root
) is created usingtk.Tk()
. It has a title and contains all the GUI elements.
- The main application window (
Label:
- A
tk.Label
is used to display a prompt asking the user to select an.m4a
file for transcription. This label is placed at the top of the application window.
- A
Progress Bar:
- A
ttk.Progressbar
is included to indicate the progress of the transcription process. It updates dynamically as the file is split and transcribed.
- A
File Selection Button:
- A
tk.Button
labeled "Select File" allows the user to browse and select an.m4a
file from their system. When clicked, it triggers theprocess_file
function.
- A
Process File Function:
The
process_file
function is responsible for handling the entire workflow. Here is what happens when a file is selected:File Dialog: Opens a file dialog (
filedialog.askopenfilename
) for the user to select a file.Splitting Audio: The selected file is split into smaller segments using the
split_audio
function fromsplit_m4a.py
.Transcription: Each segment is transcribed using the
transcribe_file
function, which sends the segment to the OpenAI Whisper API.Progress Updates: The progress bar updates after each segment is processed.
Save Transcript: The final transcript is saved as a text file in the same directory as the audio segments.
Completion Message: A success message is displayed using
messagebox.showinfo
once transcription is complete.
Multithreading:
- To keep the GUI responsive during long-running operations like splitting and transcribing, the
process_file
function runs in a separate thread using thethreading
module.
- To keep the GUI responsive during long-running operations like splitting and transcribing, the
These elements work together to provide a seamless and interactive experience for users, allowing them to easily transcribe .m4a
files without needing to interact with the command line.
Step 3: Running the Application with Automator
To run the app seamlessly on macOS, we will use Automator to execute the app.py
script directly within your virtual environment.
Automator Integration
Open Automator and create a new application.
Add the "Run Shell Script" action.
Use the following updated
run_app.sh
script:
#!/bin/bash
# Redirect all output (stdout and stderr) to a log file
exec > <your_user_path>/m4a_transcriber/script_log.txt 2>&1
# Full path to your virtual environment
VENV_PATH="/Users/<your_user_path>/m4a_transcriber/m4atranscriberenv"
# Check if the virtual environment exists
if [ -d "$VENV_PATH" ]; then
source "$VENV_PATH/bin/activate"
else
echo "Virtual environment not found at $VENV_PATH"
exit 1
fi
# Add Homebrew's binary directory to PATH
export PATH="/opt/homebrew/bin:$PATH"
# Set the OpenAI API key
export OPENAI_API_KEY='<your_api_key>'
# Explicitly use the Python executable from the virtual environment
"$VENV_PATH/bin/python" /Users/<your_user_path>/m4a_transcriber/app.py
- Save the Automator workflow as an application. You can now double-click it to launch the transcription tool.
Visual Walkthrough
GUI
Automator Setup
App Usage
Conclusion
By following this guide, you can automate the transcription of .m4a
files, saving time and reducing the effort required for manual transcription. This solution is ideal for anyone who deals with audio-to-text workflows regularly, providing a customizable, efficient, and offline-capable alternative to existing transcription tools. With the flexibility to adapt the process for various file formats and use cases, this project demonstrates how powerful Python and OpenAI's Whisper model can be when combined. Start building your transcription tool today and take control of your audio processing needs.