Automate M4A File Transcriptions with Python and OpenAI Whisper

Automate M4A File Transcriptions with Python and OpenAI Whisper


Introduction

Transcribing audio files can be a tedious and time-consuming task, especially when dealing with lengthy recordings or multiple files. Manual transcription is prone to errors, and the effort required can be overwhelming. Fortunately, automation using Python and OpenAI's Whisper model offers a solution that is both efficient and accurate.

Why even do all this when there's Zoom AI and transcription? While those are great options, where I work hasn’t enabled those features and isn’t planning to anytime soon. Additionally, this project allows for a customizable, offline-capable solution that gives me complete control over the transcription process and integrates seamlessly into existing workflows.

In this guide, I'll walk you through how to build a Python application that automates the transcription of .m4a audio files. We'll cover everything from splitting audio into manageable chunks to using Whisper for transcription and creating a user-friendly macOS application with Automator.


Overview of the Solution

Why Automate Transcription?

Automation:

  • Saves time, especially for long audio files.

  • Increases transcription accuracy using AI.

  • Scales effortlessly for batch processing.

Tools and Technologies

  1. Python: The backbone of the project for scripting and logic.

  2. OpenAI Whisper: A state-of-the-art AI model for speech-to-text transcription.

  3. pydub: For audio processing and splitting.

  4. Tkinter: For creating a graphical user interface (GUI).

  5. Automator: To make launching the app seamless.


Step-by-Step Implementation

Step 1: Environment Setup

Install Prerequisites

To set up your environment effectively, the setup_transcriber.py script will take care of creating all the necessary files, installing dependencies, and ensuring the setup is complete. Here's how to get started:

  1. Clone or download the project files to your local system.

  2. Run the setup_transcriber.py script:

python setup_transcriber.py

This script will:

  • Install required Python packages (pydub, openai, audioop-lts).

  • Create the core files (split_m4a.py and app.py) dynamically.

  • Ensure the environment is ready for execution.

FFmpeg Installation

Ensure FFmpeg is installed for audio processing:

brew install ffmpeg

Step 2: Using setup_transcriber.py

The setup_transcriber.py script simplifies the setup process by dynamically generating all necessary project files, installing dependencies, and setting up the environment. Here’s how it works:

setup_transcriber.py

This script automates the creation of required files and the installation of dependencies.

Key Features:

  • Installs all necessary packages, including pydub, openai, audioop-lts, and others.

  • Creates split_m4a.py and app.py dynamically.

  • Ensures your environment is correctly configured.

import os
import subprocess

# Define the content of the split_m4a.py script
split_m4a_content = """
import os
import math
import subprocess
import logging
from pydub import AudioSegment
import argparse

# Set up logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

MAX_SIZE_MB = 25

def calculate_segment_duration_and_num_segments(duration_seconds, overlap_seconds, max_size, bitrate_kbps):
    \"""Calculate the duration and number of segments for an audio file.\"""
    seconds_for_max_size = (max_size * 8 * 1024) / bitrate_kbps
    num_segments = max(2, int(duration_seconds / seconds_for_max_size) + 1)
    total_overlap = (num_segments - 1) * overlap_seconds
    actual_playable_duration = (duration_seconds - total_overlap) / num_segments
    return num_segments, actual_playable_duration + overlap_seconds

def construct_file_names(output_directory, base_name, num_segments, extension):
    \"""Construct new file names for the segments of an audio file.\"""
    padding = max(1, int(math.ceil(math.log10(num_segments))))
    new_names = [os.path.join(output_directory, f"{base_name}_{str(i).zfill(padding)}.{extension}") for i in range(1, num_segments + 1)]
    return new_names

def extract_audio_from_m4a(path_to_m4a, output_audio_path):
    \"""Extract audio from an M4A file using ffmpeg.\"""
    command = ['ffmpeg', '-i', path_to_m4a, '-q:a', '0', '-map', 'a', output_audio_path]
    subprocess.run(command, check=True)

def split_audio(path_to_audio, output_directory, overlap_seconds, max_size=MAX_SIZE_MB):
    \"""Split an audio file into segments.\"""
    audio = AudioSegment.from_file(path_to_audio)
    duration_seconds = len(audio) / 1000.0  # convert to seconds
    bitrate_kbps = audio.frame_rate * audio.frame_width * 8 // 1000  # approximate bitrate in kbps
    file_size_MB = os.path.getsize(path_to_audio) / (1024 * 1024)

    if file_size_MB < max_size:
        logging.info("File is less than maximum size, no action taken.")
        return [path_to_audio]

    base_name = os.path.splitext(os.path.basename(path_to_audio))[0]
    num_segments, segment_duration = calculate_segment_duration_and_num_segments(duration_seconds, overlap_seconds, max_size, bitrate_kbps)
    new_file_names = construct_file_names(output_directory, base_name, num_segments, 'mp3')

    start = 0
    for i in range(num_segments):
        if i == num_segments - 1:
            segment = audio[start:]
        else:
            end = start + segment_duration * 1000
            segment = audio[start:int(end)]

        segment.export(new_file_names[i], format="mp3")
        logging.info(f"Segment {i + 1}: {new_file_names[i]} (Duration: {len(segment) / 1000} seconds)")

        start += (segment_duration - overlap_seconds) * 1000

    logging.info(f"Split into {num_segments} sub-files.")
    return new_file_names

def split_m4a(path_to_m4a, output_directory, overlap_seconds, max_size=MAX_SIZE_MB):
    \"""Extract audio from an M4A file and split it into segments.\"""
    if not os.path.exists(path_to_m4a):
        raise ValueError(f"File {path_to_m4a} does not exist.")

    base_name = os.path.splitext(os.path.basename(path_to_m4a))[0]
    os.makedirs(output_directory, exist_ok=True)

    # Extract audio from M4A
    path_to_audio = os.path.join(output_directory, base_name + ".mp3")
    extract_audio_from_m4a(path_to_m4a, path_to_audio)

    # Split extracted audio
    return split_audio(path_to_audio, output_directory, overlap_seconds, max_size)

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Split M4A audio into segments.")
    parser.add_argument("path_to_m4a", type=str, help="Path to the M4A file.")
    parser.add_argument("output_directory", type=str, help="Output directory for the segments and transcript.")
    parser.add_argument("--overlap_seconds", type=int, default=10, help="Overlap duration in seconds.")
    parser.add_argument("--max_size", type=int, default=25, help="Maximum segment size in MB.")

    args = parser.parse_args()
    new_files = split_m4a(args.path_to_m4a, args.output_directory, args.overlap_seconds, args.max_size)

    print("Generated segments:")
    for file in new_files:
        print(file)
"""

# Define the content of the app.py script
app_content = """
import tkinter as tk
from tkinter import filedialog, messagebox, ttk
import os
import threading
from split_m4a import split_m4a
import subprocess
import openai

# Initialize OpenAI client
openai.api_key = os.getenv("OPENAI_API_KEY")

def transcribe_file(file_path):
    command = [
        'curl',
        '--request', 'POST',
        '--url', 'https://api.openai.com/v1/audio/transcriptions',
        '--header', f'Authorization: Bearer {openai.api_key}',
        '--header', 'Content-Type: multipart/form-data',
        '--form', f'file=@{file_path}',
        '--form', 'model=whisper-1',
        '--form', 'language="en"'
    ]
    result = subprocess.run(command, capture_output=True, text=True)
    return result.stdout

def update_status(progress_bar, status_label, current, total, message):
    progress_bar['value'] = (current / total) * 100
    status_label.config(text=f"{message} ({current}/{total})")

def process_file(progress_bar, status_label):
    file_path = filedialog.askopenfilename(filetypes=[("M4A files", "*.m4a")])
    if not file_path:
        return

    def process():
        try:
            base_name = os.path.splitext(os.path.basename(file_path))[0]
            output_directory = os.path.join(os.path.dirname(file_path), base_name)
            os.makedirs(output_directory, exist_ok=True)

            update_status(progress_bar, status_label, 0, 1, "Splitting audio")
            segments = split_m4a(file_path, output_directory, overlap_seconds=10)

            total_segments = len(segments)
            transcript = ""
            for i, segment in enumerate(segments, start=1):
                update_status(progress_bar, status_label, i, total_segments, "Transcribing segment")
                transcript += transcribe_file(segment)

            transcript_path = os.path.join(output_directory, base_name + ".txt")
            with open(transcript_path, "w") as f:
                f.write(transcript)

            update_status(progress_bar, status_label, total_segments, total_segments, "Completed")
            messagebox.showinfo("Success", f"Transcription saved to {transcript_path}")
        except Exception as e:
            messagebox.showerror("Error", str(e))

    threading.Thread(target=process).start()

def create_gui():
    root = tk.Tk()
    root.title(".m4a Transcriber")

    frame = tk.Frame(root, padx=20, pady=20)
    frame.pack(padx=10, pady=10)

    label = tk.Label(frame, text="Select an .m4a file to transcribe:")
    label.pack(pady=5)

    button = tk.Button(frame, text="Select File", command=lambda: process_file(progress_bar, status_label))
    button.pack(pady=5)

    progress_bar = ttk.Progressbar(frame, orient="horizontal", length=300, mode="determinate")
    progress_bar.pack(pady=10)

    status_label = tk.Label(frame, text="Status: Waiting for file selection")
    status_label.pack(pady=5)

    root.mainloop()

if __name__ == "__main__":
    create_gui()
"""

# Create the Python scripts
with open('split_m4a.py', 'w') as f:
    f.write(split_m4a_content)

with open('app.py', 'w') as f:
    f.write(app_content)

# Install required packages
subprocess.run(['pip', 'install', 'pydub', 'argparse', 'tkinter', 'py2app', 'openai'])

# Run py2app to create the macOS app
subprocess.run(['python', 'setup.py', 'py2app'])

Deep Dive: split_m4a.py

The split_m4a.py file is the backbone of the audio processing pipeline. It handles the extraction, splitting, and preparation of audio segments from a given .m4a file to make transcription more manageable. Below is a walkthrough of its main components:


Key Features

  1. Audio Extraction:

    • Converts .m4a files to .mp3 format using FFmpeg to ensure compatibility with the splitting process.
  2. Audio Splitting:

    • Divides audio files into smaller segments based on a specified maximum size or duration while ensuring an overlap between segments to prevent information loss.
  3. Dynamic File Naming:

    • Automatically generates unique file names for each segment to avoid conflicts and ensure clear organization.

Core Functions

1. calculate_segment_duration_and_num_segments

This function calculates the number of segments and their respective durations based on the input audio file's size, bitrate, and a defined overlap.

def calculate_segment_duration_and_num_segments(duration_seconds, overlap_seconds, max_size, bitrate_kbps):
    """Calculate the duration and number of segments for an audio file."""
    seconds_for_max_size = (max_size * 8 * 1024) / bitrate_kbps  # Calculate max duration per segment
    num_segments = max(2, int(duration_seconds / seconds_for_max_size) + 1)  # Ensure at least two segments
    total_overlap = (num_segments - 1) * overlap_seconds
    actual_playable_duration = (duration_seconds - total_overlap) / num_segments
    return num_segments, actual_playable_duration + overlap_seconds
  • Inputs:

    • duration_seconds: Total duration of the audio file.

    • overlap_seconds: Overlap duration between segments to prevent loss of data during splitting.

    • max_size: Maximum allowed size for each segment (in MB).

    • bitrate_kbps: Bitrate of the audio file (in kilobits per second).

  • Outputs:

    • Number of segments and their durations.

2. extract_audio_from_m4a

This function uses FFmpeg to extract the audio from the .m4a file and convert it into .mp3 format.

def extract_audio_from_m4a(path_to_m4a, output_audio_path):
    """Extract audio from an M4A file using ffmpeg."""
    command = ['ffmpeg', '-i', path_to_m4a, '-q:a', '0', '-map', 'a', output_audio_path]
    subprocess.run(command, check=True)
  • Inputs:

    • path_to_m4a: Path to the input .m4a file.

    • output_audio_path: Path to save the converted .mp3 file.

  • Outputs:

    • Extracted and converted .mp3 file ready for processing.

3. split_audio

This function splits the converted .mp3 file into smaller segments based on the calculated duration and overlap.

def split_audio(path_to_audio, output_directory, overlap_seconds, max_size=MAX_SIZE_MB):
    """Split an audio file into segments."""
    audio = AudioSegment.from_file(path_to_audio)
    duration_seconds = len(audio) / 1000.0  # Convert duration to seconds
    bitrate_kbps = audio.frame_rate * audio.frame_width * 8 // 1000
    file_size_MB = os.path.getsize(path_to_audio) / (1024 * 1024)

    if file_size_MB < max_size:
        logging.info("File size is below the maximum limit. No splitting needed.")
        return [path_to_audio]

    base_name = os.path.splitext(os.path.basename(path_to_audio))[0]
    num_segments, segment_duration = calculate_segment_duration_and_num_segments(
        duration_seconds, overlap_seconds, max_size, bitrate_kbps
    )

    new_file_names = [
        os.path.join(output_directory, f"{base_name}_part{i + 1}.mp3") for i in range(num_segments)
    ]

    start = 0
    for i in range(num_segments):
        end = start + segment_duration * 1000
        segment = audio[start:end] if i < num_segments - 1 else audio[start:]
        segment.export(new_file_names[i], format="mp3")
        logging.info(f"Segment {i + 1}: {new_file_names[i]}")
        start += (segment_duration - overlap_seconds) * 1000

    return new_file_names
  • Inputs:

    • path_to_audio: Path to the input audio file.

    • output_directory: Directory where the segments will be saved.

    • overlap_seconds: Duration of overlap between consecutive segments.

    • max_size: Maximum allowed size for each segment (default: 25 MB).

  • Outputs:

    • List of file paths to the generated audio segments.

4. split_m4a

This is the main function that ties everything together. It extracts the audio from the .m4a file, splits it into segments, and returns the paths to the generated files.

def split_m4a(path_to_m4a, output_directory, overlap_seconds, max_size=MAX_SIZE_MB):
    """Extract audio from an M4A file and split it into segments."""
    if not os.path.exists(path_to_m4a):
        raise ValueError(f"File {path_to_m4a} does not exist.")

    base_name = os.path.splitext(os.path.basename(path_to_m4a))[0]
    os.makedirs(output_directory, exist_ok=True)

    # Extract audio from M4A
    path_to_audio = os.path.join(output_directory, base_name + ".mp3")
    extract_audio_from_m4a(path_to_m4a, path_to_audio)

    # Split extracted audio
    return split_audio(path_to_audio, output_directory, overlap_seconds, max_size)

Usage Example

To use this script directly, run it from the command line:

python split_m4a.py /path/to/input.m4a /path/to/output --overlap_seconds 10 --max_size 25

This will:

  1. Extract audio from /path/to/input.m4a.

  2. Split it into overlapping segments.

  3. Save the segments in /path/to/output.


Logging

Throughout the process, the script uses Python's logging module to provide real-time feedback on:

  • The number of segments created.

  • File sizes and durations.

  • Exported file paths.

This ensures transparency and easy debugging in case of errors.


By understanding the functionality of split_m4a.py, you can customize it to suit specific requirements, such as changing the output format, adjusting segment overlap, or modifying file naming conventions.


Deep Dive: app.py

The GUI for the application allows users to select an .m4a file, split it, and transcribe it using Whisper.

import tkinter as tk
from tkinter import filedialog, messagebox, ttk
import os
import threading
from split_m4a import split_m4a
import subprocess
import openai

# Initialize OpenAI client
openai.api_key = os.getenv("OPENAI_API_KEY")

def transcribe_file(file_path):
    command = [
        'curl',
        '--request', 'POST',
        '--url', 'https://api.openai.com/v1/audio/transcriptions',
        '--header', f'Authorization: Bearer {openai.api_key}',
        '--header', 'Content-Type: multipart/form-data',
        '--form', f'file=@{file_path}',
        '--form', 'model=whisper-1',
        '--form', 'language="en"'
    ]
    result = subprocess.run(command, capture_output=True, text=True)
    return result.stdout

def update_status(progress_bar, status_label, current, total, message):
    progress_bar['value'] = (current / total) * 100
    status_label.config(text=f"{message} ({current}/{total})")

def process_file(progress_bar, status_label):
    file_path = filedialog.askopenfilename(filetypes=[("M4A files", "*.m4a")])
    if not file_path:
        return

    def process():
        try:
            base_name = os.path.splitext(os.path.basename(file_path))[0]
            output_directory = os.path.join(os.path.dirname(file_path), base_name)
            os.makedirs(output_directory, exist_ok=True)

            update_status(progress_bar, status_label, 0, 1, "Splitting audio")
            segments = split_m4a(file_path, output_directory, overlap_seconds=10)

            total_segments = len(segments)
            transcript = ""
            for i, segment in enumerate(segments, start=1):
                update_status(progress_bar, status_label, i, total_segments, "Transcribing segment")
                transcript += transcribe_file(segment)

            transcript_path = os.path.join(output_directory, base_name + ".txt")
            with open(transcript_path, "w") as f:
                f.write(transcript)

            update_status(progress_bar, status_label, total_segments, total_segments, "Completed")
            messagebox.showinfo("Success", f"Transcription saved to {transcript_path}")
        except Exception as e:
            messagebox.showerror("Error", str(e))

    threading.Thread(target=process).start()

def create_gui():
    root = tk.Tk()
    root.title(".m4a Transcriber")

    frame = tk.Frame(root, padx=20, pady=20)
    frame.pack(padx=10, pady=10)

    label = tk.Label(frame, text="Select an .m4a file to transcribe:")
    label.pack(pady=5)

    button = tk.Button(frame, text="Select File", command=lambda: process_file(progress_bar, status_label))
    button.pack(pady=5)

    progress_bar = ttk.Progressbar(frame, orient="horizontal", length=300, mode="determinate")
    progress_bar.pack(pady=10)

    status_label = tk.Label(frame, text="Status: Waiting for file selection")
    status_label.pack(pady=5)

    root.mainloop()

if __name__ == "__main__":
    create_gui()

The GUI elements are built using the Tkinter library, which provides a simple and user-friendly interface for the application. Here is how the GUI is structured and how transcription is triggered:

  1. Main Window:

    • The main application window (root) is created using tk.Tk(). It has a title and contains all the GUI elements.
  2. Label:

    • A tk.Label is used to display a prompt asking the user to select an .m4a file for transcription. This label is placed at the top of the application window.
  3. Progress Bar:

    • A ttk.Progressbar is included to indicate the progress of the transcription process. It updates dynamically as the file is split and transcribed.
  4. File Selection Button:

    • A tk.Button labeled "Select File" allows the user to browse and select an .m4a file from their system. When clicked, it triggers the process_file function.
  5. Process File Function:

    • The process_file function is responsible for handling the entire workflow. Here is what happens when a file is selected:

      • File Dialog: Opens a file dialog (filedialog.askopenfilename) for the user to select a file.

      • Splitting Audio: The selected file is split into smaller segments using the split_audio function from split_m4a.py.

      • Transcription: Each segment is transcribed using the transcribe_file function, which sends the segment to the OpenAI Whisper API.

      • Progress Updates: The progress bar updates after each segment is processed.

      • Save Transcript: The final transcript is saved as a text file in the same directory as the audio segments.

      • Completion Message: A success message is displayed using messagebox.showinfo once transcription is complete.

  6. Multithreading:

    • To keep the GUI responsive during long-running operations like splitting and transcribing, the process_filefunction runs in a separate thread using the threading module.

These elements work together to provide a seamless and interactive experience for users, allowing them to easily transcribe .m4a files without needing to interact with the command line.


Step 3: Running the Application with Automator

To run the app seamlessly on macOS, we will use Automator to execute the app.py script directly within your virtual environment.

Automator Integration

  1. Open Automator and create a new application.

  2. Add the "Run Shell Script" action.

  3. Use the following updated run_app.sh script:

#!/bin/bash

# Redirect all output (stdout and stderr) to a log file
exec > <your_user_path>/m4a_transcriber/script_log.txt 2>&1

# Full path to your virtual environment
VENV_PATH="/Users/<your_user_path>/m4a_transcriber/m4atranscriberenv"

# Check if the virtual environment exists
if [ -d "$VENV_PATH" ]; then
    source "$VENV_PATH/bin/activate"
else
    echo "Virtual environment not found at $VENV_PATH"
    exit 1
fi

# Add Homebrew's binary directory to PATH
export PATH="/opt/homebrew/bin:$PATH"

# Set the OpenAI API key
export OPENAI_API_KEY='<your_api_key>'

# Explicitly use the Python executable from the virtual environment
"$VENV_PATH/bin/python" /Users/<your_user_path>/m4a_transcriber/app.py
  1. Save the Automator workflow as an application. You can now double-click it to launch the transcription tool.

Visual Walkthrough

GUI

Automator Setup

App Usage


Conclusion

By following this guide, you can automate the transcription of .m4a files, saving time and reducing the effort required for manual transcription. This solution is ideal for anyone who deals with audio-to-text workflows regularly, providing a customizable, efficient, and offline-capable alternative to existing transcription tools. With the flexibility to adapt the process for various file formats and use cases, this project demonstrates how powerful Python and OpenAI's Whisper model can be when combined. Start building your transcription tool today and take control of your audio processing needs.