How to Use OpenAI's Whisper API for Speech-to-Text Conversion

With the rapid growth of artificial intelligence technology, converting spoken language into text has become an incredibly useful skill. OpenAI’s Whisper API is a powerful tool for doing just this—it can accurately turn your spoken words into written text.

Practical Use Case

Financial analysts and investment firms often rely on earnings calls to gather insights about a company’s performance, future outlook, and management commentary. These calls can contain crucial information affecting investment decisions. However, listening to each earnings call and noting down important details can be time-consuming. To streamline this process, firms can utilize OpenAI’s Whisper API to transcribe these audio files, allowing for easier analysis and information retrieval.

In this tutorial, I’ll show you how to build a simple Python application that records audio from a microphone, saves it as an MP3 file, and then uses the Whisper API to transcribe the speech into text. Let’s dive in!

What is Whisper API?

OpenAI’s Whisper API is a tool that allows developers to convert spoken language into written text. It’s built on the Whisper model, which is a type of deep learning model specifically designed for automatic speech recognition (ASR). The Whisper model is known for its robust performance across a wide variety of languages and accents, and it’s capable of handling different audio conditions and contexts.

To generate realtime speech to text OpenAI API token is necessary. Follow the straightforward steps outlined below to create your API_TOKEN with OpenAI.

Go to https://platform.openai.com/apps and signup with your email address or connect your Google Account.
Go to View API Keys on the left side of your Personal Account Settings
Select Create New Secret key

The API access to openai is a paid service. You have to set up billing. Read the Pricing information before experimenting.

Inorder to store the OpenAI Apikey securely used the .env files and stored the key under environmental variable

Prerequisites

Before you begin, ensure you have the following installed:

Python 3.8 or later
sounddevice: For recording audio from the microphone
numpy: For handling the audio data
pydub: For processing audio files
python-dotenv: For loading environment variables
The OpenAI Python library: For accessing the Whisper API

You can install the necessary libraries using pip:

pip install sounddevice numpy pydub python-dotenv openai

Step 1: Setting Up Environment Variables

To use the OpenAI API, you need to secure your API key. Store your API key in a .env file in your project’s root directory:

OPENAI_API_KEY='Your-OpenAI-API-Key-Here'

Load this API key in your script with python-dotenv:

from dotenv import load_dotenv
import os

load_dotenv()
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')

Step 2: Recording Audio

Use the sounddevice library to capture audio from the system’s default microphone every 5 second as audio chunks. Here’s a simple function to record audio for a specified duration:

import sounddevice as sd

def record_audio(duration=5, sample_rate=44100):
    print("Recording...")
    recording = sd.rec(int(duration * sample_rate), samplerate=sample_rate, channels=2, dtype='int16')
    sd.wait()
    return recording

Step 3: Saving the Audio

After recording the audio, save it as an MP3 file using pydub:

from pydub import AudioSegment
import numpy as np
import os

def save_as_mp3(audio_data, sample_rate=44100, file_name='output.mp3', folder='audio'):
    if not os.path.exists(folder):
        os.makedirs(folder)
    full_path = os.path.join(folder, file_name)
    audio_segment = AudioSegment(
        data=np.array(audio_data).tobytes(),
        sample_width=2,
        frame_rate=sample_rate,
        channels=2
    )
    audio_segment.export(full_path, format='mp3')
    return full_path

Step 4: Transcribing Audio with Whisper API

Now, use the OpenAI library to transcribe the saved audio file:

from openai import OpenAI

def transcribe_audio(file_path):
    client = OpenAI(api_key=OPENAI_API_KEY)
    with open(file_path, "rb") as audio_file:
        transcription = client.audio.transcriptions.create(
            model="whisper-1",
            file=audio_file,
            language='en'
        )
    print(f'Transcription: {transcription.text}')

Step 5: Putting It All Together

Combine the above functions into a script that continuously records and transcribes audio until the ESC key is pressed:

import sounddevice as sd
import numpy as np
import os
import keyboard  # To detect shortcut key press
from pydub import AudioSegment
from openai import OpenAI
import os
from dotenv import load_dotenv


load_dotenv()
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')

def record_audio(duration=5, sample_rate=44100):
    """Record audio from the microphone."""
    print("Recording...")
    recording = sd.rec(int(duration * sample_rate), samplerate=sample_rate, channels=2, dtype='int16')
    sd.wait()  # Wait until recording is finished
    #print("Recording finished")
    return recording

def save_as_mp3(audio_data, sample_rate=44100, file_name='output.mp3', folder='audio'):
    """Save recorded audio as MP3 in a specified folder."""
    if not os.path.exists(folder):
        os.makedirs(folder)
    full_path = os.path.join(folder, file_name)
    audio_segment = AudioSegment(
        data=np.array(audio_data).tobytes(),
        sample_width=2,  # 2 bytes (16 bits) per sample
        frame_rate=sample_rate,
        channels=2
    )
    audio_segment.export(full_path, format='mp3')
    return full_path

def transcribe_audio(file_path):
    """Transcribe the audio file using OpenAI's API."""
    client = OpenAI(api_key=OPENAI_API_KEY)
    with open(file_path, "rb") as audio_file:
        transcription = client.audio.transcriptions.create(
            model="whisper-1",
            file=audio_file,
            language='en'
        )
    print(f'Transcription :{transcription.text}')

if __name__ == "__main__":
    sample_rate = 44100  # Sample rate in Hz
    duration = 5  # Duration of recording in seconds
    try:
        while True:
            if keyboard.is_pressed('esc'):  # Check if ESC key is pressed to exit
                print("Exiting...")
                break
            audio_data = record_audio(duration, sample_rate)
            file_path = save_as_mp3(audio_data, sample_rate)
            transcribe_audio(file_path)
    except KeyboardInterrupt:
        print("Program terminated.")

This simple application showcases the power of OpenAI’s Whisper API in creating accessible tools for speech-to-text conversion. By integrating such technologies, developers can build more inclusive and efficient communication tools that bridge the gap between spoken and written language.

How to Use OpenAI’s Whisper API for Speech-to-Text Conversion – Python Tutorial

Practical Use Case

What is Whisper API?

Prerequisites

Step 1: Setting Up Environment Variables

Step 2: Recording Audio

Step 3: Saving the Audio

Step 4: Transcribing Audio with Whisper API

Step 5: Putting It All Together

Related

How I Built a Telegram AI Stock Assistant Using…

[Course] Building Stock Market Based Telegram Bots using Python

Understanding Object-Oriented Programming (OOP) Concepts in Python for Traders…

Leave a ReplyCancel reply