Building a Voice-Assistant with Assembly AI, Groq, and ElevenLabs
Introduction
I wanted to build a voice-assistant for my daughter when she gets older so she could ask questions to when shes playing and get answers from the AI. Creating a voice-assistant that operates in real-time requires the integration of multiple technologies. In this post, I’ll walk you through how I built a fast voice-assistant using Python libraries for real-time transcription, AI response generation, and live audio streaming. By the end of this tutorial, you’ll understand how to perform real-time speech-to-text transcription with Assembly AI, generate responses with Groq, and stream audio responses using ElevenLabs.
Step 1: Install Python Libraries
First, we need to install the required Python libraries. These libraries will allow us to transcribe speech in real-time, generate AI responses, and stream audio.
pip install groq
pip install "assemblyai[extras]"
pip install elevenlabs
brew install portaudio
brew install mpv
These commands will install the following libraries:
- assemblyai: For real-time speech-to-text transcription.
- groq: For generating AI responses using LLAMA 3.
- elevenlabs: For streaming audio responses.
- portaudio: Required for audio processing.
- mpv: Required for audio playback.
Step 2: Real-Time Transcription with AssemblyAI
To transcribe speech in real-time, we use AssemblyAI’s real-time transcription service. Here’s how we set it up:
import assemblyai as aai
class AI_Assistant:
def __init__(self):
aai.settings.api_key = "your_assemblyai_api_key"
self.transcriber = None
def start_transcription(self):
self.transcriber = aai.RealtimeTranscriber(
sample_rate=16_000,
on_data=self.on_data,
on_error=self.on_error,
on_open=self.on_open,
on_close=self.on_close,
)
self.transcriber.connect()
microphone_stream = aai.extras.MicrophoneStream(sample_rate=16_000)
self.transcriber.stream(microphone_stream)
def stop_transcription(self):
if self.transcriber:
self.transcriber.close()
self.transcriber = None
def on_open(self, session_opened: aai.RealtimeSessionOpened):
return
def on_data(self, transcript: aai.RealtimeTranscript):
if not transcript.text:
return
if isinstance(transcript, aai.RealtimeFinalTranscript):
self.generate_ai_response(transcript)
def on_error(self, error: aai.RealtimeError):
return
def on_close(self):
return
In this code, we set up a real-time transcriber using AssemblyAI. When the transcriber receives new data, it calls the on_data
method, which processes the transcription.
Step 3: Pass Real-Time Transcript to LLAMA 3
Next, we need to pass the real-time transcription to LLAMA 3 to generate an AI response. Here’s how we do it:
import assemblyai as aai
from groq import Groq
class AI_Assistant:
def __init__(self):
aai.settings.api_key = "your_assemblyai_api_key"
self.groq_client = Groq(api_key="your_groq_api_key")
# Start with a predefined message in the transcript
self.full_transcript = [{"role": "system", "content": "You are a friendly AI assistant."}]
def generate_ai_response(self, transcript):
# Stop transcription to process the current data
self.stop_transcription()
# Append the user's last spoken text to the transcript
self.full_transcript.append({"role": "user", "content": transcript.text})
# Create a streaming session with Groq using the updated transcript
groq_stream = self.groq_client.chat.completions.create(
model="llama3-8b-8192", messages=self.full_transcript, stream=True
)
text_buffer = ""
full_text = ""
# Process each chunk received from the Groq stream
for chunk in groq_stream:
if chunk.choices[0].delta.content is not None:
text_buffer += chunk.choices[0].delta.content
# If a sentence ends, stream it as audio and reset the buffer
if text_buffer.endswith("."):
self.stream_audio_response(text_buffer)
full_text += text_buffer
text_buffer = ""
# If there's remaining text that didn't end with a period, stream it
if text_buffer:
self.stream_audio_response(text_buffer)
full_text += text_buffer
# Append the full AI-generated response to the transcript
self.full_transcript.append({"role": "assistant", "content": full_text})
# Restart transcription after processing
self.start_transcription()
This method stops the transcription, appends the user’s transcription to the conversation, generates a response using LLAMA 3, and streams the response audio.
Step 4: Live Audio Stream from ElevenLabs
Finally, we use ElevenLabs to stream the AI response as live audio:
from elevenlabs import stream
from elevenlabs.client import ElevenLabs
class AI_Assistant:
def __init__(self):
aai.settings.api_key = "your_assemblyai_api_key"
self.groq_client = Groq(api_key="your_groq_api_key")
self.client = ElevenLabs(api_key="your_elevenlabs_api_key")
self.transcriber = None
def stream_audio_response(self, text):
audio_stream = self.client.generate(
text=text, model="eleven_turbo_v2", stream=True
)
stream(audio_stream)
This method uses ElevenLabs to generate and stream the audio response based on the text generated by LLAMA 3.
Full Code Example
Here is the complete code for the AI assistant:
import assemblyai as aai
from elevenlabs import stream
from elevenlabs.client import ElevenLabs
from groq import Groq
class AI_Assistant:
def __init__(self):
aai.settings.api_key = "your_assemblyai_api_key"
self.groq_client = Groq(api_key="your_groq_api_key")
self.client = ElevenLabs(api_key="your_elevenlabs_api_key")
self.transcriber = None
self.full_transcript = [
{
"role": "system",
"content": """You are a friendly and engaging AI voice assistant designed to interact with toddlers aged 3-5 years old. Your primary goal is to provide fun, educational, and age-appropriate interactions. Here are your guidelines:
1. **Language**: Use simple and clear words. Speak slowly and clearly.
2. **Engagement**: Incorporate interactive elements like songs, rhymes, and simple questions to keep the child engaged.
3. **Tone**: Maintain a friendly and gentle tone throughout the conversation.
4. **Encouragement**: Provide positive reinforcement and encourage curiosity and learning.
5. **Brevity**: Keep responses under 50 words.
Example interactions:
- "Hi there! I'm your friend. Do you want to sing a song with me? Let's sing 'Twinkle, Twinkle, Little Star'!"
- "Can you show me your favorite toy? Wow, that's amazing! What color is it?"
- "Do you want to play a game? Let's find something red in the room!"
Always be cheerful, supportive, and ready to engage the toddler in fun and educational activities.
""",
},
]
def start_transcription(self):
self.transcriber = aai.RealtimeTranscriber(
sample_rate=16_000,
on_data=self.on_data,
on_error=self.on_error,
on_open=self.on_open,
on_close=self.on_close,
)
self.transcriber.connect()
microphone_stream = aai.extras.MicrophoneStream(sample_rate=16_000)
self.transcriber.stream(microphone_stream)
def stop_transcription(self):
if self.transcriber:
self.transcriber.close()
self.transcriber = None
def on_open(self, session_opened: aai.RealtimeSessionOpened):
return
def on_data(self, transcript: aai.RealtimeTranscript):
if not transcript.text:
return
if isinstance(transcript, aai.RealtimeFinalTranscript):
self.generate_ai_response(transcript)
def on_error(self, error: aai.RealtimeError):
return
def on_close(self):
return
def generate_ai_response(self, transcript):
self.stop_transcription()
self.full_transcript.append({"role": "user", "content": transcript.text})
groq_stream = self.groq_client.chat.completions.create(
model="llama3-8b-8192", messages=self.full_transcript, stream=True
)
text_buffer = ""
full_text = ""
for chunk in groq_stream:
if chunk.choices[0].delta.content is not None:
text_buffer += chunk.choices[0].delta.content
if text_buffer.endswith("."):
self.stream_audio_response(text_buffer)
full_text += text_buffer
text_buffer = ""
if text_buffer:
self.stream_audio_response(text_buffer)
full_text += text_buffer
self.full_transcript.append({"role": "assistant", "content": full_text})
self.start_transcription()
def stream_audio_response(self, text):
audio_stream = self.client.generate(
text=text, model="
eleven_turbo_v2", stream=True
)
stream(audio_stream)
ai_assistant = AI_Assistant()
ai_assistant.start_transcription()
Conclusion
Building a real-time voice-assistant requires integrating multiple technologies to handle speech-to-text transcription, AI response generation, and live audio streaming. By following this guide, you can create your own voice-assistant using Python. This setup is perfect for creating interactive and engaging experiences, such as an AI voice assistant for toddlers. Happy coding!