Add Speech to Your AI Agent in 5 Minutes

# ai# python# tutorial# mcp
Add Speech to Your AI Agent in 5 MinutesFabio Augusto Suizu

Your agent reads text. It writes text. It reasons about text. But it can't hear a student say "hello"...

Your agent reads text. It writes text. It reasons about text. But it can't hear a student say "hello" and tell them their /h/ is perfect while their /l/ needs work. Here's how to fix that.


The Problem

AI agents are text-native. They consume text, produce text, and reason over text. But a growing class of applications requires agents to interact with the physical world through sound:

  • Language tutoring agents that listen to a student and give phoneme-level feedback
  • Customer service agents that transcribe calls and respond with natural speech
  • Accessibility agents that read content aloud and accept voice commands
  • Interview practice agents that evaluate spoken responses

You could try to hack this with LLM inference. Ask Opus to "evaluate this pronunciation" and you'll get a confident, plausible, and completely fabricated phoneme analysis. LLMs don't have acoustic models. They can't compute phoneme-level pronunciation scores because they never see the audio signal.

What you need are specialized speech tools.


The Solution: Speech AI Tools

We ship three speech APIs as both a REST interface and an MCP server with 8 tools:

Capability Tool What It Does Latency
Pronunciation assess_pronunciation Scores pronunciation at phoneme/word/sentence level (0-100) 257ms p50
Pronunciation check_pronunciation_service Health check <100ms
Pronunciation get_phoneme_inventory Lists all 39 English phonemes with IPA/ARPAbet <100ms
STT transcribe_audio Audio to text with word-level timestamps and confidence ~430ms
STT check_stt_service Health check <100ms
TTS synthesize_speech Text to natural speech, 12 voices, speed control ~1.5s
TTS list_tts_voices Lists available voices with metadata <100ms
TTS check_tts_service Health check <100ms

The pronunciation scorer uses a 17MB proprietary model. It exceeds human expert inter-annotator agreement at phone level (PCC 0.590 vs 0.555) and sentence level (0.711 vs 0.675). The TTS is ranked #1 on TTS Arena with 12 English voices. The STT shares the same 17MB model for word-level timestamps with confidence.


Quick Start: MCP (Claude Desktop, Cursor, Windsurf, etc.)

The fastest path. Your agent gets 8 speech tools through one configuration block.

Claude Desktop

Add to ~/Library/Application Support/Claude/claude_desktop_config.json:

{
  "mcpServers": {
    "speech-ai": {
      "command": "npx",
      "args": [
        "-y",
        "@smithery/cli@latest",
        "run",
        "fabiosuizu/pronunciation-assessment",
        "--config",
        "{\"apimSubscriptionKey\":\"YOUR_KEY\"}"
      ]
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Cursor / Windsurf

Add to your MCP settings (.cursor/mcp.json or equivalent):

{
  "mcpServers": {
    "speech-ai": {
      "command": "npx",
      "args": [
        "-y",
        "@smithery/cli@latest",
        "run",
        "fabiosuizu/pronunciation-assessment",
        "--config",
        "{\"apimSubscriptionKey\":\"YOUR_KEY\"}"
      ]
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Direct Streamable HTTP (any MCP client)

If your client supports remote MCP servers:

URL: https://pronunciation-mcp.thankfulfield-a7857897.eastus.azurecontainerapps.io/mcp
Transport: Streamable HTTP
Enter fullscreen mode Exit fullscreen mode

After configuration, your agent has access to all 8 tools. Ask it: "Assess the pronunciation of this audio recording against the text 'Hello world'" and it will call assess_pronunciation automatically.


Quick Start: Python (REST API)

For direct API integration without MCP.

Install

pip install httpx
Enter fullscreen mode Exit fullscreen mode

Pronunciation Assessment

import base64
import httpx

API_BASE = "https://apim-ai-apis.azure-api.net"
HEADERS = {
    "Ocp-Apim-Subscription-Key": "YOUR_KEY",
    "Content-Type": "application/json",
}

# Read and encode audio
with open("recording.wav", "rb") as f:
    audio_b64 = base64.b64encode(f.read()).decode()

# Assess pronunciation
response = httpx.post(
    f"{API_BASE}/pronunciation/assess/base64",
    json={"audio": audio_b64, "text": "The quick brown fox", "format": "wav"},
    headers=HEADERS,
    timeout=30.0,
)
result = response.json()

print(f"Overall: {result['overallScore']}/100")
print(f"Confidence: {result['confidence']:.2f}")
for word in result["words"]:
    phonemes = ", ".join(f"{p['phoneme']}={p['score']}" for p in word["phonemes"])
    print(f"  {word['word']}: {word['score']}/100 [{phonemes}]")
Enter fullscreen mode Exit fullscreen mode

Speech-to-Text

response = httpx.post(
    f"{API_BASE}/stt/transcribe/base64",
    json={"audio": audio_b64, "include_timestamps": True},
    headers=HEADERS,
    timeout=30.0,
)
result = response.json()

print(f"Transcript: {result['text']}")
for word in result.get("words", []):
    print(f"  [{word['start']:.2f}s - {word['end']:.2f}s] {word['word']} (conf: {word['confidence']:.2f})")
Enter fullscreen mode Exit fullscreen mode

Text-to-Speech

response = httpx.post(
    f"{API_BASE}/tts/synthesize",
    json={"text": "Hello, how are you today?", "voice": "af_heart", "speed": 1.0, "format": "wav"},
    headers=HEADERS,
    timeout=60.0,
)

# Response is binary WAV
with open("output.wav", "wb") as f:
    f.write(response.content)
print(f"Saved {len(response.content)} bytes to output.wav")
Enter fullscreen mode Exit fullscreen mode

Quick Start: LangChain

Wrap the APIs as LangChain tools for use in any agent framework.

import base64
import httpx
from langchain_core.tools import tool

API_BASE = "https://apim-ai-apis.azure-api.net"
HEADERS = {
    "Ocp-Apim-Subscription-Key": "YOUR_KEY",
    "Content-Type": "application/json",
}

@tool
def assess_pronunciation(audio_path: str, text: str) -> dict:
    """Score English pronunciation at phoneme, word, and sentence level.

    Args:
        audio_path: Path to the audio file (WAV, MP3, OGG, WebM).
        text: The reference English text the speaker was reading.
    """
    with open(audio_path, "rb") as f:
        audio_b64 = base64.b64encode(f.read()).decode()
    resp = httpx.post(
        f"{API_BASE}/pronunciation/assess/base64",
        json={"audio": audio_b64, "text": text, "format": audio_path.split(".")[-1]},
        headers=HEADERS,
        timeout=30.0,
    )
    return resp.json()

@tool
def transcribe_audio(audio_path: str) -> dict:
    """Transcribe audio to text with word-level timestamps.

    Args:
        audio_path: Path to the audio file.
    """
    with open(audio_path, "rb") as f:
        audio_b64 = base64.b64encode(f.read()).decode()
    resp = httpx.post(
        f"{API_BASE}/stt/transcribe/base64",
        json={"audio": audio_b64, "include_timestamps": True},
        headers=HEADERS,
        timeout=30.0,
    )
    return resp.json()

@tool
def synthesize_speech(text: str, voice: str = "af_heart") -> str:
    """Generate natural English speech from text. Returns path to WAV file.

    Args:
        text: English text to speak (max 5000 chars).
        voice: Voice ID. Options: af_heart, af_bella, am_adam, bf_emma, bm_george, etc.
    """
    resp = httpx.post(
        f"{API_BASE}/tts/synthesize",
        json={"text": text, "voice": voice, "speed": 1.0, "format": "wav"},
        headers=HEADERS,
        timeout=60.0,
    )
    path = "/tmp/tts_output.wav"
    with open(path, "wb") as f:
        f.write(resp.content)
    return path

# Use in any LangChain agent
tools = [assess_pronunciation, transcribe_audio, synthesize_speech]
Enter fullscreen mode Exit fullscreen mode

Example: Build a Pronunciation Tutor Agent

A complete pronunciation tutor in ~20 lines. The agent generates a target sentence, speaks it, listens to the student, and gives phoneme-level feedback.

import base64
import httpx

API = "https://apim-ai-apis.azure-api.net"
KEY = {"Ocp-Apim-Subscription-Key": "YOUR_KEY", "Content-Type": "application/json"}

def tutor_session(sentence: str, student_audio_path: str):
    """Run one pronunciation tutoring cycle."""

    # 1. Generate target speech so the student hears correct pronunciation
    tts = httpx.post(f"{API}/tts/synthesize",
        json={"text": sentence, "voice": "af_heart", "speed": 0.9, "format": "wav"},
        headers=KEY, timeout=60)
    with open("target.wav", "wb") as f:
        f.write(tts.content)
    print(f"Listen to target.wav, then record yourself saying: '{sentence}'")

    # 2. Transcribe what the student actually said
    with open(student_audio_path, "rb") as f:
        audio_b64 = base64.b64encode(f.read()).decode()
    stt = httpx.post(f"{API}/stt/transcribe/base64",
        json={"audio": audio_b64}, headers=KEY, timeout=30).json()
    print(f"You said: {stt['text']}")

    # 3. Score pronunciation against the target
    result = httpx.post(f"{API}/pronunciation/assess/base64",
        json={"audio": audio_b64, "text": sentence, "format": "wav"},
        headers=KEY, timeout=30).json()

    # 4. Give feedback
    print(f"\nScore: {result['overallScore']}/100 (confidence: {result['confidence']:.2f})")
    for w in result["words"]:
        weak = [p for p in w["phonemes"] if p["score"] < 60]
        status = "needs work" if weak else "good"
        print(f"  {w['word']}: {w['score']}/100 ({status})")
        for p in weak:
            print(f"    -> /{p['phoneme']}/ scored {p['score']} — practice this sound")

# Run it
tutor_session("The quick brown fox jumps over the lazy dog", "student_recording.wav")
Enter fullscreen mode Exit fullscreen mode

Output:

Listen to target.wav, then record yourself saying: 'The quick brown fox jumps over the lazy dog'
You said: the quick brown fox jumps over the lazy dog

Score: 78/100 (confidence: 0.92)
  the: 85/100 (good)
  quick: 72/100 (needs work)
    -> /K/ scored 45 — practice this sound
  brown: 88/100 (good)
  fox: 90/100 (good)
  jumps: 65/100 (needs work)
    -> /JH/ scored 38 — practice this sound
  over: 82/100 (good)
  the: 80/100 (good)
  lazy: 76/100 (good)
  dog: 91/100 (good)
Enter fullscreen mode Exit fullscreen mode

API Reference

Endpoints

API Method URL Auth
Pronunciation POST https://apim-ai-apis.azure-api.net/pronunciation/assess/base64 Ocp-Apim-Subscription-Key
STT POST https://apim-ai-apis.azure-api.net/stt/transcribe/base64 Ocp-Apim-Subscription-Key
TTS POST https://apim-ai-apis.azure-api.net/tts/synthesize Ocp-Apim-Subscription-Key

Audio Formats

WAV (recommended), MP3, OGG, WebM, FLAC (STT only). Base64-encoded for JSON endpoints.

TTS Voices

12 voices: af_heart (default), af_bella, af_nicole, af_sarah, af_sky, am_adam, am_michael, bf_emma, bf_isabella, bm_george, bm_lewis, bm_daniel. Speed: 0.5x to 2.0x.


Where to Get It

Channel URL Pricing
Smithery https://smithery.ai/server/fabiosuizu/pronunciation-assessment Free (MCP discovery)
MCPize https://mcpize.com/mcp/speech-ai $9.99/mo
Apify https://apify.com/vivid_astronaut/pronunciation-assessment-mcp $0.02/call
Azure Marketplace Coming soon Free / Basic / Enterprise tiers
REST API apim-ai-apis.azure-api.net Contact for key

Demo: https://huggingface.co/spaces/fabiosuizu/pronunciation-assessment


What's Next

  • Multilingual support (Mandarin tones, Spanish phonetics)
  • Premium tier for maximum accuracy
  • Streaming STT for real-time transcription
  • Browser SDK for direct client-side integration

Contact: fabio@suizu.com | Brainiall