Add Speech to Your AI Agent in 5 Minutes

# ai# python# tutorial# mcp

Fabio Augusto Suizu

Your agent reads text. It writes text. It reasons about text. But it can't hear a student say "hello"...

Your agent reads text. It writes text. It reasons about text. But it can't hear a student say "hello" and tell them their /h/ is perfect while their /l/ needs work. Here's how to fix that.

The Problem

AI agents are text-native. They consume text, produce text, and reason over text. But a growing class of applications requires agents to interact with the physical world through sound:

Language tutoring agents that listen to a student and give phoneme-level feedback
Customer service agents that transcribe calls and respond with natural speech
Accessibility agents that read content aloud and accept voice commands
Interview practice agents that evaluate spoken responses

You could try to hack this with LLM inference. Ask Opus to "evaluate this pronunciation" and you'll get a confident, plausible, and completely fabricated phoneme analysis. LLMs don't have acoustic models. They can't compute phoneme-level pronunciation scores because they never see the audio signal.

What you need are specialized speech tools.

The Solution: Speech AI Tools

We ship three speech APIs as both a REST interface and an MCP server with 8 tools:

Capability	Tool	What It Does	Latency
Pronunciation	`assess_pronunciation`	Scores pronunciation at phoneme/word/sentence level (0-100)	257ms p50
Pronunciation	`check_pronunciation_service`	Health check	<100ms
Pronunciation	`get_phoneme_inventory`	Lists all 39 English phonemes with IPA/ARPAbet	<100ms
STT	`transcribe_audio`	Audio to text with word-level timestamps and confidence	~430ms
STT	`check_stt_service`	Health check	<100ms
TTS	`synthesize_speech`	Text to natural speech, 12 voices, speed control	~1.5s
TTS	`list_tts_voices`	Lists available voices with metadata	<100ms
TTS	`check_tts_service`	Health check	<100ms

The pronunciation scorer uses a 17MB proprietary model. It exceeds human expert inter-annotator agreement at phone level (PCC 0.590 vs 0.555) and sentence level (0.711 vs 0.675). The TTS is ranked #1 on TTS Arena with 12 English voices. The STT shares the same 17MB model for word-level timestamps with confidence.

Quick Start: MCP (Claude Desktop, Cursor, Windsurf, etc.)

The fastest path. Your agent gets 8 speech tools through one configuration block.

Claude Desktop

Add to ~/Library/Application Support/Claude/claude_desktop_config.json:

{
  "mcpServers": {
    "speech-ai": {
      "command": "npx",
      "args": [
        "-y",
        "@smithery/cli@latest",
        "run",
        "fabiosuizu/pronunciation-assessment",
        "--config",
        "{\"apimSubscriptionKey\":\"YOUR_KEY\"}"
      ]
    }
  }
}

Cursor / Windsurf

Add to your MCP settings (.cursor/mcp.json or equivalent):

{
  "mcpServers": {
    "speech-ai": {
      "command": "npx",
      "args": [
        "-y",
        "@smithery/cli@latest",
        "run",
        "fabiosuizu/pronunciation-assessment",
        "--config",
        "{\"apimSubscriptionKey\":\"YOUR_KEY\"}"
      ]
    }
  }
}

Direct Streamable HTTP (any MCP client)

If your client supports remote MCP servers:

URL: https://pronunciation-mcp.thankfulfield-a7857897.eastus.azurecontainerapps.io/mcp
Transport: Streamable HTTP

After configuration, your agent has access to all 8 tools. Ask it: "Assess the pronunciation of this audio recording against the text 'Hello world'" and it will call assess_pronunciation automatically.

Quick Start: Python (REST API)

For direct API integration without MCP.

Install

pip install httpx

Pronunciation Assessment

import base64
import httpx

API_BASE = "https://apim-ai-apis.azure-api.net"
HEADERS = {
    "Ocp-Apim-Subscription-Key": "YOUR_KEY",
    "Content-Type": "application/json",
}

# Read and encode audio
with open("recording.wav", "rb") as f:
    audio_b64 = base64.b64encode(f.read()).decode()

# Assess pronunciation
response = httpx.post(
    f"{API_BASE}/pronunciation/assess/base64",
    json={"audio": audio_b64, "text": "The quick brown fox", "format": "wav"},
    headers=HEADERS,
    timeout=30.0,
)
result = response.json()

print(f"Overall: {result['overallScore']}/100")
print(f"Confidence: {result['confidence']:.2f}")
for word in result["words"]:
    phonemes = ", ".join(f"{p['phoneme']}={p['score']}" for p in word["phonemes"])
    print(f"  {word['word']}: {word['score']}/100 [{phonemes}]")

Speech-to-Text

response = httpx.post(
    f"{API_BASE}/stt/transcribe/base64",
    json={"audio": audio_b64, "include_timestamps": True},
    headers=HEADERS,
    timeout=30.0,
)
result = response.json()

print(f"Transcript: {result['text']}")
for word in result.get("words", []):
    print(f"  [{word['start']:.2f}s - {word['end']:.2f}s] {word['word']} (conf: {word['confidence']:.2f})")

Text-to-Speech

response = httpx.post(
    f"{API_BASE}/tts/synthesize",
    json={"text": "Hello, how are you today?", "voice": "af_heart", "speed": 1.0, "format": "wav"},
    headers=HEADERS,
    timeout=60.0,
)

# Response is binary WAV
with open("output.wav", "wb") as f:
    f.write(response.content)
print(f"Saved {len(response.content)} bytes to output.wav")

Quick Start: LangChain

Wrap the APIs as LangChain tools for use in any agent framework.

import base64
import httpx
from langchain_core.tools import tool

API_BASE = "https://apim-ai-apis.azure-api.net"
HEADERS = {
    "Ocp-Apim-Subscription-Key": "YOUR_KEY",
    "Content-Type": "application/json",
}

@tool
def assess_pronunciation(audio_path: str, text: str) -> dict:
    """Score English pronunciation at phoneme, word, and sentence level.

    Args:
        audio_path: Path to the audio file (WAV, MP3, OGG, WebM).
        text: The reference English text the speaker was reading.
    """
    with open(audio_path, "rb") as f:
        audio_b64 = base64.b64encode(f.read()).decode()
    resp = httpx.post(
        f"{API_BASE}/pronunciation/assess/base64",
        json={"audio": audio_b64, "text": text, "format": audio_path.split(".")[-1]},
        headers=HEADERS,
        timeout=30.0,
    )
    return resp.json()

@tool
def transcribe_audio(audio_path: str) -> dict:
    """Transcribe audio to text with word-level timestamps.

    Args:
        audio_path: Path to the audio file.
    """
    with open(audio_path, "rb") as f:
        audio_b64 = base64.b64encode(f.read()).decode()
    resp = httpx.post(
        f"{API_BASE}/stt/transcribe/base64",
        json={"audio": audio_b64, "include_timestamps": True},
        headers=HEADERS,
        timeout=30.0,
    )
    return resp.json()

@tool
def synthesize_speech(text: str, voice: str = "af_heart") -> str:
    """Generate natural English speech from text. Returns path to WAV file.

    Args:
        text: English text to speak (max 5000 chars).
        voice: Voice ID. Options: af_heart, af_bella, am_adam, bf_emma, bm_george, etc.
    """
    resp = httpx.post(
        f"{API_BASE}/tts/synthesize",
        json={"text": text, "voice": voice, "speed": 1.0, "format": "wav"},
        headers=HEADERS,
        timeout=60.0,
    )
    path = "/tmp/tts_output.wav"
    with open(path, "wb") as f:
        f.write(resp.content)
    return path

# Use in any LangChain agent
tools = [assess_pronunciation, transcribe_audio, synthesize_speech]

Example: Build a Pronunciation Tutor Agent

A complete pronunciation tutor in ~20 lines. The agent generates a target sentence, speaks it, listens to the student, and gives phoneme-level feedback.

import base64
import httpx

API = "https://apim-ai-apis.azure-api.net"
KEY = {"Ocp-Apim-Subscription-Key": "YOUR_KEY", "Content-Type": "application/json"}

def tutor_session(sentence: str, student_audio_path: str):
    """Run one pronunciation tutoring cycle."""

    # 1. Generate target speech so the student hears correct pronunciation
    tts = httpx.post(f"{API}/tts/synthesize",
        json={"text": sentence, "voice": "af_heart", "speed": 0.9, "format": "wav"},
        headers=KEY, timeout=60)
    with open("target.wav", "wb") as f:
        f.write(tts.content)
    print(f"Listen to target.wav, then record yourself saying: '{sentence}'")

    # 2. Transcribe what the student actually said
    with open(student_audio_path, "rb") as f:
        audio_b64 = base64.b64encode(f.read()).decode()
    stt = httpx.post(f"{API}/stt/transcribe/base64",
        json={"audio": audio_b64}, headers=KEY, timeout=30).json()
    print(f"You said: {stt['text']}")

    # 3. Score pronunciation against the target
    result = httpx.post(f"{API}/pronunciation/assess/base64",
        json={"audio": audio_b64, "text": sentence, "format": "wav"},
        headers=KEY, timeout=30).json()

    # 4. Give feedback
    print(f"\nScore: {result['overallScore']}/100 (confidence: {result['confidence']:.2f})")
    for w in result["words"]:
        weak = [p for p in w["phonemes"] if p["score"] < 60]
        status = "needs work" if weak else "good"
        print(f"  {w['word']}: {w['score']}/100 ({status})")
        for p in weak:
            print(f"    -> /{p['phoneme']}/ scored {p['score']} — practice this sound")

# Run it
tutor_session("The quick brown fox jumps over the lazy dog", "student_recording.wav")

Output:

Listen to target.wav, then record yourself saying: 'The quick brown fox jumps over the lazy dog'
You said: the quick brown fox jumps over the lazy dog

Score: 78/100 (confidence: 0.92)
  the: 85/100 (good)
  quick: 72/100 (needs work)
    -> /K/ scored 45 — practice this sound
  brown: 88/100 (good)
  fox: 90/100 (good)
  jumps: 65/100 (needs work)
    -> /JH/ scored 38 — practice this sound
  over: 82/100 (good)
  the: 80/100 (good)
  lazy: 76/100 (good)
  dog: 91/100 (good)

API Reference

Endpoints

API	Method	URL	Auth
Pronunciation	POST	`https://apim-ai-apis.azure-api.net/pronunciation/assess/base64`	`Ocp-Apim-Subscription-Key`
STT	POST	`https://apim-ai-apis.azure-api.net/stt/transcribe/base64`	`Ocp-Apim-Subscription-Key`
TTS	POST	`https://apim-ai-apis.azure-api.net/tts/synthesize`	`Ocp-Apim-Subscription-Key`

Audio Formats

WAV (recommended), MP3, OGG, WebM, FLAC (STT only). Base64-encoded for JSON endpoints.

TTS Voices

12 voices: af_heart (default), af_bella, af_nicole, af_sarah, af_sky, am_adam, am_michael, bf_emma, bf_isabella, bm_george, bm_lewis, bm_daniel. Speed: 0.5x to 2.0x.

Where to Get It

Channel	URL	Pricing
Smithery	https://smithery.ai/server/fabiosuizu/pronunciation-assessment	Free (MCP discovery)
MCPize	https://mcpize.com/mcp/speech-ai	$9.99/mo
Apify	https://apify.com/vivid_astronaut/pronunciation-assessment-mcp	$0.02/call
Azure Marketplace	Coming soon	Free / Basic / Enterprise tiers
REST API	`apim-ai-apis.azure-api.net`	Contact for key

Demo: https://huggingface.co/spaces/fabiosuizu/pronunciation-assessment

What's Next

Multilingual support (Mandarin tones, Spanish phonetics)
Premium tier for maximum accuracy
Streaming STT for real-time transcription
Browser SDK for direct client-side integration

Contact: fabio@suizu.com | Brainiall