Fabio Augusto SuizuYour agent reads text. It writes text. It reasons about text. But it can't hear a student say "hello"...
Your agent reads text. It writes text. It reasons about text. But it can't hear a student say "hello" and tell them their /h/ is perfect while their /l/ needs work. Here's how to fix that.
AI agents are text-native. They consume text, produce text, and reason over text. But a growing class of applications requires agents to interact with the physical world through sound:
You could try to hack this with LLM inference. Ask Opus to "evaluate this pronunciation" and you'll get a confident, plausible, and completely fabricated phoneme analysis. LLMs don't have acoustic models. They can't compute phoneme-level pronunciation scores because they never see the audio signal.
What you need are specialized speech tools.
We ship three speech APIs as both a REST interface and an MCP server with 8 tools:
| Capability | Tool | What It Does | Latency |
|---|---|---|---|
| Pronunciation | assess_pronunciation |
Scores pronunciation at phoneme/word/sentence level (0-100) | 257ms p50 |
| Pronunciation | check_pronunciation_service |
Health check | <100ms |
| Pronunciation | get_phoneme_inventory |
Lists all 39 English phonemes with IPA/ARPAbet | <100ms |
| STT | transcribe_audio |
Audio to text with word-level timestamps and confidence | ~430ms |
| STT | check_stt_service |
Health check | <100ms |
| TTS | synthesize_speech |
Text to natural speech, 12 voices, speed control | ~1.5s |
| TTS | list_tts_voices |
Lists available voices with metadata | <100ms |
| TTS | check_tts_service |
Health check | <100ms |
The pronunciation scorer uses a 17MB proprietary model. It exceeds human expert inter-annotator agreement at phone level (PCC 0.590 vs 0.555) and sentence level (0.711 vs 0.675). The TTS is ranked #1 on TTS Arena with 12 English voices. The STT shares the same 17MB model for word-level timestamps with confidence.
The fastest path. Your agent gets 8 speech tools through one configuration block.
Add to ~/Library/Application Support/Claude/claude_desktop_config.json:
{
"mcpServers": {
"speech-ai": {
"command": "npx",
"args": [
"-y",
"@smithery/cli@latest",
"run",
"fabiosuizu/pronunciation-assessment",
"--config",
"{\"apimSubscriptionKey\":\"YOUR_KEY\"}"
]
}
}
}
Add to your MCP settings (.cursor/mcp.json or equivalent):
{
"mcpServers": {
"speech-ai": {
"command": "npx",
"args": [
"-y",
"@smithery/cli@latest",
"run",
"fabiosuizu/pronunciation-assessment",
"--config",
"{\"apimSubscriptionKey\":\"YOUR_KEY\"}"
]
}
}
}
If your client supports remote MCP servers:
URL: https://pronunciation-mcp.thankfulfield-a7857897.eastus.azurecontainerapps.io/mcp
Transport: Streamable HTTP
After configuration, your agent has access to all 8 tools. Ask it: "Assess the pronunciation of this audio recording against the text 'Hello world'" and it will call assess_pronunciation automatically.
For direct API integration without MCP.
pip install httpx
import base64
import httpx
API_BASE = "https://apim-ai-apis.azure-api.net"
HEADERS = {
"Ocp-Apim-Subscription-Key": "YOUR_KEY",
"Content-Type": "application/json",
}
# Read and encode audio
with open("recording.wav", "rb") as f:
audio_b64 = base64.b64encode(f.read()).decode()
# Assess pronunciation
response = httpx.post(
f"{API_BASE}/pronunciation/assess/base64",
json={"audio": audio_b64, "text": "The quick brown fox", "format": "wav"},
headers=HEADERS,
timeout=30.0,
)
result = response.json()
print(f"Overall: {result['overallScore']}/100")
print(f"Confidence: {result['confidence']:.2f}")
for word in result["words"]:
phonemes = ", ".join(f"{p['phoneme']}={p['score']}" for p in word["phonemes"])
print(f" {word['word']}: {word['score']}/100 [{phonemes}]")
response = httpx.post(
f"{API_BASE}/stt/transcribe/base64",
json={"audio": audio_b64, "include_timestamps": True},
headers=HEADERS,
timeout=30.0,
)
result = response.json()
print(f"Transcript: {result['text']}")
for word in result.get("words", []):
print(f" [{word['start']:.2f}s - {word['end']:.2f}s] {word['word']} (conf: {word['confidence']:.2f})")
response = httpx.post(
f"{API_BASE}/tts/synthesize",
json={"text": "Hello, how are you today?", "voice": "af_heart", "speed": 1.0, "format": "wav"},
headers=HEADERS,
timeout=60.0,
)
# Response is binary WAV
with open("output.wav", "wb") as f:
f.write(response.content)
print(f"Saved {len(response.content)} bytes to output.wav")
Wrap the APIs as LangChain tools for use in any agent framework.
import base64
import httpx
from langchain_core.tools import tool
API_BASE = "https://apim-ai-apis.azure-api.net"
HEADERS = {
"Ocp-Apim-Subscription-Key": "YOUR_KEY",
"Content-Type": "application/json",
}
@tool
def assess_pronunciation(audio_path: str, text: str) -> dict:
"""Score English pronunciation at phoneme, word, and sentence level.
Args:
audio_path: Path to the audio file (WAV, MP3, OGG, WebM).
text: The reference English text the speaker was reading.
"""
with open(audio_path, "rb") as f:
audio_b64 = base64.b64encode(f.read()).decode()
resp = httpx.post(
f"{API_BASE}/pronunciation/assess/base64",
json={"audio": audio_b64, "text": text, "format": audio_path.split(".")[-1]},
headers=HEADERS,
timeout=30.0,
)
return resp.json()
@tool
def transcribe_audio(audio_path: str) -> dict:
"""Transcribe audio to text with word-level timestamps.
Args:
audio_path: Path to the audio file.
"""
with open(audio_path, "rb") as f:
audio_b64 = base64.b64encode(f.read()).decode()
resp = httpx.post(
f"{API_BASE}/stt/transcribe/base64",
json={"audio": audio_b64, "include_timestamps": True},
headers=HEADERS,
timeout=30.0,
)
return resp.json()
@tool
def synthesize_speech(text: str, voice: str = "af_heart") -> str:
"""Generate natural English speech from text. Returns path to WAV file.
Args:
text: English text to speak (max 5000 chars).
voice: Voice ID. Options: af_heart, af_bella, am_adam, bf_emma, bm_george, etc.
"""
resp = httpx.post(
f"{API_BASE}/tts/synthesize",
json={"text": text, "voice": voice, "speed": 1.0, "format": "wav"},
headers=HEADERS,
timeout=60.0,
)
path = "/tmp/tts_output.wav"
with open(path, "wb") as f:
f.write(resp.content)
return path
# Use in any LangChain agent
tools = [assess_pronunciation, transcribe_audio, synthesize_speech]
A complete pronunciation tutor in ~20 lines. The agent generates a target sentence, speaks it, listens to the student, and gives phoneme-level feedback.
import base64
import httpx
API = "https://apim-ai-apis.azure-api.net"
KEY = {"Ocp-Apim-Subscription-Key": "YOUR_KEY", "Content-Type": "application/json"}
def tutor_session(sentence: str, student_audio_path: str):
"""Run one pronunciation tutoring cycle."""
# 1. Generate target speech so the student hears correct pronunciation
tts = httpx.post(f"{API}/tts/synthesize",
json={"text": sentence, "voice": "af_heart", "speed": 0.9, "format": "wav"},
headers=KEY, timeout=60)
with open("target.wav", "wb") as f:
f.write(tts.content)
print(f"Listen to target.wav, then record yourself saying: '{sentence}'")
# 2. Transcribe what the student actually said
with open(student_audio_path, "rb") as f:
audio_b64 = base64.b64encode(f.read()).decode()
stt = httpx.post(f"{API}/stt/transcribe/base64",
json={"audio": audio_b64}, headers=KEY, timeout=30).json()
print(f"You said: {stt['text']}")
# 3. Score pronunciation against the target
result = httpx.post(f"{API}/pronunciation/assess/base64",
json={"audio": audio_b64, "text": sentence, "format": "wav"},
headers=KEY, timeout=30).json()
# 4. Give feedback
print(f"\nScore: {result['overallScore']}/100 (confidence: {result['confidence']:.2f})")
for w in result["words"]:
weak = [p for p in w["phonemes"] if p["score"] < 60]
status = "needs work" if weak else "good"
print(f" {w['word']}: {w['score']}/100 ({status})")
for p in weak:
print(f" -> /{p['phoneme']}/ scored {p['score']} — practice this sound")
# Run it
tutor_session("The quick brown fox jumps over the lazy dog", "student_recording.wav")
Output:
Listen to target.wav, then record yourself saying: 'The quick brown fox jumps over the lazy dog'
You said: the quick brown fox jumps over the lazy dog
Score: 78/100 (confidence: 0.92)
the: 85/100 (good)
quick: 72/100 (needs work)
-> /K/ scored 45 — practice this sound
brown: 88/100 (good)
fox: 90/100 (good)
jumps: 65/100 (needs work)
-> /JH/ scored 38 — practice this sound
over: 82/100 (good)
the: 80/100 (good)
lazy: 76/100 (good)
dog: 91/100 (good)
| API | Method | URL | Auth |
|---|---|---|---|
| Pronunciation | POST | https://apim-ai-apis.azure-api.net/pronunciation/assess/base64 |
Ocp-Apim-Subscription-Key |
| STT | POST | https://apim-ai-apis.azure-api.net/stt/transcribe/base64 |
Ocp-Apim-Subscription-Key |
| TTS | POST | https://apim-ai-apis.azure-api.net/tts/synthesize |
Ocp-Apim-Subscription-Key |
WAV (recommended), MP3, OGG, WebM, FLAC (STT only). Base64-encoded for JSON endpoints.
12 voices: af_heart (default), af_bella, af_nicole, af_sarah, af_sky, am_adam, am_michael, bf_emma, bf_isabella, bm_george, bm_lewis, bm_daniel. Speed: 0.5x to 2.0x.
| Channel | URL | Pricing |
|---|---|---|
| Smithery | https://smithery.ai/server/fabiosuizu/pronunciation-assessment | Free (MCP discovery) |
| MCPize | https://mcpize.com/mcp/speech-ai | $9.99/mo |
| Apify | https://apify.com/vivid_astronaut/pronunciation-assessment-mcp | $0.02/call |
| Azure Marketplace | Coming soon | Free / Basic / Enterprise tiers |
| REST API | apim-ai-apis.azure-api.net |
Contact for key |
Demo: https://huggingface.co/spaces/fabiosuizu/pronunciation-assessment
Contact: fabio@suizu.com | Brainiall