Vinayak MaskarBuilding a Voice-Powered Loan Assistant with Gemini Live API and Google Cloud Created for...
Created for the #GeminiLiveAgentChallenge hackathon
In India, over 800 million people access the internet primarily through their phones — and a significant number are not comfortable typing in English. Traditional loan application forms create a massive friction point: small text inputs, confusing field formats (PAN numbers, date formats), and English-only interfaces exclude a huge portion of the population.
What if users could just talk to fill out their loan application — in their own language?
InstaMoney is a real-time, multilingual voice assistant that fills out loan application forms through natural speech. Users click a microphone button and simply talk. The AI listens, understands, extracts form field values, and auto-fills the form — all in real time with zero typing.
fill_form_field) to extract structured data with confidence scores.The entire experience is powered by Gemini 2.5 Flash Native Audio through the Google GenAI SDK. We use the bidiGenerateContent Live API for true bidirectional audio streaming.
import google.genai as genai
client = genai.Client(api_key=GEMINI_API_KEY)
session = client.aio.live.connect(
model="gemini-2.5-flash-native-audio-latest",
config={
"response_modalities": ["AUDIO"],
"tools": [fill_form_field_tool],
"input_audio_transcription": {},
"output_audio_transcription": {},
}
)
Key architectural decisions:
Raw PCM over WebSocket — Audio flows as binary PCM (16-bit, 24kHz, mono) directly from the browser microphone to Gemini and back. No base64 encoding overhead.
Tool calling for structured output — Instead of parsing free-text, Gemini's native function calling ensures reliable field extraction:
fill_form_field(field_name="fullName", value="Vinayak Maskar", confidence="high")
asyncio.create_task loop reads from the Gemini session continuously, enabling true real-time streaming without blocking.The backend runs on Cloud Run with Daphne (ASGI server) for WebSocket support. We automated the entire deployment with a single script:
Built with React Native (Expo) for cross-platform support. Features a glassmorphism UI with:
| Layer | Technology |
|---|---|
| AI Model | Gemini 2.5 Flash Native Audio |
| AI SDK | Google GenAI SDK (google-genai) |
| Backend | Django Channels + Daphne |
| Hosting | Google Cloud Run |
| Secrets | Google Secret Manager |
| Build | Google Cloud Build |
| Frontend | React Native (Expo) |
| Audio | Web Audio API (PCM 24kHz) |
PKCE and JWT auth over WebSocket — WebSockets don't support HTTP headers, so we pass the JWT token as a WebSocket subprotocol during the handshake.
Multilingual tool calling — Gemini needed to extract Latin-script values from Hindi/Marathi speech. We solved this with detailed system prompt engineering and validation layers.
Barge-in handling — When the user interrupts, we need to stop audio playback immediately on the frontend while the backend handles the interruption event from the Live API.
Auto-reconnection — The Live API session can disconnect. We built a 3-retry auto-reconnect system with exponential backoff that's transparent to the user.
This blog post was created for the purposes of entering the Gemini Live Agent Challenge hackathon. #GeminiLiveAgentChallenge
Built by Vinayak Maskar
This article was created for the purposes of entering the Gemini Live Agent Challenge hackathon. #GeminiLiveAgentChallenge