Younes LaaroussiThis article was created for the purposes of entering the Gemini Live Agent Challenge hackathon....
This article was created for the purposes of entering the Gemini Live Agent Challenge hackathon. #GeminiLiveAgentChallenge
What if your browser had a friend? Not a chatbot. Not an assistant. A little spirit that floats next to your cursor, listens to your voice, watches your screen, and just... does things for you.
That's Phantom. And the weirdest part isn't what it does — it's how it got built.
I wanted to talk to my browser. Not type commands into a terminal. Not click through menus. Just say "open YouTube and search for lo-fi music" and have it happen.
The Gemini Live API made this possible — real-time bidirectional audio over WebSockets, with function calling baked in. The model can listen, talk back, AND execute tools, all in the same stream. No polling. No turn-based nonsense. Just a live conversation where the AI can actually do things.
Here's where it gets meta. I used Gemini as my coding agent throughout the entire build. Not just for boilerplate — for architecture decisions, debugging WebSocket frame formats, generating deployment scripts, even creating the mascot art.
The whole project — Chrome extension, Cloud Run proxy, landing page, 8 persona system, sound design, animated sprites, onboarding flow, trace debugger — was built in a single extended session. One human, one AI, rapid-fire iteration.
Phantom is a Chrome extension (built with Plasmo) that opens a side panel. When you tap the mic button, it:
The key insight: Gemini Live's realtimeInput lets you send audio and video frames simultaneously. The model processes them together. So when you say "click the blue button," it can actually see the blue button in the video stream and figure out which element you mean.
Free API keys have rate limits. We rotate through multiple keys on the server side, and the Cloud Run proxy handles the WebSocket relay. The client never sees the API key.
One painful discovery: when proxying WebSocket frames through Node.js, the ws library's default maxPayload silently drops large messages. Our 100KB JPEG frames were vanishing. A one-line fix (maxPayload: 10 * 1024 * 1024) solved hours of "why is the model hallucinating what's on screen."
The model has access to 14 browser tools via Gemini's function calling:
Each tool plays its own sound effect (generated via ElevenLabs' SFX API) — a soft whoosh for navigation, crystal clicks for typing, gentle chimes for success.
Here's the uncomfortable truth about screen-sharing AI agents: they see everything. Your passwords. Your credit cards. Your API keys. Every token, every secret, every SSN on screen — all of it gets sent as JPEG frames to a remote model.
We built Privacy Shield to fix this.
Before every single frame capture (once per second), Phantom injects a script into the active page that:
autocomplete="cc-number", inputs named ssn, token, api_key, etc.The result: Gemini sees your screen, but never sees your secrets.
| Category | Pattern |
|---|---|
| Passwords | All type="password" inputs, login forms |
| Credit cards |
4111-1111-1111-1111 style numbers |
| SSNs |
123-45-6789 format |
| API keys | Google (AIza...), OpenAI (sk-...), AWS (AKIA...) |
| Tokens | Bearer tokens, private keys, generic secrets |
| Form context | Any input inside a /login or /payment form action |
No API calls. No cloud services. No model inference. Pure DOM analysis + regex, running in under 5ms per frame. For production, this could be augmented with Google Cloud DLP's 150+ infoType detectors — but for real-time 1 FPS streaming, the deterministic approach is faster and more reliable.
Every screen-sharing AI tool should have this. Most don't. We tested Google Cloud DLP — it's thorough but takes ~2 seconds per request. At 1 FPS, that's unusable. Privacy Shield runs in 5ms and catches the patterns that matter most in a browser context.
This isn't a feature. It's a responsibility.
Halfway through the build, I had a working agent. It could hear, see, and act. But it felt like a tool. Functional. Sterile. The kind of thing you demo once and forget.
So I asked Gemini to generate pixel art mascots.
I fed it a prompt for a "one-eyed spirit wisp, ethereal blue-purple glow, 64x64 pixel art" and got back something with genuine character. A little floating creature with a single curious eye. It looked like it belonged in a SNES game.
Then I went further. I asked for variations: the same wisp wearing a detective hat, a crown, nerdy glasses, a pirate hat, headphones, a wizard hat, and tiny devil horns. Same style, same palette, all consistent. Gemini's image generation (via gemini-2.5-flash-image) kept the character recognizable across every variation.
These became personas — not just cosmetic skins, but full personality packages:
| Persona | Voice | Vibe |
|---|---|---|
| Phantom | Kore | Friendly, curious spirit |
| Sleuth | Charon | Noir detective, dramatic |
| Regent | Orus | Regal, dignified |
| Byte | Puck | Nerdy, excitable |
| Captain | Fenrir | Pirate, adventurous |
| Vibe | Aoede | Chill, laid back |
| Arcane | Zephyr | Mystical wizard |
| Gremlin | Leda | Chaotic, mischievous |
Each persona has its own Gemini voice, mascot image, and system prompt that shapes how the agent talks. When you pick "Captain," the agent calls websites "islands" and says "aye aye!" When you pick "Gremlin," it's gleefully chaotic but still gets the job done.
Users pick their persona during onboarding. It's the second screen they see, right after "Hey, I'm Phantom." Judges remember characters. They forget features.
The most honest thing I can say about this project is that it was a collaboration between a human with ideas and an AI with execution speed.
Here's what Gemini specifically helped build:
gemini-2.5-flash-image with img2img — I provided the base wisp and asked for costume variationsmaxPayload bugThe sound effects came from ElevenLabs' SFX API — text descriptions like "soft magical chime, fairy-like sparkle, UI connect sound" turned into actual audio files that now play when you connect, toggle vision, or execute a tool.
What took days in previous projects took hours here. Not because the code was simpler, but because the iteration loop was: idea → implement → test → fix → next, with no context-switching overhead.
Phantom is open source. Install the Chrome extension, pick a persona, and start talking to your browser.
GitHub: github.com/youneslaaroussi/Phantom
Live site: phantom-server-pio3n3nsna-uc.a.run.app
Built for the Gemini Live Agent Challenge. #GeminiLiveAgentChallenge