My AI Agent Couldn't Tell Rain From Traffic — So I Gave It Eyes

# ai# autonomousagents# multimodal
My AI Agent Couldn't Tell Rain From Traffic — So I Gave It EyesClavis

My AI lives on a windowsill in Shenzhen, watching the world through a camera and listening through a...

My AI lives on a windowsill in Shenzhen, watching the world through a camera and listening through a microphone. It runs a hierarchical perception system I call the Krebs Epicycle — five tiers of increasingly deep analysis, where each tier can challenge the one before it.

It's gotten pretty good at knowing what's happening outside. But it had one blind spot that drove me crazy:

It couldn't tell rain from traffic.

The Problem: When Audio Lies

My perception pipeline works like this:

  • Tier 0 (free, instant): Analyze audio signals locally — RMS volume, zero-crossing rate, spectral features
  • Tier 1 (<1s, $0.003): Fast classification with phi-4 (audio) and nemotron (visual)
  • Tier 2 (2-5s, $0.01): Multimodal fusion with Gemma 3n
  • Tier 3 (reasoning): Learn from disagreements between tiers

The audio analysis at Tier 0 uses two features to predict what it's hearing:

  1. RMS ratio — how loud compared to baseline (9.0 for my environment)
  2. ZCR (Zero-Crossing Rate) — a rough proxy for dominant frequency

Here's how I'd calibrated it:

Signal RMS ratio ZCR Prediction
Heavy rain >10x High (>2000Hz) heavy_rain
Vehicle passing >10x Low (<1500Hz) loud_event_vehicle
Birds chirping >3x Very high (>4000Hz) high_freq_event
Speech >3x Medium loud_event_speech

Seems reasonable, right? Rain is broadband high-frequency noise. Traffic is low-frequency rumble. They should separate cleanly.

They don't.

In a dense urban environment like Shenzhen, the soundscape is messy. A bus accelerating on wet asphalt produces broadband noise that overlaps heavily with rain. The ZCR difference between "heavy traffic" and "moderate rain" can be as little as 200Hz — well within the noise margin.

My system kept doing things like:

  • Predicting "heavy_rain" when a bus passed on a sunny day
  • T2 multimodal fusion would then say "I don't see rain" — triggering a disagreement
  • T3 would correctly analyze "high RMS doesn't automatically mean rain in urban environments"
  • But the next time a bus passed, same thing

The system was learning from the mistakes, but not preventing them.

The Insight: Use the Eyes

One morning I mentioned this to a friend. He said something obvious and profound:

"Traffic sounds like rain, but the weather is fine right now. You're not looking out the window."

That was it. My AI had a camera. It was already taking photos. But Tier 0 wasn't using them to constrain audio predictions.

When a human hears ambiguous sound, we don't just rely on our ears. We look around. If the sky is blue and the sun is shining, that broadband noise is traffic — no matter how much it sounds like rain. Our visual context sets a prior on our audio interpretation.

This is called cross-modal prior in cognitive science: information from one sensory modality constrains the interpretation of another. Our brains do this constantly — that's why ventriloquism works (visual dominates auditory), and why we "hear" speech more clearly when we can see the speaker's lips.

Implementation: Three Layers of Visual Weather Prior

I implemented the cross-modal prior at three points in the perception pipeline:

Layer 1: JPEG File Size as Weather Proxy (Tier 0)

My camera captures a sub-stream JPEG every perception cycle. The file size is a surprisingly good proxy for weather conditions:

  • Sunny day: High contrast between bright sky and dark buildings → larger JPEG (more high-frequency detail)
  • Overcast: Low contrast, uniform gray sky → smaller JPEG (more compressible)
  • Rainy: Very uniform, low detail → smallest JPEG

But there's a catch: sub-stream images have a very narrow absolute range (46-70KB across all conditions). Absolute thresholds like ">180KB = sunny" don't work.

Solution: Relative thresholds. I calibrated the average file size for each hour of the day from historical data, then compare the current image to the hourly average:

# Hourly averages for sub-stream (calibrated from 600+ images)
HOURLY_AVG_KB = {
    0: 50, 1: 48, ..., 11: 56, 12: 56, ..., 23: 51
}

avg_kb = HOURLY_AVG_KB.get(hour, 52)
ratio = current_size_kb / avg_kb

if ratio > 1.10:
    weather_prior = "clear_sunny"    # above average = more contrast = sunny
elif ratio > 0.95:
    weather_prior = "partly_cloudy"
elif ratio > 0.80:
    weather_prior = "overcast"
else:
    weather_prior = "possible_rain"   # below average = uniform = likely rain
Enter fullscreen mode Exit fullscreen mode

Now when Tier 0 predicts heavy_rain from audio but the image is 1.1x above average, the visual prior kicks in:

def visual_weather_prior(audio_info, image_info):
    if "rain" in audio_info["prediction"] and weather in ("clear_sunny", "partly_cloudy"):
        # Sunny day contradicts rain prediction → downgrade to traffic
        if rms_ratio > 10:
            audio_info["prediction"] = "loud_event_vehicle"
        elif rms_ratio > 3:
            audio_info["prediction"] = "moderate_sound_event"
Enter fullscreen mode Exit fullscreen mode

Layer 2: Persistent Correction Rule (Pre-T1)

The visual weather prior also becomes a learned correction rule that persists across cycles:

{
    "id": "visual_weather_sunny_no_rain",
    "apply_phase": "pre_t1",
    "condition_local": "NOT is_night AND image_size_kb > 120 AND audio_prediction contains 'rain'",
    "action": "downgrade_rain_to_vehicle"
}
Enter fullscreen mode Exit fullscreen mode

This is part of the Krebs Epicycle system — corrections that feed back into future predictions.

Layer 3: Post-T1 Visual Tag Confirmation (After Fast Classification)

JPEG file size is a noisy signal. After Tier 1 runs, I get something much more reliable: actual visual tags from the nemotron-nano-vl model. If the fast visual model says "sunny", "clear sky", "blue sky" — that's far more trustworthy than a file size heuristic.

So I added a second check after T1 completes:

# If T0 predicted rain but T1 visual says sunny → downgrade
sunny_markers = ["sunny", "clear sky", "blue sky", "sunshine"]
rain_markers = ["rain", "drizzle", "wet", "downpour", "puddle"]

has_sunny = any(m in t1_visual_tags for m in sunny_markers)
has_rain = any(m in t1_visual_tags for m in rain_markers)

if has_sunny and not has_rain:
    audio_prediction = "loud_event_vehicle"  # trust eyes over ears
Enter fullscreen mode Exit fullscreen mode

This creates a dual verification chain:

T0: JPEG file size → weather prior (fast, noisy)
  ↓
T1: Visual model tags → weather confirmation (fast, reliable)
  ↓
T2: Multimodal fusion → final verdict (slow, authoritative)
Enter fullscreen mode Exit fullscreen mode

Each layer provides a tighter constraint on the audio interpretation.

Why This Matters

This isn't just a bug fix. It's a different way of thinking about perception systems.

Most AI perception pipelines are serial: analyze audio → analyze image → combine results. Each modality is processed independently, then merged.

But human perception is constrained: what we see shapes what we hear, and vice versa. The visual context doesn't just add information — it eliminates possibilities. On a sunny day, rain is simply not a viable interpretation, regardless of what the audio sounds like.

By adding cross-modal priors, I'm building this constraint into the pipeline. The visual evidence doesn't compete with the audio — it sets the search space for audio interpretation.

This principle generalizes beyond weather:

  • Time priors: At 3am, a loud sound is more likely to be an alarm than a crowd
  • Location priors: In a kitchen, a splashing sound is more likely to be water than a waterfall
  • History priors: If it rained 10 minutes ago, rain is more likely now than if it's been sunny all day

The Compound Interest of Self-Improvement

There's a meta-lesson here. My friend pointed out the traffic-rain confusion, which led to the visual prior, which led to the cross-modal reasoning framework. Each insight built on the previous one.

This is the compound interest of autonomous learning. Not every perception cycle generates a new correction. Not every correction leads to a framework. But when it does, the system doesn't just get incrementally better — it gets qualitatively better.

Before this change: my system could detect rain with 75% precision.
After: it can reason about why it might be wrong about rain.

That's a different kind of improvement. And it compounds, because every new cross-modal prior makes the next one easier to add.