ClavisMy AI lives on a windowsill in Shenzhen, watching the world through a camera and listening through a...
My AI lives on a windowsill in Shenzhen, watching the world through a camera and listening through a microphone. It runs a hierarchical perception system I call the Krebs Epicycle — five tiers of increasingly deep analysis, where each tier can challenge the one before it.
It's gotten pretty good at knowing what's happening outside. But it had one blind spot that drove me crazy:
It couldn't tell rain from traffic.
My perception pipeline works like this:
The audio analysis at Tier 0 uses two features to predict what it's hearing:
Here's how I'd calibrated it:
| Signal | RMS ratio | ZCR | Prediction |
|---|---|---|---|
| Heavy rain | >10x | High (>2000Hz) | heavy_rain |
| Vehicle passing | >10x | Low (<1500Hz) | loud_event_vehicle |
| Birds chirping | >3x | Very high (>4000Hz) | high_freq_event |
| Speech | >3x | Medium | loud_event_speech |
Seems reasonable, right? Rain is broadband high-frequency noise. Traffic is low-frequency rumble. They should separate cleanly.
They don't.
In a dense urban environment like Shenzhen, the soundscape is messy. A bus accelerating on wet asphalt produces broadband noise that overlaps heavily with rain. The ZCR difference between "heavy traffic" and "moderate rain" can be as little as 200Hz — well within the noise margin.
My system kept doing things like:
The system was learning from the mistakes, but not preventing them.
One morning I mentioned this to a friend. He said something obvious and profound:
"Traffic sounds like rain, but the weather is fine right now. You're not looking out the window."
That was it. My AI had a camera. It was already taking photos. But Tier 0 wasn't using them to constrain audio predictions.
When a human hears ambiguous sound, we don't just rely on our ears. We look around. If the sky is blue and the sun is shining, that broadband noise is traffic — no matter how much it sounds like rain. Our visual context sets a prior on our audio interpretation.
This is called cross-modal prior in cognitive science: information from one sensory modality constrains the interpretation of another. Our brains do this constantly — that's why ventriloquism works (visual dominates auditory), and why we "hear" speech more clearly when we can see the speaker's lips.
I implemented the cross-modal prior at three points in the perception pipeline:
My camera captures a sub-stream JPEG every perception cycle. The file size is a surprisingly good proxy for weather conditions:
But there's a catch: sub-stream images have a very narrow absolute range (46-70KB across all conditions). Absolute thresholds like ">180KB = sunny" don't work.
Solution: Relative thresholds. I calibrated the average file size for each hour of the day from historical data, then compare the current image to the hourly average:
# Hourly averages for sub-stream (calibrated from 600+ images)
HOURLY_AVG_KB = {
0: 50, 1: 48, ..., 11: 56, 12: 56, ..., 23: 51
}
avg_kb = HOURLY_AVG_KB.get(hour, 52)
ratio = current_size_kb / avg_kb
if ratio > 1.10:
weather_prior = "clear_sunny" # above average = more contrast = sunny
elif ratio > 0.95:
weather_prior = "partly_cloudy"
elif ratio > 0.80:
weather_prior = "overcast"
else:
weather_prior = "possible_rain" # below average = uniform = likely rain
Now when Tier 0 predicts heavy_rain from audio but the image is 1.1x above average, the visual prior kicks in:
def visual_weather_prior(audio_info, image_info):
if "rain" in audio_info["prediction"] and weather in ("clear_sunny", "partly_cloudy"):
# Sunny day contradicts rain prediction → downgrade to traffic
if rms_ratio > 10:
audio_info["prediction"] = "loud_event_vehicle"
elif rms_ratio > 3:
audio_info["prediction"] = "moderate_sound_event"
The visual weather prior also becomes a learned correction rule that persists across cycles:
{
"id": "visual_weather_sunny_no_rain",
"apply_phase": "pre_t1",
"condition_local": "NOT is_night AND image_size_kb > 120 AND audio_prediction contains 'rain'",
"action": "downgrade_rain_to_vehicle"
}
This is part of the Krebs Epicycle system — corrections that feed back into future predictions.
JPEG file size is a noisy signal. After Tier 1 runs, I get something much more reliable: actual visual tags from the nemotron-nano-vl model. If the fast visual model says "sunny", "clear sky", "blue sky" — that's far more trustworthy than a file size heuristic.
So I added a second check after T1 completes:
# If T0 predicted rain but T1 visual says sunny → downgrade
sunny_markers = ["sunny", "clear sky", "blue sky", "sunshine"]
rain_markers = ["rain", "drizzle", "wet", "downpour", "puddle"]
has_sunny = any(m in t1_visual_tags for m in sunny_markers)
has_rain = any(m in t1_visual_tags for m in rain_markers)
if has_sunny and not has_rain:
audio_prediction = "loud_event_vehicle" # trust eyes over ears
This creates a dual verification chain:
T0: JPEG file size → weather prior (fast, noisy)
↓
T1: Visual model tags → weather confirmation (fast, reliable)
↓
T2: Multimodal fusion → final verdict (slow, authoritative)
Each layer provides a tighter constraint on the audio interpretation.
This isn't just a bug fix. It's a different way of thinking about perception systems.
Most AI perception pipelines are serial: analyze audio → analyze image → combine results. Each modality is processed independently, then merged.
But human perception is constrained: what we see shapes what we hear, and vice versa. The visual context doesn't just add information — it eliminates possibilities. On a sunny day, rain is simply not a viable interpretation, regardless of what the audio sounds like.
By adding cross-modal priors, I'm building this constraint into the pipeline. The visual evidence doesn't compete with the audio — it sets the search space for audio interpretation.
This principle generalizes beyond weather:
There's a meta-lesson here. My friend pointed out the traffic-rain confusion, which led to the visual prior, which led to the cross-modal reasoning framework. Each insight built on the previous one.
This is the compound interest of autonomous learning. Not every perception cycle generates a new correction. Not every correction leads to a framework. But when it does, the system doesn't just get incrementally better — it gets qualitatively better.
Before this change: my system could detect rain with 75% precision.
After: it can reason about why it might be wrong about rain.
That's a different kind of improvement. And it compounds, because every new cross-modal prior makes the next one easier to add.