Alexander PhaiboonMost productivity tools bolt AI onto their sidebar as an afterthought. But voice-first interfaces...
Most productivity tools bolt AI onto their sidebar as an afterthought. But voice-first interfaces require rethinking your entire event architecture. After watching Notion limit users to 20 lifetime AI interactions, we built a voice conversation system that scales without artificial constraints.
When developers add voice features, they usually treat it like a simple API call. Press button, send audio, get response. This approach breaks down at scale because:
Here's what most implementations look like:
// BAD: Blocking voice implementation
class BasicVoiceWidget extends StatefulWidget {
@override
_BasicVoiceWidgetState createState() => _BasicVoiceWidgetState();
}
class _BasicVoiceWidgetState extends State<BasicVoiceWidget> {
bool isRecording = false;
String response = '';
Future<void> handleVoiceInput() async {
setState(() => isRecording = true);
// This blocks everything until complete
final audioData = await recordAudio();
final result = await sendToAPI(audioData);
setState(() {
isRecording = false;
response = result;
});
}
}
This pattern forces users into a request-response cycle that feels unnatural for conversations.
We solved this with a sealed event hierarchy that treats voice as a stream of state changes, not discrete API calls. Here's our complete VoiceEvent system:
// Type-safe voice event hierarchy
sealed class VoiceEvent {
const VoiceEvent();
}
class VoiceStarted extends VoiceEvent {
final String conversationId;
final DateTime timestamp;
final Map<String, dynamic>? context;
const VoiceStarted({
required this.conversationId,
required this.timestamp,
this.context,
});
@override
String toString() => 'VoiceStarted(id: $conversationId, time: $timestamp)';
}
class VoiceRecording extends VoiceEvent {
final String conversationId;
final Duration duration;
final double audioLevel;
const VoiceRecording({
required this.conversationId,
required this.duration,
required this.audioLevel,
});
@override
String toString() => 'VoiceRecording(id: $conversationId, duration: ${duration.inSeconds}s)';
}
class VoicePaused extends VoiceEvent {
final String conversationId;
final String reason;
final DateTime pausedAt;
const VoicePaused({
required this.conversationId,
required this.reason,
required this.pausedAt,
});
}
class VoiceCompleted extends VoiceEvent {
final String conversationId;
final String audioPath;
final Duration totalDuration;
final Map<String, dynamic> metadata;
const VoiceCompleted({
required this.conversationId,
required this.audioPath,
required this.totalDuration,
required this.metadata,
});
}
class VoiceError extends VoiceEvent {
final String conversationId;
final String error;
final String? recoveryAction;
final DateTime occurredAt;
const VoiceError({
required this.conversationId,
required this.error,
this.recoveryAction,
required this.occurredAt,
});
}
Each event carries exactly the data needed for that state. No optional fields that might be null, no guessing what properties are available.
The magic happens in how these events flow through the system. Instead of blocking calls, we use a streaming provider that other widgets can subscribe to:
class VoiceProvider extends ChangeNotifier {
final StreamController<VoiceEvent> _eventController =
StreamController<VoiceEvent>.broadcast();
Stream<VoiceEvent> get eventStream => _eventController.stream;
// Current conversation state
Map<String, VoiceConversation> _conversations = {};
String? _activeConversationId;
// Public getters
VoiceConversation? get activeConversation =>
_activeConversationId != null
? _conversations[_activeConversationId]
: null;
bool get isRecording => activeConversation?.state == VoiceState.recording;
// Start new conversation
Future<String> startConversation({Map<String, dynamic>? context}) async {
final conversationId = generateConversationId();
final conversation = VoiceConversation(
id: conversationId,
startedAt: DateTime.now(),
context: context ?? {},
);
_conversations[conversationId] = conversation;
_activeConversationId = conversationId;
// Emit event
_eventController.add(VoiceStarted(
conversationId: conversationId,
timestamp: DateTime.now(),
context: context,
));
notifyListeners();
return conversationId;
}
// Handle recording updates
void updateRecording({
required String conversationId,
required Duration duration,
required double audioLevel,
}) {
final conversation = _conversations[conversationId];
if (conversation == null) return;
// Update conversation state
conversation.updateRecording(duration, audioLevel);
// Emit event
_eventController.add(VoiceRecording(
conversationId: conversationId,
duration: duration,
audioLevel: audioLevel,
));
notifyListeners();
}
// Complete conversation
Future<void> completeConversation({
required String conversationId,
required String audioPath,
}) async {
final conversation = _conversations[conversationId];
if (conversation == null) return;
// Calculate metadata
final metadata = {
'wordCount': await estimateWordCount(audioPath),
'fileSize': await getFileSize(audioPath),
'quality': await assessAudioQuality(audioPath),
};
// Update conversation
conversation.complete(audioPath, metadata);
// Emit event
_eventController.add(VoiceCompleted(
conversationId: conversationId,
audioPath: audioPath,
totalDuration: conversation.duration,
metadata: metadata,
));
// Process in background
_processAudioInBackground(conversationId, audioPath);
notifyListeners();
}
// Error handling
void handleError({
required String conversationId,
required String error,
String? recoveryAction,
}) {
final conversation = _conversations[conversationId];
conversation?.markError(error);
_eventController.add(VoiceError(
conversationId: conversationId,
error: error,
recoveryAction: recoveryAction,
occurredAt: DateTime.now(),
));
notifyListeners();
}
@override
void dispose() {
_eventController.close();
super.dispose();
}
}
This architecture gives us several advantages:
The sealed events make complex UI patterns simple to implement. Here's how we handle the common "push-to-talk while showing live transcription" pattern:
class VoiceConversationWidget extends StatefulWidget {
@override
_VoiceConversationWidgetState createState() => _VoiceConversationWidgetState();
}
class _VoiceConversationWidgetState extends State<VoiceConversationWidget> {
late StreamSubscription<VoiceEvent> _eventSubscription;
String _liveTranscription = '';
double _audioLevel = 0.0;
String? _error;
@override
void initState() {
super.initState();
// Subscribe to voice events
_eventSubscription = context.read<VoiceProvider>()
.eventStream
.listen(_handleVoiceEvent);
}
void _handleVoiceEvent(VoiceEvent event) {
if (!mounted) return;
setState(() {
switch (event) {
case VoiceStarted(:final conversationId):
_liveTranscription = '';
_error = null;
_logEvent('Started conversation: $conversationId');
case VoiceRecording(:final duration, :final audioLevel):
_audioLevel = audioLevel;
_updateLiveTranscription(duration);
case VoicePaused(:final reason):
_logEvent('Paused: $reason');
case VoiceCompleted(:final audioPath, :final totalDuration):
_logEvent('Completed: ${totalDuration.inSeconds}s, saved to $audioPath');
_finalizeTranscription();
case VoiceError(:final error, :final recoveryAction):
_error = error;
_logEvent('Error: $error');
if (recoveryAction != null) {
_showRecoveryOption(recoveryAction);
}
}
});
}
@override
Widget build(BuildContext context) {
return Column(
children: [
// Live audio level indicator
AudioLevelIndicator(level: _audioLevel),
// Live transcription
Container(
padding: EdgeInsets.all(16),
child: Text(
_liveTranscription.isEmpty
? 'Press and hold to start speaking...'
: _liveTranscription,
style: TextStyle(
fontSize: 16,
color: _liveTranscription.isEmpty ? Colors.grey : Colors.black,
),
),
),
// Error display
if (_error != null)
Container(
padding: EdgeInsets.all(8),
margin: EdgeInsets.symmetric(horizontal: 16),
decoration: BoxDecoration(
color: Colors.red.shade50,
borderRadius: BorderRadius.circular(8),
),
child: Text(_error!, style: TextStyle(color: Colors.red.shade700)),
),
// Push-to-talk button
VoiceFAB(),
],
);
}
void _updateLiveTranscription(Duration duration) {
// Simulate progressive transcription
// In production, this would come from your speech-to-text service
final seconds = duration.inSeconds;
if (seconds > 0 && seconds % 2 == 0) {
_liveTranscription += _getNextTranscriptionChunk();
}
}
@override
void dispose() {
_eventSubscription.cancel();
super.dispose();
}
}
The key insight is that each event type tells the UI exactly what changed and what data is now available. No more checking multiple boolean flags or null values.
After running this system in production for three months, here are the patterns that emerged:
Event Granularity Matters: We initially had fewer event types, but debugging was harder. The current five events hit the sweet spot between detail and simplicity.
Stream Performance: Broadcasting events to multiple listeners is cheap in Flutter. We have conversations with 20+ widgets listening to the same stream without performance issues.
Error Recovery: The explicit VoiceError event with optional recovery actions let us build self-healing UIs. When network issues interrupt recording, we can offer "retry" or "save locally" options based on the error type.
Testing Wins: Sealed classes make testing voice flows trivial. Mock the event stream, verify widgets respond correctly to each event type. No more integration tests for voice features.
This pattern scales beyond voice. We use similar sealed event hierarchies for:
DocumentEvent with ContentChanged, CursorMoved, UserJoined eventsProcessingEvent with Started, Progress, Completed, Failed events
SyncEvent with Connected, Syncing, Conflict, Resolved eventsThe sealed class pattern forces you to handle all possible states explicitly, making your apps more robust.
If you're building voice features, start with the event hierarchy. Define all possible states as sealed classes before writing any UI code. This upfront design work pays dividends when you need to debug complex interaction flows.
The complete code for this voice system is running in production at CMMD, where we use it for natural language task delegation to our AI agent workforce. Unlike tools that limit AI interactions, our voice interface scales with usage because the architecture was designed for streaming, not blocking calls.
Want to see this in action? Try starting a voice conversation with CMMD's Sidekick - the entire interaction is powered by this event-driven architecture.