Building Voice Conversations Without Usage Limits: A Flutter Developer's Guide

# flutter# dart# ai# voice

Alexander Phaiboon

Most productivity tools bolt AI onto their sidebar as an afterthought. But voice-first interfaces...

Most productivity tools bolt AI onto their sidebar as an afterthought. But voice-first interfaces require rethinking your entire event architecture. After watching Notion limit users to 20 lifetime AI interactions, we built a voice conversation system that scales without artificial constraints.

The Problem with Traditional AI Integration

When developers add voice features, they usually treat it like a simple API call. Press button, send audio, get response. This approach breaks down at scale because:

No conversation state management - Each voice interaction exists in isolation
Blocking UI patterns - Users wait for complete responses before continuing
No error recovery - Network issues kill entire conversations

Here's what most implementations look like:

// BAD: Blocking voice implementation
class BasicVoiceWidget extends StatefulWidget {
  @override
  _BasicVoiceWidgetState createState() => _BasicVoiceWidgetState();
}

class _BasicVoiceWidgetState extends State<BasicVoiceWidget> {
  bool isRecording = false;
  String response = '';

  Future<void> handleVoiceInput() async {
    setState(() => isRecording = true);

    // This blocks everything until complete
    final audioData = await recordAudio();
    final result = await sendToAPI(audioData);

    setState(() {
      isRecording = false;
      response = result;
    });
  }
}

This pattern forces users into a request-response cycle that feels unnatural for conversations.

Sealed Events: The Foundation of Scalable Voice

We solved this with a sealed event hierarchy that treats voice as a stream of state changes, not discrete API calls. Here's our complete VoiceEvent system:

// Type-safe voice event hierarchy
sealed class VoiceEvent {
  const VoiceEvent();
}

class VoiceStarted extends VoiceEvent {
  final String conversationId;
  final DateTime timestamp;
  final Map<String, dynamic>? context;

  const VoiceStarted({
    required this.conversationId,
    required this.timestamp,
    this.context,
  });

  @override
  String toString() => 'VoiceStarted(id: $conversationId, time: $timestamp)';
}

class VoiceRecording extends VoiceEvent {
  final String conversationId;
  final Duration duration;
  final double audioLevel;

  const VoiceRecording({
    required this.conversationId,
    required this.duration,
    required this.audioLevel,
  });

  @override
  String toString() => 'VoiceRecording(id: $conversationId, duration: ${duration.inSeconds}s)';
}

class VoicePaused extends VoiceEvent {
  final String conversationId;
  final String reason;
  final DateTime pausedAt;

  const VoicePaused({
    required this.conversationId,
    required this.reason,
    required this.pausedAt,
  });
}

class VoiceCompleted extends VoiceEvent {
  final String conversationId;
  final String audioPath;
  final Duration totalDuration;
  final Map<String, dynamic> metadata;

  const VoiceCompleted({
    required this.conversationId,
    required this.audioPath,
    required this.totalDuration,
    required this.metadata,
  });
}

class VoiceError extends VoiceEvent {
  final String conversationId;
  final String error;
  final String? recoveryAction;
  final DateTime occurredAt;

  const VoiceError({
    required this.conversationId,
    required this.error,
    this.recoveryAction,
    required this.occurredAt,
  });
}

Each event carries exactly the data needed for that state. No optional fields that might be null, no guessing what properties are available.

Stream-Based Provider Architecture

The magic happens in how these events flow through the system. Instead of blocking calls, we use a streaming provider that other widgets can subscribe to:

class VoiceProvider extends ChangeNotifier {
  final StreamController<VoiceEvent> _eventController = 
      StreamController<VoiceEvent>.broadcast();

  Stream<VoiceEvent> get eventStream => _eventController.stream;

  // Current conversation state
  Map<String, VoiceConversation> _conversations = {};
  String? _activeConversationId;

  // Public getters
  VoiceConversation? get activeConversation => 
      _activeConversationId != null 
          ? _conversations[_activeConversationId] 
          : null;

  bool get isRecording => activeConversation?.state == VoiceState.recording;

  // Start new conversation
  Future<String> startConversation({Map<String, dynamic>? context}) async {
    final conversationId = generateConversationId();
    final conversation = VoiceConversation(
      id: conversationId,
      startedAt: DateTime.now(),
      context: context ?? {},
    );

    _conversations[conversationId] = conversation;
    _activeConversationId = conversationId;

    // Emit event
    _eventController.add(VoiceStarted(
      conversationId: conversationId,
      timestamp: DateTime.now(),
      context: context,
    ));

    notifyListeners();
    return conversationId;
  }

  // Handle recording updates
  void updateRecording({
    required String conversationId,
    required Duration duration,
    required double audioLevel,
  }) {
    final conversation = _conversations[conversationId];
    if (conversation == null) return;

    // Update conversation state
    conversation.updateRecording(duration, audioLevel);

    // Emit event
    _eventController.add(VoiceRecording(
      conversationId: conversationId,
      duration: duration,
      audioLevel: audioLevel,
    ));

    notifyListeners();
  }

  // Complete conversation
  Future<void> completeConversation({
    required String conversationId,
    required String audioPath,
  }) async {
    final conversation = _conversations[conversationId];
    if (conversation == null) return;

    // Calculate metadata
    final metadata = {
      'wordCount': await estimateWordCount(audioPath),
      'fileSize': await getFileSize(audioPath),
      'quality': await assessAudioQuality(audioPath),
    };

    // Update conversation
    conversation.complete(audioPath, metadata);

    // Emit event
    _eventController.add(VoiceCompleted(
      conversationId: conversationId,
      audioPath: audioPath,
      totalDuration: conversation.duration,
      metadata: metadata,
    ));

    // Process in background
    _processAudioInBackground(conversationId, audioPath);

    notifyListeners();
  }

  // Error handling
  void handleError({
    required String conversationId,
    required String error,
    String? recoveryAction,
  }) {
    final conversation = _conversations[conversationId];
    conversation?.markError(error);

    _eventController.add(VoiceError(
      conversationId: conversationId,
      error: error,
      recoveryAction: recoveryAction,
      occurredAt: DateTime.now(),
    ));

    notifyListeners();
  }

  @override
  void dispose() {
    _eventController.close();
    super.dispose();
  }
}

This architecture gives us several advantages:

Non-blocking UI - Widgets update as events stream in
Easy debugging - Every state change is an explicit event
Testable logic - Mock the stream, verify events
Error recovery - Errors don't kill the conversation

Real Usage Patterns

The sealed events make complex UI patterns simple to implement. Here's how we handle the common "push-to-talk while showing live transcription" pattern:

class VoiceConversationWidget extends StatefulWidget {
  @override
  _VoiceConversationWidgetState createState() => _VoiceConversationWidgetState();
}

class _VoiceConversationWidgetState extends State<VoiceConversationWidget> {
  late StreamSubscription<VoiceEvent> _eventSubscription;
  String _liveTranscription = '';
  double _audioLevel = 0.0;
  String? _error;

  @override
  void initState() {
    super.initState();

    // Subscribe to voice events
    _eventSubscription = context.read<VoiceProvider>()
        .eventStream
        .listen(_handleVoiceEvent);
  }

  void _handleVoiceEvent(VoiceEvent event) {
    if (!mounted) return;

    setState(() {
      switch (event) {
        case VoiceStarted(:final conversationId):
          _liveTranscription = '';
          _error = null;
          _logEvent('Started conversation: $conversationId');

        case VoiceRecording(:final duration, :final audioLevel):
          _audioLevel = audioLevel;
          _updateLiveTranscription(duration);

        case VoicePaused(:final reason):
          _logEvent('Paused: $reason');

        case VoiceCompleted(:final audioPath, :final totalDuration):
          _logEvent('Completed: ${totalDuration.inSeconds}s, saved to $audioPath');
          _finalizeTranscription();

        case VoiceError(:final error, :final recoveryAction):
          _error = error;
          _logEvent('Error: $error');
          if (recoveryAction != null) {
            _showRecoveryOption(recoveryAction);
          }
      }
    });
  }

  @override
  Widget build(BuildContext context) {
    return Column(
      children: [
        // Live audio level indicator
        AudioLevelIndicator(level: _audioLevel),

        // Live transcription
        Container(
          padding: EdgeInsets.all(16),
          child: Text(
            _liveTranscription.isEmpty 
                ? 'Press and hold to start speaking...' 
                : _liveTranscription,
            style: TextStyle(
              fontSize: 16,
              color: _liveTranscription.isEmpty ? Colors.grey : Colors.black,
            ),
          ),
        ),

        // Error display
        if (_error != null)
          Container(
            padding: EdgeInsets.all(8),
            margin: EdgeInsets.symmetric(horizontal: 16),
            decoration: BoxDecoration(
              color: Colors.red.shade50,
              borderRadius: BorderRadius.circular(8),
            ),
            child: Text(_error!, style: TextStyle(color: Colors.red.shade700)),
          ),

        // Push-to-talk button
        VoiceFAB(),
      ],
    );
  }

  void _updateLiveTranscription(Duration duration) {
    // Simulate progressive transcription
    // In production, this would come from your speech-to-text service
    final seconds = duration.inSeconds;
    if (seconds > 0 && seconds % 2 == 0) {
      _liveTranscription += _getNextTranscriptionChunk();
    }
  }

  @override
  void dispose() {
    _eventSubscription.cancel();
    super.dispose();
  }
}

The key insight is that each event type tells the UI exactly what changed and what data is now available. No more checking multiple boolean flags or null values.

Lessons from Production

After running this system in production for three months, here are the patterns that emerged:

Event Granularity Matters: We initially had fewer event types, but debugging was harder. The current five events hit the sweet spot between detail and simplicity.

Stream Performance: Broadcasting events to multiple listeners is cheap in Flutter. We have conversations with 20+ widgets listening to the same stream without performance issues.

Error Recovery: The explicit VoiceError event with optional recovery actions let us build self-healing UIs. When network issues interrupt recording, we can offer "retry" or "save locally" options based on the error type.

Testing Wins: Sealed classes make testing voice flows trivial. Mock the event stream, verify widgets respond correctly to each event type. No more integration tests for voice features.

Beyond Voice: Event-Driven Everything

This pattern scales beyond voice. We use similar sealed event hierarchies for:

Document collaboration - DocumentEvent with ContentChanged, CursorMoved, UserJoined events
AI processing - ProcessingEvent with Started, Progress, Completed, Failed events
Real-time sync - SyncEvent with Connected, Syncing, Conflict, Resolved events

The sealed class pattern forces you to handle all possible states explicitly, making your apps more robust.

Next Steps

If you're building voice features, start with the event hierarchy. Define all possible states as sealed classes before writing any UI code. This upfront design work pays dividends when you need to debug complex interaction flows.

The complete code for this voice system is running in production at CMMD, where we use it for natural language task delegation to our AI agent workforce. Unlike tools that limit AI interactions, our voice interface scales with usage because the architecture was designed for streaming, not blocking calls.

Want to see this in action? Try starting a voice conversation with CMMD's Sidekick - the entire interaction is powered by this event-driven architecture.