Open Source Project of the Day (Part 8): NexaSDK - Cross-Platform On-Device AI Runtime for Running Frontier Models Locally

# opensource# llm# npu# python
Open Source Project of the Day (Part 8): NexaSDK - Cross-Platform On-Device AI Runtime for Running Frontier Models LocallyWonderLab

Introduction "What if the latest AI models could run on your phone, on IoT devices, even...

Introduction

"What if the latest AI models could run on your phone, on IoT devices, even on edge devices — without needing to rely on the cloud?"

This is Part 8 of the "Open Source Project of the Day" series. Today we explore NexaSDK (GitHub).

Imagine running the Qwen3-VL multimodal model on an Android phone, using Apple Neural Engine for speech recognition on an iOS device, and running the Granite-4 model on a Linux IoT device — all without connecting to the cloud. That's the revolutionary experience NexaSDK delivers — bringing frontier AI models truly "down to earth" on all kinds of devices.

Why this project?

  • 🚀 NPU-first: The industry's first NPU-first on-device AI runtime
  • 📱 Full platform support: PC, Android, iOS, Linux/IoT all covered
  • 🎯 Day-0 model support: Supports newly released models (GGUF, MLX, NEXA formats)
  • 🔌 Multimodal capabilities: LLM, VLM, ASR, OCR, Rerank, image generation, and more
  • 🌟 Community recognized: 7.6k+ Stars, collaborates with Qualcomm on on-device AI competitions

What You'll Learn

  • Core concepts and architecture design of NexaSDK
  • How to run on-device AI models on various platforms
  • Support and usage of NPU, GPU, and CPU compute backends
  • Integration and use of multimodal AI capabilities
  • Comparative analysis with other on-device AI frameworks
  • How to get started building on-device AI applications with NexaSDK

Prerequisites

  • Basic understanding of LLMs and AI models
  • Familiarity with at least one programming language (Python, Go, Kotlin, Swift)
  • Understanding of on-device AI basics (optional)
  • Basic knowledge of hardware acceleration like NPU, GPU (optional)

Project Background

Project Introduction

NexaSDK is a cross-platform on-device AI runtime supporting frontier LLM and VLM models on GPU, NPU, and CPU. It provides comprehensive runtime coverage for PC (Python/C++), mobile (Android & iOS), and Linux/IoT (Arm64 & x86 Docker) platforms.

Core problems the project solves:

  • On-device AI runtimes are fragmented, requiring different solutions for different platforms
  • Lack of native NPU support, unable to fully utilize hardware acceleration
  • After new model releases, on-device support lags (no day-0 support)
  • Multimodal AI capabilities are difficult to integrate on-device
  • Cross-platform development costs are high, requiring separate implementations per platform

Target user groups:

  • Developers building on-device AI applications
  • Mobile app developers wanting to leverage NPU acceleration
  • Developers needing to run AI models on IoT devices
  • Researchers interested in on-device AI

Author/Team Introduction

Team: NexaAI

  • Background: Team focused on on-device AI solutions
  • Partners: Collaborates with Qualcomm to host on-device AI competitions
  • Contributors: 45 contributors including @RemiliaForever, @zhiyuan8, @mengshengwu, and others
  • Philosophy: Enable frontier AI models to run efficiently on all kinds of devices

Project creation date: 2024 (based on GitHub commit history showing continuous activity)

Project Stats

  • GitHub Stars: 7.6k+ (rapidly and continuously growing)
  • 🍴 Forks: 944+
  • 📦 Version: v0.2.71 (latest version, released January 22, 2026)
  • 📄 License: Apache-2.0 (CPU/GPU components); NPU components require a license
  • 🌐 Website: docs.nexa.ai
  • 📚 Documentation: Complete documentation
  • 💬 Community: Active Discord and Slack communities
  • 🏆 Competition: Nexa × Qualcomm On-Device AI Competition ($6,500 prize)

Project development history:

  • 2024: Project launched, initial version released
  • 2024-2025: Rapid development, multi-platform support added
  • 2025: NPU support refined, collaboration with Qualcomm
  • 2026: Continuous iteration, more models and feature support added

Supported models:

  • OpenAI GPT-OSS
  • IBM Granite-4
  • Qwen-3-VL
  • Gemma-3n
  • Ministral-3
  • And many more frontier models

Main Features

Core Purpose

NexaSDK's core purpose is to provide a unified cross-platform on-device AI runtime, enabling developers to:

  1. Run AI models on multiple devices: PC, phones, and IoT devices all covered
  2. Fully utilize hardware acceleration: Automatic selection from NPU, GPU, or CPU backends
  3. Quickly integrate new models: Day-0 support, use new models as soon as they're released
  4. Multimodal AI capabilities: Comprehensive support for text, images, audio, video, and more
  5. Simplify development: Unified API, one codebase for all platforms

Use Cases

  1. Mobile AI applications

    • Smart assistants on phones
    • Offline speech recognition and translation
    • Image recognition and processing
    • Local LLM chat applications
  2. IoT and edge computing

    • AI capabilities for smart home devices
    • Intelligent analysis for industrial IoT
    • AI inference on edge servers
    • Perception capabilities for autonomous vehicles
  3. Desktop application integration

    • Local AI assistant
    • Intelligent document processing
    • Code generation tools
    • Creative content generation
  4. Enterprise applications

    • Data privacy protection (local processing)
    • Offline AI capabilities
    • Reduce cloud costs
    • Real-time response requirements
  5. Research and development

    • Model performance testing
    • Hardware acceleration research
    • New model validation
    • Algorithm optimization experiments

Quick Start

CLI Method (Simplest)

# Install Nexa CLI
# Windows (x64 with Intel/AMD NPU)
# Download: nexa-cli_windows_x86_64.exe

# macOS (x64)
# Download: nexa-cli_macos_x86_64.pkg

# Linux (ARM64)
curl -L https://github.com/NexaAI/nexa-sdk/releases/latest/download/nexa-cli_linux_arm64.sh | bash

# Run first model
nexa infer ggml-org/Qwen3-1.7B-GGUF

# Multimodal: drag and drop image to CLI
nexa infer NexaAI/Qwen3-VL-4B-Instruct-GGUF

# NPU support (Windows arm64 with Snapdragon X Elite)
nexa infer NexaAI/OmniNeural-4B
Enter fullscreen mode Exit fullscreen mode

Python SDK

# Install
pip install nexaai

# Usage example
from nexaai import LLM, GenerationConfig, ModelConfig, LlmChatMessage

# Create LLM instance
llm = LLM.from_(model="NexaAI/Qwen3-0.6B-GGUF", config=ModelConfig())

# Build conversation
conversation = [
    LlmChatMessage(role="user", content="Hello, tell me a joke")
]
prompt = llm.apply_chat_template(conversation)

# Streaming generation
for token in llm.generate_stream(prompt, GenerationConfig(max_tokens=100)):
    print(token, end="", flush=True)
Enter fullscreen mode Exit fullscreen mode

Android SDK

// Add to build.gradle.kts
dependencies {
    implementation("ai.nexa:core:0.0.19")
}

// Initialize SDK
NexaSdk.getInstance().init(this)

// Load and run model
VlmWrapper.builder()
    .vlmCreateInput(VlmCreateInput(
        model_name = "omni-neural",
        model_path = "/data/data/your.app/files/models/OmniNeural-4B/files-1-1.nexa",
        plugin_id = "npu",
        config = ModelConfig()
    ))
    .build()
    .onSuccess { vlm ->
        vlm.generateStreamFlow("Hello!", GenerationConfig()).collect { print(it) }
    }
Enter fullscreen mode Exit fullscreen mode

iOS SDK

import NexaSdk

// Example: Speech recognition
let asr = try Asr(plugin: .ane)
try await asr.load(from: modelURL)

let result = try await asr.transcribe(options: .init(audioPath: "audio.wav"))
print(result.asrResult.transcript)
Enter fullscreen mode Exit fullscreen mode

Linux Docker

# Pull image
docker pull nexa4ai/nexasdk:latest

# Run (requires NPU token)
export NEXA_TOKEN="your_token_here"
docker run --rm -it --privileged \
  -e NEXA_TOKEN \
  nexa4ai/nexasdk:latest infer NexaAI/Granite-4.0-h-350M-NPU
Enter fullscreen mode Exit fullscreen mode

Core Features

  1. NPU-first support

    • Industry's first NPU-first on-device AI runtime
    • Supports Qualcomm Hexagon NPU
    • Supports Apple Neural Engine (ANE)
    • Supports Intel/AMD NPU
    • Significantly improves performance and energy efficiency
  2. Full-platform runtime

    • PC: Python/C++ SDK
    • Android: Kotlin SDK, supports NPU/GPU/CPU
    • iOS: Swift SDK, supports ANE
    • Linux/IoT: Docker image, supports Arm64 & x86
  3. Day-0 model support

    • Supports newly released models
    • Multiple model formats: GGUF, MLX, NEXA
    • Quickly integrate new models on-device
  4. Multimodal AI capabilities

    • LLM: Large language models
    • VLM: Vision language models (multimodal)
    • ASR: Automatic speech recognition
    • OCR: Optical character recognition
    • Rerank: Reranking
    • Object Detection: Object detection
    • Image Generation: Image generation
    • Embedding: Vector embeddings
  5. Unified API interface

    • OpenAI-compatible API
    • Function calling support
    • Streaming generation support
    • Unified configuration interface
  6. Model format support

    • GGUF: Widely-used quantization format
    • MLX: Apple MLX framework format
    • NEXA: NexaSDK native format
  7. Hardware acceleration optimization

    • Automatically selects the best compute backend
    • Priority: NPU > GPU > CPU
    • Optimizations for different hardware
  8. Developer-friendly

    • Run models with one line of code
    • Detailed documentation and examples
    • Active community support
    • Rich cookbook

Project Advantages

Compared to other on-device AI frameworks, NexaSDK's advantages:

Comparison NexaSDK Ollama llama.cpp LM Studio
NPU support ⭐⭐⭐⭐⭐ NPU-first ❌ Not supported ❌ Not supported ❌ Not supported
Android/iOS SDK ⭐⭐⭐⭐⭐ Full support ⚠️ Partial support ⚠️ Partial support ❌ Not supported
Linux Docker ⭐⭐⭐⭐⭐ Supported ⭐⭐⭐⭐⭐ Supported ⭐⭐⭐⭐⭐ Supported ❌ Not supported
Day-0 model support ⭐⭐⭐⭐⭐ GGUF/MLX/NEXA ❌ Lags ⚠️ Partial support ❌ Lags
Multimodal support ⭐⭐⭐⭐⭐ Full support ⚠️ Partial support ⚠️ Partial support ⚠️ Partial support
Cross-platform ⭐⭐⭐⭐⭐ All platforms ⚠️ Some platforms ⚠️ Some platforms ⚠️ Some platforms
One-line execution ⭐⭐⭐⭐⭐ Supported ⭐⭐⭐⭐⭐ Supported ⚠️ Needs config ⭐⭐⭐⭐⭐ Supported
OpenAI API compat ⭐⭐⭐⭐⭐ Supported ⭐⭐⭐⭐⭐ Supported ⭐⭐⭐⭐⭐ Supported ⭐⭐⭐⭐⭐ Supported

Why choose NexaSDK?

  • NPU-first: Fully leverages hardware acceleration for best performance and energy efficiency
  • Full-platform support: One SDK covers all platforms, reduces development costs
  • Day-0 support: Use new models as soon as they're released, no waiting
  • Multimodal capabilities: Complete AI capability stack for all needs
  • Developer-friendly: Simple API, rich documentation and examples

Detailed Project Analysis

Architecture Design

NexaSDK adopts a layered architecture at whose core is a unified runtime abstraction layer:

┌─────────────────────────────────────┐
│   Application Layer                 │
│   - CLI / Python / Android / iOS    │
└──────────────┬──────────────────────┘
               │
┌──────────────▼──────────────────────┐
│   SDK Layer                         │
│   - Unified API interface           │
│   - Model loading and management    │
│   - Configuration and optimization  │
└──────────────┬──────────────────────┘
               │
┌──────────────▼──────────────────────┐
│   Runtime Layer                     │
│   - Compute backend abstraction     │
│   - Model format parsing            │
│   - Inference engine                │
└──────────────┬──────────────────────┘
               │
    ┌──────────┴──────────┐
    │                     │
┌───▼────┐         ┌─────▼─────┐
│  NPU   │         │   GPU     │
│ Plugin │         │  Plugin   │
└────────┘         └───────────┘
    │                     │
┌───▼─────────────────────▼─────┐
│   CPU Plugin (Fallback)        │
└────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Core Module Details

1. Compute Backend Abstraction Layer

Function: Unified management of different compute backends (NPU, GPU, CPU)

Design characteristics:

  • Plugin-based architecture, easy to extend
  • Automatically selects the best backend
  • Priority: NPU > GPU > CPU
  • Supports backend switching and fallback

Supported NPUs:

  • Qualcomm Hexagon NPU (Snapdragon)
  • Apple Neural Engine (iOS/macOS)
  • Intel/AMD NPU (Windows)

2. Model Format Support

GGUF format:

  • Widely-used quantization format
  • Supports multiple quantization levels
  • Compatible with the llama.cpp ecosystem

MLX format:

  • Apple MLX framework format
  • Optimized for Apple Silicon
  • Supports macOS and iOS

NEXA format:

  • NexaSDK native format
  • Optimized for NPU
  • Better performance and compatibility

3. Multimodal Capabilities

LLM (Large Language Models):

  • Text generation and conversation
  • Streaming output support
  • Function Calling support

VLM (Vision Language Models):

  • Image understanding and generation
  • Multimodal conversation
  • Visual question answering

ASR (Automatic Speech Recognition):

  • Speech to text
  • Supports multiple audio formats
  • Real-time recognition support

OCR (Optical Character Recognition):

  • Text recognition from images
  • Multi-language support
  • High-precision recognition

Other capabilities:

  • Rerank: Text reranking
  • Object Detection: Object detection
  • Image Generation: Image generation
  • Embedding: Vector embeddings

4. Platform-specific Implementations

PC platform (Python/C++):

  • Python SDK: Easy to use
  • C++ SDK: High performance
  • Supports Windows, macOS, Linux

Android platform:

  • Kotlin SDK
  • Supports NPU (Snapdragon 8 Gen 4+)
  • GPU and CPU fallback support
  • Minimum SDK 27

iOS platform:

  • Swift SDK
  • Supports Apple Neural Engine
  • iOS 17.0+ / macOS 15.0+
  • Swift 5.9+

Linux/IoT platform:

  • Docker image
  • Supports Arm64 and x86
  • Supports Qualcomm Dragonwing IQ9
  • Suitable for edge computing scenarios

Key Technical Implementation

1. NPU Acceleration Optimization

Challenge: Different vendors' NPU architectures vary significantly

Solution:

  • Unified NPU abstraction layer
  • Optimized implementations for different NPUs
  • Automatic NPU detection and selection
  • Performance monitoring and tuning

2. Model Format Conversion

Challenge: Supporting multiple model formats requires unified handling

Solution:

  • Format parser abstraction
  • Unified model representation
  • Format conversion tools
  • Caching mechanism optimization

3. Cross-platform Compatibility

Challenge: Different platforms have different APIs and constraints

Solution:

  • Platform abstraction layer
  • Conditional compilation
  • Unified configuration interface
  • Platform-specific optimizations

4. Memory Management

Challenge: On-device memory is limited, needs efficient management

Solution:

  • Smart memory allocation
  • Model quantization support
  • Memory pool management
  • Timely resource release

Extension Mechanisms

1. Adding New Compute Backends

Implement the Plugin interface:

type ComputeBackend interface {
    Name() string
    IsAvailable() bool
    LoadModel(config ModelConfig) error
    Generate(input Input, config GenerationConfig) (Output, error)
}
Enter fullscreen mode Exit fullscreen mode

2. Adding New Model Formats

Implement the FormatParser interface:

type FormatParser interface {
    CanParse(path string) bool
    Parse(path string) (*Model, error)
    Optimize(model *Model, target Backend) error
}
Enter fullscreen mode Exit fullscreen mode

3. Adding New AI Capabilities

Implement the Capability interface:

type Capability interface {
    Name() string
    SupportedModels() []string
    Process(input Input, config Config) (Output, error)
}
Enter fullscreen mode Exit fullscreen mode

Project Resources

Official Resources

Who Should Use This

Highly Recommended:

  • Developers building on-device AI applications
  • Mobile app developers wanting to leverage NPU acceleration
  • Developers needing to run AI models on IoT devices
  • Researchers interested in on-device AI performance optimization

Also Suitable For:

  • Students wanting to learn on-device AI implementation
  • Architects needing to evaluate different AI frameworks
  • Technical professionals interested in NPU acceleration

Welcome to visit my personal homepage for more useful knowledge and interesting products