TL;DR
Nvidia released Nemotron 3 Nano Omni, an open-weight multimodal model that unifies vision, audio, and language in a single architecture with 30B parameters but only 3B active per inference. It claims 9x throughput over comparable open models and tops six benchmarks. Available under Nvidia’s Open Model Agreement for commercial use, it targets edge AI agent deployment on single GPUs, making Nvidia a competitor not just in AI infrastructure but in the models that run on it.
Nvidia released Nemotron 3 Nano Omni on Tuesday, an open-weight multimodal AI model that unifies vision, audio, and language understanding in a single architecture designed to power autonomous AI agents on edge devices. The model has 30 billion parameters but activates only three billion per forward pass through a mixture-of-experts design, a ratio that allows it to run on a single GPU while matching or exceeding the multimodal capabilities of models several times its size. Nvidia claims nine times higher throughput than comparable open multimodal models with equivalent interactivity, 2.9 times faster single-stream reasoning on multimodal tasks, and roughly nine times greater effective system capacity for video reasoning. The model tops six benchmarks across document intelligence, video understanding, and audio comprehension. It processes text, images, audio, video, documents, charts, and graphical interfaces as inputs and produces text as output, meaning a single model can replace the patchwork of specialised vision, speech, and document-processing models that most enterprise AI deployments currently stitch together. The release, available on Hugging Face under Nvidia’s Open Model Agreement with full commercial use rights, represents the most aggressive move yet by the company that sells the infrastructure for AI into the market for the AI itself.
The architecture
Nemotron 3 Nano Omni uses a hybrid Mamba-Transformer architecture with 23 Mamba-2 selective state-space layers, 23 mixture-of-experts layers with 128 experts routing to six per token plus a shared expert, and six grouped-query attention layers. The vision encoder, C-RADIOv4-H, handles variable-resolution images with 16-by-16 patches scaling from 1,024 to 13,312 visual patches per image. The audio encoder, Parakeet-TDT-0.6B-v2, processes speech and environmental audio. Video processing uses three-dimensional convolutions to capture motion between frames rather than treating video as a sequence of still images. The base text model was pretrained on 25 trillion tokens and supports a 256,000-token context window. The architectural choices reflect a specific design philosophy: maximise capability per active parameter rather than total parameters, because edge deployment is constrained not by model size at rest but by compute per inference step. The three-billion active parameters at inference mean the model can run on hardware announced at Nvidia’s GTC 2026 developer conference, including the DGX Spark and DGX Station workstations, without requiring the multi-GPU clusters that power larger models in data centres.
The mixture-of-experts approach is not new, but its application to a multimodal model at this scale is. Most open multimodal models either use a single dense architecture, which requires all parameters to be active on every inference step, or use separate specialist models stitched together in a pipeline, which introduces latency at each handoff. Nemotron 3 Nano Omni does neither. It routes each token to six of 128 experts within a unified model, meaning vision tokens, audio tokens, and text tokens all flow through the same architecture but activate different expertise depending on the modality. The result is a model that can process a video feed, a spoken instruction, and a document simultaneously without the inter-model latency that makes pipeline architectures unsuitable for real-time agent applications. For enterprise deployments, this collapses the operational complexity of maintaining separate vision, speech, and language models with separate inference endpoints, monitoring, and versioning into a single model serving a single endpoint.
The strategy
Nvidia has spent the AI boom selling infrastructure: GPUs, networking, and the CUDA software ecosystem that locks developers into its hardware. The Nemotron model family, which has been downloaded more than 50 million times in the past year, represents a parallel strategy in which Nvidia also provides the models that run on that infrastructure. The logic is circular but powerful: Nvidia’s models are optimised for Nvidia’s hardware, and Nvidia’s hardware is optimised for Nvidia’s models, creating a full-stack ecosystem that competes with the model-plus-cloud offerings from Google, Amazon, and Microsoft. The case for small, domain-specific language models has been made across education, healthcare, and enterprise, and Nemotron 3 Nano Omni extends that argument to multimodal applications: rather than calling a massive cloud model for every vision or audio task, enterprises can run a compact model locally that handles the full perceptual stack.
Early enterprise adoption includes Foxconn, Palantir, Aible, ASI, Eka Care, and H Company, with Dell, DocuSign, Infosys, Oracle, and Zefr evaluating the model for production deployment. The use cases, factory-floor visual inspection, document processing, voice agent applications, and screen understanding for computer-use agents, reflect the market Nvidia is targeting: not consumer AI assistants but industrial AI agents that need to see, hear, and read in real time on local hardware. The model is available as an Nvidia NIM microservice, through Amazon SageMaker JumpStart, and on OpenRouter, with deployment options including vLLM, SGLang, Ollama, llama.cpp, and TensorRT-LLM. The breadth of deployment options is itself a competitive statement: Nvidia is making the model runnable everywhere, on every framework, to maximise adoption and deepen the dependency on Nvidia’s broader ecosystem.
The competition
Open-source AI models designed for agentic reasoning are arriving from multiple directions simultaneously. DeepSeek’s V4-Pro and V4-Flash, released last week, use a hybrid attention architecture optimised for long-horizon agentic tasks. Meta’s Llama models dominate the open-weight text space. Google’s Gemini models handle multimodal tasks at cloud scale. OpenAI’s GPT models remain the commercial benchmark. What distinguishes Nemotron 3 Nano Omni is not any single capability but the combination: multimodal perception across vision, audio, and text in a single model, with mixture-of-experts efficiency that enables edge deployment, released as open weights with commercial licensing. No other model currently offers all four properties together. The closest comparators, Google’s Gemini Nano for on-device and Meta’s Llama for open weights, each lack at least one element: Gemini Nano is not open-weight, and Llama’s multimodal capabilities do not include audio processing in a unified architecture.
The competitive implications extend beyond the model itself. If Nvidia’s open models become the default for edge AI agent deployment, the company captures value at every layer of the stack: the GPU that runs inference, the software framework that optimises it, and now the model itself. Competitors who build on Nvidia’s models deepen their dependency on Nvidia’s hardware. Competitors who build their own models still need Nvidia’s GPUs to train them. The agentic AI era is accelerating across the industry, and Nvidia’s strategy is to be indispensable at every layer rather than dominant at one. Nemotron 3 Nano Omni is not Nvidia’s answer to GPT-4o. It is Nvidia’s argument that the future of AI agents will be built on small, efficient, open models running on Nvidia hardware at the edge, rather than large, proprietary models running on someone else’s cloud. Whether that argument holds depends on whether the enterprises building the next generation of autonomous systems prefer local control over cloud convenience, and whether a model with three billion active parameters can do the work that currently requires models with hundreds of billions. The benchmarks say it can. The market will decide whether the benchmarks are right.


