System Overview and Architecture
QNAP has released the QAI-h1290FX, a dedicated edge computing storage platform engineered for large language models, retrieval-augmented generation, and generative AI applications. The chassis merges a mature processor architecture with modern workstation graphics hardware to facilitate on-premises AI deployment without cloud dependency. The system relies on the AMD EPYC 7302P processor, which delivers 16 cores and 32 threads built on the Zen 2 microarchitecture. This CPU is positioned to manage AI inference workloads and heavy parallel processing tasks.
Graphics processing options include the NVIDIA RTX PRO 4500 Blackwell with 32GB of memory, targeted at models containing approximately 30 billion parameters, or the flagship RTX PRO 6000 Blackwell equipped with 96GB of VRAM, designed for 70-billion parameter and larger systems. Both accelerators support CUDA, Tensor, and Transformer Engine acceleration to enhance deep learning and image generation workloads.
Storage, Networking, and Management
Data handling is facilitated by an all-flash architecture featuring twelve U.2 bays compatible with both NVMe and SATA SSDs, enabling rapid input/output speeds for continuous model execution and data streaming. Network connectivity is provided through dual 25-gigabit and dual 2.5-gigabit Ethernet ports, with additional PCIe slots available for third-party 100-gigabit adapter cards. The unit also integrates with QNAP’s JBOD expansion enclosures to support large-scale data storage requirements.
Software management utilizes a containerized environment supporting Docker and LXD, complete with a graphical AI app center. This interface allows administrators to allocate GPU resources and deploy AI applications without manual command-line configuration. The platform is designed to operate entirely locally, ensuring sensitive organizational data remains on-premises while accelerating AI workflows.
Benchmark Performance Data
Testing of the 96GB RTX PRO 6000 Blackwell configuration across various large language models yields the following results using native Ollama inference: the gpt-oss:120b model (MXFP4) achieves 90 tokens per second while utilizing approximately 63GB of VRAM; deepseek-r1:70b (q4_K_M) reaches 24 tokens per second with roughly 41GB of memory; qwen3:32b (q4_K_M) processes 46 tokens per second using 21GB; and gemma3:27b (q4_K_M) delivers 54 tokens per second with 19GB of VRAM. Smaller architectures demonstrate higher throughput, with deepseek-r1:8b (q4_K_M) hitting 140 tokens per second and qwen3:8b (q4_K_M) reaching 172 tokens per second, both consuming approximately 7GB of VRAM.
Concurrent vLLM inference tests for the deepseek-ai/DeepSeek-R1-Distill-Qwen-7B model show scaling performance across multiple threads: single-thread operation yields 179 total tokens per second (79 per thread); two threads produce 166 total tokens per second (83 per thread); five threads achieve 410 total tokens per second (82 per thread); ten threads reach 688 total tokens per second (68.8 per thread); twenty threads deliver 810 total tokens per second (40.5 per thread); and fifty threads result in 850 total tokens per second (17 per thread). For the openai/gpt-oss-20b model, single-thread performance generates 218 total tokens per second (218 per thread); two threads produce 340 total tokens per second (170 per thread); five threads achieve 1045 total tokens per second (209 per thread); ten threads reach 880 total tokens per second (88 per thread); and twenty threads deliver 600 total tokens per second (30 per thread).
Pricing and Configuration Options
Memory modules are sold separately, with DDR4-3200 configurations available from 8GB up to 64GB. Additional network and storage expansion cards are offered as independent accessories. The system includes a five-year warranty. Pricing is set at $8,999 for the 64GB variant, $13,499 for the 128GB configuration, and $15,999 for the top-tier 256GB model.

MT Labs helps companies across Singapore deploy AI tools they actually own. Private infrastructure, no recurring cloud subscriptions, and a setup built around how your team already works. AI isn’t right for every workflow, and part of our job is telling you where it isn’t. Get in touch and we’ll walk through where it makes sense, and where it doesn’t for your business.