GPU Accelerated Fabric Data Warehouse: CoddSpeed Complete Guide
Microsoft announced the first fully managed GPU-accelerated cloud data warehouse at Build 2026 on June 2, 2026. The engine is called CoddSpeed. It started as a Microsoft Research prototype running SQL on PyTorch tensors, evolved into a production system, and just won the SIGMOD 2026 Best Industry Paper. No SQL rewrites. No config changes. Up to 7ร faster at 64-user concurrency. Early access opens July 2026.
The GPU accelerated Fabric Data Warehouse uses an engine called CoddSpeed to offload supported SQL query operations โ large aggregations, complex joins, massive dataset scans โ from CPUs to NVIDIA GPUs. No SQL rewrites. Your data stays in place. The engine identifies GPU-eligible operations automatically and routes them transparently. Internal benchmarks (May 2026) show up to 7ร faster performance at 64-user concurrency vs three unnamed comparable cloud warehouses. Early access preview opens July 2026. (per Microsoft Fabric Community Blog, June 2, 2026)
Why GPU Acceleration Changes the Economics of Data Warehousing
Data warehouses have always run on CPUs. That was fine when data volumes were smaller, concurrency was lower, and the primary consumer of warehouse queries was a scheduled report running overnight. Three things changed that picture.
First, CPU performance gains slowed. Moore’s Law wound down. x86 software optimizations saturated. As data sizes kept growing, cost-per-query started bending the wrong way โ more data, same hardware, higher bills. Per the CoddSpeed paper authors at Microsoft: “CPU gains were slowing โ Moore’s law was winding down, x86 software optimizations were saturating, and as data sizes kept growing, cost-per-query was bending the wrong way.”
Second, AI flooded data centers with GPU compute. NVIDIA accelerated computing, custom ASICs, NVLink, InfiniBand, CXL โ orders of magnitude faster than traditional servers at compute, memory bandwidth, and networking. Data centers that had invested in GPU infrastructure for AI inference had this hardware sitting underutilized between inference jobs.
Third, agents changed the query pattern. A human opens a dashboard and looks at it once. An AI agent issues the same complex analytical query dozens of times per minute โ each query in the critical path of a real-time application response. Traditional CPU-based warehouse performance that was acceptable for scheduled reporting becomes a production bottleneck for agentic workloads.
The Core Problem CoddSpeed Solves
Agents, applications, and AI systems are now querying data warehouses continuously โ not just at scheduled report times. Every query sits in the critical path of a user experience or agent response. CPU-based warehouses were not designed for this pattern. GPU parallelism handles mixed workloads โ many concurrent analytical queries โ more efficiently than CPU thread pools, which is why the performance gap grows with concurrency.
CoddSpeed is Microsoft’s answer to all three: take the GPU hardware already in Azure data centers for AI workloads, build a thin abstraction layer that lets the SQL query optimizer route eligible operations to GPUs instead of CPUs, and make the entire thing transparent to the SQL developer writing the queries.
CoddSpeed Architecture โ Two Thin Abstraction Layers
CoddSpeed’s architecture is intentionally minimal. The design philosophy, per the paper’s lead author Matteo Interlandi (Principal Scientist Manager, Azure Data GSL): add the minimum number of abstraction layers needed to get query fragments onto GPU hardware without rebuilding the query optimizer or the storage engine.
The result is two layers: CAL and DAL. Both are minimalist by design โ each does exactly one job and delegates everything else to the existing Fabric infrastructure.
Coprocessor Abstraction Layer
A hardware-agnostic API for offloading query fragments (sub-plans). The Fabric Data Warehouse optimizer serializes eligible query fragments as Substrate plans, hands them to a coprocessor Runtime, feeds data through it (Parquet, SQL Server columnar โ zero-copy where possible), and collects results in columnar format.
Coprocessors expose capabilities so the optimizer knows what each one can run. Fragments too large for GPU High Bandwidth Memory use a partitionable execution model with per-partition CPU fallback. CAL does not optimize plans โ that stays in Fabric DW’s Cascades optimizer.
Data Abstraction Layer
A unified caching and shuffle service that hides the transport layer behind a single key/value API โ NVLink, Infinity Fabric, InfiniBand, PCIe, Ethernet are all abstracted away. Applications call one API regardless of what hardware transport is moving data between GPU and CPU.
DAL does not decide what to cache long-term โ that stays with the host scheduler. It simply ensures data movement between processing units is as fast as possible regardless of the specific hardware configuration in the Azure data center.
Why Minimalism Is the Right Design Choice
GPU hardware generations change every 18โ24 months. An abstraction layer that exposes too much hardware-specific detail would require rewriting application code with each new NVIDIA architecture. CAL and DAL abstract away the hardware details so CoddSpeed can adopt newer GPUs, FPGAs, or ASICs without changing the SQL engine or the application layer. Per the paper: this engine is “designed to outlive any single chip generation.”
What Operations Get GPU-Offloaded
Not every SQL operation benefits from GPU acceleration. CoddSpeed focuses on the operations that are both compute-intensive (worth the overhead of routing to GPU) and parallelizable (benefit from GPU’s massively parallel architecture):
- Large aggregations: SUM, COUNT, AVG, MIN, MAX across hundreds of millions to billions of rows โ GPU parallelism dramatically reduces scan time
- Complex joins: Multi-table analytical joins where the join cardinality is high and the operation is CPU-bound in traditional execution
- Massive dataset scans: Full or near-full table scans on large fact tables โ GPU memory bandwidth advantages apply here
- Reporting and application workloads: The benchmark covered “common reporting, application, and AI-driven analytics scenarios” per the official announcement
Operations that don’t fit GPU High Bandwidth Memory use per-partition CPU fallback automatically โ the query still executes correctly, just on CPU for those fragments.
Research Origins โ From TQP to CoddSpeed
CoddSpeed did not emerge fully formed at Build 2026. It is the production version of a multi-year research project that started with a question that sounded ridiculous at the time: “Could we even run SQL on AI compute runtimes?”
Tensor Query Processor (Research Prototype)
The original research prototype. Expressed relational operators โ SELECT, JOIN, GROUP BY, ORDER BY โ as PyTorch tensor operations. The insight: PyTorch already has highly optimized, GPU-accelerated implementations of the mathematical operations underlying relational algebra. Why not use them directly for SQL?
TQP proved the concept worked but wasn’t production-ready โ it was tightly coupled to PyTorch and specific GPU architectures, making it fragile for a production data warehouse that needed to run on diverse hardware across Azure’s global infrastructure.
Production Engine (CoddSpeed)
The hardened, optimized version of TQP. Named after Edgar F. Codd โ the computer scientist who invented the relational model in 1970. Replaced PyTorch coupling with the CAL/DAL abstraction layers, making it hardware-agnostic and able to run on NVIDIA GPUs, future FPGAs, and custom ASICs without application-layer changes.
Won the SIGMOD 2026 Best Industry Paper. SIGMOD (Special Interest Group on Management of Data) is the flagship peer-reviewed venue for database research โ the highest academic recognition for production database systems work.
The name “CoddSpeed” is a deliberate nod to Edgar F. Codd โ the IBM researcher who published “A Relational Model of Data for Large Shared Data Banks” in 1970, founding the relational database field. The choice of name signals that Microsoft sees this as a generational shift in query processing โ the first new execution paradigm since the relational model itself became the foundation for analytical databases.
Benchmark Numbers โ What the Data Actually Says
Microsoft published internal benchmark figures in the Build 2026 announcement. Before reading them, understand what they are and what they are not.
What the Benchmarks Cover
Internal testing conducted in May 2026, covering “common reporting, application, and AI-driven analytics scenarios.” The test measured query performance at different concurrency levels against three unnamed comparable cloud data warehouses. The specific benchmark suite (whether TPC-H, TPC-DS, or a proprietary Microsoft benchmark) is not disclosed in the public announcement. These are vendor-published figures with standard vendor benchmark caveats.
| Concurrency Level | GPU-Fabric Performance | What This Means |
|---|---|---|
| Single user (1 concurrent) | ~3ร faster | Raw throughput advantage at low concurrency. Useful for individual developer or analyst workloads. |
| 16 concurrent users | ~6ร faster | Mid-scale dashboard load. Teams of analysts hitting a Power BI report simultaneously. |
| 64 concurrent users | ~7ร faster | Enterprise-scale concurrency. Multiple departments, scheduled reports, and interactive queries simultaneously. The gap grows because GPU parallelism handles mixed workload pressure better than CPU thread pools. |
Verified Customer Result
UNC Health (a US healthcare organization) is cited as an early customer reporting up to 5ร improvement in query speeds on their existing workloads. This is a real production result on existing data and queries โ not a synthetic benchmark. It is the most credible data point in the announcement for evaluating whether the benchmark numbers reflect real-world outcomes.
Read Vendor Benchmarks Carefully
The three comparison providers are not named. The benchmark suite is not disclosed. “Common reporting, application, and AI-driven analytics scenarios” is broad enough to include workload selection that favors the test subject. The concurrency scaling pattern (3ร at 1 user, 7ร at 64 users) is plausible โ GPU parallelism genuinely scales better under concurrent mixed workloads โ but independent validation is not yet available. Run your own workloads in the July 2026 early access preview before making architecture decisions based on these numbers.
Why the Concurrency Gap Makes Sense
The fact that the performance gap grows with concurrency (3ร at 1 user, 7ร at 64 users) is the most technically credible signal in the benchmark. CPU thread pools compete for shared cache and memory bandwidth under concurrent mixed workloads โ as more users hit the warehouse simultaneously, CPU architectures experience contention. GPU architectures with High Bandwidth Memory and NVLink interconnects handle parallelism differently โ the same hardware that runs one query efficiently can also run 64 queries efficiently because the parallelism model scales horizontally rather than vertically.
How It Works in Practice โ What Changes for You
The short answer: nothing changes for you. That is the design intention and the most important practical fact about CoddSpeed for Fabric users.
- Your SQL stays unchanged. No new syntax. No query hints. No GPU-specific functions. The T-SQL you write today in Fabric Data Warehouse runs unchanged with GPU acceleration.
- Your data stays in place. CoddSpeed reads data from the same Delta-Parquet files in OneLake that the CPU engine reads. No data migration. No separate GPU data store.
- The optimizer decides what gets GPU-offloaded. The Fabric Data Warehouse Cascades optimizer identifies which query fragments are eligible for GPU execution based on the coprocessor capabilities exposed via CAL. You do not control this routing manually.
- Non-eligible operations stay on CPU. Operations that don’t benefit from GPU acceleration, or fragments too large for GPU High Bandwidth Memory, execute on CPU with per-partition fallback. The query still completes correctly.
- Enable via workspace toggle. In the early access preview (July 2026), GPU acceleration will be enabled through a workspace-level setting โ no infrastructure provisioning required.
What This Means for Existing Fabric Warehouse Investments
If you have already built T-SQL queries, star schemas, stored procedures, and Power BI reports on Fabric Data Warehouse โ you get GPU acceleration without touching any of them. The investment you have already made in Fabric Warehouse query optimization, partition strategy, and Direct Lake semantic models carries forward completely. CoddSpeed accelerates what you already have.
Hardware Support โ Designed for Multiple Accelerator Types
CoddSpeed’s CAL/DAL abstraction was built with future hardware in mind. The architecture supports: NVIDIA GPUs (the initial implementation), FPGAs, custom ASICs, NVLink interconnects, InfiniBand, CXL, and PCIe. Microsoft’s design intent is that as new accelerator hardware enters Azure data centers โ whether from NVIDIA, AMD, or custom silicon โ CoddSpeed can adopt it without application-layer changes. The GPU implementation is the first, not the only, coprocessor target.
Who Benefits Most from GPU Accelerated Fabric Data Warehouse
| Workload Type | Expected Benefit | Why |
|---|---|---|
| High-concurrency dashboards (16+ simultaneous users) | High โ up to 6โ7ร | Concurrency is where GPU architecture advantage compounds. Enterprise Power BI deployments with many simultaneous report viewers are the primary beneficiary. |
| Agentic AI queries (continuous analytical queries from agents) | High | Agents don’t sleep. Continuous multi-step analytical queries from AI agents that were previously bottlenecked by CPU concurrency limits run faster at scale. |
| Large aggregations on fact tables (hundreds of millions+ rows) | High | GPU memory bandwidth and parallel compute cores handle large scan+aggregate operations significantly faster than CPU. |
| Complex multi-table joins in analytical queries | High | GPU parallelism reduces join execution time when cardinalities are large. |
| Scheduled batch reports (single-user, off-peak) | Moderate โ ~3ร | Single-user workloads still benefit but the advantage is smaller. If reports already run in acceptable time windows, the improvement may be less critical. |
| Small lookup queries (point queries, filtered by primary key) | Low | Small queries don’t generate enough compute work to justify GPU routing overhead. These stay on CPU. |
| DML operations (INSERT, UPDATE, DELETE) | Minimal | Write operations are I/O-bound and coordination-dependent โ not the compute-intensive pattern that GPU acceleration targets. |
The clearest use case is any organization where Power BI reports serve large numbers of simultaneous viewers โ finance teams, operations centers, executive dashboards โ where query latency has been a pain point during peak usage hours. GPU acceleration directly addresses this pattern.
The second clear use case is Fabric deployments being extended to serve AI agent workloads โ where agents make continuous analytical calls to the warehouse as part of multi-step reasoning chains. This is the pattern Microsoft specifically highlighted in the Build 2026 announcements around agentic data apps.
GPU Accelerated Fabric Warehouse vs Snowflake vs Databricks โ The Real Competitive Picture
CoddSpeed positions Microsoft Fabric Data Warehouse as the first fully managed cloud data warehouse with native GPU query acceleration. That is a meaningful claim โ but it needs context.
| Platform | Query Acceleration Approach | GPU-Native? | Requires Config? |
|---|---|---|---|
| Microsoft Fabric DW (CoddSpeed) | GPU offloading via CAL/DAL โ transparent to SQL developer | Yes โ NVIDIA GPU-native | Workspace toggle only |
| Snowflake | CPU-based vectorized execution with query optimization | No GPU-native query execution | N/A |
| Databricks (Photon) | CPU-vectorized engine โ highly optimized native code for SQL and Spark | CPU-vectorized, not GPU | Enabled by default on supported clusters |
| Google BigQuery | Dremel distributed engine with columnar optimization | No GPU-native SQL execution | N/A |
| Amazon Redshift | CPU-based with AQUA (Advanced Query Accelerator) for some operations | Partial โ AQUA uses FPGAs, not GPU | AQUA is managed โ not user-controlled |
Databricks Photon โ The Closest Competitor
Databricks Photon is the most relevant comparison. It is a highly effective CPU-vectorized execution engine that generates native code for SQL and Spark operations โ claiming 2โ8ร speedup over standard execution. Photon is excellent for mixed AI/SQL workloads, particularly on Delta Lake, which aligns well with Databricks’ use cases.
The key difference: Photon is CPU-vectorized. CoddSpeed is GPU-accelerated. At high concurrency โ 64+ simultaneous analytical users or continuous agent queries โ GPU architecture’s parallelism model scales differently than CPU vectorization. Whether this advantage holds in independent benchmarks comparing CoddSpeed vs Photon at enterprise concurrency is the question to watch when third-party testing emerges.
Snowflake โ The Incumbent Gap
Snowflake has no equivalent GPU-native query execution capability as of June 2026. Snowflake’s architecture uses CPU-based virtual warehouses with Snowflake’s own columnar storage format. For teams currently evaluating Fabric vs Snowflake, GPU acceleration is a meaningful new factor โ particularly for high-concurrency analytical workloads where the benchmark advantage compounds most.
The competitive moat from CoddSpeed is real but time-limited. Snowflake and Databricks have the engineering capacity to build GPU query execution. The question is how long it takes โ GPU query execution is architecturally complex, and TQP/CoddSpeed represents years of research investment. Microsoft’s head start is measured in years, not months. For current Fabric customers, the practical question is: does GPU acceleration justify staying on Fabric vs migrating to a competitor? For most, the answer is yes โ especially given that it requires no code changes.
How to Get Early Access โ July 2026 Preview
- Watch for the early access sign-up Early access preview opens July 2026. Microsoft will publish sign-up instructions on the Fabric Updates Blog and the Microsoft Fabric What’s New page. Subscribe to both to catch the announcement the day it goes live.
- Register your Fabric workspace for early access Per the announcement, GPU acceleration will be enabled through a workspace-level toggle โ no infrastructure provisioning, no separate resource allocation. You enable it in workspace settings and existing queries start routing eligible operations to GPU automatically.
- Benchmark your own workloads โ not Microsoft’s Microsoft’s internal benchmarks covered “common reporting, application, and AI-driven analytics scenarios” against unnamed providers. Your workload is specific. Run your actual high-concurrency queries โ the ones that are currently slow or expensive โ and measure before and after. High-aggregation, high-concurrency patterns will show the largest improvements. Small lookup queries will show minimal change.
- Focus testing on concurrency scenarios The benchmark advantage compounds with concurrency. Single-user testing (3ร improvement) won’t capture the full impact. Test with realistic concurrent user counts โ simulate your peak dashboard load of 10, 20, 50 simultaneous users to see where GPU acceleration has the most business impact.
- Check capacity requirements GPU-accelerated compute is premium infrastructure. Pricing for the GPU-accelerated tier has not been announced as of June 2026. Evaluate cost vs performance improvement when pricing is published โ a 7ร faster query that costs 4ร more per query-second may or may not improve your total cost of ownership depending on workload patterns.
Pricing โ Not Yet Announced
As of June 12, 2026, Microsoft has not published pricing for GPU-accelerated Fabric Data Warehouse. It will likely be a premium tier above standard F-SKU capacity pricing. Total cost of ownership comparisons against Snowflake and Databricks are not possible until pricing is published. Factor this into your architecture evaluation timeline โ performance comparisons without cost comparisons are incomplete.
Frequently Asked Questions
Official References and Related UIG Guides
Official Microsoft Sources
Related UIG Data Lab Guides
โ ๏ธ Accuracy Disclaimer
All benchmark figures (3ร, 6ร, 7ร performance improvements, UNC Health 5ร result) are sourced from Microsoft’s internal benchmarks as published in the official Microsoft Fabric Community Blog announcement (June 3, 2026) and the CoddSpeed research post (June 2, 2026). The three comparison providers are not named by Microsoft. No independent third-party benchmark validation is available as of June 2026. GPU acceleration enters early access preview July 2026 โ pricing not announced. Verify current availability and pricing at the Microsoft Fabric What’s New page before architecture decisions. UIG Data Lab is an independent publication, not affiliated with or endorsed by Microsoft Corporation or NVIDIA.