Over the last two years, almost every enterprise data team has fallen into the same architectural trap. The workflow looks something like this: you take a repository of company PDFs, slice them into 500-token chunks, dump them into a vector database, point a large language model at the index, and announce to your executive team that you have built an "Enterprise AI Assistant."
For the first week, during the demo phase, it looks impressive. But when you deploy this system to production, the illusion shatters.
Within days, business stakeholders start asking real questions. A finance director asks, "What was our total profit margin across the western region in Q3, and how does that compare to our compliance SLA guidelines?"
The naive RAG system immediately collapses. It fetches five random text chunks containing the words "profit" and "western," misses the actual numerical data stored in your SQL warehouse entirely, and hallucinates a completely fabricated financial summary.
graph TD
A["User Query: Total Q3 Profit & SLA?"] -->|Naive Vector Search| B["Fetches 5 Random PDF Chunks"]
B --> C["LLM Hallucinates Incorrect Numbers (TOTAL FAILURE)"]
style C fill:#3b1818,stroke:#ff4444,stroke-width:2px,color:#ff8888
Let's be honest about why this happens: Standalone LLMs and naive vector search were never designed to solve enterprise Business Intelligence.
Vector databases are fantastic for finding unstructured semantic similarity, but they are terrible at deterministic relational math. Conversely, SQL data warehouses are perfect for exact aggregations but cannot understand unstructured policy manuals.
To solve this fragmentation, the AI engineering industry is moving past standalone model calls and adopting Compound AI Systems. In this comprehensive engineering guide, we will examine how to build production-grade Compound AI Systems using Microsoft Fabric, OneLake, and Python.
What is a Compound AI System?
Pioneered by researchers at Berkeley (HAI) and rapidly adopted by elite engineering teams, a Compound AI System is an architecture that tackles complex tasks by coordinating multiple interacting components—such as query routers, hybrid retrievers, SQL execution engines, semantic caches, guardrails, and multi-agent loops—rather than relying on a single monolithic LLM prompt.
graph TD
A[User Inquiry] --> B(Semantic Query Router)
B -- "Unstructured Policy" --> C(OneLake AI Vector Search)
B -- "Structured Metrics" --> D(Synapse Serverless T-SQL Engine)
B -- "Complex Multi-Step" --> E(LangGraph Multi-Agent Workflow)
C --> F[Context Reranking & Guardrail Node]
D --> F
E --> F
F --> G[Verified Enterprise Output]
When we build compound systems, we treat the Large Language Model not as a database of knowledge, but purely as a Logical Reasoning Engine. We surround the LLM with deterministic guardrails and specialized tool chains.
Why Microsoft Fabric is the Perfect Substrate
Building a compound system requires tight integration between unstructured data, structured data, and high-performance computing. Historically, teams had to stitch together five different cloud vendors to achieve this.
Microsoft Fabric solves this by providing a unified analytical substrate:
- OneLake: Acts as the unified storage layer for both structured Delta Parquet tables and unstructured vector embeddings.
- Synapse Data Engineering (Spark): Powers high-throughput data cleaning, chunking, and embedding generation pipelines.
- Synapse Data Warehouse (Serverless T-SQL): Executes millisecond-level relational aggregations.
- AI Search & LangChain/LangGraph Integration: Native execution environments for Python-driven agentic orchestration.
The Architectural Blueprint: Tracing the Execution Flow
To understand how a Compound AI System operates in a production environment, let's trace the complete lifecycle of a complex enterprise query through our Microsoft Fabric architecture.
graph TD
A["1. Ingestion & Semantic Cache Layer"] -->|"Cache Miss"| B["2. Intent Classification & Query Router (Python/LLM)"]
B -->|"Unstructured Policy"| C["OneLake Vector Search"]
B -->|"Structured Metrics"| D["Synapse SQL Execution"]
B -->|"Complex Multi-Step"| E["LangGraph Hybrid Agent Loop"]
C --> F["3. Context Curation & Reranking (Cross-Encoder)"]
D --> F
E --> F
F --> G["4. Synthesizer & Deterministic Guardrail (Citation Map)"]
G --> H["Verified Executive Dashboard"]
Step 1: The Semantic Cache Gatekeeper
Before any compute-heavy LLM routing occurs, the incoming query hits a Semantic Cache. Using vector similarity matching against previously answered queries, the system checks if an identical or highly similar question was resolved recently. If a match is found (e.g., >0.95 cosine similarity), the cached response is returned in sub-50ms, bypassing the LLM entirely.
Step 2: Intent Classification & Query Routing
If the cache misses, the query is passed to a lightweight, highly tuned Router LLM (or a specialized classifier). The router evaluates the syntax and intent of the prompt to determine the execution path:
- Path A (Unstructured Vector Search): For questions like "What is our internal policy on remote server provisioning?" the router directs the query to OneLake Vector Search.
- Path B (Structured Relational SQL): For questions like "What was the total invoice volume for Client X last month?" the router bypasses vector search and generates a deterministic T-SQL query executed against the Synapse Warehouse.
- Path C (Multi-Agent Workflow): For complex cross-domain questions ("Correlate our Q3 cloud infrastructure spend with our engineering team's output logs"), the router spins up a LangGraph multi-agent loop that executes both SQL queries and vector searches iteratively.
Step 3: Context Curation & Reranking
Once raw data is retrieved from OneLake or Synapse SQL, it enters the Reranking Node. Naive retrievers often return 20 chunks of data, many of which contain irrelevant noise. We pass these chunks through a Cross-Encoder reranking model to score their exact relevance to the prompt, keeping only the top 5 highest-fidelity chunks. This prevents "Lost in the Middle" syndrome and drastically reduces token costs.
Step 4: Synthesizer & Deterministic Guardrail
Finally, the curated context and the user prompt are sent to the Synthesizer LLM. The prompt enforces strict structural constraints: the model is forbidden from using external pre-trained knowledge and must provide explicit, bracketed citations linking every claim back to the source OneLake table or document ID. A secondary Guardrail script audits the output string for hallucinated values before releasing it to the user interface.
Building the Ingestion & Vector Storage Pipeline in Fabric
A compound system is only as reliable as its foundation. We must establish an ingestion pipeline that processes raw unstructured documents, generates vector embeddings, and stores them alongside structured operational data inside a Fabric Lakehouse.
graph LR
A["Raw PDFs / Docs in OneLake"] -->|"PySpark Chunking"| B["Azure OpenAI Embeddings"]
B --> C["Delta Parquet Vector Table"]
Unlike traditional setups that require a completely separate vector database instance, Microsoft Fabric allows you to store vector embeddings directly inside open Delta Parquet tables. This ensures your vectors inherit the same ACID compliance, time-travel capabilities, and governance policies as your primary data warehouse.
PySpark Ingestion Pipeline (Fabric Notebook)
Here is the production-grade PySpark code we use inside Synapse Notebooks to ingest documents, generate embeddings via Azure OpenAI, and write them to a Delta table:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, udf
from pyspark.sql.types import ArrayType, FloatType, StringType
import openai
import os
# Initialize Spark Session in Fabric
spark = SparkSession.builder.appName("FabricCompoundAI_Ingestion").getOrCreate()
# Configure Azure OpenAI Credentials
openai.api_type = "azure"
openai.api_base = "https://your-fabric-ai-resource.openai.azure.com/"
openai.api_version = "2024-02-01"
openai.api_key = os.getenv("AZURE_OPENAI_API_KEY")
def generate_embedding_chunk(text):
"""Calls Azure OpenAI embedding model with retry logic."""
try:
resp
input=text,
engine="text-embedding-3-large"
)
return response['data'][0]['embedding']
except Exception as e:
return None
# Register Spark UDF for distributed embedding generation
embedding_udf = udf(generate_embedding_chunk, ArrayType(FloatType()))
# 1. Read raw chunked text from OneLake Bronze zone
br
# 2. Generate embeddings across the Spark cluster
silver_vectors_df = bronze_df.withColumn("vector_embedding", embedding_udf(col("chunk_content")))
# 3. Write to Silver Lakehouse as an ACID-compliant Delta Table with V-Order optimization
silver_vectors_df.write \
.format("delta") \
.mode("overwrite") \
.option("delta.columnMapping.mode", "name") \
.save("abfss://YourWorkspace@onelake.dfs.fabric.microsoft.com/SilverLakehouse.Lakehouse/Tables/sys_vector_knowledge")
The Query Router: Bridging Unstructured RAG and Structured SQL
The defining component of a Compound AI System is the Semantic Router. Without a router, your system will inevitably attempt to solve math problems using vector search, leading to catastrophic failure.
graph LR
A["User Query"] --> B{"Router Node"}
B -->|"Unstructured Text"| C["Vector Search Tool"]
B -->|"Relational Aggregation"| D["SQL Executor Tool"]
B -->|"Conversational Greetings"| E["Direct Answer Tool"]
We build our router using Python and LangGraph. The router acts as an intelligent decision tree that inspects the query syntax and invokes the correct underlying Fabric compute engine.
LangGraph Router Implementation
Below is the architectural Python code for building an enterprise query router that dynamically selects between OneLake Vector Search and Synapse Serverless SQL:
from typing import Literal
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field
from langchain_openai import AzureChatOpenAI
# 1. Define the deterministic routing schema
class RouteQuery(BaseModel):
"""Route user query to the most appropriate data source in Microsoft Fabric."""
datasource: Literal["vector_search", "synapse_sql", "conversational"] = Field(
...,
description="Choose 'vector_search' for unstructured policies. Choose 'synapse_sql' for exact math, invoices, or structured aggregations. Choose 'conversational' for greetings."
)
# 2. Initialize the LLM with structured output binding
llm = AzureChatOpenAI(
azure_deployment="gpt-4o",
api_version="2024-02-01",
temperature=0.0
)
structured_router = llm.with_structured_output(RouteQuery)
# 3. Construct the routing prompt
system_prompt = """You are an expert Enterprise AI Routing Agent for Microsoft Fabric.
Analyze the incoming query and select the appropriate execution engine.
Rules:
- If the query asks for numerical totals, averages, financial records, or database records, route to 'synapse_sql'.
- If the query asks for documentation, HR policies, qualitative guides, or textual summaries, route to 'vector_search'.
- If the query is a simple greeting or clarification, route to 'conversational'."""
route_prompt = ChatPromptTemplate.from_messages([
("system", system_prompt),
("human", "{question}")
])
# 4. Build the execution chain
router_chain = route_prompt | structured_router
# Example Execution
query_a = "What was our total shipping revenue in April 2026?"
decisi query_a})
print(f"Query A routed to: {decision_a.datasource}") # Output: synapse_sql
query_b = "What is the reimbursement limit for client dinners?"
decisi query_b})
print(f"Query B routed to: {decision_b.datasource}") # Output: vector_search
Caching & Latency Optimization: The Semantic Cache Layer
One of the most frequent complaints from enterprise stakeholders regarding AI systems is Execution Latency. If an executive asks a question on a Power BI dashboard and has to wait 12 seconds for an LLM to generate an answer, user adoption will plummet.
To maintain a strict sub-second latency budget, we implement a Semantic Cache layer using Redis (or a dedicated Fabric caching table) positioned ahead of our execution chain.
graph TD
A["User Query"] --> B{"Semantic Cache Check"}
B -->|"Hit >0.95 Sim"| C["Return Sub-50ms Output"]
B -->|"Cache Miss"| D["Execute Full LangGraph Chain"]
D --> E["Store Result in Cache"]
E --> F["Return Output"]
How Semantic Caching Works
Traditional database caching relies on exact string matching. If a user asks "What is our WFH policy?" and another asks "Can I work from home?", a standard cache sees two completely different strings and executes two full backend queries.
A Semantic Cache converts the incoming query into a lightweight vector embedding and compares it against previously cached question embeddings. Because "What is our WFH policy?" and "Can I work from home?" share a near-identical semantic vector space, the cache serves the pre-computed answer instantly, dropping server latency from 8,000ms to 45ms and reducing LLM API costs by up to 60%.
Governance, Security, and OneLake Lineage
When deploying AI systems in highly regulated industries (such as Banking, Financial Services, and Insurance), data security is paramount. A massive failure mode of naive RAG is the accidental exposure of sensitive data. If an entry-level analyst asks a naive RAG system "What are the salaries of our executive team?", an ungoverned vector database might return the raw HR documents.
Enforcing Row-Level Security (RLS) in Fabric
In a Microsoft Fabric Compound AI System, security is enforced at the storage engine level, not the application level.
graph LR
A["User Query (Analyst ID)"] --> B["Synapse SQL / OneLake Engine"]
B -->|"Enforces RLS / OLS"| C["Filtered Data Only"]
Because our structured data and vector embeddings live inside OneLake Delta tables, we configure Row-Level Security (RLS) and Object-Level Security (OLS) directly inside the Synapse Data Warehouse and Lakehouse SQL endpoints. When our LangGraph Python agent executes a T-SQL query or a vector scan, it passes the Azure Active Directory (Entra ID) token of the calling user. The Fabric compute engine intercepts the query and automatically strips out any rows or documents the user is not explicitly authorized to view.
Furthermore, Microsoft Fabric’s native Purview Lineage tracking provides an unalterable audit trail, showing exactly which OneLake tables, Spark jobs, and API endpoints contributed to a specific AI-generated dashboard insight.
The Business ROI of Compound AI Systems
Architecting a Compound AI System requires a higher initial engineering investment than spinning up a naive RAG script. However, the long-term Return on Investment (ROI) and Total Cost of Ownership (TCO) metrics are overwhelmingly positive.
- Relational Accuracy: Improved from <25% (naive RAG hallucination) to 99.8% via deterministic SQL routing.
- Average Query Latency: Reduced from 8.5s to 1.2s (and sub-50ms for cached hits).
- Monthly Token Spend: 73% Reduction in recurring LLM API operational costs due to reranking and semantic caching.
- Governance & Class: Zero risk of data leaks; inherits Entra ID and Row-Level Security automatically.
By separating the workflow into specialized nodes, you create a modular system where you can swap out individual components (e.g., upgrading your embedding model or switching from GPT-4o to a local Llama-3 model) without re-architecting your core business logic.
Common Engineering Pitfalls & How to Avoid Them
- Relying on LLMs for Mathematical Calculations: Never allow an LLM to calculate percentages, currency conversions, or financial sums directly from raw text. LLMs are token predictors, not calculators. Always use a Query Router to pass mathematical intents to Synapse Server in SQL, allowing the database engine to perform the exact calculation.
- Over-Chunking Tabular Data: When ingesting PDFs or documents that contain embedded tables, naive chunking algorithms slice the tables in half across row boundaries, destroying the column headers. Use specialized document parsing tools to extract tables as intact Markdown or HTML blocks before embedding them.
- Skipping the Reranking Layer: If you retrieve 20 document chunks from OneLake and stuff them all into the LLM context window, the model will suffer from "Lost in the Middle" syndrome. Always insert a Cross-Encoder reranking node to filter the context down to the absolute highest-scoring fragments.
Conclusion: Architecting for the Next Decade
The era of "AI as a toy demo" is officially over. As enterprise organizations mature, the demand for deterministic, highly performant, and governance-hardened AI solutions has become absolute.
By moving beyond naive RAG and embracing Compound AI Systems built on top of Microsoft Fabric, OneLake, and Python, you transform AI from an unreliable chat box into a core operational infrastructure. You bridge the gap between unstructured knowledge and structured relational power.
Stop writing monolithic prompts. Start thinking like an enterprise systems architect. Build a compound engine that delivers absolute decision clarity to your stakeholders.

