This article walks through the design of an end-to-end “chat with your docs” system. The design supports document uploads, asynchronous ingestion, hybrid retrieval, access control, streaming chat responses, and multi-tenant isolation. Each section pairs the architectural choice with the failure mode it is trying to prevent.
01 Product requirements
The system has two primary user-facing capabilities.
First, users can add documents to a knowledge base. Documents may come from sources such as PDFs, Google Docs, Notion pages, or files stored in object storage. The baseline design assumes documents up to 10 MB, with room to extend to larger files later.
Second, users can ask open-ended questions over those documents. The answer should be grounded in retrieved knowledge-base chunks and include citations linking back to the original source document.
The most important security requirement is that users only see documents they are authorized to access. The system is also multi-tenant: multiple companies can use the same deployment, but data must not leak across tenants.
Non-functional goals
- Availability is more important than strict consistency. A newly uploaded document may take a short time to appear in chat results.
- Document freshness target: newly added documents should be reflected in retrieval results within roughly one minute at P95.
- Chat serving target: time to first token should be low, around a few hundred milliseconds where possible.
- The system should scale to high query throughput and absorb spikes in document ingestion.
02 High-level architecture
The system has two major paths: the ingestion path and the query-serving path.
flowchart LR
Client["RAG Client
Web / Mobile"] --> APIGW["API Gateway
Auth
Rate limiting
Routing"]
APIGW --> DocSvc["Document Service"]
APIGW --> ChatSvc["Chat Service"]
DocSvc --> DB[(PostgreSQL
Document metadata
Permissions
Status)]
DocSvc --> ObjStore[(Object Storage
Raw files)]
DocSvc --> SQS["Ingestion Queue
SQS / durable queue"]
SQS --> Scheduler["Workflow Scheduler"]
Scheduler --> Temporal["Temporal
Workflow Orchestration"]
Temporal --> Extract["Extract Worker
Fetch and parse source"]
Extract --> Chunk["Chunk Worker
Split raw text"]
Chunk --> Embed["Embedding Worker
Generate vectors"]
Embed --> Index["Index Worker
Write chunks"]
Index --> ES[(Search Index
BM25 + Vector Index)]
Index --> DB
ChatSvc --> DB
ChatSvc --> ES
ChatSvc --> Cache[(Cache
Hot chunks
Query embeddings
Responses)]
ChatSvc --> LLM["LLM Service"]
LLM --> ChatSvc
ChatSvc --> APIGW
APIGW --> Client
classDef input fill:#e3f2fd,stroke:#1565c0,stroke-width:2px,color:#000
classDef svc fill:#fff8e1,stroke:#f57f17,stroke-width:2px,color:#000
classDef storage fill:#f3e5f5,stroke:#6a1b9a,stroke-width:2px,color:#000
classDef worker fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#000
classDef ai fill:#fce4ec,stroke:#ad1457,stroke-width:2px,color:#000
class Client input
class APIGW,DocSvc,ChatSvc svc
class DB,ObjStore,SQS,Cache,ES storage
class Scheduler,Temporal,Extract,Chunk,Embed,Index worker
class LLM ai
The API Gateway authenticates users, applies rate limits, and routes requests to either the Document Service or Chat Service.
The Document Service owns document metadata, upload lifecycle, permissions, and ingestion status. It writes raw files to object storage and metadata to PostgreSQL, then emits ingestion work to a durable queue.
The ingestion pipeline is asynchronous. A workflow scheduler consumes queue messages and starts durable workflows in Temporal. Workers extract content, chunk it, embed it, and index it into Elasticsearch or a similar search system that supports both lexical and vector retrieval.
The Chat Service handles online user queries. It retrieves permitted chunks using hybrid search, builds an LLM prompt, streams the response, and returns citations.
03 Data model
A minimal production data model needs users, groups, documents, and chunks.
erDiagram
TENANT ||--o{ USER : contains
TENANT ||--o{ USER_GROUP : contains
USER_GROUP ||--o{ USER_GROUP_MEMBER : has
USER ||--o{ USER_GROUP_MEMBER : belongs_to
TENANT ||--o{ DOCUMENT : owns
DOCUMENT ||--o{ DOCUMENT_CHUNK : split_into
DOCUMENT {
string id
string tenant_id
string name
string source_type
string source_link
string object_key
string status
string hash
datetime created_at
datetime updated_at
}
DOCUMENT_CHUNK {
string id
string tenant_id
string document_id
int chunk_index
string content
vector embedding
string allowed_users
string allowed_groups
string source_url
}
USER {
string id
string tenant_id
string email
string name
}
USER_GROUP {
string id
string tenant_id
string name
}
USER_GROUP_MEMBER {
string user_id
string group_id
}
The key design choice is to copy access-control metadata into the search document for each chunk. Every indexed DocumentChunk contains tenant_id, allowed_users, and allowed_groups. That allows the serving path to enforce permissions during retrieval instead of retrieving first and filtering afterward.
Filtering after retrieval is dangerous because the top-k search results might be dominated by unauthorized chunks. The user would receive poor results, and the system might accidentally expose metadata or citations from documents the user should not see. Authorization must be part of the search query itself.
04 Document upload API
The document API separates metadata creation from binary upload. This lets the backend validate permissions and generate a pre-signed upload URL without proxying large files through the application server.
POST /v1/documents
Example request:
{
"name": "Q3 Financial Projections",
"permission": {
"allowed_users": ["user_123"],
"allowed_user_groups": ["finance_team"]
},
"source_type": "S3_PDF",
"source_link": "s3://company-docs/q3-financial-projections.pdf"
}
Example response:
{
"id": "doc_88",
"upload_presigned_url": "https://object-store/upload/...",
"status": "CREATED"
}
For bulk onboarding, the system also exposes:
POST /v1/documents/batch
{
"documents": [
{
"name": "Q3 Financial Projections",
"permission": {
"allowed_users": [],
"allowed_user_groups": ["finance_team"]
},
"source_type": "GOOGLE_DOC",
"source_link": "https://docs.example.com/doc/abc"
}
]
}
To check whether a document is ready for chat:
GET /v1/documents/{document_id}
Typical statuses include:
stateDiagram-v2
[*] --> CREATED
CREATED --> UPLOADED: file uploaded / source registered
UPLOADED --> PENDING_INGESTION: scheduler claims work
PENDING_INGESTION --> EXTRACTING
EXTRACTING --> CHUNKING
CHUNKING --> EMBEDDING
EMBEDDING --> INDEXING
INDEXING --> AVAILABLE
EXTRACTING --> FAILED
CHUNKING --> FAILED
EMBEDDING --> FAILED
INDEXING --> FAILED
FAILED --> PENDING_INGESTION: retry / reprocess
AVAILABLE --> PENDING_INGESTION: source changed
05 Ingestion pipeline
The ingestion flow is intentionally asynchronous. Uploading a document should not block on extraction, chunking, embedding, and indexing.
sequenceDiagram
participant Client
participant APIGW as API Gateway
participant DocSvc as Document Service
participant DB as PostgreSQL
participant Store as Object Storage
participant Q as Durable Queue
participant S as Workflow Scheduler
participant T as Temporal
participant W as Workers
participant ES as Search Index
Client->>APIGW: POST /v1/documents
APIGW->>DocSvc: Authenticated request
DocSvc->>DB: Create Document(status=CREATED)
DocSvc-->>Client: document_id + upload URL
Client->>Store: Upload file
Store-->>DocSvc: Upload event / callback
DocSvc->>DB: status=UPLOADED, hash=...
DocSvc->>Q: Enqueue document_id
S->>Q: Consume message
S->>DB: Claim document using optimistic update
S->>T: Start ingestion workflow
T->>W: Extract raw text
T->>W: Chunk document
T->>W: Generate embeddings
T->>W: Index chunks
W->>ES: Upsert chunks with vectors + ACLs
W->>DB: status=AVAILABLE
The pipeline has four core stages.
Extract. Fetch the source content based on source_type. For a PDF, the worker may read from object storage. For Google Docs or Notion, it may call the source API. The output is normalized raw text plus source-location metadata.
Chunk. Split raw text into smaller sections that fit embedding-model and LLM-context constraints. Chunks should preserve enough context to be meaningful. Common strategies include heading-aware chunking, overlapping windows, page-aware PDF chunking, and section-based chunking for structured documents.
Embed. Convert each chunk into a dense vector. Store the embedding alongside the text and metadata.
Index. Write the chunk into the search index. Each indexed record includes the chunk text, embedding, document metadata, tenant ID, permissions, and citation metadata.
06 Reliability and idempotency
Ingestion has multiple failure points. Source APIs may time out, PDF parsing may fail, embedding calls may be rate limited, or indexing may partially succeed. A durable workflow engine such as Temporal is useful because it persists workflow state, retries failed activities, and gives operators visibility into where a document failed.
flowchart TD
Start["Start ingestion workflow"] --> HashCheck{"Hash unchanged?"}
HashCheck -->|Yes| Skip["Skip ingestion
Mark as AVAILABLE"]
HashCheck -->|No| Extract["Extract text"]
Extract -->|success| Chunk["Chunk text"]
Extract -->|transient failure| RetryExtract["Retry with backoff"]
RetryExtract --> Extract
Extract -->|permanent failure| Failed["Mark FAILED"]
Chunk --> Embed["Generate embeddings"]
Embed --> Index["Upsert chunks into search index"]
Index --> Available["Mark AVAILABLE"]
Index -->|partial failure| RetryIndex["Retry idempotent upsert"]
RetryIndex --> Index
classDef plan fill:#fff8e1,stroke:#f57f17,stroke-width:2px,color:#000
classDef synth fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#000
classDef output fill:#fce4ec,stroke:#ad1457,stroke-width:2px,color:#000
classDef fail fill:#ffebee,stroke:#c62828,stroke-width:2px,color:#000
class Start,HashCheck plan
class Extract,Chunk,Embed,Index,RetryExtract,RetryIndex synth
class Skip,Available output
class Failed fail
Idempotency matters because the same document may be submitted more than once, queue messages may be delivered more than once, and multiple schedulers may race to process the same work.
The design uses a document hash. If the source content hash matches the previously indexed hash, the ingestion workflow can skip reprocessing. If the hash changed, the system proceeds with extraction, chunking, embedding, and reindexing.
The scheduler should also claim work using optimistic concurrency control:
UPDATE document SET status = 'PENDING_INGESTION' WHERE id = :document_id AND status = 'UPLOADED';
If zero rows are affected, another scheduler has already claimed the document or the document is no longer eligible for ingestion. The current scheduler should skip it.
Index writes should be idempotent too. Use deterministic chunk IDs such as:
chunk_id = hash(document_id + document_version + chunk_index)
Then indexing can use upsert semantics. Re-running the same workflow replaces the same chunk records instead of creating duplicates.
07 Handling ingestion spikes
A common mistake is to call the workflow engine directly from the Document Service for every uploaded document. During a spike, the service may create too many workflows, overload workers, or keep too much pending work in memory.
A durable queue smooths the load.
flowchart LR
DocSvc["Document Service"] --> Q["Durable Queue
SQS / Kafka / PubSub"]
Q --> Scheduler1["Workflow Scheduler 1"]
Q --> Scheduler2["Workflow Scheduler 2"]
Q --> Scheduler3["Workflow Scheduler N"]
Scheduler1 --> DB[(PostgreSQL
claim status)]
Scheduler2 --> DB
Scheduler3 --> DB
Scheduler1 --> Temporal["Temporal"]
Scheduler2 --> Temporal
Scheduler3 --> Temporal
Temporal --> Workers["Autoscaled Worker Pool"]
Workers --> ES[(Search Index)]
classDef svc fill:#fff8e1,stroke:#f57f17,stroke-width:2px,color:#000
classDef storage fill:#f3e5f5,stroke:#6a1b9a,stroke-width:2px,color:#000
classDef worker fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#000
class DocSvc,Scheduler1,Scheduler2,Scheduler3,Temporal svc
class Q,DB,ES storage
class Workers worker
The API Gateway should enforce per-user and per-tenant ingestion rate limits so one customer cannot flood the system. Worker pools can scale based on queue depth, CPU, and external-service throttling. There should also be a maximum backlog alert: autoscaling helps, but it is not a substitute for admission control.
For documents larger than the default limit, add a pre-chunking phase. The system can split a very large file into processing units, run extraction and chunking in parallel, and merge the output into one logical document version.
08 Search and retrieval
The query-serving path must balance latency, quality, and access control.
Dense vector search is good for semantic similarity. It can match “revenue forecast” with “financial projection.” But it may miss exact terms such as product codes, invoice IDs, customer names, or policy numbers.
BM25 or sparse-vector retrieval is good for exact and lexical matches. But it may miss semantically equivalent phrases.
A production RAG system should usually use hybrid retrieval.
flowchart TD
Query["User query"] --> Normalize["Normalize query"]
Normalize --> EmbedQ["Generate query embedding"]
Normalize --> BM25["BM25 / lexical search"]
EmbedQ --> ANN["ANN vector search"]
BM25 --> Merge["Merge candidates"]
ANN --> Merge
Merge --> ACL["Apply tenant + permission filters"]
ACL --> Rank["Rank / rerank"]
Rank --> Diversity["Diversity selection
avoid redundant chunks"]
Diversity --> Context["Build LLM context"]
Context --> LLM["Generate grounded answer"]
LLM --> Stream["Stream answer + citations"]
classDef input fill:#e3f2fd,stroke:#1565c0,stroke-width:2px,color:#000
classDef retrieval fill:#f3e5f5,stroke:#6a1b9a,stroke-width:2px,color:#000
classDef synth fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#000
classDef output fill:#fce4ec,stroke:#ad1457,stroke-width:2px,color:#000
class Query input
class Normalize,EmbedQ,BM25,ANN,Merge,ACL retrieval
class Rank,Diversity,Context,LLM synth
class Stream output
The retrieval query should include:
tenant_id = current_user.tenant_id AND ( allowed_users contains current_user.id OR allowed_groups intersects current_user.group_ids )
For Elasticsearch, tenant-aware routing can reduce latency and improve isolation. A common design is to route documents by tenant_id, so a tenant query hits only the shard or shard subset containing that tenant’s data. The application should also validate that retrieved chunks all belong to the current tenant before sending them to the LLM.
Hybrid retrieval can be implemented in several ways:
flowchart LR
Q["Query"] --> V["Vector Search
top_k=100"]
Q --> L["Lexical Search
top_k=100"]
V --> C["Candidate Pool"]
L --> C
C --> RRF["Reciprocal Rank Fusion
or weighted scoring"]
RRF --> Rerank["Optional reranker"]
Rerank --> Top["Top context chunks"]
classDef input fill:#e3f2fd,stroke:#1565c0,stroke-width:2px,color:#000
classDef retrieval fill:#f3e5f5,stroke:#6a1b9a,stroke-width:2px,color:#000
classDef synth fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#000
classDef output fill:#fce4ec,stroke:#ad1457,stroke-width:2px,color:#000
class Q input
class V,L,C retrieval
class RRF,Rerank synth
class Top output
For high quality, tune:
- Chunk size and overlap.
- Number of vector candidates.
- Number of lexical candidates.
- Hybrid weighting.
- Reranker threshold.
- Diversity rules, so the final context is not filled with near-duplicate chunks.
- Citation granularity, such as page number, section heading, or source URL.
09 Low-latency serving
The design target is aggressive: the user should see the first token quickly, even though the system may need to embed the query, retrieve chunks, call an LLM, and stream the answer.
The key is to cache across the serving path.
flowchart TD
Query["Incoming query"] --> ExactCache{"Exact query cache hit?"}
ExactCache -->|Yes| ReturnCached["Return cached answer"]
ExactCache -->|No| Embed["Generate query embedding"]
Embed --> SemanticCache{"Similar query cache hit?"}
SemanticCache -->|Yes| ValidatePerms["Validate tenant + permissions"]
ValidatePerms --> ReturnCached
SemanticCache -->|No| Retrieve["Hybrid retrieval"]
Retrieve --> HotChunkCache["Fetch hot chunks from cache"]
HotChunkCache --> LLM["Call LLM"]
LLM --> StoreCache["Store answer + sources"]
StoreCache --> Stream["Stream response"]
classDef input fill:#e3f2fd,stroke:#1565c0,stroke-width:2px,color:#000
classDef gate fill:#fff3e0,stroke:#e65100,stroke-width:2px,color:#000
classDef retrieval fill:#f3e5f5,stroke:#6a1b9a,stroke-width:2px,color:#000
classDef synth fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#000
classDef output fill:#fce4ec,stroke:#ad1457,stroke-width:2px,color:#000
class Query input
class ExactCache,SemanticCache gate
class Embed,Retrieve,HotChunkCache,ValidatePerms retrieval
class LLM,StoreCache synth
class ReturnCached,Stream output
Useful caches include:
Query-to-embedding cache. If users repeatedly ask the same question, avoid recomputing embeddings.
Semantic query cache. Store query embeddings and final responses. If a new query is extremely similar to a previous query, reuse the response only after checking tenant, user, permissions, document versions, and freshness.
Hot chunk cache. Frequently retrieved chunks can be cached outside the search index.
Document metadata cache. Document titles, source URLs, and citation metadata are good cache candidates.
Caching RAG responses must be permission-aware. Otherwise, a user might receive an answer generated from documents they no longer have access to.
The cache key should include tenant, user or permission scope, document-index version, and possibly group membership version.
10 Streaming chat API
For chat, WebSocket or server-sent events are a good fit because the answer is generated incrementally.
sequenceDiagram
participant Client
participant Chat as Chat Service
participant Search as Search Index
participant LLM
Client->>Chat: USER_QUERY(query)
Chat->>Search: Hybrid search with ACL filters
Search-->>Chat: Permitted chunks + citations
Chat-->>Client: CHAT_START(response_id, sources)
Chat->>LLM: Prompt with retrieved context
loop token stream
LLM-->>Chat: token delta
Chat-->>Client: CHAT_CHUNK(seq, delta)
end
Chat-->>Client: CHAT_DONE(full_text, citations)
Example client message:
{
"op": "USER_QUERY",
"query": "What is the projected revenue for Q3?"
}
Example start event:
{
"op": "CHAT_START",
"response_id": "req_abc123",
"sources": [
{
"document_id": "doc_88",
"title": "Q3 Financial Projections",
"url": "https://intranet/docs/q3-fin"
}
]
}
Example chunk event:
{
"op": "CHAT_CHUNK",
"response_id": "req_abc123",
"seq": 1,
"delta": "The "
}
Example completion event:
{
"op": "CHAT_DONE",
"response_id": "req_abc123",
"full_text": "The projected revenue is expected to grow...",
"citations": [
{
"document_id": "doc_88",
"chunk_id": "chunk_001",
"url": "https://intranet/docs/q3-fin",
"title": "Q3 Financial Projections"
}
]
}
11 Multi-tenancy, permissions, and privacy
Multi-tenancy must be enforced in both ingestion and query paths.
During ingestion:
flowchart TD
Upload["Document upload"] --> Auth["Authenticate user"]
Auth --> Tenant["Attach tenant_id"]
Tenant --> Perms["Validate allowed users/groups"]
Perms --> Metadata["Store metadata in DB"]
Metadata --> Index["Index chunks with tenant_id + ACLs"]
classDef input fill:#e3f2fd,stroke:#1565c0,stroke-width:2px,color:#000
classDef synth fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#000
classDef output fill:#fce4ec,stroke:#ad1457,stroke-width:2px,color:#000
class Upload input
class Auth,Tenant,Perms,Metadata synth
class Index output
During query serving:
flowchart TD
Query["User query"] --> Identity["Resolve user_id, tenant_id, groups"]
Identity --> SearchFilter["Build search filter"]
SearchFilter --> TenantFilter["tenant_id == current tenant"]
TenantFilter --> ACLFilter["allowed_users contains user
OR allowed_groups intersects groups"]
ACLFilter --> Retrieve["Retrieve chunks"]
Retrieve --> DoubleCheck["Application-level tenant + ACL check"]
DoubleCheck --> LLM["Send only authorized chunks to LLM"]
classDef input fill:#e3f2fd,stroke:#1565c0,stroke-width:2px,color:#000
classDef retrieval fill:#f3e5f5,stroke:#6a1b9a,stroke-width:2px,color:#000
classDef synth fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#000
classDef output fill:#fce4ec,stroke:#ad1457,stroke-width:2px,color:#000
class Query input
class Identity,SearchFilter,TenantFilter,ACLFilter retrieval
class Retrieve,DoubleCheck synth
class LLM output
There should be multiple layers of defense:
- API Gateway authenticates requests.
- Application services attach
tenant_idfrom trusted identity claims, not from user-provided input. - Search queries always include tenant and ACL filters.
- Indexed chunks store tenant and permission metadata.
- The Chat Service validates returned chunks before building the LLM prompt.
- Logs and traces avoid storing sensitive document text unless explicitly allowed.
- Cache keys include tenant and permission scope.
12 Observability and operations
The ingestion pipeline needs per-stage metrics:
flowchart LR
Metrics["Metrics"] --> Ingestion["Ingestion latency
per stage"]
Metrics --> Errors["Error rate
extract/chunk/embed/index"]
Metrics --> Queue["Queue depth
oldest message age"]
Metrics --> Freshness["Upload-to-available latency"]
Metrics --> Retrieval["Search latency
BM25/vector/rerank"]
Metrics --> Quality["Answer quality
citation correctness"]
classDef input fill:#e3f2fd,stroke:#1565c0,stroke-width:2px,color:#000
classDef synth fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#000
class Metrics input
class Ingestion,Errors,Queue,Freshness,Retrieval,Quality synth
Important alerts include:
- Queue age exceeds freshness SLO.
- Worker error rate spikes.
- Embedding provider latency or failure rate increases.
- Indexing failures exceed threshold.
- Search latency exceeds serving SLO.
- LLM timeout rate increases.
- Permission-filter validation catches unauthorized chunks.
For quality evaluation, maintain a benchmark set of questions and expected answers. Track retrieval precision, citation correctness, hallucination rate, and answer completeness. Run this evaluation before changing chunking, embeddings, retrieval weights, reranking, or prompt templates.
13 Final architecture summary
A production RAG application is not just an LLM connected to a vector database. It is a distributed system with document lifecycle management, durable ingestion, indexing, retrieval, authorization, streaming, caching, and evaluation.
The core design principles are:
- Keep document ingestion asynchronous.
- Use a durable queue to absorb spikes.
- Use workflow orchestration for retryable multi-step ingestion.
- Make ingestion idempotent with document hashes and deterministic chunk IDs.
- Store tenant and permission metadata on every indexed chunk.
- Use hybrid retrieval instead of vector-only retrieval.
- Cache aggressively, but make every cache permission-aware.
- Stream responses so users see progress quickly.
- Always return citations tied to source documents.
- Measure freshness, retrieval quality, latency, and authorization correctness.
A production RAG application is a distributed system with document lifecycle management, durable ingestion, indexing, retrieval, authorization, streaming, caching, and evaluation.
With these pieces in place, the system can ingest documents reliably, keep the index fresh, retrieve relevant context at low latency, and generate grounded answers without leaking data across users or tenants.