Field notesArchitecture

Designing an End-to-End RAG Application with Document Ingestion

Retrieval-Augmented Generation becomes much harder once it moves beyond a demo. A production system has to ingest documents from many sources, keep the index fresh, enforce permissions, isolate tenants, serve low-latency answers, and still cite the original documents. Here is a design that holds up under all of those constraints.

This article walks through the design of an end-to-end “chat with your docs” system. The design supports document uploads, asynchronous ingestion, hybrid retrieval, access control, streaming chat responses, and multi-tenant isolation. Each section pairs the architectural choice with the failure mode it is trying to prevent.

01 Product requirements

The system has two primary user-facing capabilities.

First, users can add documents to a knowledge base. Documents may come from sources such as PDFs, Google Docs, Notion pages, or files stored in object storage. The baseline design assumes documents up to 10 MB, with room to extend to larger files later.

Second, users can ask open-ended questions over those documents. The answer should be grounded in retrieved knowledge-base chunks and include citations linking back to the original source document.

The most important security requirement is that users only see documents they are authorized to access. The system is also multi-tenant: multiple companies can use the same deployment, but data must not leak across tenants.

Non-functional goals

  • Availability is more important than strict consistency. A newly uploaded document may take a short time to appear in chat results.
  • Document freshness target: newly added documents should be reflected in retrieval results within roughly one minute at P95.
  • Chat serving target: time to first token should be low, around a few hundred milliseconds where possible.
  • The system should scale to high query throughput and absorb spikes in document ingestion.

02 High-level architecture

The system has two major paths: the ingestion path and the query-serving path.

flowchart LR
    Client["RAG Client
Web / Mobile"] --> APIGW["API Gateway
Auth
Rate limiting
Routing"] APIGW --> DocSvc["Document Service"] APIGW --> ChatSvc["Chat Service"] DocSvc --> DB[(PostgreSQL
Document metadata
Permissions
Status)] DocSvc --> ObjStore[(Object Storage
Raw files)] DocSvc --> SQS["Ingestion Queue
SQS / durable queue"] SQS --> Scheduler["Workflow Scheduler"] Scheduler --> Temporal["Temporal
Workflow Orchestration"] Temporal --> Extract["Extract Worker
Fetch and parse source"] Extract --> Chunk["Chunk Worker
Split raw text"] Chunk --> Embed["Embedding Worker
Generate vectors"] Embed --> Index["Index Worker
Write chunks"] Index --> ES[(Search Index
BM25 + Vector Index)] Index --> DB ChatSvc --> DB ChatSvc --> ES ChatSvc --> Cache[(Cache
Hot chunks
Query embeddings
Responses)] ChatSvc --> LLM["LLM Service"] LLM --> ChatSvc ChatSvc --> APIGW APIGW --> Client classDef input fill:#e3f2fd,stroke:#1565c0,stroke-width:2px,color:#000 classDef svc fill:#fff8e1,stroke:#f57f17,stroke-width:2px,color:#000 classDef storage fill:#f3e5f5,stroke:#6a1b9a,stroke-width:2px,color:#000 classDef worker fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#000 classDef ai fill:#fce4ec,stroke:#ad1457,stroke-width:2px,color:#000 class Client input class APIGW,DocSvc,ChatSvc svc class DB,ObjStore,SQS,Cache,ES storage class Scheduler,Temporal,Extract,Chunk,Embed,Index worker class LLM ai

The API Gateway authenticates users, applies rate limits, and routes requests to either the Document Service or Chat Service.

The Document Service owns document metadata, upload lifecycle, permissions, and ingestion status. It writes raw files to object storage and metadata to PostgreSQL, then emits ingestion work to a durable queue.

The ingestion pipeline is asynchronous. A workflow scheduler consumes queue messages and starts durable workflows in Temporal. Workers extract content, chunk it, embed it, and index it into Elasticsearch or a similar search system that supports both lexical and vector retrieval.

The Chat Service handles online user queries. It retrieves permitted chunks using hybrid search, builds an LLM prompt, streams the response, and returns citations.

03 Data model

A minimal production data model needs users, groups, documents, and chunks.

erDiagram
    TENANT ||--o{ USER : contains
    TENANT ||--o{ USER_GROUP : contains
    USER_GROUP ||--o{ USER_GROUP_MEMBER : has
    USER ||--o{ USER_GROUP_MEMBER : belongs_to

    TENANT ||--o{ DOCUMENT : owns
    DOCUMENT ||--o{ DOCUMENT_CHUNK : split_into

    DOCUMENT {
        string id
        string tenant_id
        string name
        string source_type
        string source_link
        string object_key
        string status
        string hash
        datetime created_at
        datetime updated_at
    }

    DOCUMENT_CHUNK {
        string id
        string tenant_id
        string document_id
        int chunk_index
        string content
        vector embedding
        string allowed_users
        string allowed_groups
        string source_url
    }

    USER {
        string id
        string tenant_id
        string email
        string name
    }

    USER_GROUP {
        string id
        string tenant_id
        string name
    }

    USER_GROUP_MEMBER {
        string user_id
        string group_id
    }

The key design choice is to copy access-control metadata into the search document for each chunk. Every indexed DocumentChunk contains tenant_id, allowed_users, and allowed_groups. That allows the serving path to enforce permissions during retrieval instead of retrieving first and filtering afterward.

Filtering after retrieval is dangerous because the top-k search results might be dominated by unauthorized chunks. The user would receive poor results, and the system might accidentally expose metadata or citations from documents the user should not see. Authorization must be part of the search query itself.

04 Document upload API

The document API separates metadata creation from binary upload. This lets the backend validate permissions and generate a pre-signed upload URL without proxying large files through the application server.

POST /v1/documents

Example request:

{
  "name": "Q3 Financial Projections",
  "permission": {
    "allowed_users": ["user_123"],
    "allowed_user_groups": ["finance_team"]
  },
  "source_type": "S3_PDF",
  "source_link": "s3://company-docs/q3-financial-projections.pdf"
}

Example response:

{
  "id": "doc_88",
  "upload_presigned_url": "https://object-store/upload/...",
  "status": "CREATED"
}

For bulk onboarding, the system also exposes:

POST /v1/documents/batch
{
  "documents": [
    {
      "name": "Q3 Financial Projections",
      "permission": {
        "allowed_users": [],
        "allowed_user_groups": ["finance_team"]
      },
      "source_type": "GOOGLE_DOC",
      "source_link": "https://docs.example.com/doc/abc"
    }
  ]
}

To check whether a document is ready for chat:

GET /v1/documents/{document_id}

Typical statuses include:

stateDiagram-v2
    [*] --> CREATED
    CREATED --> UPLOADED: file uploaded / source registered
    UPLOADED --> PENDING_INGESTION: scheduler claims work
    PENDING_INGESTION --> EXTRACTING
    EXTRACTING --> CHUNKING
    CHUNKING --> EMBEDDING
    EMBEDDING --> INDEXING
    INDEXING --> AVAILABLE
    EXTRACTING --> FAILED
    CHUNKING --> FAILED
    EMBEDDING --> FAILED
    INDEXING --> FAILED
    FAILED --> PENDING_INGESTION: retry / reprocess
    AVAILABLE --> PENDING_INGESTION: source changed

05 Ingestion pipeline

The ingestion flow is intentionally asynchronous. Uploading a document should not block on extraction, chunking, embedding, and indexing.

sequenceDiagram
    participant Client
    participant APIGW as API Gateway
    participant DocSvc as Document Service
    participant DB as PostgreSQL
    participant Store as Object Storage
    participant Q as Durable Queue
    participant S as Workflow Scheduler
    participant T as Temporal
    participant W as Workers
    participant ES as Search Index

    Client->>APIGW: POST /v1/documents
    APIGW->>DocSvc: Authenticated request
    DocSvc->>DB: Create Document(status=CREATED)
    DocSvc-->>Client: document_id + upload URL

    Client->>Store: Upload file
    Store-->>DocSvc: Upload event / callback
    DocSvc->>DB: status=UPLOADED, hash=...
    DocSvc->>Q: Enqueue document_id

    S->>Q: Consume message
    S->>DB: Claim document using optimistic update
    S->>T: Start ingestion workflow

    T->>W: Extract raw text
    T->>W: Chunk document
    T->>W: Generate embeddings
    T->>W: Index chunks

    W->>ES: Upsert chunks with vectors + ACLs
    W->>DB: status=AVAILABLE

The pipeline has four core stages.

Extract. Fetch the source content based on source_type. For a PDF, the worker may read from object storage. For Google Docs or Notion, it may call the source API. The output is normalized raw text plus source-location metadata.

Chunk. Split raw text into smaller sections that fit embedding-model and LLM-context constraints. Chunks should preserve enough context to be meaningful. Common strategies include heading-aware chunking, overlapping windows, page-aware PDF chunking, and section-based chunking for structured documents.

Embed. Convert each chunk into a dense vector. Store the embedding alongside the text and metadata.

Index. Write the chunk into the search index. Each indexed record includes the chunk text, embedding, document metadata, tenant ID, permissions, and citation metadata.

06 Reliability and idempotency

Ingestion has multiple failure points. Source APIs may time out, PDF parsing may fail, embedding calls may be rate limited, or indexing may partially succeed. A durable workflow engine such as Temporal is useful because it persists workflow state, retries failed activities, and gives operators visibility into where a document failed.

flowchart TD
    Start["Start ingestion workflow"] --> HashCheck{"Hash unchanged?"}
    HashCheck -->|Yes| Skip["Skip ingestion
Mark as AVAILABLE"] HashCheck -->|No| Extract["Extract text"] Extract -->|success| Chunk["Chunk text"] Extract -->|transient failure| RetryExtract["Retry with backoff"] RetryExtract --> Extract Extract -->|permanent failure| Failed["Mark FAILED"] Chunk --> Embed["Generate embeddings"] Embed --> Index["Upsert chunks into search index"] Index --> Available["Mark AVAILABLE"] Index -->|partial failure| RetryIndex["Retry idempotent upsert"] RetryIndex --> Index classDef plan fill:#fff8e1,stroke:#f57f17,stroke-width:2px,color:#000 classDef synth fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#000 classDef output fill:#fce4ec,stroke:#ad1457,stroke-width:2px,color:#000 classDef fail fill:#ffebee,stroke:#c62828,stroke-width:2px,color:#000 class Start,HashCheck plan class Extract,Chunk,Embed,Index,RetryExtract,RetryIndex synth class Skip,Available output class Failed fail

Idempotency matters because the same document may be submitted more than once, queue messages may be delivered more than once, and multiple schedulers may race to process the same work.

The design uses a document hash. If the source content hash matches the previously indexed hash, the ingestion workflow can skip reprocessing. If the hash changed, the system proceeds with extraction, chunking, embedding, and reindexing.

The scheduler should also claim work using optimistic concurrency control:

UPDATE document
SET status = 'PENDING_INGESTION'
WHERE id = :document_id
  AND status = 'UPLOADED';

If zero rows are affected, another scheduler has already claimed the document or the document is no longer eligible for ingestion. The current scheduler should skip it.

Index writes should be idempotent too. Use deterministic chunk IDs such as:

chunk_id = hash(document_id + document_version + chunk_index)

Then indexing can use upsert semantics. Re-running the same workflow replaces the same chunk records instead of creating duplicates.

07 Handling ingestion spikes

A common mistake is to call the workflow engine directly from the Document Service for every uploaded document. During a spike, the service may create too many workflows, overload workers, or keep too much pending work in memory.

A durable queue smooths the load.

flowchart LR
    DocSvc["Document Service"] --> Q["Durable Queue
SQS / Kafka / PubSub"] Q --> Scheduler1["Workflow Scheduler 1"] Q --> Scheduler2["Workflow Scheduler 2"] Q --> Scheduler3["Workflow Scheduler N"] Scheduler1 --> DB[(PostgreSQL
claim status)] Scheduler2 --> DB Scheduler3 --> DB Scheduler1 --> Temporal["Temporal"] Scheduler2 --> Temporal Scheduler3 --> Temporal Temporal --> Workers["Autoscaled Worker Pool"] Workers --> ES[(Search Index)] classDef svc fill:#fff8e1,stroke:#f57f17,stroke-width:2px,color:#000 classDef storage fill:#f3e5f5,stroke:#6a1b9a,stroke-width:2px,color:#000 classDef worker fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#000 class DocSvc,Scheduler1,Scheduler2,Scheduler3,Temporal svc class Q,DB,ES storage class Workers worker

The API Gateway should enforce per-user and per-tenant ingestion rate limits so one customer cannot flood the system. Worker pools can scale based on queue depth, CPU, and external-service throttling. There should also be a maximum backlog alert: autoscaling helps, but it is not a substitute for admission control.

For documents larger than the default limit, add a pre-chunking phase. The system can split a very large file into processing units, run extraction and chunking in parallel, and merge the output into one logical document version.

08 Search and retrieval

The query-serving path must balance latency, quality, and access control.

Dense vector search is good for semantic similarity. It can match “revenue forecast” with “financial projection.” But it may miss exact terms such as product codes, invoice IDs, customer names, or policy numbers.

BM25 or sparse-vector retrieval is good for exact and lexical matches. But it may miss semantically equivalent phrases.

A production RAG system should usually use hybrid retrieval.

flowchart TD
    Query["User query"] --> Normalize["Normalize query"]
    Normalize --> EmbedQ["Generate query embedding"]

    Normalize --> BM25["BM25 / lexical search"]
    EmbedQ --> ANN["ANN vector search"]

    BM25 --> Merge["Merge candidates"]
    ANN --> Merge

    Merge --> ACL["Apply tenant + permission filters"]
    ACL --> Rank["Rank / rerank"]
    Rank --> Diversity["Diversity selection
avoid redundant chunks"] Diversity --> Context["Build LLM context"] Context --> LLM["Generate grounded answer"] LLM --> Stream["Stream answer + citations"] classDef input fill:#e3f2fd,stroke:#1565c0,stroke-width:2px,color:#000 classDef retrieval fill:#f3e5f5,stroke:#6a1b9a,stroke-width:2px,color:#000 classDef synth fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#000 classDef output fill:#fce4ec,stroke:#ad1457,stroke-width:2px,color:#000 class Query input class Normalize,EmbedQ,BM25,ANN,Merge,ACL retrieval class Rank,Diversity,Context,LLM synth class Stream output

The retrieval query should include:

tenant_id = current_user.tenant_id
AND (
  allowed_users contains current_user.id
  OR allowed_groups intersects current_user.group_ids
)

For Elasticsearch, tenant-aware routing can reduce latency and improve isolation. A common design is to route documents by tenant_id, so a tenant query hits only the shard or shard subset containing that tenant’s data. The application should also validate that retrieved chunks all belong to the current tenant before sending them to the LLM.

Hybrid retrieval can be implemented in several ways:

flowchart LR
    Q["Query"] --> V["Vector Search
top_k=100"] Q --> L["Lexical Search
top_k=100"] V --> C["Candidate Pool"] L --> C C --> RRF["Reciprocal Rank Fusion
or weighted scoring"] RRF --> Rerank["Optional reranker"] Rerank --> Top["Top context chunks"] classDef input fill:#e3f2fd,stroke:#1565c0,stroke-width:2px,color:#000 classDef retrieval fill:#f3e5f5,stroke:#6a1b9a,stroke-width:2px,color:#000 classDef synth fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#000 classDef output fill:#fce4ec,stroke:#ad1457,stroke-width:2px,color:#000 class Q input class V,L,C retrieval class RRF,Rerank synth class Top output

For high quality, tune:

  • Chunk size and overlap.
  • Number of vector candidates.
  • Number of lexical candidates.
  • Hybrid weighting.
  • Reranker threshold.
  • Diversity rules, so the final context is not filled with near-duplicate chunks.
  • Citation granularity, such as page number, section heading, or source URL.

09 Low-latency serving

The design target is aggressive: the user should see the first token quickly, even though the system may need to embed the query, retrieve chunks, call an LLM, and stream the answer.

The key is to cache across the serving path.

flowchart TD
    Query["Incoming query"] --> ExactCache{"Exact query cache hit?"}
    ExactCache -->|Yes| ReturnCached["Return cached answer"]
    ExactCache -->|No| Embed["Generate query embedding"]

    Embed --> SemanticCache{"Similar query cache hit?"}
    SemanticCache -->|Yes| ValidatePerms["Validate tenant + permissions"]
    ValidatePerms --> ReturnCached

    SemanticCache -->|No| Retrieve["Hybrid retrieval"]
    Retrieve --> HotChunkCache["Fetch hot chunks from cache"]
    HotChunkCache --> LLM["Call LLM"]
    LLM --> StoreCache["Store answer + sources"]
    StoreCache --> Stream["Stream response"]

    classDef input fill:#e3f2fd,stroke:#1565c0,stroke-width:2px,color:#000
    classDef gate fill:#fff3e0,stroke:#e65100,stroke-width:2px,color:#000
    classDef retrieval fill:#f3e5f5,stroke:#6a1b9a,stroke-width:2px,color:#000
    classDef synth fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#000
    classDef output fill:#fce4ec,stroke:#ad1457,stroke-width:2px,color:#000

    class Query input
    class ExactCache,SemanticCache gate
    class Embed,Retrieve,HotChunkCache,ValidatePerms retrieval
    class LLM,StoreCache synth
    class ReturnCached,Stream output

Useful caches include:

Query-to-embedding cache. If users repeatedly ask the same question, avoid recomputing embeddings.

Semantic query cache. Store query embeddings and final responses. If a new query is extremely similar to a previous query, reuse the response only after checking tenant, user, permissions, document versions, and freshness.

Hot chunk cache. Frequently retrieved chunks can be cached outside the search index.

Document metadata cache. Document titles, source URLs, and citation metadata are good cache candidates.

Caching RAG responses must be permission-aware. Otherwise, a user might receive an answer generated from documents they no longer have access to.

The cache key should include tenant, user or permission scope, document-index version, and possibly group membership version.

10 Streaming chat API

For chat, WebSocket or server-sent events are a good fit because the answer is generated incrementally.

sequenceDiagram
    participant Client
    participant Chat as Chat Service
    participant Search as Search Index
    participant LLM

    Client->>Chat: USER_QUERY(query)
    Chat->>Search: Hybrid search with ACL filters
    Search-->>Chat: Permitted chunks + citations
    Chat-->>Client: CHAT_START(response_id, sources)
    Chat->>LLM: Prompt with retrieved context
    loop token stream
        LLM-->>Chat: token delta
        Chat-->>Client: CHAT_CHUNK(seq, delta)
    end
    Chat-->>Client: CHAT_DONE(full_text, citations)

Example client message:

{
  "op": "USER_QUERY",
  "query": "What is the projected revenue for Q3?"
}

Example start event:

{
  "op": "CHAT_START",
  "response_id": "req_abc123",
  "sources": [
    {
      "document_id": "doc_88",
      "title": "Q3 Financial Projections",
      "url": "https://intranet/docs/q3-fin"
    }
  ]
}

Example chunk event:

{
  "op": "CHAT_CHUNK",
  "response_id": "req_abc123",
  "seq": 1,
  "delta": "The "
}

Example completion event:

{
  "op": "CHAT_DONE",
  "response_id": "req_abc123",
  "full_text": "The projected revenue is expected to grow...",
  "citations": [
    {
      "document_id": "doc_88",
      "chunk_id": "chunk_001",
      "url": "https://intranet/docs/q3-fin",
      "title": "Q3 Financial Projections"
    }
  ]
}

11 Multi-tenancy, permissions, and privacy

Multi-tenancy must be enforced in both ingestion and query paths.

During ingestion:

flowchart TD
    Upload["Document upload"] --> Auth["Authenticate user"]
    Auth --> Tenant["Attach tenant_id"]
    Tenant --> Perms["Validate allowed users/groups"]
    Perms --> Metadata["Store metadata in DB"]
    Metadata --> Index["Index chunks with tenant_id + ACLs"]

    classDef input fill:#e3f2fd,stroke:#1565c0,stroke-width:2px,color:#000
    classDef synth fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#000
    classDef output fill:#fce4ec,stroke:#ad1457,stroke-width:2px,color:#000

    class Upload input
    class Auth,Tenant,Perms,Metadata synth
    class Index output

During query serving:

flowchart TD
    Query["User query"] --> Identity["Resolve user_id, tenant_id, groups"]
    Identity --> SearchFilter["Build search filter"]
    SearchFilter --> TenantFilter["tenant_id == current tenant"]
    TenantFilter --> ACLFilter["allowed_users contains user
OR allowed_groups intersects groups"] ACLFilter --> Retrieve["Retrieve chunks"] Retrieve --> DoubleCheck["Application-level tenant + ACL check"] DoubleCheck --> LLM["Send only authorized chunks to LLM"] classDef input fill:#e3f2fd,stroke:#1565c0,stroke-width:2px,color:#000 classDef retrieval fill:#f3e5f5,stroke:#6a1b9a,stroke-width:2px,color:#000 classDef synth fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#000 classDef output fill:#fce4ec,stroke:#ad1457,stroke-width:2px,color:#000 class Query input class Identity,SearchFilter,TenantFilter,ACLFilter retrieval class Retrieve,DoubleCheck synth class LLM output

There should be multiple layers of defense:

  1. API Gateway authenticates requests.
  2. Application services attach tenant_id from trusted identity claims, not from user-provided input.
  3. Search queries always include tenant and ACL filters.
  4. Indexed chunks store tenant and permission metadata.
  5. The Chat Service validates returned chunks before building the LLM prompt.
  6. Logs and traces avoid storing sensitive document text unless explicitly allowed.
  7. Cache keys include tenant and permission scope.

12 Observability and operations

The ingestion pipeline needs per-stage metrics:

flowchart LR
    Metrics["Metrics"] --> Ingestion["Ingestion latency
per stage"] Metrics --> Errors["Error rate
extract/chunk/embed/index"] Metrics --> Queue["Queue depth
oldest message age"] Metrics --> Freshness["Upload-to-available latency"] Metrics --> Retrieval["Search latency
BM25/vector/rerank"] Metrics --> Quality["Answer quality
citation correctness"] classDef input fill:#e3f2fd,stroke:#1565c0,stroke-width:2px,color:#000 classDef synth fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#000 class Metrics input class Ingestion,Errors,Queue,Freshness,Retrieval,Quality synth

Important alerts include:

  • Queue age exceeds freshness SLO.
  • Worker error rate spikes.
  • Embedding provider latency or failure rate increases.
  • Indexing failures exceed threshold.
  • Search latency exceeds serving SLO.
  • LLM timeout rate increases.
  • Permission-filter validation catches unauthorized chunks.

For quality evaluation, maintain a benchmark set of questions and expected answers. Track retrieval precision, citation correctness, hallucination rate, and answer completeness. Run this evaluation before changing chunking, embeddings, retrieval weights, reranking, or prompt templates.

13 Final architecture summary

A production RAG application is not just an LLM connected to a vector database. It is a distributed system with document lifecycle management, durable ingestion, indexing, retrieval, authorization, streaming, caching, and evaluation.

The core design principles are:

  • Keep document ingestion asynchronous.
  • Use a durable queue to absorb spikes.
  • Use workflow orchestration for retryable multi-step ingestion.
  • Make ingestion idempotent with document hashes and deterministic chunk IDs.
  • Store tenant and permission metadata on every indexed chunk.
  • Use hybrid retrieval instead of vector-only retrieval.
  • Cache aggressively, but make every cache permission-aware.
  • Stream responses so users see progress quickly.
  • Always return citations tied to source documents.
  • Measure freshness, retrieval quality, latency, and authorization correctness.

A production RAG application is a distributed system with document lifecycle management, durable ingestion, indexing, retrieval, authorization, streaming, caching, and evaluation.

With these pieces in place, the system can ingest documents reliably, keep the index fresh, retrieve relevant context at low latency, and generate grounded answers without leaking data across users or tenants.

StrivoHow we build it

RAG is a systems problem, not a prompt problem.

At Strivo, we design and ship production RAG systems with durable ingestion pipelines, hybrid retrieval, tenant-aware access control, and streaming chat surfaces. If you are building a “chat with your docs” product or upgrading an internal knowledge tool, we can help.