What Is Retrieval-Augmented Generation (RAG)?

A RAG system enhances an LLM's responses by retrieving relevant context from your own proprietary data before generating a response. Instead of relying solely on the LLM's training data, your chatbot can answer questions based on your company's documentation, product catalog, legal contracts, or support history — without the hallucinations that come from asking an LLM to "remember" your data.

The architecture has three main components:

Ingestion pipeline: Convert your documents to vector embeddings and store them
Retrieval layer: When a user asks a question, find the most relevant document chunks
Generation layer: Pass the retrieved context + user question to the LLM to generate a grounded response

The Tech Stack

LangChain (Python): Orchestration framework for chaining LLM operations
OpenAI text-embedding-3-small: Convert text to vector embeddings
Pinecone: Vector database for semantic similarity search
OpenAI GPT-4o: LLM for final response generation
FastAPI: API layer for your chatbot endpoint
Next.js: Frontend chat interface

Step 1: Document Ingestion Pipeline

from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore

# Load documents
loader = DirectoryLoader('./docs', glob='**/*.pdf', loader_cls=PyPDFLoader)
documents = loader.load()

# Split into chunks
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len
)
chunks = splitter.split_documents(documents)

# Embed and store
embeddings = OpenAIEmbeddings(model='text-embedding-3-small')
vectorstore = PineconeVectorStore.from_documents(
    chunks,
    embeddings,
    index_name='your-index-name'
)

Step 2: The RAG Query Chain

from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

PROMPT = PromptTemplate(
    template="""You are a helpful assistant. Use only the following context to answer the question.
If you don't know the answer from the context, say "I don't have that information."

Context: {context}

Question: {question}
Answer:""",
    input_variables=['context', 'question']
)

llm = ChatOpenAI(model='gpt-4o', temperature=0)
retriever = vectorstore.as_retriever(search_kwargs={'k': 5})

chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type='stuff',
    retriever=retriever,
    chain_type_kwargs={'prompt': PROMPT},
    return_source_documents=True
)

Step 3: FastAPI Endpoint

from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

class QueryRequest(BaseModel):
    question: str
    session_id: str

@app.post('/chat')
async def chat(request: QueryRequest):
    result = chain.invoke({'query': request.question})
    return {
        'answer': result['result'],
        'sources': [doc.metadata for doc in result['source_documents']]
    }

Critical Production Considerations

Chunking strategy matters more than the LLM. Poor chunking (too small, no overlap, breaking mid-sentence) will produce terrible retrieval results regardless of which LLM you use. Test your chunk sizes with real user queries before going to production.

Always return source citations. For business use cases, returning the source documents alongside the LLM's answer builds user trust and allows verification. The code above returns source_documents for this reason.

Implement retrieval quality evaluation. Before launching, manually test 20–30 representative user queries and score the retrieval quality. If the right document chunks aren't being retrieved, the LLM cannot give a correct answer regardless of its quality.

Cost estimation: At current OpenAI pricing, a 1,000-query/day RAG chatbot using GPT-4o and text-embedding-3-small costs approximately $15–40/day depending on document chunk sizes and response lengths.

Need help building a production-grade RAG system for your business? Hire DelhiStack's AI developers — we've built RAG systems processing 1,000+ documents daily with 99.1% accuracy.

What Is Retrieval-Augmented Generation (RAG)?

The architecture has three main components:

Ingestion pipeline: Convert your documents to vector embeddings and store them

Retrieval layer: When a user asks a question, find the most relevant document chunks

Generation layer: Pass the retrieved context + user question to the LLM to generate a grounded response

The Tech Stack

LangChain (Python): Orchestration framework for chaining LLM operations

OpenAI text-embedding-3-small: Convert text to vector embeddings

Pinecone: Vector database for semantic similarity search

OpenAI GPT-4o: LLM for final response generation

FastAPI: API layer for your chatbot endpoint

Next.js: Frontend chat interface

Step 1: Document Ingestion Pipeline

from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain_openai import OpenAIEmbeddings from langchain_pinecone import PineconeVectorStore # Load documents loader = DirectoryLoader('./docs', glob='**/*.pdf', loader_cls=PyPDFLoader) documents = loader.load() # Split into chunks splitter = RecursiveCharacterTextSplitter( chunk_size=1000, chunk_overlap=200, length_function=len ) chunks = splitter.split_documents(documents) # Embed and store embeddings = OpenAIEmbeddings(model='text-embedding-3-small') vectorstore = PineconeVectorStore.from_documents( chunks, embeddings, index_name='your-index-name' )

Step 2: The RAG Query Chain

from langchain_openai import ChatOpenAI from langchain.chains import RetrievalQA from langchain.prompts import PromptTemplate PROMPT = PromptTemplate( template="""You are a helpful assistant. Use only the following context to answer the question. If you don't know the answer from the context, say "I don't have that information." Context: {context} Question: {question} Answer:""", input_variables=['context', 'question'] ) llm = ChatOpenAI(model='gpt-4o', temperature=0) retriever = vectorstore.as_retriever(search_kwargs={'k': 5}) chain = RetrievalQA.from_chain_type( llm=llm, chain_type='stuff', retriever=retriever, chain_type_kwargs={'prompt': PROMPT}, return_source_documents=True )

Step 3: FastAPI Endpoint

from fastapi import FastAPI from pydantic import BaseModel app = FastAPI() class QueryRequest(BaseModel): question: str session_id: str @app.post('/chat') async def chat(request: QueryRequest): result = chain.invoke({'query': request.question}) return { 'answer': result['result'], 'sources': [doc.metadata for doc in result['source_documents']] }

Critical Production Considerations

Need help building a production-grade RAG system for your business? Hire DelhiStack's AI developers — we've built RAG systems processing 1,000+ documents daily with 99.1% accuracy.

How to Build a RAG Chatbot with LangChain: A Step-by-Step Guide (2026)

What Is Retrieval-Augmented Generation (RAG)?

The Tech Stack

Step 1: Document Ingestion Pipeline

Step 2: The RAG Query Chain

Step 3: FastAPI Endpoint

Critical Production Considerations

Related Reading

How to Build a RAG Chatbot with LangChain: A Step-by-Step Guide (2026)

What Is Retrieval-Augmented Generation (RAG)?

The Tech Stack

Step 1: Document Ingestion Pipeline

Step 2: The RAG Query Chain

Step 3: FastAPI Endpoint

Critical Production Considerations

Related Reading