
A RAG system enhances an LLM's responses by retrieving relevant context from your own proprietary data before generating a response. Instead of relying solely on the LLM's training data, your chatbot can answer questions based on your company's documentation, product catalog, legal contracts, or support history — without the hallucinations that come from asking an LLM to "remember" your data.
The architecture has three main components:
text-embedding-3-small: Convert text to vector embeddingsfrom langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore
# Load documents
loader = DirectoryLoader('./docs', glob='**/*.pdf', loader_cls=PyPDFLoader)
documents = loader.load()
# Split into chunks
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
length_function=len
)
chunks = splitter.split_documents(documents)
# Embed and store
embeddings = OpenAIEmbeddings(model='text-embedding-3-small')
vectorstore = PineconeVectorStore.from_documents(
chunks,
embeddings,
index_name='your-index-name'
)
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
PROMPT = PromptTemplate(
template="""You are a helpful assistant. Use only the following context to answer the question.
If you don't know the answer from the context, say "I don't have that information."
Context: {context}
Question: {question}
Answer:""",
input_variables=['context', 'question']
)
llm = ChatOpenAI(model='gpt-4o', temperature=0)
retriever = vectorstore.as_retriever(search_kwargs={'k': 5})
chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type='stuff',
retriever=retriever,
chain_type_kwargs={'prompt': PROMPT},
return_source_documents=True
)
from fastapi import FastAPI
from pydantic import BaseModel
app = FastAPI()
class QueryRequest(BaseModel):
question: str
session_id: str
@app.post('/chat')
async def chat(request: QueryRequest):
result = chain.invoke({'query': request.question})
return {
'answer': result['result'],
'sources': [doc.metadata for doc in result['source_documents']]
}
Chunking strategy matters more than the LLM. Poor chunking (too small, no overlap, breaking mid-sentence) will produce terrible retrieval results regardless of which LLM you use. Test your chunk sizes with real user queries before going to production.
Always return source citations. For business use cases, returning the source documents alongside the LLM's answer builds user trust and allows verification. The code above returns source_documents for this reason.
Implement retrieval quality evaluation. Before launching, manually test 20–30 representative user queries and score the retrieval quality. If the right document chunks aren't being retrieved, the LLM cannot give a correct answer regardless of its quality.
Cost estimation: At current OpenAI pricing, a 1,000-query/day RAG chatbot using GPT-4o and text-embedding-3-small costs approximately $15–40/day depending on document chunk sizes and response lengths.
Need help building a production-grade RAG system for your business? Hire DelhiStack's AI developers — we've built RAG systems processing 1,000+ documents daily with 99.1% accuracy.