Building a Local Retrieval Augmented Generation (RAG) Question and Answer Platform
Image generated by Adobe FireFly.

Building a Local Retrieval Augmented Generation (RAG) Question and Answer Platform

Overview of Retrieval Augmented Generation

Retrieval Augmented Generation (RAG) AI systems are an increasingly common and important component of a generative AI pipeline. RAG systems combine two powerful AI techniques: retrieving specific relevant information from your own documents, and generating human-like responses. While large language models (LLMs) are great for most general questions, they do not have the specific local knowledge found in your proprietary documents.

Some might assume that to add this domain-specific knowledge, you need to retrain the entire LLM. However, that is neither practical nor cost-effective. Instead, you can pre-process your documents by converting them into numerical representations (called embeddings) and store them in a special database known as a vector database. Then, whenever a question is asked, a search is first run on the vector database to pull out relevant local knowledge, before passing it to the LLM to generate a suitable response.

RAG Q&A Pipelines

In order to do this, we can visualise this as 2 pipelines:

1. Knowledge Ingestion Pipeline

In order to ingest the knowledge, an ingestion pipeline involves reading the relevant documents, breaking up each document into text chunks, converting each text chunk into a list of vectors (you can imagine it to be a long list of decimal numbers) and storing the vectors and their original text chunk into a vector database.

  • Document Processing: Upload your documents (text, Markdown, or PDFs).
  • Chunking: Break the documents into smaller sections or “chunks.”
  • Embeddings Creation: Convert each chunk into a list of numbers (embeddings) that capture its meaning.
  • Storage: Save these embeddings along with their corresponding text in a vector database.

2. Question and Answer Pipeline

Whenever a question is asked, a query is run on the vector database to pull out embeddings which are most similar. Once their corresponding original text chunks are retrieved, they are added as relevant context for a LLM prompt that is sent to a LLM to generate the reply.

  • Question Processing: When a user asks a question, convert it into embeddings using the same process.
  • Similarity Search: Run a search in the vector database to find text chunks that closely match the question's embeddings. This is done using a “similarity search” (using some shortest distance or similar algorithm) that identifies the most relevant content.
  • Generating the Answer: Combine the retrieved context with the question as a prompt and let the LLM generate a detailed answer. The LLM uses both its broad training and the specific information (context) you provided to generate a more accurate response.

Building the RAG

To build this system, I leveraged on OpenAI's o1-preview to generate a backend microservice that handles document uploads, converts them into embeddings, and stores them in a vector database.

For the frontend, a user-friendly interface is built with StreamLit. The UI allows the user to easily upload text, Markdown or PDF files, and then ask questions.

Frontend UI

First, the user is prompted to upload text, Markdown or PDF files. In this example, I uploaded a PDF file that discussed Zero Trust strategies.

Article content
Uploading a file in the StreamLit UI.

After the files are uploaded and processed into embeddings, questions can be asked and replies generated. For the following example, I asked about success factors for implementing zero trust, and the LLM gave a response.

Article content
Asking a question based on uploaded content and getting a response.

Because the RAG system is not fixed to any domain, I can simply rerun the UI and upload a different set of documents relating to a different domain, and ask questions.

Backend Microservice

(Note: This part gets a little technical!)

The backend has 2 main endpoints: /upload and /query.

/upload allows a file to be uploaded and split into chunks using LangChain's RecursiveCharacterTextSplitter. The chunk is then sent to a backend Weaviate vector database, which converts the chunk into embeddings using OpenAI's text-embedding-3-small LLM. (Update 18-Feb-2025: I have added support for Redis as well.)

A retriever is then created that will allow us to interact with the Weaviate database.

# Define the prompt template
qa_prompt_template = """
You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know, don't try to make up an answer. Use three sentences maximum. Keep the answer as concise as possible. 

Summary: {context}
Question: {input}
Answer:
"""


# Store embeddings into vector database.
global chain
weaviate_store = WeaviateVectorStore.from_documents(
  chunks, 
  embeddings, 
  client
)

retriever = weaviate_store.as_retriever()

combine_docs_chain = create_stuff_documents_chain(
  llm, 
  qa_chain_prompt
)

chain = create_retrieval_chain(
  retriever, 
  combine_docs_chain
)        

I updated the generated code to use LangChain v0.3, as it simplifies the task of building an RAG. First, LangChain's create_stuff_documents_chain is used to to create a new Q&A chain with OpenAI's GPT. A retrieval chain is then created using create_retrieval_chain to link the vector database to the LLM.

Once the chains are created, the /query endpoint will invoke the chain with the question and return the response.

  response = chain.invoke({"input": request.question})
  answer = response['answer'].strip()        

The GitHub repo can be found here.

Concluding Thoughts

While Retrieval Augmented Generation systems offer powerful capabilities, they are not immune to generating inaccuracies or “hallucinations.” To deploy these systems in production, additional safeguards will need to be implemented. However, while this is still an area of ongoing research, techniques like Noun-Phrase Dominance and Collision elimination, improved prompt engineering, and fact-checking mechanisms like validating the Q&A with a known series of questions and answers can be considered for your use case.

The use of LLMs (both for creating embeddings and generating the final response) also require data to be sent across the Internet if these LLMs are hosted externally. This will likely still be a security concern for some organisations ingesting proprietary information, unless the embeddings and generative LLMs are hosted locally or the external LLMs are maintained by a trusted third-party or vendor.

The process of ingestion is likely a challenge too given the amount of data. As a result, instead of building their own RAGs, some organisations have used embedded AIs like Microsoft Copilot, which is deeply integrated into enterprise products like Office, Outlook, and Teams. It leverages generative AI, context-aware data retrieval, and natural language processing to automatically ingest documents within the corporate environment and use it to enhance existing workflows. It also has a chat assistant that allows users to query the corporate data, similar to the RAG Q&A.

Yong Dominic

Business Development Director at SGS

2mo

nice work 👍

Like
Reply
Amir Aghajani

Software Developer at Tadbirkalay-e Jam

2mo

Great work! I have a large database that stores factory failure records and their resolutions. I'm looking to build a RAG system to summarize failures and answer questions like 'How many failures happened last month?' or 'Which machine has the most failures?' Any suggestions on the best approach? Have you worked on a similar project before?

Like
Reply
Chee Yong Lee

Chief Operating Officer at REDEX

2mo

Incorporating LLM capabilities into everyday applications is likely going to change a lot of ways in which knowledge workers will work. As a matter of interest, what is the cost of running a query on the RAG setup you have described in this post.

Like
Reply

To view or add a comment, sign in

More articles by Gerald Yong

Insights from the community

Others also viewed

Explore topics