A naive introduction to Retrieval-Augmented Generation (RAG)

A naive introduction to Retrieval-Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG) is an advanced technique in natural language processing (NLP) that combines retrieval-based methods with generative language models. The simple goal of RAG is to enhance the performance of language models by integrating external knowledge during the generation process. This allows the model to provide more accurate and up-to-date responses by retrieving relevant information from a vast collection of documents.

Motivation: Why RAG?

Large language models, like GPT-3, are trained on extensive datasets but are limited to the knowledge available up to their training cut-off. This means they may lack information about recent events or emerging topics. Updating these models via fine-tuning is costly and time-consuming (even more so if the information source is rapidly changing and you need to fine-tune on a regular basis).

RAG addresses this limitation by enabling real-time access to external data sources, ensuring responses are both relevant and current. This technique is especially valuable for applications requiring up-to-date information, such as answering questions about recent events or trends or just about anything that the model has not seen in the past during its training process.

Article content
Retrieval Augmented Generation pipeline (image borrowed from Dataiku)


Components of RAG

Let's talk a bit about the various components of RAG and how it operates through a streamlined pipeline-

1. Chunking and Vectorizing

Given that you have already acquired the new information source (through web scrapping or any other method), the first step in RAG generally involves breaking down (chunking) documents into smaller, manageable pieces of text, such as sentences or paragraphs. Each chunk is then converted into a numerical representation known as an embedding. These embeddings capture the semantic meaning of the text, allowing for efficient comparison and retrieval. There are several embedding techniques but Sentence-BERT, a specialized model, is often used to generate these embeddings, ensuring they are contextually rich and meaningful.

Article content
Chunking & Vectorizing -converting text corpus in to embeddings (image from Dataiku)


2. Retrieval

Once the documents are vectorized, they are stored in a vector database. When a query is posed, it is also converted into an embedding. It is important to note here that the embedding for the query and the vector store should be made from the same model so that they are not out-of-space embeddings.

The vector database then searches for the closest matches to the query embedding using simple cosine similarity. This retrieval step ensures that the most relevant pieces of information are identified based on their semantic similarity to the query.

Article content
Retrieval component (image from Dataiku)


3. Generation by LLM with Added Context

In the final step, the retrieved chunks of text (context) are combined with the original query to form a comprehensive prompt. This enriched prompt is fed into a generative language model, such as GPT-3.5, which uses the additional context to produce a well-informed and accurate response. By augmenting the generative process with relevant information, RAG significantly improves the quality and relevance of the generated text.


Why are we calling it Naive?

We refer to the described RAG pipeline as "naive" because we are using straightforward, basic methods for each component. While these methods are effective, there are several ways to optimize and enhance each part of the pipeline:

  • Chunking and Vectorizing: Advanced techniques like dynamic chunking, which adapts chunk sizes based on content complexity, and using more sophisticated embedding models can improve accuracy.
  • Retrieval: Implementing more efficient and scalable retrieval algorithms, such as approximate nearest neighbors (ANN), Hierarchical Navigable Small World (HNSW - super useful for quick search over a massive vector store) and leveraging larger, more diverse vector databases, can enhance performance.
  • Generation: Additionally, Fine-tuning the language model on the domain-specific data or integrating more advanced prompt engineering techniques can result in more precise and contextually accurate responses.


Building a RAG pipeline as an example

Article content

Building on this gentle introduction, we will try to build a simple RAG pipeline that will power a query assistant for the website of UC Davis Medical Center which is part of UC Davis Health and a major academic health center located in Sacramento, California. I had the pleasure of working on a practicum project with UC Davis Health during my post graduation at UC Davis (2024) and I think it would be a good place to start building a demo.

The idea is to use the knowledge from the Medical center's open website and use it as context for an LLM to serve patient visiting the website in form of query resolution. This idea can be replicated to any other website/knowledge source.

We will approach the pipeline component-wise as introduced in the first part with the relevant code snippets. There will be a zeroth component which in our case would be data scrapping i.e collecting all the open source textual data from the pages of the website. We collected textual data from about 2300 different webpages (here we will call it documents) under UC Davis Health and each of them stored as a row.

1. Chunking and Vectorizing

While better chunking approach recommends we do chunk each document in to smaller chunks, we will take the naive approach create a single embedding for each document. The data is cleaned and preprocessed before being fed in to tokenizer.

To tokenize, we are using the tiktoken library, which is a tokenizer used in various language models, particularly those developed by OpenAI. The p50k_base encoding is one of the specific tokenization schemes used by these models which uses sub-word tokenization method.

client = OpenAI()

tokenizer = tiktoken.get_encoding('p50k_base')  

def get_embedding(text, model='text-embedding-3-small', max_tokens=7000):
    tokens = tokenizer.encode(text)
    if len(tokens) > max_tokens:
        tokens = tokens[:max_tokens]
        text = tokenizer.decode(tokens)
    
    return client.embeddings.create(input=[text],model=model).data[0].embedding        

For generating embeddings, we are using OpenAI's "text-embedding-3-small" which has a default dimension size of 1536 and costs 1 dollar for 62,500 documents. Here we are choosing the maximum length of each document to be 7000 tokens.

The generated embeddings are now stored in a dataframe for the next step in our pipeline.

Article content
embedding dataframe

2. Retrieval

For the retrieval process, we first pass the question in to the same get_embedding function to get its vector from the same embedding space and calculate the cosine similarity of the question vector with all the existing document vectors.

Next we only select the 4 nearest documents which acts a context for the LLM to retrieve the final answer from.

def query(question):
    question_embedding = get_embedding(question)
    
    def fn(page_embedding):
        return np.dot(page_embedding, question_embedding)
    
    distance_series = data['embedding'].apply(fn)
    
    top_four = distance_series.sort_values(ascending=False).index[0:4]
            

Cosine similarity against each document can be a time taking process when the document corpus is in millions and other methods like HNSW will be efficient. But since we are concerned with ~2300 documents, speed won't be a problem.

3. Generation by LLM with Added Context

As a final step, the retrieved chunks of text which becomes the {context} are combined with the original query to form a comprehensive prompt.

    chat_completion = client.chat.completions.create(
        messages=[
            {"role": "system", "content": "You are a helpful assistant tasked to respond to users of UC Davos Health who are seeking information about their services"},
            {"role": "user", "content": question},
            {"role": "assistant", "content": f"Use this information from the UC Davis Health website and answer the user's question: {context}. Please stick to this context while answering the question. Include all important information relevant to what the user is seeking, also tell them things they should be mindful of while following instructions. Don't miss any details about timings or weekdays."}
        ],
        model="gpt-3.5-turbo"
    )

    return chat_completion.choices[0].message.content, links, similarity_scores.tolist(), link_list
        

Here we are using GPT-3.5-turbo (fast & cheap) which uses this additional context to produce a well-informed and accurate response. Note that the prompts can be further modified to include more contact details or any other specific information.

RAG in Actions

Let's look at an example of how the query assistant performs when asked questions specific to the medical center

Question - What are the parking facilities like?
Article content
The first box is the response returned by the RAG model, the second box returns the 4 most semantically similar links to the question asked by the user.

To get a better idea of how similar these documents are to the question asked, we also printed out the cosine-similarities for the top-k.

Article content


As we can see, the LLM returns very specific answer with key informations which can be faster way for any website to resolve queries that usually require their customers to navigate through multiple web-pages.

For a working demo, please visit the HuggingFace space here. You can also find the GitHub repository here.




Bhaskar Arun

Lead Data Scientist at 1mg

9mo

Extremely well articulated and explained

To view or add a comment, sign in

More articles by Kumar Kishalaya

Insights from the community

Others also viewed

Explore topics