A naive introduction to Retrieval-Augmented Generation (RAG)
Retrieval-Augmented Generation (RAG) is an advanced technique in natural language processing (NLP) that combines retrieval-based methods with generative language models. The simple goal of RAG is to enhance the performance of language models by integrating external knowledge during the generation process. This allows the model to provide more accurate and up-to-date responses by retrieving relevant information from a vast collection of documents.
Motivation: Why RAG?
Large language models, like GPT-3, are trained on extensive datasets but are limited to the knowledge available up to their training cut-off. This means they may lack information about recent events or emerging topics. Updating these models via fine-tuning is costly and time-consuming (even more so if the information source is rapidly changing and you need to fine-tune on a regular basis).
RAG addresses this limitation by enabling real-time access to external data sources, ensuring responses are both relevant and current. This technique is especially valuable for applications requiring up-to-date information, such as answering questions about recent events or trends or just about anything that the model has not seen in the past during its training process.
Components of RAG
Let's talk a bit about the various components of RAG and how it operates through a streamlined pipeline-
1. Chunking and Vectorizing
Given that you have already acquired the new information source (through web scrapping or any other method), the first step in RAG generally involves breaking down (chunking) documents into smaller, manageable pieces of text, such as sentences or paragraphs. Each chunk is then converted into a numerical representation known as an embedding. These embeddings capture the semantic meaning of the text, allowing for efficient comparison and retrieval. There are several embedding techniques but Sentence-BERT, a specialized model, is often used to generate these embeddings, ensuring they are contextually rich and meaningful.
2. Retrieval
Once the documents are vectorized, they are stored in a vector database. When a query is posed, it is also converted into an embedding. It is important to note here that the embedding for the query and the vector store should be made from the same model so that they are not out-of-space embeddings.
The vector database then searches for the closest matches to the query embedding using simple cosine similarity. This retrieval step ensures that the most relevant pieces of information are identified based on their semantic similarity to the query.
3. Generation by LLM with Added Context
In the final step, the retrieved chunks of text (context) are combined with the original query to form a comprehensive prompt. This enriched prompt is fed into a generative language model, such as GPT-3.5, which uses the additional context to produce a well-informed and accurate response. By augmenting the generative process with relevant information, RAG significantly improves the quality and relevance of the generated text.
Why are we calling it Naive?
We refer to the described RAG pipeline as "naive" because we are using straightforward, basic methods for each component. While these methods are effective, there are several ways to optimize and enhance each part of the pipeline:
Building a RAG pipeline as an example
Building on this gentle introduction, we will try to build a simple RAG pipeline that will power a query assistant for the website of UC Davis Medical Center which is part of UC Davis Health and a major academic health center located in Sacramento, California. I had the pleasure of working on a practicum project with UC Davis Health during my post graduation at UC Davis (2024) and I think it would be a good place to start building a demo.
The idea is to use the knowledge from the Medical center's open website and use it as context for an LLM to serve patient visiting the website in form of query resolution. This idea can be replicated to any other website/knowledge source.
We will approach the pipeline component-wise as introduced in the first part with the relevant code snippets. There will be a zeroth component which in our case would be data scrapping i.e collecting all the open source textual data from the pages of the website. We collected textual data from about 2300 different webpages (here we will call it documents) under UC Davis Health and each of them stored as a row.
Recommended by LinkedIn
1. Chunking and Vectorizing
While better chunking approach recommends we do chunk each document in to smaller chunks, we will take the naive approach create a single embedding for each document. The data is cleaned and preprocessed before being fed in to tokenizer.
To tokenize, we are using the tiktoken library, which is a tokenizer used in various language models, particularly those developed by OpenAI. The p50k_base encoding is one of the specific tokenization schemes used by these models which uses sub-word tokenization method.
client = OpenAI()
tokenizer = tiktoken.get_encoding('p50k_base')
def get_embedding(text, model='text-embedding-3-small', max_tokens=7000):
tokens = tokenizer.encode(text)
if len(tokens) > max_tokens:
tokens = tokens[:max_tokens]
text = tokenizer.decode(tokens)
return client.embeddings.create(input=[text],model=model).data[0].embedding
For generating embeddings, we are using OpenAI's "text-embedding-3-small" which has a default dimension size of 1536 and costs 1 dollar for 62,500 documents. Here we are choosing the maximum length of each document to be 7000 tokens.
The generated embeddings are now stored in a dataframe for the next step in our pipeline.
2. Retrieval
For the retrieval process, we first pass the question in to the same get_embedding function to get its vector from the same embedding space and calculate the cosine similarity of the question vector with all the existing document vectors.
Next we only select the 4 nearest documents which acts a context for the LLM to retrieve the final answer from.
def query(question):
question_embedding = get_embedding(question)
def fn(page_embedding):
return np.dot(page_embedding, question_embedding)
distance_series = data['embedding'].apply(fn)
top_four = distance_series.sort_values(ascending=False).index[0:4]
Cosine similarity against each document can be a time taking process when the document corpus is in millions and other methods like HNSW will be efficient. But since we are concerned with ~2300 documents, speed won't be a problem.
3. Generation by LLM with Added Context
As a final step, the retrieved chunks of text which becomes the {context} are combined with the original query to form a comprehensive prompt.
chat_completion = client.chat.completions.create(
messages=[
{"role": "system", "content": "You are a helpful assistant tasked to respond to users of UC Davos Health who are seeking information about their services"},
{"role": "user", "content": question},
{"role": "assistant", "content": f"Use this information from the UC Davis Health website and answer the user's question: {context}. Please stick to this context while answering the question. Include all important information relevant to what the user is seeking, also tell them things they should be mindful of while following instructions. Don't miss any details about timings or weekdays."}
],
model="gpt-3.5-turbo"
)
return chat_completion.choices[0].message.content, links, similarity_scores.tolist(), link_list
Here we are using GPT-3.5-turbo (fast & cheap) which uses this additional context to produce a well-informed and accurate response. Note that the prompts can be further modified to include more contact details or any other specific information.
RAG in Actions
Let's look at an example of how the query assistant performs when asked questions specific to the medical center
Question - What are the parking facilities like?
To get a better idea of how similar these documents are to the question asked, we also printed out the cosine-similarities for the top-k.
As we can see, the LLM returns very specific answer with key informations which can be faster way for any website to resolve queries that usually require their customers to navigate through multiple web-pages.
For a working demo, please visit the HuggingFace space here. You can also find the GitHub repository here.
Lead Data Scientist at 1mg
9moExtremely well articulated and explained