Revolutionizing Searching with Semantic Search and LLM
Search engines are integral part of our life for several decades. The simplest application is applying various filters on input fields based on which it returns the specific result. This is nothing but “keyword search” which explores the presence of the input keyword/s in the given text.
One very simple and common application is user id login page where it performs the validation based on user’s given input. If it finds exact match, then allows the user to log in, otherwise provides some error message.
Why Semantic Search?
However, what it will happen for the following example if we apply “keyword search”?
Renuka worked in three organizations in last 10 years: first 3 years in TCS, next 4 years in CTS and rest of the years in LTIMindtree. Now, if we ask, “Please provide the job experience details of Renuka”. The keyword search method will not be able to answer because it will not find any matching word in the given text and query. Hence, we need some other better searching mechanism which will resolve the above problem.
Semantic Serach is the Saviour! Will see below how semantic search solves the above problem.
Few Important Concepts in Searching:
Query: What color is the sky?
Responses and Number of words in Common
a. Today is Saturday - 1
b. The sky is blue - 3
c. The capital of India is Delhi - 2
d. Soil is brown - 1
e. Cow is domestic animal - 1
Here “The sky is blue” is chosen as it has the highest number of matching keywords.
There are several algorithms exist which does this keyword match mechanism and provides the score based on that. BM25 is one popular one which we have applied in our following use case to get the desired output.
Thus, by using word embeddings, words that are close in meaning are grouped near to one another in vector space. For example, in the above picture, man and woman are nearer where king and queen are nearer based on the values present in the vector.
Similarly, it works on sentences and articles which are sentence and article embeddings respectively. So, similar kind of sentences can be grouped together in case of sentence embeddings. In the previous example, the question “What color is the sky?” and the answer “The sky is blue” will be nearer in the vector space. Hence, it will find the result because question and answer are placed in most nearby places.
Embeddings are the root of semantic search.
Rerank Methodology:
Experimenting with Semantic Search and LLM:
In the below case study, we will see how semantic search enhances the search performance over keyword search and again fine-tuned with LLM. This is basically Question-Answering based on the given set of input Paragraphs. We can provide a query for which it will search the answer from the paragraphs and provides the best response accordingly.
!pip install -U sentence-transformers rank_bm25
!pip install stop_words
!nvidia-smi
# Importing required libraries
import json
from sentence_transformers import SentenceTransformer, CrossEncoder, util
import time
import gzip
import os
import torch
if not torch.cuda.is_available():
print("Warning: No GPU found. Please add GPU to your notebook")
# For semantic search, we use SentenceTransformer('multi-qa-MiniLM-L6-cos-v1') to encode all passages.
bi_encoder = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')
bi_encoder.max_seq_length = 256
top_k = 100
Then we will use a cross encoder to re-rank the results list to improve the quality. We use a powerful CrossEncoder (cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')) that scores the query and all retrieved passages for their relevancy. The cross-encoder is necessary to filter out certain noise that might be retrieved from the semantic search step.
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
Input passages for the Experiment: We are entering the following 6 passages regarding sustainability as the input ones to the system. Based on these passages, the input query will search and provide the relevent answer.
Recommended by LinkedIn
passages = ["Environmental sustainability is the ability to maintain an ecological balance in our planet's natural environment and conserve natural resources to support the wellbeing of current and future generations.",
"Climate action: Acting now to stop global warming. Life below water: Avoiding the use of plastic bags to keep the oceans clean. Life on land: Planting trees to help protect the environment. Responsible consumption and production: Recycling items such as paper, plastic, glass and aluminum.",
"Characteristics of sustainability or sustainable development are: Reduce emission of greenhouse gases, which will reduce global warming and help in preserving the environment. Use of natural and biodegradable materials for reducing the impact on the environment.",
"However, it refers to four distinct areas: human, social, economic and environmental – known as the four pillars of sustainability.",
"Environmental sustainability is important because of how much energy, food, and human-made resources we use every day. Rapid population growth has resulted in increased farming and manufacturing, leading to more greenhouse gas emissions, unsustainable energy use, and deforestation.",
"Sustainability maintains the health and biocapacity of the environment. Sustainability supports the well-being of individuals and communities. Sustainability promotes a better economy where there is little waste and pollution, fewer emissions, more jobs, and a better distribution of wealth."]
print("Passages:", len(passages))
Next, we encode all passages into our vector space.
corpus_embeddings = bi_encoder.encode(passages, convert_to_tensor=True, show_progress_bar=True)
As discussed earlier, we will apply first the BM25 algorithm which does the keyword search. Following function provides the same.
from rank_bm25 import BM25Okapi
#from sklearn.feature_extraction import stop_words
from sklearn.feature_extraction import _stop_words
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS as sklearn_stop_words
import string
from tqdm.autonotebook import tqdm
import numpy as np
# We lower case our text and remove stop-words from indexing
def bm25_tokenizer(text):
tokenized_doc = []
for token in text.lower().split():
token = token.strip(string.punctuation)
if len(token) > 0 and token not in _stop_words.ENGLISH_STOP_WORDS:
tokenized_doc.append(token)
return tokenized_doc
tokenized_corpus = []
for passage in tqdm(passages):
tokenized_corpus.append(bm25_tokenizer(passage))
print("tokenized_corpus...", tokenized_corpus)
# First thing to do is create an instance of the BM25 class, which reads in a corpus of text and does some indexing on it.
bm25 = BM25Okapi(tokenized_corpus)
The tokenized_corpus is the list of words present in the above paragraphs after removing the keywords. The output looks like - [['environmental', 'sustainability', 'ability', 'maintain', 'ecological', 'balance', "planet's", 'natural', 'environment', 'conserve', 'natural', 'resources', 'support', 'wellbeing', 'current', 'future', 'generations'], ['climate', 'action............................................]]
Now we will apply both the techniques - Keyword search through BM25 algorithm and semantic search through SentenceTransformer and CrossEncoder to perform search operation on the previous input text.
# query is the input given by the user which need to be searched in the above 6 passages through keyword and semantic search.
def search(query):
print("Input question:", query)
####### Keyword Search through BM25 algorithm ########
bm25_scores = bm25.get_scores(bm25_tokenizer(query))
top_n = np.argpartition(bm25_scores, -5)[-5:]
bm25_hits = [{'corpus_id': idx, 'score': bm25_scores[idx]} for idx in top_n]
bm25_hits = sorted(bm25_hits, key=lambda x: x['score'], reverse=True)
print("Top-5 lexical search (BM25) hits")
for hit in bm25_hits[0:5]:
print("\t{:.3f}\t{}".format(hit['score'], passages[hit['corpus_id']].replace("\n", " ")))
##### Sematic Search #####
question_embedding = bi_encoder.encode(query, convert_to_tensor=True)
question_embedding = question_embedding.cuda()
# Both question and corpus are encoded
hits = util.semantic_search(question_embedding, corpus_embeddings, top_k=top_k)
hits = hits[0]
# First, we use a Bi-Encoder to retrieve a list of result candidates,
# then you use a Cross-Encoder on this list of candidates to pick out (or rerank) the most relevant results.
# This way, we benefit from the efficient retrieval method using Bi-Encoders
# and the high accuracy of the Cross-Encoder, so we can use this on large scale datasets!
##### Re-Ranking #####
# Now, score all retrieved passages with the cross_encoder
cross_inp = [[query, passages[hit['corpus_id']]] for hit in hits]
cross_scores = cross_encoder.predict(cross_inp)
# Sort results by the cross-encoder scores
for idx in range(len(cross_scores)):
hits[idx]['cross-score'] = cross_scores[idx]
#### Output of top-5 cases ######
print("Top-5 Bi-Encoder Retrieval hits")
hits = sorted(hits, key=lambda x: x['score'], reverse=True)
for hit in hits[0:5]:
print("\t{:.3f}\t{}".format(hit['score'], passages[hit['corpus_id']].replace("\n", " ")))
print("Top-5 Cross-Encoder Re-ranker hits")
hits = sorted(hits, key=lambda x: x['cross-score'], reverse=True)
for hit in hits[0:5]:
print("\t{:.3f}\t{}".format(hit['cross-score'], passages[hit['corpus_id']].replace("\n", " ")))
Now we will ask few questions to our keyword search and semantic search techniques to understand how the answer differs and the most efficient technique.
Input Query 1
search(query = "What are the major areas of sustainability?")
Let us see the answer our search techniques provide:
If we see here, both keyword and semantic search provide the same result for the first option, which is eventually the correct answer. It provides the same because "area" keyword is present in both query and input text corpus.
Input Query 2: How Semantic Search overperforms Keyword Search
However, what will happen if the same keyword is not present in both query and text; may be some synonym is present. We will see in the following query and corresponding result.
The advantage of semantic search is clearly visible for the above question and response. The "domain" keyword is not present in the input text passages but present in the question. However, the similar word "area" is present in the given text. The keyword search technique's first response is not accurate where semantic search finds it correctly.
Input Query 3: The need of LLM
Now, we will see another query and understand the effectiveness of LLM here!
In the above query, we wanted to know the second domain out of given four. But none of the above techniques are able to answer it appropriately. However, semantic search returns the correct paragraph where the answer belongs to.
Passing the Result of Semantic Search to LLM
We can pass the result of semantic search to the LLM and see what it returns if we ask the above same question. Hence, storing the paragraph "However, it refers to four distinct areas......." in a pdf and sending the same to LLM. We are using langchain framework for the same.
!pip install langchain
!pip install pypdf
!pip install openai
!pip install chromadb
!pip install tiktoken
!pip install docx2txt
import os
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
os.environ["OPENAI_API_KEY"] = [Your API Key]
### Putting the result of semantic search in the below pdf
pdf_loader = PyPDFLoader('/content/GFG.pdf')
documents = pdf_loader.load()
documents
This is the document which LLM is referring!
[Document(page_content='However it refers to four distinct areas human, social, economic and environmental known as the four pillars of sustainability', metadata={'source': '/content/GFG.pdf', 'page': 0})]
# we split the data into chunks of 1,000 characters, with an overlap
# of 200 characters between the chunks, which helps to give better results
# and contain the context of the information between chunks
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
documents = text_splitter.split_documents(documents)
# we create our vectorDB, using the OpenAIEmbeddings tranformer to create
# embeddings from our text chunks. We set all the db information to be stored
# inside the ./data directory, so it doesn't clutter up our source files
vectordb = Chroma.from_documents(
documents,
embedding=OpenAIEmbeddings(),
persist_directory='./data'
)
vectordb.persist()
documents
Now we are asking the same question present in Input Query 3. We will see whether it answers!!
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
qa_chain = RetrievalQA.from_chain_type(
llm=OpenAI(),
retriever=vectordb.as_retriever(search_kwargs={'k': 7}),
return_source_documents=True
)
# we can now execute queries against our Q&A chain
result = qa_chain({'query': 'What is the second domain of sustainability?'})
print(result['result'])
WOW!!!!! Please see the answer below!
WARNING:chromadb.segment.impl.vector.local_persistent_hnsw:Number of requested results 7 is greater than number of elements in index 1, updating n_results = 1
Social.
It is evident from here that it returns us the exact answer which is "social"; the second domain mentioned in the above passage.
So, we have seen here how we can use LLM to fine tune the result of Semantic search in reality!
Note: Have referred few functions present in the notebook "Retrieve & Re-Rank Demo over Simple Wikipedia".
6 Patents filed on AI/Gen AI||M.Tech (BITS Pilani)||Technology Leader||Technical Author||Story Teller||AI Strategist||Trainer||Programmer
1yWell articulated