Building Text Summarization APP with AI
I am excited to bring you an explanation article on the AI-based text Summarization app. In this edition, We'll explore the business and personal benefits of customized summarization apps, how an LLM can be used to build them, and how langchain and Streamlit Python frameworks can be used to develop and deploy the app.
Personal and Business Uses of Customized Text Summarization App
Customized text summarization, tailored to specific business and personal needs, can offer a range of benefits. Its benefits to business settings include content curation for marketing and efficient report generation. Customized summarization in content curation offers businesses a powerful tool for staying informed, making better decisions, and effectively managing the ever-growing volume of information available in today's digital landscape. It also streamlines the process in industries requiring extensive reports, allowing executives and decision-makers to quickly grasp essential details without going through lengthy documents. Customized text summarization apps offer us several personal benefits, such as enhancing learning and knowledge management. We can leverage custom summarization to quickly comprehend textual documents, enabling us to focus on key concepts and insights. Additionally, these tools assist us in managing and summarizing personal notes, research findings, or educational materials, ultimately improving our overall knowledge organization. Custom text summarization contributes to increased efficiency, better decision-making, and improved information management in both business and personal contexts. The ability to tailor summarization processes to specific needs enhances the overall utility and effectiveness of the app.
An LLM (large language model) is a powerful tool for building a summarization app. LLMs excel at various natural language processing tasks, including summarization, translation, text generation, question answering, and conversation. These models operate based on the context window principle in NLP, which involves considering the surrounding words or tokens when processing each word or token in a sequence.
Natural language processing tasks such as summarizing chunking are crucial techniques for effectively utilizing LLMs, regardless of the text's size. Chunking enables the efficient handling of lengthy texts and ensures adherence to maximum token limits imposed by language models. By breaking down the input into smaller segments, chunking enhances overall performance by allowing the model to process each segment more effectively.
LLM frameworks such as Langchain are well suited to build a summarization app. If the token size of the data chunk of the text fits or is less than the context window of the LLM, a summary can be generated just by feeding the data chucks as input to the LLM. In Langchain, that can be achieved using the load_summarize_chain function with a parameter "chain type" with the value "Stuff."
The Python code for summarizing pdf texts is shown below with explanations (“Staff” Chain type).
=> Libraries to install
Langchain and openai libraries provide the functionality to use LLMs provided by openAI.
Pypdf provides functionality for working with PDF files.
Tiktoken is an open-source tokenizer by OpenAI.
!pip -q install langchain
!pip -q install openai
!pip -q install pypdf
!pip -q install tiktoken
The RecursiveCharacterTextSplitter divides a long text into chunks based on a specified size and overlap value, using a set of characters; the default set of characters are ["\n\n", "\n", " ", ""].
The OPENAI_API_KEY is required when using OpenAI’s such as GPT-3.5, which is for authentication and authorization purposes.
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
import os
os.environ['OPENAI_API_KEY']='your_open_ai_key'
=> Loading the text file to summarize
file_path = '/content/drive/MyDrive/Colab Notebooks/doc1.pdf'
loaded_file = PyPDFLoader(file_path=file_path)
=> Chunk_size and Chunk_overlap
The chunk_size parameter sets a limit of 1000 characters per chunk, determining the number of chunks generated. Experimenting with different values is advised to find the best fit for the use case. The chunk_overlap parameter specifies a maximum of 200 overlapping characters between consecutive chunks.
The "chunk_overlap" parameter maintains the retention of the underlying context between the chunks.
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=100,
)
data_chunks = loaded_file.load_and_split(text_splitter=text_splitter)
from langchain.chains.summarize import load_summarize_chain
from langchain.chat_models import ChatOpenAI
=> Defining the large language Model
The default model from ChatOpenAI class is gpt-3.5-turbo
llm = ChatOpenAI()
The chain_type will be 'stuff' for this method of summarizing
Recommended by LinkedIn
This function will successfully summarize the loaded and chunked text if the data_chunks has a token that is less or equal to the model's maximum context length, i.e., 4097.
chain = load_summarize_chain(
llm=llm,
chain_type='stuff'
)
chain.run(data_chunks)
On extensive text data, however, the token size of the data chunk of the text usually does not fit the context window of the LLM. Hence, A summary can not be successfully generated just by directly using the data chunks to the LLM.
If the data_chunks has a token that is higher than the model's maximum context length, i.e., 4097, This function will give an error similar to what is shown below
chain = load_summarize_chain(
llm=llm,
chain_type='stuff'
)
chain.run(data_chunks)
Map - Reduce
In the map-reduce method, Map-reduce is done by dividing the data into several segments(chunks), summarizing each chunk, and then consolidating the individual summaries in a final "combine" step.
Refining
In this method, the summary starts with the initial chunk and undergoes gradual refinement with each subsequent chunk. This strategy involves continuously enhancing the summary's quality by making incremental adjustments until all data chunks are processed to get the final refined summary.
Python code for “Map-Reduce” and “Refining” Techniques
=> The same packages, libraries, and classes used in the Staff method should be used to implement the map-reduce and refine methods of summarization, but the chain type on the "load_summarize_chain" function should be different, as shown below. Regardless of the token size of the data chunk, the “map-reduce” and “refining” methods will successfully generate a summary.
=> Map-Reduce
chain = load_summarize_chain(
llm=llm,
chain_type='map_reduce'
)
chain.run(data_chunks)
=> Refining
chain = load_summarize_chain(
llm=llm,
chain_type='refine'
)
chain.run(data_chunks)
Deploying The Application
Streamlit is a popular Python library used for creating web applications with minimal effort. Streamlit Cloud provides a platform for deploying and sharing phyton applications easily. There are different sets of steps to deploy applications using Streamlit. If our code is in the GitHub repository, we can use the below are the necessary steps and requirements for the code deployment.
NLP Engineer | Knowledge Graph & Representation Learning | ML | DL | GenAI | AI in Healthcare
1yDur e Shahwar Zahra Tallat