Building Text Summarization APP with AI

Building Text Summarization APP with AI


I am excited to bring you an explanation article on the AI-based text Summarization app. In this edition, We'll explore the business and personal benefits of customized summarization apps, how an LLM can be used to build them, and how langchain and Streamlit Python frameworks can be used to develop and deploy the app.

Personal and Business Uses of Customized Text Summarization App

Customized text summarization, tailored to specific business and personal needs, can offer a range of benefits. Its benefits to business settings include content curation for marketing and efficient report generation. Customized summarization in content curation offers businesses a powerful tool for staying informed, making better decisions, and effectively managing the ever-growing volume of information available in today's digital landscape. It also streamlines the process in industries requiring extensive reports, allowing executives and decision-makers to quickly grasp essential details without going through lengthy documents. Customized text summarization apps offer us several personal benefits, such as enhancing learning and knowledge management. We can leverage custom summarization to quickly comprehend textual documents, enabling us to focus on key concepts and insights. Additionally, these tools assist us in managing and summarizing personal notes, research findings, or educational materials, ultimately improving our overall knowledge organization. Custom text summarization contributes to increased efficiency, better decision-making, and improved information management in both business and personal contexts. The ability to tailor summarization processes to specific needs enhances the overall utility and effectiveness of the app.

An LLM (large language model) is a powerful tool for building a summarization app. LLMs excel at various natural language processing tasks, including summarization, translation, text generation, question answering, and conversation. These models operate based on the context window principle in NLP, which involves considering the surrounding words or tokens when processing each word or token in a sequence.

 Natural language processing tasks such as summarizing chunking are crucial techniques for effectively utilizing LLMs, regardless of the text's size. Chunking enables the efficient handling of lengthy texts and ensures adherence to maximum token limits imposed by language models. By breaking down the input into smaller segments, chunking enhances overall performance by allowing the model to process each segment more effectively.

LLM frameworks such as Langchain are well suited to build a summarization app. If the token size of the data chunk of the text fits or is less than the context window of the LLM, a summary can be generated just by feeding the data chucks as input to the LLM. In Langchain, that can be achieved using the load_summarize_chain function with a parameter "chain type" with the value "Stuff."

Figure 1 => Summarizing when text data size does not exceed the context length of the LLM (“Stuff” chain type).



The Python code for summarizing pdf texts is shown below with explanations (“Staff” Chain type).

=> Libraries to install

Langchain and openai libraries provide the functionality to use LLMs provided by openAI.

Pypdf provides functionality for working with PDF files.

Tiktoken is an open-source tokenizer by OpenAI.


!pip -q install langchain 
!pip -q install openai
!pip -q install pypdf
!pip -q install tiktoken        

The RecursiveCharacterTextSplitter divides a long text into chunks based on a specified size and overlap value, using a set of characters; the default set of characters are  ["\n\n", "\n", " ", ""].

The OPENAI_API_KEY is required when using OpenAI’s such as GPT-3.5, which is for authentication and authorization purposes.

from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
import os


os.environ['OPENAI_API_KEY']='your_open_ai_key'        

=> Loading the text file to summarize

file_path = '/content/drive/MyDrive/Colab Notebooks/doc1.pdf'
loaded_file = PyPDFLoader(file_path=file_path)        

=> Chunk_size and Chunk_overlap

The chunk_size parameter sets a limit of 1000 characters per chunk, determining the number of chunks generated. Experimenting with different values is advised to find the best fit for the use case. The chunk_overlap parameter specifies a maximum of 200 overlapping characters between consecutive chunks.

The "chunk_overlap" parameter maintains the retention of the underlying context between the chunks.

text_splitter = RecursiveCharacterTextSplitter(
   chunk_size=1000,
   chunk_overlap=100,
)

data_chunks = loaded_file.load_and_split(text_splitter=text_splitter)

from langchain.chains.summarize import load_summarize_chain
from langchain.chat_models import ChatOpenAI        

=> Defining the large language Model

The default model from ChatOpenAI class is gpt-3.5-turbo  

  llm = ChatOpenAI()        

The chain_type will be 'stuff' for this method of summarizing

This function will successfully summarize the loaded and chunked text if the data_chunks has a token that is less or equal to the model's maximum context length, i.e., 4097.

chain = load_summarize_chain(
   llm=llm,
   chain_type='stuff'
)
chain.run(data_chunks)
        

On extensive text data, however,  the token size of the data chunk of the text usually does not fit the context window of the LLM. Hence, A summary can not be successfully generated just by directly using the data chunks to the LLM.

Figure 2 => With “Stuff” chain type,

If the data_chunks has a token that is higher than the model's maximum context length, i.e., 4097, This function will give an error similar to what is shown below

chain = load_summarize_chain(
   llm=llm,
   chain_type='stuff'
)
chain.run(data_chunks)
        
In using langchain, there are techniques to work on extensive text data by addressing the issue mentioned above, namely map-reduce and refine techniques. 

 Map - Reduce

In the map-reduce method,  Map-reduce is done by dividing the data into several segments(chunks), summarizing each chunk, and then consolidating the individual summaries in a final "combine" step.

Figure 3 =>  Map-reduce technique  

Refining 

In this method, the summary starts with the initial chunk and undergoes gradual refinement with each subsequent chunk. This strategy involves continuously enhancing the summary's quality by making incremental adjustments until all data chunks are processed to get the final refined summary. 

Figure 4 =>  Refining technique  

Python code for “Map-Reduce” and “Refining” Techniques  

=> The same packages, libraries, and classes used in the Staff method should be used to implement the map-reduce and refine methods of summarization, but the chain type on the "load_summarize_chain"  function should be different, as shown below. Regardless of the token size of the data chunk, the “map-reduce” and “refining” methods will successfully generate a summary. 

=> Map-Reduce  

chain = load_summarize_chain(
   llm=llm,
   chain_type='map_reduce'
)
chain.run(data_chunks)
        

=> Refining

chain = load_summarize_chain(
   llm=llm,
   chain_type='refine'
)
chain.run(data_chunks)
        

Deploying The Application

Streamlit is a popular Python library used for creating web applications with minimal effort. Streamlit Cloud provides a platform for deploying and sharing phyton applications easily. There are different sets of steps to deploy applications using Streamlit. If our code is in the GitHub repository, we can use the below are the necessary steps and requirements for the code deployment. 

  1. Prepare our Python Code:

  • Ensure our Python code is structured properly and includes the necessary dependencies.

  • Have the main application on app.py because Streamlit requires an app.py as the entry point for your application as it follows a convention-over-configuration approach. While Streamlit expects the main application file to be named app.py by default, we can customize the filename if needed. When deploying to Streamlit Cloud or using Streamlit Sharing, we can specify the filename and directory where our Streamlit application code is located within your Git repository or project directory. 
  • Create a requirements.txt file listing all the Python packages our application requires.2. Sign Up/Login to Streamlit Cloud:

  • In the Streamlit Cloud dashboard, click on "New app"
  • Choose the "GitHub" option to connect our app to a Git repository.
  • Select the corresponding Git repository from the list of available repositories.4. Configure our Streamlit App:

  • Specify the branch and directory where our app.py file is located within the Git repository.
  • Configure other settings such as the name, description, and, more importantly, OPENAI_API_KEY on the App settings “Secrets” section. 5. Deploy our Streamlit App:

  • Once configured, we click on the "Deploy" button to start deploying your Streamlit app to Streamlit Cloud.
  • Streamlit Cloud will automatically build and deploy our app based on the configuration settings provided.6. Access and Share our Streamlit App:

Faiqa Awan

NLP Engineer | Knowledge Graph & Representation Learning | ML | DL | GenAI | AI in Healthcare

1y

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics