A Pipeline and a Prompt: Automating Document Processing with LLMs and Python
I’m writing this article to elaborate on my recent post talking about this Python script that I wrote! This will start off with an explanation for kind of why I made the thing to begin with, and then is intended to provide a technical overview of the script, and explain its key functions and the reasoning behind its structure.
Why I Made the Thing
I hear vendors advertising their products and a lot of time the product comes down to a pipeline and a prompt. Most vendors are not offering either of the real sources of value that are involved when using LLMs: the model itself and the data you’re needing.
The first source is the model itself. If a vendor is not providing a custom model that is either fine-tuned for a specific task or fully trained in a unique and useful way, then they are not providing this value. This is where organizations like OpenAI , Anthropic , Google , Meta , Harvey , and so on are providing value. It costs an incredible amount of money and time to train a model from scratch, so if the vendor you are evaluating is not providing this as their value, then they need to be providing source number two: data.
Data is the new oil. There are really two sources of valuable data: public and private. If the data is public, then the value a vendor provides is that they have accumulated this public data and can give you access to it in an easy and efficient way. Private data is data that is internal to firms and organizations. Organizations like vLex , Thomson Reuters , LexisNexis , and the data within your own firm or organization is who provides this information.
Unfortunately, most vendors don’t offer either of these sources of value. Most products that are popping up in the LLM-space are simply a pipeline that takes the data you already have, sends it and a customized prompt to an organization that owns an LLM model, and then sends the response back to you. But these models are generally all available directly to consumers — why not just build that same pipeline internally and have the flexibility to switch between models and change up your prompts whenever you feel the need to do so?
That’s the purpose of this script. Just to show how relatively easy it is to create something like this. Further, scripts of this nature can be folded into future programs or scripts that can help them expand. Something like this could easily be turned into a tool for an LLM to call at its discretion whenever it is asked to do a task that necessitates it!
I’m hoping that I’ll be able to put more time into this project to make a more robust version of the script. I would like to:
Script Overview
The script is designed to ingest PDFs, extract text, classify documents, extract structured information, clean and format data, and rename files based on the extracted information. It uses the Ollama API to call “llama3.1:8b” for text classification and information extraction.
Key Components and Functions
First thing is the libraries used and setting up a folder path and model variable:
import os
import fitz # PyMuPDF
import ollama
import json
import pandas as pd
import re
FOLDER_PATH = "/Users/jp/Documents/Code/classify_rename/data"
client = ollama.Client()
MODEL = "llama3.1"
1. Document Ingestion
This function walks through a specified folder, identifying PDF files and returning their paths and names. It's the entry point for processing, setting up the pipeline for subsequent operations.
def document_ingestion(folder_path):
"""Ingest PDF documents from the specified folder."""
documents = []
file_name = []
for root, _, files in os.walk(folder_path):
for file in files:
if file.lower().endswith(".pdf"):
documents.append(os.path.join(root, file))
file_name.append(file)
return documents, file_name
2. Text Extraction
Using the PyMuPDF library (imported as fitz), this function extracts text from each page of a PDF. It also cleans the extracted text by normalizing whitespace and stripping extra characters, preparing it for analysis.
def extract_text_from_pdf(pdf_path):
"""Extract text from PDF files."""
text = ""
document = fitz.open(pdf_path)
for page in document:
text += page.get_text()
# Clean text
text = re.sub(r"\n", " ", text) # Normalize whitespace
text = text.strip()
return text
3. Ollama API Interaction
This function sends a chat request to the Ollama API. It's designed to be flexible, accepting different system prompts for various tasks (classification, information extraction). The function uses specific parameters to control the AI model's output, such as temperature and seed. Temperature controls the "creativity" of the model's output, while seed makes it a consistent and reproducible outcome.
def ollama_request(text, MODEL, system_prompt):
"""Send a chat request to the Ollama API."""
response = client.chat(
messages=[
{
"role": "system",
"content": system_prompt,
},
{
"role": "user",
"content": (
f"Here is the document to analyze:\n\n"
f"<document>\n{text}\n</document>\n\n"
f"Begin processing the document text and output the resulting JSON object."
),
},
{
"role": "assistant",
"content": "'''json\n",
},
],
model=MODEL,
options={
"format": "json",
"temperature": 0.2,
"seed": 1357924680,
},
)
return response
Along with this, the responses are sent to a function to clean them up and ensure that it is only returning a JSON object:
Recommended by LinkedIn
def clean_responses(response):
"""Clean the responses from the Ollama API."""
first_curly_brace_index = response["message"]["content"].find("{")
if first_curly_brace_index != -1:
response["message"]["content"] = response["message"]["content"][
first_curly_brace_index:
]
# Remove the extra spaces from the result
content = response["message"]["content"]
content = content.replace(r"\n\n", "\n")
content = content.replace(r"\s+", " ")
content = content.replace(r"```", "")
response["message"]["content"] = content
return response
Prompt Information
4. Document Classification
The script uses a classification prompt to categorize documents into "Medical Record", "Medical Bill", "Legal Document", or "Other". The prompt is structured to guide the AI in making accurate classifications based on document content.
Here is the classification prompt:
You are an AI assistant specializing in classifying documents into categories. Provide only the category name and no extra text. Your response should be one of the following categories:
1. "Medical Record"
2. "Medical Bill"
3. "Legal Document"
4. "Other"
If the document does not fit any of the above categories, please select "Other".
Medical Records are documents with detailed patient health information.
Medical Bills are documents that contain billing information for healthcare services.
Legal Documents are documents related to legal matters, such as contracts or court orders.
Other documents are those that do not fit any of the above categories.
Please select the most appropriate category based on the content of the document.
Format the response as a JSON object with the key "category" and the value as the category name:
{"category": ""}.
5. Information Extraction
For medical bills, a structured prompt is used to extract specific information into a JSON format. This prompt is carefully designed to capture provider information, patient details, itemized charges, and subtotals.
You are an AI assistant specializing in structuring medical documents into a JSON object. Your task is to analyze the following medical document text and extract relevant information to create a structured JSON object.
Follow these instructions to create the JSON object:
1. Extract information from the medical document text and structure it into a JSON object with the following keys:
- provider_information
- patient_details
- items
- subtotals
2. The JSON object should have the following structure:
{
"provider_information": {
"facility": "",
"account_number": ""
},
"patient_details": {
"name": "",
"birthday": "",
"visit_number": "",
"visit_description": "",
"date_of_service": ""
},
"items": [
{
"date": "",
"description": "",
"amount": "",
"category": ""
}
],
"subtotals": {
"Charges": "",
"Insurance Payments & Adjustments": "",
"Patient Payments & Adjustments": ""
}
}
3. For the "items" array, create an object for each line item in the document, including the date, description, amount, and category.
4. Categorize amounts as follows:
- If an amount is positive, its category is "Charges"
- If an amount is negative, categorize it as either "Insurance Payments & Adjustments" or "Patient Payments & Adjustments" based on the description.
5. Format all numbers as floats (e.g., 100.00 instead of 100).
6. Convert all dates to the format "YYYY.MM.DD", even if they are originally in a different format.
7. Provide only the JSON object as your output, with no additional text or explanation.
6. Data Cleaning and Formatting
This function is long, but it takes the extracted information and cleans it, ensuring consistent data types, formatting dates, and normalizing text fields. This is important for maintaining data quality and consistency across processed documents.
def clean_and_format_df(df):
"""Clean and format the structured billing information DataFrame."""
# Define column mappings for different sections
column_mappings = {
"provider_information": {
"healthcare_provider": "facility",
"account_number": "account_number",
},
"patient_details": {
"patient_name": "name",
"patient_birthday": "birthday",
"visit_number": "visit_number",
"visit_description": "visit_description",
"date_of_service": "date_of_service",
},
"subtotals": {
"charges": "Charges",
"insurance_payments_adjustments": "Insurance Payments & Adjustments",
"patient_payments_adjustments": "Patient Payments & Adjustments",
},
}
# Extract information using column mappings
for source_col, mappings in column_mappings.items():
if source_col in df.columns:
for new_col, old_col in mappings.items():
df[new_col] = df[source_col].apply(
lambda x: x.get(old_col, None) if isinstance(x, dict) else None
)
df = df.drop(source_col, axis=1)
else:
print(f"Warning: '{source_col}' column not found in the DataFrame.")
# Convert numeric columns
numeric_columns = [
"charges",
"insurance_payments_adjustments",
"patient_payments_adjustments",
]
for col in numeric_columns:
if col in df.columns:
df[col] = pd.to_numeric(df[col], errors="coerce").fillna(0).astype(float)
# Process items column
if "items" in df.columns:
def process_item(item):
if isinstance(item, dict):
item["amount"] = float(item.get("amount", 0))
item["date"] = convert_date(item.get("date", ""))
return item
df["items"] = df["items"].apply(
lambda x: [process_item(item) for item in x] if isinstance(x, list) else x
)
# Convert date columns
date_columns = ["date_of_service", "patient_birthday"]
for col in date_columns:
if col in df.columns:
df[col] = df[col].apply(lambda x: convert_date(x) if x else None)
# Process classification
if "classification" in df.columns:
df["classification"] = df["classification"].apply(
lambda x: x.get("category", x) if isinstance(x, dict) else x
)
# Clean and normalize text columns
text_columns = ["patient_name", "visit_description", "healthcare_provider"]
for col in text_columns:
if col in df.columns:
df[col] = df[col].str.title()
# Remove leading/trailing whitespace from all string columns
df = df.map(lambda x: x.strip() if isinstance(x, str) else x)
return df
8. File Renaming
The final step in the process, this function renames each processed file based on extracted information (date of service, patient name, healthcare provider) and moves it to the appropriate category folder. You can see that there are a few fail-safes to make sure that relevant information is included in the file name, such as it looking for a secondary date to include if it didn't originally extract a "date of service" during the initial run through.
def rename_files(FOLDER_PATH):
"""Rename files based on the extracted billing information."""
documents, file_name = document_ingestion(FOLDER_PATH)
# load the classified documents from the json file
with open(
"logs/classified_documents_llama.jsonl", mode="r", encoding="utf-8"
) as file:
classified_documents = [json.loads(line) for line in file]
# For each document, rename it based on the extracted billing information
for document, classified_document in zip(documents, classified_documents):
# Extract the classification from the classified document
classification = classified_document["classification"]
# Extract the patient last name from the classified document
patient_name = classified_document["patient_name"].split()[-1]
# Extract the date of service from the classified document. If the date_of_service is blank, use the 'date' from the first item in the 'items' list
date_of_service = classified_document["date_of_service"]
if date_of_service == "":
date_of_service = classified_document["items"][0]["date"]
# Extract the healthcare provider from the classified document
healthcare_provider = classified_document["healthcare_provider"]
# Extract the classification from the classified document
classification = classified_document["classification"]
# Construct the new file name
new_file_name = f"{date_of_service}_{patient_name}_{healthcare_provider}.pdf"
# Rename the file and save it in the folder depending on the classification
new_file_path = os.path.join(FOLDER_PATH, classification, new_file_name)
os.rename(document, new_file_path)
print("All files have been renamed.")
Conclusion
This article is intended to demonstrate the value of an organization building their own solution, instead of relying solely on vendor-provided black-box systems. This particular project demonstrates the power of combining open-source LLMs with bespoke pipelines to create a flexible document processing solution.
The approach outlined here offers adaptability to various document types and information needs, allowing businesses to maintain control over their data processing while harnessing cutting-edge AI technology. As LLMs continue to advance, so too can this pipeline, ensuring it remains a valuable tool for automating document processing tasks.
By sharing this work, I hope that others will explore custom AI solutions, and use the full potential of their document repositories and turning raw data into structured, actionable information.
If you have any questions about this project or want to know more about the kinds of solutions that can be created, feel free to contact me! Hope you enjoyed!
Senior Software Engineer
9moGood to know!