A Pipeline and a Prompt: Automating Document Processing with LLMs and Python

A Pipeline and a Prompt: Automating Document Processing with LLMs and Python

I’m writing this article to elaborate on my recent post talking about this Python script that I wrote! This will start off with an explanation for kind of why I made the thing to begin with, and then is intended to provide a technical overview of the script, and explain its key functions and the reasoning behind its structure.

Why I Made the Thing

I hear vendors advertising their products and a lot of time the product comes down to a pipeline and a prompt. Most vendors are not offering either of the real sources of value that are involved when using LLMs: the model itself and the data you’re needing.

The first source is the model itself. If a vendor is not providing a custom model that is either fine-tuned for a specific task or fully trained in a unique and useful way, then they are not providing this value. This is where organizations like OpenAI , Anthropic , Google , Meta , Harvey , and so on are providing value. It costs an incredible amount of money and time to train a model from scratch, so if the vendor you are evaluating is not providing this as their value, then they need to be providing source number two: data.

Data is the new oil. There are really two sources of valuable data: public and private. If the data is public, then the value a vendor provides is that they have accumulated this public data and can give you access to it in an easy and efficient way. Private data is data that is internal to firms and organizations. Organizations like vLex , Thomson Reuters , LexisNexis , and the data within your own firm or organization is who provides this information.

Unfortunately, most vendors don’t offer either of these sources of value. Most products that are popping up in the LLM-space are simply a pipeline that takes the data you already have, sends it and a customized prompt to an organization that owns an LLM model, and then sends the response back to you. But these models are generally all available directly to consumers — why not just build that same pipeline internally and have the flexibility to switch between models and change up your prompts whenever you feel the need to do so?

That’s the purpose of this script. Just to show how relatively easy it is to create something like this. Further, scripts of this nature can be folded into future programs or scripts that can help them expand. Something like this could easily be turned into a tool for an LLM to call at its discretion whenever it is asked to do a task that necessitates it!

I’m hoping that I’ll be able to put more time into this project to make a more robust version of the script. I would like to:

  • implement other document types to go along with pdfs,
  • switch it to an asynchronous client so it can accept LLM response requests in parallel,
  • expand on the prompts to reduce token use,
  • switch the prompts up to have one structure for medical records and one structure for medical bills, and
  • alter the models to initially use a large LLM to generate an initial “template” prompt and then switch to a smaller model to extract the information from the remaining documents based on the template. This should provide the smaller models with known information from the first couple of documents to improve their results for future extraction.

Script Overview

The script is designed to ingest PDFs, extract text, classify documents, extract structured information, clean and format data, and rename files based on the extracted information. It uses the Ollama API to call “llama3.1:8b” for text classification and information extraction.

Key Components and Functions

First thing is the libraries used and setting up a folder path and model variable:

import os
import fitz  # PyMuPDF
import ollama
import json
import pandas as pd
import re

FOLDER_PATH = "/Users/jp/Documents/Code/classify_rename/data"
client = ollama.Client()

MODEL = "llama3.1"        

 1. Document Ingestion

This function walks through a specified folder, identifying PDF files and returning their paths and names. It's the entry point for processing, setting up the pipeline for subsequent operations.

def document_ingestion(folder_path):
    """Ingest PDF documents from the specified folder."""
    documents = []
    file_name = []
    for root, _, files in os.walk(folder_path):
        for file in files:
            if file.lower().endswith(".pdf"):
                documents.append(os.path.join(root, file))
                file_name.append(file)
    return documents, file_name         

2. Text Extraction

Using the PyMuPDF library (imported as fitz), this function extracts text from each page of a PDF. It also cleans the extracted text by normalizing whitespace and stripping extra characters, preparing it for analysis.

def extract_text_from_pdf(pdf_path):
    """Extract text from PDF files."""
    text = ""
    document = fitz.open(pdf_path)
    for page in document:
        text += page.get_text()
    # Clean text
    text = re.sub(r"\n", " ", text)  # Normalize whitespace
    text = text.strip()
    return text        

3. Ollama API Interaction

This function sends a chat request to the Ollama API. It's designed to be flexible, accepting different system prompts for various tasks (classification, information extraction). The function uses specific parameters to control the AI model's output, such as temperature and seed. Temperature controls the "creativity" of the model's output, while seed makes it a consistent and reproducible outcome.

 def ollama_request(text, MODEL, system_prompt):
    """Send a chat request to the Ollama API."""
    response = client.chat(
        messages=[
            {
                "role": "system",
                "content": system_prompt,
            },
            {
                "role": "user",
                "content": (
                    f"Here is the document to analyze:\n\n"
                    f"<document>\n{text}\n</document>\n\n"
                    f"Begin processing the document text and output the resulting JSON object."
                ),
            },
            {
                "role": "assistant",
                "content": "'''json\n",
            },
        ],
        model=MODEL,
        options={
            "format": "json",
            "temperature": 0.2,
            "seed": 1357924680,
        },
    )

    return response        

Along with this, the responses are sent to a function to clean them up and ensure that it is only returning a JSON object:

def clean_responses(response):
    """Clean the responses from the Ollama API."""
    first_curly_brace_index = response["message"]["content"].find("{")
    if first_curly_brace_index != -1:
        response["message"]["content"] = response["message"]["content"][
            first_curly_brace_index:
        ]
    # Remove the extra spaces from the result
    content = response["message"]["content"]
    content = content.replace(r"\n\n", "\n")
    content = content.replace(r"\s+", " ")
    content = content.replace(r"```", "")
    response["message"]["content"] = content
    return response        

Prompt Information

4. Document Classification

The script uses a classification prompt to categorize documents into "Medical Record", "Medical Bill", "Legal Document", or "Other". The prompt is structured to guide the AI in making accurate classifications based on document content.

Here is the classification prompt:

You are an AI assistant specializing in classifying documents into categories. Provide only the category name and no extra text. Your response should be one of the following categories:

1. "Medical Record"
2. "Medical Bill"
3. "Legal Document"
4. "Other"

If the document does not fit any of the above categories, please select "Other".
Medical Records are documents with detailed patient health information.
Medical Bills are documents that contain billing information for healthcare services.
Legal Documents are documents related to legal matters, such as contracts or court orders.
Other documents are those that do not fit any of the above categories.
Please select the most appropriate category based on the content of the document.

Format the response as a JSON object with the key "category" and the value as the category name:

{"category": ""}.        

5. Information Extraction

For medical bills, a structured prompt is used to extract specific information into a JSON format. This prompt is carefully designed to capture provider information, patient details, itemized charges, and subtotals.

You are an AI assistant specializing in structuring medical documents into a JSON object. Your task is to analyze the following medical document text and extract relevant information to create a structured JSON object.

Follow these instructions to create the JSON object:

1. Extract information from the medical document text and structure it into a JSON object with the following keys:
- provider_information
- patient_details
- items
- subtotals

2. The JSON object should have the following structure:

{
  "provider_information": {
    "facility": "",
    "account_number": ""
  },
  "patient_details": {
    "name": "",
    "birthday": "",
    "visit_number": "",
    "visit_description": "",
    "date_of_service": ""
  },
  "items": [
    {
      "date": "",
      "description": "",
      "amount": "",
      "category": ""
    }
  ],
  "subtotals": {
    "Charges": "",
    "Insurance Payments & Adjustments": "",
    "Patient Payments & Adjustments": ""
  }
}

3. For the "items" array, create an object for each line item in the document, including the date, description, amount, and category.

4. Categorize amounts as follows:
- If an amount is positive, its category is "Charges"
- If an amount is negative, categorize it as either "Insurance Payments & Adjustments" or "Patient Payments & Adjustments" based on the description.

5. Format all numbers as floats (e.g., 100.00 instead of 100).

6. Convert all dates to the format "YYYY.MM.DD", even if they are originally in a different format.

7. Provide only the JSON object as your output, with no additional text or explanation.        

6. Data Cleaning and Formatting

This function is long, but it takes the extracted information and cleans it, ensuring consistent data types, formatting dates, and normalizing text fields. This is important for maintaining data quality and consistency across processed documents.

def clean_and_format_df(df):
    """Clean and format the structured billing information DataFrame."""

    # Define column mappings for different sections
    column_mappings = {
        "provider_information": {
            "healthcare_provider": "facility",
            "account_number": "account_number",
        },
        "patient_details": {
            "patient_name": "name",
            "patient_birthday": "birthday",
            "visit_number": "visit_number",
            "visit_description": "visit_description",
            "date_of_service": "date_of_service",
        },
        "subtotals": {
            "charges": "Charges",
            "insurance_payments_adjustments": "Insurance Payments & Adjustments",
            "patient_payments_adjustments": "Patient Payments & Adjustments",
        },
    }

    # Extract information using column mappings
    for source_col, mappings in column_mappings.items():
        if source_col in df.columns:
            for new_col, old_col in mappings.items():
                df[new_col] = df[source_col].apply(
                    lambda x: x.get(old_col, None) if isinstance(x, dict) else None
                )
            df = df.drop(source_col, axis=1)
        else:
            print(f"Warning: '{source_col}' column not found in the DataFrame.")

    # Convert numeric columns
    numeric_columns = [
        "charges",
        "insurance_payments_adjustments",
        "patient_payments_adjustments",
    ]
    for col in numeric_columns:
        if col in df.columns:
            df[col] = pd.to_numeric(df[col], errors="coerce").fillna(0).astype(float)

    # Process items column
    if "items" in df.columns:
        def process_item(item):
            if isinstance(item, dict):
                item["amount"] = float(item.get("amount", 0))
                item["date"] = convert_date(item.get("date", ""))
            return item

        df["items"] = df["items"].apply(
            lambda x: [process_item(item) for item in x] if isinstance(x, list) else x
        )

    # Convert date columns
    date_columns = ["date_of_service", "patient_birthday"]
    for col in date_columns:
        if col in df.columns:
            df[col] = df[col].apply(lambda x: convert_date(x) if x else None)

    # Process classification
    if "classification" in df.columns:
        df["classification"] = df["classification"].apply(
            lambda x: x.get("category", x) if isinstance(x, dict) else x
        )

    # Clean and normalize text columns
    text_columns = ["patient_name", "visit_description", "healthcare_provider"]
    for col in text_columns:
        if col in df.columns:
            df[col] = df[col].str.title()

    # Remove leading/trailing whitespace from all string columns
    df = df.map(lambda x: x.strip() if isinstance(x, str) else x)

    return df        

8. File Renaming

The final step in the process, this function renames each processed file based on extracted information (date of service, patient name, healthcare provider) and moves it to the appropriate category folder. You can see that there are a few fail-safes to make sure that relevant information is included in the file name, such as it looking for a secondary date to include if it didn't originally extract a "date of service" during the initial run through.

def rename_files(FOLDER_PATH):
    """Rename files based on the extracted billing information."""
    documents, file_name = document_ingestion(FOLDER_PATH)

    # load the classified documents from the json file
    with open(
        "logs/classified_documents_llama.jsonl", mode="r", encoding="utf-8"
    ) as file:
        classified_documents = [json.loads(line) for line in file]

    # For each document, rename it based on the extracted billing information
    for document, classified_document in zip(documents, classified_documents):
        # Extract the classification from the classified document
        classification = classified_document["classification"]

        # Extract the patient last name from the classified document
        patient_name = classified_document["patient_name"].split()[-1]

        # Extract the date of service from the classified document. If the date_of_service is blank, use the 'date' from the first item in the 'items' list
        date_of_service = classified_document["date_of_service"]
        if date_of_service == "":
            date_of_service = classified_document["items"][0]["date"]

        # Extract the healthcare provider from the classified document
        healthcare_provider = classified_document["healthcare_provider"]

        # Extract the classification from the classified document
        classification = classified_document["classification"]

        # Construct the new file name
        new_file_name = f"{date_of_service}_{patient_name}_{healthcare_provider}.pdf"

        # Rename the file and save it in the folder depending on the classification
        new_file_path = os.path.join(FOLDER_PATH, classification, new_file_name)
        os.rename(document, new_file_path)

    print("All files have been renamed.")        

 Conclusion

This article is intended to demonstrate the value of an organization building their own solution, instead of relying solely on vendor-provided black-box systems. This particular project demonstrates the power of combining open-source LLMs with bespoke pipelines to create a flexible document processing solution.

The approach outlined here offers adaptability to various document types and information needs, allowing businesses to maintain control over their data processing while harnessing cutting-edge AI technology. As LLMs continue to advance, so too can this pipeline, ensuring it remains a valuable tool for automating document processing tasks.

By sharing this work, I hope that others will explore custom AI solutions, and use the full potential of their document repositories and turning raw data into structured, actionable information.

If you have any questions about this project or want to know more about the kinds of solutions that can be created, feel free to contact me! Hope you enjoyed!


Austin Green

Senior Software Engineer

9mo

Good to know!

Like
Reply

To view or add a comment, sign in

More articles by Jacob Patton

  • Generative AI and the Duty of Competence

    Artificial Intelligence (AI) is rapidly transforming the legal field, introducing innovative tools that enhance…

  • Exploring Ollama: The Local LLM Client

    Data privacy is a huge concern these days, especially for professionals dealing with sensitive information, like in the…

    4 Comments
  • Context Over Keywords: Search for what you mean, not what you say

    The tools and technologies available to legal professionals are becoming increasingly sophisticated in the modern…

  • Time-Entry Wizard

    In the fast-paced world of legal practice, managing time entries efficiently is crucial for law firms. Precise and…

    1 Comment
  • Font and Circumstance

    In a recent opinion from the United States Court of Appeals for the Seventh Circuit, Judge Frank Easterbrook delivered…

    1 Comment
  • Enhancing Legal Research with AI: Introducing Legal Search Assistant

    Legal research can be pretty daunting, with all the precision and in-depth knowledge needed to navigate complex…

Insights from the community

Others also viewed

Explore topics