Leveraging LLMs for Efficient and Accurate Data Management

Leveraging LLMs for Efficient and Accurate Data Management

How to clean and manage database without the help of an annotator or colleagues? What if you could try using an LLM/NLP to assist you?

I've created an Inventors database that includes a column with all possible tags in a single string. Using this example, I will demonstrate how to effectively organize the data.

Article content
Inventors database before and after cleansing

0. Creating example dataset

The Inventors dataset can be created by wrapping the rows into a Pandas DataFrame to be used for further processing.

inventors = [
    "Ada-Lovelace-Computer-Programming-1843-No-UK",
    "Thomas-Edison-USA-1879-No-Light-Bulb",
    "Telephone-Alexander-Bell-1876-Scotland-Telephone-No",
    "Marie-Curie-Yes-Poland-Radioactivity-1898-Yes",
    "Guglielmo-Marconi-Radio-Italy-Yes-1895",
    "James-Watt-1769-No-Scotland-Steam-Engine",
    "Italy-Various-1490-No-Leonardo-da-Vinci",
    "Germany-Johannes-Gutenberg-Printing-Press-1440-No",
    "Tim-Berners-Lee-UK-World-Wide-Web-1989-No",
    "Computer-Science-Alan-Turing-UK-1936-No",
    "1976-No-Steve-Jobs-USA-Personal-Computer",
    "Lamarr-Hedy-No-Austria-Wireless-Communication-1942",
    "Sweden-Dynamite-Alfred-Nobel-1867-No",
    "No-Wright-Brothers-USA-Airplane-1903",
]

df = pd.DataFrame(inventors, columns=["Inventor"])        

1. Cleaning data in classic way

The last two columns, 'Year of Invention' and 'Nobel Prize', can be extracted without using a language processing model.

The code snippet below demonstrates that:

  • The year value can be identified using the Re lib.
  • The Nobel Prize values are boolean ("Yes" or "No"), so easy to check them.

def preprocess_string(string: str):
    """I found out that splitting into small tokens is more efficients then one long input"""
    parts = string.replace('-', ' ').replace('_', ' ').split()
    return parts
    
def detect_objects_from_parts(parts: str):
    objects = {'NobelPrize': False}
    for part in parts:
        if part == 'Yes':
            objects['NobelPrize'] = 'Yes'
        elif part == 'No':
            objects['NobelPrize'] = 'No'

    return objects

def process_nobel_prize(nobel_prize: str):
    parts = preprocess_string(nobel_prize)
    return pd.Series(detect_objects_from_parts(parts))

df[['NobelPrize']] = df["Inventor"].apply(lambda x: process_nobel_prize(str(x)))        
def extract_year(string: str):
    """Extracts a 4-digit year from the string"""
    parts = preprocess_string(string)
    for part in parts:
        if re.match(r"^\d{4}$", part):
            return int(part)
    return None


df["Year"] = df["Inventor"].apply(extract_year)        

TIP: Improving Detection Accuracy
One useful technique I found is that removing detected parts from the original string makes subsequent extractions more precise.

Example:

Given the original string: "Ada-Lovelace-Computer-Programming-1843-No-UK"

From previous steps, it's known that:

  • "No" represents the Nobel Prize status.
  • "1843" is the year of invention.

By removing these detected parts and creating temporary string "Ada-Lovelace-Computer-Programming-UK" reduces input size, minimizing the chances of misdetection and improving extraction accuracy.

The cleaning logic is demonstrated in the piece of code below:

def clean_inventor_column(df, column_name):

    df["Inventor"] = df.apply(
        lambda row: "-".join(
            [part for part in preprocess_string(row["Inventor"]) if part not in [str(row[column_name])]]), axis=1)
    return df        

2. Names extraction using (NER)

The most effective way to identify names in a given string is by using Named Entity Recognition (NER) models.

I have experience with 'en_core_web_trf' and load it with spacy.

If you are going to use this library, I recommended to install 3.7.2 version and download model as:

python -m spacy download en_core_web_trf.

The code snippet below demonstrates how to apply an NER model to the Inventors database to find out the Name and Surname of the scientist:

trf = spacy.load('en_core_web_trf')

def extract_human_names(string: str):
    parts = preprocess_string(string)

    human_names = []
    for part in parts:
        doc = trf(part) 
        for ent in doc.ents:
            if ent.label_ == 'PERSON':  
                human_names.append(ent.text)

    return ', '.join(human_names)


df["inventor_names"] = df["Inventor"].apply(lambda x: extract_human_names(str(x)))

df[["Name", "Surname"]] = df["inventor_names"].apply(lambda x: pd.Series(x.split(", ")[:2] if x else ["", ""]))
df.drop(columns=["inventor_names"], inplace=True)         

3. Enhancing Detection with an LLM

Last step is to extract the remaining details: 'Country' and 'Field', using a general LLM model, such as en_core_web_sm.

The big challenge is that the model (as whole AI) does not fully understand. In Inventors database, the meaning of 'Field' is 'field of research' and not as a piece of land.

There are two possible solutions:

  1. Rename "Field" to improve model recognition
  2. By using previous done workaround, clean the string after extract 'Country' and use the remaining part as 'Field' column value.

nlp = spacy.load("en_core_web_sm")

matcher = Matcher(nlp.vocab)
pattern_country = [{"POS": "PROPN"}]
matcher.add("COUNTRY", [pattern_country])

def extract_country(video_id):
    """Extracts the country name from the inventor's video_id"""
    parts = video_id.split("-")
    doc = nlp(" ".join(parts))

    matches = matcher(doc)
    for match_id, start, end in matches:
        return doc[start:end].text

    return ""


def extract_field(inventor_string):
    """Extracts the field of invention from the cleaned Inventor column."""
    parts = preprocess_string(inventor_string)
    return " ".join(parts[1:])


df["Country"] = df["Inventor"].apply(extract_country)
clean_inventor_column(df, "Country")

df["Field"] = df["Inventor"].apply(extract_field)
print(df)        

Output of the all previous steps looks 50/50.

Article content
Detection results

Obvious, some validations should be done.

4. Analyzing misdetections and validating detections

On the given output, the two main things should be validated:

  • analyzing empties cells;
  • validating detection results to confirm their truthfulness.

How to analyze empty cells?

My usual approach to analyzing empty cells is to compare the length of the original input (the one from which detection was performed) with the sum of the lengths of the post-detected parts.

In example below, only first row meets the validation requirements. The rest do not, for various reasons (field not detected, surname is not common etc).

It might be a good identifier to find out outliers, but it is not enough.

Article content
Subset of detection output with adjusted first line

How to validate detection results?

There are two outputs: from previous validation passed and failed. Both of them should be included into second validation.

However, while the first row successfully passed the length validation, its content is entirely incorrect (or perhaps a new country, 'Computer Programming,' has been added to the map?). By applying a country-database library to this column, outliers—non-existent countries—can be detected.

Though it might seem simpler to deal with failures results, the only solution is to find a better model, improve detection, or fine-tune the current one which required additional effort.

Last Considerations

Data management, particularly cleansing and refactoring databases, is often a dull and lengthy process, but it is crucial. With the adoption of LLMs and strong validation methods, this process can be automated, becoming more enjoyable, with data that is well-organized and accurate.

Ready-to-go code on GitHub


Danielle Menuhin

Data Operations | Community & Meetup Organizer @Data Operations IL Community

1mo

Hey that's really cool!! I never actually thought of using LLMs for preprocessing, I was just talking about my experiments with actual labeling and comparing to human judgment data, I may need to use your idea at some point!!

Like
Reply
Godwin Josh

Co-Founder of Altrosyn and DIrector at CDTECH | Inventor | Manufacturer

1mo

It's fascinating how you leveraged LLMs for data cleansing in your generative media work, especially considering the nuanced understanding required for context-specific fields like 'field of research.' This approach not only streamlines data management but also highlights the adaptability of LLMs in interpreting domain-specific semantics. Have you considered experimenting with fine-tuning the LLM on a subset of your data to enhance its contextual comprehension further, particularly for ambiguous terms? This could potentially refine its ability to discern and categorize nuanced data elements more accurately.

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics