Detective Work with Data Science
The JFK Assassination Records (AI-Generated)

Detective Work with Data Science

Releasing the JFK Assassination Dataset

It is impossible to read about the JFK Assassination without developing a lifelong obsession—especially if you see data science as a form of detective work. We are releasing the text extracted from some 50,000 assassination records on Hugging Face so other data scientists can put their investigative instincts to the test.

The Assassination Records

The US President John F. Kennedy(JFK) was assassinated in 1963. Over the years, multiple investigations of the case have produced millions of files known as the JFK Assassination Records. These records contain pieces of evidence, intelligence reports, and reports from analyses of the evidence.

The US government started releasing these records digitally through archives.org. Since 2017, more than fifty thousand files have been uploaded to the website as scanned PDFs.

While efforts like History Matters have cataloged and indexed the records for manual research our goal is to enable large-scale computational analysis, which may offer new clues, and help substantiate or refute some of the theories.

Download and Processing

As of April 2025, there are 50,000 unique files published on archives.org, totaling around 50GB, and is going to grow more massive as the publication of new records in ongoing.

We downloaded the current snapshot of the PDF files and version-controlled them. We then processed each file, page by page, using the Google Gemini LLM API to extract the text as Markdown. The resulting text files were version-controlled again.

The code can be found in the JFK-Tell Github repository. Text from all pages from a single PDF file was concatenated into a single .md file.

Release

The original records downloaded as PDF files were released on Hugging Face as the JFK Archives dataset. Since HF has limits on the size of individual files and the number of files, the dataset was published as five Parquet files.

The extracted text was published as the JFK Tell dataset on HF. Here is list of all published resources:

Using the Datasets

The datasets can be used to benchmark the fidelity of PDF-to-text conversion with LLMs. The Gemini 2.0-Flash model we used was fed with snapshots of PDF pages along with really simple prompts. We checked a few dozen files and the outputs looked good. However, it will be worthwhile to evaluate the accuracy and structure preservation for a larger sample.

The idea behind PDF-to-text conversion is to index the text with a semantic search engine which will enable asking questions with RAG, generating summaries, and organizing the records more meaningfully. The index should link back to the original PDFs to make it easier to cross-check details and cite the original.

Finally, the semantic search backend can be paired with a frontend to build an application that non-technical people can use to search and analyze the information. This seems to be the most compelling use as it sets a precedent in analyzing legal and other high-accuracy documents at scale, with AI tools.

Disclaimer

The download and conversion of the records is a lossy process. The purpose of making the data available is to foster open source research and no warranties of any kind regarding the accuracy of the data are implied. We are aware of the following issues and many other unknown issues may be present too.

  • Some of the PDF URLs from archives.org fail to download
  • Many files were listed multiple times on different archives.org pages and we only downloaded one copy for each such filename. Note that no content-based de-duplication was applied. We relied on archives.org file names alone.
  • A few thousand files failed to extract any text. Several of these are reports, etc., where there is little information other than the formatting and seals/stamps, etc.



To view or add a comment, sign in

More articles by Farhan Ahmad

  • Graceful API Call Failure 101 for Data Scientists

    How to methodically deal with API call failures in long-running pipelines Read a better-formatted version on Medium…

    1 Comment
  • Is Time Ripe To Throw Your Engineers Under The Trolley?

    On the 30th of the holy month of October, Sundar Pichai climbed on top of the ramparts keeping Silicon Valley safe from…

  • State of The Intersection

    This debut article is an introduction to a series of articles to be published on Medium, where I analyze the issues…

    1 Comment

Insights from the community

Others also viewed

Explore topics