Detective Work with Data Science
Releasing the JFK Assassination Dataset
It is impossible to read about the JFK Assassination without developing a lifelong obsession—especially if you see data science as a form of detective work. We are releasing the text extracted from some 50,000 assassination records on Hugging Face so other data scientists can put their investigative instincts to the test.
The Assassination Records
The US President John F. Kennedy(JFK) was assassinated in 1963. Over the years, multiple investigations of the case have produced millions of files known as the JFK Assassination Records. These records contain pieces of evidence, intelligence reports, and reports from analyses of the evidence.
The US government started releasing these records digitally through archives.org. Since 2017, more than fifty thousand files have been uploaded to the website as scanned PDFs.
While efforts like History Matters have cataloged and indexed the records for manual research our goal is to enable large-scale computational analysis, which may offer new clues, and help substantiate or refute some of the theories.
Download and Processing
As of April 2025, there are 50,000 unique files published on archives.org, totaling around 50GB, and is going to grow more massive as the publication of new records in ongoing.
We downloaded the current snapshot of the PDF files and version-controlled them. We then processed each file, page by page, using the Google Gemini LLM API to extract the text as Markdown. The resulting text files were version-controlled again.
The code can be found in the JFK-Tell Github repository. Text from all pages from a single PDF file was concatenated into a single .md file.
Release
The original records downloaded as PDF files were released on Hugging Face as the JFK Archives dataset. Since HF has limits on the size of individual files and the number of files, the dataset was published as five Parquet files.
Recommended by LinkedIn
The extracted text was published as the JFK Tell dataset on HF. Here is list of all published resources:
Using the Datasets
The datasets can be used to benchmark the fidelity of PDF-to-text conversion with LLMs. The Gemini 2.0-Flash model we used was fed with snapshots of PDF pages along with really simple prompts. We checked a few dozen files and the outputs looked good. However, it will be worthwhile to evaluate the accuracy and structure preservation for a larger sample.
The idea behind PDF-to-text conversion is to index the text with a semantic search engine which will enable asking questions with RAG, generating summaries, and organizing the records more meaningfully. The index should link back to the original PDFs to make it easier to cross-check details and cite the original.
Finally, the semantic search backend can be paired with a frontend to build an application that non-technical people can use to search and analyze the information. This seems to be the most compelling use as it sets a precedent in analyzing legal and other high-accuracy documents at scale, with AI tools.
Disclaimer
The download and conversion of the records is a lossy process. The purpose of making the data available is to foster open source research and no warranties of any kind regarding the accuracy of the data are implied. We are aware of the following issues and many other unknown issues may be present too.