How to Extract a Table From a PDF and Convert it to a CSV DataFrame with Tabula in Python

Christopher Cala

Helping ISVs Innovate, Build & Co-Sell Cloud SaaS solutions for Public Sector | Partner Account Manager, Amazon Web Services (AWS)

Published Dec 15, 2019

There's often lots of tabular data locked into PDF format that I'd like to be able to work on directly in a Python Pandas DataFrame.

With a couple lines of Python code and the Tabula package, you can do just that.

Here's the Link to the Google Collaboratory Notebook to play around with the code.

To view or add a comment, sign in

More articles by Christopher Cala

Thank you for the last decade of inspiraton

Jan 2, 2022

Thank you for the last decade of inspiraton

After 10 years and 11 months, today marks my last day at Air Liquide. I would like to thank everyone at Air Liquide who…

46 Comments
Scraping the latest Covid-19 data & time-series forecasting with ARIMA and Prophet

Mar 11, 2020

Scraping the latest Covid-19 data & time-series forecasting with ARIMA and Prophet

covid-19-scrape-timeseries forecasts Covid-19 statistics webscraping in Time Series Analysis with ARIMA & PROPHET…
Python code to webscrape the IMDb into a DataFrame using Requests, BeautifulSoup, & Pandas libraries. IronHack class project.

Feb 23, 2020

Python code to webscrape the IMDb into a DataFrame using Requests, BeautifulSoup, & Pandas libraries. IronHack class project.

At the IronHack Paris class on 18 February with Sandeep Singh, we had fun working on Python code to scrape the…
Predicting Medical Appointment No-shows using Google Dataset Search & Python

Feb 2, 2020

Predicting Medical Appointment No-shows using Google Dataset Search & Python

The Google Dataset Search allows you to find datasets to do your data wrangling and test predictive models on, such as…
Using Python & Plotly to combine & visualise two INSEE data sets on business creation & failure over two decades

Jan 19, 2020

Using Python & Plotly to combine & visualise two INSEE data sets on business creation & failure over two decades

With the recent publication of INSEE data on business creation and failure, I thought it could be cool to clean and…
Paris Tweet Sentiment Analysis with Tweepy & Flair in Python

Jan 18, 2020

Paris Tweet Sentiment Analysis with Tweepy & Flair in Python

With over a month of strikes in Paris over retirement reform, I was curious to see what the twitter sentiment analysis…
Fun making a self-driving robot with CamJam, Raspberry Pi & ultrasonic distance detector

Jan 12, 2020

Fun making a self-driving robot with CamJam, Raspberry Pi & ultrasonic distance detector

I wanted to have a robot at home that could navigate around furniture and check up on things for me. With the CamJam…
Finding Shapes in Video Images with OpenCV in Python

Jan 5, 2020

Finding Shapes in Video Images with OpenCV in Python

While on the home bike trainer working on new year resolutions, I thought it'd be cool to be able to track bike speed…
Managing URL favorites for Later Reading via Google Drive with Python

Dec 28, 2019

Managing URL favorites for Later Reading via Google Drive with Python

I often find interesting articles that I'd like to save in one place to search through and read later. Existing…

2 Comments
How to Extract Text from an Image with Pytesseract in Python

Dec 21, 2019

How to Extract Text from an Image with Pytesseract in Python

After getting text and tables out of PDFs, it could also be interesting to get the text out of images, like photos…

See all articles

How to Extract a Table From a PDF and Convert it to a CSV DataFrame with Tabula in Python

Christopher Cala

Helping ISVs Innovate, Build & Co-Sell Cloud SaaS solutions for Public Sector | Partner Account Manager, Amazon Web Services (AWS)

More articles by Christopher Cala

Insights from the community

Others also viewed

Sankey diagram in highcharts Python

Day 10 - Chai Time Automation Series

Day 6 - Chai Time Automation Series

Setting up Python Interpreter and running Python Code on Docker Container

Setting up Python on Docker Container

Setting up Python Interpreter and running Python Code on Docker Container

🔥 Python Lambda Functions: One-Line Magic!

Solution for Python TypeError: Object of type Response is not JSON serializable

How to easily display tables using "PrettyTable" package in Python

Data Manipulation with Python Pandas and R Data.Table

Explore topics

More articles by Christopher Cala

Thank you for the last decade of inspiraton

Scraping the latest Covid-19 data & time-series forecasting with ARIMA and Prophet

Python code to webscrape the IMDb into a DataFrame using Requests, BeautifulSoup, & Pandas libraries. IronHack class project.

Predicting Medical Appointment No-shows using Google Dataset Search & Python

Using Python & Plotly to combine & visualise two INSEE data sets on business creation & failure over two decades

Paris Tweet Sentiment Analysis with Tweepy & Flair in Python

Fun making a self-driving robot with CamJam, Raspberry Pi & ultrasonic distance detector

Finding Shapes in Video Images with OpenCV in Python

Managing URL favorites for Later Reading via Google Drive with Python

How to Extract Text from an Image with Pytesseract in Python

Insights from the community

Others also viewed

Sankey diagram in highcharts Python

Day 10 - Chai Time Automation Series

Day 6 - Chai Time Automation Series

Setting up Python Interpreter and running Python Code on Docker Container

Setting up Python on Docker Container

Setting up Python Interpreter and running Python Code on Docker Container

🔥 Python Lambda Functions: One-Line Magic!

Solution for Python TypeError: Object of type Response is not JSON serializable

How to easily display tables using "PrettyTable" package in Python

Data Manipulation with Python Pandas and R Data.Table

Explore topics