Pandas
pandas is a Python package that provides fast, flexible, and expressive data structures designed to make working with "relational" or "labeled" data both easy and intuitive.
It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python.
Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way towards this goal.
It is built upon Numpy, Making it efficient for handling large datasets.
Main Features
Installation and Importing Pandas
pip install pandas
import pandas as pd
Data Structures
There are mainly two types of data structures in pandas:
Series
Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.).
The axis labels are collectively referred to as the index.
The basic method to create a Series is to call:
s = pd.Series(data, index=index)
Here, data can be many different things:
import pandas as pd
pd.DataFrame({'A': [1,2,3]})
The passed index is a list of axis labels. Thus, this separates into a few cases depending on what data is:
import pandas as pd
data = [1,2,3,4,5]
index = ['a','b','c','d','e']
ser = pd.Series(index, index = data)
print(ser)
DataFrame
DataFrame is a 2-dimensional labeled data structure with columns of potentially different types.
You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object.
In the real world, a Pandas DataFrame will be created by loading the datasets from existing storage, storage can be SQL Database, CSV file, and Excel file.
Pandas DataFrame can be created from the lists, dictionary, and from a list of dictionary etc.
import pandas as pd
list = ['a', 'b', 'c', 'd', 'e', 'f', 'g']
df = pd.DataFrame(list)
display(df)
Loading Data
import pandas as pd
df = pd.read_csv("data.csv") # Load a CSV file
print(df.head()) # Display first 5 rows
df = pd.read_excel("data.xlsx", sheet_name="Sheet1") # Load an Excel file
Recommended by LinkedIn
df = pd.read_json("data.json", orient="records")
import sqlite3
conn = sqlite3.connect("database.db") # Connect to SQLite
df = pd.read_sql("SELECT * FROM table_name", conn) # Load data from SQL
Data Cleaning
Handling missing values
import pandas as pd
df = pd.read_csv("data.csv")
print(df.isnull().sum()) # Count missing values in each column
df.dropna(inplace=True) # Remove rows with missing values
df.dropna(axis=1, inplace=True) # Remove columns with missing values
df.fillna(value="Unknown", inplace=True) # Replace NaNs with a default value
df["age"].fillna(df["age"].mean(), inplace=True) # Fill NaNs with column mean
Handling Duplicates
print(df.duplicated().sum()) # Count duplicate rows
df.drop_duplicates(inplace=True) # Remove duplicate rows
Handling Incorrect Data
print(df[df["age"] < 0]) # Find negative age values (invalid data)
df.loc[df["age"] < 0, "age"] = df["age"].median() # Replace negatives with median
Data Manipulation & Transformation
Data Manipulation and Transformation are essential steps in the data analysis pipeline, allowing us to clean, reshape, and enhance raw data to make it more useful for analysis and visualization.
The Pandas library in Python provides powerful tools to perform these operations efficiently.
Data Manipulation
Data Manipulation refers to modifying, organizing, and filtering data to extract meaningful insights. It includes:
Example for Selecting and Filtering Data:
import pandas as pd
df = pd.read_csv("employees.csv")
# Select specific columns
df_filtered = df[["Name", "Salary"]]
# Filter employees earning more than 60,000
high_salary = df[df["Salary"] > 60000]
Data Transformation
Data Transformation involves converting data into a structured format suitable for analysis. It includes:
Example for transforming data:
#Adding a New Column & Grouping
# Convert monthly salary to annual salary
df["Annual_Salary"] = df["Salary"] * 12
# Group by Department and calculate the average salary
dept_salary = df.groupby("Department")["Salary"].mean()
Performance Optimization
When working with large datasets in Pandas, performance can become a bottleneck if operations are not optimized properly
Some Real-World applications
Conclusion
The Pandas library is a powerful tool for handling, analyzing, and transforming data efficiently. Whether you're working with financial reports, healthcare records, e-commerce transactions, or machine learning datasets, Pandas provides a fast, flexible, and easy-to-use framework for data manipulation.
By mastering Pandas, you can:
With continuous advancements in data science and AI, Pandas remains an essential skill for anyone looking to excel in data analytics, engineering, or machine learning.
Aspiring Data Analyst | Skilled in Python, Excel, SQL | Lifelong Learner @ Masai School
2moVery informative
Data Analysis | Data Engineer |Data Science Intern | Student at Kommuri Pratap Reddy Institute of Technology
2moGreat info Harika Rani