Pandas

Pandas

pandas is a Python package that provides fast, flexible, and expressive data structures designed to make working with "relational" or "labeled" data both easy and intuitive.

It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python.

Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way towards this goal.

It is built upon Numpy, Making it efficient for handling large datasets.


Main Features

  • Easy handling of missing data (represented as NaN, NA, or NaT) in floating point as well as non-floating point data
  • Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
  • Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and let Series, DataFrame, etc. automatically align the data for you in computations
  • Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both aggregating and transforming data
  • Make it easy to convert ragged, differently-indexed data in other Python and NumPy data structures into DataFrame objects
  • Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
  • Intuitive merging and joining data sets
  • Flexible reshaping and pivoting of data sets
  • Hierarchical labeling of axes (possible to have multiple labels per tick)
  • Robust IO tools for loading data from flat files (CSV and delimited), Excel files, databases, and saving/loading data from the ultrafast HDF5 format
  • Time series-specific functionality: date range generation and frequency conversion, moving window statistics, date shifting and lagging


Installation and Importing Pandas

  • Installation

pip install pandas        

  • Importing PANDAS

import pandas as pd        


Data Structures

There are mainly two types of data structures in pandas:

  • Series
  • DataFrame

Series

Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.).

The axis labels are collectively referred to as the index.

The basic method to create a Series is to call:

s = pd.Series(data, index=index)        

Here, data can be many different things:

  • a Python dict
  • an ndarray
  • a scalar value (like 5)

import pandas as pd
pd.DataFrame({'A': [1,2,3]})        

The passed index is a list of axis labels. Thus, this separates into a few cases depending on what data is:

import pandas as pd
data = [1,2,3,4,5]
index = ['a','b','c','d','e']
ser = pd.Series(index, index = data)
print(ser)        


DataFrame

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types.

You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object.

In the real world, a Pandas DataFrame will be created by loading the datasets from existing storage, storage can be SQL Database, CSV file, and Excel file.

Pandas DataFrame can be created from the lists, dictionary, and from a list of dictionary etc.

import pandas as pd
list = ['a', 'b', 'c', 'd', 'e', 'f', 'g']
df = pd.DataFrame(list)
display(df)
        

Loading Data

  • Reading a CSV file

import pandas as pd

df = pd.read_csv("data.csv")  # Load a CSV file
print(df.head())  # Display first 5 rows        

  • Reading a Excel file

df = pd.read_excel("data.xlsx", sheet_name="Sheet1")  # Load an Excel file        

  • Reading a JSON file

df = pd.read_json("data.json", orient="records")        

  • Reading a SQL Database

import sqlite3

conn = sqlite3.connect("database.db")  # Connect to SQLite
df = pd.read_sql("SELECT * FROM table_name", conn)  # Load data from SQL        

Data Cleaning

Handling missing values

import pandas as pd

df = pd.read_csv("data.csv")
print(df.isnull().sum())  # Count missing values in each column

df.dropna(inplace=True)  # Remove rows with missing values
df.dropna(axis=1, inplace=True)  # Remove columns with missing values

df.fillna(value="Unknown", inplace=True)  # Replace NaNs with a default value
df["age"].fillna(df["age"].mean(), inplace=True)  # Fill NaNs with column mean
        


Handling Duplicates

print(df.duplicated().sum())  # Count duplicate rows
df.drop_duplicates(inplace=True)  # Remove duplicate rows        


Handling Incorrect Data

print(df[df["age"] < 0])  # Find negative age values (invalid data)
df.loc[df["age"] < 0, "age"] = df["age"].median()  # Replace negatives with median        


Data Manipulation & Transformation

Data Manipulation and Transformation are essential steps in the data analysis pipeline, allowing us to clean, reshape, and enhance raw data to make it more useful for analysis and visualization.

The Pandas library in Python provides powerful tools to perform these operations efficiently.


Data Manipulation

Data Manipulation refers to modifying, organizing, and filtering data to extract meaningful insights. It includes:

  • Selecting specific columns or rows
  • Filtering data based on conditions
  • Sorting data
  • Adding or modifying columns
  • Handling missing values

Example for Selecting and Filtering Data:

import pandas as pd

df = pd.read_csv("employees.csv")

# Select specific columns
df_filtered = df[["Name", "Salary"]]

# Filter employees earning more than 60,000
high_salary = df[df["Salary"] > 60000]
        

Data Transformation

Data Transformation involves converting data into a structured format suitable for analysis. It includes:

  • Changing data types (e.g., string to datetime)
  • Aggregating and grouping data
  • Merging or concatenating datasets
  • Pivoting and reshaping data

Example for transforming data:

#Adding a New Column & Grouping

# Convert monthly salary to annual salary
df["Annual_Salary"] = df["Salary"] * 12

# Group by Department and calculate the average salary
dept_salary = df.groupby("Department")["Salary"].mean()        

Performance Optimization

When working with large datasets in Pandas, performance can become a bottleneck if operations are not optimized properly

  • Use efficient data types (category, int32, float32).
  • Use apply()
  • Use .loc[] and .iloc[] for Faster Selection
  • Use iterrows() and itertuples() Wisely
  • Use faster file formats (Parquet over CSV)
  • Leverage swifter for parallel processing.


Some Real-World applications

  • Financial Data Analysis
  • Healthcare & Medical Research
  • E-commerce & Customer Insights
  • Social Media & Sentiment Analysis
  • Supply Chain & Logistics
  • Sports Analytics
  • Machine Learning & AI


Conclusion

The Pandas library is a powerful tool for handling, analyzing, and transforming data efficiently. Whether you're working with financial reports, healthcare records, e-commerce transactions, or machine learning datasets, Pandas provides a fast, flexible, and easy-to-use framework for data manipulation.

By mastering Pandas, you can:

  • Clean and preprocess raw data for better insights
  • Manipulate & transform large datasets efficiently
  • Optimize performance for handling big data
  • Apply real-world use cases across industries like finance, healthcare, sports, and AI

With continuous advancements in data science and AI, Pandas remains an essential skill for anyone looking to excel in data analytics, engineering, or machine learning.




Abhishek Thakur

Aspiring Data Analyst | Skilled in Python, Excel, SQL | Lifelong Learner @ Masai School

2mo

Very informative

Like
Reply
Venkata Krishna Sai Betanabhatla

Data Analysis | Data Engineer |Data Science Intern | Student at Kommuri Pratap Reddy Institute of Technology

2mo

Great info Harika Rani

Like
Reply

To view or add a comment, sign in

More articles by Harika Rani

Insights from the community

Others also viewed

Explore topics