Topic 18924 min readStructured Tutorial

Pandas Basics

Introduction

Pandas is a popular Python library used for data manipulation and analysis. It provides easy-to-use data structures and functions to work with structured data efficiently.

This tutorial covers the basics of Pandas, including its core data structures, how to create and manipulate data, and perform simple data analysis tasks.

Data is the new oil, and Pandas is the refinery.

What is Pandas?

Pandas is an open-source Python library designed to make data analysis and manipulation fast and easy. It builds on top of NumPy and provides two primary data structures: Series and DataFrame.

It is widely used in data science, machine learning, and any field that requires data cleaning, transformation, and analysis.

Built on top of NumPy for numerical operations
Offers powerful data structures: Series and DataFrame
Supports handling of missing data
Provides tools for reading/writing data from various formats

Core Data Structures

Pandas has two main data structures: Series and DataFrame. Understanding these is key to using Pandas effectively.

Comparison of Pandas Data Structures
Data Structure	Description	Use Case
Series	One-dimensional labeled array	Storing a single column or list of data with labels
DataFrame	Two-dimensional labeled data structure	Storing tabular data with rows and columns

Series

A Series is like a column in a spreadsheet or a SQL table. It is a one-dimensional array with labels called the index.

You can create a Series from a list, dictionary, or NumPy array.

Holds data of any type (integer, string, float, etc.)
Has an index to label each element
Supports vectorized operations

DataFrame

A DataFrame is a two-dimensional labeled data structure with columns of potentially different types.

Creating Pandas Objects

Let's look at how to create Series and DataFrames using Pandas.

Creating a Series

You can create a Series by passing a list or dictionary to the pandas.Series() constructor.

Creating a DataFrame

DataFrames can be created from dictionaries of lists, lists of dictionaries, or by reading data from files.

Basic Operations with Pandas

Pandas provides many functions to explore and manipulate data easily.

Viewing data with head(), tail(), and info()
Selecting columns and rows using labels and positions
Filtering data based on conditions
Adding or removing columns
Handling missing data

Examples

Creating a Pandas Series

import pandas as pd

# Create a Series from a list
s = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])
print(s)

This example creates a Series with custom index labels and prints it.

Creating a Pandas DataFrame

import pandas as pd

# Create a DataFrame from a dictionary
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]}
df = pd.DataFrame(data)
print(df)

This example creates a DataFrame with two columns, 'Name' and 'Age', and prints it.

Selecting Data from a DataFrame

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]}
df = pd.DataFrame(data)

# Select the 'Name' column
names = df['Name']
print(names)

# Select rows where Age > 28
older_than_28 = df[df['Age'] > 28]
print(older_than_28)

This example shows how to select a column and filter rows based on a condition.

Best Practices

Always import pandas as pd for consistency.
Use descriptive column names for clarity.
Check data with head() and info() before processing.
Handle missing data explicitly using fillna() or dropna().
Use vectorized operations instead of loops for better performance.

Common Mistakes

Confusing DataFrame and Series objects.
Not setting or resetting the index properly.
Using loops instead of vectorized operations.
Ignoring missing data which can cause errors.
Modifying a DataFrame without creating a copy when needed.

Hands-on Exercise

Create and Inspect a DataFrame

Create a DataFrame with columns 'City' and 'Population' using data of your choice. Display the first few rows and the summary info.

Expected output: Printed output showing the first rows and summary information of the DataFrame.

Hint: Use pd.DataFrame() and methods head() and info().

Filter Data in a DataFrame

Using the DataFrame created, filter and display rows where the population is greater than a specified value.

Expected output: Subset of the DataFrame with rows matching the condition.

Hint: Use boolean indexing with a condition on the 'Population' column.

Interview Questions

What are the main data structures in Pandas?

Interview

The main data structures in Pandas are Series, which is a one-dimensional labeled array, and DataFrame, which is a two-dimensional labeled data structure with columns of potentially different types.

How do you select a column from a DataFrame?

Interview

You can select a column from a DataFrame using bracket notation like df['column_name'] or dot notation like df.column_name if the column name is a valid Python identifier.

Summary

Pandas is a powerful Python library for data analysis, providing easy-to-use data structures like Series and DataFrame.

Understanding how to create and manipulate these structures is essential for effective data handling.

Basic operations such as selecting, filtering, and inspecting data enable efficient data analysis workflows.

FAQ

What is the difference between a Series and a DataFrame?

A Series is a one-dimensional labeled array, while a DataFrame is a two-dimensional labeled data structure with rows and columns.

How do I handle missing data in Pandas?

You can handle missing data using methods like fillna() to fill missing values or dropna() to remove rows or columns with missing data.

Can Pandas read data from files?

Yes, Pandas can read data from various file formats including CSV, Excel, JSON, and SQL databases using functions like read_csv(), read_excel(), and read_json().