Pandas Basics
Introduction
Pandas is a popular Python library used for data manipulation and analysis. It provides easy-to-use data structures and functions to work with structured data efficiently.
This tutorial covers the basics of Pandas, including its core data structures, how to create and manipulate data, and perform simple data analysis tasks.
Data is the new oil, and Pandas is the refinery.
What is Pandas?
Pandas is an open-source Python library designed to make data analysis and manipulation fast and easy. It builds on top of NumPy and provides two primary data structures: Series and DataFrame.
It is widely used in data science, machine learning, and any field that requires data cleaning, transformation, and analysis.
- Built on top of NumPy for numerical operations
- Offers powerful data structures: Series and DataFrame
- Supports handling of missing data
- Provides tools for reading/writing data from various formats
Core Data Structures
Pandas has two main data structures: Series and DataFrame. Understanding these is key to using Pandas effectively.
| Data Structure | Description | Use Case |
|---|---|---|
| Series | One-dimensional labeled array | Storing a single column or list of data with labels |
| DataFrame | Two-dimensional labeled data structure | Storing tabular data with rows and columns |
Series
A Series is like a column in a spreadsheet or a SQL table. It is a one-dimensional array with labels called the index.
You can create a Series from a list, dictionary, or NumPy array.
- Holds data of any type (integer, string, float, etc.)
- Has an index to label each element
- Supports vectorized operations
DataFrame
A DataFrame is a two-dimensional labeled data structure with columns of potentially different types.
Creating Pandas Objects
Let's look at how to create Series and DataFrames using Pandas.
Creating a Series
You can create a Series by passing a list or dictionary to the pandas.Series() constructor.
Creating a DataFrame
DataFrames can be created from dictionaries of lists, lists of dictionaries, or by reading data from files.
Basic Operations with Pandas
Pandas provides many functions to explore and manipulate data easily.
- Viewing data with head(), tail(), and info()
- Selecting columns and rows using labels and positions
- Filtering data based on conditions
- Adding or removing columns
- Handling missing data
Examples
import pandas as pd
# Create a Series from a list
s = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])
print(s)This example creates a Series with custom index labels and prints it.
import pandas as pd
# Create a DataFrame from a dictionary
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]}
df = pd.DataFrame(data)
print(df)This example creates a DataFrame with two columns, 'Name' and 'Age', and prints it.
import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]}
df = pd.DataFrame(data)
# Select the 'Name' column
names = df['Name']
print(names)
# Select rows where Age > 28
older_than_28 = df[df['Age'] > 28]
print(older_than_28)This example shows how to select a column and filter rows based on a condition.
Best Practices
- Always import pandas as pd for consistency.
- Use descriptive column names for clarity.
- Check data with head() and info() before processing.
- Handle missing data explicitly using fillna() or dropna().
- Use vectorized operations instead of loops for better performance.
Common Mistakes
- Confusing DataFrame and Series objects.
- Not setting or resetting the index properly.
- Using loops instead of vectorized operations.
- Ignoring missing data which can cause errors.
- Modifying a DataFrame without creating a copy when needed.
Hands-on Exercise
Create and Inspect a DataFrame
Create a DataFrame with columns 'City' and 'Population' using data of your choice. Display the first few rows and the summary info.
Expected output: Printed output showing the first rows and summary information of the DataFrame.
Hint: Use pd.DataFrame() and methods head() and info().
Filter Data in a DataFrame
Using the DataFrame created, filter and display rows where the population is greater than a specified value.
Expected output: Subset of the DataFrame with rows matching the condition.
Hint: Use boolean indexing with a condition on the 'Population' column.
Interview Questions
What are the main data structures in Pandas?
InterviewThe main data structures in Pandas are Series, which is a one-dimensional labeled array, and DataFrame, which is a two-dimensional labeled data structure with columns of potentially different types.
How do you select a column from a DataFrame?
InterviewYou can select a column from a DataFrame using bracket notation like df['column_name'] or dot notation like df.column_name if the column name is a valid Python identifier.
Summary
Pandas is a powerful Python library for data analysis, providing easy-to-use data structures like Series and DataFrame.
Understanding how to create and manipulate these structures is essential for effective data handling.
Basic operations such as selecting, filtering, and inspecting data enable efficient data analysis workflows.
FAQ
What is the difference between a Series and a DataFrame?
A Series is a one-dimensional labeled array, while a DataFrame is a two-dimensional labeled data structure with rows and columns.
How do I handle missing data in Pandas?
You can handle missing data using methods like fillna() to fill missing values or dropna() to remove rows or columns with missing data.
Can Pandas read data from files?
Yes, Pandas can read data from various file formats including CSV, Excel, JSON, and SQL databases using functions like read_csv(), read_excel(), and read_json().
