Python’s Pandas library is a cornerstone for data analysis and manipulation. Understanding its core data structures is essential for anyone looking to harness its power. This article covers the primary data structures in Pandas, providing examples with real data for practical understanding.
Core Data Structures in Pandas
Pandas offers three primary data structures: Series, DataFrame and Panel. Each is designed for specific data manipulation tasks.
1. Pandas Series
A Series is a one-dimensional array-like structure designed to store a single array of data along with an associated index.
Characteristics:
- Homogeneous data
- Size-immutable but values are mutable
- Supports multiple data types
import pandas as pd
series_data = pd.Series([100, 200, 300, 400, 500],
index=['apple', 'banana', 'grape', 'orange', 'pear'])
2. Pandas DataFrame
A DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes.
Characteristics:
- Different column types
- Size-mutable
- Labeled axes (rows and columns)
data = {
'Name': ['Sachin', 'Manju', 'Ram', 'Raju', 'David', 'Wilson'],
'Age': [30, 25, 40, 35, 55, 50],
'City': ['Mumbai', 'Bangalore', 'Chennai', 'Delhi', 'New York', 'San Francisco']
}
data_frame = pd.DataFrame(data)
3. Pandas Panel
Panel is a somewhat less commonly used, but still important, three-dimensional data structure in Pandas. It can handle 3D data and is a container for DataFrame objects.
Characteristics:
- 3D container for data
- Items (DataFrame objects), major axis (rows), and minor axis (columns)
- Suitable for some complex data manipulation tasks
Example:
data = np.random.rand(2, 4, 5) # Random 3D data
panel_example = pd.Panel(data, items=['Item1', 'Item2'],
major_axis=pd.date_range('20200101', periods=4),
minor_axis=['A', 'B', 'C', 'D', 'E'])
While Panels can be useful for certain types of three-dimensional data, their use is limited due to the complexity and the availability of better alternatives in most cases, like multi-indexing in DataFrames or using xarray, a library better suited for working with multi-dimensional data.