.groupby('...')['...']

Chapter/Index: Introduction | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | Appendix

.groupby('...')['...']

The groupby command in pandas is a remarkably powerful tool for data analysis, allowing for segmentation of data into groups and performing operations on these groups independently.

Aggregating with Custom Functions
We can apply our own functions to groups. This is particularly useful for more complex aggregations beyond the standard sum, mean, etc.

grouped = df.groupby('column_name')
grouped.agg(lambda x: (x.max() - x.min())) # Custom range function

Transforming Data
The transform method returns a DataFrame with the same shape as the input. It's useful for normalization or filling NA values within groups.

normalize = lambda x: (x - x.mean()) / x.std()
df['normalized'] = df.groupby('group').transform(normalize)

For instance, this method is used to aggregate duplicates in columns of data.

Filtering Groups
Using filter, we can remove data that does not meet certain criteria. For example, keeping groups with a certain number of elements.

df_filtered = df.groupby('group').filter(lambda x: len(x) > 1)

Applying Multiple Aggregation Functions
We can perform multiple aggregation operations in a single step using agg() with a list of operations or a dictionary mapping columns to operations.

df.groupby('group').agg({'data1': ['mean', 'min'], 'data2': 'sum'})

Grouping by Index Levels
If the DataFrame has a MultiIndex, we can group by one of the levels.

df.groupby(level=0).sum()

Named Aggregation
To avoid MultiIndex on columns after aggregation, we can use named aggregation to produce flat columns.

df.groupby('group').agg(
mean_data1=('data1', 'mean'),
sum_data2=('data2', 'sum')
)

Grouping by Functions or Mappings
It is not limited to only grouping by columns. We can group by a function of the index or a mapping (like a dictionary) that provides group labels.

df.groupby(lambda x: 'odd' if x % 2 else 'even')

Grouping by Time Periods
When working with time series data, we can group by time periods (like month, quarter) using Grouper.

df.groupby(pd.Grouper(key='date', freq='M')).sum() # Group by month

Categorical Data Optimization
Grouping by categorical columns is more efficient than grouping by object columns, so consider converting strings to categories if grouping by string fields.

df['column'] = df['column'].astype('category')

SQL-like Window Functions
Pandas supports window functions, allowing for operations like running totals or moving averages within groups.

df.groupby('group')['data'].expanding().mean()

Group the rows of which the column 'Name', 'Z', and 'Grades' are empty (code):

===========================================

.groupby('...')['...']. Code:

.groupby('...')['...']

Input:

Output:

===========================================

Group the DataFrame by columns "X", "Y", and "Z" with a special rule for column "X" (grouping based on the portion of the strings before the dots "."). Code:

Input:

Output:

In this script, we add a step to preprocess the "X" column by extracting the portion before the dot and storing it in a new column. Then, we use this new column along with "Y" and "Z" for the grouping.

Python Automation and Machine Learning for EM and ICs

An Online Book, Second Edition by Dr. Yougui Liao (2024)

.groupby('...')['...']