| The groupby command in pandas is a remarkably powerful tool for data analysis, allowing for segmentation of data into groups and performing operations on these groups independently.
- Aggregating with Custom Functions
We can apply our own functions to groups. This is particularly useful for more complex aggregations beyond the standard sum, mean, etc.
grouped = df.groupby('column_name')
grouped.agg(lambda x: (x.max() - x.min())) # Custom range function
-
Transforming Data
The transform method returns a DataFrame with the same shape as the input. It's useful for normalization or filling NA values within groups.
normalize = lambda x: (x - x.mean()) / x.std()
df['normalized'] = df.groupby('group').transform(normalize)
For instance, this method is used to aggregate duplicates in columns of data.
-
Filtering Groups
Using filter, we can remove data that does not meet certain criteria. For example, keeping groups with a certain number of elements.
df_filtered = df.groupby('group').filter(lambda x: len(x) > 1)
-
Applying Multiple Aggregation Functions
We can perform multiple aggregation operations in a single step using agg() with a list of operations or a dictionary mapping columns to operations.
df.groupby('group').agg({'data1': ['mean', 'min'], 'data2': 'sum'})
-
Grouping by Index Levels
If the DataFrame has a MultiIndex, we can group by one of the levels.
df.groupby(level=0).sum()
-
Named Aggregation
To avoid MultiIndex on columns after aggregation, we can use named aggregation to produce flat columns.
df.groupby('group').agg(
mean_data1=('data1', 'mean'),
sum_data2=('data2', 'sum')
)
-
Grouping by Functions or Mappings
It is not limited to only grouping by columns. We can group by a function of the index or a mapping (like a dictionary) that provides group labels.
df.groupby(lambda x: 'odd' if x % 2 else 'even')
-
Grouping by Time Periods
When working with time series data, we can group by time periods (like month, quarter) using Grouper.
df.groupby(pd.Grouper(key='date', freq='M')).sum() # Group by month
-
Categorical Data Optimization
Grouping by categorical columns is more efficient than grouping by object columns, so consider converting strings to categories if grouping by string fields.
df['column'] = df['column'].astype('category')
-
SQL-like Window Functions
Pandas supports window functions, allowing for operations like running totals or moving averages within groups.
df.groupby('group')['data'].expanding().mean()
- Group the rows of which the column 'Name', 'Z', and 'Grades' are empty (code):
===========================================
.groupby('...')['...']. Code:
Input:

Output:

===========================================
Group the DataFrame by columns "X", "Y", and "Z" with a special rule for column "X" (grouping based on the portion of the strings before the dots "."). Code:
Input:
Output:
In this script, we add a step to preprocess the "X" column by extracting the portion before the dot and storing it in a new column. Then, we use this new column along with "Y" and "Z" for the grouping.
|