Pandas Grouping & Sorting Helpers
Jimmy Rousseau
Author: Jimmy Rousseau | Published: 8/19/2023

Grouping and Sorting in Pandas

Grouping and sorting are two of the most important operations in data analysis. In Pandas, we can use the groupby() and sort_values() methods to perform these operations.

Parts of this article was put together with references from the Grouping & Sorting section of Kaggle Learning

Grouping

The groupby() method is used to group data points based on the values of one or more columns. This can be useful for performing aggregations, such as finding the mean, sum, or standard deviation of each group.

For example, let's say we have a DataFrame of student grades, and we want to group the students by their grade level. We can do this with the following code:

The grouped_students variable now contains a DataFrame with one row for each group of students. The grade column of the grouped_students DataFrame will contain the name of the group, and the other columns will contain the aggregated values for that group.

Sorting

The sort_values() method is used to sort a DataFrame by the values of one or more columns. This can be useful for getting a better understanding of the data, or for presenting the data in a more readable way.

For example, let's say we want to sort the grouped_students DataFrame by the average grade. We can do this with the following code:

The grouped_students DataFrame will now be sorted by the average grade, with the students with the highest average grade at the top.

To sort by index values, use the companion method sort_index(). This method has the same arguments and default order:

Finally, know that you can sort by more than one column at a time:

You can think of each group we generate as being a slice of our DataFrame containing only data with values that match. This DataFrame is accessible to us directly using the apply() method, and we can then manipulate the data in any way we see fit. For example, here's one way of selecting the name of the first wine reviewed from each winery in the dataset:

For even more fine-grained control, you can also group by more than one column. For an example, here's how we would pick out the best wine by country and province:

Another groupby() method worth mentioning is agg(), which lets you run a bunch of different functions on your DataFrame simultaneously. For example, we can generate a simple statistical summary of the dataset as follows:

Multi Index

In all of the examples we've seen thus far we've been working with DataFrame or Series objects with a single-label index. groupby() is slightly different in the fact that, depending on the operation we run, it will sometimes result in what is called a multi-index.

A multi-index differs from a regular index in that it has multiple levels. For example:

Multi-indices have several methods for dealing with their tiered structure which are absent for single-level indices. They also require two levels of labels to retrieve a value. Dealing with multi-index output is a common "gotcha" for users new to pandas

The use cases for a multi-index are detailed alongside instructions on using them in the MultiIndex / Advanced Selection section of the pandas documentation.

However, in general the multi-index method you will use most often is the one for converting back to a regular index, the reset_index() method:

Grouping and Sorting Together

We can also use the groupby() and sort_values() methods together. This allows us to group the data and then sort the groups by a specific column.

For example, let's say we want to sort the grouped_students DataFrame by the average grade, but only within each grade level. We can do this with the following code:

The grouped_students DataFrame will now be sorted by grade level, with the students with the highest average grade within each grade level at the top.

Examples

Here are some more examples of how to use grouping and sorting in Pandas:

  • Find the number of students in each grade level:
  • Find the average grade for each subject:
  • Find the maximum grade for each grade level:
  • Sort the students by name within each grade level:

These are just a few examples of how to use grouping and sorting in Pandas. For more information, please refer to the Pandas documentation: https://pandas.pydata.org/docs/user_guide/groupby.html