Pandas GroupBy Function
Grouping data is one of the most important skills that you would require as a data analyst. Luckily, Pandas has a great function called GroupBy which is extremely flexible and allows you to answer many questions with just one line of code.
In this tutorial, we're going to understand the GroupBy function and subsequently answer some business questions.
So here are the questions that I'm going to answer:
- Which customers placed the most orders and the fewest orders?
- Which countries placed the fewest orders?
- Which customers spent the most money?
- Which customer spent the most money on one item?
- Which month had the most orders?
The first thing we have to do is load the libraries - Pandas, NumPy and Matplotlib - and the dataset. For this tutorial I’ll be using an online retail data set which you can download to follow along.
Notice that after loading in the data set, I’ve made sure that we are only looking at a non-zero quantity.
Now before going further, let’s have a look at the dataset by checking the head of the data, which is the first five rows.
As you can see, we have the quantity and the unit price for each invoice.
To start answering our questions, we’ll have to add a new column to our dataframe, showing the total price.
We can do this by multiplying the two columns Quantity and UnitPrice.
We’re now ready to work on our questions!
Pandas GroupBy Function
The GroupBy operation involves splitting the object into required buckets, applying a function and then combining the results. For our questions we’ll group our data by customers, countries, months, etc.
DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=<object object>, observed=False, dropna=True)
Which customers placed the most and the fewest orders?
For this, I'm going to group by the ‘Customer ID’.
Once I have that, I’ll count the invoices and sort the values.
We’ll print the head for the top customers and tail for the worst. Let’s look at the results-
How about we plot this result!
Just plot a bar graph using the matplotlib that we’ve already imported.
Which countries placed the fewest orders?
For this, just group the data by ‘Country’ instead of ‘Customer ID’.
You may want to find out which countries your top customers belong to.
For this, let’s group the data first by customer ID and then by Country. We can do this easily by adding a list in our GroupBy function.
Let’s make it a bit clearer - we’ll give the “as_index” attribute the value “false” and then sort by the invoice number.
Which customers spent the most money?
Now that we have the Customer and their Country, we can easily swap out our invoice number for the total that we created earlier, and change the count to a sum.
So we sum the total amount spent and Voila! we can see which customers spent the most money and which countries they hail from.
Which customer spent the most money on one item?
To know which customers spent the most money on a singular item, we can group by “customer ID’ and the “Description”, get the unit price and then aggregate it to get the max unit price.
Which month had the most orders?
For this, we will have to create a month column.
However, for this particular database, the invoice date is not in a date time format.
So first, let’s change that and then create the new month column.
And now we can just quickly do a GroupBy Month for Total. We can plot it as well.
We can see that November had the highest amount of total orders by money.