Tableau Prep – The Cleaning Step, Group and Replace
Data is like blood flowing through an organization and, just like blood quality is directly linked with the health of the body, organizations only function effectively when the quality of their data meets the minimum quality standards.
In real life, data that is used for visualization is typically sourced from a variety of different sources. Because data is coming from different sources it could be in different formats, may have different meanings for the same data values, may have different purpose for the same data field, and may have different levels of data quality. To ensure that this data can be aggregated in a meaningful and consistent manner for visualization and analytical purposes, data must be cleaned as part of the data preparation step.
This article is part of a series of 6 articles on Tableau Prep and focused on Cleaning, Grouping and Replace. It assumes that the reader has fundamental knowledge about data visualization and has some basic level of exposure two one or more tools for data visualization and analytics. Having read the previous articles is recommended but not a pre-requisite.
Having read the previous articles is recommended but not a pre-requisite. As a case study, I will use an Excel file that contains foreign exchange rates against the Canadian dollar. this file was downloaded from the Bank of Canada website. Additionally, I will also use a data file that was downloaded from the Statistics Canada Website.
In this section, we will use the cleaning step available as part of the Tableau Prep data flow. But before we get into the details of Tableau Prep, first let’s have a quick look at the sample data that we are going to use. As I mentioned, this is an Excel file that contains foreign exchange data.
Using Data Interpreter to Remove Extra Rows in Source Data
Look at the screenshot below showing the Excel file. You will notice that the first 39 date rows in the Excel file do not contain any data. These rows contain information about the data itself. While it is useful to be aware of this information it is unlikely that this information will be used directly as part of the visualization process by Tableau or any other visualization tool. Hence, these rows have to be removed from the data.
In order to do that, let’s open Tableau Prep and connect the Excel file. Once you have it connected the file, select cleaned with data interpreter.
You will notice that the Tableau Prep has suggested 2 tables to you after an automated analysis of the data in the Excel file. Select the second table on the bottom
Remove and Rename
Now let’s remove a few fields that we don’t want and rename a few that we want to keep in our data. For the purpose of this article, let’s assume we are only interested in looking at the date field and the conversion rates for AUD and CAD. Add a clean step after input. Remove all the other fields by clicking on the 3 dots on the field cards and selecting remove. Rename the field FXAUDCAD to Australian dollar by selecting the rename option.
Split and Change Data Type
To demonstrate the split function, let’s assume we want to split the date fields into the year, month and date. In order to do this, lets first convert the type of the field to a string field. Next, split the field by using the automatic split option. You will notice that Tableau Prep has automatically created 3 new fields.
For this step, we will use another data file from https://www150.statcan.gc.ca/n1/tbl/csv/35100001-eng.zip. Download the Excel file and use the data interpreter in the input step to select the correct data table. Once loaded, add a Clean step after Input. From the clean options, convert all the member name to Upper Case. Also, remove all the punctuations from the data by selecting the remove punctuations options.
Using the Member Name field card select manual filtering. Select Extortion and notice that all values are now filtered out except this one.
Group and Replace
Select the Member Name field, group and replace by Pronunciation. Notice that identifies Fraud and Identity Theft are grouped together now as an example. Please also refer to the section “Grouping” in this article for additional information.
Now let’s go back to our original data set of foreign exchange data from Bank of Canada. Go to the Australian dollar card and select the calculated field option.
You will notice a library of available functions to use. Select the Round() function to round off the exchange rate to decimal places.
Go to the Member Name Card. Select the Extortion value, select the edit values and change it to Cyber Extortion.
Annotations and Changes
When you perform changes to the field cards, annotations are automatically added to them to show something has changed. This is shown in the screenshot.
Changes, on the other hand, is a concept which is similar to the system audit log in the sense that they are generated automatically by Tableau to capture what changes you applied manually to the data as part of the flow. It also allows you to maintain a log of all changes in a meaningful format, modify each step subsequently, change the sequence of the steps or remove steps altogether. Please refer to the screenshot.
Group and Replace
Significance of group and replace
A lot of times when we use data, we find variations and inconsistencies within the same field. Especially, when data is being aggregated from various sources. For e.g. Muhammad may be spelled as Mohammad or Mohammed and USA may be spelled as America, United States etc. For effective and accurate analytics, it is important to be able to group them together.
Tableau Prep provides different options to achieve grouping. Some of the key approaches are outlined below:
When manually selecting the values, the user can select which values to group together using the user interface. You can do this by going to the field card options, then selecting manual selection under group and replace the menu.
Out of domain values
In addition to selecting and grouping from available data values, you can also manually insert values that may not exist in the current data set but may appear in the future. These are called out of domain values and are indicated through a coloured dot against the respective value. You can add an out of domain value by pressing the + icon in the group and replace menu.
Fuzzy matching algorithms – Pronunciation and Common Characters
Tableau Prep has built-in algorithms to perform automated matching and grouping. Two such frequently used algorithms are matching by pronunciation and characters. Pronunciation groups together values for which the pronunciation is similar for e.g. John and Jon. Common Characters on the other works by comparing values for common characters in between them and grouping the ones where there is a match so for e.g. Waqar, Gaelim and Gaelim, Waqar would match since they have the same set of characters.
Thank you for reading this article. We hope that you have developed a foundational understanding about the Cleaning and Grouping features in Tableau Prep.