How to Address Common Data Quality Issues
Updated: Sep 7, 2020
Use Tableau to Solve Inconsistent Values
What are the common data quality issues?
Before jumping right into how to solve data quality issues we need to know what are the common issues and how to spot them.
Missing value: this is the easiest one to identify, simply look at if there are any null values, maybe use a filter to make them more obvious.
Duplicate value: when several rows of data appear to be the same then most likely they have been mistakenly recorded multiple times.
Inconsistent value: the string values of the same attributes do not follow the same naming convention, e.g. both LA and "Los Angeles" are present in the City data field.
There are also many data quality issues requires certain domain knowledge to identify, hence it is important to do some research on the industry before investigating, e.g. above 100 degree water temperature record is most likely to be wrong.
This article will mainly focus on how to address inconsistent value. There will be more articles coming up to discuss more data quality issues.
Usually inconsistent values are mitigated using REGEX, however, Tableau Prep Builder provides a more intuitive way to solve the issue without coding or programming knowledge. The functionality of Tableau Prep Builder is to carry out data processing. It follows the basic system of input -> process -> output. The most basic data preprocessing flow involves three phases:
1. connect to data source
2. cleaning step
3. output the prepared data
Data quality issues are mainly addressed in the Cleaning Step. After selecting one data field, we can see that there are several options in the drop-down menu that can be applied to string type attribute. Clean and Group Values are commonly applied to address inconsistent data. Note that, in the older version of Tableau Prep Builder, it was "Group and Replace" rather than "Group Values".
Then what’s the difference between Clean and Group Values and when should we use which? Clean can be applied prior to Group Values. This is because after having a glimpse of the values, we would have an idea of whether they appear to be messy. This could be having random numbers concatenated to the string values or inconsistent use of upper/lower cases. In this case, use Group Values on each of them will become very tedious. Therefore, Clean function will be more handy to tackle these inconsistencies all together.
These methods will transform all the values of that data field at the same time. For example, if perform Make Lowercase, all the values will be transformed into lowercase.
On the other hand, Group Values is applied after we have taken a closer look at individual string values. It can be very useful to solve specific types of inconsistent values. It might be the result of typos or missing character etc. There are four main methods and each deal with different scenarios.
manual selection method: usually applied when grouping syntactically irrelevant values together, because this can be easily identified without specific domain knowledge, e.g. USA and United State
pronunciation method: group values with similar pronunciation but different text form, e.g. South Africa and south Africa
common characters method: usually applied when there are typos e.g. smith and simth