Types of Dirty Data

There are four different types of “dirty” data. Understanding and being able to differentiate between them will help you categorize your data and decide what technique or practice would work best for cleaning a dataset. Below we will cover missing data, irregular data, unncessary data, and inconsistent data.

Missing Data

Identifying missing data can be as straight forward as noticing empty columns or cells. If you see that a row or column is blank or void of a data entry then that can be classified as missing data.

A common example would be a survey that has questions which are optional. If the user decides not to answer the question it may result in an empty cell within your dataset.

Analysts have ways to combat missing data. This usually involves deleting the entire entry or imputing data based on existing knowledge of the dataset or a substitute value like N/A or NaN. Another approach is to use column means or regression values based on the data that you have.

You can also revisit the data storage to see if it exists there and was lost in transit.

Example

The table below has numerous examples of missing data within the last_name, email, employer, phone_number, and favorite_hobby columns.

first_namelast_nameemailemployerphone_numberfavorite_hobby
AmyWilliams[email protected]Emerson Electric379-012-0298hiking
RobertMarshalEdward Jones288-085-0092reading
KimberlyStephen[email protected]126-015-0765yoga
RachelVincent[email protected]Neuroflowpainting
BrianSalinas[email protected]Edward Jones914-555-4392
Wesley[email protected]Emerson Electtic504-326-2719gaming

With the above examples, prioritize understanding whether or not the missing data was due to user error or the means of collecting the data itself. This will help when deciding on what strategy can be used to alleviate the problem.

Irregular Data

Irregular data is usually related to finding outliers within the dataset. Outliers must first be detected before deciding on a strategy to handle them. An outlier can be something like an email that does not contain an @ symbol, or a number that is not within an appropriate range for the expected results.

Example

In the below table you will notice that the email column contains emails without an @ symbol and phone numbers without 9 characters.

first_namelast_nameemailemployerphone_numberfavorite_hobby
AmyWilliams[email protected]Emerson Electric37-012-0298hiking
RobertMarshalchristopher84example.orgEdward Jones288-05-0092reading
KimberlyStevens[email protected]Centene126-015-0765yoga
RachelVincent[email protected]Nueroflow857-203-034painting
BrianSalinas[email protected]Edward Jones914-555-4392cooking
WesleyBoonestephen94example.netEmerson Electric504-326-2719gaming

Unnecessary Data

Unnecessary data could be duplicates, irrelevant, or any uninformative data. This typically involves a dataset with values that are either not useful or will cause issues during analysis. Removing unncessary and redundant data will help with query and compute speeds which improves performance and will also simplify the dataset.

Common examples of unnecessary data are duplicated rows and columns or old or outdated data that is inaccurate. Another piece of unnecessary data would be data entered for testing purposes that were not removed and were left inside of the dataset.

Example

The table below represents an example of redundant or unnecessary data. The primary purpose of this data is to store contact information for different users, making the favorite_hobby category unnecessary within this dataset.

first_namelast_nameemailemployerphone_numberfavorite_hobby
AmyWilliams[email protected]Emerson Electric379-012-0298hiking
RobertMarshal[email protected]Edward Jones288-085-0092reading
KimberlyStephen[email protected]Centene126-015-0765yoga
RachelVincent[email protected]Neuroflow857-203-0334painting
BrianSalinas[email protected]Edward Jones914-555-4392cooking
WesleyBoone[email protected]Emerson Electtic504-326-2719gaming

Inconsistent Data

Inconsistent data is anything that messes with your model. This is likely due to inconsistent formatting and can be addressed by re-formatting all values in a column or row.

Example

Here is a table containing inconsistencies using the provided data. The employer column contains different spellings for the same name and the favorite_hobby column has different data types contained within.

first_namelast_nameemailemployerphone_numberfavorite_hobby
AmyWilliams[email protected]Emerson Electic379-012-02982
RobertMarshal[email protected]Edward Jones288-085-0092reading
KimberlyStephen[email protected]Centene126-015-07653
RachelVincent[email protected]Neuroflow857-203-0334painting
BrianSalinas[email protected]Edwards Jones914-555-4392cooking
WesleyBoone[email protected]Emerson Electtic504-326-2719gaming