Types of Dirty Data

There are four different types of “dirty” data. Understanding and being able to differentiate between them will help you categorize your data and decide what technique or practice would work best for cleaning a dataset. Below we will cover missing data, irregular data, unncessary data, and inconsistent data.

Missing Data

Identifying missing data can be as straight forward as noticing empty columns or cells. If you see that a row or column is blank or void of a data entry then that can be classified as missing data.

A common example would be a survey that has questions which are optional. If the user decides not to answer the question it may result in an empty cell within your dataset.

Analysts have ways to combat missing data. This usually involves deleting the entire entry or imputing data based on existing knowledge of the dataset or a substitute value like N/A or NaN. Another approach is to use column means or regression values based on the data that you have.

You can also revisit the data storage to see if it exists there and was lost in transit.

Example

The table below has numerous examples of missing data within the last_name, email, employer, phone_number, and favorite_hobby columns.

first_name	last_name	email	employer	phone_number	favorite_hobby
Amy	Williams	[email protected]	Emerson Electric	379-012-0298	hiking
Robert	Marshal		Edward Jones	288-085-0092	reading
Kimberly	Stephen	[email protected]		126-015-0765	yoga
Rachel	Vincent	[email protected]	Neuroflow		painting
Brian	Salinas	[email protected]	Edward Jones	914-555-4392
Wesley		[email protected]	Emerson Electtic	504-326-2719	gaming

With the above examples, prioritize understanding whether or not the missing data was due to user error or the means of collecting the data itself. This will help when deciding on what strategy can be used to alleviate the problem.

Irregular Data

Irregular data is usually related to finding outliers within the dataset. Outliers must first be detected before deciding on a strategy to handle them. An outlier can be something like an email that does not contain an @ symbol, or a number that is not within an appropriate range for the expected results.

Example

In the below table you will notice that the email column contains emails without an @ symbol and phone numbers without 9 characters.

first_name	last_name	email	employer	phone_number	favorite_hobby
Amy	Williams	[email protected]	Emerson Electric	37-012-0298	hiking
Robert	Marshal	christopher84example.org	Edward Jones	288-05-0092	reading
Kimberly	Stevens	[email protected]	Centene	126-015-0765	yoga
Rachel	Vincent	[email protected]	Nueroflow	857-203-034	painting
Brian	Salinas	[email protected]	Edward Jones	914-555-4392	cooking
Wesley	Boone	stephen94example.net	Emerson Electric	504-326-2719	gaming

Unnecessary Data

Unnecessary data could be duplicates, irrelevant, or any uninformative data. This typically involves a dataset with values that are either not useful or will cause issues during analysis. Removing unncessary and redundant data will help with query and compute speeds which improves performance and will also simplify the dataset.

Common examples of unnecessary data are duplicated rows and columns or old or outdated data that is inaccurate. Another piece of unnecessary data would be data entered for testing purposes that were not removed and were left inside of the dataset.

Example

The table below represents an example of redundant or unnecessary data. The primary purpose of this data is to store contact information for different users, making the favorite_hobby category unnecessary within this dataset.

first_name	last_name	email	employer	phone_number	favorite_hobby
Amy	Williams	[email protected]	Emerson Electric	379-012-0298	hiking
Robert	Marshal	[email protected]	Edward Jones	288-085-0092	reading
Kimberly	Stephen	[email protected]	Centene	126-015-0765	yoga
Rachel	Vincent	[email protected]	Neuroflow	857-203-0334	painting
Brian	Salinas	[email protected]	Edward Jones	914-555-4392	cooking
Wesley	Boone	[email protected]	Emerson Electtic	504-326-2719	gaming

Inconsistent Data

Inconsistent data is anything that messes with your model. This is likely due to inconsistent formatting and can be addressed by re-formatting all values in a column or row.

Example

Here is a table containing inconsistencies using the provided data. The employer column contains different spellings for the same name and the favorite_hobby column has different data types contained within.

first_name	last_name	email	employer	phone_number	favorite_hobby
Amy	Williams	[email protected]	Emerson Electic	379-012-0298	2
Robert	Marshal	[email protected]	Edward Jones	288-085-0092	reading
Kimberly	Stephen	[email protected]	Centene	126-015-0765	3
Rachel	Vincent	[email protected]	Neuroflow	857-203-0334	painting
Brian	Salinas	[email protected]	Edwards Jones	914-555-4392	cooking
Wesley	Boone	[email protected]	Emerson Electtic	504-326-2719	gaming