Cleaning Data
Intro to Data Cleaning
Cleaning data is one of the first and most important steps when working with any kind of dataset. This process helps ensure that you are starting with an accurate set of data by removing redundancies, outliers, missing data, and any innaccuracies. If the proper steps are not taken to clean your dataset you may lower the quality and integrity of the data since it may still contain errors that were not removed before in-depth analysis.
The process of cleaning data should help you get comfortable with:
- Identifying missing, irregular, unnecessary, and inaccurate data
- Fixing formatting and type errors
- Adding filters
- Applying functions
- Handling outliers
The tasks and practices listed above are some of the more common techniques used while cleaning data. This is not an exhaustive list as there are many more techniques used! However this serves as a good starting point for thinking about strategies you can use as you begin your journey.
We will cover some common data cleaning techniques with associated use cases like filtering, sorting, and removing any trailing whitespace. There are also more strategies used like formatting so that your data is uniform across the entire set. Removing redundant and duplicated data is another crucial aspect of the cleaning process that will prevent your data from being skewed and inaccurate.
There are four categories of “Dirty” data that we will cover and what each is associated with.
Google sheets has many built in features that allow you to clean data efficiently and removes many possible mistakes that can be made by manual input. These tools assist you when incorporating techniques mentioned above like filtering, sorting, and adding data validation to values like emails or currency.