16.1. Cleaning Data

16.1.1. Readings

Read the following articles, follow along where instructed:

  • Key takeaway: More in-depth techniques on what to clean and how.
  • Key takeaway: Brief Introduction of why and how to clean your data
  • Key takeaway: Theory behind cleaning.

16.1.2. Check Your Understanding

Question

Name the four categories of “dirty” data.

Question

Name the three possible solutions to any “dirty” data problem.

Question

Your data set “local_plants_df” has the following column names: [‘flora_sci_name’, ‘tall’, ‘growing_zone’, ‘avg_rainfall’]. We want to rename our ‘tall’ column to ‘avg_height’. What syntax would we use?

Question

You have been tasked to help the local parks department assess visitor usage to a local park over 8 weeks. As you are looking at your data, you notice a row duplication. Why would it be beneficial to this project to delete this duplicated row?

Dataframe showing name of park, location, week of, and number of guests. There are multiple rows with some duplication.

Question

Define “data cleaning”.

Question

The 5 characteristics of quality data include: