Handling Inconsistent Data
Our final type of dirty data we want to clean is inconsistent data. Inconsistent data is data that is not properly formatted for the analysis. This could be data that is strings but should be numbers ('0'
instead of 0
or 'one hundred'
instead of 100
). Let’s study etsy_sellers
for the final time to see how we can detect and handle inconsistent data.
Seller_Id Seller Sales Total_Rating Current_Items Star_Seller
0 8967 Orchid Jewels 17,896 4.5 22 0
1 908764 Ducky Ducks 5,478 3.8 10 True
2 7463529 Candy Yarns 89,974 4.8 18 True
3 161729 Parks Pins 6,897 4.9 87 True
4 4217 Sierra's Stationary 112,988 4.3 347 0
5 21378 Star Stitchery 53,483 4.2 52 0
Star sellers meet Etsy’s highest standard of customer service so it makes sense that we would have a column in our dataset to designate whether or not someone is a star seller. If we take a look at the new Star_Seller
column, we can see that all the star sellers are labeled with 'True'
and those who aren’t star sellers have '0'
. Now we need to resolve this inconsistency in order for us to do an effective analysis with the Star_Seller
column. We can either convert everything to booleans or to numbers.
The Numbers Era of Star Sellers
We are going to start by converting everything to numbers. Once everything in the column is converted to numbers, it will be easier for us to convert the column to booleans. The sellers that are not star sellers are designated with a '0'
so converting that string to a number is going to be a little more straightforward than converting the string 'True'
to 1
.
First, let’s focus in on turning
'True'
to'1'
.etsy_sellers = etsy_sellers.loc[etsy_sellers['Star_Seller'] == 'True'] = '1'
This code will replace all the values in the
Star_Seller
column with'1'
only if that value is currently equal to'True'
.We can now convert the whole
Star_Seller
column to integers.etsy_sellers.Star_Seller = etsy_sellers.Star_Seller.astype('int64')
astype()
allows us to convert a dataframe or column of a dataframe to a specific type, in this case,int64
.
After all of this, our dataframe will look a lot more like this:
Seller_Id Seller Sales Total_Rating Current_Items Star_Seller
0 8967 Orchid Jewels 17,896 4.5 22 0
1 908764 Ducky Ducks 5,478 3.8 10 1
2 7463529 Candy Yarns 89,974 4.8 18 1
3 161729 Parks Pins 6,897 4.9 87 1
4 4217 Sierra's Stationary 112,988 4.3 347 0
5 21378 Star Stitchery 53,483 4.2 52 0
The Booleans Era of Star Sellers
With the whole Star_Seller
column converted to integers, we just have to do one more step to convert the whole column to booleans.
etsy_sellers.Star_Seller = etsy_sellers.Star_Seller.astype('bool')
With this step, etsy_sellers
is going to become:
Seller_Id Seller Sales Total_Rating Current_Items Star_Seller
0 8967 Orchid Jewels 17,896 4.5 22 False
1 908764 Ducky Ducks 5,478 3.8 10 True
2 7463529 Candy Yarns 89,974 4.8 18 True
3 161729 Parks Pins 6,897 4.9 87 True
4 4217 Sierra's Stationary 112,988 4.3 347 False
5 21378 Star Stitchery 53,483 4.2 52 False
Whether you convert the column to booleans or stay with integers depends entirely on what you need from your analysis and what you find easier to work with later on. This is the case with a lot of cleaning data. The approach you take to cleaning data is heavily dependent on you and what you are hoping to achieve with your analysis. The key for now is to practice and not be afraid to try something new.