Handling Unnecessary Data

Unnecessary data is data that is not vital to your analysis. For example, if you are doing an analysis on tax data, information like a taxpayer’s name or street address might not be necessary for your analysis. The taxpayer’s name may be helpful in distinguishing one taxpayer from the other, but do you need to distinguish the taxpayer data on an individual basis? The answer is likely not. This means we have some unnecessary data in the dataset that we need to clean.

Note

You may also clean unnecessary data for cybersecurity reasons. Taxpayers’ personal identifying information or PII is not something that you want to keep in your dataset in case your company gets hacked.

We have made some updates to etsy_sellers so that we can see how to use pandas to remove unnecessary data.

   Seller_Id  Seller                Owner_Name     Sales     Total_Rating     Current_Items
0  8967       Orchid Jewels         Orchid Smith   17,896    4.5              22
1  908764     Ducky Ducks           Nala Blake     5,478     3.8              10
2  7463529    Candy Yarns           Candy Elsbeth  89,974    4.8              18
3  161729     Parks Pins            Jade Slate     6,897     4.9              87
4  4217       Sierra's Stationary   Sierra Tomlin  112,988   4.3              347     
5  21378      Star Stitchery        Sara George    53,483    4.2              52 

The Seller_Id column is a unique identifier given to each seller by Etsy’s system and now we have the Owner_Name as well. The seller’s id can be used to tie this info back to a specific shop so with that there, we should be asking ourselves if we need the shop name and the owner name. While we want to understand these sellers’ sales number, we don’t really need the owner name.

etsy_sellers.drop(columns=['Owner_Name'])

drop() defaults to dropping rows so if we want to drop a column we have to pass the column name to columns. Alternatively, we can use the axis parameter to drop this column:

etsy_sellers.drop(['Owner_Name'], axis=1)

With both of these code samples, we need to specify the column name that we want to drop, but some people might prefer one method over the other. Keep an eye out for both ways of dropping a column when reviewing others’ code!

If our analysis only focuses on sellers that do not have any items in the fiber arts space, then we might also want to remove Candy Yarns and Star Stitchery.

etsy_sellers.drop([2,5])

Since we know the indices of Candy Yarns and Star Stitchery, we can drop by index with the above syntax.

Check Your Understanding

Question

If you are performing an analysis on inflation and grocery prices, do you need the zip code of a grocery store?