15.1. EDA With Python Part 2¶
Read the following articles, follow along where instructed:
- Key takeaway: Reference list of common Pandas methods for EDA (some are review from last week).
- Key takeaway: Defines global, contextual, and collective outliers. Article and 5 min video.
- Note: no need to know the details of the techniques described here for handling missing data. Read to get introduced to some advanced methods of systematically handling missing values
- Read the above article and work along using the notebook and dataset found in this Git Hub repository.
- This walkthrough will have you write code and answer questions.
15.1.2. Check Your Understanding¶
The following plot graphs a user’s Spotify recommended song length (in milliseconds) with a song’s energy score (a perceptual measure of intensity and activity between 0 and 1). What, if any, outliers are present?
The National Park Service records the number of visitors to the Gateway Arch in St Louis every day. Occasionally, concerts are held on the park grounds and the number of visitors soars. On concert days, what type of figures is the NPS seeing?
Do missing values in a dataset provide no analytical use?
How can data analysts leverage the presence of null values in a data set?
- We can create an additional column with a binary type, indicating if any information is missing in the column in question
- We can clean the data by removing any missing entries.
- We can clean the data by removing any columns with missing entries.
- We can’t, Null values serve no purpose in analysis.
Data sets with missing values have been improperly assembled and are rare to encounter in the professional field of data analysis.
There is no best method for addressing missing data.