14.1. Exploratory Data Analysis

14.1.1. Readings

Read the following articles, follow along where instructed:

tip

For Medium articles: if you run out of free articles, open the page in an incognito window

  • Key Takeaways: Outline of common steps used in EDA, note that there is no one process for performing EDA, it all depends on the dataset and your questions.
  • Just read and follow the steps for comprehension, don’t need to do the tutorial.
  • Stop at #8: “Detecting Outliers”
  • Key Takeaways: Questions to better understand why the data was collected.
  • Suggested Reading: Data Types in Statistics.
    • Key Takeaways: Definitions for discrete, continuous and categorical data.
  • You do not need to install pandas, it comes with the Anaconda package.
  • Try coding along with the article
  • Stop at “Handling Duplicates” header
  • Key takeaway: Using pandas DataFrame; with examples

14.1.2. What is a Dataframe?

A Pandas dataframe is similar to a Python dictionary. The column names are like keys and the values are the data for that column. This diagram illustrates the different components of a dataframe.

Diagram of a Pandas Dataframe.

Credit for the above diagram and for more information about Pandas Dataframes visit here.

The column values are called a Pandas series. Here is how Pandas series are used to build a dataframe.
Diagram of how Pandas series build a dataframe.

Credit for the above diagram and for more information about Pandas Series visit here.

14.1.3. Check Your Understanding

Question

What is the pandas function used to return the number of rows and columns in a dataframe?

Question

Column names cannot be changed in dataframes?

  1. True
  2. False

Question

What can knowing the data types present in a data set tell us about the data being presented?

Question

What is the Pandas method for reading a csv?

Question

Visualized below is the “purchases” dataframe . What is the pandas syntax to select for Robert’s data?

Dataframe showing name of person and if they purchased apples and/or oranges.

Question

How do we view only the first 13 rows of a dataframe?

Question

A dataframe column is a series?

  1. True
  2. False

Question

Which pandas function will print the number of records, three quartiles, mean, standard deviation, minimum and maximum values of a dataframe?

  1. .describe()
  2. .index()
  3. .statistics()
  4. .head()