Reshaping Tables

As you continue to answer questions about your data during the exploration and cleaning stages, you may want to reformat or reshape your tabular data. Below we will cover some common functions used when reshaping your dataset so that you can not only visualize it differently but also gather data related to a specific question you have.

.concat() function

The .concat() function has many use cases, one of which is to join multiple DataFrames together, combining their datasets into one. It has the capability to join along the rows or columns of a DataFrame.

Syntax

dataset_one = [1, 2, 3]
dataset_two = ["four", "five", "six"]

dataframe_one = pd.DataFrame(dataset_one)
dataframe_two = pd.DataFrame(dataset_two)

combined_dataframe = pd.concat([dataframe_one, dataframe_two])
Example
movie_dictionary = {
    'Name': ["Interstellar", "Pride and Prejudice", "Inception", "Barbie"], 
    'Release': [2014, 2005, 2010, 2003], 
}

genre_rating_dictionary = {
    'Genre': ["Science Fiction", "Novel", "Science Fiction", "Comedy"], 
    'Rotten Tomatoes Score': [73, 87, 87, 88]
}

movie_dataframe = pd.DataFrame(movie_dictionary)
genre_rating_dataframe = pd.DataFrame(genre_rating_dictionary)

movie_genre_dataframe = pd.concat([movie_dataframe, genre_rating_dataframe])

Output

concat-dataframe-output concat-dataframe-output

Note in the output image above the inclusion of the axis parameter when printing the dataframe a second time. The axis parameter specifies that the two DataFrames should be joined along the columns instead of rows, providing a cleaner dataset.

In the lesson on exploring data with python we covered how to create a DataFrame using the .concat() method by providing two Series as parameters. The .concat function can alse be used to add a Series within an existing DataFrame!

Example
import pandas as pd

data = ["movies", "genres", "release_year"]
series_data = ["interstellar", "pride and prejudice", "barbie"]

example_dataframe = pd.DataFrame(data)
example_series = pd.Series(series_data)

concat_series_dataframe = pd.concat([example_dataframe, example_series], axis=1)

Sorting Values

The .sort_values() function allows you to reshape data to your specific use case. The below parameters are some of the more common:

  1. by: Name of column(s) or list of column(s) to sort by
  2. axis: Which axis to perform the operations on (0 or 1)
  3. ascending: Sort the data in ascending or descending order
Note

The above parameters are not the only ones available when using the .sort_values() function. If you would like a more extensive list you can view them here: pandas docs

Syntax

data.sort_values(by="column_name", axis=1, ascending=True)
Example
import pandas as pd

data = pd.read_csv("titanic.csv")

data.sort_values(by="embark_town").head()

Sort the titanic dataset by the embark_town column Sort the titanic dataset by the embark_town column

Pivot Tables

Pivot tables are often used to provide summary statistics through aggregate functions while reorganizing your data.

Below are some of the commonly used parameters:

  1. values: column(s) to aggregate
  2. aggfun: aggregate function to apply
  3. index: column to index by

Syntax

data.pivot_table(values="column-name", aggfunc="", index="")
Example
import pandas as pd

pivot_table = data.pivot_table(values='age', aggfunc='mean', index='alive')
print(pivot_table)

Output

pivot table created from the “age” and “alive” columns of the titanic dataset pivot table created from the “age” and “alive” columns of the titanic dataset

The above code accomplishes the following:

  1. Imports pandas
  2. Creates a pivot table using the age column with the alive column as the index, applying the mean function to agggregate the data.
  3. The output displays the average age of people alive, and not alive.

Wide vs Long Format

Pandas has the ability to reshape tabular data from wide to long and vice versa. Before we dive into how to reformat the data, let’s talk about the difference between wide format, and long format, as it relates to tabular data.

Wide format can be thought of as a dataset that includes many columns of data as it relates to similar items. For example, a dataset could hold a column for each type of food (chips, fruit, snack bar) or beverage (coffee, tea, water, soda) an office provides to its employees. This would be considered wide format and would be displayed as follows:

chipsfruitsnack barscoffeewatersoda
red hot ripletsbananaskind barkaldisfiltered watercoke
lays potato chipsappleskelloggs protein barstarbucksbottled waterpepsi

The same data in long format:

Office Item CategoryOffice Item Name
chipsred hot riplets
chipslays potato chips
fruitbananas
fruitapples
snack barskind bar
snack barskelloggs protein bar
coffeekaldis
coffeestarbucks
waterfiltered water
waterbottled water
sodacoke
sodapepsi

But why would you want to use long format over wide format?

  1. Readability: Long format is easier to read and visualize than wide format.
  2. Manipulating the data: Using the above data as an example, it is much easier to manipulate the data using only two columns instead of six.
  3. Compute: Less columns means less compute and memory needed from the host machine to perform the tasks necessary for your analysis.

This does not mean that wide format isn’t useful. If you have a smaller dataset without many columns, or if it makes it easier to simply keep the columns of data separate, wide format is perfectly fine!

Melt

Now that we have covered the differences between the two, let’s take a look at how the .melt() function can be used to reshape tables from wide format to long format. The .melt() funtion takes an existing DataFrame, creating two columns from the data within.

Syntax

1
2
3
4
5
data = []

example_dataframe = pd.DataFrame(data)

long_format_dataset = pd.melt(example_dataset, var_name='', value_name='')
Example
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
import pandas as pd

data = {'chips': ["red hot riplets", "lays potato chips", "fritos flavor twists"],
        'fruit': ["bananas", "apples", "avocado"],
        'snack bars': ['kind bar', "kelloggs protein bar", "power bar"],
        'coffee': ["kaldis", "starbucks", "blueprint"],
        'water': ["filtered", "bottled", "tap"],
        'soda': ["coke", "pepsi", "ginger ale"]}

example_dataframe = pd.DataFrame(data)

long_format_dataframe = pd.melt(example_dataframe, var_name="Office Item Category", value_name="Office Item Name")

Output

melt function melt function

The above code accomplished the following:

  1. Imported pandas
  2. Created a DataFrame called example_dateframe from the variable data holding a dictionary
  3. Create a new DataFrame object called long_format_dataframe using the .melt() function
    • the var_name parameter specifies the column name that will hold column labels from the example_dataframe
    • the value_name parameter specifies the column name that will hold column values from the example_dataframe
  4. Printed the new DataFrame object to the console