Checkpoint 1: Choose your Data

For this assignment, you must search for a dataset you want to get to know very well. So far in this class, your data has been assigned to you. Here, you have the opportunity to discover all of the free and readily available data out in the world for public consumption.

You have creative liberty with two decisions here. One is the data you wanted to use, the other is the business issue you want to ask. Perhaps you have a burning question about climate change, local home prices, or average daily screen usage. The scale of your question can be as macro or as micro as you wish.

For inspiration, think about what brought you to this class in the first place. Is it to get a new job? Maybe you’ll look for more Bureau of Labor datasets or ask a different question of the Assignment 3 (9 to 5) data. Maybe you’ll want to dive into some data for a field you’re interested in joining. Healthcare, finance, advertising, manufacturing, logistics, mapping, are just a few fields with growing need for analysts. You could also just have a hobby that you want to learn more about, such as knitting or running. Choose what fuels your interest. This can be an expansion or alternative analysis of a dataset we’ve used in class so far.

For some, your inspiration may come instead from doing a little data surfing. We’ve found most of the exercise and studio datasets from Kaggle. Tableau lists some solid choices as well. You are not limited to just these resources for sourcing your data. If you are interested in looking at regional data, there are resources like the STL Regional Data Exchange, OpenDataPhilly, and OpenDataKC to check out.

Here’s what you should consider when sourcing your data:

  • Is it in a format you know how to work with?
  • Is it licensed for your educational use?
  • Does it require a request process and will that prevent you from turning in your checkpoints on time?
  • If you do find a cool set of data that requires a lengthy request process, talk to your mentors. They may know about alternatives you can use.

Examples

Checkpoint 1 examples can be found here.

Kaggle Tips

Usability

Kaggle shows a usability score for datasets. Keep this figure in mind when searching for data. Usability score takes into account such factors as how dirty the data is, how much context is given, and what information is present in the columns. Hover over a usability score to see the criteria. Do not select a dataset with a usability score under 7.

Size

Keep in mind that larger datasets will not load in Tableau. Check the size of the dataset you are interested in and make sure it is well below 50 MB. If you are struggling with finding a dataset that fits the size limit, check the sizes of the files within the dataset to see if you could use one or two CSVs instead of the whole thing.

Licensing

You can find the licensing information for each dataset on the main dataset page. Click on the license type to read more about what you can and can not do with the dataset. Some authors may choose a license type that requires you to credit the source of the dataset on your work. The license type you should be looking for is CC0: Public Domain. This means that the dataset is not copyrighted and you may do with it what you will. However, you may not in any way imply that your work is endorsed by the author. For example, if your dataset is put together by the Bureau of Labor Statistics, you may not claim that the Bureau of Labor Statistics believes your work to be the most accurate representation of the workforce.

Tags

If you are interested in a specific topic or type of analysis (i.e. regression), you can search for tags. This can help narrow down the options. For example, if you want to search for some data about education and the UN, you might use “United Nations” as a search term and specify that datasets should be tagged with the “education” tag.

Additional Notes

Keep an eye out for non-Latin characters, such as Russian. During the visualization lesson, you may remember the Goodreads dataset had some errors that you were able to work around. Those errors were due to the Pandas not being able to process non-Latin characters properly. If you encounter this, you may have to clean out those rows causing trouble, leading to inaccuracies in your analysis.

Submitting Your Work

Before you submit your work, download the dataset to ensure that it will meet your needs. If your computer cannot download the dataset due to its size, try a similar dataset that is smaller or download the individual files. Open up the file(s) using the default application for your computer to see that there are in fact numbers in there and to see how many rows are in the file so that you don’t run into an issue with Tableau later. Tableau Public works best when the dataset is below 10 million rows.

When you are confident in your choice, create a new document on your computer using your word-processing program. Put your name in the right-hand corner and type up your business issue and provide the link to your chosen dataset. Submit your document on the Canvas submission page for Graded Assignment #4: Checkpoint 1.

Back to Final Project Overview