Data Analysis Methods and Techniques

As you continue to learn about data analysis you will certainly come across different types of methods and technologies. Each scenario where data is analyzed might have a completely different use case than the next. Developing an understanding of what method or technique is appropriate to use is a critical skill for beginners!

Below are some of the more common methods and techniques used today.

Methods and Techniques

  1. Cleaning data: After data has been collected it will most often need to be cleaned. Since collected data may often be improperly formatted, duplicated, have errors or typos, missing values, or simply incorrect, correcting these mistakes or “cleaning” the data is a very logical step. One of the most important reasons to clean data is so that you do not reach false or inaccurate conclusions when presenting it.

  2. Data Visualization: Adding a visual to your data is helpful when presenting. It also allows for a better understanding. Using tools and software like google sheets and Microsoft excel amongst others to create basic charts, plots, or histograms is very common practice.

  3. Formatting data: Providing a specific format for certain columns, rows, values, data types, and more will help mitigate errors in your data. If you are working with data that has a specific length or is of a specific type you can add that parameter to prevent user error when adding and collecting data.

  4. Cluster analysis: This type of analysis is a technique used to group data together that have similarities. This allows the ability to organize data based on the features present within those groups.

    Note

    There are multiple strategies used to “cluster” data together. Below are some examples:

    • K-means: defines a number of clusters that you want to group the data into.
    • Hierarchical clustering: involves organizing the clusters into a tree diagram.
    • Density-based clustering: creates clusters by identifying areas of high density within the data, separating them from low density areas.
  5. Factor analysis: Is the practice of finding hidden patterns or commonalities that drive relationships between data, combining these underlying factors into one data point.

  6. Time series analysis: Like the name suggests, time series analysis is an examination of patterns or trends in data over time. Time series analysis will help predict or forecast future outcomes based on previous occurrences. Time series data is also measured or modeled with time steps.

    Example

    If you recorded what you ate for breakfast over the course of an entire week, each recording would be a sinsgle “time step”, all of the steps in that period of time would be a series.

  7. Sentiment analysis: This type of analysis involves reading through data to filter out words expressing feeling. This might be related to survey responses, social media posts or comments, or reviews of your favorite restaurant. Most often the goal of collecting this type of data is to get a grasp on whether or not the overall sentiment is positive or negative.

  8. Cohort analysis: Type of analysis that observes a specified grouping of data. A collection of individuals that start a new educational course together would be considered a cohort. Examining that group of learners or students and their outcomes while comparing them to a future or past cohort would be an example of cohort analysis. Another example includes grouping together users that only purchase items from your website during the month of December.

  9. Exploratory analysis: Exploring or investigating data is an important step in understanding it. Exploratory analysis may often be referenced as exploratory data analysis (EDA) which will be covered more in depth later in the course. Finding connections and similarities within your data often comes from exploration, which will help you with other methods that include cleaning, organization, and vizualization.

  10. Regression analysis: Type of analysis that examines the relationship between an independant and dependant variable with the goal of predicting future outcomes for the independant variable.

  11. Monte Carlo simulation: Practice of simulating scenarios with a probable outcome and viewing the results. If you flip two sided coin 100 times you would expect that the coin land on each side close to 50 times, however this may not be the case. With a real use case of a monte carlo simulation you may flip the coin 10,000 or 20,000 times to take the element of chance away.

  12. Descriptive statistics: The use of count, frequency, spreads, centers, mean, median, mode, and other methods to summarize your data. For example you may view the amount of cars in a parking lot (count), and the colors of each car (frequency), to obtain a spread, mean, median, and mode.

  13. Classification: All data is commonly classified by something. You might classify data by time, where the data came from, how it is used, what type of data it is, whether or not the data is public or private, and more. Classifying data will help with the management and organization of the data, data retrieval, and often times security.

  14. Correlation: is a measure of how powerful a relationship is between two data points. As an analyst you might find that when one data point changes another one changes in relation. For example a car that weighs more tends to use more fuel, causing a higher miles per gallon average.

There are many more techniques and methods than those listed above. The examples provided are meant to help you understand that there are many different ways to approach a problem and it will vary from project to project. Regardless of what methods or techniques are used the data should always be able to tell a story that stakeholders can relate to. This will allow for a better understanding of how the data has been shaped over time and what it can be used for when making important decisions moving forward.