Data Exploration Notes

According to Stephen Few, data analysis is modeling data so you can make sense of it. Source: https://www.analyticsvidhya.com/blog/2016/01/guide-data-exploration/

There are two main methodologies or techniques used to retrieve relevant data from large, unorganized pools. They are the manual and automatic methods.

The manual method is another name for data exploration, while the automatic method is also known as data mining.

Some people believe these terms are synonymous, while others see a technical difference between them. Data mining generally refers to gathering relevant data from large databases. Data exploration, on the other hand, generally refers to a data user being able to find his or her way through large amounts of data in order to gather necessary information. - https://www.techopedia.com/definition/28789/data-exploration

Getting started

Steps for data exploration

Prepare data for exploration (ideally in a long format)
Identify predictor and target variables
Document assumptions you have about the data
Document methods of collecting and cleaning the data
Review data looking for trends, patterns, and exceptions to these trends and patterns
Use univariate and bi-variate analysis techniques to explore the data
Document interesting information you want to explore in detail

Date exploration happens after you collected and clean your data. It is best if you can clean the data as much as possible before you jump into exploration.

Data exploration, cleaning and preparation can take up to 70% of your total project time.

Cleaning your data

You can use some different tools including Excel, Data Wrangler, Trifacta, and Open Refine to clean your data. Some common outputs of data needed for exploration:

Variable Identification

First, identify Predictor (Input) and Target (output) variables. Next, identify the data type and category of the variables. Example: Audience dimensions as predictors of phone calls

Type of Variables	Variable	Data Type
Target	Unique Phone Call Event	Ordinal (dichotomous yes/no)
Predictor	gender	Ordinal
Predictor	new/return user type	Ordinal
Predictor	Region (U.S. State	Nominal
Predictor	Device Type (mobile, etc)	Nominal
Predictor	Number of sessions	Ratio

Univariate Analysis

Here we compare the variables one-by-one. The methods to perform uni-variate analysis will depend on whether the variable type is categorical or continuos. Univariate analysis can also help identify missing and outlier values. Continuous variables : You can focus on the central tendencies and spread of the data. Central tendencies to look at: mean, median, mode. Measure of dispersion: min, max, range, quartile, IQR, variance, standard deviation, skewness and kurtosis Visualization methods : histogram, box plot Categorical Variables: The primary way to get an understanding of this data is to use a frequency table to understand distribution fo each category. We can look at both the count and count % against each category. Central tendencies to look at: mode Visualization methods: Bar Charts, stem and leaf, frequency table

Bi-Variate Analysis

Bi-variate is about finding a relationship between two variables. How you analyze the data is based on the two data types.

Continuous and Continuous: The best visualization is a scatter plot and it is great way to see negative, positive, curvilinear relationships (as well no relationships) Correlation: calculate the R score to get an idea of the relationships

Categorical and Categorical: A two way table allows you to analyze the data and can be used to create a chi-square test. Visualization methods: a stacked column is like a visual version of a two-way table. You can do both count based stacked columns or rate columns where they are all 100%.

Categorical and Continuous: To look at the statistical significance we can perform Z-test, T-test or ANOVA. Visualization methods: Box plot for each category.

Tools

iNZight lite

Allows you to view univariate and by-variate date easily, and it can segment data into side-by-size graphs (small multiples) to easily see each segment individually. Great for exploring data.

This Small multiples time-series example shows how life expectancy is dropping in most N. American nations, and can be used to further explore the nations with the biggest declines like Jamaica and Nicaragua.

Other options https://www.analyticsvidhya.com/blog/2016/09/18-free-exploratory-data-analysis-tools-for-people-who-dont-code-so-well/ Another source to explore: https://www.grad.ubc.ca/sites/default/files/doc/page/ct2016-10-18_handout.pdf A third: https://learnche.org/pid/univariate-review/index