According to Stephen Few, data analysis is modeling data so you can make sense of it. Source: https://www.analyticsvidhya.com/blog/2016/01/guide-data-exploration/
There are two main methodologies or techniques used to retrieve relevant data from large, unorganized pools. They are the manual and automatic methods.
The manual method is another name for data exploration, while the automatic method is also known as data mining.
Some people believe these terms are synonymous, while others see a technical difference between them. Data mining generally refers to gathering relevant data from large databases. Data exploration, on the other hand, generally refers to a data user being able to find his or her way through large amounts of data in order to gather necessary information. - https://www.techopedia.com/definition/28789/data-exploration
Date exploration happens after you collected and clean your data. It is best if you can clean the data as much as possible before you jump into exploration.
Data exploration, cleaning and preparation can take up to 70% of your total project time.
You can use some different tools including Excel, Data Wrangler, Trifacta, and Open Refine to clean your data. Some common outputs of data needed for exploration:
First, identify Predictor (Input) and Target (output) variables. Next, identify the data type and category of the variables. Example: Audience dimensions as predictors of phone calls
| Type of Variables | Variable | Data Type |
|---|---|---|
| Target | Unique Phone Call Event | Ordinal (dichotomous yes/no) |
| Predictor | gender | Ordinal |
| Predictor | new/return user type | Ordinal |
| Predictor | Region (U.S. State | Nominal |
| Predictor | Device Type (mobile, etc) | Nominal |
| Predictor | Number of sessions | Ratio |
Here we compare the variables one-by-one. The methods to perform uni-variate analysis will depend on whether the variable type is categorical or continuos. Univariate analysis can also help identify missing and outlier values. Continuous variables : You can focus on the central tendencies and spread of the data. Central tendencies to look at: mean, median, mode. Measure of dispersion: min, max, range, quartile, IQR, variance, standard deviation, skewness and kurtosis Visualization methods : histogram, box plot Categorical Variables: The primary way to get an understanding of this data is to use a frequency table to understand distribution fo each category. We can look at both the count and count % against each category. Central tendencies to look at: mode Visualization methods: Bar Charts, stem and leaf, frequency table
Bi-variate is about finding a relationship between two variables. How you analyze the data is based on the two data types.
Continuous and Continuous: The best visualization is a scatter plot and it is great way to see negative, positive, curvilinear relationships (as well no relationships) Correlation: calculate the R score to get an idea of the relationships
Categorical and Categorical: A two way table allows you to analyze the data and can be used to create a chi-square test. Visualization methods: a stacked column is like a visual version of a two-way table. You can do both count based stacked columns or rate columns where they are all 100%.
Categorical and Continuous: To look at the statistical significance we can perform Z-test, T-test or ANOVA. Visualization methods: Box plot for each category.
Allows you to view univariate and by-variate date easily, and it can segment data into side-by-size graphs (small multiples) to easily see each segment individually. Great for exploring data.
This Small multiples time-series example shows how life expectancy is dropping in most N. American nations, and can be used to further explore the nations with the biggest declines like Jamaica and Nicaragua.
Other options https://www.analyticsvidhya.com/blog/2016/09/18-free-exploratory-data-analysis-tools-for-people-who-dont-code-so-well/ Another source to explore: https://www.grad.ubc.ca/sites/default/files/doc/page/ct2016-10-18_handout.pdf A third: https://learnche.org/pid/univariate-review/index