Exploratory Data Analysis (EDA) is an approach/philosophy for data analysis that employs a variety of techniques (mostly graphical) to
- maximize insight into a data set;
- uncover underlying structure;
- extract important variables;
- detect outliers and anomalies;
- test underlying assumptions;
- develop parsimonious models; and
- determine optimal factor settings.
- EDA isn't just like statistical graphics although the 2 terms are used almost interchangeably. Statistical graphics may be a collection of techniques--all graphically based and everyone that specializes in one data characterization aspect.
- EDA encompasses a bigger venue; EDA is an approach to data analysis that postpones the standard assumptions about what model the info follows with the more direct approach of allowing the info itself to reveal its underlying structure and model.
- EDA isn't a mere collection of techniques; EDA may be a philosophy on how we dissect a knowledge set; what we glance for; how we look; and the way we interpret. It is true that EDA heavily uses the gathering of techniques that we call "statistical graphics", but it's not just like statistical graphics.
- Most EDA techniques are graphical in nature with a couple of quantitative techniques. The reason for the heavy reliance on graphics is that by its very nature the main role of EDA is to open-mindedly explore, and graphics gives the analysts unparalleled power to try to do so, enticing the info to reveal its structural secrets, and being always able to gain some new, often unsuspected, insight into the info.
- Many data scientists will agree that it's very easy to get lost in data—the more you collect, study and analyze, the more you would like to explore. Rabbit holes of data are familiar and friendly places for data analysts and data scientists to dive into and spend hours extracting, modeling, and analyzing these large datasets.
- The EDA sorts of techniques are either graphical or quantitative (non-graphical). While the graphical methods involve summarising the info in a diagrammatic or visual way, the quantitative method, on the opposite hand, involves the calculation of summary statistics. These two sorts of methods are further divided into univariate and multivariate methods.
|
EDA Steps:-
- Data Sourcing
- Data Cleaning
- Univariate analysis
- Bivariate analysis
- Multivariate analysis
- Handle Missing value
- Removing duplicates
- Outlier Treatment
- Normalizing and Scaling( Numerical Variables)
- Encoding Categorical variables( Dummy Variables)
Types of Graphical Analysis:-
-
Numerical vs. Numerical
1. Scatterplot 2. Line plot 3. Heatmap for correlation 4. Joint plot
-
Categorical vs. Numerical
1. Bar chart 2. Categorical box plot
Handling Missing Values:-
- Deleting rows with missing values
- Imputing missing data based on mean/median/mode
- Estimating missing data using ML classifiers - knn
Outlier Detection:-
- Based on standard deviations away from the mean (continuous variables)
- Based on inter-quartile distance (categorical data)
|