Data
understanding is one of the key process of CRISP-DM framework. CRISP-DM stands
for cross-industry process for data mining. Data understanding helps us to
decide whether the data acquired during data collection satisfies the business
requirement and is useful for further analysis. Data understanding process can
be sub divided into following steps –
2. Exploring and Verify the data - The variables in
the data set can be further analyzed by creating univariate and bivariate
plots. These plots will enable us to identify key variables and target
variable. We can also get some insights on certain data quality issues while
performing the exploratory data analysis. Certain data quality issues can be
missing values, vague column/variable names, incorrect data, presence of
outliers and so on.
Let us take
a data set and understand the data in hand:
We will take
an inbuilt data set in R from MASS package (Modern Applied Statistics with S)
called Cars93 and try to understand it step by step
1. Data
description -
This data called Cars is collected from
the inbuilt data set in R from MASS package. It is the data set from 93 Cars on
Sale in the USA in 1993. This data set has 93 observations and 27 variables. To
load the dataset, run the following commands –
library(MASS)
cars <- cars
Following are the variables of the data set which can be viewed by
running str() function.
The
variables in a dataset can be of 2 types –
1. Categorical variables – Also known as factor
variables. These variables are categorized into levels/categories. They are
further divided into: nominal and ordinal variables.
a) Nominal variables: These are categorical
variables that do not have any natural order like gender (Male, Female)
b) Ordinal variables: These are categorical
variables that can be ordered like temperature (high, medium, low) or grades
(A+, A, B+, B)
2. Continuous/Quantitative variables – These are
the variables that have infinite possible values like Income
The data description report can be
prepared as below. This is not the only format for the report. Companies also
automate such reports in different formats to save time
Variable Name
|
Description
|
Data type
|
Manufacturer
|
Manufacturer of the car
|
Categorical
|
Model
|
Model of the car
|
Categorical
|
Type
|
Type of the cars like Small, Midsize, Sporty
|
Categorical
|
Min.Price
|
Price for the basic version in thousands Dollars
|
Quantitative
|
Price
|
Average of min and max price in thousands Dollars
|
Quantitative
|
Max.Price
|
Price for a premium version in thousands Dollars
|
Quantitative
|
MPG.city
|
City Miles per US Gallon
|
Quantitative
|
MPG.highway
|
Highway Miles per US Gallon
|
Quantitative
|
AirBags
|
Airbags standard such as none, driver only
|
Categorical
|
DriveTrain
|
Drive train type such as Front, Rear
|
Categorical
|
Cylinders
|
Number of Cylinders
|
Categorical
|
EngineSize
|
Engine Size in lts
|
Quantitative
|
Horsepower
|
Maximum horsepower
|
Quantitative
|
RPM
|
Revolutions per minute at maximum horsepower
|
Quantitative
|
Rev.per.mile
|
Engine RPM in highest gear
|
Quantitative
|
Man.trans.avail
|
Is manual transmission available
|
Categorical
|
Fuel.tank.capacity
|
Fuel tank capacity in US Gallons
|
Quantitative
|
Passengers
|
Passenger capacity
|
Quantitative
|
Length
|
Length of cars in inches
|
Quantitative
|
Wheelbase
|
Wheelbase in inches
|
Quantitative
|
Width
|
Width in inches
|
Quantitative
|
Turn.circle
|
U-turn space in feet
|
Quantitative
|
Rear.seat.room
|
Rear seat room in inches
|
Quantitative
|
Luggage.room
|
Luggage capacity in cubic feet
|
Quantitative
|
Weight
|
Weight of car in pounds
|
Quantitative
|
Origin
|
US or non-US company origin
|
Categorical
|
Make
|
Combination of manufacturer and make
|
Categorical
|
2. Exploring and Verify the data
a)
Checking for NA
values
-
NA values are the
values that are not defined. This may be because they may not have been
recorded due to human error or the values actually doesn’t exist.
-
is.na() function
helps to find the NA values in the dataset. Below we have used sapply function
to find the NA values in all the columns
We
can see two columns/variables having missing values as shown in below
screenshot
We can decide to treat these NA values
or leave them as it is. They can be treated as another category. There are
various methods to treat these. We have to identify whether the missing values
appearing is valid or not.
All cars with type Van have no luggage
room. Two of the sporty type cars have no luggage room
Some
sporty cars do not have rear seat room
b)
Checking for
outliers
Outliers
are the extreme values in the dataset which lies outside the distribution of
data.
Outliers
can be detected either using a histogram or a boxplot
Boxplot
marks the 25th and 75th percentile data with upper limit
marked as 1.5 times IQR (Inter Quartile Range = 75th percentile – 25th
percentile point) above 75th percentile and 1.5 times IQR below 25th
percentile. All the values beyond these points are considered as outliers.
Outliers can be
checked using boxplot as below. The same process can be repeated for all the
numerical columns / variables
On checking the quantile for the
Min.Price, there is a sudden jump from 34.480 onwards as shown below
c)
Exploring the data –
Sample data exploration
It is a process to visualize the data. It brings forward
important features of the data which can be useful on further analysis.
Univariate analysis in data exploration means that we analyze only one
variable/feature. Multivariate analysis in data exploration means that we
analyze more than one variables/features together to gain more insight.
Bivariate analysis is a subset of multivariate analysis where number of
variables/features selected for analysis is two.
-
Car Type Vs Number and Average Price of cars
- Airbags
Vs Number and Average Price of cars
-
Price Vs RPM for different car types
-
Number of Cylinders Vs Average Price for
different origins









Comments
Post a Comment