Chapter III - Data Understanding

Data understanding is one of the key process of CRISP-DM framework. CRISP-DM stands for cross-industry process for data mining. Data understanding helps us to decide whether the data acquired during data collection satisfies the business requirement and is useful for further analysis. Data understanding process can be sub divided into following steps –

1. Describing the data - In this process, we can get the feel of the data by preparing a data description report. This report will consist of description of variables in the data, their data types and so on

2. Exploring and Verify the data - The variables in the data set can be further analyzed by creating univariate and bivariate plots. These plots will enable us to identify key variables and target variable. We can also get some insights on certain data quality issues while performing the exploratory data analysis. Certain data quality issues can be missing values, vague column/variable names, incorrect data, presence of outliers and so on.

Let us take a data set and understand the data in hand:

We will take an inbuilt data set in R from MASS package (Modern Applied Statistics with S) called Cars93 and try to understand it step by step

1. Data description -

This data called Cars is collected from the inbuilt data set in R from MASS package. It is the data set from 93 Cars on Sale in the USA in 1993. This data set has 93 observations and 27 variables. To load the dataset, run the following commands –

library(MASS)

cars <- cars

Following are the variables of the data set which can be viewed by running str() function.

The variables in a dataset can be of 2 types –

1. Categorical variables – Also known as factor variables. These variables are categorized into levels/categories. They are further divided into: nominal and ordinal variables.

a) Nominal variables: These are categorical variables that do not have any natural order like gender (Male, Female)

b) Ordinal variables: These are categorical variables that can be ordered like temperature (high, medium, low) or grades (A+, A, B+, B)

2. Continuous/Quantitative variables – These are the variables that have infinite possible values like Income

The data description report can be prepared as below. This is not the only format for the report. Companies also automate such reports in different formats to save time

*Variable Name*	*Description*	*Data type*
Manufacturer	Manufacturer of the car	Categorical
Model	Model of the car	Categorical
Type	Type of the cars like Small, Midsize, Sporty	Categorical
Min.Price	Price for the basic version in thousands Dollars	Quantitative
Price	Average of min and max price in thousands Dollars	Quantitative
Max.Price	Price for a premium version in thousands Dollars	Quantitative
MPG.city	City Miles per US Gallon	Quantitative
MPG.highway	Highway Miles per US Gallon	Quantitative
AirBags	Airbags standard such as none, driver only	Categorical
DriveTrain	Drive train type such as Front, Rear	Categorical
Cylinders	Number of Cylinders	Categorical
EngineSize	Engine Size in lts	Quantitative
Horsepower	Maximum horsepower	Quantitative
RPM	Revolutions per minute at maximum horsepower	Quantitative
Rev.per.mile	Engine RPM in highest gear	Quantitative
Man.trans.avail	Is manual transmission available	Categorical
Fuel.tank.capacity	Fuel tank capacity in US Gallons	Quantitative
Passengers	Passenger capacity	Quantitative
Length	Length of cars in inches	Quantitative
Wheelbase	Wheelbase in inches	Quantitative
Width	Width in inches	Quantitative
Turn.circle	U-turn space in feet	Quantitative
Rear.seat.room	Rear seat room in inches	Quantitative
Luggage.room	Luggage capacity in cubic feet	Quantitative
Weight	Weight of car in pounds	Quantitative
Origin	US or non-US company origin	Categorical
Make	Combination of manufacturer and make	Categorical

2. Exploring and Verify the data

a) Checking for NA values

- NA values are the values that are not defined. This may be because they may not have been recorded due to human error or the values actually doesn’t exist.

- is.na() function helps to find the NA values in the dataset. Below we have used sapply function to find the NA values in all the columns

We can see two columns/variables having missing values as shown in below screenshot

We can decide to treat these NA values or leave them as it is. They can be treated as another category. There are various methods to treat these. We have to identify whether the missing values appearing is valid or not.

All cars with type Van have no luggage room. Two of the sporty type cars have no luggage room

Some sporty cars do not have rear seat room

b) Checking for outliers

Outliers are the extreme values in the dataset which lies outside the distribution of data.

Outliers can be detected either using a histogram or a boxplot

Boxplot marks the 25^th and 75^th percentile data with upper limit marked as 1.5 times IQR (Inter Quartile Range = 75^th percentile – 25^th percentile point) above 75^th percentile and 1.5 times IQR below 25^th percentile. All the values beyond these points are considered as outliers.

Outliers can be checked using boxplot as below. The same process can be repeated for all the numerical columns / variables

On checking the quantile for the Min.Price, there is a sudden jump from 34.480 onwards as shown below

c) Exploring the data – Sample data exploration

It is a process to visualize the data. It brings forward important features of the data which can be useful on further analysis. Univariate analysis in data exploration means that we analyze only one variable/feature. Multivariate analysis in data exploration means that we analyze more than one variables/features together to gain more insight. Bivariate analysis is a subset of multivariate analysis where number of variables/features selected for analysis is two.

- Car Type Vs Number and Average Price of cars