Skip to main content

Chapter III - Data Understanding

Data understanding is one of the key process of CRISP-DM framework. CRISP-DM stands for cross-industry process for data mining. Data understanding helps us to decide whether the data acquired during data collection satisfies the business requirement and is useful for further analysis. Data understanding process can be sub divided into following steps –
      1. Describing the data - In this process, we can get the feel of the data by preparing a data description report. This report will consist of description of variables in the data, their data types and so on
     2.  Exploring and Verify the data - The variables in the data set can be further analyzed by creating univariate and bivariate plots. These plots will enable us to identify key variables and target variable. We can also get some insights on certain data quality issues while performing the exploratory data analysis. Certain data quality issues can be missing values, vague column/variable names, incorrect data, presence of outliers and so on.

Let us take a data set and understand the data in hand:
We will take an inbuilt data set in R from MASS package (Modern Applied Statistics with S) called Cars93 and try to understand it step by step

1.       Data description -
This data called Cars is collected from the inbuilt data set in R from MASS package. It is the data set from 93 Cars on Sale in the USA in 1993. This data set has 93 observations and 27 variables. To load the dataset, run the following commands –
library(MASS)
cars <- cars
 Following are the variables of the data set which can be viewed by running str() function.
         The variables in a dataset can be of 2 types –

    1. Categorical variables – Also known as factor variables. These variables are       categorized into levels/categories. They are further divided into: nominal         and ordinal variables.
 a) Nominal variables: These are categorical variables that do not have any natural order like gender (Male, Female)
 b) Ordinal variables: These are categorical variables that can be ordered like temperature (high, medium, low) or grades (A+, A, B+, B)
   2. Continuous/Quantitative variables – These are the variables that have infinite possible values like Income

The data description report can be prepared as below. This is not the only format for the report. Companies also automate such reports in different formats to save time
               
Variable Name
Description
Data type
Manufacturer
Manufacturer of the car
Categorical
Model
Model of the car
Categorical
Type
Type of the cars like Small, Midsize, Sporty
Categorical
Min.Price
Price for the basic version in thousands Dollars
Quantitative
Price
Average of min and max price in thousands Dollars
Quantitative
Max.Price
Price for a premium version in thousands Dollars
Quantitative
MPG.city
City Miles per US Gallon
Quantitative
MPG.highway
Highway Miles per US Gallon
Quantitative
AirBags
Airbags standard such as none, driver only
Categorical
DriveTrain
Drive train type such as Front, Rear
Categorical
Cylinders
Number of Cylinders
Categorical
EngineSize
Engine Size in lts
Quantitative
Horsepower
Maximum horsepower
Quantitative
RPM
Revolutions per minute at maximum horsepower
Quantitative
Rev.per.mile
Engine RPM in highest gear
Quantitative
Man.trans.avail
Is manual transmission available
Categorical
Fuel.tank.capacity
Fuel tank capacity in US Gallons
Quantitative
Passengers
Passenger capacity
Quantitative
Length
Length of cars in inches
Quantitative
Wheelbase
Wheelbase in inches
Quantitative
Width
Width in inches
Quantitative
Turn.circle
U-turn space in feet
Quantitative
Rear.seat.room
Rear seat room in inches
Quantitative
Luggage.room
Luggage capacity in cubic feet
Quantitative
Weight
Weight of car in pounds
Quantitative
Origin
US or non-US company origin
Categorical
Make
Combination of manufacturer and make
Categorical

2.      
Exploring and Verify the data
a)      Checking for NA values
-          NA values are the values that are not defined. This may be because they may not have been recorded due to human error or the values actually doesn’t exist.
-          is.na() function helps to find the NA values in the dataset. Below we have used sapply function to find the NA values in all the columns
We can see two columns/variables having missing values as shown in below screenshot
We can decide to treat these NA values or leave them as it is. They can be treated as another category. There are various methods to treat these. We have to identify whether the missing values appearing is valid or not.

All cars with type Van have no luggage room. Two of the sporty type cars have no luggage room
Some sporty cars do not have rear seat room

b)      Checking for outliers
Outliers are the extreme values in the dataset which lies outside the distribution of data.
Outliers can be detected either using a histogram or a boxplot
Boxplot marks the 25th and 75th percentile data with upper limit marked as 1.5 times IQR (Inter Quartile Range = 75th percentile – 25th percentile point) above 75th percentile and 1.5 times IQR below 25th percentile. All the values beyond these points are considered as outliers.
Outliers can be checked using boxplot as below. The same process can be repeated for all the numerical columns / variables
On checking the quantile for the Min.Price, there is a sudden jump from 34.480 onwards as shown below
c)      Exploring the data – Sample data exploration

It is a process to visualize the data. It brings forward important features of the data which can be useful on further analysis. Univariate analysis in data exploration means that we analyze only one variable/feature. Multivariate analysis in data exploration means that we analyze more than one variables/features together to gain more insight. Bivariate analysis is a subset of multivariate analysis where number of variables/features selected for analysis is two.
-          Car Type Vs Number and Average Price of cars
-        Airbags Vs Number and Average Price of cars
-          Price Vs RPM for different car types
-          Number of Cylinders Vs Average Price for different origins




Comments

Popular posts from this blog

Chapter II - Data Gathering

The foremost step in data science journey is to decide the domain of data analysis. Based on the domain that you choose, you must gather relevant data from different sources. Data gathering/collection is a technique to gather information from various sources to pull out significant information. Here, we will discuss about few techniques of data collection. 1. Collecting survey data - Such data can be collected by circulating questionnaire to the audience.  This becomes handy if you want to limit the scope of analysis. For example, you want to find the age distribution and number of dependents in each household of your society.You can circulate the questionnaire to the residents asking them about their age and number of dependents and then draw an analysis from the data collected The drawbacks of this method is that it becomes a tedious process, the audience may not be interested in this drill and it also limits the collection of data Other examples of this method are : ...

Chapter I - The Beginning

Data science, in basic understanding, is the place where voluminous data meet insights. At present, roughly around 2.5 Quintilian bytes of data is generated everyday. This data is of little significance until we can extract meaningful information from them. The insights drawn from the data available, can be leveraged to make informed decisions. Image created using wordcloud2 package in R Few varied applications of data science are -  1. Companies analyzing the customer data to understand market preferences 2. Bank drawing insights from customer transactions to understand probability of churn 3. Teams analyzing the players statistics to improve team dynamics 4. Google making use of data science technology to deliver best search results 5. E-commerce websites making use of recommender systems to attract customers 6. Facebook making use of face recognition algorithm to enable 'tag your friend' feature and many more The data available in the ecosyste...