Chapter II - Data Gathering

The foremost step in data science journey is to decide the domain of data analysis. Based on the domain that you choose, you must gather relevant data from different sources. Data gathering/collection is a technique to gather information from various sources to pull out significant information.

Here, we will discuss about few techniques of data collection.

1. Collecting survey data -

Such data can be collected by circulating questionnaire to the audience.
This becomes handy if you want to limit the scope of analysis. For example, you want to find the age distribution and number of dependents in each household of your society.You can circulate the questionnaire to the residents asking them about their age and number of dependents and then draw an analysis from the data collected
The drawbacks of this method is that it becomes a tedious process, the audience may not be interested in this drill and it also limits the collection of data
Other examples of this method are :

Customers are requested to leave their comments on the food, service and ambiance of the restaurants
The instructors request the learners to share their feedback in the form of ratings after an online course

2. Open source data -

Data can be collected from various open sources like open government data platform | India
UCI machine learning repository also have large amount of data from various genre
Many data science enthusiasts also share awesome data sets that are publicly accessible
Such data sets are called public data sets

3. Extracting data using Web Crawlers -

Many a time, data scientists may want to extract some specific data from websites.
In such cases, they design their own web crawlers to pull out the required data.
You can check one such crawler here that parses the Quora(leading question and answer site) website and extract data from available profiles

4. Internal data -

There are tons and tons of data that is generated and collected in organisations. These data can be used for analysis to derive insights
Limitations of such data is that they are confidential and the access is restricted to only few people
Such data sets are called private data sets

P.S - Check out this space for a case study on data understanding coming up next

Data Science Journey - Insatiable quest for insight

Search This Blog

Chapter II - Data Gathering

Comments

Post a Comment

Popular posts from this blog

Chapter III - Data Understanding

Chapter I - The Beginning