Skip to main content

Chapter II - Data Gathering

The foremost step in data science journey is to decide the domain of data analysis. Based on the domain that you choose, you must gather relevant data from different sources. Data gathering/collection is a technique to gather information from various sources to pull out significant information.

Here, we will discuss about few techniques of data collection.

1. Collecting survey data -
  • Such data can be collected by circulating questionnaire to the audience. 
  • This becomes handy if you want to limit the scope of analysis. For example, you want to find the age distribution and number of dependents in each household of your society.You can circulate the questionnaire to the residents asking them about their age and number of dependents and then draw an analysis from the data collected
  • The drawbacks of this method is that it becomes a tedious process, the audience may not be interested in this drill and it also limits the collection of data
  • Other examples of this method are :
    • Customers are requested to leave their comments on the food, service and ambiance of the restaurants
    • The instructors request the learners to share their feedback in the form of ratings after an online course
2. Open source data -
3. Extracting data using Web Crawlers -
  • Many a time, data scientists may want to extract some specific data from websites.
  • In such cases, they design their own web crawlers to pull out the required data.
  • You can check one such crawler here that parses the Quora(leading question and answer site) website and extract data from available profiles
4. Internal data -
  • There are tons and tons of data that is generated and collected in organisations. These data can be used for analysis to derive insights
  • Limitations of such data is that they are confidential and the access is restricted to only few people
  • Such data sets are called private data sets
P.S - Check out this space for a case study on data understanding coming up next

Comments

Popular posts from this blog

Chapter III - Data Understanding

Data understanding is one of the key process of CRISP-DM framework. CRISP-DM stands for cross-industry process for data mining. Data understanding helps us to decide whether the data acquired during data collection satisfies the business requirement and is useful for further analysis. Data understanding process can be sub divided into following steps –        1. Describing the data - In this process, we can get the feel of the data by preparing a data description report. This report will consist of description of variables in the data, their data types and so on      2.    Exploring and Verify the data - The variables in the data set can be further analyzed by creating univariate and bivariate plots. These plots will enable us to identify key variables and target variable. We can also get some insights on certain data quality issues while performing the exploratory data analysis. Certain data quality issues can be missing values, vagu...

Chapter I - The Beginning

Data science, in basic understanding, is the place where voluminous data meet insights. At present, roughly around 2.5 Quintilian bytes of data is generated everyday. This data is of little significance until we can extract meaningful information from them. The insights drawn from the data available, can be leveraged to make informed decisions. Image created using wordcloud2 package in R Few varied applications of data science are -  1. Companies analyzing the customer data to understand market preferences 2. Bank drawing insights from customer transactions to understand probability of churn 3. Teams analyzing the players statistics to improve team dynamics 4. Google making use of data science technology to deliver best search results 5. E-commerce websites making use of recommender systems to attract customers 6. Facebook making use of face recognition algorithm to enable 'tag your friend' feature and many more The data available in the ecosyste...