Big Data Collection in Big Data Age

Karl Ho

School of Economic, Political and Policy Sciences

University of Texas at Dallas

Acknowledgements

  1. Dr. Jeyakesavan Veerasamy

  2. Expert helpers:

    1. Mehul Govindbhai Patel

    2. Kanak Kanti Roy

    3. Nick Haoran Weng

  1. Data Theory

  2. Data Methods

    1. Data Production (Small data)

    2. Data Collection (Big data)

  3. ​Workshop: Big Data Collection

    1. ​API method

    2. Non-API method

What is data?

  1. Kinds of Data

    1. Quantitative vs. Qualitative

    2. Structured vs. Semi/unstructured

    3. Measurement

      • Nominal/ordinal/interval/ratio

What is data?

  1. Data generation

    1. Made data vs. Found data

    2. Structured vs. Semi/unstructured

    3. Primary vs. secondary data

    4. Derived data

      1. metadata, paradata

Ackoff, R.L., 1989. From data to wisdom. Journal of applied systems analysis, 16(1), pp.3-9.

  • 1700s Agricultural Revolution 

  • 1780 Industrial Revolution

  • 1940 Information Revolution

  • 1950s Digital Revolution

  • Knowledge Revolution

  • Data Revolution

- Sir Francis Bacon

"ipsa scientia potestas est"

"Knowledge itself is Power."

Power=f(Knowledge_{Size},Knowledge_{Veracity},Knowledge_{Speed})
Power=f(Data_{Size},Data_{Speed},Data_{Veracity})
  • Prediction-explanation gap

  • Induction-deduction gap

  • Bigness-representativeness gap

  • Data access gap

Three challenges facing Data Science

 

  1. Generalization from samples to population

  2. Generalization from the control group to the treatment group

  3. Generalization from observed measurements to the underlying constructs of interest.

- Andrew Gelman

Data methods

Experimental design

Measurements

- Farida Vis 2013 

"...what you see is framed by what you are able to see or indeed want to see..." 

The story of Google Flu Trend

By using Big Data of search queries, Google Flu Trend (GFT) predicted the flu-like illness rate in a population.

The findings were published in the top journal Nature in 2008.  However, shortly GFT failed and missed at the peak of the 2013 flu season by 140 percent.  

The story of Google Flu Trend

Lazer, Kennedy, King and Vespignani (2014)

Traditional “small data” often offer information that is not contained (or containable) in big data, and "by combining GFT and lagged [traditional] CDC data, as well as dynamically recalibrating GFT... can substantially improve on the performance of GFT or the CDC alone. " (Lazer et al. 2014 Science)

Lazer, Kennedy, King and Vespignani (2014)

Traditional “small data” often offer information that is not contained (or containable) in big data, and "by combining GFT and lagged [traditional] CDC data, as well as dynamically recalibrating GFT... can substantially improve on the performance of GFT or the CDC alone. " (Lazer et al. 2014 Science)

Google should have highest power in data access .  

Why would it fail?

Why would it not fail yet?

Power=f(Data_{Size},Data_{Veracity},Data_{Speed})
Power=f(Data_{Veracity},Data_{Speed},Data_{Size})

Size still matters, but not first.

A Theory of Data: Understanding Data Generation

Data Generation

Data Methods

  1. Survey

  2. Experiments

  3. Qualitative Data

  4. Text Data

  5. Web Data

  6. Machine Data

  7. Complex Data

    1. Network Data

    2. Multiple-source linked Data

Made

Data

}

}

Found

Data

Data Methods

  1. Small data or Made data emphasize design

  2. Big data or Found data focus on algorithm

Data Methods

How the two can learn from each other?

Statistical Modeling:
The Two Cultures 

Leo Breiman 2001: Statistical Science 

One assumes that the data are generated by a given stochastic data model.
The other uses algorithmic models and treats the data mechanism as unknown.
Data Model
Algorithmic Model
Small data
Complex, big data

Theory:
Data Generation Process

Data are generated in many fashions.   Picture this: independent variable x goes in one side of the box-- we call it nature for now-- and dependent variable y come out from the other side.

Theory:
Data Generation Process

Data Model

The analysis in this culture starts with assuming a stochastic data model for the inside of the black box. For example, a common data model is that data are generated by independent draws from response variables.

Response Variable= f(Predictor variables, random noise, parameters)

Reading the response variable is a function of a series of predictor/independent variables, plus random noise (normally distributed errors) and other parameters.  

Theory:
Data Generation Process

Data Model

The values of the parameters are estimated from the data and the model then used for information and/or prediction.

Theory:
Data Generation Process

 Algorithmic Modeling

The analysis in this approach considers the inside of the box complex and unknown. Their approach is to find a function f(x)-an algorithm that operates on x to predict the responses y.

The goal is to find algorithm that accurately predicts y.

Theory:
Data Generation Process

 Algorithmic Modeling

Unsupervised Learning

Supervised Learning         vs. 

Source: https://www.mathworks.com

Statistical Modeling:
The Two Cultures 

Leo Breiman 2001: Statistical Science 

One assumes that the data are generated by a given stochastic data model.
 
The other uses algorithmic models and treats the data mechanism as unknown.
Data Model
Algorithmic Model
Small data
Complex, big data

Data Science Roadmap

  1. Introduction - Data theory

  2. Data methods

  3. Statistics

  4. Programming

  5. Data Visualization

  6. Information Management

  7. Data Curation

  8. Spatial Models and Methods

  9. Machine Learning

  10. NLP/Text mining

Data Science Roadmap

  1. Introduction - Data theory

    1. Fundamentals

      1. Data concepts

      2. Data Generation Process (DGP)

    2. Algorithm-based vs. Data-based approaches

    3. Taxonomy

Data Science Roadmap

  1. Data methods

    1. ​Passive data

    2. Data at will

    3. Qualitative data

    4. Complex data

    5. Text data

Data generation process

  1. ​How data are generated
  2. Distribution
  3. Missing values
  4. Sampling and Population

 

Statistical understanding

  1. Size does (not) matter
  2. Representativeness does
  3. Forecast/prediction minded
  4. Explanation

Darkest hour: Churchill and typist

Web data

How do we take advantage of the web data?

  1. Purpose of web data

  2. Generation process of web data

  3. What is data of data?

  4. Why data scientists need to collect web data?

Web data: Technical side

Web scraping

- obtaining information directly from web pages

APIs (Application program interface)
- web services that allow an interaction with, and retrieval of, structured data.

Web data technologies

Web data: API's (data source)

  1. Social Media                              

    1. Facebook

    2. Twitter

    3. Instagram

    4. Youtube

  2. News websites

  3. Government websites

  4. NGOs

Illustration: Public Sentiments on Trade War

Illustration: Public Sentiments on Trade War

Illustration: Public Sentiments on Mexico wall

Illustration: Public Sentiments on Mexico wall

Illustration: Donald Trump's Tweet Pattern

Illustration: Donald Trump's Tweet Pattern (per week)

Illustration: Donald Trump's Diction use

Illustration: Donald Trump's Twitter Popularity

Illustration: Donald Trump's Twitter Popularity

Illustration: Donald Trump's Direct Impact on Public Opinion 

Workshop:
API and non-API methods

Thank you!

Questions and Comments
are welcome!

Statistics can help data scientists in three ways:

 

  1. Design and data collection
  2. Data analysis
  3. Decision making

 

- Daniel Gilbert, psychology researcher 

Publication is not canonization. Journals are not gospels. They are the vehicles we use to tell each other what we saw (hence “Letters” & “proceedings”). The bar for communicating to each other should not be high. We can decide for ourselves what to make of each other’s speech.