Big Data Collection in Big Data Age

Karl Ho

School of Economic, Political and Policy Sciences

University of Texas at Dallas

Acknowledgements

Dr. Jeyakesavan Veerasamy
Expert helpers:
1. Mehul Govindbhai Patel
2. Kanak Kanti Roy
3. Nick Haoran Weng

Data Theory
Data Methods
1. Data Production (Small data)
2. Data Collection (Big data)
Workshop: Big Data Collection
1. API method
2. Non-API method

What is data?

Kinds of Data
1. Quantitative vs. Qualitative
2. Structured vs. Semi/unstructured
3. Measurement
  - Nominal/ordinal/interval/ratio

What is data?

Data generation
1. Made data vs. Found data
2. Structured vs. Semi/unstructured
3. Primary vs. secondary data
4. Derived data
  1. metadata, paradata

Ackoff, R.L., 1989. From data to wisdom. Journal of applied systems analysis, 16(1), pp.3-9.

1700s Agricultural Revolution
1780 Industrial Revolution
1940 Information Revolution
1950s Digital Revolution
Knowledge Revolution
Data Revolution

- Sir Francis Bacon

"ipsa scientia potestas est"

"Knowledge itself is Power."

Power=f(Knowledge_{Size},Knowledge_{Veracity},Knowledge_{Speed})

Power=f(Data_{Size},Data_{Speed},Data_{Veracity})

Prediction-explanation gap
Induction-deduction gap
Bigness-representativeness gap
Data access gap

Three challenges facing Data Science

Generalization from samples to population
Generalization from the control group to the treatment group
Generalization from observed measurements to the underlying constructs of interest.

- Andrew Gelman

Data methods

Experimental design

Measurements

- Farida Vis 2013

"...what you see is framed by what you are able to see or indeed want to see..."

The story of Google Flu Trend

By using Big Data of search queries, Google Flu Trend (GFT) predicted the flu-like illness rate in a population.

The findings were published in the top journal Nature in 2008. However, shortly GFT failed and missed at the peak of the 2013 flu season by 140 percent.

The story of Google Flu Trend

Lazer, Kennedy, King and Vespignani (2014)

Traditional “small data” often offer information that is not contained (or containable) in big data, and "by combining GFT and lagged [traditional] CDC data, as well as dynamically recalibrating GFT... can substantially improve on the performance of GFT or the CDC alone. " (Lazer et al. 2014 Science)

Lazer, Kennedy, King and Vespignani (2014)

Traditional “small data” often offer information that is not contained (or containable) in big data, and "by combining GFT and lagged [traditional] CDC data, as well as dynamically recalibrating GFT... can substantially improve on the performance of GFT or the CDC alone. " (Lazer et al. 2014 Science)

Google should have highest power in data access .

Why would it fail?

Why would it not fail yet?

Power=f(Data_{Size},Data_{Veracity},Data_{Speed})

Power=f(Data_{Veracity},Data_{Speed},Data_{Size})

Size still matters, but not first.

A Theory of Data: Understanding Data Generation

Data Generation

Data Methods

Survey
Experiments
Qualitative Data
Text Data
Web Data
Machine Data
Complex Data
1. Network Data
2. Multiple-source linked Data

Made

Data

}

Found

Data

Data Methods

Small data or Made data emphasize design
Big data or Found data focus on algorithm

Data Methods

How the two can learn from each other?

Statistical Modeling:
The Two Cultures

Leo Breiman 2001: Statistical Science

One assumes that the data are generated by a given stochastic data model.

The other uses algorithmic models and treats the data mechanism as unknown.

Data Model

Algorithmic Model

Small data

Complex, big data

Theory:
Data Generation Process

Data are generated in many fashions. Picture this: independent variable x goes in one side of the box-- we call it nature for now-- and dependent variable y come out from the other side.

Theory:
Data Generation Process

Data Model

The analysis in this culture starts with assuming a stochastic data model for the inside of the black box. For example, a common data model is that data are generated by independent draws from response variables.

Response Variable= f(Predictor variables, random noise, parameters)

Reading the response variable is a function of a series of predictor/independent variables, plus random noise (normally distributed errors) and other parameters.

Theory:
Data Generation Process

Data Model

The values of the parameters are estimated from the data and the model then used for information and/or prediction.

Theory:
Data Generation Process

Algorithmic Modeling

The analysis in this approach considers the inside of the box complex and unknown. Their approach is to find a function f(x)-an algorithm that operates on x to predict the responses y.

The goal is to find algorithm that accurately predicts y.

Theory:
Data Generation Process

Algorithmic Modeling

Unsupervised Learning

Supervised Learning vs.

Source: https://www.mathworks.com

Statistical Modeling:
The Two Cultures

Leo Breiman 2001: Statistical Science

One assumes that the data are generated by a given stochastic data model.

The other uses algorithmic models and treats the data mechanism as unknown.

Data Model

Algorithmic Model

Small data

Complex, big data

Data Science Roadmap

Introduction - Data theory
Data methods
Statistics
Programming
Data Visualization
Information Management
Data Curation
Spatial Models and Methods
Machine Learning
NLP/Text mining

Data Science Roadmap

Introduction - Data theory
1. Fundamentals
  1. Data concepts
  2. Data Generation Process (DGP)
2. Algorithm-based vs. Data-based approaches
3. Taxonomy

Data Science Roadmap

Data methods
1. Passive data
2. Data at will
3. Qualitative data
4. Complex data
5. Text data

Data generation process

How data are generated
Distribution
Missing values
Sampling and Population

Statistical understanding

Size does (not) matter
Representativeness does
Forecast/prediction minded
Explanation

Darkest hour: Churchill and typist

Web data

How do we take advantage of the web data?

Purpose of web data
Generation process of web data
What is data of data?
Why data scientists need to collect web data?

Web data: Technical side

Web scraping

- obtaining information directly from web pages

APIs (Application program interface)
- web services that allow an interaction with, and retrieval of, structured data.

Web data technologies

Web data: API's (data source)

Social Media
1. Facebook
2. Twitter
3. Instagram
4. Youtube
News websites
Government websites
NGOs

Illustration: Public Sentiments on Trade War

Illustration: Public Sentiments on Mexico wall

Illustration: Donald Trump's Tweet Pattern

Illustration: Donald Trump's Tweet Pattern (per week)

Illustration: Donald Trump's Diction use

Illustration: Donald Trump's Twitter Popularity

Illustration: Donald Trump's Direct Impact on Public Opinion

Workshop:
API and non-API methods

http://datageneration.org/BDM/

Thank you!

Questions and Comments
are welcome!

Statistics can help data scientists in three ways:

Design and data collection
Data analysis
Decision making

- Daniel Gilbert, psychology researcher

Publication is not canonization. Journals are not gospels. They are the vehicles we use to tell each other what we saw (hence “Letters” & “proceedings”). The bar for communicating to each other should not be high. We can decide for ourselves what to make of each other’s speech.

Big Data Collection in Big Data Age

Acknowledgements

Data Theory

Data Methods

Data Production (Small data)

Data Collection (Big data)

​Workshop: Big Data Collection

​API method

Non-API method

What is data?

What is data?

Data generation

Made data vs. Found data

Structured vs. Semi/unstructured

Primary vs. secondary data

Derived data

1700s Agricultural Revolution

1780 Industrial Revolution

1940 Information Revolution

1950s Digital Revolution

Knowledge Revolution

Data Revolution

- Sir Francis Bacon

"ipsa scientia potestas est"

"Knowledge itself is Power."

Prediction-explanation gap

Induction-deduction gap

Bigness-representativeness gap

Data access gap

Three challenges facing Data Science

Generalization from samples to population

Generalization from the control group to the treatment group

Generalization from observed measurements to the underlying constructs of interest.

- Andrew Gelman

Data methods

Experimental design

Measurements

- Farida Vis 2013

"...what you see is framed by what you are able to see or indeed want to see..."

The story of Google Flu Trend

By using Big Data of search queries, Google Flu Trend (GFT) predicted the flu-like illness rate in a population.

The findings were published in the top journal Nature in 2008. However, shortly GFT failed and missed at the peak of the 2013 flu season by 140 percent.

The story of Google Flu Trend

Lazer, Kennedy, King and Vespignani (2014)

Lazer, Kennedy, King and Vespignani (2014)

Google should have highest power in data access .

Why would it fail?

Why would it not fail yet?

Size still matters, but not first.

A Theory of Data: Understanding Data Generation

Data Generation

Data Methods

Survey

Experiments

Qualitative Data

Text Data

Web Data

Machine Data

Complex Data

Network Data

Multiple-source linked Data

Made

Data

}

}

Found

Data

Data Methods

Small data or Made data emphasize design

Big data or Found data focus on algorithm

Data Methods

How the two can learn from each other?

Statistical Modeling: The Two Cultures

Leo Breiman 2001: Statistical Science

Theory: Data Generation Process

Theory: Data Generation Process

Data Model

Theory: Data Generation Process

Data Model

Theory: Data Generation Process

Workshop: Big Data Collection

API method

Statistical Modeling:
The Two Cultures

Theory:
Data Generation Process

Theory:
Data Generation Process

Theory:
Data Generation Process

Theory:
Data Generation Process

Theory:
Data Generation Process

Statistical Modeling:
The Two Cultures

APIs (Application program interface)
- web services that allow an interaction with, and retrieval of, structured data.

Workshop:
API and non-API methods

Questions and Comments
are welcome!