Social and Political Data Science: Introduction

Karl Ho

School of Economic, Political and Policy Sciences

University of Texas at Dallas

Social Data Analytics and Political Research: New Public Opinion and Data tools

Presentation prepared for the Soochow University, Taipei, Taiwan,  July 6th, 2018

Social data analytics research and education programs have provided new directions and technologies to collect and analyze public opinion via social listening, web scraping and text mining. 

How the new trends open venues for political researchers to study public sentiments on policies, congressional performance and electoral choices?  

This presentation will provide new alternative ideas for political researchers to factor in in their agenda and curricular design to train next generations of researchers.   Discussion points will be outlined to facilitate exchange of innovative ideas across subfields and disciplines pertaining to the newly emerged discipline of social data science. 

What is Data Analytics?

Data Analytics vs.

Data Analysis

Data analytics refers to generation, acquisition, management, modeling and visualization of data.

Thomas Davenport and his colleagues (2007) emphasize the ability to "collect, analyze and act on data".

Davenport, Thomas H., and Jeanne G. Harris. 2007. Competing on analytics: The new science of winning. Harvard Business Press.

Data Analytics vs.

Data Analysis

Data analytics goes beyond only providing analysis of data but focuses on the action or decision making informed by data.

Social data was initially referred as to data generated from social media.  It is now not confined to that generation mode but is more generally data generated by people or users.

Social Data Analytics: A journey just set afoot

Social data was initially referred to data generated from social media.  It is now not confined to that generation mode but is more generally data generated by people or users.

Social data analytics encompasses the generation, management, modeling and visualization of social data.

.... social science is beginning to shape the world of big data. 

Much of big data is social data.... It is the responsibility of social scientists to assume their central place in the world of big data, to shape the questions we ask of big data, and to characterize what does and does not make for a convincing answer. 

- Monroe, Pan, Roberts, Sen and Sinclair 2015

Lazer et al. 2009 Life in the network

This figure summarizes the link structure within a community of political blogs (from 2004), where red nodes indicate conservative blogs, and blue liberal. Orange links go from liberal to conservative, and purple ones from conservative to liberal. The size of each blog reflects the number of other blogs that link to it

Cumulated/Repeated Data

A Theory of Data: Understanding Data Generation Process

Data Generation Process

Made Data

  • Small sample
  • Known Sample/Population
  • Data collection: controlled, pre-planned
  • Data model

Found Data

  • Big Sample
  • Unknown sample/unknown population
  • Data collection:
    Mining, scraping, semi-planned extraction
  • Algorithmic model

Data Generation Process

A Taxonomy of Data

  1. Numbers

  2. Text

  3. Images

  4. Audio

  5. Video

  6. Signals

  7. Data of data: Metadata and Paradata


Data Methods

  1. Survey

  2. Experiments

  3. Qualitative Data

  4. Text Data

  5. Web Data

  6. Machine Data

  7. Complex Data

    1. Network Data

    2. Multiple-source linked Data







Statistical Modeling:
The Two Cultures 

Leo Breiman 2001: Statistical Science 

One assumes that the data are generated by a given stochastic data model.
The other uses algorithmic models and treats the data mechanism as unknown.
Data Model
Algorithmic Model
Small data
Complex, big data

Data Generation Process

Data are generated in many fashions.   Picture this: independent variable x goes in one side of the box-- we call it nature for now-- and dependent variable y come out from the other side.

Data Generation Process

Data Model

The analysis in this culture starts with assuming a stochastic data model for the inside of the black box. For example, a common data model is that data are generated by independent draws from response variables.

Response Variable= f(Predictor variables, random noise, parameters)

Reading the response variable is a function of a series of predictor/independent variables, plus random noise (normally distributed errors) and other parameters.  

Data Generation Process

Data Model

The values of the parameters are estimated from the data and the model then used for information and/or prediction.

Data Generation Process

 Algorithmic Modeling

The analysis in this approach considers the inside of the box complex and unknown. Their approach is to find a function f(x)-an algorithm that operates on x to predict the responses y.

The goal is to find algorithm that accurately predicts y.

Data Generation Process

 Algorithmic Modeling

Unsupervised Learning

Supervised Learning         vs. 


Algorithm and Inference

Very broadly speaking, algorithms are what statisticians do while inference says why they do them.


- Efron and Hastie 2017

Machine Learning

A Random Forest is an ensemble learning method that grows multivalued Decision Trees in different training sets. 

To classify a new sample, input parameters are given to each tree in the forest to classify outcomes by taking the majority vote over all the trees in the forest. 

The Random Forest is a typical learning model with the goal of reducing variance.

Illustration: HKES

Research Question:

What concerns Hong Kong people most?

  • Choice-based conjoint analysis

  • Survey respondents were asked to choose between two social/political reform proposals

  • Each proposal consists of a set of reform items, representing three big concerns: (1) procedural democracy; (2) welfare benefits; (3) integration with mainland China

Value of each reform item is randomly drawn

Illustration: HKES

  • Value of each reform item is randomly drawn

  • Advantages of the conjoint design

  • Respondents need not report preferences for individual items, thereby lowering the risk of preference falsification

  • Pit all big theories in a single decision, so that we can rank order respondents’ major concern

Illustration: HKES

Illustration: HKES

Illustration: HKES

Machine Learning

"when you present two systems to a company, a simple one with explanations that does ok, and a more complicated system that works better, every single time they will take the second. Every single time."


-  Yann LeCun, Director of AI research at Facebook  on Deep Learning

Does the job, with no explanation or theory?

Social (Data) Scientist's mission

Two major areas to which social scientists can contribute, based on decades of experience and work with end users, are:

  1. Inference                                   

  2. Data quality. 

- Foster et al. 2016

Social (Data) Scientist's mission

Compared to computer scientists and business analytics researchers, we are distinct in not only our familiarity with data, statistical models and inference. 

Social scientists pursue a good cause, something we can contribute: to make a difference, to bring public good and to shape a better society.

Social Data Analytics and Research (SDAR) 

SDAR Roadmap

  1. Introduction - Data theory

  2. Data methods

  3. Statistics

  4. Programming

  5. Data Visualization

  6. Information Management

  7. Data Curation

  8. Spatial Models and Methods

  9. Machine Learning

  10. NLP/Text mining

SDAR Roadmap

  1. Introduction - Data theory

    1. Fundamentals

      1. Data concepts

      2. Data Generation Process (DGP)

    2. Algorithm-based vs. Data-based approaches

    3. Taxonomy

SDAR Roadmap

  1. Data methods

    1. ​Passive data

    2. Data at will

    3. Qualitative data

    4. Complex data

    5. Text data

SDAR Roadmap

  1. Statistics

    1. Sample and Population

    2. Inference

    3. Size and power

    4. Representation

SDAR Roadmap

  1. Programming

    1. R

    2. Python

    3. HTML

    4. Java script

SDAR Roadmap

  1. Data Visualization

    1. Tableau

    2. ggplot2

    3. Shiny

    4. D3.js

    5. Animation

SDAR Roadmap

  1. Information Management

    1. MapReduce

    2. Hadoop

    3. Cassandra

    4. MongoDB

    5. NoSQL

SDAR Roadmap

  1. Data curation

    1. Google OpenRefine

    2. Sampling

    3. Missing value concepts and management

SDAR Roadmap

  1. Spatial Models and Methods

    1. GIS

    2. R/Leaflet

    3. Python Map

    4. Remote Sensing

SDAR Roadmap

  1. Machine Learning

    1. Supervised

    2. Unsupervised

    3. Regression methods

    4. Neural Networks

Data Science Roadmap

  1. NLP/Text Mining

    1. Corpus

    2. Text Analysis

    3. Sentiment Analysis

    4. Natural Language Processing


The story of Google Flu Trend

By using Big Data of search queries, Google Flu Trend (GFT) predicted the flu-like illness rate in a population.

However, the journal Nature where GFT published the findings on figured the GFT overestimated as much as twice than the actual data.  Two political scientists helped fix and address the problem.

Lesson we learn:
Political Science can save the world!

The story of Google Flu Trend

Lazer, Kennedy, King and Vespignani (2014)

Traditional “small data” often offer information that is not contained (or containable) in big data, and the very factors that have enabled big data are enabling more traditional data collection (watch TED talk by Dr. Joel Selanikio). The Internet has opened the way for improving standard surveys, experiments, and health reporting.  (Lazer et al. 2014 Science)

Public Policy and Big Data

Java: D3 Library

Sentiment Analysis

Sentiment Analysis

"1","RT @RealJack: *Last year*


*Trump meets with Kim*

"2","Trump Kim summit: US wants 'major N Korea disarmament' by 2020"
"3","RT @thehill: JUST IN: Norwegian lawmakers nominate Trump for Nobel Peace Prize after summit with Kim Jong Un https:…"
"4","RT @JRubinBlogger: Pompeo is acting exactly like Kerry -- indignant, caught up in process. Convinced concessions aren'[t concessions. Pathe…"
"5","RT @SykesCharlie: On Wednesday morning, Chosun Ilbo, South Korea’s paper of record, published a bleak editorial: “Kim Jong-un Got Everythin…"
"6","RT @chuckwoolery: Yesterday Shepard Smith, gave a scathing report on Trump/Kim Singapore summit. Following the Lefts lead."
"7","RT @WhiteHouse: Leaders the world over spoke of the powerful significance of President Trump’s summit with Kim Jong Un this week.

Read mor…"
"8","RT @PalmerReport: Fuck Donald Trump

Fuck Kim Jong Un

Fuck their fake summit

Fuck Vladimir Putin

Fuck Dennis Rodman

Fuck the media for…"
"9","Dennis Rodman has been the link between Kim Jong-Un and Donald Trump. Very Scary times we are in @StephMillerShow…"
"10","RT @TomSteyer: .@realDonaldTrump repeatedly said, Kim Jong Un ""loves his people."" This is what love looks like to Trump:

Over 100,000 poli…"
"11","RT @TheUSASingers: I’m gonna lay it on the line.

- Obama isn’t a Muslim
- Hillary doesn’t eat babies
- Socialists aren’t Nazis
- Nazis are…"
  • Data Thinking

  • Multi-disciplinary Thinking

  • Machine Thinking


Some thoughts on
Social Data Science

Thank you!

Questions and Comments