An introduction to a survey bot detection algorithm

Part 1: Bot Detection Preliminaries

Carl F. Falk & Michael J. Ilagan

Workshop Outline

  • Part 1: Bots and detection preliminaries
    • Lecture (\(\approx\) 30 min)
    • Shiny Apps (\(\approx\) 10 min)
  • Part 2: Bot detection algorithm
    • Lecture (\(\approx\) 30 min)
    • R (\(\approx\) 15 min)

Workshop Materials

R packages to install

  • We assume you are familiar with R
  • detranli = DEtection of RANdom LIkert-type responses
  • Install from source? Say no for now
# Used for visualizations, installing the detranli package
install.packages(c("ggplot2","GGally"), type="binary")
install.packages("devtools", type="binary")

# stable version (recommended)
devtools::install_github("michaeljohnilagan/detranli")

# experimental version
#devtools::install_github("falkcarl/detranli", type="binary")

# some example datasets
install.packages(c("psychTools","qgraph"), type="binary")

Outline for Part 1

  • Bot contamination
  • Non-statistical interventions
  • Binary classification concepts
  • Statistical analysis of Likert-type responses
  • From suspicions to decisions

Bot contamination

Online data collection is common

Online data collection is fast, cheap, and facilitates access to hard-to-reach populations

Bot contamination is a problem

Since participants are compensated 💰, there is incentive to complete many surveys in a short time

Non-statistical interventions

Suppose you were doing a study…

  • Likert-type inventory measuring one or more constructs, data collected online

“I feel that I have a number of good qualities”

(Strongly Disagree) 1 2 3 4 5 (Strongly Agree)

Row Item 1 Item 2 Item 3 Item 4
1 4 2 3 4
2 1 1 1 1
3 5 4 2 4
4 5 2 5 5
5 1 2 2 3
6 5 2 3 3

Suppose you were doing a study…

  • Before analysis of Likert-type responses, what is there to do?
    • Hurdles to gain access to survey
    • Auxiliary Likert-type or binary items
    • Open-ended items
    • Paradata

Hurdles to gain access to survey

  • Advice
    Answer a CAPTCHA (system filters automatically) and/or provide a valid email address or phone number (researcher filters manually)
  • Benefit
    Deters suspicious responders from being part of the data in the first place
  • Limitations
    Need to be mindful of ethics and privacy laws; hurdles de-motivate humans too

Auxiliary Likert-type or binary-choice items

  • Advice
    Insert items into the survey…
    • “I am paid biweekly by leprechauns.”
    • “I can walk on water.”
    • “To ensure data quality, please choose strongly disagree for this item.”
    • “In your honest opinion, should we use your data in our analyses in this study?”
  • Benefit
    Can flag respondents with “wrong” answers
  • Limitations
    Humans may answer humorously, find them annoying or take the survey less seriously; items may have unexpected interpretation; how many count as failing?

Open-ended items

  • Advice
    Include open-ended items
  • Benefit
    Can flag respondents who wrote obvious nonsense, implausible claims, or identical responses
    • “Curriculum resources creatively mining”
    • “Wqqwr”
    • Reported age of child does not match grade level
  • Limitations
    Can be subjective or ambiguous for some responses; some useful facts may be unknown to the researcher; time-consuming

Paradata

  • Advice
    Collect geolocation data, mouse movement data, submission time, time spent on page, etc.
  • Benefit Can flag suspicious (clusters of) activity, without altering the respondent’s experience
  • Limitations
    Depends on the survey platform; be mindful of ethics or privacy laws; VPNs exist; other explanations have to be accounted for (e.g. I work at night so I was awake at 4 AM)

More interventions

  • For more detailed advice, see these papers, references in workshop materials

Our contribution

  • It’s still a good idea to do some of the previously mentioned interventions
  • However, statistical analysis of Likert-type responses is always available and can be done post hoc
  • We provide a method that has sensitivity calibration (will be explained), in the familiar paradigm of null hypothesis significance testing
  • Our R package makes it easy to apply our method

Binary classification concepts

The binary classification problem

  • True class:
    Each respondent is either a human 👶 or a bot 🤖
  • Decision/prediction:
    For each respondent, flag 🚩 it for (possible) deletion or spare 👍 it
    (Alternatively, predict it to be 👶 or 🤖)
🚩 Flag 👍 Spare
🤖 Bot True positive ✔️ False negative ❌
👶 Human False positive ❌ True negative ✔️

How good is a decision rule?

  • From machine learning and diagnostic testing: sensitivity, specificity, accuracy, etc.

Running example

  • \(N=10\)
    • 6 humans (👶 👶 👶 👶 👶 👶)
    • 4 bots (🤖 🤖 🤖 🤖)
ID Truth
1 👶
2 👶
3 👶
4 👶
5 👶
6 👶
7 🤖
8 🤖
9 🤖
10 🤖

Contamination rate

  • Contamination rate: Proportion of 🤖 in sample
  • Example: 4/10 bots in sample \(= 40\%\) contamination rate
  • In a real setting, not known
ID Truth
1 👶
2 👶
3 👶
4 👶
5 👶
6 👶
7 🤖
8 🤖
9 🤖
10 🤖

Specificity

  • Specificity: Proportion of 👶 that were correctly 👍
  • Example: 4/6 humans spared, \(\approx 67\%\) specificity
  • How good is our decision? Higher is better
  • In a real setting, not known
ID Truth Decision
1 👶 👍
2 👶 🚩
3 👶 👍
4 👶 👍
5 👶 🚩
6 👶 👍

Sensitivity

  • Sensitivity: Proportion of 🤖 that were correctly 🚩
  • Example: 3/4 bots flagged \(= 75\%\) sensitivity
  • How good is our decision? Higher is better
  • In a real setting, not known
ID Truth Decision
7 🤖 🚩
8 🤖 🚩
9 🤖 🚩
10 🤖 👍

Trade-off between sensitivity and specificity

  • For both, higher is better; but increasing one tends to decrease the other
    • 🚩 everyone, 100% sensitivity but 0% specificity
    • 👍 everyone, 100% specificity but 0% sensitivity
  • Analogous to trade-off, in null hypothesis significance testing (NHST), between Type I error and Power

Classification Accuracy

  • Proportion of all cases correctly classified
    • 🤖 is 🚩, 👶 is 👍
  • Often what we care about the most
  • Ex. 4/6 humans; 3/4 bots
    • \((4+3)/ 10 = 0.7\) or \(70\%\) accuracy
  • In a real setting, not known
  • Weighted average of specificity (\(67\%\)) & sensitivity (\(75\%\))
ID Truth Decision
1 👶 👍
2 👶 🚩
3 👶 👍
4 👶 👍
5 👶 🚩
6 👶 👍
7 🤖 🚩
8 🤖 🚩
9 🤖 🚩
10 🤖 👶
  • Cross-tabs or confusion matrix:
    Decision
    Spare Flag
Truth Human 4 2
  Bot 1 3

Statistical analysis of Likert-type responses

“Thankfully, because their answers clearly aren’t human, there are several methods for detecting bots in your data (see Dupuis et al., 2018)” - Blog for Prolific Academic (2022)

Statistically, what’s the difference between bots and humans?

Humans answer based on item content; affects means and correlations among items

Bots do not follow the same structure; item responses are independent

Nonresponsivity indices (NRIs)

  • NRIs: Statistics that quantify how suspicious how each row is
    • Intra-individual response variability
    • Maximum longstring
    • Person-total correlation
    • Mahalanobis distance
  • Statistics here do not require knowledge about reverse-coding or factor structure

Running example

Row Item 1 Item 2 Item 3 Item 4
1 4 2 3 4
2 1 1 1 1
3 5 4 2 4
4 5 2 5 5
5 1 2 2 3
6 5 2 3 3

Intra-individual response variability (IRV)

  • Standard deviation of the row
  • “small” values are suspicious?
Row Item 1 Item 2 Item 3 Item 4 IRV
1 4 2 3 4 .96
2 1 1 1 1 .00
3 5 4 2 4 1.26
4 5 2 5 5 1.50
5 1 2 2 3 .82
6 5 2 3 3 1.26

Maximum longstring

  • Length of longest sequence of items with the same response
  • “large” values are suspicious
Row Item 1 Item 2 Item 3 Item 4 longstring
1 4 2 3 4 1
2 1 1 1 1 4
3 5 4 2 4 1
4 5 2 5 5 2
5 1 2 2 3 2
6 5 2 3 3 2

Person-total correlation (PTC)

  • Pearson correlation between row and mean of all rows in reference sample
  • Low/negative values suspicious; Least suspicious, ideal point: +1
Row Item 1 Item 2 Item 3 Item 4 PTC
1 4 2 3 4 .99
2 1 1 1 1 -1.00
3 5 4 2 4 .47
4 5 2 5 5 .81
5 1 2 2 3 -.11
6 5 2 3 3 .82
Item 1 Item 2 Item 3 Item 4
Means 3.5 2.17 2.67 3.33

Mahalanobis distance (MD)

Multivariate version of the “z-score” standardization

Under univariate normal distribution, z-score far from zero are less likely

Under a multivariate normal distribution, locations with large Mahalanobis distance are less likely

Mahalanobis distance (MD)

Coordinates Euclidean distance from center Mahalanobis distance from center
Center \((0, 0)\) 0 0
Blue \((+2, +2)\) 2.83 4.22
Red \((-2, +2)\) 2.83 28.22

Mahalanobis distance (MD)

  • Statistical distance from the mean of reference sample
  • “large” values are suspicious; Least-suspicious, ideal point: 0
  • Requires reference sample means and covariances

Column means and covariances

item1 item2 item3 item4 
3.500 2.167 2.667 3.333 
      item1 item2 item3 item4
item1   3.9 1.100 1.800 2.000
item2   1.1 0.967 0.067 0.733
item3   1.8 0.067 1.867 1.533
item4   2.0 0.733 1.533 1.867

Result

  item1 item2 item3 item4 mahal
1     4     2     3     4 2.041
2     1     1     1     1 1.739
3     5     4     2     4 1.970
4     5     2     5     5 1.877
5     1     2     2     3 1.739
6     5     2     3     3 1.543

From suspicions to decisions

“Thankfully, because their answers clearly aren’t human, there are several methods for detecting bots in your data (see Dupuis et al., 2018)” - Blog for Prolific Academic (2022)

Dupuis et al. did not show researchers how to use NRIs to flag bots

Sorting and binary decisions

  • NRIs attempt to sort respondents: e.g., 👶 on left; 🤖 on right
  • In real life… we see only a mixture
  • To make a decision—for each row, 🚩 or 👍—we must apply a threshold
    • But where?

Strategies for theshold selection

  • Visual inspection: Where are the bumps in the distribution?
  • Fixed value for PTC: \(<0\) \(\to\) 🚩 or \(<.5\) \(\to\) 🚩
  • Chi-square critical value for MD (or \(MD^2\))
    • Multivariate normal, known human means/covariances

Do any of these strategies achieve high classification accuracy? (\(\approx\) 10:00)

https://falkcarl.shinyapps.io/BotApp1/ (10:00)

NRIs and Shiny App recap

  • How to choose thresholds:
    • Fixed cut-off values should not consistently do well
    • Visual inspections should sometimes work well, occasionally not
  • Other conclusions:
    • Longstring and intra-individual response variability not great
    • Optimal cut-off not always generalize from sample to sample

Take-home Activity

See how well these strategies fare in various samples

More than one NRI?

  • What if we used more than one NRI at a time, and had thresholds for each?
    • Multiple hurdles, fail more than one NRI or any NRI to 🚩

https://falkcarl.shinyapps.io/BotApp2/

Extra slides

Other nonresponsivity indices

Require more knowledge about the inventory, such as reverse-coding of items and/or underlying factor structure

  • Psychometric (or semantic) synonyms/antonyms
  • Even-odd consistency
  • Guttman errors
  • Response coherence (functional method theory)
  • etc.

Person-total cosine similarity

  • Almost like person-total correlation, but without mean-centering of rows
  • For Likert-type vectors, ranges from \(0\) to \(+1\); values “close” to \(0\) are suspicious
  • Requires reference sample mean vector
  • Defined even when one or both vectors has zero variation

Chi-square critical value, for MD

  • Flag rows whose squared MD is above the Chi-square critical value with \(\alpha\) (e.g. 5% or 10%) to the right, degrees of freedom (df) is the number of items
  • Theoretical specificity of \(1-\alpha\)
    • If items are multivariate normal for humans
    • And if the mean vector and covariance matrix were known

number of items critical value for \(\alpha=0.1\) critical value for \(\alpha=0.05\)
15 4.72 5.00
20 5.33 5.60
25 5.86 6.14
30 6.34 6.62
35 6.79 7.06

$confusion
   yhat
y   flag spare
  0    1    74
  1   12    13

$outcomemeasures
      acc      spec      sens  flagrate 
0.8600000 0.9866667 0.4800000 0.1300000 

$confusion
   yhat
y   flag spare
  0    0    75
  1    8    17

$outcomemeasures
     acc     spec     sens flagrate 
    0.83     1.00     0.32     0.08 

Fixed value, for PTC

  • Flag rows whose PTC falls below some “acceptable” PTC, e.g. 0
  • No known theoretical properties

$confusion
   yhat
y   flag spare
  0    0    75
  1   11    14

$outcomemeasures
     acc     spec     sens flagrate 
    0.86     1.00     0.44     0.11 

Visual inspection

  • Subjective pinpointing of where two classes separate

$confusion
   yhat
y   flag spare
  0   16    59
  1   25     0

$outcomemeasures
      acc      spec      sens  flagrate 
0.8400000 0.7866667 1.0000000 0.4100000 

Specificity and sensitivity trade-off

  • Analogous to trade-off between Type I error and Power