An introduction to a survey bot detection algorithm

Part 1: Bot Detection Preliminaries

Carl F. Falk & Michael J. Ilagan

Workshop Outline

Part 1: Bots and detection preliminaries
- Lecture (\(\approx\) 30 min)
- Shiny Apps (\(\approx\) 10 min)
Part 2: Bot detection algorithm
- Lecture (\(\approx\) 30 min)
- R (\(\approx\) 15 min)

Workshop Materials

See footer of slides, https://osf.io/vnuew/
- Slides
  - Download and open .html version in webbrowser
- Take-home activities (inc. R code)
- References
Shiny Apps
- Nonresponsivity indices (App 1): https://falkcarl.shinyapps.io/BotApp1/
- Multiple indices (App 2): https://falkcarl.shinyapps.io/BotApp2/

R packages to install

We assume you are familiar with R
detranli = DEtection of RANdom LIkert-type responses
Install from source? Say no for now

# Used for visualizations, installing the detranli package
install.packages(c("ggplot2","GGally"), type="binary")
install.packages("devtools", type="binary")

# stable version (recommended)
devtools::install_github("michaeljohnilagan/detranli")

# experimental version
#devtools::install_github("falkcarl/detranli", type="binary")

# some example datasets
install.packages(c("psychTools","qgraph"), type="binary")

Outline for Part 1

Bot contamination

Non-statistical interventions

Binary classification concepts

Statistical analysis of Likert-type responses

From suspicions to decisions

Bot contamination

Online data collection is common

Online data collection is fast, cheap, and facilitates access to hard-to-reach populations

Bot contamination is a problem

Since participants are compensated 💰, there is incentive to complete many surveys in a short time

Non-statistical interventions

Suppose you were doing a study…

Likert-type inventory measuring one or more constructs, data collected online

“I feel that I have a number of good qualities”

(Strongly Disagree) 1 2 3 4 5 (Strongly Agree)

Row	Item 1	Item 2	Item 3	Item 4	…
1	4	2	3	4	…
2	1	1	1	1	…
3	5	4	2	4	…
4	5	2	5	5	…
5	1	2	2	3	…
6	5	2	3	3	…
…	…	…	…	…	…

Suppose you were doing a study…

Before analysis of Likert-type responses, what is there to do?
- Hurdles to gain access to survey
- Auxiliary Likert-type or binary items
- Open-ended items
- Paradata

Hurdles to gain access to survey

Advice
Answer a CAPTCHA (system filters automatically) and/or provide a valid email address or phone number (researcher filters manually)

Benefit
Deters suspicious responders from being part of the data in the first place

Limitations
Need to be mindful of ethics and privacy laws; hurdles de-motivate humans too

Auxiliary Likert-type or binary-choice items

Advice
Insert items into the survey…
- “I am paid biweekly by leprechauns.”
- “I can walk on water.”
- “To ensure data quality, please choose strongly disagree for this item.”
- “In your honest opinion, should we use your data in our analyses in this study?”

Benefit
Can flag respondents with “wrong” answers

Limitations
Humans may answer humorously, find them annoying or take the survey less seriously; items may have unexpected interpretation; how many count as failing?

Open-ended items

Advice
Include open-ended items

Benefit
Can flag respondents who wrote obvious nonsense, implausible claims, or identical responses
- “Curriculum resources creatively mining”
- “Wqqwr”
- Reported age of child does not match grade level

Limitations
Can be subjective or ambiguous for some responses; some useful facts may be unknown to the researcher; time-consuming

Paradata

Advice
Collect geolocation data, mouse movement data, submission time, time spent on page, etc.

Benefit Can flag suspicious (clusters of) activity, without altering the respondent’s experience

Limitations
Depends on the survey platform; be mindful of ethics or privacy laws; VPNs exist; other explanations have to be accounted for (e.g. I work at night so I was awake at 4 AM)

More interventions

For more detailed advice, see these papers, references in workshop materials

Our contribution

It’s still a good idea to do some of the previously mentioned interventions

However, statistical analysis of Likert-type responses is always available and can be done post hoc

We provide a method that has sensitivity calibration (will be explained), in the familiar paradigm of null hypothesis significance testing

Our R package makes it easy to apply our method

Binary classification concepts

The binary classification problem

True class:
Each respondent is either a human 👶 or a bot 🤖
Decision/prediction:
For each respondent, flag 🚩 it for (possible) deletion or spare 👍 it
(Alternatively, predict it to be 👶 or 🤖)

	🚩 Flag	👍 Spare
🤖 Bot	True positive ✔️	False negative ❌
👶 Human	False positive ❌	True negative ✔️

How good is a decision rule?

From machine learning and diagnostic testing: sensitivity, specificity, accuracy, etc.

Running example

\(N=10\)
- 6 humans (👶 👶 👶 👶 👶 👶)
- 4 bots (🤖 🤖 🤖 🤖)

ID	Truth
1	👶
2	👶
3	👶
4	👶
5	👶
6	👶
7	🤖
8	🤖
9	🤖
10	🤖

Contamination rate

Contamination rate: Proportion of 🤖 in sample
Example: 4/10 bots in sample \(= 40\%\) contamination rate
In a real setting, not known

ID	Truth
1	👶
2	👶
3	👶
4	👶
5	👶
6	👶
7	🤖
8	🤖
9	🤖
10	🤖

Specificity

Specificity: Proportion of 👶 that were correctly 👍
Example: 4/6 humans spared, \(\approx 67\%\) specificity
How good is our decision? Higher is better
In a real setting, not known

ID	Truth	Decision
1	👶	👍
2	👶	🚩
3	👶	👍
4	👶	👍
5	👶	🚩
6	👶	👍

Sensitivity

Sensitivity: Proportion of 🤖 that were correctly 🚩
Example: 3/4 bots flagged \(= 75\%\) sensitivity
How good is our decision? Higher is better
In a real setting, not known

ID	Truth	Decision
7	🤖	🚩
8	🤖	🚩
9	🤖	🚩
10	🤖	👍

Trade-off between sensitivity and specificity

For both, higher is better; but increasing one tends to decrease the other
- 🚩 everyone, 100% sensitivity but 0% specificity
- 👍 everyone, 100% specificity but 0% sensitivity

Analogous to trade-off, in null hypothesis significance testing (NHST), between Type I error and Power

Classification Accuracy

Proportion of all cases correctly classified
- 🤖 is 🚩, 👶 is 👍
Often what we care about the most
Ex. 4/6 humans; 3/4 bots
- \((4+3)/ 10 = 0.7\) or \(70\%\) accuracy
In a real setting, not known
Weighted average of specificity (\(67\%\)) & sensitivity (\(75\%\))

ID	Truth	Decision
1	👶	👍
2	👶	🚩
3	👶	👍
4	👶	👍
5	👶	🚩
6	👶	👍
7	🤖	🚩
8	🤖	🚩
9	🤖	🚩
10	🤖	👶

Cross-tabs or confusion matrix:

		Decision
		Spare	Flag
Truth	Human	4	2
	Bot	1	3

Statistical analysis of Likert-type responses

“Thankfully, because their answers clearly aren’t human, there are several methods for detecting bots in your data (see Dupuis et al., 2018)” - Blog for Prolific Academic (2022)

Statistically, what’s the difference between bots and humans?

Humans answer based on item content; affects means and correlations among items

Bots do not follow the same structure; item responses are independent

Nonresponsivity indices (NRIs)

NRIs: Statistics that quantify how suspicious how each row is
- Intra-individual response variability
- Maximum longstring
- Person-total correlation
- Mahalanobis distance
- …

Statistics here do not require knowledge about reverse-coding or factor structure

Running example

Row	Item 1	Item 2	Item 3	Item 4
1	4	2	3	4
2	1	1	1	1
3	5	4	2	4
4	5	2	5	5
5	1	2	2	3
6	5	2	3	3

Intra-individual response variability (IRV)

Standard deviation of the row
“small” values are suspicious?

Row	Item 1	Item 2	Item 3	Item 4	IRV
1	4	2	3	4	.96
2	1	1	1	1	.00
3	5	4	2	4	1.26
4	5	2	5	5	1.50
5	1	2	2	3	.82
6	5	2	3	3	1.26

Maximum longstring

Length of longest sequence of items with the same response
“large” values are suspicious

Row	Item 1	Item 2	Item 3	Item 4	longstring
1	4	2	3	4	1
2	1	1	1	1	4
3	5	4	2	4	1
4	5	2	5	5	2
5	1	2	2	3	2
6	5	2	3	3	2

Person-total correlation (PTC)

Pearson correlation between row and mean of all rows in reference sample
Low/negative values suspicious; Least suspicious, ideal point: +1

Row	Item 1	Item 2	Item 3	Item 4	PTC
1	4	2	3	4	.99
2	1	1	1	1	-1.00
3	5	4	2	4	.47
4	5	2	5	5	.81
5	1	2	2	3	-.11
6	5	2	3	3	.82

	Item 1	Item 2	Item 3	Item 4
Means	3.5	2.17	2.67	3.33

Mahalanobis distance (MD)

Multivariate version of the “z-score” standardization

Under univariate normal distribution, z-score far from zero are less likely

Under a multivariate normal distribution, locations with large Mahalanobis distance are less likely

Mahalanobis distance (MD)

	Coordinates	Euclidean distance from center	Mahalanobis distance from center
Center	\((0, 0)\)	0	0
Blue	\((+2, +2)\)	2.83	4.22
Red	\((-2, +2)\)	2.83	28.22

Mahalanobis distance (MD)

Statistical distance from the mean of reference sample
“large” values are suspicious; Least-suspicious, ideal point: 0
Requires reference sample means and covariances

Column means and covariances

item1 item2 item3 item4 
3.500 2.167 2.667 3.333

      item1 item2 item3 item4
item1   3.9 1.100 1.800 2.000
item2   1.1 0.967 0.067 0.733
item3   1.8 0.067 1.867 1.533
item4   2.0 0.733 1.533 1.867

Result

  item1 item2 item3 item4 mahal
1     4     2     3     4 2.041
2     1     1     1     1 1.739
3     5     4     2     4 1.970
4     5     2     5     5 1.877
5     1     2     2     3 1.739
6     5     2     3     3 1.543

From suspicions to decisions

“Thankfully, because their answers clearly aren’t human, there are several methods for detecting bots in your data (see Dupuis et al., 2018)” - Blog for Prolific Academic (2022)

Dupuis et al. did not show researchers how to use NRIs to flag bots

Sorting and binary decisions

NRIs attempt to sort respondents: e.g., 👶 on left; 🤖 on right

In real life… we see only a mixture

To make a decision—for each row, 🚩 or 👍—we must apply a threshold
- But where?

Strategies for theshold selection

Visual inspection: Where are the bumps in the distribution?
Fixed value for PTC: \(<0\) \(\to\) 🚩 or \(<.5\) \(\to\) 🚩
Chi-square critical value for MD (or \(MD^2\))
- Multivariate normal, known human means/covariances

Do any of these strategies achieve high classification accuracy? (\(\approx\) 10:00)

https://falkcarl.shinyapps.io/BotApp1/ (10:00)

NRIs and Shiny App recap

How to choose thresholds:
- Fixed cut-off values should not consistently do well
- Visual inspections should sometimes work well, occasionally not
Other conclusions:
- Longstring and intra-individual response variability not great
- Optimal cut-off not always generalize from sample to sample

Take-home Activity

See how well these strategies fare in various samples

Materials in repository: https://osf.io/vnuew/
- Lab 1 worksheet (Lab1.docx) and key (Lab1Key.docx)

More than one NRI?

What if we used more than one NRI at a time, and had thresholds for each?
- Multiple hurdles, fail more than one NRI or any NRI to 🚩

https://falkcarl.shinyapps.io/BotApp2/

Extra slides

Other nonresponsivity indices

Require more knowledge about the inventory, such as reverse-coding of items and/or underlying factor structure

Psychometric (or semantic) synonyms/antonyms
Even-odd consistency
Guttman errors
Response coherence (functional method theory)
etc.

Person-total cosine similarity

Almost like person-total correlation, but without mean-centering of rows
For Likert-type vectors, ranges from \(0\) to \(+1\); values “close” to \(0\) are suspicious
Requires reference sample mean vector
Defined even when one or both vectors has zero variation

Chi-square critical value, for MD

Flag rows whose squared MD is above the Chi-square critical value with \(\alpha\) (e.g. 5% or 10%) to the right, degrees of freedom (df) is the number of items

Theoretical specificity of \(1-\alpha\)
- If items are multivariate normal for humans
- And if the mean vector and covariance matrix were known

number of items	critical value for \(\alpha=0.1\)	critical value for \(\alpha=0.05\)
15	4.72	5.00
20	5.33	5.60
25	5.86	6.14
30	6.34	6.62
35	6.79	7.06

$confusion
   yhat
y   flag spare
  0    1    74
  1   12    13

$outcomemeasures
      acc      spec      sens  flagrate 
0.8600000 0.9866667 0.4800000 0.1300000

$confusion
   yhat
y   flag spare
  0    0    75
  1    8    17

$outcomemeasures
     acc     spec     sens flagrate 
    0.83     1.00     0.32     0.08

Fixed value, for PTC

Flag rows whose PTC falls below some “acceptable” PTC, e.g. 0
No known theoretical properties

$confusion
   yhat
y   flag spare
  0    0    75
  1   11    14

$outcomemeasures
     acc     spec     sens flagrate 
    0.86     1.00     0.44     0.11

Visual inspection

Subjective pinpointing of where two classes separate

$confusion
   yhat
y   flag spare
  0   16    59
  1   25     0

$outcomemeasures
      acc      spec      sens  flagrate 
0.8400000 0.7866667 1.0000000 0.4100000

Specificity and sensitivity trade-off

Analogous to trade-off between Type I error and Power

Row	Item 1	Item 2	Item 3	Item 4	…
1	4	2	3	4	…
2	1	1	1	1	…
3	5	4	2	4	…
4	5	2	5	5	…
5	1	2	2	3	…
6	5	2	3	3	…
…	…	…	…	…	…

Row	Item 1	Item 2	Item 3	Item 4	…
1	4	2	3	4	…
2	1	1	1	1	…
3	5	4	2	4	…
4	5	2	5	5	…
5	1	2	2	3	…
6	5	2	3	3	…
…	…	…	…	…	…

An introduction to a survey bot detection algorithm

Workshop Outline

Workshop Materials

R packages to install

Outline for Part 1

Bot contamination

Online data collection is common

Bot contamination is a problem

Related concepts

Non-statistical interventions

Suppose you were doing a study…

Suppose you were doing a study…

Hurdles to gain access to survey

Auxiliary Likert-type or binary-choice items

Open-ended items

Paradata

More interventions

Our contribution

Binary classification concepts

The binary classification problem

How good is a decision rule?

Running example

Contamination rate

Specificity

Sensitivity

Trade-off between sensitivity and specificity

Classification Accuracy

Statistical analysis of Likert-type responses

Statistically, what’s the difference between bots and humans?

Nonresponsivity indices (NRIs)

Running example

Intra-individual response variability (IRV)

Maximum longstring

Person-total correlation (PTC)

Mahalanobis distance (MD)

Mahalanobis distance (MD)

Mahalanobis distance (MD)

From suspicions to decisions

Sorting and binary decisions

Strategies for theshold selection

https://falkcarl.shinyapps.io/BotApp1/ (10:00)

NRIs and Shiny App recap

Take-home Activity

More than one NRI?

Extra slides

Other nonresponsivity indices

Person-total cosine similarity

Chi-square critical value, for MD

Fixed value, for PTC

Visual inspection

Specificity and sensitivity trade-off

Row	Item 1	Item 2	Item 3	Item 4	…
1	4	2	3	4	…
2	1	1	1	1	…
3	5	4	2	4	…
4	5	2	5	5	…
5	1	2	2	3	…
6	5	2	3	3	…
…	…	…	…	…	…