For the ninth year in a row, the Early Warning Project (EWP) ran a comparison survey in December to solicit wisdom-of-the-crowd opinions on countries' relative risks for new mass killing.
The comparison survey is meant to serve the same function as our annual Statistical Risk Assessment (SRA), which uses publicly available data and a machine learning model to help focus additional resources where preventive action is most needed by ranking all countries in the world based on their risk for new mass killing.
Unlike our SRA, which has been tested extensively, we do not know how accurate results from the comparison survey are. The survey is part of our effort to experiment with wisdom-of-the-crowd methods for atrocity risk forecasting.
This year’s comparison survey received substantially less participation—that is, fewer votes—than it did in previous years. As a result, we take the opportunity to reflect on two aspects of the survey methodology.
How many votes and voters are enough?
The premise of “crowd forecasting” approaches is that aggregating subjective assessments of multiple people will yield an increase in accuracy. An important question, therefore, is how large must the “crowd” be to reap significant benefits of aggregating across individual responses.
For a standard survey where every person answers the same question (e.g., How likely is it that country x will see a new episode of mass killing?), research suggests that most gains in forecasting accuracy come from combining a small number of independent estimates—often as few as five (e.g., see Gaba et al. 2017).
However, our annual comparison survey uses a much less commonly used methodology. Recall that the pairwise comparison survey involves a single question—Which country is more likely to see a new episode of mass killing in 2022?—with many possible answers on which participants vote, one pair at a time. Because this approach has not been thoroughly tested in forecasting applications, we have little sense of how the number of votes and voters relates to the accuracy of the aggregated ranking.
The survey we ran in December 2021 received 1,148 votes in 41 unique user sessions (meaning the number of “voters” is less than or equal to 41). This is the lowest number of any year we ran the comparison survey. (By way of comparison, our 2018 survey received 10,825 total votes.)
Because each “vote” represents an opinion about two specific countries that the system chooses randomly, with more than 160 countries, the exact same comparison may have been presented only once, and the number of times any country was asked about is quite small. On average, this year, each country was rated 18 times. (In our 2018 survey, countries averaged more than 150 votes each.)
With the total number of votes so low, changes to a very small number of votes would have had large effects on the resulting rankings. For example, seven votes separate the highest-ranked country this year, Iraq, and the 30th ranked country, Sierra Leone.
Distinguishing “new” vs. “ongoing” mass killing
Another general challenge for surveys is to phrase questions so that respondents’ answers correspond to the concepts of interest. In the case of our comparison survey, we ask people to compare two countries at a time on their risk of a “new episode of mass killing.”
The adjective “new” plays an important role in this question. In keeping with our focus on early warning, we’re specifically interested in people’s assessment of the risk of a mass killing beginning, not the risk that one will continue or escalate. A “new” mass killing can mean the start of an episode in a country that had not been experiencing one, or a mass killing by a different perpetrator or against a different civilian group in a country that is already experiencing a mass killing.
Did respondents absorb and apply this distinction? Or might they have rated countries based on perceived risk of any large-scale killing of civilians, whether “new” or continuing?
One way to explore these questions is to look at how they rated countries with ongoing mass killings—especially those where there are already two ongoing episodes or the ongoing episode is so extensive as to make it implausible that a new mass killing could begin. If respondents focused sharply on risk of “new” mass killings, few of these countries should be near the top of the rankings.
Overall, 11 of the top 30 ranked countries from this survey were already experiencing mass killings. Of the four countries in the top 10 from this year’s survey that had ongoing mass killings, three (Iraq, Syria, and Nigeria) were already experiencing a state-led and a non-state-led mass killing.
This suggests that respondents probably did not clearly distinguish the risk of a new episode from continuation of ongoing violence.
It is worth noting that distinguishing new from ongoing mass killing seems to be a challenge for statistical forecasting as well as crowd-based methods. For example, 13 of the top 30 ranked countries in this year’s SRA were already experiencing mass killings.
The Early Warning Project seeks to contribute to policy conversations about current mass atrocity risks and how to address them at the same time as we advance the science of early warning for mass atrocities.
This comparison survey should be seen mainly as an attempt to learn more about a novel crowd forecasting method. We have reason to be concerned about the small number of votes and the possibility that respondents failed to distinguish between risks of a mass killing beginning and one continuing. Future analysis should help shed further light on how to use pairwise comparison surveys for early warning. For transparency, we have posted the list of the top 30 countries from this year’s survey here.
Readers should recall that our annual SRA offers similar output, a global assessment of risk of new mass killings, and stands on much firmer scientific ground. As such, it should continue to be the principal resource for people seeking an accurate assessment of mass killing risks across the globe.