EVERY MORNING for the past year, a group of British civil servants, diplomats, police officers and spies have woken up, logged onto a slick website and offered their best guess as to whether China will invade Taiwan by a particular date. Or whether Arctic sea ice will retrench by a certain amount. Or how far covid-19 infection rates will fall. These imponderables are part of Cosmic Bazaar, a forecasting tournament created by the British government to improve its intelligence analysis.
Since the website was launched in April 2020, more than 10,000 forecasts have been made by 1,300 forecasters, from 41 government departments and several allied countries. The site has around 200 regular forecasters, who must use only publicly available information to tackle the 30-40 questions that are live at any time. Cosmic Bazaar represents the gamification of intelligence. Users are ranked by a single, brutally simple measure: the accuracy of their predictions.
Forecasting tournaments like Cosmic Bazaar draw on a handful of basic ideas. One of them, as seen in this case, is the “wisdom of crowds”, a concept first illustrated by Francis Galton, a statistician, in 1907. Galton observed that in a contest to estimate the weight of an ox at a county fair, the median guess of nearly 800 people was accurate within 1% of the true figure.
Crowdsourcing, as this idea is now called, has been augmented by more recent research into whether and how people make good judgments. Experiments by Philip Tetlock of the University of Pennsylvania, and others, show that experts’ predictions are often no better than chance. Yet some people, dubbed “superforecasters”, often do make accurate predictions, largely because of the way they form judgments—such as having a commitment to revising predictions in light of new data, and being aware of typical human biases. Dr Tetlock’s ideas received publicity last year when Dominic Cummings, then an adviser to Boris Johnson, Britain’s prime minister, endorsed his book and hired a controversial superforecaster to work at Mr Johnson’s office in Downing Street.
America’s sprawling intelligence establishment was the first to apply these principles. Over the past decade, it has carried out more than a dozen forecasting projects, including prediction markets, in which people can bet money or points on the outcome, and prediction polls, like Cosmic Bazaar. The most prominent tournament was the Aggregative Contingent Estimation (ACE) programme, run from 2010 to 2015 by the Intelligence Advanced Research Projects Activity (IARPA), a blue-sky research body for American spooks. A curated team of superforecasters from the Good Judgment Project, a scheme led by Dr Tetlock, were found to be at least one-third more accurate than other research teams.
ACE and similar programmes inspired Britain to create Cosmic Bazaar. One of its purposes is to identify a group of persistently successful forecasters who could help answer difficult questions in a crisis. The top 20 or so competitors are “incredibly accurate”, says Charlie Edwards, who trains British intelligence analysts. They are obsessed with their Brier scores, a measure of accuracy over time, and, in common with findings from the Good Judgment Project, share sources of data and news enthusiastically. The only rewards are virtual badges and branded notebooks. But for analysts accustomed to working with secret intelligence, where success remains in the shadows, a high score here—and the merchandise to prove it—is a “badge of honour”, says Mr Edwards.
The game’s afoot
Yet the point is not just to pick star performers. It is also to encourage “cognitive diversity” by ensuring that intelligence draws on talent beyond Britain’s smallish pool of full-time analysts. Cosmic Bazaar’s anonymity produces an egalitarian backdrop: a junior data scientist can contest the predictions of a veteran ambassador, and the reasoning behind them, without the shadow of rank. The site encourages debate and discussion. Users can “upvote” perceptive comments by others, and questions are supplemented with seminars by experts. Moreover, since the system is unclassified (unlike most of its American-government counterparts), officials can log in from home, or abroad.
The programme is also intended to identify blind spots in analysis. Officials say that so much government attention is spent on covid-19 that slower-burning or more distant matters tend to be missed. In October, for instance, Cosmic Bazaar asked users a question on Mozambique, responses to which suggested that the risk of jihadist activity was greater than thought (as would later prove true), prompting others to look more closely at the matter.
At the moment, Cosmic Bazaar is the largest forecasting tournament in Europe. But others are getting interested. Britain hopes to draw European allies into the contest. Adam Siegel, a co-founder of Cultivate Labs, the firm which wrote the software for Cosmic Bazaar, says that the Czech Republic is using his company’s platform for public tournaments involving several government agencies, and that another European government has run a classified version. Regina Joseph of Sibylink, a consultancy, has run tournaments for the Dutch government and the Organisation for Security and Co-operation in Europe.
Yet America’s experience with forecasting is a cautionary tale. Despite the attention attracted by ACE, American tournaments and prediction markets have struggled for money and mainstream acceptance. There are no active forecasting tournaments in American intelligence agencies today, though some remain in the Pentagon and elsewhere.
One reason for this, suggests “Keeping Score: A New Approach to Geopolitical Forecasting”, a recent paper by Perry World House, a research group at the University of Pennsylvania, is that such platforms threaten to expose poor analysts and up-end existing hierarchies. “Established employees”, the paper’s authors write, “may view the potential disruption wrought by a mechanism that outperforms many traditional analysts with a sense of impending doom, as a factory worker might view a new assembly robot.”
However, the larger issue may simply be that the feature which makes precise forecasting possible also limits its appeal. A basic requirement is that questions be falsifiable, so that it is unequivocal, after the fact, who got it right and who wrong. This means there is no room for what psychologists have called “clairvoyance”, or the post hoc claim that a vague prediction came true. Yet policymakers are often drawn to bigger and vaguer questions that resist such score-keeping, such as: “what does Russia want?” or “will China become more aggressive?” Dr Tetlock calls this the “rigour-relevance trade-off”.
One way to approach this problem, says Steven Rieber, who oversees forecasting at IARPA, is to draw on an advanced statistical technique known as Bayesian networking, which uses conditional probabilities. Forecasters can be asked to judge, for example, the probability that China would seize an island in the South China Sea by a particular date if it were becoming more aggressive—and also the probability of it doing so even if it were not. A big and elusive question can thus be broken down into several smaller and more tractable ones, known as “Bayesian question clusters”. Foretell, a project run by the Centre for Security and Emerging Technology (CSET) at Georgetown University, which also uses the Cultivate platform, employs this methodology to predict the course of technological competition between America and China. It is not yet clear whether that approach will be successful.
For now, forecasters are enjoying a moment in the sun. In Britain, Cosmic Bazaar’s insights are trickling into policy teams that work on covid-19 and counter-terrorism. In America, President Joe Biden, one day after his inauguration, announced his intention to establish a National Centre for Epidemic Forecasting and Outbreak Analytics. In March the administration hired Jason Matheny, a former chief of IARPA and the founder of CSET, as an adviser on technology and national security.
The long-term viability of forecasting will depend, though, not just on accuracy, but also explainability. “It’s not enough to learn that there’s a 70% chance of war breaking out between these two countries in the next year, and not the 30% you thought,” says Dr Rieber. “You need to understand what leads to that higher probability judgment.” An assessment paired with a colourful psychological profile of Xi Jinping is more likely to resonate with a prime minister or president than a percentage figure. “You have to build up a trust relationship with these decision-makers,” says Mr Siegel. “You need to put a story together alongside the numbers.” ■
A version of this article was published online on April 14th, 2021.
This article appeared in the Science & technology section of the print edition under the headline “Welcome to the Cosmic Bazaar”