aggregation expert judgment forecasting

Statistical models vs. human judgments: The Nate Silver controversy seen through a GJP lens

Do statistical models always outperform human forecasters?

The storm of controversy over Nate Silver’s new 538 website has transformed what had been a question of largely academic interest into a staple of water-cooler conversation. Critics such as Paul Krugman complain that Silver improperly dismisses the role of human expertise in forecasting, which Krugman views as essential to developing meaningful statistical models.

The Good Judgment Project’s research team has given the human vs. machine debate a lot of thought. GJP’s chief statistician Lyle Ungar shared his views in a recent interview. According to Ungar,

The bottom line is that if you have lots of data and the world isn’t changing too much, you can use statistical methods. For questions with more uncertainty, human experts become more important.

Ungar sees the geopolitical questions posed in the ACE tournament as well suited to human forecasting:

Some problems, like the geo-political forecasting we are doing, require lots of collection of information and human thought. Prediction markets and team-based forecasts both work well for sifting through the conflicting information about international events. Computer models mostly don’t work as well here – there isn’t a long enough track records of, say, elections or coups in Mali to fit a good statistical model, and it isn’t obvious what other countries are ‘similar.’

GJP does see a role for statistical models in geopolitical forecasting. For example, we use both prediction markets and statistical aggregation techniques to combine the judgments of our forecasters and generate probability estimates that typically are more accurate than simple unweighted averages of human predictions.

Moreover, for many geopolitical forecasting questions, we see promise in a human-machine hybrid approach that combines the best strengths of human judgments and statistical models. (Think “Kasparov plus Deep Blue” in the chess context.) We hope to take initial steps to test this approach in Season 4 of the IARPA ACE forecasting tournament. Stay tuned for further developments!


forecasting IARPA ACE Tournament

Reflections on Season 3 Thus Far (part 1)

As 2013 draws to a close, so does the first half of Season 3 of the IARPA forecasting tournament. This seems as good a time as any to review some of the highs and lows of this tournament season. We begin our review with a look at IFP #1318, which closed on Christmas Day and—in our view—represents one of the biggest surprises thus far in Season 3.

A surprising visit to Yasukuni

In mid-November, we launched a question asking whether Japan’s Prime Minister would visit the controversial Yasukuni Shrine before the year’s end. Just a few days before the question was scheduled to expire, Abe did in fact visit the shrine—something that few forecasters anticipated.

A post-visit comment by one team forecaster reflected the general amazement: “Jaw-drop! I have to wonder if this was telegraphed and we missed it, or it really was a spontaneous decision.”

Forecasters who posted good scores on this question seem to have taken Prime Minister Abe at his word. As the Japan News reported, the Prime Minister had pledged during his most recent election campaign to visit the shrine while in office. Nonetheless, over the past year, Abe had avoided commitment to any specific date for a visit. A few weeks before the question launched, there was a brief flurry of news stories in which an Abe aide indicated his expectation that the visit would occur before the end of the year, though the report was downplayed by another Japanese official. In retrospect, the aide seems to have been providing accurate intelligence: Abe’s visit occurred on the first anniversary of his current administration and before the end of the year.

Those who failed to anticipate Abe’s shrine visit can take comfort from a poll published in a major Japanese newspaper on December 24th, reflecting a question posed to over 1,000 Japanese households on December 21st and 22nd:

Of the respondents, 48 percent appreciated Prime Minister Abe’s decision to refrain from visiting Tokyo’s controversial Yasukuni Shrine where Class A war criminals are enshrined along with the war dead since he took office, as compared with 37 percent who do not appreciate his decision.

The poll itself reflects that there was great uncertainty about what Abe would do only a few days before his visit.

But, forecasters who steadily decreased their probability estimates as the end-date for this question approached may want to think twice before applying this strategy to future questions that have reasonable potential for surprise endings. At least one superforecaster had noted the risk that the Yasukuni Shrine question could be this season’s “Monti”—referencing a question that caught forecasters off-guard almost exactly a year before when Italy’s then-Prime-Minister followed through on his stated intention to resign, rather than stay in office as leader of a minority government, after Berlusconi withdrew his support from the Monti-led coalition. And, of course, there is now infamous question “1007” from Season 1, which asked whether a “lethal confrontation” would occur in the South China Sea. That question resolved as “Yes” near the scheduled closing date when a Chinese fishing boat captain fatally stabbed a South Korean coast-guard official who had boarded the fishing boat.

In these cases, “surprise” outcomes occurred late in a question’s lifespan, when many forecasters had been tapering their predictions toward a 0% likelihood that the event would occur. And, in all of these cases, the outcome reflected the actions of a single individual who was able to take the action that resulted in a “Yes” outcome with little or no news coverage in the days leading up to the action that would have allowed those following the question closely to know that the event was about to occur.

If there is a lesson to be learned here, it seems to be: Take a moment to think about the ways that an event could occur. If the event of interest does not require elaborate preparations beforehand and can be accomplished by one or two people, with little fanfare, it may hold more potential for a surprise outcome than our first impression would lead us to assume. This is particularly true when there is evidence early in the lifespan of a question suggesting that the event might occur, followed by no news, as opposed to contradictory news, later in the lifespan of a question.

forecasting GJP in the Media IARPA ACE Tournament Tetlock

GJP in the News, Again (and Again)

The Economist’s The World in 2014 issue just hit the newsstands, focusing international attention on questions that Good Judgment Project forecasters consider on a daily basis: What geopolitical outcomes can we expect over the next 12-14 months?

One outcome that our forecasters may not have anticipated, though, is that the Good Judgment Project itself would be featured in this annual compendium of forecasts. An article co-authored by GJP’s Phil Tetlock and journalist Dan Gardner poses the question “Who’s good at forecasts?” and offers insight into “How to sort the best from the rest.” Participants in the ongoing ACE forecasting tournament sponsored by the Intelligence Advanced Research Projects Activity (IARPA) will not be surprised to learn that Tetlock and Gardner believe that such forecasting tournaments are the best way to compare forecasting ability.

Tetlock and Gardner’s brief article does not address a second benefit of such tournaments: participants can improve their forecasting skills through a combination of training and practice, with frequent feedback on their accuracy. Combining training and practice with what GJP’s research suggests is a stable trait of forecasting skill seems to produce the phenomenon that GJP calls “superforecasters.”

GJP’s top forecasters have been so accurate that, according to a recent report by Washington Post columnist David Ignatius, they even outperformed the forecasts of intelligence analysts who have access to classified information.

In a brief video clip on The Economist’s web site, Phil Tetlock notes that “People are not as good at anticipating the future as they think they are.” If you would like to test your own forecasting skills against GJP’s best forecasters, we invite you to register to become a Good Judgment Project forecaster. Because of the strong demand to participate, we expect to open new slots soon as part of the ongoing Season 3 tournament. Who knows? Maybe you have the skills of which superforecasters are made!

forecasting GJP research team

Meet GJP’s Political Science Experts

The Good Judgment Project has roots in Phil Tetlock’s study of Expert Political Judgment, which may be best known for its conclusion that the “expert” forecasters he studied were often hard-pressed to do better than the proverbial dart-throwing chimp. This conclusion sometimes has been mis-interpreted as a nihilistic rejection of all forms of subject matter expertise. But, that’s not so. Indeed, our research team for the IARPA tournament relies on a wide range of subject matter experts, including political scientists who devote much of their time and research effort to improving geopolitical forecasts.

They are no ordinary pundits, though. Consider, for example, Jay Ulfelder, former research director of the Political Instability Task Force (PITF) who now blogs as the Dart-Throwing Chimp. Jay’s forecasting prowess was evident during Season 2 of the Good Judgment Project—he participated as a “regular forecaster” and would have qualified for “superforecaster” status in Season 3 had he not joined our research team. Jay’s performance far exceeded that achievable through random guessing.

Now, Jay faces the far more challenging task of helping to frame the forecasting questions for Season 3. With his assistance, we’ve made progress in developing questions that are both rigorous (with all terms defined specifically enough to allow unambiguous scoring of forecasts as right or wrong) and relevant (in that knowing the correct outcome would meaningfully inform policymakers’ decisions). But, much work remains to be done before we become as skilled at generating questions as the GJP forecasters have become at producing accurate answers to those questions.

Also new to our Season 3 research team is Mike Ward, whose research lab at Duke University “creates conflict predictions using Bayesian modeling and network analysis.” In 2012, Mike and Jay engaged in a spirited debate in Foreign Policy about whether there could be a “Nate Silver” in world politics (that is, a forecaster of geopolitical events with a track record comparable to Silver’s notable successes in forecasting US elections). Mike doesn’t claim to have achieved Silver status yet, but we expect that his sophisticated modeling techniques will add to the GJP’s statistical arsenal.

We’re also fortunate to have Phil Schrodt join our Season 3 research team. Like Jay, both Phil and Mike have been involved in prior government projects to develop accurate statistical forecasts of geopolitical events, in their cases through both the PITF and DARPA’s Integrated Crisis Early Warning Project (ICEWS). Phil’s expertise with “event data” (e.g., with GDELT, Global Data on Events, Location and Tone) complements GJP’s strength in crowd-sourced forecasts. We’re hoping that the whole will be greater than the sum of the parts. And, readers of this blog should hope that Phil and Jay, both bloggers extraordinaire, will find time to write one or two guest posts for us. Until then, you can sample Phil’s thoughts on political forecasting courtesy of slides for his recent lectures at the University of Konstanz (Germany) and at European Network for Conflict Research (ENCoRe), University of Essex (United Kingdom).

Jay, Mike and Phil round out a GJP political-scientist roster that already included Penn’s own Mike Horowitz, an expert on international conflict and security issues, and Rick Herrmann, Chair of Ohio State’s Political Science Department and former Director of the Mershon Center for International Security Studies. The breadth and depth of their experience has been invaluable in guiding GJP’s efforts to improve geopolitical forecasting and shed light on important debates in international relations.

IARPA ACE Tournament

FORECAST: Another great season for the Good Judgment Project

Season 3 of the IARPA forecasting tournament is only a little over a month old (official scoring of individual forecasts began on July 31st). But, it already promises to be our most exciting season yet.

Nearly 3,000 forecasters have joined the Good Judgment Team’s roster for Season 3—many of whom are new to the Team this year. Our “newbies” include transfers from other research teams that had participated in Seasons 1 and 2, as well as more than 1,000 forecasters who are participating in the tournament for the first time ever. Welcome to all!

As before, the Good Judgment Project is using the IARPA tournament as a vehicle for social-science research to determine the most effective means of eliciting and aggregating geopolitical forecasts from a widely dispersed forecaster pool. To reach scientifically valid conclusions, we have randomly assigned participants to one of several online forecasting platforms, divided almost evenly between a survey format and two different prediction-market formats. Some survey forecasters participate as individuals; others are randomly assigned to small teams (starting size: 15 forecasters) that share information and strategies and are scored on their collective performance. We are also testing the efficacy of training materials that are tailored to the various elicitation platforms.

Our “superforecasters” deserve a tremendous share of the credit for the Good Judgment Team’s tournament-winning performance in Season 2. Most of our 60 Season 2 superforecasters have returned for Season 3, and we’ve augmented their ranks with a new crop of superforecasters chosen from the top performers in all experimental conditions in Season 2. The 120 superforecasters are organized as eight teams of 15 forecasters each.

Thus far, only four carryover questions from Season 2 have resolved and been officially scored, and that scoring includes only the final month during which these questions were open for forecasting. (The prior months were scored as part of Season 2.) Therefore, these four questions cannot tell us how well our forecasters will perform over the 150+ questions expected to be released between now and May 2014.

With that caveat firmly in mind, we close with what we hope is a precursor of the outstanding performance to be expected over the next several months. One method of aggregating forecasts is to treat each forecaster’s probability forecast as a “vote” (where any probability greater than 50% is a vote that the outcome will occur and any probability less than 50% is a vote that it will not occur), and then predict the outcome based on those votes. The “majority rules” vote of our Season 3 superforecasters achieved a perfect accuracy score for all four questions that have just closed! That’s a pretty impressive achievement when you realize that each day’s prediction for each forecasting question is separately scored. To get a perfect score, the majority of superforecasters must have made the right call on each one of the four questions every day for the month of August. Well done, everyone!

IARPA ACE Tournament

Season 3 of the Good Judgment Project Starts TODAY!

If you signed up to be a Good Judgment Project forecaster for Season 3, you should have received an e-mail invitation via Qualtrics. Clicking on your personal link will take you through the informed-consent process (necessary administrative detail about expectations, payments, etc.) and hand you off to our background data survey. (If you signed up for the project but haven’t received your invitation yet, please contact

Next steps will be initial training (another Qualtrics invitation, coming soon), followed by access to your forecasting website for some preseason practice. Current plans call for all forecasters who have completed the preliminaries to be admitted to their assigned forecasting websites by mid-July. Official scoring will begin on August 1st and run through early May 2014.

We expect an exciting forecasting season. All forecasters in the IARPA tournament are now under one “roof” (the Good Judgment Project or GJP), which means that we’ll be able to give more complete feedback about relative performance AND that the competition has become even more intense. It will be interesting to see whether forecasters transferring to GJP from our prior competitors have developed distinctive forecasting styles associated with their prior teams.

Another new feature for Season 3 is question clusters. Most of the Season 3 forecasting questions will fall into one of about 15 clusters or categories that are thematically grouped. Forecasting on closely related questions should allow forecasters to delve more deeply into global issues of interest to policymakers. And, we hope that having forecasts on several related questions will shed greater light on key policy debates than is possible with any single question.

Watch this blog for further Season 3 news.

GJP in the Media IARPA ACE Tournament Tetlock

Catching up on news about the Good Judgment Project

Season 2 of the IARPA tournament has sped by so rapidly that we’ve been remiss about keeping readers abreast of news about the Good Judgment Project. Here are some highlights of the past several months.

Project co-leader Phil Tetlock is well known as the author of Expert Political Judgment, the popular 2005 book demonstrating that political experts often perform no better than chance when making long-term forecasts. Those who know Tetlock best in this context may be surprised by his assessment of the IARPA tournament in a December 2012 interview published at There, he observed:

Is world politics like a poker game? This is what, in a sense, we are exploring in the IARPA forecasting tournament. You can make a good case that history is different and it poses unique challenges. This is an empirical question of whether people can learn to become better at these types of tasks. We now have a significant amount of evidence on this, and the evidence is that people can learn to become better. It’s a slow process. It requires a lot of hard work, but some of our forecasters have really risen to the challenge in a remarkable way and are generating forecasts that are far more accurate than I would have ever supposed possible from past research in this area.

Since that interview, the Good Judgment Team’s collective forecasts in the IARPA tournament have maintained a high standard of accuracy across topics ranging “Who will be the next Pope?” to “Will Iran and the U.S. commence official nuclear program talks before 1 April 2013?” to “Will 1 Euro buy less than $1.20 US dollars at any point before 1 January 2013?”. Impressively, the Team’s forecasters are not, for the most part, subject-matter-experts on these topics, but rather intelligent volunteers who research candidates for the papacy or Middle East politics in their “spare” time.

Our collective forecasts combine the insights of hundreds of forecasters using statistical algorithms that, ideally, help to extract the most accurate signal from the noise of conflicting predictions. Analyses of data from the first tournament season suggest that we can boost prediction accuracy by “transforming” or “extremizing” the group forecast. We described the process in an AAAI paper presented in Fall 2012:

Take the (possibly weighted) average of all the predictions to get a single probability estimate, and then transform this aggregate forecast away from 0.5….

Transformation increases the accuracy of aggregate forecasts in many, but not all cases. The trick is to find the transformation that will most improve accuracy in a particular situation. Preliminary results suggest that little or no transformation should be applied to the predictions of the most expert forecasters and to forecasts on questions with a high degree of inherent unpredictability. Stay tuned for further refinements as we analyze new data in Season 3.

This year, we’ve continued research comparing the accuracy of prediction markets to other methods of eliciting and aggregating forecasts, working with our new partner Lumenogic. Our Season 2 prediction-market forecasters use a Continuous Double Auction (“CDA”) trading platform similar to the now-defunct Intrade platform, but wagering only virtual dollars.

Forecasters have mixed feelings about trading as a method of prediction: some enjoy the challenge of predicting the ups-and-downs of forecaster sentiment; others prefer to focus on predicting the event in question and would rather not be bothered with understanding the views of their fellow traders. But, it’s hard to argue with the overall performance of the two markets over the first several months of forecasting. The prediction markets have outperformed the simple average of predictions from forecasters in each of our survey-based experimental conditions (individuals working without access to crowd-belief data and individuals working in small teams who have access to data about their teammates’ forecasts). Only our “super-forecasters” (drawn from the top 2% of all participants in Season 1) have proven to be consistently more accurate than the prediction markets.

During Season 3, which begins in June 2013, we’ll be expanding our research team and our forecaster pool in hopes of coming ever closer to the theoretical limits of forecasting accuracy. We invite our current forecasters to continue with the Good Judgment Team for another season and encourage new participants to join us for this exciting challenge.

IARPA ACE Tournament Mellers wisdom of the crowd

Season 2, are we ready?

As a new member of the Good Judgment Project, I joined this team less than a month ago. Working with Barb Mellers on a daily basis, I decided to explore the project’s past successes and future objectives while getting to know Barb a bit better. I quizzed her about the project, her thoughts, funny memories and aspirations for Season 2.

First and foremost, it was exciting to learn how international this project is. Our forecasters span the whole globe. Whatever conference Barb attends, she meets at least one of our many forecasters. When an audience member at a conference at the University of Maryland (not too far from home) approached Barb, introducing himself as a member of the Good Judgment Team, it was pleasant, but not particularly startling. But when a conference organizer in Allahabad, India, introduced himself to Barb as one of the forecasters on the project, it was a rather amusing surprise. Her experience shows that people everywhere like to participate in fun and exciting projects such as this one. It makes one wonder where Barb’s next forecaster sighting will take place.

New Good Judgment Project members quickly learn that our group had great success in the first round of the IARPA forecasting tournament. So the obvious question arises: Why was our team so successful last season compared to the others? According to Barb, not only did we have more forecasters than other teams, but our forecasters worked harder, thought harder, and updated their forecasts more often than the typical forecaster on other teams, which truly benefited our results and our data. Another contributing factor to our success came from social interaction among our team forecasters (forecasting in teams is a unique attribute of the Good Judgment Project). Being able to discuss each IFP (Individual Forecasting Problem) allowed the forecasters to bounce ideas and thoughts off each other and to collaborate in competing with other small teams within the Good Judgment Project. A friendly competition is always healthy for the success of any project.

The first season of the IARPA forecasting tournament officially ended on April 13, 2012. Now that we have had a chance to analyze data for all of Season 1, Barb offered some perspectives on the lessons the Good Judgment Project has learned thus far. Despite our overall success, our forecasters tended to over-predict change from the status quo and had difficulty judging their own expertise relative to others. But simple probabilistic-reasoning training modules improved forecasting for the most part, and collaborative teamwork improved forecasting versus forecasters who worked independently.

For those of you keeping track, June 18th was the first day of official forecasting for Season 2. So far, the Good Judgment Team is off to a great start. But, our goals of repeating and improving on last season’s success ­will not be easily attained. The forecasting questions this season will be harder. Some questions will ask participants to estimate the likelihood of a particular event depending on whether a second event occurs. (To illustrate, a “conditional” question might ask “Will Spain default on its 10-year government bonds if the market yield on those bonds rises above 7.5 %?”) There will also be questions where the answer options are divided into different time ranges, rather than simply asking whether an event will occur.

To meet these challenges, Barb and the other members of our research team must keep improving on the techniques we are using to produce collective forecasts. For example, we’ve completely revised our prediction-market platform. The new platform requires more effort on the part of our forecasters, but gives forecasters better incentives to update their forecasts and to make bets that reveal their true beliefs. We are also experimenting with smaller teams and scoring rules that may improve incentives in the survey conditions.

The year ahead will pose a lot of challenges and, we hope, an equal amount of fun. There is a reason that many of our forecasters from Season 1 have recruited family and friends to join them on the Good Judgment Team for Season 2. One of the top forecasters from Season 1 put it this way:  ”It’s my new hobby!”

Overall, I believe that the success of the Good Judgment Project is attributable to everyone involved – forecasters and researchers alike. So let us continue on the path of success for Season 2, and may it lead us to fruitful results.

Recruitment wisdom of the crowd

We’re looking for…

aggregation Mellers Tetlock wisdom of the crowd

The Power of Aggregation

A key concept underlying the Good Judgment Project is “the power of aggregation,” otherwise known as the “wisdom of the crowd.”

Francis Galton (cousin of Charles Darwin) identified the power of aggregation. At a county fair in England, Galton observed a mixture of “experts” (farmers and butchers) and ordinary people (doctors, shopkeepers, and others) making guesses about the weight of an ox that was to be slaughtered and “dressed” (butchered). “Forecasters” bought a sixpenny ticket to record their guesses, and prizes were awarded to those whose guesses were most accurate.

After the guesses were reviewed and the winners announced, Galton reviewed the 787 legible tickets and found that the median (the guess of the “middle” participant, when all guesses were ordered from highest to lowest estimated weight) was only 9 pounds (0.8%) off the actual 1,198-pound weight of the ox. In a March 1907 article in Nature, he called the median the “vox populi” estimate, and recommended that a similar approach be used for the decisions of juries. (Galton subsequently calculated the simple average or “mean” and discovered that it was within 1 pound of the true weight of the ox; nevertheless, he continued to advocate the median as the best estimator.)

Barbara Mellers, University of Pennsylvania professor and co-leader of the Good Judgment Project, discussed Galton’s work on the power of aggregation in a recent presentation to Penn alumni about the Good Judgment Project. You can view a video of her presentation (and that of Philip Tetlock, another co-leader of the Project) from YouTube to learn more about the power of aggregation:

Their presentations explain how the Good Judgment Project – and the IARPA forecasting tournament of which our Project is one part – hope to harness the wisdom of the crowd to help people make better-informed decisions.