A Little Summer Reading

Summer means vacation time for many Good Judgment Project forecasters (we’re currently on hiatus between forecasting seasons), but our research team is busily working on plans to make this the best and most useful forecasting season yet!

While you await news about Season 4, which will begin in August, we wanted to bring your attention to two recent articles by GJP investigator Michael C. Horowitz, an associate professor of political science at UPenn, that discuss the Good Judgment Project.

In the first, an article for The National Interest, Professor Horowitz and co-author Dafna Rand of CNAS lay out a case for what they call “The Crisis-Prevention Directorate.” Since trouble around the world is often hard to predict, Horowitz and Rand argue that the National Security Council Staff should create a new crisis-prevention directorate that not only draws on trained personnel from throughout the national-security community, but explicitly draws on Good Judgment Project methodologies to help the President anticipate and head off crises before they happen.

The second article, by Professor Horowitz in Politico, looks at recent analogies that President Obama has made between baseball and American foreign policy. Horowitz uses the lens of Moneyball to assess Obama’s statement that United States foreign policy should be focused on hitting singles rather than home runs. He also describes the Good Judgment Project as essentially the “moneyball” of national security decision-making. And just as one of the challenges described in Moneyball was the integration of sabermetrics into mainstream baseball decision-making, the next challenge for the Good Judgment Project and our sponsors is ensuring the lessons that our forecasters have taught us over the last three years make their way into how the US government thinks about national security decision-making moving forward.

To what extent will the national-security community incorporate learnings from GJP and other quantitative forecasting initiatives over the next 3-5 years? Suggestion: Record your own prediction today and then check five years hence to see how accurate you were. Keeping score is probably the number one “secret of the superforecasters”!

Posted in GJP in the Media, GJP research team | Tagged , , , | Comments Off

Jay Ulfelder: Crowds Aren’t Magic

One of my cousins, Steve Ulfelder, writes good mystery novels. He left a salaried writer’s job 13 years ago to freelance and make time to pen those books. In March, he posted this on Facebook:

CONTEST! When I began freelancing, I decided to track the movies I saw to remind myself that this was a nice bennie you can’t have when you’re an employee (I like to see early-afternoon matinees in near-empty theaters). I don’t review them or anything; I simply keep a Word file with dates and titles.

Here’s the contest: How many movies have I watched in the theater since January 1, 2001? Type your answer as a comment. Entries close at 8pm tonight, east coast time. Closest guess gets a WOLVERINE BROS. T-shirt and a signed copy of the Conway Sax novel of your choice. The eminently trustworthy Martha Ruch Ulfelder is holding a slip of paper with the correct answer.

I read that post and thought: Now, that’s my bag. I haven’t seen Steve in a while and didn’t have a clear idea of how many movies he’s seen in the past 13 years, but I do know about Francis Galton and that ox at the county fair. Instead of just hazarding a guess of my own, I would give myself a serious shot at outwitting Steve’s Facebook crowd by averaging their guesses.

After a handful of Steve’s friends had submitted answers, I posted the average of them as a comment of my own, then updated it periodically as more guesses came in. I had to leave the house not long before the contest was set to close, so I couldn’t include the last few entrants in my final answer. Still, I had about 40 guesses in my tally at the time and was feeling pretty good about my chances of winning that t-shirt and book.

In the end, 45 entries got posted before Steve’s 8 PM deadline, and my unweighted average wasn’t even close. The histogram below shows the distribution of the crowd’s guesses and the actual answer. Most people guessed fewer than 300 movies, but a couple of extreme values on the high side pulled the average up to 346.  Meanwhile, the correct answer was 607, nearly one standard deviation (286) above that mean. I hadn’t necessarily expected to win, but I was surprised to see that 12 of the 45 guesses—including the winner at 600—landed closer to the truth than the average did.

I read the results of my impromptu experiment as a reminder that crowds are often smart, but they aren’t magic. Retellings of Galton’s experiment sometimes make it seem like even pools of poorly informed guessers will automatically produce an accurate estimate, but, apparently, that’s not true.

As I thought about how I might have done better, I got to wondering if there was something about Galton’s crowd that made it particularly powerful for his task. Maybe we should expect a bunch of county fair–goers in nineteenth century England to be good at guessing the weight of farm animals. Still, the replication of Galton’s experiment under various conditions suggests that domain knowledge helps, but it isn’t essential. So maybe this was just an unusually hard problem. Steve has seen an average of nearly one movie in theaters each week for the past 13 years. In my experience, that’s pretty extreme, so even with the hint he dropped in his post about being a frequent moviegoer, it’s easy to see why the crowd would err on the low side. Or maybe this result was just a fluke, and if we could rerun the process with different or larger pools, the average would usually do much better.

Whatever the reason for this particular failure, though, the results of my experiment also got me thinking again about ways we might improve on the unweighted average as a method of gleaning intelligence from crowds. Unweighted averages are a reasonable strategy when we don’t have reliable information about variation in the quality of the individual guesses (see here), but that’s not always the case. For example, if Steve’s wife or kids had posted answers in this contest, it probably would have been wise to give their guesses more weight on the assumption that they knew better than acquaintances or distant relatives like me.

Figuring out smarter ways to aggregate forecasts is also an area of active experimentation for the Good Judgment Project (GJP), and the results so far are encouraging. The project’s core strategy involves discovering who the most accurate forecasters are and leaning more heavily on them. I couldn’t do this in Steve’s single-shot contest, but GJP gets to see forecasters’ track records on large numbers of questions and has been using them to great effect. In the recently-ended Season 3, GJP’s “super forecasters” were grouped into teams and encouraged to collaborate, and this approach has proved very effective. In a paper published this spring, GJP has also shown that they can do well with nonlinear aggregations derived from a simple statistical model that adjusts for systematic bias in forecasters’ judgments. Team GJP’s bias-correction model beats not only the unweighted average but also a number of widely-used and more complex nonlinear algorithms.

Those are just a couple of the possibilities that are already being explored, and I’m sure people will keep coming up with new and occasionally better ones. After all, there’s a lot of money to be made and bragging rights to be claimed in those margins. In the meantime, we can use Steve’s movie-counting contest to remind ourselves that crowds aren’t automatically as clairvoyant as we might hope, so we should keep thinking about ways to do better.

Posted in aggregation, forecasting, Galton, wisdom of the crowd | Comments Off

Jay Ulfelder on the Rigor-Relevance Tradeoff

I came to the Good Judgment Project (GJP) two years ago, in Season 2, as a forecaster, excited about contributing to an important research project and curious to learn more about my skill at prediction. I did pretty well at the latter, and GJP did very well at the former. I’m also a political scientist who happened to have more time on my hands than many of my colleagues, because I work as an independent consultant and didn’t have a full plate at that point. So, in Season 3, the project hired me to work as one of its lead question writers.

Going into that role, I had anticipated that one of the main challenges would be negotiating what Phil Tetlock calls the “rigor-relevance trade-off”—finding questions that are relevant to the project’s U.S. government sponsors and can be answered as unambiguously as possible. That forecast was correct, but even armed with that information, I failed to anticipate just how hard it often is to strike this balance.

The rigor-relevance trade-off exists because most of the big questions about global politics concern latent variables. Sometimes we care about specific political events because of their direct consequences, but more often we care about those events because of what they reveal to us about deeper forces shaping the world. For example, we can’t just ask if China will become more cooperative or more belligerent, because cooperation and belligerence are abstractions that we can’t directly observe. Instead, we have to find events or processes that (a) we can see and (b) that are diagnostic of that latent quality. For example, we can tell when China issues another statement reiterating its claim to the Senkaku Islands, but that happens a lot, so it doesn’t give us much new information about China’s posture. If China were to fire on Japanese aircraft or vessels in the vicinity of the islands—or, for that matter, to renounce its claim to them—now that would be interesting.

It’s tempting to forego some rigor to ask directly about the latent stuff, but it’s also problematic. For the forecast’s consumers, we need to be able to explain clearly what a forecast does and does not cover, so they can use the information appropriately. As forecasters, we need to understand what we’re being asked to anticipate so we can think clearly about the forces and pathways that might or might not produce the relevant outcome. And then there’s the matter of scoring the results. If we can’t agree on what eventually happened, we won’t agree on the accuracy of the predictions. Then the consumers don’t know how reliable those forecasts are, the producers don’t get the feedback they need, and everyone gets frustrated and demotivated.

It’s harder to formulate rigorous questions than many people realize until they try to do it, even on things that seem like they should be easy to spot. Take coups. It’s not surprising that the U.S. government might be keen on anticipating coups in various countries for various reasons. It is, however, surprisingly hard to define a “coup” in such a way that virtually everyone would agree on whether or not one had occurred.

In the past few years, Egypt has served up a couple of relevant examples. Was the departure of Hosni Mubarak in 2011 a coup? On that question, two prominent scholarly projects that use similar definitions to track coups and coup attempts couldn’t agree. Where one source saw an “overt attempt by the military or other elites within the state apparatus to unseat the sitting head of state using unconstitutional means,” the other saw the voluntary resignation of a chief executive due to a loss of his authority and a prompt return to civilian-led government. And what about the ouster of Mohammed Morsi in July 2013? On that, those academic sources could readily agree, but many Egyptians who applauded Morsi’s removal—and, notably, the U.S. government—could not.

We see something similar on Russian military intervention in Ukraine. Not long after Russia annexed Crimea, GJP posted a question asking whether or not Russian armed forces would invade the eastern Ukrainian cities of Kharkiv or Donetsk before 1 May 2014. The arrival of Russian forces in Ukrainian cities would obviously be relevant to U.S. policy audiences, and with Ukraine under such close international scrutiny, it seemed like that turn of events would be relatively easy to observe as well.

Unfortunately, that hasn’t been the case. As Mark Galeotti explained in a mid-April blog post,

When the so-called “little green men” deployed in Crimea, they were very obviously Russian forces, simply without their insignia. They wore Russian uniforms, followed Russian tactics and carried the latest, standard Russian weapons.

However, the situation in eastern Ukraine is much less clear. U.S. Secretary of State John Kerry has asserted that it was “clear that Russian special forces and agents have been the catalyst behind the chaos of the last 24 hours.” However, it is hard to find categorical evidence of this.

Even evidence that seemed incontrovertible when it emerged, like video of a self-proclaimed Russian lieutenant colonel in the Ukrainian city of Horlivka, has often been debunked.

This doesn’t mean we were wrong to ask about Russian intervention in eastern Ukraine. If anything, the intensity of the debate over whether or not that’s happened simply confirms how relevant this topic was. Instead, it implies that we chose the wrong markers for it. We correctly anticipated that further Russian intervention was possible if not probable, but we—like many others—failed to anticipate the unconventional forms that intervention would take.

Both of these examples show how hard it can be to formulate rigorous questions for forecasting tournaments, even on topics that are of keen interest to everyone involved and seem like naturals for the task. In an ideal world, we could focus exclusively on relevance and ask directly about all the deeper forces we want to understand and anticipate. As usual, though, that ideal world isn’t the one we inhabit. Instead, we struggle to find events and processes whose outcomes we can discern that will also reveal something telling about those deeper forces at play.

 

Posted in forecasting questions | Tagged | Comments Off

Statistical models vs. human judgments: The Nate Silver controversy seen through a GJP lens

Do statistical models always outperform human forecasters?

The storm of controversy over Nate Silver’s new 538 website has transformed what had been a question of largely academic interest into a staple of water-cooler conversation. Critics such as Paul Krugman complain that Silver improperly dismisses the role of human expertise in forecasting, which Krugman views as essential to developing meaningful statistical models.

The Good Judgment Project’s research team has given the human vs. machine debate a lot of thought. GJP’s chief data scientist Lyle Ungar shared his views in a recent interview. According to Ungar,

The bottom line is that if you have lots of data and the world isn’t changing too much, you can use statistical methods. For questions with more uncertainty, human experts become more important.

Ungar sees the geopolitical questions posed in the ACE tournament as well suited to human forecasting:

Some problems, like the geo-political forecasting we are doing, require lots of collection of information and human thought. Prediction markets and team-based forecasts both work well for sifting through the conflicting information about international events. Computer models mostly don’t work as well here – there isn’t a long enough track records of, say, elections or coups in Mali to fit a good statistical model, and it isn’t obvious what other countries are ‘similar.’

GJP does see a role for statistical models in geopolitical forecasting. For example, we use both prediction markets and statistical aggregation techniques to combine the judgments of our forecasters and generate probability estimates that typically are more accurate than simple unweighted averages of human predictions.

Moreover, for many geopolitical forecasting questions, we see promise in a human-machine hybrid approach that combines the best strengths of human judgments and statistical models. (Think “Kasparov plus Deep Blue” in the chess context.) We hope to take initial steps to test this approach in Season 4 of the IARPA ACE forecasting tournament. Stay tuned for further developments!

 

Posted in aggregation, expert judgment, forecasting | Tagged , , | Comments Off

Reflections on Season 3 Thus Far (part 1)

As 2013 draws to a close, so does the first half of Season 3 of the IARPA forecasting tournament. This seems as good a time as any to review some of the highs and lows of this tournament season. We begin our review with a look at IFP #1318, which closed on Christmas Day and—in our view—represents one of the biggest surprises thus far in Season 3.

A surprising visit to Yasukuni

In mid-November, we launched a question asking whether Japan’s Prime Minister would visit the controversial Yasukuni Shrine before the year’s end. Just a few days before the question was scheduled to expire, Abe did in fact visit the shrine—something that few forecasters anticipated.

A post-visit comment by one team forecaster reflected the general amazement: “Jaw-drop! I have to wonder if this was telegraphed and we missed it, or it really was a spontaneous decision.”

Forecasters who posted good scores on this question seem to have taken Prime Minister Abe at his word. As the Japan News reported, the Prime Minister had pledged during his most recent election campaign to visit the shrine while in office. Nonetheless, over the past year, Abe had avoided commitment to any specific date for a visit. A few weeks before the question launched, there was a brief flurry of news stories in which an Abe aide indicated his expectation that the visit would occur before the end of the year, though the report was downplayed by another Japanese official. In retrospect, the aide seems to have been providing accurate intelligence: Abe’s visit occurred on the first anniversary of his current administration and before the end of the year.

Those who failed to anticipate Abe’s shrine visit can take comfort from a poll published in a major Japanese newspaper on December 24th, reflecting a question posed to over 1,000 Japanese households on December 21st and 22nd:

Of the respondents, 48 percent appreciated Prime Minister Abe’s decision to refrain from visiting Tokyo’s controversial Yasukuni Shrine where Class A war criminals are enshrined along with the war dead since he took office, as compared with 37 percent who do not appreciate his decision.

The poll itself reflects that there was great uncertainty about what Abe would do only a few days before his visit.

But, forecasters who steadily decreased their probability estimates as the end-date for this question approached may want to think twice before applying this strategy to future questions that have reasonable potential for surprise endings. At least one superforecaster had noted the risk that the Yasukuni Shrine question could be this season’s “Monti”—referencing a question that caught forecasters off-guard almost exactly a year before when Italy’s then-Prime-Minister followed through on his stated intention to resign, rather than stay in office as leader of a minority government, after Berlusconi withdrew his support from the Monti-led coalition. And, of course, there is now infamous question “1007” from Season 1, which asked whether a “lethal confrontation” would occur in the South China Sea. That question resolved as “Yes” near the scheduled closing date when a Chinese fishing boat captain fatally stabbed a South Korean coast-guard official who had boarded the fishing boat.

In these cases, “surprise” outcomes occurred late in a question’s lifespan, when many forecasters had been tapering their predictions toward a 0% likelihood that the event would occur. And, in all of these cases, the outcome reflected the actions of a single individual who was able to take the action that resulted in a “Yes” outcome with little or no news coverage in the days leading up to the action that would have allowed those following the question closely to know that the event was about to occur.

If there is a lesson to be learned here, it seems to be: Take a moment to think about the ways that an event could occur. If the event of interest does not require elaborate preparations beforehand and can be accomplished by one or two people, with little fanfare, it may hold more potential for a surprise outcome than our first impression would lead us to assume. This is particularly true when there is evidence early in the lifespan of a question suggesting that the event might occur, followed by no news, as opposed to contradictory news, later in the lifespan of a question.

Posted in forecasting, IARPA ACE Tournament | Tagged , | Comments Off

GJP in the News, Again (and Again)

The Economist’s The World in 2014 issue just hit the newsstands, focusing international attention on questions that Good Judgment Project forecasters consider on a daily basis: What geopolitical outcomes can we expect over the next 12-14 months?

One outcome that our forecasters may not have anticipated, though, is that the Good Judgment Project itself would be featured in this annual compendium of forecasts. An article co-authored by GJP’s Phil Tetlock and journalist Dan Gardner poses the question “Who’s good at forecasts?” and offers insight into “How to sort the best from the rest.” Participants in the ongoing ACE forecasting tournament sponsored by the Intelligence Advanced Research Projects Activity (IARPA) will not be surprised to learn that Tetlock and Gardner believe that such forecasting tournaments are the best way to compare forecasting ability.

Tetlock and Gardner’s brief article does not address a second benefit of such tournaments: participants can improve their forecasting skills through a combination of training and practice, with frequent feedback on their accuracy. Combining training and practice with what GJP’s research suggests is a stable trait of forecasting skill seems to produce the phenomenon that GJP calls “superforecasters.”

GJP’s top forecasters have been so accurate that, according to a recent report by Washington Post columnist David Ignatius, they even outperformed the forecasts of intelligence analysts who have access to classified information.

In a brief video clip on The Economist’s web site, Phil Tetlock notes that “People are not as good at anticipating the future as they think they are.” If you would like to test your own forecasting skills against GJP’s best forecasters, we invite you to register to become a Good Judgment Project forecaster. Because of the strong demand to participate, we expect to open new slots soon as part of the ongoing Season 3 tournament. Who knows? Maybe you have the skills of which superforecasters are made!

Posted in forecasting, GJP in the Media, IARPA ACE Tournament, Tetlock | Tagged , , | Comments Off

Meet GJP’s Political Science Experts

The Good Judgment Project has roots in Phil Tetlock’s study of Expert Political Judgment, which may be best known for its conclusion that the “expert” forecasters he studied were often hard-pressed to do better than the proverbial dart-throwing chimp. This conclusion sometimes has been mis-interpreted as a nihilistic rejection of all forms of subject matter expertise. But, that’s not so. Indeed, our research team for the IARPA tournament relies on a wide range of subject matter experts, including political scientists who devote much of their time and research effort to improving geopolitical forecasts.

They are no ordinary pundits, though. Consider, for example, Jay Ulfelder, former research director of the Political Instability Task Force (PITF) who now blogs as the Dart-Throwing Chimp. Jay’s forecasting prowess was evident during Season 2 of the Good Judgment Project—he participated as a “regular forecaster” and would have qualified for “superforecaster” status in Season 3 had he not joined our research team. Jay’s performance far exceeded that achievable through random guessing.

Now, Jay faces the far more challenging task of helping to frame the forecasting questions for Season 3. With his assistance, we’ve made progress in developing questions that are both rigorous (with all terms defined specifically enough to allow unambiguous scoring of forecasts as right or wrong) and relevant (in that knowing the correct outcome would meaningfully inform policymakers’ decisions). But, much work remains to be done before we become as skilled at generating questions as the GJP forecasters have become at producing accurate answers to those questions.

Also new to our Season 3 research team is Mike Ward, whose research lab at Duke University “creates conflict predictions using Bayesian modeling and network analysis.” In 2012, Mike and Jay engaged in a spirited debate in Foreign Policy about whether there could be a “Nate Silver” in world politics (that is, a forecaster of geopolitical events with a track record comparable to Silver’s notable successes in forecasting US elections). Mike doesn’t claim to have achieved Silver status yet, but we expect that his sophisticated modeling techniques will add to the GJP’s statistical arsenal.

We’re also fortunate to have Phil Schrodt join our Season 3 research team. Like Jay, both Phil and Mike have been involved in prior government projects to develop accurate statistical forecasts of geopolitical events, in their cases through both the PITF and DARPA’s Integrated Crisis Early Warning Project (ICEWS). Phil’s expertise with “event data” (e.g., with GDELT, Global Data on Events, Location and Tone) complements GJP’s strength in crowd-sourced forecasts. We’re hoping that the whole will be greater than the sum of the parts. And, readers of this blog should hope that Phil and Jay, both bloggers extraordinaire, will find time to write one or two guest posts for us. Until then, you can sample Phil’s thoughts on political forecasting courtesy of slides for his recent lectures at the University of Konstanz (Germany) and at European Network for Conflict Research (ENCoRe), University of Essex (United Kingdom).

Jay, Mike and Phil round out a GJP political-scientist roster that already included Penn’s own Mike Horowitz, an expert on international conflict and security issues, and Rick Herrmann, Chair of Ohio State’s Political Science Department and former Director of the Mershon Center for International Security Studies. The breadth and depth of their experience has been invaluable in guiding GJP’s efforts to improve geopolitical forecasting and shed light on important debates in international relations.

Posted in forecasting, GJP research team | Tagged , , , , , , | Comments Off

FORECAST: Another great season for the Good Judgment Project

Season 3 of the IARPA forecasting tournament is only a little over a month old (official scoring of individual forecasts began on July 31st). But, it already promises to be our most exciting season yet.

Nearly 3,000 forecasters have joined the Good Judgment Team’s roster for Season 3—many of whom are new to the Team this year. Our “newbies” include transfers from other research teams that had participated in Seasons 1 and 2, as well as more than 1,000 forecasters who are participating in the tournament for the first time ever. Welcome to all!

As before, the Good Judgment Project is using the IARPA tournament as a vehicle for social-science research to determine the most effective means of eliciting and aggregating geopolitical forecasts from a widely dispersed forecaster pool. To reach scientifically valid conclusions, we have randomly assigned participants to one of several online forecasting platforms, divided almost evenly between a survey format and two different prediction-market formats. Some survey forecasters participate as individuals; others are randomly assigned to small teams (starting size: 15 forecasters) that share information and strategies and are scored on their collective performance. We are also testing the efficacy of training materials that are tailored to the various elicitation platforms.

Our “superforecasters” deserve a tremendous share of the credit for the Good Judgment Team’s tournament-winning performance in Season 2. Most of our 60 Season 2 superforecasters have returned for Season 3, and we’ve augmented their ranks with a new crop of superforecasters chosen from the top performers in all experimental conditions in Season 2. The 120 superforecasters are organized as eight teams of 15 forecasters each.

Thus far, only four carryover questions from Season 2 have resolved and been officially scored, and that scoring includes only the final month during which these questions were open for forecasting. (The prior months were scored as part of Season 2.) Therefore, these four questions cannot tell us how well our forecasters will perform over the 150+ questions expected to be released between now and May 2014.

With that caveat firmly in mind, we close with what we hope is a precursor of the outstanding performance to be expected over the next several months. One method of aggregating forecasts is to treat each forecaster’s probability forecast as a “vote” (where any probability greater than 50% is a vote that the outcome will occur and any probability less than 50% is a vote that it will not occur), and then predict the outcome based on those votes. The “majority rules” vote of our Season 3 superforecasters achieved a perfect accuracy score for all four questions that have just closed! That’s a pretty impressive achievement when you realize that each day’s prediction for each forecasting question is separately scored. To get a perfect score, the majority of superforecasters must have made the right call on each one of the four questions every day for the month of August. Well done, everyone!

Posted in IARPA ACE Tournament | Tagged | Comments Off

Season 3 of the Good Judgment Project Starts TODAY!

If you signed up to be a Good Judgment Project forecaster for Season 3, you should have received an e-mail invitation via Qualtrics. Clicking on your personal link will take you through the informed-consent process (necessary administrative detail about expectations, payments, etc.) and hand you off to our background data survey. (If you signed up for the project but haven’t received your invitation yet, please contact admin@goodjudgmentproject.com.)

Next steps will be initial training (another Qualtrics invitation, coming soon), followed by access to your forecasting website for some preseason practice. Current plans call for all forecasters who have completed the preliminaries to be admitted to their assigned forecasting websites by mid-July. Official scoring will begin on August 1st and run through early May 2014.

We expect an exciting forecasting season. All forecasters in the IARPA tournament are now under one “roof” (the Good Judgment Project or GJP), which means that we’ll be able to give more complete feedback about relative performance AND that the competition has become even more intense. It will be interesting to see whether forecasters transferring to GJP from our prior competitors have developed distinctive forecasting styles associated with their prior teams.

Another new feature for Season 3 is question clusters. Most of the Season 3 forecasting questions will fall into one of about 15 clusters or categories that are thematically grouped. Forecasting on closely related questions should allow forecasters to delve more deeply into global issues of interest to policymakers. And, we hope that having forecasts on several related questions will shed greater light on key policy debates than is possible with any single question.

Watch this blog for further Season 3 news.

Posted in IARPA ACE Tournament | Tagged | Comments Off

Catching up on news about the Good Judgment Project

Season 2 of the IARPA tournament has sped by so rapidly that we’ve been remiss about keeping readers abreast of news about the Good Judgment Project. Here are some highlights of the past several months.

Project co-leader Phil Tetlock is well known as the author of Expert Political Judgment, the popular 2005 book demonstrating that political experts often perform no better than chance when making long-term forecasts. Those who know Tetlock best in this context may be surprised by his assessment of the IARPA tournament in a December 2012 interview published at edge.org. There, he observed:

Is world politics like a poker game? This is what, in a sense, we are exploring in the IARPA forecasting tournament. You can make a good case that history is different and it poses unique challenges. This is an empirical question of whether people can learn to become better at these types of tasks. We now have a significant amount of evidence on this, and the evidence is that people can learn to become better. It’s a slow process. It requires a lot of hard work, but some of our forecasters have really risen to the challenge in a remarkable way and are generating forecasts that are far more accurate than I would have ever supposed possible from past research in this area.

Since that interview, the Good Judgment Team’s collective forecasts in the IARPA tournament have maintained a high standard of accuracy across topics ranging “Who will be the next Pope?” to “Will Iran and the U.S. commence official nuclear program talks before 1 April 2013?” to “Will 1 Euro buy less than $1.20 US dollars at any point before 1 January 2013?”. Impressively, the Team’s forecasters are not, for the most part, subject-matter-experts on these topics, but rather intelligent volunteers who research candidates for the papacy or Middle East politics in their “spare” time.

Our collective forecasts combine the insights of hundreds of forecasters using statistical algorithms that, ideally, help to extract the most accurate signal from the noise of conflicting predictions. Analyses of data from the first tournament season suggest that we can boost prediction accuracy by “transforming” or “extremizing” the group forecast. We described the process in an AAAI paper presented in Fall 2012:

Take the (possibly weighted) average of all the predictions to get a single probability estimate, and then transform this aggregate forecast away from 0.5….

Transformation increases the accuracy of aggregate forecasts in many, but not all cases. The trick is to find the transformation that will most improve accuracy in a particular situation. Preliminary results suggest that little or no transformation should be applied to the predictions of the most expert forecasters and to forecasts on questions with a high degree of inherent unpredictability. Stay tuned for further refinements as we analyze new data in Season 3.

This year, we’ve continued research comparing the accuracy of prediction markets to other methods of eliciting and aggregating forecasts, working with our new partner Lumenogic. Our Season 2 prediction-market forecasters use a Continuous Double Auction (“CDA”) trading platform similar to the now-defunct Intrade platform, but wagering only virtual dollars.

Forecasters have mixed feelings about trading as a method of prediction: some enjoy the challenge of predicting the ups-and-downs of forecaster sentiment; others prefer to focus on predicting the event in question and would rather not be bothered with understanding the views of their fellow traders. But, it’s hard to argue with the overall performance of the two markets over the first several months of forecasting. The prediction markets have outperformed the simple average of predictions from forecasters in each of our survey-based experimental conditions (individuals working without access to crowd-belief data and individuals working in small teams who have access to data about their teammates’ forecasts). Only our “super-forecasters” (drawn from the top 2% of all participants in Season 1) have proven to be consistently more accurate than the prediction markets.

During Season 3, which begins in June 2013, we’ll be expanding our research team and our forecaster pool in hopes of coming ever closer to the theoretical limits of forecasting accuracy. We invite our current forecasters to continue with the Good Judgment Team for another season and encourage new participants to join us for this exciting challenge.

Posted in GJP in the Media, IARPA ACE Tournament, Tetlock | Comments Off