Karen Ruth Adams: “Reflections of a (Female Subject-Matter Expert) Superforecaster”

In August, I attended the Good Judgment Project’s 2014 Superforecaster Conference, where the top 2% of last year’s 7,200 forecasters and top forecasters from previous seasons met with the principal investigators.  We discussed the project’s findings to date, changes for Season 4, and plans for the future.

At the conference, I was struck by both the diversity and lack of diversity among the forecasters.  There was a lot of occupational diversity.  I met people who work in finance, IT, materials science, law, and other commercial sectors.  Yet, considering that the aim of the study is to improve national security forecasting by US intelligence agencies, there was a notable lack of subject-matter experts.  I met just a handful of security scholars from academia and think tanks and policy makers and practitioners from government and non-profits.

I was also surprised to see few women.  It wasn’t that I expected a 50/50 ratio.  I thought women would be 25-33% of forecasters, mirroring the percentage of women among American faculty in political science, security scholars at the International Studies Association, policy analysts and leadership staff at Washington think tanks, and senior US national security and foreign policy officials.  Instead I learned from GJP researcher Pavel Atanasov that at the beginning of Season 3 (Fall 2013), women were just 17% of GJP forecasters.  By the end of the season (Spring 2014), women had dropped out at higher rates (35%) than men (29%).  Among this year’s superforecasters, just 7% are women.

As a woman who has spent decades developing expertise on international relations and human, national, and international security, and as a citizen who would like US security forecasting and policy to improve, this concerns me.  It also concerns GJP’s principal investigators, who have asked forecasters to offer suggestions for improving the mix.  This post is a contribution to that conversation.  I explain why I joined the project and what I’ve done and learned so far.  I also offer some thoughts about what remains to be discovered and improved about gender and expertise among forecasters.

Why I Joined the Project

In March 2011, I received an intriguing email via a listserve of strategic studies scholars.  Bob Jervis, a noted expert on national and international security, was looking for

knowledgeable people to participate in a quite unprecedented study of forecasting sponsored by Intelligence Advanced Research Projects Activity (“IARPA”) and focused on a wide range of political, economic and military trends around the globe.  The goal of this unclassified project is to explore the effectiveness of techniques such as prediction markets, probability elicitation, training, incentives and aggregation that the research literature suggests offer some hope of helping forecasters see at least a little further and more reliably into the future.

Bob was recruiting for the GJP team on behalf of principal investigators Barbara Mellers, Don Moore, and Phil Tetlock.  According to Bob, the “minimum time commitment would be several hours in passing training exercises, grappling with forecasting problems, and updating your forecasting response to new evidence throughout the year.”  The rewards would be “Socratic self-knowledge,” the opportunity to learn and be assessed on “state-of-the-art techniques (training and incentive systems) designed to augment accuracy,” a $150 honorarium, and the opportunity to compete anonymously with the freedom to go public later.  In addition, Bob said he thought it would be fun.

I immediately said yes, for two reasons.  First, I remembered Phil Tetlock from my time as a political science graduate student at UC Berkeley, and I trusted him to run an interesting and high-quality study in which my anonymity would be protected.  That was important to me because I wanted to take the risk of forecasting without worrying about the effects on my scholarly reputation.  After all, in Expert Political Judgment, Phil had shown that experts (highly educated professionals in academia, think tanks, government, and international organizations) weren’t much better at forecasting that “dilettantes” (undergraduates).  Moreover, they had trouble outperforming “dart-throwing chimps” (chance) and generally underperformed computers (extrapolation algorithms).  As an expert on security studies, this was my chance to try to prove him wrong.

Second, as a recently-tenured political science professor, I had begun to expand my research program, focusing less on individual publications and more on my career contribution.  For several years, I had been developing a new framework for studying, teaching, and improving human, national, and international security.  One element of the project is helping students evaluate and explain the historical and current security levels of various actors (individuals, social groups, and states) and predict future security levels.  Thus the opportunity to find out more about forecasting and try my hand as a participant-observer was too good to ignore.

What I’ve Done So Far

Since Fall 2011, I’ve participated on the GJP team in all three years of IARPA’s ACE tournament.  In Season 1, I was in an individual prediction polling group, making individual forecasts on a survey platform with no interaction with other participants.  In Season 2, I was on an interactive individual survey platform, where participants could explain their forecasts and see their own and others’ accuracy ratings (Brier scores).  In Season 3, I was on the Inkling platform, one of two large prediction markets, where participants were given 50,000 “inkles” to buy and sell “stocks” in answers to questions, with probabilities expressed as prices.  We could also make comments and see one another’s scores (earnings).

Over time, my accuracy has improved.  I moved from the top 18% in Season 1 (Brier score of .42) to the top 8% in Season 2 (Brier score of .34), and top 1% in Season 3 (no Brier score because I was in the prediction market, where I more than tripled my “money,” finishing 6th of 693 forecasters).  My best categories have consistently been those in which I have the most expertise:  international relations, military conflict, and diplomacy.

How GJP Has Enhanced My Skills and Confidence

Participating in the study has done what I hoped it would.  It has improved my forecasting skill.  By compelling me to express forecasts in stark probabilistic terms and by using clear and generally fair rules to score them, GJP has given me a laboratory in which to learn how to balance the forecasting skills of “calibration” (understanding base rates) and “discrimination” (identifying exceptions).

Now, when a colleague or reporter asks what I think will happen in an international conflict or international organization, I think more clearly about the theories and facts I’m using to arrive at my answer, the probabilities I assign to various outcomes, and the confidence I wish to express.  In my security class, I model this process for my students and ask them to make their own forecasts.

Participating in the GJP has also increased my confidence.  Like most experts, I used to be reluctant to make point predictions that could run afoul of complexity or chance and be taken out of context.  Moreover, like many professional women, I suffered from both “imposter syndrome” — the feeling that I don’t know enough and will be found out — and the knowledge that women’s qualifications and contributions are systematically discounted.  Thus it never helped to be told that I should bluff, like men.

Thanks to the GJP, I’ve learned I don’t have to pretend to be something I’m not.  I have a good sense of where my expertise lies, where it makes the most difference in improving group accuracy, and when and on what terms I wish to go public with predictions.  I also know it’s not a weakness but a strength to approach forecasting with humility.

As Bob Jervis predicted, participating in the GJP has also been fun.  I don’t worry about being right all of the time.  In Season 3, I was one of the most frequent commenters, revealing my forecasts and logic, and asking for feedback.  When I’m wrong, I have other forecasters to laugh with and learn from.

In August, before the superforecaster conference, I revealed my identity.  That was a surprise to some of my fellow forecasters, who had assumed Fork Aster was a man.

What Remains to be Discovered about Gender and Expertise

Before the superforecaster conference, I wondered if being out as a female subject-matter expert would change the dynamics of my participation.  Would I speak up less often for fear of being dubbed a pointy-headed intellectual?  Would GJP turn into a forum in which women’s comments were discounted or ignored, or in which successful women were deemed unlikable?

My concerns about gender were allayed at the conference.  Although women are just 7% of superforecasters, they were not segregated by choice or default.  Women sat and stood and worked in groups with men.  To me, this shows the value of initial anonymity.  Superforecasters were known to be good forecasters, whether or not they were known to be women.  The women who were there had made a cut based on merit, so they were accepted and confident.

But they weren’t overconfident.  After all, the GJP’s major findings are that forecasters perform best when they understand probability, are open-minded, and are scored for accuracy.  Together, this means superforecasters of all stripes know that perfection is unattainable, there’s a good chance they’re wrong and should listen to other views, and the best way to improve is to put themselves out there to be scored.

My concerns about expertise were allayed in the first month of Season 4.  In the superforecaster market, I’ve been speaking up about as much as I did last year.  Moreover, I’ve found that instead of saying less for fear of being wrong, I’ve been tempted to say more than I can confidently support simply to burnish my credentials.  Thus this year is shaping up to be a test of whether I can remain open-minded despite having my reputation at stake.  With the recent brouhaha about faux experts and experts-for-hire, this will be very interesting indeed.

Like all forecasters, I care a great deal about how “right” and “wrong” are scored.  As someone who is trying to build her confidence, I also care about ratings and rankings.  But I’m not motivated by fake money, and I doubt most subject-matter experts are either.  So this year, I’m looking forward to a new metric, “market contribution,” which will summarize each market forecaster’s contribution to the market’s accuracy (Brier score).  To understand what motivates forecasters of various types, I hope GJP will track via surveys and team and market behavior the extent to which individuals seem to be motivated by problem-solving, competition, social interaction, accuracy, and other goals.

Why and How Female and Expert Participation Should Be Improved

In one field after another, studies have found that groups with more diverse participants and in particular more women make more accurate decisions.  That is reason enough for GJP to redouble its efforts to recruit and retain women.  It also speaks to the importance of preserving occupational diversity.  Yet GJP should also make an effort to recruit more subject-matter experts.  Otherwise, it will be hard to evaluate whether Tetlock’s earlier findings about the overall unreliability of expert political judgment are valid.

Although I’m not an expert on the effects of gender and expertise on participation in scientific studies, I have some thoughts about how GJP could recruit and retain more women and subject-matter experts.

First, it’s important to think about how the recruiting pitch sounds.  The one I got was perfect for me.  It appealed to my expertise and love of learning, my desire to improve US security policy, and my sense of fun.  It also seemed reasonable.  A few hours, a few updates…  no big commitment.  In fact, the requirement has been higher.  In the first three years, it took me about 5 hours per week on top of my regular current events reading to research, answer, and discuss the required 25 questions.  Since women spend more time than men on the “second shift” of family and household work, the time requirement probably depresses female participation and retention rates.  If security scholars and practitioners have heavier work obligations than individuals in the private sector, high time commitments could depress their participation as well.  Since the whole point of expertise is to be good at something in particular instead of everything in general, perhaps GJP should set different participation expectations for subject-matter experts.  It would be more fair to all forecasters, though, either to reduce the time requirements overall or to provide more financial or reputational compensation.

Second, where does the recruiting pitch go?  According to Project Director Terry Murray, the GJP has not made a systematic effort to recruit a diverse pool of forecasters.  Instead, the project has relied on word of mouth by the principal investigators and advisors (most of whom are men) and serendipitous media coverage.  To include more women across the professions, GJP should reach out to interest groups such as the World Affairs Council and skill-building networks such as Lean In.  (For a big bang, GJP could collaborate with Lean In to produce a video about how to improve forecasting skills).  To reach more subject-matter experts, GJP should recruit through professional organizations such as Women in International Security (which has both male and female members), and the international security studies divisions of the American and international political science associations.

Third, what are GJP training materials, questions, and discussions like?  Since women dropped out of Season 3 at higher rates than men, there may be something about the experience itself that’s a turn-off.  As a middle-aged, female security expert, I was used to a lot of the bravado I saw on the GJP boards, and I was willing to live with it because I wanted to learn something.  I also thought my anonymous participation might improve things.  Other women may drop out because they find the exchanges unpleasant or irrelevant, or because they lack the confidence to weigh in.

Still other women may be turned off by some of the recommended reading.  I devoured Kathryn Schulz’s Being Wrong.  But it was a shock to open Daniel Kahneman’s Thinking Fast and Slow and discover that the first chapter features the cognitive challenge posed by a photograph of an angry woman.  Later, it emerges that that one of the most debated problems in cognitive science relates to experiments in which people erroneously assume that a woman who lives in Berkeley is more likely to be a feminist and a bank teller than simply a bank teller.  To increase female participation, researchers and forecasters need to think carefully about their language and examples so they don’t evoke what Kahneman refers to as “System 1 errors” in a whole subset of participants.  Although researchers don’t intend for their work to have these effects, if they’re not attentive, it can.

A Continuing Conversation

At the superforecaster conference, many of the male participants asked me how I thought female participation could be improved.  When they found out I’m a political science Ph.D. and professor, they asked me the same thing about including more security scholars.  They were not just being polite.  Over the past three years, as we’ve contributed to GJP reportedly outperforming intelligence analysts, we’ve all learned the value of open-minded thinking, and we know it’s more likely in groups with diverse participants whose individual contributions are heard and valued.

For now, my recommendations for GJP are to review the participation requirements, reach out to organizations and networks populated by women and subject-matter experts, and survey current and past participants about their impressions of the work load and content and their reasons for staying in or leaving the project.

My recommendation to women and subject-matter experts is to give forecasting a shot.  Decide what you want to get out of the project and what kind of participant you want to be.  Then do your best.  See what you learn and what others learn from you.  Forecast anonymously at first, then come out if you like.  I predict a lot of wonky fun – intellectual puzzles, interesting exchanges of ideas, head-to-head competition, some memorable “Aha!” moments, and the pride of knowing that you have contributed in some small way to improving security forecasting.

And you?  What do you suggest?  Let’s continue the conversation.

>>>>

Karen Ruth Adams (aka Fork Aster) is an associate professor of political science at the University of Montana.

Posted in IARPA ACE Tournament, Recruitment, superforecasters | Tagged | Comments Off

A Little Summer Reading

Summer means vacation time for many Good Judgment Project forecasters (we’re currently on hiatus between forecasting seasons), but our research team is busily working on plans to make this the best and most useful forecasting season yet!

While you await news about Season 4, which will begin in August, we wanted to bring your attention to two recent articles by GJP investigator Michael C. Horowitz, an associate professor of political science at UPenn, that discuss the Good Judgment Project.

In the first, an article for The National Interest, Professor Horowitz and co-author Dafna Rand of CNAS lay out a case for what they call “The Crisis-Prevention Directorate.” Since trouble around the world is often hard to predict, Horowitz and Rand argue that the National Security Council Staff should create a new crisis-prevention directorate that not only draws on trained personnel from throughout the national-security community, but explicitly draws on Good Judgment Project methodologies to help the President anticipate and head off crises before they happen.

The second article, by Professor Horowitz in Politico, looks at recent analogies that President Obama has made between baseball and American foreign policy. Horowitz uses the lens of Moneyball to assess Obama’s statement that United States foreign policy should be focused on hitting singles rather than home runs. He also describes the Good Judgment Project as essentially the “moneyball” of national security decision-making. And just as one of the challenges described in Moneyball was the integration of sabermetrics into mainstream baseball decision-making, the next challenge for the Good Judgment Project and our sponsors is ensuring the lessons that our forecasters have taught us over the last three years make their way into how the US government thinks about national security decision-making moving forward.

To what extent will the national-security community incorporate learnings from GJP and other quantitative forecasting initiatives over the next 3-5 years? Suggestion: Record your own prediction today and then check five years hence to see how accurate you were. Keeping score is probably the number one “secret of the superforecasters”!

Posted in GJP in the Media, GJP research team | Tagged , , , | Comments Off

Jay Ulfelder: Crowds Aren’t Magic

One of my cousins, Steve Ulfelder, writes good mystery novels. He left a salaried writer’s job 13 years ago to freelance and make time to pen those books. In March, he posted this on Facebook:

CONTEST! When I began freelancing, I decided to track the movies I saw to remind myself that this was a nice bennie you can’t have when you’re an employee (I like to see early-afternoon matinees in near-empty theaters). I don’t review them or anything; I simply keep a Word file with dates and titles.

Here’s the contest: How many movies have I watched in the theater since January 1, 2001? Type your answer as a comment. Entries close at 8pm tonight, east coast time. Closest guess gets a WOLVERINE BROS. T-shirt and a signed copy of the Conway Sax novel of your choice. The eminently trustworthy Martha Ruch Ulfelder is holding a slip of paper with the correct answer.

I read that post and thought: Now, that’s my bag. I haven’t seen Steve in a while and didn’t have a clear idea of how many movies he’s seen in the past 13 years, but I do know about Francis Galton and that ox at the county fair. Instead of just hazarding a guess of my own, I would give myself a serious shot at outwitting Steve’s Facebook crowd by averaging their guesses.

After a handful of Steve’s friends had submitted answers, I posted the average of them as a comment of my own, then updated it periodically as more guesses came in. I had to leave the house not long before the contest was set to close, so I couldn’t include the last few entrants in my final answer. Still, I had about 40 guesses in my tally at the time and was feeling pretty good about my chances of winning that t-shirt and book.

In the end, 45 entries got posted before Steve’s 8 PM deadline, and my unweighted average wasn’t even close. The histogram below shows the distribution of the crowd’s guesses and the actual answer. Most people guessed fewer than 300 movies, but a couple of extreme values on the high side pulled the average up to 346.  Meanwhile, the correct answer was 607, nearly one standard deviation (286) above that mean. I hadn’t necessarily expected to win, but I was surprised to see that 12 of the 45 guesses—including the winner at 600—landed closer to the truth than the average did.

I read the results of my impromptu experiment as a reminder that crowds are often smart, but they aren’t magic. Retellings of Galton’s experiment sometimes make it seem like even pools of poorly informed guessers will automatically produce an accurate estimate, but, apparently, that’s not true.

As I thought about how I might have done better, I got to wondering if there was something about Galton’s crowd that made it particularly powerful for his task. Maybe we should expect a bunch of county fair–goers in nineteenth century England to be good at guessing the weight of farm animals. Still, the replication of Galton’s experiment under various conditions suggests that domain knowledge helps, but it isn’t essential. So maybe this was just an unusually hard problem. Steve has seen an average of nearly one movie in theaters each week for the past 13 years. In my experience, that’s pretty extreme, so even with the hint he dropped in his post about being a frequent moviegoer, it’s easy to see why the crowd would err on the low side. Or maybe this result was just a fluke, and if we could rerun the process with different or larger pools, the average would usually do much better.

Whatever the reason for this particular failure, though, the results of my experiment also got me thinking again about ways we might improve on the unweighted average as a method of gleaning intelligence from crowds. Unweighted averages are a reasonable strategy when we don’t have reliable information about variation in the quality of the individual guesses (see here), but that’s not always the case. For example, if Steve’s wife or kids had posted answers in this contest, it probably would have been wise to give their guesses more weight on the assumption that they knew better than acquaintances or distant relatives like me.

Figuring out smarter ways to aggregate forecasts is also an area of active experimentation for the Good Judgment Project (GJP), and the results so far are encouraging. The project’s core strategy involves discovering who the most accurate forecasters are and leaning more heavily on them. I couldn’t do this in Steve’s single-shot contest, but GJP gets to see forecasters’ track records on large numbers of questions and has been using them to great effect. In the recently-ended Season 3, GJP’s “super forecasters” were grouped into teams and encouraged to collaborate, and this approach has proved very effective. In a paper published this spring, GJP has also shown that they can do well with nonlinear aggregations derived from a simple statistical model that adjusts for systematic bias in forecasters’ judgments. Team GJP’s bias-correction model beats not only the unweighted average but also a number of widely-used and more complex nonlinear algorithms.

Those are just a couple of the possibilities that are already being explored, and I’m sure people will keep coming up with new and occasionally better ones. After all, there’s a lot of money to be made and bragging rights to be claimed in those margins. In the meantime, we can use Steve’s movie-counting contest to remind ourselves that crowds aren’t automatically as clairvoyant as we might hope, so we should keep thinking about ways to do better.

Posted in aggregation, forecasting, Galton, wisdom of the crowd | Comments Off

Jay Ulfelder on the Rigor-Relevance Tradeoff

I came to the Good Judgment Project (GJP) two years ago, in Season 2, as a forecaster, excited about contributing to an important research project and curious to learn more about my skill at prediction. I did pretty well at the latter, and GJP did very well at the former. I’m also a political scientist who happened to have more time on my hands than many of my colleagues, because I work as an independent consultant and didn’t have a full plate at that point. So, in Season 3, the project hired me to work as one of its lead question writers.

Going into that role, I had anticipated that one of the main challenges would be negotiating what Phil Tetlock calls the “rigor-relevance trade-off”—finding questions that are relevant to the project’s U.S. government sponsors and can be answered as unambiguously as possible. That forecast was correct, but even armed with that information, I failed to anticipate just how hard it often is to strike this balance.

The rigor-relevance trade-off exists because most of the big questions about global politics concern latent variables. Sometimes we care about specific political events because of their direct consequences, but more often we care about those events because of what they reveal to us about deeper forces shaping the world. For example, we can’t just ask if China will become more cooperative or more belligerent, because cooperation and belligerence are abstractions that we can’t directly observe. Instead, we have to find events or processes that (a) we can see and (b) that are diagnostic of that latent quality. For example, we can tell when China issues another statement reiterating its claim to the Senkaku Islands, but that happens a lot, so it doesn’t give us much new information about China’s posture. If China were to fire on Japanese aircraft or vessels in the vicinity of the islands—or, for that matter, to renounce its claim to them—now that would be interesting.

It’s tempting to forego some rigor to ask directly about the latent stuff, but it’s also problematic. For the forecast’s consumers, we need to be able to explain clearly what a forecast does and does not cover, so they can use the information appropriately. As forecasters, we need to understand what we’re being asked to anticipate so we can think clearly about the forces and pathways that might or might not produce the relevant outcome. And then there’s the matter of scoring the results. If we can’t agree on what eventually happened, we won’t agree on the accuracy of the predictions. Then the consumers don’t know how reliable those forecasts are, the producers don’t get the feedback they need, and everyone gets frustrated and demotivated.

It’s harder to formulate rigorous questions than many people realize until they try to do it, even on things that seem like they should be easy to spot. Take coups. It’s not surprising that the U.S. government might be keen on anticipating coups in various countries for various reasons. It is, however, surprisingly hard to define a “coup” in such a way that virtually everyone would agree on whether or not one had occurred.

In the past few years, Egypt has served up a couple of relevant examples. Was the departure of Hosni Mubarak in 2011 a coup? On that question, two prominent scholarly projects that use similar definitions to track coups and coup attempts couldn’t agree. Where one source saw an “overt attempt by the military or other elites within the state apparatus to unseat the sitting head of state using unconstitutional means,” the other saw the voluntary resignation of a chief executive due to a loss of his authority and a prompt return to civilian-led government. And what about the ouster of Mohammed Morsi in July 2013? On that, those academic sources could readily agree, but many Egyptians who applauded Morsi’s removal—and, notably, the U.S. government—could not.

We see something similar on Russian military intervention in Ukraine. Not long after Russia annexed Crimea, GJP posted a question asking whether or not Russian armed forces would invade the eastern Ukrainian cities of Kharkiv or Donetsk before 1 May 2014. The arrival of Russian forces in Ukrainian cities would obviously be relevant to U.S. policy audiences, and with Ukraine under such close international scrutiny, it seemed like that turn of events would be relatively easy to observe as well.

Unfortunately, that hasn’t been the case. As Mark Galeotti explained in a mid-April blog post,

When the so-called “little green men” deployed in Crimea, they were very obviously Russian forces, simply without their insignia. They wore Russian uniforms, followed Russian tactics and carried the latest, standard Russian weapons.

However, the situation in eastern Ukraine is much less clear. U.S. Secretary of State John Kerry has asserted that it was “clear that Russian special forces and agents have been the catalyst behind the chaos of the last 24 hours.” However, it is hard to find categorical evidence of this.

Even evidence that seemed incontrovertible when it emerged, like video of a self-proclaimed Russian lieutenant colonel in the Ukrainian city of Horlivka, has often been debunked.

This doesn’t mean we were wrong to ask about Russian intervention in eastern Ukraine. If anything, the intensity of the debate over whether or not that’s happened simply confirms how relevant this topic was. Instead, it implies that we chose the wrong markers for it. We correctly anticipated that further Russian intervention was possible if not probable, but we—like many others—failed to anticipate the unconventional forms that intervention would take.

Both of these examples show how hard it can be to formulate rigorous questions for forecasting tournaments, even on topics that are of keen interest to everyone involved and seem like naturals for the task. In an ideal world, we could focus exclusively on relevance and ask directly about all the deeper forces we want to understand and anticipate. As usual, though, that ideal world isn’t the one we inhabit. Instead, we struggle to find events and processes whose outcomes we can discern that will also reveal something telling about those deeper forces at play.

 

Posted in forecasting questions | Tagged | Comments Off

Statistical models vs. human judgments: The Nate Silver controversy seen through a GJP lens

Do statistical models always outperform human forecasters?

The storm of controversy over Nate Silver’s new 538 website has transformed what had been a question of largely academic interest into a staple of water-cooler conversation. Critics such as Paul Krugman complain that Silver improperly dismisses the role of human expertise in forecasting, which Krugman views as essential to developing meaningful statistical models.

The Good Judgment Project’s research team has given the human vs. machine debate a lot of thought. GJP’s chief data scientist Lyle Ungar shared his views in a recent interview. According to Ungar,

The bottom line is that if you have lots of data and the world isn’t changing too much, you can use statistical methods. For questions with more uncertainty, human experts become more important.

Ungar sees the geopolitical questions posed in the ACE tournament as well suited to human forecasting:

Some problems, like the geo-political forecasting we are doing, require lots of collection of information and human thought. Prediction markets and team-based forecasts both work well for sifting through the conflicting information about international events. Computer models mostly don’t work as well here – there isn’t a long enough track records of, say, elections or coups in Mali to fit a good statistical model, and it isn’t obvious what other countries are ‘similar.’

GJP does see a role for statistical models in geopolitical forecasting. For example, we use both prediction markets and statistical aggregation techniques to combine the judgments of our forecasters and generate probability estimates that typically are more accurate than simple unweighted averages of human predictions.

Moreover, for many geopolitical forecasting questions, we see promise in a human-machine hybrid approach that combines the best strengths of human judgments and statistical models. (Think “Kasparov plus Deep Blue” in the chess context.) We hope to take initial steps to test this approach in Season 4 of the IARPA ACE forecasting tournament. Stay tuned for further developments!

 

Posted in aggregation, expert judgment, forecasting | Tagged , , | Comments Off

Reflections on Season 3 Thus Far (part 1)

As 2013 draws to a close, so does the first half of Season 3 of the IARPA forecasting tournament. This seems as good a time as any to review some of the highs and lows of this tournament season. We begin our review with a look at IFP #1318, which closed on Christmas Day and—in our view—represents one of the biggest surprises thus far in Season 3.

A surprising visit to Yasukuni

In mid-November, we launched a question asking whether Japan’s Prime Minister would visit the controversial Yasukuni Shrine before the year’s end. Just a few days before the question was scheduled to expire, Abe did in fact visit the shrine—something that few forecasters anticipated.

A post-visit comment by one team forecaster reflected the general amazement: “Jaw-drop! I have to wonder if this was telegraphed and we missed it, or it really was a spontaneous decision.”

Forecasters who posted good scores on this question seem to have taken Prime Minister Abe at his word. As the Japan News reported, the Prime Minister had pledged during his most recent election campaign to visit the shrine while in office. Nonetheless, over the past year, Abe had avoided commitment to any specific date for a visit. A few weeks before the question launched, there was a brief flurry of news stories in which an Abe aide indicated his expectation that the visit would occur before the end of the year, though the report was downplayed by another Japanese official. In retrospect, the aide seems to have been providing accurate intelligence: Abe’s visit occurred on the first anniversary of his current administration and before the end of the year.

Those who failed to anticipate Abe’s shrine visit can take comfort from a poll published in a major Japanese newspaper on December 24th, reflecting a question posed to over 1,000 Japanese households on December 21st and 22nd:

Of the respondents, 48 percent appreciated Prime Minister Abe’s decision to refrain from visiting Tokyo’s controversial Yasukuni Shrine where Class A war criminals are enshrined along with the war dead since he took office, as compared with 37 percent who do not appreciate his decision.

The poll itself reflects that there was great uncertainty about what Abe would do only a few days before his visit.

But, forecasters who steadily decreased their probability estimates as the end-date for this question approached may want to think twice before applying this strategy to future questions that have reasonable potential for surprise endings. At least one superforecaster had noted the risk that the Yasukuni Shrine question could be this season’s “Monti”—referencing a question that caught forecasters off-guard almost exactly a year before when Italy’s then-Prime-Minister followed through on his stated intention to resign, rather than stay in office as leader of a minority government, after Berlusconi withdrew his support from the Monti-led coalition. And, of course, there is now infamous question “1007” from Season 1, which asked whether a “lethal confrontation” would occur in the South China Sea. That question resolved as “Yes” near the scheduled closing date when a Chinese fishing boat captain fatally stabbed a South Korean coast-guard official who had boarded the fishing boat.

In these cases, “surprise” outcomes occurred late in a question’s lifespan, when many forecasters had been tapering their predictions toward a 0% likelihood that the event would occur. And, in all of these cases, the outcome reflected the actions of a single individual who was able to take the action that resulted in a “Yes” outcome with little or no news coverage in the days leading up to the action that would have allowed those following the question closely to know that the event was about to occur.

If there is a lesson to be learned here, it seems to be: Take a moment to think about the ways that an event could occur. If the event of interest does not require elaborate preparations beforehand and can be accomplished by one or two people, with little fanfare, it may hold more potential for a surprise outcome than our first impression would lead us to assume. This is particularly true when there is evidence early in the lifespan of a question suggesting that the event might occur, followed by no news, as opposed to contradictory news, later in the lifespan of a question.

Posted in forecasting, IARPA ACE Tournament | Tagged , | Comments Off

GJP in the News, Again (and Again)

The Economist’s The World in 2014 issue just hit the newsstands, focusing international attention on questions that Good Judgment Project forecasters consider on a daily basis: What geopolitical outcomes can we expect over the next 12-14 months?

One outcome that our forecasters may not have anticipated, though, is that the Good Judgment Project itself would be featured in this annual compendium of forecasts. An article co-authored by GJP’s Phil Tetlock and journalist Dan Gardner poses the question “Who’s good at forecasts?” and offers insight into “How to sort the best from the rest.” Participants in the ongoing ACE forecasting tournament sponsored by the Intelligence Advanced Research Projects Activity (IARPA) will not be surprised to learn that Tetlock and Gardner believe that such forecasting tournaments are the best way to compare forecasting ability.

Tetlock and Gardner’s brief article does not address a second benefit of such tournaments: participants can improve their forecasting skills through a combination of training and practice, with frequent feedback on their accuracy. Combining training and practice with what GJP’s research suggests is a stable trait of forecasting skill seems to produce the phenomenon that GJP calls “superforecasters.”

GJP’s top forecasters have been so accurate that, according to a recent report by Washington Post columnist David Ignatius, they even outperformed the forecasts of intelligence analysts who have access to classified information.

In a brief video clip on The Economist’s web site, Phil Tetlock notes that “People are not as good at anticipating the future as they think they are.” If you would like to test your own forecasting skills against GJP’s best forecasters, we invite you to register to become a Good Judgment Project forecaster. Because of the strong demand to participate, we expect to open new slots soon as part of the ongoing Season 3 tournament. Who knows? Maybe you have the skills of which superforecasters are made!

Posted in forecasting, GJP in the Media, IARPA ACE Tournament, Tetlock | Tagged , , | Comments Off

Meet GJP’s Political Science Experts

The Good Judgment Project has roots in Phil Tetlock’s study of Expert Political Judgment, which may be best known for its conclusion that the “expert” forecasters he studied were often hard-pressed to do better than the proverbial dart-throwing chimp. This conclusion sometimes has been mis-interpreted as a nihilistic rejection of all forms of subject matter expertise. But, that’s not so. Indeed, our research team for the IARPA tournament relies on a wide range of subject matter experts, including political scientists who devote much of their time and research effort to improving geopolitical forecasts.

They are no ordinary pundits, though. Consider, for example, Jay Ulfelder, former research director of the Political Instability Task Force (PITF) who now blogs as the Dart-Throwing Chimp. Jay’s forecasting prowess was evident during Season 2 of the Good Judgment Project—he participated as a “regular forecaster” and would have qualified for “superforecaster” status in Season 3 had he not joined our research team. Jay’s performance far exceeded that achievable through random guessing.

Now, Jay faces the far more challenging task of helping to frame the forecasting questions for Season 3. With his assistance, we’ve made progress in developing questions that are both rigorous (with all terms defined specifically enough to allow unambiguous scoring of forecasts as right or wrong) and relevant (in that knowing the correct outcome would meaningfully inform policymakers’ decisions). But, much work remains to be done before we become as skilled at generating questions as the GJP forecasters have become at producing accurate answers to those questions.

Also new to our Season 3 research team is Mike Ward, whose research lab at Duke University “creates conflict predictions using Bayesian modeling and network analysis.” In 2012, Mike and Jay engaged in a spirited debate in Foreign Policy about whether there could be a “Nate Silver” in world politics (that is, a forecaster of geopolitical events with a track record comparable to Silver’s notable successes in forecasting US elections). Mike doesn’t claim to have achieved Silver status yet, but we expect that his sophisticated modeling techniques will add to the GJP’s statistical arsenal.

We’re also fortunate to have Phil Schrodt join our Season 3 research team. Like Jay, both Phil and Mike have been involved in prior government projects to develop accurate statistical forecasts of geopolitical events, in their cases through both the PITF and DARPA’s Integrated Crisis Early Warning Project (ICEWS). Phil’s expertise with “event data” (e.g., with GDELT, Global Data on Events, Location and Tone) complements GJP’s strength in crowd-sourced forecasts. We’re hoping that the whole will be greater than the sum of the parts. And, readers of this blog should hope that Phil and Jay, both bloggers extraordinaire, will find time to write one or two guest posts for us. Until then, you can sample Phil’s thoughts on political forecasting courtesy of slides for his recent lectures at the University of Konstanz (Germany) and at European Network for Conflict Research (ENCoRe), University of Essex (United Kingdom).

Jay, Mike and Phil round out a GJP political-scientist roster that already included Penn’s own Mike Horowitz, an expert on international conflict and security issues, and Rick Herrmann, Chair of Ohio State’s Political Science Department and former Director of the Mershon Center for International Security Studies. The breadth and depth of their experience has been invaluable in guiding GJP’s efforts to improve geopolitical forecasting and shed light on important debates in international relations.

Posted in forecasting, GJP research team | Tagged , , , , , , | Comments Off

FORECAST: Another great season for the Good Judgment Project

Season 3 of the IARPA forecasting tournament is only a little over a month old (official scoring of individual forecasts began on July 31st). But, it already promises to be our most exciting season yet.

Nearly 3,000 forecasters have joined the Good Judgment Team’s roster for Season 3—many of whom are new to the Team this year. Our “newbies” include transfers from other research teams that had participated in Seasons 1 and 2, as well as more than 1,000 forecasters who are participating in the tournament for the first time ever. Welcome to all!

As before, the Good Judgment Project is using the IARPA tournament as a vehicle for social-science research to determine the most effective means of eliciting and aggregating geopolitical forecasts from a widely dispersed forecaster pool. To reach scientifically valid conclusions, we have randomly assigned participants to one of several online forecasting platforms, divided almost evenly between a survey format and two different prediction-market formats. Some survey forecasters participate as individuals; others are randomly assigned to small teams (starting size: 15 forecasters) that share information and strategies and are scored on their collective performance. We are also testing the efficacy of training materials that are tailored to the various elicitation platforms.

Our “superforecasters” deserve a tremendous share of the credit for the Good Judgment Team’s tournament-winning performance in Season 2. Most of our 60 Season 2 superforecasters have returned for Season 3, and we’ve augmented their ranks with a new crop of superforecasters chosen from the top performers in all experimental conditions in Season 2. The 120 superforecasters are organized as eight teams of 15 forecasters each.

Thus far, only four carryover questions from Season 2 have resolved and been officially scored, and that scoring includes only the final month during which these questions were open for forecasting. (The prior months were scored as part of Season 2.) Therefore, these four questions cannot tell us how well our forecasters will perform over the 150+ questions expected to be released between now and May 2014.

With that caveat firmly in mind, we close with what we hope is a precursor of the outstanding performance to be expected over the next several months. One method of aggregating forecasts is to treat each forecaster’s probability forecast as a “vote” (where any probability greater than 50% is a vote that the outcome will occur and any probability less than 50% is a vote that it will not occur), and then predict the outcome based on those votes. The “majority rules” vote of our Season 3 superforecasters achieved a perfect accuracy score for all four questions that have just closed! That’s a pretty impressive achievement when you realize that each day’s prediction for each forecasting question is separately scored. To get a perfect score, the majority of superforecasters must have made the right call on each one of the four questions every day for the month of August. Well done, everyone!

Posted in IARPA ACE Tournament | Tagged | Comments Off

Season 3 of the Good Judgment Project Starts TODAY!

If you signed up to be a Good Judgment Project forecaster for Season 3, you should have received an e-mail invitation via Qualtrics. Clicking on your personal link will take you through the informed-consent process (necessary administrative detail about expectations, payments, etc.) and hand you off to our background data survey. (If you signed up for the project but haven’t received your invitation yet, please contact admin@goodjudgmentproject.com.)

Next steps will be initial training (another Qualtrics invitation, coming soon), followed by access to your forecasting website for some preseason practice. Current plans call for all forecasters who have completed the preliminaries to be admitted to their assigned forecasting websites by mid-July. Official scoring will begin on August 1st and run through early May 2014.

We expect an exciting forecasting season. All forecasters in the IARPA tournament are now under one “roof” (the Good Judgment Project or GJP), which means that we’ll be able to give more complete feedback about relative performance AND that the competition has become even more intense. It will be interesting to see whether forecasters transferring to GJP from our prior competitors have developed distinctive forecasting styles associated with their prior teams.

Another new feature for Season 3 is question clusters. Most of the Season 3 forecasting questions will fall into one of about 15 clusters or categories that are thematically grouped. Forecasting on closely related questions should allow forecasters to delve more deeply into global issues of interest to policymakers. And, we hope that having forecasts on several related questions will shed greater light on key policy debates than is possible with any single question.

Watch this blog for further Season 3 news.

Posted in IARPA ACE Tournament | Tagged | Comments Off