Expert judgment, like the internet, runs from sublime to sordid. Since its halting entry into the Halls of Science with the Delphi studies of the 1960’s, expert judgment has remained something of an embarrassment: scientists use it all the time but would rather not talk about it. Now it is poised to become respectable. Here’s some leading indicators:
Bamber and Aspinall’s paper “An expert judgment assessment of future sea level rise from the ice sheets” (see RFF 2013 webcast Ice Sheets on the Move) has been selected as one of 10 Nature Climate Change articles to highlight research in Nature Climate Change over the last 5 years. Healthy suspicion within the science community was allayed by the “classical model for structured expert judgment” whose hallmark is empirical validation with performance based weighted combinations of experts’ judgments.
Other recent events also signal expert judgment’s ascendency. Climate gadfly Judith Curry penned an excellent blog on expert judgment and rational consensus, emphasizing the risks of confusing consensus with certainty. Australian biologist and bio-security expert Mark Burgman’s book “Trusting Judgment” hit the bookshelves in January and exhaustively reviews the sordid side of expert judgment. This follows Sutherland and Burgman’s recent piece in Nature on using experts wisely and Aspinall’s appeal for a “route to more tractable expert advice”. High visibility applications have recently appeared in top tier scientific journals. The World Health Organization completed a structured expert judgment study of food borne diseases on a massive scale; 74 experts distributed over 134 panels averaging 10 experts each quantified uncertainty in transmission rates of pathogens through food pathways for different regions of the world. The empirical validation receives careful attention. Recent applications in the public domain include Asian carp invasion of Lake Erie with out-of-sample validation, and nitrogen run-off in the Chesapeake Bay, also with out-of-sample validation.
The world of expert judgment divides into two hemispheres. The science/engineering hemisphere usually works with small numbers (order 10) of carefully selected experts, asks them about uncertain quantities with a continuous range, and propagates the results through numerical models. The psychology hemisphere estimates probabilities of future newsworthy events. Philip Tetlock’s Good Judgment Project was proclaimed the winner of a 5-year forecasting tournament organized by IARPA using the Brier Score for evaluating forecasters (disparaged in the classical model for confounding statistical accuracy and informativeness). Drawing from an expert pool of more than 3000, and skimming off the top 2 percent of all experts, Tetlock’s group distilled a small group of “superforecasters”. With a small fraction of Tetlock’s resources, Burgman’s “Australian Delphi” method based on the classical model with Delphi-like add-ons is said to make a strong showing (personal communication), though data and analysis from the tournament are not released. Both hemispheres agree that measuring expert performance and using performance based combinations pays off.
In applications of the classical model, experts (order 10) are typically asked to assess 5-, 50- and 95- percentiles for continuous quantities of interest AND for calibration variables (order 10) from their field whose true values are known post hoc. Experts are scored on statistical accuracy and informativeness. If only 2 of 10 values of calibration variables fall within an expert’s 90 percent central confidence band that would produce a low statistical accuracy score. Informativeness is measured as the degree to which an expert’s percentiles are close together. (Proper definitions and data are freely available.) The two scores are negatively correlated, though the WHO data in the following figure shows that the correlation attenuates as we down-select to statistically more accurate experts.
Unlike current events, we do not have thousands of experts and years of data per expert panel. Rather, “in-sample” validation looks at performance on the calibration variables, and “cross validation” initializes the weighting model on a subset of calibration variables, and gauges performance on the complementary set. Lt. Col. Justin Eggstaff and colleagues developed cross validation for the extensive data base of applications with the classical model. The performance ratios for Performance Weighted combinations / Equal Weighted combinations shown below (from Nature Climate Change) speak for themselves.