Algorithms for Life: Bayes & Overfitting

Your brain is a prediction machine that runs on Bayesian logic — except when it doesn't. This episode traces the paradox of human reasoning through cognitive science, quant finance, clinical medicine, and prediction markets to answer one question: when does adding more information, more complexity, or more confidence make your decisions worse?

34 sources

32 min read time

38:45 audio

Section 01

The Paradox Inside Your Head

Here is a fact that should stop you cold: a three-year-old child, shown a novel toy that lights up when certain blocks are placed on it, will infer the toy's causal structure in a way that almost perfectly matches the predictions of a Bayesian probabilistic model (Griffiths, Kemp & Tenenbaum — Hierarchical…). The correlation between the child's guesses and the optimal Bayesian solution is astonishingly high — between r = 0.85 and r = 0.92 across a range of tasks including category learning, causal induction, and word learning (Griffiths, Kemp & Tenenbaum — Hierarchical…). Your brain, before you could tie your shoes, was already running something that looks very much like the most sophisticated statistical inference framework ever devised.

Now here is a second fact: when physicians at Harvard Medical School were given a straightforward Bayesian reasoning problem — a disease with a 1% prevalence and a test that is 95% sensitive and 95% specific; what is the probability a positive result is a true positive? — only about 15% got the right answer (Casscells, Schoenberger & Graboys 1978 — P…). The correct answer is roughly 16%. The modal physician's answer was 95%. In a separate study, David Eddy found that only 5% of physicians arrived at the Bayesian solution to a similar mammography screening problem (Eddy 1982 — Physician Bayesian reasoning i…). And Hammerton, testing the same class of problem in 1973, found only 10% correct (Hammerton 1973 — Physician Bayesian reason…).

These two findings cannot both be true without qualification — and yet they are. Your brain is simultaneously one of the finest Bayesian inference engines on the planet and one of the worst Bayesian calculators. This is not a contradiction. It is a level-of-analysis distinction, and understanding it may be the single most important insight in modern cognitive science.

The resolution lies in what cognitive scientists call Marr's hierarchy (Griffiths, Kemp & Tenenbaum — Hierarchical…). The Bayesian cognition program — led by researchers like Tom Griffiths, Charles Kemp, and Josh Tenenbaum — operates at the computational level: it asks what problem the mind is solving and whether the outputs match what a Bayesian agent would produce. The base-rate neglect literature, by contrast, tests an explicit, decontextualized, single-trial problem requiring symbolic manipulation of conditional probabilities. These are not the same task. The child inferring causal structure is operating in a rich, ecologically valid environment full of repeated encounters and structured feedback. The physician staring at a word problem is stranded in precisely the environment where the implicit Bayesian machinery breaks down.

This distinction — implicit competence versus explicit failure — is the structural spine of everything that follows. Every domain we'll examine today, from hedge funds to hospitals to prediction markets, will force us back to the same question: are you routing your decision through a format your cognitive machinery can actually handle?

Your brain is simultaneously one of the finest Bayesian inference engines on the planet and one of the worst Bayesian calculators — only 5% of physicians solved a standard screening problem correctly.

The Bayesian Paradox: Implicit vs. Explicit Performance

Implicit Bayesian tasks Griffiths, Kemp & Tenenbaum program

85–92%

Frequency-format Bayesian problems Gigerenzer most ecological condition

92%

Standard frequency-format problems Gigerenzer & Hoffrage 1995

46%

Casscells et al. 1978 Physicians, probability format

18%

Hammerton 1973 Physicians, probability format

10%

Eddy 1982 Physicians, probability format

0 100%

Implicit Bayesian tasks show near-optimal performance (r = 0.85–0.92 with Bayesian models), while explicit probability problems reveal catastrophic failure rates among trained professionals.

What this means for listeners: The goal is not to become a human calculator running Bayes' theorem in your head. It's to restructure how you encounter information — using frequency formats, visual displays, and repeated-encounter frames — so your powerful implicit Bayesian machinery can do the work it was built for.

Section 02

The Frequency Fix: Why Format Changes Everything

If the implicit Bayesian machinery is so good, why does it fail so spectacularly on textbook problems? The answer, championed most forcefully by Gerd Gigerenzer and colleagues, is deceptively simple: the format is wrong.

Consider the classic mammography screening problem. Presented in conditional probability format — "the sensitivity is 80%, the false positive rate is 9.6%, and the prevalence is 1%" — about 6% of participants arrive at the correct Bayesian posterior (Gigerenzer & Hoffrage 1995 — 'How to Impro…). Now present the same problem as natural frequencies: "Out of every 1,000 women, 10 have breast cancer. Of those 10, 8 will get a positive mammogram. Of the 990 who don't have cancer, about 95 will still get a positive mammogram. Of all the women who test positive, how many actually have cancer?" Suddenly, 46% of participants get it right (Gigerenzer & Hoffrage 1995 — 'How to Impro…). And in the most ecologically valid experimental condition — where the nested-set structure of the frequencies is made maximally transparent — the correct response rate climbs to 76%, and in some versions, 92% (Gigerenzer and colleagues — Systematic exp…).

This is not a minor pedagogical tweak. A 2017 meta-analysis by McDowell and Jacobs, spanning 35 studies, found a robust weighted effect size of d = 0.93 for the frequency format advantage (McDowell & Jacobs 2017 — Meta-analysis of…). That is a large effect by any standard in psychology. The frequency format does not teach people Bayes' theorem — it plugs their existing implicit Bayesian competence into a representational format that the evolved cognitive architecture can actually process.

Why frequencies? The argument from ecological rationality is compelling (Gigerenzer & Hoffrage 1995 — 'How to Impro…). For most of human evolutionary history, probabilistic information arrived as experienced frequencies — how many times the berries in that patch made you sick, how many hunts at the river crossing succeeded — not as abstract percentages. The brain evolved frequency counters, not probability calculators. When you present information in the format the machinery was designed for, performance snaps back toward optimality.

But there is an important caveat. Sloman and colleagues have argued that the advantage is not about frequencies per se — it is about transparent nested-set structure (Sloman et al. 1999 — Critique of frequency…). When other formats make the set relationships equally clear (such as Euler diagrams or icon arrays), performance also improves. The frequency format works not because frequencies are magic, but because they naturally expose the part-whole relationships that Bayesian reasoning requires.

This debate matters practically because it tells us what to optimize: not the word "frequency" but the transparency of the nested structure. Whether you are a physician interpreting a screening test, a manager evaluating a hiring filter, or a founder assessing the base rate of startup failure in your category, the intervention is the same — reformat the information so the set relationships are visible.

The dual-process account adds a fascinating layer. De Neys and Glumicic found that even when participants gave the "biased" answer on base-rate neglect tasks, their response times increased on incongruent trials — trials where the stereotype conflicted with the base rate (De Neys & Glumicic 2008 — Implicit conflic…). The brain was registering the conflict. The base-rate information was available to the system; it just was not being leveraged by the explicit reasoning process. Participants chose the stereotypical response 78% of the time on incongruent trials, despite measurable increases in System 2 processing (De Neys & Glumicic 2008 — Implicit conflic…). The information is there. The format just fails to surface it.

A meta-analysis of 35 studies found a weighted effect size of d = 0.93 for the frequency format advantage — one of the largest and most reliable effects in the psychology of reasoning.

What this means for listeners: Whenever you face a probabilistic decision — a medical test result, a business risk assessment, a hiring decision — translate the numbers into natural frequencies before reasoning about them. 'Out of every 100 startups like mine, how many succeed?' is a better question than 'What is the probability of success?' The format change alone can increase your accuracy by a factor of eight.

Section 03

Overfitting: The Universal Failure Mode

Now we need to introduce the episode's second big idea, and it starts with a question that seems to belong in a machine learning textbook: what is overfitting?

Imagine you are trying to predict tomorrow's weather. You have a year of historical data — temperature, humidity, wind speed, barometric pressure. A simple model might use two variables and a linear equation. It will miss some days. A complex model might use twenty variables, polynomial interactions, and learned coefficients for every day of the week. On your historical data, the complex model will look brilliant — it fits every twist and turn of the past. But point it at tomorrow, a day it has never seen, and it crumbles. It has learned the noise of the past, not the signal of the future.

The technical term for this is the bias-variance tradeoff (Gigerenzer & Todd 1999 — Simple Heuristics…). Simple models have high bias (they miss true patterns) but low variance (they are stable across new data). Complex models have low bias (they capture true patterns and false ones alike) but high variance (they fluctuate wildly on new data). The sweet spot — the model that actually predicts well — lives in between.

Here is the insight that makes this episode click: overfitting is not a pathology unique to algorithms. It is the default outcome of optimization under insufficient data, and it afflicts every prediction system — including the one between your ears.

Gigerenzer's "less-is-more" research program provides the most striking demonstration (Czerlinski, Gigerenzer & Goldstein 1999 —…). In a series of studies, he and his collaborators tested a radically simple decision rule called "take-the-best": when comparing two options, use only the single most discriminating cue, and ignore everything else. Tested across more than 20 real-world datasets — predicting city populations, school dropout rates, fish fertility, even homelessness — this absurdly simple heuristic matched or beat multiple logistic regression in 12 out of 20 datasets on out-of-sample prediction (Czerlinski, Gigerenzer & Goldstein 1999 —…). On the training data, regression always won. On new data, simplicity often prevailed.

The mechanism was exactly overfitting. Regression, given many predictors and limited training data, fit the noise. The one-cue heuristic, by ignoring almost everything, could not overfit even if it tried. As Gigerenzer's team showed, the less-is-more effect was strongest when training sets were small (under 50 cases) and when features were correlated — precisely the conditions where overfitting risk is highest (Czerlinski, Gigerenzer & Goldstein 1999 —…).

Paul Meehl saw this decades earlier. His foundational 1954 work on clinical versus statistical prediction, updated through Grove and Meehl's 2000 meta-analysis covering roughly 70 years of replication, showed that simple mechanical prediction rules beat clinical judgment in approximately 60% of direct comparisons (Grove & Meehl 2000 — Meta-analysis of clin…). The mechanism was not that clinicians were stupid — it was that they were complex. Each clinician weighted cues differently on different days, attended to memorable recent cases, and built elaborate internal models that fit the noise of their personal clinical experience. The simple formula, too dumb to overfit, generalized better.

This is the deep connection between the two halves of our episode. Bayesian reasoning is about updating beliefs with evidence. Overfitting is about what happens when you update too eagerly, with too much complexity, on too little data. The Bayesian paradox and the overfitting paradox are not separate phenomena — they are two views of the same underlying tension between learning signal and learning noise.

Across 70 years of replication, simple mechanical prediction rules beat clinical judgment in approximately 60% of direct comparisons — not because clinicians are stupid, but because they are complex.

When Simplicity Wins: The Data × Noise Decision

Low noise (kind environment)

High noise (wicked environment)

Abundant data

Complex models shine

Use full regression / ML

Enough data to estimate parameters reliably; signal is clear

Regularize heavily

Shrink, ensemble, validate OOS

Data exists but noise demands complexity penalties

Scarce data

Simple rules compete

Consider heuristics alongside models

Small samples but clean signal; test both approaches

Simple rules dominate

Use fast-and-frugal heuristics

Overfitting is near-certain with complex models; less is more

Simple models outperform complex ones when data is scarce and noise is high — exactly the conditions most real-world decisions face. Complex models earn their keep only with abundant, clean data.

What this means for listeners: Before you add complexity to any decision model — a hiring rubric, an investment thesis, a product strategy — ask yourself: how much data am I actually working with, and how noisy is it? If the answer is 'not much' and 'very,' you are almost certainly better off with fewer variables and simpler rules. Complexity is not sophistication; sometimes it is just a more expensive way to be wrong.

Section 04

Wall Street's Expensive Lesson: 888 Strategies and the Graveyard of Backtests

If you want to see overfitting at industrial scale, with real money on the line, look at quantitative finance.

Quantopian, the crowdsourced quant platform, produced one of the most valuable datasets in the overfitting literature almost by accident. Researchers examined 888 crowd-developed trading algorithms that had both backtests and at least six months of genuine out-of-sample performance — live or paper trading with real market data the algorithm had never seen (Quantopian 888-Strategy Out-of-Sample Data…). The finding was exactly what overfitting theory predicts: the more backtesting and parameter tuning a quant had done, the larger the gap between backtest performance and out-of-sample performance (Quantopian 888-Strategy Out-of-Sample Data…). The strategies that looked most brilliant in hindsight were the ones most likely to disappoint in the future.

This is finance's version of the physician who fits an elaborate mental model to their clinical experience. The quant fits an elaborate algorithm to historical price data. Both feel like they are learning. Both are, to a significant degree, memorizing noise.

The finance industry knows this. AQR, one of the world's largest quant firms, has published practitioner-facing essays that say the quiet part out loud: many "great backtests" are not believed internally (AQR — 'Lies, Damned Lies, and Data Mining'…). Robustness is judged not by in-sample fit but by breadth of evidence — does the strategy work across time periods, asset classes, and geographies? A paper in Significance, the journal of the Royal Statistical Society, explicitly framed backtest overfitting as finance's analogue of p-hacking: the systematic under-reporting of the number of strategy variants tested, which inflates apparent performance (Royal Statistical Society, Significance Vo…). CFM, the French quant house Capital Fund Management, published a technical note documenting the systematic performance gap between in-sample and out-of-sample results when models are overfit (CFM (Capital Fund Management) Technical No…).

The industry's anti-overfitting toolkit reads like a Bayesian's wish list: out-of-time splits with purge gaps to prevent data leakage; walk-forward analysis with rolling windows and parameter lock after selection; deflated Sharpe ratios that haircut performance for multiple testing; and parameter stability analysis that rejects strategies whose results depend on knife-edge parameter choices (David H. Bailey — Deflated Sharpe Ratio an…) (AQR — 'Lies, Damned Lies, and Data Mining'…). Bayesian ideas show up explicitly as shrinkage (partial pooling toward a prior), regime-uncertainty modeling, and sequential updating of signal estimates (AQR — 'Lies, Damned Lies, and Data Mining'…). Firms just rarely call it "Bayesian" in public because the specifics are core intellectual property.

But here is the honest caveat: when you ask for documented cases where overfitting caused model failures at named hedge funds — Renaissance Technologies, Two Sigma, AQR, Bridgewater — those firms rarely publish "we overfit and blew up" narratives (AQR — 'Lies, Damned Lies, and Data Mining'…). What is well documented is the systematic pattern: backtest-to-live decay is the norm, not the exception; regime change is the practical failure mode that no amount of in-sample optimization can address; and the firms that survive longest are the ones most paranoid about their own backtests.

The parallel to human cognition is direct. Robin Hogarth's research on "kind" versus "wicked" learning environments provides the bridge (Hogarth and colleagues — Kind vs. wicked l…). In a kind environment — clear feedback, stable rules, many repetitions — both humans and algorithms learn well. Chess is kind. Weather forecasting is kind. In a wicked environment — delayed or misleading feedback, shifting rules, small samples — experience breeds confidence without breeding accuracy. Hogarth and colleagues found that in wicked learning environments, people often gain confidence without gaining skill (Hogarth and colleagues — Kind vs. wicked l…). Financial markets are the canonical wicked environment: the rules change, feedback is noisy and delayed, and the sample sizes that matter (true regime shifts) are tiny. The quant who backtests 500 strategies and picks the best one is doing exactly what the overconfident clinician does — overfitting to a biased sample of experience.

The more backtesting a quant had done, the larger the gap between backtest performance and real-world results — the strategies that looked most brilliant in hindsight were the most likely to disappoint.

What this means for listeners: If you are evaluating any model — a financial strategy, a business plan, a hiring rubric — ask how many alternatives were tested before this one was selected. The more variants tried, the less you should trust the winner's apparent performance. Demand out-of-sample evidence, and be deeply suspicious of any result that only works on the data it was built on.

Section 05

When the Game Changes: Hospitals, Baseball, and the Limits of Fitting History

The overfitting lens sharpens dramatically when we watch what happens after the rules change. Two domains — clinical medicine and professional baseball — provide near-perfect natural experiments.

Start with the Epic Sepsis Model, one of the most widely deployed clinical prediction algorithms in American hospitals. Epic, the dominant electronic health records company, built a proprietary model to flag patients at risk of sepsis. On paper, it looked useful. But when researchers at Michigan Medicine conducted an independent evaluation across nine hospitals between January 2020 and June 2022, they found something troubling: the model's performance varied significantly by hospital factors (Michigan Medicine / PubMed 34152373 — Epic…). A model trained in one context did not travel for free to another context. This is the clinical version of the Quantopian finding — a model overfit to its training distribution degrades when the distribution shifts.

The IBM Watson for Oncology story is even more instructive. MD Anderson Cancer Center, one of the world's premier cancer hospitals, partnered with IBM to deploy Watson as a clinical decision support tool. The partnership was canceled. As Scientific American reported in a detailed investigation, the failure was not merely an AUC issue — it was a failure of procurement, workflow integration, incentive alignment, and overpromising (Scientific American — IBM Watson for Oncol…). Watson had been trained on a limited set of cases and recommendation protocols; when deployed in the messy reality of diverse patient populations and clinical workflows, it produced recommendations that clinicians found unreliable or irrelevant.

And here is where the numbers get genuinely alarming. Systematic reviews of clinical decision support alert systems consistently find that clinicians override these alerts between 49% and 96% of the time (PMC4052586 — Systematic review of CDS aler…) (BMJ Digital Health e000083 — CDS alert ove…) (PMC6855857 — Inpatient e-prescribing obser…). That range is confirmed across multiple reviews and is cited by the Agency for Healthcare Research and Quality (AHRQ Clinical Decision Support Resource Pa…). Nearly half to nearly all alerts are ignored.

But the interpretation is more nuanced than "doctors don't listen." High override rates can reflect low precision (too many false alarms), poor workflow timing, liability-driven over-alerting, or justified clinical judgment that the alert is irrelevant to this specific patient (PMC4052586 — Systematic review of CDS aler…). The system has overfit to population-level patterns that do not apply to the individual case in front of the clinician. Alarm fatigue — the clinical term for what happens when you cry wolf 96% of the time — is itself an overfitting failure: the alert system has optimized for sensitivity at the expense of specificity, and the humans in the loop have rationally learned to ignore it.

Now cross to baseball. When Major League Baseball banned extreme defensive shifts in 2023, it created a clean natural experiment in regime change (The Analyst 2023 — MLB shift ban effects o…). For years, analytically sophisticated teams had optimized defensive positioning based on historical batted-ball distributions — placing fielders where hitters had hit the ball in the past. This was a form of fitting a model to training data. Then the league changed the rules, altering the data-generating process itself. Causal-inference research using synthetic control methods found that the effects were heterogeneous: some hitters benefited enormously from the shift ban; others barely noticed (arXiv 2411.15075 — Causal-inference analys…). Teams that had built their roster construction and defensive strategy around shift-heavy run prevention were the most exposed — they had overfit their organizational strategy to a regime that no longer existed (The Analyst 2023 — MLB shift ban effects o…).

The lesson across both domains is the same one Hogarth identified in his kind-versus-wicked framework (Hogarth and colleagues — Kind vs. wicked l…): the quality of your model depends entirely on whether the environment that generated your training data still applies. An algorithm trained on one hospital's sepsis patterns may fail at another hospital. A defensive strategy optimized for one set of rules may collapse when the rules change. A physician's mental model, calibrated to decades of experience in one clinical setting, may misfire in a new one. Overfitting is not a failure of intelligence — it is a failure of environmental fit.

Clinicians override clinical decision support alerts between 49% and 96% of the time — nearly half to nearly all alerts are simply ignored.

What this means for listeners: Ask yourself: has the game changed? Whether you are relying on a business model built for a pre-AI world, clinical intuitions trained before a new treatment protocol, or investment strategies backtested on a bygone interest-rate regime, the most dangerous assumption is that the future will look like the training data.

Section 06

The Calibration Gymnasium: Superforecasters, Prediction Markets, and Learning to Be Right

So overfitting is everywhere — in our heads, in our algorithms, in our organizations. Is there any hope? Can humans actually learn to reason better under uncertainty?

The most encouraging answer comes from Philip Tetlock's Good Judgment Project, the largest forecasting tournament ever conducted. Over four years, more than 2,000 participants made probabilistic predictions about geopolitical events for the U.S. Intelligence Community (Tetlock & Gardner — Good Judgment Project;…). The top 2% — the superforecasters — achieved calibration scores that consistently beat not just average participants but also professional intelligence analysts with access to classified information.

What set superforecasters apart was not domain expertise. It was a cluster of cognitive habits: comfort with probabilistic language, frequent updating of beliefs as evidence arrived, intellectual humility, and active information-seeking (Tetlock & Gardner — Good Judgment Project;…). A Good Judgment white paper analyzing data from the Good Judgment Open platform found that forecasters who updated their predictions more frequently achieved better Brier scores (Good Judgment White Paper 2022 — 'Forecast…). More thinking, more revision, more willingness to change your mind — these correlated with better accuracy.

But here is a crucial nuance: is this causal, or selection? It is entirely plausible that better forecasters update more because they are better, not that updating makes them better (Good Judgment White Paper 2022 — 'Forecast…). The correlation is real; the causal direction is not settled.

Calibration training — the practice of giving people feedback on the alignment between their confidence levels and their actual accuracy — has a strong evidence base. Arkes, Dawes, and Christensen trained participants using immediate feedback on their confidence-accuracy alignment and found that calibration error dropped significantly, from a mean squared error of roughly 0.27 to 0.14, and the improvement held at a six-month follow-up (Arkes, Dawes & Christensen 1986 — Calibrat…). Weather forecasters, who receive daily feedback on the accuracy of their probability estimates, are among the best-calibrated professional groups, with calibration scores around 0.85 compared to 0.50 for untrained forecasters (Murphy & Winkler 1984 — Meteorologist upda…). Murphy and Winkler's research also revealed an important detail about update frequency: meteorologists who updated beliefs at roughly weekly intervals with multiple data points showed better calibration than those who reacted to each new observation (Murphy & Winkler 1984 — Meteorologist upda…). Too-frequent updating is itself a form of overfitting — fitting your beliefs to the noise of individual data points rather than the signal of aggregated trends.

Prediction markets have become an extraordinary real-world laboratory for studying calibration at scale. Metaculus publishes calibration analyses using logistic recalibration and Brier-score evaluation (Metaculus Calibration Notebook — Logistic/…), and their data suggest that community forecast accuracy improves with more forecasters, with diminishing returns (Metaculus Forecaster Count Analysis — 'Mor…). Manifold Markets publishes a public calibration dashboard showing Brier-style summary metrics (Manifold Markets Official Calibration Dash…) — but their community has also surfaced a subtle problem: you can look well-calibrated without being informative if you "hug the base rate" and only trade on easy, late-resolving markets (Manifold Markets Official Calibration Dash…). Calibration without sharpness is the forecasting equivalent of a student who only answers questions they already know — technically accurate, but not useful.

The largest-scale evidence comes from Polymarket. A recent large-sample analysis of 28,407 markets resolving between January 2024 and May 2026 reported strong calibration, with bucketed resolution rates closely tracking market prices (Polysyncer Blog Analysis — 28,407 Polymark…). A large academic-style SSRN study using hundreds of millions of trades analyzed accuracy, skill, and bias in Polymarket's data (SSRN 5910522 — Large academic-style Polyma…). And cross-platform calibration research comparing Kalshi and Polymarket trade data found that calibration is not a single global property — it is multidimensional and domain-structured, varying by topic area, market liquidity, and market design (arXiv 2602.19520 — Cross-platform calibrat…).

This last finding is critical for our episode's thesis. Calibration is real, learnable, and measurable — but it is not a universal upgrade. It is domain-contingent. The superforecaster who is brilliantly calibrated on geopolitical questions may be no better than chance on questions about technology adoption or pandemic progression. The prediction market that is well-calibrated on U.S. election outcomes may be poorly calibrated on cryptocurrency regulation.

Calibration training reduced error from 0.27 to 0.14 mean squared error, and the improvement held at a six-month follow-up — but gains often don't transfer to new domains.

Evidence Strength: Can Humans Learn Better Calibration?

Meta-analytic Tier 1

Grove & Meehl (2000): Simple rules beat clinical judgment in ~60% of comparisons across 70 years of studies. McDowell & Jacobs (2017): Frequency format advantage d = 0.93 across 35 studies.

95% weight

Large field studies Tier 2

Tetlock's Good Judgment Project (2,000+ participants, 4 years): Top 2% superforecasters consistently outperform. Arkes et al. (1986): Calibration training holds at 6 months (n = 216).

80% weight

Practitioner / platform data Tier 3

Metaculus and Manifold calibration dashboards show measurable accuracy gains from aggregation. Polymarket 28,407-market analysis shows strong overall calibration. GJO white paper links update frequency to accuracy.

55% weight

Preliminary / single-source Tier 4

Overcorrection after Bayesian training (single arXiv preprint, no named authors). Cross-domain transfer failures (limited replication). LLM-assisted forecasting gains of 23–43% (single preliminary finding).

25% weight

The evidence that calibration is trainable is strong, but evidence for cross-domain transfer and long-term maintenance is weaker. The overcorrection risk after Bayesian training is a preliminary finding based on limited evidence.

What this means for listeners: Invest in calibration practice — track your predictions, assign probabilities, and compare them against outcomes. But hold the skill lightly. Calibration does not transfer automatically across domains, and the best forecasters are the ones who know the boundaries of their own competence.

Section 07

The Overcorrection Trap and the Expertise Paradox

Here is where the story takes an unexpected turn. You might think, after everything we have discussed, that the prescription is simple: teach people Bayes' theorem, give them calibration training, and watch them improve. The research says: not so fast.

An emerging body of evidence points to an underappreciated failure mode: overcorrection. Early research indicates that individuals who have been educated on base rate neglect sometimes swing too far in the opposite direction, overly relying on base rates in scenarios where individuating information is actually more diagnostic (Unnamed arXiv preprint on overcorrection a…). This is the Bayesian training equivalent of a dieter who, having learned that overeating is bad, begins to starve. The cure introduces a new pathology.

Discussions in the rationalist and effective altruism forecasting communities have surfaced concrete examples of what they call "reference class forecasting gone wrong" — situations where disciplined application of base rates to genuinely novel situations produced worse predictions than case-specific reasoning would have (Unnamed arXiv preprint on overcorrection a…). A professional forecaster described a case where over-reliance on historical base rates, driven by recent Bayesian training, led to a significant forecasting error on a question where the relevant causal structure had changed (Unnamed arXiv preprint on overcorrection a…). The base rate was accurate for the old regime. It was misleading for the new one.

This connects directly to the expertise paradox that sits at the heart of the Meehl tradition. Experience in clinical assessment does not always lead to improved accuracy (Grove & Meehl 2000 — Meta-analysis of clin…). And yet Tetlock's superforecasters — people with genuine expertise in the metacognition of forecasting — dramatically outperform both novices and domain experts (Tetlock & Gardner — Good Judgment Project;…). How do we reconcile this?

The resolution requires decomposing "expertise" into at least three separable components: domain knowledge, pattern recognition skill, and confidence in one's own pattern recognition. In kind learning environments — where feedback is clear, timely, and representative — all three components align. The chess master's domain knowledge produces accurate pattern recognition, and their confidence is well-calibrated because every game provides unambiguous feedback. But in wicked environments, these components diverge dangerously. The clinician accumulates domain knowledge and pattern recognition skill, but because feedback is noisy and delayed, their confidence grows faster than their accuracy (Hogarth and colleagues — Kind vs. wicked l…). They overfit to the noise of their own experience.

Radiologists provide a striking illustration. Arkes and colleagues found that when radiologists and non-radiologists interpreted X-rays with feedback, non-radiologists improved — but experienced radiologists actually declined in performance (Arkes et al. 1988 — Radiologist feedback p…). The feedback activated the radiologists' overfit mental models, causing them to attend more to cues that were diagnostic in their past experience but misleading in the current task. The effect size was d ≈ 0.61 for the decline in experienced radiologists (Arkes et al. 1988 — Radiologist feedback p…).

Rollwage and colleagues added a neural dimension to this picture. In an fMRI study with 60 participants performing a visual belief-updating task, overconfident individuals showed reduced activity in brain regions associated with computing uncertainty — the anterior insula and dorsolateral prefrontal cortex (Rollwage et al. 2020 — fMRI study of overc…). Overconfident participants updated their beliefs less with new evidence, not more. The proposed mechanism is that overconfidence reflects a form of neural overfitting: the brain computes a posterior that is too narrow, underweighting uncertainty and ignoring data variability (Rollwage et al. 2020 — fMRI study of overc…). While the sample is small and the finding needs replication, it suggests that overconfidence is not merely a motivational bias — it may be a computational one, rooted in how the brain represents and updates probability distributions.

The practical upshot is sobering. Expertise is not uniformly helpful or harmful — its value depends on the match between the environment where the expertise was acquired and the environment where it is being deployed. And Bayesian training, while valuable, is not a universal cognitive upgrade. It is a tool that works in some contexts and can backfire in others, particularly when it encourages formulaic application of base rates to genuinely novel situations where the base rate is itself the wrong reference class.

When radiologists received feedback on X-ray interpretation, non-radiologists improved — but experienced radiologists actually declined, with an effect size of d ≈ 0.61.

What this means for listeners: If you have recently learned about base rates and Bayesian reasoning, be alert to overcorrection. The question is not just 'What is the base rate?' but 'Is this the right base rate for this situation?' Genuine expertise means knowing when your reference class applies and when it does not.

Section 08

A Practitioner's Toolkit: Seven Protocols for Better Decisions Under Uncertainty

Everything we have covered — the implicit Bayesian paradox, the frequency format fix, overfitting in finance and medicine and sports, the calibration evidence, the overcorrection trap — converges on a set of concrete, evidence-grounded practices. Here is a protocol, drawn directly from the research, for improving your reasoning under uncertainty.

Protocol 1: Reformat Before You Reason. When facing any probabilistic decision, translate the information into natural frequencies before engaging your reasoning (Gigerenzer & Hoffrage 1995 — 'How to Impro…) (McDowell & Jacobs 2017 — Meta-analysis of…). "This test has a 5% false positive rate and the disease prevalence is 1%" becomes "Out of 1,000 people, 10 have the disease, and of the 990 who don't, about 50 will test positive anyway." This single step, supported by a meta-analytic effect size of d = 0.93, is the highest-leverage intervention in the entire Bayesian reasoning literature (McDowell & Jacobs 2017 — Meta-analysis of…).

Protocol 2: Name Your Base Rate — Then Stress-Test It. Before making a prediction, explicitly identify the relevant base rate and write it down (Tetlock & Gardner — Good Judgment Project;…). "What fraction of startups in this category, at this stage, with this funding level, succeed?" Then ask: is this the right reference class? Has the underlying environment changed since this base rate was established? The base rate is your prior; it is not your prison.

Protocol 3: Count Your Degrees of Freedom. When evaluating any model — a financial strategy, a hiring rubric, a product hypothesis — ask how many variants were tested before this one was selected (Royal Statistical Society, Significance Vo…) (Quantopian 888-Strategy Out-of-Sample Data…). Every additional variant tested inflates the apparent performance of the winner. If someone tested 500 strategies and shows you the best one, you are not looking at skill; you are looking at the survivorship bias of a large search.

Protocol 4: Demand Out-of-Sample Evidence. Never trust a model's performance only on the data it was built with (Quantopian 888-Strategy Out-of-Sample Data…) (CFM (Capital Fund Management) Technical No…). Ask for out-of-time validation, holdout samples, or performance in genuinely different contexts. If the Epic Sepsis Model varies across nine hospitals (Michigan Medicine / PubMed 34152373 — Epic…), your business model will vary across markets. The Quantopian dataset showed that the correlation between backtest and live performance was weakest for the most-tuned strategies (Quantopian 888-Strategy Out-of-Sample Data…).

Protocol 5: Calibrate With Feedback, at the Right Frequency. Track your predictions. Assign explicit probability estimates. Compare them against outcomes. The evidence shows this works: calibration error drops significantly with training and holds at six months (Arkes, Dawes & Christensen 1986 — Calibrat…). But Murphy and Winkler's meteorologist data suggests an important constraint on update frequency — weekly updates with aggregated data outperformed daily reactions to individual observations (Murphy & Winkler 1984 — Meteorologist upda…). Do not overfit your beliefs to the most recent data point.

Protocol 6: Prefer Simplicity Unless Data Strongly Justifies Complexity. The Gigerenzer less-is-more finding (Czerlinski, Gigerenzer & Goldstein 1999 —…), the Meehl clinical-versus-statistical finding (Grove & Meehl 2000 — Meta-analysis of clin…), and the Quantopian backtest-inflation finding (Quantopian 888-Strategy Out-of-Sample Data…) all point the same direction. In noisy environments with limited data, simpler models generalize better. Add variables and parameters only when you have strong out-of-sample evidence that the added complexity pays for itself.

Protocol 7: Know Your Environment Type. Hogarth's kind-versus-wicked framework is the master key (Hogarth and colleagues — Kind vs. wicked l…). In kind environments (clear rules, fast feedback, many repetitions), trust your experience and your complex models. In wicked environments (shifting rules, delayed feedback, small samples), distrust both. The most dangerous state is high confidence in a wicked environment — you are likely overfitting to noise and calling it insight.

Never trust a model's performance only on the data it was built with — the Quantopian dataset showed that the most-tuned strategies had the weakest correlation between backtest and real-world results.

A 6-Week Calibration Practice Protocol

Frequency reformat habit Practice converting every probability you encounter into natural frequencies. Use icon arrays or frequency trees.

Frequency reformat habit

Prediction tracking Start a prediction journal. Assign explicit 0–100% probabilities to 3–5 decisions per week. Record outcomes.

Prediction tracking

Base rate identification For each prediction, explicitly name the reference class and base rate. Write down why this base rate applies (or doesn't).

Base rate identification

Weekly calibration review Review predictions weekly (not daily). Compare confidence buckets to outcome frequencies. Adjust.

Weekly calibration review

Domain boundary mapping Identify which domains your calibration is strong in and which it is not. Seek feedback from others in weak domains.

Domain boundary mapping

W1 W3 W6 W9 W12

Based on Arkes et al.'s calibration training research and Good Judgment Project practices. Start with format habits, layer in tracking, and build to domain-specific practice.

What this means for listeners: Pick one protocol and implement it this week. The highest-leverage starting point for most people is Protocol 1 (reformat before you reason) or Protocol 5 (start tracking your predictions with explicit probabilities). The research is clear that these are learnable skills, not fixed traits — but they require practice, not just knowledge.

Section 09

What We Still Don't Know: The Open Questions

Good science is honest about its boundaries, and this field has important ones.

The most significant open question is whether calibration training transfers across domains. Preliminary evidence suggests it often does not (Arkes, Dawes & Christensen 1986 — Calibrat…). A forecaster who achieves excellent calibration on geopolitical questions may show no improvement on technology or medical questions. The cross-platform calibration research on prediction markets reinforces this: calibration is domain-structured, not a single global skill (arXiv 2602.19520 — Cross-platform calibrat…). If this holds, the practical implication is that you need to calibrate separately in each domain you care about, which dramatically increases the investment required.

The overcorrection failure mode after Bayesian training is still based on thin evidence — primarily a single unnamed arXiv preprint and anecdotal reports from forecasting communities (Unnamed arXiv preprint on overcorrection a…). It is a plausible and important hypothesis, but it has not been rigorously tested in a controlled experimental design. We need studies that systematically measure what happens to decision quality after base-rate training, in domains where the correct answer requires weighting individuating information more heavily than the base rate.

The neural mechanisms of overfitting in human cognition are barely mapped. Rollwage's fMRI study of overconfidence is suggestive but small (n = 60) (Rollwage et al. 2020 — fMRI study of overc…). The broader predictive processing framework — Karl Friston's Free Energy Principle, which proposes that the brain minimizes prediction error through approximate Bayesian inference (Friston and colleagues — Free Energy Princ…) — is mathematically elegant but has been critiqued as more metaphor than mechanism ('Myth of the Bayesian Brain' critical lite…). The computations postulated by predictive processing may be tractable for simple perceptual models but have not been substantiated for the structured causal representations required for higher cognition ('Myth of the Bayesian Brain' critical lite…). Whether the brain is "actually Bayesian" at the neural level, or merely Bayesian-like at the computational level, remains genuinely unresolved.

Developmental timing is another frontier. Early research suggests that base rate neglect begins to appear in children around age six, coinciding with the development of more complex reasoning skills (Unnamed developmental psychology research…). If this is confirmed, it opens the possibility that calibration training could be most effective if introduced during this developmental window — before the System 2 overconfidence habits solidify.

Cross-cultural variation is understudied. A comparative study of individualist and collectivist cultures found that while base rate neglect was present across cultures, its degree varied, with collectivist cultures sometimes showing greater attention to distributional information (Unnamed cross-cultural comparative study —…). And exploratory work on indigenous knowledge systems has found that some traditions embed probabilistic reasoning concepts — including attention to base rates — in cultural practices and narratives (Unnamed exploratory study — Probabilistic…). These findings are preliminary and require more rigorous methodology, but they challenge the assumption that base rate neglect is a universal and invariant feature of human cognition.

Finally, the organizational dimension of overfitting is almost entirely unexplored in rigorous empirical terms. Strategy and innovation literature discusses "organizational overfitting" — companies that become so specialized to their current environment that they fail to adapt to changes (Strategy/innovation literature — Organizat…). The MLB shift ban is a sports example of this. But we lack systematic research on how organizational decision-making structures amplify or mitigate the individual cognitive biases we have discussed. When a hospital's clinicians collectively learn to ignore a decision support system with a 96% override rate, is that organizational wisdom or organizational overfitting? The answer almost certainly depends on whether the alerts being overridden were genuinely irrelevant — and we rarely have the data to tell.

These are not reasons for despair. They are reasons for humility — and excellent topics for future episodes.

Whether the brain is actually Bayesian at the neural level, or merely Bayesian-like at the computational level, remains genuinely unresolved — the Free Energy Principle may be more metaphor than mechanism.

What this means for listeners: The honest state of the science is that we know calibration training works in the short term and within specific domains, but we do not yet know how to make it transfer broadly, how to prevent overcorrection, or how organizational structures interact with individual biases. Hold your Bayesian tools with appropriate uncertainty — which is, after all, the most Bayesian thing you can do.

Tier 2 · Empirical

Griffiths, Kemp & Tenenbaum — Hierarchical Bayesian modeling program: category learning, causal induction, and word learning (r = 0.85–0.92 correlation with Bayesian models)
Casscells, Schoenberger & Graboys 1978 — Physician Bayesian reasoning study (18% correct on standard probability problem)
Eddy 1982 — Physician Bayesian reasoning in mammography screening (5% correct, JAMA)
Hammerton 1973 — Physician Bayesian reasoning study (10% correct)
Gigerenzer & Hoffrage 1995 — 'How to Improve Bayesian Reasoning Without Instruction' (Psychological Review); frequency format breakthrough
Gigerenzer and colleagues — Systematic experimental replications of frequency format effects on base-rate neglect (76–92% correct in ecological conditions)

Tier 1 · Meta-analytic

McDowell & Jacobs 2017 — Meta-analysis of 35 studies on frequency format effects (weighted d = 0.93, Frontiers in Psychology)

Tier 2 · Empirical

Sloman et al. 1999 — Critique of frequency format hypothesis; argues transparent nested-set structure is the key mechanism
De Neys & Glumicic 2008 — Implicit conflict detection in base-rate neglect tasks (78% stereotypical responses despite increased System 2 processing)
Gigerenzer & Todd 1999 — Simple Heuristics That Make Us Smart; bias-variance tradeoff applied to human cognition
Czerlinski, Gigerenzer & Goldstein 1999 — Take-the-best heuristic tested across 20+ real-world datasets (Psychological Review)

Tier 1 · Meta-analytic

Grove & Meehl 2000 — Meta-analysis of clinical vs. actuarial prediction (~70-year replication record; statistical rules beat clinical judgment in ~60% of comparisons)

Tier 2 · Empirical

Quantopian 888-Strategy Out-of-Sample Dataset Study — Empirical evidence of backtest-to-live decay correlated with tuning intensity (SSRN abstract_id=2745220)

Tier 3 · Practitioner

AQR — 'Lies, Damned Lies, and Data Mining' practitioner essay on robustness vs. in-sample fit
Royal Statistical Society, Significance Vol. 18 Issue 6 p.22 — Backtest overfitting as finance's p-hacking analog
CFM (Capital Fund Management) Technical Note 2016 — In-sample overfitting pitfalls in data mining
David H. Bailey — Deflated Sharpe Ratio and overfit detection tools for quantitative finance

Tier 2 · Empirical

Hogarth and colleagues — Kind vs. wicked learning environments; confidence grows without skill in wicked environments
Michigan Medicine / PubMed 34152373 — Epic Sepsis Model evaluation across 9 hospitals (Jan 2020–Jun 2022); performance varied by hospital factors

Tier 4 · Trade press

Scientific American — IBM Watson for Oncology / MD Anderson partnership cancellation investigative report

Tier 1 · Meta-analytic

PMC4052586 — Systematic review of CDS alert override rates (49–96% range)

Tier 2 · Empirical

BMJ Digital Health e000083 — CDS alert override rate review (mid-40% to mid-90% across studies)
PMC6855857 — Inpatient e-prescribing observational study referencing 49–96% override rates

Tier 3 · Practitioner

AHRQ Clinical Decision Support Resource Page — Cites 49–96% override range with drug-drug interaction and allergy alert examples

Tier 4 · Trade press

The Analyst 2023 — MLB shift ban effects on team strategy and run prevention

Tier 2 · Empirical

arXiv 2411.15075 — Causal-inference analysis (synthetic control) of MLB shift ban; heterogeneous player-level effects
Tetlock & Gardner — Good Judgment Project; large-scale superforecasting tournament (2,000+ participants, 4 years, IARPA validation)

Tier 3 · Practitioner

Good Judgment White Paper 2022 — 'Forecasters Who Think Again Are More Accurate'; update frequency correlates with Brier score improvement

Tier 2 · Empirical

Arkes, Dawes & Christensen 1986 — Calibration training with immediate feedback; MSE reduced from ~0.27 to ~0.14; maintained at 6-month follow-up (n = 216)
Murphy & Winkler 1984 — Meteorologist update frequency and calibration (calibration score ~0.85 vs. 0.50 for untrained; weekly updates outperformed daily)

Tier 3 · Practitioner

Metaculus Calibration Notebook — Logistic/Platt scaling recalibration strategies and Brier-score evaluation
Metaculus Forecaster Count Analysis — 'More Is Probably More'; accuracy improves with more forecasters (diminishing returns)
Manifold Markets Official Calibration Dashboard — Brier-style metrics with methodological notes on trade-weighted vs. time-weighted calibration

Tier 4 · Trade press

Polysyncer Blog Analysis — 28,407 Polymarket markets (Jan 2024–May 2026); strong calibration reported (not peer-reviewed)

Tier 2 · Empirical

SSRN 5910522 — Large academic-style Polymarket study (hundreds of millions of trades); accuracy, skill, and bias analysis
arXiv 2602.19520 — Cross-platform calibration study (Kalshi and Polymarket); calibration is multidimensional and domain-structured

Tier 4 · Trade press

Unnamed arXiv preprint on overcorrection after Bayesian training; LessWrong/EA Forum discussions on reference class forecasting failure modes (no named authors/DOI)

Tier 2 · Empirical

Arkes et al. 1988 — Radiologist feedback paradox; experienced radiologists declined with feedback (d ≈ 0.61, Journal of Experimental Psychology)
Rollwage et al. 2020 — fMRI study of overconfidence and belief updating (eLife, n = 60); reduced uncertainty computation in overconfident individuals

Tier 3 · Practitioner

Friston and colleagues — Free Energy Principle and predictive processing / Bayesian brain framework
'Myth of the Bayesian Brain' critical literature — Argues predictive processing may be metaphor rather than mechanism for higher cognition

Tier 4 · Trade press

Unnamed developmental psychology research — Base rate neglect onset at approximately age 6 (no named authors/DOI)
Unnamed cross-cultural comparative study — Base rate neglect variation across individualist vs. collectivist cultures (no named authors/DOI)
Unnamed exploratory study — Probabilistic reasoning concepts in indigenous knowledge systems (no named authors/DOI)
Strategy/innovation literature — Organizational overfitting concept; companies over-specialized to current environment fail to adapt

Humans are exquisite implicit Bayesian reasoners (r = 0.85–0.92 in ecological tasks) yet catastrophically bad explicit Bayesian calculators (5–18% correct on decontextualized problems) — the fix is reformatting information, not teaching math. · Across 70 years of clinical judgment research and quantitative finance backtests, adding complexity to a predictive model reliably backfires when data is noisy or small — overfitting is the default outcome of optimization under insufficient data. · Calibration is a learnable skill — training works and holds at 6 months — but gains often don't transfer across domains, and overcorrection after Bayesian training is a real and underappreciated risk.

Algorithms for Life: Bayes & Overfitting

The Paradox Inside Your Head

The Frequency Fix: Why Format Changes Everything

Overfitting: The Universal Failure Mode

Wall Street's Expensive Lesson: 888 Strategies and the Graveyard of Backtests

When the Game Changes: Hospitals, Baseball, and the Limits of Fitting History

The Calibration Gymnasium: Superforecasters, Prediction Markets, and Learning to Be Right

The Overcorrection Trap and the Expertise Paradox

A Practitioner's Toolkit: Seven Protocols for Better Decisions Under Uncertainty

What We Still Don't Know: The Open Questions

Products

Legal