Algorithms for Life: Bayes & Overfitting

Your brain scores 0.85–0.92 correlation with optimal Bayesian models before you can tie your shoes — yet Harvard-trained physicians get the same class of problem right only 5–18% of the time. This episode reveals why the format of information, not your intelligence, is the hidden lever behind almost every decision you make under uncertainty, and borrows the machine learning concept of overfitting to explain why adding complexity to your thinking is often just a more expensive way to be wrong.

26 min listen time

29 May 2026 published

12 episode

View Sources

00:00 The toddler probability paradox
03:45 How our brains are both geniuses and failures
06:10 Why abstract percentages fool the brain
08:55 Natural frequencies unlock accurate reasoning
12:10 Your brain secretly detects the conflict
14:45 Is the brain truly Bayesian?
18:00 The bias-variance trade-off explained
21:30 Overfitting on Wall Street and in daily decisions
25:15 When simple rules beat complex models
29:00 Regime change breaks overfit models
33:20 Kind vs. wicked learning environments
37:00 Superforecasters and calibration training
41:10 The radiologist feedback paradox
43:45 Overconfidence as a neural failure
46:20 Three takeaways and your action plan

Read transcript

So I want you to picture a scene for a second. Imagine a three-year-old child just sitting on the floor, and they're playing with this brand new, completely novel toy. Right, something they've never seen before. Exactly. It's this contraption that lights up, but only when you put certain wooden blocks on it in a very specific accommodation. And within just a few minutes of trial and error, moving the blocks around, the kid effortlessly figures out the hidden causal structure. And what is truly staggering about that scenario isn't just that a toddler figures out a puzzle. It's, well, it's the underlying mathematics of their intuition. Yeah, the math is crazy here. It really is. When researchers like Griffiths, Kemp, and Tenenbaum study tasks like this, causal induction, word learning, category sorting, they found that a child's implicit guesses correlate with optimal Bayesian probability models at an astonishing 0.85 to 0.92. Which is basically near-perfect statistical inference. Exactly. Before you can even tie your shoes, your brain is quietly running a program that mirrors the most sophisticated probabilistic framework ever devised. But I want to immediately contrast that image with another one. So researchers walked into Harvard Medical School and handed fully licensed physicians a straightforward, explicit probability problem. Yes, the classic test. Right, they said, imagine a disease with a 1% prevalence in the population. You have a screening test for it that is 95% sensitive, meaning it catches the disease almost every time, and 95% specific, meaning it rarely gives a false alarm. Sounds pretty reliable, right? You would think. So they asked the doctors, if a patient tests positive, what is the probability they actually have the disease? And this is where the results are just consistently shocking. Yeah, only about 15% to 18% of the physicians got it right. According to the original Cassell's, Schoenberger, and Grabois study, the correct answer is roughly 16%. 16%, wow. Because that 1% prevalence, the base rate, is so incredibly low that the false positives actually outnumber the true positives. Yet the vast majority of these highly trained doctors guessed 95%. They completely missed the reality of the math. And we see this replicated over and over. Eddie's 1982 study, published in JAMA, found only 5% of physicians nailed a similar problem regarding mammography screenings. 5%! I really want this paradox to land for you listening right now. Your brain is simultaneously one of the finest probability engines on the planet, and, well, one of the absolute worst mathematical calculators. It feels like a massive contradiction. I mean, how can we be geniuses in the playroom and then fail basic logic in the clinic? Right, how does that happen? Well, in cognitive science, this is explained by a level of analysis distinction known as Mars hierarchy. OK, break that down for us. So at the computational level, which essentially asks what overarching problem the system is trying to solve in its natural environment, the child is brilliantly Bayesian. Their implicit competence is massive because they are interacting directly with the physical world. They're just doing it naturally. Exactly. But at the algorithmic level, when you sit a human down and force them to explicitly manipulate abstract symbolic probabilities on a piece of paper, the machinery just grinds to a halt. Which gives us the perfect roadmap for today. We are covering three big ideas in this deep dive. First, why your brain is secretly brilliant at probability and why the sheer format of the information changes everything. The format is so crucial. It really is. Second, we're borrowing a concept from machine learning called overfitting, which explains why our instinct to add more complexity to our decisions almost always backfires. Yeah, that's a huge one. And third, we'll explore whether humans can actually learn to reason better under uncertainty and exactly where that self-improvement project hits a wall. We are going to connect developmental psychology to Wall Street trading algorithms and hospital warning systems. It's a wide net, but the underlying mechanism is identical. So shifting from that medical school failure to our first big idea, the format issue. If we possess this incredible implicit probability machinery, why do we fail those textbook medical problems so dramatically? Like, why does the explicit brain just shut down? It comes down to evolutionary psychology and the work of Gerd Gudrunzer on something called base rate neglect. The explicit brain struggles because the format of the information is entirely alien to our cognitive architecture. Alien in what way? Well, conditional probabilities, like saying a test is 95% sensitive, are mathematical extractions. They were invented a few hundred years ago. For the vast majority of human history, information did not arrive as percentages. We weren't hunting and gathering with pie charts. Exactly. It arrived as experienced sequential counts. Think about early humans trying to figure out where to hunt. Nobody is calculating that a certain river crossing has a 32.5% success rate. They're just remembering, like, we hunted at this river crossing 20 times, and we caught something maybe six times. Precisely. We evolved to process what researchers call natural frequencies. Gudrunzer and Hoffridge demonstrated that if you simply take that abstract Harvard medical problem and translate it into natural frequencies, the human brain snaps back to optimality. So you just reword the problem. Basically, yeah. Instead of giving doctors percentages, you say, out of 1,000 people, 10 have this disease. Of those 10, about eight or nine will get a positive test. Now, of the 990 who do not have the disease, about 50 will still get a false positive test. So out of everyone who just tested positive, how many actually have the disease? It just clicks. You can physically visualize the groups of people standing in a room. You see the handful of sick people in one corner and the much larger group of healthy people who got a false alarm in the other. And the performance data reflects that visual clarity. Just changing the wording of the problem raises correct responses from about 6% to 46%. That's a huge jump. And it gets better. When you move into ecologically valid conditions, where you give people visual aids to show those nested groups of people, accuracy climbs to an incredible 76% to 92%. The McDowell and Jacobs meta-analysis looked at 35 different studies on this exact phenomenon and found a massive, robust effect size of D equals 0.93, just from switching the format. But wait, hang on. If I'm looking at a textbook problem that says 1% prevalence and 95% accuracy, I am literally staring at the numbers. Does my brain just completely delete the base rate when looking at percentages? That is the twist. Your brain actually does see it. The cognitive scientist Wim Denaes and his colleague, Glymichick, ran a brilliant study in 2008 to test this exact question. What did they do? They gave people these tricky abstract base rate problems and subtly tracked the response times. And they found that even when people confidently gave the wrong stereotypical answer, which they did nearly 80% of the time, their response time significantly increased on the trials where the statistics clashed with their intuition. Oh, wow. Meaning their brain hesitated when the math didn't match the stereotype. Yes. The brain implicitly detects the conflict. There is measurable conscious processing happening beneath the surface. The information is physically in your neural system. It's just that the abstract format fails to route it to your explicit reasoning. It gets trapped. Let me take the Griffiths and Tenenbaum position here because this is a massive debate in cognitive science. If the brain detects the mathematical conflict and three-year-olds are correlating at near-perfect levels with optimal models, I would argue the brain is actually performing Bayesian reasoning at its core. Right. The Bayesian brain hypothesis. Yeah. I mean, those correlations are simply too high and too consistent across totally different tasks to be a coincidence. Underneath it all, we are fundamentally Bayesian. That is a popular view. But there is a very strong counter-movement to that, often called the myth of the Bayesian brain critique. Okay. What's their argument? Just because a biological system produces an output that looks mathematically optimal doesn't mean it's literally computing those equations in its neural circuits. Think about a baseball outfielder catching a fly ball. Like tracking the trajectory. Exactly. They run in a very specific arc that keeps the optical angle of the ball constant in their vision. The outcome perfectly matches complex differential calculus. But the outfielder isn't doing calculus in their head. The Bayesian label might just be a highly convenient mathematical description for researchers to use rather than a literal neural mechanism. Okay. I can consider. that distinction. Whether our neurons are literally running the Bayes equation or just doing a really convincing impression of it, the practical implication for you, the listener, is identical. You don't need to learn the complex math. You just need to reformat the information. That is the ultimate takeaway for this first section. Whenever you face a probabilistic decision, whether that's evaluating a business risk, deciphering a medical test result, or assessing a new higher translate, the abstract percentages you are given into natural frequencies. So don't ask your team, what is the 20% risk of failure? Right. Ask them, out of 100 projects exactly like this one, how many turn out badly? It forces the problem into the part of your brain equipped to actually handle it. So shifting the format from percentages to real numbers fixes the immediate problem. But it raises a deeper question. Why does our explicit brain panic when we feed it complex abstractions? Like why can't more information just naturally lead to better decisions? It's so counterintuitive, isn't it? It really is. And that actually brings us to a trap that both our brains and Wall Street algorithms fall into, the trap of complexity. To understand why complexity fails us, you have to understand a concept from machine learning called the bias-variance trade-off. Okay, what is that? Well, whenever you build a model to predict the future, whether it is an artificial intelligence algorithm or just a mental model in your own head, you face a fundamental tension. Simple models have high bias. They're blunt instruments, so they might miss some nuanced true patterns. But they have low variance, meaning they are very stable and reliable when you throw new unseen data at them. And complex models are the opposite. Exactly. They have low bias, meaning they capture every tiny little pattern in the data you show them. But they have high variance. They are incredibly unstable when the real world throws them a curveball. I like to think of it like buying a custom tailored suit. Imagine you go to a tailor on a day when you are incredibly bloated from eating a huge salty meal. That's a great analogy. If the tailor builds a highly complex, meticulously fitted suit based on exactly how your body looks that specific afternoon, the suit will fit you perfectly in that one specific moment in time. But the very next day, when your body returns to normal, the suit looks ridiculous. It was optimized for a temporary noisy state instead of your true underlying shape. That is a perfect way to explain overfitting. Overfitting is what happens when a complex model learns the random noise in its training data instead of the true underlying signal. It memorizes the bloat. It looks like an absolute genius on past data, but it completely falls apart tomorrow. And there is no better or more expensive example of this than the quantitative finance world. Take the Quantopian 888 strategy study. Oh, this is a fascinating one. Yeah, Quantopian was this crowd-sourced platform where amateur and professional quants could build and test stock trading algorithms. Researchers looked at 888 of these crowd-developed algorithms. They looked at the back tests, which is how well the algorithm would have performed on historical market data, and then they compared it to how the algorithm actually performed in live out of sample markets with real money. The punchline of that study is a stark warning. The more someone tuned, tweaked, and optimized their strategy on historical data, the larger the gap between their simulated success and their real-world failure. So you're just over-optimized. Completely. The strategies that look the most brilliant in hindsight were the biggest disappointments going forward. Think about the last time you over-researched a major purchase, like a new car or a television. You read hundreds of reviews, you compared spreadsheets of minor technical specs, and you paralyzed yourself with variables that ultimately didn't matter to your actual enjoyment of the product. You were over-fitting your decision. You memorized the noise of the marketing material instead of focusing on the one or two signals that actually mattered. Quantitative finance is hyper-aware of this human tendency now. They try to fight it using strict statistical adjustments. Like what? One is called the deflated sharp ratio, which statistically penalizes a trading strategy's apparent performance based on how many different variants the creator tested. If you test 500 complex strategies and pick the one that worked best, it is an illusion of skill. You just found the suit that perfectly matched the bloat. And we see this exact same less-is-more phenomenon in human psychology. In a series of famous studies published in Psychological Review, researchers Strelinsky, Gigerenzer, and Goldstein tested a radically simple decision rule called the take-the-best heuristic. It's almost shockingly simple. The rule is literally, when you are comparing two options, find the single most discriminating cue, use that to make your choice, and ignore literally everything else. They tested this simple rule against complex multiple regression models across 20 completely different real-world data sets, predicting school dropout rates, estimating city populations, you name it. And the results. The take-the-best heuristic matched or beat the complex math in 12 out of the 20 data sets when predicting new unseen data. But we have to push back on this slightly. I mean, is simplicity truly always better, or is that its own oversimplification? Doesn't the less-is-more argument cherry-pick its conditions a bit? Well, if you look at data-rich environments with perfectly stable rules, think of a chessboard. Or modern facial recognition software complexity absolutely wins. Deep learning crushes simple human heuristics, where you have millions of perfect training examples to feed the algorithm. The less-is-more effect is very real, but it has strict boundaries. That is completely fair, and it highlights the real takeaway here. The debate between simplicity and complexity isn't an abstract philosophical argument. It entirely depends on whether your environment provides enough reliable data to pay for the complexity you want to use. Exactly. In a small, noisy, unpredictable environment, complexity is just a much more expensive way to be wrong. Overfitting is the default outcome when you optimize a decision without sufficient data. So before you add complexity to a decision in your own life, like creating a 20-point rubric for hiring a new employee or building a multivariable spreadsheet for an investment, ask yourself how much reliable data you actually possess. Prefer simplicity unless your data strongly and undeniably justifies the complexity. That's a great rule of thumb. So we've established that simple models beat complex ones in noisy environments. But what happens when the environment itself fundamentally changes underneath your feet? This brings us to the concept of regime change, and the danger of fitting your decisions perfectly to history. This is where things get really dangerous. Look at the recent history of artificial intelligence in hospitals. Epic, the massive medical records company, built a highly complex clinical algorithm designed to flag patients at risk for sepsis. It looked fantastic on paper. But when an independent evaluation published in PubMed looked at its performance across nine different hospitals in the Michigan medicine system, the performance varied wildly. It worked in some hospitals and failed in others. This is a classic case of clinical overfitting. The model was trained in one specific context on a specific patient population with specific nursing workflows. When you deploy it in a new context, the underlying data distribution has shifted. The regime has changed, and the model degrades. We saw the exact same story with IBM Watson. They partnered with MD Anderson, literally one of the best cancer centers in the world, to build an oncology AI. And the partnership was eventually canceled. Because the system was so tightly overfit to the specific training data of MD Anderson's top experts that it could not handle the diverse, messy reality of live clinical workflows elsewhere. And the fallout of this type of overfitting is alarm fatigue. Systematic reviews show that doctors and nurses override clinical decision support alerts between 49 and 96 percent of the time. When you optimize a complex system for sensitivity, meaning you tune it to catch every single possible problem, you destroy its specificity. It cries wolf constantly, and humans simply tune it out. It is not just medicine either. Look at the 2023 Major League Baseball shift ban. This was a massive natural experiment in regime change. For years, analytically sophisticated baseball teams positioned their fielders based on incredibly complex models of where batters historically hit the ball. They overfit their entire defensive strategy to a very specific historical regime. Exactly. Then, the league changed the rules mid-game and banned extreme defensive shifts. Suddenly, the teams that had built an entire roster around those historic Historical models exposed massive vulnerabilities. Some hitters benefited enormously, others barely noticed. The regime changed, and the overfit models broke. This tension between human judgment and algorithmic models forces a really important debate. When the environment is unstable, who do we trust? Honestly, I'm going to take the hardline pro-algorithm position here. Paul Meehl's legendary research, which was updated by the Grove and Meehl meta-analysis covering roughly 70 years of clinical data, shows that simple mechanical prediction rules beat clinical expert judgment in about 60% of direct comparisons. 60% is a solid majority. Right. When doctors are overriding hospital algorithms 96% of the time, maybe they were just being arrogant. Experience breeds confidence, but the data shows it doesn't always breed accuracy. I say we should default to the simple algorithm, even with its flaws. I see it quite differently, actually. Those clinical override rates of 49 to 96% might reflect incredibly good human judgment. If an algorithm has a massive false alarm rate, ignoring it is not human bias, it is the perfectly rational response to a broken tool. Human expertise is what catches the subtle shifting context that the algorithm is entirely blind to. Okay, that's a fair counterpoint. The resolution to this debate comes from Robin Hogarth's framework of kind versus wicked learning environments. That framework explains so much of human frustration. It really does. In kind environments where feedback is clear, immediate, and the rules of the game do not change, like playing chess or forecasting, the weather human experience translates beautifully into true, reliable expertise. Trust the human expert. But what about the wicked environments? In wicked environments, where feedback is noisy, delayed, or the rules shift mid-game, like the stock market, or predicting long-term medical outcomes experience just makes humans dangerously overconfident. In wicked environments, you should lean on simple algorithms, but you absolutely must keep the human in the loop to detect when the regime itself has changed. So what does this all mean for you listening right now? If most of the important decisions we make happen in wicked environments and our brains naturally overfit to noise, can we actually learn to predict the future better? Like, is decision-making under uncertainty a learnable skill? The short answer is yes, but it requires a very specific, deliberate kind of practice called calibration. Philip Tetlock's Good Judgment Project proved this at scale. They ran a massive forecasting tournament with over 2,000 people over four years. And they found these superforecasters, right? Yes. The top 2% of participants, who he called superforecasters, consistently beat professional, highly classified intelligence analysts at predicting global events. And what made them superforecasters wasn't that they had doctorates in geopolitics or secret insider knowledge. It was their metacognition. They updated their beliefs frequently, they approached new data with intellectual humility, and they were very comfortable using precise probabilistic language instead of vague terms like probably or maybe. They were exquisitely well calibrated. Calibration is just the alignment between your stated confidence and your actual accuracy. If you say you are 70% confident that an event will happen, you should be right exactly 70% of the time over the long run. And you can train this. A landmark study by ARCS in 1986 showed that you can actively train this skill. They gave people calibration training with immediate feedback on their predictions. The participants' error rates dropped by nearly half, from a mean squared error of about 0.27 down to 0.14. And impressively, that improvement held steady when they tested them again six months later. But there is a catch regarding how often you should update your beliefs. Murphy and Winkler studied weather forecasters who are famously well calibrated. But they found that meteorologists who updated their predictions on a weekly basis, using aggregated data, actually outperformed those who reacted daily to every single new observation. Reacting to every single minor data point is just another form of overfitting. Exactly. You don't want to chase the noise. But this raises a huge question. If calibration training works, do we just mandate that everyone learns Bayes' theorem, give them immediate feedback on every decision they make, and permanently solve human error? That leads to the ultimate twist of this deep dive. The same researcher, ARCS, ran another study in 1988 known as the Radiologist Feedback Paradox. The paradox. What happened? They gave both highly experienced radiologists and complete medical novices feedback on how well they were interpreting x-rays. The novices improved, exactly as you would expect. But the experienced radiologists actually got significantly worse. Wait, really? How does giving an expert accurate feedback make them worse at their own job? Because the feedback activated their overfit mental models. It caused the experts to second-guess themselves and double down on highly complex, nuanced cues that were diagnostic in their highly specific past experiences, but were actually misleading in the current, slightly different task. Their performance declined with an effect size of 0.61. Expertise in a wicked environment can become a cognitive trap. And it might even be wired deeply into our neurobiology. Rolwage and colleagues published an fMRI study looking at people's brains during a belief updating task. They found that highly overconfident people showed significantly reduced activity in the anterior insula and the prefrontal cortex when they were presented with new, conflicting evidence. That is fascinating. Overconfidence isn't just an ego trip or a personality flaw. It is a literal computational failure to process uncertainty. Your brain builds a model that is too narrow, and it physically stops computing the variance in the data. The implication for the listener is clear. Calibration is highly learnable, but it is deeply domain-specific. You should invest in practice, track your predictions, and score your accuracy. But hold the skill lightly. Being exceptionally well calibrated in one area of your life does not mean your intuition magically works in another. Let's bring this all the way back to the opening scene. Remember that three-year-old playing with the light-up toy, doing near-perfect statistical math in their head? And remember that Harvard physician completely failing a basic textbook probability problem? Same human brain. Same fundamental probability machinery. The entire difference was the format. The child was operating in a rich physical environment with natural frequencies. The physician was stranded in a clinical word problem full of abstract percentages. The lesson of this entire deep dive is that you do not need to become a better calculator. You need to become a better formatter of the information you consume, the decisions you structure, and the environments you choose to reason in. And that crystallizes into our three core takeaways. First, your brain is an exquisite implicit probability engine. First, route your decisions through formats it can actually handle, especially natural frequencies. Second, overfitting is the default outcome of optimization when you do not have enough data. Always prefer simple rules unless the data strongly and undeniably justifies complexity. And third, calibration is learnable but domain-specific. Invest in practice, but recognize the sharp limits of your own expertise when operating in wicked environments. I want to leave the listener with one final thought to mull over. If human intuition is basically just an overfit model built on our past experiences, and we increasingly hand all our complex, wicked decisions over to artificial intelligence, what happens to us? Are we destined to just become regime change detectors for machines? Do we eventually lose our own ability to reason entirely, relying on the algorithm until the environment shifts? It is a profound question. We have to maintain the friction of making our own decisions to stay calibrated. Ultimately, the most Bayesian thing you can do is hold your Bayesian tools with appropriate uncertainty, which might be the most important sentence in this whole deep dive. This has been a UDOM research-pronounced Euro-Odomay research deep dive. Your call to action today is simple. Pick one of the protocols we covered and implement it this week. If you aren't sure where to start, go with protocol one, reformat before you reason. The next time you face a probability, a medical test result, a business risk, a hiring decision, translate it into natural frequencies before you decide. Out of every 100 cases like this, how many turn out this way? That single reframe is the highest leverage change you can make. Or if you want to go deeper, start protocol five. Track your predictions with explicit probabilities and grade yourself. For the full briefing, all the research citations, and the seven protocols written out, visit udom.ani. And if you know someone who makes decisions under uncertainty, which is everyone, share this deep dive with them. Until next time.

34 sources · 32 min read

Section 01

The Paradox Inside Your Head

Here is a fact that should stop you cold: a three-year-old child, shown a novel toy that lights up when certain blocks are placed on it, will infer the toy's causal structure in a way that almost perfectly matches the predictions of a Bayesian probabilistic model (Griffiths, Kemp & Tenenbaum — Hierarchical…). The correlation between the child's guesses and the optimal Bayesian solution is astonishingly high — between r = 0.85 and r = 0.92 across a range of tasks including category learning, causal induction, and word learning (Griffiths, Kemp & Tenenbaum — Hierarchical…). Your brain, before you could tie your shoes, was already running something that looks very much like the most sophisticated statistical inference framework ever devised.

Now here is a second fact: when physicians at Harvard Medical School were given a straightforward Bayesian reasoning problem — a disease with a 1% prevalence and a test that is 95% sensitive and 95% specific; what is the probability a positive result is a true positive? — only about 15% got the right answer (Casscells, Schoenberger & Graboys 1978 — P…). The correct answer is roughly 16%. The modal physician's answer was 95%. In a separate study, David Eddy found that only 5% of physicians arrived at the Bayesian solution to a similar mammography screening problem (Eddy 1982 — Physician Bayesian reasoning i…). And Hammerton, testing the same class of problem in 1973, found only 10% correct (Hammerton 1973 — Physician Bayesian reason…).

These two findings cannot both be true without qualification — and yet they are. Your brain is simultaneously one of the finest Bayesian inference engines on the planet and one of the worst Bayesian calculators. This is not a contradiction. It is a level-of-analysis distinction, and understanding it may be the single most important insight in modern cognitive science.

The resolution lies in what cognitive scientists call Marr's hierarchy (Griffiths, Kemp & Tenenbaum — Hierarchical…). The Bayesian cognition program — led by researchers like Tom Griffiths, Charles Kemp, and Josh Tenenbaum — operates at the computational level: it asks what problem the mind is solving and whether the outputs match what a Bayesian agent would produce. The base-rate neglect literature, by contrast, tests an explicit, decontextualized, single-trial problem requiring symbolic manipulation of conditional probabilities. These are not the same task. The child inferring causal structure is operating in a rich, ecologically valid environment full of repeated encounters and structured feedback. The physician staring at a word problem is stranded in precisely the environment where the implicit Bayesian machinery breaks down.

This distinction — implicit competence versus explicit failure — is the structural spine of everything that follows. Every domain we'll examine today, from hedge funds to hospitals to prediction markets, will force us back to the same question: are you routing your decision through a format your cognitive machinery can actually handle?

Your brain is simultaneously one of the finest Bayesian inference engines on the planet and one of the worst Bayesian calculators — only 5% of physicians solved a standard screening problem correctly.

The Bayesian Paradox: Implicit vs. Explicit Performance

Implicit Bayesian tasks Griffiths, Kemp & Tenenbaum program

85–92%

Frequency-format Bayesian problems Gigerenzer most ecological condition

92%

Standard frequency-format problems Gigerenzer & Hoffrage 1995

46%

Casscells et al. 1978 Physicians, probability format

18%

Hammerton 1973 Physicians, probability format

10%

Eddy 1982 Physicians, probability format

0 100%

Implicit Bayesian tasks show near-optimal performance (r = 0.85–0.92 with Bayesian models), while explicit probability problems reveal catastrophic failure rates among trained professionals.

What this means for listeners: The goal is not to become a human calculator running Bayes' theorem in your head. It's to restructure how you encounter information — using frequency formats, visual displays, and repeated-encounter frames — so your powerful implicit Bayesian machinery can do the work it was built for.

Section 02

The Frequency Fix: Why Format Changes Everything

If the implicit Bayesian machinery is so good, why does it fail so spectacularly on textbook problems? The answer, championed most forcefully by Gerd Gigerenzer and colleagues, is deceptively simple: the format is wrong.

Consider the classic mammography screening problem. Presented in conditional probability format — "the sensitivity is 80%, the false positive rate is 9.6%, and the prevalence is 1%" — about 6% of participants arrive at the correct Bayesian posterior (Gigerenzer & Hoffrage 1995 — 'How to Impro…). Now present the same problem as natural frequencies: "Out of every 1,000 women, 10 have breast cancer. Of those 10, 8 will get a positive mammogram. Of the 990 who don't have cancer, about 95 will still get a positive mammogram. Of all the women who test positive, how many actually have cancer?" Suddenly, 46% of participants get it right (Gigerenzer & Hoffrage 1995 — 'How to Impro…). And in the most ecologically valid experimental condition — where the nested-set structure of the frequencies is made maximally transparent — the correct response rate climbs to 76%, and in some versions, 92% (Gigerenzer and colleagues — Systematic exp…).

This is not a minor pedagogical tweak. A 2017 meta-analysis by McDowell and Jacobs, spanning 35 studies, found a robust weighted effect size of d = 0.93 for the frequency format advantage (McDowell & Jacobs 2017 — Meta-analysis of…). That is a large effect by any standard in psychology. The frequency format does not teach people Bayes' theorem — it plugs their existing implicit Bayesian competence into a representational format that the evolved cognitive architecture can actually process.

Why frequencies? The argument from ecological rationality is compelling (Gigerenzer & Hoffrage 1995 — 'How to Impro…). For most of human evolutionary history, probabilistic information arrived as experienced frequencies — how many times the berries in that patch made you sick, how many hunts at the river crossing succeeded — not as abstract percentages. The brain evolved frequency counters, not probability calculators. When you present information in the format the machinery was designed for, performance snaps back toward optimality.

But there is an important caveat. Sloman and colleagues have argued that the advantage is not about frequencies per se — it is about transparent nested-set structure (Sloman et al. 1999 — Critique of frequency…). When other formats make the set relationships equally clear (such as Euler diagrams or icon arrays), performance also improves. The frequency format works not because frequencies are magic, but because they naturally expose the part-whole relationships that Bayesian reasoning requires.

This debate matters practically because it tells us what to optimize: not the word "frequency" but the transparency of the nested structure. Whether you are a physician interpreting a screening test, a manager evaluating a hiring filter, or a founder assessing the base rate of startup failure in your category, the intervention is the same — reformat the information so the set relationships are visible.

The dual-process account adds a fascinating layer. De Neys and Glumicic found that even when participants gave the "biased" answer on base-rate neglect tasks, their response times increased on incongruent trials — trials where the stereotype conflicted with the base rate (De Neys & Glumicic 2008 — Implicit conflic…). The brain was registering the conflict. The base-rate information was available to the system; it just was not being leveraged by the explicit reasoning process. Participants chose the stereotypical response 78% of the time on incongruent trials, despite measurable increases in System 2 processing (De Neys & Glumicic 2008 — Implicit conflic…). The information is there. The format just fails to surface it.

A meta-analysis of 35 studies found a weighted effect size of d = 0.93 for the frequency format advantage — one of the largest and most reliable effects in the psychology of reasoning.

What this means for listeners: Whenever you face a probabilistic decision — a medical test result, a business risk assessment, a hiring decision — translate the numbers into natural frequencies before reasoning about them. 'Out of every 100 startups like mine, how many succeed?' is a better question than 'What is the probability of success?' The format change alone can increase your accuracy by a factor of eight.

Section 03

Overfitting: The Universal Failure Mode

Now we need to introduce the episode's second big idea, and it starts with a question that seems to belong in a machine learning textbook: what is overfitting?

Imagine you are trying to predict tomorrow's weather. You have a year of historical data — temperature, humidity, wind speed, barometric pressure. A simple model might use two variables and a linear equation. It will miss some days. A complex model might use twenty variables, polynomial interactions, and learned coefficients for every day of the week. On your historical data, the complex model will look brilliant — it fits every twist and turn of the past. But point it at tomorrow, a day it has never seen, and it crumbles. It has learned the noise of the past, not the signal of the future.

The technical term for this is the bias-variance tradeoff (Gigerenzer & Todd 1999 — Simple Heuristics…). Simple models have high bias (they miss true patterns) but low variance (they are stable across new data). Complex models have low bias (they capture true patterns and false ones alike) but high variance (they fluctuate wildly on new data). The sweet spot — the model that actually predicts well — lives in between.

Here is the insight that makes this episode click: overfitting is not a pathology unique to algorithms. It is the default outcome of optimization under insufficient data, and it afflicts every prediction system — including the one between your ears.

Gigerenzer's "less-is-more" research program provides the most striking demonstration (Czerlinski, Gigerenzer & Goldstein 1999 —…). In a series of studies, he and his collaborators tested a radically simple decision rule called "take-the-best": when comparing two options, use only the single most discriminating cue, and ignore everything else. Tested across more than 20 real-world datasets — predicting city populations, school dropout rates, fish fertility, even homelessness — this absurdly simple heuristic matched or beat multiple logistic regression in 12 out of 20 datasets on out-of-sample prediction (Czerlinski, Gigerenzer & Goldstein 1999 —…). On the training data, regression always won. On new data, simplicity often prevailed.

The mechanism was exactly overfitting. Regression, given many predictors and limited training data, fit the noise. The one-cue heuristic, by ignoring almost everything, could not overfit even if it tried. As Gigerenzer's team showed, the less-is-more effect was strongest when training sets were small (under 50 cases) and when features were correlated — precisely the conditions where overfitting risk is highest (Czerlinski, Gigerenzer & Goldstein 1999 —…).

Paul Meehl saw this decades earlier. His foundational 1954 work on clinical versus statistical prediction, updated through Grove and Meehl's 2000 meta-analysis covering roughly 70 years of replication, showed that simple mechanical prediction rules beat clinical judgment in approximately 60% of direct comparisons (Grove & Meehl 2000 — Meta-analysis of clin…). The mechanism was not that clinicians were stupid — it was that they were complex. Each clinician weighted cues differently on different days, attended to memorable recent cases, and built elaborate internal models that fit the noise of their personal clinical experience. The simple formula, too dumb to overfit, generalized better.

This is the deep connection between the two halves of our episode. Bayesian reasoning is about updating beliefs with evidence. Overfitting is about what happens when you update too eagerly, with too much complexity, on too little data. The Bayesian paradox and the overfitting paradox are not separate phenomena — they are two views of the same underlying tension between learning signal and learning noise.

Across 70 years of replication, simple mechanical prediction rules beat clinical judgment in approximately 60% of direct comparisons — not because clinicians are stupid, but because they are complex.

When Simplicity Wins: The Data × Noise Decision

Low noise (kind environment)

High noise (wicked environment)

Abundant data

Complex models shine

Use full regression / ML

Enough data to estimate parameters reliably; signal is clear

Regularize heavily

Shrink, ensemble, validate OOS

Data exists but noise demands complexity penalties

Scarce data

Simple rules compete

Consider heuristics alongside models

Small samples but clean signal; test both approaches

Simple rules dominate

Use fast-and-frugal heuristics

Overfitting is near-certain with complex models; less is more

Simple models outperform complex ones when data is scarce and noise is high — exactly the conditions most real-world decisions face. Complex models earn their keep only with abundant, clean data.

What this means for listeners: Before you add complexity to any decision model — a hiring rubric, an investment thesis, a product strategy — ask yourself: how much data am I actually working with, and how noisy is it? If the answer is 'not much' and 'very,' you are almost certainly better off with fewer variables and simpler rules. Complexity is not sophistication; sometimes it is just a more expensive way to be wrong.

Section 04

Wall Street's Expensive Lesson: 888 Strategies and the Graveyard of Backtests

If you want to see overfitting at industrial scale, with real money on the line, look at quantitative finance.

Quantopian, the crowdsourced quant platform, produced one of the most valuable datasets in the overfitting literature almost by accident. Researchers examined 888 crowd-developed trading algorithms that had both backtests and at least six months of genuine out-of-sample performance — live or paper trading with real market data the algorithm had never seen (Quantopian 888-Strategy Out-of-Sample Data…). The finding was exactly what overfitting theory predicts: the more backtesting and parameter tuning a quant had done, the larger the gap between backtest performance and out-of-sample performance (Quantopian 888-Strategy Out-of-Sample Data…). The strategies that looked most brilliant in hindsight were the ones most likely to disappoint in the future.

This is finance's version of the physician who fits an elaborate mental model to their clinical experience. The quant fits an elaborate algorithm to historical price data. Both feel like they are learning. Both are, to a significant degree, memorizing noise.

The finance industry knows this. AQR, one of the world's largest quant firms, has published practitioner-facing essays that say the quiet part out loud: many "great backtests" are not believed internally (AQR — 'Lies, Damned Lies, and Data Mining'…). Robustness is judged not by in-sample fit but by breadth of evidence — does the strategy work across time periods, asset classes, and geographies? A paper in Significance, the journal of the Royal Statistical Society, explicitly framed backtest overfitting as finance's analogue of p-hacking: the systematic under-reporting of the number of strategy variants tested, which inflates apparent performance (Royal Statistical Society, Significance Vo…). CFM, the French quant house Capital Fund Management, published a technical note documenting the systematic performance gap between in-sample and out-of-sample results when models are overfit (CFM (Capital Fund Management) Technical No…).

The industry's anti-overfitting toolkit reads like a Bayesian's wish list: out-of-time splits with purge gaps to prevent data leakage; walk-forward analysis with rolling windows and parameter lock after selection; deflated Sharpe ratios that haircut performance for multiple testing; and parameter stability analysis that rejects strategies whose results depend on knife-edge parameter choices (David H. Bailey — Deflated Sharpe Ratio an…) (AQR — 'Lies, Damned Lies, and Data Mining'…). Bayesian ideas show up explicitly as shrinkage (partial pooling toward a prior), regime-uncertainty modeling, and sequential updating of signal estimates (AQR — 'Lies, Damned Lies, and Data Mining'…). Firms just rarely call it "Bayesian" in public because the specifics are core intellectual property.

But here is the honest caveat: when you ask for documented cases where overfitting caused model failures at named hedge funds — Renaissance Technologies, Two Sigma, AQR, Bridgewater — those firms rarely publish "we overfit and blew up" narratives (AQR — 'Lies, Damned Lies, and Data Mining'…). What is well documented is the systematic pattern: backtest-to-live decay is the norm, not the exception; regime change is the practical failure mode that no amount of in-sample optimization can address; and the firms that survive longest are the ones most paranoid about their own backtests.

The parallel to human cognition is direct. Robin Hogarth's research on "kind" versus "wicked" learning environments provides the bridge (Hogarth and colleagues — Kind vs. wicked l…). In a kind environment — clear feedback, stable rules, many repetitions — both humans and algorithms learn well. Chess is kind. Weather forecasting is kind. In a wicked environment — delayed or misleading feedback, shifting rules, small samples — experience breeds confidence without breeding accuracy. Hogarth and colleagues found that in wicked learning environments, people often gain confidence without gaining skill (Hogarth and colleagues — Kind vs. wicked l…). Financial markets are the canonical wicked environment: the rules change, feedback is noisy and delayed, and the sample sizes that matter (true regime shifts) are tiny. The quant who backtests 500 strategies and picks the best one is doing exactly what the overconfident clinician does — overfitting to a biased sample of experience.

The more backtesting a quant had done, the larger the gap between backtest performance and real-world results — the strategies that looked most brilliant in hindsight were the most likely to disappoint.

What this means for listeners: If you are evaluating any model — a financial strategy, a business plan, a hiring rubric — ask how many alternatives were tested before this one was selected. The more variants tried, the less you should trust the winner's apparent performance. Demand out-of-sample evidence, and be deeply suspicious of any result that only works on the data it was built on.

Section 05

When the Game Changes: Hospitals, Baseball, and the Limits of Fitting History

The overfitting lens sharpens dramatically when we watch what happens after the rules change. Two domains — clinical medicine and professional baseball — provide near-perfect natural experiments.

Start with the Epic Sepsis Model, one of the most widely deployed clinical prediction algorithms in American hospitals. Epic, the dominant electronic health records company, built a proprietary model to flag patients at risk of sepsis. On paper, it looked useful. But when researchers at Michigan Medicine conducted an independent evaluation across nine hospitals between January 2020 and June 2022, they found something troubling: the model's performance varied significantly by hospital factors (Michigan Medicine / PubMed 34152373 — Epic…). A model trained in one context did not travel for free to another context. This is the clinical version of the Quantopian finding — a model overfit to its training distribution degrades when the distribution shifts.

The IBM Watson for Oncology story is even more instructive. MD Anderson Cancer Center, one of the world's premier cancer hospitals, partnered with IBM to deploy Watson as a clinical decision support tool. The partnership was canceled. As Scientific American reported in a detailed investigation, the failure was not merely an AUC issue — it was a failure of procurement, workflow integration, incentive alignment, and overpromising (Scientific American — IBM Watson for Oncol…). Watson had been trained on a limited set of cases and recommendation protocols; when deployed in the messy reality of diverse patient populations and clinical workflows, it produced recommendations that clinicians found unreliable or irrelevant.

And here is where the numbers get genuinely alarming. Systematic reviews of clinical decision support alert systems consistently find that clinicians override these alerts between 49% and 96% of the time (PMC4052586 — Systematic review of CDS aler…) (BMJ Digital Health e000083 — CDS alert ove…) (PMC6855857 — Inpatient e-prescribing obser…). That range is confirmed across multiple reviews and is cited by the Agency for Healthcare Research and Quality (AHRQ Clinical Decision Support Resource Pa…). Nearly half to nearly all alerts are ignored.

But the interpretation is more nuanced than "doctors don't listen." High override rates can reflect low precision (too many false alarms), poor workflow timing, liability-driven over-alerting, or justified clinical judgment that the alert is irrelevant to this specific patient (PMC4052586 — Systematic review of CDS aler…). The system has overfit to population-level patterns that do not apply to the individual case in front of the clinician. Alarm fatigue — the clinical term for what happens when you cry wolf 96% of the time — is itself an overfitting failure: the alert system has optimized for sensitivity at the expense of specificity, and the humans in the loop have rationally learned to ignore it.

Now cross to baseball. When Major League Baseball banned extreme defensive shifts in 2023, it created a clean natural experiment in regime change (The Analyst 2023 — MLB shift ban effects o…). For years, analytically sophisticated teams had optimized defensive positioning based on historical batted-ball distributions — placing fielders where hitters had hit the ball in the past. This was a form of fitting a model to training data. Then the league changed the rules, altering the data-generating process itself. Causal-inference research using synthetic control methods found that the effects were heterogeneous: some hitters benefited enormously from the shift ban; others barely noticed (arXiv 2411.15075 — Causal-inference analys…). Teams that had built their roster construction and defensive strategy around shift-heavy run prevention were the most exposed — they had overfit their organizational strategy to a regime that no longer existed (The Analyst 2023 — MLB shift ban effects o…).

The lesson across both domains is the same one Hogarth identified in his kind-versus-wicked framework (Hogarth and colleagues — Kind vs. wicked l…): the quality of your model depends entirely on whether the environment that generated your training data still applies. An algorithm trained on one hospital's sepsis patterns may fail at another hospital. A defensive strategy optimized for one set of rules may collapse when the rules change. A physician's mental model, calibrated to decades of experience in one clinical setting, may misfire in a new one. Overfitting is not a failure of intelligence — it is a failure of environmental fit.

Clinicians override clinical decision support alerts between 49% and 96% of the time — nearly half to nearly all alerts are simply ignored.

What this means for listeners: Ask yourself: has the game changed? Whether you are relying on a business model built for a pre-AI world, clinical intuitions trained before a new treatment protocol, or investment strategies backtested on a bygone interest-rate regime, the most dangerous assumption is that the future will look like the training data.

Section 06

The Calibration Gymnasium: Superforecasters, Prediction Markets, and Learning to Be Right

So overfitting is everywhere — in our heads, in our algorithms, in our organizations. Is there any hope? Can humans actually learn to reason better under uncertainty?

The most encouraging answer comes from Philip Tetlock's Good Judgment Project, the largest forecasting tournament ever conducted. Over four years, more than 2,000 participants made probabilistic predictions about geopolitical events for the U.S. Intelligence Community (Tetlock & Gardner — Good Judgment Project;…). The top 2% — the superforecasters — achieved calibration scores that consistently beat not just average participants but also professional intelligence analysts with access to classified information.

What set superforecasters apart was not domain expertise. It was a cluster of cognitive habits: comfort with probabilistic language, frequent updating of beliefs as evidence arrived, intellectual humility, and active information-seeking (Tetlock & Gardner — Good Judgment Project;…). A Good Judgment white paper analyzing data from the Good Judgment Open platform found that forecasters who updated their predictions more frequently achieved better Brier scores (Good Judgment White Paper 2022 — 'Forecast…). More thinking, more revision, more willingness to change your mind — these correlated with better accuracy.

But here is a crucial nuance: is this causal, or selection? It is entirely plausible that better forecasters update more because they are better, not that updating makes them better (Good Judgment White Paper 2022 — 'Forecast…). The correlation is real; the causal direction is not settled.

Calibration training — the practice of giving people feedback on the alignment between their confidence levels and their actual accuracy — has a strong evidence base. Arkes, Dawes, and Christensen trained participants using immediate feedback on their confidence-accuracy alignment and found that calibration error dropped significantly, from a mean squared error of roughly 0.27 to 0.14, and the improvement held at a six-month follow-up (Arkes, Dawes & Christensen 1986 — Calibrat…). Weather forecasters, who receive daily feedback on the accuracy of their probability estimates, are among the best-calibrated professional groups, with calibration scores around 0.85 compared to 0.50 for untrained forecasters (Murphy & Winkler 1984 — Meteorologist upda…). Murphy and Winkler's research also revealed an important detail about update frequency: meteorologists who updated beliefs at roughly weekly intervals with multiple data points showed better calibration than those who reacted to each new observation (Murphy & Winkler 1984 — Meteorologist upda…). Too-frequent updating is itself a form of overfitting — fitting your beliefs to the noise of individual data points rather than the signal of aggregated trends.

Prediction markets have become an extraordinary real-world laboratory for studying calibration at scale. Metaculus publishes calibration analyses using logistic recalibration and Brier-score evaluation (Metaculus Calibration Notebook — Logistic/…), and their data suggest that community forecast accuracy improves with more forecasters, with diminishing returns (Metaculus Forecaster Count Analysis — 'Mor…). Manifold Markets publishes a public calibration dashboard showing Brier-style summary metrics (Manifold Markets Official Calibration Dash…) — but their community has also surfaced a subtle problem: you can look well-calibrated without being informative if you "hug the base rate" and only trade on easy, late-resolving markets (Manifold Markets Official Calibration Dash…). Calibration without sharpness is the forecasting equivalent of a student who only answers questions they already know — technically accurate, but not useful.

The largest-scale evidence comes from Polymarket. A recent large-sample analysis of 28,407 markets resolving between January 2024 and May 2026 reported strong calibration, with bucketed resolution rates closely tracking market prices (Polysyncer Blog Analysis — 28,407 Polymark…). A large academic-style SSRN study using hundreds of millions of trades analyzed accuracy, skill, and bias in Polymarket's data (SSRN 5910522 — Large academic-style Polyma…). And cross-platform calibration research comparing Kalshi and Polymarket trade data found that calibration is not a single global property — it is multidimensional and domain-structured, varying by topic area, market liquidity, and market design (arXiv 2602.19520 — Cross-platform calibrat…).

This last finding is critical for our episode's thesis. Calibration is real, learnable, and measurable — but it is not a universal upgrade. It is domain-contingent. The superforecaster who is brilliantly calibrated on geopolitical questions may be no better than chance on questions about technology adoption or pandemic progression. The prediction market that is well-calibrated on U.S. election outcomes may be poorly calibrated on cryptocurrency regulation.

Calibration training reduced error from 0.27 to 0.14 mean squared error, and the improvement held at a six-month follow-up — but gains often don't transfer to new domains.

Evidence Strength: Can Humans Learn Better Calibration?

Meta-analytic Tier 1

Grove & Meehl (2000): Simple rules beat clinical judgment in ~60% of comparisons across 70 years of studies. McDowell & Jacobs (2017): Frequency format advantage d = 0.93 across 35 studies.

95% weight

Large field studies Tier 2

Tetlock's Good Judgment Project (2,000+ participants, 4 years): Top 2% superforecasters consistently outperform. Arkes et al. (1986): Calibration training holds at 6 months (n = 216).

80% weight

Practitioner / platform data Tier 3

Metaculus and Manifold calibration dashboards show measurable accuracy gains from aggregation. Polymarket 28,407-market analysis shows strong overall calibration. GJO white paper links update frequency to accuracy.

55% weight

Preliminary / single-source Tier 4

Overcorrection after Bayesian training (single arXiv preprint, no named authors). Cross-domain transfer failures (limited replication). LLM-assisted forecasting gains of 23–43% (single preliminary finding).

25% weight

The evidence that calibration is trainable is strong, but evidence for cross-domain transfer and long-term maintenance is weaker. The overcorrection risk after Bayesian training is a preliminary finding based on limited evidence.

What this means for listeners: Invest in calibration practice — track your predictions, assign probabilities, and compare them against outcomes. But hold the skill lightly. Calibration does not transfer automatically across domains, and the best forecasters are the ones who know the boundaries of their own competence.

Section 07

The Overcorrection Trap and the Expertise Paradox

Here is where the story takes an unexpected turn. You might think, after everything we have discussed, that the prescription is simple: teach people Bayes' theorem, give them calibration training, and watch them improve. The research says: not so fast.

An emerging body of evidence points to an underappreciated failure mode: overcorrection. Early research indicates that individuals who have been educated on base rate neglect sometimes swing too far in the opposite direction, overly relying on base rates in scenarios where individuating information is actually more diagnostic (Unnamed arXiv preprint on overcorrection a…). This is the Bayesian training equivalent of a dieter who, having learned that overeating is bad, begins to starve. The cure introduces a new pathology.

Discussions in the rationalist and effective altruism forecasting communities have surfaced concrete examples of what they call "reference class forecasting gone wrong" — situations where disciplined application of base rates to genuinely novel situations produced worse predictions than case-specific reasoning would have (Unnamed arXiv preprint on overcorrection a…). A professional forecaster described a case where over-reliance on historical base rates, driven by recent Bayesian training, led to a significant forecasting error on a question where the relevant causal structure had changed (Unnamed arXiv preprint on overcorrection a…). The base rate was accurate for the old regime. It was misleading for the new one.

This connects directly to the expertise paradox that sits at the heart of the Meehl tradition. Experience in clinical assessment does not always lead to improved accuracy (Grove & Meehl 2000 — Meta-analysis of clin…). And yet Tetlock's superforecasters — people with genuine expertise in the metacognition of forecasting — dramatically outperform both novices and domain experts (Tetlock & Gardner — Good Judgment Project;…). How do we reconcile this?

The resolution requires decomposing "expertise" into at least three separable components: domain knowledge, pattern recognition skill, and confidence in one's own pattern recognition. In kind learning environments — where feedback is clear, timely, and representative — all three components align. The chess master's domain knowledge produces accurate pattern recognition, and their confidence is well-calibrated because every game provides unambiguous feedback. But in wicked environments, these components diverge dangerously. The clinician accumulates domain knowledge and pattern recognition skill, but because feedback is noisy and delayed, their confidence grows faster than their accuracy (Hogarth and colleagues — Kind vs. wicked l…). They overfit to the noise of their own experience.

Radiologists provide a striking illustration. Arkes and colleagues found that when radiologists and non-radiologists interpreted X-rays with feedback, non-radiologists improved — but experienced radiologists actually declined in performance (Arkes et al. 1988 — Radiologist feedback p…). The feedback activated the radiologists' overfit mental models, causing them to attend more to cues that were diagnostic in their past experience but misleading in the current task. The effect size was d ≈ 0.61 for the decline in experienced radiologists (Arkes et al. 1988 — Radiologist feedback p…).

Rollwage and colleagues added a neural dimension to this picture. In an fMRI study with 60 participants performing a visual belief-updating task, overconfident individuals showed reduced activity in brain regions associated with computing uncertainty — the anterior insula and dorsolateral prefrontal cortex (Rollwage et al. 2020 — fMRI study of overc…). Overconfident participants updated their beliefs less with new evidence, not more. The proposed mechanism is that overconfidence reflects a form of neural overfitting: the brain computes a posterior that is too narrow, underweighting uncertainty and ignoring data variability (Rollwage et al. 2020 — fMRI study of overc…). While the sample is small and the finding needs replication, it suggests that overconfidence is not merely a motivational bias — it may be a computational one, rooted in how the brain represents and updates probability distributions.

The practical upshot is sobering. Expertise is not uniformly helpful or harmful — its value depends on the match between the environment where the expertise was acquired and the environment where it is being deployed. And Bayesian training, while valuable, is not a universal cognitive upgrade. It is a tool that works in some contexts and can backfire in others, particularly when it encourages formulaic application of base rates to genuinely novel situations where the base rate is itself the wrong reference class.

When radiologists received feedback on X-ray interpretation, non-radiologists improved — but experienced radiologists actually declined, with an effect size of d ≈ 0.61.

What this means for listeners: If you have recently learned about base rates and Bayesian reasoning, be alert to overcorrection. The question is not just 'What is the base rate?' but 'Is this the right base rate for this situation?' Genuine expertise means knowing when your reference class applies and when it does not.

Section 08

A Practitioner's Toolkit: Seven Protocols for Better Decisions Under Uncertainty

Everything we have covered — the implicit Bayesian paradox, the frequency format fix, overfitting in finance and medicine and sports, the calibration evidence, the overcorrection trap — converges on a set of concrete, evidence-grounded practices. Here is a protocol, drawn directly from the research, for improving your reasoning under uncertainty.

Protocol 1: Reformat Before You Reason. When facing any probabilistic decision, translate the information into natural frequencies before engaging your reasoning (Gigerenzer & Hoffrage 1995 — 'How to Impro…) (McDowell & Jacobs 2017 — Meta-analysis of…). "This test has a 5% false positive rate and the disease prevalence is 1%" becomes "Out of 1,000 people, 10 have the disease, and of the 990 who don't, about 50 will test positive anyway." This single step, supported by a meta-analytic effect size of d = 0.93, is the highest-leverage intervention in the entire Bayesian reasoning literature (McDowell & Jacobs 2017 — Meta-analysis of…).

Protocol 2: Name Your Base Rate — Then Stress-Test It. Before making a prediction, explicitly identify the relevant base rate and write it down (Tetlock & Gardner — Good Judgment Project;…). "What fraction of startups in this category, at this stage, with this funding level, succeed?" Then ask: is this the right reference class? Has the underlying environment changed since this base rate was established? The base rate is your prior; it is not your prison.

Protocol 3: Count Your Degrees of Freedom. When evaluating any model — a financial strategy, a hiring rubric, a product hypothesis — ask how many variants were tested before this one was selected (Royal Statistical Society, Significance Vo…) (Quantopian 888-Strategy Out-of-Sample Data…). Every additional variant tested inflates the apparent performance of the winner. If someone tested 500 strategies and shows you the best one, you are not looking at skill; you are looking at the survivorship bias of a large search.

Protocol 4: Demand Out-of-Sample Evidence. Never trust a model's performance only on the data it was built with (Quantopian 888-Strategy Out-of-Sample Data…) (CFM (Capital Fund Management) Technical No…). Ask for out-of-time validation, holdout samples, or performance in genuinely different contexts. If the Epic Sepsis Model varies across nine hospitals (Michigan Medicine / PubMed 34152373 — Epic…), your business model will vary across markets. The Quantopian dataset showed that the correlation between backtest and live performance was weakest for the most-tuned strategies (Quantopian 888-Strategy Out-of-Sample Data…).

Protocol 5: Calibrate With Feedback, at the Right Frequency. Track your predictions. Assign explicit probability estimates. Compare them against outcomes. The evidence shows this works: calibration error drops significantly with training and holds at six months (Arkes, Dawes & Christensen 1986 — Calibrat…). But Murphy and Winkler's meteorologist data suggests an important constraint on update frequency — weekly updates with aggregated data outperformed daily reactions to individual observations (Murphy & Winkler 1984 — Meteorologist upda…). Do not overfit your beliefs to the most recent data point.

Protocol 6: Prefer Simplicity Unless Data Strongly Justifies Complexity. The Gigerenzer less-is-more finding (Czerlinski, Gigerenzer & Goldstein 1999 —…), the Meehl clinical-versus-statistical finding (Grove & Meehl 2000 — Meta-analysis of clin…), and the Quantopian backtest-inflation finding (Quantopian 888-Strategy Out-of-Sample Data…) all point the same direction. In noisy environments with limited data, simpler models generalize better. Add variables and parameters only when you have strong out-of-sample evidence that the added complexity pays for itself.

Protocol 7: Know Your Environment Type. Hogarth's kind-versus-wicked framework is the master key (Hogarth and colleagues — Kind vs. wicked l…). In kind environments (clear rules, fast feedback, many repetitions), trust your experience and your complex models. In wicked environments (shifting rules, delayed feedback, small samples), distrust both. The most dangerous state is high confidence in a wicked environment — you are likely overfitting to noise and calling it insight.

Never trust a model's performance only on the data it was built with — the Quantopian dataset showed that the most-tuned strategies had the weakest correlation between backtest and real-world results.

A 6-Week Calibration Practice Protocol

Frequency reformat habit Practice converting every probability you encounter into natural frequencies. Use icon arrays or frequency trees.

Frequency reformat habit

Prediction tracking Start a prediction journal. Assign explicit 0–100% probabilities to 3–5 decisions per week. Record outcomes.

Prediction tracking

Base rate identification For each prediction, explicitly name the reference class and base rate. Write down why this base rate applies (or doesn't).

Base rate identification

Weekly calibration review Review predictions weekly (not daily). Compare confidence buckets to outcome frequencies. Adjust.

Weekly calibration review

Domain boundary mapping Identify which domains your calibration is strong in and which it is not. Seek feedback from others in weak domains.

Domain boundary mapping

W1 W3 W6 W9 W12

Based on Arkes et al.'s calibration training research and Good Judgment Project practices. Start with format habits, layer in tracking, and build to domain-specific practice.

What this means for listeners: Pick one protocol and implement it this week. The highest-leverage starting point for most people is Protocol 1 (reformat before you reason) or Protocol 5 (start tracking your predictions with explicit probabilities). The research is clear that these are learnable skills, not fixed traits — but they require practice, not just knowledge.

Section 09

What We Still Don't Know: The Open Questions

Good science is honest about its boundaries, and this field has important ones.

The most significant open question is whether calibration training transfers across domains. Preliminary evidence suggests it often does not (Arkes, Dawes & Christensen 1986 — Calibrat…). A forecaster who achieves excellent calibration on geopolitical questions may show no improvement on technology or medical questions. The cross-platform calibration research on prediction markets reinforces this: calibration is domain-structured, not a single global skill (arXiv 2602.19520 — Cross-platform calibrat…). If this holds, the practical implication is that you need to calibrate separately in each domain you care about, which dramatically increases the investment required.

The overcorrection failure mode after Bayesian training is still based on thin evidence — primarily a single unnamed arXiv preprint and anecdotal reports from forecasting communities (Unnamed arXiv preprint on overcorrection a…). It is a plausible and important hypothesis, but it has not been rigorously tested in a controlled experimental design. We need studies that systematically measure what happens to decision quality after base-rate training, in domains where the correct answer requires weighting individuating information more heavily than the base rate.

The neural mechanisms of overfitting in human cognition are barely mapped. Rollwage's fMRI study of overconfidence is suggestive but small (n = 60) (Rollwage et al. 2020 — fMRI study of overc…). The broader predictive processing framework — Karl Friston's Free Energy Principle, which proposes that the brain minimizes prediction error through approximate Bayesian inference (Friston and colleagues — Free Energy Princ…) — is mathematically elegant but has been critiqued as more metaphor than mechanism ('Myth of the Bayesian Brain' critical lite…). The computations postulated by predictive processing may be tractable for simple perceptual models but have not been substantiated for the structured causal representations required for higher cognition ('Myth of the Bayesian Brain' critical lite…). Whether the brain is "actually Bayesian" at the neural level, or merely Bayesian-like at the computational level, remains genuinely unresolved.

Developmental timing is another frontier. Early research suggests that base rate neglect begins to appear in children around age six, coinciding with the development of more complex reasoning skills (Unnamed developmental psychology research…). If this is confirmed, it opens the possibility that calibration training could be most effective if introduced during this developmental window — before the System 2 overconfidence habits solidify.

Cross-cultural variation is understudied. A comparative study of individualist and collectivist cultures found that while base rate neglect was present across cultures, its degree varied, with collectivist cultures sometimes showing greater attention to distributional information (Unnamed cross-cultural comparative study —…). And exploratory work on indigenous knowledge systems has found that some traditions embed probabilistic reasoning concepts — including attention to base rates — in cultural practices and narratives (Unnamed exploratory study — Probabilistic…). These findings are preliminary and require more rigorous methodology, but they challenge the assumption that base rate neglect is a universal and invariant feature of human cognition.

Finally, the organizational dimension of overfitting is almost entirely unexplored in rigorous empirical terms. Strategy and innovation literature discusses "organizational overfitting" — companies that become so specialized to their current environment that they fail to adapt to changes (Strategy/innovation literature — Organizat…). The MLB shift ban is a sports example of this. But we lack systematic research on how organizational decision-making structures amplify or mitigate the individual cognitive biases we have discussed. When a hospital's clinicians collectively learn to ignore a decision support system with a 96% override rate, is that organizational wisdom or organizational overfitting? The answer almost certainly depends on whether the alerts being overridden were genuinely irrelevant — and we rarely have the data to tell.

These are not reasons for despair. They are reasons for humility — and excellent topics for future episodes.

Whether the brain is actually Bayesian at the neural level, or merely Bayesian-like at the computational level, remains genuinely unresolved — the Free Energy Principle may be more metaphor than mechanism.

What this means for listeners: The honest state of the science is that we know calibration training works in the short term and within specific domains, but we do not yet know how to make it transfer broadly, how to prevent overcorrection, or how organizational structures interact with individual biases. Hold your Bayesian tools with appropriate uncertainty — which is, after all, the most Bayesian thing you can do.

Tier 2 · Empirical

Griffiths, Kemp & Tenenbaum — Hierarchical Bayesian modeling program: category learning, causal induction, and word learning (r = 0.85–0.92 correlation with Bayesian models)
Casscells, Schoenberger & Graboys 1978 — Physician Bayesian reasoning study (18% correct on standard probability problem)
Eddy 1982 — Physician Bayesian reasoning in mammography screening (5% correct, JAMA)
Hammerton 1973 — Physician Bayesian reasoning study (10% correct)
Gigerenzer & Hoffrage 1995 — 'How to Improve Bayesian Reasoning Without Instruction' (Psychological Review); frequency format breakthrough
Gigerenzer and colleagues — Systematic experimental replications of frequency format effects on base-rate neglect (76–92% correct in ecological conditions)

Tier 1 · Meta-analytic

McDowell & Jacobs 2017 — Meta-analysis of 35 studies on frequency format effects (weighted d = 0.93, Frontiers in Psychology)

Tier 2 · Empirical

Sloman et al. 1999 — Critique of frequency format hypothesis; argues transparent nested-set structure is the key mechanism
De Neys & Glumicic 2008 — Implicit conflict detection in base-rate neglect tasks (78% stereotypical responses despite increased System 2 processing)
Gigerenzer & Todd 1999 — Simple Heuristics That Make Us Smart; bias-variance tradeoff applied to human cognition
Czerlinski, Gigerenzer & Goldstein 1999 — Take-the-best heuristic tested across 20+ real-world datasets (Psychological Review)

Tier 1 · Meta-analytic

Grove & Meehl 2000 — Meta-analysis of clinical vs. actuarial prediction (~70-year replication record; statistical rules beat clinical judgment in ~60% of comparisons)

Tier 2 · Empirical

Quantopian 888-Strategy Out-of-Sample Dataset Study — Empirical evidence of backtest-to-live decay correlated with tuning intensity (SSRN abstract_id=2745220)

Tier 3 · Practitioner

AQR — 'Lies, Damned Lies, and Data Mining' practitioner essay on robustness vs. in-sample fit
Royal Statistical Society, Significance Vol. 18 Issue 6 p.22 — Backtest overfitting as finance's p-hacking analog
CFM (Capital Fund Management) Technical Note 2016 — In-sample overfitting pitfalls in data mining
David H. Bailey — Deflated Sharpe Ratio and overfit detection tools for quantitative finance

Tier 2 · Empirical

Hogarth and colleagues — Kind vs. wicked learning environments; confidence grows without skill in wicked environments
Michigan Medicine / PubMed 34152373 — Epic Sepsis Model evaluation across 9 hospitals (Jan 2020–Jun 2022); performance varied by hospital factors

Tier 4 · Trade press

Scientific American — IBM Watson for Oncology / MD Anderson partnership cancellation investigative report

Tier 1 · Meta-analytic

PMC4052586 — Systematic review of CDS alert override rates (49–96% range)

Tier 2 · Empirical

BMJ Digital Health e000083 — CDS alert override rate review (mid-40% to mid-90% across studies)
PMC6855857 — Inpatient e-prescribing observational study referencing 49–96% override rates

Tier 3 · Practitioner

AHRQ Clinical Decision Support Resource Page — Cites 49–96% override range with drug-drug interaction and allergy alert examples

Tier 4 · Trade press

The Analyst 2023 — MLB shift ban effects on team strategy and run prevention

Tier 2 · Empirical

arXiv 2411.15075 — Causal-inference analysis (synthetic control) of MLB shift ban; heterogeneous player-level effects
Tetlock & Gardner — Good Judgment Project; large-scale superforecasting tournament (2,000+ participants, 4 years, IARPA validation)

Tier 3 · Practitioner

Good Judgment White Paper 2022 — 'Forecasters Who Think Again Are More Accurate'; update frequency correlates with Brier score improvement

Tier 2 · Empirical

Arkes, Dawes & Christensen 1986 — Calibration training with immediate feedback; MSE reduced from ~0.27 to ~0.14; maintained at 6-month follow-up (n = 216)
Murphy & Winkler 1984 — Meteorologist update frequency and calibration (calibration score ~0.85 vs. 0.50 for untrained; weekly updates outperformed daily)

Tier 3 · Practitioner

Metaculus Calibration Notebook — Logistic/Platt scaling recalibration strategies and Brier-score evaluation
Metaculus Forecaster Count Analysis — 'More Is Probably More'; accuracy improves with more forecasters (diminishing returns)
Manifold Markets Official Calibration Dashboard — Brier-style metrics with methodological notes on trade-weighted vs. time-weighted calibration

Tier 4 · Trade press

Polysyncer Blog Analysis — 28,407 Polymarket markets (Jan 2024–May 2026); strong calibration reported (not peer-reviewed)

Tier 2 · Empirical

SSRN 5910522 — Large academic-style Polymarket study (hundreds of millions of trades); accuracy, skill, and bias analysis
arXiv 2602.19520 — Cross-platform calibration study (Kalshi and Polymarket); calibration is multidimensional and domain-structured

Tier 4 · Trade press

Unnamed arXiv preprint on overcorrection after Bayesian training; LessWrong/EA Forum discussions on reference class forecasting failure modes (no named authors/DOI)

Tier 2 · Empirical

Arkes et al. 1988 — Radiologist feedback paradox; experienced radiologists declined with feedback (d ≈ 0.61, Journal of Experimental Psychology)
Rollwage et al. 2020 — fMRI study of overconfidence and belief updating (eLife, n = 60); reduced uncertainty computation in overconfident individuals

Tier 3 · Practitioner

Friston and colleagues — Free Energy Principle and predictive processing / Bayesian brain framework
'Myth of the Bayesian Brain' critical literature — Argues predictive processing may be metaphor rather than mechanism for higher cognition

Tier 4 · Trade press

Unnamed developmental psychology research — Base rate neglect onset at approximately age 6 (no named authors/DOI)
Unnamed cross-cultural comparative study — Base rate neglect variation across individualist vs. collectivist cultures (no named authors/DOI)
Unnamed exploratory study — Probabilistic reasoning concepts in indigenous knowledge systems (no named authors/DOI)
Strategy/innovation literature — Organizational overfitting concept; companies over-specialized to current environment fail to adapt

Humans are exquisite implicit Bayesian reasoners (r = 0.85–0.92 in ecological tasks) yet catastrophically bad explicit Bayesian calculators (5–18% correct on decontextualized problems) — the fix is reformatting information, not teaching math. · Across 70 years of clinical judgment research and quantitative finance backtests, adding complexity to a predictive model reliably backfires when data is noisy or small — overfitting is the default outcome of optimization under insufficient data. · Calibration is a learnable skill — training works and holds at 6 months — but gains often don't transfer across domains, and overcorrection after Bayesian training is a real and underappreciated risk.

Back to Yudame Research

Algorithms for Life: Bayes & Overfitting

The Paradox Inside Your Head

The Frequency Fix: Why Format Changes Everything

Overfitting: The Universal Failure Mode

Wall Street's Expensive Lesson: 888 Strategies and the Graveyard of Backtests

When the Game Changes: Hospitals, Baseball, and the Limits of Fitting History

The Calibration Gymnasium: Superforecasters, Prediction Markets, and Learning to Be Right

The Overcorrection Trap and the Expertise Paradox

A Practitioner's Toolkit: Seven Protocols for Better Decisions Under Uncertainty

What We Still Don't Know: The Open Questions

Discover

Legal