Read transcript
So I want you to picture a scene for a second. Imagine a three-year-old child just sitting on the floor, and they're playing with this brand new, completely novel toy. Right, something they've never seen before. Exactly. It's this contraption that lights up, but only when you put certain wooden blocks on it in a very specific accommodation. And within just a few minutes of trial and error, moving the blocks around, the kid effortlessly figures out the hidden causal structure. And what is truly staggering about that scenario isn't just that a toddler figures out a puzzle. It's, well, it's the underlying mathematics of their intuition. Yeah, the math is crazy here. It really is. When researchers like Griffiths, Kemp, and Tenenbaum study tasks like this, causal induction, word learning, category sorting, they found that a child's implicit guesses correlate with optimal Bayesian probability models at an astonishing 0.85 to 0.92. Which is basically near-perfect statistical inference. Exactly. Before you can even tie your shoes, your brain is quietly running a program that mirrors the most sophisticated probabilistic framework ever devised. But I want to immediately contrast that image with another one. So researchers walked into Harvard Medical School and handed fully licensed physicians a straightforward, explicit probability problem. Yes, the classic test. Right, they said, imagine a disease with a 1% prevalence in the population. You have a screening test for it that is 95% sensitive, meaning it catches the disease almost every time, and 95% specific, meaning it rarely gives a false alarm. Sounds pretty reliable, right? You would think. So they asked the doctors, if a patient tests positive, what is the probability they actually have the disease? And this is where the results are just consistently shocking. Yeah, only about 15% to 18% of the physicians got it right. According to the original Cassell's, Schoenberger, and Grabois study, the correct answer is roughly 16%. 16%, wow. Because that 1% prevalence, the base rate, is so incredibly low that the false positives actually outnumber the true positives. Yet the vast majority of these highly trained doctors guessed 95%. They completely missed the reality of the math. And we see this replicated over and over. Eddie's 1982 study, published in JAMA, found only 5% of physicians nailed a similar problem regarding mammography screenings. 5%! I really want this paradox to land for you listening right now. Your brain is simultaneously one of the finest probability engines on the planet, and, well, one of the absolute worst mathematical calculators. It feels like a massive contradiction. I mean, how can we be geniuses in the playroom and then fail basic logic in the clinic? Right, how does that happen? Well, in cognitive science, this is explained by a level of analysis distinction known as Mars hierarchy. OK, break that down for us. So at the computational level, which essentially asks what overarching problem the system is trying to solve in its natural environment, the child is brilliantly Bayesian. Their implicit competence is massive because they are interacting directly with the physical world. They're just doing it naturally. Exactly. But at the algorithmic level, when you sit a human down and force them to explicitly manipulate abstract symbolic probabilities on a piece of paper, the machinery just grinds to a halt. Which gives us the perfect roadmap for today. We are covering three big ideas in this deep dive. First, why your brain is secretly brilliant at probability and why the sheer format of the information changes everything. The format is so crucial. It really is. Second, we're borrowing a concept from machine learning called overfitting, which explains why our instinct to add more complexity to our decisions almost always backfires. Yeah, that's a huge one. And third, we'll explore whether humans can actually learn to reason better under uncertainty and exactly where that self-improvement project hits a wall. We are going to connect developmental psychology to Wall Street trading algorithms and hospital warning systems. It's a wide net, but the underlying mechanism is identical. So shifting from that medical school failure to our first big idea, the format issue. If we possess this incredible implicit probability machinery, why do we fail those textbook medical problems so dramatically? Like, why does the explicit brain just shut down? It comes down to evolutionary psychology and the work of Gerd Gudrunzer on something called base rate neglect. The explicit brain struggles because the format of the information is entirely alien to our cognitive architecture. Alien in what way? Well, conditional probabilities, like saying a test is 95% sensitive, are mathematical extractions. They were invented a few hundred years ago. For the vast majority of human history, information did not arrive as percentages. We weren't hunting and gathering with pie charts. Exactly. It arrived as experienced sequential counts. Think about early humans trying to figure out where to hunt. Nobody is calculating that a certain river crossing has a 32.5% success rate. They're just remembering, like, we hunted at this river crossing 20 times, and we caught something maybe six times. Precisely. We evolved to process what researchers call natural frequencies. Gudrunzer and Hoffridge demonstrated that if you simply take that abstract Harvard medical problem and translate it into natural frequencies, the human brain snaps back to optimality. So you just reword the problem. Basically, yeah. Instead of giving doctors percentages, you say, out of 1,000 people, 10 have this disease. Of those 10, about eight or nine will get a positive test. Now, of the 990 who do not have the disease, about 50 will still get a false positive test. So out of everyone who just tested positive, how many actually have the disease? It just clicks. You can physically visualize the groups of people standing in a room. You see the handful of sick people in one corner and the much larger group of healthy people who got a false alarm in the other. And the performance data reflects that visual clarity. Just changing the wording of the problem raises correct responses from about 6% to 46%. That's a huge jump. And it gets better. When you move into ecologically valid conditions, where you give people visual aids to show those nested groups of people, accuracy climbs to an incredible 76% to 92%. The McDowell and Jacobs meta-analysis looked at 35 different studies on this exact phenomenon and found a massive, robust effect size of D equals 0.93, just from switching the format. But wait, hang on. If I'm looking at a textbook problem that says 1% prevalence and 95% accuracy, I am literally staring at the numbers. Does my brain just completely delete the base rate when looking at percentages? That is the twist. Your brain actually does see it. The cognitive scientist Wim Denaes and his colleague, Glymichick, ran a brilliant study in 2008 to test this exact question. What did they do? They gave people these tricky abstract base rate problems and subtly tracked the response times. And they found that even when people confidently gave the wrong stereotypical answer, which they did nearly 80% of the time, their response time significantly increased on the trials where the statistics clashed with their intuition. Oh, wow. Meaning their brain hesitated when the math didn't match the stereotype. Yes. The brain implicitly detects the conflict. There is measurable conscious processing happening beneath the surface. The information is physically in your neural system. It's just that the abstract format fails to route it to your explicit reasoning. It gets trapped. Let me take the Griffiths and Tenenbaum position here because this is a massive debate in cognitive science. If the brain detects the mathematical conflict and three-year-olds are correlating at near-perfect levels with optimal models, I would argue the brain is actually performing Bayesian reasoning at its core. Right. The Bayesian brain hypothesis. Yeah. I mean, those correlations are simply too high and too consistent across totally different tasks to be a coincidence. Underneath it all, we are fundamentally Bayesian. That is a popular view. But there is a very strong counter-movement to that, often called the myth of the Bayesian brain critique. Okay. What's their argument? Just because a biological system produces an output that looks mathematically optimal doesn't mean it's literally computing those equations in its neural circuits. Think about a baseball outfielder catching a fly ball. Like tracking the trajectory. Exactly. They run in a very specific arc that keeps the optical angle of the ball constant in their vision. The outcome perfectly matches complex differential calculus. But the outfielder isn't doing calculus in their head. The Bayesian label might just be a highly convenient mathematical description for researchers to use rather than a literal neural mechanism. Okay. I can consider. that distinction. Whether our neurons are literally running the Bayes equation or just doing a really convincing impression of it, the practical implication for you, the listener, is identical. You don't need to learn the complex math. You just need to reformat the information. That is the ultimate takeaway for this first section. Whenever you face a probabilistic decision, whether that's evaluating a business risk, deciphering a medical test result, or assessing a new higher translate, the abstract percentages you are given into natural frequencies. So don't ask your team, what is the 20% risk of failure? Right. Ask them, out of 100 projects exactly like this one, how many turn out badly? It forces the problem into the part of your brain equipped to actually handle it. So shifting the format from percentages to real numbers fixes the immediate problem. But it raises a deeper question. Why does our explicit brain panic when we feed it complex abstractions? Like why can't more information just naturally lead to better decisions? It's so counterintuitive, isn't it? It really is. And that actually brings us to a trap that both our brains and Wall Street algorithms fall into, the trap of complexity. To understand why complexity fails us, you have to understand a concept from machine learning called the bias-variance trade-off. Okay, what is that? Well, whenever you build a model to predict the future, whether it is an artificial intelligence algorithm or just a mental model in your own head, you face a fundamental tension. Simple models have high bias. They're blunt instruments, so they might miss some nuanced true patterns. But they have low variance, meaning they are very stable and reliable when you throw new unseen data at them. And complex models are the opposite. Exactly. They have low bias, meaning they capture every tiny little pattern in the data you show them. But they have high variance. They are incredibly unstable when the real world throws them a curveball. I like to think of it like buying a custom tailored suit. Imagine you go to a tailor on a day when you are incredibly bloated from eating a huge salty meal. That's a great analogy. If the tailor builds a highly complex, meticulously fitted suit based on exactly how your body looks that specific afternoon, the suit will fit you perfectly in that one specific moment in time. But the very next day, when your body returns to normal, the suit looks ridiculous. It was optimized for a temporary noisy state instead of your true underlying shape. That is a perfect way to explain overfitting. Overfitting is what happens when a complex model learns the random noise in its training data instead of the true underlying signal. It memorizes the bloat. It looks like an absolute genius on past data, but it completely falls apart tomorrow. And there is no better or more expensive example of this than the quantitative finance world. Take the Quantopian 888 strategy study. Oh, this is a fascinating one. Yeah, Quantopian was this crowd-sourced platform where amateur and professional quants could build and test stock trading algorithms. Researchers looked at 888 of these crowd-developed algorithms. They looked at the back tests, which is how well the algorithm would have performed on historical market data, and then they compared it to how the algorithm actually performed in live out of sample markets with real money. The punchline of that study is a stark warning. The more someone tuned, tweaked, and optimized their strategy on historical data, the larger the gap between their simulated success and their real-world failure. So you're just over-optimized. Completely. The strategies that look the most brilliant in hindsight were the biggest disappointments going forward. Think about the last time you over-researched a major purchase, like a new car or a television. You read hundreds of reviews, you compared spreadsheets of minor technical specs, and you paralyzed yourself with variables that ultimately didn't matter to your actual enjoyment of the product. You were over-fitting your decision. You memorized the noise of the marketing material instead of focusing on the one or two signals that actually mattered. Quantitative finance is hyper-aware of this human tendency now. They try to fight it using strict statistical adjustments. Like what? One is called the deflated sharp ratio, which statistically penalizes a trading strategy's apparent performance based on how many different variants the creator tested. If you test 500 complex strategies and pick the one that worked best, it is an illusion of skill. You just found the suit that perfectly matched the bloat. And we see this exact same less-is-more phenomenon in human psychology. In a series of famous studies published in Psychological Review, researchers Strelinsky, Gigerenzer, and Goldstein tested a radically simple decision rule called the take-the-best heuristic. It's almost shockingly simple. The rule is literally, when you are comparing two options, find the single most discriminating cue, use that to make your choice, and ignore literally everything else. They tested this simple rule against complex multiple regression models across 20 completely different real-world data sets, predicting school dropout rates, estimating city populations, you name it. And the results. The take-the-best heuristic matched or beat the complex math in 12 out of the 20 data sets when predicting new unseen data. But we have to push back on this slightly. I mean, is simplicity truly always better, or is that its own oversimplification? Doesn't the less-is-more argument cherry-pick its conditions a bit? Well, if you look at data-rich environments with perfectly stable rules, think of a chessboard. Or modern facial recognition software complexity absolutely wins. Deep learning crushes simple human heuristics, where you have millions of perfect training examples to feed the algorithm. The less-is-more effect is very real, but it has strict boundaries. That is completely fair, and it highlights the real takeaway here. The debate between simplicity and complexity isn't an abstract philosophical argument. It entirely depends on whether your environment provides enough reliable data to pay for the complexity you want to use. Exactly. In a small, noisy, unpredictable environment, complexity is just a much more expensive way to be wrong. Overfitting is the default outcome when you optimize a decision without sufficient data. So before you add complexity to a decision in your own life, like creating a 20-point rubric for hiring a new employee or building a multivariable spreadsheet for an investment, ask yourself how much reliable data you actually possess. Prefer simplicity unless your data strongly and undeniably justifies the complexity. That's a great rule of thumb. So we've established that simple models beat complex ones in noisy environments. But what happens when the environment itself fundamentally changes underneath your feet? This brings us to the concept of regime change, and the danger of fitting your decisions perfectly to history. This is where things get really dangerous. Look at the recent history of artificial intelligence in hospitals. Epic, the massive medical records company, built a highly complex clinical algorithm designed to flag patients at risk for sepsis. It looked fantastic on paper. But when an independent evaluation published in PubMed looked at its performance across nine different hospitals in the Michigan medicine system, the performance varied wildly. It worked in some hospitals and failed in others. This is a classic case of clinical overfitting. The model was trained in one specific context on a specific patient population with specific nursing workflows. When you deploy it in a new context, the underlying data distribution has shifted. The regime has changed, and the model degrades. We saw the exact same story with IBM Watson. They partnered with MD Anderson, literally one of the best cancer centers in the world, to build an oncology AI. And the partnership was eventually canceled. Because the system was so tightly overfit to the specific training data of MD Anderson's top experts that it could not handle the diverse, messy reality of live clinical workflows elsewhere. And the fallout of this type of overfitting is alarm fatigue. Systematic reviews show that doctors and nurses override clinical decision support alerts between 49 and 96 percent of the time. When you optimize a complex system for sensitivity, meaning you tune it to catch every single possible problem, you destroy its specificity. It cries wolf constantly, and humans simply tune it out. It is not just medicine either. Look at the 2023 Major League Baseball shift ban. This was a massive natural experiment in regime change. For years, analytically sophisticated baseball teams positioned their fielders based on incredibly complex models of where batters historically hit the ball. They overfit their entire defensive strategy to a very specific historical regime. Exactly. Then, the league changed the rules mid-game and banned extreme defensive shifts. Suddenly, the teams that had built an entire roster around those historic Historical models exposed massive vulnerabilities. Some hitters benefited enormously, others barely noticed. The regime changed, and the overfit models broke. This tension between human judgment and algorithmic models forces a really important debate. When the environment is unstable, who do we trust? Honestly, I'm going to take the hardline pro-algorithm position here. Paul Meehl's legendary research, which was updated by the Grove and Meehl meta-analysis covering roughly 70 years of clinical data, shows that simple mechanical prediction rules beat clinical expert judgment in about 60% of direct comparisons. 60% is a solid majority. Right. When doctors are overriding hospital algorithms 96% of the time, maybe they were just being arrogant. Experience breeds confidence, but the data shows it doesn't always breed accuracy. I say we should default to the simple algorithm, even with its flaws. I see it quite differently, actually. Those clinical override rates of 49 to 96% might reflect incredibly good human judgment. If an algorithm has a massive false alarm rate, ignoring it is not human bias, it is the perfectly rational response to a broken tool. Human expertise is what catches the subtle shifting context that the algorithm is entirely blind to. Okay, that's a fair counterpoint. The resolution to this debate comes from Robin Hogarth's framework of kind versus wicked learning environments. That framework explains so much of human frustration. It really does. In kind environments where feedback is clear, immediate, and the rules of the game do not change, like playing chess or forecasting, the weather human experience translates beautifully into true, reliable expertise. Trust the human expert. But what about the wicked environments? In wicked environments, where feedback is noisy, delayed, or the rules shift mid-game, like the stock market, or predicting long-term medical outcomes experience just makes humans dangerously overconfident. In wicked environments, you should lean on simple algorithms, but you absolutely must keep the human in the loop to detect when the regime itself has changed. So what does this all mean for you listening right now? If most of the important decisions we make happen in wicked environments and our brains naturally overfit to noise, can we actually learn to predict the future better? Like, is decision-making under uncertainty a learnable skill? The short answer is yes, but it requires a very specific, deliberate kind of practice called calibration. Philip Tetlock's Good Judgment Project proved this at scale. They ran a massive forecasting tournament with over 2,000 people over four years. And they found these superforecasters, right? Yes. The top 2% of participants, who he called superforecasters, consistently beat professional, highly classified intelligence analysts at predicting global events. And what made them superforecasters wasn't that they had doctorates in geopolitics or secret insider knowledge. It was their metacognition. They updated their beliefs frequently, they approached new data with intellectual humility, and they were very comfortable using precise probabilistic language instead of vague terms like probably or maybe. They were exquisitely well calibrated. Calibration is just the alignment between your stated confidence and your actual accuracy. If you say you are 70% confident that an event will happen, you should be right exactly 70% of the time over the long run. And you can train this. A landmark study by ARCS in 1986 showed that you can actively train this skill. They gave people calibration training with immediate feedback on their predictions. The participants' error rates dropped by nearly half, from a mean squared error of about 0.27 down to 0.14. And impressively, that improvement held steady when they tested them again six months later. But there is a catch regarding how often you should update your beliefs. Murphy and Winkler studied weather forecasters who are famously well calibrated. But they found that meteorologists who updated their predictions on a weekly basis, using aggregated data, actually outperformed those who reacted daily to every single new observation. Reacting to every single minor data point is just another form of overfitting. Exactly. You don't want to chase the noise. But this raises a huge question. If calibration training works, do we just mandate that everyone learns Bayes' theorem, give them immediate feedback on every decision they make, and permanently solve human error? That leads to the ultimate twist of this deep dive. The same researcher, ARCS, ran another study in 1988 known as the Radiologist Feedback Paradox. The paradox. What happened? They gave both highly experienced radiologists and complete medical novices feedback on how well they were interpreting x-rays. The novices improved, exactly as you would expect. But the experienced radiologists actually got significantly worse. Wait, really? How does giving an expert accurate feedback make them worse at their own job? Because the feedback activated their overfit mental models. It caused the experts to second-guess themselves and double down on highly complex, nuanced cues that were diagnostic in their highly specific past experiences, but were actually misleading in the current, slightly different task. Their performance declined with an effect size of 0.61. Expertise in a wicked environment can become a cognitive trap. And it might even be wired deeply into our neurobiology. Rolwage and colleagues published an fMRI study looking at people's brains during a belief updating task. They found that highly overconfident people showed significantly reduced activity in the anterior insula and the prefrontal cortex when they were presented with new, conflicting evidence. That is fascinating. Overconfidence isn't just an ego trip or a personality flaw. It is a literal computational failure to process uncertainty. Your brain builds a model that is too narrow, and it physically stops computing the variance in the data. The implication for the listener is clear. Calibration is highly learnable, but it is deeply domain-specific. You should invest in practice, track your predictions, and score your accuracy. But hold the skill lightly. Being exceptionally well calibrated in one area of your life does not mean your intuition magically works in another. Let's bring this all the way back to the opening scene. Remember that three-year-old playing with the light-up toy, doing near-perfect statistical math in their head? And remember that Harvard physician completely failing a basic textbook probability problem? Same human brain. Same fundamental probability machinery. The entire difference was the format. The child was operating in a rich physical environment with natural frequencies. The physician was stranded in a clinical word problem full of abstract percentages. The lesson of this entire deep dive is that you do not need to become a better calculator. You need to become a better formatter of the information you consume, the decisions you structure, and the environments you choose to reason in. And that crystallizes into our three core takeaways. First, your brain is an exquisite implicit probability engine. First, route your decisions through formats it can actually handle, especially natural frequencies. Second, overfitting is the default outcome of optimization when you do not have enough data. Always prefer simple rules unless the data strongly and undeniably justifies complexity. And third, calibration is learnable but domain-specific. Invest in practice, but recognize the sharp limits of your own expertise when operating in wicked environments. I want to leave the listener with one final thought to mull over. If human intuition is basically just an overfit model built on our past experiences, and we increasingly hand all our complex, wicked decisions over to artificial intelligence, what happens to us? Are we destined to just become regime change detectors for machines? Do we eventually lose our own ability to reason entirely, relying on the algorithm until the environment shifts? It is a profound question. We have to maintain the friction of making our own decisions to stay calibrated. Ultimately, the most Bayesian thing you can do is hold your Bayesian tools with appropriate uncertainty, which might be the most important sentence in this whole deep dive. This has been a UDOM research-pronounced Euro-Odomay research deep dive. Your call to action today is simple. Pick one of the protocols we covered and implement it this week. If you aren't sure where to start, go with protocol one, reformat before you reason. The next time you face a probability, a medical test result, a business risk, a hiring decision, translate it into natural frequencies before you decide. Out of every 100 cases like this, how many turn out this way? That single reframe is the highest leverage change you can make. Or if you want to go deeper, start protocol five. Track your predictions with explicit probabilities and grade yourself. For the full briefing, all the research citations, and the seven protocols written out, visit udom.ani. And if you know someone who makes decisions under uncertainty, which is everyone, share this deep dive with them. Until next time.