Sources: Algorithms for Life: Bayes & Overfitting

[R01] Griffiths, Kemp & Tenenbaum — Hierarchical Bayesian modeling program: category learning, causal induction, and word learning (r = 0.85–0.92 correlation with Bayesian models)
[R02] Casscells, Schoenberger & Graboys 1978 — Physician Bayesian reasoning study (18% correct on standard probability problem)
[R03] Eddy 1982 — Physician Bayesian reasoning in mammography screening (5% correct, JAMA)
[R04] Hammerton 1973 — Physician Bayesian reasoning study (10% correct)
[R05] Gigerenzer & Hoffrage 1995 — 'How to Improve Bayesian Reasoning Without Instruction' (Psychological Review); frequency format breakthrough
[R06] Gigerenzer and colleagues — Systematic experimental replications of frequency format effects on base-rate neglect (76–92% correct in ecological conditions)
[R07] McDowell & Jacobs 2017 — Meta-analysis of 35 studies on frequency format effects (weighted d = 0.93, Frontiers in Psychology)
[R08] Sloman et al. 1999 — Critique of frequency format hypothesis; argues transparent nested-set structure is the key mechanism
[R09] De Neys & Glumicic 2008 — Implicit conflict detection in base-rate neglect tasks (78% stereotypical responses despite increased System 2 processing)
[R10] Gigerenzer & Todd 1999 — Simple Heuristics That Make Us Smart; bias-variance tradeoff applied to human cognition
[R11] Czerlinski, Gigerenzer & Goldstein 1999 — Take-the-best heuristic tested across 20+ real-world datasets (Psychological Review)
[R12] Grove & Meehl 2000 — Meta-analysis of clinical vs. actuarial prediction (~70-year replication record; statistical rules beat clinical judgment in ~60% of comparisons)
[R13] Quantopian 888-Strategy Out-of-Sample Dataset Study — Empirical evidence of backtest-to-live decay correlated with tuning intensity (SSRN abstract_id=2745220)
[R14] AQR — 'Lies, Damned Lies, and Data Mining' practitioner essay on robustness vs. in-sample fit
[R15] Royal Statistical Society, Significance Vol. 18 Issue 6 p.22 — Backtest overfitting as finance's p-hacking analog
[R16] CFM (Capital Fund Management) Technical Note 2016 — In-sample overfitting pitfalls in data mining
[R17] David H. Bailey — Deflated Sharpe Ratio and overfit detection tools for quantitative finance
[R18] Hogarth and colleagues — Kind vs. wicked learning environments; confidence grows without skill in wicked environments
[R19] Michigan Medicine / PubMed 34152373 — Epic Sepsis Model evaluation across 9 hospitals (Jan 2020–Jun 2022); performance varied by hospital factors
[R20] Scientific American — IBM Watson for Oncology / MD Anderson partnership cancellation investigative report
[R21] PMC4052586 — Systematic review of CDS alert override rates (49–96% range)
[R22] BMJ Digital Health e000083 — CDS alert override rate review (mid-40% to mid-90% across studies)
[R23] PMC6855857 — Inpatient e-prescribing observational study referencing 49–96% override rates
[R24] AHRQ Clinical Decision Support Resource Page — Cites 49–96% override range with drug-drug interaction and allergy alert examples
[R25] The Analyst 2023 — MLB shift ban effects on team strategy and run prevention
[R26] arXiv 2411.15075 — Causal-inference analysis (synthetic control) of MLB shift ban; heterogeneous player-level effects
[R27] Tetlock & Gardner — Good Judgment Project; large-scale superforecasting tournament (2,000+ participants, 4 years, IARPA validation)
[R28] Good Judgment White Paper 2022 — 'Forecasters Who Think Again Are More Accurate'; update frequency correlates with Brier score improvement
[R29] Arkes, Dawes & Christensen 1986 — Calibration training with immediate feedback; MSE reduced from ~0.27 to ~0.14; maintained at 6-month follow-up (n = 216)
[R30] Murphy & Winkler 1984 — Meteorologist update frequency and calibration (calibration score ~0.85 vs. 0.50 for untrained; weekly updates outperformed daily)
[R31] Metaculus Calibration Notebook — Logistic/Platt scaling recalibration strategies and Brier-score evaluation
[R32] Metaculus Forecaster Count Analysis — 'More Is Probably More'; accuracy improves with more forecasters (diminishing returns)
[R33] Manifold Markets Official Calibration Dashboard — Brier-style metrics with methodological notes on trade-weighted vs. time-weighted calibration
[R34] Polysyncer Blog Analysis — 28,407 Polymarket markets (Jan 2024–May 2026); strong calibration reported (not peer-reviewed)
[R35] SSRN 5910522 — Large academic-style Polymarket study (hundreds of millions of trades); accuracy, skill, and bias analysis
[R36] arXiv 2602.19520 — Cross-platform calibration study (Kalshi and Polymarket); calibration is multidimensional and domain-structured
[R37] Unnamed arXiv preprint on overcorrection after Bayesian training; LessWrong/EA Forum discussions on reference class forecasting failure modes (no named authors/DOI)
[R38] Arkes et al. 1988 — Radiologist feedback paradox; experienced radiologists declined with feedback (d ≈ 0.61, Journal of Experimental Psychology)
[R39] Rollwage et al. 2020 — fMRI study of overconfidence and belief updating (eLife, n = 60); reduced uncertainty computation in overconfident individuals
[R40] Friston and colleagues — Free Energy Principle and predictive processing / Bayesian brain framework
[R41] 'Myth of the Bayesian Brain' critical literature — Argues predictive processing may be metaphor rather than mechanism for higher cognition
[R42] Unnamed developmental psychology research — Base rate neglect onset at approximately age 6 (no named authors/DOI)
[R43] Unnamed cross-cultural comparative study — Base rate neglect variation across individualist vs. collectivist cultures (no named authors/DOI)
[R44] Unnamed exploratory study — Probabilistic reasoning concepts in indigenous knowledge systems (no named authors/DOI)
[R45] Strategy/innovation literature — Organizational overfitting concept; companies over-specialized to current environment fail to adapt

Algorithms for Life: Bayes & Overfitting — Sources

Products

Legal