• Pascal's Chatbot Q&As
  • Posts
  • If you have enough data and enough compute, general learning machines that “discover” patterns tend to outperform systems where humans try to hard-code expertise.

If you have enough data and enough compute, general learning machines that “discover” patterns tend to outperform systems where humans try to hard-code expertise.

Don’t try to out-scale the scalers; out-govern, out-validate, and out-implement them.

“Stop Hand-Crafting the Future”: The Bitter Lesson Comes for Surgical AI

by ChatGPT-5.2

The paper The Bitter Lesson and its Implications for Surgical Artificial Intelligenceis puts forward a blunt claim about how AI progress really happens: systems that learn general patterns from enormous amounts of data, using massive compute, tend to beat systems built around human “expert rules” and carefully designed features. In this paper, Balch, Shickel, and Loftus argue that surgery is now running into that same reality—and that a lot of well-intended academic surgical AI may be structurally doomed to be outpaced by frontier models trained at industrial scale.

What makes the piece interesting is not that it worships Big Tech; it doesn’t. It’s that it reassigns the “high ground” for surgeons and academic centers: if they can’t win the arms race of building the biggest models, they can still win the war over what counts as good evidence, what outcomes matter, what failure modes are intolerable, and how these systems get governed and deployed in real clinical workflows.

1) The core idea

The “Bitter Lesson” in one sentence

If you have enough data and enough compute, general learning machines that “discover” patterns tend to outperform systems where humans try to hard-code expertise (rules, hand-picked variables, curated feature sets, narrow targets).

Why this matters in surgery

A lot of surgical AI has been built like this:

  • one hospital (or a few hospitals)

  • a limited dataset

  • experts choose variables/features the model is “allowed” to look at

  • the model predicts a narrow outcome

  • the model works… in that setting

The authors’ warning is: even if that looks good today, it often won’t hold up as more general, larger models show up that have learned from far broader data and can adapt to more tasks.

Their proposed “pivot” for surgeon-scientists

They argue surgeons should focus less on “building bespoke models” and more on what industry is comparatively bad at:

  • how to judge datasets and benchmarks (what’s valid, what’s biased, what’s missing)

  • how to evaluate model limitations (where it breaks, when it lies, what uncertainty looks like)

  • workflow integration (what actually helps in clinics and ORs rather than demos)

  • safety, privacy, fairness, interpretability, and cost-effectiveness

  • patient experience and the clinician–patient relationship

  • monitoring, updating, and auditing models over time

In other words: don’t try to out-scale the scalers; out-govern, out-validate, and out-implement them.

2) How they apply the Bitter Lesson across surgical AI

A) Medical knowledge retrieval: “Stop trying to beat the frontier models”

They use USMLE performance as an illustrative benchmark and argue that early “medical NLP” trained narrowly on textbooks/question banks underperformed, while general internet-trained transformer models did far better—then improved again as scale increased.

They also make a pointed claim about many “medical LLMs” (examples named include OpenEvidence, UpToDate Expert AI, and DoxGPT): that they are often fine-tuned from large foundation models and frequently don’t beat the best frontier models for long. Their punchline: any advantage from proprietary high-quality content access may be “transient rather than structural.”

Their recommendation: academic surgical groups should build reliable validation datasets, benchmarks, and governance frameworks, rather than pouring effort into building a “surgical LLM” unless there’s a truly fundamental architectural change.

B) Clinical reasoning and prediction: “Today’s risk models don’t land in the real world”

They draw a distinction between:

  • prediction models / risk calculators (often regression/ML trained on fixed datasets)
    vs

  • sequential reasoning (how clinicians update hypotheses as new info arrives)

They argue many prediction models are limited because their data is static and narrow (single center, time-bound snapshots), and because the success metrics used in papers (like AUROC) do not necessarily translate to confidence for a specific individual patient—especially without uncertainty information.

They highlight a more ambitious direction:

  • continuously evolving, multimodal data at health-system scale

  • “digital twins” (dynamic patient avatars matched against similar trajectories across systems)

  • “agentic” frameworks: multiple coordinated models that break problems into parts, use tools, iterate, and synthesize decisions

But they repeatedly underline the catch: this requires huge data infrastructure, sharing frameworks, and sustained funding—i.e., a scale most academic surgical labs can’t match.

C) Operating room AI: “The bottleneck isn’t clever features—it’s data and governance”

They argue OR computer vision followed the same pattern seen in broader AI: early systems used human-designed features (edges/textures; then in surgery: kinematics, task time, motion smoothness, structured rating scales). Then performance improved when models learned directly from raw video and extracted spatiotemporal features automatically.

Their key claim: in surgical computer vision, the main problems are not architecture; they’re:

  • aggregating datasets across institutions

  • interpretability

  • standard benchmark datasets that are clinically meaningful

  • privacy-preserving sharing frameworks

And their “Bitter Lesson” twist: surgeons should spend less energy telling models how to see and more energy deciding what outcomes matter—and ensuring those outcomes are measured and governed responsibly.

3) What’s most surprising, controversial, and valuable

Most surprising

  1. “Human expertise can degrade performance.”
    Not “human expertise is useless,” but that when scale is available, hand-crafted heuristics and representations contribute less than data/compute/general learning.

  2. The authors treat “build a surgical LLM” as largely a dead end (absent fundamental architectural change). That’s a stronger statement than most academic AI papers will make openly.

  3. They frame proprietary medical content access as a temporary advantage—meaning paywalled literature may improve frontier models, and the gap closes.

Most controversial

  1. Academic surgical AI can’t realistically compete in frontier model development.
    That’s a hard institutional truth: it threatens grant narratives, lab identities, and traditional academic incentives.

  2. “Benchmarks and validation cohorts should be kept separate from foundation model training data.”
    This implies many current evaluation practices are inadequate because models may have “seen” the test distribution or close analogs.

  3. Digital twins and agentic systems are presented as likely directions—yet they intensify governance dilemmas (who owns data, who is custodian, who is liable when agents act).

Most valuable

  1. A clean division of labor: industry scales models; academia safeguards reality.
    The paper’s most constructive move is not the critique—it’s proposing where surgeons can lead: endpoints, evaluation cohorts, failure modes, workflow integration, auditing.

  2. A strong critique of “paper performance” culture.
    They emphasize that many models have not been rigorously tested for whether they change patient-centered outcomes in real practice.

  3. A governance agenda that becomes more urgent as AI becomes more agentic.
    They explicitly connect future agentic AI to policy reform needs around automated access to patient data and clarity on ownership/custodianship.

4) All plausible consequences

Consequences for academic surgery and surgical research

  • Research prestige shifts from “novel model architecture” papers to dataset stewardship, benchmark creation, and real-world evaluation.

  • Grant strategies change: more funding arguments around multi-institutional data infrastructure, privacy-preserving collaboration, and impact studies—not just model building.

  • New academic power centers emerge: groups that control high-quality cohorts, registries, gold-standard labels, and longitudinal multimodal datasets.

  • Career incentives may clash with what the paper recommends (because academia often rewards novelty over governance and implementation).

Consequences for hospitals and health systems

  • Competitive advantage moves to data maturity: institutions able to aggregate longitudinal multimodal data (EHR + imaging + labs + genomics + outcomes) become the testing ground for next-generation systems.

  • Operational burden increases: monitoring, auditing, updating, and documenting model behavior becomes part of clinical operations.

  • Procurement changes: buying “AI tools” becomes less about demos and more about validation evidence, failure mode documentation, uncertainty reporting, audit logs, and update policies.

Consequences for clinicians and workflow

  • More friction before trust: clinicians may demand uncertainty metrics, provenance, and “why” explanations—especially when models outperform them on average but fail catastrophically in edge cases.

  • Training shifts: clinicians will need competency not just in “using AI,” but in detecting model failure, escalation decisions, and safe override behavior.

  • Relationship risk: if AI inserts itself poorly into consultations, it can degrade the clinician–patient relationship—so workflow design becomes a clinical safety issue.

Consequences for patient safety, ethics, and fairness

  • Bias becomes a moving target: continuously learning or frequently updated models can drift; fairness requires ongoing measurement, not one-time approval.

  • Auditability becomes mandatory: “what did the system recommend, based on what inputs, with what uncertainty, under what version?” becomes a safety baseline.

  • New privacy pressures: large-scale data aggregation and agentic access expand the attack surface and intensify consent/custodianship debates.

Consequences for industry / frontier model builders

  • More pressure to accept external evaluation: if academia owns gold-standard cohorts, frontier models must face independent stress tests.

  • Greater liability exposure if models move from advice to agentic action; the stronger the automation, the more accountability questions sharpen.

  • Demand for regulated interfaces: safe integration into EHR/OR systems will require better access controls, logging, and update governance.

Consequences for publishers and knowledge providers (especially in medicine)

  • Content access becomes strategically important—but not sufficient. The paper implies proprietary literature access can improve models, yet may only provide temporary differentiation.

  • Validation and benchmarking may become a new “value layer.” Publishers or clinical evidence organizations could play a role in creating trusted evaluation sets, standards, and governance frameworks.

  • RAG/grounding and provenance become central: if frontier models improve with access to paywalled content, stakeholders will fight over terms of access, attribution, and permitted uses—and over who gets to define “trusted evidence.”

Policy and regulatory consequences

  • Clearer rules on healthcare data ownership and custodianship become unavoidable, especially as “agents” request automated data access.

  • Regulation will likely shift from model descriptions to lifecycle obligations: monitoring, drift detection, audits, update governance, and evidence of patient outcome impact.

  • Institutional accountability: regulators may increasingly treat deployment decisions as clinical governance choices, not “IT decisions.”

5) The paper’s underlying provocation

The deepest provocation is this: surgical AI is not primarily a model-building problem; it’s a systems problem. The winners won’t be the teams with the cleverest hand-crafted features. They’ll be the teams that can:

  • assemble real-world data at scale (securely and lawfully)

  • define what outcomes truly matter

  • measure uncertainty and failure honestly

  • prove benefit in real workflows

  • govern the system over time

That is both pessimistic (about bespoke models) and empowering (about what surgeons can uniquely do). It’s also a direct challenge to how much of medical AI research currently signals progress.