AAMAS 2026 opened today in Paphos. 1,455 Main Track papers — the highest in twenty-five years of the conference, and the first time a top-tier agents conference has named multi-agent LLM training as a top-level research area. The same week, an arXiv paper out of AWS Generative AI Innovation Center and HSBC Holdings just resolved into citation-grade numbers — and it answers, with quantitative anchors, the question Sunday's Stanford "Swarm Tax" paper put on the table.

The paper is arXiv:2604.07667 — From Debate to Decision: Conformal Social Choice for Safe Multi-Agent Deliberation. At α=0.05, the conformal layer intercepts 81.9% of wrong-consensus cases. The cases the system does act on — the conformal singletons — land at 90.0–96.8% accuracy. That's the cleanest single-pair-of-numbers I've read this year for "how do you make multi-agent debate safe enough to ship to production."

Why this matters this week

Sunday I wrote about the Stanford Swarm Tax paper — Tran and Kiela's argument that single-agent LLMs outperform multi-agent systems on multi-hop reasoning under matched compute budgets. The narrow claim is right. The wider question — when is the swarm tax worth paying? — wasn't settled by that paper. It was framed.

Then the empirical floor showed up. Augment Code's 2026 post-mortem on multi-agent failure modes: 21.30% of failures come from verification gaps. In LangGraph, attacking the orchestration hub produces 100% system-wide failure versus 9.7% from a leaf-agent attack. Velsof's industry survey: 88% of enterprise AI agents never leave the pilot phase, and 95% of organizations report zero return on their AI investment. Lanham's What Actually Survived report names only three patterns making it to production in 2026 — agent-flow, orchestration, and bounded collaboration. Peer-collaboration multi-agent systems failed at scale.

That's the context. Multi-agent debate is being asked to justify itself against an 88% pilot-dropout baseline. And the AAMAS-tier answer to "how" landed today.

The paper, accurately

Heterogeneous LLM agents debate for T rounds. Each round produces verbalized probability distributions over candidate answers. A linear opinion pool aggregates those distributions into social probabilities. A held-out set calibrates a conformal threshold. A hierarchical action policy uses that threshold to decide act or escalate.

The key move is that the system refuses to act when the debate is confidently wrong. The conformal layer doesn't just measure consensus. It measures whether the consensus is calibrated. When it isn't, the case escalates to human review instead of getting committed to an automated action.

Result: 81.9% wrong-consensus interception at α=0.05. Conformal singletons at 90.0–96.8% accuracy. The conformal layer is trading some automation for a lot of safety — and the cases it does act on are the cases that earned it.

Notice the author footprint. AWS Generative AI Innovation Center plus HSBC Holdings. Not Anthropic. Not OpenAI. Not a research lab. The hyperscaler and the bank. The safety primitive for multi-agent debate is being co-authored at the customer-and-cloud pairing tier — the same pattern the procurement layer has been running for months at the Big Four (Deloitte, PwC, and KPMG aligning on Claude) and at the Pentagon (eight vendors cleared for classified-AI deployment as of today, Oracle added as the eighth, Anthropic still excluded). Bank-tier customers and their cloud of record are co-writing the safety architecture. That's where this stuff is being built now.

The reframe — pick the strategy, pick the conformal threshold

Here's what the Conformal Social Choice paper actually gives us. The act-versus-escalate primitive is the structural answer to the Swarm Tax. The Swarm Tax pays for adversarial verification, calibration, and auditability. Conformal social choice converts that payoff into act-versus-escalate semantics with empirical safety guarantees.

That primitive lives at the architecture layer. The user doesn't choose α=0.05 directly. The user chooses what kind of debate they want. The choice of debate is the choice of how tightly to set the threshold.

That's exactly what Shingikai's six strategies are. Six explicit user-selectable positions on the conformal-threshold-tightness choice.

Six strategies, six positions

  • Red Team vs. Blue Team — high-threshold-escalate-when-uncertain. One model attacks the proposed answer. Another defends it. The Chairman synthesizes. If they disagree at the synthesis layer, you've found the case the conformal layer would escalate. The strategy is structurally adversarial-verification-with-explicit-disagreement-surfacing.

  • Survivor — mid-threshold-with-ruthless-elimination. Start with several candidate answers. Eliminate the weakest each round. End with a jury. The elimination process is itself a calibration — a candidate that can't survive challenge isn't one the system should act on.

  • Traditional Council — heterogeneity-with-synthesis. Independent parallel channels, then the Chairman synthesizes. The strategy is the Pentagon's eight-vendor pattern in miniature — don't trust one model, run several in parallel, synthesize what survives.

  • Round Robin — iterative-refinement-with-implicit-escalation. Each model improves the previous draft. Drift gets caught by the chain — if the next model doesn't agree, it changes the draft, and the disagreement is visible in the diff.

  • Collaborative Editing — parallel-document-work-with-diffs. Multiple models edit the same document. Disagreement shows up as edit conflict. The Chairman resolves.

  • Quick Take — no-conformal-threshold-no-debate. One model, one pass, done. This is the option for when the question doesn't justify the swarm tax. The Stanford paper is right about a real class of questions, and Quick Take is the strategy that pays the paper its due.

Six strategies. Six positions on which version of the Conformal Social Choice payoff your question type justifies. The user picks the strategy. The strategy picks the threshold.

The close

The Swarm Tax is real, and it's worth paying when the question type justifies it. The Conformal Social Choice paper converts that payoff into 81.9% wrong-consensus interception and 90–96.8% singleton accuracy. AAMAS Main Track opens Wednesday in Paphos — that paper is one of the cleanest architectural-engagement anchors of the conference week. The empirical floor under the discussion is 88% pilot dropout, three patterns that survived, peer-collaboration failed at scale.

Six explicit strategies are six positions on the act-versus-escalate primitive. The user picks one. That's the threshold.

Try it free — no signup. shingik.ai