This report distills what we learned watching 100+ Shopify operators take agents from dry-run to live across apparel, home goods, beauty, outdoor, pet, and wellness. It is deliberately light on hype and heavy on the boring decisions that actually predicted whether autonomy stuck.
01 · The state of autonomous commerce in 2026
Autonomy crossed a line this year: the question stopped being 'can an agent do this?' and became 'can I prove what it did?'. The operators who adopted fastest weren't the most technical — they were the ones whose tooling made every action legible and reversible. Capability was table stakes; accountability was the unlock.
02 · Dry-run to Phase 2 — the 14-day pattern
The single strongest predictor of a calm go-live was time spent in dry-run. Teams that ran ~14 days of read-only proof flipped to live with almost no surprises. Teams that rushed it found their surprises in production instead.
- Days 1–2: connect a read-only token, first dry-run cycle inside an hour.
- Days 3–7: read the proposed rows daily, tune policies where the agent surfaces a too-strict rule.
- Days 8–14: rows stop surprising you. That boredom is the signal you're ready.
03 · Reading the decision_log: queries that matter
Five queries showed up in nearly every deployment's first week: what did the agent do today, what did it want to do but couldn't (rejected rows), what did we reverse, which gate blocks the most, and which evidence rows recur. CFOs asked for the second and third before anything else.
04 · Kill switch drills: monthly cadence, quarterly audit
Teams that drilled the kill switch monthly trusted autonomy more, not less. The drill — press it, watch executors stand down, verify the chain signature, flip back on — moved boards from wary to comfortable faster than any demo.
05 · Self-calibrating thresholds vs. industry rules of thumb
Fixed ROAS floors and generic negative-keyword lists consistently underperformed thresholds anchored to each store's own distribution. The 'best practice' number from a podcast was, in almost every case, the wrong number for the specific business applying it.
06 · The CS reply judge — where to set the floor
Most teams started at 0.78 brand-voice fidelity and tightened during launches. The held-reply rate, not the auto-send rate, was the metric worth watching: when held replies were mostly false positives, the anchor had matured enough to loosen.
07 · Cost-confidence tiers and what they protect
Tiering actions by cost confidence let agents stay useful under imperfect information: act where being wrong is cheap and reversible, wait where being wrong is expensive and sticky. Price moves stayed gated to Tier A; soft, reversible actions ran at Tier B.
08 · 2027 outlook
The next frontier isn't more autonomy — it's cross-agent coordination under one audit plane: catalog, campaign, and CS reasoning about the same store state, with every decision still a reversible row. Legibility, not capability, will keep being the constraint that matters.
The teams that won with autonomy weren't the boldest. They were the ones who made every move a row they could read and reverse.
— The 2026 Autonomous Commerce Report
