Product2026-05-04· 6 min read

How Reply Judge catches brand-voice drift.

A second-pass judge that scores every draft on policy grounding, factual grounding, and brand voice. Fail any dimension, and the reply doesn't go out — it goes to your inbox.

Magistry Team

Product

The fastest way to lose a customer's trust isn't a wrong answer — it's a right answer in the wrong voice. A support reply that's technically correct but reads like a different company is a small betrayal, and at scale, small betrayals compound. Reply Judge exists to catch that betrayal before it ships.

Two models, two jobs

The responder drafts. The judge decides whether the draft sounds like you. We keep them separate on purpose: the model that's good at being helpful is not automatically good at being you, and conflating the two means you can't gate one without gating the other.

Reply Judge doesn't produce one blended score. It grades each draft on separate dimensions — policy grounding (is every policy claim backed by the merchant's actual policy text?), factual grounding (is every fact traceable to the order data?), and brand-voice similarity against the anchor of your sent, human-approved replies. Each dimension has its own floor: 0.7 for policy and factual grounding, 0.6 for brand-voice cosine. Fail any one, and the reply doesn't send; it lands in your inbox with the failing dimension and the specific fix named.

reply judge — a reply held for driftjson

{
  "thread": "cs_thread#41207",
  "draft_intent": "refund_partial",
  "scores": {
    "policy_grounded": 0.84,
    "factual_grounded": 0.91,
    "brand_voice_cosine": 0.52
  },
  "gates": { "policy_grounded": 0.7, "factual_grounded": 0.7, "brand_voice_cosine": 0.6 },
  "decision": "held_for_review",
  "fix": "brand_voice_cosine (0.52 < 0.6): match the brand's tone from the sample replies"
}

That draft was policy-correct and factually correct — and completely off-brand for a store whose anchor replies are warm and plain-spoken. A single model optimizing for helpfulness would have sent it. The judge caught the voice dimension at 0.52 and routed it to a human, with the failing gate named.

Why per-dimension gates, not one dial

A single blended fidelity score lets a draft trade dimensions against each other — perfectly on-brand phrasing can smuggle through a policy claim the merchant never made. Separate gates can't be traded: a draft has to clear policy grounding, factual grounding, and brand voice independently, and the weakest dimension decides. The cost of a bad reply is asymmetric — one off reply costs more than ten held ones — so any failing dimension holds the draft. Every held reply is also training: approve or edit it, and it becomes part of the anchor.

Automation that can't tell when it sounds wrong shouldn't be allowed to speak for you. The judge is permission to let it speak at all.
— Magistry

The point isn't that the machine writes every reply. The point is that nothing goes out under your name that doesn't sound like your name. Reply Judge is the line between 'we automated support' and 'we let a stranger answer our customers.'

// reading this?

Reading this? You'd like the product.

If the writing resonates, the product probably will too. Same bar, same prose, same refusal to ship something you can't reverse.

Book a 20-min demo

Back to the blog

Dry-run by default · Append-only logs · One-click rollback