I sat through a presentation last quarter where a director of digital transformation walked us through an AI deployment at a Tier 1 plant. Forty-seven slides. Engagement metrics. Query volumes. Sentiment heatmaps. Somewhere around slide thirty-eight I asked the question that should have been on slide one: what moved on the floor? Scrap rate? Right-first-time? Audit findings? Silence. He scrolled back to a dashboard showing "user satisfaction" at 84%. That number is the whole problem. Boards and investors have started grading AI deployments on operational results, not engagement scores — and most companies deploying AI in operations cannot point to a single quality metric that moved. They have dashboards. What they do not have is a closed A3.

Every AI deployment is a process change — and process changes get baselined

Here is an uncomfortable observation. In quality management, every intervention follows the same arc: baseline the current state, form a hypothesis, deploy a countermeasure, measure the outcome, then close the gap or kill it. PFMEA updates trigger capability studies. QRQC sessions open with containment and close with verified effectiveness. A3 reports end with a measured delta against a pre-intervention baseline. This is not bureaucracy. It is the only honest way to know whether something worked.

AI deployments somehow got exempted from this discipline. A vendor arrives with a proof of concept. The proof of concept produces a demo. The demo produces excitement. The excitement produces a purchase order. And eighteen months later, nobody can tell you whether defect cost per unit moved by a single eurocent. The technology was evaluated. The intervention was not.

If you cannot close an AI deployment the way you close an A3, you have not deployed technology — you have purchased a narrative.

This is not a minor procedural complaint. It is the reason AI ROI conversations now happen in boardrooms with auditors present, not at innovation conferences with keynote speakers. The discipline of baselining was never optional. It was ignored because the technology felt new enough to excuse it. Not anymore.

The three numbers that actually matter

Strip away the dashboards and the engagement scores and the "AI maturity assessments," and you are left with a question that any plant quality manager can answer in seconds: did the cost of poor quality go down? In practice, that resolves into three measurable outputs.

  • Defect cost per unit. At WITTE Automotive, we ran QRQC and A3 loops on chronic failure modes that were bleeding margin. Not conceptually — in euros, tracked weekly, closed against a baseline. The defect-cost reduction was substantial. If an AI tool touches your quality process, it must show up here. If it does not, the tool is decorative.
  • Internal lead time. At Airbus, Routing Verification KPIs cut internal lead time by 97% in the scope where they were applied. Baseline to outcome, not a projection. It closed. Any AI system claiming to improve operational flow should be held to the same standard: show me the lead-time delta against the pre-deployment state, verified by MES data you were already collecting.
  • Audit findings. I have spent a career reducing external audit findings — 50% down in a single EASA cycle. Audit findings are the most unforgiving metric in manufacturing because the auditor does not care about your technology stack. They care about evidence, traceability, and closed corrective actions. If your AI deployment reduces findings, you have a case. If it generates new ones because process documentation lagged behind the tool, you have a liability.

A proper AI ROI closing report looks exactly like a QRQC closing report. Pre-deployment baseline. Post-deployment measurement. Containment of regressions. Verified effectiveness. Signed off. The format already exists. What is missing is the willingness to apply it to AI.

Why I built MultiPS instead of buying a tool

When I designed MultiPS — the multi-model orchestration platform now running 63+ models in parallel with consensus synthesis — the architecture was driven by a quality question that no single model could answer reliably: is this output reliable enough to act on?

In manufacturing quality, we do not accept a single inspection point on a critical characteristic. We cross-verify. We use redundant measurement systems. We require agreement before we release product. A single model generating an answer is a single inspection point — and anyone who has run a PFMEA knows that single-point failures are precisely what you engineer out of a process. MultiPS was specified by quality requirements, not the other way around. The consensus architecture exists because I would not accept an AI deployment in my own operations that could not pass the same reliability test I apply to any measurement system on the shop floor.

Key takeaways

  • Treat every AI deployment as an A3. Baseline before you deploy. Measure after. No baseline, no claim of success — only activity.
  • Grade on three numbers only. Defect cost per unit, internal lead time, audit findings. If your AI investment cannot move one of these against baseline, you have bought a dashboard, not a result.
  • Specify technology from quality requirements, not the reverse. Start with: what does "reliable enough to act on" mean in this process? Let the answer determine your architecture.
  • Close the loop or kill it. QRQC discipline applies to AI the same way it applies to any process change. Verified effectiveness or containment. No open loops.

AI in operations will eventually be graded the way every quality intervention is graded: did it reduce the cost of poor quality? The companies that win asked that question before they deployed, not after — and had the discipline to close the A3 before they claimed the result.