I built a jury of 63 AI models. It taught me more about quality than any audit

Here is an uncomfortable observation from someone who has signed off thousands of inspection results: a single verdict tells you almost nothing. If one inspector approves a part, you know the part passed one inspector. Whether it passed because it is good, or because the gauge drifted, or because it was Friday afternoon – that, you do not know. This is why measurement system analysis exists. We do not trust a gauge until the gauge itself has been qualified.

Two years ago I applied the same logic to artificial intelligence, and it ended with me building a platform called MultiPS – one question, sixty-three AI models answering in parallel, and a synthesis layer that weighs their answers into a single verdict. I built it because the way most organisations consume AI today would fail any incoming-inspection standard I have ever worked to.

the single-gauge problem, again

Ask one AI model a question and you get an answer delivered with total confidence. Ask sixty-three and you get something far more valuable: a distribution. On routine questions, the jury converges – forty models phrase the same substance differently, and the synthesis is boringly solid. On hard questions, the jury splits. A third goes one way, a third goes another, a third invents details that do not exist.

That split is the product. In metrology we call it gauge R&R – if repeated measurements of the same part scatter, you do not trust any single reading, you fix the measurement system. Model disagreement works identically. It is not noise to be hidden by picking the most fluent answer. It is a direct reading of how much the question can be trusted to automation at all.

Model disagreement is not noise. It is the same signal as scatter in a gauge R&R study – a measurement of how much you should trust the measurement.

At Airbus I reduced internal lead time by 97% with Routing Verification KPIs – not by working faster, but by instrumenting a process so that deviation became visible immediately. MultiPS is the same instrument pointed at machine judgement. The consensus layer is a KPI on truth.

qualify AI like a supplier, not like software

Every serious manufacturer runs supplier qualification. Nobody puts a safety-critical component into an aircraft because the supplier's brochure was persuasive. Yet I watch companies wire a single AI model into customer-facing decisions with less scrutiny than we apply to a bolt supplier. No capability study. No requalification when the model version silently changes. No agreed reaction plan for the day it starts drifting.

The discipline transfers one-to-one. Incoming inspection becomes benchmark suites. Capability studies become accuracy distributions per task type. PPAP becomes a documented baseline before a model touches production. And dual sourcing – the oldest risk tool in procurement – becomes exactly what MultiPS does: never letting one vendor's model be a single point of failure in a decision chain.

My security work sharpened the same instinct from the other side. As a certified ethical hacker, I was taught to ask one question of every system: where does it break? I spent years finding vulnerabilities others had missed – including one that put me on T-Mobile's public bug-bounty hall of fame. The attacker's mindset and the quality manager's mindset are the same mindset. Both refuse to accept a system's self-description. Both go looking for the failure mode before it goes looking for you.

what this means if you run a company

You will be sold AI this year. Probably this quarter. The pitch will feature one impressive model doing one impressive demo. Before you sign, ask the questions you would ask of any supplier taking over a critical process. What is the defect rate, measured on your data rather than the vendor's benchmark? What happens when the model is updated – who requalifies it? And what is your second source when it fails at 2 a.m. on a Sunday?

None of these questions require a computer science degree. They require the discipline your quality organisation already practises every day. The companies that will extract real value from AI are not the ones with the most licences. They are the ones that treat machine judgement with the same structured distrust they learned to apply to gauges, suppliers and their own optimism. The tooling is new. The discipline is decades old, and it works.

Key takeaways

Never single-source a judgement. One model's confident answer is one inspector's signature. Run material decisions through diverse models and treat disagreement as a first-class signal.
Qualify AI like a supplier. Capability study on your data, documented baseline before production, requalification on every version change – PPAP thinking, applied to models.
Reuse your quality organisation. The discipline for governing AI already exists in your MSA, supplier-qualification and audit practice. Point it at the new system instead of inventing a parallel bureaucracy.

I built a jury of 63 AI models. It taught me more about quality than any audit

the single-gauge problem, again

qualify AI like a supplier, not like software

what this means if you run a company

Key takeaways

Keep reading

Ford's quality turnaround is real — and the recalls prove it isn't done

Scaling eVTOL is a quality systems problem, not an engineering one

What every new EV plant gets wrong in its first 90 days