ACV's computer vision model could inspect a vehicle faster and cheaper than a human inspector. The model accuracy was proven. The UX problem was trust — and our first attempt at solving it made things worse. This is the story of version 1, what we learned, and how version 2 shipped.
ACV's inspection model required sending a human inspector to every vehicle before it listed. That inspector would photograph, grade, and report on condition — a process that cost real money per vehicle and created scheduling complexity that limited how quickly sellers could list.
The ML team had spent 18 months building a computer vision model that could analyze photos submitted by sellers and generate condition reports at a fraction of the cost. On held-out test data, the model was more consistent than human inspectors on 14 of 22 condition categories. It was ready to ship.
The problem was that "ready" for the model team meant something very different from "ready" for the buyers who would rely on those reports to spend $20,000.
When we started this project, the framing from product leadership was: "How do we communicate to buyers that the AI is trustworthy?" That framing assumes trust is something you communicate — a message to get across.
After 8 dealer interviews and 3 usability sessions on v1, I came to a different conclusion: trust isn't communicated. It's earned through experience, and eroded by surprises. Dealers who'd been burned by bad condition data in the past — from any source — weren't going to trust a new data source based on our claim that it was good. They needed to encounter it, use it, and have it be right. Repeatedly.
V1 tried to solve this through transparency: we showed dealers AI confidence scores on every condition field, labeled the source as "Auto-assessed," and included a modal explaining the model's accuracy metrics. It felt thorough. It was a disaster.
Confidence scores on every field created decision paralysis, not confidence. Dealers spent more time trying to interpret the scores than the conditions themselves. Trust ratings for AI-generated reports dropped 0.4 points vs. human-generated reports in the same time period. We shipped something that actively made trust worse.
V2 started with a different question: not "how do we explain AI?" but "what does a dealer actually need to feel confident placing a bid?"
Buyers who used AI-assessed reports and had good outcomes increased their bid rate on subsequent AI reports by 23%. Trust is earned by being right, not by explaining your methodology.
The human escalation layer adds per-vehicle cost that the business case hadn't fully accounted for. We had to re-model the unit economics before getting exec sign-off on v2.
The "Auto-assessed" label still reduces bid confidence compared to "Inspected by ACV" on matched vehicle pairs. We accept this as the cost of transparency. The question is how much gap is acceptable — and we don't have a clean answer yet.
AI reports now cover 40% of listings with no human inspection required. That inventory would not have been listable under the old model. New supply = more buyer choice = more competitive marketplace.
Transparency and trust are not the same thing. V1 was maximally transparent and minimally trusted. Trust comes from being right repeatedly, not from explaining how you might be wrong.
ML accuracy metrics don't translate to UX design decisions. "87% confidence" means something meaningful to a data scientist. To a dealer buying a car, it means "13% chance I'm about to get burned." Design for the mental model, not the model output.
AI UX requires a theory of trust accumulation, not just disclosure. How does a user go from "skeptical of AI reports" to "routinely confident using AI reports"? We didn't have an answer to that in v1. V2 built an explicit trust accumulation model: good outcomes build credit, which raises willingness to rely on AI data over time.
Failing fast is only useful if you learn fast. V1 was a "fail fast" experiment that we ran for 6 weeks before pulling it. But the post-mortem took 3 more weeks because we hadn't instrumented the right metrics upfront. The learning lag was avoidable.