Case Study 03

AI Condition Reports:
We got it wrong
the first time

ACV's computer vision model could inspect a vehicle faster and cheaper than a human inspector. The model accuracy was proven. The UX problem was trust — and our first attempt at solving it made things worse. This is the story of version 1, what we learned, and how version 2 shipped.

Role Principal Designer — 0-to-1 AI UX design
Timeline 9 months total (v1: 4mo, v2: 5mo)
Collaborators ML Engineering, Data Science, Legal, Customer Success
Stakes Core to ACV's cost structure and inspection capacity at scale
Status Shipped — v2 in production across 40% of listings

The business case for AI inspection

ACV's inspection model required sending a human inspector to every vehicle before it listed. That inspector would photograph, grade, and report on condition — a process that cost real money per vehicle and created scheduling complexity that limited how quickly sellers could list.

The ML team had spent 18 months building a computer vision model that could analyze photos submitted by sellers and generate condition reports at a fraction of the cost. On held-out test data, the model was more consistent than human inspectors on 14 of 22 condition categories. It was ready to ship.

The problem was that "ready" for the model team meant something very different from "ready" for the buyers who would rely on those reports to spend $20,000.

Trust isn't a feature. It's an outcome.

When we started this project, the framing from product leadership was: "How do we communicate to buyers that the AI is trustworthy?" That framing assumes trust is something you communicate — a message to get across.

After 8 dealer interviews and 3 usability sessions on v1, I came to a different conclusion: trust isn't communicated. It's earned through experience, and eroded by surprises. Dealers who'd been burned by bad condition data in the past — from any source — weren't going to trust a new data source based on our claim that it was good. They needed to encounter it, use it, and have it be right. Repeatedly.

V1 tried to solve this through transparency: we showed dealers AI confidence scores on every condition field, labeled the source as "Auto-assessed," and included a modal explaining the model's accuracy metrics. It felt thorough. It was a disaster.

What v1 actually did

Confidence scores on every field created decision paralysis, not confidence. Dealers spent more time trying to interpret the scores than the conditions themselves. Trust ratings for AI-generated reports dropped 0.4 points vs. human-generated reports in the same time period. We shipped something that actively made trust worse.

Rebuilding the trust model from scratch

V2 started with a different question: not "how do we explain AI?" but "what does a dealer actually need to feel confident placing a bid?"

Show field-level confidence scores (v1 approach). Maximum transparency. Let buyers calibrate trust themselves.
Proven not to work
Show report-level confidence only. Aggregate the field scores into one overall signal. Simpler, but still quantified.
Considered
No confidence scores visible by default. Replace with a binary "Inspected / Auto-assessed" signal, visible only on fields where the distinction matters for the buying decision. Let outcome quality — not methodology disclosure — build trust over time.
Chosen
Why we chose this: V1 research revealed that confidence scores behaved like uncertainty amplifiers, not trust signals. A field showing "87% confident" didn't read as "reliable" — it read as "13% chance this is wrong." For high-value decisions, 13% is terrifying, not reassuring. The fix wasn't to explain the score better. It was to stop showing it to an audience that couldn't usefully act on it. Confidence scores are useful for internal model monitoring, not for buyer-facing UX. We moved them out of the buyer view entirely and kept them in an internal dashboard for ML and ops.
"AI-generated" label. Accurate. Transparent. Honest about the source.
Rejected
No disclosure. Treat AI and human reports identically. Legally risky. Ethically problematic.
Non-starter
"Auto-assessed" with contextual explanation available on tap. Describes the method (automated analysis of seller-submitted photos) without using "AI" as the primary label. Disclosure is complete; framing is neutral.
Chosen
Why we chose this: In a naming study with 24 dealers, "AI-generated" scored 2.8/5 for trustworthiness. "Auto-assessed" scored 3.9/5. Same underlying fact, very different connotation. "Auto-assessed" reads as a process description, not a risk flag. We were deliberate that this wasn't deceptive — the contextual explanation, available on tap, said explicitly: "This report was generated using computer vision analysis of photos submitted by the seller." Complete disclosure, better framing. Legal reviewed and approved.
Show "insufficient data" and let buyer decide. Honest. Puts the risk on the buyer.
High arbitration risk
Escalate low-confidence high-severity fields to human review queue. AI handles what it can handle confidently. For fields where confidence is below threshold AND severity is high (frame damage, structural issues), route to a human inspector. Buyer sees "Verified by ACV Inspector."
Chosen
Why we chose this: This decision required close collaboration with the ML team to define confidence thresholds and with ops to build the escalation routing. It added cost — every escalated field needed a human touchpoint. But the arbitration data was clear: disputes clustered around exactly the high-severity fields where the model was least confident. Spending $X per vehicle on human escalation was cheaper than arbitration costs and, more importantly, cheaper than buyer churn. This is the kind of decision that only makes sense when design is working directly with the business model, not just the user experience.

The costs of the v2 approach

What we gained

Trust that compounded over time

Buyers who used AI-assessed reports and had good outcomes increased their bid rate on subsequent AI reports by 23%. Trust is earned by being right, not by explaining your methodology.

What we gave up

Full automation economics

The human escalation layer adds per-vehicle cost that the business case hadn't fully accounted for. We had to re-model the unit economics before getting exec sign-off on v2.

Ongoing tension

Disclosure vs. conversion

The "Auto-assessed" label still reduces bid confidence compared to "Inspected by ACV" on matched vehicle pairs. We accept this as the cost of transparency. The question is how much gap is acceptable — and we don't have a clean answer yet.

What we gained

Scalable inspection capacity

AI reports now cover 40% of listings with no human inspection required. That inventory would not have been listable under the old model. New supply = more buyer choice = more competitive marketplace.

V2 results vs. V1 baseline

4.1→4.7
Report trust score on AI listings (1–5 survey)
−34%
Post-sale disputes on AI-assessed listings
+23%
Repeat bid rate on AI listings from buyers with prior AI experience

What v1 taught me about AI UX

Transparency and trust are not the same thing. V1 was maximally transparent and minimally trusted. Trust comes from being right repeatedly, not from explaining how you might be wrong.

ML accuracy metrics don't translate to UX design decisions. "87% confidence" means something meaningful to a data scientist. To a dealer buying a car, it means "13% chance I'm about to get burned." Design for the mental model, not the model output.

AI UX requires a theory of trust accumulation, not just disclosure. How does a user go from "skeptical of AI reports" to "routinely confident using AI reports"? We didn't have an answer to that in v1. V2 built an explicit trust accumulation model: good outcomes build credit, which raises willingness to rely on AI data over time.

Failing fast is only useful if you learn fast. V1 was a "fail fast" experiment that we ran for 6 weeks before pulling it. But the post-mortem took 3 more weeks because we hadn't instrumented the right metrics upfront. The learning lag was avoidable.