AI / Engineering · 5 min read

Computer Vision in Production: Beyond the Demo

The demo always works. That's the problem. The gap between a notebook that hits high accuracy on a curated test set and a vision system that survives a warehouse, a clinic, or a retail floor at 3 a.m. is where most CV projects quietly die.

Every computer vision project starts with a magic moment: the model draws a perfect box around the defect, the face, the license plate. Stakeholders cheer. Then it ships, and the accuracy you promised evaporates the first time a camera fogs up or a forklift parks under a different light. At DIIGOO we've shipped enough vision systems to know the demo is the easy 20 percent. Here's what the other 80 percent actually involves.

The demo lies about your data

A demo runs on data you hand-picked. Production runs on data the world hands you: motion blur, glare, occlusion, a lens someone smudged, a camera angle a technician 'fixed' on Tuesday. The single largest source of production failure isn't model architecture, it's distribution shift between the clean frames you trained on and the messy frames you inference on.

The fix is unglamorous and non-negotiable. Before you obsess over which backbone to use, build a representative dataset that includes the bad frames on purpose. Capture from the actual cameras, in the actual lighting, at the actual mounting height. A model trained on stock images of pristine products will not recognize your product covered in dust under a sodium-vapor lamp.

  • Collect failure conditions deliberately: night, rain, dirty optics, partial occlusion
  • Label edge cases more heavily than the easy center of the distribution
  • Version your datasets as rigorously as your code — a model is only as reproducible as the data behind it

Latency and cost are features, not afterthoughts

A model that needs a high-end datacenter GPU per camera is a science project, not a product. Real deployments have constraints: an edge box on a factory floor, a battery-powered camera, a budget that can't absorb cloud inference on millions of frames a day. The accuracy-per-dollar and accuracy-per-watt curves matter far more than the leaderboard.

This is where engineering judgment beats model worship. Quantization, pruning, and distillation routinely cut model size and latency by large margins for a small accuracy give-back that nobody downstream notices. Run inference where the data is born — on the edge — and send only events upstream, not raw video. The right question is rarely 'what's the most accurate model' but 'what's the smallest model that clears the business threshold.'

You are not detecting objects, you are making decisions

Bounding boxes don't create value. Decisions do. A defect detector that fires on every borderline scratch generates so many false alarms that operators turn it off within a week — a perfectly accurate model rendered worthless by a bad threshold. The decision layer on top of the model is where most of the real product lives.

Think in terms of the cost of each error type in the actual workflow. In quality inspection a missed defect that reaches a customer is catastrophic, so you tune for recall and accept more human review. In a retail people-counter a few misses are irrelevant. The same model serves both — what changes is the thresholding, temporal smoothing across frames, and the confirmation logic that turns noisy per-frame predictions into a stable, trustworthy signal.

The monitoring you skip is the outage you'll have

CV models fail silently. There's no stack trace when accuracy drifts — the system keeps returning confident predictions that are quietly wrong. A camera gets bumped, a new product SKU appears, a seasonal lighting change creeps in, and your metrics degrade for weeks before anyone notices a business number moving the wrong way.

Production CV needs observability built for models, not just servers. Log confidence distributions and watch them shift. Sample real predictions for periodic human audit. Alert on input-side signals like sudden changes in brightness, blur, or detection rate per camera, because input drift precedes output failure. Treat the model as a component that decays, and build the feedback loop to retrain on the new reality before it becomes a crisis.

Privacy and governance are part of the architecture

The moment your cameras see people, you've inherited obligations: consent, retention limits, access control, and a clear answer to 'where does this footage go.' Bolting governance on after launch is expensive and sometimes legally impossible. Design for it from frame one — process on the edge so raw video never leaves the premises, store events not images where you can, and blur or drop identifiable data you don't strictly need.

This isn't only compliance theater. A system that demonstrably minimizes what it captures is easier to get approved, easier to defend, and easier to trust. Good privacy engineering and good product engineering point in the same direction: collect less, decide more.

The bottom line

Treat the demo as a hypothesis, not a deliverable. The real work is the data pipeline, the edge constraints, the decision layer, the monitoring, and the governance — and that work is exactly what separates a vision system that runs for years from one that gets switched off in month two. If you want CV that survives contact with the real world, budget for the 80 percent that never makes it into the demo.

BUILDING SOMETHING LIKE THIS?

This is the thinking we bring to every engagement. Tell us what you’re building.