All articles

The Model Isn’t the Problem

The Model Isn’t the Problem
Healthcare AI pilots stall before reaching production. The model is rarely the issue. The gap between training data and production data is what breaks deployment.
Kunal Sharma
Kunal
Sharma
Vice President, Data Management
View bio

This is the third post in a five-part series based on our ebook, From Chaos to Clarity: The Strategic Guide to Healthcare Data Catalogs. Each post addresses one of the root causes organizations encounter when data becomes a blocker instead of an asset. Download the full ebook here.

Hospital readmissions are expensive, clinically damaging, and increasingly penalized by CMS. When a health system decides to address this, it builds a model to flag high-risk patients before discharge, giving care teams time to intervene.

Data science teams approach this work carefully, using historical EHR data, feature engineering, and algorithm tuning. By the time the model is ready, it performs well in validation, and leadership gives approval.

However, challenges emerge when teams attempt deployment.

The social determinants data that improved the model’s predictions are not accessible in real time. Lab results refresh on a schedule that the model cannot use. Three of the 40 input features behave differently in live systems than in historical data. The project stalls, and teams must explain why a model that worked in testing fails in production.

The model is rarely questioned, and it should not be. The model is not the problem.

The assumption that breaks deployment

AI teams build on retrospective data. Most healthcare organizations discover that the data used for training and the data available in production are not the same.

In development, data comes from a warehouse or curated dataset, already processed and reasonably clean. In production, data must arrive from live systems on schedules the model depends on, at the quality levels it requires. When these conditions do not hold, the model fails.

Not because it was built incorrectly, but because it was built on assumptions about data that were never verified.

What you need to know before building

Production AI has requirements beyond the algorithm. Live data must match the training data. Each prediction must be traceable to its inputs. Drift must be detectable before it affects results.

Before development begins, map every input feature against what production systems can deliver, whether it exists in real time, its completeness, its refresh frequency, and whether it can be queried fast enough for model use. A 40-feature model requires 40 clear answers. Teams that skip this step discover issues during deployment after months of work.

What running the check looks like

Before writing any code, we evaluated every data element required by the model against production capabilities.

Thirty-five features were available in structured data with over 90% completeness, which is strong. Three features were available, but only 60–70% complete, primarily social determinants data. Two required unstructured clinical note parsing that was not available in real time.

This is not a failed project. It is a decision point.

We deployed the 35-feature model, and it performed well. We documented the limitations and planned NLP integration for a later phase. The model reached production because we understood the data it relied on.

Without this assessment, gaps in social determinants data would have led to a decline in accuracy after months of work. The unstructured features would have caused a production failure shortly before deployment. The model would have been blamed for a data issue.

Before you build the next model

Most organizations lack a clear, documented view of their production data, including quality metrics, refresh frequency, completeness, and lineage, until problems arise. A data catalog provides this visibility before development begins.

Do your data scientists understand the quality of the production data they use, not just the training data? When a model underperforms, can your team determine whether the issue lies in the algorithm or the data? Can you detect drift before it becomes a performance issue?

If these questions do not have clear answers, the issue is not the model. A data catalog does not build the model, but it makes the model deployable.

The full ebook includes a 15-minute Data Catalog Readiness Assessment and a Data Chaos Cost Calculator for leadership discussions. Download From Chaos to Clarity here, or reach out for a diagnostic conversation.

Next in the Clarity Series: What it looks like when the data foundation is right from the start, and what it enables.

Other articles

You Can’t Manage What You Haven’t Named

You Can’t Manage What You Haven’t Named

Data Governance
Best Practices
Data Value Realization
Data quality tells you if your data is clean today. The Organizational Malleability Score tells you whether your organization can keep it trusted as the business changes. Most leaders treat these as the same question. They are not.
Starting with Everything Is a Good Way to Fix Nothing

Starting with Everything Is a Good Way to Fix Nothing

Data Governance
Best Practices
Why cataloging everything is the fastest way to ensure your data initiative delivers nothing. A practical approach to scoping data catalogs for healthcare organizations.
Brittle Data Has a Cause. Data Malleability Is the Cure.

Brittle Data Has a Cause. Data Malleability Is the Cure.

Data Governance
Data Value Realization
Data debt names what went wrong. Data Malleability names the capability that prevents it. This article introduces a new framework for building data that absorbs change instead of fracturing under it.
Client testimonial
The Definian team was great to work with. Professional, accommodating, organized, knowledgeable ... We could not have been as successful without you.
Senior Manager | Top Four Global Consulting Firm

Partners & Certifications

Ready to unleash the value in your data?