When Data Drift Kills Alerts

A vendor update silenced a healthcare allergy alert engine for seven days. The paired metrics, Time to Adapt and Break Rate, that expose brittle data systems.

Tagged in:

Data Governance

Steve

Novak

Vice President

View bio

A vendor update broke a healthcare alert engine for seven days. Inside the paired metrics that measure how brittle a data system really is.

On a Tuesday in March, a pharmacy benefit vendor pushes a routine update to how it encodes drug class identifiers in its reference formulary. The change is small, technically clean, and properly notified. The team that built the update tested every relationship. The downstream client systems acknowledged the schedule change. Nobody missed a step.

By Friday afternoon, a woman in her sixties stands at a pharmacy counter inside a Meridian Health Partners* outpatient clinic, holding a prescription she should not be holding. She has a documented penicillin allergy that landed her in an emergency room nine years earlier. The prescription in her hand is for a second-generation cephalosporin, a class that cross-reacts with penicillin in roughly ten percent of patients. That is exactly the kind of risk an allergy alert engine is supposed to surface before a physician ever signs the script.

She catches it because she reads the label. She has learned the hard way to read every label.

The pharmacist confirms the cross-reactivity, calls the prescribing clinic, and the prescription is withdrawn. No one is harmed. No incident report will be filed because no harm occurred. The encounter would not have surfaced anywhere in Meridian's data systems if Maya Chen, a senior analyst on Priya Venkatesan's data team, had not heard about it offhand at a Tuesday morning huddle the following week.

Maya pulls the prescribing data for the seventy-two hours following the vendor update. She is not expecting much. What she finds stops her morning.

Three hundred and forty prescriptions had been written in that window that should have generated an allergy alert, but did not. Most of them were for non-cross-reactive classes, situations where the missing alert was a near-miss, not a clinical risk. Eleven of them were genuine cross-reactivity cases. None of those eleven had been caught by the system. Two had been caught by pharmacists. One had been caught by the patient. Eight had been dispensed.

Of those eight, none had produced a documented adverse reaction. Yet.

Eight cross-reactive prescriptions dispensed in a seventy-two-hour window are not a data quality issue. It is a regulatory exposure event: notification obligations, discovery risk, clinical remediation, and the reputational tail that follows. The cost of operating below a known recovery threshold is calculable before an incident occurs, not only after one does.

Nothing in this story is a system failure in the way data teams normally use the term.

Nobody in this story failed. The vendor pushed correctly. The clinical team documented. The pharmacist caught the one she could see. The patient caught the one nobody else did. The pipeline ran. The alert engine's job logs were green. The dashboard Priya's team monitors (alert volume, response time, override rates) looks indistinguishable from the previous week. The reason the alerts are not firing is that the engine no longer recognizes the cross-reactive relationships embedded in the new drug class encoding. The values are still in the system. They just mean something slightly different than they did on Monday. And no part of Meridian's monitoring infrastructure is watching for that.

This is the failure mode that the Organizational Malleability Score was built to surface. Not the failures that announce themselves. The failures that do not.

I closed the last article with a promise: the next would begin where the score begins, with the signal organizations measure least accurately and the answer that almost always surprises them. This is that signal. The OMS measures it through two paired components.

The first is Time to Adapt, the elapsed time from when an upstream change occurs to when all affected downstream consumers are restored to a trusted, operational state. Not the time to fix the pipeline. The time to restore confidence that the data flowing through it can be relied upon. For Meridian, the clock starts on Tuesday at 6:14 a.m. when the vendor pushes the update. It ends the following Tuesday afternoon when Maya's remediated mappings are validated, and the alert engine is confirmed to be catching the cross-reactivity cases that had been silently slipping through. Total elapsed: seven business days.

The second is Break Rate. Of the downstream systems consuming the changed data, how many lose the trust of the people relying on them?

Maya identifies three downstream consumers of the affected reference data. The allergy alert engine, which has stopped surfacing the cross-reactivity cases, is the article's opening. The discharge medication reconciliation report, which uses the same alert metadata to flag patients at elevated readmission risk, has been quietly under-flagging for the same reason the alerts have gone silent. And the pharmacy verification system runs its own independent cross-check against a different reference source. The pharmacy verification system holds up. That is the redundancy the patient's pharmacist was working from when she made the call to the prescribing clinic.

Two of three downstream systems break. One holds. Break Rate: 67%, conservative.

Conservative because Maya's dependency map is incomplete. Three is the count of consumers she can confidently identify. Whether there is a fourth (a quality reporting feed, an analytics extract, a vendor data share) is a question the documentation cannot answer. In Break Rate measurement, an incomplete dependency map produces a floor, not a ceiling. The number can only adjust in one direction as the map fills in, and the direction is up.

Together, these two numbers describe the geometry of every upstream change. Time to Adapt is the speed. Break Rate is the blast radius. An organization that recovers fast but breaks a lot lives in constant chaos. An organization that breaks rarely but takes weeks to recover is one bad change away from catastrophic exposure. Both numbers must be low for a system to be malleable.

Here is what surprises Priya when Maya walks her through the timeline.

Of the seven business days the clock ran, the actual technical fix took six hours. The remediation, once Maya understood what had broken, was straightforward. Update the cross-reactivity mappings to reflect the new encoding, validate against a sample of historical cases, and push the change. Six hours, including testing.

The other six and a half days were spent on detection. Realizing something had broken at all. Tracing the cause to the vendor update. Identifying which downstream systems consumed the affected reference data. Determining the population of affected prescriptions. Confirming which alert types were and were not impacted. Locating the documentation that would have shown the dependency in the first place. And when that documentation could not be located, it was reconstructed from interviews with the original implementers, two of whom had left the organization.

That was the surprise. Not that seven days had passed. That six and a half of them had been spent on the one part nobody had been measuring.

In Meridian's case, detection took ninety-four percent of the seven-day clock. The fix was six percent. This is not a Meridian-specific ratio. In assessments of brittle organizations across industries, detection consistently accounts for the large majority of total Time to Adapt, often more than ninety percent of it.

The team had been telling itself, for months, that it needed to get faster at fixing things. The data showed it needed to get faster at noticing them.

This time, no one was hurt because a patient who had learned to read every label happened to read this one. The next change will not necessarily arrive with someone standing at a pharmacy counter.

"In brittle organizations, fixing is the small part. Finding out is what kills you."

The pattern is not specific to Meridian, and it is not specific to allergy alerts. I have watched the same shape surface in financial services, manufacturing, and retail, anywhere reference data quietly feeds the guardrails an organization runs on top of it. Fraud rules, compliance flags, threshold alerts, exception triggers. When the reference data shifts, the guardrails go silent. Silence does not produce a ticket. Silence does not page anyone. The system continues to operate. It just stops protecting what it was built to protect, and no one notices until something the silence allows through happens.

This failure mode is not limited to rule-based systems. AI agents consuming the same reference data fail more quietly, at greater volume, and with less visible evidence of the break. It is not a separate AI problem. It is a malleability problem wearing a different label.

In every sector where guardrails run on reference data, the cost of silence is not zero. It is deferred.

The pipeline was green. The dashboard was green. The break was invisible.

What Time to Adapt and Break Rate do, and why they belong together as a paired measurement, forces the organization to count what it has been treating as too rare or too uncertain to count. Every upstream change becomes a tracked event. The clock starts at the change, not at the discovery. The blast radius is calculated against a known dependency map, not against the consumers who happen to complain. The detection sub-component of TtA (the time between a change happening and the organization realizing it has) is exposed as the part of the metric that almost always dominates the total, and that almost no one measures.

That gap (between what an organization thinks its problem is and what its problem actually is) is the first reason the OMS exists. Not to issue a grade. To make a structural pattern visible that the organization could not previously see.

Priya's working hypothesis after that meeting was that Meridian had a documentation problem. She was right, but the shape of the problem was layered. Some dependencies that would have shortened the seven days had never been written down. Others had been written, but the documentation was older than the systems it described, and the people who needed it could not tell which parts of it were still true.

That layered problem, and what Priya's team did about it, is where the next two signals begin.

The next article picks up there: not whether documentation exists, but whether you can find it, and whether what you find is still telling the truth.

*Meridian Health Partners is a fictional organization. The events and patterns described are drawn from documented observations across multiple client engagements.