NEJM
Lisa Rosenbaum, M.D.
April 13, 2022
Editors: Debra Malina, Ph.D., Editor
chaosartimage
Executive Summary
by Joaquim Cardoso MSc.
The Health Revolution Institute
The Quality Revolution Unit
April 14, 2022
What is the context?
- 3 decades ago there was not much of quality measurement in healthcare in the US. Some 30 years later, however, the fix is itself a massive system.
- As reimbursement models shift toward value-based payment, QI is no longer just about being better, but about documenting improvement to maximize payment.
- An entire industry has arisen to support the optimization and demonstration of performance.
- Financial costs aside, if good care is the goal, the greatest cost of all this activity may be wasted time
What is the main question?
- That these hours could be spent in countless other ways — especially caring for patients — raises an obvious question: Is the system we created to fix the system even working?
Is Quality Improving?
It’s hard to know.
- Some early efforts … — succeeded.
- But recently, there has been growing recognition of the QI movement’s shortcomings.
One unforeseen challenge is that a measure, no matter how medically sound or well intentioned, is never just a measure.
- Not only does it become difficult to modify measures that aren’t clearly working, but a tremendous amount of resources are directed toward the appearance of quality rather than its substance.
QI has become more a box-checking exercise for billing purposes than a meaningful act to improve care .
- We never really developed an epidemiology of quality.
- Unfortunately, the science of quality measurement has become untethered from these philosophical underpinnings.
Are We Even Measuring Quality?
Our whole system of quality measurement has been based on a shaky foundation
Interrogation of P4P’s efficacy brings us back to the question of whether our measures accurately capture quality in the first place
- Even when an outcome is obviously good, measuring it reliably isn’t always simple.
- If our measures don’t necessarily capture quality, how can we know if we’re improving?
When it comes to preference-sensitive decisions — like taking statins for primary prevention or undergoing mammography — … rate-based metrics distract from what really matters: the quality of decision making and whether the patient was engaged and informed ( Friedberg )
… most quality measurement relies on claims data (which may lack clinical granularity and are vulnerable to manipulation such as “up-coding”), even our most valid measures may not capture what they’re meant to capture.
- One limitation, is faulty risk adjustment.
- Claims data don’t reliably capture many of these factors, a problem compounded by variability in coding habits among physicians and institutions.
Inequity among Patients, Demoralization among Physicians
Because better-resourced hospitals can afford administrative support to optimize billing, value-based payment initiatives can also worsen inequities
- Punishing the hospitals and patients most in need of support makes “zero sense,”
- “Community doctors want to practice medicine,”… “They don’t want to practice quality measurement.”
Some early QI leaders recognized the risk of demoralizing the workforce.
- … we are measuring and measuring, using words like ‘accountability,’ ‘incentives,’ ‘rewards and punishments,’ but this has little to do with how things get better.
- With this approach now entrenched in P4P initiatives, Berwick observed, “there is magical thinking, that if we just measure enough, and attach those measures to incentives, a miracle happens.”
- … as we shift toward value-based payment, invest heavily in an improvement infrastructure that isn’t clearly working, and navigate an epidemic of burnout and workforce demoralization, we seem to inch farther away from the ideals Berwick’s long encouraged.
Why has it proven so difficult to heed his advice?
ORIGINAL PUBLICATION (full article)
Reassessing Quality Assessment — The Flawed System for Fixing a Flawed System
NEJM
Lisa Rosenbaum, M.D.
April 13, 2022
Editors: Debra Malina, Ph.D., Editor
Mid-1980s
In the mid-1980s, Donald Berwick, who would soon become a leader of the U.S. health care quality movement, was overseeing quality-improvement (QI) efforts at a large HMO. One of his goals was to reduce intractably long radiology wait times.
One month, the data showed that the radiology department at his own practice site had reduced waits to 2 minutes.
When the reduction persisted the next month, Berwick asked the lead administrator, Ms. J., for the secret to her success. “It was easy, Don,” she replied. “I lied.”
In the mid-1980s, Donald Berwick, who would soon become a leader of the U.S. health care quality movement, was overseeing quality-improvement (QI) efforts at a large HMO.
When the reduction persisted the next month, Berwick asked the lead administrator, Ms. J., for the secret to her success. “It was easy, Don,” she replied. “I lied.”
Heeding a common impulse among people faced with performance metrics beyond their control,1,2 Ms. J. had devised a workaround.
The punch clocks used to track waits disappeared, and at the end of each day Ms. J. and her staff punched the clock multiple times, 2 minutes apart.
The ostensible improvement process was demeaning and futile, she explained: Berwick’s monthly reports were sent up Ms. J.’s chain of command, where each manager circled her dismal wait times, and then all the admonishing circles landed back on her desk.
“As if I were not already doing everything I possibly could,” Ms. J. said to Berwick. “You are not here. I am. You are wasting my time.”
Ms. J. said to Berwick. “You are not here. I am. You are wasting my time.”
The interaction transformed Berwick’s vision of QI.
He realized the focus on wait times ignored systemic issues, such as supply–demand mismatch or overreliance on x-rays.
Acknowledging that most health care workers want to do right by patients, Berwick recognized that blaming workers for factors beyond their control quashes goodwill and encourages cheating.
…most health care workers want to do right by patients, … blaming workers for factors beyond their control quashes goodwill and encourages cheating. (Berwick)
This insight accords with a foundational principle of the QI movement: most quality lapses reflect a faulty system rather than faulty people. To improve quality, we must fix the system.
… a foundational principle of the QI movement is: most quality lapses reflect a faulty system rather than faulty people. To improve quality, we must fix the system.
30 years later
Some 30 years later, however, the fix is itself a massive system.
As reimbursement models shift toward value-based payment, QI is no longer just about being better, but about documenting improvement to maximize payment. An entire industry has arisen to support the optimization and demonstration of performance.
- Though I could find no authoritative estimate of U.S. investments in QI infrastructure, the Centers for Medicare and Medicaid Services (CMS) spent about $1.3 billion on measure development and maintenance between 2008 and 2018.3
- Hospitals’ QI investments vary with their size, but data from the National Academy of Medicine suggest that health systems each employ 50 to 100 people for $3.5 million to $12 million per year to support measurement efforts. 4
- Small practices bear the greatest relative costs. One 2016 study found that practices spend about $40,000 per physician per year to meet quality-documentation requirements, for an estimated total of $15.4 billion per year.5
As reimbursement models shift toward value-based payment, QI is no longer just about being better, but about documenting improvement to maximize payment..
An entire industry has arisen to support the optimization and demonstration of performance.
Financial costs aside, if good care is the goal, the greatest cost of all this activity may be wasted time.5–7
- The 2016 study found that the average physician spent 2.6 hours per week on QI documentation5;
- another recent study that examined CMS’s Merit-Based Incentive Payment System (MIPS) for ambulatory care settings found that clinicians and administrators invested about 200 hours per year to meet each physician’s MIPS requirements.6
That these hours could be spent in countless other ways — especially caring for patients — raises an obvious question: Is the system we created to fix the system even working?
Financial costs aside, if good care is the goal, the greatest cost of all this activity may be wasted time.
That these hours could be spent in countless other ways — especially caring for patients — raises an obvious question: Is the system we created to fix the system even working?
Is Quality Improving?
It’s hard to know.
- Some early efforts — such as those focused on reducing nosocomial infections,8,9 improving surgical outcomes,10 and improving processes of care for patients with pneumonia, heart failure, or myocardial infarction11 — succeeded.
- But recently, there has been growing recognition of the QI movement’s shortcomings.2–14
One study, for instance, showed that only 37% of MIPS measures for ambulatory internal medicine were valid,12 and even CMS and the Government Accountability Office have acknowledged the need to improve the quality of measuring quality.13,14
Is Quality Improving? It’s hard to know.
Some early efforts … — succeeded.
But recently, there has been growing recognition of the QI movement’s shortcomings.
One unforeseen challenge is that a measure, no matter how medically sound or well intentioned, is never just a measure.
For instance, few physicians would object to the need to check glycated hemoglobin levels in patients with diabetes.
But once a measure is implemented and tied to a financial incentive, an entire industry arises to boost organizations’ scores on that measure.
Consultants get hired. Electronic health records (EHRs) are changed. And the measures become a source of intense organizational focus.
Not only does it become difficult to modify measures that aren’t clearly working, but a tremendous amount of resources are directed toward the appearance of quality rather than its substance.
One unforeseen challenge is that a measure, no matter how medically sound or well intentioned, is never just a measure.
Not only does it become difficult to modify measures that aren’t clearly working, but a tremendous amount of resources are directed toward the appearance of quality rather than its substance.
That QI has become a costly distraction was, ironically, best crystallized by CMS early in the Covid-19 pandemic,
… when it announced it was suspending or delaying quality-reporting requirements so that, as CMS Administrator Seema Verma said, “the healthcare delivery system can direct its time and resources toward caring for patients.”15
Many physicians’ response was essentially “Seriously? Why isn’t the essence of quality devoting our time and resources to caring for patients all the time?”
That QI has become a costly distraction was, ironically, best crystallized by CMS when, early in the Covid-19 pandemic, it announced
it was suspending or delaying quality-reporting requirements so that, the healthcare delivery system can direct its time and resources toward caring for patients, — — said CMS Administrator Seema Verma
It should be. But QI’s guiding principle that “if you can’t measure it, you can’t manage it” retains pragmatic importance because the counterfactual seems untenable: How can you improve without data indicating how you’re doing?
“A world in which we don’t measure quality, and don’t try to improve it, is not an acceptable one,” says Karen Joynt Maddox, a cardiologist and quality expert at Washington University.
Noting, nevertheless, that QI has become more a box-checking exercise for billing purposes than a meaningful act to improve care, Joynt Maddox emphasized the need to separate the moral imperative to make care better from our current method for doing so.
How can you improve without data indicating how you’re doing?
Noting, nevertheless, that QI has become more a box-checking exercise for billing purposes than a meaningful act to improve care, — Joynt Maddox — …
emphasized the need to separate the moral imperative to make care better from our current method for doing so.
The methodologic challenge applies to determining how best to improve quality and how best to evaluate our success.
“If you asked me whether quality has improved in the last 50 years relative to what it ought to be, we don’t know the answer,” says Robert Brook, a quality-measurement expert at RAND and the University of California, Los Angeles.
Noting that we don’t really know how many preventable deaths are attributable to poor medical care, nor whether the use of unnecessary medical services is decreasing, Brook stressed that he and others simply assume QI efforts are inherently good.
The methodologic challenge applies to determining how best to improve quality and how best to evaluate our success.
…we don’t really know how many preventable deaths are attributable to poor medical care, nor whether the use of unnecessary medical services is decreasing, …
“I believe all this activity does good,” he told me, “but it’s a belief, like whether I believe in God.” Yet the challenge of QI is an empirical one. “Nobody’s really asking these questions to put it all together,” Brook said. “We never really developed an epidemiology of quality.”
“If you asked me whether quality has improved in the last 50 years relative to what it ought to be, we don’t know the answer
We never really developed an epidemiology of quality.”
Few people understood this challenge better than Avedis Donabedian, who pioneered the study of health care quality.
Donabedian’s life work was inspired by the question “How can you tell if you have good-quality health care?”16
In 1990, he noted that epidemiology became exquisitely complex when applied to quality.17
“As I approached it,” he wrote, “it slipped into ambiguity and confusion. What was quality assessment? What was monitoring? For that matter, what was quality?”
Few people understood this challenge better than Avedis Donabedian, who pioneered the study of health care quality
Unfortunately, the science of quality measurement has become untethered from these philosophical underpinnings.
Despite innumerable metrics and vast research assessing their worth, it’s still not clear that we’re measuring what matters nor whether we have the methods to figure it out.
Compounding this complexity, the incentives now tied to QI may variably affect outcomes, thus warranting their own examination but also further distancing us from epistemological questions about quality’s meaning.
Any consideration of whether the movement’s costs are justified by its benefits, then, must escape the tautological trap that assumes quality is improving if we’re scoring better on what we measure.
Unfortunately, the science of quality measurement has become untethered from these philosophical underpinnings.
Is Paying for Performance Bad for Quality?
Craig Thornton is an internist in McMinnville, Oregon, a small town 50 miles outside Portland. A self-identified “dinosaur,” he’s curmudgeonly about bureaucratic intrusions into medicine, from billing codes to an unnavigable EHR.
When he started practicing in 1992, he rounded on inpatients in the morning, saw about 20 patients in clinic, then returned to the hospital to round again.
The beginning of the end for Thornton was the introduction of evaluation-and-management coding for billing, which was always annoying but became life-changing with EHR implementation.
He stopped seeing inpatients. His productivity fell. And because he has no Internet signal at home, he typically stays in the office 12 to 13 hours a day to keep up with documentation.
He still loves medicine. “The patients are wonderful,” he told me. “You close the door, and it’s the greatest thing to shut off your own life and focus on someone else.”
Thornton initially tried to cut back, but “the ridiculous documentation requirements” became too much, and he will retire this year.
Thornton initially tried to cut back, but “the ridiculous documentation requirements” became too much, and he will retire this year.
Yet Thornton sees value in most QI efforts; controlling diabetes and hypertension, for instance, “can have a dramatic impact on the quality of care of my patients.”
Referring to therapeutic inertia in patients who might benefit from more aggressive chronic disease management but already take several medications, Thornton says he likes knowing how he performs relative to his peers: “Having some standards is only good for my patients.”
Yet Thornton sees value in most QI efforts; controlling diabetes and hypertension, for instance, “can have a dramatic impact on the quality of care of my patients.”
Thornton’s not-uncommon experience with QI highlights three key points.
The first may be obvious but is easily lost in the grumbling: doctors want the best care for their patients.
Critics of QI initiatives aren’t arguing that managing hypertension isn’t important; they object to the way these goals are operationalized, particularly as they are tied to financial incentives and therefore receive disproportionate focus.
Critics of QI initiatives aren’t arguing that managing hypertension isn’t important; they object to the way these goals are operationalized, particularly as they are tied to financial incentives and therefore receive disproportionate focus.
Second, as quality is increasingly linked to reimbursement, the documentation burden imposed by billing requirements will become inextricable from the demands of demonstrating quality.
Finally, using internal performance standards to motivate better care — which many physicians embrace — differs starkly from using external financial incentives to improve quality.
Thornton’s not-uncommon experience with QI highlights three key points: (1) doctors want the best care for their patients, (2) … the documentation burden imposed by billing requirements will become inextricable from the demands of demonstrating quality, and (3) using internal performance standards to motivate better care … differs starkly from using external financial incentives to improve quality.
A pressing question, then, is whether value-based payment designs improve quality or reduce costs.
A pressing question, then, is whether value-based payment designs improve quality or reduce costs.
My overwhelming sense is that, on balance, they don’t.
My overwhelming sense is that, on balance, they don’t.18–23
It’s difficult to know for sure because payment models’ incentive structures vary, as do practice settings and the outcomes observed.
But growing evidence provides a rough sketch.
It’s difficult to know for sure because payment models’ incentive structures vary, as do practice settings and the outcomes observed.
But growing evidence provides a rough sketch.
- One analysis assessing CMS’s Hospital Value-Based Purchasing initiative found only one quality benefit: reduced pneumonia-specific mortality.18
- A study examining CMS’s Hospital Readmissions Reduction Program (HRRP) actually found an increase in 30-day mortality among patients hospitalized for heart failure or pneumonia, driven by patients who were not readmitted (which raises concerns that sicker patients were being turned away).19
- Although there’s debate about whether the HRRP causes harm,24,25 there’s little evidence to suggest consistent benefit.26,27
What about accountable care organizations (ACOs), which aim to simultaneously improve quality and cut costs?
ACOs are risk contracts between payers and provider organizations that couple a population-based payment (or “global budget”) with pay-for-performance (P4P) incentives.
Research led by Harvard economist J. Michael McWilliams showed that the Medicare Shared Savings Program (the largest ACO program)
- has achieved small spending reductions 28,29 but
- yielded minimal improvement on quality measures, including little to no movement in key areas targeted by the P4P incentives, such as readmissions, admissions for “ambulatory care sensitive conditions,”30 and medication use among patients with cardiovascular disease or diabetes.31
In parsing the evidence, McWilliams emphasized that reducing unnecessary care without apparent deterioration in quality represents meaningful progress and that the design of ACO models can be improved to generate greater savings.
Nevertheless, the lackluster results from the P4P component appear to be more of the same from a strategy that has “not yielded many, if any, wins and has certainly been very costly.”
Instead, McWilliams noted that we might expect more quality gains from the population-based payment component, which gives organizations more flexibility to meet their patients’ needs.
What about accountable care organizations (ACOs), which aim to simultaneously improve quality and cut costs?
… the lackluster results from the P4P component appear to be more of the same from a strategy that has “not yielded many, if any, wins and has certainly been very costly.”
Are We Even Measuring Quality?
Interrogation of P4P’s efficacy brings us back to the question of whether our measures accurately capture quality in the first place.
If a patient who cannot take Lasix at work because his boss won’t tolerate frequent bathroom breaks is repeatedly hospitalized for heart failure exacerbations, do his readmissions accurately reflect the quality of his physician’s care?
Conversely, if a metric superstar tells her 67-year-old patient with well-controlled diabetes that his unusual back pain is a normal part of aging and then 6 months later he presents with cord compression due to widely metastatic cancer, does she deserve his thanks for her “excellent care”?
Interrogation of P4P’s efficacy brings us back to the question of whether our measures accurately capture quality in the first place
If our measures don’t necessarily capture quality, how can we know if we’re improving?
Take mammography screening, a key primary care metric.
- In one of the studies on early ACO performance,28 mammography rates among women 65 to 69 years of age improved the first year at participating ACOs and didn’t change significantly in the second.28
- But are these rates what we should be tracking? According to Mark Friedberg, a primary care physician who directs performance improvement for Blue Cross Blue Shield of Massachusetts, not so much.
When it comes to preference-sensitive decisions like taking statins for primary prevention or undergoing mammography, Friedberg suggests, rate-based metrics distract from what really matters: the quality of decision making and whether the patient was engaged and informed.
Even when an outcome is obviously good, measuring it reliably isn’t always simple.
If our measures don’t necessarily capture quality, how can we know if we’re improving?
When it comes to preference-sensitive decisions — like taking statins for primary prevention or undergoing mammography —
… rate-based metrics distract from what really matters: the quality of decision making and whether the patient was engaged and informed ( Friedberg )
Friedberg recalled a meeting about a decade ago in which a physician shared an anecdote about strong-arming a deeply reluctant woman into getting a mammogram, allowing him to get a perfect quality score in his practice.
As the audience applauded the excellent performance, Friedberg was horrified that even though the patient’s agency had been completely disregarded, most people in the room seemed to have blindly accepted that the metrics represented good care.
Though Friedberg believes in using measurement to improve quality, it drives him crazy when the saying “Don’t let perfect be the enemy of good” is invoked to justify QI measures that omit consideration of ethics. “Have we even decided this measure is good?” he frequently wonders.
Even when an outcome is obviously good, measuring it reliably isn’t always simple.
For example, most insurers track 30-day mortality after acute myocardial infarction,16 and survival is a laudable goal.
But as Harvard cardiologist and health policy expert Rishi Wadhera explained, because most quality measurement relies on claims data (which may lack clinical granularity and are vulnerable to manipulation such as “up-coding”), even our most valid measures may not capture what they’re meant to capture.
“Our whole system of quality measurement has been based on a shaky foundation,” Wadhera told me.
… most quality measurement relies on claims data (which may lack clinical granularity and are vulnerable to manipulation such as “up-coding”),
even our most valid measures may not capture what they’re meant to capture.
“Our whole system of quality measurement has been based on a shaky foundation,”
One limitation, he noted, is faulty risk adjustment.
For example, an elderly patient with hypotension, tachycardia, and elevated lactate levels after an anterior-wall myocardial infarction typically has a higher risk of death than an otherwise healthy middle-aged man with an inferior-wall myocardial infarction, regardless of care quality.
But claims data don’t reliably capture many of these factors, a problem compounded by variability in coding habits among physicians and institutions.
Given the financial stakes associated with performance scores, particularly under value-based purchasing programs, well-resourced hospitals hire administrators to optimize coding and resultant scores.
Consequently, rather than discovering whether the same patient is more likely to survive a myocardial infarction at hospital X versus hospital Y, we learn which hospital codes better for coexisting conditions.
One limitation, he noted, is faulty risk adjustment
Inequity among Patients, Demoralization among Physicians
Because better-resourced hospitals can afford administrative support to optimize billing, value-based payment initiatives can also worsen inequities, even though addressing them ought to be at the forefront of our QI efforts.
Indeed, after implementation of CMS’s value-based purchasing programs, safety-net hospitals disproportionately bore the brunt of financial penalties.32,33
In 2019, hospitals caring for a high percentage of Black patients were disproportionately likely to incur financial penalties.34 And hospitals serving the highest-risk patients incurred the largest penalties under the HRRP, independent of quality of care.35
Billions of dollars are thus being transferred from poorly resourced hospitals or those serving the sickest patients to well-resourced hospitals, worsening the disparities we claim to be trying to fix.
Because better-resourced hospitals can afford administrative support to optimize billing, value-based payment initiatives can also worsen inequities
Billions of dollars are thus being transferred from poorly resourced hospitals or those serving the sickest patients to well-resourced hospitals, worsening the disparities we claim to be trying to fix.
Punishing the hospitals and patients most in need of support makes “zero sense,” says Wadhera.
He notes the broader irony of attempting to reduce spending with programs that create untold administrative costs and possibly greater net costs to the system long term.
For instance, smaller practices that are unable to afford these administrative costs are increasingly being bought by larger health systems that sometimes charge higher prices.36
And though such consequences create hard-to-quantify indirect costs, truly unquantifiable is the loss to society as consummate community physicians become a dying breed.
“Community doctors want to practice medicine,” said Wadhera. “They don’t want to practice quality measurement.”
Punishing the hospitals and patients most in need of support makes “zero sense,” — Wadhera –
“Community doctors want to practice medicine,” said Wadhera. “They don’t want to practice quality measurement.”
Some early QI leaders recognized the risk of demoralizing the workforce.
Berwick recalls an early project that involved conducting patient-satisfaction surveys for a medical practice.
When he distributed physicians’ individual reports during a department meeting, one excellent internist crushed hers into a ball and threw it in his face. “Oh my goodness,” he thought, “we are measuring and measuring, using words like ‘accountability,’ ‘incentives,’ ‘rewards and punishments,’ but this has little to do with how things get better.”
With this approach now entrenched in P4P initiatives, Berwick observed, “there is magical thinking, that if we just measure enough, and attach those measures to incentives, a miracle happens.”
Some early QI leaders recognized the risk of demoralizing the workforce.
… we are measuring and measuring, using words like ‘accountability,’ ‘incentives,’ ‘rewards and punishments,’ but this has little to do with how things get better.
With this approach now entrenched in P4P initiatives, Berwick observed, “there is magical thinking, that if we just measure enough, and attach those measures to incentives, a miracle happens.”
Though Berwick has been sounding this alarm for decades,1 as we shift toward value-based payment, invest heavily in an improvement infrastructure that isn’t clearly working, and navigate an epidemic of burnout and workforce demoralization, we seem to inch farther away from the ideals he’s long encouraged.
Why has it proven so difficult to heed his advice?
… as we shift toward value-based payment, invest heavily in an improvement infrastructure that isn’t clearly working, and navigate an epidemic of burnout and workforce demoralization, we seem to inch farther away from the ideals Berwick’s long encouraged.
Why has it proven so difficult to heed his advice?
Author Affiliations
Dr. Rosenbaum is a national correspondent for the Journal.
Originally published at https://www.nejm.org.