AI Shows Promise in Detecting Breast Cancer in Women — A Norway study with 120+ thousand mammograms, 10 years retrospective


Key messages

Edited by Joaquim Cardoso MSc.
Digital Health and AI Institute

March 29, 2022


What is the context?

  • “Mammograms acquired through population-based breast cancer screening programs produce a significant workload for radiologists.

  • AI has been proposed as an automated second reader for mammograms that could help reduce this workload,” noted a statement on the findings.

What is the opportunity with AI?

  • Artificial intelligence (AI) has shown promising results for cancer detection with mammographic screening. 

  • However, evidence related to the use of AI in real screening settings remain sparse.

About the study

  • The purpose of the study was to compare the performance of a commercially available AI system with routine, independent double reading with consensus as performed in a population-based screening program. Furthermore, the histopathologic characteristics of tumors with different AI scores were explored.

  • “To our knowledge, this is the largest AI evaluation study to date, including more than 120 000 examinations from a real screening setting,” the authors wrote.

  • They evaluated a commercially available AI system against independent double reading of mammography results by 2 radiologists.

  • Their retrospective analysis utilized 122,969 screening mammograms performed at 4 BreastScreen Norway screening units, from October 2009 through December 2018, with the initial patient cohort including 47,877 women (mean [SD] age, 60 [6] years).

What was the conclusion of the study?

  • In conclusion, the proportion of screen-detected cancers not selected by the artificial intelligence (AI) system at the three evaluated thresholds was less than 20%, and several of these would probably also be detected at an early stage in the next screening round.

  • The overall performance of the AI system was promising according to cancer detection.

  • “However, more research is needed to find the optimal combination of radiologists and AI systems.”

According to the authors, areas to evaluate further are

  • optimal settings for the timing and format of AI scores,
  • how rates of recall and false-positive results can be influenced by negative examinations,
  • mammographic features identified by AI,
  • multiple AI algorithms in a comparative manner,
  • use of AI in more diverse populations, and
  • the cost-effectiveness of AI.



ORIGINAL PUBLICATION

Artificial Intelligence Evaluation of 122 969 Mammography Examinations from a Population-based Screening Program


RSNA

Marthe Larsen, Camilla F. Aglen, Christoph I. Lee, Solveig R. Hoff, Håkon Lund-Hanssen, Kristina Lång, Jan F. Nygård, Giske Ursin, Solveig Hofvind

Mar 29 2022


This is an excerpt of the long version of the paper.


ABSTRACT

Background


Artificial intelligence (AI) has shown promising results for cancer detection with mammographic screening. However, evidence related to the use of AI in real screening settings remain sparse.

Purpose


To compare the performance of a commercially available AI system with routine, independent double reading with consensus as performed in a population-based screening program. Furthermore, the histopathologic characteristics of tumors with different AI scores were explored.

Materials and Methods

  • In this retrospective study, 122 969 screening examinations from 47 877 women performed at four screening units in BreastScreen Norway from October 2009 to December 2018 were included. 
  • The data set included 752 screen-detected cancers (6.1 per 1000 examinations) and 205 interval cancers (1.7 per 1000 examinations). Each examination had an AI score between 1 and 10, where 1 indicated low risk of breast cancer and 10 indicated high risk. 
  • Threshold 1, threshold 2, and threshold 3 were used to assess the performance of the AI system as a binary decision tool (selected vs not selected).
  • Threshold 1 was set at an AI score of 10, threshold 2 was set to yield a selection rate similar to the consensus rate (8.8%), and threshold 3 was set to yield a selection rate similar to an average individual radiologist (5.8%). 
  • Descriptive statistics were used to summarize screening outcomes.

Results 

  • A total of 653 of 752 screen-detected cancers (86.8%) and 92 of 205 interval cancers (44.9%) were given a score of 10 by the AI system (threshold 1). 
  • Using threshold 3, 80.1% of the screen-detected cancers (602 of 752) and 30.7% of the interval cancers (63 of 205) were selected. 
  • Screen-detected cancer with AI scores not selected using the thresholds had favorable histopathologic characteristics compared to those selected; opposite results were observed for interval cancer.

Conclusion 

  • The proportion of screen-detected cancers not selected by the artificial intelligence (AI) system at the three evaluated thresholds was less than 20%. 
  • The overall performance of the AI system was promising according to cancer detection.

The proportion of screen-detected cancers not selected by the artificial intelligence (AI) system at the three evaluated thresholds was less than 20%.

The overall performance of the AI system was promising according to cancer detection.

© RSNA, 2022


Summary


The performance of the artificial intelligence system was promising for breast cancer detection in a large population-based mammography screening program.

Key Results 

  • In this retrospective study of 122 969 examinations, mammograms were evaluated with an artificial intelligence (AI) system that predicts the risk of cancer on a scale from 1 (lowest risk) to 10 (highest risk).
  • A total of 86.8% of screen-detected cancers (653 of 752) and 44.9% of interval cancers (92 of 205) had the highest AI score of 10; 0.7% screen-detected cancers (five of 752) had the lowest AI score of 1.
  • Interval cancers with high AI scores had favorable histopathologic tumor characteristics compared to those with low AI scores; the opposite was observed for screen-detected cancers.

Introduction


Worldwide, more than half a million women die of breast cancer every year ( 1). To reduce this burden, mammographic screening has been implemented in many countries over the past decades. These screening programs, along with improved treatment options, have resulted in a reduction of at least 30% in breast cancer mortality among participants ( 2).


Use of double reading is recommended and standard in most European screening programs ( 3, 4). Double-reading interpretation is usually followed by consensus or arbitration, where the decision to recall the women for further assessment is made. In BreastScreen Norway, breast cancer is diagnosed in more than 25% of recalled women and about 0.6% of all screening examinations ( 5). Conversely, 99.4% of screening examinations are eventually determined to have a negative outcome.


Informed reviews of prior screening and diagnostic mammograms obtained by groups of radiologists have classified about 25% of screen-detected and interval cancers as missed ( 6, 7). Also, it has been reported that 20% of screen-detected cancers were recommended for recall by one of two radiologists in independent double reading ( 8). More accurate and effective interpretive procedures may improve population-level outcomes of mammographic screening.


Artificial intelligence (AI) has shown promising results for cancer detection in mammographic examinations ( 913). 

However, reported results are mainly from small studies with enriched data sets, and evidence gaps related to the use of AI in real screening settings remain ( 14). 

Retrospective studies on clinical data sets using consecutive examinations provide an opportunity to independently validate AI systems before evaluation in prospective studies. 

Furthermore, the histopathologic characteristics of cancers identified by AI should be investigated to ensure detection of clinically significant breast cancers that would lead to a reduction in breast cancer mortality.


Artificial intelligence (AI) has shown promising results for cancer detection in mammographic examinations ( 9– 13).

However, reported results are mainly from small studies with enriched data sets, and evidence gaps related to the use of AI in real screening settings remain ( 14).


In this study, we compared the performance of a commercially available AI system with independent double reading as performed by radiologists in BreastScreen Norway. 

Furthermore, we explored the histopathologic characteristics of tumors with different AI scores.


Materials and Methods


See the original publication

Results

See the original publication

Patient Overview

See the original publication


Discussion


The purpose of this study was to evaluate an artificial intelligence (AI) system for breast cancer detection on mammograms. 

The performance of the AI system was compared with that of radiologists in an independent double-reading setting with consensus. 

A total of 77.9% of all breast cancers (86.8% of screen-detected and 44.9% of interval cancers) had the highest AI score of 10. With a threshold that mirrors the average individual radiologist rate of positive interpretation (threshold 3), 80.1% of screen-detected and 30.7% of interval cancers were selected by the AI system.


The performance of the AI system was compared with that of radiologists in an independent double-reading setting with consensus.


To our knowledge, this is the largest AI evaluation study to date, including more than 120 000 examinations (752 screen-detected and 205 interval cancers) from a real screening setting. 

There are several publications describing the performance of the AI system in other, smaller screening cohorts ( 11, 13, 24, 25). 

Use of this same system in a population from Malmö, Sweden, found that none of the 68 screen-detected cancers had an AI score below 3 ( 11). 

Similar results were obtained in a study from Spain ( 24)-none of the 76 screen-detected cancers had an AI score below 3. In our larger sample, five of the 752 screen-detected cancers had a score below 4 (five had AI score 1 and none had AI score 2 or 3). 

Differences in cancer detection across these studies may be related to our use of an updated version of the AI system or differences in characteristics of the screening populations and interpreting radiologists ( 11, 25).


The high percentage of true-negative examinations classified with a low AI score may indicate that the AI system could safely select examinations not to be interpreted by radiologists. 

In such an approach, the interpretive volume would be substantially reduced, while a small proportion of cancers not selected by the AI system would remain undetected.

If AI is used as one of the two readers in a double-reading setting, then the radiologist may still identify the small number of missed cancers. 

Furthermore, 23% of screen-detected cancers in the study had a positive assessment by only one radiologist, and, thus, it may be acceptable that some cancers have a low AI score.


Similar to the challenge in defining the ideal combination of two radiologists in double reading, more research is needed to find the optimal combination of radiologists and AI systems

For instance, when using AI as a standalone system to identify true-negative cases that can forego radiologist interpretation altogether, an accurate low score on mammograms without missed cancers is critical. 

Using an AI score of 10 as a threshold in a standalone setting could result in 10% of the examinations requiring radiologist interpretation or 10% of the examinations directly selected for consensus. 

In the latter scenario, the consensus rate would be higher than usual in BreastScreen Norway and likely result in a higher recall rate. 

If radiologists are using an AI system in a screening setting, then it is expected that their assessment and the recall rates will depend on AI scores. 

The optimal timing of and format of being presented with AI scores are unknown and need further investigation to find the optimal settings.

 The effect of being presented with a high AI score may lead to overreliance on the AI system without a radiologist maintaining their own vigilance or lead to reduced attention to other suspicious areas (automation bias) ( 26).


Similar to the challenge in defining the ideal combination of two radiologists in double reading, more research is needed to find the optimal combination of radiologists and AI systems.


Our results indicate favorable histopathologic characteristics for screen-detected cancers with low versus high AI scores

Studies have shown that less than 10% of screen-detected cancers are clinically insignificant, indicating a low risk of breast cancer death ( 27). 

An AI system that is able to differentiate between clinically significant and nonsignificant cancers could be beneficial for individual women and the screening program. 

Currently, there are limited data on the progression of small low-proliferation cancers, but such information could help women and clinicians to make informed choices on the intensity and extent of treatment.


Interval cancers are known to be less prognostically favorable compared with screen-detected cancers ( 7, 18), and it is essential to keep the rate as low as possible to reduce breast cancer mortality. 

We observed that the invasive interval cancers selected using threshold 1, threshold 2, and threshold 3 by the AI system had more favorable tumor characteristics compared with those not selected. 

This may indicate that interval cancers with low AI scores are true interval cancers and not visible on the screening mammograms. 

Similar results were observed in a retrospective study on a large cohort of interval cancers using the same AI system ( 28).


The strengths of our study are the large study population from a real screening setting and the capture of all cancers through registry linkage.

 The limitations are related to the retrospective approach; however, this limitation is ameliorated by a complete follow-up of all screened women. 

Additional limitations include evaluation of mammograms from a single manufacturer, the regional homogeneous population, an AI system not considering prior mammograms, the limited number of radiologists, and not including laterality, mammographic features, or density.



Conclusion


In conclusion, the proportion of screen-detected cancers not selected by the artificial intelligence (AI) system at the three evaluated thresholds was less than 20%, and several of these would probably also be detected at an early stage in the next screening round. 

However, there are also tumor characteristics of examinations not selected indicative of clinically significant cancers. Prospective studies are needed to better understand the prognostic characteristics of AI-selected and AI-non selected cases. 


Further research is also needed to understand how the relatively large number of negative examinations with a high AI score can influence the recall rate and rate of false-positive results. 

Future studies should also examine mammographic features identified by AI, evaluate multiple AI algorithms in a comparative manner, examine AI in more diverse screening populations, and include cost-effectiveness analyses of using AI in screening.


In conclusion, the proportion of screen-detected cancers not selected by the artificial intelligence (AI) system at the three evaluated thresholds was less than 20%, and several of these would probably also be detected at an early stage in the next screening round.

References

See the original publication

Originally published at https://pubs.rsna.org

Total
0
Shares
Deixe um comentário

O seu endereço de e-mail não será publicado. Campos obrigatórios são marcados com *

Related Posts

Subscribe

PortugueseSpanishEnglish
Total
0
Share