“Artificial intelligence in medical imaging (AIMI) ranking” project is the research initiative supported by Osimis and led by Chief innovation officer Prof. Dr. Sergey Morozov. Literature research and database updates are managed by Pavel Gelezhe, MD, PhD.
AIMI-ranking by Osimis aims to create a unified, regularly updated database of AIMI-solutions’ accuracy metrics, as reported in peer-reviewed publications. With this project we provide guidance to clinical users who are comparing and selecting AI-solutions relevant for their clinical practice. Moreover, we hope to motivate AI-vendors to follow the metrics provided by AIMI-ranking and improve the quality of their solutions. The top-tier of AIMI-ranking is provided via Osimis AI-platform.
Best fit of AI-solutions for a clinical scenario is defined by sensitivity > 95% and specificity > 90%. Conditional fit corresponds to AI-solutions with reported sensitivity > 90% and specificity > 80%. Certainly, the real-world performance of AI-solutions depends on the pre-test prevalence of a disease, i.e. screening applications require the highest achievable sensitivity. The quality of the publication should also be assessed according to the published methodologies, such as CLAIM.
Version 1.0 of AIMI-ranking was published on September 6, 2023. It provides accuracy data on AI-solutions for interpreting Chest XR, Trauma XR, Mammography AI and Prostate MRI. We obtain the accuracy metrics from the peer-reviewed papers representing the respective AI-solutions. In case of an author’s discontent with the data presentation please email us at sergey.morozov@osimis.io. In case you know of a study that ought to be added to the project, please click here.
Background for the AIMI-ranking
The number of artificial intelligence (AI) products for radiology has increased rapidly in recent years. Although the number of software products on the market is high, market penetration remains limited [van Leeuwen KG, Schalekamp S, Rutten MJCM, et al. Artificial intelligence in radiology: 100 commercially available products and their scientific evidence. Eur Radiol. 2021 Jun;31(6):3797-3804.].
There is a known lack of clinical validation of existing artificial intelligence algorithms. Only 6% of recent papers on medical deep learning algorithms included validation against independent external data [Kim DW, Jang HY, Kim KW et al (2019) Korean J Radiol 20:405-410].
The aim of our work is to increase the market transparency, finding the best algorithm for the safe implementation of AI in radiology departments. This analysis presents an overview of commercially available AI solutions. For each of these products, we have collected and analyzed scientific evidence demonstrating the effectiveness of AI software.
Study concept includes comparative analysis of accuracy metrics reported in pre-selected papers, decomposition of accuracy metrics by pathology sub-domains and identification of leading AI-services. Inclusion criteria were the following: the information about sensitivity, specificity, prevalence and AUCs. Reference accuracy metrics were established setting quite lenient thresholds. Optimal values for sensitivity and specificity have been chosen at the level of >95% and >90%, respectively. Clinically unacceptable sensitivity and specificity values at the level of <90% and <80%, respectively.
Our study has limitations. First of all, there is a lack of meta-analytics. Second, there should be an analysis of the publications’ quality according to the existing methodologies such as CLAIM. Third, actual AI-services’ performance usually differs from the reported results. Fourth, lack of independent datasets’ analysis leads to the requirement of future investigations using independent datasets.
Basic definitions
- AIMI-solution refers to the software products/services, registered as SaMD, available as SaaS for clinical usage. The input for these solutions is represented by medical images used for diagnostics and/or treatment planning (including radiology, cardiology, nuclear medicine). The storage and transfer of these data is performed according to DICOM standards. The output of these solutions is represented by the augmented images (DICOM SC, GSPS) and diagnostic conclusions (lesions’ identifiers, disease classification, measurements in the format of HL7, xml, json messages used for the clinical workflow management, worklist enrichment, notifications, medical records’ drafting). These outputs are imported into PACS/RIS/HIS for the usage by radiologists, cardiologists, nuclear medicine specialists, surgeons, referring physicians.
- Sensitivity (True Positive Rate): Sensitivity refers to the ability of a diagnostic test to correctly identify individuals with the condition. Suggestion: Sensitivity should ideally be above 90% to ensure accurate identification of true positives.
- Specificity (True Negative Rate): Specificity refers to the ability of a diagnostic test to correctly identify individuals without the condition. Suggestion: Specificity should ideally be above 90% to ensure accurate identification of true negatives.
- Clinical scenario is defined by the pre-test probability of a disease. Two major scenarios are screening and confirmatory diagnostics.
- As a rule of thumb, the pre-test probability of 5% or less would define the screening clinical scenario, meaning that the goal of diagnostics is primarily to exclude the most significant diseases and narrow down a wide array of potential diagnoses. In the screening the major requirement for the accuracy is the highest achievable sensitivity (not to miss a pathology, SnNOut). After the highly sensitive test ruling-out ‘negative’ cases a highly specific test should be applied to rule-in the specific diagnosis (SpPIn).
- In the screening scenario AI is primarily used for triage, requiring the highest possible sensitivity, even at the expense of false-positives. This application of AI allows to decrease the workload by reporting negative cases much faster without a loss of accuracy with a loss of accuracy. Sensitivity of >95% and specificity >80% would be minimally required for the screening.
- In the confirmatory diagnostics scenario the high sensitivity remains a prerequisite, as an AI missing a pathology would pose a significant risk and a liability. However, in this scenario the requirement for the specificity becomes higher compared to the screening, as the demand is not only to filter-out negatives, but also to properly establish the correct diagnosis. Sensitivity of >95% and specificity >90% would be required for the confirmatory diagnostics.
- Reference: Bossuyt, P. M., Reitsma, J. B., Bruns, D. E., Gatsonis, C. A., Glasziou, P. P., Irwig, L. M., ... & Cohen, J. F. (2015). STARD 2015: An updated list of essential items for reporting diagnostic accuracy studies. Radiology, 277(3), 826-832.
- Please note that these suggestions are based on general guidelines and can vary based on the specific disease, population, and clinical setting. It's important to consider the trade-off between sensitivity and specificity, as well as other factors like prevalence and clinical implications, when determining appropriate reference levels for your specific case. The appropriate reference levels can vary based on the specific condition being diagnosed and the context of the study or application.
Top-tier AI-solutions of AIMI-ranking
Chest XR AI comparison
- Chest XR domain is the most competitive amongst all AI applications in radiology.
- The target pathology for AI is very diverse and includes conditions such as atelectasis, pleural effusion, pneumonia, pneumothorax, pulmonary mass, tuberculosis and many others.
- There is a proven accuracy gain with AI implementations for chest XR reporting, reaching the point of automated triage and “negative” studies’ autonomous reporting.
- The reported accuracy of a stand-alone AI is significantly higher than that of specialists: Sens and Spec 95-97%. However, there is a discrepancy between high sensitivity of AI for all visible pathologies and relatively lower sensitivities for underlying sub-classes of pathologies.
- Despite the high accuracy of stand-alone AI for chest XR, most of the solutions still require the calibration of their decision-making thresholds in order to maximize the sensitivity in favor of specificity.
- The overall leading chest XR AI solutions according to the peer-reviewed publications are the following: Annalise.ai, Lunit, Oxipit, Qure.ai.
Trauma XR AI comparison
- Trauma XR domain is one of the most competitive amongst all AI applications in radiology.
- The target pathology for AI is very diverse and covers fractures and dislocations of all skeletal areas.
- There is a proven accuracy gain with AI implementations for trauma XR reporting, reaching the point of automated triage and “negative” studies’ autonomous reporting.
- The reported accuracy of a stand-alone AI is significantly higher than that of specialists: Sens and Spec 97-98%.
- Despite the high accuracy of stand-alone AI for trauma XR, many solutions still require the calibration of their decision-making thresholds in order to maximize the sensitivity in favor of specificity.
- The overall leading trauma XR AI solutions according to the peer-reviewed publications are the following: Gleamer, Milvue.
MMG AI comparison
- MMG AI domain is quite competitive with several established market leaders.
- There is a proven accuracy gain with AI implementations both for 2D and 3D MMG. With the help of AI a general radiologist reaches the accuracy of a specialized breast radiologist.
- The reported accuracy of a Radiologist is quite average: Sens 52-83%, Spec 63-77%.
- The reported accuracy of a Radiologist and AI working together is slightly higher: 69-86%, Spec 67-79%
- The reported accuracy of a stand-alone AI is significantly higher: Sens 78-91%, Spec 88-97%.
- Despite the high accuracy of stand-alone AI for MMG 2D, all of the solutions require the calibration of their decision-making thresholds in order to maximize the sensitivity in favor of specificity.
- There is a drastic heterogeneity in the accuracy of CE-certified AI-solutions, certainly indicating the need for the real-world data on their clinical value. The prospective studies are ongoing and their results are very much awaited for the clarification of AI-value in the real world.
- The overall leading MMG AI solutions according to the peer-reviewed publications are the following: Kheiron, Lunit, Screenpoint, Vara.
We shall be updating the database with other medical imaging domains. In case you know of a study that ought to be added to the project, please click here. In case of an author’s discontent with the data presentation please email us at sergey.morozov@osimis.io.
References
Associated projects
- “AI for radiology”. Find the artificial intelligence based software for radiology that you are looking for. All products listed are available for the European market (CE marked).- AI for radiology
- “AI central”. This site is intended to provide easy-to-access, detailed information regarding FDA cleared AI medical products that are related to radiology and other imaging domains. Our editorial board and staff are continuously reviewing data from FDA public facing documents, vendor information and physician user feedback to provide you with up-to-date information that will help you to make appropriate purchasing decisions. - AI Central
Methodological publications
AI comparison methodology:
Diagnostic accuracy:
Publications’ quality criteria:
Datasets:
SaMD modifications: