Clinical Validation across 83 Radiographic Condition Classifiers
AI Condition Classifier PerformanceVeterinary AI Classifier Performance Metrics
89+ Condition Classifiers Built on 300,000+ Expert-Reviewed Veterinary Imaging Cases
Vetology’s AI classifiers have been validated using over 300,000 cases from real-world veterinary practices. Below you’ll find transparent performance data including sensitivity, specificity, Radiologist Agreement Rates and case counts for every classifier. This deep, real-world dataset helps our AI support veterinarians with reliable screening results at the point of care.
Unlike other veterinary AI imaging platforms, we publish complete performance metrics to help you make informed decisions about diagnostic accuracy and clinical implementation.
While these results offer meaningful insight into expected performance, real-world factors such as image quality, positioning, and patient variability can influence accuracy in clinical settings.
What Makes Our Data Different
Understanding Our AI Virtual Radiologist Report Screening Metrics
Why Radiologist Benchmarking Matters
Instead of comparing AI to a theoretical “perfect” standard, Vetology benchmarks performance against the Radiologist Agreement Rates: how often multiple boarded radiologists reach the same interpretation. This provides realistic context for conditions where even experts may disagree.
When sensitivity or specificity approaches or exceeds this rate, it indicates performance comparable to specialist-level interpretation for that finding.
These measures help you understand when AI can reinforce diagnostic confidence and when a case may benefit from further review.
Making Sense of the Metrics
Sensitivity
How reliably the AI detects a condition when it is present.
(True Positive Rate)
The proportion of actual positive cases (condition present) that the AI correctly identifies. For example, 89% sensitivity means the AI correctly detects the condition in 89 out of 100 cases where it is actually present. Higher sensitivity reduces false negatives—cases where a condition exists but goes undetected.
Specificity
How accurately the AI rules out a condition when it is absent.
(True Negative Rate)
The proportion of actual negative cases (condition absent) that the AI correctly identifies. For example, 92% specificity means the AI correctly rules out the condition in 92 out of 100 cases where it is actually absent. Higher specificity reduces false positives—cases where the AI flags a condition that is not present.
Radiologist Agreement Rate
The percentage of cases where multiple board-certified veterinary radiologists independently arrive at the same diagnosis. This serves as a benchmark for inherent diagnostic difficulty.
Conditions with lower Radiologist Agreement Rates are more subjective or challenging, even for specialists. When AI performance approaches or exceeds the Radiologist Agreement Rate, it demonstrates specialist-comparable accuracy.
Total Test Cases
95% Confidence Intervals
(for Sensitivity and Specificity)
Clinical meaning: If a classifier shows 89.51% sensitivity with a tight confidence interval, you can trust that number. If the interval is wide, you may want to weight your own clinical findings more heavily.
Area Under Curve (AUC)
AUC measures how well the classifier distinguishes between positive and negative cases across all possible decision thresholds. It is reported on a scale from 0 to 1, where 1 represents a classifier that never makes a mistake and 0.5 represents random chance.
Clinical meaning: AUC gives you a single number that captures overall discriminative ability. A classifier with an AUC of 0.94 (like Heart Failure Left in canine thorax) is performing at a high level – it reliably separates patients who have the condition from those who do not.
AUC values between 0.80 and 0.90 are generally considered “excellent” discrimination, and values above 0.90 are considered “outstanding.”[1]
Positive Predictive Value (PPV)
When the classifier flags a finding as positive, PPV tells you the probability that the patient actually has the condition. This metric is directly affected by how common the condition is in the test population.
Clinical meaning: A high PPV means that when the AI flags something, it is very likely real. A lower PPV in a rare condition is expected – it does not mean the classifier is unreliable, it means the condition is uncommon and you should correlate with clinical signs.
Negative Predictive Value (NPV)
When the classifier reports no finding, NPV tells you the probability that the patient truly does not have the condition. For screening purposes, this is one of the most important metrics available.
Clinical meaning: A high NPV gives you confidence that a negative AI result genuinely means “nothing concerning here.” This is especially valuable for conditions where missing a diagnosis carries serious consequences.
Overall Accuracy
The percentage of all cases (both positive and negative) that the classifier categorized correctly. While useful as a general performance summary, accuracy alone can be misleading when a condition is rare or common.
Clinical meaning: Use accuracy as a starting point, then look at the metrics that tell you how it performs. Check sensitivity to see how well it catches true positives, and specificity to see how well it avoids false alarms. Then check prevalence in the test set: if a condition only appears in 2% of cases, a classifier that always says “negative” would still show 98% accuracy while missing every positive case.
When accuracy is high but prevalence is low, sensitivity and NPV are the metrics that tell you whether the classifier is actually doing its job.
Prevalence in Test Set
The proportion of positive cases in the validation dataset. This is the denominator that drives PPV and NPV calculations.
Clinical meaning: Knowing the test set prevalence helps you calibrate expectations. If your practice population has a higher prevalence of a condition than the test set (e.g., you are a cardiology referral center), the real-world PPV for your caseload will be higher than the reported number.
Clinical Urgency Rating
Each classifier is tagged with a clinical urgency level: Critical, High, Moderate, or Low. These ratings are defined through consensus by Vetology’s in-house board-certified veterinary radiologists based on the clinical consequence of a delayed or missed detection – not the AI’s confidence level.
Clinical meaning: Urgency ratings help you prioritize your review queue. When multiple AI findings appear on a single study, urgency tells you where to look first.
Ground Truth Case Counts
(Positive + Negative)
The exact number of board-certified veterinary radiologist-reviewed cases used to validate each classifier, broken into positive and negative counts.
Clinical meaning: More cases mean more reliable statistics. When you see a classifier validated on 10,951 cases, you know those performance numbers are robust. This also helps you evaluate whether the validation dataset was balanced enough to produce meaningful results.
Model Release and Retrained Dates
The date each classifier model was initially released and when it was most recently retrained on updated data.
Clinical meaning: Knowing the retrained date tells you how current the model is. Classifiers that have been recently retrained incorporate the latest validation data and performance improvements.
Vetology’s Published Condition Classifier Metrics
Our industry-leading transparency allows you to evaluate our technology objectively.
- Real-World Validation: Tested on a foundation of 300,000+ multi-image patient cases from veterinary practices.
- Radiologist Benchmarking: Performance compared directly to board-certified veterinary radiologists
- Comprehensive Coverage: Canine and feline imaging across thorax, abdomen, spine, and musculoskeletal studies
Vetology AI Classifier Performance Metrics
Clinical Validation Data by Physiological Region
Validated on 300,000+ test cases by board-certified veterinary radiologists
Cardiac Conditions
Vascular
Pulmonary Parenchymal
Pleural/Mediastinal
Airways
Other Thoracic Findings
Hepatic
Splenic
Renal/Urinary
Gastrointestinal
Other Abdominal Findings
Cardiac Conditions
Pulmonary Parenchymal
Pleural/Mediastinal
Airways
Other Thoracic Findings
Hepatic
Splenic
Renal/Urinary
Gastrointestinal
Other Abdominal Findings
Spine
Pelvis/Joints
[1] Hosmer DW, Lemeshow S. Applied Logistic Regression. 2nd ed. New York: Wiley; 2000. AUC interpretation: 0.70–0.80 = acceptable, 0.80–0.90 = excellent, ≥0.90 = outstanding. See also: Çorbacιoğlu ŞK, Aksel G. “Receiver operating characteristic curve analysis in diagnostic accuracy studies: A guide to interpreting the area under the curve value.” Turk J Emerg Med. 2023;23(4):195–198. PMC10664195
Frequently Asked Questions
How accurate are Vetology's AI classifiers?
What conditions can Vetology's AI detect?
We detect 83 conditions including cardiomegaly, pleural fluid, hepatomegaly and microhepatia, splenomegaly, kidney size, thoracic masses, abdominal masses, bladder and kidney stones, and more across canine and feline patients.
Our classifiers cover thorax (20 canine, 15 feline), abdomen (25 canine, 14 feline), and spine/musculoskeletal (9 conditions) imaging.
How does Vetology compare to other veterinary imaging AI platforms?
What is Radiologist Agreement Rate (RAR)?
Radiologist Agreement Rate measures how often our AI classifiers agree with board-certified veterinary radiologists on the same cases. This benchmark helps practices understand how AI performance compares to specialist interpretation and provides context for clinical decision-making.
How was this data validated?
All performance metrics are based on a rigorous testing process using over 300,000 real-world veterinary cases.
Each classifier was evaluated independently, with sensitivity, specificity, and case counts calculated from actual clinical imaging studies.
Radiologist Agreement Rates compare AI predictions against board-certified veterinary radiologist interpretations.
Ready to Experience AI-Assisted Radiology?
See how Vetology’s classifiers can improve diagnostic confidence and workflow efficiency in your practice.
