Clinical Validation across 89 Radiographic Condition Classifiers
AI Condition Classifier PerformanceVeterinary AI Classifier Performance Metrics
89+ Condition Classifiers Built on 300,000 Expert-Reviewed Veterinary Imaging Cases
Vetology’s AI classifiers have been validated using a foundation of over 300,000 multi-image patient cases from real-world veterinary practices. Below you’ll find transparent performance data including sensitivity, specificity, Radiologist Agreement Rates and case counts for every classifier. Use the dropdown below each condition classifier to see even more in-depth data.
This deep, real-world dataset helps our AI support veterinarians with reliable screening results at the point of care.
Vetology is the ONLY veterinary AI imaging platform that offers this level of transparency. We publish complete performance metrics to help you make informed decisions about diagnostic accuracy and clinical implementation.
While these results offer meaningful insight into expected performance, real-world factors such as image quality, positioning, and patient variability can influence accuracy in clinical settings.
What Makes Our Data Different
Understanding Our AI Virtual Radiologist Report Screening Metrics
Why Radiologist Benchmarking Matters
Instead of comparing AI to a theoretical “perfect” standard, Vetology benchmarks performance against the Radiologist Agreement Rates: how often multiple boarded radiologists reach the same interpretation. This provides realistic context for conditions where even experts may disagree.
When sensitivity or specificity approaches or exceeds this rate, it indicates performance comparable to specialist-level interpretation for that finding.
These measures help you understand when AI can reinforce diagnostic confidence and when a case may benefit from further review.
Making Sense of the Metrics
Radiologist Agreement Rate
The percentage of cases where multiple board-certified veterinary radiologists independently arrive at the same diagnosis. This serves as a benchmark for inherent diagnostic difficulty.
Conditions with lower Radiologist Agreement Rates are more subjective or challenging, even for specialists. When AI performance approaches or exceeds the Radiologist Agreement Rate, it demonstrates specialist-comparable accuracy.
Total Test Cases
Sensitivity
How reliably the AI detects a condition when it is present.
(True Positive Rate)
The proportion of actual positive cases (condition present) that the AI correctly identifies. For example, 89% sensitivity means the AI correctly detects the condition in 89 out of 100 cases where it is actually present. Higher sensitivity reduces false negatives—cases where a condition exists but goes undetected.
Specificity
How accurately the AI rules out a condition when it is absent.
(True Negative Rate)
The proportion of actual negative cases (condition absent) that the AI correctly identifies. For example, 92% specificity means the AI correctly rules out the condition in 92 out of 100 cases where it is actually absent. Higher specificity reduces false positives—cases where the AI flags a condition that is not present.
Confusion Matrix (TP, FN, TN, FP)
The raw counts of True Positives (correctly flagged), False Negatives (missed), True Negatives (correctly ruled out), and False Positives (incorrectly flagged). These four numbers are the foundation from which sensitivity, specificity, and every other metric is calculated.
Clinical meaning: The confusion matrix lets you see the actual case counts behind every percentage. A classifier with 90% sensitivity and 10 positive test cases missed 1 case. A classifier with 90% sensitivity and 1,000 positive test cases missed 100 cases. The percentages are identical, but the confusion matrix tells you the scale. It also lets you see the absolute number of false positives — which directly affects how often the AI sends you down a path that turns out to be nothing.
95% Confidence Intervals
(for Sensitivity and Specificity)
Clinical meaning: If a classifier shows 89.51% sensitivity with a tight confidence interval, you can trust that number. If the interval is wide, you may want to weight your own clinical findings more heavily.
Area Under Curve (AUC)
AUC measures how well the classifier distinguishes between positive and negative cases across all possible decision thresholds. It is reported on a scale from 0 to 1, where 1 represents a classifier that never makes a mistake and 0.5 represents random chance.
Clinical meaning: AUC gives you a single number that captures overall discriminative ability. A classifier with an AUC of 0.94 (like Heart Failure Left in canine thorax) is performing at a high level – it reliably separates patients who have the condition from those who do not.
AUC values between 0.80 and 0.90 are generally considered “excellent” discrimination, and values above 0.90 are considered “outstanding.”[1]
Positive Predictive Value (PPV)
When the classifier flags a finding as positive, PPV tells you the probability that the patient actually has the condition. This metric is directly affected by how common the condition is in the test population.
Clinical meaning: A high PPV means that when the AI flags something, it is very likely real. A lower PPV in a rare condition is expected – it does not mean the classifier is unreliable, it means the condition is uncommon and you should correlate with clinical signs.
Negative Predictive Value (NPV)
When the classifier reports no finding, NPV tells you the probability that the patient truly does not have the condition. For screening purposes, this is one of the most important metrics available.
Clinical meaning: A high NPV gives you confidence that a negative AI result genuinely means “nothing concerning here.” This is especially valuable for conditions where missing a diagnosis carries serious consequences.
Overall Accuracy
The percentage of all cases (both positive and negative) that the classifier categorized correctly. While useful as a general performance summary, accuracy alone can be misleading when a condition is rare or common.
Clinical meaning: Use accuracy as a starting point, then look at the metrics that tell you how it performs. Check sensitivity to see how well it catches true positives, and specificity to see how well it avoids false alarms. Then check prevalence in the test set: if a condition only appears in 2% of cases, a classifier that always says “negative” would still show 98% accuracy while missing every positive case.
When accuracy is high but prevalence is low, sensitivity and NPV are the metrics that tell you whether the classifier is actually doing its job.
Balanced Accuracy
The average of sensitivity and specificity, giving equal weight to how well the classifier catches true positives and how well it avoids false positives. Unlike overall accuracy, balanced accuracy is not skewed by class imbalance.
Clinical meaning: When a condition is rare (say it appears in only 3% of cases) overall accuracy can look impressive while the classifier misses most positive cases. Balanced accuracy corrects for this by weighting positive and negative performance equally. If you see a large gap between overall accuracy and balanced accuracy, it usually means the condition is rare and the classifier leans on “negative” predictions.
Balanced accuracy gives you the fairer picture.
Matthews Correlation Coefficient (MCC)
A single number that summarizes classifier quality using all four confusion matrix values (TP, FP, TN, FN). MCC ranges from −1 (total disagreement) through 0 (no better than random) to +1 (perfect prediction). It is widely regarded as one of the most informative single metrics for imbalanced datasets.
Clinical meaning: MCC answers the question: “Taking everything into account (true positives, true negatives, false positives, and false negatives) how well is this classifier actually performing?”
An MCC above 0.7 indicates strong agreement between the classifier and ground truth. Unlike accuracy, MCC cannot be inflated by class imbalance, which makes it especially valuable for rare conditions.
Prevalence in Test Set
The proportion of positive cases in the validation dataset. This is the denominator that drives PPV and NPV calculations.
Clinical meaning: Knowing the test set prevalence helps you calibrate expectations. If your practice population has a higher prevalence of a condition than the test set (e.g., you are a cardiology referral center), the real-world PPV for your caseload will be higher than the reported number.
Clinical Urgency Rating
Each classifier is tagged with a clinical urgency level: Critical, High, Moderate, or Low. These ratings are defined through consensus by Vetology’s in-house board-certified veterinary radiologists based on the clinical consequence of a delayed or missed detection – not the AI’s confidence level.
Clinical meaning: Urgency ratings help you prioritize your review queue. When multiple AI findings appear on a single study, urgency tells you where to look first.
Ground Truth Case Counts
(Positive + Negative)
The exact number of board-certified veterinary radiologist-reviewed cases used to validate each classifier, broken into positive and negative counts.
Clinical meaning: More cases mean more reliable statistics. When you see a classifier validated on 10,951 cases, you know those performance numbers are robust. This also helps you evaluate whether the validation dataset was balanced enough to produce meaningful results.
Model Release Date
The date each classifier model was initially released or, in most cases, when it was most recently retrained and re-released.
Clinical meaning: Knowing the date the model was released tells you how current the model is. At Vetology we regularly retrain classifiers to ensure we are incorporating the latest validation data and performance improvements. Artificial Intelligence is changing fast! Our models are too.
Vetology’s Published Condition Classifier Metrics
Our industry-leading transparency allows you to evaluate our technology objectively.
- Real-World Validation: Tested on a foundation of 300,000+ multi-image patient cases from veterinary practices.
- Radiologist Benchmarking: Performance compared directly to board-certified veterinary radiologists
- Comprehensive Coverage: Canine and feline imaging across thorax, abdomen, spine, and musculoskeletal studies
Vetology AI Classifier Performance Metrics
Veterinary Radiology AI — Clinical Validation Data by Physiological Region
Built on a foundation of 300,000 board certified veterinary radiologist-reviewed multi-image patient cases
Understanding These Metrics ▼
Sensitivity
How often the AI correctly detects a condition when it is actually present. Think of it as the "catch rate." A sensitivity of 89% means the AI catches 89 out of every 100 cases where the condition exists. Higher is better -- you want fewer missed findings.
Specificity
How often the AI correctly identifies a normal case as normal. A specificity of 92% means that when there is no finding, the AI agrees 92 out of 100 times. Higher specificity means fewer false alarms that could lead to unnecessary follow-up.
95% Confidence Interval
A range showing how precise the sensitivity or specificity measurement is. A narrower range means we tested more cases and can be more certain of the result. For example, "85% - 93%" means the true performance most likely falls within that range.
Radiologist Agreement Rate
How often board-certified veterinary radiologists agree with each other when reading the same images. Some findings are straightforward and radiologists almost always agree; others are more subjective. This number gives you context -- if even specialists disagree 30% of the time on a finding, an AI performing in that range is working within the natural variability of expert interpretation.
Area Under Curve (AUC)
A single number that summarizes overall classifier quality. An AUC of 1.0 would be a perfect classifier; 0.5 would be no better than flipping a coin. In practice, values above 0.85 indicate strong performance. This is a metric primarily used by data scientists to compare classifier models.
Positive Predictive Value (PPV)
When the AI flags a finding, how often is it actually there? A PPV of 80% means that 8 out of 10 times the AI says "positive," the condition is truly present. PPV depends heavily on how common the condition is in practice, so we calculate it using real-world condition frequency from our clinical case database (a Bayesian adjustment) rather than the test set, giving you a number that better reflects what you would see in your clinic.
Negative Predictive Value (NPV)
When the AI says a finding is not present, how often is it right? An NPV of 99% means you can be highly confident in a negative result. Like PPV, we calculate NPV using real-world condition frequency from our clinical case database (a Bayesian adjustment) so the value reflects actual clinical practice rather than test set composition.
Prevalence
How common this condition is in real-world veterinary practice. Rather than using the test set (which may over- or under-represent certain conditions), we calculate prevalence from our clinical case database of real patient reports. This gives you a more accurate picture of how often you are likely to encounter each condition. Prevalence matters because it directly affects how much you should trust a positive or negative AI result.
Overall Accuracy
The percentage of all cases -- both positive and negative -- that the classifier got right. While it sounds like the most important number, accuracy can be misleading for rare conditions. For example, a condition that appears in only 1% of cases could show 99% accuracy simply by always saying "not present." That is why we report sensitivity and specificity alongside accuracy to give the full picture.
Balanced Accuracy
The average of sensitivity and specificity, giving equal weight to detecting positive cases and correctly ruling out negative ones. This is a more reliable measure than overall accuracy, especially for conditions that are rare in practice, because it is not skewed by how common or uncommon the condition is.
Matthews Correlation Coefficient (MCC)
A single number that captures overall classifier quality, accounting for true positives, false positives, true negatives, and false negatives all at once. MCC ranges from -1 (the classifier gets everything wrong) through 0 (no better than random) to +1 (perfect). Data scientists consider MCC one of the most balanced and informative single metrics, particularly when conditions are rare.
Confusion Matrix
The four raw counts behind all the other metrics. True Positives: the AI said yes and was right. False Negatives: the AI missed it. True Negatives: the AI said no and was right. False Positives: the AI flagged something that was not there. Every metric on this page is derived from these four numbers.
Clinical Urgency
How time-sensitive the condition is in clinical practice, rated by our board-certified veterinary radiologists. Critical conditions (such as heart failure or pericardial effusion) require immediate attention. High-urgency findings should be addressed promptly. This helps you prioritize which AI findings to act on first.
Ground Truth Cases
The number of cases used to evaluate this classifier, reviewed and labeled by board-certified veterinary radiologists. More cases generally means more reliable metrics. The count is split into positive (condition present) and negative (condition absent) so you can see how balanced the test was.
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
▼ View Full Metrics
[1] Hosmer DW, Lemeshow S. Applied Logistic Regression. 2nd ed. New York: Wiley; 2000. AUC interpretation: 0.70–0.80 = acceptable, 0.80–0.90 = excellent, ≥0.90 = outstanding. See also: Çorbacιoğlu ŞK, Aksel G. “Receiver operating characteristic curve analysis in diagnostic accuracy studies: A guide to interpreting the area under the curve value.” Turk J Emerg Med. 2023;23(4):195–198. PMC10664195
Frequently Asked Questions
How accurate are Vetology's AI classifiers?
What conditions can Vetology's AI detect?
We detect 83 conditions including cardiomegaly, pleural fluid, hepatomegaly and microhepatia, splenomegaly, kidney size, thoracic masses, abdominal masses, bladder and kidney stones, and more across canine and feline patients.
Our classifiers cover thorax (20 canine, 15 feline), abdomen (25 canine, 14 feline), and spine/musculoskeletal (9 conditions) imaging.
How does Vetology compare to other veterinary imaging AI platforms?
What is Radiologist Agreement Rate (RAR)?
Radiologist Agreement Rate measures how often our AI classifiers agree with board-certified veterinary radiologists on the same cases. This benchmark helps practices understand how AI performance compares to specialist interpretation and provides context for clinical decision-making.
How was this data validated?
All performance metrics are based on a rigorous testing process using over 300,000 real-world veterinary cases.
Each classifier was evaluated independently, with sensitivity, specificity, and case counts calculated from actual clinical imaging studies.
Radiologist Agreement Rates compare AI predictions against board-certified veterinary radiologist interpretations.
Ready to Experience AI-Assisted Radiology?
See how Vetology’s classifiers can improve diagnostic confidence and workflow efficiency in your practice.
