How to Read AI Metrics Like a Confident Veterinarian

How to Read AI Metrics Like a Confident Veterinarian

A practical guide for veterinary professionals who want to understand AI validation data, not just trust it.

ABSTRACT

Vetology publishes 11 performance metrics for each of its 89+ veterinary radiology classifiers, built on a foundation of 300,000 multi-image patient cases. This article explains what each metric means in plain clinical language so veterinary professionals can interpret AI screening results with confidence. It covers sensitivity and specificity (how well the AI classifies cases), prevalence (how common a condition is in real-world practice), positive and negative predictive values (how reliable an individual prediction is once prevalence is factored in), confidence intervals, radiologist agreement rates, AUC, F1 score, and accuracy.

A key distinction: sensitivity and specificity evaluate model performance independent of prevalence, while PPV and NPV evaluate prediction reliability and are directly affected by how common a disease is. For rare conditions, a PPV that meaningfully exceeds the underlying prevalence indicates real predictive value. All metrics are published with full transparency at vetology.net/ai-classifier-performance.

Key terms: veterinary AI, classifier performance, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), prevalence, AUC, radiologist agreement rate, confusion matrix, veterinary radiology, AI validation, diagnostic AI screening

Why This Matters for Your Practice

We recently expanded our public AI performance dashboard from four metrics to eleven for each of our 89+ classifiers. That is a lot of numbers. And if you are like most veterinary professionals, you did not go to vet school to interpret ROC curves.

But these metrics directly affect how you use AI screening results in your clinical decisions. When an AI report flags cardiomegaly or rules out pleural effusion, the metrics behind that classifier tell you how much weight to give the result. Understanding a few key numbers can change how confidently you act on what the AI is telling you.

Here is what each metric means, in plain language, with real examples from our published data.

The Two Metrics You Probably Already Know

Sensitivity (the “catch rate”)
When the condition is present, how often does the AI detect it?
A sensitivity of 89.5% means the AI correctly identifies the condition in roughly 89 or 90 out of every 100 cases where it truly exists. The remaining cases are missed findings (false negatives).

What this means for you: Higher sensitivity means fewer missed findings. For conditions where early detection is critical, like heart failure, you want sensitivity to be as high as possible.

Specificity (the “all clear” rate)

When the condition is absent, how often does the AI correctly say so?
A specificity of 92.1% means that when there is no finding, the AI agrees 92 out of 100 times. The rest are false alarms (false positives).

What this means for you: Higher specificity means fewer unnecessary follow-ups. When the AI says “not present” and specificity is high, you can feel confident about that negative result.

Prevalence

How common is this condition in real-world practice?
We calculate prevalence from our clinical case database rather than the test set, so the number reflects actual clinical frequency. This tells you the baseline probability before the AI even looks at the image. A condition with 15% prevalence behaves very differently than one at 0.5%.

Why it matters here: Prevalence is essential for understanding the next two metrics, PPV and NPV. Without knowing how common a condition is, those numbers cannot be interpreted correctly.

REAL EXAMPLE

Our Heart Failure (Canine) classifier has 89.5% sensitivity and 92.1% specificity.

That means it catches about 9 out of 10 true heart failure cases, and when it says the heart looks normal, it is right about 92% of the time.

The Two Metrics That Answer Your Real Question

Sensitivity and specificity describe how the AI performs in controlled testing. But when you are looking at a patient’s results, the question you are actually asking is different: “The AI flagged this finding. Should I believe it?”

That is where PPV and NPV come in.

While Sensitivity and Specificity are metrics for evaluating what percentage of the time we expect a case to be classified correctly, Positive Predictive Value (PPV) and Negative Predictive Value (NPV) are metrics for evaluating what percentage of the time a prediction class is correct.

The biggest difference is that PPV and NPV metrics consider how prevalent a disease is, while Sensitivity and Specificity do not.

Sensitivity and Specificity are more useful for evaluating model performance, whereas PPV and NPV are more useful for interpreting model predictions.

Positive Predictive Value (PPV)

When the AI flags a finding, how often is it actually there?

PPV depends heavily on how common the condition is. A rare condition (low prevalence) will naturally have a lower PPV even with strong sensitivity and specificity, because most of the population does not have it.

We calculate PPV using real-world prevalence from our clinical case database so the number reflects what you would see in practice.

Negative Predictive Value (NPV)

When the AI says a finding is not present, how often is it right?
For most conditions, NPV is very high because most patients do not have any given condition.

An NPV of 99.9% means you can be extremely confident in a negative result. This is where AI screening is often strongest: helping you confidently rule things out.

REAL EXAMPLE

Our Heart Failure (Canine) classifier has 89.5% Sensitivity and 92.1% Specificity, with a PPV of 11.9% and an NPV of 99.9%. That looks lopsided, and it is supposed to.

Heart failure has a prevalence of about 1.2% in our clinical database. So when the AI flags it, there is roughly a 1 in 8 chance the condition is truly present, which is still a significant increase from the baseline 1 in 83 rate. A PPV notably higher than the underlying prevalence indicates the model is providing real predictive power beyond random guessing. When it says “no heart failure,” you can be very confident.

How much PPV is clinically useful is ultimately a question for clinicians, and should be an ongoing discussion point as we continue to retrain and improve our models.

The clinical takeaway: a positive flag for a rare condition is a signal to look more closely, not a diagnosis. A negative result is a strong reassurance.

The Metrics That Give You Context

95% Confidence Interval

How precise is the measurement?

A confidence interval of “85% – 93%” means the true sensitivity most likely falls within that range. Narrower intervals mean more cases were tested and the measurement is more precise.

Wider intervals (common for rarer conditions) mean fewer test cases were available.

We publish confidence intervals for both sensitivity and specificity so you can judge how much certainty is behind each number.

Radiologist Agreement Rate

How often do specialists agree with each other on this finding?

This might be the most important context metric on the dashboard. Some findings are straightforward and board-certified radiologists almost always agree; others are more subjective.

If specialists disagree 10-30% of the time on a given finding, an AI performing in that range is working within the natural variability of expert interpretation.

This number gives you a benchmark for what “good” means for each specific condition.

REAL EXAMPLE

Our Cardiomegaly (Canine) classifier has a Radiologist Agreement Rate of 93%. That means even board-certified radiologists disagree about 7% of the time on this finding.

The AI’s sensitivity of 75.6% and specificity of 86.3% should be understood in that context.

The Metrics for the Data-Curious

The remaining metrics are primarily used by data scientists and statisticians to evaluate classifier quality. They are published for completeness and for those who want the full picture.

AUC (Area Under the Curve)

How well does the classifier distinguish positive from negative overall?

A single number summarizing overall quality. 1.0 is perfect; 0.5 is no better than a coin flip. Values above 0.85 indicate strong performance.

Our Heart Failure classifier has an AUC of 0.95.

F1 Score

How well does the classifier balance catching findings with avoiding false alarms?

The harmonic mean of precision and recall. Useful for comparing classifiers where both false positives and false negatives matter.

Accuracy

What percentage of all cases did the AI get right?

This sounds like the most important number, but it can be misleading for rare conditions. If a condition has 1% prevalence, a system that always says “not present” would be 99% accurate while catching nothing. That is why we publish sensitivity and specificity alongside accuracy.

How to Use This in Your Practice

You do not need to memorize these metrics. But the next time you review an AI screening report, three quick checks can change how you use the results:

  1. Check the NPV for negative results. For most conditions, the NPV is 99%+. When the AI says “not present,” you can move on with confidence.
  2. Check the prevalence for positive flags. A positive flag on a rare condition (prevalence under 2%) is a signal to investigate further, not a confirmation. A positive flag on a common condition (prevalence above 10%) carries more weight.
  3. Check the Radiologist Agreement Rate for borderline calls. If specialists disagree 20% of the time on a finding, an AI result in the gray zone is reflecting genuine clinical ambiguity, not a system failure.

The full metrics for all 89+ classifiers are published at vetology.net/ai-classifier-performance. We publish them because informed trust is better than blind trust, and veterinary professionals deserve the data to make their own judgment calls.

View the complete AI performance dashboard

Sensitivity, specificity, PPV, NPV, confidence intervals, and Radiologist Agreement Rate for every classifier we validate.

Vetology’s Veterinary AI Dashboard Now Tracks 11 Public Metrics

Vetology’s Veterinary AI Dashboard Now Tracks 11 Public Metrics

FOR IMMEDIATE RELEASE

Vetology Expands Public AI Validation Dashboard to 11 Metrics Per Condition Classifier, Commits to Ongoing Model Retraining

Vetology publishes full statistical profiles for 89+ classifiers, reinforcing that building AI is only half the job.

March 31, 2026 – SAN DIEGO – Vetology, a provider of AI-generated radiology screening reports and board-certified veterinary teleradiology, today expanded its publicly available AI performance dashboard from four metrics per classifier to eleven. The update covers 89+ validated classifiers across canine and feline thoracic, abdominal, and musculoskeletal imaging.

The expanded dashboard now reports sensitivity, specificity, positive predictive value, negative predictive value, AUC, F1 score, accuracy, prevalence, confidence intervals, and Radiologist Agreement Rate for each condition. The data is available on our website. 

The update also reflects Vetology’s commitment to maintaining its existing classifiers alongside building new ones. Of the 89+ classifiers currently published, 31 are retrained models that were originally released, then revalidated against updated board-certified radiologist consensus data. The oldest classifiers in the current dashboard date to August 2024; all have been revalidated with confusion matrices generated as recently as February 2026.

“AI is changing fast, and we are working just as hard to keep pace. We put the same rigor into maintaining our older models as we do into building new ones. Publishing the data for all of them, new and retrained, is how we honor our commitment to our veterinary partners and patients.”

Eric Goldman, President, Vetology

New classifiers added in this update include Obscuring Pleural Effusion, Esophageal Enlargement, Intervertebral Disc Disease (Thoracic), Small Intestine Enlargement (Feline), Colon Diffuse Distension (Feline), and a consolidated Heart Failure classifier for canine imaging. The Heart Failure classifier reports 89.5% sensitivity and 92.1% specificity; the Obscuring Pleural Effusion classifier, designed to flag cases where fluid volume may limit diagnostic interpretation, reports 87.2% sensitivity and 96.7% specificity.

“We’re improving our classifiers every month, and every update is revalidated against fresh consensus reads from board-certified radiologists – not the same training set warmed over. That’s why we publish eleven metrics per classifier instead of the one or two you’ll see from other vendors. Sensitivity by itself doesn’t tell a clinician whether to trust a result. PPV, confidence intervals, specificity – that’s what lets a veterinarian decide how much weight to put on what the model is telling them. We think that level of transparency should be the baseline for veterinary imaging AI. As far as we can tell, nobody else is publishing it.”

Cory Clemmons, CTO, Vetology

Vetology’s validation data is built on a foundation of 300,000 multi-image patient cases, with classifier performance validated against board-certified veterinary radiologist consensus. The company publishes these metrics as part of its commitment to transparency in an industry where, according to a 2026 Frontiers in Veterinary Science audit, 63.3% of commercial veterinary AI vendors do not disclose validation data publicly.

ABOUT VETOLOGY
Vetology provides AI-generated radiology screening reports and on-demand teleradiology consultations from board-certified veterinary radiologists, cardiologists and a dentist, giving veterinary practices both speed and specialist depth in a single platform. The Vetology AI screening system covers a growing list of conditions across canine and feline thoracic, abdominal, and musculoskeletal imaging. Screening results are designed to fit naturally into existing clinic workflows, so veterinary teams can move from image to informed decision without adding steps to their day. Vetology was founded on the belief that humans and AI are better together.

Learn more at vetology.net.

Media Contacts

Thanks for reading! If you’d like to learn more or have any questions, we’d love to hear from you.

Vetology’s Veterinary AI Dashboard Now Tracks 11 Public Metrics

Vetology Strengthens Leadership Team with New Director of Sales

FOR IMMEDIATE RELEASE

Veterinary commercial leader Pierre D'Amours joins growing team as Vetology expands board-certified radiologist services and AI diagnostic platform

March 16, 2026 – SAN DIEGO, CA – Vetology, a provider of AI-assisted radiology and board-certified teleradiology services for veterinary practices, today announced the addition of Pierre D’Amours as Director of Sales. The newly created role reflects the company’s growth trajectory.

President Eric Goldman, who has led Vetology’s commercial efforts since founding the company, and D’Amours will partner closely to build a sales organization that brings Vetology’s services to more veterinary practices across North America and internationally.

Vetology’s platform now includes 94+ feline and canine AI classifiers that screen radiographs for conditions across thorax, abdomen, spine, and musculoskeletal studies, with new classifiers releasing monthly and all performance metrics published publicly. The company also provides on-demand access to board-certified veterinary radiologists for specialist-level interpretation. As the platform and radiologist team grow, Vetology is investing in the commercial infrastructure to match.

We started Vetology to close the gap between the number of practices that need diagnostic imaging expertise and the number of board-certified radiologists available to provide it.  AI was the solution — a way to give every practice access to consistent, validated screening regardless of where they are and when they need it. We paired that with our own team of board-certified radiologists so practices have both. I’ve been having this conversation with practices since day one. Pierre has the industry relationships and credibility to help us bring Vetology’s service and solutions to more practices, and I’m excited to work alongside him.

Eric Goldman, President, Vetology

An Industry Insider

D’Amours brings seven years of veterinary commercial experience as Vice President of North America Sales & Operations at Movora (Vimian Group AB), where he ran a $70M+ veterinary medical devices and SaaS business. He is fluent in English and French, holds a Bachelor of Commerce from Concordia University, and has deep relationships across veterinary practices, distributors, and corporate groups throughout North America.

Pierre understands the challenges inherent in running a veterinary practice and how the right technology can solve real problems in day-to-day operations. At Vetology he will work with veterinary doctors and management teams to make sure that we are delivering on our promises both during and after the sale.

When I evaluated Vetology, what stood out was a company that had done the hard work first — building the AI, hiring board-certified radiologists, validating the classifiers, and publishing all of it for the industry to review. That kind of transparency is rare in this space. I’ve spent years working with veterinary practices, and the right technology should solve real operational problems, not add complexity.

My focus is to partner closely with DVMs to make sure we deliver on that promise — during the sales process and well after implementation — and to build a sales team grounded in trust, honest about where our solutions fit, and focused on long-term partnerships over transactions.

Pierre D’Amours, Director of Sales, Vetology 

# # #

ABOUT VETOLOGY

Vetology is a veterinary diagnostic imaging support company that provides AI-generated screening reports and traditional teleradiology services by board-certified veterinary radiologists. Built by radiologists, Vetology focuses on improving patient outcomes through accuracy, speed, and reliability in diagnostic imaging. Our platform is designed to integrate seamlessly into existing hospital workflows, helping clinicians make informed decisions quickly.

Learn more at vetology.net.

Media Contacts

Thanks for reading! If you’d like to learn more or have any questions, we’d love to hear from you.

Radiology Support Built for the Way Veterinarians Work

Radiology Support Built for the Way Veterinarians Work

AI Screenings Add Fast Imaging Analysis to the Veterinary Generalist's Toolkit

Veterinarians have a remarkable and diverse skillset. On any given day, a GP might perform surgery, vaccinate a puppy, help a pet parent manage a complex diabetes case, and perform a dental procedure. No other medical profession asks its practitioners to work across this many disciplines at this level of competence, every single day.

Radiology is one of those disciplines. Board-certified veterinary radiologists spend years in fellowship training after veterinary school, developing expertise in a field that spans thousands of conditions across multiple species and body systems. In general practice, that same breadth of imaging interpretation falls to the veterinarian.

Vetology’s AI screening report was designed with this reality in mind. Not to replace the veterinarian’s judgment, but to add a layer of specialist-level screening that supports the work practicing veterinarians are already doing.

How It Works in Practice

When a practice submits radiographs through Vetology, the AI automatically analyzes every image across our growing list of classifiers covering canine and feline thorax, abdomen, and spine/musculoskeletal conditions. Results arrive in minutes. There is no extra submission, no case selection, and no waiting for a specialist’s availability.

The system has been validated on a foundation of 300,000 board-certified veterinary radiologist-reviewed multi-image cases. These are real patient studies from real veterinary practices, each reviewed by diplomates of the American College of Veterinary Radiology (DACVR) or European College of Veterinary Diagnostic Imaging (ECVDI).

The AI provides a structured analysis that highlights findings across 94+ conditions. Some of these are conditions the veterinarian is already evaluating. Others are incidental findings that benefit from being flagged: subtle lymphadenopathy alongside a cardiac workup, early organ size changes on a GI study, mineralization that warrants monitoring. The veterinarian reviews everything in context and makes every clinical decision.

A Resource for the Whole Practice Team

AI screening benefits more than the doctor reading the images.

For veterinary technicians, AI reports create a learning opportunity built into the daily workflow. Techs who position patients and capture radiographs can see what the AI identified on the images they produced. This builds familiarity with imaging findings over time and adds professional development value to work the team is already doing.

For practice managers and operations leads, the impact shows up in workflow. When more findings are identified during the initial visit, more treatment conversations happen while the client is still in the room. This means smoother scheduling, more complete appointments, and fewer situations where the team needs to coordinate follow-up calls and return visits after the fact.

For front desk staff, the benefit is practical: when cases are more fully resolved on the first visit, there are fewer follow-up calls to coordinate and fewer schedule adjustments to manage. The front desk may not read radiographs, but they feel the difference when the day runs more smoothly.

Confidence and Collaboration

Vetology AI is a screening tool, not a diagnostic replacement. It does not tell the veterinarian what to do. It provides additional information that the DVM incorporates into their clinical picture alongside history, physical exam, and their own radiographic assessment.

Veterinarians who use AI screening consistently describe it as a confidence builder. When the AI confirms their interpretation, it reinforces their treatment plan. When the AI highlights something they had not focused on, it gives them a reason to take a second look. Either way, they have more information available when making their clinical decisions.

That added confidence has a practical impact. Veterinarians who feel well-supported in their imaging interpretation tend to use diagnostic imaging more effectively in discussions with clients, and keep more of their caseload in-house.

Designed for General Practice Economics

Vetology’s unlimited monthly subscription is built for the way general practice operates. There are no per-case fees, no contracts, and we include a PACS for free if you need one. The system integrates with widely used practice management systems and AI scribes including ezyVet, DaySmart Vet, CoVet, Scribblevet, VetRocket with more on the way, and includes free DICOM storage.

For practices that also need board-certified radiologist interpretations, Vetology offers teleradiology with 2-hour STAT and 24-hour routine turnaround from DACVR and ECVDI diplomates, as well as board-certified cardiologists and a board-certified dentist. AI screening and specialist reads work together under one platform.

Specialist-Level Support, Built for Generalists

The breadth of what general practice veterinarians manage every day is extraordinary. Vetology’s role is to make one part of that work a little easier by adding consistent, validated radiology screening to every imaging study the practice performs.

It is the kind of support that lets the whole team do what they do best, with more information and more confidence behind every decision.

Want to see AI in action?

To tour the platform and learn more, contact our team, or book a demo for a firsthand look at our AI and teleradiology platform.

Interpreting Classifier Results: A First Look at Data Science Metrics

Interpreting Classifier Results: A First Look at Data Science Metrics

What sensitivity, specificity, radiologist agreement rate, and test cases actually tell you about AI diagnostic performance

Written by – Benjamin Cote, Data Scientist | Vetology

As part of Vetology’s push to be transparent about our AI products, we recently published all of our condition classifiers on our website (find them here if you haven’t taken a look yet: AI Classifier Performance). Since we want you to be able to see how each of our models performs and draw your own conclusions, this article is designed to provide you with some extra knowledge and context to interpret our metrics.

On the AI Condition Classifier Performance Metrics page, we include several key metrics on each of our conditions. For the purposes of this article, we will focus on: Sensitivity, Specificity, Radiologist Agreement Rate, and Number of Test Cases. Each measure is a piece of the classifier puzzle, and by understanding the ways they interact, you can see the bigger picture come together. We’ll cover additional metrics in future articles. 

Sensitivity and Specificity

highlight of the sensitivity and specificity data

Front and center on each published classifier, you can see the Sensitivity and Specificity scores achieved by that model. These are both common metrics used in data science to measure model performance, and at Vetology they are the primary way we determine if a model is strong enough to be released.

They can be thought of as a pair, each capturing the same information but on different classes of data. You can think of them as a see-saw: A model that predicts every case as positive would have 100% Sensitivity (but 0% Specificity), and a model that predicts every case as negative would have 100% Specificity (but 0% Sensitivity), and neither would be useful. We want to get Sensitivity and Specificity as high as possible, so the challenge is how to get each metric to improve without harming the other.

What is Sensitivity?

Sensitivity (True Positive Rate)

Sensitivity is a measure of how often our model correctly recognizes that a disease is present in the patient. It answers the question: “When I get a Positive prediction, how often is the case actually Positive?”

When Sensitivity is high, the model correctly recognizes what a given disease looks like. It’s as if the model is telling us: “I know what heart failure looks like, and that is heart failure.”

One way to improve Sensitivity is by training the model on more examples that are positive for the condition so it understands the variation within a disease across many different breeds, and sizes.

What is Specificity?

Specificity (True Negative Rate)

Specificity is a measure of how often our model correctly determines that a disease is absent. It answers the question: “When I get a Negative prediction, how often is the case actually Negative?”

When Specificity is high, the model can correctly distinguish between a given disease and all other diseases, as if the model is telling us: “I don’t know what that is, but I know that is not an example of heart failure.”

One way to improve Specificity is by training the model on more images that are negative for the condition so it understands what kinds of information are unrelated to this disease. For instance, if we are trying to identify pulmonary nodules, the size and shape of the heart are unlikely to help us make our diagnosis. Instead, we want enough data that our classifier can isolate findings related to pulmonary nodules and ignore irrelevant visual information. That way when key findings aren’t present, the classifier will confidently predict that a disease isn’t present.

What Can You Learn from These Metrics?

When viewed together, you can get an estimate of how well a model performs when predicting on Positives (Sensitivity) and Negatives (Specificity). However, when forced to choose between prioritizing model Sensitivity or Specificity, we tend to prioritize Specificity. This is because our models are trained on many more negative images than positives.

We train on mismatched proportions because even the most common diseases only occur in a small percentage of cases; this imbalance ensures that we don’t over-predict the presence of diseases. A consequence of this is that Sensitivity and Specificity percentages are calculated on differently-sized classes, and a 1% increase in Specificity usually means a greater increase in total model accuracy than a 1% increase in Sensitivity.

The math behind these metrics is not especially complicated, but there are some nuances that require more context. If you want to learn more about how we calculate Sensitivity and Specificity, look at the In-Depth Calculation section at the end of this article.

Radiologist Agreement Rate

Radiologist Agreement Rate

The percentage of cases where two US Board Certified Veterinary Radiologists produce the same label (Positive or Negative) on an image. This serves as a real-world benchmark for evaluating AI performance.

highlight of the location of the radiologist agreement rate on the table

We calculate the Radiologist Agreement Rate by comparing the labels that expert radiologists provide on a blind set of shared images. Among this set of Positive and Negative images, we calculate the number of cases the radiologists agreed on out of the total number of cases they reviewed. Regardless of whether they agreed a case is Negative or Positive, so long as the radiologists make the same decision on an image we consider it an agreement.

Interpreting Radiographs is as much an art as it is a science! Some conditions can be easily diagnosed from radiograph findings, others cannot be. Some conditions are easily visible on a radiograph, others are not. Some conditions look similar to each other, others are completely unique! All that to say, it’s understandable why two expert radiologists may disagree when diagnosing the same patient. It also stands to reason that if a condition is hard for an expert radiologist to interpret from radiograph scans, our classifier may also have trouble consistently identifying a disease.

What Can You Learn from Radiologist Agreement Rate?

Low Agreement Rate

If the radiologist agreement rate is low, this means a condition is hard for radiologists to reliably diagnose. This is a place where our models often shine.

  • With extremely rare conditions, a clinician or radiologist may encounter it only a handful of times over the course of their career.
    • In contrast, our models are trained on hundreds or thousands of examples, so our sensitivity and specificity metrics can often surpass radiologist agreement rates.
    • Through the aggregation of clinical examples globally, these models can help you feel confident in recognizing rare findings.
  • Other times, agreement rate is low because a disease is hard for radiologists to determine visually.
    • Our models may struggle with these conditions too. Sometimes they can pick up on patterns too minuscule for the human eye to see, but other times it’s just as hard for the neural network to come to a conclusion.
    • When this is the case, our models may have low sensitivity and specificity scores that mirror low radiologist agreement rate.

High Agreement Rate

If the radiologist agreement rate is high, it means that this condition is easier for radiologists to reliably diagnose.

  • This could be because the disease presents consistently on radiographs, because it is easy to identify, or because a particular finding unambiguously indicates that disease.
    • When the agreement rate is high, model performance also tends to be high because the neural network is picking up on the same visual patterns as the radiologists.
  • However, you’ll notice that some model performance metrics don’t match their high radiologist agreement rate.
    • This is something we take seriously—we want every model to perform just as well if not better than the agreement rate so you can be confident in our predictions.

TRANSPARENCY NOTE

When you see conditions published with scores that are below the agreement rate, you can be confident that we are working to retrain a higher-performing model. Sometimes we will release a model below agreement rate because clinics have specifically requested it, and we feel confident that it has strong performance even if it is not as high as we would like. Other times, we are limited by low Positive case counts and have trained the highest-performing model we can at the time of publication. The decision usually comes down to whether it’s a high-priority condition or not.

Total Cases Evaluated

Total Test Cases

The number of unique patient cases used to evaluate a classifier and generate Sensitivity and Specificity metrics. This includes both Positive cases (disease present) and Negative cases (disease absent, which may include other conditions).

Highlight of the location of test cases on the chart

The number of evaluated cases shows the number of unique cases we used to test that particular classifier and generate our Sensitivity and Specificity metrics.

For example, the Canine Thorax condition Heart Failure Left has 10,951 total test cases, which means our performance metrics come from generating model predictions on 10,951 unique sets of radiographs, all from different dogs.

This number includes both the Positive cases where a disease is present, and the Negative cases. However, just because a case is labeled as Negative, that doesn’t always mean the animal is healthy – in fact, we make sure our set of Negative examples includes cases with a variety of other findings or diseases within the body region, just one of which is a “healthy” finding.

What Can You Learn from Test Case Counts?

As the number of test cases grows, so does the variation in examples our model is tested against. Each case introduces a unique combination of animal size, age, scan quality, and number of diseases present or absent. When a condition is tested on large quantities of data and has high Sensitivity and Specificity performance, you can feel certain that the model is robust enough to find the disease in animals of any size; it can handle any curveball case you throw at it.

An In-Depth Look: Sensitivity and Specificity Calculation

In data science, we often categorize our data by multiple labels at the same time. This can easily lead to confusion, which is why we describe outcomes using terminology like True Positive, True Negative, False Positive, and False Negative.

The table below shows the difference between each label. In short:

  • A case is True if the predicted label matches the actual label, and False if the predicted label does not match the actual label.
    • For example, if a model predicts that cardiomegaly is present in an image but a radiologist has determined that cardiomegaly is not present, we would call that classification a False Positive because the classifier falsely predicted cardiomegaly to be positive.
Condition Is Present Condition Is Absent
Model Predicts Condition as Present True Positive (TP) False Positive (FP)
Model Predicts Condition as Absent False Negative (FN) True Negative (TN)

Sensitivity

True Positives ÷ (True Positives + False Negatives)

Total correctly identified positives out of all cases where the condition is actually present

Specificity

True Negatives ÷ (True Negatives + False Positives)

Total correctly identified negatives out of all cases where the condition is actually absent

Why Specificity Improvements Have a Bigger Impact

Earlier in this article, I explained that we try to prioritize model Specificity over Sensitivity if we can no longer actively improve both metrics. Let’s explore why that is by walking through a short example.

Imagine we have a dataset with 500 Positives, 5,000 Negatives, and the model has 85% Sensitivity and 85% Specificity:

Baseline: 85% Sensitivity, 85% Specificity
Positive Cases Negative Cases Total Cases
Total 500 5,000 5,500
Predicted Correctly 425 4,250 4,675
Predicted Incorrectly 75 750 825
Metric Score 85% Sensitivity 85% Specificity 85% Accuracy

Based on the number of Positive cases, the model correctly predicted the disease on 425 cases and only misclassified 75 cases -pretty good! But 85% Specificity on 5,000 Negative cases means that 4,250 cases were predicted correctly as normal, and 750 cases were misclassified. While the scores are the same, they represent very different numbers of misclassified images.

Let’s look at what happens to model accuracy if we improve either Sensitivity or Specificity by 10% without changing the other metric’s score:

Scenario A: Improve Sensitivity by 10%
Positive Cases Negative Cases Total Cases
Total 500 5,000 5,500
Predicted Correctly 475 (+50) 4,250 4,725 (+50)
Predicted Incorrectly 25 (-50) 750 775 (-50)
Metric Score 95% Sensitivity (+10%) 85% Specificity 85.9% Accuracy (+0.9%)
Scenario B: Improve Specificity by 10%
Positive Cases Negative Cases Total Cases
Total 500 5,000 5,500
Predicted Correctly 425 4,750 (+500) 5,175 (+500)
Predicted Incorrectly 75 250 (-500) 325 (-500)
Metric Score 85% Sensitivity 95% Specificity (+10%) 94.1% Accuracy (+9.1%)

KEY TAKEAWAY

A 10% increase in Sensitivity improves overall accuracy by 0.9%, while a 10% increase in Specificity improves overall accuracy by 9.1%. When there are so many more Negative cases than Positive cases, an equivalent increase in percentage does not equate to an equivalent increase in model accuracy.

Conclusion

Assessing the performance of a disease classifier can be tricky. Sometimes it’s unclear what a metric represents, or how to compare across models. It can also be difficult to interpret when a classifier is performing well because you have to consider not only its Sensitivity and Specificity scores, but also the Radiologist Agreement Rate.

If Sensitivity and Specificity are both around 70% and the Radiologist Agreement Rate is 63%, then it’s a strong model that can pick up on details that even expert radiologists may not see. However, if a model with those same scores had a Radiologist Agreement Rate of 85%, then the model would be significantly underperforming. Everything is relative, and at Vetology we have to consider how all our metrics interact before we publish new condition classifiers.

Now that you have an idea of what these metrics mean, take a look at our classifier results. Transparency means you can be part of this process. Notice the great work we’ve done, but also notice the areas we need to work on. With our monthly bundle releases, we are constantly increasing performance of existing models and adding coverage through new disease classifiers. So please, check back in soon and see where we’ve made our latest improvements.

Want to see AI in action?

To tour the platform and learn more, contact our team, or book a demo for a firsthand look at our AI and teleradiology platform.

Pin It on Pinterest