Stats in Veterinary: Reference Intervals

“The world is a very abnormal place” – Salman Rushdie

In veterinary medicine, there is a long list of common tests that are run to check a wide variety of health conditions, such as albumin, alkaline phosphatase, amylase, lipase and calcium. When interpreting the level of substances like these, the first and most important question tends to be, is this normal?

For example, the range of values considered as normal for canine albumin is quoted as 2.7 – 4.4g/L according to one online source. However, as we’ll see below, using this value as quoted without further thought could lead to incorrect conclusions.

This blog post explores some of the ways to determine a reference interval, the nuances of what the resulting interval means, and concludes with an example computed using the statistical software R.

What’s normal?

First devised in 1969 in human medicine, the concept of reference intervals has developed and evolved considerably over the decades. Today, the established recommendations for producing reference intervals are maintained by the International Federation of Clinical Chemistry (IFCC), with the American Society for Veterinary Clinical Pathology (ASVCP) recommending that such guidelines are followed in veterinary medicine. However, veterinary-specific guidelines are also emerging, and can be found on the ASVCP website.

The standard definition of a normal reference interval is a range of values covering 95% of normal cases, where typically, normal means healthy. What exactly constitutes a healthy individual is itself a bit of a grey area (not even the World Health Organisation has a clear definition) and should be defined as well as possible for each study. For example, limits may be placed on age, breed, dietary details, pregnancy status, environment (farmed vs wild, for example), etc. It should then be kept in mind that the consequent range is only applicable to individuals meeting the same criteria. Such criteria should be shared to allow others to check such details for future cases.

Also note a consequence of the 95% span of healthy ranges, namely, that it’s possible for a future healthy individual to fall outside the reference interval (in fact, 5% of cases should be expected to do so).

As often seen in nature, the distribution of values for these normal cases will sometimes follow a normal distribution, but this may not always be the case. Depending upon the shape of the distribution, different approaches may be required to ascertain a sensible normal range.

Standard recommendations are that reference ranges are created from at least 120 reference individuals, using ‘a priori nonparametric methods’ (what that means will be covered a little later). Individuals should be extensively documented, covering inclusion and exclusion criteria, chosen randomly (a challenge in itself), with the values of interest being measured using ‘quality-controlled analytical procedures’. Fewer than 120 cases may be acceptable using alternative methods (‘Horn’s robust method’), but the resulting ranges should be treated with caution.

Sticking to such guidelines is no small feat. In 37 different reports on the ranges for human creatinine, for example, a survey showed that only 6 met international criteria, prompting a revision and simplification of the recommended methods. This led to new guidelines being published in 2008, which also included new methods for adapting reference ranges developed in other labs (a method called ‘transference’, covered later). The new guidelines also cover methods for smaller sample sizes, but even here, at least 80 individuals are strongly recommended.

Key Method

Three methods exist for determining reference intervals. These are:

  1. Create an interval from scratch
  2. Transfer an interval when, for example, an instrument is changed
  3. Transfer and validate an interval established elsewhere

The first method is the most common and also the most involved. Best practice is to define inclusion and exclusion criteria a priori, before selecting the reference individuals and creating the reference interval. However, a posteriori determination (i.e. using existing data) is also possible and acceptable. Note that if neither method is available, large scale databases can be trawled and cases used (the ‘indirect method’), but this should only be used when no other options are available, and results should be treated with caution.

Creating a Reference Interval from Scratch

International guidelines (IFCC-CLSI C28-A3) detail a number of steps to establish an a priori reference interval. These are summarised below,

Step 1 – Document Biological Factors or Variation

Knowing how, when and why the analyte in question varies will inform the consequent inclusion and exclusion criteria. For example, does the analyte vary with pregnancy status? If so, should such cases be excluded, or should the reference sample group be partitioned? (i.e. two different intervals created for pregnant and non-pregnant cases, respectively?). How about stress levels? Could healthy, but highly stressed, individual skew the data or produce outliers? The literature should be reviewed to answer such questions where-ever possible.

Step 2 – Defining Inclusion, Exclusion and Partitioning Factors

This is a critical step in not only defining suitable reference individuals, but also allowing the interval to be sensibly used in the future. Clear exclusion factors will include aspects such as evidence of disease or current use of medications, but criteria shouldn’t be too strict, or else the remaining individuals will likely represent a niche subsample not representative of the wider, more commonly encountered population. In the veterinary setting, it is recommended that a questionnaire is given to the pet owner/carer to further establish suitability for inclusion into the reference sample group.

Step 3 – How many?

As stated above, a minimum of 120 individuals are recommended, and even more in cases of a highly skewed distribution. If this isn’t possible, the highest number possible should be aimed for. The term ‘multicentre reference interval’ is sometimes seen in the literature, whereby several different laboratories may contribute reference individuals to the sample group.

Step 4 – Gather the data

This step involves selecting the individuals according to the established criteria and obtaining the samples. At this stage, samples should be rejected if incorrectly handled or show signs of haemolysis.

Step 5 – Process the samples

Samples are then processed and measurements made. The analytical methods used on the reference individuals should match the methods to be used in future cases, and details of the methods, such as limits of detection, linearity and agreement with more precise methods should all be communicated (perhaps only on request). Normal, everyday variation, such as changed of staff or reagent batches should also be taken into consideration.

Step 6 – Analyse

At this point, the data should be checked, visualised (as a histogram), outliers identified and the reference limits plus their confidence intervals determined. The central 95% is typically used, although some suggest 99% in order to reduce the number of false positives in routine health checking.

The nonparametric method is recommended, which is a method that does not assume that the data follows a normal (Gaussian) distribution. However, the parametric method can be used if the data appears normally distributed or can be transformed accordingly. If data is to be transformed (for example, by taking the log of the data), a goodness-of-fit test should then be used, for example, the Anderson-Darling normality test. For thoroughness, some suggest that both methods should be used, along with the robust method, to check for similarity between methods.

Confidence intervals present a way to quantify the uncertainty in the reference interval, with 99% confidence intervals being recommended in the established standards. Such calculations on very small sample sizes will lead to very large confidence intervals, but a technique called bootstrapping (a method that repeatedly samples from the original sample data in order to estimate certain parameters), can be used in this case. It is recommended that if the confidence intervals exceed 20% of the reference interval width, efforts should be made to collect additional samples.

Outliers are a tricky aspect to deal with at this point. Keeping them in may be to retain diseased cases that slipped the net, but removing them may delete healthy cases that should be taken into account. Certainly, no points should be removed just in order to make a curve look better!

Visually, outliers can often be seen in histograms or automatically highlighted in box-and-whisker plots. Analytically, Tukey’s method and Dixon’s range statistic are often used, but there appears to be no consensus on the single best technique to use. Note that the distribution assumptions must be checked when selecting a method, as some methods assume that the data is normally distributed, where-as others do not. The most pragmatic approach seems to be to visualise the outliers, then to double-check their original data entry, sample quality and clinical details, and finally to justify why each outlier has been removed or left in.

Partitioning presents another tricky decision point, for example, deciding that multiple reference intervals should be determined split by sex or age groups. Compared two such distributions can be done using the Harris-Boyd z-test, but no consensus on when to partition exists. One question that may help in this situation is to ask whether or not such a partition is clinically useful.


This involves using a reference interval that was developed, say, by a different laboratory. In order to transfer such an interval, it must be validated, and all relevant aspects much be known. For example, it should be clear that the original interval was developed correctly (as per the steps above), that analytical methods are comparable between labs (checked by using linear regression analysis of the two methods), and that patient populations are also comparable. Validation is then done by checking the values for, say, 20 new cases against the transferred interval, and checking how many (if any) fall outside the range. It is recommended that this is done for any tests used by an external manufacturer, or for internal tests every 3 – 5 years.

Creating a Reference Interval – An example

The statistical software package R possesses a package called ‘referenceIntervals’. This contains all the tools necessarily to calculate the reference interval and confidence intervals in a given dataset, using one or more methods.

This package comes with some example data. Let’s take a look at one called ‘set120’, which is a dataset of 120 points taken from the normal distribution. Below is first the relevant R code, followed by the plot,














As you can see, the points appear to be approximately normally distributed (this would be clearer with more data points). The density of the data is also shown (orange line). This is plotted on a larger x-axis scale than necessarily to highlight the fact that there are no outliers.

Given we have 120 normally distributed data points, the ASVCP guidelines recommend using the robust method with 90% confidence intervals (where the confidence intervals are calculated using a standard parametric method). Overlaying the plot above with red reference intervals and blue, dashed confidence intervals (and reducing the x-axis range) gives the following (again, preceded with the R code),

















The following summary is provided by the package,






The number typically focussed upon here is the upper reference value of 43.4.

What about the effect of outliers? Let’s add some to the above data,



Now the plot looks as follows,














Now let’s plot in red the reference interval leaving these outliers in, and in purple without them (removing them with the ‘Horn’ method),


















You can see that with the outliers left in, the upper reference limit is higher than if we automatically remove them (53 vs 43). Before, with no outliers, we had an upper limit of 43, so the outlier removal seems to have worked well.

For further details on this R package, see the official documentation here.

Other related areas

One point of confusion that sometimes arises is the difference between reference intervals and clinical decision limits. Not to be used synonymously, the former is based upon healthy individuals where-as the latter is based upon healthy and diseased cases. These decision limits are based upon ROC analysis, and centre around the question not of whether a certain analyte is normal or not, but whether treatment is required or not.

Another topic sometimes explored in the literature is that of subject-based reference intervals. Simply put, this is where reference intervals are determined from multiple measurements taken from a single healthy individual. This reference interval can then be used to monitor the health status of the individual in question. Note that if a patient has a chronic disease, the reference interval can be established during the stable phase of the disease, in order to check for acute deviations in the future.


How to determine reference intervals is a well-established, albeit uncomplete, area. In veterinary science, the ASVCP guidelines should be followed. Often in veterinary medicine, reference intervals published in papers and textbooks are often used, even when the details of the underlying reference sample group and analytic procedures are not known. This approach should be used with caution and may lead to misdiagnosis and inappropriate treatment.

Written by Rob Harrand – Technology & Data Science Lead


To register your email to receive further updates from Avacta specific to your areas of interest then click here.


  1. Normal Feline & Canine Blood Chemistry Values Blood, Temperature , Urine and Other Values for Your Dog and Cat –
  2. Geffre, Anne, et al. “Reference values: a review.” Veterinary clinical pathology3 (2009): 288-298.
  3. Ozarda, Yesim. “Reference intervals: current status, recent developments and future considerations.” Biochemia medica: Biochemia medica1 (2016): 5-16.
  4. Siest, Gerard, et al. “The theory of reference values: an unfinished symphony.” Clinical chemistry and laboratory medicine1 (2013): 47-64.
  5. Friedrichs, Kristen R., et al. “ASVCP reference interval guidelines: determination of de novo reference intervals in veterinary species and other related topics.” Veterinary Clinical Pathology4 (2012): 441-453.
  6. ASVCP Quality Assurance and Laboratory Standards Committee (QALS) Guidelines for the Determination of Reference ntervals in Veterinary Species and other related topics: SCOPE
  7. R Core Team (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL
  8. Daniel Finnegan (2014). referenceIntervals: Reference Intervals. R package version 1.1.1.