Evaluating the Goodness of Instrument Calibration for LC or GC Procedures

Discussion of Various Approaches to Calibration and Measures of Goodness

Average Response Factor (RF)

Average RF is commonly used in many environmental test methods, especially chromatographic methods. It is calculated as the average of each calibration factor (external standard methods) or response factor (internal standard methods), where the calibration factor is the instrument response divided by the amount or concentration of analyte introduced (2). A key property of the average RF is that each point in the calibration has equal relative weight. For example, a 10% error at the low end of the calibration has equal weight to a 10% error at the high end of this calibration. It can be mathematically demonstrated (1) that the average RF is the same as a linear regression that is weighted by 1/(concentration)² with the line forced through zero (1).

Evaluation of the Average RF Calibration: Relative Standard Deviation (RSD)

To calculate the RSD, the standard deviation of the calibration or response factors must first be determined. Then the standard deviation is divided by the mean of the calibration or response factors to give the RSD. Typically, <15% or <20% will be used as a criterion for accepting the calibration.

As with the average RF, an important feature of the RSD is that deviations from the calibration line at low concentration have the same impact as deviations at high concentrations.

Linear Regression

The second technique commonly used for creating calibration lines is linear or curvilinear least squares regression. The regression analysis creates an equation for the relationship between the instrument response and the concentration in the form:

y = ax + b [1]

y = ax + bx² + c [2]

where y is the response, x is the concentration, and a, b and c are the regression coefficients.

The line is created to minimize the sum of the squares of the residuals. The residual is the difference between the predicted response based on the calibration line, and the actual measured response from the standard. Consider a calibration curve that extends from 1 ppb to 100 ppb. A residual of 10% at 1 ppb = 0.1 ppb, while a residual of 10 at 100 ppb =10 ppb. Because the squares of the absolute residuals are minimized, then residual for the 100 ppb level will have 10²/0.1² = 10,000 times more impact than the residual at the 1 ppb level. If we want the relative (percentage) residuals at the top and the bottom of the curve to have the same impact on the curve, then the regression has to be weighted by 1/(concentration)².

Evaluation of Linear Regression

The correlation coefficient is calculated as:

The important thing to notice is that large values have more effect than small values. For example, if we have a calibration that runs between 1 ppb and 100 ppb, and the area count for the 1 ppb standard is 1000 units while for the 100 ppb it is 100,000 units, the sum of the x and y values is affected greatly by the 100 ppb standard and hardly at all by the 1 ppb standard. This is a serious problem for evaluation of calibration curves in environmental analysis where we are typically just as interested in values at the low end of the curve as values at the high end of the curve.

Some environmental analysis methods specify correlation coefficient (r) for evaluating calibration curves, and some specify the coefficient of determination. The coefficient of determination is the square of the correlation coefficient. Some specify r for certain types of curves and r² for others. It can be stated that r indicates the degree of correlation between the two variables (x representing concentration and y representing instrument response) while r² allows for determination of the percent of data closest to the line of best fit. These statements may be statistically true, but when r and r² are used as acceptance criteria for calibration curves, they are completely irrelevant. As an acceptance criterion, r = 0.995 is identical to r² = 0.990.

Illustrative Curve

We will use the calibration listed in Table I as an example. There is nothing “special” about this calibration. The issues are not limited to any specific type of instrumentation and apply to data generated using a GC instrument, LC instrument, GC-MS instrument, inductively coupled plasma (ICP) instrument, ICP-MS instrument, and any other type of instrumentation for which multipoint regression generated calibration curves are used.

The units are not important, but if they were ug/L, this would be quite typical of the range commonly calibrated in GC–MS volatile or semi-volatile analysis. The range is quite limited, less than two orders of magnitude, and the effects we illustrate would be more severe for wider calibration ranges. First, we will fit an average RF calibration and a linear unweighted regression to these values. For each point, we can calculate how far away from the calibration line the point is. This is the residual, and it measures how much error there would be when the calibration is used to calculate the amount. For example, considering the point from the 10 ppb standard, if the x value (concentration) at a y value of 211363 was 11, then the residual would be +1 in absolute terms, or +10% in relative terms. These absolute and relative residuals for both curve types are shown in Table II, along with the correlation coefficient and RSD.

The results in Table II show that the average RF calibration keeps the relative residuals fairly consistent across the calibration while the absolute residuals are small at the low end and large at the top end. In contrast, the linear unweighted calibration keeps the absolute residuals fairly consistent while the relative residuals are small at the top end and large at the low end of the calibration.

In order to decide which of these two calibration types we prefer, we must ask ourselves whether it is more important to minimize relative error or absolute error. Assuming that we are equally interested in minimizing error at all levels of the calibration, minimizing absolute error would lead us to the linear unweighted calibration while minimizing relative error would lead us to the average RF calibration.

Consider the largest absolute error, -16.17% in the average RF calibration. This is substantial, but not too serious from the environmental impact perspective. The environmental impact of a pollutant at 104 ug/L is not that different from the environmental impact of the same pollutant at 120 ug/L.

In contrast, the largest relative error, -157% in the linear regression calibration, is very serious. This error would lead us to believe that none of the analyte was present when it certainly is.

We are generally concerned with relative error in environmental analysis—this is why error limits for quality control measures such as laboratory control samples and matrix spikes are expressed in relative (percentage) terms.

Correlation Coefficient vs. RSD

From our example curve, the RSD does a good job of evaluating the calibration—it lets us know that there is some error, but that the relative errors across the calibration are reasonably small and under control. The correlation coefficient, on the other hand, has a better value for the linear unweighted calibration—it would tell us that the curve with unacceptably large errors at the low end is the curve that we should use. Unfortunately, the RSD can only be calculated for the average RF type calibration—what we need is some way of extending the RSD to use with regression type calibrations. Relative standard error (RSE) provides exactly this function.

Relative Standard Error

RSE is calculated as follows:

In short, the relative error (x’ – x)/x is calculated for each calibration point. These values are squared and summed, before being divided by the degrees of freedom. The square root then gives the RSE value. For example, relative error is commonly used in environmental chemistry for measuring the continuing calibration. One very useful property of the RSE is that it can be calculated for the average RF curve, and when it is, the numerical value is identical to the RSD.

RSE gives us the ability to extend the RSD measure to all types of calibration curves (except for curves that include a point at zero concentration). RSE can be calculated for calibrations that are forced through zero, and in fact, average RF is identical to a 1/(concentration)² weighted linear regression that is forced through zero.

Other Types of Calibration Curves

Because we are interested in minimizing relative error, we need a way of using a linear regression in a way that achieves that goal. Fortunately, there is a straightforward solution—we can weight the regression. Specifically, the regression can be weighted by the reciprocal of concentration or concentration squared. Recall that the regression minimizes the square of the residuals. Weighting by 1/(concentration)² will therefore apply the same weight to each point in the calibration in relative terms. In other words, with 1/(concentration)² weighting applied, a residual of 10% at the top end of the calibration will have the same weight as a residual of 10% at the bottom end. Use of 1/(concentration)² weighting is intermediate and could be used if we felt that accuracy towards the upper end of the calibration was more important than at the low end. Table III displays the relative residuals for weighted and unweighted linear calibrations for our example calibration.

The average, 1/x and 1/x² calibrations all look reasonable to use. The unweighted calibration is unacceptable due to the large error at the low end of the calibration. However, if the method criterion was r = 0.995, then the unweighted calibration is the only one that would pass. In other words, the use of correlation coefficient as a criterion forces the use of the worst possible calibration.

Therefore, we should not use a measure of calibration quality that routinely forces the use of calibrations with excessive and unnecessary error at the low end of the calibration. Unfortunately, this is exactly what most environmental analysis methods do.

Literature on the Use of Correlation Coefficient in Calibration

It is by no means novel to point out the problems inherent in the use of r and r² as measures of calibration quality. In 1998, the International Union for Pure and Applied Chemistry (IUPAC) pointed out that “the correlation coefficient, which is a measure of two random variables, has no meaning in calibration…” (3).

Published in 2000, Meier and Zund’s text, Statistical Methods in Analytical Chemistry, stated that “for most applications, and calibration curves in particular, the correlation coefficient must be regarded as a relic of the past” (4).

In 1990, Taylor observed with amazement that “one can even find requirements in Quality Assurance plans to recalibrate if the correlation coefficient is less than 0.995!” (5).

Back in 1981, Van Arendonk and Skogerboe noted that “one practice that should be discouraged is the use of correlation coefficient as a means of evaluating goodness of fit of linear models.” (6)

Current State of Calibration in Environmental Chemistry

The combination of using unweighted linear regression with evaluating the correlation coefficient is particularly pernicious. The unweighted regression will tend to generate a calibration with large or very large relative errors at the low end of the calibration, and the correlation coefficient will routinely allow such calibrations to pass typical method criteria. Fortunately, some progress has been made. The 2016 laboratory accreditation standards published by The NELAC Institute (TNI) require that calibration curves generated using regression analysis be evaluated using either RSE or relative error of the low and mid calibration points (7). Recent revisions of methods in EPA publication SW-846, 40 CFR Part 136, and ASTM have included requirements to evaluate using RSE, relative error of individual points or both (8). These changes are valuable but insufficient, because most methods still retain requirements to evaluate calibrations using r or r²—and these requirements may prohibit use of good calibrations while encouraging the use of bad calibrations.

Conclusions

Necessary changes to remove the use of correlation coefficient and coefficient of determination can be considered for methods produced or utilized by each EPA office separately.

Office of Water: Drinking Water Methods

The EPA Developed many of the methods approved for analyzing drinking water, such as the 500 series methods for analyzing organic constituents, and the 200 series for inorganic constituents. Recent revisions of the 500 series methods do not include r or r². Instead, the accept ability of the calibration curve is evaluated using the relative error at each calibration point.

Older EPA methods used for drinking water analysis may just tell the user to develop a regression curve with minimal further instruction. Even these methods ask the user to develop acceptance limits using relative error, such as this language from method 335.4: “Acceptance or control limits should be established using the difference between the measured value of the calibration solution and the “true value” concentration” (9).

Many methods from consensus standards organizations such as ASTM and Standard Methods are approved for drinking water analysis. An effort is currently underway to remove mention of r and r² from ASTM methods.

In short, the drinking water methods, especially those from EPA, are in good order. It would be best if RSE were included, but at least the primary measures of calibration are based on relative, not absolute, error.

Office of Water: Wastewater Methods

Methods for analysis of wastewater are specified at 40 CFR Part 136. There are many EPA methods as well as methods from consensus standards organizations. Many of the EPA written methods include r/r². Fortunately, a recent method update allows the use of RSE as an alternative to r/r² for evaluation of calibration curves. Use of a counterproductive measure such as r/r² should not be allowed to remain—correlation coefficient and coefficient of determination should be removed from the methods completely. The Office of Water could also consider adding RE as an alternative to RSE. This would improve compatibility with the drinking water methods and SW-846 methods.

Office of Resource Conservation and Recovery – SW-846 Methods

Methods published in the SW-846 manual are used for analysis of wastes, solids, and commonly groundwater. These methods usually include correlation coefficient or coefficient of determination as a measure of calibration quality. Recent methods include RSE as an alternative to r/r² and also include specifications for RE. With both RSE and RE included in the methods, r/r² is completely redundant and unnecessary and should be removed.

References

(1) D. Edgerly, Techniques for Improving the Accuracy of Calibration in the Environmental Laboratory, presented at the WTQA 1998 – 14th Annual Waste Testing and Quality Assurance Symposium, Arlington, Virginia, 1998. https://clu-in.org/download/char/dataquality/aedgerley.pdf

(2) SW-846, Method 8000D. https://www. epa.gov/sites/production/files/2015-12/ documents/8000d.pdf

(3) IUPAC, Pure & Appl. Chem. 70(4), 993– 1014 (1998).

(4) P.C. Meier and R.E. Zund, Statistical Methods in Analytical Chemistry (Wiley, New York, New York, 2000), pp. 93.

(5) J.K. Taylor, Statistical Techniques for Data Analysis (Chapman and Hall, New York, 1990), pp. 208.

(6) M.D. Van Arendonk and K. Skogerboe, Anal. Chem. 53, 2349–2350 (1981).

(7) Code of Regulations (CFR) Part 136 (U.S. Government Printing Office, Washington, D.C., 2020), pp. 4.

(8) “Management and Technical Requirements for Laboratories Performing Environmental Analysis,” The NELAC Institute (Weatherford, Texas, 2016).

(9) J. O’Dell, EPA Method 335.4, Determination of Total Cyanide by Semi-Automated Colorimetry (USEPA Office of Research and Development, 1993).

Richard Burrows is with Eurofins Environment Testing America, in Denver, Colorado. Jerry Parr is with Catalyst Information Resources, in Weatherford, Texas. Direct correspondence to: [email protected]