BioPharmaceutical Emerging Best Practices Association

BEBPA Blog

Volume 1, Issue 4

Before the Beginning Comes the Question: A Journey Through the Statistical Fundamentals of Comparability

By Kevin Brooks, MSc, PhD, Principal Consultant, K.R. Brooks & Associates

Bold moves can propel you forward, but recklessness at the start can derail your journey.

Complex biopharmaceutical processes often require changes to satisfy production scaling, cost optimization, product safety and efficacy, and the constantly evolving regulatory landscape. It is critical that products manufactured in the post-change environment be comparable to those in the pre-change environment. Proper analysis of appropriate data are essential for demonstrating the required comparability. Regulatory guidance recommends following a stepwise approach utilizing a collaborative totality-of-evidence strategy. [1]

The key steps in a comparability study are to decide quickly what process characteristics will need to be measured, what data will need to be collected, and what statistical methods will need to be used to assess comparability between the original and modified processes. Ideally, a comparability study would make use of data collected through a designed experiment, but when that is not feasible, historical data can be used.

The purpose of this paper is to take us on a journey through the statistical fundamentals of comparability.

The Research Question

For our focus on comparability, the question is always, “Are products manufactured in the post-change environment comparable to those in the pre-change environment.”

Below is a list of activities involved in research that provide a good roadmap for researchers and statisticians working together.

  1. What is the opportunity? What is the problem? What is the goal?
  2. What total information meets that opportunity?
  3. What partial information is known already?
  4. What remaining information must be discovered? (2-3)
  5. What mathematical model will supply that remaining information?
  6. What data are required to fit the mathematical model?
  7. What EXPERIMENTAL DESIGN will efficiently and effectively gather that data?
  8. Carry out the experiments.
  9. Fir the mathematical model to the data.
  10. Extract information from the mathematical model.
  11. Does the information meet the opportunity?
    1. If yes, move on to next project.
    2. If no, go back to 3 and iterate.

List 1. Charting Your Course: Questions and Actionable Steps for Successful Research

A well-defined research question is the cornerstone of efficient, effective, and productive research. It guides your search, ensures you stay focused, and clarifies when you’ve successfully met your research goals. The list above is a guide for such a process that will lead to successful research.

In general, statisticians answer research questions formally using a structured approach. This involves formulating a null hypothesis, which essentially proposes no significant effect or relationship exists between the variables of interest. They then construct a complementary alternative hypothesis, which posits the opposite, suggesting there indeed is an effect or relationship. By statistically analyzing data through various techniques, the aim is to either reject the null hypothesis, supporting the alternative, or fail to reject it, indicating further investigation might be needed.

The Hypotheses

Formulated through the scientific method, a hypothesis is a testable proposition/prediction that helps answer a question. Of note, the answer provided by a hypothesis may not fit neatly into a simple dichotomous conclusion of “yes” or “no”, “true” or “false”, but may be between these extremes in the region that’s called “don’t know.” Essentially, within the context of comparability, when the question is, “are the pre- and post-change processes comparable,” the answer might be “yes,” but if it isn’t “yes,” it doesn’t mean the answer is “no” – the answer may be the uncomfortable “don’t know.” In this case the conclusion must be that the information isn’t strong enough, given the level of confidence, to say that the two processes are comparable.

Many studies are designed to test a specific hypothesis, which can be evaluated/tested – either by algebraic comparisons with one or more reference points, or by the visual observation of confidence intervals and their relationship to these same reference points. The algebraic and visual approaches achieve the same conclusion.

Whatever the parameter, the confidence interval that will be used to carry out statistical testing will look like one of the three general examples shown in Figure 1: A two-sided confidence interval for “between” questions, and one-sided confidence intervals for “greater than” and “less than” questions.[8]

Figure 1: The three confidence intervals used for statistical testing.

We close this Section with an example provided by Mackowiak et al. [5] who hypothesized that the average normal body temperature is less than the widely accepted value of 98.6°F. We might have observed that the body temperature of many healthy people is less than 98.6°F.

If we denote the population mean of normal body temperature as μ, then we can express this research hypothesis as μ < 98.6. Typically, we can find another statement, also expressed as a hypothesis, that is the complement of our proposed hypothesis. For this example, one might hypothesize that μ ≥ 98.6. We refer to this hypothesis as the null hypothesis and denote it as H0. The null hypothesis usually reflects the “status quo” or “nothing of interest”. In contrast, we refer to our research hypothesis (i.e., the hypothesis we are investigating through a scientific study or what the researcher wants to show) as the alternative hypothesis and denote it as HA or H1. [6]

The Statistical Fundamentals of Comparability

Regulatory agencies acknowledge that product and process changes are necessary for the biotech industry to evolve. It is the manufacturers responsible to demonstrate product comparability (product’s safety, identity, purity, and potency) between the post-change and pre-change products.

The principle behind confidence intervals, discussed in the previous section, is widely used when considering comparability or equivalence. The current ICH E9 Guideline [9] for testing equivalence is to use two one-sided tests (TOST), which can be implemented visually with two one-sided confidence intervals. [10] – more details to follow.

In this paper, “comparability” is broadly associated with “(bio)similarity”, “bridging”, “equivalence”, etc. The demonstration of comparability does not necessarily mean that the quality attributes and other key parameters of interest of the reference and test product are identical, but that they are highly similar, and that the existing knowledge is sufficiently predictive to ensure that any differences in quality attributes and the key parameters of interest have no adverse impact upon safety or efficacy of the drug product.

During the process of developing the research question and hypotheses, considerable effort must be spent to determine which Critical Quality Attributes (CQA) may affect safety and efficacy during the proposed change. Tsong, Dong, et al.,[11] recommend that the CQA’s be categorized into three tiers based on their potential impact on product quality and clinical outcome.

So how exactly is comparability demonstrated?  During the process of developing a sound research question and accompanying hypotheses we will have a clear, objective and scientifically relevant definition of what it means to be “comparable,” setting the foundation to definitively demonstrate that two products or processes are indeed similar. Considering the factors presented in Figure 2, we present two examples of analytical methods: (1) the two one-sided tests (TOST); and (2) Passing-Bablok, which is when compared with other methods, such as Deming regression, this could be preferred for comparing clinical methods, because it does not assume measurement error is normally distributed and is robust against outliers.

For Tier 1 CQAs, the most widely used procedure for statistically evaluating equivalence is the TOST, which is advocated by the United States FDA. The measurements made on the reference (pre change) product (XR) are assumed to follow a normal distribution with mean μR and variance σ2R, respectively. Likewise, the measurements made on the test (post change) product (XT) are assumed to follow a normal distribution with mean μT and variance σ2T, respectively.

Figure 2: presents the three tiers along with respective definitions and suggested statistical methods to demonstrate comparability. [11]

For a given equivalence margin, δ(>0), the equivalence hypotheses can be stated as follows:

H0:|μR−μT|≥δ vs H1: |μR−μT|<δ

The null hypothesis for the TOST approach is that the groups differ by more than a tolerably small amount. The alternative hypothesis is that the groups differ by less than that amount, that is, they are practically similar within the stated equivalence margins. The null hypothesis H0 is decomposes into two separate sub-null hypotheses H01: μR − μT ≥ δ and H02R −μT ≤−δ. These two components give rise to the ‘two one-sided tests.’

Figure 3: A graphically comparison of the two-one-sided test and a two-sided confidence interval – Deming S.

As graphically shown in Figure 3, TOST uses two one-sided t tests. One test (represented by the upper one-sided green 95% confidence interval) says that there is at least 95% confidence that m is above the lower specification L. The other test (represented by the lower one-sided green 95% confidence interval) says that there is at least 95% confidence that m is below the upper specification U. An alternative approach is represented by the last green bar, which shows a two-sided 92% CI. [Deming S.] In many instances this is computed as a two-sided 90% CI.

The Goal of the equivalence test (TOST), presented in the previous Section, was to demonstrate that two populations (groups of data) are: practically equivalent. Equivalence tests in general are based on comparisons of population statistics: mean and variance. In this Section the focus will be on method comparison, not on the comparison of industrial production processes. The goal here is to demonstrate that the measurement systems (current and proposed) are practically equivalent in their measurement capacity.

Three key methods are widely used: Passing-Bablok regression, Deming regression, and Bland-Altman. The Passing-Bablok regression will be used to demonstrate methods comparison primarily because compared with Deming regression it does not assume measurement error is normally distributed and is robust against outliers.

In brief, Passing and Bablok regression is a nonparametric (robust to outliers) method for fitting two variables with measurement error. It is generally used to compare two analytical methods that are expected to produce the same measurement values. The intercept is the bias between the two methods and the slope is the proportional bias between the two methods. Passing Bablok requires checks for the assumption that measurements are positively correlated and exhibit a linear relationship.

Figure 4 shows two distinctly different results using Passing and Bablok regression.

Figure 4: Example Results from Passing and Bablok regression.

Panel A: Passing and Bablok regression analyses of two methods for total bilirubin, N = 40; concentration range 3-468 μmol/L; Pearson correlation coefficient r = 0.99, P < 0.001. Scatter diagram with regression line and confidence bands for regression line. Identity line is dashed. Regression line equation: y = -3.0 + 1.00 x; 95% CI for intercept -3.8 to -2.1 and for slope 0.98 to 1.01 indicated good agreement. Cusum test for linearity indicates no significant deviation from linearity (P > 0.10).

Panel B: Passing and Bablok regression analyses of two methods for direct bilirubin, N = 70; concentration range 4-357 μmol/L; Pearson correlation coefficient r = 0.99, P < 0.001. (A) Scatter diagram with regression line and confidence bands for regression line. Identity line is dashed. Regression line equation: y = -3.2 + 1.52 x; 95% CI for intercept -4.2 to -1.9 and for slope 1.47 to 1.58 indicated small constant and huge proportional difference. Cusum test for linearity indicates significant deviation from linearity (P<0.05).

Note that correlation coefficient in both examples is r = 0.99; hence method comparison results cannot be assessed using standard parametric methods such as Pearson’s correlation.

The two methods presented represents very different approaches to assess comparability. TOST, being an equivalence tests, is based on comparisons of population statistics: mean and variance while Passing and Bablok regression is a more robust nonparametric method for comparing data from two methods with measurement error. Could it be that Passing and Bablok regression, or some version of it, with its robust nonparametric approach be adequate for all or most comparability studies?

In closing, finding a statistically significant difference doesn’t guarantee it’s biologically important. Similarly, “not statistically different” doesn’t mean the compounds are biologically identical. Statistical findings must always be placed in the appropriate biological context.

During the ensuing workshop, we will present more on “Comparability – Challenges and Lessons Learned” with the use of demonstrations of TOST, Deming, Passing-Bablok, and Bland-Altman methods as time permits.

References

[1] FDA. (2015a). Scientific Considerations in Demonstrating Biosimilarity to a Reference Product. Silver Spring, MD: U.S. Food and Drug Administration.

[2] Riva JJ, Malik KM, Burnie SJ, Endicott AR, Busse JW. What is your research question? An introduction to the PICOT format for clinicians. J Can Chiropr Assoc 2012;56:167-71.

[3] Hacker D. A pocket style manual,4th ed. New York: Bedford/St. Martin’s; 1999.

[4] Schensul JJ. The development and maintenance of community research partnerships. www.mapcruzin.com/ community-research/schensul1.htm (accessed 2006 Nov 9).

[5] Mackowiak, P.A., Wasserman, S.S., Levine, M.M.: A critical appraisal of 98.6°F, the upper limit of the normal body temperature, and other legacies of Carl Reinhold August Wunderlich. JAMA 268, 1578–1580 (1992)

[6] Shahbaba, B. (2012). Hypothesis Testing. In: Biostatistics with R. Use R!. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-1302-8_7

[7] Chow, S.-C. (Ed.). (2018). Encyclopedia of Biopharmaceutical Statistics – Four Volume Set (4th ed.). Chapman and Hall/CRC. https://doi-org.proxy1.cl.msu.edu/10.1201/9781351110273

[8] Deming S. Statistical Analysis of Laboratory Data: Basics (lulu.com)

[9] ICH E9 Expert Working Group. Statistical principles for clinical trials: ICH harmonized tripartite guidelines. Stat. Med. 1999, 18, 1905–1942.

[10] Schuirmann, D.J. A comparison of the two one-sided tests procedure and the power approach for assessing the equivalence of average bioavailability. J. Pharmacokinet. Biopharm. 1987, 15, 657–680.

[11] Tsong, Y.; Dong X.; Shen, M. Development of Statistical Methods for Analytical Similarity Assessment, Journal of Biopharmaceutical Statistics 2015 DOI:10.1080/10543406. 2015 .1092038

[12] Bilić-Zulle L. Comparison of methods: Passing and Bablok regression. Biochem Med (Zagreb). 2011;21:49-52

Kevin Brooks

About The Author: Kevin Brooks, MSc, PhD

Kevin has extensive experience teaching biostatistics and working in the biopharmaceutical industry supporting regulatory filings, product safety, and quality control (GMP, GLP, and GCP). He also has over 30 years of experience in designing, conducting, and analyzing epidemiologic studies. His training includes a BS in Computer Science, both an MS and Ph.D. in Epidemiology and, finally, an MS in Management, Strategy & Leadership. Kevin is Principal Consultant, Biostatistics, at K. R. Brooks & Associates, CMC Statistical Consulting.