Вы находитесь на странице: 1из 100

SPECIAL EDITION

STATISTICS
IN VALIDATION

A division of UBM Americas


A di

A division of UBM Americas


Statistics in Validation
Risk Analysis and Design of Experiments in Process Validation Stage 1.................................. .......................... 4
Kevin ODonnell

First Steps in Experimental Design II: More on Screening Experiments............................... .............................. 12


John A. Wass

A Further Step in Experimental Design (III): The Response Surface..................................... ................................... 21


John A. Wass

Linear Regression 101........................................................................................... ................................................. 29


Yanhui Hu

Linear Regression 102: Stability Shelf Life Estimation Using Analysis of Covariance........................... ...................... 36
David LeBlond, Daniel Griffith, and Kelly Aubuchon

Understanding and Reducing Analytical ErrorWhy Good Science Requires Operational Excellence............ ................ 55
John McConnell, Brian K. Nunnally, and Bernard McGarvey

Analysis and Control of Variation: Using Process Control to Reduce Variability:


Comparison of Engineering Process Control with Statistical Process Control............................. ....................... 61
John McConnell, Brian K. Nunnally, and Bernard McGarvey

Improvement Alphabet: QbD, PAT, LSS, DOE, SPCHow Do They Fit Together?................ ..................... 66
Ronald D. Snee

Statistical Analysis in Analytical Method Validation............................................................ ..................................... 70


Eugenie Webster (Khlebnikova)

Statistical Tools for Development and Control of Pharmaceutical Processes:


Statistics in the FDA Process Validation Guidance.................................................. ................................ 78
Paul L. Pluta

Statistical Considerations for Design and Analysis of Bridging Studies................... ..................................... 83


Harry Yang and Timothy Shofield

FDA, Globalization, and Statistical Process Validation ....................................... ........................................ 90


Robert L. Creighton and Marlene Garcia Swider

Statistical Sampling Plan for Design Verification and Validation of Medical Devices................. ............................. 93
Liem Ferryanto
Peer Reviewed: Process Validation

Risk Analysis and Design of


Experiments in Process Validation
Stage 1 | IVT
Kevin ODonnell

Abstract
Process design, the first stage in the US Food and Drug Administration
lifecycle approach to process validation, lays the foundation for process
understanding and control. The work of Stage 1 enables subsequent
Stage 2 and 3 to be successful. Process design involves planning and
forethought, often by utilizing risk analysis and design-of-experiment
(DOE). Risk analysis tools and a simple DOE experiment are discussed.
Furthermore, several examples cases are presented. The efficiency and
ease with which process development studies can be leveraged to create
uneventful and meaningful transitions to full scale validation assumes
acceptable technical characteristics of the associated facility and equip-
ment. Specific areas for consideration are briefly described. While the
overall process may seem complex, these activities will serve the entire
process validation continuum from qualification to maintenance.

Introduction
Process design, the first stage in the FDA lifecycle approach to process
validation, lays the foundation for process understanding and control.
The work of stage one enables subsequent stages, Stage 2 and Stage 3,
to be successful. Process design allows for the characterization and un-
derstanding of how the process responds to the various process inputs
and variables. This effectively addresses one of the significant flaws
in the previous process validation paradigmthe failure to account
for inherent variability in process inputs (1). Stage 1 work allows for a
better understanding of how the process will respond to input varia-
tion before the process is validated. Further, it provides an alternative
approach to flawed practices such as "edge of failure" validation testing
(2). Edge of failure testing was a methodology frequently employed by
validation personnel to qualify the manufacturing process. Processes
were often transferred to full-scale manufacturing with limited develop-
mental data, often with disastrous results. Validation personnel would
run the process using reduced process parameter values until the point
of failure was identified. This approach was generally not conducted as
methodically as would be done for process development wherein mul-
tiple variables are tested across a range to map and measure the process
response. Only the variable deemed to be the most critical would be
tested until the process produced a response that was outside the prod-
uct specifications. Consideration that change in one variable often elicits
interdependent changes in other variables was not addressed.

4 Special Edition: Statistics in Validation


Peer Reviewed: Process Validation

The use of more efficient and methodical approach- and high-risk buckets. There are many models avail-
es during Stage 1 process design and across the able for risk analysis, including failure mode effects
process development continuum is recommended in analysis (FMEA), failure mode, effects, and criticality
the FDA Process Validation Guidance. Tools that have analysis (FMECA), fault tree analysis (FTA), haz-
been routinely used in other sectors of industry are ard and operability analysis (HAZOP), preliminary
recommended. The following paper describes impor- hazard analysis (PHA), hazard analysis and critical
tant considerations and a general approach to process control points (HACCP), and fishbone analysis (5, 6).
development as well as some of these methodologies Note that these models are not one size fits all and
and tools. Examples to illustrate their benefits and should be modified or combined depending on the
utility are also provided. application.

Strategy and Approach Risk Analysis Methods


The first step in the lifecycle approach to process When considering the type of risk analysis to per-
validation involves planning and forethought, and form, it is important to understand the values and
nothing ultimately provides greater benefit. Process shortcomings of each option. For example, FTA and
design should be conducted in a structured manner fishbone analysis are similar models that have an ob-
through the use of various risk assessment tech- jective to deduce cause-and-effect relationships. Both
niques. Risk analysis identifies failure modes that can models can be utilized to stimulate critical thinking
guide the execution of process development activities. and identify many hazards and potential failures of a
In addition, it is important to ensure that the meth- process. However, these techniques do not systemati-
ods and equipment used are sensitive enough and in cally sort or rank the risks. Other risk assessment
the proper range to properly assess the process re- models are often used to analyze the results of FTA
sponse by conducting a measurement system analysis and fishbone analyses.
(MSA). Further, once the variables that can potential- FMEA and FMECA are similar in methodology,
ly impact a process have been identified, full factorial with the difference being that FMECA adds addi-
and fractional factorial designed experiments (DOE) tional rankings to the risks. These analyses include
can be used to efficiently screen variables to deter- four variations including systems FMEA (SFMEA),
mine which have the greatest impact as well as which design FMEA (DFMEA), process FMEA (PFMEA),
exhibit significant interactions with each other. and equipment FMEA (EFMEA). FMEAs are valu-
When variables and their interactions have been able because the output is a weighted risk score of
characterized, a process development report should a particular failure event. The design of the tool
be prepared to properly document information for dictates the scale of the score failure event and what
Stage 2, process qualification and Stage 3, continued score is considered low, medium, or high-risk. Ac-
process verification. tions or monitoring is typically required of medium
and high-risk events until the risk is mitigated
Risk Analysis and DOE and scored at a low-risk level. The shortcomings
Validating any process requires aligning many safety, of FMEA are that the weighted scales are qualita-
product quality, and financial details into one focused tive and are subject to the bias of the creator. It is
effort. At first glance, this is a monumental task sure- critical to involve a fairly large and diverse team to
ly to overwhelm and confuse even the most highly build the structure of the tool as well as participate
talented individuals. Success rather than chaos may during the assessment of the risk. Small teams from
be accomplished by taking a series of precise and fo- one department of the organization can significantly
cused steps. To begin, the various aspects of the proj- bias the final weighted risk scores. In addition, note
ect must be compartmentalized from a "10,000-foot that this tool is not constructed with the capabilities
perspective." Risk assessment is often used for this ef- to define acceptance criteria or critical boundaries.
fort. Once tasks can be separated into high, medium, HAZOP is similar to FMEA. However, HAZOP is
and low-risk buckets, studies can be designed using a tailored more towards system failures, and a weight-
DOE approach to challenge and mitigate the high and ed risk score is not assigned to the failure event.
medium-risk aspects of the project.. Instead, guide-words are defined to help identify
failure events. If a failure event can be categorized
Risk Analysis as a significant deviation, action must be taken to
Risk analysis can utilize one tool to determine the pa- mitigate the risk. PHA is similar to HAZOP since
rameters of the process that must be considered and it does not assign a weighted risk score but assigns
then apply another tool to access the low, medium, an overall risk ranking (typically 14). The lack of a

Special Edition: Statistics in Validation 5


Peer Reviewed: Process Validation

weighted score system adds additional variation to Risk Assessment Example


the risk analysis and likely increases the bias poten- In this example, a manufacturer has stockpiled and
tial of the assessment team. Similar to FMEA and frozen sufficient product for the next year and has
potentially more-so, a large group of subject matter planned a shutdown to make equipment and facil-
experts (SMEs) are required to identify effective and ity upgrades. There have been many suggestions
meaningful failure event mitigation. from Process Development, Manufacturing, and
HAACP is tailored towards safety hazards; includ- Facilities staff, but since production must resume in
ing biological, chemical, physical agents; or opera- six months, the various proposed projects must be
tions that may cause harm if not controlled. HAACP prioritized in some manner. An FMEA approach was
evaluates entire processes from material reception chosen to review the proposed projects and modifica-
to final product shipping and is typically not used tions for their relative benefits.
for specific activities (7). The entire analysis is First, the suggestions were organized into five
based upon answering a risk question. For example, categories without making judgment on any of their
the risk question may be, What Critical Control qualities, costs, or the urgency of the need, as these
Parameters (CCPs) are required to reproducibly may change through the FMEA process. This is rep-
manufacture the product? A multi-disciplinary resented graphically in the Figure.
team of SMEs will first develop a process flow dia- Typically, a team of SMEs will convene to assess
gram of the entire process and use this diagram to the suggestions as well as determine their categories.
identify all possible hazards. Each hazard will then The various attributes of the suggestions are listed to
be subject to a thorough hazard analysis that identi- the best of the shutdown teams ability. This particu-
fies various components of the hazard, including lar example has been somewhat simplified for pre-
sources, severity, probability, and control measures
currently in place. With this information, the team
can begin to identify where critical control param-
eters are required.
Risk assessments as described above primarily pro-
duce three results: a list of potential hazards, a risk
ranking of the various hazards, and identification of
where process parameters need to be controlled to
encourage a robust process. However, one obvious
shortcoming is that none of these analyses define a
clear and scientific way to arrive at values for CCPs.
DOE is then used to identify further action

DOE
DOE, explained in detail in earlier paper in this
journal and the Journal of Validation Technology,
may be approached in a systematic manner by pars-
ing it into a phased approach in which the response
of the process to various factors is screened, mapped Figure : Categorized Suggestions.
across a response surface (design space exploration),
and finally modeled and validated (3). DOE may be sentation. In this case, the individual attributes were
used in this manner to assess the impact of three or then considered using what-if scenarios considering
more variables on selected attributes of the system that the facility that runs a clean-in-place (CIP) cycle
being studied as well as any significant interactions 40 times per year. Failure modes could then be deter-
between variables. Most notably, this methodology mined using this approach as well as historical data
enables performance of far fewer experiments than from a comparable system.
would be required if each variable was tested inde- While the ways a process can fail are unique to
pendently. DOE also utilizes statistical data treat- the system, the impact rankings are fairly straightfor-
ment, which allows clear statements regarding the ward. For this evaluation, safety was given the high-
significance of a variable. The following two examples est rank and time loss on equipment the next highest.
illustrate the application of both risk assessment and Categories such as human-machine interface (HMI)
DOE methodologies. failure in which operators can simply use a different
HMI or have a standby team re-start from the server

6 Special Edition: Statistics in Validation


Peer Reviewed: Process Validation

received a lower ranking. The time loss on equipment Setting the RPN threshold at 50, the manufac-
was estimated based on the product of suite time turer identified a handful of projects. Notably, the
cost per day and the fraction of day for which it was engineering and validation groups were then tasked
unavailable. Frequency was calculated using histori- with sourcing a new pump with improved chemical
cal online data collection from a similar system. The resistance to the bulk chemical used at the facility.
likelihood of failure detection was ranked where a Specifically, the value of this process can be seen as
value of 1 represents the highest chance of detection the specifications for the original pump indicate that
and a value of 10 represents no chance of detection. it is fit-for-purpose. However, data from the mainte-
Typical questions the group posed for this evaluation nance logbooks indicated several emergency repairs
were, Are the correct alarms in place, and do they for failed diaphragms. In developing new specifica-
stop the system? and Are there instrumentation tions for the pump, a DOE was utilized to both iden-
control loops, and are the sensors located in the cor- tify whether the root cause of the pump failure was
rect parts of the system? a poor choice of diaphragm elastomer or due to the
The rankings are then compared to the budget bulk chemical itself. By conducting a single experi-
allocations, resource requirements, and timeline to ment, the team was able to identify which chemistries
determine a feasible action limit for the risk priority were more likely to be compatible with the available
number (RPN). The RPN is calculated as the product pump diaphragm materials.
of severity, frequency, and likelihood of detection.
The results of this evaluation are shown in Table I.:

Step How It Can Impact Severity Potential Freq. Existing Detection Severity x Action
Fail of Rank Cause Controls of frequency x
Failure Failure detection

Operator Pipe Leak 10 Thermal 1 Facility PM 1 10 None


Set-Up Misalignment Expansion Schedule
Wrong Low Flow 3 Improper 5 Training 3 45 None
Spool Piece Alarm Installation
Filters Left Low, low flow 8 Operator 8 Training 1 64 Add
in Housings alarm (stops Does Not Checklist to
CIP) Follow SOP
SOP
Automation Operator Leak, 10 Operator 10 Training 3 300 Change
Recipe Start selects wrong failure error verification recipe
recipe of other by names
processes supervisor
Recipe
annotation
unclear
SCADA SCADA 1 Touch 4 NEMA 1 4 None
failure re-start screen cabinet
failure for HMI
CIP Pump failure Leak, 10 Pump 2 Facility 4 60 Source
Dilution batch wear, preventative new
Batching failure electrical maintenance pump
failure schedule
Chemical Residue, poor 7 Supplier 3 Preferred 2 42 None
quality cleaning change supplier
failure system
Mixing Conductivity 5 Design 7 None 1 35 None
failure limit failure failure
Water 5 Utility 30 Alarm 1 150 Analyze
supply CIP failure supply will stop water
failure inadequate CIP production
system

Table I: FMEA of Identified Categories and Projects.

Special Edition: Statistics in Validation 7


Peer Reviewed: Process Validation

DOE Example project began with the task to create the capacity
Consider a cell culture process in which variables for six qualified and reliable incubators. In order
affecting cell density are investigated. Temperature, to achieve this objective; power supply, gas supply
dissolved oxygen, and agitation speed are variables (CO2), room orientation, and process monitoring all
being studied. All other variables remain constant. had to be analyzed for hazard potential to personnel,
The process is currently set for the following nominal product systems, and support systems. Once the haz-
values: ards were identified and ranked, areas requiring the
creation of critical control parameters were specified.
Temperature: 35.0oC With the objectives of the project clearly defined,
Dissolved Oxygen: 20% it was time to identify the risk team and begin a risk
Agitation Speed: 30 RPM analysis. The team consisted of a representative from
Engineering, Validation, Operations, Quality Control,
Variables are adjusted by 5%, which provides the and Microbiology. The tool that was selected was the
following set of eight experiments summarized in FTA. A few faults, among others, that were consid-
Table II. ered were an incubator losing power, an uncaptured
Once these experiments are conducted, the results excursion from specified process parameter, and local
can be input into appropriate analysis of variance areas of increased/decreased CO2 or temperature.
(ANOVA) software such as Excel, JMP, or Minitab. Events that caused these failures are summarized
The software will output a coefficient for each vari- below:

Temperature Dissolved Oxygen Agitation Speed Cell Density


36.75 15 28.5 X1
33.25 15 28.5 X2

36.75 25 28.5 X3
33.25 25 28.5 X4
36.75 15 31.5 X5

33.25 15 31.5 X6
36.75 25 31.5 X7
33.25 25 31.5 X8
Table II: Cell Culture Variables.

able for construction of an equation that models the Failure 1: Incubator Losing Power
behavior of cell density. In addition, any combina- Facility power outage
tion of the variables can be combined to evaluate Power switch accidentally pressed into the off
interaction affects. All variables and combinations of position
variables can be subjected to a t test to evaluate the Moisture affecting the power chord during room
statistical significant of the variable. cleaning.

Example #1: Cell Culture Manufacturing Failure 2: Uncaptured Excursions from Specified
Company A wished to upgrade and expand utilities Process Parameter
in a cell culture scale-up facility to add additional Process values not recorded at acceptable
incubators. This is a highly sensitive area of produc- intervals.
tion that must be properly controlled. Contamination
problems, with loss of substantial revenue, are conse- Failure 3: Local Areas of Increased/Decreased
quences of inadequate controls. Company As existing CO2 or Temperature
scale-up facility had two incubators. It was proposed
that additional equipment (a new room to contain six Elevated CO2 inlet pressure
incubators) would alleviate the bottleneck in produc- Out-of-tolerance instruments
tion. To begin the project, the cross-functional team Non-functioning unit control mechanisms.
had to identify objectives and then focus the broad
objectives into specific components. This particular The events identified in this example are primar-

8 Special Edition: Statistics in Validation


Peer Reviewed: Process Validation

ily design-based. These events were then ranked as various conditions. A HAACP analysis is valuable in
high, medium, or low-risk by using a DFMEA model. this situation in that the risk team considers a process
DFMEA was also the specified model in the site stan- flow from when the cells are thawed through scale up
dard operating procedure (SOP). Following the proce- and transfer to another location outside of the incuba-
dure, the team evaluated the failure event across three tor. The team can identify areas of high risk and de-
criteria: event severity, frequency, and detectability. termine which parameters should be considered criti-
For each, quantitation between one and three could cal. For example, the HAACP analysis may conclude
be assigned depending on the perceived risk level. that temperature, humidity, and COs are all critical
These values were then multiplied to identify the parameters. At this point, a group of experiments
total risk score. Each risk score fell into a grouping should be designed (DOE) to prove or disprove that
to classify the first as high, medium, or low. Within those parameters are indeed critical. In addition, the
a grouping, a higher score was considered a higher DOE should arrive at parameter values that can be
priority. The risk groupings results were as follows: used to determine the range for temperature mapping
of the incubator. The work to determine the critical
High-Risk parameters was already completed, as this project
Process values not recorded at acceptable objective was to expand the cell scale-up capacity for
intervals existing products with known critical parameters.
Non-functioning unit control mechanisms. However, any new products should be subject to an
HAACP analysis to determine if any critical param-
Medium-Risk eters are required to change. These would then need
Facility power outage to be compared to the existing mapping studies of the
Power switch accidentally pressed into the off incubator to be used for the cell scale-up.
position
Moisture affecting the power chord during room Example #2: CIP Process
cleaning Typical CIP recipes consist of multiple process steps.
Elevated CO2 inlet pressure These include an initial rinse to remove gross soil, a
Out-of-tolerance instruments. hot caustic wash phase, a rinse phase to remove the
caustic residue, an acid wash, and, finally, several
No low risks were identified. At this point, the rinses culminating in a water for injection (WFI)
team was able to consider some existing facility rinse controlled by conductivity of the final rinse
infrastructure and identify ways to mitigate the risks. liquid. In a redesign of a CIP recipe intended to save
Some of the mitigation decisions are highlighted in water and chemical use at a large pharmaceutical
Table III below. manufacturer, a standard test soil that was initially
The remaining event was the non-functioning unit determined to be cleanable with hot caustic and wa-
control mechanisms, which was ranked as having ter rinses only had repeated visual failures at scale in
high risk. Many companies may consider this risk a stirred tank system.
mitigated by mapping the incubator chamber to en- Initial efforts focused on experiments to understand
sure temperature, humidity, and CO2 remain within the failures. A DOE fractional factorial was designed
an acceptable range. However, the range must be with components of the cell culture not represented
specified based upon growth profiles of the cells in in the test soil that could be the source of the failure.

Event Risk Mitigation

Process values not recorded at acceptable High Connect the incubator to the facilities building management system
intervals (BMS) to monitor CO2, temperature, and humidity.
Facility power outage Medium Provide power to the incubator units from electric panels that are backed
by the facilitys backup generators.
Power switch accidentally pressed into the Medium Install a switch cover.
off position
Moisture affecting the power chord during Medium Install NEMA Type 4X twist lock receptacles that are water and dust-
room cleaning tight.
Elevated/depressed CO2 inlet pressure Medium Install pressure regulator for the inlet of each incubator.
Out-of-tolerance instruments Medium Add all instruments into the calibration program.
Table III: DFMEA Mitigations.

Special Edition: Statistics in Validation 9


Peer Reviewed: Process Validation

These included cells suspended in phosphate buff- operations, and allocated time to the tech transfer
ered saline (PBS), a PBS control, antifoam in a range of process must be adequate for project success. The fol-
200600 ppm, cells suspended in media, and a media lowing are some specific points for consideration:
control. The cell mixtures and the media-based solu- System Design: Systems should be designed and
tions were limited to concentrations found in normal built as sanitary, with all United States Pharma-
processing because the total organic carbon (TOC) copeia (USP) Class VI elastomers. Product-con-
rinse water samples had acceptable results. The experi- tact surfaces should be electropolished. Drainage
ment demonstrated that concentrations of antifoam should include low-point drains and air breaks,
greater than 400 ppm left a silica dust residue bathtub appropriately sloped piping, diaphragm valves,
ring in the tank. The test soil, which had been compa- and peristaltic or diaphragm pumps. Spray balls
rable to process soil at lower concentrations of anti- and spray wands must be thoroughly tested for
foam, was not representative of process conditions with their ability to cover all surfaces at a flow rate
higher antifoam concentrations. The cleaning validation of five feet/second or produce acceptable sheet-
worst-case test soils and the process development were ing action inside a tank. Systems should be well
then adjusted to limit antifoam addition requirements. characterized with known worst-case soils and
a margin of error. The system should ideally be
Example #3: Cleaning Agent Composition used exclusively on a single product or with a
In an effort to reduce the cost of goods and raw ma- highly characterized platform.
terials, a manufacturer desired to create an in-house Utilities: Utility equipment should able to supply
cleaning formulation from bulk chemicals rather than more than enough water and chemicals assum-
purchase commercial cleaning agents. Specifically, ing a worst-case of all possible equipment being
each CIP run using commercial formulated cleaning cleaned simultaneously. Process modeling is use-
agents incurred costs as high as $12,000 per CIP cycle ful in both sizing equipment accurately and driv-
at commercial scale. To begin the project, the team ing the design of process development from an
first identified the general characteristics of commer- economic standpoint. WFI use is one of the most
cial cleaning agents. Proprietary cleaning formulations frequent causes of downtime in the suite and the
typically contain chelating agents, surfactants, alka- highest cost of a CIP.
line or acidic components, and, in some cases, oxidiz- Automation: The system should be robustly
ing agents. Typical examples of these include ethyl- automated without nuisance alarms, automation-
enediaminetetraacetic acid, sodium gluconate, sodium related stoppages, and minimal hand-valve
hydroxide, phosphoric acid, citric acid, peracetic acid, manipulation. Not only do hand-valve manipula-
hydrogen peroxide, and hypochlorite, respectively. tions add to the risk of operator error, but they
This commercial manufacturing facility consisted may result in safety hazards if a vent line is
of an outdoor plug-flow recombinant algae plant required to be opened to the room, a port is left
at 60,000 L scale. A custom formulation of bulk uncapped, or a valve, which is part of a facility
chemicals was created to address the unique proper- header, is opened to combine CIP process mate-
ties of the soil. To evaluate this custom formulation, rials with other operations. Realistically, most
a full factorial DOE was set up comparing several systems are sub-optimal in this regard, most
low-foaming surfactants, biocides, and chelators to frequently in utility supply capacity and automa-
address the water quality on site (minimizing water tion. Older facilities are expanded for additional
treatment requirements), the contamination load, and products, but manufacturing must continue in
the high fat content of the unique soil. A formulation the original suites to support product demand,
was identified within one week that cost approxi- causing utility upgrades to be postponed. It is
mately $100 per full scale CIP using non-bulk pricing helpful to revisit the issue of lost productivity
by a combination of soiled coupons and 40 L scale- time due to stopped and postponed CIPs after
down models of the system. three to six months of production and compare it
to the cost of the upgrade to determine a timeline
Technical Equipment and Facility Consider- for RODI and/or WFI system improvements.
ations Process Modeling: Economic savings can also
The efficiency and ease with which process develop- be realized via process modeling; water treat-
ment studies can be leveraged to create uneventful ment and consumption become significant issues
and meaningful transitions to full scale validation at scale. USP grade water may be substituted
assumes acceptable technical characteristics of the for all but the last rinse phase, and air blows or
associated facility and equipment. Technical knowl- gravity drain steps between phases can minimize
edge, design quality, communicating and understand- the water requirement, but further water savings
ing the process that designers have with commercial can often be realized with process development

10 Special Edition: Statistics in Validation


Peer Reviewed: Process Validation

to limit wash-time requirements, rinse timing, Even if there is a readily available solution, it
and waste due to paused cycle recipes. There may not be possible to implement the solution
are risks unique to scale that may also be pre- for regulatory reasons. Manufacturing operators
dicted by scale-down models; thermal expan- may have minimal technical education. Manage-
sion coefficient differences between fittings and ment may have a business background rather
gaskets, especially in unusual fittings such as than a scientific degree. The process robustness
auto sampling devices, should be tested in scale- and operational expectations should be designed
down models prior to implementation. Time is for the appropriate level of expertise, not the
a frequently overlooked scale factor; as automa- assumption that a skilled, experienced, highly
tion pauses, tank fill rates (especially when other educated engineer or scientist will be running
operations utilize the same resources) can cause operations. Until a process is completely vali-
a CIP recipe to run considerably longer with dated and has run several batches through to
longer periods between steps, during which sur- completion with minimal equipment, automa-
faces may dry, causing soil to be more difficult to tion, or operational errors; a process develop-
remove. ment engineer should be available to assist in
Equipment Set-Up: Equipment set-up must con- the troubleshooting and scale-up process during
sider ergonomics not only as part of the JHA pro- operations.
cess but also in terms of suite time. Spool pieces
that are inaccessible without ladders or fall pro- Conclusion
tection, filter housings that require hoist assis- While the overall process may seem complex, the
tance, and misaligned piping can take several tools to conduct Stage 1 process validation activities
times longer to set up than smaller parallel filter are available, have been used extensively throughout
housings and sample devices in a readily acces- other industries, and are well defined with consid-
sible areajust as manual valve manipulations erable precedent. Above all, the use of these tools
take longer and add more risk than a thoroughly couples synergistically with robust planning and risk
validated automation system that merely requires assessment activities. Ultimately, effective FDA Stage
users to push a few buttons. When designing a 1 work by an appropriate project team, including risk
system within budget constraints, the balance of analysis and DOE, will identify the critical process
affordability vs. risk must take error rates into variables, interactions between them, and how the
account by comparing the proposed process process responds to changes in each. These activities,
design with similar operations and their error when conducted and documented recorded properly,
rates. Periodic reviews of process performance, will serve the entire process validation continuum
whether monthly or annually, should include from qualification to maintenance. JVT
failure rates and causes for this reason.
Construction Realities: The economics of con- REFERENCES
struction can also result in a less than perfect 1. J. Hyde, A. Hyde, P. Pluta, FDAs 2011 Process Validation
sanitary design; higher flow rates, changes in Guidance: A Blueprint for Modern Pharmaceutical Manufac-
flow direction or flow velocity, and manual turing, Journal of GXP Compliance 17 (4), 2013, available
cleaning during the equipment set-up can be here.
used to mitigate the lack of surface contact and 2. FDA, Guidance for Industry Process Validation: General
flow. Importantly, observing joints for leaks Principles and Practices (Rockville, MD, Jan. 2011).
should never be used as a substitute for pressure 3. J.A. Wass, First Steps in Experimental Design The Screen-
testing to determine the sanitary isolation of a ing Experiment, Journal of Validation Technology 16 (2),
system; small gaps or cracks, in a pipe with high 4957, 2010.
liquid velocity, can behave like a venturi and 4. ISPE, ISPE Baseline Guide: Risk-Based Manufacture of Phar-
introduce non-process air into the system, bring- maceutical Products (Risk-MaPP), 2010.
ing contaminants with it. 5. ISPE, ISPE Baseline Guide: Risk-Based Manufacture of Phar-
Personnel: Planning for a CIP process transfer maceutical Products (Risk-MaPP), 2010.
must consider the capabilities of the recipients 6. ICH, Q9, Quality Risk Management.
and operators. A production suite is not a lab 7. PQRI, HACCP Training Guide, Risk Management Training
staffed by highly educated or technical engineers Guides, available here.
and scientists. When there are problems, it is not 8. M. Rausand, Risk Assessment, Preliminary Hazard Analysis
likely that the expertise to diagnose and work (PHA) Version 1.0, available here.
through the problem will be readily available.
Originally published in the Autumn 2014 issue of Journal of Validation Technology

Special Edition: Statistics in Validation 11


Peer Reviewed: Statistical Viewpoint

First Steps in Experimental


Design II: More on Screening
Experiments | IVT
John A. Wass

Statistical Viewpoint addresses principles of statistics useful to


practitioners in compliance and validation. We intend to present
these concepts in a meaningful way so as to enable their applica-
tion in daily work situations.
The comments, questions, and suggestions of the readers are
needed to help us fulfill our objective for this column. Please contact
our coordinating editor Susan Haigney at shaigney@advanstar.com
with comments, suggestions, or manuscripts for publication.

KEY POINTS
The following key points are discussed:
Design of experiments (DOE) consists of three basic stages: screen-
ing to identify important factors, response surface methodology to
define the optimal space, and model validation to confirm predic-
tions.
A critical preliminary step in the screening stage is for subject mat-
ter experts to identify the key list of factors that may inf luence the
process.
A DOE design consists of a table whose rows represent experi-
mental trials and whose columns (vectors) give the corresponding
factor levels. In a DOE analysis, the factor level columns are used to
estimate the corresponding factor main effects.
Interaction columns in a design are formed as the dot product of
two other columns. In a DOE analysis, the interaction columns are
used to estimate the corresponding interaction effects.
When two design columns are identical, the corresponding factors
or interactions are aliased and their corresponding effects
cannot be distinguished.
A desirable feature of a screening design is orthogonality in which
the vector products of any two main effect or interaction columns
sum to zero. Orthogonality means that all estimates can be ob-
tained independently of one another.
DOE software provides efficient screening designs whose columns
are not aliased and from which orthogonal estimates can be ob-
tained.
Fractional factorial screening designs include fewer trials and may
be more efficient than the corresponding full factorial design.
The concept of aliasing is one of the tools that can be used to con-
struct efficient, orthogonal, screening designs.
Center points are often included in screening designs to raise

12 Special edition: Statistics in Validation


Peer Reviewed: Statistical Viewpoint

the efficiency and to provide a measure of repli- the transformation is many times done with one of
cation error and lack of model fit. the power family (y*=y, where is the transforming
The order of running and testing experimental parameter to be determined, e.g., if =1/2, take the
trials is often randomized to protect against the square root of the response variable). The most useful
presence of unknown lurking variables. has been found to be the Box-Cox procedure that es-
Blocking variables (such as day or run or session) timates and other model parameters simultaneously
may be included in a design to raise the design by the method of maximum likelihood. Modern soft-
efficiency. ware does this automatically. If the analyst prefers to
Factor effects in screening designs may be choose the value of , simple values are preferred as,
missed because they were not included in the for example, the real-world differences between 0.50
screening experiment, because they were not and 0.58 may be small but the square root is much
given sufficiently wide factor ranges, because easier to interpret. Also if the optimal value of is
the design was under powered for those factors, determined to be close to one, no transformation may
because trial order was not properly randomized be necessary.
or blocked, or because of an inadequate model.
Blocking
INTRODUCTION It is often advantageous to minimize or eliminate
This article is the second in a series that deals with variability contributed by factors of little or no inter-
the specialized types of screening designs (1). These est even though they affect the outcome. These nui-
designs have been developed to most efficiently ac- sance factors may be reduced by a technique called
cept many inputs that may or may not be relevant to blocking. By grouping these nuisance variables and
the final product and reduce this list to those few that reducing system variability, the precision of factor (of
are most important. Once the results are confirmed, interest) comparisons is increased. In the example
the analyst proceeds to the response surface designs of the chemical reaction in our previous article, if
to map the fine detail in the area of optimal response several batches of the substrate are required to run
(i.e., decide on the most desirable values of the inputs the design, and there is batch-to-batch variability due
to get the optimal output of whatever is being manu- to supplier methodology, we may wish to block the
factured or controlled). The three most important substrate by supplier, thus reducing the noise from
targets usually sought are optimal concentrations, this factor. We tend to consider a block as a collection
variance reduction, and robustness. of homogeneous conditions. In this case, we would
expect the difference between different batches to be
THEORY greater than those within a single batch (supplier).
Most screening designs are class III designs, where Please note, if it is known or highly suspected, that
main effects are not aliased (confounded with each within-block variability is about the same as be-
other), but the main effects are aliased with the two- tween-block variability, paired analysis of means will
way interactions. Factor levels are brief ly discussed be the same regardless of which design may be used.
in the following sections. At this point the reader may The use of blocking here would reduce the degrees of
wish to review the previous article in the series (1) freedom and lead to a wider confidence interval for
to re-examine the importance of randomization and the difference of means.
replication. Randomization ensures the indepen-
dence of the observations. Replication assesses varia- Factor Levels
tion and more accurately obtains effect estimates. The last preliminary item of importance is choosing
Before the models (designs) are run, it may be factor levels. There are an infinite number of values
advantageous to decide on design level, blocking (if for any continuous variable, and a restricted, but usu-
any), and data transformation (if necessary). Lets ally large number for categorical variables. In general,
examine transformations first, as this is a common when the objective is to determine the small num-
problem. ber of factors that are important to the outcome or
characterize the process, it is advisable to keep factor
Data Transformation levels lowusually two works well. This is because
Transformations are usually employed to stabilize we are designing F*k runs (F=factor, k=levels) in a
the response variance, make the distribution of the factorial type experiment and as the levels of each
response variable more normal, or improve the fit of factor rise, the number of runs increases dramati-
the model to the data (2). Note that more than one of cally. The drama intensifies further if interactions are
these objectives may be simultaneously achieved, and included.

Special edition: Statistics in Validation 13


Peer Reviewed: Statistical Viewpoint

TECHNIQUESTHE DESIGNS Fold-over. There is a specialized technique


The following are three widely-used screening de- within this group called fold-over. It is mentioned
signs: because the analyst may find it useful for isolat-
Randomized blocks and fractional factorial ing effects of interest. It is performed by switching
designs certain signs in the design systematically to isolate
Nested and split-plot designs these effects. The signs are changed (reversed) in
Plackett-Burman (P-B) designs. certain factors of the original design to isolate the
one of interest in an anti-aliasing strategy. The
Randomized Blocks and Fractional name derives from the fact that it is a fold-over
Factorial Designs of the original design. The details are beyond the
As was stated, similar batches of relatively homoge- scope of this introductory article but may be found
neous units of data may be grouped. This grouping in standard references (2, 3).
restricts complete randomization as the treatments Latin-square design. Yet another specialized
are only randomized within the block. By blocking technique used with fractional factorial designs
we loose degrees of freedom but have eliminated is the Latinsquare design. This design utilizes the
sources of variability and hopefully gained a better blocking technique on one or more factors to reduce
understanding of the process. We cannot always variation from nuisance factors. In this case the
identify these nuisance factors. But by random- design is an n x n where the number of rows equal
ization, we can guard against the effects of these the number of columns. It has the desirable prop-
factors, as their effects are spread or diluted across erty of orthogonality (independence of the factors,
the entire experiment. great for simplifying the math and strengthening the
If we remember our chemical process experi- conclusions). Unfortunately, the row and column
ment from the previous article, we had two reagents arrangement represent restrictions on randomiza-
with two levels of an enzyme, temperature, and tion, as each cell in the square contains one of the
mix speeds. We added a center point to check for n letters corresponding to the treatments, and each
curvature and ran a single replicate for each point letter can occur only once in each row and column.
and blocked across four days (see Figure 1). We put The statistical model for this type of design is an
all main factors plus an interaction into the model effects model and is completely additive (i.e., there
(see Figure 2). is no interaction between the rows, columns, and
Parameter estimates in Figure 2 told us that the treatments).
blocks were not significant, and when we rerun the Saturated design. One last term that the novice
model without a block effect, we see the results in may bump up against is the concept of a satu-
Figure 3. rated design, unfortunately all too common in the
Although parameter estimates in Figure 3 show industrial world. This refers to a situation where
little difference, the enzyme component is closer the analyst is attempting to include many variables
to significance. Again this may represent a power in the model and has few runs (translating ulti-
problem or design f law (we needed a wider enzyme mately to too few degrees of freedom) to support the
range). In this example, we may not have needed to analysis. This allows for estimation of main effects
block, but it is always wise to test if an effect is sus- only. In some cases, all interactions may be aliased
pected or anomalous results are encountered. with the main effects, thus condemning the analyst
to missing important factors and interactions. If it
Figue 1: Screening design and test data is not possible to increase the number of runs, it is
a good idea to call in subject matter experts (SMEs)
(usually chemists or engineers) to assist in eliminat-
ing variables.
Nested and Split-Plot Designs
These designs are widely used in many industries.
They introduce the need for random factor designa-
tion and the joys of variance component analysis.
The former refers to those factors that may be taken
as an adequate representative of a larger population.
For example, if two instruments are used in an ex-
periment to characterize performance because only
two were available, the results may not be general-

14 Special edition: Statistics in Validation


Peer Reviewed: Statistical Viewpoint

Figure 2: Actual by predicted plot

Figure 3: Actual by predicted plot

Special edition: Statistics in Validation 15


Peer Reviewed: Statistical Viewpoint

Figure 4: Actual by predicted plot

Figure 5: Residual by predicted plot Figure 6: Normal plot

ized to the population of 100 instruments that were and power studies, a statistician may be consulted if
manufactured and the factor instrument is not there are no industry recommendations.
random and, therefore, we classify it as a fixed ef- Variance component analysis involves the calcula-
fect. If, however, 20 instruments were available and tion of expected mean squares of error to determine
7 or 8 were chosen at random, the results are much how much of the total system variance is contributed
more likely to represent the population and the fac- by each term in the model (including the error term).
tor may be considered random. As the minimal num- Nested design. When levels of an effect B only oc-
ber needed may be calculated from sampling theory cur within a single level of an effect A, then B is said

16 Special edition: Statistics in Validation


Peer Reviewed: Statistical Viewpoint

to be nested within A (4). This Table: Fixed by effect tests


may be contrasted with crossed
effects, which are interactions (i.e.,
the results of one factor are depen-
dent upon the level of another fac-
tor). Nested designs are sometimes
referred to as hierarchical designs.
In our example of the chemical
process, if we only had several tem-
peratures and mix speeds available,
we might wish to check the effects only the one random effect, we test all others as
of using only certain mix speeds with certain tempera- fixed effects (see Table).
tures. This is easily done by nesting mix speed within The results are essentially the same as for the
temperature, designated as mix speed [temp]. When fractional factorial, as the design may contain similar
the model is analyzed this way, we get the results in flaws.
Figure 4.
The fit is better only because we now have categor- Plackett-Burman (P-B) Designs
ical variables, less to fit, and many other factors are P-B designs are specialized screening designs where
significant. We have, however, lost degrees of free- the number of runs is not required to be powers of two.
dom by nesting terms, and this may negatively affect If there are funds for extra runs whose number does
power. We would then use residual (error) analysis not increase by a power of two, these designs are ideal
as our diagnostic tool, followed by standard checks as they also generate columns that are balanced and
such as normal probability plots, outlier checks, and pairwise orthogonal. They are based upon a very flex-
plotting the residual versus fitted values (see Figures ible mathematical construct called a Hadamard matrix
5 and 6). where the number of runs increases as a multiple of
Both of the diagnostics in Figures 5 and 6 exhibit four and thus will increase much more slowly than the
problems (i.e., increasing residuals with predicted fractional factorial. Note that these are Resolution III
values and many off-axis values on the Normal Plot). designs where the main effects are not aliased with each
These may be due to singularities during calculations other but are aliased with any two-way interactions. The
(e.g., terms going to infinity or division by zero). We great advantage of these designs is the ability to evaluate
may wish to increase the runs to see if the increasing many factors with few runs. The disadvantages involve
degrees of freedom will stabilize the values. the assumptions made (i.e., that any interactions are not
strong enough to mask main effects and that any qua-
Split-plot design. In some experiments, due to real- dratic effects are closely related to factor linear effects).
world complications, the run order may not be ame- Although these assumptions usually hold, it is always
nable to randomization and we need to use a generaliza- best to try to verify them with any available diagnostics.
tion of the factorial design called the split-plot design. Again, for our system the P-B design is seen in Fig-
The name refers to the historical origins of the design ure 8. The analyses results are seen in Figure 9.
in agriculture and posits splitting some factor into It appears that P-B may not be a good design
sub-factors due to some problem with running a full choice as it requires more runs and is less sensitive to
factorial design or data collection method (e.g., different the present data structure than simpler designs.
batches on different days). Therefore, we are running the
experiment as a group of runs where within each group SYNOPSIS: BUT WHICH DO I USE?
some factors (or only one) remain constant. This is done The following provides a pathway for application and
as it may be very difficult or expensive to change these selection of appropriate screening designs. The fol-
factors between runs. In our example, we can declare lowing questions are addressed:
the enzyme prep and temperature as difficult to change. Which screening design should be used when
The software then optimally designs the experiment resources are a consideration?
around 5 plots in just 10 runs, far fewer than even a Which screening design should be used when
fractional factorial design (see Figure 7). flexibility is needed regarding variables to be
It declares only the plots as random so they take tested?
up all of the variance. The plots are split by the What are the advantages of the respective designs
enzyme and mix speed as these have been declared regarding special needs (e.g., reducing noise,
hard to change and are the subplots. As we have blocking, and other needs)?

Special edition: Statistics in Validation 17


Peer Reviewed: Statistical Viewpoint

Figure 7: Screening design and test data

Figure 8: Plackett-Burman design.

18 Special edition: Statistics in Validation


Peer Reviewed: Statistical Viewpoint

Figure 9: Plackett-Burman design results.

Special edition: Statistics in Validation 19


Peer Reviewed: Statistical Viewpoint

ture, and supplement the learning experience with a


real-world example.

REFERENCES
1. Wass, John A., Statistical Viewpoint: First Steps in Experimen-
tal DesignThe Screening Experiment, Journal of Validation
Technology, Volume 16, Number 2, Spring 2010.
2. D. C. Montgomery, Design and Analysis of Experiments (5th
ed.), John Wiley, 2001.
3. G.E.P. Box, J.S Hunter and, W.G. Hunter, Statistics for Experi-
menters (2nd Ed.), Wiley Interscience, 2005.
4. JMP Design of Experiments Guide, Release 7, SAS Institute Inc.,
2007.

GENERAL REFERENCES
S.R. Schmidt and R.G. Launsby, Understanding Industrial De-
Figure 10: is a flow diagram may be used to answer signed Experiments (4th ed.), Air Academy Press, 1997.
these questions.
G.E.P. Box, J.S Hunter and, W.G. Hunter, Statistics for Experiment-
ers (2nd Ed.), Wiley Interscience, 2005.
SOFTWARE D. C. Montgomery, Design and Analysis of Experiments (5th ed.),
There are numerous software products available to John Wiley, 2001.
assist the practitioner in design and analysis of their JMP Design of Experiments Guide, Release 7, SAS Institute Inc.,
experiments. The author has had experience with 2007.
the following commercial packages: ECHIP, Reference Manual, Version 6, ECHIP Inc., 1983-1993.
Design Expert (www.statease.com) Deming, S.N., Quality by Design (Part 5), Chemtech, pp 118-
GenStat (www.vsni.co.uk) 126, Feb. 1990.
JMP (www.jmp.com) Deming, S.N., Quality by Design (Part 6), Chemtech, pp 604
Minitab (www.minitab.com) 607, Oct. 1992.
MODDE (www.umetrics.com) Deming, S.N., Quality by Design (Part 7), Chemtech, pp 666
STATISTICA (www.statsoft.com) 673, Nov. 1992. JVT
SYSTAT (www.systat.com)
Unscrambler (www.camo.no). ARTICLE ACRONYM LISTING
DOE Design of Experiments
CONCLUSIONS P-B Plackett-Burman
Modern experimental design is sometimes art as SME Subject Matter Experts
well as science. It is the objective of this column
to acquaint the reader with the rudiments of the
screening design, introduce them to the nomencla-

Originally published in the Winter 2011 issue of Journal of Validation Technology

20 Special edition: Statistics in Validation


Peer Reviewed: Statistical Viewpoint

A Further Step in
Experimental Design (III):
The Response Surface
John A. Wass

Statistical Viewpoint addresses principles of statistics useful to prac-


titioners in compliance and validation. We intend to present these
concepts in a meaningful way so as to enable their application in daily
work situations.
Reader comments, questions, and suggestions are needed to help us
fulfill our objective for this column. Please send any comments to man-
aging editor Susan Haigney at shaigney@advanstar.com.

KEY POINTS
The following key points are discussed:
Design of experiments (DOE) consists of three basic stages: screen-
ing (to identify important factors), response surface methodology
(to define the optimal space), and model validation (to confirm
predictions).
A critical preliminary step in the screening stage is for subject mat-
ter experts to identify the key list of factors that might influence the
process.
A DOE design consists of a table whose rows represent experimental
trials and whose columns (vectors) give the corresponding factor
levels. In a DOE analysis, the factor level columns are used to esti-
mate the corresponding factor main effects.
Interaction columns in a design are formed as the dot product of
two other columns. In a DOE analysis, the interaction columns are
used to estimate the corresponding interaction effects.
When two design columns are identical, the corresponding factors
or interactions are aliased and their corresponding effects cannot be
distinguished.
The order of running and testing experimental trials is often ran-
domized to protect against the presence of unknown lurking
variables.
Blocking variables (e.g., day or run or session) may be included in a
design to raise the design efficiency.
Factor effects may be missed because they were not included in the
original screening experiment, because they were not given suf-
ficiently wide factor ranges, because the design was underpowered
for those factors, because trial order was not properly randomized
or blocked, or because of an inadequate model.
Unusual interactions and higher-order effects occassionaly may
be needed to account for curvature and work around regions of
singularity.
Where there are inequality constraints (e.g., areas where standard
settings will not work), special designs are needed.

21 Special edition: Statistics in Validation


Peer Reviewed: Statistical Viewpoint

The designs may become rather challenging and


a statistician becomes an invaluable part of the
team when considering problems of non-normal
responses, unbalanced data, specialized covari-
ance structures, and unusual or unexpected
physical or chemical effects.

INTRODUCTION
Response surface methodology (RSM) is the develop-
ment of the specific types of special designs to most
efficiently accept a small number of inputs (relative
to screening designs) that are known to be relevant
to the final product and optimize a process result to
a desired target (1). Once the results are confirmed,
the analysts load becomes lighter (excepting in the
Figure 1: Surface point.
case of non-reproducibility or results drifting out of
specification). In effect, the response surface maps
the fine detail in the area of optimal response (i.e., and, therefore, must find a workable approximation.
determines the most desirable values of the inputs to The first guesstimate is a low order polynomial (e.g.,
get the optimal output of whatever is being manufac- first order model), as follows:
tured, controlled, or studied). The three most impor-
tant targets usually sought are optimal concentrations, y = 0 + 1x1 + 2x2 + + n xn +
variance reduction, and robustness (2). The adequacy
of the model is most often checked by residual analy- Obviously, this is a linear model that will not ac-
sis, influence diagnostics, and lack-of-fit testing (3). commodate curvature. If curvature is suspected, a
JMP 9 is utilized herein for the design and analysis of higher order polynomial may be tried. The following
an industrial example (4). is a second order model:

THEORY y = 0 + ixi + iixi2 + ijx ixj +


Many response surface designs are collections of
specialized statistical and mathematical techniques where we sum over i, and i<j. The above two
that have been well implemented in software using models are found to work well in a variety of situa-
efficient algorithms (4, 5). In many real world cases tions. However, although these may be reasonable
the output includes more than one response and these approximations over the entire design space, they will
need not be continuous functions. Lets examine the not be an exact fit in all regions. The goal is to find a
case of a chemical engineer who wishes to maximize smaller region of interest where they fit well. A proper
an important property (y) based on given levels of two experimental design will result in the best estimate of
chemical inputs, (x1) and (x2). The desired property the model parameters. The response surface is usually
is now a function of the two chemical entities plus fitted by a least squares procedure, and optimal values
error (), as follows: are pursued through sequential algorithms such as the
method of steepest ascent. The details of these algo-
y = f(x1, x2) + rithms are beyond the scope of this series.
Another caveat is the importance of using the
The surface is represented by the following: software tools with an incomplete understanding of
the problem. It is sometimes tempting to overlook
y = f(x1, x2) expert advice from the chemists or engineers and try
to force-fit a simple solution. As in the real world, we
The response surface is usually displayed graphi- often find extremely messy data. It might be neces-
cally as a smoothly curving surface, a practice that sary to use a cubic factor, a three-way interaction, or
may obscure the magnitude of local extremes (see a highly-customized design. At this point the reader
Figure 1). may wish to review the previous article in the series
In many problems using RSM, the experimenter (Journal of Validation Technology, Winter 2011) to
does not know the exact mathematical form of the re-examine the use of data transformations, blocking,
relationship between the input and output variables and factor levels.

Special edition: Statistics in Validation 22


Peer Reviewed: Statistical Viewpoint

Figure 3: Central composite design.


Figure 2: Box-Behnken design. It is noted that the inverse of the moment matrix,
M-1 = N(XX)-1 contains variances and covariances
TECHNIQUESTHE DESIGNS of the regression coefficients scaled by N divided by
There are a variety of specialized designs and analytic the variance. Therefore, control of the moment matrix
techniques available for RSM (6). In this review, we implies control of the variances and covariances and
will concentrate on four of the more popular (i.e., here is the value of these D-optimal designs (5).
commonly used). One will be chosen to analyze stan-
dard but messy industrial data. Box-Behnken Designs
These are three-level designs for fitting response
Factorial Designs surfaces. They are made by combining 2k factorials
These are one of the most common designs used designs with incomplete block designs. They pos-
in situations where there are several factors and sess the advantages of run efficiency (i.e., number of
the experimenter wishes to examine the interac- required runs and rotateability). This last property
tion effects of these on the response variable(s). ensures equivalence of variance for all points in the
The class of 2k factorial designs is often used in design space equidistant from the center. The design
RSM and finds wide application in the following is constructed by placing each factor, or independent
three areas (6): variable, at one of three equally spaced values from
As a screening design prior to the actual response the center. By design then, the estimated variances
surface experiments will depend upon the distance of the points from the
To fit a first-order response surface model center (Figure 2).
As a building block to create other response sur- It is noticed that the Box-Behnken design has
face designs. no points at the vertices or corners of the design
For more details on these designs, the reader is space. This is a plus when it is desired to avoid
again referred to the article on screening designs in those areas because of engineering constraints but
this series. unfortunately places a higher prediction uncer-
tainty in those areas.
D-Optimal Designs
This is a class of designs based on the determinant of Central Composite Designs
a matrix, which has an important relationship to the Central composite design (CCD) is a popular and ef-
moment matrix (M): ficient design for fitting a second-order model (Figure
3). The CCD most often consists of a 2k factorial n
M = XX/N, runs, 2k axial runs, and nc center runs. The experi-
menter must specify the distance from the axial runs
where X is the inverse on X, the design matrix of to the design center (this is done by many software
inputs and N is the number of rows in X. platforms) and the number of center points. As it is

23 Special edition: Statistics in Validation


Peer Reviewed: Statistical Viewpoint

Figure 4: CCD data.

desirable to have a reasonably stable variance at the


points of interest, the design should have the property
of rotateability.

CCDAn Example of the Response


Surface Technique Figure 5: Actual by predicted plot.
For an example, the use of the central composite
design is employed in yet another industrial chemi-
cal process. The chemists are now concerned with a
formulation that is only slightly different than that
used in the screening design, and concentrations have
changed due to knowledge gathered from the first
series of experiments. We now wish to use this new
knowledge to optimize the process. The experiment
is designed in JMP 9 software, then run, and the data
appear as seen in Figure 4.
The design includes 28 runs with a center point as
well as face centers on the design hypercube (there
are more than three dimensions). When these data
are analyzed, it is apparent that, while the model is
significant (analysis of variance [ANOVA] p = 0.0275)
and there is no significant lack of fit (p= 0.9803), the Figure 6: Residual by predicted plot.
adjusted R2 is low (~0.41) and more terms might
need to be added to the model (Figure 5). as the plot of errors (by increasing Y value) shows ran-
There is some confidence in the integrity of the data domness rather than any defining patterns that would

Special edition: Statistics in Validation 24


Peer Reviewed: Statistical Viewpoint

Figure 7: Sorted parameter estimates.

Figure 8: Normal plot. Figure 10: Surface plot.

Figure 9: Prediction profiler.

indicate a problem (Figure 6). what the chemists had long suspected was true (i.e.,
It appears that from the sorted parameter estimates, the importance of Reagent 1, the enzyme, and the re-

25 Special edition: Statistics in Validation


Peer Reviewed: Statistical Viewpoint

Figure 14: Effects test.

Figure 11: Surface curvature.

Figure 13: Three reagents and a


Reagent1*Enzyme interaction.

Figure 12: Data table output.

action temperature to the output). See Figure 7. This


is also seen on the normal plot as these three param-
eters are flagged (Figure 8).
The interaction profiles evidence of a (expected)
putative interaction between the enzyme and reaction Figure 15: Residual by predicted plot.

Special edition: Statistics in Validation 26


Peer Reviewed: Statistical Viewpoint

Figure 16: Prediction profiler.

the inclusion of interaction and quadratic terms in the


model were justified (Figure 11).
Although there is good evidence that the model is
adequate, the low R2 suggests that input values may
need to change, or more likely, factors may need to be
added. There are many ways to model this system. A
custom design may simplify the system.

Custom Designing
The chemists are comfortable with setting the tem-
perature at a level that was determined as optimal
in both theoretic calculations and previous batch
experience. They now wish to zero in on concentra-
tions for the two reagents and the enzyme. In most
modern software with experimental design modules,
it is possible to roll your own custom design to
non-standard specifications. Suppose the new data
set consists of the two reagents and the enzyme, with
Figure 17: Maximal output.
the reaction temperature and mix speed held at the
pre-determined optima. The software designs a 15-
temperature, but this did not appear significant from run experiment, and the analyst fills in the data table
the parameter estimates. output (Figure 12).
Use of such tools as prediction profilers will allow The model is then analyzed with the three reagents
the optimization of the process, which in this case, and a Reagent1*Enzyme interaction as seen in Figure 13.
seen by the position of the dotted red lines, is ac- The adjusted R2 is now ~ 0.89 evidencing a good
complished by maximizing the values of Reagent 1, fit, and the ANOVA indicates a significant model
enzyme, and temperature. Reagent 2 does not seem (p=0.014). In linear regression with standard least
important in this respect (flat slope) (Figure 9). squares fitting model, significance is a test of zero
The design space is easily visualized on the surface slope (i.e., a steep slope indicating good predictive
plot, which corroborates the results of the profiler ability). The effects test confirms that Reagent1 and
(Figure 10). the enzyme are most important to the reaction (Figure
When the figure is rotated, the surface curvature is 14). Chemistry may dictate that Reagent2 is also
seen. It is apparent by the absence of lack of fit that important, but any of the concentrations used in the

27 Special edition: Statistics in Validation


Peer Reviewed: Statistical Viewpoint

experiment will work. The exact figure will be based Unscrambler (www.camo.no).
on convenience and cost. Although there is no overall
significance to the Reagent1*enzyme interaction, the CONCLUSIONS
enzyme is necessary. Opening up the range of concen- Modern experimental design is sometimes art as well
trations used would demonstrate this. as science. It is the objective of this column to ac-
Again, a quick check of data integrity is evidenced quaint the reader with the rudiments of the screening
by the random pattern of the residual by predicted and response surface designs, introduce them to the
plot (Figure 15). nomenclature, and supplement the learning experi-
With this model, it is possible to produce an output ence with real-world examples.
that overlaps that of the larger, CCD model. And this
was accomplished in just 15 runs (Figure 16). REFERENCES
The response surface here is a stair-step because of 1. S.R. Schmidt and R.G. Launsby, Understanding Industrial
the linear nature of this model with non-continuous Designed Experiments (4th ed.), Air Academy Press, 1997.
inputs. It again illustrates that maximal output is 2. G.E.P. Box, J.S Hunter and, W.G. Hunter, Statistics for Experi-
achieved through maximal concentrations of Reagent1 menters (2nd Ed.), Wiley Interscience, 2005.
and the enzyme for the ranges used in this experiment 3. D. C. Montgomery, Design and Analysis of Experiments (5th
(Figure 17). ed.), John Wiley,2001.
4. JMP 9 Design of Experiments Guide, SAS Institute Inc., 2010.
SOFTWARE 5. ECHIP Reference Manual, Version6, ECHIP Inc., 1983-1993.
There are numerous software products available to 6. R.H. Myers and D.C. Montgomery, Response Surface Method-
assist the practitioner in design and analysis of their ology, John Wiley, 1995. JVT
experiments. The author has had experience with the
following commercial packages: ABOUT THE AUTHOR
Design Expert (www.statease.com) John A. Wass is a consulting statistician with Quantum
GenStat (www.vsni.co.uk) Cats Consulting in the Chicago area, as well as a con-
JMP (www.jmp.com) tributing editor at Scientific Computing, and adminis-
Minitab (www.minitab.com) trator of the Great Lakes JMP Users Group. He served
MODDE (www.umetrics.com) for a number of years as a senior statistician at Abbott
STATISTICA (www.statsoft.com) Laboratories (in both pharmaceuticals and diagnostics).
SYSTAT (www.systat.com) John may be contacted by e-mail at john.wass@tds.net.

Originally published in the Autumn 2011 issue of Journal of Validation Technology

Special edition: Statistics in Validation 28


Peer Reviewed: Statistical Viewpoint

Linear Regression 101


Yanhui Hu

Statistical Viewpoint addresses principles of statistics useful to


practitioners in compliance and validation. We intend to present these
concepts in a meaningful way so as to enable their application in daily
work situations.
Reader comments, questions, and suggestions are needed to help us
fulfill our objective for this column. Please send any comments to man-
aging editor Susan Haigney at shaigney@advanstar.com.

KEY POINTS
The following key points are discussed:
Linear regression is a widely used statistical tool with numerous
applications in the pharmaceutical and medical device industries.
Fundamental concepts associated with linear regression are
discussed.
An example statistical analysis using simple linear regression of an
active pharmaceutical ingredient chemical reaction is described.
Data and analysis using Minitab software are discussed. Applica-
tions, predictions, other relevant calculations, and model diagnos-
tics are also discussed.

INTRODUCTION
Linear regression is one of the widely used statistical tools that have
applications in all areas of daily life. Does the size of the mutual fund
impact the annual return of the fund? Is there a relationship between
a persons income and his or her years of education? Does the median
particle size of the active pharmaceutical ingredient (API) correlate
to the drug release rate of the final dosage form? These questions and
many others aim to find a relationship (or correlation) between vari-
ables and can be addressed in part through linear regression analysis.

DEPENDENT VERSUS INDEPENDENT VARIABLES


Linear regression models the relationship between a dependent
variable, y, and a set of independent variables, x, through a linear
function. Simple linear regression consists of only one independent
variable, denoted as x, while multiple regression consists of more than
one independent variable. This article focuses on simple linear regres-
sion with a brief discussion of multiple regression. Strictly speaking,
independent variables can be controlled or selected; but more broadly,
they refer to variables that are used to predict or explain other vari-
ables. Therefore, oftentimes independent variables are also known as
input, explanatory, or predictor variables. Dependent variables cannot
be controlled or selected directly, and their values are dependent on
the values of the independent variables. Dependent variables are also
known as output, response, or explained variables. Throughout this
article, the independent variables are assumed to be fixed; that is,
the values of an independent variable are deterministic without any
random errors.

29 Special edition: Statistics in Validation


Peer Reviewed: Statistical Viewpoint

Figure 1: Scatter plot of API concentration vs. Figure 2: Least squares estimate.
reaction time.

Table: API concentration vs. reaction time.

Following the definition of the dependent and ing is provided in the Table.
independent variables, it is natural to assume there The linear relationship between the reaction time
exists such a relationship: y = f(x). A special case of the (x) and API concentration (y) can be expressed statis-
relationship is called a linear relationship, expressed tically in the following equation:
in the following mathematical equation:
yij=a+b.xi+eij [Equation 1]
y=a+bx
Here, a and b stand for the unknown intercept and
where a is the intercept and b is the slope. Given a slope, respectively. The reaction time is denoted as
set of independent and dependent variables (x, y), the xi, with i = 1, 2, ,8 for the 8 hours that data were
intercept and slope (a, b) of the relationship can be es- collected. The API concentration is denoted as yij, for
timated, so the impact of x on y can be quantified, and jth observation taken at ith hour. For example y12
the value of y can be predicted given a future value of equals 4.90 mg/mL, the second observation taken
x. This process is called simple linear regression. at first hour; while y51 equals 29.27 mg/mL, the
first and only observation taken at the 5th hour. It is
SIMPLE LINEAR REGRESSION MODEL acknowledged that the API concentration observations
The following is a hypothetical problem. An API are not likely to be expressed by the linear function
synthesis route includes a coupling reaction step that exactly, and there will be some discrepancy between
combines two intermediates to produce the free base the observation and the calculation from the linear
API. The scientists carefully monitored the formation function. This discrepancy is denoted as eij, known as
of the free base API in the reactor to characterize the the residual error.
reaction kinetics. During the eight hours of reaction
time, multiple samples (n = 14) were taken and ana- SIMPLE LINEAR REGRESSION MODEL FITTING
lyzed using high-performance liquid chromatography The goal of linear regression is to find the optimal
(HPLC) to determine the free base concentration, as linear function (i.e., intercept and slope) that can best
shown in the scatter plot (Figure 1). Scatter plot is a represent the given data set. First we need to define
useful visualization tool and should always be used what is best? An objective measure is that the best
as the first step of the linear regression analysis. The linear function is the closest to all data; therefore, the
independent variable (i.e., reaction time) is on the x total residual error from all data points is the small-
axis, and the dependent variable (i.e., API concentra- est. Because the residual error can be positive or
tion) is on the y axis. Visually there exists a linear negative for each data point, the sum of the squared
relationship between the two variables. The data list- residual error (SSE) becomes a good measure of the

Special edition: Statistics in Validation 30


Peer Reviewed: Statistical Viewpoint

total distance between the linear function and the regression is a good fit, while a low R 2 does not
data set. Therefore, the objective of the regression is to indicate there is no association between x and y.
find the intercept and slope for which SSE is minimal. 2. R 2 is affected by the spread of x. A high R 2 does
In statistics, this objective is called the least squares not indicate the prediction has good precision.
criterion. From Equation 1, the SSE can be defined as: 3. R 2 is a measure of association only. A high R 2 does
not imply a strong cause-and-effect relationship.
[Equation 2]
REGRESSION MODEL INFERENCE
The optimal estimates for intercept (a) and slope Now that the best linear function that satisfies the
(b), to reach SSE minimum, can be calculated analyti- least squares criterion has been determined, more
cally: questions can be asked. What is the rate of API
formation? Is the rate significantly different from a
constant (say 0)? Is the intercept (i.e., the initial API
[Equation 3] concentration at time 0) significantly different from
zero? What will the API concentration be after seven
hours of reaction? These questions are answered with
_ _
where x and y are the means of xi and yij observa- statistical inferences.
tions, respectively. The least squares estimates of the intercept (2.27)
The computation of Equation 3 is built into all and slope (6.37) are point estimates. They are the best
statistical software. Back to our example, the least estimates based on the least squares criterion no mat-
squares estimates for the intercept and slope are 2.27 ter what type of distribution the error term (eij) has.
and 6.37, respectively (Figure 2). However, to make inferences, the form of the error
As a part of regression line fitting, oftentimes peo- distribution has to be assumed first. Typically, a nor-
ple report R 2, coefficient of determination, as a way mal error distribution is assumed, and it is quite often
of describing the degree of association between x and appropriate for most pharmaceutical applications.
y. It can be shown mathematically that R 2 represents Statistically, the error term in Equation 1 becomes:
the portion of the variation in y that is associated with
the variation of the predictor, x. When R 2 is 1, all data [Equation 4]
points lie perfectly on a straight line. When R 2 is 0,
there is no linear association between x and y, and that is, each error value is an independent draw
the regression line is f lat with slope = 0. R 2 is always from a normal distribution, with mean 0 and variance
between 0 and 1; and the larger the R 2, the higher 2, which can be estimated as SSE/(n 2) = e2ij /(n2),
the linear association. In this example, R 2 is 99.0%, where n is the total number of data points. The vari-
indicating there is a very strong linear relationship ance estimate is also called MSE (=SSE/(n2)), or mean
between reaction time and API concentration. squared error.
Figure 2 includes a term called adjusted R 2, which With the assumption of independent and identical
is slightly smaller than the R 2. For a simple regression normal error distribution, the dependent variable (y)
with one predictor variable, the adjusted R 2 is deter- is also a normal random variable, with mean of a+bx
mined by the following equation: and variance of 2. Because the slope and intercept
estimates (Equation 3) are linear functions of the de-
pendent variable, they are normal random variables as
well. It can be demonstrated that the means of the in-
Where n is the number of data points. It is clear tercept and slope variables are the least squares point
that adjusted R 2 is always smaller than R 2. For a sim- estimates defined by Equation 3. The variances of the
ple linear regression, the use of adjusted R 2 is limited. intercept and slope variables are functions of the MSE
However, for a multiple regression with many predic- and the independent variable. The linear regression
tor variables, the adjusted R 2 can help to determine example herein was analyzed using Minitab version
the best fit model. 16. The model fitting output is provided in Figure 3.
The coefficient of determination, R 2, is so widely The point estimates (or mean) of slope and intercept
used (or misused sometimes) that it warrants addi- are 6.37 and 2.27, respectively. The standard errors
tional discussion, as follows: of the slope and intercept estimates are 0.19 and 0.95,
1. R 2 is only a measure of linear association be- respectively.
tween x and y. When there exists a nonlinear re- With these estimates, we can address the question
lationship, a high R 2 does not indicate the linear is the reaction rate (i.e., slope) significantly differ-

31 Special edition: Statistics in Validation


Peer Reviewed: Statistical Viewpoint

ent from zero? The null hypothesis for this question uncertainty in both the slope and intercept estimates,
is Ho: b=0, while the alternative hypothesis is Ha: b the API concentration is unlikely to be exactly 32.79
0. Because the slope is a normal random variable, mg/mL. Oftentimes, it makes sense to report a range
a t-test is appropriate to answer this question. The of the API concentration that would likely contain the
t-statistic, calculated as the ratio of the point estimate true API concentration with high confidence. This is
over the standard error of the estimate, is 34.3, as the concept of the confidence interval, as seen in the
reported in Figure 3, and it is significant with p-value following equation:
< 0.001. Therefore, we can conclude that the reaction
rate is significantly different from zero. Alternatively, y t . sy
this question can be answered with the confidence
intervals provided in the output. The 95% confidence The point estimate, y , is 32.79 mg/mL. The t-value
interval for slope is (5.97, 6.78). Because this interval is determined from the t-distribution, a function of
excludes 0, the slope is significantly different from the confidence level and the degrees of freedom.
zero with 95% confidence. In this example, the degree of freedom of the error
Besides the coefficients table, the output (Figure 3) is 12, provided by the analysis of variance table in
also includes an analysis of variance (ANOVA) table. Figure 3. The confidence level is set based on the
This table lists the MSE estimate to be 2.88; and the analysts objective. Frequently, a 95% confidence level
linear model is significant with p-value < 0.001. In is chosen ( = 0.05) to establish a 95% confidence in-
fact, for simple linear regression, the model p-value terval. The t-value is 2.179 in this case. The standard
in the ANOVA table should match the p-value of the deviation of the point estimate, s y, is a function of the
slope estimate, listed in the coefficient table. Further- MSE, sample size, and the values of the independent
more, the MSE is divided into two terms: lack-of-fit variable, and is calculated by most major statistical
and pure error. The pure error term is only available software. The standard deviation of the API concen-
for situations where there are multiple measurements tration estimate at 5.5 hr is 0.49 mg/mL. Therefore,
for a given value of the independent variable. In our the 95% confidence interval for API concentration at
example, six out of the eight reaction time points have 5.5 hr is (31.717, 33.854) mg/mL. Figure 4 shows the
two measurements collected. The difference between 95% confidence intervals for the regression line. It
the paired measurements for a given reaction time can be shown that the standard deviation of the esti-
point cannot be accounted for by the linear function, mate is the smallest at the mean of all reaction times.
because they have the same x-value; and this differ- Therefore, if there is a specific reaction time at which
ence is summarized as pure error, which reflects the we would like to predict the API concentration, it is
sample-to-sample error and the measurement system best to design the experiment so this reaction time
error. The other term is lack-of-fit error, which sum- is the mean of all data points, to achieve the highest
marizes the difference between the measurement (or precision.
the average of the two measurements, if applicable) Figure 4 also shows the 95% prediction intervals,
and the predicted value for a given reaction time. which are wider than the 95% confidence intervals.
This lack-of-fit error reflects the deviation of the data The difference between confidence interval and
points from the perfect linear relationship. If all data prediction interval is that the confidence interval is
points were perfectly aligned on the line, the lack-offit on the mean of the distribution of y at a given x, while
error would be zero. The higher the deviation of the the prediction interval is about an individual outcome
data points from the line, the larger the lack-of-fit er- from the distribution of y. Because the prediction
ror. The mean squared error, including contributions interval is related to the result of a future new trial, it
from both the lack-of-fit and pure error, provides the requires a wider range to accommodate the additional
estimate of the model variance, 2. variability. The prediction interval is established in a
similar way to that of the confidence interval, except
MODEL PREDICTION that the variance for the prediction error is adjusted
One of the main purposes of performing linear with one mean squared error to represent a new trial.
regression analysis is to predict the response variable, The 95% prediction interval of API concentration at
given a level of the predictor variable. For example, 5.5 hours is (28.93, 36.64) mg/mL.
What is the API concentration after 5.5 hours of
reaction? This question can be answered easily with MODEL DIAGNOSTICS
the regression equation. The API concentration is A natural question to ask after model fitting is is
estimated to be 32.79 mg/mL at 5.5 hours based on the fitted model appropriate for the data? Can the
the regression equation. However, because there is relationship between x and y be best described by

Special edition: Statistics in Validation 32


Peer Reviewed: Statistical Viewpoint

Figure 3: Minitab regression output.


The normality of the error term can be demonstrated
a linear function, or does the relationship show any with a normal probability plot of the residual errors
curvature? If y is a linear function of x, does the fitted (not shown in figures). Similarly, the histogram of the
linear model comply with the assumptions about residual errors provides a visual assessment of the
the errors? As discussed earlier, the linear regres- distribution. Non-normality is suspected if the data
sion equation is the best estimate that meets the points significantly deviate from the linear trend.
minimal residual squares criterion, independent of Oftentimes small departure from normality should
any assumptions of the error distribution. However, not be a major concern. On the other hand, formal
the inferences on the slope and intercept estimates, goodness-of-fit tests also exist to evaluate the nor-
as well as the prediction and confidence intervals, mality of the residual errors, which is outside of the
assume the residual errors are normally and inde- scope of this paper.
pendently distributed with mean of 0 and variance
of 2. Therefore, an important step of the regression Error Variance Is Homogeneous
analysis is to evaluate whether the assumptions of the The error term is assumed to be normally distrib-
residual errors are violated. uted with a mean of 0 and variance of 2, which
Many model diagnostics tools are available, both is estimated by MSE. Therefore, when the residual
informal visual assessment and formal statistical tests. errors are plotted against the fitted values (or predic-
Here we start with the residual plots provided by tor variable), it is expected that no special pattern
Minitab output (Figure 5), which are a set of useful should be observed. The spread of the residual errors
informal visual assessment tools. The major aspects of are expected to be consistent throughout the range of
linear model diagnostics are as follows. fitted values. In some cases, the spread may be larger
for a larger fitted value, which may indicate a case of
Normality of the Error constant relative standard deviation (RSD) instead.

33 Special edition: Statistics in Validation


Peer Reviewed: Statistical Viewpoint

Figure 4: Confidence interval and prediction Figure 5: Minitab model diagnostics.


interval.
Outliers
Another way to assess the homogeneity of the error Outliers are observations that do not conform to the
variance is the sequence plot (e.g., versus observation linear regression line well. These observations have
order). Special patterns on the residual error spread unusually high residual errors. Many diagnostic tests
may indicate the variance is not a constant. Again, are available to detect outliers, and often they can be
a formal statistical test such as the modified Levene easily identified visually using residual plots. When
test can be used to test whether the error variance is the number of samples is limited, an outlier can
homogeneous. significantly influence the regression function, thus
leading the regression line away from the rest of the
Linear Regression Model Is Appropriate data. Even if a potential outlier is statistically con-
The residual plot against fitted value not only detects firmed, its treatment should follow practical conside
the error variance homogeneity, but also reviews de- A general rule is not to exclude any outliers based on
viation from the linearity between x and y. A curved a statistical test alone.
pattern of the residual data (Figure 6) indicates that
the linear regression model may not be appropri- MODEL APPLICATIONS
ate, and the true function between x and y includes The Linear regression model is widely used in all
polynomial terms. types of applications. The following are several gen-
eral considerations where a linear model is applied:
Errors Are Independent A linear regression model is useful in making
Although the unknown true errors defined by Equa- inferences about the relationship between x and
tion 4 are independent, the residual errors are not in- y. However, this relationship does not imply
dependent of each other technically, because the sum cause and effect. The high degree of association
of all residual errors is 0. However, for practical pur- between x and y may be due to some underlying
poses, when the sample size is large, it is not very im- phenomenon that is not included in the model.
portant. When the residual errors are plotted against This is especially true for observational data.
observation orders (i.e., time), it is expected that That is the reason why a controlled experiment
the data points will be randomly scattered around 0 is valuable in establishing a cause-and-effect
with no recognizable pattern. Any special trend or relationship.
pattern requires special attention. For example, the Technically, a linear model is only valid within
trend in Figure 7 indicates that the observation order the range of the independent variable x. In fact,
(i.e., time) should be included as a predictor variable. as discussed before, the precision in predicting
Other possible patterns of the residual error include y is best at the center of x, and worst at the two
the auto-correlation pattern, where errors of adjacent ends of x. Because we dont have any information
observations tend to cluster together. In summary, beyond the range of x, any prediction outside of
independent errors oftentimes indicate that the linear that range (i.e., extrapolation) assumes that the
model is not adequate to describe the relationship same linear relationship and error variance still
between x and y. will be valid. This assumption is likely appropri-
ate if the extrapolation is not too far; however,
caution still needs to be taken when extrapolat-

Special edition: Statistics in Validation 34


Peer Reviewed: Statistical Viewpoint

Figure 6: Residual data plot. Figure 7: Special patterns in residual errors.

ing. Once the prediction through extrapolation to each other. The statistical model development, as
is confirmed, the linear model range can be well as model inferences and prediction of multiple
extended. linear regression, will be the focus of the next
The model validity should also be checked before
use. The linear function of our example is for CONCLUSIONS
a given reaction scale (e.g., 100 mL). Does the This paper focuses on the basic concepts related to
same relationship hold for a larger scale system? a simple linear regression. Detailed discussions on
Even if the same linear function remains the linear regression can be found in the literature (see
same across different scales, the error variance Reference). The linear regression function is the best
may change due to different heat and mass trans- estimate that satisfies the least squares criteria. With
fer capabilities at different scales. The change in the assumption of normal independent errors, the
the error variance leads to a change in prediction significance of the model parameters can be tested,
precision and model inferences. and confidence intervals can be constructed. Further-
more, confidence interval and prediction intervals
WHAT NEXT? are discussed for a given independent variable. This
This paper focuses on the cases where there is only paper also discusses the common model diagnostic
one continuous predictor variable. What if a depen- tools, with the emphasis on the informal visual plots
dent variable is influenced by multiple parameters? and some of the key considerations when applying
That is the case of multiple regression. The types the regression model.
of predictor variables can be nominal, ordinal, or
continuous. For example, if the same reaction ex- REFERENCE
periment is carried out at three different scales, we Neter J, Kuntner MH, Nachtsheim CJ, and Wasserman W,
can model the API concentration as a function of a Applied Linear Statistical Models, Chapters 1-3. 4th edition,
continuous variable (i.e., time) and a nominal variable Irwin Chicago, 1996. JVT
(i.e., scale) at the same time with a multiple regres-
sion model. In this case, the impact of time, scale, as ARTICLE ACRONYM LISTING
well as any possible interaction between reaction time ANOVA Analysis of Variance
and scale, can be evaluated. If the interaction between API Active Pharmaceutical Ingredient
reaction time and scale is not significant, the regres- MSE Mean Squared Error
sion lines from the three scales would have similar RSD Relative Standard Deviation
slopes. Furthermore, if the impact of scale were not SSE Squared Residual Error
significant, all three regression lines would be close

Originally published in the Autumn 2011 issue of Journal of Validation Technology

35 Special edition: Statistics in Validation


Peer Reviewed: Statistical Viewpoint

Linear Regression 102: Stability


Shelf Life Estimation Using
Analysis of Covariance
David LeBlond, Daniel Griffith, and Kelly Aubuchon

Statistical Viewpoint addresses principles of statistics useful to


practitioners in compliance and validation. We intend to present these
concepts in a meaningful way so as to enable their application in daily
work situations.
Reader comments, questions, and suggestions are needed to help us
fulfill our objective for this column. Please contact managing editor Su-
san Haigney at shaigney@advanstar.com with comments, suggestions,
or manuscripts for publication.

KEY POINTS
The following key points are discussed:
Analysis of covariance (ANCOVA) is an important kind of multiple
regression that involves two predictor variables: one continuous
(e.g., time) and one categorical (e.g., batch of material).
Like simple linear regression, simple ANCOVA fits straight lines to
response measurements (e.g., potency, related substance, or mois-
ture content) over time: one line for each level (i.e., batch) of the
categorical variable.
A key objective of ANCOVA is to determine whether the straight
lines for all batches are best described as having a common-inter-
cept-common-slope (CICS) model, a separate-intercepts-common-
slope (SICS) model, or a separate-intercepts-separate-slopes (SISS)
model.
In ANCOVA, model choice is based on two statistical F-tests: one
comparing slopes and one comparing intercepts among batches. In
the case of pharmaceutical shelf life estimation, the US Food and
Drug Administration recommends a p-value < 0.25 for significance
in these tests.
ANCOVA model adequacy can be assessed by examining measures
such as a root mean square error (RMSE), lack of fit, PRESS, and
predicted R-square.
Once the appropriate model (i.e., CICS, SICS, or SISS) has been
identified for a given data set, it can be used to obtain expected
values, confidence intervals, and prediction intervals of potency of
a given lot at a given time.
When a lower or an upper specification limit can be identified for
the response, the ANCOVA model can be used to estimate the shelf
life for the batches tested.
The shelf life for a pharmaceutical batch is defined as the maxi-
mum storage period within which the 95% confidence interval

Special edition: Statistics in Validation 36


Peer Reviewed: Statistical Viewpoint

for the batch mean response level remains uct), and moisture level are measured over time
within the specification range. Depending on the in multiple batches of product stored in a tem-
response, the confidence interval may be one or perature- and humidity-controlled chamber.
two sided. The objective is to estimate the shelf life of the
The shelf life for a pharmaceutical product is product. Here, the potency, related substance,
taken to be the minimum shelf life for batches on and moisture levels are the dependent variables,
stability. batch identity is the categorical variable, and
ANCOVA analysis and shelf life estimation using storage time is the covariate.
the Minitab Stability Studies Macro is illustrated
in the cases of pharmaceutical potency, related Notice the following distinctions in these examples.
substance, and moisture content responses. The relative importance of the covariate and
categorical variable differs. Sometimes, as in the
INTRODUCTION pharmaceutical stability example, the primary inter-
A previous installment of Statistical Viewpoint est may be on the effects of the covariate (i.e., stability
described simple linear regression in which there over time) where the categorical variable, batch, is
is a single continuous independent variable such as merely an unavoidable nuisance variable. In other
time, temperature, concentration, or weight (1). Many cases, as in the analytical methods example, differ-
important relationships involve multiple independent ences among levels of the categorical variable, sample,
variables, some of which may be categorical in nature are of primary interest, while the effects of the covari-
(e.g., batch of material, supplier, manufacturing site, ate, blank, is an unavoidable nuisance variable. In
laboratory, preservative type, clinical subject). Under- other cases, as with the pre-clinical studies or process
standing such relationships requires the use of mul- scale-up examples, both the differences between the
tiple linear regression. In this installment, we deal categorical variable (rodent or scale) and the effects of
with the simplest kind of multiple linear regression the covariate (dose or reaction time) may be of equal
in which there are two independent variables: one interest.
continuous (called the covariate) and one categori- The covariate may or may not be truly inde-
cal. The following are some examples in which this pendent. Sometimes the covariate may be a truly
kind of relationship is important: independent variable whose value is well controlled
Pre-clinical studies. Ten xenograft rodents are and known with certainty, such as dose level or time.
treated with a range of doses of an anti-tumor In other cases, the covariate is actually a measured
agent and the tumor weight for each animal value, such as an analytical blank. This violates one
decreases as dose increases. The objective is to of the assumptions of regression, that the predictor
quantify the animal to animal differences in variables are known without error (1). We still often
dose response profile. Here tumor weight is the use regression in these cases as long as the covariate
dependent variable, rodent identity is the cat- is measured relatively accurately.
egorical variable, and dose is the covariate. The experiment may include all or only some of
Process scale-up. Active pharmaceutical ingre- the categorical variable levels of interest. Sometimes
dient (API) concentration is measured over time we include all levels of a categorical variable that are
in three chemical reactors. The reactors differ of interest, such as with the analytical methods ex-
in size (scale). The objective is to estimate scale ample where we are concerned only with the samples
effects on the rate of API synthesis. Here, API being tested. In other cases, the categorical variable
concentration is the dependent variable, scale levels in our experiment are merely a sampling of all
is the categorical variable, and dose level is the possible levels drawn from a larger population, such
covariate. as all possible rodents or all possible manufactured
Analytical methods. An assay measures the con- batches. In these later cases, we must remember that
centration of an analyte in plasma samples based the methods we discuss here do not allow us to make
on a florescence response. Samples are tested in strong inferences about that larger population; our
duplicate. Each test provides a blank response conclusions will be limited primarily to the categori-
and a test response. The objective is to compare cal levels (e.g., rodents, batches) we have tested. To
analyte concentrations among samples, while make stronger inferences about the larger population,
correcting each for the effect of the blank. In this more advanced statistical methods are required.
case, the test response is the dependent variable, This article focuses on the important example of
sample identity is the categorical response, and pharmaceutical product stability. Thus our categorical
blank is the covariate. variable will be batch and our covariate will be stor-
Pharmaceutical product stability. The drug age time. Design and analysis of stability studies is a
potency, related substance (a degradation prod- mature discipline and such studies may include

37 Special edition: Statistics in Validation


Peer Reviewed: Statistical Viewpoint

Figure 1: Multiple-batch models of instability: CICS (common intercept and common slope),
SICS (separate intercept and common slope), SISS (separate intercept and separate slope).

additional continuous covariates such as dosage time (e.g., moisture). In Figure 1, the mean response
strength, storage temperature, or excipient levels level for each batch is indicated by a different colored
as well as additional categorical variables such as line. Each line can be defined by its intercept (i.e.,
excipient lot or packaging type. These more complex response level at time zero) and slope (i.e., rate of
studies are referred to as multi-factor stability studies change in response over time).
(2). The analysis of such studies is beyond the scope The common intercept and common slope (CICS)
of this article. model represents a scenario were the stability profiles
Because batches often differ in stability, stabil- of all batches have a common intercept and common
ity studies on a single batch of product are not of slope. This might be the result of a well controlled
interest. The number of batches in such studies is manufacturing process where the initial levels of all
often small, yet the objective is inevitably a shelf-life components, as well as their stability over time, are
estimate to be applied to the population of all future uniform across batches. The CICS model generally
manufactured batches. This is somewhat troubling. will result in a longer estimated product shelf life
In light of the last distinction mentioned above, we because it allows tighter estimates of the mean slope
advise caution and encourage the reader to discuss and intercept that are common to all batches.
with a statistician the possibility of using mixed The separate intercept and common slope (SICS)
model or Bayesian approaches (2) where appropri- model represents a scenario where batches have sepa-
ate. We will proceed with our description of the rate intercepts but a common slope. This could result
traditional approach without apologies because it is from a manufacturing process in which the initial
common industry practice. level of the component of interest is not well con-
trolled batch to batch. However, other aspects of the
MODELS OF INSTABILITY process that govern batch stability are uniform such
We will assume here that, for a given batch, the that the rate of change in the level of the component
change in level over time can be approximated by a of interest is the same for all batches.
straight line. Chemists refer to this as pseudo zero- The separate intercept and separate slope (SISS)
order kinetic mechanism. The real kinetic mecha- model represents a scenario where batches have
nism is almost certainly more complex, but this separate intercepts and separate slopes. This could
linear assumption is often found to be adequate. In result from an uncontrolled manufacturing process
any real application this linear assumption should be in which neither the initial level nor the stability of
justified. In some cases, the response measurements the component of interest is well controlled batch to
or the time scale can be altered using appropriate batch.
transformation(s) to obtain a linear stability profile. Clearly the CICS model is most desirable. The
Consider the case where stability data are avail- SICS model may be acceptable as long as the initial
able for three batches of product. Figure 1 illustrates level non-uniformity is controlled within acceptance
possible models, or scenarios, of product instability limits. However, the SISS model is the least desirable
where the response is, for instance, the level of some scenario because batches may become increasingly
related substance or degradation product of the ac- less uniform over time. The presence of large batch-
tive drug. However, the models described in Figure to-batch variability makes it difficult to accurately
1 apply equally well for decreasing responses (e.g., estimate a shelf life for the process from only a few
potency) or for responses that may rise or fall over batches.

Special edition: Statistics in Validation 38


Peer Reviewed: Statistical Viewpoint

Table I: Model comparisons made in the ANCOVA F-tests.

Some readers may notice that a CISS (common intercept and


separate slopes) model is missing from Figure 1. Certainly there alone. If so, we reject the simpler model in favor of the more
is no scientific reason to exclude a manufacturing process in complicated one. Table I shows the models being compared in
which initial levels of batches are very well controlled but that the ANCOVA F-tests.
other components (such as stabilizers) or process settings that The p-value obtained from either ANCOVA test in Table I
affect batch stability might not be well controlled. However, is the probability of obtaining an F statistic that is as or more
while the initial levels may be relatively well controlled they are extreme than the one we observed, given that the null hypoth-
unlikely to be identical, at least for batches derived from blend- esis (i.e., the simpler model) is true. If the p-value is below some
ed powders or unit-dose filling processes. So, unless there are fixed value, we should select the more complicated model; oth-
compelling scientific reasons to consider the CISS model, we erwise we choose the simpler model. This fixed value is referred
must use the stability data to choose either the SISS, SICS, or to as the alpha or type I error level. In many applications, we
CICS models. choose a limit value of 0.05 for our hypothesis tests. However,
in the case of pharmaceutical product stability, it is traditional
A model that is important in building the analysis of covari- to use the more conservative limit of 0.25 for the p-value (4).
ance (ANCOVA) table but is not considered in the evaluation of The 0.25 limit is controversial because it implies that 25% of the
stability data is what we might call the common intercept, no time we will incorrectly choose the more complicated (and less
slope (CINS) model. The CINS model assumes that the com- desirable) model.
mon slope of all batches is zero. This implies a perfectly stable Figure 2: The ANCOVA model selection process
product. While very stable pharmaceutical products do exist,
we never make an assumption of perfect stability in evaluating
stability data.

ANCOVA MODEL SELECTION


Well controlled processes that follow a CICS model will more
likely result in a longer shelf-life estimate than those that fol-
low the SICS or SISS models. Because the estimate of shelf life
depends on the model choice, the first task is to choose the
model. While there may be development experience or theoreti-
cal reasons to expect one model over another, the traditional ap-
proach is to let the stability data themselves guide us to the most
appropriate model. The ANCOVA is the statistical procedure for
selecting the most appropriate of the three models. ANCOVA is a
close cousin of the analysis of variance (ANOVA) associated with
simple linear regression (1). Like ANOVA, ANCOVA partitions
the variance in the observed measurement in a specific way. This
partitioning allows us to make two statistical F-tests for batch The rational for choosing this more conservative limit has to
differences among slopes and intercepts. do with the safety and efficacy. If we incorrectly choose the more
The algebra behind the ANCOVA F-tests is complicated. But it complicated model, the estimated shelf-life estimate will likely
is not necessary to understand the algebra because the calcula- be too short. The consumers of this drug product will likely not
tions are easily handled by statistical software packages such suffer side effects if a manufacturer establishes a shelf life that
as Minitab Statistical Software (3). However, it is necessary to is shorter than necessary. On the other hand, if we incorrectly
understand what these F-tests are comparing, what the criteria choose the simpler model, the estimated shelf life estimate will
for test acceptance or rejection are, and to be familiar with the likely be too long. In that case, consumers that use product
ANCOVA table that statistical software produces. near the end of its shelf-life may be under medicated (if potency
The ANCOVA F-tests make a comparison between two mod- declines with time) or be exposed to higher levels of harmful
els: a simple (null or reduced) model and a more complicated degradation products. Consequently, regulatory agencies have
(alternative or full) model. The p-value associated with the test F established the more conservative p-value limit of 0.25 to reduce
statistic is used to decide whether the portion of response vari- the likelihood of establishing a shelf life that is too long. This
ance attributable to the extra features of the more complicated practice seems undesirable from a manufacturers point of view,
model is larger than can be explained by measurement variation but remember that the shelf life is meant to apply to the popula

39 Special edition: Statistics in Validation


Peer Reviewed: Statistical Viewpoint

Table II: ANCOVA table output from the Minitab stability macro.

tion of all future batches. Establishing a shelf life that a process error [MSE]), assuming that the SISS model is appropriate. The
cannot support adds to the cost of operations due to out-of- degrees of freedom (DF) and the Seq SS in the Total source row
specification investigations of batches on stability and potential are merely the sum of those quantities in the rows above.
product recalls. The Batch and Batch*Time sources provide the ANCOVA
The ANCOVA decision process is diagrammed in Figure 2. It F-tests for intercept and slope, respectively, that are of interest
starts with the stability data at the top. There are of course many to us here. The p-values from these F-tests are used to make the
ways to organize stability data. The format shown in Figure model choice as described in Table I and Figure 2.
2 is what is required for input into most statistical packages, DF. This gives the degrees of freedom associated with each
such as Minitab, for ANCOVA analysis. In this format, there are source. This is a measure of the amount of information available
three columns: the response level column, the time (covariate) in the data to estimate the statistics associated with this source.
column, and the batch (categorical variable) column. For brevity, B is the number of batches in the data set, and N is the total
only the first four and last observations are shown. number of independent measurements in the data set. Notice
We start with the worst-case presumption that slopes and how the DF for the Total source equals the sum of the values
intercepts vary among batches. The F-test for separate slopes is above it.
examined first. If this test is statistically significant (i.e., p-value Seq SS. This is the sum of squares associated with this source.
< 0.25) then the ANCOVA process concludes with the selection Larger Seq SS values represent sources that contribute more to
of SISS as the final stability model. As discussed previously, un- variation in the data. This quantity is obtained from the ANOVA
less there is a compelling scientific argument, an F-test compar- error sum of squares (see Reference 1) from the multiple regres-
ing the SISS and CISS models is not made at this point. sion fit to models CINS, CICS, SICS, and SISS. The error sum of
If the F-test for separate slopes is not statistically significant squares is indicated as SSEmodel where the subscript gives the
(i.e., p-value 0.25), there is no evidence in the data for a dif- fitted model. Notice how the Seq SS for the Total source equals
ference in slopes among the batches, and we can presume a the sum of the values above it.
common slope model, SICS. Next, we perform the second F-test Seq MS. Seq MS gives the mean square (or variance) associat-
in Table I that tests for separate batch intercepts, assuming that ed with the source. This is simply the respective Seq SS divided
batch slopes are common. This test is a comparison of model by the DF.
SICS and CICS. If the test is statistically significant, then the F. This gives the F-value for this source that is simply a ratio
ANCOVA concludes with the selection of SICS model. If the test of the respective SS MS to some measure of error variance. In
is not statistically significant, then the remaining model, CICS, the case of pharmaceutical stability ANCOVA, it is common
is selected. to use MSE as the error variance for all F-tests, but in a tradi-
An ANCOVA table that is produced by the Minitab stabil- tional ANCOVA table, the quantity SSESICS/(N-B-1) is used as
ity macro is shown in Table II. It consists of five rows and five the measure of error variance for the test for common intercept
columns of statistical quantities. The quantities in each column, (the Batch source). MSE is used because it is smaller than the
how they are obtained, and what they represent are described as traditional quantity. This leads to a larger F-value, which is more
follows. likely to lead to statistical significance and a more conservative
Source. A label indicates the variable or interaction that con- final model choice.
tributes variation to the measurement. This label also indicates P. This gives the p-value for the F-test associated with this
the particular F-test that this row represents. The Time source Source. This p-value is the complement of the cumulative F-dis-
provides an F test that tests the hypothesis that the common tribution with quantile = FSource, numerator degrees of freedom
slope is zero in the CICS model. A low p-value suggests that = DFSource, and denominator degrees of freedom = DFE.
some instability is present, but is of no interest to us in model To summarize, p-valueB and p-valueBT in Table II are calcu-
selection here because we never entertain a model with zero lated from the stability data and are used to test for common in-
slope. The Error source does not include an F-test but provides tercept and slope, respectively, as described in Table I and Figure
an estimate of total analytical variance (the quantity mean squre 2. The outcome of the ANCOVA process is a final stability model

Special edition: Statistics in Validation 40


Peer Reviewed: Statistical Viewpoint

that is used to estimate the product shelf life. It is important to procedures (1) based on the selected ANCOVA model are used
remember that the model selected through the ANCOVA process to obtain 95% confidence bounds for each batch. Assignment
may change if data are re-analyzed after additional stability time of shelf life for a product is based on worst-case: the response,
points are acquired. batch, and side (upper or lower) giving the shortest shelf life is
used to set the shelf life for the product. Shelf life estimates are
DETERMINATION OF SHELF LIFE often based on extrapolation beyond storage periods of available
Shelf life for a pharmaceutical product is based on measurements stability batches. International Conference on Harmonisation
of one or more stability indicating responses for which upper or (ICH) Q1E guidelines state that the maximum allowable shelf
lower acceptance limits have been established. The responses are life is two times the maximum storage period of available stabil-
measured on a few (typically three) batches of product that are ity data (4).
stored under carefully controlled temperature and humidity in Because our focus here is on the ANCOVA decision process,
the intended packaging. Traditionally, a pharmaceutical product we will emphasize this aspect of the computer output in the
shelf life for a batch is based on the 95% confidence limit for examples below. The shelf life estimation process involves simple
the mean response level over time, as estimated from the avail- or multiple regression and results in additional tables of comput-
able stability batch data. The 95% confidence limit for a mean er output. This consists of prediction equations and stability pro-
regression line is described briefly in Reference 1. The shelf life file graphs for all batches (CICS model) or for each batch (SICS
(S) is based on the shortest Time at which the estimated 95% or SISS models), model summary statistics, and an ANOVA that
confidence bound crosses an acceptance limit. Shelf-life estima- may include a lack-of-fit (LOF) test of nonlinearity in the stabil-
tion for a single batch in three common situations is illustrated ity profile. This output will be illustrated in the examples that
in Figure 3. follow. Interested readers can learn more details about multiple
The left panel of Figure 3 illustrates an increasing response regression from standard statistical textbooks (5).
level over time (such as a degradation product) for which only an
upper acceptance limit is set. In this case, it is common to use a ANCOVA DATA ANALYSIS USING THE MINITAB STABILITY
one-sided upper confidence bound. The middle panel illustrates MACRO
a decreasing response level (such as tablet potency) with only ANCOVA and regression analysis for shelf-life estimation can be
a lower acceptance limit set. In this case, it is common to use a obtained using many commercially available statistical packages.
one-sided lower confidence bound. The right panel illustrates We illustrate this process here using a convenient Minitab macro
the situation for a response level (such as moisture) that may that may be downloaded and saved (3). Once saved, an analysis
either increase or decrease on storage and for which both upper is made as follows:
and lower limits have been set. Cases do exist where lower (or 1.Start Minitab.
upper) limits are in place for responses expected to increase (or 2.Enter stability data into a worksheet using the format given
decrease) over time. In such cases, it may be desirable to employ in Figure 2.
two-sided limits. One-sided confidence limits will lead to longer 3.Select Edit, then Command Line Editor.
shelf-life estimates so their use must be risk justified. 4.Type a short script into the Command Line Editor (syntax
described below) that describes the type of analysis desired.
Figure 3: Illustration of shelf-life determination for a 5.Choose Submit Commands to execute the script.
single batch. Red horizontal lines indicate upper (U) or
lower (L) acceptance limits. The solid straight line is the The script will invoke the macro, and the ANOVA and regres-
mean regresion line, and the dashed line is the upper or sion results, including stability profile graphs and shelf-life
lower confidence interval. The maximum batch shelf life estimates, will be produced. A typical stability macro script is
is indicated by S given as follows:
%stability ycol tcol bcol;
store out.1-out.n;
itype it;
confidence cl;
life c.1 c.z;
xvalues xpredt xpredb;
nograph;
criteria alpha.
The script syntax consists of a main command (%stability),
given in the first line, and a set of optional subcommands, each
. given on subsequent lines. The order of appearance of the sub-
commands is not important. All commands and subcommands
Usually, multiple response data from multiple batches are must end in a semicolon except the last subcommand, which
used to set shelf life. ANCOVA is employed to identify the ap- must end in a period. Each command and subcommand consists
propriate stability model for the batches at hand. Regression of a key word followed by user-specified input parameters whose

41 Special edition: Statistics in Validation


Peer Reviewed: Statistical Viewpoint

values tell the macro what worksheet columns to use for data We can use the following script to analyze these data and obtain
and calculated predictions, and the kind of confidence interval an estimate for the product shelf life:
to employ. In the %stability command, ycol indicates the column %stability c1 c2 c3;
in your worksheet containing your response (e.g., potency, store c4 c5 c6;
related substance, or moisture level,), tcol indicates column for itype -1;
storage time, and bcol indicates the particular batch. The bcol confidence 0.95;
worksheet column can be formatted as either numeric (i.e., 1, 2, life 95;
3,) or text (i.e., A, B, C, ). The macro has a limit of up to 100 criteria 0.25.
batches in the worksheet. The other subcommands are explained
in Table III. Table V provides the ANCOVA and other computer output.
Compare the ANCOVA output in Table V to that shown in Table
STABILITY ANALYSIS II and to the ANCOVA decision process shown in Figure 2. The
The following illustrates five stability analyses using this macro. p-value associated with the test for separate slopes (Source =
The potency data used was obtained from an actual literature ex- Batch*Time) is 0.797 which is > 0.25, so the data provide no
ample (6). The related substance and moisture data are realistic, evidence for separate slopes among the batches. The p-value
but artificially constructed. associated with the test for separate intercepts (Source = Batch)
is 0.651, which is > 0.25, so the data provide no evidence for
Example One: Potency Stability (CICS Model, One- or Two- separate intercepts among the batches. Consequently, we take
Sided Limit) CICS as an appropriate stability model for estimating shelf life.
Table IV provides potency stability data (%LC) obtained over a As seen in Table V, the Minitab macro output refers to the CICS
24-month period from B=3 batches (batches numbered 2, 5, and model as Model 1.
7) of a drug product. A total of N=31 independent measurements The output in Table V provides the regression equation with
are available. The first three columns of this table are in the common intercept (100.567 %LC) and slope (-0.192994 %LC/
format required by the Minitab macro. Notice that independent month). The negative slope indicates that potency is decreasing
replicate measurements on each batch are available for months with time. The output includes the following summary statistics.
3-24. Such independent replicates provide a test of the linearity S. Root mean square estimate of the final model 1 fit. This esti-
assumption as described below. Note also that we are assum- mates total analytical standard deviation.
ing independence of each measurement here (as discussed in PRESS. Prediction sum-of-squares (PRESS). This gives a robust
Reference 1), but independence is a key assumption that must be estimate of your models predictive error. In general, the smaller
justified. The lower acceptance limit for potency for this product
is 95% LC.

Table III: Stability macro subcommands.

Special edition: Statistics in Validation 42


Peer Reviewed: Statistical Viewpoint

Table IV: Example one potency stability data and esti- standard statistical text books for more information on complex
mated fits and limits. ANOVA (5). One useful feature of the ANOVA in Table V is the
LOF test. Simply put, this LOF test compares a models residual
variance to that available from pure replication to form an F
ratio. If this ratio is large and the p-value is significant (i.e., <
0.05), either there is evidence for non-linearity, or the replicates
are not truly independent. Such is the case in this example (p-
value = 0.0000037). If it is determined that this nonlinearity is
impacting the shelf-life estimation, it may be advisable to alter
the model, transform the response, or analyze replicate aver-
ages rather than individual replicates. We will assume in this
example that the LOF has no impact and, for illustration, will
use this model to estimate shelf life.
The shelf-life estimate for this example is given at the bottom
of Table V as 26 months. This estimate is illustrated in Figure
4. This plot shows the individual measurements for each batch
as separate colors. The solid black line is the best-fit regression
line for the mean potency of all three batches. The red dashed
line gives the one-sided lower 95% confidence bound of the
mean potency. It can be seen that this line intersects the lower
acceptance limit for the product (95% LC) at about 26 months.
It is common practice to round a shelf-life estimate down to the
nearest whole month.

Figure 4: Example one potency stabil-


ity profile for all batches based on a CICS
model and a one-sided lower acceptance limit.

Notice the additional numbers in columns c4-c6 of Table IV.


the PRESS value, the better the models predictive ability. The stability macro will place these numbers in the worksheet
R-Sq(pred). A robust version of Adjusted R-Sq useful for com- as a result of the store subcommand (see the script above used
paring models because it is calculated using observations not for this analysis). The Fit and Lower CL (columns c4 and c5)
included in model estimation. Predicted R-Sq ranges between correspond to the black and red dashed lines, respectively, in
0 and 100%. Larger values of predicted R-Sq suggest models of Figure IV. The Lower PL in Table IV is the Lower 95% prediction
greater predictive ability. limit for individual observations. This limit is more conserva-
R-Sq(adj). A robust version of R-Sq, the percentage of response tive (lower) than the 95% confidence for the mean (red line) and
variation that is explained by the model, adjusted for the com- reflects the scatter of individual values about the fitted line (see
plexity of the model. Reference 1 for more description). Notice in Table IV that this
The output in Table V also includes a ANOVA table. This prediction limit is below the acceptance limit at 24 months. Thus
ANOVA table is similar to that described previously (1), but has in this case, while a 26-month shelf life for the product may be
a few additional statistical tests. Interested readers are referred to acceptable from a regulatory point of view, a sponsor

43 Special edition: Statistics in Validation


Peer Reviewed: Statistical Viewpoint

Table V: Example one ANCOVA, regression, ANOVA, and estimated shelf-life outputfrom the Minitab stability
macro.

may want to consider the risk of out-of-specification results for We will use the following script to estimate the product shelf
this product near the end of shelf life. life based on these data:
So far we have assumed a one-sided lower limit of 95%LC. If %stability c1 c2 c3;
the product had an upper limit of 105%LC as well and there is store c4 c5 c6;
risk of batches exceeding the upper limit, then we might want a itype -1;
shelf life based on a two-sided 95% confidence interval. In that confidence 0.95;
case we could use the following analysis script: life 95;
t%stability c1 c2 c3; criteria 0.25.
life 95 105. The ANCOVA and other statistical output from this analysis are
The resulting stability profile is shown in Figure 5. Notice in this given in Table VII.
case that the shelf-life estimate is slightly lower (25.5 months There is no evidence for separate slopes (p-value = 0.834).
which we would likely round down to 25 months). This is be- However, there is evidence for separate intercepts (p-value <
cause two-sided limits will be wider than a one-sided bound and 0.001). A comparison with the ANCOVA decision process of
will thus intersect the limit sooner. Figure 2 shows that the SICS model is appropriate in this case.
The regression equations in Table VII show that the estimated
Example Two: Potency Stability (SICS Model Two, One-Sided slope (-0.213121 %LC/month) is common to each batch, but the
Lower Limit) intercepts differ. As in example one, the LOF test is significant
Another set of potency stability data is given in columns C1-C3 (p-value = 0.0258), but we will assume that the straight-line as-
of Table VI. As before, we will assume a one-sided lower accep- sumption is adequate for illustration purposes here.
tance limit of 95%LC. Figure 6 provides the separate stability profiles for each batch.

Special edition: Statistics in Validation 44


Peer Reviewed: Statistical Viewpoint

Figure 5: Example one potency stabil- Figure 6: Example two potency stabil-
ity profile for all batches based on a CICS ity profiles for each batch on a SICS model
model and a two-sided acceptance limit. and a onesided lower acceptance limit.

Because the intercepts differ, the macro produces a separate


plot for each batch. The shelf life estimated for each batch, based
on when its 95% confidence lower bound crosses the accep-
tance limit of 95%LC, is given on the upper right corner of each
plot. Batch 5 has the lowest estimated shelf life (23.4 months).
Therefore, by the worst-case logic of pharmaceutical shelf-life
estimation, limits the shelf life for the product to 23.4 months as
is also indicated in Table VII. In practice, we would likely round
this down to 23 months. As described in Example one, columns
C4-C6 of Table VI provide the numeric Fit and interval estimates
based on the store subcommand request.

Example Three: Potency Stability (SISS Model, One-Sided


Lower Limit With Predictions)
Yet another set of potency stability data is provided in columns
C1-C3 of Table VIII.
These data are analyzed using the following script:
%stability c1 c2 c3;
store c4 c5 c6;
itype -1;
confidence 0.95;
life 95;
criteria 0.25.
Table IX shows the ANCOVA and other statistical output from
this analysis.
There is evidence for both separate slopes (p-value = 0.17) and
intercepts (p-value < 0.01). Both p-values are below the regula-
tory limit of 0.25. A comparison with the ANCOVA decision
process of Figure 2, shows that the SISS model is appropriate in
this case. The regression equations for each batch are given in
Table IX, and the slopes and intercepts differ for each batch as ex-
pected. We note that in this case, the LOF test is not statistically real stability testing was done at 15 months of storage, but we
significant (p-value = 0.100568). For this test we use the tradi- can use the stability model to obtain estimates by including the
tional Type I error rate of 0.05 to judge statistical significance. desired times and batch numbers in columns c4 and c5, respec-
Stability profiles for each batch are given in Figure 7. tively, prior to the analysis and employing the following script:
As seen in Figure 7 and Table IX, the product shelf life esti- %stability c1 c2 c3;
mated by these data is limited by Batch 8 to 15.6 months. We itype 0;
would likely round this down to 15 months in practice. Howev- confidence 0.95;
er, it would be interesting in this case to see what potencies the life 95 105;
model would predict for these batches at 15 months. No xvalues c4 c5;

45 Special edition: Statistics in Validation


Peer Reviewed: Statistical Viewpoint

Table VI: Example two potency stability data and estimated fits and limits.

Special edition: Statistics in Validation 46


Peer Reviewed: Statistical Viewpoint

Table VII: Example two ANCOVA, regression, ANOVA, and estimated shelf-life output from the Minitab stability
macro.

store C6 c7 c8 c9 c10. Example Four: Related Substance Stability (SISS Model


For illustration, we are requesting two-sided 95% confidence Three, One-Sided Upper Limit)
limits (it=0). This amounts to requesting a 97.5% confidence To illustrate estimation of shelf life for a response whose level
lower bound, which is more conservative than a 95% confidence increases on storage, we will use the data for a related substance
lower bound. The same result could be obtained using it= -1 (degradation product of the active ingredient) given in columns
and cl = 97.5. Columns C4 and C5 contain the time points and C1-C3 of Table XI. The levels in column C1 are expressed as
batches for which we want predictions. The above macro per- a percent of label claim for the active ingredient and the up-
forms the fit as given previously in Table IX and the xvalues sub- per limit for this particular related substance is assumed to be
command produces the predictions in columns C6-C10 of Table 0.3%LC.
X. Note that the lower confidence bound is still within the limit We can obtain the shelf life based on this response by using
of 95%LC, although the lower prediction bound, which reflects the following script:
individual result variation, is below the acceptance limit. %stability c1 c2 c3;

47 Special edition: Statistics in Validation


Peer Reviewed: Statistical Viewpoint

Table VIII: Example three potency data and estimated As a final example of a response that may either increase or
fits and limits. decrease on storage, we examine the moisture data given in
columns C1-C3 of Table XIII. The moisture measurements in
column C1 have units of %(w/w). We will take the acceptance
limits for this product to be 1.5 to 3.5 %(w/w).
We can analyze these data using the following script. Notice
that we have specified both the lower and upper acceptance
limits using the life subcommand and requested two-sided con-
fidence limits using the itype subcommand.
%stability c1 c2 c3;
itype 0;
confidence 0.95;
life 1.5 3.5;
criteria 0.25.
The results of this analysis are provided in Table XIV.
Notice in this case, the ANCOVA analysis leads to the CICS
model because neither the test for separate slopes nor the test
for separate intercepts is statistically significant (i.e., p-values of
0.483 and 0.705, respectively). The stability profile given in Fig-
ure 9 indicates a shelf life for all batches of 45.35 months, which
agrees with the estimate at the bottom of Table XIV. In this case,
it is the 95% confidence upper bound that crosses the upper
limit earliest and that, therefore, governs the product shelf life.

CONCLUSION
We have illustrated here the ANCOVA process that is used to
set product shelf life for pharmaceutical products. We have also
illustrated the use of a convenient Minitab macro that can be
used to perform the ANCOVA analysis, choose the appropriate
stability model, and execute the multiple regressions to estimate
shelf life and produce other useful statistical tests and statistics.
The macro is flexible enough to handle a variety of common
situations and produces graphics that serve as useful regression
diagnostics.
It is essential to stress here the critical aspect of software
validation. Validation is a regulatory requirement for any
software used to estimate pharmaceutical product shelf life.
store c4 c5 c6; Reliance on any statistical software, whether validated or not,
itype 1; carries with it the risk of producing misleading results. It is
confidence 0.95; incumbent on the users of statistical software to determine, not
life 0.3; only that the statistical packages they use can produce accurate
criteria 0.25. results, given a battery of standard data sets, but also that the
Notice in this case that we are requesting a one-sided upper statistical model and other assumptions being made apply to
confidence limit (it=1) of 95% (cl=0.95). The output from this the particular data set being analyzed, and that data and com-
analysis is shown in Table XII. mand language integrity are maintained. It is not uncommon
As in example three, the ANCOVA output in Table XII in- for a computer package to perform differently when installed
dicates an SISS model. The separate slopes and intercepts are on different computing equipment, in different environments,
given in Table XII along with an LOF test that is not statistically or when used under different operating systems. In our hands,
significant, and an estimated shelf life of 15.61 months (which using a number of representative data sets, the Minitab Stabil-
we would usually round down to 15 months). ity macro performs admirably compared to other statistical
Stability profiles for these batches are given in Figure 8, which packages such as JMP, SAS, and R. However, we can make no
confirms that batch 8 is the stability limiting batch for the prod- general claim that it will not be found lacking in other environ-
uct shelf life. Numeric predictions, requested using the STORE ments. Readers are advised to enlist the aid of local statisticians
subcommand are given in columns C4-C6 of Table IX. to assure that the statistical packages they use are properly
Example Five: Moisture Stability (CICS Model, Two-Sided validated. JVT
Limits)

Special edition: Statistics in Validation 48


Peer Reviewed: Statistical Viewpoint

Table IX: Example three ANCOVA, regression, ANOVA, and esti-


mated shelf-life output from the Minitab stability macro.

Table X: Example three fit, confidence limit, and prediction limit estimates for
time and batch combinations not present in the stability data.

49 Special edition: Statistics in Validation


Peer Reviewed: Statistical Viewpoint

Figure 7: Example three potency stability profiles each batch on a SISS model and a one-sided
lower acceptance limit

Special edition: Statistics in Validation 50


Peer Reviewed: Statistical Viewpoint

Table XI: Example four related substance stability data and estimated fits and limits.

51 Special edition: Statistics in Validation


Peer Reviewed: Statistical Viewpoint

Table XII: Example four ANCOVA, regression, ANOVA, and esti-


mated shelf-life output from the Minitab stability macro.

Special edition: Statistics in Validation 52


Peer Reviewed: Statistical Viewpoint

Figure 8: Example four related substance sta- Table XIII: Example two moisture stability data.
bility profiles for each batch on a SISS model
and a one-sided upper acceptance limit.

53 Special edition: Statistics in Validation


Peer Reviewed: Statistical Viewpoint

Table XIV: Example five ANCOVA, regression, ANOVA, and estimated shelf-life output from the Minitab stability
macro.

Figure 9: Example five moisture stabil- 4. International Conference on Harmonization. ICH Q1E, Step 4: Evaluation
ity profile for all batches based on a CICS for Stability Data, 2003. http://www.ich.org/products/guidelines/quality/
model and a two-sided acceptance limit. article/quality-guidelines.html
5. Neter J, Kuntner MH, Nachtsheim CJ, and Wasserman W, Applied Linear
Statistical Models, Chapter 23. 3rd edition, Irwin Chicago, 1996.
6. Schuirmann, DJ, Current Statistical Approaches in the Center for Drug
Evaluation and Research, FDA, Proceedings of Stability Guidelines, AAPS
and FDA Joint Conference, Arlington, VA, Dec 11-12, 1989. JVT

ARTICLE ACRONYM LISTING


ANCOVA Analysis of Covariance
ANOVA Analysis of Variance
API Active Pharmaceutical Ingredient
CICS Common Intercept and Common Slope
CL Confidence Limit
DF Degrees of Freedom
LOF Lack of fit
%LC Percent of Label Claim
REFERENCES MSE Mean Square Error
1. Hu Yanhui, Linear Regression 101, Journal of Validation Technology 17(2), PL Prediction Limit
15-22, 2011. PRESS Predicted Residual Sum of Squares
2. LeBlond D., Chapter 23, Statistical Design and Analysis of Long-Term RMSE Root Mean Squared Error
Stability Studies for Drug Products, In Qui Y, Chen Y, Zhang G, Liu L, Porter R-Sq R-square
W (Eds.), 539-561, 2009. R-Sq(adj) Adjusted R-square
3. Minitab Stability Studies Macro (2011), A technical support document R-Sq(pred) Prediction R-square
describing the use of the Macro in Minitab version 16 is available from the SICS Separate Intercept and Common Slope
Minitab Knowledgebase at http://www.minitab.com/support/answers/an- SISS Separate Intercept and Separate Slope
swer.aspx?id=2686.

Originally published in the Summer 2011 issue of Journal of Validation Technology

Special edition: Statistics in Validation 54


Peer Reviewed: Analysis and Control of Variation

Understanding and Reducing


Analytical ErrorWhy Good
Science Requires Operational
Excellence
John McConnell, Brian K. Nunnally, and Bernard McGarvey

Analysis and Control of Variation is dedicated to revealing weak-


nesses in existing approaches to understanding, reducing, and control-
ling variation and to recommend alternatives that are not only based
on sound science but also that demonstrably work. Case studies will
be used to illustrate both prob-lems and successful methodologies.The
objective of the column is to combine sound science with proven practi-
cal advice.
Reader comments, questions, and suggestions will help us fulfil our
objective for this column. Case studies illustrating the successful reduc-
tion or control of varia-tion submitted by readers are most welcome.
We need your help to make Analysis and Control of Variation a useful
resource.Please send your comments and suggestions to column coordi-
nator John McConnell at john@wysowl.com.au or journal coordinating
editor Susan Haigney at shaigney@advanstar.com.

KEY POINTS DISCUSSED


The following key points are discussed:
Good science in discovery, development, produc-tion, and in labo-
ratories requires stable operations with low variation.
Unstable analytical systems signals from the ana-lytical process
add variation to production data.
Actual examples of variable processes are presented.
Stabilize first is the first principle. Stable pro-cesses are
predictable.
Variation in laboratory operations may mask causal relationships
in other areas.
Compliance to procedures is not acceptable ratio-nale for a vari-
able process.
Senior management should remove obstacles to conquering varia-
tion by making it a strategic imperative.
In environments where high degrees of variation are possible (e.g.,
in biologics), the need for very low levels of variation in operations
is greatest.
Reduced variation means fewer deviations, fewer resources tied up
conducting investigations and reports, more resources dedicated to
doing the core work, and increased security from robust processes
with known capabilities.
Operating in a low-variation environment results in easier detec-

Special edition: Statistics in Validation 55


Peer Reviewed: Analysis and Control of Variation

tion of causal relationships and fewer errors in especially if the process under examination is not
interpreting data. statistically stable.
The US Food and Drug Administrations process Some Actual Examples
validation guidance recommends statistical pro- Before we ask the scientists in the discovery, devel-
cess control techniques to measure and evaluate opment, production, or analytical areas to do good
process stability and process capability. science, we ought to create stable operations. Unfor-
tunately, much of the industry has yet to discover
INTRODUCTION this truth, let alone use it to their advantage. To
This article continues discussion initiated in Blame illustrate the situation, two control charts are shown
the LaboratoryUnderstanding Analytical Error (1). in Figure 1 (2). They show the results of a plant trial
That article generated more comment and discussion whose objective was to drive variation to minimum
than any other article published in this column, and levels in everything. The same people using the same
it soon became clear that readers required more detail technology made the same product for the period of
and guidance. As this article was being written, one the chart. There is no change in the science involved.
of the authors visited a large pharmaceuticals site What changed was operational rather than chemi-
producing biological products. Earlier in the year, cal or biological. What changed was that everyone
a slow and long-term upward drift in the level of involved, from the plant manager down, became
analytical error had been demonstrated. In addition, intolerant of variation in any form. Training was
it was noted that a significant drop in the average of conducted, operational definitions were created, and
the production data was matched with a similar drop method masters were appointed to ensure almost
in the average for laboratory reference standards. exact performance between shift and between opera-
Further studies revealed that analytical error was tor and analyst repeatability. Instruments were tested
likely increasing variation in the formulation of the and calibrated to ensure excellent replication across
final product. instruments. Bacteria from only one working cell
It was clear that analytical error was excessive and bank were used in fermentation. The aim was never
that it needed to be reduced. The cell count for refer- concerned with accuracy for any characteristic. The
ence standards met the desired minimum level only aim was always to create maximum precision, to con-
about 40% of the time. A project to reduce analytical quer variation, and to create repeatability. Nowhere
error was initiated. Six weeks after this project was was this done better than in the laboratory.
introduced, remarkable results had been achieved. There are two elements that ought to be kept in
Cell count met the standard 90% of the time, and the mind when examining Figure 1. First, it should be
standard deviation for this cell count was less than clear that not only was the factory (in this case the
half of that which existed before the project com- fermentation step for a biologic) successful in con-
menced. A quiet revolution is taking place in this quering variation, but also so too was the analytical
analytical system. Analytical error has been slashed, laboratory involved. The laboratory manager and
and the project is far from over; in truth, it has barely the technicians involved reduced assay variation by
begun. Interestingly, nearly all the improvement work just as significant a proportion as did the production
has been done by the technicians. In this example, the people. This must be true; otherwise, the change in
science remains unchanged. It is the conduct of opera- factory performance would not have been so obvi-
tions that has improved. Central to this article is the ous. Secondly, if the instrument failure noted in
notion that if we are to do good science, we are well the pH chart had occurred before the trial, there is
served to start by conquering variation in operations. every chance that it would have gone unnoticed. It is
axiomatic that as we reduce variation in any process,
GOOD SCIENCE REQUIRES GOOD OPERATIONS ever smaller signals can be detected through the
Pharmaceuticals companies are designed, built, and reduced background random noise.
managed by scientists. This is only as one might This is a critical understanding if we are to do good
expect. Nearly always, one of the most important cri- science. Nowhere is this truer than in the analytical
teria for promotion will be technical skills and ability. world. The lower the variation in assays, the easier it
This results in pharmaceuticals businesses having a is to detect disturbances in the analytical process and
culture strongly biased towards technical excellence to correct them before they cause deviations or other
both at a business unit and at an individual level. trouble. The customer of the laboratory also benefits.
Technical excellence is a very good objective. How- The lower the variation is in the assay, the easier it is
ever, when such companies encounter a problem, for production people to detect signals in the pro-
the nature of the business and the people who staff duction data. Figure 2 shows a chart of laboratory
them is to address the problem from a scientific or controls (reference material) in another company.
technical perspective. This can be a terrible mistake,

56 Special edition: Statistics in Validation


Peer Reviewed: Analysis and Control of Variation

Figure 1: Results of a plant trial to reduce variation.

The production people believed the assay to be existed was clearly demonstrable. This was not pos-
inaccurate and were demanding more replicates in sible with variation at the level prior to the steady-
an attempt to improve assay accuracy. The analyti- state trial.
cal laboratory manager disagreed, suggesting the
problem was in assay variation rather than in ac- WHAT SHOULD BE THE INITIAL AIM?
curacy. He assigned a statistician to drive variation Stabilizing the process and reducing variation should
in the conduct of operations to a minimum. Again, be the initial aim for every analytical process. In
a dramatic decrease in analytical error is observed. particular, variation in laboratory operations, which
As before, the improvement is entirely operational, masks the causal relationships from the scientists,
and nearly all the improvement work was done by ought to be an early target.
the analysts. The entire project lasted for a week. No Have any of us ever met anybody working in
change was made to the science. pharmaceuticals or biologics who is not interested in
Good science requires good knowledge and a good variation, and if possible, reducing it? Every chemist,
understanding of that which is being investigated. biologist, virologist, analyst, manager, or operator
This requires understanding of causal relationships. with whom we have discussed this subject has been
The charts in Figure 3 come from the same trial as in agreement that reducing variation is a good thing
those results shown in Figure 1. The two variables to do. Some might claim that it is sometimes not pos-
should have shown a strong correlation based on the sible in certain circumstancesthat we have hit the
science, but until the operations were stabilized with limits of our technology. Others might be adamant
minimum variation, the scientists could not under- that whilst it is necessary to reduce variation, their
stand the process well enough to do good science. In hands are tied because the real causes of variation lie
Figure 3, not only do we note a much reduced scatter in a different department, and so on. Nevertheless, it
and a vastly increased R2 factor, but also that the seems we are all in agreement that reducing variation
angle of slope of the regression line is fundamentally is a good thing to do. The reasons that understanding
altered (the shallow slope in the left chart of Figure 3 and reducing variation is always a good thing to do
is caused by instability). From a scientists perspec- are many.
tive, both are critical understandings. After the trial,
the data made sense and the correlation that always

Special edition: Statistics in Validation 57


Peer Reviewed: Analysis and Control of Variation

Figure 2: Reduced variation in analytical error.

POLIO VACCINE CLINICAL SUPPLIES managers in the pharmaceuticals industry are made aware that
From a quality perspective, lower variation means more pre- their processes contain unnecessary variation many respond
dictable and better quality product. Jonas Salk understood well with, but I am compliant ...what is the problem?
the need for repeatability and predictable outcomes as a key
quality indicator. In 1954, the first batches of polio vaccines WHY NO PROGRESS?
were manufactured for the massive clinical trial. Over 400,000 Some are trying to convince the industry that the approaches
doses were administered without any serious incidents or nega- developed by Shewhart, Deming, Smith, Juran, Harry, and oth-
tive effects. The National Foundation for Infantile Paralysis ers holds the promise of improved quality and productivity as
had demanded that to have their vaccine accepted for the trial, well as fewer deviations and regulatory issues (2). Unfortunately,
manufacturers were required to make 11 successive batches, all change is occurring slowly. In the case study depicted in Figure
of which demonstrated that the live virus was completely inacti- 2, the laboratory manager and statistician who led this analytical
vated. Only two manufacturers met this criterion and only these revolution presented the results of their project to colleagues and
two provided vaccine for the clinical trial. After the successful peers. They intended to explain the methodology and demon-
trial, the federal government assumed oversight of manufactur- strate its benefits.
ing and large-scale vaccination. The requirement to make 11 For the most part, their audience was unresponsive. They
successive inactivated batches was dropped. Soon afterwards, could not see a problem. Generally speaking, they were meet-
a man-made polio epidemic followed that was created almost ing the required standards. Even if a similar project in their
exclusively by a single manufacturer who was not part of the laboratories might yield similar results, why should they bother
initial trial and who had never made more than four batches to drive analytical error to even lower levels? No argument
in a row without detecting live polio virus in finished batches. moved the detractors. Neither improved service to customer
Live virus was, in some batches, being missed during testing departments nor the potential to reduce regulatory deviations
and these batches were paralyzing and killing children. Other impressed them; nor did the opportunity to provide a better
issues did exist. However, the subsequent investigation showed platform for scientific work, now and in the future.
that if the requirement for repeatability and predictability had Until senior management removes options to conquering
been maintained, the man-made epidemic would never have variation by making it a strategic imperative, we ought not to be
occurred because the problematic vaccine would never have surprised if some refuse to switch their focus from technical to
been released for use (3). operational issues. When trouble occurs in the process, there is
a strong tendency for scientists to search for the smoking gun.
SHEWHART AND DEMING Sometimes it exists, and sometimes it does not. Where it does
Eighty years ago, Dr. Deming edited Dr. Shewharts seminal exist, it will be much easier to find in a low variation environ-
work, Economic Control of Quality of Manufactured Product ment. In many cases, however, what exists is not so much a
(4). Until his death in 1993, Deming pleaded with western smoking gun as a 100 tiny firecrackersa plethora of operation-
business to work at understanding and reducing variation in al issues that combine to produce a noisy environment with high
everything they do. Deming stated It is good management to variation in which it is very difficult to do good science.
reduce the variation of any quality characteristic ... whether this From a compliance perspective, reduced variation means
characteristic be in a state of control or not, and even when no fewer deviations, fewer resources tied up conducting investiga-
or few defectives are produced. Reduction in variation means tions and preparing reports, more resources dedicated to doing
greater uniformity and dependability of product, greater output the core work. This enables security for all (i.e., the company,
per hour, greater output per unit of raw material, and better the US Food and Drug Administration, and the consumer) that
competitive position (5). springs from a predictable, repeatable, and precise analytical
Unfortunately, 80 years later we are still learning that process with a known capability.
Shewhart and Deming were correct. To this day when analytical From an operational perspective, Littles Law explains why

58 Special edition: Statistics in Validation


Peer Reviewed: Analysis and Control of Variation

Figure 3: Before and after results of a steady-state trial (sst).

Deming was right when he claimed that reducing variation will be the likely analytical error next month?), a glance
increased output. Increased output from the same resources at Figure 4 soon reveals that any measure of process capability
(people and equipment) means lower costs. only has meaning if the data are reasonably stable (4). Finally,
Finally, from a scientific perspective, operating in a low-vari- how do scientists establish causal relationships when the data
ation environment results in easier detection of causal relation- are unstable? The Winter 2010 issue of the Journal of Validation
ships and fewer errors in interpreting data. Consider pre-clinical Technology (6) illustrates this issue. In one example, it resulted
trials. If the scientists are operating in a low-variation environ- in a potential root cause being moved from the bottom of the
ment, there will be fewer type I and type II errors (2). A type list to the top. Too often, significant errors in interpreting the
I error occurs if we conclude that two candidate molecules science are made. It is not possible to do good science when the
produced different effects when in fact there was no difference data are so unstable.
between them. A type II error occurs if it is concluded that the
two candidate molecules produced the same effect, when in fact Figure 4: The first principle; stabilize first.
there exists a real difference in performance. One is inclined to
wonder how often high levels of analytical error have sent the
wrong candidate molecule to the clinic and what the associated
costs might be. We can never know the answer to such musings.
What we can do is to work now and forever to minimize the
variation in operations to give the scientists the best chance at
doing good science.

THE FIRST PRINCIPLE However, if the data exhibit stability, they are predictable.
Stabilize first is the first principle (2). Figure 4 shows a stable This makes the analytical process trustworthy and easier to
and an unstable process, side-by-side, as a series of distributions manage. It greatly simplifies scheduling and allows us to provide
(2). What are the implications of instability? First, by definition, analytical capability and service guarantees that actually mean
an unstable process is not predictable. A modern Jonas Salk something. Stable data reveal causal relationships much more
would rightly exclude the unstable (unpredictable) supplier of readily. Stabilizing a process is akin to lifting a fog that hitherto
product or of analytical services. In addition, until it is stable, had concealed the truth from all. This allows the scientists to
a process has no known capability (4). One can do the calcula- do good science far more often. Fewer type I and type II errors
tions, but the resultants of these calculations mean nothing if are made. In the laboratory, analytical error can be even further
the data are unstable. What does this imply when the labora- reduced. In production, yields rise and costs fall. In discovery
tory controls investigated by the authors have never once been and development, scientists are able to detect smaller changes
stable at the commencement of investigations? First and fore- in the performance of a molecule or cell and to do a better job of
most, instability makes a mockery of the estimates provided for selecting the most promising candidate to send to the clinic.
analytical error. If the laboratory controls are unstable, no degree
of confidence can be applied to the degree of likely analytical BIOLOGICAL ASSAYS
error in the future, which is what process capability measures in By their nature, biological assays are usually more variable than
a laboratory. Because process capability implies prediction (what their chemical counterparts. It is too easy to shelter behind what

Special edition: Statistics in Validation 59


Peer Reviewed: Analysis and Control of Variation

Figure 5: Unstable laboratory controls (for biologics).

seems to be unavoidable variation, and to claim that the level of training in statistical process control techniques develop the data
variation observed is inherit in the biology and largely collection plan and statistical methods and procedures used in
unavoidable. In an attempt to overcome this high level of varia- measuring and evaluating process stability and process capabil-
tion, a common reaction is to add more replicates and more ity.Procedures should describe how trending and calculations
cost. However, if a control chart made with laboratory reference are to be performed and should guard against overreaction to
standards shows instability, the inevitable conclusion is that the individual events as well as against failure to detect unintended
same people and instruments can produce results with reduced process variability.Production data should be collected to evalu-
variation if only they could stabilize the process. Figure 5 shows ate process stability and capability. The quality unit should
two recent examples of reference standard performance in review this information. If properly carried out, these efforts can
biological assays. Both are unstable. This means that stabilizing identify variability in the process and/or signal potential process
the process will significantly reduce analytical error. Regardless improvements.
of whether the assay under examination is chemical or biologi- There is little room for interpretation of this statement.FDA
cal in nature, stability is more often an operational issue than it is demanding stability as a minimum standard, and with good
is a technical issue. When it is a technical issue, causes for the reason. In the majority of cases, instability is caused by opera-
trouble can be found much more rapidly and with more cer- tional rather than technical aspects. However, scientists tend
tainty when the assay is stable with minimum variation. Con- to examine any issue from a technical or scientific perspective
sider the charts at Figure 4. Once the correct control band has because that is how they are trained and because the culture
been calculated, often only one to three points reveal a change of most pharmaceutical businesses has a strong technical or
in the system, triggering a search for root causes while what- scientific bias. This need not be an issue, providing we under-
ever changed is still there to be found. Alternately, if a deliber- stand that trying to do good science in an unstable process
ate change has been made, often very few points are needed to varies from difficult to impossible, providing we understand that
demonstrate an improvement to the process.A well-constructed stabilize first ought to be the first principle, and providing that
control chart leads to faster, more effective interpretation of time we understand that nowhere is this more important than in the
series data. Laboratory controls are a good place to start. laboratory. JVT
Every example in this article came from biological processes.
Some were vaccines; others were biological therapeutics; but all REFERENCES
were biological. Operational excellence (i.e., reduced operational 1. J. McConnell, B. Nunnally and B. McGarvey, Blame the LaboratoryUn-
variation) is most important when the potential for variability derstanding Analytical Error, Journal of Validation Technology, Volume 15,
in the science is higher and when data are scarce or expensive. Number 3, Summer 2009.
Therefore, in biological analytical processes the need to achieve 2. B. K. Nunnally and J. S. McConnell, Six Sigma in the Pharmaceutical Indus-
operational excellence is greater than might usually be the case. try, CRC Press, 2007.
The same can be said of development areas where data are much 3. P. A. Offit, The Cutter Incident, Yale University Press, 2005.
more scarce. If we combine these two considerations, it is diffi- 4. W. A. Shewhart, Economic Control of Quality of Manufactured Product, Van
cult to avoid the conclusion that assay development for biologics Nostrand, 1931.
is one key area where the requirement to design for operational 5. W. Edwards Deming, On Some Statistical Aids Toward Economic Produc-
excellence and robustness is at its greatest. tion, Interfaces, Vol. 5, No. 4, August 1975.
6. J. McConnell, B. Nunnally and B. McGarvey, The Dos And Donts of Control
CONCLUSION ChartingPart 1, Journal of Validation Technology, Volume 16, Number 1,
In part, the FDA Guidance for Industry-Process Validation: Gen- Winter 2010.
eral Principles and Practices (7) states: 7. FDA, Guidance for Industry-Process Validation: General Principles and
We recommend that a statistician or person with adequate Practices, January 2011.

Originally published in the Winter 2011 issue of Journal of Validation Technology

60 Special edition: Statistics in Validation


Peer Reviewed: Variation

Analysis and Control of Variation:


Using Process Control to Reduce
Variability: Comparison of
Engineering Process Control with
Statistical Process Control | IVT
Bernard McGarvey, Brian K. nunnally, John Mcconnell

Key Points DiscusseD


There are two traditional ways to control the variability in process
parameters statistical process control (SPC) & engineering pro-
cess control (EPC).
Both of these approaches have much in common with respect to
their objectives.
There are differences, however, that determine which situations
each one is applied.
Misapplication of either approach in the wrong situation will
lead to less than optimal results, and in many cases may actually
increase variability in the process parameter.
Understanding how the approaches differ will help ensure they are
applied correctly.

intRoDuction
When I first joined the ranks of employed engineers (back in the early
80s), I worked in a technical services organization where part of my job
responsibilities was to use process control to improve the performance
of the manufacturing processes at the site. My job was to find out from
the chemists and other engineers what the perfect process should
look like and then, as Captain Picard of the USS Enterprise would say,
make it so. I remember one day sitting down with a chemist and ask-
ing her what was important about this part of the process. Her response
was that the temperature in the reactor needed to heat to 60C and then
stay exactly at this temperature until the reaction was complete. I then
worked on this until the temperature was flat-lined in that you could
see little difference between the target (set point) of 60C and the actual
temperature in the reactor during the reaction. If you had asked me
what I was doing, I would have described it as improving the control of
the process. Indeed, at one point, that same chemist described my role
as making her life easier because the improved control made it easier
to see if the process was behaving normally or not. At the time I would
have said I was using EPC to keep the controlled parameter (the reac-
tor temperature) at its set point. EPC has been around for a long time,
having started in the process industries (1). EPC is used to control the
value of a process parameter (the controlled parameter) to a set point by

Special edition: Statistics in Validation 61


Peer Reviewed: Variation

manipulating the value of another (the manipulated of a process. So long as the potency stays within the
parameter), as shown in Figure 1). control limits (the red lines) and has no trends over
time, the process is considered stable, and we should
Figure 1: A Temperature Control System. not react to any particular result as if it were special.

Figure 2: Control Chart used in SPC.

When the temperature is below the set point, the If on the other hand, we see data falling outside
Hot Supply control and Hot Return block valves the control limits, as at batches A and B in Figure 2,
are opened so that hot liquid flows around the tank it indicates that something unusual has occurred.
jacket and the tank temperature rises. When the tem- These batches should be investigated to see if the
perature rises above the set point, the Cold Supply cause of the unusual variation can be found. If the
control and Cold Return block valves open. The heat- potency is lower than expected (point A), we would
ing/cooling liquid is pumped around the jacket of the attempt to eliminate the cause or at least reduce the
reactor to enhance the heat transfer rate. The block risk that it happens again. If the potency is higher
valves have only two states open or shut and are used than expected (point B), we would attempt to see if a
just to ensure that hot liquid is returned to the cool- positive process change could be identified and then
ing supply system or cold liquid is not returned to the made part of the process so that all future potencies
hot supply system. The control valves can vary their would be higher.
open position from zero percent to 100%. The actual Because of its apparent simplicity and because of
amount the control valves open depends on how far the stories that began to circulate about the success
the tank temperature is from the set-point. The ma- of SPC, it caught the imagination of (at least some)
nipulated parameters are the percent open position of management in the Western Hemisphere. Over the
the Cold and Hot Supply control valves. years, this has morphed into a situation where SPC
In the vernacular used today, my role would be de- has essentially become synonymous with variabil-
scribed as one of reducing variation. However, during ity reduction. Given the success of both methods,
the time I was in this role, the idea of random varia- however, it is clear that both have a role to play in
tion never impacted what we did in any significant process control/variability reduction. It is also intui-
way. It obviously did not need to, since EPC was be- tive to anyone who has practiced both methods that
ing used very successfully by many people to reduce they have similarities and differences and that there
variation. This state of affairs remained for several are situations where one approach is preferred over
years until we began to hear rumors of a new ap- the other. In fact, there are situations where applica-
proach to reducing variation called statistical process tion of one approach is simply wrong. For example,
control. Use of SPC (and other statistical thinking I just had a new gas furnace installed in my home.
approaches) were being attributed to the turnaround It uses a sophisticated EPC control system to keep
in the Japanese economy and the much higher quality the temperature of my house within suitable limits.
levels in goods that were being mass produced in So far, it is working very well. Of course, what the
Japan (2). Anecdotes about the high levels of quality control system is doing is keeping the variation of
of Japanese goods began to emerge (such as the one temperature within my house much smaller than the
where the variation in the Japanese-made items was variations in temperature outside my house (due to
so small that measurement systems in the US could changing weather) by adding heat when the tempera-
not detect the variation!). Eventually, we began to ture is below the set point. It is difficult to imagine
experiment and then implement these ideas and to an SPC control system being able to do this in any
see benefits. The ideas behind SPC were much easier practical way.
to appreciate from an implementation perspective, Not too many people get the opportunity to prac-
consisting mainly of plotting performance data on tice both types of control on an ongoing basis, and
a specially constructed chart, called a control chart, thus it makes it difficult for practitioners to see both
and then reacting to the chart in some predefined sides of the fence. Most engineers (myself included)
ways. For example, Figure 2 shows a control chart are taught EPC without much, if any, reference to
we might construct to monitor the product potency random variation. Most statisticians will be taught

62 Special edition: Statistics in Validation


Peer Reviewed: Variation

SPC without any reference to EPC. It is therefore easy Both approaches recognize the idea of capability.
to see how biases can creep in. I can once remember Once a process is controlled (a stable process), the
a conversation with a statistician about the compari- performance can be assessed against the require-
son between SPC and EPC. His opinion was that SPC ments. Because the process is stable, the data can be
is better because it is the only method that actually assembled into a summary view such as the histo-
reduced variation! Clearly many generations of engi- gram as shown in Figure 3.
neers can refute this by showing case after case where
EPC has reduced variation in a process. Therefore, Figure 3: Assessing the Capability of a
given this risk of bias, it is very important to clearly Performance Measure.
understand how SPC and EPC are the same and how
they are different. This understanding will ensure
that these two excellent approaches are not misap-
plied. The rest of this article will discuss the similari-
ties and differences between the two approaches.
Earlier, it was noted that SPC is easier to appreciate
from an implementation perspective. However, the
theoretical underpinnings of SPC are just as involved
as EPC. Just try reading Shewharts original writings
on the topic (3, 4). This can result in situations where
people think SPC looks simple and so misapply it. The capability is then determined by compar-
Some of such misapplications have been documented ing the summary view with the Lower and Upper
(5-7). For example, some people end up thinking that Specification Limits (LSL and USL). This can be done
3-sigma limits were chosen as the control chart limits visually as in Figure 3 or more quantitatively using
because only 0.3% of a normally distributed random some calculated capability index such as Cpk (7).
variable fall outside this range. However, this was Both approaches also focus on economics.
never part of Shewharts argument for 3-sigma limits. Shewhart (3) made this clear when he included the
His argument is purely empirical; over a long time of word economic in the title of his book. EPC assumes
using these charts, the use of 3-sigma limits seem to that the cost to create and operate a control system
strike the right economic balance between over react- like Figure 1 is more than offset by the gains in keep-
ing and creating more variability and under-reacting ing the temperature close to the set-point. Neither
and missing opportunities to reduce variability (3). method promotes the idea of reducing variation
This situation is not only seen in practitioners, but, without taking cost of implementation into account.
in conversations with such practitioners, it is clear There is no point in spending $1,000 to save one
that they learned all this from misinformed teachers. dollar. In saying this, it should also be acknowledged
The point here is that practitioners of SPC (and EPC that many times the advantage of reducing variation
or indeed any other skill) need to invest intellectual is hard to quantify, so it should be remembered that
energy in understanding why something works the just because we cannot quantify the benefit, it does
theory behind the method. not mean there is no benefit.
Both approaches do not need to know the causes of
siMilaRities Between ePc anD sPc the variability at the beginning. In fact, EPC is never
The first similarity is that they both recognize the no- concerned with these causes. It is inherently assumed
tion of an ideal state, a state of control, for the process (based on process knowledge) that it would not be
parameter being controlled. Shewhart (1) has given practical to reduce or eliminate the causes of the
us a very good definition of process control: variability. For example, I could eliminate the need
A phenomenon (process) will be said to be con- for a heating/cooling system in my home by finding a
trolled when, through the use of past experience, we location where the natural variations in the weather
can predict, at least within limits, how the phenom- are within my requirements and so no control is
enon may be expected to vary in the future. Here it is required. However, this is not a practical solution to
understood that prediction within limits means that variability reduction. In fact, the objective of the EPC
we can state, at least approximately, the probability controller is to make the process robust to sources of
that the observed phenomenon will fall within the variability that cannot be eliminated economically.
given limits. In the case of SPC, the whole point of the approach
Because there is this idea of a controlled or stable is to identify some of the causes of variability so they
state, it is possible to decide if the process parameter can be reduced. Since it is assumed that it will make
(temperature, potency) is being controlled adequately economic sense to do this, if they were known at
so that no control action is currently required. the beginning, then they would be addressed at the

Special edition: Statistics in Validation 63


Peer Reviewed: Variation

beginning. Of course, it is possible that the SPC approach might DiffeRences Between ePc anD sPc
identify sources of variability that cannot be reduced economi- The first major difference is that the EPC approach assumes that
cally. How we would deal with this is very situation dependent. a lever can be found that can be adjusted in some economic way
Both approaches reduce variability in the same way; that is to to reduce the variation of the controlled parameter. Without this
say, in both cases, the variability of one parameter is reduced by lever (the heating and cooling control valves in Figure 1), EPC
causing or reducing variation in another parameter. This is based is a non-starter. SPC, on the other hand, does not require this
on the notion that nature is causal. If you want to change some- assumption to be true. The classic applications of SPC, such as
thing in one place, then you must make a change somewhere else! the application to potency in Figure 2), do not have such a lever.
This is easy to see in the case of EPC by looking at Figure 1. The Essentially, the purpose applying the SPC approach is to identify
variability in the tank temperature is reduced by creating varia- the levers and then modify the levers to reduce the variability in
tion in the manipulated parameter(s) the position of the control the potency. The worn gasket referred to earlier is an example of
valves. It may not be quite so obvious in the case of SPC. Going such an SPC lever.
back to the potency example in Figure 2, imagine that the batch The second major difference is the role of random variation. In
corresponding to point A has just been completed, the potency SPC, the central attribute of the process parameters involved is
has been plotted on the control chart, and a special cause inves- that they are dominated by random variation. We would certain-
tigation has been started. The investigation reveals that a valve ly expect to have random variability in a potency due to process
closed more slowly than normal and an extra quantity of a reagent variability and measurement error. Since it is well know from
got into the reactor and caused the drop in yield. Further, the the Deming Funnel experiment that no control action should be
valve issue was caused by a gasket that had worn out prematurely. taken if the data is pure random noise (9), the SPC controller has
It is obvious that, if the worn gasket is not addressed, then the to be able to differentiate between pure random noise and cases
risk of future low yields is high. So a change must be made! First where signals (special causes) are present in the data. This, of
the worn gasket is replaced, a change to the process. Secondly, course, is the primary purpose of the control chart. The mental-
the reason for the premature failure is addressed that might lead ity of the SPC approach is that you should only take a control
to needing to change how valve gaskets are selected, a change to a action if there is evidence that the data is not a purely random
business process that supports the process. set of data. The SPC approach is very biased towards a hands
Both approaches are based on feedback control as shown in off control approach. On the other hand, EPC tends to ignore
Figure 4. random variation. The EPC mentality is based on the notion that
all changes in the data are real changes and that the control-
Figure 4: A Feedback Control System. ler should react to them. When the temperature increases, it is
assumed that whatever caused this to happen was not random.
It is assumed that whatever has changed will continue to do so
unless some action is taken. Of course, the amount by which the
controller adjusts the manipulated parameter will depend on the
amount by which the controlled parameter is from the set point.
The EPC controller, therefore, may make very small or insignifi-
cant changes in some cases. The EPC approach is very biased
towards a hands on approach. This difference in mentality be-
All feedback control works the same way. You start with an tween the two approaches shows up clearly when the parameter
objective (keep the temperature at the set point, keep the process to be controlled has significant random variation but also has
stable with no special causes). Then you compare actual perfor- significant non-random variation present. Examples of such situ-
mance with the objective; the difference is the (performance) gap. ations abound in the pharmaceutical (and other) industry con-
The controller then uses this gap as input to decide if a change trolling the weight of tablets, controlling the fill volume of vials,
is required. This change will cause the actual performance to etc. If the control system was designed with the SPC approach
change, and so the gap is impacted and the cycle is repeated. This as the starting point, the controller will tend to under control
should also remind people of the Deming PDCA (Plan/Do/Check/ and thus will tend to make infrequent adjustments. On the other
Act) loop (8) that is a feedback loop for process improvement. hand, if the control system was designed with the EPC approach
Finally, both approaches recognize that a stable process with as the starting point, then the controller will tend over control
low variability is key to efficient process improvement. When by making more frequent/larger adjustments. In both cases, this
the variability is low, the impact of changes on the process (both will lead to larger variability in the controlled variability than
the intended impact and, just as importantly, the unintended would be obtained if the optimal adjustments are made (10, 11).
impact) will be easier to see, and so the impact of the change The presence or absence of significant random variation also
will be assessed more quickly and with more certainty. impacts how capability is measured. For SPC applications, a
Thus, it can be seen that there are many similarities between capability measure is a statistical measure, and the value of
EPC and SPC. However, there are at least three significant differ- the measure gives an indication of the probability that potency
ences between them, and it is these differences that account for would be outside its specifications. For EPC, there is no consid-
the different usage of each approach. eration of random variation in the control of the parameter, and

64 Special edition: Statistics in Validation


Peer Reviewed: Variation

the measure of capability cannot involve a statistical calculation. the benefits of variability reduction, a dead band control strat-
We simply look and see if the process parameter is controlled egy is preferred (11). In effect, the control strategy gives up some
inside the specifications, and, if it is, the system is considered of the benefit for variability reduction by making fewer costly
capable of meeting its requirements. The data within a batch adjustments. EPC also recognizes the validity of this trade-off. For
cannot be used to define a probability of being inside the specifi- example, one of the downsides of EPC is that by making lots of
cations in any statistical sense. However, it is possible to look at adjustments, the control valves may wear much faster. In this case,
the variation of the process parameter in a statistical sense if the a dead band control approach can be used to reduce the frequency
variation is looked at across batches. For example, the variation of adjustments and reduce the wear on the control valve (1). This
of the minimum and maximum temperature during a reac- will increase the variability of the temperature, so this dead band
tion across batches will vary in a random way, and this random strategy is valid if this increase in temperature variability is offset
variation can be used to characterize the capability of the EPC by the loss that could occur if the control valve failed prematurely
controller across batches (12, 13). and the batch was significantly impacted.
Probably the most interesting difference between EPC and
SPC lies in the cost of adjustment. EPC assumes that the cost of suMMaRy
adjustment is insignificant compared to the benefits of variabil- It should be noted that there are certain situations where all the
ity reduction. Adjusting the control valves in Figure 1 is a trivial attributes for EPC are present except that the amount of random
cost compared to the benefit of keeping the tank temperature variability is significant. In this case, there is a third approach
close to the set-point. SPC assumes the exact opposite. This called statistical process adjustment (SPA) that can be used (11).
aspect of SPC may not be obvious but is another reason why SPC However, that is a subject for a future paper. JVT
is so biased against making changes unless you are quite sure
that a change has occurred in the process parameter. Consider RefeRences
the point that is being made by SPC. Without SPC, management 1. F.G. Shinskey, Process Control Systems Application, Design, and Tuning,
tends to react to any little change in a parameter as if it were a 3rd ed., McGraw-Hill, 1988, ISBN 0-07-056903-7.
signal of a real change. They then order their team to figure out 2. W.E. Deming, , On Some Statistical Aids Toward Economic Production,
what has changed and get to the root cause when in fact noth- Interfaces 5 (4), 1-15, 1975.
ing has changed. This could waste a lot of time and money and 3. A.W. Shewhart, Economic Control of Quality of Manufactured Product,
was a big reason why Deming (2) and others promoted statistical ASQ 50th Anniversary Commemorative Reissue, D. Van Nostrand Com-
methods in general and SPC in particular. By using the control pany. Inc., 1980, ISBN 0-87389-076-0.
chart limits as a guide to whether or not a real change has oc- 4. A.W. Shewhart, Statistical Method from the Viewpoint of Quality Control,
curred, SPC prevents over reaction. Several years ago, I attended Dover Publications, New York, 1986, ISBN 0-486-65232-7.
a monthly meeting where metrics for a certain operational area 5. J.S. McConnell, B. Nunnally, B. and McGarvey, The Dos and Donts of
were reviewed. One of the metrics was monthly expenses for the Control Charting Part I, Journal of Validation Technology 16 (1), 2010.
area, and the manager, being educated on statistical thinking 6. J.S. McConnell, B. Nunnally, and B. McGarvey, The Dos and Donts of
and SPC, had the financial team member plot the data on a con- Control Charting Part II, Journal of Validation Technology 17 (4), 2011.
trol chart. The first month I attended the meeting the monthly 7. D.J. Wheeler, Advanced Topics in Statistical Process Control, SPC Press,
expenses were above the mean; however, because it was below Knoxville, Tennessee, 1995, ISBN 0-945320-45-0.
the upper control limit the manager, despite some concern 8. W.E. Deming, Out of the Crisis, published by The Center for Advanced
from some team members, did not ask for an investigation. The Engineering Study, M.I.T., Cambridge, Mass. 02139, ISBN 0-911379-01-0.
second month it was slightly higher again. Still the manager 9. J.S. McConnell, Analysis and Control of Variation, 4th Edition, published
did nothing. The third month it was slightly higher again. By by the Delaware Books, ISBN 0-958-83242-0, 1987.
now the team was getting really concerned at the lack of action, 10. J.F. MacGregor, A Different View of the Funnel Experiment, Journal of
but the manager held tough. Finally, on the fourth month, the Quality Technology 22 (4), 255-259, 1990.
expenses fell and the team breathed a sigh of relief. This is SPC 11. E. Del Castillo, Statistical Process Adjustment for Quality Control, Wiley
as it should be practiced! The variation was simply random, and, Series in Probability and Statistics, 2002.
if the manager had insisted on looking for a reason for the short 12. G. Mitchell, K. Abhivava, K. Griffiths, K. Seibert, and S. Sethuraman, Unit
term slight trend upwards, it would have been a waste of time. Operations Characterization Using Historical Manufacturing Performance,
Worse still, the team in their zeal to find a root cause might have Industrial & Engineering Chemistry Research 47, 66126621, 2008.
found a phantom root cause and made unnecessary changes 13. G. Mitchell, K. Griffiths, K. Seibert, and S. Sethuraman, The Use of Rou-
that cost resources and could have made things worse! tine Process Capability for the Determination of Process Parameter Critical-
SPC using control charts is basically a dead band control ity in Small-molecule API Synthesis, Journal of Pharmaceutical Innovation
strategy. While the data is within a dead band (the control limits), 3, 105112, 2008.
no action is warranted. The variation is simply random. Action is 14. K.L. Jensen, S.B. Vardeman, Optimal Adjustment in the Presence of Deter-
warranted once data appears outside the control limits. It can be ministic Process Drift and Random Adjustment Error, Technometrics 35
shown that when the cost of adjustment is significant compared to (4), 376-388, 1993.

Originally published in the Autumn 2011 issue of Journal of Validation Technology

Special edition: Statistics in Validation 65


Peer Reviewed: Method Validation

Improvement Alphabet: QbD,


PAT, LSS, DOE, SPCHow Do
They Fit Together? | IVT
Ronald d. Snee, Ph.d.

Consider the following scenario, a new pharmaceutical or biotech


scientist or engineer is assigned the job of solving a problem, improving
a process, or just developing better understanding how a process works.
Five different people are asked for advice and guidance, and five dif-
ferent recommendations are received and summarized as quality-by-
design (QbD), process analytical technology (PAT), lean six sigma (LSS),
design of experiments (DOE), and statistical process control (SPC).
Each advisor has had success with their recommended approach. So
what should this professional do? Which approach should the profes-
sional use? First, some context is needed to aid understanding of the
five approaches.

PRobleM SolVing and PRoceSS iMPRoVeMent context


It is important to recognize that all five approaches utilize system and
process thinking, are helpful, and have merit, particularly when used
in the application the approach was designed to handle. There is also
considerable overlap in what the approaches can do regarding concepts,
methods, and tools. Two guiding considerations that aid selection are:
What function is one working indevelopment or manufacturing?
What is the needprocess or product design or redesign, process
control, or improvement of a product or process?
Understanding is enabled by reviewing the definitions of the approaches.

Quality-by-deSign
QbD is defined as a systematic approach to development that begins
with predefined objectives, emphasizes product and process under-
standing and process control, and is based on sound science and
quality risk management (1). QbD is about designing quality into a
product and its manufacturing process (2) so that in-process and final
product inspection is less critical and can be reduced. The quality com-
munity learned decades ago that quality must be built in, it cannot
be inspected in. Borman, et al (2007); Schweitzer, et al (2010); and
McCurdy, et al (2010) discuss applications of QbD (3-5).
Since announcing the value of QbD, US Food and Drug Administra-
tion has continued to emphasize its importance in the recently released
Process Validation Guidacne (6) and again in 2012, requiring the use of
QbD for new abbreviated new drug application (ANDA) filings stating,
We encourage you to apply Quality by Design (QbD) principles to the
pharmaceutical development of your future original ANDA product
submissions, as of January 1, 2013. (7).
In 2012, FDA stated that a risk-based, scientifically sound sub-
mission would be expected to include the following: Quality target
product profile (QTPP), critical quality attributes (CQAs) of the drug

Special edition: Statistics in Validation 66


Peer Reviewed: Method Validation

product, product design and understanding includ- the goal of reducing the cycle time of batch release by
ing identification of critical attributes of excipients, 50%. The analysis of batch release sub-process cycle
drug substance(s), and/or container closure systems, times showed that review by manufacturing account-
process design and understanding including identi- ed for the major portion on the total cycle time. The
fication of critical process parameters and in-process review process by manufacturing was revised using
material attributes control strategy and justification. lean manufacturing principles. The overall cycle time
(8). was reduced by 35-50% depending on the product
type, the inventory of the drug was reduced by $5
PRoceSS analytical technology million (one time reduction), and the annual operat-
The following direct quote from the FDA guidance ing costs were reduced by $200,000 (12).
explains PAT well: The Agency considers PAT to be
a system for designing, analyzing, and controlling deSign-of-exPeRiMent
manufacturing through timely measurements (i.e., DOE is asystematic approach to experimentation
during processing) of critical quality and perfor- wherein the process variables (X) are changed in a
mance attributes of raw and in-process materials and controlled way and the effects on the process out-
processes, with the goal of ensuring final product puts (Y) are measured, critical process variables and
quality. It is important to note that the term analyti- interactions are identified, experimental knowledge
cal in PAT is viewed broadly to include chemical, is maximized, and predictive cause-effect relation-
physical, microbiological, mathematical, and risk ships [Y=f(X)] are developed. DOE can be used to
analysis conducted in an integrated manner. The design experiments for building knowledge about any
goal of PAT is to enhance understanding and control product and process in manufacturing and service
the manufacturing process, which is consistent with processes alike where X variables can be controlled
our current drug quality system: quality cannot be and where quantitative Y responses can be reliably
tested into products; it should be built-in or should measured (13). In the pharmaceutical and biotech
beby design. Consequently, the tools and principles QbD world, in addition to the uses above, DOE is
described in this guidance should be used for gaining used to establish a design space and control strategy
process understanding and can also be used to meet for a process or test method. Borman, et al, (2007);
the regulatory requirements for validating and con- Schweitzer, et al, (2010); and McCurdy, et al, (2010)
trolling the manufacturing process. (9). discuss some examples.
PAT has many applications. Some identified by Aggarwal (2006) discussed an API development
Rathore, Bhambure, and Ghare (2010) include: rapid study that was designed to increase yield of the
tablet identification using acoustic resonance spec- process, which was approximately 40%. Conducting
troscopy, near-infrared spectroscopy (NIR) based two designed experiments, the yield was increased
powder flow characterization, active drug identifica- to more than 90%; the lab capacity was doubled; and
tion, and content determination using NIR and roller costs were reduced by using less catalyst, as learned
compaction dry granulation based on effusivity sen- from the experiments. In the first experiment, five
sor measurements. As noted above, PAT is a system variables were studied in 20 runs and yields of 75%
for designing, analyzing, and controlling a manufac- were observed. The analysis of the data indicated that
turing process and is thus a collection of concepts, the ranges of some of the variables should be changed
methods, and tools (10). and one variable should be held constant in the next
experiment. The resulting 30-run experiment identi-
lean Six SigMa fied a set of conditions that produced more than 97%
LSS is a business improvement strategy and system yield. As a result, the yield of the process was more
with supporting concepts, methods, and tools that fo- than doubled using two experiments and 50 runs (14).
cuses on increasing process performance resulting in
enhanced customer satisfaction and improved bottom StatiStical PRoceSS contRol
line results (11). One objective of LSS is the reduc- SPC is a collection of statistical and non-statistical
tion of variation in the output of a process. Process tools that help manufacturers understand the varia-
performance is measured by the flow of material and tion observed in process performance, help maintain
information through the process as well as product the process in a state of statistical control (process
quality and cost, process cycle time, and customer variation is predictable within upper and lower
satisfaction. limits), and identify opportunities for improvement.
A pharmaceutical company had concern that one SPC has a wide range of applicability and can be used
of its blockbuster drugs had considerable finished to monitor, improve, and control any product and
product inventory, and yet product delivery times process in manufacturing and service processes alike
were very long. An LSS project was chartered with (15). PAT described above often uses SPC as part of

67 Special edition: Statistics in Validation


Peer Reviewed: Method Validation

the process control procedure. developed. PAT holds the promise of real-time
A biopharmaceutical process was exhibiting a low process control and product release.
yield in fermentation and there was concern that the SPC and DOE have utility outside of develop-
process would not be able to meet market demand. ment and manufacturing, in areas such as labo-
A control chart analysis identified the problem; there ratory efficiency, change control, business pro-
was significant variation between the batches of me- cesses, and sales and marketing (16). In general,
dia used in the process. A quality control procedure if a product or service that needs to be created
for the batches of media was put in place and the or improved can be defined, DOE and SPC will
process consistently produced yields 20-25% higher be useful in some way.
than the previous process had produced enabling the
company to meet market demand for the drug. Now return to the question posed at the beginning.
At a high level, the relationships between the five What should this engineer or scientist do? First, the
approaches, and the areas in which they are used, approach taken depends on the situation: the prob-
development and manufacturing, are shown in the lem, the objectives and goals, and the environment
development or manufacturing. If the goal is develop-
Figure. Some conclusions from the figure ment of a new product, process, or both, it is good
include: strategy to think using QbD with PAT to develop the
control strategy. Both DOE and SPC will likely be
used as part of QbD and PAT in such a situation.
If one needs to improve product or process perfor-
mance prior to launch, LSS can be useful. In such a
situation, DOE and SPC techniques will often be used
as part of the LSS approach.
LSS can also be useful in improving a product
or process after launch. What is often overlooked
is that processes can frequently be improved while
remaining within the bounds of the original filing.
Large gains in performance with significant financial
improvement often result. Of course, if a design space
was part of the original filing, then changes within
There is no step-by-step procedure to decide the region of the design space are possible without
which approach to select. Over time, any organi- getting approval from FDA.
zation involving development and manufactur- Another situation is the need to create and imple-
ing will use aspects of all of the approaches. At ment a monitoring system to better control the pro-
any point in time, the critical question is, What cess and comply with the guidance provided in Stage
approach should I use for this need at this time? 3 of the FDA Process Validation Guidance (6). Such
Clearly, QbD is the broadest approach. It works a system will focus on assuring process stability and
in both development and manufacturing, having capability and use the SPC tools of control charts and
greater utility in development than in manufac- process capability indices (17).
turing. QbD utilizes PAT, DOE, and SPC and These approaches are most effectively utilized
intersects with LSS. when viewed from a systems perspective. All of these
approaches are in fact systems that include a set of
Contrary to the belief of many, QbD is much more concepts, methods, and tools. The systems thinking,
than DOE. It also involves things such as QTPP, which underlies these approaches, increases the effec-
CQAs of the drug product, product and process de- tiveness of the methods. It has been learned over the
sign, and understanding (including identification of years that the effectiveness of any approach is greatly
critical process parameters and attributes of excipi- enhanced when a system is created and deployed to
ents), drug packaging, and process control strate- implement the approach.
gies. DOE is necessary but not sufficient. DOE is a Other strategies are possible, as QbD, PAT, LSS,
critically important to the successful use of QbD but DOE, and SPC have many embedded elements and
is not the only element of the system. tools. The authors hope is that this discussion will
LSS, which has a large intersection with QbD, help the reader understand the uses and value of the
also works in both development and manufactur- approaches and provide an aid that will be useful as
ing and utilizes both DOE and SPC. one works to use and implement these approaches to
PAT has roots in development, where the infor- improve products and processes.
mation to perform PAT in manufacturing is 2013 Ronald D. Snee JVT

Special edition: Statistics in Validation 68


Peer Reviewed: Method Validation

RefeRenceS 11. R.D. Snee and R.W. Hoerl, Leading Six Sigma A Step by
1. ICH Q8(R2) Pharmaceutical Development. Step Guide Based on Experience With General Electric and
2. J.M. Juran, Juran on Quality by Design: The New Steps for Other Six Sigma Companies (FT Prentice Hall, New York,
Planning Quality into Goods and Services, The Free Press, NY) 2003.
New York, NY, 1992. 12. R.D. Snee and R.W. Hoerl, Six Sigma beyond the Fac-
3. P. Bourman, P. Nethercote, M. Chatfield, D. Thompson, and tory Floor Deployment Strategies for Financial Services,
K. Truman, Application of Quality by Design to Analytical Healthcare, and the Rest of the Real Economy, Financial
Methods, Pharmaceutical Technology, 142-152, 2007. Times Prentice Hall, New York, NY, 2005.
4. M. Schweitzer, M. Pohl, M. Hanna-Brown, P. Nethercote, P. 13. D.C. Montgomery, Design and Analysis of Experiments,
Borman, P. Smith, and J. Larew Implications and Oppor- John Wiley and Sons, New York, NY, 8th ed., 2012.
tunities of Applying QbD Principles to Analytical Measure- 14. V.K. Aggarwal, A.C. Staubitz, and M. Owen Optimization
ments, Pharmaceutical Technology 34 (2), 52-59, 2010. of the Mizoroki-Heck Reaction Using Design of Experiment
5. V. McCurdy, M.T. am Ende, F.R. Busch, J. Mustakis, P. (DOE), Organic Research and Development 10, 64-69,
Rose. and M. R. Berry (2010) Quality by Design using an 2006.
Integrated Active Pharmaceutical Ingredient Drug Product 15. D.C. Montgomery, Introduction to Statistical Quality Con-
Approach to Development, Pharmaceutical Engineering, trol, John Wiley and Sons, New York, NY, 7th ed., 2011.
28-38, July/Aug 2010. 16. J. Ledolter and A.J. Swersey, Testing 1-2-3: Experiment De-
6. FDA, Guidance for Industry - Process Validation: General sign with Applications to Marketing and Service Operations,
Principles and Practices (Rockville, MD, Jan. 2011). Stanford Business Books, Stanford, CA, 2007.
7. FDA Information for Industry Webpage, available here. 17. R.D. Snee, Using QbD to Enable CMO Manufacturing
8. Pharmaceutical Manufacturing, July/August 2012. Process Development, Control and Improvement, Pharma-
9. FDA, Guidance for Industry PAT A Framework for In- ceutical Outsourcing, 10-18, January/February 2011.
novative Pharmaceutical Development, Manufacturing, and
Quality Assurance (Rockville, MD, Sept. 2004). geneRal RefeRenceS
10. A.S. Rathore, R. Bhambure, and V. Ghare, Process Ana- G.E.P Box, J.S. Hunter, and W.G. Hunter, Statistics for Experi-
lytical Technology (PAT) for Biopharmaceutical Products, menters Design, Innovation and Discovery (2nd ed., John
Annals of Bioanalytical Chemistry 398 (1), 137-154, 2010. Wiley and Sons, New York, NY) 2005.

Originally published in the Autumn 2011 issue of Journal of Validation Technology

69 Special edition: Statistics in Validation


Peer Reviewed: Method Validation

Statistical Analysis in Analytical


Method Validation | IVT
Eugenie Webster (Khlebnikova)

Abstract
This paper discusses an application of statistics in analytical method
validation. The objective of this paper is to provide an overview of
regulatory expectations related to statistical analysis and the review of
common statistical techniques used to analyze analytical method vali-
dation data with specific examples. The examples provided cover the
minimum expectations of regulators.

Key Points
The following key points are presented:
Regulatory guidelines regarding statistical data analysis in analyti-
cal method validation.
Statistics to analyze data for analytical method validation such
as mean, standard deviation, confidence intervals, and linear
regression.
Data analysis using statistical packages such as Minitab and Excel
are discussed.

Introduction
Analytical method validation is an important aspect in the phar-
maceutical industry and is required during drug development and
manufacturing. The objective of validation of an analytical method is
to demonstrate that the method is suitable for the intended use, such
as evaluation of a known drug for potency, impurities, etc. The intent
of method validation is to provide scientific evidence that the analyti-
cal method is reliable and consistent before it can be used in routine
analysis of drug product. The analytical method validation is governed
by the International Conference on Harmonization Guideline Q2(R1),
Validation of Analytical Procedures: Text and Methodology (1). The
ICH guideline on performing analytical method validation provides
requirements to demonstrate method specificity, accuracy, precision,
repeatability, intermediate precision, reproducibility, detection limit,
quantitation limit, linearity, range, and robustness. The ICH definitions
for validation characteristics are listed in Table I.

Special Edition: Statistics in Validation 70


Peer Reviewed: Method Validation

Table I: Validation Characteristics for Analytical Method Validation.

The validation characteristics should be investi- stAtistics in AnAlyticAl Method VAlidAtion


gated based on the nature of the analytical method. Statistical analysis of data obtained during a method
Results for each applicable validation characteristic validation should be performed to demonstrate valid-
are compared against the selected acceptance criteria ity of the analytical method. The statistics required
and are summarized in the analytical method valida- for the interpretation of analytical method valida-
tion report. ICH also provides recommendations on tion results are the calculation of the mean, standard
statistical analysis required to demonstrate method deviation, relative standard deviation, confidence
suitability. These recommendations are further dis- intervals, and regression analysis. These calculations
cussed in the following sections. are typically performed using statistical software
In addition to ICH, the US Food and Drug Ad- packages such as Excel, Minitab, etc. The purpose
ministration guidance, Draft guidance for Industry: of statistical analysis is to summarize a collection of
Analytical Procedures and Methods Validation (2) data that provides an understanding of the examined
can be consulted for detailed information on the US method characteristic. The acceptance criteria for
requirements. each validation characteristic are typically around
the individual values as well as the mean and relative

71 Special edition: Statistics in Validation


Peer Reviewed: Method Validation

standard deviation. The statistical analysis explained


in this paper is based on assumption of normal distri- Confidence interval for
bution. Non-normally distributed data will need to be
transformed first, prior to performing any statistical where s is the sample deviation, x- is the sample
analysis. The statistical tools with examples of each mean, n is the number of individual data points, and
tool application are described in the following. z is constant obtained from statistical tables for z.
The value of z depends on the confidence level
Mean listed in statistical tables for z. For 95%, z is 1.96
The mean or average of a data set is the basic and the (3). For small samples, z can be replaced by t-value
most common statistics used. The mean is calculated obtained from the Students t-distribution tables (4).
by adding all data points and dividing the sum by The value of t corresponds to n-1.
the number of samples. It is typically denoted by x- (x Table II provides an example of a typical data
bar) and is computed using the following formula: analysis summary for the evaluation of a system
precision for a high-powered liquid chromatography
(HPLC) analysis.

Table II: An Example of a System Precision


where X i are individual values and n is the number of Determination for a HPLC Analysis.
individual data points.

Standard Deviation
The standard deviation of a data set is the measure of
the spread of the values in the sample set and is com-
puted by measuring the difference between the mean
and the individual values in a set. It is computed us-
ing the following formula:

where X i is individual value, x- is the sample mean,


and n is the number of individual data points. Figure 1: Fitted Line Plot.

Relative Standard Deviation


The relative standard deviation is computed by taking
the standard deviation of the sample set multiplied by
100% and dividing it by the sample set average. The
relative standard deviation is expressed as percent.
Typically, the acceptance criteria for accuracy, preci-
sion, and repeatability of data is expressed in % RSD:

Confidence Interval
Confidence intervals are used to indicate the reli-
ability of an estimate. Confidence intervals provide In this example, the data clearly shows a linear rela-
limits around the sample mean to predict the range tionship. The fitted or estimated regression line equa-
of the true population of the mean. The prediction is tion is computed using the following formula:
usually based on probability of 95%. The confidence
interval depends on the sample standard deviation Y = b0 + b1X + ei
and the sample mean. Where b0: y-intercept, b1: line slope, and ei: the
residual.

Special edition: Statistics in Validation 72


Peer Reviewed: Method Validation

Table IV provides the calculations that are used to compute y-intercept and the line slope.

Table IV: Manual Calculations.

Table V provides the mathematical formulas and calculations for data listed in Table IV.

Table V: Manual Calculations for Error.

Thus, the equation of the line is Y =1.13 + 20.39*X.

73 Special Edition: Statistics in Validation


Peer Reviewed: Method Validation

The other important calculations that are typically reported Table VI provides equations that are used to determine the coef-
are the coefficient of determination (R2) and linear correlation ficient of determination (R2) and the correlation coefficient (r).
coefficient (r). The coefficient of determination (R2) measures
the proportion of variation that is explained by the model. Ide- Table VI: Line Equation Formulas.
ally, R2 should be equal to one, which would indicate zero error.
The correlation coefficient (r) is the correlation between the
predicted and observed values. This will have a value between 0
and 1; the closer the value is to 1, the better the correlation. Any
data that form a straight line will give high correlation coef-
ficient; therefore, extra caution should be taken when inter-
preting correlation coefficient. Additional statistical analysis is
recommended to provide estimates of systematic errors, not just
the correlation or results. For instance, in method comparison
studies, if one method gives consistently higher results than the
other method, the results would show linear correlation and
have a high correlation coefficient, despite a difference between Figure 2 demonstrates the Excel output, and Figure 3 demon-
the two methods. strates the Minitab output.

Figure 2: Excel Output.

Special Edition: Statistics in Validation 74


Peer Reviewed: Method Validation

Figure 3 demonstrates Minitab output.

Figure 3: Minitab Output.

Table VII provides the summary of linear regression calculations.

Table VII: Regression Summary

75 Special Edition: Statistics in Validation


Peer Reviewed: Method Validation

Other Statistical Tools ries (collaborative studies, usually applied to standardization


Other statistical tools used in method validation include of methodology).
comparative studies using Students t-test, analysis of variation Table IX provides an example of a typical data analysis sum-
(ANOVA) analysis, design of experiments, and assessment of mary for the evaluation of a precision study for an analytical
outliers. Information on these statistical tools can be obtained method. In this example, the method was tested in two different
from statistical books suggested in the reference section. laboratories by two different analysts on two different instru-
ments.
ICH Data Analysis Recommendations
The ICH guidelines provide suggestions regarding data reporting Table IX: Example of Results Obtained for a Precision
and analysis. Statistics recommended by ICH to evaluate method Study.
suitability are listed below.

Specificity
The results from specificity studies are typically interpreted
by a visual inspection. Quantitative interpretation may also be
performed using analytical software that is able to manipulate
spectral information to analyze spectra.

Accuracy
ICH recommends accuracy assessment using a minimum of
nine determinations at three concentration levels covering the
specified range. It should be reported as percent recovery by the
assay of known amount of analyte in the sample or as the differ-
ence between the mean and the accepted value together with the
confidence intervals. Table VIII provides an example of accuracy
data assessment.

Table VIII: Accuracy Example.

In the example provided in Table IX, precision of analytical


procedure is evaluated by statistical analysis of data to deter-
mine method precision. Precision is determined for a number of
different levels during validation, which include system preci-
sion, repeatability, intermediate precision, and reproducibility.
The system precision is evaluated by comparing the means
and relative standard deviations. Reproducibility is assessed by
means of an inter-laboratory trial. The intermediate precision is
established by comparing analytical results obtained when using
different analysts and instruments and performing the analysis
on different days. The repeatability is assessed by measuring
the variability in the results obtained when using the analytical
method in a single determination. In each case, the mean and %
of RSD is calculated and compared to the established acceptance
criteria.

Detection Limit
Precision The ICH guideline mentions several approaches for determining
Comparison of results obtained from samples prepared to test the detection limit: visual inspection, signal-to-noise, and using
the following conditions: the standard deviation of the response and the slope. The detec-
Repeatability expresses the precision under the same oper- tion limit and the method used for determining the detection
ating conditions over a short interval of time. Repeatability limit should be presented. If visual evaluation is used, the detec-
is also termed intra-assay precision. tion limit is determined by the analysis of samples with known
Intermediate precision expresses within-laboratories varia- concentration of analyte and by establishing the minimum level
tions: different days, different analysts, different equipment, at which the analyte can be reliably detected. The signal-to-noise
etc. ratio is performed by comparing measured signals from samples
Reproducibility expresses the precision between laborato- with known low concentrations of analyte with those of blank.

Special Edition: Statistics in Validation 76


Peer Reviewed: Method Validation

When the detection limit is based on the standard deviation of Robustness


the response and the slope, it is calculated using the following Robustness is evaluated by performing a comparison of results
equation. obtained by deliberately manipulating method parameters
(temperature, different columns, etc.). Mean and % RSDs are
compared against the acceptance criteria to evaluate impact of
changing experimental parameters.
where is the standard deviation of the response and S is the
slope of the calibration curve. Conclusion
The statistical methods used during the analytical method
Quantitation Limit validation involve the basic knowledge of statistics. Even though
The ICH guideline states several approaches for determining the there are statistical packages available to perform statistical
quanititation limit: an approach based on visual evaluation, an calculations, it is important to understand the mathematical
approach based on signal-to-noise, and an approach based on basis behind these calculations. It is essential for the analysts to
the standard deviation of the response and the slope. The quan- be familiar with the basic statistical elements. Statistics used for
ititation limit and the method used for determining the quantita- validation data interpretations should be incorporated into the
tion limit should be presented. When the quanititation limit is companys standard procedure and specified in the validation
based on the standard deviation of the response and the slope, it protocol and report. JVT
is calculated using the equation below:
References
1. 1. ICH, Technical Requirements for Registration of Pharmaceuticals for
Human Use, Topic Q2 (R1) Validation of Analytical Procedures: Text and
Methodology.
2. 2. FDA, Analytical Procedures and Methods Validation (Rockville, MD, 2000).
where is the standard deviation of the response and S is the 3. S. Bolton, Pharmaceutical Statistics Practical and Clinical Applications, 5th
slope of the calibration curve. ed., New York, NY, Marcel Decker, Inc., 2004, p.558, Table IV.2.
4. S. Bolton, Pharmaceutical Statistics Practical and Clinical Applications, 5th
Linearity ed., New York, NY, Marcel Decker, Inc., 2004, p.561, Table IV.4.
The ICH guideline states that a linear relationship should be 5. Minitab 16 Statistical Software (2010). [Computer software], State College,
evaluated across the range of the analytical procedure. If there PA, Minitab, Inc.
is a linear relationship, test results should be evaluated by linear 6. W.J. Dixon and F.J. Massey, Introduction to Statistical Analysis, New York,
regression analysis. The correlation coefficient, y-intercept, and NY McGraw-Hill, 1969.
slope of the regression line and residual sum of squires should 7. NIST/SEMTECH, e-Handbook of Statistical Methods, available at: http://
be submitted with a plot of data. www/itl.nist/gov/div898/handbook
8. P.C. Meier and R.E. Znd. Statistical Methods in Analytical Chemistry, 2nd
Range ed. John Wiley & Sons, New York, 2000.
Range is obtained from results from linearity, accuracy, and 9. J.N. Miller and J.C. Miller, Statistics and Chemometrics for Analytical
precision. These results should be linear, accurate, and precise to Chemistry, 6th ed. Pearson/Prentiss Hall, Harlow, UK, 2010.
validate a specific range for the method. 10. AMC Technical Brief, No. 14, The Royal Society of Chemistry, 2003.

Originally published in the Winter 2011 issue of Journal of Validation Technology

77 Special Edition: Statistics in Validation


Peer Reviewed: Statistics

Statistical Tools for Development


and Control of Pharmaceutical
Processes: Statistics in the FDA
Process Validation Guidance | IVT
Paul L. Pluta

Welcome to Statistical Tools.


This feature provides discussion and examples of statistical methods
useful to practitioners in validation and compliance. We intend to pres-
ent these concepts in a meaningful way so as to enable their use in daily
work situations. Our objective: Useful information.
The recently issued FDA Process Validation Guidance recommended
multiple specific applications for statistics in the lifecycle approach to
process validation. These applications were identified in Stage 1 Process
Design, Stage 2 Process Qualification (PQ), and Stage 3 Continued
Process Verification. FDA recommendations were quite specific for
these respective stages, indicating Agency focus on statistical methods.
The guidance described several specific details of statistics applications,
including design-of-experiment (DOE) studies in formulation and pro-
cess development, statistical metrics in PQ, and trending of material,
process, and product data in monitoring and maintaining validation.
The importance of statistical expertise was emphasized throughout the
guidance.
Statistical Tools will provide relevant practical examples of using
statistics in the various stages of validation. The content of Statistical
Tools will provide readers with theory and practice on topics relevant
to validation. Reader understanding of this vital subject in validation
should be enhanced through these respective discussions.
The first part of Statistical Tools discusses general areas identified
in the guidance that recommend applications of statisticsan intro-
duction to the future content in Statistical Tools.
Comments, questions, and suggestions from readers are needed to
help us fulfill our objective for this series. Suggestions for future discus-
sion topics or questions to be addressed are invited. Readers are also
invited to participate and contribute manuscripts for this column. Case-
studies sharing uses of statistics in validation are most welcome. We
need your help to make Statistical Tools a useful resource. Please con-
tact column coordinator Paul Pluta at paul.pluta@comcast.net or IVT
Community Manager Cale Rubenstein at crubenstein@advanstar.com
with comments, questions, suggestions, or case-studies for publication.

Abstract
The recent US Food and Drug Administration Process Validation Guid-
ance has provided clear statements on the need for statistical proce-
dures in process validation. FDA has redefined validation to include
activities taking place over the lifecycle of product and processfrom

Special Edition: Statistics in Validation 78


Peer Reviewed: Statistics

process design and development through ongoing base of all of these is the measurement process itself,
commercialization. New applications have evolved without which it is impossible to study any of the
as result of this guidance. Statistical applications higher-order processes. Statistical methods are tools
should be used in process validation and related ap- to be utilized for better risk-based decision-making in
plications to improve decision-making. Development the face of variation and uncertainty.
efforts should include statistically designed experi-
ments to determine relationships and interactions Guidance Definition
between inputs and outputs. Manufacturers should Process validation is defined in the 2011 guidance as
understand the sources of variation, understand its follows (1):
impact on process and product, and control varia- Process validation is defined as the collection
tion commensurate with the risk. Statistical methods and evaluation of data, from the process design stage
should be used to monitor and quantify variation. throughout production, which establishes scientific
Statistical methods should be used in support of evidence that a process is capable of consistently de-
sampling and testing in process qualification (PQ). livering quality product. Process validation involves
Sampling plans should reflect risk and demonstrate a series of activities taking place over the lifecycle of
statistical confidence. Validation protocol sampling the product and process. This guidance describes the
plans should include sampling points, numbers of process validation activities in three stages:
samples, sampling frequency, and associated attri- Stage 1 Process Design: The commercial pro-
butes. Acceptance criteria should include statistical cess is defined during this stage based on knowl-
methods to analyze data. Continuing process verifica- edge gained through development and scale-up
tion data should include data to evaluate process activities.
trends, incoming material, in-process materials, and Stage 2 Process Qualification: During this
final products. Data should focus on ongoing control stage, the process design is confirmed as
of critical quality attributes. FDA recommends that being capable of reproducible commercial
personnel with adequate and appropriate education manufacturing.
in statistics should be used for these activities. Stage 3 Continued Process Verification:
Ongoing assurance is gained during routine
Introduction production that the process remains in a state of
FDA issued Process Validation: General Principles control.
and Practices (1) in January 2011. This guidance
transformed process validation from an individual The lifecycle approach to process validation is based
and singular event to an ongoing continuum of on the following basic tenets as stated in the guidance
activities during the entire lifecycle (i.e., develop- (1):
ment through commercialization) of a pharmaceuti- Quality, safety, and efficacy are designed or
cal product. The guidance incorporates quality-by- built into the product.
design (QbD), process analytical technology (PAT), Quality cannot be adequately assured merely by
risk management, and other modern concepts into a in-process and finished-product inspection or
comprehensive approach to process validation. The testing.
application of statistical methods is an important Each step of a manufacturing process is con-
part of implementing the guidance in pharmaceuti- trolled to assure that the finished product meets
cal process validation programs. FDA also recently all design characteristics and quality attributes
issued (draft guidance) Analytical Procedures and including specifications.
Methods Validation for Drugs and Biologics (2). This
document describes statistical analysis and models The above is proposed by FDA for application to
appropriate for validation of analytical methods. human drugs, veterinary drugs, biological and bio-
The principles and approaches described above are technology products, active pharmaceutical ingredi-
also being applied to other processes (e.g., cleaning, ents, finished products, and the drug component of
packaging), qualifications (e.g., equipment, facilities, combination drug and medical device products. The
utilities, control systems), hybrid systems (e.g., water, above does not specifically apply to process valida-
heating, ventilation, and air conditioning [HVAC]), tion of medical devices. However, these same general
and quality systems. Measurement is itself a process. stages and their respective inclusions have previously
Statisticians play a role in evaluating capability of been published for medical devices (3).
the measurement process, without which no other
work can be done. Pharmaceutical processes often FDA Expectations Variation, Control, and
comprise multiple sub-processes; inside each fur- Statistics
ther sub-sub-processes are nested, and so on. At the The FDA guidance document changed and expanded

79 Special Edition: Statistics in Validation


Peer Reviewed: Statistics

the scope of process validation. The guidance further Regulatory Requirements and Recommendations
raised expectations regarding scope and content of Process validation is a legally enforceable require-
validation activities. Application of statistical methods ment in the pharmaceutical good manufacturing
has become a significant part of these expectations. practices (GMPs). The guidance identifies two areas
A brief section in the opening pages of the FDA to exemplify emphasis on recognition of variation
guidance clearly states expectations for industry vali- and control. Both sampling and in-process specifica-
dation programs. This section describes the expand- tions are mentioned as aspects of process validation.
ed view of validation for new and legacy products. Statistical analyses are explicitly mentioned in both
Key concepts in this section include recognition of these areas. Sampling plans must result in statistical
variation and associated control of variation through- confidence that product batches meet predetermined
out the entirety of the product lifecycle. Collection specifications. In-process limits must be determined
and analysis of data are critical to this effort. Specifi- by application of statistical procedures. The guidance
cally: also provides a list of recommendations that further
A successful validation program depends upon in- emphasize recognition of variation and associated
formation and knowledge from product and process control. FDA recommends a team approach to pro-
development. This knowledge and understanding is cess validation, including representation of expertise
the basis for establishing an approach to control of in statistics.
the manufacturing process that result in products
with the desired quality attributes. Manufacturers Stage 1Process Design
should: The Stage 1 Process Design stage of process valida-
Understanding sources of variation tion comprises work conducted towards providing
Detect the presence and degrees of variation fundamental understanding of the product and pro-
Understand the impact of variation in the pro- cess. It includes laboratory-scale experimental studies
cess and ultimately on product attributes conducted to determine basic technical relationships
Control the variation in a manner commensu- between formulation ingredients, process parameters,
rate with the risk it represents in the process and and product attributes. It also includes work con-
product. ducted at an increasing scale culminating at the full-
scale commercial process. Good understanding of the
Each manufacturer should judge whether it has manufacturing process must be technically and scien-
gained sufficient understanding to provide a high tifically based. Critical quality attributes and criti-
degree of assurance in the manufacturing process cal process parameters must be identified and their
to justify commercial distribution of the product. relationships understood. The work of Stage 1 should
Focusing exclusively on qualification efforts with be commensurate with the identified or expected risk
also understanding the manufacturing process and for the product and process.
associated variation may not lead to adequate assur-
ance of quality. After establishing and confirming the Stage 1 recommendations address development ac-
process, manufacturers must maintain the process tivities that will ultimately be reflected in the master
in a state of control over the life of the process, even production record and control records. The guidance
as materials, equipment, production environment, clearly states the goal of Stage 1, To design a process
personnel, and manufacturing procedures change. suitable for routine commercial manufacturing that
Manufacturers should use ongoing programs to can consistently deliver a product that meets its qual-
collect and analyze product and process data to ity attributes. (1). Two general topics are discussed
evaluate the state of control of the process. These in the guidance: 1) building and capturing process
programs may identify process or product problems knowledge and 2) understanding and establishing a
or opportunities for process improvements that can strategy for process control.
be evaluated and implemented through some the
activities described in Stages 1 and 2. Application of Statistics
Manufacturers of legacy products can take ad- Product and process scientists and engineers work-
vantage of the knowledge gained from the original ing in development of pharmaceutical products must
process development and qualification work as well understand and utilize statistical methods whenever
as manufacturing experience to continually improve possible. Their work provides the bases for future
their processes. Implementation of the recommenda- manufacturing and selection of parameters in phar-
tions in this guidance for legacy product and process- maceutical processes. Documentation of their work
es would likely begin with the activities described in will be utilized in regulatory submissions, regulatory
Stage 3. (1). audits, change control, and other activities supportive
to products and processes. The FDA guidance specifi-

Special Edition: Statistics in Validation 80


Peer Reviewed: Statistics

cally comments on the use of DOE studies to develop process Analytical method validation
knowledge; reveal relationships, including multivariate interac- Review and approval by appropriate departments and the
tions; screen variables; and other applications. The guidance quality unit.
mentions applications of DOE in establishing ranges of incoming
component quality, equipment parameters, and in-process mate- Stage 3Continuing Process Verification
rial quality attributes. Also mentioned are experiments at labora- The Stage 3 Continued Process Verification stage comprises the
tory or pilot scale that may assist in evaluation of conditions ongoing commercial manufacturing of the product under the
and prediction of process performance. Application of statistical same or equivalent conditions as demonstrated in Stage 2 Pro-
methods are useful in these and associated activities. cess Qualification. This phase continues throughout the entire
commercial life of the product/process. Maintenance activities of
Stage 2Process Qualification Stage 3 should be commensurate with the risk identified for the
The Stage 2 Process Qualification stage comprises performance product and process.
of the commercial process by means of conformance lots. This Assuming good development of the process, identification of
stage confirms work of Stage 1 Process Design and demonstrates potential sources of variation, and control of this variation, the
that the proposed manufacturing process is capable of reproduc- manufacturer must maintain the process under control over the
ible commercial manufacture. Process Performance Qualification product lifetime (i.e., the work of Stage 3). This control must ac-
(PPQ) conformance lot manufacturing includes increased testing commodate expected changes in materials, equipment, person-
to demonstrate acceptability of the developed formulation and nel, and other changes throughout the commercial life of the
process. The testing of Stage 2 should be commensurate with the product, and it must do so based on risk analysis.
risk identified for the product and process.
The FDA guidance specifically discusses design of facility, Application of Statistics
utilities, and equipment, Process Performance Qualification Specific items in this section of the guidance requiring statistical
(PPQ), the PPQ protocol, and PPQ protocol execution and report application include the following:
in Stage 2, all of which are directly connected to specific process Ongoing program to collect and analyze process data,
validation. PPQ is intended to confirm the process design and including process trends, incoming materials, in-process
development work and demonstrate that the commercial manu- material, and finished products
facturing process performs as expected. This stage is an impor- Statistical analysis of data by trained personnel
tant milestone in the product lifecycle. PPQ should be based on Procedures defining trending and calculations
sound science and experience as developed in Stage 1 studies Evaluation of inter-batch and intra-batch variation
and activities. PPQ should have a higher level of testing and Evaluation of parameters and attributes at PPQ levels until
sampling. The goal of PPQ is to demonstrate that the process is variability estimates can be established
reproducible and will consistently deliver quality products. Adjustment of monitoring levels based on the above
Timely assessment of defect complaints, out-of-specification
PPQ Protocol and Application of Statistics (OOS) findings, deviations, yield variations, and other
A written protocol is essential for acceptable PPQ. Specific information
requirements mentioned in the FDA guidance, many of which Periodic discussion with production and quality staff on
requiring statistical methods, include the following: process performance
Manufacturing conditions, process parameters, process Process improvement changes
limits, and raw material inputs Maintenance of facilities, utilities, and equipment to ensure
Data collection and evaluation process control.
Testing and acceptance criteria
Sampling plan, including sampling points and number of Continuing process verification data should include data to
samples evaluate process trends, incoming material, in-process materi-
Number of samples, which demonstrate statistical als, and final products. Data should focus on ongoing control of
confidence critical quality attributes.
Confidence level based on risk analysis
Criteria for a rational conclusion of whether the process is Expertise in Statistics
acceptable The guidance clearly shows scope, objectives, and criticality of
Statistical methods that are used to analyze data, including data analysis and statistical treatment of data in Stage 3. Spe-
statistical metrics defining both intra-batch and inter-batch cific FDA recommendations regarding expertise in statistics are
variability noteworthy:
Provision to address deviations and non-conformances
Design of facilities and qualification of equipment and An ongoing program to collect and analyze product and pro-
facilities cess data that relate to product quality must be established. The
Personnel training and qualification data collected should include relevant process trends and quality
Verification of sources of materials and containers/closures of incoming materials or components, in-process materials, and

81 Special Edition: Statistics in Validation


Peer Reviewed: Statistics

finished products. The data should be statistically trended and fidence intervals and tolerance intervals will be discussed.
reviewed by trained personnel. The information collected should Subsequent topics will address areas particularly applicable to
verify that the quality attributes are being appropriately con- the respective lifecycle stages of process validation. These will
trolled throughout the process. include topics such as experimental design, including screen-
We recommend that a statistician with adequate training in ing studies and multivariate experimental studies. Discussions
statistical process control techniques develop the data collection on metrology, process capability, control charts, trending, and
plan and statistical methods and procedure used in measuring other related topics are planned. Example case-studies and
and evaluating process stability and process capability. Proce- calculations will further describe the above topics. Validation by
dures should describe how trending and calculations are to be Design The Statistical Handbook for Pharmaceutical Process
performed and should guard against overreaction to individual Validation by Torbeck (4) is recommended for a comprehensive
events as well as against failure to detect unintended process summary of statistics topics associated with process validation.
variability. Production data should be collected to evaluate pro- As mentioned above, the objective for this series of discus-
cess stability and capability. The quality unit should review this sions on statistical topics is useful information. Reader input
information. If properly carried out, these efforts can identify through comments, questions, and other discussion is needed.
variability in the process and/or signal potential process im- Suggestions for future discussion topics are invited. Readers are
provements. (1). also invited to participate and contribute manuscripts reflecting
actual experiences utilizing statistical tools for development and
The following paragraph from the guidance provides another control of pharmaceutical processes or analytical methods. JVT
clear recommendation:
Many tools and techniques, some statistical and others more References
qualitative, can be used to detect variation, characterize it, and 1. FDA, Process Validation: General Principles and Practices (Rockville, MD,
determine the root cause. We recommend that the manufacturer Jan. 2011).
use quantitative statistical methods whenever feasible. (1). 2. FDA, Analytical Procedures and Methods Validation for Drugs and Biolog-
ics (Draft Guidance) (Rockville, MD, Feb. 2014).
Series Discussion Topics 3. GHTF, Quality Management Systems Process Validation Guidance, Edi-
The tentative plan for the content in this series will begin with tion 2, January 2004.
discussion of basic principles. Fundamental topics in this area 4. L. Torbeck, Validation by Design. The Statistical Handbook for Pharma-
will include types of data, graphical representation, distribu- ceutical Process Validation, PDA and DHI Publishing, 2010.
tions, central tendencies, dispersions, and probability. Con-

Originally published in the Autumn 2011 issue of Journal of Validation Technology

Special Edition: Statistics in Validation 82


Peer Reviewed: Statistics

Statistical Considerations for Design


and Analysis of Bridging Studies |
IVT
Harry Yang and Timothy Shofield

Abstract
Biological products rely on a wide array of analytical methods for
product characterization, lot release, and stability testing. As method
improvement is a continued effort during the lifecycle of a biopharma-
ceutical product, bridging studies are often conducted to demonstrate
comparability of the old and new methods. This paper discusses statis-
tical considerations in the design and analysis of bridging studies for
analytical methods.

Introduction
In biological product development, a wide range of analytical methods
are used to ensure product quality throughout the lifecycle of a product.
These methods include tests for product identity, purity, concentration,
and potency. As the product progresses through early-stage to late-
stage development, and ultimately manufacturing, a parallel effort is
made to improve the analytical methods, taking advantage of emerg-
ing state-of-art analytical technologies and increased understanding
of drug mechanism of action. Bridging studies should be conducted to
demonstrate that when compared to a current method, a new method
provides similar or better reliability in correspondence to their intended
use. While some aspects of method bridging are well understood, many
questions remain unanswered.
Recently, significant progress has been made in the adoption of
risk-based approaches to pharmaceutical process development (15).
These approaches are most apparent in the new regulatory definition
for process validation: the collection and evaluation of data, from
the process design through commercial production, which establish
scientific evidence that a process is capable of consistently deliver-
ing quality. It includes a shift of regulatory requirements from the
traditional test-to-compliance to a quality-by-design approach to
process and analytical development and maintenance. Since a wealth
of knowledge of the old and new methods are readily available before a
bridging study is designed, application of the lifecycle and risk manage-
ment concepts in the design and analysis of the bridging study not only
allows the study to be properly designed and data accurately analyzed
but also enables the knowledge gained from the development of the old
and new methods to be utilized to provide greater assurance of method
reliability and product quality.
The goal of a bridging study is to demonstrate performance equiva-
lence between the old and new methods. To that end, one needs to
design an experiment, collect data, and analyze and interpret the

Special Edition: Statistics in Validation 83


Peer Reviewed: Statistics

results so that one can declare if the two methods in mean potencies between the two versions of a
are comparable or not. As data are variable, statisti- bioassay falls within 80% to 125%, the two bioas-
cal inference inevitably will suffer from two types of says are determined to be comparable. There are two
errors. One is to falsely claim performance equiva- issues associated with this method. Firstly, the point
lence (consumers risk) and the other is to falsely estimate might meet the acceptance criterion, but the
claim nonequivalence (producers risk). In practice, confidence that the point estimate will continue to
these two types of risk are often not adequately man- meet the acceptance criterion during future testing is
aged either because of poor study design or misuse unknown. Secondly, the true value of the difference
of statistical methods to evaluate equivalence. For in mean potency between the new method and the
purposes of this paper, the authors will be using the old method may be outside of the acceptance range,
term method to mean a technology used to test one but due to random chance alone, the point estimate
or more critical quality attributes. Method bridging might be within the acceptance range. In fact, if the
or comparability is a study that is performed when true value is at the upper limit of the acceptance
there has been a significant change to that method. range, there is a 50% chance for the point estimate
Thus, this may be more appropriately called method from this particular experiment to be within the ac-
version-bridging. The approaches described here may ceptance range (refer to Figure 1).
or may not be used to assess equivalence of multiple
methods for testing a quality attribute such as bioac- Figure 1: Distribution of Difference in Mean
tivity, which can be measured using either a binding Performance when the Theoretical Difference
or cell-based bioassay, or aggregates, which can be is Equal to the Upper Acceptance Limit.
measured by multiple methods. This usually depends
upon a companys internal strategy and procedures.
This paper is intended to address several statistical
issues related to method bridging and and to discuss
other statistical considerations. An equivalence test
and a Bayesian approach are suggested to assess per-
formance equivalence.

Bridging Study Design and Analysis


When designing a bridging study and analyzing
results of the study, many factors need to be taken This approach of analysis may lead to acceptance of
into account. These include, but are not limited to: 1) a poor method with the consequence of a high rate
What are the performance characteristics to be com- of out-of-specification (OOS) results when the new
pared? 2) What types of samples should be included method is put into use or rejection of a good method
in bridging study? 3) How does one establish an ac- with the consequence that a superior technology
ceptance criterion? 4) How large should the bridging cannot be used. Ideally, the approach for assessing
study be in terms of number of lots/runs? 5) What are comparability should account for uncertainty in the
the appropriate statistical approaches for establishing point estimate of the difference in mean performance
method comparability? In the following, this papers and provide high assurance that the new method
primarily concern is addressing issues related to the would perform well in the future should it be deemed
design and analysis of bridging studies. comparable to the old method.

Approaches for Assessing Equivalence Approach 2p-value Approach


Accuracy and precision are two key quantities char- Another commonly used approach for testing compa-
acterizing method performance. To bridge the old rability is to use classical hypothesis testing and the
and new methods, one needs, at a minimum, to dem- associated p-value. In this setting, a statistical test is
onstrate that the two methods are equivalent in terms conducted to inappropriately test the hypothesis of
of accuracy and precision. In the literature, there equal performance in a characteristic (e.g., potency).
are two approaches used for this purpose, which are Performance equivalence is claimed if the p-value of
either lacking in statistical rigor or completely flawed. the statistical test is greater than 0.05.
This approach can be illustrated as follows. Let X =
Current Approaches (x1,..., x n) and Y = (y1,..., y n) be the measured response
Approach 1Point Estimate Approach values obtained from testing the same set of n-sam-
The first approach is to compare the point estimate ples using the old and new methods, respectively.
of the difference in mean performance to an ac- It is assumed that xi and yi are normally inde-
ceptance criterion. For example, if the difference pendently distributed with means x and y and a

84 Special Edition: Statistics in Validation


Peer Reviewed: Statistics

common variance of 2. In this case, evaluation of Since the 90% CIs of pairs A and C do not include
comparability in part involves testing the hypotheses null, the old and new methods of either pair are
concerning the difference d = x y: deemed to be not equivalent. By contrast, the old and
new methods of pairs B and D might be considered
H0 : d = 0 vs. H1 : d 0 [Equation 1] equivalent because the 90% CIs of the paired differ-
ence in mean response both contain null.
The null hypothesis is rejected if the p-val- Issues with p-value Approach
ue calculated from a paired t-statistic satisfies However, theres a major drawback of the p-value
approach. Note that method pairs A and D have the
same mean difference. Yet, by the p-value approach,
Pair A is called not comparable as the interval does
not contain zero, but Pair D is deemed comparable.
where Tn1 is a random variable with a Student- This is due to the fact that the variability of the
t distribution of n-1 degrees of freedom and methods in Pair D is much larger than that of Pair
A. This suggests that if one has improved the preci-
sion of their method, one is more likely not to be able
to bridge the old and new methods, which does not
are sample mean and variance estimates of the mean make sense. On the other hand, the mean difference
difference in potency values of the old and new of Pair C appears to be smaller than Pair B, as Pair
methods. It is noted that the calculations should be C is more precise. However, the p-value approach
performed on log potency for relative potency bioas- claims that the old and new methods of this pair are
say. Equivalence is claimed if the null hypothesis is not equivalent.
not rejected. Operationally, this test is equivalent to The root cause of this issue is that the wrong null
establishing whether the confidence interval hypothesis is being tested. By the sheer construct of
the confidence interval in Equation 2, it is ensured
(d tn1(0.05)s/ n1, d + tn1(0.05)s/ n1) that the rate for the null hypothesis in Equation 1 to
[Equation 2] be falsely rejected thereby resulting in non-equiv-
alence being claimed is no more than 5%. In other
contains null or not. The quantity tn1(00.5) is the words, when the methods are equivalent the risk of
lower 5th percentile of Student-t distribution with concluding, they are not equivalent is bounded by
n1 degrees of freedom. The null hypothesis is 5%. However, when the old and new methods are not
rejected if the above interval does not cover null. The equivalent, the chance for the p-value approach to
approach is usually referred to as confidence interval declare so is influenced by both the variability of the
approach. As an illustration, suppose that for four methods and sample size. When the method vari-
paired old and new method results; A, B, C, and, ability is large or sample size is small, the width of
D; are used to test comparability using the p-value the confidence interval is wide. Therefore, it is more
approach. The 90% confidence interval (CI) of the likely for the 90% CI to contain zero, causing the rule
paired difference in mean potency of the old and to claim equivalence. This is sometimes characterized
new methods is calculated for each of the four pairs as rewarding poor work. In truth, failure to show a
and shown in Figure 2. significant difference has nothing to do with demon-
strating that the methods are equivalent or not. Like-
Figure 2: 90% Confidence Intervals for Four wise, a significant difference is not the same thing as
Pairs of Old and New Method Results Used to non-equivalence (Refer to Pair C). The method could
Test Comparability. be too precise or the study was excessively large such
that a small difference is statistically significant; in
other words, the confidence interval might not con-
tain zero. This is sometimes characterized as penal-
izing good work.
Equivalence Testing
Two-One-Sided Test
To correct these issues of the p-value method, we
first need to state the appropriate hypotheses to be
tested. If correctly stated, the bridging study can be
carried out such that both the risks of falsely claim-
ing equivalence and falsely claiming non-equivalence
are controlled. To construct an equivalence test, it

Special Edition: Statistics in Validation 85


Peer Reviewed: Statistics

is first necessary to establish equivalence limits (the some cases, it may be more appropriate to show an
outside limits in Figure 2). These limits define a dif- improvement in method performance. Thus, it may
ference between the new and old analytical methods be desirable to show that a host cell protein method
that is deemed practically unimportant. Let has better coverage using one critical reagent versus
denote the equivalence limits. The hypotheses we another or that an impurity method is more sensitive.
intend to test are In these cases, a superiority approach should be
used with a one-sided acceptance criterion.
H0 : d < or d > vs. H1 : d . The equivalence approach described above is a
[Equation 3] test of a mean shift in the measurements between
the two methods. This should be accompanied by an
The hypotheses (3) can be tested using the two-one- assessment of a change in the variability of the new
sided-test (TOST) that rejects the null hypothesis in method. Various approaches exist to assess this for
(1) if either method-bridging and other comparison paradigms
such as method transfer. Discussion of these ap-
proaches is beyond the scope of this paper.

A Bayesian Approach
Knowledge gained from developing both the old and
new methods can be utilized in support of a method
performance equivalence assessment. This is in line
with the lifecycle and risk-based development para-
[Equation 4] digm recommended for product, process, and analyti-
cal methods by regulatory guidelines in recent years,
where tn1(0.05) and tn1(0.95) are the 5th and 95th and this knowledge can be accomplished through
percentiles of the Student-T distribution, respectively. a statistical approach called Bayesian analysis. The
The TOST is operationally equivalent to the confi- approach, first developed by Rev. Thomas Bayes,
dence interval approach in the sense the null hy- provides a general framework for making statisti-
pothesis of non-equivalence in Equation 3 is rejected cal inference based on newly collected experimental
if the 90% confidence interval is entirely contained evidence and historical knowledge (7). For a bridging
within the limits (, ) (6). This test ensures that the study, the new data consist of measured analytical
probability for the old and new methods to be falsely response values from both the old and new meth-
claimed equivalent when they are not is no more ods. Historical knowledge includes understanding of
than 5%. The risk of falsely claiming non-equivalence the performance characteristics of the old and new
when the methods are equivalent is controlled by methods gleaned from the data collected during the
calculating a sample size that manages this risk to a development of the methods. Such knowledge is typi-
satisfactory level. In this way the equivalence ap- cally described in terms of a prior distribution of the
proach rewards good work. Applying the approach performance characteristics. This, coupled with the
to the four pairs of methods in Figure 1 where we distribution of the data collected from the bridging
assume the equivalence limits are the two vertical study, enables us to derive the posterior distribu-
dash lines, it can be concluded that Pairs B and C are tion of performance characteristics of the old and
equivalent (the methods are similar) while for Pairs new methods, which, in turn, can be used to make
A and D it can be concluded that there is insufficient inference on the equivalence of the characteristics.
evidence to declare they are equivalent. In practice, Specifically, we assume that d = x y is normally
this can be set up so as to evaluate the methods after distributed with a mean d of close to zero and vari-
a fixed number of pairs and then amend the sample ance . That is,
size when it has been determined that one or another
of the assumptions going into the bridging study
is incorrect (e.g., the variability of the methods is d and might be informed from the prior knowl-
greater than was originally assumed). The example edge about performance characteristics derived from
shows that this approach overcomes the drawbacks of development experience.
the p-value approach previously discussed.
The equivalence approach is appropriate in cases Note that
when its meaningful to show the two methods fall
within an upper and a lower acceptance criterion,
such as may be the case when showing that the aver- Without loss of generality, one assumes that 2 is
age potency between the two methods is the same. In known. Therefore, the posterior distribution of d is

86 Special Edition: Statistics in Validation


Peer Reviewed: Statistics

given by (8): reduces to how many replicates of each sample type


should be included in the study. For some methods,
such as bioassay, the replicate determinations should
be obtained from independent assays or runs of each
method. In addition, special consideration should
[Equation 5] be given to samples that are grouped together in the
same assay run (or tested under similar conditions).
One can conclude performance equivalence if The case for the sample size required for the bridg-
ing study will be illustrated for a design using k-lots
Pr[ d | X,Y] 95% [Equation 6] (or sample types) tested together in n-runs in both
methods.
An Example As discussed in Equivalence Testing, regard-
Suppose that the data from a bridging study consist less of the sample size, the equivalence test should
of 10 measured responses of the old and new meth- warrant that the rate of falsely claiming equivalence
ods that follow normal distributions with means x = is bounded by 5% (Type I error or producers risk).
100 and y = 110, and an estimate of their common By choosing an adequate sample size, one may also
variance of s2. It is also assumed that the equivalence minimize the chance of falsely claiming nonequiva-
limits are (-10, 10). Because the lower 90% confidence lence (Type II error or consumers risk). The rates of
limit in (2) (x y) tn1(0.05)s / n 1 < x y = 10 Type I and Type II errors are usually expressed as
iis below the lower equivalence limit = 10, the and such that 0 < , < 1.The sample size n that
data from the bridging study alone would not war- guarantees the two error rates no greater than and
rant an equivalence claim. However, if the historical respectively, and that is based on paired t-distribu-
data suggest that the old and new methods are both tion, can be obtained as (9)
very accurate and precise, it is reasonable to assume
that x y is normally distributed with a mean d of
close to zero and small variance Mu. For the sake of
illustration, one can assume the mean and variance
are d = -1 and Mu = 1, respectively. Based on (5), the
posterior distribution of d is given by [Equation 7]

d | X, Y ~ N(-5.5, 0.05) where tn 1 (1 ) and tn 1 (1 /2) are the 100(1


) and 100(1 /2) percentiles of the Student-t
It can be calculated that the probability d to be distribution with n-1 degrees of freedom respectively,
bounded by (-10, 10) is obtained as is the standard deviation associated with com-
parison, is the equivalence margin, 0 and is an
Pr[-10 d 10 | X, Y] = (31) (-10) offset accounting for maximal unknown difference
allowed for the two methods. Since both sides of the
which is greater than 99.9%. This says there is a very inequality in Equation 7 involve n, the solution of n
high likelihood that the two methods are equiva- is obtained either through an iterative algorithm or
lent, and therefore, it can be concluded that the new simulation. The sample size n can also be obtained
method is successfully bridged to the old method. from commercially available software packages such
as nQuery Advisor (8). An alternative method for
Design Considerations sample size calculation is obtained by replacing the
Number of Lots versus Number of Replicates right hand side of the inequality in Equation 7 by its
A key question concerning the design of a method normal approximation. Specifically, n can be calcu-
bridging study is how many lots and how many lated as
replicates from each lot need to be assayed to dem-
onstrate performance equivalence. In principle, this
may be the wrong question as lot-to-lot variability,
which reflects process consistency, should not in-
terfere with the comparability assessment of the old [Equation 8]
and new methods. It may be more important to test
multiple sample types, such as different intermediates where z1 and z1/2 are the 100(1 ) and 100(1
as well as linearity and forced degradation samples, /2) percentiles of the standard normal distribution.
to ensure that the new method is equivalent or bet-
ter than the old method. In this regard, the problem

Special Edition: Statistics in Validation 87


Peer Reviewed: Statistics

Equivalence Limits Table II: Potency Values from Testing Five Lots with Two
In order to use the equivalence test in Issues with p-value Ap- Replicates per Lot.
proach to demonstrate method equivalence, it is necessary to
pre-specify the equivalence limits (-, ). Selection of the limits
needs to consider the impact on product quality. Data on lot-
to-lot variability using the old method informs the equivalence
limits. The equivalence limits (or limit in the case of an improve-
ment) can be determined as the shift in the lot-to-lot distribution
that still satisfies the product specifications. An approach like
this is discussed and illustrated in United States Pharmacopeia
(USP) Chapter <1033>, Biological Assay Validation.

Sample Size
Table I displays sample size calculated using the formula in
Equation 8 for various combinations of equivalence limit t, Table III: Normalized Potency Values from Testing
comparison standard deviation . It is assumed that inherent Simulated Samples with Five Replicates per Lot.
difference between the old and new method is zero, that is, 0 =
0. Type I error and Type II error are assumed to be 5% and
10%, respectively. As seen from Table II, the larger the compari-
son variability and the smaller the equivalence limit, the larger
the sample size. For example, the sample sizes for (, ) = (10%,
20%), (5%, 30%) are 5 and 617, respectively.

Table I: Sample Size for Demonstrating Performance


Equivalence.

Based on Equation 2 with n = 10 and tn1, the 90% confidence


intervals of the three sets of simulated samples and samples from
the five lots mean difference are presented in Table IV, along
with the mean differences and standard deviation.

Table IV: 90% Confidence Intervals of Test Samples.

cASe-StudY
In this section, the authors present a case study in which a new
and old method were compared through a bridging study. To
fully demonstrate that the new method is comparable to the old Since all the 90% confidnece intervals are contained within
method, samples from five drug substance lots and simulated (-25%, 25%), the new method is deemed to be comparable to the
samples at the levels of 75%, 100%, and 125% of the target value old method.
were tested by both methods. Precaution was taken to ensure
samples were tested under similar condition for both methods so dIScuSSIon
that the pair-t test can be used in the assessment of equivalence. Bridging studies play an important role in the lifecycle of an
Such treatment reduced the effect of lot variability. analytical method. A successful bridging study relies a well
Based on historical data, the %CV of methods is no more thought out study design and correct approaches of data analy-
than 15%. Since the lot release and stability specification have sis. However, two approaches widely used in bridging study data
lower and upper limits of 65%, 135%, respectively, an inher- analysis are either not statistically rigorous or aimed at detect-
ent difference of no more than 10% is deemed acceptable. The ing performance difference rather than equivalence. Therefore,
equivalence limits are set at +/-25%, taking into account both the they run the risk of rejecting comparable methods that are of
allowable inherent difference and method variability. Based on high precision or accepting incomparable methods that have
these quantities, a sample size of 10 is determined, using Equa- large variability. To mitigate such risk, the authors suggest using
tion 7 with Type I error being fixed at 5% and power of 80%. two alternative approaches, a TOST or a Bayesian approach to
The test results are presented in Tables II and III. assess performance equivalence. These approaches overcome the
shortcomings of the approaches currently in use. Moreover, the
Bayesian approach allows for incorporation of historical data in

88 Special edition: Statistics in Validation


Peer Reviewed: Statistics

the assessment of method performance equivalence. This is in requires establishment of joint acceptance criteria for the perfor-
accordance with the lifecycle principles recommended in recent mance parameters. Equally important is to estimate sample size
regulatory guidelines. to ensure falsely clam rates to be capped by pre-defined levels of
Other design issues such as selection of number of lots and risk. Although a closed-form formula seen in Equation 6 usually
number of replicates are important considerations in method- does not exist for the multivariate test, statistical simulation can
bridging. It is shown that the selection is dependent on intra-run be used to determine the sample size. However, detailed discus-
and inter-run method variability. In a proper analysis of method sion of this is beyond the scope of this paper. JVT
bridging, the lots represent within-run replicates, and thus, the
impact of number of lots is on the reduction of the bridging study References
variability is much less than that of the number of runs. The au- 1. ICH Q8, Pharmaceutical Development, 2006.
thors recommend reducing the number of lots and increasing the 2. ICH Q9, Quality Risk Management, 2007.
number of runs in a method bridging study. Strategically selected 3. ICH Q10, Pharmaceutical Quality Systems, 2007.
levels of the quality attribute that is being tested by the method 4. ICH Q11, Concept Paper, 2011.
should be considered in the selection of study samples. It should 5. FDA, Guidance for Industry on Process Validation: General Principles and
also be noted that method performance is characterized through Practices (Rockville, MD, Jan. 2011)
multiple metrics, including accuracy and precision. To control 6. S.C. Chow and J.P. Liu, Design and Analysis of Bioavailability and Bio-
the overall risk of falsely claiming equivalence or nonequivalence, equivalence Studies, Marcel Dekker, 1992.
a statistical test needs to be established so that the equivalence 7. T. Bayes, Philos Trans Roy Soc London 53, 370418, 1763. Reprinted with
of multiple characteristics can be tested simultaneously. This an introduction by B.G. Biometrika 45, 293315, 1958.

Originally published in the Summer 2011 issue of Journal of Validation Technology

Special Edition: Statistics in Validation 89


Peer Reviewed: Process Validation

FDA, Globalization, and Statistical


Process Validation | IVT
Robert L. Creighton and Marlene Garcia Swider

INTRODUCTION
Pharmaceutical manufacturing in the USA has undergone significant
changes in the past few decades. FDA-regulated product manufac-
turing was done by a relatively few pharmaceutical companies. The
entire manufacturing process, i.e., raw materials through final product
packaging, was accomplished by only one or two firms. As demand
for these products increased along with pressure to reduce costs and
increase productivity, manufacturing firms outsourced various opera-
tions within the total manufacturing process. Today, outsourcing is a
common practice. Globalization provides the opportunity to not only
serve international needs but also to reduce costs. Globalization is now
routine practice for many manufacturers.
All imported products to USA are required to meet the same stan-
dards as domestic goods. They all must be pure, produced under
sanitary conditions, and contain informative and truthful labeling in
English in order to be marketed in USA. According to FDA Quality
System Regulation of 1996 (Tautman, 1997), manufacturers need to
monitor and control their process parameters for validated processes
to continually meet quality attributes. It is through statistical tools and
techniques that manufacturers can help ensure good measurements to
demonstrate purity and sanitary conditions of the products and facili-
ties inspected by the FDA.
According to the FDA, process validation is the collection and vali-
dation of data, from the process design stage throughout production.
More specifically defined, process validation is providing documented
evidence with high degree of assurance that a specific process will con-
sistently produce a product meeting its pre-determined specifications
and quality characteristics (Campbell, and Katz, 2012). FDA wants
objective evidence that a process is capable of consistently delivering
quality product. Statistical tools and techniques, properly applied, can
provide this evidence.
FDA depends in analytical statistics to assure that batches of prod-
ucts meet each appropriate specification. FDA requires appropriate sta-
tistical quality control criteria are a condition for product approval and
release. However, many inquire whether the types of analytical tools
and techniques should change depending on which country is working
with the FDA.
FDA continues supporting state-of-the-art science and as such,
adapts as much as possible to the needs of its stakeholders. Stakehold-
ers include manufacturers submitting products for approval and mar-
keting for USA consumption. This support is evidenced by FDA effort
into globalization.
Globalization
In the last decade, FDA has opened offices all around the world. This

Special edition: Statistics in Validation 90


Peer Reviewed: Process Validation

includes offices in China, Mexico, Costa Rica, Brus- PROCESS VALIDATION


sels, India and Italy among other countries. Staff- The three stages for process validation described in
ing FDA offices at these new locations has its own the FDA 2011 Process Validation Guidance include:
challenges. Interacting, adapting, and understanding Process Design
new cultures, languages, computerized systems, and Process Qualification
policies continue to be part of FDAs globalization ef- Continued Process Verification
forts. Additional common sense challenges exist too.
These include adapting to a competitive world where Process Design
the American way is not the only way of doing The FDA guidance describes process design as what
business. Taking into consideration different point of defines the commercial manufacturing process as
views not only requires more flexibility from every- reflected in planned master production and control
one involved but also forces everyone to learn new records. The use of statistics within the first stage
ways. Perhaps this can signify better ways of doing primarily focuses on statistically designed experi-
business in the future with and by FDA. ments and strategies for process control. Recognized
The authors believe major factors impacting how statistical techniques discussed in the guidance in-
manufacturing outsourcing could be expanding into clude design of experiments (DOE) (Chen & Moore,
globalization include: 2006), risk analysis (screening potential variables),
1. Keeping up with state-of-the art science and models of the commercial process.
2. Innovation emerging from many fields of knowl-
edge, and Process Qualification
3. Harmonize the different nations efforts. The FDA guidance states that process qualification is
the stage where the evaluation of the process design
These factors are added to other existing trends takes place in order to determine if the process is
and affecting factors like customized medication, capable of reproducible commercial manufacture.
introduction of new products while increasing quality Elements identified by the FDA in process qualifica-
of care and reducing costs, and the ones previously tion are:
mentioned like interacting with new cultures and Design of facility including equipment and utili-
languages. Although this is not an exhaustive list of ties qualification
factors, the authors believe that these are the same Process Performance Qualification (PPQ)
factors will also impact how FDA will be doing busi- Intra-batch and inter-batch metrics
ness with other countries in the near future. (See Comparison and evaluation of process measures
Figure below) and in-process and final product attributes.
Other examples considered by FDA are statistical
techniques for medical devices based on the Bayes
Theorem. According to FDA Guidance for the Use
of Bayesian Statistics in Medical Device Clinical Tri-
als, these techniques are based on combining prior
information with current information. Although
sometimes controversial due to experts points of view
vs. empirical data, these techniques are sometime
preferred because they are less burdensome.
FIGURE 1. INFLUENCE DIAGRAM
Continued Process VerificationProcess Control
At this time, however, no matter where a manufac- Techniques
turer is located, the same FDA guidance equally ap- In order to address process variability, statistical
plies to everyone manufacturing FDA regulated prod- process control (SPC) techniques should be used.
ucts. In other words, the same regulations governing According to Torbeck (2011), such methods are used
the statistical tools and techniques to be used by USA to monitor, improve, and control the manufacturing
manufacturing firms apply to manufacturers in any process. These methods include:
other country seeking FDA approval for products. SPC Techniques including probability, multivari-
At no surprise and through experience accumulated ate statistics, and statistical control charts
through the years, FDA revised the Process Validation Process Analytical Techniques (PAT)
Guidance in 2011 for pharmaceutical products. Please Process capability studies
note that different products such as medical devices Control charts
apply to different FDA guidances. Comparison of CpK and critical quality attri-
butes (CQA).

91 Special edition: Statistics in Validation


Peer Reviewed: Process Validation

CONCLUSIONS experience working for more than 28 years in FDA. It


FDAs requirements regarding validation statistics are does not reflect a position, official or unofficial, of the
the same for US as for any other country inspected. It FDA. Any reference to FDA information is based on
is our opinion that this will change as FDA advances public information. JVT
in its relationship with other countries, understands
the needs of other countries, and works toward sup- REFERENCES
porting harmonization advances. In the meantime, 1. Campbell, C. and Katz, P. (2012). FDA 2011 Process Valida-
we recommend: tion Guidance: Process Validation Revisited. Retrieved
Study and familiarize yourself with FDA poli- March 17, 2015 from FDA website: http://www.fda.gov/
cies and regulation in order to comply with FDA downloads/AboutFDA/CentersOffices/OfficeofMedical-
guidance ProductsandTobacco/CDER/UCM334568.pdf
Network and participate on professional forums 2. FDA Guidance for the Use of Bayesian Statistics in Medical
to keep in top of changes and trends Device Clinical Trials, February 2010.
Inform yourself through researching literature 3. FDA, Process Validation: General Principles and Practices,
Describe how trending and calculations are Guidance for Industry, January 2011.
performed and detect unintended process 4. Chen, C. and Moore, C. (2006). Role of Statistics in
variability Pharmaceutical Development Using Quality-by-Design
Collect production data to evaluate process sta- Approach-an FDA Perspective. Office of New Drug Quality
bility and capability -- know your product Assessment, CDER/FDA. Retrieved March 17, 2015 from
Understand the role of the quality unit and American Statistical Association website: https://www.
review process google.com/#q=Role+of+Statistics+in+Pharmaceutical+Dev
Identify variability in the process and/or signal elopment+Using+Quality-by-Design+Approach+%E2%80%
potential process improvements. 93+an+FDA+Perspective
It is through mutual understanding, collaboration, 5. Torbeck, L. (2011). Case Study: Use of Statistical Process
and communication with FDA that manufacturers Control to Detect Process Drift; Using Process Capability
can help expedite their product submission approval Measurement. Pharmaceutical Quality System. Retrieved
time and ensure safer public health. March 17, 2015
6. Trautman, K.A. (1997). The FDA and Worldwide Quality
FINAL THOUGHTS System Requirements Guidebook for Medical Devices.
This article was written solely based on authors
Marlene Garcia Swider and Robert L. Creighton's

Originally published in the Autumn 2011 issue of Journal of Validation Technology

Special edition: Statistics in Validation 92


Peer Reviewed: Medical Device Validation

Statistical Sampling Plan for


Design Verification and Validation
of Medical Devices | IVT
Liem Ferryanto

ABSTRACT
The valid rational in developing statistical sampling for design verifica-
tion and validation of a medical device product performance is to dem-
onstrate the probability of conformance to specification of the device
performance. AQL sampling plans are not suitable for testing in the
verification and validation phases. Therefore, here, a non-parametric
binomial distribution model and a NTI model are used to determine
the sample size needed in order to demonstrate a specified the PCS
at a given confidence level for a characteristic with attribute data and
variable data, respectively. A practical step by step process on selecting
and applying statistical sampling plans and acceptance criteria for the
verification and validation is also presented and then applied to some
cases related to medical devices products and processes.

INTRODUCTION
Food and Drugs Administration (FDA) requires, via Sec. 820.30 of Title
21 of Code of Federal Regulations (CRF), medical device manufacturers
that want to market certain categories of medical in the USA to estab-
lish and maintain procedures to control the design of the device (U.S.
FDA, 2014). In essence, design controls are simple and logical steps to
ensure that what is developed is what is meant to be developed, and
that the final product meets customers needs and expectations. When
a device product reaches at the stage where its hardware or software
prototype is either fully functional, the FDA 21 CFR 820.30 Design
Control requires medical device manufacturers to perform design veri-
fication and design validation processes. These are to confirm that the
device design via examination and objective evidence, and to ensure
that the design and development critical specifications or outputs for
the proper function of the device have met the design and development
input requirements and are capable of meeting the requirements for
the specified application or intended use, and safety requirements (U.S.
FDA, 2011). In executing design verification and validation (V&V) Sec.
820.50 of Title 21 of CRF required manufacturers establish and main-
tain procedures for identifying valid statistical techniques required for
the acceptability of process capability and product characteristics. Sam-
pling plans shall be written and based on a valid statistical rationale.
The paper will provide a direction for determining validation and
design verification sampling plans and tables that may be used for at-
tributes and variables data. The sampling plans provided must be able
to demonstrate that specified reliability or probability of conformance

Special Edition: Statistics in Validation 93


Peer Reviewed: Medical Device Validation

to specification (PCS) levels are met with the desired The hypotheses above can be written in term of
level of confidence. PCS as follows:
H0: PCS < desired PCS level
STATISTICAL SAMPLING PLANS H1: PCS desired PCS level
The V&V assumes that its requirements have not Validation will be passed if H0 is rejected. The
been met unless testing demonstrates they are so. The rejection criterion would be the maximum number
available plans for use in manufacturing or rou- of failures, Xc, found in a sample of size N with a
tine inspection are Acceptable Quality Limit (AQL) desired PCS level, should be such that
sampling plans. AQL sampling plan is a statistical Probability [X Xc | N, desired PCS level] = 1- Confi-
method used to test the quality level that would (e.g. dence Level.
95% of the time) be accepted by estimating a charac- X is the number of failures. This is the probability
teristic of the product population through a sample. of passing the demonstration test although the device
The rationale behind the AQL sampling plan is that does not meet the requirement, i.e. consumers risk
the lot is assumed to be good right from the begin- (Pardo, 2013).
ning until proven bad, biased towards the manufac- The basic principle of demonstration is to demon-
turers risk: strate if a product characteristic performs as designed
from a sample of devices that is tested under condi-
H0: probability [non-conformance] Assigned AQL tions which are considered to be representative of
their operational use. Test results are measured by
H1: probability [non-conformance] > Assigned AQL determining if the product was passed or failed to
meet its specification as percent of units conforming
Conformance or non-conformance of the product to requirements characteristic. Based on the results
characteristic is generally defined as the number of of such a test, a decision is taken on the acceptabil-
passes or fails that occurred in a sample size divided ity of the population of devices which the sample
by the sample size, respectively. The manufacturer represents, that is, future production items. In any
will accept a lot if H0 is not rejected. Fail to reject sampling test, there are risks to both the producer
shows that there is no statistically significant evi- and the consumer that a wrong decision can be
dence that the lot which is assumed good is good. reached. The degree of risk will vary according to
Without more information, we usually accept the lot such factors as the sample size and test duration and
as a good lot. But this is under the idea that they are must therefore be agreed and specified when plan-
looking for evidence that lot is not good. When the ning demonstration tests.
AQL sampling plan is applied to design V&V and
manufacturers do not reject H0, what can be said PASS-FAIL TEST BASED ON THE NON-PARAMETRIC
about the PCS of the design performance? The typical BINOMIAL DISTRIBUTION FOR ATTRIBUTE DATA
AQL sampling plans applied to demonstrate whether There are two types of data to be evaluated in V&V
or not the PCS of a system is good enough to meet its tests of each of product, component or process
goal do not technically allow us to conclude the PCS characteristic, i.e. variables (quantitative) data and
of the system is good just because null is not rejected. attributes (pass/fail) data. In general, these charac-
In V&V phases manufacturers will have to dem- teristics are the critical to quality characteristics of
onstrate whether or not the PCS of a system is good the product performance. A method widely used in
enough to meet its goal with a specific confidence practice to determine the sample size needed in order
level with the assumption that the requirements have to demonstrate a specified PCS at a given confidence
not been met unless testing demonstrates they are so. level for a characteristic with attribute data is based
Therefore, AQL sampling plans are not suitable for on non-parametric binomial (NPB) distribution mod-
testing in the V&V phases. Thus, if manufacturers el (Guo et al., 2013). To use the binomial distribution
want to demonstrate how good the PCS of a prod- model to predict the PCS for devices, the trials in the
uct performance, first assume that the requirements sample must meet the following conditions. Each trial
have not been met, and then try to gather evidence to has only one of two possible outcomes and must be
the contrary, i.e. evidence that suggests they are so. independent; the outcome of one trial cannot influ-
Therefore, the null hypothesis must be stated as the ence an outcome of another trial. All trials have the
following: same PCS, i.e. each trial must come from an identical
device or devices with an identical condition.
H0: probability [non-conformance] > desired non-
conformance level Determining the PCS of a device poses a unique
H1: probability [non-conformance] desired non- challenge. Therefore, the test planner must have the
conformance level knowledge necessary to determine the sample size

94 Special edition: Statistics in Validation


Peer Reviewed: Medical Device Validation

that must be tested to demonstrate a desired PCS of to 99.0%, the chance of passing this test is equal to
the population at some acceptable level of confidence. 1 CL = 10%, which is the error to pass non-confor-
The calculations are based on the Binomial Distribu- mance devices. Therefore, Equation (1) determines
tion and the following formula: the sample size by controlling for the error to pass
non-conformance devices.
Several other methods have been designed help
engineers develop sampling plans for V&V tests such
Cumulative Binomial, Exponential Chi-Squared, Life
Where CL is the confidence level, f is the maxi- Testing and Non-Parametric Bayesian (Guo et al.,
mum number of failures, N is the sample size, and R 2013)
is the demonstrated PCS which is equal to 1 pro-
portion non-conformance. 1 CL is the probability A VARIABLE-DATA TEST BASED ON TOLERANCE
of f or fewer failures occurring in test of N units or INTERVALS FOR A NORMAL DISTRIBUTION
the probability of passing the demonstration test A method widely used in practice to determine the
although the device does not meet the requirement. sample size needed in order to demonstrate a speci-
Therefore, the NPB equation determines the sample fied reliability at a given confidence level for a charac-
size by controlling for the error to pass non-confor- teristic with variable data is based on a normal toler-
mance devices. If no units failed the test is called ance interval (NTI) model (Hahn and Meeker, 1991).
success-run testing. If i = 0 (no devices failed), the A tolerance interval is a statistical interval within
CL is defined as 1 R N. Sampling plans for V&V will which, with some confidence level, a certain propor-
ordinarily provide greater confidence than those used tion of a sampled population falls. The endpoints of a
in normal production. Given any three of variables tolerance interval are called upper and lower toler-
in equation (1), the remaining one can be solved. ance limits. If the demonstration test results are vari-
Attachment A provides a table of sample sizes for able data, then calculate tolerance interval of the data;
different combinations of PCS levels (R), confidence tolerance interval that covers at least a certain PCS
levels (CL), and maximum numbers of failures (f). As of the device with confidence level should be within
a comparison to the data generated from a normally specification limits of the device characteristic to pass
distributed population, capability (Ppk) of the pro- the V&V requirements.
cess validation can be calculated as 1/3 of the inverse In most cases a characteristic of the device can
of the normal cumulative distribution for the corre- be addressed by three types of tolerance intervals: a
sponding reliability performance level and its results two-sided interval, lower to one-sided interval, and
are shown in Appendix A. upper one-sided interval. The corresponding toler-
ance intervals are defined by lower (L) and upper (U)
Example 1: A geometric characteristic of a newly tolerance limits which are computed from a series of
designed device is being validated. The risk of this n device characteristic measurements Y1,,Yn and
characteristic is minor corresponding to a non-con- described as follows:
formity that may cause the product to function poorly
or cause an inconvenience but still fit for use. The
recommended reliability performance level is 99.0%
per minor risk of this characteristic. It is suggested
confidence level is 90% corresponding to the design
verification of a new product. A product engineer
wants to design a zero-failure demonstration test in
order to demonstrate a reliability of 99.0% at a 90%
confidence level using the NPB method to determine where the Y-
the required sample size.
Thus, sampling plan is R = 99.0%, CL = 90%, and f is the average value of Y, the s is the standard devia-
= 0. Substituting these values to equation (1) will give tion of Y, the k factors are determined so that the
the corresponding sample size 230 (Appendix A). intervals cover at least a certain R of the device with
This sample size will be collected randomly from the a certain CL (NIST/SEMATECH, 2013). Equation (2),
pilot production for this design verification. If those (3) or (4) guarantees with the probability CL that R
230 devices are run for the required demonstration percent of the PCS measurements is contained in the
test and no failures are observed, i.e. null hypothesis interval, will not fall below a lower tolerance limit, or
that failures > 0 is rejected, then a PCS of 99.0% or will not exceed an upper limit, respectively.
higher with a 90% confidence level has been demon- If the data are from a normally distributed popula-
strated. If the PCS of the system is less than or equal tion, an approximate value for the k2 factor as a func-

Special edition: Statistics in Validation 95


Peer Reviewed: Medical Device Validation

tion of R and CL for a two-sided tolerance interval is PROCESS STEPS SELECTING THE SAMPLING PLAN AND AC-
CEPTANCE CRITERIA
Based on the NPB Distribution model and NTI model used
to develop demonstration tests of PCS above this section will
propose the flow how to determine a sampling plan and to make
a decision whether the plan passes or fails. The process flow dia-
gram of the selection of a sampling plan and acceptance criteria
where 21-CL, is the critical value of the chi-square distribu- is shown in Figure 1.
tion with degrees of freedom that is exceeded with probability
CL, z(1-R)/2 is the critical value of the normal distribution associ- Step 1 is to determine the desired R and the overall CL for each
ated with cumulative probability (1-R)/2, and N is the sample product, component or process characteristic to be evaluated. R
size. The quantity represents the degrees of freedom used and CL must capture the probability of risk of the product char-
to estimate the standard deviation. Most of the time the same acteristic that may cause some dissatisfaction or harm to users if
sample will be used to estimate both the mean and standard the product characteristic does not conform to its specification.
deviation so that = N - 1, but the formula allows for other pos- Many manufacturers rank the risk from cosmetic, minor, major
sible values of . to critical. Cosmetic risk may be defined as nonconformity detri-
The calculation of an approximate k1 factor for one-sided mental that will not affect usability or functionality of the prod-
tolerance intervals comes directly from the following set of uct and affects only appearance of the product. Minor risk may
formulas: be defined as a nonconformity which may cause the product to
function poorly or cause an inconvenience but still be fit for use
or may possibly generate a complaint. Major risk may be defined
as a nonconformity which may cause the product to be unfit for
use significantly degrades the products function or performance
or is very likely to generate a complaint. Critical risk may be
defined as a nonconformity that is likely to present a hazard to
health. For example, a product characteristic with critical, major,
minor and cosmetic risk, respectively, shall have R levels > 99%,
> 97%, > 95%, and > 90%, respectively, with confidence level
must be greater than or equal to 90% in order to have at least
the R > 80%.

Step 2 is to identify data type of each product, component or


Given the R, the CL, and the N, factor k1 can be found from process characteristic to be evaluated, i.e. either as variable or
equation (6). Appendices B-1 and B-2 provide tables of the com- pass/fail data. In general, these are the critical quality character-
bination of preferred N, and factor k1 for different combinations istics of the product or process output.
of reliability performance levels (R), and confidence levels (CL).
In addition, capability (Ppk) of the process validation can be cal- Step 3 is to select the sampling plan(s) to meet the desired R and
culated as 1/3 of the inverse of the normal cumulative distribu- CL. Selection for attribute data is provided in a table in Appen-
tion for the corresponding reliability performance level. dix A. Selection for variable is provided in tables in Appendices
B-1 and B-2. Samples shall represent the behavior of process
Example 2: Packaging seal strength for a new design is being validation or design verification runs. Random sampling or other
verified. The one-sided specification limit of the seal strength method, such as periodic sampling, stratified sampling, or ratio-
is 10 lbs. minimum. Reliability performance level to be dem- nal sampling is commonly used to assure samples are represen-
onstrated is 99.6% with Confidence Level equal to 90% for one tative of the entire run.
run.
Given R = 99.6% and CL = 90%, equations (6)-(8) will provide Step 4 is to perform verification and/or validation run(s) to
a combination of sample sizes and k1 factors: N = 20 and k1 collect test samples.The minimum size per length of each run
=3.42; N = 30 and k1 = 3.25; N = 40 and k1 = 3.15, etc. (Appen- should normally reflect the expected production run.
dix B-2)
The verification test was run based on sampling plan of N Step 5 is to perform statistical analysis of pass/fail data collected
= 40 and k1 = 3.15. The data passed the normality test for the from Step 4. The verification and/or validation run passes if the
run: sample average = 13.1 lbs. and s = 0.6 lbs. Thus, the lower number failed units is less than or equal to the maximum num-
tolerance interval is 13.1 lbs. 3.15 * 0.6 lbs. = 11.21 lbs. Since ber of failures (acceptance number) in the table (Appendix A).
the lower interval was above the lower specification limit for the
design verification run, the new design packaging seal passed. Step 6 is to perform good fitness test on the variable type data if
the data is normally distributed.

96 Special edition: Statistics in Validation


Peer Reviewed: Medical Device Validation

normality test for each run with the summary statistics:


Step 7 is to performance statistical data analysis by calculating
NTI of the data, if the data pass normality test. NTI is calculated
based on sample average, sample standard deviation and normal
tolerance interval factor from Appendices B-1 or B-2. The inter-
val should be within specification limits to pass the run:
I. If the specification has lower specification limit (LSL) only,
then the run passes if (sample average k1 * sample standard
deviation) LSL.
II. If the specification has upper specification limit (USL) only,
then the run passes (sample average + k1 * sample standard
deviation) USL.
III. If the specification has two-sided specification limits, then Where is the sample average and s is the sample standard
the run passes if (sample average k 2 * sample standard devia- deviation. Since 2.15s was within the specification limit for
tion) LSL and (sample average + k 2 * sample standard devia- all three runs, the plan passed. We can be 99.9% confident that
tion) <= USL. the process produces more than or equal to 99.0% conforming
units.
Step 8 is to add more data using NTI Test sampling as in Step
3, If normality test in Step 6 fails, and then perform additional CONCLUSIONS
verification and/or validation runs. In this case the normal toler- In this article, practical sampling plans and their step-by-step
ance interval approach is probably not appropriate. procedure to select a suitable plan are developed based on NPB
Distribution model and NTI model for attribute data and vari-
Example 3: Some changing within the IO Audio Driver, Wave able data, respectively. These solid statistical sampling plans that
File and Sound Manager interaction was performed in order to are required by regulatory are suitable for demonstrating the
increase Surgical Equipment GUI performance. The data demon- probability of conformance to specification of medical device
strates that the prior to optimization the GUI reliability or PCS performance in the design V&V stages. JVT
level (no freeze) was 93% with 95% confidence level. The target
was to increase the PCS level to 99% with 95% confidence level. REFERENCES
Table to be used in this example is in Appendix and the 1. H. Guo, E. Pohl, and A. Gerokostopoulo, Determining the Right Sample
sampling plans are R = 99% and CL = 95%. The corresponding Size for Your Test: Theory and Application. 2013 Annual Reliability and
sample size was 459 with failure to accept is 0 and reject is 1 or Maintainability Symposium, IEEE.
higher. Formal engineering testing via total 500 (rounded up) 2. G.J. Hahn and W.Q. Meeker, Statistical Intervals: A Guide for Practitioners.
simulated surgery tests was done and the run passed. John Wiley & Sons, Inc., 1991.
3. NIST/SEMATECH, e-Handbook of Statistical Methods, 2013. (Available
Example 4: Fill volume of new filler with specification limits: at: http://www.itl.nist.gov/div898/handbook/prc/section2/prc263.htm ac-
1000 - 1060 ml is being validated. PCS Level to be demonstrated cessed February 18, 2015)
is 99.0% with 99% overall Confidence level. Three runs at 90% 4. S. Pardo, Equivalence and Noninferiority Tests for Quality, Manufacturing
confidence each will give about 99.9% overall. Overall confi- and Test Engineers, Chapman and Hall/CRC, 2013.
dence is calculated as (1 (1 0.90)3) * 100% = 99.9%. 5. U.S. FDA, Code of Federal Regulations Title 21, 2014. (Available at:
Sampling plan selected from two-sided 90% confidence: N http://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfcfr/CFRSearch.
= 20, k2 = 2.15 for each run. Table to be used is Appendix B-1: cfm?CFRPart=820 accessed February 18, 2015)
Normal Tolerance Limit Factors (k 2) for Two-sided Specification 6. U.S. FDA, Guidance for Industry Process Validation: General Principles
Limits. and Practices Current Good Manufacturing Practices (CGMP), Revision
Sampling plans are R = 99.0%, CL = 90% per run. The cor- 1, 2011. (Available at: http://www.fda.gov/downloads/Drugs/Guidances/
responding sample size is 20 and k 2 = 2.15. The data passed the UCM070336.pdf accessed February 18, 2015).

Special edition: Statistics in Validation 97


Peer Reviewed: Medical Device Validation

FIGURE 1: Process Flow on Selecting and Applying Statistical Sampling Plan for Design V & V

98 Special edition: Statistics in Validation


Peer Reviewed: Medical Device Validation

APPENDIX A: Attributes Sampling Plan - Non-parametric Pass-Fail Test

Originally published in the Summer 2011 issue of Journal of Validation Technology

Special edition: Statistics in Validation 99


UBM, LLC All rights reserved.
Reproduction in whole or part is prohibited without prior written permission of the publisher.
SVTQ6000

Вам также может понравиться