Big Data Aiche

Big Data Analytics
28 Meet the Authors

30 The Four Vs
32 What Is It?
36 Success Stories
41 Getting Started
46 Challenges and Future Research
Meet the Authors

RICHARD D. BRAATZ, PhD, is the Edwin R. Gilliland
Professor of Chemical Engineering at the Massachusetts Institute of Technology (MIT), where he does
research in systems and control theory and its application to chemical and biological systems (Email:
braatz@mit.edu). He was the Millennium Chair and
Professor at the Univ. of Illinois at Urbana-Champaign
and a visiting scholar at Harvard Univ. before moving
to MIT. He has consulted or collaborated with more
than 20 companies including IBM, DuPont, Novartis,
and United Technologies Corp. Honors include the
Donald P. Eckman Award from the American Automatic Control Council, the Curtis W. McGraw Research
Award from the Engineering Research Council, the
IEEE Control Systems Society Transition to Practice
Award, and the CAST AIChE Computing in Chemical
Engineering Award. His 400+ publications include the
textbook, Fault Detection and Diagnosis in Industrial
Systems. He received a BS in chemical engineering from Oregon State Univ. and an MS and PhD in
chemical engineering from the California Institute of
Technology. He is a Fellow of the International Federation of Automatic Control, IEEE, and the American
Association for the Advancement of Science.
LEO H. CHIANG, PhD, is Senior Technical Manager
at the Dow Chemical Co. in Freeport, TX (Email:
hchiang@dow.com). He is the leader of Dows
Chemometrics and Manufacturing Analytics
departments and is responsible for partnering with
academia to develop and transfer emerging data
analytics technologies for Dow. He has developed
and implemented several systems techniques to
solve complex manufacturing problems, resulting
in 11 Manufacturing Technology Center Awards. In
2010, he received the Vernon A. Stenger Award,
which is the highest individual honor for analytical
sciences research and development at Dow. He has
authored 25 peer-reviewed papers, 33 conference
presentations, and 2 books published by Springer
Verlag. His textbook, Fault Detection and Diagnosis
in Industrial Systems, is available in English and Chinese and has received over 1,300 citations according
to Google Scholar. He received a BS from the Univ.
of WisconsinMadison, and an MS and PhD from the
Univ. of Illinois at Urbana-Champaign, all in chemical
engineering. He is an active member of AIChE.
LLOYD F. COLEGROVE, PhD, is the Director of Data
Services and the Director of Fundamental Problem
Solving in the Analytical Technology Center at the
Dow Chemical Co. (Email: lfcolegrove@dow.com). He
began his Dow career in R&D in polymer science and
quickly moved to improvement of analytical method
ology in plant labs. While in a role as business quality
leader for four business units, he embarked on his
big data journey in manufacturing before the term big
data came into use. He established the first applied
statistics group in Dow manufacturing and developed
the vision that is taking the company from merely collecting data to actively using data. Colegrove has more
than 29 years of experience in chemical research and
manufacturing. He holds a BS in chemistry and a PhD
in chemical physics from Texas A&M Univ.
SALVADOR GARCA MUOZ, PhD, is the team leader of
the modeling and simulation department at Eli Lilly and
Co. (Email: sal.garcia@lilly.com). His current responsibilities span the drug substance and drug product
areas, with particular focus on the use of computer-
RICHARD D. BRAATZ
28
LEO H. CHIANG
aided engineering in regulatory documents and the

transfer of modeling technology to manufacturing. He
started his career at Aspen Technology, where he spent
four years working as a consultant for the polymer,
petrochemical, and fine chemical manufacturing
industries. After receiving his PhD, he joined Pfizer,
where he spent nine years as a modeling and simulation scientist actively participating in the development
of new medicines and the improvement of commercial
manufacturing operations using model-based control
tools. His research interests include the theory and
application of multivariate statistical methods, optimization, advanced thermodynamics, and advanced
control. He has received multiple awards, including
the Pfizer Achievement Award (2009), the AIChE Food,
Pharmaceutical, and Bioengineering Div. Award (2010),
and the Pfizer Manufacturing Mission Award (2011).
He received a BSc in chemical engineering and an MSc
in chemical and computer systems engineering, both
from the Instituto Tecnolgico y de Estudios Superiores
de Monterrey (ITESM), in Mexico, and a PhD in chemical engineering from McMaster Univ., in Canada. He is
an active member of AIChE and collaborates with the
American Association of Pharmaceutical Scientists.
CHAITANYA KHARE is an EMI Development Engineer
and has been leading the design and implementation
of an EMI effort, redefining how Dow aggregates,
visualizes, interprets, and utilizes its plant data in real
time. His work has brought about an overall diagnostic
dashboard tool that is currently being shared by the
technology center and the operations and R&D groups
across the enterprise to monitor plant health in real
time. Khares work was recognized by two prestigious
awards the 2015 Golden Mousetrap Award, and
the 2015 Manufacturing Leadership Award (Frost and
Sullivan). Khare obtained a diploma in chemical engineering from Mumbai Univ., India, a BE in petrochemical engineering from Pune Univ., India, and an MS in
chemical engineering from Twente Univ., The Netherlands. He spent more than seven years in hydrocarbons
research at Dow before moving into the manufacturing
analytics group in the Analytical Technology Center in
2014. Khare is certified as a Black Belt in Six Sigma.
JOHN F. MACGREGOR, PhD, is President and CEO of
ProSensus Inc. (Email: john.macgregor@prosensus.
ca), a company that provides specialized engineering
consulting and state-of-the-art software for the analysis and interpretation of big data from the process
industries, and for the online monitoring, control,
and optimization of processes based on developed
models. He is a Distinguished University Professor
Emeritus at McMaster Univ., Canada, where he spent
36 years in the Chemical Engineering Dept. and in the
McMaster Advanced Control Consortium (MACC), and
where his research group developed many advanced
big data methods in collaboration with the large
international sponsor companies of MACC. He has
received many awards, including the Shewhart Medal
from the American Society for Quality; the Herman
Wold Medal from the Swedish Chemical Society;
the Century of Achievement Award and the R. S.
Jane Award from the Canadian Society for Chemical
Engineering; the Computing and Systems Technology
Award from AIChE; the Nordic Process Control Award
from the Scandinavian Control Society; and the Guido
Stella Award from the World Batch Forum. He received
a BEng from McMaster Univ. in Canada, and an MS
in statistics and chemical engineering and a PhD in
statistics, both from the Univ. of WisconsinMadison.
He is a Fellow of the Royal Society of Canada, the
Canadian Academy of Engineering, and the American
Statistical Association.
LLOYD F. COLEGROVE
www.aiche.org/cep March 2016 CEP
CHAITANYA KHARE
JOHN F. MACGREGOR
MARCO S. REIS, PhD, is a professor in the

department of chemical engineering at the Univ.
of Coimbra, Portugal (Email: marco@eq.uc.pt),
where he is responsible for the Process Systems
Engineering (PSE) research group. He currently
serves as president of the European Network for
Business and Industrial Statistics (ENBIS) and of
Associao Para o Desenvolvimento da Engenharia
Qumica (PRODEQ). He lectures on process systems
engineering, quality technology and management,
management and entrepreneurship, and process
improvement. His research interests are centered
on the field of process systems engineering (system
identification, fault detection and diagnosis, control, and optimization), statistical process control
of complex large-scale processes, data-driven
multiscale modeling, chemometrics, design of
experiments, and industrial statistics. Other areas
of interest include multivariate image analysis, systems biology, and process improvement through initiatives such as Six Sigma and lean manufacturing.
He has published about 60 articles in international
journals or book series, four book chapters, and two
books, and authored or coauthored 100+ presentations in international congresses. He received a
Licentiate degree in chemical engineering and a
PhD in chemical engineering, both from the Univ. of
Coimbra, Portugal.
MARY BETH SEASHOLTZ, PhD, is the Technology Leader for the Data Services Capability in the
Analytical Technology Center at the Dow Chemical
Co. (Email: mseasholtz@dow.com). She began her
career applying chemometrics to process analyzers.
Now, her primary responsibility is to drive the technology needed to use data to make money. For this,
she focuses on several areas, including statistics
and chemometrics, as well as software platforms.
In 2015, she and her team were awarded the 2015
Golden Mousetrap Award in Design Tools Hardware
& Software: Analysis & Calculation Software from
Design News. They also were awarded the 2015
Manufacturing Leadership Award: Big Data and
Advanced Analytics Leadership from Manufacturing
Leadership Community (Frost and Sullivan). She
has more than 25 years of experience in the field of
chemometrics. She received a BS in chemistry and
mathematics from Lebanon Valley College, and an
MS in applied mathematics and a PhD in analytical
chemistry, both from the Univ. of Washington.
DAVID WHITE is a senior analyst at the ARC Advisory
Group (Email: DWhite@ARCweb.com), where he
is responsible for research into analytics and big
data. He uses his 20 years of experience from many
industries to research the art and science of getting
the right information, to the right people, at the right
time. Choosing an appropriate analytics solution is
vital. But, many other factors are also crucial for an
analytics project to create lasting business value.
With this in mind, his research has two goals: To
help technology buyers get the most value from
their investments in analytics; and to help suppliers
shape their analytics product and marketing strategies. Immediately before joining ARC, he researched
business analytics for the Aberdeen Group, serving
clients such as SAP, IBM, Qliktech, and Tableau.
Before that, he worked in marketing roles for
companies such as Oracle, Cognos, Dimensional
Insight, and Progress Software. He received a BS in
computer science from the Univ. of Hertfordshire and
an MBA from Cranfield Univ., both in the UK.
MARCO S. REIS
MARY BETH SEASHOLTZ
DAVID WHITE
Copyright 2016 American Institute of Chemical Engineers (AIChE)
Come to the 2016 AIChE

Spring Meeting and Discover How
Companies are Using Big Data
Learn at 15 Sessions Focused on

Operationalizing Big Data and Analytics
ore than ever companies in the process industries are

using data to improve their business. This second topical
conference devoted to the applications of big data features
case studies and sessions including:
Data Management in Refineries
Big Data Analytics and Statistics
Big Data Analytics Data Visualization
Big Data Analytics Fundamental Modeling
Big Data Analytics and Smart Manufacturing
View the program

and register at
www.aiche.org/spring.
2016 AIChE 2435_16 02.16
Plus sessions offering industry and vendor perspectives and

a plenary session.
And you can attend any of the other sessions at the
2016 AIChE Spring Meeting and 12th Global Congress
on Process Safety.
Special Section: Big Data Analytics
BIG DATA
The Four Vs
ig data is a big topic with a lot of potential. Before realizing this potential, however, we need to get on the same page about what big data is,
how it can be analyzed, and what we can do with it.
The term big data is somewhat misleading, as it is not only the size
(volume) of the data set that makes it big data. Size is just one aspect, and
it describes the sheer amount of data available. A study conducted by Peter
Lyman and Hal R. Varian of the Univ. of California, Berkeley, estimates that
the amount of new data stored each year has increased by 30%/yr between
1999 and 2002, to 5 trillion gigabytes. Ninety-two percent of the new data was
stored on magnetic media, mostly on hard disks. For reference, 5 trillion gigabytes is equivalent to the data stored in 37,000 libraries the size of the Library
of Congress, which houses 17 million books. And, according to IBM, the
amount of data created each day is expected to grow to 43 trillion gigabytes by
2020, from about 2.3 trillion gigabytes of data per day in 2005. In the chemical process industries (CPI), data are coming from many sources, including
employees, customers, vendors, manufacturing plants, and laboratories.
In addition to volume, big data is characterized by three other Vs
velocity, variety, and veracity. Velocity refers to the rate at which data
are coming into your organization. Data are now streaming continuously
into servers in real time. IBM puts this in context the New York Stock
Exchange captures 1,000 gigabytes of trade information during each trading session. Furthermore, according to Intel, every minute, 100,000 tweets,
700,000 Facebook posts, and 100 million emails are sent.
All data are not equal. Variety describes the heterogeneity of data being
generated. One distinction is whether the data are structured or unstructured.
Structured data include digital data from online sensors and monitoring
devices, while unstructured data are not as neat, such as customer feedback
in the form of paragraphs of text in an email. Realizing the benefits of big
data will require the simultaneous analysis and processing of many different
forms of data, from market research information, to online sensor measurements, to images and spectrographs.
The fourth V, veracity, refers to the quality of data and uncertainty in the
data. Not all data are meaningful. Data quality depends on the way the data are
collected (bias issues may emerge that are very difficult to detect), on whether
the data are updated or no longer make sense (due to time-varying changes
in the system), and on the signal-to-noise ratio (measurement uncertainty),
among other factors.
But the potential of big data is not merely the collection of data. Its the
thoughtful collection and analysis of the data combined with domain know
ledge to answer complex questions. By acting on the answers to these questions, CPI companies will be able to improve operations and increase profits.
Big data analytics is a more appropriate term to emphasize the potential of
big data.
30
AIChE recognizes the importance of big data and has organized topical
conferences on big data analytics at its meetings, including at the upcoming
Spring Meeting being held in Houston, TX, April 1014.
The articles in this special section explore the topic of big data analytics
and its potential for the CPI.
In the first article, David White introduces big data. A common misconception is that big data is a thing, White writes. A more accurate metaphor
is that big data enables a journey toward more-informed business and operational decisions. White discusses this journey, emphasizing the need for a
new approach to analytics that eliminates delays and latency. He concludes
with recommendations to help you as you embark on the big data journey.
Salvador Garca Muoz and John MacGregor provide several examples
of big data success stories in the second article. The examples include the
analysis and interpretation of historical data and troubleshooting process
problems; optimizing processes and product performance; monitoring and
controlling processes; and integrating data from multivariate online analyzers and imaging sensors. Because these examples involve the use of latent
variable methods, the authors briefly discuss such analytics and why they are
suitable in the era of big data.
Once you see big datas potential, how do you get started? In the third
article, Lloyd Colegrove, Mary Beth Seasholtz, and Chaitanya Khare answer
this question. The first steps involve identifying a place to get started, a project
where big data analytics will pay off, and then selecting a software package
appropriate for that project. Once you have found an analytics opportunity and
decided on a data analytics software package, the truly hard work starts convincing your organization to move forward and then taking those first steps,
the authors write. Drawing on their experience at Dow Chemical, they describe
a strategy that has worked for them. They then talk about moving beyond the
initial success and using big data on more than just a few small projects.
Looking to the future, Marco Reis, Richard Braatz, and Leo Chiang identify
challenges and research directions aimed at realizing the potential of big data.
The fourth article explores some of the challenges related to the four Vs and
some potential areas of research that could address these challenges. Big data
creates new possibilities to drive operational and business performance to higher
levels, they write. However, gaining access to such potential is far from trivial.
New strategies, processes, mindsets, and skills that are not yet in place are necessary. Pointing back to the previous articles, the authors end on a high note: Big
data offers new opportunities for managing our operations, improving processes
at all levels, and even adapting the companies business models. So the important
question is: Can we afford not to enter the big data era?
CEP extends a special thanks to Leo H. Chiang for serving as guest editor of this special section.
CEP March 2016 www.aiche.org/cep
31
BIG DATA
What Is It?
David White
ARC Advisory Group
Big data can pave the way to greater

business and operational insight. New approaches
and technologies are poised to help you
navigate your journey.
uch mystery surrounds big data. A 2014 survey

by the ARC Advisory Group found that 38% of
respondents did not understand what big data is or
why they should care about it (1).
A common misconception is that big data is a thing. A
more accurate metaphor is that big data enables a journey
toward more-informed business and operational decisions.
Most companies have already embarked upon a big-data
journey whether they realize it or not. For many industrial
companies, big data manifests itself as data from the Industrial Internet of Things (IIoT). The IIoT connects intelligent
physical entities, such as sensors, devices, machines, assets,
and products, to each other, Internet services, and applications. The IIoT is built upon current and emerging technologies, such as mobile and intelligent devices, wired and wireless networks, cloud computing, analytics, and visualization
tools.
IIoT data have surpassed what most industry observers
had anticipated, enabling data to arrive faster, from more
data sources, and in greater volume. This has profoundly
impacted analytics. The classic architecture used for business intelligence, operational intelligence, and analytics is no
longer adequate. New analytics technologies are necessary,
and they will need to be placed closer to the data source to
be effective. This article introduces new analytics technologies and related supporting technologies, and presents some
early big-data success stories.
What is big data?

ARCs survey results are, in many ways, not surprising.
Many industrial organizations do not understand big data.
A definition of big data established 20 years ago (2)
32
stresses three Vs: volume (the amount of data managed),

velocity (the rate of incoming data), and variety (the type of
data). Sometimes veracity (the quality and accuracy of the
incoming data) is also included in the definition.
Think about your current experience in terms of the original three Vs (volume, velocity, and variety) to determine
whether your company is already on the big data journey.
Volume: Is the amount of data you manage growing at an
accelerating rate?
Velocity: Is the rate at which data are generated accelerating? Are the people consuming the information gleaned
from the data demanding more-timely insight?
Variety: Are the number and types of data sources you
use for analytics growing rapidly?
If you answered yes to any of these questions, you are
already dealing with big data; if you had two or more yes
answers, you are further along the journey. A common
thread through each question is growth (demand). It is not
necessarily difficult to manage a large volume of data, but it
is stressful to manage a body of data that is growing rapidly
year after year.
A big-data journey must strike a balance between
data supply (data management) and information demand
(managers asking for different information or more-timely
updates). The aim of any big data project must be to add
business value by enabling cost reductions, productivity
gains, or revenue increases. Many older big data projects
never reached the point where they were adding value. For
example, it is not unusual to find projects based on plant
historians that accumulate data for years without business
or operational managers taking full advantage of the data
through analytics.
The big data journey

Figure 1 presents a traditional business intelligence
(BI)/analytics infrastructure of operational systems and a
data warehouse. The operational systems may include enterprise resource planning (ERP) systems, which help manage
financials, supply chain, manufacturing, operations, reporting, and human resources; manufacturing execution systems
(MES), which track and document the transformation of
raw materials to finished goods; supply chain management
software; and financial and accounting software.
Delays and latency (green arrows) are built into this
traditional system. It often takes time for all transactions to
be entered into the operational systems. For example, handwritten maintenance records need to be transcribed into digital records so they can be analyzed and integrated with other
data sources. There is usually a delay between capturing
data in operational systems and copying the data to the data
warehouse that supports reporting, dashboards, and analytics. As a consequence, managers do not see the current state
of operations or business through BI/analytics. Instead, they
see the state of the business the last time the data warehouse
was refreshed, rendering the information a day, week, or
even month out of date.
Delays in business insight operational equipment
effectiveness, ontime shipments, overtime expenses, etc.
can increase costs or cause opportunities to be missed.
At a minimum, this is an inconvenience, but in the future
such delays could be catastrophic. Some data generated by
the IIoT are time-critical, and demand immediate action to
maintain product quality or avoid costly equipment failure.
New approaches to analytics that minimize or eliminate
data latency (i.e., eliminate the green arrows in Figure 1) are
gaining attention.
Use one database
One approach to eliminating data latency is to merge the
operational databases with the data warehouse, creating a
single database that can support recording transactions and
analytics. This eliminates the need for periodic batch transfers to keep the data warehouse up-to-date.
This idea has been proposed before using a traditional
relational database management system (RDBMS), but in
most cases that proved to be impractical. RDBMSs first
came to commercial prominence in the 1980s, and their role
was to record simple transactional information, such as bank
account withdrawals, telephone call details, or sales orders.
The need for analytics arose later. The database structure
required to support high-performance analytics was often
very different from the structure required to support highperformance transaction processing.
Consider designing a car. It is possible to build a car
that comfortably exceeds 200 miles per hour, and it is also
relatively easy to design a car that achieves 50 miles to the

gallon. However, creating a car that does both simultaneously is beyond current mainstream automotive technology.
Similarly, running transactional applications and analytics
using the same database not only increases the workload,
but also makes the workload more complex to manage. The
performance of the two workloads is difficult to optimize to
the satisfaction of all users.
When the transactional applications and BI are separate,
both can be optimized. However, maintaining multiple databases involves higher costs and, more importantly, introduces delays of days or weeks.
A novel in-memory database architecture (3) can support both transactional and analytics workloads in a single
database. Americas Styrenics benefited from this approach
after a divestiture forced an IT overhaul in just a year. For an
enterprise application such as SAP, this time constraint was
clearly a challenge.
The solution was to adopt SAPs Business Suite on the
HANA enterprise cloud. Because the entire database is
stored in memory rather than on a hard disk, reading and
writing data is much faster (microseconds vs. milliseconds).
Analytics are optimized because the data are stored in
columns instead of rows. This scheme enables both transactional and analytics workloads to be served successfully
from a single database.
Get closer to the data source

Another way to reduce latency is to place analytics
closer to data generation. This can be particularly valuable
when it is critical to maintain quality and uptime, such as in
production monitoring applications. In an industrial setting,
delays in assembling and accessing information can cause
serious issues, like expensive machine failures.
Analytics that are closer to the data source are able to
intercept data and perform time-critical analyses almost
immediately after the data are generated. The analysis occurs
Transactions
and
Operational Data
Operational Systems
Enterprise Resource
Planning (ERP)
Manufacturing Execution
Systems (MES)
Supply Chain
Management (SCM)
Financials
Data
Warehouse
p Figure 1. In a traditional business intelligence (BI) architecture, trans

actions and operational data are fed to operational systems that organize
and manage the data, which are then stored in a data warehouse. The green
arrows represent points where latency is introduced into the system.
33
before the data are written to long-term data storage.

This is possible, in part, because processing and communication technologies have become smaller, cheaper, and
yet more powerful. Microprocessors can be embedded in
industrial devices, such as pumps or turbines, and software,
such as predictive analytics algorithms, can execute on the
device itself. Time-critical functions, such as device optimization or failure alerts, can be performed using analytics
on the device. For other functions that are not time-critical,
such as historical trend analyses, data are still aggregated in
traditional data storage.
Schwering & Hasse Elektrodraht utilizes complex event
processing (event stream processing) technology to manage
the quality of the copper magnet wire it produces. Copper
magnet wire, which is coated with a thin layer of insulation,
is a critical component of many electrical products, such as
transformers and motors. It is made to fine tolerances, and
because it is embedded within other components, quality is
critical. Failures can trigger expensive product recalls.
To ensure product quality, the manufacturing process is
monitored continuously via about 20,000 measurements per
second across the factory. These measurements come from
about 20 different sources and include, for example, oven
temperatures, material feed rates, and cooling fan speeds.
The quality of the wire insulation is also physically checked
at 25-mm intervals along the wire to ensure voltage isolation. Measurements are fed into Software AGs Apama
Streaming Analytics, which monitors the production process
in real time from start to finish.
This monitoring scheme has changed the way the factory operates. In the past, if a wire did not meet quality
requirements, the entire spool of wire had to be scrapped.
Real-time production monitoring reduces scrap and provides more information about the production process and
quality of each spool of wire.
1,000
Cost, U.S.$/GB
100
10
1
0.10
0.01
1995
2000
2005
Year
2010
2015
p Figure 2. The cost of raw disk storage has fallen dramatically over the
last two decades. If the per-gigabyte cost in 1994 is represented by the
height of the Empire State Building, the cost in 2014 would be comparable
to the length of an almond. Source: Adapted from (4).
34
Predictive analytics, used in applications such as predictive maintenance, are also more effective when moved closer
to the data source. For example, GE uses predictive analytics
to monitor the performance of over 1,500 of its turbines and
generators in 58 countries. Each turbine typically has more
than 100 physical sensors and over 300 virtual sensors that
monitor factors such as temperature, pressure, and vibration. Data are routed over the Internet to a central monitoring location (more than 40 terabytes of data have been
transferred so far). At the data center, predictive analytics
algorithms check current and recent operating performance
against models of more than 150 failure scenarios to detect
signs of impending failure.
Routing data into central storage makes efficient use of
highly skilled workers, allowing specialist technicians to
monitor all of the turbine and generator installations. This
single-database method also helps to ensure that predictive
models are continually refined and improved, and enables
best practices to be shared with all customers simultaneously
rather than on an incremental basis during system upgrades.
GE estimates that this approach collectively saved its customers $70 million last year.
SAS has applied similar technology in deepwater oil
fields. Electrical submersible pumps (ESPs) are used to
increase oil well production, but unexpected failure of an
ESP is costly, causing hundreds of millions of dollars in lost
production as well as requiring $20 million to replace the
pump. The predictive maintenance application developed
using SAS draws on data stored in historians and other
sources to monitor the performance of thousands of ESPs
and detect abnormal operation. Operators are able to gain
a three-month lead time on potential failures and reduce
downtime to six hours.
The cloud
The cloud will be key to IIoT data management and
analytics. The IIoT is accelerating the growth of big data,
generating an unprecedented volume of data. Although
analytics are vital in extracting value from IIoT data, without
strong data management capabilities, high-quality analytics
are impossible.
Raw disk storage, like other electronic technologies,
follows Moores Law. In just 20 years, the price of raw disk
storage has dropped from $950 per gigabyte in 1994 to
$0.03 per gigabyte in 2014 (Figure 2).
Although storage has become cheaper, salaries of IT
staff have kept pace with inflation. As a result, the cost of
storage is not tied to the technology, but rather to the people charged with supporting the technology. As demand for
storage and processing grows, the choice is to pay people
to procure, commission, and maintain the infrastructure, or
outsource this work to the cloud and deploy skilled IT staff
to more important tasks.

Another argument favoring the cloud is that projects can
be initiated sooner and completed faster. Companies will be
able to undertake more projects and accelerate the rate of
innovation. On-demand, cloud-based data warehouses are
available from many software and service providers. Users
pay for a monthly subscription, hourly usage, or a combination of data storage and query volume. Setup is faster than
the traditional approach of procuring and commissioning
hardware and software for an on-premise corporate data
warehouse.
Amazon Redshift provides a high-performance data
warehouse that is scalable on-demand (up to 1.6 petabytes).
It would take an organization months to procure, install, and
commission that amount of disk storage onsite, but setting
up the cloud-based service requires only 20 minutes. The
cloud service also offers flexibility. An organization can
rapidly set up a data warehouse, use it intensively for a short
period of time, and then discontinue it.
This ability to set up, scale, and tear down a data warehouse on-demand shatters conventional data warehouse
economics. A wide range of analytics projects that were
considered not viable before can now be executed for a few
hundred dollars.
NoSQL databases
RDBMSs have long been the dominant tool for organizing and managing enterprise data, and they continue to
dominate. However, an alternative class of databases, collectively known as nonrelational or NoSQL (e.g., MongoDB,
Cassandra, Redis, and HBase), have gained popularity
because they meet the changing demands of data management and the growth of very-large-scale applications.
RDBMSs are from a bygone era when a database server
typically ran in a single box that contained a central processing unit (CPU), memory, and a hard disk. Scalability was
restricted to scaling within the box. The processor could
be updated to run faster or the single-CPU card could be
upgraded to a dual-CPU card, but only within the confines
of the box. That is known as vertical scalability, and it places
fundamental limits on how much the data volume, throughput, and response rate can be scaled.
NoSQL databases are designed out-of-the-box to deliver
performance, scalability, and high availability across distributed computing environments. To gain these advantages
over RDBMS, however, many NoSQL databases take a
more relaxed approach to consistency. The nodes in the database eventually all have a consistent view of the data, which
is often referred to as eventual consistency. This approach
pushes more responsibility for data integrity and consistency
onto the application logic. Whereas an RDBMS usually
centralizes responsibility for data integrity in the database
engine, NoSQL engines shift some of that responsibility to

application programmers. This tradeoff is worth making for
some applications, but not for all.
Recommendations
For most industrial enterprises, big data will manifest
through the IIoT and will require a different approach to
analytics. It will be difficult to extract maximum value from
high volumes of fast-moving data with traditional data architecture. I recommend the following actions as you embark
on your big data journey.
Start small and focus on a real business problem. Many
devices can be connected to the IIoT in a short period. However, it is best to only connect the things that are associated
with business problems. The fastest way to impede a big data
project is to expend resources with no measurable end result.
Pursue a project that promises quick and easy value.
Do not go out of your way to find a particularly nasty problem to solve. Your first big data project will be challenging,
so pick an easy project. Use data you already have, instead
of data that require new sensors. Find a project that requires
only one or two new technologies. The project should have a
relatively short time frame (months, not years).
Assemble a multidisciplinary team. Any successful IT
project requires a blend of technical expertise and domainspecific business or operational expertise. Treating an IIoT
project as purely a technical problem will yield technically
correct but useless insights and recommendations. The IT
team may need to learn and implement new technologies,
but will require operational and business insight to ensure
there is value in implementation.
Measure return on investment for future projects. Before
you start your first big data project, ensure that you have a
process in place to measure the return on investment. If you
cannot demonstrate tangible value for your first project, there
likely will not be a second project. Make sure you understand
the objectives as agreed upon with business leadership, and
CEP
document progress toward those objectives.
Literature Cited
1. ARC Advisory Group, Whats Driving Industrial Investment in BI and Analytics?, www.arcweb.com/strategy-reports/
Lists/Posts/Post.aspx?List=e497b184-6a3a-4283-97bfae7b2f7ef41f&ID=1665&Web=a157b1d0-c84d-440a-a7da9b99faeb14cc (Sept. 4, 2014).
2. Laney, D., 3D Data Management: Controlling Data Volume,
Velocity, and Variety, META Group, Inc. (Feb. 6, 2001).
3. ARC Advisory Group, SAP HANA: The Real-Time Database
as Change Agent?, www.arcweb.com/strategy-reports/Lists/
Posts/Post.aspx?ID=1656 (Aug. 22, 2014).
4. Komorowski, M., A History of Storage Costs (update),
www.mkomo.com/cost-per-gigabyte-update (Mar. 9, 2014).
35
BIG DATA
Success Stories in the

Process Industries
Salvador Garca Muoz
Eli Lilly and Co.
John F. MacGregor
ProSensus, Inc.
Big data holds much potential for optimizing

and improving processes. See how it has
already been used in a range of industries,
from pharmaceuticals to pulp and paper.
ig data in the process industries has many of the

characteristics represented by the four Vs volume,
variety, veracity, and velocity. However, process
data can be distinguished from big data in other industries
by the complexity of the questions we are trying to answer
with process data. Not only do we want to find and interpret
patterns in the data and use them for predictive purposes, but
we also want to extract meaningful relationships that can be
used to improve and optimize a process.
Process data are also often characterized by the presence of large numbers of variables from different sources,
something that is generally much more difficult to handle
than just large numbers of observations. Because of the
multisource nature of process data, engineers conducting a
process investigation must work closely with the IT department that provides the necessary infrastructure to put these
data sets together in a contextually correct way.
This article presents several success stories from different industries where big data has been used to answer
complex questions. Because most of these studies involve
the use of latent variable (LV) methods such as principal
component analysis (PCA) (1) and projection to latent
structures (PLS) (2, 3), the article first provides a brief
overview of those methods and explains the reasons such
methods are particularly suitable for big data analysis.
36
Latent variable methods

Historical process data generally consist of measurements of many highly correlated variables (often hundreds
to thousands), but the true statistical rank of the process, i.e.,
the number of underlying significant dimensions in which the
process is actually moving, is often very small (about two to
ten). This situation arises because only a few dominant events
are driving the process under normal operations (e.g., raw
material variations, environmental effects). In addition, more
sophisticated online analyzers such as spectrometers and
imaging systems are being used to generate large numbers of
highly correlated measurements on each sample, which also
require lower-rank models.
Latent variable methods are uniquely suited for the
analysis and interpretation of such data because they are
based on the critical assumption that the data sets are of
low statistical rank. They provide low-dimension latent
variable models that capture the lower-rank spaces of
the process variable (X) and the response (Y) data without over-fitting the data. This low-dimensional space is
defined by a small number of statistically significant latent
variables (t1, t2, ), which are linear combinations of the
measured variables. Such variables can be used to construct simple score and loading plots, which provide a way
to visualize and interpret the data.
The scores can be thought of as scaled weighted averages of the original variables, using the loadings as the
weights for calculating the weighted averages. A score plot
is a graph of the data in the latent variable space. The loadings are the coefficients that reveal the groups of original
variables that belong to the same latent variable, with one
loading vector (W*) for each latent variable. A loading
plot provides a graphical representation of the clustering of
variables, revealing the identified correlations among them.
The uniqueness of latent variable models is that they
simultaneously model the low dimensional X and Y
spaces, whereas classical regression methods assume that
there is independent variation in all X and Y variables
(which is referred to as full rank). Latent variable models
show the relationships between combinations of variables
and changes in operating conditions thereby allowing
us to gain insight and optimize processes based on such
historical data.
The remainder of the article presents several industrial
applications of big data for:
the analysis and interpretation of historical data and
troubleshooting process problems
optimizing processes and product performance
monitoring and controlling processes
integrating data from multivariate online analyzers
and imaging sensors.
Learning from process data

A data set containing about 200,000 measurements was
collected from a batch process for drying an agrochemical
material the final step in the manufacturing process. The
On-Spec (High Residual Solvent)
On-Spec
Off-Spec
unit is used to evaporate and collect the solvent contained in

the initial charge and to dry the product to a target residual
solvent level.
The objective was to determine the operating conditions
responsible for the overall low yields when off-specification
product is rejected. The problem is highly complex because
it requires the analysis of 11 initial raw material conditions,
10 time trajectories of process variables (trends in the evolution of process variables), and the impact of the process
variables on 11 physical properties of the final product.
The available data were arranged in three blocks:
the time trajectories measured through the batch,
which were characterized by milestone events (e.g., slope,
total time for stage of operation), comprised Block X
Block Z contained measurements of the chemistry of
the incoming materials
Block Y consisted of the 11 physical properties of the
final product.
A multiblock PLS (MBPLS) model was fitted to the three
data blocks. The results were used to construct score plots
(Figure 1), which show the batch-to-batch product quality
variation, and the companion loading plots (Figure 2), which
show the regressor variables (in X and Z) that were most
highly correlated with such variability.
Contrary to the initial hypothesis that the chemistry variables (Z) were responsible for the off-spec product, the analysis isolated the time-varying process variables as a plausible
cause for the product quality differences (Figure 1, red) (4).
This was determined by observing the direction in which the
product quality changes (arrow in Figure 1) and identifying
the variables that line up in this direction of change (Figure 2).
Variables z1z11 line up in a direction that is close to perpendicular to the direction of quality change.
0.7
Level1
0.6
4
0.4
0.3
W* [2]
t2
2
0
0.2
Weight Wet
Cake
Time4
0.71
2
Time2
0.1
0.2
6
6
Temp1
0.5
t1
p Figure 1. A score plot of two latent variables shows lots clustered by

product quality. Source: (4).
0.3
0.4
0.3
Temp2
Z3 Z5 Z4
Z9
Z1 Z10
Time1
Z8
0.2
0.1
Z2
Z11
Z6
Z7
Time3
TempSlope
0.1
0.2
0.3
0.4
W* [1]
p Figure 2. A companion loading plot reveals the process parameters that

were aligned with the direction of change in the score plot. Source: (4).
37
Optimizing process operations

The manufacture of formulated products (such as
pharmaceutical tablets) generates a complex data set that
extends beyond process conditions to also include information about the raw materials used in the manufacture of each
lot of final product, and the physical properties of the raw
materials. This case study can be represented by multiple
blocks of data: the final quality of the product of interest
(Y), the weighted average for the physical properties of the
raw materials used in each lot (RXI), and the process and
environmental conditions at which each lot was manufactured (Z). These blocks of data were used to build a MBPLS
model that was later embedded within a mixed-integer nonlinear programming (MINLP) optimization framework. The
physical properties of the lots of material available in inventory are represented by data block XA and the properties of
the lots of material used to manufacture the final product are
represented by data block X.
The objective for the optimization routine was to determine the materials available in inventory that should be
combined and the ratios (r) of those that should be blended
to obtain the best next lot of finished product. The square of
the difference between the predicted and the target quality of
the product was used to choose the lots and blending ratios.
The underlying calculations reduce the problem to the
score space, where the differences in quality in this case
tablet dissolution correspond to different locations on
the score plot (Figure 3). The MINLP optimization routine
identified the candidate materials available in inventory
that should be blended together to make the final product
so that the score for the next lot lands in the score space
corresponding to the desired quality (i.e., target dissolu1.5
tion). Implementing this optimization routine in real time

significantly improved the quality of the product produced
in this manufacturing process (Figure 4).
Selecting the materials from inventory to be used in
manufacturing a product is not as simple as choosing those
that will produce the best lot of product. If you choose
materials aiming to produce the best next lot, you will
inevitably consume the best materials very fast; this may
be acceptable for a low-volume product. For high-volume
products, however, using this same calculation will lead to
an undesired situation where the best materials have been
depleted and the less-desirable raw materials are left. In
this case, it is better to perform the optimization routine
for the best next campaign (a series of lots), which will
account for the fact that more than one acceptable lot of
product is being manufactured. The optimization calculation in this latter case will then balance the use of inventory
and enable a better management of desireable vs. lessdesirable raw materials for the entire campaign of manufactured product.
The MINLP objective function must be tailored to the
material management needs for the given product so that
it adequately considers operational constraints, such as
the maximum number of lots of the same material to
blend (5, 6).
Monitoring processes
Perhaps the most well-known application of principal
components analysis in the chemical process industries
(CPI) is its use as a monitoring tool, enabling true multivariate statistical process control (MSPC) (7, 8). In this
example, a PCA model was used to describe the normal
variability in the operation of a closed spray drying system
in a pharmaceutical manufacturing process (9). The system
Best-Next-Lot Approach
1
45
Slow Dissolution
40
Dissolution, %
t3
0.5
0.5
Historical
Data
Best-Next-Campaign Approach
USL
35
30
Target
25
LSL
20
1
Target Dissolution
15
Fast Dissolution
Quality Problems
1.5
2
1.5
0.5
0
t1
0.5
1.5
p Figure 3. The dissolution speed of a pharmaceutical tablet is identified

on a score plot of the latent variables. Source: (5).
38
Lots of Finished Goods
p Figure 4. A control chart of the degree of dissolution of a pharmaceutical tablet reveals the onset of quality problems. Quality problems are
reduced by the implementation of a best-next-lot solution, then eliminated
by the best-next-campaign approach. Source: (6).
HEPA
Filter
Drying
Chamber
Drying Gas Flowrate

Controlled by Supply
Fan Speed
FS
T
Process
Heater
Thermal
Mass Flow
Sensor
Supply
Fan
Condenser
Exhaust Pressure
Transducer
FS
HEPA
Filter
Feed
Pump
P
Cyclone
Data Logging
of Product
Collection
Weight
Baghouse
Exhaust Pressure
Controlled by Exhaust
Fan Speed
Exhaust
Fan
p Figure 5. A closed-loop spray drying system in a pharmaceutical manufacturing facility is being monitored by the measurement of 16 variables that a PCA
model projects into two principal components. Source: (9).
(Figure 5) includes measurements of 16 process variables,

which can be projected by a PCA model into two principal
components (t1 and t2), each of which describes a different source of variability in the process. A score plot that
updates in real time can then be used as a graphical tool
to determine when the process is exhibiting abnormal
behavior. This is illustrated in Figure 6, where the red dots
indicate the current state of the process, which is clearly
outside of the normal operating conditions (gray markers).
It is important to emphasize that this model could be
used to effectively monitor product quality without the
need to add online sensors to measure product properties.
Building an effective monitoring system requires a good
data set that is representative of the normal operating conditions of the process.
Normal
Operating
Conditions
t2
Abnormal
Operating
Conditions
0.08
Final Product Quality Attribute
0
2
4
6
10
Control of batch processes

Multivariate PLS models built from process data that
relate the initial conditions of the batch (Z), the time-varying
process trajectories (X), and the final quality attributes (Y)
(10) provide an effective way to control product quality and
productivity of batch processes. Those models can be used
online to collect evolving data of any new batch (first the
initial data in Z and then the evolving data in X), which are
then used to update the predictions of final product quality (Y) at every time interval during the batch process. At
certain critical decision points (usually each batch has one
or two), a multivariate optimization routine is run to identify
control actions that will drive the final quality into a desired
target region and maximize productivity while respecting all
operating constraints (1113).
Figure 7 displays one quality attribute of a high-value
food product before and after this advanced process
10
t1
p Figure 6. A score plot of the two principal components describing

the closed-loop spray drying system (Figure 5) shows that the process is
operating under abnormal conditions. Source: (9).
0.07
With Control
0.06
0.05
0.04
0.03
No Control
0.02
0.01
0
0.4 0.2
0.2
0.4
0.6
0.8
1.2 1.4
Deviation from Target
p Figure 7. Advanced control eliminated the variation in the final product

quality attribute of a food product. Source: (9).
39
Multivariate latent variable methods

reduce a problem to manageable
diagnostics and simple plots.
control method was implemented over many thousands of
batches. The process control method reduced the rootmean-square deviation from the target for all final product
quality attributes by 5070% and increased batch productivity by 20%.
Analyzing information from

advanced analyzers and imaging sensors
The use of more-sophisticated online analyzers (e.g.,
online spectrometers) and image-based sensors for online
process monitoring is becoming more prevalent in the
CPI. With that comes the need for more powerful methods to handle and extract information from the large and
diverse data blocks acquired from such sophisticated
online monitors. Latent variable methods provide an effective approach (14).
Consider a soft sensor (i.e., virtual sensor software that
processes several measurements together) application for
predicting the quality of product exiting a lime kiln at a pulp
and paper mill. Real-time measurements on many process

variables were combined with images from a color camera
capturing the combustion region of the kiln. The information
extracted from the combustion zone images and data from
the process data blocks were combined using the online
multivariate model to assess combustion stability and make
2-hr-ahead predictions of the exit lime quality.
Concluding remarks
Contextually correct historical data is a critical asset
that a corporation can take advantage of to expedite assertive decisions (3). A potential pitfall in the analysis of big
data is assuming that the data will contain information
just because there is an abundance of data. Data contain
information if they are organized in a contextually correct manner; the practitioner should not underestimate the
effort and investment necessary to organize data such that
information can be extracted from them.
Multivariate latent variable methods are effective tools
for extracting information from big data. These methods
reduce the size and complexity of the problem to simple
and manageable diagnostics and plots that are accessible to
all consumers of the information, from the process designers and line engineers to the operations personnel.
CEP
Literature Cited
1. Jackson, E., A Users Guide to Principal Components, 1st ed.,
John Wiley and Sons, Hoboken, NJ (1991).
2. Hskuldsson, A., PLS Regression Methods, Journal of Chemometrics, 2 (3), pp. 211228 (June 1988).
3. Wold, S., et al., PLS Partial Least-Squares Projection to
Latent Structures, in Kubiny, H., ed., 3D-QSAR in Drug
Design, ESCOM Science Publishers, Leiden, The Netherlands,
pp. 523550 (1993).
4. Garca Muoz, S., et al., Troubleshooting of an Industrial Batch
Process Using Multivariate Methods, Industrial and Engineering
Chemistry Research, 42 (15), pp. 35923601 (2003).
5. Garca Muoz, S., and J. A. Mercado, Optimal Selection of
Raw Materials for Pharmaceutical Drug Product Design and
Manufacture Using Mixed Integer Non-Linear Programming and
Multivariate Latent Variable Regression Models, Industrial and
Engineering Chemistry Research, 52 (17), pp. 59345942 (2013).
6. Garca Muoz, S., et al., A Computer Aided Optimal Inventory
Selection System for Continuous Quality Improvement in Drug
Product Manufacture, Computers and Chemical Engineering, 60,
pp. 396402 (Jan. 10, 2014).
7. MacGregor, J. F., and T. Kourti, Statistical Process Control of
Multivariable Processes, Control Engineering Practice, 3 (3),
pp. 403414 (1995).
8. Kourti, T., and J. F. MacGregor, Recent Developments in
40
Multivariate SPC Methods for Monitoring and Diagnosing Process

and Product Performance, Journal of Quality Technology, 28 (4),
pp. 409428 (1996).
9. Garca Muoz, S., and D. Settell, Application of Multivariate
Latent Variable Modeling to Pilot-Scale Spray Drying Monitoring
and Fault Detection: Monitoring with Fundamental Knowledge,
Computers and Chemical Engineering, 33 (12), pp. 21062110
(2009).
10. Kourti, T., et al., Analysis, Monitoring and Fault Diagnosis of
Batch Processes Using Multiblock and Multiway PLS, Journal of
Process Control, 5, pp. 277284 (1995).
11. Yabuki, Y., and J. F. MacGregor, Product Quality Control
in Semibatch Reactors Using Midcourse Correction Policies,
Industrial and Engineering Chemistry Research, 36, pp. 12681275
(1997).
12. Yabuki, Y., et al., An Industrial Experience with Product Quality
Control in Semi-Batch Processes, Computers and Chemical Engineering, 24, pp. 585590 (2000).
13. Flores-Cerrillo, J., and J. F. MacGregor, Within-Batch and
Batch-to-Batch Inferential Adaptive Control of Semi-Batch
Reactors, Industrial and Engineering Chemistry Research, 42,
pp. 33343345 (2003).
14. Yu, H., et al., Digital Imaging for Online Monitoring and Control
of Industrial Snack Food Processes, Industrial and Engineering
Chemistry Research, 42 (13), pp. 30363044 (2003).
BIG DATA
Getting Started
on the Journey
Lloyd F. Colegrove
Mary Beth Seasholtz
Chaitanya Khare
The Dow Chemical Co.
This article discusses some experiences

and challenges in establishing an
enterprise manufacturing intelligence (EMI) platform
at a major chemical manufacturing company, and
recommends steps you can take to convince your
management to harness big data.
ts often said that a journey of a thousand miles starts

with one step. This holds true for any effort to harness
the power of the data in plant historians, laboratory information management systems (LIMS), and online analyzers.
The previous two articles in this special section discuss what big data is and why engineers should care about
data. This article instructs you on how to tackle the vexing
problem of convincing your organization to start the journey
into big data. It discusses how we helped the Dow Chemical
Co. move from merely collecting data to actively harnessing
the data. We offer our experiences, reveal some of the challenges we faced and successfully met, and explain how we
overcame these obstacles.
The what and the why

In many sectors, big data means using large sets of
unstructured data to predict buying patterns or market trends.
These data can also be used to define triggers also called
the what that alert companies to engage in an activity.
What refers to an event that signals an action is required.
For example, data analysis indicates Internet shoppers are
interested in a particular product. This trend might prompt
you to meet with suppliers, restock, or increase production.
The what allows companies to react appropriately in their
supply chain and market context. Because of the fickleness
of trends in business, the what that matters today might not
matter tomorrow. In this example, you are not necessarily
interested in why a particular products sales pattern is behaving this way, you just want to respond to the market trend. In
this big data journey, the why is often irrelevant or ignored.

The chemical industry cannot afford to ignore the why
that accompanies the what. The what might be the plant
trending outside of normal operating conditions; the why
is the reason for that trend. The what might be that product
quality does not meet customer requirements; the why is the
reason for that as determined from the data.
In the chemical industry, engineers typically should not
respond until they know the why. Chemistry and chemical
engineering principles do not trend up or down based on
consumer whim or stock market variation. Ignoring the why
can lead to operational (i.e., reliability and productivity) and
safety peril. So, given our focus on both the what and the
why, how do the ideas of big data translate to the chemical
manufacturing industry?
At a chemical plant that has taken advantage of big data,
the right people will instantly receive alerts to let them know
when continuous processes are trending outside of normal
or batch processes are not progressing properly. These alerts
will be accompanied by tools for making improvements
that is, a method for efficient diagnosis of the problem, a
collaborative environment for discussion about what to do,
and a way to store the experience to learn from it.
Typically, both the alerting system and the tools for making improvements are built into a single platform, referred
to as enterprise manufacturing intelligence (EMI). EMI can
mean different things to different people; in this instance, we
mean a platform that encompasses the automated sampling
of data from data sources, as well as collation, affinitization,
41
and analysis of those data all in real time. We are not adding new measures or systems to generate more data. We are
using the data we already collect and store to achieve everincreasing value.
In the program that we established at Dow, we aimed to
achieve total data domination. This entailed mastering all
digital and alphanumeric data that related to a given operation, and correlating and displaying the data to be meaningful for every user.
Where to find big data opportunities

To determine where better data tools are needed, listen
to the way that data are discussed on a daily basis. For
example, at the beginning of our process, we often heard, I
dont trust the lab data. In attempting to ensure accuracy in
the analytical measurement and raise the trust within our
operations we discovered two things:
The analytical measurement was typically performing
its duties as prescribed and expected.
The recipients of the data (e.g., process operations, plant
engineering, and improvement and quality release functions,
among others) were unable to use the data properly.
The recipients did not understand the effect of natural
variation inherent in the manufacturing process, the analytical method, the sampling technique, etc. This lack of understanding was costing a lot of money, and frustrating plant
custodians and customers alike. Data can bring great value
if understood within the proper context, but the context was
almost always missing.
Look for operations in continual crisis. Many big data
opportunities are identified during post-mortem root-cause
investigations. Other opportunities stem from known crisis
situations, where the plant works feverishly to solve a recognized problem but the issue continues to escalate. Failing to
identify a root cause may result in reduced production rates
or even unexpected downtime.
In both of these cases, its easy to demonstrate to a desperate audience that the issue could have been avoided if the
data had been evaluated in a timely manner. After a process
disruption, its clear that use of data analytics can help the
plant avoid a shutdown, reduce production delays, and/or
deliver better value and service to its customers.
After the storm has passed. Once the root causes of a
crisis are understood, plant personnel may think they have
the situation under control. But, as time passes and engineering personnel change, commitment to maintaining the
necessary changes can wane. This is especially true if the
recommendations coming out of the root-cause investigations were contrary to local lore. If a plant is struggling with
consistently employing the necessary operating discipline,
an EMI platform can be employed to drive the continued use
of the operating discipline.
42
What software to use

The term EMI platform generally refers to a system that
can be comprised of one or more programs and approaches
that first aggregate data and then allow for contextual
analysis, visualization, and access to this information and
knowledge by a broad spectrum of users. The platform
supports problem identification (the what) and problem
resolution (the why).
Before we chose an EMI platform that was the right
fit for our needs, our early attempts to use data in a more-
efficient manner were often conducted on an electronic
spreadsheet (i.e., Excel). However, this method of recording the data only tracked and conveyed recent plant data
(i.e., only the last few plant problems). While spreadsheets
are acceptable for personal use, they are insufficient for
institutional deployment as an EMI platform. More and
bigger spreadsheets are not the way to deploy an enterprise
data analytics platform, nor can you necessarily count on
the appropriate application of numerical analysis in self-
programmed analytics.
Once you have made the commitment to an enterprise
approach, you need to select a software package. Take a few
for a test drive. We sought out many vendors, chose three
packages to test, and then went back to each vendor to see
if they could withstand our critique and give us the technology we could use. Only one vendor at that time was willing
to work with us, and they became our first analytics partner
a partnership that has grown over the past 13 years.
During the evaluation process, you must consider
whether the tool will fit in your plants workflow. Beware
of the hammer-and-nail problem as applied to software
i.e., once you have a hammer, everything starts to look like
a nail.
Carefully consider whether you want a tailored solution or a general solution. A tailored solution will exactly
meet your workflow needs, but tailored solutions are
harder to upgrade and maintain and thus are often more
resource-intensive. A general solution is easy to upgrade
and maintain and requires fewer resources to support (an
important benefit especially in a large organization); however, a general solution might not exactly fit your needs.
Because it was less resource-intensive, we chose the
general solution with one caveat. We worked with the
vendor to improve their product, and along with the vendors
other clients, we slowly molded the original product to be
generally favorable to our workflows. At times, the original
designers intentions for their tools did not match our usage
criteria, and we were able to communicate this to our partner
to resolve the gap. This makes codevelopment a win-win for
everyone. The platform itself is not the competitive advantage the way we implement it becomes our competitive
advantage.
The first steps of the big data journey

Once you have found an analytics opportunity and
decided on a data analytics software package, the truly hard
work starts convincing your organization to move forward and then actually taking those first steps (Figure 1).
Form a team. First, gain the cooperation of a variety of
plant engineers, lab and/or quality control personnel, perhaps
a statistician or chemometrician, an IT infrastructure expert,
and a dedicated coordinator.
You can choose to rely, somewhat, on the software vendor and consulting firms if the full complement of expertise
is not available at your company. An outside consultant can
serve as an excellent check on your organizations understanding of statistics if you do not have your own internal
experts.
If yours is a small company with very few operations,
having such a robust team may not be necessary. However,
larger companies need to dedicate resources starting from
the beginning of this journey, because a lack of IT infra
structure and support can derail the best of intentions.
It may be best to bring the plant quality and process
control experts onboard first, because improved data analysis
will only make their jobs easier. But anyone you come
across who is committed to using data in a more transformative way is a good potential partner.
The goal of bringing together a diverse team is to successfully deploy the initial applications, while at the same
time striving toward a complete EMI platform with supporting documentation, training, support, IT infrastructure, etc.
Such a team is helpful for a sustained engagement along the
big data journey.
Start with a pilot program. Next, set up a pilot program
within the lab or plant. This environment should contain one
or more data sources that are already collecting data. The
pilot program involves collating the data within the prospective EMI platform so that the team can experiment and
explore how improved analytics can better explain measurement capability and variation. Once the connection between
the data and some plant events are well understood, the team
can then learn how monitoring the data can prevent problems. This activity marries context to data.
Your starting point may depend on how knowledgeable
your company, or clients, are. Across our vast company, we
noticed that many users (regardless of role) needed training
in basic statistics because they did not have sufficient understanding of the science of numerical variation. You cannot
create context for the data until the science of variation
(statistics) is understood.
The lightbulb goes on. During the development and early
use of the EMI tool, look for success stories that point to
subtle shifts in the culture. A well-timed phone call (e.g., I
saw that impurity X is increasing; lets talk about what to do
Co
Pu unter
shb
ack
Li
Su sten
Sto cces for
ries s
S
P tart a
Pro ilot
gra
m
F
Tea orm a
m
Expect some
resistance, but
be prepared to
counter it.
Listen for success

stories they point to
shifts in culture.
Set up a pilot program within the
lab or plant.
Bring together a diverse team consisting of
plant engineers, quality control personnel,
IT experts, etc.
p Figure 1. The first steps of the big data journey typically garner some
pushback. Place a metaphorical baited hook in front of skeptics by
demonstrating the power of improved analytics.
to reduce it), additional samples taken to more fully characterize a feed stream, an interruption avoided because the
impurity never exceeded the specification, or an operational
improvement identified are all signs of the desired culture
change.
We often have to show our clients how their results are
changing because of our tools and approaches. The success
stories are told slightly differently, and depend on what part
of the organization you are working with.
After the small success stories are recognized for what
they mean, it is time to expand the pilot within the business
or plant. This is where the EMI platform will get its first test
with deployments beyond the original pilot. Engineers will
see familiar tools appearing all around the business, which
will generate discussion.
The pushback. As we expanded beyond our initial pilot
project, we tried to convince others across the company
that there is value in data. In any software introduction
there is likely to be resistance. It is too expensive, Its
not my way, and Why do I need that? are just three of
the responses you may hear. If you have already attempted
to change the analytics culture in your company, you have
heard these and more. To address the skepticism, we tried a
new approach that can be compared to fishing.
What exactly do fish have to do with analytics, you ask.
If we place a hook and bait in front of a potential user, we
may get them to bite. We tempt them by telling success
stories from other areas of the company that demonstrate the
power of improved analytics.
We guide, we explain, we mentor, we hand-hold, all to
provide support as they take the first tentative steps toward
better data usage. If we are successful in this fishing expedition, the mere mention that we might remove the tool will
elicit protests.
Article continues on next page
43
Moving beyond initial success

At this point, our small team had several successes and
the acceptance of data analytics started to spread across the
company. We even developed our own training modules for
new adopters, which included a nearly math-less treatise on
how to use statistics and understand variation. These courses
proved popular to a workforce starved for a better understanding of how to explore their data but unable or unwilling
to go back to school for a degree in statistics.
With these successes, our diverse team began to establish an EMI solution and expand our analytics approach.
Remember that we are not talking about adding new
measures or systems to generate more data. We are simply
using the data we already collect and store to achieve ever-
increasing value.
As we incrementally expanded the purpose and function of the EMI, we began to recognize discrete uses for our
approach based on the kinds of decisions that need to be
made (Figure 2):
transactional (control room)
tactical (engineering and daily to weekly control)
operational (across multiple plants and/or unit
operations)
strategic (across an entire business).
The data analytics needs of each of these levels are very
different. However, the platform and the underlying IT
infrastructure (e.g., collating of data, contextual analysis,
visualization, and propagation across the enterprise) are not
unique they are the same for all levels of operation.
If you are thinking about implementing an EMI platform,
keep in mind this advice:
Implementation must have flexibility. No matter how
you envision an implementation, be aware of the quirks of a
Strategic
Operational
Tactical
Large-scale capital decisions

across the business
Annually
Decisions at the plant level;
optimizing across multiple
unit operations
Quarterly
Course corrections
Daily to monthly
Transactional
Change one
variable at a time
Hourly
p Figure 2. The data collected and analyzed in the EMI platform can be
used to make four types of decisions. Each type of decision has a different
timescale and scope. Transactional decisions are the most frequent but
smallest in scope, whereas strategic decisions are the least frequent but
largest in scope.
44
global manufacturing infrastructure and the cultures represented both locally and across the enterprise. Potential users
can cite many reasons to deflect the opportunity you bring,
so you need to be more flexible to overcome these fears and
misgivings.
We have devised strategies with the sole purpose of
getting the analytics approaches in front of users, regardless
of their misgivings. The personnel who both lead and deliver
the big data analytics tools must have as much flexibility as
the platform itself. It is imperative for long-term success that
the users become comfortable with their data and trust the
message the data are delivering.
Perfection is a myth. There is no such thing as the perfect
tool or perfect approach for harnessing big data. For example, we adopted an attitude that if we gain around 80% of
what we seek, we should move on in the process.
Forgive your data. Accept the data as they come to you.
Do not get into the argument of prescribing how data should
be structured. As you develop a system that harnesses big
data, keep in mind that it is not your job to change, modify,
or replace existing databases.
It should not matter what type of system your data are
stored in, nor how antiquated the data storage is a tool
must be able to access all sources of data in some manner
without requiring intermediate programming to make the
connection to the data.
Look for all data that are available even data dis
connected from the lab or plant historians such as readings
from remote online analyzers.
If you have bad data, that is, you think your measurement systems are lacking capability or you think you are not
analyzing the right place at the right time, you may think that
your big data journey is over before it ever begins. This is
not true! It will become readily apparent in the initial analysis if your data are poor, which will reveal the first, most
significant, improvement opportunity to the operation. It can
also point to the measurements that need to be improved in
capability (precision or accuracy) or perhaps frequency, etc.
Engage your audience

Analytics must be tailored to the process and personnel
that they are aimed at.
If you applied an analytics solution to an operating
environment without an understanding of that environment,
the analytics solution would be doomed to marginalization
or outright failure. It would be destined for marginalization,
because ultimately the tools and approaches would become
the purview of a select few. It could be an outright failure
if the tool never rises above personal-use status, such as an
electronic spreadsheet or personal calculator.
In order for true progress to be made, the team members
and staff have to be open to viewing and interacting with
their data in a way they are not accustomed to, and the analytics team needs to be willing to assemble the analytics in a
way that the users can relate to.
Early in the design of our analytics approach, we committed to aligning our tools and the underlying approach to
the workflow that our first internal customers desired. We
did not sub-optimize the tool by tailoring the programming;
rather, we ensured that the tool was used in a manner that
was familiar to and digestible by a given operational team.
Figure 3 illustrates the higher-order process that was developed in collaboration with the operations team.
Our team sought a tool that alerted us to problems and
opportunities and triggered discussion among operations and
subject matter experts, hence providing an environment that
embraces both the what and the why. The tool would also
prompt users by displaying internally vetted and documented (i.e., company-proprietary) knowledge and wisdom,
which could then be harnessed to develop appropriate action
plans. Finally, we wanted to develop a way for users to
encode new knowledge and wisdom back into the tool so
that it may be used again at an appropriate time this step
has not yet come to fruition and is ongoing.
Closing thoughts
Our big data approach seeks to actively turn data into
information (through the application of statistics), translate that information into knowledge (via contextualization
of data), and then convert that knowledge into wisdom
(e.g., maintaining optimal operation 24/7), while avoiding
any surprises.
Alert!
Data,
calculations,
predictive
models,
big data
Real-time
tracking and
notification
dashboard
I
Integrate
lea
learning
into
e
enterprise
Plant
m
makes
cha
changes
Discussion
triggered
between
technical and
onsite staff
Consult
existing
e
knowledge
Agree on
actions
The translation from knowledge to wisdom will take

longer than the translation from information to knowledge.
We have implemented a cutting-edge approach to gain
information from our data, and we have built an enterprise
approach that can be applied at all levels (transactional to
strategic). The next stop on our big data journey is to bring
vetted information to bear automatically in real time, when
the user needs this pre-contextualized information. It will
take time to build, connect, and systematize a process for
incorporating new information, past information, and even
valuable (but nonvalidated) local operational lore, but the
payoff will be worth it.
We have made great strides away from a culture of
just collecting data but not using them to their fullest potential, to a culture of realizing that potential. In the future, we
hope to completely automate the process of analyzing the
data, so that the system can spot problems even before control limits and rules are violated. When that occurs, proper
control of the plant will become second nature to process
engineers and operators. New staff will have information
and tools immediately available to them, which will shorten
learning induction times and lead to safer, more reliable
plant operations and practices.
We promised at the beginning of this article to tell you
how to convince your organization to tackle data analytics.
The best advice we can give is to move slowly and deliberately. Start small, look for small data opportunities within
your plants and operations, and build momentum with success stories. Starting the process on a scale that is too grand
will only elicit blank stares and possibly overwhelm your
staff and operational team. At Dow, our big data journey
started within the lab environment, and steadily grew, until
the advent of the first EMI platform that met our needs.
In our quest for total data domination, we are already
improving reliability, reducing costs, increasing value, and
providing safer operations. These efforts will provide our
customers with a level of product consistency and reliability
CEP
that will become the new standard in the CPI.
Acknowledgments
The authors wish to acknowledge the editorial contributions of Jim
Petrusich, Vice President, Northwest Analytics, Inc.
Additional Resources
p Figure 3. When designing an analytics approach to harness big data,

you can use data to trigger conversations that enable timely actions to
avoid potential plant problems.
Colegrove, L., Data Initiative Improves Insights, Chemical

Processing, www.chemicalprocessing.com/articles/2015/data-
initiative-improves-insights/ (Mar. 12, 2015).
Neil, S., Big Data Dilemma Finding the Hidden Value, Auto
mation World, www.automationworld.com/industrial-internetthings/big-data-dilemma-finding-hidden-value/ (June 28, 2015).
45
BIG DATA
Challenges and Future

Research Directions
Marco S. Reis
Univ. of Coimbra, Portugal
Richard D. Braatz
Massachusetts Institute of
Technology
The big data movement is creating opportunities

for the chemical process industries to improve
their operations. Challenges, however, lie ahead.
Leo H. Chiang
The Dow Chemical Co.
he big data movement is gaining momentum, with

companies increasingly receptive to engaging in
big data projects. Their expectations are that, with
massive data and distributed computing, they will be able
to answer all of their questions from questions related to
plant operations to those on market demand. With answers
in hand, companies hope to pave new and innovative paths
toward process improvements and economic growth.
An article in Wired magazine, The End of Theory: The
Data Deluge Makes the Scientific Method Obsolete (1),
describes a new era in which abundant data and mathematics
will replace theory. Massive data is making the hypothesizemodel-test approach to science obsolete, the article states.
In the past, scientists had to rely on sample testing and
statistical analysis to understand a process. Today, computer
scientists have access to the entire population and therefore
do not need statistical tools or theoretical models. Why is
theory needed if the entire real thing is now within reach?
Although big data is at the center of many success
stories, unexpected failures can occur when a blind
trust is placed in the sheer amount of data available
highlighting the importance of theory and fundamental
understanding.
A classic example of such failures is actually quite dated.
In 1936, renowned magazine Literary Digest conducted an
extensive survey before the presidential election between
Franklin D. Roosevelt and Alfred Landon, who was then
46
governor of Kansas. The magazine sent out 10 million postcards considered a massive amount of data at that time
to gain insight into the voting tendencies of the populace.
The Digest collected data from 2.4 million voters, and after
triple-checking and verifiying the data, forecast a Landon
victory over Roosevelt by a margin of 57% to 43%. The
final result, however, was a landslide victory by Roosevelt
of 61% versus Landons 37% (the remaining votes were
for a third candidate). Based on a much smaller sample of
approximately 3,000 interviews, George Gallup correctly
predicted a clear victory for Roosevelt.
Literary Digest learned the hard way that, when it
comes to data, size is not the only thing that matters.
Statistical theory shows that sample size affects sample
error, and the error was indeed much lower in the Digest
poll. But sample bias must also be considered and this is
especially critical in election polls. (The Digest sample was
taken from lists of automobile registrations and telephone
directories, creating a strong selection bias toward middleand upper-class voters.)
Another example that demonstrates the danger of
putting excessive confidence in the analysis of big data
sets regards the mathematical models for predicting loan
defaults developed by Lehman Brothers. Based on a very
large database of historical data on past defaults, Lehman
Brothers developed, and tested for several years, models
for forecasting the probability of companies defaulting on
their loans. Yet those models built over such an extensive

database were not able to predict the largest bankruptcy in
history Lehman Brothers own.
These cases illustrate two common flaws that undermine big data analysis:
the sample, no matter how big, may not accurately
reflect the actual target population or process
the population/process evolves in time (i.e., it is
nonstationary) and data collected over the years may not
accurately reflect the current situation to which analytics
are applied.
These two cases and other well-known blunders show that
domain knowledge is, of course, needed to handle real problems even when massive data are available. Industrial big data
can benefit from past experiences, but challenges lie ahead.
Like any new, promising field, big data must be viewed
in terms of its capabilities as well as its limitations. Some of
these limitations are merely challenges that can be addressed
enabling companies to make the most out of new opportunities created by data, technology, and analytics (Figure 1).
This article outlines ten critical challenges regarding big
data in industrial contexts that need to be addressed, and
suggests some emerging research paths related to them. The
challenges are discussed in terms of the four Vs that define the
context of big data: volume, variety, veracity, and velocity.
Volume challenges
Big data is, first of all, about handling massive amounts of data.
However, in industrial processes,
the first thing to realize is that not
all data are created equal. Several
challenges arise from this point.
Meaningful data. Most industrial big data projects rely
on happenstance data, i.e., data passively collected from
processes operating under normal operating conditions most
of the time. Thus, a large amount of data is indeed available,
but those data span a relatively narrow range of operating
conditions encountered during regular production situations.
Data sets collected under those circumstances may be
suitable for process monitoring and fault detection activities (2), which rely on a good description of the normal
operating conditions (NOC) as a reference to detect any
assignable or significant deviation from such behavior.
However, their value is limited for predictive activities, and
even more so for control and optimization tasks. Prediction
can only be carried out under the same conditions found in
the data used to construct the models. As a corollary, only
when all the NOC correlations linking the input variables
are respected can the model be used for prediction.
For process control and optimization activities, the
process description must capture the actual influence of

each manipulated input variable on the process outputs. Its
construction requires experimentation i.e., the active
collection of process data via a design of experiments
(DOE) program for process optimization or via system
identification (SI) experiments for process control.
Future research is needed to determine ways to use
DOE in the context of big data to complement the information already available and increase the datas value for
predictive, control, and optimization activities. This will
likely require methods to selectively remove data with
very little informative value. The presence of such data
is not only unnecessary for developing models, but also
detrimental, as their presence induces a bias in the models
toward highly sampled regions of the operational space.
The modern theory of optimal DOE may provide a suitable
framework to begin addressing this challenge.
Information-poor data sets. Despite the sheer volume
of industrial data, the relevant or interesting information
may happen on only a few, dispersed occasions. Examples
include batches with abnormally excellent quality or runs
that experience several types of process upsets.
Current data mining and knowledge discovery tools
(3, 4) can handle very large volumes of data that are rich
in information. Such tools include methodologies such as
partial least-squares regression, least-absolute-shrinkage
and selection operator (LASSO) regression, and ensemble
methods (e.g., random forests and gradient boosting),
among others. However, by design, those methods are not
suited to analyze information-poor data sets, in which the
interesting information is rare and scattered. And, traditional data visualization tools which are recommended
for any data analysis activity, especially to identify potentially interesting outlying data points may not always be
Data
Big Data
Technology
Analytics
p Figure 1. The big data movement stems from the availability of data,
high-power computer technology, and analytics to handle data characterized by the four Vs volume, variety, veracity, and velocity.
47
effective when applied to big data. For example, creating a

classical plot from big data might produce what looks like
a black cloud of data points that is not useful.
An engineer who is not able to rely on visualization
might be tempted to perform some sort of massive statistical testing to pinpoint abnormal situations or to extract
potentially interesting correlations, only to find a very large
number of such situations (or correlations). That is a consequence of the extreme power of the tests, induced by the
massive number of observations used. The significant events
detected may not (and most of the time will not) have any
practical relevance because of their small impact.
The situation can be even worse when an engineer
cleans the data using an algorithm that automatically
removes outlying observations from data sets prior to
analysis. Such algorithms often incorporate standard rules
of an empirical nature that eliminate the data embedded
with the rare gems of information.
Future research should focus on the development of
analytical methods applicable to information-poor data,
including visualization tools that can condense large
amounts of data while being sensitive to abnormal observations, and sound ways of detecting outlying (but interesting) observations (and variable associations), namely by
incorporating the available domain knowledge.
Variety challenges
Big data is also characterized by
its complexity. The complexity
of industrial data can arise from
different sources, and is usually
related to the variety of objects
to be analyzed. Different challenges arise depending on the origin of the complexity.
Multiple data structures. In addition to the usual scalar
quantities (temperature, pressure, and flow measurements),
data collected in modern industrial settings also include
other data structures arranged as higher-order tensors, such
as one-way arrays (e.g., spectra, chromatograms, nuclear
magnetic resonance [NMR] spectra, particle-size distribution curves), two-way arrays (e.g., data obtained from
analytical techniques such as gas chromatography with
mass spectrometry [GC-MS] and high-performance liquid
chromatography with diode array detection [HPLC-DAD]),
and three-way and higher-order arrays (e.g., hyperspectral
images, color videos, hyphenated instruments). These data
structures are examples of profiles (5), abstractly defined as
any data array, indexed by time and/or space, that characterize a product or process.
Future research should focus on developing analytical platforms that can effectively incorporate and fuse all
48
of these heterogeneous sources of information found in

industrial processes, for instance, through the development
of more flexible multiblock methodologies. Such methodologies incorporate the natural block-wise structure of data,
where each block may carry information about distinct
aspects of the problem and present a characteristic structure and dimensionality.
Heterogeneous data. Variety does not originate only
from the presence of different data structures to be handled
simultaneously. Another source of variety is the presence
of data in the same data set that were collected when the
process underwent meaningful changes, including in its
structure (e.g., new equipment was added, procedures were
changed). By not taking such changes into account during
the analysis of the entire data set, you may fall into the trap
of mixing apples with oranges an issue that also raises
concerns of data quality, which is discussed in the veracity
section of this article. Overlooking heterogeneity in time is
detrimental for analytical tasks such as process monitoring
and quality prediction.
A future research path to address this challenge is
developing methods to detect and handle these issues, as
well as to deal with the time-varying nature of processes,
namely through evolutionary and adaptive schemes (6).
Such schemes can adapt to complex and/or changing
conditions by continuously seeking the optimal operational
settings or by periodically retuning the models (through
re-estimation or recursive updating approaches).
Multiple data-management systems. Data are also collected from a variety of sources across the companys value
chain, from raw materials, plant operations, and quality
laboratories, to the commercial marketplace. Each stage
usually has its own data-management system, and each
records data in a different way.
Future efforts should be directed toward the development of integrated platforms that link all of the different
sources of data in the value chain. Market data, in particular, have not been included in conventional models
used in the chemical process industries (CPI). Data-driven
methods which incorporate the time-delayed structure
of the processes and use different types of data aggregation
should be developed to make this integration effective.
A priori knowledge. Some knowledge about the main
sources of variety affecting a massive data set is usually
available. However, making use of it in conventional industrial analytics is not straightforward. Big data methods tend to
be of a black box type, lacking the flexibility to incorporate
a priori knowledge about the processes under analysis.
Incorporating information about the structure of the processes in data-driven analysis is an important research path
for the future, especially in the fields of fault diagnosis and
predictive modeling (79). Fault diagnosis requires informaCopyright 2016 American Institute of Chemical Engineers (AIChE)
tion about the causal structure of the systems, which conventional data-driven monitoring methods cannot provide.
Predictive modeling also requires this type of knowledge, in
particular for process control and optimization applications.
Bayesian approaches (10, 11) and data transformation based
on network inference, together with hybrid gray-box modeling frameworks, are potential ways to introduce a priori
knowledge into data-driven modeling.
The sources of variability are actually the core of many

improvement activities, in particular those aimed at reducing
process variation and increasing product quality and consistency. Big data cannot replace the need to understand how
data are acquired and the underlying mechanisms that generate variability, and statistical engineering principles should
be brought to the analysis of big data sets in the future (15).
Velocity challenges
Veracity challenges
A major concern in the analysis
of massive data sets has to do
with the quality of data, i.e.,
their veracity. As previously
mentioned, quantity does not
imply quality. On the contrary,
quantity creates more opportunities for problems to occur.
To make matters worse, the detection of bad observations in
massive data sets through visualization techniques is more
challenging and automatic-cleaning algorithms cannot be
relied on either. Data quality also depends on the way the
data are collected (bias issues may emerge that are very
difficult to detect), on whether the information is updated
or no longer makes sense (due to time-varying changes in
the system), and on the signal-to-noise ratio (measurement
uncertainty), among other factors.
Uncertainty data. In addition to the collected data, information associated with uncertainty is also available. Measurement uncertainty is defined as a parameter associated with the
result of a measurement that characterizes the dispersion of
the values that could reasonably be attributed to the quantity
to be measured (12). Combining uncertainty data with the raw
measurements can improve data analysis, empirical modeling,
and subsequent decision-making (13, 14).
Specification of measurement uncertainty in big data contexts and developing methods that take advantage of know
ledge about uncertainty should be explored in more depth.
Unstructured variability. Process improvement activities require a careful assessment of the multiple sources
of variability of the process, which are typically modeled
using suitable mathematical equations (ranging from firstprinciples models to purely data-driven approaches). The
analysis should involve both the deterministic backbone of
the process behavior, as well as the unstructured aspects of
the process arising from stochastic sources of variability,
including disturbances, sample randomness, measurement
noise, operators variation, and machine drifting. Jumping
into the analysis of massive data sets while overlooking the
main sources of unstructured variability is ill-advised, and
is contrary to a reliable statistical engineering approach to
addressing process improvement activities.
In big data scenarios, large

quantities of data are collected
at high speed. This creates several challenges in the implementation of online collection
techniques and in defining the
appropriate granularity to adopt for data analysis.
Data with a high time resolution. The high speed at
which data are collected in modern chemical plants produces
information with very fine time granularity, i.e., the data have,
by default, a high time resolution (on the order of minutes, or
even seconds). This default is a conditioning factor for all the
subsequent stages of data analysis, as the usual practice is to
avoid throwing out potentially valuable data. Consequently,
the analysis is prone to producing over-parameterized models.
It is important to select the most effective resolution
(16) for your particular data analysis. A default resolution
selected by a third party with no knowledge of your specific
data will probably not be appropriate.
Future research should develop sound ways for selecting the proper resolution, including the possibility of using
multiple time resolutions (17) that take into account the
variables dynamic and noise features.
Adaptive fault detection and diagnosis. The high speed
of data collection provides the potential for fast detection and diagnosis of faults, failures, and other abnormal
conditions. Many effective methods for fault detection and
identification of associated variables are available, including techniques that account for dynamics (1821).
A limitation of the standard data-based fault diagnosis
methods is that they rely on historical data that were collected, analyzed, and labeled during past abnormal conditions
(22, 23). One way around this requirement is to incorporate
causal information from the process flowsheet (24).
Drawing on ideas from the machine learning community (25), a more effective solution could be to treat fault
diagnosis as an online learning problem. Adaptive learning methods could generate fault diagnosis systems that
become increasingly effective over time, with the objective
of moving toward prognostics (i.e., the early prediction
of future operational problems) instead of learning about
abnormal conditions after a catastrophic incident.
Article continues on next page
49
Final thoughts
Big data creates new possibilities to drive operational
and business performance to higher levels. However,
gaining access to such potential is far from trivial. New
strategies, processes, mindsets, and skills that are not yet in
place are necessary. In addition, challenges emerge when
big data problems are considered in industrial contexts.
This article has summarized ten such challenges to be
addressed in the future to make this journey an insightful learning experience and a successful business opportunity for companies. We also believe the dominating ideas
and premises of big data need to evolve and mature.
As we have discussed, big data by itself will not answer

all of your questions. Processes evolve over time, under quite
restrictive operating conditions, and data just reflect this reality. We cannot expect data to tell us more than the information
contained in the data. But big data and domain knowledge can
be used synergistically to move forward and answer important questions, to design better experiments, or to determine
additional sensors needed to address those questions.
Big data offers new opportunities for managing our operations, improving processes at all levels, and even adapting
the companies business models. So the important question
is: Can we afford not to enter the big data era?
CEP
Literature Cited
1. Anderson, C., The End of Theory: The Data Deluge Makes the
Scientific Method Obsolete, Wired, www.wired.com/2008/06/
pb-theory/ (June 23, 2008).
els for the Chemical Processing Industry, Industrial and Engineering

Chemistry Research, 54 (37), pp. 91599177 (Aug. 31, 2015).
2. Chiang, L. H., et al., Fault Detection and Diagnosis in Industrial

Systems, Springer-Verlag London (2001).
14. Reis, M. S., and P. M. Saraiva, Integration of Data Uncertainty

in Linear Regression and Process Optimization, AIChE Journal,
51 (11), pp. 30073019 (Nov. 2005).
3. Han, J., and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, San Francisco, CA (2001).
15. Hoerl, R., and R. D. Snee, Statistical Thinking: Improving

Business Performance, Duxbury Press, Pacific Grove, CA (2001).
4. Wang, X. Z., Data Mining and Knowledge Discovery for Process

Monitoring and Control, Springer-Verlag London (1999).
16. Reis, M. S., and P. M. Saraiva, Generalized Multiresolution

Decomposition Frameworks for the Analysis of Industrial Data
with Uncertainty and Missing Values, Industrial and Engineering
Chemistry Research, 45 (18), pp. 63306338 (Aug. 9, 2006).
5. Reis, M. S., and P. M. Saraiva, Prediction of Profiles in the Process Industries, Industrial and Engineering Chemistry Research,
51 (11), pp. 45244266 (Feb. 27, 2012).
6. Rato, T. J., et al., A Systematic Comparison of PCA-Based Statistical
Process Monitoring Methods for High-Dimensional, Time-Dependent
Processes, AIChE Journal, 62 (1), pp. 127142 (Jan. 2016).
17. Reis, M. S., and P. M. Saraiva, Multiscale Statistical Process

Control with Multiresolution Data, AIChE Journal, 52 (6),
pp. 21072119 (June 2006).
18. Russell, E. L., et al., Fault Detection in Industrial Processes Using
Canonical Variate Analysis and Dynamic Principal Component
Analysis, Chemometrics and Intelligent Laboratory Systems, 51,
pp. 8193 (2000).
7. Reis, M. S., et al., Challenges in the Specification and Integration

of Measurement Uncertainty in the Development of Data-Driven
Models for the Chemical Processing Industry, Industrial and
Engineering Chemistry Research, 54 (37), pp. 91599177
(Aug. 31, 2015).
19. Zhu, X., and R. D. Braatz, Two-Dimensional Contribution Map for

Fault Detection, IEEE Control Systems, 34 (5), pp. 7277 (Oct. 2014).
8. Reis, M. S., and P. M. Saraiva, Integration of Data Uncertainty

in Linear Regression and Process Optimization, AIChE Journal,
51 (11), pp. 30073019 (Nov. 2005).
20. Jiang, B., et al., Canonical Variate Analysis-Based Contributions

for Fault Identification, Journal of Process Control, 26, pp. 1725
(Feb. 2015).
9. Chiang, L. H., and R. D. Braatz, Process Monitoring Using the

Causal Map and Multivariate Statistics: Fault Detection and Identification, Chemometrics and Intelligent Laboratory Systems, 65 (2),
pp. 159178 (Feb. 28, 2003).
21. Jiang, B., et al., Canonical Variate Analysis-Based Monitoring of

Process Correlation Structure Using Causal Feature Representation, Journal of Process Control, 32, pp. 109116 (Aug. 2015).
10. Bakshi, B. R., et al., Multiscale Bayesian Rectification of Data

from Linear Steady-State and Dynamic Systems without Accurate
Models, Industrial and Engineering Chemistry Research, 40 (1),
pp. 261274 (Dec. 6, 2000).
11. Yu, J., and M. M. Rashid, A Novel Dynamic Bayesian
Network-Based Networked Process Monitoring Approach for
Fault Detection, Propagation, Identification, and Root Cause
Diagnosis, AIChE Journal, 59 (7), pp. 23482365 (July 2013).
12. Joint Committee for Guides in Metrology, Evaluation of
Measurement Data Guide to the Expression of Uncertainty in
Measurement, JCGM 100:2008, JCGM, Paris, France, p. 134
(Sept. 2008).
13. Reis, M. S., et al., Challenges in the Specification and Integration of
Measurement Uncertainty in the Development of Data-Driven Mod-
50
22. Chiang, L. H., et al., Fault Diagnosis in Chemical Processes Using

Fisher Discriminant Analysis, Discriminant Partial Least Squares
and Principal Component Analysis, Chemometrics and Intelligent
Laboratory Systems, 50, pp. 240252 (2000).
23. Jiang, B., et al., A Combined Canonical Variate Analysis and
Fisher Discriminant Analysis (CVA-FDA) Approach for Fault
Diagnosis, Computers and Chemical Engineering, 77, pp. 19
(June 9, 2015).
24. Chiang, L. H., et al., Diagnosis of Multiple and Unknown Faults
Using the Causal Map and Multivariate Statistics, Journal of
Process Control, 28, pp. 2739 (April 2015).
25. Severson, K., et al., Perspectives on Process Monitoring of Industrial Systems, in Proceedings of the 9th IFAC Symposium on Fault
Detection, Supervision, and Safety for Technical Processes, Paris,
France (Sept. 24, 2015).

Big Data Aiche

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Big Data Aiche

Загружено:

Авторское право:

Доступные форматы

Big Data Analytics

28 Meet the Authors

Meet the Authors

aided engineering in regulatory documents and the

www.aiche.org/cep March 2016 CEP

MARCO S. REIS, PhD, is a professor in the

MARY BETH SEASHOLTZ

Copyright 2016 American Institute of Chemical Engineers (AIChE)

Come to the 2016 AIChE

Learn at 15 Sessions Focused on

ore than ever companies in the process industries are

View the program

Plus sessions offering industry and vendor perspectives and

Special Section: Big Data Analytics

www.aiche.org/cep March 2016 CEP

Copyright 2016 American Institute of Chemical Engineers (AIChE)

Copyright 2016 American Institute of Chemical Engineers (AIChE)

CEP March 2016 www.aiche.org/cep

Special Section: Big Data Analytics

Big data can pave the way to greater

uch mystery surrounds big data. A 2014 survey

What is big data?

www.aiche.org/cep March 2016 CEP

stresses three Vs: volume (the amount of data managed),

The big data journey

relatively easy to design a car that achieves 50 miles to the

Get closer to the data source

p Figure 1. In a traditional business intelligence (BI) architecture, trans

CEP March 2016 www.aiche.org/cep

Special Section: Big Data Analytics

before the data are written to long-term data storage.

www.aiche.org/cep March 2016 CEP

to more important tasks.

engine, NoSQL engines shift some of that responsibility to

CEP March 2016 www.aiche.org/cep

Special Section: Big Data Analytics

Success Stories in the

Big data holds much potential for optimizing

ig data in the process industries has many of the

www.aiche.org/cep March 2016 CEP

Latent variable methods

Learning from process data

unit is used to evaporate and collect the solvent contained in

p Figure 1. A score plot of two latent variables shows lots clustered by

Copyright 2016 American Institute of Chemical Engineers (AIChE)

p Figure 2. A companion loading plot reveals the process parameters that

CEP March 2016 www.aiche.org/cep

Special Section: Big Data Analytics

Optimizing process operations

tion). Implementing this optimization routine in real time

p Figure 3. The dissolution speed of a pharmaceutical tablet is identified

www.aiche.org/cep March 2016 CEP

Lots of Finished Goods

Copyright 2016 American Institute of Chemical Engineers (AIChE)

Drying Gas Flowrate

(Figure 5) includes measurements of 16 process variables,

Control of batch processes

p Figure 6. A score plot of the two principal components describing

Copyright 2016 American Institute of Chemical Engineers (AIChE)

Deviation from Target

p Figure 7. Advanced control eliminated the variation in the final product

CEP March 2016 www.aiche.org/cep

Special Section: Big Data Analytics

Multivariate latent variable methods