Академический Документы
Профессиональный Документы
Культура Документы
RICHARD D. BRAATZ
28
LEO H. CHIANG
LLOYD F. COLEGROVE
CHAITANYA KHARE
JOHN F. MACGREGOR
MARCO S. REIS
DAVID WHITE
BIG DATA
The Four Vs
ig data is a big topic with a lot of potential. Before realizing this potential, however, we need to get on the same page about what big data is,
how it can be analyzed, and what we can do with it.
The term big data is somewhat misleading, as it is not only the size
(volume) of the data set that makes it big data. Size is just one aspect, and
it describes the sheer amount of data available. A study conducted by Peter
Lyman and Hal R. Varian of the Univ. of California, Berkeley, estimates that
the amount of new data stored each year has increased by 30%/yr between
1999 and 2002, to 5 trillion gigabytes. Ninety-two percent of the new data was
stored on magnetic media, mostly on hard disks. For reference, 5 trillion gigabytes is equivalent to the data stored in 37,000 libraries the size of the Library
of Congress, which houses 17 million books. And, according to IBM, the
amount of data created each day is expected to grow to 43 trillion gigabytes by
2020, from about 2.3 trillion gigabytes of data per day in 2005. In the chemical process industries (CPI), data are coming from many sources, including
employees, customers, vendors, manufacturing plants, and laboratories.
In addition to volume, big data is characterized by three other Vs
velocity, variety, and veracity. Velocity refers to the rate at which data
are coming into your organization. Data are now streaming continuously
into servers in real time. IBM puts this in context the New York Stock
Exchange captures 1,000 gigabytes of trade information during each trading session. Furthermore, according to Intel, every minute, 100,000 tweets,
700,000 Facebook posts, and 100 million emails are sent.
All data are not equal. Variety describes the heterogeneity of data being
generated. One distinction is whether the data are structured or unstructured.
Structured data include digital data from online sensors and monitoring
devices, while unstructured data are not as neat, such as customer feedback
in the form of paragraphs of text in an email. Realizing the benefits of big
data will require the simultaneous analysis and processing of many different
forms of data, from market research information, to online sensor measurements, to images and spectrographs.
The fourth V, veracity, refers to the quality of data and uncertainty in the
data. Not all data are meaningful. Data quality depends on the way the data are
collected (bias issues may emerge that are very difficult to detect), on whether
the data are updated or no longer make sense (due to time-varying changes
in the system), and on the signal-to-noise ratio (measurement uncertainty),
among other factors.
But the potential of big data is not merely the collection of data. Its the
thoughtful collection and analysis of the data combined with domain know
ledge to answer complex questions. By acting on the answers to these questions, CPI companies will be able to improve operations and increase profits.
Big data analytics is a more appropriate term to emphasize the potential of
big data.
30
AIChE recognizes the importance of big data and has organized topical
conferences on big data analytics at its meetings, including at the upcoming
Spring Meeting being held in Houston, TX, April 1014.
The articles in this special section explore the topic of big data analytics
and its potential for the CPI.
In the first article, David White introduces big data. A common misconception is that big data is a thing, White writes. A more accurate metaphor
is that big data enables a journey toward more-informed business and operational decisions. White discusses this journey, emphasizing the need for a
new approach to analytics that eliminates delays and latency. He concludes
with recommendations to help you as you embark on the big data journey.
Salvador Garca Muoz and John MacGregor provide several examples
of big data success stories in the second article. The examples include the
analysis and interpretation of historical data and troubleshooting process
problems; optimizing processes and product performance; monitoring and
controlling processes; and integrating data from multivariate online analyzers and imaging sensors. Because these examples involve the use of latent
variable methods, the authors briefly discuss such analytics and why they are
suitable in the era of big data.
Once you see big datas potential, how do you get started? In the third
article, Lloyd Colegrove, Mary Beth Seasholtz, and Chaitanya Khare answer
this question. The first steps involve identifying a place to get started, a project
where big data analytics will pay off, and then selecting a software package
appropriate for that project. Once you have found an analytics opportunity and
decided on a data analytics software package, the truly hard work starts convincing your organization to move forward and then taking those first steps,
the authors write. Drawing on their experience at Dow Chemical, they describe
a strategy that has worked for them. They then talk about moving beyond the
initial success and using big data on more than just a few small projects.
Looking to the future, Marco Reis, Richard Braatz, and Leo Chiang identify
challenges and research directions aimed at realizing the potential of big data.
The fourth article explores some of the challenges related to the four Vs and
some potential areas of research that could address these challenges. Big data
creates new possibilities to drive operational and business performance to higher
levels, they write. However, gaining access to such potential is far from trivial.
New strategies, processes, mindsets, and skills that are not yet in place are necessary. Pointing back to the previous articles, the authors end on a high note: Big
data offers new opportunities for managing our operations, improving processes
at all levels, and even adapting the companies business models. So the important
question is: Can we afford not to enter the big data era?
CEP extends a special thanks to Leo H. Chiang for serving as guest editor of this special section.
31
BIG DATA
What Is It?
David White
ARC Advisory Group
Operational Systems
Enterprise Resource
Planning (ERP)
Manufacturing Execution
Systems (MES)
Supply Chain
Management (SCM)
Financials
Data
Warehouse
33
Cost, U.S.$/GB
100
10
1
0.10
0.01
1995
2000
2005
Year
2010
2015
p Figure 2. The cost of raw disk storage has fallen dramatically over the
last two decades. If the per-gigabyte cost in 1994 is represented by the
height of the Empire State Building, the cost in 2014 would be comparable
to the length of an almond. Source: Adapted from (4).
34
Predictive analytics, used in applications such as predictive maintenance, are also more effective when moved closer
to the data source. For example, GE uses predictive analytics
to monitor the performance of over 1,500 of its turbines and
generators in 58 countries. Each turbine typically has more
than 100 physical sensors and over 300 virtual sensors that
monitor factors such as temperature, pressure, and vibration. Data are routed over the Internet to a central monitoring location (more than 40 terabytes of data have been
transferred so far). At the data center, predictive analytics
algorithms check current and recent operating performance
against models of more than 150 failure scenarios to detect
signs of impending failure.
Routing data into central storage makes efficient use of
highly skilled workers, allowing specialist technicians to
monitor all of the turbine and generator installations. This
single-database method also helps to ensure that predictive
models are continually refined and improved, and enables
best practices to be shared with all customers simultaneously
rather than on an incremental basis during system upgrades.
GE estimates that this approach collectively saved its customers $70 million last year.
SAS has applied similar technology in deepwater oil
fields. Electrical submersible pumps (ESPs) are used to
increase oil well production, but unexpected failure of an
ESP is costly, causing hundreds of millions of dollars in lost
production as well as requiring $20 million to replace the
pump. The predictive maintenance application developed
using SAS draws on data stored in historians and other
sources to monitor the performance of thousands of ESPs
and detect abnormal operation. Operators are able to gain
a three-month lead time on potential failures and reduce
downtime to six hours.
The cloud
The cloud will be key to IIoT data management and
analytics. The IIoT is accelerating the growth of big data,
generating an unprecedented volume of data. Although
analytics are vital in extracting value from IIoT data, without
strong data management capabilities, high-quality analytics
are impossible.
Raw disk storage, like other electronic technologies,
follows Moores Law. In just 20 years, the price of raw disk
storage has dropped from $950 per gigabyte in 1994 to
$0.03 per gigabyte in 2014 (Figure 2).
Although storage has become cheaper, salaries of IT
staff have kept pace with inflation. As a result, the cost of
storage is not tied to the technology, but rather to the people charged with supporting the technology. As demand for
storage and processing grows, the choice is to pay people
to procure, commission, and maintain the infrastructure, or
outsource this work to the cloud and deploy skilled IT staff
Copyright 2016 American Institute of Chemical Engineers (AIChE)
NoSQL databases
RDBMSs have long been the dominant tool for organizing and managing enterprise data, and they continue to
dominate. However, an alternative class of databases, collectively known as nonrelational or NoSQL (e.g., MongoDB,
Cassandra, Redis, and HBase), have gained popularity
because they meet the changing demands of data management and the growth of very-large-scale applications.
RDBMSs are from a bygone era when a database server
typically ran in a single box that contained a central processing unit (CPU), memory, and a hard disk. Scalability was
restricted to scaling within the box. The processor could
be updated to run faster or the single-CPU card could be
upgraded to a dual-CPU card, but only within the confines
of the box. That is known as vertical scalability, and it places
fundamental limits on how much the data volume, throughput, and response rate can be scaled.
NoSQL databases are designed out-of-the-box to deliver
performance, scalability, and high availability across distributed computing environments. To gain these advantages
over RDBMS, however, many NoSQL databases take a
more relaxed approach to consistency. The nodes in the database eventually all have a consistent view of the data, which
is often referred to as eventual consistency. This approach
pushes more responsibility for data integrity and consistency
onto the application logic. Whereas an RDBMS usually
centralizes responsibility for data integrity in the database
Copyright 2016 American Institute of Chemical Engineers (AIChE)
Recommendations
For most industrial enterprises, big data will manifest
through the IIoT and will require a different approach to
analytics. It will be difficult to extract maximum value from
high volumes of fast-moving data with traditional data architecture. I recommend the following actions as you embark
on your big data journey.
Start small and focus on a real business problem. Many
devices can be connected to the IIoT in a short period. However, it is best to only connect the things that are associated
with business problems. The fastest way to impede a big data
project is to expend resources with no measurable end result.
Pursue a project that promises quick and easy value.
Do not go out of your way to find a particularly nasty problem to solve. Your first big data project will be challenging,
so pick an easy project. Use data you already have, instead
of data that require new sensors. Find a project that requires
only one or two new technologies. The project should have a
relatively short time frame (months, not years).
Assemble a multidisciplinary team. Any successful IT
project requires a blend of technical expertise and domainspecific business or operational expertise. Treating an IIoT
project as purely a technical problem will yield technically
correct but useless insights and recommendations. The IT
team may need to learn and implement new technologies,
but will require operational and business insight to ensure
there is value in implementation.
Measure return on investment for future projects. Before
you start your first big data project, ensure that you have a
process in place to measure the return on investment. If you
cannot demonstrate tangible value for your first project, there
likely will not be a second project. Make sure you understand
the objectives as agreed upon with business leadership, and
CEP
document progress toward those objectives.
Literature Cited
1. ARC Advisory Group, Whats Driving Industrial Investment in BI and Analytics?, www.arcweb.com/strategy-reports/
Lists/Posts/Post.aspx?List=e497b184-6a3a-4283-97bfae7b2f7ef41f&ID=1665&Web=a157b1d0-c84d-440a-a7da9b99faeb14cc (Sept. 4, 2014).
2. Laney, D., 3D Data Management: Controlling Data Volume,
Velocity, and Variety, META Group, Inc. (Feb. 6, 2001).
3. ARC Advisory Group, SAP HANA: The Real-Time Database
as Change Agent?, www.arcweb.com/strategy-reports/Lists/
Posts/Post.aspx?ID=1656 (Aug. 22, 2014).
4. Komorowski, M., A History of Storage Costs (update),
www.mkomo.com/cost-per-gigabyte-update (Mar. 9, 2014).
35
BIG DATA
The scores can be thought of as scaled weighted averages of the original variables, using the loadings as the
weights for calculating the weighted averages. A score plot
is a graph of the data in the latent variable space. The loadings are the coefficients that reveal the groups of original
variables that belong to the same latent variable, with one
loading vector (W*) for each latent variable. A loading
plot provides a graphical representation of the clustering of
variables, revealing the identified correlations among them.
The uniqueness of latent variable models is that they
simultaneously model the low dimensional X and Y
spaces, whereas classical regression methods assume that
there is independent variation in all X and Y variables
(which is referred to as full rank). Latent variable models
show the relationships between combinations of variables
and changes in operating conditions thereby allowing
us to gain insight and optimize processes based on such
historical data.
The remainder of the article presents several industrial
applications of big data for:
the analysis and interpretation of historical data and
troubleshooting process problems
optimizing processes and product performance
monitoring and controlling processes
integrating data from multivariate online analyzers
and imaging sensors.
Level1
0.6
4
0.4
0.3
W* [2]
t2
2
0
0.2
Weight Wet
Cake
Time4
0.71
2
Time2
0.1
0.2
6
6
Temp1
0.5
t1
0.3
0.4
0.3
Temp2
Z3 Z5 Z4
Z9
Z1 Z10
Time1
Z8
0.2
0.1
Z2
Z11
Z6
Z7
Time3
TempSlope
0.1
0.2
0.3
0.4
W* [1]
37
Monitoring processes
Perhaps the most well-known application of principal
components analysis in the chemical process industries
(CPI) is its use as a monitoring tool, enabling true multivariate statistical process control (MSPC) (7, 8). In this
example, a PCA model was used to describe the normal
variability in the operation of a closed spray drying system
in a pharmaceutical manufacturing process (9). The system
Best-Next-Lot Approach
1
45
Slow Dissolution
40
Dissolution, %
t3
0.5
0.5
Historical
Data
Best-Next-Campaign Approach
USL
35
30
Target
25
LSL
20
1
Target Dissolution
15
Fast Dissolution
Quality Problems
1.5
2
1.5
0.5
0
t1
0.5
1.5
38
p Figure 4. A control chart of the degree of dissolution of a pharmaceutical tablet reveals the onset of quality problems. Quality problems are
reduced by the implementation of a best-next-lot solution, then eliminated
by the best-next-campaign approach. Source: (6).
HEPA
Filter
Drying
Chamber
FS
T
Process
Heater
Thermal
Mass Flow
Sensor
Supply
Fan
Condenser
Exhaust Pressure
Transducer
FS
HEPA
Filter
Feed
Pump
P
Cyclone
Data Logging
of Product
Collection
Weight
Baghouse
Exhaust Pressure
Controlled by Exhaust
Fan Speed
Exhaust
Fan
p Figure 5. A closed-loop spray drying system in a pharmaceutical manufacturing facility is being monitored by the measurement of 16 variables that a PCA
model projects into two principal components. Source: (9).
t2
Abnormal
Operating
Conditions
0.08
Final Product Quality Attribute
0
2
4
6
10
10
t1
0.07
With Control
0.06
0.05
0.04
0.03
No Control
0.02
0.01
0
0.4 0.2
0.2
0.4
0.6
0.8
1.2 1.4
39
Concluding remarks
Contextually correct historical data is a critical asset
that a corporation can take advantage of to expedite assertive decisions (3). A potential pitfall in the analysis of big
data is assuming that the data will contain information
just because there is an abundance of data. Data contain
information if they are organized in a contextually correct manner; the practitioner should not underestimate the
effort and investment necessary to organize data such that
information can be extracted from them.
Multivariate latent variable methods are effective tools
for extracting information from big data. These methods
reduce the size and complexity of the problem to simple
and manageable diagnostics and plots that are accessible to
all consumers of the information, from the process designers and line engineers to the operations personnel.
CEP
Literature Cited
1. Jackson, E., A Users Guide to Principal Components, 1st ed.,
John Wiley and Sons, Hoboken, NJ (1991).
2. Hskuldsson, A., PLS Regression Methods, Journal of Chemometrics, 2 (3), pp. 211228 (June 1988).
3. Wold, S., et al., PLS Partial Least-Squares Projection to
Latent Structures, in Kubiny, H., ed., 3D-QSAR in Drug
Design, ESCOM Science Publishers, Leiden, The Netherlands,
pp. 523550 (1993).
4. Garca Muoz, S., et al., Troubleshooting of an Industrial Batch
Process Using Multivariate Methods, Industrial and Engineering
Chemistry Research, 42 (15), pp. 35923601 (2003).
5. Garca Muoz, S., and J. A. Mercado, Optimal Selection of
Raw Materials for Pharmaceutical Drug Product Design and
Manufacture Using Mixed Integer Non-Linear Programming and
Multivariate Latent Variable Regression Models, Industrial and
Engineering Chemistry Research, 52 (17), pp. 59345942 (2013).
6. Garca Muoz, S., et al., A Computer Aided Optimal Inventory
Selection System for Continuous Quality Improvement in Drug
Product Manufacture, Computers and Chemical Engineering, 60,
pp. 396402 (Jan. 10, 2014).
7. MacGregor, J. F., and T. Kourti, Statistical Process Control of
Multivariable Processes, Control Engineering Practice, 3 (3),
pp. 403414 (1995).
8. Kourti, T., and J. F. MacGregor, Recent Developments in
40
BIG DATA
Getting Started
on the Journey
Lloyd F. Colegrove
Mary Beth Seasholtz
Chaitanya Khare
The Dow Chemical Co.
41
and analysis of those data all in real time. We are not adding new measures or systems to generate more data. We are
using the data we already collect and store to achieve everincreasing value.
In the program that we established at Dow, we aimed to
achieve total data domination. This entailed mastering all
digital and alphanumeric data that related to a given operation, and correlating and displaying the data to be meaningful for every user.
Co
Pu unter
shb
ack
Li
Su sten
Sto cces for
ries s
S
P tart a
Pro ilot
gra
m
F
Tea orm a
m
Expect some
resistance, but
be prepared to
counter it.
p Figure 1. The first steps of the big data journey typically garner some
pushback. Place a metaphorical baited hook in front of skeptics by
demonstrating the power of improved analytics.
to reduce it), additional samples taken to more fully characterize a feed stream, an interruption avoided because the
impurity never exceeded the specification, or an operational
improvement identified are all signs of the desired culture
change.
We often have to show our clients how their results are
changing because of our tools and approaches. The success
stories are told slightly differently, and depend on what part
of the organization you are working with.
After the small success stories are recognized for what
they mean, it is time to expand the pilot within the business
or plant. This is where the EMI platform will get its first test
with deployments beyond the original pilot. Engineers will
see familiar tools appearing all around the business, which
will generate discussion.
The pushback. As we expanded beyond our initial pilot
project, we tried to convince others across the company
that there is value in data. In any software introduction
there is likely to be resistance. It is too expensive, Its
not my way, and Why do I need that? are just three of
the responses you may hear. If you have already attempted
to change the analytics culture in your company, you have
heard these and more. To address the skepticism, we tried a
new approach that can be compared to fishing.
What exactly do fish have to do with analytics, you ask.
If we place a hook and bait in front of a potential user, we
may get them to bite. We tempt them by telling success
stories from other areas of the company that demonstrate the
power of improved analytics.
We guide, we explain, we mentor, we hand-hold, all to
provide support as they take the first tentative steps toward
better data usage. If we are successful in this fishing expedition, the mere mention that we might remove the tool will
elicit protests.
Article continues on next page
43
Strategic
Operational
Tactical
Transactional
Change one
variable at a time
Hourly
p Figure 2. The data collected and analyzed in the EMI platform can be
used to make four types of decisions. Each type of decision has a different
timescale and scope. Transactional decisions are the most frequent but
smallest in scope, whereas strategic decisions are the least frequent but
largest in scope.
44
global manufacturing infrastructure and the cultures represented both locally and across the enterprise. Potential users
can cite many reasons to deflect the opportunity you bring,
so you need to be more flexible to overcome these fears and
misgivings.
We have devised strategies with the sole purpose of
getting the analytics approaches in front of users, regardless
of their misgivings. The personnel who both lead and deliver
the big data analytics tools must have as much flexibility as
the platform itself. It is imperative for long-term success that
the users become comfortable with their data and trust the
message the data are delivering.
Perfection is a myth. There is no such thing as the perfect
tool or perfect approach for harnessing big data. For example, we adopted an attitude that if we gain around 80% of
what we seek, we should move on in the process.
Forgive your data. Accept the data as they come to you.
Do not get into the argument of prescribing how data should
be structured. As you develop a system that harnesses big
data, keep in mind that it is not your job to change, modify,
or replace existing databases.
It should not matter what type of system your data are
stored in, nor how antiquated the data storage is a tool
must be able to access all sources of data in some manner
without requiring intermediate programming to make the
connection to the data.
Look for all data that are available even data dis
connected from the lab or plant historians such as readings
from remote online analyzers.
If you have bad data, that is, you think your measurement systems are lacking capability or you think you are not
analyzing the right place at the right time, you may think that
your big data journey is over before it ever begins. This is
not true! It will become readily apparent in the initial analysis if your data are poor, which will reveal the first, most
significant, improvement opportunity to the operation. It can
also point to the measurements that need to be improved in
capability (precision or accuracy) or perhaps frequency, etc.
their data in a way they are not accustomed to, and the analytics team needs to be willing to assemble the analytics in a
way that the users can relate to.
Early in the design of our analytics approach, we committed to aligning our tools and the underlying approach to
the workflow that our first internal customers desired. We
did not sub-optimize the tool by tailoring the programming;
rather, we ensured that the tool was used in a manner that
was familiar to and digestible by a given operational team.
Figure 3 illustrates the higher-order process that was developed in collaboration with the operations team.
Our team sought a tool that alerted us to problems and
opportunities and triggered discussion among operations and
subject matter experts, hence providing an environment that
embraces both the what and the why. The tool would also
prompt users by displaying internally vetted and documented (i.e., company-proprietary) knowledge and wisdom,
which could then be harnessed to develop appropriate action
plans. Finally, we wanted to develop a way for users to
encode new knowledge and wisdom back into the tool so
that it may be used again at an appropriate time this step
has not yet come to fruition and is ongoing.
Closing thoughts
Our big data approach seeks to actively turn data into
information (through the application of statistics), translate that information into knowledge (via contextualization
of data), and then convert that knowledge into wisdom
(e.g., maintaining optimal operation 24/7), while avoiding
any surprises.
Alert!
Data,
calculations,
predictive
models,
big data
Real-time
tracking and
notification
dashboard
I
Integrate
lea
learning
into
e
enterprise
Plant
m
makes
cha
changes
Discussion
triggered
between
technical and
onsite staff
Consult
existing
e
knowledge
Agree on
actions
Acknowledgments
The authors wish to acknowledge the editorial contributions of Jim
Petrusich, Vice President, Northwest Analytics, Inc.
Additional Resources
45
BIG DATA
Leo H. Chiang
The Dow Chemical Co.
governor of Kansas. The magazine sent out 10 million postcards considered a massive amount of data at that time
to gain insight into the voting tendencies of the populace.
The Digest collected data from 2.4 million voters, and after
triple-checking and verifiying the data, forecast a Landon
victory over Roosevelt by a margin of 57% to 43%. The
final result, however, was a landslide victory by Roosevelt
of 61% versus Landons 37% (the remaining votes were
for a third candidate). Based on a much smaller sample of
approximately 3,000 interviews, George Gallup correctly
predicted a clear victory for Roosevelt.
Literary Digest learned the hard way that, when it
comes to data, size is not the only thing that matters.
Statistical theory shows that sample size affects sample
error, and the error was indeed much lower in the Digest
poll. But sample bias must also be considered and this is
especially critical in election polls. (The Digest sample was
taken from lists of automobile registrations and telephone
directories, creating a strong selection bias toward middleand upper-class voters.)
Another example that demonstrates the danger of
putting excessive confidence in the analysis of big data
sets regards the mathematical models for predicting loan
defaults developed by Lehman Brothers. Based on a very
large database of historical data on past defaults, Lehman
Brothers developed, and tested for several years, models
for forecasting the probability of companies defaulting on
Copyright 2016 American Institute of Chemical Engineers (AIChE)
Data
Big Data
Technology
Analytics
p Figure 1. The big data movement stems from the availability of data,
high-power computer technology, and analytics to handle data characterized by the four Vs volume, variety, veracity, and velocity.
47
tion about the causal structure of the systems, which conventional data-driven monitoring methods cannot provide.
Predictive modeling also requires this type of knowledge, in
particular for process control and optimization applications.
Bayesian approaches (10, 11) and data transformation based
on network inference, together with hybrid gray-box modeling frameworks, are potential ways to introduce a priori
knowledge into data-driven modeling.
Veracity challenges
A major concern in the analysis
of massive data sets has to do
with the quality of data, i.e.,
their veracity. As previously
mentioned, quantity does not
imply quality. On the contrary,
quantity creates more opportunities for problems to occur.
To make matters worse, the detection of bad observations in
massive data sets through visualization techniques is more
challenging and automatic-cleaning algorithms cannot be
relied on either. Data quality also depends on the way the
data are collected (bias issues may emerge that are very
difficult to detect), on whether the information is updated
or no longer makes sense (due to time-varying changes in
the system), and on the signal-to-noise ratio (measurement
uncertainty), among other factors.
Uncertainty data. In addition to the collected data, information associated with uncertainty is also available. Measurement uncertainty is defined as a parameter associated with the
result of a measurement that characterizes the dispersion of
the values that could reasonably be attributed to the quantity
to be measured (12). Combining uncertainty data with the raw
measurements can improve data analysis, empirical modeling,
and subsequent decision-making (13, 14).
Specification of measurement uncertainty in big data contexts and developing methods that take advantage of know
ledge about uncertainty should be explored in more depth.
Unstructured variability. Process improvement activities require a careful assessment of the multiple sources
of variability of the process, which are typically modeled
using suitable mathematical equations (ranging from firstprinciples models to purely data-driven approaches). The
analysis should involve both the deterministic backbone of
the process behavior, as well as the unstructured aspects of
the process arising from stochastic sources of variability,
including disturbances, sample randomness, measurement
noise, operators variation, and machine drifting. Jumping
into the analysis of massive data sets while overlooking the
main sources of unstructured variability is ill-advised, and
is contrary to a reliable statistical engineering approach to
addressing process improvement activities.
49
Final thoughts
Big data creates new possibilities to drive operational
and business performance to higher levels. However,
gaining access to such potential is far from trivial. New
strategies, processes, mindsets, and skills that are not yet in
place are necessary. In addition, challenges emerge when
big data problems are considered in industrial contexts.
This article has summarized ten such challenges to be
addressed in the future to make this journey an insightful learning experience and a successful business opportunity for companies. We also believe the dominating ideas
and premises of big data need to evolve and mature.
Literature Cited
1. Anderson, C., The End of Theory: The Data Deluge Makes the
Scientific Method Obsolete, Wired, www.wired.com/2008/06/
pb-theory/ (June 23, 2008).
3. Han, J., and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, San Francisco, CA (2001).
5. Reis, M. S., and P. M. Saraiva, Prediction of Profiles in the Process Industries, Industrial and Engineering Chemistry Research,
51 (11), pp. 45244266 (Feb. 27, 2012).
6. Rato, T. J., et al., A Systematic Comparison of PCA-Based Statistical
Process Monitoring Methods for High-Dimensional, Time-Dependent
Processes, AIChE Journal, 62 (1), pp. 127142 (Jan. 2016).
50