Вы находитесь на странице: 1из 321

Big Data Governance

and Perspectives in
Knowledge Management

Sheryl Kruger Strydom


University of South Africa, South Africa

Moses Strydom
University of South Africa, South Africa

A volume in the Advances in Knowledge


Acquisition, Transfer, and Management (AKATM)
Book Series
Published in the United States of America by
IGI Global
Information Science Reference (an imprint of IGI Global)
701 E. Chocolate Avenue
Hershey PA, USA 17033
Tel: 717-533-8845
Fax: 717-533-8661
E-mail: cust@igi-global.com
Web site: http://www.igi-global.com

Copyright © 2019 by IGI Global. All rights reserved. No part of this publication may be reproduced, stored or distributed in
any form or by any means, electronic or mechanical, including photocopying, without written permission from the publisher.
Product or company names used in this set are for identification purposes only. Inclusion of the names of the products or
companies does not indicate a claim of ownership by IGI Global of the trademark or registered trademark.
Library of Congress Cataloging-in-Publication Data
Names: Strydom, Sheryl Kruger, 1959- editor. | Strydom, Moses, 1944- editor.
Title: Big data governance and perspectives in knowledge management / Sheryl
Kruger Strydom and Moses Strydom, editors.
Description: Hershey, PA : Information Science Reference, [2019] | Includes
bibliographical references.
Identifiers: LCCN 2018014732| ISBN 9781522570776 (hardcover) | ISBN
9781522570783 (ebook)
Subjects: LCSH: Knowledge management. | Big data--Social aspects. | Internet
governance.
Classification: LCC HD30.2 .B5445 2019 | DDC 658.4/038--dc23 LC record available at https://lccn.loc.gov/2018014732

This book is published in the IGI Global book series Advances in Knowledge Acquisition, Transfer, and Management
(AKATM) (ISSN: 2326-7607; eISSN: 2326-7615)

British Cataloguing in Publication Data


A Cataloguing in Publication record for this book is available from the British Library.

All work contributed to this book is new, previously-unpublished material. The views expressed in this book are those of the
authors, but not necessarily of the publisher.

For electronic access to this publication, please contact: eresources@igi-global.com.


Advances in Knowledge
Acquisition, Transfer, and
Management (AKATM) Book
Series
Murray E. Jennex
San Diego State University, USA
ISSN:2326-7607
EISSN:2326-7615

Mission
Organizations and businesses continue to utilize knowledge management practices in order to streamline
processes and procedures. The emergence of web technologies has provided new methods of informa-
tion usage and knowledge sharing.
The Advances in Knowledge Acquisition, Transfer, and Management (AKATM) Book Series
brings together research on emerging technologies and their effect on information systems as well as the
knowledge society.AKATM will provide researchers, students, practitioners, and industry leaders with
research highlights surrounding the knowledge management discipline, including technology support
issues and knowledge representation.

Coverage
• Cognitive Theories
IGI Global is currently accepting manuscripts
• Cultural Impacts
for publication within this series. To submit a pro-
• Information and Communication Systems
posal for a volume in this series, please contact our
• Knowledge acquisition and transfer processes
Acquisition Editors at Acquisitions@igi-global.com
• Knowledge management strategy
or visit: http://www.igi-global.com/publish/.
• Knowledge Sharing
• Organizational Learning
• Organizational Memory
• Small and Medium Enterprises
• Virtual communities

The Advances in Knowledge Acquisition, Transfer, and Management (AKATM) Book Series (ISSN 2326-7607) is published by IGI
Global, 701 E. Chocolate Avenue, Hershey, PA 17033-1240, USA, www.igi-global.com. This series is composed of titles available for purchase
individually; each title is edited to be contextually exclusive from any other title within the series. For pricing and ordering information please
visit http://www.igi-global.com/book-series/advances-knowledge-acquisition-transfer-management/37159. Postmaster: Send all address changes
to above address. Copyright © 2019 IGI Global. All rights, including translation in other languages reserved by the publisher. No part of this
series may be reproduced or used in any form or by any means – graphics, electronic, or mechanical, including photocopying, recording, taping,
or information and retrieval systems – without written permission from the publisher, except for non commercial, educational use, including
classroom teaching purposes. The views expressed in this series are those of the authors, but not necessarily of IGI Global.
Titles in this Series
For a list of additional titles in this series, please visit: www.igi-global.com/book-series

Scholarly Content and Its Evolution by Scientometric Indicators Emerging Research and Opportunities
Zahid Ashraf Wani (University of Kashmir, India) and Tazeem Zainab (University of Kashmir, India)
Information Science Reference • copyright 2019 • 201pp • H/C (ISBN: 9781522559450) • US $135.00 (our price)

Technological Innovations in Knowledge Management and Decision Support


Nilanjan Dey (Techno India College of Technology, India)
Information Science Reference • copyright 2019 • 322pp • H/C (ISBN: 9781522561644) • US $195.00 (our price)

Effective Knowledge Management Systems in Modern Society


Murray E. Jennex (San Diego State University, USA)
Information Science Reference • copyright 2019 • 391pp • H/C (ISBN: 9781522554271) • US $195.00 (our price)

Global Information Diffusion and Management in Contemporary Society


Zuopeng (Justin) Zhang (State University of New York at Plattsburgh, USA)
Information Science Reference • copyright 2019 • 331pp • H/C (ISBN: 9781522553939) • US $195.00 (our price)

Contemporary Knowledge and Systems Science


W. B. Lee (Hong Kong Polytechnic University, China) and Farzad Sabetzadeh (Hong Kong Polytechnic University,
China)
Information Science Reference • copyright 2018 • 340pp • H/C (ISBN: 9781522556558) • US $185.00 (our price)

Enhancing Knowledge Discovery and Innovation in the Digital Era


Miltiadis D. Lytras (The American College of Greece, Greece) Linda Daniela (University of Latvia, Latvia) and
Anna Visvizi (Effat University, Saudi Arabia)
Information Science Reference • copyright 2018 • 363pp • H/C (ISBN: 9781522541912) • US $215.00 (our price)

N-ary Relations for Logical Analysis of Data and Knowledge


Boris Kulik (Russian Academy of Science, Russia) and Alexander Fridman (Russian Academy of Science, Russia)
Information Science Reference • copyright 2018 • 297pp • H/C (ISBN: 9781522527824) • US $225.00 (our price)

Radical Reorganization of Existing Work Structures Through Digitalization


Punita Duhan (Meera Bai Institute of Technology, India) Komal Singh (Meera Bai Institute of Technology, India)
and Rahul Verma (Bhai Parmanand Institute of Business Studies, India)
Business Science Reference • copyright 2018 • 287pp • H/C (ISBN: 9781522531913) • US $235.00 (our price)

701 East Chocolate Avenue, Hershey, PA 17033, USA


Tel: 717-533-8845 x100 • Fax: 717-533-8661
E-Mail: cust@igi-global.com • www.igi-global.com
Editorial Advisory Board
Elena Bouleanu, University of Sibiu, Romania
Cyrille Dongno, University of South Africa, South Africa
Fuensanta Medina, University of Madrid, Spain
Eyitayo Olakanmi, Botswana University of Science and Technology, Botswana
Mirjana Pejic-Bach, University of Zagreb, Croatia
Laura Po, University of Modena, Italy
Vasudeva Rao Veeredhi, University of South Africa, South Africa
Eduardo Rodriguez, University of Wisconsin, USA
Kent Rondeau, University of Alberta, Canada
Andras Vajkal, University of Pecs, Hungary

List of Reviewers
Ari Alamaki, University of Applied Science, Finland
Lamyaa el Bassiti, Mohammed V University of Rabat, Morocco
Richard Boire, Boire Filler Group, Canada
David Booth, Kent State University, USA
Sonia Chien-I Chen, Ulster University, UK & SoniChica Company Limited, Taiwan
Dorina Ionescu, University of South Africa, South Africa
Zivko Krstic, Atomic Intelligence, Croatia
David Kruger, University of South Africa, South Africa
Steve MacFeely, United Nations Conference on Trade and Development, Switzerland
Vincent Mzazi, University of South Africa, South Africa
Ceylun Ozgur, Valparaiso University, USA
Jasmina Pivar, University of Zagreb, Croatia
Mustafa Sagsun, Near East University, Cyprus
Daria Sarti, University of Florence, Italy
Samuel Ssemugabi, University of South Africa, South Africa
Azadeh Tabrizi, Near East University, Cyprus
Bruno Tissot, Bank for International Settlements, Switzerland
Teresina Torre, University of Genoa, Italy
Winfred Yaokumah, Pentecost University College, Ghana

Table of Contents

Preface..................................................................................................................................................xiii

Acknowledgment................................................................................................................................. xvi

Chapter 1
Making the Most of Big Data for Financial Stability Purposes............................................................... 1
Bruno Tissot, Bank for International Settlements (BIS), Switzerland

Chapter 2
Big Data and Official Statistics.............................................................................................................. 25
Steve MacFeely, The United Nations Conference on Trade and Development (UNCTAD),
Switzerland

Chapter 3
The Big Data Research Ecosystem: An Analytical Literature Study.................................................... 55
Moses John Strydom, University of South Africa, South Africa
Sheryl Buckley, University of South Africa, South Africa

Chapter 4
Optimization of Aerospace Big Data Including Integrated Health Monitoring With the Help of
Data Analytics........................................................................................................................................ 88
Ranganayakulu Chennu, Aeronautical Development Agency, India
Vasudeva Rao Veeredhi, University of South Africa, South Africa

Chapter 5
Classification Techniques and Data Mining Tools Used in Medical Bioinformatics.......................... 105
Satish Kumar David, King Saud University, Saudi Arabia
Amr T. M. Saeb, King Saud University, Saudi Arabia
Mohamed Rafiullah, King Saud University, Saudi Arabia
Khalid Rubeaan, King Saud University, Saudi Arabia

Chapter 6
Big Data and People Management: The Prospect of HR Managers.................................................... 127
Daria Sarti, University of Florence, Italy
Teresina Torre, University of Genoa, Italy




Chapter 7
Big Data, Semantics, and Policy-Making: How Can Data Dynamics Lead to Wiser Governance?... 154
Lamyaa El Bassiti, Mohammed V University in Rabat, Morocco

Chapter 8
Big Data Governance in Agile and Data-Driven Software Development: A Market Entry Case in
the Educational Game Industry............................................................................................................ 179
Lili Aunimo, Haaga-Helia University of Applied Sciences, Finland
Ari V. Alamäki, Haaga-Helia University of Applied Sciences, Finland
Harri Ketamo, Headai Ltd., Finland

Chapter 9
The Link Between Innovation and Prosperity: How to Manage Knowledge for the Individual’s
and Society’s Benefit From Big Data Governance?............................................................................ 200
Sonia Chien-i Chen, SC Company Limited, Taiwan
Radwan Alyan Kharabsheh, Applied Science University, Bahrain

Chapter 10
Big Data for Prediction: Patent Analysis – Patenting Big Data for Prediction Analysis..................... 218
Mirjana Pejic-Bach, University of Zagreb, Croatia
Jasmina Pivar, University of Zagreb, Croatia
Živko Krstić, Atomic Intelligence, Croatia

Chapter 11
The Components of Big Data and Knowledge Management Will Change Radically How People
Collaborate and Develop Complex Research....................................................................................... 241
Amitava Choudhury, University of Petroleum and Energy Studies, India
Ambika Aggarwal, University of Petroleum and Energy Studies, India
Kalpana Rangra, University of Petroleum and Energy Studies, India
Ashutosh Bhatt, Shivalik College of Engineering, India

Compilation of References................................................................................................................ 258

About the Contributors..................................................................................................................... 295

Index.................................................................................................................................................... 301
Detailed Table of Contents

Preface..................................................................................................................................................xiii

Acknowledgment................................................................................................................................. xvi

Chapter 1
Making the Most of Big Data for Financial Stability Purposes............................................................... 1
Bruno Tissot, Bank for International Settlements (BIS), Switzerland

Big data has become a key topic in data creation, storage, retrieval, methodology, and analysis in the
financial stability area. The flexibility and real-time availability of big data have opened up the possibility
of extracting more timely economic signals, applying new statistical methodologies, enhancing economic
forecasts and financial stability assessments, and obtaining rapid feedback on policy impacts. But, while
public financial authorities appear increasingly interested in big data, their actual use has remained
limited, reflecting a number of operational challenges. Moreover, using big data for policy purposes is
not without risks, such as that of generating a false sense of certainty and precision. Exploring big data
is thus a complex, multifaceted task, and a general and regular production of big data-based information
would take time. Looking ahead, it is key to focus on concrete pilot projects and share these experiences.
International cooperation would certainly add value in this endeavor.

Chapter 2
Big Data and Official Statistics.............................................................................................................. 25
Steve MacFeely, The United Nations Conference on Trade and Development (UNCTAD),
Switzerland

Over recent years, the potential of big data for government, for business, for society has excited much
comment, debate, and even evangelism. But are big data really the panacea to all our data problems
or is this just hype and hubris? This is the question facing official statisticians: Are big data worth the
investment of time and resources? While the statistical possibilities appear endless, big data also present
enormous challenges and potential pitfalls: legal, ethical, technical, and reputational. This chapter
examines the opportunities and challenges presented by big data and also discusses some governance
issues arising for official statistics.

Chapter 3
The Big Data Research Ecosystem: An Analytical Literature Study.................................................... 55
Moses John Strydom, University of South Africa, South Africa
Sheryl Buckley, University of South Africa, South Africa




Big data is the emerging field where innovative technology offers new ways to extract value from an
unequivocal plethora of available information. By its fundamental characteristic, the big data ecosystem
is highly conjectural and is susceptible to continuous and rapid evolution in line with developments in
technology and opportunities, a situation that predisposes the field to research in very brief time spans.
Against this background, both academics and practitioners oddly have a limited understanding of how
organizations translate potential into actual social and economic value. This chapter conducts an in-
depth systematic review of existing penchants in the rapidly developing field of big data research and,
thereafter, systematically reviewed these studies to identify some of their weaknesses and challenges.
The authors argue that, in practice, most of big data surveys do not focus on technologies, and instead
present algorithms and approaches employed to process big data.

Chapter 4
Optimization of Aerospace Big Data Including Integrated Health Monitoring With the Help of
Data Analytics........................................................................................................................................ 88
Ranganayakulu Chennu, Aeronautical Development Agency, India
Vasudeva Rao Veeredhi, University of South Africa, South Africa

The objective of this chapter is to present the role and advantages of big data governance in the optimal
use of integrated health monitoring systems with a specific reference to the aerospace industry. Aerospace
manufacturers and many passenger airlines have realized the benefits of sharing and analyzing the huge
amounts of data being collected by their latest generation airliners and engines. While aero engines are
already equipped with integrated engine health monitoring concepts, aircraft systems are now being
introduced with integrated vehicle health monitoring concepts which require large number of sensors.
The data generated by these sensors is enormously high and grows over a period of time to constitute
a big data to be monitored and analyzed. This chapter aims to give an overview of various systems and
their data logging processes, simulations, and data analysis. Various sensors that are required to be used
in important systems of a typical fighter aircraft and their functionalities emphasizing the huge volume
of data generated for the analysis are presented in this chapter.

Chapter 5
Classification Techniques and Data Mining Tools Used in Medical Bioinformatics.......................... 105
Satish Kumar David, King Saud University, Saudi Arabia
Amr T. M. Saeb, King Saud University, Saudi Arabia
Mohamed Rafiullah, King Saud University, Saudi Arabia
Khalid Rubeaan, King Saud University, Saudi Arabia

Increasing volumes of data with the increased availability information mandates the use of data mining
techniques in order to gather useful information from the datasets. In this chapter, data mining techniques
are described with a special emphasis on classification techniques as one important supervised learning
technique. Bioinformatics tools in the field for medical applications especially in medical microbiology
are discussed. This chapter presents WEKA software as a tool of choice to perform classification analysis
for different kinds of available data. Uses of WEKA data mining tools for biological applications such as
genomic analysis and for medical applications such as diabetes are discussed. Data mining offers novel
tools for medical applications for infectious diseases; it can help in identifying the pathogen and analyzing
the drug resistance pattern. For non-communicable diseases such as diabetes, it provides excellent data
analysis options for analyzing large volumes of data from many clinical studies.


Chapter 6
Big Data and People Management: The Prospect of HR Managers.................................................... 127
Daria Sarti, University of Florence, Italy
Teresina Torre, University of Genoa, Italy

This chapter investigates the role of big data (BD) in human resource management (HRM). The interest
is related to the strategic relevance of human resources (HR) and to the increasing importance of BD in
every dimension of a company’s life. The analysis focuses on the perception of the HR managers on the
impact that BD and BD analytics may have on the HRM and the possible problems the HR departments
may encounter when implementing human resources analytics (HRA). The authors’ opinion is that
attention to the perceptions shown by the HR managers is the more important element conditioning their
attitude towards BD and it is the first feature influencing the possibility that BD can become a positive
challenge. After the presentation of the topic and of the state of the art, the study is introduced. The main
findings are discussed and commented to offer suggestion for HR managers and to underline some key
points for future research in this field.

Chapter 7
Big Data, Semantics, and Policy-Making: How Can Data Dynamics Lead to Wiser 
Governance?........................................................................................................................................ 154
Lamyaa El Bassiti, Mohammed V University in Rabat, Morocco

At the heart of all policy design and implementation, there is a need to understand how well decisions
are made. It is evidently known that the quality of decision making depends significantly on the quality
of the analyses and advice provided to the associated actors. Over decades, organizations were highly
diligent in gathering and processing vast amounts of data, but they have given less emphasis on how
these data can be used in policy argument. With the arrival of big data, attention has been focused on
whether it could be used to inform policy-making. This chapter aims to bridge this gap, to understand
variations in how big data could yield usable evidence, and how policymakers can make better use of
those evidence in policy choices. An integrated and holistic look at how solving complex problems could
be conducted on the basis of semantic technologies and big data is presented in this chapter.

Chapter 8
Big Data Governance in Agile and Data-Driven Software Development: A Market Entry Case in
the Educational Game Industry............................................................................................................ 179
Lili Aunimo, Haaga-Helia University of Applied Sciences, Finland
Ari V. Alamäki, Haaga-Helia University of Applied Sciences, Finland
Harri Ketamo, Headai Ltd., Finland

Constructing a big data governance framework is important when a company performs data-driven software
development. The most important aspects of big data governance are data privacy, security, availability,
usability, and integrity. In this chapter, the authors present a business case where a framework for big
data governance has been built. The business case is about the development and continuous improvement
of a new mobile application that is targeted for consumers. In this context, big data is used in product
development, in building predictive modes related to the users and for personalization of the product.
The main finding of the study is a novel big data governance framework and that a proper framework
for big data governance is useful when building and maintaining trustworthy and value adding big data-
driven predictive models in an authentic business environment.


Chapter 9
The Link Between Innovation and Prosperity: How to Manage Knowledge for the Individual’s
and Society’s Benefit From Big Data Governance?............................................................................ 200
Sonia Chien-i Chen, SC Company Limited, Taiwan
Radwan Alyan Kharabsheh, Applied Science University, Bahrain

The digital era accelerates the growth of knowledge to such an extent that it is a challenge for individuals
and society to manage it traditionally. Innovative tools are introduced to analyze massive data sets for
extracting business value cost-effectively and efficiently. These tools help extract business intelligence
from explicit information, so that tacit knowledge can be transferred into actionable insights. Big data
are relevantly fashionable because of their accuracy and the capability of predicting future trends.
They show their mightiness of bringing business prosperity from supermarket giants to businesses and
disciplines of all kinds. However, with data widely spreading, people are concerning their potential risk
of increasing inequality and threatening democracy. Big data governance is needed, if people want to
keep their private right. This chapter explores how big data can be governed for maintaining the benefits
of the individual and society. It aims to allow technology to humanize the digital era, so that people can
be benefited from living in the present.

Chapter 10
Big Data for Prediction: Patent Analysis – Patenting Big Data for Prediction Analysis..................... 218
Mirjana Pejic-Bach, University of Zagreb, Croatia
Jasmina Pivar, University of Zagreb, Croatia
Živko Krstić, Atomic Intelligence, Croatia

Technical field of big data for prediction lures the attention of different stakeholders. The reasons are
related to the potentials of the big data, which allows for learning from past behavior, discovering patterns
and values, and optimizing business processes based on new insights from large databases. However, in
order to fully utilize the potentials of big data, its stakeholders need to understand the scope and volume
of patenting related to big data usage for prediction. Therefore, this chapter aims to perform an analysis
of patenting activities related to big data usage for prediction. This is done by (1) exploring the timeline
and geographic distribution of patenting activities, (2) exploring the most active assignees of technical
content of interest, (3) detecting the type of the protected technical according to the international patent
classification system, and (4) performing text-mining analysis to discover the topics emerging most
often in patents’ abstracts.

Chapter 11
The Components of Big Data and Knowledge Management Will Change Radically How People
Collaborate and Develop Complex Research....................................................................................... 241
Amitava Choudhury, University of Petroleum and Energy Studies, India
Ambika Aggarwal, University of Petroleum and Energy Studies, India
Kalpana Rangra, University of Petroleum and Energy Studies, India
Ashutosh Bhatt, Shivalik College of Engineering, India

Emerging as a rapidly growing field, big data is already known for promising success and having
considerable synergies with knowledge management. The common goal of this collaboration is to improve
and facilitate decision making, fueling the competition, fostering innovation, and achieving economic


success through acquisition of knowledge to various applications. Knowledge in the entire world or
inside any organization has already expanded itself in various directions and is exponentially increasing
with time. To withstand the current competitive environment, an intensive collaboration of knowledge
management with different approaches and algorithms of big data is required. Classical structuring is
becoming obsolete with the increasing amount of knowledge components.

Compilation of References................................................................................................................ 258

About the Contributors..................................................................................................................... 295

Index.................................................................................................................................................... 301
xiii

Preface

The editors take great pleasure in prefacing this book titled Big Data Governance and Perspectives in
Knowledge Management.
It is generally accepted that any organization working with data recognizes the value-added benefits of
being able to aggregate information across multiple complex systems and business processes that enable
the said organizations to be efficient. Data governance is that entity that is responsible for establishing
policies, procedures, standards, and guidelines for ensuring optimal value of data can be achieved. Read-
ily available frameworks, tools, and services can be adapted to the requirements and environment of an
organization, yet when it comes to big data governance, the options are more complex.
Articles, conferences and seminars abound, emphasizing the overall management of the availability,
usability, integrity and security of data used in an enterprise. Nobody could escape from this wave of
events organized around the concepts of big data. But what is really behind these expressions which at
first sight seems so simple? Why such enthusiasm with respect to this topic? If data and their process-
ing have been the catalysts of information since its inception, then what has happened in recent years to
evolve from data to big data? How is it a revolution? In what manner is it a revolution? And, above all,
can knowledge management facilitate additional value from big data?
It is to answer all these questions that we invited relevant voices within their fields to highlight new
directions in contemporary research in big data governance and perspectives in knowledge management.
Each author has brought her/his own share of valuable lessons in analyzing this issue.
In this way each chapter is intended to afford fully the benefits of sharing expertise from different
organizations and contexts. The book subsequently reaps these multifaceted benefits.
Written primarily by academics, the chapters of this book have nonetheless the main goal to provide
insight to academics and practitioners alike. In what follows we provide a brief overview of subjects
discussed.
Chapter 1 analyses how the flexibility and real-time availability of big data have advanced the pos-
sibility of extracting more timely economic signals, by applying new statistical methodologies, enhancing
economic forecasts and financial stability assessments, and obtaining rapid feedback on policy impacts.
In Chapter 2, the author presents a comprehensive overview of big data for official statistics, cover-
ing the sources of big data, the manner of accessing them, the opportunities and challenges, and the
mitigating factors, a case in point being governance frameworks.


Preface

The big data ecosystem is highly conjectural and is susceptible to continuous and rapid evolution
in line with developments in technology and opportunities, a situation that predisposes the field to re-
search in very brief time spans. It is thus important, nay, vital, to conduct in-depth systematic reviews
of existing penchants in the rapidly developing field of big data research, and, thereafter, systematically
review these studies to identify some of their weaknesses and challenges: this is submitted in Chapter 3.
The objective of Chapter 4 is to present the role and advantages of big data governance in the optimal
use of integrated health monitoring systems with a specific reference to the aerospace industry where
manufacturers have realized the benefits of sharing and analyzing the huge amounts of data being col-
lected by their latest generation airliners and engines.
In Chapter 5 the authors analyze different data mining techniques and tools and illustrate one of them
in the biomedical field. They additionally show the obtained result in a real case where they analyze
diabetes data with the WEKA tool.
The aim of Chapter 6 is to investigate the effective and potential role of big data in the field of human
resource management (HRM), as there is a great deal of interest among HR practitioners, and yet, the
authors argue, a relative lack of research interest in the current academic debate.
With the arrival of big data and the growth of investment in research, attention has been brought to
whether the produced knowledge can be used to informed innovation and policy-making. Chapter 7 aims
to bridge the gap between science, innovation and policy-making while meeting the challenge of using
science as well as considering the global benefit.
Constructing a big data governance framework is important when a company performs data-driven
software development. The authors of Chapter 8 present a business case where a framework for big data
governance has been built based on a new mobile application that is targeted for consumers.
Chapter 9 explores how big data can be governed in order to maintain the benefits of the individual
and society. One of fundamental objectives is to permit technology to humanize the digital era through
reviving digital forgetting.
While we are unanimous about the huge potential of big data, there is a crying need to understand
the scope and volume of patenting related to big data usage for prediction. Chapter 10 analyses patenting
activities related to big data usage.
With increasing amount of knowledge components, classical structuring is becoming obsolete. In
order to endure the current competitive environment an intensive collaboration of knowledge manage-
ment with different approaches and algorithms of big data is required. Chapter 11 examines this scenario.
It is worthy of note that whether it concerns legal aspects, or subjects addressing social, skills’, or
economic affairs, on the one hand, and technology and applications, on the other, the book, over the
chapters, has managed to treat them all.
We trust you will find this book a useful reference as we strive collectively for meaningful high quality

Moses John Strydom


University of South Africa, South Africa

xiv
15

To our dear confidants, Big data governance, and Knowledge management,

Where is the wisdom we lost in knowledge?


Where is the knowledge we lost in information?
T. S. Eliot, Choruses from The Rock (1934).


xvi

Acknowledgment

Editing a book of this nature proved, for several reasons, to be a daunting task.

First, constituting the editorial team. Here, Moses Strydom proved to be the pillar that supported the
book structure. His vast knowledge and experience made the task easier. Secondly, though the quality
of submissions was of a high standard, on many occasions we regrettably were obliged to reject the text
because it did not fall within the scope and format of this publication.

To this end, the book spanned both established and up-and-coming scholars from 14 different countries
(Canada, Croatia, Finland, France, India, Ireland, Italy, Morocco, Saudi Arabia, South Africa, United
Kingdom and the USA). The book is derived from an independent double- and in some instances treble-
blind peer review process. We would like to thank the Advisory Board and the 28 reviewers from 15
countries, again spanning the globe, who have critically evaluated the chapters.

Finally, we hope that the chapters will stimulate further progress in the big data discipline and can give
rise to a more global collaborative approach to big data management, underpinned by innovation. In a
fast-changing world, it is only through innovation that we keep ahead of time.

Sheryl Beverley Kruger Strydom


University of South Africa, South Africa



1

Chapter 1
Making the Most of Big Data
for Financial Stability Purposes
Bruno Tissot
Bank for International Settlements (BIS), Switzerland

ABSTRACT
Big data has become a key topic in data creation, storage, retrieval, methodology, and analysis in the
financial stability area. The flexibility and real-time availability of big data have opened up the possibility
of extracting more timely economic signals, applying new statistical methodologies, enhancing economic
forecasts and financial stability assessments, and obtaining rapid feedback on policy impacts. But, while
public financial authorities appear increasingly interested in big data, their actual use has remained
limited, reflecting a number of operational challenges. Moreover, using big data for policy purposes is
not without risks, such as that of generating a false sense of certainty and precision. Exploring big data
is thus a complex, multifaceted task, and a general and regular production of big data-based information
would take time. Looking ahead, it is key to focus on concrete pilot projects and share these experiences.
International cooperation would certainly add value in this endeavor.

INTRODUCTION

There is general policy interest for “big data”, which is fundamentally changing the character of the
information available to public authorities. To put it simply, this term usually describes extremely large
data-sets that are often a by-product of commercial or social activities and provide a huge amount of
granular information, typically at the level of individual transactions. This form of data is available in,
or close to, real time and can be used to identify behavioral patterns or economic trends. Its impact in
terms of information creation, storage, retrieval, methodology and analysis has gained increasing im-
portance, and the private sector is already using to a significant extent data patterns from such data-sets
to produce new and timely indicators.
The flexibility and immediacy of big data also provide new opportunities for public authorities involved
in financial stability policies. In particular, central banks, macro-prudential authorities and financial
supervisors are showing an increasing interest, especially at senior policy level.1 They can have access

DOI: 10.4018/978-1-5225-7077-6.ch001

Copyright © 2019, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

Making the Most of Big Data for Financial Stability Purposes

to a broader and timelier (often on a real-time basis) range of indicators, opening up the possibility of
extracting more timely economic signals, applying new statistical methodologies, supporting economic
forecasts and financial stability analyses, and obtaining rapid feedback on policy impacts.
Yet there are important peculiarities as regards the role of big data for financial stability issues.
Financial stability needs are quite specific, and not all big data fields are equally promising from this
perspective. For instance, central banks’ focus is more on those big data sources that can effectively
support micro- and macro-economic as well as monetary and financial stability analyses. In contrast,
other types of big data sources – such as geospatial information, an increasingly important source for
national statistical offices – is often of lower interest to central banks.
Not only is the focus specific, the challenges faced in handling and using big data for financial stabil-
ity purposes are also particular. For instance, a key input in financial stability assessments relates to the
identification of pockets of micro fragilities that can have a system-wide impact – but basing such an
identification on large granular data-sets can be difficult because of, for instance, confidentiality protection
rules or commercial copyrights. Another example is the feedback loop that is inherent to policy-making
authorities: big data sources can affect central bank policy-making, and in turn the policies implemented
can generate new data-sets. This relates to the more general Lucas critique, and is a clear distinctive
mark compared to other public statistical bodies in the field of big data. One obvious example is the
growing number of qualitative statements that can be used to decipher central banks’ communication,
by applying big data-related text mining techniques, and which in turn can modify other agents’ actions.
Another example relates to the large number of big data pools generated by various financial regulations,
since public authorities in charge of supervising financial institutions and monitoring specific market
segments can request a lot of (complex and highly rich) information. In turn, big data can strengthen
supervisors’ capacity by providing insights into large amounts of unstructured data (Basel Committee
on Banking Supervision (BCBS), 2018).
Another key specificity relates to the public nature of financial authorities and the degree of trust they
benefit in the pursuing of their mission of promoting monetary and financial stability. They put a great
emphasis on the degree of confidence they enjoy in the society and are usually the first to be concerned
about possible ethical or reputational consequences of using big data. Moreover, the risk of misusing
big data has to be carefully considered, since policy decisions based on wrong data could have incom-
mensurable consequences. Indeed, the public and private sectors may have different areas of concerns
from this data quality perspective: for instance, online retailers targeting potential customers based on
past web searches might find it acceptable to be “right” once out of five times, but official statisticians
would usually consider such an accuracy level of 20% as completely inadequate…
The same applies to security issues. Recent news have highlighted the risk that large private records of
personal information collected by social medias could be accessed and potentially misused by unauthorized
third-parties. For public authorities, especially those tasked to collect official statistics, such risks are of
utmost importance. Among financial supervisors, attention has focused on cyber threat intelligence and
cyber threat modelling techniques that should be mobilized, with a focus on firms’ capability to gather
and interpret related information (which is increasingly big data-type information, such as the number
of computing logs, payments et cetera). More generally, a growing area of focus has been how to ensure
the resilience of financial market infrastructures (Committee on Payments and Market Infrastructures
(CPMI) & International Organization of Securities Commissions (IOSCO), 2016).

2

Making the Most of Big Data for Financial Stability Purposes

Partly reflecting these specificities, but also more basic resource and IT constraints, big data work
is still on an exploratory mode in the financial stability area. It is seen as a complex, multifaceted task,
and any regular production of big data-based information is likely to take time, not least due to the lack
of transparency in methodologies and the poor quality of some data sources.
In view of these limitations, how can one make the most of big data looking forward? Key objec-
tives for public financial authorities is to better understand the specific structure of the new data-sets,
the methodologies needed for analyzing them, and their value added in comparison with “traditional”
statistics. Among central banks, this is mainly achieved by focusing on pilot projects to assess how new
big data information related to financial and economic topics can help to: (1) better monitor the economic
and financial situation; (2) enhance the effectiveness of policy measures, and; (3) assess the impact of
policy actions within the financial system and the economy at large. Possible tasks may well expand
beyond this simple list, as the development of big data is constantly creating new information/research
needs. Haldane (2018) argues for instance that big data can facilitate policy-makers’ understanding of
economic agents’ reactions through the exploration of behaviors in a “virtual economy”.
The present chapter tries to address these issues from various angles. The Background section dis-
cusses the peculiar nature of big data in the financial system. The second section (Main focus of chapter)
reviews its growing importance for those authorities working on financial stability issues. The third
section, Findings and analysis, reviews the challenges posed by handling and using big data. The last
section, Future research directions, highlights the implications of big data for information management
and the need for proper governance frameworks.

BACKGROUND

Looking for a Relevant Definition of Big Data in the Financial System

While “big data” has clearly become a buzzword in our societies, one obvious caveat is to define precisely
what it really is. Indeed, the literature provides little in the way of precise definitions for this term. What
volume of data is needed before one can classify information as “big”? What are the specific charac-
teristics, if any, of big data-sets? How fast does their content change over time? A further difficulty is
that this concept is a fluid one. “Big data”, as defined 10 years ago, may no longer seem “big” today,
owing to technological progress and the steady expansion of the daily volumes of data collected and of
the variety of data – for instance in terms of formats, such as pictures or texts, frequency (daily data,
tick data, trade-by-trade data) and quality. Furthermore, some big data sets available today may not be
available in a few years, for instance if public authorities decide to tighten privacy rules and reduce the
amount of personal data that can be re-used for other purposes by the companies collecting them.
One can start with a relatively broad approach and view big data as composed of data-sets that are a
by-product of commercial or social activities and that provide a huge amount of very granular informa-
tion. Yet there is no formally agreed definition that would cover all possible cases. For instance, it may
not be sufficient for a data-set to be large to qualify as “big data” – indeed, national statistical authorities
have been dealing with data-sets covering millions of records (for instance census data) for many decades
without branding them as “big data”. A key factor to consider is whether the data-set is structured and
can be handled with “traditional” statistical techniques, or if it is unstructured and requires new tools
to process the information.

3

Making the Most of Big Data for Financial Stability Purposes

From a conceptual perspective, it is thus usually referred to a “big data-set” when:

1. The data-set is unstructured (and often quite large), as a by-product of a non-statistical activity – it
is “found data”, “produced organically”, in contrast to traditional data-sets, which are produced
for specific purposes and are, by design, clearly structured, or;
2. The data-set is made of a large volume of records that are relatively well structured but neverthe-
less difficult to handle because of their size, granularity or complexity. From this perspective, even
relatively “simple” structured data-sets (that may not fall into the category of big data stricto sensu)
can benefit from the application of big data tools – for instance IT architecture, software packages,
modelling techniques – to process the information more efficiently.

In practice, defining big data leaves room for judgment, and depends on a number of features such
as the following “Vs” (Laney, 2001):2

• Volume (meaning the number of records and attributes);


• Velocity (speed of data production, for instance in the case of tick data); and
• Variety (for instance structure and format of the data-set).

Some observers have added other “Vs” to this list, such as:

• Veracity (accuracy and uncertainty of big data-sets that usually comprise large individual records);
• Valence (interconnectedness of the data); and
• Value (the data collected are often a by-product of an activity and can trigger a monetary reward;
hence they may not be available as a public good, due either to commercial considerations or con-
fidentiality constraints, reflecting the fact that “the world’s most valuable resource is no longer oil,
but data” (The Economist, 2017).

Other observers have added many dozens of other Vs. Key point is that this variety underscores that
the features characterizing big data can be very diverse. In addition, their actual information content
is also quite heterogeneous. The work conducted under the aegis of the United Nations Department of
Economic and Social Affairs provides a good starting point for classifying it (see Meeting of the Expert
Group on International Statistical Classifications, 2015). Under this approach, big data type of informa-
tion can be classified into three groups, as a product of:

• Social networks (human-sourced information, such as blogs, videos, internet searches);


• Traditional business systems (process-mediated data, such as data produced in the context of com-
mercial transactions, e-commerce, credit cards); and
• The internet of things (machine-generated data, such as data produced by pollution or traffic sen-
sors, mobile phone locational information, logs registered by computer systems).

Big Data and Financial Stability Issues

How do the discussions above translate in the financial stability world? Here too, it would be illusory to
try to provide a unified definition, and public financial authorities appear to have a varying understand-

4

Making the Most of Big Data for Financial Stability Purposes

ing and perception of what big data entails. But at least there appears to be a consensus for considering
that the concept should encompass the variety of large-scale, raw information, which requires more than
“traditional” statistical processes to be combined, processed and analyzed. Table 1 presents a number
of characteristics of “big data-sets” deemed relevant by those working on financial stability issues from
this perspective.
To simplify, two types of data sources can be identified. The first area corresponds to a rather restricted
view of big data, mainly limited to web-based indicators: this covers primarily the type of unstructured
data the private sector is dealing with. Public financial authorities such as central banks are clearly inter-
ested in this kind of indicators, for example to use Google Trends for nowcasting (that is, the prediction
of the present) or assessing sentiment and expectations. But this first area – the “internet of things” type
of big data – is not really the core of their work: they are just starting to explore it, and this work is not
yet in production on a large scale.

Table 1. Concrete characteristics of “big data-sets” in the context of financial stability work

Relative
Issue Comments
importance (*)
Generally made of “granular” or “micro” bits of information
High Granularity + that are usually produced by IT systems (for instance web search
engines, IT records of financial operations)
Very big size: large data-sets, various & different contents, high
Large Size ++
frequency (the three Vs: high volume, high velocity, high variety)
Data-sets drawing on multiple, potentially inconsistent sources
Variety of Sources +
(hence the importance of the “V” of “veracity”)
Usually, but not always, a by-product: data are “organic”, meaning
that their availability results from business operations (for instance
By-Product Nature ++ credit card operations) or personal activities (for instance web
searches); in contrast, “standard” statistical data are usually
compiled for specific purposes (“designed” data-sets)
Big data can pose specific quality issues. In particular, there is a
Quality ++ risk that big data can be misleading (what is their exact veracity?)
or even be manipulated (raising ethical challenges)
Big data are often available at low cost (with exceptions); what is
Access Cost -
really costly is to manage them
Big data-sets are often “too large to be manipulated in a
Management Complexity ++ conventional way”; that is, big data comprise information that pose
challenges to existing statistical systems
Big data-sets usually need to be correctly filtered so as to extract
Information Extraction ++ appropriate intelligence (such as specific patterns) and inform
decisions
Data-sets generally felt to be difficult to manage in terms of IT
Resources Needs +
resources and human skill mix
Data-sets often need to be adequately stored, not least because of
their size and confidentiality issues (for instance privacy protection,
Storage +
rights to access/modify personal information, anonymization
techniques).
Storage difficulties are in turn raising IT security risks as well as
Security ++
ethical issues (for instance protection of personal information)
(*): from “-” (characteristic of secondary importance when dealing with big data, compared to “traditional” data sources) to “++” (very
important); author’s assessment derived from IFC (2015).

5

Making the Most of Big Data for Financial Stability Purposes

The second area comprises large structured records of micro-level information, such as granular
“administrative data” – that is, the data compiled by authorities as a by-product of their (non-statistical
stricto sensu) work; for instance information derived from “official” central balance sheet data offices
(CBSOs) or central credit registers (CCRs)3 – and micro “financial information data” (for instance
security-by-security data-sets), as an integral part of “big data”. While these records are more structured
than those that are internet-based, their granularity and complexity are clearly important. Moreover,
their key features – not least in terms of degree of confidentiality and quality aspects – depend on their
public or private nature. In particular, “big data-sets” derived from (non web-based) “administrative
systems” and “private business systems” have been used in different context over time and raise differ-
ent types of issues.
This second, broader view is the one adopted in the rest of this chapter. Four main types of “big data-
sets” can be considered in the financial area from this perspective (Figure 1):

• Administrative data-set (such as corporate balance sheet data);


• Internet-based data-set (such as web search indicators);
• Commercial data-set (such as credit card operations);
• Financial market data (such as high frequency trading, bid-offer spreads).4

Big data-sets relevant for financial stability analyses are thus quite diverse. They comprise internet-
based data, but also large registers that can be derived as a by-product of various types of financial,
commercial or administrative activities. Certainly, this distinction can be artificial, since a significant –
and indeed increasing – part of the information collected on the web can be the result of such activities.
Indeed, the recent expansion of “Fintech” has highlighted the close interactions between the multiple
aspects of the financial sector that are impacted by parallel innovations, such as big data of course but
also mobile phone, internet, artificial intelligence (AI), et cetera. Fintech can be defined as: technology-
enabled innovation in financial services that could result in new business models, applications, processes
or products with an associated material effect on the provision of financial services (Financial Stability
Board (FSB), 2017) and was estimated to represent about 30 USD billion of global investment in 2016. It
is particularly known for its implications in terms of digital currencies such as Bitcoin (CPMI & Markets
Committee, 2018). But Fintech also covers much wider applications in the payments space, crowdfund-

Figure 1. Main big data sources in the financial area


Source: IFC (2015).

6

Making the Most of Big Data for Financial Stability Purposes

ing, smart contracts, robot advice et cetera – and, of course, big data applications, for instance for credit
risk assessments and contract pricing. In turn, Fintech is posing new challenges for the supervisors of
financial institutions (BCBS, 2018).
In any event, authorities working on financial stability issues have in practice to deal with a large
and heterogeneous range of “big data” sources derived from web-based, administrative, commercial
and financial data sources alike. The need to “match” these various data points is in turn adding further
complexity, requiring the use of common identifiers and rules (for instance to identify similarities or
control relationships between granular entities).

A Moving Target

A key conclusion from the developments above is that big data-sets are usually not directly produced
for a specific statistical purpose, as in the cases of traditional census or survey exercises. But, indirectly,
they can be usefully used (with some transformation) for statistical purposes.
From this perspective, the treatment of the raw, “organic” data is key, and many observers prefer to
refer to “smart data” instead of “big data”. To put it simply, there is no such thing as “big data” from
their perspective; there are data sources that can be exploited for addressing statistical information
needs that may independently exist. It is thus the interaction between the data available and the specific
information needs that explains the complexity and the multiple characteristics of the big data universe.
In addition, the use of specific data sources will often depend on the policy question at stake, so not all
countries will have the same practices. For example, the large data-sets related to banking payments
are of interest to central banks in their traditional supervisory role of payment systems; but the data can
also be used or economic analysis depending on country-specific contexts – for instance to track tour-
ism activity in those countries in the South of Europe where foreigners account for a significant part of
credit card operations.
Moreover, as regards complexity, public authorities are just be at the beginning of making sense of
all the increasing volume and variety of data that could be accessed. A key reason is that big data-sets
have a micro-level structure that can be very complex and may evolve over time as new policy needs
arise. Sometimes one has to merge information from different sources and to deal with inconsistent
observations; or one may have to choose a particular information source; the way to aggregate granular
information may also evolve over time; lastly, such choices may depend on circumstances.
A good example is provided by the compilation of international debt issuance statistics by the Bank
for International Settlements (BIS).5 This is done by aggregating micro-level information derived from
large security-by-security data-sets. This data collection is based on a residency concept, meaning that
one can compute all the debt issued by the economic agents that are located in a given country – in line
with the System of National Accounts (SNA) principles (European Commission et al., 2009). But, in
addition, the BIS has developed similar statistics on a so-called “nationality basis”, by looking at all
the debt issued not only by a resident firm of a country but also by the foreign entities controlled by this
national firm – its affiliates can be located outside of the country, for instance in international financial
centers, and would therefore not be captured the residency-based SNA framework (see Tissot, 2016). This
was less of an issue a few decades ago, but has become a key development to consider in the context of
the growing issuance of debt by global companies on a worldwide basis – the so-called “second phase of
global liquidity”, which has progressively replaced cross-border bank credit intermediation (Shin, 2013).

7

Making the Most of Big Data for Financial Stability Purposes

Confronting these two types of approaches, the residency and the nationality ones, can provide use-
ful insights to assess countries’ financial exposures. But constructing nationality-based statistics can be
quite challenging: one has to identify the perimeter of global firms, reclassify their individual units and
consolidate granular information at the group level. And this construction can be both time-dependent
(meaning that the structure of a business group will change over time as a result of the acquisition/sell-
ing of affiliates) and source-dependent (information provided by the various granular data-sets on debt
security issuance can differ). Dealing with such complexity requires the handling of large and complex
information, which could benefit from the use of big data techniques.

MAIN FOCUS OF CHAPTER

Over the recent decade, big data sources and opportunities have expanded markedly, owing to the
combination of three developments with major implications in the financial stability area: the internet
of things, digitalization, and the expansion of micro financial data-sets in the aftermath of the Great
Financial Crisis of 2007-09 (GFC). Public financial authorities have accordingly significantly expanded
their work in these three areas in the recent decade (see Table 2).

The Internet of Things

Public financial authorities have gained significant experience in recent years in collecting various kinds
of information generated by the wide range of web and electronic devices, and for various purposes. The
related internet data-sets comprise various web-based indicators, such as search queries, the recording
of clicks on specific pages, the display of commercial information and text posted online, social media
messages, and so on.
A general lesson is that this increasing amount of data can be effectively used to complement official
statistics. One example is the collection of prices posted online by supermarkets, which allows components
of consumer inflation to be both estimated (complementing traditional collection techniques) and fore-
casted. To this end, authorities have to “scrape” from the internet sales price information that is regularly
posted by retailers with an online presence – web scraping referring to the capturing of unstructured
data on web pages and transformed into structured data that can be worked out by using “traditional”
statistical tools. Since the CPI basket is quite large and in particular comprises goods and services that
are not traded online, such exercises are typically limited to specific inflation components – Hill (2018)
reports indeed that 15% of the US CPI is now collected online. Moreover, there are important difficulties:
in particular, the statisticians have to capture on the web information on unit-level prices but also on the
characteristics of the products being considered; in addition, additional input on quantities is necessary
to compute aggregated prices index with adequate weights; et cetera.
Nevertheless, these tasks can be easily automatized and this approach has proved to be quite a robust
and scalable process, requiring little in the way of maintenance costs. It can also be a very effective
complement to “standard” statistical processes. For instance, when the information is difficult to capture
through the existing apparatus, say for the price of fresh vegetables that can be very volatile. Moreover,
the data can be used not only to estimate inflation patterns, but also to forecast them in advance of actual
publication dates. Lastly, ongoing work such as the one conducted with the Billion Prices Project at the
Massachusetts Institute of Technology (MIT) argues that they can be used for enhancing international

8

Making the Most of Big Data for Financial Stability Purposes

Table 2. Examples of big data projects conducted by central banks

Big Data Areas Types of Data-Sets Examples of Projects

Enhancement of balance of payments statistics (for instance


Foreign trade operations / investment
non-resident travellers)
transactions
Exports forecasts

Monitoring of employment, wages, business formation


Taxation / payroll / unemployment (SMEs)
insurance Identification of household unincorporated enterprises
engaged in market production

Corporate vulnerabilities assessment


Central balance sheet offices Modelling of corporate behaviour & performance (for instance
SMEs)
Administrative Records
Loans registers: CCRs, credit bureaus, Entity-level credit information on loans / mortgages
mortgage registers Measurement of credit risk, FX exposures

Network analysis of the financial system


Financial market supervisors
Financial exposures and fragilities

Corporate balance sheet analyses


Public financial statements
(Listed) corporate group-level supervision

Payments systems (for instance monitoring of tourism


Financial market activity indicators
activities)
collected by public authorities
Trade repositories’ (TRs) transaction-by-transaction data

Internet clicks Google searches

Analysis of twitter messages to assess confidence & economic


Media and social networks
sentiment

Textual messages analysed for policy communication


Digitalised content / text data
Analysis of expectations/sentiment of economic agents

Websites’ scraping Various uses


Web-Based Indicators
Job portals Forecasts of employment / activity

Web-scraped information to measure specific components of


Prices posted directly on retailers’ and the CPI as well as input/output producer prices
other firms’ websites Inflation nowcasting / forecasting
Pricing strategy analysis

Real estate agencies Estimation of house price indices

Payments patterns
Credit card operations
Tourism activity

Mobile positioning data (for instance international calls for


Commercial Data-Sets
Mobile operators monitoring travelers’ activity)
Financial inclusion (for instance use of mobile payments)

Geo spatial information Tasks related to the national statistical system

Monitoring of balance sheet exposures


Credit institutions
Investor behaviour/expectations

Network analysis
Payments & Settlement operations
Operational risks
(including FX)
Market functioning
Financial Data-Sets Securities issuance

Market liquidity (FX, bond markets,


Bid/ask spreads
equities)

Custodians records Securities holding statistics

Tick-by-tick data Real-time analysis of financial markets’ patterns

Source: Author’s update of IFC (2015).

9

Making the Most of Big Data for Financial Stability Purposes

comparisons of price indexes in multiple countries and for dealing with measurement biases and distor-
tions in international relative prices (Cavallo & Rigobon, 2016).
Another, relatively close, growing area of attention relates to the collection of housing prices. Here
again, public agencies can scrape price information displayed by real estate agencies on their websites
in order to create a housing price index. The possibility of producing such indicators can be particularly
relevant for economies with less developed statistical reporting, since “traditional” data collection in
this sector can be hard to implement. Even in more advanced countries, statistics on property prices are
often derived from surveys available with a low frequency and sometimes for only a limited number of
cities. Another advantage is that the approach can help to capture the various housing characteristics
posted in web announcements, facilitating the calculations of quality effects through the use of hedonic
price methodologies – since property prices reflect various elements such as the location of the house, its
size, its degree of comfort et cetera. As for CPI measures, however, the challenge is to be able to scrape
this information – in particular the location and the characteristics of the properties – in a comprehensive
and structured way and also to weight it properly to produce synthetic indicators.
Real-side economic indicators can also be measured / forecasted using web-based information. There
has been a growing interest, for instance, to collect job announcements posted by employment agencies
on the web to compute leading indicators of business activity and/ or of unemployment. The approach
can be at the overall level of the economy, or more focused on specific sectors – for instance, travelers are
increasingly using web services to book trips or accommodations, and this can provide useful informa-
tion to monitor tourism activity in some economies; internet activity can also be mobilized to monitor
consumption of durable goods (for instance cars); and so on. Yet, as indeed noted by Hill (2018) in the
case of the United States, the use of big data has been relatively incremental and limited, often targeted at
methodological improvements (for instance quality adjustment) and at reducing reporting lags. A similar
picture is provided by the experiences of national statistical agencies in other advanced economies, such
as Australia, Canada, New Zealand and a number of Nordic countries. One notable situation relates to
those developing economies as big as India, where collecting internet-based data is seen as a potentially
useful alternative to the organization of large surveys that would have to cover millions of reporters.
In general, the (near) real time availability of web data allows for getting rapid information and im-
proving timeliness, compared to the usual lags observed in producing official statistics. This means that
online data sources are often useful for “nowcasting exercises”, a key step for macro-economic diagnosis
and forecasting.6 Second, the fact that these data are available more rapidly can help reduce both the
lags and the revisions that hamper the value of “traditional” official statistics, as argued by Hill (2018).
A third advantage is the possibility of capturing unsuspected patterns in the data: instead of inferring
statistical relationships, as with “traditional” statistical modelling, big data algorithms can allow a wide
range of effects to be incorporated (for instance, geographical factors, seasonal patterns, non-linearities,
lagged effects) without the need to make ex ante assumptions. Third advantage, these techniques can be
implemented easily and in an automated way using standard software package – indeed, and in contrast
with common perceptions, the complexity of these techniques should not be overestimated (Mehrhoff,
2017); some of these techniques (e.g. principal component analysis) have been in the information man-
agement toolkit for many decades...
An increasing interest is also devoted to qualitative information derived from web search engines and
social media sources. Clicks on specific terms on the web, twitter messages or posts on social media can
be monitored to assess qualitative factors that can be difficult to capture with “traditional” statistics. For
instance, to incorporate the sentiment and expectations of economic agents, their assessments of risks,

10

Making the Most of Big Data for Financial Stability Purposes

their changing preferences, causality patterns and so on. This can help assessing factors that play an
important role during episodes of financial stress but are quite difficult to model in a good way – say,
the impact of uncertainty, non-linearities and network effects. As an example, a substantial increase in
the number of times the keyword “inflation” appears in blogs may point to a shift in the public assess-
ment of inflation risks.
Yet there are several drawbacks. One is the limited interpretability of the relationships derived from
“black box” calculations, especially for public financial authorities that have to present a consistent “story”
when communicating their policies. Certainly, this issue is not new, and relates to the difficulties in us-
ing statistical algorithms / econometrics and mining data to derive conclusions that can be meaningful
from an economic perspective – for a recent discussion on this issue and how to combine deductive and
inductive approaches, see Haldane (2018). A long-standing issue, in particular, is the common confusion
between causality and correlation.
Another drawback is that accessing web-based information requires using new techniques (for instance
web-scraping tools) and methodologies to ensure that one can automatically and easily access the data.
But the choice of a given technique should depend on the questions at stake. If not, the risk is to develop
black boxes, which cannot deliver meaningful messages for policy use. Key is thus to explore the various
new techniques being made available, work on specific projects, and perhaps more importantly define
exactly what are the policy questions to be answered.7
A further difficulty is the need to collect information consistently over time; this is a particular issue
when scraping the prices of goods that are kept identical on the web only for a short period. Similarly,
advertisements can be posted for a long time even after the actual economic transaction had been settled
– for instance, advertisements for houses tend to remain on the internet after the selling has occurred;
scraping contemporaneous information on the web may thus alter the consistent measurement of transaction
prices over time. Another point to consider from this perspective is that those big data sources currently
used may not remain available in the future, for instance due to potentially growing privacy concerns.
Lastly, there are specific data quality limitations that need to be properly considered. One example is
that announcement prices captured on the web differ from actual transaction prices (an issue of particular
importance for house prices, which are typically posted for a significant higher value compared to settle-
ment prices). Another quality issue is the accuracy of the information that individuals (when it is not
robots!) input to the web. This information can be prone to errors, typos and self-fulfilling expectations
– for instance, “I am concerned by this because I am receiving increasing worrying messages through
the social medias”. Yet another limitation is that the data are not well structured, especially compared
to traditional official statistics; for instance, details on the location of a transaction / job offer can be
difficult to get; or the underlying information can be collected several times, especially if it is not posted
in an identical way in different websites – this can be the case if an identical property is advertised with
different characteristics depending on the real estate agencies…

Expanded Access to Digitalized Information

Another important development relates to digitalization (Organisation for Economic Co-operation and
Development (OECD), 2017). A growing number of information, especially textual information, has
been moving to the web in recent years while not being produced by internet activities strictly speak-
ing. In particular, many reference documents that used to be only available in paper format have been
digitalized and can be accessed and analyzed like other “web-based” indicators.

11

Making the Most of Big Data for Financial Stability Purposes

The large amount of textual information available on the web can be more easily and automatically
exploited compared to the past, especially through ad hoc “big data” techniques and tools. For instance,
text semantic analysis can be conducted in order to extract the contextual usage of specific words and
analyze similarities across a large sample of documents by building a “semantic similarity database”:
all the words are first extracted from the textual information of interest (for instance central banks’
statements available on the web); these words are then characterized by attributes covering various
dimensions; and similarities between two words can be measured by the proximity of these attributes.
In turn, one can compare the information content of text messages, and classify them in various groups.
For instance, the words that have the highest discriminative power could be selected to define the tone
of central banks’ messages – distinguishing communication phases with a “hawkish tone” from “dov-
ish” ones. Interestingly, such an identification can be tracked over time, helping to assess the influence
of changing circumstances / policy decisions.
In addition, these techniques can also be used to measure the impact of policy communication on
the expectations of economic agents. In the past, attention focused mainly on comparing outcomes in
financial markets with policy intentions, for instance, by conducting “event studies” around the times of
central bank decisions. Now, one can gauge the results of public policy communication / actions more
easily and in a more structured way, by analyzing the tone of the messages expressed in reaction by the
various stakeholders – for instance, articles in the press, blogs, posts on social medias. This allows for
having an objective measure of both the stance of public authorities’ communication and its perceived
impact. Indeed, there has been an increasing interest for applying such quantitative approaches to analyze
policy communication in the central banking community in recent years. In turn, such arrangements can
be used to assess how public expectations are formed, providing opportunities for fine-tuning policy
communication in a proactive way.

New Financial Statistics

A third key development has been the revolution in financial statistics observed since the GFC (Borio,
2013). Indeed, this revolution could be compared to what happened in the 1930s for the real economy
– at that time, the Great Depression clearly influenced the subsequent development of the national ac-
counts framework.8 Similarly, the recent financial crisis has triggered unprecedented efforts to collect
more information on the financial sector – especially in the context of the Data Gaps Initiative (DGI)
endorsed by the Group of 20 (International Monetary Fund (IMF) & FSB, 2009). Large, granular and
complex data-sets have become in high demand in this context, and the international community has
been increasingly recognizing the need to increase the sharing and accessibility of such information
despite various legal and operational barriers.9
One important consideration was that the GFC underlined the limitations of looking at a group of various
financial institutions in an aggregated way: it has become essential to better take into consideration those
institutions that are systemic on an individual basis (Bese Goksu & Tissot, 2018). In addition, authorities
are increasingly feeling the need to have a better sense of the distribution of macro indicators, to better
look at “fat tails” and go “beyond the aggregates”. For instance, central banks have been managing a
large and expanding amount of information collected at the level of individual institutions, transactions
and financial instruments since the GFC. They have also realized – like other statistical authorities; see
Bean (2016) in the context of the review of the Office of National Statistics in the United Kingdom –

12

Making the Most of Big Data for Financial Stability Purposes

that a huge amount of information is already available and could be better exploited, especially through
the various administrative data-sets compiled in today’s economies.
The willingness to go beyond aggregated indicators and make a better use of available/expanding
micro-level data-sets has been a key factor driving interest in big data. It also explains why many financial
public authorities have focused their work on those data-sets made of very granular information, derived
from various sources and which are clearly more complex compared to “typical” web-based ones. As a
result, their day-to-day work is requiring an increasing supply of information, in terms of volume, velocity
and variety; in turn, this information can be used to produce new types of indicators to support policy.
One good example is the rising demand for detailed loan-by-loan information, as credit registries have
become the largest data-sets maintained by some central banks – with important applications including
in particular the analysis of credit conditions to support monetary or financial policy. These data are
well structured, but they qualify as “big data-sets” since the reporting is highly granular (covering most
individuals and corporations applying for a credit), contains multiple attributes (for instance on the
debtor, the credit extended, the instrument used et cetera) and is often complex to analyze. In Europe,
for instance, the AnaCredit (“analytical credit data-sets”) initiative to develop a CCR under the aegis of
the European Central Bank has been leading to the collection of almost 200 attributes per data point on
a monthly basis (and on a daily basis for a subset). In the United States, the FRBNY (Federal Reserve
Bank of New York) Consumer Credit Panel presents detailed information on consumer debt and credit
derived from individuals’ reports (Lee & van der Klaau, 2010). Security-by-security databases, which
have been in high demand in many countries since the GFC not least in the context of the DGI initiative,
also display similar complex characteristics.
Attention has focused on information derived from tax registers. Authorities in many countries have
been building up large fiscal databases covering several years or decades of tax records and collected
at the level of individual taxpayers (households or firms). Even if such data-sets have often to be ano-
nymized for confidentiality reasons, a key advantage is their richness across the population of interest
– for instance, they allow the capturing of very small enterprises that are usually not publicly listed or
that can have difficulties in accessing providers of “formal” financial services like commercial banks.
Another important feature is that fiscal databases are usually collected regularly over a long period of
time, allowing for measuring and analyzing selected indicators in a dynamic way – for instance to better
understand the factors driving firms’ demography.
As supervisors of financial firms, public authorities can also make use of private sector experiences
in dealing with large data-sets. In particular, commercial banks’ risk management frameworks and the
related use of large micro data sources have clearly expanded since the GFC, not least because of the
need to comply with more stringent regulation. For instance, the production of “stress tests” to test their
ability to face episodes of financial tensions requires an increasing amount of data points as well as
quite sophisticated quantitative tools. To this end, commercial banks now amass large data volumes at
a highly disaggregated level, say to measure changes in their portfolios over time, capture the drivers of
their risk profiles with sufficient sensitivity, and validate their modelling tools. As supervisors of these
entities, public financial authorities have to develop their expertise in these areas too (BCBS, 2015).

13

Making the Most of Big Data for Financial Stability Purposes

FINDINGS AND ANALYSIS

Big data raises a number of challenges for public financial authorities. First, the handling of big data-
sets requires significant resources and proper arrangements for managing the information. In addition,
using big data in policy-making creates opportunities but is not without risks.
These challenges may explain why public authorities’ actual use of big data is still limited, at least
in comparison to the private industry – for instance the well-known American technology companies
(“GAFAs”): Google, Apple, Facebook, Amazon. It also suggests that significant time and effort will be
needed before a regular production of big data-based information can be undertaken to support statistical
and analytical work on a large scale in the financial stability area.

Handling Big Data

The handling of big data raises a number of challenges. Significant resources and proper arrangements
for managing the information are required. The issues faced are mainly due to the sheer size of the data-
sets, their lack of structure and the often limited quality of raw data obtained from internet streaming,
large administrative records or other sources.
The statistical production process itself has to be adapted, usually requiring a very considerable amount
of work to appropriately collect, clean, reconcile and store the new data-sets. In particular, very granu-
lar data-sets are usually produced with the lack of “standard” quality controls applied for “traditional”
statistics, leading to a significant number of false/inconsistent/missing records. This can be cumber-
some for public authorities that put a lot of attention on quality issues (and arguably more than private
companies working on big data). In practice, they have usually tried to set up a clear and comprehensive
information process, distinguishing four main phases: data acquisition (for instance, downloading of the
information from the web); data preparation (for instance, detection of selected data attributes of inter-
est, data cleaning); data processing (for instance, removal of outliers or extraction of indices); and data
validation (combining the use of statistical tests, quality checks and judgment analyses).
A major area is IT. The implications of big data for information systems are potentially huge. There
are large processing costs and difficult and expensive technology choices have to be made. Moreover,
sophisticated statistical techniques are often required to derive meaningful information from such data
– see the increasing use of “big data algorithms”, “machine learning” techniques and “artificial intel-
ligence”, for instance to detect errors in reported attributes, find similarities between data points, match
individual records across datasets when no common identifier exists, and aggregate individual records
such as affiliate-level data within a corporate group. From this perspective, public financial authorities
are not in a very different situation from their private counterparts involved in big data work. Yet they
may have less budget resources, despite the increasing need for implementing (costly) IT enhancements.
There are also issues in terms of confidentiality protection and security. Handling transaction-level,
potentially highly confidential, information may expose authorities to significant risk. It is indeed on
top of the agenda of public financial authorities. Indeed, and reflecting the importance of the granular
information requested from reporting entities and risks of data leakages, operational incidents can lead
to significant privacy and legal issues, with potential financial consequences. For instance, a growing
concern among supervisory authorities is that the risk of not complying with data privacy rules may
increase with the development of big data and Fintech firms (BCBS, 2017). In Europe, attention has
recently focused on the privacy issues related to the large amount of data provided by users through

14

Making the Most of Big Data for Financial Stability Purposes

their web-based activities, for instance internet searches and usage of social medias. The European
General Data Protection Regulation (GDPR) entered into force in 2018 with the aim of enhancing data
protection and privacy for individuals, providing greater control over their personal data passed to third
parties, addressing the issues posed by exporting such data, and streamlining the applicable regulatory
environment.10
Key is perhaps reputation risk – central banks are in a very specific position compared to private
companies from this perspective. For instance, if private information is reported but not protected ad-
equately, it can be very damaging for the reputation and credibility of any policy-making body. Turning
to internet-based information that is a by-product of commercial activities, its handling can also pose
significant legal, financial, reputational and ethical issues.
One particular issue from this perspective is that public statisticians have a tendency to be “cloud
computing-adverse”, mainly because of the disclosure risks posed by confidential information. Most
prefer to operate in a “secluded” data environment. But this may well reduce the scope for public au-
thorities to benefit from new big data techniques developed in the marketplace – for instance, some
applications may be available only as part of a cloud-based solution. It also calls for substantial internal
organizational changes to better deal with big data, including the creation of internal centers for big data
statistics, “data lakes”, “internal clouds” and so on.
Another key area is staff. The necessary skills may not be available in-house, for instance as regards
IT, data science and methodology as well legal expertise. Given the limited supply of graduates, especially
in mathematics, finance and statistics, public authorities may well face a “war for talent”, with intense
competition to attract highly-skilled staff – a competition they may find difficult to engage in with the
private sector. This could be a key obstacle, since skilled staff are a prerequisite for exploiting big data
opportunities as well as managing the associated risks. Moreover, skills shortage can also raise questions
around compensation and staff career paths, as well as management issues – one view, for instance, is
that the relatively important role played by economists in central banks’ managerial positions might well
be called into question by these developments.
Another difficulty is to enhance existing information management processes – not just the IT aspects
but also the whole process for producing relevant information out of the data points collected. Indeed,
authorities have been struggling to effectively handle the new and increasing amounts of web-based,
supervisory, statistical and financial markets information made available in the recent decade. As regards
financial supervisory data, for instance, supervisors’ “traditional” template-driven data collections have
to be replaced to effectively access granular, micro-level data from various different sources – at a rea-
sonable cost, in an automated way, and while keeping a consistent link between micro data points and
aggregated macro indicators. An increasing amount of private service providers (the “regtech” industry)
are proposing their services to address these new opportunities, in particular to facilitate compliance
with regulations in the financial industry.
Dealing with these issues requires a greater harmonization of the various data-sets involved, and in
particular the development of adequate statistical standards, identifiers and dictionaries when collecting
and processing the data. Fortunately, public financial authorities have been at the forefront of interna-
tional efforts initiated in these areas, for instance to develop the global system of Legal Entity Identifiers
(LEIs – see Legal Entity Identifier Regulatory Oversight Committee (2016)) as well as automated data
exchanges standards such as XBRL (eXtensible Business Reporting Language), SMDX (Statistical Data
and Metadata eXchange – see IFC (2016a)) and ISO 20022 (for payments standards). Looking ahead,
efforts are being made to enhance the integration of the various IT systems among both authorities

15

Making the Most of Big Data for Financial Stability Purposes

and reporting entities. Recent “fintech innovations” are also explored to facilitate secure data transfer
mechanisms. In particular the distributed ledger technology (DLT) is increasingly seen as an effective
tool for enabling network participants (the so-called “nodes” involved in a DLT arrangement) to securely
propose, validate and record information to a synchronized ledger distributed across the network (CPMI,
2017). The aim is that each financial transaction can be recorded in a batch of transactions (a “block”)
added to the chain comprising the full transactions’ history (the “blockchain”).11

Using Big Data

Turning to its policy-making use, big data create opportunities but is not without risks. Their apparent
benefits (in terms, say, of lower production costs or speed in producing information) should be balanced
against the potential large economic and social costs of misguided policy decisions that might be based
on inadequate statistics.
One key question is the extent to which indicators based on “big data” provide a more accurate picture
of economic reality. As noted above, separating noise from signal out of these data may be challeng-
ing. Perhaps more fundamental is the risk that analyses convey a false sense of certainty and precision.
An important issue that is sometimes overlooked is that the use of (very) large amounts of data is no
guarantee of accuracy. It is usually thought that, because of their large size, big data-sets provide by
construction a very good and reliable source of information. But one cannot really judge the accuracy
of a data-set if its coverage is unknown: the quality (or “statistical representativeness”) of an apparently
large non-random sample of data is not determined by its absolute number of records, but by its relative
size compared to the population of interest given the methodology used (Meng, 2014).
This problem can be exacerbated by the organic nature of many large, non-random big data-sets, since
the data are often self-reported or are the by-product of social activities (for instance financial transactions,
internet clicks). As a result, the coverage bias of these samples is unknown and can be significant. For
instance, social media sources will yield information whose quality depends on differences in the usage
intensity of these social medias; that is, the less one uses them the less one is represented. Furthermore,
social medias information may reflect personal aspirations (the way people want to be seen) rather than
“real” facts – although some observers argue that people are less prone to disguise reality when acting
on the web, compared to when they have to fill an administrative survey... Nevertheless, the key point
is that even extremely large big data samples (such as internet-based) may compare unfavorably with
(smaller) traditional probabilistic samples – that, in contrast, are precisely designed to be representa-
tive of the population of interest. In other words, using (very) large amounts of data is no guarantee of
accuracy, and there is a key misperception of the intrinsic value of big data from this perspective. For
public financial authorities, the fundamental risk is to base their actions on inaccurate data: this could
undermine both the effectiveness of their policies as well as their reputation/legitimacy.
A last issue is whether a widespread use of “big data” might systematically alter decision-making.
For instance, a greater ability to monitor the economy in real time is certainly of key interest for policy-
makers, but it might create a bias towards responding quickly and more frequently to news, encouraging
shorter horizons. Similarly, a greater reliance on “big data”-based analyses of public sentiment could
lead to excessively fine-tune policy communication based on perceived expectations rather than actual
economic developments.12

16

Making the Most of Big Data for Financial Stability Purposes

FUTURE RESEARCH DIRECTIONS

Proper information management frameworks are needed to make the most of big data. The various
challenges faced when handling and using big data, as analyzed above, raise an additional, sometimes
neglected risk: the one of spending excessive time and resources on cumbersome activities – cleaning the
data-sets, organizing the underlying platforms et cetera – rather than on actually analyzing the information
collected. To address this risk, public authorities have tended to privilege step-by-step approaches, by
working on specific use cases, instead of “big bang” solutions. Yet it is also essential to focus on the big
picture: the need to make sense of the data and to have coherent information management frameworks.

Making Sense of Data

The combination of expanding internet services, digitalization and new post-crisis data needs means that
authorities are just at the beginning of making sense of the increasing volume, granularity and variety
of information one can access. A key point is that the involved data-sets can be quite complex.
Perhaps more fundamentally, it is important to distinguish between “data” and “information”; the
latter depends on processing the former. Up to know, “traditional” official statistics could be described
as “designed data”, since they were collected for a specified statistical purpose through adequate statisti-
cal processes such as surveys and censuses; the compilation of these data-sets was almost by definition
organized in order to extract meaningful information, a key difference to “organic”-type big data-sets
(Groves, 2011).
Today, with the increasing supply of organic data relative to that of designed data, the risk of confu-
sion between “data” and “information” is clearly an issue. The challenge is to complement, instead of
replace, designed data with (organic) big data-sets so as to maximize information-to-data ratios. This
calls for ensuring a continuum, from the collection of data to their statistical processing and their policy
use. Key is really to ensure that a vast amount of information is not just collected and prepared, it has
to be useful; “in other words, connecting the dots is as important as collecting the dots, meaning the
right data” (Caruana, 2017).
But doing so is not an easy task: the extraction of valuable information from the data collected re-
quires proper IT infrastructure and adequate statistical applications and skills, including through the use
of big data analytics; sometimes legal and human resources (HR) support in terms of skill-sets can also
be in demand; moreover, good co-ordination is required to ensure a consistent and holistic information
production chain.
Indeed, public financial authorities have increasingly started to rethink their information management
processes to be better able to access large data-sets and use big data techniques. For instance, central
banks have significantly revamped their information platforms to handle the new data collections initi-
ated after the GFC, reflecting their greater involvement in financial stability issues and /or supervisory
functions (Glass, 2016). Of course, it is difficult to identify a one-size-fits-all approach: much depends
in particular on the characteristics of the data collections considered, precise country circumstances,
and actual policy needs. The way public authorities organize their information management processes
– creation of a data warehouse or a “data lake”, appointment of a “chief data officer” in charge, set-
ting up of a “data strategy” and so on – will thus vary across jurisdictions. But what matters is less the
organization structure than the coherence of this information management process itself, to transform
“data” into (useful) “information”.

17

Making the Most of Big Data for Financial Stability Purposes

One illustration of this dilemma is the fact that information needs evolve over time, as authorities
need both to have a bird’s eye view of the financial system but also to be able to zoom in specific sectors
depending on circumstances; in other words, “to see the forest as well as the trees within it” (Borio, 2013).
For instance, the assessment of how fragilities are building up in the financial system will typically rely
on aggregated statistics; in contrast, resolution work in the aftermath of a financial of crisis will request
much more timely and granular information (for instance firm-level supervisory data).
Another good example that reflects the complexity of processing large data-sets is the AnaCredit
project for building a European-wide loan-by-loan data-set, which has triggered a full rethink of central
bank information management frameworks (Schubert, 2016). In particular, attention has focused on
the rationalization of data collection and management; the need to harmonize the underlying statistical
concepts and ensure that consistent data can be used for multiple purposes; the set-up of a single entry
point for reporting and accessing the information; and the willingness to limit the associated reporting
burden as possible. Overall, the project has proved instrumental in steering central banks’ attention to
the need to manage information in an integrated way within their institutions.

Information Governance Frameworks

In this context, decisions on data have clearly become of strategic importance for public financial authori-
ties. This has been reinforced with the development of big data, with the need to properly balance costs
and resources implications as well as the various financial, legal and reputational risks. In a nutshell,
information management decisions can no longer be taken by statisticians alone, and strong policy sup-
port is increasingly needed. Potential use cases of big data have also expanded, covering “traditional”
short-term forecasting and nowcasting exercises as well as applications related to cybersecurity and the
safety of payments systems. For instance, supervisory authorities are increasingly interested to check
how financial institutions manage information related to IT- or cyber-security incident reports and previ-
ous examination deficiencies (Crisanto & Prenio, 2017). Needless to say, these monitoring efforts are
data intensive and will certainly require an extensive use of sophisticated techniques looking forward.
These various aspects put a premium on enhancing the governance of the related data-sets, in par-
ticular by clarifying the respective responsibilities of the various stakeholders involved – through the
establishment of clear guidelines for reporting entities, of standard protocols to compile the statistics, of
formal Memorandum of Understanding to share the data and specify access rights, of adequate metadata
documentation to support data users (including data catalogues and dictionaries), et cetera…13 This
would clearly help to address the various big data challenges faced by authorities. First are the legal,
financial and ethical issues posed by accessing (often private) information that is a by-product of com-
mercial activities. Then there are the operational, legal and reputational risks entailed in dealing with
transaction-level information that is potentially confidential; the responsibility for authorizing such data
collections, not least as regards aspects such as confidentiality protection, data ownership and privacy;
the degree of statistical accuracy of these data and the level of confidence in their sources; and even the
information content of data derived from self-generated activities – for instance, the information value
of the number of clicks on a specific topic will vary as these clicks are influenced by the search engines
and based on users’ (and robots’…) past searches, or the information provided on social medias may
reflect personal aspirations rather than objective factors...

18

Making the Most of Big Data for Financial Stability Purposes

What is still unknown is whether and how far these developments will trigger a change in the busi-
ness model of public financial institutions. Central banks are certainly relatively new in exploiting big
data, in contrast to the greater experience of national statistical offices (NSOs) in handling large and
confidential data-sets such as censuses and administrative records. A key reason is that central banks
have traditionally been data users rather than data producers (unlike most NSOs, which are typically not
using the data for policy purposes). The situation has clearly changed since the GFC, but the lessons
learned are mainly tentative, as big data sources of information are still under evaluation in most public
financial institutions.
From this perspective, cooperation (both internationally among authorities working on financial
stability issues as well as domestically among statistical authorities) should be expanded to facilitate the
exploration of the opportunities to access new and complementary big data sources. This reflects the
fact that there is a lot to learn from each other. It also puts a premium on collaborative work to explore
the synergies and benefits of using big data for policy purposes.

CONCLUSION

As for other sectors, “big data” is a key topic in data creation, storage, retrieval, methodology and
analysis in the financial stability area. Yet exploring it is a complex, multifaceted task, and any regular
and general production of big data-based information would take time, given the lack of transparency
in methodologies and the poor quality of some data sources. From this perspective, big data may create
new information/research needs, for which international cooperation could add value.
Looking ahead, a key way to monitor big data-related developments and issues – such as the method-
ologies for analysis, their value compared with “traditional” statistics, and the structure of big data-sets
– is to focus on concrete pilot projects and share these experiences (IFC, 2015).
In this endeavor, the following points can be highlighted:

• The flexibility and real-time availability of big data have opened up the possibility of extracting
more timely economic signals, applying new statistical methodologies, enhancing economic fore-
casts and financial stability assessments, and obtaining rapid feedback on policy impacts.
• The financial stability community appears increasingly interested in big data, but its actual use
has remained limited. One reason is that big data raises a number of operational challenges; in
particular, their handling requires significant resources and proper arrangements for managing the
information process effectively.
• Using big data for policy purposes is not without risks, such as that of generating a false sense
of certainty and precision. From this perspective, the apparent benefits of big data (in terms, say,
of lower production costs or speed in producing information) should be balanced against the
potential large economic and social costs of misguided policy decisions that would be based on
inadequate statistics.

19

Making the Most of Big Data for Financial Stability Purposes

REFERENCES

Basel Committee on Banking Supervision (BCBS). (2015). Making supervisory stress tests more mac-
roprudential: Considering liquidity and solvency interactions and systemic risk. Working Paper, no 29.
Author.
BCBS. (2018). Sound Practices - Implications of fintech developments for banks and bank supervisors.
Author.
Bean, C. (2016). Independent review of UK economic statistics. Academic Press.
Bese Goksu, E., & Tissot, B. (2018). Monitoring Systemic Institutions for the Analysis of Micro-macro
Linkages and Network Effects. Journal of Mathematics and Statistical Science, 4(4).
Bholat, D. (2015, March). Big data and central banks. Bank of England, Quarterly Bulletin. Retrieved
from https://ssrn.com/abstract=2577759
Borio, C. (2013). The Great Financial Crisis: setting priorities for new statistics. Journal of Banking
Regulation.
Carnot, N., Koen, V., & Tissot, B. (2011). Economic Forecasting and Policy (2nd ed.). Palgrave Macmil-
lan. doi:10.1057/9780230306448
Caruana, J. (2017). International financial crises: new understandings, new data. Speech at the National
Bank of Belgium, Brussels, Belgium.
Cavallo, A., & Rigobon, R. (2016, Spring). The Billion Prices Project: Using Online Prices for Measure-
ment and Research. The Journal of Economic Perspectives, 30(2), 151–178. doi:10.1257/jep.30.2.151
Cœuré, B. (2017). Policy analysis with big data. Speech at the conference on “Economic and Financial
Regulation in the Era of Big Data”, Banque de France, Paris, France.
Committee on Payments and Market Infrastructures (CPMI). (2017). Distributed ledger technology in
payment, clearing and settlement – An analytical framework. CPMI.
CPMI & Board of the International Organization of Securities Commissions (IOSCO). (2016, June).
Guidance on cyber resilience for financial market infrastructures. Authors.
CPMI & Markets Committee. (2018). Central bank digital currencies. Authors.
Crisanto, J.C., & Prenio J. (2017). Regulatory approaches to enhance banks’ cyber-security frameworks.
FSI Insights on policy implementation No 2, Financial Stability Institute.
European Commission, International Monetary Fund, Organisation for Economic Co-operation and
Development, United Nations, & World Bank. (2009). System of National Accounts 2008. Authors.
Financial Stability Board (FSB). (2017). Financial Stability Implications from FinTech. Author.
Glass, E. (2016). Survey analysis – Big data in central banks. Central Banking Focus Report, 2016.
Retrieved from www.centralbanking.com/central-banking/content-hub/2474744/big-data-in-central-
banks-focus-report

20

Making the Most of Big Data for Financial Stability Purposes

Groves, R. (2011). Designed data and organic data. Director’s Blog of the US Census Bureau. Retrieved
from www.census.gov/newsroom/blogs/director/2011/05/designed-data-and-organic-data.html
Haldane, A. G. (2018). Will Big Data Keep Its Promise? Speech at the Bank of England Data Analytics
for Finance and Macro Research Centre, King’s Business School.
Hill, S. (2018, May 6). The Big Data Revolution in Economic Statistics: Waiting for Godot... and Gov-
ernment Funding. Goldman Sachs US Economics Analyst.
IFC. (2016a). Central banks’ use of the SDMX standard. IFC.
IFC. (2016b). The sharing of micro data – a central bank perspective. IFC.
IFC. (2017a). Proceedings of the IFC-ECCBSO-CBRT Conference on “Uses of central balance sheet
data office information. IFC Bulletin, 45.
IFC. (2017b). Proceedings of the IFC Workshop on “Data needs and statistics compilation for macro-
prudential analysis”. IFC Bulletin, 46.
IMF & FSB. (2015). The Financial Crisis and Information Gaps – Sixth Implementation Progress Report
of the G20 Data Gaps Initiative. Author.
International Monetary Fund (IMF) and FSB. (2009). The Financial Crisis and Information Gaps. Report
to the G20 Finance Ministers and Central Bank Governors.
Irving Fisher Committee on Central Bank Statistics (IFC). (2015). Central banks’ use of and interest
in ‘big data’. IFC.
Laney, D. (2001). 3D data management: controlling data volume, velocity, and variety. META Group
(now Gartner). Retrieved from https://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-
Management-Controlling-Data-Volume-Velocity-and-Variety.pdf
Lee, D., & Wilbert van der Klaauw, W. (2010). An Introduction to the FRBNY Consumer Credit Panel.
Staff Report no 479, November.
Legal Entity Identifier Regulatory Oversight Committee. (2016). Collecting data on direct and ultimate
parents of legal entities in the Global LEI System – Phase 1. Author.
Meeting of the Expert Group on International Statistical Classifications. (2015). Classification of Types
of Big Data. United Nations Department of Economic and Social Affairs, ESA/STAT/AC.289/26, May.
Mehrhoff, J. (2017). Demystifying big data in official statistics – it is not rocket science! Presentation
at the Second Statistics Conference of the Central Bank of Chile.
Meng, X. (2014). A trio of inference problems that could win you a Nobel Prize in statistics (if you help
fund it). In X. Lin, C. Genest, D. Banks, G. Molenberghs, D. Scott, & J.-L. Wang (Eds.), Past, present,
and future of statistical science (pp. 537–562). Chapman and Hall. doi:10.1201/b16720-50
Nymand-Andersen, P. (2015). Big data – the hunt for timely insights and decision certainty: Central
banking reflections on the use of big data for policy purposes. IFC Working Paper, no 14.

21

Making the Most of Big Data for Financial Stability Purposes

Organisation for Economic Co-operation and Development (OECD). (2017). Key issues for digital
transformation in the G20. Report prepared for a joint G20 German Presidency/OECD conference.
Schubert, A. (2016). AnaCredit: banking with (pretty) big data. Central Banking Focus Report.
Shin, H. (2013). The Second Phase of Global Liquidity and Its Impact on Emerging Economies. Keynote
address at the Federal Reserve Bank of San Francisco Asia Economic Policy Conference, Princeton, NJ.
The Economist. (2017, May 6). The world’s most valuable resource is no longer oil, but data. The
Economist.
Tissot, B. (2016). Globalisation and financial stability risks: is the residency-based approach of the
national accounts old-fashioned? BIS Working Papers no 587, October.
Tissot, B. (2017). Using micro data to support evidence-based policy. International Statistical Institute
61st World Statistics Congress.

ADDITIONAL READING

Bank for International Settlements (BIS). (2018). Cryptocurrencies: looking beyond the hype, BIS An-
nual Economic Report, chapter V.
Hammer, C., Kostroch, D., Quiros, G., & Staff of the IMF Statistics Department (STA) Internal Group
(2017). Big data: potential, challenges, and statistical implications, IMF Staff Discussion Note, Staff
Discussion Notes (SDN)/17/06, September.
Irving Fisher Committee on Central Bank Statistics (IFC). (2017). Proceedings of the IFC Satellite
Seminar on “Big Data” at the ISI Regional Statistics Conference 2017, IFC Bulletin, no 44, September.

KEY TERMS AND DEFINITIONS

AI: Artificial intelligence.


BCBS: Basel Committee on Banking Supervision.
BIS: Bank for International Settlements.
CBSO: Central balance sheet data office.
CCR: Central credit register.
CPMI: Committee on Payments and Market Infrastructures.
DGI: Data gaps initiative.
DLT: Distributed ledger technology.
FRBNY: Federal Reserve Bank of New York.
FSB: Financial Stability Board.
FSI: Financial Stability Institute.

22

Making the Most of Big Data for Financial Stability Purposes

G20: Group of Twenty (Argentina, Australia, Brazil, Canada, China, France, Germany, India, In-
donesia, Italy, Japan, Mexico, Russia, Saudi Arabia, South Africa, South Korea, Turkey, the United
Kingdom, the United States, and the European Union).
GAFA: Google, Apple, Facebook, Amazon.
GDPR: General data protection regulation.
GFC: Great financial crisis (2007-09).
HR: Human resources.
IFC: Irving Fisher Committee on Central Bank Statistics.
IMF: International Monetary Fund.
INEXDA: International Network for Exchanging Experience on Statistical Handling of Granular Data.
IOSCO: International Organization of Securities Commissions.
ISO: International Organization for Standardization.
IT: Information technology.
LEI: Legal entity identifier.
MIT: Massachusetts Institute of Technology.
OECD: Organization for Economic Co-Operation and Development.
OTC: Over the counter.
SMDX: Statistical data and metadata exchange.
SNA: System of national accounts.
TR: Trade repository.
XBRL: eXtensible business reporting language.

ENDNOTES
1
Central banks play an important role in the group of public financial authorities, not least because
a number of them can cumulate the responsibilities of monetary policy-making, macro-prudential
oversight and micro-financial supervision of banks and other financial institutions (the allocation
of these various mandates depending obviously on each jurisdiction). Central banks have indeed
expressed recently increasing interest in the topic of big data (IFC, 2015), in particular as regards
their use for conducting policy (Bholat, 2015). See also IFC (2017b) for the data needs of macro-
prudential authorities.
2
Not all observers agree on the precise list and definitions of the Vs one should consider; for a
presentation, see Nymand-Andersen (2015).
3
CBSO data usually include information on balance sheet positions and income statements derived
from corporations’ financial accounts. In several countries, this granular information can be com-
bined with various data-sets, for instance those from TRs – entities tasked to collect information on
transactions in over-the-counter (OTC) derivatives markets – or from CCRs – centralized systems
for collecting entity-level credit information on loans provided to the economy, usually managed
by public authorities; see IFC (2017a).
4
From a different perspective (the real economic side), Hill (2018) distinguishes three broad cat-
egories: administrative data; social medias and other unstructured data; and structured data-sets
generated by private companies. The main difference with the categories proposed in this chapter
is to separate “real” commercial data-sets (for instance for credit card operations), which are usu-

23

Making the Most of Big Data for Financial Stability Purposes

ally the property of private firms, from those financial market data-sets (for instance quotation
prices), which consist of publicly available information (even though these data can be in practice
collected (and re-sold) by specialized data vendors).
5
See www.bis.org/statistics/about_securities_stats.htm?m=6%7C33%7C638.
6
For the use of nowcasting in forecasting “bridge models”, see for instance Carnot et al. (2011).
7
See in particular the various techniques presented at the ECB Workshop on Using big data for
forecasting and statistics, organized in cooperation with the International Institute of Forecasters in
April 2014 (retrieved from www.ecb.europa.eu/pub/conferences/html/20140407_workshop_on_us-
ing_big_data.en.html).
8
Retrieved from http://unstats.un.org/unsd/nationalaccount/docs/SNA2008.pdf. For an introduction
to the national accounts framework, see, for instance, Carnot et al. (2011), Annex I.
9
See for instance the initiative to increase the sharing and accessibility of granular data, if needed by
revisiting existing confidentiality constraints, in the context of the second phase of the Data Gaps
Initiative (DGI) endorsed by the Group of 20 (IMF & FSB, 2015). Another important initiative
is the International Network for Exchanging Experience on Statistical Handling of Granular Data
(INEXDA), which is an international cooperative project to enable the exchange of experiences in
the statistical handling of granular data for research purposes (see www.bundesbank.de/Navigation/
EN/Bundesbank/Research/RDSC/INEXDA/inexda.html).
10
See the GDPR portal at www.eugdpr.org/.
11
For an introduction on these concepts, see the Additional Reading section.
12
On the potential improvements brought by big data and also the issues posed to policy-makers, see
Cœuré (2017) in the case of central banks.
13
On recommended clarifications in this area to facilitate data sharing, see IFC (2016b).

24
25

Chapter 2
Big Data and Official Statistics
Steve MacFeely
The United Nations Conference on Trade and Development (UNCTAD), Switzerland

ABSTRACT
Over recent years, the potential of big data for government, for business, for society has excited much
comment, debate, and even evangelism. But are big data really the panacea to all our data problems
or is this just hype and hubris? This is the question facing official statisticians: Are big data worth the
investment of time and resources? While the statistical possibilities appear endless, big data also pres-
ent enormous challenges and potential pitfalls: legal, ethical, technical, and reputational. This chapter
examines the opportunities and challenges presented by big data and also discusses some governance
issues arising for official statistics.

INTRODUCTION

Over recent years the potential of big data for government, for business, for society has excited much
comment, debate and even evangelism. Described by Pat Gelsinger (2012) of EMC as the ‘new science’
with all the answers, a paradigm destroying phenomena of enormous potential (Stephens-Davidowitz,
2017) and a panacea to all our data problems. Official statisticians, already with a long history of using
non-survey or secondary data, which are often very large in terms of volume, must decide whether big
data is really something new, or just more of the same, only more so. On-the one hand, some argue that
big data needs to be seen as an entirely new ecosystem and requires serious strategic rethinking on the
part of the official statistical community (Letouzé & Jütting, 2015) whereas others argue to the contrary
that big data is just hype and that big data are just data (Thamm, 2017). Perhaps psychologist Dan Ariely
(2013) was correct when he tweeted hilariously ‘Big Data is like teenage sex: everyone talks about it,
nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they
are doing it.’
Big data is the by-product of a technological revolution. In simplistic terms, one can think of big data
as the collective noun for all of the new digital data arising from our digital activities. Our increasing
day-to-day dependence on technology is leaving digital footprints everywhere. Those digital footprints
or digital exhaust offers official statisticians rich and tantalizing opportunities to augment or supplant

DOI: 10.4018/978-1-5225-7077-6.ch002

Copyright © 2019, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

Big Data and Official Statistics

existing data sources or generate completely new statistics. With the computing power now available
these digital data can be shared, cross-referenced, and repurposed as never before opening up a myriad
of new statistical possibilities. Big data also present enormous statistical and governance challenges and
potential pitfalls: legal; ethical; technical; and reputational. Big data also present a significant expectations
management challenge, as it seems many hold the misplaced belief that accessing big data is straight-
forward and that their use will automatically and dramatically reduce the costs of producing statistical
information. As yet the jury is out on whether big data will offer official statistics anything especially
useful. Beyond the hype of big data, and hype it may well be1, statisticians understand that big data are
not always better data and that more data doesn’t automatically mean more insight. In fact more data
may simply mean more noise. As Boyd and Crawford (2012: 668) eloquently counsel ‘Increasing the
size of the haystack does not make the needle easier to find.’
This chapter will examine big data from the perspective of official statistics and outline some of the
opportunities, challenges and governance issues that they present. To understand the governance issues
involved with big data from the unique perspective of official statistics, it is useful to first define what we
mean by big data, administrative data and official statistics. Thereafter the chapter will look at sources
of big data, before examining the opportunities and challenges presented by big data. Before conclud-
ing, the chapter will briefly outline some of the governance structures that National Statistical Offices
(NSOs) and International Organisations (IOs) might need to consider putting in place if they intend to
harvest big data for the purposes of compiling official statistics.

BACKGROUND

Definitions

Big Data

What are big data? While some, such as, Stephens-Davidowitz (2017) argue that big data is an inher-
ently vague concept it is nevertheless important to try and define it. This is important, if only, to explain
to readers that big data are not simply ‘lots of data’ and that despite the name big data size is not the
defining feature. So if not size, what makes big data big? One of the challenges in trying to answer this
question is to understand that there is no rigorous or universally accepted definition of big data (Mayer-
Schonberger & Cukier, 2013).
Gartner analyst Doug Laney (2001) provided what become known as the three ‘Vs’ definition. He
described big data as being high-volume, high-velocity, and/or high-variety information assets that de-
mand cost-effective, innovative forms of information processing that enable enhanced insight, decision
making, and process automation. In other words, big data should be huge in terms of volume (i.e. at least
terabytes), have high velocity (i.e. be created in or near real-time), and be varied in type (i.e. contain
structured and unstructured data and span temporal and geographic planes). The European Commission
(2014) definition of big data; ‘large amounts of data produced very quickly by a high number of diverse
sources’ is essentially a summary of the 3V’s definition.
It seemed that the 3Vs definition was generally accepted, within official statistics circles at least, with
the United Nations Statistical Commission (2014: 2) adopting a very similar definition - ‘data sources
that can be described as; high volume, velocity and variety of data that demand cost-effective, innova-

26

Big Data and Official Statistics

tive forms of processing for enhanced insight and decision making.’ Tam and Clarke (2015) use a more
general definition, describing big data as potentially everything from traditional sources and new sources
that are becoming available from the “web of everything”.
Over the intervening years, the 3Vs has swollen to 10Vs2. Perhaps more usefully, Hammer et al.
(2017: 8) selected a 5V definition (the original 3Vs plus an additional two V’s - volatility and veracity).
Veracity refers to the noise and bias in the data and volatility refers to the ‘changing technology or busi-
ness environments in which big data are produced, which could lead to invalid analyses and results, as
well as to fragility in big data as a data source.’ At first glance, the additional Vs may seem odd as they
are not per-se defining characteristics of the data or intrinsic to it. Nevertheless, volatility and veracity
are extremely important additions, in particular for understanding the contribution that big data might
make to compiling official statistics. Certainly the ‘5Vs’ definition is more balanced and useful from an
analytical perspective than the 3Vs as it also flags some of the downside risks that prompted Borgman
(2015: 129) to note that using big data is ‘a path with trap doors, land mines, misdirection, and false
clues.’ But arguably a ‘6V’ definition that includes ‘value’, where value means that something useful is
derived from the data offers the best overall definition - introducing the idea of cost-benefit yet striking
a balance between parsimony and utility. This is extremely important, as the costs of investing in big
data must be weighed up against what it can deliver in practical terms. The Figure 1 summarises the
6Vs of big data relevant to official statistics.

Figure 1. The 6V’s of big data for official statistics


Source: Derived from Hammer et al. (2017)

27

Big Data and Official Statistics

In understanding big data from a statistics perspective, it is important to understand that like admin-
istrative data, big data is conceptually quite different to traditional survey or census data. They are a
collection of by-product data rather than data designed by statisticians for a specific statistical purpose. In
other words deriving statistics is a secondary purpose. This difference is perhaps obvious but profoundly
important. We find ourselves today in a situation, oddly reminiscent of the storyline in The Hitchhikers
Guide to the Galaxy (Adams, 1979), where we now have the answers but we are still struggling to define
the question. As big data is a by-product of our interactions with technologies that are evolving quickly,
we must accept that as a consequence big data are not a stable platform but a very dynamic one, and so
the definition is likely to require further refinement in the not too distant future.

Administrative Data

Many NSOs already make extensive use of administrative data. Blackwell (1985: 78) defined adminis-
trative or public sector data as ‘information which is collected as a matter of routine in the day-to-day
management or supervision of a scheme or service or revenue collecting system.’ Similarly, the United
Nations Economic Commission for Europe (2011) defined administrative data ‘as collections of data
held by other parts of government, collected and used for the purposes of administering taxes, benefits
or services.’ In other words across public services, a huge volume of administrative records are col-
lected, maintained and updated on a regular basis. These data pertain to the wide range of administrative
functions in which the state is involved, ranging from individual and enterprise tax payments to social
welfare claims or education or farming grants. Typically these administrative records are collected and
maintained at the lowest level of aggregation, i.e. transactions or interactions by individual taxpayers,
applicants, recipients of the state, making these data very rich from an analytical perspective (MacFeely
& Dunne, 2014). Brackstone (1987) identified four key distinguishing characteristics of administrative
data: (1) the data are collected by an agency other than an NSO; (2) the methodology and processing
are controlled by and agency other than an NSO; (3) the data were collected for non-statistical purposes,
and; (4) the data have complete coverage of the target population.
Although administrative datasets are often very large (high volume), they are not typically considered
big data, in the sense that they are not updated in real time (low velocity), they tend to be relatively stable
(low volatility) and they are typically structured (low variety). It is worth noting that in 1997, Eurostat,
proposed a narrow and a broad definition of administrative data. The narrow view saw administrative
data comprising of just public sector non-statistical sources whereas the wider definition included private
sector sources (this would presumably include big data). In 2000, the Conference of European Statisticians
adopted a definition of administrative data consistent with this wider concept (United Nations Economic
Commission for Europe, 2000). Administrative and big data share some important characteristics: both
are secondary data as neither is originally compiled for statistical purposes; and both may both suffer from
problems with veracity. But there are some important differences too: administrative data are typically
national whereas many big data may be global or at least supra-national - this is an important difference
as a NSO may be able to influence the content or quality of administrative data; big data are inherently
more unstable and volatile than administrative data; and big data will in all likelihood compromise a
greater variety of sources and types of data.

28

Big Data and Official Statistics

Official Statistics

It is also useful to define official statistics. The purpose of official statistics are to provide ‘an indispens-
able element in the information system of a democratic society, serving the Government, the economy
and the public with data about the economic, demographic, social and environmental situation. To this
end, official statistics that meet the test of practical utility are to be compiled and made available on
an impartial basis by official statistical agencies to honour citizens’ entitlement to public information’
(United Nations, 2014: Principle 1). This is a demanding and ever more complicated role, as in addition
to measuring the traditional economic, social and environment dimensions, new ones such as, peace and
security and welfare have emerged.
Official statistics can be national or international (or both). Official National Statistics are all statistics
produced by a NSO in accordance with the Fundamental Principles of Official Statistics (United Na-
tions, 2014), other than those explicitly stated by the NSO not to be official; and all statistics produced
by the National Statistical System (NSS) i.e. by other national organisations that have been mandated by
national government or certified by the head of the NSS to compile statistics for their specific domain.
So in practice, if a NSO produces statistics they are, de facto, official unless stated otherwise. If another
organisation within a NSS produces statistics, then they are typically also considered official.
Official International Statistics are indicators, aggregates or statistics, produced by a UN agency or
other IO in accordance with the Principles Governing International Statistical Activities (Committee
for the Coordination of Statistical Activities, 2014). It is often necessary for a UN agency, or other IO,
to modify official national statistics that have been provided by an NSO or another organisation of the
NSS, in order to harmonise statistics across countries or to correct evidently erroneous values. Further-
more, in the absence of an official national statistic, a UN agency or other IO may compile estimates.
Thus, it is insufficient to define official international statistics as simply the reproduction of official
national statistics.

Sources of Big Data

In a world where our increasing day-to-day dependence on technology is leaving significant digital
footprints, it seems that just about everything we think or do is now potentially a source of data. Big data
are being generated from a bewildering array of activities and transactions. Our spending and travel pat-
terns, our online search queries, our reading habits, our television and movies choices, our social media
posts - everything it seems now yields data. Some examples – big data were generated by the 227.1 bil-
lion global credit/debit card purchase transactions made in 2015 (Nilson Report, 2018). The 7.7 billion
mobile telephone subscribers in 2017 around the world (International Telecommunications Union, 2017)
also unwittingly created big data every time they used their phone. In fact even when they didn’t use
their phones, they were still generating data - according to Goodman (2015) mobile phones generate 600
billion unique data events every day. Every day we send 500 million tweets (Krikorian, 2013), 8 billion
snapchats (Aslam, 2015) and conduct 3.5 billion google searches. Everyday approximately 1.25 million
trades are made on the London Stock Exchange (Statistica, 2018a) and almost 4 million group trades on
the New York Stock Exchange (2018). Each one of these transactions leaves at least one digital footprint,
but usually several, from which new types of statistics can be compiled. In fact as Stephens-Davidowitz
(2017: 103) explains, today ‘Everything is data.’ The torrent of by-product data being generated by our
digital interactions is now so huge it has been described variously as a data deluge; data smog; info-glut

29

Big Data and Official Statistics

or the original information overload. This deluge is also the result of an important behavioral change,
where people now record and load content for free. Every day we upload 1.8 billion images (Meeker,
2017), approximately 1.7 million TripAdvisor reviews (Statistica, 2018b) and every minute of every day
we upload 400 hours of video to YouTube (Taplin, 2017). Weigand (2009) described this phenomenon
where people actively share or supply data directly to various social networks and product reviews and
led to the evolution of the wiki model as a ‘social data revolution’. For this reason the European Com-
mission (2014) notes that big data can be created either by people or generated by machines.
Not only have the sources changed, the very concept of data itself has changed - ‘the days of structured,
clean, simple, survey-based data are over. In this new age, the messy traces we leave as we go through
life are becoming the primary source of data’ (Stephens-Davidowitz, 2017: 97). Now data includes text,
sound and images, not just neat columns of numbers. Begging the question, in this digital age, how much
data now exist. Definitional differences again make this a difficult question to answer, and consequently
there are various estimates to choose from. Hilbert and Lopez (2012) estimated that 300 exabytes (or
slightly less than one third of a zettabyte or 1021 bytes) of data were stored in 2007. According to their
2017 Big Data factsheet, Waterford Technologies (2017) estimated that 2.7 zettabytes of digital data exist.
Goodbody (2018) states that 16 zettabytes of data are produced globally every year and that by 2025 it
is predicted that that estimate will have risen to 160 zettabytes annually. IBM (2017) now estimates we
create an additional 2.5 exabytes of data every day.
Although the definitions and consequent estimates differ it is clear a massive volume of digital data
now exists. But as Harkness (2017: 17) wisely counsels, the ‘proliferation of data is deceptive; we’re
just recording the same things in more detail’. Nor are all of these data necessarily accessible or of good
quality. As Borgman (2015: 131) warns, big data must be treated with caution. A threat ‘to the validity
of tweets as indicators of social activity is the evolution in how online services are being used. A grow-
ing proportion of twitter accounts consists of social robots used to influence public communication….
As few as 35 percent of twitter followers may be real people, and as much as 10 percent of activity is
social networks may be generated by robotic accounts.’ Furthermore, Goodman (2015) states that 25%
of reviews posted on Yelp are bogus. Facebook themselves have admitted that 3% of accounts are fake
and an additional 6% are clones or duplicates, the equivalent of 270 million accounts (Kulp, 2017).
Taplin (2017) also states that 11% of display ads, almost 25% of video ads, and 50% of publisher traffic
are viewed by bots not people - ‘fake clicks.’ Goodman (2015) refers to these bots as WMD - Weapons
of Mass Disruption.
There are also issues of coverage, as sizeable digital divides exist. For example, the International
Telecommunication Union (2017) estimates that global Internet penetration is only 48% and global mobile
broadband subscription 56%, although they are as high as 97% in the developed world. Notwithstanding
that global coverage is improving rapidly, it still means that in 2017 almost half of the world’s population
did not use the web. The digital divide - limited access and connectivity to the web or mobile phones
is creating a data divide. To quote William Gibson (2003) ‘The future is already here - it’s just not very
evenly distributed.’ Anyone excluded from web access or mobile phones will not have a digital footprint
or at best, a rather limited one. Even within countries, digital divides exist arising from a range of access
barriers: gender; social; geographic; or economic that may lead to important cohorts being excluded,
with obvious bias implications for statistics (see Struijs et al., 2014). The question being asked by NSOs
is whether these data can be safely accessed and whether they are representative and stable enough to
be used to compile official statistics.

30

Big Data and Official Statistics

USING BIG DATA TO COMPILE OFFICIAL STATISTICS

Opportunities for Official Statistics

There will almost certainly be opportunities in the future to compile official statistics in new and excit-
ing ways. Assuming access problems can be overcome then big data offer the potential to contribute to
the measurement of official statistics in a number of ways. According to the Big Data Project Inventory
(United Nations Statistics Division, 2018) compiled by the UN Global Working Group on Big Data,
34 NSSs from around the world have registered 109 separate big data projects3. NSOs and agencies are
attempting to use a variety of sources, including: satellite imagery; aerial imagery; mobile phone data;
data scraped from websites; smart meters; road sensors; ships identification data; public transport usage
data; social media; scanner data; health records; patent data; criminal records; Google alerts; and credit
card data to compile a wide range of official statistical indicators. These include: improving registers;
compiling mobility, transport and tourism statistics; road safety indicators; price indices; indicators on
corruption and crime; energy consumption; population density and migration; nutrition; land use; well-
being; measures of remoteness; and labour market and job vacancy statistics. From Table 1, it is clear
that NSSs are mainly targeting web scraping, scanner and mobile phone data in particular - accounting
for half of the sources being used. It should be noted that several projects are speculative or aspirational,
where the big data source has not yet been identified or where access to the data (particularly mobile
phone/CDR) has not yet been secured. Improving price indices using scanner data or prices scraped from
the web are by far and away the most popular projects. This is not surprising as these approaches have
been in development for many years4 and typically have fewer data access problems.
IOs, most particularly the World Bank and the United Nations Global Pulse, are also investigating
big data. They have logged 91 projects on the Big Data Project Inventory. Here too, a wide variety of
big data sources are being explored: mobile phone CDR records; Wikipedia; Google Trends; scanner
data; web scraping; road sensor data; satellite imagery; credit card transactions; ATM withdrawls data;
online purchases; aerial imagery; financial transaction data; taxi GPS data; freight data; medical insur-
ance records; criminal records; building certification data; OpenStreetMap; Twitter; publicly available

Figure 2. Big Data sources and project topics by National and International Organisations
Source: Authors own calculations derived from UN Big Data Project Inventory5

31

Big Data and Official Statistics

Facebook data; social media; aid data; bus fleet AVL data; and electricity data. These big data may be
used in conjunction with or as a replacement for traditional data sources to improve, enhance and comple-
ment existing statistics. Figure 2 suggests that IOs are targeting social media and mobile phone records
to try and address issues regarding transport, poverty and disaster mitigation. While the Big Data Project
Inventory does not provide an exhaustive list of all the big data projects underway, it nevertheless gives
a good sense of the types of new data sources being trialed and the types of statistical gaps that NSOs
and IOs are hoping to fill. In years to come, developments, such as, the Internet of Things6, biometrics
and behaviometrics will all surely present other opportunities to compile new and useful statistics.
Big data may offer new cost-effective or timely ways of compiling statistics or offer some relief to
survey fatigue and burden. Big data also offers the tantalizing potential of being able to generate more
granular or disaggregated statistics, allowing for more segmented and bespoke analyses, or the possi-
bility of generating completely new statistics. That said, Kitchin (2015) sounds a cautionary note with
regard to generating new statistics, noting that it is important that NSOs don’t allow mission drift where
big data drives their direction i.e. falling into the trap of measuring what is easy rather than what needs
to be measured. Notwithstanding this risk, the opportunities presented by big data for official statistics
can be summarized as entirely or partially replacing existing data sources with new big data sources
to compile existing statistics in a more efficient, timely or more precise way or compiling entirely new
statistics altogether.

Linked Data

Big data also offer the potential to compile datasets that are linkable, offering enormous potential to
undertake more transversal analyses that may help us to better understand causation. One of the short
comings of many existing official statistics is that each statistic is compiled discretely, and typically de-
rived from a sample. While this bespoke approach offers many advantages regarding bias, accuracy and
precision, it has the disadvantage that as discrete data, those data cannot be easily connected or linked
(other than at an aggregate level) with other data. It is not always possible subsequently to construct a
comprehensive analyses or narrative for many complex phenomena. As big data sets are more likely
to have full or universal coverage, then provided there are common identifiers, the potential to match
those data with other datasets increases enormously, increasing the analytical power of the data hugely.

Improved Timeliness

The possibility of improving timeliness by utilizing big data is enormously attractive. Policy makers
require not only long-term structural information but they also require up-to-date, real time informa-
tion - particularly during emergencies such as natural disasters or economic crises. Official statistics
has generally been very good at providing the former but rather more poor at the latter. This has been
a long standing criticism of official statistics. In the words of the United Nations Secretary-General’s
Independent Expert Advisory Group on a Data Revolution for Sustainable Development (2014: 22) ‘Data
delayed is data denied…The data cycle must match the decision cycle.’ This presupposes, of course, that
the public policy cycle has the capacity to absorb and analyse more voluminous and timely statistics - it
is not always clear that that is the case. Nevertheless, big data offers the possibility of publishing very
current indicators, using what Choi and Varian (2011: 1) describe as ‘contemporaneous forecasting’ or
‘nowcasting.’ This offers the possibility of identifying turning points much faster, which, from a public

32

Big Data and Official Statistics

policy perspective could be critical to making better decisions. This will be of critical importance for
containing, not only pandemics, but also financial crises and reacting quickly to natural disasters.

International Production

Many of digital data we are discussing are global or at least supra-national in scope. This globalized
aspect of big data offers exciting, although strategically sensitive, opportunities to reconsider the national
production models currently employed by NSOs and NSSs. Switching from a national to a collaborative
international production model might well make sense from the perspective of improving efficiency and
international comparability, but it would be a dramatic change in approach, and possibly a step too far
for many NSOs and governments. As MacFeely (2016) notes, current modernization initiatives can be
summarized as attempts to de-silo legacy production systems. However, in most cases, these attempts
to de-silo are done within the constraints of national silos, that is, each country is attempting to de-silo
independently. Nevertheless, in the case of global digital data, the most logical and efficient approach
might be to centralize statistical production in a single centre rather than replicating production many
times over in individual countries. Obviously, this would not work for all domains, but for some new
statistics being developed using globalized big data sets it would offer the chance of real international
comparability. Some examples of this might be land use, maritime and fishery statistics derived from
satellite imagery. Such an approach poses some difficult questions, not least legal. Globalized data pres-
ent a particular challenge as they defy national sovereignty, putting the owners and the data themselves
beyond the reach of national legal systems. Governments cannot always enforce national laws or ensure
their citizens are protected.

Jump Ahead

Across the world there exists not only a ‘digital divide’ but also a significant ‘data divide.’ For many
developing countries, the provision of basic statistical information remains a real challenge. The Global
Partnership for Sustainable Development Data (2016) notes that much of the data that does exist is ‘in-
complete, inaccessible, or simply inaccurate.’ Jerven (2015) too has also been very critical of the quality
of statistics available for many developing countries. Most analysts agree that despite significant prog-
ress, serious problems with data quality and availability persist. Some, Long and Brindley (2013); Korte
(2014); and Ismail (2016) have argued, that owing to the falling costs associated with technology, big
data may offer developing countries opportunities to skip ahead, and compile next-generation statistics.
Examples, such as, the massive growth of M-Pesa mobile money services in countries like Kenya, where
almost half of the population use it, lend some credence to this argument (Donkin, 2017). Neverthe-
less, others (Mutuku, 2016; United Nations Conference for Trade and Development, 2016; MacFeely &
Barnat, 2017; Runde, 2017) have cautioned that in order to do so, there will need to be improved access
to computers and internet, significant development in numeric and statistical literacy, and in basic data
infrastructure. There are also concerns that as statistical legislation and data protection are often weak
in many parts of the developing world, focusing on big data before addressing these fundamental issues
might do more harm than good.

33

Big Data and Official Statistics

Better Data

Big data may, in some cases, be better data than survey data. Seth Stephens-Davidowitz (2017) makes
a compelling argument in his book Everybody Lies that the content of social media posts, social media
likes and dating profiles is no more (or less) accurate than what respondents report in social surveys.
However, big data has other types of data available - data that are of much superior quality. He argues
that ‘the trails we leave as we seek knowledge on the internet are tremendously revealing. In other words,
people’s search for information is, in itself, information’ (2017: 4). He describes data generated from
searches, views and clicks as ‘digital truth.’ Thus, big data may be able to provide more honest data with
greater veracity than could ever be achieved from survey data. Hand (2015) makes a similar argument,
noting that as big data are transaction data they are closer to social reality than traditional survey and
census data that are based on opinions, statements or recall.

Data Broker

Finally, big data may offer the United Nations an opportunity to exercise some leadership and regain some
control over an increasingly congested and rapidly fragmenting information space. Two opportunities
spring to mind. Firstly, NSOs and IOs may find opportunities in rethinking and repositioning their role
within the new data ecosystem. In the next section the challenges of accessing proprietary big data are
discussed. This is a challenge, not only for NSOs, but for all sorts of institutions hoping to use big data.
The United Nations Secretary-General’s Independent Expert Advisory Group on a Data Revolution for
Sustainable Development (2014) argued there is a role for someone - presumably the UN - to act as a
data broker, to facilitate the safe sharing of data. At the global level, the UN would seem to the sensible
body. The 2017 Bogota Declaration on Big Data for Official Statistics (United Nations Statistics Division,
2017) hinted at a similar possibility, recommending a ‘marketplace’ for sharing data as a public good.
But perhaps at national level, there is a role also for NSOs to act as a trusted 3rd party intermediary, or
honest broker, where big datasets could be housed, curated, anonymised and disseminated under strict
and controlled conditions. This would be similar to the approach many NSOs already take for the release
of anonymised microdata. If such a mechanism were available, it might encourage big data owners to
release at least a sample of their data for statistical purposes.

Accreditation

Secondly, ‘Statistical agencies could consider new tasks, such as the accreditation or certification of
data sets created by third parties or public or private sectors. By widening its mandate, it would help
keep control of quality and limit the risk of private big data producers and users fabricating data sets
that fail the test of transparency, proper quality, and sound methodology’ (Hammer et al, 2017: p.19).
Similar proposals have been floated in the past (see Cervera et al. 2014; Landefeld, 2014; Kitchin, 2015;
MacFeely, 2016). NSOs and IOs could adopt a new proactive approach and introduce an accreditation
system (with uniform standards) that would allow un-official compilers of statistical series to be ac-
credited as ‘official’. Compilers of unofficial national indicators would need to demonstrate adherence
to the UN Fundamental Principles of Official Statistics. To secure global accreditation adherence to the
Principles Governing International Statistical Activities (Committee for the Coordination of Statistical
Activities, 2014) would be required. Indicators would also be required to meet a pre-defined set of qual-

34

Big Data and Official Statistics

ity and metadata standards, such as those set out in the UN Statistical Quality Assurance Framework
(Committee of the Chief Statisticians of the United Nations System, 2018) and the Common Metadata
Framework (United Nations Economic Commission for Europe, 2013) respectively. Perhaps such an
initiative would create sufficient incentive for big data holders to open up and reveal their metadata and
help to make the idea of a multi stakeholder data ecosystem a reality. Such a move would not be without
risks: legal; reputational; or equity. Landefeld (2014) also points out, that such a move might also face
its share of resistance, based on ideological grounds, challenging the right of government to impose
more regulation. Ironically, Hayak (1944), the grandfather of neo-liberal doctrine understood the role
of government in regulating weights and measures.

New or More Granular Statistics

Big data also potentially offer a variety of other benefits and opportunities. The variety offered by big data
provides not only new data sources but the promise of new types of data. These alternative or substitute
data sources may offer a mechanism to relieve survey fatigue and burden to households and businesses.
Given the exhaustive nature or massive volume of big data, they also offer opportunities to improve
existing registers (or develop completely new ones) that could improve sample selection and weighting
for traditional statistical instruments. Continuous big data also offers the chance for new flow or dynamic
statistics to be derived (something that is very difficult and costly to achieve with sample-based statistics),
offering the potential for more policy-relevant, outcome-based statistics. The sheer volume of data may
also allow greater disaggregation of some statistics, or greater segmentation or granular analyses. It also
offers the chance to measure a much wider variety of social, economic and environmental phenomena.
As noted above, this is a double edged sword, and NSOs must be careful to measure what is important
not just what is easy. Nevertheless, big data may be able to contribute to improving the quality of a
number of existing statistics (for example, tourism expenditure, travel volumes…) while also offering
new approaches to measuring difficult concepts like wellbeing.

Summary of Opportunities

In summary, big data offer a wide range of potential opportunities - cost savings, improved timeliness,
burden reduction, greater granularity, linkability and scalability, greater accuracy, improved international
comparability, greater variety of indicators, new dynamic indicators. Big data may perhaps offer solu-
tions to data deficits in the developing world where traditional approaches have so far failed. Big data
may also offer opportunities to rethink what official statistics means and re-position the role of official
statistics vis-a-vis the wider data ecosystem. But of course, big data also presents risks and challenges
for official statistics. These are examined in the next section.

Accessing Big Data

One of the biggest barriers to using big data is lack of access. Many big data are proprietary i.e. data
that are commercially or privately-owned and are not publically available. For example, data generated
from using credit cards, search engines, social media, mobile phones and store loyalty cards are all pro-
prietary and may not be available for use. Even if data are not proprietary and are publically accessible,
sensitivities around repurposing data to compile official statistics must be carefully considered. ‘Even

35

Big Data and Official Statistics

if there are no legal impediments, public perception is a factor that must be taken into account’ warn
Daas et al. (2015: 257). The current proprietary status of some data may change in the future as people
around the world realise that their data is being used and traded. But for the moment many datasets are
not currently accessible by NSOs, either because costs are prohibitive, data protection legislation pre-
vents it or proprietary ownership makes it impossible. Changes to statistical legislation may be required
to give NSOs or NSSs access to big data sources necessary for statistics, as recommended by United
Nations Economic Commission for Europe (2018). MacFeely and Barnat (2017) made similar recom-
mendations, arguing that in order to future-proof statistical legislation, consideration should be given to
mandatory access to all appropriate secondary data, including some important, commercially held data.
NSOs must be extremely careful not to damage their reputation and the public trust they enjoy. To
do so, a NSO must ensure it does not break the law or stray too far outside the culturally acceptable
boundaries or norms of their country. An NSO must decide whether it is legally permissible, ethical or
culturally acceptable to access and use big data. These are not always easy questions to answer. When it
comes to accessing new sources of digital data, the legal, ethical and cultural boundaries are not always
clear-cut. In some cases NSOs may be forced to confront issues well before the law is clear or cultural
norms have been established. Furthermore, given the speed with which the digital data world is chang-
ing, it is almost impossible for any related legislation to keep up (Rudder, 2014). This poses a challenge
as public trust and reputation are fragile; hard won but easily lost. NSOs depend on the public to supply
information to countless surveys and enquiries. If an NSO breaks that trust, they risk biting the hand that
feeds them. Yet a progressive NSO must to some extent lead public opinion, meaning they must maintain
a delicate balance, innovating and publishing new statistics that deal with sensitive public issues but
without moving too far ahead of public opinion. For example, from a technical, statistical perspective
the most logical and cost-effective method of deriving international travel and tourism statistics might
be to use mobile phone data, but from a data protection and public opinion perspective using these type
of data may not be acceptable.
This tension or trade-off does not appear to be well understood and is certainly not well reflected
in many national and international policy documents. MacFeely (2017) noted that in an increasingly
complex data protection environment, there is a growing but discernible mismatch between potential and
actual, between expectations and reality. The rather fantastic talk of a big data revolution does not seem
to make any allowance for the complex legal and ethical issues that prevent access to many valuable data
sources. The United Nations Economic Commission for Europe (2016) reflecting on their experiences,
note ‘High initial expectations about the opportunities of Big Data had to face the complexity of reality.
The fact that data are produced in large amounts does not mean they are immediately and easily avail-
able for producing statistics… Data from mobile phones represent a notable example in this sense7. It
has been proved that such data can be exploited for a wide range of purposes, but they are still largely
outside the reach of the majority of statistical organizations, due to the high sensitivity of the data.’

OTHER CHALLENGES FOR OFFICIAL STATISTICS

Rapid Change and Instability

Technology, the source of many big data, continues to rapidly evolve. This continuing and rapid evolution
raises questions regarding the long-term stability or maturity of big data and their practicality as a data

36

Big Data and Official Statistics

source for the compilation of official statistics. As Daas et al. (2015: 258) note ‘The big data sources
encountered so far seem subject to frequent modifications’. For example, social media may tweak their
services to test alternative layouts, colours and design, which in turn may mutate the underlying data.
Kitchin (2015: 9) warns ‘the data created by such systems are therefore inconsistent across users and/or
time’. The United Nations Economic Commission for Europe (2016) caution that official statisticians
using big data will need to accept a general instability in the data. They note ‘Wikipedia access statistics
show a general drop in the overall number of accesses from the time the mobile version of Wikipedia was
released. Similarly, Twitter had a significantly lower number of geo-located tweets after Apple changed
the default options for its products.’ Consequently they note that time series consistency will be affected
by such events. Hence the importance of incorporating the concept of volatility in to the definition of
big data. Instability of some big data sources introduces risks to continuity of data supply itself. NSOs
must decide whether access to big data are sufficiently stable and those data are adequately mature to
justify making the investment. Will this ‘exhaust pipe’ data be available over time on a consistent basis?
If not, then this will pose a challenge for official statistics, where a primary focus is to provide consistent
time series over time to serve policy analyses. It is often said that data are the new oil. But data (just
like crude oil) must be refined in order to produce useable statistics. And just like oil, if the quality and
consistency of the raw input data (crude oil) keeps changing, it will be very costly and difficult to refine.

Data Ownership

Ownership of source data is another issue of concern. As an NSO moves away from survey based data
and becomes more reliant on administrative or other secondary data, such as big data, it surrenders
control of its production system. The main input commodity, the source data, is dependent on external
factors, exposing the NSO to the risk of exogenous shocks. Partnerships with third party data suppliers
means, not only losing control of data generation, but perhaps also sampling and data processing (per-
haps as a solution to overcome data protection concerns). Furthermore NSOs will have limited ability
to shape the input data they rely upon (Landefeld, 2014; Kitchin, 2015). The technologies that produce
the ‘exhaust’ or ‘tailpipe’ data may change or become redundant, leading to changes in or disappear-
ance of the data. Changes in government social or tax policy may lead to alterations or termination of
important administrative datasets. Changes in data protection law, if it does not take the concerns of
official statistics into account, could retard the development of statistics for decades8. These are all risks
that a NSO must carefully consider when deciding whether or not to invest in administrative data and big
data. Reliance on external data sources also introduces new financial and reputational risks. If a NSO is
paying to access a big data set, there is always the risk, that the data provider realizing the value of the
data will increase the price. There are also reputational risks. The first is the public, learning that the
NSO (an office of the State) is using or repurposing their social media, telephone, smart metering or
credit card data without their consent may react negatively. There may also be concerns or perceptions
of state driven ‘big brother’ surveillance or what Raley (2013) terms ‘dataveillance.’ So an NSO must
consider carefully how it communicates with the public to try and mitigate negative public sentiment.
The other reputational risk is that of association. If an NSO is using particular social media data for
example, and that provider becomes embroiled in a public scandal, the reputation of the NSO may be
adversely affected, through no fault of their own.

37

Big Data and Official Statistics

Data Quality

As noted earlier, big data are essentially re-purposed data. Consequently a lot of contextual knowledge
of the original generating system is required before it can be recycled and used for statistical purposes.
Developing that knowledge can be difficult as frequently data owners have no great incentive to docu-
ment changes or be transparent. Both the data and the algorithms they use are typically proprietary and
potentially of enormous commercial value. But accessing accurate metadata is vitally important to us-
ing any secondary data. For example, understanding how missing data have arisen, perhaps from server
downtime or network outages, is essential to assessing the quality of data and then using the data (Daas
et al., 2015).
Big data can also be gamed or contain fake data (Kitchin, 2015; MacFeely, 2016) and so it is important
to understand vulnerabilities in the data. There may also be challenges with regard to the representativity
and accuracy of many big data: age; gender; language; disability; social class; regional; and cultural biases
may exist. There are also concerns too that many social media are simply echo-chambers cultivating less
than rigorous debate and leading to cyber-cascading, where a belief (either correct or incorrect) rapidly
gains currency as a ‘fact’ as it is passed around the web (Weinberger, 2014). As David Eggers, in his
wonderful book The Circle remarked that social media has elevated gossip, hearsay and conjecture to
the level of valid, mainstream communication (Eggers, 2013). There are also concerns for veracity aris-
ing from the concentration of data owners. Reich (2015) notes that in 2010, the top ten websites in the
United States accounted for 75 percent of all page views. According to Taplin (2017) Google has an 88
percent market share in online searches, Amazon has a 70 percent market share in e-book sales, Facebook
has a 77 percent market share in mobile social media. Such concentration introduces obvious risks of
abuse and manipulation, leaving serious questions for the continued veracity of any resultant data. The
decision by the Federal Communications Commission (2017) in the United States in December 2017 to
repeal Net Neutrality9 raises a whole new set of concerns regarding the veracity of big data for statistical
purposes. The United Nations Conference for Trade and Development (2015) noted that ambiguities ex-
ist for a range of issues connected with net neutrality, including traffic management practices and their
effects on quality of service, competition, innovation, investments, and diversity, online freedom, and
protection of human rights. Tim Berners-Lee (2014) has warned against the loss of net neutrality and
the increasing concentration within the web: both trends that are undermining the web as a public good.

The Digital Economy

Apart from utilising big data, another challenge (and opportunity) for NSOs and IOs is to actually measure
the digital economy itself (including big data) to help shed light on the importance of this rapidly emerg-
ing economy for trade, and other aspects of the traditional economy and society. In the coming years,
this will become an increasingly important subject and one where significantly more data and statistics
will be required. More broadly, there is a growing demand for data to better explain the value added
and benefits of the digital economy, the implications for taxation, how it facilitates trade, the contribu-
tion to GDP, and the likely influence on economic growth in the future. Countries are also wondering
whether the digital economy can be harnessed to reduce economic, social and gender inequalities and
what are the regulatory challenges associated with trying to supervise, what is essentially a globalised
phenomenon. Big data is of course only a sub-set of this bigger picture, but NSOs and IOs must address
the data needs emerging from the complex world of the digital economy.

38

Big Data and Official Statistics

Competition

The emergence of big data is changing the information world. The digital revolution has created an
abundance of data, a deluge of digital data, challenging the monopolistic position enjoyed by official
statistics for so long, to provide free, timely and high-quality statistics. This abundance of data, the ba-
sic input commodity or fuel for statistics, has reduced the costs of entry into the statistics compilation
business. Consequently today, there is a battle for the ownership of ‘facts’ – a battle that perhaps the
global statistical community has not taken sufficiently seriously (MacFeely, 2017). Today a variety of
compilers are producing statistics and although little is known about the quality of the input data or the
compilation process, the allure of these statistics is seductive. So much so, we now live in a post-truth
age where virtually all authoritative information sources can be challenged by alternative facts or fake
news with a consequent diminution of trust and credibility of all sources. It seems Huxley (1932) might
have been correct when he predicted that truth would be drowned in a sea of irrelevance. As Fukuyama
(2017) warns ‘In a world without gatekeepers, there is no reason to think that good information will
win out over bad.’ In fact there are mounting concerns at the weaponisation of data (O’Neill, 2016;
Berners-Lee, 2018). Davies (2017) believes official statistics is losing this battle and argues ‘The declin-
ing authority of statistics is at the heart of the crisis that has become known as “post-truth” politics.’
Furthermore, these data are allowing new types of indicators and statistics to be compiled. So not only
is the primacy of NSOs and NSSs being challenged, the legitimacy of many traditional statistics, such as
GDP or unemployment statistics is also being diminished. National level statistics, based on international
agreed classifications, are increasingly viewed by many as overly reductionist and inflexible. Letouzé
and Jütting (2015) warn that the proliferation of alternative statistics is challenging the trustworthiness
of official statistics.

Privacy and Confidentiality

For official statistics safeguarding the confidentiality of individual data is sacrosanct. The importance
of confidentiality is enshrined in Principle 6 of the United Nations Fundamental Principles of Official
Statistics (United Nations, 2014), which states ‘Individual data collected by statistical agencies for sta-
tistical compilation, whether they refer to natural or legal persons, are to be strictly confidential and used
exclusively for statistical purposes.’ The UN Handbook of Statistical Organization (United Nations, 2003)
also underscores repeatedly the requirement that the information that statistical agencies collect should
remain confidential and inviolate. The Scheveningen Memorandum (European Commission, 2013)10
prepared by the Directors General of NSOs in the European Union identified the need to adapt statistical
legislation in order to use big data - both to secure access but also protect privacy. The failure to treat
individual information as a trust would prevent any statistical agency from functioning effectively. For
a NSS to function, the confidentiality of the persons and entities for which it holds individual data must
be protected i.e. a guarantee to protect the identities and information supplied by all persons, enterprises
or other entities. In short, everyone who supplies data for statistical purposes does so with the reasonable
presumption that their confidentiality will be respected and protected11. In most countries, safeguarding
confidentiality is enshrined in national statistical legislation. But with the increased volumes of big data
being generated, and the potential to match those data, greater attention must be paid to data suppression
techniques to ensure confidentiality can be safeguarded.

39

Big Data and Official Statistics

The emergence of big data is forcing many challenging questions to be asked, not least with regard
to privacy and confidentiality. Mark Zuckerberg, the founder of Facebook, famously claimed that the
age of privacy is over (Kirkpatrick 2010). Scott McNealy, CEO of Sun Microsystems, too asserted that
concerns over privacy are a ‘red herring’ as we ‘have zero privacy’ (Noyes, 2015). John McAfee (2015),
founder of McAfee Associates has stated that the ‘concept of privacy is fast approaching extinction.’
Many disagree and have voiced concerns over the loss of privacy (see Pearson, 2013; Payton & Clay-
poole, 2015). Fry (2017) has likened developments with regard to big data and the loss of privacy to the
opening of Pandoras Box - what he terms, Pandora 5.0. The introduction in Europe of the new General
Data Protection Regulation which comes into effect in May 2018, reinforcing citizen’s data-protection
rights, including among other things the right ‘to be forgotten’, suggests that privacy is still a real concern
(European Parliament, 2016) - at least in some regions of the world. Yet in the United States, despite
reported growing anxiety over online privacy (United Nations Conference for Trade and Development,
2018), users who provide information under the ‘third-party doctrine’ i.e. to utilities, banks, social net-
works etc. should have ‘no reasonable expectation of privacy.’
This introduces two new challenges for official statisticians: one technical and one of perception. The
technical challenge arises from the availability of large, linkable datasets which present a problem thought
to have been solved in traditional statistics – anonymization. It is now clear, owing to the availability
of big data and enormous computing power, that simply removing personal identifiers and aggregating
individual data is not a sufficient safeguard. A paper by Ohm (2010) outlining the consequences of fail-
ing to adequately anonymise data graphically illustrates why there is no room for complacency. Thus a
problem that had been solved for traditional official statistics must now be re-solved, in the context of
a richer and more varied data ecosystem.
The changing nature of perception is arguably a trickier problem. What if Zuckerberg and McNealy
are correct and future generations are less concerned about privacy? There appears to be some evidence
to suggest that they may be correct. It seems there are clear inter-generational differences in opinion
vis-a-vis privacy and confidentiality, where those ‘born digital’ (roughly those born since 1990) seem
to be less concerned about disclosing personal information than older generations (European Commis-
sion, 2011). Taplin (2017: 157) ponders this, musing ‘It very well may be that privacy is a hopelessly
outdated notion and that Mark Zuckerberg’s belief that privacy is no longer a social norm has won the
day.’ If this is so, what are the implications for official statistics and anonymization? If other statistical
providers, not governed by the Fundamental Principles, take a looser approach to confidentiality and
privacy, it may leave official statistics in a relatively anachronistic and disadvantaged position vis-a-vis
other data providers. But moving away from or discarding principle 6 of the UN Fundamental Principles
for Official Statistics would seem to be a very risky move, given the importance of public trust for NSOs.
A related and emerging challenge for official statistics is that of open data, or more specifically, the
asymmetry in openness expected of private and public sector data. Many of the ‘open data’ initiatives
are in fact drives to open government data12. This of course makes sense, in that tax payers should to
some extent own the data they have paid for, and so those data should be public, within sensible limits.
But arguably people also own much of the data being held by search engines, payments systems and
telecommunication providers too. So why is there an exclusive focus on public or government data?
Letouzé and Jütting (2015: 10) have highlighted this issue, remarking that ‘Official statisticians express
an acute and understandable sense of frustration over pressure to open up their data to private-sector
actors, while these same actors are increasingly locking away what they and many consider to be “their”
data.’ Aggregate official statistics, as a public good, should of course be open. But the philosophy of

40

Big Data and Official Statistics

open data should be more evenly applied to avoid asymmetrical conditions. This is a complex challenge,
as to some extent it feeds off poor understanding of privacy issues, statistical literacy and the data wars
that are underway at the moment. Rudder (2014: 241) notes that ‘because so much happens with so little
public notice, the lay understanding of data is inevitably many steps behind the reality.’
Taplin (2017: 157) argues that we trade our privacy with corporations in return for innovation or the
benefits of improved services but challenges the need to surrender information to government. MacFeely
(2016) has warned that if the benefits of privacy are insufficiently clear to the public or policy makers,
then it leaves official statistics vulnerable, and possibly facing a precarious and bleak future. Rudder
(2014: 242) highlights this challenge too noting that ‘the fundamental question in any discussion of pri-
vacy is the trade-off - what you get for losing it.’ Like Taplin, Rudder also argues that the trade-off with
the private sector is clear - better targeted ads! He argues that ‘what we get in return for the government’s
intrusion is less straightforward.’ McNealy too, who seems unconcerned about the lack of privacy in the
private sector, takes a very different attitude when it comes to government, saying ‘It scares me to death
when the NSA or the IRS know things about my personal life…Every American ought to be very afraid
of big government’ (Noyes, 2015). Curiously, while there is a real fear of government Big Brother, there
appears to be little concern regarding the emergence of a corporate Big Brother. To some extent there
is ideology at play here, where a neo-liberal agenda is pushing to minimize the role of the public sector,
but it also illustrates the challenge facing the UN and national governments and their agencies generally
where their contribution of the SDGs to the wellbeing of economies and societies is poorly understood.

Summary of Challenges

Thus, while big data may offer opportunities, they also present some real challenges for NSOs, NSSs
and IOs. To some extent, these challenges are magnified versions of problems that already exist with
other data sources, such as, uncertainty over the quality or veracity of data and dealing with a range of
potential biases. Access to external secondary sources, such as, administrative data can already be chal-
lenging, and is not unique to big data. But big data do appear to present some rather unique challenges
with regard to rapidly evolving and unstable data, ownership of data, data protection and safeguarding
confidentiality. These are some of the issues that NSOs and IOs will need to carefully consider before
committing resources to any big data projects.

SOLUTIONS AND RECOMMENDATIONS

In considering whether big data provides a viable option, statistical offices and systems must carefully
decide what governance systems will be required to ensure the official statistics brand is not compromised13.
For the purposes of this chapter, governance systems can be defined as the policies and rules, and the
monitoring mechanisms that allow the management of a NSO or IO to direct and control the activities
of the office. That governance system should help decision makers to balance the often-competing needs
of new statistical demands with the rights of data owners and ensure public accountability.
At a global level, questions naturally arise as to whether some sort of global governance framework
for the treatment of big data will be required or whether ad-hoc or bespoke national or regional agree-
ments can work. In a world where big data are being used more extensively, the multinational enterprises
generating those massive global datasets will effectively be setting many of the future data standards.

41

Big Data and Official Statistics

What will this mean for the global statistical system? What will it mean for the United Nations Funda-
mental Principles of Official Statistics and Principles Governing International Statistical Activities?
These massive new globalized data also challenge the concept of sovereign data and the justification for
national or local compilation, raising a host of legal, security and organizational questions.
At individual NSO, NSS or IO level, there are also governance issues to be considered. This section
presents some recommendations regarding governance. The issues identified here are not exhaustive,
but give a flavor of the issues that a NSO or IO might need to be consider:

Ethics

Many big data are the exhaust from technology. Deriving statistics involves repurposing those data. The
possibilities are exciting and may offer incredible opportunities to derive new and exciting statistics. In
the rush to compile new statistics, it may be easy to forget where those data came from. Thus it may be
sensible to establish an ethics committee to consider whether the compilation of new statistics justify
the potential ‘intrusion’ to citizens privacy. An independent board, not immediately involved in the
compilation process, may be better able to weigh up the pros and cons of a big data project and ensure
‘no harm’ is done. A NSO may wish to consider also, that in using a particular big data set, it may be
inadvertently taking an ideological or philosophical stance on a range of emerging debates, including
for example, the ownership of data.

Legal

There will be many legal issues to be unpicked in the years to come with regard to big data. For example,
can a NSO or IO access data, such as, credit card expenditure information or mobile phone location
data without breaching data protection, statistical or other legislation? The ambiguous sovereignty of
globalized datasets will no doubt raise very particular legal problems in the years to come, both for
statistical offices, data protection commissioners and governments. It will probably be necessary (or,
at the very least wise) to establish a small board of specialist legal experts who can adjudicate on these
complex issues and provide comprehensive legal opinion to the management board of the NSO or IO.
The correspondence between statistical and data protection legislation will be of paramount importance
in the coming years.

Oversight and Confidentiality

There will most likely be a growing need for a committee that deals specifically with the confidentiality
and oversight of access to data held by a NSO or IO. Storing big data will present new challenges. Who
has access to those data and why? Who decides who should have access and using what criterion? How
is confidentiality of published data being safeguarded? Answering these questions will require a mixture
of statistical methodology and broader governance expertise. This board might also play a useful role,
in coordination with ethics committee in deciding whether certain data sets should be linked, and if they
are, what are the likely implications for protecting confidentiality?

42

Big Data and Official Statistics

IT and Cyber-Security

Storing large volumes of data, and providing sufficient processing power and memory, will present
technical challenges too. Obviously sufficient space will be required. But new cyber-security protocols
will also be required. ‘Any data collected will invariably leak’ - so warns Goodman (2015: 153). What
does globalized data mean for storage location – does it make sense to continue with the old paradigm
where data are stored, locally, in-house? If it is stored locally, will the data be quarantined and stored
offline (so that it cannot be hacked or corrupted). If not will the NSO require some types of randomized
identifiers to mask identifiers and suppress identities? But does storing global data and re-processing
the same data many times over in different locations make sense? Would it be more efficient to store
the data at source, or in some central location (in the cloud?). How then will the data be integrated with
other data sources stored in different locations? The movement and transfer of data will require very
secure pipeline systems with encryption. This is a complex topic and NSOs and IOs must proceed cau-
tiously. Purely technical or financial driven solutions may yield poor decisions - storage and movement
solutions must take reputation and cultural norms in to consideration.

Quality Assurance

Assessing the quality of big data is not the same as assessing the quality of traditional datasets. Firstly,
quality must be defined from the perspective of big data and clear criterion for how these can be mea-
sured must be developed. The United Nations Economic Commission for Europe (2016) note that using
big data may mean accepting ‘different notions of quality.’ Owing to new quality issues, for example,
disorganized data management, more time and effort may be required to organize and properly manage
data. Gao et al. (2016) identify a number of quality parameters unique to cleaning and organizing big
data. They are: determining quality assurance; dealing with data management and data organization;
and the particular challenges of data scalability; and transformation and conversion. Using big data may
require an extended quality framework for official statistics and SDG compilation. Such a framework
might put greater emphasis on risk management than is currently used.

Continuous Professional Development and Training

Using big data will require a blend of different skills to that of the traditional statistician, with more
emphasis on data mining and analytics. Given the demand for mathematically skilled graduates today, it
will be necessary to retrain some existing statisticians. This should be an on-going process in any event
for professional statisticians. Nevertheless, big data may be the catalyst for some NSOs or IOs to con-
sider establishing formal training or a Continuous Professional Development (CPD) programme. It may
also provide an impetus to consider new partnerships and collaborations in order to bring in new skills.

Strategic Partnerships

Using big data presents a range of technical challenges that may require new strategic partnerships. The
decision for NSOs and IOs is whether it makes sense to try and develop all of the skills in-house or
whether it will be better to partner with other entities that have the required skills. These will be critical
decisions, both in terms of costs and efficiencies, but also for legal and reputational reasons. In making

43

Big Data and Official Statistics

these decisions, NSOs and IOs must ensure they do not compromise the Fundamental Principles of
Official Statistics, the Principles Governing International Statistical Activities or statistical legislation.

Communications and Dissemination

Any NSO planning to use big data in the day-to-day compilation process should prepare carefully a
communications strategy. How will repurposing be explained and communicated to the public? Will the
NSO publish an inventory of administrative and big data being accessed, stored and used by the NSO?
What is the plan, when and if, some scandal arises that embroils the NSO in a negative media story?
NSOs and IOs must also carefully consider how to make new statistics available - in particular how to
use technology to make the experience more interactive and user friendly for users.

Clear Lines of Responsibility

NSOs are typically headed by a national statistician who understands data issues and is responsible for
the data governance of that office. But many of the institutions that make up a NSS are not statistical
offices. These institutions may have a statistical unit, but may rely on the NSO for key statistical and
methodological support. Nevertheless these institutions, such as tax authorities, cental banks or other
government departments or agencies manage very large volumes of administrative and statistical data
and thus may face some of the same challenges facing a NSO. These institutions are perhaps also in-
vestigating the use of big data. For these reasons, institutions may need to consider appointing a chief
statistician or chief data officer to properly manage data governance and liaise with the NSO. This role
is distinct from a chief information officer, who despite the title, is typically an IT specialist rather than
a data specialist.

CONCLUSION

The purpose of official statistics is to provide high quality, impartial and timely information that al-
low governments and their citizens to make decisions and benchmark progress. It is not clear, as yet,
whether big data will contribute anything special to the production of official statistics. It seems likely
that Thamm (2014) was correct when he surmised that big data are just more data: the next phase in the
evolution of data rather than a revolution.
Big data, if they can be harnessed properly, would appear to offer some tantalizing opportunities - not
least improved timeliness and the chance to better align the availability of statistics with policy needs.
Perhaps in some cases they can improve accuracy. The possibilities of matching different digital data
sets may also allow us to dramatically improve our understanding of complex, transversal issues, such
as, gender inequality or the challenges of being disabled. As yet, the implications of this ‘big data bang’
for statistics is not immediately clear, but one can envisage a whole host of new ways to measure human
interactions and experiences. But these developments will bring a myriad of new challenges too, not least
the growth of unreliable information. It is already clear that big data will not be a panacea for statistical
agencies confronting demands for more, better, and faster data with fewer resources. This may not be
universally understood and so managing expectations will be an ongoing challenge for official statisticians.

44

Big Data and Official Statistics

Challenges regarding how best to determine the quality and veracity of big data from a statistical
perspective remain. The growing centralization or monopolization of the internet, the threat to net
neutrality, and the growing volumes of ‘bot’ traffic are just some of the issues that may compromise
the quality and impartiality of any resultant statistics. There are concerns too, that many social media
channels are polarizing social exchange and promoting ‘echo chambers’ and cyber-cascading. Official
statisticians must ensure they can filter the wheat from the chaff.
In relative terms, big data are still new. We are probably only in the first hours of the first day of the
Internet revolution. Many norms and standards are yet to evolve. Official statistics must proceed cau-
tiously - just because something can be measured doesn’t mean it should be. In assessing whether to,
and how to use big data, NSOs must begin to carefully consider what are the human rights of citizens
in this digital age? But this is easier said than done - there is a new gold rush underway - a data rush. In
that rush, NSOs and IOs are feeling the pressure to be seen to utilize big data. But as outlined above, it
will be a bumpy road with many challenges along the way. It is of course often easier to see problems
than opportunities, so NSO’s and IOs must carefully evaluate the likely costs and benefits of using big
data, both now and in the future. What makes big data so intriguing is the fact that they simultaneously
present both threats and opportunities for official statistics.
From a governance perspective, the challenge for official statistics is to identify and mitigate the
threats while seizing the opportunities. In making that decision, they must not lose sight of their mis-
sion and mandates. Above all else, irrespective what data sources used, official statisticians must supply
independent and impartial information that allow citizens to challenge stereotypes, governments, public
bodies and private enterprises and hold them to account.

REFERENCES

Adams, D. (1979). The Hitchhikers Guide to the Galaxy. Pan.


Ariely, D. (2013, January 6). Big Data is like Teenage Sex. Twitter. Retrieved from: https://twitter.com/
danariely/status/287952257926971392?lang=en
Aslam, S. (2015, October 7). Snapchat by the Numbers: Stats, Demographics and Fun Facts. Omnicore.
Retrieved from: http://www.omnicoreagency.com/snapchatstatistics/
Berners-Lee, T. (2014, August 23). Tim Berners-Lee on the Web at 25: the past, present and future.
Wired. Retrieved from: http://www.wired.co.uk/article/tim-berners-lee
Berners-Lee, T. (2018). The web is under threat. Join us and fight for it. World Wide Web Foundation.
Available from: https://webfoundation.org/2018/03/web-birthday-29/
Blackwell, J. (1985). Information for policy. National and Economic Social Council, Report no. 78.
Dublin: NESC. Retrieved from: http://files.nesc.ie/nesc_reports/en/NESC_78_1985.pdf
Borgman, C. L. (2015). Big Data, Little Data, No Data - Scholarship in the Networked World. Cam-
bridge, MA: MIT Press.

45

Big Data and Official Statistics

Boyd, J., & Crawford, K. (2012). Critical Questions for Big Data - Provocations for a cultural, techno-
logical, and scholarly phenomenon. Information Communication and Society, 15(5), 662–679. doi:10.
1080/1369118X.2012.678878
Brackstone, G. J. (1987). Statistical Issues of Administrative Data: Issues and Challenges. Survey Meth-
odology, 13(1), 29–43.
Buytendijk, F. (2014). Hype Cycle for Big Data, 2014. Gartner. Retrieved from: https:// www.gartner.
com/doc/2814517/hype-cycle-big-data-
Committee of the Chief Statisticians of the United Nations System. (2018). UN Statistical Quality Assur-
ance Framework. Retrieved from: https://unstats.un.org/unsd/unsystem/documents/UNSQAF-2018.pdf
Coordination Committee for Statistical Activities. (2014). Principles Governing International Statisti-
cal Activities. Retrieved from: https://unstats.un.org/unsd/accsub-public/principles_stat_activities.htm
Cervera, J. L., Votta, P., Fazio, D., Scannapieco, M., Brennenraedts, R., & van der Vorst, T. (2014).
Big Data in Official Statistics. Eurostat ESS Big Data Event. Retrieved from: https://ec.europa.eu/eu-
rostat/cros/system/files/Big%20Data%20Event%202014%20-%20Technical%20Final%20Report%20
-finalV01_0.pdf
Choi, H., & Varian, H. (2011). Predicting the present with Google Trends. Retrieved from: http://people.
ischool.berkeley.edu/~hal/Papers/2011/ptp.pdf
Daas, P. J. H., Puts, M. J., Buelens, B., & van den Hurk, P. A. M. (2015). ‘Big Data as a Source for Of-
ficial Statistics’. Journal of Official Statistics, 31(2), 249–262. doi:10.1515/jos-2015-0016
Davies, W. (2017). How statistics lost their power – and why we should fear what comes next. Retrieved from:
https://www.theguardian.com/politics/2017/jan/19/crisis-of-statistics-big-data-democracy?CMP=share_
btn_link
Donkin, C. (2017). M-Pesa continues to dominate Kenyan market. Mobile World Live. Retrieved from:
https://www.mobileworldlive.com/money/analysis-money/m-pesa-continues-to-dominate-kenyan-market/
Eggers, D. (2013). The Circle. Penguin Books.
European Commission. (2011). Attitudes on Data Protection and Electronic Identity in the European
Union. Special Eurobarometer No. 359, Wave 74.3 - TNS Opinion and Social. Published June 2011.
Retrieved from: http://ec.europa.eu/commfrontoffice/publicopinion/archives/ebs/ebs_359_en.pdf
European Commission. (2013). Scheveningen Memorandum on “Big Data and Official Statistics”.
Adopted by the European Statistical System Committee on 27 September 2013. Retrieved from: https://
ec.europa.eu/eurostat/cros/content/scheveningen-memorandum_en
European Commission. (2014). Big Data. Digital Single Market Policies. Retrieved from: https://ec.europa.
eu/digital-single-market/en/policies/big-data

46

Big Data and Official Statistics

European Parliament. (2016). Regulation (EU) 2016/679 of the European Parliament and of the Council
of 27 April 2016 on the protection of natural persons with regard to the processing of personal data
and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection
Regulation). Retrieved from: http://ec.europa.eu/justice/data-protection/reform/files/regulation_oj_en.pdf
Eurostat. (2014). Feasibility Study on the Use of Mobile Positioning Data for Tourism Statistics - Con-
solidated Report. Eurostat Contract No 30501.2012.001- 2012.452, 30 June 2014. Retrieved from: http://
ec.europa.eu/eurostat/documents/747990/6225717/MP-Consolidated-report.pdf
Federal Communications Commission. (2017). Restoring Internet Freedom. Retrieved from: https://
www.fcc.gov/restoring-internet-freedom
Fry, S. (2017). The Way Ahead. Lecture delivered on the 28th May 2017, Hay Festival, Hay-on-Wye.
Retrieved from: http://www.stephenfry.com/2017/05/the-way-ahead/
Fukuyama, F. (2017). The Emergence of a Post Fact World. Project Syndicate. Retrieved from: https://
www.project-syndicate.org/onpoint/the-emergence-of-a-post-fact-world-by-francis-fukuyama-2017-01
Gelsinger, P. (2012). Big Data quotes of the week. Retrieved from https://whatsthebigdata.com/2012/06/29/
big-data-quotes-of-the-week-11/
Gibson, W. (2001).’Broadband Blues - Why has broadband Internet access taken off in some countries
but not in others? The Economist. Retrieved from: https://www.economist.com/node/666610
Global Partnership for Sustainable Development Data. (2016). The Data Ecosystem and the Global
Partnership. Retrieved from: http://gpsdd.squarespace.com/who-we-are/
Goa, J., Xie, C., & Tao, C. (2016). Big Data Validation and Quality Assurance - Issues, Challenges,
and Needs. 2016 IEEE Symposium on Service-Oriented System Engineering. Retrieved from: http://
ieeexplore.ieee.org/xpls/icp.jsp?arnumber=7473058
Goodbody, W. (2018). Waterford researchers develop new method to store data in DNA. RTE News.
Retrieved from: https://www.rte.ie/news/ireland/2018/0219/941956-dna-data/
Goodman, M. (2015). Future Crimes - Inside the Digital Underground and the Battle for Our Connected
World. New York: Anchor Books.
Guerreiro, V., Walzer, M., & Lamboray, C. (2018). The use of Supermarket Scanner data in the Luxem-
bourg Consumer Price Index. Economie et Statistiques - Working papers du STATEC, No. 97. Retrieved
from: http://www.statistiques.public.lu/catalogue-publications/economie-statistiques/2018/97-2018.pdf
Hammer, C. L., Kostroch, D. C., & Quiros, G. (2017). Big Data: Potential, Challenges, and Statistical
Implications. IMF Staff Discussion Note, SDN/17/06, September 2017. Retrieved from: http://www.
imf.org/en/Publications/SPROLLs/Staff-Discussion-Notes
Hand, D. J. (2015). Official Statistics in the New Data Ecosystem. Presented at the New Techniques and
Technologies in Statistics conference, Brussels, Belgium. Retrieved from: https://ec.europa.eu/eurostat/
cros/system/files/Presentation%20S20AP2%20%20Hand%20-%20Slides%20NTTS%202015.pdf
Harkness, T. (2017). Big Data: Does size matter? London, UK: Bloomsbury Sigma.

47

Big Data and Official Statistics

Hayak, F. A. (1944). The Road to Serfdom. Chicago, MA: The University of Chicago Press.
Hilbert, M., & Lopez, P. (2012). How to Measure the World’s Technological Capacity to Store, Com-
municate and Compute Information. International Journal of Communication, 6, 956–979.
Huxley, A. (1932). A brave new world. London: Chatto and Windus.
IBM. (2017). 10 Key Marketing Trends for 2017 and Ideas for Exceeding Customer Expectations. IBM
Marketing Cloud. Retrieved from: https://public.dhe.ibm.com/common/ssi/ecm/wr/en/wrl12345usen/
watson-customer-engagement-watson-marketing-wr-other-papers-and-reports-wrl12345usen-20170719.
pdf
Ismail, N. (2016). Big Data in the developing world. Information Age. Retrieved from: http://www.
information-age.com/big-data-developing-world-123461996/
International Telecommunications Union. (2017). ITU Key 2005 - 2017 ICT data. Retrieved from: https://
idp.nz/Global-Rankings/ITU-Key-ICT-Indicators/6mef-ytg6
Jerven, M. (2015). Africa - Why economists get it wrong. London: Zed Books.
Kirkpatrick, M. (2010). Facebook’s Zuckerberg Says the Age of Privacy is Over. Retrieved from: https://
readwrite.com/2010/01/09/facebooks_zuckerberg_says_the_age_of_privacy_is_ov/
Kitchin, R. (2015). The opportunities, challenges and risks of big data for official statistics. Statistical
Journal of the International Association of Official Statistics, 31(3), 471–481.
Korte, T. (2014). How Data and Analytics Can Help the Developing World. Huffington Post - The Blog.
Retrieved from: https://www.huffingtonpost.com/travis-korte/how-data-and-analytics-ca_b_5609411.html
Krikorian, R. (2013). New Tweets per Second Record, and How! Engineering Blog. Retrieved from:
https://blog.twitter.com/2013/new-tweets-per-second-record-and-how
Kulp, P. (2017). Facebook quietly admits to as many as 270 million fake or clone accounts. Mashable. Re-
trieved from: https://mashable.com/2017/11/02/facebook-phony-accounts-admission/#UyvC2aOAmPqo
Landefeld, S. (2014). Uses of Big Data for Official Statistics: Privacy, Incentives, Statistical Challenges,
and Other Issues. Discussion paper presented at the United Nations Global Working Group on Big Data
for Official Statistics, Beijing, China. Retrieved from: https://unstats.un.org/unsd/trade/events/2014/
beijing/Steve%20Landefeld%20-%20Uses%20of%20Big%20Data%20for%20official%20statistics.pdf
Laney, D. (2001). 3D Data Management: Controlling data volume, velocity and variety. Meta Group, File
949. Retrieved from: https://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-
Controlling-Data-Volume-Velocity-and-Variety.pdf
Letouzé, E., & Jütting, J. (2015). Official Statistics, Big Data and Human Development. Data Pop Al-
liance, White Paper Series. Retrieved from: https://www.paris21.org/sites/default/files/WPS_Official-
Statistics_June2015.pdf

48

Big Data and Official Statistics

Long, J., & Brindley, W. (2013). The role of big data and analytics in the developing world: Insights
into the role of technology in addressing development challenges. Accenture Development Partnerships.
Retrieved from: https://www.accenture.com/us-en/~/media/Accenture/Conversion-Assets/DotCom/Docu-
ments/Global/PDF/Strategy_5/Accenture-ADP-Role-Big-Data-And-Analytics-Developing-World.pdf
MacFeely, S., & Dunne, J. (2014). Joining up public service information: The rationale for a national
data infrastructure. Administration, 61(4), 93–107.
MacFeely, S. (2016). The Continuing Evolution of Official Statistics: Some Challenges and Opportuni-
ties. Journal of Official Statistics, 32(4), 789–810. doi:10.1515/jos-2016-0041
MacFeely, S. (2017). Measuring the Sustainable Development Goals: What does it mean for Ireland?
Administration, 65(4), 41–71. doi:10.1515/admin-2017-0033
MacFeely, S., & Barnat, N. (2017). Statistical capacity building for sustainable development: Develop-
ing the fundamental pillars necessary for modern national statistical systems. Statistical Journal of the
International Association of Official Statistics, 33(4), 895–909.
Mayer-Schonberger, V., & Cukier, K. (2013). Big Data: A Revolution That Will Transform How We Live,
Work and Think. London: John Murray.
Meeker, M. (2017). Internet Trends 2017. Presented at the Code Conference, Rancho Palos Verdes, CA.
Retrieved from: http://www.kpcb.com/internet-trends
McAfee, J. (2015). Untitled posting on Facebook. Retrieved from https://www.facebook.com/officialm-
cafee/posts/464114187078100:0
Mutuku, L. (2016) The big data challenge for developing countries. The World Academy of Sciences.
Retrieved from: https://twas.org/article/big-data-challenge-developing-countries
New York Stock Exchange. (2018). Daily NYSE Group Volume in NYSE Listed, 2018. NYSE Transac-
tions, Statistics and Data Library. Retrieved from: https://www.nyse.com/data/transactions-statistics-
data-library
Report, N. (2018). Global Cards - 2015: Special Report. The Nilson report. Retrieved from: https://www.
nilsonreport.com/publication_special_feature_article.php
Nordrum, A. (2016). Popular Internet of Things Forecast of 50 Billion Devices by 2020 Is Outdated.
IEEE Spectrum. Retrieved from https://spectrum.ieee.org/tech-talk/telecom/internet/popular-internet-
of-things-forecast-of-50-billion-devices-by-2020-is-outdated
Noyes, K. (2015). Scott McNealy on privacy: You still don’t have any. PC World. Retrieved from: https://
www.pcworld.com/article/2941052/scott-mcnealy-on-privacy-you-still-dont-have-any.html
Nyborg Vov, K. (2018). Using scanner data for sports equipment. Paper written for the joint UNECE/
ILOs Meeting of the Group of Experts on Consumer Price Indices, Geneva, Switzerland. Retrieved from:
https://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.22/2018/Norway_-_session_1.pdf
O’Neill, C. (2016). Weapons of Math Destruction - How big data increases inequality and threatens
democracy. London: Allen Lane.

49

Big Data and Official Statistics

Payton, T., & Claypoole, T. (2015). Privacy in the Age of Big Data - Recognising the Threats Defending
Your Rights and Protecting Your Family. Lanham, MD: Rowman & Littlefield.
Pearson, E. (2013). Growing Up Digital. Presentation to the OSS Statistics System Seminar Big Data
and Statistics New Zealand: A seminar for Statistics NZ staff, Wellington, New Zealand. Retrieved from:
https://www.youtube.com/watch?v=lRgEMSqcKXA
Raley, R. (2013). Dataveillance and countervailance. In “Raw Data” is an Oxymoron. MIT Press.
Reich, R. (2015). Saving Capitalism: For the Many, Not the Few. London: Icon Books Ltd.
Rudder, C. (2014). Dataclysm: What our online lives tell us about our offline selves. 4th Estate.
Runde, D. (2017). The Data Revolution in Developing Countries Has a Long Way to Go. Forbes. Re-
trieved from https://www.forbes.com/sites/danielrunde/2017/02/25/the-data-revolution-in-developing-
countries-has-a-long-way-to-go/2/#3a48f53e482f
Statistica. (2018a). Average daily number of trades on London Stock Exchange (UK order book) in the
United Kingdom from January 2015 to February 2018. Statistica - The Statistics Portal. Retrieved from:
https://www.statista.com/statistics/325326/uk-lse-average-daily-trades/
Statistica. (2018b). Number of user reviews and opinions on TripAdvisor worldwide from 2014 to 2017
(in millions). Statistica - The Statistics Portal. Retrieved from: https://www.statista.com/statistics/684862/
tripadvisor-number-of-reviews/
Stephens-Davidowitz, S. (2017). Everybody lies - What the internet can tell us about who we really are.
London, UK: Bloomsbury.
Struijs, P., Braaksma, B., & Daas, P. J. H. (2014). Official statistics and Big Data. Big Data & Society.
Retrieved from: http://journals.sagepub.com/doi/pdf/10.1177/2053951714538417
Tam, S., & Clarke, F. (2015). Big Data, Official Statistics and Some Initiatives by the Australian
Bureau of Statistics. International Statistical Review. Retrieved from: https://www.researchgate.net/
publication/280972848_Big_Data_Official_Statistics_and_Some_Initiatives_by_the_Australian_Bu-
reau_of_Statistics
Taplin, J. (2017). Move Fast and Break things - How Facebook, Google and Amazon cornered culture
and undermined democracy. New York: Little, Brown and Company.
Thamm, A. (2017). Big Data is dead. LinkedIn. Retrieved from: https://www.linkedin.com/pulse/big-
data-dead-just-regardless-quantity-structure-speed-thamm/
United Nations. (2003). Handbook of Statistical Organization – 3rd Edition: The Operation and Orga-
nization of a Statistical Agency. Department of Economic and Social Affairs Statistics Division Studies
in Methods Series F No. 88. United Nations. Retrieved from: https://www.paris21.org/sites/default/
files/654.pdf
United Nations. (2014). Resolution adopted by the General Assembly on 29 January 2014 - Fundamen-
tal Principles of Official Statistics. General Assembly, A/RES/68/261. Retrieved from: http://unstats.
un.org/unsd/dnss/gp/FP-New-E.pdf

50

Big Data and Official Statistics

United Nations Conference for Trade and Development. (2015). Mapping of international Internet
public policy issues. E/CN.16/2015/CRP.2, Commission on Science and Technology for Develop-
ment, Eighteenth session, Geneva. Retrieved from: http://unctad.org/meetings/en/SessionalDocuments/
ecn162015crp2_en.pdf
United Nations Conference for Trade and Development. (2016). Development and Globalization: Facts
and Figures 2016. Retrieved from http://stats.unctad.org/Dgff2016/
United Nations Conference for Trade and Development. (2018). Data privacy: new global sur-
vey reveals growing internet anxiety. Retrieved from: http://unctad.org/en/pages/newsdetails.
aspx?OriginalVersionID=1719
United Nations Economic Commission for Europe. (2000). Terminology on Statistical Metadata. Confer-
ence of European Statisticians Statistical Standards and Studies, No.53. Retrieved from: http://ec.europa.
eu/eurostat/ramon/coded_files/UNECE_TERMINOLOGY_STAT_METADATA_2000_EN.pdf
United Nations Economic Commission for Europe. (2011). Using Administrative and Secondary Sources
for Official Statistics - A Handbook of Principles and Practices. Retrieved from: https://unstats.un.org/
unsd/EconStatKB/KnowledgebaseArticle10349.aspx
United Nations Economic Commission for Europe. (2013). The Common Metadata Framework.
UNECE Virtual Standards Helpdesk. Retrieved from: https://statswiki.unece.org/display/VSH/
The+Common+Metadata+Framework
United Nations Economic Commission for Europe. (2016). Outcomes of the UNECE Project on Using
Big Data for Official Statistics. Retrieved from: https://statswiki.unece.org/display/bigdata/Big+Data
+in+Official+Statistics
United Nations Economic Commission for Europe. (2018). Guidance on common elements of statistical
legislation. Conference of European Statisticians, 66th Session, Geneva. Retrieved from: http://www.
unece.org/fileadmin/DAM/stats/documents/ece/ces/2018/CES_6_Common_elements_of_statistical_leg-
islation__Guidance__for_consultation_for_upload.pdf
United Nations Secretary-General’s Independent Expert Advisory Group on a Data Revolution for Sus-
tainable Development. (2014). A World that Counts: Mobilizing the Data Revolution for Sustainable
Development. Report prepared at the request of the United Nations Secretary-General, by the Independent
Expert Advisory Group on a Data Revolution for Sustainable Development. November 2014. Retrieved
from: http://www.undatarevolution.org/wp-content/uploads/2014/11/A-World-That-Counts.pdf
United Nations Statistical Commission. (2014). Big data and modernization of statistical systems; Report
of the Secretary-General. E/CN.3.2014/11 of the forty-fifth session of UNSC 4-7 March 2014. Retrieved
from: https://unstats.un.org/unsd/statcom/doc14/2014-11-BigData-E.pdf
United Nations Statistics Division. (2017). Bogota Declaration on Big Data for Official Statistics. Agreed
at the 4th Global Conference on Big Data for Official Statistics, Bogota, Colombia. Retrieved from: https://
unstats.un.org/unsd/bigdata/conferences/2017/Bogota%20declaration%20-%20Final%20version.pdf

51

Big Data and Official Statistics

United Nations Statistics Division. (2018). Big Data Project Inventory. Retrieved from: ‘https://unstats.
un.org/bigdata/inventory/
Van Loon, K., & Roels, D. (2018). Integrating big data in the Belgian CPI. Presented to the Meeting of
the UNECE Group of Experts on Consumer Price Indices, Geneva, Switzerland. Retrieved from https://
www.unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.22/2018/Belgium.pdf
Waterford Technologies. (2017). Big Data Statistics & Facts for 2017. Retrieved from: https://www.
waterfordtechnologies.com/big-data-interesting-facts/
Weigand, A. (2009). The Social Data Revolution(s). Harvard Business Review. Retrieved from https://
hbr.org/2009/05/the-social-data-revolution.html
Weinberger, D. (2014). Too Big to Know. New York: Basic Books.

KEY TERMS AND DEFINITIONS

ATM: Automatic teller machine.


AVL: Automatic vehicle location.
CDR: Call detail record.
CPD: Continuous professional development.
GDP: Gross domestic product.
GDPR: General data protection regulation.
IO: International organization.
IRS: Internal Revenue Service.
NSO: National Statistical Office.
NSA: National Security Agency.
NSS: National statistical system.
OECD: Organization for Economic Cooperation and Development.
OGD: Open government data.
SDG: Sustainable development goals.
UNDS: United Nations Statistical Division.

ENDNOTES


1
Buytendijk (2014) argued that big data had passed the top of the ‘Hype Cycle’ and was moving
towards the ‘Trough of Disillusionment’ and that expectations regarding the use of big data would
now become more realistic.

2
Volume, Velocity, Variety, Variability, Veracity, Validity, Vulnerability, Volatility, Visualisation
and Value.

3
This is a best estimate. Projects are not always well defined or explained on the inventory. Some
projects seem to incorporate several projects or big data sources.

52

Big Data and Official Statistics

4
See Guerreiro et al. (2018); Van Loon & Roels (2018); or Nyborg Hov (2018) for some recent
examples.
5
Readers will note that the totals for data sources and projects do not match. In some cases projects
use several sources, whereas in other cases a single source can be used for several projects. Con-
sequently this is a best estimate based on the text available in the project plans. It should be noted
that several projects are not well defined or do not appear to have any clear objective - hence the
‘other’ categories are quite large.
6
In 2006 there were some 2 billion ‘smart devices’ connected to each other. By 2020 it is projected
that this ‘internet of things’ will contain omewhere between 30 and 50 billion devices (Nordrum,
2016). Goodman (2015) notes the result will be 2.5 sextillion potential networked object-to-object
interactions.
7
See Eurostat (2014) for an excellent summary of how mobile phone data can be used to compile
tourism statistics.
8
For example, within the statistical community of the European Union there are concerns that the
new GDPR has not fully taken the particular needs of official statistics into consideration. If it
hasn’t this new legislation may retard significantly the development of official statistics in that
region.
9
Net Neutrality sets out the principles for equal treatment of Internet traffic, regardless of the type
of service, the sender, or the receiver. In practice, however, the Internet service providers conduct
a degree of appropriate traffic management aimed at avoiding congestion, and delivering a reliable
quality of service. Concerns regarding the loss of net neutrality focus mainly on definitions of (in)
appropriate and (un)reasonable management and discriminatory practices, especially those that are
conducted for commercial (e.g. anti-competitive behaviour) or political reasons (e.g. censorship).
Net neutrality has three important dimensions: (1) technical (impact on Internet infrastructure); (2)
economic (influence on Internet business models); and (3) human rights (possible discrimination
in the use of the Internet).
10
Para 3 - Recognise that the implications of Big Data for legislation especially with regard to data
protection and personal rights (e.g. access to Big Data sources held by third parties) should be
properly addressed as a matter of priority in a coordinated manner.
11
In effect this means that only aggregate data can be published for general release by official statistical
compilers and those aggregates will have been tested for primary and secondary disclosure. Data
that cannot be published due to the risk of statistical disclosure are referred to as confidential data.
Primary confidentiality disclosure arises when dissemination of data provides direct identification
of an individual person or entity. This usually arises when there are insufficient records in a cell to
mask individuals or when one or two records are dominant and so their identity remains evident
despite many records (this is a recurring challenge for business statistics where ‘hiding’ the identity
of large multinational enterprises can be very difficult). Secondary disclosure may arise when data
that have been protected for primary disclosure nevertheless reveal individual information when
cross-tabulated with other data.
12
For example: the OECD Open Government Data (OGD) is a philosophy- and increasingly a set
of policies - that promotes transparency, accountability and value creation by making government
data available to all - see: http://www.oecd.org/gov/digital-government/open-government-data.htm.

53

Big Data and Official Statistics

In the United States, Data.gov aims to make government more open and accountable. Opening
government data increases citizen participation in government, creates opportunities for economic
development, and informs decision making in both the private and public sectors - see: https://
www.data.gov/open-gov/. In the European Union, there is a legal framework promoting the re-use
of public sector information - Directive 2013/37/EU of the European Parliament and of the Council
of 26 June 2013 amending Directive 2003/98/EC on the re-use of public sector information. See -
http://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32013L0037&from=FR
13
The issue of modernising governance frameworks in the context of big data is specifically addressed
in the 2017 Bogota Declaration on Big Data for Official Statistics (United Nations Statistics Divi-
sion, 2017).

54
55

Chapter 3
The Big Data Research
Ecosystem:
An Analytical Literature Study

Moses John Strydom


University of South Africa, South Africa

Sheryl Buckley
University of South Africa, South Africa

ABSTRACT
Big data is the emerging field where innovative technology offers new ways to extract value from an
unequivocal plethora of available information. By its fundamental characteristic, the big data ecosystem
is highly conjectural and is susceptible to continuous and rapid evolution in line with developments in
technology and opportunities, a situation that predisposes the field to research in very brief time spans.
Against this background, both academics and practitioners oddly have a limited understanding of how
organizations translate potential into actual social and economic value. This chapter conducts an in-
depth systematic review of existing penchants in the rapidly developing field of big data research and,
thereafter, systematically reviewed these studies to identify some of their weaknesses and challenges.
The authors argue that, in practice, most of big data surveys do not focus on technologies, and instead
present algorithms and approaches employed to process big data.

INTRODUCTION

The International Data Corporation (Reinsel et al., 2017) forecasts that by 2025 the global data sphere or
ecosystem will surge to 163 Zettabytes, id est, a trillion Gigabytes. This signifies that the new data lake
was foreseen to be ten times the 16 Zettabytes worth of data that its devices, storage systems and data
centers generated in 2016, singly. As expected, research in the domain has likewise literally exploded.
The direct consequence of these expectations is that there has been a parallel paradigm shift in research
activity: a literatim explosion!

DOI: 10.4018/978-1-5225-7077-6.ch003

Copyright © 2019, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

The Big Data Research Ecosystem

Some changes seem to be running at the speed of light. In this morass of hard trends and cosmetic
swirls - an unquestionable tsunami - how do we distinguish the trend from the trendy? How do we know
which changes are truly disruptive, and which are merely potentially transformative? Which technolo-
gies are going to have the biggest impact on decision making? And, of the many initiatives put forward,
which will be pivotal to the researcher’s and the practitioner’s journey?
Predictably, big data governance has thus become increasingly critical, indeed decisive, as more or-
ganizations learn how to leverage their data and exploit it to make better decisions, optimize operations,
create new products and services, and improve profitability.
Beyond their sheer magnitude, these data-sets and affiliated applicatory considerations pose signifi-
cant challenges for method and software development.
Hitherto, the bulk of big data reports have focused on discoursing opportunities, applications, chal-
lenges and issues (Wang et al., 2016). Others favored to survey and study the algorithms and techniques
used in such contexts (Raguseo, 2018).
Only a limited number of authors treat big data technologies with respect to their ecosystem by fo-
cusing on the phases of the data life-cycle model (Cavanillas et al., 2015).
Additionally, there is strong need to systematically review these studies in order to render them ac-
cessible to researchers and practitioners.
By clearly defining the opportunity in big data, by examining the big data value chain, and by under-
taking a comprehensive inspection into industry sector applications, this chapter charts a way forward to
new value creation and new opportunities from big data. Decision makers, policy advisers, researchers,
and practitioners on all levels can benefit from this.
The novelty of this study is that it explores the current trends in the field of big data research and
most relevant research areas, during the past triennial, that is, 2015 to 2017.
In doing so, the authors introduced the big data value chain, opportunities and applications. They
presented the main challenges encountered when facing the complexity of imbalanced data-sets and
difficulties contingent on the V’s of big data.
We are reminded that big data is not a single technology or initiative. Rather, it is a contiguous de-
rivative of all our interactions with rapidly evolving technologies. This latter reality presages that big
data is not a stable platform but is as dynamic as is the technology that defines it.
The tendencies, based on a systematic literature review methodology were identified by an extensive
audit of peer-reviewed scholarly journals where the objective was to congregate, elaborate and synthesize
an exhaustive systematic literature review of current issues related to the area of big data research trends.
In the same vain, topics and proclivities in the areas of exascale computing (exascale ecosystem?) and
social data analysis were reported.
Primarily, with the objective of analyzing current research, content analysis was employed.
Building a representative dictionary of the subjects being analyzed was a key to the proposed proce-
dure. Likewise, social network analysis tools were employed to interpret the interrelationship between
the big data taxonomy of terms.
The results were interpreted using descriptive analysis (frequencies) and social network analysis
This systematic literature review is foreseen to contribute to the scientific knowledge on computer
science and information system management by (i)studying in detail the current issues related to big
data; (ii)identifying some of the research trends in existing literature; (iii)suggesting directions for future
research.

56

The Big Data Research Ecosystem

Finally, the ambitions of this chapter are to provide responses to the following questions:

• What are the current issues related to big data research?


• Is the big data life-cycle mode the most efficient manner to elicit big data attributes and its
ecosystem?
• Extrapolating, what will be the top-ranking future research topics in big data?

Last and foremost, the authors presume that the findings obtained in this chapter may additionally
be useful in the exploration of potential research areas and the identification of neglected areas in the
field of big data.

BACKGROUND

Big data is the innovative discipline where cutting-edge technology offers new ways to extract value from
a veritable plethora of available and ever-increasing information. “As with any emerging area, terms and
concepts can be open to different interpretations. The big data domain is no different” (Curry, 2015, p.36).
Different definitions of “big data” that have materialized “show the diversity and use of the term to
label data with different attributes” (Curry, 2015, p.36). Two mainstream praxis from “the business com-
munity, value chains and business ecosystems, can be used to model big data systems and the big data
business environments” (Curry, 2015, p.36). Big data value chains or life-cycle cascades can describe
the information flow within a big data system as a series of steps needed to generate value and useful
insights from data (Cavanillas et al., 2015). Big data ecosystems can, on the other hand, be employed to
understand the business context and relationships between concepts and key stakeholders. “A big data
business ecosystem can be an important factor for the capitalization of big data services, products, and
platforms” (Curry, 2018, p.36). (see Figure 2).
Curry’s study (Curry, 2018) was based on an analogous data life-cycle model where different big
data attributes, from data creation extending to data consumption, were meticulously analyzed.
As underscored elsewhere in this chapter, the phenomenon of big data is such that today there is a
superabundance of diverse, avant-garde representative applications of big data initiatives, technologies
and research in academia, industry and other related disciplines.
Illustrative of this disposition is the pervasive deployment of big data technology in virtually all walks
of life. Indeed, big data is the all-embracing fuel for tomorrow’s business successes.

Business Intelligence (Balachandran & Prasad, 2017)

The main objective of business intelligence is to support corporate executives, business managers and
other operational workers make better, more informed and faster business decisions. Organizations are
being compelled to capture, understand and harness their data to support decision-making in order to
improve business.

57

The Big Data Research Ecosystem

• Business Alignment Strategies: (Kaidalova, 2017): Industry necessitates the implementation of


pragmatic solutions for dealing with product-information technology integration into enterprise’s
architecture management framework.
• Organizational Behavioral Strategies: (Mazzei & Noble, 2017): These strategies increase the
task performance and improve productivity. Innovation, competition, and productivity are being
reassessed and re-activated by the increase in the levels of data and its synergistic technology.
• Information Technology Strategies (Grensing-Pophal, 2015; Huang et al., 2015; Lim, 2018):
These studies identify a number of key factors that characterize data-based value creation, among
them being: (i) data source, (ii) data collection, (iii) data, (iv) data analysis, (v) information on the
data source, (vi) information delivery, (vii) customer (information user), (viii) value in informa-
tion use, and (ix) provider network.
• Marketing Strategies (Jianping et al., 2017): This study examines factors related to the effective-
ness of mobile advertising where the authors propose strategies for effective and targeted mobile
advertising.

Fraud Detection/Prediction and Prevention (Broeders


et al., 2017; Edwards & Urquhart, 2016)

“Big data analytics in national security, law enforcement and the fight against fraud have the potential
to reap great benefits for states, citizens and society but require extra safeguards to protect citizens’
fundamental rights” (Broeders et al., 2017, p.i).

• Credit Card Fraud: In their survey paper Abdallah et al. (2016) provide a systematic and com-
prehensive overview of issues (concept drift, non-support of real-time detection, skewed distri-
bution, large amount of data) and challenges that obstruct the performance of fraud prevention
systems.
• Criminal Identification: Possible through low-cost, high-throughput technologies which support
rapid accumulation of molecular data and to which are applied sophisticated machine-learning
algorithms, building generalizable, predictive models that will be useful in the criminal justice
system (Metcalf, 2017).

Querying, Searching, and Indexing Strategies (Rekik et al., 2018; Sun, 2017)

Both studies examine multi-dimensional complex queries in order to administer and assess the quality
of large data-sets

• Keyword Based Analytics: (Radojicis et al., 2017; Zhu et al., 2015) Using the basic concepts of
the big data theory, both these studies, analyze and exploit keyword frequencies and their subse-
quent mapping.
• Analytic Pattern Matching: (Hu & Zhang, 2017) Through the application of various social
network analysis and geographical visualization methods, these scholars attempt to decipher the
structure and patterns of cross-national collaborations in big data research.

58

The Big Data Research Ecosystem

Data Mining and Knowledge Discovery

• Big Data Analytics: (Ruzgas et al., 2016). By identifying the challenges inherent in the analysis
of big data, the authors propose exploratory methods and tools specifically designed for big data
analytics.
• Data Mining Tasks: Zhang et al., (2016) study an approach of mining time-series abnormal pat-
terns in the hydrological field. The authors propose an innovative concept of solving the problem
of hydrological anomalies in mining based on time series.
• Analytics in Healthcare: Typically, in healthcare systems, big data analytics plays a vital role in
varieties of disease pattern identification, prediction and therapy deliverables. A case in point are
therapies for diabetes (Gharajeh, 2017), cardiovascular disorders, cancer and Parkinson diseases
(Wu et al., 2017). This is accomplished by exhaustive exploring large data-sets using various data
mining techniques. However, these investigations are faced with a major challenge: the use of
personal health data for health research (Bietz et al., 2016).
• Statistical Modeling and Analysis (Gokmen & Vlasov, 2016): An illustration of this research
category is autonomous language translation (speech recognition, visual object detection, pat-
tern extraction). Various other day-to-day life transactions are associated with statistical modeling
(Frické, 2015).
• Analytics for Gnomic Medicine (Maia et al., 2017): Advances in gnomic technologies in the
last decade have revolutionized the field of medicine, especially in cancer, by producing a large
amount of genetic information, often referred to as big data. This study identifies new patterns and
relations among the genes and other organic structures present in humans and other living beings.
• Climate Predictions and Operative Projections: Can be made based on the effective analytics
of huge amounts of climate and environmental data (Manogaran and Lopez, 2018; Saitoh, 2017;
Asencio–Cortés, 2017; Ge, 2016).

Data Statistical Methods

• Software Defect Diagnostic (Zhao et al., 2017)


• Prediction (Abaei et al., 2017)
• Manufacturing Products (He & Wang, 2017).

As for big data surveys, most give an overview of big data applications, opportunities and challenges.
Others discuss techniques and methodologies used in the big data environment and how they can help to
improve performance, scalability and authenticity of results. Wang et al. (2016) introduce an overview
on big data which includes four elements, namely: (i) concepts, characteristics and processing para-
digms of big data; (ii) the state-of-the-art techniques for decision-making in big data; (iii) accountable
managerial applications of big data in the social sciences; and (iv) the current challenges of big data as
well as possible future trends.
Khan et al. (2017, p.i) “survey scholarly data with respect to the different phases of the big data life-
cycle. They additionally identify the different big data tools and technologies that can be utilized for the
development of scholarly applications”.

59

The Big Data Research Ecosystem

Chen and Zhang (2014) likewise present a survey with respect to big data applications, opportunities,
challenges and typical techniques. Furthermore, their review discusses methodologies to handle big data
operative conundrums, for example, granular computing, cloud computing, bio-inspired computing, and
quantum computing.
Emani et al. (2015) presents a general overview of big data management with an appraisal of its
architecture and the challenge of merging this latter into an already existing information system.
Chen et al. (2014) review the general background of big data and real-world applications. Unlike the
aforementioned papers, they examine big data technologies by focusing on the four phases of the value
chain of big data, id est, data generation, data acquisition, data storage, and data analysis. For each phase,
they present the general background, discuss the technical challenges, and review the latest technologi-
cal advances. For all that, it still does not give the reader a sufficient comparison and understanding of
the current big data ecosystem.
Salleh and Janczewski (2016) provide a literature review on security and privacy issues of big data,
while Raguseo (2018) investigates the adoption levels of big data technologies in companies, and the
big data resources employed by them. The latter article furthermore identifies “the most frequently
recognized strategic, transactional, transformational and informational benefits and risks related to the
usage of big data technologies by companies” Raguseo (2018, p.i).
Zhang et al. (2018) consider the most typical deep learning models and their variants, and addition-
ally examine emerging research of deep learning models for big data feature learning.
Sangeetha and Prakash (2017) review in more detail big data mining algorithms, data slicing facilities,
mining platforms and clustering techniques. They also discuss their respective advantages and drawbacks
as well as their performance and quality measurements.
Wang et al. (2018) systematically explore the ideas behind current multiple data source mining
methods. They likewise consolidate recent research results in this field.
Mahdavinejad et al. (2017) assess “the different machine learning methods that deal with the challenges
in the Internet of Things (IoT) data by considering smart cities as the main user case (p.i). Furthermore,
“a key contribution of this study is a presentation of a taxonomy of machine learning algorithms explain-
ing how different techniques are applied to the data in order to extract higher-level information” (p.i).
Zhang et al. (2017) summarize the history of machine learning and provide insight into recently
developed deep-learning approaches and their applications in rational drug discovery. They suggest that
this evolution of machine intelligence now provides a guide for early-stage drug design and discovery
in the current big data era.
Skourletopoulos et al. (2017, p.24) “present a review of the current big data research vista explor-
ing applications, opportunities and challenges, as well as the state-of-the-art techniques and underly-
ing models that exploit cloud computing technologies, such as the Big Data-as-a-Service (BDaaS) or
Analytics-as-a-Service (AaaS)”.
Much of the literature stresses the opportunities afforded by big data technologies by referring to the
so-called 3 V’s1: volume, velocity, and variety (Laney, 2001).
Since the conception of the 3 V’s, other scholars have added veracity, specifically, how much noise is
in the data (Goes, 2014), granularity (Aaltonen & Tempini, 2014; Yoo, 2015), and many other features
that have largely been associated with the technological functionality of big data (see Figure 1).
In 2001, (Laney, 2001) Gartner (perhaps) accidentally abetted an avalanche of alliteration with an
article that forecasts trends in the industry, gathering them under the headings “Data Volume, Data
Velocity, and Data Variety”. Naturally inflation continues its inexorable advance, and about a decade

60

The Big Data Research Ecosystem

Figure 1. Eye-weighted spring embedded edge betweenness diagram of the V’s of big data
Source: Strydom (2018)

later we had the 4 V’s2 of big data, the 5V’s3 then the 7 V’s4, and then 10 V’s5. But it is the year 2018
and we now operate in an even more sophisticated world of analytics. Abreast with the other big data
augmentative tendencies, Shafer (2017), proposes a 42 V’s6 of big data and data science (Figure 1): can
it be surmised that we are now well on the way to 100 V’s of big data and data science!
In Figure 1 each preceding generation, until 42V’s, utilizing Cytoscape (Schwikowski, 2015), is
embedded into the more recent version while orbiting around the initial 3V’s. For example, the authors
of 4V’s defined their V’s as consisting of the fundamental 3V’s - Volume, Velocity and Variety - while
adding a fourth, V, Variety (Williamson, 2013). Though the evolution of the V’s in big data it is beyond
the scope of this chapter, it will suffice to say that the 3V’s were found inadequate to fully describe big
data hence this surge to define other big data V-based characterization, where the V is merely used as a
mnemonic device to label the different challenges.
“These Vs of big data challenge the fundamentals of existing technical approaches and require new
forms of data processing to enable enhanced decision-making, insight discovery, and process optimiza-
tion” (Curry, 2015, p.30). Organizations also need to understand why, whether and how to invest in big
data. It will also assist in maximizing the performance of an organization, which ultimately is the objec-
tive of these efforts. This led data analysts to replace the V’s by A’s - analytics, awareness, anticipation
and action - which they felt better served this objective (Menninger, 2017).

61

The Big Data Research Ecosystem

The authors, on the other hand, propose a first-principles’ approach founded on the ineluctable con-
viction that big data is not a stable platform but, on the contrary, is as dynamic as the technology that
defines it.

METHODOLOGY

The potential of the big data is impacting all socioeconomic sectors, from healthcare to energy and
transport, from finance and insurance to retail. It has assumed a ubiquitous macrocosm, and its positive
transformational potential has already been acknowledged in multitudinous key sectors. A successful
data ecosystem, which should be the most conspicuous constituent of any data-driven economy, has as
objective to amalgamate data owners, data analytics, skilled data professionals, cloud service provid-
ers, companies from industry, venture capitalists, entrepreneurs, research institutions and universities.
An efficient understanding and utilization of big data as an economic asset carries great financial
and collective potential.
The relationship between the conceptual dimensions of a big date ecosystem can be represented as
follows (Cavanillas et al., 2015) (Figure 2).

(Legal; Social; Economic; Technology; Application) Є (Data) (1)

where the dimensions Legal, Social, Economic, Technology and Application are subsets of Data.
Figure 2 delineates the conceptual dimensions of a big data ecosystem, which is a prominent feature
of a data-driven economy and which would permit all its dimensions to seamlessly co-exist.

• Data (Big Data): Availability and access to data is the foundation of any data-driven economy.
• Skills (People): A critical challenge faced by all is to ensure the availability of skilled workers in
the big data ecosystem.
• Social (Testing; Piloting): It is critical to increase awareness of the benefits that big data can
deliver.
• Legal (Investigate; Awareness): Appropriate regulatory environments are indispensable to fa-
cilitate the development of a successful big data marketplace.
• Application (Testing; Piloting): Big data has the potential to transform many sectors and domains.
• Technology (Testing; Piloting): Key technical challenges that must be surmounted in order to
develop competitive advantages.
• Economic (Testing; Piloting): A big data ecosystem can support the transformation of existing
business sectors and the development of new innovative star-ups.

That said, an in-depth systematic literature review (Baker, 2016, Kulage and Larson, 2016) was
performed focusing on objectively reporting on the current knowledge on the dimensions of a big data
ecosystem (Figure 2) and subsequently providing an informed perspective or a comprehensive overview
of the knowledge available on the topic.
Building a representative dictionary of the subjects being analyzed was the key to the proposed
procedure. This dictionary was based on the big data life-cycle which can be presented as follows (Ca-
vanillas et al., 2015):

62


Figure 2. Conceptual dimensions of a big data ecosystem


Source: Strydom (2018: Adapted from Cavanillas et al., 2015)
The Big Data Research Ecosystem

63

The Big Data Research Ecosystem

(Data acquisition; Data analysis; Data curation; Data storage; Data usage)Є (Data value chain) (2)

where Data acquisition, Data analysis, Data curation, Data storage and Data usage are subsets of the
big data value chain.
Big data can be acquired, analyzed, curated, stored and used in many different ways (Table 1).
The different big data subsets are delineated as follows (see also Figure 4):

(Structured data; Unstructured data; Event processing; Sensor networks; Protocols; Real-time; Data
streams; Multi-modality) Є (Data acquisition) (3)

(Stream mining; Semantic analysis; Machine learning; Information extraction; Linked data; Data dis-
covery; “Whole world semantics”; Ecosystems; Community data analysis; Cross-sectional analysis) Є
(Data analysis) (4)

(Data quality; Trust/Provence; Annotation; Data validation; Human-data interaction; Top-down/


Bottom-up; Community/Crowd; Human computation; Curation at scale; Incentivisation; Automation;
Interoperability) Є (Data curation) (5)

(In-memory Dbs; NoSQLDBs; NewSQL Dbs; Cloud storage; Query interfaces; Scalability and perfor-
mance; Data models; Consistency, availability; Partition-tolerance; Security and privacy; Standardiza-
tion) Є (Data storage) (6)

(Decision support; Prediction; In-use analytics; Simulation; Modeling; Control; Domain-specific usage)
Є (Data usage) (7)

Every big data source has different attributes, including the frequency, volume, velocity, type, and
veracity of the data (Shafer, 2017). When big data is processed and stored, additional dimensions come
into play, such as governance, security and polices. Because so many factors have to be taken into con-
sideration, the choice of an architecture, as well as, building an appropriate big data solution develops
into a complex and highly challenging, yet commendable, exercise.
The big data value chain (Cavanillas et al., 2015, p.49), as illustrated in Table 1, was used “to model
the high-level activities that comprise an information system”: a total of 54 elements were cataloged.
Research linked to the big data value chain was identified by an extensive review of peer-reviewed
scholarly data bases, namely, Elton B. Stephens Company (EBSCO), Science Direct, IEEE Xplore and
Scopus, and appropriately recorded.
Because of the rapid evolution of big data technology, a situation that predisposes the field to research
in very brief time spans, the year 2015 was chosen as the cut-off date for this study and the authors, as
follow-up research, also examined articles published in 2018.

64


Table 1. Value chain classification of the big data ecosystem


The Big Data Research Ecosystem

Source: Strydom (2018: Adapted from Cavanillas et al., 2015)

65

The Big Data Research Ecosystem

For this purpose, the frequencies of the following indicators were analyzed: (i) indicated keywords,
(ii) chosen research areas, (iii) focused variables, (vi) emphasized theoretical/conceptual backgrounds,
(vii) data collection instruments.
The authors suspected that some papers would not have specified ‘‘big data” in the topic, title, abstract,
or subject, but are about data or data analytics in an organizational context and could mention the term
within the article itself. Therefore, three additional keywords were employed to increase the number of
potentially relevant search results (Boell & Cecez-Kecmanovic, 2015): ‘‘business intelligence”, ‘‘data
driven”, and different forms of ‘‘business analytics”.
Networks that represent liaisons between elements can be a valuable analytical tool.
Cytoscape (n.d.) is a the worldwide open-source platform for network visualization and data integra-
tion, funded by the National Institute of General Medical Sciences and licensed directly from the Max
Planck Institute for Informatics that has the to interpret the interrelationship between keywords.
Although at the onset designed for molecular components and interactions (Schwikowski, 2017), it
is now a general platform for any type of network including the two used in this chapter, that is, the big
data V’s and big data attributes.
Operationally, Cytoscape permitted to create a cloud network layout of an active sub-network with each
individual node beginning its own cluster. This pattern enabled identifying sub-networks characterized
by specific features such as the presence of dense interconnections with respect to the rest of the network.

FINDINGS AND ANALYSIS

The most relevant instance and synoptic measure of the pace of technical changes in the big data eco-
system is via its value chain, or data technology life-cycle.
The clusters of research articles pertaining to big data trends are noted in Table 2.
Each of these sectors was subsequently analyzed.
The frequency count presents a descriptive analysis of the 54 keywords used in this technology life-
cycle.
On that premise, and to understand and quantify the influence and importance of the relationship
among the keywords, social network analysis was conducted, and a complete network model was visu-
alized based on their relationships using centrality distribution (Figure 3). Resulting from this analysis
the pattern of these 54 nodes was observed. Their loads and importance were subsequently investigated
in 2-dimensional graphs (see Figures 5 to 9).
The first level big data nodes, which form the inner centrality elements (Figure 3) are defined hereafter.

• Data Acquisition (Figures 4 and 8): It is understood to be the “process of gathering, filtering and
cleaning raw data before it is put in a data warehouse or any other storage solution” (Lyko et al.,
2016, p.39). Data acquisition of big data is most commonly governed by the V’s of big data, which
have developed from the 3 V’s to the 42 V’s and, maybe heading to the 100 V’s.
• Data Analysis (Figures 4 and 6): This is “the sub-area of big data concerned with adding struc-
ture to raw data to support decision-making as well as assisting domain-specific usage scenarios”
(Domingue et al., 2016, p.63). Furthermore, “raw data, which may or may not be structured, and
which will usually be composed of many different formats, is transformed to be ready for data
curation, data storage and data usage” (Cavanillas et al, 2015, p.63).

66


Table 2. Clusters of articles addressing taxonomy of big data attributes


The Big Data Research Ecosystem

Source: Strydom (2018)

67

The Big Data Research Ecosystem

Figure 3. A circular attribute node betweenness centrality diagram of the big data value chain
(Strydom, 2018)

• Data Curation (Figures 4 and 7): “One of the key principles of data analytics is that the quality
of the analysis is dependent on the quality of the information analyzed”, (Freitas & Curry, 2016,
p.87), confer, “garbage in, garbage out”. According to Cavanillas et al. (2015), in addition to data
volume, data consumers in the big data era need to cope with data variety, as a consequence of
decentralized data generation, where data is created under different contexts and requirements.
“Data curation thus provides the methodological and technological big data management support
to address data quality issues maximizing the usability of the data and heavily dependent on the
challenges of scale” (Freitas & Curry, 2016, p.87).
• Data Storage (Figures 4 and 5): The big data storage technology life-cycle is concerned with
storing and managing data in a scaleable manner, satisfying “the needs of applications that require
fast access to the data” (Curry, 2016, p.32). The ideal big data storage system would allow stor-
age of a virtual unlimited amount of data, cope both with high rates of random read/write access,
flexibly and efficiently service a range of different data models, support both structured and un-
structured data, and for privacy reasons, only work on encrypted data.
• Data Usage (Figures 4 and 9): “One of the core business tasks of advanced data usage is the sup-
port of business decisions” (Becker, 2016, p.143). Furthermore, “Data usage is a wide field and
includes technology tasks, trends in various sectors, the impact of business models, and require-
ments on human-computer interaction”.

As illustrated in Figure 4, data storage (also Figure 5) and data analysis (also Figure 6) engendered
the most wide-spread enthusiasm for research themes, respectively accounting for 39% and 38% of the
research efforts during the triennial 2015-2017.

68

The Big Data Research Ecosystem

Figure 4. Articles accredited to global big data attributes


Source: Strydom (2018)

When dealing with big data storage (Figure 5), we are fully cognizant that there is a physical barrier
to the amount of information that can be stored in the universe. Yet we are additionally aware that this
barrier represents an order of magnitude so much larger than what we currently collect and analyze, that
it offers a very important perspective of growth.
Is there a shadow in this revolution?

Figure 5. Articles accredited to big data storage attributes


Source: Strydom (2018)

69

The Big Data Research Ecosystem

Illustrative of this “infobesity”, is the data stored, during October 2017 (Jarlett, 2017), by the Conseil
Européen pour la Recherche Nucléaire (CERN) data center which amounted to a colossal 12.3 petabytes
(an equivalent of 15,000 64GB smartphones!). Most of this data came from the Large Hadron Collider’s
(LHC) experiments, so this record is a direct result of the outstanding LHC performance, the rest is made
up of data from other experiments and backups. By the end of June 2017, they had “already passed a data
storage milestone, with a total of 200 petabytes of data permanently archived on tape” (Jarlett, 2017).
The CERN data center is at the heart of the organization’s infrastructure. “Here data from every
experiment at CERN is collected, the first stage in reconstructing that data is performed, and copies of
all the experiments’ data are archived to long-term tape storage”(Jareltt, 2017).
“This data researchers hope will help the experiment’s ultimate goal of revealing fundamental insights
into the laws of nature (Castelvecchi, 2018, p.1)
Paradoxically, because there is such a dearth in present day tools to treat this deluge of particle colli-
sions (Castelvecchi, 2018), most of the data collected at CERN will be stored forever; the physics’ “data
is so valuable that it will never be deleted and needs to be preserved for future generations of physicists”
(Jarlett, 2017).
When the topic of scalability arises, storage systems play an important role (as recognized by a re-
search output of 25%, Figure 3), especially the indexing techniques and retrieval strategies.
The scalability data platform accommodates rapid changes in the growth of data, either in traffic
or volume, which if not implemented could negatively impact the workflow, efficiency and customer
retention.
There is a growing need for real-time results in application environments such as the Internet of
Things. This demand is driving the need for a new approach to storage. The technology in this space
has a particular goal in mind and that is to reduce the latency between the computational layer and the
storage layer. Storage must be persistent but long-term persistence commonly comes at a cost to com-
putational performance that makes the delivery of real-time analysis difficult at best. To overcome this
debilitating latency, a new approach needs to be found.
Though data security is treated here as a sub-set of data storage it is a concern that transcends this
feature and underpins the entire big data landscape.
By its nature, big data analytics, has the capacity to permit organizations to optimize their previous
data and use it to discover new operational possibilities. In substance, this illustrates that organizations
can and will benefit from smarter business decisions, more efficient operations, healthier profit, and
more contented clients and other stakeholders. Against this background, “the amount of data collected
from various applications all over the world across a wide variety of fields today is expected to double
every two years” (Acharjya & Kauser, 2016, p.517). It is worthless unless these are analyzed to extract
useful information. This latter fact “necessitates the development of techniques which can be used to
facilitate big data analysis” (Acharjya & Kauser, 2016, p.517). The development of powerful comput-
ers (an edifying example being China’s Tianhe-2 supercomputer that is capable of performing 33.86
quadrillion floating-point operations in a single second) is a boon to implement these techniques lead-
ing to automated systems. Among the mission-critical metrics of every decision-maker is how to best
transform big data into knowledge and information to support and also drive decision making. Besides
high-performance large-scale data processing it likewise comprises exploiting attributes such as exempli
gratia machine learning, analysis of the ecosystem, information extraction, linked data and semantic
analysis (see Figure 6). The fourth V (Williamson, 2013), veracity, demonstrates the essential impact of
analytics uncertainty not only in big data analysis but also in data modeling and data processing. “Many

70

The Big Data Research Ecosystem

different models like fuzzy sets, rough sets, soft sets, neural networks, their generalizations and hybrid
models obtained by combining two or more of these models have been found to be fruitful in representing
data” (Acharjya & Kauser, 2016, p. 517). By their nature, data modeling plays a critical role in big data
analytics essentially due the fact that 85% of big data is unstructured data - word processing, presenta-
tions, video and audio files, email, chat, and social media postings (Lim et al., 2016). The exponential
growth of dig data and the resulting curse of dimensionality also demands that, instead of collecting raw,
redundant, inconsistent and noisy data, the effort is switched to concentrate on reduced and relevant data
streams. This consequently generates the imperative to research other methods of data analysis, which
include machine learning/deep learning/artificial intelligence (Figure 6).
Additionally, machine learning concepts and tools, which represents a prodigious 60% of research
efforts (Figure 6), are gaining popularity among researchers with the aim to facilitate meaningful results
from these concepts. “Research in the area of machine learning for big data has focused on data process-
ing, algorithm implementation, and optimization” (Acharjya & Kauser, 2016). As highlighted elsewhere
in this chapter, the world’s data is proliferating in virtually every sector of society, and, as we evolve
towards distributed and real-time processing, traditional tools for machine learning are becoming less
efficient. This calls for radical change.
Furthermore, there is no known existence of an exclusive toolkit that veritably homogenizes a one-
size-fits-all type solution.
The authors argue that while each of the tools have their advantages and limitations, more efficient
tools must be developed that depend less on the identification of statistical patterns in multidimensional
data sets, particularly when large, labeled training sets are readily available. In their present state the
present family of tools lack the rich capabilities associated with human learning that allow humans to

Figure 6. Articles accredited to big data analysis attributes


Source: Strydom (2018)

71

The Big Data Research Ecosystem

generalize from small numbers of exemplars, apply previously learned knowledge to new tasks, and cope
with changing contexts and dynamic environments (Greenwald & Oertel, 2017).
Apart from this generalizability associated with humans, modern tools must also be capable to
handle noisy, and imbalanced data, outliers, uncertainty and inconsistency, and missing values, as well
as redundant/harmful features in data.
For the purposes of this chapter, artificial intelligence, machine learning and deep learning (is deep
learning not a technique to implement machine learning?) have been lodged under the same framework.
Deep learning which is considered as being a technique to implement machine learning, is, in fact, an
important step toward artificial intelligence (see below).

(Deep learning; Machine learning; Algorithms) Є (Artificial intelligence) (8)

where Deep learning, Machine learning and Algorithms being subsets of Artificial intelligence
Deep learning not only provides complex representations of data which are suitable for artificial
intelligence tasks but also makes the machines independent of human knowledge which is the goal of
artificial intelligence. AI, by dint of its inherent inference engine, infers patterns from data-sets without
reference to known, or labeled, results from which it extracts representations directly from unsupervised
data without human interference (Najafabadi et al. 2015).
Without question, machine learning is the most brilliant star of the big data research constellation
and will continue to be an area of active research well into the foreseeable future.
Data curation (see Figure 7) provides the methodological and technological data management sup-
port to address data quality issues thus maximizing the usability of the data (Cavanillas et al., 2015).

Figure 7. Articles accredited to big data curation attributes


Source: Strydom (2018)

72

The Big Data Research Ecosystem

According to Freitas and Curry (2015, p.89), “The core impact of data curation is to enable more
complete and high-quality data-driven models for knowledge organizations”.
Cavanillas et al. (2015) submitted that poor data quality – validity, accuracy, consistency complete-
ness, timeliness, integrity (Shafer, 2017) - will successively impact the quality of knowledge productivity
and hence suffer the syndrome of “garbage-in, garbage-out” at a decisional level.
Because of the need to aggregate data from diverse sources to form a unique picture of a business
situation, and, on account of big data, data curation, accompanied by the appropriate research activities,
has just recently started to enter corporate parlance. Companies now realize that data curation takes the
value of data to a new level.
The most prevalent areas of research in data curation are the double-digit attributes (Figure 7), namely,
Curation at scale (24%), Human computation (21%) and Trust/provenance (16%). The attraction of hu-
man computation as an area of research is oddly high. Would human computation be adequate to scale
the size of problems that are encountered in the field? For example, one web aggregator requires the
curation of 100,000 URL’s and CERN data center has the challenge of curating 10 petabytes of data.
“At this scale, data curation cannot be a manual (human) effort, but must entail artificial intelligence
approaches with human assistance only when necessary” (Devlin, 2017, p.1).
The big data acquisition repertoire of resources has to contend with high-velocity, variety, as well
as real-time data acquisition (see Figure 8). It is essentially for this reason that operational resources
– both hardware and software - for data acquisition have to ensure very high throughput (Lyko et al.,
2016). “This also means that it is anticipated that data is procured from multiple resources (for example,
social networks, sensors, web mining, and logs) with different structures, or be unstructured (text, video,
pictures, and media files) and at a very high rate (tens or eventually hundreds of thousands events per
second” (Lyko et al., 2016, p.51). The main challenge in acquiring big data is to, therefore, provide
frameworks and operational resources – both hardware and software - that ensure the required through-
put for the problem at hand without losing any data in the process (Emani et al., 2015; Cavanillas et al.,
2015). Some applications indeed need real-time analytics to quickly find useful correlations, customer
preferences, hidden patterns, and other valuable information that can assist organizations and decision
makers to take more informed business actions. These challenges provided scholars (38%, in Figure 8)
opportunities to investigate different directions of this acquisition attribute.
Several organizations that rely internally on big data processing have devised enterprise-specific
protocols of which the majority have not been publicly released (Lyko et al., 2016). The others are
based on advanced message queuing protocols, Java message services and divers software tools, and
are beyond the scope of this chapter, except to indicate that 22% of researchers contributed articles to
this attribute of big data protocols.
From an acquisition perspective data variety is probably the biggest obstacle to effectively use large
volumes of data. Incompatible data formats, incomplete data, non-aligned data structures, and incon-
sistent data semantics represent significant challenges that can lead to acquisition spread over a large
area in a cumbersome and irregular manner. Estimates are that anywhere between 80% and 90% of data
in an average organization is unstructured. Besides being at times the untapped majority of data, it is
additionally that data variety that grows faster than any other type, yet in this study it represents a mere
6% of scholar’s efforts. The possible interpretation is that it has been considered in other attributes, for
example, in the analysis sector under artificial intelligence.

73

The Big Data Research Ecosystem

Figure 8. Articles attributed to big data acquisition attributes


Source: Strydom (2018)

“The usage and exploitation of big data illustrates the value creation possibilities of big data appli-
cations in various sectors, including industry, healthcare, finance, energy, media and public services”
(Cavanillas et al., 2015, p.x). Paradoxically, in Figure 3, this large spectrum of activities is credited
with a mere 6% of research articles. The reason for this anomaly is due to the fact that in this study, the
authors, focused on the “usage” and not its broader meaning.
We have understood that big data is not only about data, per se, but also embodies a complete conceptual
and technological stack including hoard including raw and processed data, storage, ways of managing
data, processing and analytics. A challenge that becomes even more complex is the management of the
quality of data in the big data ecosystem. A fortiori the need for assessing the quality-in-use has gained
importance since real contribution data – business value – can be only estimated in its context of use.
This need is recognized by scholars (20%, in Figure 9).
Scholars further illustrate that even in the era of big data the iron law of data management is still
prevalent: “If you want to do anything with data, you are eventually going to have to derive, impute, or
invent schema. You have to model data”. (Swoyer, 2017, p.1): Researchers acknowledge this attribute
as being 19% (see Figure 9) of their preoccupation.
“The process of decision-making includes reporting, exploration of data (browsing and lookup), and
exploratory search (finding correlations, comparisons, what-if scenarios)” (Becker, 2015, p.143), whose
business value is the control over the big data life-cycle. Scholars, in their research, have endorsed that
control (16%, in Figure 9) provides opportunities and requirements for data markets and services.

74

The Big Data Research Ecosystem

Figure 9. Articles attributed to big data usage challenges


Source: Strydom (2018)

FUTURE RESEARCH DIRECTIONS

In the body of this chapter, assorted future enhancements were tabled; they will furthermore be discussed
in this section.
The preceding literature review can be rationally characterized by its many speculations and opinions.
In and of itself, this calls for supplementary empirical studies that would critically examine how organi-
zations actually, in practice, realize value from big data. Explicitly, future research needs to empirically
examine how different actors within organizations work with big data on the fire, how organizational
models are developed, and how organizations deal with different stakeholders’ interests to realize value
from big data. Organizations, in all probability, and depending on a range of factors, adopt different
positions on each of 54 attributes mentioned, such as industry context and organizational size. This
necessitates more empirical approaches that investigate when, and if, and how opposing positions are
relevant to theirs.
In this chapter, with a contingent of approximately 1,000 articles accredited (Table 2 and Figure 6),
machine learning was the scholars most preferred subject of interest.
The relationship between artificial intelligence and her constitutive variables, data, machine learning
and algorithms, can be expressed as follows;

(Big data; Machine learning; Algorithms) Є (Artificial intelligence) (9)

where Big data, Machine learning and Algorithms area subsets of Artificial intelligence.

75

The Big Data Research Ecosystem

Press (2017) has postulated that the number of global survey respondents at enterprises with more
than 100 terabytes of unstructured data has doubled since 2016. However, because older-generation text
analytics platforms are so complex, only 32% of companies have successfully analyzed text data, and
even fewer are analyzing other unstructured sources (Press, 2017). Consequently, in spite of the pleni-
tude and potential of available data, scaling the development of predictive models is still difficult and
unsustainable (energy efficient) where it is generally presumed that the ultimate victory will be when
the boundaries between structured and unstructured data-based are completely erased.
Considering the low-maturity of deep learning, which is still governed by an indistinct fundamental
mathematical theory, considerable work still remains to be done (Najafabdi et al., 2015). There is still
research needed for improving machine learning and the formulation of high-level abstractions and data
representations for big data. In the context of this problem, there is a necessity to explore what volume of
input data is generally necessary to train useful data representations by deep learning algorithms which
can subsequently be generalized for new data in the specific big data application domain (Najafabdi et
al., 2015).
Najafabadi et al. (2015, p.17) advance that domain adaptation during learning is an important focus of
deep learning, “where the distribution of the training data (from which the representations are learned)
is different from the distribution of the test data (on which the learned representations are deployed)”.
The question raised is how and why deep learning, despite its large capacity and complexity, can be
adequately generalized.
Some of the biggest challenges of many organizations are to obtain value from unstructured data.
We have seen that potentially semantic technologies can help them gain insights into other nuggets of
information that were not previously available. Though merely 4% (see Figure 6) scholars explored this
question, it would be interesting to establish what criteria is necessary to allow extracted data represen-
tations to provide useful semantic meaning to big data.
At a conceptual level, learning techniques and algorithms serve for the parameterization of deep
neural network structures, that is, they extract high level features from the given data-set.
This methodology is, however, unsuitable for most big data analytics since they involve learning from
largely unsupervised data. Further research is needed to correct this oddity.
“The rapid evolution and adoption of big data by industry has leapfrogged the discourse to popular
outlets, forcing the academic press to catch up. Academic journals in numerous disciplines, which will
benefit from a relevant discussion of big data, have yet to cover the topic” (Gandoni & Haider, 2015,
p.i). Potentially, in the very near future, academia will become the new insights partner for enterprises.
It is imminent that big data impact the human sciences more extensively.
We suddenly have enough data to make big discoveries and important achievements, that is, the big-
gest impact. In physics, scientists have confirmed the existence of the Higgs boson, thanks to CERN’s
particle accelerator and its ability to process a great deal of different information (Jarlett, (2017). We
not only have a view of these data, but we have enough to build rigorous mathematical models. This is
the beginning of the passage of “a good science” which remains however limited, because qualitative,
to a science which becomes quantitative.
Additionally, the original dream of social physics is about to be realized. Its founders imagined
this discipline to be underpinned to a mathematical theory of human culture. But, having neither the
mathematics necessary nor the data indispensable for their demonstrations, they finally abandoned this
route. Sociology as we know it, associated with qualitative work, makes it possible to identify facts
and attempt to understand social principles. It, however, does not really authorize the construction of

76

The Big Data Research Ecosystem

predictive models. Today, we are starting to be able to do this. We have much more data, much better
mathematics and we can discover the real functioning of our society where the society will be framed
by the flow and exchange of ideas, and not the ideas, per se.
“Financial crimes remain a major ongoing cost to businesses. Whether perpetrating credit card fraud
and money laundering in the banking industry, or fraud, waste, and abuse in the healthcare field, financial
criminals relentlessly devise new attacks that assail old defenses and all too often slip by, undetected”
(Moss et al., 2018, p.i). Moreover, financial crimes include much wider societal offenses, such as, drug
trafficking, money laundering, human trafficking and terrorist financing, often with devastating conse-
quences. Financial institutions have thus far only resorted to legacy technology to combat the problem
of Cybersecurity; with limited success. Though convinced by the potential contribution of artificial
intelligence to combat this ever-increasing scourge, they have not ready adopted this technology. One
of the prime reasons being that financial institutions are constrained to comply to strict regulatory and
compliance requirements, which inevitably preclude the deployment of artificial intelligence models.
Furthermore, these models must be integrated into the existing, highly regulated operational systems,
among them being, Anti-Money Laundering (AML), Counter Terrorist Financing (CTF), Bank Secrecy
Act (BSA), and Know Your Customer (KYC),
Future research would be to multiply instances of “how”, as opposed to “if”, artificial intelligence
models can reliably analyze massive, and ever-increasing amounts of transactional and client informa-
tion from diverse provenances.
“In efforts to tackle extremism, the British government recently unveiled a machine learning tool it
says can accurately detect jihadist content and block it from being viewed” (Lee, 2018, p.1 and Vaas,
2018, p.1). “Thousands of hours of content posted by the Islamic State group was run past the tool, to
“train” it to automatically spot extremist material” (Lee, 2018, p.1).
Facebook, Google and Twitter are credited with similar efforts. (Hern, 2018).
Another point of contention is that presently empirical information systems’ studies have barely al-
luded as to how organizations deal with legal and ethical concerns in innovative, but acceptable ways.
In this regards we need to critically examine which features of big data influence these concerns.
At this point in time, the major technical conviction is that data or data-sets, and not present-day al-
gorithms might be the critical restrictive factor to the development of human-level artificial intelligence
(Wissner-Gross, 2016; Gupta et al, 2017). Exactly half-a-century ago, at the inception of the artificial
intelligence era, “two of its founders famously anticipated that solving the problem of computer vision
would take only a summer” (Wissner-Gross, 2016, p.1; Lu and Li, 2017). While imaging software is
undeniably not a nascent technology within the vision arena of activities, the developments in this area
in recent years have been consequential, particularly in the use of artificial intelligence, machine learning
and deep learning technologies where they even surpass human performances (Kurzweils et al., 2017).
Why then was the artificial intelligence revolution so erratic in materializing? Is data more influential
than algorithms in artificial intelligence?
History has proven that, especially since the emergence of deep learning, data might matter even more
than algorithms themselves. According to Wissner-Gross (2016); “The average elapsed time between
key algorithm proposals and corresponding advances was about eighteen years, whereas the average
elapsed time between key data-set availabilities and corresponding advances was less than three years,
or about six times faster, suggesting that data-sets might have been limiting factors in the advances”
(Wissner-Gross, 2016, p.1).

77

The Big Data Research Ecosystem

Wissner-Gross (2016) furthermore assumes that there exists an unmitigated duality between data-sets
and algorithms whereas Amatriain (2017a, 2017b) maintains that for data hungry applications (Machine
learning and Algorithms) this rationalization, that one needs huge amounts of data in order to leverage
latest advances, is not always so linear. However, if one is attempting to vigorously drive cutting-edge
technologies to discover very concrete applications then one would unquestionably need to have internal
data that can be employed to train a deep learning approach.
This intra-algorithm debate is probably eclipsed by another larger debate; that between machine
learning and scientific reasoning. While Domingos (2016) with his “master algorithm” expects machine
learning algorithms to asymptotically develop to be capable of a prefect understanding of how the world
and people function. The European Commission via its parliament has a tangential view: “In particular,
buzz-phrases like “Artificial Intelligence”, “deep learning algorithm”, “data-based algorithm” should
act as warning signs that the “algorithm” is in fact probably no more than unscientific reasoning by
generalization” (GDPR, 2016, 15.1(h)).
Scholars should be aware that though artificial intelligence is a hugely exciting field, it has limitations
which present grave dangers if given uncritical use.
World-renowned physicist Stephen Hawking has warned that efforts to create thinking machines
pose a threat to our very existence: “Computers can, in theory, emulate human intelligence, and exceed
it. Success in creating effective artificial intelligence could be the biggest event in the history of our
civilization. Or, the worst. We just won’t know. So, we cannot know if we will be infinitely helped by
artificial intelligence, or ignored by it and side-lined, or conceivably destroyed by it” (Stephen Hawking
cited by Osbourne, 2017).
The academic community is not alone in warnings about the potential dangers of artificial intelligence
over and above its prospective benefits. A number of industrial trailblazers have also expressed their
concerns about the damage that super-intelligent artificial intelligence could wreak on humanity. Among
these is the celebrated technology entrepreneur Elon Musk who has warned that artificial intelligence
is “our biggest existential threat”. He further warns that “an immortal digital dictator” could forever
trap humanity in its grasp unless we begin regulating technology as soon as possible (Elon Musk cited
by Specktor, 2018).
The European Commission has additionally expressed similar apprehensions by recently implement-
ing the General Data Protection Regulation (GDPR) whose objective is to strengthen and unify data
protection for individuals within the European Union (EU) as well as championing the export of personal
data outside of the EU. In a nutshell, the GDPR will now legislate where sensitive data is stored, who
is accessing it, and who should be accessing it. The immediate impact of these regulations suggests that
European companies processing personal data may be discouraged from using artificial intelligence
technologies thus putting them (the EU companies) at a competitive disadvantage with respect to their
Asiatic and North American counterparts (Wallace and Castro, 2018).
Are the GDPR’s AI-limiting provisions actually protecting consumers?
Based on future empirical evidence, we as scholars may be able to judge as to what extent these regu-
lations meet expectations, both for organizations seeking to strategically benefit from big data, artificial
intelligence, and society as a whole.
Last but foremost, the opportunities created by scrutinizing the life-cycle of the big data ecosystem
impact how organizational actors work with big data, develop organizational models, and manage
stakeholder interests. Future research needs to rigorously examine how they are exploited in practice.

78

The Big Data Research Ecosystem

The cohort of technology pioneers advanced by the most recent World Economic Forum (2018)
cataloged several high-tech trailblazers who have one essential point in common; their activities were
all contingent on artificial intelligence and machine learning – see a selection in Table 3.
This actuality resonates with the analysis of this study which concluded that the great cluster of the
scholar’s research articles (Figure 4 and 6) were in the domains of artificial intelligence and machine
learning, which, we assume, collaterally generated into innovative enterprises.
This, in turn, validates the authors’ first principle approach through which this reality was enacted.

CONCLUSION

Previous big data reports, which focused on opportunities, applications, challenges and uses, were found
wanting. This is primarily because big data processes differ from traditional data explorations. The
former are characterized by volume, velocity and variety, and other big data V’s, and even A’s. To this
end, the big data life -cycle, which by its nature transcends the V’s definition, was structured around
activities of data acquisition, data analysis, data curation, data storage and data usage. Working through
first principles assured a holistic approach and was found to be the most efficient manner to elicit the
specificity of projects and evaluate the model’s utility in practice. It was found imperative to understand
big data through the prism of its life-cycle.
In the context of auditing keywords in the literature and providing insights on the big data ecosystem,
the authors, employing a big data technology life-cycle approach, analyzed on diverse types of big data
attributes.

Table 3. Pioneering companies leading the artificial intelligence revolution

79

The Big Data Research Ecosystem

In the grand scheme of things, what positions organizations adopt with respect to each of the data
base attributes, and consequently how organizations realize value from big data in practice, also depends
on what the technology authorizes.
In the research, it was noted that data analytics and data storage are the two high-focus domains of
data science with machine learning being the preeminent topic among scholars.
Security issues, though in the recent political limelight, did not seem to be a major research topic.
Though this chapter seems to be fundamentally narrow in its focus, it, on that premise and for its
credit, likewise attempted to avoid what would have been a rather cumbersome exercise, that is, ana-
lyzing each and every big data attribute - there are 45 of them graphically delineated in Figures 4 to 9.
The authors, to depict the nature of the methodology, limited their analysis to the most illustrative of
the big data attributes, namely, data analysis with its sub-sections machine learning and the ecosystem.
The other attributes especially those associated with data storage – scalability and consistency – and
data acquisition - real-time – were principally mentioned to validate the first-principle’s method.
The central argument of this study was that an analysis by a first-principles approach as opposed
to that deploying opportunities was one that had the best probability of considering all the operational
attributes of big data. This objective was largely accomplished.
Put simply, the big data ecosystem covered all the conceptual dimensions of the big data life cycle.
The key allusion being that one might need “45V’s” to replicate a similar outcome!
The authors recognize the limitations of their research, and readers, and future academics and re-
searchers should be aware of these and indeed interpret the material presented in this chapter within the
context of these limitations. Axiomatically, a meta-analysis reposes on the existing, as well as, acces-
sible research studies (both conceptual and empirical). While, to identify all possible relevant articles,
the authors conducted a thorough literature search principally through the Science Direct and EBSCO
databases, it is possible that some research articles could have been overlooked in this review especially
when compared to some other leading databases, for example, the Web of Science.
Additionally, the analysis and synthesis are based on the authors’ interpretation of the selected articles.
The authors attempted to avoid these issues by independently cross-checking papers and thus deal with
embedded bias. Despite this, though, they humbly consider this research to be robust as every effort to
mitigate errors was taken, some errors might have presumably slipped through the cracks.

REFERENCES

Aaltonen, A., & Tempini, N. (2014). Everything counts in large amounts: A critical realist case study
on data-based production. Journal of Information Technology, 29(1), 97–110. doi:10.1057/jit.2013.29
Abaei, G., Selamat, A., & Fujita, H. (2015). An empirical study based on semi-supervised hybrid self-
organizing map for software fault prediction. Knowledge-Based Systems, 74, 28–39. doi:10.1016/j.
knosys.2014.10.017
Abdallah, A., Maarof, M. A., & Zainal, A. (2016). Fraud detection system: A survey. Journal of Network
and Computer Applications, 68, 90–113. doi:10.1016/j.jnca.2016.04.007

80

The Big Data Research Ecosystem

Acharjya, D. P., & Kauser, A. P. (2016). A Survey on Big Data Analytics: Challenges, Open Research
Issues and Tools. International Journal of Advanced Computer Science and Applications, 7(2). Retrieved
from https://thesai.org/Downloads/Volume7No2/Paper_67-A_Survey_on_Big_Data_Analytics_Chal-
lenges.pdf
Ali-ud-din Khan, M., Uddin, M.F., & Gupta, N. (2014). Seven V’s of Big Data understanding Big Data
to extract Value. Retrieved from https://ieeexplore.ieee.org/document/6820689
Amatriain, X. (2017a). Is Data more important then Algorithms in Artificial Integrating? Retrieved
from https://www.forbes.com/sites/quora/2017/01/26/is-data-more-important-than-algorithms-in-
ai/#7424f7dc42c1
Amatriain, X. (2017b). In Machine Learning, is more Data always better than better Algorithms? Retrieved
from https://www.quora.com/In-machine-learning-is-more-data-always-better-than-better-algorithms
Asencio–Cortés, G., Morales–Esteban, A., Shang, X., & Martínez–Álvare, F. (2017). Earthquake pre-
diction in California using regression algorithms and cloud-based big data infrastructure. Computers &
Geosciences. doi:10.1016/j.cageo.2017.10.011
Baker, J. D. (2016). The Purpose, Process, and Methods of Writing a Literature Review. AORN Journal,
103(3), 265–269. doi:10.1016/j.aorn.2016.01.016 PMID:26924364
Balachandran, B., & Prasad, S. (2017). Challenges and Benefits of Deploying Big Data Analytics
in the Cloud for Business Intelligence. Procedia Computer Science, 112, 1112–1122. doi:10.1016/j.
procs.2017.08.138
Becker, T. (2016). Big Data Usage. In J. Cavanillas, E. Curry, & W. Wahlster (Eds.), New Horizons for
a Data-Driven Economy. Cham: Springer; doi:10.1007/978-3-319-21569-3_8
Bietz, M. J., Bloss, C. S., Calvert, S., Godino, J. G., Gregory, J., Claffey, M. P., Sheehan, J. and Patrick,
K. (2016). Opportunities and challenges in the use of personal health data for health research. Journal
of the American Medical Informatics Association, 23(1), e42-e48. Doi:10.1093/jamia/ocv118
Boell, S. K., & Cecez-Kecmanovic, D. (2015). A Hermeneutic Approach for Conducting Literature
Reviews and Literature Searches. Communications of the Association for Information Systems, 34, 12.
Retrieved from http://aisel.aisnet.org/cais/vol34/iss1/12
Broeders, D., Schrijvers, E., van der Sloot, B., van Brakel, R., de Hoog, J., & Hirsch Ballin, E. (2017).
Big Data and security policies: Towards a framework for regulating the phases of analytics and use of
Big Data. Computer Law & Security Review, 33(3), 309–323. doi:10.1016/j.clsr.2017.03.002
Cano, J. (2014). The V’s of Big Data: Velocity, Volume, Value, Variety and Veracity. Retrieved from
https://www.xsnet.com/blog/bid/205405/the-v-s-of-big-data-velocity-volume-value-variety-and-veracity
Castelvecchi, D. (2018). Particle physicists turn to AI to cope with CERN’s collision deluge. Nature.
Retrieved from https://www.nature.com/articles/d41586-018-05084-2
Cavanillas, J. M., Curry, E., & Wahlster, W. (2015). New Horizons for a Data-Driven Economy: A Roadmap
for Usage and Exploitation of Big Data in Europe. Academic Press. doi:10.1007/978-3-319-21569-3_3

81

The Big Data Research Ecosystem

Chen, C. L. P., & Zhang, C.-Y. (2014). Data-intensive applications, challenges, techniques and technolo-
gies: A survey on Big Data. Information Sciences, 275, 314–347. doi:10.1016/j.ins.2014.01.015
Chen, M., Mao, S., & Liu, Y. (2014). Big Data: A Survey. Mobile Networks and Applications, 19(v2),
171–209. doi:10.100711036-013-0489-0
Curry, E. (2016). The Big Data Value Chain: Definitions, Concepts, and Theoretical Approaches. In J.
Cavanillas, E. Curry, & W. Wahlster (Eds.), New Horizons for a Data-Driven Economy. Cham: Springer;
doi:10.1007/978-3-319-21569-3_3
Cytoscape. (n.d.). Cytoscape. Retrieved from: http://www.cytoscape.org/
Devlin, B. (2017). In the Middle of Data Integration Is AI. Transforming Data With Intelligence. Retrieved
from https://tdwi.org/articles/2017/10/27/adv-all-middle-of-data-integration-is-ai.aspx
Domingos, P. (2015). The Master Algorithm: How the Quest for the Ultimate Learning Machine Will
Remake Our World. Amazon.
Domingue, J., Lasierra, N., Fensel, A., van Kasteren, T., Strohbach, M., & Thalhammer, A. (2016).
Big Data Analysis. In J. Cavanillas, E. Curry, & W. Wahlster (Eds.), New Horizons for a Data-Driven
Economy. Cham: Springer; doi:10.1007/978-3-319-21569-3_5
Edwards, L., & Urquhart, L. (2016). Privacy in public spaces: what expectations of privacy do we have
in social media intelligence? International Journal of Law & Information Technology, 24(3), 279-310.
Doi:10.1093/ijlit/eaw007
Emani, C. K., Cullot, N., & Nicolle, C. (2015). Understandable Big Data: A survey. Computer Science
Review, 17, 70–81. doi:10.1016/j.cosrev.2015.05.002
Firican, G. (2017). The 10V’s of Big data. Retrieved from https://tdwi.org/articles/2017/02/08/10-vs-
of-big-data.aspx
Freitas, A., & Curry, E. (2016). Big Data Curation. In J. Cavanillas, E. Curry, & W. Wahlster (Eds.),
New Horizons for a Data-Driven Economy. Cham: Springer; doi:10.1007/978-3-319-21569-3_6
Frické, M. (2015). Big data and its epistemology. Journal of the Association for Information Science
and Technology, 66(4), 651-661. Doi:10.1002/asi.23212
Gandomi, A., & Haider, M. (2015). Beyond the hype: Big data concepts, methods, and analytics. In-
ternational Journal of Information Management, 35(2), 137–144. doi:10.1016/j.ijinfomgt.2014.10.007
GDPR. (2016). The European Parliament and the Council. Retrieved from https://eur-lex.europa.eu/
legal-content/EN/TXT/PDF/?uri=CELEX:32016R0679
Ge, P., Ritchey, N. A., Casey, K. S., Kearns, E. J., Privette, J. L., Saunders, D., … Ansari, S. (2016).
Scientific Stewardship in the Open Data and Big Data Era - Roles and Responsibilities of Stewards and
Other Major Product Stakeholders. D-Lib Magazine, 22(5-6). Doi:10.1045/may2016-peng
Gharajeh, M. S. (2017). Biological Big Data Analytics. Advances in Computers. doi:10.1016/
bs.adcom.2017.08.002

82

The Big Data Research Ecosystem

Goes, P. B. (2014). Big Data and IS Research. MIS Quarterly, 38(3), iii-viii.
Gokmen, T., & Vlasov, Y. (2016). Acceleration of Deep Neural Network Training with Resistive Cross-
Point Devices: Design Considerations. Frontiers in Neuroscience, 10, 333. doi:10.3389/fnins.2016.00333
PMID:27493624
Greenwald, H. S., & Oertel, C. K. (2017). Future Directions in Machine Learning. Frontiers in Robotics
and AI. Computational Intelligence. doi:10.3389/frobt.2016.00079
Grensing-Pophal, L. (2015). The State of Content Marketing. EContent, 38(1), 16-17.
Gupta, A., Sun, C., Shrivastava, A., & Singh, S. (2017). Revisiting the Unreasonable Effectiveness of
Data. Google Artificial Intelligence Blog. Retrieved from https://ai.googleblog.com/2017/07/revisiting-
unreasonable-effectiveness.html
He, Q. P., & Wang, J. (2017). Statistical process monitoring as a big data analytics tool for smart manu-
facturing. Journal of Process Control. doi:10.1016/j.jprocont.2017.06.012
Hern, A. (2018). Facebook, Google and Twitter to Testify in Congress over Extremist Content. Retrieved
from https://www.theguardian.com/technology/2018/jan/10/facebook-google-twitter-testify-congress-
extremist-content-russian-election-interference-information
Hu, J., & Zhang, Y. (2017). Structure and patterns of cross-national Big Data research collaborations.
Journal of Documentation, 73(6), 1119-1136. Doi:10.1108/JD-12-2016-0146
Huang, Y., Schuehle, J., Porter, A., & Youtie, J. (2015). A systematic method to create search strategies
for emerging technologies based on the Web of Science: Illustrated for “Big Data”. Scientometrics,
105(3), 2005–2022. doi:10.100711192-015-1638-y
Jarlett, H. (2017). Breaking data records bit by bit. CERN document server. Retrieved from https://home.
cern/about/updates/2017/12/breaking-data-records-bit-bit
Jarlett, H. K. (2017). Breaking data records bit by bit. CERN. Retrieved from: https://home.cern/about/
updates/2017/12/breaking-data-records-bit-bit
Jianping, P., Juanjuan, Q., Le, P., Jing, Q., & Perdue, F. P. (2017). An Exploratory Study of the Effective-
ness of Mobile Advertising. Information Resources Management Journal, 30(4), 24-38. Doi:10.4018/
IRMJ.2017100102
Kaidalova, J., Sandkuhl, K., & Seigerroth, U. (2017). Challenges in Integrating Product-IT into Enterprise
Architecture – a case study. Procedia Computer Science, 121, 525–533. doi:10.1016/j.procs.2017.11.070
Khan, S., Liu, X., Shakil, K. A., & Alam, M. (2017). A survey on scholarly data: From big data perspec-
tive. Information Processing & Management, 53(4), 923–944. doi:10.1016/j.ipm.2017.03.006
Kulage, K. M., & Larson, E. L. (2016). Implementation and Outcomes of a Faculty-Based, Peer Review
Manuscript Writing Workshop. Journal of Professional Nursing, 32(4), 262–270. doi:10.1016/j.prof-
nurs.2016.01.008 PMID:27424926

83

The Big Data Research Ecosystem

Kurzweils, R., Brooks, R., Hanson, R., Rothblatt, M., Puri, R., Mead, C., . . . Schmidhuber, J. (2017).
Human-level Artificial Intelligence is Right Around the corner – or Hundred of years away. IEEE Spec-
trum. Retrieved from https://spectrum.ieee.org/computing/software/humanlevel-ai-is-right-around-the-
corner-or-hundreds-of-years-away
Laney, D. (2001). 3D Data management: controlling data volume, velocity and variety. Retrieved from
https://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-
Volume-Velocity-and-Variety.pdf
Lee, D. (2018). UK unveils extremism blocking tool. BBC News. Retrieved from https://www.bbc.com/
news/technology-43037899
Lim, C., Kim, K.-H., Kim, M.-J., Heo, J.-Y., & Maglio, P. P. (2018). From data to value: A nine-factor
framework for data-based value creation in information-intensive services. International Journal of
Information Management, 39, 121–135. doi:10.1016/j.ijinfomgt.2017.12.007
Lim, B., Nagai, A., Lim, V., & Ho. (2016). Veritas Global Databerg Report Finds 85% of Stored Data
Is Either Dark or Redundant, Obsolete, or Trivial (ROT). Retrieved from https://www.veritas.com/en/
aa/news-releases/2016-03-15-veritas-global-databerg-report-finds-85-percent-of-stored-data
Lu, H., & Li, Y. (2017). Artificial intelligence and computer vision. Studies in computational intelligence.
Springer. doi:10.1007/978-94-009-7772-3_15
Lyko, K., Nitzschke, M., & Ngonga Ngomo, A. C. (2016). Big Data Acquisition. In J. Cavanillas, E. Curry,
& W. Wahlster (Eds.), New Horizons for a Data-Driven Economy. Cham: Springer; doi:10.1007/978-
3-319-21569-3_4
Mahdavinejad, M. S., Rezvan, M., Barekatain, M., Adibi, P., Barnaghi, P., & Sheth, A. P. (2017). Ma-
chine learning for Internet of Things data analysis: A survey. Digital Communications and Networks.
doi:10.1016/j.dcan.2017.10.002
Maia, A.-T., Sammut, S.-J., Jacinta-Fernandes, A., & Chin, S.-F. (2017). Big data in cancer genomics.
Current Opinion in Systems Biology., 4, 78–84. doi:10.1016/j.coisb.2017.07.007
Manogaran, G., & Lopez, D. (2018). Spatial cumulative sum algorithm with big data analytics for
climate change detection. Computers & Electrical Engineering, 65, 207–221. doi:10.1016/j.compel-
eceng.2017.04.006
Mazzei, M. J., & Noble, D. (2017). Big data dreams: A framework for corporate strategy. Business
Horizons, 60(3), 405–414. doi:10.1016/j.bushor.2017.01.010
Menninger, D. (2017). 2017 Big data prediction: A’s replace V’s. Retrieved from https://davidmenninger.
ventanaresearch.com/2017-big-data-prediction-as-replace-vs-1
Metcalf, J. L., Xu, Z. Z., Bouslimani, A., Dorrestein, P., David, O., Carter, P. D., & Knight, R. (2017).
Microbiome Tools for Forensic Science. Trends in Biotechnology, 35(9), 814–823. doi:10.1016/j.
tibtech.2017.03.006 PMID:28366290
Moss, S., Misra, A., & Evans, C. (2018). Using artificial intelligence to fight financial crimes. O’Reilly
Community. Retrieved from https://www.oreilly.com/pub/e/3930

84

The Big Data Research Ecosystem

Najafabadi, M., Villanustre, F., Khoshgoftaar, T. M., Seliya, N., Wald, R., & Muharemagic, E. (2015).
Deep learning applications and challenges in big data analytics. Journal of Big Data. Retrieved from http://
www.deeplearningitalia.com/wp-content/uploads/2017/12/Dropbox_Big-Data-and-Deep-Learning.pdf
Osbourne, H. (2017). Stephan Hawking AI warning: Artificial Intelligence could destroy civilisation.
Newsweek. Retrieved from http://www.newsweek.com/stephen-hawking-artificial-intelligence-warning-
destroy-civilization-703630
Press, G. (2017). Ten predictions for AI, big data and analytics in 2018. Forbes. Retrieved from
https://www.forbes.com/sites/gilpress/2017/11/09/10-predictions-for-ai-big-data-and-analytics-in-
2018/2/#27db0676441c
Radojicis, M., Stankovic, R., & Kaplar, S. (2017). Review of the 2ndKEYSTONE Training Summer School
on Keyword Search in Big Linked Data. INFOtheca - Journal for Digital Humanities, 17(1), 108-113.
Raguseo, E. (2018). Big data technologies: An empirical investigation on their adoption, benefits and
risks for companies. International Journal of Information Management, 38, 187-195. doi:10.1016/j.
ijinfomgt.2017.07.008
Reinsel, D., Gantz, J., & Rydning, J. (2017). Data age 2025. The evolution of data to life-critical. Don’t
focus on big data; focus on data that’s big. An IDC white paper, sponsored by Seagate. Retrieved
from https://www.seagate.com/files/www-content/our-story/trends/files/Seagate-WP-DataAge2025-
March-2017.pdf
Rekik, R., Kallel, I., Casillas, J., & Alimi, A. M. (2018). Assessing web sites quality: A systematic lit-
erature review by text and association rules mining. International Journal of Information Management,
38(1), 201–216. doi:10.1016/j.ijinfomgt.2017.06.007
Ruzgas, T., Jakubeliene, K., & Buivyte, A. (2016). Big data mining and knowledge discovery. Journal
of Communications Technology, Electronics and Computer Science, (9). Doi:10.22385/jctecs.v9i0.134
Saitoh, M. (2017). Application of satellite remote sensing for marine spatial management: An approach
towards sustainable utilization of fisheries resources. Journal of Information Processing & Management,
60(9), 641-650. Doi:10.1241/johokanri.60.641
Salleh, K. A., & Janczewski, L. (2016). Technological, Organisational and Environmental Security and
Privacy Issues of Big Data: A Literature Review. Procedia Computer Science, 100, 19–28. doi:10.1016/j.
procs.2016.09.119
Sangeetha, J., & Prakash, V. S. J. (2017). A Survey on Big Data Mining Techniques. International Journal
of Computer Science and Information Security, 15(1). Retrieved from https://sites.google.com/site/ijcsis/
Schwikowski, B. (2015). Cytoscape: Visualization and Analysis of omis data in interaction networks,
Institut Pasteur. Gnome Research. Retrieved from https://research.pasteur.fr/en/software/cytoscape/
Shafer, T. (2017). The 42 V’s of Big Data and Data Science. Elder Research. Retrieved from https://
www.elderresearch.com/company/blog/42-v-of-big-data

85

The Big Data Research Ecosystem

Skourletopoulos, G., Mavromoustakis, C. X., Mastorakis, G., Batalla, J. M., Dobre, C., Panagiotakis,
S., & Pallis, E. (2016). Big Data and Cloud Computing: A Survey of the State-of-the-Art and Research
Challenges. Studies in Big Data, 22. Retrieved from https://link.springer.com/chapter/10.1007/978-3-
319-45145-9_2#citeas
Specktor, B. (2018). Elon Musk worries that AI Research will create an “Immoral Dictator”. Live Science
Tech. Retrieved from https://www.livescience.com/62239-elon-musk-immortal-artificial-intelligence-
dictator.html
Strydom, M. (Ed.). (2018). Big Data Governance and Perspectives in Knowledge Management. IGI Global.
Sun, H., Tang, Y., Wang, Q., & Liu, X. (2017). Handling multi-dimensional complex queries in key-
value data stores. Information Systems, 66, 82-96. Doi:10.1016/j.is.2017.02.001
Swoyer, S. (2017). You Still Need a Model! Data Modeling for Big Data and NoSQL. Transforming
Data With Intelligence. Retrieved from https://tdwi.org/articles/2017/03/22/data-modeling-for-big-data-
and-nosql.aspx
Vaas, L. (2018). New AI Technology used by UK Government to fight Extremist Content. Naked Security.
Sophos. Retrieved from https://nakedsecurity.sophos.com/2018/02/14/new-ai-technology-used-by-uk-
government-to-fight-extremist-content/
Wallace, N., & Castro, D. (2018). The impact of the EU’s new Data Protection Regulation on AI. Centre
for Data Innovation. Retrieved from http://www2.datainnovation.org/2018-impact-gdpr-ai.pdf
Wang, H., Xu, Z., Fujita, H., & Liu, S. (2016). Towards felicitous decision making: An overview on chal-
lenges and trends of Big Data. Information Sciences, 367–368, 747–765. doi:10.1016/j.ins.2016.07.007
Wang, R., Ji, W., Liu, M., Wang, X., Weng, J., Deng, S., ... Yuan, C. (2018). Review on mining data
from multiple data sources. Pattern Recognition Letters. doi:10.1016/j.patrec.2018.01.013
Williamson, J. (2013). The 4v’s of Big Data. Retrieved from http://www.dummies.com/careers/find-a-
job/the-4-vs-of-big-data/
Wissner-Gross, A. (2016). Datasets over Algorithms. Retrieved from https://www.edge.org/response-
detail/26587
World Economic Forum. (2018). Introducing the technology pioneers cohort of 2018. World Economic
Forum. Retrieved from: http://widgets.weforum.org/techpioneers-2018/
Wu, J., Li, H., Liu, L., & Zheng, H. (2017). Adoption of big data and analytics in mobile healthcare market:
An economic perspective. Electronic Commerce Research and Applications, 22, 24–41. doi:10.1016/j.
elerap.2017.02.002
Yoo, Y. (2015). It is not about size: A further thought on big data. Journal of Information Technology,
30(1), 63–65. doi:10.1057/jit.2014.30
Zhang, L., Tan, J., Han, D., & Zhu, H. (2017). From machine learning to deep learning: Progress in ma-
chine intelligence for rational drug discovery. Drug Discovery Today, 22(11), 1680–1685. doi:10.1016/j.
drudis.2017.08.010 PMID:28881183

86

The Big Data Research Ecosystem

Zhang, P., Xiao, Y., Zhu, Y., Feng, J., Wan, D., Li, W., & Leung, H. (2016). A New Symbolization and
Distance Measure Based Anomaly Mining Approach for Hydrological Time Series. International Journal
of Web Services Research, 13(3), 26-45.
Zhang, Q., Yang, L. Y., Chen, Z., & Li, P. (2018). A survey on deep learning for big data. Information
Fusion, 42, 146–157. doi:10.1016/j.inffus.2017.10.006
Zhao, Y., Liu, P., Wang, Z., Lei Zhang, L., & Hong, J. (2017). Fault and defect diagnosis of battery
for electric vehicles based on big data analysis methods. Applied Energy, 207, 354–362. doi:10.1016/j.
apenergy.2017.05.139
Zhu, L., Liu, X., He, S., Shi, J., & Pang, M. (2015). Keywords co-occurrence mapping knowledge do-
main research base on the theory of Big Data in oil and gas industry. Scientometrics, 105(1), 249–260.
doi:10.100711192-015-1658-7

ENDNOTES
1
3V’s (Laney, 2001): Volume, Velocity, Variety.
2
4V’s (Williamson, 2013): Volume, Velocity, Variety, Veracity.
3
5V’s (Cano, 2014): Volume, Velocity, Variety, Veracity, Value.
4
7V’s (Ali-ud-din Khan et al., 2014): Volume, Velocity, Variety, Veracity, Value, Vulnerability,
Virtue.
5
10V’s (Firican, 2017): Volume, Velocity, Variety, Veracity, Value, Vulnerability, Virtue, Validity,
Volatility, Visualization.
6
42V’s (Shafer, 2017): Volume, Velocity, Variety, Veracity, Value, Virtue, Validity, Volatility,
Visualization, Vagueness, Valor, Vane, Vanilla, Vantage, Variability, Varifocal, Varmint, Varnish,
Vastness, Vaticination, Vault, Veer, Veil, Venue, Verdict, Versed, Version control, Vet, Vexed,
Viability, Vibrant, Victual, Virtuosity, Viscosity, Visibility, Vivify, Vocabulary, Vogue, Voice,
Voodoo, Voyage, Vulpine.

87
88

Chapter 4
Optimization of Aerospace
Big Data Including Integrated
Health Monitoring With the
Help of Data Analytics
Ranganayakulu Chennu
Aeronautical Development Agency, India

Vasudeva Rao Veeredhi


University of South Africa, South Africa

ABSTRACT
The objective of this chapter is to present the role and advantages of big data governance in the optimal
use of integrated health monitoring systems with a specific reference to the aerospace industry. Aerospace
manufacturers and many passenger airlines have realized the benefits of sharing and analyzing the huge
amounts of data being collected by their latest generation airliners and engines. While aero engines are
already equipped with integrated engine health monitoring concepts, aircraft systems are now being
introduced with integrated vehicle health monitoring concepts which require large number of sensors.
The data generated by these sensors is enormously high and grows over a period of time to constitute
a big data to be monitored and analyzed. This chapter aims to give an overview of various systems and
their data logging processes, simulations, and data analysis. Various sensors that are required to be
used in important systems of a typical fighter aircraft and their functionalities emphasizing the huge
volume of data generated for the analysis are presented in this chapter.

INTRODUCTION

Aerospace manufacturers and MRO (Maintenance, Repair and Overall) providers have seen the future
and are investing heavily in digital transformation. Many passenger airlines have no option except tak-
ing the benefits of sharing and analyzing the huge amounts of data being collected by their airliners
and aero-engines. Aero engines have been already equipped with integrated engine health monitoring
DOI: 10.4018/978-1-5225-7077-6.ch004

Copyright © 2019, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

Optimization of Aerospace Big Data Including Integrated Health Monitoring With the Help of Data Analytics

systems and concepts. Whereas aircraft systems are being introduced with Integrated Vehicle Health
Monitoring (IVHM) concepts which require large number of sensors data are to be monitored. In this
direction, airlines should invest in Big Data analytics, OEMs should generate useful insights from the
vast volumes of data available and independent MROs should carve out a role in this new world. Data
Analytics help the Aerospace and Defense industry optimize their resources and business processes
while promoting new business opportunities. By developing insightful analytical techniques, organiza-
tions have started using this data to improve their business processes by eliminating redundancies for
optimal use of resources.
As more and more people are connected to the internet and sensors become integral parts of daily
hardware an unprecedented amount of information is being produced. Fundamentally, big data is noth-
ing new for the aerospace industry. Sensors have been collecting data on aircraft for years ranging from
binary data such as speed, altitude and stability of the aircraft during flight, to damage and crack growth
progression at service intervals. The authorities and parties involved have done an incredible job at using
routine data and data gathered from failures to raise safety standards.
Bombardier plans to bring its C-Series jetliner that carries Pratt & Whitney’s Geared Turbo Fan
(GTF) engine – an engine that comes with 5000 sensors that generate up to 10 GB of data per second
with Health Monitoring System. A single twin engine aircraft with an average of 12 hours flight-time can
produce 844 TB of data. Therefore, the data generated by the aerospace industry alone could surpass the
magnitude of the consumer internet. Most of the engines today have less than 250 sensors. For someone
who has built engine health monitoring solutions on big data platforms and demonstrated the reduction
in the processing time from days to minutes, the new engines are a different ball game. These scales are
beyond imagination and the kind of data storage and computing infrastructure required to handle such
data is truly mind blowing. GE expects to gain up to 40 per cent improvement in factory efficiencies by
the application of Internet of Things (IoT) and Big Data Analytics. Avionics systems and other mechani-
cal systems are also catching up to this trend quickly. The traditional avionics systems transfer data up
to a maximum of 12.5 kB/s whereas Boeing 787 Dreamliner and A350s are using Ethernet-based, next-
generation aircraft data networks, called AFDX that allows up to 12.5 MB/s. With rapid advancements
being made in the Internet of Things in aircrafts combined with data analytics, it’s a truly exciting time
to be working in the aerospace industry. Soon, thousands of sensors will be embedded in each aircraft,
allowing data to be streamed down to the ground in real-time.
This chapter explains the abundance of data that opened multiple opportunities for organizations to
enhance their operations. Further, it provides a sense of direction in the following new business oppor-
tunities with advanced analytics solutions, such as:

• Health monitoring systems in real-rime.


• Predictive analytics solutions (Tools to identify system / component failures in advance and sug-
gesting necessary preventive maintenance actions before the failure).
• Models to simulate the system behavior.
• Failure analysis of systems through root cause investigation of complex and inter-dependent
systems.
• Intelligent models for scheduling and forecasting.

Further, some useful solutions for the following aero systems are indicated in this chapter.

89

Optimization of Aerospace Big Data Including Integrated Health Monitoring With the Help of Data Analytics

• Engine Controls and APUs.


• Air Management Systems.
• Electrical Systems.
• Actuation Systems.
• Nacelle and Thrust Reverser.
• Avionics Systems.
• Mechanical systems such as Fuel, Hydraulics, Landing Gear, Secondary Power System,
Environmental Control System and Life Support Systems.

Big data is characterized by a data stream that is high in volume, high velocity and coming from
multiple sources and in a variety of forms. This combination of factors makes analyzing and interpret-
ing data via a live stream incredibly difficult, but such a capability is exactly what is needed in the
aerospace environment. For example, structural health monitoring has received a lot of attention within
research institute es because an internal sensory system that provides information about the real stresses
and strains within a structure could improve prognostics about the “health” of a part and indicate when
service intervals and replacements are needed. Similarly, all aircraft systems IVHM concepts will also
be presented in this chapter. Big data could also serve as the underlying structure that establishes au-
tonomous aircraft on a wide scale. Finally, big data opens the door for a new type of adaptive design in
which data from sensors are used to describe the characteristics of a specific outcome, and a design is
then iterated until the desired and actual data match. Also big data is incredibly important because the
underlying stream from distributed data systems on aircraft or weather data systems can be aggregated
and analyzed in consonance to create new insights for safety. Thus, in the aerospace industry the major
value drivers will be data analytics and data science, which will allow engineers and scientists to combine
datasets in new ways and gain insights from complex systems that are hard to analyze deterministically.
The major challenge is how to upscale the current systems into a new era where the information system
is the foundation of the entire aerospace environment. In this manner data science will transform into
a fundamental pillar of aerospace engineering, alongside the classical foundations such as propulsion,
structures, control and aerodynamics.
In almost every industry and in every part of the world, companies are seeking to tap into the power
derived from analytics-driven insights. Many companies are investing heavily in analytics to optimize the
entire data and to reduce man power requirements. No doubt, analytics consistently delivers significant
value, from strategic to tactical and managing top line bottom line, to the organizations and business
executives who use it. Advent of newer technologies is making data collection faster than ever before,
and it may look like an overwhelming task to turn data into insights that drive the strategic imperative.
Storage and computational capacities have grown by leaps and bounds, opening up doors to intelligent
decision-making for varied business stakeholders, yet many organizations are still looking for better ways
to obtain value from their data and compete more effectively in the market place. It is quite a mammoth
task in itself, with too many ifs and buts. Let us take a high-level perspective of what kind of challenges
most organizations stumble upon and understand the following critical ingredients for a perfect analytics
program (Sameer., 2018).

• Right problem statement where analytics could have a strong play.


• Right data to begin with.
• A strong team of analytics professionals with a right blend of skill sets.

90

Optimization of Aerospace Big Data Including Integrated Health Monitoring With the Help of Data Analytics

• Senior leadership buy-in and requisite budgets.


• Clearing other internal toll gates.
• Program review framework to track progress and suggest realignment
• A drastic shift in the mind-sets of business users consuming these insights, plan to make the tran-
sition process seamless.

FUTURE OUTLOOK FOR AVIATION

The future of aviation looks for a sustained growth in the number of passengers and also the dominance
of the large aircraft, making the improvements in reliability and in safety related aspects even more im-
portant. Five key elements are identified for the future Integrated Health Monitoring systems according
to (Iñaki et al., 2016). They are as follows.

• Dedicated sensors with improved reliability and durability.


• Improved data management.
• Complex accurate integrated system with enhanced prognostic skills.
• Integration with the logistic system of the engine or subsystem maintenance provider.
• Enhanced inspection capabilities, joined by a larger portfolio of on-wing repair technologies.

The systems will have to be split between on-ground and on-board. It is expected that on-ground
systems will continue to dominate in terms of analytical capability, etc. although on-board should grow
in its relative importance. The on-ground analysis will happen in almost real time, using fully automated
systems for most of the prognostics. The on-board systems, although is recognized to be more challenging
and expensive due to the strict airworthiness regulation, aiming to increase its robustness avoiding false
alarms, will play a more important role in the future. In addition, the integration with the logistic system
will be crucial, providing a more comprehensive analysis and different potential alternatives taking into
account these aspects. Finally, the enhanced inspection capabilities, linked to the computational systems,
will provide the engineers an accurate remote view of the engine. The on-wing repair technologies also
need to further develop and the miniaturized robotics will play an interesting role in the future.

DATA HANDLING AND ANALYSIS

The most convenient way to recover data is to get the MPC’s PCM cards. Each time a card can be recov-
ered from an aircraft, it must be replaced by another one. In order to identify any corrupted PCM card, it
is strongly advised to dedicate each card to a specific aircraft in case of passenger fleet. The number of
cards required for each aircraft to ensure a smooth process may therefore vary and must be determined
according to the airline’s specificities. Based on the storage space available on the cards and the record-
ing rate of the aircraft, the frequency with which data should be recovered is to be determined. It is
always preferred as it allows an early response if an incident is found in the data. Wireless data transfer
solutions exist in some cases and others are under development.

91

Optimization of Aerospace Big Data Including Integrated Health Monitoring With the Help of Data Analytics

FINDINGS AND ANALYSIS

Data Processing

Data is processed by software that provides a corresponding series of flights and flight events to the end
user. Filtering those flights and events is necessary to ensure a good level of relevance and consistency
in the database. Undesired events to be cleaned are generally recurrent and may be due to several factors
such as improper thresholds or event definition under some circumstances or inaccurate terrain database.
Even though such spurious events have to be cleaned to ensure a good quality of the database, their root
cause needs to be identified and corrected when possible, either by configuring the software differently
or contacting the manufacturer/vendor of the equipment at the origin of the issue. The quality of the
database can be assessed by the retrieval rate and the quality index as per Flight Data Monitoring on
ATR Aircraft (2016) which is as follows:

• The retrieval rate is the ratio of the number of flights processed by the Flight Data Management
FDM) software over the number of flights actually operated (coming from another source).
• The quality index is the ratio of the number of flights properly analyzed over the number of flights
processed.

Database Filtering

Sometimes it becomes necessary to filter the flight events. Deleting flight events under specific circum-
stances is a normal practice but should be temporary. The long flare or late touchdown on a runway due
to various reasons is one of the examples. In any case, investigation should be conducted to understand
the cause of the issue and to solve it.

Analysis

Analysis is the heart of the FDM process. The simplest way to look at it is to monitor the occurrence
rates and trends of the various events configured in the software. Typically, a high occurrence rate of
one specific event should be investigated. Analyzing a trend of occurrence rate over a large (> 6 months)
period of time requires the knowledge of any change in the company that can impact this rate (change
in SOP, change in routes or airports, change in FDM algorithm or thresholds, specific training given to
the crews, etc.). The analysis consists in putting those numbers or occurrence rates in perspective. To
summarize, the FDM team provides the organization (but especially the SMS process downstream) the
validated factual material on which the operational and safety strategy will be based or amended. It has
to be that the analytics plays a major role in analysis of big data.

Stastical Approach

The power of a FDM program is to provide data of a large number of flights over a significant period
of time (generally, at least one year). A statistical approach to this data allows monitoring trends of oc-
currence of events and therefore identifying hazards or follows their evolution. Practically, on a regular
basis the FDM team would produce statistical reports with systematic data: top 10 events, top 10 red

92

Optimization of Aerospace Big Data Including Integrated Health Monitoring With the Help of Data Analytics

events, top 10 events at each airport, top events trends, etc... By producing those reports in a standard
format it is then possible to monitor the evolution of the situation. Any FDM software should provide
a statistical module or capability. It is important to remember that statistics are relevant only if they are
based on a sufficient amount of data, especially when a breakdown of event rate per airfield or runway
(for instance) is performed. In such case, a rate at a specific airfield may be artificially low or high just
because there were not many flights to this airfield. Generally speaking, extreme rates (low or high)
should raise the attention of the analyst.

Fine Tuning

Fine tuning is the process with which the airline modifies the logics or the thresholds of one or several
events. It is more likely to take place after several months and sufficient statistics reports. It can also be
the consequence of a change in the SOPs that needs to be reflected in the FDM. The company can tune
these event thresholds to be more relevant by testing limits. This can be tested either by realistically
manipulating normal data to simulate an event, by reducing the event limits such that normal flying will
trigger events, or more acceptably, by replaying historical data known to contain incidents that should
trigger events. It is also important to identify issues such as “false events” generated by the program.

Aircraft Data Recorder

Apart from the Flight Data Recorder (FDR) and Cockpit Voice Recorder (CVR) used in the case of ac-
cident/incident investigation, the modern aircraft are equipped with recorders and interfaces that enable
the operator to record and retrieve flight data daily for on-ground analysis. The purpose of an airplane
Flight Data Recorder (FDR) system is to collect and record data from a variety of airplane sensors onto a
media designed to survive an accident. To ease data recovery for the airlines’ needs (such as performing
FDM) Quick Access Recorder (QAR) systems have to be used. The QAR records a copy of the FDR
data but on a media that is easily accessible and interchangeable. A generic schematic of FDR is shown
below in Figure 1 which is as per (Flight Data Monitoring on ATR Aircraft 2016).
Data from

CVR
QAR

MILITARY AIRCRAFT DATA

The scenario in military aircraft data is entirely different with respect to sharing of on security point of
view. Most of the sensor data cannot be provided in open domain. Hence, the implementation of big data
analytics in military aircraft is much more restricted than in passenger aircraft. The various sensors listed
in Table 1 are required to be used in various systems of typical fighter aircraft for various functionalities:

93

Optimization of Aerospace Big Data Including Integrated Health Monitoring With the Help of Data Analytics

Figure 1. Generic schematic of FDR

Table 1. Sensor requirements in a typical fighter aircraft

Sl. Number of
System/sub-system Type of sensor Functionality
No. sensors

To assess the thermal performance of various components


Temperature 20-25
Environmental Control and to know the limits in cabin and avionics equipment.
1
System To know the limits of cabin pressure and pressures of critical
Pressure 10-15
pressure reducing valves.

Pressure 4-6 Monitoring of Pumps outlet and return line pressures

2 Hydraulic System Temperature 4-6 Monitoring of hydraulic fluid temperatures

Discrete(tapped parameters) 2-4 Operation of brake parachutes and pumps failures

Monitoring of fuel Pumps outlets, various tanks and engine


Pressure 8-10
inlet pressures

Monitoring of engine bay, heat exchangers outlets, Integrated


Temperature 10-12
3 Fuel System Drive Generators oil and FADEC fuel inlet/out temperatures

Pumps failures, tanks pressure failures, fuel low level warning


Discrete(tapped parameters) 8-10
and inflight refuel switch status, etc.

Flow meter 2 Fuel flow measurement

Firing and fusing status of various armaments and critical


4 Armament System Mostly discrete type 50-55
pressures of store channels

Pressure 10-12 Main and nose wheel jacks extension and retraction pressures

Various doors up-lock status and arrester hook up and down


Discrete 14-16
5 Landing Gear status in case of navel aircraft

Various locations of main and nose wheel parts and on


Strain gauges 80-90
arrester hook in case naval aircraft

Various location of engine –fan, compressor, combustion


Thermocouples 60-70
locations, oil coolers and skin temperatures

Various location of engine –fan, compressor, combustion


Pressures 40-45
locations, oil pressures
6 Engine
Flow meters 2-4 Main after burner and pilot after burner fuel flow rates

Compressor rotor speed, fuel flow rate, engine oil pressure,


Analog tapping 6-8 Engine power Lever Angle, Compressor discharge pressure
and Turbine discharge pressures

7 Electrical System Analog 3-4 Generator, Battery and VRU voltages

continued on following page

94

Optimization of Aerospace Big Data Including Integrated Health Monitoring With the Help of Data Analytics

Table 1. Continued

Sl. Number of
System/sub-system Type of sensor Functionality
No. sensors

8 Flight performance Pressure 2-4 Air intake pressures

Secondary Pressure Pressure 3-4 Gear box oil pressures


9
System Temperature 3-4 Gear box oil Temperature

Vibration (acceleration) 60-70 Various location of aircraft structure


10 Airframe
Strain 200-220 Various location of aircraft structure

Weight on Wheel status, Airbrake in/out, FCS fail and DFCC


11 Flight Control System Discrete 6-8
channels health status

Transmitter, Front End Receiver, Radar Processor and


Temperature 5-6
12 Multi Mode Radar Antenna Platforms

Pressure 1-2 Radar wave guide pressure

13 Drop Tanks Strain 35-40 On various locations of drop tanks

14 HUMS Strain 25-30 On various locations of aircraft critical parts

Pressure 5-6 Shock absorbers and brake pressures

Thermocouples 8-10 Brake temperatures

Pedal demands, servo valve channels current and nose wheel


Analog 10
system angles (in degrees)

Brake Management Potentiometer 3-4 Main and Nose Wheel Shock absorber travels
15
System
Main and Nose landing Gears vertical loads, drag loads,
Strain gauges 20-22
brake torque loads

Frequency 4 Wheel speeds

Vibration (Linear and


12-14 Main and Nose wheel Axle brackets
angular accelerations)

Pressure 2-4 Air intake pressures


16 Wake Penetration
Thermocouples 2-4 Air intake temperatures

INTEGRATED HEALTH MONITORING

Among the other systems, the engine and structures data health monitoring plays a major role in aircraft
industry. The Engine Health Management (EHM) is playing much more important role and it is expected
that such role will significantly increase over the coming years. Reduction of trouble in departures,
capability to anticipate maintenance actions, reduction in fuel consumption, extension of the life of the
components, etc. are the type of advantages that we can anticipate. In addition, the large amount of in
service data will also help to perform changes to the future engines. The aircraft engines health monitor-
ing is based on the following three basic aspects (Iñaki et al., 2016).

1. Recording of flight parameters such as altitude, ambient temperature and pressure, etc. and on-
board engine measurements such as temperatures at different engine locations, oil pressure and
temperature, fuel flow, shaft speeds, vibration levels, etc. from the engine sensors.
2. Development of a thermodynamic (including air/oil systems) & vibration model simulating the
behavior of the engine parameters under different conditions of their modules.

95

Optimization of Aerospace Big Data Including Integrated Health Monitoring With the Help of Data Analytics

3. Development of an integrated system that is able to evaluate the current health of the system (diag-
nostics), to identify and alert about abnormal performances and forecast the future behavior of the
engine identifying potential emerging problems (prognostics). The power is in its ability to predict
the future condition of the overall engine, subsystems and engine component, being capable to
predict component failures or malfunctions before they actually occur. Certain parts of the system
could be on-board, while the majority is expected to be on-ground.

ENGINE HEALTH MONITORING (EHM)

The EHM will also learn from the global experience accumulated. It should be able to track individual
engines (for which some physical measurements would have been taken, such as nozzle guide vane areas
for the turbines, exhaust areas, cold build clearances, etc.) as well as complete fleets. With continued
advances in affordable cost computing, high speed communication and more sophisticated sensors, EHM
systems are now found in a variety of applications both commercial and military. These basic aspects
could be not that different to the health monitoring of other type of products, but the inherent difficulty
is due to aspects such as the aggressive environment. As an example, the temperatures of interest could
be as low as -80ºC at the entry of the engine in altitude conditions and as high as 1700ºC at the entry of
the high pressure turbine and the measurement should take place in flows at speeds close to the sonic
condition (Mach numbers from 0.5 to 1.0 are quite typical). A similar case can be seen in the acquisition
of other parameters of interest, such as pressures, vibrations, oil system conditions, etc. In addition to the
intrinsic challenge described, we should also take into account the durability required: periods between
maintenance actions for gas turbines have significantly increase over the last decades and also the modu-
lar concept brings the fact that some modules will not be disassembled in nearly a decade. Hence the
instrumentation used for health monitoring should also survive such periods providing accurate readings.
In terms of measurements it is important to record the parameters at each flight and also to ensure that
stabilized and repeatable conditions (take-off, climb, stabilized cruise, etc.) are continuously available.
As mentioned above, aircraft gas turbine data is available from a variety of sources, including on-board
sensor measurements, maintenance histories, and component models. An ultimate goal of EHM is to
maximize the amount of meaningful information that can be extracted from various data sources to obtain
comprehensive diagnostic and prognostic knowledge regarding the health of the engine. Data fusion,
performed by the integrated system, is the combination of data or information from multiple sources to
achieve enhanced accuracy and more specific inferences than can be obtained from the use of a single
information source alone. The information available includes the following as per (Iñaki et al., 2016).

• Engine Gas Path Measurements: Typically, a reduced number of inter-module pressures and
temperatures, shaft speeds, fuel flow, etc. Gas Path Analysis (GPA) itself can be viewed as a form
of information blend, as these parameters individually provide far less information than when
considered collectively to form signatures that can be correlated to known fault categories. Basic
gas path sensors were always available due to their use in controlling the engine. Progressively,
additional gas path sensors were added for EHM purposes since they were relatively inexpensive
and fairly reliable.

96

Optimization of Aerospace Big Data Including Integrated Health Monitoring With the Help of Data Analytics

• Oil/Fuel System Measurements: It consist of various oil system temperatures and pressures.
Advanced sensors indicating oil quality, oil debris monitoring sensors, and oil quantity measure-
ments are also available in modern engines. The fuel system is also monitored, although typically
has lower number of sensors.
• Vibration Measurements: Vibration monitoring is typically performed on most engines. The
state-of-the-art is to measure the vibration in all the spools (typically two or three) and includes
specific instrumentation to monitor the bearing and gearbox. Accelerometers were not as abun-
dant on aero-engines. The bandwidth for vibrations is relatively high (2 to 20 kHz) limiting, in
the early days, the type of on- board analysis possible due to CPU capacity. The large amount of
information contained in these signals could not be processed on-board and storing the informa-
tion for on-ground analysis was also limited due to storage constraints. Consequently, this form
of mechanical monitoring was restricted to primarily capturing average vibration amplitudes and
comparing them to limits, alerting in case of exceeding those. However, with today’s storage and
CPU capacity, larger on-board processing is possible, as well as preserving all the content of the
vibration signals for subsequent on-ground analysis.
• Structural Assessment Sensors: Structural health monitoring sensors aid in assessing structural
integrity of the engine, such as inlet and exhaust debris monitoring, acoustic and high frequency
vibration sensors, etc.
• Full Authority Digital Engine Control (FADEC): This system, of increased capacity and capa-
bilities over the last years, performs continuous tests on signal condition and fidelity. Cross chan-
nel checks can aid in determining whether or not a sensor is drifting, going out of limit or failing,
providing information regarding engine health.
• Engine Models: Accurate engine models can be used to generate virtual engine measurements to
aid in detecting faulty engine instrumentation or confirming degraded engine performance.
• Maintenance/Analysis History: Information regarding the performance disposition of the major
modules that comprise the engine can potentially be used as a priori information to support the
identification and estimation of performance changes.
• Companion Engine Data: On multi-engine aircraft, information from the companion engines is
proven to be very useful to provide additional independent confirmation / rejection of instrumen-
tation problems or engine events. This is especially true if the companion engine has been through
a similar lifetime maintenance activity, etc. so both are truly comparable.

If information from all sources combined with analytical methods to collectively manipulate them,
the (perceived) need for additional sensors can be minimized. Additional sensors would only be neces-
sary if they acquire unique information (not deducible from the information fusion) or as a redundant
corroboration to reduce uncertainty which may be inherent in our observations (existing sensors and
other information).

Powerful Physical Models

With the huge amount of data available, there is an obvious attraction to swing towards statistical tools
to derive conclusions. However, the use of a powerful engine performance tool could be the difference

97

Optimization of Aerospace Big Data Including Integrated Health Monitoring With the Help of Data Analytics

between a conventional system and a state of the art EHM tool. Advanced models adapt themselves to the
conditions observed by the engine, hence providing virtual sensors that can be used to estimate engine
module degradation. Strong knowledge of the overall engine design including as well the different mod-
ules is of paramount importance: location of sensors, deep understanding of the limits, action strategy,
etc. are best known when having OEM capabilities. The assembly and pass-off data also represent key
pieces of information, to be able to differentiate between individual engines. As an additional benefit,
the in service experience will be employed in future design.

Improved Reliability

EHM is a means to mitigate risk in decisions that impact operational integrity and has a profound impact
on safety related aspects, such as In-Flight Shut Downs (IFSD). Current values for large engines with
ETOPS requirements are in the range of 1 per 1 million hours of operation, although the regulation re-
quirement for ETOPS 180 is to be less than 2 per 100 000 EFH. It basically means that the pilots starting
today will probably never experience an IFSD due to an engine malfunction.

Examples of EHM Management

The following examples (Allan et al., 2004) will provide considerable idea for EHM and improves reli-
ability:

1. Turboprop Performance Evolution Due to Normal Wear and Tear: In order to determine the
performance behavior, we can monitor fuel flow, inter turbine temperature, HP and LP spool shaft
speeds are measured. In case of normal performance evolution, a continuous and progressive in-
crease of fuel flow and inter turbine temperature can be observed. The deterioration and the life
of the components can be better estimated as a function of real operation rather than standard mis-
sions. Baroscopic inspection also shows normal degradation. Based on the comprehensive EHM
follow up, it will be possible to predict the best moment to remove the engine and send it to MRO
shop for maintenance action. In addition, the engine repair work scope is better oriented thanks to
the in-service analysis results obtained through the data retrieved from the EHM extracts: module
deterioration, individual issues, etc. which could lead to the decision of solely refurbish the hot
section or to go for a complete engine overhaul.
2. Hot Section Temperature Sensor Failure: This shows a sudden increase in the turbine temperature.
None of the other parameters displayed exhibit an abnormal behavior, which could suggest that the
issue is in the thermocouple. Based on EHM daily follow up it can be confirmed defective after
troubleshooting. This issue, if detected on time, represents an effort of one hour for the removal
and installation plus the associated thermocouple costs and replacement can be easily planned at
the overnight Turn-around Time (TAT) of the aircraft without incurring in further expenses. AOG
represents a significant cost for airlines which can be drastically reduced due to the EHM advance
techniques.
3. Hot Section Component Failure Detection: This may indicate a sudden change in the parameters
such as the LP shaft speed and LPT section. This would potentially represent a hazardous situation
which could be identified with EHM.

98

Optimization of Aerospace Big Data Including Integrated Health Monitoring With the Help of Data Analytics

4. Cold Section Failure / Deterioration: There are other cases when, for example, potential issues
in the cold section are detected. This could indicate failure in a bleed valve or compressor dete-
rioration. Adequate management of the compressor performance through cleaning, identification
of valve malfunctions, etc. could lead to fuel savings in a turboprop aircraft equivalent to two full
tanks per year, which has a very significant economic and environmental impact.

STRUCTURAL HEALTH MONITORING (SHM)

Structural Health Monitoring (SHM) aims to give, at every moment during the life of a structure, a
diagnosis of the “state” of the constituent materials, of the different parts, and of the full assembly of
these parts constituting the structure as a whole. The state of the structure must remain in the domain
specified in the design, although this can be altered by normal aging due to usage, by the action of the
environment, and by accidental events. Thanks to the time-dimension of monitoring, which makes it
possible to consider the full history database of the structure, and with the help of Usage Monitoring, it
can also provide a prognosis (evolution of damage, residual life, etc.). SHM involves the integration of
sensors, possibly smart materials, data transmission, computational power, and processing ability inside
the structures. It makes it possible to reconsider the design of the structure and the full management
of the structure itself and of the structure considered as a part of wider systems of SHM (Daniel et al.,
2010). Structural Health Monitoring system allows the following;

• Allows an optimal use of the structure, a minimized downtime, and the avoidance of catastrophic
failures,
• Gives the constructor an improvement in his products,
• Drastically changes the work organization of maintenance services: i) by aiming to replace sched-
uled and periodic maintenance inspection with performance-based (or condition-based) mainte-
nance (long term) or at least (short term) by reducing the present maintenance labor, in particular
by avoiding dismounting parts where there is no hidden defect; ii) by drastically minimizing the
human involvement, and consequently reducing labor, downtime and human errors, and thus im-
proving safety and reliability.

In addition to engine and structure health monitoring, there is a need to look into following other
systems solutions and health monitoring concepts:

FLIGHT MANAGEMENT SYSTEM

The Flight Management System (FMS) is a system that performs functions related to navigation and
flight planning by constantly analyzing data from other aircraft systems. According to (Soares, 2014) the
FMS involves two major components, a computer unit and a master computer display unit. The first one
is responsible for the data processing of the flight data recorders as well as navigation systems to prepare
the aircraft routes and flight planning while the second one is primarily a human/machine interface for
data input/output for the FMS.

99

Optimization of Aerospace Big Data Including Integrated Health Monitoring With the Help of Data Analytics

AIR MANAGEMENT SYSTEM

There is a need to provide end-to-end systems engineering solutions in air management systems, which
includes mechanical systems packs, cargo refrigeration units (CRU), nitrogen generation system (NGS),
and secondary/galley cooling units (SCU/GCU). Additional sensors are needs to be installed for health
monitoring of these subsystems.

• Engine Control Systems and Auxiliary Power Units (APUs): Design solutions and health mon-
itoring concepts are required for fuel metering unit (FMU), hydro-mechanical valves, oil and fuel
filters, pumps, gear boxes, air turbine starters, and APUs.
• Electrial Systems: Functional feed backs of electrical system components like System 1 and
System 2 Generators, batteries and other line replaceable units (LRU) are required. In addition,
health monitoring diagnostics are to be provided.
• Actuation Systems: Actuators plays major role in both fixed wing and rotary wing air vehicles.
Hence, there is a need for comprehensive design-to-testing solutions and health monitoring in
actuation systems, which include primary and secondary flight control, helicopter main and tail
rotor servo actuation systems, deck lock, and utility actuators.
• Nacelle and Thrust Reverser: Extensive health monitoring is required on fan cowls, translating
sleeves, thrust reverser actuation systems and other nacelle and thrust reverser parts.
• Avionics Systems: An Avionics system plays a major role in flight control, data acquiring and
storage of large data in aircraft. Hence, an integrated health monitoring system combining all
avionics equipment is required.
• Mechanical Systems: The Mechanical system consists of Fuel, Hydraulics, Landing Gear,
Secondary Power System, Environmental Control System and Life Support Systems. A solution
approach for IVHM of landing gear system for typical transport aircraft, in which they demon-
strated a typical case of landing retraction mechanism was presented by (Divakaran et al., 2017).
Wang (2012) introduced diagnostic, prognostic and health management (DPHM)concept into the
aircraft hydraulic power system development. Karthik (2011) worked out the IVHM concept on
UAV Fuel System Test Rig. Similarly, Tai (2010) describes the Aircraft electrical power system
diagnostics, prognostics and health management.

In the present context, voluminous data coming from large number of additional sensors, shown in
Table 2 below, is required for health monitoring of various mechanical systems which in turn increases
the complexity of data analytics.

PRESENT AVIATION SCENARIO

As mentioned above, at any moment in time more than 10,000 aircraft are in operation, which corre-
sponds to over 100,000 flights per day or more than 39 million flights per year in 2016. Each engine on
board has instrumentation, which ranges from just 10 to >100 sensors. Temperatures, pressures, shaft
speeds, vibrations, torque, fuel flow, etc. are the most common parameters. If we consider an acquisi-
tion frequency of just 1Hz, which is clearly insufficient for aspects such as vibrations, which will be on
the order of 1 to 20 kHz, we could be considering data of the order of 10 trillion data for the entire civil

100

Optimization of Aerospace Big Data Including Integrated Health Monitoring With the Help of Data Analytics

Table 2. Additional parameters and its functionality for health monitoring of mechanical system

Sl.
System/sub-system Type of sensor Functionality
No.

1. Vibration and Differential Pressure


Vibration,     Measurements for all Heat Exchangers
1 Environmental Control System Temperature, 2. Cold air units’ oil quality, unit speed and vibration are to be monitored for its health.
Pressure, Speed 3. Data on Pressure build up and the time of opening for pressure reducing valves.
4. Monitoring of control pressure and bellows movement in cabin pressure control system.

The following parameters needs to be monitored for checking health of Hydraulic pumps,
Pressure filters and reservoirs:
Temperature      • Pump - Vibration sensor to capture failure based on frequency
Vibration sensor,      • Filter - Differential electric DPI to indicate clogging
2 Hydraulic System
Differential pressure      • Pumps - No. of logged hours to be recorded
indicator(DPI) and      • Accumulator pressure indication (From Transducers)
Level indicators      • Reservoir level indication (micro-switch based)
     • Filter clogging status (Electric DPI)

Pressure
Temperature Monitoring of fuel Pumps outlets, various tanks and engine inlet pressures, Pumps failures,
3 Fuel System
Discrete (tapped parameters) tanks pressure failures, fuel low level warning and inflight refuel switch status, etc.
Flow meter

     ➢ Structural health monitoring


Monitors damage growth and remaining useful life and triggers maintenance work if damage
has exceeded the threshold.
     ➢ System health monitoring
Takes care of functional aspects and initiate maintenance task if any degradation in
performance.
     ➢ Shock absorber Monitoring
     • Nitrogen gas pressure
     • Hydraulic fluid volume
- capacitive
     • Temperature
     • Shock absorber position
     ➢ Corrosion Monitoring
Pressure transducers, Ultrasonic
     • Monitoring the corrosiveness of parts due to environment
fluid level sensors, Temperature,
     ➢ Hard landing and Overload monotoring - dynamic measurements
4 Landing Gear Position indicator,
     a) Passive hard landing detector (HLD):
Discrete
This employs metallic mass/spring system and when predetermined acceleration is exceeded
Strain gauges
mass pivots and held by magnet, giving an indication.
     b) Remote inertial Measurement units
Records acceleration and compares it with threshold defined in the aircraft data bus to indicate
hard landing.
     ➢ Landing gear MEMS platform
MEMs based acceleration sensing system with data logging
     ➢ Mechanical devices
     ➢ Indenter, Deformable pin, Fuse
Being a visual inspection of indentation & deformation.
     ➢ System health monitoring
The Landing Gear System Electronic Controller monitors & controls all the important system
functions through the available sensors in the system give necessary needed warnings in
real-time.

1. Battery Management system


-Charge remaining
-Over charge
-Temperature
5 Electrical System Temperature, Vibration, Current 2. Generators
-Temperature sensor
-Vibration sensor
3. AC &DC Master box
-Current monitoring

The following parameters needs to be monitored for checking health of Gear Box:
a) Electrical chip detector
Pressure sensors, Temperature
b) Remote oil level sensing and warning
Secondary Pressure System sensors, Level indicators,
6 c) Online debris analysis of lubricating oil
(SPS) Vibration sensors, Fire warning
d) PTO vibration limit exceedances
and Electrical chip detectors
e) SPS bay fire sensing & warning
f) Lube oil pressure and temperatures.

Oxygen and life support Oxygen indicators, Regulators, There is need to monitor oxygen contents, system pressures regulation, temperatures and an
7
Systems Pressures and Temperatures electronic control unit to monitor all parameters.

Pressure
Thermocouples
There is need to monitor shock absorbers, brake pressures, brake temperatures, pedal demands,
Analog
8 Brake Management System servo valve channels current and nose wheel system angles (in degrees), main and nose wheel
potentiometer
shock absorber travels.
Strain gauges, Frequency and
Vibration

101

Optimization of Aerospace Big Data Including Integrated Health Monitoring With the Help of Data Analytics

worldwide fleet, increasing significantly if the defense, marine and industrial gas turbines are taken
into account. If we just consider a modest fleet of 10 aircraft with an average number of parameters this
figure will still be 4 billion data per year, which is in itself quite big. The technologies currently under
development to handle big data will be an excellent complement to the specific gas turbine advances.

IVHM Challenges in Aerospace

Ofsthun (2002) indicated that the automotive industry is well on its way to achieving IVHM goals, where
as a number of following technical and business issues have historically made the following goals more
difficult to achieve for aerospace platforms:

• Mission Complexity: The number of complex systems that must be tightly integrated into an
aerospace platforms and the required level of integration is also greater in an aerospace platform.
• Affordability and Interoperability: Aerospace affordability initiatives often demand that com-
mon hardware and software be reused across a number of dissimilar vehicle types. Interoperability
complicates IVHM development by elevating health status to a much higher level than any single
air vehicle.
• Lifespan: Integrated vehicle health management is a growing technology area that promises to
provide significant life cycle benefits to aerospace programs.
• Operational Environment: The aerospace platforms, particularly rotorcraft, military fighter/ at-
tack aircraft, or space vehicles can be exposed to much harsher environmental conditions, the ex-
tremes for temperature, humidity, shock, vibration, and radiation. As a result, aerospace vehicles
place greater demands on the operators.
• Maintenance Environment: Aerospace customers have historically had to provide their own
maintenance personnel. The lack of experienced maintenance personnel makes the level of IVHM
autonomy especially critical for aerospace vehicles.

Market - the aerospace industry must rely on selling many vehicles to a handful of customers, which
complicates the business.

CONCLUSION

The optimization of Aerospace big data including Integrated Health monitoring is a challenge in aerospace
industry. In the aerospace industry, the major value drivers will be data analytics and data science, which
will allow engineers and scientists to combine datasets in new ways and gain insights from complex
systems that are hard to analyze deterministically. The solutions of analytics will help for optimization
of big data in aerospace industry. This chapter aims to through some light on various systems and their
data logging process and simulations and analysis. The various sensors, which are required to be used
in various systems of typical fighter aircraft, and its various functionalities are presented in this chapter.

102

Optimization of Aerospace Big Data Including Integrated Health Monitoring With the Help of Data Analytics

The Integrated Vehicle Health Management (IVHM) is playing much more important role than
normal data processing, analysis and reporting. It is expected that such role will significantly increase
over the coming years. Hence, the health monitoring concepts of various aircraft systems, such as En-
gine, Structure, Nacelle, Avionics, Electrical system, Actuation systems and mechanical systems: Fuel,
Hydraulics, Landing Gear, Secondary Power System, Environmental Control System and Life Support
Systems, are presented in detail. Finally, the future aviation scenario is indicated in this chapter taking
advantage of Aerospace big data.

REFERENCES

Allan, J. V., Tom, B., Robert, L., & Donald, L. S. (2004). Development of an Information Fusion System
for Engine Diagnostics and Health Management. NASA/TM—2004-212924.
Daniel, B., Claus-Peter, F., & Alfredo, G. (2010). Structural Health Monitoring. In Introduction to
Structural Health Monitoring. ISTE Ltd.
Divakaran, V. N., Subrahmanya, R. M., & Ravi Kumar, G.V.V. (2017). Integrated Vehicle Health Man-
agement of a Transport Aircraft Landing Gear System. Infosys Limited. Retrieved from https://www.
infosys.com/engineering-services/white.../aircraft-landing-gear-system.pd
Flight Data Monitoring on ATR Aircraft. (2016). ATR Training Center. Retrieved from ATR Product
Support & Services Portal: https://www.atractive.com
Iñaki, U., Iñaki, A., & Belén, M. (2016). Aircraft Engine Advanced Health Management: The Power of
the Foresee. 8thEuropean Workshop On Structural Health Monitoring (EWSHM 2016), Bilbao, Spain.
Karthikk, R. G. (2011). IVHM On UAV Fuel System Test Rig (MSc thesis). School of Engineering, Aero-
space Vehicle Design, Cranfield University, UK. Retrieved from https://www.dsiintl.com/.../Cranfield-
University-IVHM-Diagnostic-Influence-2011-Go
Ofsthun, S. (2002). Integrated vehicle health management for aerospace platforms. IEEE Instrumenta-
tion & Measurement Magazine, 5(3), 21-24. Doi:10.1109/MIM.2002.1028368
Sameer, D. (2018). AI and Analytics – Accelerating Business Decisions. Wiley India Pvt. Ltd.
Soares, P. F. M. (2014). Flight Data Monitoring and its Application on Algorithms for Precursor Detec-
tion (MS Thesis). Instituto Superior Tecnico, Lisboa, Portugal.
Tai, Z. (2010). Aircraft electrical power system diagnostics, prognostics and health management (MSc
thesis). School of Engineering, Aerospace Design Program, Cranfield University, UK. Retrieved from
https://dspace.lib.cranfield.ac.uk/bitstream/handle/1826/9593/Tai_z.pdf
Wang, J. (2012). Aircraft hydraulic power system diagnostics, prognostics and health management (MSc
thesis). School of Engineering, Cranfield University, UK. Retrieved from https://dspace.lib.cranfield.
ac.uk/bitstream /handle/.../Wang_Jian_Thesis_2012.pdf

103

Optimization of Aerospace Big Data Including Integrated Health Monitoring With the Help of Data Analytics

KEY TERMS AND DEFINITIONS

AOG: Aircraft on ground.


ACMS: Aircraft condition monitoring system.
APU: Auxiliary power unit.
CRU: Cargo refrigeration unit.
CVR: Cockpit voice recorder.
DPI: Differential pressure indicator.
DFDR: Digital flight data recorder.
EHM: Engine health monitoring.
FADEC: Full authority digital engine control.
FDA: Flight data analysis.
FDEP: Flight data entry panel.
FDM: Flight data monitoring.
FDR: Flight data recorder.
FMS: Flight data management system.
FMU: Fuel metering unit.
IVHM: Integrated vehicle health management.
LRU: Line replaceable unit.
MPC: Multi-purpose computer.
MRO: Maintenance, repair, and overhaul.
OEM: Original equipment manufacturer.
PCM: Pulse code modulation.
PLA: Power lever angle.
QAR: Quick access recorder.
SHM: Structural health monitoring.
SMS: Safety management system.
SSFDR: Solid state flight data recorder.
TBO: Time between overhaul.
UAV: Unmanned air vehicle.

104
105

Chapter 5
Classification Techniques
and Data Mining Tools Used
in Medical Bioinformatics
Satish Kumar David
King Saud University, Saudi Arabia

Amr T. M. Saeb
King Saud University, Saudi Arabia

Mohamed Rafiullah
King Saud University, Saudi Arabia

Khalid Rubeaan
King Saud University, Saudi Arabia

ABSTRACT
Increasing volumes of data with the increased availability information mandates the use of data mining
techniques in order to gather useful information from the datasets. In this chapter, data mining techniques
are described with a special emphasis on classification techniques as one important supervised learning
technique. Bioinformatics tools in the field for medical applications especially in medical microbiology
are discussed. This chapter presents WEKA software as a tool of choice to perform classification analysis
for different kinds of available data. Uses of WEKA data mining tools for biological applications such
as genomic analysis and for medical applications such as diabetes are discussed. Data mining offers
novel tools for medical applications for infectious diseases; it can help in identifying the pathogen and
analyzing the drug resistance pattern. For non-communicable diseases such as diabetes, it provides
excellent data analysis options for analyzing large volumes of data from many clinical studies.

DOI: 10.4018/978-1-5225-7077-6.ch005

Copyright © 2019, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

Classification Techniques and Data Mining Tools Used in Medical Bioinformatics

INTRODUCTION

Developments in information technology have led to significant advancements in how the large volumes
of data are handled. Advances in the healthcare have created enormous medical data in the form of
electronic health records. All the medical information and history of patients are stored in the electronic
health records. Many countries have even set up unique registries for diseases. With advancements in the
biomedical research data from genomics, proteomics and metabolomics have flooded the researchers.
Appropriate data analysis is necessary to convert these enormous volumes of raw data into meaningful
and valuable results. Medical data analysis can be beneficial in the epidemiology and disease surveillance,
to predict the pattern of diseases and track the outbreaks. It can be used to analyze the clinical data to
evaluate the effectiveness of health programs and identify the people at risk for developing adverse health
outcomes. Medical data along with data from other biomedical research can be useful in the develop-
ment of a faster, economical and effective new drug discovery and development programs. Therefore,
medical data analysis has become an important tool for all the stakeholders involved in the healthcare.
Data analysis requires appropriate tools to be effective. Managing the big data has developed into
an important field of research known as data mining. It is a method of discovering information from
studying the data of medicine, genetics, bioinformatics and education (Fayyad & Stolorz, 1997). Data
mining extracts data patterns in large data sets identifying novel, potentially useful and valid informa-
tion from the data (Fayyad & Stolorz, 1997). It is an incredible potential tool, which can predict pat-
terns, behaviors and can be actualized on existing programming and hardware platforms. Data mining
is bolstered by three innovations, such as massive data accumulation, powerful multiprocessor PCs and
data mining algorithms. Data mining methods are not the same as traditional statistical strategies though
many processes of data mining can be done using statistical methods. Traditional statistical strategies
require a lot of user collaboration with a specific goal to approve the accuracy of a model. Therefore,
these strategies can be hard to mechanize. Whereas, data mining strategies are appropriate for expansive
data collections and can be automated easily. Data mining includes tasks such as deviation recognition,
which identifies irregular data records, dependency demonstration also known as market basket analysis
that looks for the association between variables, clustering, classification, regression, and summarization
(Figure 1). It utilizes modeling, building a model in one circumstance where you know the appropriate
response and afterward apply it to another circumstance. It requires knowledge from large dataset to
develop models that can analyze the current data. Moreover, unlike other methods, data mining tools do
not modify the data to analyze it.
Data mining has two techniques, namely unsupervised and supervised learning techniques. Unsu-
pervised learning technique analyses the data and creates hypothesis to build a model. It is not guided
by the variable. Clustering is one of the commonly used unsupervised technique (Guerra et al., 2011).
In case of supervised learning technique, the model is built before the analysis. Classification, Statisti-
cal regression and Association rules are the commonly used supervised learning techniques in medical
field (Yoo et al., 2012).
Moreover, these techniques are used widely in the field of infectious disease control. These include
pathogen identification and typing and comparison with the produced molecular profiles with the pre-
existing databases such as Institute Pasteur MLST. The phylogenetic analysis that uses different clas-
sification techniques such as neighbor-joining and Bayesian analysis. In addition to pathogenomics, that
is mainly dependent on data mining of the huge amount of sequence data generated by next-generation
sequencing techniques, as authors will discuss later.

106

Classification Techniques and Data Mining Tools Used in Medical Bioinformatics

Figure 1. Data mining techniques

The application of Bioinformatics tools and techniques in analyzing the increasing data generated in
molecular biology, genomics, transcriptomics, and proteomics is gaining momentum (Hogeweg, 2011).
Moreover, the amount of information gleaned in the form of databases and literature for generating
molecular profiles and for collecting data related epidemiology of pathogens has been also mounting
(Carrico et al., 2013). Therefore, the use of Bioinformatics tools and techniques in pathogen identification
and typing, identifying markers for early diagnosis and treatment, enabling personalized interventions
and predicting patient outcomes is imperative (Saeb et al., 2017). Bioinformatics aided next generation
sequencing (NGS) data analysis are promising to identify clinically relevant viruses from a variety of
specimen types (Petty et al., 2014). Similarly, bacterial pathogens such as Francisellatularensis and Lepto-
spira santarosai were successfully identified using culture-Independent NGS identification from primary
human clinical specimens (Kuroda et al., 2012; Wilson et al., 2014). The application of Bioinformatics
techniques in the surveillance of pathogen outbreaks in fighting infectious diseases is also essential. In
this chapter, the authors discuss in general the biomedical applications of data mining, applications that
utilize classification techniques, available bioinformatics resources and databases.

CLASSIFICATION TECHNIQUES

Classification is a widely used technique for data mining purposes. It is a technique where rules of group-
ing the data is developed that can be used to classify future set of data. It develops a model in the basis
of a set of pre-classified cases that can be used to classify the dataset. It aims to assign class precisely
for every case in the dataset. Many classification tools are used for the analysis of data from healthcare
services. These include Decision trees (Kaushik et al., 2013), the Bayesian network (Ozekes, & Camurcu,
2002), K-Nearest Neighbor (KNN) (Ismail et al., 2012) and machine learning (Patthy, 1999) (Table 1).

107

Classification Techniques and Data Mining Tools Used in Medical Bioinformatics

Table 1. Comparison of popular classification techniques used in data mining

Classification
Functionality Advantages Limitations Applications
techniques
• It is an easy and powerful
represented by rule if-then- • There is no need of any
else conditions to classify domain knowledge for
the data. It uses depth-first construction
• It is limited to one output
approach to recursively • High dimensional data
attribute and must be In decision making
partition the data until all can be dealt with
Decision Tree categorical. systems, teaching and
the date is classified. The • It can be Implemented
• It is not suited for research
structure is made of root, in parallel or serial
uncorrelated variables.
internal and leaf node. fashion.
• It has two phases of • It is easier to interpret
operation tree building and the output
tree pruning.
• It is powerful
probabilistic representation
and graphical model.
• It is also known as belief • Bayesian networks are • Assumptions are made In bioinformatics,
networks. simplified computations. in class conditional medicine, semantic
Bayesian Network • Classifier learns from • It is highly accurate and independence. search, image
training data and works in faster when employed to • Probability data is processing and data
two stages such as Directed large datasets. unavailable fusion.
Acyclic Graph (DAG) and
Conditional Probabilities
parameters
• It is popularly known
distance based algorithm
• It requires huge storage
and considered as statistical • It uses local
and is
learning algorithms. information to yield
     vulnerable to the curse of In pattern recognition,
• The classifier searches highly adaptive behavior.
K-Nearest dimensionality. image databases,
the pattern space for the k • It is analytically
Neighbor • Classifying test tuples is internet marketing
training samples that are tractable and can be
time-consuming and cluster analysis
nearby to the unknown implemented in parallel
• It is affected by irrelevant
sample and simple.
attributes
• It is best suited for
multimodal classes
• It enables computers to
learn without being overtly
programmed
• It does not require
reprogramming as the • It syndicates high • It has very few adjustable
machine learns precision, rapid parameters
Machine Learning medical microbiology
• Random forest classifiers prediction speed and • The learning process may
are trained using specific dealing with raucous data be slower
genomic and protein
features and pathogenicity
factors for each sequence in
the training process

DATA MINING TOOLS

The software that use the classification techniques are Rapid Miner, KNIME, Tanagra, Orange and
WEKA (Zupan, & Demsar, 2008; Malley, Kruppa, Dasgupta, Malley, & Ziegler, 2012) (Table 2).

108

Classification Techniques and Data Mining Tools Used in Medical Bioinformatics

Table 2. Comparison of various data mining tools

Data Mining Advantages/


Functionality Applications Platform
Tools limitations
• It is the most powerful tool for ETL and
data analysis covering broad spectrum of
different tasks and flexible data import.
• RapidMiner offers comprehensive data
mining solution • It is deployable and
• It loads the data and transforms to Market research, sales, dependable. MySQL,
produce the descriptive and predictive CRM, Manufacturing, • Fast and flexible Access, ODBC,
RapidMiner
modelling, evaluation of the models Telecom companies and • It can be used stand- JDBC, ARFF,
followed by the deployment, evaluation financial services alone, server and Java CSV
and reporting. library.
• It can handle all types of data mining
such as text mining, web mining,
audio mining, time series analysis and
forecasting and predictive analytics tasks.
• Konstanz Information Miner (KNIME)
is user friendly and comprehensive open-
source data mining tool
Pharmaceutical research, MySQL,
• It is powerful, reliable, scalable and
CRM, financial It has only limited error Access, ODBC,
KNIME easily integrates with other tools
services and academic measurement methods JDBC, ARFF,
• Large volume data processing capability
professionals CSV
limited only to hard disk capacity and not
RAM
• Enhanced data exploration
• It uses supervised learning algorithms
It does not have the
for interactive and visual construction
capability of validating
of decision trees. It also incorporates
independent validation
Tanagra exploratory data analysis, statistical Students and Researchers ARFF, Excel
set.
learning and machine learning.
It works only in
• It serves as a pedagogical tool for
windows environments
learning the programming techniques.
• It is an open source data visualization
and analysis tool for novice and experts.
Orange uses Python scripting or visual
programming. Computational approaches It does not have options
• Orange has excellent visualizations in bioinformatics in to save the model, so
Orange MySQL
options such as scatter plots, bar charts, Science, industry, and every time model has to
trees, networks and heat maps. teaching. be rebuilt
• The power of Orange lies in the way it is
exposed to Python scripting in a simpler
form.
• It contains large collection of state-of-
the-art machine learning and data mining • It is well-known
algorithms. software among academic It is platform
ARFF, CSV,
WEKA • The important tools available in WEKA and industrial researchers. independent and freely
Excel
are regression, classification, clustering, • It is also widely used in available
association rules, visualization and data teaching.
pre-processing

WEKA software has shown the highest accuracy in the performance (Zupan, & Demsar, 2008). It
is very powerful software that can even handle the multiclass dataset that other data mining software
cannot handle. Furthermore, it has better functionality of running the specific algorithm on a selected
tool. It is capable of running up to six selected classifiers using all data sets.

109

Classification Techniques and Data Mining Tools Used in Medical Bioinformatics

BIOINFORMATICS TOOLS IN MEDICAL MICROBIOLOGY

The development of the multidisciplinary field of bioinformatics occurred as a response to the increased
amount of data generated in the fields of molecular biology, genomics, transcriptomics, and proteomics.
Similarly, the amount of generated information, in databases and literature, regarding molecular profiles
and epidemiology of pathogens has been also mounting. There is no doubt that using bioinformatics
tools can aid in pathogen identification and typing, identifying markers for early diagnosis and treat-
ment, enabling personalized interventions and predicting patient outcomes. In addition, biosurveillance
of pathogen outbreaks is also an important application of bioinformatics in fighting infectious diseases.

Pathogen Identification and Typing

Bioinformatics tools are extensively used in the identification, characterization, and typing of all kinds
of pathogens. This followed the widespread use of genomic approaches in the diagnosis and manage-
ment of viral, bacterial, and fungal infections. Applications of bioinformatics have been used in patho-
gen identification, detection of virulence factors, resistome analysis, and strain typing. For example,
after the failure of conventional techniques for identification of the cause of infection of an 18-year-old
woman, only next-generation sequencing (NGS) technology supported by bioinformatics, phylogenetic,
and pathogenomics analyses, authors were able to identify the causative agent as a C. haemolyticum
isolate. Our bioinformatics analysis showed that this isolate possesses all virulence factors necessary to
establish an infection and cause the all the observed symptoms (Saeb et al, 2017). This proves that NGS
holds considerable potential for pathogen identification isolated from human specimens using whole
genome sequencing (WGS) assisted by powerful bioinformatics tools (Weinstock, 2012). The need for
bioinformatics tools not only needed when dealing with WGS but also is needed when dealing with Ri-
bosomal (rRNA) gene sequencing data that is routinely used for the identification of both bacterial and
fungal pathogens. Even more powerful bioinformatics tools are needed when dealing with NGS-rRNA
sequencing data in the emerging microbiome studies (Klindworth et al., 2013).
There are many available bioinformatics tools that are used in sequence assembly and analysis such
as Lasergene, CLCbio workbench, Geneious and Mauve. For the microbiome studies, bioinformatics
tools are also available for detection and removal of amplification-derived chimeric sequences, such as
DECIPHER, UCHIME algorithm, ChimeraSlayer, Mother, AmpliconNoise and CATCh are given in
Table 3 (Carrico et al., 2013).
In addition, there are ready pipelines dedicated to analyzing both processed data and raw sequences
such as QIIME (Caporaso et al., 2010), Ribosomal Database Project (RDP) (Cole et al., 2009), and mothur
(Schloss et al., 2009). RDP currently provides sequence information of 3,356,809 bacterial 16S rRNAs
and 125,525 fungal 28S rRNAs. This project provides quality-controlled, aligned and annotated bacterial
and archaeal 16S rRNA sequences, and fungal 28S rRNA sequences, and a suite of analysis tools to the
scientific community. It contains a new Fungal 28S Aligner and updated Bacterial and Archaeal 16S
Aligner. It also provides a pipeline for extended processing and analysis for high-throughput sequencing
data, including single-strand and paired-end reads. Several comprehensive reference databases have been
developed in order to facilitate accurate bacterial pathogen identification. For instance, Greengenes that
contains 1049116 aligned 16S rDNA records, SILVA, that contains 6,300,000 available SSU/LSU se-
quences of bacteria, archaea and eukarya and Human Oral Microbiome Database (HOMD) that contains
comprehensive information on the approximately 700 prokaryote species that are present in the human

110

Classification Techniques and Data Mining Tools Used in Medical Bioinformatics

Table 3. Bioinformatics tools for sequence assembly & analysis and microbiome studies

Sl.No. Tool Name URL


1. Lasergene http://dnastar.com
2. CLCbio workbench http://www.clcbio.com/products/clcmain-workbench/
3. Geneious http://www.geneious.com/
4. Mauve http://gel.ahabs.wisc.edu/mauve
5. DECIPHER http://DECIPHER.cee.wisc.edu
6. UCHIME algorithm http://drive5.com/usearch/manual/uchime_algo.html
7. ChimeraSlayer http://microbiomeutil.sourceforge.net/#A_CS
8. Mothur https://www.mothur.org/
9. AmpliconNoise http://qiime.org/scripts/ampliconnoise.html
10. CATCh http://science.sckcen.be/en/Institutes/EHS/MCB/MIC/Bioinformatics/CATCH

oral cavity. NGS supported by bioinformatics tools has been used to catalog discrete organisms within
complex, polymicrobial specimens. For example, deep sequencing of 16S rRNA to implicated Actino-
madura madurae as the cause of mycetoma in a diabetic patient when conventional microbiological and
molecular methods were overwhelmed by overgrowth Staphylococcus aureus (Salipante et al., 2013).
Furthermore, RAST server is one of the most important tools when dealing with WGS metagenomics
analysis that is considered to be more complicated compared with 16S rRNA sequencing. Moreover,
high-throughput sequencing (HTS) accompanied by a bioinformatics pipeline (ezVIR) was used evaluate
the entire spectrum of known human viruses at once and provided results that are easy to interpret and
customizable. This pipeline works by identifying the most likely viruses present in the specimen given
the sequencing data. The ezVIR pipeline generates strain typing reports, genome coverage histograms,
and cross-contamination analysis for specimens prepared in series. This pipeline was able to identify all
DNA or RNA viruses in the most common collected clinical specimens.
Bioinformatics tools were also developed to solve the problem of removal of host sequences from the
NGS resulting pathogen and human sequence mixed pool. The filtering step is very important since the
amount of viral sequencing in the resulting pool is usually less than 1%. For example, rapid identification
of non-human sequences (RINS), was able to precisely identify sequencing reads non-human genomes in
the used dataset and vigorously produces contigs from these sequences in less than two hours (Bhaduri
et al., 2012). Moreover, VirusSeqis an algorithmic method that is also used for detecting known viruses
and their integration sites in the human genome using next-generation sequencing data. VirusSeqis
implemented in PERL (Chen et al., 2013). In addition, HMMER3-compatible profile hidden Markov
models (profile HMMs) was constructed within vFAM software to classify sequences as viral or non-
viral (Skewes-Cox et al., 2014). Moreover, PathSeq was developed to identify both known and unknown
microorganisms in NGS data. On example of the use of bioinformatics analysis in the identification of a
bacterial pathogen was introduced by (Saeb et al., 2017). In this study, the authors developed an analysis
pipeline to identify the suggested pathogen and annotate it. The first step, authors assessed the quality of
the reads and reads with score less than 20bp were removed. In the second step, the selected reads were
subjected to Metaphlan software (Segata et al., 2012) for primary microbial identifications of sequence
reads based on unique and clade-specific marker genes. In parallel, authors used the BLAST program

111

Classification Techniques and Data Mining Tools Used in Medical Bioinformatics

to map each read to the non-redundant nucleotide database of NCBI. As expected, authors observed the
presence of high contamination with human non-pathogen sequences. Then authors used the program
TMAP to remove the contamination reads. The target non-human sequences were subjected to further
analysis. In the first step, MIRA software (version 4) was used to perform de novo assembly for these
non-human sequences (Chevreux et al., 2015). The selected sequences were mapped to bacterial genomes
that were top ranked based on Metaphlan, BLAST findings. The pipeline used in our study was imported
to the workflow system Tavaxy (Abouelhoda et al., 2012). Moreover, the authors used QIIME pipeline
for performing taxonomic assignment and for results visualizations (Caporaso et al., 2010).
Another important application of bioinformatics tools in the fields of clinical microbiology, popula-
tion genetics, and infection control is microbial typing (Van Belkum et al., 2001; Struelens et al., 1996
and McKnew et al., 2003). The most commonly used techniques in this field are Multilocus sequence
typing (MLST), single locus sequence typing (SLST) and multilocus variable-number of tandem repeats
analysis (MLVA) and less commonly interspaced short palindromic repeats (CRISPR) (Maiden et al.,
1998; Frénay et al., 1996 and Schouls et al., 2009). Databases are freely available for submitting, sharing
and comparing MLST analysis data, MLVA typing and SLST analysis (Table 4) (Saeb, 2018).
These including, Multi Locus Sequence Typing home page, Public databases for molecular typing and
microbial genome diversity, Institut Pasteur MLST, European Working Group for Legionella Infections
(EWGLI) Sequence-based typing database and Environmental Research Institute, University College
Cork. For example, the later database contains total number of record is 11614, number of Sequence types
is 2389, number of flaA alleles is 38, number of pilE allele is 53, number of asd alleles is 72, number of

Table 4. Databases for MLST data analysis, MLVA typing and SLST analysis

Sl. No. Tool Name URL


1. Multi Locus Sequence Typing http://www.mlst.net
Public databases for molecular typing and
2. http://www.pubmlst.org
microbial genome diversity
3. Institut Pasteur MLST http://www.pasteur.fr/mlst/
European Working Group for
http://www.hpabioinformatics.org.uk/legionella/legionella_sbt/php/
4. Legionella Infections (EWGLI)
sbt_homepage.php
Sequence-based typing database
Environmental Research
5. Institute, University College http://mlst.ucc.ie/
Cork
6. MLVAbank http://mlva.u-psud.fr/mlvav4/genotyping/
7. Groupe d’EtudesenBiologie Prospective http://www.mlva.eu
https://research.pasteur.fr/en/publication/mlva-net-a-standardised-
8. MLVA-NET
web-database-for-bacterial-genotyping-and-surveillance
Multiple-Locus Variable number tandem repeat
9. http://www.mlva.net
Analysis
10. ccrB sequence typing http://www.ccrbtyping.net/
11. dru typing database http://dru-typing.org/site/
12. RidomSpaServer http://spaserver.ridom.de/
13. CRISPRs web server http://crispr.i2bc.paris-saclay.fr

112

Classification Techniques and Data Mining Tools Used in Medical Bioinformatics

mip alleles is 84, number of mompS alleles is 96, number of proA alleles is 54, number of neuA alleles
is 63 and number of neuAh alleles is 30. Similarly, Database are freely available MLVA typing. These
include MLVAbank. This site aims to simplify pathogenic bacteria strain genotyping for epidemiologi-
cal purposes. These pathogenic bacteria include Acinetobacter baumannii, Bacillus anthracis, Brucella,
Coxiella burnetii, Legionella pneumophila, Mycobacterium tuberculosis, Pseudomonas aeruginosa,
Staphylococcus aureus and Yersinia pestis. On more database is Groupe d’EtudesenBiologie Prospective.
It provides genotyping information for bacterial species by MLVA. These pathogenic bacteria include
Staphylococcus aureus, Streptococcus pneumoniae, Pseudomonas aeruginosa, M. tuberculosis, S. enterica
and K. pneumoniae. Moreover, MLVA-NET–a standardized web database for bacterial genotyping and
surveillance and Multiple-Locus Variable number tandem repeat Analysis home page. The later database
includes data for Bordetella pertussis, Haemophilus influenzae, Neisseria meningitidis, Staphylococcus
aureus and Streptococcus pneumoniae. Examples for databases for SLST analysis are ccrB typing tool:
an online resource for staphylococci ccrB sequence typing and dru typing database. The later contains
99 dru repeats and 531 dru types with from 1 to 23 repeats as per 22nd of May 2017. Another example
for SLST analysis database is RidomSpaServer. The SpaServer is used to gather and match data from
different geographical locations. Spa typing is very important for surveillance of methicillin-resistant
Staphylococcus aureus (MRSA) since Single locus DNA-sequencing of the repeat region of the Staphy-
lococcus protein A gene (spa) can be used for steadfast, precise and discriminatory typing of MRSA.
One last molecular typing bioinformatics analysis tool is CRISPRs web server. It contains RISPRcompar
which is a website to compare clustered regularly interspaced short palindromic repeats.

Bioinformatics Tools for Pathogenicity and Virulence

It is of a great importance to access the pathogenicity and virulence of any newly detected human
pathogen. In fact, there are freely available bioinformatics tools to achieve that aim. One important tool
to test the pathogenicity nature of a newly discovered bacterial pathogen is the PathogenFinder 1.1 the
pathogenicity prediction program. PathogenFinder is a webserver used for the prediction of bacterial
pathogenicity utilizing proteomic, genomic, or raw reads. The bacterial pathogenicity in this web-server
depends on groups of proteins known to be involved in pathogenicity (Cosentino et al., 2013). One more
method has recently come to this field is the machine learning based approach PaPrBaG (Pathogenicity
Prediction for Bacterial Genomes). This tool is provided as user-friendly R package (Deneke et al., 2017).
PaPrBaG predicts pathogenicity by means of training on a large number of established pathogenic species
in comparison with non-pathogenic bacteria. It is suitable NGS data even at very low genomic coverage.
Furthermore, after the genomic contigs of a pathogen, produced using NGS techniques, being annotated
using the Prokaryotic Genomes Automatic Annotation Pipeline (PGAAP), it can also be annotated using
the bacterial bioinformatics database and analysis resource (PATRIC) gene annotation service in order to
selectively scan for pathogenicity and virulence factors. In addition, the Virulence genes sequences and
functions that are corresponding to different major bacterial virulence factors of specific pathogen can
also be collected from GenBank and validated using virulence factors of pathogenic bacteria database.
Victors, virulence factors search program and PATRIC_VF tool (Wattam et al., 2014).

113

Classification Techniques and Data Mining Tools Used in Medical Bioinformatics

Bioinformatics Tools for Identifying and Combating Antimicrobial Resistance

In order to have the upper hand in the ongoing arms race between humans and pathogens in term of an-
timicrobial resistance, we have to own tools to facilitate rapid and accurate detection and understanding
of resistance factors and mechanisms. Bioinformatics, in silico, tools provide a new approach to achieve
the mentioned-above goals. For instance, genome contigs can be primarily investigated for the presence
of antibiotic resistance loci using both PGAAP and PATRIC gene annotation services as mentioned
above. Furthermore, the presence of antibiotic resistance loci for a newly isolated bacterial pathogens
can then be investigated using specialized search tools and services namely, Antibiotic Resistance Gene
Search, Genome Feature Finder (antibiotic resistance), ARDB (Antibiotic Resistance Genes Database),
CARD (The Comprehensive Antibiotic Resistance Database), Specialty Gene Search and ResFinder 2.1
(Wattam et al., 2014; Liu, & Pop, 2009; McArthur et al., 2013). Similarly, for antibacterial biocide and
metal resistance genes, can also be investigated using PGAAP and PATRIC gene annotation services,
PATRIC Feature Finder searches tool and BacMet (antibacterial biocide and metal resistance genes
database) (Zankari et al., 2014 and Pal et al., 2014). In addition to what the authors presented herein,
there are much more applications for bioinformatics in this important field of studies such as drug re-
sistance testing, pathogen-host interaction, infection and treatment outcomes and many others. There
are still needed requirements to be met in order to facilitate and further incorporate bioinformatics in
the field of clinical microbiology and infectious diseases such as training of personnel and producing
straightforward, validated and user-friendly bioinformatics pipelines. The bioinformatics tools for da-
tabase search are given in Table 5.

WEKA DATA MINING SOFTWARE

Comparison of different data mining software indicates WEKA is the most accurate and powerful data
mining software in terms of performance. Therefore, authors will discuss about the WEKA software in
detail in this chapter. “Waikato Environment for Knowledge Analysis” created by Machine Learning
Group at the University of Waikato in New Zealand. The vision was to Build cutting-edge programming
for creating machine learning (ML) systems and apply them to genuine machine learning (ML) strate-
gies and to true information mining issues. As per the fundamental page of the Weka venture, Weka is
an accumulation of machine learning calculations for information mining assignments. The calculations
can either be connected specifically to a dataset or called from your own particular Java code. Weka
contains devices for information pre-handling, characterization, relapse, grouping, affiliation guidelines,
and perception. It is additionally appropriate for growing new machine learning plans (Weka 3). The
tools available in WEKA include regression, classification, clustering, association rules, visualization,
and data pre-processing. Weka can be used with Linux, Mac, and Windows systems.
Data mining tasks are carried out by the machine-learning algorithms available in WEKA. Own Java
code of the users can be utilized to call the algorithms, or it can be applied directly to the dataset. WEKA
executes the tasks such as data pre-processing, regression, classification, association rules, clustering,
feature selection, and visualization with the respective tools. WEKA identifies the most appropriate
strategy by comparing the available strategies for the same evaluation method.

114

Classification Techniques and Data Mining Tools Used in Medical Bioinformatics

Table 5. Bioinformatics tools for database search

Sl.No. Tools URL


Ribosomal Database Project
1. http://rdp.cme.msu.edu/
(RDP)
2. Greengenes http://greengenes.secondgenome.com/downloads
3. SILVA https://www.arb-silva.de/
4. RAST server http://metagenomics.anl.gov
5. VirusSeqis http://odin.mdacc.tmc.edu/~xsu1/VirusSeq.html
6. vFAM software http://derisilab.ucsf.edu/software/vFam
7. PathSeq http://www.broadinstitute.org/software/pathseq
8. TMAP https://github.com/iontorrent/TMAP
9. PathogenFinder 1.1 https://cge.cbs.dtu.dk/services/PathogenFinder/
Prokaryotic Genomes
10. Automatic Annotation http://www.ncbi.nlm.nih.gov/
Pipeline (PGAAP)
11. PATRIC https://www.patricbrc.org/app/Annotation
12. virulence factors http://www.mgc.ac.cn/VFs/
13. Victors http://www.phidias.us/victors/
https://www.patricbrc.org/portal/portal/patric/SpecialtyGeneSource?source=PATR
14. PATRIC_VF
IC_VF&kw
https://www.patricbrc.org/portal/portal/patric/AntibioticResistanceGeneSearch?cType=t
15. Antibiotic Resistance
axon&cId=131567&dm=
16. ARDB https://ardb.cbcb.umd.edu/
17. CARD https://card.mcmaster.ca/
18. ResFinder 2.1 https://cge.cbs.dtu.dk//services/ResFinder/
19. BacMet http://bacmet.biomedicine.gu.se/
20. RINS http://khavarilab.stanford.edu/resources.html
21. R package https://github.com/crarlus/paprbag

In addition, bioinformatics analyses can also be carried out using the WEKA tools. Some of the re-
ported bioinformatics analysis using WEKA tools are probe selection of gene expression arrays (Tobler,
Molla, Nuwaysir, Green, & Shavlik, 2002), automated protein data annotation (Bazzan, Engel, Schroeder,
& da Silva, 2002; Kretschmann, Fleischmann, & Apweiler, 2001), automatic cancer diagnosis (Li, Liu,
Ng, & Wong, 2003), plant genotype discrimination (Taylor, King, Altmann, & Fiehn, 2002), classifying
gene expression profiles (Li, & Wong, 2002), computational model for frame-shifting sites (Bekaert et
al., 2003) and extracting rules from them (Li et al., 2003).

Interfaces to WEKA

WEKA has four interfaces and these are available in the main GUI Chooser window. The simple com-
mand line (CLI) can be used to access all the learning techniques either as part of shell scripts, or from
within other Java programs using the WEKA API. CLI can be executed directly by WIKI commands.

115

Classification Techniques and Data Mining Tools Used in Medical Bioinformatics

There is another graphical drag-and-drop user interface “Knowledge Flow,” available in WEKA with
support for incremental learning. It provides a more process-oriented view of data mining. A flow of
information can be created graphically by connecting the individual learning components represented
by the Java beans.
Experimenter is another graphical user interface available in the main GUI window. It compares
the performance of several learning schemes on various datasets. It can run experiments across many
computers running with remote servers. It can be used to conduct different statistical analyses between
the learning schemes.

The WEKA Explorer

The main interface of WEKA is the Explorer (Figure 2). It can be used to run the simulation, data vi-
sualization and preprocessing. It is convenient for new users. Explorer is useful to upload the date in
different formats such as ARFF, CSV, C4.5, and Library.
There are six tabs available in the WEKA Explorer namely preprocesses, classify, cluster, associate,
select attributes and visualize (Bouckaert et al., 2013). These tabs are used to execute the tasks corre-
sponding to the tab names such as preprocess, classify, associate etc.

Preprocess

This tool is also known as “Filters” in WEKA. Data retrieval from a file, of from a database like SQL
database or from a website URL can be handled by the preprocess tool. In case of datasets with huge
samples, sub-sampling may be necessary. There is a histogram in the preprocess tab that also displays
statistics for the selected attribute. Histograms of all the attributes can be viewed separately in a win-
dow simultaneously. The required filter is set up using filter box. The filters available in WEKA are for
discretization, normalization, resampling, attribute selection, and attribute combination.

Figure 2. WEKA Knowledge Explorer

116

Classification Techniques and Data Mining Tools Used in Medical Bioinformatics

Classify

Classify tools work on the preprocessed data to execute further analysis. A pre-trained data is used to
produce the classification model. The most important learning techniques for classification and regres-
sion such as Bayesian classifiers, decision trees, rule sets, support vector machines, logistic and multi-
layer perceptrons, linear regression, and nearest-neighbor methods are available in WEKA. Automatic
parameter tuning using cross-validation and cost-sensitive classification can be performed by using
“meta-learners” like bagging, stacking, boosting, and schemes. Learning algorithms can be evaluated
using cross-validation or a hold-out set. The performance measures such as accuracy, root mean squared
error as well as graphical means for visualizing classifier performance like ROC curves and precision-
recall curves are available in WEKA. Cross-validation or hold-out set can be used to evaluate the learning
algorithms. Predictions of a classification or regression model can be visualized to identify the outliers
easily. WEKA system can load or save the models that have been generated.

Cluster

Groups of cases in datasets can be recognized using the cluster tool. WEKA contains clustering algo-
rithms such as k-means, heuristic incremental hierarchical clustering scheme that can be accessed using
the cluster tab. Visualization and comparison of cluster assignments with the actual clusters that are
defined by the attributes in the data can be performed.

Associate

Associate tool consists of algorithms for the generation of association rules to discover any association
between groups of attributes in the data.

Select Attributes

The select attribute provides the means for identifying the subsets of attributes that are predictive of
target attribute in the data. There are many methods available in WEKA to search the attribute of subsets,
evaluation measures for attributes and attribute subsets. Best-first search, genetic algorithms, forward
selection, and attributes ranking are some of the search methods available. The flexibility of the system
allows combination of different search and evaluation methods.

Visualize

The visualization in WEKA uses single dimension (1D) and double dimension (2D) for single and pairs
of attributes respectively. It has a matrix of scatter plots. Visualization tools are useful in determining
the learning problem difficulties. These plots can be enlarged in a separate window to read information
from individual data points. The nominal attributes for exposing obscured data points can be dealt with
the available. “Jitter” option.

117

Classification Techniques and Data Mining Tools Used in Medical Bioinformatics

WEKA for Biomedical Applications

Bioinformatics is the area of research that emphasizes on large-scale investigations of the data coming
from bio molecules (Luscombe, Greenbaum, & Gerstein, 2001). Largely, Bioinformatics investigation
involves problems that can be demonstrated as machine learning tasks. These include classification or
regression, clustering and feature selection (Frank, Hall, Trigg, Holmes, & Witten, 2004). The Weka
data mining suite provides algorithms for such biological problems. The WEKA data mining suite
was utilized in many bioinformatics applications such as protein sequence information annotation in
the SWISS-PROT portal, with reasonable coverage and confidence levels (Kretschmann et al., 2001;
Bazzan et al., 2002). Numerous bioinformatics applications utilized the WEKA data-mining platform
for performing proteomics and gene expression experiments data analysis. Furthermore, Naïve Bayes
and artificial neural networks have been used in probe selection step for gene-expression arrays (Tobler
et al., 2002). In addition, WEKA data mining was likewise utilized for classifying cancer diagnosis
data to discover significant rules (Li et al., 2003). Additionally, WEKA data mining has been utilized
in molecular investigation of frameshift mutation sites in eukaryotic organisms (Bekaert et al., 2003),
use of metabolomics discrimination in plant genotypes (Taylor et al., 2002) and classifying expression
profiles of genes (Li & Wong, 2002).
Moreover, the existing WEKA structure provides a comprehensive diversity of valuable tools for ma-
chine learning. For example, the BioWEKA project extended it with added bioinformatics tools, including
new input formats for bioinformatics and alignments that make it possible to use with other downstream
analysis tools. These include MAGE-ML (Spellman et al., 2002) and CSV compatible formats for gene
expression data, FASTA (Pearson, & Lipman, 1988), EMBL (Kanz et al., 2005), Swiss-Prot (Bairoch, &
Boeckmann, 1991), GenBank (Benson, Lipman, & Ostell, 1993) for the storage of biological sequences
in ASCII files and InterProScan (Zdobnov, & Apweiler, 2001) for the annotation of sequence patterns.
The WEKA machine learning platform, two classical decision tree-building techniques (J48 and
SimpleCART) along with an advanced alternating decision tree (ADTree), were utilized to construct
decision tree models to study the gene-ranking stability estimation of overlapping genes or classic gene
set enrichment analysis. This method discovered very precise descriptive models (Stiglic, Bajgot, &
Kokol, 2010). In addition, the random forest method in the WEKA platform used in study short read
data from small RNA-seq experiments (Fasold, Langenberger, Binder, Stadler, & Hoffmann, 2011).
Moreover, decision trees generated using the j48 implementation of the C4.5 decision tree algorithm
from the WEKA machine learning workbench is employed to analyze deep sequencing data to investigate
bacterial communities constitute bacterial vaginosis (Hummelen et al., 2010). Molecular phylogeny is
an essential approach to examine species evolution and gene function.

WEKA in Diabetes Data Analysis

Diabetes is a disease with a high morbidity and mortality. The increasing prevalence of diabetes and its
complications has put a lot of pressure on already overburdened healthcare systems worldwide. Thus,
diabetes has become one of the hottest topics in the medical research field. With large volumes of data
being generated regularly, data mining techniques can serve as a valuable tool for extracting knowledge
from these data. The data mining process helps us to determine the pattern of data by applying various
techniques such as machine learning, statistics, and data classification. With provision for new machine
learning techniques and availability of algorithms, WEKA has more flexibility for the users.

118

Classification Techniques and Data Mining Tools Used in Medical Bioinformatics

Classification algorithms J48, support vector machines (SVM), classification and regression tree
(CART) and k-Nearest Neighbor (kNN) were used to categorize the dataset that included 10 attributes
from 545 patients (Saravananathan, & Velmurugan, 2016). WEKA software was used to calculate the
accuracy, specificity, sensitivity, precision and error rate of different classification algorithms. They
found that J48 algorithm was superior to other algorithms in classifying the selected diabetes dataset.
In one of the studies, a connection was made between the Diabetes Expert System interface and WEKA
(Yasodha, & Kannan, 2011). The study aimed to determine the parameters that determine the cases of
diabetes and to check the number of people with diabetes in the population. The data classification in
diabetic patient’s dataset was carried out with 249 instances and 7 different attributes. Bayes Network
Classifier, J48 Pruned Tree, REP Tree, and Random Tree were the classification algorithms used in the
study. After the classification, the data was evaluated using 10-fold cross-validation.
WEKA classification algorithms were used for analyzing the datasets of 768 patients taken from
Pima Indians Diabetes Database of National Institute of Diabetes and Digestive and Kidney Diseases.
The data had nine attributes. Naïve Bayes (NB), Zero R, Logistic regression, Random forest, J48, Mul-
tilayer perception are the algorithms that were used in the data analysis (Hina, Shaikh, & Sattar, 2017).
Multilayer perception algorithm showed the highest accuracy of 81.81% with fewer errors whereas the
least accurate algorithm was Zero R. However, multilayer perception algorithm required longer process-
ing time because it calculated weights of each node. ZeroR was found to be useful to determine baseline
performance for others classification method.
Diwani and Sam used NB and J48 algorithms to predict diabetic patients from Pima Indians dataset
(Diwani, & Sam, 2014). NB and J48 correctly classified 76.30% and 73.83% of the dataset respectively.
Parashar et al. (2014) used SVM and Feed Forward Neural Network for the classification of patients in
Pima dataset. Supervised learning approach was applied to identify basis vectors. LDA was used as a
preprocessing step for attribute selection. With an accuracy of 75.65%, the performance of LDA-SVM
was found to be better than the SVM and FFNN.
Asgarnezhad et al. (2017) used Pima Indians diabetes mellitus dataset with 768 patients and 8 at-
tributes. They applied SVM classifier to predict of diabetes mellitus. In combinations of missing value
techniques with attribute subset selection methods, the combination of replacement with mean as miss-
ing value replacement technique with optimize selection (evolutionary) using genetic algorithm as a
data reduction technique provided the best results in terms of accuracy and precision of the predictive
model for this dataset.

CONCLUSION

Bioinformatics involves the understanding and organization of the large scale information arising from
the biomedical sciences. With vast amount of data being generated, drawing conclusions from these
data requires specialized techniques. Data mining provides automated and sophisticated solutions for
analyzing these biological data. It is a process of extracting data patterns in large datasets identifying
novel, potentially useful and valid information from the data. Data mining involves the use of machine
learning, pattern recognition and artificial intelligence. It includes tasks such as deviation recognition,
which identifies irregular data records, dependency demonstration also known as market basket analysis
that looks for the association between variables, clustering, classification, regression, and summarization.

119

Classification Techniques and Data Mining Tools Used in Medical Bioinformatics

Classification is the most commonly used techniques in data mining. It uses the examples of pre-
classified data to develop a model to classify the given dataset. The task of the classification technique
is the prediction of target class precisely for each case in the data. Several classification techniques such
as Decision trees, the Bayesian network, K-Nearest Neighbor and machine learning are used in analyz-
ing medical data.
Data mining tools are used to carry out the different data mining tasks. These tools provide algorithms
for data analysis and let the users to customize it according to their requirements. Among the several
types of software that employ the classification techniques such as Rapid Miner, KNIME, Tanagra,
Orange and WEKA, highest performance improvement in accuracy is found with the WEKA software.
It has the ability to handle multiclass data set, which other data mining tools lack.
Data mining tools have been extensively used in the fields of genomics, proteomics, metabolomics,
metabonomics, metabolite profilingþ, gene expression and microbiomics. Data mining offers novel tools
for medical applications, it can help in identifying the pathogen and analyzing the drug resistance pat-
tern. This followed the widespread use of genomic approaches in the diagnosis and management of viral,
bacterial, and fungal infections. Pathogenicity and virulence of any newly detected human pathogen can
be analyzed using freely available bioinformatics tools. These tools even facilitate rapid and accurate
detection and understanding of resistance factors and mechanisms.
WEKA is a complete data-mining suite, which provides various preprocessing modules and data min-
ing techniques. Many bioinformatics applications apply the WEKA data-mining suite for carrying out
data analysis involving proteomics and gene expression experiments. Data-mining tools have been used
in probe selection for gene-expression arrays. In addition, Weka data mining was also used in classifying
cancer diagnosis data to discover significant rules. With provision for new machine learning techniques
and availability of algorithms, WEKA has been used for the data mining of non-communicable diseases
such as diabetes. It provides excellent data analysis options for analyzing large volumes of data from
many clinical studies. Many WEKA classification algorithms have been successfully implemented to
analyze the diabetes data.
Recognizing the importance of big data analysis in the biomedical sciences, National Institutes of
Health has started a program “Big-data to Knowledge (BD2K). This will lead to newer knowledge
development and training of biomedical data scientists. With the newer knowledge derived from big
data analysis, availability of more reliable and accurate medical information can be seen in the future.

ACKNOWLEDGMENT

The authors would like to thank Strategic Center for Diabetes Research, College of Medicine at King
Saud University for facilitating the conduction of this work.

REFERENCES

Abouelhoda, M., Issa, S., & Ghanem, M. (2012). Tavaxy: integrating taverna and galaxy workflows with
cloud computing support. BMC Bioinformatics, 13(1).

120

Classification Techniques and Data Mining Tools Used in Medical Bioinformatics

Asgarnezhad, R., Shekofteh, M., & Boroujeni, F. Z. (2017). Improving Diagnosis of Diabetes Mellitus
Using Combination of Preprocessing Techniques. Journal of Theoretical and Applied Information
Technology, 95(13), 15.
Bairoch, A., & Boeckmann, B. (1991). The SWISS-PROT protein sequence data bank. Nucleic Acids
Research, 19(suppl), 2247–2249. doi:10.1093/nar/19.suppl.2247 PMID:2041811
Bazzan, A. L., Engel, P. M., Schroeder, L. F., & da Silva, S. C. (2002). Automated annotation of keywords
for proteins related to mycoplasmataceae using machine learning techniques. Bioinformatics (Oxford,
England), 18(Suppl 2), 35S–43S. doi:10.1093/bioinformatics/18.suppl_2.S35 PMID:12385981
Bekaert, M., Bidou, L., Denise, A., Duchateau–Nguyen, G., Forest, J. P., Froidevaux, C., ... Termier, M.
(2003). Towards a computational model for –1 eukaryotic frame shifting sites. Bioinformatics (Oxford,
England), 19(3), 327–335. doi:10.1093/bioinformatics/btf868 PMID:12584117
Benson, D., Lipman, D. J., & Ostell, J. (1993). GenBank. Nucleic Acids Research, 21(13), 2963–2965.
doi:10.1093/nar/21.13.2963 PMID:8332518
Bhaduri, A., Qu, K., Lee, C. S., Ungewickell, A., & Khavari, P. A. (2012). Rapid identification of non–
human sequences in high–throughput sequencing datasets. Bioinformatics (Oxford, England), 28(8),
1174–1175. doi:10.1093/bioinformatics/bts100 PMID:22377895
Bouckaert, R. R., Frank, E., Hall, M., Kirkby, R., Reutemann P, Seewald A, & Scuse D. (2013). WEKA
Manual for Version 3–7–8. Academic Press.
Caporaso, J. G., Kuczynski, J., Stombaugh, J., Bittinger, K., Bushman, F. D., Costello, E. K., ... Knight,
R. (2010). QIIME allows analysis of high-throughputcommunity sequencing data. Nature Methods, 7(5),
335–336. doi:10.1038/nmeth.f.303 PMID:20383131
Carrico, J. A., Sabat, A. J., Friedrich, A. W., Ramirez, M., & on behalf of the ESCMID Study Group, C.
(2013). Bioinformatics in bacterial molecular epidemiology and public health: Databases, tools and the
next-generation sequencing revolution. Eurosurveillance, 18(4), 32–40. doi:10.2807/ese.18.04.20382-
en PMID:23369390
Chen, Y., Yao, H., Thompson, E. J., Tannir, N. M., Weinstein, J. N., & Su, X. (2013). VirusSeq: Software
to identify viruses and their integration sites using next–generation sequencing of human cancer tissue.
Bioinformatics (Oxford, England), 29(2), 266–267. doi:10.1093/bioinformatics/bts665 PMID:23162058
Chevreux, B. (2015). MIRA Assembler. C1997–2014. Retrieved from: www.chevreux.org/projects_mira.
html
Cole, J. R., Wang, Q., Cardenas, E., Fish, J., Chai, B., Farris, R. J., ... Tiedje, J. M. (2009). The Ribo-
somal Database Project: Improved alignments and new tools for rRNA analysis. Nucleic Acids Research,
37(Database), D141–D145. doi:10.1093/nar/gkn879 PMID:19004872
Cosentino, S., Larsen, V. M., Aarestrup, M. F., & Lund, O. (2013). Pathogen Finder – Distinguishing
Friend from Foe Using Bacterial Whole Genome Sequence Data. PLoS One, 8(10), e77302. doi:10.1371/
journal.pone.0077302 PMID:24204795

121

Classification Techniques and Data Mining Tools Used in Medical Bioinformatics

Deneke, C., Rentzsch, R., & Renard, B. Y. (2017). PaPrBaG: A machine learning approach for the
detection of novel pathogens from NGS data. Scientific Reports, 7(1), 39194. doi:10.1038rep39194
PMID:28051068
Diwani, D. A., & Sam, A. (2014). Diabetes Forecasting Using Supervised Learning Techniques. ACSIJ
Advances in Computer Science: an International Journal, 3, 10–18.
Fasold, M., Langenberger, D., Binder, H., Stadler, P. F., & Hoffmann, S. (2011). DARIO: A ncRNA
detection and analysis tool for next-generation sequencing experiments. Nucleic Acids Research, 39(suppl
2), W112–W117. doi:10.1093/nar/gkr357 PMID:21622957
Fayyad, U., & Stolorz, P. (1997). Data mining and KDD: Promise and challenges. Future Generation
Computer Systems, 13(2), 99–115. doi:10.1016/S0167-739X(97)00015-0
Frank, E., Hall, M., Trigg, L., Holmes, G., & Witten, I. H. (2004). Data mining in bioinformatics us-
ing Weka. Bioinformatics (Oxford, England), 20(15), 2479–2481. doi:10.1093/bioinformatics/bth261
PMID:15073010
Frénay, H. M., Bunschoten, A. E., Schouls, L. M., van Leeuwen, W. J., Vandenbroucke–Grauls, C. M.,
Verhoef, J., & Mooi, F. R. (1996). Molecular typing of methicillin–resistant Staphylococcus aureus on
the basis of protein A gene polymorphism. European Journal of Clinical Microbiology & Infectious
Diseases, 15(1), 60–64. doi:10.1007/BF01586186 PMID:8641305
Guerra, L., McGarry, M., Robles, V., Bielza, C., Larrañaga, P., & Yuste, R. (2011). Comparison be-
tween supervised and unsupervised classifications of neuronal cell types: A case study. Developmental
Neurobiology, 71(1), 71–82. doi:10.1002/dneu.20809 PMID:21154911
Hina, S., Shaikh, A., & Sattar, S. A. (2017). Analyzing Diabetes Datasets using Data Mining. Journal
of Basic and Applied Sciences, 13, 466–471. doi:10.6000/1927-5129.2017.13.77
Hummelen, R., Fernandes, A. D., Macklaim, J. M., Dickson, R. J., Changalucha, J., Gloor, G. B., & Reid,
G. (2010). Deep sequencing of the vaginal microbiota of women with HIV. PLoS One, 5(8), e12078.
doi:10.1371/journal.pone.0012078 PMID:20711427
Ismail, S. A., Matin, A. F. A., & Mantoro, T. (2012). A Comparison Study of Classifier Algorithms
for Mobile–phone’s Accelerometer Based Activity Recognition. Procedia Engineering, 41, 224–229.
doi:10.1016/j.proeng.2012.07.166
Kanz, C., Aldebert, P., Althorpe, N., Baker, W., Baldwin, A., Bates, K., ... Apweiler, R. (2005). The
EMBL nucleotide sequence database. Nucleic Acids Research, 33(1), D29–D33. PMID:15608199
Kaushik, H., Raviya, & BirenGajjar. (2013). Performance Evaluation of different data mining classifica-
tion algorithm using WEKA. Indian Journal of Research, 2(1).
Klindworth, A., Pruesse, E., Schweer, T., Peplies, J., Quast, C., Horn, M., & Glockner, F. O. (2013).
Evaluation of general 16S ribosomal RNA gene PCRprimers for classical and next-generation sequenc-
ing–based diversity studies. Nucleic Acids Research, 41(1), e1. doi:10.1093/nar/gks808 PMID:22933715

122

Classification Techniques and Data Mining Tools Used in Medical Bioinformatics

Kretschmann, E., Fleischmann, W., & Apweiler, R. (2001). Automatic rule generation for protein annota-
tion with the C4. 5 data mining algorithm applied on SWISS–PROT. Bioinformatics (Oxford, England),
17(10), 920–926. doi:10.1093/bioinformatics/17.10.920 PMID:11673236
Li, J., Liu, H., Ng, S. K., & Wong, L. (2003). Discovery of significant rules for classifying cancer di-
agnosis data. Bioinformatics (Oxford, England), 19(suppl 2), ii93–ii102. doi:10.1093/bioinformatics/
btg1066 PMID:14534178
Li, J., & Wong, L. (2002). Identifying good diagnostic gene groups from gene expression profiles us-
ing the concept of emerging patterns. Bioinformatics (Oxford, England), 18(5), 725–734. doi:10.1093/
bioinformatics/18.5.725 PMID:12050069
Liu, B., & Pop, M. (2009). ARDB––Antibiotic Resistance Genes Database. Nucleic Acids Research,
37(Database), D443–D447. doi:10.1093/nar/gkn656 PMID:18832362
Luscombe, N. M., Greenbaum, D., & Gerstein, M. (2001). What is bioinformatics? An introduction and
overview. Yearbook of Medical Informatics, 1(01), 83–99. doi:10.1055-0038-1638103 PMID:27701604
Maiden, M., Bygraves, J., Feil, E. J., Morelli, G., Russell, J., Urwin, R., ... Spratt, B. G. (1998). Multilocus
sequence typing: A portable approach tothe identification of clones within populations of pathogenicmi-
croorganisms. Proceedings of the National Academy of Sciences of the United States of America, 95(6),
3140–3145. doi:10.1073/pnas.95.6.3140 PMID:9501229
Malley, J. D., Kruppa, J., Dasgupta, A., Malley, K. G., & Ziegler, A. (2012). Probability machines:
Consistent probability estimation using nonparametric learning machines. Methods of Information in
Medicine, 51(1), 74–81. doi:10.3414/ME00-01-0052 PMID:21915433
McArthur, A. G., Waglechner, N., Nizam, F., Yan, A., Azad, M. A., Baylay, A. J., ... Wright, G. D.
(2013). The Comprehensive Antibiotic Resistance Database. Antimicrobial Agents and Chemotherapy,
57(7), 3348–3357. doi:10.1128/AAC.00419-13 PMID:23650175
McKnew, D. L., Lynn, F., Zenilman, J. M., & Bash, M. C. (2003). Por in variation among clinical iso-
lates of Neisseria gonorrhoeae over a 10–yearperiod, as determined by Por variable region typing. The
Journal of Infectious Diseases, 187(8), 1213–1222. doi:10.1086/374563 PMID:12696000
Ozekes, A., & Camurcu, Y. (2002). Classification and Prediction in A Data Mining Application. Journal
of Marmara for Pure and Applied Sciences, 18, 159–174.
Pal, C., Bengtsson–Palme, J., Rensing, C., Kristiansson, E., & Larsson, D. G. (2014). BacMet: Anti-
bacterial Biocide and Metal Resistance Genes Database. Nucleic Acids Research, 42(D1), D737–D743.
doi:10.1093/nar/gkt1252 PMID:24304895
Parashar, K., Burse, & Rawat, K. (2014). A Comparative Approach for Pima Indians Diabetes Diagno-
sis using LDA–Support Vector Machine and Feed Forward Neural Network. Int J Adv Res Comput Sci
Softw Eng, 4, 378–383.
Patthy, L. (1999). Genome evolution and the evolution of exon–shuffling–a review. Gene, 238(1),
103–114. doi:10.1016/S0378-1119(99)00228-0 PMID:10570989

123

Classification Techniques and Data Mining Tools Used in Medical Bioinformatics

Pearson, W. R., & Lipman, D. J. (1988). Improved tools for biological sequence comparison. Proceedings
of the National Academy of Sciences of the United States of America, 85(8), 2444–2448. doi:10.1073/
pnas.85.8.2444 PMID:3162770
Saeb, A. T., Abouelhoda, M., Selvaraju, M., Althawadi, S. I., Mutabagani, M., Adil, M., & Tayeb, H.
T. (2017). The Use of Next–Generation Sequencing in the Identification of a Fastidious Pathogen: A
Lesson from a Clinical Setup. Evolutionary Bioinformatics Online, 13. doi:10.1177/1176934316686072
PMID:28469373
Saeb. (2018). Current Bioinformatics resources in combating infectious diseases. Bioinformation, 14(1),
31–35.
Salipante, S. J., SenGupta, D. J., Hoogestraat, D. R., Cummings, L. A., Bryant, B. H., Natividad, C.,
... Hoffman, N. G. (2013). Molecular Diagnosis of Actinomadura madurae Infection by 16S rRNA
Deep Sequencing. Journal of Clinical Microbiology, 51(12), 4262–4265. doi:10.1128/JCM.02227-13
PMID:24108607
Saravananathan, K., & Velmurugan, T. (2016). Analyzing Diabetic Data using Classification Algorithms
in Data Mining. Indian Journal of Science and Technology, 9(43). Retrieved from http://www.indjst.org/
index.php/indjst/article/view/93874
Schloss, P. D., Westcott, S. L., Ryabin, T., Hall, J. R., Hartmann, M., Hollister, E. B., ... Weber, C. F.
(2009). Introducing mothur: Open source, platform–independent, community–supported software for
describing and comparing microbial communities. Applied and Environmental Microbiology, 75(23),
7537–7541. doi:10.1128/AEM.01541-09 PMID:19801464
Schouls, L. M., Spalburg, E. C., van Luit, M., Huijsdens, X. W., Pluister, G. N., van Santen–Verheuvel,
M. G., ... de Neeling, A. J. (2009). Multiple–locus variable number tandem repeat analysis of Staphy-
lococcus aureus: Comparison with pulsed–field gel electrophoresis and spa–typing. PLoS One, 4(4),
e5082. doi:10.1371/journal.pone.0005082 PMID:19343175
Segata, N., Waldron, L., Ballarini, A., Narasimhan, V., Jousson, O., & Huttenhower, C. (2012). Metage-
nomic microbial community profiling using unique clade–specific marker genes. Nature Methods, 9(8),
811–814. doi:10.1038/nmeth.2066 PMID:22688413
Skewes–Cox, P., Sharpton, T. J., Pollard, K. S., & DeRisi, J. L. (2014). Profile Hidden Markov Models for
the Detection of Viruses within Metagenomic Sequence Data. PLoS ONE, 9(8), e105067. doi:10.1371/
journal.pone.0105067
Spellman, P. T., Miller, M., Stewart, J., Troup, C., Sarkans, U, Chervitz S, Bernhart, D., Brazma, A.
(2002). Design and implementation of microarray gene expression markup language (MAGE–ML).
Genome Biology, 3(9), research0046.
Stiglic, G., Bajgot, M., & Kokol, P. (2010). Gene set enrichment meta–learning analysis: Next–genera-
tion sequencing versus microarrays. BMC Bioinformatics, 11(1), 176. doi:10.1186/1471-2105-11-176
PMID:20377890

124

Classification Techniques and Data Mining Tools Used in Medical Bioinformatics

Struelens, M. (1996). Consensus guidelines for appropriate use and evaluation of microbial epidemio-
logic typing systems. Clinical Microbiology and Infection, 2(1), 2–11. doi:10.1111/j.1469-0691.1996.
tb00193.x PMID:11866804
Taylor, J., King, R. D., Altmann, T., & Fiehn, O. (2002). Application of metabolomics to plant genotype
discrimination using statistics and machine learning. Bioinformatics (Oxford, England), 18(Suppl 2),
S241–S248. doi:10.1093/bioinformatics/18.suppl_2.S241 PMID:12386008
Tobler, J. B., Molla, M. N., Nuwaysir, E. F., Green, R. D., & Shavlik, J. W. (2002). Evaluating machine
learning approaches for aiding probe selection for gene–expression arrays. Bioinformatics (Oxford,
England), 18(suppl 1), S164–S171. doi:10.1093/bioinformatics/18.suppl_1.S164 PMID:12169544
Van Belkum, A., Struelens, M., de Visser, A., Verbrugh, H., & Tibayrenc, M. (2001). Role of genomic
typing in taxonomy, evolutionarygenetics, and microbial epidemiology. Clinical Microbiology Reviews,
14(3), 547–560. doi:10.1128/CMR.14.3.547-560.2001 PMID:11432813
Wattam, A. R., Abraham, D., Dalay, O., Disz, T. L., Driscoll, T., Gabbard, J. L., ... Sobral, B. W. (2014).
PATRIC, the Bacterial Bioinformatics Database and Analysis Resource. Nucleic Acids Research, 42(D1),
D581–D591. doi:10.1093/nar/gkt1099 PMID:24225323
Weinstock, G. M. (2012). Genomic approaches to studying the human microbiota. Nature, 489(7415),
250–256. doi:10.1038/nature11553 PMID:22972298
Weka 3: Data Mining Software in Java. (n.d.). Retrieved June 24, 2018, from https://www.cs.waikato.
ac.nz/~ml/weka/
Yasodha, P., & Kannan, M. (2011). Analysis of a Population of Diabetic Patients Databases in Weka
Tool. International Journal of Scientific & Engineering Research, 2(5).
Yoo, I., Alafaireet, P., Marinov, M., Pena–Hernandez, K., Gopidi, R., Chang, J. F., & Hua, L. (2012).
Data mining in healthcare and biomedicine: A survey of the literature. Journal of Medical Systems,
36(4), 2431–2448. doi:10.100710916-011-9710-5 PMID:21537851
Zankari, E., Hasman, H., Cosentino, S., Vestergaard, M., Rasmussen, S., Lund, O., & Larsen, M. V.
(2012). Identification of Acquired Antimicrobial Resistance Genes. The Journal of Antimicrobial Che-
motherapy, 67(11), 2640–2644. doi:10.1093/jac/dks261 PMID:22782487
Zdobnov, E. M., & Apweiler, R. (2001). InterProScan–an integration platform for the signature–recogni-
tion methods in InterPro. Bioinformatics (Oxford, England), 17(9), 847–848. doi:10.1093/bioinformat-
ics/17.9.847 PMID:11590104
Zupan, B., & Demsar, J. (2008). Open–source tools for data mining. Clinics in Laboratory Medicine,
28(1), 37–54. doi:10.1016/j.cll.2007.10.002 PMID:18194717

125

Classification Techniques and Data Mining Tools Used in Medical Bioinformatics

KEY TERMS AND DEFINITIONS

Amplification-Derived Chimeric Sequences: Chimeras are sequences formed from two or more
biological sequences joined together. Amplicons with chimeric sequences can form during PCR.
BD2K: Big-data to knowledge.
CRISPR: Clustered regularly inter-spaced short palindromic repeats.
EWGLI: European Working Group for Legionella Infections.
High-Throughput Sequencing: Next-generation sequencing (NGS), also known as high-throughput
sequencing, is the catch-all term used to describe a number of different modern sequencing technologies
including: Illumina (Solexa) sequencing; Roche 454 sequencing; Ion torrent: Proton/PGM sequencing;
SOLiD sequencing. These recent technologies allow us to sequence DNA and RNA much more quickly
and cheaply than the previously used Sanger sequencing, and as such have revolutionized the study of
genomics and molecular biology.
ML: Machine learning.
MLST: Multilocus sequence typing.
MLVA: Multilocus variable-number of tandem repeats analysis.
MRSA: Methicillin-resistant Staphylococcus aureus
Multilocus Sequence Typing: A technique in molecular biology for the typing of multiple loci. The
procedure characterizes isolates of microbial species using the DNA sequences of internal fragments of
multiple housekeeping genes.
Multiple Loci VNTR Analysis (MLVA): A method employed for the genetic analysis of particular
microorganisms, such as pathogenic bacteria, that takes advantage of the polymorphism of tandemly
repeated DNA sequences. A “VNTR” is a “variable-number tandem repeat.”
PaPrBaG: Pathogenicity prediction for bacterial genomes.
PATRIC: Bacterial bioinformatics resource center.
PGAAP: Prokaryotic genomes automatic annotation pipeline.
RINS: Rapid identification of non-human sequences.
Single Locus Sequence Typing: A technique in molecular biology for the typing of bacterial strains
based on a single us such as bla OXA-51-like Gene.
SLST: Single locus sequence typing.
WEKA: An open source Java-based platform containing various machine learning algorithms.

126
127

Chapter 6
Big Data and People
Management:
The Prospect of HR Managers

Daria Sarti
University of Florence, Italy

Teresina Torre
University of Genoa, Italy

ABSTRACT
This chapter investigates the role of big data (BD) in human resource management (HRM). The interest
is related to the strategic relevance of human resources (HR) and to the increasing importance of BD in
every dimension of a company’s life. The analysis focuses on the perception of the HR managers on the
impact that BD and BD analytics may have on the HRM and the possible problems the HR departments
may encounter when implementing human resources analytics (HRA). The authors’ opinion is that at-
tention to the perceptions shown by the HR managers is the more important element conditioning their
attitude towards BD and it is the first feature influencing the possibility that BD can become a positive
challenge. After the presentation of the topic and of the state of the art, the study is introduced. The main
findings are discussed and commented to offer suggestion for HR managers and to underline some key
points for future research in this field.

INTRODUCTION

Working environments -and organizations in general- nowadays are continuously experiencing mas-
sive and rapid increases in those that have been identified as ‘big data’ (BD). Defined by Sivarajah
and colleagues (2017: 263) as an ‘overwhelming amount of complex and heterogeneous data pouring
from any-where, any-time and any-device’, BD is recognized as one of the most intriguing topic among
scholars and practitioners.

DOI: 10.4018/978-1-5225-7077-6.ch006

Copyright © 2019, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

Big Data and People Management

Even if of uncertain origins, as Diebold (2018) remarks, the phenomenon has been investigated in
its different dimensions, often in partnership with companies (Angrave et al., 2016) and taken into con-
sideration from many perspectives. In particular, a significant and increasing interest has been shown
from 2011 as Gandomi and Haider (2017) put in evidence.
The current debate on BD benefits from interdisciplinary contributions coming from diverse research
domains - such as: engineering, information and communication technology, economics and manage-
ment - and seeks to understand the newness and originality that BD is introducing in every field. Also,
a vast contribution to the overall discussion is given by the massive number of reports and researches
carried out by consultancy firms and by firms operating in the field of big data analytics (BDA) (namely
Gartner, n.a.; McKinsey & Company, 2016).
A lively discussion is ongoing regarding the correct definition of the concept and subsequently on
its delimitation (Sheng et al. 2017). Indeed the lack of consensus of the scientific community on the
concept of BD (De Mauro et al., 2016) might represent itself a clue of the poor development of the
discipline (Ronda-Pupo and Guerras-Martin, 2012) and of course, ‘not’ irrelevant is the fact that the
debate on BD is relatively recent.
A number of researchers describe BD using a mere quantitative approach, thus considering the vol-
ume of data, which is an inherent property. For example, Manyika and colleagues (2011: 1) suggest that
BD ‘… will range from a few dozen terabytes to multiple perabytes’, even if they do not ignore that the
size can increase, as technology advances. Other scholars also include further intrinsic features, which
contribute to qualify the topic (Angrave et al., 2016; Gandomi and Haider, 2017). In this sense, the most
diffused and cited definition has been given by Gartner Company, which describes BD as ‘high volume,
high velocity, and/or high variety of information assets that require new forms of processing to enable
enhanced decision making, insight discovery and process optimization’ (Gartner IT Glossary, n.a.: 1).
Coherently, Laney’s vision introduces the 3Vs model of BD, focusing on three dimensions represent-
ing its key elements: Volume (that is the magnitude of data available to the organization), Velocity (that
means the speed of data creation, streaming, and aggregation), and Variety (which refers to the richness
of data representation and to their structural heterogeneity) (Laney, 2001).
Supplementary dimensions have been mentioned referring to BD qualifying characteristics. They
include: Veracity (suggested by IBM, which is related to the data quality of captured data, which can
vary greatly, affecting the possible analysis), and Value (proposed by Oracle, it means the usefulness of
data in making decisions to obtain more value) (Erevelles et al., 2016; Power, 2014). Also, Variability
(that measures the variation of the dataset, which can hinder processes to handle and manage) seems to
contribute to point out the concept. According to Kaisler et al. (2013), Complexity is also an important
characteristic. It measures ‘the degree of interconnectedness (possibly very large) and interdependence
in BD structures, such that a small change (or combination of small changes) in one or a few elements
can yield very large changes, or a small change that ripple across or cascade through the system and
substantially affect its behavior, or no change at all’ (Kaisler et al., p. 997).
A further step forward is proposed, abandoning the logic of BD as unstructured, even if really copious,
amount of data, and considering it as a ‘smart’ entity. That means attention to the insights that the volume
of data can reasonably offer and the extent to which it is ‘able to provide the material of conduct fine
grained analysis that successfully explains and predicts behaviors and outcomes’ (George et al., 2014: 2),
A dynamic perspective of predictive management implications is so open (Angrave et al, 2016) and
the role of BDA is introduced to gain valid and valuable intuitions, useful to predict future outcomes,
as De Mauro and colleagues (2016) suggest when they precise that technology and analytical methods

128

Big Data and People Management

are necessary to transform BD into value. Indeed, they underline that ‘Big Data is the Information asset
characterized by such a High Volume, Velocity and Variety to require specific Technology and Analyti-
cal Methods for its transformation into Value.’ (De Mauro et al., 2016: 122)
This point of view immediately introduces to some important questions. Given that ‘Data is being
collected and stored at unprecedented rates’, Bakshi (2012) observes that ‘The challenge is not only to
store and manage the vast volume of data, but also to analyze and extract meaningful value from them’
(Bakshi, 2012:1). So the greater attention has to be paid to the use of so much data and moreover to the
comprehension of their significance and importance. BD, when appropriately managed, processed and
analyzed, has the potential to generate knowledge (Jukic et al. 2015): the point is to identify clearly the
goal the organization is interested to reach, so that it guides the analysis.
Starting from the evidence that data are increasing also in the field of HRM and that this is inevitably
involved by the deep change of approach that BD requires, the aim of this chapter is to contribute at
investigating on:
(1) what is its effective and potential role for HR practices and processes, (2) what is the novelty
BD represents and its relevance; (3) which changes are introduced (or have to be introduced) in the
traditional way to manage the typical tasks of people management and in the normal approach used to
run these same tasks. Specifically, the focus is on the analysis of the perception of the HR managers
on what BD is, on the impact BD actually has, may have, or would have on the management of HR, in
which processes BD is present and in which terms and to put in evidence the possible problems HD
Departments may face. It is the authors’ opinion that a specific attention to the perceptions shown by
the managers involved is the first and more important element conditioning their attitude towards the
phenomenon and by consequence it is the first feature influencing the possibility that BD can be con-
sidered as a positive challenge.
The chapter is organized in the following manner. The first two paragraphs introduce the benefits and
drawbacks of BD and the relationship between BD and HRM; and the next describes how it is possible
to implement a proactive orientation towards BD. Then the research is presented and the main results
are analyzed and commented. At the end of the chapter, implications for theory and for management
practices will be discussed with the specific purpose to propose useful suggestions for organizations and
HR managers to face (or not to face!) the new BD challenge taken, in any case, as an informed decision.

BACKGROUND

Benefits and Drawbacks of Big Data in the Organization

A growing number of organizations today already operate in data-intensive (or data-centric as defined by
De Mauro et.al, 2016) businesses and increasingly rely on an extensive use of information management
systems (e.g., Enterprise Resource planning RP and Customer Relationship Management) to conduct
their business. It is recognized that BD is an asset, a resource of the organization, which—if properly
managed, processed, and analyzed—might favor the company’s success (Jukic et al., 2015).
According to the main contributions to this stream of research, at least four main arguments have
been introduced on the side of the use of BD. Firstly, BD may favor better-informed decision-making
processes (Chen & Zahng, 2014). Secondly, through the creation of new knowledge from BD, it may
favor innovation (Jukic et al., 2015) and lead to value creation (Brown et al., 2011), thereby giving the

129

Big Data and People Management

value of information. At a third level, the implementation of such a system architecture—which in cur-
rent literature is exemplified, for example, with case studies from retail (Evans & Kitchin, 2018), banks
(Dery et al., 2006), and public services (Carter et al., 2011)—enables greater internal coordination and
control. At the same time, from an inter-organizational perspective, BD also facilitates relationships
and integration with other actors of the whole supply chain (i.e., suppliers and customers). Fourth, and
finally, it is proposed that the BD systems facilitate better control of complex systems, which can be
automated through the use of BD and, in turn, result in increased efficiency and reliability of the work
flow (Evans & Kitchin, 2018).
Next to the list of positive implications of BD, the main drawbacks are considered by authors, too.
Indeed it is suggested that BD may have a profound impact on labor practices and that this may occur
at three different levels:
(1) change in the work content, through tasks mediated by digital devices; (2) replacement of some
work tasks with automation; and, in the end, (3) changes in the way in which work is managed and
regulated, the so-called ‘governmentality’.
The improvement in technology through the introduction of digital devices had a massive impact
on the nature of work itself. Also, the rise of digital technology has led to the growth in the number
of alternative forms of work and veiled the dominant full-time job logic. At the same time the current
phenomenon is still unexplored (Lengnick-Hall et al., 2018) while new practices are requested in order
to manage this ‘new workforce’ whose psychological contract is based on completely different purposes
compared to the traditional one.
According to Evans and Kitchin (2018), the last and third aspect of governmentality is a key issue,
particularly relevant from the perspective discussed in this chapter, because it implies a revolution in the
perspective in regulating employees’ behavior. In case of a BD method of surveillance, the ‘manager’ is
a constant presence performed by a software (and not more by the boss). This new way in which work
is managed, therefore, exposes to the risk of destruction of any care for individuals (Stiegler, 2010). As
a matter of fact, the author suggests a parallel between the working system produced by the introduction
of BD—the so-called ‘algocracy’, a situation in which algorithm-based systems structure and constrain
the opportunities for human participation in, and comprehension of, decision-making (Danaher, 2016)—
and the classical system of scientific management. Moreover, they underline that two extra intervening
variables further affect the current working conditions. These are the speed, which asks for different
reaction capability, and the direct intrusion, which modifies discretion spaces in which workers were
used to move.
Multiple risks arise in this working context. The separation of labor from the reason for doing the task,
the degradation of the work content, the precarious employment, the lack of division between public and
private spheres are all possible consequences of this ‘technologically-infused surveillance workplace’,
which are the outcome of the diffusion of BD (Evans & Kitchin, 2018).
It is suggested that the effect of the presence of BD systems in the workplace that monitor and as-
sess labor performance continually is to create a new kind of power that sits in contrast with previous
modes of labor governmentality. BD systems greatly intensify the extent and frequency of monitoring of
labor and shifts the managerial logic from surveillance and discipline to capture and control (Deleuze,
1992; Agre, 1994) through the use of digital systems that are distributed, ubiquitous and increasingly
automated, automatic, and autonomous in nature (Dodge & Kitchin, 2007).

130

Big Data and People Management

Big Data and Human Resources Management

In a very popular article published in the Harvard Business Review, Cappelli (2017) highlights the
important relationship between HRM and BD. In his article the author underlines that every function
of business ‘feels compelled to outline how they are going to use it [BD] to improve their operations.
That is also true for HR Departments, which is where most of a company’s money is spent and where
the real value lies’ (p. 1).
In general, nowadays, HR Managers —and HR professionals in general—show a great deal of interest
in the new BD trend. Sure enough, according to a survey carried out in 2014 by Towers Watson, one in
three organizations was planning to increase HR technology budgets by up to 20% or more, compared
with the indications in the previous year (Towers Watson Survey, 2014), confirming how relevant this
perspective is. Managerial press demonstrates its concern on the importance of this challenge for HR
Departments (SHRM Foundation, 2016). Among practitioners, there is a strong commitment to focus on
HRA and on how BD may contribute to the single HR activities (e.g., recruitment, training, assessments).
Indeed, it is claimed that ‘most [HR practitioners] are process oriented generalists who have expertise in
personnel benefits, compensation, and labor relations’ (Charan, 2014: 33). Other surveys demonstrated
a dim assessment of HR Departments’ abilities to use data for HR functions (Collins & Bennet, 2015).
In addition, in most such reports, claims on the poor clarity in the definition of the significance of BD
and lack of systematization of the studies are reported.
Within the academic domain, despite the recent increased interest, only a relatively limited number
of works are concerned on this issue. Some scholars explore the new possibilities given by digital tech-
nology and BD to the world of work (Lengnick-Hall et al., 2018) thus highlighting the important role
that HRM may play in the development of individual and organizational knowledge (Oltra, 2005; Shah
et al., 2017). Some other authors underline the potential of BD for the HR function itself in favoring the
efficiency of the labor market at large and private enterprises alike (Larsen et al, 2015) and in supporting
its tasks and all the activities related to the people management.
Because of studies claiming the lack of the so-called ‘must-have capabilities’ for HR profession-
als (Angrave et al., 2016), scholars underline the importance of developing new competencies for HR
and the need for new ‘hybrid’ professions (Oding, 2015) able to deal with and manage BD (Angrave
et al., 2016). Indeed, it is suggested that organizations should invest in training programs for providing
experts, and company workers, with ‘interdisciplinary business intelligence’ and ‘analytics education’,
thus enabling the company members to manage data properly and to incorporate them in a ‘new’ deci-
sion processes (De Mauro et al., 2016). According to authors, BD may produce massive changes in
the process of decision-making, that would evolve from a static one to a dynamic one, replacing the
traditional sequential logic of connections and the rules of the process, (McAfee et al., 2012) as well
as the scientific method underling it. At the same time, the process of competencies enrichment, which
would support the new frame of decision-making activities for employees and organizations’, might also
have an effect on the organizational culture (De Mauro, 2016). However, the HR departments’ lack of
competencies and awareness about BD may weaken their organizational ‘position’ thus compromising
their chance to take on a role of key player in the organizational and strategic decision process. On this
point, Rasmussen and Ulrich (2015) remark that HR Departments are so ‘blind’ that they are not able
to ask the right question of the HR data they have at their disposal, underlining how difficult it is to find
their bearings in such a different context.

131

Big Data and People Management

In this stream, some authors express their concern on the possible future failure of HRM in the BD
challenge (Angrave et al., 2016; Rasmussen & Ulrich, 2015). In fact, it is affirmed that ‘HR analytics is
being taken over by other functions that are more mature in their analytics journey (in particular finance,
IT, and marketing)’ (Rasmussen and Ulrich, 2015). As suggested by Lengnick-Hall and colleagues (2018:
4) ‘HRM has been slow in join the bandwagon’ of BD and data analytics. In particular, some scholars
view as a threat the secondary importance held by HR Department compared to other business functions
such as finance, marketing and operations, also because of its difficulties in providing hard quantifiable
results of its activities’ (Lengnick-Hall et al., 2018: 236).
According to Angrave and colleagues (2016), one of the reasons for a probable defeat of HR Depart-
ments might be related to the HRA industry. The products and services it offers often do not provide
effective tools to enable HR Departments to capture all the available data, to create links and to propose
pertinent relationships and interdependences among data, so as to propose strategic elements for HR
future orientations. However, also the widespread lack of understanding and analytical thinking of HR
professionals (Algrave et al., 2018) together with the ‘old paradigm that HRM cannot be a data-driven
function’ (Lengnick-Hall et al., 2018) represent themselves sound conditions weakening the position of
HR Departments compared to other organizational functions. All these circumstances could open up
the risk for HR Departments to ‘miss the boat’ of the BD challenge, and to be relegated to a peripheral
position in the organizational hierarchy, being overtaken by other more ready departments.
In this chapter, the key role of the HR Department in dealing with BD is suggested. In fact, as it is
the one that is legitimated to be in charge of all HR-related functions (Ulrich and Grochowsky, 2012),
the HR Department should make any effort in order to take the chance to win this war. In so doing, it is
advisable that they would pursue an ‘outside-in’ perspective to consider ‘doing the right things’, rather
than maintaining their current and traditional ‘inside-out’ perspective—which can be described as ‘doing
things right’, that is using validated and existing knowledge in practice.
Therefore, HR Departments’ should start to play, in this new domain, the role of the devil advocate,
‘disrupting’ certainties rather than promoting the status-quo. Thus, they should acquire a provocative
and critical role within the company, to contribute proactively to the ongoing and ‘compelling’ process
of organizational change. Through acquiring a ‘systemic view capability’, they should contribute in a
strategic manner to the whole process and play the critical and unique role of glue among the various
operational departments, and this for example might be pursued in playing the role of ‘languages inte-
grator’ thereby enabling the organizational units to understand each other and ‘connect’ the different
perspectives.

Implementing a Proactive Orientation and a Human Resources


‘Analytical Capability’ for the Big Data Challenge

The use of BD requires an internal awareness of its usefulness, which can be described as a proactive
orientation towards the change BD and its processing through analytics introduces. More exactly, it
means knowledge about BD nature, about insights into the its use, together with the recognition of the
importance of the use of analytics to support a detailed study about the relevance and the implications
that a data-driven approach may have in managerial work in general and in people management, specifi-
cally (HR Review, 2013; Kim et al., 2014). Indeed, the insertion of a BD system in any organization
embodies a process of organizational change. Dery and colleagues (2006), while considering the ERP
application, suggested a five-step implementation process that could be usefully translated even in the

132

Big Data and People Management

BD context, with which the ERP system shares some similarities. The key steps recommended are: [1]
‘establishing a strong outcomes orientation, [2] clarifying the corporate identity of the business, [3]
ensuring the organization has a clear business strategy, [4] maintaining a “constancy of purpose” over
the long term and [5] pursuing various specific “benefit realization” tactics for maintaining project
momentum, handling consultants, developing performance metrics and deciding on upgrades…’ (Dery
et al., p. 230). It is important to underline the recurrent dynamic of the process itself, which brings the
evolution of the approach inside the organization and which allows to follow it.
As it happens for any organizational change, this ‘ideal’ 5-steps path needs to be managed. In this
sense, the HR Department has to play the classic key role of change agent (Ulrich, 1997), operating as
a ‘ferryman’ in this process, which has its focus on BD system rethinking and implementation. As sug-
gested, in current environment it is essential that the HR Department actively plays the role of change
agent (Brown et al., 2017), defined as being ‘agent of continuous transformation, shaping processes
and a culture that together improve an organization’s capacity for change’ (Ulrich, 1988: 125). As we
will discuss later in this section, the complexity arising in the process of BD introduction calls for a
redesign of this role which acquires new meanings and need to be enriched in terms of activities and
capabilities. Indeed, according to the sociotechnical dimension involved in BD, the breakdown of the
‘control zones’ implies a massive change to the traditional and well established method, technology and
culture (Lagoze, 2014).
In the recent debate on BD, and coherently with the resource-based perspective, the concept of BDA
and BDAC have been introduced (Gandomi & Haider, 2015; Wamba et al., 2017). BDA are defined as
‘a holistic approach to managing, processing and analyzing the 5 V data related dimensions’ (Wamba
et al., 2017: 356) which ‘allow firms to analyze and manage strategy through a data lens’ (Wamba et
al., 2017: 357). BDAC is considered as an important organizational capability, necessary to support
the development of BDA. According to authors organizations require to focus on ‘resources, besides
technology, which are needed to build firm-specific ‘hard to imitate’ BDA capability’ (Gupta & George,
2016: 3). Studies have demonstrated that company fail in BD investments due to number of reasons. As
suggested, one of the reason is ‘because most companies are still not ready or do not make decisions in

Figure 1. The proactive orientation in BD challenge

133

Big Data and People Management

response to the intelligence extracted from data’ (Gupta & George, 2016: 3). Further, the authors suggest
that such kind of initiatives may fail because of a lack of managerial support and an unsatisfied need for
fresh talents. Also, a shift toward a decision making culture based on the culture of data is essential in
order to develop successful BD initiatives (McAfee et al., 2012).
According to authors, to allow the implementation of a firm’s BDAC attention towards specific
resources (Gupta and George, 2016, see Figure 2) and organizational dimensions is recommended
(Davenport et al., 2012; Wamba et al., 2012).
In detail, relevant work in the field of BDA lead to three key dimensions affecting BDAC, which
are: people, technology and management. The first is declined as ‘people and data knowledge’ and it
is useful in order to understand, develop and apply analytics models. It is also referred to as ‘people’
(Davenport et al., 2012), ‘data science capability’ (Barton and Bourth, 2012), ‘employees’ analytic skills’
(Kiron et al., 2012). Other authors define it as ‘personnel management’ (McAfee and Brynjolfsson,
2012). The second dimension is ‘technology’, and it is essential to explore and manage data (Barton and
Bourt, 2012). This dimension is also defined as: ‘analytics platform’ (Kiron et al. 2014), ‘technology
infrastructure’ (McAfee and Brynjolfsson, 2012) and ‘technology capability’ (Baron and Bourt, 2012).
In the end, the last and third dimension is referred to as ‘management’ (Davenport et al., 2012), and it
is related to decision models. It is also defined, in broad terms, as: ‘management capabilities’ (Bourt,
2012), ‘corporate decision making’ (McAfee and Brynjolfsson, 2012) and ‘organizational culture’, un-
derlying the essential role of culture in managerial behavior (Kiron et al., 2014). Thus, BDAC is as the
organizational capability, which results in the company ability to combine and re-combine in effective
and unique way the organizational dimensions and resources herein listed.

Figure 2. Classification of BD source towards BDA

134

Big Data and People Management

Authors also suggest that the development of BDAC requires the implementation of a dual strategy
(Galbraith, 2014). The first, is described as that of ‘build a digital capability to make better and faster
decisions and to enhance existing products’ (p. 11); whilst the second, is that of ‘use data and analytics
to create insights and custom reports that can be sold to customer and became a new profit center’ (p.
12). Using the lens of the organizational ambidexterity theorization (Tushman and O’Reilly, 1996) it
is possible to refer to the former strategy as an exploitation strategy and to the latter as an explorative
strategy. As noticed by scholars, organizations should be able to integrate these two strategies and allow
for their complementarity through the effective balance and integration of the dual system enabling, in
turn, the support of the whole structure (see Figure 3).
The model herein presented should be further enriched with two other dimensions able to affect the
BDAC of the company, which are the environment and the individual.
On the side of the environmental dimension, exogenous pressures may have a huge influence in or-
ganizations to adopt systems, enabling the management to undertake the exploitation of BD (Benders et
al., 2006). This general attitude towards BD adoption following what has been done by others, risks to
make companies follow decisions of other companies, based on the idea that if they use BD, ‘we can do
the same’, and without a careful evaluation of the effective firm-specific peculiarities. Therefore, scholars
recommend paying attention to the implementation of such systems in accordance with a business-led
rather than IT-led logic (Dery et al., 2006). Indeed, technology should be seen as a mean to reach the
end rather than a goal in itself. The risk otherwise might be to get in love with an instrument that might
risk to be simply a vacuous (and a dangerous!) fad. In the same tones, the sociotechnical view of BD
stresses the importance of contextual factors (i.e., “community of use”) so that, it is suggested, that the
perspective on data and their meaning may be diverse when employed by different communities or for
different purposes (Lagoze, 2014) thus conducting to the need for a firm-specific or business-specific logic.

Figure 3. BDAC, BD sources and exploration-exploitation processes

135

Big Data and People Management

As for the individual dimension of BDAC, an interesting contribution lies in the speculation on
contextual ambiguity in which the individual and subjective perspective is introduced (Gibson and
Birkinshaw, 2004). In this new perspective, ambidexterity is also considered as residing in the single
person who has to be able to manage internally the conflicts arising from the duality of needs and logics
residing in the person itself. In this context, companies have to support the development of employees’
ambidexterity in ‘building a set of processes or systems that enable and encourage individuals to make
their own judgments about how to divide their time between conflicting demands for alignment and
adaptability’ (p. 210).
Considering the above-mentioned speculations, it is the authors’ opinion that the BD ‘revolution’,
which takes shape in way things are done and would lead to the re-evaluation of the epistemological
foundations of the concept of BD needs a guidance which could be supported by the HR Department
in its role of facilitator. BD challenge imply a ‘revolution’ in cognitive and operational frameworks of
the company. In this sense, the role of HR Department is essential and should be enriched through a
number of sub roles. First, the role of ‘guidance’ and ‘protection’ preventing the risk of following a
mere IT-led logic and rather enabling organizations to manage technology with awareness. Indeed, it is
suggested that the perception and the context play a strong role in shaping the way in which BD chal-
lenge is faced by companies and by HR Managers Second, the role of dialogue promoter granting the
balance between dualities, which are the two (apparently?) ‘opposite’ parts of the organization that are:
people and technology; soft and hard dimensions. Which are characterized by different strategic aims
and structural logics, coherently with its vocation to play a ‘facilitator’. Third, the HR Department should
be the promotor of a data analytics knowledge and able to identify and develop individual positions able
to understand and manage the requests of the ‘opposite tensions’ since they first of all have to manage
internally a dual identity.

METHODOLOGY

The Explorative Research in the Italian Context

In order to study the relationship between BD and HRM with specific attention to the Italian context, the
researchers developed an explorative project, organized into two phases. The first one aims to deepen the
attention towards this theme in a national horizon, through a literature review and a survey of initiatives
related to BD, so to identify where it was possible to find a higher awareness and a potential maturity in
the evolving field of the practices connected with news technologies in the HR processes. The second
establishes to conduct interviewers with HR managers, who have matured a personal and organizational
attention towards this ongoing change.
The study here presented is based on the analysis of data gathered through semi-structured interviews,
carried out in a multiple case study perspective on a small number of HR managers in some Italian en-
terprises, conducted with an exploratory approach coherently to the characteristics of the topic.
In detail, the research has privileged these aspects: (1) The perception of BD usefulness and value
for HR managers and (2) their readiness for their implementation in terms of knowledge ownership,
through BDA. Also, perceptions on (3) threats and (4) opportunities arising from BD in general and for
the HR Department will be examined, as also those relating to the specific organizational characteristics.

136

Big Data and People Management

Table 1. Interview questions for interviewees

• Presentation of the company and of professional role of the respondent


• Introduction of the point of view of the respondent on BD
• What does your enterprise think about BD?
• Where is BD in your enterprise?
• Do you have an information system for HR?
• Is it related to the other data systems?
• Which problems do you see with BDA?
• What are the benefit and treats you see in BDA?

The Selection of Participants and the Use of a Multiple Case Study

A qualitative approach through case studies has been chosen (Mayan, 2009). Qualitative research is
defined an in-depth multifaceted investigation (Orum et al., 1991), providing knowledge of the chosen
event from the participant’s point of view, which is useful to gain understanding and meaning of a specific
phenomenon (Merriam, 1998), first of all when it is an emerging one as present topic is. This inquiry
form represents a rich source (Merriam, 1998) and offers a large variety of evidence (Miles and Huber-
man, 1994), just because it works inductively from individual cases aiming ‘to break a phenomenon
open, unfasten… so that the description of the phenomenon, in all of its contradictions, messiness and
depth is (re)presented‘(Mayan, 2009: 11).
In the qualitative research field, the multiple case study method has been used (Yin, 2009). This
choice is justified by a set of three motives. First, the research questions are exploratory in nature. Sec-
ond, the phenomenon under study (and the privileged perspective, namely that of the HR managers) is
quite unexplored within the academic debate and, so, proper measures are still undeveloped (Bonoma,
1985). Moreover, one of the main constraints of this study was the access to interviewees willing to
discuss this topic, which, in the Italian context, is still new and in its early stage of development, also for
structural conditions (the first being the presence of a large number of small enterprises, which seems
less involved in the great question of BD). Therefore, coherently with previous studies in the field of
HRM (e.g., Antila, 2006), the multiple case study method was adopted (Eisenhardt, 1991; Yin, 2009).
Also, the choice of respondents was done following previous studies on HR roles which are based
on convenience sampling of practitioners - such as HR managers attending HR workshops or members
of HRM associations (e.g., Conner and Ulric, 1996; Caldwell, 2003).In this analysis it has been decided
to choose respondents based on these same characteristics, too. Hence, the participant involved by re-
searchers are part of three different groups:
(1) HR managers of organizations active in HR topics with whom the researchers had previous re-
search contacts, (2) HR representatives belonging to the national HR practitioners’ association (which
is particularly active and concerned about new HR challenges), and (3) HR managers attending an event
devoted to spread knowledge and favor debate on HR and BD issues.
A total of 25 HR managers belonging to the three just mentioned different groups were contacted for
this research, but 21 declined and, among those who provided explanations in declining our invitation,
the main reason cited was: ‘We are not ready to discuss the topic yet’. Finally, this study is based on the
possibility of deepening the experiences of four Italian companies that are interested in strengthening
the role of BD and of BDA in HRM and which provide good opportunities to learn about complex-
ity and context features, underlining how the same phenomenon performs in different environments

137

Big Data and People Management

(Stake, 2013). The four companies work in different sectors and thus face this challenge from different
perspectives. Data was collected through semi-structured interviews, developed with unstructured and
open-ended questions intended to elicit views and opinions, following the list reported in Table 1 above
introduced (Creswel, 2014). They were administered to one HR manager working in a society well known
to researchers (Group 1), to two HR managers belonging to the Association of HR Managers (Group
2) and to one participating in a training event specifically focused on this topic (Group 3). All the four
people declared that they are facing current organizational change related to ICT implementation and
HR challenges.

Data Collection and Analysis

The interviews lasted from 45 minutes to an hour-and-a-half. The contents of the interviews were tran-
scribed and sent back to the respondents as an e-mail attachment for their revision, to increase the quality
of gathered information. The interviewees were asked to comment and complement the text wherever
necessary. The interview transcripts represented the raw data for the subsequent analysis.
Following indications by Yin (2009), the data analysis began with a careful reading of the transcript
interviews, according to the predefined order of the questions. Then, the data was coded case by case, in
this way enabling the authors to find and describe a structure of the principal elements emerging from
the participants and to draw conclusions proposed as acquisitions on the topic.
The details of the respondents and their organizations were separated from the analysis and the con-
clusion presented in this chapter, according to the ethical issues and to attention to privacy at the basis
of the research project, following authors’ indications (i.e. Stake, 2000).
Data analysis have conducted to the identification of some elements shaping the perceptions of HR
managers about BD. Then researchers have defined some criteria, to select the factors meeting with them
to support arguments and conclusion. In detail, the criteria can be summarized in these terms

• Each factor should have played a significant role in describing the approach towards BD
• Each factor helps in enlightening a specific organizational dimensions
• Each factor has to be identified by both the researchers.

The Case Studies Examined: A Short Presentation

The case of Company A is an Italian wine producer, with business operations located in the Tuscany
area. It is a medium-sized enterprise, recently acquired by a US multinational company. The interviewee
is the HR Director.
Company B is an Italian wholesaler of electronic equipment with stores located in central and northern
Italy, which is planning new openings in the near future. It employs 600 workers. Recently this Company
has collaborated with a German multinational company. In this case, the person with whom researchers
develops the talk is the HR responsible.
The next case (Company C) is that of an Italian maritime transport company, with an international
business and a global approach having more than 1,000 employees worldwide, even if its head and its
heart is totally Italian. The representative belongs to the HR Department and he is in chief of the ad-
ministrative function. He has been working in this enterprise since 15 years, developing his career in the

138

Big Data and People Management

same Department. He declares a personal interest towards this topic and appreciation for his enterprise,
which has begun to try to understand if BD is also related to people, and how to use them.
The last company (Company D) works in retail sector; it is one of the most important competitors
in the field with regard to the central Italian regions. The participant is at present in charge for devel-
opment of human resources. Her background, which is important for the topic, is characterized by an
education in information technology science and a first working experience in Information Technology
(IT) Department, which has influenced her mindset.

FINDINGS AND ANALYSIS

In the following narrative, the researchers adopt an inductive methodology for reading the examined
case studies, so that, looking at the main organizational dimensions, the principal issues arising from
the interviews, which are similar in some of the cases and different from the others, are highlighted.

The Structural Issue: Organizational Complexity and Management Uncertainty

A first relevant aspect can be remarked with relation to the degree of organizational complexity. The
higher degree of complexity (e.g., dispersed facilities), also in connection with the huge amount of flows
of data coming from the market (such as in a retail companies), may affect the increase in uncertainty
perception of the management.
As emerged from the interviews, differently from company A, which is a wine company strongly
rooted in a territory and with facilities which are located very close the one to the other, company B and
D are retailers with stores located in different geographical areas. Company C, is a maritime company
and own as well dispersed facilities, which are spread across different Italian cities, also its representa-
tive offices, are in different parts of the world. This company, which is also in a stage of reflection and
rethinking of its information system architecture, underlines the high degree of concern on managing
data. Its HR manager indeed affirms: ‘we have a lot of data, but we do not know what to do with it’.
Also the HR director of Company B, which has a number of stores located in different regions in Italy,
right from the beginning of the interview, expressed the importance of having appropriate caution in
managing big data. He starts the interview by saying: ‘with data I am cautious… because data needs
investigations; you need to understand which ones are correct’. Both Companies B and C express the
major concerns on data. Company C’s HR manager said:

Table 2. The four case studies

Industry Employees Revenues Organizational role


Case A Wine producer 200 100 million euro HR director
Case B Electrical equipment wholesaler 600 156 million euro HR responsible
Case C Maritime transport 1000 300 million euro HR Administrative manager
Case D Retail 8000 2 billion euro HR development manager
Source: Authors’ elaboration.

139

Big Data and People Management

There is still a bit of mistrust, it is feared that too much data will cause confusion. Data, yes, but is it
reliable? And, above all, how do we manage it? We must understand what it means to move from a logic
of KPI (past) to the dashboard perspective (towards the future).

A bit different is the level of awareness of case D: the interviewee tells that the amount of data is really
great and that the company is used to manage them, simply to organize its activities (i.e. about variety
of products to sell), so information systems are well rooted and managers are accustomed to face with
them. Gradually, with the growth and the diversification of the typology of data, coordination problems
are arising, but the company is prepared.
It seems like due to physical barriers and organizational spatial dispersion BD and information tech-
nologies might represent the biggest opportunity as well as the biggest threat. The spatial dispersion of
facilities seems to affect in a negative way the uncertainty perceived and the complexity in managing
data. Indeed, a higher effort of all departments, but especially of HR Department might be advisable.

The Data Management System and Logic of ‘Control’ of Data

The implementation of information data system is different among the four companies. Indeed, in the B,
C and D firms, it is present as a top-down logic, while in the A firm, it is more like a bottom-up logic.
This means that: in B and C, the organization of the data is centralized in just one Department, that is
the Performance Management Department which acts as key player in the management of data, together
in some cases (at least in Company B) with the IT Department. Both Companies B and C declare that
the highest concentration of big data is in the Management Control Department, defined as ‘… the most
transversal function of the company’ (B declares) ‘it’s the real transversal function of the company’ (C
admits). ‘They have the company’s pulse: data, budget, sales, purchases, HR.’ Furthermore, B suggests
and adds that ‘also the finance area, but they do not have HR data’. In contrast, in company D the whole
top management is sited in the “control room” of data management and BD are managed by all relevant
Departments, including the HR Department. As for Organization A, data seems to be co-built among
the different areas. In the case of A, as declared by the HR manager, the area where big data is present
is the finance department. He declares that:

in our company, surely in the Finance Department there is more data than anywhere in the company’,
and he buttresses his sentence by declaring: ‘we are part of a multinational company. We have a very
heavy reporting process: every month by the third working day we have to send the monthly budget to
the headquarter. […] Surely, vast amounts of data are also in the cellars were we store the vintages...
Obviously, we have some disciplinary system to follow, and the wine ageing of four years is dependent
on the procedural guideline, so we have different vintages (2017, 2016, 2015, 2014). We have to manage
the wine product data, frauds, and traceability of wine and grapes needs to be controlled. Also, a huge
amount of data is also in the agricultural part of our company. With a GPS system, we have mapped the
vigor of all the wine plants, so that fertilizers can be given accordingly. Also, when the tractors go to
prune plants, they will know the good and the less good grapes, and they will divide them in this manner
in the cellar so that the different kind of grapes will arrive already divided. All this information gives
indications to people about what they have to do, when, and how. So, we see that data is all connected
among them…

140

Big Data and People Management

Therefore, in the case analyzed it is suggested that managing complexity may lead, at least in some
cases (probably in the early stage of BD adoption), to the adoption of a control-based logic of manag-
ing data, quite solely left in the hands of the management control department. On the other hand, firms
considering the IT department as a supportive one, may experience an higher degree of integration of
expertise and the development of new solutions. In this vein, the HR manager of A declared,

In the implementation of the system, the expertise of the finance side was needed, for those who use the
data. In fact, a system was built that could be processed according to requests’. In addition, the same
occurred with the other area of the company. Indeed, he added: ‘When we implemented the system in
the vineyard, we had the agronomic competence that made a request to IT. IT had to provide the tech-
nological support.

Therefore, there is a strong integration of knowledge and expertise, and technology is a support for
the whole process which is business-led rather than IT-led.

The Design of Information and Communication Technologies


infrastructures: System Interaction and Ad-Hoc Solutions

Evidence from the case analyzed suggest that organizational complexity and high amount of data from
different sources and Departments call for highly structured data systems which are designed with the
need for ever new super-structures able to manage the continuously increasing complexity and enable
the dialogue. Indeed, company B, C and D have implemented a Business Intelligence system that iden-
tifies, recomposes, and combines data from different particular systems. The Manager of Company B
calls it ‘the octopus’, and suggests ‘to get a broader view, for example, we can simultaneously see sales,
turnover, budget and HR data’.
It is well recognized the need for ICT infrastructure interaction which is a common problem for
most of the organizations facing nowadays with BD. Indeed, systems interactions represent for all of
the company a key aspect. In Company B, the interviewee suggests that

all different information systems—that for HR, that of Warehouse and the BI system—dialogue the one
with the other. The BI system is managed by the IT department, which manages the system tables and
formulas that allow one to extrapolate information. The system composes and breaks down data

In Case C, the HR manager declares:

We are trying to increase the integration; till now the focus has always been on the business: the point
is to understand that the business is supported by the whole company and, therefore, the ability to read
in an integrated manner should be increased. We need skills, which must be formed, and sensitiveness
to grasp the meaning

However, a possible key issue emerging from the cases here analyzed is the aspect related to the degree
of ‘customization’ of such systems. In Cases A, B and C, they use well-known commercial ICT systems
available on the market. A different situation is described for A, for which the HR manager declares:

141

Big Data and People Management

We use the software of a work consulting office, which then became a software company that started to
implement software. This service company has a vision ahead, and is able to implement certain solu-
tions for us in a more customized way. An agricultural worker does not stamp in the morning... he goes
to the farm...

Therefore, the bigger issue organization need to face in ICT infrastructure design is the issue of in-
tegration of different systems. Indeed, usually in organizations information systems born spontaneously
and separately in different Departments following contingent development logics. Therefore a need for
a role or a Department able to have a systemic overview arises. The other issue is that of standardized
systems to which organization need to adapt rather than vice-versa. This, may have a huge impact on
organizational functioning and on the logic of decision-making which might be strongly affected by the
companies providing analytics.

People and Data Analytics Knowledge

In all the companies, the introduction of such systems increased the commitment of the HR Depart-
ment. This position, in the case of A, is summarized in these terms: ‘The HR Department had to work
a lot to help people manage complex machinery’. In the case of B, the HR manager declares that after
the introduction of our Business Intelligence system, ‘we have all become “more” analysts and able to
manage data; obviously, some training courses have been made to increase employees’ capabilities’. The
HR manager of company C underlined the following point: ‘The data must be structured (but with regard
to the needs that I see now? Or those that will be I can imagine to be?), and grouped to be managed
and used (and here I must know what it is that interests me). It is essential to inform people about the
use of data and to train them to be informed and aware users.’ In the end, in company D, engaged from
decades in this process of data management and importance of this issue this problem was faced more
than 10 years ago introducing a managerial role which characteristics were of a Informatics Education
which role was within the HR Department.
In addition, the need for the development of a system thinking capability is also mentioned by the
HR manager of company C. Indeed, while discussing the HR system he affirmed:

... now, we are trying to understand what the analytics give us, but I make myself aware that we need
a different mentality and some more technical skills, too. But it seems that these two different areas of
competences are difficult to be found in the same person, so that integration is a real open problem...

Moreover, he added that:

The source is also important, the certification of data is fundamental. In addition, we have to consider
the problems related to privacy and security, which are crucial for HR data. Furthermore, there are
so many ways to use the data, the risk is to remain cluttered by the volume of data without having well
verified what you need, go a bit ‘blindly’... With regard to some HR processes, the utility to have many
data is evident (for selecting, the more the things I know about my candidate, the better), but as regards
other internal processes, I see the risk of connecting things that are not so connected... That is why we
need to make the most efforts to define what we are interested in and how we can find it.

142

Big Data and People Management

Quite in the same direction, the HR Manager of B asserted:

I prefer to have a few but certain data, which leads me to concrete actions. Having so much data available
creates the risk of getting lost and not being able to focus on actions. With the large amount of data that
we now have, we wonder which data is correct and we lose a lot of time at this stage, that is the one of
validation and certification of the data. But I lose sight of the certain data... and above all the action. My
fear is to become slaves of data... data must be immediate, and it takes people who know how to move...
we must not chase data, but the data must get us to make decisions’. Also, he adds ‘without ever losing
sight of the human value of people’. Furthermore, he made a point: ‘So, the source is also important,
the certification of data is fundamental. I think it is also important (this is my opinion)—the ethics of the
data. Sometimes, I decide to say: “I do not give the data”, because I cannot give an uncertified data,
which has to be used to decide, and decisions have implications for people, for business. So, I must say
“No”. Checking ethics is a hot topic, given the amount of data ....

Company C’s HR manager mentioned some people resistance to change in the company. He states
that ‘like other competitors, we are not so much up-to-date, and we have to often face a sort of resistance
of people in our organization against innovation. Of course, people understand that there is a problem
that needs to be solved… but it is difficult to include a new system in a wholesale project’.
The use of technology helped to take data that are more precise and also was strictly functional to
the workers and simplified for their use through a Taylor-made tools.

The Role of HR Department and Its Responsibilities

Researchers had the opportunity to understand the real feelings about this new challenge for the HR
Department. All the interviewees were sincerely committed to the topic, but some sorts of reverence and
caution on the matter are used. All the interviewees of the organizations involved in our analysis admit
that the data they have to face nowadays is huge in number and is increasing very quickly. Also, all of
them, even if at different levels, highlighted the importance of developing a system which is properly
structured and a systemic view of the HR Department and hopefully, as in the case of company D, have
one or more persons in the HR Department office which are system designers which enable HD Depart-
ment to incorporate system thinking capabilities within the function.
However some discrepancy emerged in the way HR Departments were themselves involved in the use
of HR analytics and new technologies supporting employees while working. For example, in Company,
technological support is available for employees viz.

all the employees in the company have an “app” with which they reserve the cafeteria, they manage
the transfers, the holidays etc. The app also serves as a badge, through a QR-code that is read by an
iPpad, which then replaces the old time clock. In this case, it was an initiative of HR Department, an
HR project, in which IT acted in support. Obviously, IT manages the iPad installation in the structure,
etc. Now, our employees can do many things remotely.

On the contrary, other companies are quite cautious in the implementation of such systems. In ad-
dition, the awareness that technology has to be functional and easy is as well another key characteristic
highlighted. For example, the HR manager of A said a sentence that is quite iconic of the image of an

143

Big Data and People Management

organization placed in the middle of the Tuscany countryside and devoted to wine production for de-
cades: ‘Technology is beautiful, but you must not see it!’ Furthermore, he described how technology is
applied in such kinds of contexts:

Technology must stand aside and the operator must just sense ease. The farm worker for each day of
work imputes the hours, the activity and the farm in which he realized it. So that, practically, I can find
myself the analytical accounting already made, attributing the individual expenses to that farm... and
this data is then used by the performance management system.

In the end, HR departments may play a role of interface for data analytics knowledge development.
In fact, Company A also has a strong commitment in dialogue with external sources for innovation. For
example, it has a strong relationship with a University to implement the systems on the plants.

LEARNED LESSONS AND SUGGESTIONS ON HR DEPARTMENT ROLE

The considered organization seem to have to manage two kind of complexity. On the one side a complexity
arising from the business. This is the case of company B, C and D. On the other case, that is company
A, the higher complexity the company has to face arises from the “production process” itself and from
strong external regulations and requirements.
The two different kind of complexity bring to uncertainty. The first is an uncertainty arising from the
market, according to the different requests of customers, the other is more concerned on the uncertainty
arising from the natural conditions (weather, climate) and external regulations.
At the same time, the analysis shows in the case of A, a company that is pursuing an exploitation-
exploration strategy. Able, in this sense to mix the tradition and the technology, shifting the paradox and
managing it with the introduction of innovation in the respect of the tradition.
The three cases of B, C and D rather seem to adopt another strategy. Indeed, B and C seem to be in
a ‘stage’ of exploitation strategy, in which HR Department seems to play to date a marginal role. As for
D, it seems that its way to manage the IT issue (long tradition) and the rooted system thinking of some

Table 3. Synthetic picture

Data Logic of
ICT infrastructure
Cases Structural issue management control over People BD strategy
design
System data (power)

Standard system with


Centralization in need of coordination at
Higher resistance to
just one or two higher level
B change
Complexity from departments (top Top down Non customized system Exploitation
C Acceptance of
the business down) Top down with continuous need of Exploitation+Exploration
D change/system
Dispersed in coordination at higher
thinking in HR
departments level/ /integrated by
system thinking in HR

Accustomed to
and acceptance
Complexity from Dispersed in Top down and
A Customized system of technology Exploitation+Exploration
the process many departments bottom up
(functional and
simple)

Source: Authors’ elaboration.

144

Big Data and People Management

of the key actors in the HR Department makes it possible, in this case, to view a exploitation-exploration
strategy. As a consequence, it is suggested that if ‘system thinking’, that is balancing properly the hard
and soft dimensions, is within the HR Department, and better if it is located in individuals (rooted in
their experience, education etc.) this might help the organization to be able to adopt an exploitation-
exploration strategy in managing BD that is the way to avoid a ‘data-led’ (or ‘IT led’) system but rather
to grant a business-led system.
The BD challenge is one of the most relevant implications of the ongoing technological evolution,
which is involving every aspect of a company’s life. Different kind of company depending on their
business, strategy and history need to cope with the BD implementation in different way and have to
adopt and design different solutions. Regarding to the HR Departments, it has been underlined that they
usually have a hard time facing new technologies and their role in the organizational context (Camuffo,
2016), although technologies – adopted into the organizational processes – produce a huge impact on
people, on that people of whom HR Departments have to take care. This is the real reason at the basis
of the interest towards the topic examined in the present chapter.
The narrative here presented suggest to theories two possible ideal-types of ways of configuring and
designing organizational systems because of BD. Of course, the case studies do not represent per se
these two ideal-types, but their knowledge has been useful to identify and to clarify the dimensions of
the model of interpretation. In figure 4 this model is represented.
Some common points seem to emerge as possible suggestions for all companies approaching to BD
implementation.

• It is important to have a conceptual framework for capturing business value


• It is necessary to understand where value can be created, through data
• It is fundamental to ask the right questions to data
• It is essential to share a common vision developing a common language.

The model presented considers two different orientations of the firms. On the left hand, there is a more
outcome-focused strategy, which, as a result, might be designed as a more ‘centralized’ organizational
structure; on the other hand, a process-based organization is included.
In this second, the structure is ‘decentralized’ implying that BD are dispersed in many Departments,
not only in the traditional ones (i.e. Performance Management Department and IT Department); in this
latter case, IT systems are most of time taylor-made, the logic of control over data is more widespread,
as well the competency of data analytics are diffused among workers. It is the authors’ opinion that: the
configuration shown in the right end side of Figure 4 should be the target to be reached for any company
aiming at surmounting the BD challenge.
Within the right-end side configuration, the HR Department is enabled to play the whole set of roles:
as strategic partner, change agent, administrative expert and employee champion (Ulrich & Grochowski,
2012). Also, the authors suggest that HR Department should expand its role of listening and reporting
to employees – i.e., employee champion - even to actors in the external environment.
From an organizational perspective, as well as from that of the HR Department, the more BD become
popular (right hand side of the figure) in a company, the more a stress on rethinking organizational design
has to be placed on. Due to BD entry among the key variables and resources of our organizations, this
means that a deep effort in redesign should take place. Indeed, structures, coordination mechanisms and
decision processes need to be analyzed and re-designed. At the same time, because of the number of

145

Big Data and People Management

Figure 4. The model arising from results

new and more precise figures coming from BD analytics field also the performance management system
for human resources should be implemented with particular attention to listening and communicating
with employees.
A source of BD is the diffusion of the wearables tools, for example as described in the case of com-
pany A. They produce information on organizational behaviors – personal feature, commitment, emo-
tion, satisfaction and so on – and can be used in relationship to those coming from other processes, from
market (Waber, 2013) to develop knowledge useful for developing appropriate HR policies.
So, HR Department should take again the role of organizational analyst, studying and mapping BD
in the organization since they might have a huge impact on what, how, and in what time tasks are, or
have to be, performed be employees.
Moreover, another important aspect still related to the one of HR Department is that of ‘education’ to
an appropriate comprehension of data. Indeed, a new challenge arises with this increase in data (about
any aspect of work, any attitude and sentiment), standards and higher results to be reached. Also, together
with data management the problem of ethics arises. In this regard, HR Department may play a key role,
putting attention to the meaning of data, due to the simple fact that behind data there are persons and
situations to consider.

146

Big Data and People Management

It is evident that this perspective asks a deep change to HR managers. Their role have to evolve towards
a hybrid one, which together with the traditional competences on the basic HR processes have to develop
also capabilities able to connect different information and flows of information among them and to do
it with method, using appropriately analytics and their predictive potential, organizing cockpits able to
answer to questions regarding future.
If the real interest is to put people at the heart of enterprises and of the economic system, as always
declared, there no other solution than including competences of data analytics in the area of the tradi-
tional skills of the HR managers, so that they can really play a significant role to support business and
its strategies.

Research’ Limits

The principal limit of the research here presented is connected with the use of a case study methodol-
ogy, which has known may imply the lack of generalizability of the results. This is moreover evident on
the basis of the limited numbers of companies into the group of the companies identified as potential
participants, the managers of which were the only ones interested in discussing the problems arising
with BD and BDA and who accepted to share their vision on the topic.
A second aspect to underline is that the research considers the HR managers’ perspective, that is one
single view of the whole phenomenon. Even if the focus is on the role of the HR Departments and even
if its role is of course essential in relationship with the chosen aim, other perspectives are necessary
for a complete vision. In detail, other corporate-level management positions can influence company’s
behavior towards this topic and, consequently, the orientation of HR managers towards it., Equally, this
study do not consider the line managers’ view, notwithstanding their role in managing people (with the
help or not of data) is fundamental.

FUTURE RESEARCH DIRECTIONS

The importance of the research interest at the basis of the chapter has been confirmed in the analysis
here presented. Both the increasing role of BD - and, of course, of BDA - and the change that they
can introduce in HRM are worthy of further study. Building upon the research findings described and
overall understanding presented in the chapter, authors think that these issues may hold the promise in
contributing towards future studies.
Many directions can be followed, coherently with the just mentioned lack of academic inquiry from
the specific perspective of the potential role of HR Department in the diffusion of BD, their comprehen-
sion and their use for predictive scopes.
First of all, a focus on a larger number of firms can help in enriching the knowledge of the experi-
ences by enterprises, dealing with this phenomenon. The enlargement of cases studies allows to build
a database, organized according to industry and dimensions, with the intention to examine if there are
differences, if these can be attributed to specific features or, on the contrary, if similar characteristics
have implications on the approach towards BD. Also, a longitudinal approach can be useful to follow the
evolution of such an innovative and dynamic phenomenon as this. Hence, case D represents a potential
candidate for this continuance, by virtue of its more mature situation and moreover by the peculiar aware-
ness of the persons in charge, whose skills and experiences seem summarize the profile of a modern

147

Big Data and People Management

HR manager, able to play as change agent, to mediate among different perspectives also thanks to his
ambidexterity inclination.
The final goal is to verify if the model here hypothesized is able to enlighten how HR managers can
intervene supporting an appropriate development of the necessary capability to support new business
model and, above all, new ways for better managing people. The extent to which BD and its role in the
processes of HRM is theoretically grounded is limited in literature, as remarked at the beginning of the
chapter, so it is desirable to contribute also in this direction.

CONCLUSION

This chapter aims to investigate what role BD may play in the field of Human Resource Management
(HRM). The topic is particularly interesting both for the recognized strategic relevance of human re-
sources (HR) and for the increasing importance of BD in every dimensions of companies’ life. Moreover,
it is documented a relative lack of research on this domain in the current academic debate, considering
the limited numbers of published papers, to which a great deal of interest among HR practitioners is
associated. It seems that a specific in-depth analysis to examine if and how BD and BDA could change
the way HR activities are managed could be appreciated. Also, the role of HR Department in supporting
the firm’s change toward the adoption of BD and BDA acquires in this perspective a strategic relevance
and asks for further analysis.
After the presentation of the topic and of the state of the art, the definition of the theoretical context
is introduced. In detail, a proactive orientation and a human resources ‘analytical capability’ for the big
data challenge are proposed as basic element to support the evolution towards a useful adoption of BD
logics and BDA.
Then the study is introduced based on the analysis of data gathered through semi-structured inter-
views, carried out in a multiple case inquiry on a small number of HR managers belonging to four Italian
enterprises, developed with an exploratory approach coherently to the characteristics of the topic and to
its specific contextualization in Italy.
The analysis focuses on the perception of the HR managers on the impact that BD and BDA may have,
or would have on the HRM as a whole and the possible problems the HR Departments may encounter
when facing this challenge and implementing human resources analytics (HRA). It is the authors’ opinion
that a specific attention to the perceptions shown by the managers involved in this process is the first
and more important element conditioning their attitude towards this phenomenon and by consequence
it is the first feature influencing the possibility that BD can be considered as a positive challenge, and
managed accordingly. In this vein the HR managers’ awareness of the ‘location’ of BD and of the power
dynamics arising from it, its dimension and the need for its effective management, also through the
promotion of a shared vision, are essential.
In the end, the main findings are discussed and commented, to offer suggestion for HR managers
and to underline some key points for future research in the field of BD and of BDA, which is expected
to become more and more relevant.

148

Big Data and People Management

ACKNOWLEDGMENT

Authors wish to thank the persons participating to the interviews for having accepted the challenge to
discuss this topic.
This research received no specific grant from any funding agency in the public, commercial, or not-
for-profit sectors.

REFERENCES

Agre, P. E. (1994). Surveillance and capture: Two models of privacy. The Information Society, 10(2),
101–127. doi:10.1080/01972243.1994.9960162
Angrave, D., Charlwood, A., Kirkpatrick, I., Lawrence, M., & Stuart, M. (2016). HR and analytics:
Why HR is set to fail the big data challenge. Human Resource Management Journal, 26(1), 1–11.
doi:10.1111/1748-8583.12090
Antila, E. M. (2006). The role of HR managers in international mergers and acquisitions: A
multiple case study. International Journal of Human Resource Management, 17(6), 999–1020.
doi:10.1080/09585190600693322
Bakshi, K. (2012). Considerations for big data: Architecture and approach. In Aerospace Conference
Proceedings, IEEE (pp. 1-7). Big Sky, MT: IEEE. 10.1109/AERO.2012.6187357
Benders, J., Hoeken, P., Batenburg, R., & Schouteten, R. (2006). First organise, then automate: A modern
socio‐technical view on ERP systems and teamworking. New Technology, Work and Employment, 21(3),
242–251. doi:10.1111/j.1468-005X.2006.00178.x
Bonoma, T. V. (1985). Case research in marketing: Opportunities, problems, and a process. JMR, Journal
of Marketing Research, 22(2), 199–208. doi:10.2307/3151365
Brown, B., Chui, M., & Manyika, J. (2011). Are you ready for the era of ‘big data’. The McKinsey
Quarterly, 4(1), 24–35.
Brown, M., Kulik, C. T., Cregan, C., & Metz, I. (2017). Understanding the Change–Cynicism Cycle:
The Role of HR. Human Resource Management, 56(1), 5–24. doi:10.1002/hrm.21708
Caldwell, R. (2003). Models of change agency: A fourfold classification. British Journal of Manage-
ment, 14(2), 131–142. doi:10.1111/1467-8551.00270
Camuffo, A. (2016). Le nuove sfide dell’HR: Big data, rilevanza e sostenibilità. Economia & Manage-
ment, 5, 117–125.
Cappelli, P. (2017). There’s no such thing as big data in HR. Harvard Business Review, 2.
Carter, B., Danford, A., Howcroft, D., Richardson, H., Smith, A., & Taylor, P. (2011). ‘All they lack is a
chain’: Lean and the new performance management in the British civil service. New Technology, Work
and Employment, 26(2), 83–97. doi:10.1111/j.1468-005X.2011.00261.x

149

Big Data and People Management

Charan, R. (2014). It’s time to split HR. Harvard Business Review, 92(7), 33–34.
Chen, C. P., & Zhang, C. Y. (2014). Data-intensive applications, challenges, techniques and technologies:
A survey on Big data. Information Sciences, 275, 314–347. doi:10.1016/j.ins.2014.01.015
Collins, L., & Bennet, C. (2015). HR and people analytics. Deloitte Insights. Retrieved November 10,
2017, from: https://www2.deloitte.com/insights/us/en/focus/human-capital-trends/2015/people-and-hr-
analytics- human-capital-trends-2015.html
Conner, J., & Ulrich, D. (1996). Human resource roles: Creating value, not rhetoric. People and Strategy,
19(3), 38–49.
Creswell, J. W. (2014). Research design: Qualitative & quantitative approaches (4th ed.). London: Sage
Publications, Inc.
Danaher, J. (2016). The threat of algocracy: Reality, resistance and accommodation. Philosophy &
Technology, 29(3), 245–268. doi:10.100713347-015-0211-1
De Mauro, A., Greco, M., & Grimaldi, M. (2015). What is big data? A consensual definition and a review
of key research topics. In Proceedings of the 4th International Conference on Integrated Information
(Vol. 1644, No. 1, pp. 97-104). Madrid, Spain: AIP Publishing 10.1063/1.4907823
Deleuze, G. (1992). Postscript on the Societies of Control. October, 59, 3-7.
Dery, K., Hall, R., & Wailes, N. (2006). ERPs as ‘technologies-in-practice’: Social construction, mate-
riality and the role of organizational factors. New Technology, Work and Employment, 21(3), 229–241.
doi:10.1111/j.1468-005X.2006.00177.x
Diebold, F. (2018). A Personal Perspective on the Origin (s) and Development of ‘Big Data’: The Phe-
nomenon, the Term, and the Discipline, Second Version. Retrieved May 25, 2018, from: https://www.
sas.upenn.edu/~fdiebold/papers/paper112/Diebold_Big_Data.pdf
Dodge, M., & Kitchin, R. (2007). The automatic management of drivers and driving spaces. Geoforum,
38(2), 264–275. doi:10.1016/j.geoforum.2006.08.004
Eisenhardt, K. M. (1991). Better stories and better constructs: The case for rigor and comparative logic.
Academy of Management Review, 16(3), 620–627. doi:10.5465/amr.1991.4279496
Erevelles, S., Fukawa, N., & Swayne, L. (2016). Big data consumer analytics and the transformation of
marketing. Journal of Business Research, 69(2), 897–904. doi:10.1016/j.jbusres.2015.07.001
Evans, L., & Kitchin, R. (2018). A smart place to work? Big data systems, labour, control and modern
retail stores. New Technology, Work and Employment, 33(1), 44–57. doi:10.1111/ntwe.12107
Galbraith, J. R. (2014). Organizational design challenges resulting from big data. Journal of Organiza-
tion Design, 3(1), 2–13. doi:10.7146/jod.8856
Gandomi, A., & Haider, M. (2015). Beyond the hype: Big data concepts, methods, and analytics. In-
ternational Journal of Information Management, 35(2), 137–144. doi:10.1016/j.ijinfomgt.2014.10.007

150

Big Data and People Management

Gartner. (n.d.). IT Glossary. Retrieved November 5, 2017, from: https://www.gartner.com/it-glossary/


big-data
George, G., Haas, M. R., & Pentland, A. (2014). Big data and management. Academy of Management
Journal, 57(2), 321–326. doi:10.5465/amj.2014.4002
Gibson, C. B., & Birkinshaw, J. (2004). The antecedents, consequences, and mediating role of organi-
zational ambidexterity. Academy of Management Journal, 47(2), 209–226.
Gupta, M., & George, J. F. (2016). Toward the development of a big data analytics capability. Informa-
tion & Management, 53(8), 1049–1064. doi:10.1016/j.im.2016.07.004
HRReview. (2013). 78% of HR managers do not feel they are very effective at workforce analytics. Re-
trieved September 17, 2017, from: http://bit.ly/HRRAnalytics
Jukić, N., Sharma, A., Nestorov, S., & Jukić, B. (2015). Augmenting data warehouses with Big data.
Information Systems Management, 32(3), 200–209. doi:10.1080/10580530.2015.1044338
Kaisler, S., Armour, F., Espinosa, J. A., & Money, W. (2013). Big data: Issues and challenges moving
forward. In Proceedings of the 46th Hawaii International Conference on System Sciences (HICSS) (pp.
995-1004). Washington, DC: IEEE Computer Society. 10.1109/HICSS.2013.645
Kim, G. H., Trimi, S., & Chung, J. H. (2014). Big-data applications in the government sector. Com-
munications of the ACM, 57(3), 78–85. doi:10.1145/2500873
Lagoze, C. (2014). Big Data, data integrity, and the fracturing of the control zone. Big Data & Society,
1(2), 1–11. doi:10.1177/2053951714558281
Laney, D. (2001). 3D data management: Controlling data volume, velocity and variety. META Group
Research Note, 6(70).
Lengnick-Hall, M. L., Neely, A. R., & Stone, C. B. (2018). Human Resource Management in the Digital
Age: Big Data, HR Analytics and Artificial Intelligence. In P. N. Melo & C. Machado (Eds.), Manage-
ment and Technological Challenges in the Digital Age (pp. 13–42). Boca Raton, FL: CRC Press.
Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., & Byers, A. H. (2011). Big data:
The next frontier for innovation, competition, and productivity. Retrieved June 2, 2017, from: https://www.
mckinsey.com/business-functions/digital-mckinsey/our-insights/big-data-the-next-frontier-for-innovation
Mayan, M. J. (2009). Essential of Qualitative Inquiry. Walnut Creek, CA: Left Coast Press, Inc.
McAfee, A., Brynjolfsson, E., Davenport, T. H., Patil, D. J., & Barton, D. (2012). Big data: The manage-
ment revolution. Harvard Business Review, 90(10), 60–68. PMID:23074865
McKinsey&Company. (2016). The age of analytics: competing in a data-driven world. Global McKinsey
Institute.
Merriam, S. B. (1998). Qualitative Research and Case Study Applications in Education. Revised and
Expanded from: Case Study Research in Education. San Francisco, CA: Jossey-Bass Publishers.

151

Big Data and People Management

Miles, M. B., & Huberman, A. M. (1994). Qualitative data analysis: An expanded sourcebook. Thousand
Oaks, CA: Sage Publications, Inc.
Oding, N. (2015), The use of Big data: Challenges and Perspectives in Russia. In C. Larsen, S. Rand,
A. Schmid, M. Mezzanzanica, & S. Dusi (Eds.), Big data and the complexity of Labor Market Policies:
New Approaches in Regional and Local Labor Market Mentoring for Reducing Skills Mismatches (pp.
153-164). Muenchen, Germany: Rainer Hampp Verlag.
Oltra, V. (2005). Knowledge management effectiveness factors: The role of HRM. Journal of Knowledge
Management, 9(4), 70–86. doi:10.1108/13673270510610341
Orum, A. M., Feagin, J. R., & Sjoberg, G. (1991). Introduction: The nature of the case study. In J.R.
Feagin, A.M. Orum, & G. Sjoberg (Eds.), A case for the case study (pp. 1-26). Chapel Hill, NC: The
University of North Carolina Press.
Power, D. J. (2014). Using ‘Big data’ for analytics and decision support. Journal of Decision Systems,
23(2), 222–228. doi:10.1080/12460125.2014.888848
Rasmussen, T., & Ulrich, D. (2015). Learning from practice: How HR analytics avoids being a manage-
ment fad. Organizational Dynamics, 44(3), 236–242. doi:10.1016/j.orgdyn.2015.05.008
Ronda‐Pupo, G. A., & Guerras‐Martin, L. Á. (2012). Dynamics of the evolution of the strategy concept
1962–2008: A co‐word analysis. Strategic Management Journal, 33(2), 162–188. doi:10.1002mj.948
Scholz, T. M. (2017). Big Data in Organizations and the Role of Human Resource Management. New
York, NY: Peter Lang LTD International Academic Publishers. doi:10.3726/b10907
Shah, N., Irani, Z., & Sharif, A. M. (2017). Big data in an HR context: Exploring organizational change
readiness, employee attitudes and behaviors. Journal of Business Research, 70, 366–378. doi:10.1016/j.
jbusres.2016.08.010
Sheng, J., Amankwah-Amoah, J., & Wang, X. (2017). A multidisciplinary perspective of big data in
management research. International Journal of Production Economics, 191, 97–112. doi:10.1016/j.
ijpe.2017.06.006
SHRM Foundation. (2016). Use of Workforce Analytics for Competitive Advantage, Preparing for future
HR trends Report. Retrieved June 15, 2016, from: https://www.shrm.org/ foundation/ourwork/initiatives/
preparing-for-future-hr-trends/Documents/Workforce%20Analytics%20Report.pdf
Sivarajah, U., Kamal, M. M., Irani, Z., & Weerakkody, V. (2017). Critical analysis of Big data challenges
and analytical methods. Journal of Business Research, 70, 263–286. doi:10.1016/j.jbusres.2016.08.001
Stake, R. E. (2013). Multiple case study analysis. New York, NY: Guilford Press.
Stiegler, B. (2010). Technics and Time, 3: Cinematic Time and the Question of Malaise. Stanford, CA:
Stanford University Press.
Tower Watson. (2014). Global Workforce Study. At a glance. Tower Watson. Retrieved June 25, 2017,
from: https://www.towerswatson.com/assets/jls/2014_global_workforce_study_at_a_glance_ emea.pdf

152

Big Data and People Management

Tushman, M. L., & O’Reilly, C. A. III. (1996). Ambidextrous organizations: Managing evolutionary and
revolutionary change. California Management Review, 38(4), 8–29. doi:10.2307/41165852
Ulrich, D. (1997). Human resources champion. Boston, MA: Harvard Business School Press.
Ulrich, D. (1998). A new mandate for human resources. Harvard Business Review, 76, 124–135.
PMID:10176915
Ulrich, D., & Grochowski, J. (2012). From shared services to professional services. Strategic HR Review,
11(3), 136–142. doi:10.1108/14754391211216850
Waber, B. (2013). People Analytics: How Social Sensing Technology Will Transform Business and What
It Tells Us about the Future of Work. Upper Saddle River, NJ: Pearson Education Inc.
Wamba, S. F., Gunasekaran, A., Akter, S., Ren, S. J. F., Dubey, R., & Childe, S. J. (2017). Big data
analytics and firm performance: Effects of dynamic capabilities. Journal of Business Research, 70,
356–365. doi:10.1016/j.jbusres.2016.08.009
Yin, R. K. (2009). Case study research: design and methods (4th ed.). Thousand Oaks, CA: Sage Pub-
lications, Inc.

153
154

Chapter 7
Big Data, Semantics,
and Policy-Making:
How Can Data Dynamics Lead
to Wiser Governance?

Lamyaa El Bassiti
Mohammed V University in Rabat, Morocco

ABSTRACT
At the heart of all policy design and implementation, there is a need to understand how well decisions
are made. It is evidently known that the quality of decision making depends significantly on the quality
of the analyses and advice provided to the associated actors. Over decades, organizations were highly
diligent in gathering and processing vast amounts of data, but they have given less emphasis on how
these data can be used in policy argument. With the arrival of big data, attention has been focused on
whether it could be used to inform policy-making. This chapter aims to bridge this gap, to understand
variations in how big data could yield usable evidence, and how policymakers can make better use of
those evidence in policy choices. An integrated and holistic look at how solving complex problems could
be conducted on the basis of semantic technologies and big data is presented in this chapter.

INTRODUCTION

To ensure global prosperity there is a need to bridge the gap between big data and policy-making, to
“overcome mistrust and misunderstanding, [to] resolve conflicts of goals, and [to] learn to speak the
same language” (Jacoby, 2013: 3). This ongoing divergence is raising a serious question of social re-
sponsibility and is calling for a drastic change by investigating the scientific and moral foundations of
contemporary beliefs to help decision-makers in dealing with complex global issues. It is thus imperative
to place big data and policy-making in the time horizon of doing the informed right thing rather than
merely doing. In other words, there is an urgent need to be endowed with an ability to exercise wise
judgment, to adopt a balanced perception of doing based on the assumption that political, scientific and
ethical aspects are closely interrelated and mutually reinforced. In doing so, serious attention should be
DOI: 10.4018/978-1-5225-7077-6.ch007

Copyright © 2019, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

Big Data, Semantics, and Policy-Making

paid to the relationship between big data and policy-making on one side and ethics on the other side with
a perspective to go beyond the use of big data to inform policy, towards seeking novel meta-evidences
inferred from available big data to underpin wise judgment leading to sustainable policies. According
to Marcus (2013) “solving problems will often require a fair amount of ... specific information often
gathered painstakingly by experts. So-called machine learning can sometimes help, but ... Big Data is a
powerful tool for inferring correlations”.
Believing that the most pressing problems organizations face today are characterized by unprecedented
levels of complexity and interdependence leads to the breakdown of the conventional problem-solving
paradigm focusing on explicit knowledge and incremental improvements, and go about leveraging from
the currently experienced change and complexity. USCCF (2014: 7) has stated that regardless of what form
it takes, big data has the potential to identify new connections and opportunities, and enable improved
understanding of the past to shape a better future. Bean (2017) has argued that big data is already being
used to improve operational efficiency, and the ability to make informed decisions based on the very
latest up-to-the-moment information is rapidly becoming the mainstream norm. Yet, the most critical
dilemma facing industries isn’t the big size and the related enormous complexity of big data, but having
the right data (Wessel, 2016). This challenge does not lie in a lack of data processing tools, but more in
a holistic, integrated and unified framework governing the steady flow of data. Despite the increased
adoption of data analytic tools, the current state of the art shows that using big data to solve complex
problems still remains problematic.
By its very construct, big data analytic tools were designed to enable in-depth understanding of
complex issues, anticipating possible scenarios and making better decisions. Although these issues can
be similar across different disciplines, too many data will look too hard to form any coherent judgment.
Maybe it is difficult to deal with this elusive challenge because policy-making is based on a context-
dependent process and the idea of using big data, which is still a very slippery concept, requires making
best use of global knowledge sources. However, a holistic picture can be drawn from the exploration
of the distinguishing characteristics of the social and semantic faces of the web of data. More specifi-
cally, dealing with this gap and finding meaningful correlation relies fundamentally on data indexing
and information integration based on domain knowledge representation. Following today’s transition
towards a new era of wise and smart forms of managing and organizing, this chapter makes a conceptual
contribution by investigating the question How to make best use of big data to inform policy-making
and complex problem-solving? and provides a generic, holistic and integrated framework for big data
governance based on semantic web principles.
Narrowing the focus of the research the remainder of this chapter unfolds as follows: firstly the
background of this research work and the foundations in the contribution areas are summarized in or-
der to better identify the trends to look for and understand the broadness of the topic to be addressed.
Then, the focus would be on data dynamics-based ontology engineering as it is the provided response
to the increasing complexity facing modern organizations in their pursuit of big data governance. Next,
a specific lens will be put on policy-making as application area of the designed framework to big data
governance. The author concludes this chapter by summing up what was done and suggesting future
directions of research.

155

Big Data, Semantics, and Policy-Making

BIG DATA AND COMPLEX PROBLEM-SOLVING

Based on the literature review related to the uptake of big data and the investigations of different case
studies trying to use big data to inform complex problem-solving in general and policy-making in par-
ticular, the author aggregated the different shortcomings found under the following key points:

• Big Data and Decision-Makers: According to McAfee et al. (2012: 62) the current big data
movement, like analytics before, seeks to glean intelligence from data and translate that into im-
proved decision-making and performance. So, managers would be more willing to make better
predictions and smarter decisions, as well as target more-effective interventions in areas that so far
have been dominated by gut and intuition rather than data and rigor. Nevertheless, today the busi-
ness world still commits to the broad agreement that important decision-making relies too much
on experience and intuition and not enough on data. So, to reap the full benefits of using big data,
organizations have to change the way they make decisions by moving from the HiPPO (Highest-
Paid Person’s Opinion) style to a more data-driven style (McAfee et al., 2012: 68). Organizations
interested in leading a big data transition have to focus on asking more-specific questions such as:
Where did data come from? How could data be meaningful? What does the retrieved knowledge
say? Which kinds of analyses should be conducted? How could the deduced evidences be used?
In doing so, there is a need to define a more integrated framework to stitch data across silos, to
discover patterns and derive principles in order to be able to address complex problems from a
holistic perspective.
• Dynamic Designing and Policy-Making: According to Prewitt et al. (2012: 40) policy-making
process starts with a decision that is pending, then research provides information that is lacking,
and with the knowledge in hand decision is made. Based on this understanding the policy-making
process can unfold through five stages: (1) Problem identification (2) Problem addressing (3)
Alternatives prediction (4) Consequences measurement and (5) Policy selection. Although rel-
evant and easy to understand, the indicated stages appear successive whereas the policy-making
process is dynamic, which calls for an updated design where stages can mutually influence each
other. Furthermore, the “characterization of distinct functional activities in the policy cycle is
somewhat arbitrary” (Höchtl et al., 2016: 156). So, there is a need to think about a holistic frame-
work allowing an in-depth understanding of the policy-making knowledge domain; a systematic
management of the policy-making process and multi-dimensional measurement of the policy-
making performance. Such a framework is intended to provide a trade-off between granting con-
ditions for creativity and at the same time exercising control, required to ensure that available
resources are used in an effective and efficient way.
• Evidence-Based Practices and Complex Problem-Solving: Generally, evidence-based prac-
tices aim to realize better and more defensible practices by grounding them in the conscientious,
explicit and judicious use of the available knowledge (Prewitt et al., 2012: 50). Although the term
evidence is frequently encountered as claims about predicted or actual consequences - effects,
impacts, outcomes or costs - of a specific action, scientific data can further be used as evidence
for early warning signs of a problem to be addressed, for target setting as well as assessment and
evaluation. Accordingly, the uptake of scientific data is more a matter of framing issues rather than

156

Big Data, Semantics, and Policy-Making

a matter of the straightforward application of scientific findings to discrete judgments (Prewitt et


al., 2012: 41). So, evidences cannot solve every problem, but they have the potential to inform
and illuminate the path to wise judgments, which requires an agreement between the involved
actors in the problem-solving process on what the desired ends should be. Gormley (2011: 979)
stated that scientific data should be “translated, condensed, repackaged, and reinterpreted before
it is used”. So, there is a need to identify what knowledge is noticed in a particular stage of the
problem-solving process? Whether is it understood as pertaining to a given problem? and How it
is eventually used to make a wise judgment? These assessments, according to Spillane and Miele
(2007: 48), depend not only on the frames of knowledge named evidences but also on the cogni-
tion of the actors involved in problem-solving activities as well as the surrounding context.

Apart-from identifying the main problems facing the use of big data to inform complex problem-
solving like policy-making and rather than investigating each problem individually, the author espoused
a holistic view and formulated three complementary research questions (RQs) to guide the direction of
this research work and to answer the main question: How to make best use of big data to inform complex
problem-solving and make wiser decisions?

RQ 1: How can big data be integrated to support interoperability between different knowledge domains?
RQ 2: How can big data be managed to allow a systematic steady flow of inferred evidences?
RQ 3: How can big data be governed to assist modern organizations in their efforts to practice wisdom?

As a corollary, the uptake of big data to inform complex problem-solving like policy-making is “a
dynamic, complex and mediated process, which is shaped by formal and informal structures, by multiple
actors and bodies of knowledge, and by the relationships and play of politics and power that run through
the wider policy context” (Nutley et al., 2007: 111). So, there is a need to bridge the data producing
and data consuming sides of the big data challenge, to represent big data in a usable form, to synthesize
the problem-solving activity in a consensus-based process, to glean intelligence from data and translate
that into improved performance, to embrace an ethical dimension along the problem-solving process.
In other words, there is a need for a sophisticated big data organization system.

DATA ORGANIZATION SYSTEM

Creativity occurs from interconnections between people and their subjects of interest, and through social
interactions tacit knowledge is shared and further developed; so, the produced content grows quickly.
Catmull (2008: 4) stated that “creativity involves a large number of people from different disciplines
working effectively together to solve a great many problems”. Policy-making as a creative activity
requires that involved actors understand precisely which knowledge will fulfill their needs, keep this
knowledge on the cutting edge, deploy it, leverage it in performing their tasks and spread it across their
organizations. Inversely, lacking these capabilities will impede social interactions and may finally fail
the policy-making process. Thus, big data, being the result of social interactions on the web, is criti-
cal for developing new policies and garnering support and sponsorship to move them forward towards
realization. Although using big data as part of the complex problem-solving process is still in its very

157

Big Data, Semantics, and Policy-Making

early stages; recently, this trend is gaining more momentum. Yet, despite their potential usefulness for
locating and linking people, big data in general and social media, in particular, do not explain what con-
nects particular people and not others, and lack capability to create shared knowledge graphs that will
help efficiently use the overwhelming flow of data.

Beyond Traditional Knowledge Organization Systems

Knowledge organization is a core concept within Library and Information Science. It is concerned with
theories and methods for organizing and representing concepts and concept-relations. It is associated
with different kinds of representation systems that are designed for managing, storing and searching for
different kinds of information sources. Knowledge organization is about activities such as document
description, indexing and classification; and concerned with theories about systems and processes con-
nected to subject representation and information retrieval. A Knowledge Organization System (KOS)
comprises the concepts relevant for a certain domain of interest, called vocabulary, and imposes a structure
by inter-relating this vocabulary with different semantic relations (Weller, 2010: 21). The foundation of a
KOS is usually in the form of a concept hierarchy that can be enriched with further complex relationships
like equivalence or association. The basic functions of a KOS are: knowledge representation, resource
indexing and information retrieval.
Over decades different types of KOS have been developed and they can be categorized into four main
classes (Zeng, 2008: 161): (1) Term Lists: limited sets of terms arranged in some sequential order, (2)
Metadata-like Models: lists of terms used for controlling variant names of an entity, (3) Classifications:
vocabularies of controlled terms including relationships between these terms and having a particular
focus on hierarchical relations, and (4) Relationship Lists: controlled and structured vocabularies allow-
ing hierarchical, associative and equivalence relationships. The most prominent KOS belonging to these
categories are nomenclatures, taxonomies and thesauri, ordered according to their evolution in terms of
broadness and semantic strength. The main limitations of traditional KOS facing the growth of data can
be summarized in the serious lack of semantic structure and consistency, in addition to limited automated
processing and domain coverage (Weller, 2010: 405). Thus, by applying new specifications of relation-
ships, it may be possible to overcome these limitations and provide more extensive and semantically rich
knowledge representations that will ultimately improve information organization, indexing and retrieval.

Social Trend and Big Data Dilemma

Technological advances dramatically changed the approach to knowledge organization. Organizing the
huge amount of information is one of the major challenges to achieving the full potential of the web. In
other words, the decentralized process of capturing more and more aspects of physical social networks
suffers from the information overload dilemma coupled with a lack of interoperability and linking be-
tween diffused data. According to Schroeck et al. (2012: 4-5), big data is associated with increased data
volume (large scale), velocity (motion in fractions of second) and variety (many forms, structured and
unstructured) but decreased data veracity (uncertainty). Consequently, the size of datasets exceeds the
abilities of many organizations in terms of capturing, storing, managing, and analyzing data (Manyika
et al., 2011: 116). Such changes in data characteristics call for corresponding changes in data processing
and business process models, to unlock the potential of big data and allow its conversion into intelligence.

158

Big Data, Semantics, and Policy-Making

In this context, the well-known problem of annotating (indexing) content with descriptive metadata
was rediscovered and addressed from a new user-centered perspective (Weller, 2010: 70). The new
technique of indexing becomes to label a resource with a set of keywords, called tags, that the user be-
lieves they are relevant to characterize the resource according to its own needs. These tags do not rely
on a controlled vocabulary or a previously defined structure but serve as links to resources annotated
both by their owners and by other users. This has led to the emergence of a shared and evolving clas-
sification structure, called folksonomy, which refers to a combination of folk and taxonomy and means
“a taxonomy created by people” (Peters, 2009: 154). Folksonomies are not only a way of representing
concepts through cognitive association by individual users, but also a social feature that encourages social
relationships among users. From this perspective, folksonomies can be seen as a method for sharing,
exchanging and integrating users’ interests through tags attached to social content, which allow mediat-
ing common interests across independent and heterogeneous sources. This act can be an alternative way
towards creating new knowledge from heterogeneous data sources.
Tagging systems in general and folksonomies, in particular, cannot produce a coherent categoriza-
tion scheme because users’ contributions do not operate under a centralized controlling vocabulary. Its
uncontrolled nature is fundamentally chaotic and suffers from problems of ambiguity, lack of synonymy
and discrepancies in granularity, which has been categorized as a vocabulary problem. Moreover, tag-
ging activities that arise from users’ participation are locked into host sites, which prevent related users
having common interests from being discovered and connected across heterogeneous data sources us-
ing tags (Breslin & Decker, 2007: 87). As an option of improvement Gruber (2007: 7) emphasized the
need to link social media with semantic technologies and to identify and formalize a conceptualization
of tagging data.

Semantic Trend and Opportunities to Big Data

According to Gruber (2008: 5) If the world’s knowledge is to be found on the web, then we should be
able to use it to answer questions, retrieve facts, solve problems, and explore possibilities. This is quali-
tatively different than searching for documents and reading them, even though text search engines are
getting better at helping people do these things. Many major scientific discoveries and breakthroughs
have involved recognizing the connections across domains or integrating insights from several sources.
These are not associations of words; they are deep insights that involve the actual subject matter of these
domains. Semantic web has the machinery to help address the interoperability of data from multiple
sources. Interoperability is regarded as one of the key factors of existence of both people and organiza-
tions in a knowledge society. According to the US Joint Vision 2020, interoperability is considered as a
key element of information superiority (USDOD, 2000: 15). Interoperability happens when two or more
actors (persons, organizations or systems) interact, communicate or collaborate to achieve a common goal.
A standard definition of interoperability is provided by IEEE (1990: 114) as “the ability of two or more
systems or components to exchange information and to use the information that has been exchanged”.
This vision of interoperability of data poses the challenge of information integration (Weller, 2010: 53).
Data a user needs may not be available all in one place but may have to be collected from several sites
and then presented in an integrated manner. To enable information integration and exchange, semantic
technologies have been developed to bring machine-readable descriptions to content already existing
on the web. Berners-Lee et al. (2001; p.4) disseminated the term semantic web to denote a new version
of the web where data is given well-defined meaning, better enabling computers and people to work

159

Big Data, Semantics, and Policy-Making

in cooperation. In this perspective, information integration can be done through the use of semantic
markups, machine-understandable index terms, that can be associated with web resources. Nevertheless,
using metadata alone will not establish semantics of what is being marked-up. In response and for more
explicit meaning, Ontologies have been proposed to provide shared and precisely defined terms and
constraints to describe the meaning of resources using metadata. Accordingly, a promising solution to
big data integration is to index data-resources with semantically well-defined metadata based on ontolo-
gies, which would allow linking both content and people in meaningful ways and enhance browsing and
locating interesting items and people with similar profiles (Breslin & Decker, 2007: 86).

ONTOLOGY ENGINEERING FOR BIG DATA INTEGRATION

Theoretically, the semantic vision aims to provide a structure for miscellaneous data on the web by add-
ing structured metadata to data resources. This will elucidate the actual meaning behind a connection,
which will improve search, navigation and interoperability between systems. Technically, this vision
absolutely needs an interoperable infrastructure based on global standard protocols (De Virgilio et al.,
2010: 481). Accordingly, the semantic vision has been empowered by the development of a set of tech-
nologies promoting formal description of concepts, terms and relationships within a given knowledge
domain. These technologies are specified as W3C standards (see Figure 1) and include:

Figure 1. Semantic web technology stack


Source: Nowak (2009)

160

Big Data, Semantics, and Policy-Making

• Resource Description Framework (RDF): The structure is the core data representation format
for semantic web technologies. It provides a data representation model and syntax for describing
resources on the web. As shown in Figure 2 it creates statements in the form of triples (subject,
predicate or property, object). The subject defines the resource what a given statement is about.
Every resource can be described by a number of predicates which state some kind of character-
istics of the resource. Finally, the object expresses the value of the property identified with the
predicate, which can be either a literal or a pointer to another resource. Using RDF all resources
are identified by URI (Uniform Resource Identifiers) addresses. As a result, RDF allows perceiv-
ing data resources as a graph (not as a tree-like data in XML), where subjects and objects are the
nodes and predicates are the edges. One of the benefits from such an approach is to facilitate the
extension of the data resources graph and the integration of data across different systems by refer-
ring to common URI’s resources. Nevertheless, anyone can define a vocabulary of terms to be
used for a more detailed description. Thus, to allow a standardized description, a data-modeling
vocabulary for RDF data is required.
• RDF Schema (RDFS): A language that provides a basic vocabulary for RDF to describe par-
ticular data of a given knowledge domain. So, RDFS extends RDF with a schema vocabulary. In
practice, ontologies defined with RDFS are expressed in systems of classes (abstract concepts of a
domain of knowledge), instances (real-world individuals) and the relations between them specify-
ing classes’ properties. Nevertheless, RDFS serves to create only simple ontologies; more com-
plex ontologies need more expressive language providing additional standardized vocabulary and
more advanced constructs that allows expressing the classes and their relationships in more detail.
• Web Ontology Language (OWL): An extension of RDFS that enables defining more complex re-
lationships between concepts to restrict the interpretation of the knowledge base. This is achieved
by a number of new features such as logical expressions and local properties with a possibility to
define certain values as required or optional and limit their range. Those additional description ca-
pabilities are defined in OWL as constructs (union, intersection) and axioms (subclass, equivalent
class, definition of symmetry). OWL provides three sub-languages (OWL Lite, OWL DL and OWL
Full) that increase expressivity according to syntax and computational needs. For instance, OWL
Full allows maximum expressivity with no computational guarantees.
• Simple Protocol and RDF Query Language (SPARQL): A language that can be used to query
any RDF-based data (data including statements involving RDFS and OWL). It is SQL-like lan-
guage but uses RDF triples and resources for both matching parts of the query and returning
results. Since both RDFS and OWL are built on RDF, SPARQL can be used to query ontologies
and knowledge bases directly. Furthermore, SPARQL is considered as a protocol for accessing
RDF data.

Below are examples of ontologies that have been developed to represent uniformly the different
artifacts (communities, people, documents, tag) produced and shared in the web of data:

• Friend-Of-A-Friend (FOAF): An ontology that defines a widely used vocabulary for describing
people and relationships between them, as well as things they create and do. It enables people
to create machine-readable content for people, groups, organizations and other related concepts.
foaf: knows is one of the most used FOAF properties that acts as a simple manner to create social
networks by adding knows relationships to individuals and their friends.

161

Big Data, Semantics, and Policy-Making

Figure 2. RDF triple (subject, predicate, object)


Source: el Bassiti (2018)

• Semantically Interlinked Online Communities (SIOC): An ontology that aims to interlink on-
line communities’ content from different social media by providing a lightweight vocabulary to
describe the structure of, and activities within, online communities. When combined with FOAF
vocabulary and SKOS model (Simple Knowledge Organization System is an ontology used to
organize knowledge), SIOC allows linking discussion posts and content items to other discussions
and items, people (using their associated user accounts), and topics (using specific tags, hierarchi-
cal categories or concepts represented by URIs).
• Tag Ontology: The first RDF-based model for representing tags and tagging actions. This
Ontology defines the Tag and Tagging classes with related properties to create the tripartite rela-
tionship of tagging. In order to represent the users involved in a tagging action, this ontology relies
on FOAF vocabulary.
• Meaning of a Tag (MOAT): An ontology that aims to represent the meaning of tags using URIs
of existing domain-specific ontology instances or resources from existing public knowledge bases.
MOAT is more than a single model, as it also provides a framework allowing people to easily
bridging the gap between simple free-form tagging and semantic indexing, helping users to an-
notate their content with URIs of semantic web resources from the tags they have already used to
annotate content.
• Social Cloud of Tags (SCOT): An ontology that aims to share and reuse tag data and represent
social relations across different sources. It provides the structure and semantics to describe re-
sources, tags and users. It provides extended tag information such as synonym, spelling variant,
tag frequency, tag co-occurrence frequency and tag equivalence in order to reduce tag ambiguity.
The SCOT’s concepts “user”, “tag” and “resource” have links to FOAF, SKOS and SIOC ontolo-
gies respectively.
• Annotation Ontology (AO): This ontology does not propose a new domain-specific ontology
nor tagging vocabulary to represent activities, but extends existing models (Tag Ontology and
MOAT), so that they can easily interoperate. It aims to allow users constructing annotations, mak-
ing it possible to publish the annotation data, to query them and to reason about them.

Big Data Governance: Generic Integration using


Data Dynamics (GenID Framework)

Big data presupposes the enormous volume of information including social data (user-generated data
from social media), industrial and sensor data (machine, mobile and GPS data as well as the Internet of
Things), business data (customer, inventory and transactional data) and public data (datasets generated

162

Big Data, Semantics, and Policy-Making

or collected by government agencies as well as universities and non-profit organizations) (USCCF, 2014:
4). So, big data is not static; rather, it is constantly changing over time and depending on the surround-
ing context as well as the desired or produced application. Big data can be seen as both a resource and
a process, both of which are linked to the interactions that occur among actors. Data dynamics can be
understood as the dynamics that emerge from the processes of iterative transition from creation; through
usage, transformation and movement; to the diffusion of valuable data. Embracing a semantic vision
based on data dynamics,-organizations aim to shift big data from being machine-readable and human-
understandable to being machine-understandable (read and interpreted correctly by a machine).
This semantic-based vision to big data governance aims to leverage from explicit and formal repre-
sentations of knowledge domains (ontologies) to annotate data resources with common metadata. This
vision will be approached from two directions: (1) making use of semantic web technologies to improve
usage of big data by adding metadata which are defined explicitly and formally based on ontologies,
(2) making use of big data to improve the process of creating semantic web metadata. The expected
success of this combination may be due to the remarkable reciprocity between big data -supported by
an ecosystem of participation where the value is created by the aggregation of many individual user
contributions- and semantic web -supported by an ecosystem of data where the value is created by the
integration of structured data from many data sources. Together they will have a unique objective of
dealing with the fundamental challenge of socially shared meaning of data.
This approach can also be conceptualized as an iterative process allowing the coproduction of knowl-
edge arising from the integration of strongly connected fragments of formal knowledge from multiple
knowledge providers, with different levels of precision and trustworthiness, maybe even inconsisten-
cies. As depicted in Figure 3 the expected result is to propose an ontology-based Linked Big Data with
advanced cross-domain interoperability. As a prominent example of this perspective is DBpedia, which
is part of the W3C Linking-Open-Data community project covering the structured data of Wikipedia.
In addition, the semantic web layer on top of big data will enable to deal with information fragmenta-
tion and data formats heterogeneity, knowledge integration and reuse, annotation and retrieval. Once
structured and represented in a uniform way, big data can be mined and leveraged to meet organizational
requirements for linking information, detecting and structuring emergent process, and providing insight
into and from spontaneous communities and emergent collaborative structures.

GenID Framework: Data Dynamics Process

As depicted in Figure 4, GenID framework to big data governance is supported by a process of data
dynamics-comprising three main phases: (1) Data Interoperability, (2) Knowledge Mining and (3) Wis-
dom Governance. These phases will be detailed below:
At the bottom of GenID framework is defined the Data Interoperability phase that aims to translate
big data stores into knowledge bases using ontologies to ensure cross-domain sharing of knowledge
patterns. This phase can be further broken down into two main stages: (1) Localization and selection of
relevant sources of data, typically containing large volumes of raw having no meaning and existing in
different formats. (2) Integration of data from selected sources and giving them meaning using semantic
annotations extracted from the ontological representation of the knowledge domain (contextualization),
which leads to meaningful, subjective and semantically constructed information (knowledge).

163

Big Data, Semantics, and Policy-Making

Figure 3. Transition towards ontology-based linked big data


Source: el Bassiti (2018)

In the middle of GenID framework is defined the Knowledge Mining phase that aims to elicit hidden
patterns and translate them into structured evidences to be used in a variety of contexts using knowledge
discovery tools. This phase consists in two main stages: (1) Analysis of knowledge and mining of useful,
ultimately understandable and context-rich insights to be translated into abstract and structured patterns
(evidences), disembodied from their context of creation (de-contextualized/generalized models). (2) Us-
ing these evidences by putting them within a new context of use to inform reasoning while espousing
a moral lens (re-contextualization/specialization of abstract models) towards making wise judgment,
which means the creation of new, practical and moral-based evidences (wisdom).
On the top of GenID framework is defined the Wisdom Governance phase that aims to develop a
cross-domain and moral-driven system of governance that emerges out of the transaction between prac-
ticing the acquired wisdom while focusing on the ethical dimension, establishing a connection with the
strategic roadmap of the larger system (organization or market), as well as assessing the progress and
evaluating the impact. This phase comprises three main stages: (1) Create a trust-based culture to en-
able firms and workers to cluster together, to pursue a common objective and to work more effectively.
(2) Support ongoing and accurate examination of triggers (initiating factors), performance and impact;
which will lead to better awareness of relevant patterns and quick detection of eventual risks, allowing
faster abilities to change direction before mistakes become expensive or great opportunities are missed.
(3) Enable a high level of embeddedness (aggregation) of retrieved patterns and diffusion of the acquired
wisdom in surrounding communities, which can greatly facilitate the absorption of practically validated
patterns of wisdom (practical wisdom).
These phases have been translated into service layers in the conceptual architecture of GenID frame-
work that will be further detailed in the next sub-section.

164

Big Data, Semantics, and Policy-Making

Figure 4. GenID data dynamics process to big data governance


Source: el Bassiti (2018)

GenID Framework: Conceptual Architecture

Over the three main service layers corresponding respectively to the three phases of GenID data dynam-
ics process identified in the sub-section above, Figure 5 presents the key components of the architectural
design underlying GenID framework to big data governance, which adopts an ontology-based semantic
vision. GenID framework includes six main components: (1) Big Data Integration, (2) Big Data An-
notation, (3) Knowledge Representation, (4) Evidence Discovery, (5) Wise Reasoning and (6) Wise
Judgment. Further details about each component are provided below:
At the bottom of GenID architecture is defined the Big Data Integration component that aims to
merge heterogeneous data resources, towards providing the level of personalization required to establish
a clearer vision, which facilitates data grouping, clustering and visualization. The main functions of this
component are: (1) Define a data-model intended to provide the abstraction required to have an overview
of how the pieces of data can be integrated into a coherent whole. (2) Define a metadata-model intended
to provide the abstraction required to have an overview about how data resources can be described, linked
and annotated, stored and reused. (3) Create a schema mapping to link concepts defined in conceptual
schemas of ontologies (representing the different knowledge domains required to deal with a given target)
to big data sources’ schemas (large scale databases). (4) Map ontology instances to raw data in databases.
(5) Convert relational databases into linked databases (ontology-based databases).

165

Big Data, Semantics, and Policy-Making

Figure 5. conceptual architecture of GenID framework to big data governance


Source: el Bassiti (2018)

The Big Data Annotation component seeks the successful use of metadata to be able to define what
the data resource is about in a machine-processable way, which will facilitate research and retrieval of
useful and relevant data. The main functions of this component are: (1) Define sufficient description of
data elements so that they can be properly interpreted and compared using the defined mapping sche-
mas (linking ontologies and big data sources). (2) Identify domain-specific indexing rules that can infer
high-level semantic descriptions. (3) Recognize and extract the entities to be annotated. (4) Annotate
recognized data resources using controlled vocabulary defined in ontologies (semantic metadata). (5)
Score annotations according to their context of creation to improve their usability.
In the middle of GenID architecture is defined the Knowledge Representation component that aims
to use a systematic process of generation, search and selection of key concepts to create a shared con-
ceptualization of a knowledge domain. The main functions of this component are: (1) Design a common
model allowing a holistic and integrated understanding of a knowledge domain, to be discussed with
a multidisciplinary perspective, to achieve a somewhat consensus that satisfies all concerns. (2) Break
down the holistic understanding into small chunks (concepts), so each concept can be instantiated as
annotations that can be used independently and (re)used efficiently in various contexts. (3) Investigate
the designed ontology in order to determine if it is a self-contained system aggregating all the informa-
tion needed for a knowledge domain; so it can be easily understood, computationally searched and then
quickly modified according to the user requirements. (4) Merge specific context-dependant ontologies
to create a generic, single, coherent and context-independent meta-ontology. (5) Align ontologies by
establishing links, which facilitates reusing data from different ontologies.

166

Big Data, Semantics, and Policy-Making

The Evidence Discovery component aims to use the conceptualized knowledge representations and
reasoning services to infer evidences. The main functions of this component are: (1) Make the designed
frames of knowledge visible to all users, so they can ponder, challenge, condense or adapt them to their
specific needs. (2) Analyze a selection of knowledge frames to clearly define a set of mental images
translating the different understanding about the reason for evidence existence and the effects the actor
wants to have on each part of the larger context. This task typically calls for multidisciplinary partici-
pation, and requires a handful of data analysis tools and inspiring stimuli that can trigger relevant pat-
terns. (3) Synthesize the related mental images into a conjecture (a rough approximation of a cohesive
whole that only defines primary insights). This task deals with what question within a given context.
(4) Mine and infer a generic pattern by improving, prioritizing and clustering complementary groups of
context-rich insights to form the basis of a new compelling pattern based on trusted proofs. This task is
based on a process of focusing and going into details, which requires more than just giving a finishing
touch or opening up to new horizons and contents, but instead a way to make creative use of available
knowledge frames. (5) Design a semantic representation to capture the essence of the inferred patterns.
Using structured evidences provides focus, builds a shared understanding and enables comparability
between alternative patterns.
On the top of GenID architecture is defined the Wise Reasoning component that is context-dependent
and aims to help users to deal with complex situations using structured representations and similarity-
based retrieval of past problem-solving experiences, considered as a practically-proved source of problem-
solving expertise. The main functions of this component are: (1) Formally represent and clearly describe
the situation or problem at hand as an aggregation of specifications covering the core-subject to deal
with, the associated actors as well as the main features of the surrounding context. (2) Search within
the wisdom memory (knowledge base of proved experiences) for similar or related situations using se-
mantic similarities and relationships. (3) Retrieve cases whose specifications are similar to the current
target situation using case-based reasoning which relies on concrete experiences in the form of codified
rules and strong domain models. (4) Infer or predict the changes to be made to adapt the retrieved cases
to fit the target situation using linked-data-based recommendation. (5) Formally represent and clearly
describe the solution used to solve the problem or to deal with the situation with a particular emphasis
on its practical wisdom side.
The Wise Learning component is context-dependent and seeks to dynamically learn from concrete
experiences for future consciousness, exercising good judgment and adapting the practice modes accord-
ingly, using an iterative process of moral perception, wise action and continuous feedback. The main
functions of this component are: (1) To be consciously aware of oneself as a problem-solver by creatively
aggregating, synthesizing and applying the evidence-driven wisdom patterns/practices across domains;
which allows developing wise capabilities required to deal with inherent ambiguity, messiness and ab-
surdity of unforeseen challenges. (2) Establish connection between the patterns/practices inferred from
the wisdom memory and the strategic roadmap of the larger system. (3) Assess the progress and evaluate
the impact of the adopted patterns. (4) Flexibly collaborate with others around a common objective and
capture feedbacks. (5) Purposefully collate feedbacks into learning items and topics to be used to enrich
the wisdom memory used to reduce risks, build new knowledge frames and continuously improve the
performance. (6) Diffusion of practically-proved wisdom patterns/practices in surrounding communities.

167

Big Data, Semantics, and Policy-Making

GENID FRAMEWORK: USE CASE IN POLICY-MAKING

Because this study is considered as a multidisciplinary endeavor and in order to examine the effective
design, delivery, use and impact of GenID framework to big data governance relaying on data dynamics,
the author looked for a case study as this latter focuses on understanding phenomena in their natural set-
ting and cultural context. On the other hand, policy-making is typically made in complex environments
with many factors covering a whole spectrum of social, environmental, economic and technological
considerations. In recent years, most pressing challenges have been better managed by the introduction
of new tools and research streams, ranging from ontology engineering through data analytics to com-
plexity science and structuration theory. These new developments enable modern organizations to better
understand complex issues, anticipate possible scenarios, and make better policy decisions.
Unlike previous modes of scientific thought dominated by reductionism (the belief that everything
can be reduced to individual parts), cause-effect and determinism (fatalism), the complexity perspective
emphasizes expansionism (the belief that any system can always be a sub-system of some larger system),
producer-product and indeterminism (probabilistic thinking). The essence of this new view is encap-
sulated in the concept of systemic wholeness that replaces analysis which reveals how a system works
and gains knowledge by understanding its parts, with synthesis revealing why a system works the way
it does by explaining its role in the larger system (Pourdehnad et al., 2011: 3). So, organizations should
leverage from this promising perspective to facilitate their work because complexity science can add
more possibilities to the on-going efforts by providing a scientific lens for evidence-based approaches.
In doing so, adopting this perspective calls for using more integrated approaches to discover patterns
and derive principles to address problems from a holistic perspective.
Albeit considerable advances have been made in clarifying the usefulness of complexity principles to
explain social systems as a whole, concerns arise from the lack of understanding the human interaction
dynamics inside and outside these social systems (Hazy & Ashley, 2011: 60-62). An answer is to combine
the complexity perspective with the structuration theory formulated by Giddens to conceptualize this
interplay in social systems as an inseparable and intricate duality of the production and reproduction of
dynamic social structures as they coevolved with human interactions over space and time (Falkheimer,
2018: 190). Using this theory to understand the human interaction dynamics over social systems requires
using models as they allow thinking about a phenomenon in a holistic way, they give the management
process a common language and they provide a framework for communicating changes and transitions.
Because this framework presents these characteristics, it looks suitable to provide promising answers
to the challenges facing complex-problem-solving in general and policy-making in particular, within
modern organizations.
Actually, despite the countless efforts and spending in policy-making many organizations do not gen-
erate satisfactory results. This problem does not lie in a lack of policies, but more in a holistic, integrated
and unified framework allowing policy-makers to ensure a sustainable impact. This challenge calls on
organizations to manage all the aspects fostering their policy-making capabilities as well as the necessary
tools and techniques supporting the expected change. This section focuses on this challenge and aims to
facilitate policy-making management by developing three conceptual models: (1) Policy-Making Meta-
Ontology allowing the nascent knowledge domain of policy-making to be semantically represented, and

168

Big Data, Semantics, and Policy-Making

related patterns to be dynamically discovered using an ontology-driven paradigm. This meta-ontology


could be used as a nucleus of the Data Interoperability Service Layer in GenID framework. (2) Policy-
Making Life-cycle allowing the interplay between data and organization’s business to be shaped, so the
policy-making processes could be easily simplified, systematically structured and clearly described us-
ing a data-driven approach. This life-cycle could be used as a nucleus of the Knowledge Mining Service
Layer in GenID framework. (3) Policy-Making Governance Activities allowing the policy change to take
place within a strategic planning using a wisdom-based perspective. These activities could be used as a
nucleus of the Wisdom Governance Service Layer in GenID framework.

Data Interoperability: Policy-Making Meta-Ontology

Schumpeter has argued that most innovations are not novel in themselves but they are novel combinations
of elements that already exist (Salter & Alexy, 2013: 5). So, by analogy with policy-making, which is a
creative activity, a new policy rarely involves a single idea or experience, but rather a bundle of experi-
ences that are brought together into a whole. Accordingly, the unification, integration and structuration
of policy-making knowledge domain requires developing a whole ontology (Meta-Ontology containing
just the global key concepts) that can be decomposed into smaller modules to be developed independently
according to their particular application scenarios; and inversely, the developed modules can be mutu-
ally integrated to compose the larger ontology (Mother-Ontology containing all detailed specifications
of modules). Accordingly and based on the distinctive features of a problem-solving activity within a
complex context the author elicit three concepts (see Figure 6) she considers as cornerstones to build a
domain vocabulary to represent the policy-making knowledge area.
The three concepts underlying the Policy-Making Meta-Ontology are:

1. Policy-Making Actor which refers to individuals, organizations or communities playing a role,


making an impact or having an interest/concern, as well as their interactions and involvement in
policy-making. Expanding this concept aids to obtain appropriately focused communities as needed
in each phase/stage throughout the policy-making life-cycle, exchanging frequent feedback related
to goal attainment and linking actors’ abilities, recognition, rewards with the organization’s profit-
ability as well as the global goods.
2. Policy-Making Core-Object which refers to the complete set of knowledge patterns discovered and
used by an actor to build, maintain and sustain a policy. Expanding this concept allows easy handling
and quick locating of relevant knowledge patterns, breaking content down into small chunks that
can be used independently and (re)used efficiently in various contexts. Revealing the core-object
of a policy involves excavating the most profound meaning and essence of the related activities by
making close observation in a highly mindful manner, asking deep questions and trying to get as
close as possible to the subject of investigation both intellectually and practically.
3. Policy-Making Context which refers to the contextual variables -either internal or external- im-
pacting the policy-making process. Expanding this concept allows representing alliance-based or
risk-sharing contractual agreements between involved actors along the policy-making life-cycle;
which will assist policy-makers, researchers and community practitioners in planning strategies,
preparing practices, distributing benefits, allocating risks and minimizing conflicts.

169

Big Data, Semantics, and Policy-Making

Figure 6. Policy-making meta-ontology


Source: el Bassiti (2018; Adapted from [el Bassiti, 2017])

Each of these three key concepts can be developed as sub-ontology. Despite there is no considerable
agreement on all the criteria a sub-ontology needs to satisfy, there is a general consensus on some of
the main aspects, which can be summarized as follow: (1) Minimality: to be as small as possible and
each concept should be defined only once in the mother-ontology. (2) Correctness: to contain only the
concepts that are present in mother-ontology and this latter must be a representative union of all the
sub-ontologies. So, any knowledge that can be inferred from the sub-ontology should be possible to be
inferred as well from the mother-ontology. (3) Completeness: to contain all information relevant to the
related elements within the mother-ontology. So, it would be no difference in the logical consequence of
importing an ontology-module or the mother-ontology. (4) Understandability: to be easily understood,
readable and navigated for the designer and the end user.

Knowledge Mining: Policy-Making Life-Cycle

Policy-making is a creative activity that is not different from other forms of problem-solving activities
and has the same requirements for data. According to Bakhshi and Mateos-Garcia (2016: 6) policies need
to be prioritized, targeted, designed, implemented, evaluated and adapted and data has a crucial role to
play in all stages of that policy cycle. Following a complexity-driven perspective, an efficient and more
effective policy-making process has to shift from a narrow vision focusing on national self-interest, to a
wider perspective seeking global public goods, where a new balance is continually sought between social,
economic and environmental challenges and goals. Accordingly, policy-making can be conceptualized
as the co-production of knowledge arising from the collaboration of multiple knowledge providers.
In this view, the author structures policy-making as a process model called Policy-Making Life-cycle
(see Figure 7) broken down into two phases:

170

Big Data, Semantics, and Policy-Making

1. Policy-Making Design an analytic phase of creation and making that aims to support individual
or collective effort to generate creative insights about policies by identifying the local problems
to be addressed, investigating the existing policies and searching new sources of inspiration using
semantic similarity and knowledge discovery (Generation Stage). Next, integrate the generated ideas
about policy into the strategic roadmap of the larger system while espousing a balanced perception
(Networking Stage). Then, establish a clear vision about the aims, the ways to do and the expected
results, synthesize complementary groups of ideas to form a new compelling policy (Modeling
Stage). The main deliverable of this phase is a conceptually-designed Policy-Framework.
2. Policy-Making Adoption a synthetic phase of learning and doing that aims to convert the initial
rough design into factual outcomes and find out the right way to put them into right application
by initiating a pilot program to investigate the designed policy framework; in order to minimize
risks, maximize opportunities and estimate the needed cost and time (Validation Stage). Next,
specify the required competencies (actors), resources and abilities as well as the expected return;
and integrate the whole into the targeted system while embracing a holistic view; in order to ensure
an added value for the global system (Implementation Stage). Then, formally screen and clearly
present the policy to the large public (all players, stakeholders and anyone concerned) and capture
feedback, engage different adopters with multidisciplinary perspectives and capabilities; so the
adopted policy framework could be practically reviewed and calibrated as well as continuously
improved by espousing a regular measurement of the impact the large adoption of the policy has
had (Exploitation Stage). This phase may require a deal of time before the true impact occurs and
the measurement of how the system changes as the policy has been introduced and used take place.
The main deliverable of this phase is a practically-adopted Policy.

Each stage is followed by a Gate, a decision point that allows for pause and then to synthesize the
current state of progress to the whole. At the heart of the life-cycle, a Learning Engine has been defined
to enable learning to occur and flow, and alignment to be kept with the Context (organization’s strategy,
goals, needs...). Identifying these phases, stages and iterations is instrumental in enabling all stakehold-
ers to increase their creative and collaborative capabilities in a systematic way.
Although these phases and stages appear successive, the inside iterations can overlap, and each stage
is strengthened by an iterative process of feedback, learning, improvement, and evaluation to keep the
relationships among the involved actors around a given task or knowledge-object interactive and mean-
ingful in a given context. This iterative process will guide the process of inquiry to define problems or
capture new possibilities, then design solutions or alternative before putting the outcomes into practice.
The model, therefore, joins flexible process models more than linear ones.

Wisdom Governance: Policy-Making Governance Key Activities

To achieve a balanced perception of policy-making, the assumption that economic, social and environ-
mental strands matter and are interrelated has to be considered. To concretize such a perception and
achieve a deep change the key activities underlying policy governance should be identified. Deep change
has been always connected with a reflection on deep assumptions and stepping out a core of reference,
which involves going beyond the boundaries of the pre-structured space of knowledge and reframing
it in the sense of constructing and establishing new dimensions and new semantic categories. While
features of complexity science have been considered as argument for systems thinking, the necessity

171

Big Data, Semantics, and Policy-Making

Figure 7. Policy-making life-cycle


Source: el Bassiti (2018; Adapted from [el Bassiti et al., 2017])

for variety and multi-reasoning pathways and methodologies to explore possibilities of what could be,
and to create desired outcomes that benefit the whole seems to call for a design mindset that is problem
finding-based, solution-focused and action-oriented.
On the other hand, change that is not new may feel different today and faster because new ICT al-
lows communication with anyone at anytime, whereas in the past change was always slow as it took so
long for information to get from one place to another which is obviously no longer true now. However,
organizations’ ability to adapt to the fast pace of change is largely dependent on the availability of the
right set of competencies, the implementation of a flexible and wise strategy as well as the advanced
use of technology. This allows the creation of a climate conducive to creativity, collaboration, learning
and wise judgment.
In this perspective, the Policy-Making Governance Key Activities (see Figure 8) required to ensure
good policy governance and successful policy change seeking a sustainable future are:

1. Creative Imagination is about creative balancing and integration of full range of human mental
capacities as well as using that integrative ability to bring out sound and radical alternatives, and
act in the best interests of the whole. Based on ethical principles, creative imagination is neces-
sary to flourish and maintain a decent society, as it allows creatively see consequences and new
possibilities, and navigate ethical dilemmas and tensions.

172

Big Data, Semantics, and Policy-Making

2. Collaborative Working is about collective engagement which requires coming together to think
collectively about circumstances, to form a joint identity, to combine and coordinate efforts and to
become united behind a common purpose. Being innate, collaborative working is the art form of
future as it takes no great insight to realize there is no choice, but to think together and to ponder
together because alone there is no way to confront this tangled world.
3. Intentional Learning is about creative synthesis and application of knowledge across domains,
flexible collaboration with others and active commitment to the unrelenting pursuit of truth. Being
based on active engagement, intentional learning requires the development of interpretative skills
and deep learning consciousness, the capacities to think and read critically, to communicate clearly
and persuasively and to participate thoughtfully in solving complex problems.
4. Wise Judgment is about using a blend of intelligence, creativity, experience and virtue. In other
words, blending the ability to see and comprehend a complex situation encountered in experience
(to have a whole perception), the ability to discover and evaluate possibilities determined by a
particular circumstance, as well as the ability to move fluidly between producing alternatives and
evaluating them and to operate at both levels simultaneously. Even being interesting, wise judgment
cannot take place unless it is internalized and advanced by all concerned actors.

Figure 8. Policy-making governance key activities


Source: el Bassiti (2018)

173

Big Data, Semantics, and Policy-Making

CONCLUSION AND FUTURE RESEARCH DIRECTIONS

With recent explosion of digital data often unstructured and unwieldy, although incorporating a huge
amount of signals in the noise waiting to be released, big data movement seeks to glean intelligence from
data and translate that into improved processes and performance. So, managers would be more willing
to make better predictions and smarter decisions, and target more-effective interventions in areas that
so far have been dominated by gut-feeling and intuition rather than data and rigor. Yet, over time a lot
of attention has been granted to the creation and protection of organizational data sources, while little
attention has been paid to the dynamics of transforming data from one form into another till being valu-
able knowledge. Since organizational performance is tightly related to data dynamics, clearer specifica-
tion of the forms data could take and the transition from one form to another should assist managers in
improving their overall organizational performance. In spite of investment, enthusiasm and ambition
to leverage the power of data dynamics, transforming raw data into valuable knowledge, results vary in
terms of success as organizations still struggle to forge what would be consider a data-driven culture.
The research work presented in this chapter set out with a goal to identify a generic, holistic and
integrated framework to big data governance based on data dynamics and leveraging from semantic
technology. The data dynamics process aims to transform data within large-scale databases into inte-
grated knowledge, then formal evidences towards practical wisdom. This process has been divided into
three phases: (1) data interoperability, (2) knowledge mining and (3) wisdom governance. Based on this
process a semantically-enriched architecture has been designed. The main technologies underlying this
architecture are: (1) semantic annotation for big data indexing, (2) ontology engineering for knowledge
representation allowing big data integration, (3) knowledge discovery for patterns mining, (4) case-based
reasoning and semantic search for judicious reasoning, (5) exploratory search and linked-data-based
recommendation for wise learning. Next, a case study for policy-making has been developed to show-
case how using GenID framework to big data governance to deal with complex problem-solving while
espousing an ethical lens.
Practically, this research work is intended to be useful for all types of organizations (large or small,
private or public), whatever their work area and in any sector of activity. It is a method to create aware-
ness on the importance of big data governance and promote the growth of a practical-wisdom-oriented
culture that could make this effort easier. It will boost fast and intuitive management of complex activi-
ties like policy-making and innovation while keeping an eye on the qualitative perspective. In particular,
implementing GenID approach to policy-making will benefit all categories of organizations by enhanc-
ing their capability to have a sustainable impact and participate in the global-wealth creation. In terms
of directions for future work, Bean (2017) has argued that the next phase will be to use data dynamics
for generating more radical ideas and creating more disruptive innovations. So, a wider use of GenID
framework to big data governance in different contexts related to complex problems is planned.

174

Big Data, Semantics, and Policy-Making

REFERENCES

Bakhshi, H., & Mateos-Garcia, J. (2016). New data for innovation policy. OECD Blue Sky Conference.
Bean, R. (2017). How Companies Say They’re Using Big Data. Harvard Business Review. Retrieved
July 20, 2018, from https://hbr.org/2017/04/how-companies-say-theyre-using-big-data
Berners-Lee, T., Hendler, J., & Lassila, O. (2001). The Semantic Web. Scientific American, 29–37.
PMID:11323639
Breslin, J., & Decker, S. (2007). The future of social networks on the internet: The need for semantics.
IEEE Internet Computing, 11(6), 86–90. doi:10.1109/MIC.2007.138
Catmull, E. (2008). How Pixar Fosters Collective Creativity. Harvard Business Review, 1–13.
De Virgilio, R., Giunchiglia, F., & Tanca, L. (2010). Semantic web information management: a model-
based perspective. Springer Science & Business Media. doi:10.1007/978-3-642-04329-1
El Bassiti, L. (2017). Generic Ontology for Innovation Domain towards “Innovation Interoperability”.
Journal of Entrepreneurship Management and Innovation, 13(2), 105–126. doi:10.7341/20171325
El Bassiti, L., El Haiba, M., & Ajhoun, R. (2017). Generic Innovation Designing -GenID- Framework:
Towards a more Systematic Approach to Innovation Management. Presented at the 18th European Con-
ference on Knowledge Management (ECKM), Barcelona, Spain.
Falkheimer, J. (2018). On Giddens: Interpreting Public Relations through Anthony Giddens’s Structuration
and Late Modernity Theories. In Ø. Ihlen, B. Van Ruler, & M. Fredriksson (Eds.), Public Relations and
Social Theory: Key Figures, Concepts and Developments (2nd ed.; pp. 177–192). London: Routledge.
Gormley, W. T. (2011). From science to policy in early childhood education. Science, 333(6045), 978–981.
doi:10.1126cience.1206150 PMID:21852491
Gruber, T. (2007). Ontology of folksonomy: A mash-up of apples and oranges. International Journal
on Semantic Web and Information Systems, 3(2), 1–11. doi:10.4018/jswis.2007010101 PMID:18974854
Gruber, T. (2008). Collective Knowledge Systems: Where the Social Web meets the Semantic Web.
Journal of Web Semantics, 6(1), 4–13. doi:10.1016/j.websem.2007.11.011
Hazy, J. K., & Ashley, A. (2011). Unfolding the future: Bifurcation in organizing form and emergence
in social systems. Emergence, 13(3), 58–80.
Höchtl, J., Parycek, P., & Schöllhammer, R. (2016). Big data in the policy cycle: Policy decision making
in the digital era. Journal of Organizational Computing and Electronic Commerce, 26(1-2), 147–169.
doi:10.1080/10919392.2015.1125187
IEEE. (1990). IEEE Standard Computer Dictionary: A Compilation of IEEE Standard Computer Glos-
saries. IEEE.

175

Big Data, Semantics, and Policy-Making

Jacoby, R. (2013). Foreword. In R. Picciotto & R. Weaving (Eds.), Security and Development: Investing
in Peace and Prosperity (pp. 3–6). Routledge.
Manyika, J. (2011). Big Data: The Next Frontier for Innovation, Competition and Productivity. McK-
insey Global Institute.
Marcus, G. (2013). Steamrolled by big data. The New Yorker. Retrieved July 20, 2018, from https://
www.newyorker.com/tech/elements/steamrolled-by-big-data
McAfee, A. (2012). Big data: The management revolution. Harvard Business Review, 90(10), 60–68.
PMID:23074865
Nowak, B. (2009). The Semantic Web Technology Stack (not a piece of cake...). Retrieved July 20, 2018,
from http://bnode.org/media/2009/07/08/semantic_web_technology_stack.png
Nutley, S., Walter, I., & Davies, H. T. O. (2007). Using Evidence: How Research Can Inform Public
Services. Bristol, UK: The Policy Press. doi:10.2307/j.ctt9qgwt1
Peters, I. (2009). Folksonomies: Indexing and Retrieval in Web 2.0. Berlin: De Gruyter Saur.
doi:10.1515/9783598441851
Pourdehnad, J., Wexler, R., & Wilson, V. (2011). Systems & Design Thinking: A Conceptual Framework
for Their Integration. Organizational Dynamics Working Papers, 1-16.
Prewitt, K., Schwandt, T. A., & Straf, M. L. (2012). Using science as evidence in public policy. Wash-
ington, DC: National Academies Press.
Salter, A., & Alexy, O. (2013). The nature of Innovation. In M. Dodgson, D. Gann, & N. Phillips (Eds.),
The Oxford Handbook of Innovation Management. Oxford, UK: OUP.
Schroeck, M, (2012). Analytics: The real-world use of big data. Academic Press.
Spillane, J. P., & Miele, D. B. (2007). Evidence in practice: A framing of the terrain. In P.A. Moss (Eds.),
Evidence and Decision Making. 106th Year book of the National Society for the Study of Education
(Part I, pp.46-73). Malden, MA: Blackwell.
USCCF. (2014). The future of data-driven innovation. Author.
USDOD. (2000). Joint Vision 2020. Retrieved July 20, 2018, from http://www.fs.fed.us/fire/doctrine/
genesis_and_evolution/source_materials/joint_vision_2020.pdf
Weller, K. (2010). Knowledge representation in the social semantic web. Walter de Gruyter.
doi:10.1515/9783598441585
Wessel, M. (2016). You Don’t Need Big Data — You Need the Right Data. Harvard Business Review.
Retrieved July 20, 2018, from https://hbr.org/2016/11/you-dont-need-big-data-you-need-the-right-data
Zeng, M. L. (2008). Knowledge Organization Systems (KOS). Knowledge Organization, 35(2-3), 160–182.

176

Big Data, Semantics, and Policy-Making

KEY TERMS AND DEFINITIONS

Big Data: Enormous amount of unstructured, heterogeneous and distributed data that is constantly
changing over time and depending on the surrounding context. Big data can be seen as both a resource
and a process, both of which are linked to the different interactions occurring on the web.
Case-Based Reasoning: Aims to (re)use previous experiences defined as codified rules and strong
domain models (cases) to deal with challenged situation using similarity retrieval techniques.
Complex Problem-Solving: Refers to the co-production of knowledge arising from the collabora-
tion of multiple knowledge providers seeking global public goods by embracing a balanced perception
between social, economic and environmental challenges and goals.
Complexity Science: Aims to understand how things are connected with each other, and how these
interactions work together. It is concerned with the study of emergent order in what otherwise may be
considered as very disorderly systems.
Data Dynamics: The dynamics that emerge from the processes of iterative transition from creation;
through usage, transformation, and movement; to the diffusion of valuable data.
Evidence: Based on understandable and context-rich insights elicited from knowledge, evidence is
an abstract and structured pattern disembodied from its context of creation (de-contextualized/general-
ized model).
GenID Framework to Big Data Governance: A generic approach that aims to leverage from explicit
and formal representations of knowledge domains (ontologies) to structure and represent useful and
relevant data in a uniform way, so it could be mined and leveraged to meet organizational requirements
for linking information, detecting and structuring emergent process, and providing insight into and from
spontaneous communities and emergent collaborative structures.
Interoperability: To ensure cross-domain sharing of knowledge patterns by translating big data
stores into knowledge bases using ontologies.
Knowledge: Meaningful, subjective, and semantically constructed information by integration of data
from selected sources and giving it meaning using ontological representation of the knowledge domain
(contextualization).
Meta-Ontology: A whole ontology containing just the global key concepts.
Mother-Ontology: A whole large ontology containing all detailed specifications of modules.
Ontology Modularization: An approach allowing an ontology to be perceived simultaneously as
a whole and as a set of parts (modules). A modular ontology can be designed either by composition
(independently developing modules that can be integrated coherently and uniformly) or decomposition
(extracting independent modules from an integrated ontology for supporting a particular use case).0
Policy-Making: The act of creating a deliberate system of principles to solve problems, guide deci-
sions, improve the quality of life and achieve global prosperity.
Practical Wisdom: The capacity to make informed, rational judgments without recourse to a formal
decision procedure.
Structuration Theory: Aims to conceptualize the interplay in social systems as an inseparable and
intricate “duality” explaining the production and reproduction of dynamic social structures as they co-
evolved with human interactions, over space and time.

177

Big Data, Semantics, and Policy-Making

Sub-Ontology: A reusable module of a larger ontology which is self-contained and logically con-
sistent, as well as tied to other sub-ontologies within the mother ontology.
Wisdom: Using moral-based evidences to inform reasoning while espousing a moral lens towards
making wise judgment.
Wisdom Governance: To develop a cross-domain and moral-driven system of governance that
emerges out of the transaction between practicing the acquired wisdom while focusing on the ethical
dimension, establishing a connection with the strategic roadmap of the larger system (organization or
market), as well as assessing the progress and evaluating the impact.
Wisdom Memory: Knowledge base of structured representations of past problem-solving experi-
ences, considered as a practically-proved source of problem-solving expertise.

178
179

Chapter 8
Big Data Governance in
Agile and Data-Driven
Software Development:
A Market Entry Case in the
Educational Game Industry

Lili Aunimo
Haaga-Helia University of Applied Sciences, Finland

Ari V. Alamäki
Haaga-Helia University of Applied Sciences, Finland

Harri Ketamo
Headai Ltd., Finland

ABSTRACT
Constructing a big data governance framework is important when a company performs data-driven
software development. The most important aspects of big data governance are data privacy, security,
availability, usability, and integrity. In this chapter, the authors present a business case where a frame-
work for big data governance has been built. The business case is about the development and continuous
improvement of a new mobile application that is targeted for consumers. In this context, big data is used
in product development, in building predictive modes related to the users and for personalization of
the product. The main finding of the study is a novel big data governance framework and that a proper
framework for big data governance is useful when building and maintaining trustworthy and value add-
ing big data-driven predictive models in an authentic business environment.

DOI: 10.4018/978-1-5225-7077-6.ch008

Copyright © 2019, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

Big Data Governance in Agile and Data-Driven Software Development

INTRODUCTION

A big data governance framework is critical when a company performs data-driven software develop-
ment and business. The authors adhere to the common definitions of big data governance and of data
governance, which were presented by Sarsfield (2009), Soares (2012), and DAMA (2017). To meet the
requirements of an agile start-up company that needs to manage and govern big data based predictive
models and the data related to them, the authors propose a framework for big data governance. This
framework includes five key dimensions for big data governance: data privacy, security, availability,
usability, and integrity. Each dimension is described in the big data governance framework section.
These five key dimensions are important in building and managing a successful big data driven business.
In this chapter, the authors present the business case based on which the authors have derived the
proposed big data governance framework. The business case concerns the development and continuous
improvement of a new mobile educational application that is targeted at children and young people who
wish to learn to play soccer. In this context, big data are used in three contexts: 1) product and business
development, 2) building predictive modes about the learners’ progress in general and 3) personalisation
of the product to meet the needs of each individual learner.
Previous research and guidelines on how to govern big data governance (e.g., Soares, 2012; DAMA,
2017) have been published. However, there is a research gap concerning big data governance in agile big
data driven start-up companies and especially on how to govern big data in the product development phase.
The main finding of the study is the proposed novel big data governance framework and the fact that
a proper framework for data governance is necessary when developing trustworthy and value-adding
big data driven software products in an authentic business environment. Without big data governance,
the predictive models and other data-driven applications may not bring added value to the business
because their trustworthiness is uncertain, they might violate the right to privacy of customers, or they
might not be available for use when needed. In addition, without proper big data governance, they might
not meet the needs of the business, or valuable data might be leaked to competitors because of the poor
governance and weak security of the big data. If a company succeeds in big data governance, data can
become its most valuable asset (Panian, 2010).

BACKGROUND

The field of big data governance emerged with the advent of big data. Big data are data that cannot be
processed using traditional data processing software and infrastructure on a personal computer or on
a dedicated server (e.g., Liebowitz, 2013). In addition, big data differ from traditional data in volume,
variety, and/or velocity (Liebowitz, 2013). Typical examples of big data include web and social media
data, machine-to-machine data, big transaction data, biometric data and human-generated data (Soares,
2012). The authors include Internet of Things (IoT) data and data generated in industrial processes in
the broad set of transactional data.

Need for Big Data Governance in Companies

In technology companies, there is an increasing need to develop data analytics especially in launching
new products and services. The diffusion of technology has increased rapidly (Downes & Nunes, 2013),

180

Big Data Governance in Agile and Data-Driven Software Development

increasing the need to analyse consumer behaviour in developing new products and services. Further-
more, such changes call for new models of data analytics in launching and developing new products
and services in the rapidly changing digital markets. Companies that launch new mobile applications
and other online services expect thousands of downloads, good user reviews in the market place, and
less turnover in paid services. Thus, understanding consumer behaviour is essential for the entry of
new products and services in the market, and big data analytics play a crucial role in this endeavour. To
understand consumer behaviour in situations where they use the mobile application for the first time,
multi-source data analytics provide richer information than a single data source does. Thus, the aim is
to develop a model that combines user data and open data in the product development initiatives of new
mobile services. This model could provide new data sources for personalization and predictive models
about the users. In addition, it could provide open data-based opportunities for innovative ideas about
product features, which then would create more value for consumers in using mobile applications.

Data Management and Data Governance

Many companies have established and mature procedures for data governance. Sarsfield (2009) defined
data governance as set of processes that ensure the formal management of important data assets through-
out the organisation. It guarantees that data can be trusted, and that people can be made accountable for
any adverse event that happens because of poor data quality. The Data Management Body of Knowledge
(DMBOK) provides the following concise definition of data governance:

The exercise of authority, control, and shared decision making (planning, monitoring and enforcement)
over the management of data assets. (DAMA, 2017)

Although many companies have mature procedures for data governance, very few have any procedures
for the governance of big data for two reasons: first, the field of big data is still immature; second, the
existing big data applications are geared toward exploratory data analysis and discovery than toward
traditional business intelligence. The latter reason has created a vicious circle: to be governed, data need
to be modelled, and to be modelled, they need to be explored (du Mars, 2012). However, it is probable
that big data, like any other company data, will soon be governed routinely. One indicator of this is that
the Global Data Management Community (DAMA), has dedicated a subchapter to big data governance
in its International Guide to Data Management Body of Knowledge (DAMA DMBOK, DAMA 2017,
Chapter 14.6).
Data management is strongly related to data governance. DAMA International defines data gover-
nance as a part of data management. This relationship is shown in Figure 1.
The nine disciplines of data management that are listed inside the sectors of the circle in Figure 1
are the following:

1. Data Architecture Management: Defining the overall process of managing all data assets in an
organization.
2. Data Development: Analysis, design, implementation, testing, deployment and maintenance of
data.
3. Database Operations Management: Supporting all actions in the database lifecycle: from data
acquisition to data integrity management.

181

Big Data Governance in Agile and Data-Driven Software Development

Figure 1. Data governance as defined by DMBOK

4. Data Security Management: Ensuring privacy, confidentiality and appropriate access to data.
5. Reference and Master Data Management: Acquiring and maintaining the relatively stable data
concerning the business domain and customers.
6. Data Warehousing and Business Intelligence Management: Enabling reporting and analytics.
7. Document and Content Management: Managing data found outside of relational databases and
data warehouses
8. Meta-Data Management: Integrating, controlling and providing descriptive information about
the data assets.
9. Data Quality Management: Defining, monitoring and improving the correctness and trustworthi-
ness of data

One could imagine that before big data governance can be established, an organisation should already
have in place a framework for data governance. However, in start-up companies where big data plays a
key role in their business, this is not the case. That kind of companies need to start by implementing a
big data governance framework in their organisation.

Big Data in Software Development

In this study, special attention is put to the software development process from the big data perspective.
The software development process is a multidimensional process where several factors affect to the suc-
cess. Flyvbjerg and Budzier (2011) showed in their study how failures in the software development or
implementation processes can cause significant damage for business or even fall entire companies. They
showed that the overruns of development costs and schedules are not so fatal for the companies than

182

Big Data Governance in Agile and Data-Driven Software Development

damages that influence their business operations and customer satisfaction. Additionally, overrunning
significantly budgets and schedules is not rare. Flyvbjerg and Budzier (2011) revealed that every sixth
of large IT-projects overrun their budgets 200% and schedules 70%. Thus, the successful management of
mobile software projects is not only technological endeavor but it also deals with end user and business
perspectives (Alamäki & Dirin, 2015). Alamäki and Dirin (2015) state that several stakeholders need
to be involved to the mobile application development process. Developers and designers can evaluate
the feasibility of technological features, business professionals validate features towards business needs,
end users focus on the user experience and industry experts contribute to the business model of a new
mobile application.
Big data provides new ways to improve product and service development and innovation (Paajanen,
Valkokari & Aminoff, 2017; Tao et al. 2018). Advanced data collection and analytics help designers,
developers and product managers monitor user experience and behavior of the potential users, and op-
timize decision making in investing to the development of new product features (Chen, Zhang & Zhao,
2017). Additionally, more effective data collection and analytics provide useful information to the deci-
sion makers of product life cycle management (Zhang, et al. 2017). Thus, big data analytics enhance
design and development process in various areas of industries and service business.
In Figure 1, the authors summarize mobile development perspectives (Alamäki & Dirin, 2015) to
the business risks of the IT-projects (Flyvbjerg & Budzier, 2011) and the principles of data-driven
product and service development and management (Chen, Zhang & Zhao, 2017; Zhang, et al. 2017;
Tao et al, 2018). Big data analytics provides useful information to various stakeholders of mobile ap-
plication development, and each stakeholder has different perspective to the data. The figure 1 shows
that mobile application development should focus on the several different data sources that the analysts
can combine in searching, for example, patterns, segments and profiles. User data create information
about the user experience of end users, such as usage time, user profiles, navigation paths, activities and
likings. The application data generate device- and performance-related information, such as used end

Figure 2. Data management perspectives in new application development


(adapted from Alamäki & Dirin, 2015)

183

Big Data Governance in Agile and Data-Driven Software Development

terminals, operating systems and IP-addresses. For example, data saved to the log files of the software
systems can potentially provide both user and application data. The goals of analytics determine which
category collected data belong to. The situational data show location where application is used and in
which types of context. The business data assist managers to monitor customer purchase behavior, such
as orders and rankings and business environment where a new product or service will compete. Each
data source provides different types of information to the mobile development process
Agile big data driven software development and data governance are often seen as having completely
different goals and that they cannot be applied in the same project. However, proper big data governance
does enable an organization to remain agile (Panian, 2010). In an agile software development project also
data governance should be done in an agile manner. In an agile software company, big data governance
should be applied to the existing agile way of working and not as a completely new process (Seiner, 2014).

Privacy Issues of Personal Data

It is important to focus on users’ privacy concerns in designing and planning commercial mobile ap-
plications (Malhotra, Kim, & Agarwal, 2004). Several studies showed that privacy concerns, such as
the feeling that privacy has been invaded or threatened, affect users’ online behaviour and willingness
to use a mobile application (Ackerman, Cranor, & Reagle, 1999). Moreover, privacy issues have a legal
dimension, which is codified in the European Union’s (EU) general data protection regulation (GDPR).
The GDPR law will be applied on 25 May 2018 in EU countries, providing new rights for individuals
to control how their data are collected, processed, and used by companies. The law includes strong
sanctions for companies who contravene it. Furthermore, companies are required to demonstrate how
they apply the principles of GDPR in their internal data management and privacy processes. In previous
research on e-commerce, data privacy has been related to online trust (Bart et al., 2005). Trust between
the application provider and users affects users’ willingness to provide their personal information to
such providers. Trust is also essential in situations where users consider paying for the use of application
or where they are eager to recommend an application to their friends (Dirin, Laine, & Alamäki, 2018).
Hence, data privacy should be carefully considered in managing big data governance, as it affects to
the trust of users.
The type of information that users provide when they access digital services could also affect their
privacy (Acquisti & Gross, 2006). Thus, it is important to identify the type of information that compa-
nies need when they collect user data and develop policies according to such information. For example,
Acquisti and Gross (2006) showed that users were more comfortable in providing general information,
such as gender and age, than in revealing their home address and phone number. Information such as
gender and age is easier to observe and obtain publicly than private information, such as a home address.

Security: Technologies and Processes

In addition to privacy, users consider security issues when they want to adopt new digital services and
technologies (Wilkowska & Ziefle, 2012). The feeling that an application is secure and that it protects
the privacy of users is closely related to the commercial success of the application. Distributed networks
and data processing in cloud systems create security concerns for users. The simultaneous sharing of data
on networks leads to users’ concerns about protecting their personal information (Chen & Zhao, 2012).
Thus, the issue of security does not involve a single application provider, but it extends to communi-

184

Big Data Governance in Agile and Data-Driven Software Development

ties that create policies regarding the standards and practices of the development and management of
cloud systems (Kaufman, 2009). Thus, companies need to define how they have secured users’ personal
information, not necessarily related to technology, but from the viewpoint of process management and
documentation. Security issues are an essential dimension of process management in the big data gover-
nance policies of technology companies. In fact, problems in ensuring the security of users’ information
have had negative consequences for the reputations of companies in which the user management servers
have been hacked and users’ personal information has been published on the public Internet.
The management of big data security includes both processes and technological solutions that ensure
the security of big data, which differs from the management of privacy because it includes only the
data (including big data) assets of the company. In addition, this management is concerned about the
company’s internal processes and the technological solutions used to ensure that each data asset has
the correct level of security. The most important feature of data security is that only authorised people
and software can access the data (Otto, 2011). Big data governance typically consists of both human-
controlled procedures and software that automatically monitors the implementation of big data security
as agreed by the organisation.

Availability of Data

Data availability means that authorised users have access to the data, software, and hardware upon request
(Zissis & Lekkas, 2012). Thus, the availability of big data involves providing the right data to the right
people and processes at a specific time such that when a piece of information is needed, the organisation
can find it. The provision of big data has been included in defining the information architecture (Godinez
et al., 2010; Morville, 2007) of a company or in designing enterprise searches (White, 2015). Information
governance can be used to build metrics for assessing the availability of information in an organisation.

Usability: Value to the Usiness

The usability of big data means that the big data of an organisation must be in line with the organisation’s
business needs, the overall corporation strategy, and the corporate information strategy. Simply put, the
usability of big data ensures that it can be monetized. Ensuring that data meets business needs is one of
the core goals of data governance (Panian, 2010). It is not always easy to establish measures for defining
the value for a business achieved by big data. For example, an organisation may gain value from big data
because big data analytics helps it to build an accurate model of its customers. This model may be used
to enhance the customers’ experience of the company’s products as well as in targeted marketing, among
other uses. Although it is difficult to measure exactly the degree to which big data contributes in these
endeavours, big data governance provides the appropriate tools to ensure that the big data processed by
an organisation meets its business needs.

Integrity: Trustworthiness, Quality, and Completeness

According to the principles of data integrity, only authorised parties can manage and modify data (Hashem
et al., 2015). Thus, the aim of data integrity is to prevent the unauthorised use of data (Zissis & Lekkas,
2012) to ensure their trustworthiness and quality. This aim relates to the concept of accountability and
professionalism of the management of data by companies. The term big data integrity means that the

185

Big Data Governance in Agile and Data-Driven Software Development

big data used by a company is trustworthy or that the organisation knows the extent to which the data
may be trusted. Big data is not as trustworthy as traditional data in an organisation. This dimension of
big data is called veracity (Ramesh et al., 2018). Big data often need to be cleaned to remove erroneous
and unusable data. In some cases, data management must accept that the quality of the data is quite low.
However, it is very important to know the level of the quality of the data. Another critical issue regard-
ing big data integrity is the integration of big data from various sources. Unsuccessful data integration
is also a potential source of erroneous data. This issue is important because many big data projects rely
on the integration of data from various sources. Big data governance provides measures to ensure that
the quality of the data is as agreed. It also provides accountability in the case of an error caused by low
data integrity.

BUSINESS CASE DESCRIPTION

This study demonstrates through a genuine business case the development of big data governance by
a company that provides mobile and web services for their customers. The business case is about the
product development process of new software in an agile start-up company. The new software is an
educational game.

Methodology of the Business Case Research

The method chosen to conduct this research is the case study (Eisenhardt, 1989). The aim was to develop
big data governance in the new product and service development phases. The authors also adopted an
abductive qualitative research approach (Dubois & Gadde, 2002) to develop a new framework for big
data governance. The framework is based on observations concerning big data governance in the case
company’s product development project. In abductive research, the researchers simultaneously review the
prior literature and theories, and they analyse data gathered through empirical research and development
work (Dubois & Gadde, 2002). In this study, the adoption of this iterative research process allowed for
developing a deeper understanding of the empirical data being analysed while simultaneously contribut-
ing to the theory of big data analytics in consumer research.

The Business Case

This case study is based on an empirical research on mobile application development. The case company
is an agile software company that is developing a big data driven soccer training mobile application
which uses an artificially intelligent bot that trains junior soccer players.
The use of big data related to the business case can be divided into three main categories:

1. Data used for product and business development,


2. Data used for building predictive modes about the general progress of the players and
3. Data used for personalisation of the product.

Each of these categories will be described in detail in the following. Firstly, log data originating from
user actions while interacting with the software reveals all the bottlenecks or traps a user may face when

186

Big Data Governance in Agile and Data-Driven Software Development

using the product. This data is used in developing the product in a user-centric way. In addition to real
user data, artificial data mimicking human behaviour is generated using a genetic algorithm. This data
is used to simulate numerous users and let them run through the game any time during development
phase (Ketamo 2008; Ketamo 2010). This kind of approach is used to mimic the payers’ behaviour in
large scale. Care has to be taken to make sure that all the possible steps are built and that the evolution-
ary algorithm constructs enough variance and unexpected cases. On the conceptual level the method is
the same as letting the AplhaGO -AI play against itself to create more understanding on Go game (e.g.
Silver & Hassabis 2016).
Secondly, the user data, in general, enables building predictive models on what activities or what kind
of usage patterns might predict success or failure. This is very useful in terms of professional sports: An
organization might be able to point out talent from extremely large population just based on big data.
Typically talent in sports cannot be scouted within large populations just because of lack of resources.
This might help national sports associations to build better understanding on factors that might reveal
talent. Similar measures have been done in the domain of mathematics (Ketamo, Devlin & Kiili 2018).
The educational application is built so that it will enables predictive models based on all user data.
Thirdly, personalisation requires understanding on human behaviour and understanding the knowledge
domain of activity (e.g. Brusilovsky 2001). The domain in the educational soccer learning game was
constructed as a knowledge graph when the course was built. On the other hand, user activities were
measured and recorded with the same language. The user data combined to predictive modes on success-
ful (or non-successful) behaviour enables adaptive learning and personalised content recommendations
within the domain and also outside the domain. It is very useful for a talented player to get the latest clips
related to his talent. On the other hand, if the user tends to follow non-successful patterns, the adaptive
learning features can guide the him back to a successful track.
The biggest challenge in personalisation is privacy: How to take all the benefits on big data and at
the same time ensure not revealing any information that can be identified to a specific person without
his consent. In the educational game in question, all the data is collected anonymously and according to
the rules and spirit of EU’s GDPR (General Data Protection Regulation), meaning that the internal use
for adaptation and personalisation is secured. However, when passing even small parts of information
into an external service, additionally with other data from user’s device, the user’s consent needs to be
asked explicitly for. Even data on suggesting and downloading single news like “Losing your nerves
every time: try these 10 exercises” might tell something about the user. Typically this can be connected
to a specific person using additional data such as data from cookies in browser.
In addition to studying the all the sources of big data that was used in the product development phase,
a field study was conducted to verify how well user opinions collected by a traditional questionnaire
correlated with the actual facts that could be observed in the user log data. This was important for a
deeper understanding of the significance of the log data. The log data is one of the key sources of big
data and important in product development, building predictive models on players’ future steps and on
personalization of the product.
In the field study, a traditional field study with test users was combined with the usage information
of application server data and local weather information. The soccer training application was intended
to be used outdoors. The application includes a training programme guided by an artificially intelligent
bot that guides junior players to try selected soccer techniques, such as corner kicks, passing, and ball
control. Thus, the outdoor context with situational variables such as the local weather were essential
factors that affected the behaviour of the junior players who were the consumers of the application.

187

Big Data Governance in Agile and Data-Driven Software Development

The authors conducted a field study in which the application was introduced to 134 junior soccer
players in eight different soccer teams. Data was collected by interviewing the players after the test pe-
riod to understand their user experience and the variables that affected their use of the application. The
native mobile application communicated with the web service based on the cloud server architecture,
which enabled us to use the server data. To gain a deep understanding of the factors that affected the
user behaviour in this field study, the authors analysed also the weather data collected during the study
period in the locations where the application was used by the junior soccer players. Thus, the authors
formed a comprehensive understanding of the players’ behaviour. The authors combined the traditional
interview data with the usage log data, open data on the local weather, master data concerning the soccer
teams and their training facilities as well as geographical reference data.

RESULTS AND RECOMMENDATIONS

This section first proposes a framework for big data governance. After that it provides recommendations
concerning each of the five dimensions of the framework. Both are based on the business case study
presented above.

The Proposed Big Data Governance Framework

The proposed big data governance framework is formed based on the requirements found in the business
case concerning the product development proect in the educational game industry. It also takes into ac-
count the widely accepted framework of DAMA International presented in the background chapter. Table
1 lists the five dimensions of the big data governance framework and gives a brief explanation of each.
Figure 3 below shows the relation of the widely accepted data governance framework of DAMA to
the proposed big data governance framework. As explained in the previous chapter, DAMA describes
data governance as the act of authority, control and shared decision making over the nine fields of data
management that are listed in the sectors of the circle. The authors will now explain in detail the dimen-
sions of the proposed big data governance framework and how they relate to the existing data governance
framework of DAMA.

Table 1. The five dimensions of the proposed big data governance framework and their brief description

Dimension Meaning
Data containing information about a private person should be treated with special attention
Data privacy
according to the organisation’s data privacy policy and legislation.
The processes and technologies that ensure that sensitive and confidential data about an
Data security
organization are kept secure according to the organisation’s policies.
Making data available at a given moment, including the usage of data, interface standards, metadata,
Data availability
and the findability of data.
The data in an organisation can be used to meet the goals defined in the corporate strategy,
Data usability
including data monetisation.
Data integrity The trustworthiness of the data, including data lifecycle management and data quality monitoring.

188

Big Data Governance in Agile and Data-Driven Software Development

Figure 3. Illustration of the proposed big data governance framework and its relation to the existing
data governance framework presented in DMBOK. The five dimensions of the framework are written
in capital letters.

The figure shows that usability is in the heart of big data governance itself. In addition to controlling
all the traditional nine fields of data management, all of the nine fields should be managed with usability
in mind. This is because both the amount and the technical challenges related to big data are so important
that it is crucial to ensure that all the data that is being governed is also usable i.e. relevant from the
business point of view. This is one reason why big data governance should not be an endeavour of the
IT-department, but it should rather be responsibility of a business leader. Moreover, to ensure data us-
ability, care should be taken to implement shared decision making about the data assets of an organisation.
This shared decision making should involve all relevant parties so that all possibly valuable data assets
are considered. Shared decision making is also in line with agile software development (Seiner, 201).
Data governance itself does have a more important role when dealing with big data than when dealing
with traditional data. It is important that personnel of an organization is explicitly made accountable for
all the data assets that are used in the organization and that are vital for the business. This ensures that
all data assets are governed with authority and controlled. With the advent of big data, also those roles
and departments that typically have not had any accountability at all or only a limited one for data assets
will typically have a broader accountability for different data assets. For example, personnel from the
marketing department is typically assigned the accountability for social media data. The accountability

189

Big Data Governance in Agile and Data-Driven Software Development

for sensor data (also called IoT data) is typically assigned to personnel in the operations and maintenance
units of an organization. Governing big data is very much linked to the usability of the data – even more
than in the framework for governing traditional data.
Now that the special requirements enforced by big data on data governance have been explained,
the authors will go through the rest of the five dimensions of the framework and explain how they are
related to the traditional data governance framework. In Figure 3, the four other dimensions are written
in capital letters directly under the “roof” of big data governance and usability. This means that they are
dimensions that need to be controlled by the framework. The nine fields of data management are written
inside the box under the four big data governance dimensions. They are aligned under the dimension that
most describes them. However, the fields typically can be described by several dimensions. For example,
data architecture management is at the left-hand side directly under availability. This means that a good
data architecture is very important for data availability, meaning that is enables the findability of the
data for the right person at the right time. However, a good data architecture also helps in data integrity
because it prevents a situation where the same data is stored in two places, which causes the risk that
the data are not updated simultaneously.

Recommendations on Big Data Privacy

The privacy of big data is an important issue because these data are heterogeneous, and they are derived
from various sources (Morabito, 2015; Soares, 2012). Many sources contain data that originate in the
actions of individual customers who interact with a digital service. In the business case of the educa-
tional soccer application, the data collected on the application’s usage were used to build the predictive
models. The usage data revealed information about the customers, such as their location, date and time
of usage, and the parts of the application they used. It is very important that these data are well protected
and that only authorised people and software may access them (Otto, 2011). If need be, the proposed
big data governance framework enforces data privacy related procedures for acquiring permission from
users to use their personal data.
In the case study, data privacy played a crucial role. In the field test, the authors defined the privacy
issues before meeting the users in person when they received a face-to-face introduction to using the
application. The purpose of the study was explained to the players. Because the users were children, a
letter was distributed to their parents explaining the purpose of the trial. The coaches of the junior soc-
cer teams or research assistants delivered the information letter to the parents. Hence, the researchers
obtained the parents’ permission for their children to use their smartphones and access the data. The
parents understood that their children would download and use the application, which was related to their
hobby. The authors learned that when the persons who collected the data met the users in person, they
built trust more easily than in collecting similar information through online surveys. In the field test, it
was easier to motivate users to use the application, as they received a free application for participating in
the study. This observation supports the findings of previous research that showed that if users received
benefits, they were more willing to allow the use of their personal information (Caudill & Murphy, 2000;
Chellappa & Sin, 2005). The study demonstrated that it is important to define in advance how different
user groups or segments are identified and used in the analytical phase.
The authors found that by combining several data sources, the predefined research questions allowed
for determining the accuracy of the user data, particularly regarding the privacy issue. For example, to
obtain detailed information, the authors needed to identify small user groups in other data sources. The

190

Big Data Governance in Agile and Data-Driven Software Development

findings showed that some users were significantly more active than others who used the application
only once or a few times. The log data collected by the server provided statistical information about
the usage during the trial period, which helped the authors to validate and understand the findings of
the field test. Thus, the authors were able to compare these statistics to the findings of the interviews,
which increased the reliability of qualitative results. Integrating the weather information as a dependent
variable was challenging because the authors could not find data on the exact location of the users when
they used the application. The authors learned that weather changes rapidly within a city. For example,
because rain can start and stop within minutes, it was necessary to track and analyse usage accurately in
minutes rather than hours. Additionally, the exact location was also required. Thus, the authors recom-
mend adding a geolocation feature to the application because it was not included during the field test.
However, the addition of this feature could increase the privacy concerns of many users because they
would reveal their location when they used the application. Although many mobile applications already
have a geolocation functionality because it facilitates sending local advertisements or other information
to the screens of smartphones, users could choose to disable the geolocation function. Thus, companies
should provide a trade-off or a value for users who allow tracking by geolocation if it is important to
know their exact location for analytical purposes.

Recommendations on Data Security

In the analyses of the companies, integrating open data did not place data security or the privacy of us-
ers at risk. The findings showed that the open weather data provided interesting information about the
usage environment and the variables that affected the users’ willingness to use the application outdoors.
Furthermore, the data sources were not technologically integrated, which increased the security of the
data. The companies did not need to conduct analyses over networks if the data were saved to a secured
server on their premises. Chen and Zhao (2012) pointed out that sharing data over networks created
potential security risks and could cause privacy concerns in users. In the present study, the authors did
not deal with sensitive information, such as home addresses, credit card numbers, or phone numbers.
Acquisti and Gross (2006) showed that users were more willing to provide general personal information
than factual private information that provides more detailed information about them.

Recommendations on Data Availability

In the case study, big data were readily available. The field study data were few and well managed. The
server log data were accumulated constantly in real time. The data management principle considered
that the predictive models had to be re-created from time to time to reflect changes. In the current case,
the authors always used all the available log data. In a future study, the oldest data could be warehoused,
and only the newest data could be used. In the current study, data warehousing procedures had to be
constructed to accommodate the old models.
The external open weather data were not managed by the case company because they were sourced
from the data provider. According to this experience, the authors suggest that external data should be
governed such that one person or role in the organisation is accountable for them.

191

Big Data Governance in Agile and Data-Driven Software Development

Recommendations on Data Usability

The term data usability means that the data should be aligned with business needs. According to the
experiences of the case study, the authors suggest that businesses as well as technology stakeholders in
organisations should be involved in ensuring that all data conform to the needs of the predictive models.
For example, an organisational role must be accountable to ensure that the user action log data conform
to the requirements of the predictive big data model generated after software updates. In this study, part
of the management of data usability was the proper documentation of software updates that might affect
the creation of log data.

Recommendations on Data Integrity

In the case study, data integrity concerned mainly data quality and data lifecycle management, includ-
ing data warehousing. In the business case, the findings showed that in order to provide a trustworthy
model based on several separate datasets, special attention must be paid to the correct integration of data.
Data integrity became a challenge when the authors combined three data sources. It was not difficult
to analyse a single data set, such as clustering users in different segments. However, when there were
three different data sources in different formats in the same period and location, it was difficult to identify
the exact location of the anonymous soccer players or teams when the weather conditions were analysed.
This finding indicates that to maintain data integrity between different data sources accurate location
information and the identification of anonymous users might be required to create profiles of their usage.
However, the requirement of data privacy is a limitation if users do not allow tracking their geolocation.
It is understood that a predictive model that is based on inaccurate or even erroneous data has no
value (Soares, 2012). In applying the framework use in this study, the authors used big data from two
sources: the company’s internal data and the company’s external data. The internal data included real-
time and historical data on the company’s proprietary servers as well as master data on the soccer teams
and soccer practice facilities. The external data consisted of real-time and historical data on weather and
weather forecasts at the Finnish Meteorological Institute as well as through Google maps.
The predictive models were used in real time, and they were automatically updated. The findings
confirmed the requirements for the automatic monitoring of data quality as well as the availability of
data. Hence, special attention was paid to building not only a solid framework for ensuring data qual-
ity but also mechanisms for the automated monitoring of the most important issues in this framework.
Examples of issues that this framework monitored include soccer practice facilities and teams that were
not included in the master data but appeared in the customer data on the server as well as in unexpected
changes to application programming interfaces in the external data sources.
Automated tests for monitoring data integration are an efficient means of ensuring data integrity. For
example, if the server log data showed that a team was using a playing facility that was not in the master
data, an alert would be sent to the person accountable for the master data about the playing facilities.
This person would then be able to amend the master data as soon as possible.

192

Big Data Governance in Agile and Data-Driven Software Development

Table 2. The summary of solutions and recommendations

Dimensions Solutions and recommendations based on the case study


Data privacy should be taken into account very carefully in big data development. Different data
types and collection methods require special attention to be paid to privacy issues. In the case study,
the level of location information in the different user segments was inaccurate due to the used
Data privacy
privacy policy. The authors recommend paying special attention to the tradeoff between privacy
and value perceived by the user. This can be achieved by motivating users to share their private
information that is crucial for more detailed analytics.
In many companies private and company confidential data is professionally secured and managed
but sharing data over networks may create security risks. Real time data analytics on cloud
applications creates more security risks than analyzing data sets offline. The authors recommend
Data security
companies to partner with IT-companies who are specialized in secured infrastructure solutions
and services. Additionally, the authors recommend companies to use trusted data providers in
purchasing sensitive customer information, such as purchase histories or order data.
Availability was easy to manage in the small start-up company of the case study, but in a larger
corporation cross-functional communication and co-creation business insight become more
challenging. It is important to design an information architecture that supports the findability of
data. The development of meta-data management should also be given special attention.
Data warehouses should be given special attention as they contribute to making non-transactional
data available. Examples of data warehouse-related decision are: Which data to put into a data
Data availability warehouse, how long to keep it and at what intervals to update it? Business intelligence is also very
important in the era of big data. It is evolving very quickly: when data is no more big data, it often
becomes just regular business intelligence data. For example, social media stream data used to be
big data when it was too large and moving too quickly to be handled by traditional software.
Document and content management is very important in a big data governance framework because
the majority of big data belongs to this category. It is not nicely available in a well-structured
relational database
In the case study, the careful planning of the customer study helped to align the data collection to
the business expectations. However, accuracy of data became a challenge. In the case study, location
information of weather data was difficult to align to usage data due to the inaccuracy of location
information. The authors found that big data analytics is an iterative process where quality of data is
Data usability gradually enhanced. Especially analytics in the emerging market involves several uncertainties that
are difficult to predefine exactly in advance. The authors recommend beginning big data projects
as early as possible. Learning to manage a data collection and to analyze the processes related to
it in multi-disciplinary teams takes time, and quality of data needs to be gradually improved by
integrating various datasets.
To provide a trustworthy model based on several separate datasets, special attention has to be given
on the correct integration of data. In real-time data analysis unexpected modifications in data due to
Data integrity
changes in the software producing it have to be governed to prevent the advent of analysis based on
corrupted data.

FUTURE RESEARCH DIRECTIONS

More research is needed on real-life cases of big data governance. Now that the usage of big data in
creating new value for business is common, it is important that also suitable governance frameworks are
developed and implemented in business. Big data governance will bring new insights also to traditional
data governance and vice versa. At some point data governance will also include big data governance.
Before that, big data governance is needed because it takes into account the challenges and trade-offs
caused by data variety, velocity and volume. Furthermore, the frameworks and practices for governing
big data are by no means ready and they need to be developed and researched in more detail and depth.

193

Big Data Governance in Agile and Data-Driven Software Development

CONCLUSION

One framework of big data governance does not fit all companies. Although the key principles of pri-
vacy, security, availability, usability, and integrity should be the same in general, their implementation
differs. The company’s strategy and the maturity of its product development and business model affect
the implementation of a big data governance framework. The authors learned that a start-up company
in an emerging market needs a flexible data governance framework that is suited for an agile software
development process. In an emerging market, a company cannot predict and manage all issues in advance
as it could do in a mature market. In a mature market, customer behaviour is easier to predict because
the companies in it already have a long history of dealing with customer data. In emerging markets,
companies are faced with a great amount of uncertainty because the products are new, and little is known
about the markets and customers (Blank, 2007). Additionally, start-up companies often use lean or agile
development methodologies and management, and they usually have an experimental corporate culture
unlike established businesses (Ries, 2007). The goal of data analytics is to facilitate learning processes
in companies. The learning cycles of start-up companies are often shorter than those of established busi-
nesses. Specifically, established businesses have longer histories of customer data, but start-ups work
under conditions of uncertainty.
Despite the challenges of collecting data in the new AI-based learning technology markets, the authors
learned that it is important to focus on all five key dimensions of big data governance in technology
companies. The findings from the user study of the soccer learning game showed that diversified data
collection methods helped us to obtain a realistic understanding of users’ thinking. The findings also
showed that information about the users’ locations was important, but it required users to reveal their
geolocation. This requirement calls for ways to motivate users to provide personal information, which
in the literature on data privacy is called a trade-off between privacy and perceived value (e.g., Caudill
& Murphy, 2000; Chellappa & Sin, 2005).
To succeed in customer-behavioural analytics, companies need to be able to manage various dimensions
of data processes. In particular, service companies that launch new mobile and web services should apply
a big data governance framework to manage technological data collection processes and privacy issues
of customers as well as the availability, integrity, and usability of data. An adequate big data governance
framework would enable data driven insights for marketers and aid in sales and product development.
The findings of this study showed that users need to be motivated to provide their personal information
for use by companies. The implementation of a framework for big data governance in a real-life business
case increased the quality and value to the business of big data based product development and predictive
model generation. The big data governance framework also facilitated the consistent and trustworthy
use of customer data, which is essential in maintaining a positive company image.

Managerial Implications

From the managerial point of view, this study presents the framework of big data governance for compa-
nies that operate in emerging markets. It is essential for the success of such businesses that it learns fast
from the data. Advanced multi-source data analytics provide a way to gain new knowledge for decision-
making. However, maintaining data privacy when enriching the primary data with other data sources is
not an obvious case. The weather data used in this study does not bring any privacy challenges. If the
users of the case study would be connected to additional sources of data such as social media sources,

194

Big Data Governance in Agile and Data-Driven Software Development

there would be a great possibility that the privacy of the users is concerned. In fact, there are no exact
rules for determining when an added data source could risk the privacy of a person. That is why all man-
agers should be aware of the nature of big data. Learning fast through data analytics requires a big data
governance framework. Thus, it is important to implement the five key dimensions into the processes
of companies to manage data analytics projects.

ACKNOWLEDGMENT

This work was supported by the BIG-research program, funded by TEKES (Finnish Funding Agency
for Innovation) no 2710/31/2016. The authors also thank the eight anonymous soccer teams and the
students who assisted in collecting the qualitative field data.

REFERENCES

Ackerman, M. S., Cranor, L. F., & Reagle, J. (1999). Privacy in e-commerce: Examining user scenarios
and privacy preferences. In Proceedings of the 1st ACM Conference on Electronic Commerce. ACM.
10.1145/336992.336995
Acquisti, A., & Gross, R. (2006). Imagined communities: Awareness information sharing and privacy on
Facebook. Proceedings of the Privacy Enhancing Technologies Symposium, 36–58 10.1007/11957454_3
Alamäki, A., & Dirin, A. (2015). The stakeholders of a user-centred design process in mobile service
development. International Journal of Digital Information and Wireless Communications, 5(4), 270–284.
doi:10.17781/P001825
Bart, Y., Shankar, V., Sultan, F., & Urban, G. L. (2005). Are the drivers and role of online trust the same
for all web sites and consumers? A large-scale exploratory empirical study. Journal of Marketing, 69(4),
133–152. doi:10.1509/jmkg.2005.69.4.133
Blank, S. (2007). The four steps to the epiphany: Successful strategies for products that win. Quad/
Graphics.
Brusilovsky, P. (2001). Adaptive Hypermedia. User Modeling and User-Adapted Interaction, 11(1/2),
87–110. doi:10.1023/A:1011143116306
Caudill, E. M., & Murphy, P. E. (2000). Consumer online privacy: Legal and ethical issues. Journal of
Public Policy & Marketing, 19(1), 7–19. doi:10.1509/jppm.19.1.7.16951
Chellappa, R. K., & Sin, R. G. (2005). Personalization versus privacy: An empirical examination of the
online consumer’s dilemma. Information Technology Management, 6(2–3), 181–202. doi:10.100710799-
005-5879-y
Chen, D., & Zhao, H. (2012, March). Data security and privacy protection issues in cloud computing.
In Computer Science and Electronics Engineering (ICCSEE), 2012 International Conference (Vol. 1,
pp. 647–651). IEEE. 10.1109/ICCSEE.2012.193

195

Big Data Governance in Agile and Data-Driven Software Development

Chen, Q., Zhang, M., & Zhao, X. (2017). Analysing customer behaviour in mobile app usage. Industrial
Management & Data Systems, 117(2), 425–438. doi:10.1108/IMDS-04-2016-0141
DAMA International. (2017). DAMA-DMBOK: Data management body of knowledge (2nd ed.). Tech-
nics Publications.
Dirin, A., Laine, T., & Alamäki, A. (2018). Managing emotional requirements in a context-aware
mobile application for tourists. International Journal of Interactive Mobile Technologies, 12(2), 177.
doi:10.3991/ijim.v12i2.7933
Du Mars, R. (2012). Mission impossible? Data governance process takes on “big data.” Retrieved from
http://searchdatamanagement.techtarget.com/feature/Mission-impossible-Data-governance-process-
takes-on-big-data
Flyvbjerg, B., & Budzier, A. (2011). Why your IT project may be riskier than you think. Harvard Busi-
ness Review, 89(9), 23–25.
Godinez, M., Hechler, E., Koenig, K., Lockwood, S., Oberhofer, M., & Schroeck, M. (2010). The art
of enterprise information architecture: A systems-based approach for unlocking business insight. IBM
Press, Pearson Higher Ed.
Hashem, I. A. T., Yaqoob, I., Anuar, N. B., Mokhtar, S., Gani, A., & Khan, S. U. (2015). The rise of
“big data” on cloud computing: Review and open research issues. Information Systems, 47, 98–115.
doi:10.1016/j.is.2014.07.006
Kaufman, L. M. (2009). Data security in the world of cloud computing. IEEE Security and Privacy,
7(4), 61–64. doi:10.1109/MSP.2009.87
Ketamo, H. (2008). Cost Effective Testing with Artificial Labour. Proceedings of 2008 Networked &
Electronic Media Summit, 185-190.
Ketamo, H. (2010). Balancing adaptive content with agents: Modelling and reproducing group behavior
as computational system. Proceedings of 6th International Conference on Web Information Systems and
Technologies, 1, 291-296.
Ketamo, H., Devlin, K., & Kiili, K. (2018). Gamifying Assessment: Extending Performance Measures
with Gaming Data. Proceedings of American Educational Researcher Association’s Annual Conference.
Ladley, J. (2012). Data governance: How to design, deploy and sustain an effective data governance
program. Elsevier.
Liebowitz, J. (2013). Business analytics: An introduction. CRC Press, Taylor & Francis Group.
Malhotra, N. K., Kim, S. S., & Agarwal, J. (2004). Internet users’ information privacy concerns (IUIPC):
The construct, the scale, and a causal model. Information Systems Research, 15(4), 336–355. doi:10.1287/
isre.1040.0032
Martin, K. D., & Murphy, P. E. (2017). The role of data privacy in marketing. Journal of the Academy
of Marketing Science, 45(2), 135–155. doi:10.100711747-016-0495-4

196

Big Data Governance in Agile and Data-Driven Software Development

Morabito, V. (2015). Big data and analytics: Strategic and organizational impacts. Cham: Springer.
doi:10.1007/978-3-319-10665-6
Morville, P. (2007). Information architecture for the World Wide Web. O’Reilly.
Otto, B. (2011). Data governance. Business & Information Systems Engineering, 3(4), 1–244.
doi:10.100712599-011-0162-8
Paajanen, S., Valkokari, K., & Aminoff, A. (2017). The opportunities of big data analytics in supply
market intelligence. In Proceedings of the18th IFIP WG 5.5 Working Conference on Virtual Enterprises,
PRO-VE 2017 (pp. 194-205). Springer. 10.1007/978-3-319-65151-4_19
Panian, Z. (2010). Some practical experiences in data governance. World Academy of Science, Engineer-
ing and Technology, 38, 150–157.
Ricci, F., Rokach, L., Shapira, B., & Kantor, P. B. (Eds.). (2010). Recommender systems handbook.
Springer Science & Business Media.
Ries, E. (2010). The Lean Startup: How constant innovation creates radically successful businesses.
London: Penguin Books.
Sarsfield, S. (2009). The data governance imperative. It Governance Ltd.
Seiner, R. (2014). Non-Invasive Data Governance. The Path of Least Resistance and Greatest Success.
Technics Publications.
Sharda, R., Delen, D., & Turban, E. (2018). Business intelligence, analytics and data science: A mana-
gerial perspective. Pearson.
Silver, D., & Hassabis, D. (2016, January 27). AlphaGo: Mastering the ancient game of Go with Machine
Learning. Google Research Blog.
Soares, S. (2012). Big data governance: An emerging imperative. MC Press.
Tao, F., Cheng, J., Qi, Q., Zhang, M., Zhang, H., & Sui, F. (2018). Digital twin-driven product design,
manufacturing and service with big data. International Journal of Advanced Manufacturing Technology,
94(9-12), 3563–3576. doi:10.100700170-017-0233-1
White, M. S. (2015). Enterprise search. O’Reilly Media.
Wilkowska, W., & Ziefle, M. (2012). Privacy and data security in E-health: Requirements from the
user’s perspective. Health Informatics Journal, 18(3), 191–201. doi:10.1177/1460458212442933
PMID:23011814
Zhang, Y., Ren, S., Liu, Y., Sakao, T., & Huisingh, D. (2017). A framework for Big Data driven product
lifecycle management. Journal of Cleaner Production, 159, 229–240. doi:10.1016/j.jclepro.2017.04.172
Zissis, D., & Lekkas, D. (2012). Addressing cloud computing security issues. Future Generation Com-
puter Systems, 28(3), 583–592. doi:10.1016/j.future.2010.12.006

197

Big Data Governance in Agile and Data-Driven Software Development

ADDITIONAL READING

Chen, H., Chiang, R. H., & Storey, V. C. (2012). Business intelligence and analytics: From big data to
big impact. Management Information Systems Quarterly, 1165–1188.
Connolly, T., & Begg, C. (2010). Database Systems: A Practical Approach to Design, Implementation
and Management (5th ed.). Addison-Wesley.
Dean, J. (2014). Big Data, Data Mining, and Machine Learning: Value Creation for Business Leaders
and Practitioners. Hoboken, NJ: SAS Institute, Inc. John Wiley & Sons, Inc. doi:10.1002/9781118691786
Demirkan, H., & Delen, D. (2013). Leveraging the capabilities of service-oriented decision support sys-
tems: Putting analytics and big data in cloud. Decision Support Systems, 55(1), 412–421. doi:10.1016/j.
dss.2012.05.048
Groves, P., Kayyali, B., Knott, D., & Van Kuiken, S. (2013). The ‘big data’ revolution in healthcare.
The McKinsey Quarterly, 2, 3.
Gurin, J. (2014). Open Data Now: The Secret to Hot Startups, Smart Investing, Savvy Marketing, and
Fast Innovation. McGraw Hill Education.
Hashem, I. A. T., Yaqoob, I., Anuar, N. B., Mokhtar, S., Gani, A., & Khan, S. U. (2015). The rise of
“big data” on cloud computing: Review and open research issues. Information Systems, 47, 98–115.
doi:10.1016/j.is.2014.07.006
LaValle, S., Lesser, E., Shockley, R., Hopkins, M. S., & Kruschwitz, N. (2011). Big data, analytics and
the path from insights to value. MIT Sloan Management Review, 52(2), 21.
Verhoef, P. C., Kooge, E., & Walk, N. (2016). Creating value with big data analytics: Making smarter
marketing decisions. Routledge.
Zikopoulos, P., & Eaton, C. (2011). Understanding big data: Analytics for enterprise class Hadoop and
streaming data. McGraw-Hill Osborne Media.

KEY TERMS AND DEFINITIONS

Big Data: Data that cannot be processed using traditional data analytics software and infrastructure
on a personal computer or on a data analytics server. Compared with traditional data, big data have
greater volume, variety, and/or velocity than traditional data.
DAMA: The global data management community. It is a non-profit and vendor-independent associa-
tion that provides a community and support for information professionals.
Data Availability: Making data available at a given moment, including the usage of data, interface
standards, metadata, and the findability of data.
Data Governance: The processes and technical infrastructure that an organization has in place to
ensure data privacy, security, availability, usability, and integrity.
Data Integrity: The trustworthiness of the data, including data integration, data lifecycle manage-
ment, and data quality monitoring.

198

Big Data Governance in Agile and Data-Driven Software Development

Data Privacy: Data containing information about a person should be treated with special attention
according to the organization’s data privacy policy and legislation.
Data Security: The processes and technologies that ensure that sensitive and confidential data about
an organization are kept secure according to the organization’s policies.
Data Usability: The data in an organization can be used to meet the goals defined in the corporate
strategy, including data monetization.
DMBOK: The DAMA International Guide to Data Management Body of Knowledge. A publication
that is dedicated to advancing the concepts and practices of information and data management.
GDPR: The general data protection regulation. It is a regulation in European Union (EU) law on data
protection and privacy for all individuals within the EU and the European Economic Area.
IT Governance: The processes that ensure the effective and efficient use of IT in enabling an orga-
nization to achieve its goals.
Predictive Model: A data-driven model, which is used to predict a future event, in contrast to a
descriptive model, which is used to explain a past event.

199
200

Chapter 9
The Link Between Innovation
and Prosperity:
How to Manage Knowledge for the
Individual’s and Society’s Benefit
From Big Data Governance?

Sonia Chien-i Chen


SC Company Limited, Taiwan

Radwan Alyan Kharabsheh


Applied Science University, Bahrain

ABSTRACT
The digital era accelerates the growth of knowledge to such an extent that it is a challenge for individuals
and society to manage it traditionally. Innovative tools are introduced to analyze massive data sets for
extracting business value cost-effectively and efficiently. These tools help extract business intelligence
from explicit information, so that tacit knowledge can be transferred into actionable insights. Big data
are relevantly fashionable because of their accuracy and the capability of predicting future trends. They
show their mightiness of bringing business prosperity from supermarket giants to businesses and disci-
plines of all kinds. However, with data widely spreading, people are concerning their potential risk of
increasing inequality and threatening democracy. Big data governance is needed, if people want to keep
their private right. This chapter explores how big data can be governed for maintaining the benefits of
the individual and society. It aims to allow technology to humanize the digital era, so that people can
be benefited from living in the present.

DOI: 10.4018/978-1-5225-7077-6.ch009

Copyright © 2019, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

The Link Between Innovation and Prosperity

INTRODUCTION

Innovation contributes to societal prosperity through problem-solving and value creation by knowledge
management (KM) to benefit individual and society. In the wake of the Internet and digital era, big
data governance (BDG) is relevant in KM to ensure human’s right and societal equality while pursuing
prosperity. The speed of technology development often exceeds the forming of required laws and regula-
tions (Dalkir, 2013). Consequently, human’s ethics and democracy are at risk due to big data’s potential
hazard (Mittelstadt & Floridi, 2016). The lack of institutional review board (IRB) protocols or federal
regulations that safeguard the right of human participants on the Internet remains as a gap to be bridged
(Fiske & Hauser, 2014; Huser & Cimino, 2016). It is particularly problematic when human participants
are tangled in the interventions that are not fully informed or participants’ data are used for secondary
purposes. Thus, it is significant that personal digital data can be accessed while advancing scientific
discoveries and innovations. The responsibility and accountability of obtaining private data should be
addressed to confirm the responsibility and liability for both the data owners and users while advancing
scientific discoveries and innovations. The analytical techniques of big data need to be innovated and
refined with the growth of their technologies. This chapter explores how big data can be governed from
the perspectives of data audit, data accountability, and human participation for the individual and societal
benefits. It aims to allow technology to humanize the digital era through reviving digital forgetting to
serve better people living in the present.

BACKGROUND

The digital era accelerates the growth of knowledge to such an extent that it is challenging for individu-
als and society to manage it traditionally (Hesse et al., 2015). These challenges can be divided into
three features of data: Volume, velocity, and variety. The massive volume of the data at rest will need a
cluster of computers to process them. Digital data in motion are generated with higher speed than ever,
which is called the velocity of data (Sutherland & Soares, 2012). The variety of data are produced in
many formats that conventional tools are far to manage (McHugh, 2015). Therefore, innovative tools
are introduced to analyse massive data sets for extracting business value cost-effectively and efficiently.
These tools help extract business intelligence from explicit information, so that tacit knowledge can be
transferred into actionable insights. Big data are relevantly fashionable because of their accuracy and
the capability of predicting future trends.
On the one hand, the growing big data movement gives rise to new business opportunities. They
show their mightiness of bringing business prosperity from supermarket giants to businesses and disci-
plines of all kind. On the other hand, it also brings the concerns over data security, privacy protection,
and ethical boundaries of accessing personal digital data (Chen, 2018). With data widely spreading,
people are concerning their potential risk of increasing inequality and threatening democracy. BDG is
needed if people want to keep their right. In order to have insightful proposal for BDG, more compre-
hensive understanding of knowledge, innovation, and big data management are essential to be reviewed
throughout this section.
The generation of new knowledge becomes more and more important than just storing knowledge,
in this fast-changing world. New knowledge is not limited to the organization any more. Frequently,
new sources of data, such as Web sites, mobile devices, and physical sensors, have the characteristics

201

The Link Between Innovation and Prosperity

of being appeared in digitalized format and stored online or cloud platform. The shift of information
storage format will change the way knowledge to be organized and to be processed. Consequently, it
may offer unprecedented opportunity in KM and open innovation and create new jobs. Not only jobs for
computer scientists and systems administrators, but also for knowledge workers having strong business
analytic or marketing skills.

MAIN FOCUS OF THE CHAPTER

Traditionally, KM classified knowledge into tacit and explicit knowledge; however, big data-based KM
has broken up this boundary. It extracts useful knowledge from digital data and information and it often
processes in a real-time format. Knowledge acquisition becomes less involvement of people, but more
machine-focused, unlike the traditional KM, most stresses are on people’s support. In big data-based
KM, higher analytical skills for knowledge extraction are required, compared to the traditional way.
Usually, data appear in a continuous flow and require constantly processing in big data-based KM. In
the past, knowledge was relatively static and was stored in repositories, networks, and people’s heads.
Also, stakeholders had more face-to-face interaction. Nevertheless, in the era of big data, less face-to-
face interaction is required. The innovation of the new knowledge generation has significantly altered the
way of extracting and spreading knowledge. Despite its change, some rules and logic related to KM may
remain the same. The authors propose reviewing the history of how knowledge is relevant to authority
and society to gain more insight into BDG.

Knowledge, Power, and Authority

“Ignorance is the curse of God; knowledge is the wing wherewith we fly to heaven” (Shakespeare, 1826,
p. 521).
Shakespeare’s play offers people the hope of getting closer to God’s wisdom and its almighty through
knowledge acquisition. Bacon (1864) (in his Meditationes Sacrae of 1597) absorbed the essence of
Shakespeare’s expression that knowledge is power, to emphasize that those who are in the know fre-
quently command power that the ignorant lack (Brown, 1989). The importance of knowledge and how it
is related to power can be seen from the Reformation, which describes competition between the Catholic
church and national authority (Mitchell, 1998).

Those Who Have the Control of Information Have the Power of Knowledge

The Reformation broke the monopoly of the Roman Catholic Church. Then, the increase of literacy
rate and the right of reading the Bible boosted a new atmosphere of freedom and tolerance in faith.
Consequently, the political and economic progress in Europe has been promoted with the spreading of
knowledge. As people have free will to own their religious beliefs, various European countries have
also strengthened their national consciousness and the national cohesion. This implies that knowledge
is relevant to power, as it gives people the hope to be the master of their fate.
Virtually, the stories that happened ages ago are transforming into different features and appear again
and again in human’s life. The concept of “the power is knowledge” is fundamental in the era of knowl-
edge. Businesses are keen and strive to obtain information from consumers to keep their prosperity, as

202

The Link Between Innovation and Prosperity

they are conscious that those who have the control of information have the power of knowledge. However,
some individuals may be willing to exchange their information with convenience. Some may be aware
of the risk of this exchange and would rather stay on the safe ground. It is certain that the competition
of controlling information never ends.
These competitions come along with the development techniques of managing knowledge, which
explains the revolution of information and communication technologies. Indeed, they are standard tools
that help facilitate knowledge sharing and KM. The progress of Western society in the past centuries is
supporting this argument. The competition of acquiring knowledge may appear in different formats, can
be only more competitive, but never stops (e.g., arms race), since knowledge is considered to be relevant
to power. Innovative technologies are invented as practical tools to acquire and diffuse knowledge to
generate tangle and intangible values. Innovation and advance of technology incorporate the economic
growth and give birth to the knowledge economy.
The authority of church encountered a tough challenge, when the printing press technique was invented.
Scholars (Keohane and Nye, 1997; Mayer and Brodning, 2007) understood that the power of knowledge
is from the capability of controlling information and from the development of the concepts of “power
and interdependence” and “information power”. How printing press technology enables the diffusion of
knowledge and how it inspires the demand of new management regulations, such as intellectual property
and the concept of the freedom of speech, or, similarly, big data technology and its relevant governance,
draw significant attention more than ever, due to its market dominance potential.

Innovation in Knowledge Management

“The next society will be a knowledge society. Knowledge will be its key resource, and knowledge work-
ers will be the dominant group in its workforce” (Drucker, 2001, p2.).
As societies have evolved, the world economy has gradually turned from a labor-intensive into a
technology-intensive and, then, into a knowledge-intensive society. According to Drucker (2001), this
is called the new knowledge economy, and it will rely heavily on knowledge workers: People with con-
siderable theoretical knowledge and learning, such as doctors, lawyers, teachers, accountants, chemical
engineers (Drucker, 2001, p2.). Information technology is just one of its features, but it influences people’s
life significantly with near-instantly knowledge spreading speed and making it accessible to everybody.
Since the 1990s, KM has become one of the key driving forces for organizational value creation (Liew,
2008) and has been widely discussed both in the social and science domains. According to Lichtenthaler
and Ernst (2009), the dynamic capability of a firm will distinguish its ability to successfully manage
its knowledge-based structure over time, as KM capability will reconfigure and readjust the knowledge
capacities. However, various scholars in different disciplines still have defined KM differently.
According to Tomas and Hult (2003), KM is the process of managing information, including ex-
plicit and tacit knowledge to create unique value that an organization can use to achieve a competitive
advantage in the marketplace. However, Sabherwal (2005) simplifies it into one major function that
“is about doing what is needed to get the most out of knowledge resources” (p. 20). Instead, Sousa
and Hendriks (2006; 2008) give KM a more inclusive definition as “addresses policies, strategies, and
techniques aimed at supporting an organization’s competitiveness by optimizing the conditions needed
for efficiency improvement, innovation, and collaboration among employees” (p.15). In this definition,
the authors introduce the concept of innovation to elaborate the meaning of KM. The definition that
currently employs may be relatively important to the proposed “The systematic process of identifying,

203

The Link Between Innovation and Prosperity

capturing, and transferring information and knowledge people can use to create, compete, and improve”
(Nicolini et al., 2008, p.25). In order to sum up the definitions above, a systematic process for creating,
improving, and innovating value between knowledge and information for potential stakeholders can be
viewed as crucial themes for KM.
The role of KM in innovation is of importance as knowledge generates wisdom and intelligence,
which leads to innovation. In other words, innovation is a kind of intelligence that is rolling, opened and
dynamic, and, through questioning the current knowledge and learning, people are able to understand
themselves and overpass themselves (Moreno et al., G. (Taiwani in the futurene site selection. 2012).
According to Du Plessis (2007), innovation is highly dependent on the acquirement of knowledge, and, in
order to ensure successful innovation, the explosion and availability of knowledge needs to be recognized
and managed (Adams & Lamont, 2003; Du Plessis, 2007). Nonaka and Von Krogh (2009) discussed the
organizational knowledge creation theory from two key functions: “Tacit knowledge” and “knowledge
conversion”. The understanding of tacit knowledge is as important as the interaction between tacit and
explicit knowledge, according to recent research. They may have implications for the generation of in-
novation. In the new global economy (Drucker, 2001), innovation management has become a central
issue for KM. On the other hand, KM is an efficient support for innovation and competitiveness.
According to Kosturiak (2010), the pioneers in KM expressed the positive relationship between KM
and innovation. Innovative solutions are usually created from the sources of value that generates from
KM. Chen, Huang, and Hsiao (2010) explored the role of organizational climate and structure in terms
of KM and innovativeness. They looked at this subject from the perspectives of social capital and social
network, and the hypotheses were tested by 146 Taiwanese firms. The findings support the argument
of KM being positively related to the innovativeness of firms, which should contribute to innovative
and supportive climate and structure on KM. Noordin and Karim (2015) further investigated it at an
organizational level by modeling the relationship between human intelligence, KM practices, and in-
novation performance.
Knowledge can be considered as a kind of capital in investing in a complex process of innovation, and
it is relevant to producing a successful innovative product. KM contributes to innovation for more reward-
ing markets and increases competitiveness (Carneiro, 2000). As Drucker (2001) predicted, knowledge
will be the mainstream of the next society. Therefore, KM is becoming critical to staying competitive.
It requires innovation to support organizational competitiveness (Sousa & Hendriks, 2006). Innovation
is beneficial to business growth and sustainability; however, not all of the innovation will be successful
(Moustaghfir and Schiuma, 2013). Nevertheless, the collaboration of innovation and KM is believed to
be a key to success.

Do Big Data Make Individual’s Life More Beautiful?

Recently, big data have become part of the solutions to press global problems. Not only can they pro-
cess massive data effectively, but, thanks to their prediction capability, big data can also allow for better
preparation and, thus, reduce the damages from disasters. Benefits can be seen from addressing climate
change, eradicating disease for fostering good governance, and contributing economic development
(Mayer-Schonberger and Cukier, 2013). The digitalization of data somehow resolves the challenge of
preserving the volume of recorded knowledge, since the books might decay faster than they could be
reproduced, and big data can extract intelligence from massive and messy data. In the past, an untold
amount of wisdom was lost to the ages, and there was little incentive to record more of it to the pages.

204

The Link Between Innovation and Prosperity

Now, things have changed. Big data can manage information affordably and efficiently, and they can
capture value just like magic. The challenge becomes: How can people and institutions be better prepared
to harness the technology?
It is likely that the information people have obtained may be interpreted into the signal for their
benefits, according to the desire of being the master of one’s fate. This happens as the Reformation not
only suggests a triumph of literacy and the new printing press, but also higher political autonomy and
economic development. The Protestant work ethic that the Reformation revealed is seen as the introduc-
tion of capitalism and the Industrial Revolution. Planning lives and predicting the world’s course are
tied up with the notion of forecasting and consequently helping the progress of a society. Indeed, people
start to know how to use their accumulated knowledge to change society. The Protestant ethic and a
free press phenomenon contribute the Industrial Revolution, as both religious and scientific ideas can
flourish under the freedom from censorship (Silver, 2012).
The Industrial Revolution is an excellent example of how technologies accelerate the development of
the economy and the explosion of knowledge. Without the printing press technology from Gruenberg,
the ninety-five theses could not be so compelling to destroy monopoly situations. Galileo shares his
(censored) ideas, and Shakespeare produces his plays with the help of the printing press. Scientific and
literary works have significantly progressed since then. However, a distinguished gap between one’s fate
and being the master of one’s fate still exists. Although the Industrial Revolution sped up the growth
of knowledge, people’s understanding of how to process it is still slow to catch up. Regarding transfer
information to useful knowledge, it is not only about technology; people need the capability to identify
between signal and noises.
A productivity paradox between information growth and information comprehension from the Industrial
Revolution until now is apparent. From the computer age to the information age, people made efforts to
improve techniques of processing information. However, the speed of producing data and information
never slows down. According to IBM (Gandomi & Haider, 2015), approximately 2.5 quintillion bytes of
data are generated each day, and more than 90% of them were created in the last two years. “Big data”
have become a fashionable term, not only as a result of their nature of volume, variety, and velocity, but
also in relation to the techniques of their process to acquire knowledge.
Big data surprise and impress society in such a degree that sometimes they are seen as a cure-all.
Although big data have the capability of predicting outcomes, yet they are not as almighty as God. Be-
sides, science and theory are still needed to prevent people make silly mistakes. The worry is that the
believe for big data may be a reflection of the desire of being the master of one’s fate, so that bias may
be contained when processing information from data. Anderson (2008) claims that the numbers could
speak for themselves, but Siver (2012) believes that it is people that imbue data with meaning according
to their own way for a self-serving purpose. Data may be detached from their objective reality, due to
human’s intentional interpretation.
For ages, society honored and preferred causality. Now, big data is shifting the preference into cor-
relation by the attempt to quantify the world. It opens the door to new ways of understanding the world
by measuring, storing, analyzing, and sharing things that were rarely quantified before. Exactitude and
consequence are becoming less relevant since then (Mayer-Schonberger & Cukier, 2013). Big data may
benefit those who can acquire a massive volume of data, but bring disadvantages to specific and vol-
untary groups. As a result, the gap between the poor and the rich will increase dramatically. This raises
the question if people manage knowledge to the individual’s and society’s benefit. Are big data in favor
of big company only or can they make people’s life more beautiful as well? For answering these ques-

205

The Link Between Innovation and Prosperity

tions, the authors will look into the forming of the analysis of role model, the beneficiary of big data,
and their potential consequences.

Issues of Big Data Model

The big data role model refers to the algorism and technologies that companies have developed to store
and analyze massive data which used to be challenging to manage in the past or by traditional tools for
extracting knowledge and intelligence. Mathematics models and technology are combined to deal with
both transaction data and interaction data by data scientists, in an efficient way. The combination of data
scientist’s model and business intelligence technology create an outcome called serendipity, which refers
to the occurrence and development of events by chance in a beneficial way. Data scientists usually require
skills and knowledge from computer science and statistic science to be able to capture business insights
from big data sets. The correction between data sets is more important than their causality. However, this
correction suggests that the focus of data has been shifted from accuracy to probability. People’s desires,
movements, and spending power will be studied by mathematics and statistician. The trustworthiness
of people’s potential as students, employees, consumers, and criminals will be predicting out through
calculation. The outcomes will show the tendency of one side to the other. These kinds of models usu-
ally favor one side, according to their business models (BMs). The one who is listed on one side will be
the winner, and unfortunately, those who are on the other side of the result will be the losers or victims.
The paragraphs below will explain how the BM of big data work. Big data runners tend to adopt
a conventional “buy–sell” BM (Figure 1), whereby a company offers a value proposition and delivers
it through its value chain and value network in order to obtain reasonable profits under a competitive
strategy (Chesbrough, 2007, 2010, 2013).
Figure 1 shows the interaction of elements in the BM. The logic in this BM is to optimize the benefits
of the value proposer. In a big data BM, it refers to the big data company. This model is simplex lining.
Application of this simplex structure may explain why big data companies or big data users may be
the largest beneficiary. All the design in this BM is based on its benefits, that is to earn the return from
the target market segment. Thus, it implies that it may not be a fair trade for the target market segment.

Figure 1. Conventional “buy-and-sell” business model

206

The Link Between Innovation and Prosperity

Obviously, no statistical systems are perfect. A data-crunching program has the potential to misinterpret
people a certain percentage of the time, putting them in the wrong groups and denying them a job or a
chance at their dream house. This minority is usually considered as unworthy and expendable. Attention
should be paid to find out if the disadvantage part is usually the same group, which will result in col-
lateral damage. The poor will become poorer. Although it is not surprising that the big data model can
multiply the chaos and misfortune, organizations are still keen on applying it to more domains because
their incentive, including both political and standard currency, is more attractive than anything else. It
is the center of their BM, which is engineered to guzzle up more data and modify their analytics to gain
more money. It is called big data economy. It attracts organizations with promising benefits.
Not all the big data models will risk democracy, as data-driven predictions are like two sides of one
coin: They can be accurate, and they may make mistakes. According to O’Neil (2017), if a model has
all the following three features, they are probably, the “weapon of math destruction”: Opacity, scale,
and damage. A bank’s model may be a suitable example to explain it. If one is considered as a high-risk
borrower, then the world will treat him/her as a deadbeat, as the model is designed to provide no trans-
parency, in order to keep its business secrets. It can be scaled up and may result in the difficulty in one’s
house buying, or job hunting, even if he is misunderstood. This is the damage. Some people may benefit
from it and the others may suffer. What is worse is the one who suffers has little chance to appeal, as he/
she is condemned by secret algorithms, which is unfair and harmful to democracy.

What Can Be Wrong If No Governance Occurs in Big Data?

In a democratic society, people love freedom more than governance. Nevertheless, how many crimes
may be committed in the name of liberty. Writers are singing the song of liberty for the French Revolu-
tion and signing the evil of out of governance for it, at the same time. Scholars summarized the French
Revolution as burning up destruction and a world of unwise. This is an image that the day of anarchy in
the big data is what may happen under the name of liberty. Good tidings of big data have been spread
out rapidly, and companies are proactive to adopt them for their benefits. Their governance issues have
been easily ignored, even though they are significant. Again, thinking about the negative bank model,
maybe it is time that people should do something to govern the secret and harmful models.
Why might the algorism designed by excellent data scientists have issues? According to Silver (2017),
the prediction outcomes rely on the quality and quantity of data. If the data serve more noises than sig-
nals, the predictions will be less accurate, and vice versa. Useful information may not increase as faster
as the growth of information. Although the quantity of 2.5 quintillion bytes information is estimated to
be generated per day, most of it is just noise, rather than signal. Moreover, with the growth of informa-
tion, more and more hypotheses have to be tested and data set have to be mined. On the one hand, the
advance of technology, such as printing press, is increasing the speed of producing information. On the
other hand, it also grows the rate of error spreading. In the other words, the quality of data may take
away from the truth by serving noises, rather than knowledge.
The nature of data-driven predictions is correlation-oriented, where people may not be able to explain
the reasons behind their decisions. Therefore, if any bias occurs in the model, those who are mistakenly
selected may have no chance to appeal and fall into a negative feedback loop regularly. Another concern
is the danger of privacy: Those who are predicted to be a high-risk group in diseases may be charged
more health insurance fees, and it may apply to mortgage and something else. In education, those who
need the education to change their lives may be targeted to go to private colleges that only charge them

207

The Link Between Innovation and Prosperity

premium, but offer them little hope and help to change their social status. These things will challenge
an ethical consideration and free will, despite statistics may argue otherwise. A new way of controlling
and managing data needs to be formed to adapt the shit of technology.

Does a Mistake Have an Expiration Date?

Crime has a statute of limitations, if the bad memories have the right to be forgotten. An example could
be that of an individual who had made a mistake in his/her teenage; it was revealed and spread after he/
she had a professional career or family. Suddenly, his/her professional image went bankruptcy. He/she
had thought he/she could move on to his/her new life until his/her past came back to bite him/her. Who
should be blamed for this drama? This scenario is familiar in Les Miserables, by Victor Hugo (2013). In
the big data era, similar situations will take place only more often than before, as information is easier
to be preserved and spread. How long can a mistake be forgotten or forgiven? It is a must-ask question.
In order to answer this question, it may be necessary to go back to the time before the digital era,
when memories could be buried as time went. When it was the time in which remembering had a price,
storage was limited; people tended to treasure their property more, as the cost of losing them was too
high. Undeniably, what is gone cannot be returned. The point is if it is possible to revive digital forget-
ting, to let “dust to dust”, and let people focus on living in the present by making the rules.
The digitalization of data has made information easy and cost-effective to be preserved and spread.
Technology not only accelerates the diffusion of knowledge and makes it more efficient to be managed,
but it also speeds up the spreading of noise and harmful data. It may threaten people’s privacy and ethi-
cal considerations. During World War II, significant data were collected through spies and surveillance,
which destroyed people’s trust. Nowadays, with the advance of the Internet, mobile, and digitalization of
data, the risks inherent to big data may increase as much as the growth of datasets themselves. Data used
to be difficult to be duplicated, but easy to be destroyed. After digitalization, their complete destruction
before they spread becomes a challenge. Governance is needed to prevent people from being prisoners
of their past by future presumption of big data.
Anonymization is a possible solution of protecting privacy. However, perfect anonymization is im-
possible, as people’s connections with one another will reveal their identity. Besides, the nature of big
data is to predict based on correlations to make casual decisions about individual responsibility. Neither
individual notice and consent nor opting out can effectively ensure one’s privacy, in the era of big data.
According to Mayer-Schonberger (2009), the virtue of forgetting should be reintroduced in the digital era.
People need a chance to move on from their past through establishing a system of deleting, since what
is done is done. It is feasible by setting up an expiration date for information to let bygones be bygones.

SOLUTIONS AND RECOMMENDATIONS

How Can Big Data Be Governed by the Benefits of the Individual and Society?

Innovation occurs when looking for solutions to acquire, share, and manage knowledge. According to
Drucker (1969), in the 21st century knowledge is an economic resource, which can be employed to
generate tangible and intangible value. Combined with technology, it can create significant economic
value. Schleicher (2010) elaborated it further: Knowledge becomes the global currency that people cre-

208

The Link Between Innovation and Prosperity

ate. The chase for knowledge has no limitation. The concept of knowledge is power; it seems to be a
common language that required no translation. As a fashionable tool for extracting knowledge to power
innovation, big data are considered as a catalyst of boosting economic development. However, a critical
question is: How can big data be managed and governed in a way that both an individual and society
can be beneficial?

Governing Data or Under Its Governance

Companies may benefit from big data predictability, but people may not be happy to exchange their
privacy with convenience. Therefore, it is time to tackle the issue of the governance of big data. If the
world that people are living was like the movie Minority Report, crime could be predicted before it
happened. It may raise the option of governing data or being under their governance. Big data are chal-
lenging the boundary of justice more than ever. People aim to get benefits from using data, but things
may be out of control, if data are not wisely used. Should not the rules of data governance be made to
clarify the border between justice and unfair? (Mayer-Schönberger and Lazer, 2007). In order to achieve
data equality and ensure the benefits to the whole society, governance should be initiated from how big
data may influence people’s lives, work, and way of thinking: Data accountability and human ability vs.
prediction probability and data audit system.

Data Accountability

Technology is not only changing history, but also the way knowledge is extracted and how data are ac-
quired and collected. Big data change the character of the risk from personal privacy to secondary uses,
which suggests that the personal notice and consent are not the only lawful way to gather and process
private data (Hey, 2004). New solutions are needed to face this potential risk, along with a new way to
acquire data. Secondary uses of data can be too innovative to be imaged (Chen et al., 2009). Therefore,
personal consent is making less sense, compared to the past. The way of managing data should be re-
newed into a data-driven policy, based on the shift in the big data era. The data consent policy implies
that the data owner should be aware of the risk of offering his/her data and be responsible for them.
Apart from the issues of ownership, sometimes data risk occurs in their secondary uses, which personal
consents do not cover. The data owner may be too innocent to take the responsibility. On the contrary,
it may be unfair that data users, who may be the most significant knowledge beneficiaries, take less or
no responsibility. Thus, it is suggested data users should take more responsibility, according to the data
accountability concept.
Human data, both biomedical and behavioral data, are sensitive and can cause an ethical issue. Therefore,
a relevant committee has been created to ensure data are used properly. An IRB is an example. An IRB
reviews research methods and purpose to make sure that the research has no harm to the individual and
society (Fiske and Hauser, 2014; Huser and Cimino, 2016). The board can approve or reject, according
to if the research is beneficial or harmful to professional risk-benefit analysis. An IRB acts as an ethi-
cal safeguard to protect people’s rights and welfare in their participation in research (Rivers and Lewis,
2014). It covers all the potential concerns to preserve people from physical or psychological harm, by
reviewing research protocols and related materials. An IRB covers three topics: Scientific, ethical, and
regulatory (Nunan and Di Domenico, 2013). Participants should be aware of the purpose of the usage
of their data and their potential risk. If a concern arises in the research, modifications will be required

209

The Link Between Innovation and Prosperity

to the applicants to ensure participants’ rights (Chen, Chiang and Storey, 2012). Relevant risk-benefit
analysis is essential in BDG.
It is required to have an IRB in governing big data to ensure that data owners’ rights will be protected,
and that the methods of gathering data are ethical and regulatory. Moreover, the stress should be more
on the data users to assure that data are properly employed without ethical concerns for participants. A
gatekeeper should accept or reject or request a modification to ensure the benefits for both individuals
and society. It is not only about data, but also big data algorithm should be taken into account by an IRB
because the algorithm that composes a big data profile model may have bias and discrimination, and,
thus, it may affect the outcomes of predictions. Further, the algorithms should be open to be modified,
as they involve the interactions between people and their behavior. They are not fixed various. They are
flexible and changeable. When people are aware that their behavior is monitored or detected, they may
change their original intention, but the algorithm is not that intelligent to adapt accordingly. Consequently,
the results may lose accuracy.
The competition in the market will affect the algorithm in big data as well. When one topic model-
ling is built, the participants and competitors may not need a long time to discover the rules. Rules need
to be modified to ensure its competitiveness. Gradually, the life cycle of the model is becoming shorter
and shorter. If organizations still apply the same algorithms, they may obtain inaccurate results without
self-awareness. Thus, this causes discrimination and misunderstanding for individuals and society, and
may be harmful for democracy. Therefore, an independent institute or reviewing board is required to
take things under control or get them back on the right track. It is required to keep probability from
punishment to innocent people.

Data Audit

If data can recommend judgement, it may need to be audited. Since to err is human, to reflect is divine,
so do data. Algorithmists need to review big data models to prevent bias, in case of someone that is overly
confident, or obsessive in the analysis results, as it looks so impressive. A feedback system should be
provided to perfect the data-driven intelligence (O’Neil, 2017). It should start from breaking the black
box of big data. Big data algorithm is extremely complicated, due to its high professionality in the domain
knowledge and the scale of data sets. Those who are not in the same field have difficulties in understanding
it. Therefore, an audit system should be built by experts with professional domain knowledge, including
both external and internal data scientists. As the professions of accountant and auditors are emerging
to manage the new deluge of financial information, today algorism auditors are needed to handle the
explosion of knowledge. There is a need; there is a market. Agile and self-regulatory specialists are in
demand. Algorism auditors should be divided into external and internal auditors, in order to evaluate
and improve the effectiveness of risk management, governance, and control in the domain field. They
may strengthen society’s confidence in the economy by their professional offerings that bring benefits
to the market and community.

Human in the Loop

Followed by the data accountability and data audit, data-driven predictions may need to be weighted
with human agency. According to the logic of data-driven predictions, people will be judged by their
data prediction, rather than their behavior. Although big data can predict people’s behavior to offer

210

The Link Between Innovation and Prosperity

considerable service, it may threaten human free will and human rights of defense and appeal. Data
scientists develop algorithms to calculate their probability and preference of doing something. In other
words, big data are becoming the judge of people. They will be judged and punished according to data
predictions, rather than their own acts, which is against human nature and threatens ethics in society
(O’Neil, 2017). Initially, the data-driven concept aims to offer a relatively objective and less emotional
reference for risk management. It impresses people with accurate predictions to improve their life, but
it seems to be that the human role will be ignored in the big data era. It is necessary to go back to the
origin and ask why big data are needed. Data should be acquired and used for human’s benefits. How-
ever, humans tend to possess their free will to make selections. It does not matter if it is a right decision
or not, the point is that it is their own decision. Likewise, people never give up trying to control their
lives and fight against their fate. It is unwise to scarify people’s rights when managing data. Governance
should take place to return the right of being a grown-up: Being free to make their decisions and being
responsible for their behavior.

Bayes’s Theorem

As to designing a big data model that is fit for all and less risky to the whole society, Bayes’s theorem
can be considered (Pawlak, 2002). Bayes’ theorem refers to the description of the event probability,
according to existed knowledge as its associated condition. It may have various probability interpreta-
tions, when it is applied. Basically, it describes how a subjective degree of certainty should be sensibly
modified to account for availability of related supporting evidence. For instance, if intelligent quotient
(IQ) is related to incomes, a person’s income can be employed to more accurately assess the probability
that one may have high IQ by using Bayes’ theorem, compared to the evaluation of the probability of IQ
that is calculated without prior knowledge of the person’s income. The development of algorithms can
consider the spirit of Bayes’ theorem (Abbott, 2014). The first outcomes of prediction results can be seen
as prior knowledge of the model adding to the content of probability. The advantage of this application
is that it will allow to develop more objective and evidence-based algorithms, according to prior facts.
If the developer of the algorithms has a prior bias in the model, the bias can be adjusted accordingly,
based on existed facts. It will be helpful to protect individuals and society from misunderstanding and
mistreatment. The application of Bayes’ theorem also implies that people should not only think beyond
their current knowledge by adding more facts to the judgement model, but also leave space to test their
ideas (Fong, Wong and Vasilakos, 2016). People also need to be comfortable with the reality that prob-
ability and uncertainty will always be with them. Individuals’ assumption and beliefs will always need
to be tested; even so, a problem will still occur because it is the real feature of the life. The only possible
action is trying to come close to the signal-the truth and stay away from what distracts from it-noise.
Nowadays, people have abundant information, but it does not mean that the understanding of human
beings has increased. It is expected that Bayes’ theorem can help to come close to the truth.

FUTURE RESEARCH DIRECTIONS

Governance should take place to ensure the tradeoff of big data between supporting their development
and preventing their potential hazards to seize their rewards. Human ability should not be overtaken
when shaping the technology. Since big data have changed the rules of the games, privacy protections

211

The Link Between Innovation and Prosperity

should be altered from individual consent to data-users’ accountability. In order to prevent data-driven
predictions replace human agency, a data auditor system should be established to serve effective and
fair governance. Big data have their dark side; it is people who let the light shine. It is reminded that
controlling big data is paramount, in order to avoid people are controlled by them.
Currently, the beneficiary of big data is the leading class in society. The individual and voluntary
groups are still in disadvantage. Regulations are one approach to solve these problems. When people
are looking for the benefits of technology, how about considering allowing big data to humanize the
digital era! People’s role in the process matters. It seems to be that data-driven predictions are here to
assist people to make objective decisions, or to get better preparations for something may happen, due
to the concept of prevention is better than cure. However, it also reminds that people still live in the
desire of being the masters of their fate. The capability of predicting the future seems to make people
feel empowered. People must admit that they are not very good at predicting things and data are not
that almighty. In fact, as no objective truth is available, no data is big enough to tell people the truth.
What is called big data, that people can obtain, is still not large enough compare to the knowledge in the
universe. People need to have an attitudinal change and realize how limited people are about the future
and how little they can know about the future. Then, people can predict a humble mind.
Big data should not be an ice-cold world of algorithms and automatons; they should leave human-
ity space to flourish. Technology is a tool for extracting knowledge for human beings. The human role
should be the center of technology. People may be afraid of technology. Actually, uncertainty is the
fuel for engineering innovation. In history, innovation has usually flourished when people have tried to
solve problems by using intuition, faith to fight with uncertainty. When humans encounter challenges
and barriers, they will find the way to overcome them, which will create new industries and jobs. It is
the human being that creates new technology through innovation and creativity, and eventually solves
problems. Thus, technology should be used to serve people, rather than replacing them. What matters
is how to optimize the role of human intelligence and data-driven predictions.

CONCLUSION

Prosperity flows when people make efforts in solving problems or improving the quality of life through
innovation. Different from data-driven predictions, human beings are sparking from trial and error. In
the world of innovation, action is more than speech. Every time, when people work on an issue and
collaborate with each other, society is making progress. Advanced technology is invented when a need
emerges. Action and collaboration are the logic of innovation, that contributes prosperity. Stakeholders
proceed with an advanced understanding of big data may allow their using in individual and societal
benefits to be considered in a meaningful, constructive, and cautious way.
In fact, people do play an essential role in allowing innovation. Mistakes and misperceptions can
give birth to innovation, human creativity, instinct, and genius from mess and chaos. Those who made
people suffer also made people grow. This suggests that people do not mind managing messy data, as
they have significant meaning. Similarly, people should welcome imperfection, as it is a part of human
life. If everyone owns the same data and tools, what differentiates people? Perhaps, it is unpredictability,
such as instinct, risk-taking or error, that makes a difference. It is essential to create space for the human,
so that intuition and serendipity can occur, and make sure that data and machine-made answers will not

212

The Link Between Innovation and Prosperity

crowd them out. The space should be preserved for a surprise, the unspoken, and the not-yet-thought.
These elements imply the spirit of innovation and entrepreneurship or even progress in society.
The advantages that big data have produced should generate more innovations, which lead to pros-
perity. However, the spark of innovation is usually something that data do not tell or confirm, as it has
yet to exist. No matter how hard one may try, people can only collect and process just a little amount of
information that exists in the universe. It is only as a shadow of reality. Big Data’s predictions may not
be naturally untrustworthy, as people can never obtain perfect information. Therefore, people cannot use
big data without a generous degree of humility and humanity.

REFERENCES

Abbott, D. (2014). Applied predictive analytics: Principles and techniques for the professional data
analyst. John Wiley & Sons.
Adams, G. L., & Lamont, B. T. (2003). Knowledge management systems and developing sustainable com-
petitive advantage. Journal of Knowledge Management, 7(2), 142–154. doi:10.1108/13673270310477342
Anderson, C. (2008). The end of theory: The data deluge makes the scientific method obsolete. Wired
Magazine, 16(7).
Baregheh, A., Rowley, J., & Sambrook, S. (2009). Towards a multidisciplinary definition of innovation.
Management Decision, 47(8), 1323–1339. doi:10.1108/00251740910984578
Brown, A. L., & Palincsar, A. S. (1989). Guided, cooperative learning and individual knowledge acquisition.
In Knowing, learning, and instruction: Essays in honor of Robert Glaser (pp. 393-451). Academic Press.
Browning, D. M., Meyer, E. C., Truog, R. D., & Solomon, M. Z. (2007). Difficult conversations in
health care: Cultivating relational learning to address the hidden curriculum. Academic Medicine, 82(9),
905–913. doi:10.1097/ACM.0b013e31812f77b9 PMID:17726405
Carneiro, A. (2000). How does knowledge management influence innovation and competitiveness?
Journal of Knowledge Management, 4(2), 87–98. doi:10.1108/13673270010372242
Chen, C. J., Huang, J. W., & Hsiao, Y. C. (2010). Knowledge management and innovativeness: The
role of organizational climate and structure. International Journal of Manpower, 31(8), 848–870.
doi:10.1108/01437721011088548
Chen, H., Chiang, R. H., & Storey, V. C. (2012). Business intelligence and analytics: From big data to
big impact. Management Information Systems Quarterly, 1165–1188.
Chen, M., Ebert, D., Hagen, H., Laramee, R. S., Van Liere, R., Ma, K. L., ... Silver, D. (2009). Data,
information, and knowledge in visualization. IEEE Computer Graphics and Applications, 29(1), 12–19.
doi:10.1109/MCG.2009.6 PMID:19363954
Chen, S. C. I. (2018). Technological Health Intervention in Population Aging to Assist People to Work
Smarter not Harder: Qualitative Study. Journal of Medical Internet Research, 20(1), e3. doi:10.2196/
jmir.8977 PMID:29301736

213

The Link Between Innovation and Prosperity

Chesbrough, H. (2007a). Why companies should have open business models. MIT Sloan Management
Review, 48(2), 22–28.
Chesbrough, H. (2007b). Business model innovation: It’s not just about technology anymore. Strategy
and Leadership, 35(6), 12–17. doi:10.1108/10878570710833714
Chesbrough, H. (2010). Business model innovation: Opportunities and barriers. Long Range Planning,
43(2), 354–363. doi:10.1016/j.lrp.2009.07.010
Chesbrough, H. (2013). Open business models: How to thrive in the new innovation landscape. Boston:
Harvard Business Press.
Chesbrough, H., & Vanhaverbeke, W. (2006). Open innovation: Research a new paradigm. New York:
Oxford University Press.
Dalkir, K. (2013). Knowledge management in theory and practice. Routledge.
Drucker, P. F. (1969, November). Management’s new role. Harvard Business Review, 49–54.
Drucker, P. F. (2001). Knowledge work and knowledge society: the social transformations of this century.
British Library.
Du Plessis, M. (2007). The role of knowledge management in innovation. Journal of Knowledge Man-
agement, 11(4), 20–29. doi:10.1108/13673270710762684
Fagerberg, J. (2004). Innovation: A guide to the literature. Oslo: Georgia Institute of Technology.
Fagerberg, J. (2006). Innovation, technology and the global knowledge economy: Challenges for future
growth. Proceedings of the Green Roads to Growth Project and Conference.
Fagerberg, J., Fosaas, M., & Sapprasert, K. (2012). Innovation: Exploring the knowledge base. Research
Policy, 41(7), 1132–1153. doi:10.1016/j.respol.2012.03.008
Fagerberg, J., & Srholec, M. (2009). Knowledge, capabilities, and the poverty trap: The complex inter-
play between technological, social, and geographical factors. International Centre Economic Research
Working Paper, 24, 1-23.
Fiske, S. T., & Hauser, R. M. (2014). Protecting human research participants in the age of big data.
Academic Press.
Fong, S., Wong, R., & Vasilakos, A. (2016). Accelerated PSO swarm search feature selection for data
stream mining big data. IEEE Transactions on Services Computing, (1), 1–1.
Gandomi, A., & Haider, M. (2015). Beyond the hype: Big data concepts, methods, and analytics. In-
ternational Journal of Information Management, 35(2), 137–144. doi:10.1016/j.ijinfomgt.2014.10.007
Hashem, I. A. T., Yaqoob, I., Anuar, N. B., Mokhtar, S., Gani, A., & Khan, S. U. (2015). The rise of
“big data” on cloud computing: Review and open research issues. Information Systems, 47, 98–115.
doi:10.1016/j.is.2014.07.006

214

The Link Between Innovation and Prosperity

Hesse, B. W., Moser, R. P., & Riley, W. T. (2015). From big data to knowledge in the social sci-
ences. The Annals of the American Academy of Political and Social Science, 659(1), 16–32.
doi:10.1177/0002716215570007 PMID:26294799
Hey, J. (2004). The data, information, knowledge, wisdom chain: The metaphorical link. Intergovern-
mental Oceanographic Commission, 26, 1–18.
Hugo, V. (2013). Les misérables. Simon and Schuster.
Huser, V., & Cimino, J. J. (2016). Impending challenges for the use of Big Data. International Journal
of Radiation Oncology, Biology, Physics, 95(3), 890-894.
Košturiak, J. (2010). Innovations and knowledge management. Human Systems Management, 29(1), 51–63.
Lichtenthaler, U., & Ernst, H. (2009). The role of champions in the external commercialization of knowl-
edge. Journal of Product Innovation Management, 26(4), 371–387. doi:10.1111/j.1540-5885.2009.00666.x
Liew, C. B. A. (2008). Strategic integration of knowledge management and customer relationship
management. Journal of Knowledge Management, 12(4), 131–146. doi:10.1108/13673270810884309
Mayer-Schönberger, V. (2009). Can we reinvent the internet? Science, 325(5939), 396–397.
doi:10.1126cience.1178418 PMID:19628843
Mayer-Schönberger, V. (2011). Delete: The virtue of forgetting in the digital age. Princeton University
Press.
Mayer-Schonberger, V., & Cukier, K. (2013). Big data: The essential guide to work, life, and learning
in the age of insight. Hachette.
Mayer-Schönberger, V., & Lazer, D. (Eds.). (2007). Governance and information technology: From
electronic government to information government. MIT Press.
McHugh, D. (2015). Traffic prediction and analysis using a big data and visualisation approach. De-
partment of Computer Science, Institute of Technology Blanchardstown.
Mitchell, S. A. (1998). The analyst’s knowledge and authority. The Psychoanalytic Quarterly, 67(1),
1–31. doi:10.1080/00332828.1998.12006029 PMID:9494977
Mittelstadt, B. D., & Floridi, L. (2016). The ethics of big data: Current and foreseeable issues in bio-
medical contexts. Science and Engineering Ethics, 22(2), 303–341. doi:10.100711948-015-9652-2
PMID:26002496
Moreno, O., Shapira, B., Rokach, L., & Shani, G. (2012, October). Talmud: transfer learning for mul-
tiple domains. In Proceedings of the 21st ACM international conference on Information and knowledge
management (pp. 425-434). ACM.
Moustaghfir, K., & Schiuma, G. (2013). Knowledge, learning, and innovation: Research and perspec-
tives. Journal of Knowledge Management, 17(4), 495–510. doi:10.1108/JKM-04-2013-0141

215

The Link Between Innovation and Prosperity

Nicolini, D., Powell, J., Conville, P., & Martinez‐Solano, L. (2008). Managing knowledge in the healthcare
sector. A review. International Journal of Management Reviews, 10(3), 245–263. doi:10.1111/j.1468-
2370.2007.00219.x
Nonaka, I., & Von Krogh, G. (2009). Perspective—Tacit knowledge and knowledge conversion: Con-
troversy and advancement in organizational knowledge creation theory. Organization Science, 20(3),
635–652. doi:10.1287/orsc.1080.0412
Noordin, M. F., & Karim, Z. A. (2015). Modeling the relationship between human intelligence, knowledge
management practices, and innovation performance. Journal of Information & Knowledge Management,
14(01), 1550012. doi:10.1142/S0219649215500124
Nunan, D., & Di Domenico, M. (2013). Market research & the ethics of big data. International Journal
of Market Research, 55(4), 505–520. doi:10.2501/IJMR-2013-015
O’Neil, C. (2017). Weapons of math destruction: How big data increases inequality and threatens de-
mocracy. New York: Broadway Books.
Pawlak, Z. (2002). Rough sets, decision algorithms and Bayes’ theorem. European Journal of Opera-
tional Research, 136(1), 181–189. doi:10.1016/S0377-2217(01)00029-7
Rivers, C. M., & Lewis, B. L. (2014). Ethical research standards in a world of big data. F1000 Research, 3.
Shakespeare, W. (1826). The dramatic works of Shakespeare. William Pickering.
Silver, N. (2012). The signal and the noise: the art and science of prediction. London: Penguin UK.
Sutherland, L. S., & Soares, C. G. (2012). The use of quasi-static testing to obtain the low-velocity im-
pact damage resistance of marine GRP laminates. Composites. Part B, Engineering, 43(3), 1459–1467.
doi:10.1016/j.compositesb.2012.01.002
Wang, S., Wan, J., Zhang, D., Li, D., & Zhang, C. (2016). Towards smart factory for industry 4.0: A
self-organized multi-agent system with big data based feedback and coordination. Computer Networks,
101, 158–168. doi:10.1016/j.comnet.2015.12.017

KEY TERMS AND DEFINITIONS

Big Data: They are data sets that are so voluminous and complex that traditional data-processing
application software is inadequate to deal with them. The technology of processing big data sets is often
called as big data as well.
Big Data Governance: It is a comprehensive way to protect information assets both for organizations
and their customers to ensure they are used in a reliable and secure approach.
Business Intelligence: It refers to the meaningful and useful knowledge extracted from data and
information.

216

The Link Between Innovation and Prosperity

Digitization: It is the process of converting information into a digital format, in which the informa-
tion is organized into bits.
Ethics: They are moral principles that govern a person’s behavior or the conducting of an activity.
Innovation: It can be defined simply as the application of better solutions that meet new requirements,
unarticulated needs, or existing market needs through developing a new idea, device, process, or method.
Knowledge Management: It is an efficient way of managing information and resources within an
organization.
Prosperity: It is the state of flourishing, thriving, good fortune, or successful social status.

217
218

Chapter 10
Big Data for Prediction:
Patent Analysis – Patenting Big
Data for Prediction Analysis

Mirjana Pejic-Bach
University of Zagreb, Croatia

Jasmina Pivar
University of Zagreb, Croatia

Živko Krstić
Atomic Intelligence, Croatia

ABSTRACT
Technical field of big data for prediction lures the attention of different stakeholders. The reasons are
related to the potentials of the big data, which allows for learning from past behavior, discovering
patterns and values, and optimizing business processes based on new insights from large databases.
However, in order to fully utilize the potentials of big data, its stakeholders need to understand the
scope and volume of patenting related to big data usage for prediction. Therefore, this chapter aims to
perform an analysis of patenting activities related to big data usage for prediction. This is done by (1)
exploring the timeline and geographic distribution of patenting activities, (2) exploring the most active
assignees of technical content of interest, (3) detecting the type of the protected technical according to
the international patent classification system, and (4) performing text-mining analysis to discover the
topics emerging most often in patents’ abstracts.

INTRODUCTION

Patent databases are an abundant and important source of information about the particular technical field,
and patent analysis has been proven as effective tool for decision makers who seek for a comprehensive
overview of different technologies’ topics, such as big data technologies (Madani & Weber, 2016). De-
cision makers may want to understand relevant trends, to spot new technologies in particular area or to
estimate the importance of the emerging new technologies. Moreover, patent information is a relevant
DOI: 10.4018/978-1-5225-7077-6.ch010

Copyright © 2019, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

Big Data for Prediction

source for those who want to get familiar with key players of a particular technology, or to learn about
their productivity and patenting behavior.
Big data technologies have attracted lots of attention due to their ability to analyze large amounts of
various data sources, and extract useful information from them. Recently, big data technologies have
become not only a methodology for analyzing the current situation, but are also used as tools for predic-
tion in various fields, such as retailing, marketing and social media (e.g. Bradlow et al., 2017; Miah, Vu,
Gammack & McGrath, 2017; Shirdastian et al., 2017).
Goal of this chapter is to analyze and help to understand patents related to big data for prediction.
The paper will provide answers to the following questions that are of interest to big data inventors and
investors: (1) What is the timeline of patents of big data solutions for prediction?; (2) Who are assignees
of patents of big data solutions for prediction, and what is their geographic origin?; (3) What are the most
frequent IPC patent areas of patents of big data solutions for prediction?, (4) What are the most often
topics of patents of big data solutions for prediction? Answers to these questions will provide useful
guidance related to competitiveness and new trends that emerge in the usage of big data technologies
for prediction. Additional goal of this paper is to assess the usability of several data mining and text
mining methods for the purpose of patent analysis, specifically association analysis of IPC patent areas,
key-terms extraction and clustering. For this purpose, Statistica Text Miner 13.0, and Provalis Wordstat
8.0 has been used.
The chapter consists of the following sections. After the introduction, the second section presents the
background of the research, encompassing the notion of big data, usage of big data for prediction, and
usage of patent analysis. The third section describes the methodology used. The results of the analysis
are presented in the fourth section. Finally, the last section is used to synthesise findings, present limita-
tions, and future research directions of the chapter.

BACKGROUND

Big Data and Predictive Analytics

Big data has become an exciting field of study for practitioners and researchers, due to the need to adapt
to the emergence of huge databases (Parr Rud, 2011). Each of them has different focus and concerns in
this area, which yielded various definitions and descriptions of big data. Practitioners, such as consulting
companies and multinational corporations, define big data by mainly focusing on the technology neces-
sary to handle such data. For example, the National Institute of Standards and Technology describes it as
data that exceed capacity or capability of conventional systems and “require a scalable architecture for
efficient storage, manipulation and analysis” (NIST, 2017, p. 8). On the other hand, scientists describe
big data as the phenomenon related to various characteristics of data generated by different actions, e.g.
social media and business transactions. Boyd and Crawford (2012, p. 662) define big data as “cultural,
technological, and scholarly phenomenon that rests on the interplay of technology, analysis and mythology”.
Furthermore, scientists often use following three characteristics in order to describe big data: Volume,
Variety and Velocity. Volume describes the large amount of data that depends on the type of data, time
and industry, which “make it impractical to define a specific threshold for big data volumes” (Gandomi
& Haider, 2014, p. 137). Variety refers to the various types of data including structured, semi-structured

219

Big Data for Prediction

and unstructured data (Chen et al., 2014), while velocity relates to the rapid and timely conducted data
collection and data analysis (Chen et al., 2014; Dmitriyev et al., 2015; Vera-Baquero et al., 2015).
Harnessing big data is believed to result in more efficient and effective operations (Günther et al.,
2017). Moreover, big data is being perceived as support for decision-making (Sharma & Kankanhalli,
2014) or as a source of business opportunities (McAfee & Brynjolfsson, 2012; Gandomi & Haider, 2014).
Günther et al. (2017) stress out that continuous interaction between work practices, organizational models
and stakeholder interests are the prerequisites for the successful usage of big data. Big data analytics
is the main source of value generated by big data technologies, since it allows generation of the new
knowledge from huge databases, which only recently emerged as a possibility. Big data analytics refers to
exploitation of algorithms that can process a large volume of various types of data at increasing speeds,
which can be classified into following groups: text analytics or text mining, multimedia analytics, social
media analytics and predictive analytics:

• Text analytics denotes techniques for extraction of useful information and knowledge from un-
structured textual data (e.g. business documents, emails, social media). Text mining is primarily
based on natural language processing (NLP) which enables computational text analysis, inter-
pretation and generation (Chen, Chiang & Storey, 2012; Chen et al., 2014; Gandomi & Haider,
2015). Examples of common NLP-based techniques used in text analytics are text summarization
techniques, opinion mining, clustering, and so on.
• Multimedia data analytics refers to information extraction from unstructured audio, images and
video streams data. The transcript-based approach and phonetic-based approach are two common
technological approaches to audio analytics (Gandomi & Haider, 2015). Video analytics or video
content analysis refers to various techniques for analyzing and extracting information from video
data.
• Social media analytics encompasses techniques for analyzing both structured and unstructured
data generated by social media (Chen et al., 2012; Gandomi & Haider, 2015). Social media ana-
lytics is classified into content-based analytics and link-based analytics. Content-based analytics
refers to usage of text; video and audio analytics for analyzing data generated by users of social
media, such as images, reviews and so on. Link-based analytics is focused at structure of social
networks and relationships among entities that participate in networks. For example, community
detection techniques can be used to uncover behavioral patterns and predict properties of certain
network. Additionally, participants’ influence or strength of connections in networks can be evalu-
ated by using so-called social influence analysis. Similarly, link prediction strives to predict future
linkages between entities in a network.
• Predictive analytics use both quantitative and qualitative approaches to learn from past behavior,
uncover patterns in data and to optimize business processes based on new insights. It usually
refers to the application of statistical techniques, data mining and machine learning algorithms
to extract information and knowledge from structured data. Common goals of various predictive
analytics approaches are to found patterns in data and to explore relationships in data.

Business application domains that current focus for big data and predictive analytics are retail, mar-
keting, and social media. Bradlow et al. (2017) examine the opportunities of using data about customers,
products, time, location and channel for the purpose of decision making in retailing, using Bayesian
techniques on large dataset. Miah et al. (2017) propose the method for analysing unstructured data,

220

Big Data for Prediction

geo-tagged photos uploaded by tourists to social media, to support strategic decision-making in tour-
ism destination management. Salehan and Kim (2016, p. 31) suggest an approach for development of
“scalable, automated systems for sorting and classification of big online consumer reviews data, which
will benefit both vendors and consumers”. Yi and Wang (2017, p. 188) presented “a big data analytics
based fault prediction approach for shop floor scheduling”. Latent semantic analysis and the support
vector machine were used to examine the sentiments toward a brand to identify the reasons for positive
or negative sentiments on social media (Shirdastian et al., 2017).
Some authors discussed application areas that predictive analytics using big data will greatly influence
in future. Akter and Wamba (2016) review usage of big data analytics in e-commerce. They concluded
that main application areas of big data analytics in e-commerce are personalization, dynamic pricing,
customer service, supply chain visibility, security and fraud detection, as well as predicting individual
customer’s theoretical values to company, to predict sales patterns, to forecast and determine inventory
requirements and to predict consumer preferences and behavior. Big data analytics attracted attention in
various areas such as logistics and supply chain management (Waller and Fawcett, 2013), cyber-physical
systems (Lee et al., 2015), auditing (Geep et al., 2018), cognitive computing (Garret, 2014) helath care
services (Wu et al., 2016), cybersecurity (Rassam et al., 2017).

Patent Analysis for Decision Making

Decision makers who seek for a comprehensive overview of different technology topics in a technical
field of interest may rely on patent analyzes, which often utilizes text mining (Pejić Bach et al., 2017).
Madani and Weber (2016) analyze the evolution of patent analysis, focusing to text mining. Brügmann
et al. (2014) present workbench for intelligent patent document analysis, which includes modules for
summarization, entity recognition, segmentation, lexical chain identification and claim-description align-
ment. For example, Kim et al. (2016) use the semantic patent topic analysis-based bibliometric method
to generate patent development maps related to 3D printing technologies. Altunas et al. (2014) analyzed
patent documents by using weighted association rules that recognise the different importance of protected
technical content based on following criterion: commercial significance and technological impact. Patent
lanes developed regard semantic similarities, which can be seen as the deployment of patent clusters,
were suggested by Niemann et al. (2017) in order to describe the development of a technological field
in the course of time. Han et al. (2017) presented usage of natural language processing technologies to
extract concepts and patent similarity assessments, and to support content-oriented visualisation.
Valuable insights lie in patent citations which analysis can reveal patterns of knowledge spillover and
diffusion of information between different stakeholders such as countries, universities and companies.
Patent citation analysis reveals its applicability across different technical fields that serve the creation
of technology (Sharma & Tripathi, 2017). Kyebambe et al. (2017) used supervised learning methods to
forecast emerging technologies. Furhermore, Kim and Bae (2017) suggested a three-step methodology
for technology forecasting. The first step is to cluster patent documents based on cooperative patent
classification. The second step is to examine the combination of cooperative patent classification of
each derived clusters. The final step is to determine which clusters are promising based on analysis of
patent indicators such as citations, triadic patent families as well as independent patent claims. Song
et al. (2018) used a bibliographic coupling to patents to produce a list of outlier patents, developed the
technological and market measures to evaluate them and determined promising technologies based on
the developed measures.

221

Big Data for Prediction

Patents can be searched and analyzed by using numerous patent databases or platforms. Patent da-
tabases can be divided into national databases and world databases. Examples of national databases are
United States Patent and Trademark Office (USPTO) patent database, Canadian Intellectual Property
Office patent database, Australian patent database - AutPat or DEPATISnet, which contains patents from
the German Patent and Trade Mark Office. Patent databases that contain patent documents from around
the world are Espacenet, Google Patents, The Lens, Patentscope, which provides access to international
Patent Cooperation Treaty (PCT) applications, and OECD Patent Database that contains data on pat-
ent applications to the European Patent Office - EPO and USPTO. Commercial patent platforms allow
advanced patent search and analysis such as patent network analysis or citation analysis. Examples of
commercial patent platforms are PatSeer, Clearstone Elements, PatentCloud, LifeQuest, Derwent In-
novation by ClarivateAnalytics, Total Patent One by Lexis Nexis and Octimine.

METHODOLOGY

Patents from the PatSeer database related to big data usage for prediction analytics from 2013 to 13
October 2017 are analyzed, using the longitudinal approach in combination with text mining techniques.
The patent analysis consists four phases related to (i) the patent search and selection, (ii) timeline, geo-
graphic origin and patents assignees analysis, (iii) patents analysis according to IPC system patent area,
and (iv) text mining.

Phase One: Patent Search and Selection

A patent, in general, is an exclusive right granted for an invention to exclude others from making, using,
or vending the patented invention without the patent owner’s permission. Each patent’s information or
so-called meta-data of patents are provided in the form of highly structured documents. Patent documents
usually contain following patent’s data: title, abstract or description, publication or issue year, filing/
application year, priority country, assignee country, The International Patent Classification codes, The
Cooperative Patent Classification CPC codes, File Index codes, backward/forward citations and so on.
Analysis of patents’ documents containing all of these data sheds light on a technical area of interest
and can serve to stakeholders in their decision-making. Patent databases should provide accurate data
in comprehensible format and deliver data promptly (Madani & Weber, 2016) in order to be relevant
and valuable for decision makers.
PatSeer is an online patent database storing the patents in the forms of simple patent families. PatSeer
is available in several editions: Lite, Standard, Premier, Pro, Explorer and Projects Edition. Authors used
Lite Edition to conduct a preliminary search of the simple patent families to detect the patents related
to big data for prediction. In general, Lite Edition is used to search the worldwide patent database and
allows users to manage and save search strings, to narrow down search results by using filters, as well
as to extract data in excel format. Therefore, authors used PatSeer solution for searching and extracting
patent data only. Other PatSeer’s Editions offer more capabilities in comparison to Lite Edition. For
example, PatSeer Pro allows advanced patent analysis such as patent network analysis with semantic
spatial-mapping, to conduct citation analysis, text clustering and more.

222

Big Data for Prediction

The PatSeer database was searched on 13 October 2017 by using the search string search string
(TA: (data AND (predict OR prediction OR forecasting OR forecast OR prognosis OR prognosticate
OR foresight OR foresee))), with an option for searching simple patent families. Authors found 316 of
records for simple family families in total. Among these records, 296 simple patent families had status
“active” at the time of the search. Therefore, a patent analysis of the 296 simple patent families related
to big data for prediction was conducted to achieve the goal of this research.

Phase Two: Timeline, Geographic Origin and Patents Assignees Analysis

Authors performed an extensive analysis of timeline, geographic origin and current assignees in order to
detect which of them were most active in patenting technical content related to big data for prediction.
A current assignee is an entity, organization or individual, inventor, that has the property right to the
patent (Sinha and Pandurangi, 2015).

Phase Three: Patents According to IPC System Patent Area

Authors analyzed the protected technical content of big data for prediction simple patent families, using
International Patent Classification (IPC) system established in 1971 by the Strasbourg Agreement, used
in more than 100 countries worldwide. The IPC describes technical knowledge by using the systematic
and hierarchical classification, which includes section, class, subclass, group and subgroup (WIPO,
2017). In this research, the analysis of the active simple patent families related to big data usage for
prediction according to the sections, subclasses and groups will be conducted. In order to determine
whether the technical content of the selected simple patent families is heterogeneous or homogeneous,
authors use association rules.

Phase Four: Text Mining Patent Analysis

Text mining approach was utilised in order to detect the topics emerging most often in abstracts of simple
patent families related to big data solutions for predictive analytics. Software WordStat Provalis was
used for text mining. First, phrases of maximum five words, which occur in more than five simple pat-
ent abstracts, are extracted. Second, extracted phrases were used to conduct cluster analysis in order to
detect which topics occur together. Cluster analysis of phrases was conducted using of average-linkage
hierarchical clustering algorithm, which creates clusters from a similarity matrix (Everitt et al., 2011).
The distance between two clusters is the average distance between each observation in one cluster to
every observation in the other. This method is also called Unweighted Pair Group Mean Averaging.
For example, distance between clusters “A” and “B’’ refers to average length of each arrow connecting
observations within the clusters (Figure 1) as expressed in Formula 1.

Equation 1. Distance Between Clusters – Average Linkage Method

1 k l
d ab  d  Ai , B j 
kl i 1 j 1
(1)

223

Big Data for Prediction

Figure 1. Average linkage method


Source: (Authors)

Notation:

A1, A2,..., Ak = Observations from cluster A

B1, B2,..., Bl = Observations from cluster B

d (a,b) = Distance between a cluster with observation vector a and a cluster with observation vector
b

The cluster analysis was conducted by using Jaccard’s coefficient as a similarity measure. Jaccard’s
coefficient determined the association between two phrases that occur together in simple patent abstract.
The result is represented by the dendogram. Single-word clusters were hidden from the dendrogram to
simplify the use of the dendrogram and being able to focus only on the strongest associations of mean-
ingful phrases. Since a dendrogram determines only the temporal order of the branching sequence, the
sequence of phrases cannot be seen as a linear representation of those distances. That means that any
cluster can be rotated around branches on the dendrogram without affecting its meaning. For that reason,
authors used proximity plots generated in WordStat Provalis software in order to represent the distance
between most frequent phrases to all other phrases. In proximity plot, phrases that often tend to appear
near selected phrase are shown on the top of the plot. In addition, network graphs were used in order to
represent the relationships between phrases by lines connecting those phrasest.

RESULTS

In this part of the chapter, patent analysis results are presented as following: timeline, geographic origin
and patents assignees of related to big data for prediction, the result of patents analysis according to IPC
system patent area and results of the text mining patent analysis.

224

Big Data for Prediction

Timeline, Geographic and Assignee Patent Analysis

In order to provide answers to when, where and who pursues protection of big data analytics solutions
for predictive analysis, the timeline, geographic and assignee analysis was conducted.
Table 1 represents the timeline for the period between 2013 and October 2017, and geographic origin
of simple patent families.
Most of the most of the assignees related to big data for prediction are spread across China and South
Korea. Figure 2 provides details on the timeline and geographic origin of simple patent families accord-
ing to priority countries for the period between 2013 and October 2017.
Table 2 provides details on the number of simple patent families related to big data for prediction
according to current assignees and countries, which indicates that all organizations with more than 5
patents come from China.

Table 1. Number of big data for prediction simple patent families per publication/issue year and priority
country (from 2013 to 13th October 2017)

% of Total No. of Simple Patent


Publication / Issue Year No. of Simple Patent Families
Families
Timeline
2013 1 0%
2014 16 5%
2015 50 17%
2016 122 41%
October 2017 107 36%
Total 296 100.00%
Country of Origin
% of Total No. of Simple Patent
Priority Country No. of Simple Patent Families
Families
China (CH) 233 79%
South Korea (KR) 35 12%
United States of America (USA) 17 6%
India (IN) 4 1%
Taiwan (TW) 3 1%
Japan (JP) 1 0%
None 3 1%
Total 296 100.00%
Source: (Authors, PatSeer, 13th October 2017)

225

Big Data for Prediction

Figure 2. Number of big data for prediction simple patent families per priority country (from 2013 to
13th October 2017)
Source: (Authors, PatSeer, 13th October 2017)

Table 2. Number of big data for prediction simple patent families according to current assignee and
country (from 2013 to 13th October 2017)

No. of Simple Patent % of Total No. of Simple Patent


Current Assignee Country
Families Families
State Grid Corporation China 22 7.4%
Inspur Group China 7 2.4%
Nanjing University China 7 2.4%
Business Big Data China 5 1.7%
Hohai University China 5 1.7%
Other - 250 84.5%
Total 296 100.00%
Source: (Authors, PatSeer, 13th October 2017)

Patents According to IPC System Patent Area

Majority of the simple patent families were assigned to more than one IPC’s main groups or sub-groups.
A simple patent family is usually registered under multiple ICR codes, so the total number of ICR codes
(561 codes) is larger than the number of simple patent families examined (296 simple patent families),
which indicates that one simple patent family is registered to approximately two IPC’s main groups or
sub-groups on average. Observed simple patent families were registered under following seven five
IPC sections: A Human Necessities; B Performing Operations; Transporting; C Chemistry, Metallurgy;
E Fixed Constructions, F Mechanical Engineering; Lighting; Heating; Weapons; Blasting Engines or
Pumps, G Physics and H Electricity.

226

Big Data for Prediction

Table 3 presents the number of big data for prediction simple patent families according to the IPC
system – Sub-class level, that occur in more than 10 simple patent families. Among classes assigned to
296 simple patent families are computing, calculating or counting instruments such as G06Q - Analogue
computers (228 times), G06F - Electrical digital data processing (132 times) and G06N - Computer sys-
tems based on specific computational models. Additionally, simple patent families that were registered
as an electric communication technique were mostly related to the sub-class H04L - Transmission of
digital information.
Table 4 presents simple patent families according to IPC main group and sub-group level. Data pro-
cessing systems or methods adapted forecasting or optimization was the most frequent IPC’s group. A
substantial number were related to administrative, financial, managerial or supervisory purposes – IPC’s
group G06F17/30 (62 simple patent families). Additionally, 40 of 228 simple patent families that were
registered for electricity, gas or water supply purposes.

Co-Occurrence of IPC Areas

In order to detect relationships between IPC codes, association rule analysis was conducted. IPC’s main
group or sub-group code is considered as an item, and each record of a simple patent family is considered
as a transaction. Due to the heterogeneity of IPC codes, task for finding association rules was non-trivial
and association rules between different IPCs’ main groups or sub-groups level codes were challenging to
detect. Therefore, minimal support and confidence at 1% level was set, which resulted in 39 association
rules. Table 5 shows only rules with the minimal support of 2% and minimal correlation of 10%, which

Table 3. Number of big data for prediction simple patent families according to the IPC system – Sub-
class level (>10 simple patent families)

No. of Simple Patent


Subclass Description
Families
A Human Necessities
A61B Medical diagnosis, surgery and identification 12
G Physics
G06Q Analogue computers 228
G06F Electrical digital data processing 132
G06N Computer systems based on specific computational models 26
G08G Traffic control systems 16
G06K Instruments for recognition and presentation of data 14
G08B Signaling or calling systems - order telegraphs, alarm systems 13
G05B Monitoring or testing arrangements/elements for control systems 12
H Electricity
H04L Transmission of digital information 27
Other 76
Total 561
Source: (Authors, PatSeer, 13th October 2017)

227

Big Data for Prediction

Table 4. Number of simple patent families related to big data for prediction according to the IPC system
– Main group/Sub-group level (>10 simple patent families)

No. of Simple Patent


Main/Sub Group Description
Families
G06 Physics - Computing, Calculating and Counting Instruments
G06F Digital Computing or Data Processing Equipment or Methods for:
G06F17/30 Administrativec, financial, managerial, supervisory purposes 62
G06F19/00 Specific applications 28
G06Q Data Processing Systems or Methods Specially Adapter for:
G06Q10/04 Forecasting or optimization 64
G06Q50/06 Electricity, gas or water supply 40
G06Q10/06 Resources, enterprise planning, organizational model 22
G06Q30/02 Marketing, e.g. Buyer profiling, price estimation 19
G06Q50/26 Government or public services 12
G06Q50/10 Services 11
H04 – Electricity - Electric Communication Technique
H04L Transmission of Digital Information
H04L29/08 Control procedure, e.g. Data link level control procedure 12
Other 291
Total 561
Source: (Authors, PatSeer, 13th October 2017)

reveals that the simple patent families registered as data processing systems or methods for forecasting
or optimization were specially adapted for electricity, gas or water supply purposes in 10.47% of the
total number of simple patent families (Rule G06Q10/04 → G06Q50/06). Data processing systems or
methods for resources management were specially adapted for electricity, gas or water supply purposes
in 2.70% of the total number of simple patent families (Rule G06Q10/06 → G06Q50/06).

Patent Topics

In order to detect most frequent topics of the simple patent families’ abstracts, authors used the phrase
extraction process combined with the cluster analysis conducted by Wordstat Provalis software. Authors
detected following most frequent phrases: real-time, data mining, early warning and neural networks.
Table 6 shows most frequent phrases in patent applications with the frequency of occurrence ≥ 5.
Column TF*IDF of Table 7 contains values of metrics for a phrase’s importance. The Term Frequency-
Inverse Document Frequency (TF-IDF) is a metric that helps to estimate how important is a phrase in
a whole collection of documents (e.g. abstracts of all analyzed patents in a certain area) and not only in
a particular document (e.g. abstract of only one patent). Therefore, for this chapter, TF-IDF is a metric
that helps authors to estimate how important is a phrase in a whole collection of analyzed patents. Spe-
cifically, for this research, the collection of patents refers to patents’ abstracts.

228

Big Data for Prediction

Table 5. Summary of association rules - Min. support = >2%, Min. confidence = >2%, Min. correla-
tion = 10%

Support/
Body – Description (Application Area or Method) Head – Description (Application Area or Method)
Confidence
G06Q10/04 - forecasting method G06Q50/06 - energy supply 10% 48%
G06Q50/06 - energy supply G06Q10/04 - forecasting method 10% 78%
G06Q10/06 - enterprise resources planning G06Q50/06 - energy supply 3% 36%
G06Q50/06 - energy supply G06Q10/06 - enterprise resources planning 3% 20%
G06F17/30 - finance/management G06Q10/04 - forecasting method 2% 11%
G06Q10/04 - forecasting method G06F17/30 - finance/management 2% 11%
G06Q10/04 - forecasting method G06Q50/26 - government/public services 2% 9%
G06Q10/04 - forecasting method G06Q50/26 - government/public services 2% 9%
G06Q10/06 - enterprise resources planning G06Q10/04 - forecasting method 2% 27%
G06Q50/26 - government/public services area G06Q10/04 - forecasting method 2% 50%
Source: (Authors, PatSeer, 13th October 2017; Statistica Text Miner)

Reason for using TF-IDF metric is that common words usually appear several times in a document
(an abstract of certain patent), but they are not important as key-phrases to be searched or indexed. Term
Frequency measures how frequently a phrase occurs in an abstract of patent. The Term Frequency value
for the certain phrase “p” in the certain patent’s abstract is defined as the ratio between the frequency
of phrase “p” in the patent’s abstract, and the total number of phrases in the same patent’s abstract.
Furthermore, Inverse Document Frequency measures how important is a certain phrase “p” concerning
the whole collection of patents’ abstracts. The IDF for a given keyword “p” in the collection of patents
is calculated as the logarithm of the ratio between the total number of patents’ abstracts in a collection
and is the number of abstracts in which the phrase “p” appears. Finally, the product of TF and IDF value
gives its TF-IDF value for a certain phrase “p”. Therefore, a phrase that has higher TF-IDF values is of
higher importance. Phrases that are most important in the whole collection of patents related to big data
for prediction, indicated by their TF-IDF values, are real-time (TF-IDF value 79.7), data mining (48.9),
early warning (58.5) and neural networks (63.2).
Figure 3 presents the results of the cluster analysis that identified six groups of topics regard simple
patent families related to big data for prediction.

• Cluster 1 includes 28 simple families patents’ abstracts, with the co-occuring phrases: real-time
systems used for weather forecasting to provide weather information to a client-side; early warn-
ing management system based on monitoring data for managing power supply.
• Cluster 2 includes 10 simple families patents’ abstracts with the co-occuring phrases: data analysis
supported by efficient database technologies such as managing power grid based on power load
forecasting method or preprocessing of big traffic data.
• Cluster 3 includes 6 simple families patents’ abstracts with the co-occuring phrases: environment
information and prediction data supported by wireless communication; storage systems and wire-
less communication supported by cloud computing and wireless networks.

229

Big Data for Prediction

Table 6. Most frequent phrases in patent applications (>5% of Cases)

Phrase Frequency No. of Cases % Cases TF - IDF


real time 109 55 18.58% 79.7
data mining 52 34 11.49% 48.9
early warning 58 29 9.80% 58.5
neural network 57 23 7.77% 63.2
management system 30 18 6.08% 36.5
machine learning 29 16 5.41% 36.7
historical data 17 15 5.07% 22.0
data platform 25 15 5.07% 32.4
Source: (Authors by using WordStat Provalis software)

• Cluster 4 includes 11 simple families patents’ abstracts with the co-occuring phrases: predicting
and monitoring public opinion, and analyzing behavior data by using feature extraction and neural
networks.
• Cluster 5 includes from 12 simple families patents’ abstracts with the co-occuring phrases: using
support vector machine to increase prediction accuracy.
• Cluster 6 includes 12 simple families patents’ abstracts with the co-occuring phrases: informa-
tion extraction based on data mining and machine learning to analyze historical data; information
extraction based on deep learning for control systems and risk assessment, as well as a medical
diagnosis based on natural language processing.

In a dendrogram, the phrases (keywords) that co-occur tend to appear near each other but dendrogram
determines only the temporal order of the branching sequence. For that reason, reading dendrograms is
not intuitive or very easy. Therefore, authors used proximity plots generated to detect phrases that often
tend to appear near selected phrase (Figure 4). Such phrases are shown on the top of the plot.
Figure 4 presents four proximity plots indicating which phrases occur the most often with the most
frequent and most important phrases: real-time, data-mining, early warning and neural network. Authors
found following:

• The phrases that occur the most often with the phrase real-time are mostly related to data analysis
such as historical data, management systems, real-time performance and monitoring data; meth-
ods and techniques for data analysis such as statistical analysis, neural networks, machine learning
or data visualization, as well as specific purposes such as traffic big data, power supply, risk as-
sessment, social networks or behavior analysis.
• The phrases that occur the most often with the phrase data mining are mostly related to the phrase
historical data, methods and techniques of data analysis such as machine learning, natural lan-
guage or deep learning, and applications such as medical diagnosis, risk assessment or control
systems.

230

Big Data for Prediction

Figure 3. Cluster dendrogram of phrases that occur in most frequent phrases


Source: (Authors by using WordStat Provalis software)

• The phrases that occur the most often with the phrase early-warning indicate general technical
parts of early warning systems such as management system, an analysis module, real-time and
client side, as well as particular purposes of early-warning systems such as weather forecasting
and power supply management. The phrase is also related to phrases indicating source or type of
data used or generated by early-warning systems such as monitoring data, weather information,
environment information.

231

Big Data for Prediction

• The phrase neural network occurs the most often with the phrase neural network model. Other
phrases that occur with the phrase neural network indicate its’s specific application areas such
as power load, behavior analysis, weather forecast, feature extraction or medical diagnosis.
Additionally, types of data analyzed by neural networks are indicated by phrases historical data
and behavior data.

Furthermore, the connections between keywords – phrases are visualized by using a network graph
that allows us to explore relationships, to detect underlying patterns and structures of co-occurrences.
Network graph was generated for each of the six clusters in the dendrogram. Elements are represented
as a node while their relationships are represented as lines connecting those nodes. Figure 5 presents six
network graphs indicating which phrases co-occurred most often within each of the cluster.

FUTURE RESEARCH DIRECTIONS

This chapter provides an outlook to the possible questions that can be answered for the investors and
inventors interested in big data solutions for predictive analytics. Patent analysis can provide answer to
the most basic questions, relating to when and where most of the patenting was conducted, by whom
and in which areas. Therefore, future research directions are provided as the answers to these questions.

When?

Analysis indicate that area of big data usage for predictive analytics emerged recently. Only one simple
patent family related to big data for prediction was registered in 2013. After that period, the number

Figure 4. Proximity plots of phrases that occur in more than 20 patent applications
Source: (Authors by using WordStat Provalis softwareFigure 5. Network graphs of phrases that occur most frequent

232

Big Data for Prediction

Figure 5.­
Source: (Authors by using WordStat Provalis software)

of simple patent families increases rapidly, with 122 simple patent families registered in 2016 and 107
simple patent families registered in 2017, until October. The emerging trend is expected to continue in
the period of at least several years.

Where?

China is the leading country in patenting activities related to big data for prediction. Chinese organiza-
tions began publishing patents related to this technical area in 2013. South Korea began publishing big
data for prediction patents two years later, in 2015. Among other countries, only India, Japan and Taiwan
published big data for prediction patents.

233

Big Data for Prediction

Who?

The organization that registered the most substantial number of simple patent families related to big data
for prediction in the observed period is State Grid Corporation registered in China (227 simple patent
families). Inspur Group (7 simple patent families) and Nanjing University (7 simple patent families)
were active assignees from China as well. Kim Seung Chan, the inventor, registered three simple patent
families, which makes him being the only individual on a list of first ten assignees of the area of interest.
Other organizations that registered a more substantial number of simple patent families are companies
such as Business Big Data, NAT Computer Network Information Security, Shanghai Fuli Information
Technology and academic institutions such as Hohai University, Beijing Jiaotong University and the
University of South China.
Patenting applications related to big data and prediction have been followed trends that are present
in patent activities worldwide generally. According to patenting indicators for 2016, published by World
Intellectual Property Organization (2017), China is the largest contributor in number of filing. The State
Intellectual Property Office of The People’s Republic of China (SIPO) received more than 1.3 million
patent applications in 2016, which was more than the European Patent Office, the United States Patent
and Trademark Office, the Japan Patent Office and the Korean Intellectual Property Office received
together. Many of patents are related to new technological content in computing, medical technology,
semiconductors, and so on. Reasons, why patenting activities in China have been growing, are following.
In 2012, China’s government set the goal regard growth of all type of patenting activities. Since then,
they supported patenting activities with various incentives, and by setting new, more patenting friendly,
regulations regarding the examination of patent applications. Moreover, China’s high-tech companies
and telecoms have become significant global players, not only conducting patenting activities but also
buying patent rights. State Grid Corporation of China, which is in top 100 patent applicants worldwide,
leads in patenting activities regard big data and prediction. Specifically, State Grid Corporation took
ninth place when it comes to the application of patent families for the period between 2011 and 2014,
especially for the following technological fields: electrical machinery, apparatus and energy, as well as
technical content related to measurement.
Stakeholders who are interested in harnessing big data analytics solutions can choose between
numerous vendors. However, vendors often acquire patents’ rights, so they do not have to be patent
assignees or inventors. Instead, they make strategic investments in patents, acquire patents and manage
patent portfolios, which allows them to focus on their core activities and provide innovative solutions
to clients. For example, in 2015, Avigilon Corporation, a global provider of surveillance solutions, in-
cluding video analytics, acquired 126 USA and international patents from VideoMining Corporation,
FaceDouble Incorporated, Behavioral Recognition Systems and ITS7 Pty. The total value of patents was
US$135,375,000, covering technical content: different video analytics capabilities such as behavioral
analysis, in-store object tracking, video segmentation, anomaly detection, image classification, as well
as patents related to programming of remote security camera and network camera system.

What?

Search revealed the most frequent patent topics are related to technological solutions (G06Q - Analogue
computers), data processing (G06F - Electrical digital data processing), and specific areas (G06N -
Computer systems based on specific computational models). This finding is in line with the specific

234

Big Data for Prediction

challenges related to big data identified by Sivarajah et al. (2017): (i) data challenges that are related to
the characteristics of big data, (ii) process challenges, including challenges related to big data analysis
and modelling, and (iii) management challenges that cover privacy, security, data governance, data and
information sharing, cost and operational expenditure and data ownership challenges.
Number of patent families focus to technological solutions and data processing solutions, which try
to solve specific challenges related to big data analytics. Techniques of predictive analytics can be di-
vided into two group (Gandomi and Haider, 2015): (i) techniques for discovering historical patterns and
extrapolating an outcome variable(s), and (ii) techniques for exploring the interdependencies between
outcome and explanatory variables. Predictive analytics mostly relies on statistical techniques. However,
while the conventional statistical methods rely on statistical significance to examine a significance of the
specific relationship, big data analysis is often conducted on majority or entire population, so statistical
significance is not that important for big data as compared to small samples of a population. Furthermore,
when it comes to computational challenges, many conventional methods for small samples do not scale
up to big data. For that reason, existing methods are extended and modified for parallel and distributed
tasks. Additionally, big data unique characteristics cause some problems when it comes to estimating
predictive models for big data (Gandomi & Haider, 2015): noise accumulation, spurious correlation
and incidental endogeneity. Noise accumulation or accumulated estimation error sometimes results in
overlooking some significant variables. Spurious correlation appears when some variables appear to be
correlated because of high dimensionality of big data. In addition, incidental endogeneity, the dependence
of the error term and variables, is common in big data. Extreme machine learning techniques have been
extended for tasks such as clustering and adapted for parallel computation, which makes them feasible
for big data analytics (Huang et al., 2014). Zhang et al. (2018) discussed the role and future of deep
learning techniques in big data analytics that are used for image, audio and text analytics. Another issue
of big data analytics is related to big data proneness to noise, outliers, inconsistencies and incomplete-
ness (Wu, X., Zhu, Wu, G.-Q. & Ding, 2014). Additionally, re-utilization of existing big data should be
taken into account. Most of the big data analytics algorithms will be designed to support parallel and
distributed computing. This raise problem regard bottlenecks of algorithms that may occur because of
synchronization and information exchange issues (Tsai et al., 2015; Wu et al., 2014). Additionally, big
data technology needs improvements regard efficiency of format conversion of heterogeneous data, big
data transfer and performance of real-time processing of big data.
Identified association rules indicate some specific domains of their usage such as market research,
buyer profiling, price estimation or determination, computer-aided design and so on. Text mining re-
vealed that following topics occur together: (i) real-time systems focusing to e.g. weather forecasting,
(ii) database technologies related to preprocessing of specific data, such as big traffic data, (iii) techni-
cal challenges, such as cloud computing, (iv) specific topics, such as monitoring public opinion, and
analyzing behavior data, (v) methodological challenges, such as usage of support vector machine to
increase prediction accuracy, and (vi) specific topics related to information extraction from historical
data, e.g. risk assessment.
Some of these topics indicate patenting activities for challenges related to data management and data
integration. Safety and privacy are always key challenges and concerns when it comes to information
and communication technology, as well as data. Security-related big data challenges are big data privacy,
safety and big data application in information security (Chen et al., 2014). Big data privacy includes
protection of personal privacy during data handling. Nowadays, usage of information and communica-

235

Big Data for Prediction

tion technology potentiate easy and simple generation and acquisition of large amounts of users’ data.
Hence, it is highly important for users to raise their awareness on which of their personal data third
parties collect and how it is used. Big data safety mechanisms, such as efficient cryptography of big
data and schemes for safety management, are under development. Efficiency of big data mechanisms
is assured only if data availability, completeness, controllability, traceability and confidentiality are
enabled (Chen et al., 2014).

CONCLUSION

Big data will influence society, economy and it will drive the progress of technologies in the near future.
It causes fusion of different disciplines, which is particularly visible when it comes to big data analyt-
ics. Big data influence operations and decision making in various application fields. On the other hand,
society promotes the progress of technologies, including widespread usage and development of big data.
Additionally, big data encouraged fusion of different technologies, such as the Internet of Things, cloud
computing and so on, and forces exploration of new and innovative technologies for handling big data.
People are participants of big data, both users and generators of big data. Generation of real-time and
streaming data, online network data, Internet of Things and mobile data, geography data (e.g. geo-tag
or location-based real-time geographic data), spacial-temporal data, and visual data represent trends in
big data area (Lv et al., 2017; Brown et al., 2011). Shortly, it is expected that the volume of such data
will grow to a large degree due to technological advances and development in related areas, such as geo-
databases or wireless sensor networks. Furthermore, demands from a wide range of application areas,
along with new database and processing technologies, drive the modification of existing techniques and
development of new techniques for big data analytics.
The chapter presents a patent analysis technical area of big data for prediction based on data searched
and gathered from PatSeer patent database. Authors analyzed 296 active simple patent families related
to big data for prediction assigned from 2013 to October 2017. The patent analysis was conducted in
four stages related to (i) the patent search and selection, (ii) timeline, geographic origin and patents as-
signees analysis, (iii) patents according to IPC system patent area, and (iv) text mining patent analysis.
An analysis of the 296 simple patent families related to big data for prediction was conducted to achieve
the goal of this research.
The analysis provided insights into the technical area of big data for prediction. The increasing trend
is in patenting the technical content of big data for prediction is present from 2013, with 122 simple
patent families registered in 2016 and 107 simple patent families registered in 2017, until October. This
is due to an increasing interest in big data and new opportunities big data brings. Authors revealed that
the patenting activities related to big data for prediction are spread across China and South Korea which
organizations assigned the majority of patents related to the technology of interest. The organization
that registered the largest number of simple patent families related to big data for prediction in the ob-
served period is State Grid Corporation registered in China (227 simple patent families or 7.43%). Other
organizations that registered a larger number of simple patent families are companies such as Business
Big Data, NA Computer Network Information Security MAN, Shanghai Fuli Information Technology
and academic institutions such as Hohai University, Beijing Jiaotong University, University of South
China and so on.

236

Big Data for Prediction

Next, the protected technical content of big data for prediction simple patent families was analyzed
by using the International Patent Classification system at the section, class, sub-class, main group or
sub-group level. The simple patent families were mostly registered under the section G codes (474 times)
with the following classes most frequently assigned: G06 - Computing; calculating; counting instruments
(407 times), G08 – Signaling instruments (29 times) and G01 – Measuring; testing instruments (24
times). Therefore, computing instruments have been the major focus of assignees-inventors. Specifically,
a significant number of simple patent families were information retrieval, database structures or file
system as a part of data processing systems specially adapted for administrative, commercial, financial
managerial, supervisory or forecasting purposes.
Furthermore, association rules analysis revealed rules that are trivial due to dataset limitations. For
better results, weighted association rules should be applied in future research with additional patent data
such as backward citations and the number of IPC codes. Therefore, authors conclude that the technical
content of the observed simple patent families is not heterogeneous, but association rules indicate some
specific domains.
Finally, authors used the phrase extraction process combined with the cluster analysis to detect most
common topics appearing in big data for prediction simple patent families’ abstracts. Most frequent
phrases occurring in big data for prediction simple patent families’ abstracts were real time, data mining,
early warning and neural networks. The phrases that occur the most often with the phrase real-time are
mostly related to data analysis such as historical data, management systems, real-time performance and
monitoring data (Belfo et al., 2015). The phrases that occur the most often with the phrase data mining
are mostly related to the phrase historical data, and methods and techniques of data analysis such as ma-
chine learning, natural language or deep learning. The phrases that occur the most often with the phrase
early-warning indicate specific purposes of the weather forecast and power supply domain, and source
of data analyzed by early-warning systems such as monitoring data, weather information, environment
information. The phrase neural network occurs with phrases that indicate its specific applications areas
such as power load, behavior analysis, weather forecast, feature extraction or medical diagnosis. Cluster
analysis identified 6 groups of topics regard big data for prediction patents and the connections between
keywords – phrases are visualized by using a network graph to explore relationships, to detect underlying
patterns and structures of co-occurrences.

REFERENCES

Akter, S., & Wamba, S. F. (2016). Big data analytics in E-commerce: A systematic review and agenda
for future research. Electronic Markets, 26(2), 173–194. doi:10.100712525-016-0219-0
Altunas, S., Dereli, T., & Kusiak, A. (2015). Analysis of patent documents with weighted association
rules. Technological Forecasting and Social Change, 92, 249–262. doi:10.1016/j.techfore.2014.09.012
Belfo, F., Trigo, A., & Estébanez, R. P. (2015). Impact of ICT Innovative Momentum on Real-Time
Accounting. Business Systems Research Journal, 6(2), 1–17. doi:10.1515/bsrj-2015-0007
Boyd, D., & Crawford, K. (2012). Critical questions for big data. Communicatio Socialis, 15(5), 662–679.
Bradlow, E. T., Gangwar, M., Kopalle, P., & Voleti, S. (2017). The Role of Big Data and Predictive
Analytics in Retailing. Journal of Retailing, 93(1), 79–95. doi:10.1016/j.jretai.2016.12.004

237

Big Data for Prediction

Brown, R. A., & Sankaranarayanan, S. (2011). Intelligent store agent for mobile shopping. International
Journal of E-Services and Mobile Applications, 3(1), 57–72. doi:10.4018/jesma.2011010104
Brügmann, S., Bouayad-Agha, N., Burga, A., Carrascosa, S., Ciaramella, A., Ciaramella, M., ... Wan-
ner, L. (2015). Towards content-oriented patent document processing: Intelligent patent analysis and
summarization. World Patent Information, 40, 30–42. doi:10.1016/j.wpi.2014.10.003
Chen, H., Chiang, R. H. L., & Storey, V. C. (2012). Business Intelligence and Analytics: From Big Data
to Big Impact. Management Information Systems Quarterly, 36(4), 1165–1188.
Chen, M., Mao, S., & Liu, Y. (2014). Big Data: A Survey. Mobile Networks and Applications, 19(2),
171–209. doi:10.100711036-013-0489-0
Dmitriyev, V., Mahmoud, T., & Marín-Ortega, P. M. (2015). SOA enabled ELTA: Approach in design-
ing business intelligence solutions in Era of Big Data. International Journal of Information Systems and
Project Management, 3(3), 49–63.
Everitt, B. S., Landau, S., Leese, M., & Stahl, D. (2011). Hierarchical clustering. In Cluster Analysis
(5th ed.). John Wiley and Sons, Ltd. doi:10.1002/9780470977811.ch4
Gandomi, A., & Haider, M. (2014). Beyond the hype: Big data concepts, methods, and analytics. Inter-
national Journal of Information Management, 35, 137 – 144.
Garret, M. A. (2014). Big Data analytics and cognitive computing – future opportunities for astronomical
research. IOP Conference Series. Materials Science and Engineering, 67, 012017. doi:10.1088/1757-
899X/67/1/012017
Gepp, A., Linnenluecke, M.K., O’Neill, T.J., & Smith, T. (2018). Big data techniques in auditing research
and practice: Current trends and future opportunities. Journal of Accounting Literature, 40, 102-115.
Günther, A. W., Rezazade, M. H., Huysman, M., & Feldberg, F. (2017). Debating big data: A literature
review on realizing value from big data. The Journal of Strategic Information Systems, 26(3), 191–209.
doi:10.1016/j.jsis.2017.07.003
Han, Q., Heimerl, F., Codina-Filba, J., Lohmann, S., Wanner, L., & Ertl, T. (2017). Visual patent trend
analysis for informed decision making in technology management. World Patent Information, 49, 34–42.
doi:10.1016/j.wpi.2017.04.003
Huang, G., Huang, G.-B., Song, S., & You, K. (2014). Trends in extreme machine learning: A review.
Neural Networks, 61, 32-48.
Ji, W., & Wang, L. (2017). Big data analytics based fault prediction for shop floor scheduling. Journal
of Manufacturing Systems, 43(1), 187–194. doi:10.1016/j.jmsy.2017.03.008
Kim, G., & Bae, J. (2017). A novel approach to forecast promising technology through patent analysis.
Technological Forecasting and Social Change, 117, 228–237. doi:10.1016/j.techfore.2016.11.023
Kim, M., Park, Y., & Yoon, J. (2016). Generating patent development maps for technology monitoring
using semantic patent-topic analysis. Computers & Industrial Engineering, 98, 289–299. doi:10.1016/j.
cie.2016.06.006

238

Big Data for Prediction

Kyebambe, M., Cheng, G., Huang, Y., He, C., & Zhang, Z. (2017). Forecasting emerging technologies:
A supervised learning approach through patent analysis. Technological Forecasting and Social Change,
125, 236–244. doi:10.1016/j.techfore.2017.08.002
Lee, J., & Ardakani, H. D. (2015). Industrial Big Data Analytics and Cyber-ph. Academic Press.
Lv, Z., Song, H., Basanta-Val, P., Steed, A., & Jo, M. (2017). Next-Generation Big Data Analytics: State
of the Art, Challenges, and Future Research Topics. IEEE Transactions on Industrial Informatics, 13(4),
1891–1899. doi:10.1109/TII.2017.2650204
Madani, F., & Weber, C. (2016). The evolution of patent mining: Applying bibliometrics analysis and
keyword network analysis. World Patent Information, 46, 32–48. doi:10.1016/j.wpi.2016.05.008
McAfee, A., & Brynjolfsson, E. (2012). Big data: The management revolution. Harvard Business Review,
90(10), 60–68. PMID:23074865
Miah, S. J., Vu, Q. H., Gammack, J., & McGrath, M. (2017). A Big Data Analytics Method for Tourist
Behavior Analysis. Information & Management, 54(6), 771–785. doi:10.1016/j.im.2016.11.011
Niemann, H., Moehrle, M. G., & Frischkorn, J. (2017). Use of a new patent text mining and visualization
method for identifying patenting patterns over time: Concept, method and test application. Technological
Forecasting and Social Change, 115, 210–220. doi:10.1016/j.techfore.2016.10.004
NIST Big Data Public Working Group. (2017). Big Data Interoperability Framework: Volume 1, Defini-
tions. Accessed at: http://bigdatawg.nist.gov/home.php
Parr Rud, O. (2011). Invited article: Adaptability. Business Systems Research Journal: International
Journal of the Society for Advancing Business & Information Technology, 2(2), 4-12.
PatSeer. (2017). Retrieved from http://patseer.com/
Pejić Bach, M., Pivar, J., & Dumičić, K. (2017). Data anonymization patent landscape. Croatian Op-
erational Research Review, 8(1), 265–281. doi:10.17535/crorr.2017.0017
Rassam, M. A., Maarof, M. A., & Zainal, A. (2017). Big Data Analytics Adoption for Cyber-Security:
A Review of Current Solutions, Requirements, Challenges and Trends. Journal of Information Assur-
ance and Security, 12(4), 124–145.
Salehan, M., & Kim, D. J. (2016, January). Predicting the performance of online consumer reviews: A
sentiment mining approach to big data analytics. Decision Support Systems, 81, 30–40. doi:10.1016/j.
dss.2015.10.006
Sharma, P., & Tripathi, R. C. (2017). Patent citation: A technique for measuring the knowledge flow
of information and innovation. World Patent Information, 51, 31–42. doi:10.1016/j.wpi.2017.11.002
Sharma, R., & Kankanhalli, A. (2014). Transforming decision-making processes: A research agenda
for understanding the impact of business analytics on organizations. European Journal of Information
Systems, 23(4), 433–441. doi:10.1057/ejis.2014.17

239

Big Data for Prediction

Shirdastian, H., Laroche, M., & Richard, M. O. (2017). Using big data analytics to study brand authen-
ticity sentiments: The case of Starbucks on Twitter. International Journal of Information Management.
doi:10.1016/j.ijinfomgt.2017.09.007
Sinha, M., & Pandurangi, A. (2015). Guide to Practical Patent Searching And How To Use Patseer For
Patent Search And Analysis. Pune: Gridlogics Technologies.
Sivarajah, U., Kamal, M. M., Irani, Z., & Weerakkody, V. (2017). Critical analysis of Big Data challenges
and analytical methods. Journal of Business Research, 70, 263–286. doi:10.1016/j.jbusres.2016.08.001
Song, K., Kim, K., & Lee, S. (2018). Identifying promising technologies using patents: A retrospective
feature analysis and a prospective needs analysis on outlier patents. Technological Forecasting and Social
Change, 128, 118–132. doi:10.1016/j.techfore.2017.11.008
Tsai, C. W., Lai, C. F., Chao, H. C., & Vasilakos, A. V. (2015). Big data analytics: A survey. Journal of
Big Data, 2(1), 21. doi:10.118640537-015-0030-3 PMID:26191487
Vera-Baquero, A., Colomo-Palacios, R., Molloy, O., & Elbattah, M. (2015). Business process improve-
ment by means of Big Data based Decision Support Systems: A case study on Call Centers. International
Journal of Information Systems and Project Management, 3(1), 5–26.
Waller, M. A., & Fawcett, S. E. (2013). Data Science, Predictive Analytics, and Big Data: A Revolu-
tion that Will Transform Supply Chain Design and Management. Journal of Business Logistics, 34(2),
77–84. doi:10.1111/jbl.12010
World Intellectual Property Organization, Economics and Statistics Division. (2016). World Intellectual
Property Indicators 2016. Accessed at: http://www.wipo.int/edocs/pubdocs/en/wipo_pub_941_2016.pdf
World Intellectual Property Organization (WIPO). (2017). Guide to the International Patent Classifi-
cation. Accessed at: http://www.wipo.int/export/sites/www/classifications/ipc/en/guide/guide_ipc.pdf
Wu, X., Zhu, X., Wu, G.-Q., & Ding, W. (2014). Data Mining with Big Data. IEEE Transactions on
Knowledge and Data Engineering, 26(1), 97–107. doi:10.1109/TKDE.2013.109
Zhang, Q., Yang, L. T., Chen, Z., & Li, P. (2018). A survey on deep learning for big data. Information
Fusion, 42, 146–157. doi:10.1016/j.inffus.2017.10.006

240
241

Chapter 11
The Components of Big Data
and Knowledge Management
Will Change Radically How
People Collaborate and
Develop Complex Research
Amitava Choudhury
University of Petroleum and Energy Studies, India

Ambika Aggarwal
University of Petroleum and Energy Studies, India

Kalpana Rangra
University of Petroleum and Energy Studies, India

Ashutosh Bhatt
Shivalik College of Engineering, India

ABSTRACT
Emerging as a rapidly growing field, big data is already known for promising success and having con-
siderable synergies with knowledge management. The common goal of this collaboration is to improve
and facilitate decision making, fueling the competition, fostering innovation, and achieving economic
success through acquisition of knowledge to various applications. Knowledge in the entire world or in-
side any organization has already expanded itself in various directions and is exponentially increasing
with time. To withstand the current competitive environment, an intensive collaboration of knowledge
management with different approaches and algorithms of big data is required. Classical structuring is
becoming obsolete with the increasing amount of knowledge components.

DOI: 10.4018/978-1-5225-7077-6.ch011

Copyright © 2019, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

The Components of Big Data

INTRODUCTION

By the advancement in technology, the life of people of present generation is very relaxing and they
completely enjoy their life.While on the other hand, if we try to compare the lifestyle of the people
living in 19th century or earlier, it was very poor, so here we can see the drastic change in the lifestyle
of the people in a small amount of time, which is only possible because of the big data and knowledge
management. We can also say that in the charming lifestyle of the present generation, there is a major
role played by the big data and the knowledge management system. Over the years knowledge manage-
ment has evolved to integrate information from multiple sources and perspectives. The data integration
and manipulation paves the way for decision making. Decisions made by various organizations are not
just based on single factor but it is the cumulative result of multiple driving forces. Considering the
financial decisions of organization, dealing with revenues, salaries, interest rates alone would not be ef-
fective for deciding and predicting the solutions, Such factors must be comprehensively supported with
the information of where and when to invest along with the proper consideration of the geographical
locations of the market.
Data can exist in multiple forms, it simply has its existence and has no significance beyond that. Data
that has some meaning is information. This information may be put to use or can just be stored without
practically applying it to some area. Data gathering brings along with it all types of data or unstructured
data which is to be separated, segregated and formalized to bring out something informative from the
same. Unstructured data does not follow any specific format and cannot be put to use directly. It has to
be organized and structured so that it can be put to use and can provide some valuable information within
the resources. Structured data has defined length and has format specific. A document containing data
with date and indexes is example of structured data in traditional databases. A recent survey (US San
Diego, 2018) declares 20% out of 100% as the structured data, which is further categorized under machine
generated, and human generated data. Sensor data, weblog data, point of sale data are typical examples
of machine-generateddata gathered from web activities and product purchasing. Human generated data
include input data such as online information; click stream data generated on clicking a website link.
Another term coined for data that lies somewhere in between structured and unstructured data is semi-
structured data. Semi-structured data can be understood as the data that has self-describing structure. It
is the kind of data that does not conform to any data model typically associated with relational databases.
Typical examples of semi-structured data can be XML and JSON files.

Knowledge Management

Nowadays data is considered as an important resource for business as compared to materialistic assets
and intellectual capital. Organizations aspiring for sustainable growth deliberately require managing
the information for innovations. One way to mediate the increasing load of information is application
of knowledge management practices, which have their baseline in conventional approaches such as
knowledge creation, innovation, sharing and transfer of knowledge, reusability of knowledge and ap-
plicability of knowledge. It does not end here but the conventional knowledge digs deep onto knowledge
acquisition. Acquisition of knowledge is a complex area arising from competing disparate theories and
calling for complications. One such complication is difficulty in predicting future needs of knowledge
and skills as well. Another concern of knowledge imperatives is application of knowledge. Therefore,

242

The Components of Big Data

Figure 1. Approximate percentage distribution of digital data

capturing knowledge and its application can be crucial for competitive and economic success of the firm.
While knowledge management components include data identification, data capturing, and managing
the data we have another emergent technology that deals with acquisition of data in huge volumes com-
monly referred to as raw data, mundane knowledge or unstructured data. This form of data is usually
collected from social sites and analyzed for predictive insights. Though knowledge management covers
a wide area of data processing, still the efficiency of knowledge management is restricted to structured
data. The unstructured or semi structured data can be efficiently processed by collaborating knowledge
management with big data approaches and algorithms.

Big Data Analytics

Big data Analytics covers the collection and connection of data extending to storage, manipulating, and
presenting the vast data stores to the world. Big data can be defined precisely in dimensions of volume,
variety and velocity. The solution isachieved by obtaining subsets of data, analyzing the subsets and ag-
gregating the subsets to achieve results. Big data being the intensive change for IT has major implications
for knowledge management. Business Analytics is one approach to use big data in collaboration with
knowledge management that enables to store, process and retrieve information using business intelligence.
Text analytics allows creation of data from enormous unstructured sources and helps in building sentiment
scores. Speed at which the data is generated and communicated in various formats is also significantly
taken into consideration in Big data analytics. The range of data from multiple sources and in multiple
formats is yet another important aspect of big data analytics. It is possible to identify several features
that can offer a wider perspective in understanding the impact of analyzing big data in an effective and
influential contextual range covering knowledge management. The nature of sources from where the
data is gathered, the understanding of individual and the language understood by humans also affect the
concept of Big Data Analytics. The changing values of data that are captured from a source may reflect
the change in subject and opinions of an individual over a short period of time. Large volume of data
can be derived from changing perspectives but then there arises the question about degree of trust that
can be placed on information so that it can be qualified as valid information. Thus, big data analytics can
be critically evaluated as the impact of 10 V’s for big data supported knowledge management (Crane,
Lesley & Self, Richard., 2014).

243

The Components of Big Data

1. Volume(size)
2. Velocity(speed)
3. Variety(format)
4. Variability(temporal)
5. Value (to whom?)
6. Validity(applicable)
7. Veracity(truth)
8. Volatility(temporal)
9. Verbosity(text)
10. Verification(trust)

Out of these, the first three are traditionally essential in determining the definition of Big Data. Vol-
ume, velocity and variety target the big data analytics challenging the surety of correct linkages made
between objects and entities in different sources. Other social network accounts have more data that has
to be structured and should have good linkages. Variety of sources henceforth brings forward a technical
challenge and problem in veracity and validity of data questioning the reliability of the content. To gain
correct linkages, Verification and validation approaches can be put to use before relying on analyzed
information and knowledge. Changing values of data can be addressed under Variability and volatility.
Data may be altered and affected by changing the demographic details. Verbosity deals with the nature
of text sources since computer systems are not as good as humans in understanding the semantics of
language. The value in data is correlated to extent to which it complies with 10 V’s and analysis is not
limited by pre-conceived ideas about data relation and connections.
With the time over 100 billion of mails are being received and sent per day that adds on the burgeon-
ing volume created each day on social sites including LinkedIn, twitter, and Facebook (Falch, Morten,
Henten, Anders, Tadayoni, Reza, Windekilde, Iwona, 2009). The huge metadata gives the tsunami like
proportion view since the amount of data sent to the average person per year is enormously large. IBM
reports have shown that major part of the data created in the world is not structured and if the trend
continued, most of the data will turn out to be untrustworthy. By emerging as the technology that has
revolutionized many service sectors big data permeated into every operative sector of business to gear
up the competitive environment.
In the past years, the cost of data collection and storage limited the ability of the enterprises to obtain
the required information to get a holistic view of the solution for information retrieval. The barriers in
accessing the data have been removed by automated collection of digital information and cheap stor-
age. Data is abundantly available these days but the relational databases limit the ability of retrieval of
sensible information. Big Data has emerged as a new solution to deal with such problem.
Emergent applications including commercial are using big data for dealing with interrelated opera-
tional and transactional data thereby providing new dimensions to knowledge in directions of business
operations, supply chain management, tracking the performance of distribution channels, predicting and
analyzing the customer behavior for business intelligence.
There has been an explosive growth in amount of data streaming into businesses. There is no sign
that this exponential growth will slow down in the near future. Several organizations will gain many
benefits for their businesses by leveraging this data to their advantage to gain deeper insights in cus-
tomer behavior, competitor tracking, operational efficiencies, and many more.The question that requires

244

The Components of Big Data

an answer here is - how are we going to manage and utilize this data to our advantage? A survey (J. N
Dorlenas,Kassia Roberta Rodrigues de Souza,AméricoNobreAmorim2017)carried out in joint venture
with the Outsourcing Center on the procedures of organizational data management and analysis of the
proliferating data, application of big data for the benefit of their businesses came out with few findings.
The survey resulted in identifying some general trends in emerging areas:

1. Big data platforms address the issue of managing future big data challenges in majority of
organizations.
2. Organizations face difficulty in analyzing data sufficiency, handling external data and reporting
data in real time.
3. Lack of formal organizational strategy in place to deal with and leverage big data.
4. Becoming more operationally efficient is considered the biggest benefit of implementing big data
strategy.
5. The biggest roadblock to implementing a big data strategy is lack of measurable ROI.

The large amount of data, which can generate new and valuable information, beneficial for firms
and organizations, as compared to traditional data sets and information obtained by trending knowledge
management activities,can be called as big data. Thus, the concept of big data encompasses the amount
of data, the tools, the technologies necessary to maintain vast amount of data in terms of variety, volume
and velocity that cannot be achieved with traditional data management.

Why Big Data Analytics?

Digital data is growing at a huge pace of approximately 40% per annum and is expected to reach nearly
45ZB by the year 2020. Approximately 1.2 trillion GB of data was generated in 2010 itself which got
doubled by the year 2012 and became 2.4 trillion GB. In the year 2014, the amount of data was ap-
proximated to be 5 trillion GB. The size of data is expected to be doubled in every 1.2 years across the
globe (A. Cuzzocrea.,2014).
Approximately one million customer transactions are processed by Wal-Mart every hour. Every day
its users post about 500 million tweets on twitter accounts. Facebook records approximately 2.7 bil-
lion ‘Likes’ and ‘comments’ per day. It is estimated that 2.5 quintillion bytes of data is created per day
worldwide. It is also interesting to note that 90% of data worldwide was created in last two years itself.
The cost of storing data per gigabyte has dropped hugely and there a number of user-friendly tools
available in the market for Big Data analytics.

CLASSIFICATION OF ANALYTICS

1. Basic Analytics: Basic analytics deals with classifying and categorizing data in order to obtain
business insights. It primarily includes reporting based on historical data and basic data visualiza-
tion, etc.
2. Operationalized Analytics: This kind of analytics becomes operationalized analytics when inter-
linked with the organization’s business process.

245

The Components of Big Data

3. Advanced Analytics: Advanced analytics majorly deals with future forecasting based on predictive
and prescriptive modelling.
4. Monetized Analytics: This kind of analytics is used to derive direct revenues for business
organizations.

COMPONENTS OF BIG DATA

• HDFS (Hadoop Distributed File System): It is component of big data, which is responsible for
the storage layer. Because of this layer, we are able to get insights which enables to make the tech-
nology better. As without proper storage medium for data, we cannot store data and as a result, we
will not be able to get some insights, which will help in directing us to our goal.
• NoSQL Databases: NoSQL databases are responsible for handling the data, whichis not struc-
tured, as it is not always possible that the insights we are getting are from structured data only, it
may be unstructured or semi-structured data, which will be directing us towards our goal.
• Real Time Processing: It is another component of Big Data that allows to develop many modules
of latest technology. By real time processing, we are able to answer or reply to the query raised
within microseconds. This component will be very helpful in complex researches.
• Fault Tolerant: This component also plays a major role in complex researches and technology.
Consider an important research, which is going on, on a system but what if the system fails; if the
system is not fault tolerant then it will be very difficult to recover the progress.
• Map Reduce:It is very new concept for faster execution of huge amount of data, and to process
distributed data.

COMPONENTSOFKNOWLEDGE MANAGEMANT

• Functionality: This component of knowledge management is responsible for enhancing and sup-
porting the knowledge intensive applications/processes, their transformation, their maintenance,
their structures, their evolution, the application of knowledge etc. This component is the backbone
of the applications, which require complex researches.
• Interface: For the facilitation of knowledge or a service which manages knowledge, an interface
is required, this interface could be technological, structural or a human itself. It doesn’t matter
what type of interface it is, but it should be there for the proper working of the systems.
• Strategy: It is very important component of knowledge management, as a strategy should be
present to deal with the problem, or to gain proper benefit from a opportunity. In other words, to
develop anything, to cope up with a problem, to run a business, or to perform complex researches
we should have a strategy.
• Persistent Improvement: As we all know, that nothing is perfect, therefore we have to improve it
and we should always try to improve a particular thing regardless of how good it is, it is possible
that there may be some flaws which will be resolved by the continuous improvement. Therefore,
persistent improvement is necessary.

246

The Components of Big Data

Now, coming back to the original discussion, this chapter would like to say that with the help of the
components of big data and knowledge management, there would be a drastic change we will see in the
collaboration of the people and their development of complex research. Now after using the components
of big data and knowledge management, collaboration of people will face a sharp turn or an interesting
twist, as now they are already able to get insights from the big data, so, now they will collaborate smartly.
The possible changes we will see in the collaboration of people will be:

• They will try to collaborate with people who are having same goals.
• They will collaborate in order to create creative conflict, which will be productive for them.
• The people who will collaborate will be selected carefully by considering their potential, their
skills, their determination towards their work and what creative they can do in order to gain ben-
efits. As only the people who can dig out some insights from the ocean of data and facilitate the
information will not be sufficient to collaborate, some creative people will be needed to collabo-
rate because for doing such type of tasks we already have big data and knowledge management
components.
• Collaboration will be done on the basis of how good the people are in surfacing creativity and
facilitation of techniques.
• Collaboration will be done on the basis of the engagement.

In development of the complex research, if we are provided with the services, which will automati-
cally, do computation on data and facilitate information it will be very helpful for us in doing complex
researches and also give us motivation to do more. This work is done by the components of big data and
the knowledge management. It is not that without these components we are not able to do something but
with the use of technology as a tool we are able to do complex researches fast and effective.
It is not that, that technology is our advantage; we are only smartly using the resources, which will
accelerate our learning and will enhance our capabilities. Consider a research, which is very complex,
and we are not able to perform it normally or we are facing difficulties in it.Now if we use services pro-
vided by the components of the big data and knowledge management system, we may able to perform it
easily, here we are using technology in a smart way to get rid of the problems or converting a complex
problem into a simple one. Use of components of big data and knowledge management are very helpful
and beneficial for us, it will exceed all expectations if we are a well-structured organization which will
follow collaboration protocols and regularly do researches. Example of this is Human Genome Project.
As the complexity of scientific challenges is increasing constantly therefore we need to use the latest
technology such as big data and knowledge management systems/frameworks and collaborate to tackle
the challenges.
One thing we should always keep in our mind that excellence in education is the ultimate outcome,
the more superior thing above excellence in education is our ability to learn. We should not think ability
to learn as an individual achievement but it is a collective one, collective learning are the backbone of
the great learning ecosystems.
So, we should learn every time to enhance technology and use it in complex researches or anywhere
where it can be used, like we are using components of big data and knowledge management.

247

The Components of Big Data

BIG DATA FOR SCIENTIFIC RESEARCH

In recent times, Big Data has rapidly evolved and has emerged as topic of great interest for researchers,
academicians, industry experts and even government organizations all across the globe (Mayer-Schon-
berger & Cukier, 2013; Thomson, Lebiere & Bennati, 2014; Cuzzocrea, 2014). Big data has instigated
the entire scientific community to re-evaluate its methodology of scientific research and has triggered
a revolution in scientific thinking and methods.
Historically, the researches were based on experimental results. Then emerged the theoretical sci-
ence, which involved the study of various laws and theorems. However, it was observed that theoretical
analysis was becoming too complex and was not viable for solving practical problems, which eventually
led researchers to opt for simulation based methods, and hence computational science was born.
The evolution of big data has given birth to a new research model. Now with big data in picture,
researchers are able to extract and process only relevant information. The objects to be studied also do
not need to be accessed directly. Jim Gray, the late Turing Award winner had illustrated in his last speech
about the fourth paradigm of data-intensive scientific research (Hey, Tansley & Tolle, 2009), which
separates data-intensive science from computational science. According to Gray, the fourth paradigm
was the only possible systematic approach for solving some of the toughest global challenges faced to-
day. In essence, the fourth paradigm was not merely a change in the way of scientific research, but also
a change in the way that people think (Mayer-Schonberger & Cukier, 2013).

BIG DATA FOR EMERGING INTERDISCIPLINARY RESEARCH

An emerging interdisciplinary research area known as data science (Loukides, 2011) has eventually
come into existence which is also taking advantage of the Big Data technology. The research objective
of data science is big data which aims at generalizing knowledge extraction from data which eventually

Figure 2. Big data processing model

248

The Components of Big Data

gives rise to the term knowledge management. The term data science is not restricted to any particular
area or field; It rather expands across various disciplines, including mathematics, information science,
network science, social science, system science, economics, and psychology (O’Neil & Schutt. 2013). It
employs many theories and techniques from various fields, including machine learning, signal processing,
statistical learning, probability theory, computer programming, pattern recognition, data engineering,
visualization, data warehousing, uncertainty modeling, and high performance computing.
Big data has led towards establishment of numerous research centers and institutes in various Universi-
ties across the globe. Some of these Universities include New York University, University of California at
Berkeley, Columbia University, Eindhoven University of Technology, Tsinghua University, and Chinese
University of Hong Kong). Big Data tools and technologies are being used extensively by industry and
academia to facilitate knowledge management and to create data science engineers.

NATIONAL DEVELOPMENT THROUGH BIG DATA

Currently, the world is entering into an era of information age. The extensive use of Internet, emergence
of smart devices resulting in Internet of Things (IoT), evolution of cloud computing and various other
data sources have resulted in huge amount of data which is mostly of unstructured type and is complex
to process. Big Data processing and analytics will prove to be of great importance for enhancing com-
petitiveness of companies hence promoting the economic growth of the nation.
The future of the world lies in Analysis as a Service (AaaS) and various IT giants like Google, Mi-
crosoft, IBM etc. have already started working towards it by employing Big Data analytics tools and
technologies. In near future, the capacity and capability of a nation to store, process and analyze huge
amount of data will become the new landmark of its strength.

Figure 3. Region specific data co-ordinates

249

The Components of Big Data

In China, a government report in China had proposed that cyberspace, as well as deep space and
deep sea are the key areas of national core interests. A country lagging behind in the field of big data
research and applications is not only an indication of the loss of its industrial strategic advantage, but
it also indicates loopholes in its national security cyberspace. Considering this report, the Big Data
Research and Development Initiative(RD. Williams, 2018), announced by the United States in March
2012, was not merely a strategic plan that promotes the US to continuously lead in the high-tech fields,
but was also a strategy to shield its national security and enhance its social and economic development.

PREDICTING BETTER FUTURE WITH BIG DATA

Better prediction on future trends of various events can be made by achieving effective integration of
big data and more accurate analysis of unstructured, heterogeneous big data. Big data analytics has
made it possible to promote sustainable development of society by enhance its economic growth and
providing opportunities to establish new industries based on data services. The ability of huge amount
of data distributed over a network has been greatly developed and effectively implemented in the field
of military and security. For example, the US released a report in 2010 (Louis, 2013) titled “Chinese
Nuclear Warhead Storage and Handling System”, stating that US had found Chinese nuclear bases in
areas like Jiangxi, Shaanxi, and Sichuan. The names of the various cities and countries were also quoted
by the report where the nuclear bases were located. This report immediately became a global sensation.
The 2049 Project Institute of the United States, founded in 2008 in Washington, DC, got worldwide
attention after the release of this report. The said institute analyzes and forecasts security issues related
to economy and military in China by utilizing publically available data such as conference papers and
journals. Another report was released by the institute in March 2013, based on China’s Unmanned Aerial
Vehicle (UAV) project (Easton & Hsiao, 2013) As per the report, a comprehensive analysis was conducted
on equipment, research and development, and operational deployment of UAV in China.
Social issues such as public health and economic growth and development have also been addressed
by applying big data predictive analysis techniques. Ginsberg, et al. concluded in one of their works that
the number of influenza patients will increase eventually in the hospital emergency rooms of a particu-
lar region, if in the past few weeks, a high number of Google queries were submitted in that area with
keywords like “flu symptom” and “flu treatment” (Ginsberg, Mohebbi, Patel, Brammer, Smolinski &
Brilliant., 2009) Imagine the advantage of such predictive analysis in the area of public health. Hospitals
and doctors can prepare themselves in advance with appropriate equipment, medicines and vaccines.
In terms of economic growth and development, a project called Global Pulse (UN Global Pulse, 2012)
was recently launched by the United Nations, which aims at utilizing big data technologies to promote
and enhance global economic development.
Knowledge management (KM) is a collection of systematic approaches to help information and
knowledge flow to and between the right people at the right time (in the right format at the right cost)
so they can act more efficiently and effectively to create value for the organization. The term knowledge
management is mainly deals with effectively deals with information within an organization. On the other
hand, Big Data is used in process huge amount of data within or outside the organization. Big data is a
broad term for data sets so large or complex that traditional data processing applications are inadequate.
Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, and
information privacy. The term often refers simply to the use of predictive analytics or other certain ad-

250

The Components of Big Data

vanced methods to extract value from data, and seldom to a particular size of data set. Accuracy in big
data may lead to more confident decision making. And better decisions can mean greater operational
efficiency, cost reduction and reduced risk. Knowledge management is nothing but the process of data
i.e. known as information, while recent days the computational data size is very huge. Traditional KM
system may provide sufficient knowledge with in one commercial organization but it may fail in complex
structure of data.
As noted before, knowledge management does not provide the solution very nicely in lower level of
the data-information-knowledge hierarchy, which can be treated as knowledge management problem.
Below figure, describe the process of data into information to knowledge management.

UNSTRUCTURED DATA PROCESSING PROBLEM IN KM

Commercial data is mainly applied in knowledge management (Lee, Ming-Chang.et al, 2009). Such
data is unstructured in nature. In our traditional data process techniques, processing such kind of data
was very challenging task. Big data supports knowledge management and extends its functionality by
providing tool like Hadoop for analyzing unstructured data. Having the distributed storage capabilities
and distributed processing framework, Hadoop enables working with huge volume of complex data.The
design of Hadoop supports big data as it can accommodate the data that is too big for any traditional
database technologies. Data is stored in files since Hadoop does not enforce schema or structure for data
storage. Applications like Sqoop, HIVE, HBASE integrated with Hadoop enables import and export from

Figure 4. Organizational-learning process through big data analytics

251

The Components of Big Data

Table 1. Various tools and platforms for big data analytics

Tool/Platform Description
HDFS provides the storage structures for HADOOP cluster. The data is
The Hadoop Distributed
divided into small segments and distributed across various nodes or servers for
File System (HDFS)
storage.
It acts as an interface that divides tasks into sub tasks and distributes them for
MapReduce processing and then collects various outputs and combines them. It also keeps
track of the processing of various nodes or servers.
PIG is a programming language which is configured to integrate any kind of
data whether structured or unstructured. The two major modules of PIG are the
PIG and PIG Latin
language itself, known as PIG Latin, and the runtime environment in which its
code is executed.
Hive is a runtime environment supported by Hadoop which is based on
Hive Structured Query Language(SQL). It allows programmers to write HQL (Hive
Query Language) queries which are very similar to SQL.
It is another functional and declarative query language which is designed
for processing larger data sets. “High level queries” are converted into “low
Jaql
level queries” consisting of MapReduce tasks, in order to facilitate parallel
processing.
Zookeeper facilitates synchronization throughout a cluster of servers thus
allowing a centralized infrastructure having various services. These services
Zookeeper
are being utilized by Big Data Analytics applications in order to coordinate
parallel processing throughout big clusters.

various traditional and non-traditional database forms. This allows Hadoop to structure any unstructured
data and explore semi-structured data for further analysis and decision-making. Expanding more, Hadoop
is seen as a powerful tool for writing customized codes. Programmers can implement algorithms of any
complexity exploiting the benefits of Hadoop features, framework to get efficient and reliable solutions.
Thus, programmers gain flexibility for better understanding of data at crude level. Hadoop as an open
source project has developed exponentially in the market since it has numerous applications including
speech and image processing, file analysis, and text analysis pivoted with python.

MARITIME BUSINESS CASE: A CASE STUDY

Big data enabled knowledge management solutions can be applied to support maritime innovation ca-
pability. Big data driven knowledge management initiatives can be addressed in maritime organization
in following areas

• Big data for Strategy and decision making in shipping industries


• Big data for competitive intelligence for knowing shipper supplier and competitors
• Big data for human capital development on board and ashore
• Data driven Culture from craftsmanship to knowledge and science orientation

252

The Components of Big Data

Big Data for Strategy and Decision Making in Shipping Industries

Strategy formulation in shipping primarily addresses investment appraisal and portfolio management
covering various elements of ship valuations and asset play,risk assessment for financing, market scan-
ning and positioning or determining the exits.Strategic adjustment is considered as closest knowledge
management system evolution approach using big data that provides possibility for new information
services enabled by near time collection of multiple big data sources for both public and private, open
andcorporateorganizations to support market interpretation and prediction.
Utmost research and development emphasis has been primarily laid on a) predictive maintenance
applications to monitor ship health b) energy consumption/efficiency monitoring c) emission and envi-
ronmental impact monitoring platform d) safety and security platform for monitoring piracy for critical
incidence.WAVES and DANAOS are commonly used commercial platforms in shipping companies for
monitoring vessel and fleet performance. Centralized policy orientation for operational authorities are
currently emergent along with vessel tracking and trade route analysis functionality.

Big Data for Competitive Intelligence for Knowing


Shipper Supplier and Competitors

Collection and analysis of available data resources to identify pattern of collaborators customers and
competitor’sbehavioral patterns is the primary objective of maritime competitive intelligence. Predic-
tive competitor behavior (like Liner shipping companies serving the same trade routes) allows future
anticipation of strategic moves. Future behaviors of customers including shippers forwarder and charters
can be analyzed with high confidence thus by strongly supporting the decision making of any maritime
organization.
An unexplored area of knowledge management for maritime domain is unveiled to give impetus for
driving maritime big data and business intelligence technology and development of applications. Related
technologies range from doing sentimental analysis for inferring opinions to predict market trends and
freight rate predictions.

Big Data for Human Capital Development on Board and Ashore

Human capital development needs assessment to necessitate reorientation in terms of soft skills in any
maritime organization.Modern maritime education and training is required to fulfill current industry
needs and complement future trends.New technical requirements pose new challenges for seafarers since
they have to work with high speed equipment’s on board as a part of unmanned mega ships.This requires
more skilled workforce whichshould be technically literate and well equipped for problem solving and
decision making along with high communication skill.Other marine professionals are continuously chal-
lenged to be trained and arecertified to retain their positions in maritime organization.Big data trends
along with knowledge management e learning trends are likely to be influenced in maritime training
sector for development of skill estimations and course recommendations.

253

The Components of Big Data

Data Driven Culture From Craftsmanship to


Knowledge and Science Orientation

Another major paradigm shift altering maritime industry these days is a new technological trajectory
that is currently evolving with smart ships, smart ports and logistic infrastructure. Vessel and shipping
company node are highlysupported by Big Data.Innovation orientated technology is gaining momentum
in shipping environment and new strategies are also paving way for timely entering the profitable mar-
kets to develop and redesign decision support and operational support in maritime software. Inessence,
of functionalities knowledge management is enabling systematic innovation capabilities of shipping
companies.
It can be postulated that unless appropriately designed, visualized and dialogued knowledge manage-
ment in big data era application for innovation support it’s not possible to enhance innovative capabilities
in a MARITIME INDUSTRY.

HANDLING BIG DATA IN AUTOMATIVE INDUSTRY

Big data is qualitatively different from previous data analytics technologies since it can cooperate with vast
amount of different types of data. This data is real time data and can be tapped from multiple and large
number of public and private resources. Big data will enhance the results and improve the supply chain
and reduce the lean time of quality approaches like Six Sigma, design for Six sigma, lean manufacturing.

Big Data and Supply Chain Management

Supply chain is used to predict car demand and trends by using CRM (customer relationship manage-
ment database or public data. SCM also provides for monitoring the dealers and end of line customer
satisfaction.Various automotive industries apply Big Data in both economic and social aspects.

Big Data for End Customer Focus

Through big data customer becomes an eminent part of Control board of project decisions. The needs of
customer influence the future improvement in car functionalities leading to production of new versions
of the model. Customized driving Experience is gather to provide personalized driving tips and sugges-
tions for improvement of driving style. Break warning, unnecessary acceleration, destination suggestions,
alternative route suggestions message can inform drivers and answer their queries of common interest.

Big Data for Management Roles

Management requires future forecasting and business prediction. Client needs directly influence busi-
ness decisions Client being the share manager,the power of decision-makingis harnessed directly from
the former. This requires implementing Big Data decisions making algorithms to support CRM. Busi-
ness profitability is increasing with growth of intangible resources. Big data supports this same by
measuring the intellectual capital and knowledge management. Based on the processed data employee,
records and performancecan be traced and the status can be measured for both the parameters. Big data

254

The Components of Big Data

platform is used to check “know –how” distribution inside companies and subdivisions. Changes can
be implemented for efficient allocation of resources if the employee is well integrated inside the team.
Big data answers the major question “what will the average growth in performance if the management
is informed and prepared to deal with team issues and balance them without much problem.” The big
issue of employee inadaptability can be addressed using big data measurement technologies. Decisions
can be madebased on suggestions provided by big data algorithms.Big data enables organization in
data gathering data storage,management and manipulation of data at right speed and time that provides
insights in decision-making.Finding new information has and patterns lead to new insights of knowledge
management, which is valuable for the organization.Large volume of data helps to maintain knowledge
functioning as foundation for decision-making constrained to few features like relevance of data,high
quality of data, careful analysis and type of problem being addressed by it.

Big Data for Automotive Telematics

Globally and locally interconnected cars can save time, lives and money by avoiding accidents, traf-
fic jams. Algorithms that map the personal driving behaviors can be used to get the above-mentioned
results.Big data has a vital place in connected car environments to deliver real time solutions for traffic
management and improve traffic efficiency .the concept of connected cars can revolutionize the cities
to become Intelligent cities implementing intelligent infrastructures. There are several devices that can
interpret the world around car and handle the exceptions, such devices lay foundation of automated
driving solutions using big data and knowledge management.
Concluding the role of big data in automotive industry, big data will be higly effective in supply
chain to perusecustomer-oriented policies to reduce warranty cost.Integrating big data solutions with
government cities and agencies can support the safety system and revolutionize the traffic systems by
collaborating with the traffic authorities. Big data allows combining the data between and among the
organization to determine customer behavior and produce useful insights in numerous applications like
public health and safety, frauddetection, cost reduction and much more.
Large quantities of raw and unstructured data can provide evolutionary leap to classical data handling
and manipulating technologies.

BIG DATA IN HEALTHCARE

Healthcare is also another area where Big Data has been widely applied with rapid development in data
and knowledge management. Analysis of prevalent diseases and the disease trends among population
can be done using the clinical data. Clinal data analysis can be done to determine causality, effect and
association between risk factors and diseases among the populations.
Big data can bring knowledge management in health science to reuse medical record data to aid
medical research.Health science information can be manipulated to have positive impact on health
science. Based on query placed google could predict the spread of flue that started in United states in
2009.The processing and the expertise power of google can provide public health care officials with
valuable real time information. Google example shows what can be achieved with collaboration of Big
Data and knowledge management in society for health services.Integration of patient information and
the related data across entities and analysis of health care data can exponentially gear up the quality,
efficiency and continuity of healthcare and outcome predictions.Being the most expensive sector within

255

The Components of Big Data

the nation,providing employment and services to people and costing high in terms of expenditure in an
aging population, it’sa producer of enormous amount of data.This type of data includes health records,
statistics which can be applied to enhance large operational efficiency, productivity to improvise the
services provided.Healthcare executives and policy makers are also the implications of integrated big
data knowledge management.

CONCLUSION

There’s no wrong in saying that intellectual capital (IC), KM, and the escalating trend towards the use
of big data processing and business analytics are all connected to each other. All of these are related to
some sort of intangible asset, whether it is data, information, knowledge, or intelligence. A better un-
derstanding of how an organization can benefit from knowledge assets can be obtained by concentrating
more on the strategic aspects of developing and protecting knowledge.
We can obtain an understanding of what kind of knowledge is appropriate to be developed in numerous
industries by going through various variables such as the nature of knowledge. This perspective can be
beneficial in obtaining an understanding about how and when contributions from big data might prove
to be helpful. Further, such variables can provide an insight into whether data is at risk or not, and also
give us guidelines regarding protection of intangible assets, and can even illustrate the need of protecting
the data from competitive incursions if required.

REFERENCES

Crane, L., & Self, R. (2014). Big Data Analytics: A Threat or an Opportunity for Knowledge Manage-
ment? Lecture Notes in Business Information Processing, 185, 25–34. doi:10.1007/978-3-319-08618-7_3
Cuzzocrea, A. (2014). Privacy and security of big data: current challenges and future research perspec-
tives. Proceedings of the First International Workshop on Privacy and Security of Big Data, PSBD ’14.
Easton, I.M., & Hsiao, L.R. (2013). The Chinesepeople’s liberation army’s unmanned aerial vehicle
project: Organizational capacities and operational capabilities. Tech. rep., 2049 Project Institute.
Falch, M., Henten, A., Tadayoni, R., & Windekilde, I. (2009). Business Models in Social Networking.
Aalborg Universitet. Retrieved from: http://vbn.aau.dk/files/19150157/falch_3.pdf
Ginsberg, J., Mohebbi, M. H., Patel, R. S., Brammer, L., Smolinski, M. S., & Brilliant, L. (2009). Detect-
ing influenza epidemics using search engine query data. Nature, 7232(7232), 1012–1014. doi:10.1038/
nature07634 PMID:19020500
Hey, T., Tansley, S., & Tolle, K. (2009). The Fourth Paradigm: Data-Intensive Scientific Discovery.
Microsoft Corporation; doi:10.1145/2609876.2609883
Lee, M.-C. (2009). The Combination of Knowledge Management and Data mining with Knowledge
Warehouse. International Journal of Advancements in Computing Technology, 1. doi:10.4156/ijact.
vol1.issue1.6

256

The Components of Big Data

Lewis, L. (2013). China’s Nuclear Idiosyncrasis and Their Challenges. Security Studies Center. Retrieved
from: https://www.ifri.org/sites/default/files/atoms/files/pp47lewis.pdf
Loukides, M. (2011). What Is Data Science? O’Reilly Media, Inc. Retrieved from: https://www.oreilly.
com/ideas/what-is-data-science
Mayer-Schonberger, V., & Cukier, K. (2013). Big Data: A Revolution That Will Transform How We Live,
Work, and Think. Houghton Mifflin Harcourt.
O’Neil, C., & Schutt, R. (2013). Doing Data Science: Straight Talk from the Frontline. O’Reilly Media, Inc.
Simião Dornelas, J., & Rodrigues de Souza, K. R. (2017, May/August). AméricoNobreAmorim Cloud
Computing:Searching its Use In Public Management Environments. JISTEM USP, Brazil, 14(2), 281–306.
doi:10.4301/S1807-17752017000200008
Thomson, R., Lebiere, C., & Bennati, S. (2014). Human, model and machine: a complementary ap-
proach to big data. Proceedings of the Workshop on Human Centered Big Data Research, HCBDR ’14.
UNGlobal Pulse. (2012). Big data for development: challenges & opportunities. Available at: http://
www.unglobalpulse.org/projects/BigDataforDevelopment
Williams, R. D. (2018). The ‘China, Inc.+’ Challenge to Cyberspace Norms. Hoover Institution. Retrieved
from: https://www.hoover.org/research/china-inc-challenge-cyberspace-norms

ADDITIONAL READING

Angelo, T. A. (1999). Doing assessment as if learning matters most. AAHE Bulletin, 51(9), 3–6.
Erickson, S., & Rothberg, H. (2014). Big Data and Knowledge Management: Establishing a Conceptual
Foundation. Electronic Journal of Knowledge Management.

257
258

Compilation of References

Aaltonen, A., & Tempini, N. (2014). Everything counts in large amounts: A critical realist case study
on data-based production. Journal of Information Technology, 29(1), 97–110. doi:10.1057/jit.2013.29
Abaei, G., Selamat, A., & Fujita, H. (2015). An empirical study based on semi-supervised hybrid self-
organizing map for software fault prediction. Knowledge-Based Systems, 74, 28–39. doi:10.1016/j.
knosys.2014.10.017
Abbott, D. (2014). Applied predictive analytics: Principles and techniques for the professional data
analyst. John Wiley & Sons.
Abdallah, A., Maarof, M. A., & Zainal, A. (2016). Fraud detection system: A survey. Journal of Network
and Computer Applications, 68, 90–113. doi:10.1016/j.jnca.2016.04.007
Abouelhoda, M., Issa, S., & Ghanem, M. (2012). Tavaxy: integrating taverna and galaxy workflows with
cloud computing support. BMC Bioinformatics, 13(1).
Acharjya, D. P., & Kauser, A. P. (2016). A Survey on Big Data Analytics: Challenges, Open Research
Issues and Tools. International Journal of Advanced Computer Science and Applications, 7(2). Retrieved
from https://thesai.org/Downloads/Volume7No2/Paper_67-A_Survey_on_Big_Data_Analytics_Chal-
lenges.pdf
Ackerman, M. S., Cranor, L. F., & Reagle, J. (1999). Privacy in e-commerce: Examining user scenarios
and privacy preferences. In Proceedings of the 1st ACM Conference on Electronic Commerce. ACM.
10.1145/336992.336995
Acquisti, A., & Gross, R. (2006). Imagined communities: Awareness information sharing and privacy on
Facebook. Proceedings of the Privacy Enhancing Technologies Symposium, 36–58 10.1007/11957454_3
Adams, D. (1979). The Hitchhikers Guide to the Galaxy. Pan.
Adams, G. L., & Lamont, B. T. (2003). Knowledge management systems and developing sustainable com-
petitive advantage. Journal of Knowledge Management, 7(2), 142–154. doi:10.1108/13673270310477342
Agre, P. E. (1994). Surveillance and capture: Two models of privacy. The Information Society, 10(2),
101–127. doi:10.1080/01972243.1994.9960162
Akter, S., & Wamba, S. F. (2016). Big data analytics in E-commerce: A systematic review and agenda
for future research. Electronic Markets, 26(2), 173–194. doi:10.100712525-016-0219-0



Compilation of References

Alamäki, A., & Dirin, A. (2015). The stakeholders of a user-centred design process in mobile service
development. International Journal of Digital Information and Wireless Communications, 5(4), 270–284.
doi:10.17781/P001825
Ali-ud-din Khan, M., Uddin, M.F., & Gupta, N. (2014). Seven V’s of Big Data understanding Big Data
to extract Value. Retrieved from https://ieeexplore.ieee.org/document/6820689
Allan, J. V., Tom, B., Robert, L., & Donald, L. S. (2004). Development of an Information Fusion System
for Engine Diagnostics and Health Management. NASA/TM—2004-212924.
Altunas, S., Dereli, T., & Kusiak, A. (2015). Analysis of patent documents with weighted association
rules. Technological Forecasting and Social Change, 92, 249–262. doi:10.1016/j.techfore.2014.09.012
Amatriain, X. (2017a). Is Data more important then Algorithms in Artificial Integrating? Retrieved
from https://www.forbes.com/sites/quora/2017/01/26/is-data-more-important-than-algorithms-in-
ai/#7424f7dc42c1
Amatriain, X. (2017b). In Machine Learning, is more Data always better than better Algorithms? Retrieved
from https://www.quora.com/In-machine-learning-is-more-data-always-better-than-better-algorithms
Anderson, C. (2008). The end of theory: The data deluge makes the scientific method obsolete. Wired
Magazine, 16(7).
Angrave, D., Charlwood, A., Kirkpatrick, I., Lawrence, M., & Stuart, M. (2016). HR and analytics:
Why HR is set to fail the big data challenge. Human Resource Management Journal, 26(1), 1–11.
doi:10.1111/1748-8583.12090
Antila, E. M. (2006). The role of HR managers in international mergers and acquisitions: A
multiple case study. International Journal of Human Resource Management, 17(6), 999–1020.
doi:10.1080/09585190600693322
Ariely, D. (2013, January 6). Big Data is like Teenage Sex. Twitter. Retrieved from: https://twitter.com/
danariely/status/287952257926971392?lang=en
Asencio–Cortés, G., Morales–Esteban, A., Shang, X., & Martínez–Álvare, F. (2017). Earthquake pre-
diction in California using regression algorithms and cloud-based big data infrastructure. Computers &
Geosciences. doi:10.1016/j.cageo.2017.10.011
Asgarnezhad, R., Shekofteh, M., & Boroujeni, F. Z. (2017). Improving Diagnosis of Diabetes Mellitus
Using Combination of Preprocessing Techniques. Journal of Theoretical and Applied Information
Technology, 95(13), 15.
Aslam, S. (2015, October 7). Snapchat by the Numbers: Stats, Demographics and Fun Facts. Omnicore.
Retrieved from: http://www.omnicoreagency.com/snapchatstatistics/
Bairoch, A., & Boeckmann, B. (1991). The SWISS-PROT protein sequence data bank. Nucleic Acids
Research, 19(suppl), 2247–2249. doi:10.1093/nar/19.suppl.2247 PMID:2041811
Baker, J. D. (2016). The Purpose, Process, and Methods of Writing a Literature Review. AORN Journal,
103(3), 265–269. doi:10.1016/j.aorn.2016.01.016 PMID:26924364

259
Compilation of References

Bakhshi, H., & Mateos-Garcia, J. (2016). New data for innovation policy. OECD Blue Sky Conference.
Bakshi, K. (2012). Considerations for big data: Architecture and approach. In Aerospace Conference
Proceedings, IEEE (pp. 1-7). Big Sky, MT: IEEE. 10.1109/AERO.2012.6187357
Balachandran, B., & Prasad, S. (2017). Challenges and Benefits of Deploying Big Data Analytics
in the Cloud for Business Intelligence. Procedia Computer Science, 112, 1112–1122. doi:10.1016/j.
procs.2017.08.138
Baregheh, A., Rowley, J., & Sambrook, S. (2009). Towards a multidisciplinary definition of innovation.
Management Decision, 47(8), 1323–1339. doi:10.1108/00251740910984578
Bart, Y., Shankar, V., Sultan, F., & Urban, G. L. (2005). Are the drivers and role of online trust the same
for all web sites and consumers? A large-scale exploratory empirical study. Journal of Marketing, 69(4),
133–152. doi:10.1509/jmkg.2005.69.4.133
Basel Committee on Banking Supervision (BCBS). (2015). Making supervisory stress tests more mac-
roprudential: Considering liquidity and solvency interactions and systemic risk. Working Paper, no 29.
Author.
Bazzan, A. L., Engel, P. M., Schroeder, L. F., & da Silva, S. C. (2002). Automated annotation of keywords
for proteins related to mycoplasmataceae using machine learning techniques. Bioinformatics (Oxford,
England), 18(Suppl 2), 35S–43S. doi:10.1093/bioinformatics/18.suppl_2.S35 PMID:12385981
BCBS. (2018). Sound Practices - Implications of fintech developments for banks and bank supervisors.
Author.
Bean, C. (2016). Independent review of UK economic statistics. Academic Press.
Bean, R. (2017). How Companies Say They’re Using Big Data. Harvard Business Review. Retrieved
July 20, 2018, from https://hbr.org/2017/04/how-companies-say-theyre-using-big-data
Becker, T. (2016). Big Data Usage. In J. Cavanillas, E. Curry, & W. Wahlster (Eds.), New Horizons for
a Data-Driven Economy. Cham: Springer; doi:10.1007/978-3-319-21569-3_8
Bekaert, M., Bidou, L., Denise, A., Duchateau–Nguyen, G., Forest, J. P., Froidevaux, C., ... Termier, M.
(2003). Towards a computational model for –1 eukaryotic frame shifting sites. Bioinformatics (Oxford,
England), 19(3), 327–335. doi:10.1093/bioinformatics/btf868 PMID:12584117
Belfo, F., Trigo, A., & Estébanez, R. P. (2015). Impact of ICT Innovative Momentum on Real-Time
Accounting. Business Systems Research Journal, 6(2), 1–17. doi:10.1515/bsrj-2015-0007
Benders, J., Hoeken, P., Batenburg, R., & Schouteten, R. (2006). First organise, then automate: A modern
socio‐technical view on ERP systems and teamworking. New Technology, Work and Employment, 21(3),
242–251. doi:10.1111/j.1468-005X.2006.00178.x
Benson, D., Lipman, D. J., & Ostell, J. (1993). GenBank. Nucleic Acids Research, 21(13), 2963–2965.
doi:10.1093/nar/21.13.2963 PMID:8332518
Berners-Lee, T. (2014, August 23). Tim Berners-Lee on the Web at 25: the past, present and future.
Wired. Retrieved from: http://www.wired.co.uk/article/tim-berners-lee

260
Compilation of References

Berners-Lee, T. (2018). The web is under threat. Join us and fight for it. World Wide Web Foundation.
Available from: https://webfoundation.org/2018/03/web-birthday-29/
Berners-Lee, T., Hendler, J., & Lassila, O. (2001). The Semantic Web. Scientific American, 29–37.
PMID:11323639
Bese Goksu, E., & Tissot, B. (2018). Monitoring Systemic Institutions for the Analysis of Micro-macro
Linkages and Network Effects. Journal of Mathematics and Statistical Science, 4(4).
Bhaduri, A., Qu, K., Lee, C. S., Ungewickell, A., & Khavari, P. A. (2012). Rapid identification of non–
human sequences in high–throughput sequencing datasets. Bioinformatics (Oxford, England), 28(8),
1174–1175. doi:10.1093/bioinformatics/bts100 PMID:22377895
Bholat, D. (2015, March). Big data and central banks. Bank of England, Quarterly Bulletin. Retrieved
from https://ssrn.com/abstract=2577759
Bietz, M. J., Bloss, C. S., Calvert, S., Godino, J. G., Gregory, J., Claffey, M. P., Sheehan, J. and Patrick,
K. (2016). Opportunities and challenges in the use of personal health data for health research. Journal
of the American Medical Informatics Association, 23(1), e42-e48. Doi:10.1093/jamia/ocv118
Blackwell, J. (1985). Information for policy. National and Economic Social Council, Report no. 78.
Dublin: NESC. Retrieved from: http://files.nesc.ie/nesc_reports/en/NESC_78_1985.pdf
Blank, S. (2007). The four steps to the epiphany: Successful strategies for products that win. Quad/
Graphics.
Boell, S. K., & Cecez-Kecmanovic, D. (2015). A Hermeneutic Approach for Conducting Literature
Reviews and Literature Searches. Communications of the Association for Information Systems, 34, 12.
Retrieved from http://aisel.aisnet.org/cais/vol34/iss1/12
Bonoma, T. V. (1985). Case research in marketing: Opportunities, problems, and a process. JMR, Journal
of Marketing Research, 22(2), 199–208. doi:10.2307/3151365
Borgman, C. L. (2015). Big Data, Little Data, No Data - Scholarship in the Networked World. Cam-
bridge, MA: MIT Press.
Borio, C. (2013). The Great Financial Crisis: setting priorities for new statistics. Journal of Banking
Regulation.
Bouckaert, R. R., Frank, E., Hall, M., Kirkby, R., Reutemann P, Seewald A, & Scuse D. (2013). WEKA
Manual for Version 3–7–8. Academic Press.
Boyd, D., & Crawford, K. (2012). Critical questions for big data. Communicatio Socialis, 15(5), 662–679.
Boyd, J., & Crawford, K. (2012). Critical Questions for Big Data - Provocations for a cultural, techno-
logical, and scholarly phenomenon. Information Communication and Society, 15(5), 662–679. doi:10.
1080/1369118X.2012.678878
Brackstone, G. J. (1987). Statistical Issues of Administrative Data: Issues and Challenges. Survey Meth-
odology, 13(1), 29–43.

261
Compilation of References

Bradlow, E. T., Gangwar, M., Kopalle, P., & Voleti, S. (2017). The Role of Big Data and Predictive
Analytics in Retailing. Journal of Retailing, 93(1), 79–95. doi:10.1016/j.jretai.2016.12.004
Breslin, J., & Decker, S. (2007). The future of social networks on the internet: The need for semantics.
IEEE Internet Computing, 11(6), 86–90. doi:10.1109/MIC.2007.138
Broeders, D., Schrijvers, E., van der Sloot, B., van Brakel, R., de Hoog, J., & Hirsch Ballin, E. (2017).
Big Data and security policies: Towards a framework for regulating the phases of analytics and use of
Big Data. Computer Law & Security Review, 33(3), 309–323. doi:10.1016/j.clsr.2017.03.002
Brown, A. L., & Palincsar, A. S. (1989). Guided, cooperative learning and individual knowledge acquisition.
In Knowing, learning, and instruction: Essays in honor of Robert Glaser (pp. 393-451). Academic Press.
Brown, B., Chui, M., & Manyika, J. (2011). Are you ready for the era of ‘big data’. The McKinsey
Quarterly, 4(1), 24–35.
Browning, D. M., Meyer, E. C., Truog, R. D., & Solomon, M. Z. (2007). Difficult conversations in
health care: Cultivating relational learning to address the hidden curriculum. Academic Medicine, 82(9),
905–913. doi:10.1097/ACM.0b013e31812f77b9 PMID:17726405
Brown, M., Kulik, C. T., Cregan, C., & Metz, I. (2017). Understanding the Change–Cynicism Cycle:
The Role of HR. Human Resource Management, 56(1), 5–24. doi:10.1002/hrm.21708
Brown, R. A., & Sankaranarayanan, S. (2011). Intelligent store agent for mobile shopping. International
Journal of E-Services and Mobile Applications, 3(1), 57–72. doi:10.4018/jesma.2011010104
Brügmann, S., Bouayad-Agha, N., Burga, A., Carrascosa, S., Ciaramella, A., Ciaramella, M., ... Wan-
ner, L. (2015). Towards content-oriented patent document processing: Intelligent patent analysis and
summarization. World Patent Information, 40, 30–42. doi:10.1016/j.wpi.2014.10.003
Brusilovsky, P. (2001). Adaptive Hypermedia. User Modeling and User-Adapted Interaction, 11(1/2),
87–110. doi:10.1023/A:1011143116306
Buytendijk, F. (2014). Hype Cycle for Big Data, 2014. Gartner. Retrieved from: https:// www.gartner.
com/doc/2814517/hype-cycle-big-data-
Caldwell, R. (2003). Models of change agency: A fourfold classification. British Journal of Manage-
ment, 14(2), 131–142. doi:10.1111/1467-8551.00270
Camuffo, A. (2016). Le nuove sfide dell’HR: Big data, rilevanza e sostenibilità. Economia & Manage-
ment, 5, 117–125.
Cano, J. (2014). The V’s of Big Data: Velocity, Volume, Value, Variety and Veracity. Retrieved from
https://www.xsnet.com/blog/bid/205405/the-v-s-of-big-data-velocity-volume-value-variety-and-veracity
Caporaso, J. G., Kuczynski, J., Stombaugh, J., Bittinger, K., Bushman, F. D., Costello, E. K., ... Knight,
R. (2010). QIIME allows analysis of high-throughputcommunity sequencing data. Nature Methods, 7(5),
335–336. doi:10.1038/nmeth.f.303 PMID:20383131
Cappelli, P. (2017). There’s no such thing as big data in HR. Harvard Business Review, 2.
Carneiro, A. (2000). How does knowledge management influence innovation and competitiveness?
Journal of Knowledge Management, 4(2), 87–98. doi:10.1108/13673270010372242
262
Compilation of References

Carnot, N., Koen, V., & Tissot, B. (2011). Economic Forecasting and Policy (2nd ed.). Palgrave Macmil-
lan. doi:10.1057/9780230306448
Carrico, J. A., Sabat, A. J., Friedrich, A. W., Ramirez, M., & on behalf of the ESCMID Study Group, C.
(2013). Bioinformatics in bacterial molecular epidemiology and public health: Databases, tools and the
next-generation sequencing revolution. Eurosurveillance, 18(4), 32–40. doi:10.2807/ese.18.04.20382-
en PMID:23369390
Carter, B., Danford, A., Howcroft, D., Richardson, H., Smith, A., & Taylor, P. (2011). ‘All they lack is a
chain’: Lean and the new performance management in the British civil service. New Technology, Work
and Employment, 26(2), 83–97. doi:10.1111/j.1468-005X.2011.00261.x
Caruana, J. (2017). International financial crises: new understandings, new data. Speech at the National
Bank of Belgium, Brussels, Belgium.
Castelvecchi, D. (2018). Particle physicists turn to AI to cope with CERN’s collision deluge. Nature.
Retrieved from https://www.nature.com/articles/d41586-018-05084-2
Catmull, E. (2008). How Pixar Fosters Collective Creativity. Harvard Business Review, 1–13.
Caudill, E. M., & Murphy, P. E. (2000). Consumer online privacy: Legal and ethical issues. Journal of
Public Policy & Marketing, 19(1), 7–19. doi:10.1509/jppm.19.1.7.16951
Cavallo, A., & Rigobon, R. (2016, Spring). The Billion Prices Project: Using Online Prices for Measure-
ment and Research. The Journal of Economic Perspectives, 30(2), 151–178. doi:10.1257/jep.30.2.151
Cavanillas, J. M., Curry, E., & Wahlster, W. (2015). New Horizons for a Data-Driven Economy: A Roadmap
for Usage and Exploitation of Big Data in Europe. Academic Press. doi:10.1007/978-3-319-21569-3_3
Cervera, J. L., Votta, P., Fazio, D., Scannapieco, M., Brennenraedts, R., & van der Vorst, T. (2014).
Big Data in Official Statistics. Eurostat ESS Big Data Event. Retrieved from: https://ec.europa.eu/eu-
rostat/cros/system/files/Big%20Data%20Event%202014%20-%20Technical%20Final%20Report%20
-finalV01_0.pdf
Charan, R. (2014). It’s time to split HR. Harvard Business Review, 92(7), 33–34.
Chellappa, R. K., & Sin, R. G. (2005). Personalization versus privacy: An empirical examination of the
online consumer’s dilemma. Information Technology Management, 6(2–3), 181–202. doi:10.100710799-
005-5879-y
Chen, D., & Zhao, H. (2012, March). Data security and privacy protection issues in cloud computing.
In Computer Science and Electronics Engineering (ICCSEE), 2012 International Conference (Vol. 1,
pp. 647–651). IEEE. 10.1109/ICCSEE.2012.193
Chen, C. J., Huang, J. W., & Hsiao, Y. C. (2010). Knowledge management and innovativeness: The
role of organizational climate and structure. International Journal of Manpower, 31(8), 848–870.
doi:10.1108/01437721011088548
Chen, C. L. P., & Zhang, C.-Y. (2014). Data-intensive applications, challenges, techniques and technolo-
gies: A survey on Big Data. Information Sciences, 275, 314–347. doi:10.1016/j.ins.2014.01.015

263
Compilation of References

Chen, H., Chiang, R. H. L., & Storey, V. C. (2012). Business Intelligence and Analytics: From Big Data
to Big Impact. Management Information Systems Quarterly, 36(4), 1165–1188.
Chen, H., Chiang, R. H., & Storey, V. C. (2012). Business intelligence and analytics: From big data to
big impact. Management Information Systems Quarterly, 1165–1188.
Chen, M., Ebert, D., Hagen, H., Laramee, R. S., Van Liere, R., Ma, K. L., ... Silver, D. (2009). Data,
information, and knowledge in visualization. IEEE Computer Graphics and Applications, 29(1), 12–19.
doi:10.1109/MCG.2009.6 PMID:19363954
Chen, M., Mao, S., & Liu, Y. (2014). Big Data: A Survey. Mobile Networks and Applications, 19(v2),
171–209. doi:10.100711036-013-0489-0
Chen, Q., Zhang, M., & Zhao, X. (2017). Analysing customer behaviour in mobile app usage. Industrial
Management & Data Systems, 117(2), 425–438. doi:10.1108/IMDS-04-2016-0141
Chen, S. C. I. (2018). Technological Health Intervention in Population Aging to Assist People to Work
Smarter not Harder: Qualitative Study. Journal of Medical Internet Research, 20(1), e3. doi:10.2196/
jmir.8977 PMID:29301736
Chen, Y., Yao, H., Thompson, E. J., Tannir, N. M., Weinstein, J. N., & Su, X. (2013). VirusSeq: Software
to identify viruses and their integration sites using next–generation sequencing of human cancer tissue.
Bioinformatics (Oxford, England), 29(2), 266–267. doi:10.1093/bioinformatics/bts665 PMID:23162058
Chesbrough, H. (2007a). Why companies should have open business models. MIT Sloan Management
Review, 48(2), 22–28.
Chesbrough, H. (2007b). Business model innovation: It’s not just about technology anymore. Strategy
and Leadership, 35(6), 12–17. doi:10.1108/10878570710833714
Chesbrough, H. (2010). Business model innovation: Opportunities and barriers. Long Range Planning,
43(2), 354–363. doi:10.1016/j.lrp.2009.07.010
Chesbrough, H. (2013). Open business models: How to thrive in the new innovation landscape. Boston:
Harvard Business Press.
Chesbrough, H., & Vanhaverbeke, W. (2006). Open innovation: Research a new paradigm. New York:
Oxford University Press.
Chevreux, B. (2015). MIRA Assembler. C1997–2014. Retrieved from: www.chevreux.org/projects_mira.
html
Choi, H., & Varian, H. (2011). Predicting the present with Google Trends. Retrieved from: http://people.
ischool.berkeley.edu/~hal/Papers/2011/ptp.pdf
Cœuré, B. (2017). Policy analysis with big data. Speech at the conference on “Economic and Financial
Regulation in the Era of Big Data”, Banque de France, Paris, France.
Cole, J. R., Wang, Q., Cardenas, E., Fish, J., Chai, B., Farris, R. J., ... Tiedje, J. M. (2009). The Ribo-
somal Database Project: Improved alignments and new tools for rRNA analysis. Nucleic Acids Research,
37(Database), D141–D145. doi:10.1093/nar/gkn879 PMID:19004872

264
Compilation of References

Collins, L., & Bennet, C. (2015). HR and people analytics. Deloitte Insights. Retrieved November 10,
2017, from: https://www2.deloitte.com/insights/us/en/focus/human-capital-trends/2015/people-and-hr-
analytics- human-capital-trends-2015.html
Committee of the Chief Statisticians of the United Nations System. (2018). UN Statistical Quality Assur-
ance Framework. Retrieved from: https://unstats.un.org/unsd/unsystem/documents/UNSQAF-2018.pdf
Committee on Payments and Market Infrastructures (CPMI). (2017). Distributed ledger technology in
payment, clearing and settlement – An analytical framework. CPMI.
Conner, J., & Ulrich, D. (1996). Human resource roles: Creating value, not rhetoric. People and Strategy,
19(3), 38–49.
Coordination Committee for Statistical Activities. (2014). Principles Governing International Statisti-
cal Activities. Retrieved from: https://unstats.un.org/unsd/accsub-public/principles_stat_activities.htm
Cosentino, S., Larsen, V. M., Aarestrup, M. F., & Lund, O. (2013). Pathogen Finder – Distinguishing
Friend from Foe Using Bacterial Whole Genome Sequence Data. PLoS One, 8(10), e77302. doi:10.1371/
journal.pone.0077302 PMID:24204795
CPMI & Board of the International Organization of Securities Commissions (IOSCO). (2016, June).
Guidance on cyber resilience for financial market infrastructures. Authors.
CPMI & Markets Committee. (2018). Central bank digital currencies. Authors.
Crane, L., & Self, R. (2014). Big Data Analytics: A Threat or an Opportunity for Knowledge Manage-
ment? Lecture Notes in Business Information Processing, 185, 25–34. doi:10.1007/978-3-319-08618-7_3
Creswell, J. W. (2014). Research design: Qualitative & quantitative approaches (4th ed.). London: Sage
Publications, Inc.
Crisanto, J.C., & Prenio J. (2017). Regulatory approaches to enhance banks’ cyber-security frameworks.
FSI Insights on policy implementation No 2, Financial Stability Institute.
Cuzzocrea, A. (2014). Privacy and security of big data: current challenges and future research perspec-
tives. Proceedings of the First International Workshop on Privacy and Security of Big Data, PSBD ’14.
Cytoscape. (n.d.). Cytoscape. Retrieved from: http://www.cytoscape.org/
Daas, P. J. H., Puts, M. J., Buelens, B., & van den Hurk, P. A. M. (2015). ‘Big Data as a Source for Of-
ficial Statistics’. Journal of Official Statistics, 31(2), 249–262. doi:10.1515/jos-2015-0016
Dalkir, K. (2013). Knowledge management in theory and practice. Routledge.
DAMA International. (2017). DAMA-DMBOK: Data management body of knowledge (2nd ed.). Tech-
nics Publications.
Danaher, J. (2016). The threat of algocracy: Reality, resistance and accommodation. Philosophy &
Technology, 29(3), 245–268. doi:10.100713347-015-0211-1
Daniel, B., Claus-Peter, F., & Alfredo, G. (2010). Structural Health Monitoring. In Introduction to
Structural Health Monitoring. ISTE Ltd.

265
Compilation of References

Davies, W. (2017). How statistics lost their power – and why we should fear what comes next. Retrieved from:
https://www.theguardian.com/politics/2017/jan/19/crisis-of-statistics-big-data-democracy?CMP=share_
btn_link
De Mauro, A., Greco, M., & Grimaldi, M. (2015). What is big data? A consensual definition and a review
of key research topics. In Proceedings of the 4th International Conference on Integrated Information
(Vol. 1644, No. 1, pp. 97-104). Madrid, Spain: AIP Publishing 10.1063/1.4907823
De Virgilio, R., Giunchiglia, F., & Tanca, L. (2010). Semantic web information management: a model-
based perspective. Springer Science & Business Media. doi:10.1007/978-3-642-04329-1
Deleuze, G. (1992). Postscript on the Societies of Control. October, 59, 3-7.
Deneke, C., Rentzsch, R., & Renard, B. Y. (2017). PaPrBaG: A machine learning approach for the
detection of novel pathogens from NGS data. Scientific Reports, 7(1), 39194. doi:10.1038rep39194
PMID:28051068
Dery, K., Hall, R., & Wailes, N. (2006). ERPs as ‘technologies-in-practice’: Social construction, mate-
riality and the role of organizational factors. New Technology, Work and Employment, 21(3), 229–241.
doi:10.1111/j.1468-005X.2006.00177.x
Devlin, B. (2017). In the Middle of Data Integration Is AI. Transforming Data With Intelligence. Retrieved
from https://tdwi.org/articles/2017/10/27/adv-all-middle-of-data-integration-is-ai.aspx
Diebold, F. (2018). A Personal Perspective on the Origin (s) and Development of ‘Big Data’: The Phe-
nomenon, the Term, and the Discipline, Second Version. Retrieved May 25, 2018, from: https://www.
sas.upenn.edu/~fdiebold/papers/paper112/Diebold_Big_Data.pdf
Dirin, A., Laine, T., & Alamäki, A. (2018). Managing emotional requirements in a context-aware
mobile application for tourists. International Journal of Interactive Mobile Technologies, 12(2), 177.
doi:10.3991/ijim.v12i2.7933
Divakaran, V. N., Subrahmanya, R. M., & Ravi Kumar, G.V.V. (2017). Integrated Vehicle Health Man-
agement of a Transport Aircraft Landing Gear System. Infosys Limited. Retrieved from https://www.
infosys.com/engineering-services/white.../aircraft-landing-gear-system.pd
Diwani, D. A., & Sam, A. (2014). Diabetes Forecasting Using Supervised Learning Techniques. ACSIJ
Advances in Computer Science: an International Journal, 3, 10–18.
Dmitriyev, V., Mahmoud, T., & Marín-Ortega, P. M. (2015). SOA enabled ELTA: Approach in design-
ing business intelligence solutions in Era of Big Data. International Journal of Information Systems and
Project Management, 3(3), 49–63.
Dodge, M., & Kitchin, R. (2007). The automatic management of drivers and driving spaces. Geoforum,
38(2), 264–275. doi:10.1016/j.geoforum.2006.08.004
Domingos, P. (2015). The Master Algorithm: How the Quest for the Ultimate Learning Machine Will
Remake Our World. Amazon.

266
Compilation of References

Domingue, J., Lasierra, N., Fensel, A., van Kasteren, T., Strohbach, M., & Thalhammer, A. (2016).
Big Data Analysis. In J. Cavanillas, E. Curry, & W. Wahlster (Eds.), New Horizons for a Data-Driven
Economy. Cham: Springer; doi:10.1007/978-3-319-21569-3_5
Donkin, C. (2017). M-Pesa continues to dominate Kenyan market. Mobile World Live. Retrieved from:
https://www.mobileworldlive.com/money/analysis-money/m-pesa-continues-to-dominate-kenyan-market/
Drucker, P. F. (1969, November). Management’s new role. Harvard Business Review, 49–54.
Drucker, P. F. (2001). Knowledge work and knowledge society: the social transformations of this century.
British Library.
Du Mars, R. (2012). Mission impossible? Data governance process takes on “big data.” Retrieved from
http://searchdatamanagement.techtarget.com/feature/Mission-impossible-Data-governance-process-
takes-on-big-data
Du Plessis, M. (2007). The role of knowledge management in innovation. Journal of Knowledge Man-
agement, 11(4), 20–29. doi:10.1108/13673270710762684
Easton, I.M., & Hsiao, L.R. (2013). The Chinesepeople’s liberation army’s unmanned aerial vehicle
project: Organizational capacities and operational capabilities. Tech. rep., 2049 Project Institute.
Edwards, L., & Urquhart, L. (2016). Privacy in public spaces: what expectations of privacy do we have
in social media intelligence? International Journal of Law & Information Technology, 24(3), 279-310.
Doi:10.1093/ijlit/eaw007
Eggers, D. (2013). The Circle. Penguin Books.
Eisenhardt, K. M. (1991). Better stories and better constructs: The case for rigor and comparative logic.
Academy of Management Review, 16(3), 620–627. doi:10.5465/amr.1991.4279496
El Bassiti, L., El Haiba, M., & Ajhoun, R. (2017). Generic Innovation Designing -GenID- Framework:
Towards a more Systematic Approach to Innovation Management. Presented at the 18th European Con-
ference on Knowledge Management (ECKM), Barcelona, Spain.
El Bassiti, L. (2017). Generic Ontology for Innovation Domain towards “Innovation Interoperability”.
Journal of Entrepreneurship Management and Innovation, 13(2), 105–126. doi:10.7341/20171325
Emani, C. K., Cullot, N., & Nicolle, C. (2015). Understandable Big Data: A survey. Computer Science
Review, 17, 70–81. doi:10.1016/j.cosrev.2015.05.002
Erevelles, S., Fukawa, N., & Swayne, L. (2016). Big data consumer analytics and the transformation of
marketing. Journal of Business Research, 69(2), 897–904. doi:10.1016/j.jbusres.2015.07.001
European Commission, International Monetary Fund, Organisation for Economic Co-operation and
Development, United Nations, & World Bank. (2009). System of National Accounts 2008. Authors.
European Commission. (2011). Attitudes on Data Protection and Electronic Identity in the European
Union. Special Eurobarometer No. 359, Wave 74.3 - TNS Opinion and Social. Published June 2011.
Retrieved from: http://ec.europa.eu/commfrontoffice/publicopinion/archives/ebs/ebs_359_en.pdf

267
Compilation of References

European Commission. (2013). Scheveningen Memorandum on “Big Data and Official Statistics”.
Adopted by the European Statistical System Committee on 27 September 2013. Retrieved from: https://
ec.europa.eu/eurostat/cros/content/scheveningen-memorandum_en
European Commission. (2014). Big Data. Digital Single Market Policies. Retrieved from: https://ec.europa.
eu/digital-single-market/en/policies/big-data
European Parliament. (2016). Regulation (EU) 2016/679 of the European Parliament and of the Council
of 27 April 2016 on the protection of natural persons with regard to the processing of personal data
and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection
Regulation). Retrieved from: http://ec.europa.eu/justice/data-protection/reform/files/regulation_oj_en.pdf
Eurostat. (2014). Feasibility Study on the Use of Mobile Positioning Data for Tourism Statistics - Con-
solidated Report. Eurostat Contract No 30501.2012.001- 2012.452, 30 June 2014. Retrieved from: http://
ec.europa.eu/eurostat/documents/747990/6225717/MP-Consolidated-report.pdf
Evans, L., & Kitchin, R. (2018). A smart place to work? Big data systems, labour, control and modern
retail stores. New Technology, Work and Employment, 33(1), 44–57. doi:10.1111/ntwe.12107
Everitt, B. S., Landau, S., Leese, M., & Stahl, D. (2011). Hierarchical clustering. In Cluster Analysis
(5th ed.). John Wiley and Sons, Ltd. doi:10.1002/9780470977811.ch4
Fagerberg, J., & Srholec, M. (2009). Knowledge, capabilities, and the poverty trap: The complex inter-
play between technological, social, and geographical factors. International Centre Economic Research
Working Paper, 24, 1-23.
Fagerberg, J. (2004). Innovation: A guide to the literature. Oslo: Georgia Institute of Technology.
Fagerberg, J. (2006). Innovation, technology and the global knowledge economy: Challenges for future
growth. Proceedings of the Green Roads to Growth Project and Conference.
Fagerberg, J., Fosaas, M., & Sapprasert, K. (2012). Innovation: Exploring the knowledge base. Research
Policy, 41(7), 1132–1153. doi:10.1016/j.respol.2012.03.008
Falch, M., Henten, A., Tadayoni, R., & Windekilde, I. (2009). Business Models in Social Networking.
Aalborg Universitet. Retrieved from: http://vbn.aau.dk/files/19150157/falch_3.pdf
Falkheimer, J. (2018). On Giddens: Interpreting Public Relations through Anthony Giddens’s Structuration
and Late Modernity Theories. In Ø. Ihlen, B. Van Ruler, & M. Fredriksson (Eds.), Public Relations and
Social Theory: Key Figures, Concepts and Developments (2nd ed.; pp. 177–192). London: Routledge.
Fasold, M., Langenberger, D., Binder, H., Stadler, P. F., & Hoffmann, S. (2011). DARIO: A ncRNA
detection and analysis tool for next-generation sequencing experiments. Nucleic Acids Research, 39(suppl
2), W112–W117. doi:10.1093/nar/gkr357 PMID:21622957
Fayyad, U., & Stolorz, P. (1997). Data mining and KDD: Promise and challenges. Future Generation
Computer Systems, 13(2), 99–115. doi:10.1016/S0167-739X(97)00015-0
Federal Communications Commission. (2017). Restoring Internet Freedom. Retrieved from: https://
www.fcc.gov/restoring-internet-freedom

268
Compilation of References

Financial Stability Board (FSB). (2017). Financial Stability Implications from FinTech. Author.
Firican, G. (2017). The 10V’s of Big data. Retrieved from https://tdwi.org/articles/2017/02/08/10-vs-
of-big-data.aspx
Fiske, S. T., & Hauser, R. M. (2014). Protecting human research participants in the age of big data.
Academic Press.
Flight Data Monitoring on ATR Aircraft. (2016). ATR Training Center. Retrieved from ATR Product
Support & Services Portal: https://www.atractive.com
Flyvbjerg, B., & Budzier, A. (2011). Why your IT project may be riskier than you think. Harvard Busi-
ness Review, 89(9), 23–25.
Fong, S., Wong, R., & Vasilakos, A. (2016). Accelerated PSO swarm search feature selection for data
stream mining big data. IEEE Transactions on Services Computing, (1), 1–1.
Frank, E., Hall, M., Trigg, L., Holmes, G., & Witten, I. H. (2004). Data mining in bioinformatics us-
ing Weka. Bioinformatics (Oxford, England), 20(15), 2479–2481. doi:10.1093/bioinformatics/bth261
PMID:15073010
Freitas, A., & Curry, E. (2016). Big Data Curation. In J. Cavanillas, E. Curry, & W. Wahlster (Eds.),
New Horizons for a Data-Driven Economy. Cham: Springer; doi:10.1007/978-3-319-21569-3_6
Frénay, H. M., Bunschoten, A. E., Schouls, L. M., van Leeuwen, W. J., Vandenbroucke–Grauls, C. M.,
Verhoef, J., & Mooi, F. R. (1996). Molecular typing of methicillin–resistant Staphylococcus aureus on
the basis of protein A gene polymorphism. European Journal of Clinical Microbiology & Infectious
Diseases, 15(1), 60–64. doi:10.1007/BF01586186 PMID:8641305
Frické, M. (2015). Big data and its epistemology. Journal of the Association for Information Science
and Technology, 66(4), 651-661. Doi:10.1002/asi.23212
Fry, S. (2017). The Way Ahead. Lecture delivered on the 28th May 2017, Hay Festival, Hay-on-Wye.
Retrieved from: http://www.stephenfry.com/2017/05/the-way-ahead/
Fukuyama, F. (2017). The Emergence of a Post Fact World. Project Syndicate. Retrieved from: https://
www.project-syndicate.org/onpoint/the-emergence-of-a-post-fact-world-by-francis-fukuyama-2017-01
Galbraith, J. R. (2014). Organizational design challenges resulting from big data. Journal of Organiza-
tion Design, 3(1), 2–13. doi:10.7146/jod.8856
Gandomi, A., & Haider, M. (2014). Beyond the hype: Big data concepts, methods, and analytics. Inter-
national Journal of Information Management, 35, 137 – 144.
Gandomi, A., & Haider, M. (2015). Beyond the hype: Big data concepts, methods, and analytics. In-
ternational Journal of Information Management, 35(2), 137–144. doi:10.1016/j.ijinfomgt.2014.10.007
Garret, M. A. (2014). Big Data analytics and cognitive computing – future opportunities for astronomical
research. IOP Conference Series. Materials Science and Engineering, 67, 012017. doi:10.1088/1757-
899X/67/1/012017

269
Compilation of References

Gartner. (n.d.). IT Glossary. Retrieved November 5, 2017, from: https://www.gartner.com/it-glossary/


big-data
GDPR. (2016). The European Parliament and the Council. Retrieved from https://eur-lex.europa.eu/
legal-content/EN/TXT/PDF/?uri=CELEX:32016R0679
Ge, P., Ritchey, N. A., Casey, K. S., Kearns, E. J., Privette, J. L., Saunders, D., … Ansari, S. (2016).
Scientific Stewardship in the Open Data and Big Data Era - Roles and Responsibilities of Stewards and
Other Major Product Stakeholders. D-Lib Magazine, 22(5-6). Doi:10.1045/may2016-peng
Gelsinger, P. (2012). Big Data quotes of the week. Retrieved from https://whatsthebigdata.com/2012/06/29/
big-data-quotes-of-the-week-11/
George, G., Haas, M. R., & Pentland, A. (2014). Big data and management. Academy of Management
Journal, 57(2), 321–326. doi:10.5465/amj.2014.4002
Gepp, A., Linnenluecke, M.K., O’Neill, T.J., & Smith, T. (2018). Big data techniques in auditing research
and practice: Current trends and future opportunities. Journal of Accounting Literature, 40, 102-115.
Gharajeh, M. S. (2017). Biological Big Data Analytics. Advances in Computers. doi:10.1016/
bs.adcom.2017.08.002
Gibson, W. (2001).’Broadband Blues - Why has broadband Internet access taken off in some countries
but not in others? The Economist. Retrieved from: https://www.economist.com/node/666610
Gibson, C. B., & Birkinshaw, J. (2004). The antecedents, consequences, and mediating role of organi-
zational ambidexterity. Academy of Management Journal, 47(2), 209–226.
Ginsberg, J., Mohebbi, M. H., Patel, R. S., Brammer, L., Smolinski, M. S., & Brilliant, L. (2009). Detect-
ing influenza epidemics using search engine query data. Nature, 7232(7232), 1012–1014. doi:10.1038/
nature07634 PMID:19020500
Glass, E. (2016). Survey analysis – Big data in central banks. Central Banking Focus Report, 2016.
Retrieved from www.centralbanking.com/central-banking/content-hub/2474744/big-data-in-central-
banks-focus-report
Global Partnership for Sustainable Development Data. (2016). The Data Ecosystem and the Global
Partnership. Retrieved from: http://gpsdd.squarespace.com/who-we-are/
Goa, J., Xie, C., & Tao, C. (2016). Big Data Validation and Quality Assurance - Issues, Challenges,
and Needs. 2016 IEEE Symposium on Service-Oriented System Engineering. Retrieved from: http://
ieeexplore.ieee.org/xpls/icp.jsp?arnumber=7473058
Godinez, M., Hechler, E., Koenig, K., Lockwood, S., Oberhofer, M., & Schroeck, M. (2010). The art
of enterprise information architecture: A systems-based approach for unlocking business insight. IBM
Press, Pearson Higher Ed.
Goes, P. B. (2014). Big Data and IS Research. MIS Quarterly, 38(3), iii-viii.

270
Compilation of References

Gokmen, T., & Vlasov, Y. (2016). Acceleration of Deep Neural Network Training with Resistive Cross-
Point Devices: Design Considerations. Frontiers in Neuroscience, 10, 333. doi:10.3389/fnins.2016.00333
PMID:27493624
Goodbody, W. (2018). Waterford researchers develop new method to store data in DNA. RTE News.
Retrieved from: https://www.rte.ie/news/ireland/2018/0219/941956-dna-data/
Goodman, M. (2015). Future Crimes - Inside the Digital Underground and the Battle for Our Connected
World. New York: Anchor Books.
Gormley, W. T. (2011). From science to policy in early childhood education. Science, 333(6045), 978–981.
doi:10.1126cience.1206150 PMID:21852491
Greenwald, H. S., & Oertel, C. K. (2017). Future Directions in Machine Learning. Frontiers in Robotics
and AI. Computational Intelligence. doi:10.3389/frobt.2016.00079
Grensing-Pophal, L. (2015). The State of Content Marketing. EContent, 38(1), 16-17.
Groves, R. (2011). Designed data and organic data. Director’s Blog of the US Census Bureau. Retrieved
from www.census.gov/newsroom/blogs/director/2011/05/designed-data-and-organic-data.html
Gruber, T. (2007). Ontology of folksonomy: A mash-up of apples and oranges. International Journal
on Semantic Web and Information Systems, 3(2), 1–11. doi:10.4018/jswis.2007010101 PMID:18974854
Gruber, T. (2008). Collective Knowledge Systems: Where the Social Web meets the Semantic Web.
Journal of Web Semantics, 6(1), 4–13. doi:10.1016/j.websem.2007.11.011
Guerra, L., McGarry, M., Robles, V., Bielza, C., Larrañaga, P., & Yuste, R. (2011). Comparison be-
tween supervised and unsupervised classifications of neuronal cell types: A case study. Developmental
Neurobiology, 71(1), 71–82. doi:10.1002/dneu.20809 PMID:21154911
Guerreiro, V., Walzer, M., & Lamboray, C. (2018). The use of Supermarket Scanner data in the Luxem-
bourg Consumer Price Index. Economie et Statistiques - Working papers du STATEC, No. 97. Retrieved
from: http://www.statistiques.public.lu/catalogue-publications/economie-statistiques/2018/97-2018.pdf
Günther, A. W., Rezazade, M. H., Huysman, M., & Feldberg, F. (2017). Debating big data: A literature
review on realizing value from big data. The Journal of Strategic Information Systems, 26(3), 191–209.
doi:10.1016/j.jsis.2017.07.003
Gupta, A., Sun, C., Shrivastava, A., & Singh, S. (2017). Revisiting the Unreasonable Effectiveness of
Data. Google Artificial Intelligence Blog. Retrieved from https://ai.googleblog.com/2017/07/revisiting-
unreasonable-effectiveness.html
Gupta, M., & George, J. F. (2016). Toward the development of a big data analytics capability. Informa-
tion & Management, 53(8), 1049–1064. doi:10.1016/j.im.2016.07.004
Haldane, A. G. (2018). Will Big Data Keep Its Promise? Speech at the Bank of England Data Analytics
for Finance and Macro Research Centre, King’s Business School.

271
Compilation of References

Hammer, C. L., Kostroch, D. C., & Quiros, G. (2017). Big Data: Potential, Challenges, and Statistical
Implications. IMF Staff Discussion Note, SDN/17/06, September 2017. Retrieved from: http://www.
imf.org/en/Publications/SPROLLs/Staff-Discussion-Notes
Hand, D. J. (2015). Official Statistics in the New Data Ecosystem. Presented at the New Techniques and
Technologies in Statistics conference, Brussels, Belgium. Retrieved from: https://ec.europa.eu/eurostat/
cros/system/files/Presentation%20S20AP2%20%20Hand%20-%20Slides%20NTTS%202015.pdf
Han, Q., Heimerl, F., Codina-Filba, J., Lohmann, S., Wanner, L., & Ertl, T. (2017). Visual patent trend
analysis for informed decision making in technology management. World Patent Information, 49, 34–42.
doi:10.1016/j.wpi.2017.04.003
Harkness, T. (2017). Big Data: Does size matter? London, UK: Bloomsbury Sigma.
Hashem, I. A. T., Yaqoob, I., Anuar, N. B., Mokhtar, S., Gani, A., & Khan, S. U. (2015). The rise of
“big data” on cloud computing: Review and open research issues. Information Systems, 47, 98–115.
doi:10.1016/j.is.2014.07.006
Hayak, F. A. (1944). The Road to Serfdom. Chicago, MA: The University of Chicago Press.
Hazy, J. K., & Ashley, A. (2011). Unfolding the future: Bifurcation in organizing form and emergence
in social systems. Emergence, 13(3), 58–80.
He, Q. P., & Wang, J. (2017). Statistical process monitoring as a big data analytics tool for smart manu-
facturing. Journal of Process Control. doi:10.1016/j.jprocont.2017.06.012
Hern, A. (2018). Facebook, Google and Twitter to Testify in Congress over Extremist Content. Retrieved
from https://www.theguardian.com/technology/2018/jan/10/facebook-google-twitter-testify-congress-
extremist-content-russian-election-interference-information
Hesse, B. W., Moser, R. P., & Riley, W. T. (2015). From big data to knowledge in the social sci-
ences. The Annals of the American Academy of Political and Social Science, 659(1), 16–32.
doi:10.1177/0002716215570007 PMID:26294799
Hey, J. (2004). The data, information, knowledge, wisdom chain: The metaphorical link. Intergovern-
mental Oceanographic Commission, 26, 1–18.
Hey, T., Tansley, S., & Tolle, K. (2009). The Fourth Paradigm: Data-Intensive Scientific Discovery.
Microsoft Corporation; doi:10.1145/2609876.2609883
Hilbert, M., & Lopez, P. (2012). How to Measure the World’s Technological Capacity to Store, Com-
municate and Compute Information. International Journal of Communication, 6, 956–979.
Hill, S. (2018, May 6). The Big Data Revolution in Economic Statistics: Waiting for Godot... and Gov-
ernment Funding. Goldman Sachs US Economics Analyst.
Hina, S., Shaikh, A., & Sattar, S. A. (2017). Analyzing Diabetes Datasets using Data Mining. Journal
of Basic and Applied Sciences, 13, 466–471. doi:10.6000/1927-5129.2017.13.77

272
Compilation of References

Höchtl, J., Parycek, P., & Schöllhammer, R. (2016). Big data in the policy cycle: Policy decision making
in the digital era. Journal of Organizational Computing and Electronic Commerce, 26(1-2), 147–169.
doi:10.1080/10919392.2015.1125187
HRReview. (2013). 78% of HR managers do not feel they are very effective at workforce analytics. Re-
trieved September 17, 2017, from: http://bit.ly/HRRAnalytics
Hu, J., & Zhang, Y. (2017). Structure and patterns of cross-national Big Data research collaborations.
Journal of Documentation, 73(6), 1119-1136. Doi:10.1108/JD-12-2016-0146
Huang, G., Huang, G.-B., Song, S., & You, K. (2014). Trends in extreme machine learning: A review.
Neural Networks, 61, 32-48.
Huang, Y., Schuehle, J., Porter, A., & Youtie, J. (2015). A systematic method to create search strategies
for emerging technologies based on the Web of Science: Illustrated for “Big Data”. Scientometrics,
105(3), 2005–2022. doi:10.100711192-015-1638-y
Hugo, V. (2013). Les misérables. Simon and Schuster.
Hummelen, R., Fernandes, A. D., Macklaim, J. M., Dickson, R. J., Changalucha, J., Gloor, G. B., & Reid,
G. (2010). Deep sequencing of the vaginal microbiota of women with HIV. PLoS One, 5(8), e12078.
doi:10.1371/journal.pone.0012078 PMID:20711427
Huser, V., & Cimino, J. J. (2016). Impending challenges for the use of Big Data. International Journal
of Radiation Oncology, Biology, Physics, 95(3), 890-894.
Huxley, A. (1932). A brave new world. London: Chatto and Windus.
IBM. (2017). 10 Key Marketing Trends for 2017 and Ideas for Exceeding Customer Expectations. IBM
Marketing Cloud. Retrieved from: https://public.dhe.ibm.com/common/ssi/ecm/wr/en/wrl12345usen/
watson-customer-engagement-watson-marketing-wr-other-papers-and-reports-wrl12345usen-20170719.
pdf
IEEE. (1990). IEEE Standard Computer Dictionary: A Compilation of IEEE Standard Computer Glos-
saries. IEEE.
IFC. (2016a). Central banks’ use of the SDMX standard. IFC.
IFC. (2016b). The sharing of micro data – a central bank perspective. IFC.
IFC. (2017a). Proceedings of the IFC-ECCBSO-CBRT Conference on “Uses of central balance sheet
data office information. IFC Bulletin, 45.
IFC. (2017b). Proceedings of the IFC Workshop on “Data needs and statistics compilation for macro-
prudential analysis”. IFC Bulletin, 46.
IMF & FSB. (2015). The Financial Crisis and Information Gaps – Sixth Implementation Progress Report
of the G20 Data Gaps Initiative. Author.
Iñaki, U., Iñaki, A., & Belén, M. (2016). Aircraft Engine Advanced Health Management: The Power of
the Foresee. 8thEuropean Workshop On Structural Health Monitoring (EWSHM 2016), Bilbao, Spain.

273
Compilation of References

International Monetary Fund (IMF) and FSB. (2009). The Financial Crisis and Information Gaps. Report
to the G20 Finance Ministers and Central Bank Governors.
International Telecommunications Union. (2017). ITU Key 2005 - 2017 ICT data. Retrieved from: https://
idp.nz/Global-Rankings/ITU-Key-ICT-Indicators/6mef-ytg6
Irving Fisher Committee on Central Bank Statistics (IFC). (2015). Central banks’ use of and interest
in ‘big data’. IFC.
Ismail, N. (2016). Big Data in the developing world. Information Age. Retrieved from: http://www.
information-age.com/big-data-developing-world-123461996/
Ismail, S. A., Matin, A. F. A., & Mantoro, T. (2012). A Comparison Study of Classifier Algorithms
for Mobile–phone’s Accelerometer Based Activity Recognition. Procedia Engineering, 41, 224–229.
doi:10.1016/j.proeng.2012.07.166
Jacoby, R. (2013). Foreword. In R. Picciotto & R. Weaving (Eds.), Security and Development: Investing
in Peace and Prosperity (pp. 3–6). Routledge.
Jarlett, H. (2017). Breaking data records bit by bit. CERN document server. Retrieved from https://home.
cern/about/updates/2017/12/breaking-data-records-bit-bit
Jarlett, H. K. (2017). Breaking data records bit by bit. CERN. Retrieved from: https://home.cern/about/
updates/2017/12/breaking-data-records-bit-bit
Jerven, M. (2015). Africa - Why economists get it wrong. London: Zed Books.
Jianping, P., Juanjuan, Q., Le, P., Jing, Q., & Perdue, F. P. (2017). An Exploratory Study of the Effective-
ness of Mobile Advertising. Information Resources Management Journal, 30(4), 24-38. Doi:10.4018/
IRMJ.2017100102
Ji, W., & Wang, L. (2017). Big data analytics based fault prediction for shop floor scheduling. Journal
of Manufacturing Systems, 43(1), 187–194. doi:10.1016/j.jmsy.2017.03.008
Jukić, N., Sharma, A., Nestorov, S., & Jukić, B. (2015). Augmenting data warehouses with Big data.
Information Systems Management, 32(3), 200–209. doi:10.1080/10580530.2015.1044338
Kaidalova, J., Sandkuhl, K., & Seigerroth, U. (2017). Challenges in Integrating Product-IT into Enterprise
Architecture – a case study. Procedia Computer Science, 121, 525–533. doi:10.1016/j.procs.2017.11.070
Kaisler, S., Armour, F., Espinosa, J. A., & Money, W. (2013). Big data: Issues and challenges moving
forward. In Proceedings of the 46th Hawaii International Conference on System Sciences (HICSS) (pp.
995-1004). Washington, DC: IEEE Computer Society. 10.1109/HICSS.2013.645
Kanz, C., Aldebert, P., Althorpe, N., Baker, W., Baldwin, A., Bates, K., ... Apweiler, R. (2005). The
EMBL nucleotide sequence database. Nucleic Acids Research, 33(1), D29–D33. PMID:15608199
Karthikk, R. G. (2011). IVHM On UAV Fuel System Test Rig (MSc thesis). School of Engineering, Aero-
space Vehicle Design, Cranfield University, UK. Retrieved from https://www.dsiintl.com/.../Cranfield-
University-IVHM-Diagnostic-Influence-2011-Go

274
Compilation of References

Kaufman, L. M. (2009). Data security in the world of cloud computing. IEEE Security and Privacy,
7(4), 61–64. doi:10.1109/MSP.2009.87
Kaushik, H., Raviya, & BirenGajjar. (2013). Performance Evaluation of different data mining classifica-
tion algorithm using WEKA. Indian Journal of Research, 2(1).
Ketamo, H. (2008). Cost Effective Testing with Artificial Labour. Proceedings of 2008 Networked &
Electronic Media Summit, 185-190.
Ketamo, H. (2010). Balancing adaptive content with agents: Modelling and reproducing group behavior
as computational system. Proceedings of 6th International Conference on Web Information Systems and
Technologies, 1, 291-296.
Ketamo, H., Devlin, K., & Kiili, K. (2018). Gamifying Assessment: Extending Performance Measures
with Gaming Data. Proceedings of American Educational Researcher Association’s Annual Conference.
Khan, S., Liu, X., Shakil, K. A., & Alam, M. (2017). A survey on scholarly data: From big data perspec-
tive. Information Processing & Management, 53(4), 923–944. doi:10.1016/j.ipm.2017.03.006
Kim, G. H., Trimi, S., & Chung, J. H. (2014). Big-data applications in the government sector. Com-
munications of the ACM, 57(3), 78–85. doi:10.1145/2500873
Kim, G., & Bae, J. (2017). A novel approach to forecast promising technology through patent analysis.
Technological Forecasting and Social Change, 117, 228–237. doi:10.1016/j.techfore.2016.11.023
Kim, M., Park, Y., & Yoon, J. (2016). Generating patent development maps for technology monitoring
using semantic patent-topic analysis. Computers & Industrial Engineering, 98, 289–299. doi:10.1016/j.
cie.2016.06.006
Kirkpatrick, M. (2010). Facebook’s Zuckerberg Says the Age of Privacy is Over. Retrieved from: https://
readwrite.com/2010/01/09/facebooks_zuckerberg_says_the_age_of_privacy_is_ov/
Kitchin, R. (2015). The opportunities, challenges and risks of big data for official statistics. Statistical
Journal of the International Association of Official Statistics, 31(3), 471–481.
Klindworth, A., Pruesse, E., Schweer, T., Peplies, J., Quast, C., Horn, M., & Glockner, F. O. (2013).
Evaluation of general 16S ribosomal RNA gene PCRprimers for classical and next-generation sequenc-
ing–based diversity studies. Nucleic Acids Research, 41(1), e1. doi:10.1093/nar/gks808 PMID:22933715
Korte, T. (2014). How Data and Analytics Can Help the Developing World. Huffington Post - The Blog.
Retrieved from: https://www.huffingtonpost.com/travis-korte/how-data-and-analytics-ca_b_5609411.html
Košturiak, J. (2010). Innovations and knowledge management. Human Systems Management, 29(1), 51–63.
Kretschmann, E., Fleischmann, W., & Apweiler, R. (2001). Automatic rule generation for protein annota-
tion with the C4. 5 data mining algorithm applied on SWISS–PROT. Bioinformatics (Oxford, England),
17(10), 920–926. doi:10.1093/bioinformatics/17.10.920 PMID:11673236
Krikorian, R. (2013). New Tweets per Second Record, and How! Engineering Blog. Retrieved from:
https://blog.twitter.com/2013/new-tweets-per-second-record-and-how

275
Compilation of References

Kulage, K. M., & Larson, E. L. (2016). Implementation and Outcomes of a Faculty-Based, Peer Review
Manuscript Writing Workshop. Journal of Professional Nursing, 32(4), 262–270. doi:10.1016/j.prof-
nurs.2016.01.008 PMID:27424926
Kulp, P. (2017). Facebook quietly admits to as many as 270 million fake or clone accounts. Mashable. Re-
trieved from: https://mashable.com/2017/11/02/facebook-phony-accounts-admission/#UyvC2aOAmPqo
Kurzweils, R., Brooks, R., Hanson, R., Rothblatt, M., Puri, R., Mead, C., . . . Schmidhuber, J. (2017).
Human-level Artificial Intelligence is Right Around the corner – or Hundred of years away. IEEE Spec-
trum. Retrieved from https://spectrum.ieee.org/computing/software/humanlevel-ai-is-right-around-the-
corner-or-hundreds-of-years-away
Kyebambe, M., Cheng, G., Huang, Y., He, C., & Zhang, Z. (2017). Forecasting emerging technologies:
A supervised learning approach through patent analysis. Technological Forecasting and Social Change,
125, 236–244. doi:10.1016/j.techfore.2017.08.002
Ladley, J. (2012). Data governance: How to design, deploy and sustain an effective data governance
program. Elsevier.
Lagoze, C. (2014). Big Data, data integrity, and the fracturing of the control zone. Big Data & Society,
1(2), 1–11. doi:10.1177/2053951714558281
Landefeld, S. (2014). Uses of Big Data for Official Statistics: Privacy, Incentives, Statistical Challenges,
and Other Issues. Discussion paper presented at the United Nations Global Working Group on Big Data
for Official Statistics, Beijing, China. Retrieved from: https://unstats.un.org/unsd/trade/events/2014/
beijing/Steve%20Landefeld%20-%20Uses%20of%20Big%20Data%20for%20official%20statistics.pdf
Laney, D. (2001). 3D data management: Controlling data volume, velocity and variety. META Group
Research Note, 6(70).
Laney, D. (2001). 3D Data Management: Controlling data volume, velocity and variety. Meta Group, File
949. Retrieved from: https://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-
Controlling-Data-Volume-Velocity-and-Variety.pdf
Laney, D. (2001). 3D Data management: controlling data volume, velocity and variety. Retrieved from
https://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-
Volume-Velocity-and-Variety.pdf
Laney, D. (2001). 3D data management: controlling data volume, velocity, and variety. META Group
(now Gartner). Retrieved from https://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-
Management-Controlling-Data-Volume-Velocity-and-Variety.pdf
Lee, D. (2018). UK unveils extremism blocking tool. BBC News. Retrieved from https://www.bbc.com/
news/technology-43037899
Lee, D., & Wilbert van der Klaauw, W. (2010). An Introduction to the FRBNY Consumer Credit Panel.
Staff Report no 479, November.
Lee, J., & Ardakani, H. D. (2015). Industrial Big Data Analytics and Cyber-ph. Academic Press.

276
Compilation of References

Lee, M.-C. (2009). The Combination of Knowledge Management and Data mining with Knowledge
Warehouse. International Journal of Advancements in Computing Technology, 1. doi:10.4156/ijact.
vol1.issue1.6
Legal Entity Identifier Regulatory Oversight Committee. (2016). Collecting data on direct and ultimate
parents of legal entities in the Global LEI System – Phase 1. Author.
Lengnick-Hall, M. L., Neely, A. R., & Stone, C. B. (2018). Human Resource Management in the Digital
Age: Big Data, HR Analytics and Artificial Intelligence. In P. N. Melo & C. Machado (Eds.), Manage-
ment and Technological Challenges in the Digital Age (pp. 13–42). Boca Raton, FL: CRC Press.
Letouzé, E., & Jütting, J. (2015). Official Statistics, Big Data and Human Development. Data Pop Al-
liance, White Paper Series. Retrieved from: https://www.paris21.org/sites/default/files/WPS_Official-
Statistics_June2015.pdf
Lewis, L. (2013). China’s Nuclear Idiosyncrasis and Their Challenges. Security Studies Center. Retrieved
from: https://www.ifri.org/sites/default/files/atoms/files/pp47lewis.pdf
Lichtenthaler, U., & Ernst, H. (2009). The role of champions in the external commercialization of knowl-
edge. Journal of Product Innovation Management, 26(4), 371–387. doi:10.1111/j.1540-5885.2009.00666.x
Liebowitz, J. (2013). Business analytics: An introduction. CRC Press, Taylor & Francis Group.
Liew, C. B. A. (2008). Strategic integration of knowledge management and customer relationship
management. Journal of Knowledge Management, 12(4), 131–146. doi:10.1108/13673270810884309
Li, J., Liu, H., Ng, S. K., & Wong, L. (2003). Discovery of significant rules for classifying cancer di-
agnosis data. Bioinformatics (Oxford, England), 19(suppl 2), ii93–ii102. doi:10.1093/bioinformatics/
btg1066 PMID:14534178
Li, J., & Wong, L. (2002). Identifying good diagnostic gene groups from gene expression profiles us-
ing the concept of emerging patterns. Bioinformatics (Oxford, England), 18(5), 725–734. doi:10.1093/
bioinformatics/18.5.725 PMID:12050069
Lim, B., Nagai, A., Lim, V., & Ho. (2016). Veritas Global Databerg Report Finds 85% of Stored Data
Is Either Dark or Redundant, Obsolete, or Trivial (ROT). Retrieved from https://www.veritas.com/en/
aa/news-releases/2016-03-15-veritas-global-databerg-report-finds-85-percent-of-stored-data
Lim, C., Kim, K.-H., Kim, M.-J., Heo, J.-Y., & Maglio, P. P. (2018). From data to value: A nine-factor
framework for data-based value creation in information-intensive services. International Journal of
Information Management, 39, 121–135. doi:10.1016/j.ijinfomgt.2017.12.007
Liu, B., & Pop, M. (2009). ARDB––Antibiotic Resistance Genes Database. Nucleic Acids Research,
37(Database), D443–D447. doi:10.1093/nar/gkn656 PMID:18832362
Long, J., & Brindley, W. (2013). The role of big data and analytics in the developing world: Insights
into the role of technology in addressing development challenges. Accenture Development Partnerships.
Retrieved from: https://www.accenture.com/us-en/~/media/Accenture/Conversion-Assets/DotCom/Docu-
ments/Global/PDF/Strategy_5/Accenture-ADP-Role-Big-Data-And-Analytics-Developing-World.pdf

277
Compilation of References

Loukides, M. (2011). What Is Data Science? O’Reilly Media, Inc. Retrieved from: https://www.oreilly.
com/ideas/what-is-data-science
Lu, H., & Li, Y. (2017). Artificial intelligence and computer vision. Studies in computational intelligence.
Springer. doi:10.1007/978-94-009-7772-3_15
Luscombe, N. M., Greenbaum, D., & Gerstein, M. (2001). What is bioinformatics? An introduction and
overview. Yearbook of Medical Informatics, 1(01), 83–99. doi:10.1055-0038-1638103 PMID:27701604
Lv, Z., Song, H., Basanta-Val, P., Steed, A., & Jo, M. (2017). Next-Generation Big Data Analytics: State
of the Art, Challenges, and Future Research Topics. IEEE Transactions on Industrial Informatics, 13(4),
1891–1899. doi:10.1109/TII.2017.2650204
Lyko, K., Nitzschke, M., & Ngonga Ngomo, A. C. (2016). Big Data Acquisition. In J. Cavanillas, E. Curry,
& W. Wahlster (Eds.), New Horizons for a Data-Driven Economy. Cham: Springer; doi:10.1007/978-
3-319-21569-3_4
MacFeely, S. (2016). The Continuing Evolution of Official Statistics: Some Challenges and Opportuni-
ties. Journal of Official Statistics, 32(4), 789–810. doi:10.1515/jos-2016-0041
MacFeely, S. (2017). Measuring the Sustainable Development Goals: What does it mean for Ireland?
Administration, 65(4), 41–71. doi:10.1515/admin-2017-0033
MacFeely, S., & Barnat, N. (2017). Statistical capacity building for sustainable development: Develop-
ing the fundamental pillars necessary for modern national statistical systems. Statistical Journal of the
International Association of Official Statistics, 33(4), 895–909.
MacFeely, S., & Dunne, J. (2014). Joining up public service information: The rationale for a national
data infrastructure. Administration, 61(4), 93–107.
Madani, F., & Weber, C. (2016). The evolution of patent mining: Applying bibliometrics analysis and
keyword network analysis. World Patent Information, 46, 32–48. doi:10.1016/j.wpi.2016.05.008
Mahdavinejad, M. S., Rezvan, M., Barekatain, M., Adibi, P., Barnaghi, P., & Sheth, A. P. (2017). Ma-
chine learning for Internet of Things data analysis: A survey. Digital Communications and Networks.
doi:10.1016/j.dcan.2017.10.002
Maia, A.-T., Sammut, S.-J., Jacinta-Fernandes, A., & Chin, S.-F. (2017). Big data in cancer genomics.
Current Opinion in Systems Biology., 4, 78–84. doi:10.1016/j.coisb.2017.07.007
Maiden, M., Bygraves, J., Feil, E. J., Morelli, G., Russell, J., Urwin, R., ... Spratt, B. G. (1998). Multilocus
sequence typing: A portable approach tothe identification of clones within populations of pathogenicmi-
croorganisms. Proceedings of the National Academy of Sciences of the United States of America, 95(6),
3140–3145. doi:10.1073/pnas.95.6.3140 PMID:9501229
Malhotra, N. K., Kim, S. S., & Agarwal, J. (2004). Internet users’ information privacy concerns (IUIPC):
The construct, the scale, and a causal model. Information Systems Research, 15(4), 336–355. doi:10.1287/
isre.1040.0032

278
Compilation of References

Malley, J. D., Kruppa, J., Dasgupta, A., Malley, K. G., & Ziegler, A. (2012). Probability machines:
Consistent probability estimation using nonparametric learning machines. Methods of Information in
Medicine, 51(1), 74–81. doi:10.3414/ME00-01-0052 PMID:21915433
Manogaran, G., & Lopez, D. (2018). Spatial cumulative sum algorithm with big data analytics for
climate change detection. Computers & Electrical Engineering, 65, 207–221. doi:10.1016/j.compel-
eceng.2017.04.006
Manyika, J. (2011). Big Data: The Next Frontier for Innovation, Competition and Productivity. McK-
insey Global Institute.
Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., & Byers, A. H. (2011). Big data:
The next frontier for innovation, competition, and productivity. Retrieved June 2, 2017, from: https://www.
mckinsey.com/business-functions/digital-mckinsey/our-insights/big-data-the-next-frontier-for-innovation
Marcus, G. (2013). Steamrolled by big data. The New Yorker. Retrieved July 20, 2018, from https://
www.newyorker.com/tech/elements/steamrolled-by-big-data
Martin, K. D., & Murphy, P. E. (2017). The role of data privacy in marketing. Journal of the Academy
of Marketing Science, 45(2), 135–155. doi:10.100711747-016-0495-4
Mayan, M. J. (2009). Essential of Qualitative Inquiry. Walnut Creek, CA: Left Coast Press, Inc.
Mayer-Schönberger, V. (2009). Can we reinvent the internet? Science, 325(5939), 396–397.
doi:10.1126cience.1178418 PMID:19628843
Mayer-Schönberger, V. (2011). Delete: The virtue of forgetting in the digital age. Princeton University
Press.
Mayer-Schonberger, V., & Cukier, K. (2013). Big Data: A Revolution That Will Transform How We Live,
Work and Think. London: John Murray.
Mayer-Schonberger, V., & Cukier, K. (2013). Big Data: A Revolution That Will Transform How We Live,
Work, and Think. Houghton Mifflin Harcourt.
Mayer-Schonberger, V., & Cukier, K. (2013). Big data: The essential guide to work, life, and learning
in the age of insight. Hachette.
Mayer-Schönberger, V., & Lazer, D. (Eds.). (2007). Governance and information technology: From
electronic government to information government. MIT Press.
Mazzei, M. J., & Noble, D. (2017). Big data dreams: A framework for corporate strategy. Business
Horizons, 60(3), 405–414. doi:10.1016/j.bushor.2017.01.010
McAfee, J. (2015). Untitled posting on Facebook. Retrieved from https://www.facebook.com/officialm-
cafee/posts/464114187078100:0
McAfee, A., Brynjolfsson, E., Davenport, T. H., Patil, D. J., & Barton, D. (2012). Big data: The manage-
ment revolution. Harvard Business Review, 90(10), 60–68. PMID:23074865

279
Compilation of References

McArthur, A. G., Waglechner, N., Nizam, F., Yan, A., Azad, M. A., Baylay, A. J., ... Wright, G. D.
(2013). The Comprehensive Antibiotic Resistance Database. Antimicrobial Agents and Chemotherapy,
57(7), 3348–3357. doi:10.1128/AAC.00419-13 PMID:23650175
McHugh, D. (2015). Traffic prediction and analysis using a big data and visualisation approach. De-
partment of Computer Science, Institute of Technology Blanchardstown.
McKinsey&Company. (2016). The age of analytics: competing in a data-driven world. Global McKinsey
Institute.
McKnew, D. L., Lynn, F., Zenilman, J. M., & Bash, M. C. (2003). Por in variation among clinical iso-
lates of Neisseria gonorrhoeae over a 10–yearperiod, as determined by Por variable region typing. The
Journal of Infectious Diseases, 187(8), 1213–1222. doi:10.1086/374563 PMID:12696000
Meeker, M. (2017). Internet Trends 2017. Presented at the Code Conference, Rancho Palos Verdes, CA.
Retrieved from: http://www.kpcb.com/internet-trends
Meeting of the Expert Group on International Statistical Classifications. (2015). Classification of Types
of Big Data. United Nations Department of Economic and Social Affairs, ESA/STAT/AC.289/26, May.
Mehrhoff, J. (2017). Demystifying big data in official statistics – it is not rocket science! Presentation
at the Second Statistics Conference of the Central Bank of Chile.
Meng, X. (2014). A trio of inference problems that could win you a Nobel Prize in statistics (if you help
fund it). In X. Lin, C. Genest, D. Banks, G. Molenberghs, D. Scott, & J.-L. Wang (Eds.), Past, present,
and future of statistical science (pp. 537–562). Chapman and Hall. doi:10.1201/b16720-50
Menninger, D. (2017). 2017 Big data prediction: A’s replace V’s. Retrieved from https://davidmenninger.
ventanaresearch.com/2017-big-data-prediction-as-replace-vs-1
Merriam, S. B. (1998). Qualitative Research and Case Study Applications in Education. Revised and
Expanded from: Case Study Research in Education. San Francisco, CA: Jossey-Bass Publishers.
Metcalf, J. L., Xu, Z. Z., Bouslimani, A., Dorrestein, P., David, O., Carter, P. D., & Knight, R. (2017).
Microbiome Tools for Forensic Science. Trends in Biotechnology, 35(9), 814–823. doi:10.1016/j.
tibtech.2017.03.006 PMID:28366290
Miah, S. J., Vu, Q. H., Gammack, J., & McGrath, M. (2017). A Big Data Analytics Method for Tourist
Behavior Analysis. Information & Management, 54(6), 771–785. doi:10.1016/j.im.2016.11.011
Miles, M. B., & Huberman, A. M. (1994). Qualitative data analysis: An expanded sourcebook. Thousand
Oaks, CA: Sage Publications, Inc.
Mitchell, S. A. (1998). The analyst’s knowledge and authority. The Psychoanalytic Quarterly, 67(1),
1–31. doi:10.1080/00332828.1998.12006029 PMID:9494977
Mittelstadt, B. D., & Floridi, L. (2016). The ethics of big data: Current and foreseeable issues in bio-
medical contexts. Science and Engineering Ethics, 22(2), 303–341. doi:10.100711948-015-9652-2
PMID:26002496

280
Compilation of References

Morabito, V. (2015). Big data and analytics: Strategic and organizational impacts. Cham: Springer.
doi:10.1007/978-3-319-10665-6
Moreno, O., Shapira, B., Rokach, L., & Shani, G. (2012, October). Talmud: transfer learning for mul-
tiple domains. In Proceedings of the 21st ACM international conference on Information and knowledge
management (pp. 425-434). ACM.
Morville, P. (2007). Information architecture for the World Wide Web. O’Reilly.
Moss, S., Misra, A., & Evans, C. (2018). Using artificial intelligence to fight financial crimes. O’Reilly
Community. Retrieved from https://www.oreilly.com/pub/e/3930
Moustaghfir, K., & Schiuma, G. (2013). Knowledge, learning, and innovation: Research and perspec-
tives. Journal of Knowledge Management, 17(4), 495–510. doi:10.1108/JKM-04-2013-0141
Mutuku, L. (2016) The big data challenge for developing countries. The World Academy of Sciences.
Retrieved from: https://twas.org/article/big-data-challenge-developing-countries
Najafabadi, M., Villanustre, F., Khoshgoftaar, T. M., Seliya, N., Wald, R., & Muharemagic, E. (2015).
Deep learning applications and challenges in big data analytics. Journal of Big Data. Retrieved from http://
www.deeplearningitalia.com/wp-content/uploads/2017/12/Dropbox_Big-Data-and-Deep-Learning.pdf
New York Stock Exchange. (2018). Daily NYSE Group Volume in NYSE Listed, 2018. NYSE Transac-
tions, Statistics and Data Library. Retrieved from: https://www.nyse.com/data/transactions-statistics-
data-library
Nicolini, D., Powell, J., Conville, P., & Martinez‐Solano, L. (2008). Managing knowledge in the healthcare
sector. A review. International Journal of Management Reviews, 10(3), 245–263. doi:10.1111/j.1468-
2370.2007.00219.x
Niemann, H., Moehrle, M. G., & Frischkorn, J. (2017). Use of a new patent text mining and visualization
method for identifying patenting patterns over time: Concept, method and test application. Technological
Forecasting and Social Change, 115, 210–220. doi:10.1016/j.techfore.2016.10.004
NIST Big Data Public Working Group. (2017). Big Data Interoperability Framework: Volume 1, Defini-
tions. Accessed at: http://bigdatawg.nist.gov/home.php
Nonaka, I., & Von Krogh, G. (2009). Perspective—Tacit knowledge and knowledge conversion: Con-
troversy and advancement in organizational knowledge creation theory. Organization Science, 20(3),
635–652. doi:10.1287/orsc.1080.0412
Noordin, M. F., & Karim, Z. A. (2015). Modeling the relationship between human intelligence, knowledge
management practices, and innovation performance. Journal of Information & Knowledge Management,
14(01), 1550012. doi:10.1142/S0219649215500124
Nordrum, A. (2016). Popular Internet of Things Forecast of 50 Billion Devices by 2020 Is Outdated.
IEEE Spectrum. Retrieved from https://spectrum.ieee.org/tech-talk/telecom/internet/popular-internet-
of-things-forecast-of-50-billion-devices-by-2020-is-outdated

281
Compilation of References

Nowak, B. (2009). The Semantic Web Technology Stack (not a piece of cake...). Retrieved July 20, 2018,
from http://bnode.org/media/2009/07/08/semantic_web_technology_stack.png
Noyes, K. (2015). Scott McNealy on privacy: You still don’t have any. PC World. Retrieved from: https://
www.pcworld.com/article/2941052/scott-mcnealy-on-privacy-you-still-dont-have-any.html
Nunan, D., & Di Domenico, M. (2013). Market research & the ethics of big data. International Journal
of Market Research, 55(4), 505–520. doi:10.2501/IJMR-2013-015
Nutley, S., Walter, I., & Davies, H. T. O. (2007). Using Evidence: How Research Can Inform Public
Services. Bristol, UK: The Policy Press. doi:10.2307/j.ctt9qgwt1
Nyborg Vov, K. (2018). Using scanner data for sports equipment. Paper written for the joint UNECE/
ILOs Meeting of the Group of Experts on Consumer Price Indices, Geneva, Switzerland. Retrieved from:
https://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.22/2018/Norway_-_session_1.pdf
Nymand-Andersen, P. (2015). Big data – the hunt for timely insights and decision certainty: Central
banking reflections on the use of big data for policy purposes. IFC Working Paper, no 14.
O’Neil, C. (2017). Weapons of math destruction: How big data increases inequality and threatens de-
mocracy. New York: Broadway Books.
O’Neil, C., & Schutt, R. (2013). Doing Data Science: Straight Talk from the Frontline. O’Reilly Media, Inc.
O’Neill, C. (2016). Weapons of Math Destruction - How big data increases inequality and threatens
democracy. London: Allen Lane.
Oding, N. (2015), The use of Big data: Challenges and Perspectives in Russia. In C. Larsen, S. Rand,
A. Schmid, M. Mezzanzanica, & S. Dusi (Eds.), Big data and the complexity of Labor Market Policies:
New Approaches in Regional and Local Labor Market Mentoring for Reducing Skills Mismatches (pp.
153-164). Muenchen, Germany: Rainer Hampp Verlag.
Ofsthun, S. (2002). Integrated vehicle health management for aerospace platforms. IEEE Instrumenta-
tion & Measurement Magazine, 5(3), 21-24. Doi:10.1109/MIM.2002.1028368
Oltra, V. (2005). Knowledge management effectiveness factors: The role of HRM. Journal of Knowledge
Management, 9(4), 70–86. doi:10.1108/13673270510610341
Organisation for Economic Co-operation and Development (OECD). (2017). Key issues for digital
transformation in the G20. Report prepared for a joint G20 German Presidency/OECD conference.
Orum, A. M., Feagin, J. R., & Sjoberg, G. (1991). Introduction: The nature of the case study. In J.R.
Feagin, A.M. Orum, & G. Sjoberg (Eds.), A case for the case study (pp. 1-26). Chapel Hill, NC: The
University of North Carolina Press.
Osbourne, H. (2017). Stephan Hawking AI warning: Artificial Intelligence could destroy civilisation.
Newsweek. Retrieved from http://www.newsweek.com/stephen-hawking-artificial-intelligence-warning-
destroy-civilization-703630
Otto, B. (2011). Data governance. Business & Information Systems Engineering, 3(4), 1–244.
doi:10.100712599-011-0162-8

282
Compilation of References

Ozekes, A., & Camurcu, Y. (2002). Classification and Prediction in A Data Mining Application. Journal
of Marmara for Pure and Applied Sciences, 18, 159–174.
Paajanen, S., Valkokari, K., & Aminoff, A. (2017). The opportunities of big data analytics in supply
market intelligence. In Proceedings of the18th IFIP WG 5.5 Working Conference on Virtual Enterprises,
PRO-VE 2017 (pp. 194-205). Springer. 10.1007/978-3-319-65151-4_19
Pal, C., Bengtsson–Palme, J., Rensing, C., Kristiansson, E., & Larsson, D. G. (2014). BacMet: Anti-
bacterial Biocide and Metal Resistance Genes Database. Nucleic Acids Research, 42(D1), D737–D743.
doi:10.1093/nar/gkt1252 PMID:24304895
Panian, Z. (2010). Some practical experiences in data governance. World Academy of Science, Engineer-
ing and Technology, 38, 150–157.
Parashar, K., Burse, & Rawat, K. (2014). A Comparative Approach for Pima Indians Diabetes Diagno-
sis using LDA–Support Vector Machine and Feed Forward Neural Network. Int J Adv Res Comput Sci
Softw Eng, 4, 378–383.
Parr Rud, O. (2011). Invited article: Adaptability. Business Systems Research Journal: International
Journal of the Society for Advancing Business & Information Technology, 2(2), 4-12.
PatSeer. (2017). Retrieved from http://patseer.com/
Patthy, L. (1999). Genome evolution and the evolution of exon–shuffling–a review. Gene, 238(1),
103–114. doi:10.1016/S0378-1119(99)00228-0 PMID:10570989
Pawlak, Z. (2002). Rough sets, decision algorithms and Bayes’ theorem. European Journal of Opera-
tional Research, 136(1), 181–189. doi:10.1016/S0377-2217(01)00029-7
Payton, T., & Claypoole, T. (2015). Privacy in the Age of Big Data - Recognising the Threats Defending
Your Rights and Protecting Your Family. Lanham, MD: Rowman & Littlefield.
Pearson, E. (2013). Growing Up Digital. Presentation to the OSS Statistics System Seminar Big Data
and Statistics New Zealand: A seminar for Statistics NZ staff, Wellington, New Zealand. Retrieved from:
https://www.youtube.com/watch?v=lRgEMSqcKXA
Pearson, W. R., & Lipman, D. J. (1988). Improved tools for biological sequence comparison. Proceedings
of the National Academy of Sciences of the United States of America, 85(8), 2444–2448. doi:10.1073/
pnas.85.8.2444 PMID:3162770
Pejić Bach, M., Pivar, J., & Dumičić, K. (2017). Data anonymization patent landscape. Croatian Op-
erational Research Review, 8(1), 265–281. doi:10.17535/crorr.2017.0017
Peters, I. (2009). Folksonomies: Indexing and Retrieval in Web 2.0. Berlin: De Gruyter Saur.
doi:10.1515/9783598441851
Pourdehnad, J., Wexler, R., & Wilson, V. (2011). Systems & Design Thinking: A Conceptual Framework
for Their Integration. Organizational Dynamics Working Papers, 1-16.
Power, D. J. (2014). Using ‘Big data’ for analytics and decision support. Journal of Decision Systems,
23(2), 222–228. doi:10.1080/12460125.2014.888848

283
Compilation of References

Press, G. (2017). Ten predictions for AI, big data and analytics in 2018. Forbes. Retrieved from
https://www.forbes.com/sites/gilpress/2017/11/09/10-predictions-for-ai-big-data-and-analytics-in-
2018/2/#27db0676441c
Prewitt, K., Schwandt, T. A., & Straf, M. L. (2012). Using science as evidence in public policy. Wash-
ington, DC: National Academies Press.
Radojicis, M., Stankovic, R., & Kaplar, S. (2017). Review of the 2ndKEYSTONE Training Summer School
on Keyword Search in Big Linked Data. INFOtheca - Journal for Digital Humanities, 17(1), 108-113.
Raguseo, E. (2018). Big data technologies: An empirical investigation on their adoption, benefits and
risks for companies. International Journal of Information Management, 38, 187-195. doi:10.1016/j.
ijinfomgt.2017.07.008
Raley, R. (2013). Dataveillance and countervailance. In “Raw Data” is an Oxymoron. MIT Press.
Rasmussen, T., & Ulrich, D. (2015). Learning from practice: How HR analytics avoids being a manage-
ment fad. Organizational Dynamics, 44(3), 236–242. doi:10.1016/j.orgdyn.2015.05.008
Rassam, M. A., Maarof, M. A., & Zainal, A. (2017). Big Data Analytics Adoption for Cyber-Security:
A Review of Current Solutions, Requirements, Challenges and Trends. Journal of Information Assur-
ance and Security, 12(4), 124–145.
Reich, R. (2015). Saving Capitalism: For the Many, Not the Few. London: Icon Books Ltd.
Reinsel, D., Gantz, J., & Rydning, J. (2017). Data age 2025. The evolution of data to life-critical. Don’t
focus on big data; focus on data that’s big. An IDC white paper, sponsored by Seagate. Retrieved
from https://www.seagate.com/files/www-content/our-story/trends/files/Seagate-WP-DataAge2025-
March-2017.pdf
Rekik, R., Kallel, I., Casillas, J., & Alimi, A. M. (2018). Assessing web sites quality: A systematic lit-
erature review by text and association rules mining. International Journal of Information Management,
38(1), 201–216. doi:10.1016/j.ijinfomgt.2017.06.007
Report, N. (2018). Global Cards - 2015: Special Report. The Nilson report. Retrieved from: https://www.
nilsonreport.com/publication_special_feature_article.php
Ricci, F., Rokach, L., Shapira, B., & Kantor, P. B. (Eds.). (2010). Recommender systems handbook.
Springer Science & Business Media.
Ries, E. (2010). The Lean Startup: How constant innovation creates radically successful businesses.
London: Penguin Books.
Rivers, C. M., & Lewis, B. L. (2014). Ethical research standards in a world of big data. F1000 Research, 3.
Ronda‐Pupo, G. A., & Guerras‐Martin, L. Á. (2012). Dynamics of the evolution of the strategy concept
1962–2008: A co‐word analysis. Strategic Management Journal, 33(2), 162–188. doi:10.1002mj.948
Rudder, C. (2014). Dataclysm: What our online lives tell us about our offline selves. 4th Estate.

284
Compilation of References

Runde, D. (2017). The Data Revolution in Developing Countries Has a Long Way to Go. Forbes. Re-
trieved from https://www.forbes.com/sites/danielrunde/2017/02/25/the-data-revolution-in-developing-
countries-has-a-long-way-to-go/2/#3a48f53e482f
Ruzgas, T., Jakubeliene, K., & Buivyte, A. (2016). Big data mining and knowledge discovery. Journal
of Communications Technology, Electronics and Computer Science, (9). Doi:10.22385/jctecs.v9i0.134
Saeb. (2018). Current Bioinformatics resources in combating infectious diseases. Bioinformation, 14(1),
31–35.
Saeb, A. T., Abouelhoda, M., Selvaraju, M., Althawadi, S. I., Mutabagani, M., Adil, M., & Tayeb, H.
T. (2017). The Use of Next–Generation Sequencing in the Identification of a Fastidious Pathogen: A
Lesson from a Clinical Setup. Evolutionary Bioinformatics Online, 13. doi:10.1177/1176934316686072
PMID:28469373
Saitoh, M. (2017). Application of satellite remote sensing for marine spatial management: An approach
towards sustainable utilization of fisheries resources. Journal of Information Processing & Management,
60(9), 641-650. Doi:10.1241/johokanri.60.641
Salehan, M., & Kim, D. J. (2016, January). Predicting the performance of online consumer reviews: A
sentiment mining approach to big data analytics. Decision Support Systems, 81, 30–40. doi:10.1016/j.
dss.2015.10.006
Salipante, S. J., SenGupta, D. J., Hoogestraat, D. R., Cummings, L. A., Bryant, B. H., Natividad, C.,
... Hoffman, N. G. (2013). Molecular Diagnosis of Actinomadura madurae Infection by 16S rRNA
Deep Sequencing. Journal of Clinical Microbiology, 51(12), 4262–4265. doi:10.1128/JCM.02227-13
PMID:24108607
Salleh, K. A., & Janczewski, L. (2016). Technological, Organisational and Environmental Security and
Privacy Issues of Big Data: A Literature Review. Procedia Computer Science, 100, 19–28. doi:10.1016/j.
procs.2016.09.119
Salter, A., & Alexy, O. (2013). The nature of Innovation. In M. Dodgson, D. Gann, & N. Phillips (Eds.),
The Oxford Handbook of Innovation Management. Oxford, UK: OUP.
Sameer, D. (2018). AI and Analytics – Accelerating Business Decisions. Wiley India Pvt. Ltd.
Sangeetha, J., & Prakash, V. S. J. (2017). A Survey on Big Data Mining Techniques. International Journal
of Computer Science and Information Security, 15(1). Retrieved from https://sites.google.com/site/ijcsis/
Saravananathan, K., & Velmurugan, T. (2016). Analyzing Diabetic Data using Classification Algorithms
in Data Mining. Indian Journal of Science and Technology, 9(43). Retrieved from http://www.indjst.org/
index.php/indjst/article/view/93874
Sarsfield, S. (2009). The data governance imperative. It Governance Ltd.
Schloss, P. D., Westcott, S. L., Ryabin, T., Hall, J. R., Hartmann, M., Hollister, E. B., ... Weber, C. F.
(2009). Introducing mothur: Open source, platform–independent, community–supported software for
describing and comparing microbial communities. Applied and Environmental Microbiology, 75(23),
7537–7541. doi:10.1128/AEM.01541-09 PMID:19801464

285
Compilation of References

Scholz, T. M. (2017). Big Data in Organizations and the Role of Human Resource Management. New
York, NY: Peter Lang LTD International Academic Publishers. doi:10.3726/b10907
Schouls, L. M., Spalburg, E. C., van Luit, M., Huijsdens, X. W., Pluister, G. N., van Santen–Verheuvel,
M. G., ... de Neeling, A. J. (2009). Multiple–locus variable number tandem repeat analysis of Staphy-
lococcus aureus: Comparison with pulsed–field gel electrophoresis and spa–typing. PLoS One, 4(4),
e5082. doi:10.1371/journal.pone.0005082 PMID:19343175
Schroeck, M, (2012). Analytics: The real-world use of big data. Academic Press.
Schubert, A. (2016). AnaCredit: banking with (pretty) big data. Central Banking Focus Report.
Schwikowski, B. (2015). Cytoscape: Visualization and Analysis of omis data in interaction networks,
Institut Pasteur. Gnome Research. Retrieved from https://research.pasteur.fr/en/software/cytoscape/
Segata, N., Waldron, L., Ballarini, A., Narasimhan, V., Jousson, O., & Huttenhower, C. (2012). Metage-
nomic microbial community profiling using unique clade–specific marker genes. Nature Methods, 9(8),
811–814. doi:10.1038/nmeth.2066 PMID:22688413
Seiner, R. (2014). Non-Invasive Data Governance. The Path of Least Resistance and Greatest Success.
Technics Publications.
Shafer, T. (2017). The 42 V’s of Big Data and Data Science. Elder Research. Retrieved from https://
www.elderresearch.com/company/blog/42-v-of-big-data
Shah, N., Irani, Z., & Sharif, A. M. (2017). Big data in an HR context: Exploring organizational change
readiness, employee attitudes and behaviors. Journal of Business Research, 70, 366–378. doi:10.1016/j.
jbusres.2016.08.010
Shakespeare, W. (1826). The dramatic works of Shakespeare. William Pickering.
Sharda, R., Delen, D., & Turban, E. (2018). Business intelligence, analytics and data science: A mana-
gerial perspective. Pearson.
Sharma, P., & Tripathi, R. C. (2017). Patent citation: A technique for measuring the knowledge flow
of information and innovation. World Patent Information, 51, 31–42. doi:10.1016/j.wpi.2017.11.002
Sharma, R., & Kankanhalli, A. (2014). Transforming decision-making processes: A research agenda
for understanding the impact of business analytics on organizations. European Journal of Information
Systems, 23(4), 433–441. doi:10.1057/ejis.2014.17
Sheng, J., Amankwah-Amoah, J., & Wang, X. (2017). A multidisciplinary perspective of big data in
management research. International Journal of Production Economics, 191, 97–112. doi:10.1016/j.
ijpe.2017.06.006
Shin, H. (2013). The Second Phase of Global Liquidity and Its Impact on Emerging Economies. Keynote
address at the Federal Reserve Bank of San Francisco Asia Economic Policy Conference, Princeton, NJ.
Shirdastian, H., Laroche, M., & Richard, M. O. (2017). Using big data analytics to study brand authen-
ticity sentiments: The case of Starbucks on Twitter. International Journal of Information Management.
doi:10.1016/j.ijinfomgt.2017.09.007

286
Compilation of References

SHRM Foundation. (2016). Use of Workforce Analytics for Competitive Advantage, Preparing for future
HR trends Report. Retrieved June 15, 2016, from: https://www.shrm.org/ foundation/ourwork/initiatives/
preparing-for-future-hr-trends/Documents/Workforce%20Analytics%20Report.pdf
Silver, D., & Hassabis, D. (2016, January 27). AlphaGo: Mastering the ancient game of Go with Machine
Learning. Google Research Blog.
Silver, N. (2012). The signal and the noise: the art and science of prediction. London: Penguin UK.
Simião Dornelas, J., & Rodrigues de Souza, K. R. (2017, May/August). AméricoNobreAmorim Cloud
Computing:Searching its Use In Public Management Environments. JISTEM USP, Brazil, 14(2), 281–306.
doi:10.4301/S1807-17752017000200008
Sinha, M., & Pandurangi, A. (2015). Guide to Practical Patent Searching And How To Use Patseer For
Patent Search And Analysis. Pune: Gridlogics Technologies.
Sivarajah, U., Kamal, M. M., Irani, Z., & Weerakkody, V. (2017). Critical analysis of Big data challenges
and analytical methods. Journal of Business Research, 70, 263–286. doi:10.1016/j.jbusres.2016.08.001
Skewes–Cox, P., Sharpton, T. J., Pollard, K. S., & DeRisi, J. L. (2014). Profile Hidden Markov Models for
the Detection of Viruses within Metagenomic Sequence Data. PLoS ONE, 9(8), e105067. doi:10.1371/
journal.pone.0105067
Skourletopoulos, G., Mavromoustakis, C. X., Mastorakis, G., Batalla, J. M., Dobre, C., Panagiotakis,
S., & Pallis, E. (2016). Big Data and Cloud Computing: A Survey of the State-of-the-Art and Research
Challenges. Studies in Big Data, 22. Retrieved from https://link.springer.com/chapter/10.1007/978-3-
319-45145-9_2#citeas
Soares, P. F. M. (2014). Flight Data Monitoring and its Application on Algorithms for Precursor Detec-
tion (MS Thesis). Instituto Superior Tecnico, Lisboa, Portugal.
Soares, S. (2012). Big data governance: An emerging imperative. MC Press.
Song, K., Kim, K., & Lee, S. (2018). Identifying promising technologies using patents: A retrospective
feature analysis and a prospective needs analysis on outlier patents. Technological Forecasting and Social
Change, 128, 118–132. doi:10.1016/j.techfore.2017.11.008
Specktor, B. (2018). Elon Musk worries that AI Research will create an “Immoral Dictator”. Live Science
Tech. Retrieved from https://www.livescience.com/62239-elon-musk-immortal-artificial-intelligence-
dictator.html
Spellman, P. T., Miller, M., Stewart, J., Troup, C., Sarkans, U, Chervitz S, Bernhart, D., Brazma, A.
(2002). Design and implementation of microarray gene expression markup language (MAGE–ML).
Genome Biology, 3(9), research0046.
Spillane, J. P., & Miele, D. B. (2007). Evidence in practice: A framing of the terrain. In P.A. Moss (Eds.),
Evidence and Decision Making. 106th Year book of the National Society for the Study of Education
(Part I, pp.46-73). Malden, MA: Blackwell.
Stake, R. E. (2013). Multiple case study analysis. New York, NY: Guilford Press.

287
Compilation of References

Statistica. (2018a). Average daily number of trades on London Stock Exchange (UK order book) in the
United Kingdom from January 2015 to February 2018. Statistica - The Statistics Portal. Retrieved from:
https://www.statista.com/statistics/325326/uk-lse-average-daily-trades/
Statistica. (2018b). Number of user reviews and opinions on TripAdvisor worldwide from 2014 to 2017
(in millions). Statistica - The Statistics Portal. Retrieved from: https://www.statista.com/statistics/684862/
tripadvisor-number-of-reviews/
Stephens-Davidowitz, S. (2017). Everybody lies - What the internet can tell us about who we really are.
London, UK: Bloomsbury.
Stiegler, B. (2010). Technics and Time, 3: Cinematic Time and the Question of Malaise. Stanford, CA:
Stanford University Press.
Stiglic, G., Bajgot, M., & Kokol, P. (2010). Gene set enrichment meta–learning analysis: Next–genera-
tion sequencing versus microarrays. BMC Bioinformatics, 11(1), 176. doi:10.1186/1471-2105-11-176
PMID:20377890
Struelens, M. (1996). Consensus guidelines for appropriate use and evaluation of microbial epidemio-
logic typing systems. Clinical Microbiology and Infection, 2(1), 2–11. doi:10.1111/j.1469-0691.1996.
tb00193.x PMID:11866804
Struijs, P., Braaksma, B., & Daas, P. J. H. (2014). Official statistics and Big Data. Big Data & Society.
Retrieved from: http://journals.sagepub.com/doi/pdf/10.1177/2053951714538417
Strydom, M. (Ed.). (2018). Big Data Governance and Perspectives in Knowledge Management. IGI Global.
Sun, H., Tang, Y., Wang, Q., & Liu, X. (2017). Handling multi-dimensional complex queries in key-
value data stores. Information Systems, 66, 82-96. Doi:10.1016/j.is.2017.02.001
Sutherland, L. S., & Soares, C. G. (2012). The use of quasi-static testing to obtain the low-velocity im-
pact damage resistance of marine GRP laminates. Composites. Part B, Engineering, 43(3), 1459–1467.
doi:10.1016/j.compositesb.2012.01.002
Swoyer, S. (2017). You Still Need a Model! Data Modeling for Big Data and NoSQL. Transforming
Data With Intelligence. Retrieved from https://tdwi.org/articles/2017/03/22/data-modeling-for-big-data-
and-nosql.aspx
Tai, Z. (2010). Aircraft electrical power system diagnostics, prognostics and health management (MSc
thesis). School of Engineering, Aerospace Design Program, Cranfield University, UK. Retrieved from
https://dspace.lib.cranfield.ac.uk/bitstream/handle/1826/9593/Tai_z.pdf
Tam, S., & Clarke, F. (2015). Big Data, Official Statistics and Some Initiatives by the Australian
Bureau of Statistics. International Statistical Review. Retrieved from: https://www.researchgate.net/
publication/280972848_Big_Data_Official_Statistics_and_Some_Initiatives_by_the_Australian_Bu-
reau_of_Statistics
Tao, F., Cheng, J., Qi, Q., Zhang, M., Zhang, H., & Sui, F. (2018). Digital twin-driven product design,
manufacturing and service with big data. International Journal of Advanced Manufacturing Technology,
94(9-12), 3563–3576. doi:10.100700170-017-0233-1

288
Compilation of References

Taplin, J. (2017). Move Fast and Break things - How Facebook, Google and Amazon cornered culture
and undermined democracy. New York: Little, Brown and Company.
Taylor, J., King, R. D., Altmann, T., & Fiehn, O. (2002). Application of metabolomics to plant genotype
discrimination using statistics and machine learning. Bioinformatics (Oxford, England), 18(Suppl 2),
S241–S248. doi:10.1093/bioinformatics/18.suppl_2.S241 PMID:12386008
Thamm, A. (2017). Big Data is dead. LinkedIn. Retrieved from: https://www.linkedin.com/pulse/big-
data-dead-just-regardless-quantity-structure-speed-thamm/
The Economist. (2017, May 6). The world’s most valuable resource is no longer oil, but data. The
Economist.
Thomson, R., Lebiere, C., & Bennati, S. (2014). Human, model and machine: a complementary ap-
proach to big data. Proceedings of the Workshop on Human Centered Big Data Research, HCBDR ’14.
Tissot, B. (2016). Globalisation and financial stability risks: is the residency-based approach of the
national accounts old-fashioned? BIS Working Papers no 587, October.
Tissot, B. (2017). Using micro data to support evidence-based policy. International Statistical Institute
61st World Statistics Congress.
Tobler, J. B., Molla, M. N., Nuwaysir, E. F., Green, R. D., & Shavlik, J. W. (2002). Evaluating machine
learning approaches for aiding probe selection for gene–expression arrays. Bioinformatics (Oxford,
England), 18(suppl 1), S164–S171. doi:10.1093/bioinformatics/18.suppl_1.S164 PMID:12169544
Tower Watson. (2014). Global Workforce Study. At a glance. Tower Watson. Retrieved June 25, 2017,
from: https://www.towerswatson.com/assets/jls/2014_global_workforce_study_at_a_glance_ emea.pdf
Tsai, C. W., Lai, C. F., Chao, H. C., & Vasilakos, A. V. (2015). Big data analytics: A survey. Journal of
Big Data, 2(1), 21. doi:10.118640537-015-0030-3 PMID:26191487
Tushman, M. L., & O’Reilly, C. A. III. (1996). Ambidextrous organizations: Managing evolutionary and
revolutionary change. California Management Review, 38(4), 8–29. doi:10.2307/41165852
Ulrich, D. (1997). Human resources champion. Boston, MA: Harvard Business School Press.
Ulrich, D. (1998). A new mandate for human resources. Harvard Business Review, 76, 124–135.
PMID:10176915
Ulrich, D., & Grochowski, J. (2012). From shared services to professional services. Strategic HR Review,
11(3), 136–142. doi:10.1108/14754391211216850
UNGlobal Pulse. (2012). Big data for development: challenges & opportunities. Available at: http://
www.unglobalpulse.org/projects/BigDataforDevelopment
United Nations Conference for Trade and Development. (2015). Mapping of international Internet
public policy issues. E/CN.16/2015/CRP.2, Commission on Science and Technology for Develop-
ment, Eighteenth session, Geneva. Retrieved from: http://unctad.org/meetings/en/SessionalDocuments/
ecn162015crp2_en.pdf

289
Compilation of References

United Nations Conference for Trade and Development. (2016). Development and Globalization: Facts
and Figures 2016. Retrieved from http://stats.unctad.org/Dgff2016/
United Nations Conference for Trade and Development. (2018). Data privacy: new global sur-
vey reveals growing internet anxiety. Retrieved from: http://unctad.org/en/pages/newsdetails.
aspx?OriginalVersionID=1719
United Nations Economic Commission for Europe. (2000). Terminology on Statistical Metadata. Confer-
ence of European Statisticians Statistical Standards and Studies, No.53. Retrieved from: http://ec.europa.
eu/eurostat/ramon/coded_files/UNECE_TERMINOLOGY_STAT_METADATA_2000_EN.pdf
United Nations Economic Commission for Europe. (2011). Using Administrative and Secondary Sources
for Official Statistics - A Handbook of Principles and Practices. Retrieved from: https://unstats.un.org/
unsd/EconStatKB/KnowledgebaseArticle10349.aspx
United Nations Economic Commission for Europe. (2013). The Common Metadata Framework.
UNECE Virtual Standards Helpdesk. Retrieved from: https://statswiki.unece.org/display/VSH/
The+Common+Metadata+Framework
United Nations Economic Commission for Europe. (2016). Outcomes of the UNECE Project on
Using Big Data for Official Statistics. Retrieved from: https://statswiki.unece.org/display/bigdata/
Big+Data+in+Official+Statistics
United Nations Economic Commission for Europe. (2018). Guidance on common elements of statistical
legislation. Conference of European Statisticians, 66th Session, Geneva. Retrieved from: http://www.
unece.org/fileadmin/DAM/stats/documents/ece/ces/2018/CES_6_Common_elements_of_statistical_leg-
islation__Guidance__for_consultation_for_upload.pdf
United Nations Secretary-General’s Independent Expert Advisory Group on a Data Revolution for Sus-
tainable Development. (2014). A World that Counts: Mobilizing the Data Revolution for Sustainable
Development. Report prepared at the request of the United Nations Secretary-General, by the Independent
Expert Advisory Group on a Data Revolution for Sustainable Development. November 2014. Retrieved
from: http://www.undatarevolution.org/wp-content/uploads/2014/11/A-World-That-Counts.pdf
United Nations Statistical Commission. (2014). Big data and modernization of statistical systems; Report
of the Secretary-General. E/CN.3.2014/11 of the forty-fifth session of UNSC 4-7 March 2014. Retrieved
from: https://unstats.un.org/unsd/statcom/doc14/2014-11-BigData-E.pdf
United Nations Statistics Division. (2017). Bogota Declaration on Big Data for Official Statistics. Agreed
at the 4th Global Conference on Big Data for Official Statistics, Bogota, Colombia. Retrieved from: https://
unstats.un.org/unsd/bigdata/conferences/2017/Bogota%20declaration%20-%20Final%20version.pdf
United Nations Statistics Division. (2018). Big Data Project Inventory. Retrieved from: ‘https://unstats.
un.org/bigdata/inventory/
United Nations. (2003). Handbook of Statistical Organization – 3rd Edition: The Operation and Orga-
nization of a Statistical Agency. Department of Economic and Social Affairs Statistics Division Studies
in Methods Series F No. 88. United Nations. Retrieved from: https://www.paris21.org/sites/default/
files/654.pdf

290
Compilation of References

United Nations. (2014). Resolution adopted by the General Assembly on 29 January 2014 - Fundamen-
tal Principles of Official Statistics. General Assembly, A/RES/68/261. Retrieved from: http://unstats.
un.org/unsd/dnss/gp/FP-New-E.pdf
USCCF. (2014). The future of data-driven innovation. Author.
USDOD. (2000). Joint Vision 2020. Retrieved July 20, 2018, from http://www.fs.fed.us/fire/doctrine/
genesis_and_evolution/source_materials/joint_vision_2020.pdf
Vaas, L. (2018). New AI Technology used by UK Government to fight Extremist Content. Naked Security.
Sophos. Retrieved from https://nakedsecurity.sophos.com/2018/02/14/new-ai-technology-used-by-uk-
government-to-fight-extremist-content/
Van Belkum, A., Struelens, M., de Visser, A., Verbrugh, H., & Tibayrenc, M. (2001). Role of genomic
typing in taxonomy, evolutionarygenetics, and microbial epidemiology. Clinical Microbiology Reviews,
14(3), 547–560. doi:10.1128/CMR.14.3.547-560.2001 PMID:11432813
Van Loon, K., & Roels, D. (2018). Integrating big data in the Belgian CPI. Presented to the Meeting of
the UNECE Group of Experts on Consumer Price Indices, Geneva, Switzerland. Retrieved from https://
www.unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.22/2018/Belgium.pdf
Vera-Baquero, A., Colomo-Palacios, R., Molloy, O., & Elbattah, M. (2015). Business process improve-
ment by means of Big Data based Decision Support Systems: A case study on Call Centers. International
Journal of Information Systems and Project Management, 3(1), 5–26.
Waber, B. (2013). People Analytics: How Social Sensing Technology Will Transform Business and What
It Tells Us about the Future of Work. Upper Saddle River, NJ: Pearson Education Inc.
Wallace, N., & Castro, D. (2018). The impact of the EU’s new Data Protection Regulation on AI. Centre
for Data Innovation. Retrieved from http://www2.datainnovation.org/2018-impact-gdpr-ai.pdf
Waller, M. A., & Fawcett, S. E. (2013). Data Science, Predictive Analytics, and Big Data: A Revolu-
tion that Will Transform Supply Chain Design and Management. Journal of Business Logistics, 34(2),
77–84. doi:10.1111/jbl.12010
Wamba, S. F., Gunasekaran, A., Akter, S., Ren, S. J. F., Dubey, R., & Childe, S. J. (2017). Big data
analytics and firm performance: Effects of dynamic capabilities. Journal of Business Research, 70,
356–365. doi:10.1016/j.jbusres.2016.08.009
Wang, J. (2012). Aircraft hydraulic power system diagnostics, prognostics and health management (MSc
thesis). School of Engineering, Cranfield University, UK. Retrieved from https://dspace.lib.cranfield.
ac.uk/bitstream /handle/.../Wang_Jian_Thesis_2012.pdf
Wang, H., Xu, Z., Fujita, H., & Liu, S. (2016). Towards felicitous decision making: An overview on chal-
lenges and trends of Big Data. Information Sciences, 367–368, 747–765. doi:10.1016/j.ins.2016.07.007
Wang, R., Ji, W., Liu, M., Wang, X., Weng, J., Deng, S., ... Yuan, C. (2018). Review on mining data
from multiple data sources. Pattern Recognition Letters. doi:10.1016/j.patrec.2018.01.013

291
Compilation of References

Wang, S., Wan, J., Zhang, D., Li, D., & Zhang, C. (2016). Towards smart factory for industry 4.0: A
self-organized multi-agent system with big data based feedback and coordination. Computer Networks,
101, 158–168. doi:10.1016/j.comnet.2015.12.017
Waterford Technologies. (2017). Big Data Statistics & Facts for 2017. Retrieved from: https://www.
waterfordtechnologies.com/big-data-interesting-facts/
Wattam, A. R., Abraham, D., Dalay, O., Disz, T. L., Driscoll, T., Gabbard, J. L., ... Sobral, B. W. (2014).
PATRIC, the Bacterial Bioinformatics Database and Analysis Resource. Nucleic Acids Research, 42(D1),
D581–D591. doi:10.1093/nar/gkt1099 PMID:24225323
Weigand, A. (2009). The Social Data Revolution(s). Harvard Business Review. Retrieved from https://
hbr.org/2009/05/the-social-data-revolution.html
Weinberger, D. (2014). Too Big to Know. New York: Basic Books.
Weinstock, G. M. (2012). Genomic approaches to studying the human microbiota. Nature, 489(7415),
250–256. doi:10.1038/nature11553 PMID:22972298
Weka 3: Data Mining Software in Java. (n.d.). Retrieved June 24, 2018, from https://www.cs.waikato.
ac.nz/~ml/weka/
Weller, K. (2010). Knowledge representation in the social semantic web. Walter de Gruyter.
doi:10.1515/9783598441585
Wessel, M. (2016). You Don’t Need Big Data — You Need the Right Data. Harvard Business Review.
Retrieved July 20, 2018, from https://hbr.org/2016/11/you-dont-need-big-data-you-need-the-right-data
White, M. S. (2015). Enterprise search. O’Reilly Media.
Wilkowska, W., & Ziefle, M. (2012). Privacy and data security in E-health: Requirements from the
user’s perspective. Health Informatics Journal, 18(3), 191–201. doi:10.1177/1460458212442933
PMID:23011814
Williams, R. D. (2018). The ‘China, Inc.+’ Challenge to Cyberspace Norms. Hoover Institution. Retrieved
from: https://www.hoover.org/research/china-inc-challenge-cyberspace-norms
Williamson, J. (2013). The 4v’s of Big Data. Retrieved from http://www.dummies.com/careers/find-a-
job/the-4-vs-of-big-data/
Wissner-Gross, A. (2016). Datasets over Algorithms. Retrieved from https://www.edge.org/response-
detail/26587
World Economic Forum. (2018). Introducing the technology pioneers cohort of 2018. World Economic
Forum. Retrieved from: http://widgets.weforum.org/techpioneers-2018/
World Intellectual Property Organization (WIPO). (2017). Guide to the International Patent Classifi-
cation. Accessed at: http://www.wipo.int/export/sites/www/classifications/ipc/en/guide/guide_ipc.pdf

292
Compilation of References

World Intellectual Property Organization, Economics and Statistics Division. (2016). World Intellectual
Property Indicators 2016. Accessed at: http://www.wipo.int/edocs/pubdocs/en/wipo_pub_941_2016.pdf
Wu, J., Li, H., Liu, L., & Zheng, H. (2017). Adoption of big data and analytics in mobile healthcare market:
An economic perspective. Electronic Commerce Research and Applications, 22, 24–41. doi:10.1016/j.
elerap.2017.02.002
Wu, X., Zhu, X., Wu, G.-Q., & Ding, W. (2014). Data Mining with Big Data. IEEE Transactions on
Knowledge and Data Engineering, 26(1), 97–107. doi:10.1109/TKDE.2013.109
Yasodha, P., & Kannan, M. (2011). Analysis of a Population of Diabetic Patients Databases in Weka
Tool. International Journal of Scientific & Engineering Research, 2(5).
Yin, R. K. (2009). Case study research: design and methods (4th ed.). Thousand Oaks, CA: Sage Pub-
lications, Inc.
Yoo, I., Alafaireet, P., Marinov, M., Pena–Hernandez, K., Gopidi, R., Chang, J. F., & Hua, L. (2012).
Data mining in healthcare and biomedicine: A survey of the literature. Journal of Medical Systems,
36(4), 2431–2448. doi:10.100710916-011-9710-5 PMID:21537851
Yoo, Y. (2015). It is not about size: A further thought on big data. Journal of Information Technology,
30(1), 63–65. doi:10.1057/jit.2014.30
Zankari, E., Hasman, H., Cosentino, S., Vestergaard, M., Rasmussen, S., Lund, O., & Larsen, M. V.
(2012). Identification of Acquired Antimicrobial Resistance Genes. The Journal of Antimicrobial Che-
motherapy, 67(11), 2640–2644. doi:10.1093/jac/dks261 PMID:22782487
Zdobnov, E. M., & Apweiler, R. (2001). InterProScan–an integration platform for the signature–recogni-
tion methods in InterPro. Bioinformatics (Oxford, England), 17(9), 847–848. doi:10.1093/bioinformat-
ics/17.9.847 PMID:11590104
Zeng, M. L. (2008). Knowledge Organization Systems (KOS). Knowledge Organization, 35(2-3), 160–182.
Zhang, P., Xiao, Y., Zhu, Y., Feng, J., Wan, D., Li, W., & Leung, H. (2016). A New Symbolization and
Distance Measure Based Anomaly Mining Approach for Hydrological Time Series. International Journal
of Web Services Research, 13(3), 26-45.
Zhang, L., Tan, J., Han, D., & Zhu, H. (2017). From machine learning to deep learning: Progress in ma-
chine intelligence for rational drug discovery. Drug Discovery Today, 22(11), 1680–1685. doi:10.1016/j.
drudis.2017.08.010 PMID:28881183
Zhang, Q., Yang, L. Y., Chen, Z., & Li, P. (2018). A survey on deep learning for big data. Information
Fusion, 42, 146–157. doi:10.1016/j.inffus.2017.10.006
Zhang, Y., Ren, S., Liu, Y., Sakao, T., & Huisingh, D. (2017). A framework for Big Data driven product
lifecycle management. Journal of Cleaner Production, 159, 229–240. doi:10.1016/j.jclepro.2017.04.172

293
Compilation of References

Zhao, Y., Liu, P., Wang, Z., Lei Zhang, L., & Hong, J. (2017). Fault and defect diagnosis of battery
for electric vehicles based on big data analysis methods. Applied Energy, 207, 354–362. doi:10.1016/j.
apenergy.2017.05.139
Zhu, L., Liu, X., He, S., Shi, J., & Pang, M. (2015). Keywords co-occurrence mapping knowledge do-
main research base on the theory of Big Data in oil and gas industry. Scientometrics, 105(1), 249–260.
doi:10.100711192-015-1658-7
Zissis, D., & Lekkas, D. (2012). Addressing cloud computing security issues. Future Generation Com-
puter Systems, 28(3), 583–592. doi:10.1016/j.future.2010.12.006
Zupan, B., & Demsar, J. (2008). Open–source tools for data mining. Clinics in Laboratory Medicine,
28(1), 37–54. doi:10.1016/j.cll.2007.10.002 PMID:18194717

294
295

About the Contributors

Sheryl Kruger Strydom is a Research Fellow in the School of Computing, University of South
Africa. Her passion lies in the Information Science discipline. She is a committee member of a number
of international organizations as well as an active peer reviewer. She has presented and published papers
locally and internationally.

Moses Strydom is a retired Professor and recent Chair of the department of of Mechanical and In-
dustrial Engineering at the University of South Africa. An alumni of the University of Perpignan, France
(PhD) and New Mexico State University (MSc), Moses is bilingual, French-English, and for the past 30
years, as an academic, has worked in several universities in Africa, France and the USA. His research
interests include computational/experimental fluid dynamics, hydrogen fuel cells; big data and robotics,
and holographic technology in m-learning. He has published in several research journals and books, and
presented conference papers, both nationally and internationally.

***

Ambika Aggarwal completed her B.Tech degree in First Division with Honors in 2010 from Utta-
rakhand Technical University, Dehradun and M.Tech also in First Division from Uttarakhand Technical
University, Dehradun in 2012. She is currently pursuing her PhD in the field of Cloud Computing. The
author has an overall teaching experience of approximately 7 years and is currently working as an As-
sistant Professor (Senior Scale) at University of Petroleum and Energy Studies (UPES), Dehradun. She
has attended various conference and workshops in her career till now and has published many papers in
reputed journals and conference proceedings.

Ari Alamäki is a Principal Lecturer at Haaga-Helia University of Applied Sciences, Finland. His
current research focuses on customer behavior in digital channels, digital service development and
business perspectives on big data analytics and cognitive systems. His work has been published in both
refereed scientific journals and practically oriented business journals and books. He also has over 12
years of experience in business consulting and software business.

Lili Aunimo studied Language Technology and Computer Science at the University of Helsinki,
Finland. She defended her doctoral dissertation in the field of Computer Science in 2007. The disserta-
tion belongs to the field of text analytics. In her thesis, Dr. Aunimo developed a novel method based on
pattern matching and machine learning techniques to be used in natural language question answering


About the Contributors

systems. Dr. Aunimo works currently in the Department of Digital Business at Haaga-Helia University
of Applied Sciences. Her research focuses on applying data analysis and artificial intelligence techniques
in the areas of data-driven new product development, market intelligence and other fields of business
analytics. The teaching activities of Dr. Aunimo include topics related to data analytics as well as the
innovation and development of new digital services.

Ashutosh Bhatt obtained his B. Tech and M.Tech Degrees in first division from Graphic Era Uni-
versity, Dehradun in 2008 and 2011 respectively. He is currently pursuing his PhD in the field of Cloud
Computing. The author has an overall teaching experience of approximately 10 years and is currently
heading the department of Computer Science and Engineering at Shivalik College of Engineering,
Dehradun. The author has attended various conference and workshops in his career till now and has
published many papers in reputed journals and conference proceedings.

Ranganayakulu C., Scientist H/Outstanding Scientist and Associate Technology Director (General
Systems), working in Aeronautical Development Agency, Bangalore. He has 32 years of Research
& Development experience in Aeronautics for various Aircraft programs. Qualifications: Ph.D from
IIT(Madras), Chennai and Humboldt Post Doc. from Germany. His areas of Research are Development
of Environmental Control Systems and its components - Compact Heat Exchangers for Aircraft Ap-
plications and development of Advanced Technologies – All Electric Environmental Control System
and nano-fluid heat transfer. He guided 15 M. Tech and 8 Ph.D projects, author of 2 books and 80 peer
reviewed papers and more than 100 restricted documents. He is Visiting Researcher to German and South
African Universities. He is a Fellow of National Academy of Engineers and winner of Sir C.V.Raman
Young Scientist State award and ADA Excellency Award.

Sonia Chien-I Chen is the founder of SoniChica (SC) Company Limited, an International Management
Consulting firm based in Taiwan. She is also a visiting scholar in Ministry of Science and Technology in
Taiwan, holding a PhD and a MSc in Business and Management from Ulster University in the UK and
a BA from Fu-Jen Catholic University in Taiwan. An important strand of her current research relates to
business innovation and entrepreneurship, organic food marketing and business development, bio-medical
industry business model innovation, globalisation and medical innovation and big data governance. Her
guest lectures cover various leading universities in Taiwan. She has published over 10 articles in refereed
journals, conference proceedings and the book chapter. She is invited as a marketing consultant in the
Biotech start-up as she has rich experiences both in international marketing and project management.

Amitava Choudhury is currently working as an Assistant Professor in Department of Cybernetics,,


School of Computer Science, University of Petroleum and Energy Studies located in Dehradun, India
and also pursuing Ph.D from Indian Institute of Engineering science and Technology, Shibpur (formerly
BESU, Shibpur) and completed his M.Tech from jadavpur University. He has more than five years of
experience in teaching and two years in research work. His area of research interest are computational
geometry in the field of micromechanical modelling, Pattern recognition, Character recognition, Fog
based IOT BigData. He has published 15 number of journal and conference paper and organized various
workshop on modern data processing.

296
About the Contributors

Satish Kumar David is currently working as Researcher & Head of the Information Technology
department in the Strategic Center for Diabetes Research, College of Medicine, King Saud University,
Riyadh, Saudi Arabia. He obtained MCA, M.Tech. (CSE), PhD degrees and has over 21 years of pro-
fessional experience working as a department head, assistant professor, Researcher, IT Specialist. His
research interest includes data mining, mHealth, system analysis, computer networks, AI. He has several
publications in reputed international journals and has been a reviewer for few journals.

Lamyaa El Bassiti is a junior researcher graduated from ENSIAS Engineering School, University
Mohammed V in Rabat, Morocco. She is an associate member of the International Association for
Knowledge Management -IAKM. Her main teaching and research interests lie in the area of Innovation
Management and its related fields, especially in Educational Studies, Circular Economy, Cognitive Intel-
ligence, Data Science, Change Management, Evidence-based Policy-Making, Research-based Education,
Socio-Economic Impact of Research and Innovation, University-Industry Collaboration, Business Ethics
and Sustainable Entrepreneurship. She has published articles on these topics in journals and conference
proceedings.

Harri Ketamo, Ph.D., an entrepreneur with 20 years experience in learning sciences and artificial
intelligence. Currently he is founder & chairman of Headai, a company developing natural language
based cognitive artificial intelligence. He’s also actively participating academic research as an adjunct
professor at Tampere University of Technology and as a senior fellow at University of Turku. Previously
he has been e.g. founder & CEO of gameMiner ltd (gameAI & data mining) and SkillPixels ltd (seroius
games). Ketamo is awareded the Eisenhower Fellowship in 2017.

Radwan A. Kharabsheh holds a BS from Yarmouk University in Jordan, MBA and PhD in inter-
national business from Charles Sturt University (CSU) Wagga Wagga NSW Australia where he taught
full time. He then moved to the Hashemite University in Jordan where he worked as head department of
business administration. Currently, Dr. Kharabsheh works as an associate professor in business admin-
istration at Applied Science University (ASU) in Kingdome of Bahrain. He also worked as the Director
of Quality Assurance and Accreditation Center at ASU. His research interests include organizational
learning, knowledge management and international joint ventures. He published more than 21 articles
in referred journals, obtained numerous grants including fellowship of the Australia Malaysia Institute
and attended more than 20 international conferences. He is member of ANZIBA and ANZMAC and the
Sydney University Centre for Peace and Conflict Resolution Studies. He supervised and headed more
than 25 postgraduates’ viva, supervised more than 30 students and works a reviewer and examiner for
numerous journals and international conferences.

Zivko Krstic is a Data Scientist at Atomic Intelligence. He graduated at the Faculty of Economics
& Business – Split. Zivko participated in EU FP7 project “FERARI project”. His area of expertise are
text analytics and big data analytics. He participated in development of several big data products such
as JupiterOne, Pandora Insight. Zivko is author of several papers in field of text analytics.

297
About the Contributors

Steve MacFeely is the Head of Statistics & Information at the United Nation Conference for Trade
and Development in Geneva, Adjunct Professor at the Centre for Policy Studies in University College
Cork, Ireland and a Director of the IASE International Statistical Literacy Project. Before working at
UNCTAD, Steve was the Assistant Director-General at the Central Statistics Office (CSO) in Ireland and
was a member of the Oversight Board of the new Irish Government Economic and Evaluation Service.
He established the “Professional Diploma in Official Statistics & Policy Evaluation” at the Institute of
Public Administration in Ireland. He has published several papers in peer reviewed journals covering a
range of topics including family business, cross-border shopping, productivity, tourism, input-output,
regional policy, data infrastructure and SDGs.

Mirjana Pejić Bach is a Full Professor at the Department of Informatics at the Faculty of Economics
& Business. She graduated at the Faculty of Economics & Business – Zagreb, where she also received
her Ph.D. degree in Business, submitting a thesis on “System Dynamics Applications in Business Model-
ling“ in 2003. She is the recipient of the Emerald Literati Network Awards for Excellence 2013 for the
paper Influence of strategic approach to BPM on financial and non-financial performance published in
Baltic Journal of Management. Mirjana was also educated at MIT Sloan School of Management in the
field of System Dynamics Modelling, and at OliviaGroup in the field of data mining. She participates
in number of EU FP7 projects, and is an Expert for Horizon 2020.

Jasmina Pivar is currently employed as a teaching and research assistant at the Department of Infor-
matics at the Faculty of Economics and Business in Zagreb, where she is also enrolled in a postgraduate
doctoral study program. She graduated with a degree in economics from the Faculty of Organization and
Informatics in Varaždin, where she wrote her master’s thesis on association rules and where she received
Dean’s Award for Excellence and Honorable Mention as Master of Economics with Highest Distinction.
Jasmina was also educated at Northern Institute of Technology (NIT), Hamburg, in the field of Partial
Least Squares Structural Equation Modeling (PLS-SEM). Her primary research interests are big data
technology, smart cities, technology adoption, and data mining.

Mohammed Rafiullah is currently working as an Assistant Professor in the Strategic Center for
Diabetes Research, College of Medicine, King Saud University, Riyadh, Saudi Arabia. In addition, he is
the head of the departments of Pharmacology and Nanotechnology in the center. He supervises various
ongoing research projects in these departments. He obtained his doctoral degree in pharmacy (Phar-
macology) from Faculty of Pharmacy, Hamdard University, New Delhi, India. He was with research
and development center of Ranbaxy Laboratories Limited, India as research scientist in the department
of medical affairs and clinical research after completing the doctoral studies. He has over 10 years of
research experience in preclinical and clinical pharmacology in the areas of pharmacokinetics, diabetes,
metabolism and inflammation. He has several publications in reputed international journals and has
been a reviewer for many journals.

Khalid Al Rubeaan is a Professor and Senior Endocrinologist at College of Medicine, King Saud
University, Riyadh, Saudi Arabia, with major interests in endocrine diseases especially Diabetes Mellitus.
His main areas of interest include basic research, education, prevention and epidemiology of Diabetes
Mellitus in Saudi Arabia and Middle East. He is the Executive Director of Strategic Center for Diabetes
Research and Editor-in-chief of International Journal of Diabetes Mellitus and Head of Saudi National

298
About the Contributors

Diabetes Registry. He is a member of many associations and has numerous publications in the form of
book chapters and scientific papers covering various aspects of Diabetes Mellitus. His achievements
include many national and international awards from Government agencies as well as research and
academic institutions. He also holds patent in the field of Nano Technology & IT for the management
of Diabetes Mellitus.

Amr T. M. Saeb is currently the Head of Genetics, Microbiology and Biotechnology Department,
Strategic Center for Diabetes Research (SCDR), Riyadh, Kingdom of Saudi Arabia. He is running the
Medical Microbiology Laboratory at the same research facility. Dr. Saeb is also an Assistant Professor at
College of Medicine, King Saud University, KSA. Dr. Saeb earned his Ph.D. (Molecular Phylogenetics
and Population Genetics) from The Ohio State University, the United States of America in 2006. Dr. Saeb
has been in the research and teaching profession for more than 20 years in Egypt, the United States of
America then in Kingdom of Saudi Arabia. Recently, He and his team discovered the Diabetic foot ulcer
pathogenic bacterium Proteus mirabilis SCDR 1 (The first Non-induced Nanosilver resistant bacteria)
which is one of his various contributions to science. His book “Genetics of Diabetes” which currently on
the press, would indisputable be a great addition to the knowledge of genetics and diabetes. Dr. Saeb is
an author and co-authors many published in International peer-reviewed scientific papers. His ongoing
researches are in collaboration with national and international researchers. He is the Editor-in-Chief of
SOP Transaction on Inheritance and Genetic Engineering, and a member of the Editorial Board of sev-
eral scientific journals. Dr. Saeb’s research interest is focused on but not limited to Molecular Genetics,
bacterial pathogenomics and virulence, molecular pathology, comparative genomics, bioinformatics,
metagenomics and human genetics. Dr. Amr T.M. Saeb is a recipient of the international acclaimed
“Albert Nelson Marquis Lifetime Award” for his dedication and passion in his work. His biography has
been included in the Who’s Who in the World 2015. Dr. Saeb is a distinguished international speaker
in many scientific conferences and congresses. He earned a lifetime membership by election from the
National Honor Society of Phi Kappa Phi for his superior academic excellence.

Daria Sarti is Assistant Professor of Organization Studies at the University of Florence (Italy) where,
in 2005, she received her Ph.D. Her primary research interest is in human resource management, work
engagement and workforce wellbeing. Recently she has started working on the impact that new technol-
ogy and new ways of organizing work have on workforce.

Bruno Tissot has been working at the Bank for International Settlements (BIS) since 2001, as Senior
Economist and Secretary to the Markets Committee of Central Banks in the Monetary and Economic
Department and then as the Adviser to the General Manager and Secretary to the BIS Executive Com-
mittee. Between 1994 and 2001 he worked for the French Ministry of Finance. He is currently Head of
BIS Statistics and Research Support and Head of Secretariat, Irving Fisher Committee on Central Bank
Statistics (IFC), Monetary and Economic Department, and is a graduate from École Polytechnique (Paris)
and of the French Statistical Office INSEE.

Teresina Torre is Full Professor of Organization Studies and Human Resource Management, Depart-
ment of Economics and Business Studies, University of Genoa (Italy), School of Social Science, where
she teaches: organization theory (bachelor’s degree) and human resource management (master’s degree).
She is coordinator of the Master Degree in Management (University of Genoa). She is president of the

299
About the Contributors

international MBA – University of Lima and Genoa. She is vicepresident of Assioa – Association of
Italian Organization Studies Academics and members of AIDEA – Italian Academy of Management and
ICTO - Technologies de l’Information et de la Communication dans les Organisations et au sein de la
Société. She is co-editor of Impresa Progetto Electronic Journal of Management. Her current research
interests focus on human work, its organization and assessment, and its evolution in connection with
IT. She presented her research in national and international conferences such as WOA, RCM, ECIE,
ECKM and ICTO; she published contribution both in books and in national and international journals.
La sua attività di ricerca più recente si concentra sui temi del cambiamento nel lavoro e nella gestione
delle risorse umane, nel cui ambito è autrice di numerose pubblicazioni anche internazionali. Her current
research interests focus on the evolution of work in digital context, its organization and assessment. She
presented her research both in national and international conferences such as WOA, RCM, ECIE, ECKM
and ICTO; she published contributions both in books and in national and international journals. She is
representative for Italy of the Project HRM Digital Lab, sponsored by Télécom Ecole de Management.

Veeredhi Vasudeva Rao holds Bachelor’s Degree in Mechanical Engineering and Master’s Degree
with specialization in Heat Transfer from Andhra University, India. He holds a Doctoral Degree from
the Indian Institute of Science (IISc) Bangalore, India with specialization in Heat Transfer from faculty
of engineering. During his studies at masters and doctoral level, he was a recipient of (GATE) National
scholarship and Scientific and Industrial Research (CSIR) fellowships respectively. He was Post-Doctoral
Fellow at Nanyang Technological University, Singapore. Formerly, he was the principal of SreeNidhi
Institute of Science and Technology, Hyderabad, India. He was also the Director of Technology De-
velopment and Test center (TDTC) recognized by Government of India. He has published 80 Research
Papers in National and International Journals of repute and peer reviewed conference proceedings. He
has guided 4 Doctoral Research Scholars and 25 master’s students for successful completion. He has
presented several research papers in international conference in India and USA. As a part of World Bank
project under Technical Education Quality Improvement Program (TEQIP), visited the department of
mechanical engineering, Texas A&M University, USA. He is principal or Co-investigator for 6 national
level research projects. He received Engineer of the Year award form Institution of Engineers, Govt.
of Andhra Pradesh, India. He has been adjudged and received Best Teacher Award from SreeNidhi
Institute of science and technology. He was awarded best faculty awarded in the state by Congnizant
Technology Solutions which is a well-known multinational company in India. During his tenure as
principal, SreeNidhi Institute of Science and Technology is given the Best Engineering College award
in the state by Indian Society for Technical Education (ISTE). His current research interests are Contact
Heat Transfer, Nanotechnology including heat transfer in Nano-fluids, Energy Systems including Fuel
Cells and Cooling of Electronic Equipment. Currently, he a full Professor in the department of Mechani-
cal and Industrial Engineering, College of Science Engineering and Technology, University of South
Africa. He is in charge for R & D activities in the department of mechanical and industrial engineering
of UNISA. At present 6 candidates are working under his guidance for their doctoral studies including
UNISA and Universities abroad.

300
301

Index

A C
ACMS 104 capability 2, 90-91, 93, 95, 130, 132-135, 142, 148,
administrative data 6, 26, 28, 37, 41 158, 174, 200-201, 203-205, 212, 219, 249, 252
aerospace industry 88-90, 102 case-based reasoning 167, 174, 177
AI 6, 22, 72 CBSO 22
Amplification-Derived Chimeric Sequences 110, CCR 13, 22
126 CDR 31, 52
annotation 113-115, 118, 126, 163, 165-166, 174 central banks 1-3, 5, 7, 12-13, 15, 17-19
AOG 104 classification 105-108, 114, 117-120, 134, 158-159,
APU 104 218, 221-223, 234, 237, 245
association rules 106, 114, 117, 221, 223, 227, 235, complexity science 168, 171, 177
237 complex problem-solving 155-157, 174, 177
ATM 31, 52 content analysis 56
AVL 32, 52 CPD 43, 52
CPMI 2, 6, 16, 22
B CRISPR 112, 126
CRU 100, 104
BCBS 2, 7, 13-14, 22 CVR 93, 104
BD2K 120, 126
big data 1-8, 10-19, 25-45, 55-64, 66, 68-80, 88-90, D
92-93, 102-103, 106, 120, 127-129, 131-132,
139-140, 148, 154-160, 162-166, 168, 174, DAMA 180-181, 188, 198-199
177, 179-195, 198, 200-213, 216, 218-227, data analytics 17, 58, 62, 66, 70-71, 76, 80, 88-90,
229, 232-237, 241-256 93, 100, 102, 128, 132, 136, 142, 144-145, 147,
big data analytics 17, 58, 70-71, 76, 89, 93, 128, 168, 180-181, 183, 185-186, 194-195, 198,
181, 183, 185-186, 220-221, 225, 234-236, 220-221, 225, 234-236, 243-245, 249-251, 254
243-245, 249-251 data availability 185, 190-191, 198, 236
big data governance 56, 88, 155, 162-163, 165-166, data dynamics 154, 162-163, 165, 168, 174, 177
168, 174, 177, 179-182, 184-186, 188-190, data governance 44, 56, 88, 155, 162-163, 165-166,
193-195, 200-201, 216 168, 174, 177, 179-182, 184-186, 188-190,
big data technologies 56, 60, 218-220, 250 193-195, 198, 200-201, 209, 216, 235
big data value chain 56, 64, 68 data integrity 185-186, 190, 192, 198
bioinformatics 105-107, 110-115, 118-120, 126 data mining 43, 59-60, 105-109, 114, 116, 118-120,
BIS 7, 22 219, 228-229, 237
business intelligence 57, 66, 131, 141-142, 181, data privacy 14, 179-180, 184, 190, 192, 194, 198-
200-201, 206, 216, 243-244, 253 199, 235
Index

data quality 2, 11, 33, 38, 72-73, 128, 181, 192, 198 GenID framework to big data governance 163, 165-
data security 70, 185, 191, 199, 201 166, 168, 174, 177
data usability 189, 192, 199 GFC 8, 12-13, 17, 19, 23
DFDR 104 governance 3, 18, 25-26, 41-42, 44-45, 56, 64, 88,
DGI 12-13, 22 154-155, 162-166, 168-169, 171-174, 177-182,
diabetes 105, 118-120 184-186, 188-190, 193-195, 198-201, 203-204,
digital data 25-26, 30, 33, 36, 39, 44, 174, 201-202, 207-212, 216, 235
227, 234, 243, 245
digitalization 8, 11, 17, 204, 208 H
Digitization 217
DLT 16, 22 high-throughput sequencing 110-111, 126
DMBOK 181-182, 189, 199 HR 17, 23, 127, 129, 131-133, 136-148
DPI 104 Human Resource Management 127, 148

E I
educational application 180, 187 IFC 6, 15, 19, 23
EHM 95-96, 98, 104 IMF 12, 23
engine health monitoring systems 88 INEXDA 23
ethical consideration 208 information retrieval 158, 237, 244
ethics 42, 143, 146, 155, 201, 211, 217 innovation 6, 38, 41, 129, 143-144, 174, 183, 195,
evidence 40, 78, 128-129, 137, 141, 154, 165, 167, 200-204, 208-209, 212-213, 217, 222, 241-242,
177, 211 252, 254
Evidence Discovery 165, 167 integration 15, 66, 91, 99, 111, 130, 135, 141-142,
EWGLI 112, 126 155, 159-160, 162-163, 165, 169, 174, 177,
186, 192, 198, 235, 242, 250, 255
F intelligence 2, 6, 14, 22, 57, 60, 66, 71-73, 75, 77-
79, 119, 131, 134, 141-142, 157-158, 174, 181,
FADEC 104 200-201, 204, 206, 210, 212, 216, 243-244,
family 71, 208, 223, 226-227, 232 253, 256
FDA 104 International Patent Classification 218, 222-223, 237
FDEP 104 Internet 5-6, 8, 10-11, 14-17, 30, 32-34, 45, 60, 70,
FDM 92-93, 104 89, 162, 180, 185, 201, 208, 236, 249
FDR 93-94, 104 Interoperability 158-160, 163, 169, 174, 177
financial crisis 8, 12, 23 IO 29, 41-42, 52
Flight data analysis 104 IOSCO 2, 23
flight data recorder 93, 104 IRS 41, 52
FMS 99, 104 ISO 15, 23
FMU 104 IT Governance 199
FRBNY 13, 22 IVHM 89-90, 102-104
FSB 6, 12, 22
FSI 22 K
G knowledge 34, 38, 56, 59, 62, 70, 72-73, 92, 96, 98,
106, 114, 116, 118, 120, 126, 129, 131-132,
G20 23 134, 136-137, 141-142, 144-147, 155, 157-160,
GAFA 23 163-171, 174, 177-178, 181, 187, 194, 199-
GDP 38-39, 52 212, 216-217, 220-221, 223, 241-245, 247-256
GDPR 15, 23, 52, 78, 184, 187, 199 knowledge management 201, 203, 217, 241-243,
245, 247, 249-256

302
Index

L PatSeer 222-223, 226, 236


PCM 91, 104
LEI 23 personalisation 180, 187
LRU 104 PGAAP 113-114, 126
PLA 104
M policy-making 2, 14-16, 154-157, 168-174, 177
policy use 11, 17
machine learning 14, 60, 70-72, 75-80, 107, 113-
practical wisdom 164, 167, 174, 177
114, 118-120, 126, 155, 235, 237, 249
prediction 5, 58, 113, 120, 126, 204, 207, 209-211,
Market Entry 179
218-219, 221-227, 229, 232-237, 250, 253-254
Meta-Ontology 166, 168-170, 177
predictive model 119, 192, 194, 199
micro data 13, 15
prosperity 154, 177, 200-202, 212-213, 217
mining 2, 11, 43, 59-60, 73, 105-109, 114, 116, 118-
public authorities 1-3, 7, 12-15, 17
120, 163-164, 169-170, 174, 219-224, 228-229,
235-237
MIT 8, 23
Q
ML 114, 126 QAR 93, 104
MLST 106, 112, 126
MLVA 112-113, 126 R
mother-ontology 169-170, 177
MPC 91, 104 recommendation 167, 174
MRO 88, 104 research trends 56
MRSA 113, 126 RINS 111, 126
Multilocus sequence typing 112, 126
Multiple Loci VNTR Analysis (MLVA) 126 S
SDG 43, 52
N search 8, 10, 18, 29, 34-35, 40, 66, 74, 80, 113-114,
national statistical offices 2, 19, 26 117, 159-160, 166-167, 174, 222-223, 234,
NSA 41, 52 236, 250
NSO 28-29, 36-37, 41-45, 52 SHM 99, 104
NSS 29, 39, 42, 44, 52 similarity 12, 177, 221, 223-224
simple patent 222-229, 232-234, 236-237
O Single locus sequence typing 112, 126
SLST 112-113, 126
OECD 11, 23, 52, 222 SMDX 15, 23
OEM 98, 104 SMS 92, 104
OGD 52 SNA 7, 23
ontology 155, 160, 165-166, 168-169, 174, 177-178 social network analysis 56, 66
Ontology Modularization 177 SSFDR 104
organic data 17 start-up company 180, 186, 194
OTC 23 Structuration Theory 168, 177
sub-ontology 170, 178
P
PaPrBaG 113, 126
T
patent 31, 218-219, 221-229, 232-237 TBO 104
patent analysis 218-219, 221-225, 232, 236 technological field 221
pathogenicity 113, 120, 126 text mining 2, 219-224, 235-236
PATRIC 113-114, 126 TR 23

303
Index

U Wisdom Governance 163-164, 169, 171, 174, 178


wisdom memory 167, 178
UAV 104, 250
UNDS 52 X
W XBRL 15, 23

WEKA 105, 108-109, 114-120, 126


wisdom 163-164, 167, 169, 171, 174, 177-178, 202,
204

304

Вам также может понравиться