Вы находитесь на странице: 1из 4

Formalizing Big Data Processing Lifecycles:

Acquisition, Serialization, Aggregation, Analysis,


Mining, Knowledge Representation, and
Information Dissemination
Khalil Abuosba
Ted Rogers School of Information Technology Management
Ryerson University
Toronto, Canada

Abstract— In today's e-Business environment, ERP, CRM, the use of significant horizontal scaling (more nodes) for
collaboration tools, and networked sensors may be characterized efficient processing.
as data generators resources. Business Intelligence (BI) is a term
that incorporates a range of analytical and decision support
II. DATA TYPES, SOURCES AND PROPERTIES
applications in business including data mining, decision support
systems, knowledge management systems, and online analytical Data are classified as structured (relational data model),
processing; processing data within these systems produce new semi-structured data (data model) or unstructured data (no data
data that are characterized to grow rapidly causing limitation model).
problem of data management if handled by a Relational
Database Management System (RDBMS) or statistical tools. Big data are generated by executing business and personal
Collectively these structured and unstructured data are referred data transformation processes; these data may be classified as
to as Big Data. Successful and efficient handling of Big Data transaction data, web text resources, log data (aka machine
requires deployment of specific IT infrastructure components as data), events. e-mail, social media, sensors, external feeds and
well as adopting an emerging service model. In this research we live streams, RFID scans, Form Text, explicit geographic
introduce a conceptual model that abstracts the processing positioning information known as geospatial data, audio, still
scheme of big data processing lifecycle. The model addresses the images, and videos.
main phases of the lifecycle: data acquisition, data serialization,
data aggregation, data analysis, data mining, knowledge
representation, and information dissemination. The model is
driven by projecting Service Oriented Architecture attributes to
the building block of the lifecycle and adhering to the Lifecycle
Modeling Language specification.

Keywords—big data, processing, lifecycle, model.

I. INTRODUCTION Fig. 1. Types of Big Data


Business Intelligence (BI) is a term that incorporates a
range of analytical and decision support applications in Big data, in general, are grouped into three types of data:
business including data mining, decision support systems, archived data, repository data, and flowing data (fig. 1).
knowledge management systems, and online analytical Flowing data (aka in motion) are dependent five factors:
processing [1]. Big data can be described as large collections velocity (rate of data flow), variability (change of rate of data
of data that can be structured or unstructured and grow so large
flow, structure, and refresh rate), accessibility, transport
and quickly that it is difficult to manage with regular database
format, and transport protocols.
or statistics tools [2]. Big data is defined as large pools of data
that can be captured, communicated, aggregated, stored, and
analyzed, it is now part of every sector and function of the III. THE BUILDING BLOCKS OF BIG DATA PROCESSING
global economy [3] argues that big data is data which LIFECYCLE
“exceed(s) the capacity or capability of current or conventional
methods and systems.” NIST defines Big Data as a paradigm We identify the main building block of big data lifecycle
where the data volume, acquisition velocity, or data based on the functional requirements of the system: Data
representation limits the ability to perform effective Acquisition and Collection, Data Serialization, Data
analysis using traditional relational approaches or requires Aggregation, Data Mining, Information Dissemination. Fig.2
illustrates a context-level diagram that represents data flows

978-1-4799-6908-1/15/$31.00 ©2015 Crown


between the main entities of the system: Data Serializer, Data IV. BIG DATA PROCESSING AND SERVICE ORIENTED
Collector, Data Aggregator, Data Warehouse System, Service ARCHITECTURE
Requester, and Analytics are identified.
The Organization for the Advancement of Structured
Information Standards (OASIS) defines Service Oriented
Architecture (SOA) as “a paradigm for organizing and utilizing
distributed capabilities that may be under the control of
different ownership domains.”[4]. SOA description features
three attributes: visibility, interaction, and effect. Visibility
refers to the provider’s and requester’s ability to interact and
accomplish the requested service capability. This attribute
introduces opportunities for matching the requester’s needs to
the provider’s capabilities and vice versa. We can decompose
visibility according to related attributes or functional
requirements, such as awareness, willingness, reachability, and
self-serviceability. The second attribute is Interaction—when
the requester uses one of the provider’s capabilities—
Interaction is mediated through message exchange and
produces an effect (or set of effects). An effect (the third
attribute of SOA) can take the form of information resulting
from using a capability, a state change in entities (defined or
undefined), or a combination of both [4]. Here, we define a
service as a mechanism that lets a requester (consumer) use
predefined capabilities.

Fig. 2. A Context-level Data Flow Diagram (DFD) for the Big Data
Processing System.

A. The Data Acquisition Process

B. The Data Serialization Process

C. The Data Aggregation Process

D. The Data Analysis Process


Fig. 4. SOA Functional Requirements Conceptual Model
E. The Data Mining Process
Mapping the SOA conceptual model to Big Data scenarios
F. The Knowledge Representation Process produces the SOA Big Data Processing Conceptual Model
G. The Information Dissemination Process (Fig. 5).

Fig. 3. The main building block of Big Data Processing Lifecycle. Fig. 5. SOA Big Data Pocessing Conceptual Model
Fig. 6. Level-0 Data Flow Diagram (DFD) for the Big Data Processing System.

Based on the context diagram a logical DFD Level-0 CONCLUSION AND FUTURE WORK
diagram is exploded (Fig.6). The model illustrates the main
processes interaction with the main entities of the system. Formalizing Big Data Processing Lifecycle is a complicated
process. In this research, we have identified the main building
blocks of the processing lifecycle and abstracted the main
V. THE LIFECYCLE MODELING LANGUAGE entities of the system. A context diagram and level-0 data flow
Lifecycle Modeling Language (LML) [5] is a modeling diagrams are presented. We have utilized Lifecycle Modeling
language is based on entity, relationship, and attribute meta- Language to draft a big data processing lifecycle mode. These
data model [5]. Entities in LML are represented as actors. models shall provide data scientists with the knowledge they
LML is considered as an extension to SysML [6]. Advantages need to understand big data processing requirements.
of the language are supporting information capturing and
tracing throughout the lifecycle. LML supports a Future work will cover the process of generating level-1
Documentation Model, a Functional Model (modeling actions diagram by exploding the level-0 diagram and decomposing
and input/output processes), a Physical Model (modeling each of the processes within the level-0 diagram to a set of
Assets and resources, and connections), and a Parametric sub-processes. A formal requirements engineering
Entities (address parametric entities such as measures, risk, methodology will evaluated and selected where processes will
and location). Functional models can be easily deduced from be defined based on the requirements analysis process. A
logical and physical DFDs. Lifecycles are defined based on physical DFD will be also generated based on an
drafting the Action Diagram, the Asset Diagram, and the implementation of a big data processing platform.
Spider Diagram; additionally, some other optional diagrams Additionally, a system lifecycle model will be designed based
are also supported. on the logical and physical DFDs in conformance with
Lifecycle Modeling Language.
[3] The National Institute of Standards and Technology’s Joint Cloud and
REFERENCES Big Data Workshop,
http://www.nist.gov/itl/cloud/cloudbdworkshop.cfm.
[4] “OASIS Reference Model for Service Oriented Ar-chitecture 1.0,”
[1] Marakas O’Brien, Introduction to Information Systems, 6th Edition, public review draft 2, 2006; www.oasis-
McGraw-Hill, 2013. open.org/committees/download.php/18486/pr- 2changes.pdf.
[2] Raul F. Chong, Clara Liu, DB2 Essentials - Understanding DB2 in a Big [5] Lifecycle Modeling Lanaguage Specification, V.1,
Data World, IBM Press, 2014. http://www.lifecyclemodeling.org/specification/.
[6] OMG SysML 1.3 Specification. http://www.omg.org/spec/SysML/1.3/.

Вам также может понравиться