Вы находитесь на странице: 1из 8

Abstract Cloud computing is an emerging paradigm that provides computing,

communication and storage resources as a service over a network. Communication resources


often become a bottleneck in service provisioning for many cloud applications. Therefore,
data replication which brings data (e.g., databases) closer to data consumers (e.g., cloud
applications) is seen as a promising solution. It allows minimizing network delays and
bandwidth usage. In this paper we study the performance characteristics of a replicated data
base under two different updating policies. In the synchronous case requests for any
replications of the database can be processed only if no copies of data base are being updated
due to a previous write request, where as in the non-synchronous case read requests are
allowed to be processed at any time if there is a free database copy. We formulate a queuing
theoretic model of the system assuming, a Poisson arrival process for both read and write
requests. This model is then solved using the matrix geometric solution method and the
relevant performance matrices are derived and analysed. The evaluation results, obtained
from both mathematical model and extensive simulations, help to unveil performance and
guide us for future data replication solutions.

1. Introduction

Cloud computing is an emerging technology that attracts ICT service providers offering
tremendous opportunities for online distribution of services. It offers computing as a utility,
sharing resources of scalable data centers [1,2]. To some, cloud computing seems to be little
more than a marketing umbrella, encompassing topics such as distributed computing, grid
computing, utility computing, and software as- a-service, that have already received
significant research focus and commercial implementation. Nonetheless, there exist an
increasing number of large companies that are offering cloud computing infrastructure
products and services that do not entirely resemble the visions of these individual component
topics. End users can benefit from the convenience of accessing data and services globally,
from centrally managed backups, high computational capacity and flexible billing
strategies[3]. Cloud computing is also ecologically friendly. It benefits from the efficient
utilization of servers, data centre power planning , large scale virtualization, and optimized
software stacks. In 2010, datacenters consumed around 1.1-1.5% of global electricity
consumption and between 1.7 and 2.2% for US[5,6]. Data center consumption will be almost
140TWh in 2020[7]
In data centers, there is an over provisioning of computing, storage, power
distribution and cooling resources to ensure high levels of reliability [8]. Cooling and power
distribution systems consume around 45 and 15% of the total energy respectively, while
leaving roughly 40% to the IT equipment [9]. These 40% are shared between computing
servers and networking equipment. Depending on the data center load level, the
communication network consumes 30–50% of the total power used by the IT equipment
[10]. In this article we discuss the limitations and opportunities of deploying data
management issues on these emerging cloud computing platforms (e.g., Amazon Web
Services). We speculate that large scale data analysis tasks, decision support systems, and
application specific data marts are more likely to take advantage of cloud computing
platforms than operational, transactional database systems (at least initially).
There are two main approaches for making data center consume less energy: shutting
the components down or scaling down their performance. Both approaches are applicable
to computing servers [11,12] and network switches [10,13]. The performance of cloud
computing applications, such as gaming, voice and video conferencing, online office,
storage, backup, social networking, depends largely on the availability and efficiency of high-
performance communication resources [14]. For better reliability and high performance
low latency service provisioning, data resources can be brought closer (replicated) to the
physical infrastructure, where the cloud applications are running. A large number of
replication strategies for data centers have been proposed in the literature [8,15–18]. These
strategies optimize system bandwidth and data availability between geographically
distributed data centers. However, none of them focuses on energy efficiency and replication
techniques inside data centers.
To address this gap, we propose a data replication technique for cloud computing data
centers which optimizes energy consumption, network bandwidth and communication delay
both between geographically distributed data centers as well as inside each datacenter.
Specifically, our contributions can be summarized as follows.

 Modeling of energy consumption characteristics of data center IT infrastructures.


 Development of a data replication approach for joint optimization of energy
consumption and bandwidth capacity of data centers.
 Optimization of communication delay to provide quality of user experience for cloud
applications.
 Performance evaluation of the developed replication strategy through mathematical
modeling and using a packet level cloud computing simulator, GreenCloud [19].
 Analysis of the trade off between performance, serviceability, reliability and energy
consumption.

The rest of the paper is organized as follows: Section 2 highlights relevant related works on
energy efficiency and data replication. In Section 3 we develop a mathematical model for
energy consumption, bandwidth demand and delay of cloud applications. Section 4 provides
evaluation of the model outlining theoretical limits for the proposed replication scenarios.
Section 5 presents evaluation results obtained through simulations. Section 6 concludes
the paper.

2. Related Work

At the component level, there are two main alternatives for making data center
consume less energy: (a) shutting hardware components down or (b) scaling down hardware
performance. Both methods are applicable to computing servers and network switches. When
applied to the servers, the former method is commonly referred to as dynamic power
management (DPM) [11]. DPM results in most of the energy savings. It is the most efficient
if combined with the workload consolidation scheduler—the policy which allows maximizing
the number of idle servers that can be put into a sleep mode, as the average load of the system
often stays below 30% in cloud computing systems [11]. The second method corresponds to
the dynamic voltage and frequency scaling (DVFS) technology [12].DVFS exploits the
relation between power consumption P, supplied voltage V, and operating frequency f :

P= V2 * f .

Reducing voltage or frequency reduces the power consumption. The effect of DVFS
is limited, as power reduction applies only to the CPU, while system bus, memory, disks as
well as peripheral devices continue consuming at their peak rates. Similar to computing
servers, most of the energy-efficient solutions for communication equipment depend on (a)
downgrading the operating frequency (or transmission rate) or (b) powering down the entire
device or its hardware components in order to conserve energy. Power-aware networks
were first studied by Shang at el. [10]. In 2003, the first work that proposed a power-aware
interconnection network utilized dynamic voltage scaling (DVS) links [10]. After that, DVS
technology was combined with dynamic network shutdown (DNS) to further optimize energy
consumption [13].
Another technology which indirectly affects energy consumption is virtualization.
Virtualization is widely used in current systems [20] and allows multiple virtual machines
(VMs) to share the same physical server. Server resources can be dynamically provisioned to
a VM based on the application requirements. Similar to DPM and DVFS power management,
virtualization can be applied in both the computing servers and network switches, however,
with different objectives. In networking, virtualization enables implementation of logically
different addressing and forwarding mechanisms, and may not necessarily have the goal of
energy efficiency [21].

Cloud computing enables the deployment of immense IT services which are built on
top of geographically distributed platforms and offered globally. For better reliability and
performance, resources can be replicated at redundant locations and using redundant
infrastructures. To address exponential increase in data traffic [22] and optimization of
energy and bandwidth in datacenter systems, several data replication approaches have been
proposed.
Maintaining replicas at multiple sites clearly scales up the performance by reducing
remote access delay and mitigating single point of failure. However, several infrastructures,
such as storage devices and networking devices, are required to maintain data replicas. On
top of that, new replicas need to be synchronized and any changes made at one of the sites
need to be reflected at other locations. This involves an underlying communication costs both
in terms of the energy and network bandwidth. Data center infrastructures consume
significant amounts of energy and remain underutilized [23]. Underutilized resources can be
exploited without additional costs. Moreover, the cost of electricity differs at different
geographical locations [24] making it another parameter to consider in the process of data
replication.
In [15], an energy efficient data replication scheme for datacenter storage is
proposed. Underutilized storage servers can be turned off to minimize energy consumption,
while keeping one of the replica servers for every data object alive to guarantee the
availability. In [8], dynamic data replication in cluster of data grids is proposed. This
approach creates a policy maker which is responsible for replica management. It periodically
collects information from the cluster heads, which significance is determined with a set of
weights selected according to the age of the reading. The policymaker further determines the
popularity of a file based on the access frequency. To achieve load balancing, the number of
replicas for a file is computed in relationship with the access frequency of all other files in the
system. This solution follows a centralized design approach, thus exposing it to a single
point of failure.
In [16], the authors suggest replication strategy across multiple data centers to
minimize power consumption in the backbone network. This approach is based on linear
programming and determines optimal points of replication based on the data center traffic
demands and popularity of data objects. Since power consumption of aggregation ports is
linearly related to the traffic load, an optimization based on the traffic demand can bring
significant power savings. This work focuses on replication strategies between different data
centers, but not inside data centers.
Another optimization of data replication across data centers is proposed in [17]. The
aim is to minimize data access delay by replicating data closer to data consumers. Optimal
location of replicas for each data object is determined by periodically processing a log of
recent data accesses. Then, replica site is determined by employing a weighted k-means
clustering of user locations and deploying replica closer to the centroid of each cluster. The
migration from an old site to a new site is performed if the gain in quality of service of
migration (communication cost) is higher than a predefined threshold.
A cost-based data replication in cloud datacenter is proposed in [18]. This approach
analyzes data storage failures and data loss probability that are in the direct relationship and
builds a reliability model. Then, replica creation time points are determined from data storage
reliability function.
The approach presented in this paper is different from all replication approaches
discussed above by
(a) the scope of data replication which is implemented both within a data center as well as
between geographically distributed data centers, and
(b) the optimization target, which takes into account system energy consumption, network
bandwidth and communication delay to define the employed replication strategy.

3. Data Replication

Data Replication is the process of storing data in more than one site or node. It is useful in
improving the availability of data. It is simply copying data from a database from one
server to another server so that all the users can share the same data without any
inconsistency. The result is a distributed database in which users can access data relevant to
their tasks without interfering with the work of others.

Data replication encompasses duplication of transactions on an ongoing basis, so that the


replicate is in a consistently updated state and synchronized with the source.However in
data replication data is available at different locations, but a particular relation has to reside at
only one location.

There can be full replication, in which the whole database is stored at every site. There can
also be partial replication, in which some frequently used fragment of the database are
replicated and others are not replicated.

3.1 Types of Data Replication –

1. Transactional Replication – In Transactional replication users receive full initial


copies of the database and then receive updates as data changes. Data is copied in real
time from the publisher to the receiving database(subscriber) in the same order as they
occur with the publisher therefore in this type of replication, transactional
consistency is guaranteed. Transactional replication is typically used in server-to-
server environments. It does not simply copy the data changes, but rather consistently
and accurately replicates each change.
2. Snapshot Replication – Snapshot replication distributes data exactly as it appears at a
specific moment in time does not monitor for updates to the data. The entire snapshot
is generated and sent to Users. Snapshot replication is generally used when data
changes are infrequent. It is bit slower than transactional because on each attempt it
moves multiple records from one end to the other end. Snapshot replication is a good
way to perform initial synchronization between the publisher and the subscriber.

3. Merge Replication – Data from two or more databases is combined into a single
database. Merge replication is the most complex type of replication because it allows
both publisher and subscriber to independently make changes to the database. Merge
replication is typically used in server-to-client environments. It allows changes to be
sent from one publisher to multiple subscribers.

3.2 Replication Schemes

1. Full Replication – The most extreme case is replication of the whole database at
every site in the distributed system. This will improve the availability of the system
because the system can continue to operate as long as atleast one site is up.

Advantages of full replication –

 High Availability of Data.


 Improves the performance for retrieval of global queries as the result can be obtained
locally from any of the local site.
 Faster execution of Queries.

Disadvantages of full replication –

 Concurrency is difficult to achieve in full replication.


 Slow update process as a single update must be performed at different databases to
keep the copies consistent.

2. No Replication – The other case of replication involves having No replication – that is, each
fragment is stored at only one site.

Advantages of No replication –

 The data can be easily recovered.


 Concurrency can be achieved in no replication.

Disadvantages of No replication –

 Since multiple users are accessing the same server, it may slow down the execution of
queries.
 The data is not easily available as there is no replication.

3. Partial Replication – In this type of replication some fragments of the database may be
replicated whereas others may not. The number of copies of the fragment may range from
one to the total number of sites in the distributed system. The description of replication of
fragments is sometimes called the replication schema.

Advantages of Partial replication –


 The number of copies of the fragment depends upon the importance of data.

3.3 ADVANTAGES OF DATA REPLICATION – Data Replication is generally performed


to:

 To provide a consistent copy of data across all the database nodes.


 To increase the availability of data.
 The reliability of data is increased through data replication.
 Data Replication supports multiple users and gives high performance.
 To remove any data redundancy,the databases are merged and slave databases are
updated with outdated or incomplete data.
 Since replicas are created there are chances that the data is found itself where the
transaction is executing which reduces the data movement.
 To perform faster execution of queries.

3.4 DISADVANTAGES OF DATA REPLICATION –

 More storage space is needed as storing the replicas of same data at different sites
consumes more space.
 Data Replication becomes expensive when the replicas at all different sites need to be
updated.
 Maintaining Data consistency at all different sites involves complex measures.

4. Replication in the cloud

On ensuring performance, or rather satisfying an agreed upon performance level; service


providers can benefit from a plethora of choices. Among these, data replication is a very well
known and researched data management technique that has been used for decades in many
systems. Benefits of data replication include increased performance by strategic placement of
replicas, improved availability by having multiple copies of data sets and better fault-
tolerance against possible failures of servers.
When tenant queries are submitted to the data management system, depending on the
execution plan, e.g. the number of joins, they may require a number of relations in order to
carry on with the execution. Naturally, in a large-scale environment where relations are
fragmented and distributed geographically in multiple servers, not all required data may be
present on the executing node itself. Considering that a query is processed on multiple servers
according to inter-operator and intra-operator parallelism, the likelihood of some remote data
to be shipped from faraway servers is a realistic possibility. In cases when the network
bandwidth capability to the remote servers are not abundant, e.g. due to remote data being at
a geographically separate location, a bottleneck that may ultimately lead to a response time
dissatisfaction may occur during query execution process.
In order to ensure the satisfaction of query response time objective, the bottleneck
data should be identified heuristically to be selected for possible replication before the query
is even started executing. Also, when to trigger the actual replication event to start is another
important decision that must be made for the same goal. Deciding how many replicas to
create and how to retire the unused replicas must also be dealt with further down the road in
the data replication decision process. Strategic placement of the newly created replicas plays
a key role in reducing data access latency and improving response time satisfaction.
Undoubtedly, all of these replication decisions should be made from a cost-effective point of
view to ensure the economic benefit of the provider, which is especially important in the
economy-based large-scale systems such as cloud computing.
Dealing with the mentioned issues of data replication, a good data replication strategy
must be able to decide in a meaningful way; (i) what to replicate to correctly determine which
fragments of relations are in need of replication, (ii) when to replicate to be able to respond
the change in demand of data in a timely manner to quickly resolve performance problems,
(iii) how many replicas to create to avoid wasting precious resources such as storage to keep
the costs down and retire unnecessary replicas accordingly, and finally (iv) where to replicate
to strategically place newly created replicas to ensure tenant performance expectations are
met and any possible penalties are avoided. Moreover, all of these decisions should be based
on some criteria that are consistent with the aims of both the tenant and the provider.

Data availability and durability is paramount for cloud storage providers, as data loss or
unavailability can be damaging both to the bottom line (by failing to hit targets set in service level
agreements [2]) and to business reputation (outages often make the news [3]). Data availability and
durability are typically achieved through under-the-covers replication (i.e., data is automatically
replicated without customer interference or requests). Large cloud computing providers with data
centers spread throughout the world have the ability to provide high levels of fault tolerance by
replicating data across large geographic distances. Amazon’s S3 cloud storage service replicates data
across “regions” and “availability zones” so that data and applications can persist even in the face of
failures of an entire location. The customer should be careful to understand the details of the
replication scheme however; for example, Amazon’s EBS (elastic block store) will only replicate data
within the same availability zone and is thus more prone to failures.

4. Replicated Database Analysis in Cloud

Consider the performance of a database system where there are two replications of the data.
Each replication is independently accessible and modelled by a server. Requests to access the
database are assumed to queue in a central location and correspond to read or write
operations. To preserve the integrity of both copies of the database, we assume that write
requests must wait until both copies of the database are available before beginning execution.
Both copies are assumed to be updated in parallel and released simultaneously. Read requests
can be processed by any copy of the database. Both types of requests are assumed to wait in
the queue in the order in which they arrive. We assume that requests arrive to the system
from a Poisson point source with intensity of>. and that the probability a given request is a
read (resp., write) is given by r (resp., 1-r). Service times for both read and write requests are
assumed to be exponential with an average value of  . Since we assume that writes are
served in parallel, the total service time for write requests equals the maximum of two
exponential random variables with parameter  .

We let We let It be the number of requests at time t that are waiting in the queue and let Jt,
0 <= Jt <= 2, be the number of replications that are involved in a read or write operation at
time t. Our assumptions above imply that (It, Jt ) is a Markov process. The state transition
diagram for the process is given in Figure 9.5.

We explain some of the transitions from the repeating portion of the process. in state
(2,2) both servers are busy serving customers and the customer at the head of the queue is
equivalent to the unexamined arrival. Thus it is a read(resp., write) request with probability r
(resp., (1-r)). Upon service completion, at rate 2  ,there are two possibilities depending on
the type of customer at the head of the queue. If the customer is a read request, then the next
state will be (1,2) and hence the transition rate from (2,2) to (1,2) is given by 2  r. On the
other hand, if the head of the queue is a write request then the next state is (2,1) since the
write must wait until all servers are available before beginning execution. Hence the
transition rate from (2,2) to (2,1) is given by 2  (1-r). The rest of the transactions can be
explained in a similar manner.

If we order the state lexicographically, the generator matrix of the process is given by

  r  (1  r ) 
 
B0,0    (   ) r 
0 2  (  2  ) 

This generator is of matrix geometric form. To see this , identify the matrices of
(9.13) as follows:

  r  (1  r )  0 0
   
B0,0    (   ) r  , B0,1   (1  r ) 0 
0 2  (  2  )  0 0 
 

0 0    (    ) 0 
B1,0    B1,1  A1   2 (1  r ) 
 (  2  ) 
0 0 2 r  
 

 0 0  
A0    and A2   
0  0 2 r 

https://www.geeksforgeeks.org/dbms-data-replication/

Вам также может понравиться