Data Replica Placement Mechanism For Open Heterogeneous Storage Systems

Available online at www.sciencedirect.
com
Available online at www.sciencedirect.com
Available online at www.sciencedirect.com
ScienceDirect
Procedia
Procedia Computer
Computer Science
Science 00 (2017)
109C (2016) 000–000
Procedia Computer Science 00 (2016) 18–25
000–000 www.elsevier.com/locate/procedia
www.elsevier.com/locate/procedia
The
The 8th
8th International
International Conference
Conference on
on Ambient
Ambient Systems,
Systems, Networks
Networks and
and Technologies
Technologies
(ANT 2017)
(ANT 2017)
Data
Data Replica
Replica Placement
Placement Mechanism
Mechanism for
for Open
Open Heterogeneous
Heterogeneous
Storage
Storage Systems
Systems
a,b,∗ a c
X. Xua,b,∗,, C.
X. Xu C. Yang
Yanga ,, J.
J. Shao
Shaoc
a School of Computer Science, Nanjing University of Posts and Telecommunications, Nanjing 210023, China
a School of Computer Science, Nanjing University of Posts and Telecommunications, Nanjing 210023, China
b State Key Laboratory of Information Security, Chinese Academy of Sciences, Beijing 100093, China
b State Key Laboratory of Information Security, Chinese Academy of Sciences, Beijing 100093, China
c Institute of Big Data Research at Yancheng, Nanjing University of Posts and Telecommunications, Yancheng 224005, China
c Institute of Big Data Research at Yancheng, Nanjing University of Posts and Telecommunications, Yancheng 224005, China
Abstract
Abstract
Many big data-oriented network storage platforms are open and heterogeneous, which means that the data placement mechanism
Many big data-oriented network storage platforms are open and heterogeneous, which means that the data placement mechanism
has a great impact on the system performance. The data replica placement mechanism needs to consider the data availability, the
has a great impact on the system performance. The data replica placement mechanism needs to consider the data availability, the
load balance and the quality of service for open heterogeneous storage systems. However, current data placement mechanisms
load balance and the quality of service for open heterogeneous storage systems. However, current data placement mechanisms
are mainly for simple homogeneous storage systems, not entirely suitable for heterogeneous storage systems. We proposed a
are mainly for simple homogeneous storage systems, not entirely suitable for heterogeneous storage systems. We proposed a
novel data replica placement mechanism with the adjustable replica deployment strategy (ARDS) for open heterogeneous data
novel data replica placement mechanism with the adjustable replica deployment strategy (ARDS) for open heterogeneous data
storage systems, which considers the data availability, the data access frequency and the storage capacity. The mechanism takes
storage systems, which considers the data availability, the data access frequency and the storage capacity. The mechanism takes
the data availability as premise of the initialization number of replica, then adjusts the number of replica according to the data
the data availability as premise of the initialization number of replica, then adjusts the number of replica according to the data
access frequency, and places replicas based on the node spare capacity, which effectively avoids the condition of load unbalance.
access frequency, and places replicas based on the node spare capacity, which effectively avoids the condition of load unbalance.
We implemented simulation experiments based on OMNet++. The experimental results show that the ARDS-based mechanism
We implemented simulation experiments based on OMNet++. The experimental results show that the ARDS-based mechanism
improves the system performance including the system load balance and the rate of successful data access.
improves the system performance including the system load balance and the rate of successful data access.
c 2016 The Authors. Published by Elsevier B.V.
c 2016 The Authors.
1877-0509 ThePublished
Authors. by Elsevierby
B.V.
Peer-review©under
2017responsibility Published
of Elsevier
the Conference B.V. Chairs.
Program
Peer-review under responsibility of the Conference Program Chairs.
Keywords: Data Replica, Replica Placement, Heterogeneous Storage Systems
Keywords: Data Replica, Replica Placement, Heterogeneous Storage Systems
1. Introduction
1. Introduction
Internet of Things, Social Networks and Mobile Computing are becoming more and more popular. The storage
Internet of Things, Social Networks and Mobile Computing are becoming more and more popular. The storage
platform of Big Data is becoming more and more open and isomerized. The data replication mechanism of the storage
platform of Big Data is becoming more and more open and isomerized. The data replication mechanism of the storage
system will have a great impact on system performance 11 . The key points involved in the placement of data replicas
system will have a great impact on system performance . The key points involved in the placement of data replicas
in heterogeneous storage systems include the determination of the number of data replication and the location of
in heterogeneous storage systems include the determination of the number of data replication and the location of
storage 22 . The number of replicas has great impact on the data availability of distributed storage systems. Few replicas
storage . The number of replicas has great impact on the data availability of distributed storage systems. Few replicas
could easily lead to overheating of partial replicas and overload of storage nodes. If the number of replicas is large,
could easily lead to overheating of partial replicas and overload of storage nodes. If the number of replicas is large,
∗ Corresponding author. Tel.: +8613813885172; fax: +862585866433.

∗ Corresponding author. Tel.: +8613813885172; fax: +862585866433.
E-mail address: xuxl@njupt.edu.cn
E-mail address: xuxl@njupt.edu.cn
1877-0509 c 2016 The Authors. Published by Elsevier B.V.
1877-0509 c 2016 The Authors. Published by Elsevier B.V.
Peer-review©under
1877-0509 2017responsibility
The Authors. of the Conference
Published Program B.V.
by Elsevier Chairs.
10.1016/j.procs.2017.05.290
X. Xu et al. / Procedia Computer Science 109C (2017) 18–25 19
2 Author name / Procedia Computer Science 00 (2016) 000–000
the storage resources will be wasted. Heterogeneous storage systems choosing those nodes with different performance
and the different replicas storage locations affects the quality of service when replicas are accessed.
The current replication placement strategies can be divided into the source request placement strategy, priority
placement strategy, path placement strategy, neighbor node placement strategy, and random placement strategy,
etc. 3,4,5 . Lightweight adaptive replication is a typical source request placement strategy. The advantage is that the
new replication creation mechanism is triggered when the existing nodes that reach the placement threshold, which
reduces the creation of redundant replicas, thus reducing the cost of replication creation and maintenance. The disad-
vantage is that the load of the storage node is easily exceeded, resulting in the unbalance of the overall system load.
The advantage of priority placement is that when an access request arrives at the target storage node, the other storage
nodes will transfer the same replication to the visited node. The advantage is that the number of storage nodes is
reduced. The disadvantage is that the hot spots are easily generated and the overall system load is unbalanced. The
path placement strategy is a simple and convenient method to query all nodes on the request path when users access
the replication, but it is easy to cause the data redundancy, which increases the waste of storage resources and the
maintenance cost of replication consistency. The neighbor node placement is mainly to save the history of the repli-
cation access. When a certain node is requested to reach the threshold, it selects the neighbor node as the new storage
node, and makes the node access to the neighbor node. The advantage of random placement is load balancing, reduc-
ing access latency, the shortcomings are that the number of replicas are too large. Yuan et al. proposed a strategy of
data replication placement based on simulated annealing algorithm, which provides a local replication of the remote
data that can be quickly accessed and processed by users, and finds the optimal replication placement node by the
simulated annealing algorithm 6 . Pang et al. proposed a replication creation policy, which includes the intra-domain
replication-derived policy and the inter-domain replication extension strategy 7 . The replicas spread across domains
according to the frequency of replication visits. This improves the response speed of user access and reduces band-
width consumption. Yang et al. proposed a replication optimal placement algorithm based on user’s interest, which
extracts the user group content interest and gives priority to the replication of the group’s interest value 8 .
However, the existing data placement mechanism is mainly for the simple homogeneous storage environment, but
not for the increasingly complex and dynamic heterogeneous storage environment 9 . The data replication placement
mechanism focuses on the replication consistency and reliability for open heterogeneous storage systems. Heteroge-
neous storage systems need to realize data availability and load balancing, and ensure the quality of service (QoS).
In this paper, we propose a new replication placement mechanism with the adjustable replica deployment strategy
(ARDS) for heterogeneous data storage systems, which considers the availability of data, replication access frequency
and storage node residual capacity. With the premise of the data availability, the mechanism initializes the number
of replicas. Then, the number of replicas will be adjusted based on the frequency of replication access. The replica
placement mechanism is based on the remaining storage capacities of nodes, and prefers to place the replicas on nodes
with higher current capacity, thus effectively avoiding the load unbalance of storage nodes.
2. Data replica placement mechanism
2.1. Determination of the initial number of data replica
In an open heterogeneous storage system, the online storage node is not all stable and high-performance. So, it
is important to consider the availability of nodes to determine the number of initial replicas. The node availability is
defined as event N, the probability of occurrence is defined as N, then the unavailability of nodes is defined as N, the
probability of occurrence is P(N) = 1 − P(N).
We also need to consider the data availability, which needs to be measured both at the file level and the data block
level. The data block availability of event occurrence probability is defined as P(B), while the occurrence probability
of data unavailability is P(B) = 1 − P(B). The availability of data file is defined as the probability of occurrence of the
event, that is P(F), and the probability that the data file unavailability occurs is P(F) = 1 − P(F).
It is assumed that the file consists of j pieces of data blocks {block1 , block2 , block3 , . . . , block j }, and each data block
has r replicas, described as {replica1 , replica2 , replica3 , . . . , replicar }. When all replicas of block j are not available,
block j is not available, that is
P(B) = P(N1 × N2 × . . . × NR × . . . × Nr )
20 X. Xu et al. / Procedia Computer Science 109C (2017) 18–25
Author name / Procedia Computer Science 00 (2016) 000–000 3
= P(N1 ) × P(N2 ) × . . . × P(NR ) × . . . × P(Nr )

r
= fR (R ∈ (1, 2, . . . , r)) (1)
R=1
To the overall stability of a file, the structure of file needs to be considered:
• The data blocks in the file are tightly coupled. All or at least most data blocks are unable be deleted without
affecting the availability of the file, which is described as
b
rj

P(F) = P(B1 × B2 × . . . × B j ) = 1 − (−1) j+1
Cbj ( fR ) j (2)
j=1 R=1
• The data blocks in the file are loosely coupled. The typical file is streaming media file, whose data blocks are
relatively independent. The availability of the file is described as
P(F) = P(B1 ∪ B2 ∪ . . . ∪ B j ) = P(B1 ) + P(B2 ) + . . . + P(Bb )
b
rj

= P(B j ) = (1 − P(N1 × N2 × . . . × NR × . . . × Nr )) (3)
j=1 R=1
In (2) and (3), j ∈ {1, 2, . . . , b}, where b is the total number of data blocks of the file, R ∈ (1, r), and r j is the
number of replica of block j .
It is assumed that the expected file availability level is set as Wexpect , at the stage of initializing the number of
replicas of r j . Values can be dynamically set by the user, and P(F) should be greater than or equal to Wexpect :
 b rj





 1 − (−1) j+1 j
C ( fR ) j (tightly coupled)


 b
 j=1 R=1
Wexpect ≤   rj (4)





 (1 − P(N1 × N2 × . . . × NR × . . . × Nr )) (loosely coupled)


R=1
2.2. Adjustable replica deployment strategy
In order to make the overhead of the system as small as possible, we should create as few replicas as possible while
still ensure the data availability. With the emergence of hot data blocks leads to node load imbalance, the system will
be uneven access to data. Therefore, after the initial determination of the number of replicas, the number of replicas
will be adjusted along with the operation of the system. The system uses ARDS to adjust dynamically the number of
replicas to balance the system load.
Xiong et al. proposed the QoS preference-aware replica selection strategy in cloud computing, which keeps track
of the average access number of files and gets the maximum and minimum values, used as the premise of triggering
the replication 10 . If a user frequently accesses a data block of file, it is unnecessary to duplicate the entire file, only
needed to copy the corresponding data block. We propose a replication strategy based on the historical replication
access frequency to balance the load of nodes and reduce the response time of file access by increasing the number of
replicas of hot data blocks flexibly, dynamically and distinctively.
The total number of accessing the data block S Fi DB j of the file S Fi in a certain period of time needs to be counted,
and then calculate the average access number Wi j to all replicas of S Fi DB j :
rS Fi DB j

S Fi DB j
R=1
Wi j = (5)
rS Fi DB j
where rS Fi DB j is the total number of replicas of the data block S Fi DB j . If Wi j reach a certain threshold value, set as
Wyz , new replicas will be created.
2.3. Storage node selection Strategy
The placement of replicas in an open heterogeneous storage system must take the types of storage nodes into
account. Zhao et al. pointed out that creating a file should satisfy the following two requirements: 1) the log of the
node contains the access record of the file; 2) the node is not overloaded at present 11 . According to the number of
requests for data blocks, the node with the most visits is selected as the placement site of the new replica from the
candidate nodes. We not only consider the load of storage nodes, but also consider the connection between nodes in
an open heterogeneous storage environment and the attributes of data replicas, including the load of candidate nodes,
the candidate node type, the importance of replicas, the number of replicas stored in the nodes. Based on the principle
of risk-sharing, it is important to avoid storing replicas of the same data block on the same node.
When the system receives the corresponding request to a data block, the first step is to check whether the system
have the replica of the data block. The current workload of node, αactual , is less than or equal to the stability threshold,
that is αnodeyz . If a node’s αactual is higher than αnodeyz , it will be unsuitable to provide services. The appropriate node
must satisfy
αactual <= αnodeyz , αactual = CU sedcapacity /Ctotalcapacity (6)
where CU sedcapacity represents the node’s resources which have already been used, Ctotalcapacity indicates the total ca-
pacity of node. The storage node selection strategy here is selecting the node with the smallest αactual to create the
corresponding replica to ensure the data availability.
3. Experiments
3.1. Experimental environment
The experiments were conducted with OMNet++ 12 . OMNet++ supports the distributed parallel simulation. We
used OMNet++ to build the simulation network and the open heterogeneous storage network platform, as shown in
Fig.1.
As shown in Fig.1, SN is the resource index node in the open heterogeneous storage platform, CN is high-
performance node, whose number is 20, and PN is relatively low-performance node, whose number is 2000. The
platform adopts the hierarchical management strategy, which means that SN manages CNs and each CN manages 100
PNs.
3.2. Performance criteria
We compare the replica placement strategy presented in this paper, i.e. ARDS, with other replica placement
strategies and under the same experimental conditions.
• Load rate of node

LNode indicates the load rate of the node:
NusedCapacity
LNode = (7)
NtotalCapactiy
where NusedCapacity is the current workload of a node, and NtotalCapactiy represents the total capacity of a node.
• Load rate of system
αactual indicates the load rate of system:
X
N x.usedCapacity
αactual = , x ∈ (1, X) (8)
x=1
N x.totalCapactiy
where N x.usedCapacity represents the current workload of node x, N x.totalCapactiy represents the total capacity of
node x, and X represents the number of nodes in the system.
PN11[100]
PN10[100] PN12[100]
PN9[100]
PN13[100]
CN11
PN8[100] CN10
CN12
CN9 CN13
PN14[100]
CN8
PN7[100] CN14
CN7 PN15[100]
PN6[100]
CN6
CN15
PN5[100]
CN5 SN
CN16
PN4[100] PN16[100]
CN4
CN17
CN3
CN18
CN2 CN1 CN19 PN17[100]
PN3[100]
PN18[100]
PN2[100] PN1[100] PN19[100]
Fig. 1. The simulation network and the open heterogeneous storage network platform.
• Rate of successful access

η success indicates the rate of successful access:
Num success
η success = (9)
Numtotal
where Num success represents the number of successful access, and Numtotal represents the total number of access
requests.
3.3. Experimental results and performance analysis
Fig.2 shows the current available capacities of that all nodes in the experimental system.
We choose the typical priority-based replica deployment strategy (PR) to compare with ARDS. Fig. 3 shows the
changing trend of the workloads of CNs in the ARDS-based system with different numbers of replicas deployed,
including 2,000 replicas, 4,000 replicas, 6000 replicas, 8000 replicas and 9603 replicas. It can be seen that the
available capacities of CNs are very different initially. However, as the system operates, the available capacities of
CNs tend to be balanced, which means their workloads are to be balanced gradually.
It can be further seen from Fig. 3 that as the increment of replicas, the nodes with higher available capacities will
be chosen preferentially, which means ARDS can help to maintain the workload balance.
Fig. 4 shows the changing trend of the workloads of CNs in the PR-based system with different numbers of replicas
deployed, including 2,000 replicas, 4,000 replicas, 6000 replicas, 8000 replicas and 9602 replicas.

&DSDFLW\RI&1*%
&DSDFLW\RI31*%

1RRI&1 1RRI31
(a) (b)
Fig. 2. (a)(b) the current available capacities of that all nodes in the experimental system.

&DSDFLW\RI&1*%

1RRI&1
,QLWLDO6WDWXV 1XPEHURI5HSOLFDV˖
1XPEHURI5HSOLFDV˖ 1XPEHURI5HSOLFDV˖
1XPEHURI5HSOLFDV˖ 1XPEHURI5HSOLFDV˖
Fig. 3. The changing trend of the workloads of CNs in the ARDS-based system.

&DSDFLW\RI&1*%

1RRI&1
,QLWLDO6WDWXV 1XPEHURI5HSOLFDV˖ 1XPEHURI5HSOLFDV˖
1XPEHURI5HSOLFDV˖ 1XPEHURI5HSOLFDV˖ 1XPEHURI5HSOLFDV˖
Fig. 4. Node load change (based on PR algorithm).
As shown in Fig. 4, with PR, the system prefers to select the CN with the highest available capacities as the target
storage node until this node reaches the saturation point, which is easy to accumulate replicas on the same node and
lead to the workload imbalance.
Fig. 5 compares the changing trends of the workloads of CNs with ARDS and with PR further.

&DSDFLW\RI&1*%
&DSDFLW\RI&1*%

1RRI&1 1RRI&1
,$5'6 35 ,$5'6 35
(a) (b)

&DSDFLW\RI&1*%
&DSDFLW\RI&1*%

1RRI&1 1RRI&1
,$5'6 35 ,$5'6 35
(c) (d)
Fig. 5. (a) With 2,000 replicas (b) With 4,000 replicas (c) With 6,000 replicas (d) With 8,000 replicas.
It can be seen from Fig. 5 that, as the increment of replicas, compared with PR, with ARDS, the changing trend of
the available capacities of CNs is more stable, making the load variation of all CNs more balanced and reducing the
probability of hot spots.
Next, we need to conduct experiments on the rate of successful data access. As shown in Fig. 6, when the total
number of deployed replicas reaches 6,000∼8,000, compared with PR, with ARDS, the failure rate of nodes is lower.
Fig. 6. The success rate of data access.
4. Conclusion
With the development and popularization of the Internet and social networks, current storage systems have become
increasingly unable to meet the storage of big data. We believe it is a good idea for effective integration of cloud
server resources and edge node resources to build open heterogeneous storage systems. Data replica management
mechanism has important impact on the system’s reliability, balance and QoS. In this paper, we proposed a novel data
replica placement mechanism with ARDS for open heterogeneous data storage systems. Compared with the PR-based
mechanism, the ARDS-based mechanism improves the system performance including the system load balance and the
rate of successful data access. In the future, we will apply the ARDS-based mechanism to the typical streaming media
system. And, we will research on the maintenance mechanism of data consistency in open heterogeneous storage
systems.
Acknowledgements
We would like to thank the reviewers for their detailed comments to help us improve the quality of this paper.
This work was jointly sponsored by the National Natural Science Foundation of China under Grant 61472192 and the
Scientific and Technological Support Project (Society) of Jiangsu Province under Grant BE2016776.
References
1. Wang Y, Jin X, Cheng X. Network Big Data: present future. Chinese Journal of Computers 2013; 36: 1125-1138.
2. Wang W, Wei W. A dynamic replica placement mechanism based on response time measure. Proceedings of 2010 International Conference
on Communications and Mobile Computing. Shenzhen, China: 2010: 169-173.
3. Rao L, Yang F, Li X, et al. Dynamic replica creation algorithm based on temperatures analysis. Journal of Computer Applications 2014; 34:
130-134.
4. Xu X, Zou Q, Yang G. Data replication management mechanisms for DSS. Computer Technology and Development, 2013; 23: 245-249.
5. Yang H. Research and experience about the data replica placement algorithm in cloud storage System. University of Electronic Science and
Technology of China, 2013.
6. Yuan M, Liu J, Liu T, et al. Data grid replica deployment strategy based on simulated annealing algorithm. Computer Engineering 2009; 35:
22-24.
7. Pang L, Chen Y. Data Replication strategies for the network environment. Computer Engineering and Science 2005; 27: 1-2.
8. Yang X, Wang X, Zhang M, et al. User interest-aware content replica optimized placement algorithm. Journal on Communications 2014; 35:
21-27.
9. Lemma F, Schad J, Fetzer C. Dynamic replication technique for micro-clouds based distributed storage system. Proceedings of 2013 the 3th
International Conference on Cloud and Green Computing. Karlsruhe, Germany: 2013.
10. Xiong R, Luo J, Song A, et al. QoS preference-aware replica selection strategy in cloud computing. Journal on Communications. 2011; 32:
93-102.
11. Zhao W, Xu X, Wang Z. Load balancing-based replica placement strategy in data grid system. Proceedings of 2010 Third International
Conference on Education Technology and Training. Hubei, China: 2010.
12. OMNeT++. https: //omnetpp.org/

Data Replica Placement Mechanism For Open Heterogeneous Storage Systems

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Data Replica Placement Mechanism For Open Heterogeneous Storage Systems

Загружено:

Авторское право:

Доступные форматы

Available online at www.sciencedirect.

∗ Corresponding author. Tel.: +8613813885172; fax: +862585866433.

2. Data replica placement mechanism

2.1. Determination of the initial number of data replica

= P(N1 ) × P(N2 ) × . . . × P(NR ) × . . . × P(Nr )

To the overall stability of a file, the structure of file needs to be considered:

2.2. Adjustable replica deployment strategy

2.3. Storage node selection Strategy

3.1. Experimental environment

3.2. Performance criteria

• Load rate of node

CN2 CN1 CN19 PN17[100]

• Rate of successful access

3.3. Experimental results and performance analysis

Fig. 4. Node load change (based on PR algorithm).

Fig. 6. The success rate of data access.

Вам также может понравиться