You are on page 1of 6

Cloud Storage Performance Enhancement by

Real-time Feedback Control and De-duplication


Tin-Yu Wu, Wei-Tsong Lee, Chia Fan Lin

Department of Electrical Engineering, Tamkang University, Taiwan, R.O.C.


{tyw, wtlee}@mail.tku.edu.tw, 698470357@s98.tku.edu.tw

Abstract- In a cloud storage environment, file distribution and de-duplication and transmission optimization techniques for
storage is processed by storage devices providers or physical storage cloud storage performance enhancement.
devices rented from the third-party companies. Through centralized
management and virtualization, files are integrated into available II. RELATED WORKS
resources for users to access. However, because of the increasing
number of files, the manager cannot guarantee the optimal status of Based on cloud computing, cloud storage refers to saving
each storage node. The great number of files not only leads to the data in the servers maintained by a third party, instead of an
waste of hardware resources, but also worsens the control exclusive or single server. Co-location companies run
complexity of data center, which further degrades the performance of large-scale datacenters for users in want of data storage space
the cloud storage system. For this reason, to decrease the workload
and services to rent virtual machines and storage space [2,3].
caused by duplicated files, this paper proposes a new data
management structure: Index Name Server (INS), which integrates According to the clients’ demands, the database providers
data de-duplication with nodes optimization mechanisms for cloud prepare remote virtualized resource and offer it in the form of
storage performance enhancement. INS can manage and optimize the a storage pool so that the clients can save files and data in the
nodes according to the client-side transmission conditions. By INS, storage pool. The cloud storage interface will be installed to
each node can be controlled to work in the best status and matched to different end devices based on client requirements. As a result,
suitable clients as possible. In such a manner, we can improve the to operate cloud storage is like to operate a local storage
performance of the cloud storage system efficiently and distribute the
files reasonably to reduce the load of each storage node.
device.
Via networked devices or servers, cloud storage changes
the cloud storage application programming interfaces (APIs),
Keyworks: INS, Duplicatioin, Cloud storage
like Simple Object Access Protocol (SOAP) and
Representational State Transfer (REST), to block-based
I. INTRODUCTION
protocols, like iSCSI and Fibre Channel, or file-based

C
loud computing technology not only has contributed protocols, like NFS and CIFS.[4] To achieve better
to various applications and influenced the operation transmission and access performance, many recent researches
of original networks, but also allowed the Internet have proposed to add de-duplication and feedback control
to exist in different corners with different devices. techniques to enhance the transmission efficiency and reduce
Nevertheless, with the appearance of numerous devices and the load of the storage nodes.
data, data access and nodes management and control of cloud [5] proposed to use the utility functions and the predictor
network become the significant issues to be emphasized (to estimate how configuration changes of the controller affect
because the efficiency of control methods enormously affects the system) to monitor and distribute system resources.
the performance and quality of cloud network. Because de-duplication mechanism is included, this scheme
In a cloud storage environment, data is usually stored in the performs well in reading/writing efficiency and
space provided by the third-party companies, instead of one load-balancing. Nevertheless, the block size and duplication
single host, and must be managed and integrated into ratio in its simulated scenario is fixed, which needs more
available resources for users to access. The common storage experiments to prove its efficiency while applying to WAN
protocols include two types: NAS and SAN.[1] However, cloud network.
because of the great number of users and devices, the cloud [6] presented a resource management mechanism that
network manager often cannot control the performance of adopts job managers and feedback control system to distribute
different storage nodes, which increases the complexity in resources and manage the QoS. However, since the job
controlling hardware and network traffic and further decreases manager is an independent program that is generated
the performance of the cloud network. according to different demands, too many job managers might
To avoid the burden resulted from duplicate data and to be generated with the increase of data, which leads to extra
optimize the nodes’ performance, this paper presents an burden to the server.
optimized self-adjusted WAN cloud network management Besides using various systems for feedback control, many
architecture: Index Name Server (INS), which integrates methods have been proposed to design different standards for
different purposes for extra regulation and control. For
example, QoS Controller [7], which regulates the bandwidth queries the data between fingerprints and storage nodes, but
of accesses and monitors the bandwidth for load distribution, also coordinates the transmissions by feedback control
is presented to distribute the load when the storage system is between storage nodes and clients [11].
over loaded. Nevertheless, because the metadata server is not As shown in Figure 1, INSs, the central managers of the
a specific server and the heterogeneity of different domains is nodes, have the server-client relationships [12] with one
not considered, its performance is greatly influenced. On the another in a hierarchical architecture and takes down the
other hand, [8] proposed Service Level Objectives (SLOs) to fingerprints and the storage node information of each data
adjust and balance the load according to different service chunk. It is worth noticing that INSs only records the
levels. This scheme monitors the I/O command and delay to locations of the fingerprints and manages the storage nodes,
achieve distributed storage, which is applied to virtualized and other information will not be taken down [13]. Each
storage devices. But, owing to its concentration on I/O storage node provides its own status and information for INSs
performance optimization, the performance needs to be to record while clients requests INSs for related information
testified while applying to WAN cloud network. during transmissions. Whenever a new INS is established, the
Generally speaking, most existing methods use control storage node with the maximum throughput in the domain
systems for feedback control and propose various will be selected for backup. Since the tasks of INS center on
optimization and control strategies based on the feedback. data computation and transmission, we put the emphasis on
However, these schemes are triggered only when storage the performance of the database and the throughput of data
nodes exceed certain threshold, which means that these transmission, instead of storage space.
schemes are compensative, instead of predictive. With the
increase of network and users, the cost of resource regulation
increases also. To estimate the performance of clients and
storage nodes in advance, and choose suitable storage nodes
according to different user requirements, our proposed INS
architecture can control the load of each storage node
efficiently and attain load balancing.

III. INDEX NAME SERVER (INS)


Index Name Server (INS), an index server similar to
Domain Name System (DNS) structure, manages the cloud
data by a complex P2P-like architecture [9,10]. Although INS
resembles DNS in structure and functions, INS mainly
processes the one-to-many matches of the storage nodes' IP
addresses and hash codes. In general, INS has three chief
functions:
Figure 1. Hierarchical INS architecture.
 Switching between the fingerprints and their
corresponding storage nodes. 3.2 De-duplication
 Affirming and balancing the storage nodes' Owing to the large-scale architecture of WAN cloud
load. network system and the similarity of user habits and user
 Satisfying client demands for transmission as groups, data duplication rate greatly increases. Thus,
much as possible. de-duplication technique is adopted in our scheme to scatter
For file transmission optimization, every INS has exclusive and remix the data at local hosts, divide the file into several
databases of its own domain, which include the fingerprints chunks for uploading, and designate a unique fingerprint to
and their corresponding storage nodes. However, for WAN each file by MD5.
cloud network environment, to manage the file system by few Because of its uniqueness, every fingerprint is regarded as
INSs will cause great burden on the INSs. Therefore, based on the identification and fingerprint of a data chunk. After
the existing DNS structure, we propose to divide the INSs checking a requested fingerprint, the INSs will make sure
according to the domains and loading capacity and adopt whether the file chunk of the same fingerprint exists in the
hierarchical management architecture to reduce the workload storage space. If not, the system continues the following
of the INSs. uploading procedure and assigns tasks to the storage node.
Figure 3.3 displays the INS flowchart.
3.1 INS Architecture
3.3 INS Querying Process
Based on the stack structure of DNS, INS manages the
storage nodes in the domain and handles the client file-access Each domain-based INS has databases of fingerprints and
requirements according to its database. Though similar to storage nodes. The database of fingerprints records the
DNS in architecture and functions, INS not only manages and fingerprints of different files and their corresponding storage
nodes. When a user are looking for specific fingerprints, the each storage node alters (CPU, RAM, hard disk for example),
INS queries and confirms if the file already exists in the we cannot affirm its actual efficiency by estimating the
storage node within the domain before taking the next step. performance based on hardware specifications. Thus, our
While the clients want to access data, they can use the proposed measuring method is to test the maximum write/read
obtained fingerprints as the index and query the INS of the speed of the system before achieving 90% of the load, and to
upper layer, which searches for the best access node based on figure out the access efficiency of the storage node by
the content in the database in case the inefficiency of the considering its available maximum bandwidth. Moreover,
access node or data loss. because the size of the chunks is fixed, our method can
Different requirements lead to different query results. If the validly measure the performance metric of each storage node
file that the client wants to access does not exist in the storage in INS environment.
node of local domain, the INS queries the INS of the upper
layer. With the auxiliary of the Bloom Filter, the INS can find 3.5 Client-side Parameters
out the domain of the INS with that file chunk and figure out
the storage node through the destination INS for transmission. When the clients request the INSs for accessing files, the
By increasing the length of the mapping array, the search clients must provide the fingerprints of the desired chunk and
error rate of the Bloom Filter can be reduced. This is the bandwidth for accessing the chunks for the INSs to
manifested in the following: integrate for transmissions. With the information, the INS can
select the storage nodes that fit the clients’ need as possible.
1 The calculation method is displayed in Equation 6:
1 (1)
m Bc
Bs  (6)
 1
k
N download  N upload
1   (2)
 m Where Bs means the available bandwidth of the storage
kn node, n means the number of storage nodes for the upcoming
 1
1   (3) transmission, N download and N upload respectively mean the
 m number of files for the upcoming transmission (de-duplicated
kn
 1 fixed-size file chunks, 256KB for instance), and Bc means
1  1   (4) the client’s available bandwidth for transmission. However, to
 m
k consider the instability of WAN cloud network, we take more
  1  
kn

kn k
 environmental variants into concern and include file
Pe  1  1     1  e m  (5)
  m     duplication management scheme to enhance the utilization
efficiency of storage nodes. The equation is thus modified
Assuming that m is the length of the assembly matrix, into Equation 7:
Equation 1 means the probability for one element to be 0.
Bc
With k, the number of hash functions, we get Equation 2. Bs  (7)
After adding n elements, the probability for a specific bit in  N download   N upload  Fu    1  Fd 
 
the matrix to be 0 is shown as Equation 3. Consequently, the
probability for a specific bit to be 1 is equal to 1 minus Fd refers to the delay time obtained after several times of
Equation 3, namely Equation 4. Supposing an element does transmission. The inclusion of this value is to ensure that the
not exist in the matrix but is set to 1 due to its positions, the INS can assign the optimal storage node with the most
probability can be given by Equation 5. This section proves appropriate bandwidth for clients and improve the utilization
that by altering the length of the matrix, the search error rate efficiency of the storage node. Fu symbolizes the number of
of the Bloom Filter can be decreased to fit our standard. duplicate files determined by INS databases. These duplicate
file chunks therefore will not be re-uploaded to the storage
3.4 Performance Parameters of Storage Nodes node.

Since storage nodes play fundamental roles in cloud storage,


3.6 INS Controlling Process
the performance parameters of the storage nodes influence the
network tremendously. For each storage node to present the Due to the interference of external factors like network
best efficiency according to its performance, we define the delay, the actual transmission value is usually not equal to the
parameter metric with the unit of files/s, the files that a bandwidth that clients can use. In such a manner, the INSs
storage node can process practically. The parameters adopted might overestimate the clients, select the inappropriate storage
to define the metric include: node, and lead to waste of resources. For this reason, we
design a system loop to optimize the performance of the
 Read/write capability of storage hardware
storage nodes.
 Available bandwidth
An INS itself can be regarded as an automatic control
 Performance extremity of CPU/RAM
system, which receives the feedback of the previous
While triggering for the first time, the storage node first transmission and regulates the transmission parameters
starts self-examination. Because the hardware architecture of
according to the feedback to achieve the maximum Table 1. The simulation parameters.
performance of the storage node. INS Controlling Process is
shown in Figure 2.
Parameter Content

Storage Node 200~500 (files/second)


Performance Metric

Space on Storage Node 1~10(TB)

Bandwidth of Storage 10~100(Mbps)


Figure 2. INS Controlling Process. Node
 R  k  : the initial expected value Client-Side Write Request 1~5 (files/second)
 F  k  : the output feedback
M  k  : the modified feedback
Client-Side Read Request 1~5 (files/second)

 Fs  k  : the modified internal function of the storage Client-Side Bandwidth 2~10(Mbps)
node
 D  k  : external interference factor (random variable) Extra Transmission Delay 5~100 (ms)
 X  k  : the result inside of the storage node Number of Clients 1~2700+
 Y  k  : the actual result
 K ins : the best node chosen by the INS according to 4.2 Simulation Results
the feedback
Figure 3 displays the average delay time of different
Initially, the INS uses the client-side access parameters to transmission methods in the same environment. The hybrid
calculate the bandwidth that the client will use for the storage mechanism selects the nodes with better performance and
node and thus gets R(k). Based on the result of the previous bandwidth out of the storage nodes, and transmits files
transmission, the system modifies the feedback to regulate the according to their current load. This method can guarantee
client-side parameters and assign the suitable storage node to that the nodes with better capability load first for
the client. The feedback loop inside of the storage node is transmissions, but these nodes comparatively have heavier
executed by the algorithm mentioned in the previous section load.
to ensure that the client-side bandwidth is utilized fully and Compared with other mechanisms that choose nodes
efficiently. according to various orientations, the random mechanism
ignores each node’s capability and distributes the load equally
IV. PERFORMANCE SIMULATION AND ANALYSIS to all storage nodes, instead of part of the nodes first. By
In addition to define the simulation parameters, this section taking the load bandwidth and performance metric of the
includes the simulation and analysis on multi-node storage nodes into account, our proposed INS not only selects
transmission, average load and average transmission delay of the nodes with the optimal bandwidth and performance, but
the INS system. also chooses the most appropriate storage nodes for
client-side transmissions.
4.1 Simulation Parameters
An INS system is composed of three parts: client, storage
node and INS itself. The simulation parameters are listed
below.
As mentioned previously, the storage node performance
metric denotes the handling capability of the storage node,
which means the file number that the storage node can read
and write simultaneously. Space and bandwidth of storage
node represents the limits of the storage node in storage and
transmission. Client-side read and write requests refer to the
file number that the client will transmit by INS system.
Figure 3. The average delay time (0% data duplication rate).
In order to measure the storage nodes’ actual performance
in INS architecture, the client-side bandwidth is not limited
while measuring the average load and average delay time, client-side bandwidth, unnecessary and extra waste easily
which means that each storage node uses its maximum occurs.
bandwidth for every transmission. After the previous analysis, we include the scheme of data
Figure 4 shows the influence of different mechanisms to the duplication identification and management to reduce the load
system’s average loading rate, in which the loading rate of the of the INS and storage nodes so that clients can have more
INS at 100% is taken as the benchmark. This figure reveals resources for performance improvement. Table 2 lists
that the INS mechanism that considers the bandwidth, different data duplication rate and the corresponding
performance and load balancing can distribute the load maximum number of clients when the INS’s loading rate
efficiently to all storage nodes. On the other hand, the load of achieves 100%.
the hybrid mechanism is higher because the load is distributed
Table 2. Data duplication rate and the corresponding
to only some of the nodes. Although the random mechanism
maximum number of clients.
tends to distribute the load equally, different storage nodes
have different bandwidth and performance and thus cannot
Data Duplication Maximum Number of
achieve fair load-balancing.
Rate (%) Clients
0% 2600
10% 2800
20% 2900
30% 3000
40% 3300
50% 3700
60% 3700
70% 4000
80% 4300
90% 5200

Figure 6 displays the average loading rate of the INS under


Figure 4. The average loading rate (0% data duplication rate). different data duplication rate and uses 0% data duplication
rate as the maximum for comparison. The figure shows that
Another concern is that the actual client-side bandwidth
with the increase of duplication rate, the loading rate of the
cannot match the full-speed storage nodes, which leads to the
same client-side load reduces for 50% approximately, which
waste of resources. Therefore, to add user bandwidth to the
proves that de-duplication technique indeed decreases the
simulated scenario for further analysis, we get the result as
system load effectively.
shown in Figure 5, which can be expressed by Equation 8:

Figure 6.Average loading rate under different data duplication


Figure 5. The exceeding rate caused by not regulating the
rate (0% duplication rate as the benchmark).
bandwidth according to client-side bandwidth (0% data
duplication rate). V. CONCLUSION AND FUTURE OBJECTIVE
Client _ Bandwidth Simulation results show that our proposed INS architecture
Exceeding _ rate  1  (8)
can control the system’s average load and delay time
Assigned _ Bandwidth
influenced by the bandwidth to the lowest limit and provide
This equation refers to the exceeding rate resulted from the maximum bandwidth for clients to attain the optimal
assigning too much bandwidth of storage nodes to the clients. transmission. Our future objective is to classify file formats
If the storage nodes do not regulate the bandwidth according and regulate and optimize their transmission parameters
to client capability, or the managers (or clients) of different according to different INS domains and transmission time in
mechanisms do not choose the best storage nodes based on order to achieve the best performance of INS in various
domains.
REFERENCES
[1] Show Me the Gateway,
http://gigaom.com/2010/06/22/show-me-the-gateway-taking-storage-to-
the-cloud/
[2] Yanmei Huo; Hongyuan Wang; Liang Hu; Hongji Yang; "A Cloud
Storage Architecture Model for Data-Intensive Applications", in Proc.
Computer and Management (CAMAN), 2011, pp. 1-4.
[3] Microsoft SMB Protocol and CIFS Protocol Overview, MSDN,
http://msdn.microsoft.com/en-us/library/aa365233
[4] Direct hosting of SMB over TCP/IP. Microsoft,
http://support.microsoft.com/kb/204279
[5] Costa, L.B.; Ripeanu, M.; “Towards automating the configuration of a
distributed storage system”, in Proc. Grid Computing (GRID), 2010, pp.
201-208.
[6] Ohsaki, H.; Watanabe, S.; Imase, M.; “On dynamic resource
management mechanism using control theoretic approach for wide-area
grid computing”, in Proc. Control Applications, 2005, pp. 891-897.
[7] Dezhi Han; Fu Feng; “Research on Self-Adaptive Distributed Storage
System”, in Proc. Wireless Communications, Networking and Mobile
Computing (WiCOM '08.), 2008.
[8] Jianzong Wang; Varman, P.; Changsheng Xie; “Avoiding performance
fluctuation in cloud storage”, in Proc. High Performance Computing
(HiPC), 2010.
[9] A Survey of DHT Security Techniques
http://www.cs.vu.nl/~steen/papers/2009.acm-cs.pdf
[10] Xin Sun, Kan Li, Yushu Liu, "An Efficient Replica Location Method in
Hierarchical P2P Networks", in Proc. Computer and Information
Science(ICIS 2009), 2009, pp. 769-774.
[11] He Huang; Liqiang Wang; “P&P: A Combined Push-Pull Model for
Resource Monitoring in Cloud Computing Environment”, Cloud
Computing (CLOUD), 2010, pp. 260 - 267 .
[12] Wenzheng Li; Hongyan Shi; "Dynamic Load Balancing Algorithm
Based on FCFS", in Proc. Innovative Computing, Information and
Control (ICICIC), 2009, pp. 1528 – 1531.
[13] Dinerstein, J.; Dinerstein, S.; Egbert, P.K.; Clyde, S.W.;
“Learning-Based Fusion for Data Deduplication”, in Proc. Machine
Learning and Applications ICMLA '08, 2008, pp. 66-71.