Вы находитесь на странице: 1из 6

International Journal of Computer Science Engineering

and Information Technology Research (IJCSEITR)


ISSN(P): 2249-6831; ISSN(E): 2249-7943
Vol. 4, Issue 5, Oct 2014, 101-106
TJPRC Pvt. Ltd.

A HYBRID METHOD OF DE-DUPLICATION TO IMPROVE SPEED AND EFFICIENCY


PREETIKA SINGH & MEENAKSHI SHARMA
Department of Computer Science and Engineering, Sri Sai Institute of Engineering and Technology,
Badhani, Punjab, India

ABSTRACT
Cloud computing has redefined the method of using computing resources. Various individual technologies when
united with cloud computing produced even better results. De-duplication technology was one among them. Where on one
side De-duplication technology helped in shrinking the storage space required, on the other it invited few other problems.
Access speed of data and security of data remained top scorer among all of them. This paper proposes a De-duplication
system that is a hybrid of two different types of De-duplication system, file level and block level De-duplication.
The system proposed here inherits the advantages of both and discards the disadvantage of both.

KEYWORDS: Cloud Computing, De-Duplication, Speed, Efficiency


INTRODUCTION
In this digital era the amount of data being used is becoming vast and the networks are becoming denser.
A number of business and scientific applications use a huge amount of data and access them frequently. The files
containing these data not necessarily reside at the users location like in the case of grid and cloud computing. As people
are becoming closer to computers, their dependency on data is increasing day by day. This dependency leads to the
requirement of vast data storage space. Where cloud computing emerged as a solution to this requirement, storage of
redundant data on cloud remained a point of notification. To cope up with this problem De-duplication was introduced into
the cloud storage systems. This process helped in reducing the amount of storage space required by de-duplicating the data
[2].

Figure 1: Iconic Visualization of Data De-Duplication


For example, when several users of same cloud storage service, upload similar audio file on their storage server,
in spite of storing several different files, the server will store only one file and add the users name in the access control list
of that file and hence the storage space, directly proportional to the number of files, will be saved. Figure 1explains how
unique chunks are formed out of two files and thus storage space equal to three chunks have been saved.

www.tjprc.org

editor@tjprc.org

102

Preetika Singh & Meenakshi Sharma

Where on one side De-duplication technology helped in shrinking the storage space requirement [1], on the other
it invited certain other problems. Access speed of data and maintenance of security of data remained top scorer among
them. Security suffered because of side channels created by the virtual machines sharing same data [3]. Access speed of
data created a challenge due to overhead produced during the traversal of innumerable blocks corresponding to thousands
of files stored on a server. The traversal had to be performed every time. (1) When a new file arrived, for De-duplication
possibility check. (2) When a file retrieval request arrived, for the recollection of blocks in a single file. To address this
problem of time spent in traversing the blocks, this paper proposes a method of De-duplication which can help decrease the
time required to check for the De-duplication possibility of a file and its corresponding blocks.

RELATED WORK
Based on the location of the De-duplication process it can be divided into two categories, server
side-De-duplication and client side De-duplication. In server side De-duplication the process takes place at the cloud
service providers premises i.e. the files travel through the network and then the decision of De-duplication possibility
takes place, resulting in the loss of network bandwidth used to send the file again and again to the server. To cope up with
this problem, Client side De-duplication was introduced this process removed the usage of extra bandwidth as the
De-duplication check was performed at the client side.
On another perspective, De-duplication can also be categorized based on the unit of De-duplication, file level
De-duplication or block level De-duplication. In file level De-duplication, the hash value of whole file is calculated and
stored at an indexed location. Those files, whose hash values are similar, are deduplicated. Further, it was seen that
De-duplication can also be done within the file. Thus De-duplication of blocks, (original files were divided into blocks)
was introduced. This was named as block level De-duplication. A variant of block level De-duplication was fixed size
blocking system that divided whole file into a number of fix sized blocks [7]. Another variant to this approach was content
defined blocking system, which divided the file into variable sized blocks based on their content. This was implemented on
a low bandwidth file system, LBFS [6].
Sparse indexing [9] was introduced to decrease the latency of index lookup. The indexing in this technology is not
based on the hash value of full blocks rather; sampling is used to increase the speed of lookup.
Another idea was the concept of solid state drives (SSD) [8] SSDs were used to enhance the lookup speed as they
can perform faster lookups. This technology, however, had an issue of extra hardware and the scalability.

PROBLEMS ADDRESED AND SOLUTION PROPOSED


The related work discussed above shows that De-duplication emerged as a continuously improving technology
with a lot of improvements being proposed every year. However, the speed of access of data still remains a big problem.
Because of the block level De-duplication, a large amount of time is taken to check for the De-duplication possibility of the
blocks. The time wasted on traversing large number of blocks produces a delay in both upload and download of the file.
File level De-duplication was fast in both De-duplication possibility check and the retrieval as the number of files was far
less than the number of blocks. But to increase the De-duplication possibility block level De-duplication was introduced.
Thus it can be deduced that file level De-duplication supports fast De-duplication, but block level De-duplication supports
efficient De-duplication. Keeping these points in mind, the application proposed below unites the file level and block level
De-duplication to use their advantage while the disadvantages of both are discarded. The methodology involved
Impact Factor (JCC): 6.8785

Index Copernicus Value (ICV): 3.0

103

A Hybrid Method of De-Duplication to Improve Speed and Efficiency

installation of an application on client machine which will be responsible for upload of a file, download of a file and
De-duplication process execution. The detailed working of the application is described in the section below.

METHODOLOGY
The application dedicated to De-duplication process was installed on the client side LAN, so that the process
remains client side De-duplication which saves a large amount of network bandwidth. An abstract representation of system
setup is shown in figure 1. The first block represents the cloud storage service provider, containing the server. The service
provider will remain completely unaware of the process happening at the clients end to support client side De-duplication.
The second block gives an abstract representation of the application. The application contains a database managing three
different lists. This database will be dedicated to implement the hybrid approach. The encryption, decryption block will
encrypt the blocks while sending them to the storage server in order to maintain the confidentiality of data. This will also
help in establishing trust among the storage provider and the user. The third block represents the cloud storage service user.

Figure 2: Abstract Representation of the Application


The server dedicated to De-duplication possibility check will store three different lists. First list will hold the
filename and their hash values. Second one will hold the block names and their hash values and the third one will hold the
filename and associated blocks corresponding to every file. When a file arrives for the storage request the first list will
check if the file already exists. If file exists, the time wasted in dividing file in blocks and checking each block for
De-duplication is saved and file is deduplicated.
When there is an access request for the file already stored, the second list will recollect the various blocks of file
and corresponding blocks will be fetched from the server. The steps described below elaborate the working of system.
Steps involved during the storage request.

A new file storage request received

Calculate hash value of the file

Check in list1

If hash value is already present

Add name of file to its corresponding hash value in List1 and add the name of the user in the access control
list of the file.

Else

www.tjprc.org

editor@tjprc.org

104

Preetika Singh & Meenakshi Sharma

Update list 3 by adding the new file name to the list.

Divide the file into blocks

Calculate the hash value of each block.

For each block Check list 2

If hash value already present.

Update List 3 by adding the block name under the corresponding file name.

Else

Update list 2 by adding the block name and hash value.

Update list 3 by adding the block name under corresponding hash value.

Encrypt the block

Send the block to the server for storage.

Stop.

Steps Involved during File Retrieval

A file retrieval request arrives.

Check the file in list 3.

Recollect all blocks under the file

Retrieve corresponding blocks from the server

Send the file to user after joining the blocks into the file.

Stop.

RESULT AND ANALYSIS


The methodology described above was tested with the help of a private cloud system using Eucalyptus cloud
infrastructure. The application developed was installed on the virtual machine requested by the user. Two different datasets
were created for testing. First one was related to an educational institute data (dummy) containing roll call lists and result
analysis. Second was the dummy dataset similar to a private medical clinic, consisting of test reports and patient records.
Both data sets were De-duplicated with the help of proposed application and also with the help of two most highly rated
De-duplication codes available on Github [11]. Although the De-duplication ratio remained same in all the three cases but
a significant increase, in seconds, in the speed of file upload was seen in through the hybrid approach. File 1(11.6 KB),
present in dataset 1 was a .doc (Microsoft word) file consisting roll call list of students of the third semester, Computer
branch. File 2(32 KB), present in dataset 1 was an .xlsx (Microsoft excel) file consisting of result analysis of students of
the third semester, Computer branch. File 3(18.2 KB), present in dataset 2 was a doc file, report for sugar test for a patient
number1. File 4(18.2 KB), present in dataset 2 was a doc file, report for sugar test for patient xxxx. At different times, the
file uploads gave different timing results, however an average approximate time of upload was noted down and
Impact Factor (JCC): 6.8785

Index Copernicus Value (ICV): 3.0

105

A Hybrid Method of De-Duplication to Improve Speed and Efficiency

summarized in the form of a graph for comparison shown in figure 3.

Figure 3: Approximate Readings of File Upload Time


It can be analyzed that due to similarity of file 3 and file 4 from dataset 2 it took significantly less time in the
upload which was not in other case.

CONCLUSIONS AND FUTURE WORK


The paper describes how a hybrid approach towards De-duplication can help in improving the speed as well as the
De-duplication efficiency of a system. In addition to that the application proposed in the paper also tries to enhance the
security of confidential data by encrypting data at the client premises. However, the methodology described above was
developed and tested as a prototype on a single machine. Further work in this direction can be towards working out the
methodology in an actual cloud system where number of virtual machines are running, a large amount of data is backed
and a faster lookup is required to upload and download the file with De-duplication. This will be more efficient in case of
medical records systems and bank records systems where most of the contents of various files are usually same.

REFERENCES
1.

D. Russell, Data De-duplication Will Be Even Bigger in 2010, Gartner, February 2010.

2.

M. Dutch. Understanding data De-duplication ratios. White paper, June 2008.

3.

D. Harnik, B. Pinkas, and A. Shulman-Peleg. Side Channels in Cloud Services. De-duplication in Cloud Storage.
IEEE Security & Privacy, 8(6), Nov. 2010.

4.

Sun, Z., Shen, J. & Yong, J. (2013). A novel approach to data De-duplication over the engineering-oriented cloud
systems. Integrated Computer Aided Engineering, 20 (1), 45-57.

5.

Mkandawire, Stephen, "Improving Backup and Restore Performance for De-duplication-based Cloud Backup
Services" (2012). Computer science and Engineering: Theses, Dissertations, and Student Research. Paper 39

6.

Muthitacharoen, B. Chen, and D. Mazieres, A low-bandwidth network file system, in Symposium on Operating
Systems Principles, 2001, pp. 174187. [Online]. Available:
http://citeseer.ist.psu.edu/muthitacharoen01lowbandwidth.html

www.tjprc.org

editor@tjprc.org

106

Preetika Singh & Meenakshi Sharma

7.

S. Quinlan and S. Dorward, Venti: a new approach to archival storage, in First USENIX conference on File and
Storage Technologies, Monterey, CA, 2002. [Online]. Available:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.8085

8.

D. Meister and A. Brinkmann. dedupv1: Improving De-duplication throughput using solid state drives (SSD). In
IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), pages 1-6, 2010.

9.

D.

Bhagwat, K.

Eshghi, D. D. E. Long, and M.

Lillibridge. Extreme Binning: Scalable, parallel

De-duplication for chunk-based file backup. In IEEE International Symposium on Modelling, Analysis &
Simulation of Computer and Telecommunication Systems (MASCOTS), 2009.
10. Burton H. Bloom. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM,
13(7):422426, 1970.
11. www.github.com, accessed 17 April 2014

Impact Factor (JCC): 6.8785

Index Copernicus Value (ICV): 3.0

Вам также может понравиться