Comp Sci - IJCSEITR - A Hybrid Method of - Preetika Singh

International Journal of Computer Science Engineering
and Information Technology Research (IJCSEITR)

ISSN(P): 2249-6831; ISSN(E): 2249-7943
Vol. 4, Issue 5, Oct 2014, 101-106
TJPRC Pvt. Ltd.
A HYBRID METHOD OF DE-DUPLICATION TO IMPROVE SPEED AND EFFICIENCY

PREETIKA SINGH & MEENAKSHI SHARMA
Department of Computer Science and Engineering, Sri Sai Institute of Engineering and Technology,
Badhani, Punjab, India
ABSTRACT
Cloud computing has redefined the method of using computing resources. Various individual technologies when
united with cloud computing produced even better results. De-duplication technology was one among them. Where on one
side De-duplication technology helped in shrinking the storage space required, on the other it invited few other problems.
Access speed of data and security of data remained top scorer among all of them. This paper proposes a De-duplication
system that is a hybrid of two different types of De-duplication system, file level and block level De-duplication.
The system proposed here inherits the advantages of both and discards the disadvantage of both.
KEYWORDS: Cloud Computing, De-Duplication, Speed, Efficiency

INTRODUCTION
In this digital era the amount of data being used is becoming vast and the networks are becoming denser.
A number of business and scientific applications use a huge amount of data and access them frequently. The files
containing these data not necessarily reside at the users location like in the case of grid and cloud computing. As people
are becoming closer to computers, their dependency on data is increasing day by day. This dependency leads to the
requirement of vast data storage space. Where cloud computing emerged as a solution to this requirement, storage of
redundant data on cloud remained a point of notification. To cope up with this problem De-duplication was introduced into
the cloud storage systems. This process helped in reducing the amount of storage space required by de-duplicating the data
[2].
Figure 1: Iconic Visualization of Data De-Duplication

For example, when several users of same cloud storage service, upload similar audio file on their storage server,
in spite of storing several different files, the server will store only one file and add the users name in the access control list
of that file and hence the storage space, directly proportional to the number of files, will be saved. Figure 1explains how
unique chunks are formed out of two files and thus storage space equal to three chunks have been saved.
www.tjprc.org
editor@tjprc.org
102
Preetika Singh & Meenakshi Sharma
Where on one side De-duplication technology helped in shrinking the storage space requirement [1], on the other
it invited certain other problems. Access speed of data and maintenance of security of data remained top scorer among
them. Security suffered because of side channels created by the virtual machines sharing same data [3]. Access speed of
data created a challenge due to overhead produced during the traversal of innumerable blocks corresponding to thousands
of files stored on a server. The traversal had to be performed every time. (1) When a new file arrived, for De-duplication
possibility check. (2) When a file retrieval request arrived, for the recollection of blocks in a single file. To address this
problem of time spent in traversing the blocks, this paper proposes a method of De-duplication which can help decrease the
time required to check for the De-duplication possibility of a file and its corresponding blocks.
RELATED WORK
Based on the location of the De-duplication process it can be divided into two categories, server
side-De-duplication and client side De-duplication. In server side De-duplication the process takes place at the cloud
service providers premises i.e. the files travel through the network and then the decision of De-duplication possibility
takes place, resulting in the loss of network bandwidth used to send the file again and again to the server. To cope up with
this problem, Client side De-duplication was introduced this process removed the usage of extra bandwidth as the
De-duplication check was performed at the client side.
On another perspective, De-duplication can also be categorized based on the unit of De-duplication, file level
De-duplication or block level De-duplication. In file level De-duplication, the hash value of whole file is calculated and
stored at an indexed location. Those files, whose hash values are similar, are deduplicated. Further, it was seen that
De-duplication can also be done within the file. Thus De-duplication of blocks, (original files were divided into blocks)
was introduced. This was named as block level De-duplication. A variant of block level De-duplication was fixed size
blocking system that divided whole file into a number of fix sized blocks [7]. Another variant to this approach was content
defined blocking system, which divided the file into variable sized blocks based on their content. This was implemented on
a low bandwidth file system, LBFS [6].
Sparse indexing [9] was introduced to decrease the latency of index lookup. The indexing in this technology is not
based on the hash value of full blocks rather; sampling is used to increase the speed of lookup.
Another idea was the concept of solid state drives (SSD) [8] SSDs were used to enhance the lookup speed as they
can perform faster lookups. This technology, however, had an issue of extra hardware and the scalability.
PROBLEMS ADDRESED AND SOLUTION PROPOSED

The related work discussed above shows that De-duplication emerged as a continuously improving technology
with a lot of improvements being proposed every year. However, the speed of access of data still remains a big problem.
Because of the block level De-duplication, a large amount of time is taken to check for the De-duplication possibility of the
blocks. The time wasted on traversing large number of blocks produces a delay in both upload and download of the file.
File level De-duplication was fast in both De-duplication possibility check and the retrieval as the number of files was far
less than the number of blocks. But to increase the De-duplication possibility block level De-duplication was introduced.
Thus it can be deduced that file level De-duplication supports fast De-duplication, but block level De-duplication supports
efficient De-duplication. Keeping these points in mind, the application proposed below unites the file level and block level
De-duplication to use their advantage while the disadvantages of both are discarded. The methodology involved
Impact Factor (JCC): 6.8785
Index Copernicus Value (ICV): 3.0
103
A Hybrid Method of De-Duplication to Improve Speed and Efficiency
installation of an application on client machine which will be responsible for upload of a file, download of a file and
De-duplication process execution. The detailed working of the application is described in the section below.
METHODOLOGY
The application dedicated to De-duplication process was installed on the client side LAN, so that the process
remains client side De-duplication which saves a large amount of network bandwidth. An abstract representation of system
setup is shown in figure 1. The first block represents the cloud storage service provider, containing the server. The service
provider will remain completely unaware of the process happening at the clients end to support client side De-duplication.
The second block gives an abstract representation of the application. The application contains a database managing three
different lists. This database will be dedicated to implement the hybrid approach. The encryption, decryption block will
encrypt the blocks while sending them to the storage server in order to maintain the confidentiality of data. This will also
help in establishing trust among the storage provider and the user. The third block represents the cloud storage service user.
Figure 2: Abstract Representation of the Application

The server dedicated to De-duplication possibility check will store three different lists. First list will hold the
filename and their hash values. Second one will hold the block names and their hash values and the third one will hold the
filename and associated blocks corresponding to every file. When a file arrives for the storage request the first list will
check if the file already exists. If file exists, the time wasted in dividing file in blocks and checking each block for
De-duplication is saved and file is deduplicated.
When there is an access request for the file already stored, the second list will recollect the various blocks of file
and corresponding blocks will be fetched from the server. The steps described below elaborate the working of system.
Steps involved during the storage request.
A new file storage request received
Calculate hash value of the file
Check in list1
If hash value is already present
Add name of file to its corresponding hash value in List1 and add the name of the user in the access control
list of the file.
Else
www.tjprc.org
editor@tjprc.org
104
Update list 3 by adding the new file name to the list.
Divide the file into blocks
Calculate the hash value of each block.
For each block Check list 2
If hash value already present.
Update List 3 by adding the block name under the corresponding file name.
Else
Update list 2 by adding the block name and hash value.
Update list 3 by adding the block name under corresponding hash value.
Encrypt the block
Send the block to the server for storage.
Stop.
Steps Involved during File Retrieval
A file retrieval request arrives.
Check the file in list 3.
Recollect all blocks under the file
Retrieve corresponding blocks from the server
Send the file to user after joining the blocks into the file.
Stop.
RESULT AND ANALYSIS

The methodology described above was tested with the help of a private cloud system using Eucalyptus cloud
infrastructure. The application developed was installed on the virtual machine requested by the user. Two different datasets
were created for testing. First one was related to an educational institute data (dummy) containing roll call lists and result
analysis. Second was the dummy dataset similar to a private medical clinic, consisting of test reports and patient records.
Both data sets were De-duplicated with the help of proposed application and also with the help of two most highly rated
De-duplication codes available on Github [11]. Although the De-duplication ratio remained same in all the three cases but
a significant increase, in seconds, in the speed of file upload was seen in through the hybrid approach. File 1(11.6 KB),
present in dataset 1 was a .doc (Microsoft word) file consisting roll call list of students of the third semester, Computer
branch. File 2(32 KB), present in dataset 1 was an .xlsx (Microsoft excel) file consisting of result analysis of students of
the third semester, Computer branch. File 3(18.2 KB), present in dataset 2 was a doc file, report for sugar test for a patient
number1. File 4(18.2 KB), present in dataset 2 was a doc file, report for sugar test for patient xxxx. At different times, the
file uploads gave different timing results, however an average approximate time of upload was noted down and
105
A Hybrid Method of De-Duplication to Improve Speed and Efficiency
summarized in the form of a graph for comparison shown in figure 3.
Figure 3: Approximate Readings of File Upload Time

It can be analyzed that due to similarity of file 3 and file 4 from dataset 2 it took significantly less time in the
upload which was not in other case.
CONCLUSIONS AND FUTURE WORK

The paper describes how a hybrid approach towards De-duplication can help in improving the speed as well as the
De-duplication efficiency of a system. In addition to that the application proposed in the paper also tries to enhance the
security of confidential data by encrypting data at the client premises. However, the methodology described above was
developed and tested as a prototype on a single machine. Further work in this direction can be towards working out the
methodology in an actual cloud system where number of virtual machines are running, a large amount of data is backed
and a faster lookup is required to upload and download the file with De-duplication. This will be more efficient in case of
medical records systems and bank records systems where most of the contents of various files are usually same.
REFERENCES
1.
D. Russell, Data De-duplication Will Be Even Bigger in 2010, Gartner, February 2010.
2.
M. Dutch. Understanding data De-duplication ratios. White paper, June 2008.
3.
D. Harnik, B. Pinkas, and A. Shulman-Peleg. Side Channels in Cloud Services. De-duplication in Cloud Storage.
IEEE Security & Privacy, 8(6), Nov. 2010.
4.
Sun, Z., Shen, J. & Yong, J. (2013). A novel approach to data De-duplication over the engineering-oriented cloud
systems. Integrated Computer Aided Engineering, 20 (1), 45-57.
5.
Mkandawire, Stephen, "Improving Backup and Restore Performance for De-duplication-based Cloud Backup
Services" (2012). Computer science and Engineering: Theses, Dissertations, and Student Research. Paper 39
6.
Muthitacharoen, B. Chen, and D. Mazieres, A low-bandwidth network file system, in Symposium on Operating
Systems Principles, 2001, pp. 174187. [Online]. Available:
http://citeseer.ist.psu.edu/muthitacharoen01lowbandwidth.html
www.tjprc.org
editor@tjprc.org
106
7.
S. Quinlan and S. Dorward, Venti: a new approach to archival storage, in First USENIX conference on File and
Storage Technologies, Monterey, CA, 2002. [Online]. Available:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.8085
8.
D. Meister and A. Brinkmann. dedupv1: Improving De-duplication throughput using solid state drives (SSD). In
IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), pages 1-6, 2010.
9.
D.
Bhagwat, K.
Eshghi, D. D. E. Long, and M.
Lillibridge. Extreme Binning: Scalable, parallel
De-duplication for chunk-based file backup. In IEEE International Symposium on Modelling, Analysis &
Simulation of Computer and Telecommunication Systems (MASCOTS), 2009.
10. Burton H. Bloom. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM,
13(7):422426, 1970.
11. www.github.com, accessed 17 April 2014

Comp Sci - IJCSEITR - A Hybrid Method of - Preetika Singh

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Comp Sci - IJCSEITR - A Hybrid Method of - Preetika Singh

Загружено:

Авторское право:

Доступные форматы

International Journal of Computer Science Engineering

and Information Technology Research (IJCSEITR)

A HYBRID METHOD OF DE-DUPLICATION TO IMPROVE SPEED AND EFFICIENCY

KEYWORDS: Cloud Computing, De-Duplication, Speed, Efficiency

Figure 1: Iconic Visualization of Data De-Duplication

Preetika Singh & Meenakshi Sharma

PROBLEMS ADDRESED AND SOLUTION PROPOSED

Index Copernicus Value (ICV): 3.0

A Hybrid Method of De-Duplication to Improve Speed and Efficiency

Figure 2: Abstract Representation of the Application

A new file storage request received

Calculate hash value of the file

If hash value is already present

Preetika Singh & Meenakshi Sharma

Update list 3 by adding the new file name to the list.

Divide the file into blocks

Calculate the hash value of each block.

For each block Check list 2

If hash value already present.

Update list 2 by adding the block name and hash value.

Encrypt the block

Send the block to the server for storage.

Steps Involved during File Retrieval

A file retrieval request arrives.

Check the file in list 3.

Recollect all blocks under the file

Retrieve corresponding blocks from the server

RESULT AND ANALYSIS

Index Copernicus Value (ICV): 3.0

A Hybrid Method of De-Duplication to Improve Speed and Efficiency

summarized in the form of a graph for comparison shown in figure 3.

Figure 3: Approximate Readings of File Upload Time

CONCLUSIONS AND FUTURE WORK

M. Dutch. Understanding data De-duplication ratios. White paper, June 2008.

Preetika Singh & Meenakshi Sharma

Eshghi, D. D. E. Long, and M.

Lillibridge. Extreme Binning: Scalable, parallel

Impact Factor (JCC): 6.8785

Index Copernicus Value (ICV): 3.0

Вам также может понравиться