Академический Документы
Профессиональный Документы
Культура Документы
ABSTRACT
Cloud computing has redefined the method of using computing resources. Various individual technologies when
united with cloud computing produced even better results. De-duplication technology was one among them. Where on one
side De-duplication technology helped in shrinking the storage space required, on the other it invited few other problems.
Access speed of data and security of data remained top scorer among all of them. This paper proposes a De-duplication
system that is a hybrid of two different types of De-duplication system, file level and block level De-duplication.
The system proposed here inherits the advantages of both and discards the disadvantage of both.
www.tjprc.org
editor@tjprc.org
102
Where on one side De-duplication technology helped in shrinking the storage space requirement [1], on the other
it invited certain other problems. Access speed of data and maintenance of security of data remained top scorer among
them. Security suffered because of side channels created by the virtual machines sharing same data [3]. Access speed of
data created a challenge due to overhead produced during the traversal of innumerable blocks corresponding to thousands
of files stored on a server. The traversal had to be performed every time. (1) When a new file arrived, for De-duplication
possibility check. (2) When a file retrieval request arrived, for the recollection of blocks in a single file. To address this
problem of time spent in traversing the blocks, this paper proposes a method of De-duplication which can help decrease the
time required to check for the De-duplication possibility of a file and its corresponding blocks.
RELATED WORK
Based on the location of the De-duplication process it can be divided into two categories, server
side-De-duplication and client side De-duplication. In server side De-duplication the process takes place at the cloud
service providers premises i.e. the files travel through the network and then the decision of De-duplication possibility
takes place, resulting in the loss of network bandwidth used to send the file again and again to the server. To cope up with
this problem, Client side De-duplication was introduced this process removed the usage of extra bandwidth as the
De-duplication check was performed at the client side.
On another perspective, De-duplication can also be categorized based on the unit of De-duplication, file level
De-duplication or block level De-duplication. In file level De-duplication, the hash value of whole file is calculated and
stored at an indexed location. Those files, whose hash values are similar, are deduplicated. Further, it was seen that
De-duplication can also be done within the file. Thus De-duplication of blocks, (original files were divided into blocks)
was introduced. This was named as block level De-duplication. A variant of block level De-duplication was fixed size
blocking system that divided whole file into a number of fix sized blocks [7]. Another variant to this approach was content
defined blocking system, which divided the file into variable sized blocks based on their content. This was implemented on
a low bandwidth file system, LBFS [6].
Sparse indexing [9] was introduced to decrease the latency of index lookup. The indexing in this technology is not
based on the hash value of full blocks rather; sampling is used to increase the speed of lookup.
Another idea was the concept of solid state drives (SSD) [8] SSDs were used to enhance the lookup speed as they
can perform faster lookups. This technology, however, had an issue of extra hardware and the scalability.
103
installation of an application on client machine which will be responsible for upload of a file, download of a file and
De-duplication process execution. The detailed working of the application is described in the section below.
METHODOLOGY
The application dedicated to De-duplication process was installed on the client side LAN, so that the process
remains client side De-duplication which saves a large amount of network bandwidth. An abstract representation of system
setup is shown in figure 1. The first block represents the cloud storage service provider, containing the server. The service
provider will remain completely unaware of the process happening at the clients end to support client side De-duplication.
The second block gives an abstract representation of the application. The application contains a database managing three
different lists. This database will be dedicated to implement the hybrid approach. The encryption, decryption block will
encrypt the blocks while sending them to the storage server in order to maintain the confidentiality of data. This will also
help in establishing trust among the storage provider and the user. The third block represents the cloud storage service user.
Check in list1
Add name of file to its corresponding hash value in List1 and add the name of the user in the access control
list of the file.
Else
www.tjprc.org
editor@tjprc.org
104
Update List 3 by adding the block name under the corresponding file name.
Else
Update list 3 by adding the block name under corresponding hash value.
Stop.
Send the file to user after joining the blocks into the file.
Stop.
105
REFERENCES
1.
D. Russell, Data De-duplication Will Be Even Bigger in 2010, Gartner, February 2010.
2.
3.
D. Harnik, B. Pinkas, and A. Shulman-Peleg. Side Channels in Cloud Services. De-duplication in Cloud Storage.
IEEE Security & Privacy, 8(6), Nov. 2010.
4.
Sun, Z., Shen, J. & Yong, J. (2013). A novel approach to data De-duplication over the engineering-oriented cloud
systems. Integrated Computer Aided Engineering, 20 (1), 45-57.
5.
Mkandawire, Stephen, "Improving Backup and Restore Performance for De-duplication-based Cloud Backup
Services" (2012). Computer science and Engineering: Theses, Dissertations, and Student Research. Paper 39
6.
Muthitacharoen, B. Chen, and D. Mazieres, A low-bandwidth network file system, in Symposium on Operating
Systems Principles, 2001, pp. 174187. [Online]. Available:
http://citeseer.ist.psu.edu/muthitacharoen01lowbandwidth.html
www.tjprc.org
editor@tjprc.org
106
7.
S. Quinlan and S. Dorward, Venti: a new approach to archival storage, in First USENIX conference on File and
Storage Technologies, Monterey, CA, 2002. [Online]. Available:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.8085
8.
D. Meister and A. Brinkmann. dedupv1: Improving De-duplication throughput using solid state drives (SSD). In
IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), pages 1-6, 2010.
9.
D.
Bhagwat, K.
De-duplication for chunk-based file backup. In IEEE International Symposium on Modelling, Analysis &
Simulation of Computer and Telecommunication Systems (MASCOTS), 2009.
10. Burton H. Bloom. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM,
13(7):422426, 1970.
11. www.github.com, accessed 17 April 2014