Академический Документы
Профессиональный Документы
Культура Документы
ABSTRACT:
1. INTRODUCTION:
Cloud computing is basically a service oriented architecture rendering easy access to all who
make use of it. The need of computation power rendered by the machines is on a continuous hike
nowadays. The CPU computation power is boosting twice for every 3 years. However size of the
files keeps increasing also in an amazing rate. 20 years ago, the common format is only text file.
Later, computer can handle the graphics well, and play some low quality movies. In recent year,
human is not satisfied in the quality of DVD, and introduce the Blue-ray disk. The file is changed
form a few KBs to now nearly 1 TB.
The varying characteristics of cloud making it different from other computing technologies
are on-demand self-service, agility, autonomic computing, virtualization, parallel and distributed
architecture and pay for use model. A Cloud is said to be parallel and distributed architecture as it is
consisting of a family of interconnected and virtualized computers.
An organization moving from traditional model to cloud model which has been switched
from dedicated hardware and depreciate the service for certain period of time to use shared resources
in cloud infrastructure and pay based on the usage.
The cloud computing reduces the ground work of infrastructure costs, run the application faster with
upgraded manageability and less maintenance, to rapidly adjust resources to meet inconsistent and
capricious business demand and allow the user to differentiate their business instead of on
infrastructure. The cloud providers use “pay-as-you-go” model. If the cloud model is not used which
unexpectedly leads to high charges. First the system is defined as physical machine (PM) as cloud
servers, and instance or virtual machine (VM) as the virtual server provided to the users.
Parallel system can make computation faster
File size is increasing but the algorithms don’t improve so much. Moreover, newer format
always require more complicated decode algorithms and make the processing time longer. The only
way to greatly speed up the process is make the job parallel running on different machines.
Cloud reduce the cost
Since buying expensive facilities for only a single purpose is unreasonable but using the
cloud charge only how long used by the user. It enables many people to run large scale project with
an acceptable cost.
Agility
The resources are re-provisioned by the system to rectify users’ ability. The necessities are
altered according to the customer’s requests.
Cost:
To convert the resources overheads to functioning leak then the system using open-cloud
liberation model. The resource supposedly reduces the obstacles, as an alternative of leveraging the
tools and software for once the infrastructure use the services provided by the third party. The cost
saving depends on the kind of obtainable communications.
The user is allowed to access the system using web browser without considering location or
what device the system use. The communications is provided by the intermediary and in off-site.
The user can contact the possessions from somewhere when concerning to the internet.
Maintenance
Rather than installing the resources on the users system the resources can be accessed from
anywhere in any location. It is easy for the punter to preserve the records in the cloud.
Multi-tenancy
Numerous patrons use the similar open cloud. Along with the large group of punters, they enable
the sharing of possessions and expenditures. Multi-tenancy allows
Centralization of infrastructure
Several users accessing the same infrastructure from different location at reduced cost
without installing the software in their own system.
peak-load capacity
The capacity of the system increases depending upon the users requesting services.
The system provides the services at even peak-load without any interruption.
Utilization and efficiency
The capability of the system to adapt to the changing needs in provisioning and de-
provisioning the resources. It is how the resources are utilized in efficient manner.
Performance
The loosely coupled architectures are constructed as the system interface using the web
services. The performance of the system is monitored and consistency is to be maintained.
Reliability
The multiple redundant sitesare accessed by several users simultaneously. It provides cloud
computing suitable for business endurance and tragedy recovery.
Business Continuity
The term business continuity means the business must provide to mitigate the impact of
planned and unplanned outages.
Disaster Recovery
The term disaster recovery is used to recover the system data. It is used to backup the copies
of data to an alternative site when primary site is incapable due to some disaster.
The resources are provided without any interruption to the users’ request. The VM start-up time
varies by VM type, location, operating systems and cloud providers.
Security
When data is scattered over a broader region or over a greater number of strategies as well as
in multi-leaseholder systems shared by impertinent users then the intricacy of security is greatly
increased. Secluded cloud installations are motivated by the users’ craving to retain control over the
substructure and avoid losing control of tiding security.
Figure .1.1. WORKING OF CLOUD COMPUTING
On-demand self-services
Broad network access
Resource pooling
1.1.3. TYPES OF CLOUD SERVICES
The customer can access the providing requests, data storage, networks, and other essential
computing resources.
Consumer is able to organize and sprint random software, along with the Operating System
and application.
The cloud infrastructure is not managed and controlled by the consumer but they can access
operating system, storage, organized applications, and probably incomplete control of select
networking machinery like host firewalls.
Figure: 1.2. TYPES OF CLOUD SERVICES
The qualifications afforded to the end user is to arrange onto the cloud communications
purchaser created or acquired request created using programming expression and
mechanisms reinforced by the contributor.
The end user does not manage or control the fundamental cloud communications including
network, servers, operating systems, or storage space, but has control over the arranged
applications and possibly product hosting location patterns.
Deployment model:
Private cloud
Sequestered cloud is said to be a private cloud infrastructure operated solely for a single
officialdom, whether managed inside or by a third-party, and hosted either confidentially or
superficially.
Public cloud:
The public can use the open services that are rendered over the network then the cloud is said
to be a public cloud. The pay-per-usage model and free services are the services afforded by
the public cloud.
The only difference is security deliberation for the services such as application, storage, and
other resources available for unrestricted onlookers.
Hybrid cloud
Hybrid cloud is a symphony of two or more clouds i.e. private, community or public.
Hybrid cloud services are a cloud computing service that is composed of some combination
of the clouds, public and community cloud services. The services are obtained from different
service suppliers.
A hybrid cloud service crosses remoteness and provider precincts to separate private, public
or hamlet cloud service.
A major improvement of cloud bursting and a hybrid cloud model is that an outfit only pays
for extra compute resources when they are needed.
Community cloud:
Community cloud shares the infrastructure between several organizations from a specific
convergence of some common anxieties (security, amenability, jurisdiction, etc.), whether
managed internally or by a third-party, and either hosted internally or externally.
Some of the cost savings potential of cloud computing are realized as the costs are spread
over fewer users.
Distributed cloud:
The distributed set of apparatuses running at different locations is also provided by the cloud
computing through a connected schmooze or hub.
Thus the distributed system voluntarily shared some resources over a network.
Inter-cloud
It resembles the hybrid and Multi-cloud where they focus on interoperability between public
cloud service providers more than between providers and consumers
Multi-cloud
It differs from hybrid cloud in that it refers to multiple cloud services rather than multiple
distribution modes.
CHAPTER 2:
2.1 INTRODUCTION:
Storing large amounts of data efficient, in terms of both time and space, is of paramount
concern in the design of backup and restore systems. Users might wish to periodically (e.g., hourly,
daily or weekly) backup data which is stored on their computers as a precaution against possible
crashes, corruption or accidental deletion of important data. It commonly occurs that most of the
data has not changed since the last backup has been performed, and therefore much of the current
data can already be found in the backup repository, with only minor changes. If the data, in the
repository, that is similar to the current backup data, can be located efficient, then there is no need to
store the data again, rather, only the changes need be recorded. This process of storing common data
once only is known as data deduplication. Data deduplication is much easier to achieve with disk
based storage than with tape backup. The technology bridges the price gap between disk based
backup and tape based backup, making disk based backup affordable. Disk based backup has several
distinctive advantages over tape backup in terms of reducing backup windows and improving restore
reliability and speed. In a backup and restore system with deduplication it is very likely that a new
input data stream is similar to data already in the repository, but many different types of changes are
possible. Given the potential size of the repository which may have hundreds of terabytes of data,
identifying the regions of similarity to the new incoming data is a major challenge. In addition, the
similarity matching must be performed quickly in order to maintain high backup bandwidth
requirements. This project idea aims to alleviate the disk bottleneck of fingerprint lookup, reduce the
time of fingerprint lookup, and improve the throughput of data deduplication. This means that
deduplication is only performed within individual servers due to overhead considerations, which
leaves cross-node redundancy untouched. Thus, data routing, a technique to concentrate data
redundancy within individual nodes, reduce cross-node redundancy and balance load, becomes a key
issue in the cluster deduplication design. Second, for the intra-node scenario, it suffers from the disk
chunk index lookup bottleneck.
2.2 STUDY OF DEDUPLICATION:
Storage-based data deduplication reduces the amount of storage needed for a given set of files. It
is most effective in applications where many copies of very similar or even identical data are
stored on a single disk—a surprisingly common scenario. In the case of data backups, which
routinely are performed to protect against data loss, most data in a given backup remain
unchanged from the previous backup. Common backup systems try to exploit this by omitting
(or hard linking) files that haven't changed or storing differences between files. Neither approach
captures all redundancies, however. Hard-linking does not help with large files that have only
changed in small ways, such as an email database; differences only find redundancies in adjacent
versions of a single file (consider a section that was deleted and later added in again or a logo
image included in many documents).
Network data deduplication is used to reduce the number of bytes that must be transferred
between endpoints, which can reduce the amount of bandwidth required.
Virtual servers benefit from deduplication because it allows nominally separate system files for
each virtual server to be coalesced into a single storage space. At the same time, if a given server
customizes a file, deduplication will not change the files on the other servers—something that
alternatives like hard links or shared disks do not offer. Backing up or making duplicate copies
of virtual environments are similarly improved.
DEDUPLICATION METHODS:
One of the most common forms of data deduplication implementations works by comparing
chunks of data to detect duplicates. For that to happen, each chunk of data is assigned identification,
calculated by the software, typically using cryptographic hash functions. In many implementations,
the assumption is made that if the identification is identical, the data is identical, even though this
cannot be true in all cases due to the pigeonhole principle; other implementations do not assume that
two blocks of data with the same identifier are identical, but actually verify that data with the same
identification is identical. If the software either assumes that a given identification already exists in
the deduplication namespace or actually verifies the identity of the two blocks of data, depending on
the implementation, then it will replace that duplicate chunk with a link.
Once the data has been deduplicated, upon read back of the file, wherever a link is found, the system
simply replaces that link with the referenced data chunk. The deduplication process is intended to be
transparent to end users and applications.
Client backup deduplication. This is the process where the deduplication hash calculations are
initially created on the source (client) machines. Files that have identical hashes to files already
in the target device are not sent, the target device just creates appropriate internal links to
reference the duplicated data. The benefit of this is that it avoids data being unnecessarily sent
across the network thereby reducing traffic load.
Primary storage and secondary storage. By definition, primary storage systems are designed for
optimal performance, rather than lowest possible cost. The design criteria for these systems are
to increase performance, at the expense of other considerations. Moreover, primary storage
systems are much less tolerant of any operation that can negatively impact performance. Also by
definition, secondary storage systems contain primarily duplicate or secondary copies of data.
These copies of data are typically not used for actual production operations and as a result are
more tolerant of some performance degradation, in exchange for increased efficiency.
A Hybrid Cloud Approach for Secure Authorized Deduplication:
This hybrid aiming at efficiently solving the problem of deduplication with differential
privileges in cloud computing, consider a hybrid cloud architecture consisting of a public cloud and
a private cloud. Unlike existing data deduplication systems, the private cloud is involved as a proxy
to allow data owner/users to securely perform duplicate check with differential privileges. Such
architecture is practical and has attracted much attention from researchers. The data owners only
outsource their data storage by utilizing public cloud while the data operation is managed in private
cloud. A new deduplication system supporting differential duplicate check is proposed under this
hybrid cloud architecture where the S-CSP resides in the public cloud. The user is only allowed to
perform the duplicate check for files marked with the corresponding privileges.
Furthermore, we enhance our system in security. Specifically, present an advanced scheme to
support stronger security by encrypting the file with differential privilege keys. In this way, the users
without corresponding privileges cannot perform the duplicate check. Furthermore, such
unauthorized users cannot decrypt the cipher text even collude with the S-CSP. Security analysis
demonstrates that our system is secure in terms of the definitions specified in the proposed security
model. Finally, we implement a prototype of the proposed authorized duplicate check and conduct
testbed experiments to evaluate the overhead of the prototype.
Secure Auditing and Deduplicating Data in Cloud:
SecCloud introduces an auditing entity with a maintenance of a MapReduce cloud, which
helps clients generate data tags before uploading as well as audit the integrity of data having been
stored in cloud. This design fixes the issue of previous work that the computational load at user or
auditor is too huge for tag generation. For completeness of fine-grained, the functionality of auditing
designed in SecCoud is supported on both block level and sector level. In addition, SecCoud also
enables secure deduplication. Notice that the “security” considered in SecCoud is the prevention of
leakage of side channel information. In order to prevent the leakage of such side channel
information, we follow the tradition and design a proof of ownership protocol between clients and
cloud servers, which allows clients to prove to cloud servers that they exactly own the target data.
A Study of Practical Deduplication
File systems often contain redundant copies of information: identical files or sub-file regions,
possibly stored on a single host, on a shared storage cluster, or backed-up to secondary storage.
Deduplicating storage systems take advantage of this redundancy to reduce the underlying space
needed to contain the file systems (or backup images thereof). Deduplication can work at either the
sub-file or whole-file level. More fine-grained deduplication creates more opportunities for space
savings, but necessarily reduces the sequential layout of some files, which may have significant
performance impacts when hard disks are used for storage (and in some cases necessitates
complicated techniques to improve performance). Alternatively, whole-file deduplication is simpler
and eliminates file-fragmentation concerns, though at the cost of some otherwise reclaimable
storage. Because the disk technology trend is toward improved sequential bandwidth and reduced
per-byte cost with little or no improvement in random access speed, it’s not clear that trading away
sequentiality for space savings makes sense, at least in primary storage. Complicating matters, these
files are in opaque unstructured formats with complicated access patterns. At the same time there are
increasingly many small files in an increasingly complex file system tree.
ChunkStash: Speeding up Inline Storage Deduplication using Flash Memory
EXISTING SYSTEM:
3.1 INTRODUCTION:
For VM snapshot backup, file level semantics are normally not provided. Snapshot
operations take place at the virtual device driver level, which means no fine-grained file system
metadata can be used to determine the changed data. Backup systems have been developed to use
content fingerprints to identify duplicate content. Offline deduplication is used to remove previously
written duplicate blocks during idle time. Several techniques have been proposed to speedup
searching of duplicate fingerprints. Existing approaches have focused on such inline duplicate
detection in which deduplication of an individual block is on the critical write path. In existing work,
this constraint is difficult and there is no waiting time for many duplicate detection requests. This
relaxation is unacceptable because in context, difficult to finishing the backup of required VM
images within a reasonable time window.
3.2 ALGORITHM:
Whole File Hashing: In a whole file hashing (WFH) technique, the whole file is directed to a
hashing function. The hashing function is always cryptographic hash like MD5 or SHA-1. The
cryptographic hash is used to find entire replicate files. This approach is speedy with low
computation and low additional metadata overhead. It works very well for complete system backups
when total duplicate files are more common. However, the larger granularity of replicate matching
stops it from matching two files that only differ by one single byte or bit of data.
Sub File Hashing: Sub file hashing (SFH) is appropriately named. Whenever SFH is being used, it
means the file is broken into a number of smaller sections before data de-duplication. The number of
sections depends on the type of SFH that is being used. The two most common types of SFH are
fixed size chunking and variable-length chunking. In a fixed-size chunking approach, a file is
divided up into a number of fixed-size pieces called “chunks”. In a variable-length chunking
approach, a file is broken up into “chunks” of variable length. Some techniques such as Rabin
fingerprinting are applied to determine “chunk boundaries”. Each section is passed to a
cryptographic hash function (usually MD5 or SHA-1) to get the “chunk identifier”. The chunk
identifier is used to locate replicate data. Both of these SFH approaches find replicate data at a finer
granularity but at a price.
Delta Encoding: The term delta encoding (DE) is comes from the mathematical use of the delta
symbol. In math and science, delta is used to calculate the “change” or “rate of change” in an object.
Delta encoding is applied to show the difference between a source object and a target object.
Suppose, if block A is the source and block B is the target, the DE of B is the difference between A
and B that is unique to B. The expression and storage of the difference depends on how delta
encoding is applied. Normally it is used when SFH does not produce results but there is a strong
enough similarity between two items/ locks / chunks that storing the difference would take less space
than storing the non-duplicate block.
3.3 DISADVANTAGES:
PROPOSED SYSTEM
User Registration
SHA implementation
Data Holder
Check ID,
File Upload / File name, Duplicated file
Encryption content and
Location
RE-Encryption
RE-Encryption
Key
Data Search
AP
SHA was designed by NIST & NSA and is the US federal standard for use with the DSA
signature scheme (nb the algorithm is SHA, the standard is SHS)
it produces 160-bit hash values
SHA overview
o pad message so its length is a multiple of 512 bits
o initialize the 5-word (160-bit) buffer (A,B,C,D,E) to
o (67452301,efcdab89,98badcfe,10325476,c3d2e1f0)
o process the message in 16-word (512-bit) chunks, using 4 rounds of 20 bit operations
each on the chunk & buffer
o output hash value is the final buffer value
SHA is a close relative of MD5, sharing much common design, but each having differences
SHA has very recently been subject to modification following NIST identification of some
concerns, the exact nature of which is not public
current version is regarded as secure
Step 1: Append Padding Bits….
Message is “padded” with a 1 and as many 0’s as necessary to bring the message length to 64
bits less than an even multiple of 512.
Step 2: Append Length....
64 bits are appended to the end of the padded message. These bits hold the binary format of
64 bits indicating the length of the original message.
Step 3: Prepare Processing Functions….
SHA1 requires 80 processing functions defined as:
f(t;B,C,D) = (B AND C) OR ((NOT B) AND D) ( 0 <= t <= 19)
f(t;B,C,D) = B XOR C XOR D (20 <= t <= 39)
f(t;B,C,D) = (B AND C) OR (B AND D) OR (C AND D) (40 <= t <=59)
f(t;B,C,D) = B XOR C XOR D (60 <= t <= 79)
Step 4: Prepare Processing Constants....
SHA1 requires 80 processing constant words defined as:
K(t) = 0x5A827999 ( 0 <= t <= 19)
K(t) = 0x6ED9EBA1 (20 <= t <= 39)
K(t) = 0x8F1BBCDC (40 <= t <= 59)
K(t) = 0xCA62C1D6 (60 <= t <= 79)
Step 5: Initialize Buffers….
SHA1 requires 160 bits or 5 buffers of words (32 bits):
H0 = 0x67452301
H1 = 0xEFCDAB89
H2 = 0x98BADCFE
H3 = 0x10325476
H4 = 0xC3D2E1F0
The basic flowchart as
4.3 ADVANTAGES:
• SiLo is able to remove large amounts of redundant data, dramatically reduce the numbers of
accesses to on-disk index.
Hardware Requirements:
Software Requirements:
Operating system : Windows OS ( XP, 2007, 2008)
Front End : HADOOP
Back End : My SQL 5.0.5 1b
HADOOP
large sets of data in a distributed computing environment. It is a part of the Apache project
sponsored by the Apache Software Foundation. Hadoop cluster uses a Master/Slave structure Lu,
Huang 2012. Using Hadoop, large data sets can be processed across a cluster of servers and
applications can be run on systems with thousands of nodes involving thousands of terabytes.
Distributed file system in Hadoop helps in rapid data transfer rates and allows the system to continue
its normal operation even in the case of some node failures. This approach lowers the risk of an
entire system failure, even in the case of a significant number of node failures. Hadoop enables a
computing solution that is scalable, cost effective, flexible and fault tolerant. Hadoop Framework is
used by popular companies like Google, Yahoo, Amazon and IBM etc., to support their applications
involving huge amounts of data. Hadoop has two main sub projects – Map Reduce and Hadoop
Hadoop Map Reduce is a framework Wie, Jiang 2010 used to write applications that process large
manner. A Map Reduce job first divides the data into individual chunks which are processed by Map
jobs in parallel. The outputs of the maps sorted by the framework are then input to the reduce tasks.
Generally the input and the output of the job are both stored in a file-system. Scheduling, Monitoring
HDFS K, Chitharanjan 2013 is a file system that spans all the nodes in a Hadoop cluster for
data storage. It links together file systems on local nodes to make it into one large file system. HDFS
improves reliability by replicating data across multiple sources to overcome node failures.
The big data application refers to the large scale distributed applications which usually work
with large data sets. Data exploration and analysis turned into a difficult problem in many sectors in
the span of big data. With large and complex data, computation becomes difficult to be handled by
the traditional data processing applications which trigger the development of big data applications
F.C.P, Muhtaroglu 2013. Google’s map reduce framework and apache Hadoop are the defector
software systems Zhao,2013 for big data applications, in which these applications generates a huge
amount of intermediate data. Manufacturing and Bioinformatics are the two major areas of big data
applications. Big data provide an infrastructure for transparency in manufacturing industry, which
has the ability to unravel uncertainties such as inconsistent component performance and availability.
In these big data applications, a conceptual framework of predictive manufacturing begins with data
acquisition where there is a possibility to acquire different types of sensory data such as pressure,
vibration, acoustics, voltage, current, and controller data. The combination of sensory data and
historical data constructs the big data in manufacturing. This generated big data from the above
combination acts as the input into predictive tools and preventive strategies such as prognostics and
health management. Another important application for Hadoop is Bioinformatics which covers the
next generation sequencing and other biological domains. Bioinformatics Xu-bin, 2012 which
requires a large scale data analysis, uses Hadoop. Cloud computing gets the parallel distributed
In Big data, the software packages provide a rich set of tools and options where an individual
could map the entire data landscape across the company, thus allowing the individual to analyze the
threats he/she faces internally. This is considered as one of the main advantages as big data keeps the
data safe. With this an individual can be able to detect the potentially sensitive information that is
not protected in an appropriate manner and makes sure it is stored according to the regulatory
requirements.
b) Addresses speed and scalability, mobility and security, flexibility and stability.
c) In big data the realization time to information is critical to extract value from various data sources,
including mobile devices, radio frequency identification, the web and a growing list of automated
sensory technologies.
All the organizations and business would benefit from speed, capacity, and scalability of
cloud storage. Moreover, end users can visualize the data and companies can find new business
opportunities. Another notable advantage with big-data is, data analytics, which allow the individual
to personalize the content or look and feel of the website in real time so that it suits the each
customer entering the website. If big data are combined with predictive analytics, it produces a
challenge for many industries. The combination results in the exploration of these four areas:
We present several efficient proxies re-encryption schemes that offer security improvements
over earlier approaches. The primary advantage of our schemes is that they are unidirectional (i.e.,
Alice can delegate to Bob without Bob having to delegate to her) and do not require delegators to
reveal all of their secret key to anyone – or even interact with the delegate – in order to allow a
proxy to re-encrypt their cipher texts. In our schemes, only a limited amount of trust is placed in the
proxy. For example, it is not able to decrypt the cipher texts it re-encrypts and we prove our schemes
secure even when the proxy publishes all the encryption information it knows. This enables a
number of applications that would not be practical if the proxy needed to be fully trusted.
5.2.3 Deduplication scheme:
Deduplication is a technology that can be used to reduce the amount of storage required for a
set of files by identifying duplicate “chunks” of data in a set of files and storing only one copy of
each chunk. Subsequent requests to store a chunk that already exists in the chunk store are done by
simply recording the identity of the chunk in the file’s block list; by not storing the chunk a second
time, the system stores less data, thus reducing cost. In this module, we implement fingerprint
scheme to identifying chunks differ, both fixed-size and variable-size chunking use
cryptographically secure content hashes such as MD5 or SHA1 to identify chunks, thus allowing the
system to quickly discover that newly-generated chunks already have stored instances.
5.2.4 File system analysis:
In this module, we first search data and then analyzed different sets of chunks to determine
both the amount of deduplication possible and the source of chunk similarity. We use the term disk
image to denote the logical abstraction containing all of the data while image files refers to the actual
files that make up a disk image. Storage is always associated with a single storage; a monolithic disk
image consists of a single image file, and a spanning place has one or more image files, each limited
to a particular size. Files are stored in data server with block id and this can be monitored by Data
servers. Data servers are mapped by using cloud services proiver.
In this module, we can analyze data sharing components and cloud services provider in file
responsible for managing all data servers. A dedicated background daemon thread will immediately
send a heartbeat message to the problematic data server and determines if it is alive. This mechanism
ensures that failures are detected and handled at an early stage. The stateless routing algorithm can
be implemented since it could detect duplicate data servers even if no one is communicating with
them.
7.1 CONCLUSION:
In cloud many data are stored again and again by user. So the user need more spaces store
another data. That will reduce the memory space of the cloud for the users. To overcome this
problem uses the deduplication concept. Data deduplication is a method for sinking the amount of
storage space an organization wants to save its data. In many associations, the storage systems
surround duplicate copies of many sections of data. For instance, the similar file might be keep in
several dissimilar places by dissimilar users, two or extra files that aren't the same may still include
much of the similar data. Deduplication remove these extra copies by saving just one copy of the
data and replace the other copies with pointers that lead reverse to the unique copy. So we proposed
Block-level deduplication frees up more spaces and exacting category recognized as variable block
or variable length deduplication has become very popular. And implemented heart beat protocol to
recover the data from corrupted cloud server. Experimental metrics are proved that our proposed
approach provide improved results in deduplication process.
7.2 FUTURE WORK:
In future we can extend our work to handle multimedia data for deduplication storage. The
multimedia data includes audio, image and videos. And also implement heart beat protocol recover
each data server and increase scalability process of system.
REFERENCES:
[1] Jin Li, Yan Kit Li,” A Hybrid Cloud Approach for Secure Authorized Deduplication” IEEE
transactions on parallel and distributed systems, vol. 26, no. 5, may 2015
[2] Jingwei Li, Jin Li, “Secure Auditing and Deduplicating Data in Cloud” IEEE TRANSACTIONS
ON Computers, VOL. pp, NO.99, JAN 2015
[3] D. Meyer and W. Bolosky, “A study of practical deduplication,” in Proceedings of the 9th
USENIX Conference on File and Storage Technologies, 2011.
[4] B. Debnath, S. Sengupta, and J. Li, “Chunkstash: speeding up inline storage deduplication using
flash memory,” in Proceedings of the 2010 USENIX conference on USENIX annual technical
conference. USENIX Association, 2010.
[5] W. Dong, F. Douglis, K. Li, H. Patterson, S. Reddy, and P. Shilane, “Tradeoffs in scalable data
routing for deduplication clusters,” in Proceedings of the 9th USENIX conference on File and
storage technologies. USENIX Association, 2011.
[6] E. Kruus, C. Ungureanu, and C. Dubnicki, “Bimodal content defined chunking for backup
streams,” in Proceedings of the 8th USENIX conference on File and storage technologies. USENIX
Association, 2010.
[7] G.Wallace, F. Douglis, H. Qian, P. Shilane, S. Smaldone, M. Chamness, and W. Hsu,
“Characteristics of backup workloads in production systems,” in Proceedings of the Tenth USENIX
Conference on File and Storage Technologies, 2012.
[8] A. Broder, “On the resemblance and containment of documents,” in Compression and
Complexity of Sequences 1997.
[9] D. Bhagwat, K. Eshghi, and P. Mehra, “Content-based document routing and index partitioning
for scalable similarity-based searches in a large corpus,” in Proceedings of the 13th ACM SIGKDD
international conference on Knowledge discovery and data mining. ACM, 2007, pp. 105–112.
[10] Y. Tan, H. Jiang, D. Feng, L. Tian, Z. Yan, and G. Zhou, “SAM: A Semantic-Aware Multi-
Tiered Source De-duplication Framework for Cloud Backup,” in IEEE 39th International
Conference on Parallel Processing. IEEE, 2010, pp. 614–623.