Вы находитесь на странице: 1из 31

PROXY RE ENCRYPTIONS (PRE) BASED DEDUPLICATION SCHEME ON

ENCRYPTED BIG DATA FOR CLOUD

ABSTRACT:

Cloud computing is widely considered as potentially the next dominant technology in IT


industry. It offers simplified system maintenance and scalable resource management with Virtual
Machines (VMs). As a fundamental technology of cloud computing, Virtual Machines has been a
hot research topic in recent years. The high overhead of virtualization has been well addressed by
hardware advancement in CPU industry, and by software implementation improvement in
hypervisors themselves. However, the high demand on virtual machines storage remains a
challenging problem. Existing systems have made efforts to reduce virtual machines storage
consumption by means of deduplication within a storage area network system. Nevertheless, storage
area network cannot satisfy the increasing demand of large-scale virtual machines hosting for cloud
computing because of its cost limitation. In this project, we propose SILO, a scalable deduplication
file system that has been particularly designed for large-scale virtual machines deployment. Its
design provides fast virtual machines deployment with similarity and locality based fingerprint index
for data transfer and low storage consumption by means of deduplication on virtual machines. It also
provides a comprehensive set of storage features including instant cloning for virtual machines, on-
demand fetching through a network, and caching with local disks by copy-on-read techniques.
Experiments show that SILO features perform well and introduce minor performance overhead.
CHAPTER 1

1. INTRODUCTION:

1.1 CLOUD COMPUTING:

Cloud computing is basically a service oriented architecture rendering easy access to all who
make use of it. The need of computation power rendered by the machines is on a continuous hike
nowadays. The CPU computation power is boosting twice for every 3 years. However size of the
files keeps increasing also in an amazing rate. 20 years ago, the common format is only text file.
Later, computer can handle the graphics well, and play some low quality movies. In recent year,
human is not satisfied in the quality of DVD, and introduce the Blue-ray disk. The file is changed
form a few KBs to now nearly 1 TB.

The varying characteristics of cloud making it different from other computing technologies
are on-demand self-service, agility, autonomic computing, virtualization, parallel and distributed
architecture and pay for use model. A Cloud is said to be parallel and distributed architecture as it is
consisting of a family of interconnected and virtualized computers.

An organization moving from traditional model to cloud model which has been switched
from dedicated hardware and depreciate the service for certain period of time to use shared resources
in cloud infrastructure and pay based on the usage.

The cloud computing reduces the ground work of infrastructure costs, run the application faster with
upgraded manageability and less maintenance, to rapidly adjust resources to meet inconsistent and
capricious business demand and allow the user to differentiate their business instead of on
infrastructure. The cloud providers use “pay-as-you-go” model. If the cloud model is not used which
unexpectedly leads to high charges. First the system is defined as physical machine (PM) as cloud
servers, and instance or virtual machine (VM) as the virtual server provided to the users.
Parallel system can make computation faster
File size is increasing but the algorithms don’t improve so much. Moreover, newer format
always require more complicated decode algorithms and make the processing time longer. The only
way to greatly speed up the process is make the job parallel running on different machines.
Cloud reduce the cost
Since buying expensive facilities for only a single purpose is unreasonable but using the
cloud charge only how long used by the user. It enables many people to run large scale project with
an acceptable cost.

1.1.1. CLOUD CONCEPTS


Cloud Computing is a web processing with large amount of resource. The user of the cloud
can obtain the service through network (in both internet and intranet). In other words, users are using
or buying computing service from others. The resource of the cloud can be anything IT related. As
the resource pool is very large, user can increase the application on cloud to any scale. It is fully
under user’s control.

1.1.2. CHARACTERISTICS OF CLOUD

Cloud computing demonstrate the next key characteristics:

 Synchronizing work region


 Application programming crossing point
 Pay for use
 Anywhere anytime access
 Virtualization
 Multi-tenancy
 Dependability
 Scalability and elasticity
 Maintenance is easier

Agility

The resources are re-provisioned by the system to rectify users’ ability. The necessities are
altered according to the customer’s requests.

Application programming interface (API)


It is similar to the traditional user interface which facilitates interaction between humans and
computers. The Application Programming interface enables the mechanism to interrelate with cloud
software Representational State Transfer (REST)-based APIs is used in cloud computing systems.

Cost:

To convert the resources overheads to functioning leak then the system using open-cloud
liberation model. The resource supposedly reduces the obstacles, as an alternative of leveraging the
tools and software for once the infrastructure use the services provided by the third party. The cost
saving depends on the kind of obtainable communications.

Device and location independence

The user is allowed to access the system using web browser without considering location or
what device the system use. The communications is provided by the intermediary and in off-site.
The user can contact the possessions from somewhere when concerning to the internet.

Maintenance

Rather than installing the resources on the users system the resources can be accessed from
anywhere in any location. It is easy for the punter to preserve the records in the cloud.

Multi-tenancy

Numerous patrons use the similar open cloud. Along with the large group of punters, they enable
the sharing of possessions and expenditures. Multi-tenancy allows

 Centralization of infrastructure

Several users accessing the same infrastructure from different location at reduced cost
without installing the software in their own system.

 peak-load capacity
The capacity of the system increases depending upon the users requesting services.
The system provides the services at even peak-load without any interruption.
 Utilization and efficiency
The capability of the system to adapt to the changing needs in provisioning and de-
provisioning the resources. It is how the resources are utilized in efficient manner.

Performance

The loosely coupled architectures are constructed as the system interface using the web
services. The performance of the system is monitored and consistency is to be maintained.

Reliability

The multiple redundant sitesare accessed by several users simultaneously. It provides cloud
computing suitable for business endurance and tragedy recovery.

 Business Continuity

The term business continuity means the business must provide to mitigate the impact of
planned and unplanned outages.

 Disaster Recovery

The term disaster recovery is used to recover the system data. It is used to backup the copies
of data to an alternative site when primary site is incapable due to some disaster.

Scalability and Elasticity

The resources are provided without any interruption to the users’ request. The VM start-up time
varies by VM type, location, operating systems and cloud providers.

Security

When data is scattered over a broader region or over a greater number of strategies as well as
in multi-leaseholder systems shared by impertinent users then the intricacy of security is greatly
increased. Secluded cloud installations are motivated by the users’ craving to retain control over the
substructure and avoid losing control of tiding security.
Figure .1.1. WORKING OF CLOUD COMPUTING

Advantages of cloud computing

 On-demand self-services
 Broad network access
 Resource pooling
1.1.3. TYPES OF CLOUD SERVICES

Infrastructure as a service (IaaS)

 The customer can access the providing requests, data storage, networks, and other essential
computing resources.
 Consumer is able to organize and sprint random software, along with the Operating System
and application.
 The cloud infrastructure is not managed and controlled by the consumer but they can access
operating system, storage, organized applications, and probably incomplete control of select
networking machinery like host firewalls.
Figure: 1.2. TYPES OF CLOUD SERVICES

Platform as a service (PaaS)

 The qualifications afforded to the end user is to arrange onto the cloud communications
purchaser created or acquired request created using programming expression and
mechanisms reinforced by the contributor.
 The end user does not manage or control the fundamental cloud communications including
network, servers, operating systems, or storage space, but has control over the arranged
applications and possibly product hosting location patterns.

Software as a service (SaaS)


 The facility presented to the end user is to utilize the contributor’s products running on the
cloud transportation.
 The appliances are available from diverse punters approach through a thin client crossing
point such as a web browser i.e., web-based email.
 The end user does not manage or control the essential cloud transportation including system,
servers, operating systems, storage space, or even human being function capacity, with the
potential omission of restricted user-specific function relationship.

Deployment model:

Private cloud

 Sequestered cloud is said to be a private cloud infrastructure operated solely for a single
officialdom, whether managed inside or by a third-party, and hosted either confidentially or
superficially.

 Undertaking a secluded cloud development entails a significant level and amount of


rendezvous to virtualize the professional situation, and entails the association to re-evaluate
decisions about existing resources.

Public cloud:

 The public can use the open services that are rendered over the network then the cloud is said
to be a public cloud. The pay-per-usage model and free services are the services afforded by
the public cloud.

 The only difference is security deliberation for the services such as application, storage, and
other resources available for unrestricted onlookers.

Hybrid cloud

 Hybrid cloud is a symphony of two or more clouds i.e. private, community or public.

 Hybrid cloud services are a cloud computing service that is composed of some combination
of the clouds, public and community cloud services. The services are obtained from different
service suppliers.

 A hybrid cloud service crosses remoteness and provider precincts to separate private, public
or hamlet cloud service.

 A major improvement of cloud bursting and a hybrid cloud model is that an outfit only pays
for extra compute resources when they are needed.

Community cloud:
 Community cloud shares the infrastructure between several organizations from a specific
convergence of some common anxieties (security, amenability, jurisdiction, etc.), whether
managed internally or by a third-party, and either hosted internally or externally.
 Some of the cost savings potential of cloud computing are realized as the costs are spread
over fewer users.

Distributed cloud:
 The distributed set of apparatuses running at different locations is also provided by the cloud
computing through a connected schmooze or hub.
 Thus the distributed system voluntarily shared some resources over a network.

Inter-cloud

 It resembles the hybrid and Multi-cloud where they focus on interoperability between public
cloud service providers more than between providers and consumers

 The Inter-cloud is an interrelated large-scale "cloud of clouds" and an extension of the


Internet "network of networks" on which it is based.

Multi-cloud

 Multi-cloud is the use of multiple cloud computing facilities in a single miscellaneous


architecture. In order to reduce conviction on any single vendor, increase flexibility through
high-quality, palliate against cataclysms, etc.

 It differs from hybrid cloud in that it refers to multiple cloud services rather than multiple
distribution modes.
CHAPTER 2:

2.1 INTRODUCTION:

Storing large amounts of data efficient, in terms of both time and space, is of paramount
concern in the design of backup and restore systems. Users might wish to periodically (e.g., hourly,
daily or weekly) backup data which is stored on their computers as a precaution against possible
crashes, corruption or accidental deletion of important data. It commonly occurs that most of the
data has not changed since the last backup has been performed, and therefore much of the current
data can already be found in the backup repository, with only minor changes. If the data, in the
repository, that is similar to the current backup data, can be located efficient, then there is no need to
store the data again, rather, only the changes need be recorded. This process of storing common data
once only is known as data deduplication. Data deduplication is much easier to achieve with disk
based storage than with tape backup. The technology bridges the price gap between disk based
backup and tape based backup, making disk based backup affordable. Disk based backup has several
distinctive advantages over tape backup in terms of reducing backup windows and improving restore
reliability and speed. In a backup and restore system with deduplication it is very likely that a new
input data stream is similar to data already in the repository, but many different types of changes are
possible. Given the potential size of the repository which may have hundreds of terabytes of data,
identifying the regions of similarity to the new incoming data is a major challenge. In addition, the
similarity matching must be performed quickly in order to maintain high backup bandwidth
requirements. This project idea aims to alleviate the disk bottleneck of fingerprint lookup, reduce the
time of fingerprint lookup, and improve the throughput of data deduplication. This means that
deduplication is only performed within individual servers due to overhead considerations, which
leaves cross-node redundancy untouched. Thus, data routing, a technique to concentrate data
redundancy within individual nodes, reduce cross-node redundancy and balance load, becomes a key
issue in the cluster deduplication design. Second, for the intra-node scenario, it suffers from the disk
chunk index lookup bottleneck.
2.2 STUDY OF DEDUPLICATION:

In computing, data deduplication is a specialized data compression technique for eliminating


duplicate copies of repeating data. Related and somewhat synonymous terms are intelligent (data)
compression and single-instance (data) storage. This technique is used to improve storage utilization
and can also be applied to network data transfers to reduce the number of bytes that must be sent. In
the deduplication process, unique chunks of data, or byte patterns, are identified and stored during a
process of analysis. As the analysis continues, other chunks are compared to the stored copy and
whenever a match occurs, the redundant chunk is replaced with a small reference that points to the
stored chunk. Given that the same byte pattern may occur dozens, hundreds, or even thousands of
times (the match frequency is dependent on the chunk size), the amount of data that must be stored
or transferred can be greatly reduced.[1] This type of deduplication is different from that performed
by standard file-compression tools, such as LZ77 and LZ78. Whereas these tools identify short
repeated substrings inside individual files, the intent of storage-based data deduplication is to inspect
large volumes of data and identify large sections – such as entire files or large sections of files – that
are identical, in order to store only one copy of it. This copy may be additionally compressed by
single-file compression techniques. For example a typical email system might contain 100 instances
of the same 1 MB (megabyte) file attachment. Each time the email platform is backed up, all 100
instances of the attachment are saved, requiring 100 MB storage space. With data deduplication,
only one instance of the attachment is actually stored; the subsequent instances are referenced back
to the saved copy for deduplication ratio of roughly 100 to 1.
BENEFITS OF DEDUPLICATION:

 Storage-based data deduplication reduces the amount of storage needed for a given set of files. It
is most effective in applications where many copies of very similar or even identical data are
stored on a single disk—a surprisingly common scenario. In the case of data backups, which
routinely are performed to protect against data loss, most data in a given backup remain
unchanged from the previous backup. Common backup systems try to exploit this by omitting
(or hard linking) files that haven't changed or storing differences between files. Neither approach
captures all redundancies, however. Hard-linking does not help with large files that have only
changed in small ways, such as an email database; differences only find redundancies in adjacent
versions of a single file (consider a section that was deleted and later added in again or a logo
image included in many documents).
 Network data deduplication is used to reduce the number of bytes that must be transferred
between endpoints, which can reduce the amount of bandwidth required.
 Virtual servers benefit from deduplication because it allows nominally separate system files for
each virtual server to be coalesced into a single storage space. At the same time, if a given server
customizes a file, deduplication will not change the files on the other servers—something that
alternatives like hard links or shared disks do not offer. Backing up or making duplicate copies
of virtual environments are similarly improved.
DEDUPLICATION METHODS:

One of the most common forms of data deduplication implementations works by comparing
chunks of data to detect duplicates. For that to happen, each chunk of data is assigned identification,
calculated by the software, typically using cryptographic hash functions. In many implementations,
the assumption is made that if the identification is identical, the data is identical, even though this
cannot be true in all cases due to the pigeonhole principle; other implementations do not assume that
two blocks of data with the same identifier are identical, but actually verify that data with the same
identification is identical. If the software either assumes that a given identification already exists in
the deduplication namespace or actually verifies the identity of the two blocks of data, depending on
the implementation, then it will replace that duplicate chunk with a link.
Once the data has been deduplicated, upon read back of the file, wherever a link is found, the system
simply replaces that link with the referenced data chunk. The deduplication process is intended to be
transparent to end users and applications.

 Chunking. Between commercial deduplication implementations, technology varies primarily in


chunking method and in architecture. In some systems, chunks are defined by physical layer
constraints (e.g. 4KB block size in WAFL). In some systems only complete files are compared,
which is called single-instance storage or SIS. The most intelligent (but CPU intensive) method
to chunking is generally considered to be sliding-block. In sliding block, a window is passed
along the file stream to seek out more naturally occurring internal file boundaries.

 Client backup deduplication. This is the process where the deduplication hash calculations are
initially created on the source (client) machines. Files that have identical hashes to files already
in the target device are not sent, the target device just creates appropriate internal links to
reference the duplicated data. The benefit of this is that it avoids data being unnecessarily sent
across the network thereby reducing traffic load.

 Primary storage and secondary storage. By definition, primary storage systems are designed for
optimal performance, rather than lowest possible cost. The design criteria for these systems are
to increase performance, at the expense of other considerations. Moreover, primary storage
systems are much less tolerant of any operation that can negatively impact performance. Also by
definition, secondary storage systems contain primarily duplicate or secondary copies of data.
These copies of data are typically not used for actual production operations and as a result are
more tolerant of some performance degradation, in exchange for increased efficiency.
A Hybrid Cloud Approach for Secure Authorized Deduplication:
This hybrid aiming at efficiently solving the problem of deduplication with differential
privileges in cloud computing, consider a hybrid cloud architecture consisting of a public cloud and
a private cloud. Unlike existing data deduplication systems, the private cloud is involved as a proxy
to allow data owner/users to securely perform duplicate check with differential privileges. Such
architecture is practical and has attracted much attention from researchers. The data owners only
outsource their data storage by utilizing public cloud while the data operation is managed in private
cloud. A new deduplication system supporting differential duplicate check is proposed under this
hybrid cloud architecture where the S-CSP resides in the public cloud. The user is only allowed to
perform the duplicate check for files marked with the corresponding privileges.
Furthermore, we enhance our system in security. Specifically, present an advanced scheme to
support stronger security by encrypting the file with differential privilege keys. In this way, the users
without corresponding privileges cannot perform the duplicate check. Furthermore, such
unauthorized users cannot decrypt the cipher text even collude with the S-CSP. Security analysis
demonstrates that our system is secure in terms of the definitions specified in the proposed security
model. Finally, we implement a prototype of the proposed authorized duplicate check and conduct
testbed experiments to evaluate the overhead of the prototype.
Secure Auditing and Deduplicating Data in Cloud:
SecCloud introduces an auditing entity with a maintenance of a MapReduce cloud, which
helps clients generate data tags before uploading as well as audit the integrity of data having been
stored in cloud. This design fixes the issue of previous work that the computational load at user or
auditor is too huge for tag generation. For completeness of fine-grained, the functionality of auditing
designed in SecCoud is supported on both block level and sector level. In addition, SecCoud also
enables secure deduplication. Notice that the “security” considered in SecCoud is the prevention of
leakage of side channel information. In order to prevent the leakage of such side channel
information, we follow the tradition and design a proof of ownership protocol between clients and
cloud servers, which allows clients to prove to cloud servers that they exactly own the target data.
A Study of Practical Deduplication

File systems often contain redundant copies of information: identical files or sub-file regions,
possibly stored on a single host, on a shared storage cluster, or backed-up to secondary storage.
Deduplicating storage systems take advantage of this redundancy to reduce the underlying space
needed to contain the file systems (or backup images thereof). Deduplication can work at either the
sub-file or whole-file level. More fine-grained deduplication creates more opportunities for space
savings, but necessarily reduces the sequential layout of some files, which may have significant
performance impacts when hard disks are used for storage (and in some cases necessitates
complicated techniques to improve performance). Alternatively, whole-file deduplication is simpler
and eliminates file-fragmentation concerns, though at the cost of some otherwise reclaimable
storage. Because the disk technology trend is toward improved sequential bandwidth and reduced
per-byte cost with little or no improvement in random access speed, it’s not clear that trading away
sequentiality for space savings makes sense, at least in primary storage. Complicating matters, these
files are in opaque unstructured formats with complicated access patterns. At the same time there are
increasingly many small files in an increasingly complex file system tree.
ChunkStash: Speeding up Inline Storage Deduplication using Flash Memory

ChunkStash, which is flash-assisted inline storage deduplication system incorporating a high


performance chunk metadata store on flash. When a key-value pair (i.e., chunkid and its metadata) is
written, it is sequentially logged in flash. A specialized RAM-space efficient hash table index
employing a variant of cuckoo hashing and compact key signatures is used to index the chunk
metadata stored in flash memory and serve chunk-id lookups using one flash read per lookup.
ChunkStash works in concert with existing RAM prefetching strategies. The flash requirements of
ChunkStash are well within the range of currently available SSD capacities – as an example,
ChunkStash can index order of terabytes of unique (deduplicated) data using order of tens of
gigabytes of flash. Further, by indexing a small fraction of chunks per container, ChunkStash can
reduce RAM usage significantly with negligible loss in deduplication quality. ChunkStash uses an
in-memory hash table to index key-value pairs on flash, with hash collisions resolved by a variant of
cuckoo hashing. The in-memory hash table stores compact key signatures instead of full keys so as
to strike tradeoffs between RAM usage and false flash reads. Further, by indexing a small fraction of
chunks per container, ChunkStash can reduce RAM usage significantly with negligible loss in
deduplication quality.
Following table summarize existing works as follows:

Title Advantages Disadvantages


A Hybrid Cloud Approach for  Scheme is secure in  Maximize the
Secure Authorized terms of the definitions overhead at the time of
Deduplication, Jin Li,2015 specified in the encryption
proposed security
model.
Secure Auditing and  The dictionary attack  Clients difficult to
Deduplicating Data in Cloud is prevented. generate data tags
Jingwei Li, Jin Li, 2015  Enables the guarantee before uploading
of file confidentiality.
A Study of Practical  It is simpler and  File fragmentation is
Deduplication, D. Meyer and eliminates file- difficult
W. Bolosky,2011 fragmentation  Ever deepening file
concerns. structure analysis
 Reduced per-byte cost
with little
ChunkStash: Speeding up  ChunkStash can reduce  Provide loss in
Inline Storage Deduplication RAM usage deduplication quality
using Flash Memory, Biplob significantly
Debnath, 2010

Tradeoffs in Scalable Data  High throughput is  Scalability of the system


Routing for Deduplication achieved because of is less across a broad
Clusters, Wei Dong, 2011 stream and inter-file range of cluster sizes
locality, and per-chunk
memory overhead is
minimized
Bimodal Content Defined  Able to perform  Require large steps per-
Chunking for Backup Streams, content-defined chunk storage
Erik Kruus, 2010 chunking in a scalable overhead
manner
Characteristics of Backup  Particularly helpful in  Provide degraded hit
Workloads in Production analyzing the ratios at the time of
Systems, Grant Wallace, 2012 effectiveness of deduplication
deduplication
parameters and
caching algorithms
CHAPTER 3

EXISTING SYSTEM:

3.1 INTRODUCTION:
For VM snapshot backup, file level semantics are normally not provided. Snapshot
operations take place at the virtual device driver level, which means no fine-grained file system
metadata can be used to determine the changed data. Backup systems have been developed to use
content fingerprints to identify duplicate content. Offline deduplication is used to remove previously
written duplicate blocks during idle time. Several techniques have been proposed to speedup
searching of duplicate fingerprints. Existing approaches have focused on such inline duplicate
detection in which deduplication of an individual block is on the critical write path. In existing work,
this constraint is difficult and there is no waiting time for many duplicate detection requests. This
relaxation is unacceptable because in context, difficult to finishing the backup of required VM
images within a reasonable time window.
3.2 ALGORITHM:

Whole File Hashing: In a whole file hashing (WFH) technique, the whole file is directed to a
hashing function. The hashing function is always cryptographic hash like MD5 or SHA-1. The
cryptographic hash is used to find entire replicate files. This approach is speedy with low
computation and low additional metadata overhead. It works very well for complete system backups
when total duplicate files are more common. However, the larger granularity of replicate matching
stops it from matching two files that only differ by one single byte or bit of data.

Sub File Hashing: Sub file hashing (SFH) is appropriately named. Whenever SFH is being used, it
means the file is broken into a number of smaller sections before data de-duplication. The number of
sections depends on the type of SFH that is being used. The two most common types of SFH are
fixed size chunking and variable-length chunking. In a fixed-size chunking approach, a file is
divided up into a number of fixed-size pieces called “chunks”. In a variable-length chunking
approach, a file is broken up into “chunks” of variable length. Some techniques such as Rabin
fingerprinting are applied to determine “chunk boundaries”. Each section is passed to a
cryptographic hash function (usually MD5 or SHA-1) to get the “chunk identifier”. The chunk
identifier is used to locate replicate data. Both of these SFH approaches find replicate data at a finer
granularity but at a price.

Delta Encoding: The term delta encoding (DE) is comes from the mathematical use of the delta
symbol. In math and science, delta is used to calculate the “change” or “rate of change” in an object.
Delta encoding is applied to show the difference between a source object and a target object.
Suppose, if block A is the source and block B is the target, the DE of B is the difference between A
and B that is unique to B. The expression and storage of the difference depends on how delta
encoding is applied. Normally it is used when SFH does not produce results but there is a strong
enough similarity between two items/ locks / chunks that storing the difference would take less space
than storing the non-duplicate block.

3.3 DISADVANTAGES:

 There is no scalability in distributed data sharing systems.


 Difficult to implement fault tolerance mechanisms when the number of nodes keeps
changing.
 Provide huge volume of garbage data collector.
CHAPTER 4

PROPOSED SYSTEM

4.1 PROPOSED SYSTEM ARCHITECTURE:

User Registration

SHA implementation
Data Holder
Check ID,
File Upload / File name, Duplicated file
Encryption content and
Location

RE-Encryption
RE-Encryption
Key

Data Search

AP

RE- Decryption key User


& Decryption key
In deduplication framework, propose system implement block level deduplication system and
named as similarity and locality based deduplication (SILO) framework that is a scalable and short
overhead near-exact deduplication system, to defeat the aforementioned shortcomings of existing
schemes. The main idea of SiLo is to consider both similarity and locality in the backup stream
concurrently. Specifically, expose and utilize more similarity through grouping strongly correlated
small files into a division and segmenting large files, and leverage locality in the backup stream by
grouping closest segments into blocks to confine similar and duplicate data missed by the
probabilistic similarity detection. By keeping the parallel index and preserving spatial locality of
help streams in RAM (i.e., hash table and locality cache), SiLo is able to remove huge amounts of
redundant data, dramatically reduce the numbers of accesses to on-disk index, and substantially
increase the RAM utilization. This approach divides a large file into many little segments to better
expose similarity among large files while increasing the efficiency of the deduplication pipeline.

4.2 ALGORITHM APPROACH TECHNIQUES:

4.2.1 Secure Hash Algorithm:

 SHA was designed by NIST & NSA and is the US federal standard for use with the DSA
signature scheme (nb the algorithm is SHA, the standard is SHS)
 it produces 160-bit hash values
 SHA overview
o pad message so its length is a multiple of 512 bits
o initialize the 5-word (160-bit) buffer (A,B,C,D,E) to
o (67452301,efcdab89,98badcfe,10325476,c3d2e1f0)
o process the message in 16-word (512-bit) chunks, using 4 rounds of 20 bit operations
each on the chunk & buffer
o output hash value is the final buffer value
 SHA is a close relative of MD5, sharing much common design, but each having differences
 SHA has very recently been subject to modification following NIST identification of some
concerns, the exact nature of which is not public
 current version is regarded as secure
Step 1: Append Padding Bits….
Message is “padded” with a 1 and as many 0’s as necessary to bring the message length to 64
bits less than an even multiple of 512.
Step 2: Append Length....
64 bits are appended to the end of the padded message. These bits hold the binary format of
64 bits indicating the length of the original message.
Step 3: Prepare Processing Functions….
SHA1 requires 80 processing functions defined as:
f(t;B,C,D) = (B AND C) OR ((NOT B) AND D) ( 0 <= t <= 19)
f(t;B,C,D) = B XOR C XOR D (20 <= t <= 39)
f(t;B,C,D) = (B AND C) OR (B AND D) OR (C AND D) (40 <= t <=59)
f(t;B,C,D) = B XOR C XOR D (60 <= t <= 79)
Step 4: Prepare Processing Constants....
SHA1 requires 80 processing constant words defined as:
K(t) = 0x5A827999 ( 0 <= t <= 19)
K(t) = 0x6ED9EBA1 (20 <= t <= 39)
K(t) = 0x8F1BBCDC (40 <= t <= 59)
K(t) = 0xCA62C1D6 (60 <= t <= 79)
Step 5: Initialize Buffers….
SHA1 requires 160 bits or 5 buffers of words (32 bits):
H0 = 0x67452301
H1 = 0xEFCDAB89
H2 = 0x98BADCFE
H3 = 0x10325476
H4 = 0xC3D2E1F0
The basic flowchart as
4.3 ADVANTAGES:

• SiLo is able to remove large amounts of redundant data, dramatically reduce the numbers of
accesses to on-disk index.

• Maintain a very high deduplication throughput

4.4 SYSTEM SPECIFICAITONS:

Hardware Requirements:

 Processor : Dual core processor 2.6.0 GHZ


 RAM : 1GB
 Hard disk : 160 GB
 Compact Disk : 650 Mb
 Keyboard : Standard keyboard
 Monitor : 15 inch color monitor

Software Requirements:
 Operating system : Windows OS ( XP, 2007, 2008)
 Front End : HADOOP
 Back End : My SQL 5.0.5 1b

4.4 SOFTWARE DESCRIPTION

4.4.1 FRONT END SOFTWARE

HADOOP

Hadoop, which is a free, Java-based programming framework supports the processing of

large sets of data in a distributed computing environment. It is a part of the Apache project

sponsored by the Apache Software Foundation. Hadoop cluster uses a Master/Slave structure Lu,

Huang 2012. Using Hadoop, large data sets can be processed across a cluster of servers and

applications can be run on systems with thousands of nodes involving thousands of terabytes.

Distributed file system in Hadoop helps in rapid data transfer rates and allows the system to continue

its normal operation even in the case of some node failures. This approach lowers the risk of an

entire system failure, even in the case of a significant number of node failures. Hadoop enables a

computing solution that is scalable, cost effective, flexible and fault tolerant. Hadoop Framework is

used by popular companies like Google, Yahoo, Amazon and IBM etc., to support their applications

involving huge amounts of data. Hadoop has two main sub projects – Map Reduce and Hadoop

Distributed File System (HDFS).

Hadoop Map Reduce is a framework Wie, Jiang 2010 used to write applications that process large

amounts of data in parallel on clusters of commodity hardware resources in a reliable, fault-tolerant

manner. A Map Reduce job first divides the data into individual chunks which are processed by Map

jobs in parallel. The outputs of the maps sorted by the framework are then input to the reduce tasks.
Generally the input and the output of the job are both stored in a file-system. Scheduling, Monitoring

and re-executing failed tasks are taken care by the framework.

Hadoop Distributed File System (HDFS)

HDFS K, Chitharanjan 2013 is a file system that spans all the nodes in a Hadoop cluster for

data storage. It links together file systems on local nodes to make it into one large file system. HDFS

improves reliability by replicating data across multiple sources to overcome node failures.

Big data applications

The big data application refers to the large scale distributed applications which usually work

with large data sets. Data exploration and analysis turned into a difficult problem in many sectors in

the span of big data. With large and complex data, computation becomes difficult to be handled by

the traditional data processing applications which trigger the development of big data applications

F.C.P, Muhtaroglu 2013. Google’s map reduce framework and apache Hadoop are the defector

software systems Zhao,2013 for big data applications, in which these applications generates a huge

amount of intermediate data. Manufacturing and Bioinformatics are the two major areas of big data

applications. Big data provide an infrastructure for transparency in manufacturing industry, which

has the ability to unravel uncertainties such as inconsistent component performance and availability.

In these big data applications, a conceptual framework of predictive manufacturing begins with data

acquisition where there is a possibility to acquire different types of sensory data such as pressure,

vibration, acoustics, voltage, current, and controller data. The combination of sensory data and

historical data constructs the big data in manufacturing. This generated big data from the above

combination acts as the input into predictive tools and preventive strategies such as prognostics and

health management. Another important application for Hadoop is Bioinformatics which covers the

next generation sequencing and other biological domains. Bioinformatics Xu-bin, 2012 which
requires a large scale data analysis, uses Hadoop. Cloud computing gets the parallel distributed

computing framework together with computer clusters and web interfaces.

Big data advantages

In Big data, the software packages provide a rich set of tools and options where an individual

could map the entire data landscape across the company, thus allowing the individual to analyze the

threats he/she faces internally. This is considered as one of the main advantages as big data keeps the

data safe. With this an individual can be able to detect the potentially sensitive information that is

not protected in an appropriate manner and makes sure it is stored according to the regulatory

requirements.

There are some common characteristics of big data, such as

a) Big data integrates both structured and unstructured data.

b) Addresses speed and scalability, mobility and security, flexibility and stability.

c) In big data the realization time to information is critical to extract value from various data sources,

including mobile devices, radio frequency identification, the web and a growing list of automated

sensory technologies.

All the organizations and business would benefit from speed, capacity, and scalability of

cloud storage. Moreover, end users can visualize the data and companies can find new business

opportunities. Another notable advantage with big-data is, data analytics, which allow the individual

to personalize the content or look and feel of the website in real time so that it suits the each
customer entering the website. If big data are combined with predictive analytics, it produces a

challenge for many industries. The combination results in the exploration of these four areas:

a) Calculate the risks on large portfolios

b) Detect, prevent, and re-audit financial fraud

c) Improve delinquent collections

d) Execute high value marketing campaigns


5.2 LIST OF MODULES:

 Cloud resource allocation


 Proxy re encryption
 Deduplication scheme
 File system analysis
 Data sharing components
 Evaluation criteria

5.2.1 Cloud resource allocation:

The virtualization is being used to provide ever-increasing number of servers on virtual


machines, reducing the number of physical machines required while preserving isolation between
machine instances. This approach better utilizes server resources, allowing many different operating
system instances to run on a small number of servers, saving both hardware acquisition costs and
operational costs such as energy, management, and cooling. Individual storage instances can be
separately managed, allowing them to serve a wide variety of purposes and preserving the level of
control that many users want. In this module, clients store data into data servers for future usages.
Then data servers stored data in cloud server’s provider.

5.2.2 Proxy re encryption

We present several efficient proxies re-encryption schemes that offer security improvements
over earlier approaches. The primary advantage of our schemes is that they are unidirectional (i.e.,
Alice can delegate to Bob without Bob having to delegate to her) and do not require delegators to
reveal all of their secret key to anyone – or even interact with the delegate – in order to allow a
proxy to re-encrypt their cipher texts. In our schemes, only a limited amount of trust is placed in the
proxy. For example, it is not able to decrypt the cipher texts it re-encrypts and we prove our schemes
secure even when the proxy publishes all the encryption information it knows. This enables a
number of applications that would not be practical if the proxy needed to be fully trusted.
5.2.3 Deduplication scheme:

Deduplication is a technology that can be used to reduce the amount of storage required for a
set of files by identifying duplicate “chunks” of data in a set of files and storing only one copy of
each chunk. Subsequent requests to store a chunk that already exists in the chunk store are done by
simply recording the identity of the chunk in the file’s block list; by not storing the chunk a second
time, the system stores less data, thus reducing cost. In this module, we implement fingerprint
scheme to identifying chunks differ, both fixed-size and variable-size chunking use
cryptographically secure content hashes such as MD5 or SHA1 to identify chunks, thus allowing the
system to quickly discover that newly-generated chunks already have stored instances.
5.2.4 File system analysis:
In this module, we first search data and then analyzed different sets of chunks to determine
both the amount of deduplication possible and the source of chunk similarity. We use the term disk
image to denote the logical abstraction containing all of the data while image files refers to the actual
files that make up a disk image. Storage is always associated with a single storage; a monolithic disk
image consists of a single image file, and a spanning place has one or more image files, each limited
to a particular size. Files are stored in data server with block id and this can be monitored by Data
servers. Data servers are mapped by using cloud services proiver.

5.2.5 Data sharing components:

In this module, we can analyze data sharing components and cloud services provider in file
responsible for managing all data servers. A dedicated background daemon thread will immediately
send a heartbeat message to the problematic data server and determines if it is alive. This mechanism
ensures that failures are detected and handled at an early stage. The stateless routing algorithm can
be implemented since it could detect duplicate data servers even if no one is communicating with
them.

5.2.6 Evaluation criteria:

Deduplication is an efficient approach to reduce storage demands in environments with large


numbers of cloud services provider. As we have shown, deduplication of can save 80% or more of
the space required to store the operating system and application environment; we explored the
impact of many factors on the effectiveness of deduplication. We showed that data localization have
little impact on deduplication ratio. However, factors such as the base operating system or even the
Linux distribution can have a major impact on deduplication effectiveness. Thus, we recommend
that hosting centers suggest “preferred” operating system distributions for their users to ensure
maximal space savings. If this preference is followed subsequent user activity will have little impact
on deduplication effectiveness.
CHAPTER 7

7. CONCLUSION AND FUTURE WORK

7.1 CONCLUSION:

In cloud many data are stored again and again by user. So the user need more spaces store
another data. That will reduce the memory space of the cloud for the users. To overcome this
problem uses the deduplication concept. Data deduplication is a method for sinking the amount of
storage space an organization wants to save its data. In many associations, the storage systems
surround duplicate copies of many sections of data. For instance, the similar file might be keep in
several dissimilar places by dissimilar users, two or extra files that aren't the same may still include
much of the similar data. Deduplication remove these extra copies by saving just one copy of the
data and replace the other copies with pointers that lead reverse to the unique copy. So we proposed
Block-level deduplication frees up more spaces and exacting category recognized as variable block
or variable length deduplication has become very popular. And implemented heart beat protocol to
recover the data from corrupted cloud server. Experimental metrics are proved that our proposed
approach provide improved results in deduplication process.
7.2 FUTURE WORK:

In future we can extend our work to handle multimedia data for deduplication storage. The
multimedia data includes audio, image and videos. And also implement heart beat protocol recover
each data server and increase scalability process of system.
REFERENCES:

[1] Jin Li, Yan Kit Li,” A Hybrid Cloud Approach for Secure Authorized Deduplication” IEEE
transactions on parallel and distributed systems, vol. 26, no. 5, may 2015
[2] Jingwei Li, Jin Li, “Secure Auditing and Deduplicating Data in Cloud” IEEE TRANSACTIONS
ON Computers, VOL. pp, NO.99, JAN 2015

[3] D. Meyer and W. Bolosky, “A study of practical deduplication,” in Proceedings of the 9th
USENIX Conference on File and Storage Technologies, 2011.

[4] B. Debnath, S. Sengupta, and J. Li, “Chunkstash: speeding up inline storage deduplication using
flash memory,” in Proceedings of the 2010 USENIX conference on USENIX annual technical
conference. USENIX Association, 2010.
[5] W. Dong, F. Douglis, K. Li, H. Patterson, S. Reddy, and P. Shilane, “Tradeoffs in scalable data
routing for deduplication clusters,” in Proceedings of the 9th USENIX conference on File and
storage technologies. USENIX Association, 2011.
[6] E. Kruus, C. Ungureanu, and C. Dubnicki, “Bimodal content defined chunking for backup
streams,” in Proceedings of the 8th USENIX conference on File and storage technologies. USENIX
Association, 2010.
[7] G.Wallace, F. Douglis, H. Qian, P. Shilane, S. Smaldone, M. Chamness, and W. Hsu,
“Characteristics of backup workloads in production systems,” in Proceedings of the Tenth USENIX
Conference on File and Storage Technologies, 2012.
[8] A. Broder, “On the resemblance and containment of documents,” in Compression and
Complexity of Sequences 1997.
[9] D. Bhagwat, K. Eshghi, and P. Mehra, “Content-based document routing and index partitioning
for scalable similarity-based searches in a large corpus,” in Proceedings of the 13th ACM SIGKDD
international conference on Knowledge discovery and data mining. ACM, 2007, pp. 105–112.
[10] Y. Tan, H. Jiang, D. Feng, L. Tian, Z. Yan, and G. Zhou, “SAM: A Semantic-Aware Multi-
Tiered Source De-duplication Framework for Cloud Backup,” in IEEE 39th International
Conference on Parallel Processing. IEEE, 2010, pp. 614–623.

Вам также может понравиться