Big Data Analyt

BIG DATA ANALYSIS
Big data analytics examines large amounts of data to uncover hidden patterns, correlations and
other insights. With todays technology, its possible to analyze your data and get answers from it
almost immediately an effort thats slower and less efficient with more traditional business
intelligence solutions
History and evolution of big data analytics:
The concept of big data has been around for years; most organizations now understand that if
they capture all the data that streams into their businesses, they can apply analytics and get
significant value from it. But even in the 1950s, decades before anyone uttered the term big
data, businesses were using basic analytics (essentially numbers in a spreadsheet that were
manually examined) to uncover insights and trends.
The new benefits that big data analytics brings to the table, however, are speed and efficiency.
Whereas a few years ago a business would have gathered information, run analytics and
unearthed information that could be used for future decisions, today that business can identify
insights for immediate decisions. The ability to work faster and stay agile gives organizations
a competitive edge they didnt have before.
Why is big data analytics important?
Big data analytics helps organizations harness their data and use it to identify new opportunities.
That, in turn, leads to smarter business moves, more efficient operations, higher profits and
happier customers. In his report Big Data in Big Companies, IIA Director of Research Tom
Davenport interviewed more than 50 businesses to understand how they used big data. He found
they got value in the following ways:
Cost reduction. Big data technologies such as Hadoop and cloud-based analytics bring
significant cost advantages when it comes to storing large amounts of data plus they can
identify more efficient ways of doing business.
Faster, better decision making. With the speed of Hadoop and in-memory analytics,
combined with the ability to analyze new sources of data, businesses are able to analyze
information immediately and make decisions based on what theyve learned.
New products and services. With the ability to gauge customer needs and satisfaction
through analytics comes the power to give customers what they want. Davenport points
out that with big data analytics, more companies are creating new products to meet
customers needs.
Mahout
The Apache Mahout project's goal is to build an environment for quickly creating scalable
performant machine learning applications. apache Mahout is an official Apache project and thus
available from Mahout is a pretty big beast at this point, so ... Apache Mahout, Building Mahout
from source ... Mahout code depends on hadoop
Apache Mahout is a project of the Apache Software Foundation to produce free

implementations of distributed or otherwise scalable machine learning algorithms focused
primarily in the areas of collaborative filtering, clustering and classification. Many of the
implementations use the Apache Hadoop platform. Mahout also provides Java libraries for
common maths operations that focused on linear algebra and statistics) and primitive Java
collections. Mahout is a work in progress; the number of implemented algorithms has grown
quickly but various algorithms are still missing.
While Mahout's core algorithms for clustering, classification and batch based collaborative
filtering are implemented on top of Apache Hadoop using the map/reduce paradigm, it does not
restrict contributions to Hadoop-based implementations. Contributions that run on a single node
or on a non-Hadoop cluster are also welcomed. For example, the 'Taste' collaborative-filtering
recommender component of Mahout was originally a separate project and can run stand-alone
without Hadoop.
Starting with the release 0.10.0, the project shifts its focus to building backend-independent
programming environment, code named "Samsara". The environment consists of an algebraic
backend-independent optimizer and an algebraic Scala DSL unifying in-memory and distributed
algebraic operators. At the time of this writing supported algebraic platforms are Apache
Spark and H2O, and Apache Flink. Support for MapReduce algorithms is being gradually phased
out.
Pentaho (redirect from Pentaho Data Integration)
Secure Big Table HBase - BigTable-model database Hypertable - HBase alternative MapReduce
- Google's fundamental data filtering algorithm Apache Mahout - machine
Sematext (category American companies established in 2007)
Gospodneti (the co-author of Lucene in Action, the founder of Simpy, and committer on
Lucene, Solr, Nutch, Apache Mahout, and Open Relevance projects) founded
List of Apache Software Foundation projects (category Wikipedia articles in need of

updating from May 2017)
at dynamic language users. Mahout: machine learning and data mining solution. Mahout
Marmotta: open platform for Linked Data. Maven: Java project management
Apache Spark (category Big data products)
Apache Mahout (according to benchmarks done by the MLlib developers against the Alternating
Least Squares (ALS) implementations, and before Mahout itself
Distributed storage system for massive data
How the amount of data being collected is growing tremendously and why organizations want to
collect and analyze this data. The big challenge they face, though, is how to store so much
information. Many organizations have turned to the ideas of relatively large distributed file
systems and databasesas exemplified by Hadoop. Such challenges are why the concept of "big
data" has gotten so much attention in the first place.
Indeed, "big data" is mostly an outgrowth of the huge amounts of information that were collected
and needed to be used by the big Internet services. Most the big services quickly found out that
the amount of data they needed to store and the frequency in which they need to update and
access it could not be met by traditional relational database management systems such as Oracle
DB or IBM DB2. This led most of them to create their own proprietary new data stores; most
notably, Google created its own Google File System and BigTable database, while Amazon
created a new storage system called Dynamo.
As a group, these products are highly distributed file systems and databases, designed so that
each server can work independently, but that data stays consistent across a very decentralized
environment. In general, they use a decentralized, persistent "key value" store; often they use a
"map reduce" algorithm that parses out input to multiple servers, then combines the results.
Many of the database systems are focused around columns of data rather than rows (as in a
traditional RDMS).
Perhaps the best-known framework for dealing with this is Hadoop, a project of the Apache
Software Foundation, inspired by MapReduce and the Google File System. Hadoop is often used
for things such as log data, click data, and Web traffic, because it allows for the data to be stored
quickly and economically. Generally, this data isn't used with the traditional business analytics
tools, but rather with things such as Apache Hive data warehousing tools and the Pig platform.
This is more of a file system or a software framework than a database, but it is often used in
conjunction with database and analysis tools.
While most of the companies I have talked to that use Hadoop use the open-source version of the
software, many are turning to commercial firms that offer support and implementation, such as
Cloudera and Hortonworks. (Think of these firms as the equivalent of Red Hat or Canonical in
the Linux world.) In addition, we're seeing more and more big vendors making Hadoop part of
their software solutions, including enterprise vendors such as Amazon, Oracle, and Microsoft.
The need to store the data and use it for analytics has led to the creation of a number of more
widely available tools with similar characteristics, generally known as the noSQL (or !SQL)
movement. Note that most people now take noSQL to mean "not-only SQL," as in almost all of
these organizations traditional relational databases such as MySQL are also heavily used.
The noSQL nomenclature, which has only been around for two or three years, isn't a formal
definition, but rather something applied to a class of products that share certain characteristics,
but may perform quite differently from each other. NoSQL databases are usually distributed (or
"sharded") among multiple machines. They often are not always consistent at a given point in
timethe idea being that any differences between the systems can be addressed later, so they
will eventually become consistent over time. (In technical terms, this means they aren't trying to
meet the ACID (atomicity, consistency, isolation, durability) requirements assumed for relational
databases.) And as the name implies, they typically do not use the Structured Query Language
(SQL) for querying, but there are solutions that do use the SQL language though not the
relational structure.
Among the best-known of the key value data stores is Cassandra, another Apache project that
uses a data model similar to BigTable and an infrastructure similar to Dynamo. Facebook, for
example, is said to use this. Another is Project Voldemort, an open-source implementation started
by LinkedIn, which is similar to Amazon's Dynamo system; shopping sites like Gilt use this.
There are a number of different variations within the noSQL movement. Some are really meant
as "document stores;" they handle unstructured or semi-structured data. (This is somewhat
different from the data stores used by content and document management system, which would
be another discussion.) Among the most established of these are Apache Couch DB, MongoDB,
and Amazon's Simple DB, which is part of the company's Amazon Web Services. An alternative
is a "graph database," which connects separate "nodes" of data typically without a central index.
Examples include Flock DB, an open-source project created by Twitter.
There are many other variations including Couchbase, which expands from CouchDB and
Citrusleaf, which I wrote about at Demo a couple of weeks ago. Both of these emphasize real-
time performance.
In general, all of these tools vary in terms of target markets, how the file systems work, how the
data is stored, what tools they work with, and which kinds of problems they are optimized for. As
a class, they are enabling all sorts of new websites and cloud services, including many of the
emerging new tools based around location and social information.
But it also has huge implications for many organizations that are now gathering more data than
ever and applying real-time analyticsfor instance, to change pricing and offer specials very
quickly; to better manage inventory; to offer personalized recommendations; and, in general, to
understand buying patterns in ways that are faster, more accurate, and more persona
What is MongoDB?
In recent years, we have seen a growing interest in database management systems that differ
from the traditional relational model. At the heart of this is the concept of NoSQL, a term used
collectively to denote database software that does not use the Structured Query Language (SQL)
to interact with the database. One of the more notable NoSQL projects out there is MongoDB, an
open source document-oriented database that stores data in collections of JSON-like documents.
What sets MongoDB apart from other NoSQL databases is its powerful document-based query
language, which makes the transition from a relational database to MongoDB easy because the
queries translate quite easily.
MongoDB is written in C++. It stores data inside JSON-like documents (using BSON a
binary version of JSON), which hold data using key/value pairs. One feature that differentiates
MongoDB from other document databases is that it is very straightforward to translate SQL
statements into MongoDB query function calls. This makes is easy for organizations currently
using relational databases to migrate. It is also very straightforward to install and use, with
binaries and drivers available for major operating systems and programming languages.
MongoDB is an open-source project, with the database itself licensed under the GNU AGPL
(Affero General Public License) version 3.0. This license is a modified version of the GNU GPL
that closes a loophole where the copyleft restrictions do not apply to the software's usage but
only its distribution. This of course is important in software that is stored on the cloud and not
usually installed on client devices. Using the regular GPL, one could perceive that no distribution
is actually taking place, and thus potentially circumvent the license terms.
The AGPL only applies to the database application itself, and not to other elements of
MongoDB. The official drivers that allow developers to connect to MongoDB from various
programming languages are distributed under the Apache License Version 2.0. The MongoDB
documentation is available under a Creative Commons license.
Document-oriented databases
Document-oriented databases are quite different from traditional relational databases. Rather
than store data in rigid structures like tables, they store data in loosely defined documents. With
relational database management systems (RDBMS) tables, if you need to add a new column, you
need to change the definition of the table itself, which will add that column to every existing
record (albeit with potentially a null value). This is due to RDBMS' strict schema-based design.
However, with documents you can add new attributes to individual documents without changing
any other documents. This is because document-oriented databases are generally schema-less by
design.
Another fundamental difference is that document-oriented databases don't provide strict
relationships between documents, which helps maintain their schema-less design. This differs
greatly from relational databases, which rely heavily on relationships to normalize data storage.
Instead of storing "related" data in a separate storage area, in document databases they are
embedded in the document itself. This is much faster than storing a reference to another
document where the related data is stored, as each reference would require an additional query.
This works extremely well for many applications where it makes sense for the data to be self-
contained inside a parent document. A good example (which is also given in MongoDB
documentation) is blog posts and comments. The comments only apply to a single post, so it
does not make sense to separate them from that post. In MongoDB, your blog post document
would have a scommentsattribute that stores the comments for that post. In a relational database
you would probably have a comments table with an ID primary key, a posts table with an ID
primary key and an intermediate mapping table post_comments that defines which comments
belong to which post. This is a lot of unnecessary complexity for something that should be very
straightforward.
However, if you must store related data separately you can do so easily in MongoDB using a
separate collection. Another good example is that you store customer order information in the
MongoDB docs. This can typically comprise information about a customer, the order itself, line
items in the order, and product information. Using MongoDB, you would probably store
customers, products, and orders in individual collections, but you would embed line item data
inside the relevant order document. You would then reference
the products and customers collections using foreign key-style IDs, much like you would in a
relational database. The simplicity of this hybrid approach makes MongoDB an excellent choice
for those accustomed to working with SQL. With that said, take time and care to decide on the
approach you need to take for each individual use case, as the performance gains can be
significant by embedding data inside the document rather than referencing it in other collections.
Features at a glance
MongoDB is a lot more than just a basic key/value store. Let's take a brief look at some of its
other features:
Official binaries available for Windows, Mac OS X, Linux and Solaris, source
distribution available for self-build
Official drivers available for C, C#, C++, Haskell, Java, JavaScript, Perl, PHP, Python,
Ruby and Scala, with a large range of community-supported drivers available for other languages
Ad-hoc JavaScript queries that allow you to find data using any criteria on any document
attribute. These queries mirror the functionality of SQL queries, making it very straightforward
for SQL developers to write MongoDB queries.
Support for regular expressions in queries
MongoDB query results are stored in cursors that provide a range of functions for
filtering, aggregation, and sorting including limit(), skip(), sort(), count(), distinct() and group().
map/reduce implementation for advanced aggregation
Large file storage using GridFS
RDBMS-like attribute indexing support, where you can create indexes directly on
selected attributes of a document
Query optimization features using hints, explain plans, and profiling
Master/slave replication similar to MySQL
Collection-based object storage, allowing for referential querying where normalized data
is required
Horizontal scaling with auto-sharding
In-place updates for high-performance contention-free concurrency
Online shell allows you to try out MongoDB without installing
In-depth documentation, several books published and currently in writing
Big Data Storage Data Deduplication Techniques
Dedupe is a data management technique that eliminates redundancy to increase efficient use of
space, enabling improvements in capacity.
Deduplication approach plays a vital role to remove redundancy in large scale cluster computing
storage. As a result, deduplication provides better storage utilization by eliminating redundant
copies of data and saving only one copy of data in storage devices.
Data deduplication first divides large data objects into smaller parts called chunks and represent
them by their unique hash values using MD5 or SHA1 to identify duplicate data. The
experimental results with real big data show that represented Deduplication approach improves
DER (data elimination ratio) and gains storage space.
In the data deduplication process, the data blocks are analysed to identify duplicate blocks and
the system stores only one copy of it while deleting the rest. By doing so, it does not require
huge space to store all of the data, thus reducing capacity needs and utilizing storage space more
efficiently.
Three Techniques for Deduplication
1. POST PROCESS
The earliest dedupe technique:
Post-process was the first of all deduplication methods in the storage solutions market, here
dedupe happens at the disk level, which means that the incoming data has to be stored first
(taking up capacity) before any dedupe takes place. All of the incoming data (un-deduplicated) is
written to the cache, and then moved to disk or SSD. Deduplication happens only after this
move.
Depending on the actual product, the dedupe may occur at the SSD or disk level or both. All
blocks have to be scanned and compared with each other to find the duplicates. Since this
process is very slow, it is often scheduled to run only at night. In most products it also makes it
impractical to dedupe at the system level - blocks are only compared within the scope of a single
volume or a RAID set.
Since, all data coming in has to be stored somewhere before deduplication can take place, the
capacity has to be large enough to store them.
Post-process dedupe was designed at a time where CPU resources were quite expensive, and the
main media was disk drives. For this reason, delaying the deduplication process to occur after the
data is ingested frees up the potential bottleneck and allows for faster write throughput.
However, newer in-line implementations using multi-core processors and faster media proven
that dedupe can occur in closer proximity to the arrival of data into the system without impacting
performance.
2. TRADITIONAL IN-LINE
In in-line deduplication, all of the incoming data is written in the cache, but unlike post-process
not all of it is moved to the media level. Deduplication happens within the cache, and only
deduped data is written to the media. Since the output of dedupe is highly random, this
technology was only implemented in SSD-based systems. The number of read/writes is
significantly reduced between the cache and flash disk. With In-line deduplication in place,
precious SSD capacity becomes more affordable because you are storing only deduped data in
flash.
But, what about processing time?
In-line requires lot of processing power and in cases where high volume data enters the system,
network bottlenecks occur due to latency in write operations, decreasing server performance. But
in comparison with post-process, in-line has several advantages, such as increased effective
capacity, reduced IOs inside the system, thus saving plenty of time and resources. Again in-line
had to write the data (duplicate data) in the memory before deduplication can take place.
Writing duplicate data into media during overloads is a major drawback for inline, because it
acts like post-process. It only partially dedupes inline in order to keep up the performance,
compromising capacity.
3. IN-LINE IN-MEMORY
In-line In-memory dedupe eliminates the need for deduplication itself, thereby speeding up the
storage process and making its predecessors look like dinosaurs . In-line In-memory eliminates
the existence of duplicate data anywhere anytime on any tiers by identifying duplicate data right
as it falls off the wire before being written anywhere in the system.
Its possible to pull it off by hashing (calculating a mathematical fingerprint for the data blocks)
the data as soon it arrives to the system as a write request. The dedupe engine then looks for a
match of the fingerprint across the entire database (comparing with blocks in any of the tiers). If
there is a match, it ignores the write request and references it to the already-written data block. If
there is no match, meaning this data block is unique, the data is written into the cache.
All this happens before the data is written anywhere.
Therefore, the cache, RAM, SSD and disks only hold unique data blocks at all times, drastically
increasing the effective capacity and storage performance. The non-duplicated data in the cache
is also compressed before it leaves to the SSD, effectively doubling the capacity beyond the
deduplication ratio.
Simple concept, but huge benefits.
This simple yet powerful technology has huge implications when it comes to optimizing number
of IOPs. By not allowing any duplicate data written to the system, right from the moment it
comes off the wire, the number of read/writes between several tiers is far smaller since only
deduped data is in motion right from cache to SSDs to spinning disks, which also means more
durability (less wear and tear).
By identifying duplicate data as the IO comes in, instead of at the cache, RAM or SSD levels In-
line In-memory dedupe resolves the performance issues of traditional in-line dedupe making it a
better solution.
Another important feature that sets it apart from others is its dedupe aware cache. In others,
when the host reads the data (even if deduped), the cache loads multiple copies of it, whereas in
In-line In-memory, the cache loads only one copy. Leaving enough room and resources for other
operations, and not affecting performance. The clear benefit of this can be seen in VM boots.
Several hundred VMs can be started concurrently right from the RAM itself, which makes it
super fast.
FIXED SIZE AND VARIABLE SIZE BLOCK BASED DEDUPLICATION
Deduplication technology is quickly becoming the new hotness in the IT industry. Previously,
deduplication was delegated to secondary storage tiers as the controller could not always keep up
with the storage IO demand. These devices were designed to handle streams of data in and out
versus random IO that may show up on primary storage devices. Heck deduplication has been
around in email environments for some time. Just not in the same form we are seeing it today.
However, deduplication is slowly sneaking into new areas of IT and we are seeing more and
more benefit elsewhere. Backup clients, backup servers, primary storage, and who-knows-where
in the future.
As deduplication is being deployed across the IT world, the technology continues to advance and
become quicker and more efficient. So, in order to try and stay on top of your game, knowing a
little about the techniques for deduplication may add another tool in your tool belt and allow you
to make a better decision for your company/clients.
Deduplication is accomplished by sharing common blocks of data on storage environments and

only storing the changes to the data versus storing a copy of the data AGAIN! This allows for
some significant storage savings especially when you consider that many of file changes are
minor adjustments versus major data loads (at least as far as corporate IT user data).
So, how is this magic accomplished? Great question, I am glad you asked! Enter Fixed Block
deduplication and Variable Block deduplication
Fixed Block deduplication involves determining a block size and segmenting files/data into those
block sizes. Then, those blocks are what are stored in the storage subsystem.
Variable Block deduplication involves using algorithms to determine a variable block size. The
data is split based on the algorithms determination. Then, those blocks are stored in the
subsystem.
Check out the following example based on the following sentence: deduplication technologies
are becoming more an more important now.
Notice how the variable block deduplication has some funky block sizes. While this does not
look too efficient compared to fixed block, check out what happens when I make a correction to
the sentence. Oops it looks like I used an when it should have been and. Time to change the
file: deduplication technologies are becoming more and more important now. File > Save
After the file was changed and deduplicated, this is what the storage subsystem saw:
The red sections represent the changed blocks that have changed. By adding a single character in
the sentence, a d, the sentence length shifted and more blocks suddenly changed. The Fixed
Block solution saw 4 out of 9 blocks changed. The Variable Block solution saw 1 out of 9 blocks
changed. Variable block deduplication ends up providing a higher storage density.
Now, if you determine you have something doing fixed block deduplication, dont go and return
it right now. It probably rocks and you are definitely seeing value in what you have. However, if
you are in the market for something that deduplicates data, it is not going to hurt to ask the
vendor if they use fixed block or variable block deduplication. You should find that you get
better density and maximize your storage purchase even more.
FIXED VS. VARIABLE-BLOCK DEDUPLICATION

Fixed-block deduplication breaks the incoming data stream into pieces or blocks that are all the
same size. The blocks are compared, and only new unique blocks are stored on disk after being
compressed. Duplicates are discarded. A system of pointers is used to map the ingested data to
the pool of unique blocks. Some approaches let the administrator choose the block size and refer
to this as variable. But once a block size is selected, that size is used for all datathe ability to
change the block size manually does not qualify a solution as variable-block. With variable-
block deduplication, the block size is not fixed. Instead, the algorithm divides the data into
blocks of varying sizes based on natural boundaries in the data. The block size is varied
automatically, in real time, in response to the incoming data. New unique blocks are compressed
and stored on disk, and pointers are used to map the ingested data to the unique blocks.
What happens when the data changes? With fixed-block methods, unless the changes made to the
file are exactly a multiple of the fixed-block size, all of the data past the first change is shifted.
This shift changes subsequent blocks in the file with respect to the fixed-block boundaries, so to
the fixed-block algorithm they look new. Variable-block algorithms divide data into blocks
based on the characteristics of the data itself, not an arbitrary block size. This makes them
flexible when data changes. Only the new or changed data is stored and the uniqueness of the
remainder of the file is not impacted. TEST CONFIGURATION Deduplication capability of
three products was compared: Quantums DXi6900 appliance, utilizing Quantums patented
variable-block deduplication A Symantec NetBackup 5200 appliance with fixed-block
deduplication CommVault Simpana 10, also using fixed-block deduplication The same
hardware, software, and data was used for all tests. Source data was hosted on the data mover
server directly, and backups were performed to a DXi system or the NetBackup 5200
appliance. For the Simpana deduplication test, the data was sent to an NFS disk share. The NFS
share was configured on a DXi, but with the DXis deduplication and compression disabled. Any
disk could have been used as the target, but by using the DXi, the DXis standard Advanced
Reporting capability could be leveraged to examine the results. For the NBU 5200, statistics
were recorded from the devices interface. For the DXi and NetBackup 5200 tests, Symantec
NetBackup 7.6 was used as the data mover. Fixed size N O W I S T H E T I M E Take a phrase,
break it into blocks...
SNOWISTHETIME
Now add one character...
SNOWISTHETIMESNOWISTHETIMENOWISTHETIMENOWIST
H E T I M E EVERY
block changes ONE block changes
Content Defined Chunking

Backup programs need to deal with large volumes of changing data. Saving the whole copy of
each file again to the backup location when a subsequent (usually called incremental) backup
is created is not efficient. Over time, different strategies have emerged to handle data in such a
case.
In a backup program, data de-duplication can be applied in two locations: Removing duplicate
data from the same or different files within the same backup process (inter-file de-duplication),
e.g. during the initial backup, or removing it between several backups that contain some of the
same data (inter-backup de-duplication). While the former is desirable to have, the latter is much
more important.
Strategies
The most basic strategy is to only save files that have changed since the last backup, this is
where the term incremental backup comes from. This way, unmodified files are not stored
again on subsequent backups. But what happens if just a small portion of a large file is
modified? Using this strategy, the modified file will be saved again, although most of it did not
change.
A better idea is to split files into smaller fixed-size pieces (called chunks in the following) of
e.g.
1MiB in size. When the backup program saves a file to the backup location, it is sufficient to
save all chunks and the list of chunks the file consists of. These chunks can be identified for
example by the SHA-256 hash of the content, so duplicate chunks can be detected and saved
only once. This way, for a file containing of a large number of consecutive null bytes, only one
chunk of null bytes needs to be stored.
On a subsequent backup, unmodified files are not saved again because all chunks have already
been saved before. Modified files on the other hand are split into chunks again, and new chunks
are saved to the backup location.
But what happens when the user adds a byte to the beginning of the file? The chunk boundaries
(where a chunk ends and the next begins) would shift by one byte, changing every chunk in the
file. When the backup program now splits the file into fixed-sized chunks, it would (in most
cases) end up with a list of different chunks, so it needs to save every chunk as a new chunk to
the backup location. This is not satisfactory for a modern backup program.
Content Defined Chunking

Restic works a bit differently. It also operates on chunks of data from files and only upload new
chunks, but uses a more sophisticated approach for splitting files into chunks called Content
Defined Chunking. It works by splitting a file into chunks based on the contents of the file, rather
than always splitting after a fixed number of bytes.
In the following, the function
F(b0b63)
returns a 64 bit Rabin Fingerprint of the byte sequence in the argument (where
b
i
bi
is the byte at offset
i
i
). This function can be efficiently computed as a rolling hash, which means that
F(b1b64)
can be computed without much overhead when
F(b0b63)
is already known. Restic uses 64 bytes as the window size for the rolling hash.
When restic saves a file, it first computes the Rabin Fingerprints for all 64 byte sequences in the
file, so it starts by computing
F(b0b63)
, then
F(b1b64)
, then
F(b2b65)
and so on. For each fingerprint, restic then tests if the lowest 21 bits are zero. If this is the case,
restic found a new chunk boundary.
A chunk boundary therefore depends only on the last 64 bytes before the boundary, in other
words the end of a chunk depends on the last 64 bytes of a chunk. This especially means that
chunks are variable-sized, within reasonable limits.
Returning to our earlier example, if a user creates a backup of a file and then inserts bytes at
the beginning of the file, restic will find the same chunk boundary for the first chunk during the
second run. The content of this first chunk will have changed (due to the additional bytes), but
any subsequent chunk will remain identical thanks to the content-defined chunk boundaries.
Lets say our file consists of 4MiB of data, and restic detects the following chunk boundaries,
where offset is the byte offset of the last byte of the sliding window:
Offset
Fingerprint
577536
0x77db45c60d400000
1990656
0xc0da6ed30fe00000
2945019
0x309235f507600000
4194304
End of File
The file is therefore split into four chunks. Adding 20 bytes at the beginning of the file still yields
the same chunk boundaries, shifted by 20 bytes:
Offset
Fingerprint
577556
0x77db45c60d400000
1990676
0xc0da6ed30fe00000
2945039
0x309235f507600000
4194304
End of File
When restic computes a cryptographic hash (SHA-256) over the data in each chunk, it detects
that the first chunk has been changed (we added 20 bytes, remember?), but the remaining three
chunks have the same hash. Therefore, it only needs to save the changed first chunk.
Examples
So, lets take the things explained above to the real world, and have a bit of fun with restic. For
the sake of simplicity, well save the repository location and the password in environment
variables (RESTIC_REPOSITORY and RESTIC_PASSWORD) so that we dont have to type
the password for every action:
$ export RESTIC_REPOSITORY=/tmp/restic-test-repository RESTIC_PASSWORD=foo
Please be aware that this way the password will be contained in your shell history.
First, well initialize a new repository at a temporary location:
$ restic init
created restic backend 2b310bf378 at /tmp/restic-test-repository
Please note that knowledge of your password is required to access
the repository. Losing your password means that your data is
irrecoverably lost.
At this point, nothing has been saved to the repository, so it is rather small:
$ du -sh $RESTIC_REPOSITORY
8.0K /tmp/restic-test-repository
Next, we create a new directory called testdata for our test, containing a file file.raw, filled with
100MiB of random data:
$ mkdir testdata
$ dd if=/dev/urandom of=testdata/file.raw bs=1M count=100
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 5.76985 s, 18.2 MB/s
We then backup this directory with restic (into the repository we specified via the environment
variable $RESTIC_REPOSITORY above):
$ restic backup testdata
scan [/home/fd0/tmp/testdata]
scanned 1 directories, 1 files in 0:00
[0:02] 100.00% 41.457 MiB/s 100.000 MiB / 100.000 MiB 0 / 2 it...ETA 0:00
duration: 0:02, 44.21MiB/s
snapshot 7452bd17 saved
We can see that restic created a backup with a size of 100MiB in about two seconds. We can
verify this by checking the size of the repository again:
101M /tmp/restic-test-repository
Not surprisingly, the repository is roughly the same size as the data we have created the backup
of.
Now, we run the backup command for a second time:
using parent snapshot 7452bd17
[0:00] 100.00% 0B/s 100.000 MiB / 100.000 MiB 0 / 2 items ... ETA 0:00
snapshot 0b870550 saved
Again weve instructed restic to backup 100MiB of data, but in this case restic was much faster
and finished the job in less than a second. By the way, restic would have also been able to
efficiently backup a file that was renamed or even moved to a different directory.
Looking at the repository size we can already guess that it is still about 100MiB, since we didnt
really add any new data:
When we make a copy of the file file.raw and backup the same repository again, restic
recognises that all data is already known and the repository size does not grow at all, although
the directory testdata now contains 200MiB of data:
$ cp testdata/file.raw testdata/file2.raw
$ du -sh testdata
200M testdata
using parent snapshot 0b870550
[0:01] 100.00% 163.617 MiB/s 200.000 MiB / 200.000 MiB 1 / 3 ... ETA 0:00
snapshot ab8e0047 saved
Now for the final demonstration well create a new file file3.raw which is a nasty combination of
the 100MiB weve initially saved in file.raw so that testdata now contains about 400MiB:
$ (echo foo; cat testdata/file.raw; echo bar; cat testdata/file.raw; echo baz) \
> testdata/file3.raw
$ du -sh testdata
401M testdata
Well create a new backup of the directory with restic and observe that the repository has grown
by about 10MiB:
using parent snapshot ab8e0047
[0:03] 100.00% 127.638 MiB/s 400.000 MiB / 400.000 MiB 2 / 4 ... ETA 0:00
snapshot a8897ae3 saved
This is expected because weve created a few new chunks when creating file3.raw, e.g. the first
chunk will be saved again because a few bytes (the string foo\n) were added. Restic managed
this challenge quite well and only introduced minor overhead for storing this incremental
backup.
Content Defined Chunking is a clever idea to split large amounts of data (e.g. large files) into
small chunks, while being able to recognize the same chunks again when shifted or (slightly)
modified.
This enables restic to de-duplicate data on the level of chunks so that each chunk of data is only
stored at (and transmitted to) the backup location once. This gives us not only inter-file
de-duplication, but also the more relevant inter-backup de-duplication.
CLOUD STORAGE
Cloud computing is an information technology (IT) paradigm, a model for enabling ubiquitous
access to shared pools of configurable resources (such as computer networks, servers, storage,
applications and services), which can be rapidly provisioned with minimal management effort,
often over the Internet. Cloud computing allows users and enterprises with various computing
capabilities to store and process data either in a privately-owned cloud, or on a third-party server
located in a data centre - thus making data-accessing mechanisms more efficient and
reliable. Cloud computing relies on sharing of resources to achieve coherence and economy of
scale, similar to a utility.
Advocates note that cloud computing allows companies to avoid or minimize up-front IT
infrastructure costs. As well, third-party clouds enable organizations to focus on their core
businesses instead of expending resources on computer infrastructure and maintenance.
Proponents also claim that cloud computing allows enterprises to get their applications up and
running faster, with improved manageability and less maintenance, and that it enables
IT teams to more rapidly adjust resources to meet fluctuating and
unpredictable business demand. Cloud providers typically use a "pay-as-you-go" model. This
could lead to unexpectedly high charges if administrators are not familiarized with cloud-pricing
models.
In 2009 the availability of high-capacity networks, low-cost computers and storage devices as
well as the widespread adoption of hardware virtualization, service-oriented architecture,
and autonomic and utility computing led to a growth in cloud computing Companies can scale
up as computing needs increase and then scale down again when demands decrease. In 2013 it
was reported that cloud computing had become a highly demanded service or utility due to the
advantages of high computing power, cheap cost of services, high performance, scalability, and
accessibility - as well as availability. Some cloud vendors experience growth rates of 50% per
year, but while cloud computing remains in a stage of infancy, it has pitfalls that need to be
addressed to make cloud-computing services more reliable and user-friendly.
Cloud Storage is a service where data is remotely maintained, managed, and backed up. The
service allows the users to store files online, so that they can access them from any location via
the Internet. According to a recent survey conducted with more than 800 business decision
makers and users worldwide, the number of organizations gaining competitive advantage
through high cloud adoption has almost doubled in the last few years and by 2017, the public
cloud services market is predicted to exceed $244 billion. Now, lets look into some of the
advantages and disadvantages of Cloud Storage.
Advantages of Cloud Storage

1. Usability: All cloud storage services reviewed in this topic have desktop folders for Macs and
PCs. This allows users to drag and drop files between the cloud storage and their local storage.
2. Bandwidth: You can avoid emailing files to individuals and instead send a web link to
recipients through your email.
3. Accessibility: Stored files can be accessed from anywhere via Internet connection.
4. Disaster Recovery: It is highly recommended that businesses have an emergency backup
plan ready in the case of an emergency. Cloud storage can be used as a backup plan by
businesses by providing a second copy of important files. These files are stored at a remote
location and can be accessed through an internet connection.
5. Cost Savings: Businesses and organizations can often reduce annual operating costs by using
cloud storage; cloud storage costs about 3 cents per gigabyte to store data internally. Users can
see additional cost savings because it does not require internal power to store information
remotely.
Disadvantages of Cloud Storage

1. Usability: Be careful when using drag/drop to move a document into the cloud storage folder.
This will permanently move your document from its original folder to the cloud storage location.
Do a copy and paste instead of drag/drop if you want to retain the documents original location in
addition to moving a copy onto the cloud storage folder.
2. Bandwidth: Several cloud storage services have a specific bandwidth allowance. If an
organization surpasses the given allowance, the additional charges could be significant.
However, some providers allow unlimited bandwidth. This is a factor that companies should
consider when looking at a cloud storage provider.
3. Accessibility: If you have no internet connection, you have no access to your data.
4. Data Security: There are concerns with the safety and privacy of important data stored
remotely. The possibility of private data commingling with other organizations makes some
businesses uneasy. If you want to know more about those issues that govern data security and
privacy, here is an interesting article on the recent privacy debates.
5. Software: If you want to be able to manipulate your files locally through multiple devices,
youll need to download the service on all devices.

Big Data Analyt

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Big Data Analyt

Загружено:

Авторское право:

Доступные форматы

BIG DATA ANALYSIS

History and evolution of big data analytics:

Why is big data analytics important?

Apache Mahout is a project of the Apache Software Foundation to produce free

Pentaho (redirect from Pentaho Data Integration)

Sematext (category American companies established in 2007)

List of Apache Software Foundation projects (category Wikipedia articles in need of

Distributed storage system for massive data

Three Techniques for Deduplication

The earliest dedupe technique:

But, what about processing time?

All this happens before the data is written anywhere.

FIXED SIZE AND VARIABLE SIZE BLOCK BASED DEDUPLICATION

Deduplication is accomplished by sharing common blocks of data on storage environments and

FIXED VS. VARIABLE-BLOCK DEDUPLICATION

Now add one character...

block changes ONE block changes

Content Defined Chunking

Content Defined Chunking

Advantages of Cloud Storage

recipients through your email.

4. Disaster Recovery: It is highly recommended that businesses have an emergency backup

location and can be accessed through an internet connection.

Disadvantages of Cloud Storage

addition to moving a copy onto the cloud storage folder.

2. Bandwidth: Several cloud storage services have a specific bandwidth allowance. If an

consider when looking at a cloud storage provider.

privacy, here is an interesting article on the recent privacy debates.

youll need to download the service on all devices.

Вам также может понравиться