Академический Документы
Профессиональный Документы
Культура Документы
Big data analytics examines large amounts of data to uncover hidden patterns, correlations and
other insights. With todays technology, its possible to analyze your data and get answers from it
almost immediately an effort thats slower and less efficient with more traditional business
intelligence solutions
The concept of big data has been around for years; most organizations now understand that if
they capture all the data that streams into their businesses, they can apply analytics and get
significant value from it. But even in the 1950s, decades before anyone uttered the term big
data, businesses were using basic analytics (essentially numbers in a spreadsheet that were
manually examined) to uncover insights and trends.
The new benefits that big data analytics brings to the table, however, are speed and efficiency.
Whereas a few years ago a business would have gathered information, run analytics and
unearthed information that could be used for future decisions, today that business can identify
insights for immediate decisions. The ability to work faster and stay agile gives organizations
a competitive edge they didnt have before.
Big data analytics helps organizations harness their data and use it to identify new opportunities.
That, in turn, leads to smarter business moves, more efficient operations, higher profits and
happier customers. In his report Big Data in Big Companies, IIA Director of Research Tom
Davenport interviewed more than 50 businesses to understand how they used big data. He found
they got value in the following ways:
Cost reduction. Big data technologies such as Hadoop and cloud-based analytics bring
significant cost advantages when it comes to storing large amounts of data plus they can
identify more efficient ways of doing business.
Faster, better decision making. With the speed of Hadoop and in-memory analytics,
combined with the ability to analyze new sources of data, businesses are able to analyze
information immediately and make decisions based on what theyve learned.
New products and services. With the ability to gauge customer needs and satisfaction
through analytics comes the power to give customers what they want. Davenport points
out that with big data analytics, more companies are creating new products to meet
customers needs.
Mahout
The Apache Mahout project's goal is to build an environment for quickly creating scalable
performant machine learning applications. apache Mahout is an official Apache project and thus
available from Mahout is a pretty big beast at this point, so ... Apache Mahout, Building Mahout
from source ... Mahout code depends on hadoop
Secure Big Table HBase - BigTable-model database Hypertable - HBase alternative MapReduce
- Google's fundamental data filtering algorithm Apache Mahout - machine
Gospodneti (the co-author of Lucene in Action, the founder of Simpy, and committer on
Lucene, Solr, Nutch, Apache Mahout, and Open Relevance projects) founded
at dynamic language users. Mahout: machine learning and data mining solution. Mahout
Marmotta: open platform for Linked Data. Maven: Java project management
Apache Spark (category Big data products)
Apache Mahout (according to benchmarks done by the MLlib developers against the Alternating
Least Squares (ALS) implementations, and before Mahout itself
How the amount of data being collected is growing tremendously and why organizations want to
collect and analyze this data. The big challenge they face, though, is how to store so much
information. Many organizations have turned to the ideas of relatively large distributed file
systems and databasesas exemplified by Hadoop. Such challenges are why the concept of "big
data" has gotten so much attention in the first place.
Indeed, "big data" is mostly an outgrowth of the huge amounts of information that were collected
and needed to be used by the big Internet services. Most the big services quickly found out that
the amount of data they needed to store and the frequency in which they need to update and
access it could not be met by traditional relational database management systems such as Oracle
DB or IBM DB2. This led most of them to create their own proprietary new data stores; most
notably, Google created its own Google File System and BigTable database, while Amazon
created a new storage system called Dynamo.
As a group, these products are highly distributed file systems and databases, designed so that
each server can work independently, but that data stays consistent across a very decentralized
environment. In general, they use a decentralized, persistent "key value" store; often they use a
"map reduce" algorithm that parses out input to multiple servers, then combines the results.
Many of the database systems are focused around columns of data rather than rows (as in a
traditional RDMS).
Perhaps the best-known framework for dealing with this is Hadoop, a project of the Apache
Software Foundation, inspired by MapReduce and the Google File System. Hadoop is often used
for things such as log data, click data, and Web traffic, because it allows for the data to be stored
quickly and economically. Generally, this data isn't used with the traditional business analytics
tools, but rather with things such as Apache Hive data warehousing tools and the Pig platform.
This is more of a file system or a software framework than a database, but it is often used in
conjunction with database and analysis tools.
While most of the companies I have talked to that use Hadoop use the open-source version of the
software, many are turning to commercial firms that offer support and implementation, such as
Cloudera and Hortonworks. (Think of these firms as the equivalent of Red Hat or Canonical in
the Linux world.) In addition, we're seeing more and more big vendors making Hadoop part of
their software solutions, including enterprise vendors such as Amazon, Oracle, and Microsoft.
The need to store the data and use it for analytics has led to the creation of a number of more
widely available tools with similar characteristics, generally known as the noSQL (or !SQL)
movement. Note that most people now take noSQL to mean "not-only SQL," as in almost all of
these organizations traditional relational databases such as MySQL are also heavily used.
The noSQL nomenclature, which has only been around for two or three years, isn't a formal
definition, but rather something applied to a class of products that share certain characteristics,
but may perform quite differently from each other. NoSQL databases are usually distributed (or
"sharded") among multiple machines. They often are not always consistent at a given point in
timethe idea being that any differences between the systems can be addressed later, so they
will eventually become consistent over time. (In technical terms, this means they aren't trying to
meet the ACID (atomicity, consistency, isolation, durability) requirements assumed for relational
databases.) And as the name implies, they typically do not use the Structured Query Language
(SQL) for querying, but there are solutions that do use the SQL language though not the
relational structure.
Among the best-known of the key value data stores is Cassandra, another Apache project that
uses a data model similar to BigTable and an infrastructure similar to Dynamo. Facebook, for
example, is said to use this. Another is Project Voldemort, an open-source implementation started
by LinkedIn, which is similar to Amazon's Dynamo system; shopping sites like Gilt use this.
There are a number of different variations within the noSQL movement. Some are really meant
as "document stores;" they handle unstructured or semi-structured data. (This is somewhat
different from the data stores used by content and document management system, which would
be another discussion.) Among the most established of these are Apache Couch DB, MongoDB,
and Amazon's Simple DB, which is part of the company's Amazon Web Services. An alternative
is a "graph database," which connects separate "nodes" of data typically without a central index.
Examples include Flock DB, an open-source project created by Twitter.
There are many other variations including Couchbase, which expands from CouchDB and
Citrusleaf, which I wrote about at Demo a couple of weeks ago. Both of these emphasize real-
time performance.
In general, all of these tools vary in terms of target markets, how the file systems work, how the
data is stored, what tools they work with, and which kinds of problems they are optimized for. As
a class, they are enabling all sorts of new websites and cloud services, including many of the
emerging new tools based around location and social information.
But it also has huge implications for many organizations that are now gathering more data than
ever and applying real-time analyticsfor instance, to change pricing and offer specials very
quickly; to better manage inventory; to offer personalized recommendations; and, in general, to
understand buying patterns in ways that are faster, more accurate, and more persona
What is MongoDB?
In recent years, we have seen a growing interest in database management systems that differ
from the traditional relational model. At the heart of this is the concept of NoSQL, a term used
collectively to denote database software that does not use the Structured Query Language (SQL)
to interact with the database. One of the more notable NoSQL projects out there is MongoDB, an
open source document-oriented database that stores data in collections of JSON-like documents.
What sets MongoDB apart from other NoSQL databases is its powerful document-based query
language, which makes the transition from a relational database to MongoDB easy because the
queries translate quite easily.
MongoDB is written in C++. It stores data inside JSON-like documents (using BSON a
binary version of JSON), which hold data using key/value pairs. One feature that differentiates
MongoDB from other document databases is that it is very straightforward to translate SQL
statements into MongoDB query function calls. This makes is easy for organizations currently
using relational databases to migrate. It is also very straightforward to install and use, with
binaries and drivers available for major operating systems and programming languages.
MongoDB is an open-source project, with the database itself licensed under the GNU AGPL
(Affero General Public License) version 3.0. This license is a modified version of the GNU GPL
that closes a loophole where the copyleft restrictions do not apply to the software's usage but
only its distribution. This of course is important in software that is stored on the cloud and not
usually installed on client devices. Using the regular GPL, one could perceive that no distribution
is actually taking place, and thus potentially circumvent the license terms.
The AGPL only applies to the database application itself, and not to other elements of
MongoDB. The official drivers that allow developers to connect to MongoDB from various
programming languages are distributed under the Apache License Version 2.0. The MongoDB
documentation is available under a Creative Commons license.
Document-oriented databases
Document-oriented databases are quite different from traditional relational databases. Rather
than store data in rigid structures like tables, they store data in loosely defined documents. With
relational database management systems (RDBMS) tables, if you need to add a new column, you
need to change the definition of the table itself, which will add that column to every existing
record (albeit with potentially a null value). This is due to RDBMS' strict schema-based design.
However, with documents you can add new attributes to individual documents without changing
any other documents. This is because document-oriented databases are generally schema-less by
design.
Another fundamental difference is that document-oriented databases don't provide strict
relationships between documents, which helps maintain their schema-less design. This differs
greatly from relational databases, which rely heavily on relationships to normalize data storage.
Instead of storing "related" data in a separate storage area, in document databases they are
embedded in the document itself. This is much faster than storing a reference to another
document where the related data is stored, as each reference would require an additional query.
This works extremely well for many applications where it makes sense for the data to be self-
contained inside a parent document. A good example (which is also given in MongoDB
documentation) is blog posts and comments. The comments only apply to a single post, so it
does not make sense to separate them from that post. In MongoDB, your blog post document
would have a scommentsattribute that stores the comments for that post. In a relational database
you would probably have a comments table with an ID primary key, a posts table with an ID
primary key and an intermediate mapping table post_comments that defines which comments
belong to which post. This is a lot of unnecessary complexity for something that should be very
straightforward.
However, if you must store related data separately you can do so easily in MongoDB using a
separate collection. Another good example is that you store customer order information in the
MongoDB docs. This can typically comprise information about a customer, the order itself, line
items in the order, and product information. Using MongoDB, you would probably store
customers, products, and orders in individual collections, but you would embed line item data
inside the relevant order document. You would then reference
the products and customers collections using foreign key-style IDs, much like you would in a
relational database. The simplicity of this hybrid approach makes MongoDB an excellent choice
for those accustomed to working with SQL. With that said, take time and care to decide on the
approach you need to take for each individual use case, as the performance gains can be
significant by embedding data inside the document rather than referencing it in other collections.
Features at a glance
MongoDB is a lot more than just a basic key/value store. Let's take a brief look at some of its
other features:
Official binaries available for Windows, Mac OS X, Linux and Solaris, source
distribution available for self-build
Official drivers available for C, C#, C++, Haskell, Java, JavaScript, Perl, PHP, Python,
Ruby and Scala, with a large range of community-supported drivers available for other languages
Ad-hoc JavaScript queries that allow you to find data using any criteria on any document
attribute. These queries mirror the functionality of SQL queries, making it very straightforward
for SQL developers to write MongoDB queries.
Support for regular expressions in queries
MongoDB query results are stored in cursors that provide a range of functions for
filtering, aggregation, and sorting including limit(), skip(), sort(), count(), distinct() and group().
map/reduce implementation for advanced aggregation
Large file storage using GridFS
RDBMS-like attribute indexing support, where you can create indexes directly on
selected attributes of a document
Query optimization features using hints, explain plans, and profiling
Master/slave replication similar to MySQL
Collection-based object storage, allowing for referential querying where normalized data
is required
Horizontal scaling with auto-sharding
In-place updates for high-performance contention-free concurrency
Online shell allows you to try out MongoDB without installing
In-depth documentation, several books published and currently in writing
Big Data Storage Data Deduplication Techniques
Dedupe is a data management technique that eliminates redundancy to increase efficient use of
space, enabling improvements in capacity.
Deduplication approach plays a vital role to remove redundancy in large scale cluster computing
storage. As a result, deduplication provides better storage utilization by eliminating redundant
copies of data and saving only one copy of data in storage devices.
Data deduplication first divides large data objects into smaller parts called chunks and represent
them by their unique hash values using MD5 or SHA1 to identify duplicate data. The
experimental results with real big data show that represented Deduplication approach improves
DER (data elimination ratio) and gains storage space.
In the data deduplication process, the data blocks are analysed to identify duplicate blocks and
the system stores only one copy of it while deleting the rest. By doing so, it does not require
huge space to store all of the data, thus reducing capacity needs and utilizing storage space more
efficiently.
1. POST PROCESS
Post-process was the first of all deduplication methods in the storage solutions market, here
dedupe happens at the disk level, which means that the incoming data has to be stored first
(taking up capacity) before any dedupe takes place. All of the incoming data (un-deduplicated) is
written to the cache, and then moved to disk or SSD. Deduplication happens only after this
move.
Depending on the actual product, the dedupe may occur at the SSD or disk level or both. All
blocks have to be scanned and compared with each other to find the duplicates. Since this
process is very slow, it is often scheduled to run only at night. In most products it also makes it
impractical to dedupe at the system level - blocks are only compared within the scope of a single
volume or a RAID set.
Since, all data coming in has to be stored somewhere before deduplication can take place, the
capacity has to be large enough to store them.
Post-process dedupe was designed at a time where CPU resources were quite expensive, and the
main media was disk drives. For this reason, delaying the deduplication process to occur after the
data is ingested frees up the potential bottleneck and allows for faster write throughput.
However, newer in-line implementations using multi-core processors and faster media proven
that dedupe can occur in closer proximity to the arrival of data into the system without impacting
performance.
2. TRADITIONAL IN-LINE
In in-line deduplication, all of the incoming data is written in the cache, but unlike post-process
not all of it is moved to the media level. Deduplication happens within the cache, and only
deduped data is written to the media. Since the output of dedupe is highly random, this
technology was only implemented in SSD-based systems. The number of read/writes is
significantly reduced between the cache and flash disk. With In-line deduplication in place,
precious SSD capacity becomes more affordable because you are storing only deduped data in
flash.
In-line requires lot of processing power and in cases where high volume data enters the system,
network bottlenecks occur due to latency in write operations, decreasing server performance. But
in comparison with post-process, in-line has several advantages, such as increased effective
capacity, reduced IOs inside the system, thus saving plenty of time and resources. Again in-line
had to write the data (duplicate data) in the memory before deduplication can take place.
Writing duplicate data into media during overloads is a major drawback for inline, because it
acts like post-process. It only partially dedupes inline in order to keep up the performance,
compromising capacity.
3. IN-LINE IN-MEMORY
In-line In-memory dedupe eliminates the need for deduplication itself, thereby speeding up the
storage process and making its predecessors look like dinosaurs . In-line In-memory eliminates
the existence of duplicate data anywhere anytime on any tiers by identifying duplicate data right
as it falls off the wire before being written anywhere in the system.
Its possible to pull it off by hashing (calculating a mathematical fingerprint for the data blocks)
the data as soon it arrives to the system as a write request. The dedupe engine then looks for a
match of the fingerprint across the entire database (comparing with blocks in any of the tiers). If
there is a match, it ignores the write request and references it to the already-written data block. If
there is no match, meaning this data block is unique, the data is written into the cache.
Therefore, the cache, RAM, SSD and disks only hold unique data blocks at all times, drastically
increasing the effective capacity and storage performance. The non-duplicated data in the cache
is also compressed before it leaves to the SSD, effectively doubling the capacity beyond the
deduplication ratio.
Simple concept, but huge benefits.
This simple yet powerful technology has huge implications when it comes to optimizing number
of IOPs. By not allowing any duplicate data written to the system, right from the moment it
comes off the wire, the number of read/writes between several tiers is far smaller since only
deduped data is in motion right from cache to SSDs to spinning disks, which also means more
durability (less wear and tear).
By identifying duplicate data as the IO comes in, instead of at the cache, RAM or SSD levels In-
line In-memory dedupe resolves the performance issues of traditional in-line dedupe making it a
better solution.
Another important feature that sets it apart from others is its dedupe aware cache. In others,
when the host reads the data (even if deduped), the cache loads multiple copies of it, whereas in
In-line In-memory, the cache loads only one copy. Leaving enough room and resources for other
operations, and not affecting performance. The clear benefit of this can be seen in VM boots.
Several hundred VMs can be started concurrently right from the RAM itself, which makes it
super fast.
Deduplication technology is quickly becoming the new hotness in the IT industry. Previously,
deduplication was delegated to secondary storage tiers as the controller could not always keep up
with the storage IO demand. These devices were designed to handle streams of data in and out
versus random IO that may show up on primary storage devices. Heck deduplication has been
around in email environments for some time. Just not in the same form we are seeing it today.
However, deduplication is slowly sneaking into new areas of IT and we are seeing more and
more benefit elsewhere. Backup clients, backup servers, primary storage, and who-knows-where
in the future.
As deduplication is being deployed across the IT world, the technology continues to advance and
become quicker and more efficient. So, in order to try and stay on top of your game, knowing a
little about the techniques for deduplication may add another tool in your tool belt and allow you
to make a better decision for your company/clients.
So, how is this magic accomplished? Great question, I am glad you asked! Enter Fixed Block
deduplication and Variable Block deduplication
Fixed Block deduplication involves determining a block size and segmenting files/data into those
block sizes. Then, those blocks are what are stored in the storage subsystem.
Variable Block deduplication involves using algorithms to determine a variable block size. The
data is split based on the algorithms determination. Then, those blocks are stored in the
subsystem.
Check out the following example based on the following sentence: deduplication technologies
are becoming more an more important now.
Notice how the variable block deduplication has some funky block sizes. While this does not
look too efficient compared to fixed block, check out what happens when I make a correction to
the sentence. Oops it looks like I used an when it should have been and. Time to change the
file: deduplication technologies are becoming more and more important now. File > Save
After the file was changed and deduplicated, this is what the storage subsystem saw:
The red sections represent the changed blocks that have changed. By adding a single character in
the sentence, a d, the sentence length shifted and more blocks suddenly changed. The Fixed
Block solution saw 4 out of 9 blocks changed. The Variable Block solution saw 1 out of 9 blocks
changed. Variable block deduplication ends up providing a higher storage density.
Now, if you determine you have something doing fixed block deduplication, dont go and return
it right now. It probably rocks and you are definitely seeing value in what you have. However, if
you are in the market for something that deduplicates data, it is not going to hurt to ask the
vendor if they use fixed block or variable block deduplication. You should find that you get
better density and maximize your storage purchase even more.
What happens when the data changes? With fixed-block methods, unless the changes made to the
file are exactly a multiple of the fixed-block size, all of the data past the first change is shifted.
This shift changes subsequent blocks in the file with respect to the fixed-block boundaries, so to
the fixed-block algorithm they look new. Variable-block algorithms divide data into blocks
based on the characteristics of the data itself, not an arbitrary block size. This makes them
flexible when data changes. Only the new or changed data is stored and the uniqueness of the
remainder of the file is not impacted. TEST CONFIGURATION Deduplication capability of
three products was compared: Quantums DXi6900 appliance, utilizing Quantums patented
variable-block deduplication A Symantec NetBackup 5200 appliance with fixed-block
deduplication CommVault Simpana 10, also using fixed-block deduplication The same
hardware, software, and data was used for all tests. Source data was hosted on the data mover
server directly, and backups were performed to a DXi system or the NetBackup 5200
appliance. For the Simpana deduplication test, the data was sent to an NFS disk share. The NFS
share was configured on a DXi, but with the DXis deduplication and compression disabled. Any
disk could have been used as the target, but by using the DXi, the DXis standard Advanced
Reporting capability could be leveraged to examine the results. For the NBU 5200, statistics
were recorded from the devices interface. For the DXi and NetBackup 5200 tests, Symantec
NetBackup 7.6 was used as the data mover. Fixed size N O W I S T H E T I M E Take a phrase,
break it into blocks...
SNOWISTHETIME
SNOWISTHETIMESNOWISTHETIMENOWISTHETIMENOWIST
H E T I M E EVERY
Strategies
The most basic strategy is to only save files that have changed since the last backup, this is
where the term incremental backup comes from. This way, unmodified files are not stored
again on subsequent backups. But what happens if just a small portion of a large file is
modified? Using this strategy, the modified file will be saved again, although most of it did not
change.
A better idea is to split files into smaller fixed-size pieces (called chunks in the following) of
e.g.
1MiB in size. When the backup program saves a file to the backup location, it is sufficient to
save all chunks and the list of chunks the file consists of. These chunks can be identified for
example by the SHA-256 hash of the content, so duplicate chunks can be detected and saved
only once. This way, for a file containing of a large number of consecutive null bytes, only one
chunk of null bytes needs to be stored.
On a subsequent backup, unmodified files are not saved again because all chunks have already
been saved before. Modified files on the other hand are split into chunks again, and new chunks
are saved to the backup location.
But what happens when the user adds a byte to the beginning of the file? The chunk boundaries
(where a chunk ends and the next begins) would shift by one byte, changing every chunk in the
file. When the backup program now splits the file into fixed-sized chunks, it would (in most
cases) end up with a list of different chunks, so it needs to save every chunk as a new chunk to
the backup location. This is not satisfactory for a modern backup program.
CLOUD STORAGE
Cloud computing is an information technology (IT) paradigm, a model for enabling ubiquitous
access to shared pools of configurable resources (such as computer networks, servers, storage,
applications and services), which can be rapidly provisioned with minimal management effort,
often over the Internet. Cloud computing allows users and enterprises with various computing
capabilities to store and process data either in a privately-owned cloud, or on a third-party server
located in a data centre - thus making data-accessing mechanisms more efficient and
reliable. Cloud computing relies on sharing of resources to achieve coherence and economy of
scale, similar to a utility.
Advocates note that cloud computing allows companies to avoid or minimize up-front IT
infrastructure costs. As well, third-party clouds enable organizations to focus on their core
businesses instead of expending resources on computer infrastructure and maintenance.
Proponents also claim that cloud computing allows enterprises to get their applications up and
running faster, with improved manageability and less maintenance, and that it enables
IT teams to more rapidly adjust resources to meet fluctuating and
unpredictable business demand. Cloud providers typically use a "pay-as-you-go" model. This
could lead to unexpectedly high charges if administrators are not familiarized with cloud-pricing
models.
In 2009 the availability of high-capacity networks, low-cost computers and storage devices as
well as the widespread adoption of hardware virtualization, service-oriented architecture,
and autonomic and utility computing led to a growth in cloud computing Companies can scale
up as computing needs increase and then scale down again when demands decrease. In 2013 it
was reported that cloud computing had become a highly demanded service or utility due to the
advantages of high computing power, cheap cost of services, high performance, scalability, and
accessibility - as well as availability. Some cloud vendors experience growth rates of 50% per
year, but while cloud computing remains in a stage of infancy, it has pitfalls that need to be
addressed to make cloud-computing services more reliable and user-friendly.
Cloud Storage is a service where data is remotely maintained, managed, and backed up. The
service allows the users to store files online, so that they can access them from any location via
the Internet. According to a recent survey conducted with more than 800 business decision
makers and users worldwide, the number of organizations gaining competitive advantage
through high cloud adoption has almost doubled in the last few years and by 2017, the public
cloud services market is predicted to exceed $244 billion. Now, lets look into some of the
advantages and disadvantages of Cloud Storage.
PCs. This allows users to drag and drop files between the cloud storage and their local storage.
2. Bandwidth: You can avoid emailing files to individuals and instead send a web link to
3. Accessibility: Stored files can be accessed from anywhere via Internet connection.
plan ready in the case of an emergency. Cloud storage can be used as a backup plan by
businesses by providing a second copy of important files. These files are stored at a remote
5. Cost Savings: Businesses and organizations can often reduce annual operating costs by using
cloud storage; cloud storage costs about 3 cents per gigabyte to store data internally. Users can
see additional cost savings because it does not require internal power to store information
remotely.
This will permanently move your document from its original folder to the cloud storage location.
Do a copy and paste instead of drag/drop if you want to retain the documents original location in
organization surpasses the given allowance, the additional charges could be significant.
However, some providers allow unlimited bandwidth. This is a factor that companies should
3. Accessibility: If you have no internet connection, you have no access to your data.
4. Data Security: There are concerns with the safety and privacy of important data stored
remotely. The possibility of private data commingling with other organizations makes some
businesses uneasy. If you want to know more about those issues that govern data security and
5. Software: If you want to be able to manipulate your files locally through multiple devices,