Академический Документы
Профессиональный Документы
Культура Документы
Variety: Manage and benefit from diverse data types and data structures
Velocity: Analyze streaming data and large volumes of persistent data
Volume: Scale from terabytes to zettabytes
BI vs Big Data Analysis
BI :
Business Users determine what question to ask, then IT Structures the data to
answer that question.
Sample of BI tasks: Monthly sales reports, Profitability analysis, Customer surveys
NoSQL Database
What is Parquet, RC/ORC file formats, and Avro?
Parquet
Parquet is a columnar storage format,
Allows compression schemes to be specified on a per-column level
Offer better write performance by storing metadata at the end of the file
Provides the best results in benchmark performance tests
Avro
Avro data files are a compact, efficient binary format
NoSQL Databases
NoSQL is a new way of handling variety of data.
NoSQL DB can handle Millions of Queries per Sec while normal RDBMS can handle
Thousands of Queries per Sec only, and both are follow CAP Theorem.
* Consistency means Every read receives the most recent write or an error
HBase, and MongoDB ---> CP [give data Consistency but not Availability]
Cassandra , CouchDB ---> AP [give data Availability but not Consistency]
DataNodes
Hadoop HDFS: read and write files from HDFS
Create sample text file on Linux
# echo “My First Hadoop Lesson” > test.txt
HBase is a NoSQL column family database that runs on top of Hadoop HDFS (it is the default Hadoop Database ).
Can handle large tables which have billions of rows and millions of columns with fault tolerance and horizontal scalability.
HBase concept was inspired by Google’s Big Table.
Schema does not need to be defined up front
support high performance random r/w applications
hive> CREATE TABLE IF NOT EXISTS employee (id int, name String, salary String, destination String)
COMMENT ‘Employee details’
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t’
LINES TERMINATED BY ‘\n’
STORED AS TEXTFILE;
hive> SELECT c.ID, c.NAME, c.AGE, o.AMOUNT FROM CUSTOMERS c JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID);
$ javac HiveQLOrderBy.java
$ java HiveQLOrderBy
Phoenix
Apache Phoenix
Apache Phoenix is an open source, massively parallel, relational database engine supporting OLTP for
Hadoop using Apache HBase as its backing store.
Phoenix provides a JDBC driver that hides the intricacies of the noSQL store enabling users to create, delete,
and alter SQL tables, views, indexes, and sequences; insert and delete rows singly and in bulk; and query data
through SQL. Phoenix compiles queries and other statements into native noSQL store APIs rather than using
MapReduce enabling the building of low latency applications on top of noSQL stores.
Apache phoenix is a good choice for low-latency and mid-size table (1M - 100M rows)
Apache phoenix is faster than Hive, and Impala
Phoenix main features
support transaction
Support User defined functions
Support Secondary Indexes
supports view syntax
Solr
Solr (enterprise search engine)
Solr is used to build search applications deliver high performance with support for
execution of parallel SQL queries... It was built on top of Lucene (full text search engine).
Solr can be used along with Hadoop to search large volumes of text-centric data.. Not
only search, Solr can also be used for storage purpose. Like other NoSQL databases, it is
a non-relational data storage and processing technology.
Support Fulltext Search, It utilizes the RAM, not the CPU.
PDF, Word document indexing
Auto-suggest, Stop words, synonyms, etc.
Supports replication
Communicate with the search server via HTTP (it can even return Json, Native PHP/Ruby/Python)
Index directly from the database with custom queries
Apache Spark was the world record holder in 2014 for sorting.
By sorting 100 TB of data on 207 machines in 23 minutes
but Hadoop MapReduce took 72 minutes on 2100 machines.
Spark libraries
Spark SQL: is a Spark module for structured data processing, in which in-memory processing is its core. Using Spark
SQL, can read the data from any structured sources, like JSON, CSV, parquet, avro, sequencefiles, jdbc , Hive etc.
example:
scala> sqlContext.sql("SELECT * FROM src").collect
scala> hiveContext.sql("SELECT * FROM src").collect
MLLib: Spark2+ has new optimized library support machine learning functions on a cluster based on new
DataFrame-based API in the spark.ml package.
messages are persisted in a topic. consumers can subscribe to one or more topic and
consume all the messages in that topic.
Knox
Knox
Hadoop clusters is unsecured by default and any one can call it and we can block direct access
to Hadoop using Knox.
Knox Gateway is a REST API Gateway to interact with Hadoop clusters.
Knox allow control, integration, monitoring and automation administrative and analytical tasks
Knox provide Authentication using LDAP and Active Directory Authentication
Slider
Slider (support long-running distributed services)
YARN resource management and scheduling works well for batch workloads, but not for interactive or real-time data
processing services
Apache Slider extends YARN to support long-running distributed services on an Hadoop cluster, Supports restart after process
failure, support Live Long and Process (LLAP)
Applications can be stopped then started
The distribution of the deployed application across the YARN cluster is persisted
This enables best-effort placement close to the previous locations
Applications which remember the previous placement of data (such as HBase) can exhibit fast start-up times from this
feature.
YARN itself monitors the health of "YARN containers" hosting parts of the deployed application
YARN notifies the Slider manager application of container failure
Slider then asks YARN for a new container, into which Slider deploys a replacement for the failed component, keeping
the size of managed applications consistent with the specified configuration
Slider implements all its functionality through YARN APIs and the existing application shell scripts
The goal of the application was to have minimal code changes and impact on existing applications
ZooKeeper
ZooKeeper
• Zookeeper is a distributed coordination service that manages large sets of nodes. On any partial failure, clients
can connect to any node to receive correct, up-to-date information
• Services depend on ZooKeeper Hbase,MapReduce, and Flume
Z-Node
• znode is a file that persists in memory on the ZooKeeper servers
• znode can be updated by any node in the cluster
• Applications can synchronize their tasks across the distributed cluster by updating their status in a ZooKeeper
znode, which would then inform the rest of the cluster of a specific node’s status change.
• ZNode Shell command: Create, delete, exists, getChildren, getData, setData, getACL, setACL, sync
Watches events
• any node in the cluster can register to be informed of changes to specific znode (watch)
• Watches are one-time triggers and always ordered. Client sees watched event before new ZNode data.
• ZNode watches events: NodeChildrenChanged, NodeCreated, NodeDataChanged, NodeDeleted
Ambari
Ambari (GUI tools to manage Hadoop(
https://www.ibm.com/hadoop
https://www.ibm.com/analytics/us/en/technology/hadoop/hadoop-trials.html
IBM BigInsights for Apache Hadoop Offering Suite
GPFS-FPO
HDFS files can only access with Hadoop APIs. so, Any application can access and use it using all the
standard applications cannot use it commands used in Windows/Unix
Does not replicate metadata, has a single point of Distributed metadata feature eliminates any single point of
failure in the NameNode failure (metadata is replicated just like data)
GPFS/UNIX:
cp /source/path /target/path
HDFS:
hadoop fs -mv path1/ path2/
GPFS/regular UNIX:
mv path1/ path2/
HDFS:
diff < (hadoop fs -cat file1) < (hadoop fs -cat file2)
GPFS/regular UNIX:
diff file1 file2
IBM Spectrum Symphony
[Adaptive MapReduce]
Adaptive MapReduce (Platform Symphony)
While Hadoop clusters normally run one job at a time, Platform Symphony is designed for
concurrency, allowing up to 300 job trackers to run on a single cluster at the same time with
agile reallocation of resources based on real-time changes to job priorities.
What is Platform Symphony ?
Platform Symphony distributes and virtualizes compute-intensive application services and processes
across existing heterogeneous IT resources.
Platform Symphony creates a shared, scalable, and fault-tolerant infrastructure, delivering faster,
more reliable application performance while reducing cost.
It provides an application framework that allows you to run distributed or parallel applications in a
scaled-out grid environment.
Platform Symphony is fast middleware written in C++ although it presents programming interfaces in
multiple languages including Java, C++, C#, and various scripting languages.
Client applications interact with a session manager though a client-side API, and the session manager
guarantees the reliable execution of tasks distributed to various service instances. Services instances
are orchestrated dynamically based on application demand and resource sharing policies.
So,
is a High-performance computing (HPC) software system , designed to deliver scalability and
enhances performance for compute-intensive risk and analytical applications.
The product lets users run applications using distributed computing.
"the first and only solution tailored for developing and testing Grid-ready service-oriented architecture
applications"
Details
Rich SQL
Comprehensive SQL Support
IBM SQL PL compatibility
Extensive Analytic Functions
Federation Enterprise
Features
Distributed requests to multiple data
sources within a single SQL statement Advanced security/auditing
Main data sources supported: Resource and workload management
DB2 LUW, Teradata, Oracle, Netezza, Self tuning memory management
Informix, SQL Server Comprehensive monitoring
BIG SHEETS
What you can do with BigSheets?
Label
Develop Test Profile Export
Sample snippets
extractors extractors extractors extractors
input Find clues
documents
All All
available Analyzed available
information information information
analyzed
# Connect to BigInsights
> bigr.connect(host="192.168.153.219", user="bigr", password="bigr")
Sqoop
Shell
SSH
HDSF FS
Simple Oozie workflow