You are on page 1of 2

TRAINING SHEET

CLOUDERA DATA ANALYST TRAINING


Take your knowledge to the next level

Cloudera University’s four-day data analyst training course will teach you to apply traditional
data analytics and business intelligence skills to big data tools like Apache Impala, Apache Hive,
“Cloudera has not only prepared
and Apache Pig. Cloudera presents the tools data professionals need to access, manipulate,
us for success today, but has also transform, and analyze complex data sets using SQL and familiar scripting languages.
trained us to face and prevail
over our big data challenges in Learn a modern toolset
the future by using Hadoop.” Students will have the chance to learn and work with modern tools, such as:
_ Apache Impala enables instant interactive analysis of the data stored in Apache Hadoop
Persado
via a native SQL environment.
_ Apache Hive provides a SQL-like query language with HiveQL that makes data accessible
to analysts, database administrators, and others without Java programming expertise.
_ Apache Pig applies the fundamentals of familiar scripting languages to the Hadoop cluster.

Get hands-on experience


Through instructor-led discussion and interactive, hands-on exercises, participants will
navigate the Hadoop ecosystem, learning how to:
_ Acquire, store, and analyze data using features in Pig, Hive, and Impala
_ Perform fundamental ETL (extract, transform, and load) tasks with Hadoop tools
_ Use Pig, Hive, and Impala to improve productivity for typical analysis tasks
_ Join diverse datasets to gain valuable business insight
_ Perform interactive, complex queries on datasets

What to expect
This course is designed for data analysts, business intelligence specialists, developers, system
architects, and database administrators. Prior knowledge of Apache Hadoop is not required.
_ Knowledge of SQL is assumed
_ Basic familiarity with the Linux command line is expected
_ Knowledge of a scripting language (such as Bash scripting, Perl, Python, or Ruby)
is helpful but not essential.

Get certified
Upon completion of the course, attendees are encouraged to continue their study and
register for the CCA Data Analyst exam. Certification is a great differentiator. It helps
establish you as a leader in the field, providing employers and customers with tangible
evidence of your skills and expertise.
TRAINING SHEET

Course Details:
Introduction Apache Pig Troubleshooting Relational Data Analysis with
and Optimization Apache Hive and Impala
Apache Hadoop Fundamentals
_ Troubleshooting Pig _ Joining Datasets
_ The Motivation for Hadoop
_ Logging _ Common Built-In Functions
_ Hadoop Overview
_ Using Hadoop’s Web UI _ Aggregation and Windowing
_ Data Storage: HDFS
_ Data Sampling and Debugging Complex Data with Apache Hive
_ Distributed Data Processing:
_ Performance Overview and Impala
YARN, MapReduce, and Spark
_ Understanding the Execution Plan _ Complex Data with Hive
_ Data Processing and Analysis:
_ Tips for Improving the Performance _ Complex Data with Impala
Pig, Hive, and Impala
_ Database Integration: Sqoop of Pig Jobs Analyzing Text with Apache Hive
_ Other Hadoop Data Tools Introduction to Apache Hive and Impala
and Impala _ Using Regular Expressions with
_ Exercise Scenarios
_ What is Hive? Hive and Impala
Introduction to Apache Pig _ What is Impala? _ Processing Text Data with SerDes
_ What is Pig? in Hive
_ Why Use Hive and Impala?
_ Pig’s Features _ Sentiment Analysis and n-grams
_ Schema and Data Storage
_ Pig Use Cases in Hive
_ Comparing Hive and Impala
_ Interacting with Pig to Traditional Databases Apache Hive Optimization
Basic Data Analysis with Apache Pig _ Use Cases _ Understanding Query Performance
_ Pig Latin Syntax Querying with Apache Hive _ Bucketing
_ Loading Data and Impala _ Indexing Data
_ Simple Data Types _ Databases and Tables _ Hive on Spark
_ Field Definitions _ Basic Hive and Impala Query
Language Syntax Apache Impala Optimization
_ Data Output _ How Impala Executes Queries
_ Data Types
_ Viewing the Schema _ Improving Impala Performance
_ Using Hue to Execute Queries
_ Filtering and Sorting Data
_ Using Beeline (Hive’s Shell) Extending Apache Hive and Impala
_ Commonly Used Functions
_ Using the Impala Shell _ Custom SerDes and File Formats
Processing Complex Data in Hive
with Apache Pig Apache Hive and Impala
Data Management _ Data Transformation with
_ Storage Formats
_ Data Storage _ Custom Scripts in Hive
_ Complex/Nested Data Types
_ Creating Databases and Tables _ User-Defined Functions
_ Grouping
_ Loading Data _ Parameterized Queries
_ Built-In Functions for Complex Data
_ Altering Databases and Tables Choosing the Best Tool for the Job
_ Iterating Grouped Data
_ Simplifying Queries with Views _ Comparing Pig, Hive, Impala,
Multi-Dataset Operations
_ Storing Query Results and Relational Databases
with Apache Pig
_ Techniques for Combining Datasets _ Which to Choose?
Data Storage and Performance
_ Joining Datasets in Pig _ Partitioning Tables Conclusion
_ Set Operations _ Loading Data into Partitioned Tables
_ Splitting Datasets _ When to Use Partitioning
_ Choosing a File Format
_ Using Avro and Parquet File Formats

Cloudera, Inc. 395 Page Mill Road Palo Alto, CA 94306 USA cloudera.com
© 2018 Cloudera, Inc. All rights reserved. Cloudera and the Cloudera logo are trademarks or registered trademarks
of Cloudera Inc. in the USA and other countries. All other trademarks are the property of their respective companies.
Information is subject to change without notice. Cloudera_Data_Analyst_Training_Sheet_106 201712