Академический Документы
Профессиональный Документы
Культура Документы
Through Pyspark
Apache Spark is a popular open source framework that ensures data processing with lightning speed and
supports various languages like Scala, Python, Java, and R. Here, we would be talking about Spark with
Python to demonstrate how Python leverages the functionalities of Apache Spark in loading
unstructured/semi-structured data into Hive database.
Hive as an ETL and data warehousing tool provide supports on loading of structured (from other
RDBMS), semi-structured (from XML file) and un-structured data (.txt, .csv type) from a defined source
and querying those data based upon certain requirement.
In this tip, we will explain how to load unstructured data in text file format into Hive table through
Python script and invoke the script through Apache Spark tool, Pyspark.
When it comes to file data source, we may have ‘n’ number of files in different format, a separate
automated unix script needs to be prepared to invoke the python script and pass individual file name
along with file path as argument to extract, transform as per business standard and finally, load them
into Hive database.
2.1 Scope of the code
The code is working fine when we execute from spark version 2.2.0.cloudera2 using Scala
version 2.11.8
Hive version used Hive 1.1.0-cdh5.12.2
With Spark, we can read data from a CSV file, text file or external SQL data store, or another data source,
apply certain transformations to the data, and store it onto Hadoop in a Hadoop Distributed File System
(HDFS) or Hive.
For other RDBMS, we can read data by scoop command and load them into Hive database.
Text files with separators can be imported into a Spark DataFrame and then stored as a HIVE table using
the steps described here.
In this example, we will explain how to use spark’s primary data abstraction a resilient distributed dataset
(RDD), translate it into a DataFrame, and store it in HIVE. It is also possible to load CSV files directly into
DataFrames using the spark-csv package.
Let’s assume we have demo source file (.txt) present in as shown below.
TEST-CELL XR 2.00
ID : TEST001_14SEP17_AAAA
File name : "TEST001_14SEP17_AAAA.txt"
Time :
LogDate : 12 Sep 2017 1:34:26 PM
RunDate : 12 Sep 2017 1:51:42 PM
FileDate : 12 Sep 2017 1:53:22 PM
User : TESTUSR001
Analysis version : 2.88
Results:
Demo Images : 5
Demo Total cells : 1111
Demo Viable cells : 2323
Demo Nonviable cells : 320
Demo Viability (%) : 9.10
Demo Total cells / ml (x 10^6) : 11.66
Demo Total viable cells / ml (x 10^6) : 9.423
DemoSizeData :
3,7,6,15,5,3,1,0,1,2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
DemoViableSizeData :
2,3,1,12,3,6,4,5,6,9,16,11,12,14,24,28,33,39,58,79,78,69,63,74,84,88,83,103,103,88,84,64,85,56,48,44,5
1,36,24,27,26,25,16,17,12,9,9,8,7,3,4,7,1,4,2,2,1,0,0,3,1,0,1,2,1,0,0,0,0,0,
DemoViabilityData :
92.8571,96.0000,97.6744,93.9394,96.8750,92.5000,97.2973,93.3333,88.8889,91.1111,85.2941,93.7500,
95.2381,83.3333,94.4444,92.1053,91.8919,92.8571,90.6977,97.5000,89.7436,83.7209,87.7551,90.6977,
94.8718,86.0465,89.7436,91.3043,91.4286,96.6667
DemoCountData :
28,50,43,33,32,40,37,45,36,45,34,48,42,36,54,38,37,42,43,40,39,43,49,43,39,43,39,46,35,30,35,48,32,5
0,40,49,48,40,44,56,38,44,34,40,46,37,31,39,45,44
4. Code Implementation
Suppose the source data is in a file, which is in a text format. The requirement is to load the text file into
hive table using pySpark.
Create python script in VI editor and import some open source python packages, which will be helpful
for reading text file and helps in several data manipulation task. Also imported some classes of Spark
SQL.
I. Spark: Used to parse raw file, transform as per business standard and apply customization.
II. Hive: Used to store into database.
This python pseudocode reads flat file from HDFS and after transforming, this will load file contents into
HIVE tables.
This is the step where loading data frame into hive table. In demo code, we have used write.insertInto
to append transformed data into Hive table.
We can validate the data that loaded through pyspark using sqlContext.sql command.
DemoCode.docx
Method Invocation:
We can define a try block, where we can wrap up entire demo code and we may include code to write
message into log file in case of successful execution.
try:
//pass HDFS file location in first argument
v_filePath=sys.argv[1]
// pass source file name here
v_fileName=sys.argv[2]
//another variable for log file location
v_stgfilePath=sys.argv[3]
//define file handler to open log file in append or write mode. Current date appended for log purpose
only.
f=open(v_stgfilePath +"/StagingDataLoad.csv",'a')
v_currdt = datetime.datetime.now()
filestr= v_filePath + v_fileName
<demo_function>(filestr)
f.write('STAGING,<demo_function>,<demo_script_name.py>,'+v_currdt.strftime("%d-%m-%Y %H:%M
%p")+',SUCCEDED'+'\n')
5. Auditing
A separate log file can be placed in a specified directory that is used to capture success or error message
in case any exception encountered.
In example script shown above, for demo purpose, we have defined two variables for log file path and
file name. One file handler has defined to open log file in append mode and writes success message into
file if succeeded.
In demo code, one except block has defined to catch run time error and print stack traces into log file.
Exception message size has truncated to first line only.
except:
exceptiondata=traceback.format_exc().splitlines()
exceptionarray = [exceptiondata[-1]]
f.write('STAGING,<demo_function>,<demo_script_name.py>,'+v_currdt.strftime("%d-%m-%Y %H:%M
%p")+',FAILED,'+str(exceptionarray)+'\n')
Finally:
f.close()
//Log entry in exception:
6. SPARK-SUBMIT Command
$ spark2-submit --master yarn demo_script_name.py <file_location> FlatFileSource.txt
<log_file_location>
7. Pyspark Advantage
Easy Integration with other languages: PySpark framework supports other languages like Scala,
Java, R.
RDD: PySpark basically helps data scientists to easily work with Resilient Distributed Datasets.
Speed: This framework is known for its greater speed compared with the other traditional data
processing frameworks.
Caching and Disk persistence: This has a powerful caching and disk persistence mechanism for
datasets that make it incredibly faster and better than others.