Вы находитесь на странице: 1из 4

3/6/2016

SharkorSpark?|VDAlabVisualDataAnalysisLab

VDA-lab Visual Data Analysis Lab


From data to insight Putting the human back in the loop of data analysis

Shark or Spark?
In a previous post, we discussed the rst steps into using Spark for analysing genomic data at scale. We
now turn to Shark to see which one may be better suited for the task at hand.

Introduction
Shark is built on top of Spark and is similar to Apache Hive in that it provides a SQL-like interface to the
data: HSQL . As we will see, it may have some benets, but also some disadvantages.
Note: had to increase the amount of memory available to the spark shell by adding the following to
shark-env.sh :
1 -XX:MaxPermSize=512m

Shark to use
We rst have to create the table:
1 CREATE TABLE genomea(chr STRING, lower BIGINT, upper BIGINT, annot STRING)
2 row format delimited
3 fields terminated by '\t';

Please note that we immediately specify the correct data types for the columns.Then adding the data to
the table is a query away:
1 load data local inpath "201101_encode_motifs_in_tf_peaks.bed"
2 into table genomea;

Information on the data le and format used can be found here.

http://homes.esat.kuleuven.be/~bioiuser/blog/sharkorspark/

1/4

3/6/2016

SharkorSpark?|VDAlabVisualDataAnalysisLab

Queries can now be performed easily:


1 select * from genomea where chr == "chr4" and (upper > 190930000 AND lower < 190940000

Caching the data is done like this:


1 create table genomea_cached as select * from genomea;

Shark as a service
Shark can easily be run as a service in order to access it, e.g., via jdbc from a script that is included in
the installation. This enables any custom-built application to connect to it, eectively turning your data
handling into a tiered architecture with lots of possibilities.
Shark can be called From the CLI:
1 bin/shark -e "SELECT * FROM genomea where chr == \"chr4\" and (upper > 190930000 AND lower < 190940000)"

Benets
At rst sight, Shark seems to have some benets over the default Scala interface to Spark.
First of all, rows have the correct data type because the tables are created as such. The SQL syntax

enables a lot of people to get access to the data using the knowledge they already have. Especially,
if you combine this with the fact that Shark can be exposed as a service out-of-the box.
Another possibility is to extend Shark with approximate queries. That is precisely what BlinkDB is all
about. It is based on Shark, but takes into account condence intervals or response times.

But
There are is one main disadvantages to the Shark approach:This has to do with the exibility. A SQL-like
syntax is very nice, but the data needs to t in a table structure. Not all data can easily be transformed
to do that. Most data cleaning will have to be done before using Shark to expose the data to users. This
means that we need other tools to pre-process the data, but then these other tools can possibly be
used for querying as well?
A second disadvantages lies in the fact that one can not add a column to a table in HSQL . This has to be
http://homes.esat.kuleuven.be/~bioiuser/blog/sharkorspark/

2/4

3/6/2016

SharkorSpark?|VDAlabVisualDataAnalysisLab

done by creating a second table and joining the two into a third.

A smaller disadvantage may be that additional dependencies are introduced, especially when
deploying BlinkDB.

Calling Shark From Scala


It is possible to interface to Shark from Scala code:
1 val tst = sql2rdd("SELECT * FROM genomea where chr == \"chr4\" and (upper > 190930000 AND lower < 190940000)"

Something peculiar happens here, or not if you think about it. The result is wrapped in an array of type
Row . This is a class that has getters for all the primitive types available to HSQL . Getting the data out

requires some work. This is an example to get the data out in a comma-separated list:
1 tst foreach ( x => println(x.getString("chr") + "," + x.getLong("lower") + "," + x.getLong("upper") +

It can get a little easier for Strings and Ints, because get in itself is enough.

1 tst foreach ( x => println(x.get("chr") + "," + x.get("lower")))

In this case, however, the type is (the most general) Object. Only the specic getters return the correct
type.
It all depends what one wants to do with the results afterwards. If one needs to write a json stream as
a result, it does not matter.
In the recent Spark Summit (), on one of the slides, the following code is mentioned:
1 val points = sc.runSql[Double,Double](select latitude, longitude from historic_tweets")

This means that it should be possible to specify the types of the data to be read from the query. Im not
sure, however, that this method is already implemented and I havent tried it out yet.

Conclusion
For our work, we will be using Spark, not Shark. The benets of Shark in our use-case do not outweigh
http://homes.esat.kuleuven.be/~bioiuser/blog/sharkorspark/

3/4

3/6/2016

SharkorSpark?|VDAlabVisualDataAnalysisLab

the additional eort of transforming the data back and forth. Moreover, as will be discussed in a later
post, we will see that Spark itself can be exposed in the form of a REST API, covering the most important
advantage of Shark.

Additional Info
https://github.com/amplab/shark/wiki/Running-Shark-Locally
https://github.com/amplab/shark/wiki/Shark-User-Guide
http://www.youtube.com/watch?v=w0Tisli7zn4.
This entry was posted in howto, review, shark, spark on January 29, 2014
[http://homes.esat.kuleuven.be/~bioiuser/blog/shark-or-spark/] by Toni Verbeiren.

http://homes.esat.kuleuven.be/~bioiuser/blog/sharkorspark/

4/4

Вам также может понравиться