Distributed Code Analysis Over Computer Clusters

University Politehnica of Bucharest Automatic Control and Computers Faculty, Computer Science and Engineering Department
National University of Singapore School of Computing
MASTER THESIS Distributed Code Analysis over Computer Clusters
Scientic Advisers: Prof. Khoo Siau Cheng (NUS) Prof. Nicolae Tpus (UPB) , ,
Author: Clin-Andrei Burloiu
Bucharest, 2012
Universitatea Politehnica Bucuresti , Facultatea de Automatic si Calculatoare, , Catedra de Calculatoare
National University of Singapore School of Computing
LUCRARE DE DISERTATIE , Analiza de cod n mod distribuit peste clustere de calculatoare
Conductori Stiintici: , , Prof. Khoo Siau Cheng (NUS) Prof. Nicolae Tpus (UPB) , ,
Autor: Clin-Andrei Burloiu
Bucuresti, 2012 ,
I would like to thank my parents and my brother for their care and support. I would also like to thank Professor Nicolae Tpus for oering me the opportunity to have an internship at , , National University of Singapore where I had the chance to contribute to this interesting and promising project. Many thanks to Professor Khoo Siau Cheng for his involvement into the project and for guiding my work to the right direction.
Abstract
This master thesis marks the rst steps towards building an Internet-scale source code search engine, forked from Sourcerer infrastructure [4]. The rst part of the work is a deep analysis of the appropriateness of using a Hadoop stack for scaling up Sourcerer. The second describes the design and implementation of the storage layer for the code analysis engine of the system, by using HBase, a distributed database for Hadoop. The third part is an implementation over Hadoop MapReduce of an algorithm named Generalized CodeRank for scoring code entities by their popularity, as an extended application of Googles PageRank. As far we know this approach is unique because it considers all entities during calculation, not only subsets of particular types. The results show that Generalized CodeRank gives relevant results although all entity types are used for computation. Aceat lucrare de disertatie face primii pasi ctre construirea unui motor de cutare pentru cod , , surs la scara Internetului, pornind de la infrastructura Sourcerer. Prima parte reprezint o analiz profund a posibilittii de a utiliza stiva de aplicatii Hadoop pentru a scala Sourcerer. A , , doua parte descrie proiectarea si implementarea nivelului de stocare pentru motorul de analiz , de cod al sistemului, folosing HBase, o baz de date distribuit pentru Hadoop. n a treia parte este descris proiectarea si implementarea algoritmul de CodeRank generalizat pentru , calcularea scorului de popularitate a entittilor de cod, ca o aplicatie extins a algoritmului , , PageRank de la Google. Dup constintele mele, aceast abordare este unic prin faptul c , , include n calcul toate entittile de code, nu doar cele de un anumit tip. Rezultatele arat c , algoritmul de CodeRank generalizat ofer rezultate relevante, n conditiile n care entitti de , , toate tipurile sunt folosite pentru calcul.
ii
Contents
Acknowledgements Abstract 1 Introduction 1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Sourcerer: A Code Search and Analysis Infrastructure . . . . . . . . . . . . . . . 2 The Choice for Cluster Computing Technologies 2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . 2.2 MapReduce and Hadoop . . . . . . . . . . . . . . . . 2.2.1 MapReduce Algorithm . . . . . . . . . . . . . 2.2.2 Storage . . . . . . . . . . . . . . . . . . . . . 2.3 HDFS . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Fault Tolerance . . . . . . . . . . . . . . . . . 2.3.2 Data Sharding . . . . . . . . . . . . . . . . . 2.3.3 High Throughput for Sequential Data Access 2.4 NoSQL . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Context . . . . . . . . . . . . . . . . . . . . . 2.4.2 Reasons . . . . . . . . . . . . . . . . . . . . . 2.5 HBase . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 Data Structuring . . . . . . . . . . . . . . . . 2.5.2 Node Roles and Data Distribution . . . . . . 2.5.3 Data Mutations . . . . . . . . . . . . . . . . . 2.5.4 Data Retrieval . . . . . . . . . . . . . . . . . 2.5.5 House Keeping . . . . . . . . . . . . . . . . . 2.6 The Reasons for the Chosen Technologies . . . . . . 2.6.1 Why Hadoop? . . . . . . . . . . . . . . . . . 2.6.2 ACID . . . . . . . . . . . . . . . . . . . . . . 2.6.3 CAP Theorem and PACELC . . . . . . . . . 2.6.4 SQL vs. NoSQL . . . . . . . . . . . . . . . . 2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . 3 Database Schema Design and Querying 3.1 Database Purpose . . . . . . . . . . . . 3.2 Design Principles . . . . . . . . . . . . . 3.3 Projects Data . . . . . . . . . . . . . . . 3.3.1 Former SQL Database . . . . . . 3.3.2 Functional Requirements . . . . 3.3.3 Schema Design . . . . . . . . . . 3.3.4 Querying . . . . . . . . . . . . . 3.4 Files Data . . . . . . . . . . . . . . . . . i ii 1 2 3 5 5 6 6 6 6 7 7 7 7 8 8 8 9 9 9 10 10 10 10 11 12 13 13 14 14 15 15 15 16 16 17 17
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
iii
CONTENTS 3.4.1 Former SQL Database . . 3.4.2 Functional Requirements 3.4.3 Schema Design . . . . . . 3.4.4 Querying . . . . . . . . . Entities Data . . . . . . . . . . . 3.5.1 Former SQL Database . . 3.5.2 Functional Requirements 3.5.3 Schema Design . . . . . . 3.5.4 Querying . . . . . . . . . Relations Data . . . . . . . . . . 3.6.1 Former SQL Database . . 3.6.2 Functional Requirements 3.6.3 Schema Design . . . . . . 3.6.4 Querying . . . . . . . . . Dangling Entities Cache . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
iv 18 18 18 19 19 19 19 20 22 22 22 22 23 24 25 25 26 26 26 26 27 28 28 28 29 30 30 31 32 32 32 33 34 35 36 36 36 38 39 40 41 41 42 42 43 43 44
3.5
3.6
3.7 3.8
4 Generalized CodeRank 4.1 Reputation, PageRank and CodeRank . . . . . . . . 4.1.1 PageRank . . . . . . . . . . . . . . . . . . . . 4.1.2 The Random Web Surfer Behavior . . . . . . 4.1.3 CodeRank . . . . . . . . . . . . . . . . . . . . 4.1.4 The Random Code Surfer Behavior . . . . . . 4.2 Mathematical Model . . . . . . . . . . . . . . . . . . 4.2.1 CodeRank Basic Formula . . . . . . . . . . . 4.2.2 CodeRank Matrix Representation . . . . . . 4.3 Computing Generalized CodeRank with MapReduce 4.3.1 Storing Data in HBase . . . . . . . . . . . . . 4.3.2 Hadoop Jobs . . . . . . . . . . . . . . . . . . 4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Setup . . . . . . . . . . . . . . . . . . . . . . 4.4.2 Convergence . . . . . . . . . . . . . . . . . . 4.4.3 Probability Distribution . . . . . . . . . . . . 4.4.4 Entities CodeRank Top . . . . . . . . . . . . 4.4.5 Performance Results . . . . . . . . . . . . . . 5 Implementation 5.1 Database Implementation . . . . . . . . . 5.1.1 Data Modeling . . . . . . . . . . . 5.1.2 Database Retrieval Queries API . 5.1.3 Database Insertion Queries API . . 5.1.4 Indexing Data from Database . . . 5.2 CodeRank Implementation . . . . . . . . 5.2.1 CodeRank and Metrics Calculation 5.2.2 Utility Jobs . . . . . . . . . . . . . 5.3 Database Querying Tools . . . . . . . . . 5.4 Database Utility Tools . . . . . . . . . . . 5.5 Database Indexing Tools . . . . . . . . . . 5.6 CodeRank Tools . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . Jobs . . . . . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
6 Conclusions 46 6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 6.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
CONTENTS A Model Types B Top 100 Entities CodeRank
v 48 51
List of Figures
1.1 1.2 4.1 4.2 4.3 4.4 4.5 A Java code graph example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sourcerer system architecture (as it appears in [4]) . . . . . . . . . . . . . . . . . CodeRank Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Variation of Euclidean distance with the growth of iteration illustrating the convergence of Generalized CodeRank algorithm . . . . . . . . . . . . . . . . . . . Probability distribution represented by CodeRanks vector . . . . . . . . . . . . log-log plot for CodeRanks distribution and a power law distribution . . . . . Left: Top 10 Entities CodeRank chart; Right: Distribution of Top 10 Entities CodeRanks within the whole set of entities . . . . . . . . . . . . . . . . . . . . 2 3
. 29 . 33 . 33 . 34 . 35
vi
List of Tables
3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 4.1 4.2 5.1 5.2 5.3 A.1 A.2 A.3 A.4 A.5 Columns for projects MySQL table . projects HBase Table . . . . . . . . . Columns for files MySQL table . . . files HBase Table . . . . . . . . . . . Columns for entities MySQL table . entities_hash HBase Table . . . . . entities HBase Table . . . . . . . . . Columns for relations MySQL table relations_hash HBase Table . . . . relations_direct HBase Table . . . relations_inverse HBase Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 16 18 18 20 21 21 23 23 24 24
Top 10 Entities CodeRank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Experiments and jobs running time . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Common CLI arguments for Hadoop tools (CodeRank and tools) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Common CLI arguments for CodeRankCalculator tool . . . Common CLI arguments for CodeRankUtil tool . . . . . . . Project Types . . File Types . . . . Entity Types . . Relation Types . Relation Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Database indexing . . . . . . . . . . . . 44 . . . . . . . . . . . . 45 . . . . . . . . . . . . 45 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 48 49 50 50
B.1 Top 100 Entities CodeRank (No. 1-33) . . . . . . . . . . . . . . . . . . . . . . . . 51 B.2 Top 100 Entities CodeRank (No. 34-67) . . . . . . . . . . . . . . . . . . . . . . . 52 B.3 Top 100 Entities CodeRank (No. 68-100) . . . . . . . . . . . . . . . . . . . . . . 53
vii
Chapter 1
Introduction
This work marks the rst steps into the project Semantic-based Code Search initiated at University of Singapore, School of Computing by Professor Khoo Siau Cheng. Its main objective is building a Internet-scale search engine for source code. The implementation is a fork of Sourcerer [11], a code search platform developed at University California of Irvine. My work concentrates on putting the foundation of a code search and analysis platform by scaling up Sourcerer to Internet-scale. My contribution can be divided into three parts. The rst part is a deep analysis of the appropriateness of using a Hadoop stack for scaling up Sourcerer. The second describes the design and implementation of the storage layer for the code analysis engine, by using HBase, a distributed database for Hadoop. And the third part is an implementation over Hadoop MapReduce of an algorithm for ranking code entities by their popularity, as an extended application of Googles PageRank. In the last years we assist to an increase of popularity of open-source and free software. It may be possible that the Web had a big impact on this by allowing everyone to share content with others. To make this more accessible to people, web hosting services oered access to free technologies like LAMP1 stack. By having a lot of users, the open-source communities became motivated to developed better and better software. The expansion of open-source software also had an impact on IT business. A lot of companies like Google, Yahoo!, Cloudera, DataStax and MapR are nancing open-source projects at the exchange of payed support and premium services. On the other side of the business eld, more and more companies are starting to adopt open-source projects, technologies and libraries into their products not only because are free, but also because of the big communities around them which are able to oer good support. The open-source movement also changed the way software architects and developers work. They need to reserve a lot of time to nd libraries or technologies capable of accomplishing a particular task or to gure out how to use them. In this context a search engine provides a good way to start by nding pieces of code, documentation, code examples or to download the source code. During development with open-source technologies, programmers often encounter issues and searching the Web for similar problems with the purpose of nding code snippets, code examples and solutions is often part of the development phase. Currently Google dominates the market of mainstream search engines [53]. Developers often use this kind of search engines to download code, to nd code snippets, examples and to solve their issues, although they are not adapted for this purpose. For better results a dedicated source code search engines would be more appropriate. But searching for code that is capable to accomplish a particular task is not easy. Others tried to implement code search engines, like
1 Linux,
Apache, MySQL, PHP
CHAPTER 1. INTRODUCTION
the commercial solutions Koders [48], Krugle [49], Codase [14] and Google Code Search. The fact that none of them is very popular and that users prefer to use mainstream search engines proves the fact that state of the art code search does not satisfy users needs. Google Code Search has been shut down in 2012 most likely because of the unsolved challenges in this eld.
1.1
Overview
When we started working at the Semantic-based Code Search project we wanted to incorporate basic state of the art code search techniques without the need to reinvent the wheel. So we searched for an open-source code search platform that we could extend by building our own algorithms on top of it. After analyzing multiple alternatives we stopped at Sourcerer [4]. One of the rst reasons to choose it was that fact that as our system it aimed at an Internet-scale search engine. Secondly, it had the basic information retrieval techniques already implemented. The important thing was that it included a MySQL [61] database with information extracted from code which can be used as a basis for a lot of code analysis tools and algorithms. Sourcerer only handles Java code. We named our Sourcerer fork [11] Distributed Sourcerer because it runs on a cluster of computers, a set of tightly coupled commodity hardware computers which work together to a common task. After getting deep into Sourcerer by studying its architecture and source code, as well as by testing it, we soon realized that it would not scale to Internet size as its authors aimed, as it will be discussed in the next section. The basic idea was that each of its components could only run on a single machine. So I started to redesign it to run on a computer cluster. My area of investigation is the code analysis part of Sourcerer infrastructure, which provides algorithms vital to a code search engine, such as code indexing techniques, ranking of code entities, code clone detection and code clustering. The code analysis eld investigates the structure and the semantics of program code and it is part of program analysis eld which focuses on behavior. Code analysis algorithm can be classied as compile based and non-compile based, depending on their need to compile the source code before performing analysis or not. Non-compile based algorithms are generally faster because they perform static code analysis and can cope with code that contains errors, which is not compilable because of this reason. The main disadvantage of this algorithms is that they cannot analyze dynamically loaded code entities, like Java classes loaded during runtime. Compile-based code analysis algorithms do not have this disadvantage, but are generally slower.
Figure 1.1: A Java code graph example Code analysis usually deals with code graphs, like the one presented in Figure 1.1. Their nodes are code entities like classes, methods, primitives and variables and their edges are relations
like returns in method returns primitive (see Figure 1.1). When a code graph only contains method nodes and calls relations it is called call graph. In a similar way graphs that only catch the class inheritance hierarchy or class connections can be constructed. More details about entities and relations can be found in Section 3.1. A full list of all entity types and relation types as well as an explanation of them can be found in Appendix A. When talking about big scale systems, the data takes an important role because of its size and diculties to access it. A recent solution for large scale data processing is Hadoop platform [24], an open-source implementation of Googles MapReduce programming paradigm [15]. Chapter 2 presents the investigations of using this platform, as well as the reasons we decided to port the MySQL [61] database to a Hadoop-compliant database named HBase [25]. The design schema of the new database, the reasons behind it and the techniques used to implement queries to it are explained in Chapter 3. After porting the storage layer of the system to HBase, I implemented a ranking algorithm for the search engine (see Chapter 4), named Generalized CodeRank, which is a PageRank [63] adaption for code analysis used for calculating entities popularity. Other state of the art works implemented CodeRank before [64][51][55], but our approach diers from theirs by the fact that we applied the algorithm to all entities and relations from the database. Portfolio [55] only applies it to C functions and their call relations. Puppin et al. [64] apply it only to classes. As far as we know, an older version of Sourcerer [51] only applied it to several types of entities, but not simultaneously.
1.2
Sourcerer: A Code Search and Analysis Infrastructure
Developed at University California of Irvine mostly by S. Bajracharya, J. Ossher and C. Lopes, Sourcerer is a code search infrastructure which aims to grow to Internet-scale.
Figure 1.2: Sourcerer system architecture (as it appears in [4]) A crawler downloads source code found on the Internet in various repositories and stores the data in three dierent forms:
1. In the Managed Repository (referred from now as repository) which keeps the original code and additional libraries. 2. In the Code Database named SourcererDB [62] which stores data as a metamodel obtained by parsing the code from the repository (details in Chapter 3). 3. In the Code Index which is an inverted index for keywords extracted from the code. The system architecture is illustrated in Figure 1.2. At its core, Sourcerer applies basic information retrieval techniques to index tokenized source code into the Code Index implemented with Apache Lucene [26], an open-source search engine library. To hide the complexity of Lucene, a higher level technology is used, that is Apache Solr [29], a search server. In 2010, Sourcerer team published a paper [5] which proposed an innovative way to eciently retrieve code examples. Their technique associates keywords to source code entities that have similar API usage. This similarity is obtained from the Code Database and the keywords associated are stored in the Code Index. The Code Database is a relational MySQL [61] database, which stores metamodels of projects, les, entities and relations into tables. More about this in Chapter 3. The data obtained by the crawler from the web is rst stored in the Managed Repository. In order to populate the Code Database and the Code Index the extractor is used, which is implemented as a headless Eclipse [37] plugin. This component is able to parse the source code and obtain code entities and code relations data. An older version of Sourcerer [51] implemented CodeRank, the PageRank-like algorithm for ranking code entities, but the current version does not implement this any more. Chapter 2 will talk more about the limitations of Sourcerer with respect to scalability and will propose changing the database with a distributed one called SourcererDDB (Sourcerer Distributed Database). Chapter 3 will describe in detail the schema design of the new database and Chapter 4 will present Generalized CodeRank algorithm which runs on SourcererDDBs data.
Chapter 2
The Choice for Cluster Computing Technologies

This chapter presents the cluster computing technologies used for scaling up Sourcerer and the reason they were chosen instead of considering other alternatives.
2.1
Motivation
Chapter 1.2 described Sourcerer an open-source project from which our implementation started. Although its goals of building a large-scale system match our goals we soon realized the limitations regarding Sourcerer scalability: 1. SourcererDB (Sourcerer database), which uses MySQL, showed poor performance for repositories of hundreds of gigabytes and for dicult queries required by some applications. 2. The extractor, implemented as an Eclipse [37] headless plugin, can only run on a single machine and parsing hundreds of gigabytes takes days. 3. The real time code search application uses Apache Solr [29], a search server based on Apache Lucene [26] search library. Solr runs on a single machine mode so its not capable to scale. There is a multi-machine version, called Distributed Solr, but currently lacks some of the Solr features. Other Lucene-based search servers like ElasticSearch [18] should be investigated in the future. Our basic idea is to port Sourcerer for technologies capable of running in a distributed manner on a computer cluster. This master thesis deals only with the rst point above, by investigating solutions to scale the database and by designing and implementing a new distributed database called SourcererDDB (Sourcerer Distributed Database), capable of scaling to thousands of machines and to deal with petabytes of data. The database design schema for storing code data is presented in Chapter 3 and an algorithm, Generalized CodeRank, implemented on top of it is presented in Chapter 4. The next three sections present the technologies we chose for running our data on a computer cluster. We are using HBase [25], a large-scale distributed database, and Hadoop [24], a platform for running distributed computing computation based on MapReduce programming model [15]. These technologies are capable of scaling linearly and horizontally on commodity hardware.
CHAPTER 2. THE CHOICE FOR CLUSTER COMPUTING TECHNOLOGIES
2.2
MapReduce and Hadoop
Hadoop [24] is an open-source framework for running distributed applications developed by Apache Software Foundation [28]. Its implementation is based on two Google papers, one about MapReduce [15] and the other about Google File System [39]. For the later the Hadoop homologous implementation is named HDFS and is described in Section 2.3, while the former will be detailed in this section. By the time of writing this thesis, Hadoop is very popular and widely used in the industry by important companies like Facebook, Yahoo!, Twitter, IBM and Amazon.[35] For example, a news published on gigaom.com states that Facebook has a 100 GiB Hadoop cluster [41]. MapReduce [15] is a programming model and framework which can be used to solve embarrassingly parallel problems consisting of very large amounts of data distributed across a cluster of computers.
2.2.1
MapReduce Algorithm
MapReduce problem solving model involves two main functions, map and reduce, inspired from functional programming. The execution of map functions is supervised by Map tasks and the execution of reduce functions by Reduce tasks. From a distributed system point of view, there is a master node which coordinates jobs, and multiple slave nodes which execute tasks. The master is responsible with job scheduling by assigning tasks to slave nodes, coordinating slaves activity and monitoring. The domain and range of the map and reduce functions are values structured as (key, value) pairs.[15][72] In Hadoop, the input data is passed to an InputFormat class implementation which splits the data into multiple parts. The master assigns a split to each slave which uses a RecordReader to parse the input data split into input (key, value) pairs. During the map stage, each map function will receive a pair and by processing it will output a set of intermediate (key, value) pairs. After the completion of all Map tasks from the whole cluster the map stage is nished. During sort and shuing stage the master schedules the allocation of intermediate (key, values) to Reduce tasks. During reduce stage, each reduce function will receive as input a set of intermediate (key, value) pairs having the same key and through processing will output a set of output (key, value) pairs.[15][72]
2.2.2
Storage
Hadoop stores input and output data into a distributed le system or in a distributed database [72]. It comes with its own distributed le system implementation, which is HDFS, but other implementation ca be used. The most common distributed database used with Hadoop is Apache HBase [25], but other solutions like Apache Cassandra [22] can be used as well. The resources shared by the cluster nodes are stored in the distributed le system. Hadoop achieves data locality by trying to allocate tasks on the same nodes where the data is located or if its not possible in the same rack.
2.3
HDFS
Hadoop framework comes with HDFS, a distributed le system which oers the following features: 1. Fault tolerance in case of node failures
CHAPTER 2. THE CHOICE FOR CLUSTER COMPUTING TECHNOLOGIES 2. Data sharding across the cluster 3. High throughput for streaming data access
HDFS is based on the Google File System paper [39] and exposes to the client an API that provides access to the le system with similar UNIX semantics.
2.3.1
Fault Tolerance
Fault tolerance guarantees that in case of a node failure all les continue to be available and the system will continue to work. This is provided thorough block replication. Each le is made of a set of blocks and each block is replicated by default on two other nodes, one in the same rack and the other in another rack. This ensures that in case of a node failure the in-rack replica can be used without losing data locality. In case of a rack failure the replica from another rack is used to serve data. Data replication is done automatically transparent for the client.[34]
2.3.2
Data Sharding
Files are automatically sharded on cluster nodes without users intervention. DataNodes store the blocks and a master node, named NameNode, keeps track where each block is located. The client only talks with the master to nd which DataNodes store the desired blocks. The NameNode is not susceptible to overloading because transferring blocks between clients and DataNodes does not involve the master and blocks location is cached to the client. New blocks are written in a daisy chain, i.e., while a DataNode receives data (from a client or another DataNode) and writes to disk, it also sends the data in pipeline to the next replica.[34][39][72]
2.3.3
High Throughput for Sequential Data Access
When HDFS was designed, besides the need to create a distributed system that oers a consistent view of the les for all cluster nodes, it was desired to transfer data from commodity hard-disks with a superior speed then from traditional Linux les systems. HDFS provides high throughput for sequential data access. It is known that the biggest bottleneck in a hard-disk are disk seeks and not the transfer rate. Using larger data blocks diminishes the chance of disk seeks and improves throughput, but grows the access latency. For an average user this is not acceptable because small les are frequently accessed and having big latency to each le aects user experience. But in the case of MapReduce which aims at processing large amounts of data having high throughput its a must and high latencies are not a concern if only big les are used. HDFS data blocks typically have 64 or 128 MiB. For the best performance, les should be larger than the block size. By using large block sizes data is read from disk at transfer rate, not at seek rate.[34][39]
2.4
NoSQL
NoSQL (No SQL or Not Only SQL) is an emerging database category which proposes different approaches to store data then the ubiquitous relational database management systems (RDBMS) based on SQL. It has been stated that there are so many dierences between NoSQL databases that they were grouped together based on what they dont have in common [54]. Usually the main dierences against RDBMS are the following [68][6]: 1. They do not have a relational model, so there is no SQL support.
CHAPTER 2. THE CHOICE FOR CLUSTER COMPUTING TECHNOLOGIES 2. Are usually column based or oer a dierent way of structuring data.
3. They sacrice some constraints such as ACID (Atomicity, Consistency, Isolation, Durability). For details about what ACID means see Section 2.6.
2.4.1
Context
NoSQL databases appeared in the context of the Web expansion which required much more data to store and many more users to access it. Web applications that need to store large data sets, also known as big data [73], started to appear. The limits in terms of scalability of SQL-based databases started to show. Big Web sites like Facebook, Twitter, Reddit and Digg started to experience problems with their SQL databases and as a consequence they begin to look for alternatives. Databases like MongoDB [1], CouchDB[23], HBase and Cassandra [22] come with dierent approaches to structure data and with dierent guarantees.[68][6]
2.4.2
Reasons
There are three reasons one would chose a NoSQL solution instead of SQL. Usually the reason is the need for more scalability which is typically obtained through distributed computing. RDBMS oer a lot of guarantees like strong consistency and durability (see Section 2.6) which have a big overhead in a distributed environment or just make scaling dicult and expensive. Running computation in a distributed system also creates availability problems. Web applications have availability requirements, because if a site goes down its owner may lose a lot of money and clients. Ensuring scalability, high availability and low latency comes with the expense of consistency which is usually weaken. Users dont typically care if they dont see the latest version of a post as long as they still can access the site to some extent. For applications where consistency is important like banking, SQL databases are still the best choice.[68][6] The second reason for choosing NoSQL and giving up SQL is because of the availability requirements. This reason is strongly linked with the rst one. On large scale the database runs on a cluster of computers. For commodity hardware failures are usually the norm, not the exception and as Section 2.6 will expose SQL databases cannot meet theses availability guarantees at large scale. The third reason why one would choose NoSQL is the situation when RDBMS way to structure data and the relational model does not feet the needs [68][6]. NoSQL databases are usually column-based and scale horizontally, as opposed to SQL which scales vertically. In SQL the schema is xed, but databases like HBase, Cassandra and MongoDB support an arbitrary number of columns to be stored on a row with any name. Others, like Amazon Dynamo [16], Amazon S3 [52] and memcached [56] are based on key-values and are optimized for fast retrieval by key. Neo4j [57] is well suited for graph structured data like maps.
2.5
HBase
Storing data in les is not always advantageous and there are many scenarios when a database, which oers a more structured data access, is more protable. With these thoughts in mind HBase was developed, a NoSQL database based on Googles BigTable paper [13]. Nicknamed The Hadoop Database, it oers a good integration with MapReduce and is able to scale to millions of rows and billions of column. It is used with success in the industry by a lot of important companies such as Facebook, Twitter, Yahoo! and Abobe [33][9][3]. Facebook uses it to store all user private messages and chats [9][3] and has a 100 petabytes Hadoop cluster [41].
2.5.1
Data Structuring
As in RDBMS, data is organized in tables, rows and columns. The dierence is that any number of columns with any names can be stored on each row. Columns on a row are independent of the columns from another row, for instance if row a has columns x, y and z, row b may have columns m, n and o (without having any column of a). The column names are called column qualiers. A table can have one or more column families, which group together columns. In order to identify a column, both the column family and the column qualier need to be given, pair which is called column key. HBase stores data internally as key-value pairs having three-dimensional keys with coordinates row key, column key and timestamp. The former coordinate is a way to store multiple versions for a table value based on the number of milliseconds from the Epoch. By default when a new value is inserted the current time is used as a timestamp, but a custom value can be used as well. The key-value pairs are sorted rst by row key, then by column key and then in decreasing order by timestamp, such that the newest versions are retrieved rst.[38][25] When a new table is created only the column families and conguration parameters for each one need to be given, because new rows and columns can be freely created when performing inserts.
2.5.2
Node Roles and Data Distribution
The data range for each table, consisting of key-value pairs, is split into regions which can be stored on dierent nodes in the cluster, known as region servers. Clients talk directly with this servers which are responsible to serve data reads and writes to clients, for the regions they are assigned to. If a region grows beyond a threshold because of new data, the region is split and a new region is assigned to a dierent region server, so HBase scales automatically.[38][25] HBase relies on HDFS for data persistence, replication and availability. Because region servers serve data reads and writes, data locality is achieved. This is because HDFS rst writes data locally and then updates other replicas from other nodes in a daisy chain. In case of a region server failure, its regions are assigned to other region servers and data locality is temporarily lost. However, compactions, described in Subsection 2.5.5 reestablish data locality after a while. A master node is responsible with region assignment. To do this, it uses ZooKeeper [44], a distributed, highly available, reliable and persistent coordination and conguration service. For client bootstrap, ZooKeeper is also necessary, because it stores contact information to reach catalog regions, which are able to tell on which region server a row-key is stored. Clients cache regions to region servers mappings for ecient future requests.[38][25]
2.5.3
Data Mutations
Data mutations are insertions or updates to the database. HBase keeps some key-value pairs in memory in MemStore. In order to guarantee durability, mutations received from the client are rst persisted to a log in HDFS, named write-ahead log (WAL) and then the information is also updated in MemStore. By doing this no data loss occurs in case of a failure like a power outage, when memory data is lost. When MemStore data grows beyond a threshold, it is persisted to disk to HDFS in a HFile. This les oer an ecient way to store data to disk because they contain an index which is used for fast location of key-values within the le. When the HFile is completely written, the WAL can be discarded.[38][25] Mutations in HBase are atomic on a row basis.[38]
10
2.5.4
Data Retrieval
When region servers need to retrieve a key-value for a client, it is searched both in MemStore and in HFiles stored in HDFS. By requesting only particular timestamps searching in some HFiles can be avoided, achieving some performance improvements. MemStore is also used as a cache for recently retrieved key-values.[38][25] Searching in memory is very ecient because keys are looked-up in B+ Trees with O(log(n)) complexity. Searching in HFiles is accomplished as stated before by using the index from the le which avoids the need to load the entire le in memory, which is usually impossible.[38][25] Because each column is stored internally as a key-value, having all information that identities it along with the value, it does not matter where the actual data is stored from the space requirements point of view. This allows users to move some data from the value to the row key or to the column qualier, if it requires to index the column by that information. However, if possible, keys (row keys and column keys) should be kept as small as possible in order to keep the HFile index small.
2.5.5
House Keeping
After many mutations, multiple ushes from memory to disk occur, so a lot of HFiles are going to be created. The retrieval performance decreases when the number of les grows. To eliminate this issue HBase executes compactions regularly, in order to reduce the number of HFiles.
2.6
The Reasons for the Chosen Technologies
NoSQL technologies became very popular these days, but many startups seem to choose them just because they constitute a trend. Before we took a decision we made a deep analysis of our data needs. This section presents our rational reasons for using Hadoop and choosing HBase as our database of choice.
2.6.1
Why Hadoop?
We require a solution capable of scaling linearly by just adding commodity hardware without additional overhead. This is exactly the reason why Hadoop was created. Scaling does not require the change of the code or restructuring the data. Code search and analysis require processing of both unstructured and structured data. We aim at building a distributed extractor which will parse source code and extract facts about it. The input source code constitutes unstructured plain text data which can be embarrassingly parallelized with Hadoop by assigning groups of les to each Map task. The facts extracted from code are usually structured data which may have a graph structure. Hadoop is not usually recommended for this kind of data unless the data is structured and optimized for that particular usage scenario. The nature of our project will require a xed set of algorithms that are going to be run on a long term basis. Querying data in unexpected ways is not required in our case. Our information retrieval processing require a xed set of steps which are rarely changed: crawling, ltering, indexing, ranking, retrieval etc.. Only crawling and indexing require massive updates of the database. In our ranking algorithms reads prevail (see Chapter 4). The input data is written once during indexing phase, but is read multiple times iteratively during ranking, having only some small updates of some elds at the end of each iteration. If the data written
11
during indexing is structured properly Hadoop performs well when it reads repeatedly during ranking phase. The last reason for choosing Hadoop is its good support and well written documentation. Being used by major actors from IT, boosts its support and stability making it a reliable solution.
2.6.2
ACID
RDBMSs generally oer ACID guarantees, which is an abbreviation from atomicity, consistency, isolation and durability. This subsection will explain the concepts and analyze if they are required for our system. If not, we can drop some of them in order to gain other advantages. All these guarantees are most of the time linked with the concept of transaction which is a unit of work in a database which may involve more steps.[6] Atomicity guarantees that a transaction can either be successful or can fail [6]. In case some step failed in the middle of the transaction, the system must return to the original state where it was before starting the transaction and declare a failure. HBase guarantees atomic row mutations [32][38], which meets our requirements for the Generalized CodeRank algorithm. We do have updates that expand to more rows at a time and even to more tables during indexing phase, but a failure which will let a data eld inconsistent with the other will statistically have an insignicant impact on our system. Besides that such indexing errors are easy to recover without data loss. Consistency in ACID sense diers from the same concept found in distributed systems which will be discussed in the next subsection. Here consistency guarantees that a transaction will bring the system from one valid state to another [6]. This means that if after the transaction some constraints, triggers or cascades are not valid the transaction must be rolled back and the system must return to its original state. So, consistency copes with the logical conditions of the system, as opposed to atomicity which copes with failures and errors. HBase consistency guarantees are linked to its atomic row mutations feature. Retrieval of a row will return a complete image of that row that existed at some point in history [32]. Additionally time travel or updates from the past are not possible. HBase does not come with any other consistency guarantees in ACID sense, but developers are free to implement this logic in their application either on the client side or on the server side thorough an HBase feature called coprocessors. This could come with some performance penalties especially when it is implemented on client side. However for our application ACID-consistency is not required. Logical constraints can be invalidated only through programming errors and there is no reason to sacrice performance for constraints checking if those are not very likely to occur. Isolation ensures that a transaction will not inuence other concurrent transactions, i.e., transactions are independent of each other [6]. HBase oers atomic row mutations [38] and as a consequence isolation is guaranteed at the same granularity [32]. It is not very likely that for our applications a higher isolation guarantee is required. We are not planning to run concurrent algorithms that require atomic operations on multiple rows or tables. We plan to run read queries or distributed, non-concurrent MapReduce algorithms in batch jobs. Most of our usage patterns will consist of reads. Isolation violations can only occur due to human error or programming errors which are expected anyway in a system. Durability guarantees that when a transaction is reported as successful data mutations will already be persisted, such that in cause of a system failure (like a power outage) there are no data losses [6]. HBase aims at oering complete durability through the WAL by ensuring that any mutation is not reported as successful until writing to the log has not nished. However there are still issues on this feature and at the time of this writing the only guarantee is that the data has been ushed to the operating system buer [42]. If an outage occurs before the buer
12
is ushed to disk the data is lost. This is not an HBase issue, but an HDFS one, which has been recently solved [36], but its integration to HBase is pending [21]. It is very likely that the next HBase version will support full durability. However, small data losses from an unushed OS buer are not critical for our applications. Usually our data is obtained from crawling or from other data through processing, thus data can be easily recovered.
2.6.3
CAP Theorem and PACELC
Eric Brewer conceived in 2000 the CAP principles, CAP being an abbreviation from consistency, availability and partition-tolerance. In 2002, Nancy Lynch formalized them into a theorem [40] and since then it became a fundamental model for describing distributed databases. CAP Theorem: It is impossible to have in a distributed system all three qualities of consistency, availability and partition-tolerance in the same time.[20][40] As stated in the previous section, consistency has a dierent meaning in distributed systems context then in ACID. Actually, this semantics is the one which is usually considered when referring to the term. Consistency in distributed systems sense subsumes the atomic and consistent meaning from ACID concepts [40], so it may be dened as atomic consistency. Consistency in a distributed system (or atomic consistency) guarantees that any observer will always see the latest version of the data no matter what replica is read.[6][40][20] Availability ensures that the system will continue to work as a whole even when a node fails, i.e. a response is always received for a request.[40][20] Partition-tolerance requires a distributed system to continue to operate even if arbitrarily many messages between two nodes are lost, when a partition in the network appears. A model for better describing CAP Theorem was proposed by Daniel Abadi, named PACELC [2]. Each letter from this abbreviation is marked with bold and capital letters in the following scheme: if Partition: trade between Availability and Consistency Else: trade between low Latency and Consistency PACELC model explains the fact that in case of a network partitioning (the P from PACELC) a system needs to make trades between either availability (the A), either consistency (the C). E lse (the E), the system must decide if either providing a low l atency (the L) is more important or a stronger consistency (the C). As stated atomic consistency covers both the atomicity and consistency terms from ACID. Thus, HBase guarantees the consistency condition in distributed systems terms. Availability is weakened in the sense that in case a region server fails it takes some time until the master reassigns its region and the new assigned region server replays the failed server log (WAL). By default it takes up to three minutes for ZooKeeper to gure out that a region server failed. This can also have implications on latency and data locality in case the data from one of the remaining replicas is not located on the same machine as the new allocated region server. But this problem is solved when compactions are performed. The tradeo made by HBase at the expense of availability are not a big concern for Distributed Sourcerer because the database is not designed to be used by critical realtime applications like code search. Algorithms are usually using the database, which can be programmed to cope with this kind of situations by waiting for the region to be recovered.
13
In PACELC semantics, HBase can be characterized as a PC/EC system, because in case of a network partition will prefer keeping consistency and weakening availability to some extent and in case of normal operation writes have a larger latency because of the consistency requirements of the underlying HDFS implementation. Also, latency has to suer in the case of reassigned regions, but this is just temporarily. However, the consistency is weaker than in RDBMS, such that latency is kept within a controllable range.
2.6.4
SQL vs. NoSQL
By applying the CAP Theorem [40] it is know obvious why SQL and RDBMS do not scale for big data. By ensuring ACID constraints the atomic consistency is guaranteed, which is a strong consistency requirement, which in PACELC semantics translates to a PC/EC distributed system. This sacrices availability and latency for the benet of consistency. As the system grows, the latency also gets larger and parts of the data become unavailable due to failures which are normal in a commodity hardware cluster. As described in the previous section, HBase is also a PC/EC system, but with more relaxed consistency requirements. Only row mutations are atomic, there are no transactions and no constraints between columns. By giving up joins, data denormalization and duplication is encouraged, such that only one big table is queried, reducing the overhead. However, this gives some limitations in some scenarios when a relational model is more appropriate. Another problem with SQL databases are the algorithms they use. Most of them use B+ Trees to store indexes [38], which oer a good performance in O(log(n)) complexity for reads, updates and inserts. But as the database size grows, more updates are performed and the B+ Trees get imbalanced. Rebalancing is a very expensive operation which can signicantly slow down the database. On the other hand, HBase uses a more appropriate design for big data by storing the B+ Trees in the MemStore for recently accessed key-values and by using an index for HFiles, which are stored on disk in HDFS [38]. An overhead occurs during compactions, but those are performed in two dierent stages which lowers the impact on performance. SQL databases are able run on a cluster in a distributed way, but scaling them involves a big operational overhead [38]. HBase scales automatically without human intervention. When a region grows beyond a limit it is automatically split into two regions as described in Section 2.5.
2.7
Summary
We saw in this chapter that HBase by oering atomic row mutations guarantees enough consistency for our usage requirements. Reading latency is kept low as the system grows ensuring good performance for MapReduce. The availability at scale is way more better than what SQL can oer and the partial outages are controllable and predictable (they are not longer than 3 minutes by default). No data losses can occur in case of hardware failures, because mutations are always persisted to the log rst. Since all this HBase advantages t our needs and all disadvantages are not a concern for our applications we decided to use HBase to reimplement SourcererDB into what we call SourcererDDB.
Chapter 3
Database Schema Design and Querying

This chapter presents the motivations behind schema design decisions for the HBase database used in Distributed Sourcerer. The former database based on MySQL is also presented by comparison highlighting dierences.
3.1
Database Purpose
As described in Chapter 1.2, Sourcerer uses a database to store information about code entities and relations between them, as well as information about projects and les. The extractor parses Java source les, JARs and class les from the repository in order to extract this information which is described by using the following models [62]: Projects: The biggest division in a repository is a project which consists of a set of les that comprise the same system, are typically developed by the same team and in the same company or organization. For each project there is a database entry which stores metadata elds like project name, description, type, version and path within the repository. Files: The repository stores Java source les (with .java extension), JAR (Java archive) les (with .jar extension) and Java class les (with .class extension) which are byte code compiled les contained within the JAR les. Class les not packed into JARs are ignored by Sourcerer. For each le metadata elds are stored into database like path, le type, le hash and the project ID that contains it. Entities: The smallest metamodel divisions extracted from code are represented by entities such as methods, variables, classes, interfaces, packages etc. Relations: The relationship between entities are modeled by relations such as a calling relationship between two methods, an inheritance relationship between two classes or a containment relationship between a class and a method. Various algorithms can be built on top of the infrastructure to use as input the le structure of the projects and the relations between code entities. Chapter 4 describes such an algorithm for computing CodeRank, a metric used to rank code entities based on their popularity in a similar way PageRank from Google is used to rank web pages popularity. Code entities relations are used as input to compute CodeRank for each entity from the database.
14
CHAPTER 3. DATABASE SCHEMA DESIGN AND QUERYING
15
The database API can be used to search projects, les, entities and relations matching several criteria. For example we can assume that an application needs to retrieve all methods called from a particular class instance, in a particular project. This chapter describes how HBase database was designed in order to facilitate searching based on several matching criteria and how querying is performed on this schema design.
3.2
Design Principles
All schema design decisions were made such that processing time for database operations is minimized. The most important factor that was considered was reading time because algorithms that work with the database perform faster for low latency and high throughput when reading their input. Usually only a small number of small size elds are updated in the database. The most complex writing process takes place at the beginning when the database is populated, but after this stage most operations are reads accompanied by some updates. Some algorithms require reading of large amounts of data in a repeated manner. For instance CodeRank runs iteratively until convergence is reached, so it must read repeatedly all relations from database. In these kind of situations loading large batches of relations into memory with high throughput and low latency are vital. On the other hand writing into the database requires a smaller amount of information to be written in the case of CodeRank. After each iteration the current CodeRank (a double oating point value) must be written for each entity. The number of entities is much more smaller than the number of relations. There is no best design that can perform well in all situations so compromises need to be done to optimize performance for some particular scenarios. This scenarios were chosen by studying all MySQL SELECT queries used in Sourcerer as well as studying the data requirements to compute CodeRank for the entities. As it was described in Chapter 2, No-SQL schema design principles for databases such as HBase dier substantially from their relational counterparts. Because join operations are not natively supported and an arbitrary number of columns with arbitrary names can be used for a row, normalization is not required. On the contrary, according to DDI (Denormalization, Duplication, Intelligent keys) principle [19], denormalization should be used instead. By using this principle fewer reads are needed to retrieve the data because all the columns required can be stored on the same row, not on dierent rows from dierent tables as in relational normalized data. Denormalization is often used with duplication if the required data must be retrieved by dierent matching criteria. In this way no secondary indexes must be created as in SQL databases. In HBase data is sorted by keys, so an intelligent key design must be chosen such that the most common search criteria are optimized. Additionally, as discussed in Chapter 2, because data is stored in HFiles as KeyValues, it makes no dierence for storage requirements if data is stored in the key part or in the value part.
3.3
Projects Data
Projects metadata is stored in HBase in a similar way to MySQL. The main dierence, detailed in the following sections, lies in the way the row key was designed. The project types dened in Sourcerer are described in Table A.1 [60][62][59].
3.3.1
Former SQL Database
The original MySQL database used in Sourcerer has the columns described in Table 3.1 [60][62][59]. Most of the columns can have a null value, thus are optional and important columns like
CHAPTER 3. DATABASE SCHEMA DESIGN AND QUERYING project_id and project_type are indexed for fast retrieval in O(log n). Table 3.1: Columns for projects MySQL table Is Indexed Null Description yes no Numerical unique ID of the project. yes no Type of the project. yes no Name of the project from the original repository. no no An optional human readable project description. no yes Version number for MAVEN projects. yes yes Group for MAVEN projects. no yes Project path within the repository. yes yes Project MD5 hash for JAR and MAVEN projects. yes no Whether the project has or does not have source les.
16
Column project_id project_type name description version groop path hash has_source
An additional SQL table named project_metrics exists which stores metrics for projects like the number of lines of code and the number of lines of code with non-whitespace lines. Each row contains the project ID, the metric type and the metric value. Thus, a join by project_id is required in order to obtain the metric values for a project.
3.3.2
Functional Requirements
The distributed database should be able to retrieve fast a project by its ID. As it can be seen in the next sections, les, entities and relations are attached to a project by referring to its ID. In case more information about a project is required it can be searched by its ID.
1 2
SELECT project_id, project_type, path, hash FROM projects WHERE project_type = ? Listing 3.1: SQL Query used to retrieve projects by their type There are a few methods implemented in Sourcerer that retrieve information about projects by their ID by using SQL queries like the one from Listing 3.1. The new database needs to provide an ecient way to retrieve project entries by their type.
3.3.3
Schema Design
Each project must be uniquely identied by an ID. An MD5 hash can be used to generate such an ID. Some of the metadata elds used to describe a project can be hashed to generate the unique MD5. For JAVA_LIBRARY and CRAWLED projects the path from the repository is used as a hash seed since any project has a unique path. But other types of projects do not have this eld so dierent elds are used to generate an unique ID. For JAR and MAVEN projects the hash eld is used. For the two SYSTEM projects the ID is a 16 byte array containing the ASCII string primitives or unknowns respectively, right padded with null bytes. Table 3.2: projects HBase Table <projectType><projectID> name, description, version, groop, path, hasSource linesOfCode, nonWhitespaceLinesOfCode
Row Key Default Column Family Metrics Column Family
17
Project metadata is stored into projects HBase table described in Table 3.2. Each project entry can be assigned to one row creating a tall-narrow table with a lot of rows and just a few columns having the same meaning as the SQL columns in Table 3.1. Part of these elds can be stored in the key part to achieve ecient retrieval. All the other columns, which are homologous to the ones from the SQL schema, can be grouped together as default column family. Because an arbitrary number of columns can be stored on each row there is no need to store null values, so only those metadata elds that are available can be set as HBase columns. Another column family, named metrics is used to store any metric dened for the project from that row. Currently Sourcerer only uses two metrics, but more metrics can be added with no cost in the future. The main question that arises is how to design the row key for ecient retrieval by both project ID and project type? If the project ID is used as a row key, any project can be eciently retrieved with a get operation by using its ID. Using a hash function for all projects IDs, except for the two SYSTEM projects, causes project row entries to be randomly distributed across regions, no matter what type they have. So row scans cannot be used to eciently retrieve projects by type if project ID is used as row key. Filtering only project rows that have a particular type is very inecient because it requires scanning the whole table of projects. The project type can be encoded as a single byte and placed in the row key before the 16 byte project ID hash as described in Table 3.2. In this way data locality is achieved and by using row scans all project entries with a particular type can be retrieved. There is no project type that seems to appear more often than all other types in the dataset so region hotspotting [38] shouldnt be a problem. The issue with this approach is that project entries can no longer be retrieved by their ID without knowing the type in advance. If this is not known a get operation can be tried for each project type and a particular ID. All this requests can be served in parallel and for a big dataset requests will be served by dierent regions exploiting the distributed nature of HBase. Additionally the number of types to be tried is very small. There are very few projects of JAVA_LIBRARY type and only two projects of type SYSTEM, so the number of projects of these types can be neglected. Most of the projects have type CRAWLED and JAR and some of them have type MAVEN. So basically there are only three project types to be tried making this approach very ecient.
3.3.4
Querying
As described in Subsection 3.3.3 ecient retrieval of project entries is done when project type is known. All projects of a particular type can be retrieved by doing row scans. The start row is set as the 1 byte project type and the stop row is the same byte incremented by 1. By using a get operation for a row key which includes the project type as the rst byte and the project ID as the rest of the bytes a particular project entry can be retrieved. If project type is not known the techniques described previously of trying all types can be applied. As described, this does not endure serious performance penalties. Querying by any other other criteria, like path, is not ecient when using this schema design. It is possible to do it by using value lters, but it requires scanning the whole table which can take a long time.
3.4
Files Data
For les the database only stores metadata as for projects. That is why the schema design for HBase in this case is also similar to the SQL one. The le types dened in Sourcerer are described in Table A.2[60][62][59].
18
3.4.1
Former SQL Database
There are two MySQL tables with le information. One of them, files table, has its columns described in Table 3.3 and stores metadata [60][62][59]. The other one, named file_metrics, stores metrics related to les in a similar manner with project_metrics. Currently the same two metrics are used: the number of lines of code and the number of lines of code with no whitespace lines. Table 3.3: Columns for files MySQL table Is Indexed Null Description yes no Numerical unique ID of the le. yes no Type of the le. yes no Name of the le. no yes File path within the repository. yes yes File MD5 hash for JAR les. yes no ID of the project that contains this le.
Column le_id le_type name path hash project_id
3.4.2
As reected by the next sections, entities and relations can refer to an ID of a le they belong to. It should be possible to retrieve le entries, which contain metadata about les, by their unique ID as well as by their type or ID of the project they belong to. Dierent combinations of those three criteria should be considered.
3.4.3
Schema Design
Each le from the repository must be uniquely identied by an ID, which is obtained by using an MD5 hash. For JAR les the name eld is hashed and for other le types, the path eld is hashed, resulting an unique ID for each le entry. For more information about le metadata elds see the SQL columns of the former database in Table 3.3. As in the case of projects it is necessary to store in the database metadata and metrics. A similar HBase schema can be used by storing a le entry on each row of files HBase table, described in Table 3.4. A default column family contains the same information as the SQL columns described in Table 3.3 except for some metadata elds which are moved in the key part for ecient retrieval. Metrics column family stores le metrics in the same manner as in projects HBase table. Table 3.4: files HBase Table <projectID><fileType><fileID> name, path, hash linesOfCode, nonWhitespaceLinesOfCode <entityType><fqn> <relationKind><targetEntityID><sourceEntityID>
Row Key Default Column Family Metrics Column Family Entities Column Family Relations Column Family
After dening column families for le data and the column keys used, the remaining challenge that remained was to design the row key for ecient retrieval by le ID, le type and project ID. All this three elds are placed in the row key and encoded as 33 bytes. The rst 16 bytes
19
represent the project ID, the next byte encodes the le type and the last 16 bytes represent the le ID, as illustrated in Table 3.4. For ecient retrieval of a le entry both the le ID and the ID of the project it belongs to need to be known in advance. It is not very important to know the le type since all the three types can be tried without sacricing performance so much. A similar approach was described in Subsection 3.3.3 for trying all project types to retrieve a project entry. In the case of les, there are even less types to try only three.
3.4.4
Querying
As discussed in the previous section ecient retrieval of a le entry is achieved when querying HBase by both le ID and project ID. As discussed, knowing also the le type would not bring substantial performance improvements. Knowing the project ID is not a problem for the current design of the database, because as it can be seen in the next sections, when le ID is stored for an entity or relation also the project ID is kept. However, in case project ID is not known, it is possible to retrieve a le entry by using a lter. A custom row lter has been implemented which passes all rows that contain into their row key sux (the last 16 bytes) the le ID. Using this retrieval approach is not optimal since it requires scanning of the whole table, but at least it makes the scenario possible. Retrieving all les from a project is possible by doing a row scan of all rows that begin with the 16 bytes of the project ID. If an additional byte representing the le type is added only les of a particular type are retrieved from that project. If it is required to retrieve all les from the repository of a particular type the whole table must be scanned and a custom row lter can be used which passes only rows that have the 17th byte set to the correspondent value of the le type. Querying by other matching elds can be achieved in a non-ecient way by using column value lters and scanning the whole table, which can take some time for big datasets.
3.5
Entities Data
Code entities have a lot of information elds that describe them. In order to achieve ecient retrieval by matching several elds duplication design principle [19], described in Section 3.2, will be applied. Thus, entities data will be stored redundantly into multiple HBase tables. The entity types available in Sourcerer are described in Table A.3 [60][62][59].
3.5.1
Former SQL Database
As in the case of projects and les two MySQL tables are used to store entities information. General information used to describe them is placed in entities table [60] [60][62][59]. Metric information is stored in entity_metrics in the same way as for projects and les. Table 3.5 describes the columns used in entities SQL table.
3.5.2
Schema design for relations HBase tables should provide ecient retrieval by the following data elds: FQN (Fully-Qualied Name)
20
Column entity_id entity_type fqn modiers multi project_id le_id oset length
Table Is Indexed yes yes yes no no yes yes no no
3.5: Columns for entities MySQL table Null Description no Numerical unique ID of the entity. no Type of the entity. yes FQN (Fully-Qualied Name) of the entity. yes Java modiers for entity types that are allowed to have them. yes Multipurpose column for additional information. no ID of the project that contains the entity. yes ID of the le that contains the entity. yes Byte oset of the entity in the source le. yes Byte length of the entity in the source le.
entity type project ID le ID This requirements are found in Sourcerer API to the SQL database, where a lot of SQL queries select rows by these criteria. For example, Listing 3.2 shows three queries extracted from Sourcerers code. All of them lter results by entity type, marking this eld as being very important. One of the queries searches entity entries that have a particular FQN prex, so partial FQNs should be a searching criteria, not only exact FQNs. The other two queries search by project ID and le ID respectively.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
-- Retrieval by FQN prefix and filtering by entity type: SELECT entity_id, entity_type, fqn, project_id FROM entities WHERE fqn LIKE ${PREFIX}% AND entity_type NOT IN (PARAMETER, LOCAL_VARIABLE) -- Retrival by project ID and filtering by entity type: SELECT entity_id, entity_type, fqn, project_id FROM entities WHERE project_id = ? AND entity_type IN (ARRAY, WILD_CARD, TYPE_VARIABLE, PARAMETRIZED_TYPE, DUPLICATE) -- Retrieval by file ID and filtering by entity type: SELECT entity_id, entity_type, fqn, project_id FROM entities WHERE file_id = ? AND entity_type IN (CLASS, INTERFACE, ANNOTATION, ENUM) Listing 3.2: SQL Queries used to retrieve entities data The former SQL database uses secondary indexes for all these four elds, conrming their importance (see Table 3.5).
3.5.3
Schema Design
Entities data is stored redundantly in three HBase tables by applying duplication design principle [19], thus ensuring ecient retrieval by several criteria. Each entity is uniquely identied by an MD5 hash ID, calculated by using the following elds described in Table 3.5: entity type, FQN, modiers, multi, project ID, le ID, oset and length. entities_hash HBase table, described in Table 3.6, stores entity data by entity ID. It does
21
this by storing the unique ID as a row key and the other elds as columns in default column family. Entity IDs are used in relations data, so this table can be useful when it is required to retrieve more information about an entity. Table 3.6: entities_hash HBase Table <entityID> entityType, fqn, modifiers, multi, projectID, fileID, fileType, offset, length Metrics Column Family linesOfCode, nonWhitespaceLinesOfCode Relations Column Family sourceEntityType, codeRank, targetEntitiesCount, targetEntities, relationIDs Row Key Default Column Family
To achieve ecient retrieval by the four elds mentioned in the previous section, i.e. FQN, entity type, le ID and project ID, they need to be stored in the key part of two of the tables which store entities data, whether this key part is the row key or the column qualier. The other remaining elds, which are not stored in the key part, are serialized in the value part. For scenarios when searching by project ID or le ID is required, entities data is stored into files HBase table, previously described in Table 3.4 and Subsection 3.4.3. When searching entities by FQN or FQN prex a special table is used, named entities table (see Table 3.7). Table 3.7: entities HBase Table <fqn>0x00<projectID><fileID> <entityType>
Row Key Default Column Family
By using the row key design from files table entities can be eciently searched by project ID, le type and le ID. Entities column family is used to separate entities data from les metadata, which is stored in the default and metrics column families as described in Table 3.4. Entity type and FQN elds are placed in this order in the column qualiers of entities column family. The one byte entity type and the 16 bytes MD5 hashes for le ID and project ID require exact matching when performing a search. But for FQN it must be possible to search all entities that have a particular FQN prex. The most ecient way to do this is by putting this eld at the beginning of row keys of entities table and performing a scan by the required FQN prex. After the FQN eld the row key includes a null byte which is useful for exact FQN matches. For example lets assume we need to search an entity which has the exact FQN java.lang. If we perform a scan only by this string other entities with dierent FQNs but the same prex will be returned, such java.lang.Object, java.lang.String etc. But by adding the additional null byte to the scanning start row string, i.e. "java.lang\0", the exact FQN will be matched. The next elds found in the row key are the MD5 hashes of project ID and le ID (see Table 3.7), which can help narrowing results by these two other criteria. In all three queries from Listing 3.3 it is required to narrow the results by including or excluding entities with a particular type. By placing the entity type byte in column qualiers makes this ltering possible when searching by FQN or FQN prex. All this columns associated with entity types are placed into the default column family.
22
3.5.4
Querying
As mentioned in the previous section retrieving en entity entry by its ID is performed in entities_hash table. If entities need to be searched by several criteria the other two tables can be used, i.e. files table and entities table. When FQN or an FQN prex is known, entities table should be used. The following scenarios cover the use cases for this table: FQN or an FQN prex is known both the exact FQN and project ID are known exact FQN, project ID and le ID are known Since these three elds presented above are placed in the row key, the scenarios are implemented by doing row scans or using get operations. In the rst case the FQN is used as the start row and in the second the null byte and the project ID are added. In the third case a get operation can be performed, because by adding the le ID the whole row key is known. Column qualiers hold entity types, so by requesting only some columns to be returned the results are narrowed by the corresponding entity type. When FQN or an FQN prex is not known, operations in entities column family of files table should be performed. Usually, the use cases of this table are covered when the project ID or le ID is known. Similar scenarios where presented in Subsection 3.4.4 when searching for les. Here the same matching requirements are desired but for entities instead of les, so entities column family is used. Narrowing results to some specic entity type can be implemented with a column qualier prex lter which passes only those columns that have the required value as the rst byte.
3.6
Relations Data
Storing relations data into HBase requires the same design principle of duplication in order to achieve ecient retrieval for the desired scenarios. Data is stored redundantly in multiple HBase tables, each one being used for a particular scenario. The relation types dened in Sourcerer are described in Table A.4 [60][62][59]. Besides its type, a relation also has a class which denes the location of the target entity as described in Table A.5 [60][62][59].
3.6.1
Former SQL Database
There is only one MySQL table which stores relations data, which is named relations. Table 3.8 describes its columns [60][60][62][59].
3.6.2
Retrieval of relation entries should be optimized for the following elds, explained in Table 3.8: source entity ID target entity ID relation type and relation class project ID
23
Column relation_id relation_type relation_class lhs_eid rhs_eid project_id le_id oset length le ID
Table 3.8: Columns for relations MySQL table Is Indexed Null Description yes no Numerical unique ID of the relation. yes no Type of the relation. no no Class of the relation. yes no ID for the source entity of the relation. yes no ID for the target entity of the relation. yes no ID of the project that contains the relation. yes yes ID of the le that contains the relation. no yes Byte oset of the relation in the source le. no yes Byte length of the relation in the source le.
All this requirements can be found in SQL queries from Sourcerer source code, except for source entity ID. Two of those SQL queries are available in Listing 3.3. Optimizations for source entity ID eld were considered for practical reasons and because its column in relations MySQL table is indexed.
1 2 3 4 5 6 7
-- Retrieve relations by target entity ID and type: SELECT project_id FROM relations WHERE rhs_eid = ? AND relation_type IN ? -- Retrieve relations by project and type: SELECT * FROM relations WHERE project_id = ? AND relation_type IN ? Listing 3.3: SQL Queries used to retrieve relations data
3.6.3
Schema Design
Duplication design principle [19] has been applied in order to achieve ecient retrieval by several criteria. Thus, relations data has been stored redundantly in multiple HBase tables. Depending on the application some tables may not be implemented. For example, there is a table named relations_hash, described in Table 3.9, which stores relations data by their ID, similar to entities_hash. Currently, there is no feature or algorithm that uses it, so in future it may be dropped if its not required. Table 3.9: relations_hash HBase Table <relationID> relationKind, sourceEntityID, sourceEntityType, targetEntityID, targetEntityType, projectID, fileID, fileType, offset, length
Row Key Default Column Family
The elds by which the retrieval should be optimized are stored in the key part of the tables relations_direct (see Table 3.10), relations_inverse (see Table 3.11) and files (see Table 3.4). The other remaining elds, i.e. oset and length are serialized in the value part of those tables.
24
Table 3.10: relations_direct HBase Table Row Key <sourceEntityID><relationKind><targetEntityID> Default Column Family <projectID><fileID>
Table 3.11: relations_inverse HBase Table Row Key <targetEntityID><relationKind><sourceEntityID> Default Column Family <projectID><fileID>
Relation type and relation class are combined together in a single byte in HBase tables resulting a eld named relation kind. The three most signicant bits are used for relation class and the next 5 bits for relation type. Relation IDs are calculated by using an MD5 hash on the following elds: relation kind, source entity ID, target entity ID, project ID, le ID, oset and length. Ecient retrieval by source entity ID is achieved by querying relations_direct HBase table, described in Table 3.10, which contains in its row key the following elds in this order: source entity ID (16 bytes), relation kind (1 byte) and target entity ID (16 bytes). Retrieval of all relations with a particular source entity ID and optionally with a particular relation kind is possible through row scanning. The same principles are applied for relations_inverse table, described in Table 3.11, optimized for retrieval by target entity ID which has the following elds in its row key in this order: target entity ID, relation kind, source entity ID. Here scanning can be performed by target entity ID or by both target entity ID and relation kind. Both relations_direct and relations_inverse tables have the same column keys design. They have a default column family and column qualiers contain the project ID and the le ID in this order. Selecting only specic columns or by using column qualier lters relations results can be narrowed to only those that are part of a particular project or source le. Relations data is also stored in relations column family of files HBase table, described in Table 3.4, similar to entities data in entities column family or to les data in default column family. By using the row key design of this table ecient retrieval by project ID and le ID can be performed. The other important relation elds are stored in column qualiers in the following order: relation kind, target entity ID and source entity ID. Narrowing results by matching this elds is achieved by selecting particular columns or by using column qualier lters. The Generalized CodeRank algorithm described in Chapter 4 gets relations information from entities_hash table (see Table 3.6), from relations column family. The row key represents the source entity ID. Target entity IDs of the relations that have this source entity ID, as well as their relation kinds are serialized in the targetEntities column. The current code rank of the source entity ID dened on the row key is stored in codeRank column.
3.6.4
Querying
A relation entry can be retrieved by its ID by using relations_hash table, which stores the ID on the row key. If ecient retrieval by source entity ID and relation kind or just by source entity ID is desired, relations_direct table should be used. If instead of source entity ID, target entity ID needs to be matched relations_inverse should be used. Both these two tables have the same column qualier design. Requesting specic columns will only retrieve those relations that are included in the les and projects identied by the columns. If only a particular project is
25
required and all les from the project need to be included a column qualier prex lter can be used which only matches the rst 16 bytes of the project ID MD5 hash. A custom lter has been implemented which can match the last 16 bytes of the qualier if ltering by le ID is required. Relations can be eciently retrieved by project ID, le type and le ID from relations column family of files table. If only those or part of those elds need to be matched the same approach as the one to retrieve les or entities from this table must be applied. Further narrowing of the results can be performed by selecting specic columns or by using column qualier lters. For instance, the rst byte can be matched to select a specic relation kind. The bytes from the 2nd to the 17th can be matched to narrow results to a particular target entity ID and nally the last 16 bytes can be matched for a specic source entity ID (see Table 3.4). The CodeRank algorithm described in Chapter 4 queries relations data into entities_hash table, column family relations. The current CodeRank of an entity is also stored here. Ecient retrieval by source entity ID is provided here.
3.7
Dangling Entities Cache
Generalized CodeRank algorithm described in Chapter 4 needs a fast way to retrieve all dangling entities without scanning the whole entities_hash table and checking each entity if it is dangling. Obtaining dangling entities and their CodeRanks can be done by querying dangling entities cache. The cache needs to be rewritten each time a dangling entitys rank is updated. Dangling entities cache is implemented by redundantly storing all dangling entities at the end of entities_hash table. This is accomplished through row key design. Each cache entry is a row that has as row keys 16 bytes with the maximum value (FF hex value) followed by a dangling entity ID. Because keys are byte ordered the 16 bytes prex ensures that there is no other entity placed after the cache rows. Performing a row scan having the start row the 16 maximum value bytes makes possible the retrieval of all dangling entities.
3.8
Summary
As it can be observed from this chapter, a schema design needs to be engineered for some particular querying requirements or some data access patterns. A one-size-ts-all design which works for every problem is not possible. SQL is more exible from this point of view, but unfortunately is does not scale to our needs as shown in Chapter 2. This chapter presented a demonstrative schema design which tries to match as close as possible the original Sourcerer data access patterns. All lot of changes can occur to this schema in future development if needed.
Chapter 4
Generalized CodeRank
This chapter presents the design of a code analysis algorithm, named Generalized CodeRank, which ranks code entities by their popularity within the repository. Inspired by Google PageRank [63], instead of considering the links between web pages, it uses relations between code entities to compute ranks.
4.1
Reputation, PageRank and CodeRank
This section introduces the concept of PageRank, illustrates how it can be applied to code by introducing the concept of CodeRank and describes how the general concept of PageRank models the reputation of nodes within a graph.
4.1.1
PageRank
The PageRank algorithm was rst published in the article The anatomy of a large-scale hypertextual Web search engine by the Google Inc. founders Brin and Page [10]. Since then, it sparkled a lot of research around it and a lot of variants, improvements and alternative uses appeared besides its original use for the web. PageRank uses the graph structure of the web created by the links between web pages to rank those pages popularity. The algorithm is inspired by academic citation literature [10] where an article is considered important if a lot of other articles cite it. Taking the idea beyond, PageRank ranks higher pages that have more inbound links and more importantly pages that are linked by other important pages. The algorithms philosophy is that when a page links to another page it trusts its content and guarantees for its quality. The same thing happens in the academia, when a paper cites another one it takes its content as granted. It is a way to measure a page s reputation within the web context. If a popular and thus important page, such as one from CNN 1 or Yahoo! 2 , links to a web page, it might be ranked higher then another page which has a lot of inbound links from non-popular pages.
4.1.2
The Random Web Surfer Behavior
PageRank models the random surfer behavior [10] which starts from a random web page and keeps following random links without using browsers back button. After a while he gets bored
1 http://www.cnn.com/ 2 http://www.yahoo.com/
26
CHAPTER 4. GENERALIZED CODERANK
27
and jumps to a random page from the web. If he reaches a dangling page, also known as a sink page, which has no outbound links, he goes to a random page. The set of all PageRanks can be viewed like a vector of probability distribution. So the PageRank for a page has a value between 0 and 1 and represents the probability that a user reaches that page by following the random surfer behavior. Because all PageRanks make up a probability distribution their sum should be 1. Web surng can be modeled with a Markov chain where each page can be viewed as a state and the links between them are the probabilities of passing to other states. In the case of a random surfer there is an equal probability to move from one page to another linked page. More complex PageRank models can assign dierent probabilities in order to better model the real user behavior which can choose pages by dierent criteria, such as their topic or the language are written in or the location of the link on the page.
4.1.3
CodeRank
The general concept of PageRank can be generalized such that the algorithm can be applied in other elds than the web. To do this, the Web link structure can be viewed as a graph where web pages are nodes and links are edges. Thus, for any directed graph PageRank can be applied. This has been already used in several other elds. One example is a proposal to replace ISI IF (Institute for Scientic Information, Impact Factor) ranking of science and social science journals, which only counts the number of citations for two years, with a PageRank-like prestige rank.[8] Another usage example is for ecosystem modeling in order to determine which species are important for the health of the environment.[12] Following this idea, state of the art research in code search and code analysis proposed applications of PageRank to measure the importance of methods, classes, packages and other code entities [64][51][55]. The name CodeRank for this approach was proposed in an older implementation of Sourcerer [51]. The concepts of code entities and relations are explained in Chapter 3. Calculating CodeRank follows the principles of PageRank algorithm, but considers entities instead of nodes and relations between them instead of graph edges. The result is a hierarchy of the most popular entities from the source code used as input. For instance the classes java.lang.Object or java.lang.String should be very popular for Java source code. This master thesis proposes a PageRank-like approach to rank all code entities from a repository following all relations between them, not just for some code entities like methods, classes or packages. From this point, we will refer to the algorithm that follows this approach as Generalized CodeRank. As far as we know calculating PageRank on code entities by taking into account all entities and all relations from a repository has not been done until now. There are a lot of useful applications for CodeRank: Improving results ranking in source code search engines. Creating a top of the most important projects from a repository. Listing the most important packages, classes and methods from a project to help developers get started with a new project. If they must read the code in order to understand how it works, they might start with the important packages and read the code of the most important methods of the most important classes. In the context of a Web-scale source code repository a top of the most important libraries for a specic purpose can be computed. This might be useful for project managers that need to chose the best library or technology that accomplishes a desired task in their project.
28
As proved by state of the art, CodeRank was successfully used to improve source code search engines results ranking providing better results to the user.[51][55]
4.1.4
The Random Code Surfer Behavior
The original PageRank algorithm was used in the Web context and models the web surfer behavior of following links from page to page. Generalized CodeRank could be imagined modeling a programmers behavior of surng source code. For better understanding we can assume the following scenario. A new developer is hired in a company to work in a software system implemented in Java which already has a big code base of multiple tightly coupled projects. Additionally other third-party libraries are used, such as JUnit and Apache Commons. Before the new employee stars coding, it needs to understand how the already written code works and how it is organized. So he will start from a main method to surf the code to facilitate understanding. While doing this, he will read code entities like methods, elds, local variables, classes, packages and follow the relations between them. For example while he reads a method he may follow the call relation to another method and so on. While doing this he may jump to a random point in the source code, i.e. a random entity, from time to time or when he reaches a dangling entity (sink entity), which has no outbound relations. By following this model we can interpret CodeRank as the probability that a programmer will encounter an entity while surng source code. Entities encountered more often are more popular and thus the chance that someone might search them in a source code search engine is bigger.
4.2
Mathematical Model
This section will describe the mathematics behind PageRank concept as it can be applied on any directed graph, no matter if the nodes are web pages, code entities or any other concept and no matter if the edges are links or code entity relations. However, throughout this section we will use the term CodeRank instead of PageRank, for consistency with the topic described by this chapter. The concept CodeRank of an entity can be used interchangeably with rank of an entity.
4.2.1
CodeRank Basic Formula
The CodeRank of an entity is a probability, so it has a value between 0 and 1. When an entity has outbound relations to other entities it transfers its rank to each of them as illustrated in Figure 4.1. According to the most simple form of CodeRank algorithm an entity rank sums up the rank amount propagated from all inbound relations as illustrated by the following formula [63]: R(u) =
vBu
R(v) Nv
(4.1)
R(u) is the CodeRank of an entity u, Bu is the set of entities that have outbound relations to entity u and Nv is the number of outbound relations of entity v. It can be noticed from the formula that by dividing the rank of an entity by the number of outbound relations, an equal amount of its rank is transfered to each target entity of the outbound relations as it happens in Figure 4.1.
29
Figure 4.1: CodeRank Example According to the code surfer model, a programmer might get bored of following relations through entities and suddenly can jump to a random entity, action known in the literature as teleportation [55]. A damping factor d, which represents the probability to follow relations without teleporting, is introduced to the previous formula to model this behavior: R(u) = d
vBu
1 R(v) + (1 d) Nv n
(4.2)
In this formula the value n represents the total number of the entities.
4.2.2
CodeRank Matrix Representation
The set of all CodeRanks for each entity can be grouped together in a vector r. If M is a transition matrix that models code surfers behavior of moving from one entity to another, the following formula holds: r =M r (4.3)
PageRanks vector r is the dominant eigenvector of the transition matrix M . Computing r from the equation is not possible because of the size of matrix M , but the ranks vector r can be approximated with the formula r = M j r0 , where r0 is an initial CodeRanks vector. The values of the initial ranks are not important because for a large enough j an approximative value of r is obtained. r0 is typically set to an uniform distribution, where each rank is 1/n, n being the size of the vector. An ideal r would be obtained if j tends to innity: r = lim M j r0
j
(4.4)
Transition matrix can be decomposed like this: M = dP + (1 d)Q = d(A + D) + (1 d)Q (4.5)
d is the damping factor, A is the adjacency matrix which models relations graph, D models transitions from dangling entities and Q models teleportation, i.e. random transition to any entity. Elements ai,j of matrix A are 0 if there is no relation between entity j and entity i. Each
30
element ai,j represents the probability that the random surfer will go from entity j to entity i. Each column of A sums to 1, making A a stochastic matrix. From a dangling entity there are no outbound relations, so in order to model random code surfer behavior, we state that there is equal probability to have an outbound relation to any other entity. This behavior is modeled by transition matrix D. If an entity j is dangling (sink), then all elements of column j from matrix D are 1/n, because there is equal probability to have a transition to any other entity. Otherwise (j is not dangling), all elements of the column are 0. The rst equation below describes a way to decompose D, where e is a vector having all its elements 1 and sT is the transposed vector of sink entities, i.e. element j of s is 1 if entity j is dangling and 0 otherwise (4.9).[58] D = e sT /n D r = e (sT r)/n (4.6)
Computing D r for CodeRank equation (4.3) is basically reduced to calculating the inner product sT r, as it can be seen in the equations above. Calculating this inner product, referred from now on as dangling entities inner product (DEIP), is equivalent with summing the CodeRanks of all dangling entities. By replacing D from (4.6) in (4.5) and M from (4.5) in (4.3) the basic CodeRank formula (4.2) can be rewritten:
R(0) a0,0 a0,1 a0,n1 R(0) R(1) a a1,1 a1,n1 R(1) 1,0 = d + . . . . . .. . . . . . . . . . . . R(n 1) an1,0 an1,1 an1,n1 R(n 1) 1 1 R(0) n 1 R(1) 1 n + (1 d) +d . s0 s1 sn1 . . . . . . . . 1 0
1 Nj
(4.7)
R(n 1)
1 n
ai,j =
if there is no relation from entity j to entity i if there is a relation from entity j to entity i 0 1 if j is not a dangling entity if j is a dangling entity
(4.8)
sj =
(4.9)
4.3
Computing Generalized CodeRank with MapReduce
As shown in the previous section, computing CodeRanks vector is performed by repeatedly multiplying the transition matrix with the current CodeRanks vector. This section will show how to accomplish this by using Hadoop MapReduce.
4.3.1
Storing Data in HBase
In Chapter 3, Subsection 3.5.3 described how entities_hash HBase table stores data required by Generalized CodeRank algorithm in relations column family. The row key is an entity
31
ID and codeRank column stores the current rank of the entity. Relations having the source entity with this ID can be retrieved. Besides the current CodeRank of an entity, Generalized CodeRank algorithm also needs the number of outbound relations of an entity, stored in targetEntitiesCount column, and the target entities of those relations, stored in targetEntities column along with the kinds of each relation.
4.3.2
Hadoop Jobs
There are two mandatory MapReduce jobs that need to be performed for one algorithm iteration: DEIP Job: calculates DEIP scalar value CodeRank Job: calculates CodeRank for each entity; takes DEIP scalar value as input parameter Those jobs are going to be repeated until a maximum number of iterations is reached or an error tolerance is achieved. The Map tasks of CodeRank Job take as input data all rows from relations column family of entities_hash table. The input key is the table row key, which represents the source entity ID of a relation, and the input value are columns codeRank, targetEntitiesCount and targetEntities. Each Map task will output a key for each target entity ID received as input value [58]. All output values will be the CodeRank divided by the number of outbound relations, corresponding to the sum terms R(v)/Nv from (4.2). Basically, each Map task computes a source entity contribution to each target entity rank of the outbound relations. Each Reduce task sums up the contributions to an entity received from the source entities of the inbound relations. The input key is the entity ID and the value is the contribution received. After calculating the contributions sum, referred here as a, the Reduce task calculates CodeRank with the formula below and outputs this as value along with the entity ID as key.[58] R(u) = d(a + 1 b ) + (1 d) n n (4.10)
The above formula is obtained from (4.7) by replacing the above mentioned contributions sum (which corresponds with the dot product between the adjacency matrix and CodeRanks vector) with a and DEIP scalar with b. Each Reduce task will write in codeRank column of entities_hash table the CodeRank calculated for the entity received as input. Calculating DEIP is equivalent with summing CodeRanks of all dangling entities. To do this with MapReduce, each Map task needs to read a dangling entity from relations column family of entities_hash table and output the input key with its CodeRank as value. Each Reduce task sums the CodeRanks received as input values for an entity received as input key. The output key is the same with the input key received and the output value is the sum calculated. There are two dierent Hadoop jobs capable of calculating DEIP. One of them, named Metrics Job, scans the whole entities_hash table and each Map task needs to verify if the entity received as input is dangling and only if so will output its rank. This is highly inecient because the Map task needs to read all entities and from our statistics only about 1% of the entities are dangling. DEIP Job has an ecient Map implementation which only reads dangling entities. To do this, they are stored redundantly along with their CodeRanks at the end of the entities_hash table as described in Section 3.7. This table range is called Dangling Entities Cache. Prexing dangling entity IDs with 16 bytes having the maximum hex value FF, ensures that
32
they are placed at the end of the table. By scanning all rows that start with 16 FF -valued bytes all dangling entities are retrieved. Metrics Job can be used to calculate useful metrics: Euclidean distance between current CodeRank vector and the one from the previous iteration sum of all CodeRanks (should be approximatively 1 if computation was correct) DEIP Generalized CodeRank algorithm continues to run until a maximum number of iterations is reached. An optional stopping condition can be set such that the algorithm stops when a tolerance is reached, i.e., Euclidean distance metric falls below a threshold . When this condition is set DEIP Job is replaced by Metrics Job which calculates both DEIP scalar and Euclidean distance metric. If desired, also sum can be calculated. Metrics Job is inecient for DEIP calculation, but if calculating metrics is desired, this compromise must be done, because Euclidean distance and sum require all CodeRanks of all entities.
4.4
Experiments
This section presents our experiments with the Generalized CodeRank algorithm by describing our data input, the infrastructural setup and the results accompanied by statistical remarks.
4.4.1
Setup
A sample repository of about 333 MiB was used to populate the database. We did not tested with a bigger repository because currently the extractor is not yet ported to run on the cluster. Populating our HBase database involves a lot of overhead because at the moment we are importing data from the old MySQL database, which takes a lot of time. After running the extractor the HBase database contained 111 projects, 29, 236 les, 611, 262 entities and 1, 936, 757 relations. Hadoop and HBase were deployed on an 9 node cluster reserved from Tembusu Cluster of National University of Singapore, School of Computing. Each is a Dell PC with 2 x QuadCore Xeon E5620 2.4GHz CPU, 24 GiB RAM. It runs CensOS GNU/Linux operating system installed on a 500 GiB hard-disk. Two additional hard-disks are used to store HDFS data, each with 1 TiB, connected in a RAID matrix. On one node we run the HBase master, Hadoop JobTracker and HDFS NameNode. On each of the others we placed HBase RegionServers, Hadoop TaskTrackers and HDFS DataNodes. Three of these former nodes hosted a ZooKeeper cluster.
4.4.2
Convergence
In a rst experiment 29 iterations of Generalized CodeRank algorithm have been run and for each one the Euclidean distance between the current CodeRank vector and the one from the previous iteration has been calculated. Figure 4.2 shows how this Euclidean distance varies with each iteration. A desired tolerance of 105 is reached after 15 iterations, which is enough for good CodeRank results.
33
Figure 4.2: Variation of Euclidean distance with the growth of iteration illustrating the convergence of Generalized CodeRank algorithm When metrics like Euclidean distance are calculated, Metrics Job instead of DEIP Job is executed. In order to test with DEIP Job also, another experiment was run with 15 iteration. The results shown in the next subsectios are obtained from this second experiment.
4.4.3
Probability Distribution
The mathematical model of CodeRank states that CodeRanks vector is a probability distribution. This distribution is plotted in Figure 4.3 for the rst 40 largest CodeRank values. The ranks are ordered in decreasing order and indexed from 0 to 38. x axis has index values and y axis has rank values.
Figure 4.3: Probability distribution represented by CodeRanks vector It is known from previous studies that PageRank follows a power law distribution with the power value of approximatively 1 [71]. Figure 4.4 plots the CodeRanks distribution obtained from the second experiment and a power law distribution f (x) modeled by the following equation: f (x) = max(r) x+1 (4.11)
34
Figure 4.4: log-log plot for CodeRanks distribution and a power law distribution r is the CodeRanks vector and max(r) is the biggest CodeRank, i.e., the one that has index x = 0 in Figure 4.3. Both axes of Figure 4.4 are on logarithmic scale, so PageRank points have coordinate (logx, logR(x)) and power laws points have (logx, logf (x)). max r log f (x) = log max r log (x + 1) x+1 g(log (x + 1)) = k log (x + 1) g(x) = k x
log f (x) = log
(4.12)
In the above equation log f (x) = g(log (x + 1)) and k = log max r is a constant. Equation g(x) = k x which describes the power law from Figure 4.4 is linear, so its graphic is a line. The CodeRank points from the gure are very close to the power law line, proving that CodeRanks follow a power law distribution like PageRanks from the web.
4.4.4
Entities CodeRank Top
Table 4.1 describes the 10 entities with the largest CodeRank expressed as percents. An extended version of this table which shows 100 entities instead of 10 can be found in Appendix B Their importance in the repository is very big because those 10 entities from a total of 611, 262 sum 21.42% from the total amount of CodeRank as illustrated in the right side of Figure 4.5. Table 4.1: Top 10 Entities CodeRank FQN java.lang void java.lang.Object int java.lang.String boolean java.io java.io.Serializable java.lang.CharSequence java.lang.Comparable<java.lang.String>
# 1 2 3 4 5 6 7 8 9 10
Entity Type package primitive class primitive class primitive package interface interface parametrized type
CodeRank 4.72% 4.24% 4.06% 2.29% 2.20% 1.12% 1.03% 1.00% 0.39% 0.37%
35
Figure 4.5: Left: Top 10 Entities CodeRank chart; Right: Distribution of Top 10 Entities CodeRanks within the whole set of entities It can be observed that the 10 most popular entities are all either from the Java Standard Library or are primitives ubiquitous in any Java application. The reason for this is that any Java program need them in order to work. java.lang, which takes the rst place, is the default package, so its imported by default. Any program should have at least a main method, which has type void, primitive that occupies the second place. The base of all classes is java.lang.Object third place. A Java program must have at least one class which by default inherits Object. We can conclude that the correctness of the results is justied by both the relevance of CodeRanks Top and also by having the same statistical model (i.e. CodeRank and PageRank follow the same power law distribution).
4.4.5
Performance Results
Table 4.2 contains the time required by Generalized CodeRank to run the experiments, as well as the time for each job involved in the process. Table 4.2: Experiments and jobs running time Experiment / Job Time Experiment 1 (with metrics), 29 iterations 4485 s (74 min 45 s) Experiment 2 (without metrics), 15 iterations 1882 s (31 min 22 s) Experiment 3 (with metrics), 15 iterations 2255 s (37 min 35 s) DEIP Job (for Experiment 2) 41 s Metrics Job (for Experiment 1 and 3) 69 s (1 min 9 s) CodeRank Job 84 s (1 min 24 s)
As explained in this chapter DEIP, can be calculated either with a DEIP Job or a Metrics Job. The results from the table show that by using dangling entities cache, DEIP Job achieves a performance boost of 68.29%. The computation of Generalized CodeRank as a whole benets from a performance boost of 19.82%.
Chapter 5
Implementation
The original Sourcerer code search infrastructure has been developed in Java at University California of Irvine. This master thesis describes the work around a fork of Sourcerer, named Distributed Sourcerer, which aims at scaling up Sourcerer to Internet-scale. The original database implementation, which relied on MySQL, has been rewritten from scratch in order to work with HBase. The implementation details and the new API to the database is described in Section 5.1. A higher level interface to the new database, as a set of command-line interface (CLI) tools, is described in Section 5.3. The Generalized CodeRank algorithm, described in Chapter 4, has been implemented over Hadoop MapReduce as described in Section 5.2. A CLI user interface which facilitates CodeRank calculation is described in Section 5.6.
5.1
Database Implementation
The new distributed database implementation is called SourcererDDB and is located in distributed-database Eclipse project, path infrastructure/tools/java/distributed-database from Distributed Sourcerer repository [11]. Its implementation can be divided into three parts described in the next subsections: Subsection 5.1.1: classes used to model data, HBase tables and model types Subsection 5.1.2: classes that provide the programming interface to retrieve data from HBase tables Subsection 5.1.3: classes that provide the programming interface to insert data to HBase tables Subsection 5.1.4: Hadoop MapReduce jobs which duplicates data into additional tables for ecient retrieval
5.1.1
Data Modeling
The data modeling part of the implementation is compounded from three Java packages described as follows: 1. edu.nus.soc.sourcerer.ddb.tables Eclipse project: distributed-database 36
CHAPTER 5. IMPLEMENTATION
37
Path: infrastructure/tools/java/distributed-database/src/edu/nus/soc/sourcerer/ddb/tables/ Description: classes from this package contain information about HBase tables and provide access to them. 2. edu.uci.ics.sourcerer.model Eclipse project: model Path: infrastructure/tools/java/model/src/edu/uci/ics/sourcerer/model/ Description: this package contains classes that abstract the model types described in Appendix A for projects, les, entities and relations. 3. edu.nus.soc.sourcerer.model.ddb Eclipse project: distributed-database Path: infrastructure/tools/java/distributed-database/src/edu/nus/soc/sourcerer/model/ddb/ Description: classes used to abstract data exchanged with HBase in an objectoriented way. Each HBase table has an associated class from package edu.nus.soc.sourcerer.ddb.tables, named <name>HBTable. <name> identies the table, so the camel case of the table name was chosen. So, for example, relations_inverse HBase table is associated with class RelationsInverseHBTable. All classes associated with tables follow singleton design pattern and extend HBTable abstract class. A unique instance of a class is obtained with getInstance() method. The only abstract method that needs to be overridden is getName(), which returns the associated tables name. The base class HBTable provides the implementation of getHTable() method which returns an HTable instance for the associated table, which is used to access the table as described in HBase documentation [38]. When the instance is created for the rst time setupHTable() methods is called, where special conguration code for an HTable instance can be added. Besides the table name, classes associated with tables also contain static nal elds which store column family names and column qualier names. An HTableDescriptor (see HBase documentation [38]) can be obtained by calling the static method getTableDescriptor(). The obtained instance can be used to create new tables, modify tables schema, delete tables etc.. These administrative operations are implemented in class DatabaseInitializer from package edu.nus.soc.sourcerer.ddb.tools. Updating tables schema is not currently implemented and a NotImplementedException is thrown. Package edu.uci.ics.sourcerer.model, which contains enums that abstract model types, was included in the original Sourcerer implementation. Project types, le types, entity types, relation types and relation classes are abstracted in classes Project, File, Entity, Relation and RelationClass, respectively. I modied those classes in order to encode a byte value for each type, which can be returned by using getValue() method. Classes from package edu.nus.soc.sourcerer.model.ddb are used to model data exchanged with HBase. All of them implement Model interface and have their name ended in Model. Some of them implement the interface indirectly by extending class ModelWithID. Method computeId from this class returns an MD5 hash of the class elds passed as parameters. The data from those elds is obtained through Java reection and the returned value can be used to set an id eld for a model. For most models that extend ModelWithID this is done in the constructor.
38
5.1.2
Database Retrieval Queries API
Package edu.nus.soc.sourcerer.ddb.queries from Eclipse project distributed-database provides specialized classes to retrieve and add data about projects, les, entities and relations data. Listing 5.1 presents an example of retrieving relations by several criteria.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
try { /* Instatiate the object used to retrieve relations from HBase.*/ RelationsRetriever rr = new RelationsRetriever(); /* Results are going to be printed as they are retrieved from HBase.*/ ModelAppender<Model> appender = new PrintModelAppender<Model>(); /* Retrieve relations call.*/ rr.retrieveRelations(appender, sourceID, kind, targetID, projectID, fileID, fileType); } catch (HBaseConnectionException e) { LOG.fatal("Could not connect to HBase database: " + e.getMessage()); } catch (HBaseException e) { LOG.fatal("An HBase error occured: " + e.getMessage()); } Listing 5.1: Retrieving relations example In order to retrieve logical entries from HBase, referred from now as models, the following API steps must be followed. Each step is exemplied in Listing 5.1. The models are implemented in package edu.nus.soc.sourcerer.model.ddb. 1. A retrieval class is used to search HBase tables. In the example from Listing 5.1, RelationsRetriever class is used. 2. A retrieval method of that class is called. The rst parameter is always a ModelAppender object. Subsequent parameters represent several searching criteria. If one of these parameters is null, searching will not be performed by that criteria. Depending on what searching criteria parameters are not null, the method will gure out in which HBase table to look up for the entries in order to optimize the query. In the example, method retrieveRelations from line 11 is called. After the appender,
the following searching criteria are passed respectively: source entity ID, relation kind (byte representing relation type and relation class), target entity ID, project ID, le ID and le type. Depending on which of these criteria parameters are null, HBase will look in relations_direct, relations_inverse or files table.
3. Inside the retrieval method, HBase client API is used which retrieves table rows as results. In some tables each row is mapped to exactly one model, but in others a row is mapped to multiple models. A result-to-model(s) method is used to convert a table row result to a model or a set of models. In the example, HBase client API will retrieve table rows as results from relations_direct, relations_inverse or files table, depending on the criteria parameters which are not null. In relations_hash table each row is mapped to exactly one relation entry, but in relations_direct or relations_inverse tables, a row may contain more relation entries. relationsInverseResultToRelationsGroupedModels
39
method converts a relations_inverse table row result to a set of RelationsGroupedModels.
4. For each model retrieved, method add of a ModelAppender object is called. ModelAppender is an interface which facilitates a visitor design pattern. Depending on the processing desired for each model retrieved, a special implementation for this interface can be written. In the provided example, PrintModelAppender will print the model passed to add
method. ListModelAppender, another ModelAppender implementation, creates a list of the models passed to add which can be returned afterwards by calling getList().
Retrieval classes, retrieval methods, result-to-model(s) methods and ModelAppender implementations follow some naming conventions: Retrieval classes: <Entries>Retriever, where <Entries> can be Projects, Files, Entities or Relations. So, the following retrieval classes exists: ProjectsRetriever, FilesRetriever, EntitiesRetriever and RelationsRetriever Retrieval methods: retrieve<Models>[From<Table>][WithGet]. The parts in square brackets are optional. <Models> is the model class name which is passed to add method of the ModelAppender object. <Table> is the camel case name of the table (for example: RelationsDirect for relations_direct table). if [From<Table>] is included in the name of the method, the retrieval is performed from that particular table. If [WithGet] is included in the name, the retrieval is performed by using a get HBase operation instead of a scan operation. Examples: retrieveProjects, retrieveFilesWithGet, retrieveRelationsFromFilesTableWithGet. Result-to-model(s) methods: <table>ResultTo<Model>[s]. <table> is the camel case name of the table with the rst letter lower case. <Model> is the model class name. The optional [s] represents a plural for the model. If it appears a List of model types will be retrieved, rather than a model type. Examples: resultToFileModel, entitiesHashResultToEntityModel, filesResultToEntitiesGroupedModels. ModelAppender: <Name>ModelAppender.
5.1.3
Database Insertion Queries API
The same package edu.nus.soc.sourcerer.ddb.queries contains the classes for inserting code data into HBase tables. Insertion classes, which implement ModelInserter interface, are used to add data contained in a collection of models into the database, as shown in Listing 5.2. Interface ModelInserter is parametrized by the model class. Currently, there are four implementations of this interface, each one for inserting projects, les, entities and relations, respectively. The example from Listing 5.2 shows how two relations, having their data stored in models relationModelA and relationModelB, are inserted into the database by using RelationModelInserter class, which implements ModelInserter<RelationModel> interface.
1 2 3 4 5 6 7 8 9
try { /* Create a list of relation models. */ Collection<RelationModel> relationModels = new Vector<RelationModel>(2); relationModels.add(relationModelA); relationModels.add(relationModelB); /* Insert the models from the list into HBase tables. */ ModelInserter<RelationModel> modelInserter =
10 11 12 13 14 15
40
new RelationModelInserter(2); modelInserter.insertModels(relationModels); } catch (HBaseException e) { LOG.fatal("An HBase error occured: " + e.getMessage()); } Listing 5.2: Inserting relations example All ModelInserter implementations should have their name following the format <ModelClass>Inserter. ProjectModelInserter adds data to projects table, FileModelInserter to files table, EntityModelInserter inserts data to entities_hash table and RelationModelInserter to relations_hash table. Currently class MySQLImporter uses all these insertion classes mentioned to import code data from the old MySQL database (SourcererDB) to the new database (SourcererDDB), based on HBase. The next section shows how the new imported data is indexed into more tables for ecient retrieval.
5.1.4
Indexing Data from Database
The insertion classes described in the previous subsection only populate one table for each of the metamodels projects, les, entities and relations. This tables are basically hash tables for ecient retrieval by MD5 hash ID. In order to have an optimized retrieval by several other searching criteria, other tables need to be populated for redundancy, as explained in Chapter 3. To achieve this some Hadoop MapReduce jobs need to be run. By running those jobs duplication and denormalization principles are satised for the data [38]. For projects and les, two HBase tables are enough. But for entities and relations storing data redundantly is required. All Hadoop MapReduce classes are currently organized in distributed-database Eclipse project. Classes which implement MapReduce Job-s are placed in edu.nus.soc.sourcerer.ddb.mapreduce.jobs package, Map classes in edu.nus.soc.sourcerer.ddb.mapreduce.map, and Reduce classes in edu.nus.soc.sourcerer.ddb.mapreduce.reduce. In order to index entities a MapReduce job needs to be run. Its class name, as well as their corresponding Map and Reduce class implementations are as follows: EntitiesIndexerJob: indexes entities by duplicating their data from entities_hash table to entities table and entities column family of files table. Map task class: EntitiesMapper Reduce task class: EntitiesReducer In order to index relations two MapReduce jobs need to be run. Their class names, as well as their corresponding Map and Reduce class implementations are as follows: RelationsIndexerJob: indexes relations by duplicating their data from relations_hash table to relations_direct table, relations_inverse table and relations column family of files table. Map task class: RelationsMapper Reduce task class: RelationsReducer CRRelationsIndexerJob: indexes relations for ecient retrieval during CodeRank calculation. Data from relations_hash table is redundantly stored in relations column family of entities_hash table, as explained in Chapter 4.
CHAPTER 5. IMPLEMENTATION Map task class: RelationsSourceMapper Reduce task class: RelationsSourceReducer
41
5.2
CodeRank Implementation
Chapter 4 explained in details what MapReduce jobs are required to calculate CodeRank and how these jobs need to be combined and repeated in order to achieve a nal result. As a consequence, most of Subsection 5.2.1 will specify which classes are used for each job and which map and reduce task classes are used for them. Subsection 5.2.2 will describe some additional utility jobs that have been implemented. Generalized CodeRank source code is located in distributed-database Eclipse project. The same package structure for MapReduce jobs, Map tasks and Reduce tasks was used as specied in Subsection 5.1.4.
5.2.1
CodeRank and Metrics Calculation Jobs
If the database has just been populated and indexed, the values from relations column family of entities_hash table need to be initialized. An initialization job does the following operations: The initial CodeRank for all entities is set to 1/n, where n is the total number of entities. Dangling Entities Cache is created with its initial values of CodeRank (1/n, as previously stated). For entities with no outbound relations, target entities count column must store a value of 0, such that they can be identied as dangling entities in a MetricsJob. CodeRank and metrics calculation job class names, as well as their corresponding Map and Reduce class implementations are as follows: CRInitJob: the initialization job described above. Map task class: CRInitMapper Reduce task class: not available CRJob: the CodeRank Job described in Subsection 4.3.2. Map task class: CRMapper Combine task class: CRCombiner Reduce task class: CRReducer DEIPJob: the DEIP Job described in Subsection 4.3.2. Map task class: DEIPMapper Combine task class: DEIPReducer Reduce task class: DEIPReducer CRMetricsJob: the Metrics Job described in Subsection 4.3.2. Map task class: CRMetricsMapper Combine task class: CRMetricsCombiner
CHAPTER 5. IMPLEMENTATION Reduce task class: CRMetricsReducer
42
5.2.2
Utility Jobs
Currently there is only one utility job which is used to output in an HDFS text le the entities top by their CodeRank. Entities formatted in descendant order are each stored on a line. There are four tab separated columns: CodeRank (as a subunit value, not percent) entity ID (hex representation of the MD5 hash) entity types FQN (Fully-Qualied Name) The job class name, as well as their corresponding Map and Reduce class implementations are as follows: CRTop Map task class: CRTopMapper Reduce task class: not available
5.3
Database Querying Tools
Database querying tools have their source code in Eclipse project distributed-database, main class edu.nus.soc.sourcerer.ddb.tools.Main. When running the tools at the command line their name and their arguments are prexed by double-minus --. The following list presents the tools. Parameter --help for any tool will print usage information, including arguments explanation. All tools that work with the HBase database support --hbase-table-prefix, which appends the specied prex to all table names that are going to be accessed. Argument --properties-file can be used to pass a Java properties le where predened arguments are stored as key-value pairs, separated by the equal = sign. --retrieve-projects: search projects from the database --pt: project type as upper case string --pid: project ID as a hex of the MD5 hash --retrieve-files: search les from the database --pid: project ID as a hex of the MD5 hash --ft: le type as upper case string --fid: le ID as a hex of the MD5 hash --retrieve-entities: search entities from the database --eid: entity ID as a hex of the MD5 hash --et: entity type as upper case string --fqn: fully-qualied name --fqn-prefix: fully-qualied name prex
CHAPTER 5. IMPLEMENTATION --pid: project ID as a hex of the MD5 hash --fid: le ID as a hex of the MD5 hash --ft: le type as upper case string --retrieve-relations: search relations from the database --rid: relation ID as a hex of the MD5 hash --seid: source entity ID as a hex of the MD5 hash --teid: target entity ID as a hex of the MD5 hash
43
--rk: relation kind as upper case string compound from relation type and relation class separated by a double colon --fqn: fully-qualied name --fqn-prefix: fully-qualied name prex --pid: project ID as a hex of the MD5 hash --fid: le ID as a hex of the MD5 hash --ft: le type as upper case string --retrieve-relations-by-source: retrieve relations by source entity ID from relations column family, entities_hash table --eid: entity ID as a hex of the MD5 hash --retrieve-code-rank: prints the CodeRank of an entity by its ID --eid: entity ID as a hex of the MD5 hash
5.4
Database Utility Tools
Utility tools share the same main class as querying tools. The following list presents the tools: --initialize-db: tool used to initialize HBase database by creating the tables --empty-existing: if a table already exists, it is empty (by deleting it and creating it again) --update-existing: if a table already exists, its conguration and column families denition is updated if necessary. This feature is not currently implemented --import-mysql: imports data from an old SourcererDB database, based on MySQL --database-url --database-user --database-password
5.5
Database Indexing Tools
Database indexing tools are used to duplicate data in multiple tables for ecient retrieval as discussed in Section 5.1.4 and have their main classes located in distributed-database project, in package edu.nus.soc.sourcerer.ddb.mapreduce. Tools that use Hadoop have a dierent library for parsing command-line arguments. Other database tools rely on Sourcerer
44
library, but database indexing and CodeRank tools rely on Apache Commons library. For each Hadoop tool there is a dierent main class and arguments have both a short one letter form prexed by one hyphen - and a long form prexed by two hyphens --. There is a set of common arguments for all tools described in Table 5.1. Table 5.1: Common CLI arguments for Hadoop tools (CodeRank and Database indexing tools) Long arg. Short arg. Description --hbase-table-prefix -p Prex appended to HBase table names --debug -d Turn on debug
In order to index entities, the tool with the main class EntitiesIndexer must be used. To index relations, the tool with the main class RelationsIndexer must be used.
5.6
CodeRank Tools
CodeRank tools are basically used for CodeRank calculation and have their main classes located in distributed-database project, in package edu.nus.soc.sourcerer.ddb.mapreduce. Being Hadoop applications they also use Apache Commons CLI arguments parsing library, as explained in the previous section. The most important tool is the one used to calculate CodeRanks for all entities. Its main class is CodeRankCalculator and the command line arguments are described in Table 5.2.
Arguments --num-iter and --entities-count are mandatory. If the initialization job needs to be run as explained in Subsection 5.2.1, then --init argument must be set. If it is desired to iterate the algorithm until a tolerance is reached --tolerance argument must be provided with a small oating point value. Setting this requires setting --metric-euclidian-distance also. By setting any argument which starts with --metric-, Metric Job will be used instead of DEIP Job, as discussed in Subsection 4.3.2. The performance is aected, but it is the only way to do it if metric calculation or iterating until a tolerance is reached is required. Another tool which has its main class CodeRankUtil is used to exclusively calculate metrics or to output in HDFS a text le with entities top by CodeRank. Its CLI arguments are described in Table 5.3.
45
Table 5.2: Common CLI arguments for CodeRankCalculator tool Long arg. Short arg. Description --num-iter -n The number of CodeRank iterations to run --init -i Initialize database before CodeRank calculation. Required if tables has just been populated --entities-count -c The number of entities --teleportation-probab -r Probability of jumping from one entity to another random one. Defaults to 0.15 --metric-euclidian-distance -e Calculate euclidian distance between current CodeRank vector and the previous one. Use --tolerance / -t argument to set a distance when computation should stop --tolerance -t Euclidian distance between currenct iteration and the previous one which stops computation if it is reached. Requires setting --metric-euclidian-dist / -e argument --metric-coderanks-sum -s Calculate the sum of CodeRanks for all entities. Should be close to 1 if computation was correct --metrics-output -o Output directory in HDFS where metrics should be saved (one le for each iteration). This argument is ignored if calculation of no metric is requested. One of the arguments -metric-euclidian-dist / -e or --metriccoderanks-sum / -s should be set. Output le(s) will contain by default an additional metric which is deip (dangling entities inner product)
Table 5.3: Common CLI arguments for CodeRankUtil tool Long arg. Short arg. Description --coderank-top -T Generate a le with the top of all entities by CodeRank --metric-euclidian-dist -e Calculate euclidian distance between current CodeRank vector and the previous one --metric-coderanks-sum -s Calculate the sum of CodeRanks for all entities. Should be close to 1 if computation was correct --metric-deip -D Calculate DEIP (Dangling Entities Inner Product)
Chapter 6
Conclusions
This chapter exposes a summary of the contributions of this work, presents an outlook of future research to the Semantic-based Code Search project and takes a look to the past by comparing this work with state of the art contributions.
6.1
Summary
I have chosen a cluster computing technology stack, based on Hadoop and HBase, as a basis for an Internet-scale code search and code analysis platform. By performing a rigorous analysis I proved that an SQL database would not scale for our needs because of the big latencies involved. We showed that the tradeo made by moving to HBase, such as giving up some consistency guarantees aects our applications in a negligible way. The system can now scale linearly by just adding new commodity hardware machines and benets from using the popular Hadoop MapReduce platform, which is highly used in the industry and has a big community around it, both of volunteers and of companies with commercial interest. I have engineered an HBase database schema design for the storage layer of the system. It allows basic code queries to be performed and stores the data needed to calculate Generalized CodeRank. I have showed that there is no schema that meets any application need and exemplied why the chosen schema wouldnt be appropriate for other data access patterns. I implemented [11] a PageRank variant for ranking code entities, which as far we know it is unique by considering all entities during calculation and not only subsets of particular types. I proved the validity of the results by coming with both statistical proofs and intuitive facts. Those results show that CodeRank gives relevant results even when all entity types are considered during computation. The algorithm was implemented over Hadoop MapReduce and terminates in reasonable time about 30 minutes for a Java repository of about 300 MiB. The ranked entities improve code search as state of the art shows [51][55].
6.2
Future Work
The next step for building our code search engine is the parallelize the extractor such that it can run on a cluster. Our idea is to use Hadoop for this purpose by running on each Map an extractor instance. The les need to be accessible to each Map. Putting les in HDFS is not a good idea, because this le system performs good for sequential access to big les, but source and jar les are small. To address this issue we could write source les in HBase, because are
46
CHAPTER 6. CONCLUSIONS
47
small enough to t as values. Additionally random access to particular les is possible with good performance. Jar les can embed a lot of les, hence they can grow larger and storing them in HBase can create problems [31]. This les can be stored in HDFS as SequenceFiles, by concatenating multiple jars in a big HDFS le. Thus, sequential access is achieved for optimum MapReduce performance. The only drawback is having a bigger overhead for random access to a particular jar le. The second task that we want to accomplish in the future is scaling up the search server. As discussed in Chapter 2, Sourcerer uses Solr as a search server. Its distributed version, named Distributed Solr [30], is currently limited in comparison with single machine Solr. We are considering to use ElasticSearch [18] instead of Solr, which also uses Lucene [26] and performs better than Solr for realtime access to a large-scale index [67]. Our third plan is linked to a contribution we want to make to code search eld. We are currently investigating a way to improve the results by using code clone detection techniques and clustering.
6.3
Related Work
Besides Sourcerer [4] from which our system has been forked there are several other infrastructures for code search or code analysis. Portfolio [55] is a code search system for C programming language which focuses on retrieval and visualization of relevant functions and their usages. Similar to my work it implements PageRank for code, but it only uses functions and their call relation. Besides this, it also proposes a technique called SAN (Spreading Activation Network) to improve ranking. For the purpose of indexing, it uses Lucene like Sourcerer. An older version of Sourcerer use to implement CodeRank [51], but currently this component is not available any more. Another search tool that relies on Sourcerer is CodeGenie [50], which uses test-cases to search and reuse source code. Another code search engine which uses test-cases is a prototype of Reiss et al. presented in [66]. Besides test-cases and standard keyword based retrieval techniques it also uses contracts and security constraints. The distinctive characteristic of this engine is its ability to apply program transformations to adapt to user requirements. Keinvaloo et al. proposed another Internet-scale code search infrastructure called SE-CodeSearch [45], based on semantic web. Instead of relying on a relational model like Sourcerer, this infrastructure uses an ontology to represent facts about source code and inference to acquire new knowledge about missing code entities. For the purpose of querying source code based on its entities and relations, dierent other mathematical models have been proposed besides relational algebra (used in relational databases) and description logics (used in semantic-web ontologies). It is important to notice that the storage solution proposed in this work, although it uses HBase which is not a relational database, it uses a relational model. Query languages using relational algebra have been implemented, like SemmleCode [70] and JGraLab [17]. Similar to Codds relational algebra is Tarskis Binary Relational Calculus [69]. Grok [43], Rscript [46] and JRelCal [65] use this formalism. Other approaches use predicate logic like CrocoPat [7] and JTransformer [47].
Appendix A
Model Types
Table A.1: Project Types Description Project type used for only two core projects. One of them groups primitive types provided by the Java language and the other one unknown entities with unresolved references. Projects associated Java Standard Library JARs, like rt.jar. Projects downloaded by the crawler from online repositories. All unique JARs aggregated from the CRAWLED projects are also considered a project on their own. Used for MAVEN projects.[27]
Project Type SYSTEM
JAVA_LIBRARY CRAWLED JAR MAVEN
File Types SOURCE CLASS
JAR
Table A.2: File Types Description Files containing Java source code from any project except SYSTEM. Files containing Java byte code from any project except SYSTEM and CRAWLED. Class les are extracted from jar les. The extractor ignores crawled class les which are not packed into a jar. Jar les from CRAWLED projects.
48
APPENDIX A. MODEL TYPES
49
Entity Type UNKNOWN PACKAGE CLASS INTERFACE ENUM ANNOTATION INITIALIZER FIELD ENUM_CONSTANT CONSTRUCTOR METHOD ANNOTATION_ELEMENT PARAMETER LOCAL_VARIABLE PRIMITIVE ARRAY TYPE_VARIABLE WILDCARD PARAMETRIZED_TYPE DUPLICATE
Table A.3: Entity Types Description used for an undened type package declaration class declaration interface declaration enum declaration annotation declaration for instance or static initializer declaration eld declaration enum constant declaration constructor declaration method declaration annotation type element declaration formal parameter declaration local variable declaration only used in primitives SYSTEM project array declaration type variable declaration wildcard declaration parametrized type declaration an entity created when it is unclear exactly which type was referenced by a relation
APPENDIX A. MODEL TYPES
50
Relation Types UNKNOWN INSIDE EXTENDS IMPLEMENTS
HOLDS RETURNS READS WRITES CALLS USES INSTANTIATES THROWS CASTS CHECKS ANNOTATED_BY HAS_ELEMENTS_OF PARAMETRIZED_BY HAS_BASE_TYPE HAS_TYPE_ARGUMENT
HAS_UPPER_BOUND HAS_LOWER_BOUND OVERRIDES MATCHES
Table A.4: Relation Types Description used for an undened type. physical containment. Example: METHOD INSIDE CLASS. class inheritance. Example: CLASS EXTENDS CLASS. interface implementation or inheritance. Example: CLASS IMPLEMENTS INTERFACE or INTERFACE IMPLEMENTS INTERFACE. denes the type of a eld. Example: FIELD HOLDS CLASS. denes the return type of a method. Example: METHOD RETURNS CLASS. a eld being read. Example: METHOD READS FIELD. a eld being written. Example: METHOD WRITES FIELD. method invocation. Example: METHOD CALLS METHOD. type reference. Example: METHOD USES CLASS. constructor invocation for object instantiation. Example: METHOD INSTANTIATES CONSTRUCTOR. denes a throws clause. Example: METHOD THROWS CLASS. denes a cast expression. Example: METHOD CASTS CLASS. denes an instanceof expression: METHOD CHECKS CLASS. an entity is annotated. Example: METHOD ANNOTATED_BY CLASS. denes the item type from an array. Example: ARRAY HAS_ELEMENTS_OF CLASS. denes the type parameters for an entity. Example: METHOD PARAMETRIZED_BY TYPE_VARIABLE. denes the base type of a parametrized type. Example: PARAMETRIZED_TYPE HAS_BASE_TYPE CLASS. denes the bounding of a type parameter to a specic type. Example: PARAMETRIZED_TYPE HAS_TYPE_ARGUMENT CLASS. denes the upper bound of a wildcard. Example: WILDCARD HAS_UPPER_BOUND CLASS. denes the lower bound of a wildcard. Example: WILDCARD HAS_LOWER_BOUND CLASS. denes when a method overrides a parent class/interface method. Example: METHOD OVERRIDES METHOD. denes when a DUPLICATE type matches a number of types. Example: DUPLICATE MATCHES CLASS.
Relation Classes UNKNOWN JAVA_LIBRARY INTERNAL EXTERNAL NOT_APPLICABLE
Table A.5: Relation Classes Description It is unknown where the target entity is. The target entity is located in the JAVA_LIBRARY project. The target entity is in the same project. The target entity is in an external project. It makes no sense to classify the target entity as internal or external.
Appendix B
Top 100 Entities CodeRank
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
Table Entity Type package primitive class primitive class primitive package interface interface parameterized type package primitive interface package interface primitive package type variable primitive class package package unknown package primitive type variable primitive package class package package class class
B.1: Top 100 Entities CodeRank (No. 1-33) FQN java.lang void java.lang.Object int java.lang.String boolean java.io java.io.Serializable java.lang.CharSequence java.lang.Comparable<java.lang.String> java.util long java.lang.Comparable java.awt java.lang.Cloneable byte javax.swing <T+java.lang.Object> short java.lang.Exception java.sql sun.awt.X11 java.lang.Object org.w3c.dom float <E+java.lang.Object> double javax.accessibility java.lang.Throwable org.omg.CORBA com.lowagie.text.pdf java.util.Vector java.util.ListResourceBundle
CodeRank 4.715620% 4.242388% 4.059280% 2.293895% 2.199717% 1.116965% 1.034111% 1.000685% 0.388675% 0.374038% 0.304947% 0.251881% 0.232919% 0.222480% 0.221790% 0.200660% 0.195239% 0.163133% 0.155148% 0.139547% 0.138024% 0.136395% 0.136028% 0.130180% 0.108922% 0.099308% 0.096172% 0.094041% 0.089935% 0.089068% 0.088002% 0.087758% 0.087504%
51
APPENDIX B. TOP 100 ENTITIES CODERANK
52
# 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67
Entity Type primitive interface interface class package package interface interface package unknown class package package class type variable package package array package package package class class package class package package package package package interface class class interface
Table B.2: Top 100 Entities CodeRank (No. 34-67) FQN char javax.accessibility.Accessible org.w3c.dom.Node javax.swing.JComponent org.xml.sax sun.org.mozilla.javascript java.util.List java.util.EventListener org.biomage.Interface java.lang.String java.lang.Class net.sf.pizzacompiler.compiler java.lang.reflect java.io.InputStream <E> java.awt.event ru.novosoft.uml.foundation.core byte[] com.sun.java.swing.plaf.nimbus org.hsqldb org.hsqldb java.awt.Color java.awt.Component javax.swing.text java.io.File com.sun.media.sound java.security com.sun.org.apache.bcel.internal.generic javax.swing.plaf.basic com.ibm.db2.jcc.b org.w3c.dom.Element java.util.Hashtable java.util.ResourceBundle java.lang.Runnable
CodeRank 0.083407% 0.080270% 0.079507% 0.075495% 0.074926% 0.073773% 0.072944% 0.071246% 0.070408% 0.067242% 0.066851% 0.065243% 0.063781% 0.062156% 0.060435% 0.059796% 0.059524% 0.058651% 0.058159% 0.057687% 0.057557% 0.057352% 0.056617% 0.056141% 0.055589% 0.054086% 0.052590% 0.051423% 0.050802% 0.050419% 0.049696% 0.049590% 0.049359% 0.048984%
APPENDIX B. TOP 100 ENTITIES CODERANK
53
# 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
Entity Type interface constructor package package interface class interface package interface package package package class package package class class class class class class class package package interface array type variable package interface interface class class package
Table B.3: Top 100 Entities CodeRank (No. 68-100) FQN java.util.Collection java.lang.Object.<init>() java.awt.image antlr java.util.Map java.sql.SQLException java.sql.Wrapper javax.swing.plaf java.io.Closeable java.net xjavadoc com.sun.org.apache.xalan.internal.xsltc.compiler javax.swing.JPanel org.w3c.dom.svg ca.gcf.util com.sun.java.swing.plaf.nimbus.AbstractRegionPainter java.lang.RuntimeException java.lang.Integer java.io.OutputStream com.sun.corba.se.impl.logging.ORBUtilSystemException java.io.Writer java.awt.Container com.lowagie.text java.nio java.util.Iterator java.lang.String[] <V+java.lang.Object> com.ibm.db2.jcc.a java.sql.ResultSet sun.awt.X11.XKeySymConstants java.util.ArrayList com.sun.corba.se.spi.logging.LogWrapperBase javax.management
CodeRank 0.048234% 0.047192% 0.046850% 0.046490% 0.045774% 0.045667% 0.045381% 0.045180% 0.045055% 0.044337% 0.043616% 0.043496% 0.042380% 0.042105% 0.041985% 0.041813% 0.041065% 0.040916% 0.040798% 0.039915% 0.038651% 0.038380% 0.038072% 0.037661% 0.037292% 0.037119% 0.036814% 0.036752% 0.036514% 0.036440% 0.036350% 0.035903% 0.035815%
Bibliography
[1] Inc. 10gen. MongoDB. http://www.mongodb.org/, August 2012. [2] Daniel Abadi. Problems with CAP, and Yahoos little known NoSQL system. http:// dbmsmusings.blogspot.ro/2010/04/problems-with-cap-and-yahoos-little.html, April 2010. [3] Amitanand S. Aiyer, Mikhail Bautin, Guoqiang Jerry Chen, Pritam Damania, Prakash Khemani, Kannan Muthukkaruppan, Karthik Ranganathan, Nicolas Spiegelberg, Liyin Tang, and Madhuwanti Vaidya. Storage Infrastructure Behind Facebook Messages: Using HBase at Scale. IEEE Data Eng. Bull., 35(2):413, 2012. [4] Sushil Bajracharya, Joel Ossher, and Cristina Lopes. Sourcerer: An internet-scale software repository. In Proceedings of the 2009 ICSE Workshop on Search-Driven DevelopmentUsers, Infrastructure, Tools and Evaluation, SUITE 09, pages 14, Washington, DC, USA, 2009. IEEE Computer Society. [5] Sushil Krishna Bajracharya, Joel Ossher, and Cristina Videira Lopes. Leveraging usage similarity for eective retrieval of examples in code repositories. In Gruia-Catalin Roman and Kevin J. Sullivan, editors, SIGSOFT FSE, pages 157166. ACM, 2010. [6] Daniel Bartholomew. SQL vs. NoSQL. Linux Journal, 2010(195), Jully 2010. [7] D. Beyer, A. Noack, and C. Lewerentz. Ecient relational calculation for software analysis. 31:137 149, 2005. [8] Johan Bollen, Marko A. Rodriguez, and Herbert Van de Sompel. Journal status. Scientometrics, volume 69, number 3, pp. 669-687, 2006, December 2006. [9] Dhruba Borthakur, Jonathan Gray, Joydeep Sen Sarma, Kannan Muthukkaruppan, Nicolas Spiegelberg, Hairong Kuang, Karthik Ranganathan, Dmytro Molkov, Aravind Menon, Samuel Rash, Rodrigo Schmidt, and Amitanand Aiyer. Apache hadoop goes realtime at facebook. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, SIGMOD 11, pages 10711080, New York, NY, USA, 2011. ACM. [10] Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual web search engine. Comput. Netw. ISDN Syst., 30(1-7):107117, April 1998. [11] Clin-Andrei Burloiu. Distributed sourcerer code on github. calinburloiu/Sourcerer, September 2012. https://github.com/
[12] Judith Burns. Google trick tracks extinctions. http://news.bbc.co.uk/2/hi/science/nature/ 8238462.stm, September 2009. [13] Fay Chang, Jerey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst., 26(2):4:14:26, June 2008. [14] Codase. Codase. http://www.codase.com/, September 2012.
54
BIBLIOGRAPHY
55
[15] Jerey Dean and Sanjay Ghemawat. MapReduce: Simplied Data Processing on Large Clusters. In OSDI 04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, 2004. [16] Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. Dynamo: amazons highly available key-value store. SIGOPS Oper. Syst. Rev., 41(6):205220, October 2007. [17] Jrgen Ebert, Daniel Bildhauer, Hannes Schwarz, and Volker Riediger. Using Dierence Information to Reuse Software Cases. Softwaretechnik-Trends, 27(2), 2007. [18] Elasticsearch. Elasticsearch. http://www.elasticsearch.org/, September 2012. [19] D. Salmen et al. Cloud data structure diagramming techniques and design patterns. https://www.data-tactics-corp.com/index.php/component/jdownloads/nish/ 22-white-papers/68-cloud-data-structure-diagramming, November 2009. [20] Dietrich Featherston. Cassandra: Principles and application. http://dfeatherston.com/ cassandra-cs591-su10-fthrstn2.pdf. [21] Apache Software Foundation. Allow proper fsync support for HBase. https://issues.apache. org/jira/browse/HBASE-5954, August 2012. [22] Apache Software Foundation. Apache cassandra. http://cassandra.apache.org/, September 2012. [23] Apache Software Foundation. Apache CouchDB. http://couchdb.apache.org/, August 2012. [24] Apache Software Foundation. Apache hadoop. http://hadoop.apache.org/, September 2012. [25] Apache Software Foundation. Apache HBase. http://hbase.apache.org/, September 2012. [26] Apache Software Foundation. Apache lucene. http://lucene.apache.org/, September 2012. [27] Apache Software Foundation. Apache Maven Project. http://maven.apache.org/, August 2012. [28] Apache Software Foundation. September 2012. Apache software foundation. http://www.apache.org/,
[29] Apache Software Foundation. Apache solr. http://lucene.apache.org/solr/, September 2012. [30] Apache Software Foundation. DistributedSearch, September 2012. Distributed solr. http://wiki.apache.org/solr/
[31] Apache Software Foundation. HBase FAQ Design. http://wiki.apache.org/hadoop/Hbase/ FAQ_Design#A3, September 2012. [32] Apache Software Foundation. HBase ACID Properties. acid-semantics.html, September 2012. http://hbase.apache.org/
[33] Apache Software Foundation. HBase/PoweredBy - Hadoop Wiki. http://wiki.apache.org/ hadoop/Hbase/PoweredBy, September 2012. [34] Apache Software Foundation. HDFS architecture guide. http://hadoop.apache.org/docs/ r1.0.3/hdfs_design.html, August 2012. [35] Apache Software Foundation. Powered by hadoop wiki. http://wiki.apache.org/hadoop/ PoweredBy, September 2012. [36] Apache Software Foundation. Support hsync in HDFS. https://issues.apache.org/jira/ browse/HDFS-744, August 2012.
BIBLIOGRAPHY [37] Eclipse Foundation. Eclipse. http://eclipse.org/, September 2012. [38] Lars George. HBase: The denitive guide. OReilly, September 2011.
56
[39] Sanjay Ghemawat, Howard Gobio, and Shun-Tak Leung. The google le system. SIGOPS Oper. Syst. Rev., 37(5):2943, October 2003. [40] Seth Gilbert and Nancy Lynch. Brewers conjecture and the feasibility of consistent, available, partition-tolerant web services. SIGACT News, 33(2):5159, June 2002. [41] Derrick Harris. How Facebook keeps 100 petabytes of Hadoop data online. http://gigaom. com/cloud/how-facebook-keeps-100-petabytes-of-hadoop-data-online/, September 2012. [42] Lars Hofhansl. HBase, HDFS and durable sync. http://hadoop-hbase.blogspot.ro/2012/ 05/hbase-hdfs-and-durable-sync.html, May 2012. [43] Richard C. Holt. Structural manipulations of software architecture using tarski relational algebra. In Proceedings of the Working Conference on Reverse Engineering (WCRE98), WCRE 98, pages 210, Washington, DC, USA, 1998. IEEE Computer Society. [44] Patrick Hunt, Mahadev Konar, Flavio P. Junqueira, and Benjamin Reed. Zookeeper: waitfree coordination for internet-scale systems. In Proceedings of the 2010 USENIX conference on USENIX annual technical conference, USENIXATC10, pages 1111, Berkeley, CA, USA, 2010. USENIX Association. [45] Iman Keivanloo, Laleh Roostapour, Philipp Schugerl, and Juergen Rilling. SE-CodeSearch: A scalable Semantic Web-based source code search infrastructure. In Proceedings of the 2010 IEEE International Conference on Software Maintenance, ICSM 10, pages 15, Washington, DC, USA, 2010. IEEE Computer Society. [46] Paul Klint. How understanding and restructuring dier from compiling a rewriting perspective. In Proceedings of the 11th IEEE International Workshop on Program Comprehension, IWPC 03, pages 2, Washington, DC, USA, 2003. IEEE Computer Society. [47] Gunter Kniesel and Uwe Bardey. An analysis of the correctness and completeness of aspect weaving. In Proceedings of the 13th Working Conference on Reverse Engineering, WCRE 06, pages 324333, Washington, DC, USA, 2006. IEEE Computer Society. [48] Koders. Koders. http://koders.com/, September 2012. [49] Krugle. Krugle. http://krugle.com/, September 2012. [50] Otvio Augusto Lazzarini Lemos, Sushil Krishna Bajracharya, and Joel Ossher. Codegenie: : a tool for test-driven source code search. In Richard P. Gabriel, David F. Bacon, Cristina Videira Lopes, and Guy L. Steele Jr., editors, OOPSLA Companion, pages 917 918. ACM, 2007. [51] Erik Linstead, Sushil Bajracharya, Trung Ngo, Paul Rigor, Cristina Lopes, and Pierre Baldi. Sourcerer: mining and searching internet-scale software repositories. Data Mining and Knowledge Discovery, 18:300336, 2009. [52] Amazon Web Services LLC. Amazon s3. http://aws.amazon.com/s3/, August 2012. [53] Karma Snack LLC. Search engine market share. http://www.karmasnack.com/about/ search-engine-market-share/, September 2012. [54] M. Loukides. What is data science? http://radar.oreilly.com/2010/06/what-is-data-science. html, August 2012. [55] Collin McMillan, Mark Grechanik, Denys Poshyvanyk, Qing Xie, and Chen Fu. Portfolio: nding relevant functions and their usage. In Proceedings of the 33rd International Con-
BIBLIOGRAPHY
57
ference on Software Engineering, ICSE 11, pages 111120, New York, NY, USA, 2011. ACM. [56] memcached. memcached. http://memcached.org/, August 2012. [57] neo4j.org. Neo4j graph database. http://neo4j.org/, August 2012. [58] Michael Nielsen. Using MapReduce to compute PageRank. http://michaelnielsen.org/blog/ using-mapreduce-to-compute-pagerank/, January 2009. [59] University California of Irvine. Sourcerer code on github. https://github.com/sourcerer/ Sourcerer, September 2012. [60] University California of Irvine. SourcererDB web page. sourcerer-db.html, September 2012. [61] Oracle. MySQL. http://www.mysql.com/, September 2012. [62] Joel Ossher, Sushil Bajracharya, Erik Linstead, Pierre Baldi, and Cristina Lopes. SourcererDB: An aggregated repository of statically analyzed and cross-linked open source java projects. In Proceedings of the 2009 6th IEEE International Working Conference on Mining Software Repositories, MSR 09, pages 183186, Washington, DC, USA, 2009. IEEE Computer Society. [63] Lawrence Page, Sergey Brin, Motwani Rajeev, and Winograd Terry. The PageRank citation ranking: Bringing order to the web. Technical report, Stanford University, 1998. [64] Diego Puppin and Fabrizio Silvestri. The social network of java classes. In Proceedings of the 2006 ACM symposium on Applied computing, SAC 06, pages 14091413, New York, NY, USA, 2006. ACM. [65] Peter Rademaker. Binary relational querying for structural source code analysis. 2008. [66] Steven P. Reiss. Semantics-based code search. In Proceedings of the 31st International Conference on Software Engineering, ICSE 09, pages 243253, Washington, DC, USA, 2009. IEEE Computer Society. [67] Ryan Sonnek. Realtime Search: Solr vs Elasticsearch. realtime-search-solr-vs-elasticsearch/, May 2011. http://blog.socialcast.com/ http://sourcerer.ics.uci.edu/
[68] Michael Stonebraker. SQL databases v. NoSQL databases. Commun. ACM, 53(4):1011, April 2010. [69] A. Tarski. On the calculus of relations. Journal of Symbolic Logic, 6(3):7389, September 1941. [70] Mathieu Verbaere, Elnar Hajiyev, and Oege de Moor. Improve software quality with SemmleCode: an eclipse plugin for semantic code search. In Richard P. Gabriel, David F. Bacon, Cristina Videira Lopes, and Guy L. Steele Jr., editors, OOPSLA Companion, pages 880881. ACM, 2007. [71] Yana Volkovich, Nelly Litvak, and Debora Donato. Determining factors behind the PageRank log-log plot. In Proceedings of the 5th international conference on Algorithms and models for the web-graph, WAW07, pages 108123, Berlin, Heidelberg, 2007. Springer-Verlag. [72] Tom White. Hadoop: The denitive guide (third edition). OReilly, Yahoo! Press, January 2012. [73] Wikipedia. Big data. http://en.wikipedia.org/wiki/Big_data, August 2012.

Distributed Code Analysis Over Computer Clusters

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Distributed Code Analysis Over Computer Clusters

Загружено:

Авторское право:

Доступные форматы

University Politehnica of Bucharest Automatic Control and Computers Faculty, Computer Science and Engineering Department

National University of Singapore School of Computing

MASTER THESIS Distributed Code Analysis over Computer Clusters

Author: Clin-Andrei Burloiu

Universitatea Politehnica Bucuresti , Facultatea de Automatic si Calculatoare, , Catedra de Calculatoare

National University of Singapore School of Computing

LUCRARE DE DISERTATIE , Analiza de cod n mod distribuit peste clustere de calculatoare

Autor: Clin-Andrei Burloiu

CONTENTS A Model Types B Top 100 Entities CodeRank

Apache, MySQL, PHP

Sourcerer: A Code Search and Analysis Infrastructure

The Choice for Cluster Computing Technologies

CHAPTER 2. THE CHOICE FOR CLUSTER COMPUTING TECHNOLOGIES

MapReduce and Hadoop

High Throughput for Sequential Data Access

CHAPTER 2. THE CHOICE FOR CLUSTER COMPUTING TECHNOLOGIES

Node Roles and Data Distribution

CHAPTER 2. THE CHOICE FOR CLUSTER COMPUTING TECHNOLOGIES

The Reasons for the Chosen Technologies

CHAPTER 2. THE CHOICE FOR CLUSTER COMPUTING TECHNOLOGIES

CHAPTER 2. THE CHOICE FOR CLUSTER COMPUTING TECHNOLOGIES

CAP Theorem and PACELC

CHAPTER 2. THE CHOICE FOR CLUSTER COMPUTING TECHNOLOGIES

SQL vs. NoSQL

Database Schema Design and Querying

CHAPTER 3. DATABASE SCHEMA DESIGN AND QUERYING

Former SQL Database

Row Key Default Column Family Metrics Column Family

CHAPTER 3. DATABASE SCHEMA DESIGN AND QUERYING

CHAPTER 3. DATABASE SCHEMA DESIGN AND QUERYING

Former SQL Database

Column le_id le_type name path hash project_id

CHAPTER 3. DATABASE SCHEMA DESIGN AND QUERYING

Former SQL Database

CHAPTER 3. DATABASE SCHEMA DESIGN AND QUERYING

Table Is Indexed yes yes yes no no yes yes no no

CHAPTER 3. DATABASE SCHEMA DESIGN AND QUERYING

Row Key Default Column Family

CHAPTER 3. DATABASE SCHEMA DESIGN AND QUERYING

Former SQL Database

CHAPTER 3. DATABASE SCHEMA DESIGN AND QUERYING

Row Key Default Column Family

CHAPTER 3. DATABASE SCHEMA DESIGN AND QUERYING

CHAPTER 3. DATABASE SCHEMA DESIGN AND QUERYING

Dangling Entities Cache

Reputation, PageRank and CodeRank

The Random Web Surfer Behavior

CHAPTER 4. GENERALIZED CODERANK

CHAPTER 4. GENERALIZED CODERANK

The Random Code Surfer Behavior

CodeRank Basic Formula

CHAPTER 4. GENERALIZED CODERANK

CodeRank Matrix Representation

CHAPTER 4. GENERALIZED CODERANK

Computing Generalized CodeRank with MapReduce

Storing Data in HBase

CHAPTER 4. GENERALIZED CODERANK

CHAPTER 4. GENERALIZED CODERANK

CHAPTER 4. GENERALIZED CODERANK

CHAPTER 4. GENERALIZED CODERANK

log f (x) = log

Entities CodeRank Top

CHAPTER 4. GENERALIZED CODERANK