Hardware Accelerators For Informnation Retrieval and Data Mining PDF

2015 International Conference on Information and Communication Technology Research (ICTRC2015)
Hardware Accelerators for Information Retrieval and

Data Mining
Valery Sklyarov, Iouliia Skliarova, Joo Silva Alexander Sudnitson, Artjom Rjabov
Department of Electronics, Telecommunications and Department of Computer Engineering,
Informatics/IEETA, Tallinn University of Technology
University of Aveiro, Tallinn, Estonia
Aveiro, Portugal
Abstract Many algorithms in informatics require a set of first step individual data items are merged into 2-item blocks;
objects with similar properties to be grouped (clustered) on the 2) at the second step 2-item blocks are merged into 4-item
basis of some predefined criteria. The proposed technique blocks and this process is continued until a single sorted set is
involves hierarchical merging in which software, responsible for built. Acceleration can be achieved if some initial steps may be
solving the entire problem, is enhanced with highly parallel executed faster and subsequent merging is started with such
networks in hardware accelerators. Additional improvements are inputs that are relatively large blocks instead of individual
achieved with the aid of support methods that are sort and items. Similar fast partial merge can be used in the bottom-up
verification of object intersections that may also be autonomously clustering method [1]. Note that discovering common attributes
used for other types of information processing and database is significantly simplified if such attributes are presorted for
management. It is shown and experimentally proved that the objects and clusters.
proposed solutions are efficient. They can be used in such areas
as health care, statistical data manipulation and so on. We suggest below the following technique:
1. Pre-processing in hardware accelerators which enable
Keywordshardware accelerator; FPGA; parallel processing;
information retrieval; data mining
core subtasks to be done significantly faster than in
software. Note that the complete merge in hardware
cannot be done in practice due to the large volume of
I. INTRODUCTION processing data and constraints that are common to
Information retrieval is required in many algorithms. Let us hardware accelerators (e.g. limits in the available
consider some examples. Clustering is an activity that permits a resources).
given set of objects with similar properties to be grouped [1]. 2. Subsequent processing in software beginning with
Hierarchical clustering methods represent a major technique relatively large blocks instead of individual objects.
allowing the desired set to be built through searching common
attributes and combining objects with such attributes. Similar The remainder of the paper is organized in six sections.
problems arise in statistical data manipulation and other data Section II describes the background. Section III discusses the
mining algorithms. In [2] an analogy to a shopping card is proposed clustering method. Section IV is dedicated to sorting
shown and discussed. A basket is a set of items purchased at of individual items which has to be done in the proposed
one time. A frequent item is an item that often occurs in a method. Section V explains how individual objects are merged
database. A frequent set of items often occur together in the in clusters. Section VI reports the results of implementation
same basket. A researcher can request a particular support and experiments. The conclusion is given in Section VII.
value and find the items which occur together in a basket either
a maximum or a minimum number of times within the database II. BACKGROUND
[2]. Similar problems appear to determine frequent inquiries at
Combining capabilities of software and hardware permits
the Internet, customer transactions, credit card purchases, etc.
many characteristics of developed applications to be improved.
producing very large volumes of data in the span of a day [2].
The earliest work in this direction was done at the University of
Computing subsets of the most frequent or the less frequent
California at Los Angeles [7]. The idea was to create a fixed +
items in large data sets permits the relevant data mining
variable structure computer and to augment a standard
algorithms to be simplified and accelerated. Sorting is involved
processor by an array of reconfigurable logic assuming that this
in many known algorithms from this area [e.g. 3-5].
logic can be utilized to solve some processor tasks faster and
Let us consider (bottom-up) clustering method [1] that more efficiently. Such combination of flexibility of software
begins with L objects that are merged in the desired number of with speed of hardware was considered as a new way to evolve
clusters (K) in such a way that at the first step individual higher performance computing from any general-purpose
objects are merged and in subsequent steps the previously built computer. The level of technology in 1959-1960 did not permit
clusters are also joined in accordance with some pre-given this idea to be put in practice. Today a very similar technique
criteria. Similar steps are executed in merge sort [6]: 1) at the has been implemented in Zynq-7000 microchips combining
978-1-4799-8966-9/15/$31.00 2015 IEEE 202

multi-core high-performance dual-core processors Cortex-A9, 4. Final steps for producing the desired number of clusters
embedded blocks, and advanced reconfigurable logic [8]. are done in software of the PS.
Existing design tools such as Software Development Kit [9]
Note that in practical cases (such as that were reported in
and Vivado Design Suite [10] permit the complete system to be
[1]) the number of attributes (for objects that have to be
developed, implemented and tested. Two available books
clustered) is limited and they can be completely sorted in the
[11,12] dedicated to the design techniques for Zynq-7000
PL. However, we would like to develop a data sorter that could
present all necessary details. Acceleration of software with the
also be used for significantly more complex tasks, such as
aid of hardware can also be done for a general-purpose PC
sorting millions of data items. Thus, the following two
interacting with accelerators through PCI express bus. Two-
directions were proposed and studied:
level systems including the level of general-purpose software
in PC and the level of application-specific hardware 1. Sorting just in the PL assuming that hardware resources
implemented in Field-Programmable Gate Arrays (FPGA) are are sufficient.
widely used in practical designs such as [13]. Three-level
systems PC, Zynq Processing System (PS), and Zynq 2. Sorting blocks in the PL and subsequent merge of the
Programmable Logic (PL) can also be used and functionality of sorted blocks in the PS.
such systems is described in [12]. The second direction will be described below since it
Software/hardware co-design for data sort is studied in incorporates the first one.
detail in [14-19]. It is shown in [17] that the fastest known
even-odd merge and bitonic merge circuits [20] are very IV. SORTING INDIVIDUAL ITEMS
resource consuming and can only be used effectively in If an initial set of data items requires more resources that
existing FPGAs for sorting very small data sets. An alternative are available in the PL then a large set of data is divided into
solution is based on an iterative even-odd transition network such blocks which can be sorted in the PL with the aid of
[17] that is very regular and can be implemented efficiently in iterative networks [17] and further merged in the PS. For
FPGAs for larger data sets. Besides, for many practical example, the APSoC available on ZedBoard [22] can sort up to
applications the effective throughput is higher than in other N=512 32-bit data items [12]. The blocks are saved in memory
known networks, which is demonstrated in [17] on the results which can be either DDR or OCM. The PL reads blocks from
of numerous experiments in FPGA. The size of blocks (sub- memory, sorts them by the iterative network [17], and copies
sets of data) sorted in FPGA is increased but it is still small and the sorted blocks to the same location in memory (i.e. unsorted
constrained by the available FPGA resources. Additional data will be replaced with the sorted data). As soon as the
experiments were done with the expensive prototyping system sorted subsets are ready, the PL forms an interrupt to the PS
VC 707 from Xilinx [21], but even using the advanced FPGA indicating that further processing can be started. The PS reads
XC7VX485T from Xilinx Virtex-7 family, the maximum size the sorted blocks from memory and merges them producing the
N of one block is limited to 4,096 32-bit items. To sort more results of sorting. If merging is not needed, it is just skipped
data items with significantly cheaper microchips, we propose and the problem becomes less complicated.
to split the problem into two parts [17], one executed in the
reconfigurable logic and the other in software running on Let us compare the described above hardware/software
embedded high-performance processors. Zynq All sorter with a sorter that is implemented in software only with
Programmable Systems-on-Chip (APSoC) are very appropriate the C function qsort. The comparison is done as follows (see
for such decomposition. Fig. 1):
1. Two copies of data are prepared in the DDR
III. THE METHOD memory that are kept in different memory areas.
The following steps have been applied for the proposed 2. The first copy is used by the software only sorter
hierarchical bottom-up clustering method: and the second copy by the developed
1. Objects are collected and saved in memory by the PS. hardware/software sorter.
Dependently on the number of objects either on-chip 3. Execution times are measured, compared and
memory (OCM) or external DDR memory is used and it conclusions are drawn. Clearly, all the involved
is accessed from the PS and from the PL (the memory is communication overheads are taken into account.
shared by the PS and PL).
The results will be reported in section VI.
2. Attributes for each object are read from memory by the
PL, sorted using fast and highly parallel iterative
networks from [17], and the sorted attributes are copied V. MERGING OBJECTS
back to the same memory. We mentioned in section III that merging can be done in
3. Dependently on available resources and performance software (in the PS) and in hardware/software (in the PL and
requirements subsequent steps can be either executed in PS). In both cases at the beginning, the number of common
software of the PS or partitioned once again between attributes is found as shown in Fig. 2. Let Bi and Bj be sorted
software of the PS and hardware of the PL. attributes for objects Oi and Oj. The component C in Fig. 2
compares the smallest values (bi and bj) in the sets Bi and Bj. If
bi = bj then the number of common attributes (which initially is
978-1-4799-8966-9/15/$31.00 2015 IEEE 203

equal to 0) is incremented and all the values Bi and Bj are Final merging (that is based on the number of common
shifted allowing the next pair of smallest values to be attributes) is done in software of the PS as follows. At the first
compared. If bi < bj then only the values of Bi are shifted and if step, objects that have the maximum number of common
bi > bj then only the values of Bj are shifted. This process is attributes are sequentially merged, i.e. two objects that have the
finished as soon as any set (Bi, Bj or both) is empty. It permits maximum number of common attributes are merged, then
the number of common attributes for objects Oi and Oj to be among the remaining objects two objects that have the next
found. maximum number of common attributes are merged, etc. until
all pairs of objects are merged. Similar operations are executed
Hardware in the PL for partially merged clusters. The number of final clusters is
Interaction with
pre-defined by the requirements.
Software in the PS
hardware including
interrupt handler
VI. IMPLEMENTATION AND EXPERIMENTS
Initial data are either generated randomly in software of the
Application-specific PS with the aid of C language rand function (see Fig. 1) or
merge function prepared in the host PC. In the last case data may be either
running in software Iterative randomly generated by rand or other functions or copied from
sorting benchmarks. Files with data items are copied from the host PC
Sorting data in network to the APSoC using projects from [12]. Two separate scenarios
were tested that are sorting and clustering.
software only by C [17] for N
function qsort items Sorting is done completely in Zynq APSoCs available on
ZyBo [23] and ZedBoard [22] and using methods described in
section IV. The results are verified in software running either in
C function rand which the PS or in the host PC. Functions for verifications of the
generates items of an results are given in [12]. Verification time is not taken into
unsorted set randomly Data transfer account in the measurements below.
Execution time for sorting different sets of data was
measured. Fig. 3 presents the results of experiments with
software only and software/hardware data sorters and reports
(L) 32-bit (L) 32-bit the achieved acceleration. Synthesis and implementation of the
data items = data items
Memory projects were done in Xilinx Vivado 2014.4 and SDK 2014.4.
L varies from N to 38,400,000
Two identical areas in memory

with L 32-bit data items
Fig. 1. Software/hardware architecture for sorting blocks in the PL and

merging sorted blocks in the PS.
Shift Bi bi < bj
(bj = bi) (+1)
Bi bi Shift Bi and Bj
C
Bj bj
bj < bi Fig. 3. Experiments with software only and software/hardware sorters
Shift Bj (acceleration of hardware/software sorters over software only sorters is by a
factor ranging from 4.08 to 1.13).
Fig. 2. Counting the number of common attributes.
Objects for clustering were either created using examples
Similar operations (that are shown in Fig. 2 for individual from [1] or generated randomly. We found that the most time
objects) are executed for partially merged clusters. In this case consuming part is the sort of attributes that is done significantly
the sets Bi and Bj contain attributes that are included in all faster in hardware than in software. In all the design cases just
objects of the first cluster (for the set Bi) and all objects of the one block is sorted and acceleration by using larger blocks, i.e.
second cluster (for the set Bj). by processing up to 512 attributes (instead of 64 attributes
shown in Fig. 3) ranges from 4 to 10. Experiments for up to
128 attributes were done with ZyBo [23] and for up to 512
978-1-4799-8966-9/15/$31.00 2015 IEEE 204

attributes with ZedBoard [22]. The latter contains a more State University, 2011, available at:
advanced Zynq-7000 device and permits more complicated http://lib.dr.iastate.edu/cgi/viewcontent.cgi?article=1421&context=etd.
problems to be solved. [4] X. Wu, V. Kumar, J.R. Quinlan, et al. Top 10 algorithms in data
mining, Knowledge and Information Systems, vol. 14, no. 1, pp. 1-37,
Finding common attributes in a set of objects (or clusters) 2014.
was tested in software and in hardware. The results differ [5] M.F.M. Firdhous, Automating Legal Research through Data Mining,
insignificantly mainly due to the involved communication Int. Journal of Advanced Computer Science and Applications, vol. 1, no.
6, pp. 9-16, 2010.
overheads. In all the experiments AXI ACP (Advanced
[6] D.E. Knuth, The Art of Computer Programming. Sorting and Searching,
eXtensible Interface Accelerator Coherency Port) [24] port was vol. III, Addison-Wesley, New York, 2011.
used for communications between the PS and PL.
[7] G. Estrin, Organization of Computer Systems The Fixed Plus
Future work will be continued in the following directions. Variable Structure Computer, Proc. Western Joint IRE-AIEE-ACM
Computer Conference, New York, May 1960, pp. 33-40.
We intend to consider the next two designs:
[8] M. Santarini, Zynq-7000 EPP Sets Stage for New Era of Innovations,
PC FPGA interacting through PCI-express which is Xcell Journal, no. 75, 2011, available at:
possible for the prototyping system [21]. The main http://www.eetimes.com/design/programmable-logic/4217069/Zynq-
7000-EPP-sets-stage-for-new-era-of-innovations.
advantage of the system [21] is availability of very large
[9] Xilinx, Inc., All Programmable SoC Software Developers Guide,
programmable resources, which permits the complexity available at: http://www.xilinx.com/support/documentation/user_guides
of designs to be increased drastically. This is very /ug821-zynq-7000-swdev.pdf.
important for autonomous data sorters. [10] Xilinx, Inc., Vivado Design Suite Guides, available at: www.xilinx.com.
PC Zynq (PS and PL) working in such a way that the [11] L.H. Crockett, R.A. Elliot, M.A. Enderwitz, and R.W. Stewart, The
Zynq Book, University of Strathclyde, 2014.
problem is split in two levels of software (PC and Zynq
[12] V. Sklyarov, I. Skliarova, J. Silva, A. Rjabov, A. Sudnitson, and C.
PS) and one level of hardware (Zynq PL) operating in Cardoso, Hardware/Software Co-design for Programmable Systems-on-
parallel and interacting through PCI-express (PC-Zynq) Chip, TUT Press, 2014.
and high-performance (HP) on-chip interfaces such as [13] B. Sukhwani, H. Min, M. Thoennes, P. Dube, B. Brezzo, S. Asaad, and
AXI HP and AXI ACP in Zynq devices. This is a true D.E. Dillenberger, Database Analytics: A Reconfigurable-Computing
three-level system. The support for PCI-express Approach, IEEE Micro, vol. 34, no. 1, pp. 19-29, 2014.
communications is provided in the advanced [14] R. Mueller, J. Teubner, and G. Alonso, Sorting Networks on FPGAs,
prototyping systems with Zynq APSoC [25]. The International Journal on Very Large Data Bases, vol. 21, no. 1, pp.
1-23, 2012.
[15] M. Zuluada, P. Milder, and M. Puschel, Computer Generation of
VII. CONCLUSION Streaming Sorting Networks, Proc. 49th Design Automation
Conference, San Francisco, June 2012, pp. 1245-1253.
The paper describes a method for clustering objects based
on the number of their common attributes. There are two basic [16] R. Mueller, Data Stream Processing on Embedded Devices, Ph.D. thesis,
ETH, Zurich, 2010.
steps in the method that are sorting of the attributes and
[17] V. Sklyarov and I. Skliarova, High-performance implementation of
merging objects (partially merged clusters). It is proved in regular and easily scalable sorting networks on an FPGA,
numerous experiments that the proposed hardware/software Microprocessors and Microsystems, vol. 38, no. 5, pp. 470-484, 2014.
solutions are faster (with all protocol and communication [18] V. Sklyarov, I. Skliarova, A. Barkalov, and L. Titarenko, Synthesis and
overheads taken into account). Besides, the sorting method can Optimization of FPGA-based Systems, Springer, 2014.
autonomously be used and this permits very large data sets to [19] O. Arnold, S. Haas, G. Fettweis, B. Schlegel, T. Kissinger, and W.
be processed. Lehner, An application-specific instruction set for accelerating set-
oriented database primitives, Proc. ACM SIGMOD International
Conference on Management of Data, Utah, USA, June 2014, pp. 767-
ACKNOWLEDGMENT 778.
This research was supported by EU through European [20] S.W. Aj-Haj Baddar and K.E.Batcher, Designing Sorting Networks. A
New Paradigm, Springer, 2011.
Regional Development Funds, the institutional research
[21] Xilinx, Inc., VC707 Evaluation Board for the Virtex-7 FPGA User
funding IUT 19-1 of the Estonian Ministry of Education and Guide, 2015 available at:
Research, ESF grant 9251, and Portuguese National Funds http://www.xilinx.com/support/documentation/boards_and_kits/vc707/u
through FCT - Foundation for Science and Technology, in the g885_VC707_Eval_Bd.pdf.
context of the projects PEst-OE/EEI/UI0127/2014 and [22] Avnet, Inc., ZedBoard (ZynqTM Evaluation and Development)
Incentivo/EEI/UI0127/2014. Hardware Users Guide, 2014, available at:
http://www.zedboard.org/sites/default/files/documentations/ZedBoard_
HW_UG_v2_2.pdf.
REFERENCES [23] Digilent, Inc., ZyBo Reference Manual, 2014, available at:
[1] G. Serban and A. Cmpan, Hierarchical Adaptive Clustering, http://digilentinc.com/Data/Products/ZYBO/ZYBO_RM_B_V6.pdf.
Informatica, vol. 19, no. 1, pp. 101-112, 2008. [24] Xilinx, Inc., Zynq-7000 All Programmable SoC Technical Reference
[2] Z.K. Baker and V.K. Prasanna, An Architecture for Efficient Hardware Manual, 2014, available at:
Data Mining using Reconfigurable Computing Systems, Proc. 14th http://www.xilinx.com/support/documentation/user_guides/ug585-Zynq-
Annual IEEE Symp. on Field-Programmable Custom Computing 7000-TRM.pdf.
Machines - FCCM, Napa, USA, April 2006, pp. 67-75. [25] Xilinx, Inc., ZC706 Evaluation Board for the Zynq-7000 XC7Z045 All
[3] S. Sun, Analysis and acceleration of data mining algorithms on high Programmable SoC User Guide, 2014.
performance reconfigurable computing platforms, Ph.D. thesis, Iowa
978-1-4799-8966-9/15/$31.00 2015 IEEE 205

Hardware Accelerators For Informnation Retrieval and Data Mining PDF

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Hardware Accelerators For Informnation Retrieval and Data Mining PDF

Загружено:

Авторское право:

Доступные форматы

2015 International Conference on Information and Communication Technology Research (ICTRC2015)

Hardware Accelerators for Information Retrieval and

978-1-4799-8966-9/15/$31.00 2015 IEEE 202

978-1-4799-8966-9/15/$31.00 2015 IEEE 203

L varies from N to 38,400,000

Two identical areas in memory

Fig. 1. Software/hardware architecture for sorting blocks in the PL and

978-1-4799-8966-9/15/$31.00 2015 IEEE 204

978-1-4799-8966-9/15/$31.00 2015 IEEE 205

Вам также может понравиться