Вы находитесь на странице: 1из 7

BIGM AT: A Distributed Affinity-Preserving

Random Walk Strategy for Instance Matching on


Knowledge Graphs
Ali Assi Hamid Mcheick Wajdi Dhifli
University of Quebec At Montreal University of Quebec At Chicoutimi University of Lille
Montreal, Canada Chicoutimi, Canada Lille, France
assi.ali@courrier.uqam.ca hamid mcheick@uqac.ca wajdi.dhifli@univ-lille.fr

Abstract—Instance Matching (IM) is the process of matching the identity link expressed by the “owl:sameAs” property. The
instances that refer to the same real-world object (e.g., the same latter is expressed by the “owl:sameAs” property of the OWL
person) across different independent Knowledge Bases (KBs). (Web Ontology Language) vocabulary. It allows to explicitly
This process is considered as a key step, for instance, in the
integration of KBs. In this paper, we propose BIGM AT, a novel link two instances that refer to the same entity in the real-
approach for the IM problem based on Markov random walks. world.
Our approach bears in mind the local and global information In the IM process, the large amount of instances to be
mutually calculated from a pairwise similarity graph. Precisely, processed in these large-scale KBs can be very costly. As
we first build an expanded association graph consisting of pairs a result, besides the qualitative challenges that the classic
of IM candidates. Then, we rank each candidate pair through
the stationary distribution computed from the Markov random instance matching approaches should face, additional quanti-
walk on the association graph. We provide a scalable distributed tative and scalability challenges are introduced. To solve these
implementation on top of the Spark framework and we evaluate challenges, two opportunities can be pursued. (1) Indexing
it on benchmark datasets from the instance track of the Ontology (i.e., blocking) techniques can be used to reduce the search
Alignment Evaluation Initiative (OAEI). The experiments show space. Such techniques split instances into blocks (that may
the efficiency and scalability of our approach compared to several
state-of-the-art IM approaches. overlap) and execute the matching process for the instances
Index Terms—Data linking, Instance matching, Affinity- only within the same block they lie in. (2) Parallel and
preserving random Walk distributed programming models (such as Spark [2], and Map-
Reduce [3]) constitute another complementary technique that
I. I NTRODUCTION can be used to improve the scalability of the matching process.
In recent years, the amounts and availability of openly Indeed, their distributed architectures permit to partition the
accessed data have witnessed an impressive growth. Combined instance matching process into several matching tasks that
with the advances in algorithmic techniques for information could be executed in parallel across several servers (also called
extraction, this have facilitated the design and structuring of nodes or workers).
information giving rise to large-scale Knowledge Bases (KBs) The Random Walk (RW) technique has long been estab-
in a “Big Data” context. By now, the Linked Open Data (LOD) lished for a range of tasks including outlier detection [4],
cloud comprises 1239 datasets structured as knowledge graphs keyword extraction [5], web search [6], etc. The common idea
and covering several domains such as geography, biology, behind all the RW based techniques is to design the input data
etc. These knowledge graphs are considered as centralized as a stochastic graph where the vertices usually correspond to
repositories represented as Resource Description Framework the objects in the data. Then, a RW is executed on all the
(RDF). They store billions of facts about entities (e.g., persons, paths in the graph to deduce the importance of the vertices.
locations), their attributes and interdependence relationships. In this paper, we propose BIGM AT (Bipartite Graph-based
However, these knowledge graphs are often created indepen- Instance Matching), an approach for the instance matching
dently from each other. Thus, they may introduce entities (with problem based on the affinity-preserving RW. We represent the
distinct descriptions) that are co-referring to the same entity in IM problem as a graph-based node ranking [6] and selection
the real-world. Yet, such a connection is not explicitly defined. problem in a constructed candidates association graph. Our
By establishing semantic links between entities described in approach bears in mind the local and global information mutu-
these different KBs, intelligent agents are able to navigate be- ally calculated from a pairwise similarity graph. Precisely, we
tween them as if they operate on a local integrated database. As first build an expanded candidates association graph consisting
a result, a richer and more enriched information is provided in of pairs (each as a graph node) of IM candidates. The edges
response. Instance Matching (IM) [1] is defined as the process between nodes reflect their pairwise structural harmony. Then,
of establishing a specific kind of the semantic links called we rank each candidate pair through the stationary distribution
vector computed from a Markov RW strategy on that graph. considering the time as discrete (i.e., t ∈ N), after N -time
Candidate pairs with higher rank scores in this vector are steps, the surfer will be at a vertex XN = xn . Thus, the
more likely to be co-referents. This novel RW technique sequence of visited vertices is called a simple RW in G [8].
preserves the initial similarities of instance matching pairs. In general, when the surfer is at vertex x at time t, the
For that, an absorbing node is introduced in the candidates neighbor y to move toward at time t + 1 is chosen with a
association graph in order to approximately separate nodes probability value that is proportional to the weight wxy of the
of estimated true candidates from the false ones. We further edge (x, y) ∈ E. This means that y is not chosen uniformly at
present a bipartite graph-based strategy to filter out the final random. Therefore, this RW is equivalent to a specific family
set of matched instances among the candidate nodes of pairs of random process, that is a Markov chain [9]. Formally:
of instances in the association graph. Definition 3: (Random walk) A random walk is defined as
We experimentally evaluate our approach on several KBs a Markov chain with a discrete sequence of states (i.e., random
from the benchmark instance matching track of OAEI 2009 variables) drawn from a discrete state space X = (Xn )n∈N =
and 2010. We also compare it with a wide range of existing (X0 , X1 , . . .) and a given matrix P of transition probabilities:
state-of-the-art IM techniques. Furthermore, we perform a
scalabilty test on our spark-based implementation to show the (
P wxy {(x, y), (x, k)} ⊆ E
potential speedup of BIGM AT on large-scale datasets when Pxy = P (Xt+1 = y|Xt = x) = k∈V wxk

leveraging distributed computation resources. The obtained 0 otherwise


results show the efficiency and scalability of our approach. P (1)
The remainder of this paper is structured as follows. Section The normalization by k∈V wxk , ∀(x, k) ∈ E, is called
II gives the preliminary definitions of the RDF data model, the “democratic normalization” (equivalent to “Internet Democ-
instance matching problem and the random walk technique. racy” in [10]).P P can be converted to a row stochastic matrix
Section III describes the affinity-preserving random walk tech- (i.e., ∀x ∈ V, y∈V Pxy = 1, (x, y) ∈ E) by P P = D−1 W
nique (Section III-A) used for the instance matching task, as where D is a diagonal matrix with ∀i ∈ V , Dii = k∈V wik
well as the association graph of candidates instance matching and W is the adjacency matrix of G with Wij = wij . The RW
pairs (Section III-B). Section IV depicts the workflow of our leads to a probability distribution p~ = (p(v))v∈V at time t [10].
approach as well as our bipartite graph-based solution. Evalua- The value p(v) reflects the likelihood that the vertex v is vis-
tion and experimental results on KBs from OAEI benchmarks ited. This probability distribution is updated at each time step
are reported in Section V. Section ?? presents and discusses until it reaches (i.e., converges to) a “stationary distribution”,
existing related works from the literature. Finally, Section VI i.e., the probability distribution does no change again. The
concludes the paper and discusses our future work. stationary distribution of the RW will be determined by solving
(t+1)T T

t+1p~
the following iterative equation = p~(t) ×P , where p~T
II. P RELIMINARIES AND P ROBLEM S TATEMENT

is a row vector, until the p~ − p~t `−norm becomes equal
A. RDF Data Model to 0 or less or equal to a predefined threshold.
The RDF data model [7] represents the descriptions of re- Since the update is done by performing moves toward
sources by triplets in the form <s, p, o>. Accordingly, the neighbors, the RW can go to a trap in a dead vertex (i.e.,
triplets <s, rdf:type, o> declare that s is an instance a sink vertex). To avoid such a situation, an imaginary action
of o (i.e., class). (i.e., jump) can be taken by the surfer when it is at a current
vertex with a small “restart probability” α [11]. This new move
Definition 1: (Instance matching) Given two sets of in- allows the surfer to jump to any randomly selected vertex in G.
stances S and T belonging to two KBs (KB1 and KB2 ), Such a new action makes the transition matrix both irreducible
the aim of IM is to discover the set M of owl:sameAs and aperiodic [9]. As a result, the existence and the uniqueness
which are not already defined in the KBs. Formally, M = of the stationary distribution (i.e., convergence) is guaranteed.
{(i1 , i2 ) | (i1 , i2 ) ∈ S × T , < i1 , owl:sameAs, i2 >} where This action can be illustrated by defining a “personalization
M 6⊂ KB1 ∪ KB2 . vector” ~v (being of length |V |) which curbs the surfer to restart
from a set of vertices, called seed(s). Only the seed vertices
Definition 2: (RDF knowledge graph) An RDF knowledge will get values different from zero in ~v . In the RW scenario
graph is a set of facts in the form <s, p, o> ∈ (E ∪ B) × where the surfer can jump to any vertex in the graph, each
P × (E ∪ L ∪ B), where E is the set of instances, B is the set value in ~v is initialized to |V1 | .
of blank nodes, P is the set of predicates and L is the set of
III. R ANDOM WALKS FOR I NSTANCE M ATCHING
literals (basic values).
As the KBs are multi-digraphs, the IM can be transformed to
B. Random Walks on Graphs a graph matching problem between the source and target KBs
Consider a graph G = (V, E) and an imaginary surfer s given by G1 = (V1 , E1 ) and G2 = (V2 , E2 ), respectively. To
which starts at time t0 a walk at an arbitrary vertex X0 = solve this problem, we adopt the RW technique to compute the
x0 ∈ V . At time t1 , the surfer is located at X1 = x1 picked similarities among the vertices of G1 and G2 . More precisely,
uniformly at random from the 1-hope neighbors of x0 . By to be able to run this technique, we create a stochastic graph
called candidates association graph Grw (rw stands for random matrix P :
walk) (see Section III-B). The vertices of Grw are simply
the candidate pairs between G1 and G2 . The weight (also Pxy = P (Xt+1 = y|Xt = x)
known as affinity) of edges between the candidate pairs in  wxy
(x, y) ∈ E arw
Grw are encoded in a matrix W (i.e., affinity matrix) which  WmaxP

w
arw
can be transformed into P by P = D−1 W . Consequently, the = 1 − k∈V Wmax
xk
∀x ∈ V arw and y = vabs (2)

values in the stationary distribution vector will be interpreted 0 ∀y ∈ V arw and x = vabs

as vertices (i.e., candidate pairs) ranking scores.
In principle, not all the vertices in Grw represent true where Wmax is the maximum affinity weight in Garw .
Pn
candidate pairs. Indeed, only the ones with “high” similarity Note that kW k1 = max1≤j≤n i=1 |aij |. Note also here
values represent true candidate pairs. They are considered as that in practice vabs does not need to be materialized in
inlier candidate pairs. The rest (i.e., with “low” similarity definition 5 and thus the graph Garw = (V arw , E arw ) can
values) represent false candidate pairs and are considered as be simply reduced to Grw = (V rw , E rw ).
outliers. Another important remark to mention here is that the The trick here is to find a way that prevents the surfer from
number of outliers are bigger than the number of inliers. Thus, moving out to another vertex once it is at an outlier vertex. In
by detecting the outliers, the inliers constitute the bounded set fact, the surfer will be redirected toward the absorbing vertex.
of candidate pairs that includes the true matches. The outliers In other words, once the surfer is at a current vertex in Garw
can be detected through the “stationary distribution” vector with a big affinity in Grw , it will less probably move toward
[12]. Intuitively, the rare a vertex v ∈ Grw is visited by the vabs . The surfer will get an opposite behavior if its affinity sum
RW, the more probably that v is an outlier. In other words, the is small in Grw . It will more probably walk and be absorbed by
smaller is the score pt (v), the more presumably v is an outlier. vabs . In this way, the inaccurate ranking scores of the outliers
In addition, the affinity sum of the outgoing edges of an outlier will be absorbed by the absorbing vertex.
vertex is usually smaller than that of an inlier vertex. Thus, The new normalization of the affinity matrix W can be
the democratic normalization applied in this stochastic graph W
transformed to P by P = D−1 . It is worth pointing
will generate a distorted probability distribution vector where kW k1
only few vertices will have their true ranking values. This is here that some rows in P can get a sum less than 1. In the
because the affinity sum of such an outlier will be amplified following, we adopt the formulation used in [13] for P and
whereas that of an inlier will be abated making them both the absorbing Markov chain corresponding to the APRW.
miss-classified.
!
W ~1 − ~a 
T

kW k1 kW k1 (t+1)
A. Affinity-Preserving Random Walk ~T
, x(t+1) xvabs
0 1
The outliers are not known a priori by the surfer. Thus,
 
T (t)
= x(t) xvabs × P (3)
the required normalization method should guide the surfer
to distinguish between an outlier and an inlier vertex. This W rw
guidance can be illustrated by preserving the original affinity where kW k1 is a |V | × |V rw | square matrix (called sub-
relations between the candidate IM pairs through a special stochastic matrix), ~1|V rw |×1 denotes an all-ones vector with
stochastic graph called the “absorbing association graph” size |V rw |, ~0T|V rw |×1 denotes an all-zeros vector with size
denoted by Garw = (V arw , E arw ). Garw adds a special vertex |V rw
W2 , . . . , Wn )T1≤n≤|V rw | with Wi =
P | and ~a = (W1 ,rw
to Grw called “absorbing vertex”defined as follows: ∀i ∈ V . The distribution probability for an
j∈V rw wij ,
absorbing Markov Chain admits a fixed solution p~ = (~0T 1). It
Definition 4: (Absorbing vertex) An absorbing vertex, is trivial that p~ in this case cannot be used for vertex ranking
denoted by vabs , is a trapped vertex which once visited, as in [6]. To alleviate this problem, [13] defines the following
the random walk cannot move out. Formally: Pvabs vabs = conditional probability:
P (vabs , vabs ) = 1.
, denoted by vabs , and which constitutes a trapped vertex (t)
which once visited, the random walk cannot move out, i.e., (t) xia
x̄ia = P (X (t) = via |X (t) 6= vabs ) = (4)
Pvabs vabs = P (vabs , vabs ) = 1. Executing the RW on Garw (t)
1 − xvabs
leads to a special Markov chain called “Affinity-Preserving
Random Walk” (APRW) which can be defined as follows: which refers to the distribution of unabsorbed vertex X (t) =
Definition 5: (Affinity-preserving random walk) An via ∈ Garw at time t. Hence, the distribution probability where
affinity-preserving random walk on a graph Garw = the surfer stays at time t is represented now by x̄(t) . The
(V arw , E arw ) is an absorbing Markov chain with discrete solution (i.e., stationary distribution) of equation (3) is called
sequence of states drawn from a discrete state space X = “quasi-stationary distribution” denoted by the (|V rw | × 1) x̄
(Xn )n∈N = (X0 , X1 , . . .) and a given transition probabilities vector. It is obtained when x̄(t+1) = x̄(t) .
B. Candidates Association Graph Algorithm 1: Computation of the weights values of the
a) Vertices generation: We rely on the predicates values edges in the candidates association graph
of the instances to build the IM candidate pairs (i.e., vertices Input : I: IM candidate pairs (V rw ), G1 = (V1 , E1 ): a
in the association graph). For this, purpose, we build for each source dataset, G2 = (V2 , E2 ): a target dataset
dataset (with a Source (G1 ) and a Target (G2 )) an inverted and Grw = (V rw , E rw ): the candidates
index. Each of them will have < T oken, {i1 , i2 , . . . , } > association graph between V1 and V2
as a schema where i is an instance in the considered Output: W : The affinity matrix of Grw
dataset (we denote by si and tj instances in G1 and G2 , 1 W ←0
k rw
respectively). Then, we join both inverted-indexes on the 2 foreach v ∈ V do
common T oken to produce a table T ab =< T oken, < 3 vi ← the entity v1k ∈ V1
{s1 , s2 , . . . , sn }, {t1 , t2 , . . . , tm } >>. The IM candidate pairs 4 va ← the entity v2k ∈ V2
with their initial similarities are then generated by transform- 5 if vj ∈ Neighbors(vi , G1 ) then
ing T ab to a new matrix I of size n × m where I =< 6 foreach v p ∈ Neighbors(v k , Grw ) do
1 P 7 vj ← the entity v1p ∈ V1
vij , simij > and simij = × T F1 (w) × T F2 (w)
ksi k w∈si ∩tj 8 vb ← the entity v2p ∈ V2
is the similarity between si and tj . T F is a metric permitting 9 if vb ∈ Neighbors(va , G2 ) then
to quantify the infrequency of a given word in the given simia +simjb
10 wia;jb ← exp 2
dataset [1]. T F (w) = log2 (F1(w)+1) where F (w) is equal to
11 foreach wm;n ∈ W do
the number of instances containing w in their description.
12 if wm;n = 0 then
b) Neighbors: Two instances s and o are considered as 1
neighbors in a given KB graph if there is a statement between 13 wm;n ←
| V rw |
them (i.e., < s, p, o >). In case where o is a blank node, 14 return W
we adopt the idea in our previous work [1] to generate the
neighbors of s. In fact, we consider all the leaf nodes of the
k-hops paths starting from o and ending in an object node as
neighbors of s.
c) Edges generation: We suppose that an edge exists KBs, respectively) or the number of iterations until the RW
between every two nodes in the candidates association graph. converges. Secondly, a large number of false positive pairs
The weights of these edges are determined in a way that can be comprised in the association graph yielding a large
reflects the intuition that if two instances form a potential number of noisy edges between the nodes which may hinder
co-referent pair, then there is a high possibility that their the quality of the IM results. In iterative algorithms such as
neighbors are also potential co-referent pairs. Setting the RW, this drawback increases in magnitude since the erroneous
values of the weights of these edges is detailed in the following matching results will be propagated between the nodes over
paragraph. the iterations. A pruning step can highly alleviate these issues.
d) CA-graph construction: Formally, let us define
Grw = (V rw , E rw ) as the association graph where V rw ⊆ In order to reduce the number of nodes in the CA-graph,
V1 × V2 is the set of vertices and E rw the set of edges it is possible to leverage the background knowledge about
between the vertices. The edges are weighted according to the used data to restrain the search space (i.e., the number
different schemas. Let us consider two nodes v k = (vi , va ) of nodes in CA-graph). Indeed, we rely on the instances’
and v p = (vj , vb ) in Grw . If vj is a neighbor of vi in G1 labels to reduce the number of candidate pairs, such that two
and vb is a neighbor of va in G2 , then the weight of the edge instances vi ∈ V1 and va ∈ V2 (where G1 = (V1 , E1 ) and
simia +simjb
e = (v k , v p ), denoted by wia;jb , is equal to exp 2 . G2 = (V2 , E2 ) resemble the graphs of a source and a target
1 dataset, respectively) are directly considered as a match, if
Otherwise, it is equal to . The weights wia;jb are the they have an identical unique label and there doesn’t exist
| V rw |
affinities that are encoded in the affinity matrix W . Algo. any other instances @vp ∈ {V1\{vi } ∪ V2\{va } } that shares
1 shows the full algorithm for computing the values of the this label. As a result, we remove from Grw all the nodes
weights for the edges E rw of Grw . Note that the function v k ∈ Grw that include in their component vi or va . To
Neighbors allows to identify the set of neighbors of a query automatically determine the predicates that could act as labels,
node in its graph. the harmonic average of discriminability and support is used
e) Optimization: A naive solution to build the candidates [14].Discriminability measures its objects diversity. The
association graph can be done by taking all the different coverage of a property measures the instance-wise frequency
possible instances’ pairs from the source and target datasets. of a property. For non-unique labels’ instances, we select
Such a solution raises two main issues. Firstly, the seemingly for each source instance the 25% (up from the 3rd quartile)
large size of the compared KBs yields a scalability challenge most similar target instances (based on the similarity values
whether on the number of nodes (i.e., n × m where n and computed as in section III-B.a) that share at least one word
m are the total number of instances in source and target with it on their local descriptions.
IV. B IPARTITE G RAPH -BASED I NSTANCE M ATCHING nodes. Each node has 128 GB RAM and an Intel E5-2683 v4
The workflow of our approach is depicted in Fig. 1. In the “Broadwell” CPU of 2.1 Ghz and 32 cores. Indeed, we allow
first phase we determine the candidate instance pairs (i.e., the paths to be composed at most of 2 blank nodes when we
nodes) based on the common token in their local descriptions. determine the neighbors in the candidates association graphs
These nodes are refined by the unique label heuristic defined as described in Section III-B. Following [18], we set up the
at the end of the Section III-B. Then, based on the join of both predefined threshold for RW convergence to be 10−6 . Indeed,
inverted indexes, we determine the vertices of the candidates we fix the maximum number of iterations of the RW to be
association graph and then set up the edges between them. 300.
The (affinity-preserving) RW phase allows to compute the C. Baselines
(“quasi”) distribution probability vector which represents the
We compare BIGM AT with several IM systems which we
final nodes’ ranks. In fact, the more the rank score of a node
categorize into three blocks:
vxy ∈ V arw with x ∈ G1 and y ∈ G2 is large, the more x
• Supervised approaches: This type of approaches need
and y have tendency to be co-referents. The last phase called
bipartite graph-based post-processing discovers the co-referent training data and sometimes require the intervention of
pairs from the vertices of Garw . It permits to set up a one- a domain expert. In this category, we compare with
to-one identity link constraint to clean the instance matching AdaBoost [19] and cLink [14].
• Unsupervised approaches: Here, the approaches do not
result. For this purpose, multiple approaches can be adopted to
obtain a match in bipartite graphs including Stable Marriage go through a training phase but they use knowledge
Problem (SMP) [15], Hungarian algorithm [16], Symmetric declared in an ontology or provided by an expert. In this
Best Match strategy (SBM) [17] or simply get the Best Match category, we compare with SERIMI [20], PARIS [21],
(BM) among the targets for each source instance. BM could VMI [22], SIGMa [23], RiMOM [24], HMATCH(I) [25],
be seen here as a relaxed version of SBM. Simply, we first DSSIM [26] and FBEM [27].
• Semi-supervised approaches: In this category, the meth-
build a weighted bipartite graph Gb = (V 1 , V 2 , E b ) where
the obtained scores of each vertex via ∈ Garw is assigned to ods self-learn by labeling unlabeled examples and extend
the edge (vi , va ) ∈ E b . The final set of co-referent pairs can the labeled training set. In this category, we compare with
be obtained by adopting any of the post-processing mentioned ObjectCoref [28].
above on Gb . D. Results and Discussion
V. E XPERIMENTAL E VALUATION
A. Datasets a) Analysis the effect of the affinity-preserving random
walk on the IM: We first analyze the effect of the affinity-
We evaluate our approach on three benchmark datasets: preserving random walk on the accuracy of the IM process.
PR (synthetic), DI (real) and A-R-S (real). The first two Table II reports the obtained results for our approach with
benchmarks are used in OAEI 2010 while A-R-S is used in the APRW (BIGM AT (APRW)) and with the traditional RW
OAEI 2009. Table I reports the statistics about the consid- (BIGM AT (RW)). We notice that APRW permits to avoid the
ered benchmarks. From PR, we choose only the restaurants drawback of the democracy internet problem in the IM task.
RDF datasets which includes structural and value modifi- This new variant of RW allows to minimize the effect of low
cations only. DI includes four RDF subsets named Sider, quality data (i.e., the case of eprints-rexa) and therefore to
DrugBank, Diseasome and Dailymed related to health increase the accuracy of the IM process. As a result, the F-
and drug domains. A-R-S includes three RDF files named measure over the considered datasets are improved approxi-
eprints, rexa and dblp related to scientific publications RDF mately between 1% to 4%. Noting that, in these experiments,
data. This benchmark encompasses very noisy values’ data as we used the BM strategy as post-processing to discover the
well as ambiguous labels (authors names and paper titles). For final match pairs.
each set, the OAEI provides a mapping file (“gold standard”)
including the co-referent pairs between the source and the Datasets D1 D2 D3 D4 D5 D6
target RDF files. PR benchmark provides a complete gold BIGM AT (RW) 0.99 0.83 0.79 0.92 0.87 0.76
BIGM AT (APRW) 1 0.87 0.81 0.94 0.88 0.78
standard that allows an accurate IM evaluation.
Benchmarks ID Datasets Source Target Gold Standard TABLE II: F-Measure of BIGM AT using random walk (RW)
PR D1 Restaurant1-Restaurant2 399 URIS 2256 URIs 112
D2 eprints-rexa 1130 URIS 18,492 URIs 777
or affinity-preserving random walk (APRW)
A-R-S D3 eprints-dblp 1130 URIs 2, 650, 832 URIs 554
D4 rexa-dblp 18, 492 URIs 2, 650, 832 URIs 1540 b) Comparative analysis: In Table III, we report the
D5 Sider-Diseasome 16187 URIs 20064 URIs 238
DI D6 Sider-Dailymed 16187 URIs 12515 URIs 349 results of multiple state-of-the-art IM approaches on selected
datasets from PR, A-R-S and DI benchmarks and we compare
TABLE I: Benchmarks statistics
our approach against them. BIGM AT achieves better results
compared to the other approaches. By analyzing the false
B. Experimental Setup matching pairs obtained on the A-R-S benchmark, we noticed
Our approach is implemented in Python 3.5 and Spark 2.3. that several of these instances in A-R-S were isolated nodes
To conduct our experiments, we deploy a Linux cluster of 5 in the original RDF graph and thus their neighborhood in the
Fig. 1: General architecture of our RW-based instance matching approach.

candidate association did lack inliers nodes which can boost platform. This case shows that Spark may hinder the running
the rank for these kind of nodes. As for DI benchmark, we time on small datasets. In contrast, for large-scale datasets
notice that instances with very similar descriptions exist too. as in rexa-dblp, our approach records 6 minutes as running
These instances also possess high similar neighbors which time while VMI took 19min52s, RIMOM 36h34min, DSSim
have by their turn recursively similar descriptions. This sce- 20h32min, respectively. For sider-dailymed datasets, BIGM AT
nario constitutes a challenge case for the IM approaches in took 4min3sec to finish his job. To the best of our knowledge,
general and for the iterative-based IM approaches (i.e., RW) none of the state-of-the-art approaches that we compare with
in special. on sider-dailymed report its running time.
Datasets D1 D2 D3 D4 D5 D6 VI. C ONCLUSION
BIGMat 1 0.87 0.81 0.94 0.88 0.78
RIMOM [24] 0.81 0.80 0.73 0.73 0.46 0.63 In this paper, we proposed a novel approach to the instance
SERIMI [20] 0.77 - - - 0.87 0.66
AdaBoost [19] - - - - 0.79 0.73 matching problem based on random walks. This approach
cLink [14] - - - - 0.82 0.78 represents and exploits the global interdependence among
PARIS [21] 0.91 - - 0.91 0.11 0.15 the different candidate pairs to infer the final co-referents.
ObjectCoref [28] 0.90 - - - 0.74 0.71
VMI [22] - 0.85 0.66 0.76 - - Our constrained CA-graph plays an important role permitting
SIGMa [23] 0.96 - - 0.94 - - our approach to be scalable. This scalability is furthermore
DSSIM [26] - 0.38 0.13 - - - increased by leveraging the distributed computation platform
HMATCH(I) [25] - 0.62 0.65 0.45 - -
FBEM [27] - 0.18 0.28 0.21 - - Spark. We established an affinity-preserving random walk
technique which is a variant of RW to overcome the drawback
TABLE III: Results of the compared approaches on PR, A-R-S of democratic normalization. Our approach achieves competi-
and DI benchmarks tive results on benchmark datasets compared to state-of-the-art
c) Efficiency Evaluation: In Fig. 2 we report the scala- approaches. In our method, we break ties arbitrarily in case
bility test of BIGM AT on three datasets from OAEI 2009 and when a source query instance has multiple target instances
2010. This test represents the effect of the number of used with the same similar rank score. For future work, we plan to
CPU cores, in the deployed cluster, on the running time and integrate a semantic-based instance matching approach (as in
the speedup of BIGM AT. In each sub-figure (in Fig. 2), the [1]) to solve such instance assignments. It will be interesting
left (resp. right) vertical axis records the running time (resp. to integrate more pruning heuristics into our CA-graph. This
speedup). In principle, Spark creates a task for each partition will help in reducing more the running time of our approach
of the initial data and runs the obtained tasks on the cluster. and in removing false positives. Indeed, we plan to study the
The best practice use of Spark recommends 2 to 3 tasks per effect of different post-processing strategies on the quality of
CPU core in the cluster. As a result, the number of tasks will the IM task in terms of F-measure.
be equal to the number of the available cores (in the cluster)
multiplied by the number of executed tasks per CPU core. In ACKNOWLEDGEMENTS
our experimentation, we set it to 3 and we fix the number of We would like to thank Dr. Vasilis Efthymiou for the helpful
tasks among all the datasets tests. comments to conduct the experiments and suggestions that
The results in Fig. 2 show that BIGM AT benefits from considerably improved the manuscript.
the increased number of the available cores. Its running time
was shortened according to the number of CPU cores used R EFERENCES
in each experimentation. On the small restaurant datasets,
[1] A. Assi, H. Mcheick, A. Karawash, and W. Dhifli, “Context-aware
BIGM AT finished in 40 seconds. This running time is due instance matching through graph embedding in lexical semantic space,”
to the overhead of the switch of tasks initialized in Spark Knowledge-Based Systems, 2019.
Fig. 2: Scalability test of BIGM AT. The left vertical axis shows the running time of the approach and the right vertical axis
shows the respective speedup.

[2] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, alignment of relations, instances, and schema,” PVLDB, vol. 5, no. 3,
“Spark: Cluster computing with working sets,” in HotCloud, 2010, pp. pp. 157–168, 2011.
10–10. [22] J. Li, Z. Wang, X. Zhang, and J. Tang, “Large scale instance matching
[3] J. Dean and S. Ghemawat, “Mapreduce: Simplified data processing on via multiple indexes and candidate selection.” Knowledge-Based Sys-
large clusters,” Commun. ACM, vol. 51, no. 1, pp. 107–113, Jan. 2008. tems, vol. 50, pp. 112–120, 2013.
[4] J. Sun, H. Qu, D. Chakrabarti, and C. Faloutsos, “Neighborhood forma- [23] S. Lacoste-Julien, K. Palla, A. Davies, G. Kasneci, T. Graepel, and
tion and anomaly detection in bipartite graphs,” Fifth IEEE International Z. Ghahramani, “Sigma: Simple greedy matching for aligning large
Conference on Data Mining (ICDM’05), p. 8 pp., 2005. knowledge bases,” in ACM SIGKDD, 2013, pp. 572–580.
[5] R. Mihalcea and P. Tarau, “TextRank: Bringing order into text,” [24] J. Li, J. Tang, Y. Li, and Q. Luo, “Rimom: A dynamic multistrategy
in Proceedings of the 2004 Conference on Empirical Methods in ontology alignment framework,” IEEE TKDE, vol. 21, no. 8, pp. 1218–
Natural Language Processing. Barcelona, Spain: Association for 1232, 2009.
Computational Linguistics, Jul. 2004, pp. 404–411. [Online]. Available: [25] S. Castano, A. Ferrara, S. Montanelli, and D. Lorusso, “Instance
https://www.aclweb.org/anthology/W04-3252 matching for ontology population,” in SEBD, 2008, pp. 121–132.
[6] L. Page, S. Brin, R. Motwani, and T. Winograd, “The pagerank citation [26] M. Nagy, M. Vargas-Vera, and E. Motta, “Dssim - managing uncertainty
ranking: Bringing order to the web.” Stanford InfoLab, Technical Report on the semantic web,” in OM, ser. CEUR Workshop Proceedings, vol.
1999-66, November 1999. 304, 2007.
[27] H. Stoermer and N. Rassadko, “Results of okkam feature based en-
[7] G. Klyne and J. J. Carroll, “Resource description framework
tity matching algorithm for instance matching contest of oaei 2009,”
(rdf): Concepts and abstract syntax,” http://www.w3.org/TR/2004/
in OM, ser. CEUR Workshop Proceedings, P. Shvaiko, J. Euzenat,
REC-rdf-concepts-20040210/, W3C, 2004.
F. Giunchiglia, H. Stuckenschmidt, N. F. Noy, and A. Rosenthal, Eds.,
[8] L. Lovász, “Random walks on graphs: A survey,” in Combinatorics, vol. 551, 2009.
Paul Erdős is Eighty, 1996, vol. 2, pp. 353–398. [28] W. Hu, J. Chen, and Y. Qu, “A self-training approach for resolving object
[9] E. Seneta, Non-Negative Matrices and Markov Chains. Springer, 2006. coreference on the semantic web,” in WWW. ACM, 2011, pp. 87–96.
[10] A. N. Langville and C. D. Meyer, “Deeper inside PageRank,” Internet
Mathematics, vol. 1, no. 3, pp. 335–380, 2004.
[11] J.-Y. Pan, H.-J. Yang, C. Faloutsos, and P. Duygulu, “Automatic multi-
media cross-modal correlation discovery.” in KDD, W. Kim, R. Kohavi,
J. Gehrke, and W. DuMouchel, Eds. ACM, 2004, pp. 653–658.
[12] H. D. K. Moonesignhe and P. Tan, “Outlier detection using random
walks,” in ICTAI, Nov 2006, pp. 532–539.
[13] M. Cho, J. Lee, and K. M. Lee, “Reweighted random walks for graph
matching.” in ECCV, vol. 6315. Springer, 2010, pp. 492–505.
[14] K. Nguyen and R. Ichise, “Linked data entity resolution system en-
hanced by configuration learning algorithm,” IEICE TRANSACTIONS
on Information and Systems, vol. 99, no. 6, pp. 1521–1530, 2016.
[15] D. Gale and L. S. Shapley, “College admissions and the stability of
marriage,” The American Mathematical Monthly, vol. 69, no. 1, pp. 9–
15, 1962.
[16] J. Munkres, “Algorithms for the assignment and transportation prob-
lems,” Journal of the society for industrial and applied mathematics,
vol. 5, no. 1, pp. 32–38, 1957.
[17] S. Melnik, H. Garcia-Molina, and E. Rahm, “Similarity flooding: a ver-
satile graph matching algorithm and its application to schema matching,”
in ICDE, Feb 2002, pp. 117–128.
[18] D. Smedley, S. Khler, J. C. Czeschik, J. Amberger, C. Bocchini,
A. Hamosh, J. Veldboer, T. Zemojtel, and P. N. Robinson, “Walking
the interactome for candidate prioritization in exome sequencing studies
of Mendelian diseases,” Bioinformatics, vol. 30, no. 22, pp. 3215–3222,
07 2014.
[19] S. Rong, X. Niu, E. W. Xiang, H. Wang, Q. Yang, and Y. Yu, “A machine
learning approach for instance matching based on similarity metrics,” in
ISWC. Springer, 2012, pp. 460–475.
[20] S. Araujo, D. Tran, A. DeVries, J. Hidders, and D. Schwabe, “Serimi:
Class-based disambiguation for effective instance matching over hetero-
geneous web data.” in WebDB, 2012, pp. 25–30.
[21] F. M. Suchanek, S. Abiteboul, and P. Senellart, “Paris: Probabilistic

Вам также может понравиться