A Fast Algorithm For Classifying Seismic Events Using Distributed Computations in Apache Spark Framework

ISSN 0361-7688, Programming and Computer Software, 2020, Vol. 46, No. 1, pp. 35–48. © Pleiades Publishing, Ltd.
, 2020.
Russian Text © The Author(s), 2020, published in Programmirovanie, 2020, Vol. 46, No. 1.
A Fast Algorithm for Classifying Seismic Events Using Distributed

Computations in Apache Spark Framework
S. E. Popova,* and R. Yu. Zamaraeva,**
a
Institute of Computer Technologies, Siberian Branch of the Russian Academy of Sciences,
Novosibirsk, 630090 Russia
*e-mail: popov@ict.sbras.ru
**e-mail:zamaraev@ict.sbras.ru
Received May 14, 2019; revised August 19, 2019; accepted August 19, 2019
Abstract—The main ideas of the development of the software implementation of an algorithm for the fast auto-
matic classification of seismic signals based on diagnostic patterns are described. The process of adaptation and
integration of this implementation into the distributed computations system Apache Spark is described in detail.
A software solution for the preliminary processing of the signals and optimization of the mathematical model
for parallel computations using broadcast variables is presented. Performance tests for the classification algo-
rithm on a set of day-long signals are carried out. The execution time of the algorithm in the context of massively
parallel computations was reduced tenfold compared with the sequential execution.
DOI: 10.1134/S0361768820010053
1. INTRODUCTION filtering procedures, autocorrelation methods, and

The regional monitoring and analysis of regional spectral and wavelet analysis, artificial neural net-
geodynamic situation is a complicated task. Indeed, works, and cluster analysis. In the majority of such
along with powerful disturbances originating in the studies, the evaluation of the proposed algorithms on
known focal zones, one has to analyze and classify a real-life signals is carried out on timeframes not longer
diverse stream of events including industrial blasts of than 60–80 s. The signal segments containing a priori
various power and laying depth, regional, and local seis- reliable significant (detectable) disturbances are con-
mic events. In mining regions of Russia (Kemerovo sidered. The execution time of the algorithms for com-
region, Krasnoyarsk Territory, and others), there are a lot plete day-long signal sequences is not presented. Spe-
of enterprises that regularly perform massive blasting cial attention should be paid to paper [16], in which
operations. In total, more than 2.5 thousands of seismic the Fingerprint And Similarity Thresholding (FAST)
events of similar magnitude (1.5–2.5 points) can be reg- algorithm is used for detecting natural earthquakes;
istered yearly; this magnitude is characteristic of both this algorithm is tested on real-life week-long signal
regional earthquakes and typical (in terms of technol- series. The execution time of 1 hour 36 minutes is
ogy and power) blasts in opencast mines. There is an reported, while the execution time of the autocorrela-
extensive network of seismic stations (e.g., ten stations tion algorithm was nine days (the data was taken from
in Krasnoyarsk Territory and seven in Kemerovo a single station). Therefore, the analysis of even 2–3
region), which register the stream of data with the week-long timeframes from several seismic stations
average frequency of 100 samples per 10 ms three mea- can require one day, and other methods can even
surement channels in each seismic signal. Thus, a require about a month.
huge amount of data is accumulated—about 150 Mb Considering the performance of algorithms from
(174 millions of samples) in a week for one station or the viewpoint of improving the efficiency of the anal-
more than 1 Gb (1 billion of samples) for several sta- ysis of regional monitoring of seismic events, we dis-
tions. Taking this fact into account, the detection and tinguish the following main requirements: stable
classification of various disturbances even in a set of arrival of sets of actual seismic data, their fast process-
week-long data obtained from 4–6 stations requires ing (less than 10–12 s per full seismic record 24 h
significant computing resources and time. This is also long), and analysis and drawing expert conclusions
indirectly confirmed by the processing of the literature using fast algorithms. There are a lot of software solu-
devoted to the analysis of seismic signals [1–17]. In tions that implement various functions of processing
these works, a number of effective approaches are and analysis of seismic data. The largest system for
described that are based on various combinations of collecting, storing and providing access to seismic data
35
36 POPOV, ZAMARAEV
is IRIS DMC [18]. Among the tools designed for pro- Thus, the initial set of signal values (1) to be classified
cessing and analyzing seismic data, one can distin- is formed; the signal values are in double precision:
guish software for detecting wave signal disturbances
[19–21], for interactive simulation of the seismic wave CH = {chij , i ∈ [0, Lch ] ∈ Z, j ∈ [0,2] ∈ Z} , (1)
field and verification of geological hypotheses in clas-
sification of seismic data [22], and software for solving where Lch is the data array length for each synchro-
the inverse geophysics problem based on open GRID nized channel (8.3–8.5 millions of elements on the
systems [23–25]. average) and j is the channel index.
However, the available software solutions do not
support the classification of seismic events based on
the analysis of the set of real-life day-long data 2.2. Preliminary Signal Processing
obtained from different observation stations. The data The algorithm is able to analyze complete three-
are processed manually with loading files and finding component day-long signals. It gets at its input matrix
short timeframes of events of interest. The classifica- (1) (CH = {chij , i ∈ [0, Lch ] ∈ Ζ, j ∈ [0,2] ∈ Ζ} . A mov-
tion algorithms implemented in this software cannot
be executed in parallel mode for processing the signals ing window of size m = 6145 samples with the shift
on the entire time interval. As a result, the perfor- size step = 100 samples is specified. At each step, a
mance of the algorithms is insufficiently high. The seismogram in the form of the matrix X = {xij,
major part of software solutions are static applications i ∈ [0, m] ∈ Ζ, j ∈ [0,2] ∈ Ζ}, which contains a part of
designed for work with specialized hardware. This sig- signal CH (1) is formed. The algorithm determines
nificantly reduces the capabilities of the analysis, (classifies) the seismogram type as follows.
search, and confirmation of the nature of seismic
events using the information from various sources. Step 1. The elements of the matrix X = {xij } are
Taking into account the above reasoning, we face replaced by the swings squared swi, j , which guarantees
an important task of developing software for classify- that the values used in the following computations are
ing seismic events that supports high-performance nonnegative:
processing of large amounts of data in computing
swi, j = ( xi, j − xi +1, j ) , i ∈ [0, m − 1] , j ∈ [0,2] . (2)
2
environments in massively parallel mode.
Step 2. The matrix of weights (3) and the matrix of
2. A MATHEMATICAL MODEL entropies (4) are calculated:
OF THE CLASSIFICATION ALGORITHM
swi, j
The classification algorithm discussed here was qi, j = m −1
, i ∈ [0, m − 1] , j ∈ [0,2] . (3)
developed by the authors of this paper [27, 28]. In the
current version of the algorithm, the classification  sw
i =0
i, j
based on the correlation function was extended by

using the statistical distance functions. As a result, the Ei, j = −qi, j ln ( qi, j ) , i ∈ [0, m − 1] , j ∈ [0,2] . (4)
resolution of the algorithm (the number of correct
classifications) has been significantly improved; this is The entropy model of the discrete signal (3)–(4)
confirmed by the logs and reports of the Kemerovo possesses the properties of the Shannon sample
geophysical monitoring service. entropy [29]. This model guarantees that the elements
related to the same signal are additive and, which is the
main thing, that the elements of different signals in
2.1. Initial Data synchronized samples (taken at the same time) are
The source of the initial data for the algorithm is additive.
the seismic signal in the miniSEED format, which has Step 3. The generalized data vector in three mea-
three channels in the form of separate files (e.g., EHE, surement channels is computed:
EHN, EHZ). The day-long data in each channel has
about 8.5 millions of samples on the average and the H i = Ei,0 + Ei,1 + Ei,2, i ∈ [0, m − 1] . (5)
time of taking each sample, i.e, (24 hours) × (3600 s in Step 4. The characteristic function (6) in the work-
an hour) × (100 the number of samples per second, ing window is constructed. This process is called sig-
sample_rate). The initial data in the channel may be nal accumulation
shifted from the beginning of the day (00:00:00). To
i
synchronize the data, the latest initial sample is cho-
Сi = H , i ∈ [0, m − 1] .
s
sen (that has the maximum time from the beginning), l (6)
and all values in each channel are extracted. Next, the l =0
minimum length of these arrays is found (the mini- Due to accumulation, the three signal components
mum time from the signal end), and the remaining are reduced to the one-dimensional steady form [30].
arrays are truncated on the right to fit this length. Due to the steady-state property of model (6), good
PROGRAMMING AND COMPUTER SOFTWARE Vol. 46 No. 1 2020

A FAST ALGORITHM FOR CLASSIFYING SEISMIC EVENTS 37
approximation over the averaged (smoothed) data events but fades the differences between events of the
(patterns with known characteristics) is ensured. same class.
For each certain event, algorithm (1)–(6)
2.3. Construction of Standard Patterns described above is used to construct the characteristic
The idea underlying the classification algorithm functions. If the number of certain events is suffi-
makes it possible to use any available number of seis- ciently large, then three patterns are found separately
mograms containing confirmed industrial blasts and for the set of blasts and the set of earthquakes—the
regional earthquakes for the pattern construction. It is middle pattern and two patterns corresponding to the
clear that the pattern averaging reveals features of the boundaries S = 0.5σ , where σ is the root-mean-
characteristic functions of the corresponding classes of square deviation found by
U1 or U 2 U1 or U 2
σi = 1
 − μi ) , μi = 1
C
2
(Ci, j i, j ,
U 1 or U 2 j =1 U 1 or U 2 j =1
where U 1 and U 2 are the required number of certain turbances. This pattern is obtained as the characteris-
seismograms of industrial blasts and regional earth- tic function of the single-level signal f(t) = 1.
quakes, respectively. On the average, for one seismo- In the current version of the algorithm, we use an
graph it is chosen U 1 = 30 and U 2 = 19. array of 16 double precision patterns represented by
Thus, for the sets of blasts and earthquakes, we Cijt , i ∈ [0,6144] ∈ Ζ, j ∈ [0,15] ∈ Ζ .
obtain three patterns for each set: the middle pattern
and two boundary patterns: for blasts, we have Blast,
Blast ± S (B, B ± S), and for earthquakes EarthQuake, 2.4. Construction of the Diagnostic Matrix
EarthQuake ± S (E, E ± S).
According to (6), all the patterns are in the same
Our analysis showed that, for seismic stations that metric space. By interpreting the patterns as features
are sufficiently far from noisy industrial zones, the and samples as objects (independent observations), we
characteristic time of passing and effective damping of can complement their set with the sample characteris-
a seismic disturbance caused by blasts is in the range tic function and calculate an analog of the Bayesian
50–75 s. Therefore, the pattern width (working win- diagnostic function by standardizing the matrix C in
dow) can be set to m = 6145 samples or 61.45 seconds the objects by formula (7).
for the signal sampling rate 100 Hz.
Another feature of the algorithm is abstract pat- Step 5. Add the column vector С s to the matrix
terns. They are obtained using the function f(t) = Cijt (i ∈ [0,6144] ∈ Ζ , j ∈ [0,15] ∈ Ζ ) on the right to
( ) ( ) obtain the matrix Cij , i ∈ [0, m − 1], j ∈ [0, n] , n = 16 .
h
A t exp h − t + ε(t): by varying the values of the
Th Th
parameters T and h in (6) for t = 0,…, m , the set of (Сi, j − μi )
n
unimodal envelopes for the generalized data vector H
(5) is formed. Using formula (6), we obtain a set of
Si , j =
σi
, where μi = 1
n C
j =1
i, j
abstract characteristic functions from these envelopes. n

(7)
σi = 1  (C − μi )
These functions represent the sequential passage of 2
and i, j
the unimodal seismic disturbance through the work- n j =1
ing window. We will use the following notation:
● three patterns at the entry of the working win- Now, we can estimate the similarity between the
dow: WaveFront-I (WF-I), WaveFront-II (WF-II), characteristic function (6) and each pattern in the set
and WaveFront-III (WF-III); Cijt in matrix (7) as the distance between two features
● three patterns at the exit from the working win- (one-dimensional vectors). For this purpose, we fix
dow: WaveRear-I (WR-I), WaveRear-II (WR-II), j = 16 and form the pair with Si,16 for each Si, j,
and WaveRear-III (WR-III);
j ∈ [0,15]; this gives 16 pairs.
● three patterns in the middle of the working win-
dow: WaveMiddle (WM), WaveLeft (WL), and Step 6. For each pair, the distances are calculated
WaveRight (WR). using Table 1.
Additionally, we introduce the abstract pattern Thus, we obtain the matrix of distances
WhiteNoise (WN) that represents the seismic back-
ground and possible multimodal low-amplitude dis- D = {Dk, j , k ∈ [0,11] , j ∈ [0,15]} , (8)

38 POPOV, ZAMARAEV
Table 1. Statistical distances

Formula for the computation with a
Distance name Computation formula
weighting coefficient
Bray-Curtis m −1 –
D0, j =
i =0 Si,16 − Si, j
m −1
i =0 Si,16 + Si, j
Canberra m −1 2( m −1) /3
Si,16 − Si, j Si,16 − Si, j
D1, j = S D2, j =  wi
i =0 i,16 + Si, j i =0 Si,16 + Si, j
CityBlock m −1 –
D3, j =  Si,16 − Si, j
i =0
Normalized correlation D4, j = 1 − cov / ( σ1σ2 ), –
m −1 m −1 m −1
 Si,16Si, j − i =0 Si,16 0
cov = i =0
Si , j
m −1 ( m − 1) 2
i =0 Si2,16 − ( i =0 Si,16 )
m −1 2
m −1
σ1 =
m −1 (m − 1)
2
i =0 Si2, j − ( i =0 Si, j )
m −1 2
m −1
σ2 =
m −1 (m − 1)
2
Euclidean D5, j = ||S16 − S j ||2 2( m −1)

3
 wi ( Si,16 − Si, j )
2
D6, j =
i =0
Euclidean squared D7, j = ||S16 − S j || 2( m −1)
3
 wi ( Si,16 − Si, j )
2
D8, j =
i =0
Minkowski of third degree D9, j = ||S16 − S j ||3 2( m −1)
3
 wi ( Si,16 − Si, j )
3 3
D10, j =
i =0
Cosine distance S ⋅ Sj –
D11, j = 1 − 16
||S16 ||2 ||S j ||2
Note. Dash indicates that there is no similar distance with a weighting coefficient.
where the weighting coefficient is ward summation of the votes given to each pattern
determines its rank.
1 for i ∈ [0,2(m − 1)/3]
wi = 
0, otherwise. Step 7. The matrix D is transformed as follows: for
There are no a priori arguments in support of vari- each k ∈ [0,11],
ous distances (see Table 1); for this reason, in the cur-
rent version of the algorithm we use all distances that 1 for Dk, j = min(Dk ),
are feasible for nominal features with different varia- Dk, j =  (9)
tions [31]. 0, otherwise.
Step 8. The ranks
2.5. Classification
11
In the algorithm, we use the simple voting scheme
in which each distance has a single vote. Each vote is
Rj = D
k =0
k, j (10)
given to the pattern with the minimum distance (8) to
the sample characteristic function. The straightfor- are computed, and the maximum of {Rj} is found.

Table 2. Example of classification result for one step of the working window
Pattern
№ D0 D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 R Conclusion
type
1 WR-I 0 0 0 0 0 0 0 0 0 0 0 0 0
2 WR-II 0 0 0 0 0 0 0 0 0 0 0 0 0
3 WR-III 0 0 0 0 0 0 0 0 0 0 0 0 0
4 WL 0 0 0 0 0 0 0 0 0 0 0 0 0
5 B+S 0 0 0 0 0 0 0 0 0 0 0 0 0
6 B 0 0 1 1 1 1 1 1 1 1 1 0 9 2
7 B-S 0 0 0 0 0 0 0 0 0 0 0 0 0
8 WM 0 0 0 0 0 0 0 0 0 0 0 0 0
9 EQ+S 0 0 0 0 0 0 0 0 0 0 0 0 0
10 EQ 0 0 1 0 0 0 0 1 0 0 1 0 3
11 EQ-S 0 0 0 0 0 0 0 0 0 0 0 0 0
12 WR 0 0 0 0 0 0 0 0 0 0 0 0 0
13 WF-III 0 0 0 0 0 1 1 0 0 0 0 0 2
14 WF-II 0 0 0 0 0 0 0 0 0 0 0 0 0
15 WF-I 0 0 0 0 0 0 0 0 0 0 0 0 0
16 WN 0 0 0 0 0 0 0 0 0 0 0 0 0
Step 9. The classification conclusion is formed To analyze and identify the classifications, for each
using the following scheme, which is called ranked shift of the working window, we seek patterns that are
voting (Table 2): typical for regional earthquakes and industrial blasts
(a) if there is no unique maximum, then the con- (Fig. 1). A pattern is a sequence of classification con-
clusion is undefined = 0; clusions (Table 1), i.e., points in the plot arranged in a
certain order in the working window. For the patterns
(b) if there is a unique maximum greater than 10, of industrial blasts and of regional earthquakes, the
then the conclusion is strictly = 1, which means the arrangement and type of points on the axis X is the
strict correspondence to the pattern; same; the differences are only on the axis Y. For the
(c) if the unique maximum is greater than 8 and less unambiguous identification, the following conditions
than 11, then the conclusion is not strictly = 2, which must be simultaneously fulfilled:
means the nonstrict correspondence to the pattern; ● on the interval not shorter than 5 seconds, the
Blast/Earthquake pattern must have one or more clas-
(d) if the unique maximum is less than 9, then the
sification conclusions of type strictly = 1. A number of
conclusion is perhaps = 3, which means the probable
classification conclusion values of type not strictly = 2
correspondence to the pattern.
and probably = 3 is also allowed;
It is seen from Table 2 that, at the current step of the ● the same 5-second time interval must contain
working window, the shape of the signal of length 6144 one or more classification conclusions of type strictly = 1
samples corresponds (nonstrictly) to the pattern Blast; and (or) not strictly = 2 and (or) probably = 3 for the
indeed, on the basis of the classifier conditions (Step 9), patterns Blast/ Earthquake ± S, respectively.
we obtain 8 < max(R) < 11 and the maximum is
To automate the search for patterns, the classifica-
unique. At each iteration step 1 s long, we obtain the
tion map is written to a JSON file (Fig. 2, see the next
result in the form of the iteration index equal to the
section).
number of seconds elapsed from the signal start and in
the form of the pair (pattern type, result index).
The shift of the working window from the signal 3. ADAPTATION OF THE ALGORITHM
start to its end makes it possible to detect and identify FOR DISTRIBUTED COMPUTATIONS
according to the patterns all significant disturbances of IN APACHE SPARK
the day-long timeframe, thus forming the complete The analysis of the classification algorithm
classification map (Fig. 1). (Steps 1–9) shows that the computations at each iter-

40 POPOV, ZAMARAEV
Y
WaveRear-I
WaveRear-II
WaveRear-III pattern
WaveLeft
Blast+S
Blast
Blast-S
WaveMiddle
EarthQuake+S
EarthQuake
EarthQuake-S
WaveRight
WaveFront-III
WaveFront-II
WaveFront-I X
WhiteNoise
Undefined
08:41:30 08:41:45 08:42:00 08:42:15 08:42:30 08:42:45 08:43:00 08:43:15
Jan 14, 2013 undefined strictly not strictly perhaps
Fig. 1. Classification map for a day-long record of a seismic signal with the boxed pattern of an industrial blast.
"conclusion type": {"sample (no. of s)": [...], " pattern type": [...]}
"strictly": {"x": [...,31345,31346,...], "y": [...,6,6,...]},

"strictly": {"x": [...,31352,...],"y": [...,5,...]},
"perhaps": {"x": [...,31353,...],"y": [...,5,...]},
"not strictly": {"x": [...,31348,...],"y": [...,7,...]},
"not strictly": {"x": [...,31349,...],"y": [...,7,...]},
Fig. 2. Fragment of the desired pattern corresponding to an industrial blast written in JSON format.
ation of the working window shift are independent. sample in the working window. Additionally, the
This implies that the classification conclusion is names of the channel files and the initial and end time
obtained as an abstract value of one of the features (see after their synchronization are indicated.
Step 9 in Section 2). Therefore, we may decompose At the first stage, the data are prepared. To this
the day-long record of length Lch into partitions with end, the object RDD (Resilient Distributed Datasets)
the number of partitions equal to the number of avail- is created, which contains the partition indexes (parti-
able cores in the cluster, and then compute model (2)– tion) according to the number available cluster cores
(10) in parallel (the stage Map). Next, after all Map (number_of_partitions). Next, the number of steps in
tasks have completed, the process of collecting the the working window for each core (steps_per_core) is
results (collect) of the results of each Task is per- determined by formula (11). One step of the working
formed, and each result is assigned the partition index window is the shift by one second in the day-long
(Fig. 4). At the last stage, the Driver Program joins the record of the signal.
results by ordering them by time in the initial signal
(classification map)—this is the Reduce stage. Thus, it is steps_per_core
possible to organize the computations using the classical Lch (11)
MapReduce scheme with the intermediate stage Collect = .
(number_of_partitons)(sample_rate)
(Fig. 3).
The classification map is formed by the driver pro- Each task (Task) will process only a part of the
gram after the stage Reduce in the form of a JSON file array CH . The index of the first element of the array is
(an example is shown in Fig. 4), where the field x con- found by formula (12):
tains the step number of the working window or the start_idx = (steps _per _core)(parti ),
number of seconds elapsed from the signal start, the (12)
field y contains the index of the classified pattern i = 0...numper_of_partitions − 1.
(Table 2) for the current working window, and the The entire array CH is represented by a special
field time contains the time corresponding to the first object Broadcast of the framework Spark API. Broad-

Driver Program
Broadcasting variables: CH, swO, sw_sum, C, C_sum Parallelize Array[30]
RDD
0 1 2 ..... 27 28 29
RDD composed from partition numbers,
number of partitions = number of CPU-cores allocated working nodes
RDD
result_part_0 result_part_1 result_part_2 ..... result_part_27 result_part_28 result_part_29
RDD includes a partition numbers and associated result as String object formatted like a JSON
reduce
Creating a classification map as
JSON
JSON Classification map
Work Node #1
broadcasted variables
Executor
Task
Task #1
collect
map Collecting result of each task.
Processing the model (3)−(10) for all Identifing each result by the
steps per core parttion_number
(number if tasks = number of CPU-cores per Worker Node,

number of steps per core = (Lch)/(number of partitions)/(window step length),
window step length - sample rate = 100)
Work Node #2 Work Node #3

broadcasted broadcasted
variables variables
Executor Executor
Task #10 Task Task #20 Task

map collect map collect
Fig. 3. Generic scheme of computation organization in Apache Spark for the classification algorithm on three physical cluster
nodes with 10 cores on each node.
cast is a broadcast variable stored in the cash at each To reduce the execution time of the algorithm imple-
worker node of the cluster. In distinction from the mentation and adapt it for working in the Apache
ordinary variable, Broadcast is not sent as a copy to Spark environment, we did the following.
each task, which makes it possible to efficiently deal 1. Preliminary, the elements swi, j (2) for the three
with large sets of invariable data—the day-long seismic channels of the day-long record are calculated, and
signal records are indeed invariable.
the array swOij , i ∈ [0, Lch − 1] is formed. The sums
The use of the broadcast variables allowed us to sig-
nificantly optimize the algorithm described above.

42 POPOV, ZAMARAEV
(a) (b)
Stage «Collect» Stage «Reduce»
{ {
"partition00": { "underined": {...},
"undefined": {...}, "striotly": {
"strictly": { "x": [
"x": [ 721,
721, 722,
722, 723,
723, ...
... 85432
], ],
"y": [ "y": [
3, 3,
4, 4,
6, 6,
... ...
], 12
"time": [ ],
"2013-01-14 00:12:04", "time": [
"2013-01-14 00:12:05", "2013-01-14 00:12:04",
"2013-01-14 00:12:13", "2013-01-14 00:12:05",
... "2013-01-14 00:12:13",
] ...
}, "2013-01-14 23:43:54"
"notstriatly": {...}, ]
"perhaps": {...} },
}, "notstrictly": {...},
... "perhaps": {...},
"partition29": { "channe11":"AN.BRCR.81.EHE.D.2013.014"
"strictly": { "channe12": "AN.BRCR.81.EHN.D.2013.014",
"x": [ "channe13": "AN.BRCR.81.EHZ.D.2013.014",
... "signalStartTime": "2013-01-14 00:04:02",
85432, "signalEndTime": "2013-01-14 23:58:01"
], }
"y": [
...
12
],
"time": [
...
"2013-01-14 23:43:54"
]
}
}
}
Fig. 4. Classification map of the day-long seismic signal in the form of JSON file at the stages (a) Collect and (b) Reduce, respec-
tively.
m −1
 swi, j (2) for each shift step are calculated in
i =1
The arrays swO and sw_sums are declared broadcast
variables for all tasks in the Apache Spark environ-
advance. This gives the array ment, and they are broadcast using the object Broad-
100 cast. This optimization significantly reduces the time
sw_sums j,s = sw_sums j,s −1 ±  swO
l =1
100 s ± l , j ,
needed to compute (2) and (3) because at each shift
step of the working window the sums of only 100 pre-
ceding and 100 succeeding elements of sw are com-
s ∈ 1, ch  ,
L
where
 100 puted.

n
m −1 2. For the array C t , the sum CSumi = Cit, j is
sw_sums j,0 = 
i =0
swi, j .
preliminary computed. CSumi is also declared a
j =1

{
// Main Java class
"className": "org.myapp.seismatica.classifiers.ClassificationProcessor",
// Path to the file containing the program
"file": "hdfs://cloudera-node04/user/jars/seismatica/seismatica-classifier-1.0.jar",
"name": "Seismatica – Classifier",
"args": [
"DistanceClassifier", // Classifier name
chFilesArg, // Array of paths to channel files
templateFile, // Path to the file of patterns
"100", // Step of the window shift
"30" // Number of tasks to be executed concurrently
],
"conf": {
"spark.executor.instances": "3", // Number of Executor objects to be executed concurrently
"spark.task.cpus": "1", // Number of CPUs per one task
"spark.executor.cores": "10", // Number of tasks per one Executor
"spark.executor.memory": "4g", // Memory allocated for one Executor
"spark.driver.memory": "2g", // Memory allocated for the Driver Program
"spark.driver.extraClassPath": "/mnt/hdfs/user/jars/seismatica/*", // Path to java classes
"spark.executor.extraClassPath": "/mnt/hdfs/user/jars/seismatica/*" // Path to java classes
}
}
Fig. 5. Example of JSON object in the POST request to the service Apache Livy for remote task run.
broadcast variable. Then, the expression μi = result, and identifies it as partition##, where ## is the
partition index (Fig. 4a); next, the reduce procedure is

1 n C in (7) can be replaced by μ = 1 (C |i,16 +
n j =1
i, j i
n
started (Fig. 4b). The result is saved to a JSON file.
CSumi) at each step, which also reduces the number of At the third stage, the classification map is ana-
operations with the sum of elements. lyzed and the characteristic patterns of event type are
distinguished. The table of classified seismic events is
3. The algorithms used to find distances (see [14]) constructed.
include repeating expressions. For example, the dis-
tance Bray–Curtis includes the expression Si,16 − Si, j ,
which also appears in Canberra, and Σ Si,16 − Si, j 4. SOFTWARE IMPLEMENTATION
(j ∈ [0,15]) appears in City Block. The computation of The algorithm is implemented in the form of a
the correlation through covariance yields the expres- library in Java (jar file) (see Section 6, Source Code).
The computation kernel uses the framework Apache
sions ΣSi2,16 , ΣSi2, j , and ΣSi,16Si, j , which are used in the Spark API (Java) and the resource manager Apache
computation of the Euclidean and cosine distances. YARN. The results are saved in the HDFS file system.
Taking into account the fact that the weighting coeffi- To make input/output operations more convenient,
cient takes the value of one at two thirds of the total the HDFS file system is mounted as a folder in Linux
number of iterations and the other values are zero, we using the package hadoop-hdfs-fuse. The computa-
should register the sum value at the iteration number tions are started through Apache Livy using the file of
2(m − 1)/3 and use it for computing the distance with parameters shown in Fig. 5.
the weighting coefficient. That is, if the Euclidean dis- The library contains Java classes that are responsi-
tance is ΣSi,162 + ΣSi, j 2 − 2ΣSi,16Si, j for i ∈ [0, m − 1], ble for the computational part of the algorithm and
then the Euclidean distance with the weighting coeffi- additional classes that implement the methods of pre-
cient is computed in the same way, only for i ∈ [0 , 2(m – liminary seismic signal processing and for the work
1)/3]. with Apache Spark API objects (Fig. 6).
ClassificationProcessor is the main class responsi-
Such optimizations significantly reduce the algo-
rithm execution time because for each shift the same
ble for starting the classification process. It contains
computations for the same datasets should be per-
the subclass classify, which extends the class Function
formed several times. Taking into account the number
(Spark API), for passing it to the function map. The
subclass classify implements the classification algo-
of steps (84000 on average) and the window size (m − 1 = rithm (the classes SignalProcessor and DistanceClas-
6144 samples), the proposed optimizations signifi- sifier) adapted for work in distributed mode on cluster
cantly improve the performance. nodes. The class ClassificationProcessor configures
At the second stage, each task independently com- the task execution environment (Executor) using the
putes model (3)–(10), obtains a part of classification object SparkContext. ClassificationProcessor imple-

44 POPOV, ZAMARAEV
Fig. 6. Class diagram of the algorithm library.
ments the procedure of allocating the broadcast vari- The method classify (Fig. 8) produces a JSON file
ables containing the preliminary computed data (see (Fig. 4a) as its result; it contains one partition##.
Section 3) in the classes TemplateProcessor and Mini- After all the tasks have been completed, the driver
SEEDProcessor (Fig. 7). These variables can be program joins the results into a single string-type
accessed from all cluster nodes in the shared memory object (Fig. 2b). This object is written as a JSON file
of the current context object. in HDFS file system (Fig. 9).
The template data are stored in HDFS in a CSV file.
The class TemplateProcessor has the methods for read-
5. PERFORMANCE TEST
ing data, CSV data processing, and construction of the
 The performance test was carried out by executing
n
array Cijt and C for each i th sample for adding to
j =1 i, j the classification process of day-long records made by
the pool of broadcast variables (see Section 3). the BRCR1 station. 150 runs were made. All the files
The class MiniSEEDProcessor contains methods of signals (three channels) were different. Table 3
for working with files in miniSEED format using the shows the mean algorithm execution time. The results
library iris-WS.jar. This library makes it possible to for four implementations of the algorithm in Matlab,
decode and read data from files containing seismic Java (console application)—local test, Java, and Python
channel records. MiniSEEDProcessor implements (Spark API application)—distributed test—are pre-
the channel synchronization procedure with respect to sented. We measured only the processing time begin-
ning from feeding the input parameters (channel files,
time and forms the broadcast variable for the matrix template file, Spark configuration parameters, etc.) to
CH (1); moreover, some additional data—signal obtaining the JSON files of the classification map.
length and metadata for each channel (name, sam-
pling rate, and initial/final record time)—are also 1 The
codes of International Registry of Seismograph Stations
computed in this class. (IR) are available at http://www.isc.ac.uk/registries/.

// Pattern processor
TemplateProcessor templateProcessor = new TemplateProcessor();
templateProcessor.process(templatesFile);
// Channel processor
MiniSEEDProcessor miniSEEDProcessor = new MiniSEEDProcessor(...);
miniSEEDProcessor.process(chFiles[0], chFiles[1], chFiles[2], true);
// Configuration of the run context
SparkConf sparkConf = new SparkConf();
JavaSparkContext sc = new JavaSparkContext(sparkConf);
// Public data allocation
Broadcast<double[][]> CBroadcast = sc.broadcast(templateProcessor.С);
...
// Starting the task
List<String> classificationMapParts = sc.parallelize[IntStream.range(0,partitionCount)
.boxed().collect(Collectors.toList()), partitionCount)
.map(new classify(...,..., CBroadcast,...
)).collect();
...
// Implementation of the object function for the method map (concurrently for each task)
class classify implements Fcnction<Integer, String> {
...
public classify(...,...,Broadcast<double[][]> CBroadcast,...) {
this.CBroadcast = CBroadcast;}
@Override
public String call(Integer core) {
SignalProcessor signalProcessor = new SignalProcessor(...);

DistanceClassifier distanceClassifier = new DistanceClassifier(...);
...
// Shared data access within the task SparkContext
double[][] С = CBroadcast.value();
...
...
// Running the classification algorithm
double[][] D = signalProcessor.process(chlRawData, ch2RawBata, ch3RawData, C, CSums);
conclusionResult = distanceClassifier.process(D);
...
return conclcsionResult // Result in the form of a JSON string
...
Fig. 7. Fragment of code for configuring the Spark context and starting the computation task jf the class ClassificationProcessor.
Note that there are a lot of algorithms for classify- number of simultaneous processes depending on the
ing seismic events. The comparison of all these algo- number of signals in a week-long timeframe.
rithms with the proposed algorithm is out of the scope
of this paper. It is practically impossible to analyze the We compared our results with the observation logs
mathematical models underlying these algorithms and of Kemerovo region geophysical monitoring service.
their implementations. Therefore, it is difficult to assess In 95% of cases, the classification conclusions of the
the possibility of running these algorithms in parallel proposed algorithm coincided with these observations
(distributed) mode, implement, and optimize them. (the data for year 2013 containing about 500 of seismic
Our purpose was to achieve the optimal execution events (industrial blasts) obtained from two stations
time of the proposed algorithm with the aim to use it were used).
for receiving and processing seismic signals in stream
mode while retaining the accuracy of results. How-
ever, even in comparison with the FAST algorithm 6. SOURCE CODE
[16] mentioned above, we achieved the 25-fold accel-
eration. Furthermore, as the number of worker nodes The source code of the Java implementation of the
in the cluster and, correspondingly, the number of algorithm described in this paper and the initial data—
cores increases, the execution time linearly decreases. patterns and files containing the day-long seismic sig-
This is possible due to running a greater number of nals—are freely available on the Internet at https://bit-
tasks for one day-long signal and due to increasing the bucket.org/ogidog/seismatica-classifier/src/master/.

46 POPOV, ZAMARAEV
double[][] D = signalProcessor.process(ch1RawData, ch2RawData, ch3RawData, C, CSums);

conclusionResult = distanceClassifier.process(D);
if (conclcaionResult[0] != –1) {
date.setTime(syncStartTime + (core * stepsPerCores * 1000 + i * 1000)};
// Forming the JSON string for each conclusion
...
if (conclusionResult[0] == 1) {
strictlyX += (globalStartPosition + i) + ",";
strictlyY += conclusionResult[1] + ",";
strictlyTime += "\"" + simpleDateFormat.format(date).toString() + "\",";
}
...
if (conclusionResult[0] == 3) {
perhapsX +– (globalStartPosition + i) + ",";
perhapsY += conclusionResult[1] + ",";
perhapsTime += "\"" + simpleDateFormat.format(date).toString() + "\",";
}
}
...
// Concatenation of conclusion strings

result = "\"undefined\":{\"x\":[" + undefinedX + "],\"у\":[" + undefinedY +
"],\"times\":[" + undefinedTime + "]},"
+ "\"strictly\":{\"x\":[" + strictlyX + "],\"y\":[" + strictlyY + "],\"times\":[" +
strictlyTime + "]},"
+ "\"notstrictly\":{\"x\":[" + notStrictlyX + "],\"y\":[" + notStrictlyY + "],\"times\":["
+ notStrictlyTime + "]},"
+ "\"perhaps\":{\"x\":[" + perhapsX + "],\"y\":[" + perhapsY + "],\"times\":[" +
perhapsTime + "]},";
return "\"partition" + String.format("%03d", core} + "\":" + "{" + result + "}";
Fig. 8. Fragment of code for forming the String object containing JSON.
7. CONCLUSIONS putations were carried out with the aim of reducing the
A software library for the fast detection and classi- number of arithmetic operations in the tasks involved
fication of seismic events in a day-long timeframe in the Apache Spark organization of computations.
based on distributed computations in the Apache The performance tests demonstrated the signifi-
Spark environment is developed. Based on the inde-
pendence of the iteration steps of shifting the moving cant reduction of the time needed for processing day-
window, approaches to distributing the execution of long and week-long seismic records compared with
the proposed algorithm and its optimization are sequential processing. Due to the extremely fast exe-
demonstrated. Using special broadcast variables in the cution, the proposed approach to the optimization of
shared memory of the Spark cluster, preliminary com- the mathematical model and implementation of the
Table 3. Performance test of the software implementation of the classification algorithm (execution time)
Java Spark API Python Spark API Matlab Java
the program is run simultaneously on 30

the program is run sequentially without using
computation cores with decomposition into
Apache Spark API
partitions
Day-long record
Execution time (s) 17 24 3574 801

Execution time (s) Week-long timeframe
154 183 25145 5599
Remark. Each day-long record synchronized with respect to channels contained 8355839 samples on average spaced at 10 ms or 83558
shifts. The following hardware was used: 3 servers (AMD Ryzen 1700 (8+8 cores (Simultaneous Multi-Threading)) 3.2 GHz, 32 Gb
RAM, 1 Gb/s data transmission rate between servers). The local test was run on one server.

HDFS
chFiles,
chFiles, templatesFile, resultAsJSON templatesFile
Drives Program resultAsJSON reduce()
TemplateProcessor templatesFile conclusionResult[]

process() C, CSums collect()
MiniSEEDProcessor ClassificationProcessor
chFiles conclusionResult
process() chData process()
chData, C, CSums
Worker Node
conclusionResult
Broadcasted Variables
chData, C, CSums chData, C,
CSums
Task
SignalProcessor chData, C, CSums

process() D classify()
DistanceClassifier conclusionResult
process()
D
Fig. 9. Data stream diagram of model (2)–(10) for one shift of the working window per second.
algorithms makes it possible to use it for stream signal 3. Diersena, S., Leeb, E.-J., Spearsc, D., Chenb, P., and
processing. Wanga, L., Classification of seismic windows using ar-
tificial neural networks, Proc. Comput. Sci., 2011, vol. 4,
Methods of the adaptation of seismological mathe- pp. 1572–1581.
matical models to modern technologies of massively
4. Hamer, R.M. and Cunningham, J.W., Cluster analyz-
parallel execution of tasks on clusters are demonstrated. ing profile data confounded with interrater differences:
In our opinion, the proposed approach can also be use- A comparison of profile association measures, Appl.
ful in other fields of science and engineering. Psychol. Meas., 1981, vol. 5, pp. 63–72.
5. Kedrov, E.O. and Kedrov, O.K., Spectral time method
FUNDING of identification of seismic events at distances of 15°–40°,
Izv., Phys. Solid Earth, 2006, vol. 42, no. 5, pp. 398–
This work was supported by the Russian Foundation for 415.
Basic Research, project no. 18-07-00013А. 6. Langer, H., Falsaperla, S., Powell, T., and Thompson, G.,
Automatic classification and a posteriori analysis of
seismic event identification at Soufrière Hills volcano,
REFERENCES Montserrat, J. Volcanology Geotherm. Res., 2006,
1. Scarpetta, S., Giudicepietro, F., Ezin, E.C., Petrosino, S., vol. 153, no. 1, pp. 1–10.
Del Pezzo, E., Martini, M., and Marinaro, M., Auto- 7. Lyubushin, A.A., Jr., Kaláb, Z., and Častová, N., Ap-
matic classification of seismic signals at Mt. Vesuvius plication of wavelet analysis to the automatic classifica-
volcano, Italy, using neural networks, Bull. Seism. Soc. tion of three-component seismic records, Izv., Phys.
Am., 2005, vol. 95, no. 1, pp. 185–196. Solid Earth, 2004, vol. 40, no. 7, pp. 587–593.
2. Benbrahim, M., Daoudi, A., Benjelloun, K., and Iben- 8. Musil, M. and Pleginger, A., Discrimination between
brahim, A., Discrimination of seismic signals using ar- local microearthquakes and quarry blasts by multi-layer
tificial neural networks, Proc. World Acad. Sci. Eng. perceptrons and kohonen maps, Bull. Seismol. Soc.
Technol., 2005, vol. 4, pp. 4–7. Am., 1996, vol. 86, no. 4, pp. 1077–1090.

48 POPOV, ZAMARAEV
9. Ryzhikov, G.A., Biryulina, M.S., and Husebye, E.S., A nie/funktsii-zetlab/analiz-signalov/detektor-sta-lta/.

novel approach to automatic monitoring of regional Cited May 7, 2019.
seismic events, IRIS Newsletter, 1996, vol. XV, no. 1, 22. Stratimagic. http://www.pdgm.com/products/strati-
pp. 12–14. magic/. Cited May 7, 2019.
10. Shimshoni, Y. and Intrator, N., Classification of seis- 23. Development and creation of GRID applications for
mic signals by integrating ensembles of neural net- solving applied problems of geophysics, project no. 10-
works, IEEE Trans. Signal Proc., 1998, vol. 46, no. 5, 07-00491-a, Russian Foundation for Basic Research.
pp. 1194–1201. http://www.rfbr.ru/rffi/ru/project_search/o_49145.
11. Ryan, T.M., Borisov, D., Lefebvre, M., and Tromp, J., Cited May 7, 2018.
SeisFlows – flexible waveform inversion software,
24. The use of weakly coupled computer systems for solving
Comput. & Geosci., 2018, vol. 115, pp. 88–95.
inverse problems of geophysics, project no. 11-05-
12. Lesage, P., Interactive Matlab software for the analysis 00988-a, Russian Foundation for Basic Research.
of seismic volcanic signals. Comput. & Geosci., 2009, http://www.rfbr.ru/rffi/ru/project_search/o_43212.
vol. 35, no. 10, pp. 2137–2144. Cited May 7, 2018.
13. Jiang, W., Yu, H., Li, L., and Huang, L., A robust algo- 25. Development of a GRID system and computation ser-
rithm for earthquake detector, Proc. of the 15 World vices for analyzing geodynamic space-time processes
Conference on Earthquake Engineering, Lisbon, 2012. given Earth remote sensing data from, project no. 11-
14. Álvarez, I., García, L., Mota, S., Cortés, G., Benítez, C., 07-12045-ofi, Russian Foundation for Basic Research.
and De La Torre A., An automatic p-phase picking al- http://www.rfbr.ru/rffi/ru/project_search/o_46676.
gorithm basedon adaptive multiband processing, IEEE Cited May 7, 2018.
Geosci. Remote Sensing Lett., 2013, vol. 10, no 6,
26. Distance computations, SciPy.org.
pp. 1488–1492.
https://docs.scipy.org/doc/scipy/reference/spa-
15. Madureira, G. and Ruano, A., A neural network seis- tial.distance.html. Cited May 11, 2018.
mic detector, IFAC Proc. Vol., 2009, vol. 42, no. 19,
pp. 304–309. 27. Zamaraev, R.Yu., Popov, S.E., and Logov, A.B., The
algorithm for classifying seismic events based on the en-
16. Clara, E.Y., Ossian, O’R., Karianne, J.B., and Beroza, G.C., tropy mapping of signals, Izv., Phys. Solid Earth, 2016,
Earthquake detection through computationally effi- vol. 52, no. 3, pp. 364–370.
cient similarity search, Sci. Advances, 2015, vol. 1,
pp. E1501057 (1–13). 28. Zamaraev, R.Yu. and Popov, S.E., An algorithm for the
17. Paul, B. Q., Pierre, G., Yoann, C., and Munkhuu, U., automatic detection and classification of industrial
Detection and classification of seismic events with pro- blasts based on the entropy mapping of signals, Geofiz.
gressive multichannel correlation and hidden Markov Issled., 2019, vol. 20, no. 1, pp. 38–51.
models, Comput. & Geosci., 2015, vol. 83, pp. 110–119. 29. McKay, D., Information Theory, Inference, and Learn-
18. IRIS. Incorporated Research Institutions for Seismol- ing Algorithms, Cambridge: Cambridge Univ. Press,
ogy. https://www.iris.edu/hq/. Cited May 4, 2019. 2003.
19. Romero, L.E., Titos, M., Bueno, Á., Álvarez, I., 30. Kortström, J., Uski, M., and Tiira, T., Automatic classi-
García, L., de la Torre, Á, and Benítez, M.C., APAS- fication of seismic events within a regional seismograph
VO: A free software tool for automatic P-phase picking network, Comput. & Geosci., 2016, no. 87, pp. 22–30.
and event detection in seismic traces, Comput. & Geo- 31. Guojun, Gan, Chaoqun, Ma, and Jianhong, Wu., Data
scie., 2016, vol. 90, Part A, pp. 213–220. Clustering: Theory, Algorithms, and Applications (ASA-
20. GeoSeisQC. SIAM Series on Statistics and Applied Probability), Phil-
http://www.geoleader.ru/index.php/ru/produkty-ru/ adelphia: Society for Industrial and Applied Mathe-
geoseicqc. Cited May 7, 2019. matics, 2007.
21. ZETLAB Дeтeктop STA/LTA.
https://zetlab.com/shop/programmnoe-obespeche- Translated by A. Klimontovich

A Fast Algorithm For Classifying Seismic Events Using Distributed Computations in Apache Spark Framework

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

A Fast Algorithm For Classifying Seismic Events Using Distributed Computations in Apache Spark Framework

Загружено:

Авторское право:

Доступные форматы

ISSN 0361-7688, Programming and Computer Software, 2020, Vol. 46, No. 1, pp. 35–48. © Pleiades Publishing, Ltd.

A Fast Algorithm for Classifying Seismic Events Using Distributed

1. INTRODUCTION filtering procedures, autocorrelation methods, and

based on the correlation function was extended by

PROGRAMMING AND COMPUTER SOFTWARE Vol. 46 No. 1 2020

abstract characteristic functions from these envelopes. n

PROGRAMMING AND COMPUTER SOFTWARE Vol. 46 No. 1 2020

Table 1. Statistical distances

Euclidean D5, j = ||S16 − S j ||2 2( m −1)

PROGRAMMING AND COMPUTER SOFTWARE Vol. 46 No. 1 2020

PROGRAMMING AND COMPUTER SOFTWARE Vol. 46 No. 1 2020

"strictly": {"x": [...,31345,31346,...], "y": [...,6,6,...]},

PROGRAMMING AND COMPUTER SOFTWARE Vol. 46 No. 1 2020

(number if tasks = number of CPU-cores per Worker Node,

Work Node #2 Work Node #3

Task #10 Task Task #20 Task

PROGRAMMING AND COMPUTER SOFTWARE Vol. 46 No. 1 2020

PROGRAMMING AND COMPUTER SOFTWARE Vol. 46 No. 1 2020

PROGRAMMING AND COMPUTER SOFTWARE Vol. 46 No. 1 2020

Fig. 6. Class diagram of the algorithm library.

PROGRAMMING AND COMPUTER SOFTWARE Vol. 46 No. 1 2020

SignalProcessor signalProcessor = new SignalProcessor(...);

PROGRAMMING AND COMPUTER SOFTWARE Vol. 46 No. 1 2020

double[][] D = signalProcessor.process(ch1RawData, ch2RawData, ch3RawData, C, CSums);

// Concatenation of conclusion strings

return "\"partition" + String.format("%03d", core} + "\":" + "{" + result + "}";

the program is run simultaneously on 30

Execution time (s) 17 24 3574 801

PROGRAMMING AND COMPUTER SOFTWARE Vol. 46 No. 1 2020

Drives Program resultAsJSON reduce()

TemplateProcessor templatesFile conclusionResult[]

SignalProcessor chData, C, CSums

PROGRAMMING AND COMPUTER SOFTWARE Vol. 46 No. 1 2020

9. Ryzhikov, G.A., Biryulina, M.S., and Husebye, E.S., A nie/funktsii-zetlab/analiz-signalov/detektor-sta-lta/.

PROGRAMMING AND COMPUTER SOFTWARE Vol. 46 No. 1 2020

Вам также может понравиться