Академический Документы
Профессиональный Документы
Культура Документы
, 2020.
Russian Text © The Author(s), 2020, published in Programmirovanie, 2020, Vol. 46, No. 1.
Abstract—The main ideas of the development of the software implementation of an algorithm for the fast auto-
matic classification of seismic signals based on diagnostic patterns are described. The process of adaptation and
integration of this implementation into the distributed computations system Apache Spark is described in detail.
A software solution for the preliminary processing of the signals and optimization of the mathematical model
for parallel computations using broadcast variables is presented. Performance tests for the classification algo-
rithm on a set of day-long signals are carried out. The execution time of the algorithm in the context of massively
parallel computations was reduced tenfold compared with the sequential execution.
DOI: 10.1134/S0361768820010053
35
36 POPOV, ZAMARAEV
is IRIS DMC [18]. Among the tools designed for pro- Thus, the initial set of signal values (1) to be classified
cessing and analyzing seismic data, one can distin- is formed; the signal values are in double precision:
guish software for detecting wave signal disturbances
[19–21], for interactive simulation of the seismic wave CH = {chij , i ∈ [0, Lch ] ∈ Z, j ∈ [0,2] ∈ Z} , (1)
field and verification of geological hypotheses in clas-
sification of seismic data [22], and software for solving where Lch is the data array length for each synchro-
the inverse geophysics problem based on open GRID nized channel (8.3–8.5 millions of elements on the
systems [23–25]. average) and j is the channel index.
However, the available software solutions do not
support the classification of seismic events based on
the analysis of the set of real-life day-long data 2.2. Preliminary Signal Processing
obtained from different observation stations. The data The algorithm is able to analyze complete three-
are processed manually with loading files and finding component day-long signals. It gets at its input matrix
short timeframes of events of interest. The classifica- (1) (CH = {chij , i ∈ [0, Lch ] ∈ Ζ, j ∈ [0,2] ∈ Ζ} . A mov-
tion algorithms implemented in this software cannot
be executed in parallel mode for processing the signals ing window of size m = 6145 samples with the shift
on the entire time interval. As a result, the perfor- size step = 100 samples is specified. At each step, a
mance of the algorithms is insufficiently high. The seismogram in the form of the matrix X = {xij,
major part of software solutions are static applications i ∈ [0, m] ∈ Ζ, j ∈ [0,2] ∈ Ζ}, which contains a part of
designed for work with specialized hardware. This sig- signal CH (1) is formed. The algorithm determines
nificantly reduces the capabilities of the analysis, (classifies) the seismogram type as follows.
search, and confirmation of the nature of seismic
events using the information from various sources. Step 1. The elements of the matrix X = {xij } are
Taking into account the above reasoning, we face replaced by the swings squared swi, j , which guarantees
an important task of developing software for classify- that the values used in the following computations are
ing seismic events that supports high-performance nonnegative:
processing of large amounts of data in computing
swi, j = ( xi, j − xi +1, j ) , i ∈ [0, m − 1] , j ∈ [0,2] . (2)
2
environments in massively parallel mode.
Step 2. The matrix of weights (3) and the matrix of
2. A MATHEMATICAL MODEL entropies (4) are calculated:
OF THE CLASSIFICATION ALGORITHM
swi, j
The classification algorithm discussed here was qi, j = m −1
, i ∈ [0, m − 1] , j ∈ [0,2] . (3)
developed by the authors of this paper [27, 28]. In the
current version of the algorithm, the classification sw
i =0
i, j
minimum length of these arrays is found (the mini- Due to accumulation, the three signal components
mum time from the signal end), and the remaining are reduced to the one-dimensional steady form [30].
arrays are truncated on the right to fit this length. Due to the steady-state property of model (6), good
approximation over the averaged (smoothed) data events but fades the differences between events of the
(patterns with known characteristics) is ensured. same class.
For each certain event, algorithm (1)–(6)
2.3. Construction of Standard Patterns described above is used to construct the characteristic
The idea underlying the classification algorithm functions. If the number of certain events is suffi-
makes it possible to use any available number of seis- ciently large, then three patterns are found separately
mograms containing confirmed industrial blasts and for the set of blasts and the set of earthquakes—the
regional earthquakes for the pattern construction. It is middle pattern and two patterns corresponding to the
clear that the pattern averaging reveals features of the boundaries S = 0.5σ , where σ is the root-mean-
characteristic functions of the corresponding classes of square deviation found by
U1 or U 2 U1 or U 2
σi = 1
− μi ) , μi = 1
C
2
(Ci, j i, j ,
U 1 or U 2 j =1 U 1 or U 2 j =1
where U 1 and U 2 are the required number of certain turbances. This pattern is obtained as the characteris-
seismograms of industrial blasts and regional earth- tic function of the single-level signal f(t) = 1.
quakes, respectively. On the average, for one seismo- In the current version of the algorithm, we use an
graph it is chosen U 1 = 30 and U 2 = 19. array of 16 double precision patterns represented by
Thus, for the sets of blasts and earthquakes, we Cijt , i ∈ [0,6144] ∈ Ζ, j ∈ [0,15] ∈ Ζ .
obtain three patterns for each set: the middle pattern
and two boundary patterns: for blasts, we have Blast,
Blast ± S (B, B ± S), and for earthquakes EarthQuake, 2.4. Construction of the Diagnostic Matrix
EarthQuake ± S (E, E ± S).
According to (6), all the patterns are in the same
Our analysis showed that, for seismic stations that metric space. By interpreting the patterns as features
are sufficiently far from noisy industrial zones, the and samples as objects (independent observations), we
characteristic time of passing and effective damping of can complement their set with the sample characteris-
a seismic disturbance caused by blasts is in the range tic function and calculate an analog of the Bayesian
50–75 s. Therefore, the pattern width (working win- diagnostic function by standardizing the matrix C in
dow) can be set to m = 6145 samples or 61.45 seconds the objects by formula (7).
for the signal sampling rate 100 Hz.
Another feature of the algorithm is abstract pat- Step 5. Add the column vector С s to the matrix
terns. They are obtained using the function f(t) = Cijt (i ∈ [0,6144] ∈ Ζ , j ∈ [0,15] ∈ Ζ ) on the right to
( ) ( ) obtain the matrix Cij , i ∈ [0, m − 1], j ∈ [0, n] , n = 16 .
h
A t exp h − t + ε(t): by varying the values of the
Th Th
parameters T and h in (6) for t = 0,…, m , the set of (Сi, j − μi )
n
unimodal envelopes for the generalized data vector H
(5) is formed. Using formula (6), we obtain a set of
Si , j =
σi
, where μi = 1
n C
j =1
i, j
i =0 Si2,16 − ( i =0 Si,16 )
m −1 2
m −1
σ1 =
m −1 (m − 1)
2
i =0 Si2, j − ( i =0 Si, j )
m −1 2
m −1
σ2 =
m −1 (m − 1)
2
where the weighting coefficient is ward summation of the votes given to each pattern
determines its rank.
1 for i ∈ [0,2(m − 1)/3]
wi =
0, otherwise. Step 7. The matrix D is transformed as follows: for
There are no a priori arguments in support of vari- each k ∈ [0,11],
ous distances (see Table 1); for this reason, in the cur-
rent version of the algorithm we use all distances that 1 for Dk, j = min(Dk ),
are feasible for nominal features with different varia- Dk, j = (9)
tions [31]. 0, otherwise.
Step 8. The ranks
2.5. Classification
11
In the algorithm, we use the simple voting scheme
in which each distance has a single vote. Each vote is
Rj = D
k =0
k, j (10)
given to the pattern with the minimum distance (8) to
the sample characteristic function. The straightfor- are computed, and the maximum of {Rj} is found.
Table 2. Example of classification result for one step of the working window
Pattern
№ D0 D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 R Conclusion
type
1 WR-I 0 0 0 0 0 0 0 0 0 0 0 0 0
2 WR-II 0 0 0 0 0 0 0 0 0 0 0 0 0
3 WR-III 0 0 0 0 0 0 0 0 0 0 0 0 0
4 WL 0 0 0 0 0 0 0 0 0 0 0 0 0
5 B+S 0 0 0 0 0 0 0 0 0 0 0 0 0
6 B 0 0 1 1 1 1 1 1 1 1 1 0 9 2
7 B-S 0 0 0 0 0 0 0 0 0 0 0 0 0
8 WM 0 0 0 0 0 0 0 0 0 0 0 0 0
9 EQ+S 0 0 0 0 0 0 0 0 0 0 0 0 0
10 EQ 0 0 1 0 0 0 0 1 0 0 1 0 3
11 EQ-S 0 0 0 0 0 0 0 0 0 0 0 0 0
12 WR 0 0 0 0 0 0 0 0 0 0 0 0 0
13 WF-III 0 0 0 0 0 1 1 0 0 0 0 0 2
14 WF-II 0 0 0 0 0 0 0 0 0 0 0 0 0
15 WF-I 0 0 0 0 0 0 0 0 0 0 0 0 0
16 WN 0 0 0 0 0 0 0 0 0 0 0 0 0
Step 9. The classification conclusion is formed To analyze and identify the classifications, for each
using the following scheme, which is called ranked shift of the working window, we seek patterns that are
voting (Table 2): typical for regional earthquakes and industrial blasts
(a) if there is no unique maximum, then the con- (Fig. 1). A pattern is a sequence of classification con-
clusion is undefined = 0; clusions (Table 1), i.e., points in the plot arranged in a
certain order in the working window. For the patterns
(b) if there is a unique maximum greater than 10, of industrial blasts and of regional earthquakes, the
then the conclusion is strictly = 1, which means the arrangement and type of points on the axis X is the
strict correspondence to the pattern; same; the differences are only on the axis Y. For the
(c) if the unique maximum is greater than 8 and less unambiguous identification, the following conditions
than 11, then the conclusion is not strictly = 2, which must be simultaneously fulfilled:
means the nonstrict correspondence to the pattern; ● on the interval not shorter than 5 seconds, the
Blast/Earthquake pattern must have one or more clas-
(d) if the unique maximum is less than 9, then the
sification conclusions of type strictly = 1. A number of
conclusion is perhaps = 3, which means the probable
classification conclusion values of type not strictly = 2
correspondence to the pattern.
and probably = 3 is also allowed;
It is seen from Table 2 that, at the current step of the ● the same 5-second time interval must contain
working window, the shape of the signal of length 6144 one or more classification conclusions of type strictly = 1
samples corresponds (nonstrictly) to the pattern Blast; and (or) not strictly = 2 and (or) probably = 3 for the
indeed, on the basis of the classifier conditions (Step 9), patterns Blast/ Earthquake ± S, respectively.
we obtain 8 < max(R) < 11 and the maximum is
To automate the search for patterns, the classifica-
unique. At each iteration step 1 s long, we obtain the
tion map is written to a JSON file (Fig. 2, see the next
result in the form of the iteration index equal to the
section).
number of seconds elapsed from the signal start and in
the form of the pair (pattern type, result index).
The shift of the working window from the signal 3. ADAPTATION OF THE ALGORITHM
start to its end makes it possible to detect and identify FOR DISTRIBUTED COMPUTATIONS
according to the patterns all significant disturbances of IN APACHE SPARK
the day-long timeframe, thus forming the complete The analysis of the classification algorithm
classification map (Fig. 1). (Steps 1–9) shows that the computations at each iter-
Y
WaveRear-I
WaveRear-II
WaveRear-III pattern
WaveLeft
Blast+S
Blast
Blast-S
WaveMiddle
EarthQuake+S
EarthQuake
EarthQuake-S
WaveRight
WaveFront-III
WaveFront-II
WaveFront-I X
WhiteNoise
Undefined
08:41:30 08:41:45 08:42:00 08:42:15 08:42:30 08:42:45 08:43:00 08:43:15
Jan 14, 2013 undefined strictly not strictly perhaps
Fig. 1. Classification map for a day-long record of a seismic signal with the boxed pattern of an industrial blast.
"conclusion type": {"sample (no. of s)": [...], " pattern type": [...]}
Fig. 2. Fragment of the desired pattern corresponding to an industrial blast written in JSON format.
ation of the working window shift are independent. sample in the working window. Additionally, the
This implies that the classification conclusion is names of the channel files and the initial and end time
obtained as an abstract value of one of the features (see after their synchronization are indicated.
Step 9 in Section 2). Therefore, we may decompose At the first stage, the data are prepared. To this
the day-long record of length Lch into partitions with end, the object RDD (Resilient Distributed Datasets)
the number of partitions equal to the number of avail- is created, which contains the partition indexes (parti-
able cores in the cluster, and then compute model (2)– tion) according to the number available cluster cores
(10) in parallel (the stage Map). Next, after all Map (number_of_partitions). Next, the number of steps in
tasks have completed, the process of collecting the the working window for each core (steps_per_core) is
results (collect) of the results of each Task is per- determined by formula (11). One step of the working
formed, and each result is assigned the partition index window is the shift by one second in the day-long
(Fig. 4). At the last stage, the Driver Program joins the record of the signal.
results by ordering them by time in the initial signal
(classification map)—this is the Reduce stage. Thus, it is steps_per_core
possible to organize the computations using the classical Lch (11)
MapReduce scheme with the intermediate stage Collect = .
(number_of_partitons)(sample_rate)
(Fig. 3).
The classification map is formed by the driver pro- Each task (Task) will process only a part of the
gram after the stage Reduce in the form of a JSON file array CH . The index of the first element of the array is
(an example is shown in Fig. 4), where the field x con- found by formula (12):
tains the step number of the working window or the start_idx = (steps _per _core)(parti ),
number of seconds elapsed from the signal start, the (12)
field y contains the index of the classified pattern i = 0...numper_of_partitions − 1.
(Table 2) for the current working window, and the The entire array CH is represented by a special
field time contains the time corresponding to the first object Broadcast of the framework Spark API. Broad-
Driver Program
Broadcasting variables: CH, swO, sw_sum, C, C_sum Parallelize Array[30]
RDD
0 1 2 ..... 27 28 29
RDD composed from partition numbers,
number of partitions = number of CPU-cores allocated working nodes
RDD
result_part_0 result_part_1 result_part_2 ..... result_part_27 result_part_28 result_part_29
RDD includes a partition numbers and associated result as String object formatted like a JSON
reduce
Creating a classification map as
JSON
JSON Classification map
Work Node #1
broadcasted variables
Executor
Task
Task #1
collect
map Collecting result of each task.
Processing the model (3)−(10) for all Identifing each result by the
steps per core parttion_number
Fig. 3. Generic scheme of computation organization in Apache Spark for the classification algorithm on three physical cluster
nodes with 10 cores on each node.
cast is a broadcast variable stored in the cash at each To reduce the execution time of the algorithm imple-
worker node of the cluster. In distinction from the mentation and adapt it for working in the Apache
ordinary variable, Broadcast is not sent as a copy to Spark environment, we did the following.
each task, which makes it possible to efficiently deal 1. Preliminary, the elements swi, j (2) for the three
with large sets of invariable data—the day-long seismic channels of the day-long record are calculated, and
signal records are indeed invariable.
the array swOij , i ∈ [0, Lch − 1] is formed. The sums
The use of the broadcast variables allowed us to sig-
nificantly optimize the algorithm described above.
(a) (b)
Stage «Collect» Stage «Reduce»
{ {
"partition00": { "underined": {...},
"undefined": {...}, "striotly": {
"strictly": { "x": [
"x": [ 721,
721, 722,
722, 723,
723, ...
... 85432
], ],
"y": [ "y": [
3, 3,
4, 4,
6, 6,
... ...
], 12
"time": [ ],
"2013-01-14 00:12:04", "time": [
"2013-01-14 00:12:05", "2013-01-14 00:12:04",
"2013-01-14 00:12:13", "2013-01-14 00:12:05",
... "2013-01-14 00:12:13",
] ...
}, "2013-01-14 23:43:54"
"notstriatly": {...}, ]
"perhaps": {...} },
}, "notstrictly": {...},
... "perhaps": {...},
"partition29": { "channe11":"AN.BRCR.81.EHE.D.2013.014"
"strictly": { "channe12": "AN.BRCR.81.EHN.D.2013.014",
"x": [ "channe13": "AN.BRCR.81.EHZ.D.2013.014",
... "signalStartTime": "2013-01-14 00:04:02",
85432, "signalEndTime": "2013-01-14 23:58:01"
], }
"y": [
...
12
],
"time": [
...
"2013-01-14 23:43:54"
]
}
}
}
Fig. 4. Classification map of the day-long seismic signal in the form of JSON file at the stages (a) Collect and (b) Reduce, respec-
tively.
m −1
swi, j (2) for each shift step are calculated in
i =1
The arrays swO and sw_sums are declared broadcast
variables for all tasks in the Apache Spark environ-
advance. This gives the array ment, and they are broadcast using the object Broad-
100 cast. This optimization significantly reduces the time
sw_sums j,s = sw_sums j,s −1 ± swO
l =1
100 s ± l , j ,
needed to compute (2) and (3) because at each shift
step of the working window the sums of only 100 pre-
ceding and 100 succeeding elements of sw are com-
s ∈ 1, ch ,
L
where
100 puted.
n
m −1 2. For the array C t , the sum CSumi = Cit, j is
sw_sums j,0 =
i =0
swi, j .
preliminary computed. CSumi is also declared a
j =1
{
// Main Java class
"className": "org.myapp.seismatica.classifiers.ClassificationProcessor",
// Path to the file containing the program
"file": "hdfs://cloudera-node04/user/jars/seismatica/seismatica-classifier-1.0.jar",
"name": "Seismatica – Classifier",
"args": [
"DistanceClassifier", // Classifier name
chFilesArg, // Array of paths to channel files
templateFile, // Path to the file of patterns
"100", // Step of the window shift
"30" // Number of tasks to be executed concurrently
],
"conf": {
"spark.executor.instances": "3", // Number of Executor objects to be executed concurrently
"spark.task.cpus": "1", // Number of CPUs per one task
"spark.executor.cores": "10", // Number of tasks per one Executor
"spark.executor.memory": "4g", // Memory allocated for one Executor
"spark.driver.memory": "2g", // Memory allocated for the Driver Program
"spark.driver.extraClassPath": "/mnt/hdfs/user/jars/seismatica/*", // Path to java classes
"spark.executor.extraClassPath": "/mnt/hdfs/user/jars/seismatica/*" // Path to java classes
}
}
Fig. 5. Example of JSON object in the POST request to the service Apache Livy for remote task run.
broadcast variable. Then, the expression μi = result, and identifies it as partition##, where ## is the
partition index (Fig. 4a); next, the reduce procedure is
1 n C in (7) can be replaced by μ = 1 (C |i,16 +
n j =1
i, j i
n
started (Fig. 4b). The result is saved to a JSON file.
CSumi) at each step, which also reduces the number of At the third stage, the classification map is ana-
operations with the sum of elements. lyzed and the characteristic patterns of event type are
distinguished. The table of classified seismic events is
3. The algorithms used to find distances (see [14]) constructed.
include repeating expressions. For example, the dis-
tance Bray–Curtis includes the expression Si,16 − Si, j ,
which also appears in Canberra, and Σ Si,16 − Si, j 4. SOFTWARE IMPLEMENTATION
(j ∈ [0,15]) appears in City Block. The computation of The algorithm is implemented in the form of a
the correlation through covariance yields the expres- library in Java (jar file) (see Section 6, Source Code).
The computation kernel uses the framework Apache
sions ΣSi2,16 , ΣSi2, j , and ΣSi,16Si, j , which are used in the Spark API (Java) and the resource manager Apache
computation of the Euclidean and cosine distances. YARN. The results are saved in the HDFS file system.
Taking into account the fact that the weighting coeffi- To make input/output operations more convenient,
cient takes the value of one at two thirds of the total the HDFS file system is mounted as a folder in Linux
number of iterations and the other values are zero, we using the package hadoop-hdfs-fuse. The computa-
should register the sum value at the iteration number tions are started through Apache Livy using the file of
2(m − 1)/3 and use it for computing the distance with parameters shown in Fig. 5.
the weighting coefficient. That is, if the Euclidean dis- The library contains Java classes that are responsi-
tance is ΣSi,162 + ΣSi, j 2 − 2ΣSi,16Si, j for i ∈ [0, m − 1], ble for the computational part of the algorithm and
then the Euclidean distance with the weighting coeffi- additional classes that implement the methods of pre-
cient is computed in the same way, only for i ∈ [0 , 2(m – liminary seismic signal processing and for the work
1)/3]. with Apache Spark API objects (Fig. 6).
ClassificationProcessor is the main class responsi-
Such optimizations significantly reduce the algo-
rithm execution time because for each shift the same
ble for starting the classification process. It contains
computations for the same datasets should be per-
the subclass classify, which extends the class Function
formed several times. Taking into account the number
(Spark API), for passing it to the function map. The
subclass classify implements the classification algo-
of steps (84000 on average) and the window size (m − 1 = rithm (the classes SignalProcessor and DistanceClas-
6144 samples), the proposed optimizations signifi- sifier) adapted for work in distributed mode on cluster
cantly improve the performance. nodes. The class ClassificationProcessor configures
At the second stage, each task independently com- the task execution environment (Executor) using the
putes model (3)–(10), obtains a part of classification object SparkContext. ClassificationProcessor imple-
ments the procedure of allocating the broadcast vari- The method classify (Fig. 8) produces a JSON file
ables containing the preliminary computed data (see (Fig. 4a) as its result; it contains one partition##.
Section 3) in the classes TemplateProcessor and Mini- After all the tasks have been completed, the driver
SEEDProcessor (Fig. 7). These variables can be program joins the results into a single string-type
accessed from all cluster nodes in the shared memory object (Fig. 2b). This object is written as a JSON file
of the current context object. in HDFS file system (Fig. 9).
The template data are stored in HDFS in a CSV file.
The class TemplateProcessor has the methods for read-
5. PERFORMANCE TEST
ing data, CSV data processing, and construction of the
The performance test was carried out by executing
n
array Cijt and C for each i th sample for adding to
j =1 i, j the classification process of day-long records made by
the pool of broadcast variables (see Section 3). the BRCR1 station. 150 runs were made. All the files
The class MiniSEEDProcessor contains methods of signals (three channels) were different. Table 3
for working with files in miniSEED format using the shows the mean algorithm execution time. The results
library iris-WS.jar. This library makes it possible to for four implementations of the algorithm in Matlab,
decode and read data from files containing seismic Java (console application)—local test, Java, and Python
channel records. MiniSEEDProcessor implements (Spark API application)—distributed test—are pre-
the channel synchronization procedure with respect to sented. We measured only the processing time begin-
ning from feeding the input parameters (channel files,
time and forms the broadcast variable for the matrix template file, Spark configuration parameters, etc.) to
CH (1); moreover, some additional data—signal obtaining the JSON files of the classification map.
length and metadata for each channel (name, sam-
pling rate, and initial/final record time)—are also 1 The
codes of International Registry of Seismograph Stations
computed in this class. (IR) are available at http://www.isc.ac.uk/registries/.
// Pattern processor
TemplateProcessor templateProcessor = new TemplateProcessor();
templateProcessor.process(templatesFile);
// Channel processor
MiniSEEDProcessor miniSEEDProcessor = new MiniSEEDProcessor(...);
miniSEEDProcessor.process(chFiles[0], chFiles[1], chFiles[2], true);
// Configuration of the run context
SparkConf sparkConf = new SparkConf();
JavaSparkContext sc = new JavaSparkContext(sparkConf);
// Public data allocation
Broadcast<double[][]> CBroadcast = sc.broadcast(templateProcessor.С);
...
// Starting the task
List<String> classificationMapParts = sc.parallelize[IntStream.range(0,partitionCount)
.boxed().collect(Collectors.toList()), partitionCount)
.map(new classify(...,..., CBroadcast,...
)).collect();
...
// Implementation of the object function for the method map (concurrently for each task)
class classify implements Fcnction<Integer, String> {
...
public classify(...,...,Broadcast<double[][]> CBroadcast,...) {
this.CBroadcast = CBroadcast;}
@Override
public String call(Integer core) {
Fig. 7. Fragment of code for configuring the Spark context and starting the computation task jf the class ClassificationProcessor.
Note that there are a lot of algorithms for classify- number of simultaneous processes depending on the
ing seismic events. The comparison of all these algo- number of signals in a week-long timeframe.
rithms with the proposed algorithm is out of the scope
of this paper. It is practically impossible to analyze the We compared our results with the observation logs
mathematical models underlying these algorithms and of Kemerovo region geophysical monitoring service.
their implementations. Therefore, it is difficult to assess In 95% of cases, the classification conclusions of the
the possibility of running these algorithms in parallel proposed algorithm coincided with these observations
(distributed) mode, implement, and optimize them. (the data for year 2013 containing about 500 of seismic
Our purpose was to achieve the optimal execution events (industrial blasts) obtained from two stations
time of the proposed algorithm with the aim to use it were used).
for receiving and processing seismic signals in stream
mode while retaining the accuracy of results. How-
ever, even in comparison with the FAST algorithm 6. SOURCE CODE
[16] mentioned above, we achieved the 25-fold accel-
eration. Furthermore, as the number of worker nodes The source code of the Java implementation of the
in the cluster and, correspondingly, the number of algorithm described in this paper and the initial data—
cores increases, the execution time linearly decreases. patterns and files containing the day-long seismic sig-
This is possible due to running a greater number of nals—are freely available on the Internet at https://bit-
tasks for one day-long signal and due to increasing the bucket.org/ogidog/seismatica-classifier/src/master/.
Fig. 8. Fragment of code for forming the String object containing JSON.
7. CONCLUSIONS putations were carried out with the aim of reducing the
A software library for the fast detection and classi- number of arithmetic operations in the tasks involved
fication of seismic events in a day-long timeframe in the Apache Spark organization of computations.
based on distributed computations in the Apache The performance tests demonstrated the signifi-
Spark environment is developed. Based on the inde-
pendence of the iteration steps of shifting the moving cant reduction of the time needed for processing day-
window, approaches to distributing the execution of long and week-long seismic records compared with
the proposed algorithm and its optimization are sequential processing. Due to the extremely fast exe-
demonstrated. Using special broadcast variables in the cution, the proposed approach to the optimization of
shared memory of the Spark cluster, preliminary com- the mathematical model and implementation of the
Table 3. Performance test of the software implementation of the classification algorithm (execution time)
Java Spark API Python Spark API Matlab Java
Day-long record
HDFS
chFiles,
chFiles, templatesFile, resultAsJSON templatesFile
MiniSEEDProcessor ClassificationProcessor
chFiles conclusionResult
process() chData process()
chData, C, CSums
Worker Node
conclusionResult
Broadcasted Variables
chData, C, CSums chData, C,
CSums
Task
DistanceClassifier conclusionResult
process()
D
Fig. 9. Data stream diagram of model (2)–(10) for one shift of the working window per second.
algorithms makes it possible to use it for stream signal 3. Diersena, S., Leeb, E.-J., Spearsc, D., Chenb, P., and
processing. Wanga, L., Classification of seismic windows using ar-
tificial neural networks, Proc. Comput. Sci., 2011, vol. 4,
Methods of the adaptation of seismological mathe- pp. 1572–1581.
matical models to modern technologies of massively
4. Hamer, R.M. and Cunningham, J.W., Cluster analyz-
parallel execution of tasks on clusters are demonstrated. ing profile data confounded with interrater differences:
In our opinion, the proposed approach can also be use- A comparison of profile association measures, Appl.
ful in other fields of science and engineering. Psychol. Meas., 1981, vol. 5, pp. 63–72.
5. Kedrov, E.O. and Kedrov, O.K., Spectral time method
FUNDING of identification of seismic events at distances of 15°–40°,
Izv., Phys. Solid Earth, 2006, vol. 42, no. 5, pp. 398–
This work was supported by the Russian Foundation for 415.
Basic Research, project no. 18-07-00013А. 6. Langer, H., Falsaperla, S., Powell, T., and Thompson, G.,
Automatic classification and a posteriori analysis of
seismic event identification at Soufrière Hills volcano,
REFERENCES Montserrat, J. Volcanology Geotherm. Res., 2006,
1. Scarpetta, S., Giudicepietro, F., Ezin, E.C., Petrosino, S., vol. 153, no. 1, pp. 1–10.
Del Pezzo, E., Martini, M., and Marinaro, M., Auto- 7. Lyubushin, A.A., Jr., Kaláb, Z., and Častová, N., Ap-
matic classification of seismic signals at Mt. Vesuvius plication of wavelet analysis to the automatic classifica-
volcano, Italy, using neural networks, Bull. Seism. Soc. tion of three-component seismic records, Izv., Phys.
Am., 2005, vol. 95, no. 1, pp. 185–196. Solid Earth, 2004, vol. 40, no. 7, pp. 587–593.
2. Benbrahim, M., Daoudi, A., Benjelloun, K., and Iben- 8. Musil, M. and Pleginger, A., Discrimination between
brahim, A., Discrimination of seismic signals using ar- local microearthquakes and quarry blasts by multi-layer
tificial neural networks, Proc. World Acad. Sci. Eng. perceptrons and kohonen maps, Bull. Seismol. Soc.
Technol., 2005, vol. 4, pp. 4–7. Am., 1996, vol. 86, no. 4, pp. 1077–1090.