Академический Документы
Профессиональный Документы
Культура Документы
PII: S0010-4825(18)30174-4
DOI: 10.1016/j.compbiomed.2018.06.026
Reference: CBM 3006
Please cite this article as: B. Bozkurt, I. Germanakis, Y. Stylianou, A study of time-frequency features
for CNN-based automatic heart sound classification for pathology detection, Computers in Biology and
Medicine (2018), doi: 10.1016/j.compbiomed.2018.06.026.
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to
our customers we are providing this early version of the manuscript. The manuscript will undergo
copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please
note that during the production process errors may be discovered which could affect the content, and all
legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
PT
1
Electrical and Electronics Engineering Department, Izmir Democracy University,
Turkey
RI
2
Faculty of Medicine, University of Crete, Greece
3
SC
Computer Science Department, University of Crete, Greece
U
Corresponding author contact information:
AN
Baris Bozkurt1, Electrical and Electronics Engineering Department, Izmir Democracy
Karabağlar/İZMİR,
D
Phone: +90 232 260 1001, Fax: +90 232 260 1004
TE
Abstract
EP
This study concerns the task of automatic structural heart abnormality risk detection
from digital phonocardiogram (PCG) signals aiming at pediatric heart disease screening
C
outperform systems using hand-crafted features. This study focuses on the segmentation
the most commonly used features (MFCC and Mel-Spectrogram) used in state-of-the-
1
This study has been carried during a research visit of Baris Bozkurt in Computer
Science Department of University of Crete in period January to July 2017.
1
ACCEPTED MANUSCRIPT
namely sub-band envelopes as an alternative feature. Via tests carried on two high
quality databases with a large set of possible settings, we show that sub-band envelopes
are preferable to the most commonly used features and period synchronous windowing
PT
is preferable over asynchronous windowing.
RI
Keywords: heart disease screening, heart sound classification, phonocardiogram
SC
analysis, automated cardiac auscultation, time-frequency features
U
1. Introduction
AN
The recording and analysis of acoustic vibrations recorded at the chest of a patient using
of mitral and aortic regurgitation, murmur of mitral and aortic stenosis and rheumatic
valvular lesions [1]. Clinicians listen to the heart sounds of a patient for monitoring
TE
functionalities of the heart tissues, especially the opening and closing of the valves.
EP
and typically involves analysis of time and frequency characteristics of heart sounds and
C
Heart disease represents a major health issue with significant costs worldwide2.
heart disease including various heart malformations may already be present since birth
2
http://www.who.int/mediacentre/factsheets/fs317/en/
2
ACCEPTED MANUSCRIPT
form of CHD [2], with a wide spectrum of clinical presentations based on the severity
important first line clinical screening tools to detect individuals with CHD risk.
PT
Although early CHD screening offers considerable health advantages, the primary
health care physician is confronted with the difficult clinical task, to differentiate
RI
between (innocent) murmurs often present among healthy children from those
SC
associated with abnormal hemodynamics indicative of CHD (abnormal murmurs) [3].
Referring all children with a murmur for expensive diagnostic tests (such as
U
echocardiography) is not a cost-effective approach [4]. Still, expert auscultation is
AN
frequently recommended as first line screening tool prior to application of diagnostic
One important resource that can support pediatric structural (CHD) screening is the
EP
use of automatic heart sound classification technologies. Efficient screening has high
potential to both lower the financial costs and also allow use of expert resources more
C
effectively. A low-cost, non-invasive and fast screening method would also provide the
AC
advances in machine learning and computing, close to human performances have been
reached in many audio classification tasks including the heart sound classification. Our
3
ACCEPTED MANUSCRIPT
In the present study, we followed the approach common to design of most of the
PT
recently developed high performance systems: convolutional neural networks trained on
RI
functional system for automatic PCG classification that has been tested and shown to
SC
have a performance at the level of the state-of-the-art, our main focus is to study various
U
this study are as follows: we present results of extensive comparative tests carried with
AN
a multitude of settings of segmentation (period-synchronous and asynchronous with
various sizes), time-frequency representations and neural network models on two large
M
databases: i) our proprietary PCG database which is composed of PCG recordings from
D
patients referred to a cardiology expert by pediatricians, ii) a recent challenge data used
this domain, is preferable over the commonly used time-frequency representations like
share our codes for one of the settings which can be tested using the publicly available
AC
data.
cardiac auscultation and the literature of automatic heart sound classification with a
specific focus to CNN based approaches. The proposed method is presented in section 3
with subsections on feature extraction (in which segmentation is also covered) and the
4
ACCEPTED MANUSCRIPT
machine learning models. We explain the test design (together with the specification of
the test data) in section 4 followed by test results in section 5. Section 6 is dedicated to
conclusions drawn from the test results. Finally, discussions and future work are
presented in section 7.
PT
2. Automatic PCG classification, a short review
RI
Basics of cardiac auscultation
SC
Before developing an automated heart murmur classification, a basic understanding of
heart sound generation is needed. Heart sounds are acoustics signals created within the
U
heart through blood flow and heart apparatus (mainly valve) motion and transmitted to
AN
the chest surface where they can be audible either through direct ear placement
to the ears (modern auscultation since the discovery of the stethoscope by Laennec) [7].
D
Portable electronic stethoscopes are largely used today for recording, which facilitate
characteristics of normal fundamental heart sounds (S1 and S2, corresponding to inflow
and outflow valve closure, respectively) and (if present) of murmurs or extra heart
C
sounds (such as clicks, abnormal split of S2 etc). Pitch, duration, location and shape
AC
characteristics of heart sounds and murmurs are main components investigated. Most of
the CHD cases are associated with abnormal blood flow patterns due to the presence of
based on their temporal classification within the heart cycle (with innocent murmurs
5
ACCEPTED MANUSCRIPT
maximum (the location where best heard) and finally due to a very subjective murmur
sound quality validation (with innocent murmurs often described as having “musical” or
PT
Cardiac auscultation, especially pediatric cardiac auscultation remains a challenging
clinical task. Not only it requires long-term practice and experience, but there are also
RI
perceptual difficulties: The heart sounds and murmurs involve low-frequency
SC
components (carrying discriminative characteristics for detection of abnormalities)
which are hardly audible, often with the presence of a high degree of noise
U
(environmental noise, breath, scratches due to microphone movement) in many cases,
AN
especially in pediatric cardiac auscultation.
etc. some of which are costly. A very large percentage of the cases referred to a
TE
cardiology expert for such costly analysis have no serious problems [11]. Automatic
EP
heart sound classification technology has high potential to support the screening process
state-of-the-art methods (such as [12]) are already available. Leng et al [13] and
6
ACCEPTED MANUSCRIPT
abnormalities and related PCG characteristics. Shen [15] reviewed the use of
phonocardiogram(PCG) signals for diagnosis from its early days to today. Abbas and
Bassam [16] overviewed the signal processing steps involved in processing PCG signals
PT
in detail. Leng et al [13] further reviewed the state-of-the-art hardware systems used for
electronic stethoscope for recording PCG signals and recent techniques used in
RI
automatic PCG classification. Marascio and Modesti [17] and Liu et al [18] presented
SC
detailed reviews of the trends in feature selection and automatic classification strategies.
U
automatic classification systems are: (training) data, segmentation, feature extraction
AN
and machine learning. Each of these components (and their interrelation) have been
discussed in various dimensions within this very large literature (with a few hundreds of
M
papers). Our particular interest in this work is the feature extraction and machine
D
learning components.
Features used for automatic PCG classification can be grouped as follows: time
TE
domain, frequency domain, statistical domain and time–frequency domain features [18].
EP
Studies using the time domain features typically include in the feature vector, duration
measures (for S1, S2, diastole, systole, R-to-R) and their ratios (for example the ratio of
C
systolic interval to the heart beat), relative amplitude/energy measures of heart sound
AC
components, and other common time-domain features like the zero-crossing rate. An
open-source system involving such features is presented in [18] as the baseline system
representations and/or measures as is the case for other automatic sound classification
7
ACCEPTED MANUSCRIPT
tasks. Schmidt et al [19] have considered a wide range of spectral features for automatic
PCG classification such as parametric models for the spectra, instantaneous frequency
and amplitude (IFA), power in octave bands with a conclusion that the low-frequency
bands carry important information that can be effectively modeled for designing
PT
discriminative features for PCG signals. In our previous study [20], we have used
RI
modeled/computed and used as feature, such as sample entropy, simplicity and spectral
SC
entropy [19].
Here, due to availability of these numerous reviews on this topic, we will limit our
U
discussions on a specific methodology which is being used since the early days of
AN
automatic PCG classification and becoming more of an attraction recently due to
neural networks to build effective automatic systems for cardiac abnormality detection.
D
Time-frequency features and neural networks for automatic heart sound classification
TE
representation of) spectra computed from windowed segments of the signal to form a
representation in audio classification is that for specific classes of sounds, some patterns
AC
exist within these image-like representations. While a multitude of options exist for
mimicking human auditory response such as Bark or Mel. Introduced by Davis and
Mermelstein in 1980 [21], Mel Frequency Cepstral Coefficients (MFCC) is possibly the
8
ACCEPTED MANUSCRIPT
most frequently used feature in all automatic sound classification domains. MFCC is
also very commonly and effectively used in automatic PCG classification studies for
use wavelet based features as a time-frequency representation (for example [23] and
PT
[27]) as wavelets has certain advantages in terms of resolutions over STFT.
RI
various different sound classification tasks using time-frequency representations and
SC
deep neural network architectures reporting high success rates [28, 29]. For the
automatic heart sound classification task, this approach is also gaining popularity and
U
systems based on this approach rank among the best in recent challenges. In PhysioNet-
AN
2016 challenge [12], half of the systems [24-26, 30] among the top 8 (selected out of 48
systems) use such an approach. Except [30] for which we could not find a detailed
M
description/documentation, all other three systems make use of MFCC and some other
D
milliseconds (ms) windows with hop size of 10 ms, etc.) that is fed into a neural
TE
network classifier. Some of parameter choices seem to be highly influenced from other
EP
audio classification tasks. For example, the window and hop sizes used in speech
processing tasks are often chosen to be multiple of some commonly accepted average
C
period). The use of a few periods (25-30 ms) window length and about a period (10 ms)
hop length is common in speech processing. Applying the same lengths (25-30 ms
given the maximum frequency would not exceed 5 Hz (200 ms period), yet appear to be
9
ACCEPTED MANUSCRIPT
3. Proposed method
PT
the following difference: while most studies present a single best selected configuration,
RI
asynchronous, different sizes (close to average PCG period lengths)), various features
SC
(MFFC, Mel-Spectrogram, Spectrogram, etc. with different sizes (time resolution and
U
of different settings is worth testing. We have designed tests to compare various settings
AN
of common time-frequency representations used directly as the input to a CNN
classifier. The features considered for this study are: Mel-Spectrogram, MFFC and sub-
M
band envelopes. With tests on two high quality datasets, we show that sub-band
D
envelopes are preferable over other options in many settings and a system with
relatively simple architecture, built using this feature achieves high performances. Next,
TE
Feature extraction can be performed at the whole signal level or frame/segment level
AC
where multiple frames are extracted via windowing. As explained in the introduction,
within this study, we limit ourselves with frame level time-frequency feature extraction.
10
ACCEPTED MANUSCRIPT
we include both of these strategies in our comparative tests. For period synchronous
PT
Period marking (segmentation into heart cycles or marking of cycle starting instances)
can be carried directly on PCG signals and there is a large literature for this task and
RI
publicly available state-of-the-art tools [31]. As the location of the PCG recording (on
SC
the body of the patient) influences the relative energies of S1-S2 components and high
U
challenging task. When Electrocardiogram (ECG) signals, recorded in parallel, is
AN
available, automatic marking can be more reliably performed since ECG signals are less
noisy and they include a main peak which can be tracked for reliable period marking.
M
Our database (explained in the test design part) includes ECG signals recorded
D
simultaneously together with the PCG signals, hence an algorithm with the following
steps is implemented and used for extracting period marks from the ECG signals
TE
(detecting R-peak locations of ECG which refer to S1 onset of the PCG [31]):
EP
● High-pass filtering the ECG signal to remove very low frequency variations
● Signal peaks detection via applying a threshold: The threshold (with an initial
value of 0.5) is lowered incrementally until peak count is larger than four times
11
ACCEPTED MANUSCRIPT
the estimated number of cycles. This choice aims coping with possible octave
PT
surrounding peaks
We did not carry a formal testing of this algorithm but visually checked most of the
RI
samples in our database to observe potential problems. The method provides high
SC
quality period marking for almost all cases. In Figure 1, two samples from our database
are presented. Top figures include the ECG signals where period marks are represented
U
with red dots together with fixed-length frames obtained and the bottom figures show
AN
the corresponding PCG signals with period marks. For the windowing operation to
extract frames that also involve S1 component of the heart sound, period marks are
M
12
ACCEPTED MANUSCRIPT
PT
RI
Figure 1: Automatic period marking and period synchronous segmentation on ECG
SC
signals into 0.5 second frames. Pitch marks are indicated in red dots and obtained fixed-
U
Once the period marks are available, different strategies can be used for
AN
segmentation to obtain PCG frames. The following segmentation strategies are worth
testing:
M
the local period (half a period, one period, two periods, etc.). For segment
TE
● Period synchronous segmentation with fixed segment length (0.5 sec., 1 sec., 2
EP
sec., etc.). Wherever the segment length exceeds period length, overlap is
inherent.
C
● Period asynchronous segmentation with fixed segment length (0.5 sec., 1 sec., 2
AC
sample is depicted together with frame boundaries for fixed length of 2 sec. and
13
ACCEPTED MANUSCRIPT
PT
RI
SC
Figure 2: Asynchronous segmentation example with length of 2 sec. and hop size of 1
U
sec. Frame boundaries are indicated with black solid lines.
AN
A number of such options are considered in the tests carried. This of course adds a
common (which are also often included in audio processing software libraries):
● Spectrogram
C
time-frequency representation. Here, sub-band envelopes of a given PCG signal (in the
of sub-bands obtained by band-pass filtering the PCG signal (discussed in detail in the
14
ACCEPTED MANUSCRIPT
next subsection). For all features, various time and frequency resolutions were tested as
explained in the test design section. A Tukey window (with r=0.08) is applied to signal
PT
Sub-band envelopes as a time-frequency feature
One of the important steps in PCG analysis by a cardiology expert is the investigation
RI
on signal shapes of murmurs and heart sounds. The experts often use dedicated software
SC
tools to apply band-pass filters (with flexible settings they can control) and check the
shape and localization of signal components. Using their experience with prior cases,
U
they check for patterns in the shapes of the signals and signal envelopes. In
AN
development of automatic classification systems, this practice can be imitated by
feature in the automatic speech recognition domain [32, 33]. In the automatic PCG
analysis literature, the use of envelop signal is more common for segmentation
TE
purposes. Liu et al [18] presented a detailed review of envelope-based methods used for
EP
automatic segmentation of PCG signals. While envelope signals are successfully used in
segmentation tasks (for example [34]), they were also directly used as features fed into
C
neural network classifiers, although rarely, since the early days of automatic PCG
AC
classification [35]. Some of the wavelet based features, when the extracted coefficients
study following that approach is by presented by Deng & Han [36], where sub-band
envelopes are calculated from discrete wavelet decomposition (DWT) coefficients. [24]
15
ACCEPTED MANUSCRIPT
uses median powers of sub-band signals which can also be considered as a similar
representation where very few samples were used for each sub-band envelope.
The sub-band envelopes can be computed in various ways. We have chosen the
following steps (also depicted in Figure 3) for computing sub-band envelopes of a PCG
PT
segment:
RI
● Envelope detection via computing analytical signal using Hilbert transform
SC
● Envelopes are resampled to a specific time-resolution (inherently involves low-
U
● Logarithmic compression applied to the final envelope signals
AN
● All envelopes are stacked to obtain an image-like time-frequency representation
In Figure 3, we present the flow diagram for the process and an example for feature
D
extraction, depicting the sub-band signal envelopes computed and final feature obtained
in matrix form. The top sub-plots include 8 sub-band signals and their resampled
TE
versions extracted from the original PCG signal (shown in blue). Considering this
EP
(number of time bins), a 8*128 image-like representation is derived and plotted with
C
color coding element values resulting in the bottom sub-plot which is the main feature
AC
3
Filter banks designed using Python library by Jason Heeris: https://github.com/detly/gammatone
16
ACCEPTED MANUSCRIPT
PT
RI
Figure 3: Sub-band envelopes feature computation. Original PCG signal depicted in
SC
blue. Sub-band envelope feature is the matrix obtained at the output of the feature
extraction process which is depicted as a colored image via mapping matrix coefficients
U
to color code (low values: dark, high values: bright).
AN
M
A large variety of neural network models may be used for the PCG classification task.
D
Our tests are limited to use of feed-forward CNN models applied on frame level
TE
features, one of the most popular approaches in the recent state-of-the-art systems in
this domain.
EP
To keep the number of tests limited (so that tests can be repeated in a reasonable
C
amount of time), we have considered three similar models which include common
AC
sequence of layers used for similar tasks in literature: 2D convolutional layers (with
kernel size 3 by 3, rectified linear unit activation) followed by max-pooling and drop-
out layers. The input dimension is equal to the feature dimension and the output
dimension is two (number of categories: normal and pathological). The models are
17
ACCEPTED MANUSCRIPT
implemented using Keras4 with TensorFlow5 as the backend. The Keras models and all
other design parameters are available from the accompanying repository6. Since PCG
database sizes are often relatively small (compared to other automatic classification
tasks with CNNs), complex models with high capacities learn to memorize the train
PT
data. For this reason, the number of layers were kept to be small and L1-regularisation
RI
models are: 1,2 and 4.
SC
Each model is designed to compute probability of a segment to belong to a recording
U
belonging to pathological class: all frame probabilities are sorted, 15% lowest and %15
AN
highest values are discarded and finally, the probability for a file is computed as the
4. Test design
Here, we first explain the data used in the tests and further discuss various dimensions
TE
Databases:
C
Two databases (with large differences in patient ages and the pathologies) were
AC
considered for the comparative tests. The first database is a proprietary database
4
https://keras.io
5
https://www.tensorflow.org
6
https://github.com/barisbozkurt/AutomaticPCGclassification script models.py
18
ACCEPTED MANUSCRIPT
database is a publicly available one, involving mainly adult heart sound recordings, it
represents the largest up-to-date phonocardiogram database worldwide, and is used for
PT
University of Crete, PCGs with murmur (UoC-murmur) database:
RI
length including 4 to 18 PCG cycles with an average of 8 cycles), obtained from
SC
pediatric cardiology outpatients as standard of care (provided time allowance) and from
a pilot pediatric cardiology screening program for school age children (8-year olds),
U
approved by Greek Ministry of Education and local Health Authorities, including digital
AN
phonocardiogram as a component for pediatric heart disease screening (Cretan Pediatric
associated with various types and severity levels of CHD. This database is proprietary
D
Each recording was labeled as normal (i.e. having innocent murmur) or abnormal by
TE
a single expert in pediatric cardiac auscultation (I.G, the second author) based on
EP
involves therefore samples with abnormal murmurs obtained from children of various
C
ages, and often suboptimal recording conditions, or innocent murmurs which were
AC
either difficult for their pediatricians to classify as such, or were recorded during
primary school visits (associated with high probability of external noise). The available
conditions.
19
ACCEPTED MANUSCRIPT
database (83 PCG samples) has been cross-validated blindly by two pediatric cardiology
experts independently [37]. The database includes samples with various levels of CHD
PT
real-life daily clinical challenges scenario. Selected recordings of the same database
have been also used for teaching purposes [6, 38]. Representative digital
RI
phonocardiograms of this database, along with extended introductory web-lectures in
SC
pediatric cardiac auscultation are free available as open sources material in the
institutional web-server7.
U
The database contains 336 recordings from 327 healthy children with innocent
AN
murmurs and 130 recordings from 117 children with various forms of CHD, of various
been standardized and described previously [37]. Briefly, a sensor based electronic
stethoscope with incorporated 3-lead ECG8 was used. Four recordings of were
D
performed from each patient, corresponding to the apical, lower left (fourth intercostal
TE
space) and upper (second intercostal space) left and right parasternal location. Digital
EP
acoustic data (with a sampling rate of 44100 kHz, 16-bit dynamic resolution) and ECG
signals, were transferred and stored as wave files, in a personal laptop computer using
C
the designated software9. Any personal identification data has been removed and
AC
For each patient, one or two recordings was selected by the expert to have the highest
quality for murmur detection and all other recordings were removed from the set. The
7
https://opencourses.uoc.gr/courses/enrol/index.php?id=367 Password for the video lectures is available
upon request from the authors.
8
TheStethoscope®; Welch Allyn-Meditron, Welch Allyn Inc., NY, USA
9
Meditron Analyzer 4®
20
ACCEPTED MANUSCRIPT
following steps were applied for pre-processing of the original data: i) ECG data was
down sampled to 882 Hz and PCG data was down sampled to 4410 Hz, ii) both ECG
and PCG signals were amplitude normalized to have a maximum level of 0.9.
PT
PhysioNet-2016 database:
RI
Normal/Abnormal Heart Sound Recordings: the PhysioNet/Computing in Cardiology
Challenge 201610 [12]. This database includes a compilation of various other databases
SC
and is a very good resource for comparing a specific system with various state-of-the-art
U
algorithms without the need of implementing them and running experiments. The
AN
PhysioNet-2016 data includes some very noisy data (even some non-PCG samples) and
does not include ECG channels. Detailed profiles of the 9 included databases are
M
provided in Section 2 and Table 1 of [12], reaching a total number of 2435 PCG
D
recordings.
TE
segmentation, feature extraction and machine learning. Our test design started with
C
considering cross-combinational settings for these blocks. As the first step, an initial list
AC
10
https://www.physionet.org/challenge/2016/
21
ACCEPTED MANUSCRIPT
segments, or fixed sizes of 0.5, 1, 2, 3 seconds segments. Overlap exists if size exceeds
PT
seconds with overlap of 1 second.
RI
● Spectrogram, Mel-spectrogram, MFCC and sub-band envelopes
SC
● Time resolutions: 32, 64, 128 (points)
U
Machine learning models (3 options):
AN
● Models with 1,2 or 4 2D convolutional layers
Databases (2 options):
M
This initial list refers to 1584 systems (asynchronous systems to be tested on two
TE
databases) where each test would also need to be repeated several times to remove bias
EP
While we think its worthwhile to consider all these settings, due to this high number
C
of tests, several additional options (such as using other machine learning models
AC
(LSTM,RNN,etc.), using other file-level features in literature) have been left out. For
isolation without test repetition. Leaving out the worst-cases in these preliminary tests,
the list has been reduced to a total of 90 systems for the final tests. Due to space
22
ACCEPTED MANUSCRIPT
considerations, here, we will only mention our observations that has lead us to leaving
systems with respect to their F1 measures and observed that systems using spectrogram
PT
performed with the lowest scores. Hence, spectrogram was removed from the feature
list. Segment lengths defined in relative to local period length did not appear more
RI
advantageous than using fixed sizes and were also removed. For tests with period
SC
asynchronous segments, 0.5 and 1 second lengths were too short, learning could not
converge for those cases. Frequency bands higher than 16 did not bring improvement
U
(PCG spectrum is limited to 2.2kHz) and the performances observed for 8 frequency
AN
bands and 16 frequency bands were similar. Sorting all systems, machine learning
models with 2 and 4 convolutional layers were ranked higher than the systems with a
M
single convolutional layer. Using delta coefficients with Mel-Spectrogram and MFCC
D
We finally arrived at the following reduced list of (90) systems for which the tests
can be re-run/repeated with a single GPU in a few days for one of the databases.
C
Features (9 options):
23
ACCEPTED MANUSCRIPT
● Frequency bands: 16
● Our tests involve performing repeated experiments for 90 systems (54 period
PT
synchronous and 36 asynchronous) on the UoC-murmur database and then picking a
high performance period asynchronous system and repeating tests for this system on
RI
PhysioNet-2016 data. In the tests with the UoC-murmur database, for each
SC
segmentation strategy the following options have been tested: use of three different
features (Mel-spectrogram, MFCC and sub-band envelopes) with three different time
U
resolutions (32, 64, 128) and two different CNN models. Our shared repository includes
AN
the implementation of this period asynchronous system and testing scripts. The readers
can reproduce our results with PhysioNet-2016 data simply running our shared test
M
script.
D
For the learning experiments, the data needs to be split into three subsets: train,
EP
validation/development and testing. In our tests, the validation set is used to observe
how accuracies and losses vary during learning, altering the model parameters
C
saving the best model learned in a learning test (when highest accuracy is achieved for
the validation set). The split ratios used for train, validation and test are 65%, 15% and
20%.
24
ACCEPTED MANUSCRIPT
distribution of sample numbers in categories per set (i.e. train, validation and test sets
increase the size of the database for training and has been shown to be beneficial in
PT
many applications [39]. One straightforward way of adding new samples is creating
new copies of existing samples via applying transformations for which the system
RI
should be invariant to. For our problem, we would like our system to be invariant to
SC
minor or moderate variations in heart rate and murmur frequency band. One easy way to
create new samples with varied heart rate and murmur frequency band is to resample
U
existing samples and save them as if the sampling rate is not altered. This would
AN
compress/expand the spectrum which corresponds to modification of the murmur
Data augmentation is performed by changing the sampling rate with a random value
D
2 is used (i.e. the size of the data is doubled). Data augmentation is applied to only the
TE
train set.
EP
category: samples in the pathological category are lower in number. Balancing the data
C
could be easily performed by leaving out samples from the largely populated category.
AC
However, we cannot afford leaving out samples due to the low database size. We have
followed an alternative path: creating new samples for the category with few samples
using re-sampling. The procedure used is the same as data augmentation step. Balancing
operation, via creating new transformed samples of original files, is applied to the train
25
ACCEPTED MANUSCRIPT
5. Test results
While the number of systems to be tested and compared is reduced to 90, there is still
need for a way to sort the systems in terms of performance. For our application of
screening, we would like our automatic systems to detect as many pathological cases as
PT
possible (i.e. we want to increase the true positive rate(TPR)) while we can tolerate
some normal cases to be labeled as pathological (i.e. we can tolerate some increase in
RI
the false positive rate(FPR)). In a real life scenario, this would correspond to labeling
SC
high number of samples as pathological, referring some extra normal cases to an expert
for consultation. The output of the automatic classification system for each sample is
U
the probability for belonging to a category. The straight-forward class assignment is
AN
performed by using 0.5 as threshold for probability in a binary classification task.
Reducing the threshold for pathology detection, more cases will be labeled as
M
pathological. This would increase both true and false positive rates. For finding an
D
optimum point of operation, TPR versus FPR are plotted for different threshold values
and the commonly used Receiver Operating Characteristics(ROC) curves are obtained.
TE
The area under the ROC curve is considered as the main measure of performance for the
EP
ranking. Following the sorting, we also provide other performance measures for a
selected system.
C
AC
To start our comparison of various features with a sample, below we present three ROC
curves obtained for three different features while keeping all other settings the same:
time resolution of 32, 16 frequency bands, ECG synchronous segmentation with a fixed
length of 500 milliseconds, using CNN model with 2 convolutional layers. ROC curves
26
ACCEPTED MANUSCRIPT
database.
PT
RI
U SC
AN
Figure 4: ROC curves for systems using three different features while keeping all other
In Figure 4, the best system among the three is the one using sub-band envelopes
D
since the ROC curve for that system is closer to the upper left corner (high TPR, low
TE
FPR) and the area under the ROC is largest. Following the intuition of this sample, for
comparison of 90 systems, we have used the area under the ROC as a single measure to
EP
sort all system performances. Since random splitting is involved, tests are repeated 5
C
times and average ROC curves were used for each system.
AC
Table 1: Sorted list of systems with respect to area under the ROC. Naming convention:
M1/2: CNN model number, eSyn: ECG synchronous, ASyn: asynchronous. Rightmost
number refers to fixed length of the frame in milliseconds. Table includes the best and
worst 25 systems. Please refer to the github repository for the table with all 90 systems:
/results4allSystems_UocDba/sortingWRTareaUnderROC.txt
27
ACCEPTED MANUSCRIPT
Rank Area System setting (Rank: 1-25) Rank Area System setting (Rank: 66-90)
under under
ROC ROC
1 0.8772 M2SubEnv128by16_eSyn500 66 0.6945 M2MelSpec128by16_nASyn2000
2 0.8764 M2SubEnv32by16_eSyn500 67 0.6897 M2MelSpec64by16_nASyn3000
3 0.8736 M2SubEnv64by16_eSyn1000 68 0.6878 M2MelSpec32by16_eSyn500
4 0.8716 M1SubEnv32by16_nASyn2000 69 0.6873 M2MelSpec32by16_eSyn1000
5 0.8705 M2SubEnv32by16_eSyn2000 70 0.6808 M2MelSpec32by16_nASyn2000
PT
6 0.8691 M2SubEnv64by16_eSyn500 71 0.6719 M2MFCC64by16_eSyn500
7 0.8679 M2SubEnv64by16_eSyn2000 72 0.6662 M2MelSpec128by16_nASyn3000
8 0.8648 M2SubEnv32by16_nASyn2000 73 0.6657 M1MFCC128by16_nASyn2000
RI
9 0.8646 M2SubEnv128by16_eSyn2000 74 0.6552 M2MFCC64by16_eSyn1000
10 0.8640 M1SubEnv32by16_eSyn500 75 0.6544 M2MelSpec64by16_nASyn2000
SC
11 0.8636 M2SubEnv32by16_eSyn1000 76 0.6516 M1MFCC32by16_nASyn3000
12 0.8617 M1SubEnv32by16_eSyn1000 77 0.6509 M1MFCC64by16_nASyn2000
13 0.8595 M2SubEnv128by16_eSyn1000 78 0.6502 M1MFCC32by16_nASyn2000
U
14 0.8522 M1SubEnv32by16_nASyn3000 79 0.6482 M1MFCC64by16_nASyn3000
15 0.8497 M1SubEnv32by16_eSyn2000 80 0.6452 M1MFCC128by16_nASyn3000
AN
16 0.8437 M1SubEnv64by16_eSyn1000 81 0.6334 M2MFCC32by16_eSyn500
17 0.8404 M1SubEnv64by16_eSyn2000 82 0.6209 M2MFCC64by16_eSyn2000
18 0.8362 M1SubEnv64by16_eSyn500 83 0.6205 M2MFCC32by16_eSyn2000
M
ROC curves for these best and worst 20 systems are presented below.
C
28
ACCEPTED MANUSCRIPT
PT
RI
Figure 5: ROC curves of best and worst 20 systems tested on the UoC-murmur database
SC
In figure below, we present ROC curves of systems using specific features.
The test results show that systems using sub-band envelopes are ranked higher than
those using MFFC and Mel-Spectrogram features: 23 systems out of the best 25 use
29
ACCEPTED MANUSCRIPT
sub-band envelope as the feature. ROC curves also support this observation: ROC
curves of systems using sub-band envelopes are closer to the left top corner compared
to other ROC curves. One interesting observation is that a system using asynchronous
PT
important since period marking, therefore the ECG channel, is not needed in the design
of such systems.
RI
To compare performances of synchronous and asynchronous systems, in Figure 7,
SC
we provide the ROC curves for systems using sub-band envelopes into two groups, one
U
Synchronous systems (sub-band env.) Asynchronous systems (sub-band env.)
AN
M
D
TE
EP
into two groups: systems applying synchronous segmentation and systems applying
C
asynchronous segmentation
AC
higher but a few of the asynchronous system performances are comparable to highest
performances of the synchronous systems. This is also reflected in Table 1: the fourth
ranked asynchronous system has a ROC area of 0.8716 where the best system
30
ACCEPTED MANUSCRIPT
Zabihi et al [25]).
PT
Test results with PhysioNet-2016 data:
Thanks to the authors of PhysioNet-2016 data [18], it serves as an excellent resource for
RI
comparing new proposals with recent state-of-the-art systems without the need of
SC
implementing these systems and re-running the experiments of the challenge since the
performances of these systems are already reported in [12]. Here, we present our tests
U
carried with this openly available data and report our system performance which can be
AN
contrasted with results in [12].
For PhysioNet-2016 data, ECG channels are not available. Recently, Zabihi et al [25]
M
has shown that high performances can also be achieved using asynchronous frames
D
have run experiments for the most performant system using asynchronous frames that
TE
Using a hop size of 1 second, and balancing via creation of new samples, the number
of frames extracted from the PhysioNet-2016 database were 103228. As the number of
C
segments is relatively very high, data augmentation is not applied in these tests. Each
AC
test is repeated 5 times and results are averaged. In Table 2 we present the confusion
31
ACCEPTED MANUSCRIPT
Table 2: Confusion matrix for CHD risk detection (after averaging results of 5
M1SubEnv32by16_nASyn2000
PT
Pathological (actual) 127.6 23.4 151
RI
159.8 141.2
SC
Sensitivity = 0.845, Specificity = 0.785, Accuracy = 0.815
Openly accessible Physionet-2016 contains a train set and a validation set (shared
U
with the aim of pre-testing functionality of a submission to the challenge) which is
AN
actually a subset of the train set. Since the main aim is to run an open challenge, test set
is not available. For facilitating comparison of our results with tests in other studies, we
M
decided to use the validation set provided as the test set, removed copies from the train
D
set and further split the train set into train and validation subsets (this validation set
TE
system, testing scripts (that downloads PhysioNet-2016 data, performs splitting and
EP
runs the experiments) and more detailed results involving other evaluation measures has
been shared openly on github11 for facilitating reproducibility of our test results.
C
In [12] Table 3, the top 8 systems’ (out of 48 submitted systems) performances are
AC
listed to have specificity values ranging in 0.7120 to 0.9424, specificity values ranging
in 0.7569 to 0.9521 and mean accuracy ranging in 0.7057 to 0.8602. These values are
computed via applying weighting with respect to the signal quality on classification
results obtained on the test data that is not openly available. In the tests we have carried
11
https://github.com/barisbozkurt/AutomaticPCGclassification
32
ACCEPTED MANUSCRIPT
(with train, validation and test set split explained above), the following scores have been
obtained for our shared system: 0.845 sensitivity, 0.785 specificity and 0.815 mean
accuracy. While these results cannot be directly compared with results in [12] (since
they are not computed on the same test subset and weighting is not applied), they show
PT
that our system performs similar to the top ranked state-of-the-art systems. The reader
can refer to the complete table in [12] for details regarding performances of the best
RI
systems in the challenge.
SC
6. Conclusions
U
This study targeted comparing various features and segmentation strategies in the
AN
context of automatic PCG classification for screening purposes, based on feedforward
PCG frames. To arrive at an optimum design, 90 different system settings were tested
D
database) and a system selected to have high performance in these tests was also tested
TE
on the PhysioNet-2016 data containing normal and pathological cases. The codes (of
EP
this specific system and test scripts) have been openly shared with the community for
reproducibility of our study and facilitating comparisons with state of the art. We should
C
stress here that our main contribution with this manuscript is in comparing various
AC
segmentation and feature computation strategies, not proposing a single best system that
is more performant than state of the art. Our analysis with PhysioNet data supports the
fact that the comparative tests have been carried using system architectures as
33
ACCEPTED MANUSCRIPT
For ranking 90 distinct systems, ROC curves are obtained via applying different
levels of thresholds for final categorization from probabilities of pathology and area
under ROC has been used as the single measure representing potential of each system
for screening applications. All systems are sorted in terms of area under the ROC for
PT
comparison. Further we have provided other performance measures for a selected
system. The sensitivity and specificity are critical measures for screening applications.
RI
Together with accuracy, these evaluation metrics are most common in comparative
SC
studies such as [12].
As presented in Table 1, the systems using sub-band envelopes have the highest
U
ranks in the sorted list of systems (with respect to area under ROC): the 23 highest rank
AN
systems out of 90 use sub-band envelopes as the feature. Considering most of the state-
observation. The ROC curves of the systems using sub-band envelopes are in general
D
closer to the left-top corner than systems using MFCC or Mel-Spectrogram (Figure 6).
The UoC-murmur database included PCG samples with murmur which were
TE
recorded from patients who were referred to a cardiology expert. That means the
EP
pediatricians have considered all cases in this data set to have a potential for heart
Compared to adult auscultation, specific challenges also exist for auscultation of young
children. Obtaining clean recordings free of scratch noise are in some cases difficult.
34
ACCEPTED MANUSCRIPT
Heart rate is often higher compared to adults (up to double of the adult norm) which
The best system developed through tests on our data (UoC-murmur database) is
PT
envelopes with time resolution of 64, 16 frequency bands computed period synchronous
1 second frames as the feature. This system has not been tested on the PhysioNet-2016
RI
data due to unavailability of the ECG channel. Following the tests with the UoC-
SC
murmur database, a system using period asynchronous frames has been tested on the
U
in tests with UoC-murmur database). We have shown that our asynchronous system
AN
performs similar to the top ranked state-of-the-art systems reported in [12] with 0.845
Our study involves some processes requiring further in-depth analysis which we
consider as challenges for further studies. The first is gaining better understanding for
TE
the effectiveness of data augmentation step applied and alternatives for it. While the
EP
uniform resampling does not reflect variability of the cardiac cycles governed by
C
physical constraints of the heart. Data augmentation strategies respecting the physical
AC
For sub-band envelope computation, we have only considered one specific setting of
Gammatone filter banks: simply setting the number of banks to 8, 16, 24, etc. Our study
lacks an in-depth analysis of the sub-band filtering process. Gammatone filter banks has
been preferred as it reflects some of the auditory response characteristics (although not
35
ACCEPTED MANUSCRIPT
all, such as the loudness related non-linear auditory behavior). A study of optimization
We have applied frame-level classification which later were fused via averaging to
deduce probability of the whole PCG signal to belong to a category. Many other options
PT
exist for such a step (for example majority voting). We have not tested other strategies
RI
Design of automatic PCG classification systems requires optimization of a large
SC
number of settings. Improving system performance through parameter optimization is
one option for future studies. Another direction for further studies is the use of multi-
U
sensor signal processing techniques to lower the need for experienced operators for
AN
screening applications. In [40], the authors propose noise cancellation using the multi-
channel PCG recordings that would result in a more robust PCG analysis systems. Joint
M
recorded using multisensor systems [41,42] also has the potential to lead to improved
performance for screening applications. Building end-products and testing them in real-
TE
life scenarios is an important future direction our research community should consider.
EP
Acknowledgements
C
This project has been funded by Special Account for Research of University of Crete
AC
(code number 4305). We would like to thank the Greek Ministry of Education and the
local Health Authorities (7th Health Region Crete), for their support of CPCS program,
and the University of Crete for the support of innovative cardiac auscultation teaching
approaches (including web-lecture hosting). We would like to thank Vassilis Tsiaras for
his valuable help and assistance throughout the study and the fruitful discussions that
36
ACCEPTED MANUSCRIPT
lead to the final designs and Alena Burianova Bagaki, for the valuable assistance in
References
PT
[1] Rangayyan, R. M., & Lehner, R. J. (1986). Phonocardiogram signal analysis: a
RI
[2] Ferencz, C., Rubin, J. D., Mccarter, R. J., Brenner, J. I., Neill, C. A., Perry, L. W., ...
SC
& Downing, J. W. (1985). Congenital heart disease: prevalence at livebirth: the
U
[3] Van Oort, A., Le Blanc-Botden, M., De Boo, T., Van Der Werf, T., Rohmer, J., &
AN
Daniels, O. (1994). The vibratory innocent heart murmur in schoolchildren: difference
[4] Michael, S. Y., Kimball, T. R., Tsevat, J., Mrus, J. M., & Kotagal, U. R. (2002).
[5] Cheitlin, M. D., Armstrong, W. F., Aurigemma, G. P., Beller, G. A., Bierman, F. Z.,
Davis, J. L., ... & Kussmaul, W. G. (2003). ACC/AHA/ASE 2003 guideline update for
C
1146-1162.
[6] Germanakis, I., Petridou, E. T., Varlamis, G., Matsoukis, I. L., Papadopoulou-
37
ACCEPTED MANUSCRIPT
[7] Hanna, I. R., & Silverman, M. E. (2002). A history of cardiac auscultation and some
[8] Newburger, J. W., Rosenthal, A., Williams, R. G., Fellows, K., & Miettinen, O. S.
(1983). Noninvasive tests in the initial evaluation of heart murmurs in children. New
PT
England Journal of Medicine, 308(2), 61-64.
[9] Smythe, J. F., Teixeira, O. H., Vlad, P., Demers, P. P., & Feldman, W. (1990).
RI
Initial evaluation of heart murmurs: are laboratory tests necessary?. Pediatrics, 86(4),
SC
497-500.
[10] Geva, T., Hegesh, J., & Frand, M. (1988). Reappraisal of the approach to the child
U
with heart murmurs: is echocardiography mandatory?. International journal of
AN
cardiology, 19(1), 107-113.
[11] Telatar, Z., & Erogul, O. (2003, September). Heart sounds modification for the
M
[12] Clifford, G. D., Liu, C., Moody, B., Springer, D., Silva, I., Li, Q., & Mark, R. G.
http://doi.org/10.22489/CinC.2016.179-154
C
[13] Leng, S., Tan, R. S., Chai, K. T. C., Wang, C., Ghista, D., & Zhong, L. (2015). The
AC
http://doi.org/10.1186/s12938-015-0056-y
[14] Noponen, A.-L., Lukkarinen, S., Angerla, A., & Sepponen, R. (2007). Phono-
http://doi.org/10.1186/1471-2431-7-23
38
ACCEPTED MANUSCRIPT
[15] Shen, C.-H. (2012). Acoustic based condition monitoring. University of Akron.
http://doi.org/10.2200/S00187ED1V01Y200904BME031
PT
[17] Marascio, G., & Modesti, P. A. (2013). Current trends and perspectives for
RI
http://doi.org/10.1136/heartasia-2013-010392
SC
[18] Liu, C., Springer, D., Li, Q., Moody, B., Juan, R. A., Chorro, F. J., … Clifford, G.
D. (2016). An open access database for the evaluation of heart sound algorithms.
U
Physiological Measurement, 37(12), 2181–2213. http://doi.org/10.1088/0967-
AN
3334/37/12/2181
[19] Schmidt, S. E., Holst-Hansen, C., Hansen, J., Toft, E., & Struijk, J. J. (2015).
M
Acoustic features for the identification of coronary artery disease. IEEE Transactions on
D
http://doi.org/10.1109/TBME.2015.2432129
TE
[20] Markaki, M., Germanakis, I., & Stylianou, Y. (2013, May). Automatic
EP
[21] Davis, S., & Mermelstein, P. (1980). Comparison of parametric representations for
AC
[22] Chauhan, S., Wang, P., Sing Lim, C., & Anantharaman, V. (2008). A computer-
aided MFCC-based HMM system for automatic auscultation. Computers in Biology and
39
ACCEPTED MANUSCRIPT
[23] Vepa, J. (2009). Classification of heart murmurs using cepstral features and support
vector machines. Proceedings of the 31st Annual International Conference of the IEEE
PT
[24] Potes, C., Parvaneh, S., Rahman, A., & Conroy, B. (2016). Ensemble of Feature-
based and Deep learning-based Classifiers for Detection of Abnormal Heart Sounds.
RI
Computing in Cardiology. http://doi.org/10.22489/CinC.2016.182-399
SC
[25] Zabihi, M., Rad, A. B., Kiranyaz, S., Gabbouj, M., & Katsaggelos, A. K. (2016).
Heart Sound Anomaly and Quality Detection using Ensemble of Neural Networks
U
without Segmentation. Computing in Cardiology.
AN
http://doi.org/10.22489/CinC.2016.180-213
[26] Rubin, J., Abreu, R., Ganguli, A., Nelaturi, S., Matei, I., & Sricharan, K. (2016).
M
Classifying Heart Sound Recordings using Deep Convolutional Neural Networks and
D
http://doi.org/10.22489/CinC.2016.236-175
TE
[27] Ergen, B., Tatar, Y., & Gulcur, H. O. (2012). Time-frequency analysis of
EP
http://doi.org/10.1080/10255842.2010.538386
AC
[28] Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A.-R., Jaitly, N., … Kingbury,
B. (2012). Deep Neural Networks for Acoustic Modeling in Speech Recognition. Signal
40
ACCEPTED MANUSCRIPT
networks. In 2015 IEEE 25th International Workshop on Machine Learning for Signal
Processing (MLSP).
[30] Kay, E., & Agarwal, A. (2017). DropConnected neural networks trained on time-
PT
frequency and inter-beat features for classifying heart sounds. Physiological
RI
[31] Springer, D. B., Tarassenko, L., & Clifford, G. D. (2016). Logistic regression-
SC
HSMM-based heart sound segmentation. IEEE Transactions on Biomedical
U
[32] Lu, X., Unoki, M., & Nakamura, S. (2011). Sub-band temporal modulation
AN
envelopes and their normalization for automatic speech recognition in reverberant
http://doi.org/10.1016/j.csl.2010.10.002
D
[33] Mitra, V., Wang, W., & Franco, H. (2014, December). Deep convolutional nets and
[34] Schmidt, S. E., Toft, E., Holst-Hansen, C., Graff, C., & Struijk, J. J. (2008).
http://doi.org/10.1109/CIC.2008.4749049
[35] Barschdorff, D., Bothe, A., & Rengshausen, U. (1989). Heart sound analysis using
41
ACCEPTED MANUSCRIPT
[36] Deng, S. W., & Han, J. Q. (2016). Towards heart sound classification without
[37] Germanakis, I., Dittrich, S., Perakaki, R., & Kalmanti, M. (2008). Digital
PT
phonocardiography as a screening tool for heart disease in childhood. Acta Paediatrica,
RI
2227.2008.00697.x
SC
[38] Germanakis, I., & Kalmanti, M. (2009). Paediatric cardiac auscultation teaching
U
[39] Wong, S. C., Gatt, A., Stamatescu, V., & McDonnell, M. D. (2016, November).
AN
Understanding data augmentation for classification: when to warp?. In Digital Image
(pp. 1-6).
D
[40] Nunes, D., Leal, A., Couceiro, R., Henriques, J., Mendes, L., Carvalho, P., &
TE
(EMBC), 2015 37th Annual International Conference of the IEEE (pp. 5936-5939).
IEEE.
C
[41] Nedoma, J., Fajkus, M., Martinek, R., Kepak, S., Cubik, J., Zabka, S., & Vasinek,
AC
V. (2017). Comparison of BCG, PCG and ECG signals in application of heart rate
42
ACCEPTED MANUSCRIPT
[42] Marcelli, E., Capucci, A., Minardi, G., & Cercenelli, L. (2017). Multi-sense
PT
RI
U SC
AN
M
D
TE
C EP
AC
43
ACCEPTED MANUSCRIPT
Highlights:
PT
● Automatic PCG classification technology is well developed to start supporting
real-life screening applications
RI
U SC
AN
M
D
TE
C EP
AC
ACCEPTED MANUSCRIPT
Conflict of interest statement for manuscript:
The authors of this manuscript claims no conflict of interest with any person or institutional
body.
PT
RI
U SC
AN
M
D
TE
C EP
AC