Bozkurt2018 PDF

Accepted Manuscript
A study of time-frequency features for CNN-based automatic heart sound

classification for pathology detection
Baris Bozkurt, Ioannis Germanakis, Yannis Stylianou
PII: S0010-4825(18)30174-4
DOI: 10.1016/j.compbiomed.2018.06.026
Reference: CBM 3006
To appear in: Computers in Biology and Medicine
Received Date: 29 August 2017

Revised Date: 24 June 2018
Accepted Date: 24 June 2018
Please cite this article as: B. Bozkurt, I. Germanakis, Y. Stylianou, A study of time-frequency features
for CNN-based automatic heart sound classification for pathology detection, Computers in Biology and
Medicine (2018), doi: 10.1016/j.compbiomed.2018.06.026.
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to
our customers we are providing this early version of the manuscript. The manuscript will undergo
copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please
note that during the production process errors may be discovered which could affect the content, and all
legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
A study of time-frequency features for CNN-based automatic heart sound
Baris Bozkurt1, Ioannis Germanakis2, Yannis Stylianou3
barisbozkurt0@gmail.com, germjohn@med.uoc.gr, yannis@csd.uoc.gr
PT
1
Electrical and Electronics Engineering Department, Izmir Democracy University,
Turkey
RI
2
Faculty of Medicine, University of Crete, Greece
3
SC
Computer Science Department, University of Crete, Greece
U
Corresponding author contact information:
AN
Baris Bozkurt1, Electrical and Electronics Engineering Department, Izmir Democracy
University, Üçkuyular Mahallesi, Gürsel Aksel Bulvarı, No:14 35140

M
Karabağlar/İZMİR,
D
Phone: +90 232 260 1001, Fax: +90 232 260 1004
TE
Abstract
EP
This study concerns the task of automatic structural heart abnormality risk detection
from digital phonocardiogram (PCG) signals aiming at pediatric heart disease screening
C
applications. Recently, various systems based on convolutional neural networks trained

AC
on time-frequency representations of segmental PCG frames has been presented that
outperform systems using hand-crafted features. This study focuses on the segmentation
and time-frequency representation components of the CNN-based designs. We consider
the most commonly used features (MFCC and Mel-Spectrogram) used in state-of-the-
1
This study has been carried during a research visit of Baris Bozkurt in Computer
Science Department of University of Crete in period January to July 2017.
1
ACCEPTED MANUSCRIPT
art systems and a time-frequency representation influenced by domain-knowledge,
namely sub-band envelopes as an alternative feature. Via tests carried on two high
quality databases with a large set of possible settings, we show that sub-band envelopes
are preferable to the most commonly used features and period synchronous windowing
PT
is preferable over asynchronous windowing.
RI
Keywords: heart disease screening, heart sound classification, phonocardiogram
SC
analysis, automated cardiac auscultation, time-frequency features
U
1. Introduction
AN
The recording and analysis of acoustic vibrations recorded at the chest of a patient using
a microphone-transducer is referred as phonocardiography (PCG). Several heart

M
conditions can be successfully studied via analysis of phonocardiogram signals: murmur

D
of mitral and aortic regurgitation, murmur of mitral and aortic stenosis and rheumatic
valvular lesions [1]. Clinicians listen to the heart sounds of a patient for monitoring
TE
functionalities of the heart tissues, especially the opening and closing of the valves.
EP
Such evaluation of heart sounds targeted towards diagnosis is referred as auscultation
and typically involves analysis of time and frequency characteristics of heart sounds and
C
murmurs, the overall periodicity characteristics and the quality of sounds.

AC
Heart disease represents a major health issue with significant costs worldwide2.
Although coronary heart disease and hypertension predominate in adults, structural
heart disease including various heart malformations may already be present since birth
(congenital heart disease (CHD)) which is a considerable cause of pediatric morbidity
2
http://www.who.int/mediacentre/factsheets/fs317/en/
2
ACCEPTED MANUSCRIPT
and mortality. Up to 1% of newborn children are considered to be affected by some
form of CHD [2], with a wide spectrum of clinical presentations based on the severity
of underlying heart malformation. Experienced cardiac auscultation is amongst the most
important first line clinical screening tools to detect individuals with CHD risk.
PT
Although early CHD screening offers considerable health advantages, the primary
health care physician is confronted with the difficult clinical task, to differentiate
RI
between (innocent) murmurs often present among healthy children from those
SC
associated with abnormal hemodynamics indicative of CHD (abnormal murmurs) [3].
Referring all children with a murmur for expensive diagnostic tests (such as
U
echocardiography) is not a cost-effective approach [4]. Still, expert auscultation is
AN
frequently recommended as first line screening tool prior to application of diagnostic
echocardiography [5]. In order to address the observed declining clinical skills in

M
cardiac auscultation, several approaches have been applied, including multimedia

D
teaching interventions, tele-medical applications or other computer-based clinical
decision support systems [6].

TE
One important resource that can support pediatric structural (CHD) screening is the
EP
use of automatic heart sound classification technologies. Efficient screening has high
potential to both lower the financial costs and also allow use of expert resources more
C
effectively. A low-cost, non-invasive and fast screening method would also provide the
AC
opportunity to perform screening of a large number of people in their early ages
resulting in timely diagnosis of some of the pathological cases. Thanks to recent
advances in machine learning and computing, close to human performances have been
reached in many audio classification tasks including the heart sound classification. Our
3
ACCEPTED MANUSCRIPT
study targets improving the performance of automated heart sound classification
technology aimed at screening for CHD risk detection.
Focus and contributions of our study
In the present study, we followed the approach common to design of most of the
PT
recently developed high performance systems: convolutional neural networks trained on
time-frequency representations of segmental PCG frames. While we present a
RI
functional system for automatic PCG classification that has been tested and shown to
SC
have a performance at the level of the state-of-the-art, our main focus is to study various
strategies in feature extraction in a CNN-based approach. The novel contributions of
U
this study are as follows: we present results of extensive comparative tests carried with
AN
a multitude of settings of segmentation (period-synchronous and asynchronous with
various sizes), time-frequency representations and neural network models on two large
M
databases: i) our proprietary PCG database which is composed of PCG recordings from
D
patients referred to a cardiology expert by pediatricians, ii) a recent challenge data used
for comparing performances of state-of-the-art systems. As a result of these tests, we

TE
show that sub-band envelopes, comparatively rarely used as a time-frequency feature in

EP
this domain, is preferable over the commonly used time-frequency representations like
Mel Frequency Cepstral Coefficients (MFCC). For reproducibility of our study, we

C
share our codes for one of the settings which can be tested using the publicly available
AC
data.
This manuscript is organized as follows: In section 2, we first review basics of
cardiac auscultation and the literature of automatic heart sound classification with a
specific focus to CNN based approaches. The proposed method is presented in section 3
with subsections on feature extraction (in which segmentation is also covered) and the
4
ACCEPTED MANUSCRIPT
machine learning models. We explain the test design (together with the specification of
the test data) in section 4 followed by test results in section 5. Section 6 is dedicated to
conclusions drawn from the test results. Finally, discussions and future work are
presented in section 7.
PT
2. Automatic PCG classification, a short review
RI
Basics of cardiac auscultation
SC
Before developing an automated heart murmur classification, a basic understanding of
heart sound generation is needed. Heart sounds are acoustics signals created within the
U
heart through blood flow and heart apparatus (mainly valve) motion and transmitted to
AN
the chest surface where they can be audible either through direct ear placement
(historical auscultation since Hippocrates times) or by using a stethoscope as transmitter

M
to the ears (modern auscultation since the discovery of the stethoscope by Laennec) [7].
D
Portable electronic stethoscopes are largely used today for recording, which facilitate
further analysis on computer based systems.

TE
Proper cardiac auscultation typically involves analysis of time and frequency

EP
characteristics of normal fundamental heart sounds (S1 and S2, corresponding to inflow
and outflow valve closure, respectively) and (if present) of murmurs or extra heart
C
sounds (such as clicks, abnormal split of S2 etc). Pitch, duration, location and shape
AC
characteristics of heart sounds and murmurs are main components investigated. Most of
the CHD cases are associated with abnormal blood flow patterns due to the presence of
intracavity communications (shunt lesions) and /or valvular abnormalities (narrowing or
leakage). The clinical classification of heart murmurs as innocent or abnormal is mainly
based on their temporal classification within the heart cycle (with innocent murmurs
5
ACCEPTED MANUSCRIPT
being exclusively early to mid-systolic), spatial classification based on punctum
maximum (the location where best heard) and finally due to a very subjective murmur
sound quality validation (with innocent murmurs often described as having “musical” or
“vibratory” properties) [8, 9, 10].
PT
Cardiac auscultation, especially pediatric cardiac auscultation remains a challenging
clinical task. Not only it requires long-term practice and experience, but there are also
RI
perceptual difficulties: The heart sounds and murmurs involve low-frequency
SC
components (carrying discriminative characteristics for detection of abnormalities)
which are hardly audible, often with the presence of a high degree of noise
U
(environmental noise, breath, scratches due to microphone movement) in many cases,
AN
especially in pediatric cardiac auscultation.
A through cardiologic analysis involves, in addition to auscultation and

M
phonocardiography, use of multitude of techniques and technologies such as

D
echocardiography, ultrasound imaging, angiography, electrocardiograms, chest X-rays,
etc. some of which are costly. A very large percentage of the cases referred to a
TE
cardiology expert for such costly analysis have no serious problems [11]. Automatic
EP
heart sound classification technology has high potential to support the screening process
and lower costs.

C
AC
Automatic heart sound classification
Cardiac abnormality detection via automatic heart sound(HS)/phonocardiogram(PCG)
classification, is a very largely studied research domain. Manuscripts providing in depth
reviews of phonocardiogram(PCG) processing methods and comparison of large lists of
state-of-the-art methods (such as [12]) are already available. Leng et al [13] and
6
ACCEPTED MANUSCRIPT
Noponen et al [14] presented a detailed overview of the problem discussing types of
abnormalities and related PCG characteristics. Shen [15] reviewed the use of
phonocardiogram(PCG) signals for diagnosis from its early days to today. Abbas and
Bassam [16] overviewed the signal processing steps involved in processing PCG signals
PT
in detail. Leng et al [13] further reviewed the state-of-the-art hardware systems used for
electronic stethoscope for recording PCG signals and recent techniques used in
RI
automatic PCG classification. Marascio and Modesti [17] and Liu et al [18] presented
SC
detailed reviews of the trends in feature selection and automatic classification strategies.
As for many other automatic classification problems, the crucial components of an
U
automatic classification systems are: (training) data, segmentation, feature extraction
AN
and machine learning. Each of these components (and their interrelation) have been
discussed in various dimensions within this very large literature (with a few hundreds of
M
papers). Our particular interest in this work is the feature extraction and machine
D
learning components.
Features used for automatic PCG classification can be grouped as follows: time
TE
domain, frequency domain, statistical domain and time–frequency domain features [18].
EP
Studies using the time domain features typically include in the feature vector, duration
measures (for S1, S2, diastole, systole, R-to-R) and their ratios (for example the ratio of
C
systolic interval to the heart beat), relative amplitude/energy measures of heart sound
AC
components, and other common time-domain features like the zero-crossing rate. An
open-source system involving such features is presented in [18] as the baseline system
for the PhysioNet-2016 challenge [12].
Spectral (frequency domain) features involve a large variety of spectral
representations and/or measures as is the case for other automatic sound classification
7
ACCEPTED MANUSCRIPT
tasks. Schmidt et al [19] have considered a wide range of spectral features for automatic
PCG classification such as parametric models for the spectra, instantaneous frequency
and amplitude (IFA), power in octave bands with a conclusion that the low-frequency
bands carry important information that can be effectively modeled for designing
PT
discriminative features for PCG signals. In our previous study [20], we have used
reassigned spectrogram for feature computation. Signal complexity can also be
RI
modeled/computed and used as feature, such as sample entropy, simplicity and spectral
SC
entropy [19].
Here, due to availability of these numerous reviews on this topic, we will limit our
U
discussions on a specific methodology which is being used since the early days of
AN
automatic PCG classification and becoming more of an attraction recently due to
advances in machine learning technology: use of time-frequency representations and

M
neural networks to build effective automatic systems for cardiac abnormality detection.
D
Time-frequency features and neural networks for automatic heart sound classification
TE
Time-frequency representation of a sound signal often refers to stacking (some

EP
representation of) spectra computed from windowed segments of the signal to form a
two dimensional representation. The basic assumption in using time-frequency

C
representation in audio classification is that for specific classes of sounds, some patterns
AC
exist within these image-like representations. While a multitude of options exist for
computing such representations, the most common technique is to map a linear
representation like Short-Time Fourier Transform(STFT) in a logarithmic scale
mimicking human auditory response such as Bark or Mel. Introduced by Davis and
Mermelstein in 1980 [21], Mel Frequency Cepstral Coefficients (MFCC) is possibly the
8
ACCEPTED MANUSCRIPT
most frequently used feature in all automatic sound classification domains. MFCC is
also very commonly and effectively used in automatic PCG classification studies for
pathology detection [22-26]. A large number of automatic PCG classification studies
use wavelet based features as a time-frequency representation (for example [23] and
PT
[27]) as wavelets has certain advantages in terms of resolutions over STFT.
Following advances in deep learning, a large number of recent studies considered
RI
various different sound classification tasks using time-frequency representations and
SC
deep neural network architectures reporting high success rates [28, 29]. For the
automatic heart sound classification task, this approach is also gaining popularity and
U
systems based on this approach rank among the best in recent challenges. In PhysioNet-
AN
2016 challenge [12], half of the systems [24-26, 30] among the top 8 (selected out of 48
systems) use such an approach. Except [30] for which we could not find a detailed
M
description/documentation, all other three systems make use of MFCC and some other
D
time-frequency features with fixed settings (13-14 MFCC coefficients computed on 25
milliseconds (ms) windows with hop size of 10 ms, etc.) that is fed into a neural
TE
network classifier. Some of parameter choices seem to be highly influenced from other
EP
audio classification tasks. For example, the window and hop sizes used in speech
processing tasks are often chosen to be multiple of some commonly accepted average
C
period lengths (for example a fundamental frequency of 100 Hz referring to 10 ms

AC
period). The use of a few periods (25-30 ms) window length and about a period (10 ms)
hop length is common in speech processing. Applying the same lengths (25-30 ms
window length and 10 ms hop length) in PCG classification is an interesting choice
given the maximum frequency would not exceed 5 Hz (200 ms period), yet appear to be
preferred in some state-of-the-art systems (for example [25]).
9
ACCEPTED MANUSCRIPT
3. Proposed method
Our study follows the Convolutional Neural Network(CNN)-based classification with
PT
the following difference: while most studies present a single best selected configuration,
indeed there exists various options for segmentation (period synchronous or
RI
asynchronous, different sizes (close to average PCG period lengths)), various features
SC
(MFFC, Mel-Spectrogram, Spectrogram, etc. with different sizes (time resolution and
number of frequency coefficients)), various CNN architectures. Hence, a large number
U
of different settings is worth testing. We have designed tests to compare various settings
AN
of common time-frequency representations used directly as the input to a CNN
classifier. The features considered for this study are: Mel-Spectrogram, MFFC and sub-
M
band envelopes. With tests on two high quality datasets, we show that sub-band
D
envelopes are preferable over other options in many settings and a system with
relatively simple architecture, built using this feature achieves high performances. Next,
TE
we discuss the computation of these features.

EP
3.1 Feature extraction

C
Feature extraction can be performed at the whole signal level or frame/segment level
AC
where multiple frames are extracted via windowing. As explained in the introduction,
within this study, we limit ourselves with frame level time-frequency feature extraction.
Further, to classify individual files, frame level decisions are fused.
Frame level feature extraction necessitates automatic segmentation of a signal into
frames. Segmentation can be performed period synchronously or asynchronously and
10
ACCEPTED MANUSCRIPT
we include both of these strategies in our comparative tests. For period synchronous
segmentation, period marking is necessary.
Period marking and segmentation
PT
Period marking (segmentation into heart cycles or marking of cycle starting instances)
can be carried directly on PCG signals and there is a large literature for this task and
RI
publicly available state-of-the-art tools [31]. As the location of the PCG recording (on
SC
the body of the patient) influences the relative energies of S1-S2 components and high
amount of noise may be present, performing marking reliably on PCG signals is a
U
challenging task. When Electrocardiogram (ECG) signals, recorded in parallel, is
AN
available, automatic marking can be more reliably performed since ECG signals are less
noisy and they include a main peak which can be tracked for reliable period marking.
M
Our database (explained in the test design part) includes ECG signals recorded
D
simultaneously together with the PCG signals, hence an algorithm with the following
steps is implemented and used for extracting period marks from the ECG signals
TE
(detecting R-peak locations of ECG which refer to S1 onset of the PCG [31]):
EP
● High-pass filtering the ECG signal to remove very low frequency variations
● Element-wise multiplication of the pre-emphasized ECG with the original ECG

C
to obtain a more impulsive version of the signal

AC
● Computing energy signal and amplitude normalization of the energy signal
● Autocorrelation based period detection from energy signal
● Estimation of number of heart cycles in the signal
● Signal peaks detection via applying a threshold: The threshold (with an initial
value of 0.5) is lowered incrementally until peak count is larger than four times
11
ACCEPTED MANUSCRIPT
the estimated number of cycles. This choice aims coping with possible octave
errors in the period estimation. In addition, a secondary peak may be prominent
within the cycle.
● Removing spurious peaks using peak amplitude comparison and distance to
PT
surrounding peaks
We did not carry a formal testing of this algorithm but visually checked most of the
RI
samples in our database to observe potential problems. The method provides high
SC
quality period marking for almost all cases. In Figure 1, two samples from our database
are presented. Top figures include the ECG signals where period marks are represented
U
with red dots together with fixed-length frames obtained and the bottom figures show
AN
the corresponding PCG signals with period marks. For the windowing operation to
extract frames that also involve S1 component of the heart sound, period marks are
M
shifted to left by 75 milliseconds.

D
TE
C EP
AC
12
ACCEPTED MANUSCRIPT
PT
RI
Figure 1: Automatic period marking and period synchronous segmentation on ECG
SC
signals into 0.5 second frames. Pitch marks are indicated in red dots and obtained fixed-
length frames with black rectangles.
U
Once the period marks are available, different strategies can be used for
AN
segmentation to obtain PCG frames. The following segmentation strategies are worth
testing:
M
● Period synchronous segmentation with segment length defined proportionally to

D
the local period (half a period, one period, two periods, etc.). For segment
TE
lengths higher that one period, overlap is inherent.
● Period synchronous segmentation with fixed segment length (0.5 sec., 1 sec., 2
EP
sec., etc.). Wherever the segment length exceeds period length, overlap is
inherent.
C
● Period asynchronous segmentation with fixed segment length (0.5 sec., 1 sec., 2
AC
sec., etc.) with or without overlap. This is exemplified in Figure 2 where a
sample is depicted together with frame boundaries for fixed length of 2 sec. and
hop size of 1 sec. (i.e. %50 overlap).
13
ACCEPTED MANUSCRIPT
PT
RI
SC
Figure 2: Asynchronous segmentation example with length of 2 sec. and hop size of 1
U
sec. Frame boundaries are indicated with black solid lines.
AN
A number of such options are considered in the tests carried. This of course adds a
multiplicative factor to the number of tests to be performed.

M
D
Computing time-frequency features

TE
Considering automatic audio/sound classification systems with CNN architectures, the
following time-frequency representations were selected to be amongst the most

EP
common (which are also often included in audio processing software libraries):
● Spectrogram
C
● MFCC (with or without delta coefficients)

AC
● Mel-Spectrogram (with or without delta coefficients)
In addition to these common representations, we included sub-band envelopes as a
time-frequency representation. Here, sub-band envelopes of a given PCG signal (in the
form of a time-frequency feature) is formed by stacking temporal modulation envelopes
of sub-bands obtained by band-pass filtering the PCG signal (discussed in detail in the
14
ACCEPTED MANUSCRIPT
next subsection). For all features, various time and frequency resolutions were tested as
explained in the test design section. A Tukey window (with r=0.08) is applied to signal
segments just before the feature computation.
PT
Sub-band envelopes as a time-frequency feature
One of the important steps in PCG analysis by a cardiology expert is the investigation
RI
on signal shapes of murmurs and heart sounds. The experts often use dedicated software
SC
tools to apply band-pass filters (with flexible settings they can control) and check the
shape and localization of signal components. Using their experience with prior cases,
U
they check for patterns in the shapes of the signals and signal envelopes. In
AN
development of automatic classification systems, this practice can be imitated by
building time-frequency representations as stack of sub-band signal envelopes.

M
The sub-band (temporal) envelope has been successfully used as a time-frequency

D
feature in the automatic speech recognition domain [32, 33]. In the automatic PCG
analysis literature, the use of envelop signal is more common for segmentation
TE
purposes. Liu et al [18] presented a detailed review of envelope-based methods used for
EP
automatic segmentation of PCG signals. While envelope signals are successfully used in
segmentation tasks (for example [34]), they were also directly used as features fed into
C
neural network classifiers, although rarely, since the early days of automatic PCG
AC
classification [35]. Some of the wavelet based features, when the extracted coefficients
are down-sampled, can be also be interpreted as sub-band envelope features. A recent
study following that approach is by presented by Deng & Han [36], where sub-band
envelopes are calculated from discrete wavelet decomposition (DWT) coefficients. [24]
15
ACCEPTED MANUSCRIPT
uses median powers of sub-band signals which can also be considered as a similar
representation where very few samples were used for each sub-band envelope.
The sub-band envelopes can be computed in various ways. We have chosen the
following steps (also depicted in Figure 3) for computing sub-band envelopes of a PCG
PT
segment:
● Band-pass filtering applied to PCG signal using Gammatone filter banks3
RI
● Envelope detection via computing analytical signal using Hilbert transform
SC
● Envelopes are resampled to a specific time-resolution (inherently involves low-
pass filtering that removes high-frequency components)
U
● Logarithmic compression applied to the final envelope signals
AN
● All envelopes are stacked to obtain an image-like time-frequency representation
● The matrix obtained is processed to have zero-mean and normalized amplitude

M
In Figure 3, we present the flow diagram for the process and an example for feature
D
extraction, depicting the sub-band signal envelopes computed and final feature obtained
in matrix form. The top sub-plots include 8 sub-band signals and their resampled
TE
versions extracted from the original PCG signal (shown in blue). Considering this
EP
particular example, after stacking 8 vectors (corresponding to 8 sub-bands) of size 128
(number of time bins), a 8*128 image-like representation is derived and plotted with
C
color coding element values resulting in the bottom sub-plot which is the main feature
AC
used as input to the classifier.
3
Filter banks designed using Python library by Jason Heeris: https://github.com/detly/gammatone
16
ACCEPTED MANUSCRIPT
PT
RI
Figure 3: Sub-band envelopes feature computation. Original PCG signal depicted in
SC
blue. Sub-band envelope feature is the matrix obtained at the output of the feature
extraction process which is depicted as a colored image via mapping matrix coefficients
U
to color code (low values: dark, high values: bright).
AN
M
3.2 Machine learning
A large variety of neural network models may be used for the PCG classification task.
D
Our tests are limited to use of feed-forward CNN models applied on frame level
TE
features, one of the most popular approaches in the recent state-of-the-art systems in
this domain.
EP
To keep the number of tests limited (so that tests can be repeated in a reasonable
C
amount of time), we have considered three similar models which include common
AC
sequence of layers used for similar tasks in literature: 2D convolutional layers (with
kernel size 3 by 3, rectified linear unit activation) followed by max-pooling and drop-
out layers. The input dimension is equal to the feature dimension and the output
dimension is two (number of categories: normal and pathological). The models are
17
ACCEPTED MANUSCRIPT
implemented using Keras4 with TensorFlow5 as the backend. The Keras models and all
other design parameters are available from the accompanying repository6. Since PCG
database sizes are often relatively small (compared to other automatic classification
tasks with CNNs), complex models with high capacities learn to memorize the train
PT
data. For this reason, the number of layers were kept to be small and L1-regularisation
is applied to avoid over-fitting. The number of 2D convolutional layers in the three
RI
models are: 1,2 and 4.
SC
Each model is designed to compute probability of a segment to belong to a recording
of a patient with pathology. Further, to compute a file’s/patient’s probability of
U
belonging to pathological class: all frame probabilities are sorted, 15% lowest and %15
AN
highest values are discarded and finally, the probability for a file is computed as the
mean probability of these remaining frames.

M
D
4. Test design
Here, we first explain the data used in the tests and further discuss various dimensions
TE
of the automatic system considered in the test design process.

EP
Databases:
C
Two databases (with large differences in patient ages and the pathologies) were
AC
considered for the comparative tests. The first database is a proprietary database
representing real-life scenarios for PCG based pediatric cardiology-screening. It is one
of the largest exclusively pediatric digital phonocardiogram databases. The second
4
https://keras.io
5
https://www.tensorflow.org
6
https://github.com/barisbozkurt/AutomaticPCGclassification script models.py
18
ACCEPTED MANUSCRIPT
database is a publicly available one, involving mainly adult heart sound recordings, it
represents the largest up-to-date phonocardiogram database worldwide, and is used for
comparing various classification studies in the literature.
PT
University of Crete, PCGs with murmur (UoC-murmur) database:
Our database is composed of anonymized digital phonocardiograms (4 to 10 seconds
RI
length including 4 to 18 PCG cycles with an average of 8 cycles), obtained from
SC
pediatric cardiology outpatients as standard of care (provided time allowance) and from
a pilot pediatric cardiology screening program for school age children (8-year olds),
U
approved by Greek Ministry of Education and local Health Authorities, including digital
AN
phonocardiogram as a component for pediatric heart disease screening (Cretan Pediatric
Cardiology Screening program-CPCS). The database includes abnormal murmurs

M
associated with various types and severity levels of CHD. This database is proprietary
D
and is not publicly available.
Each recording was labeled as normal (i.e. having innocent murmur) or abnormal by
TE
a single expert in pediatric cardiac auscultation (I.G, the second author) based on
EP
clinical auscultation and final echocardiography confirmatory study. Our database
involves therefore samples with abnormal murmurs obtained from children of various
C
ages, and often suboptimal recording conditions, or innocent murmurs which were
AC
either difficult for their pediatricians to classify as such, or were recorded during
primary school visits (associated with high probability of external noise). The available
database is representative of assessment of pediatric patients of various ages in real life
conditions.
19
ACCEPTED MANUSCRIPT
Due to high cost of echocardiography confirmatory analysis, only part of this
database (83 PCG samples) has been cross-validated blindly by two pediatric cardiology
experts independently [37]. The database includes samples with various levels of CHD
considered as a “difficult to classify”, even for experts. It is a good representative of
PT
real-life daily clinical challenges scenario. Selected recordings of the same database
have been also used for teaching purposes [6, 38]. Representative digital
RI
phonocardiograms of this database, along with extended introductory web-lectures in
SC
pediatric cardiac auscultation are free available as open sources material in the
institutional web-server7.
U
The database contains 336 recordings from 327 healthy children with innocent
AN
murmurs and 130 recordings from 117 children with various forms of CHD, of various
ages (infants-adolescents). The technique of digital phonocardiogram recording has

M
been standardized and described previously [37]. Briefly, a sensor based electronic
stethoscope with incorporated 3-lead ECG8 was used. Four recordings of were
D
performed from each patient, corresponding to the apical, lower left (fourth intercostal
TE
space) and upper (second intercostal space) left and right parasternal location. Digital
EP
acoustic data (with a sampling rate of 44100 kHz, 16-bit dynamic resolution) and ECG
signals, were transferred and stored as wave files, in a personal laptop computer using
C
the designated software9. Any personal identification data has been removed and
AC
replaced by a random ID prior to data analysis.
For each patient, one or two recordings was selected by the expert to have the highest
quality for murmur detection and all other recordings were removed from the set. The
7
https://opencourses.uoc.gr/courses/enrol/index.php?id=367 Password for the video lectures is available
upon request from the authors.
8
TheStethoscope®; Welch Allyn-Meditron, Welch Allyn Inc., NY, USA
9
Meditron Analyzer 4®
20
ACCEPTED MANUSCRIPT
following steps were applied for pre-processing of the original data: i) ECG data was
down sampled to 882 Hz and PCG data was down sampled to 4410 Hz, ii) both ECG
and PCG signals were amplitude normalized to have a maximum level of 0.9.
PT
PhysioNet-2016 database:
Recently, a large open PCG database has been announced: Classification of
RI
Normal/Abnormal Heart Sound Recordings: the PhysioNet/Computing in Cardiology
Challenge 201610 [12]. This database includes a compilation of various other databases
SC
and is a very good resource for comparing a specific system with various state-of-the-art
U
algorithms without the need of implementing them and running experiments. The
AN
PhysioNet-2016 data includes some very noisy data (even some non-PCG samples) and
does not include ECG channels. Detailed profiles of the 9 included databases are
M
provided in Section 2 and Table 1 of [12], reaching a total number of 2435 PCG
D
recordings.
TE
Deciding system settings

EP
An automatic PCG classification system involves the following basic blocks:
segmentation, feature extraction and machine learning. Our test design started with
C
considering cross-combinational settings for these blocks. As the first step, an initial list
AC
of combinations has been created:
Segmentation strategies (11 options):
10
https://www.physionet.org/challenge/2016/
21
ACCEPTED MANUSCRIPT
● Period synchronous segmentation with: 0.5 period, 1 period, 2 period sized
segments, or fixed sizes of 0.5, 1, 2, 3 seconds segments. Overlap exists if size exceeds
the local period.
● Period asynchronous segmentation with fixed sized segments of 0.5, 1, 2 or 3
PT
seconds with overlap of 1 second.
Features (48 options):
RI
● Spectrogram, Mel-spectrogram, MFCC and sub-band envelopes
SC
● Time resolutions: 32, 64, 128 (points)
● Frequency bands: 8, 16, 24, 32 (bands)
U
Machine learning models (3 options):
AN
● Models with 1,2 or 4 2D convolutional layers
Databases (2 options):
M
● UoC-murmur database: innocent murmur versus pathological murmur

D
● PhysioNet-2016 database: normal versus pathological
This initial list refers to 1584 systems (asynchronous systems to be tested on two
TE
databases) where each test would also need to be repeated several times to remove bias
EP
of random splitting applied to the databases (explained further in this section).
While we think its worthwhile to consider all these settings, due to this high number
C
of tests, several additional options (such as using other machine learning models
AC
(LSTM,RNN,etc.), using other file-level features in literature) have been left out. For
practical reasons of computation time, some dimensions have been considered in
isolation without test repetition. Leaving out the worst-cases in these preliminary tests,
the list has been reduced to a total of 90 systems for the final tests. Due to space
22
ACCEPTED MANUSCRIPT
considerations, here, we will only mention our observations that has lead us to leaving
out some options.
In isolated and reduced combinational tests without repletion, we have sorted
systems with respect to their F1 measures and observed that systems using spectrogram
PT
performed with the lowest scores. Hence, spectrogram was removed from the feature
list. Segment lengths defined in relative to local period length did not appear more
RI
advantageous than using fixed sizes and were also removed. For tests with period
SC
asynchronous segments, 0.5 and 1 second lengths were too short, learning could not
converge for those cases. Frequency bands higher than 16 did not bring improvement
U
(PCG spectrum is limited to 2.2kHz) and the performances observed for 8 frequency
AN
bands and 16 frequency bands were similar. Sorting all systems, machine learning
models with 2 and 4 convolutional layers were ranked higher than the systems with a
M
single convolutional layer. Using delta coefficients with Mel-Spectrogram and MFCC
D
was also tested and observed to bring no improvement.

TE
Tested system settings

EP
We finally arrived at the following reduced list of (90) systems for which the tests
can be re-run/repeated with a single GPU in a few days for one of the databases.
C
Segmentation strategies (5 options):

AC
● Period synchronous segmentation with 0.5, 1, 2 second lengths
● Period asynchronous segmentation with 2, 3 second lengths
Features (9 options):
● Mel-spectrogram, MFCC and sub-band envelopes
● Time resolutions: 32, 64, 128
23
ACCEPTED MANUSCRIPT
● Frequency bands: 16
Machine learning models (2 options):
● Models with 2 or 4 2D convolutional layers
● Our tests involve performing repeated experiments for 90 systems (54 period
PT
synchronous and 36 asynchronous) on the UoC-murmur database and then picking a
high performance period asynchronous system and repeating tests for this system on
RI
PhysioNet-2016 data. In the tests with the UoC-murmur database, for each
SC
segmentation strategy the following options have been tested: use of three different
features (Mel-spectrogram, MFCC and sub-band envelopes) with three different time
U
resolutions (32, 64, 128) and two different CNN models. Our shared repository includes
AN
the implementation of this period asynchronous system and testing scripts. The readers
can reproduce our results with PhysioNet-2016 data simply running our shared test
M
script.
D
Data splitting, augmentation and balancing

TE
For the learning experiments, the data needs to be split into three subsets: train,
EP
validation/development and testing. In our tests, the validation set is used to observe
how accuracies and losses vary during learning, altering the model parameters
C
(manually) based on these observations (for avoiding overfitting) and (automatically)

AC
saving the best model learned in a learning test (when highest accuracy is achieved for
the validation set). The split ratios used for train, validation and test are 65%, 15% and
20%.
An important detail in random splitting is to ensure each set is composed of
completely independent samples. Splitting is performed at file level keeping a similar
24
ACCEPTED MANUSCRIPT
distribution of sample numbers in categories per set (i.e. train, validation and test sets
includes similar distributions of normal and pathological cases).
Data augmentation refers to creating new samples in an artificial/automatic way to
increase the size of the database for training and has been shown to be beneficial in
PT
many applications [39]. One straightforward way of adding new samples is creating
new copies of existing samples via applying transformations for which the system
RI
should be invariant to. For our problem, we would like our system to be invariant to
SC
minor or moderate variations in heart rate and murmur frequency band. One easy way to
create new samples with varied heart rate and murmur frequency band is to resample
U
existing samples and save them as if the sampling rate is not altered. This would
AN
compress/expand the spectrum which corresponds to modification of the murmur
frequency band and the heart rate.

M
Data augmentation is performed by changing the sampling rate with a random value
D
in range 10%-20% on randomly selected samples. In all tests, an augmentation ratio of
2 is used (i.e. the size of the data is doubled). Data augmentation is applied to only the
TE
train set.
EP
Our original database is unbalanced in terms of the number of samples in each
category: samples in the pathological category are lower in number. Balancing the data
C
could be easily performed by leaving out samples from the largely populated category.
AC
However, we cannot afford leaving out samples due to the low database size. We have
followed an alternative path: creating new samples for the category with few samples
using re-sampling. The procedure used is the same as data augmentation step. Balancing
operation, via creating new transformed samples of original files, is applied to the train
and validation sets but not to the test set.
25
ACCEPTED MANUSCRIPT
5. Test results
While the number of systems to be tested and compared is reduced to 90, there is still
need for a way to sort the systems in terms of performance. For our application of
screening, we would like our automatic systems to detect as many pathological cases as
PT
possible (i.e. we want to increase the true positive rate(TPR)) while we can tolerate
some normal cases to be labeled as pathological (i.e. we can tolerate some increase in
RI
the false positive rate(FPR)). In a real life scenario, this would correspond to labeling
SC
high number of samples as pathological, referring some extra normal cases to an expert
for consultation. The output of the automatic classification system for each sample is
U
the probability for belonging to a category. The straight-forward class assignment is
AN
performed by using 0.5 as threshold for probability in a binary classification task.
Reducing the threshold for pathology detection, more cases will be labeled as
M
pathological. This would increase both true and false positive rates. For finding an
D
optimum point of operation, TPR versus FPR are plotted for different threshold values
and the commonly used Receiver Operating Characteristics(ROC) curves are obtained.
TE
The area under the ROC curve is considered as the main measure of performance for the
EP
ranking. Following the sorting, we also provide other performance measures for a
selected system.
C
AC
Test results with the UoC-murmur database:
To start our comparison of various features with a sample, below we present three ROC
curves obtained for three different features while keeping all other settings the same:
time resolution of 32, 16 frequency bands, ECG synchronous segmentation with a fixed
length of 500 milliseconds, using CNN model with 2 convolutional layers. ROC curves
26
ACCEPTED MANUSCRIPT
are obtained via averaging 5 repeated random experiments on the UoC-murmur
database.
PT
RI
U SC
AN
Figure 4: ROC curves for systems using three different features while keeping all other
design parameters fixed

M
In Figure 4, the best system among the three is the one using sub-band envelopes
D
since the ROC curve for that system is closer to the upper left corner (high TPR, low
TE
FPR) and the area under the ROC is largest. Following the intuition of this sample, for
comparison of 90 systems, we have used the area under the ROC as a single measure to
EP
sort all system performances. Since random splitting is involved, tests are repeated 5
C
times and average ROC curves were used for each system.
AC
Table 1: Sorted list of systems with respect to area under the ROC. Naming convention:
M1/2: CNN model number, eSyn: ECG synchronous, ASyn: asynchronous. Rightmost
number refers to fixed length of the frame in milliseconds. Table includes the best and
worst 25 systems. Please refer to the github repository for the table with all 90 systems:
/results4allSystems_UocDba/sortingWRTareaUnderROC.txt
27
ACCEPTED MANUSCRIPT
Rank Area System setting (Rank: 1-25) Rank Area System setting (Rank: 66-90)
under under
ROC ROC
1 0.8772 M2SubEnv128by16_eSyn500 66 0.6945 M2MelSpec128by16_nASyn2000
3 0.8736 M2SubEnv64by16_eSyn1000 68 0.6878 M2MelSpec32by16_eSyn500
4 0.8716 M1SubEnv32by16_nASyn2000 69 0.6873 M2MelSpec32by16_eSyn1000
PT
6 0.8691 M2SubEnv64by16_eSyn500 71 0.6719 M2MFCC64by16_eSyn500
8 0.8648 M2SubEnv32by16_nASyn2000 73 0.6657 M1MFCC128by16_nASyn2000
RI
SC
11 0.8636 M2SubEnv32by16_eSyn1000 76 0.6516 M1MFCC32by16_nASyn3000
U
AN
M

D

TE

24 0.8238 M1MelSpec32by16_eSyn1000 89 0.5762 M2MFCC32by16_nASyn2000
25 0.8227 M1MelSpec32by16_eSyn2000 90 0.5604 M2MFCC64by16_nASyn3000
EP
ROC curves for these best and worst 20 systems are presented below.
C
Best 20 systems Worst 20 systems

AC
28
ACCEPTED MANUSCRIPT
PT
RI
Figure 5: ROC curves of best and worst 20 systems tested on the UoC-murmur database
SC
In figure below, we present ROC curves of systems using specific features.
Systems using sub-band envelopes

U Systems using MFCC
AN
M
D
TE
Systems using Mel-Spectrogram

C EP
AC
Figure 6: ROC curves grouped in terms of features
The test results show that systems using sub-band envelopes are ranked higher than
those using MFFC and Mel-Spectrogram features: 23 systems out of the best 25 use
29
ACCEPTED MANUSCRIPT
sub-band envelope as the feature. ROC curves also support this observation: ROC
curves of systems using sub-band envelopes are closer to the left top corner compared
to other ROC curves. One interesting observation is that a system using asynchronous
frames have high enough performance to be ranked as fourth. This is specifically
PT
important since period marking, therefore the ECG channel, is not needed in the design
of such systems.
RI
To compare performances of synchronous and asynchronous systems, in Figure 7,
SC
we provide the ROC curves for systems using sub-band envelopes into two groups, one
that uses synchronous frames, the other asynchronous frames.
U
Synchronous systems (sub-band env.) Asynchronous systems (sub-band env.)
AN
M
D
TE
EP
Figure 7: Break-down of Figure 6a (ROC curves of systems using sub-band envelopes)
into two groups: systems applying synchronous segmentation and systems applying
C
asynchronous segmentation
AC
From Figure 7, we observe that synchronous system performances are in general
higher but a few of the asynchronous system performances are comparable to highest
performances of the synchronous systems. This is also reflected in Table 1: the fourth
ranked asynchronous system has a ROC area of 0.8716 where the best system
(synchronous) has a ROC area of 0.8772. This observation suggests that an
30
ACCEPTED MANUSCRIPT
asynchronous system (designed applying successful parameter optimization) can have a
performance very close to performances of synchronous systems (as also reported by
Zabihi et al [25]).
PT
Test results with PhysioNet-2016 data:
Thanks to the authors of PhysioNet-2016 data [18], it serves as an excellent resource for
RI
comparing new proposals with recent state-of-the-art systems without the need of
SC
implementing these systems and re-running the experiments of the challenge since the
performances of these systems are already reported in [12]. Here, we present our tests
U
carried with this openly available data and report our system performance which can be
AN
contrasted with results in [12].
For PhysioNet-2016 data, ECG channels are not available. Recently, Zabihi et al [25]
M
has shown that high performances can also be achieved using asynchronous frames
D
(systems without segmentation) which is also supported by our observation above. We
have run experiments for the most performant system using asynchronous frames that
TE
was ranked as fourth in experiments with the UoC-murmur database.

EP
Using a hop size of 1 second, and balancing via creation of new samples, the number
of frames extracted from the PhysioNet-2016 database were 103228. As the number of
C
segments is relatively very high, data augmentation is not applied in these tests. Each
AC
test is repeated 5 times and results are averaged. In Table 2 we present the confusion
matrix and other performance measures the system: M1SubEnv32by16_nASyn2000
31
ACCEPTED MANUSCRIPT
Table 2: Confusion matrix for CHD risk detection (after averaging results of 5
experiments, using 0.5 as probability threshold for class assignment) System:
M1SubEnv32by16_nASyn2000
n = 301 (test set size) Pathological (predicted) Normal (predicted)
PT
Pathological (actual) 127.6 23.4 151
Normal (actual) 32.2 117.8 150
RI
159.8 141.2
SC
Sensitivity = 0.845, Specificity = 0.785, Accuracy = 0.815
Openly accessible Physionet-2016 contains a train set and a validation set (shared
U
with the aim of pre-testing functionality of a submission to the challenge) which is
AN
actually a subset of the train set. Since the main aim is to run an open challenge, test set
is not available. For facilitating comparison of our results with tests in other studies, we
M
decided to use the validation set provided as the test set, removed copies from the train
D
set and further split the train set into train and validation subsets (this validation set
TE
referring to the subset in a machine learning experiment). The implementation of this
system, testing scripts (that downloads PhysioNet-2016 data, performs splitting and
EP
runs the experiments) and more detailed results involving other evaluation measures has
been shared openly on github11 for facilitating reproducibility of our test results.
C
In [12] Table 3, the top 8 systems’ (out of 48 submitted systems) performances are
AC
listed to have specificity values ranging in 0.7120 to 0.9424, specificity values ranging
in 0.7569 to 0.9521 and mean accuracy ranging in 0.7057 to 0.8602. These values are
computed via applying weighting with respect to the signal quality on classification
results obtained on the test data that is not openly available. In the tests we have carried
11
https://github.com/barisbozkurt/AutomaticPCGclassification
32
ACCEPTED MANUSCRIPT
(with train, validation and test set split explained above), the following scores have been
obtained for our shared system: 0.845 sensitivity, 0.785 specificity and 0.815 mean
accuracy. While these results cannot be directly compared with results in [12] (since
they are not computed on the same test subset and weighting is not applied), they show
PT
that our system performs similar to the top ranked state-of-the-art systems. The reader
can refer to the complete table in [12] for details regarding performances of the best
RI
systems in the challenge.
SC
6. Conclusions
U
This study targeted comparing various features and segmentation strategies in the
AN
context of automatic PCG classification for screening purposes, based on feedforward
convolutional neural networks trained on time-frequency representations of segmental

M
PCG frames. To arrive at an optimum design, 90 different system settings were tested
D
on a challenging dataset containing innocent and abnormal murmur cases (UoC-murmur
database) and a system selected to have high performance in these tests was also tested
TE
on the PhysioNet-2016 data containing normal and pathological cases. The codes (of
EP
this specific system and test scripts) have been openly shared with the community for
reproducibility of our study and facilitating comparisons with state of the art. We should
C
stress here that our main contribution with this manuscript is in comparing various
AC
segmentation and feature computation strategies, not proposing a single best system that
is more performant than state of the art. Our analysis with PhysioNet data supports the
fact that the comparative tests have been carried using system architectures as
performant as state-of-the-art systems.
33
ACCEPTED MANUSCRIPT
For ranking 90 distinct systems, ROC curves are obtained via applying different
levels of thresholds for final categorization from probabilities of pathology and area
under ROC has been used as the single measure representing potential of each system
for screening applications. All systems are sorted in terms of area under the ROC for
PT
comparison. Further we have provided other performance measures for a selected
system. The sensitivity and specificity are critical measures for screening applications.
RI
Together with accuracy, these evaluation metrics are most common in comparative
SC
studies such as [12].
As presented in Table 1, the systems using sub-band envelopes have the highest
U
ranks in the sorted list of systems (with respect to area under ROC): the 23 highest rank
AN
systems out of 90 use sub-band envelopes as the feature. Considering most of the state-
of-the-art systems prefer MFCCs as time-frequency representation, this is an important

M
observation. The ROC curves of the systems using sub-band envelopes are in general
D
closer to the left-top corner than systems using MFCC or Mel-Spectrogram (Figure 6).
The UoC-murmur database included PCG samples with murmur which were
TE
recorded from patients who were referred to a cardiology expert. That means the
EP
pediatricians have considered all cases in this data set to have a potential for heart
malfunction/defect risk, hence this is indeed a challenging set for an automatic

C
classification task. Our database consists exclusively of pediatric digital

AC
phonocardiograms, corresponding to various levels of CHD (with the mildest forms
being misclassified also by experienced pediatric cardiologists), as well as from
innocent murmurs, most being misclassified as abnormal by treating physicians.
Compared to adult auscultation, specific challenges also exist for auscultation of young
children. Obtaining clean recordings free of scratch noise are in some cases difficult.
34
ACCEPTED MANUSCRIPT
Heart rate is often higher compared to adults (up to double of the adult norm) which
leads to further challenges in period marking and selection.
The best system developed through tests on our data (UoC-murmur database) is
M1_SubEnv64by16_eSyn1000: CNN with 2 convolutional layers using sub-band
PT
envelopes with time resolution of 64, 16 frequency bands computed period synchronous
1 second frames as the feature. This system has not been tested on the PhysioNet-2016
RI
data due to unavailability of the ECG channel. Following the tests with the UoC-
SC
murmur database, a system using period asynchronous frames has been tested on the
PhysioNet-2016 challenge data: the highest performing asynchronous system (ranked 4
U
in tests with UoC-murmur database). We have shown that our asynchronous system
AN
performs similar to the top ranked state-of-the-art systems reported in [12] with 0.845
sensitivity, 0.785 specificity and 0.815 mean accuracy.

M
7. Discussions and future work

D
Our study involves some processes requiring further in-depth analysis which we
consider as challenges for further studies. The first is gaining better understanding for
TE
the effectiveness of data augmentation step applied and alternatives for it. While the
EP
transformation applied is mild (expanding/contracting %10-20 of original duration),
uniform resampling does not reflect variability of the cardiac cycles governed by
C
physical constraints of the heart. Data augmentation strategies respecting the physical
AC
settings of the heart should be developed.
For sub-band envelope computation, we have only considered one specific setting of
Gammatone filter banks: simply setting the number of banks to 8, 16, 24, etc. Our study
lacks an in-depth analysis of the sub-band filtering process. Gammatone filter banks has
been preferred as it reflects some of the auditory response characteristics (although not
35
ACCEPTED MANUSCRIPT
all, such as the loudness related non-linear auditory behavior). A study of optimization
of filter-bank computation may potentially lead to improved performance.
We have applied frame-level classification which later were fused via averaging to
deduce probability of the whole PCG signal to belong to a category. Many other options
PT
exist for such a step (for example majority voting). We have not tested other strategies
to avoid inclusion of one more dimension of complexity in our tests.
RI
Design of automatic PCG classification systems requires optimization of a large
SC
number of settings. Improving system performance through parameter optimization is
one option for future studies. Another direction for further studies is the use of multi-
U
sensor signal processing techniques to lower the need for experienced operators for
AN
screening applications. In [40], the authors propose noise cancellation using the multi-
channel PCG recordings that would result in a more robust PCG analysis systems. Joint
M
analysis of various modalities such as Ballistocardiography (BCG), ECG and PCG

D
recorded using multisensor systems [41,42] also has the potential to lead to improved
performance for screening applications. Building end-products and testing them in real-
TE
life scenarios is an important future direction our research community should consider.
EP
Acknowledgements
C
This project has been funded by Special Account for Research of University of Crete
AC
(code number 4305). We would like to thank the Greek Ministry of Education and the
local Health Authorities (7th Health Region Crete), for their support of CPCS program,
and the University of Crete for the support of innovative cardiac auscultation teaching
approaches (including web-lecture hosting). We would like to thank Vassilis Tsiaras for
his valuable help and assistance throughout the study and the fruitful discussions that
36
ACCEPTED MANUSCRIPT
lead to the final designs and Alena Burianova Bagaki, for the valuable assistance in
digital phonocardiogram recordings.
References
PT
[1] Rangayyan, R. M., & Lehner, R. J. (1986). Phonocardiogram signal analysis: a
review. Critical reviews in biomedical engineering, 15(3), 211-236.
RI
[2] Ferencz, C., Rubin, J. D., Mccarter, R. J., Brenner, J. I., Neill, C. A., Perry, L. W., ...
SC
& Downing, J. W. (1985). Congenital heart disease: prevalence at livebirth: the
Baltimore-Washington Infant Study. American journal of epidemiology, 121(1), 31-36.
U
[3] Van Oort, A., Le Blanc-Botden, M., De Boo, T., Van Der Werf, T., Rohmer, J., &
AN
Daniels, O. (1994). The vibratory innocent heart murmur in schoolchildren: difference
in auscultatory findings between school medical officers and a pediatric cardiologist.

M
Pediatric cardiology, 15(6), 282-287.

D
[4] Michael, S. Y., Kimball, T. R., Tsevat, J., Mrus, J. M., & Kotagal, U. R. (2002).
Evaluation of heart murmurs in children: cost-effectiveness and practical implications.

TE
The Journal of pediatrics, 141(4), 504-511.

EP
[5] Cheitlin, M. D., Armstrong, W. F., Aurigemma, G. P., Beller, G. A., Bierman, F. Z.,
Davis, J. L., ... & Kussmaul, W. G. (2003). ACC/AHA/ASE 2003 guideline update for
C
the clinical application of echocardiography: summary article. Circulation, 108(9),

AC
1146-1162.
[6] Germanakis, I., Petridou, E. T., Varlamis, G., Matsoukis, I. L., Papadopoulou-
Legbelou, K., & Kalmanti, M. (2013). Skills of primary healthcare physicians in
paediatric cardiac auscultation. Acta Paediatrica, International Journal of Paediatrics,
102(2), 74–78. http://doi.org/10.1111/apa.12062
37
ACCEPTED MANUSCRIPT
[7] Hanna, I. R., & Silverman, M. E. (2002). A history of cardiac auscultation and some
of its contributors. The American journal of cardiology, 90(3), 259-267.
[8] Newburger, J. W., Rosenthal, A., Williams, R. G., Fellows, K., & Miettinen, O. S.
(1983). Noninvasive tests in the initial evaluation of heart murmurs in children. New
PT
England Journal of Medicine, 308(2), 61-64.
[9] Smythe, J. F., Teixeira, O. H., Vlad, P., Demers, P. P., & Feldman, W. (1990).
RI
Initial evaluation of heart murmurs: are laboratory tests necessary?. Pediatrics, 86(4),
SC
497-500.
[10] Geva, T., Hegesh, J., & Frand, M. (1988). Reappraisal of the approach to the child
U
with heart murmurs: is echocardiography mandatory?. International journal of
AN
cardiology, 19(1), 107-113.
[11] Telatar, Z., & Erogul, O. (2003, September). Heart sounds modification for the
M
diagnosis of cardiac disorders. In IJCI Proceedings of International Conference on

D
Signal Processing (Vol. 1, No. 2, pp. 100-105).

TE
[12] Clifford, G. D., Liu, C., Moody, B., Springer, D., Silva, I., Li, Q., & Mark, R. G.
(2016). Classification of Normal / Abnormal Heart Sound Recordings : the PhysioNet /

EP
Computing in Cardiology Challenge 2016. Computing in Cardiology, 3–6.
http://doi.org/10.22489/CinC.2016.179-154
C
[13] Leng, S., Tan, R. S., Chai, K. T. C., Wang, C., Ghista, D., & Zhong, L. (2015). The
AC
electronic stethoscope. Biomedical Engineering Online, 14(66).
http://doi.org/10.1186/s12938-015-0056-y
[14] Noponen, A.-L., Lukkarinen, S., Angerla, A., & Sepponen, R. (2007). Phono-
spectrographic analysis of heart murmur in children. BMC Pediatrics, 7(1), 23.
http://doi.org/10.1186/1471-2431-7-23
38
ACCEPTED MANUSCRIPT
[15] Shen, C.-H. (2012). Acoustic based condition monitoring. University of Akron.
[16] Abbas, A. K., & Bassam, R. (2009). Phonocardiography Signal Processing.
Synthesis Lectures on Biomedical Engineering (Vol. 4).
http://doi.org/10.2200/S00187ED1V01Y200904BME031
PT
[17] Marascio, G., & Modesti, P. A. (2013). Current trends and perspectives for
automated screening of cardiac murmurs. Heart Asia, 5(1), 213–218.
RI
http://doi.org/10.1136/heartasia-2013-010392
SC
[18] Liu, C., Springer, D., Li, Q., Moody, B., Juan, R. A., Chorro, F. J., … Clifford, G.
D. (2016). An open access database for the evaluation of heart sound algorithms.
U
Physiological Measurement, 37(12), 2181–2213. http://doi.org/10.1088/0967-
AN
3334/37/12/2181
[19] Schmidt, S. E., Holst-Hansen, C., Hansen, J., Toft, E., & Struijk, J. J. (2015).
M
Acoustic features for the identification of coronary artery disease. IEEE Transactions on
D
Biomedical Engineering, 62(11), 2611–2619.
http://doi.org/10.1109/TBME.2015.2432129
TE
[20] Markaki, M., Germanakis, I., & Stylianou, Y. (2013, May). Automatic
EP
classification of systolic heart murmurs. In Acoustics, Speech and Signal Processing
(ICASSP), 2013 IEEE International Conference on (pp. 1301-1305).

C
[21] Davis, S., & Mermelstein, P. (1980). Comparison of parametric representations for
AC
monosyllabic word recognition in continuously spoken sentences. IEEE transactions on
acoustics, speech, and signal processing, 28(4), 357-366.
[22] Chauhan, S., Wang, P., Sing Lim, C., & Anantharaman, V. (2008). A computer-
aided MFCC-based HMM system for automatic auscultation. Computers in Biology and
Medicine, 38(2), 221–233. http://doi.org/10.1016/j.compbiomed.2007.10.006
39
ACCEPTED MANUSCRIPT
[23] Vepa, J. (2009). Classification of heart murmurs using cepstral features and support
vector machines. Proceedings of the 31st Annual International Conference of the IEEE
Engineering in Medicine and Biology Society: Engineering the Future of Biomedicine,
EMBC 2009, 2539–2542. http://doi.org/10.1109/IEMBS.2009.5334810
PT
[24] Potes, C., Parvaneh, S., Rahman, A., & Conroy, B. (2016). Ensemble of Feature-
based and Deep learning-based Classifiers for Detection of Abnormal Heart Sounds.
RI
Computing in Cardiology. http://doi.org/10.22489/CinC.2016.182-399
SC
[25] Zabihi, M., Rad, A. B., Kiranyaz, S., Gabbouj, M., & Katsaggelos, A. K. (2016).
Heart Sound Anomaly and Quality Detection using Ensemble of Neural Networks
U
without Segmentation. Computing in Cardiology.
AN
http://doi.org/10.22489/CinC.2016.180-213
[26] Rubin, J., Abreu, R., Ganguli, A., Nelaturi, S., Matei, I., & Sricharan, K. (2016).
M
Classifying Heart Sound Recordings using Deep Convolutional Neural Networks and
D
Mel-Frequency Cepstral Coefficients. Computing in Cardiology, 1–4.
http://doi.org/10.22489/CinC.2016.236-175
TE
[27] Ergen, B., Tatar, Y., & Gulcur, H. O. (2012). Time-frequency analysis of
EP
phonocardiogram signals using wavelet transform: a comparative study. Computer
Methods in Biomechanics and Biomedical Engineering, 15(4), 371–81.

C
http://doi.org/10.1080/10255842.2010.538386
AC
[28] Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A.-R., Jaitly, N., … Kingbury,
B. (2012). Deep Neural Networks for Acoustic Modeling in Speech Recognition. Signal
Processing Magazine, IEEE, 29(6), 82–97. http://doi.org/10.1109/MSP.2012.2205597
40
ACCEPTED MANUSCRIPT
[29] Piczak, K. J. (2015). Environmental sound classification with convolutional neural
networks. In 2015 IEEE 25th International Workshop on Machine Learning for Signal
Processing (MLSP).
[30] Kay, E., & Agarwal, A. (2017). DropConnected neural networks trained on time-
PT
frequency and inter-beat features for classifying heart sounds. Physiological
measurement, 38(8), 1645.
RI
[31] Springer, D. B., Tarassenko, L., & Clifford, G. D. (2016). Logistic regression-
SC
HSMM-based heart sound segmentation. IEEE Transactions on Biomedical
Engineering, 63(4), 822–832. http://doi.org/10.1109/TBME.2015.2475278
U
[32] Lu, X., Unoki, M., & Nakamura, S. (2011). Sub-band temporal modulation
AN
envelopes and their normalization for automatic speech recognition in reverberant
environments. Computer Speech and Language, 25(3), 571–584.

M
http://doi.org/10.1016/j.csl.2010.10.002
D
[33] Mitra, V., Wang, W., & Franco, H. (2014, December). Deep convolutional nets and
robust features for reverberation-robust speech recognition. In Spoken Language

TE
Technology Workshop (SLT), 2014 IEEE (pp. 548-553).

EP
[34] Schmidt, S. E., Toft, E., Holst-Hansen, C., Graff, C., & Struijk, J. J. (2008).
Segmentation of heart sound recordings from an electronic stethoscope by a duration

C
dependent Hidden-Markov model. Computers in Cardiology, 35, 345–348.

AC
http://doi.org/10.1109/CIC.2008.4749049
[35] Barschdorff, D., Bothe, A., & Rengshausen, U. (1989). Heart sound analysis using
neural and statistical classifiers: a comparison. Computers in Cardiology, 415–418.
41
ACCEPTED MANUSCRIPT
[36] Deng, S. W., & Han, J. Q. (2016). Towards heart sound classification without
segmentation via autocorrelation feature and diffusion maps. Future Generation
Computer Systems, 60, 13–21. http://doi.org/10.1016/j.future.2016.01.010
[37] Germanakis, I., Dittrich, S., Perakaki, R., & Kalmanti, M. (2008). Digital
PT
phonocardiography as a screening tool for heart disease in childhood. Acta Paediatrica,
International Journal of Paediatrics, 97(4), 470–473. http://doi.org/10.1111/j.1651-
RI
2227.2008.00697.x
SC
[38] Germanakis, I., & Kalmanti, M. (2009). Paediatric cardiac auscultation teaching
based on digital phonocardiography. Medical education, 43(5), 489-489.
U
[39] Wong, S. C., Gatt, A., Stamatescu, V., & McDonnell, M. D. (2016, November).
AN
Understanding data augmentation for classification: when to warp?. In Digital Image
Computing: Techniques and Applications (DICTA), 2016 International Conference on

M
(pp. 1-6).
D
[40] Nunes, D., Leal, A., Couceiro, R., Henriques, J., Mendes, L., Carvalho, P., &
TE
Teixeira, C. (2015, August). A low-complex multi-channel methodology for noise
detection in phonocardiogram signals. In Engineering in Medicine and Biology Society

EP
(EMBC), 2015 37th Annual International Conference of the IEEE (pp. 5936-5939).
IEEE.
C
[41] Nedoma, J., Fajkus, M., Martinek, R., Kepak, S., Cubik, J., Zabka, S., & Vasinek,
AC
V. (2017). Comparison of BCG, PCG and ECG signals in application of heart rate
monitoring of the human body. In Telecommunications and Signal Processing (TSP),
2017 40th International Conference on (pp. 420-424). IEEE.
42
ACCEPTED MANUSCRIPT
[42] Marcelli, E., Capucci, A., Minardi, G., & Cercenelli, L. (2017). Multi-sense
cardiopatch: A wearable patch for remote monitoring of electro-mechanical cardiac
activity. Asaio Journal, 63(1), 73-79.
PT
RI
U SC
AN
M
D
TE
C EP
AC
43
ACCEPTED MANUSCRIPT
Highlights:
● In a CNN-based PCG classification scheme using time-frequency

representations, sub-band envelope features are preferable over the most
commonly used features: MFCC and Mel-Spectrum
● While deep learning based methods are considered end-to-end with little use
of domain knowledge, domain knowledge (how humans perform a particular
classification task) can be beneficial in designing the input representation (in
our case sub-band envelopes) that leads to more performant systems
PT
● Automatic PCG classification technology is well developed to start supporting
real-life screening applications
RI
U SC
AN
M
D
TE
C EP
AC
ACCEPTED MANUSCRIPT
Conflict of interest statement for manuscript:
Title: A study of time-frequency features for CNN-based automatic heart sound

Authors: Baris Bozkurt, Ioannis Germanakis, Yannis Stylianou
The authors of this manuscript claims no conflict of interest with any person or institutional
body.
PT
RI
U SC
AN
M
D
TE
C EP
AC

Bozkurt2018 PDF

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Bozkurt2018 PDF

Загружено:

Авторское право:

Доступные форматы

Accepted Manuscript

A study of time-frequency features for CNN-based automatic heart sound

Baris Bozkurt, Ioannis Germanakis, Yannis Stylianou

To appear in: Computers in Biology and Medicine

Received Date: 29 August 2017

A study of time-frequency features for CNN-based automatic heart sound

classification for pathology detection

Baris Bozkurt1, Ioannis Germanakis2, Yannis Stylianou3

barisbozkurt0@gmail.com, germjohn@med.uoc.gr, yannis@csd.uoc.gr

University, Üçkuyular Mahallesi, Gürsel Aksel Bulvarı, No:14 35140

applications. Recently, various systems based on convolutional neural networks trained

on time-frequency representations of segmental PCG frames has been presented that

and time-frequency representation components of the CNN-based designs. We consider

art systems and a time-frequency representation influenced by domain-knowledge,

a microphone-transducer is referred as phonocardiography (PCG). Several heart

conditions can be successfully studied via analysis of phonocardiogram signals: murmur

Such evaluation of heart sounds targeted towards diagnosis is referred as auscultation

murmurs, the overall periodicity characteristics and the quality of sounds.

Although coronary heart disease and hypertension predominate in adults, structural

(congenital heart disease (CHD)) which is a considerable cause of pediatric morbidity

and mortality. Up to 1% of newborn children are considered to be affected by some

of underlying heart malformation. Experienced cardiac auscultation is amongst the most

echocardiography [5]. In order to address the observed declining clinical skills in

cardiac auscultation, several approaches have been applied, including multimedia

teaching interventions, tele-medical applications or other computer-based clinical

decision support systems [6].

opportunity to perform screening of a large number of people in their early ages

resulting in timely diagnosis of some of the pathological cases. Thanks to recent

study targets improving the performance of automated heart sound classification

technology aimed at screening for CHD risk detection.

Focus and contributions of our study

time-frequency representations of segmental PCG frames. While we present a

strategies in feature extraction in a CNN-based approach. The novel contributions of

for comparing performances of state-of-the-art systems. As a result of these tests, we

show that sub-band envelopes, comparatively rarely used as a time-frequency feature in

Mel Frequency Cepstral Coefficients (MFCC). For reproducibility of our study, we

This manuscript is organized as follows: In section 2, we first review basics of

(historical auscultation since Hippocrates times) or by using a stethoscope as transmitter

further analysis on computer based systems.

Proper cardiac auscultation typically involves analysis of time and frequency

intracavity communications (shunt lesions) and /or valvular abnormalities (narrowing or

leakage). The clinical classification of heart murmurs as innocent or abnormal is mainly

being exclusively early to mid-systolic), spatial classification based on punctum

“vibratory” properties) [8, 9, 10].

A through cardiologic analysis involves, in addition to auscultation and

phonocardiography, use of multitude of techniques and technologies such as

echocardiography, ultrasound imaging, angiography, electrocardiograms, chest X-rays,

and lower costs.

Automatic heart sound classification

Cardiac abnormality detection via automatic heart sound(HS)/phonocardiogram(PCG)

classification, is a very largely studied research domain. Manuscripts providing in depth

reviews of phonocardiogram(PCG) processing methods and comparison of large lists of

Noponen et al [14] presented a detailed overview of the problem discussing types of

As for many other automatic classification problems, the crucial components of an

for the PhysioNet-2016 challenge [12].

Spectral (frequency domain) features involve a large variety of spectral

reassigned spectrogram for feature computation. Signal complexity can also be

advances in machine learning technology: use of time-frequency representations and

Time-frequency representation of a sound signal often refers to stacking (some

two dimensional representation. The basic assumption in using time-frequency

computing such representations, the most common technique is to map a linear

representation like Short-Time Fourier Transform(STFT) in a logarithmic scale

pathology detection [22-26]. A large number of automatic PCG classification studies