Академический Документы
Профессиональный Документы
Культура Документы
Pattern Recognition
journal homepage: www.elsevier.com/locate/patcog
a r t i c l e i n f o a b s t r a c t
Article history: Deep neural networks have rapidly become an indispensable tool in many classification applications.
Received 15 July 2019 However, the inclusion of deep learning methods in medical diagnostic systems has come at the cost
Revised 18 April 2020
of diminishing their explainability. This significantly reduces the safety of a diagnostic system, since the
Accepted 29 April 2020
physician is unable to interpret and validate the output. Therefore, in this work we aim to address this
Available online xxx
major limitation and improve the explainability of a skin cancer diagnostic system. We propose to lever-
Keywords: age two sources of information: (i) medical knowledge, in particular the taxonomic organization of skin
Hierarchical deep learning lesions, which will be used to develop a hierarchical neural network; and (ii) recent advances in chan-
Explainability nel and spatial attention modules, which can identify interpretable features and regions in dermoscopy
Channel attention images. We demonstrate that the proposed approach achieves competitive results in two dermoscopy
Spatial attention data sets (ISIC 2017 and 2018) and provides insightful information about its decisions, thus increasing
Safety-critical CADS
the safety of the model.
Skin cancer
© 2020 Published by Elsevier Ltd.
https://doi.org/10.1016/j.patcog.2020.107413
0031-3203/© 2020 Published by Elsevier Ltd.
Please cite this article as: C. Barata, M.E. Celebi and J.S. Marques, Explainable skin lesion diagnosis using taxonomies, Pattern Recognition,
https://doi.org/10.1016/j.patcog.2020.107413
JID: PR
ARTICLE IN PRESS [m5G;May 21, 2020;9:28]
2 C. Barata, M.E. Celebi and J.S. Marques / Pattern Recognition xxx (xxxx) xxx
Please cite this article as: C. Barata, M.E. Celebi and J.S. Marques, Explainable skin lesion diagnosis using taxonomies, Pattern Recognition,
https://doi.org/10.1016/j.patcog.2020.107413
JID: PR
ARTICLE IN PRESS [m5G;May 21, 2020;9:28]
C. Barata, M.E. Celebi and J.S. Marques / Pattern Recognition xxx (xxxx) xxx 3
Fig. 2. Block diagram of the proposed system. The colors identify the channel and spatial attention weights (β t and α t ) associated with each of the sequential diagnosis.
Please cite this article as: C. Barata, M.E. Celebi and J.S. Marques, Explainable skin lesion diagnosis using taxonomies, Pattern Recognition,
https://doi.org/10.1016/j.patcog.2020.107413
JID: PR
ARTICLE IN PRESS [m5G;May 21, 2020;9:28]
4 C. Barata, M.E. Celebi and J.S. Marques / Pattern Recognition xxx (xxxx) xxx
where it , ft , ct , ot , and ht are the input, forget, memory, output, and Pure spatial attention methods can be divided into hard and
hidden state of the LSTM, respectively. The network parameters soft attention [33]. Soft attention, as proposed by Chorowski et al.
(the weights W• and biases b• ) are learned during the training [39] is used in this work. Based on this method, we estimated the
phase, as well as the class-embedding matrix E ∈ RM×K , where context vector zt as follows
M is the class-embedding dimension and K is the number of
L
words/classes. Finally, the context vector zt is computed using the zt = αti xti , (6)
attention module, which will be discussed in the following section.
i=1
At each time step t, the hidden state ht is used to obtain
p(Ct |I, Ct−1 ) and predict the next level in the taxonomy. Various where αt = {αt1 , . . . , αtL } is a set of non-negative weights that rep-
strategies can be used to obtain this probability. In this work, we resent the relative importance of each position in the feature map.
compare the following two: In this work, we apply the method proposed by Xu et al. [33] to
Class Inference - Method 1 (CI1): estimate the weights α t as follows
p(Ct |I, Ct−1 ) = softmax(Wo1 ht ) , (3) αt = softmax(Wsa (tanh(Wsax x + Wsah ht−1 ))), (7)
Class Inference - Method 2 (CI2): where Wa , Wax , and Wah are all trained in parallel with the other
p(Ct |I, Ct−1 ) = softmax(Wo2 (Wz zt + Wh ht )) , (4) parameters of the model.
The outputs of both attention mechanisms may be visually in-
where Wo1 ∈ RK×P , Wz ∈ RM×D , Wh ∈ RM×P , and Wo2 ∈ RK×M are spected, i.e., we can identify which were the activation maps with
trainable weight matrices. The predicted class Cˆt is then selected the highest values in β t and visualize the spatial map α t , for each
as the one that maximizes p(Ct |I, Ct−1 ). of the predicted labels. This greatly improves the explainability of
To be able to use the LSTM to predict the first level of the the model, since we are able to provide visual cues that justify
taxonomy, i.e., at time-step t = 1, it is necessary to initialize its each of the hierarchical decisions.
state and memory (h0 and c0 ). Following Xu et al. [33], we will
use two perceptrons to infer these parameters from the average
5. Experimental setup
values of the activation maps obtained from the CNN: 1L Ll=1 xl .
We developed our model using the ISIC 2017 and 2018 der-
The goal of the attention module is to provide the LSTM
moscopy data sets [4,40]. The first set comprises 2750 images
network with the most relevant image features (called the context
divided into training (20 0 0), validation (150), and test (600)
vector zt ), that can be used to predict the t − th level of the
sets. These images contain examples of the following classes of
taxonomy. This approach aims to mimic the analysis performed by
lesions: melanocytic (melanoma and nevi) and non-melanocytic
dermatologists, while diagnosing skin lesions: inspect the lesion
(seborrheic keratosis). The second set is larger and more complex,
for localized criteria that give clues about its origin, followed
containing 11,527 examples of the following lesions: melanocytic
by the detailed analysis of some of the identified criteria to
(melanoma and nevi) and non-melanocytic (basal cell carcinoma,
perform the differential diagnosis. We reproduce this analysis by
actinic keratosis, benign keratosis, dermatofibroma, and vascular
incorporating a spatial attention mechanism in our model. Spatial
lesions). Similarly to ISIC 2017, the 2018 data set is also divided
attention will enforce the LSTM network to selectively focus on
into training (10,015) and test sets (1512). Neither data set was
different parts of the skin lesion, taking into account the previ-
augmented with external data and, in order to deal with image
ously predicted hierarchical labels. Additionally, recent works, such
color variability induced by different acquisition setups, all of the
as that of Van Simoens and Dhoedt [18], have shown that CNN
images were normalized using the color normalization scheme
filters respond to different properties of the skin lesions. Since
proposed in Barata et al. [41]. Fig. 3 shows some examples of
some of these filters may convey more discriminative information
normalized images.
than others, we also incorporate a channel attention mechanism
We report the results of several ablation studies conducted on
in our model. Similarly to the spatial attention block, the channel
the ISIC 2017 data set, namely:
one is also influenced by the previously generated hierarchical
labels and dynamically provides the LSTM with the most relevant i) Image encoder: The representation power of the three CNN
image features (activation maps) to predict the following labels. architectures (DenseNet-161, VGG-16, and ResNet-50) was
Below we provide details about these two attention mechanisms. evaluated by using each network separately as an image en-
Channel and spatial attention have been combined in various coder.
frameworks, such as those proposed in Chen et al. [34], Woo ii) Hierarchical loss function: A commonly used loss func-
et al. [38]. Both of these works have demonstrated that the best tion in captioning problems is the categorical cross-entropy.
performances are achieved when spatial attention is applied to However, by using this loss function alone there is no guar-
the output of a channel attention block. Therefore, we also adopt antee that the model will be able to learn the taxonomic
this organization. Channel attention has been addressed in the lit- structure of the various classes i.e, the sequential path be-
erature as a technique to perform feature selection and to improve tween the coarse and finer classes. Thus, we propose to ex-
the representation power of the model [38]. The non-negative plicitly model the taxonomic constraints on the loss function
channel attention weights β t are computed as follows and compare this approach against solely using the cross-
βt = σ (Wca (tanh(Wcax x̄ + Wcax x˜ + Wcah ht−1 ))) , entropy loss. Details about the hierarchical loss function will
be given in the following subsection.
xt = βt x (5)
iii) Channel attention: The combined use of the channel and
where σ is the sigmoid function, denotes the element-wise spatial attention mechanisms (recall Section 4.3) was evalu-
multiplication, x̄ and x˜ are respectively the average and max- ated for several values of the channel-reduction factor r ∈ {1,
D
pooled feature maps, and the matrices Wca , Wcax ∈ RD× r , and 2, 4, 8, 16}, where 1r denotes the proportion of channels that
Wcah ∈ are all trainable. The new features xt can be used to the model should select. Setting r = 1 is equivalent to using
compute the spatial attention weights as described next. only spatial attention.
Please cite this article as: C. Barata, M.E. Celebi and J.S. Marques, Explainable skin lesion diagnosis using taxonomies, Pattern Recognition,
https://doi.org/10.1016/j.patcog.2020.107413
JID: PR
ARTICLE IN PRESS [m5G;May 21, 2020;9:28]
C. Barata, M.E. Celebi and J.S. Marques / Pattern Recognition xxx (xxxx) xxx 5
Fig. 3. Examples of dermoscopy images (1st column) and their color normalized versions using: [41] (2nd column): ISIC 2017 (1st row) and ISIC 2018 (2nd row).
iv) Class inference: The two class inference approaches pre- ometric transformations, such as central crops of different ratios
sented in Section 4.2 are compared to determine if the pre- and horizontal/vertical flips (e.g, [42]), followed by an ensemble
dicted state ht conveys sufficient information for the diagno- like classification of the augmented images. Although the results
sis or if it is beneficial to add context zt to the classification are promising, it also suggests that the learned models may not be
model. robust to changes in viewpoint. This is undesirable, since it means
v) Model hyperparemeters: The impact of various model hy- that the same lesion may be assigned a different label depending
perparameters in the performace of the hiearchical diagnos- on its acquisition. Thus, we evaluate the robustness of our method
tic system are assessed. In particular, we compare the per- w.r.t to: i) performing a horizontal flip on the images; ii) perform-
formance for different values of M (size of the lesion class- ing a central crop to get only 80% of the original image; and iii)
embedding) and P (size of the hidden state h of LSTM). removing the color normalization exemplified in Fig. 3, i.e., apply-
ing the model directly to the original image.
We train a hierarchical model for each of the aforementioned Unfortunately, the ground truth information (i.e., diagnostic la-
configurations, using the ISIC 2017 training set. The validation set bels) for the test set of ISIC 2018 is not publicly available. The per-
is used for early stopping and selection of the best hyperparame- formance on this set can only be assessed through an online plat-
ters, while the test set is used to quantify the performance of the form. To be able to conduct the same training procedure as in the
model. We would like to emphasize that we do not use any exter- case of ISIC 2017, we used 5-fold cross validation on the training
nal data set to augment the ISIC 2017 training set. set. Then, each of the five models was used to classify the images
The ISIC 2018 data set was used to assess the performance of on the test set and the final diagnosis was determined by averag-
the model in a more complex scenario, where there are more types ing the scores of the models.
of non-melanocytic lesions (this data set contains examples of five
types of non-melanocytic lesions, while ISIC 2017 contained only 5.2. Model training
one). Additionally, we conduct the following experiments:
a) Varying taxonomy length: The ISIC 2018 data set contains Each of the model’s configurations discussed in the previous
seven classes of skin lesions, five of which are non-melanocytic. section was trained end-to-end using a data set I = {I1 , . . . , IN },
Among the non-melanocytic lesions, one can further split them for which we have the corresponding ground truth sequences of
into benign (keratosis, dermatofibroma, and vascular) and malig- hierarchical labels C = {C 1 , . . . , C N }. In order to train the model, we
nant (BCC and actinic) lesions, prior to their differential diagno- apply the balanced categorical cross-entropy loss to each training
sis. In order to infer if this additional hierarchical class helps the example [43]
model to learn more discriminative representations of the skin le-
1
T
sions, we will compare the following approaches: (i) two level tax- LCE (I, C ) = − wCt log p(Ct |I, Ct−1 ) , (8)
onomy (T = 2), where two hierarchical decisions are performed T
i=1
(melanocytic/non melanocytic and the differential diagnosis); and where T is the length of the taxonomy of the example, wCt is the
(ii) varying length, where the model has first to classify the lesion label-weight, and p(Ct |I, Ct−1 ) is given by either (3) or (4). In our
as melanocytic/non melanocytic and then, if the lesion is classified models, we set T = 2 or 3, i.e., we either consider two levels in the
as melanocytic, perform a differential diagnosis (melanoma/nevu), taxonomy (melanocytic/non-melanocytic and the differential diag-
otherwise it should first classify the lesion as malignant/benign nosis) or two levels for the melanocytic lesions and three for the
and only then the differential diagnosis. non-melanocytic ones. The label-weights wCt are used to deal with
b) Changing the taxonomic order: The taxonomy defined in the severe imbalance of classes in the training set, and are given
Fig. 1 has been defined by dermatologists to take into account the by
origin of the lesions as the first splitting criterion. However, of-
ten the decision to excise the lesion is based on whether or not
#N
wCt = , (9)
it shows signs of malignancy. Only at the pathologist level will the #Nc
lesion be diagnosed according to its origin and differential class. To where #N is the size of the training set and #Nc is the number of
mimic this process and determine its influence on the performance examples in each class. To ensure that these weights do not influ-
of the model, we change the order of the hierarchical decisions, i.e., ence the learning rate of the stochastic gradient descent, we nor-
first classify the lesion and malignant/benign, then diagnose it as malize the weights in each batch such that they have unit mean.
melanocytic or non-melanocytic, and finally perform the differen- Despite being used in classification and captioning problems,
tial diagnosis. solely relying on the cross-entropy loss may be sub-optimal in our
c) Robustness experiments: Recently, various authors have re- hierarchical problem, since this loss only penalizes incorrect labels
ported improved performances by augmenting the test set with ge- and not deviations from the taxonomic structure (e.g, reaching the
Please cite this article as: C. Barata, M.E. Celebi and J.S. Marques, Explainable skin lesion diagnosis using taxonomies, Pattern Recognition,
https://doi.org/10.1016/j.patcog.2020.107413
JID: PR
ARTICLE IN PRESS [m5G;May 21, 2020;9:28]
6 C. Barata, M.E. Celebi and J.S. Marques / Pattern Recognition xxx (xxxx) xxx
Table 1
Comparision of loss functions - best scores on the ISIC 2017 test set using CI1 as inference
approach. HL stands for hierarchical loss.
correct differential diagnosis through an incorrect path). To address SP, and AUC are class specific, while BACC is computed over the
this issue, we also propose to use a hierarchical loss function, de- entire data set.
fined as follows For the ISIC 2017 data set, we also compare our results with
others recently reported in the literature. In particular, we estab-
L(I, C ) = LCE (I, C ) + LH (I, C ), (10)
lish comparisons with the results of: (i) Harangi [46], who com-
where LH (I, C ) is given by the cosine distance pared multiple CNN architectures using the ISIC 2017 data set but
does not use a hierarchy; (ii) Barata and Marques [28], who in-
H · Hˆ vestigated the inclusion of a class-hierarchy in a CNN framework
LH (I, C ) = 1 − . (11)
H Hˆ using two levels of fully connected layers; and (iii) the top ranked
participants of the ISIC 2017 challenge.
Here H ∈ RA is a binary path-vector, A is the set of labels in the
taxonomy tree, and Ha = 1 if the a − th label belongs to the taxo-
nomic classification of the lesion. Finally, Hˆ ∈ RA is the estimated 6. Experimental results
path.
The model is optimized end-to-end using the Adam variation 6.1. Ablation studies on ISIC 2017
of the stochastic gradient descent [44] with mini-batches of size
20, using an initial learning rate of 10−6 , which decays at every In this section, we report the results for the several ablation
200 epochs. In total, the model is trained for 600 epochs with an studies described in Section 5.1. We will report the results of the
early-stop criterion on a NVIDIA Titan Xp1 . The model parameters studies as follows. First, we show the results for the analysis of the
are set as follows: M ∈ {50, 100, 300}, and P tuned in the interval loss function for all image encoders, but using only CI1 (3) as the
{28 , . . . , 210 }. To improve the generalization of the model, we have class inference method, since the performance for CI2 (4) was sim-
adopted the following strategies: (i) careful initialization of sev- ilar. We then evaluate the performance after the incorporation of
eral model weights W• ; (ii) online data augmentation (a sequence the channel attention module, using all of the image encoders. We
of random crop, random flip, and random color transformation at also use this scenario to compare the inference strategies CI1 and
each epoch); and iii) incorporation of dropout with 50% probability CI2. Finally, we select the two best models to perform a detailed
in several of the layers. The carefully initialized weights are those analysis of the influence of the model parameters: P (the size of
of the image encoder (CNNs), where we used the weights of the the LSTM hidden layer) and M (the size of the lesion class embed-
models pre-trained on ImageNet, and those of the word encoding ding).
E that were initialized from the GloVe embeddings [45]. Hierarchical loss function: Table 1 shows the scores for models
trained using either the cross-entropy loss (8) or the hierarchical
loss (10).
5.3. Model evaluation By inspecting Table 1 we can observe that enforcing the hier-
archical structure in the loss function improves the performance
All of the trained model configurations were evaluated using of the models, particularly when ResNet-50 and DenseNet-161 are
an independent test set, and the performance was quantified us- used as image encoders (e.g., note the improvement in the BACC
ing the following metrics: sensitivity (SE), specificity (SP), balanced scores). Moreover, if we compare the SE and SP values of the ker-
accuracy (BACC), and area under the curve (AUC). The metrics SE, atosis class (the only non-melanocytic lesion) with those of the
melanocytic/non-melanocytic task, we conclude that only the hi-
1
The source code will be available on https://github.com/catarina-barata/ erarchical loss (10) guarantees that the model does not violate the
skin-hierarchy/ taxonomy.
Please cite this article as: C. Barata, M.E. Celebi and J.S. Marques, Explainable skin lesion diagnosis using taxonomies, Pattern Recognition,
https://doi.org/10.1016/j.patcog.2020.107413
JID: PR
ARTICLE IN PRESS [m5G;May 21, 2020;9:28]
C. Barata, M.E. Celebi and J.S. Marques / Pattern Recognition xxx (xxxx) xxx 7
Table 2
Comparison of channel attention and inference strategies (termed configuration) - best scores on the
ISIC 2017 test set.
channels contain redundant information and can be discarded. In- Bold highlights the best results.
terestingly, the best ratio r seems to be either 2 or 4 and larger val-
ues led to a significant performance degradation. This may be due
to the variability across dermoscopy images or to the size of the 6.2. Qualitative assessment of attention modules
training set, which does not allow the training of larger channel-
attention modules. In this section, we visualize the channel and spatial attention
Between the two inference strategies, CI2 seems to improve the maps obtained using the VGG-16 and the DenseNet-161 based
classification scores for the melanoma class. However, the perfor- models. For the sake of simplicity, we will not discuss the model
mance for the remaining classes degrades and some of the de- based on ResNet-50, since this was the encoder that achieved the
cisions violate the taxonomy (e.g., compare the SP of the non- worst overall performance. The discussed examples are all from the
melanocytic lesions with the SE of the keratosis class). Thus, we ISIC 2017 data set.
conclude that it is preferable to use only the state of the LSTM to Figs. 4 and 5 show the spatial attention maps obtained with
infer the class, i.e., inference method CI1 (3). The remaining results VGG-16 and DenseNet-161 as image encoders. Two observations
will be reported for this formulation. can be made regarding the attention maps: i) VGG-16 (top maps)
Model hyperparameters: Table 3 shows a comparison of the leads to more detailed maps, although it achieves slightly worse
BACC scores for different values of P and W. For simplicity, we limit diagnostic performances; and ii) both models “attend” to regions
our analysis to the two best performing image encoders: VGG-16 that are relevant for the medical diagnosis and are able to show
and DenseNet-161. Both networks seem to achieve better perfor- regions of interest, even when the lesion is difficult to distinguish
mances when the size of the hidden state of the LSTM is P ≥ 512, from the surrounding skin (see the keratosis example on Fig. 4). To
thus we opt to use P = 512 to reduce the number of parame- understand the importance of the regions identified by the mod-
ters. Regarding the size of the class-embedding M, the two mod- els, we can take a closer look at the melanocytic images. In both
els show distinct behaviors. The performance of the model that examples, the models “attend” to areas that contain a dermoscopic
uses DenseNet-161 degrades, possibly due to the smaller size of structure called pigment network, which is considered to be one
the activation maps that are obtained with this network (9 × 9 of the hallmarks of melanocytic lesions [15]. As its name sug-
vs. 18 × 18 for VGG-16) and significantly higher number of filters gests, pigment network consists of a mesh of dark lines over a
(D = 2208 vs. D = 512 for VGG-16). lighter background. In Fig. 4, the pigment network areas are lo-
Please cite this article as: C. Barata, M.E. Celebi and J.S. Marques, Explainable skin lesion diagnosis using taxonomies, Pattern Recognition,
https://doi.org/10.1016/j.patcog.2020.107413
JID: PR
ARTICLE IN PRESS [m5G;May 21, 2020;9:28]
8 C. Barata, M.E. Celebi and J.S. Marques / Pattern Recognition xxx (xxxx) xxx
Fig. 6. Highest probability activation maps (channels) for a correctly diagnosed ne-
Fig. 4. Correctly diagnosed skin lesions from ISIC 2017 test set and the correspond-
vus. The maps for VGG-16 and DenseNet-161 and their corresponding probabilities
ing spatial attention maps for the following image encoders: VGG-16 (top maps)
are shown on the left and right columns, respectively.
and DenseNet-161 (bottom maps).
cated near the border of the two melanocytic lesions. Pigment net- part of a challenge, thus most of the works in the literature only
work may also be used to diagnose skin lesions, as a network with report performance scores for the melanoma and keratosis classes,
an atypical shape (enlarged and irregular lines) is taken as a sign which were the metrics used to rank the participants. For compar-
of melanoma. This may justify why a benign lesion was diagnosed ison purposes, we also adopt those metrics. Additionally, we report
as melanoma in Fig. 5 (left), since both networks select one of the two criteria that may influence the performance of the models: (i)
regions with atypical network, to perform the differential diagno- ensembles of networks; and (ii) augmenting the training set with
sis. external data.
Spatial attention may also be very useful to help us identify By comparing the scores of our hierarchical models with flat
bias in the data set, as is exemplified in Fig. 5 (right). Several ker- models that use the same CNN architecture and do not use ex-
atosis images from the ISIC 2017 data set contain illumination ar- ternal data [28,46], we are able to see that imposing a hierarchi-
tifacts that were detected by the attention modules, allowing the cal diagnosis leads to impressive improvements in the performance
model to learn incorrect features from the data. of the model, since SE and SP scores for both classes are signifi-
Fig. 6 shows a comparison between the three channels with cantly higher for all of our models. When external data is used for
the highest weights β t (computed using (5)), for VGG-16 and training (see [17]), the flat model performs better than our hierar-
DenseNet-161. To improve the visualization, we expanded the chical approach (see the results with ResNet-50). Nonetheless, our
channel activation maps to the size of the original dermoscopy im- approach is still able to achieve a competitive performance, using
age, The maps from VGG-16 seem to capture more localized in- less training data. Once again, this suggests that imposing a hier-
formation than those from DenseNet-161. This may be due to the archical decision process allows the training of more generalizable
properties of the two networks, since in DenseNet, the maps from models.
top layers are propagated to the deeper ones. Nonetheless, in both The methods that ranked in the first four positions of the ISIC
cases it is possible to see that the networks extract information 2017 challenge used external data to train the models. Moreover,
related with the surrounding skin, the border of the lesion and its three of these works also adopted an ensemble strategy to predict
center, and the presence of hair. These findings confirm those re- the final diagnosis. When we compare our score with theirs, it is
ported by Van Simoens and Dhoedt [18], who observed that the possible to see that only the top ranked method achieved better
convolutional layers were sensitive to several visual cues, includ- SE scores. Ensembles may be used to improve the representation
ing the aforementioned ones. power of a model by combining the capabilities of different CNN
architectures. Our results suggest that imposing a hierarchical di-
6.3. Comparison with other works agnosis allows a single architecture to learn better representations
of the various lesion classes.
Table 4 compares our best performing ISIC 2017 models to var- A final comparison can be made between the proposed mod-
ious state-of-the-art methods. ISIC 2017 data set was released as els and the hierarchical approach of [28], which sequentially adds
Fig. 5. Incorrectly diagnosed skin lesion (left) and data set bias (rights), and their corresponding spatial attention maps for the following image encoders: VGG-16 (top maps)
and DenseNet-161 (bottom maps).
Please cite this article as: C. Barata, M.E. Celebi and J.S. Marques, Explainable skin lesion diagnosis using taxonomies, Pattern Recognition,
https://doi.org/10.1016/j.patcog.2020.107413
JID: PR
ARTICLE IN PRESS [m5G;May 21, 2020;9:28]
C. Barata, M.E. Celebi and J.S. Marques / Pattern Recognition xxx (xxxx) xxx 9
Table 4
Comparison with other works on the ISIC 2017 test set. The table is organized by CNN architecture and the
methods identified with numbers correspond to the methods that ranked in the first four positions in the
ISIC 2017 challenge.
Melanoma Keratosis
Table 5
Classification scores on the ISIC 2018 test set using different taxonomies: a) T = 2 two-level taxonomy (melanocytic/non-
melanocytic and differential diagnosis); b) T = 3 three-level taxonomy (melanocytic/non-melanocytic and benign/malignant
for non-melanocytic lesions, prior to the differential diagnosis); c) Inv. inverted taxonomy (malignant/benign followed by
melanocytic/non-melanocytic and differential diagnosis).
T =2 T =3 Inv
Melanoma 57.3% 93.1% 72.3% 87.7% 67.8% 90.1% 72.6% 86.1% 59.1% 94.6% 71.7% 88.9%
Nevus 83.9% 93.0% 96.2% 82.0% 93.7% 96.1% 89.4% 89.7% 96.0%
BCC 75.3% 98.0% 97.7% 74.2% 98.4% 98.5% 76.3% 98.4% 98.3%
Actinic 55.8% 99.2% 96.3% 60.5% 99.5% 95.5% 62.8% 99.0% 95.6%
Keratosis 76.0% 95.8% 95.0% 72.8% 96.6% 94.4% 76.0% 95.8% 94.6%
Derm. 63.6% 99.7% 98.1% 68.2% 99.8% 97.2% 68.2% 99.8% 96.7%
Vascular 60.0% 99.7% 97.7% 57.1% 99.7% 98.5% 62.9% 99.8% 97.8%
Average 67.4% 96.9% 97.7% 69.0% 96.8% 95.2% 70.7% 96.7% 95.4%
fully-connected layers to make hierarchical decisions. The proposed on the average SE for all the classes. If we modify the taxonomic
approach outperforms the one presented in Barata and Marques order, it is possible to improve the performance for almost all le-
[28], suggesting that an LSTM is more suitable for hierarchical clas- sion types, when compared with T = 2. However, this configuration
sification. leads to poorest melanoma scores than T = 3, which is undesirable
as melanoma is a very aggressive form of cancer, as well as worse
BACC. Analyzing the channel and spatial attention maps of the le-
6.4. ISIC 2018 - more classes and robustness
sions that were incorrectly classified could allow us to further un-
derstand the performance differences among for the various taxo-
Table 5 shows the best performances on the ISIC 2018 test
nomic configurations, but unfortunately we do not have access to
set, using different taxonomic lengths and orders. These results
the ground truth labels for the images on the ISIC 2018 test set.
were obtained with the following configuration (selected by cross-
Similarly to ISIC 2017, in this case it is also possible to compare
validation): DenseNet-161 as the image encoder, channel (r = 2)
our results with those reported in the literature. The BACC scores
and spatial attention, and CI1 for class inference. Since the ground
for the ISIC 2018 test set fall in the range [13.2%, 88.5%],2 which
truth diagnosis is not publicly available, it is not possible to evalu-
clearly shows how challenging this problem is. As in the case of
ate the model on the other hierarchical decisions for the test set.
the ISIC 2017 challenge, several of the best performing methods are
Similarly to what was observed for the ISIC 2017 data set,
large ensemble models that were trained with augmented versions
melanoma remains one of the most challenging classes. This was
of the original training set, which is not the case for the proposed
expected, as this type of skin cancer often mimics other types
approach.
of lesions [15]. The other difficult classes are actinic, dermatofi-
In order to evaluate the robustness of the method to viewpoint
broma, and vascular, while the model is able to achieve good per-
and acquisition changes, we applied the model with T = 3 to trans-
formances for nevus, keratosis, and BCC. Actinic, dermatofibroma,
formed versions of the test set, as described in Section 5.1. The
and vascular lesions correspond respectively to 3.3%, 1.2%, and 1.4%
results are shown in Table 6. The sets with geometric transfor-
of the training set, which is an extremely imbalanced scenario.
mations lead to slight improvements over the BACC score for the
Nevertheless, although we do not conduct any form of data aug-
original set, but the performances remain fairly similar for all le-
mentation, our approach is still able to achieve sensitivities above
sion types. On the other hand, removing the color normalization
50% for each of the minority classes.
has a significant effect on the performance of the model: the SE
It is possible to appreciate the influence of a hierarchical struc-
for keratosis and vascular lesions drop drastically, while the SE for
ture in the performance of the system by comparing several taxo-
melanoma increases. Although we have conducted a form of data
nomic configurations. Just by increasing the taxonomic level from
augmentation that used random color transformations to improve
T = 2 to T = 3 we are able to significantly improve the diagnos-
the robustness of the network, this result suggests that the fea-
tic performance for two of the malignant classes (melanoma and
actinic), while the performance for BCC remains almost the same.
There was also a slight improvement on the BACC score, as well as 2
Source: https://challenge2018.isic-archive.com/
Please cite this article as: C. Barata, M.E. Celebi and J.S. Marques, Explainable skin lesion diagnosis using taxonomies, Pattern Recognition,
https://doi.org/10.1016/j.patcog.2020.107413
JID: PR
ARTICLE IN PRESS [m5G;May 21, 2020;9:28]
10 C. Barata, M.E. Celebi and J.S. Marques / Pattern Recognition xxx (xxxx) xxx
Table 6
Classification scores on the modified ISIC 2018 test sets.
Melanoma 67.8% 88.7% 73.2% 60.2% 93.0% 73.1% 83.0% 76.9% 66.6%
Nevus 80.2% 95.4% 84.5% 91.7% 74.1% 93.9%
BCC 75.3% 98.4% 77.4% 98.0% 76.4% 98.2%
Actinic 58.1% 99.6% 69.8% 98.6% 62.8% 99.1%
Keratosis 74.2% 95.3% 67.7% 97.3% 33.6% 99.5%
Derm. 70.5% 99.7% 65.9% 99.8% 56.8% 99.9%
Vascular 65.7% 99.8% 60.0% 99.7% 48.6% 99.7%
tures extracted by CNN are color sensitive. These results confirm consistently diagnosed as melanoma and another that is diagnosed
the findings of other works in the literature, such as [47], who as BCC for the original and flipped images, and as vascular for
found out that color normalization has a positive impact on the the remaining transformations. For each of the examples, we in-
performance of CNN. Two relevant points come out of this experi- spected their spatial attention maps as well as the most proba-
ment. First, it is important to standardize the color of dermoscopy ble activation map given by channel attention. Fig. 7 shows the
images through normalization. Second, future work should address results, for simplicity we only show the maps for two of the de-
the perception of color by CNNs, in order to make them more ro- cisions: melanocytic/non-melanocytic and the differential diagno-
bust to color changes. A few works investigated the sensitiveness sis. In all of the scenarios, the decision between melanocytic/non-
of CNNs to color using real images. However, to the best of our melanocytic is made taking into account the skin that surrounds
knowledge such a study is still missing in the dermoscopy field. the lesions, as shown in the corresponding spatial attention maps.
Based on our experimental findings, we believe that this is a nec- The spatial maps for the differential diagnosis are consistent across
essary direction, in order to improve the safety of the method. transformations, i.e., on the left example the model consistently
The explainability of our model can be used to understand the “attends” to the darkest regions, while on the right example it fo-
aforementioned performance differences. Thus, we selected two cuses on the blue-oval area. The main difference lies in the se-
examples from one of the cross-validation data sets: one that is lected channels. These maps show activations significantly differ-
ent from those obtained for the original image (1st row), demon-
strating the lack of robustness of the CNN architecture to these
transformations.
7. Conclusions
Acknowledgments
Please cite this article as: C. Barata, M.E. Celebi and J.S. Marques, Explainable skin lesion diagnosis using taxonomies, Pattern Recognition,
https://doi.org/10.1016/j.patcog.2020.107413
JID: PR
ARTICLE IN PRESS [m5G;May 21, 2020;9:28]
C. Barata, M.E. Celebi and J.S. Marques / Pattern Recognition xxx (xxxx) xxx 11
[4] P. Tschandl, et al., The ham10 0 0 0 dataset, a large collection of multi-source [32] O. Vinyals, et al., Show and tell: a neural image caption generator, in: Com-
dermatoscopic images of common pigmented skin lesions, Sci. Data 5 (2018) puter Vision and Pattern Recognition, 2015, pp. 3156–3164.
180161. [33] K. Xu, et al., Show, attend and tell: beural image caption generation with
[5] A. Esteva, et al., Dermatologist-level classification of skin cancer with deep visual attention, in: International Conference on Machine Learning, 2015,
neural networks, Nature 542 (2017) 115–118. pp. 2048–2057.
[6] Y. Fujisawa, et al., Deep-learning-based, computer-aided classifier developed [34] L. Chen, et al., SCA-CNN: spatial and channel-wise attention in convolutional
with a small dataset of clinical images surpasses board-certified dermatolo- networks for image captioning, in: Computer Vision and Pattern Recognition,
gists in skin tumour diagnosis, Br. J. Dermatol. 180 (2) (2019) 373–381. 2017, pp. 5659–5667.
[7] G. Litjens, et al., A survey on deep learning in medical image analysis, Med. [35] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale
Image Anal. 42 (2017) 60–88. image recognition, arXiv:1409.1556 (2014).
[8] B. Zhou, et al., Learning deep features for discriminative localization, in: Com- [36] G. Huang, et al., Densely connected convolutional networks., in: Computer Vi-
puter Vision and Pattern Recognition, 2016, pp. 2921–2929. sion and Pattern Recognition, 1, 2017, p. 3.
[9] R.R. Selvaraju, et al., Grad-CAM: visual explanations from deep networks via [37] K. He, et al., Deep residual learning for image recognition, in: Computer Vision
gradient-based localization, in: International Conference on Computer Vision, and Pattern Recognition, 2016, pp. 770–778.
2017, pp. 618–626. [38] S. Woo, et al., CBAM: convolutional block attention module, in: ECCV, 2018,
[10] J. Gu, et al., Recent advances in convolutional neural networks, Pattern Recog- pp. 3–19.
nit. 77 (2018) 354–377. [39] J.K. Chorowski, et al., Attention-based models for speech recognition, in:
[11] B. Zhou, et al., Interpreting deep visual representations via network dissection, Neurips, 2015, pp. 577–585.
IEEE Trans. Pattern Anal. Mach. Intell. 41 (9) (2018) 2131–2145. [40] N. Codella, et al., Skin lesion analysis toward melanoma detection: achallenge
[12] I. Rafegas, et al., Understanding trained CNNs by indexing neuron selectivity, at the 2017 international symposium on biomedical imaging (ISBI), hosted by
Pattern Recognit. Lett. (2019). the international skin imaging collaboration (ISIC), in: ISBI, 2018, pp. 168–172.
[13] N. Sarafianos, et al., Curriculum learning of visual attribute clusters for multi– [41] C. Barata, et al., Improving dermoscopy image classification using color con-
task classification, Pattern Recognit. 80 (2018) 94–108. stancy, IEEE JBHI 19 (2015) 1146–1152.
[14] J. Zhang, et al., Learning multi-layer coarse-to-fine representations for large-s- [42] F. Perez, et al., Data augmentation for skin lesion analysis, in: OR 2.0 Con-
cale image classification, Pattern Recognit. 91 (2019) 175–189. text-Aware Operating Theaters, Computer Assisted Robotic Endoscopy, Clinical
[15] G. Argenziano, et al., Interactive Atlas of Dermoscopy, EDRA Medical Publishing Image-Based Procedures, and Skin Image Analysis, Springer, 2018, pp. 303–311.
& New Media, 20 0 0. [43] Y. Cui, et al., Class-balanced loss based on effective number of samples, in:
[16] J. Yang, et al., Classification for dermoscopy images using convolutional neural Computer Vision and Pattern Recognition, 2019, pp. 9268–9277.
networks based on region average pooling, IEEE Access 6 (2018) 65130–65138. [44] D.P. Kingma, J. Ba, Adam: a method for stochastic optimization, arXiv:1412.
[17] J. Zhang, et al., Attention residual learning for skin lesion classification, IEEE 6980 (2014).
TMI 38 (9) (2019) 2092–2103. [45] J. Pennington, et al., Glove: global vectors for word representation, in: EMNLP,
[18] P. Simoens, B. Dhoedt, Visualizing convolutional neural networks to improve 2014, pp. 1532–1543.
decision support for skin lesion classification, in: MLCN 2018, DLF 2018, and [46] B. Harangi, Skin lesion classification with ensembles of deep convolutional
iMIMIC 2018, 11038, Springer, 2018, pp. 115–123. neural networks, Elsevier JBI 86 (2018) 25–32.
[19] M. Yu, et al., Facial expression recognition based on a multi-task global-local [47] A. Mahbod, et al., Fusing fine-tuned deep features for skin lesion classification,
network, Pattern Recognit. Lett. (2020). CMIG 71 (2019) 19–29.
[20] J. Kawahara, et al., 7-point checklist and skin lesion classification using multi–
task multi-modal neural nets, IEEE JBHI 23 (2) (2019) 538–546. Catarina Barata received B.Sc. and M.Sc. degrees in Biomedical Engineering, and
[21] I. González-Díaz, Dermaknet: incorporating the knowledge of dermatologists Ph.D. degree in Electrical and Computer Engineering from Instituto Superior Téc-
to convolutional neural networks for skin lesion diagnosis, IEEE JBHI 23 (2) nico, University of Lisbon, in 2009, 2011, and 2017 respectively. Currently, she is an
(2019) 547–559. Invited Assistant Professor at Instituto Superior Técnico, and a Research assistant at
[22] P. Tschandl, et al., Accuracy of the first step of the dermatoscopic 2-step algo- Institute for Systems and Robotics (ISR). Her research interests include the devel-
rithm for pigmented skin lesions, Dermatol. Pract. Concept. 2 (3) (2012) 43–49. opment of interpretable classification methods with applications in medical image
[23] Y. Guo, et al., CNN-RNN: a large-scale hierarchical image classification frame- analysis and surveillance.
work, Multimed. Tools Appl. 77 (8) (2018) 10251–10271.
[24] D. Roy, et al., Tree-CNN: a hierarchical deep convolutional neural network for
incremental learning, Neural Netw. 121 (2020) 148–160. M. Emre Celebi received the B.Sc. degree in computer engineering from the Middle
[25] D. Nguyen, T. Okatani, Multi-task learning of hierarchical vision-language rep- East Technical University, Ankara, Turkey, in 2002, and the M.Sc. and Ph.D. degrees
resentation, in: Proceedings of the IEEE Conference on Computer Vision and in computer science and engineering from The University of Texas at Arlington, Ar-
Pattern Recognition, 2019, pp. 10492–10501. lington, TX, USA, in 2003 and 2006, respectively. He is currently Professor and Chair
[26] P. Kinghorn, et al., A hierarchical and regional deep learning architecture for of the Department of Computer Science, University of Central Arkansas, Arkansas,
image description generation, Pattern Recognit. Lett. 119 (2019) 77–85. USA. He has pursued research in the field of image processing and analysis.
[27] K. Shimizu, et al., Four-class classification of skin lesions with task decompo-
sition strategy, IEEE TBME 62 (1) (2015) 274–283. Jorge S. Marques received the E.E., Ph.D., and Aggregation degrees from the Tech-
[28] C. Barata, J.S. Marques, Deep learning for skin cancer diagnosis with hiearchical nical University of Lisbon, Lisbon, Portugal, in 1981, 1990, and 2002, respectively.
architectures, ISBI, 2019. He is currently a Full Professor with the Electrical and Computer Engineering De-
[29] X. Shu, et al., Hierarchical long short-term concurrent memory for human in- partment, Instituto Superior Técnico, Lisbon, and a Researcher at the Institute for
teraction recognition, IEEE Trans. Pattern Anal. Mach. Intell. (2019). Systems and Robotics. His research interests include the areas of image process-
[30] C. Barata, et al., Deep attention model for the hierarchical diagnosis of skin ing and pattern recognition. Dr. Marques was Co-chairman of the IAPR Conference
lesions, Computer Vision and Pattern Recognition Workshops, 2019. IbPRIA 2005, and President of the Portuguese Association for Pattern Recognition
[31] S. Bai, S. An, A survey on automatic image caption generation, Neurocomputing (20 0120 03).
311 (2018) 291–304.
Please cite this article as: C. Barata, M.E. Celebi and J.S. Marques, Explainable skin lesion diagnosis using taxonomies, Pattern Recognition,
https://doi.org/10.1016/j.patcog.2020.107413