Supporting Skin Lesion Diagnosis with Content-Based Image Retrieval

—In recent years, many attempts have been dedicated to the creation of automated devices that could assist both expert and beginner dermatologists towards fast and early diagnosis of skin lesions. Tasks such as skin lesion classiﬁcation and segmentation have been extensively addressed with deep learning algorithms, which in some cases reach a diagnostic accuracy comparable to that of expert physicians. However, the general lack of interpretability and reliability severely hinders the ability of those approaches to actually support dermatologists in the diagnosis process. In this paper a novel skin image retrieval system is presented, which exploits features extracted by Convolutional Neural Networks to gather similar images from a publicly available dataset, in order to assist the diagnosis process of both expert and novice practitioners. In the proposed framework, ResNet-50 is initially trained for the classiﬁcation of dermoscopic images; then, the feature extraction part is isolated, and an embedding network is built on top of it. The embedding learns an alternative representation, which allows to check image similarity by means of a distance measure. Experimental results reveal that the proposed method is able to select meaningful images, which can effectively boost the classiﬁcation accuracy of human dermatologists.


I. INTRODUCTION
Skin cancer is one of the most common forms of human cancer worldwide [1].
Malignant melanoma is less common than basal and squamous cell skin carcinoma (it accounts for only about 3-4% of all skin cancers) but it is responsible for most of skin cancer deaths [1], [2], despite the existence of new therapeutic agents, such as checkpoint and BRAF inhibitors that improve survival of advanced cases [3].However, Squamous Cell Carcinoma (SCC) can become lethal when it metastasizes, since few standardized and effective therapies for advanced SCC have been established.Although metastatic Basal Cell Carcinoma (BCC) is very rare, any delay in diagnosis may allow tumors to become unresectable [4].Therefore, early detection of all skin cancers, not limited to melanoma, is required to prevent progression of these cancers to advanced stages and reduce skin cancer-related deaths [4], [5].The clinical diagnosis of malignant melanoma is still difficult since the morphological characteristics of other pigmented skin lesions may sometimes mimic it.In fact, even in specialized centers, the melanoma diagnosis accuracy achieved with the unaided eye is slightly better than 60% [6], [7], making the early detection very hard to obtain.Nowadays, dermoscopy represents one of the most relevant imaging techniques for melanoma diagnosis.Dermoscopic images are obtained through a non-invasive in vivo examination based on a microscope that exploits an incident light and oil/gel immersion to make skin subsurface structures, dermoepidermal junction, and upper dermis accessible to visual examination [8], [9], [10].
Several approaches have been proposed to improve the diagnostic performance of clinicians: the ABCD rule [11], the CASH algorithm [12], Menzies method [13], 7-point checklist [14] or some other pattern classification methods [15] to distinguish between melanoma and non-melanoma pigmented skin lesions.However, becoming an experienced dermoscopic reader requires a significant time and training investment [4].Moreover, even after such training, the readings are often complex and subjective.
Unfortunately, these approaches require huge amounts of data, which are hard to obtain and particularly expensive to annotate.However, since the first convolutional layers of CNNs learn to recognize simple elements like lines and colors, pre-training neural networks with existing collections of natural images [25] can mitigate the need for large annotated medical datasets [26].Nevertheless, this approach can introduce biases towards certain characteristics and features.As an example, CNNs trained using ImageNet are strongly biased in recognizing textures rather than shapes [27].
With the aim of mitigating this problem, since 2016 the International Skin Imaging Collaboration (ISIC) has begun to aggregate a large-scale, publicly available dataset of dermoscopic skin lesions images (Fig. 1) and hosting multiple challenges and workshops [28].The availability of this substantial amount of dermoscopic images allowed to significantly improve the performance of machine learning algorithms.A recent international diagnostic study demonstrates that the accuracy of machine learning algorithms for pigmented skin lesion classification proposed in the last years is comparable with those of expert dermatologists [29].This is true considering only individual images and ignoring the clinical history of the patient: information that the dermatologist has often available when visiting.
Nevertheless, in the medical diagnosis field the goal should not be to take an action on behalf of an expert practitioner, but rather to assist his/her choice.Indeed, empirical experiments have shown that providing the physician with a second computerized and non-interactive opinion is not well seen by the dermatologists that even tends to change their own diagnoses when presented with a second computerized point of view [30].
In the last years, many attempts have been made to devise AI approaches, which are not only performing well, but are trustworthy, transparent, interpretable and explainable for a human expert [31].Among others, the Grad-Cam technique [32] has proven to be a valid solution for dermoscopic images analysis: it is able to explain dermatologists on which parts of the input image the model is mostly focused, by simply providing a heat map [33].
This paper mainly aims at providing an interactive approach to effectively support dermatologists in the decision making process.Given a skin lesion, the designed deep learningbased algorithm will retrieve a number of other diagnosed cases that are similar to the original image, thus supporting and driving the dermatologist to a more precise diagnosis, without directly providing a second opinion.We thus propose a Content-Based Image Retrieval (CBIR) tool for searching and retrieving most similar dermoscopic images.A set of quantitative experiments shows that the proposed approach provides a better ranking than existing state-of-the-art solutions on the same topic.Moreover, to test the usefulness of the devised strategy, five dermatologists have been asked to classify different images with and without the output provided by our algorithm: experimental results clearly demonstrate the benefit of the proposal.
The rest of the paper is organized as follows.Section II describes the state-of-the-art dermoscopic image analysis solutions and content-based image retrieval systems available in literature.In Section III the proposed model is described and then it is evaluated from both a quantitative and qualitative point of view in Section IV.Finally, in Section V conclusion are drawn.

Dermoscopic Images Analysis
The extensive usage of dermoscopic images has been causing a wide interest in their computed analysis.Traditional computer vision techniques have focused on several tasks, from basic lesion segmentation [34] to the identification of meaningful patterns for the final classification [35].The groundbreaking evolution of deep learning pushed the limits of automated dermoscopic images analysis, with great results especially on the segmentation and the complete classification tasks [33], [36].Esteva et al. [23] compared the performance of CNNs to the accuracy of expert dermatologists, across two binary classification tasks of both clinical and dermoscopic images.The results of this study indicate that neural networks are able to classify skin cancer with a level of competence comparable to dermatologists.
The International Skin Imaging Collaboration (ISIC) has been playing a major role in the growth of skin lesion analysis.They have been collecting a large dataset of publicly available dermoscopic images [28], [37], [38], and have been hosting challenges since 2016; the 2019 challenge official training set counts 25 331 images.The work by Tschandl et al. [29] thoroughly compares results obtained by CNNs trained using the public ISIC dataset and the diagnosis carried out by numerous expert practitioners, proving that, when trained for the dermoscopic images classification task, neural networks obtain comparable results to human dermatologists.The authors also demonstrated the lack of reliability and resilience to out of distribution samples of these models.

Content-Based Image Retrieval
In the work by Ballarini et al. [39], authors developed a content-based image retrieval system for a dataset of 533 images, divided into five classes of lesions, including two non-melanoma cancer types.Their system relies on visual features, such and color and composite texture, evolved using genetic algorithms.The similarity matching function is obtained by composing Bhattacharyya distance [40] for color covariance-based features, and Euclidean distance for texture features.In [41], Baldi et al. proposed another method based on low-level representative features.They find visually similar images to a query by means of a hierarchical multi-scale computation of the Bhattacharyya distance of all the database images.
Another CBIR system for skin lesions was developed by Jiji and Raj [42], who exploit features such as color, shape and texture, in order to find the most visually similar images to a query in a dataset composed of 20 different diseases.Their categories, however, do not include melanocytic nevi, and melanomas are not distinguished from other kinds of skin cancer.In 2016, Rahman et al. [43] proposed a decision support system based on a fusion between classification and retrieval.Their model considered high-level features such as Non-Subsampled Contourlet Transform (NSCT) and HOG based on Hessian matrix1 .The model has been validated on the ISIC 2016 dataset, containing two skin lesion categories, melanoma and benign.Belattar et al. [44] used a kernelized SVM classification algorithm with an active learning technique and a histogram intersection matching in a retrieval system with relevance feedback for melanoma diagnosis.Their experiments were conducted on a dataset of melanomas and benign lesions.Finally, Pu et al. [45] applied the retrieval method proposed in [46] to the skin lesion domain.Their approach consists of training a deep network that learns to map images to hash codes, in a way that preserves similar semantics.Their network minimizes a classification loss function, with additional regulation terms to achieve desirable hash code properties.Then, similar images to a query are ranked by means of the Hamming distance, and clustered with Affinity Propagation (AP).This last proposal improves the measured precision with respect to previous methods based on hand-crafted features, and therefore represents a reference for comparison.
With this paper we present another content-based image retrieval system that requires no hand-crafted feature.Instead, we exploit CNNs' automatically learned filters, and we discuss three variations of a model trained with a loss function specific for the retrieval task, that allows us to overcome the limits of the classification loss and improve the retrieval performance with respect to state-of-the-art.

III. IMAGE RETRIEVAL
As stated, the aim of this paper is to find a model for Content-Based Image Retrieval (CBIR) for skin lesion medical images.Given a query, i.e. a new image of an unknown class, the task consists of retrieving k images from a labeled dataset, that are similar to the query.Since the final aim is to provide physicians with a valuable support for the classification task, two images are considered similar if they possess common features ascribable to a certain class of lesions.In order to extract such features, our approach to the problem consists of learning an embedding function f from skin lesion images space I into a compact Euclidean feature space R d , where distances correspond to a measure of similarity.In feature space, distance between images of the same class should be small, and distance between images of different classes should be large.After such a mapping has been established, the Fig. 2. Example result of the ranking system.On the left an example query image belonging to the melanocytic nevus class.On the left side the first five results in order of score from top to bottom.In this example all the output images are from the same class of the query.retrieval task can be accomplished by taking the k labeled images with the minimum distance from the query in the embedded space.In order to achieve this task, we built and evaluated four different variations of a deep convolutional neural network.
For every model, input images are resized to 512×512, and each channel is normalized by subtracting the mean and dividing by the standard deviation.Moreover, training images undergo augmentation techniques to reduce overfitting, consisting in random flip and random rotation.

A. Classification
The first embedding model is part of a deep neural network trained for classification.We conducted experiments with multiple variations of ResNet (with 18, 34, 50, 101 and 152 layers), which have been proved to be effective solutions for the classification task of the ISIC challenge [28].The loss used for this network is the cross-entropy loss.The network is trained with Stochastic Gradient Descent (SGD), with an initial learning rate of 0.001, lowered each time the evaluation accuracy reaches a plateau, with initial weights pretrained on the ImageNet dataset.After the training process, the embedding network is obtained by removing the last fully connected layer, which outputs the classification results.In this way, only the feature extraction part of the network is preserved, whose output is a vector of size 512 or 2048, depending on the ResNet variation used.This intermediate feature vector represents high-level visual features, strictly related to the image class, and represents the mapping in the new euclidean space.This simple model is mainly built for comparison with the similar proposal by Pu et al. [45], which also employs a classification network, and with the more complex models discussed further on.

B. Embedding End-to-End
The second model is a convolutional network trained endto-end for the embedding task.Like the previous model, it is based on ResNet.However, instead of outputting class probabilities, the last fully connected layer, with output size d, represents the embedding in R d .The output vector is divided by its L2-norm; this regularization ensures that embedded vectors lie on the surface of the unit (d − 1)-sphere, i.e., f (x) 2 = 1.
The loss used for training is the triplet loss [47], that directly reflects the specific goal of this work.The triplet loss is calculated over a triplet of input samples composed of an anchor (a), a positive (p) and a negative (n).Sample a and p belong to the same class, while n belongs to a different class.Given a triplet, the loss is computed as: In the formula, d(x, y) is the distance between x and y in feature space, and it can be calculated using a chosen metric, while α represents a desired margin between positive and negative pairs.The triplet loss aims at minimizing the distance from anchor to positive, and maximizing the distance from anchor to negative.We performed experiments with the euclidean distance and cosine distance (defined as 1 -cosine similarity), and we verified that, for this task, cosine distance yields better results.It is worth noticing that, if feature vectors lie on the unit (d − 1)-sphere, cosine similarity is the same as the dot product.
The training process starts with the sampling of balanced minibatches, with n random elements per class, so that each skin lesion category is equally represented.For every minibatch, a network forward yields a corresponding list of embeddings.Then, triplets are chosen in the following way: (i) for every class, all possible pairs (a, p), i.e. combinations of two samples, are listed.Since there are n samples for each of the 8 classes, the total number of pairs is n 2 ×8 = 4n(n−1); (ii) for every pair, a hard negative sample is chosen from the other 7 classes.A hard negative is a sample n for which L(a, p, n) > 0.
Only hard negatives are useful for the training process, because they yield a non-zero loss.We experimented the three most common criteria for the choice of the hard negative: • random hard: a hard negative randomly selected.
• hardest: the negative sample yielding the highest loss value.
• random semihard: a random negative n for which The hardest negative criterion leads to the fastest convergence, but it is also known to often lead to local minima [48], and the random semihard criterion is useful to reduce that risk.These observations hold true in our case, and we got the best results applying the random semihard criterion, even if the random hard choice proved to be nearly as effective.After the choice of negatives, the loss is computed for every triplet, and losses are averaged to get an aggregated value.This network is trained with SGD and initial learning rate of 0.01, decreased every 30 epochs.

C. Classification & Embedding
The third model taken into account is composed of both a classification network and a separate embedding network.It builds upon the first model, in the sense that, after the end of the training process, feature vectors extracted by the classification network constitute a new dataset, and they become the input of a fully connected network, that outputs the final embedding in d dimensions.This embedding network is composed of three fully connected layers of size 2048, 1024 and d respectively, randomly initialized.The first two layers are followed by a PReLU activation and a dropout layer; the whole architecture is depicted in Fig. 3. Like the previous case, the loss function is the online triplet loss.We trained this network with SGD, learning rate starting at 0.01 and lowered every 30 epochs, and batches of 30 samples per class.

D. Classification & Embedding End-to-End
The fourth model shares the same architecture of the third one: the difference is that, in this case, the feature extractor and the embedding network are not separated.Instead, after the training of the classification network, its final fully connected layer is substituted with the embedding network, and a new end-to-end training process starts, with online triplet loss.This means that the backward propagation does not stop at the first layer of the embedding network, as in Model-C, but instead reaches the beginning of the feature extractor.Because of the large size of the network, we are strongly limited in the batch size: in our tests, we used batches with 3 examples for each class.The whole architecture, that is the same as Model-C, is depicted in Fig. 3.

A. Quantitative
The dataset used for training and evaluating the proposed models is the ISIC 2019 archive [28], [37], [38], an international repository of dermoscopic images built for both clinical ), and computing precision at k (P @k), i.e., the number of neighbors that share the same class of the query, divided by k.The values for k have been chosen in a way that reflects the use case, that is showing a physician similar lesions to a new image, which needs to be classified.P @k values computed on the validation set are used for hyperparameters optimization and early stopping of the training process.Then, final values of P @k are computed by substituting validation set with test set.In literature, another common metric used to evaluate the performance of retrieval systems is the recall, i.e. the ratio of relevant retrieved images to the total number of relevant images in the dataset.However, the recall is not meaningful for this use case, because the amount of retrieved images is orders of magnitude lower than the total number of dataset images belonging to each class.
As regards the choice of hyperparameters, experiments revealed that, independently of the model, the best value for α is 0.2, cosine distance is more effective than euclidean distance, and the L2 regularization actually improves the results.Model-dependent hyperparameters, instead, are reported It can be observed that the feature extractors used in the final models, ResNet-34 and ResNet-50, include a relatively small number of layers.We verified that deeper versions of ResNet, with 101 or 152 layers, are more effective for the classification task, i.e., lead to higher accuracy; however, when measuring P @k, the results are not satisfactory.As regards models employing triplet loss, one of the reasons is that implementation details of the loss itself make large batches more efficient [48], and, under the same hardware, smaller networks allow for larger batches.Another consequence is that Model-C (Classification & Embedding End-to-End) reaches lower AP @k than Model-D (Classification & Embedding): the latter, despite being a simpler model, can exploit a batch size larger by an order of magnitude.
For comparative reasons, we have implemented the method proposed in [45] and known as Hash-AP.To the best of our knowledge, it is the only skin lesion CBIR system that makes use of deep learning to extract visual features, and it represents the current state-of-the-art in the field.Hash-AP is a classification network with an added hidden layer that learns a binary embedding, or hash, and retrieves similar images to a query measuring the Hamming distance of hash codes, then clustering the results and the query itself using affinity propagation.The original model is based on AlexNet [49].In order to provide a fair comparison with our proposals, we also built a variation of Hash-AP based on ResNet, and we verified that, at least on the ISIC dataset, the retrieval performance notably improves w.r.t. the original model.
Final results are reported in Table I: P @k is calculated for each class, and the last column reports the average precision at k (AP @k), which represents a summary of the model performance.AP @k is the average value of per class P @k: in this way, every class is given the same weight, independently of its representativeness.Models Hash-AP and Hash-AP ResNet retrieve a variable number of images, which depends on the clustering result, and therefore only one value is reported for them.Novel proposals are marked by a star.
The precision measured for the original Hash-AP model is quite low.This result can be attributed to the classification network used, AlexNet, which fails to compete with more modern architectures, and to the difference between the ISIC dataset used in this work and the smaller dataset considered in the original paper.However, the ResNet-50 variation we implemented alongside the original one reaches an average precision of almost 0.76, proving the effectiveness of the idea.As regards models discussed in this paper, the Classification one yields the highest AP @1, being the only one to exceed 0.82.Nevertheless, for higher values of k, its performance is less satisfactory and, indeed, this network was not directly trained for image retrieval.The best performing model for higher values of k is Model-C, Classification & Embedding.As stated before, this result is partially related to the batch size used in training: Model-C has been trained with a batch size respectively 5 times and 10 times larger than that used for Model-B and D (Table II).A larger batch size means that, in the triplet selection process, more negatives are evaluated for the same positive pair, and thus there is a higher probability to find a random semihard negative.The improvement with respect to Hash-AP ResNet demonstrates the effectiveness of the triplet loss function, that is more suitable than a classification loss for the retrieval task.

B. Qualitative
In order to understand if our model can be effectively useful in supporting medical diagnosis, we asked five dermatologists to undergo a test, composed of two tasks.The first task simply consisted in classifying 100 dermoscopic images, randomly sampled from the ISIC test set, without any additional help.In the second task, the physicians were asked to classify the same images, but we also showed them the 5 nearest neighbors of each sample, labeled with their ground truth classes.The neighbors were found in the training set using Model-C.The tests were performed using exactly the same 100 images, with the aim of reducing the probability of a possible improvement being only imputable to the dataset variance.In order to mitigate bias in the second task, each image was rotated by 180°, and the list was shuffled.We measured the accuracy of the physician classification for both tasks.Average result is 67.4% for the first one (images only) and 76.6% for the second one (images and 5 nearest neighbors).Results obtained by individual dermatologists are shown in Table IV.Although a set of five dermatologists constitutes a relatively small statistical sample, an average improvement of 9.2% on the very same images proves the usefulness of our CBIR model as a support in the decision making process.

V. CONCLUSION
In this paper we introduced a novel content-based image retrieval system for dermoscopic image analysis.The proposed system is based on learned features obtained through state-ofthe-art convolutional neural networks.
An exhaustive experimental evaluation proved that the proposed solution outperforms competitors on the well known ISIC dataset.Moreover, the effectiveness and the usefulness of the devised solution have been demonstrated by a set of qualitative experiments carried out by a group of dermatologists.
These results prove that, although the existing knowledge of dermatologists is invaluable for the diagnosis of skin cancer, the proposed CBIR model can assist physicians with dermatologist-grade decision support.

Fig. 3 .
Fig. 3. Architecture of Model-C (Classification & Embedding) and D (Classification & Embedding End-to-End).The difference between the two approaches is that, after the training of the ResNet feature extractor with a classification loss, Model-D is fine-tuned end-to-end with triplet loss, while Model-C only updates weights on the embedding part of the network.

TABLE I AVERAGE
PRECISION AT K MEASURED FOR EVERY MODEL ANALYZED, FOR THREE VALUES OF K, 1, 5, AND 10.CENTRAL COLUMNS REPORT AVERAGE VALUES SEPARATED FOR EACH CLASS, AND THE LAST COLUMN REPORTS THE BALANCED AVERAGE.NOVEL PROPOSALS ARE IDENTIFIED BY *.SCC).A priori probability of each class is reported in Table III.We split the dataset into training, validation and test sets, composed of 19 331, 1 000 and 5 000 images respectively.Every model is trained with the aim of minimizing the corresponding loss function on the training set.After each epoch, a partial evaluation of the model is performed by retrieving the k nearest neighbors taken from the training set of every sample in validation set, for three values of k (1, 5, and 10

TABLE II OPTIMAL
HYPERPARAMETERS USED FOR EACH MODEL, FOUND WITH GRID SEARCH.FE, DIM, LOSS, LR AND BS RESPECTIVELY STAND FOR FEATURE EXTRACTOR, DIMENSIONS OF FEATURE SPACE, LOSS FUNCTION, LEARNING RATE AND BATCH SIZE.

TABLE III A
PRIORI PROBABILITY OF EACH SKIN LESION CLASS.DERMATOLOGISTS ON TASK 1 AND TASK 2, CONSISTING IN CLASSIFYING 100 SKIN LESION IMAGES WITH AND WITHOUT THE SUPPORT OF OUR RETRIEVAL NETWORK.