Comparison of the accuracy of human readers versus machine-learning algorithms for pigmented skin lesion classification: an open, web-based, international, diagnostic study

BACKGROUND
Whether machine-learning algorithms can diagnose all pigmented skin lesions as accurately as human experts is unclear. The aim of this study was to compare the diagnostic accuracy of state-of-the-art machine-learning algorithms with human readers for all clinically relevant types of benign and malignant pigmented skin lesions.


METHODS
For this open, web-based, international, diagnostic study, human readers were asked to diagnose dermatoscopic images selected randomly in 30-image batches from a test set of 1511 images. The diagnoses from human readers were compared with those of 139 algorithms created by 77 machine-learning labs, who participated in the International Skin Imaging Collaboration 2018 challenge and received a training set of 10 015 images in advance. The ground truth of each lesion fell into one of seven predefined disease categories: intraepithelial carcinoma including actinic keratoses and Bowen's disease; basal cell carcinoma; benign keratinocytic lesions including solar lentigo, seborrheic keratosis and lichen planus-like keratosis; dermatofibroma; melanoma; melanocytic nevus; and vascular lesions. The two main outcomes were the differences in the number of correct specific diagnoses per batch between all human readers and the top three algorithms, and between human experts and the top three algorithms.


FINDINGS
Between Aug 4, 2018, and Sept 30, 2018, 511 human readers from 63 countries had at least one attempt in the reader study. 283 (55·4%) of 511 human readers were board-certified dermatologists, 118 (23·1%) were dermatology residents, and 83 (16·2%) were general practitioners. When comparing all human readers with all machine-learning algorithms, the algorithms achieved a mean of 2·01 (95% CI 1·97 to 2·04; p<0·0001) more correct diagnoses (17·91 [SD 3·42] vs 19·92 [4·27]). 27 human experts with more than 10 years of experience achieved a mean of 18·78 (SD 3·15) correct answers, compared with 25·43 (1·95) correct answers for the top three machine algorithms (mean difference 6·65, 95% CI 6·06-7·25; p<0·0001). The difference between human experts and the top three algorithms was significantly lower for images in the test set that were collected from sources not included in the training set (human underperformance of 11·4%, 95% CI 9·9-12·9 vs 3·6%, 0·8-6·3; p<0·0001).


INTERPRETATION
State-of-the-art machine-learning classifiers outperformed human experts in the diagnosis of pigmented skin lesions and should have a more important role in clinical practice. However, a possible limitation of these algorithms is their decreased performance for out-of-distribution images, which should be addressed in future research.


FUNDING
None.


Methods-For
this open, web-based, international, diagnostic study, human readers were asked to diagnose dermatoscopic images selected randomly in 30-image batches from a test set of 1511 images. The diagnoses from human readers were compared with those of 139 algorithms created by 77 machine-learning labs, who participated in the International Skin Imaging Collaboration 2018 challenge and received a training set of 10 015 images in advance. The ground truth of each lesion fell into one of seven predefined disease categories: intraepithelial carcinoma including actinic keratoses and Bowen's disease; basal cell carcinoma; benign keratinocytic lesions including solar lentigo, seborrheic keratosis and lichen planus-like keratosis; dermatofibroma; melanoma; melanocytic nevus; and vascular lesions. The two main outcomes were the differences in the number of correct specific diagnoses per batch between all human readers and the top three algorithms, and between human experts and the top three algorithms. Aug 4, 2018, andSept 30, 2018, 511 human readers from 63 countries had at least one attempt in the reader study. 283 (55·4%) of 511 human readers were board-certified dermatologists, 118 (23·1%) were dermatology residents, and 83 (16·2%) were general practitioners. When comparing all human readers with all machine-learning algorithms, the algorithms achieved a mean of 2·01 (95% CI 1·97 to 2·04; p<0·0001) more correct diagnoses (17·91 [SD 3·42] vs 19·92 [4·27]). 27 human experts with more than 10 years of experience achieved a mean of 18·78 (SD 3·15) correct answers, compared with 25·43 (1·95) correct answers for the top three machine algorithms (mean difference 6·65, 95% CI 6·06-7·25; p<0·0001). The difference between human experts and the top three algorithms was significantly lower for images in the test set that were collected from sources not included in the training set (human underperformance of 11·4%, 95% CI 9·9-12·9 vs 3·6%, 0·8-6·3; p<0·0001).

Findings-Between
Interpretation-State-of-the-art machine-learning classifiers outperformed human experts in the diagnosis of pigmented skin lesions and should have a more important role in clinical practice.

Introduction
Diagnosis of skin cancer needs specific expertise that might not be available in many clinical settings. Accurate diagnosis of early melanoma in particular demands experience in dermatoscopy, a non-invasive examination technique 1 that improves diagnosis compared with examination with the naked eye. 2 Dermatoscopy, which requires proper training and experience, is used widely by dermatologists, 3 but also by general practitioners 4 and other health-care professionals in areas where specialist dermatological services are not readily available.
The paucity of experts and the rising incidence of skin cancer in an aging population 5 have increased the demand for point-of-care decision support systems that can diagnose skin lesions without the need of human expertise. There has been a long tradition of translational research involving machine learning for melanoma diagnosis based on dermatoscopic images. [6][7][8] Although some automated diagnostic devices have been approved by the US Food and Drug Administration, 9,10 such devices are not widely adopted in clinical practice for various reasons-for example, the devices are approved for melanocytic lesions only and they require preselection of lesions by human experts.
Recent advancements in the field of machine learning, particularly the introduction of convolutional neural networks, have boosted interest in this area of research. 11 Codella and colleagues 12 used ensembles of multiple algorithms to show melanoma recognition accuracies greater than those of expert dermatologists. Subsequently, Esteva and colleagues 13 and Han and colleagues 14 fine-tuned convolutional neural networks with large datasets of clinical images and observed dermatologist-level accuracy for general skin disease classification. Furthermore, Haenssle and colleagues 15 reported expert-level accuracy of algorithms for dermatoscopic images of melanocytic lesions. However, in patients with severe chronic sun damage, up to 50% of pigmented lesions that are biopsied or excised for diagnostic reasons are non-melanocytic. 16 Training of neural networks for automated diagnosis of pigmented skin lesions has been hampered by the insufficient diversity of available datasets and by selection and verification bias. We tackled this problem by collecting dermatoscopic images of all clinically relevant types of pigmented lesions, and created a publicly available training set of 10 015 images for machine learning. 17 We provided this training set and a test set of 1511 dermatoscopic images to the participants of the International Skin Imaging Collaboration (ISIC) 2018 challenge, with the aim of attracting the best machine-learning labs worldwide to obtain reliable estimates of the accuracy of state-of-the-art machine-learning algorithms. We planned and organised an open, web-based, reader study under the umbrella of the International Dermoscopy Society and invited their members to compare their diagnostic accuracy with that of algorithms. Therefore, the aim of this study was to compare the most advanced machine-learning algorithms with the most experienced human experts using publicly available data.

Study design
For this open, web-based, international, diagnostic study, invitations to participate were first issued at the World Congress of Dermoscopy (June 14, 2018) and continued until Sept 28, 2018. 3Gen (San Juan Capistrano, CA, USA) and HealthCert (Singapore) sponsored prizes (a dermatoscope and books) for the best participants. No other compensation was offered to readers. Cumulative numbers of registrations were correlated with specific mailings and social media posts to targeted groups (appendix p 1).
The study protocol was approved by the ethics review boards of the University of Queensland (Brisbane, QLD, Australia) and the Medical University of Vienna (Vienna, Austria), which waived written, informed consent for retrospectively collected and deidentified dermatoscopic images. Before participation, human readers and participants of the ISIC 2018 challenge provided written consent to allow analysis of their ratings.

Procedures
We created a web-based rating platform accessible via username and password on which we ran the screening tests. Upon registration of participants (human readers), we collected information about age, sex, medical education, and years of experience with dermatoscopy. The basic functionality of the platform was to show an image together with a multiple choice question, which included seven predefined disease categories and a single correct answer. Before the main test, each reader had to complete four screening tests, which were used to stratify readers according to skill and to verify if self-reported experience matched actual skill.
The actual survey was done identically to the screening test, but used the test set of 1511 unknown images. The ground truth of each lesion fell into one of seven predefined disease categories: intraepithelial carcinoma including actinic keratoses and Bowen's disease; basal cell carcinoma; benign keratinocytic lesions including solar lentigo, seborrheic keratosis, and lichen planus-like keratosis; dermatofibroma; melanoma; melanocytic nevus; and vascular lesions. These seven disease categories comprise more than 95% of all pigmented lesions biopsied or excised for diagnostic reasons in clinical practice. 16 As we did not expect human readers to rate all 1511 images, each reader received batches of 30 randomly selected images. Readers could repeat the survey with different batches at their own discretion. Each test set image was rated by a mean of 80 readers (range 43-184; 95% CI 78·6-80·7). We stratified random sampling in four ways to analyse potential effects of class distributions (appendix p 1). The first batch was balanced with regard to number of lesions from each class (balanced), the second batch had more benign lesions (benign; 25 [83%] of 30 lesions), the third more malignant lesions (malignant; 21 [70%] of 30 lesions), and all subsequent batches were randomly drawn from the test set without stratification (random).
We randomly divided a master set of 11 210 dermatoscopic images into a training set (10 015 images; 89·3%) and a test set (1195 images; 10·7%). The images were collected during a period of 20 years from two sites, the Vienna Dermatologic Imaging Research Group (ViDIR) at the Department of Dermatology at the Medical University of Vienna (Vienna, Austria), and the skin cancer practice of Cliff Rosendahl in Queensland (Capalaba, QLD, Australia). The set, which has been described previously, 17 included consecutively collected images of pigmented lesions from different populations. Ground truth was routine pathology evaluation (>50% of all lesions), biology (>1·5 years sequential dermatoscopic imaging without changes), and expert consensus in some cases of common, straightforward, nonmelanocytic cases that were not excised. Controversial cases with ambiguous histopathological reports were excluded. The Austrian image set could be divided into the following three subgroups: ViDIR legacy (images captured before 2005 with analog cameras and archived as diapositives), ViDIR current (images captured after 2005 with the DermLite FOTO [3Gen] system or Delta 20 [Heine; Herrsching, Germany], and ViDIR MoleMax (images captured with the MoleMax HD system [Derma Medical Systems; Vienna, Austria]). The Australian image set included lesions from the patients of a primary care facility in an area with high skin cancer incidence. We added 316 images from other centres to the test set (external data), specifically from Turkey, New Zealand, Sweden, and Argentina, to assure diversity of skin types. Our original protocol did not mention test set images from other sources and did not specify the number of disease categories. These amendments were approved by the ethics board of the Medical University of Vienna on Dec 4, 2018.
Predictions of the machine-learning algorithms were provided by the participants of the ISIC 2018 challenge. We co-organised this challenge and an associated workshop 18 at the 21st International Conference On Medical Image Computing & Computer Assisted Intervention, which took place on Sept 20, 2018, in Granada, Spain. Detailed descriptions of submissions can be found at the challenge website. We removed the two lowest scoring (1·4%) of 141 submissions because they produced random predictions because of a formatting error. Machine-learning groups were allowed up to three technically distinct submissions to the challenge, resulting in multiple entries from some groups (there were a total of 139 algorithms from 77 machine-learning labs). For each test case, the class (disease category) with the highest probability was regarded as the diagnosis given by the algorithm.
The two main outcomes were the differences in the number of correct specific diagnoses per batch between human readers and the top three algorithms, and between human experts and the top three algorithms. For a batch of lesions with equal distribution of classes, this difference corresponds to the difference in balanced multiclass accuracy, which is the mean sensitivity calculated for every class in a one-versus-all manner. We chose this metric because it ignores the bias of highly prevalent classes, such as nevi, and gives a good overall estimation of performance in a multiclass setting, as it indirectly measures false positive cases, which are missing in the directly measured true positives of their respective class. Secondary outcomes were differences regarding unbalanced batches.

Statistical analysis
We aimed to include 500 human readers in the study. We used a one-sample t test to compare human readers and algorithms and determine whether the difference in the number of correct diagnoses in batches of 30 cases was different from 0. With an SD of 15%, the study had a power of 80% to detect a difference of 1·9% in the number of correct diagnoses at α=0·05.
Because the random batch could be attempted more than once, only the first attempt was included in the analyses of two main outcomes to avoid bias. We calculated the probability of a correct diagnosis for human readers and algorithms by summing the instances of correct diagnoses per lesion and dividing this by the number of readers or number of algorithms.
The probability of correct predictions per lesion, diagnostic values, and area under receiver operator characteristics curves were post-hoc exploratory analyses. For diagnostic values and confusion matrices, we used the majority vote of all ratings for each image. We calculated binary diagnostic values, such as sensitivity and specificity, in a one-versus-all manner. Receiver operating characteristic curves, areas under the curves, and their 95% CIs were calculated with pROC, 19 and we compared areas under the curves with the method described by Delong and colleagues. 20 Baseline characteristics are reported as n (%) or mean and 95% CI. All p values are twosided, and p<0·05 was regarded as significant. Bonferroni correction was used for all p values unless otherwise stated. Calculations and plotting were done with R version 3.4.0. 21

Role of the funding source
There was no funding source for this study. The corresponding author had full access to all the data in the study and had final responsibility for the decision to submit for publication.

Results
Between Aug 4, 2018, andSept 30, 2018, 951 (52·7%) of 1804 potential readers registered on the study platform finished all screening tests, and 511 (28·3%) readers from 63 countries had at least one attempt in the reader study (figure 1). 283 (55·4%) of 511 human readers were board-certified dermatologists, 118 (23·1%) were dermatology residents, and 83 (16·2%) were general practitioners. The distribution of professions in participants of the reader study was similar to users who finished screening, but who did not participate (appendix p 1). 236 (46·2%) of 511 human readers were aged between 31 and 40 years and 321 (62·8%) were female. As the number of years of experience was the most important predictor of a high score in the screening tests, human readers with more than 10 years of experience were regarded as experts.
The probability for correct diagnosis of an image increased with the number of years of experience of the human reader and depended on the image source. For experts, the highest probability of a correct diagnosis was found in the ViDIR MoleMax dataset (91·4%, 95% CI 90·1-92·7) and the lowest in the Australian dataset (60·1%, 56·0-64·1; appendix p 5). Compared with other image sets, the difference between experts and the top three algorithms was significantly lower for images that were collected from centres that did not provide images to the training set (human underperformance of 11·4%, 95% CI 9·9-12·9 vs 3·6%, 0·8-6·3; p<0·0001).

Discussion
We provide a state-of-the-art comparison of machine-learning algorithms with human readers for the diagnosis of all clinically relevant types of pigmented skin lesions using dermatoscopic images. Machine-learning algorithms outperformed human readers with respect to most outcome measures. In sets of 30 randomly selected lesions, the best machine-learning algorithms achieved a mean of 7·94 more correct diagnoses than the average human reader, and a mean of 6·65 more correct diagnoses than expert readers.
A common problem in human reader studies is the definition of experts. In a screening test, we compared the self-reported domain-specific experience of participants with their actual performance and found that self-reported years of experience reliably predicted domainspecific expertise (appendix p 2). Unlike in similar studies, 15,22,23 our test set included not only melanoma and nevi, but also non-melanocytic lesions. The primary task in our study was a multiclass problem with seven disease categories, and not just the simple binary problem of melanoma versus nevi. Therefore, our diagnostic study could be considered closer to a real-life situation than other studies in this field. Our test set is unique because of the large number of benign lesions that were not biopsied or excised. Inclusion of typical benign lesions avoids verification bias, which is a common limitation of diagnostic studies. Most benign lesions were nevi that we monitored for more than 18 months without any changes, which is as reliable a ground truth as pathological verification. The lesions were collected in two different settings-a tertiary referral centre in Europe and a skin cancer clinic in Australia. European patients are typified by a high number of nevi and a personal history of melanoma, and Australian patients by severe chronic sun damage. Human readers, including experts, achieved the lowest accuracy in the Australian dataset, which is not surprising since this dataset was more challenging and contained many equivocal lesions on chronic sun damaged skin that were biopsied to rule out malignancy. This set also contained difficult to diagnose melanomas and many pigmented intraepithelial carcinomas, which were often misdiagnosed by human readers. However, the top three algorithms performed equally well across all datasets, including the Australian set, and across all diagnoses, including pigmented intraepithelial carcinomas.
Overfitting to the distribution of images in the training set might explain the superior performance of algorithms. However, overfitting would lead to lack of generalisability. We anticipated overfitting and tried to quantify it by including a set of images from sources that did not provide images for the training set. As we expected, the accuracy of the top three machine-learning algorithms was lower in the set of new lesions, but still higher than the accuracy of human experts, which was also shown previously by Han and colleagues. 14 This result indicates a potential limitation of algorithms for out-of-distribution images, which should be addressed in future research.
The low sensitivity of human experts for melanoma is striking and might be explained by the difficult test set, especially with regard to the Australian set, and by the framing of the task and presentation of images. A limitation of our study is that we did not provide additional data, for example, anatomical site, age, and sex, beyond dermatoscopic images, although these data were also lacking in the development of the algorithm. In a real-world situation, human readers would consider the variability of lesions within a given patient. This approach, which is a variant of the so-called ugly duckling rule, 24 increases sensitivity and specificity, but requires examination of the entire patient and not just single lesions. Therefore, our diagnostic study deviated from a real-world scenario and simulates a telemedical approach, which could be a future domain for machine-learning algorithms.
Another obstacle for human readers was that the lesions in the test set and training set were not standardised. The images were photographed with different devices and magnifications but, in reality, human readers could be used to a single device with fixed magnification and constant representation of colours. However, the variations in the dataset are representative of the variations observed in the field of skin imaging, which are a consequence of the high diversity of dermatoscopes and cameras, and the absence of applied standards. 25 We asked human readers to rate lesions from the training set to get used to the diversity of the test set to mitigate this effect.
Although machine-learning algorithms outperformed human experts in nearly every aspect, higher accuracy in a diagnostic study with digital images does not necessarily mean better clinical performance or patient management. 26 The metrics used in this study treated all diagnoses equally. The algorithms were trained to optimise the mean sensitivity across all classes, and did not consider that it is more detrimental to mistake a malignant for a benign lesion than vice versa. We deliberately chose a balanced metric because the test set was highly imbalanced towards nevi, and we wanted to penalise strategies that optimise accuracy by preferring predictions in favour of the most prevalent class. However, in practice, diagnosis of a melanoma as a basal cell carcinoma will be of no major clinical consequence for a patient with regard to primary diagnostic tests, because both lesions are usually excised or biopsied. Therefore, a metric that is based on the binary outcome of benign or malignant (or excise or dismiss) might be more clinically relevant. When we dichotomised the diagnostic classes into a benign and a malignant group and compared the accuracy of the majority vote of human readers with the top three algorithms, we found no difference in the area under the curve. Similar findings were reported in radiology, where so-called swarm intelligence improved the diagnostic accuracy of human readers. 27 Although the lack of superiority in melanoma sensitivity of experts compared with the average human reader was outweighed by the superiority of experts for other diagnoses, this fact deserves an explanation. We hypothesise that, given their lower level of confidence, the non-expert readers tended to give false positives for melanoma, since the cost of a false negative decision on a possible melanoma is more severe than the cost of a false positive. The expert readers, who had a higher level of confidence, preferred to use their highest likelihood prediction.
Our study is a simulation and deviates from a real-life setting. In a real-life setting, evaluation of skin lesions is not limited to a timeframe of 20 s and human readers might make different decisions when faced with a patient in person. In future, it is probable that automated classifiers will be used under human guidance, rather than alone. 28 Hence, it might be more appropriate to test the accuracy of automated classifying algorithms in the hand of human readers rather than to test classifiers and humans alone.

Evidence before this study
We searched the online databases Medline, arXiv, and PubMed Central using the search terms "melanoma diagnosis" or "melanoma detection" for articles published between Jan 1, 2002, andDec 15, 2017, in English. After screening 1375 abstracts, we found 90 studies that investigated the accuracy of automated diagnostic systems for the diagnosis of melanoma. 57 studies provided enough data for a quantitative analysis and nine made direct comparisons with human experts. The summary estimate of the accuracy of machine-learning algorithms was on par with, but did not exceed, human experts. Many studies did not use an independent, external test set and we found no study that fully covered the heterogeneity of pigmented lesions by including all relevant types of nonmelanocytic lesions. Many studies were also prone to different types of biases, including selection and verification bias, and did not use publicly available data. Most studies focused on a single machine-learning algorithm and compared it with a small number (less than 100) of human readers.

Added value of this study
We provide a state-of-the-art comparison of the most advanced machine-learning algorithms with a large number of human readers, including the most experienced human experts. We included all types of clinically relevant pigmented skin lesions, not only melanoma and nevi, and algorithms and humans were tested with publicly available images, including images from sites with different populations and skin types. Most algorithms were also trained with a standard image set; hence, performance should be easily reproducible by other research teams. Our results show that state-of-the-art machine-learning algorithms outperform even the most experienced human experts.

Implications of all the available evidence
The results of our study could improve the accuracy of the diagnosis of pigmented skin lesions in areas where specialist dermatological service is not readily available, and might accelerate the acceptance and implementation of automated diagnostic devices in the field of skin cancer diagnosis.  Error bars denote 95% CIs. Machine-learning groups were allowed up to three technically distinct test set submissions resulting in multiple entries for some groups. The performance of each algorithm vs humans is increased the further down the y axis they are listed. Blue dots indicate single human sensitivities and specificities, the purple box indicates the mean, and the error bars around the mean indicate 95% CI.