Is Extraprostatic Extension of Cancer Predictable? A Review of Predictive Tools and an External Validation based on a Large and a Single Center Cohort of Prostate Cancer Patients

Our aim was to review and externally validate all the available predictive tools (PTs) predicting EPE using the area under the curve (AUC), calibration plots and scaled brier score. A literature search was performed showing 19 models predicting EPE. External validation (EV) was carried out on 6360 prostate cancer (PCa) patients submitted to RP. Most of the PTs showed poor discrimination and unsatisfactory calibration. The majority of the available PTs are not reliable for the prediction of EPE in populations other than the development one; thus, they may not be completely appropriate for patients’ counselling or for surgical strategy preplanning.


A C C E P T E D M
A N U S C R I P T

Introduction
Prostate cancer (PCa) represents a major health concern of male sex.International guidelines recommend radical prostatectomy (RP) for localized PCa patients ˂65 years old with life expectancy ˃10 years 1,2 .Erectile dysfunction is a potential drawback of RP , that has to deal with a trade-off between oncological safety and functional outcomes 3 .In 1983, Walsh introduced the nerve-sparing RP (NSRP) to improve the post-operative erectile function 4 .
The AUA and the European Association of Urology (EAU) guidelines emphasize the value of NSRP for localized PCa patients seeking post-operative potency 1,2 .
The NSRP may lead to increased incidence of positive surgical margin (PSM) and subsequent biochemical recurrence 5,6 .Thus, prediction of extraprostatic extension (EPE) of PCa is the cornerstone to determine patients' eligibility for NSRP 7 .Approximately, EPE at final pathology is found in 20% of men with clinically localized PCa 8 .The pre-surgical planning has been increasingly performed using predictive tools (PTs) based on common clinical-pathological features 4,7,[9][10][11][12][13][14][15][16][17][18][19][20][21][22][23][24][25][26][27] and the EAU guidelines recommend referral to externally validated PTs to select patients for NSRP 2 .Moreover, some of those models have user-friendly web access, and patients can easily consult them.However, there is an ominous gap between their potential and actual predictive performance in clinical practice 28 , because of its probable optimistic performance during development and the lack of high quality external validation (EV) studies 29 .
The aim of our study is to provide an accurate EV of the available PTs of EPE on a large cohort of patients.

A C C E P T E D M
A N U S C R I P T 5

Reporting
The EV was performed according to the TRIPOD statement 30 .

Patient population
Data of 6360 patients who underwent robotic-assisted prostatectomy (RALP) between 2008 and 2016 at the Global Robotics Institute of Celebration (FL, USA) were used as the validation dataset.

Surgical technique
All the procedures were performed by a single surgeon (VP) using the Da Vinci Surgical System, as previously described 31 .

Preoperative clinical variables analyzed
Preoperative clinical variables included patient's age, body mass index, total prostate specific antigen (PSA) level, PSA density, prostate volume, and clinical stage (American Joint Committee on Cancer (AJCC) TNM staging 1992/2002) 32 .Moreover, a side-specific clinical-T-stage was determined analyzing 11,794 prostatic lobes (6,360 patients).For example, when a patient is assigned to cT2a, the abnormally palpable lobe was considered to be stage T2a while the normal lobe was assigned to stage T1c.On the other hand, a patient with abnormally palpable tumor on both sides was considered to have cT2c in each lobe 15,18 .

Pathological analysis of prostate biopsy cores
Biopsy variables considered for each lobe were total number of cores, Gleason score, and the number of positive cores.Moreover, the percentage of positive cores and maximum percentage of cancer were considered.

Pathologic analysis of prostate specimen
Pathological analysis of specimens was described before 33 and includes: 1) The pathological T-stage (AJCC TNM Staging, 1992/2002) 32 .
3) PSM: the presence of carcinoma on the prostatic-inked surface.

Statistical analysis
Receiver operating characteristics (ROC) curves were calculated to assess the ability of the prediction models to discriminate between patients with or without EPE.The area under the ROC curve (AUC) with 95% confidence interval (CI) was estimated.AUC ranges between 0.5 and 1; a value of 0.5 indicates no discrimination, 0.5 < AUC < 0.7 poor discrimination, 0.7 ≤ AUC < 0.8 acceptable discrimination, 0.8 ≤ AUC < 0.9 excellent discrimination, 0.9 ≤ AUC < 1 outstanding discrimination, and 1 indicates perfect discrimination 35 .
Calibration of the model was investigated to show the relationship between model-predicted and observed rates of EPE.Agreement between predicted and actual probabilities was assessed graphically by plotting LOESS-smoothed calibration curve together with the 45° line of perfect calibration.Deviations from the ideal line were characterized estimating intercept and slope of the line approximating the calibration curve 36 .Furthermore, the estimated calibration index (ECI) was calculated to compare the calibration of the different PTs with 0 representing perfect calibration 37 .
The Brier score is the average squared difference between the actual outcomes and the is negative or close to zero, the overall predictive ability of the model is worse than or similar to a non-informative model; when SBS is 1 the model returns a perfect prediction 36 .
As regards the calculation of predictions, when coefficients of logistic regression were available, predicted probabilities were calculated using the formula associated to this model.
However, when only a nomogram was given, the image was digitized, the coefficients of the linear functions were estimated, the single scores were added up, the logit function was applied and finally the predicted probabilities were calculated.
For each PT, a comparison between the distribution of patients' characteristics in development and EV datasets (Supplementary Table 2) was performed using two-sample test of proportions Pearson's chi-squared test (categorical variables) and two-sample t-test (numerical variables).The comparability of EV and development populations was also assessed using the standardized difference 38 which is defined for dichotomous variables as √ where and denote the prevalence in EV and development populations, respectively; for continuous variables is defined as where ̅ , ̅ , denote the mean and standard deviation in EV and development populations, respectively.For categorical variables with k>2 categories, the maximum of the k standardized differences was reported.Values of in the range -0.1 ≤ d ≤ 0.1 can be considered a sign of good balance between variable distributions in the two populations 38 .
Regarding missing values, no imputation method was used and a complete-case analysis was performed.All analyses were performed using R software (version 3.4.3;R Development Core Team, Vienna, Austria).

A C C E P T E D M
A N U S C R I P T 9

Search results
The search identified 748 manuscripts.The selection process consisted of two phases: (1)   initial screening phase by the title and abstract to exclude irrelevant articles and this resulted in the exclusion of 674 articles and ( 2) full text review phase for the remaining manuscripts (74 articles) with exclusion based on appropriate reasons that resulted in the exclusion of 55 more manuscripts.Overall, our search identified 19 manuscripts describing different EPE predictive tools, accounting for a total of 44901 patients.Supplementary Figure 2 shows a detailed analysis of the search process with reasons for exclusion.
The sample size of the included studies ranged from 96 12 to 5,730 19,21 .In terms of pathological staging, organ confined disease ranged from 54% 12 to 80% 16,26 .Several predictive variables were used but only PSA level and Gleason score were considered by all the authors.Supplementary Table 1 shows all the covariates used and the number of studies integrating them.
In 15/19 studies (79%), the internal AUC was reported by the authors ranging from 0.420 23 to 0.856 12 and from 0.777 4 to 0.840 18 in the PTs developed to predict pT3a and wEPE, respectively.

A C C E P T E D M
A N U S C R I P T 10

External validation
The characteristics of patients in the validation cohort are summarized in Table 2.The analysis of the prostatectomy specimen revealed that 1,365 (21.5%) and 1,803 (28.4%) patients had pT3a and wEPE, respectively.The inclusion and exclusion criteria of each predictive tool were respected and the total number of patients used for EV for each of them is reported in Table 1.The degree of balance for each covariate in the validation and derivation cohorts is summarized in Supplementary Table 2.
As far as discrimination, when considering the event which any singular model had been developed for, the AUC at EV ranged from 0.610 to 0.801.The nomogram developed by Ohori et al 15 showed the highest AUC similar to the one reported by authors (0.801 versus 0.806, respectively).Moreover, Tsuzuki 17 (AUC 0.787), Satake 22 (AUC 0.783), Chung 4 (AUC 0.772) and Jeong 24 (AUC 0.715) were among the top five models as regards the discrimination of the event they intended to predict, even though they are in the "acceptable predictive performance" 35 .
We then tested each nomogram independently from what it was originally developed for, to predict both wEPE and pT3a status.Interestingly, all the PTs showed a better discrimination for wEPE rather than for pT3a status. Figure 1 summarizes the discriminative performance of all the models when predicting wEPE and pT3a.
As regards the calibration, Supplementary Figures 3-23 show the curves for all the PTs except for Tsuzuki et.al. 17 because it lacks some essential data for the calculation of predicted probabilities.Most of the PTs showed poor calibration with tendency towards overestimation.Regarding the ECI, Tosoian 26 (0.148), Naito 20 (0.184), Chung 4 (0.294), Egawa 12 (0.314), and Ohori 15 (0.405) showed the best calibration considering the event they were developed to predict.
The popularity of the PTs was assessed based on its number of citation in Google Scholar per year (total number of citation/number of years) in order to give a more realistic information about their popularity (Table 1).The most cited PTs are Partin 1997 11 , Partin 1993 9 , and Partin 2001 14 .The popularity of those PTs seems not to relate to their predictive performance on an external cohort.

Discussion
EPE prediction is crucial for surgical planning as the localization and quantification of a possible EPE could allow tailoring the surgical approach on cancer's characteristics.
Although, multiparameteric Magnetic Resonance Imaging (mpMRI) has gained great acceptance as a useful diagnostic and staging tool for PCa 27 , its sensitivity in predicting EPE appears to be low (0.57) 39 .Furthermore, the incremental value of adding mpMRI parameters to the currently available PTs is debatable 40,41 .On the other hand, the standardized use of intra-operative frozen section has been proposed for NSRP; however, its role is still debatable and has not gained widespread popularity in the clinical practice yet.Furthermore, visual and tactile assessment during surgery is only partially reliable and not reproducible 17,42,43 .
After Partin's innovative idea to create a statistical tool to predict pathological stage 9 , several authors developed other PTs 4, 7, 9-26 ; however, the majority lacks of appropriate EV.

A C C E P T E D M
A N U S C R I P T 12 Independent EV studies are uncommon, with only a 16% probability for a PT to be externally validated by different authors within 5 years of development 29 .Most of the published EV studies are based on small sample sizes and therefore not reliable; actually, an EV should include a minimum of 100 events and 100 non-events 28 .
It is noteworthy that EPE is not only prone to diagnostic pitfalls and interobserver variability, but it is also characterized by heterogeneous definitions, often elusive and equivocal.In this study, we included fourteen papers considering EPE as pure pT3a disease 9-14, 16, 19-21, 23-26 whereas the remaining ones looked for the global presence of disease out of the prostate regardless SVI (wEPE) 4,15,17,18,22 .
In this setting, a large cohort was used to perform the EV of all the available PTs of EPE published since 1993 considering both definitions, and all of them have been externally validated for the prediction of both events (pT3a and wEPE).
As far as methodology is concerned, AUC is commonly used due to its user-friendly output 35 .Considering the EV, Ohori's 15 PT was the only model with a discriminative performance exceeding 0.8 (AUC 0.801).This supports the results of Clement 44 who compared the performance of Ohori's 15 and Steuber's 18 PTs reporting an AUC of 0.80 and 0.78, respectively.Seven models reached an "acceptable" AUC 4, 10, 17-19, 22, 24-26 , whereas the discrimination of the remaining 11 models was "poor".

A C C E P T E D M A N U S C R I P T 13
The different versions of Partin Tables 9,11,14,19,21,25,26 showed worse discriminative performance in our EV than in the original derivation cohorts.Despite the highest popularity and the good internal AUC (0.818), the discriminative ability of Partin Table 1997 in the current EV was poor (AUC 0.675).
These findings are consistent with the ones from previous EV studies, which showed that Partin Tables' performances seem to worsen when applied to different populations.The transportability of such models to other geographical areas (Sweden, UK, France, Italy and Austria) showed poor performance [45][46][47][48] .On the contrary, some authors showed acceptable discriminative performance in German and North American patients 49,50 .
Despite the clinical importance of the PTs calibration 51 , our study showed poor calibration with tendency towards overestimation of the EPE risk in most of the predictive models, consistent with other EV studies 46,48 .In this setting, recalibration can be considered for poorly calibrated PTs (regardless their predictive performance) before their introduction into the clinical practice 51 .
As far as the overall performance is concerned, all the PTs exhibited moderate performance on SBS, with Chung 4 providing the best predictive performance (SBS = 0.204) for the event it was developed to predict (wEPE).
These differences of the PTs performances between the development and the validation studies may be explained by the temporal, geographical and domain differences, which in turn may affect the measurement of variables and outcomes, the case-mix (like age and tumor characteristics) and the sample size 28 .
Interestingly, when considering the different definitions of EPE, all the models had a worse discrimination capability for the prediction of pure pT3a status rather than for the prediction of the wEPE, including those developed to predict the pure pT3a status specifically.There is no clear explanation of this, however, it can be speculated that the wEPE status is easier to predict because includes also patients with SVI, and therefore with a probably higher burden of concomitant EPE, that it is usually easier to be detected 7 .
Interestingly, only four models provide a side-specific-risk of EPE, to aid surgical decision toward a unilateral vs bilateral NSRP 15,17,18,22 .It is noteworthy to mention that, beyond the prediction of the presence or of the laterality of EPE, none of the PTs estimates the amount of disease out of the prostatic capsule nor provide a decision rule to grade the preservation of NVB.More recently the PRECE tool 7 was specifically developed to this purpose, providing a decision rule to grade the dissection.However, it has not been validated in this study since it had been developed using the present cohort.The PRECE tool 7 has not been externally validated yet to our knowledge.
Finally, the primary goal of the current study was to provide a broader image about the currently available PTs and their performance using a single and large cohort of PCa patients, which, might help to identify the limitations and challenges encountered in developing new PTs in the contemporary era.This EV study showed that the discriminative performance for most of the included PTs ranged from poor to acceptable discrimination with poor calibration suggesting that new clinical, pathological and radiological predictors might be integrated in the development of new PTs to improve their performance.In these settings, mpMRI could have the potentials in improving the performance of predictive tools 41,52 .Furthermore, some authors suggested that 68 Gallium-prostate specific membrane antigen-positron emission tomography/CT or MRI ( 68 Ga-PSMA-PET/CT or MRI), has a good potential for preoperative prediction of EPE 53,54 , thus may be identification of 68 Ga-PSMA-PET/CT variables may improve the predictive performance of PTs.Moreover, Dean et al 55 , demonstrated that the quantification of Gleason pattern 4 (total length of Gleason pattern 4 across all cores) is an independent predictor of pathological adverse events after RP, suggesting that its addition to PTs may improve its predictive performance.

Strength:
 The simultaneous EV of all the PTs of EPE published in the last 25 years.
 Evaluating the models' performance using both discrimination, calibration and SBS.
 Large and heterogeneous cohort of patients as validation dataset. Limitations:  Our PCa cases are representative of patients mostly coming from the USA; however, it remains uncertain if they are representative for patients from different populations.
 Absence of central review of the biopsy specimen and central PSA measurement, since the cohort comes from a large referral center.
 The use of a single center cohort of patients may limit the generalizability of our results; however, it should be remarked that as a large referral center patients come from different parts of the world.

Conclusion
To the authors' knowledge, this is the largest cohort of patients used to externally validate the available PTs of EPE developed since 1993.However, the included PTs may have shown acceptable results based on the variables used and the time when they were developed, the current EV study raises several concerns about the limitations of almost all the existing PTs of EPE in the contemporary settings.In the era of precise and personalized medicine, surgeons have to consider these limitations when using these PTs to plan a NSRP and to identify more reliable covariates, to improve the predictive performance of newly developed PTs; the inclusion of mpMRI variables could be a future perspective to face the concern of EPE prediction and surgical management.

TAKE HOME MESSAGE:
Despite the use of predictive models is widespread and recommended by most of the international guidelines, surgeons have to be aware about their moderate to poor predictive performance for a pT3a disease and the consequent risk of applying those decision-making tools in a population other than the development one.

4 ) 7 
According to the definitions found in the literature, two distinct definitions were considered for EPE (Supplementary Figure 1 illustrates the difference):  pT3a: the presence of tumor beyond the confines of the prostate without invasion of the seminal vesicles.A C C E P T E D M A N U S C R I P T Whole EPE (wEPE): the presence of tumor beyond the confines of the prostate regardless the status of seminal vesicles.
predicted probabilities : ∑ .It is a measure of overall performance because it can be decomposed into two components: the first related to calibration and the second related to discrimination.For convenience, the scaled Brier score (SBS) was reported in the study, where ̅ ̅ and ̅ indicating the average probability of the outcome.When the scaled Brier score A C C E P T E D M A N U S C R I P T 8

TABLE 1 :
Summary of the characteristics of the included predictive tools with the results of the external validation (AUC and

TABLE 2 :
Summary of the external validation patients' characteristics