Enhancing PFI Prediction with GDS-MIL: A Graph-based Dual Stream MIL Approach

. Whole-Slide Images (WSI) are emerging as a promising resource for studying biological tissues, demonstrating a great potential in aiding cancer diagnosis and improving patient treatment. However, the manual pixel-level annotation of WSIs is extremely time-consuming and practically unfeasible in real-world scenarios. Multi-Instance Learning (MIL) have gained attention as a weakly supervised approach able to address lack of annotation tasks. MIL models aggregate patches ( e.g., cropping of a WSI) into bag-level representations ( e.g., WSI label), but neglect spatial information of the WSIs, crucial for histological analysis. In the High-Grade Serous Ovarian Cancer (HGSOC) context, spatial information is essential to predict a prognosis indicator (the Platinum-Free Interval, PFI) from WSIs. Such a prediction would bring highly valuable insights both for patient treatment and prognosis of chemotherapy resistance. Indeed, NeoAdjuvant ChemoTherapy (NACT) induces changes in tumor tissue morphology and composition, making the prediction of PFI from WSIs extremely challenging. In this paper, we propose GDS-MIL, a method that integrates a state-of-the-art MIL model with a Graph AT-tention layer (GAT in short) to inject a local context into each instance before MIL aggregation. Our approach achieves a significant improvement in accuracy on the “Ome18” PFI dataset. In summary, this paper presents a novel solution for enhancing PFI prediction in HGSOC, with the potential of significantly improving treat-ment decisions and patient outcomes.


Introduction
High-Grade Serous Ovarian Cancer (HGSOC) is a form of ovarian cancer characterized by multiple treatment recurrences with variable response to platinumbased chemothereapy.The prediction of Platinum-Free Interval (PFI), defined as the time interval between the end of chemotherapy and disease recurrence [27], is determinant for treatment planning and is usually performed by analyzing the histological tissue digitalized in Whole-Slide Images (WSIs).Unfortunately, NeoAdjuvant ChemoTherapy (NACT), recommended for HGSOC patients who are ineligible for Primary Debulking Surgery (PDS) [10,22], causes strong variable changes and heterogeneity in tumor morphology and composition, making the prediction of PFI from WSI extremely challenging.
Whole-slide imaging has emerged in recent years as a promising technology to enable the digitalization and the analysis of tissue sections [12].The creation of multi-resolution gigapixel WSIs provides the opportunity of developing novel diagnostic tools for treatment and monitoring [6,25].However, the manual pixellevel annotation of WSIs is a time-consuming and labor-intensive task.As an alternative, WSIs are often labelled with metadata (e.g., genetic or other molecular features) characterizing the disease.In addition, because of their gigapixel size, WSIs are usually clipped into patches before being fed into a deep learning model.Given all these conditions, Convolutional Neural Networks (CNNs), which have provided amazing results for a multitude of tasks [13,16,21,31], cannot be directly applied to such data.
Consequently, Multi-Instance Learning (MIL) methods have gained considerable attention in WSI analysis [15,32], avoiding the need for pixel-level annotations.MIL is a weakly supervised learning approach used to assign a label to a set (or bag) composed of unlabelled instances.The label of the bag (e.g., a WSI) is determined by the presence or absence of at least one positive instance (e.g., patch containing tumour), so it is generally assumed that negative bags only contain negative instances (e.g., patch not containing tumour), while positive bags contain at least one positive instance.When dealing with histological images, such assumption cannot be enough and Attention-Based MIL (AB-MIL) [9,2] should be employed to improve patch aggregations [32,35].However, AB-MIL approaches do not exploit any spatial dependency between instances, which may be crucial in some application [7].While some tasks can rely solely on morphology analysis (e.g., tumor detection), others would benefit from a more comprehensive tissue analysis.An example of such a task is the aforementioned prediction of PFI on chemotherapy treated HGSOC tissue.
This paper proposes GDS-MIL, which integrates a state-of-the-art MIL model with Graph Neural Networks (GNNs) to contextualize patch local interactions better.Specifically, we use Graph ATtention networks (GATs) [33] to capture the spatial relationships between instances before MIL aggregation, introducing a local context into each instance.This approach has shown promising results, achieving a significant improvement on the "Ome18" PFI dataset.Our study provides a novel solution to improve the accuracy of PFI prediction in HGSOC, which could ultimately lead to better treatment decisions and improved patient outcomes [27].

Related Works
In this section, we briefly review recent developments in MIL models, as well as relevant studies that employ MIL for WSI analysis, and existing strategies for PFI prediction.

Multi-Instance Learning for WSI Analysis
Consider a bag X bag composed of a set of N feature vectors: Each instance x i ∈ X bag , can be assigned to a class through a mapping process f : X bag − → {0, 1}, where the negative and positive classes correspond to 0 and 1, respectively.While traditionally MIL approaches rely on simple aggregators like mean-pooling and max-pooling [8,24], recent studies have shown that there may be benefits in parameterizing the aggregation operator with neural networks [17,23].The Attention-Based MIL (AB-MIL) [9] employs a sidebranch network to calculate attention scores.Similarly, in [37], Zhang et al. apply an attention mechanism to support a double-tier feature distillation approach, where relevant features are distilled from pseudo-bags to the WSI using either "MaxMin" or Aggregated Feature Selection (AFS) [37].Another approach, DS-MIL [15], applies non-local attention aggregation to measure the distance with the most relevant patch.In 2021, Lu et al. [18] propose an algorithm that applies a clustering loss to single or multiple branches (CLAM-SB and CLAM-MB), a variant of the classic AB-MIL.Shao et al. [28], instead, employ a transformer architecture named Trans-MIL.

PFI Prediction
A few algorithms for automatic PFI prediction have been proposed in the literature.Both Yu et al. [36] and Laury et al. [14] use pixel-level annotated WSI for their studies.Yu et al. propose a method based on a VGG [29], using portions of WSI for regression analysis finalized to PFI prediction, while Laury et al. develop a method based on multiple neural networks used in series, i.e., the output of the first becomes the input of the following network, after human supervised rearrangements.The final aggregation is based on the ratio between digital biomarkers associated with a poor or good prognosis.Their approach employs WSI of treatment-naïve HGSOC.Only tumoral areas are analyzed for the PFI prediction, exploiting pixel-level annotations for the segmentation.Moreover, by focusing the method on treatment-naïve patients, the tumor tissue presents a higher homogeneity in its morphology and texture than tissues undergoing treatment.
Instead, our approach focuses on patients with HGSOC who underwent NACT therapy.Therefore, the WSIs analyzed in this paper are characterized by unique morphological characteristics resulting from the treatment effects.Furthermore, to better understand the effects of the treatment, our method analyzes different tissues and compartments in the WSI (e.g., tumor, stroma, inflammatory cells, etc.), and not only tumoral areas, increasing data heterogeneity.
Finally, our method does not require pixel-level annotations to predict the PFI score, relaying only on the global label.To achieve this goal, a graph attention layer has been incorporated into the model to analyze tissue as a complex system composed of multiple interconnected parts.

Model
In this study, we propose the use of a GAT to contextualize instances (WSI patches) through local interaction before MIL aggregation.Fig. 1 summarizes the key elements of the proposed method, which are detailed in the following of this Section.

Graph Integration
Given the data as a set of instances x ins i ∈ X bag and a self-supervised feature extractor f , informative and discriminative embeddings are obtained as follows 4 : Each embedding contains important local information inside the patch (e.g., representing the morphology).In order to also capture the micro and macro interaction between instances, we apply a GNN G [11,26,34], implemented with GATs.Given an adjacency matrix A considering the spatial coordinates of the instances (e.g., each patch is connected to its at most 8 closest neighbors), a more contextualized instance representation is obtained as:

Graph Attention Layer
The GAT applies a masked attention on each instance E ins i ∈ E bag and its neighborhood E ins j ∈ N i .The neighborhood of each instance can be found in the adjacency matrix A. At the starting point, each instance is processed with 4 x ins i represents a patch extracted from the X bag , i.e., the entire WSI.
a shared weight matrix W∈ R as H ins = W (E ins ).The instance interaction is measured by an α ij computed as: where a ∈ R 2F is a single-layer feedforward neural network and ∥ is the concatenation operator.A multi-head attention produces a new instance representation as the average of the linear combinations of the neighborhood among each head k ∈ K: where σ is a softmax operation.

Bag-level Representation
Taking inspiration from DS-MIL [15], the bag representation is built through a dual stream approach.In particular, starting from the graph output E ins i ∈ E bag a first patch classifier f patch is used to identify the most critical patch instance as: E ins crit = argmax Given the most relevant instance, E ins crit , and a linear-layer neural networks, U , it is possible to build the attention scores of the current instance, E ins i , considering its similarity with E ins crit : After that, the bag label is obtained applying a classifier W CLS over the bag embedding built as: where V is another linear-layer neural networks, and 4 Experimental Setup

Dataset
The dataset is composed by 176 omentum-tissue-WSIs [20] belonging to 77 different HGSOC patients who underwent NACT therapy.The staining procedure used for the WSIs was Hematoxylin and Eosin (HE) [19].Images have been Fig. 2: Example of segmentation masks generated by the pre-processing algorithm.Green contours identify the considered tissue, blue ones are holes the algorithm will discard.The procedure allows for filtering out background, fat, and blood.
scanned by a Pannoramic SCAN 150 with a resolution of 0.22 µm/pixel at the 40× resolution.Each WSI is assigned a label based on the patients' PFI: those with a poor prognosis, low-PFI (≤ 6 months), are 99 in total, while the other 77 scans have an high-PFI (≥ 12 months).The dataset is split into 4-folds in order to perform cross-validation.For each split, a balance between low-and high-PFI was respected.We also ensured that WSIs from the same patient were not mixed between training and test sets.

Pre-processing
The state-of-the-art CLAM [18] framework has been employed to crop each WSI into multiple patches.This strategy involves selecting only relevant tissue by means of Otsu thresholding [38] and Connected Components Analysis [1].Additionally, a red filter is used to remove blood 5 .An example of the resulting segmentation mask is shown in Fig. 2. The green contour delineates a portion of tissue that is preserved; the blue one indicates a removed area (holes).The preserved area is then cropped into non-overlapping 256×256 patches at different resolution scales.20× and 5× resolutions were chosen to capture both micro and macro details in the dataset.On average, each WSI contains 5 960 patches at 20× resolution and 370 patches at 5× resolution.
DINO [4], a Vision Transformer (ViT) model [5], is then employed to produce high quality patch representations, while ensuring a fast processing with low computational resource requirements.This approach focuses on aligning exclusively the positive pairs by leveraging a teacher-student framework, which comprises two separate networks.We trained the model over the entire set of patches, separately for each resolution level.

Implementation Details
The optimization is performed using Adam with a learning rate of 2 * 10 -4 and a weight decay of 5 * 10 -3 .The training is carried out for 200 epochs with the CosineAnnealingLR scheduler.We employ one single GAT layer with 3 heads used for multi-head attention.All the experiments are conducted using a unified codebase and under identical experimental conditions.Each bag is sub-sampled using a patch dropout probability of 0.5 to increase the number of bags and promote randomness during training.The Area Under the Curve (AUC) and the accuracy metrics are calculated as described in [30].To ensure a fair comparison, all methods considered in our analysis are evaluated using the same metrics.

Results and Discussion
A comparison of the proposed solution with state-of-the-art MIL approaches is reported in Tab. 1 and Tab. 2. All the experiments have been performed on the previously described dataset and repeated 5 times to stress the robustness of the algorithms.Tables report the average performance and the associated standard deviation at 5× and 20× resolutions.
We compared the proposed model GDS-MIL with MaxPooling and Mean-Pooling to understand the effectiveness of patch-level classifiers, and AB-MIL [9] and DS-MIL [15] as state-of-the-art attention based MIL solutions.The performance of each approach is measured with average accuracy and average AUC, both at the best and last epoch.The best epoch is the one where the model obtains the best performance considering the average between accuracy and AUC on the test set, while the last epoch is the end of the training phase.
Experimental results demonstrate that GDS-MIL outperforms all the other approaches on both scales, achieving the highest accuracy and AUC scores at the best epochs.DS-MIL also performs well, achieving good scores on both scales, while MeanPooling and AB-MIL show moderate performance.Overall, the results suggest that integrating a graph-based solution improves our baseline (DS-MIL) by 3.5% on accuracy and 2% on AUC.
Even when considering only the last epoch, GDS-MIL outperforms the baselines improving DS-MIL by 1.3%.
In Tab. 2 we investigated a specific dataset split characterized by significant tissue heterogeneity.In this case, the contextualization introduced with the graph plays an even more relevant role: our model outperforms DS-MIL by 9.4% on accuracy and 9.3% on AUC.
A further analysis is reported in Tab. 3, stressing the relevance of graph (main) hyper-parameters such as layer type, number of sequential layers, and number of heads within the same graph layer.

Model Analysis
Experimental results demonstrate that the 20× scale resolution is the most effective when tackling the PFI prediction task on omentum WSIs tacken from NACT patients.Specifically, a patch-level classifier such as MaxPooling can achieve surprisingly good performance at 20× resolution, with an accuracy of 0.676 and AUC of 0.637.This phenomenon implies the existence of morphology and patterns correlated to the PFI which can be exploited to solve the task.This conclusion is also supported by the effectiveness of DS-MIL which achieves an accuracy of 0.681 and an AUC of 0.649.The attention mechanism used by DS-MIL allows to identify the most relevant WSI regions, guiding the PFI classification.
However, adding a graph attention layer can significantly improve the performance at both considered resolutions.This finding suggests that incorporating spatial context into each instance, including both neighborhood morphology and interaction, allows to change the meaning of critical patch.In GDS-MIL, the relevance score of each instance is not limited to the instance itself, but also influenced by the area where it is located, allowing for a more fine-grained criticality assessment.These results suggest that the proposed model is highly effective and can offer significant improvements over existing state-of-the-art approaches.The high standard deviation of all reported experiments is intrinsically connected to the small number of WSIs and to the high heterogeneity of the task.

Hyperparameter Analysis
To stress the contribution of different graph layers, Tab. 3 is reported.The results indicate that, in general, using layers of a Graph Convolutional Network [11] leads to worse performances compared to GAT [33] and GATv2 [3].When relying on convolutional layers, the patch representation becomes similar to its neighborhood, resulting in a loss of important details.In contrast, leveraging an attention layer enables the patch to acquire context information, while preserving its own unique features.No significant difference can be observed between GAT and GATv2, with the latter performing slightly better than the former.The experiments reported in Tab. 3 also reveal that a higher number of graph layers has a negative impact on the performance.This is mainly related to the smoothing operation performed by the graph on the patch representation.If the smoothing is too strong, it becomes challenging for the MIL module to distinguish what is actually important.Therefore, it is crucial to identify a tradeoff between the number of layers and the overall performance.
Moreover, increasing the number of heads applied to the attention mechanism generally provide better performances.Indeed, using a multi-head approach enhances the ability to capture the most important information from the neighborhood and build a more contextualized representation of each instance.
In summary, our analysis highlights the importance of carefully selecting the graph hyper-parameters.Specifically, the adoption of attention layers usually provide better performance than convolutional graph layers.Limiting the number of graph layers, and considering an higher number of heads during the self-attention process can also improve the final results.This is the reason why we opted for a single GAT layer consisting of three heads.

Conclusions
This paper proposes GDS-MIL method which integrates a GAT into a MIL architecture for predicting the PFI of WSIs obtained from NACT patients.Our results demonstrate that introducing a spatial contextualization has beneficial effects on the MIL architecture.A future work will analyze what kind of biological patterns have major impact for the prediction in order to better explain the PFI task.

Fig. 1 :
Fig.1: DINO[4] features extractor is applied to patches tiled from the original WSI.The embeddings thus obtained are fed to a GAT module to capture patches' context and generate a more contextualized representation.A dual-stream MIL aggregation module is then employed to obtain the final prediction by averaging the scores of instance and bag classifiers.

Table 1 :
Performance comparison.Experiments were run 5 times, each with a 4fold cross-validation.This table reports the average results and the corresponding standard deviation.

Table 2 :
Performance comparison on an Out of Distribution (OOD) testset.

Table 3 :
Performance comparison changing the type of graph layer (type), the number of layers (L) and heads (H) used by the graph neural network.