Conditional Channel Gated Networks for Task-Aware Continual Learning

Convolutional Neural Networks experience catastrophic forgetting when optimized on a sequence of learning problems: as they meet the objective of the current training examples, their performance on previous tasks drops drastically. In this work, we introduce a novel framework to tackle this problem with conditional computation. We equip each convolutional layer with task-specific gating modules, selecting which filters to apply on the given input. This way, we achieve two appealing properties. Firstly, the execution patterns of the gates allow to identify and protect important filters, ensuring no loss in the performance of the model for previously learned tasks. Secondly, by using a sparsity objective, we can promote the selection of a limited set of kernels, allowing to retain sufficient model capacity to digest new tasks.Existing solutions require, at test time, awareness of the task to which each example belongs to. This knowledge, however, may not be available in many practical scenarios. Therefore, we additionally introduce a task classifier that predicts the task label of each example, to deal with settings in which a task oracle is not available. We validate our proposal on four continual learning datasets. Results show that our model consistently outperforms existing methods both in the presence and the absence of a task oracle. Notably, on Split SVHN and Imagenet-50 datasets, our model yields up to 23.98% and 17.42% improvement in accuracy w.r.t. competing methods.


Introduction
Machine learning and deep learning models are typically trained offline, by sampling examples independently from the distribution they are expected to deal with at test time. However, when trained online in real-world settings, models may encounter multiple tasks as a sequential stream of * Research conducted during an internship at Qualcomm Technologies Netherlands B.V. † Qualcomm AI Research is an initiative of Qualcomm Technologies, Inc.
activities, without having any knowledge about their relationship or duration in time. Such challenges typically arise in robotics [2], reinforcement learning [31], vision systems [28] and many more (cf. Chapter 4 in [7]). In such scenarios, deep learning models suffer from catastrophic forgetting [24,9], meaning they discard previously acquired knowledge to fit the current observations. The underlying reason is that, while learning the new task, models overwrite the parameters that were critical for previous tasks. Continual learning research (also called lifelong or incremental learning) tackles the above mentioned issues [7]. The typical setting considered in the literature is that of a model learning disjoint classification problems one-by-one. Depending on the application requirements, the task for which the current input should be analyzed may or may not be known. The majority of the methods in the literature assume that the label of the task is provided during inference. Such a continual learning setting is generally referred to as task-incremental. In many real-world applications, such as classification and anomaly detection systems, a model can seamlessly instantiate a new task whenever novel classes emerge from the training stream. However, once deployed in the wild, it has to process inputs without knowing in which training task similar observations were encountered. Such a setting, in which task labels are available only during training, is known as class-incremental [37]. Existing methods employ different strategies to mitigate catastrophic forgetting, such as memory buffers [29,19], knowledge distillation [18], synaptic consolidation [15] and parameters masking [22,34]. However, recent evidence has shown that existing solutions fail, even for simple datasets, whenever task labels are not available at test time [37]. This paper introduces a solution based on conditionalcomputing to tackle both task-incremental and classincremental learning problems. Specifically, our framework relies on separate task-specific classification heads (multihead architecture), and it employs channel-gating [6,3] in every layer of the (shared) feature extractor. To this aim, we introduce task-dedicated gating modules that dynamically select which filters to apply conditioned on the input feature map. Along with a sparsity objective encouraging the use of fewer units, this strategy enables per-sample model selection and can be easily queried for information about which weights are essential for the current task. Those weights are frozen when learning new tasks, but gating modules can dynamically select to either use or discard them. Contrarily, units that are never used by previous tasks are reinitialized and made available for acquiring novel concepts. This procedure prevents any forgetting of past tasks and allows considerable computational savings in the forward propagation. Moreover, we obviate the need for a task label during inference by introducing a task classifier selecting which classification head should be queried for the class prediction. We train the task classifier alongside the classification heads under the same incremental learning constraints. To mitigate forgetting on the task classification side, we rely on example replay from either episodic or generative memories. In both cases, we show the benefits of performing rehearsal at a task-level, as opposed to previous replay methods that operate at a class-level [29,5]. To the best of our knowledge, this is the first work that carries out supervised task prediction in a class-incremental learning setting. We perform extensive experiments on four datasets of increasing difficulty, both in the presence and absence of a task oracle at test time. Our results show that, whenever task labels are available, our model effectively prevents the forgetting problem, and performs similarly to or better than state-of-the-art solutions. In the task agnostic setting, we consistently outperform competing methods.

Related work
Continual learning. Catastrophic forgetting has been a well-known problem of neural networks [24]. Early approaches to alleviate the issue involved orthogonal representation learning and replay of prior samples [9]. The recent advent in deep learning has led to the widespread use of deep neural networks in the continual learning field. First attempts, such as Progressive Neural Networks [32] tackle the forgetting problem by introducing a new set of parameters for each new task at the expense of limited scalability. Another popular solution is to apply knowledge distillation by using the past parametrizations of the model as a reference when learning new tasks [18]. Consolidation approaches emerged recently with the focus of identifying the weights that are critically important for prior tasks and preventing significant updates to them during the learning of new tasks. The relevance/importance estimation for each parameter can be carried out through the Fisher Information Matrix [15], the path integral of loss gradients [41], gradient magnitude [1] and a posteriori uncertainty estimation in a Bayesian Neural Network [26]. Other popular consolidation strategies rely on the estimation of binary masks that directly map each task to the set of parameters responsible for it. Such masks can be estimated either by random assignment [23], pruning [22] or gradient descent [21,34]. However, existing mask-based approaches can only operate in the presence of an oracle providing the task label. Our work is akin to the above-mentioned models, with two fundamental differences: i) our binary masks (gates) are dynamically generated and depend on the network input, and ii) we promote mask-based approaches to class-incremental learning settings, by relying on a novel architecture comprising a task classifier. Several models allow access to a finite-capacity memory buffer (episodic memory), holding examples from prior tasks. A popular approach is iCaRL [29], which computes class prototypes as the mean feature representation of stored memories, and classifies test examples in a nearestneighbor fashion. Alternatively, other approaches intervene in the training algorithm, proposing to adjust the gradient computed on the current batch towards an update direction that guarantees non-destructive effects on the stored examples [19,5,30]. Such an objective can imply the formalization of constrained optimization problems [19,5] or the employment of meta-learning algorithms [30]. Differently, generative memories do not rely on the replay of any real example whatsoever, in favor of generative models from which fake examples of past tasks can be efficiently sampled [36,40,28]. In this work, we also rely on either episodic or generative memories to deal with the class-incremental learning setting. However, we carry out replay only to prevent forgetting of the task predictor, thus avoiding to update task-specific classification heads.

Conditional computation.
Conditional computation research focuses on deep neural networks that adapt their architecture to the given input. Although the first work has been applied to language modeling [35], several works applied such concept to computer vision problems. In this respect, prior works employ binary gates deciding whether a computational block has to be executed or skipped. Such gates may either drop entire residual blocks [38,39] or specific units within a layer [6,3]. In our work, we rely on the latter strategy, learning a set of task-specific gating modules selecting which kernels to apply on the given input. To our knowledge, this is the first application of data-dependent channel-gating in continual learning.

Problem setting and objective
We are given a parametric model, i.e., a neural network, called a backbone or learner network, which is exposed to a sequence of N tasks to be learned, T = {T 1 , . . . , T N }. Each task T i takes the form of a classification problem, T i = {x j , y j } ni j=1 , where x j ∈ R m and y j ∈ {1, . . . , C i }.
A task-incremental setting requires to optimize: where θ identifies the parametrization of the learner network, and x, y and t are random variables associated with the observation, the label and the task of each example, respectively. Such a maximization problem is subject to the continual learning constraints: as the model observes tasks sequentially, the outer expectation in Eq. 1 is troublesome to compute or approximate. Notably, this setting requires the assumption that the identity of the task each example belongs to is known at both training and test stages. Such information can be exploited in practice to isolate relevant output units of the classifier, preventing the competition between classes belonging to different tasks through the same softmax layer (multi-head). Class-incremental models solve the following optimization: Here, the absence of task conditioning prevents any form of task-aware reasoning in the model. This setting requires to merge the output units into a single classifier (single-head) in which classes from different tasks compete with each other, often resulting in more severe forgetting [37]. Although the model could learn based on task information, this information is not available during inference.
To deal with observations from unknown tasks, while retaining advantages of multi-head settings, we will jointly optimize for class as well as task prediction, as follows: Eq. 3 describes a twofold objective. On the one hand, the term log p(y|x, t) is responsible for the class classification given the task, and resembles the multi-head objective in Eq. 1. On the other hand, the term log p(t|x) aims at predicting the task from the observation. This prediction relies on a task classifier, which is trained incrementally in a single-head fashion. Notably, the objective in Eq. 3 shifts the single-head complexities from a class prediction to a task prediction level, with the following benefits: • given the task label, there is no drop in class prediction accuracy; • classes from different tasks never compete with each other, neither during training nor during test; • the challenging single-head prediction step is shifted from class to task level; as tasks and classes form a two-level hierarchy, the prediction of the former is arguably easier (as it acts at a coarser semantic level). Conv2D gates -th layer, t-th task Figure 1: The proposed gating scheme for a convolution layer. Depending on the input feature map, the gating module G l t decides which kernels should be used.

Multi-head learning of class labels
In this section, we introduce the conditional computation model we used in our work. Fig. 1 illustrates the gating mechanism used in our framework. We limit the discussion of the gating mechanism to the case of convolutional layers, as it also applies to other parametrized mappings such as fully connected layers or residual blocks. Consider h l ∈ R c l in ,h,w and h l+1 ∈ R c l out ,h ,w to be the input and output feature maps of the l-th convolutional layer respectively. Instead of h l+1 , we will forward to the following layer a sparse feature mapĥ l+1 , obtained by pruning uninformative channels. During the training of task t, the decision regarding which channels have to be activated is delegated to a gating module G l t , that is conditioned on the input feature map h l :ĥ where G l t (h l ) = [g l 1 , . . . , g l c l out ], g l i ∈ {0, 1}, and refers to channel-wise multiplication. To be compliant with the incremental setting, we instantiate a new gating module each time the model observes examples from a new task. However, each module is designed as a light-weight network with negligible computation costs and number of parameters. Specifically, each gating module comprises a Multi-Layer Perceptron (MLP) with a single hidden layer featuring 16 units, followed by a batch normalization layer [12] and a ReLU activation. A final linear map provides logprobabilities for each output channel of the convolution. Back-propagating gradients through the gates is challenging, as non-differentiable thresholds are employed to take binary on/off decisions. Therefore, we rely on the Gumbel-Softmax sampling [13,20], and get a biased estimate of the gradient utilizing the straight-through estimator [4]. Specif-  Figure 2: Illustration of the task prediction mechanism for a generic backbone architecture. First (block 'a'), the l-th convolutional layer is fed with multiple gated feature maps, each of which is relevant for a specific task. Every feature map is then convolved with kernels selected by the corresponding gating module G l x , and forwarded to the next module. At the end of the network the task classifier (block 'b') takes as input candidate feature maps and decides which task to solve.
ically, we employ the hard threshold in the forward pass (zero-centered) and the sigmoid function in the backward pass (with temperature τ = 2/3). Moreover, we penalize the number of active convolutional kernels with the sparsity objective: where L is the total number of gated layers, and λ s is a coefficient controlling the level of sparsity. The sparsity objective instructs each gating module to select a minimal set of kernels, allowing us to conserve filters for the optimization of future tasks. Moreover, it allows us to effectively adapt the capacity of the allocated network depending on the difficulty of the task and the observation at hand. Such a data-driven model selection contrasts with other continual learning strategies that employ fixed ratios for model growing [32] or weight pruning [22]. At the end of the optimization for task t, we compute a relevance score r l k for each unit in the l-th layer by estimating the firing probability of their gates on a validation set T val t : where I[·] is an indicator function, and p(·) denotes a probability distribution. By thresholding such scores, we obtain two sets of kernels. On the one hand, we freeze relevant kernels for the task t, so that they will be available but not updatable during future tasks. On the other hand, we reinitialize non-relevant kernels, and leave them learnable by subsequent tasks. In all our experiments, we use a threshold equal to 0, which prevents any forgetting at the expense of a reduced model capacity left for future tasks.

Single-head learning of task labels
The gating scheme presented in Sec. 3.2 allows the immediate identification of important kernels for each past task. However, it cannot be applied in the task-agnostic setting as is, since it requires the knowledge about which gating module G l x has to be applied for layer l, where x ∈ {1, . . . , t} represents the unknown task. Our solution is to employ all gating modules [G l ing modules which select a limited number of convolutional filters in each stream. After the last convolutional layer, indexed by L, we are given a list of t candidate feature maps [ĥ L+1 and as many classification heads. The task classifier is fed with a concatenation of all feature maps: where µ denotes the global average pooling operator over the spatial dimensions and describes the concatenation along the feature axis. The architecture of the task classifier is based on a shallow MLP with one hidden layer featuring 64 ReLU units, followed by a softmax layer predicting the task label. We use the standard cross-entropy objective to train the task classifier. Optimization is carried out jointly with the learning of class labels at task t. Thus, the network not only learns features to discriminate the classes inside task t, but also to allow easier discrimination of input data from task t against all prior tasks. The single-head task classifier is exposed to catastrophic forgetting. Recent papers have shown that replay-based strategies represent the most effective continual learning strategy in single-head settings [37]. Therefore, we choose to ameliorate the problem by rehearsal. In particular, we consider the following approaches.
Episodic memory. A small subset of examples from prior tasks is used to rehearse the task classifier. During the training of task t, the buffer holds C random examples from past tasks 1, . . . , t − 1 (where C denotes a fixed capacity). Examples from the buffer and the current batch (from task t) are re-sampled so that the distribution of task labels in the rehearsal batch is uniform. At the end of task t, the data in the buffer is subsampled so that each past task holds m = C/t examples. Finally, m random examples from task t are selected for storage.
Generative memory. A generative model is employed for sampling fake data from prior tasks. Specifically, we utilize Wasserstein GANs with Gradient Penalty (WGAN-GP [10]). To overcome forgetting in the sampling procedure, we use multiple generators, each of which models the distribution of examples of a specific task.
In both cases, replay is only employed for rehearsing the task classifier and not the classification heads. To summarize, the complete objective of our model includes: the cross-entropy at a class level (p θ (y|x, t) in Eq. 3), the cross-entropy at a task level (p θ (t|x) in Eq. 3) and the sparsity term (L sparse in Eq. 5).

Datasets and backbone architectures
We experiment with the following datasets: • Split MNIST: the MNIST handwritten classification benchmark [17] is split into 5 subsets of consecutive classes. This results into 5 binary classification tasks that are observed sequentially. • Split SVHN: the same protocol applied as in Split MNIST, but employing the SVHN dataset [25]. • Split CIFAR-10: the same protocol applied as in Split MNIST, but employing the CIFAR-10 dataset [16]. • Imagenet-50 [28]: a subset of the iILSVRC-2012 dataset [8] containing 50 randomly sampled classes and 1300 images per category, split into 5 consecutive 10-way classification problems. Images are resized to a resolution of 32x32 pixels.
As for the backbone models, for the MNIST and SVHN benchmarks, we employ a three-layer CNN with 100 filters per layer and ReLU activations (SimpleCNN in what follows). All convolutions except for the last one are followed by a 2x2 max-pooling layer. Gating is applied after the pooling layer. A final global average pooling followed by a linear classifier yields class predictions. For the CIFAR-10 and Imagenet-50 benchmarks we employed a ResNet-18 [11] model as backbone. The gated version of a ResNet basic block is represented in Fig. 3. As illustrated, two independent sets of gates are applied after the first convolution and after the residual connection, respectively. All models were trained with SGD with momentum until convergence. After each task, model selection is performed for all models by monitoring the corresponding objective on a held-out set of examples from the current task (i.e., we don't rely on examples of past tasks for validation purposes). We apply the sparsity objective introduced in Sec. 3.2 only after a predetermined number of epochs, to provide the model the possibility to learn meaningful kernels before starting pruning the uninformative ones. We refer to the supplementary material for further implementation details.

Task-incremental setting
In the task-incremental setting, an oracle can be queried for task labels during test time. Therefore, we don't rely on the task classifier, exploiting ground-truth task labels to select which gating modules and classification head should be active. This section validates the suitability of the proposed data-dependent gating scheme for continual learning. We compare our model against several competing methods: -Joint: the backbone model trained jointly on all tasks while having access to the entire dataset. We considered its performance as the upper bound. Tab. 1 reports the comparison between methods, in terms of accuracy on all tasks after the whole training procedure. Despite performing very similarily for MNIST, the gap in the consolidation capability of different models emerges as the dataset grows more and more challenging. It is worth mentioning several recurring patterns. First, LwF struggles when the number of tasks grows larger than two. Although its distillation objective is an excellent regularizer against forgetting, it does not allow enough flexibility to the model to acquire new knowledge. Consequently, its accuracy on the most recent task gradually decreases during sequential learning, whereas the performance on the first task is kept very high. Moreover, results highlight the suitability of gating-based schemes (HAT and ours) with respect to other consolidation strategies such as EWC Online. Whereas the former ones prevent any update of relevant parameters, the latter approach only penalizes updating them, eventually incurring a significant degree of forgetting. Finally, the table shows that our model either performs on-par or outperforms HAT on all datasets, suggesting the beneficial effect of our data-dependent gating scheme and sparsity objective.

Class-incremental with episodic memory
Next, we move to a class-incremental setting in which no awareness of task labels is available at test time, significantly increasing the difficulty of the continual learning problem. In this section, we set up an experiment for which the storage of a limited amount of examples (buffer) is allowed. We compare against: -Full replay: upper bound performance given by replay to the network of an unlimited number of examples. -iCaRL [29] an approach based on a nearest-neighbor classifier exploiting examples in the buffer. We report the performances both with the original buffer-filling strategy (iCaRL-mean) and with the randomized algorithm used for our model (iCaRL-rand); -A-GEM [5]: a buffer-based method correcting parameter updates on the current task so that they don't contradict the gradient computed on the stored examples.
Results are summarized in Fig. 4, illustrating the final average accuracy on all tasks at different buffer sizes for the class-incremental Split-MNIST and Split-SVHN benchmarks. The figure highlights several findings. Surprisingly, A-GEM yields a very low performance on MNIST, while providing higher results on SVHN. Further examination on the former dataset revealed that it consistently reaches competitive accuracy on the most recent task, while mostly forgetting the prior ones. The performance of iCaRL, on the other hand, does not seem to be significantly affected by changing its buffer filling strategy. Moreover, its accuracy seems not to scale with the number of stored examples. In contrast to these methods, our model primarily utilizes the few stored examples for the rehearsal of coarse-grained task prediction, while retaining the accuracy of fine-grained class prediction. As shown in Fig. 4, our approach consistently outperforms competing approaches in the classincremental setting with episodic memory.

Class-incremental with generative memory
Next, we experiment with a class-incremental setting in which no examples are allowed to be stored whatsoever. A popular strategy in this framework is to employ generative models to approximate the distribution of prior tasks and rehearse the backbone network by sampling fake observations from them. Among these, DGM [28] is the state-ofthe-art approach, which proposes a class-conditional GAN architecture paired with a hard attention mechanism similar to the one of HAT [34]. Fake examples from the GAN generator are replayed to the discriminator, which includes an auxiliary classifier providing a class prediction. As for our model, as mentioned in Sec. 3

Model analysis
Episodic vs. generative memory. To understand which rehearsal strategy has to be preferred when dealing with class-incremental learning problems, we raise the following question: What is more beneficial between a limited amount of real examples and a (potentially) unlimited amount of generated examples? To shed light on this matter, we report our models' performances on Split SVHN and Split CIFAR-10 as a function of memory budget. Specifically, we compute the memory consumption of episodic memories as the cumulative size of the stored examples. As for generative memories, we consider the number of bytes needed to store their parameters (in single-precision floating-point format), discarding the corresponding discriminators as well as inner activations generated in the sampling process. Fig. 5 presents the result of the analysis. As can be seen, the variant of our model relying on memory buffers consistently outperforms its counterpart relying on generative modeling. In the case of CIFAR-10, the generative replay yields an accuracy  comparable with an episodic memory of ≈ 1.5 MBs, which is more than 20 times smaller than its generators. The gap between the two strategies shrinks on SVHN, due to the simpler image content resulting in better samples from the generators. Finally, our method, when based on memory buffers, outperforms the DGMw model [28] on Split-SVHN, albeit requiring 3.6 times less memory.
Gate analysis. We provide a qualitative analysis of the activation of gates across different tasks in Fig. 6. Specifically, we use the validation sets of Split MNIST and Imagenet-50 to compute the probability of each gate to be triggered by images from different tasks 1 . The analysis of the figure suggests two pieces of evidence: First, as more tasks are observed, previously learned features are re-used. This pattern shows that the model does not fall into degenerate solutions, e.g., by completely isolating tasks into different sub-networks. On the contrary, our model profitably exploits pieces of knowledge acquired from previous tasks for the optimization of the future ones. Moreover, a significant number of gates never fire, suggesting that a considerable portion of the backbone capacity is available for learning even more tasks. Additionally, we showcase how images from different tasks activating the same filters show some resemblance in low-level or semantic features (see the caption for details). 1 we report such probabilities for specific layers: layer 1 for Split MNIST (Simple CNN), block 5 for Imagenet-50 (ResNet-18).    Figure 6: Illustration of the gate execution patterns for continually trained models on MNIST (left) and Imagenet-50 (right) datasets. The histograms in the top left and top right show the firing probability of gates in the 1st layer and the 5th residual block respectively. For better illustration, gates are sorted by overall execution rate over all tasks. The bottom-left box shows images from different tasks either triggering or not triggering a specific gate on Split MNIST. The bottom-right box illustrates how -on Imagenet-50 -correlated classes from different tasks fire the same gates (e.g., fishes, different breeds of dogs, birds).
On the cost of inference. We next measure the inference cost of our model as the number of tasks increases. Tab. 3 reports the average number of multiply-add operations (MAC count) of our model on the test set of Split MNIST and Split CIFAR-10 after learning each task. Moreover, we report the MACs of HAT [34] as well as the cost of forward propagation in the backbone network (i.e. the cost of any other competing method mentioned it this section). In the task-incremental setting, our model obtains a meaningful saving in the number of operations, thanks to the data-dependent gating modules selecting only a small subset of filters to apply. In contrast, forward propagation in a class-incremental setting requires as many computational streams as the number of tasks observed so far. However, each of them is extremely cheap as few convolutional units are active. As presented in the  operations never exceeds the cost of forward propagation in the backbone model. The reduction in inference cost is particularly significant for Split CIFAR-10, which is based on a ResNet-18 backbone.
Limitations and future works. Training our model can require a lot of GPU memory for bigger backbones. However, by exploiting the inherent sparsity of activation maps, several optimizations are possible. Secondly, we expect the task classifier to be susceptible to the degree of semantic separation among tasks. For instance, a setting where tasks are semantically well-defined, like T 1 = {cat,dog}, T 2 = {car,truck} (animals / vehicles), should favor the task classifier with respect to its transpose T 1 = {cat,car}, T 2 = {dog,truck}. However, we remark that in our experiments the assigment of classes to tasks is always random. Therefore, our model could perform even better in the presence of coherent tasks.

Conclusions
We presented a novel framework based on conditional computation to tackle catastrophic forgetting in convolutional neural networks. Having task-specific light-weight gating modules allows us to prevent catastrophic forgetting of previously learned knowledge. Besides learning new features for new tasks, the gates allow for dynamic usage of previously learned knowledge to improve performance. Our method can be employed both in the presence and in the absence of task labels during test. In the latter case, a task classifier is trained to take the place of a task oracle. Through extensive experiments, we validated the performance of our model against existing methods both in task-incremental and class-incremental settings and demonstrated state-ofthe-art results in four continual learning datasets.
In this section we report training details and hyperparameters used for the optimization of our model. As already specified in Sec. 4.1 of the main paper, all models were trained with Stochastic Gradient Descent with momentum. Gradient clipping was utilized, ensuring the gradient magnitude to be lower than a predetermined threshold. Moreover, we employed a scheduler dividing the learning rate by a factor of 10 at certain epochs. Such details can be found, for each dataset, in Tab. 4, where we highlighted two sets of hyperparameters: • optim: general optimization choices that were kept fixed both for our model and competing methods, in order to ensure fairness.
• our: hyperparameters that only concern our model, such as the weight of the sparsity loss and the number of epochs after which sparsity was introduced (patience).   [10]). The reader can find the specification of the architecture in Tab. 9. For every dataset, we trained the WGANs for 2 × 10 5 total iterations, each of which was composed by 5 and 1 discriminator and generator updates respectively. As for the optimization, we rely on Adam [14] with a learning rate of 10 −4 , fixing β 1 = 0.5 and β 2 = 0.9. The batch size was set to 64. The weight for gradient penalty [10] was set to 10. Inputs were normalized before being fed to the discriminator. Specifically, for MNIST we normalize each image into the range [0, 1], whilst for other datasets we map inputs into the range [−1, 1].

On mixing real and fake images for rehearsal.
The common practice when adopting generative replay for continual learning is to exploit a generative model to synthesize examples for prior tasks {1, . . . , t − 1}, while utilizing real examples as representative of the current task t. In early experiments we followed this exact approach, but it led to sub-optimal results. Indeed, the task classifier consistently reached good discrimination capabilities during training, yielding very poor performances at test time.
After an in-depth analysis, we conjectured that the task classifier, while being trained on a mixture of real and fake examples, fell into the following very poor classification logic (Fig. 7). It first discriminated between the nature of the image (real/fake), learning to map real examples to task t. Only for inputs deemed as fake, a further categorization into tasks {1, . . . , t − 1} was carried out. Such a behavior, perfectly legit during training, led to terrible test performances. Indeed, during test only real examples are presented to the network, causing the task classifier to consistently label them as coming from task t. To overcome such an issue, we remove mixing of real and fake examples during rehearsal, by presenting to the task    [5] 0.5680 0.5411 0.5933 0.5704 iCaRL-rand [29] 0.4972 0.5492 0.4788 0.5484 iCaRL-mean [29] 0.5626 0.5469 0.5252 0.5511 ours 0.6745 0.7399 0.7673 0.8102 Table 5: Numerical results for Fig. 4 in the main paper. Average accuracy for the episodic memory experiment, for different buffer sizes (C).
classifier fake examples also for the task t. In the incremental learning paradigm, this only requires to shift the training of the WGAN generators from the end of a given task to its beginning.

Quantitative results for figures
To foster future comparisons with our work, we report in this section quantitative results that are represented in Fig. 4 and 5 of the main paper. Such quantities can be found in Tab Table 7: Performance of our model based on generative memory against a baseline comprising a class-conditional generator for each task (C-Gen).

Comparison w.r.t. conditional generators
To validate the beneficial effect of the employment of generated examples for the rehearsal of task prediction only, we compare our model based on generative memory (Sec. 4.4 of the main paper) against a further baseline. To this end, we still train a WGAN-GP for each task, but instead of training unconditional models we train class-conditional ones, following the AC-GAN framework [27]. After training N conditional generators, we train the backbone model by generating labeled examples in an i.i.d fashion. We refer to this baseline as C-Gen, and report the final results in Tab. 7. The results presented for Split SVHN and Split CIFAR-10, illustrate that generative rehearsal at a task level, instead of at a class level, is beneficial in both datasets. We believe our method behaves better for two reasons. First, our model never updates classification heads guided by a loss function computed on generated examples (i.e., potentially poor in visual quality). Therefore, when the task label gets predicted correctly, the classification accuracy is comparable to the one achieved in a task-incremental setup. Moreover, given equivalent generator capacities, conditional generative modeling may be more complex than unconditional modeling, potentially resulting in higher degradation of generated examples.

Confidence of task-incremental results
To validate the gap between our model's performance with respect to HAT (Tab. 1 in the main paper), we report the confidence of such experiment by repeating it 5 times with different random seeds. Results in Tab. 8 show that the margin between our proposal and HAT is slight, yet consistent.