A TinyML Platform for On-Device Continual Learning with Quantized Latent Replays

Ravaglia, L.; Rusci, M.; Nadalini, D.; Capotondi, A.; Conti, F.; Benini, L.

doi:10.1109/JETCAS.2021.3121554

In the last few years, research and development on Deep Learning models & techniques for ultra-low-power devices- in a word, TinyML - has mainly focused on a train-then-deploy assumption, with static models that cannot be adapted to newly collected data without cloud-based data collection and fine-tuning. Latent Replay-based Continual Learning (CL) techniques (Pellegrini et al., 2020) enable online, serverless adaptation in principle, but so far they have still been too computation- and memory-hungry for ultra-low-power TinyML devices, which are typically based on microcontrollers. In this work, we introduce a HW/SW platform for end-to-end CL based on a 10-core FP32 -enabled parallel ultra-low-power (PULP) processor. We rethink the baseline Latent Replay CL algorithm, leveraging quantization of the frozen stage of the model and Latent Replays (LRs) to reduce their memory cost with minimal impact on accuracy. In particular, 8-bit compression of the LR memory proves to be almost lossless (-0.26% with 3000LR) compared to the full-precision baseline implementation, but requires 4times less memory, while 7-bit can also be used with an additional minimal accuracy degradation (up to 5%). We also introduce optimized primitives for forward and backward propagation on the PULP processor, together with data tiling strategies to fully exploit its memory hierarchy, while maximizing efficiency. Our results show that by combining these techniques, continual learning can be achieved in practice using less than 64MB of memory - an amount compatible with embedding in TinyML devices. On an advanced 22nm prototype of our platform, called VEGA, the proposed solution performs on average 65 times faster than a low-power STM32 L4 microcontroller, being 37times more energy efficient - enough for a lifetime of 535h when learning a new mini-batch of data once every minute.

A TinyML Platform for On-Device Continual Learning with Quantized Latent Replays / Ravaglia, L., Rusci, M., Nadalini, D., Capotondi, A., Conti, F., Benini, L.. - In: IEEE JOURNAL OF EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS. - ISSN 2156-3357. - 11:4(2021), pp. 789-802. [10.1109/JETCAS.2021.3121554]

A TinyML Platform for On-Device Continual Learning with Quantized Latent Replays

Ravaglia L.;Rusci M.;Nadalini D.;Capotondi A.;Conti F.;Benini L.

2021

Abstract

In the last few years, research and development on Deep Learning models & techniques for ultra-low-power devices- in a word, TinyML - has mainly focused on a train-then-deploy assumption, with static models that cannot be adapted to newly collected data without cloud-based data collection and fine-tuning. Latent Replay-based Continual Learning (CL) techniques (Pellegrini et al., 2020) enable online, serverless adaptation in principle, but so far they have still been too computation- and memory-hungry for ultra-low-power TinyML devices, which are typically based on microcontrollers. In this work, we introduce a HW/SW platform for end-to-end CL based on a 10-core FP32 -enabled parallel ultra-low-power (PULP) processor. We rethink the baseline Latent Replay CL algorithm, leveraging quantization of the frozen stage of the model and Latent Replays (LRs) to reduce their memory cost with minimal impact on accuracy. In particular, 8-bit compression of the LR memory proves to be almost lossless (-0.26% with 3000LR) compared to the full-precision baseline implementation, but requires 4times less memory, while 7-bit can also be used with an additional minimal accuracy degradation (up to 5%). We also introduce optimized primitives for forward and backward propagation on the PULP processor, together with data tiling strategies to fully exploit its memory hierarchy, while maximizing efficiency. Our results show that by combining these techniques, continual learning can be achieved in practice using less than 64MB of memory - an amount compatible with embedding in TinyML devices. On an advanced 22nm prototype of our platform, called VEGA, the proposed solution performs on average 65 times faster than a low-power STM32 L4 microcontroller, being 37times more energy efficient - enough for a lifetime of 535h when learning a new mini-batch of data once every minute.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2021
			
	Rivista
	
				IEEE JOURNAL OF EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS
			
	N° del Volume
	
				11
			
	Fascicolo
	
				4
			
	Pagina iniziale
	
				789
			
	Pagina finale
	
				802
			
	Codice DOI
	
				https://dx.doi.org/10.1109/JETCAS.2021.3121554
			
	Codice WoS
	
				WOS:000730514000025
			
	Codice Scopus
	
				2-s2.0-85118254450
			
	Citazione
	
				A TinyML Platform for On-Device Continual Learning with Quantized Latent Replays / Ravaglia, L., Rusci, M., Nadalini, D., Capotondi, A., Conti, F., Benini, L.. - In: IEEE JOURNAL OF EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS. - ISSN 2156-3357. - 11:4(2021), pp. 789-802. [10.1109/JETCAS.2021.3121554]
			
	Tutti gli autori
	
						Ravaglia, L.; Rusci, M.; Nadalini, D.; Capotondi, A.; Conti, F.; Benini, L.
					
	Tipologia
	
				Articolo su rivista

File in questo prodotto:

File	Dimensione	Formato
A_TinyML_Platform_for_On-Device_Continual_Learning_With_Quantized_Latent_Replays.pdf Open access Tipologia: VOR - Versione pubblicata dall'editore Dimensione 3.29 MB Formato Adobe PDF Visualizza/Apri	3.29 MB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris