Gradient Coding with Dynamic Clustering for Straggler-Tolerant Distributed Learning

Distributed implementations are crucial in speeding up large scale machine learning applications. Distributed gradient descent (GD) is widely employed to parallelize the learning task by distributing the dataset across multiple workers. A significant performance bottleneck for the per-iteration completion time in distributed synchronous GD is straggling workers. Coded distributed computation techniques have been introduced recently to mitigate stragglers and to speed up GD iterations by assigning redundant computations to workers. In this paper, we introduce a novel paradigm of dynamic coded computation, which assigns redundant data to workers to acquire the flexibility to dynamically choose from among a set of possible codes depending on the past straggling behavior. In particular, we propose gradient coding (GC) with dynamic clustering, called GC-DC, and regulate the number of stragglers in each cluster by dynamically forming the clusters at each iteration. With time-correlated straggling behavior, GC-DC adapts to the straggling behavior over time; in particular, at each iteration, GC-DC aims at distributing the stragglers across clusters as uniformly as possible based on the past straggler behavior. For both homogeneous and heterogeneous worker models, we numerically show that GC-DC provides significant improvements in the average per-iteration completion time without an increase in the communication load compared to the original GC scheme.

Gradient Coding with Dynamic Clustering for Straggler-Tolerant Distributed Learning / Buyukates, B.; Ozfatura, E.; Ulukus, S.; Gunduz, D.. - In: IEEE TRANSACTIONS ON COMMUNICATIONS. - ISSN 0090-6778. - 714:6(2022), pp. 3317-3332. [10.1109/TCOMM.2022.3166902]

Gradient Coding with Dynamic Clustering for Straggler-Tolerant Distributed Learning

Buyukates B.;Ozfatura E.;Ulukus S.;Gunduz D.

2022

Abstract

Distributed implementations are crucial in speeding up large scale machine learning applications. Distributed gradient descent (GD) is widely employed to parallelize the learning task by distributing the dataset across multiple workers. A significant performance bottleneck for the per-iteration completion time in distributed synchronous GD is straggling workers. Coded distributed computation techniques have been introduced recently to mitigate stragglers and to speed up GD iterations by assigning redundant computations to workers. In this paper, we introduce a novel paradigm of dynamic coded computation, which assigns redundant data to workers to acquire the flexibility to dynamically choose from among a set of possible codes depending on the past straggling behavior. In particular, we propose gradient coding (GC) with dynamic clustering, called GC-DC, and regulate the number of stragglers in each cluster by dynamically forming the clusters at each iteration. With time-correlated straggling behavior, GC-DC adapts to the straggling behavior over time; in particular, at each iteration, GC-DC aims at distributing the stragglers across clusters as uniformly as possible based on the past straggler behavior. For both homogeneous and heterogeneous worker models, we numerically show that GC-DC provides significant improvements in the average per-iteration completion time without an increase in the communication load compared to the original GC scheme.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
			2022
		
	Rivista
	
			IEEE TRANSACTIONS ON COMMUNICATIONS
		
	N° del Volume
	
			714
		
	Fascicolo
	
			6
		
	Pagina iniziale
	
			3317
		
	Pagina finale
	
			3332
		
	Codice DOI
	
			https://dx.doi.org/10.1109/TCOMM.2022.3166902
		
	Codice WoS
	
			WOS:001013660800012
		
	Codice Scopus
	
			2-s2.0-85128302247
		
	Citazione
	
			Gradient Coding with Dynamic Clustering for Straggler-Tolerant Distributed Learning / Buyukates, B.; Ozfatura, E.; Ulukus, S.; Gunduz, D.. - In: IEEE TRANSACTIONS ON COMMUNICATIONS. - ISSN 0090-6778. - 714:6(2022), pp. 3317-3332. [10.1109/TCOMM.2022.3166902]
		
	Tutti gli autori
	
			Buyukates, B.; Ozfatura, E.; Ulukus, S.; Gunduz, D.
		
	Tipologia
	
			Articolo su rivista

File in questo prodotto:

File	Dimensione	Formato
Gradient_Coding_with_Dynamic_Clustering_for_Straggler-Tolerant_Distributed_Learning (1).pdf Accesso riservato Tipologia: Versione dell'autore revisionata e accettata per la pubblicazione Dimensione 7.59 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	7.59 MB	Adobe PDF	Visualizza/Apri Richiedi una copia
2103.01206.pdf Open access Tipologia: Versione dell'autore revisionata e accettata per la pubblicazione Dimensione 249.23 kB Formato Adobe PDF Visualizza/Apri	249.23 kB	Adobe PDF	Visualizza/Apri
Gradient_Coding_With_Dynamic_Clustering_for_Straggler-Tolerant_Distributed_Learning.pdf Accesso riservato Tipologia: Versione pubblicata dall'editore Dimensione 1.54 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	1.54 MB	Adobe PDF	Visualizza/Apri Richiedi una copia

Pubblicazioni consigliate

I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11380/1280023

Citazioni

ND

3

2

social impact