Straggler-aware distributed learning: Communication-computation latency trade-off

Ozfatura, E.; Ulukus, S.; Gunduz, D.

doi:10.3390/E22050544

When gradient descent (GD) is scaled to many parallel workers for large-scale machine learning applications, its per-iteration computation time is limited by straggling workers. Straggling workers can be tolerated by assigning redundant computations and/or coding across data and computations, but in most existing schemes, each non-straggling worker transmits one message per iteration to the parameter server (PS) after completing all its computations. Imposing such a limitation results in two drawbacks: over-computation due to inaccurate prediction of the straggling behavior, and under-utilization due to discarding partial computations carried out by stragglers. To overcome these drawbacks, we consider multi-message communication (MMC) by allowing multiple computations to be conveyed from each worker per iteration, and propose novel straggler avoidance techniques for both coded computation and coded communication with MMC. We analyze how the proposed designs can be employed efficiently to seek a balance between the computation and communication latency. Furthermore, we identify the advantages and disadvantages of these designs in different settings through extensive simulations, both model-based and real implementation on Amazon EC2 servers, and demonstrate that proposed schemes with MMC can help improve upon existing straggler avoidance schemes.

Straggler-aware distributed learning: Communication-computation latency trade-off / Ozfatura, E.; Ulukus, S.; Gunduz, D.. - In: ENTROPY. - ISSN 1099-4300. - 22:5(2020), pp. 544-544. [10.3390/E22050544]

Straggler-aware distributed learning: Communication-computation latency trade-off

Ozfatura E.;Ulukus S.;Gunduz D.

2020

Abstract

When gradient descent (GD) is scaled to many parallel workers for large-scale machine learning applications, its per-iteration computation time is limited by straggling workers. Straggling workers can be tolerated by assigning redundant computations and/or coding across data and computations, but in most existing schemes, each non-straggling worker transmits one message per iteration to the parameter server (PS) after completing all its computations. Imposing such a limitation results in two drawbacks: over-computation due to inaccurate prediction of the straggling behavior, and under-utilization due to discarding partial computations carried out by stragglers. To overcome these drawbacks, we consider multi-message communication (MMC) by allowing multiple computations to be conveyed from each worker per iteration, and propose novel straggler avoidance techniques for both coded computation and coded communication with MMC. We analyze how the proposed designs can be employed efficiently to seek a balance between the computation and communication latency. Furthermore, we identify the advantages and disadvantages of these designs in different settings through extensive simulations, both model-based and real implementation on Amazon EC2 servers, and demonstrate that proposed schemes with MMC can help improve upon existing straggler avoidance schemes.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2020
			
	Rivista
	
				ENTROPY
			
	N° del Volume
	
				22
			
	Fascicolo
	
				5
			
	Pagina iniziale
	
				544
			
	Pagina finale
	
				544
			
	Codice DOI
	
				https://dx.doi.org/10.3390/E22050544
			
	Codice WoS
	
				WOS:000541900700090
			
	Codice Scopus
	
				2-s2.0-85085695701
			
	Citazione
	
				Straggler-aware distributed learning: Communication-computation latency trade-off / Ozfatura, E.; Ulukus, S.; Gunduz, D.. - In: ENTROPY. - ISSN 1099-4300. - 22:5(2020), pp. 544-544. [10.3390/E22050544]
			
	Tutti gli autori
	
						Ozfatura, E.; Ulukus, S.; Gunduz, D.
					
	Tipologia
	
				Articolo su rivista

File in questo prodotto:

File	Dimensione	Formato
entropy-22-00544-v2.pdf Open access Tipologia: VOR - Versione pubblicata dall'editore Dimensione 1.15 MB Formato Adobe PDF Visualizza/Apri	1.15 MB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris