On The Importance of Execution Ordering in Graph-Based Distributed Machine Learning Systems

Research output: Contribution to conferencePaperpeer-review

Abstract

Execution of operations in distributed machine learning systems
has largely ignored dependencies between communication and
computation ops. In this paper, we make the case that model-aware
ordering of operations at individual machines can decrease the step
time of training iteration in distributed machine learning systems
while also improving network utilization. The contributions of this
work are:
• We introduce a metric for quantitatively measuring the efficiency
of ordering of ops (§1).
• We propose an ordering heuristic for Model-Replica with Parameter Server systems (§2.1).
• We chalk out a roadmap for developing fast heuristics for model aware ordering of ops in Model-Replica systems with all-reduce
synchronization (§2.2).
• We evaluate our ordering mechanism on Model-Replica with
Parameter Server on TensorFlow and show that the training
efficiency can be improved by up to 78% through better ordering
of tasks with 46% reduction in step time
Original languageEnglish (US)
StatePublished - 2018
EventThe Conference of Systems and Machine Learning -
Duration: Aug 18 2020Aug 18 2020

Conference

ConferenceThe Conference of Systems and Machine Learning
Period8/18/208/18/20

Fingerprint

Dive into the research topics of 'On The Importance of Execution Ordering in Graph-Based Distributed Machine Learning Systems'. Together they form a unique fingerprint.

Cite this