Abstract
Execution of operations in distributed machine learning systems
has largely ignored dependencies between communication and
computation ops. In this paper, we make the case that model-aware
ordering of operations at individual machines can decrease the step
time of training iteration in distributed machine learning systems
while also improving network utilization. The contributions of this
work are:
• We introduce a metric for quantitatively measuring the efficiency
of ordering of ops (§1).
• We propose an ordering heuristic for Model-Replica with Parameter Server systems (§2.1).
• We chalk out a roadmap for developing fast heuristics for model aware ordering of ops in Model-Replica systems with all-reduce
synchronization (§2.2).
• We evaluate our ordering mechanism on Model-Replica with
Parameter Server on TensorFlow and show that the training
efficiency can be improved by up to 78% through better ordering
of tasks with 46% reduction in step time
has largely ignored dependencies between communication and
computation ops. In this paper, we make the case that model-aware
ordering of operations at individual machines can decrease the step
time of training iteration in distributed machine learning systems
while also improving network utilization. The contributions of this
work are:
• We introduce a metric for quantitatively measuring the efficiency
of ordering of ops (§1).
• We propose an ordering heuristic for Model-Replica with Parameter Server systems (§2.1).
• We chalk out a roadmap for developing fast heuristics for model aware ordering of ops in Model-Replica systems with all-reduce
synchronization (§2.2).
• We evaluate our ordering mechanism on Model-Replica with
Parameter Server on TensorFlow and show that the training
efficiency can be improved by up to 78% through better ordering
of tasks with 46% reduction in step time
Original language | English (US) |
---|---|
State | Published - 2018 |
Event | The Conference of Systems and Machine Learning - Duration: Aug 18 2020 → Aug 18 2020 |
Conference
Conference | The Conference of Systems and Machine Learning |
---|---|
Period | 8/18/20 → 8/18/20 |