On The Importance of Execution Ordering in Graph-Based Distributed Machine Learning Systems

Research output: Contribution to conferencePaper

Abstract

Execution of operations in distributed machine learning systems
has largely ignored dependencies between communication and
computation ops. In this paper, we make the case that model-aware
ordering of operations at individual machines can decrease the step
time of training iteration in distributed machine learning systems
while also improving network utilization. The contributions of this
work are:
• We introduce a metric for quantitatively measuring the efficiency
of ordering of ops (§1).
• We propose an ordering heuristic for Model-Replica with Parameter Server systems (§2.1).
• We chalk out a roadmap for developing fast heuristics for model aware ordering of ops in Model-Replica systems with all-reduce
synchronization (§2.2).
• We evaluate our ordering mechanism on Model-Replica with
Parameter Server on TensorFlow and show that the training
efficiency can be improved by up to 78% through better ordering
of tasks with 46% reduction in step time
Original languageEnglish (US)
StatePublished - 2018
EventThe Conference of Systems and Machine Learning -
Duration: Aug 18 2020Aug 18 2020

Conference

ConferenceThe Conference of Systems and Machine Learning
Period8/18/208/18/20

Fingerprint

Learning systems
Servers
Communication

Cite this

Campbell, R. (2018). On The Importance of Execution Ordering in Graph-Based Distributed Machine Learning Systems. Paper presented at The Conference of Systems and Machine Learning, .

On The Importance of Execution Ordering in Graph-Based Distributed Machine Learning Systems. / Campbell, Roy.

2018. Paper presented at The Conference of Systems and Machine Learning, .

Research output: Contribution to conferencePaper

Campbell, R 2018, 'On The Importance of Execution Ordering in Graph-Based Distributed Machine Learning Systems' Paper presented at The Conference of Systems and Machine Learning, 8/18/20 - 8/18/20, .
Campbell R. On The Importance of Execution Ordering in Graph-Based Distributed Machine Learning Systems. 2018. Paper presented at The Conference of Systems and Machine Learning, .
Campbell, Roy. / On The Importance of Execution Ordering in Graph-Based Distributed Machine Learning Systems. Paper presented at The Conference of Systems and Machine Learning, .
@conference{4db0ce7273744f81bbf7ca05b3d3817e,
title = "On The Importance of Execution Ordering in Graph-Based Distributed Machine Learning Systems",
abstract = "Execution of operations in distributed machine learning systemshas largely ignored dependencies between communication andcomputation ops. In this paper, we make the case that model-awareordering of operations at individual machines can decrease the steptime of training iteration in distributed machine learning systemswhile also improving network utilization. The contributions of thiswork are:• We introduce a metric for quantitatively measuring the efficiencyof ordering of ops (§1).• We propose an ordering heuristic for Model-Replica with Parameter Server systems (§2.1).• We chalk out a roadmap for developing fast heuristics for model aware ordering of ops in Model-Replica systems with all-reducesynchronization (§2.2).• We evaluate our ordering mechanism on Model-Replica withParameter Server on TensorFlow and show that the trainingefficiency can be improved by up to 78{\%} through better orderingof tasks with 46{\%} reduction in step time",
author = "Roy Campbell",
year = "2018",
language = "English (US)",
note = "The Conference of Systems and Machine Learning ; Conference date: 18-08-2020 Through 18-08-2020",

}

TY - CONF

T1 - On The Importance of Execution Ordering in Graph-Based Distributed Machine Learning Systems

AU - Campbell, Roy

PY - 2018

Y1 - 2018

N2 - Execution of operations in distributed machine learning systemshas largely ignored dependencies between communication andcomputation ops. In this paper, we make the case that model-awareordering of operations at individual machines can decrease the steptime of training iteration in distributed machine learning systemswhile also improving network utilization. The contributions of thiswork are:• We introduce a metric for quantitatively measuring the efficiencyof ordering of ops (§1).• We propose an ordering heuristic for Model-Replica with Parameter Server systems (§2.1).• We chalk out a roadmap for developing fast heuristics for model aware ordering of ops in Model-Replica systems with all-reducesynchronization (§2.2).• We evaluate our ordering mechanism on Model-Replica withParameter Server on TensorFlow and show that the trainingefficiency can be improved by up to 78% through better orderingof tasks with 46% reduction in step time

AB - Execution of operations in distributed machine learning systemshas largely ignored dependencies between communication andcomputation ops. In this paper, we make the case that model-awareordering of operations at individual machines can decrease the steptime of training iteration in distributed machine learning systemswhile also improving network utilization. The contributions of thiswork are:• We introduce a metric for quantitatively measuring the efficiencyof ordering of ops (§1).• We propose an ordering heuristic for Model-Replica with Parameter Server systems (§2.1).• We chalk out a roadmap for developing fast heuristics for model aware ordering of ops in Model-Replica systems with all-reducesynchronization (§2.2).• We evaluate our ordering mechanism on Model-Replica withParameter Server on TensorFlow and show that the trainingefficiency can be improved by up to 78% through better orderingof tasks with 46% reduction in step time

M3 - Paper

ER -