TY - GEN
T1 - Reining in the outliers in map-reduce clusters using mantri
AU - Ananthanarayanan, Ganesh
AU - Kandula, Srikanth
AU - Greenberg, Albert
AU - Stoica, Ion
AU - Lu, Yi
AU - Saha, Bikas
AU - Harris, Edward
PY - 2019
Y1 - 2019
N2 - Experience from an operational Map-Reduce cluster reveals that outliers significantly prolong job completion. The causes for outliers include run-time contention for processor, memory and other resources, disk failures, varying bandwidth and congestion along network paths and, imbalance in task workload. We present Mantri, a system that monitors tasks and culls outliers using cause- and resource-aware techniques. Mantri's strategies include restarting outliers, network-aware placement of tasks and protecting outputs of valuable tasks. Using real-time progress reports, Mantri detects and acts on outliers early in their lifetime. Early action frees up resources that can be used by subsequent tasks and expedites the job overall. Acting based on the causes and the resource and opportunity cost of actions lets Mantri improve over prior work that only duplicates the laggards. Deployment in Bing's production clusters and trace-driven simulations show that Mantri improves job completion times by 32%.
AB - Experience from an operational Map-Reduce cluster reveals that outliers significantly prolong job completion. The causes for outliers include run-time contention for processor, memory and other resources, disk failures, varying bandwidth and congestion along network paths and, imbalance in task workload. We present Mantri, a system that monitors tasks and culls outliers using cause- and resource-aware techniques. Mantri's strategies include restarting outliers, network-aware placement of tasks and protecting outputs of valuable tasks. Using real-time progress reports, Mantri detects and acts on outliers early in their lifetime. Early action frees up resources that can be used by subsequent tasks and expedites the job overall. Acting based on the causes and the resource and opportunity cost of actions lets Mantri improve over prior work that only duplicates the laggards. Deployment in Bing's production clusters and trace-driven simulations show that Mantri improves job completion times by 32%.
UR - http://www.scopus.com/inward/record.url?scp=85076916744&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85076916744&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85076916744
T3 - Proceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2010
SP - 265
EP - 278
BT - Proceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2010
PB - USENIX Association
T2 - 9th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2010
Y2 - 4 October 2010 through 6 October 2010
ER -