TY - GEN
T1 - MON
T2 - 20th ACM Symposium on Operating Systems Principles, SOSP 2005
AU - Liang, Jin
AU - Ko, Steven Y.
AU - Gupta, Indranil
AU - Nahrstedt, Klara
PY - 2005
Y1 - 2005
N2 - The recent deployment of large distributed computing systems such as content distribution networks and the Planet-Lab has made it possible for researchers and practitioners to experiment with real world, large scale distributed applications. However, running an application in such an environment is difficult, due to the scale and frequent node failures of such systems. Thus, an important tool is needed that helps application developers/deployers to manage their applications. Our goal in this work is to develop MON, an extremely lightweight and failure resilient system for managing distributed applications. MON allows users to execute instant management commands on the distributed computing nodes, such as query the current status of the application, or start/stop a process on the distributed nodes. The commands are propagated to all the nodes and executed on each node, and the results are aggregated and returned back. We believe the ability to execute such instant commands is especially useful for the initial deployment of a distributed application, or for the monitoring and diagnoistics of (unexpected) application failures.
AB - The recent deployment of large distributed computing systems such as content distribution networks and the Planet-Lab has made it possible for researchers and practitioners to experiment with real world, large scale distributed applications. However, running an application in such an environment is difficult, due to the scale and frequent node failures of such systems. Thus, an important tool is needed that helps application developers/deployers to manage their applications. Our goal in this work is to develop MON, an extremely lightweight and failure resilient system for managing distributed applications. MON allows users to execute instant management commands on the distributed computing nodes, such as query the current status of the application, or start/stop a process on the distributed nodes. The commands are propagated to all the nodes and executed on each node, and the results are aggregated and returned back. We believe the ability to execute such instant commands is especially useful for the initial deployment of a distributed application, or for the monitoring and diagnoistics of (unexpected) application failures.
UR - http://www.scopus.com/inward/record.url?scp=84885574968&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84885574968&partnerID=8YFLogxK
U2 - 10.1145/1095810.1118585
DO - 10.1145/1095810.1118585
M3 - Conference contribution
AN - SCOPUS:84885574968
SN - 1595930795
SN - 9781595930798
T3 - Proceedings of the 20th ACM Symposium on Operating Systems Principles, SOSP 2005
BT - Proceedings of the 20th ACM Symposium on Operating Systems Principles, SOSP 2005
Y2 - 23 October 2005 through 26 October 2005
ER -