Hardware mechanism for dynamic extraction and relayout of program hot spots

Matthew C. Merten, Andrew R. Trick, Erik M. Nystrom, Ronald D. Barnes, Wen mei W. Hwu

Research output: Contribution to journalConference article

Abstract

This paper presents a new mechanism for collecting and deploying runtime optimized code. The code-collecting component resides in the instruction retirement stage and lays out hot execution paths to improve instruction fetch rate as well as enable further code optimization. The code deployment component uses an extension to the Branch Target Buffer to migrate execution into the new code without modifying the original code. No significant delay is added to the total execution of the program due to these components. The code collection scheme enables safe runtime optimization along paths that span function boundaries. This technique provides a better platform for runtime optimization than trace caches, because the traces are longer and persist in main memory across context switches. Additionally, these traces axe not as susceptible to transient behavior because they are restricted to frequently executed code. Empirical results show that on average this mechanism can achieve better instruction fetch rates using only 12 KB of hardware than a trace cache requiring 15 KB of hardware, while producing long, persistent traces more suited to optimization.

Original languageEnglish (US)
Pages (from-to)59-70
Number of pages12
JournalConference Proceedings - Annual International Symposium on Computer Architecture, ISCA
StatePublished - Jan 1 2000
EventISCA-27: The 27th Annual International Symposium on Computer Architecture - Vancouver, BC, Can
Duration: Jun 10 2000Jun 14 2000

Fingerprint

Hardware
Switches
Data storage equipment

ASJC Scopus subject areas

  • Hardware and Architecture

Cite this

Hardware mechanism for dynamic extraction and relayout of program hot spots. / Merten, Matthew C.; Trick, Andrew R.; Nystrom, Erik M.; Barnes, Ronald D.; Hwu, Wen mei W.

In: Conference Proceedings - Annual International Symposium on Computer Architecture, ISCA, 01.01.2000, p. 59-70.

Research output: Contribution to journalConference article

@article{58a6f342ce9642b7b815b7389e07e0e5,
title = "Hardware mechanism for dynamic extraction and relayout of program hot spots",
abstract = "This paper presents a new mechanism for collecting and deploying runtime optimized code. The code-collecting component resides in the instruction retirement stage and lays out hot execution paths to improve instruction fetch rate as well as enable further code optimization. The code deployment component uses an extension to the Branch Target Buffer to migrate execution into the new code without modifying the original code. No significant delay is added to the total execution of the program due to these components. The code collection scheme enables safe runtime optimization along paths that span function boundaries. This technique provides a better platform for runtime optimization than trace caches, because the traces are longer and persist in main memory across context switches. Additionally, these traces axe not as susceptible to transient behavior because they are restricted to frequently executed code. Empirical results show that on average this mechanism can achieve better instruction fetch rates using only 12 KB of hardware than a trace cache requiring 15 KB of hardware, while producing long, persistent traces more suited to optimization.",
author = "Merten, {Matthew C.} and Trick, {Andrew R.} and Nystrom, {Erik M.} and Barnes, {Ronald D.} and Hwu, {Wen mei W.}",
year = "2000",
month = "1",
day = "1",
language = "English (US)",
pages = "59--70",
journal = "Conference Proceedings - Annual International Symposium on Computer Architecture, ISCA",
issn = "1063-6897",
publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Hardware mechanism for dynamic extraction and relayout of program hot spots

AU - Merten, Matthew C.

AU - Trick, Andrew R.

AU - Nystrom, Erik M.

AU - Barnes, Ronald D.

AU - Hwu, Wen mei W.

PY - 2000/1/1

Y1 - 2000/1/1

N2 - This paper presents a new mechanism for collecting and deploying runtime optimized code. The code-collecting component resides in the instruction retirement stage and lays out hot execution paths to improve instruction fetch rate as well as enable further code optimization. The code deployment component uses an extension to the Branch Target Buffer to migrate execution into the new code without modifying the original code. No significant delay is added to the total execution of the program due to these components. The code collection scheme enables safe runtime optimization along paths that span function boundaries. This technique provides a better platform for runtime optimization than trace caches, because the traces are longer and persist in main memory across context switches. Additionally, these traces axe not as susceptible to transient behavior because they are restricted to frequently executed code. Empirical results show that on average this mechanism can achieve better instruction fetch rates using only 12 KB of hardware than a trace cache requiring 15 KB of hardware, while producing long, persistent traces more suited to optimization.

AB - This paper presents a new mechanism for collecting and deploying runtime optimized code. The code-collecting component resides in the instruction retirement stage and lays out hot execution paths to improve instruction fetch rate as well as enable further code optimization. The code deployment component uses an extension to the Branch Target Buffer to migrate execution into the new code without modifying the original code. No significant delay is added to the total execution of the program due to these components. The code collection scheme enables safe runtime optimization along paths that span function boundaries. This technique provides a better platform for runtime optimization than trace caches, because the traces are longer and persist in main memory across context switches. Additionally, these traces axe not as susceptible to transient behavior because they are restricted to frequently executed code. Empirical results show that on average this mechanism can achieve better instruction fetch rates using only 12 KB of hardware than a trace cache requiring 15 KB of hardware, while producing long, persistent traces more suited to optimization.

UR - http://www.scopus.com/inward/record.url?scp=0033700757&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0033700757&partnerID=8YFLogxK

M3 - Conference article

AN - SCOPUS:0033700757

SP - 59

EP - 70

JO - Conference Proceedings - Annual International Symposium on Computer Architecture, ISCA

JF - Conference Proceedings - Annual International Symposium on Computer Architecture, ISCA

SN - 1063-6897

ER -