An architectural framework for runtime optimization

Matthew C. Merten, Andrew R. Trick, Ronald D. Barnes, Erik M. Nystrom, Christopher N. George, John C. Gyllenhaal, Wen-Mei W Hwu

Research output: Contribution to journalArticle

Abstract

Wide-issue processors continue to achieve higher performance by exploiting greater instruction-level parallelism. Dynamic techniques such as out-of-order execution and hardware speculation have proven effective at increasing instruction throughput. Runtime optimization promises to provide an even higher level of performance by adaptively applying aggressive code transformations on a larger scope. This paper presents a new hardware mechanism for generating and deploying runtime optimized code. The mechanism can be viewed as a filtering system that resides in the retirement stage of the processor pipeline, accepts an instruction execution stream as input, and produces instruction profiles and sets of linked, optimized traces as output. The code deployment mechanism uses an extension to the branch prediction mechanism to migrate execution into the new code without modifying the original code. These new components do not add delay to the execution of the program except during short bursts of reoptimization. This technique provides a strong platform for runtime optimization because the hot execution regions are extracted, optimized, and written to main memory for execution and because these regions persist across context switches. The current design of the framework supports a suite of optimizations, including partial function inlining (even into shared libraries), code straightening optimizations, loop unrolling, and peephole optimizations.

Original languageEnglish (US)
Pages (from-to)567-589
Number of pages23
JournalIEEE Transactions on Computers
Volume50
Issue number6
DOIs
StatePublished - Jun 1 2001

Fingerprint

Optimization
Hardware
Straightening
Branch Prediction
Reoptimization
Instruction Level Parallelism
Even function
Speculation
Burst
Pipelines
Switches
Throughput
Framework
Architecture
Switch
Continue
Data storage equipment
Filtering
High Performance
Trace

Keywords

  • Code layout
  • Dynamic optimization
  • Hardware profiling
  • Low-overhead profiling
  • Partial function inlining
  • Postlink optimization
  • Program hot spot
  • Runtime optimization
  • Trace formation and optimization

ASJC Scopus subject areas

  • Software
  • Theoretical Computer Science
  • Hardware and Architecture
  • Computational Theory and Mathematics

Cite this

Merten, M. C., Trick, A. R., Barnes, R. D., Nystrom, E. M., George, C. N., Gyllenhaal, J. C., & Hwu, W-M. W. (2001). An architectural framework for runtime optimization. IEEE Transactions on Computers, 50(6), 567-589. https://doi.org/10.1109/12.931894

An architectural framework for runtime optimization. / Merten, Matthew C.; Trick, Andrew R.; Barnes, Ronald D.; Nystrom, Erik M.; George, Christopher N.; Gyllenhaal, John C.; Hwu, Wen-Mei W.

In: IEEE Transactions on Computers, Vol. 50, No. 6, 01.06.2001, p. 567-589.

Research output: Contribution to journalArticle

Merten, MC, Trick, AR, Barnes, RD, Nystrom, EM, George, CN, Gyllenhaal, JC & Hwu, W-MW 2001, 'An architectural framework for runtime optimization', IEEE Transactions on Computers, vol. 50, no. 6, pp. 567-589. https://doi.org/10.1109/12.931894
Merten MC, Trick AR, Barnes RD, Nystrom EM, George CN, Gyllenhaal JC et al. An architectural framework for runtime optimization. IEEE Transactions on Computers. 2001 Jun 1;50(6):567-589. https://doi.org/10.1109/12.931894
Merten, Matthew C. ; Trick, Andrew R. ; Barnes, Ronald D. ; Nystrom, Erik M. ; George, Christopher N. ; Gyllenhaal, John C. ; Hwu, Wen-Mei W. / An architectural framework for runtime optimization. In: IEEE Transactions on Computers. 2001 ; Vol. 50, No. 6. pp. 567-589.
@article{f06cc2d2650a4b9d9b3cc36ee3c708c9,
title = "An architectural framework for runtime optimization",
abstract = "Wide-issue processors continue to achieve higher performance by exploiting greater instruction-level parallelism. Dynamic techniques such as out-of-order execution and hardware speculation have proven effective at increasing instruction throughput. Runtime optimization promises to provide an even higher level of performance by adaptively applying aggressive code transformations on a larger scope. This paper presents a new hardware mechanism for generating and deploying runtime optimized code. The mechanism can be viewed as a filtering system that resides in the retirement stage of the processor pipeline, accepts an instruction execution stream as input, and produces instruction profiles and sets of linked, optimized traces as output. The code deployment mechanism uses an extension to the branch prediction mechanism to migrate execution into the new code without modifying the original code. These new components do not add delay to the execution of the program except during short bursts of reoptimization. This technique provides a strong platform for runtime optimization because the hot execution regions are extracted, optimized, and written to main memory for execution and because these regions persist across context switches. The current design of the framework supports a suite of optimizations, including partial function inlining (even into shared libraries), code straightening optimizations, loop unrolling, and peephole optimizations.",
keywords = "Code layout, Dynamic optimization, Hardware profiling, Low-overhead profiling, Partial function inlining, Postlink optimization, Program hot spot, Runtime optimization, Trace formation and optimization",
author = "Merten, {Matthew C.} and Trick, {Andrew R.} and Barnes, {Ronald D.} and Nystrom, {Erik M.} and George, {Christopher N.} and Gyllenhaal, {John C.} and Hwu, {Wen-Mei W}",
year = "2001",
month = "6",
day = "1",
doi = "10.1109/12.931894",
language = "English (US)",
volume = "50",
pages = "567--589",
journal = "IEEE Transactions on Computers",
issn = "0018-9340",
publisher = "IEEE Computer Society",
number = "6",

}

TY - JOUR

T1 - An architectural framework for runtime optimization

AU - Merten, Matthew C.

AU - Trick, Andrew R.

AU - Barnes, Ronald D.

AU - Nystrom, Erik M.

AU - George, Christopher N.

AU - Gyllenhaal, John C.

AU - Hwu, Wen-Mei W

PY - 2001/6/1

Y1 - 2001/6/1

N2 - Wide-issue processors continue to achieve higher performance by exploiting greater instruction-level parallelism. Dynamic techniques such as out-of-order execution and hardware speculation have proven effective at increasing instruction throughput. Runtime optimization promises to provide an even higher level of performance by adaptively applying aggressive code transformations on a larger scope. This paper presents a new hardware mechanism for generating and deploying runtime optimized code. The mechanism can be viewed as a filtering system that resides in the retirement stage of the processor pipeline, accepts an instruction execution stream as input, and produces instruction profiles and sets of linked, optimized traces as output. The code deployment mechanism uses an extension to the branch prediction mechanism to migrate execution into the new code without modifying the original code. These new components do not add delay to the execution of the program except during short bursts of reoptimization. This technique provides a strong platform for runtime optimization because the hot execution regions are extracted, optimized, and written to main memory for execution and because these regions persist across context switches. The current design of the framework supports a suite of optimizations, including partial function inlining (even into shared libraries), code straightening optimizations, loop unrolling, and peephole optimizations.

AB - Wide-issue processors continue to achieve higher performance by exploiting greater instruction-level parallelism. Dynamic techniques such as out-of-order execution and hardware speculation have proven effective at increasing instruction throughput. Runtime optimization promises to provide an even higher level of performance by adaptively applying aggressive code transformations on a larger scope. This paper presents a new hardware mechanism for generating and deploying runtime optimized code. The mechanism can be viewed as a filtering system that resides in the retirement stage of the processor pipeline, accepts an instruction execution stream as input, and produces instruction profiles and sets of linked, optimized traces as output. The code deployment mechanism uses an extension to the branch prediction mechanism to migrate execution into the new code without modifying the original code. These new components do not add delay to the execution of the program except during short bursts of reoptimization. This technique provides a strong platform for runtime optimization because the hot execution regions are extracted, optimized, and written to main memory for execution and because these regions persist across context switches. The current design of the framework supports a suite of optimizations, including partial function inlining (even into shared libraries), code straightening optimizations, loop unrolling, and peephole optimizations.

KW - Code layout

KW - Dynamic optimization

KW - Hardware profiling

KW - Low-overhead profiling

KW - Partial function inlining

KW - Postlink optimization

KW - Program hot spot

KW - Runtime optimization

KW - Trace formation and optimization

UR - http://www.scopus.com/inward/record.url?scp=0035365019&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0035365019&partnerID=8YFLogxK

U2 - 10.1109/12.931894

DO - 10.1109/12.931894

M3 - Article

AN - SCOPUS:0035365019

VL - 50

SP - 567

EP - 589

JO - IEEE Transactions on Computers

JF - IEEE Transactions on Computers

SN - 0018-9340

IS - 6

ER -