Beating in-order stalls with "flea-flicker" two-pass pipelining

R. D. Barnes, Sanjay Jeram Patel, E. M. Nystrom, N. Navarro, J. W. Sias, Wen-Mei W Hwu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Accommodating the uncertain latency of load instructions is one of the most vexing problems in in-order microarchitecture design and compiler development. Compilers can generate schedules with a high degree of instruction-level parallelism but cannot effectively accommodate unanticipated latencies; incorporating traditional out-of-order execution into the microarchitecture hides some of this latency but redundantly performs work done by the compiler and adds additional pipeline stages. Although effective techniques, such as prefetching and threading, have been proposed to deal with anticipable, long latency misses, the shorter, more diffuse stalls due to difficult-to-anticipate, first- or second-level misses are less easily hidden on in-order architectures. This paper addresses this problem by proposing a microarchitectural technique, referred to as two-pass pipelining, wherein the program executes on two in-order back-end pipelines coupled by a queue. The "advance" pipeline executes instructions greedily, without stalling on unanticipated latency dependences (executing independent instructions while otherwise blocking instructions are deferred). The "backup" pipeline allows concurrent resolution of instructions that were deferred in the other pipeline, resulting in the absorption of shorter misses and the overlap of longer ones. This paper argues that this design is both achievable and a good use of transistor resources and shows results indicating that it can deliver significant speedups for in-order processor designs.

Original languageEnglish (US)
Title of host publicationProceedings - 36th International Symposium on Microarchitecture, MICRO 2003
PublisherIEEE Computer Society
Pages387-398
Number of pages12
ISBN (Electronic)076952043X
DOIs
StatePublished - Jan 1 2003
Event36th International Symposium on Microarchitecture, MICRO 2003 - San Diego, United States
Duration: Dec 3 2003Dec 5 2003

Publication series

NameProceedings of the Annual International Symposium on Microarchitecture, MICRO
Volume2003-January
ISSN (Print)1072-4451

Other

Other36th International Symposium on Microarchitecture, MICRO 2003
CountryUnited States
CitySan Diego
Period12/3/0312/5/03

Fingerprint

Pipelines
Transistors

Keywords

  • Computer aided instruction
  • Delay
  • Microarchitecture
  • Out of order
  • Parallel processing
  • Pipeline processing
  • Process design
  • Processor scheduling
  • Registers
  • Runtime

ASJC Scopus subject areas

  • Hardware and Architecture

Cite this

Barnes, R. D., Patel, S. J., Nystrom, E. M., Navarro, N., Sias, J. W., & Hwu, W-M. W. (2003). Beating in-order stalls with "flea-flicker" two-pass pipelining. In Proceedings - 36th International Symposium on Microarchitecture, MICRO 2003 (pp. 387-398). [1253243] (Proceedings of the Annual International Symposium on Microarchitecture, MICRO; Vol. 2003-January). IEEE Computer Society. https://doi.org/10.1109/MICRO.2003.1253243

Beating in-order stalls with "flea-flicker" two-pass pipelining. / Barnes, R. D.; Patel, Sanjay Jeram; Nystrom, E. M.; Navarro, N.; Sias, J. W.; Hwu, Wen-Mei W.

Proceedings - 36th International Symposium on Microarchitecture, MICRO 2003. IEEE Computer Society, 2003. p. 387-398 1253243 (Proceedings of the Annual International Symposium on Microarchitecture, MICRO; Vol. 2003-January).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Barnes, RD, Patel, SJ, Nystrom, EM, Navarro, N, Sias, JW & Hwu, W-MW 2003, Beating in-order stalls with "flea-flicker" two-pass pipelining. in Proceedings - 36th International Symposium on Microarchitecture, MICRO 2003., 1253243, Proceedings of the Annual International Symposium on Microarchitecture, MICRO, vol. 2003-January, IEEE Computer Society, pp. 387-398, 36th International Symposium on Microarchitecture, MICRO 2003, San Diego, United States, 12/3/03. https://doi.org/10.1109/MICRO.2003.1253243
Barnes RD, Patel SJ, Nystrom EM, Navarro N, Sias JW, Hwu W-MW. Beating in-order stalls with "flea-flicker" two-pass pipelining. In Proceedings - 36th International Symposium on Microarchitecture, MICRO 2003. IEEE Computer Society. 2003. p. 387-398. 1253243. (Proceedings of the Annual International Symposium on Microarchitecture, MICRO). https://doi.org/10.1109/MICRO.2003.1253243
Barnes, R. D. ; Patel, Sanjay Jeram ; Nystrom, E. M. ; Navarro, N. ; Sias, J. W. ; Hwu, Wen-Mei W. / Beating in-order stalls with "flea-flicker" two-pass pipelining. Proceedings - 36th International Symposium on Microarchitecture, MICRO 2003. IEEE Computer Society, 2003. pp. 387-398 (Proceedings of the Annual International Symposium on Microarchitecture, MICRO).
@inproceedings{5db8777649f044d49eae40655947c707,
title = "Beating in-order stalls with {"}flea-flicker{"} two-pass pipelining",
abstract = "Accommodating the uncertain latency of load instructions is one of the most vexing problems in in-order microarchitecture design and compiler development. Compilers can generate schedules with a high degree of instruction-level parallelism but cannot effectively accommodate unanticipated latencies; incorporating traditional out-of-order execution into the microarchitecture hides some of this latency but redundantly performs work done by the compiler and adds additional pipeline stages. Although effective techniques, such as prefetching and threading, have been proposed to deal with anticipable, long latency misses, the shorter, more diffuse stalls due to difficult-to-anticipate, first- or second-level misses are less easily hidden on in-order architectures. This paper addresses this problem by proposing a microarchitectural technique, referred to as two-pass pipelining, wherein the program executes on two in-order back-end pipelines coupled by a queue. The {"}advance{"} pipeline executes instructions greedily, without stalling on unanticipated latency dependences (executing independent instructions while otherwise blocking instructions are deferred). The {"}backup{"} pipeline allows concurrent resolution of instructions that were deferred in the other pipeline, resulting in the absorption of shorter misses and the overlap of longer ones. This paper argues that this design is both achievable and a good use of transistor resources and shows results indicating that it can deliver significant speedups for in-order processor designs.",
keywords = "Computer aided instruction, Delay, Microarchitecture, Out of order, Parallel processing, Pipeline processing, Process design, Processor scheduling, Registers, Runtime",
author = "Barnes, {R. D.} and Patel, {Sanjay Jeram} and Nystrom, {E. M.} and N. Navarro and Sias, {J. W.} and Hwu, {Wen-Mei W}",
year = "2003",
month = "1",
day = "1",
doi = "10.1109/MICRO.2003.1253243",
language = "English (US)",
series = "Proceedings of the Annual International Symposium on Microarchitecture, MICRO",
publisher = "IEEE Computer Society",
pages = "387--398",
booktitle = "Proceedings - 36th International Symposium on Microarchitecture, MICRO 2003",

}

TY - GEN

T1 - Beating in-order stalls with "flea-flicker" two-pass pipelining

AU - Barnes, R. D.

AU - Patel, Sanjay Jeram

AU - Nystrom, E. M.

AU - Navarro, N.

AU - Sias, J. W.

AU - Hwu, Wen-Mei W

PY - 2003/1/1

Y1 - 2003/1/1

N2 - Accommodating the uncertain latency of load instructions is one of the most vexing problems in in-order microarchitecture design and compiler development. Compilers can generate schedules with a high degree of instruction-level parallelism but cannot effectively accommodate unanticipated latencies; incorporating traditional out-of-order execution into the microarchitecture hides some of this latency but redundantly performs work done by the compiler and adds additional pipeline stages. Although effective techniques, such as prefetching and threading, have been proposed to deal with anticipable, long latency misses, the shorter, more diffuse stalls due to difficult-to-anticipate, first- or second-level misses are less easily hidden on in-order architectures. This paper addresses this problem by proposing a microarchitectural technique, referred to as two-pass pipelining, wherein the program executes on two in-order back-end pipelines coupled by a queue. The "advance" pipeline executes instructions greedily, without stalling on unanticipated latency dependences (executing independent instructions while otherwise blocking instructions are deferred). The "backup" pipeline allows concurrent resolution of instructions that were deferred in the other pipeline, resulting in the absorption of shorter misses and the overlap of longer ones. This paper argues that this design is both achievable and a good use of transistor resources and shows results indicating that it can deliver significant speedups for in-order processor designs.

AB - Accommodating the uncertain latency of load instructions is one of the most vexing problems in in-order microarchitecture design and compiler development. Compilers can generate schedules with a high degree of instruction-level parallelism but cannot effectively accommodate unanticipated latencies; incorporating traditional out-of-order execution into the microarchitecture hides some of this latency but redundantly performs work done by the compiler and adds additional pipeline stages. Although effective techniques, such as prefetching and threading, have been proposed to deal with anticipable, long latency misses, the shorter, more diffuse stalls due to difficult-to-anticipate, first- or second-level misses are less easily hidden on in-order architectures. This paper addresses this problem by proposing a microarchitectural technique, referred to as two-pass pipelining, wherein the program executes on two in-order back-end pipelines coupled by a queue. The "advance" pipeline executes instructions greedily, without stalling on unanticipated latency dependences (executing independent instructions while otherwise blocking instructions are deferred). The "backup" pipeline allows concurrent resolution of instructions that were deferred in the other pipeline, resulting in the absorption of shorter misses and the overlap of longer ones. This paper argues that this design is both achievable and a good use of transistor resources and shows results indicating that it can deliver significant speedups for in-order processor designs.

KW - Computer aided instruction

KW - Delay

KW - Microarchitecture

KW - Out of order

KW - Parallel processing

KW - Pipeline processing

KW - Process design

KW - Processor scheduling

KW - Registers

KW - Runtime

UR - http://www.scopus.com/inward/record.url?scp=84944390453&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84944390453&partnerID=8YFLogxK

U2 - 10.1109/MICRO.2003.1253243

DO - 10.1109/MICRO.2003.1253243

M3 - Conference contribution

AN - SCOPUS:84944390453

T3 - Proceedings of the Annual International Symposium on Microarchitecture, MICRO

SP - 387

EP - 398

BT - Proceedings - 36th International Symposium on Microarchitecture, MICRO 2003

PB - IEEE Computer Society

ER -