Beating in-order stalls with "Flea-Flicker" two-pass pipelining

Ronald D. Barnes, John W. Sias, Erik M. Nystrom, Sanjay J. Patel, Jose Navarro, Wen Mei W. Hwu

Research output: Contribution to journalArticle

Abstract

While compilers have generally proven adept at planning useful static instruction-level parallelism for in-order microarchitectures, the efficient accommodation of unanticipable latencies, like those of load instructions, remains a vexing problem. Traditional out-of-order execution hides some of these latencies, but repeats scheduling work already done by the compiler and adds additional pipeline overhead. Other techniques, such as prefetching and multithreading, can hide some anticipable, long-latency misses, but not the shorter, more diffuse stalls due to difficult-to-anticipate, first or second-level misses. Our work proposes a microarchitectural technique, two-pass pipelining, whereby the program executes on two in-order back-end pipelines coupled by a queue. The "advance" pipeline often defers instructions dispatching with unready operands rather than stalling. The "backup" pipeline allows concurrent resolution of instructions deferred by the first pipeline allowing overlapping of useful "advanced" execution with miss resolution. An accompanying compiler technique and instruction marking further enhance the handling of miss latencies. Applying our technique to an Itanium 2-like design achieves a speedup of 1.38x in mcf, the most memory-intensive SPECint2000 benchmark, and an average of 1.12x across other selected benchmarks, yielding between 32 percent and 67 percent of an idealized out-of-order design's speedup at a much lower design cost and complexity.

Original languageEnglish (US)
Pages (from-to)18-33
Number of pages16
JournalIEEE Transactions on Computers
Volume55
Issue number1
DOIs
StatePublished - Jan 1 2006

Fingerprint

Pipelining
Pipelines
Latency
Compiler
Percent
Speedup
Benchmark
Instruction Level Parallelism
Backorder
Prefetching
Multithreading
Dispatching
Queue
Overlapping
Concurrent
Scheduling
Planning
Data storage equipment
Costs
Design

Keywords

  • Cache-miss tolerance
  • Out-of-order execution
  • Prefetching
  • Runahead execution

ASJC Scopus subject areas

  • Software
  • Theoretical Computer Science
  • Hardware and Architecture
  • Computational Theory and Mathematics

Cite this

Beating in-order stalls with "Flea-Flicker" two-pass pipelining. / Barnes, Ronald D.; Sias, John W.; Nystrom, Erik M.; Patel, Sanjay J.; Navarro, Jose; Hwu, Wen Mei W.

In: IEEE Transactions on Computers, Vol. 55, No. 1, 01.01.2006, p. 18-33.

Research output: Contribution to journalArticle

Barnes, Ronald D. ; Sias, John W. ; Nystrom, Erik M. ; Patel, Sanjay J. ; Navarro, Jose ; Hwu, Wen Mei W. / Beating in-order stalls with "Flea-Flicker" two-pass pipelining. In: IEEE Transactions on Computers. 2006 ; Vol. 55, No. 1. pp. 18-33.
@article{919b506e6a544ce3b3700ebfc74474c3,
title = "Beating in-order stalls with {"}Flea-Flicker{"} two-pass pipelining",
abstract = "While compilers have generally proven adept at planning useful static instruction-level parallelism for in-order microarchitectures, the efficient accommodation of unanticipable latencies, like those of load instructions, remains a vexing problem. Traditional out-of-order execution hides some of these latencies, but repeats scheduling work already done by the compiler and adds additional pipeline overhead. Other techniques, such as prefetching and multithreading, can hide some anticipable, long-latency misses, but not the shorter, more diffuse stalls due to difficult-to-anticipate, first or second-level misses. Our work proposes a microarchitectural technique, two-pass pipelining, whereby the program executes on two in-order back-end pipelines coupled by a queue. The {"}advance{"} pipeline often defers instructions dispatching with unready operands rather than stalling. The {"}backup{"} pipeline allows concurrent resolution of instructions deferred by the first pipeline allowing overlapping of useful {"}advanced{"} execution with miss resolution. An accompanying compiler technique and instruction marking further enhance the handling of miss latencies. Applying our technique to an Itanium 2-like design achieves a speedup of 1.38x in mcf, the most memory-intensive SPECint2000 benchmark, and an average of 1.12x across other selected benchmarks, yielding between 32 percent and 67 percent of an idealized out-of-order design's speedup at a much lower design cost and complexity.",
keywords = "Cache-miss tolerance, Out-of-order execution, Prefetching, Runahead execution",
author = "Barnes, {Ronald D.} and Sias, {John W.} and Nystrom, {Erik M.} and Patel, {Sanjay J.} and Jose Navarro and Hwu, {Wen Mei W.}",
year = "2006",
month = "1",
day = "1",
doi = "10.1109/TC.2006.4",
language = "English (US)",
volume = "55",
pages = "18--33",
journal = "IEEE Transactions on Computers",
issn = "0018-9340",
publisher = "IEEE Computer Society",
number = "1",

}

TY - JOUR

T1 - Beating in-order stalls with "Flea-Flicker" two-pass pipelining

AU - Barnes, Ronald D.

AU - Sias, John W.

AU - Nystrom, Erik M.

AU - Patel, Sanjay J.

AU - Navarro, Jose

AU - Hwu, Wen Mei W.

PY - 2006/1/1

Y1 - 2006/1/1

N2 - While compilers have generally proven adept at planning useful static instruction-level parallelism for in-order microarchitectures, the efficient accommodation of unanticipable latencies, like those of load instructions, remains a vexing problem. Traditional out-of-order execution hides some of these latencies, but repeats scheduling work already done by the compiler and adds additional pipeline overhead. Other techniques, such as prefetching and multithreading, can hide some anticipable, long-latency misses, but not the shorter, more diffuse stalls due to difficult-to-anticipate, first or second-level misses. Our work proposes a microarchitectural technique, two-pass pipelining, whereby the program executes on two in-order back-end pipelines coupled by a queue. The "advance" pipeline often defers instructions dispatching with unready operands rather than stalling. The "backup" pipeline allows concurrent resolution of instructions deferred by the first pipeline allowing overlapping of useful "advanced" execution with miss resolution. An accompanying compiler technique and instruction marking further enhance the handling of miss latencies. Applying our technique to an Itanium 2-like design achieves a speedup of 1.38x in mcf, the most memory-intensive SPECint2000 benchmark, and an average of 1.12x across other selected benchmarks, yielding between 32 percent and 67 percent of an idealized out-of-order design's speedup at a much lower design cost and complexity.

AB - While compilers have generally proven adept at planning useful static instruction-level parallelism for in-order microarchitectures, the efficient accommodation of unanticipable latencies, like those of load instructions, remains a vexing problem. Traditional out-of-order execution hides some of these latencies, but repeats scheduling work already done by the compiler and adds additional pipeline overhead. Other techniques, such as prefetching and multithreading, can hide some anticipable, long-latency misses, but not the shorter, more diffuse stalls due to difficult-to-anticipate, first or second-level misses. Our work proposes a microarchitectural technique, two-pass pipelining, whereby the program executes on two in-order back-end pipelines coupled by a queue. The "advance" pipeline often defers instructions dispatching with unready operands rather than stalling. The "backup" pipeline allows concurrent resolution of instructions deferred by the first pipeline allowing overlapping of useful "advanced" execution with miss resolution. An accompanying compiler technique and instruction marking further enhance the handling of miss latencies. Applying our technique to an Itanium 2-like design achieves a speedup of 1.38x in mcf, the most memory-intensive SPECint2000 benchmark, and an average of 1.12x across other selected benchmarks, yielding between 32 percent and 67 percent of an idealized out-of-order design's speedup at a much lower design cost and complexity.

KW - Cache-miss tolerance

KW - Out-of-order execution

KW - Prefetching

KW - Runahead execution

UR - http://www.scopus.com/inward/record.url?scp=33947629734&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=33947629734&partnerID=8YFLogxK

U2 - 10.1109/TC.2006.4

DO - 10.1109/TC.2006.4

M3 - Article

AN - SCOPUS:33947629734

VL - 55

SP - 18

EP - 33

JO - IEEE Transactions on Computers

JF - IEEE Transactions on Computers

SN - 0018-9340

IS - 1

ER -