Compiler-directed early load-address generation

Ben Chung Cheng, Daniel A. Connors, Wen-Mei W Hwu

Research output: Contribution to journalConference article

Abstract

Two orthogonal hardware techniques, table-based address prediction and early address calculation, for reducing the latency of load instructions have been recently proposed. The key idea behind both of these techniques is to speculatively perform loads early in the processor pipeline using predicted values for the loads' addresses. These techniques have required either a large hardware table or complex register bypass logic to be implemented in order to accurately predict the important loads in the presence of a large number of less-important loads. This paper proposes a compiler-directed approach that allows a streamlined version of both of these techniques to be effectively used together. The compiler provides directives to indicate which prediction mechanism to use or, when appropriate, that a prediction should not be made. The hardware therefore can be focused on their target cases so that a smaller prediction table and simpler bypass logic suffice. Our results show that through straightforward compiler heuristics, we obtain an average speedup of 34% with a 256-entry direct-mapped address table and only one cached register. And with the help of address profiling, an extra 4% of speedup can be obtained.

Original languageEnglish (US)
Pages (from-to)138-147
Number of pages10
JournalProceedings of the Annual International Symposium on Microarchitecture
StatePublished - Dec 1 1998
EventProceedings of the 1998 31st Annual ACM/IEEE International Symposium on Microarchitecture - Dallas, TX, USA
Duration: Nov 30 1998Dec 2 1998

Fingerprint

Hardware
Pipelines

ASJC Scopus subject areas

  • Hardware and Architecture
  • Software

Cite this

Compiler-directed early load-address generation. / Cheng, Ben Chung; Connors, Daniel A.; Hwu, Wen-Mei W.

In: Proceedings of the Annual International Symposium on Microarchitecture, 01.12.1998, p. 138-147.

Research output: Contribution to journalConference article

@article{abbb6d05c9734f83ae2626ffdc1507e5,
title = "Compiler-directed early load-address generation",
abstract = "Two orthogonal hardware techniques, table-based address prediction and early address calculation, for reducing the latency of load instructions have been recently proposed. The key idea behind both of these techniques is to speculatively perform loads early in the processor pipeline using predicted values for the loads' addresses. These techniques have required either a large hardware table or complex register bypass logic to be implemented in order to accurately predict the important loads in the presence of a large number of less-important loads. This paper proposes a compiler-directed approach that allows a streamlined version of both of these techniques to be effectively used together. The compiler provides directives to indicate which prediction mechanism to use or, when appropriate, that a prediction should not be made. The hardware therefore can be focused on their target cases so that a smaller prediction table and simpler bypass logic suffice. Our results show that through straightforward compiler heuristics, we obtain an average speedup of 34{\%} with a 256-entry direct-mapped address table and only one cached register. And with the help of address profiling, an extra 4{\%} of speedup can be obtained.",
author = "Cheng, {Ben Chung} and Connors, {Daniel A.} and Hwu, {Wen-Mei W}",
year = "1998",
month = "12",
day = "1",
language = "English (US)",
pages = "138--147",
journal = "Proceedings of the Annual International Symposium on Microarchitecture, MICRO",
issn = "1072-4451",

}

TY - JOUR

T1 - Compiler-directed early load-address generation

AU - Cheng, Ben Chung

AU - Connors, Daniel A.

AU - Hwu, Wen-Mei W

PY - 1998/12/1

Y1 - 1998/12/1

N2 - Two orthogonal hardware techniques, table-based address prediction and early address calculation, for reducing the latency of load instructions have been recently proposed. The key idea behind both of these techniques is to speculatively perform loads early in the processor pipeline using predicted values for the loads' addresses. These techniques have required either a large hardware table or complex register bypass logic to be implemented in order to accurately predict the important loads in the presence of a large number of less-important loads. This paper proposes a compiler-directed approach that allows a streamlined version of both of these techniques to be effectively used together. The compiler provides directives to indicate which prediction mechanism to use or, when appropriate, that a prediction should not be made. The hardware therefore can be focused on their target cases so that a smaller prediction table and simpler bypass logic suffice. Our results show that through straightforward compiler heuristics, we obtain an average speedup of 34% with a 256-entry direct-mapped address table and only one cached register. And with the help of address profiling, an extra 4% of speedup can be obtained.

AB - Two orthogonal hardware techniques, table-based address prediction and early address calculation, for reducing the latency of load instructions have been recently proposed. The key idea behind both of these techniques is to speculatively perform loads early in the processor pipeline using predicted values for the loads' addresses. These techniques have required either a large hardware table or complex register bypass logic to be implemented in order to accurately predict the important loads in the presence of a large number of less-important loads. This paper proposes a compiler-directed approach that allows a streamlined version of both of these techniques to be effectively used together. The compiler provides directives to indicate which prediction mechanism to use or, when appropriate, that a prediction should not be made. The hardware therefore can be focused on their target cases so that a smaller prediction table and simpler bypass logic suffice. Our results show that through straightforward compiler heuristics, we obtain an average speedup of 34% with a 256-entry direct-mapped address table and only one cached register. And with the help of address profiling, an extra 4% of speedup can be obtained.

UR - http://www.scopus.com/inward/record.url?scp=0032315196&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0032315196&partnerID=8YFLogxK

M3 - Conference article

AN - SCOPUS:0032315196

SP - 138

EP - 147

JO - Proceedings of the Annual International Symposium on Microarchitecture, MICRO

JF - Proceedings of the Annual International Symposium on Microarchitecture, MICRO

SN - 1072-4451

ER -