TY - JOUR
T1 - Run-time cache bypassing
AU - Johnson, Teresa L.
AU - Connors, Daniel A.
AU - Merten, Matthew C.
AU - Hwu, Wen Mei W.
N1 - Funding Information:
The authors would like to thank all the members of the IMPACT research group, whose comments and suggestions helped to improve the quality of this research. We would also like to thank the anonymous referees for their constructive comments. This research has been supported by the U.S. National Science Foundation (under grant CCR-9629948), Hewlett-Packard, Advanced Micro Devices, and Intel. This research was conducted while Teresa Johnson was a graduate student at the University of Illinois.
PY - 1999
Y1 - 1999
N2 - The growing disparity between processor and memory performance has made cache misses increasingly expensive. Additionally, data and instruction caches are not always used efficiently, resulting in large numbers of cache misses. Therefore, the importance of cache performance improvements at each level of the memory hierarchy will continue to grow. In numeric programs, there are several known compiler techniques for optimizing data cache performance. However, integer (nonnumeric) programs often have irregular access patterns that are more difficult for the compiler to optimize. In the past, cache management techniques such as cache bypassing were implemented manually at the machine-language-programming level. As the available chip area grows, it makes sense to spend more resources to allow intelligent control over the cache management. In this paper, we present an approach to improving cache effectiveness, taking advantage of the growing chip area, utilizing run-time adaptive cache management techniques, optimizing both performance and cost of implementation. Specifically, we are aiming to increase data cache effectiveness for integer programs. We propose a microarchitecture scheme where the hardware determines data placement within the cache hierarchy based on dynamic referencing behavior. This scheme is fully compatible with existing Instruction Set Architectures. This paper examines the theoretical upper bounds on the cache hit ratio that cache bypassing can provide for integer applications, including several Windows applications with OS activity. Then, detailed trace-driven simulations of the integer applications are used to show that the implementation described in this paper can achieve performance close to that of the upper bound.
AB - The growing disparity between processor and memory performance has made cache misses increasingly expensive. Additionally, data and instruction caches are not always used efficiently, resulting in large numbers of cache misses. Therefore, the importance of cache performance improvements at each level of the memory hierarchy will continue to grow. In numeric programs, there are several known compiler techniques for optimizing data cache performance. However, integer (nonnumeric) programs often have irregular access patterns that are more difficult for the compiler to optimize. In the past, cache management techniques such as cache bypassing were implemented manually at the machine-language-programming level. As the available chip area grows, it makes sense to spend more resources to allow intelligent control over the cache management. In this paper, we present an approach to improving cache effectiveness, taking advantage of the growing chip area, utilizing run-time adaptive cache management techniques, optimizing both performance and cost of implementation. Specifically, we are aiming to increase data cache effectiveness for integer programs. We propose a microarchitecture scheme where the hardware determines data placement within the cache hierarchy based on dynamic referencing behavior. This scheme is fully compatible with existing Instruction Set Architectures. This paper examines the theoretical upper bounds on the cache hit ratio that cache bypassing can provide for integer applications, including several Windows applications with OS activity. Then, detailed trace-driven simulations of the integer applications are used to show that the implementation described in this paper can achieve performance close to that of the upper bound.
UR - http://www.scopus.com/inward/record.url?scp=0033319665&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=0033319665&partnerID=8YFLogxK
U2 - 10.1109/12.817393
DO - 10.1109/12.817393
M3 - Article
AN - SCOPUS:0033319665
SN - 0018-9340
VL - 48
SP - 1338
EP - 1354
JO - IEEE Transactions on Computers
JF - IEEE Transactions on Computers
IS - 12
ER -