Adaptive cache bypass and insertion for many-core accelerators

Xuhao Chen, Shengzhao Wu, Li Wen Chang, Wei Sheng Huang, Carl Pearson, Zhiying Wang, Wen-Mei W Hwu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Many-core accelerators, e.g. GPUs, are widely used for ac- celerating general-purpose compute kernels. With the SIMT execution model, GPUs can hide memory latency through massive multithreading for many regular applications. To support more applications with irregular memory access pat- Tern, cache hierarchy is introduced to GPU architecture to capture input data sharing and mitigate the effect of irreg- ular accesses. However, GPU caches suffer from poor effi- ciency due to severe contention, which makes it difficult to adopt heuristic management policies, and also limits system performance and energy-efficiency. We propose an adaptive cache management policy specifi- cally for many-core accelerators. The tag array of L2 cache is enhanced with extra bits to track memory access history, an thus the locality information is captured and provided to L1 cache as heuristics to guide its run- Time bypass and inser- Tion decisions. By preventing un-reused data from polluting the cache and alleviating contention, cache efficiency is sig- nificantly improved. As a result, the system performance is improved by 31% on average for cache sensitive benchmarks, compared to the baseline GPU architecture.

Original languageEnglish (US)
Title of host publication2nd ACM International Workshop on Many-Core Embedded Systems, MES 2014 - In Conjunction with the 41st International Symposium on Computer Architecture, ISCA 2014
PublisherAssociation for Computing Machinery
Pages1-8
Number of pages8
ISBN (Print)9781450328227
DOIs
StatePublished - Jan 1 2014
Event2nd ACM International Workshop on Many-Core Embedded Systems, MES 2014, Held in Conjunction with the 41st International Symposium on Computer Architecture, ISCA 2014 - Minneapolis, MN, United States
Duration: Jun 14 2014Jun 15 2014

Publication series

NameACM International Conference Proceeding Series

Other

Other2nd ACM International Workshop on Many-Core Embedded Systems, MES 2014, Held in Conjunction with the 41st International Symposium on Computer Architecture, ISCA 2014
CountryUnited States
CityMinneapolis, MN
Period6/14/146/15/14

Fingerprint

Particle accelerators
Data storage equipment
Energy efficiency
Graphics processing unit

Keywords

  • Bypass
  • Cache management
  • GPGPU
  • Insertion

ASJC Scopus subject areas

  • Software
  • Human-Computer Interaction
  • Computer Vision and Pattern Recognition
  • Computer Networks and Communications

Cite this

Chen, X., Wu, S., Chang, L. W., Huang, W. S., Pearson, C., Wang, Z., & Hwu, W-M. W. (2014). Adaptive cache bypass and insertion for many-core accelerators. In 2nd ACM International Workshop on Many-Core Embedded Systems, MES 2014 - In Conjunction with the 41st International Symposium on Computer Architecture, ISCA 2014 (pp. 1-8). (ACM International Conference Proceeding Series). Association for Computing Machinery. https://doi.org/10.1145/2613908.2613909

Adaptive cache bypass and insertion for many-core accelerators. / Chen, Xuhao; Wu, Shengzhao; Chang, Li Wen; Huang, Wei Sheng; Pearson, Carl; Wang, Zhiying; Hwu, Wen-Mei W.

2nd ACM International Workshop on Many-Core Embedded Systems, MES 2014 - In Conjunction with the 41st International Symposium on Computer Architecture, ISCA 2014. Association for Computing Machinery, 2014. p. 1-8 (ACM International Conference Proceeding Series).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Chen, X, Wu, S, Chang, LW, Huang, WS, Pearson, C, Wang, Z & Hwu, W-MW 2014, Adaptive cache bypass and insertion for many-core accelerators. in 2nd ACM International Workshop on Many-Core Embedded Systems, MES 2014 - In Conjunction with the 41st International Symposium on Computer Architecture, ISCA 2014. ACM International Conference Proceeding Series, Association for Computing Machinery, pp. 1-8, 2nd ACM International Workshop on Many-Core Embedded Systems, MES 2014, Held in Conjunction with the 41st International Symposium on Computer Architecture, ISCA 2014, Minneapolis, MN, United States, 6/14/14. https://doi.org/10.1145/2613908.2613909
Chen X, Wu S, Chang LW, Huang WS, Pearson C, Wang Z et al. Adaptive cache bypass and insertion for many-core accelerators. In 2nd ACM International Workshop on Many-Core Embedded Systems, MES 2014 - In Conjunction with the 41st International Symposium on Computer Architecture, ISCA 2014. Association for Computing Machinery. 2014. p. 1-8. (ACM International Conference Proceeding Series). https://doi.org/10.1145/2613908.2613909
Chen, Xuhao ; Wu, Shengzhao ; Chang, Li Wen ; Huang, Wei Sheng ; Pearson, Carl ; Wang, Zhiying ; Hwu, Wen-Mei W. / Adaptive cache bypass and insertion for many-core accelerators. 2nd ACM International Workshop on Many-Core Embedded Systems, MES 2014 - In Conjunction with the 41st International Symposium on Computer Architecture, ISCA 2014. Association for Computing Machinery, 2014. pp. 1-8 (ACM International Conference Proceeding Series).
@inproceedings{a0229a23ff9f42af9a4d4303a86d2164,
title = "Adaptive cache bypass and insertion for many-core accelerators",
abstract = "Many-core accelerators, e.g. GPUs, are widely used for ac- celerating general-purpose compute kernels. With the SIMT execution model, GPUs can hide memory latency through massive multithreading for many regular applications. To support more applications with irregular memory access pat- Tern, cache hierarchy is introduced to GPU architecture to capture input data sharing and mitigate the effect of irreg- ular accesses. However, GPU caches suffer from poor effi- ciency due to severe contention, which makes it difficult to adopt heuristic management policies, and also limits system performance and energy-efficiency. We propose an adaptive cache management policy specifi- cally for many-core accelerators. The tag array of L2 cache is enhanced with extra bits to track memory access history, an thus the locality information is captured and provided to L1 cache as heuristics to guide its run- Time bypass and inser- Tion decisions. By preventing un-reused data from polluting the cache and alleviating contention, cache efficiency is sig- nificantly improved. As a result, the system performance is improved by 31{\%} on average for cache sensitive benchmarks, compared to the baseline GPU architecture.",
keywords = "Bypass, Cache management, GPGPU, Insertion",
author = "Xuhao Chen and Shengzhao Wu and Chang, {Li Wen} and Huang, {Wei Sheng} and Carl Pearson and Zhiying Wang and Hwu, {Wen-Mei W}",
year = "2014",
month = "1",
day = "1",
doi = "10.1145/2613908.2613909",
language = "English (US)",
isbn = "9781450328227",
series = "ACM International Conference Proceeding Series",
publisher = "Association for Computing Machinery",
pages = "1--8",
booktitle = "2nd ACM International Workshop on Many-Core Embedded Systems, MES 2014 - In Conjunction with the 41st International Symposium on Computer Architecture, ISCA 2014",

}

TY - GEN

T1 - Adaptive cache bypass and insertion for many-core accelerators

AU - Chen, Xuhao

AU - Wu, Shengzhao

AU - Chang, Li Wen

AU - Huang, Wei Sheng

AU - Pearson, Carl

AU - Wang, Zhiying

AU - Hwu, Wen-Mei W

PY - 2014/1/1

Y1 - 2014/1/1

N2 - Many-core accelerators, e.g. GPUs, are widely used for ac- celerating general-purpose compute kernels. With the SIMT execution model, GPUs can hide memory latency through massive multithreading for many regular applications. To support more applications with irregular memory access pat- Tern, cache hierarchy is introduced to GPU architecture to capture input data sharing and mitigate the effect of irreg- ular accesses. However, GPU caches suffer from poor effi- ciency due to severe contention, which makes it difficult to adopt heuristic management policies, and also limits system performance and energy-efficiency. We propose an adaptive cache management policy specifi- cally for many-core accelerators. The tag array of L2 cache is enhanced with extra bits to track memory access history, an thus the locality information is captured and provided to L1 cache as heuristics to guide its run- Time bypass and inser- Tion decisions. By preventing un-reused data from polluting the cache and alleviating contention, cache efficiency is sig- nificantly improved. As a result, the system performance is improved by 31% on average for cache sensitive benchmarks, compared to the baseline GPU architecture.

AB - Many-core accelerators, e.g. GPUs, are widely used for ac- celerating general-purpose compute kernels. With the SIMT execution model, GPUs can hide memory latency through massive multithreading for many regular applications. To support more applications with irregular memory access pat- Tern, cache hierarchy is introduced to GPU architecture to capture input data sharing and mitigate the effect of irreg- ular accesses. However, GPU caches suffer from poor effi- ciency due to severe contention, which makes it difficult to adopt heuristic management policies, and also limits system performance and energy-efficiency. We propose an adaptive cache management policy specifi- cally for many-core accelerators. The tag array of L2 cache is enhanced with extra bits to track memory access history, an thus the locality information is captured and provided to L1 cache as heuristics to guide its run- Time bypass and inser- Tion decisions. By preventing un-reused data from polluting the cache and alleviating contention, cache efficiency is sig- nificantly improved. As a result, the system performance is improved by 31% on average for cache sensitive benchmarks, compared to the baseline GPU architecture.

KW - Bypass

KW - Cache management

KW - GPGPU

KW - Insertion

UR - http://www.scopus.com/inward/record.url?scp=84904471467&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84904471467&partnerID=8YFLogxK

U2 - 10.1145/2613908.2613909

DO - 10.1145/2613908.2613909

M3 - Conference contribution

AN - SCOPUS:84904471467

SN - 9781450328227

T3 - ACM International Conference Proceeding Series

SP - 1

EP - 8

BT - 2nd ACM International Workshop on Many-Core Embedded Systems, MES 2014 - In Conjunction with the 41st International Symposium on Computer Architecture, ISCA 2014

PB - Association for Computing Machinery

ER -