Abstract
A recent latency tolerance technique, read miss clustering, restructures code to send demand miss references in parallel to the underlying memory system. An alternate, widely-used latency tolerance technique is software prefetching, which initiates data fetches ahead of expected demand miss references by a certain distance. Since both techniques seem to target the same types of latencies and use the same system resources, it is unclear which technique is superior or if both can be combined. This paper shows that these two techniques are actually mutually beneficial, each helping to overcome limitations of the other. We perform our study for uniprocessor and multiprocessor configurations, in simulation and on a real machine (Convex Exemplar). Compared to prefetching alone (the state-of-the-art implemented in systems today), the combination of the two techniques reduces execution time an average of 21% across all cases studied in simulation and an average of 16% for 5 out of 10 cases on the Exemplar. The combination sees execution time reductions relative to clustering alone averaging 15% for 6 out of 11 cases in simulation and 20% for 6 out of 10 cases on the Exemplar.
Original language | English (US) |
---|---|
Pages (from-to) | 292-303 |
Number of pages | 12 |
Journal | Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT |
State | Published - 2001 |
Externally published | Yes |
Event | Internatinal Conference on Parallel Architectures and Compilation Techniques PACT 2001 - Barcelona, Spain Duration: Sep 8 2001 → Sep 12 2001 |
ASJC Scopus subject areas
- Software
- Theoretical Computer Science
- Hardware and Architecture