TY - GEN
T1 - A DSL for Performance Orchestration
AU - Teixeira, Thiago Santos Faria Xavier
AU - Padua, David
AU - Gropp, William
N1 - Publisher Copyright:
© 2017 IEEE.
PY - 2017/10/31
Y1 - 2017/10/31
N2 - The complexity and diversity of today's computer architectures are requiring more attention from the software developers in order to harness all the computing power available. Furthermore, each different modern architecture requires a potentially non-overlapping set of optimizations to attain a higher fraction of its nominal peak speed. This leads to challenges about performance portability and code maintainability, in particular, how to manage different optimized versions of the same code tailored to different architectures and how to keep them up to date as new algorithmic features are added. This increasing complexity of the architectures and the extension of the optimization space tends to make compilers deliver unsatisfactory performance, and the gap between the performance of hand-tuned and compiler-generated code has grown dramatically. Even the use of advanced optimization flags is not enough to narrow this gap. On the other hand, optimizing applications manually is very time-consuming, and the developer needs to understand and interact with many different hardware features for each architecture. Successful research has been developed to assist the programmer in this painful and error-prone process of implementing, optimizing and porting applications to different architectures. Nonetheless, the adoption of these works has been mostly restricted to specific domains, such as dense linear algebra, Fourier transforms, and signal processing. We have developed the framework ICE that decouples the performance expert role from the application expert role (separation of concerns). It allows the use of architecture-specific optimizations while keeping the code maintainable on the long term. It is responsible to orchestrate the use of multiple optimization tools to application's baseline version and perform an empirical search to find the best sequence of optimizations and their parameters. The baseline version is regarded as not having any architecture- or compiler-specific optimizations. The optimizations and the empirical search are directed by a domain-specific language (DSL) in an external file. Application's code are often dramatically altered by adding multiple optimization cases for each architecture used. This DSL allows the performance expert to apply optimizations without disarrange the original code. The DSL has constructs to expose the options of the optimizations and generates a search space that can be traversed by different search tools. For instance, it has conditional statements that can be used to specify which optimizations should be carried out for each compiler. The DSL is not only the input of the empirical search, but also the output. It can be used so save the best sequence of transformations found in previous searches. The application's code is annotated with unique identifiers that are referenced in the DSL. Currently, source-to-source loop optimizations, algorithm and pragmas selection are accepted. The framework interface is flexible to integrate new optimization and search tools. And in case of any failure it falls back to the baseline version. We have applied the framework to linear algebra problems, stencil computations and to a production code for the simulation of plasma-coupled combustion~xpacc achieving up to 3x speedup. Other works have tried to solve the problem of facilitating optimizing applications, but they lack of important features comprised by ICE. CHiLL, Orio, and X Language simplifies the generation of optimized code. CHiLL is the only one among these that the instructions to carry out the optimizations are given using an external file, but it references loops by their position on the source and modifications in the source require modifications in the external file, restricting its use in large production codes. Only Orio empirically evaluates variants of the annotated code. Summarizing, the contributions of the framework are: the separation of concerns, incremental adoption, a DSL to specify the optimization space, interface to plug-in and compare different optimization and search tools, combination of empirical search with expert knowledge.
AB - The complexity and diversity of today's computer architectures are requiring more attention from the software developers in order to harness all the computing power available. Furthermore, each different modern architecture requires a potentially non-overlapping set of optimizations to attain a higher fraction of its nominal peak speed. This leads to challenges about performance portability and code maintainability, in particular, how to manage different optimized versions of the same code tailored to different architectures and how to keep them up to date as new algorithmic features are added. This increasing complexity of the architectures and the extension of the optimization space tends to make compilers deliver unsatisfactory performance, and the gap between the performance of hand-tuned and compiler-generated code has grown dramatically. Even the use of advanced optimization flags is not enough to narrow this gap. On the other hand, optimizing applications manually is very time-consuming, and the developer needs to understand and interact with many different hardware features for each architecture. Successful research has been developed to assist the programmer in this painful and error-prone process of implementing, optimizing and porting applications to different architectures. Nonetheless, the adoption of these works has been mostly restricted to specific domains, such as dense linear algebra, Fourier transforms, and signal processing. We have developed the framework ICE that decouples the performance expert role from the application expert role (separation of concerns). It allows the use of architecture-specific optimizations while keeping the code maintainable on the long term. It is responsible to orchestrate the use of multiple optimization tools to application's baseline version and perform an empirical search to find the best sequence of optimizations and their parameters. The baseline version is regarded as not having any architecture- or compiler-specific optimizations. The optimizations and the empirical search are directed by a domain-specific language (DSL) in an external file. Application's code are often dramatically altered by adding multiple optimization cases for each architecture used. This DSL allows the performance expert to apply optimizations without disarrange the original code. The DSL has constructs to expose the options of the optimizations and generates a search space that can be traversed by different search tools. For instance, it has conditional statements that can be used to specify which optimizations should be carried out for each compiler. The DSL is not only the input of the empirical search, but also the output. It can be used so save the best sequence of transformations found in previous searches. The application's code is annotated with unique identifiers that are referenced in the DSL. Currently, source-to-source loop optimizations, algorithm and pragmas selection are accepted. The framework interface is flexible to integrate new optimization and search tools. And in case of any failure it falls back to the baseline version. We have applied the framework to linear algebra problems, stencil computations and to a production code for the simulation of plasma-coupled combustion~xpacc achieving up to 3x speedup. Other works have tried to solve the problem of facilitating optimizing applications, but they lack of important features comprised by ICE. CHiLL, Orio, and X Language simplifies the generation of optimized code. CHiLL is the only one among these that the instructions to carry out the optimizations are given using an external file, but it references loops by their position on the source and modifications in the source require modifications in the external file, restricting its use in large production codes. Only Orio empirically evaluates variants of the annotated code. Summarizing, the contributions of the framework are: the separation of concerns, incremental adoption, a DSL to specify the optimization space, interface to plug-in and compare different optimization and search tools, combination of empirical search with expert knowledge.
KW - Autotuning
KW - Compilers
KW - Optimization
UR - http://www.scopus.com/inward/record.url?scp=85043569182&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85043569182&partnerID=8YFLogxK
U2 - 10.1109/PACT.2017.50
DO - 10.1109/PACT.2017.50
M3 - Conference contribution
AN - SCOPUS:85043569182
T3 - Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT
SP - 372
BT - Proceedings - 26th International Conference on Parallel Architectures and Compilation Techniques, PACT 2017
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 26th International Conference on Parallel Architectures and Compilation Techniques, PACT 2017
Y2 - 9 September 2017 through 13 September 2017
ER -