A DSL for Performance Orchestration

Thiago Santos Faria Xavier Teixeira, David A Padua, William D Gropp

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The complexity and diversity of today's computer architectures are requiring more attention from the software developers in order to harness all the computing power available. Furthermore, each different modern architecture requires a potentially non-overlapping set of optimizations to attain a higher fraction of its nominal peak speed. This leads to challenges about performance portability and code maintainability, in particular, how to manage different optimized versions of the same code tailored to different architectures and how to keep them up to date as new algorithmic features are added. This increasing complexity of the architectures and the extension of the optimization space tends to make compilers deliver unsatisfactory performance, and the gap between the performance of hand-tuned and compiler-generated code has grown dramatically. Even the use of advanced optimization flags is not enough to narrow this gap. On the other hand, optimizing applications manually is very time-consuming, and the developer needs to understand and interact with many different hardware features for each architecture. Successful research has been developed to assist the programmer in this painful and error-prone process of implementing, optimizing and porting applications to different architectures. Nonetheless, the adoption of these works has been mostly restricted to specific domains, such as dense linear algebra, Fourier transforms, and signal processing. We have developed the framework ICE that decouples the performance expert role from the application expert role (separation of concerns). It allows the use of architecture-specific optimizations while keeping the code maintainable on the long term. It is responsible to orchestrate the use of multiple optimization tools to application's baseline version and perform an empirical search to find the best sequence of optimizations and their parameters. The baseline version is regarded as not having any architecture- or compiler-specific optimizations. The optimizations and the empirical search are directed by a domain-specific language (DSL) in an external file. Application's code are often dramatically altered by adding multiple optimization cases for each architecture used. This DSL allows the performance expert to apply optimizations without disarrange the original code. The DSL has constructs to expose the options of the optimizations and generates a search space that can be traversed by different search tools. For instance, it has conditional statements that can be used to specify which optimizations should be carried out for each compiler. The DSL is not only the input of the empirical search, but also the output. It can be used so save the best sequence of transformations found in previous searches. The application's code is annotated with unique identifiers that are referenced in the DSL. Currently, source-to-source loop optimizations, algorithm and pragmas selection are accepted. The framework interface is flexible to integrate new optimization and search tools. And in case of any failure it falls back to the baseline version. We have applied the framework to linear algebra problems, stencil computations and to a production code for the simulation of plasma-coupled combustion~xpacc achieving up to 3x speedup. Other works have tried to solve the problem of facilitating optimizing applications, but they lack of important features comprised by ICE. CHiLL, Orio, and X Language simplifies the generation of optimized code. CHiLL is the only one among these that the instructions to carry out the optimizations are given using an external file, but it references loops by their position on the source and modifications in the source require modifications in the external file, restricting its use in large production codes. Only Orio empirically evaluates variants of the annotated code. Summarizing, the contributions of the framework are: the separation of concerns, incremental adoption, a DSL to specify the optimization space, interface to plug-in and compare different optimization and search tools, combination of empirical search with expert knowledge.

Original languageEnglish (US)
Title of host publicationProceedings - 26th International Conference on Parallel Architectures and Compilation Techniques, PACT 2017
PublisherInstitute of Electrical and Electronics Engineers Inc.
Number of pages1
ISBN (Electronic)9781467395243
DOIs
StatePublished - Oct 31 2017
Event26th International Conference on Parallel Architectures and Compilation Techniques, PACT 2017 - Portland, United States
Duration: Sep 9 2017Sep 13 2017

Publication series

NameParallel Architectures and Compilation Techniques - Conference Proceedings, PACT
Volume2017-September
ISSN (Print)1089-795X

Other

Other26th International Conference on Parallel Architectures and Compilation Techniques, PACT 2017
CountryUnited States
CityPortland
Period9/9/179/13/17

Fingerprint

Orchestration
Domain-specific Languages
Optimization
Compiler
Baseline
Linear algebra
Maintainability
Computer Architecture
Architecture
Computer architecture
Portability
Plug-in

Keywords

  • Autotuning
  • Compilers
  • Optimization

ASJC Scopus subject areas

  • Software
  • Theoretical Computer Science
  • Hardware and Architecture

Cite this

Teixeira, T. S. F. X., Padua, D. A., & Gropp, W. D. (2017). A DSL for Performance Orchestration. In Proceedings - 26th International Conference on Parallel Architectures and Compilation Techniques, PACT 2017 (Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT; Vol. 2017-September). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/PACT.2017.50

A DSL for Performance Orchestration. / Teixeira, Thiago Santos Faria Xavier; Padua, David A; Gropp, William D.

Proceedings - 26th International Conference on Parallel Architectures and Compilation Techniques, PACT 2017. Institute of Electrical and Electronics Engineers Inc., 2017. (Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT; Vol. 2017-September).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Teixeira, TSFX, Padua, DA & Gropp, WD 2017, A DSL for Performance Orchestration. in Proceedings - 26th International Conference on Parallel Architectures and Compilation Techniques, PACT 2017. Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT, vol. 2017-September, Institute of Electrical and Electronics Engineers Inc., 26th International Conference on Parallel Architectures and Compilation Techniques, PACT 2017, Portland, United States, 9/9/17. https://doi.org/10.1109/PACT.2017.50
Teixeira TSFX, Padua DA, Gropp WD. A DSL for Performance Orchestration. In Proceedings - 26th International Conference on Parallel Architectures and Compilation Techniques, PACT 2017. Institute of Electrical and Electronics Engineers Inc. 2017. (Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT). https://doi.org/10.1109/PACT.2017.50
Teixeira, Thiago Santos Faria Xavier ; Padua, David A ; Gropp, William D. / A DSL for Performance Orchestration. Proceedings - 26th International Conference on Parallel Architectures and Compilation Techniques, PACT 2017. Institute of Electrical and Electronics Engineers Inc., 2017. (Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT).
@inproceedings{a8a455310192415297031bd88fecbb24,
title = "A DSL for Performance Orchestration",
abstract = "The complexity and diversity of today's computer architectures are requiring more attention from the software developers in order to harness all the computing power available. Furthermore, each different modern architecture requires a potentially non-overlapping set of optimizations to attain a higher fraction of its nominal peak speed. This leads to challenges about performance portability and code maintainability, in particular, how to manage different optimized versions of the same code tailored to different architectures and how to keep them up to date as new algorithmic features are added. This increasing complexity of the architectures and the extension of the optimization space tends to make compilers deliver unsatisfactory performance, and the gap between the performance of hand-tuned and compiler-generated code has grown dramatically. Even the use of advanced optimization flags is not enough to narrow this gap. On the other hand, optimizing applications manually is very time-consuming, and the developer needs to understand and interact with many different hardware features for each architecture. Successful research has been developed to assist the programmer in this painful and error-prone process of implementing, optimizing and porting applications to different architectures. Nonetheless, the adoption of these works has been mostly restricted to specific domains, such as dense linear algebra, Fourier transforms, and signal processing. We have developed the framework ICE that decouples the performance expert role from the application expert role (separation of concerns). It allows the use of architecture-specific optimizations while keeping the code maintainable on the long term. It is responsible to orchestrate the use of multiple optimization tools to application's baseline version and perform an empirical search to find the best sequence of optimizations and their parameters. The baseline version is regarded as not having any architecture- or compiler-specific optimizations. The optimizations and the empirical search are directed by a domain-specific language (DSL) in an external file. Application's code are often dramatically altered by adding multiple optimization cases for each architecture used. This DSL allows the performance expert to apply optimizations without disarrange the original code. The DSL has constructs to expose the options of the optimizations and generates a search space that can be traversed by different search tools. For instance, it has conditional statements that can be used to specify which optimizations should be carried out for each compiler. The DSL is not only the input of the empirical search, but also the output. It can be used so save the best sequence of transformations found in previous searches. The application's code is annotated with unique identifiers that are referenced in the DSL. Currently, source-to-source loop optimizations, algorithm and pragmas selection are accepted. The framework interface is flexible to integrate new optimization and search tools. And in case of any failure it falls back to the baseline version. We have applied the framework to linear algebra problems, stencil computations and to a production code for the simulation of plasma-coupled combustion~xpacc achieving up to 3x speedup. Other works have tried to solve the problem of facilitating optimizing applications, but they lack of important features comprised by ICE. CHiLL, Orio, and X Language simplifies the generation of optimized code. CHiLL is the only one among these that the instructions to carry out the optimizations are given using an external file, but it references loops by their position on the source and modifications in the source require modifications in the external file, restricting its use in large production codes. Only Orio empirically evaluates variants of the annotated code. Summarizing, the contributions of the framework are: the separation of concerns, incremental adoption, a DSL to specify the optimization space, interface to plug-in and compare different optimization and search tools, combination of empirical search with expert knowledge.",
keywords = "Autotuning, Compilers, Optimization",
author = "Teixeira, {Thiago Santos Faria Xavier} and Padua, {David A} and Gropp, {William D}",
year = "2017",
month = "10",
day = "31",
doi = "10.1109/PACT.2017.50",
language = "English (US)",
series = "Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
booktitle = "Proceedings - 26th International Conference on Parallel Architectures and Compilation Techniques, PACT 2017",
address = "United States",

}

TY - GEN

T1 - A DSL for Performance Orchestration

AU - Teixeira, Thiago Santos Faria Xavier

AU - Padua, David A

AU - Gropp, William D

PY - 2017/10/31

Y1 - 2017/10/31

N2 - The complexity and diversity of today's computer architectures are requiring more attention from the software developers in order to harness all the computing power available. Furthermore, each different modern architecture requires a potentially non-overlapping set of optimizations to attain a higher fraction of its nominal peak speed. This leads to challenges about performance portability and code maintainability, in particular, how to manage different optimized versions of the same code tailored to different architectures and how to keep them up to date as new algorithmic features are added. This increasing complexity of the architectures and the extension of the optimization space tends to make compilers deliver unsatisfactory performance, and the gap between the performance of hand-tuned and compiler-generated code has grown dramatically. Even the use of advanced optimization flags is not enough to narrow this gap. On the other hand, optimizing applications manually is very time-consuming, and the developer needs to understand and interact with many different hardware features for each architecture. Successful research has been developed to assist the programmer in this painful and error-prone process of implementing, optimizing and porting applications to different architectures. Nonetheless, the adoption of these works has been mostly restricted to specific domains, such as dense linear algebra, Fourier transforms, and signal processing. We have developed the framework ICE that decouples the performance expert role from the application expert role (separation of concerns). It allows the use of architecture-specific optimizations while keeping the code maintainable on the long term. It is responsible to orchestrate the use of multiple optimization tools to application's baseline version and perform an empirical search to find the best sequence of optimizations and their parameters. The baseline version is regarded as not having any architecture- or compiler-specific optimizations. The optimizations and the empirical search are directed by a domain-specific language (DSL) in an external file. Application's code are often dramatically altered by adding multiple optimization cases for each architecture used. This DSL allows the performance expert to apply optimizations without disarrange the original code. The DSL has constructs to expose the options of the optimizations and generates a search space that can be traversed by different search tools. For instance, it has conditional statements that can be used to specify which optimizations should be carried out for each compiler. The DSL is not only the input of the empirical search, but also the output. It can be used so save the best sequence of transformations found in previous searches. The application's code is annotated with unique identifiers that are referenced in the DSL. Currently, source-to-source loop optimizations, algorithm and pragmas selection are accepted. The framework interface is flexible to integrate new optimization and search tools. And in case of any failure it falls back to the baseline version. We have applied the framework to linear algebra problems, stencil computations and to a production code for the simulation of plasma-coupled combustion~xpacc achieving up to 3x speedup. Other works have tried to solve the problem of facilitating optimizing applications, but they lack of important features comprised by ICE. CHiLL, Orio, and X Language simplifies the generation of optimized code. CHiLL is the only one among these that the instructions to carry out the optimizations are given using an external file, but it references loops by their position on the source and modifications in the source require modifications in the external file, restricting its use in large production codes. Only Orio empirically evaluates variants of the annotated code. Summarizing, the contributions of the framework are: the separation of concerns, incremental adoption, a DSL to specify the optimization space, interface to plug-in and compare different optimization and search tools, combination of empirical search with expert knowledge.

AB - The complexity and diversity of today's computer architectures are requiring more attention from the software developers in order to harness all the computing power available. Furthermore, each different modern architecture requires a potentially non-overlapping set of optimizations to attain a higher fraction of its nominal peak speed. This leads to challenges about performance portability and code maintainability, in particular, how to manage different optimized versions of the same code tailored to different architectures and how to keep them up to date as new algorithmic features are added. This increasing complexity of the architectures and the extension of the optimization space tends to make compilers deliver unsatisfactory performance, and the gap between the performance of hand-tuned and compiler-generated code has grown dramatically. Even the use of advanced optimization flags is not enough to narrow this gap. On the other hand, optimizing applications manually is very time-consuming, and the developer needs to understand and interact with many different hardware features for each architecture. Successful research has been developed to assist the programmer in this painful and error-prone process of implementing, optimizing and porting applications to different architectures. Nonetheless, the adoption of these works has been mostly restricted to specific domains, such as dense linear algebra, Fourier transforms, and signal processing. We have developed the framework ICE that decouples the performance expert role from the application expert role (separation of concerns). It allows the use of architecture-specific optimizations while keeping the code maintainable on the long term. It is responsible to orchestrate the use of multiple optimization tools to application's baseline version and perform an empirical search to find the best sequence of optimizations and their parameters. The baseline version is regarded as not having any architecture- or compiler-specific optimizations. The optimizations and the empirical search are directed by a domain-specific language (DSL) in an external file. Application's code are often dramatically altered by adding multiple optimization cases for each architecture used. This DSL allows the performance expert to apply optimizations without disarrange the original code. The DSL has constructs to expose the options of the optimizations and generates a search space that can be traversed by different search tools. For instance, it has conditional statements that can be used to specify which optimizations should be carried out for each compiler. The DSL is not only the input of the empirical search, but also the output. It can be used so save the best sequence of transformations found in previous searches. The application's code is annotated with unique identifiers that are referenced in the DSL. Currently, source-to-source loop optimizations, algorithm and pragmas selection are accepted. The framework interface is flexible to integrate new optimization and search tools. And in case of any failure it falls back to the baseline version. We have applied the framework to linear algebra problems, stencil computations and to a production code for the simulation of plasma-coupled combustion~xpacc achieving up to 3x speedup. Other works have tried to solve the problem of facilitating optimizing applications, but they lack of important features comprised by ICE. CHiLL, Orio, and X Language simplifies the generation of optimized code. CHiLL is the only one among these that the instructions to carry out the optimizations are given using an external file, but it references loops by their position on the source and modifications in the source require modifications in the external file, restricting its use in large production codes. Only Orio empirically evaluates variants of the annotated code. Summarizing, the contributions of the framework are: the separation of concerns, incremental adoption, a DSL to specify the optimization space, interface to plug-in and compare different optimization and search tools, combination of empirical search with expert knowledge.

KW - Autotuning

KW - Compilers

KW - Optimization

UR - http://www.scopus.com/inward/record.url?scp=85043569182&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85043569182&partnerID=8YFLogxK

U2 - 10.1109/PACT.2017.50

DO - 10.1109/PACT.2017.50

M3 - Conference contribution

AN - SCOPUS:85043569182

T3 - Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT

BT - Proceedings - 26th International Conference on Parallel Architectures and Compilation Techniques, PACT 2017

PB - Institute of Electrical and Electronics Engineers Inc.

ER -