TY - GEN
T1 - Semi-Supervised Code Translation Overcoming the Scarcity of Parallel Code Data
AU - Zhu, Ming
AU - Karim, Mohimenul
AU - Lourentzou, Ismini
AU - Yao, Daphne
N1 - Publisher Copyright:
Copyright held by the owner/author(s).
PY - 2024/10/27
Y1 - 2024/10/27
N2 - Neural code translation is the task of converting source code from one programming language to another. One of the main challenges is the scarcity of parallel code data, which hinders the ability of translation models to learn accurate cross-language alignments. In this paper, we introduce MIRACLE, a semi-supervised approach that improves code translation through synthesizing high-quality parallel code data and curriculum learning on code data with ascending alignment levels. MIRACLE leverages static analysis and compilation to generate synthetic parallel code datasets with enhanced quality and alignment to address the challenge of data scarcity. We evaluate the proposed method along with strong baselines including instruction-tuned Large Language Models (LLMs) for code. Our analysis reveals that LLMs pre-trained on open-source code data, regardless of their size, suffer from the "shallow translation"problem. This issue arises when translated code copies keywords, statements, and even code blocks from the source language, leading to compilation and runtime errors. Extensive experiments demonstrate that our method significantly mitigates this issue, enhancing code translation performance across multiple models in C++, Java, Python, and C. Remarkably, MIRACLE outperforms code LLMs that are ten times larger in size. MIRACLE also achieves up to a 43% improvement in C code translation with fewer than 150 annotated examples.
AB - Neural code translation is the task of converting source code from one programming language to another. One of the main challenges is the scarcity of parallel code data, which hinders the ability of translation models to learn accurate cross-language alignments. In this paper, we introduce MIRACLE, a semi-supervised approach that improves code translation through synthesizing high-quality parallel code data and curriculum learning on code data with ascending alignment levels. MIRACLE leverages static analysis and compilation to generate synthetic parallel code datasets with enhanced quality and alignment to address the challenge of data scarcity. We evaluate the proposed method along with strong baselines including instruction-tuned Large Language Models (LLMs) for code. Our analysis reveals that LLMs pre-trained on open-source code data, regardless of their size, suffer from the "shallow translation"problem. This issue arises when translated code copies keywords, statements, and even code blocks from the source language, leading to compilation and runtime errors. Extensive experiments demonstrate that our method significantly mitigates this issue, enhancing code translation performance across multiple models in C++, Java, Python, and C. Remarkably, MIRACLE outperforms code LLMs that are ten times larger in size. MIRACLE also achieves up to a 43% improvement in C code translation with fewer than 150 annotated examples.
KW - cross-language code alignment
KW - curriculum learning
KW - neural code translation
KW - semi-supervised learning
KW - static analysis
UR - http://www.scopus.com/inward/record.url?scp=85211265593&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85211265593&partnerID=8YFLogxK
U2 - 10.1145/3691620.3695524
DO - 10.1145/3691620.3695524
M3 - Conference contribution
AN - SCOPUS:85211265593
T3 - Proceedings - 2024 39th ACM/IEEE International Conference on Automated Software Engineering, ASE 2024
SP - 1545
EP - 1556
BT - Proceedings - 2024 39th ACM/IEEE International Conference on Automated Software Engineering, ASE 2024
PB - Association for Computing Machinery
T2 - 39th ACM/IEEE International Conference on Automated Software Engineering, ASE 2024
Y2 - 28 October 2024 through 1 November 2024
ER -