Semi-Supervised Code Translation Overcoming the Scarcity of Parallel Code Data

Ming Zhu, Mohimenul Karim, Ismini Lourentzou, Daphne Yao

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Neural code translation is the task of converting source code from one programming language to another. One of the main challenges is the scarcity of parallel code data, which hinders the ability of translation models to learn accurate cross-language alignments. In this paper, we introduce MIRACLE, a semi-supervised approach that improves code translation through synthesizing high-quality parallel code data and curriculum learning on code data with ascending alignment levels. MIRACLE leverages static analysis and compilation to generate synthetic parallel code datasets with enhanced quality and alignment to address the challenge of data scarcity. We evaluate the proposed method along with strong baselines including instruction-tuned Large Language Models (LLMs) for code. Our analysis reveals that LLMs pre-trained on open-source code data, regardless of their size, suffer from the "shallow translation"problem. This issue arises when translated code copies keywords, statements, and even code blocks from the source language, leading to compilation and runtime errors. Extensive experiments demonstrate that our method significantly mitigates this issue, enhancing code translation performance across multiple models in C++, Java, Python, and C. Remarkably, MIRACLE outperforms code LLMs that are ten times larger in size. MIRACLE also achieves up to a 43% improvement in C code translation with fewer than 150 annotated examples.

Original languageEnglish (US)
Title of host publicationProceedings - 2024 39th ACM/IEEE International Conference on Automated Software Engineering, ASE 2024
PublisherAssociation for Computing Machinery
Pages1545-1556
Number of pages12
ISBN (Electronic)9798400712487
DOIs
StatePublished - Oct 27 2024
Event39th ACM/IEEE International Conference on Automated Software Engineering, ASE 2024 - Sacramento, United States
Duration: Oct 28 2024Nov 1 2024

Publication series

NameProceedings - 2024 39th ACM/IEEE International Conference on Automated Software Engineering, ASE 2024

Conference

Conference39th ACM/IEEE International Conference on Automated Software Engineering, ASE 2024
Country/TerritoryUnited States
CitySacramento
Period10/28/2411/1/24

Keywords

  • cross-language code alignment
  • curriculum learning
  • neural code translation
  • semi-supervised learning
  • static analysis

ASJC Scopus subject areas

  • Artificial Intelligence
  • Software
  • Safety, Risk, Reliability and Quality

Fingerprint

Dive into the research topics of 'Semi-Supervised Code Translation Overcoming the Scarcity of Parallel Code Data'. Together they form a unique fingerprint.

Cite this