Researchers at Facebook have developed a new AI (artificial intelligence) system that converts source code from a high-level programming language to another.
Called as “TransCoder”, this AI-based language conversion tool can translate between high-level languages such as C++, Java, and Python into a different code with high accuracy.
The research study has been detailed in the paper titled, “Unsupervised Translation of Programming Languages.”
Converting code from one programming language to another is a difficult task for even an experienced programmer, as it requires knowledge of both the source language and the target language, making code-translation projects expensive.
Transcompilers eliminate the need to rewrite the new code from scratch. However, it is still a complicated task for a developer to deal with different languages that can have a different syntax, library changes, variable types and AI adaptation.
Facebook’s “Neural Translator” TransCoder takes care of all these issues with an unsupervised learning approach. This means it can run unsupervised with a minimal amount of human supervision or intervention to find previously undetected patterns in data sets without labels and outperform rule-based commercial baselines by a “significant” margin.
In the paper, the researchers have proposed to apply recent approaches in unsupervised machine translation, by leveraging a large amount of monolingual source code from GitHub to train a model, TransCoder, to translate between three popular languages: C++, Java and Python.
According to the paper, the Transcoder model functions on three main principles:
- The first principle initializes the model with a cross-lingual masked language model pretraining. As a result, pieces of code that express the same instructions are mapped to the same representation, regardless of the programming language.
- Next, comes Denoising auto-encoding, where the decoder is trained to generate valid sequences even when fed with noisy data, and it increases the encoder robustness to input noise.
- Finally, there is a “back-translation” process that is used to convert the code back to the first language. This allows TransCoder to generate parallel data to be compared to the original ones. The difference found in this process is used to reinforce the training.
The tool has been trained with more than 2.8 million open-source GitHub. Facebook Research team has also carried out tests with 852 parallel functions in C ++, Java and Python from GeeksforGeeks, an online platform that gathers coding problems and presents solutions in several programming languages.
Using the above two data, they developed a new metric — computational accuracy — that checks whether translated functions generate the same outputs as a reference with the same inputs.
For example, when converting from C++ to Java, TransCoder achieved 74.8% accuracy in expected results, whereas conversion from Python to C++ saw 57.8% accuracy, while Java to C++, the accuracy is 91.6%.
“Although never provided with parallel data, the model manages to translate functions with a high accuracy, and to properly align functions from the standard library across the three languages, outperforming rule-based and commercial baselines by a significant margin,” wrote the co-authors in the paper.
“Our approach is simple, does not require any expertise in the source or target languages, and can easily be extended to most programming languages. Although not perfect, the model could help to reduce the amount of work and the level of expertise required to successfully translate a codebase.”