This repository contains artifacts of AlphaTrans from the paper "Repository-Level Compositional Code Translation and Validation".
We provide a Dockerfile
which installs all necessary dependencies to reproduce the results of AlphaTrans. Please download Docker, and then execute the following to create a docker image and execute the container in interactive mode:
docker build --no-cache -t alphatrans .
docker run -it alphatrans bash
Note
If you are using MacOS with an Apple chip, please considering adding --platform=linux/amd64
in docker build
.
Please refer to Reproduce AlphaTrans Results for instructions on how to reproduce the results of AlphaTrans. If you are interested in translating more projects, please refer to Translate New Java Projects.
AlphaTrans currently supports prompting OpenAI models (e.g., gpt-4o-2024-11-20
) and open-source models (e.g., deepseek-ai/deepseek-coder-33b-instruct
) served by ollama (please see the Ollama Project on how to start an engine). We have created a .env
file to store API keys and model endpoints. If prompting with ollama, please simply paste in your OLLAMA_HOST
(e.g., http://0.0.0.0:5000
where the engine IP
is 0.0.0.0
and PORT
is 5000
). If prompting with OpenAI models, you only need to paste in your key in OPENAI_API_KEY
.
vim .env
For all ten projects, we provide the project skeletons and partial translations. Please execute the following to start translating projects (e.g., commons-fileupload
project with gpt-4o-2024-11-20
model and with temperature=0.0
):
bash scripts/get_dependencies.sh _decomposed_tests
bash scripts/generate_test_invocation_map.sh _decomposed_tests
bash scripts/extract_coverage.sh commons-fileupload _decomposed_tests
bash scripts/translate_fragment.sh commons-fileupload 0.0 gpt-4o-2024-11-20
Note
Executing the script extract_coverage.sh
can take some time. Please be patient.
These scripts will translate the project fragment by fragment in reverse-call graph order and store translations in JSON files along with validation results (e.g., syntactical correctness, GraalVM correctness, test execution correctness, etc.). If you want to create standalone python projects, simply recompose all translations with the following script:
bash scripts/recompose.sh commons-fileupload 0.0 gpt-4o-2024-11-20
If you want to check the effectiveness of gpt-4o-2024-11-20
in translating commons-fileupload
, please run the following script:
bash scripts/print_results.sh commons-fileupload 0.0 gpt-4o-2024-11-20 data/schemas_decomposed_tests/translations
Note
Due to probabilistic behavior of models, the results might be slightly different when re-translating projects. You may run the experiment multiple times to adjust for this behavior.
If you want to merge results of two different models, please first move each model result under data/results
, and then execute the following:
bash scripts/merge_results.sh 0.0 deepseek-coder-33b-instruct gpt-4o-2024-11-20
This will merge results and create a new directory under data/results/{$first_model}_{$second_model}_MERGED
.
In this section, we discuss how to add more projects and translate with AlphaTrans. Below, we provide the steps for the ten subject projects in our work. If you add a new project, it should be similar to existing ones. For every project, we provide two specific snapshots as shown below:
original_projects
: Original snapshot of the projects cloned from GitHub.cleaned_final_projects_decomposed_tests
: Snapshot of theoriginal_projects
with third-party libraries removed, overload methods/constructors transformed, and tests decomposed.
You can start experimenting from the second snapshot (cleaned_final_projects_decomposed_tests
) as project reduction, transformation, and decomposition can potentially take hours. Please refer to Project Reduction, Program Transformation and Test Decomposition for further preprocessing steps.
AlphaTrans requires CodeQL CLI for database creation and static analysis of projects. We already install CodeQL using Docker. We also clone the vscode-codeql-starter repository required for executing CodeQL queries. Please follow the steps below to create project database and execute queries on cleaned_final_projects_decomposed_tests
:
Create project database with CodeQL by executing the following script:
bash scripts/create_database.sh _decomposed_tests
After successful execution, the databases should be created under databases/<project_name>_decomposed_tests
.
We have already copied all CodeQL files from queries
into the vscode-codeql-starter/codeql-custom-queries-java
directory. Execute the following to run all necessary CodeQL queries:
cd vscode-codeql-starter/codeql-custom-queries-java
bash run.sh
Once all queries are executed, query outputs will be stored under data/query_outputs_decomposed_tests
.
Execute the following to decompose programs and create project schemas:
bash scripts/create_schema.sh _decomposed_tests
bash scripts/extract_call_graph.sh _decomposed_tests
These scripts will properly store project schemas in JSON format under data/schemas_decomposed_tests
.
We provide our universal type map under data/type_resolution/universal_type_map_final.json
. This type map can be directly used, however, if you want to translate types again, please execute the following from the root directory of the repository to perform type translation on the projects.
bash scripts/extract_types.sh _decomposed_tests
bash scripts/crawl_type_desc.sh
bash scripts/translate_types.sh <type> <model_name>
The <type>
can be either simple
or source_description
. The former prompts the model with vanilla prompt, while the latter prompts the model with source PL type description. The model can be either deepseek-coder-33b-instruct
or gpt-4o-2024-11-20
.
Execute the following from the root directory of the repository to generate skeletons of projects and check their syntactical correctness
bash scripts/get_dependencies.sh _decomposed_tests
bash scripts/create_skeleton.sh _decomposed_tests
This command should create proper skeletons in target language under data/skeletons/<project_name>
.
Finally, execute the following from the root directory of the repository to perform compositional translation and validation on the projects.
bash scripts/generate_test_invocation_map.sh _decomposed_tests
bash scripts/extract_coverage.sh <project_name> _decomposed_tests
bash scripts/translate_fragment.sh <project_name> <temperature> <model>
You can use project_name=commons-fileupload
, temperature=0.0
, and model=gpt-4o-2024-11-20
as an example.
In this section, we provide the steps on how to get rid of third-party libraries from original projects, transform programs and perform test decomposition.
Run the following script to add the maven-jar-plugin
to a project for a test jar:
bash scripts/add_plugin.sh <project_name>
Run this script to build the project and merge the source and test JARs:
bash scripts/merge_jar.sh <project_name>
Note
If a project uses an older version of Java, please consider changing the pom.xml
file or use -Dmaven.compiler.source=1.8 -Dmaven.compiler.target=1.8
flags to override the compiler versions during compilation.
Generate a simple call graph (via JavaCG) of the entire project using the merged JAR:
bash scripts/generate_cg.sh <project_name>
python3 src/preprocessing/reduce_third_party_libs.py <project_name>
The project name is the name of a directory from java_projects/original_projects
. After reduction, a directory of the same name will appear under java_projects/automated_reduced_projects/
.
Before doing program transformation, please follow the steps mentioned in CodeQL Database Creation and Static Analysis to properly create databases of reduced projects and generate query outputs. Then, execute the following to perform program transformation on a specific project. The <project_dir_overload_methods>
and <project_dir_overload_constructors>
are the directories of projects with overload methods and constructors. You can choose the directory names.
bash scripts/program_transformation.sh <project_dir_overload_methods> <project_dir_overload_constructors> <project_name>
AlphaTrans performs test decomposition on transformed projects as a step to address the long-call chain problem when executing tests in target language. Please execute the following to first extract executed tests and their coverage, and use this information to decompose tests properly:
bash scripts/extract_coverage.sh <project_name> ''
Once this executes properly, it should create a directory called source_test_execution
under data
. Then execute the following to decompose tests:
bash scripts/decompose_test.sh
After successful execution, this should create the cleaned_final_projects_decomposed_tests
directory under java_projects
.
Note
There might be a need to do some small manual changes after test decomposition. For instance removing @Test(expected = IllegalArgumentException.class)
from test annotation as we do not know ahead of time which decomposed tests throw exception. Please refer to our reference decomposed tests (e.g., cleaned_final_projects_decomposed_tests
) for specific examples. A project with decomposed tests is considered ok as long as it can be compiled by maven.
We look forward to hearing your feedback. Please open an issue for any questions or comments 🙏.