v1.9.0-rc.0 (2025-01-07)
- Upgrade Kubernetes to v1.31.3 (#2330 by @astefanutti)
- Upgrade Kubernetes to v1.30.7 (#2332 by @astefanutti)
- Update the name of PVC in
train
API (#2187 by @helenxie-bit) - Remove support for MXJob (#2150 by @tariq-hasan)
- Support Python 3.11 and Drop Python 3.7 (#2105 by @tenzen-y)
- Add JAX controller (#2194 by @sandipanpanda)
- Add JAX API (#2163 by @sandipanpanda)
- JAX Integration Enhancement Proposal (#2125 by @sandipanpanda)
- FSDP Example for T5 Fine-Tuning and PyTorchJob (#2286 by @andreyvelich)
- Add DeepSpeed Example with Pytorch Operator (#2235 by @Syulin7)
- Validate pytorchjob workers are configured when elasticpolicy is configured (#2320 by @tarat44)
- [Feature] Support managed by external controller (#2203 by @mszadkow)
- Update trainer to ensure type consistency for
train_args
andlora_config
(#2181 by @helenxie-bit) - Support ARM64 platform in TensorFlow examples (#2119 by @akhilsaivenkata)
- Feat: Support ARM64 platform in XGBoost examples (#2114 by @tico88612)
- ARM64 supported in PyTorch examples (#2116 by @danielsuh05)
- [SDK] Adding env vars (#2285 by @tarekabouzeid)
- [SDK] Use torchrun to create PyTorchJob from function (#2276 by @andreyvelich)
- [SDK] move env var to constants.py (#2268 by @varshaprasad96)
- [SDK] Allow customising base trainer and storage images in Train API (#2261 by @varshaprasad96)
- [SDK] Read namespace from the current context (#2255 by @andreyvelich)
- [SDK] Sync Transformers version for train API (#2146 by @andreyvelich)
- [SDK] Explain Python version support cycle (#2144 by @andreyvelich)
- KEP-2170: Kubeflow Training V2 API (#2171 by @andreyvelich)
- KEP-2170: Update V2 KEP with MPI Runtime info (#2345 by @andreyvelich)
- Always update TrainJob status on errors (#2352 by @astefanutti)
- Fix TrainJob status comparison and update (#2353 by @astefanutti)
- Add required RBAC on TrainJob finalizer sub-resources (#2350 by @astefanutti)
- KEP-2170: [SDK] Initial implementation of the Kubeflow Training V2 Python SDK (#2324 by @andreyvelich)
- KEP-2170: Add Torch Distributed Runtime (#2328 by @andreyvelich)
- KEP-2170: Add TrainJob conditions (#2322 by @tenzen-y)
- KEP-2170: Add the TrainJob state transition design (#2298 by @tenzen-y)
- KEP-2170: Implement Initializer builders in the JobSet plugin (#2316 by @andreyvelich)
- KEP-2170: Implement JobSet, PlainML, and Torch Plugins (#2308 by @andreyvelich)
- KEP-2170: Create model and dataset initializers (#2303 by @andreyvelich)
- KEP-2170: Generate Python SDK for Kubeflow Training V2 (#2310 by @andreyvelich)
- KEP-2170: Initialize runtimes before the manager starts (#2306 by @tenzen-y)
- KEP-2170: Strictly verify the CRD marker validation and defaulting in the integration testings (#2304 by @tenzen-y)
- KEP-2170: Decouple JobSet from TrainJob (#2296 by @tenzen-y)
- KEP-2170: Implement TrainJob Reconciler to manage objects (#2295 by @tenzen-y)
- KEP-2170: Add manifests for Kubeflow Training V2 (#2289 by @andreyvelich)
- KEP-2170: Adding CEL validations on v2 TrainJob CRD (#2260 by @akshaychitneni)
- KEP-2170: Rename TrainingRuntimeRef to RuntimeRef API (#2283 by @andreyvelich)
- KEP-2170: Implement runtime framework (#2248 by @tenzen-y)
- [v2alpha] Move GV related codebase (#2281 by @varshaprasad96)
- KEP-2170: Generate clientset, openapi spec for the V2 APIs (#2273 by @varshaprasad96)
- KEP-2170: Implement skeleton webhook servers (#2251 by @tenzen-y)
- KEP-2170: Initial Implementations for v2 Manager (#2236 by @tenzen-y)
- KEP-2170: Generate CRD manifests for v2 CustomResources (#2237 by @tenzen-y)
- KEP-2170: Update Training V2 APIs in the KEP (#2240 by @andreyvelich)
- KEP-2170: Add TrainJob and TrainingRuntime APIs (#2223 by @andreyvelich)
- KEP-2170: Bind repository into the build environment instead of filecopy (#2222 by @tenzen-y)
- KEP-2170: Add directories for the V2 APIs (#2221 by @andreyvelich)
- KEP-2170: Add the apiGroup to the TrainingRuntimeRef (#2201 by @tenzen-y)
- KEP-2170: Make API specification more restricting (#2198 by @tenzen-y)
- [release-1.9] V1: Fix versions in HuggingFace dataset initializer (#2370 by @google-oss-robot)
- Pin accelerate package version in trainer (#2340 by @gavrissh)
- [fix] Resolve v2alpha API exceptions (#2317 by @varshaprasad96)
- [SDK] Minor fix in wait_for_job_conditions with job_kind python training API (#2265 by @saileshd1402)
- [SDK] Fix typo of "get_pvc_spec" (#2250 by @helenxie-bit)
- [Bug] Finish CleanupJob early if the job is suspended. (#2243 by @mszadkow)
- [SDK] Fix trainer error: Update the version of base image and add "num_labels" for downloading pretrained models (#2230 by @helenxie-bit)
- Update
huggingface_hub
Version in the storage initializer to fix ImportError (#2180 by @helenxie-bit) - [SDK] Fix Failed condition in wait Job API (#2160 by @andreyvelich)
- fix volcano podgroup update issue (#2079 by @ckyuto)
- [SDK] Fix Incorrect Events in get_job_logs API (#2122 by @andreyvelich)
- [release-1.9] Add release branch to the image push trigger (#2377 by @google-oss-robot)
- Add e2e test for train API (#2199 by @helenxie-bit)
- buildx link was broken (#2356 by @Veer0x1)
- Upgrade helm/kind-action to v1.11.0 (#2357 by @astefanutti)
- Upgrade Go version to v1.23 (#2302 by @tenzen-y)
- Ensure code generation dependencies are downloaded (#2339 by @astefanutti)
- Added test for create-pytorchjob.ipynb python notebook (#2274 by @saileshd1402)
- Remove zw0610 from approvers (#2343 by @zw0610)
- Upgrade kustomization files to Kustomize v5 (#2326 by @oksanabaza)
- Add openapi-generator CLI option to skip SDK v2 test generation (#2338 by @astefanutti)
- Refine the server-side apply installation args (#2337 by @tenzen-y)
- Ignore cache exporting errors in the image building workflows (#2336 by @tenzen-y)
- Pin Gloo repository in JAX Dockerfile to a specific commit (#2329 by @sandipanpanda)
- Update tf job examples to tf v2 (#2270 by @YosiElias)
- Remove Prometheus Monitoring doc (#2301 by @sophie0730)
- Upgrade Deepspeed demo dependencies (#2294 by @Syulin7)
- [SDK] test: add unit test for list_jobs method of the training_client (#2267 by @seanlaii)
- [SDK] Training Client Conditions related unit tests (#2253 by @Bobbins228)
- [SDK] test: add unit test for get_job_logs method of the training_client (#2275 by @seanlaii)
- [SDK] test: add unit test for get_job method of the training_client (#2205 by @Bobbins228)
- [SDK] test: add unit tests for delete_job() method (#2232 by @Bobbins228)
- [SDK] Add UTs for
wait_for_job_conditions
(#2196 by @Electronic-Waste) - [SDK] Unit tests for TrainingClient APIs - get_job_pod_names and update_job (#2192 by @YosiElias)
- [SDK] Add more unit tests for TrainingClient APIs - get_job_pods (#2175 by @YosiElias)
- Update JAX image to use image published by Kubeflow (#2264 by @sandipanpanda)
- Update README and out-of-date docs (#2252 by @andreyvelich)
- Clean up Go modules (#2238 by @tenzen-y)
- Change isort profile to black for full compatibility (#2234 by @Ygnas)
- Enhance pre-commit hooks with flake8 linting (#2195 by @Ygnas)
- Implement pre-commit hooks (#2184 by @droctothorpe)
- Add command to re-run GitHub Actions tests (#2167 by @andreyvelich)
- Update JAX integration proposal (#2165 by @sandipanpanda)
- Update release document (#2153 by @andreyvelich)
- update volcano to v1.9.0 (#2148 by @lowang-bh)
- Update Slack Invitation (#2142 by @andreyvelich)
- Refine the integration tests for the immutable PyTorchJob queueName (#2130 by @tenzen-y)
- Add GitHub Issue Template (#2129 by @andreyvelich)
- Update the images to the latest tag in master branch (#2128 by @johnugeorge)
- Updated Github Action Workflows as per issue #2117 (#2123 by @hkiiita)
- changed package name to flake8 to fix pytests pip install (#2109 by @ChristopheBrown)
- chore(fix): isort xgboost (#2098 by @harshithbelagur)
- Fix isort on examples/pytorch (#2094 by @marcmaliar)
v1.8.1 (2024-09-10)
- [Bug] Finish CleanupJob early if the job is suspended (#2243 by @mszadkow)
- [SDK] Fix trainer error: Update the version of base image and add "num_labels" for downloading pretrained models (#2230 by @helenxie-bit)
- Update huggingface_hub Version in the storage initializer to fix ImportError (#2180 by @helenxie-bit)
v1.8.0 (2024-07-15)
- [SDK] Support Python 3.11 and Drop Python 3.7 (#2105 by @tenzen-y)
- Support K8s v1.29 and Drop K8s v1.26 (#2039 by @tenzen-y)
- Support K8s v1.28 and Drop K8s v1.25 (#2038 by @tenzen-y)
- Deprecation Notice for MXJob (#2058 by @tenzen-y)
⚠️ Breaking Changes: Renamemonitoring-port
flag towebook-server-port
(#1925 by @afritzler)
- Train/Fine-tune API Proposal for LLMs (#1945 by @deepanker13)
- [SDK] Train API for LLM Fine-Tuning (#1962 by @deepanker13)
- Modify LLM Trainer to support BERT and Tiny LLaMA (#2031 by @andreyvelich)
- Support arm64 for Hugging Face trainer (#2028 by @tariq-hasan)
- Add Fine-Tune BERT LLM Example (#2021 by @andreyvelich)
- Train api dataset download changes (#1959 by @deepanker13)
- Train api init container creation (#1958 by @deepanker13)
- [SDK] Add docstring for Train API (#2075 by @andreyvelich)
- Upgrade scheduler-plugins to v0.28.9 (#2065 by @tenzen-y)
- Implement webhook validations for the PaddleJob (#2057 by @tenzen-y)
- Implement webhook validations for the XGBoostJob (#2052 by @tenzen-y)
- Implement webhook validation for the TFJob (#2051 by @tenzen-y)
- Implement webhook validations for the PyTorchJob (#2035 by @tenzen-y)
- Upgrade PyTorchJob examples to PyTorch v2 (#2024 by @champon1020)
- Upgrade Go version to v1.22 (#2046 by @tenzen-y)
- [SDK] Add resources per worker for Create Job API (#1990 by @andreyvelich)
- [SDK] Fix Worker and Master templates for PyTorchJob (#1988 by @andreyvelich)
- [SDK] Get Kubernetes Events for Job (#1975 by @andreyvelich)
- SDK: Upgrade the minimum required Kubernetes version to v1.27.2 (#2066 by @tenzen-y)
- [SDK] Add information about TrainingClient logging (#1973 by @andreyvelich)
- Training operator SDK unit test (#1938 by @deepanker13)
- [SDK] Consolidate Naming for CRUD APIs (#1907 by @andreyvelich)
- [SDK] Fix Failed condition in wait Job API (#2160 by @andreyvelich)
- [SDK] Sync Transformers version for train API (#2147 by @andreyvelich)
- [SDK] Changed package name to flake8 to fix pip install (#2140 by @tenzen-y)
- [SDK] Fix Incorrect Events in get_job_logs API (#2138 by @tenzen-y)
- Fix volcano podgroup update issue (#2079 by @ckyuto)
- Fix import for HuggingFace Dataset Provider (#2085 by @andreyvelich)
- Updated examples for train API (#2077 by @shruti2522)
- Fail job for non-retryable exit codes (#2071 by @kellyaa)
- E2E: Replace outdated images with latest ones (#2083 by @tenzen-y)
- fix wrong filepath in the simple example command (#2062 by @qzoscar)
- fix(example): add installation of python-etcd in Pytorch example (#2064 by @champon1020)
- fix: Upgrade controller-gen to v0.14.0 (#2026 by @champon1020)
- Fix build workflow config for pytorch-torchrun-example (#2020 by @PeterWrighten)
- Fix Distributed Data Samplers in PyTorch Examples (#2012 by @andreyvelich)
- Fix URL in python SDK setup.py (#2011 by @garymm)
- Fix for Github CI to publish HF trainer image (#1987 by @johnugeorge)
- train api jupyternotebook fix (#1984 by @deepanker13)
- fix: volcano podgroup should has a non-empty queue name (#1977 by @lowang-bh)
- Fix Master Label for PyTorchJob (#1974 by @andreyvelich)
- IsMasterRole fix in pytorchjob controller (#1969 by @deepanker13)
- [fix] replace
${go env GOPATH} with $ (go env GOPATH) to get the prope… (#1952 by @double12gzh) - Fixing issues with providing existing service account (#1918 by @rpemsel)
- Refine the integration tests for the immutable PyTorchJob (#2130 by @tenzen-y)
- Update training operator image to latest (#2089 by @johnugeorge)
- Update sdk to v1.8.0rc0 (#2087 by @johnugeorge)
- Test: Simplify and Identify pod-controller envtest (#2084 by @tenzen-y)
- Remove deadcode related to PodDisruptionBudget (#2073 by @tenzen-y)
- docs: updating docs for local development (#2074 by @franciscojavierarceo)
- PyTorchJob: Always show warnings when using elasticPolicy.nProcPerNode (#2067 by @tenzen-y)
- Updated developer docs to include Kind (#2061 by @franciscojavierarceo)
- adding fine tune example with s3 as the dataset store (#2006 by @deepanker13)
- CI: Use a mode=min in the builder cache (#2053 by @tenzen-y)
- Fix: upgrade version of crd-ref-docs, which caused panic with go v1.22 (#2043 by @jdcfd)
- Remove Dockerfile.ppc64le of pytorch example (#2042 by @champon1020)
- publish torchrun example via Dockerfile (#2018 by @PeterWrighten)
- Updated examples/pytorch to disable istio sidecar injection (#2004 by @jdcfd)
- [docs] development guide update (#1995 by @shashank-iitbhu)
- Add Kubeflow Website links to README (#1983 by @andreyvelich)
- publish trainer hugging face image (#1985 by @deepanker13)
- Adding Training image needed for train api (#1963 by @deepanker13)
- Add test to create PyTorchJob from func (#1979 by @andreyvelich)
- Corrected Some Spelling And Grammatical Errors (#1980 by @daniel-hutao)
- torchrun example with cpu version pytorch (#1965 by @kuizhiqing)
- utils changes needed to add train api (#1954 by @deepanker13)
- Adding parallel support for coveralls (#1956 by @johnugeorge)
- chore: pkg import only once (#1950 by @testwill)
- fix nproc env in elastic mode for pytorchjob (#1948 by @kuizhiqing)
- Avoid modifying log level globally (#1944 by @droctothorpe)
- Add @andreyvelich to Approvers (#1941 by @andreyvelich)
- Merge v1.7 branch changes to Main (#1940 by @johnugeorge)
- Increase the root volume size on the github runner when building container images (#1931 by @tenzen-y)
- Check podGroup CRD for the volcano and the scheudler-plugins as default. (#1929 by @Syulin7)
- Use a community hosted image in MXJob E2E (#1928 by @tenzen-y)
- Build MXJob examples in CI (#1927 by @tenzen-y)
- Bump
k8s.io/*
deps to 1.28 (#1920 by @afritzler) - Replace XGBoost image for E2E with community hosted (#1922 by @tenzen-y)
- Creating service account where approriate for MPI Job (#1917 by @rpemsel)
- Build XGBoostJob example images in CI (#1913 by @tenzen-y)
- Manage kube-delivery image from training-operator and update it (#1909 by @rpemsel)
- Adding Yuki to Approvers (#1901 by @johnugeorge)
- docs: Remove reference to tf-operator specific design doc (#1903 by @terrytangyuan)
- Add Training WG Community Call (#1900 by @andreyvelich)
- update full change list in changelog (#1895 by @lowang-bh)
- update volcano scheduler to 1.8.0 (#1894 by @lowang-bh)
- Changelog updated for 1.7.0 rc0 release (#1892 by @johnugeorge)
- Add Stale GitHub Action (#1893 by @andreyvelich)
- Refactor core/pod tests (#1890 by @tenzen-y)
- Remove klog v1 (#1886 by @tenzen-y)
v1.8.0-rc.1 (2024-06-25)
- [SDK] Sync Transformers version for train API (#2147 by @andreyvelich)
- [SDK] Changed package name to flake8 to fix pip install (#2140 by @tenzen-y)
- [SDK] Fix Incorrect Events in get_job_logs API (#2138 by @tenzen-y)
- Fix volcano podgroup update issue (#2079 by @ckyuto)
v1.8.0-rc.0 (2024-04-28)
- Support K8s v1.29 and Drop K8s v1.26 (#2039 by @tenzen-y)
- Support K8s v1.28 and Drop K8s v1.25 (#2038 by @tenzen-y)
- Deprecation Notice for MXJob (#2058 by @tenzen-y)
⚠️ Breaking Changes: Renamemonitoring-port
flag towebook-server-port
(#1925 by @afritzler)
- Train/Fine-tune API Proposal for LLMs (#1945 by @deepanker13)
- [SDK] Train API for LLM Fine-Tuning (#1962 by @deepanker13)
- Modify LLM Trainer to support BERT and Tiny LLaMA (#2031 by @andreyvelich)
- Support arm64 for Hugging Face trainer (#2028 by @tariq-hasan)
- Add Fine-Tune BERT LLM Example (#2021 by @andreyvelich)
- Train api dataset download changes (#1959 by @deepanker13)
- Train api init container creation (#1958 by @deepanker13)
- [SDK] Add docstring for Train API (#2075 by @andreyvelich)
- Upgrade scheduler-plugins to v0.28.9 (#2065 by @tenzen-y)
- Implement webhook validations for the PaddleJob (#2057 by @tenzen-y)
- Implement webhook validations for the XGBoostJob (#2052 by @tenzen-y)
- Implement webhook validation for the TFJob (#2051 by @tenzen-y)
- Implement webhook validations for the PyTorchJob (#2035 by @tenzen-y)
- Upgrade PyTorchJob examples to PyTorch v2 (#2024 by @champon1020)
- Upgrade Go version to v1.22 (#2046 by @tenzen-y)
- [SDK] Add resources per worker for Create Job API (#1990 by @andreyvelich)
- [SDK] Fix Worker and Master templates for PyTorchJob (#1988 by @andreyvelich)
- [SDK] Get Kubernetes Events for Job (#1975 by @andreyvelich)
- SDK: Upgrade the minimum required Kubernetes version to v1.27.2 (#2066 by @tenzen-y)
- [SDK] Add information about TrainingClient logging (#1973 by @andreyvelich)
- Training operator SDK unit test (#1938 by @deepanker13)
- [SDK] Consolidate Naming for CRUD APIs (#1907 by @andreyvelich)
- Fix import for HuggingFace Dataset Provider (#2085 by @andreyvelich)
- Updated examples for train API (#2077 by @shruti2522)
- Fail job for non-retryable exit codes (#2071 by @kellyaa)
- E2E: Replace outdated images with latest ones (#2083 by @tenzen-y)
- fix wrong filepath in the simple example command (#2062 by @qzoscar)
- fix(example): add installation of python-etcd in Pytorch example (#2064 by @champon1020)
- fix: Upgrade controller-gen to v0.14.0 (#2026 by @champon1020)
- Fix build workflow config for pytorch-torchrun-example (#2020 by @PeterWrighten)
- Fix Distributed Data Samplers in PyTorch Examples (#2012 by @andreyvelich)
- Fix URL in python SDK setup.py (#2011 by @garymm)
- Fix for Github CI to publish HF trainer image (#1987 by @johnugeorge)
- train api jupyternotebook fix (#1984 by @deepanker13)
- fix: volcano podgroup should has a non-empty queue name (#1977 by @lowang-bh)
- Fix Master Label for PyTorchJob (#1974 by @andreyvelich)
- IsMasterRole fix in pytorchjob controller (#1969 by @deepanker13)
- [fix] replace
${go env GOPATH} with $ (go env GOPATH) to get the prope… (#1952 by @double12gzh) - Fixing issues with providing existing service account (#1918 by @rpemsel)
- Update training operator image to latest (#2089 by @johnugeorge)
- Update sdk to v1.8.0rc0 (#2087 by @johnugeorge)
- Test: Simplify and Identify pod-controller envtest (#2084 by @tenzen-y)
- Remove deadcode related to PodDisruptionBudget (#2073 by @tenzen-y)
- docs: updating docs for local development (#2074 by @franciscojavierarceo)
- PyTorchJob: Always show warnings when using elasticPolicy.nProcPerNode (#2067 by @tenzen-y)
- Updated developer docs to include Kind (#2061 by @franciscojavierarceo)
- adding fine tune example with s3 as the dataset store (#2006 by @deepanker13)
- CI: Use a mode=min in the builder cache (#2053 by @tenzen-y)
- Fix: upgrade version of crd-ref-docs, which caused panic with go v1.22 (#2043 by @jdcfd)
- Remove Dockerfile.ppc64le of pytorch example (#2042 by @champon1020)
- publish torchrun example via Dockerfile (#2018 by @PeterWrighten)
- Updated examples/pytorch to disable istio sidecar injection (#2004 by @jdcfd)
- [docs] development guide update (#1995 by @shashank-iitbhu)
- Add Kubeflow Website links to README (#1983 by @andreyvelich)
- publish trainer hugging face image (#1985 by @deepanker13)
- Adding Training image needed for train api (#1963 by @deepanker13)
- Add test to create PyTorchJob from func (#1979 by @andreyvelich)
- Corrected Some Spelling And Grammatical Errors (#1980 by @daniel-hutao)
- torchrun example with cpu version pytorch (#1965 by @kuizhiqing)
- utils changes needed to add train api (#1954 by @deepanker13)
- Adding parallel support for coveralls (#1956 by @johnugeorge)
- chore: pkg import only once (#1950 by @testwill)
- fix nproc env in elastic mode for pytorchjob (#1948 by @kuizhiqing)
- Avoid modifying log level globally (#1944 by @droctothorpe)
- Add @andreyvelich to Approvers (#1941 by @andreyvelich)
- Merge v1.7 branch changes to Main (#1940 by @johnugeorge)
- Increase the root volume size on the github runner when building container images (#1931 by @tenzen-y)
- Check podGroup CRD for the volcano and the scheudler-plugins as default. (#1929 by @Syulin7)
- Use a community hosted image in MXJob E2E (#1928 by @tenzen-y)
- Build MXJob examples in CI (#1927 by @tenzen-y)
- Bump
k8s.io/*
deps to 1.28 (#1920 by @afritzler) - Replace XGBoost image for E2E with community hosted (#1922 by @tenzen-y)
- Creating service account where approriate for MPI Job (#1917 by @rpemsel)
- Build XGBoostJob example images in CI (#1913 by @tenzen-y)
- Manage kube-delivery image from training-operator and update it (#1909 by @rpemsel)
- Adding Yuki to Approvers (#1901 by @johnugeorge)
- docs: Remove reference to tf-operator specific design doc (#1903 by @terrytangyuan)
- Add Training WG Community Call (#1900 by @andreyvelich)
- update full change list in changelog (#1895 by @lowang-bh)
- update volcano scheduler to 1.8.0 (#1894 by @lowang-bh)
- Changelog updated for 1.7.0 rc0 release (#1892 by @johnugeorge)
- Add Stale GitHub Action (#1893 by @andreyvelich)
- Refactor core/pod tests (#1890 by @tenzen-y)
- Remove klog v1 (#1886 by @tenzen-y)
v1.7.0-rc.0 (2023-07-07)
- Upgrade Scheduler Plugins version to v0.25.7 #1824 (tenzen-y)
- Upgrade the kubernetes dependencies to v1.27 #1834 (tenzen-y)
- Make scheduler-plugins the default gang scheduler. #1747 (Syulin7)
- Merge kubeflow/common to training-operator #1813 (johnugeorge)
- Auto-generate RBAC manifests by the controller-gen #1815 (Syulin7)
- Implement suspend semantics #1859 (tenzen-y)
- Set up controllers using goroutines to start the manager quickly #1869 (tenzen-y)
- Set correct ENV for PytorchJob to support torchrun #1840 (kuizhiqing)
- Fix a bug that XGBoostJob's running condition isn't updated when the job is resumed #1866 (tenzen-y)
- Set a Running condition when the XGBoostJob is completed and doesn't have a Running condition #1789 (tenzen-y)
- Avoid to depend on local env when installing the code-generators #1810 (tenzen-y)
- Removing reconciler code #1879 (johnugeorge)
- Make Condition and ReplicaStatus optional #1862 (tenzen-y)
- Use the same reasons for Condition and Event #1854 (tenzen-y)
- Fully consolidate tfjob-operator to training-operator #1850 (tenzen-y)
- Clean up /pkg/common/util/v1 #1845 (tenzen-y)
- Refactoring tests in common/controller.v1 #1843 (tenzen-y)
- remove duplicate code of add task spec annotation #1839 (lowang-bh)
- fetch volcano log when e2e failed #1837 (lowang-bh)
- Add check pods are not scheduled when testing gang-scheduler integrations in e2e #1835 (tenzen-y)
- Replace dummy client with fake client #1818 (tenzen-y)
- Add default Intel MPI env variables to MPIJob #1804 (tkatila)
- Improve E2E tests for the gang-scheduling #1801 (tenzen-y)
- xgb yaml container name should be consistent with xgb job default container name #1794 (Crisescode)
- make timeout configurable from e2e tests #1787 (nagar-ajay)
v1.6.0 (2023-03-21)
Note: Since scheduler-plugins has changed API from sigs.k8s.io
with the x-k8s.io
, future releases of training operator(v1.7+) will not support scheduler-plugins v0.24.x or lower. Related: #1769
Note: Latest Python SDK 1.6 version does not support earlier training operator versions. The minimum training operator version required is v1.6.0 release. Related: #1702
- Support for k8s v1.25 in CI #1684 (johnugeorge)
- HPA support for PyTorch Elastic #1701 (johnugeorge)
- Adopting coschduling plugin #1724 (tenzen-y)
- Support for Paddlepaddle #1675 (kuizhiqing)
- Create TFJob and PyTorchJob from Function APIs in the Training SDK #1659 (andreyvelich)
- [SDK] Use Training Client without Kube Config #1740 (andreyvelich)
- [SDK] Create Unify Training Client #1719 (andreyvelich)
- [SDK] pod has no metadata attr anymore in the get_job_logs() … #1760 (yaobaiwei)
- Add PodGroup as controller watch source #1666 (ggaaooppeenngg)
- fix infinite loop in init-pytorch container #1756 (kidddddddddddddddddddddd)
- Fix the success condition of the job in PyTorchJob's Elastic mode. #1752 (Syulin7)
- Fix XGBoost conditions bug #1737 (tenzen-y)
- To fix scaledown error, upgrade PyTorch version to v1.13.1 in echo example #1733 (tenzen-y)
- fix: support MxNet single host training when update mxJob status #1644 (PeterChg)
- fix: fix mxnet failed to update StartTime and CompletionTime #1643 (PeterChg)
- Fix the default LeaderElectionID and make it an argument #1639 (goyalankit)
- fix: fix wrong parameter for resolveControllerRef #1583 (fighterhit)
- fix: tfjob with restartPolicy=ExitCode not work #1562 (cheimu)
- fix: Mac M1 compatible Dockerfile and bump TF version #1700 (terrytangyuan)
- Fix status lost #1697 (ggaaooppeenngg)
- handle all restart policies #1649 (abin-thomas-by)
- [chore] fix typo #1648 (tenzen-y)
- Add validation for verifying that the CustomJob (e.g., TFJob) name meets DNS1035 #1748 (tenzen-y)
- Configure controller worker threads #1707 (HeGaoYuan)
- Validation Spec consistency #1705 (HeGaoYuan)
- [SDK] Remove Final Keyword from constants #1676 (andreyvelich)
- Fix Python installation in CI #1759 (tenzen-y)
- Update mpijob_controller.go #1755 (yshalabi)
- Set the default value of CleanPodPolicy to None #1754 (Syulin7)
- Update join Slack link #1750 (Syulin7)
- Update latest operator image #1742 (johnugeorge)
- Run E2E with various Python versions to verify Python SDK #1741 (tenzen-y)
- Add Yuki to reviewer group #1739 (johnugeorge)
- Trim down CRD descriptions #1735 (tenzen-y)
- Add CI to build example images #1731 (tenzen-y)
- Fix predicates of paddlepaddle-controller for scheduling.volcano.sh/v1beta1 PodGroup #1730 (tenzen-y)
- Fix indents on examples for tensorflow #1726 (tenzen-y)
- docs: Update Kubernetes requirement and version matrix #1721 (terrytangyuan)
- chore: Update the use of MultiWorkerMirroredStrategy in TF #1715 (terrytangyuan)
- Removing deprecated Job Labels #1702 (johnugeorge)
- Bump certifi from 2022.9.14 to 2022.12.7 in /py/kubeflow/tf_operator #1699 (dependabot[bot])
- Add myself to reviewer. #1689 (kuizhiqing)
- Upgrade the envtest version #1687 (tenzen-y)
- [chore] Upgrade some actions version #1686 (tenzen-y)
- Upgrade Golangci-lint #1685 (johnugeorge)
- Make a generic logger instead of the nil logger on dependent update #1680 (ggaaooppeenngg)
- Bump protobuf from 3.8.0 to 3.18.3 in /py/kubeflow/tf_operator #1669 (dependabot[bot])
- Removed GOARCH dependency for multiarch support #1674 (pranavpandit1)
- Update deployment.yaml #1668 (OmriShiv)
- Upgrade Go version to v1.19 #1663 (tenzen-y)
- Upgrade kubernetes versoin for test #1667 (tenzen-y)
- Adding support for linux/ppc64le in github actions for training-operator #1692 (amitmukati-2604)
- style: Refine name and signature of 2 replicaName functions #1660 (houz42)
- Update training operator sdk version to 1.5.0 #1651 (johnugeorge)
- Add finalizers to cluster-role #1646 (ArangoGutierrez)
- Update the cmd to support MPI operator in ReadME #1656 (denkensk)
- The default value for CleanPodPolicy is inconsistent. #1753
- HPA support for PyTorch Elastic #1751
- Bug: allowance of non DNS-1035 compliant PyTorchJob names results in service creation failures and missing state #1745
- paddle-operator can not get podgroup status(inqueue) with volcano when enable gang #1729
- *job API(master) cannot compatible with old job #1725
- Support coscheduling plugin #1722
- Number of worker threads used by the controller can't be configured #1706
- Conformance: Training tests #1698
- PyTorch and MPI Operator pulls hardcoded initContainer #1696
- PaddlePaddle Training: why can't find pods #1694
- Training-operator pod CrashLoopBackOff in K8s v1.23.6 with kubeflow1.6.1 #1693
- [SDK] Create unify client for all Training Job types #1691
- Support Kubernetes v1.25 #1682
- panic happened when add podgroup watch #1679
- OnDependentUpdateFunc for Job will panic when enable volcano scheduler #1678
- There is no clusterrole of "MPI Jobs" in kubeflow 1.5. #1670
- Change Kubernetes version for test #1665
- Support for multiplatform container imege (amd64 and arm64) #1664
- Training Operator pod failed to start on OCP 4.10.30 with error "memory limit too low" #1661
- After setting hostNetwork to true, mpi does not work #1657
- What is the purpose of /examples/pytorch/elastic/etcd.yaml #1655
- When will MPIJob support v2beta1 version? #1653
- Kubernetes HPA doesn't work with elastic PytorchJob #1645
- training-operator can not get podgroup status(inqueue) with volcano when enable gang #1630
- Training operator fails to create HPA for TorchElastic jobs #1626
- Release v1.5.0 tracking #1622
- upgrade client-go #1599
- trainning-operator may need to monitor PodGroup #1574
- Error: invalid memory address or nil pointer dereference #1553
- The pytorchJob training is slow #1532
- pytorch elastic scheduler error #1504
v1.4.0-rc.0 (2022-01-26)
- [bug] Missing init container in PyTorchJob #1482
- Fail to install tf-operator in minikube because of the version of kubectl/kustomize #1381
- Restore KUBEFLOW_NAMESPACE options #1522
- Improve test coverage #1497
- swagger.json missing Pytorchjob.Spec.ElasticPolicy #1483
- PytorchJob DDP training will stop if I delete a worker pod #1478
- Write down e2e failure debug process #1467
- How can i add the Priorityclass to the TFjob? #1466
- github.com/go-logr/zapr.(*zapLogger).Error #1444
- Podgroup is constantly created and deleted after tfjob is success or failure #1426
- Cut official release of 1.3.0 #1425
- Add "not maintained" notice to other operator repos #1423
- Python SDK for Kubeflow Training Operator #1380
- Update manifests with latest image tag #1527 (johnugeorge)
- add option for mpi kubectl delivery #1525 (zw0610)
- restore option namespace in launch arguments #1524 (zw0610)
- remove unused scripts #1521 (zw0610)
- remove ChanYiLin from approvers #1513 (ChanYiLin)
- add StacktraceLevel for zapr #1512 (qiankunli)
- add unit tests for tensorflow controller #1511 (zw0610)
- add the example of MPIJob #1508 (hackerboy01)
- Added 2022 roadmap and migrated previous roadmap from kubeflow/common #1500 (terrytangyuan)
- Fix a typo in mpi controller log #1495 (LuBingtan)
- feat(pytorch): Add init container config to avoid DNS lookup failure #1493 (gaocegege)
- chore: Fix GitHub Actions script #1491 (tenzen-y)
- chore: Fix missspell in tfjob #1490 (tenzen-y)
- chore: Update OWNERS #1489 (gaocegege)
- Bump jinja2 from 2.10.1 to 2.11.3 in /py/kubeflow/tf_operator #1487 (dependabot[bot])
- fix comments for mpi-controller #1485 (hackerboy01)
- add expectation-related functions for other resources used in mpi-controller #1484 (zw0610)
- Add MPI job to README now that it's supported #1480 (terrytangyuan)
- add mpi doc #1477 (zw0610)
- Set Go version of base image to 1.17 #1476 (tenzen-y)
- update label for tf-controller #1474 (zw0610)
- Add Akuity to the list of adopters #1473 (terrytangyuan)
- Add PR template with doc checklist #1470 (andreyvelich)
- Add e2e failure debugging guidance #1469 (Jeffwan)
- chore: Add .gitattributes to ignore Jsonnet test code for linguist #1463 (terrytangyuan)
- Migrate additional examples from xgboost-operator #1461 (terrytangyuan)
- Minor edits to README.md #1460 (terrytangyuan)
- add mpi-operator(v1) to the unified operator #1457 (hackerboy01)
- fix tfjob status when enableDynamicWorker set true #1455 (zw0610)
- feat(pytorch): Support elastic training #1453 (gaocegege)
- fix: generate printer columns for job crds #1451 (henrysecond1)
- Fix README typo #1450 (davidxia)
- consistent naming for better readability #1449 (pramodrj07)
- Fix set scheduler error #1448 (qiankunli)
- Add CI to run the tests for Go #1440 (tenzen-y)
- fix: Add missing retrying package that failed the import #1439 (terrytangyuan)
- Generate a single
swagger.json
file for all frameworks #1437 (alembiewski) - Update links and files with the new URL #1434 (andreyvelich)
- chore: update CHANGELOG.md #1432 (Jeffwan)
- Add acknowledgement section in README to credit all contributors #1422 (terrytangyuan)
- Add Cisco to Adopters List #1421 (andreyvelich)
- Add Python SDK for Kubeflow Training Operator #1420 (alembiewski)
- docs: Move myself to approvers #1419 (terrytangyuan)
- fix hyperlinks in the 'overview' section #1418 (pramodrj07)
- docs: Migrate adopters of all operators to this repo #1417 (terrytangyuan)
- Feature/support pytorchjob set queue of volcano #1415 (qiankunli)
- Bump controller-tools to 0.6.0 and enable GenerateEmbeddedObjectMeta #1409 (Jeffwan)
- Update scripts to generate sdk for all frameworks #1389 (Jeffwan)
v1.3.0 (2021-10-03)
- Unable to specify pod template metadata for TFJob #1403
v1.3.0-rc.2 (2021-09-21)
- Missing Pod label for Service selector #1399
v1.3.0-rc.1 (2021-09-15)
- [bug] Reconcilation fails when upgrading common to 0.3.6 #1394
- Update manifests with latest image tag #1406 (johnugeorge)
- 2010: fix to expose correct monitoring port #1405 (deepak-muley)
- Fix 1399: added pod matching label in service selector #1404 (deepak-muley)
- fix: runPolicy validation error in the examples #1401 (Jeffwan)
v1.3.0-rc.0 (2021-08-31)
- chore: Update training-operator tag #1396 (Jeffwan)
- Add simple verification jobs #1391 (Jeffwan)
- fix: volcano pod group creation issue #1390 (Jeffwan)
- chore: Bump kubeflow/common version to 0.3.7 #1388 (Jeffwan)
v1.3.0-alpha.3 (2021-08-29)
- Update guidance to install all-in-one operator in README.md #1386
- chore(doc): Update README.md #1387 (Jeffwan)
- Remove tf-operator from the codebase #1378 (thunderboltsid)