Skip to content

Deploys a data science tool suite to a Kubernetes distributed compute cluster

License

Notifications You must be signed in to change notification settings

alexander-manley/kubernetes-data-science

Repository files navigation

One click kubernetes-data-science

One click deploys a data science tool suite to a local Kubernetes distributed compute cluster by helping to integrate a broad spectrum of high quality opensource software. Each tool is made available depending on use case, without requiring the end user to seek out, individually install, configure, update, and integrate disparate tooling into new or existing analysis workflows.

Each component dependency is pulled from each respective upstream project and/or docker repository, and no upstream code is ever modified. Different industries and user roles have varied needs; as such the platform is fully customizable in that analytic containers can be added, removed, or rolled back to earlier versions depending on host infrastructure and the applicable template.

Use on-prem Kubernetes or Okteto Cloud: Managed Kubernetes service designed for developers. Free developer accounts. The apps sleep after 24 hours of inactivity. Details: https://okteto.com/docs/cloud

Browser navigate to unique link generated under "Endpoints:"

kubernetes-datascience-launcher

R Spreadsheet Analytics (Jamovi):

  • Opensource alternative to SPSS, MATLAB, Statistica, and SAS.

Develop on Okteto

Jupyter-Spark Tool Suite:

  • Apache Spark
  • Jupyterlab notebook interface
  • Python interpreter with Pandas, Scikit-learn, Matplotlib, and Statsmodels
  • R statistics

Launch, startup can take up to 10 minutes:

Develop on Okteto

Dataiku Data Science Studio (DSS):

  • Free edition
  • Data preparation and model management

Launch, startup can take up to 10 minutes:

Develop on Okteto

RAPIDS/DASK Distributed Machine Learning:

  • Distributed python clustering and regression algorithms

Launch, startup can take up to 10 minutes, requires account upgrade for sufficient compute resource allocation:

Develop on Okteto

MLFlow:

  • Track experiments to record & compare parameters and results (MLflow Tracking).

  • Packaging ML code in a reproducible form to share with data scientists or transfer to production (MLflow Projects).

  • Manage and deploy models from a variety of ML libraries to a variety of model serving and inference platforms (MLflow Models).

  • Provide a central model store to collaboratively manage the full lifecycle of an MLflow Model, including model versioning, stage transitions, and annotations (MLflow Model Registry).

Launch MLFlow Now

Tensorflow Notebook:

  • Training and inference of deep neural networks

Develop on Okteto

Additional Opensource ML Platforms:

Seldon: https://github.com/SeldonIO/seldon-core (https://github.com/SeldonIO/seldon-core/tree/master/helm-charts/seldon-core-operator)

Generic Kubeflow via Kustomize manifests: https://github.com/kubeflow/manifests

Polyaxon: https://github.com/polyaxon/polyaxon

DVC: https://github.com/iterative/dvc

ClearML: https://github.com/allegroai/clearml/

Guild AI: https://github.com/guildai/guildai

Sacred: https://github.com/IDSIA/sacred

Tensorboard: https://github.com/tensorflow/tensorboard

SaaS Enterprise Alternatives:

AWS Sagemaker

Azure ML Studio

Google Vertex AI

Neptune AI (Warsaw)

Comet (Tel Aviv, NY)

Pachyderm (San Francisco)

Weights & Biases (San Francsico)

Tool Suite Add-ons:

  • DataProfiler: Sensitive data detection, equiped with a pre-trained deep learning model identify sensitive data (PII / NPI).
  • DataComPy: Prints out a human-readable report summarizing and sampling differences between two pandas dataframes.
  • DataCompareR: Compare two R datasets and view a report on the similarities and differences.
  • Rubicon-ml: Captures and stores searchable model training and execution information, like parameters and outcomes, and display them in a dashboard.
  • Synthetic-data: Sample data generation (https://github.com/capitalone/synthetic-data)
  • Jupyterlab templates: (https://github.com/jpmorganchase/jupyterlab_templates)

Kubeflow (K3ai) Add-ons:

  • argo-workflow: Argo Workflow plugin
  • h2o-single: H2O.ai
  • jupyter-minimal: Minimal Jupyter Configuration
  • katib: Kubeflow Katib
  • kf-pipelines-tekton: Kubeflow Pipelines based on Tekton
  • kubeflow-pipelines: Kubeflow Pipelines platform agnostic
  • mpi-op: MPI-Operator
  • nvidia-gpu: Nvidia GPU support
  • pytorch-op: Pytorch op
  • tekton: Kubeflow Pipelines based on Tekton
  • tensorflow-op: Kubeflow Tensorflow

Pipeline Integration:

Machine Learning Operations (MLOps):

  • Deployment: single node bare metal (nerdctl)
  • Deployment: single node kubernetes (kubectl)
  • Deployemnt: cloud service provider kuberenetes: (aws eks / kubectl), (gcloud container clusters / kubectl), (az aks / kubectl)
  • Shared notebooks and data access/transfers
  • Logging and metrics
  • Multi-format data ingest (HL7, SAS, CSV, XLSX, ZIP, TXT, JSON, XML, HTML, Images, HDF, PDF, DOCX, MP3, MP4...)
  • Dataset clean-up and type conversion
  • Lifecycle management (https://github.com/mlflow/mlflow/releases)

kubeflow-katib

Opensource Analytics on Apache Spark:

(https://spark.apache.org/docs/latest/ml-statistics.html)

  • Collinearity and covariance
  • Boosting: RandomForest & XGBoost (Regressor and Classifier)
  • Extracting, transforming and selecting features
  • Classification and Regression
  • Clustering
  • Collaborative filtering
  • Frequent Pattern Mining
  • Model selection and tuning
  • Limited-memory BFGS (L-BFGS)
  • Normal equation solver for weighted least squares
  • Iteratively reweighted least squares (IRLS)
  • Clustering
  • Dimensionality reduction
  • Feature extraction and transformation
  • Frequent pattern mining
  • Evaluation metrics
  • PMML model export
  • Optimization
  • Timeseries collection and analyis
  • Interpretability (https://github.com/interpretml/interpret/releases)
  • Dashboarding (https://github.com/apache/superset/releases)

Data layer

ML Model Monagement

jupyterlab

spark-kubernetes-notebook

spark-3 2-shell

kubeflow

K3S Kubernetes Container Image Management

kubernetes

spark-kubernetes-notebook-images

k3s-images

CI/CD

kds-container-ui

About

Deploys a data science tool suite to a Kubernetes distributed compute cluster

Resources

License

Stars

Watchers

Forks