Skip to content

Commit

Permalink
Update
Browse files Browse the repository at this point in the history
  • Loading branch information
leestott authored Nov 25, 2024
1 parent b9ee45e commit 949918f
Show file tree
Hide file tree
Showing 10 changed files with 620 additions and 0 deletions.
2 changes: 2 additions & 0 deletions src/02.ONNXRuntime/02.phi-3.5-webgpu/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@

A simple React + Vite application for running [Phi-3.5-mini-instruct](https://huggingface.co/onnx-community/Phi-3.5-mini-instruct-onnx-web), a powerful small language model, locally in the browser using Transformers.js and WebGPU-acceleration.

If you dont want install the solution you can run it from a [Hugging Face Space](https://huggingface.co/spaces/webml-community/phi-3.5-webgpu)

## Getting Started

Follow the steps below to set up and run the application running in the codespaces environment.
Expand Down
1 change: 1 addition & 0 deletions src/02.ONNXRuntime/Readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
This folder contains demos for ONNXruntime WebGPU samples

### [Demo 1. Phi-3.5 WebGPU Chat](/src/02.ONNXRuntime/02.phi-3.5-webgpu/README.md)
[Link to deployed demo] (https://huggingface.co/spaces/webml-community/phi-3.5-webgpu)

### [Demo 2. ONNXRuntime WebGPU RAG](/src/02.ONNXRuntime/01.WebGPUChatRAG/Readme.md)

289 changes: 289 additions & 0 deletions src/03.AIToolsSolutionE2E/Olive_Demo/data/data_sample_travel.jsonl

Large diffs are not rendered by default.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
236 changes: 236 additions & 0 deletions src/03.AIToolsSolutionE2E/Olive_Demo/readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,236 @@
# Optimize AI models for on-device inference

## Introduction

> [!IMPORTANT]
> This Demo requires an **Nvidia A10 or A100 GPU** with associated drivers and CUDA toolkit (version 12+) installed.
> [!NOTE]
> This is Demo will take **35-minute** to setup and provides a hands-on introduction to the core concepts of optimizing models for on-device inference using OLIVE.
The key aspect of demo is step 5. Inference with the fine tuned model.

## Learning Objectives

By the end, you will be able to use OLIVE to:

- Quantize an AI Model using the AWQ quantization method.
- Fine-tune an AI model for a specific task.
- Generate LoRA adapters (fine-tuned model) for efficient on-device inference on the ONNX Runtime.

### What is Olive

Olive (*O*NNX *live*) is a model optimization toolkit with accompanying CLI that enables you to ship models for the ONNX runtime https://onnxruntime.ai with quality and performance.

![Olive Flow](./images/olive-flow.png)

The input to Olive is typically a PyTorch or Hugging Face model and the output is an optimized ONNX model that is executed on a device (deployment target) running the ONNX runtime. Olive will optimize the model for the deployment target's AI accelerator (NPU, GPU, CPU) provided by a hardware vendor such as Qualcomm, AMD, Nvidia or Intel.

Olive executes a *workflow*, which is an ordered sequence of individual model optimization tasks called *passes* - example passes include: model compression, graph capture, quantization, graph optimization. Each pass has a set of parameters that can be tuned to achieve the best metrics, say accuracy and latency, that are evaluated by the respective evaluator. Olive employs a search strategy that uses a search algorithm to auto-tune each pass one by one or set of passes together.

#### Benefits of Olive

- **Reduce frustration and time** of trial-and-error manual experimentation with different techniques for graph optimization, compression and quantization. Define your quality and performance constraints and let Olive automatically find the best model for you.
- **40+ built-in model optimization components** covering cutting edge techniques in quantization, compression, graph optimization and finetuning.
- **Easy-to-use CLI** for common model optimization tasks. For example, olive quantize, olive auto-opt, olive finetune.
- Model packaging and deployment built-in.
- Supports generating models for **Multi LoRA serving**.
- Construct workflows using YAML/JSON to orchestrate model optimization and deployment tasks.
- **Hugging Face** and **Azure AI** Integration.
- Built-in **caching** mechanism to **save costs**.

## Lab Instructions
> [!NOTE]
> Please ensure you have provision your Azure AI Hub and Project and setup your A100 compute.
### Step 0: Connect to your Azure AI Compute

You'll connect to the Azure AI compute using the remote feature in **VS Code.**

1. Open your **VS Code** desktop application:
1. Open the **command palette** using **Shift+Ctrl+P**
1. In the command palette search for **AzureML - remote: Connect to compute instance in New Window**.
1. Follow the on-screen instructions to connect to the Compute. This will involve selecting your Azure Subscription, Resource Group, Project and Compute name you set up in Lab 1.
1. Once your connected to your Azure ML Compute node this will be displayed in the **bottom left of Visual Code** `><Azure ML: Compute Name`

### Step 1: Clone this repo

In VS Code, you can open a new terminal with **Ctrl+J** and clone this repo:

In the terminal you should see the prompt

```
azureuser@computername:~/cloudfiles/code$
```
Clone the solution

```bash
cd ~/localfiles
git clone https://github.com/microsoft/aitour-exploring-cutting-edge-models.git
```

### Step 2: Open Folder in VS Code

To open VS Code in the relevant folder execute the following command in the terminal, which will open a new window:

```bash
code src/03.AIToolsSolutionE2E/Olive_Demo
```

Alternatively, you can open the folder by selecting **File** > **Open Folder**.

### Step 3: Dependencies

Open a terminal window in VS Code in your Azure AI Compute Instance (tip: **Ctrl+J**) and execute the following commands to install the dependencies:

```bash
conda create -n olive-ai python=3.11 -y
conda activate olive-ai
pip install -r requirements.txt
az extension remove -n azure-cli-ml
az extension add -n ml
```

> [!NOTE]
> It will take ~5mins to install all the dependencies.
In this lab you'll download and upload models to the Azure AI Model catalog. So that you can access the model catalog, you'll need to login to Azure using:

```bash
az login
```

> [!NOTE]
> At login time you'll be asked to select your subscription. Ensure you set the subscription to the one provided for this lab.
### Step 4: Execute Olive commands

Open a terminal window in VS Code in your Azure AI Compute Instance (tip: **Ctrl+J**) and ensure the `olive-ai` conda environment is activated:

```bash
conda activate olive-ai
```

Next, execute the following Olive commands in the command line.

1. **Inspect the data:** In this example, you're going to fine-tune Phi-3.5-Mini model so that it is specialized in answering travel related questions. The code below displays the first few records of the dataset, which are in JSON lines format:

```bash
head data/data_sample_travel.jsonl
```
1. **Quantize the model:** Before training the model, you first quantize with the following command that uses a technique called Active Aware Quantization (AWQ) https://arxiv.org/abs/2306.00978. AWQ quantizes the weights of a model by considering the activations produced during inference. This means that the quantization process takes into account the actual data distribution in the activations, leading to better preservation of model accuracy compared to traditional weight quantization methods.

```bash
olive quantize \
--model_name_or_path microsoft/Phi-3.5-mini-instruct \
--trust_remote_code \
--algorithm awq \
--output_path models/phi/awq \
--log_level 1
```

It takes **~8mins** to complete the AWQ quantization, which will **reduce the model size from ~7.5GB to ~2.5GB**.

In this lab, we're showing you how to input models from Hugging Face (for example: `microsoft/Phi-3.5-mini-instruct`). However, Olive also allows you to input models from the Azure AI catalog by updating the `model_name_or_path` argument to an Azure AI asset ID (for example: `azureml://registries/azureml/models/Phi-3.5-mini-instruct/versions/4`).
1. **Train the model:** Next, the `olive finetune` command finetunes the quantized model. Quantizing the model *before* fine-tuning instead of afterwards gives better accuracy as the fine-tuning process recovers some of the loss from the quantization.
```bash
olive finetune \
--method lora \
--model_name_or_path models/phi/awq \
--data_files "data/data_sample_travel.jsonl" \
--data_name "json" \
--text_template "<|user|>\n{prompt}<|end|>\n<|assistant|>\n{response}<|end|>" \
--max_steps 100 \
--output_path ./models/phi/ft \
--log_level 1
```
It takes **~6mins** to complete the Fine-tuning (with 100 steps).
1. **Optimize:** With the model trained, you now optimize the model using Olive's `auto-opt` command, which will capture the ONNX graph and automatically perform a number of optimizations to improve the model performance for CPU by compressing the model and doing fusions. It should be noted, that you can also optimize for other devices such as NPU or GPU by just updating the `--device` and `--provider` arguments - but for the purposes of this lab we'll use CPU.
```bash
olive auto-opt \
--model_name_or_path models/phi/ft/model \
--adapter_path models/phi/ft/adapter \
--device cpu \
--provider CPUExecutionProvider \
--use_ort_genai \
--output_path models/phi/onnx-ao \
--log_level 1
```
It takes **~5mins** to complete the optimization.
### Step 5: Model inference quick test
To test inferencing the model, create a Python file in your folder called **app.py** and copy-and-paste the following code:
```python
import onnxruntime_genai as og
import numpy as np
print("loading model and adapters...", end="", flush=True)
model = og.Model("models/phi/onnx-ao/model")
adapters = og.Adapters(model)
adapters.load("models/phi/onnx-ao/model/adapter_weights.onnx_adapter", "travel")
print("DONE!")
tokenizer = og.Tokenizer(model)
tokenizer_stream = tokenizer.create_stream()
params = og.GeneratorParams(model)
params.set_search_options(max_length=100, past_present_share_buffer=False)
user_input = "what is the best thing to see in chicago"
params.input_ids = tokenizer.encode(f"<|user|>\n{user_input}<|end|>\n<|assistant|>\n")
generator = og.Generator(model, params)
generator.set_active_adapter(adapters, "travel")
print(f"{user_input}")
while not generator.is_done():
generator.compute_logits()
generator.generate_next_token()
new_token = generator.get_next_tokens()[0]
print(tokenizer_stream.decode(new_token), end='', flush=True)
print("\n")
```
Execute the code using:
```bash
python app.py
```
### Step 6: Upload model to Azure AI
Uploading the model to an Azure AI model repository makes the model sharable with other members of your development team and also handles version control of the model. To upload the model run the following command:
> [!NOTE]
> Update the `{}` placeholders with the name of your resource group and Azure AI Project Name.
To find your resource group `"resourceGroup"and Azure AI Project name, run the following command
```
az ml workspace show
```
Or by going to +++ai.azure.com+++ and selecting **management center** **project** **overview**
Update the `{}` placeholders with the name of your resource group and Azure AI Project Name.
```bash
az ml model create \
--name ft-for-travel \
--version 1 \
--path ./models/phi/onnx-ao \
--resource-group {RESOURCE_GROUP_NAME} \
--workspace-name {PROJECT_NAME}
```
You can then see your uploaded model and deploy your model at +++https://ml.azure.com/model/list+++
13 changes: 13 additions & 0 deletions src/03.AIToolsSolutionE2E/Olive_Demo/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
olive-ai==0.7.1
transformers==4.44.2
autoawq==0.2.6
optimum==1.23.1
peft==0.13.2
bitsandbytes==0.44.1
accelerate>=0.30.0
scipy==1.14.1
azure-ai-ml==1.21.1
onnxruntime-genai-cuda==0.5.0
tabulate==0.9.0
openai==1.54.4
python-dotenv==1.0.1
33 changes: 33 additions & 0 deletions src/03.AIToolsSolutionE2E/Olive_Demo/scripts/app.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
import onnxruntime_genai as og
import numpy as np
import time

model = og.Model("models/phi/onnx-ao/model")
adapters = og.Adapters(model)
adapters.load("models/phi/onnx-ao/model/adapter_weights.onnx_adapter", "travel")

tokenizer = og.Tokenizer(model)
tokenizer_stream = tokenizer.create_stream()

params = og.GeneratorParams(model)
params.set_search_options(max_length=100, past_present_share_buffer=False)
params.input_ids = tokenizer.encode("<|user|>\nwhere is the best place in london<|end|>\n<|assistant|>\n")

generator = og.Generator(model, params)

generator.set_active_adapter(adapters, "travel")

print(f"[Travel]: Tell me what to do in London")
start = time.time()
token_count = 0
while not generator.is_done():
generator.compute_logits()
generator.generate_next_token()

new_token = generator.get_next_tokens()[0]
print(tokenizer_stream.decode(new_token), end='', flush=True)
token_count = token_count+1

print("\n")
end = time.time()
print(f"Tk.sec:{token_count/(end - start)}")
31 changes: 31 additions & 0 deletions src/03.AIToolsSolutionE2E/Olive_Demo/scripts/flow.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
echo -e "\n>>>>>> running awq quantization >>>>>>>>\n"

olive quantize \
--model_name_or_path azureml://registries/azureml/models/Phi-3.5-mini-instruct/versions/4 \
--algorithm awq \
--output_path models/phi/awq \
--log_level 1

echo -e "\n>>>>>> running finetuning >>>>>>>>\n"

olive finetune \
--method lora \
--model_name_or_path models/phi/awq \
--trust_remote_code \
--data_files "data/data_sample_travel.jsonl" \
--data_name "json" \
--text_template "<|user|>\n{prompt}<|end|>\n<|assistant|>\n{response}<|end|>" \
--max_steps 100 \
--output_path ./models/phi/ft \
--log_level 1

echo -e "\n>>>>>> running optimizer >>>>>>>>\n"

olive auto-opt \
--model_name_or_path models/phi/ft/model \
--adapter_path models/phi/ft/adapter \
--device cpu \
--provider CPUExecutionProvider \
--use_ort_genai \
--output_path models/phi/onnx-ao \
--log_level 1
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
az ml model create \
--name ft-for-travel \
--version 1 \
--path ./models/phi/onnx-ao \
--resource-group RESOURCE_GROUP \
--workspace-name PROJECT_NAME
9 changes: 9 additions & 0 deletions src/03.AIToolsSolutionE2E/Readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,15 @@ This demo provides a structured approach to fine-tuning the Phi-3 model using AI

**Sample Code**

**Demo 1. Olive Demo**
| Step | Description | Operation |
|-------------------|----------------------------------|-------------------|
|01.Installation| Please follow this step to set your env|[Go](./Olive_Demo/readme.md)|
|02.Use Microsoft Olive to architect | Using Microsoft Olive tools to fit your SLMOps cycle|[Go](./Olive_Demo/readme.md)|
|04.Inference your Fine-tuning models| Inference your onnx model after fine tuning|[Go](./Olive_Demo/readme.md)|

**Demo 2. QA_E2E**

| Step | Description | Operation |
|-------------------|----------------------------------|-------------------|
|01.Installation| Please follow this step to set your env|[Go](./qa_e2e/docs/01.Installation.md)|
Expand Down

0 comments on commit 949918f

Please sign in to comment.