diff --git a/.buildinfo b/.buildinfo
index 83ed1f5f..6964678a 100644
--- a/.buildinfo
+++ b/.buildinfo
@@ -1,4 +1,4 @@
 # Sphinx build info version 1
 # This file records the configuration used when building these files. When it is not found, a full rebuild will be done.
-config: c9e032fa7f49dc809f164063dfb2e28e
+config: dabe4189cc532017a30e750cb5d38b1e
 tags: 645f666f9bcd5a90fca523b33c5a78b7
diff --git a/_downloads/2b236384a146de2d4a64081ddb0a7c9a/developer_01_ir_builder.zip b/_downloads/2b236384a146de2d4a64081ddb0a7c9a/developer_01_ir_builder.zip
index 37c81a1b..5736fe5f 100644
Binary files a/_downloads/2b236384a146de2d4a64081ddb0a7c9a/developer_01_ir_builder.zip and b/_downloads/2b236384a146de2d4a64081ddb0a7c9a/developer_01_ir_builder.zip differ
diff --git a/_downloads/39c6904b3f007c07e3d59200d0bf98b4/dive_03_composition.ipynb b/_downloads/39c6904b3f007c07e3d59200d0bf98b4/dive_03_composition.ipynb
new file mode 100644
index 00000000..b695a4e5
--- /dev/null
+++ b/_downloads/39c6904b3f007c07e3d59200d0bf98b4/dive_03_composition.ipynb
@@ -0,0 +1,183 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "\n# Kernel Composition\n\n**Author**: Hongzheng Chen (hzchen@cs.cornell.edu)\n\nThis document will discuss kernel composition.\nIn the previous tutorials, we have seen how to write a simple kernel.\nHowever, in real applications, we often need to compose multiple kernels together.\n\nIn the following example, we define a ``matrix_add`` and a ``gemm`` kernel, and wrap them into a ``top``-level function.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "import allo\nfrom allo.ir.types import int32, float32\n\nM, K, N = 32, 32, 32\n\n\ndef matrix_add(A: int32[M, N]) -> int32[M, N]:\n    B: int32[M, N] = 0\n    for i, j in allo.grid(M, N):\n        B[i, j] = A[i, j] + 1\n    return B\n\n\ndef gemm(A: int32[M, K], B: int32[K, N]) -> int32[M, N]:\n    C: int32[M, N] = 0\n    for i, j in allo.grid(M, N):\n        for k in allo.reduction(K):\n            C[i, j] += A[i, k] * B[k, j]\n    return C\n\n\ndef top(A: int32[M, K], B: int32[K, N]) -> int32[M, N]:\n    C = gemm(A, B)\n    D = matrix_add(C)\n    return D"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "Different teams or people can then work on different parts of the code and optimize each kernel.\nWe first create a schedule for the ``matrix_add`` kernel, and add several optimizations.\n\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "s1 = allo.customize(matrix_add)\ns1.pipeline(\"j\")\nprint(s1.module)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "Then we create a schedule for the ``gemm`` kernel and optimize it.\n\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "s2 = allo.customize(gemm)\ns2.reorder(\"k\", \"j\")\ns2.buffer_at(s2.C, axis=\"i\")\ns2.pipeline(\"j\")\nprint(s2.module)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "Notice that now we only optimize the separate kernels but do not incorporate them into the top-level function, as shown in the following printed module.\n\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "s = allo.customize(top)\nprint(s.module)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "Therefore, after each part has been optimized, we need to explicitly *compose* them together.\nIn Allo, we can use the ``.compose()`` primitive to compose the schedules together into the parent function.\n\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "s.compose([s1, s2])\nprint(s.module)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "We can see that the schedules for the ``matrix_add`` and ``gemm`` kernels are both correctly optimized in the top-level function.\n\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Template Composition\nSometimes we may define template kernels and invoke the kernel with different template arguments. Allo provides an *id* option to specify the exact kernel to be composed.\n\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "def kernel[T_in, T_out, S](A: \"T_in[S]\") -> \"T_out[S]\":\n    B: T_out[S] = 0\n    for i in range(S):\n        with allo.meta_if(T_out == int32):\n            B[i] = A[i] + 1\n        with allo.meta_else():\n            B[i] = A[i] * 2\n    return B\n\n\ndef top2(A: int32[M]) -> float32[M]:\n    C = kernel[int32, int32, M, \"K1\"](A)\n    D = kernel[int32, float32, M, \"K2\"](C)\n    return D"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "Specifically, the last argument of the template kernel is the *id* of the kernel. Later on we can use this ID for distinguishing different kernels during composition.\nWe also customize the two template kernels with different optimizations first.\n\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "s1 = allo.customize(kernel, instantiate=[int32, int32, M])\ns1.unroll(\"i\", factor=4)\nprint(s1.module)\n\ns2 = allo.customize(kernel, instantiate=[int32, float32, M])\ns2.pipeline(\"i\")\nprint(s2.module)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "Finally, we compose the two template kernels into the top-level function with the ID specified.\n\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "s = allo.customize(top2)\ns.compose(s1, id=\"K1\")\ns.compose(s2, id=\"K2\")\nprint(s.module)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "We can see from the printed module that the loop in the first kernel is unrolled by a factor of 4, and the loop in the second kernel is pipelined.\n\n"
+      ]
+    }
+  ],
+  "metadata": {
+    "kernelspec": {
+      "display_name": "Python 3",
+      "language": "python",
+      "name": "python3"
+    },
+    "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.12.5"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 0
+}
\ No newline at end of file
diff --git a/_downloads/4569d89feac47262c8c4e3e128da7a7e/dive_04_features.zip b/_downloads/4569d89feac47262c8c4e3e128da7a7e/dive_04_features.zip
new file mode 100644
index 00000000..bc7cd824
Binary files /dev/null and b/_downloads/4569d89feac47262c8c4e3e128da7a7e/dive_04_features.zip differ
diff --git a/_downloads/48b69635df4cfe1643d9d6b9bdf6cd79/tutorial_01_get_started.zip b/_downloads/48b69635df4cfe1643d9d6b9bdf6cd79/tutorial_01_get_started.zip
index 5c881bf2..49c413eb 100644
Binary files a/_downloads/48b69635df4cfe1643d9d6b9bdf6cd79/tutorial_01_get_started.zip and b/_downloads/48b69635df4cfe1643d9d6b9bdf6cd79/tutorial_01_get_started.zip differ
diff --git a/_downloads/4fba383e419c1fc1ea22179140eb2d12/dive_01_data_types.py b/_downloads/4fba383e419c1fc1ea22179140eb2d12/dive_01_data_types.py
new file mode 100644
index 00000000..180bf685
--- /dev/null
+++ b/_downloads/4fba383e419c1fc1ea22179140eb2d12/dive_01_data_types.py
@@ -0,0 +1,114 @@
+# Copyright Allo authors. All Rights Reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+"""
+Data Types and Type Casting
+===========================
+
+**Author**: Hongzheng Chen (hzchen@cs.cornell.edu)
+
+This document will discuss the Allo-supported data types in detail.
+All the data types are defined in the ``allo.ir.types`` module.
+"""
+
+import allo
+from allo.ir.types import int16, int32, float32, Int, UInt, Float, Fixed
+
+##############################################################################
+# Currently, Allo supports three base data types for mathematical operations:
+#
+# - Integers: ``Int(bitwdith)``, ``UInt(bitwidth)``
+# - Floating points: ``Float(bitwidth)`` (only support 16, 32, and 64 bits)
+# - Fixed points: ``Fixed(bitwidth, frac)``, ``UFixed(bitwidth, frac)``
+#
+# For example, one can declare a 15-bit integer as ``Int(15)`` and an unsigned 8-bit fixed-point number with 3 fractional bits as ``UFixed(8, 3)``.
+# For all the C/C++ supported data types, we provide shorthands like ``float32`` and ``int16`` to easily declare them.
+
+# %%
+# Notice different from native Python, Allo requires the program to be **strongly and statically typed**.
+# The variable types are either declared explicitly or inferred from the context.
+# For a variable that first appears in the program, we should declare it with an expected data type using Python's type hint notation:
+
+a: int32
+
+# %%
+# Once the data types are defined, an important consideration is how to handle
+# operations between variables of different types. Allo supports two types of casting:
+# (1) implicit casting that is automatically done by the Allo compiler;
+# and (2) explicit casting that is manually done by the user.
+
+##############################################################################
+# Implicit Casting
+# ----------------
+# Allo has a strong type system that follows the `MLIR convention <https://mlir.llvm.org/docs/Dialects/ArithOps/>`_ to enforce the operand types are the same for the arithmetic operations.
+# However, it is burdensome for users to cast the variables every time, and it is also error-prone to avoid overflow when performing computations.
+# Therefore, Allo is equipped with builtin casting rules to automatically cast the variables to the same type before the operation, which is called *implicit casting*.
+# An example is shown below:
+
+
+def add(a: int32, b: int32) -> int32:
+    return a + b
+
+
+s = allo.customize(add)
+print(s.module)
+
+# %%
+# We can see that ``a`` and ``b`` are firstly casted to ``int33``, added
+# together, and converted back to ``int32``.
+# This is to avoid overflow and is automatically inferred by the Allo compiler.
+
+
+##############################################################################
+# Explicit Casting
+# ----------------
+# One can also explicitly cast the variable to a specific type by creating an intermediate variable,
+# or use Python-builtin functions like ``float()`` and ``int()`` to explicitly cast a variable to ``float32`` or ``int32``.
+# Another example is shown below:
+
+
+def cast(a: int32) -> int16:
+    b: float32 = a  # explicit
+    c: float32 = b * 2
+    d: float32 = float(a) * 2
+    e: int16 = c + d
+    return e
+
+
+s = allo.customize(cast)
+print(s.module)
+
+# %%
+# By explicitly creating an intermediate variable ``b``, we can cast the ``int32`` variable ``a`` to the desired floating-point type.
+# Similarly, calling ``float(a)`` can also cast ``a`` to a floating-point type.
+#
+# .. note::
+#
+#    The above stated explicit casting between integers and floating points preserves the value but the precision may be changed.
+#    If you want to use a union type to represent both integers and floating points, please use the `.bitcast()` API instead. For example, ``a.bitcast()`` can convert ``int32`` to ``float32`` representation with the bit pattern preserved.
+
+##############################################################################
+# Bit Operations
+# --------------
+# As hardware accelerators have ability to manipulate each bit of the data, Allo supports bit operations on
+# those integer types. For example, we can access a specific bit in an integer ``a`` using the indexing operator:
+#
+# .. code-block:: python
+#
+#   a[15]
+
+# %%
+# We can also extract a chunk of bits from an integer using the slicing operator:
+#
+# .. code-block:: python
+#
+#   a[0:16]
+#
+# .. note::
+#
+#    Allo follows the Python convention that the upper bound is not included, so ``[0:16]`` means
+#    extracting the first 16 bits, which is different from the Xilinx HLS convention that uses ``[0:15]``
+#    to indicate the first 16 bits.
+
+# %%
+# Not only constant values are supported, but also variables can be used as the index or the slice range.
diff --git a/_downloads/5c3db288c9103701a8cc33d4c4f30066/dive_03_composition.zip b/_downloads/5c3db288c9103701a8cc33d4c4f30066/dive_03_composition.zip
new file mode 100644
index 00000000..ecea986b
Binary files /dev/null and b/_downloads/5c3db288c9103701a8cc33d4c4f30066/dive_03_composition.zip differ
diff --git a/_downloads/68e0932078b39343e70c899a03d3ae7c/dive_01_data_types.ipynb b/_downloads/68e0932078b39343e70c899a03d3ae7c/dive_01_data_types.ipynb
new file mode 100644
index 00000000..690f7223
--- /dev/null
+++ b/_downloads/68e0932078b39343e70c899a03d3ae7c/dive_01_data_types.ipynb
@@ -0,0 +1,146 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "\n# Data Types and Type Casting\n\n**Author**: Hongzheng Chen (hzchen@cs.cornell.edu)\n\nThis document will discuss the Allo-supported data types in detail.\nAll the data types are defined in the ``allo.ir.types`` module.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "import allo\nfrom allo.ir.types import int16, int32, float32, Int, UInt, Float, Fixed"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "Currently, Allo supports three base data types for mathematical operations:\n\n- Integers: ``Int(bitwdith)``, ``UInt(bitwidth)``\n- Floating points: ``Float(bitwidth)`` (only support 16, 32, and 64 bits)\n- Fixed points: ``Fixed(bitwidth, frac)``, ``UFixed(bitwidth, frac)``\n\nFor example, one can declare a 15-bit integer as ``Int(15)`` and an unsigned 8-bit fixed-point number with 3 fractional bits as ``UFixed(8, 3)``.\nFor all the C/C++ supported data types, we provide shorthands like ``float32`` and ``int16`` to easily declare them.\n\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "Notice different from native Python, Allo requires the program to be **strongly and statically typed**.\nThe variable types are either declared explicitly or inferred from the context.\nFor a variable that first appears in the program, we should declare it with an expected data type using Python's type hint notation:\n\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "a: int32"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "Once the data types are defined, an important consideration is how to handle\noperations between variables of different types. Allo supports two types of casting:\n(1) implicit casting that is automatically done by the Allo compiler;\nand (2) explicit casting that is manually done by the user.\n\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Implicit Casting\nAllo has a strong type system that follows the [MLIR convention](https://mlir.llvm.org/docs/Dialects/ArithOps/) to enforce the operand types are the same for the arithmetic operations.\nHowever, it is burdensome for users to cast the variables every time, and it is also error-prone to avoid overflow when performing computations.\nTherefore, Allo is equipped with builtin casting rules to automatically cast the variables to the same type before the operation, which is called *implicit casting*.\nAn example is shown below:\n\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "def add(a: int32, b: int32) -> int32:\n    return a + b\n\n\ns = allo.customize(add)\nprint(s.module)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "We can see that ``a`` and ``b`` are firstly casted to ``int33``, added\ntogether, and converted back to ``int32``.\nThis is to avoid overflow and is automatically inferred by the Allo compiler.\n\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Explicit Casting\nOne can also explicitly cast the variable to a specific type by creating an intermediate variable,\nor use Python-builtin functions like ``float()`` and ``int()`` to explicitly cast a variable to ``float32`` or ``int32``.\nAnother example is shown below:\n\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "def cast(a: int32) -> int16:\n    b: float32 = a  # explicit\n    c: float32 = b * 2\n    d: float32 = float(a) * 2\n    e: int16 = c + d\n    return e\n\n\ns = allo.customize(cast)\nprint(s.module)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "By explicitly creating an intermediate variable ``b``, we can cast the ``int32`` variable ``a`` to the desired floating-point type.\nSimilarly, calling ``float(a)`` can also cast ``a`` to a floating-point type.\n\n<div class=\"alert alert-info\"><h4>Note</h4><p>The above stated explicit casting between integers and floating points preserves the value but the precision may be changed.\n   If you want to use a union type to represent both integers and floating points, please use the `.bitcast()` API instead. For example, ``a.bitcast()`` can convert ``int32`` to ``float32`` representation with the bit pattern preserved.</p></div>\n\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Bit Operations\nAs hardware accelerators have ability to manipulate each bit of the data, Allo supports bit operations on\nthose integer types. For example, we can access a specific bit in an integer ``a`` using the indexing operator:\n\n```python\na[15]\n```\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "We can also extract a chunk of bits from an integer using the slicing operator:\n\n```python\na[0:16]\n```\n<div class=\"alert alert-info\"><h4>Note</h4><p>Allo follows the Python convention that the upper bound is not included, so ``[0:16]`` means\n   extracting the first 16 bits, which is different from the Xilinx HLS convention that uses ``[0:15]``\n   to indicate the first 16 bits.</p></div>\n\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "Not only constant values are supported, but also variables can be used as the index or the slice range.\n\n"
+      ]
+    }
+  ],
+  "metadata": {
+    "kernelspec": {
+      "display_name": "Python 3",
+      "language": "python",
+      "name": "python3"
+    },
+    "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.12.5"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 0
+}
\ No newline at end of file
diff --git a/_downloads/77583a2b4d388a5f4b2bf7b3eec828d2/dive_01_data_types.zip b/_downloads/77583a2b4d388a5f4b2bf7b3eec828d2/dive_01_data_types.zip
new file mode 100644
index 00000000..f8f8a0b2
Binary files /dev/null and b/_downloads/77583a2b4d388a5f4b2bf7b3eec828d2/dive_01_data_types.zip differ
diff --git a/_downloads/849a5b43539eae829c9f79867111880a/dive_02_template.zip b/_downloads/849a5b43539eae829c9f79867111880a/dive_02_template.zip
new file mode 100644
index 00000000..e06fcf76
Binary files /dev/null and b/_downloads/849a5b43539eae829c9f79867111880a/dive_02_template.zip differ
diff --git a/_downloads/8d3af32bb0bffe35477d27ae08e595fe/tutorial_01_get_started.py b/_downloads/8d3af32bb0bffe35477d27ae08e595fe/tutorial_01_get_started.py
index 815d271d..253d0411 100644
--- a/_downloads/8d3af32bb0bffe35477d27ae08e595fe/tutorial_01_get_started.py
+++ b/_downloads/8d3af32bb0bffe35477d27ae08e595fe/tutorial_01_get_started.py
@@ -34,8 +34,10 @@
 # %%
 # We then define a function that takes two 32x32 matrices as inputs and
 # returns a 32x32 matrix as output. The variable declaration is defined
-# as ``<name>: <type>[<shape>]``. We require **strict type annotation** in
-# Allo's kernels, which is different from directly programming in Python.
+# as ``<name>: <type>[<shape>]``, and the function type is defined as
+# ``(<in_type0>, <in_type1>, ...) -> <out_type>``.
+# We require **strict type annotation** in Allo's kernels, which is different
+# from directly programming in Python.
 #
 # Inside the kernel, we provide a shorthand for the loop iterator. For example,
 # ``for i, j, k in allo.grid(32, 32, 32)`` is equivalent to the following
diff --git a/_downloads/90b883f891c63f481ffa4756cd7e0781/dive_04_features.ipynb b/_downloads/90b883f891c63f481ffa4756cd7e0781/dive_04_features.ipynb
new file mode 100644
index 00000000..a90ba5e3
--- /dev/null
+++ b/_downloads/90b883f891c63f481ffa4756cd7e0781/dive_04_features.ipynb
@@ -0,0 +1,86 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "\n# Other Features\n\n**Author**: Hongzheng Chen (hzchen@cs.cornell.edu)\n\nThis document will discuss other features that are not covered in the previous tutorials.\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Dynamic Shapes\nIn some cases, the shape of the tensor is not known at compile time, so we can use ``[...]`` to represent the dynamic shape.\nFrom the generated MLIR module, we can see it has a ``\"?\"`` in the shape of the tensor, which means the shape is not predefined,\nbut we can still run the LLVM module with arbitrary shapes of NumPy arrays.\n\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "import allo\nfrom allo.ir.types import int32, float32\nimport numpy as np\n\n\ndef kernel(A: float32[...], B: float32[...], size: int32):\n    for i in range(size):\n        B[i] = A[i]\n\n\ns = allo.customize(kernel)\nprint(s.module)\nnp_A = np.random.random((256,)).astype(np.float32)\nallo_A = np.zeros((256,)).astype(np.float32)\nmod = s.build()\nmod(np_A, allo_A, 256)\nnp.testing.assert_allclose(np_A, allo_A)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "We can also check the generated HLS code that the arguments are declared as pointers.\n\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "code = s.build(target=\"vhls\")\nprint(code)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Tuple Return\nAnother feature is the tuple support. As in Python, we can return multiple values from a function, Allo\nalso supports this by explicitly specifying the return type as a tuple.\n\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "def callee(a: float32, b: float32) -> (float32, float32):\n    c: float32 = a + b\n    d: float32 = a - b\n    return c, d\n\n\ndef kernel(A: float32[10], B: float32[10]) -> (float32[10], float32[10]):\n    C: float32[10] = 0\n    D: float32[10] = 0\n    for i in range(10):\n        C[i], D[i] = callee(A[i], B[i])\n    return C, D\n\n\ns = allo.customize(kernel)\nprint(s.module)\nmod = s.build()\nnp_A = np.random.random((10,)).astype(np.float32)\nnp_B = np.random.random((10,)).astype(np.float32)\nnp_C, np_D = mod(np_A, np_B)\nnp_C_ref = np.zeros((10,), dtype=np.float32)\nnp_D_ref = np.zeros((10,), dtype=np.float32)\nfor i in range(10):\n    np_C_ref[i], np_D_ref[i] = callee(np_A[i], np_B[i])\nnp.testing.assert_allclose(np_C, np_C_ref)\nnp.testing.assert_allclose(np_D, np_D_ref)"
+      ]
+    }
+  ],
+  "metadata": {
+    "kernelspec": {
+      "display_name": "Python 3",
+      "language": "python",
+      "name": "python3"
+    },
+    "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.12.5"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 0
+}
\ No newline at end of file
diff --git a/_downloads/9b041eab9a2cc4c12c883027dcc37a54/dive_02_template.py b/_downloads/9b041eab9a2cc4c12c883027dcc37a54/dive_02_template.py
new file mode 100644
index 00000000..d0868c4c
--- /dev/null
+++ b/_downloads/9b041eab9a2cc4c12c883027dcc37a54/dive_02_template.py
@@ -0,0 +1,82 @@
+# Copyright Allo authors. All Rights Reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+"""
+Template Kernels
+================
+
+**Author**: Hongzheng Chen (hzchen@cs.cornell.edu)
+
+This document explains how to write a template kernel in Allo.
+Template kernels are useful when we need to reuse a kernel with different data types or when certain computation patterns depend on specific constants.
+By leveraging template kernels, we can achieve greater flexibility and reusability in the code.
+"""
+
+import allo
+from allo.ir.types import int32, float32
+
+# %%
+# We follow Python's convention to use *type variable* to define a template kernel.
+# Specifically, the type variable is specified after the function name using square brackets: ``def kernel[T](...)``, and the type variable can be used in the function signature and body.
+# Importantly, as the native Python interpreter does not support Allo's type declaration (i.e., base type + shape), we need to use string annotations like ``"T[10]"`` to specify the type of the variables.
+# Otherwise, it will raise a type error.
+#
+# In the following, we define a simple addition function that adds 1 to each element of the input array.
+# To invoke the kernel with a specific data type, we can use the ``instantiate`` argument in the ``allo.customize`` function.
+
+
+def kernel[T](A: "T[10]") -> "T[10]":
+    B: T[10]
+    for i in range(10):
+        B[i] = A[i] + 1
+    return B
+
+
+s = allo.customize(kernel, instantiate=[int32])
+print(s.module)
+
+# %%
+# We can see that the kernel is specialized with the given ``int32`` data type.
+# Similarly, we can directly declare a new kernel by specifying ``float32`` as the data type.
+
+s = allo.customize(kernel, instantiate=[float32])
+print(s.module)
+
+# %%
+# If we not only want to specialize the data type but also the shape of the array, we can provide another type variable, and pass it to the ``instantiate`` argument.
+# Note that here we also use the ``<type_var>: base_type`` notation to constrain the type of the type variable. Here we constrain the type variable ``M`` to be an integer.
+
+
+def kernel2[T, M: int32](A: "T[M]") -> "T[M]":
+    B: T[M]
+    for i in range(M):
+        B[i] = A[i] + 1
+    return B
+
+
+s = allo.customize(kernel2, instantiate=[int32, 20])
+print(s.module)
+
+# %%
+# Furthermore, Allo's template also enables metaprogramming that can evaluate type variables at compile time.
+# Specifically, we can use the ``allo.meta_if``, ``allo.meta_elif``, and ``allo.meta_else`` to conditionally generate code based on the type variables.
+# Just to make sure the conditions can be determined at compile time.
+
+
+def kernel3[T, M: int32](A: "T[M]") -> "T[M]":
+    B: T[M]
+    for i in range(M):
+        with allo.meta_if(T == int32):
+            B[i] = A[i] + 1
+        with allo.meta_else():
+            B[i] = A[i] - 1
+    return B
+
+
+# %%
+# In final generated code, we can see that only a single branch is generated based on the given data type.
+
+s = allo.customize(kernel3, instantiate=[int32, 20])
+print(s.module)
+s = allo.customize(kernel3, instantiate=[float32, 20])
+print(s.module)
diff --git a/_downloads/9fbd96ba55c84b58bccde28bb525c3ff/developer_02_mlir.zip b/_downloads/9fbd96ba55c84b58bccde28bb525c3ff/developer_02_mlir.zip
index b1ad6292..c7d6b5b1 100644
Binary files a/_downloads/9fbd96ba55c84b58bccde28bb525c3ff/developer_02_mlir.zip and b/_downloads/9fbd96ba55c84b58bccde28bb525c3ff/developer_02_mlir.zip differ
diff --git a/_downloads/a1303c8436389bcc90cc384bd5c2d23e/tutorial_02_vhls.zip b/_downloads/a1303c8436389bcc90cc384bd5c2d23e/tutorial_02_vhls.zip
index a6b639a4..87d20891 100644
Binary files a/_downloads/a1303c8436389bcc90cc384bd5c2d23e/tutorial_02_vhls.zip and b/_downloads/a1303c8436389bcc90cc384bd5c2d23e/tutorial_02_vhls.zip differ
diff --git a/_downloads/aac8c815d185f6d5646a9509ba2daa13/dive_03_composition.py b/_downloads/aac8c815d185f6d5646a9509ba2daa13/dive_03_composition.py
new file mode 100644
index 00000000..ef3fc175
--- /dev/null
+++ b/_downloads/aac8c815d185f6d5646a9509ba2daa13/dive_03_composition.py
@@ -0,0 +1,120 @@
+# Copyright Allo authors. All Rights Reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+"""
+Kernel Composition
+==================
+
+**Author**: Hongzheng Chen (hzchen@cs.cornell.edu)
+
+This document will discuss kernel composition.
+In the previous tutorials, we have seen how to write a simple kernel.
+However, in real applications, we often need to compose multiple kernels together.
+
+In the following example, we define a ``matrix_add`` and a ``gemm`` kernel, and wrap them into a ``top``-level function.
+"""
+
+import allo
+from allo.ir.types import int32, float32
+
+M, K, N = 32, 32, 32
+
+
+def matrix_add(A: int32[M, N]) -> int32[M, N]:
+    B: int32[M, N] = 0
+    for i, j in allo.grid(M, N):
+        B[i, j] = A[i, j] + 1
+    return B
+
+
+def gemm(A: int32[M, K], B: int32[K, N]) -> int32[M, N]:
+    C: int32[M, N] = 0
+    for i, j in allo.grid(M, N):
+        for k in allo.reduction(K):
+            C[i, j] += A[i, k] * B[k, j]
+    return C
+
+
+def top(A: int32[M, K], B: int32[K, N]) -> int32[M, N]:
+    C = gemm(A, B)
+    D = matrix_add(C)
+    return D
+
+
+# %%
+# Different teams or people can then work on different parts of the code and optimize each kernel.
+# We first create a schedule for the ``matrix_add`` kernel, and add several optimizations.
+
+s1 = allo.customize(matrix_add)
+s1.pipeline("j")
+print(s1.module)
+
+# %%
+# Then we create a schedule for the ``gemm`` kernel and optimize it.
+
+s2 = allo.customize(gemm)
+s2.reorder("k", "j")
+s2.buffer_at(s2.C, axis="i")
+s2.pipeline("j")
+print(s2.module)
+
+# %%
+# Notice that now we only optimize the separate kernels but do not incorporate them into the top-level function, as shown in the following printed module.
+
+s = allo.customize(top)
+print(s.module)
+
+# %%
+# Therefore, after each part has been optimized, we need to explicitly *compose* them together.
+# In Allo, we can use the ``.compose()`` primitive to compose the schedules together into the parent function.
+
+s.compose([s1, s2])
+print(s.module)
+
+# %%
+# We can see that the schedules for the ``matrix_add`` and ``gemm`` kernels are both correctly optimized in the top-level function.
+
+##############################################################################
+# Template Composition
+# --------------------
+# Sometimes we may define template kernels and invoke the kernel with different template arguments. Allo provides an *id* option to specify the exact kernel to be composed.
+
+
+def kernel[T_in, T_out, S](A: "T_in[S]") -> "T_out[S]":
+    B: T_out[S] = 0
+    for i in range(S):
+        with allo.meta_if(T_out == int32):
+            B[i] = A[i] + 1
+        with allo.meta_else():
+            B[i] = A[i] * 2
+    return B
+
+
+def top2(A: int32[M]) -> float32[M]:
+    C = kernel[int32, int32, M, "K1"](A)
+    D = kernel[int32, float32, M, "K2"](C)
+    return D
+
+
+# %%
+# Specifically, the last argument of the template kernel is the *id* of the kernel. Later on we can use this ID for distinguishing different kernels during composition.
+# We also customize the two template kernels with different optimizations first.
+
+s1 = allo.customize(kernel, instantiate=[int32, int32, M])
+s1.unroll("i", factor=4)
+print(s1.module)
+
+s2 = allo.customize(kernel, instantiate=[int32, float32, M])
+s2.pipeline("i")
+print(s2.module)
+
+# %%
+# Finally, we compose the two template kernels into the top-level function with the ID specified.
+
+s = allo.customize(top2)
+s.compose(s1, id="K1")
+s.compose(s2, id="K2")
+print(s.module)
+
+# %%
+# We can see from the printed module that the loop in the first kernel is unrolled by a factor of 4, and the loop in the second kernel is pipelined.
diff --git a/_downloads/addf17760130f22dafec92dedc62e16a/tutorial_01_get_started.ipynb b/_downloads/addf17760130f22dafec92dedc62e16a/tutorial_01_get_started.ipynb
index d2f89745..95e32ef5 100644
--- a/_downloads/addf17760130f22dafec92dedc62e16a/tutorial_01_get_started.ipynb
+++ b/_downloads/addf17760130f22dafec92dedc62e16a/tutorial_01_get_started.ipynb
@@ -40,7 +40,7 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "We then define a function that takes two 32x32 matrices as inputs and\nreturns a 32x32 matrix as output. The variable declaration is defined\nas ``<name>: <type>[<shape>]``. We require **strict type annotation** in\nAllo's kernels, which is different from directly programming in Python.\n\nInside the kernel, we provide a shorthand for the loop iterator. For example,\n``for i, j, k in allo.grid(32, 32, 32)`` is equivalent to the following\nnested for-loop:\n\n```python\nfor i in range(32):\n    for j in range(32):\n        for k in range(32):\n            # body\n```\nThe ``allo.grid`` API is used to define the iteration space of the loop.\nThe arguments denote the upper bounds of the loop iterators.\nNotice the above range-loop is also supported in the new Allo, so\nusers have more flexibility to define the loop structure.\n\n"
+        "We then define a function that takes two 32x32 matrices as inputs and\nreturns a 32x32 matrix as output. The variable declaration is defined\nas ``<name>: <type>[<shape>]``, and the function type is defined as\n``(<in_type0>, <in_type1>, ...) -> <out_type>``.\nWe require **strict type annotation** in Allo's kernels, which is different\nfrom directly programming in Python.\n\nInside the kernel, we provide a shorthand for the loop iterator. For example,\n``for i, j, k in allo.grid(32, 32, 32)`` is equivalent to the following\nnested for-loop:\n\n```python\nfor i in range(32):\n    for j in range(32):\n        for k in range(32):\n            # body\n```\nThe ``allo.grid`` API is used to define the iteration space of the loop.\nThe arguments denote the upper bounds of the loop iterators.\nNotice the above range-loop is also supported in the new Allo, so\nusers have more flexibility to define the loop structure.\n\n"
       ]
     },
     {
diff --git a/_downloads/d58a09ade6135cf6e79cb2fe738ace28/dive_04_features.py b/_downloads/d58a09ade6135cf6e79cb2fe738ace28/dive_04_features.py
new file mode 100644
index 00000000..e087285f
--- /dev/null
+++ b/_downloads/d58a09ade6135cf6e79cb2fe738ace28/dive_04_features.py
@@ -0,0 +1,76 @@
+# Copyright Allo authors. All Rights Reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+"""
+Other Features
+==============
+
+**Author**: Hongzheng Chen (hzchen@cs.cornell.edu)
+
+This document will discuss other features that are not covered in the previous tutorials.
+"""
+
+##############################################################################
+# Dynamic Shapes
+# --------------
+# In some cases, the shape of the tensor is not known at compile time, so we can use ``[...]`` to represent the dynamic shape.
+# From the generated MLIR module, we can see it has a ``"?"`` in the shape of the tensor, which means the shape is not predefined,
+# but we can still run the LLVM module with arbitrary shapes of NumPy arrays.
+
+import allo
+from allo.ir.types import int32, float32
+import numpy as np
+
+
+def kernel(A: float32[...], B: float32[...], size: int32):
+    for i in range(size):
+        B[i] = A[i]
+
+
+s = allo.customize(kernel)
+print(s.module)
+np_A = np.random.random((256,)).astype(np.float32)
+allo_A = np.zeros((256,)).astype(np.float32)
+mod = s.build()
+mod(np_A, allo_A, 256)
+np.testing.assert_allclose(np_A, allo_A)
+
+# %%
+# We can also check the generated HLS code that the arguments are declared as pointers.
+
+code = s.build(target="vhls")
+print(code)
+
+##############################################################################
+# Tuple Return
+# ------------
+# Another feature is the tuple support. As in Python, we can return multiple values from a function, Allo
+# also supports this by explicitly specifying the return type as a tuple.
+
+
+def callee(a: float32, b: float32) -> (float32, float32):
+    c: float32 = a + b
+    d: float32 = a - b
+    return c, d
+
+
+def kernel(A: float32[10], B: float32[10]) -> (float32[10], float32[10]):
+    C: float32[10] = 0
+    D: float32[10] = 0
+    for i in range(10):
+        C[i], D[i] = callee(A[i], B[i])
+    return C, D
+
+
+s = allo.customize(kernel)
+print(s.module)
+mod = s.build()
+np_A = np.random.random((10,)).astype(np.float32)
+np_B = np.random.random((10,)).astype(np.float32)
+np_C, np_D = mod(np_A, np_B)
+np_C_ref = np.zeros((10,), dtype=np.float32)
+np_D_ref = np.zeros((10,), dtype=np.float32)
+for i in range(10):
+    np_C_ref[i], np_D_ref[i] = callee(np_A[i], np_B[i])
+np.testing.assert_allclose(np_C, np_C_ref)
+np.testing.assert_allclose(np_D, np_D_ref)
diff --git a/_downloads/de72dcd3242a3c85b41c9c54a3424409/dive_02_template.ipynb b/_downloads/de72dcd3242a3c85b41c9c54a3424409/dive_02_template.ipynb
new file mode 100644
index 00000000..b910fc97
--- /dev/null
+++ b/_downloads/de72dcd3242a3c85b41c9c54a3424409/dive_02_template.ipynb
@@ -0,0 +1,133 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "\n# Template Kernels\n\n**Author**: Hongzheng Chen (hzchen@cs.cornell.edu)\n\nThis document explains how to write a template kernel in Allo.\nTemplate kernels are useful when we need to reuse a kernel with different data types or when certain computation patterns depend on specific constants.\nBy leveraging template kernels, we can achieve greater flexibility and reusability in the code.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "import allo\nfrom allo.ir.types import int32, float32"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "We follow Python's convention to use *type variable* to define a template kernel.\nSpecifically, the type variable is specified after the function name using square brackets: ``def kernel[T](...)``, and the type variable can be used in the function signature and body.\nImportantly, as the native Python interpreter does not support Allo's type declaration (i.e., base type + shape), we need to use string annotations like ``\"T[10]\"`` to specify the type of the variables.\nOtherwise, it will raise a type error.\n\nIn the following, we define a simple addition function that adds 1 to each element of the input array.\nTo invoke the kernel with a specific data type, we can use the ``instantiate`` argument in the ``allo.customize`` function.\n\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "def kernel[T](A: \"T[10]\") -> \"T[10]\":\n    B: T[10]\n    for i in range(10):\n        B[i] = A[i] + 1\n    return B\n\n\ns = allo.customize(kernel, instantiate=[int32])\nprint(s.module)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "We can see that the kernel is specialized with the given ``int32`` data type.\nSimilarly, we can directly declare a new kernel by specifying ``float32`` as the data type.\n\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "s = allo.customize(kernel, instantiate=[float32])\nprint(s.module)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "If we not only want to specialize the data type but also the shape of the array, we can provide another type variable, and pass it to the ``instantiate`` argument.\nNote that here we also use the ``<type_var>: base_type`` notation to constrain the type of the type variable. Here we constrain the type variable ``M`` to be an integer.\n\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "def kernel2[T, M: int32](A: \"T[M]\") -> \"T[M]\":\n    B: T[M]\n    for i in range(M):\n        B[i] = A[i] + 1\n    return B\n\n\ns = allo.customize(kernel2, instantiate=[int32, 20])\nprint(s.module)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "Furthermore, Allo's template also enables metaprogramming that can evaluate type variables at compile time.\nSpecifically, we can use the ``allo.meta_if``, ``allo.meta_elif``, and ``allo.meta_else`` to conditionally generate code based on the type variables.\nJust to make sure the conditions can be determined at compile time.\n\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "def kernel3[T, M: int32](A: \"T[M]\") -> \"T[M]\":\n    B: T[M]\n    for i in range(M):\n        with allo.meta_if(T == int32):\n            B[i] = A[i] + 1\n        with allo.meta_else():\n            B[i] = A[i] - 1\n    return B"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "In final generated code, we can see that only a single branch is generated based on the given data type.\n\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [],
+      "source": [
+        "s = allo.customize(kernel3, instantiate=[int32, 20])\nprint(s.module)\ns = allo.customize(kernel3, instantiate=[float32, 20])\nprint(s.module)"
+      ]
+    }
+  ],
+  "metadata": {
+    "kernelspec": {
+      "display_name": "Python 3",
+      "language": "python",
+      "name": "python3"
+    },
+    "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.12.5"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 0
+}
\ No newline at end of file
diff --git a/_images/sphx_glr_dive_01_data_types_thumb.png b/_images/sphx_glr_dive_01_data_types_thumb.png
new file mode 100644
index 00000000..8a5fed58
Binary files /dev/null and b/_images/sphx_glr_dive_01_data_types_thumb.png differ
diff --git a/_images/sphx_glr_dive_02_template_thumb.png b/_images/sphx_glr_dive_02_template_thumb.png
new file mode 100644
index 00000000..8a5fed58
Binary files /dev/null and b/_images/sphx_glr_dive_02_template_thumb.png differ
diff --git a/_images/sphx_glr_dive_03_composition_thumb.png b/_images/sphx_glr_dive_03_composition_thumb.png
new file mode 100644
index 00000000..8a5fed58
Binary files /dev/null and b/_images/sphx_glr_dive_03_composition_thumb.png differ
diff --git a/_images/sphx_glr_dive_04_features_thumb.png b/_images/sphx_glr_dive_04_features_thumb.png
new file mode 100644
index 00000000..8a5fed58
Binary files /dev/null and b/_images/sphx_glr_dive_04_features_thumb.png differ
diff --git a/_modules/allo/customize.html b/_modules/allo/customize.html
index e034f2ee..4b0822d9 100644
--- a/_modules/allo/customize.html
+++ b/_modules/allo/customize.html
@@ -5,7 +5,7 @@
     <meta charset="utf-8" />
     <meta name="viewport" content="width=device-width, initial-scale=1.0" />
     <title>allo.customize &#8212; Allo Documentation</title>
-    <link rel="stylesheet" type="text/css" href="../../_static/pygments.css?v=fa44fd50" />
+    <link rel="stylesheet" type="text/css" href="../../_static/pygments.css?v=03e43079" />
     <link rel="stylesheet" type="text/css" href="../../_static/basic_mod.css?v=9b2032db" />
     <link rel="stylesheet" type="text/css" href="../../_static/graphviz.css?v=4ae1632d" />
     <link rel="stylesheet" type="text/css" href="../../_static/copybutton.css?v=76b2166b" />
@@ -140,6 +140,15 @@ <h3 id="searchlabel">Quick search</h3>
 <li class="toctree-l1"><a class="reference internal" href="../../gallery/tutorial_01_get_started.html">Getting Started</a></li>
 <li class="toctree-l1"><a class="reference internal" href="../../gallery/tutorial_02_vhls.html">Vivado/Vitis HLS Backend</a></li>
 </ul>
+<p class="caption" role="heading"><span class="caption-text">Deep Dive</span></p>
+<ul>
+<li class="toctree-l1"><a class="reference internal" href="../../gallery/dive_01_data_types.html">Data Types and Type Casting</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../../gallery/dive_02_template.html">Template Kernels</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../../gallery/dive_03_composition.html">Kernel Composition</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../../dive/ip.html">IP Integration</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../../dive/pytorch.html">PyTorch Integration</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../../gallery/dive_04_features.html">Other Features</a></li>
+</ul>
 <p class="caption" role="heading"><span class="caption-text">Developer Guide</span></p>
 <ul>
 <li class="toctree-l1"><a class="reference internal" href="../../developer/index.html">Developer Setup</a></li>
@@ -166,17 +175,17 @@ <h1>Source code for allo.customize</h1><div class="highlight"><pre>
 <span class="c1"># SPDX-License-Identifier: Apache-2.0</span>
 <span class="c1"># pylint: disable=no-name-in-module</span>
 
-<span class="kn">import</span> <span class="nn">re</span>
-<span class="kn">import</span> <span class="nn">inspect</span>
-<span class="kn">import</span> <span class="nn">textwrap</span>
-<span class="kn">import</span> <span class="nn">copy</span>
-<span class="kn">from</span> <span class="nn">dataclasses</span> <span class="kn">import</span> <span class="n">dataclass</span>
-<span class="kn">from</span> <span class="nn">functools</span> <span class="kn">import</span> <span class="n">wraps</span>
-<span class="kn">from</span> <span class="nn">types</span> <span class="kn">import</span> <span class="n">FunctionType</span> <span class="k">as</span> <span class="n">PyFunctionType</span>
-<span class="kn">from</span> <span class="nn">typing</span> <span class="kn">import</span> <span class="n">Union</span>
-<span class="kn">from</span> <span class="nn">collections.abc</span> <span class="kn">import</span> <span class="n">Callable</span>
-
-<span class="kn">from</span> <span class="nn">._mlir.ir</span> <span class="kn">import</span> <span class="p">(</span>
+<span class="kn">import</span><span class="w"> </span><span class="nn">re</span>
+<span class="kn">import</span><span class="w"> </span><span class="nn">inspect</span>
+<span class="kn">import</span><span class="w"> </span><span class="nn">textwrap</span>
+<span class="kn">import</span><span class="w"> </span><span class="nn">copy</span>
+<span class="kn">from</span><span class="w"> </span><span class="nn">dataclasses</span><span class="w"> </span><span class="kn">import</span> <span class="n">dataclass</span>
+<span class="kn">from</span><span class="w"> </span><span class="nn">functools</span><span class="w"> </span><span class="kn">import</span> <span class="n">wraps</span>
+<span class="kn">from</span><span class="w"> </span><span class="nn">types</span><span class="w"> </span><span class="kn">import</span> <span class="n">FunctionType</span> <span class="k">as</span> <span class="n">PyFunctionType</span>
+<span class="kn">from</span><span class="w"> </span><span class="nn">typing</span><span class="w"> </span><span class="kn">import</span> <span class="n">Union</span>
+<span class="kn">from</span><span class="w"> </span><span class="nn">collections.abc</span><span class="w"> </span><span class="kn">import</span> <span class="n">Callable</span>
+
+<span class="kn">from</span><span class="w"> </span><span class="nn">._mlir.ir</span><span class="w"> </span><span class="kn">import</span> <span class="p">(</span>
     <span class="n">Context</span><span class="p">,</span>
     <span class="n">Location</span><span class="p">,</span>
     <span class="n">InsertionPoint</span><span class="p">,</span>
@@ -192,7 +201,7 @@ <h1>Source code for allo.customize</h1><div class="highlight"><pre>
     <span class="n">AffineMap</span><span class="p">,</span>
     <span class="n">AffineMapAttr</span><span class="p">,</span>
 <span class="p">)</span>
-<span class="kn">from</span> <span class="nn">._mlir.dialects</span> <span class="kn">import</span> <span class="p">(</span>
+<span class="kn">from</span><span class="w"> </span><span class="nn">._mlir.dialects</span><span class="w"> </span><span class="kn">import</span> <span class="p">(</span>
     <span class="n">allo</span> <span class="k">as</span> <span class="n">allo_d</span><span class="p">,</span>
     <span class="n">memref</span> <span class="k">as</span> <span class="n">memref_d</span><span class="p">,</span>
     <span class="n">affine</span> <span class="k">as</span> <span class="n">affine_d</span><span class="p">,</span>
@@ -200,50 +209,46 @@ <h1>Source code for allo.customize</h1><div class="highlight"><pre>
     <span class="n">arith</span> <span class="k">as</span> <span class="n">arith_d</span><span class="p">,</span>
     <span class="n">func</span> <span class="k">as</span> <span class="n">func_d</span><span class="p">,</span>
 <span class="p">)</span>
-<span class="kn">from</span> <span class="nn">._mlir.dialects.affine</span> <span class="kn">import</span> <span class="p">(</span>
+<span class="kn">from</span><span class="w"> </span><span class="nn">._mlir.dialects.affine</span><span class="w"> </span><span class="kn">import</span> <span class="p">(</span>
     <span class="n">AffineExpr</span><span class="p">,</span>
     <span class="n">AffineDimExpr</span><span class="p">,</span>
 <span class="p">)</span>
-<span class="kn">from</span> <span class="nn">._mlir.exceptions</span> <span class="kn">import</span> <span class="p">(</span>
+<span class="kn">from</span><span class="w"> </span><span class="nn">._mlir.exceptions</span><span class="w"> </span><span class="kn">import</span> <span class="p">(</span>
     <span class="n">AlloValueError</span><span class="p">,</span>
 <span class="p">)</span>
 
-<span class="kn">from</span> <span class="nn">.</span> <span class="kn">import</span> <span class="n">primitives</span> <span class="k">as</span> <span class="n">prim</span>
-<span class="kn">from</span> <span class="nn">.ir.visitor</span> <span class="kn">import</span> <span class="n">ASTContext</span>
-<span class="kn">from</span> <span class="nn">.ir.utils</span> <span class="kn">import</span> <span class="n">MockArg</span><span class="p">,</span> <span class="n">MockBuffer</span><span class="p">,</span> <span class="n">parse_ast</span><span class="p">,</span> <span class="n">get_global_vars</span>
-<span class="kn">from</span> <span class="nn">.ir.builder</span> <span class="kn">import</span> <span class="n">ASTTransformer</span>
-<span class="kn">from</span> <span class="nn">.ir.infer</span> <span class="kn">import</span> <span class="n">TypeInferer</span>
-<span class="kn">from</span> <span class="nn">.ir.transform</span> <span class="kn">import</span> <span class="p">(</span>
+<span class="kn">from</span><span class="w"> </span><span class="nn">.</span><span class="w"> </span><span class="kn">import</span> <span class="n">primitives</span> <span class="k">as</span> <span class="n">prim</span>
+<span class="kn">from</span><span class="w"> </span><span class="nn">.ir.visitor</span><span class="w"> </span><span class="kn">import</span> <span class="n">ASTContext</span>
+<span class="kn">from</span><span class="w"> </span><span class="nn">.ir.utils</span><span class="w"> </span><span class="kn">import</span> <span class="n">MockArg</span><span class="p">,</span> <span class="n">MockBuffer</span><span class="p">,</span> <span class="n">parse_ast</span><span class="p">,</span> <span class="n">get_global_vars</span>
+<span class="kn">from</span><span class="w"> </span><span class="nn">.ir.builder</span><span class="w"> </span><span class="kn">import</span> <span class="n">ASTTransformer</span>
+<span class="kn">from</span><span class="w"> </span><span class="nn">.ir.infer</span><span class="w"> </span><span class="kn">import</span> <span class="n">TypeInferer</span>
+<span class="kn">from</span><span class="w"> </span><span class="nn">.ir.transform</span><span class="w"> </span><span class="kn">import</span> <span class="p">(</span>
     <span class="n">get_affine_loop_nests</span><span class="p">,</span>
     <span class="n">find_loop_in_bands</span><span class="p">,</span>
     <span class="n">find_buffer</span><span class="p">,</span>
     <span class="n">find_func_in_module</span><span class="p">,</span>
     <span class="n">LoopWrapper</span><span class="p">,</span>
 <span class="p">)</span>
-<span class="kn">from</span> <span class="nn">.ir.use_def</span> <span class="kn">import</span> <span class="n">UseDefChain</span>
-<span class="kn">from</span> <span class="nn">.passes</span> <span class="kn">import</span> <span class="p">(</span>
+<span class="kn">from</span><span class="w"> </span><span class="nn">.passes</span><span class="w"> </span><span class="kn">import</span> <span class="p">(</span>
     <span class="n">_mlir_lower_pipeline</span><span class="p">,</span>
     <span class="n">lower_linalg_and_attach_names</span><span class="p">,</span>
+    <span class="n">analyze_use_def</span><span class="p">,</span>
 <span class="p">)</span>
-<span class="kn">from</span> <span class="nn">.backend.llvm</span> <span class="kn">import</span> <span class="n">LLVMModule</span>
-<span class="kn">from</span> <span class="nn">.backend.hls</span> <span class="kn">import</span> <span class="n">HLSModule</span>
-<span class="kn">from</span> <span class="nn">.library</span> <span class="kn">import</span> <span class="n">KERNEL2SCHEDULE</span>
+<span class="kn">from</span><span class="w"> </span><span class="nn">.backend.llvm</span><span class="w"> </span><span class="kn">import</span> <span class="n">LLVMModule</span>
+<span class="kn">from</span><span class="w"> </span><span class="nn">.backend.hls</span><span class="w"> </span><span class="kn">import</span> <span class="n">HLSModule</span>
+<span class="kn">from</span><span class="w"> </span><span class="nn">.library</span><span class="w"> </span><span class="kn">import</span> <span class="n">KERNEL2SCHEDULE</span>
 
 
-<span class="k">def</span> <span class="nf">getsourcefile</span><span class="p">(</span><span class="n">obj</span><span class="p">):</span>
+<span class="k">def</span><span class="w"> </span><span class="nf">getsourcefile</span><span class="p">(</span><span class="n">obj</span><span class="p">):</span>
     <span class="n">ret</span> <span class="o">=</span> <span class="n">inspect</span><span class="o">.</span><span class="n">getsourcefile</span><span class="p">(</span><span class="n">obj</span><span class="p">)</span>
     <span class="k">if</span> <span class="n">ret</span> <span class="ow">is</span> <span class="kc">None</span><span class="p">:</span>
         <span class="n">ret</span> <span class="o">=</span> <span class="n">inspect</span><span class="o">.</span><span class="n">getfile</span><span class="p">(</span><span class="n">obj</span><span class="p">)</span>
     <span class="k">return</span> <span class="n">ret</span>
 
 
-<span class="k">def</span> <span class="nf">getsourcelines</span><span class="p">(</span><span class="n">obj</span><span class="p">):</span>
-    <span class="k">return</span> <span class="n">inspect</span><span class="o">.</span><span class="n">getsourcelines</span><span class="p">(</span><span class="n">obj</span><span class="p">)</span>
-
-
-<span class="k">def</span> <span class="nf">wrapped_apply</span><span class="p">(</span><span class="n">fn</span><span class="p">):</span>
+<span class="k">def</span><span class="w"> </span><span class="nf">wrapped_apply</span><span class="p">(</span><span class="n">fn</span><span class="p">):</span>
     <span class="nd">@wraps</span><span class="p">(</span><span class="n">fn</span><span class="p">)</span>
-    <span class="k">def</span> <span class="nf">wrapper</span><span class="p">(</span><span class="o">*</span><span class="n">args</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">):</span>
+    <span class="k">def</span><span class="w"> </span><span class="nf">wrapper</span><span class="p">(</span><span class="o">*</span><span class="n">args</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">):</span>
         <span class="n">sch</span> <span class="o">=</span> <span class="n">args</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
         <span class="k">with</span> <span class="n">sch</span><span class="o">.</span><span class="n">module</span><span class="o">.</span><span class="n">context</span><span class="p">,</span> <span class="n">Location</span><span class="o">.</span><span class="n">unknown</span><span class="p">():</span>
             <span class="n">res</span> <span class="o">=</span> <span class="n">fn</span><span class="p">(</span><span class="o">*</span><span class="n">args</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">)</span>
@@ -267,7 +272,7 @@ <h1>Source code for allo.customize</h1><div class="highlight"><pre>
 
 
 <span class="nd">@dataclass</span>
-<span class="k">class</span> <span class="nc">Partition</span><span class="p">:</span>
+<span class="k">class</span><span class="w"> </span><span class="nc">Partition</span><span class="p">:</span>
     <span class="n">Complete</span> <span class="o">=</span> <span class="mi">0</span>
     <span class="n">Block</span> <span class="o">=</span> <span class="mi">1</span>
     <span class="n">Cyclic</span> <span class="o">=</span> <span class="mi">2</span>
@@ -275,15 +280,14 @@ <h1>Source code for allo.customize</h1><div class="highlight"><pre>
 
 <div class="viewcode-block" id="Schedule">
 <a class="viewcode-back" href="../../api/index.html#allo.customize.Schedule">[docs]</a>
-<span class="k">class</span> <span class="nc">Schedule</span><span class="p">:</span>
-    <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span>
+<span class="k">class</span><span class="w"> </span><span class="nc">Schedule</span><span class="p">:</span>
+    <span class="k">def</span><span class="w"> </span><span class="fm">__init__</span><span class="p">(</span>
         <span class="bp">self</span><span class="p">,</span>
         <span class="n">module</span><span class="p">,</span>
         <span class="n">top_func</span><span class="p">,</span>
         <span class="n">func_args</span><span class="p">,</span>
         <span class="n">ip</span><span class="p">,</span>
         <span class="n">ext_libs</span><span class="o">=</span><span class="kc">None</span><span class="p">,</span>
-        <span class="n">use_def_chain</span><span class="o">=</span><span class="kc">None</span><span class="p">,</span>
         <span class="n">inst_list</span><span class="o">=</span><span class="kc">None</span><span class="p">,</span>
     <span class="p">):</span>
         <span class="bp">self</span><span class="o">.</span><span class="n">module</span> <span class="o">=</span> <span class="n">module</span>
@@ -295,24 +299,23 @@ <h1>Source code for allo.customize</h1><div class="highlight"><pre>
         <span class="k">if</span> <span class="n">ext_libs</span> <span class="ow">is</span> <span class="kc">None</span><span class="p">:</span>
             <span class="n">ext_libs</span> <span class="o">=</span> <span class="p">[]</span>
         <span class="bp">self</span><span class="o">.</span><span class="n">ext_libs</span> <span class="o">=</span> <span class="n">ext_libs</span>
-        <span class="bp">self</span><span class="o">.</span><span class="n">use_def_chain</span> <span class="o">=</span> <span class="n">use_def_chain</span>
         <span class="bp">self</span><span class="o">.</span><span class="n">partitioned_arrays</span> <span class="o">=</span> <span class="p">{}</span>
         <span class="bp">self</span><span class="o">.</span><span class="n">inst_list</span> <span class="o">=</span> <span class="n">inst_list</span> <span class="k">if</span> <span class="n">inst_list</span> <span class="ow">is</span> <span class="ow">not</span> <span class="kc">None</span> <span class="k">else</span> <span class="p">[]</span>
 
-    <span class="k">def</span> <span class="nf">get_loops</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">func</span><span class="o">=</span><span class="kc">None</span><span class="p">):</span>
+    <span class="k">def</span><span class="w"> </span><span class="nf">get_loops</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">func</span><span class="o">=</span><span class="kc">None</span><span class="p">):</span>
         <span class="k">if</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">func</span><span class="p">,</span> <span class="nb">str</span><span class="p">):</span>
             <span class="n">func</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">_find_function</span><span class="p">(</span><span class="n">func</span><span class="p">)</span>
         <span class="k">if</span> <span class="n">func</span> <span class="ow">is</span> <span class="kc">None</span><span class="p">:</span>
             <span class="n">func</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">top_func</span>
         <span class="k">return</span> <span class="n">get_affine_loop_nests</span><span class="p">(</span><span class="n">func</span><span class="p">)</span>
 
-    <span class="k">def</span> <span class="nf">_find_band</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">band_name</span><span class="p">,</span> <span class="n">func</span><span class="o">=</span><span class="kc">None</span><span class="p">):</span>
+    <span class="k">def</span><span class="w"> </span><span class="nf">_find_band</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">band_name</span><span class="p">,</span> <span class="n">func</span><span class="o">=</span><span class="kc">None</span><span class="p">):</span>
         <span class="n">loops</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">get_loops</span><span class="p">(</span><span class="n">func</span><span class="p">)</span>
         <span class="k">if</span> <span class="n">band_name</span> <span class="ow">in</span> <span class="n">loops</span><span class="o">.</span><span class="n">loops</span><span class="p">:</span>
             <span class="k">return</span> <span class="n">loops</span><span class="p">[</span><span class="n">band_name</span><span class="p">]</span>
         <span class="k">raise</span> <span class="ne">RuntimeError</span><span class="p">(</span><span class="sa">f</span><span class="s2">&quot;Band </span><span class="si">{</span><span class="n">band_name</span><span class="si">}</span><span class="s2"> not found&quot;</span><span class="p">)</span>
 
-    <span class="k">def</span> <span class="nf">_find_function</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">name</span><span class="p">,</span> <span class="n">error</span><span class="o">=</span><span class="kc">True</span><span class="p">):</span>
+    <span class="k">def</span><span class="w"> </span><span class="nf">_find_function</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">name</span><span class="p">,</span> <span class="n">error</span><span class="o">=</span><span class="kc">True</span><span class="p">):</span>
         <span class="k">for</span> <span class="n">func</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">module</span><span class="o">.</span><span class="n">body</span><span class="o">.</span><span class="n">operations</span><span class="p">:</span>
             <span class="k">if</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">func</span><span class="p">,</span> <span class="n">func_d</span><span class="o">.</span><span class="n">FuncOp</span><span class="p">)</span> <span class="ow">and</span> <span class="n">func</span><span class="o">.</span><span class="n">name</span><span class="o">.</span><span class="n">value</span> <span class="o">==</span> <span class="n">name</span><span class="p">:</span>
                 <span class="k">return</span> <span class="n">func</span>
@@ -320,7 +323,7 @@ <h1>Source code for allo.customize</h1><div class="highlight"><pre>
             <span class="k">raise</span> <span class="ne">RuntimeError</span><span class="p">(</span><span class="sa">f</span><span class="s2">&quot;Function </span><span class="si">{</span><span class="n">name</span><span class="si">}</span><span class="s2"> not found&quot;</span><span class="p">)</span>
         <span class="k">return</span> <span class="kc">None</span>
 
-    <span class="k">def</span> <span class="nf">_get_func_and_axis</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">axis</span><span class="p">):</span>
+    <span class="k">def</span><span class="w"> </span><span class="nf">_get_func_and_axis</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">axis</span><span class="p">):</span>
         <span class="k">if</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">axis</span><span class="p">,</span> <span class="n">LoopWrapper</span><span class="p">):</span>
             <span class="n">func</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">_find_function</span><span class="p">(</span><span class="n">axis</span><span class="o">.</span><span class="n">func</span><span class="p">)</span>
             <span class="k">return</span> <span class="n">func</span><span class="p">,</span> <span class="n">axis</span>
@@ -334,7 +337,7 @@ <h1>Source code for allo.customize</h1><div class="highlight"><pre>
 <div class="viewcode-block" id="Schedule.split">
 <a class="viewcode-back" href="../../api/index.html#allo.customize.Schedule.split">[docs]</a>
     <span class="nd">@wrapped_apply</span>
-    <span class="k">def</span> <span class="nf">split</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">axis</span><span class="p">,</span> <span class="n">factor</span><span class="p">):</span>
+    <span class="k">def</span><span class="w"> </span><span class="nf">split</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">axis</span><span class="p">,</span> <span class="n">factor</span><span class="p">):</span>
 <span class="w">        </span><span class="sd">&quot;&quot;&quot;</span>
 <span class="sd">        `split` will find the loop with loop index `axis` and tile it with each tile size `factor`</span>
 <span class="sd">        The new inner loop will be named `axis.inner` and the outer loop will be named `axis.outer`</span>
@@ -360,7 +363,7 @@ <h1>Source code for allo.customize</h1><div class="highlight"><pre>
 <div class="viewcode-block" id="Schedule.reorder">
 <a class="viewcode-back" href="../../api/index.html#allo.customize.Schedule.reorder">[docs]</a>
     <span class="nd">@wrapped_apply</span>
-    <span class="k">def</span> <span class="nf">reorder</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="o">*</span><span class="n">args</span><span class="p">):</span>
+    <span class="k">def</span><span class="w"> </span><span class="nf">reorder</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="o">*</span><span class="n">args</span><span class="p">):</span>
 <span class="w">        </span><span class="sd">&quot;&quot;&quot;</span>
 <span class="sd">        Reorders nested loops with indices listed in `args` such that the outermost loop is the first</span>
 <span class="sd">        index listed in `args`, the second is the second outermost, and so on.</span>
@@ -385,7 +388,7 @@ <h1>Source code for allo.customize</h1><div class="highlight"><pre>
 <div class="viewcode-block" id="Schedule.unroll">
 <a class="viewcode-back" href="../../api/index.html#allo.customize.Schedule.unroll">[docs]</a>
     <span class="nd">@wrapped_apply</span>
-    <span class="k">def</span> <span class="nf">unroll</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">axis</span><span class="p">,</span> <span class="n">factor</span><span class="o">=</span><span class="mi">0</span><span class="p">):</span>
+    <span class="k">def</span><span class="w"> </span><span class="nf">unroll</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">axis</span><span class="p">,</span> <span class="n">factor</span><span class="o">=</span><span class="mi">0</span><span class="p">):</span>
 <span class="w">        </span><span class="sd">&quot;&quot;&quot;</span>
 <span class="sd">        Unrolls a loop with loop index `axis` by `factor`.</span>
 
@@ -411,7 +414,7 @@ <h1>Source code for allo.customize</h1><div class="highlight"><pre>
 <div class="viewcode-block" id="Schedule.fuse">
 <a class="viewcode-back" href="../../api/index.html#allo.customize.Schedule.fuse">[docs]</a>
     <span class="nd">@wrapped_apply</span>
-    <span class="k">def</span> <span class="nf">fuse</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="o">*</span><span class="n">args</span><span class="p">):</span>
+    <span class="k">def</span><span class="w"> </span><span class="nf">fuse</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="o">*</span><span class="n">args</span><span class="p">):</span>
 <span class="w">        </span><span class="sd">&quot;&quot;&quot;</span>
 <span class="sd">        Combines loops with indices listed in `args` into a single loop over a single index.</span>
 
@@ -435,7 +438,7 @@ <h1>Source code for allo.customize</h1><div class="highlight"><pre>
 <div class="viewcode-block" id="Schedule.partition">
 <a class="viewcode-back" href="../../api/index.html#allo.customize.Schedule.partition">[docs]</a>
     <span class="nd">@wrapped_apply</span>
-    <span class="k">def</span> <span class="nf">partition</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">target</span><span class="p">,</span> <span class="n">partition_type</span><span class="o">=</span><span class="n">Partition</span><span class="o">.</span><span class="n">Complete</span><span class="p">,</span> <span class="n">dim</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">factor</span><span class="o">=</span><span class="mi">0</span><span class="p">):</span>
+    <span class="k">def</span><span class="w"> </span><span class="nf">partition</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">target</span><span class="p">,</span> <span class="n">partition_type</span><span class="o">=</span><span class="n">Partition</span><span class="o">.</span><span class="n">Complete</span><span class="p">,</span> <span class="n">dim</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">factor</span><span class="o">=</span><span class="mi">0</span><span class="p">):</span>
 <span class="w">        </span><span class="sd">&quot;&quot;&quot;</span>
 <span class="sd">        Partitions a given array, for example if the array is `B`, this would be `&lt;schedule&gt;.B`.</span>
 <span class="sd">        There are three types, `Partition.Complete`, `Partition.Block`, and `Partition.cyclic`.</span>
@@ -464,14 +467,15 @@ <h1>Source code for allo.customize</h1><div class="highlight"><pre>
             <span class="k">raise</span> <span class="n">AlloValueError</span><span class="p">(</span><span class="s2">&quot;Invalid dimension&quot;</span><span class="p">)</span>
         <span class="k">if</span> <span class="n">factor</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">:</span>
             <span class="k">raise</span> <span class="n">AlloValueError</span><span class="p">(</span><span class="s2">&quot;Invalid factor&quot;</span><span class="p">)</span>
-        <span class="k">if</span> <span class="n">partition_type</span> <span class="o">==</span> <span class="n">Partition</span><span class="o">.</span><span class="n">Complete</span><span class="p">:</span>
-            <span class="n">partition_type</span> <span class="o">=</span> <span class="mi">0</span>
-        <span class="k">elif</span> <span class="n">partition_type</span> <span class="o">==</span> <span class="n">Partition</span><span class="o">.</span><span class="n">Block</span><span class="p">:</span>
-            <span class="n">partition_type</span> <span class="o">=</span> <span class="mi">1</span>
-        <span class="k">elif</span> <span class="n">partition_type</span> <span class="o">==</span> <span class="n">Partition</span><span class="o">.</span><span class="n">Cyclic</span><span class="p">:</span>
-            <span class="n">partition_type</span> <span class="o">=</span> <span class="mi">2</span>
-        <span class="k">else</span><span class="p">:</span>
-            <span class="k">raise</span> <span class="n">AlloValueError</span><span class="p">(</span><span class="s2">&quot;Not supported partition type&quot;</span><span class="p">)</span>
+        <span class="k">match</span> <span class="n">partition_type</span><span class="p">:</span>
+            <span class="k">case</span> <span class="n">Partition</span><span class="o">.</span><span class="n">Complete</span><span class="p">:</span>
+                <span class="n">partition_type</span> <span class="o">=</span> <span class="mi">0</span>
+            <span class="k">case</span> <span class="n">Partition</span><span class="o">.</span><span class="n">Block</span><span class="p">:</span>
+                <span class="n">partition_type</span> <span class="o">=</span> <span class="mi">1</span>
+            <span class="k">case</span> <span class="n">Partition</span><span class="o">.</span><span class="n">Cyclic</span><span class="p">:</span>
+                <span class="n">partition_type</span> <span class="o">=</span> <span class="mi">2</span>
+            <span class="k">case</span><span class="w"> </span><span class="k">_</span><span class="p">:</span>
+                <span class="k">raise</span> <span class="n">AlloValueError</span><span class="p">(</span><span class="s2">&quot;Not supported partition type&quot;</span><span class="p">)</span>
         <span class="c1"># test whether partitioning the same array</span>
         <span class="k">for</span> <span class="n">parray</span><span class="p">,</span> <span class="n">items</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">partitioned_arrays</span><span class="o">.</span><span class="n">items</span><span class="p">():</span>
             <span class="k">for</span> <span class="n">item</span> <span class="ow">in</span> <span class="n">items</span><span class="p">:</span>
@@ -492,15 +496,23 @@ <h1>Source code for allo.customize</h1><div class="highlight"><pre>
         <span class="n">visited_target_names</span> <span class="o">=</span> <span class="p">[]</span>
         <span class="n">visited_func_calls</span> <span class="o">=</span> <span class="p">[]</span>
 
-        <span class="k">def</span> <span class="nf">recursive_partition</span><span class="p">(</span><span class="n">inner_target</span><span class="p">):</span>
+        <span class="k">def</span><span class="w"> </span><span class="nf">recursive_partition</span><span class="p">(</span><span class="n">inner_target</span><span class="p">):</span>
             <span class="n">name</span> <span class="o">=</span> <span class="sa">f</span><span class="s2">&quot;</span><span class="si">{</span><span class="n">inner_target</span><span class="o">.</span><span class="n">func</span><span class="si">}</span><span class="s2">:</span><span class="si">{</span><span class="n">inner_target</span><span class="o">.</span><span class="n">name</span><span class="si">}</span><span class="s2">&quot;</span>
             <span class="k">if</span> <span class="n">name</span> <span class="ow">in</span> <span class="n">visited_target_names</span><span class="p">:</span>
                 <span class="k">return</span>
             <span class="n">visited_target_names</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">name</span><span class="p">)</span>
             <span class="n">_</span><span class="p">,</span> <span class="n">_</span><span class="p">,</span> <span class="n">mlir_target</span> <span class="o">=</span> <span class="n">find_buffer</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">module</span><span class="p">,</span> <span class="n">inner_target</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">func_args</span><span class="p">)</span>
             <span class="c1"># equivalent users</span>
-            <span class="k">for</span> <span class="n">tensor</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">use_def_chain</span><span class="o">.</span><span class="n">get_equivalent_tensors</span><span class="p">(</span><span class="n">name</span><span class="p">):</span>
-                <span class="n">recursive_partition</span><span class="p">(</span><span class="n">MockBuffer</span><span class="p">(</span><span class="n">tensor</span><span class="o">.</span><span class="n">path</span><span class="p">,</span> <span class="n">tensor</span><span class="o">.</span><span class="n">name</span><span class="p">))</span>
+            <span class="k">if</span> <span class="n">inner_target</span><span class="o">.</span><span class="n">name</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">func_args</span><span class="p">[</span><span class="n">inner_target</span><span class="o">.</span><span class="n">func</span><span class="p">]:</span>
+                <span class="c1"># is a function argument</span>
+                <span class="n">idx</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">func_args</span><span class="p">[</span><span class="n">inner_target</span><span class="o">.</span><span class="n">func</span><span class="p">]</span><span class="o">.</span><span class="n">index</span><span class="p">(</span><span class="n">inner_target</span><span class="o">.</span><span class="n">name</span><span class="p">)</span>
+                <span class="n">name</span> <span class="o">=</span> <span class="sa">f</span><span class="s2">&quot;</span><span class="si">{</span><span class="n">inner_target</span><span class="o">.</span><span class="n">func</span><span class="si">}</span><span class="s2">:</span><span class="si">{</span><span class="n">idx</span><span class="si">}</span><span class="s2">&quot;</span>
+            <span class="k">for</span> <span class="n">buf_name</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">get_equivalent_variables</span><span class="p">(</span><span class="n">name</span><span class="p">):</span>
+                <span class="n">path</span><span class="p">,</span> <span class="n">buf_name</span> <span class="o">=</span> <span class="n">buf_name</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s2">&quot;:&quot;</span><span class="p">)</span>
+                <span class="k">if</span> <span class="n">buf_name</span><span class="o">.</span><span class="n">isdigit</span><span class="p">():</span>
+                    <span class="c1"># function argument</span>
+                    <span class="n">buf_name</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">func_args</span><span class="p">[</span><span class="n">path</span><span class="p">][</span><span class="nb">int</span><span class="p">(</span><span class="n">buf_name</span><span class="p">)]</span>
+                <span class="n">recursive_partition</span><span class="p">(</span><span class="n">MockBuffer</span><span class="p">(</span><span class="n">path</span><span class="p">,</span> <span class="n">buf_name</span><span class="p">))</span>
             <span class="c1"># calling the same function</span>
             <span class="k">if</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">mlir_target</span><span class="p">,</span> <span class="n">func_d</span><span class="o">.</span><span class="n">CallOp</span><span class="p">):</span>
                 <span class="n">visited_func_calls</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">mlir_target</span><span class="p">)</span>
@@ -590,7 +602,7 @@ <h1>Source code for allo.customize</h1><div class="highlight"><pre>
 <div class="viewcode-block" id="Schedule.buffer_at">
 <a class="viewcode-back" href="../../api/index.html#allo.customize.Schedule.buffer_at">[docs]</a>
     <span class="nd">@wrapped_apply</span>
-    <span class="k">def</span> <span class="nf">buffer_at</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">target</span><span class="p">,</span> <span class="n">axis</span><span class="p">):</span>
+    <span class="k">def</span><span class="w"> </span><span class="nf">buffer_at</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">target</span><span class="p">,</span> <span class="n">axis</span><span class="p">):</span>
 <span class="w">        </span><span class="sd">&quot;&quot;&quot;</span>
 <span class="sd">        Creates a chip buffer to hold the values of `target` written to in loop with index `axis`</span>
 <span class="sd">        instead of immediately writing them to memory.</span>
@@ -617,7 +629,7 @@ <h1>Source code for allo.customize</h1><div class="highlight"><pre>
 <div class="viewcode-block" id="Schedule.reshape">
 <a class="viewcode-back" href="../../api/index.html#allo.customize.Schedule.reshape">[docs]</a>
     <span class="nd">@wrapped_apply</span>
-    <span class="k">def</span> <span class="nf">reshape</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">target</span><span class="p">,</span> <span class="n">shape</span><span class="p">):</span>
+    <span class="k">def</span><span class="w"> </span><span class="nf">reshape</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">target</span><span class="p">,</span> <span class="n">shape</span><span class="p">):</span>
 <span class="w">        </span><span class="sd">&quot;&quot;&quot;</span>
 <span class="sd">        Takes an array in the kernel, `target`, for example if the array is `B`, then would be `target` would be `&lt;schedule&gt;.B`, and reshapes it to tuple `shape`. As an example, if the desired shape is 32 by 4 by 8, the `&lt;shape&gt;` would be `(32, 4, 8)`.</span>
 
@@ -639,7 +651,7 @@ <h1>Source code for allo.customize</h1><div class="highlight"><pre>
 <div class="viewcode-block" id="Schedule.pipeline">
 <a class="viewcode-back" href="../../api/index.html#allo.customize.Schedule.pipeline">[docs]</a>
     <span class="nd">@wrapped_apply</span>
-    <span class="k">def</span> <span class="nf">pipeline</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">axis</span><span class="p">,</span> <span class="n">initiation_interval</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">rewind</span><span class="o">=</span><span class="kc">False</span><span class="p">):</span>
+    <span class="k">def</span><span class="w"> </span><span class="nf">pipeline</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">axis</span><span class="p">,</span> <span class="n">initiation_interval</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">rewind</span><span class="o">=</span><span class="kc">False</span><span class="p">):</span>
 <span class="w">        </span><span class="sd">&quot;&quot;&quot;</span>
 <span class="sd">        Pipelines a loop with index `axis` into `initiation_interval` stages.</span>
 
@@ -670,7 +682,7 @@ <h1>Source code for allo.customize</h1><div class="highlight"><pre>
 <div class="viewcode-block" id="Schedule.parallel">
 <a class="viewcode-back" href="../../api/index.html#allo.customize.Schedule.parallel">[docs]</a>
     <span class="nd">@wrapped_apply</span>
-    <span class="k">def</span> <span class="nf">parallel</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">axis</span><span class="p">):</span>
+    <span class="k">def</span><span class="w"> </span><span class="nf">parallel</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">axis</span><span class="p">):</span>
 <span class="w">        </span><span class="sd">&quot;&quot;&quot;</span>
 <span class="sd">        Instantiates a loop with index `axis` to be computed in parallel with the loops it is nested with.</span>
 
@@ -691,7 +703,7 @@ <h1>Source code for allo.customize</h1><div class="highlight"><pre>
 <div class="viewcode-block" id="Schedule.inline">
 <a class="viewcode-back" href="../../api/index.html#allo.customize.Schedule.inline">[docs]</a>
     <span class="nd">@wrapped_apply</span>
-    <span class="k">def</span> <span class="nf">inline</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="kc">None</span><span class="p">):</span>
+    <span class="k">def</span><span class="w"> </span><span class="nf">inline</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="kc">None</span><span class="p">):</span>
 <span class="w">        </span><span class="sd">&quot;&quot;&quot;</span>
 <span class="sd">        Inlines a function `axis`.</span>
 
@@ -711,7 +723,7 @@ <h1>Source code for allo.customize</h1><div class="highlight"><pre>
 <div class="viewcode-block" id="Schedule.dataflow">
 <a class="viewcode-back" href="../../api/index.html#allo.customize.Schedule.dataflow">[docs]</a>
     <span class="nd">@wrapped_apply</span>
-    <span class="k">def</span> <span class="nf">dataflow</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">axis</span><span class="p">):</span>
+    <span class="k">def</span><span class="w"> </span><span class="nf">dataflow</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">axis</span><span class="p">):</span>
 <span class="w">        </span><span class="sd">&quot;&quot;&quot;</span>
 <span class="sd">        Applies a &quot;dataflow&quot; attribute to function `axis`. This allows for parallelism if the given function uses streams or the `to` schedule.</span>
 
@@ -731,7 +743,7 @@ <h1>Source code for allo.customize</h1><div class="highlight"><pre>
         <span class="n">band_name</span> <span class="o">=</span> <span class="n">band_name</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s2">&quot;:&quot;</span><span class="p">)[</span><span class="mi">1</span><span class="p">]</span>
         <span class="n">cnt</span> <span class="o">=</span> <span class="mi">0</span>
 
-        <span class="k">def</span> <span class="nf">locate_loop</span><span class="p">(</span><span class="n">op</span><span class="p">):</span>
+        <span class="k">def</span><span class="w"> </span><span class="nf">locate_loop</span><span class="p">(</span><span class="n">op</span><span class="p">):</span>
             <span class="k">nonlocal</span> <span class="n">cnt</span>
             <span class="k">for</span> <span class="n">ope</span> <span class="ow">in</span> <span class="n">op</span><span class="o">.</span><span class="n">body</span><span class="o">.</span><span class="n">operations</span><span class="p">:</span>
                 <span class="k">if</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">ope</span><span class="p">,</span> <span class="p">(</span><span class="n">scf_d</span><span class="o">.</span><span class="n">ForOp</span><span class="p">,</span> <span class="n">affine_d</span><span class="o">.</span><span class="n">AffineForOp</span><span class="p">)):</span>
@@ -758,7 +770,7 @@ <h1>Source code for allo.customize</h1><div class="highlight"><pre>
 <div class="viewcode-block" id="Schedule.compute_at">
 <a class="viewcode-back" href="../../api/index.html#allo.customize.Schedule.compute_at">[docs]</a>
     <span class="nd">@wrapped_apply</span>
-    <span class="k">def</span> <span class="nf">compute_at</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">from_loop</span><span class="p">,</span> <span class="n">target_loop</span><span class="p">):</span>
+    <span class="k">def</span><span class="w"> </span><span class="nf">compute_at</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">from_loop</span><span class="p">,</span> <span class="n">target_loop</span><span class="p">):</span>
 <span class="w">        </span><span class="sd">&quot;&quot;&quot;</span>
 <span class="sd">        If `from_loop` and `target_loop` are indices over the same range, `&lt;schedule&gt;.compute_at(from_loop, target_loop)` merges the two loops, taking</span>
 <span class="sd">        the body of `from_loop` and appending it to the body of `target_loop`.</span>
@@ -787,7 +799,7 @@ <h1>Source code for allo.customize</h1><div class="highlight"><pre>
 <div class="viewcode-block" id="Schedule.reuse_at">
 <a class="viewcode-back" href="../../api/index.html#allo.customize.Schedule.reuse_at">[docs]</a>
     <span class="nd">@wrapped_apply</span>
-    <span class="k">def</span> <span class="nf">reuse_at</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">target</span><span class="p">,</span> <span class="n">axis</span><span class="p">):</span>
+    <span class="k">def</span><span class="w"> </span><span class="nf">reuse_at</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">target</span><span class="p">,</span> <span class="n">axis</span><span class="p">):</span>
 <span class="w">        </span><span class="sd">&quot;&quot;&quot;</span>
 <span class="sd">        Takes an array in a kernel, for example if the array is `B`, this would be `&lt;schedule&gt;.B`, accessed by index `axis` and creates a reuse buffer</span>
 <span class="sd">        to reuse values from `target` which are accessed in a sequentially moving window.</span>
@@ -809,7 +821,7 @@ <h1>Source code for allo.customize</h1><div class="highlight"><pre>
         <span class="n">loop_hdl</span> <span class="o">=</span> <span class="n">allo_d</span><span class="o">.</span><span class="n">CreateLoopHandleOp</span><span class="p">(</span><span class="n">op_hdl</span><span class="o">.</span><span class="n">result</span><span class="p">,</span> <span class="n">StringAttr</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">axis</span><span class="p">),</span> <span class="n">ip</span><span class="o">=</span><span class="n">ip</span><span class="p">)</span>
         <span class="n">memref_type</span> <span class="o">=</span> <span class="n">MemRefType</span><span class="o">.</span><span class="n">get</span><span class="p">((</span><span class="mi">1</span><span class="p">,),</span> <span class="n">F32Type</span><span class="o">.</span><span class="n">get</span><span class="p">())</span>
 
-        <span class="k">def</span> <span class="nf">find_reuse_buffers</span><span class="p">(</span><span class="n">res</span><span class="p">):</span>
+        <span class="k">def</span><span class="w"> </span><span class="nf">find_reuse_buffers</span><span class="p">(</span><span class="n">res</span><span class="p">):</span>
             <span class="k">for</span> <span class="n">func</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">module</span><span class="o">.</span><span class="n">body</span><span class="o">.</span><span class="n">operations</span><span class="p">:</span>
                 <span class="k">if</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">func</span><span class="p">,</span> <span class="n">func_d</span><span class="o">.</span><span class="n">FuncOp</span><span class="p">):</span>
                     <span class="k">for</span> <span class="n">op</span> <span class="ow">in</span> <span class="n">func</span><span class="o">.</span><span class="n">entry_block</span><span class="o">.</span><span class="n">operations</span><span class="p">:</span>
@@ -841,7 +853,7 @@ <h1>Source code for allo.customize</h1><div class="highlight"><pre>
 <div class="viewcode-block" id="Schedule.to">
 <a class="viewcode-back" href="../../api/index.html#allo.customize.Schedule.to">[docs]</a>
     <span class="nd">@wrapped_apply</span>
-    <span class="k">def</span> <span class="nf">to</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">target</span><span class="p">,</span> <span class="n">dst</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="kc">None</span><span class="p">,</span> <span class="n">depth</span><span class="o">=-</span><span class="mi">1</span><span class="p">):</span>
+    <span class="k">def</span><span class="w"> </span><span class="nf">to</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">target</span><span class="p">,</span> <span class="n">dst</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="kc">None</span><span class="p">,</span> <span class="n">depth</span><span class="o">=-</span><span class="mi">1</span><span class="p">):</span>
 <span class="w">        </span><span class="sd">&quot;&quot;&quot;</span>
 <span class="sd">        Takes an array in the kernel, `target`, for example if the array is `B`, this would be `target` would be `&lt;schedule&gt;.B`,</span>
 <span class="sd">        and converts it into a stream. `dst` is the name of the array any value of `target` is written to.</span>
@@ -871,7 +883,7 @@ <h1>Source code for allo.customize</h1><div class="highlight"><pre>
 <div class="viewcode-block" id="Schedule.unfold">
 <a class="viewcode-back" href="../../api/index.html#allo.customize.Schedule.unfold">[docs]</a>
     <span class="nd">@wrapped_apply</span>
-    <span class="k">def</span> <span class="nf">unfold</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">band_name</span><span class="p">,</span> <span class="n">axes</span><span class="p">):</span>
+    <span class="k">def</span><span class="w"> </span><span class="nf">unfold</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">band_name</span><span class="p">,</span> <span class="n">axes</span><span class="p">):</span>
 <span class="w">        </span><span class="sd">&quot;&quot;&quot;</span>
 <span class="sd">        Finds a set of nested loops with name `band_name` and for every `&lt;i&gt;` in list `axes`.</span>
 <span class="sd">        The `&lt;i&gt;th` nested loop is unfolded into a constant number of copies of it&#39;s loop body.</span>
@@ -920,7 +932,7 @@ <h1>Source code for allo.customize</h1><div class="highlight"><pre>
                 <span class="k">if</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">op</span><span class="p">,</span> <span class="n">affine_d</span><span class="o">.</span><span class="n">AffineYieldOp</span><span class="p">):</span>
                     <span class="k">break</span>
 
-            <span class="k">def</span> <span class="nf">update_operand</span><span class="p">(</span><span class="n">op</span><span class="p">,</span> <span class="n">old</span><span class="p">,</span> <span class="n">new</span><span class="p">):</span>
+            <span class="k">def</span><span class="w"> </span><span class="nf">update_operand</span><span class="p">(</span><span class="n">op</span><span class="p">,</span> <span class="n">old</span><span class="p">,</span> <span class="n">new</span><span class="p">):</span>
                 <span class="k">if</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">op</span><span class="p">,</span> <span class="n">affine_d</span><span class="o">.</span><span class="n">AffineForOp</span><span class="p">):</span>
                     <span class="c1"># pylint: disable=cell-var-from-loop</span>
                     <span class="k">for</span> <span class="n">in_op</span> <span class="ow">in</span> <span class="n">op</span><span class="o">.</span><span class="n">body</span><span class="o">.</span><span class="n">operations</span><span class="p">:</span>
@@ -974,7 +986,7 @@ <h1>Source code for allo.customize</h1><div class="highlight"><pre>
 <div class="viewcode-block" id="Schedule.compose">
 <a class="viewcode-back" href="../../api/index.html#allo.customize.Schedule.compose">[docs]</a>
     <span class="nd">@wrapped_apply</span>
-    <span class="k">def</span> <span class="nf">compose</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">schs</span><span class="p">:</span> <span class="nb">list</span><span class="p">,</span> <span class="nb">id</span><span class="o">=</span><span class="kc">None</span><span class="p">,</span> <span class="n">instantiate</span><span class="o">=</span><span class="kc">None</span><span class="p">):</span>
+    <span class="k">def</span><span class="w"> </span><span class="nf">compose</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">schs</span><span class="p">:</span> <span class="nb">list</span><span class="p">,</span> <span class="nb">id</span><span class="o">=</span><span class="kc">None</span><span class="p">,</span> <span class="n">instantiate</span><span class="o">=</span><span class="kc">None</span><span class="p">):</span>
 <span class="w">        </span><span class="sd">&quot;&quot;&quot;</span>
 <span class="sd">        Uses `schs`, a schedule for a kernel called in this kernel, in this kernel.</span>
 
@@ -995,7 +1007,7 @@ <h1>Source code for allo.customize</h1><div class="highlight"><pre>
 <span class="sd">            This is a list of objects used to instantiate types `schs` is generic over.</span>
 <span class="sd">        &quot;&quot;&quot;</span>
 
-        <span class="k">def</span> <span class="nf">get_name</span><span class="p">(</span><span class="n">arg</span><span class="p">):</span>
+        <span class="k">def</span><span class="w"> </span><span class="nf">get_name</span><span class="p">(</span><span class="n">arg</span><span class="p">):</span>
             <span class="k">if</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">arg</span><span class="p">,</span> <span class="p">(</span><span class="n">LoopWrapper</span><span class="p">,</span> <span class="n">MockBuffer</span><span class="p">)):</span>
                 <span class="n">arg</span> <span class="o">=</span> <span class="n">copy</span><span class="o">.</span><span class="n">copy</span><span class="p">(</span><span class="n">arg</span><span class="p">)</span>
                 <span class="n">orig_func_name</span> <span class="o">=</span> <span class="n">arg</span><span class="o">.</span><span class="n">func</span> <span class="k">if</span> <span class="n">arg</span><span class="o">.</span><span class="n">func</span> <span class="ow">is</span> <span class="ow">not</span> <span class="kc">None</span> <span class="k">else</span> <span class="n">sch</span><span class="o">.</span><span class="n">top_func_name</span>
@@ -1073,7 +1085,14 @@ <h1>Source code for allo.customize</h1><div class="highlight"><pre>
                     <span class="bp">self</span><span class="o">.</span><span class="n">primitive_sequences</span><span class="o">.</span><span class="n">append</span><span class="p">((</span><span class="n">primitive</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">args</span><span class="p">,</span> <span class="n">kwargs</span><span class="p">))</span></div>
 
 
-    <span class="k">def</span> <span class="nf">build</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">target</span><span class="o">=</span><span class="kc">None</span><span class="p">,</span> <span class="n">mode</span><span class="o">=</span><span class="kc">None</span><span class="p">,</span> <span class="n">project</span><span class="o">=</span><span class="kc">None</span><span class="p">,</span> <span class="n">configs</span><span class="o">=</span><span class="kc">None</span><span class="p">):</span>
+    <span class="k">def</span><span class="w"> </span><span class="nf">get_equivalent_variables</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">name</span><span class="p">):</span>
+        <span class="n">use_def</span> <span class="o">=</span> <span class="n">analyze_use_def</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">module</span><span class="p">)</span>
+        <span class="k">for</span> <span class="n">ele</span> <span class="ow">in</span> <span class="n">use_def</span><span class="p">:</span>
+            <span class="k">if</span> <span class="n">name</span> <span class="ow">in</span> <span class="n">ele</span><span class="p">:</span>
+                <span class="k">return</span> <span class="n">ele</span>
+        <span class="k">return</span> <span class="p">[]</span>
+
+    <span class="k">def</span><span class="w"> </span><span class="nf">build</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">target</span><span class="o">=</span><span class="kc">None</span><span class="p">,</span> <span class="n">mode</span><span class="o">=</span><span class="kc">None</span><span class="p">,</span> <span class="n">project</span><span class="o">=</span><span class="kc">None</span><span class="p">,</span> <span class="n">configs</span><span class="o">=</span><span class="kc">None</span><span class="p">,</span> <span class="n">wrap_io</span><span class="o">=</span><span class="kc">True</span><span class="p">):</span>
         <span class="k">if</span> <span class="n">target</span> <span class="ow">is</span> <span class="kc">None</span> <span class="ow">or</span> <span class="n">target</span> <span class="o">==</span> <span class="s2">&quot;llvm&quot;</span><span class="p">:</span>
             <span class="n">target</span> <span class="o">=</span> <span class="s2">&quot;llvm&quot;</span>
             <span class="k">return</span> <span class="n">LLVMModule</span><span class="p">(</span>
@@ -1081,22 +1100,32 @@ <h1>Source code for allo.customize</h1><div class="highlight"><pre>
                 <span class="n">top_func_name</span><span class="o">=</span><span class="bp">self</span><span class="o">.</span><span class="n">top_func_name</span><span class="p">,</span>
                 <span class="n">ext_libs</span><span class="o">=</span><span class="bp">self</span><span class="o">.</span><span class="n">ext_libs</span><span class="p">,</span>
             <span class="p">)</span>
-        <span class="k">if</span> <span class="n">target</span> <span class="ow">in</span> <span class="p">{</span><span class="s2">&quot;vhls&quot;</span><span class="p">,</span> <span class="s2">&quot;vivado_hls&quot;</span><span class="p">,</span> <span class="s2">&quot;vitis_hls&quot;</span><span class="p">}:</span>
+        <span class="k">if</span> <span class="n">target</span> <span class="ow">in</span> <span class="p">{</span><span class="s2">&quot;vhls&quot;</span><span class="p">,</span> <span class="s2">&quot;vivado_hls&quot;</span><span class="p">,</span> <span class="s2">&quot;vitis_hls&quot;</span><span class="p">,</span> <span class="s2">&quot;tapa&quot;</span><span class="p">,</span> <span class="s2">&quot;ihls&quot;</span><span class="p">}:</span>
+            <span class="k">match</span> <span class="n">target</span><span class="p">:</span>
+                <span class="k">case</span> <span class="s2">&quot;vitis_hls&quot;</span><span class="p">:</span>
+                    <span class="n">platform</span> <span class="o">=</span> <span class="s2">&quot;vitis_hls&quot;</span>
+                <span class="k">case</span> <span class="s2">&quot;tapa&quot;</span><span class="p">:</span>
+                    <span class="n">platform</span> <span class="o">=</span> <span class="s2">&quot;tapa&quot;</span>
+                <span class="k">case</span> <span class="s2">&quot;ihls&quot;</span><span class="p">:</span>
+                    <span class="n">platform</span> <span class="o">=</span> <span class="s2">&quot;intel_hls&quot;</span>
+                <span class="k">case</span><span class="w"> </span><span class="k">_</span><span class="p">:</span>
+                    <span class="n">platform</span> <span class="o">=</span> <span class="s2">&quot;vivado_hls&quot;</span>
             <span class="k">return</span> <span class="n">HLSModule</span><span class="p">(</span>
                 <span class="bp">self</span><span class="o">.</span><span class="n">module</span><span class="p">,</span>
                 <span class="n">top_func_name</span><span class="o">=</span><span class="bp">self</span><span class="o">.</span><span class="n">top_func_name</span><span class="p">,</span>
-                <span class="n">platform</span><span class="o">=</span><span class="s2">&quot;vivado_hls&quot;</span> <span class="k">if</span> <span class="n">target</span> <span class="o">!=</span> <span class="s2">&quot;vitis_hls&quot;</span> <span class="k">else</span> <span class="s2">&quot;vitis_hls&quot;</span><span class="p">,</span>
+                <span class="n">platform</span><span class="o">=</span><span class="n">platform</span><span class="p">,</span>
                 <span class="n">mode</span><span class="o">=</span><span class="n">mode</span><span class="p">,</span>
                 <span class="n">project</span><span class="o">=</span><span class="n">project</span><span class="p">,</span>
                 <span class="n">ext_libs</span><span class="o">=</span><span class="bp">self</span><span class="o">.</span><span class="n">ext_libs</span><span class="p">,</span>
                 <span class="n">configs</span><span class="o">=</span><span class="n">configs</span><span class="p">,</span>
                 <span class="n">func_args</span><span class="o">=</span><span class="bp">self</span><span class="o">.</span><span class="n">func_args</span><span class="p">,</span>
+                <span class="n">wrap_io</span><span class="o">=</span><span class="n">wrap_io</span><span class="p">,</span>
             <span class="p">)</span>
         <span class="k">raise</span> <span class="ne">NotImplementedError</span><span class="p">(</span><span class="sa">f</span><span class="s2">&quot;Target </span><span class="si">{</span><span class="n">target</span><span class="si">}</span><span class="s2"> is not supported&quot;</span><span class="p">)</span></div>
 
 
 
-<span class="k">def</span> <span class="nf">customize</span><span class="p">(</span>
+<span class="k">def</span><span class="w"> </span><span class="nf">customize</span><span class="p">(</span>
     <span class="n">fn</span><span class="p">:</span> <span class="n">Union</span><span class="p">[</span><span class="n">Callable</span><span class="p">,</span> <span class="nb">str</span><span class="p">],</span>
     <span class="n">verbose</span><span class="p">:</span> <span class="nb">bool</span> <span class="o">=</span> <span class="kc">False</span><span class="p">,</span>
     <span class="n">enable_tensor</span><span class="p">:</span> <span class="nb">bool</span> <span class="o">=</span> <span class="kc">False</span><span class="p">,</span>
@@ -1107,37 +1136,36 @@ <h1>Source code for allo.customize</h1><div class="highlight"><pre>
 <span class="p">):</span>
     <span class="c1"># Get Python AST</span>
     <span class="k">if</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">fn</span><span class="p">,</span> <span class="nb">str</span><span class="p">):</span>
-        <span class="n">src</span> <span class="o">=</span> <span class="n">fn</span>
+        <span class="n">src</span><span class="p">,</span> <span class="n">starting_line_no</span> <span class="o">=</span> <span class="n">fn</span><span class="p">,</span> <span class="mi">1</span>
     <span class="k">else</span><span class="p">:</span>
-        <span class="n">src</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="n">getsourcelines</span><span class="p">(</span><span class="n">fn</span><span class="p">)</span>
+        <span class="n">src</span><span class="p">,</span> <span class="n">starting_line_no</span> <span class="o">=</span> <span class="n">inspect</span><span class="o">.</span><span class="n">getsourcelines</span><span class="p">(</span><span class="n">fn</span><span class="p">)</span>
         <span class="n">src</span> <span class="o">=</span> <span class="p">[</span><span class="n">textwrap</span><span class="o">.</span><span class="n">fill</span><span class="p">(</span><span class="n">line</span><span class="p">,</span> <span class="n">tabsize</span><span class="o">=</span><span class="mi">4</span><span class="p">,</span> <span class="n">width</span><span class="o">=</span><span class="mi">9999</span><span class="p">)</span> <span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="n">src</span><span class="p">]</span>
         <span class="n">src</span> <span class="o">=</span> <span class="n">textwrap</span><span class="o">.</span><span class="n">dedent</span><span class="p">(</span><span class="s2">&quot;</span><span class="se">\n</span><span class="s2">&quot;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">src</span><span class="p">))</span>
-    <span class="n">tree</span> <span class="o">=</span> <span class="n">parse_ast</span><span class="p">(</span><span class="n">src</span><span class="p">,</span> <span class="n">verbose</span><span class="p">)</span>
+    <span class="n">tree</span> <span class="o">=</span> <span class="n">parse_ast</span><span class="p">(</span><span class="n">src</span><span class="p">,</span> <span class="n">starting_line_no</span><span class="o">=</span><span class="n">starting_line_no</span><span class="p">,</span> <span class="n">verbose</span><span class="o">=</span><span class="n">verbose</span><span class="p">)</span>
     <span class="k">if</span> <span class="n">instantiate</span> <span class="ow">is</span> <span class="kc">None</span><span class="p">:</span>
         <span class="n">instantiate</span> <span class="o">=</span> <span class="p">[]</span>
     <span class="k">if</span> <span class="n">global_vars</span> <span class="ow">is</span> <span class="kc">None</span><span class="p">:</span>
         <span class="n">global_vars</span> <span class="o">=</span> <span class="n">get_global_vars</span><span class="p">(</span><span class="n">fn</span><span class="p">)</span>
-    <span class="c1"># Use-def chain analysis</span>
-    <span class="n">use_def_chain</span> <span class="o">=</span> <span class="n">UseDefChain</span><span class="p">(</span><span class="n">global_vars</span><span class="o">.</span><span class="n">copy</span><span class="p">(),</span> <span class="n">instantiate</span><span class="p">)</span>
-    <span class="n">use_def_chain</span><span class="o">.</span><span class="n">visit</span><span class="p">(</span><span class="n">tree</span><span class="p">)</span>
     <span class="c1"># Type construction</span>
     <span class="n">ctx_type_inf</span> <span class="o">=</span> <span class="n">ASTContext</span><span class="p">(</span>
+        <span class="n">tree</span><span class="o">=</span><span class="n">tree</span><span class="p">,</span>
         <span class="n">global_vars</span><span class="o">=</span><span class="n">global_vars</span><span class="o">.</span><span class="n">copy</span><span class="p">(),</span>
         <span class="n">mlir_ctx</span><span class="o">=</span><span class="n">Context</span><span class="p">()</span> <span class="k">if</span> <span class="n">context</span> <span class="ow">is</span> <span class="kc">None</span> <span class="k">else</span> <span class="n">context</span><span class="p">,</span>
+        <span class="n">inst</span><span class="o">=</span><span class="n">instantiate</span><span class="p">,</span>
         <span class="n">enable_tensor</span><span class="o">=</span><span class="n">enable_tensor</span><span class="p">,</span>
         <span class="n">verbose</span><span class="o">=</span><span class="n">verbose</span><span class="p">,</span>
     <span class="p">)</span>
-    <span class="n">ctx_type_inf</span><span class="o">.</span><span class="n">inst</span> <span class="o">=</span> <span class="n">instantiate</span>
     <span class="n">tree</span> <span class="o">=</span> <span class="n">TypeInferer</span><span class="p">()(</span><span class="n">ctx_type_inf</span><span class="p">,</span> <span class="n">tree</span><span class="p">)</span>
     <span class="n">ctx_type_inf</span> <span class="o">=</span> <span class="kc">None</span>
     <span class="c1"># Start building IR</span>
     <span class="n">ctx</span> <span class="o">=</span> <span class="n">ASTContext</span><span class="p">(</span>
+        <span class="n">tree</span><span class="o">=</span><span class="n">tree</span><span class="p">,</span>
         <span class="n">global_vars</span><span class="o">=</span><span class="n">global_vars</span><span class="p">,</span>
         <span class="n">mlir_ctx</span><span class="o">=</span><span class="n">Context</span><span class="p">()</span> <span class="k">if</span> <span class="n">context</span> <span class="ow">is</span> <span class="kc">None</span> <span class="k">else</span> <span class="n">context</span><span class="p">,</span>
+        <span class="n">inst</span><span class="o">=</span><span class="n">instantiate</span><span class="p">,</span>
         <span class="n">enable_tensor</span><span class="o">=</span><span class="n">enable_tensor</span><span class="p">,</span>
         <span class="n">verbose</span><span class="o">=</span><span class="n">verbose</span><span class="p">,</span>
     <span class="p">)</span>
-    <span class="n">ctx</span><span class="o">.</span><span class="n">inst</span> <span class="o">=</span> <span class="n">instantiate</span>
     <span class="n">module</span> <span class="o">=</span> <span class="n">ASTTransformer</span><span class="p">()(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">tree</span><span class="p">)</span>
     <span class="k">if</span> <span class="n">lower_linalg</span><span class="p">:</span>
         <span class="n">lower_linalg_and_attach_names</span><span class="p">(</span><span class="n">module</span><span class="p">)</span>
@@ -1148,7 +1176,6 @@ <h1>Source code for allo.customize</h1><div class="highlight"><pre>
         <span class="n">ctx</span><span class="o">.</span><span class="n">func_args</span><span class="p">,</span>
         <span class="n">InsertionPoint</span><span class="o">.</span><span class="n">at_block_terminator</span><span class="p">(</span><span class="n">ctx</span><span class="o">.</span><span class="n">top_func</span><span class="o">.</span><span class="n">entry_block</span><span class="p">),</span>
         <span class="n">ext_libs</span><span class="o">=</span><span class="n">ctx</span><span class="o">.</span><span class="n">ext_libs</span><span class="p">,</span>
-        <span class="n">use_def_chain</span><span class="o">=</span><span class="n">use_def_chain</span><span class="p">,</span>
         <span class="n">inst_list</span><span class="o">=</span><span class="n">instantiate</span><span class="p">,</span>
     <span class="p">)</span>
     <span class="c1"># Attach buffers to schedule:</span>
@@ -1203,7 +1230,7 @@ <h1>Source code for allo.customize</h1><div class="highlight"><pre>
 
 
     <div class="footer" role="contentinfo">
-    &#169; Copyright 2024, Allo Authors.
+    &#169; Copyright 2025, Allo Authors.
       Created using <a href="https://www.sphinx-doc.org/">Sphinx</a> 8.1.3.
     </div>
 
diff --git a/_modules/index.html b/_modules/index.html
index 52084588..38110646 100644
--- a/_modules/index.html
+++ b/_modules/index.html
@@ -5,7 +5,7 @@
     <meta charset="utf-8" />
     <meta name="viewport" content="width=device-width, initial-scale=1.0" />
     <title>Overview: module code &#8212; Allo Documentation</title>
-    <link rel="stylesheet" type="text/css" href="../_static/pygments.css?v=fa44fd50" />
+    <link rel="stylesheet" type="text/css" href="../_static/pygments.css?v=03e43079" />
     <link rel="stylesheet" type="text/css" href="../_static/basic_mod.css?v=9b2032db" />
     <link rel="stylesheet" type="text/css" href="../_static/graphviz.css?v=4ae1632d" />
     <link rel="stylesheet" type="text/css" href="../_static/copybutton.css?v=76b2166b" />
@@ -140,6 +140,15 @@ <h3 id="searchlabel">Quick search</h3>
 <li class="toctree-l1"><a class="reference internal" href="../gallery/tutorial_01_get_started.html">Getting Started</a></li>
 <li class="toctree-l1"><a class="reference internal" href="../gallery/tutorial_02_vhls.html">Vivado/Vitis HLS Backend</a></li>
 </ul>
+<p class="caption" role="heading"><span class="caption-text">Deep Dive</span></p>
+<ul>
+<li class="toctree-l1"><a class="reference internal" href="../gallery/dive_01_data_types.html">Data Types and Type Casting</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../gallery/dive_02_template.html">Template Kernels</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../gallery/dive_03_composition.html">Kernel Composition</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../dive/ip.html">IP Integration</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../dive/pytorch.html">PyTorch Integration</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../gallery/dive_04_features.html">Other Features</a></li>
+</ul>
 <p class="caption" role="heading"><span class="caption-text">Developer Guide</span></p>
 <ul>
 <li class="toctree-l1"><a class="reference internal" href="../developer/index.html">Developer Setup</a></li>
@@ -187,7 +196,7 @@ <h1>All modules for which code is available</h1>
 
 
     <div class="footer" role="contentinfo">
-    &#169; Copyright 2024, Allo Authors.
+    &#169; Copyright 2025, Allo Authors.
       Created using <a href="https://www.sphinx-doc.org/">Sphinx</a> 8.1.3.
     </div>
 
diff --git a/_sources/dive/ip.rst.txt b/_sources/dive/ip.rst.txt
new file mode 100644
index 00000000..6d4fab85
--- /dev/null
+++ b/_sources/dive/ip.rst.txt
@@ -0,0 +1,88 @@
+..  Copyright Allo authors. All Rights Reserved.
+    SPDX-License-Identifier: Apache-2.0
+
+..  Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+
+..    http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+
+##############
+IP Integration
+##############
+
+Apart from directly writing Allo kernels in Python, we also support integrating existing C++ HLS kernels into Allo. This feature is useful when you have a existing optimized C++ HLS code that wants to be integrated into Allo. The following example shows how to integrate a simple vector addition kernel written in C++ into Allo.
+
+Suppose the C++ kernel header is defined in the ``vadd.h`` file:
+
+.. code-block:: cpp
+
+    #ifndef VADD_H
+    #define VADD_H
+
+    void vadd(int A[32], int B[32], int C[32]);
+
+    #endif // VADD_H
+
+And the corresponding implementation is defined in the ``vadd.cpp`` file:
+
+.. code-block:: cpp
+
+    #include "vadd.h"
+    using namespace std;
+
+    void vadd(int A[32], int B[32], int C[32]) {
+        for (int i = 0; i < 32; ++i) {
+            C[i] = A[i] + B[i];
+        }
+    }
+
+In Allo, we can create an *IP module* to wrap the C++ kernel. Basically, we need to provide the top-level function name, the header files, and the implementation files. Also, currently an Allo signature is required to specify the input and output types of the kernel. Allo will automatically compile the C++ kernel and generate the corresponding Python wrapper based on the provided files and signature. The last argument ``link_hls`` determines whether the C++ compiler should link the Vitis HLS libraries (e.g., ``ap_int``), which is only available when your machine has installed Vitis HLS.
+
+.. code-block:: python
+
+    vadd = allo.IPModule(
+        top="vadd",
+        headers=["vadd.h"],
+        impls=["vadd.cpp"],
+        signature=["int32[32]", "int32[32]", "int32[32]"],
+        link_hls=False,
+    )
+
+After creating the IP module, we can use it in Allo as a normal Python function. For example, we can directly call the ``vadd`` function to perform vector addition. The inputs and outputs will be automatically wrapped and unwrapped as NumPy arrays, which greatly simplies the burden of complex C-Python interface management. This is also very useful when you want to debug the HLS kernels with the Python data.
+
+.. code-block:: python
+
+    np_A = np.random.randint(0, 100, (32,)).astype(np.int32)
+    np_B = np.random.randint(0, 100, (32,)).astype(np.int32)
+    np_C = np.zeros((32,), dtype=np.int32)
+    vadd(np_A, np_B, np_C)
+    np.testing.assert_allclose(np_A + np_B, np_C, atol=1e-6)
+
+Moreover, the IP module can also be called in a normal Allo kernel. In the following example, we wrap the ``vadd`` function into an Allo ``kernel`` and use it to perform vector addition. The Allo kernel can then be further customized and compiled with the external C++ HLS kernel.
+
+.. code-block:: python
+
+    def kernel(A: int32[32], B: int32[32]) -> int32[32]:
+        C: int32[32] = 0
+        vadd(A, B, C)
+        return C
+
+    s = allo.customize(kernel)
+    print(s.module)
+    mod = s.build()
+    np_A = np.random.randint(0, 100, (32,)).astype(np.int32)
+    np_B = np.random.randint(0, 100, (32,)).astype(np.int32)
+    allo_C = mod(np_A, np_B)
+    np.testing.assert_allclose(np_A + np_B, allo_C, atol=1e-6)
diff --git a/_sources/dive/pytorch.rst.txt b/_sources/dive/pytorch.rst.txt
new file mode 100644
index 00000000..9076dd06
--- /dev/null
+++ b/_sources/dive/pytorch.rst.txt
@@ -0,0 +1,70 @@
+..  Copyright Allo authors. All Rights Reserved.
+    SPDX-License-Identifier: Apache-2.0
+
+..  Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+
+..    http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+
+###################
+PyTorch Integration
+###################
+
+In this document, we will show how to directly compile PyTorch models to Allo.
+First, users can define a PyTorch module as usual:
+
+.. code-block:: python
+
+    import torch
+    import torch.nn.functional as F
+    import torch.nn as nn
+
+    class Model(nn.Module):
+        def __init__(self):
+            super(Model, self).__init__()
+
+        def forward(self, x, y):
+            x = x + y
+            x = F.relu(x)
+            return x
+
+    model = Model()
+    model.eval()
+
+Then, users can compile the PyTorch model to Allo by using the ``allo.frontend.from_pytorch`` API:
+
+.. code-block:: python
+
+    import allo
+    example_inputs = [torch.rand(1, 3, 10, 10), torch.rand(1, 3, 10, 10)]
+    llvm_mod = allo.frontend.from_pytorch(model, example_inputs=example_inputs)
+
+Then, we can use the generated Allo LLVM module as usual by passing in the NumPy inputs:
+
+.. code-block:: python
+
+    golden = model(*example_inputs)
+    np_inputs = [x.detach().numpy() for x in example_inputs]
+    res = llvm_mod(*np_inputs)
+    torch.testing.assert_close(res, golden.detach().numpy())
+    print("Passed!")
+
+The process should be very similar to the original Allo workflow.
+The default target is LLVM. We can also change the backend to other compilers such as Vitis HLS by specifying the ``target``:
+
+.. code-block:: python
+
+    mod = allo.frontend.from_pytorch(model, example_inputs=example_inputs, target="vhls")
+    print(mod.hls_code)
diff --git a/_sources/gallery/developer_02_mlir.rst.txt b/_sources/gallery/developer_02_mlir.rst.txt
index 08f9e222..35a548b1 100644
--- a/_sources/gallery/developer_02_mlir.rst.txt
+++ b/_sources/gallery/developer_02_mlir.rst.txt
@@ -331,7 +331,7 @@ the lowering pass from tensor dialect to LLVM dialect, and that is something we
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** (0 minutes 0.074 seconds)
+   **Total running time of the script:** (0 minutes 0.070 seconds)
 
 
 .. _sphx_glr_download_gallery_developer_02_mlir.py:
diff --git a/_sources/gallery/dive_01_data_types.rst.txt b/_sources/gallery/dive_01_data_types.rst.txt
new file mode 100644
index 00000000..ed9b15aa
--- /dev/null
+++ b/_sources/gallery/dive_01_data_types.rst.txt
@@ -0,0 +1,266 @@
+
+.. DO NOT EDIT.
+.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
+.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
+.. "gallery/dive_01_data_types.py"
+.. LINE NUMBERS ARE GIVEN BELOW.
+
+.. only:: html
+
+    .. note::
+        :class: sphx-glr-download-link-note
+
+        :ref:`Go to the end <sphx_glr_download_gallery_dive_01_data_types.py>`
+        to download the full example code.
+
+.. rst-class:: sphx-glr-example-title
+
+.. _sphx_glr_gallery_dive_01_data_types.py:
+
+
+Data Types and Type Casting
+===========================
+
+**Author**: Hongzheng Chen (hzchen@cs.cornell.edu)
+
+This document will discuss the Allo-supported data types in detail.
+All the data types are defined in the ``allo.ir.types`` module.
+
+.. GENERATED FROM PYTHON SOURCE LINES 13-17
+
+.. code-block:: Python
+
+
+    import allo
+    from allo.ir.types import int16, int32, float32, Int, UInt, Float, Fixed
+
+
+
+
+
+
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 18-26
+
+Currently, Allo supports three base data types for mathematical operations:
+
+- Integers: ``Int(bitwdith)``, ``UInt(bitwidth)``
+- Floating points: ``Float(bitwidth)`` (only support 16, 32, and 64 bits)
+- Fixed points: ``Fixed(bitwidth, frac)``, ``UFixed(bitwidth, frac)``
+
+For example, one can declare a 15-bit integer as ``Int(15)`` and an unsigned 8-bit fixed-point number with 3 fractional bits as ``UFixed(8, 3)``.
+For all the C/C++ supported data types, we provide shorthands like ``float32`` and ``int16`` to easily declare them.
+
+.. GENERATED FROM PYTHON SOURCE LINES 28-31
+
+Notice different from native Python, Allo requires the program to be **strongly and statically typed**.
+The variable types are either declared explicitly or inferred from the context.
+For a variable that first appears in the program, we should declare it with an expected data type using Python's type hint notation:
+
+.. GENERATED FROM PYTHON SOURCE LINES 31-34
+
+.. code-block:: Python
+
+
+    a: int32
+
+
+
+
+
+
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 35-39
+
+Once the data types are defined, an important consideration is how to handle
+operations between variables of different types. Allo supports two types of casting:
+(1) implicit casting that is automatically done by the Allo compiler;
+and (2) explicit casting that is manually done by the user.
+
+.. GENERATED FROM PYTHON SOURCE LINES 41-47
+
+Implicit Casting
+----------------
+Allo has a strong type system that follows the `MLIR convention <https://mlir.llvm.org/docs/Dialects/ArithOps/>`_ to enforce the operand types are the same for the arithmetic operations.
+However, it is burdensome for users to cast the variables every time, and it is also error-prone to avoid overflow when performing computations.
+Therefore, Allo is equipped with builtin casting rules to automatically cast the variables to the same type before the operation, which is called *implicit casting*.
+An example is shown below:
+
+.. GENERATED FROM PYTHON SOURCE LINES 47-56
+
+.. code-block:: Python
+
+
+
+    def add(a: int32, b: int32) -> int32:
+        return a + b
+
+
+    s = allo.customize(add)
+    print(s.module)
+
+
+
+
+
+.. rst-class:: sphx-glr-script-out
+
+ .. code-block:: none
+
+    module {
+      func.func @add(%arg0: i32, %arg1: i32) -> i32 attributes {itypes = "ss", otypes = "s"} {
+        %0 = arith.extsi %arg0 : i32 to i33
+        %1 = arith.extsi %arg1 : i32 to i33
+        %2 = arith.addi %0, %1 : i33
+        %3 = arith.trunci %2 : i33 to i32
+        return %3 : i32
+      }
+    }
+
+
+
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 57-60
+
+We can see that ``a`` and ``b`` are firstly casted to ``int33``, added
+together, and converted back to ``int32``.
+This is to avoid overflow and is automatically inferred by the Allo compiler.
+
+.. GENERATED FROM PYTHON SOURCE LINES 63-68
+
+Explicit Casting
+----------------
+One can also explicitly cast the variable to a specific type by creating an intermediate variable,
+or use Python-builtin functions like ``float()`` and ``int()`` to explicitly cast a variable to ``float32`` or ``int32``.
+Another example is shown below:
+
+.. GENERATED FROM PYTHON SOURCE LINES 68-81
+
+.. code-block:: Python
+
+
+
+    def cast(a: int32) -> int16:
+        b: float32 = a  # explicit
+        c: float32 = b * 2
+        d: float32 = float(a) * 2
+        e: int16 = c + d
+        return e
+
+
+    s = allo.customize(cast)
+    print(s.module)
+
+
+
+
+
+.. rst-class:: sphx-glr-script-out
+
+ .. code-block:: none
+
+    module {
+      func.func @cast(%arg0: i32) -> i16 attributes {itypes = "s", otypes = "s"} {
+        %0 = arith.sitofp %arg0 : i32 to f32
+        %alloc = memref.alloc() {name = "b"} : memref<f32>
+        affine.store %0, %alloc[] {to = "b"} : memref<f32>
+        %c2_i32 = arith.constant 2 : i32
+        %1 = arith.sitofp %c2_i32 : i32 to f32
+        %2 = affine.load %alloc[] {from = "b"} : memref<f32>
+        %3 = arith.mulf %2, %1 : f32
+        %alloc_0 = memref.alloc() {name = "c"} : memref<f32>
+        affine.store %3, %alloc_0[] {to = "c"} : memref<f32>
+        %4 = arith.sitofp %arg0 : i32 to f32
+        %c2_i32_1 = arith.constant 2 : i32
+        %5 = arith.sitofp %c2_i32_1 : i32 to f32
+        %6 = arith.mulf %4, %5 : f32
+        %alloc_2 = memref.alloc() {name = "d"} : memref<f32>
+        affine.store %6, %alloc_2[] {to = "d"} : memref<f32>
+        %7 = affine.load %alloc_0[] {from = "c"} : memref<f32>
+        %8 = affine.load %alloc_2[] {from = "d"} : memref<f32>
+        %9 = arith.addf %7, %8 : f32
+        %10 = arith.fptosi %9 : f32 to i16
+        %alloc_3 = memref.alloc() {name = "e"} : memref<i16>
+        affine.store %10, %alloc_3[] {to = "e"} : memref<i16>
+        %11 = affine.load %alloc_3[] {from = "e"} : memref<i16>
+        %12 = affine.load %alloc_3[] {from = "e"} : memref<i16>
+        return %12 : i16
+      }
+    }
+
+
+
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 82-89
+
+By explicitly creating an intermediate variable ``b``, we can cast the ``int32`` variable ``a`` to the desired floating-point type.
+Similarly, calling ``float(a)`` can also cast ``a`` to a floating-point type.
+
+.. note::
+
+   The above stated explicit casting between integers and floating points preserves the value but the precision may be changed.
+   If you want to use a union type to represent both integers and floating points, please use the `.bitcast()` API instead. For example, ``a.bitcast()`` can convert ``int32`` to ``float32`` representation with the bit pattern preserved.
+
+.. GENERATED FROM PYTHON SOURCE LINES 91-99
+
+Bit Operations
+--------------
+As hardware accelerators have ability to manipulate each bit of the data, Allo supports bit operations on
+those integer types. For example, we can access a specific bit in an integer ``a`` using the indexing operator:
+
+.. code-block:: python
+
+  a[15]
+
+.. GENERATED FROM PYTHON SOURCE LINES 101-112
+
+We can also extract a chunk of bits from an integer using the slicing operator:
+
+.. code-block:: python
+
+  a[0:16]
+
+.. note::
+
+   Allo follows the Python convention that the upper bound is not included, so ``[0:16]`` means
+   extracting the first 16 bits, which is different from the Xilinx HLS convention that uses ``[0:15]``
+   to indicate the first 16 bits.
+
+.. GENERATED FROM PYTHON SOURCE LINES 114-115
+
+Not only constant values are supported, but also variables can be used as the index or the slice range.
+
+
+.. rst-class:: sphx-glr-timing
+
+   **Total running time of the script:** (0 minutes 0.170 seconds)
+
+
+.. _sphx_glr_download_gallery_dive_01_data_types.py:
+
+.. only:: html
+
+  .. container:: sphx-glr-footer sphx-glr-footer-example
+
+    .. container:: sphx-glr-download sphx-glr-download-jupyter
+
+      :download:`Download Jupyter notebook: dive_01_data_types.ipynb <dive_01_data_types.ipynb>`
+
+    .. container:: sphx-glr-download sphx-glr-download-python
+
+      :download:`Download Python source code: dive_01_data_types.py <dive_01_data_types.py>`
+
+    .. container:: sphx-glr-download sphx-glr-download-zip
+
+      :download:`Download zipped: dive_01_data_types.zip <dive_01_data_types.zip>`
+
+
+.. only:: html
+
+ .. rst-class:: sphx-glr-signature
+
+    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_
diff --git a/_sources/gallery/dive_02_template.rst.txt b/_sources/gallery/dive_02_template.rst.txt
new file mode 100644
index 00000000..1bed2753
--- /dev/null
+++ b/_sources/gallery/dive_02_template.rst.txt
@@ -0,0 +1,300 @@
+
+.. DO NOT EDIT.
+.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
+.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
+.. "gallery/dive_02_template.py"
+.. LINE NUMBERS ARE GIVEN BELOW.
+
+.. only:: html
+
+    .. note::
+        :class: sphx-glr-download-link-note
+
+        :ref:`Go to the end <sphx_glr_download_gallery_dive_02_template.py>`
+        to download the full example code.
+
+.. rst-class:: sphx-glr-example-title
+
+.. _sphx_glr_gallery_dive_02_template.py:
+
+
+Template Kernels
+================
+
+**Author**: Hongzheng Chen (hzchen@cs.cornell.edu)
+
+This document explains how to write a template kernel in Allo.
+Template kernels are useful when we need to reuse a kernel with different data types or when certain computation patterns depend on specific constants.
+By leveraging template kernels, we can achieve greater flexibility and reusability in the code.
+
+.. GENERATED FROM PYTHON SOURCE LINES 14-18
+
+.. code-block:: Python
+
+
+    import allo
+    from allo.ir.types import int32, float32
+
+
+
+
+
+
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 19-26
+
+We follow Python's convention to use *type variable* to define a template kernel.
+Specifically, the type variable is specified after the function name using square brackets: ``def kernel[T](...)``, and the type variable can be used in the function signature and body.
+Importantly, as the native Python interpreter does not support Allo's type declaration (i.e., base type + shape), we need to use string annotations like ``"T[10]"`` to specify the type of the variables.
+Otherwise, it will raise a type error.
+
+In the following, we define a simple addition function that adds 1 to each element of the input array.
+To invoke the kernel with a specific data type, we can use the ``instantiate`` argument in the ``allo.customize`` function.
+
+.. GENERATED FROM PYTHON SOURCE LINES 26-38
+
+.. code-block:: Python
+
+
+
+    def kernel[T](A: "T[10]") -> "T[10]":
+        B: T[10]
+        for i in range(10):
+            B[i] = A[i] + 1
+        return B
+
+
+    s = allo.customize(kernel, instantiate=[int32])
+    print(s.module)
+
+
+
+
+
+.. rst-class:: sphx-glr-script-out
+
+ .. code-block:: none
+
+    module {
+      func.func @kernel(%arg0: memref<10xi32>) -> memref<10xi32> attributes {itypes = "s", otypes = "s"} {
+        %alloc = memref.alloc() {name = "B"} : memref<10xi32>
+        affine.for %arg1 = 0 to 10 {
+          %0 = affine.load %arg0[%arg1] {from = "A"} : memref<10xi32>
+          %1 = arith.extsi %0 : i32 to i33
+          %c1_i32 = arith.constant 1 : i32
+          %2 = arith.extsi %c1_i32 : i32 to i33
+          %3 = arith.addi %1, %2 : i33
+          %4 = arith.trunci %3 : i33 to i32
+          affine.store %4, %alloc[%arg1] {to = "B"} : memref<10xi32>
+        } {loop_name = "i", op_name = "S_i_0"}
+        return %alloc : memref<10xi32>
+      }
+    }
+
+
+
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 39-41
+
+We can see that the kernel is specialized with the given ``int32`` data type.
+Similarly, we can directly declare a new kernel by specifying ``float32`` as the data type.
+
+.. GENERATED FROM PYTHON SOURCE LINES 41-45
+
+.. code-block:: Python
+
+
+    s = allo.customize(kernel, instantiate=[float32])
+    print(s.module)
+
+
+
+
+
+.. rst-class:: sphx-glr-script-out
+
+ .. code-block:: none
+
+    module {
+      func.func @kernel(%arg0: memref<10xf32>) -> memref<10xf32> attributes {itypes = "_", otypes = "_"} {
+        %alloc = memref.alloc() {name = "B"} : memref<10xf32>
+        affine.for %arg1 = 0 to 10 {
+          %0 = affine.load %arg0[%arg1] {from = "A"} : memref<10xf32>
+          %c1_i32 = arith.constant 1 : i32
+          %1 = arith.sitofp %c1_i32 : i32 to f32
+          %2 = arith.addf %0, %1 : f32
+          affine.store %2, %alloc[%arg1] {to = "B"} : memref<10xf32>
+        } {loop_name = "i", op_name = "S_i_0"}
+        return %alloc : memref<10xf32>
+      }
+    }
+
+
+
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 46-48
+
+If we not only want to specialize the data type but also the shape of the array, we can provide another type variable, and pass it to the ``instantiate`` argument.
+Note that here we also use the ``<type_var>: base_type`` notation to constrain the type of the type variable. Here we constrain the type variable ``M`` to be an integer.
+
+.. GENERATED FROM PYTHON SOURCE LINES 48-60
+
+.. code-block:: Python
+
+
+
+    def kernel2[T, M: int32](A: "T[M]") -> "T[M]":
+        B: T[M]
+        for i in range(M):
+            B[i] = A[i] + 1
+        return B
+
+
+    s = allo.customize(kernel2, instantiate=[int32, 20])
+    print(s.module)
+
+
+
+
+
+.. rst-class:: sphx-glr-script-out
+
+ .. code-block:: none
+
+    module {
+      func.func @kernel2(%arg0: memref<20xi32>) -> memref<20xi32> attributes {itypes = "s", otypes = "s"} {
+        %alloc = memref.alloc() {name = "B"} : memref<20xi32>
+        affine.for %arg1 = 0 to 20 {
+          %0 = affine.load %arg0[%arg1] {from = "A"} : memref<20xi32>
+          %1 = arith.extsi %0 : i32 to i33
+          %c1_i32 = arith.constant 1 : i32
+          %2 = arith.extsi %c1_i32 : i32 to i33
+          %3 = arith.addi %1, %2 : i33
+          %4 = arith.trunci %3 : i33 to i32
+          affine.store %4, %alloc[%arg1] {to = "B"} : memref<20xi32>
+        } {loop_name = "i", op_name = "S_i_0"}
+        return %alloc : memref<20xi32>
+      }
+    }
+
+
+
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 61-64
+
+Furthermore, Allo's template also enables metaprogramming that can evaluate type variables at compile time.
+Specifically, we can use the ``allo.meta_if``, ``allo.meta_elif``, and ``allo.meta_else`` to conditionally generate code based on the type variables.
+Just to make sure the conditions can be determined at compile time.
+
+.. GENERATED FROM PYTHON SOURCE LINES 64-76
+
+.. code-block:: Python
+
+
+
+    def kernel3[T, M: int32](A: "T[M]") -> "T[M]":
+        B: T[M]
+        for i in range(M):
+            with allo.meta_if(T == int32):
+                B[i] = A[i] + 1
+            with allo.meta_else():
+                B[i] = A[i] - 1
+        return B
+
+
+
+
+
+
+
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 77-78
+
+In final generated code, we can see that only a single branch is generated based on the given data type.
+
+.. GENERATED FROM PYTHON SOURCE LINES 78-83
+
+.. code-block:: Python
+
+
+    s = allo.customize(kernel3, instantiate=[int32, 20])
+    print(s.module)
+    s = allo.customize(kernel3, instantiate=[float32, 20])
+    print(s.module)
+
+
+
+
+.. rst-class:: sphx-glr-script-out
+
+ .. code-block:: none
+
+    module {
+      func.func @kernel3(%arg0: memref<20xi32>) -> memref<20xi32> attributes {itypes = "s", otypes = "s"} {
+        %alloc = memref.alloc() {name = "B"} : memref<20xi32>
+        affine.for %arg1 = 0 to 20 {
+          %0 = affine.load %arg0[%arg1] {from = "A"} : memref<20xi32>
+          %1 = arith.extsi %0 : i32 to i33
+          %c1_i32 = arith.constant 1 : i32
+          %2 = arith.extsi %c1_i32 : i32 to i33
+          %3 = arith.addi %1, %2 : i33
+          %4 = arith.trunci %3 : i33 to i32
+          affine.store %4, %alloc[%arg1] {to = "B"} : memref<20xi32>
+        } {loop_name = "i", op_name = "S_i_0"}
+        return %alloc : memref<20xi32>
+      }
+    }
+
+    module {
+      func.func @kernel3(%arg0: memref<20xf32>) -> memref<20xf32> attributes {itypes = "_", otypes = "_"} {
+        %alloc = memref.alloc() {name = "B"} : memref<20xf32>
+        affine.for %arg1 = 0 to 20 {
+          %0 = affine.load %arg0[%arg1] {from = "A"} : memref<20xf32>
+          %c1_i32 = arith.constant 1 : i32
+          %1 = arith.sitofp %c1_i32 : i32 to f32
+          %2 = arith.subf %0, %1 : f32
+          affine.store %2, %alloc[%arg1] {to = "B"} : memref<20xf32>
+        } {loop_name = "i", op_name = "S_i_0"}
+        return %alloc : memref<20xf32>
+      }
+    }
+
+
+
+
+
+
+.. rst-class:: sphx-glr-timing
+
+   **Total running time of the script:** (0 minutes 0.372 seconds)
+
+
+.. _sphx_glr_download_gallery_dive_02_template.py:
+
+.. only:: html
+
+  .. container:: sphx-glr-footer sphx-glr-footer-example
+
+    .. container:: sphx-glr-download sphx-glr-download-jupyter
+
+      :download:`Download Jupyter notebook: dive_02_template.ipynb <dive_02_template.ipynb>`
+
+    .. container:: sphx-glr-download sphx-glr-download-python
+
+      :download:`Download Python source code: dive_02_template.py <dive_02_template.py>`
+
+    .. container:: sphx-glr-download sphx-glr-download-zip
+
+      :download:`Download zipped: dive_02_template.zip <dive_02_template.zip>`
+
+
+.. only:: html
+
+ .. rst-class:: sphx-glr-signature
+
+    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_
diff --git a/_sources/gallery/dive_03_composition.rst.txt b/_sources/gallery/dive_03_composition.rst.txt
new file mode 100644
index 00000000..018b0fa8
--- /dev/null
+++ b/_sources/gallery/dive_03_composition.rst.txt
@@ -0,0 +1,529 @@
+
+.. DO NOT EDIT.
+.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
+.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
+.. "gallery/dive_03_composition.py"
+.. LINE NUMBERS ARE GIVEN BELOW.
+
+.. only:: html
+
+    .. note::
+        :class: sphx-glr-download-link-note
+
+        :ref:`Go to the end <sphx_glr_download_gallery_dive_03_composition.py>`
+        to download the full example code.
+
+.. rst-class:: sphx-glr-example-title
+
+.. _sphx_glr_gallery_dive_03_composition.py:
+
+
+Kernel Composition
+==================
+
+**Author**: Hongzheng Chen (hzchen@cs.cornell.edu)
+
+This document will discuss kernel composition.
+In the previous tutorials, we have seen how to write a simple kernel.
+However, in real applications, we often need to compose multiple kernels together.
+
+In the following example, we define a ``matrix_add`` and a ``gemm`` kernel, and wrap them into a ``top``-level function.
+
+.. GENERATED FROM PYTHON SOURCE LINES 16-44
+
+.. code-block:: Python
+
+
+    import allo
+    from allo.ir.types import int32, float32
+
+    M, K, N = 32, 32, 32
+
+
+    def matrix_add(A: int32[M, N]) -> int32[M, N]:
+        B: int32[M, N] = 0
+        for i, j in allo.grid(M, N):
+            B[i, j] = A[i, j] + 1
+        return B
+
+
+    def gemm(A: int32[M, K], B: int32[K, N]) -> int32[M, N]:
+        C: int32[M, N] = 0
+        for i, j in allo.grid(M, N):
+            for k in allo.reduction(K):
+                C[i, j] += A[i, k] * B[k, j]
+        return C
+
+
+    def top(A: int32[M, K], B: int32[K, N]) -> int32[M, N]:
+        C = gemm(A, B)
+        D = matrix_add(C)
+        return D
+
+
+
+
+
+
+
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 45-47
+
+Different teams or people can then work on different parts of the code and optimize each kernel.
+We first create a schedule for the ``matrix_add`` kernel, and add several optimizations.
+
+.. GENERATED FROM PYTHON SOURCE LINES 47-52
+
+.. code-block:: Python
+
+
+    s1 = allo.customize(matrix_add)
+    s1.pipeline("j")
+    print(s1.module)
+
+
+
+
+
+.. rst-class:: sphx-glr-script-out
+
+ .. code-block:: none
+
+    module {
+      func.func @matrix_add(%arg0: memref<32x32xi32>) -> memref<32x32xi32> attributes {itypes = "s", otypes = "s"} {
+        %alloc = memref.alloc() {name = "B"} : memref<32x32xi32>
+        %c0_i32 = arith.constant 0 : i32
+        linalg.fill ins(%c0_i32 : i32) outs(%alloc : memref<32x32xi32>)
+        affine.for %arg1 = 0 to 32 {
+          affine.for %arg2 = 0 to 32 {
+            %0 = affine.load %arg0[%arg1, %arg2] {from = "A"} : memref<32x32xi32>
+            %1 = arith.extsi %0 : i32 to i33
+            %c1_i32 = arith.constant 1 : i32
+            %2 = arith.extsi %c1_i32 : i32 to i33
+            %3 = arith.addi %1, %2 : i33
+            %4 = arith.trunci %3 : i33 to i32
+            affine.store %4, %alloc[%arg1, %arg2] {to = "B"} : memref<32x32xi32>
+          } {loop_name = "j", pipeline_ii = 1 : ui32}
+        } {loop_name = "i", op_name = "S_i_j_0"}
+        return %alloc : memref<32x32xi32>
+      }
+    }
+
+
+
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 53-54
+
+Then we create a schedule for the ``gemm`` kernel and optimize it.
+
+.. GENERATED FROM PYTHON SOURCE LINES 54-61
+
+.. code-block:: Python
+
+
+    s2 = allo.customize(gemm)
+    s2.reorder("k", "j")
+    s2.buffer_at(s2.C, axis="i")
+    s2.pipeline("j")
+    print(s2.module)
+
+
+
+
+
+.. rst-class:: sphx-glr-script-out
+
+ .. code-block:: none
+
+    module {
+      func.func @gemm(%arg0: memref<32x32xi32>, %arg1: memref<32x32xi32>) -> memref<32x32xi32> attributes {itypes = "ss", otypes = "s"} {
+        %alloc = memref.alloc() {name = "C"} : memref<32x32xi32>
+        %c0_i32 = arith.constant 0 : i32
+        linalg.fill ins(%c0_i32 : i32) outs(%alloc : memref<32x32xi32>)
+        affine.for %arg2 = 0 to 32 {
+          %alloc_0 = memref.alloc() : memref<32xi32>
+          affine.for %arg3 = 0 to 32 {
+            affine.store %c0_i32, %alloc_0[%arg3] : memref<32xi32>
+          } {buffer, loop_name = "j_init", pipeline_ii = 1 : i32}
+          affine.for %arg3 = 0 to 32 {
+            affine.for %arg4 = 0 to 32 {
+              %0 = affine.load %arg0[%arg2, %arg3] {from = "A"} : memref<32x32xi32>
+              %1 = affine.load %arg1[%arg3, %arg4] {from = "B"} : memref<32x32xi32>
+              %2 = arith.extsi %0 : i32 to i64
+              %3 = arith.extsi %1 : i32 to i64
+              %4 = arith.muli %2, %3 : i64
+              %5 = affine.load %alloc_0[%arg4] : memref<32xi32>
+              %6 = arith.trunci %4 : i64 to i32
+              %7 = arith.addi %5, %6 : i32
+              affine.store %7, %alloc_0[%arg4] : memref<32xi32>
+            } {loop_name = "j", pipeline_ii = 1 : ui32}
+          } {loop_name = "k", op_name = "S_k_0", reduction}
+          affine.for %arg3 = 0 to 32 {
+            %0 = affine.load %alloc_0[%arg3] : memref<32xi32>
+            affine.store %0, %alloc[%arg2, %arg3] : memref<32x32xi32>
+          } {buffer, loop_name = "j_back", pipeline_ii = 1 : i32}
+        } {loop_name = "i", op_name = "S_i_j_0"}
+        return %alloc : memref<32x32xi32>
+      }
+    }
+
+
+
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 62-63
+
+Notice that now we only optimize the separate kernels but do not incorporate them into the top-level function, as shown in the following printed module.
+
+.. GENERATED FROM PYTHON SOURCE LINES 63-67
+
+.. code-block:: Python
+
+
+    s = allo.customize(top)
+    print(s.module)
+
+
+
+
+
+.. rst-class:: sphx-glr-script-out
+
+ .. code-block:: none
+
+    module {
+      func.func @gemm(%arg0: memref<32x32xi32>, %arg1: memref<32x32xi32>) -> memref<32x32xi32> attributes {itypes = "ss", otypes = "s"} {
+        %alloc = memref.alloc() {name = "C"} : memref<32x32xi32>
+        %c0_i32 = arith.constant 0 : i32
+        linalg.fill ins(%c0_i32 : i32) outs(%alloc : memref<32x32xi32>)
+        affine.for %arg2 = 0 to 32 {
+          affine.for %arg3 = 0 to 32 {
+            affine.for %arg4 = 0 to 32 {
+              %0 = affine.load %arg0[%arg2, %arg4] {from = "A"} : memref<32x32xi32>
+              %1 = affine.load %arg1[%arg4, %arg3] {from = "B"} : memref<32x32xi32>
+              %2 = arith.extsi %0 : i32 to i64
+              %3 = arith.extsi %1 : i32 to i64
+              %4 = arith.muli %2, %3 : i64
+              %5 = affine.load %alloc[%arg2, %arg3] {from = "C"} : memref<32x32xi32>
+              %6 = arith.trunci %4 : i64 to i32
+              %7 = arith.addi %5, %6 : i32
+              affine.store %7, %alloc[%arg2, %arg3] {to = "C"} : memref<32x32xi32>
+            } {loop_name = "k", op_name = "S_k_0", reduction}
+          } {loop_name = "j"}
+        } {loop_name = "i", op_name = "S_i_j_0"}
+        return %alloc : memref<32x32xi32>
+      }
+      func.func @matrix_add(%arg0: memref<32x32xi32>) -> memref<32x32xi32> attributes {itypes = "s", otypes = "s"} {
+        %alloc = memref.alloc() {name = "B"} : memref<32x32xi32>
+        %c0_i32 = arith.constant 0 : i32
+        linalg.fill ins(%c0_i32 : i32) outs(%alloc : memref<32x32xi32>)
+        affine.for %arg1 = 0 to 32 {
+          affine.for %arg2 = 0 to 32 {
+            %0 = affine.load %arg0[%arg1, %arg2] {from = "A"} : memref<32x32xi32>
+            %1 = arith.extsi %0 : i32 to i33
+            %c1_i32 = arith.constant 1 : i32
+            %2 = arith.extsi %c1_i32 : i32 to i33
+            %3 = arith.addi %1, %2 : i33
+            %4 = arith.trunci %3 : i33 to i32
+            affine.store %4, %alloc[%arg1, %arg2] {to = "B"} : memref<32x32xi32>
+          } {loop_name = "j"}
+        } {loop_name = "i", op_name = "S_i_j_0"}
+        return %alloc : memref<32x32xi32>
+      }
+      func.func @top(%arg0: memref<32x32xi32>, %arg1: memref<32x32xi32>) -> memref<32x32xi32> attributes {itypes = "ss", otypes = "s"} {
+        %0 = call @gemm(%arg0, %arg1) {name = "C"} : (memref<32x32xi32>, memref<32x32xi32>) -> memref<32x32xi32>
+        %1 = call @matrix_add(%0) {name = "D"} : (memref<32x32xi32>) -> memref<32x32xi32>
+        return %1 : memref<32x32xi32>
+      }
+    }
+
+
+
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 68-70
+
+Therefore, after each part has been optimized, we need to explicitly *compose* them together.
+In Allo, we can use the ``.compose()`` primitive to compose the schedules together into the parent function.
+
+.. GENERATED FROM PYTHON SOURCE LINES 70-74
+
+.. code-block:: Python
+
+
+    s.compose([s1, s2])
+    print(s.module)
+
+
+
+
+
+.. rst-class:: sphx-glr-script-out
+
+ .. code-block:: none
+
+    module {
+      func.func @gemm(%arg0: memref<32x32xi32>, %arg1: memref<32x32xi32>) -> memref<32x32xi32> attributes {itypes = "ss", otypes = "s"} {
+        %alloc = memref.alloc() {name = "C"} : memref<32x32xi32>
+        %c0_i32 = arith.constant 0 : i32
+        linalg.fill ins(%c0_i32 : i32) outs(%alloc : memref<32x32xi32>)
+        affine.for %arg2 = 0 to 32 {
+          %alloc_0 = memref.alloc() : memref<32xi32>
+          affine.for %arg3 = 0 to 32 {
+            affine.store %c0_i32, %alloc_0[%arg3] : memref<32xi32>
+          } {buffer, loop_name = "j_init", pipeline_ii = 1 : i32}
+          affine.for %arg3 = 0 to 32 {
+            affine.for %arg4 = 0 to 32 {
+              %0 = affine.load %arg0[%arg2, %arg3] {from = "A"} : memref<32x32xi32>
+              %1 = affine.load %arg1[%arg3, %arg4] {from = "B"} : memref<32x32xi32>
+              %2 = arith.extsi %0 : i32 to i64
+              %3 = arith.extsi %1 : i32 to i64
+              %4 = arith.muli %2, %3 : i64
+              %5 = affine.load %alloc_0[%arg4] : memref<32xi32>
+              %6 = arith.trunci %4 : i64 to i32
+              %7 = arith.addi %5, %6 : i32
+              affine.store %7, %alloc_0[%arg4] : memref<32xi32>
+            } {loop_name = "j", pipeline_ii = 1 : ui32}
+          } {loop_name = "k", op_name = "S_k_0", reduction}
+          affine.for %arg3 = 0 to 32 {
+            %0 = affine.load %alloc_0[%arg3] : memref<32xi32>
+            affine.store %0, %alloc[%arg2, %arg3] : memref<32x32xi32>
+          } {buffer, loop_name = "j_back", pipeline_ii = 1 : i32}
+        } {loop_name = "i", op_name = "S_i_j_0"}
+        return %alloc : memref<32x32xi32>
+      }
+      func.func @matrix_add(%arg0: memref<32x32xi32>) -> memref<32x32xi32> attributes {itypes = "s", otypes = "s"} {
+        %alloc = memref.alloc() {name = "B"} : memref<32x32xi32>
+        %c0_i32 = arith.constant 0 : i32
+        linalg.fill ins(%c0_i32 : i32) outs(%alloc : memref<32x32xi32>)
+        affine.for %arg1 = 0 to 32 {
+          affine.for %arg2 = 0 to 32 {
+            %0 = affine.load %arg0[%arg1, %arg2] {from = "A"} : memref<32x32xi32>
+            %1 = arith.extsi %0 : i32 to i33
+            %c1_i32 = arith.constant 1 : i32
+            %2 = arith.extsi %c1_i32 : i32 to i33
+            %3 = arith.addi %1, %2 : i33
+            %4 = arith.trunci %3 : i33 to i32
+            affine.store %4, %alloc[%arg1, %arg2] {to = "B"} : memref<32x32xi32>
+          } {loop_name = "j", pipeline_ii = 1 : ui32}
+        } {loop_name = "i", op_name = "S_i_j_0"}
+        return %alloc : memref<32x32xi32>
+      }
+      func.func @top(%arg0: memref<32x32xi32>, %arg1: memref<32x32xi32>) -> memref<32x32xi32> attributes {itypes = "ss", otypes = "s"} {
+        %0 = call @gemm(%arg0, %arg1) {name = "C"} : (memref<32x32xi32>, memref<32x32xi32>) -> memref<32x32xi32>
+        %1 = call @matrix_add(%0) {name = "D"} : (memref<32x32xi32>) -> memref<32x32xi32>
+        return %1 : memref<32x32xi32>
+      }
+    }
+
+
+
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 75-76
+
+We can see that the schedules for the ``matrix_add`` and ``gemm`` kernels are both correctly optimized in the top-level function.
+
+.. GENERATED FROM PYTHON SOURCE LINES 78-81
+
+Template Composition
+--------------------
+Sometimes we may define template kernels and invoke the kernel with different template arguments. Allo provides an *id* option to specify the exact kernel to be composed.
+
+.. GENERATED FROM PYTHON SOURCE LINES 81-99
+
+.. code-block:: Python
+
+
+
+    def kernel[T_in, T_out, S](A: "T_in[S]") -> "T_out[S]":
+        B: T_out[S] = 0
+        for i in range(S):
+            with allo.meta_if(T_out == int32):
+                B[i] = A[i] + 1
+            with allo.meta_else():
+                B[i] = A[i] * 2
+        return B
+
+
+    def top2(A: int32[M]) -> float32[M]:
+        C = kernel[int32, int32, M, "K1"](A)
+        D = kernel[int32, float32, M, "K2"](C)
+        return D
+
+
+
+
+
+
+
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 100-102
+
+Specifically, the last argument of the template kernel is the *id* of the kernel. Later on we can use this ID for distinguishing different kernels during composition.
+We also customize the two template kernels with different optimizations first.
+
+.. GENERATED FROM PYTHON SOURCE LINES 102-111
+
+.. code-block:: Python
+
+
+    s1 = allo.customize(kernel, instantiate=[int32, int32, M])
+    s1.unroll("i", factor=4)
+    print(s1.module)
+
+    s2 = allo.customize(kernel, instantiate=[int32, float32, M])
+    s2.pipeline("i")
+    print(s2.module)
+
+
+
+
+
+.. rst-class:: sphx-glr-script-out
+
+ .. code-block:: none
+
+    module {
+      func.func @kernel(%arg0: memref<32xi32>) -> memref<32xi32> attributes {itypes = "s", otypes = "s"} {
+        %alloc = memref.alloc() {name = "B"} : memref<32xi32>
+        %c0_i32 = arith.constant 0 : i32
+        linalg.fill ins(%c0_i32 : i32) outs(%alloc : memref<32xi32>)
+        affine.for %arg1 = 0 to 32 {
+          %0 = affine.load %arg0[%arg1] {from = "A"} : memref<32xi32>
+          %1 = arith.extsi %0 : i32 to i33
+          %c1_i32 = arith.constant 1 : i32
+          %2 = arith.extsi %c1_i32 : i32 to i33
+          %3 = arith.addi %1, %2 : i33
+          %4 = arith.trunci %3 : i33 to i32
+          affine.store %4, %alloc[%arg1] {to = "B"} : memref<32xi32>
+        } {loop_name = "i", op_name = "S_i_0", unroll = 4 : i32}
+        return %alloc : memref<32xi32>
+      }
+    }
+
+    module {
+      func.func @kernel(%arg0: memref<32xi32>) -> memref<32xf32> attributes {itypes = "s", otypes = "_"} {
+        %c0_i32 = arith.constant 0 : i32
+        %0 = arith.sitofp %c0_i32 : i32 to f32
+        %alloc = memref.alloc() {name = "B"} : memref<32xf32>
+        linalg.fill ins(%0 : f32) outs(%alloc : memref<32xf32>)
+        affine.for %arg1 = 0 to 32 {
+          %1 = affine.load %arg0[%arg1] {from = "A"} : memref<32xi32>
+          %2 = arith.extsi %1 : i32 to i64
+          %c2_i32 = arith.constant 2 : i32
+          %3 = arith.extsi %c2_i32 : i32 to i64
+          %4 = arith.muli %2, %3 : i64
+          %5 = arith.sitofp %4 : i64 to f32
+          affine.store %5, %alloc[%arg1] {to = "B"} : memref<32xf32>
+        } {loop_name = "i", op_name = "S_i_0", pipeline_ii = 1 : ui32}
+        return %alloc : memref<32xf32>
+      }
+    }
+
+
+
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 112-113
+
+Finally, we compose the two template kernels into the top-level function with the ID specified.
+
+.. GENERATED FROM PYTHON SOURCE LINES 113-119
+
+.. code-block:: Python
+
+
+    s = allo.customize(top2)
+    s.compose(s1, id="K1")
+    s.compose(s2, id="K2")
+    print(s.module)
+
+
+
+
+
+.. rst-class:: sphx-glr-script-out
+
+ .. code-block:: none
+
+    module {
+      func.func @kernel_K1(%arg0: memref<32xi32>) -> memref<32xi32> attributes {itypes = "s", otypes = "s"} {
+        %alloc = memref.alloc() {name = "B"} : memref<32xi32>
+        %c0_i32 = arith.constant 0 : i32
+        linalg.fill ins(%c0_i32 : i32) outs(%alloc : memref<32xi32>)
+        affine.for %arg1 = 0 to 32 {
+          %0 = affine.load %arg0[%arg1] {from = "A"} : memref<32xi32>
+          %1 = arith.extsi %0 : i32 to i33
+          %c1_i32 = arith.constant 1 : i32
+          %2 = arith.extsi %c1_i32 : i32 to i33
+          %3 = arith.addi %1, %2 : i33
+          %4 = arith.trunci %3 : i33 to i32
+          affine.store %4, %alloc[%arg1] {to = "B"} : memref<32xi32>
+        } {loop_name = "i", op_name = "S_i_0", unroll = 4 : i32}
+        return %alloc : memref<32xi32>
+      }
+      func.func @kernel_K2(%arg0: memref<32xi32>) -> memref<32xf32> attributes {itypes = "s", otypes = "_"} {
+        %c0_i32 = arith.constant 0 : i32
+        %0 = arith.sitofp %c0_i32 : i32 to f32
+        %alloc = memref.alloc() {name = "B"} : memref<32xf32>
+        linalg.fill ins(%0 : f32) outs(%alloc : memref<32xf32>)
+        affine.for %arg1 = 0 to 32 {
+          %1 = affine.load %arg0[%arg1] {from = "A"} : memref<32xi32>
+          %2 = arith.extsi %1 : i32 to i64
+          %c2_i32 = arith.constant 2 : i32
+          %3 = arith.extsi %c2_i32 : i32 to i64
+          %4 = arith.muli %2, %3 : i64
+          %5 = arith.sitofp %4 : i64 to f32
+          affine.store %5, %alloc[%arg1] {to = "B"} : memref<32xf32>
+        } {loop_name = "i", op_name = "S_i_0", pipeline_ii = 1 : ui32}
+        return %alloc : memref<32xf32>
+      }
+      func.func @top2(%arg0: memref<32xi32>) -> memref<32xf32> attributes {itypes = "s", otypes = "_"} {
+        %0 = call @kernel_K1(%arg0) {name = "C"} : (memref<32xi32>) -> memref<32xi32>
+        %1 = call @kernel_K2(%0) {name = "D"} : (memref<32xi32>) -> memref<32xf32>
+        return %1 : memref<32xf32>
+      }
+    }
+
+
+
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 120-121
+
+We can see from the printed module that the loop in the first kernel is unrolled by a factor of 4, and the loop in the second kernel is pipelined.
+
+
+.. rst-class:: sphx-glr-timing
+
+   **Total running time of the script:** (0 minutes 0.731 seconds)
+
+
+.. _sphx_glr_download_gallery_dive_03_composition.py:
+
+.. only:: html
+
+  .. container:: sphx-glr-footer sphx-glr-footer-example
+
+    .. container:: sphx-glr-download sphx-glr-download-jupyter
+
+      :download:`Download Jupyter notebook: dive_03_composition.ipynb <dive_03_composition.ipynb>`
+
+    .. container:: sphx-glr-download sphx-glr-download-python
+
+      :download:`Download Python source code: dive_03_composition.py <dive_03_composition.py>`
+
+    .. container:: sphx-glr-download sphx-glr-download-zip
+
+      :download:`Download zipped: dive_03_composition.zip <dive_03_composition.zip>`
+
+
+.. only:: html
+
+ .. rst-class:: sphx-glr-signature
+
+    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_
diff --git a/_sources/gallery/dive_04_features.rst.txt b/_sources/gallery/dive_04_features.rst.txt
new file mode 100644
index 00000000..b4048409
--- /dev/null
+++ b/_sources/gallery/dive_04_features.rst.txt
@@ -0,0 +1,250 @@
+
+.. DO NOT EDIT.
+.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
+.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
+.. "gallery/dive_04_features.py"
+.. LINE NUMBERS ARE GIVEN BELOW.
+
+.. only:: html
+
+    .. note::
+        :class: sphx-glr-download-link-note
+
+        :ref:`Go to the end <sphx_glr_download_gallery_dive_04_features.py>`
+        to download the full example code.
+
+.. rst-class:: sphx-glr-example-title
+
+.. _sphx_glr_gallery_dive_04_features.py:
+
+
+Other Features
+==============
+
+**Author**: Hongzheng Chen (hzchen@cs.cornell.edu)
+
+This document will discuss other features that are not covered in the previous tutorials.
+
+.. GENERATED FROM PYTHON SOURCE LINES 14-19
+
+Dynamic Shapes
+--------------
+In some cases, the shape of the tensor is not known at compile time, so we can use ``[...]`` to represent the dynamic shape.
+From the generated MLIR module, we can see it has a ``"?"`` in the shape of the tensor, which means the shape is not predefined,
+but we can still run the LLVM module with arbitrary shapes of NumPy arrays.
+
+.. GENERATED FROM PYTHON SOURCE LINES 19-38
+
+.. code-block:: Python
+
+
+    import allo
+    from allo.ir.types import int32, float32
+    import numpy as np
+
+
+    def kernel(A: float32[...], B: float32[...], size: int32):
+        for i in range(size):
+            B[i] = A[i]
+
+
+    s = allo.customize(kernel)
+    print(s.module)
+    np_A = np.random.random((256,)).astype(np.float32)
+    allo_A = np.zeros((256,)).astype(np.float32)
+    mod = s.build()
+    mod(np_A, allo_A, 256)
+    np.testing.assert_allclose(np_A, allo_A)
+
+
+
+
+
+.. rst-class:: sphx-glr-script-out
+
+ .. code-block:: none
+
+    module {
+      func.func @kernel(%arg0: memref<?xf32>, %arg1: memref<?xf32>, %arg2: i32) attributes {itypes = "__s", otypes = ""} {
+        %c0_i32 = arith.constant 0 : i32
+        %0 = arith.index_cast %c0_i32 : i32 to index
+        %1 = arith.index_cast %arg2 : i32 to index
+        %c1_i32 = arith.constant 1 : i32
+        %2 = arith.index_cast %c1_i32 : i32 to index
+        scf.for %arg3 = %0 to %1 step %2 {
+          %3 = memref.load %arg0[%arg3] {from = "A"} : memref<?xf32>
+          memref.store %3, %arg1[%arg3] {to = "B"} : memref<?xf32>
+        } {loop_name = "i", op_name = "S_i_0"}
+        return
+      }
+    }
+
+
+
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 39-40
+
+We can also check the generated HLS code that the arguments are declared as pointers.
+
+.. GENERATED FROM PYTHON SOURCE LINES 40-44
+
+.. code-block:: Python
+
+
+    code = s.build(target="vhls")
+    print(code)
+
+
+
+
+
+.. rst-class:: sphx-glr-script-out
+
+ .. code-block:: none
+
+
+    //===------------------------------------------------------------*- C++ -*-===//
+    //
+    // Automatically generated file for High-level Synthesis (HLS).
+    //
+    //===----------------------------------------------------------------------===//
+    #include <algorithm>
+    #include <ap_axi_sdata.h>
+    #include <ap_fixed.h>
+    #include <ap_int.h>
+    #include <hls_math.h>
+    #include <hls_stream.h>
+    #include <math.h>
+    #include <stdint.h>
+    using namespace std;
+    void kernel(
+      float *v0,
+      float *v1,
+      int32_t v2
+    ) {     // L2
+      int v3 = v2;  // L5
+      for (int v4 = 0; v4 < v3; v4 += 1) {  // L8
+        float v5 = *v0[v4]; // L9
+        *v1[v4] = v5;       // L10
+      }
+    }
+
+
+
+
+
+
+.. GENERATED FROM PYTHON SOURCE LINES 45-49
+
+Tuple Return
+------------
+Another feature is the tuple support. As in Python, we can return multiple values from a function, Allo
+also supports this by explicitly specifying the return type as a tuple.
+
+.. GENERATED FROM PYTHON SOURCE LINES 49-77
+
+.. code-block:: Python
+
+
+
+    def callee(a: float32, b: float32) -> (float32, float32):
+        c: float32 = a + b
+        d: float32 = a - b
+        return c, d
+
+
+    def kernel(A: float32[10], B: float32[10]) -> (float32[10], float32[10]):
+        C: float32[10] = 0
+        D: float32[10] = 0
+        for i in range(10):
+            C[i], D[i] = callee(A[i], B[i])
+        return C, D
+
+
+    s = allo.customize(kernel)
+    print(s.module)
+    mod = s.build()
+    np_A = np.random.random((10,)).astype(np.float32)
+    np_B = np.random.random((10,)).astype(np.float32)
+    np_C, np_D = mod(np_A, np_B)
+    np_C_ref = np.zeros((10,), dtype=np.float32)
+    np_D_ref = np.zeros((10,), dtype=np.float32)
+    for i in range(10):
+        np_C_ref[i], np_D_ref[i] = callee(np_A[i], np_B[i])
+    np.testing.assert_allclose(np_C, np_C_ref)
+    np.testing.assert_allclose(np_D, np_D_ref)
+
+
+
+
+.. rst-class:: sphx-glr-script-out
+
+ .. code-block:: none
+
+    module {
+      func.func @callee(%arg0: f32, %arg1: f32) -> (f32, f32) attributes {itypes = "__", otypes = "__"} {
+        %0 = arith.addf %arg0, %arg1 : f32
+        %alloc = memref.alloc() {name = "c"} : memref<f32>
+        affine.store %0, %alloc[] {to = "c"} : memref<f32>
+        %1 = arith.subf %arg0, %arg1 : f32
+        %alloc_0 = memref.alloc() {name = "d"} : memref<f32>
+        affine.store %1, %alloc_0[] {to = "d"} : memref<f32>
+        %2 = affine.load %alloc[] {from = "c"} : memref<f32>
+        %3 = affine.load %alloc_0[] {from = "d"} : memref<f32>
+        return %2, %3 : f32, f32
+      }
+      func.func @kernel(%arg0: memref<10xf32>, %arg1: memref<10xf32>) -> (memref<10xf32>, memref<10xf32>) attributes {itypes = "__", otypes = "__"} {
+        %c0_i32 = arith.constant 0 : i32
+        %0 = arith.sitofp %c0_i32 : i32 to f32
+        %alloc = memref.alloc() {name = "C"} : memref<10xf32>
+        linalg.fill ins(%0 : f32) outs(%alloc : memref<10xf32>)
+        %c0_i32_0 = arith.constant 0 : i32
+        %1 = arith.sitofp %c0_i32_0 : i32 to f32
+        %alloc_1 = memref.alloc() {name = "D"} : memref<10xf32>
+        linalg.fill ins(%1 : f32) outs(%alloc_1 : memref<10xf32>)
+        affine.for %arg2 = 0 to 10 {
+          %2 = affine.load %arg0[%arg2] {from = "A"} : memref<10xf32>
+          %3 = affine.load %arg1[%arg2] {from = "B"} : memref<10xf32>
+          %4:2 = func.call @callee(%2, %3) : (f32, f32) -> (f32, f32)
+          affine.store %4#0, %alloc[%arg2] {to = "C"} : memref<10xf32>
+          affine.store %4#1, %alloc_1[%arg2] {to = "D"} : memref<10xf32>
+        } {loop_name = "i", op_name = "S_i_0"}
+        return %alloc, %alloc_1 : memref<10xf32>, memref<10xf32>
+      }
+    }
+
+
+
+
+
+
+.. rst-class:: sphx-glr-timing
+
+   **Total running time of the script:** (0 minutes 0.240 seconds)
+
+
+.. _sphx_glr_download_gallery_dive_04_features.py:
+
+.. only:: html
+
+  .. container:: sphx-glr-footer sphx-glr-footer-example
+
+    .. container:: sphx-glr-download sphx-glr-download-jupyter
+
+      :download:`Download Jupyter notebook: dive_04_features.ipynb <dive_04_features.ipynb>`
+
+    .. container:: sphx-glr-download sphx-glr-download-python
+
+      :download:`Download Python source code: dive_04_features.py <dive_04_features.py>`
+
+    .. container:: sphx-glr-download sphx-glr-download-zip
+
+      :download:`Download zipped: dive_04_features.zip <dive_04_features.zip>`
+
+
+.. only:: html
+
+ .. rst-class:: sphx-glr-signature
+
+    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_
diff --git a/_sources/gallery/index.rst.txt b/_sources/gallery/index.rst.txt
index 192c4b2f..7213e9ff 100644
--- a/_sources/gallery/index.rst.txt
+++ b/_sources/gallery/index.rst.txt
@@ -28,6 +28,23 @@ Allo Documentations
     </div>
 
 
+.. raw:: html
+
+    <div class="sphx-glr-thumbcontainer" tooltip="Author: Hongzheng Chen (hzchen@cs.cornell.edu)">
+
+.. only:: html
+
+  .. image:: /gallery/images/thumb/sphx_glr_dive_01_data_types_thumb.png
+    :alt:
+
+  :ref:`sphx_glr_gallery_dive_01_data_types.py`
+
+.. raw:: html
+
+      <div class="sphx-glr-thumbnail-title">Data Types and Type Casting</div>
+    </div>
+
+
 .. raw:: html
 
     <div class="sphx-glr-thumbcontainer" tooltip="Author: Hongzheng Chen (hzchen@cs.cornell.edu)">
@@ -62,6 +79,40 @@ Allo Documentations
     </div>
 
 
+.. raw:: html
+
+    <div class="sphx-glr-thumbcontainer" tooltip="Author: Hongzheng Chen (hzchen@cs.cornell.edu)">
+
+.. only:: html
+
+  .. image:: /gallery/images/thumb/sphx_glr_dive_02_template_thumb.png
+    :alt:
+
+  :ref:`sphx_glr_gallery_dive_02_template.py`
+
+.. raw:: html
+
+      <div class="sphx-glr-thumbnail-title">Template Kernels</div>
+    </div>
+
+
+.. raw:: html
+
+    <div class="sphx-glr-thumbcontainer" tooltip="Author: Hongzheng Chen (hzchen@cs.cornell.edu)">
+
+.. only:: html
+
+  .. image:: /gallery/images/thumb/sphx_glr_dive_04_features_thumb.png
+    :alt:
+
+  :ref:`sphx_glr_gallery_dive_04_features.py`
+
+.. raw:: html
+
+      <div class="sphx-glr-thumbnail-title">Other Features</div>
+    </div>
+
+
 .. raw:: html
 
     <div class="sphx-glr-thumbcontainer" tooltip="Author: Hongzheng Chen (hzchen@cs.cornell.edu)">
@@ -79,6 +130,23 @@ Allo Documentations
     </div>
 
 
+.. raw:: html
+
+    <div class="sphx-glr-thumbcontainer" tooltip="Author: Hongzheng Chen (hzchen@cs.cornell.edu)">
+
+.. only:: html
+
+  .. image:: /gallery/images/thumb/sphx_glr_dive_03_composition_thumb.png
+    :alt:
+
+  :ref:`sphx_glr_gallery_dive_03_composition.py`
+
+.. raw:: html
+
+      <div class="sphx-glr-thumbnail-title">Kernel Composition</div>
+    </div>
+
+
 .. thumbnail-parent-div-close
 
 .. raw:: html
@@ -90,9 +158,13 @@ Allo Documentations
    :hidden:
 
    /gallery/developer_01_ir_builder
+   /gallery/dive_01_data_types
    /gallery/tutorial_02_vhls
    /gallery/tutorial_01_get_started
+   /gallery/dive_02_template
+   /gallery/dive_04_features
    /gallery/developer_02_mlir
+   /gallery/dive_03_composition
 
 
 
diff --git a/_sources/gallery/sg_execution_times.rst.txt b/_sources/gallery/sg_execution_times.rst.txt
index bade13f8..c5632403 100644
--- a/_sources/gallery/sg_execution_times.rst.txt
+++ b/_sources/gallery/sg_execution_times.rst.txt
@@ -6,7 +6,7 @@
 
 Computation times
 =================
-**00:00.607** total execution time for 4 files **from gallery**:
+**00:01.961** total execution time for 8 files **from gallery**:
 
 .. container::
 
@@ -32,14 +32,26 @@ Computation times
    * - Example
      - Time
      - Mem (MB)
+   * - :ref:`sphx_glr_gallery_dive_03_composition.py` (``dive_03_composition.py``)
+     - 00:00.731
+     - 0.0
+   * - :ref:`sphx_glr_gallery_dive_02_template.py` (``dive_02_template.py``)
+     - 00:00.372
+     - 0.0
+   * - :ref:`sphx_glr_gallery_dive_04_features.py` (``dive_04_features.py``)
+     - 00:00.240
+     - 0.0
    * - :ref:`sphx_glr_gallery_tutorial_02_vhls.py` (``tutorial_02_vhls.py``)
-     - 00:00.332
+     - 00:00.192
      - 0.0
    * - :ref:`sphx_glr_gallery_tutorial_01_get_started.py` (``tutorial_01_get_started.py``)
-     - 00:00.196
+     - 00:00.181
+     - 0.0
+   * - :ref:`sphx_glr_gallery_dive_01_data_types.py` (``dive_01_data_types.py``)
+     - 00:00.170
      - 0.0
    * - :ref:`sphx_glr_gallery_developer_02_mlir.py` (``developer_02_mlir.py``)
-     - 00:00.074
+     - 00:00.070
      - 0.0
    * - :ref:`sphx_glr_gallery_developer_01_ir_builder.py` (``developer_01_ir_builder.py``)
      - 00:00.005
diff --git a/_sources/gallery/tutorial_01_get_started.rst.txt b/_sources/gallery/tutorial_01_get_started.rst.txt
index 9cf0bb7f..7dce558d 100644
--- a/_sources/gallery/tutorial_01_get_started.rst.txt
+++ b/_sources/gallery/tutorial_01_get_started.rst.txt
@@ -71,12 +71,14 @@ use ``int32`` as the data type for all the variables.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 35-55
+.. GENERATED FROM PYTHON SOURCE LINES 35-57
 
 We then define a function that takes two 32x32 matrices as inputs and
 returns a 32x32 matrix as output. The variable declaration is defined
-as ``<name>: <type>[<shape>]``. We require **strict type annotation** in
-Allo's kernels, which is different from directly programming in Python.
+as ``<name>: <type>[<shape>]``, and the function type is defined as
+``(<in_type0>, <in_type1>, ...) -> <out_type>``.
+We require **strict type annotation** in Allo's kernels, which is different
+from directly programming in Python.
 
 Inside the kernel, we provide a shorthand for the loop iterator. For example,
 ``for i, j, k in allo.grid(32, 32, 32)`` is equivalent to the following
@@ -94,7 +96,7 @@ The arguments denote the upper bounds of the loop iterators.
 Notice the above range-loop is also supported in the new Allo, so
 users have more flexibility to define the loop structure.
 
-.. GENERATED FROM PYTHON SOURCE LINES 55-64
+.. GENERATED FROM PYTHON SOURCE LINES 57-66
 
 .. code-block:: Python
 
@@ -114,7 +116,7 @@ users have more flexibility to define the loop structure.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 65-71
+.. GENERATED FROM PYTHON SOURCE LINES 67-73
 
 Create the Schedule
 -------------------
@@ -123,7 +125,7 @@ kernel in order to achieve high performance. We call ``allo.customize`` to
 create a schedule for the kernel, where **schedule** denotes the set of
 transformations.
 
-.. GENERATED FROM PYTHON SOURCE LINES 71-74
+.. GENERATED FROM PYTHON SOURCE LINES 73-76
 
 .. code-block:: Python
 
@@ -137,7 +139,7 @@ transformations.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 75-80
+.. GENERATED FROM PYTHON SOURCE LINES 77-82
 
 Inspect the Intermediate Representation (IR)
 --------------------------------------------
@@ -145,7 +147,7 @@ Allo leverage the `MLIR <https://mlir.llvm.org/>`_ infrastructure to
 represent the program, and we can directly print out the IR by using
 ``s.module``.
 
-.. GENERATED FROM PYTHON SOURCE LINES 80-83
+.. GENERATED FROM PYTHON SOURCE LINES 82-85
 
 .. code-block:: Python
 
@@ -188,7 +190,7 @@ represent the program, and we can directly print out the IR by using
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 84-99
+.. GENERATED FROM PYTHON SOURCE LINES 86-101
 
 Let's take a close look at the generated IR. Basically an MLIR program is
 a set of operations in different dialects, and the operations are referred
@@ -206,7 +208,7 @@ operations and some arithmetic operations.
 Allo also attaches some attributes to the operations, including the tensor
 names, loop names, and operation names, which are further used for optimization.
 
-.. GENERATED FROM PYTHON SOURCE LINES 101-106
+.. GENERATED FROM PYTHON SOURCE LINES 103-108
 
 Apply Transformations
 ---------------------
@@ -214,7 +216,7 @@ Next, we start transforming the program by using the schedule primitives.
 We can refer to the loops by using the loop names. For example, to split
 the outer-most loop into two, we can call the ``.split()`` primitive as follows:
 
-.. GENERATED FROM PYTHON SOURCE LINES 106-109
+.. GENERATED FROM PYTHON SOURCE LINES 108-111
 
 .. code-block:: Python
 
@@ -228,7 +230,7 @@ the outer-most loop into two, we can call the ``.split()`` primitive as follows:
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 110-116
+.. GENERATED FROM PYTHON SOURCE LINES 112-118
 
 We can print out the IR again to see the effect of the transformation.
 
@@ -237,7 +239,7 @@ We can print out the IR again to see the effect of the transformation.
   In the Allo DSL, all the transformations are applied **immediately**,
   so users can directly see the changes after they apply the transformations.
 
-.. GENERATED FROM PYTHON SOURCE LINES 116-119
+.. GENERATED FROM PYTHON SOURCE LINES 118-121
 
 .. code-block:: Python
 
@@ -285,7 +287,7 @@ We can print out the IR again to see the effect of the transformation.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 120-125
+.. GENERATED FROM PYTHON SOURCE LINES 122-127
 
 We can see that the outer-most loop is split into two loops, and the
 original loop is replaced by the two new loops. The new loops are named
@@ -293,7 +295,7 @@ as ``i.outer`` and ``i.inner``.
 
 Similarly, we can split the ``j`` loop:
 
-.. GENERATED FROM PYTHON SOURCE LINES 125-129
+.. GENERATED FROM PYTHON SOURCE LINES 127-131
 
 .. code-block:: Python
 
@@ -345,13 +347,13 @@ Similarly, we can split the ``j`` loop:
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 130-133
+.. GENERATED FROM PYTHON SOURCE LINES 132-135
 
 We can further reorder the loops by using ``.reorder()``. For example, we
 can move the splitted outer loops together, and move the splitted inner
 loops together.
 
-.. GENERATED FROM PYTHON SOURCE LINES 133-137
+.. GENERATED FROM PYTHON SOURCE LINES 135-139
 
 .. code-block:: Python
 
@@ -402,11 +404,11 @@ loops together.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 138-139
+.. GENERATED FROM PYTHON SOURCE LINES 140-141
 
 We can see the changes from the loop names in the generated IR.
 
-.. GENERATED FROM PYTHON SOURCE LINES 141-149
+.. GENERATED FROM PYTHON SOURCE LINES 143-151
 
 Create the Executable
 ---------------------
@@ -417,7 +419,7 @@ can be executed on the CPU. Otherwise, you can also specify the target as
 ``vhls`` to generate a Vivado HLS program that can be synthesized to an FPGA
 accelerator.
 
-.. GENERATED FROM PYTHON SOURCE LINES 149-152
+.. GENERATED FROM PYTHON SOURCE LINES 151-154
 
 .. code-block:: Python
 
@@ -431,13 +433,13 @@ accelerator.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 153-156
+.. GENERATED FROM PYTHON SOURCE LINES 155-158
 
 .. note::
 
   ``s.build(target="llvm")`` is equivalent to ``s.build()``.
 
-.. GENERATED FROM PYTHON SOURCE LINES 158-167
+.. GENERATED FROM PYTHON SOURCE LINES 160-169
 
 Prepare the Inputs/Outputs for the Executable
 ---------------------------------------------
@@ -449,7 +451,7 @@ but we still need to make sure the data types are consistent. By default,
 when defining our kernel function, so we need to explicitly cast the data type
 to ``np.int32``.
 
-.. GENERATED FROM PYTHON SOURCE LINES 167-173
+.. GENERATED FROM PYTHON SOURCE LINES 169-175
 
 .. code-block:: Python
 
@@ -466,7 +468,7 @@ to ``np.int32``.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 174-179
+.. GENERATED FROM PYTHON SOURCE LINES 176-181
 
 Run the Executable
 ------------------
@@ -474,7 +476,7 @@ With the prepared inputs/outputs, we can feed them to our executable.
 Notice our module can return a new array as output, so we can directly
 assign the output to a new variable.
 
-.. GENERATED FROM PYTHON SOURCE LINES 179-182
+.. GENERATED FROM PYTHON SOURCE LINES 181-184
 
 .. code-block:: Python
 
@@ -488,11 +490,11 @@ assign the output to a new variable.
 
 
 
-.. GENERATED FROM PYTHON SOURCE LINES 183-184
+.. GENERATED FROM PYTHON SOURCE LINES 185-186
 
 Finally, we can do a sanity check to see if the results are correct.
 
-.. GENERATED FROM PYTHON SOURCE LINES 184-188
+.. GENERATED FROM PYTHON SOURCE LINES 186-190
 
 .. code-block:: Python
 
@@ -516,7 +518,7 @@ Finally, we can do a sanity check to see if the results are correct.
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** (0 minutes 0.196 seconds)
+   **Total running time of the script:** (0 minutes 0.181 seconds)
 
 
 .. _sphx_glr_download_gallery_tutorial_01_get_started.py:
diff --git a/_sources/gallery/tutorial_02_vhls.rst.txt b/_sources/gallery/tutorial_02_vhls.rst.txt
index a002dda2..246f3a8a 100644
--- a/_sources/gallery/tutorial_02_vhls.rst.txt
+++ b/_sources/gallery/tutorial_02_vhls.rst.txt
@@ -521,7 +521,7 @@ you can check the following files:
 
 .. rst-class:: sphx-glr-timing
 
-   **Total running time of the script:** (0 minutes 0.332 seconds)
+   **Total running time of the script:** (0 minutes 0.192 seconds)
 
 
 .. _sphx_glr_download_gallery_tutorial_02_vhls.py:
diff --git a/_sources/index.rst.txt b/_sources/index.rst.txt
index abbfa8d1..289577db 100644
--- a/_sources/index.rst.txt
+++ b/_sources/index.rst.txt
@@ -40,6 +40,17 @@ Allo is an Accelerator Design Language (ADL) and compiler that facilitates the c
    gallery/tutorial_02_vhls.rst
 
 
+.. toctree::
+   :maxdepth: 1
+   :caption: Deep Dive
+
+   gallery/dive_01_data_types.rst
+   gallery/dive_02_template.rst
+   gallery/dive_03_composition.rst
+   dive/ip.rst
+   dive/pytorch.rst
+   gallery/dive_04_features.rst
+
 .. toctree::
    :maxdepth: 1
    :caption: Developer Guide
diff --git a/_sources/sg_execution_times.rst.txt b/_sources/sg_execution_times.rst.txt
index 029a9c7c..bcc5f3bf 100644
--- a/_sources/sg_execution_times.rst.txt
+++ b/_sources/sg_execution_times.rst.txt
@@ -6,7 +6,7 @@
 
 Computation times
 =================
-**00:00.607** total execution time for 4 files **from all galleries**:
+**00:01.961** total execution time for 8 files **from all galleries**:
 
 .. container::
 
@@ -32,14 +32,26 @@ Computation times
    * - Example
      - Time
      - Mem (MB)
+   * - :ref:`sphx_glr_gallery_dive_03_composition.py` (``../../tutorials/dive_03_composition.py``)
+     - 00:00.731
+     - 0.0
+   * - :ref:`sphx_glr_gallery_dive_02_template.py` (``../../tutorials/dive_02_template.py``)
+     - 00:00.372
+     - 0.0
+   * - :ref:`sphx_glr_gallery_dive_04_features.py` (``../../tutorials/dive_04_features.py``)
+     - 00:00.240
+     - 0.0
    * - :ref:`sphx_glr_gallery_tutorial_02_vhls.py` (``../../tutorials/tutorial_02_vhls.py``)
-     - 00:00.332
+     - 00:00.192
      - 0.0
    * - :ref:`sphx_glr_gallery_tutorial_01_get_started.py` (``../../tutorials/tutorial_01_get_started.py``)
-     - 00:00.196
+     - 00:00.181
+     - 0.0
+   * - :ref:`sphx_glr_gallery_dive_01_data_types.py` (``../../tutorials/dive_01_data_types.py``)
+     - 00:00.170
      - 0.0
    * - :ref:`sphx_glr_gallery_developer_02_mlir.py` (``../../tutorials/developer_02_mlir.py``)
-     - 00:00.074
+     - 00:00.070
      - 0.0
    * - :ref:`sphx_glr_gallery_developer_01_ir_builder.py` (``../../tutorials/developer_01_ir_builder.py``)
      - 00:00.005
diff --git a/_static/pygments.css b/_static/pygments.css
index 0d49244e..5f2b0a25 100644
--- a/_static/pygments.css
+++ b/_static/pygments.css
@@ -6,26 +6,26 @@ span.linenos.special { color: #000000; background-color: #ffffc0; padding-left:
 .highlight .hll { background-color: #ffffcc }
 .highlight { background: #eeffcc; }
 .highlight .c { color: #408090; font-style: italic } /* Comment */
-.highlight .err { border: 1px solid #FF0000 } /* Error */
+.highlight .err { border: 1px solid #F00 } /* Error */
 .highlight .k { color: #007020; font-weight: bold } /* Keyword */
-.highlight .o { color: #666666 } /* Operator */
+.highlight .o { color: #666 } /* Operator */
 .highlight .ch { color: #408090; font-style: italic } /* Comment.Hashbang */
 .highlight .cm { color: #408090; font-style: italic } /* Comment.Multiline */
 .highlight .cp { color: #007020 } /* Comment.Preproc */
 .highlight .cpf { color: #408090; font-style: italic } /* Comment.PreprocFile */
 .highlight .c1 { color: #408090; font-style: italic } /* Comment.Single */
-.highlight .cs { color: #408090; background-color: #fff0f0 } /* Comment.Special */
+.highlight .cs { color: #408090; background-color: #FFF0F0 } /* Comment.Special */
 .highlight .gd { color: #A00000 } /* Generic.Deleted */
 .highlight .ge { font-style: italic } /* Generic.Emph */
 .highlight .ges { font-weight: bold; font-style: italic } /* Generic.EmphStrong */
-.highlight .gr { color: #FF0000 } /* Generic.Error */
+.highlight .gr { color: #F00 } /* Generic.Error */
 .highlight .gh { color: #000080; font-weight: bold } /* Generic.Heading */
 .highlight .gi { color: #00A000 } /* Generic.Inserted */
-.highlight .go { color: #333333 } /* Generic.Output */
-.highlight .gp { color: #c65d09; font-weight: bold } /* Generic.Prompt */
+.highlight .go { color: #333 } /* Generic.Output */
+.highlight .gp { color: #C65D09; font-weight: bold } /* Generic.Prompt */
 .highlight .gs { font-weight: bold } /* Generic.Strong */
 .highlight .gu { color: #800080; font-weight: bold } /* Generic.Subheading */
-.highlight .gt { color: #0044DD } /* Generic.Traceback */
+.highlight .gt { color: #04D } /* Generic.Traceback */
 .highlight .kc { color: #007020; font-weight: bold } /* Keyword.Constant */
 .highlight .kd { color: #007020; font-weight: bold } /* Keyword.Declaration */
 .highlight .kn { color: #007020; font-weight: bold } /* Keyword.Namespace */
@@ -33,43 +33,43 @@ span.linenos.special { color: #000000; background-color: #ffffc0; padding-left:
 .highlight .kr { color: #007020; font-weight: bold } /* Keyword.Reserved */
 .highlight .kt { color: #902000 } /* Keyword.Type */
 .highlight .m { color: #208050 } /* Literal.Number */
-.highlight .s { color: #4070a0 } /* Literal.String */
-.highlight .na { color: #4070a0 } /* Name.Attribute */
+.highlight .s { color: #4070A0 } /* Literal.String */
+.highlight .na { color: #4070A0 } /* Name.Attribute */
 .highlight .nb { color: #007020 } /* Name.Builtin */
-.highlight .nc { color: #0e84b5; font-weight: bold } /* Name.Class */
-.highlight .no { color: #60add5 } /* Name.Constant */
-.highlight .nd { color: #555555; font-weight: bold } /* Name.Decorator */
-.highlight .ni { color: #d55537; font-weight: bold } /* Name.Entity */
+.highlight .nc { color: #0E84B5; font-weight: bold } /* Name.Class */
+.highlight .no { color: #60ADD5 } /* Name.Constant */
+.highlight .nd { color: #555; font-weight: bold } /* Name.Decorator */
+.highlight .ni { color: #D55537; font-weight: bold } /* Name.Entity */
 .highlight .ne { color: #007020 } /* Name.Exception */
-.highlight .nf { color: #06287e } /* Name.Function */
+.highlight .nf { color: #06287E } /* Name.Function */
 .highlight .nl { color: #002070; font-weight: bold } /* Name.Label */
-.highlight .nn { color: #0e84b5; font-weight: bold } /* Name.Namespace */
+.highlight .nn { color: #0E84B5; font-weight: bold } /* Name.Namespace */
 .highlight .nt { color: #062873; font-weight: bold } /* Name.Tag */
-.highlight .nv { color: #bb60d5 } /* Name.Variable */
+.highlight .nv { color: #BB60D5 } /* Name.Variable */
 .highlight .ow { color: #007020; font-weight: bold } /* Operator.Word */
-.highlight .w { color: #bbbbbb } /* Text.Whitespace */
+.highlight .w { color: #BBB } /* Text.Whitespace */
 .highlight .mb { color: #208050 } /* Literal.Number.Bin */
 .highlight .mf { color: #208050 } /* Literal.Number.Float */
 .highlight .mh { color: #208050 } /* Literal.Number.Hex */
 .highlight .mi { color: #208050 } /* Literal.Number.Integer */
 .highlight .mo { color: #208050 } /* Literal.Number.Oct */
-.highlight .sa { color: #4070a0 } /* Literal.String.Affix */
-.highlight .sb { color: #4070a0 } /* Literal.String.Backtick */
-.highlight .sc { color: #4070a0 } /* Literal.String.Char */
-.highlight .dl { color: #4070a0 } /* Literal.String.Delimiter */
-.highlight .sd { color: #4070a0; font-style: italic } /* Literal.String.Doc */
-.highlight .s2 { color: #4070a0 } /* Literal.String.Double */
-.highlight .se { color: #4070a0; font-weight: bold } /* Literal.String.Escape */
-.highlight .sh { color: #4070a0 } /* Literal.String.Heredoc */
-.highlight .si { color: #70a0d0; font-style: italic } /* Literal.String.Interpol */
-.highlight .sx { color: #c65d09 } /* Literal.String.Other */
+.highlight .sa { color: #4070A0 } /* Literal.String.Affix */
+.highlight .sb { color: #4070A0 } /* Literal.String.Backtick */
+.highlight .sc { color: #4070A0 } /* Literal.String.Char */
+.highlight .dl { color: #4070A0 } /* Literal.String.Delimiter */
+.highlight .sd { color: #4070A0; font-style: italic } /* Literal.String.Doc */
+.highlight .s2 { color: #4070A0 } /* Literal.String.Double */
+.highlight .se { color: #4070A0; font-weight: bold } /* Literal.String.Escape */
+.highlight .sh { color: #4070A0 } /* Literal.String.Heredoc */
+.highlight .si { color: #70A0D0; font-style: italic } /* Literal.String.Interpol */
+.highlight .sx { color: #C65D09 } /* Literal.String.Other */
 .highlight .sr { color: #235388 } /* Literal.String.Regex */
-.highlight .s1 { color: #4070a0 } /* Literal.String.Single */
+.highlight .s1 { color: #4070A0 } /* Literal.String.Single */
 .highlight .ss { color: #517918 } /* Literal.String.Symbol */
 .highlight .bp { color: #007020 } /* Name.Builtin.Pseudo */
-.highlight .fm { color: #06287e } /* Name.Function.Magic */
-.highlight .vc { color: #bb60d5 } /* Name.Variable.Class */
-.highlight .vg { color: #bb60d5 } /* Name.Variable.Global */
-.highlight .vi { color: #bb60d5 } /* Name.Variable.Instance */
-.highlight .vm { color: #bb60d5 } /* Name.Variable.Magic */
+.highlight .fm { color: #06287E } /* Name.Function.Magic */
+.highlight .vc { color: #BB60D5 } /* Name.Variable.Class */
+.highlight .vg { color: #BB60D5 } /* Name.Variable.Global */
+.highlight .vi { color: #BB60D5 } /* Name.Variable.Instance */
+.highlight .vm { color: #BB60D5 } /* Name.Variable.Magic */
 .highlight .il { color: #208050 } /* Literal.Number.Integer.Long */
\ No newline at end of file
diff --git a/api/index.html b/api/index.html
index 96862381..2fe74412 100644
--- a/api/index.html
+++ b/api/index.html
@@ -6,7 +6,7 @@
     <meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="viewport" content="width=device-width, initial-scale=1" />
 
     <title>Schedule Primitives &#8212; Allo Documentation</title>
-    <link rel="stylesheet" type="text/css" href="../_static/pygments.css?v=fa44fd50" />
+    <link rel="stylesheet" type="text/css" href="../_static/pygments.css?v=03e43079" />
     <link rel="stylesheet" type="text/css" href="../_static/basic_mod.css?v=9b2032db" />
     <link rel="stylesheet" type="text/css" href="../_static/graphviz.css?v=4ae1632d" />
     <link rel="stylesheet" type="text/css" href="../_static/copybutton.css?v=76b2166b" />
@@ -142,6 +142,15 @@ <h3 id="searchlabel">Quick search</h3>
 <li class="toctree-l1"><a class="reference internal" href="../gallery/tutorial_01_get_started.html">Getting Started</a></li>
 <li class="toctree-l1"><a class="reference internal" href="../gallery/tutorial_02_vhls.html">Vivado/Vitis HLS Backend</a></li>
 </ul>
+<p class="caption" role="heading"><span class="caption-text">Deep Dive</span></p>
+<ul>
+<li class="toctree-l1"><a class="reference internal" href="../gallery/dive_01_data_types.html">Data Types and Type Casting</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../gallery/dive_02_template.html">Template Kernels</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../gallery/dive_03_composition.html">Kernel Composition</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../dive/ip.html">IP Integration</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../dive/pytorch.html">PyTorch Integration</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../gallery/dive_04_features.html">Other Features</a></li>
+</ul>
 <p class="caption" role="heading"><span class="caption-text">Developer Guide</span></p>
 <ul>
 <li class="toctree-l1"><a class="reference internal" href="../developer/index.html">Developer Setup</a></li>
@@ -167,7 +176,7 @@ <h3 id="searchlabel">Quick search</h3>
 <h1>Schedule Primitives<a class="headerlink" href="#schedule-primitives" title="Link to this heading"><span>¶</span></a></h1>
 <dl class="py class">
 <dt class="sig sig-object py" id="allo.customize.Schedule">
-<em class="property"><span class="pre">class</span><span class="w"> </span></em><span class="sig-prename descclassname"><span class="pre">allo.customize.</span></span><span class="sig-name descname"><span class="pre">Schedule</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">module</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">top_func</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">func_args</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">ip</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">ext_libs</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">None</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">use_def_chain</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">None</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">inst_list</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">None</span></span></em><span class="sig-paren">)</span><a class="reference internal" href="../_modules/allo/customize.html#Schedule"><span class="viewcode-link"><span class="pre">[source]</span></span></a><a class="headerlink" href="#allo.customize.Schedule" title="Link to this definition"><span>¶</span></a></dt>
+<em class="property"><span class="pre">class</span><span class="w"> </span></em><span class="sig-prename descclassname"><span class="pre">allo.customize.</span></span><span class="sig-name descname"><span class="pre">Schedule</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">module</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">top_func</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">func_args</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">ip</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">ext_libs</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">None</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">inst_list</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">None</span></span></em><span class="sig-paren">)</span><a class="reference internal" href="../_modules/allo/customize.html#Schedule"><span class="viewcode-link"><span class="pre">[source]</span></span></a><a class="headerlink" href="#allo.customize.Schedule" title="Link to this definition"><span>¶</span></a></dt>
 <dd><p><strong>Methods:</strong></p>
 <table class="autosummary longtable docutils align-default">
 <tbody>
@@ -524,7 +533,7 @@ <h1>Data Types<a class="headerlink" href="#data-types" title="Link to this headi
 
 
     <div class="footer" role="contentinfo">
-    &#169; Copyright 2024, Allo Authors.
+    &#169; Copyright 2025, Allo Authors.
       Created using <a href="https://www.sphinx-doc.org/">Sphinx</a> 8.1.3.
     </div>
 
diff --git a/developer/index.html b/developer/index.html
index 83fea737..53d64194 100644
--- a/developer/index.html
+++ b/developer/index.html
@@ -6,7 +6,7 @@
     <meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="viewport" content="width=device-width, initial-scale=1" />
 
     <title>Developer Setup &#8212; Allo Documentation</title>
-    <link rel="stylesheet" type="text/css" href="../_static/pygments.css?v=fa44fd50" />
+    <link rel="stylesheet" type="text/css" href="../_static/pygments.css?v=03e43079" />
     <link rel="stylesheet" type="text/css" href="../_static/basic_mod.css?v=9b2032db" />
     <link rel="stylesheet" type="text/css" href="../_static/graphviz.css?v=4ae1632d" />
     <link rel="stylesheet" type="text/css" href="../_static/copybutton.css?v=76b2166b" />
@@ -24,7 +24,7 @@
     <link rel="index" title="Index" href="../genindex.html" />
     <link rel="search" title="Search" href="../search.html" />
     <link rel="next" title="IR Builder Walkthrough" href="../gallery/developer_01_ir_builder.html" />
-    <link rel="prev" title="Vivado/Vitis HLS Backend" href="../gallery/tutorial_02_vhls.html" /> 
+    <link rel="prev" title="Other Features" href="../gallery/dive_04_features.html" /> 
   </head><body data-dark_mode_code_blocks="true">
 
 <div id="top_nav">
@@ -143,6 +143,15 @@ <h3 id="searchlabel">Quick search</h3>
 <li class="toctree-l1"><a class="reference internal" href="../gallery/tutorial_01_get_started.html">Getting Started</a></li>
 <li class="toctree-l1"><a class="reference internal" href="../gallery/tutorial_02_vhls.html">Vivado/Vitis HLS Backend</a></li>
 </ul>
+<p class="caption" role="heading"><span class="caption-text">Deep Dive</span></p>
+<ul>
+<li class="toctree-l1"><a class="reference internal" href="../gallery/dive_01_data_types.html">Data Types and Type Casting</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../gallery/dive_02_template.html">Template Kernels</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../gallery/dive_03_composition.html">Kernel Composition</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../dive/ip.html">IP Integration</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../dive/pytorch.html">PyTorch Integration</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../gallery/dive_04_features.html">Other Features</a></li>
+</ul>
 <p class="caption" role="heading"><span class="caption-text">Developer Guide</span></p>
 <ul class="current">
 <li class="toctree-l1 current"><a class="current reference internal" href="#">Developer Setup</a></li>
@@ -297,8 +306,8 @@ <h2>Integration Tests<a class="headerlink" href="#id1" title="Link to this headi
         <div class="button_nav">
             <div class="left">
                 
-                <a href="../gallery/tutorial_02_vhls.html">
-                    <span class="icon">&lt;</span><span>Vivado/Vitis HLS Backend</span></a>
+                <a href="../gallery/dive_04_features.html">
+                    <span class="icon">&lt;</span><span>Other Features</span></a>
                 
             </div>
 
@@ -312,7 +321,7 @@ <h2>Integration Tests<a class="headerlink" href="#id1" title="Link to this headi
 
 
     <div class="footer" role="contentinfo">
-    &#169; Copyright 2024, Allo Authors.
+    &#169; Copyright 2025, Allo Authors.
       Created using <a href="https://www.sphinx-doc.org/">Sphinx</a> 8.1.3.
     </div>
 
diff --git a/dive/ip.html b/dive/ip.html
new file mode 100644
index 00000000..bdcf2bf5
--- /dev/null
+++ b/dive/ip.html
@@ -0,0 +1,268 @@
+<!DOCTYPE html>
+
+<html lang="en" data-content_root="../">
+  <head>
+    <meta charset="utf-8" />
+    <meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="viewport" content="width=device-width, initial-scale=1" />
+
+    <title>IP Integration &#8212; Allo Documentation</title>
+    <link rel="stylesheet" type="text/css" href="../_static/pygments.css?v=03e43079" />
+    <link rel="stylesheet" type="text/css" href="../_static/basic_mod.css?v=9b2032db" />
+    <link rel="stylesheet" type="text/css" href="../_static/graphviz.css?v=4ae1632d" />
+    <link rel="stylesheet" type="text/css" href="../_static/copybutton.css?v=76b2166b" />
+    <link rel="stylesheet" type="text/css" href="../_static/sg_gallery.css?v=d2d258e8" />
+    <link rel="stylesheet" type="text/css" href="../_static/sg_gallery-binder.css?v=f4aeca0c" />
+    <link rel="stylesheet" type="text/css" href="../_static/sg_gallery-dataframe.css?v=2082cf3c" />
+    <link rel="stylesheet" type="text/css" href="../_static/sg_gallery-rendered-html.css?v=1277b6f3" />
+    <script src="../_static/documentation_options.js?v=db78e746"></script>
+    <script src="../_static/doctools.js?v=9bcbadda"></script>
+    <script src="../_static/sphinx_highlight.js?v=dc90522c"></script>
+    <script src="../_static/clipboard.min.js?v=a7894cd8"></script>
+    <script src="../_static/copybutton.js?v=30646c52"></script>
+    <script src="../_static/js/theme.js"></script>
+    <script src="../_static/js/petite-vue.js"></script>
+    <link rel="index" title="Index" href="../genindex.html" />
+    <link rel="search" title="Search" href="../search.html" />
+    <link rel="next" title="PyTorch Integration" href="pytorch.html" />
+    <link rel="prev" title="Kernel Composition" href="../gallery/dive_03_composition.html" /> 
+  </head><body data-dark_mode_code_blocks="true">
+
+<div id="top_nav">
+    
+
+    <nav>
+        
+            
+        
+
+        <p id="toggle_sidebar">
+            <a href="#" title="Toggle sidebar">|||</a>
+        </p>
+        <h1><a href="../index.html" title="Go to homepage">Allo Documentation</a></h1>
+            <a id="source_link" href="https://github.com/cornell-zhang/allo">
+    
+        <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 496 512">
+            <path fill="white" d="M 244.8,8 C 106.1,8 0,113.3 0,252 c 0,110.9 69.8,205.8 169.5,239.2 12.8,2.3 17.3,-5.6 17.3,-12.1 0,-6.2 -0.3,-40.4 -0.3,-61.4 0,0 -70,15 -84.7,-29.8 0,0 -11.4,-29.1 -27.8,-36.6 0,0 -22.9,-15.7 1.6,-15.4 0,0 24.9,2 38.6,25.8 21.9,38.6 58.6,27.5 72.9,20.9 2.3,-16 8.8,-27.1 16,-33.7 -55.9,-6.2 -112.3,-14.3 -112.3,-110.5 0,-27.5 7.6,-41.3 23.6,-58.9 -2.6,-6.5 -11.1,-33.3 2.6,-67.9 20.9,-6.5 69,27 69,27 20,-5.6 41.5,-8.5 62.8,-8.5 21.3,0 42.8,2.9 62.8,8.5 0,0 48.1,-33.6 69,-27 13.7,34.7 5.2,61.4 2.6,67.9 16,17.7 25.8,31.5 25.8,58.9 0,96.5 -58.9,104.2 -114.8,110.5 9.2,7.9 17,22.9 17,46.4 0,33.7 -0.3,75.4 -0.3,83.6 0,6.5 4.6,14.4 17.3,12.1 C 428.2,457.8 496,362.9 496,252 496,113.3 383.5,8 244.8,8 Z"/>
+        </svg>
+    
+</a>
+        
+
+        <a id="mode_toggle" href="#" @click.prevent="handleClick" :title="mode">
+    <template v-if="mode == 'light'">
+        <svg width="100%" height="100%" viewBox="0 0 79 80" version="1.1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" xml:space="preserve" style="fill-rule:evenodd;clip-rule:evenodd;stroke-linejoin:round;stroke-miterlimit:2;"><g id="mode_light"><rect id="Bounds" x="0" y="-0" width="78.623" height="79.049" style="fill:none;"/><circle cx="39.311" cy="39.524" r="15.734" style="fill:#fff;"/><g id="beams"><g id="beam"><path id="beam1" serif:id="beam" d="M44.212,4.901c0,-2.705 -2.196,-4.901 -4.901,-4.901c-2.704,-0 -4.9,2.196 -4.9,4.901l-0,9.614c-0,2.705 2.196,4.901 4.9,4.901c2.705,0 4.901,-2.196 4.901,-4.901l0,-9.614Z" style="fill:#fff;"/></g><g id="beam2" serif:id="beam"><path id="beam3" serif:id="beam" d="M67.48,18.073c1.913,-1.912 1.913,-5.018 0,-6.931c-1.912,-1.912 -5.018,-1.912 -6.931,0l-6.798,6.799c-1.912,1.912 -1.912,5.018 0,6.931c1.913,1.912 5.018,1.912 6.931,-0l6.798,-6.799Z" style="fill:#fff;"/></g><g id="beam4" serif:id="beam"><path id="beam5" serif:id="beam" d="M25.728,61.108c1.912,-1.913 1.912,-5.018 -0,-6.931c-1.913,-1.913 -5.019,-1.913 -6.931,-0l-6.799,6.798c-1.912,1.913 -1.912,5.019 0,6.931c1.913,1.913 5.019,1.913 6.931,0l6.799,-6.798Z" style="fill:#fff;"/></g><g id="beam6" serif:id="beam"><path id="beam7" serif:id="beam" d="M60.682,54.177c-1.913,-1.913 -5.018,-1.913 -6.931,-0c-1.912,1.913 -1.912,5.018 0,6.931l6.798,6.798c1.913,1.913 5.019,1.913 6.931,0c1.913,-1.912 1.913,-5.018 0,-6.931l-6.798,-6.798Z" style="fill:#fff;"/></g><g id="beam8" serif:id="beam"><path id="beam9" serif:id="beam" d="M4.901,34.623c-2.705,0 -4.901,2.196 -4.901,4.901c0,2.705 2.196,4.901 4.901,4.901l9.614,0c2.705,0 4.901,-2.196 4.901,-4.901c0,-2.705 -2.196,-4.901 -4.901,-4.901l-9.614,0Z" style="fill:#fff;"/></g><g id="beam10" serif:id="beam"><path id="beam11" serif:id="beam" d="M44.212,64.534c0,-2.705 -2.196,-4.901 -4.901,-4.901c-2.704,-0 -4.9,2.196 -4.9,4.901l-0,9.614c-0,2.705 2.196,4.901 4.9,4.901c2.705,-0 4.901,-2.196 4.901,-4.901l0,-9.614Z" style="fill:#fff;"/></g><g id="beam12" serif:id="beam"><path id="beam13" serif:id="beam" d="M18.929,11.142c-1.912,-1.912 -5.018,-1.912 -6.931,0c-1.912,1.913 -1.912,5.019 0,6.931l6.799,6.799c1.912,1.912 5.018,1.912 6.931,-0c1.912,-1.913 1.912,-5.019 -0,-6.931l-6.799,-6.799Z" style="fill:#fff;"/></g><g id="beam14" serif:id="beam"><path id="beam15" serif:id="beam" d="M64.108,34.623c-2.705,0 -4.901,2.196 -4.901,4.901c-0,2.705 2.196,4.901 4.901,4.901l9.614,0c2.705,0 4.901,-2.196 4.901,-4.901c-0,-2.705 -2.196,-4.901 -4.901,-4.901l-9.614,0Z" style="fill:#fff;"/></g></g></g></svg>
+    </template>
+
+    <template v-if="mode == 'dark'">
+        <svg width="100%" height="100%" viewBox="0 0 79 80" version="1.1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" xml:space="preserve" style="fill-rule:evenodd;clip-rule:evenodd;stroke-linejoin:round;stroke-miterlimit:2;"><g id="mode_dark"><rect id="Bounds" x="0" y="-0" width="78.623" height="79.049" style="fill:none;"/><circle cx="39.311" cy="39.524" r="15.734" style="fill:#fff;"/><g id="beams"><g id="beam"><path id="beam1" serif:id="beam" d="M44.212,14.515c0,-2.705 -2.196,-4.901 -4.901,-4.901c-2.704,0 -4.901,2.196 -4.901,4.901c0,2.705 2.197,4.901 4.901,4.901c2.705,0 4.901,-2.196 4.901,-4.901Z" style="fill:#fff;"/></g><g id="beam2" serif:id="beam"><path id="beam3" serif:id="beam" d="M60.662,24.892c1.902,-1.902 1.902,-4.99 0,-6.892l-0.04,-0.039c-1.901,-1.902 -4.989,-1.902 -6.891,-0c-1.901,1.901 -1.901,4.989 0,6.891l0.04,0.04c1.902,1.901 4.989,1.901 6.891,-0Z" style="fill:#fff;"/></g><g id="beam4" serif:id="beam"><path id="beam5" serif:id="beam" d="M25.732,61.103c1.91,-1.91 1.91,-5.011 0,-6.921l-0.009,-0.01c-1.91,-1.91 -5.012,-1.91 -6.921,-0c-1.91,1.91 -1.91,5.011 -0,6.921l0.01,0.01c1.909,1.91 5.011,1.91 6.92,-0Z" style="fill:#fff;"/></g><g id="beam6" serif:id="beam"><path id="beam7" serif:id="beam" d="M60.672,54.167c-1.907,-1.907 -5.004,-1.907 -6.911,0l-0.02,0.02c-1.907,1.907 -1.907,5.004 0,6.911c1.907,1.907 5.004,1.907 6.911,-0l0.02,-0.02c1.907,-1.907 1.907,-5.004 0,-6.911Z" style="fill:#fff;"/></g><g id="beam8" serif:id="beam"><path id="beam9" serif:id="beam" d="M14.52,34.623c-2.702,0 -4.896,2.194 -4.896,4.896l0,0.01c0,2.702 2.194,4.896 4.896,4.896c2.702,0 4.896,-2.194 4.896,-4.896l-0,-0.01c-0,-2.702 -2.194,-4.896 -4.896,-4.896Z" style="fill:#fff;"/></g><g id="beam10" serif:id="beam"><path id="beam11" serif:id="beam" d="M44.212,64.534c0,-2.705 -2.196,-4.901 -4.901,-4.901c-2.704,-0 -4.901,2.196 -4.901,4.901c0,2.704 2.197,4.9 4.901,4.9c2.705,0 4.901,-2.196 4.901,-4.9Z" style="fill:#fff;"/></g><g id="beam12" serif:id="beam"><path id="beam13" serif:id="beam" d="M25.73,17.943c-1.911,-1.911 -5.015,-1.911 -6.926,0l-0.005,0.005c-1.911,1.911 -1.911,5.015 0,6.926c1.911,1.911 5.015,1.911 6.926,0l0.005,-0.005c1.911,-1.911 1.911,-5.014 -0,-6.926Z" style="fill:#fff;"/></g><g id="beam14" serif:id="beam"><path id="beam15" serif:id="beam" d="M64.098,34.623c-2.699,0 -4.891,2.192 -4.891,4.892l-0,0.019c-0,2.699 2.192,4.891 4.891,4.891c2.7,0 4.892,-2.192 4.892,-4.891l0,-0.019c0,-2.7 -2.192,-4.892 -4.892,-4.892Z" style="fill:#fff;"/></g></g></g></svg>
+    </template>
+
+    <template v-if="mode == 'darkest'">
+        <svg width="100%" height="100%" viewBox="0 0 79 80" version="1.1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" xml:space="preserve" style="fill-rule:evenodd;clip-rule:evenodd;stroke-linejoin:round;stroke-miterlimit:2;"><g id="mode_darkest"><rect id="Bounds" x="0" y="-0" width="78.623" height="79.049" style="fill:none;"/><path d="M39.315,23.791c8.684,-0 15.734,7.05 15.734,15.733c0,8.684 -7.05,15.734 -15.734,15.734c-8.683,0 -15.733,-7.05 -15.733,-15.734c-0,-8.683 7.05,-15.733 15.733,-15.733Zm0,4.737c6.069,0 10.997,4.927 10.997,10.996c-0,6.069 -4.928,10.996 -10.997,10.996c-6.068,0 -10.996,-4.927 -10.996,-10.996c0,-6.069 4.928,-10.996 10.996,-10.996Z" style="fill:#fff;"/><g id="beams"><g id="beam"><path id="beam1" serif:id="beam" d="M44.216,14.515c0,-2.705 -2.196,-4.901 -4.901,-4.901c-2.704,0 -4.9,2.196 -4.9,4.901c-0,2.705 2.196,4.901 4.9,4.901c2.705,0 4.901,-2.196 4.901,-4.901Z" style="fill:#fff;"/></g><g id="beam2" serif:id="beam"><path id="beam3" serif:id="beam" d="M60.666,24.892c1.902,-1.902 1.902,-4.99 0,-6.892l-0.04,-0.039c-1.901,-1.902 -4.989,-1.902 -6.891,-0c-1.901,1.901 -1.901,4.989 0,6.891l0.04,0.04c1.902,1.901 4.99,1.901 6.891,-0Z" style="fill:#fff;"/></g><g id="beam4" serif:id="beam"><path id="beam5" serif:id="beam" d="M25.737,61.103c1.909,-1.91 1.909,-5.011 -0,-6.921l-0.01,-0.01c-1.91,-1.91 -5.011,-1.91 -6.921,-0c-1.91,1.91 -1.91,5.011 -0,6.921l0.01,0.01c1.91,1.91 5.011,1.91 6.921,-0Z" style="fill:#fff;"/></g><g id="beam6" serif:id="beam"><path id="beam7" serif:id="beam" d="M60.676,54.167c-1.907,-1.907 -5.004,-1.907 -6.911,0l-0.02,0.02c-1.907,1.907 -1.907,5.004 0,6.911c1.907,1.907 5.004,1.907 6.911,-0l0.02,-0.02c1.907,-1.907 1.907,-5.004 0,-6.911Z" style="fill:#fff;"/></g><g id="beam8" serif:id="beam"><path id="beam9" serif:id="beam" d="M14.524,34.623c-2.702,0 -4.896,2.194 -4.896,4.896l0,0.01c0,2.702 2.194,4.896 4.896,4.896c2.702,0 4.896,-2.194 4.896,-4.896l0,-0.01c0,-2.702 -2.194,-4.896 -4.896,-4.896Z" style="fill:#fff;"/></g><g id="beam10" serif:id="beam"><path id="beam11" serif:id="beam" d="M44.216,64.534c0,-2.705 -2.196,-4.901 -4.901,-4.901c-2.704,-0 -4.9,2.196 -4.9,4.901c-0,2.704 2.196,4.9 4.9,4.9c2.705,0 4.901,-2.196 4.901,-4.9Z" style="fill:#fff;"/></g><g id="beam12" serif:id="beam"><path id="beam13" serif:id="beam" d="M25.734,17.943c-1.911,-1.911 -5.015,-1.911 -6.926,0l-0.005,0.005c-1.911,1.911 -1.911,5.015 0,6.926c1.911,1.911 5.015,1.911 6.926,0l0.005,-0.005c1.911,-1.911 1.911,-5.014 0,-6.926Z" style="fill:#fff;"/></g><g id="beam14" serif:id="beam"><path id="beam15" serif:id="beam" d="M64.103,34.623c-2.7,0 -4.892,2.192 -4.892,4.892l-0,0.019c-0,2.699 2.192,4.891 4.892,4.891c2.699,0 4.891,-2.192 4.891,-4.891l0,-0.019c0,-2.7 -2.192,-4.892 -4.891,-4.892Z" style="fill:#fff;"/></g></g></g></svg>
+    </template>
+</a>
+
+<script>
+(function() {
+    const LOCAL_STORAGE_KEY = 'piccoloThemeMode'
+
+    var initialMode = localStorage.getItem(LOCAL_STORAGE_KEY)
+
+    if (initialMode) {
+        // Make sure the value in local storage is valid
+        if (['light', 'dark', 'darkest'].indexOf(initialMode) == -1) {
+            initialMode = 'light'
+            localStorage.setItem(LOCAL_STORAGE_KEY, initialMode)
+        }
+    } else {
+        // Check if the client prefers dark mode
+        if (window.matchMedia && window.matchMedia('(prefers-color-scheme: dark)').matches) {
+            initialMode = 'dark'
+        } else {
+            initialMode = 'light'
+        }
+        localStorage.setItem(LOCAL_STORAGE_KEY, initialMode)
+    }
+
+    document.documentElement.dataset.mode = initialMode
+
+    PetiteVue.createApp({
+        'mode': initialMode,
+        handleClick() {
+            let currentMode = this.mode
+
+            if (currentMode == 'light') {
+                this.mode = 'dark'
+            } else if (currentMode == 'dark') {
+                this.mode = 'darkest'
+            } else if (currentMode == 'darkest') {
+                this.mode = 'light'
+            }
+
+            document.documentElement.dataset.mode = this.mode
+            localStorage.setItem(LOCAL_STORAGE_KEY, this.mode)
+
+            console.log(this.mode)
+        }
+    }).mount('#mode_toggle')
+})()
+</script>
+            <p class="mobile_search_link">
+                <a href="../search.html" title="Search">
+                    <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 65 64" fill-rule="evenodd" stroke-linejoin="round" stroke-miterlimit="2">
+                        <path d="M14.873 40.009c-2.315-3.943-3.642-8.532-3.642-13.429C11.231 11.91 23.141 0 37.811 0s26.58 11.91 26.58 26.58-11.91 26.58-26.58 26.58a26.44 26.44 0 0 1-14.277-4.161L9.739 62.794a3.12 3.12 0 0 1-4.413 0L.913 58.382c-1.217-1.218-1.217-3.196 0-4.413l13.96-13.96zM37.811 8.054c10.225 0 18.526 8.301 18.526 18.526s-8.301 18.526-18.526 18.526-18.526-8.301-18.526-18.526S27.586 8.054 37.811 8.054z" fill="#fff" />
+                    </svg>
+                </a>
+            </p>
+        
+
+        <div class="searchbox_wrapper">
+            
+<search id="searchbox" style="display: none" role="search">
+  <h3 id="searchlabel">Quick search</h3>
+    <div class="searchformwrapper">
+    <form class="search" action="../search.html" method="get">
+      <input type="text" name="q" aria-labelledby="searchlabel" autocomplete="off" autocorrect="off" autocapitalize="off" spellcheck="false"/>
+      <input type="submit" value="Go" />
+    </form>
+    </div>
+</search>
+<script>document.getElementById('searchbox').style.display = "block"</script>
+        </div>
+    </nav>
+</div>
+
+    
+      <div class="sphinxsidebar" role="navigation" aria-label="Main">
+        <div class="sphinxsidebarwrapper"><p class="caption" role="heading"><span class="caption-text">Getting Started</span></p>
+<ul>
+<li class="toctree-l1"><a class="reference internal" href="../setup/index.html">Installation</a></li>
+</ul>
+<p class="caption" role="heading"><span class="caption-text">Tutorials</span></p>
+<ul>
+<li class="toctree-l1"><a class="reference internal" href="../gallery/tutorial_01_get_started.html">Getting Started</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../gallery/tutorial_02_vhls.html">Vivado/Vitis HLS Backend</a></li>
+</ul>
+<p class="caption" role="heading"><span class="caption-text">Deep Dive</span></p>
+<ul class="current">
+<li class="toctree-l1"><a class="reference internal" href="../gallery/dive_01_data_types.html">Data Types and Type Casting</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../gallery/dive_02_template.html">Template Kernels</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../gallery/dive_03_composition.html">Kernel Composition</a></li>
+<li class="toctree-l1 current"><a class="current reference internal" href="#">IP Integration</a></li>
+<li class="toctree-l1"><a class="reference internal" href="pytorch.html">PyTorch Integration</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../gallery/dive_04_features.html">Other Features</a></li>
+</ul>
+<p class="caption" role="heading"><span class="caption-text">Developer Guide</span></p>
+<ul>
+<li class="toctree-l1"><a class="reference internal" href="../developer/index.html">Developer Setup</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../gallery/developer_01_ir_builder.html">IR Builder Walkthrough</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../gallery/developer_02_mlir.html">MLIR Translation Guide</a></li>
+</ul>
+<p class="caption" role="heading"><span class="caption-text">Python API</span></p>
+<ul>
+<li class="toctree-l1"><a class="reference internal" href="../api/index.html">Schedule Primitives</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../api/index.html#data-types">Data Types</a></li>
+</ul>
+
+        </div>
+      </div>
+
+
+    <div class="document">
+      <div class="documentwrapper">
+        <div class="bodywrapper">
+          <div class="body" role="main">
+            
+  <section id="ip-integration">
+<h1>IP Integration<a class="headerlink" href="#ip-integration" title="Link to this heading"><span>¶</span></a></h1>
+<p>Apart from directly writing Allo kernels in Python, we also support integrating existing C++ HLS kernels into Allo. This feature is useful when you have a existing optimized C++ HLS code that wants to be integrated into Allo. The following example shows how to integrate a simple vector addition kernel written in C++ into Allo.</p>
+<p>Suppose the C++ kernel header is defined in the <code class="docutils literal notranslate"><span class="pre">vadd.h</span></code> file:</p>
+<div class="highlight-cpp notranslate"><div class="highlight"><pre><span></span><span class="cp">#ifndef VADD_H</span>
+<span class="cp">#define VADD_H</span>
+
+<span class="kt">void</span><span class="w"> </span><span class="nf">vadd</span><span class="p">(</span><span class="kt">int</span><span class="w"> </span><span class="n">A</span><span class="p">[</span><span class="mi">32</span><span class="p">],</span><span class="w"> </span><span class="kt">int</span><span class="w"> </span><span class="n">B</span><span class="p">[</span><span class="mi">32</span><span class="p">],</span><span class="w"> </span><span class="kt">int</span><span class="w"> </span><span class="n">C</span><span class="p">[</span><span class="mi">32</span><span class="p">]);</span>
+
+<span class="cp">#endif </span><span class="c1">// VADD_H</span>
+</pre></div>
+</div>
+<p>And the corresponding implementation is defined in the <code class="docutils literal notranslate"><span class="pre">vadd.cpp</span></code> file:</p>
+<div class="highlight-cpp notranslate"><div class="highlight"><pre><span></span><span class="cp">#include</span><span class="w"> </span><span class="cpf">&quot;vadd.h&quot;</span>
+<span class="k">using</span><span class="w"> </span><span class="k">namespace</span><span class="w"> </span><span class="nn">std</span><span class="p">;</span>
+
+<span class="kt">void</span><span class="w"> </span><span class="nf">vadd</span><span class="p">(</span><span class="kt">int</span><span class="w"> </span><span class="n">A</span><span class="p">[</span><span class="mi">32</span><span class="p">],</span><span class="w"> </span><span class="kt">int</span><span class="w"> </span><span class="n">B</span><span class="p">[</span><span class="mi">32</span><span class="p">],</span><span class="w"> </span><span class="kt">int</span><span class="w"> </span><span class="n">C</span><span class="p">[</span><span class="mi">32</span><span class="p">])</span><span class="w"> </span><span class="p">{</span>
+<span class="w">    </span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="kt">int</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">0</span><span class="p">;</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="o">&lt;</span><span class="w"> </span><span class="mi">32</span><span class="p">;</span><span class="w"> </span><span class="o">++</span><span class="n">i</span><span class="p">)</span><span class="w"> </span><span class="p">{</span>
+<span class="w">        </span><span class="n">C</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">A</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">B</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
+<span class="w">    </span><span class="p">}</span>
+<span class="p">}</span>
+</pre></div>
+</div>
+<p>In Allo, we can create an <em>IP module</em> to wrap the C++ kernel. Basically, we need to provide the top-level function name, the header files, and the implementation files. Also, currently an Allo signature is required to specify the input and output types of the kernel. Allo will automatically compile the C++ kernel and generate the corresponding Python wrapper based on the provided files and signature. The last argument <code class="docutils literal notranslate"><span class="pre">link_hls</span></code> determines whether the C++ compiler should link the Vitis HLS libraries (e.g., <code class="docutils literal notranslate"><span class="pre">ap_int</span></code>), which is only available when your machine has installed Vitis HLS.</p>
+<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">vadd</span> <span class="o">=</span> <span class="n">allo</span><span class="o">.</span><span class="n">IPModule</span><span class="p">(</span>
+    <span class="n">top</span><span class="o">=</span><span class="s2">&quot;vadd&quot;</span><span class="p">,</span>
+    <span class="n">headers</span><span class="o">=</span><span class="p">[</span><span class="s2">&quot;vadd.h&quot;</span><span class="p">],</span>
+    <span class="n">impls</span><span class="o">=</span><span class="p">[</span><span class="s2">&quot;vadd.cpp&quot;</span><span class="p">],</span>
+    <span class="n">signature</span><span class="o">=</span><span class="p">[</span><span class="s2">&quot;int32[32]&quot;</span><span class="p">,</span> <span class="s2">&quot;int32[32]&quot;</span><span class="p">,</span> <span class="s2">&quot;int32[32]&quot;</span><span class="p">],</span>
+    <span class="n">link_hls</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span>
+<span class="p">)</span>
+</pre></div>
+</div>
+<p>After creating the IP module, we can use it in Allo as a normal Python function. For example, we can directly call the <code class="docutils literal notranslate"><span class="pre">vadd</span></code> function to perform vector addition. The inputs and outputs will be automatically wrapped and unwrapped as NumPy arrays, which greatly simplies the burden of complex C-Python interface management. This is also very useful when you want to debug the HLS kernels with the Python data.</p>
+<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">np_A</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">100</span><span class="p">,</span> <span class="p">(</span><span class="mi">32</span><span class="p">,))</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">int32</span><span class="p">)</span>
+<span class="n">np_B</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">100</span><span class="p">,</span> <span class="p">(</span><span class="mi">32</span><span class="p">,))</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">int32</span><span class="p">)</span>
+<span class="n">np_C</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">((</span><span class="mi">32</span><span class="p">,),</span> <span class="n">dtype</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">int32</span><span class="p">)</span>
+<span class="n">vadd</span><span class="p">(</span><span class="n">np_A</span><span class="p">,</span> <span class="n">np_B</span><span class="p">,</span> <span class="n">np_C</span><span class="p">)</span>
+<span class="n">np</span><span class="o">.</span><span class="n">testing</span><span class="o">.</span><span class="n">assert_allclose</span><span class="p">(</span><span class="n">np_A</span> <span class="o">+</span> <span class="n">np_B</span><span class="p">,</span> <span class="n">np_C</span><span class="p">,</span> <span class="n">atol</span><span class="o">=</span><span class="mf">1e-6</span><span class="p">)</span>
+</pre></div>
+</div>
+<p>Moreover, the IP module can also be called in a normal Allo kernel. In the following example, we wrap the <code class="docutils literal notranslate"><span class="pre">vadd</span></code> function into an Allo <code class="docutils literal notranslate"><span class="pre">kernel</span></code> and use it to perform vector addition. The Allo kernel can then be further customized and compiled with the external C++ HLS kernel.</p>
+<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="k">def</span><span class="w"> </span><span class="nf">kernel</span><span class="p">(</span><span class="n">A</span><span class="p">:</span> <span class="n">int32</span><span class="p">[</span><span class="mi">32</span><span class="p">],</span> <span class="n">B</span><span class="p">:</span> <span class="n">int32</span><span class="p">[</span><span class="mi">32</span><span class="p">])</span> <span class="o">-&gt;</span> <span class="n">int32</span><span class="p">[</span><span class="mi">32</span><span class="p">]:</span>
+    <span class="n">C</span><span class="p">:</span> <span class="n">int32</span><span class="p">[</span><span class="mi">32</span><span class="p">]</span> <span class="o">=</span> <span class="mi">0</span>
+    <span class="n">vadd</span><span class="p">(</span><span class="n">A</span><span class="p">,</span> <span class="n">B</span><span class="p">,</span> <span class="n">C</span><span class="p">)</span>
+    <span class="k">return</span> <span class="n">C</span>
+
+<span class="n">s</span> <span class="o">=</span> <span class="n">allo</span><span class="o">.</span><span class="n">customize</span><span class="p">(</span><span class="n">kernel</span><span class="p">)</span>
+<span class="nb">print</span><span class="p">(</span><span class="n">s</span><span class="o">.</span><span class="n">module</span><span class="p">)</span>
+<span class="n">mod</span> <span class="o">=</span> <span class="n">s</span><span class="o">.</span><span class="n">build</span><span class="p">()</span>
+<span class="n">np_A</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">100</span><span class="p">,</span> <span class="p">(</span><span class="mi">32</span><span class="p">,))</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">int32</span><span class="p">)</span>
+<span class="n">np_B</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">100</span><span class="p">,</span> <span class="p">(</span><span class="mi">32</span><span class="p">,))</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">int32</span><span class="p">)</span>
+<span class="n">allo_C</span> <span class="o">=</span> <span class="n">mod</span><span class="p">(</span><span class="n">np_A</span><span class="p">,</span> <span class="n">np_B</span><span class="p">)</span>
+<span class="n">np</span><span class="o">.</span><span class="n">testing</span><span class="o">.</span><span class="n">assert_allclose</span><span class="p">(</span><span class="n">np_A</span> <span class="o">+</span> <span class="n">np_B</span><span class="p">,</span> <span class="n">allo_C</span><span class="p">,</span> <span class="n">atol</span><span class="o">=</span><span class="mf">1e-6</span><span class="p">)</span>
+</pre></div>
+</div>
+</section>
+
+
+            <div class="clearer"></div>
+          </div>
+        </div>
+      </div>
+    
+
+      <div class="clearer"></div>
+    </div>
+    <div class="button_nav_wrapper">
+        <div class="button_nav">
+            <div class="left">
+                
+                <a href="../gallery/dive_03_composition.html">
+                    <span class="icon">&lt;</span><span>Kernel Composition</span></a>
+                
+            </div>
+
+            <div class="right">
+                
+                    <a href="pytorch.html"><span>PyTorch Integration</span><span class="icon">&gt;</span></a>
+                
+            </div>
+        </div>
+    </div>
+
+
+    <div class="footer" role="contentinfo">
+    &#169; Copyright 2025, Allo Authors.
+      Created using <a href="https://www.sphinx-doc.org/">Sphinx</a> 8.1.3.
+    </div>
+
+<p id="theme_credit">Styled using the <a href="https://github.com/piccolo-orm/piccolo_theme">Piccolo Theme</a></p>
+  </body>
+</html>
\ No newline at end of file
diff --git a/dive/pytorch.html b/dive/pytorch.html
new file mode 100644
index 00000000..41b88b83
--- /dev/null
+++ b/dive/pytorch.html
@@ -0,0 +1,253 @@
+<!DOCTYPE html>
+
+<html lang="en" data-content_root="../">
+  <head>
+    <meta charset="utf-8" />
+    <meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="viewport" content="width=device-width, initial-scale=1" />
+
+    <title>PyTorch Integration &#8212; Allo Documentation</title>
+    <link rel="stylesheet" type="text/css" href="../_static/pygments.css?v=03e43079" />
+    <link rel="stylesheet" type="text/css" href="../_static/basic_mod.css?v=9b2032db" />
+    <link rel="stylesheet" type="text/css" href="../_static/graphviz.css?v=4ae1632d" />
+    <link rel="stylesheet" type="text/css" href="../_static/copybutton.css?v=76b2166b" />
+    <link rel="stylesheet" type="text/css" href="../_static/sg_gallery.css?v=d2d258e8" />
+    <link rel="stylesheet" type="text/css" href="../_static/sg_gallery-binder.css?v=f4aeca0c" />
+    <link rel="stylesheet" type="text/css" href="../_static/sg_gallery-dataframe.css?v=2082cf3c" />
+    <link rel="stylesheet" type="text/css" href="../_static/sg_gallery-rendered-html.css?v=1277b6f3" />
+    <script src="../_static/documentation_options.js?v=db78e746"></script>
+    <script src="../_static/doctools.js?v=9bcbadda"></script>
+    <script src="../_static/sphinx_highlight.js?v=dc90522c"></script>
+    <script src="../_static/clipboard.min.js?v=a7894cd8"></script>
+    <script src="../_static/copybutton.js?v=30646c52"></script>
+    <script src="../_static/js/theme.js"></script>
+    <script src="../_static/js/petite-vue.js"></script>
+    <link rel="index" title="Index" href="../genindex.html" />
+    <link rel="search" title="Search" href="../search.html" />
+    <link rel="next" title="Other Features" href="../gallery/dive_04_features.html" />
+    <link rel="prev" title="IP Integration" href="ip.html" /> 
+  </head><body data-dark_mode_code_blocks="true">
+
+<div id="top_nav">
+    
+
+    <nav>
+        
+            
+        
+
+        <p id="toggle_sidebar">
+            <a href="#" title="Toggle sidebar">|||</a>
+        </p>
+        <h1><a href="../index.html" title="Go to homepage">Allo Documentation</a></h1>
+            <a id="source_link" href="https://github.com/cornell-zhang/allo">
+    
+        <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 496 512">
+            <path fill="white" d="M 244.8,8 C 106.1,8 0,113.3 0,252 c 0,110.9 69.8,205.8 169.5,239.2 12.8,2.3 17.3,-5.6 17.3,-12.1 0,-6.2 -0.3,-40.4 -0.3,-61.4 0,0 -70,15 -84.7,-29.8 0,0 -11.4,-29.1 -27.8,-36.6 0,0 -22.9,-15.7 1.6,-15.4 0,0 24.9,2 38.6,25.8 21.9,38.6 58.6,27.5 72.9,20.9 2.3,-16 8.8,-27.1 16,-33.7 -55.9,-6.2 -112.3,-14.3 -112.3,-110.5 0,-27.5 7.6,-41.3 23.6,-58.9 -2.6,-6.5 -11.1,-33.3 2.6,-67.9 20.9,-6.5 69,27 69,27 20,-5.6 41.5,-8.5 62.8,-8.5 21.3,0 42.8,2.9 62.8,8.5 0,0 48.1,-33.6 69,-27 13.7,34.7 5.2,61.4 2.6,67.9 16,17.7 25.8,31.5 25.8,58.9 0,96.5 -58.9,104.2 -114.8,110.5 9.2,7.9 17,22.9 17,46.4 0,33.7 -0.3,75.4 -0.3,83.6 0,6.5 4.6,14.4 17.3,12.1 C 428.2,457.8 496,362.9 496,252 496,113.3 383.5,8 244.8,8 Z"/>
+        </svg>
+    
+</a>
+        
+
+        <a id="mode_toggle" href="#" @click.prevent="handleClick" :title="mode">
+    <template v-if="mode == 'light'">
+        <svg width="100%" height="100%" viewBox="0 0 79 80" version="1.1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" xml:space="preserve" style="fill-rule:evenodd;clip-rule:evenodd;stroke-linejoin:round;stroke-miterlimit:2;"><g id="mode_light"><rect id="Bounds" x="0" y="-0" width="78.623" height="79.049" style="fill:none;"/><circle cx="39.311" cy="39.524" r="15.734" style="fill:#fff;"/><g id="beams"><g id="beam"><path id="beam1" serif:id="beam" d="M44.212,4.901c0,-2.705 -2.196,-4.901 -4.901,-4.901c-2.704,-0 -4.9,2.196 -4.9,4.901l-0,9.614c-0,2.705 2.196,4.901 4.9,4.901c2.705,0 4.901,-2.196 4.901,-4.901l0,-9.614Z" style="fill:#fff;"/></g><g id="beam2" serif:id="beam"><path id="beam3" serif:id="beam" d="M67.48,18.073c1.913,-1.912 1.913,-5.018 0,-6.931c-1.912,-1.912 -5.018,-1.912 -6.931,0l-6.798,6.799c-1.912,1.912 -1.912,5.018 0,6.931c1.913,1.912 5.018,1.912 6.931,-0l6.798,-6.799Z" style="fill:#fff;"/></g><g id="beam4" serif:id="beam"><path id="beam5" serif:id="beam" d="M25.728,61.108c1.912,-1.913 1.912,-5.018 -0,-6.931c-1.913,-1.913 -5.019,-1.913 -6.931,-0l-6.799,6.798c-1.912,1.913 -1.912,5.019 0,6.931c1.913,1.913 5.019,1.913 6.931,0l6.799,-6.798Z" style="fill:#fff;"/></g><g id="beam6" serif:id="beam"><path id="beam7" serif:id="beam" d="M60.682,54.177c-1.913,-1.913 -5.018,-1.913 -6.931,-0c-1.912,1.913 -1.912,5.018 0,6.931l6.798,6.798c1.913,1.913 5.019,1.913 6.931,0c1.913,-1.912 1.913,-5.018 0,-6.931l-6.798,-6.798Z" style="fill:#fff;"/></g><g id="beam8" serif:id="beam"><path id="beam9" serif:id="beam" d="M4.901,34.623c-2.705,0 -4.901,2.196 -4.901,4.901c0,2.705 2.196,4.901 4.901,4.901l9.614,0c2.705,0 4.901,-2.196 4.901,-4.901c0,-2.705 -2.196,-4.901 -4.901,-4.901l-9.614,0Z" style="fill:#fff;"/></g><g id="beam10" serif:id="beam"><path id="beam11" serif:id="beam" d="M44.212,64.534c0,-2.705 -2.196,-4.901 -4.901,-4.901c-2.704,-0 -4.9,2.196 -4.9,4.901l-0,9.614c-0,2.705 2.196,4.901 4.9,4.901c2.705,-0 4.901,-2.196 4.901,-4.901l0,-9.614Z" style="fill:#fff;"/></g><g id="beam12" serif:id="beam"><path id="beam13" serif:id="beam" d="M18.929,11.142c-1.912,-1.912 -5.018,-1.912 -6.931,0c-1.912,1.913 -1.912,5.019 0,6.931l6.799,6.799c1.912,1.912 5.018,1.912 6.931,-0c1.912,-1.913 1.912,-5.019 -0,-6.931l-6.799,-6.799Z" style="fill:#fff;"/></g><g id="beam14" serif:id="beam"><path id="beam15" serif:id="beam" d="M64.108,34.623c-2.705,0 -4.901,2.196 -4.901,4.901c-0,2.705 2.196,4.901 4.901,4.901l9.614,0c2.705,0 4.901,-2.196 4.901,-4.901c-0,-2.705 -2.196,-4.901 -4.901,-4.901l-9.614,0Z" style="fill:#fff;"/></g></g></g></svg>
+    </template>
+
+    <template v-if="mode == 'dark'">
+        <svg width="100%" height="100%" viewBox="0 0 79 80" version="1.1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" xml:space="preserve" style="fill-rule:evenodd;clip-rule:evenodd;stroke-linejoin:round;stroke-miterlimit:2;"><g id="mode_dark"><rect id="Bounds" x="0" y="-0" width="78.623" height="79.049" style="fill:none;"/><circle cx="39.311" cy="39.524" r="15.734" style="fill:#fff;"/><g id="beams"><g id="beam"><path id="beam1" serif:id="beam" d="M44.212,14.515c0,-2.705 -2.196,-4.901 -4.901,-4.901c-2.704,0 -4.901,2.196 -4.901,4.901c0,2.705 2.197,4.901 4.901,4.901c2.705,0 4.901,-2.196 4.901,-4.901Z" style="fill:#fff;"/></g><g id="beam2" serif:id="beam"><path id="beam3" serif:id="beam" d="M60.662,24.892c1.902,-1.902 1.902,-4.99 0,-6.892l-0.04,-0.039c-1.901,-1.902 -4.989,-1.902 -6.891,-0c-1.901,1.901 -1.901,4.989 0,6.891l0.04,0.04c1.902,1.901 4.989,1.901 6.891,-0Z" style="fill:#fff;"/></g><g id="beam4" serif:id="beam"><path id="beam5" serif:id="beam" d="M25.732,61.103c1.91,-1.91 1.91,-5.011 0,-6.921l-0.009,-0.01c-1.91,-1.91 -5.012,-1.91 -6.921,-0c-1.91,1.91 -1.91,5.011 -0,6.921l0.01,0.01c1.909,1.91 5.011,1.91 6.92,-0Z" style="fill:#fff;"/></g><g id="beam6" serif:id="beam"><path id="beam7" serif:id="beam" d="M60.672,54.167c-1.907,-1.907 -5.004,-1.907 -6.911,0l-0.02,0.02c-1.907,1.907 -1.907,5.004 0,6.911c1.907,1.907 5.004,1.907 6.911,-0l0.02,-0.02c1.907,-1.907 1.907,-5.004 0,-6.911Z" style="fill:#fff;"/></g><g id="beam8" serif:id="beam"><path id="beam9" serif:id="beam" d="M14.52,34.623c-2.702,0 -4.896,2.194 -4.896,4.896l0,0.01c0,2.702 2.194,4.896 4.896,4.896c2.702,0 4.896,-2.194 4.896,-4.896l-0,-0.01c-0,-2.702 -2.194,-4.896 -4.896,-4.896Z" style="fill:#fff;"/></g><g id="beam10" serif:id="beam"><path id="beam11" serif:id="beam" d="M44.212,64.534c0,-2.705 -2.196,-4.901 -4.901,-4.901c-2.704,-0 -4.901,2.196 -4.901,4.901c0,2.704 2.197,4.9 4.901,4.9c2.705,0 4.901,-2.196 4.901,-4.9Z" style="fill:#fff;"/></g><g id="beam12" serif:id="beam"><path id="beam13" serif:id="beam" d="M25.73,17.943c-1.911,-1.911 -5.015,-1.911 -6.926,0l-0.005,0.005c-1.911,1.911 -1.911,5.015 0,6.926c1.911,1.911 5.015,1.911 6.926,0l0.005,-0.005c1.911,-1.911 1.911,-5.014 -0,-6.926Z" style="fill:#fff;"/></g><g id="beam14" serif:id="beam"><path id="beam15" serif:id="beam" d="M64.098,34.623c-2.699,0 -4.891,2.192 -4.891,4.892l-0,0.019c-0,2.699 2.192,4.891 4.891,4.891c2.7,0 4.892,-2.192 4.892,-4.891l0,-0.019c0,-2.7 -2.192,-4.892 -4.892,-4.892Z" style="fill:#fff;"/></g></g></g></svg>
+    </template>
+
+    <template v-if="mode == 'darkest'">
+        <svg width="100%" height="100%" viewBox="0 0 79 80" version="1.1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" xml:space="preserve" style="fill-rule:evenodd;clip-rule:evenodd;stroke-linejoin:round;stroke-miterlimit:2;"><g id="mode_darkest"><rect id="Bounds" x="0" y="-0" width="78.623" height="79.049" style="fill:none;"/><path d="M39.315,23.791c8.684,-0 15.734,7.05 15.734,15.733c0,8.684 -7.05,15.734 -15.734,15.734c-8.683,0 -15.733,-7.05 -15.733,-15.734c-0,-8.683 7.05,-15.733 15.733,-15.733Zm0,4.737c6.069,0 10.997,4.927 10.997,10.996c-0,6.069 -4.928,10.996 -10.997,10.996c-6.068,0 -10.996,-4.927 -10.996,-10.996c0,-6.069 4.928,-10.996 10.996,-10.996Z" style="fill:#fff;"/><g id="beams"><g id="beam"><path id="beam1" serif:id="beam" d="M44.216,14.515c0,-2.705 -2.196,-4.901 -4.901,-4.901c-2.704,0 -4.9,2.196 -4.9,4.901c-0,2.705 2.196,4.901 4.9,4.901c2.705,0 4.901,-2.196 4.901,-4.901Z" style="fill:#fff;"/></g><g id="beam2" serif:id="beam"><path id="beam3" serif:id="beam" d="M60.666,24.892c1.902,-1.902 1.902,-4.99 0,-6.892l-0.04,-0.039c-1.901,-1.902 -4.989,-1.902 -6.891,-0c-1.901,1.901 -1.901,4.989 0,6.891l0.04,0.04c1.902,1.901 4.99,1.901 6.891,-0Z" style="fill:#fff;"/></g><g id="beam4" serif:id="beam"><path id="beam5" serif:id="beam" d="M25.737,61.103c1.909,-1.91 1.909,-5.011 -0,-6.921l-0.01,-0.01c-1.91,-1.91 -5.011,-1.91 -6.921,-0c-1.91,1.91 -1.91,5.011 -0,6.921l0.01,0.01c1.91,1.91 5.011,1.91 6.921,-0Z" style="fill:#fff;"/></g><g id="beam6" serif:id="beam"><path id="beam7" serif:id="beam" d="M60.676,54.167c-1.907,-1.907 -5.004,-1.907 -6.911,0l-0.02,0.02c-1.907,1.907 -1.907,5.004 0,6.911c1.907,1.907 5.004,1.907 6.911,-0l0.02,-0.02c1.907,-1.907 1.907,-5.004 0,-6.911Z" style="fill:#fff;"/></g><g id="beam8" serif:id="beam"><path id="beam9" serif:id="beam" d="M14.524,34.623c-2.702,0 -4.896,2.194 -4.896,4.896l0,0.01c0,2.702 2.194,4.896 4.896,4.896c2.702,0 4.896,-2.194 4.896,-4.896l0,-0.01c0,-2.702 -2.194,-4.896 -4.896,-4.896Z" style="fill:#fff;"/></g><g id="beam10" serif:id="beam"><path id="beam11" serif:id="beam" d="M44.216,64.534c0,-2.705 -2.196,-4.901 -4.901,-4.901c-2.704,-0 -4.9,2.196 -4.9,4.901c-0,2.704 2.196,4.9 4.9,4.9c2.705,0 4.901,-2.196 4.901,-4.9Z" style="fill:#fff;"/></g><g id="beam12" serif:id="beam"><path id="beam13" serif:id="beam" d="M25.734,17.943c-1.911,-1.911 -5.015,-1.911 -6.926,0l-0.005,0.005c-1.911,1.911 -1.911,5.015 0,6.926c1.911,1.911 5.015,1.911 6.926,0l0.005,-0.005c1.911,-1.911 1.911,-5.014 0,-6.926Z" style="fill:#fff;"/></g><g id="beam14" serif:id="beam"><path id="beam15" serif:id="beam" d="M64.103,34.623c-2.7,0 -4.892,2.192 -4.892,4.892l-0,0.019c-0,2.699 2.192,4.891 4.892,4.891c2.699,0 4.891,-2.192 4.891,-4.891l0,-0.019c0,-2.7 -2.192,-4.892 -4.891,-4.892Z" style="fill:#fff;"/></g></g></g></svg>
+    </template>
+</a>
+
+<script>
+(function() {
+    const LOCAL_STORAGE_KEY = 'piccoloThemeMode'
+
+    var initialMode = localStorage.getItem(LOCAL_STORAGE_KEY)
+
+    if (initialMode) {
+        // Make sure the value in local storage is valid
+        if (['light', 'dark', 'darkest'].indexOf(initialMode) == -1) {
+            initialMode = 'light'
+            localStorage.setItem(LOCAL_STORAGE_KEY, initialMode)
+        }
+    } else {
+        // Check if the client prefers dark mode
+        if (window.matchMedia && window.matchMedia('(prefers-color-scheme: dark)').matches) {
+            initialMode = 'dark'
+        } else {
+            initialMode = 'light'
+        }
+        localStorage.setItem(LOCAL_STORAGE_KEY, initialMode)
+    }
+
+    document.documentElement.dataset.mode = initialMode
+
+    PetiteVue.createApp({
+        'mode': initialMode,
+        handleClick() {
+            let currentMode = this.mode
+
+            if (currentMode == 'light') {
+                this.mode = 'dark'
+            } else if (currentMode == 'dark') {
+                this.mode = 'darkest'
+            } else if (currentMode == 'darkest') {
+                this.mode = 'light'
+            }
+
+            document.documentElement.dataset.mode = this.mode
+            localStorage.setItem(LOCAL_STORAGE_KEY, this.mode)
+
+            console.log(this.mode)
+        }
+    }).mount('#mode_toggle')
+})()
+</script>
+            <p class="mobile_search_link">
+                <a href="../search.html" title="Search">
+                    <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 65 64" fill-rule="evenodd" stroke-linejoin="round" stroke-miterlimit="2">
+                        <path d="M14.873 40.009c-2.315-3.943-3.642-8.532-3.642-13.429C11.231 11.91 23.141 0 37.811 0s26.58 11.91 26.58 26.58-11.91 26.58-26.58 26.58a26.44 26.44 0 0 1-14.277-4.161L9.739 62.794a3.12 3.12 0 0 1-4.413 0L.913 58.382c-1.217-1.218-1.217-3.196 0-4.413l13.96-13.96zM37.811 8.054c10.225 0 18.526 8.301 18.526 18.526s-8.301 18.526-18.526 18.526-18.526-8.301-18.526-18.526S27.586 8.054 37.811 8.054z" fill="#fff" />
+                    </svg>
+                </a>
+            </p>
+        
+
+        <div class="searchbox_wrapper">
+            
+<search id="searchbox" style="display: none" role="search">
+  <h3 id="searchlabel">Quick search</h3>
+    <div class="searchformwrapper">
+    <form class="search" action="../search.html" method="get">
+      <input type="text" name="q" aria-labelledby="searchlabel" autocomplete="off" autocorrect="off" autocapitalize="off" spellcheck="false"/>
+      <input type="submit" value="Go" />
+    </form>
+    </div>
+</search>
+<script>document.getElementById('searchbox').style.display = "block"</script>
+        </div>
+    </nav>
+</div>
+
+    
+      <div class="sphinxsidebar" role="navigation" aria-label="Main">
+        <div class="sphinxsidebarwrapper"><p class="caption" role="heading"><span class="caption-text">Getting Started</span></p>
+<ul>
+<li class="toctree-l1"><a class="reference internal" href="../setup/index.html">Installation</a></li>
+</ul>
+<p class="caption" role="heading"><span class="caption-text">Tutorials</span></p>
+<ul>
+<li class="toctree-l1"><a class="reference internal" href="../gallery/tutorial_01_get_started.html">Getting Started</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../gallery/tutorial_02_vhls.html">Vivado/Vitis HLS Backend</a></li>
+</ul>
+<p class="caption" role="heading"><span class="caption-text">Deep Dive</span></p>
+<ul class="current">
+<li class="toctree-l1"><a class="reference internal" href="../gallery/dive_01_data_types.html">Data Types and Type Casting</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../gallery/dive_02_template.html">Template Kernels</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../gallery/dive_03_composition.html">Kernel Composition</a></li>
+<li class="toctree-l1"><a class="reference internal" href="ip.html">IP Integration</a></li>
+<li class="toctree-l1 current"><a class="current reference internal" href="#">PyTorch Integration</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../gallery/dive_04_features.html">Other Features</a></li>
+</ul>
+<p class="caption" role="heading"><span class="caption-text">Developer Guide</span></p>
+<ul>
+<li class="toctree-l1"><a class="reference internal" href="../developer/index.html">Developer Setup</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../gallery/developer_01_ir_builder.html">IR Builder Walkthrough</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../gallery/developer_02_mlir.html">MLIR Translation Guide</a></li>
+</ul>
+<p class="caption" role="heading"><span class="caption-text">Python API</span></p>
+<ul>
+<li class="toctree-l1"><a class="reference internal" href="../api/index.html">Schedule Primitives</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../api/index.html#data-types">Data Types</a></li>
+</ul>
+
+        </div>
+      </div>
+
+
+    <div class="document">
+      <div class="documentwrapper">
+        <div class="bodywrapper">
+          <div class="body" role="main">
+            
+  <section id="pytorch-integration">
+<h1>PyTorch Integration<a class="headerlink" href="#pytorch-integration" title="Link to this heading"><span>¶</span></a></h1>
+<p>In this document, we will show how to directly compile PyTorch models to Allo.
+First, users can define a PyTorch module as usual:</p>
+<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span><span class="w"> </span><span class="nn">torch</span>
+<span class="kn">import</span><span class="w"> </span><span class="nn">torch.nn.functional</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="nn">F</span>
+<span class="kn">import</span><span class="w"> </span><span class="nn">torch.nn</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="nn">nn</span>
+
+<span class="k">class</span><span class="w"> </span><span class="nc">Model</span><span class="p">(</span><span class="n">nn</span><span class="o">.</span><span class="n">Module</span><span class="p">):</span>
+    <span class="k">def</span><span class="w"> </span><span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
+        <span class="nb">super</span><span class="p">(</span><span class="n">Model</span><span class="p">,</span> <span class="bp">self</span><span class="p">)</span><span class="o">.</span><span class="fm">__init__</span><span class="p">()</span>
+
+    <span class="k">def</span><span class="w"> </span><span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">):</span>
+        <span class="n">x</span> <span class="o">=</span> <span class="n">x</span> <span class="o">+</span> <span class="n">y</span>
+        <span class="n">x</span> <span class="o">=</span> <span class="n">F</span><span class="o">.</span><span class="n">relu</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
+        <span class="k">return</span> <span class="n">x</span>
+
+<span class="n">model</span> <span class="o">=</span> <span class="n">Model</span><span class="p">()</span>
+<span class="n">model</span><span class="o">.</span><span class="n">eval</span><span class="p">()</span>
+</pre></div>
+</div>
+<p>Then, users can compile the PyTorch model to Allo by using the <code class="docutils literal notranslate"><span class="pre">allo.frontend.from_pytorch</span></code> API:</p>
+<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span><span class="w"> </span><span class="nn">allo</span>
+<span class="n">example_inputs</span> <span class="o">=</span> <span class="p">[</span><span class="n">torch</span><span class="o">.</span><span class="n">rand</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mi">10</span><span class="p">),</span> <span class="n">torch</span><span class="o">.</span><span class="n">rand</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mi">10</span><span class="p">)]</span>
+<span class="n">llvm_mod</span> <span class="o">=</span> <span class="n">allo</span><span class="o">.</span><span class="n">frontend</span><span class="o">.</span><span class="n">from_pytorch</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">example_inputs</span><span class="o">=</span><span class="n">example_inputs</span><span class="p">)</span>
+</pre></div>
+</div>
+<p>Then, we can use the generated Allo LLVM module as usual by passing in the NumPy inputs:</p>
+<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">golden</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="o">*</span><span class="n">example_inputs</span><span class="p">)</span>
+<span class="n">np_inputs</span> <span class="o">=</span> <span class="p">[</span><span class="n">x</span><span class="o">.</span><span class="n">detach</span><span class="p">()</span><span class="o">.</span><span class="n">numpy</span><span class="p">()</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">example_inputs</span><span class="p">]</span>
+<span class="n">res</span> <span class="o">=</span> <span class="n">llvm_mod</span><span class="p">(</span><span class="o">*</span><span class="n">np_inputs</span><span class="p">)</span>
+<span class="n">torch</span><span class="o">.</span><span class="n">testing</span><span class="o">.</span><span class="n">assert_close</span><span class="p">(</span><span class="n">res</span><span class="p">,</span> <span class="n">golden</span><span class="o">.</span><span class="n">detach</span><span class="p">()</span><span class="o">.</span><span class="n">numpy</span><span class="p">())</span>
+<span class="nb">print</span><span class="p">(</span><span class="s2">&quot;Passed!&quot;</span><span class="p">)</span>
+</pre></div>
+</div>
+<p>The process should be very similar to the original Allo workflow.
+The default target is LLVM. We can also change the backend to other compilers such as Vitis HLS by specifying the <code class="docutils literal notranslate"><span class="pre">target</span></code>:</p>
+<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">mod</span> <span class="o">=</span> <span class="n">allo</span><span class="o">.</span><span class="n">frontend</span><span class="o">.</span><span class="n">from_pytorch</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">example_inputs</span><span class="o">=</span><span class="n">example_inputs</span><span class="p">,</span> <span class="n">target</span><span class="o">=</span><span class="s2">&quot;vhls&quot;</span><span class="p">)</span>
+<span class="nb">print</span><span class="p">(</span><span class="n">mod</span><span class="o">.</span><span class="n">hls_code</span><span class="p">)</span>
+</pre></div>
+</div>
+</section>
+
+
+            <div class="clearer"></div>
+          </div>
+        </div>
+      </div>
+    
+
+      <div class="clearer"></div>
+    </div>
+    <div class="button_nav_wrapper">
+        <div class="button_nav">
+            <div class="left">
+                
+                <a href="ip.html">
+                    <span class="icon">&lt;</span><span>IP Integration</span></a>
+                
+            </div>
+
+            <div class="right">
+                
+                    <a href="../gallery/dive_04_features.html"><span>Other Features</span><span class="icon">&gt;</span></a>
+                
+            </div>
+        </div>
+    </div>
+
+
+    <div class="footer" role="contentinfo">
+    &#169; Copyright 2025, Allo Authors.
+      Created using <a href="https://www.sphinx-doc.org/">Sphinx</a> 8.1.3.
+    </div>
+
+<p id="theme_credit">Styled using the <a href="https://github.com/piccolo-orm/piccolo_theme">Piccolo Theme</a></p>
+  </body>
+</html>
\ No newline at end of file
diff --git a/gallery/developer_01_ir_builder.html b/gallery/developer_01_ir_builder.html
index a640b4e8..127ed0cf 100644
--- a/gallery/developer_01_ir_builder.html
+++ b/gallery/developer_01_ir_builder.html
@@ -6,7 +6,7 @@
     <meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="viewport" content="width=device-width, initial-scale=1" />
 
     <title>IR Builder Walkthrough &#8212; Allo Documentation</title>
-    <link rel="stylesheet" type="text/css" href="../_static/pygments.css?v=fa44fd50" />
+    <link rel="stylesheet" type="text/css" href="../_static/pygments.css?v=03e43079" />
     <link rel="stylesheet" type="text/css" href="../_static/basic_mod.css?v=9b2032db" />
     <link rel="stylesheet" type="text/css" href="../_static/graphviz.css?v=4ae1632d" />
     <link rel="stylesheet" type="text/css" href="../_static/copybutton.css?v=76b2166b" />
@@ -143,6 +143,15 @@ <h3 id="searchlabel">Quick search</h3>
 <li class="toctree-l1"><a class="reference internal" href="tutorial_01_get_started.html">Getting Started</a></li>
 <li class="toctree-l1"><a class="reference internal" href="tutorial_02_vhls.html">Vivado/Vitis HLS Backend</a></li>
 </ul>
+<p class="caption" role="heading"><span class="caption-text">Deep Dive</span></p>
+<ul>
+<li class="toctree-l1"><a class="reference internal" href="dive_01_data_types.html">Data Types and Type Casting</a></li>
+<li class="toctree-l1"><a class="reference internal" href="dive_02_template.html">Template Kernels</a></li>
+<li class="toctree-l1"><a class="reference internal" href="dive_03_composition.html">Kernel Composition</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../dive/ip.html">IP Integration</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../dive/pytorch.html">PyTorch Integration</a></li>
+<li class="toctree-l1"><a class="reference internal" href="dive_04_features.html">Other Features</a></li>
+</ul>
 <p class="caption" role="heading"><span class="caption-text">Developer Guide</span></p>
 <ul class="current">
 <li class="toctree-l1"><a class="reference internal" href="../developer/index.html">Developer Setup</a></li>
@@ -175,8 +184,8 @@ <h3 id="searchlabel">Quick search</h3>
 <p>This guide will walk you through the process of translating a Python-based
 Allo program to the internal MLIR representation. We will use the vector
 addition example to demonstrate the process.</p>
-<div class="highlight-Python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">allo</span>
-<span class="kn">from</span> <span class="nn">allo.ir.types</span> <span class="kn">import</span> <span class="n">int32</span>
+<div class="highlight-Python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span><span class="w"> </span><span class="nn">allo</span>
+<span class="kn">from</span><span class="w"> </span><span class="nn">allo.ir.types</span><span class="w"> </span><span class="kn">import</span> <span class="n">int32</span>
 </pre></div>
 </div>
 <section id="algorithm-definition">
@@ -189,7 +198,7 @@ <h2>Algorithm Definition<a class="headerlink" href="#algorithm-definition" title
 <div class="highlight-Python notranslate"><div class="highlight"><pre><span></span><span class="n">M</span><span class="p">,</span> <span class="n">N</span> <span class="o">=</span> <span class="mi">1024</span><span class="p">,</span> <span class="mi">1024</span>
 
 
-<span class="k">def</span> <span class="nf">matrix_add</span><span class="p">(</span><span class="n">A</span><span class="p">:</span> <span class="n">int32</span><span class="p">[</span><span class="n">M</span><span class="p">,</span> <span class="n">N</span><span class="p">])</span> <span class="o">-&gt;</span> <span class="n">int32</span><span class="p">[</span><span class="n">M</span><span class="p">,</span> <span class="n">N</span><span class="p">]:</span>
+<span class="k">def</span><span class="w"> </span><span class="nf">matrix_add</span><span class="p">(</span><span class="n">A</span><span class="p">:</span> <span class="n">int32</span><span class="p">[</span><span class="n">M</span><span class="p">,</span> <span class="n">N</span><span class="p">])</span> <span class="o">-&gt;</span> <span class="n">int32</span><span class="p">[</span><span class="n">M</span><span class="p">,</span> <span class="n">N</span><span class="p">]:</span>
     <span class="n">B</span><span class="p">:</span> <span class="n">int32</span><span class="p">[</span><span class="n">M</span><span class="p">,</span> <span class="n">N</span><span class="p">]</span> <span class="o">=</span> <span class="mi">0</span>
     <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">j</span> <span class="ow">in</span> <span class="n">allo</span><span class="o">.</span><span class="n">grid</span><span class="p">(</span><span class="n">M</span><span class="p">,</span> <span class="n">N</span><span class="p">):</span>
         <span class="n">B</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">]</span> <span class="o">=</span> <span class="n">A</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">]</span> <span class="o">+</span> <span class="mi">1</span>
@@ -201,7 +210,7 @@ <h2>Algorithm Definition<a class="headerlink" href="#algorithm-definition" title
 One of the most useful tools is the <code class="docutils literal notranslate"><span class="pre">inspect</span></code> module, which provides
 an API to access the source code of a Python function. We can call
 <code class="docutils literal notranslate"><span class="pre">inspect.getsource</span></code> to get the source code of the <code class="docutils literal notranslate"><span class="pre">matrix_add</span></code>.</p>
-<div class="highlight-Python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">inspect</span>
+<div class="highlight-Python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span><span class="w"> </span><span class="nn">inspect</span>
 
 <span class="n">src</span> <span class="o">=</span> <span class="n">inspect</span><span class="o">.</span><span class="n">getsource</span><span class="p">(</span><span class="n">matrix_add</span><span class="p">)</span>
 <span class="nb">print</span><span class="p">(</span><span class="n">src</span><span class="p">)</span>
@@ -219,7 +228,7 @@ <h2>Algorithm Definition<a class="headerlink" href="#algorithm-definition" title
 can be used to print the AST in a human-readable format, which requires to
 be installed through <code class="docutils literal notranslate"><span class="pre">pip</span></code> separately. Otherwise, you can just use
 <code class="docutils literal notranslate"><span class="pre">ast.dump</span></code> to print the AST in raw format.</p>
-<div class="highlight-Python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">ast</span><span class="o">,</span> <span class="nn">astpretty</span>
+<div class="highlight-Python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span><span class="w"> </span><span class="nn">ast</span><span class="o">,</span><span class="w"> </span><span class="nn">astpretty</span>
 
 <span class="n">tree</span> <span class="o">=</span> <span class="n">ast</span><span class="o">.</span><span class="n">parse</span><span class="p">(</span><span class="n">src</span><span class="p">)</span>
 <span class="n">astpretty</span><span class="o">.</span><span class="n">pprint</span><span class="p">(</span><span class="n">tree</span><span class="p">,</span> <span class="n">indent</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">show_offsets</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
@@ -386,7 +395,7 @@ <h2>Traverse the AST<a class="headerlink" href="#traverse-the-ast" title="Link t
 We can see the <code class="docutils literal notranslate"><span class="pre">build_Module</span></code> function only does one thing: traverse the statements
 inside the body of the module, and recursively call <code class="docutils literal notranslate"><span class="pre">build_stmt</span></code>.</p>
 <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="nd">@staticmethod</span>
-<span class="k">def</span> <span class="nf">build_Module</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">node</span><span class="p">):</span>
+<span class="k">def</span><span class="w"> </span><span class="nf">build_Module</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">node</span><span class="p">):</span>
     <span class="k">for</span> <span class="n">stmt</span> <span class="ow">in</span> <span class="n">node</span><span class="o">.</span><span class="n">body</span><span class="p">:</span>
         <span class="n">build_stmt</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">stmt</span><span class="p">)</span>
 </pre></div>
@@ -537,7 +546,7 @@ <h3>Other Nodes<a class="headerlink" href="#other-nodes" title="Link to this hea
 
 
     <div class="footer" role="contentinfo">
-    &#169; Copyright 2024, Allo Authors.
+    &#169; Copyright 2025, Allo Authors.
       Created using <a href="https://www.sphinx-doc.org/">Sphinx</a> 8.1.3.
     </div>
 
diff --git a/gallery/developer_02_mlir.html b/gallery/developer_02_mlir.html
index 65745d68..f28d0439 100644
--- a/gallery/developer_02_mlir.html
+++ b/gallery/developer_02_mlir.html
@@ -6,7 +6,7 @@
     <meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="viewport" content="width=device-width, initial-scale=1" />
 
     <title>MLIR Translation Guide &#8212; Allo Documentation</title>
-    <link rel="stylesheet" type="text/css" href="../_static/pygments.css?v=fa44fd50" />
+    <link rel="stylesheet" type="text/css" href="../_static/pygments.css?v=03e43079" />
     <link rel="stylesheet" type="text/css" href="../_static/basic_mod.css?v=9b2032db" />
     <link rel="stylesheet" type="text/css" href="../_static/graphviz.css?v=4ae1632d" />
     <link rel="stylesheet" type="text/css" href="../_static/copybutton.css?v=76b2166b" />
@@ -143,6 +143,15 @@ <h3 id="searchlabel">Quick search</h3>
 <li class="toctree-l1"><a class="reference internal" href="tutorial_01_get_started.html">Getting Started</a></li>
 <li class="toctree-l1"><a class="reference internal" href="tutorial_02_vhls.html">Vivado/Vitis HLS Backend</a></li>
 </ul>
+<p class="caption" role="heading"><span class="caption-text">Deep Dive</span></p>
+<ul>
+<li class="toctree-l1"><a class="reference internal" href="dive_01_data_types.html">Data Types and Type Casting</a></li>
+<li class="toctree-l1"><a class="reference internal" href="dive_02_template.html">Template Kernels</a></li>
+<li class="toctree-l1"><a class="reference internal" href="dive_03_composition.html">Kernel Composition</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../dive/ip.html">IP Integration</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../dive/pytorch.html">PyTorch Integration</a></li>
+<li class="toctree-l1"><a class="reference internal" href="dive_04_features.html">Other Features</a></li>
+</ul>
 <p class="caption" role="heading"><span class="caption-text">Developer Guide</span></p>
 <ul class="current">
 <li class="toctree-l1"><a class="reference internal" href="../developer/index.html">Developer Setup</a></li>
@@ -174,8 +183,8 @@ <h3 id="searchlabel">Quick search</h3>
 <p><strong>Author</strong>: Hongzheng Chen (<a class="reference external" href="mailto:hzchen&#37;&#52;&#48;cs&#46;cornell&#46;edu">hzchen<span>&#64;</span>cs<span>&#46;</span>cornell<span>&#46;</span>edu</a>)</p>
 <p>This guide will give some examples on how to invoke the MLIR toolchain to
 verify the correctness of a handwritten or generated MLIR program.</p>
-<div class="highlight-Python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">allo</span>
-<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
+<div class="highlight-Python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span><span class="w"> </span><span class="nn">allo</span>
+<span class="kn">import</span><span class="w"> </span><span class="nn">numpy</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="nn">np</span>
 </pre></div>
 </div>
 <section id="define-an-mlir-program-with-linalg-dialect">
@@ -288,7 +297,7 @@ <h2>Define an MLIR program with linalg dialect<a class="headerlink" href="#defin
 users write programs in a more friendly way. One thing we can do is to provide high-level programming
 abstractions like NumPy that has lots of tensor-based operations instead of elementwise ones.
 Therefore, the frontend interface may look like this:</p>
-<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="k">def</span> <span class="nf">kernel</span><span class="p">(</span><span class="n">A</span><span class="p">:</span> <span class="n">int32</span><span class="p">[</span><span class="mi">32</span><span class="p">,</span> <span class="mi">32</span><span class="p">],</span> <span class="n">B</span><span class="p">:</span> <span class="n">int32</span><span class="p">[</span><span class="mi">32</span><span class="p">,</span> <span class="mi">32</span><span class="p">])</span> <span class="o">-&gt;</span> <span class="n">int32</span><span class="p">[</span><span class="mi">32</span><span class="p">,</span> <span class="mi">32</span><span class="p">]:</span>
+<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="k">def</span><span class="w"> </span><span class="nf">kernel</span><span class="p">(</span><span class="n">A</span><span class="p">:</span> <span class="n">int32</span><span class="p">[</span><span class="mi">32</span><span class="p">,</span> <span class="mi">32</span><span class="p">],</span> <span class="n">B</span><span class="p">:</span> <span class="n">int32</span><span class="p">[</span><span class="mi">32</span><span class="p">,</span> <span class="mi">32</span><span class="p">])</span> <span class="o">-&gt;</span> <span class="n">int32</span><span class="p">[</span><span class="mi">32</span><span class="p">,</span> <span class="mi">32</span><span class="p">]:</span>
     <span class="n">C</span> <span class="o">=</span> <span class="n">allo</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">A</span><span class="p">,</span> <span class="n">B</span><span class="p">)</span>
     <span class="k">return</span> <span class="n">C</span>
 </pre></div>
@@ -350,7 +359,7 @@ <h2>Define an MLIR program with Tensor dialect<a class="headerlink" href="#defin
 </div>
 <p>Unfortunately, the program cannot be lowered to LLVM dialect, because we have not added
 the lowering pass from tensor dialect to LLVM dialect, and that is something we need to do next.</p>
-<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> (0 minutes 0.074 seconds)</p>
+<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> (0 minutes 0.070 seconds)</p>
 <div class="sphx-glr-footer sphx-glr-footer-example docutils container" id="sphx-glr-download-gallery-developer-02-mlir-py">
 <div class="sphx-glr-download sphx-glr-download-jupyter docutils container">
 <p><a class="reference download internal" download="" href="../_downloads/26d9fb3f431f94846c086960be43203e/developer_02_mlir.ipynb"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Jupyter</span> <span class="pre">notebook:</span> <span class="pre">developer_02_mlir.ipynb</span></code></a></p>
@@ -412,7 +421,7 @@ <h2>Define an MLIR program with Tensor dialect<a class="headerlink" href="#defin
 
 
     <div class="footer" role="contentinfo">
-    &#169; Copyright 2024, Allo Authors.
+    &#169; Copyright 2025, Allo Authors.
       Created using <a href="https://www.sphinx-doc.org/">Sphinx</a> 8.1.3.
     </div>
 
diff --git a/gallery/dive_01_data_types.html b/gallery/dive_01_data_types.html
new file mode 100644
index 00000000..e942416a
--- /dev/null
+++ b/gallery/dive_01_data_types.html
@@ -0,0 +1,379 @@
+<!DOCTYPE html>
+
+<html lang="en" data-content_root="../">
+  <head>
+    <meta charset="utf-8" />
+    <meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="viewport" content="width=device-width, initial-scale=1" />
+
+    <title>Data Types and Type Casting &#8212; Allo Documentation</title>
+    <link rel="stylesheet" type="text/css" href="../_static/pygments.css?v=03e43079" />
+    <link rel="stylesheet" type="text/css" href="../_static/basic_mod.css?v=9b2032db" />
+    <link rel="stylesheet" type="text/css" href="../_static/graphviz.css?v=4ae1632d" />
+    <link rel="stylesheet" type="text/css" href="../_static/copybutton.css?v=76b2166b" />
+    <link rel="stylesheet" type="text/css" href="../_static/sg_gallery.css?v=d2d258e8" />
+    <link rel="stylesheet" type="text/css" href="../_static/sg_gallery-binder.css?v=f4aeca0c" />
+    <link rel="stylesheet" type="text/css" href="../_static/sg_gallery-dataframe.css?v=2082cf3c" />
+    <link rel="stylesheet" type="text/css" href="../_static/sg_gallery-rendered-html.css?v=1277b6f3" />
+    <script src="../_static/documentation_options.js?v=db78e746"></script>
+    <script src="../_static/doctools.js?v=9bcbadda"></script>
+    <script src="../_static/sphinx_highlight.js?v=dc90522c"></script>
+    <script src="../_static/clipboard.min.js?v=a7894cd8"></script>
+    <script src="../_static/copybutton.js?v=30646c52"></script>
+    <script src="../_static/js/theme.js"></script>
+    <script src="../_static/js/petite-vue.js"></script>
+    <link rel="index" title="Index" href="../genindex.html" />
+    <link rel="search" title="Search" href="../search.html" />
+    <link rel="next" title="Template Kernels" href="dive_02_template.html" />
+    <link rel="prev" title="Vivado/Vitis HLS Backend" href="tutorial_02_vhls.html" /> 
+  </head><body data-dark_mode_code_blocks="true">
+
+<div id="top_nav">
+    
+
+    <nav>
+        
+            
+        
+
+        <p id="toggle_sidebar">
+            <a href="#" title="Toggle sidebar">|||</a>
+        </p>
+        <h1><a href="../index.html" title="Go to homepage">Allo Documentation</a></h1>
+            <a id="source_link" href="https://github.com/cornell-zhang/allo">
+    
+        <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 496 512">
+            <path fill="white" d="M 244.8,8 C 106.1,8 0,113.3 0,252 c 0,110.9 69.8,205.8 169.5,239.2 12.8,2.3 17.3,-5.6 17.3,-12.1 0,-6.2 -0.3,-40.4 -0.3,-61.4 0,0 -70,15 -84.7,-29.8 0,0 -11.4,-29.1 -27.8,-36.6 0,0 -22.9,-15.7 1.6,-15.4 0,0 24.9,2 38.6,25.8 21.9,38.6 58.6,27.5 72.9,20.9 2.3,-16 8.8,-27.1 16,-33.7 -55.9,-6.2 -112.3,-14.3 -112.3,-110.5 0,-27.5 7.6,-41.3 23.6,-58.9 -2.6,-6.5 -11.1,-33.3 2.6,-67.9 20.9,-6.5 69,27 69,27 20,-5.6 41.5,-8.5 62.8,-8.5 21.3,0 42.8,2.9 62.8,8.5 0,0 48.1,-33.6 69,-27 13.7,34.7 5.2,61.4 2.6,67.9 16,17.7 25.8,31.5 25.8,58.9 0,96.5 -58.9,104.2 -114.8,110.5 9.2,7.9 17,22.9 17,46.4 0,33.7 -0.3,75.4 -0.3,83.6 0,6.5 4.6,14.4 17.3,12.1 C 428.2,457.8 496,362.9 496,252 496,113.3 383.5,8 244.8,8 Z"/>
+        </svg>
+    
+</a>
+        
+
+        <a id="mode_toggle" href="#" @click.prevent="handleClick" :title="mode">
+    <template v-if="mode == 'light'">
+        <svg width="100%" height="100%" viewBox="0 0 79 80" version="1.1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" xml:space="preserve" style="fill-rule:evenodd;clip-rule:evenodd;stroke-linejoin:round;stroke-miterlimit:2;"><g id="mode_light"><rect id="Bounds" x="0" y="-0" width="78.623" height="79.049" style="fill:none;"/><circle cx="39.311" cy="39.524" r="15.734" style="fill:#fff;"/><g id="beams"><g id="beam"><path id="beam1" serif:id="beam" d="M44.212,4.901c0,-2.705 -2.196,-4.901 -4.901,-4.901c-2.704,-0 -4.9,2.196 -4.9,4.901l-0,9.614c-0,2.705 2.196,4.901 4.9,4.901c2.705,0 4.901,-2.196 4.901,-4.901l0,-9.614Z" style="fill:#fff;"/></g><g id="beam2" serif:id="beam"><path id="beam3" serif:id="beam" d="M67.48,18.073c1.913,-1.912 1.913,-5.018 0,-6.931c-1.912,-1.912 -5.018,-1.912 -6.931,0l-6.798,6.799c-1.912,1.912 -1.912,5.018 0,6.931c1.913,1.912 5.018,1.912 6.931,-0l6.798,-6.799Z" style="fill:#fff;"/></g><g id="beam4" serif:id="beam"><path id="beam5" serif:id="beam" d="M25.728,61.108c1.912,-1.913 1.912,-5.018 -0,-6.931c-1.913,-1.913 -5.019,-1.913 -6.931,-0l-6.799,6.798c-1.912,1.913 -1.912,5.019 0,6.931c1.913,1.913 5.019,1.913 6.931,0l6.799,-6.798Z" style="fill:#fff;"/></g><g id="beam6" serif:id="beam"><path id="beam7" serif:id="beam" d="M60.682,54.177c-1.913,-1.913 -5.018,-1.913 -6.931,-0c-1.912,1.913 -1.912,5.018 0,6.931l6.798,6.798c1.913,1.913 5.019,1.913 6.931,0c1.913,-1.912 1.913,-5.018 0,-6.931l-6.798,-6.798Z" style="fill:#fff;"/></g><g id="beam8" serif:id="beam"><path id="beam9" serif:id="beam" d="M4.901,34.623c-2.705,0 -4.901,2.196 -4.901,4.901c0,2.705 2.196,4.901 4.901,4.901l9.614,0c2.705,0 4.901,-2.196 4.901,-4.901c0,-2.705 -2.196,-4.901 -4.901,-4.901l-9.614,0Z" style="fill:#fff;"/></g><g id="beam10" serif:id="beam"><path id="beam11" serif:id="beam" d="M44.212,64.534c0,-2.705 -2.196,-4.901 -4.901,-4.901c-2.704,-0 -4.9,2.196 -4.9,4.901l-0,9.614c-0,2.705 2.196,4.901 4.9,4.901c2.705,-0 4.901,-2.196 4.901,-4.901l0,-9.614Z" style="fill:#fff;"/></g><g id="beam12" serif:id="beam"><path id="beam13" serif:id="beam" d="M18.929,11.142c-1.912,-1.912 -5.018,-1.912 -6.931,0c-1.912,1.913 -1.912,5.019 0,6.931l6.799,6.799c1.912,1.912 5.018,1.912 6.931,-0c1.912,-1.913 1.912,-5.019 -0,-6.931l-6.799,-6.799Z" style="fill:#fff;"/></g><g id="beam14" serif:id="beam"><path id="beam15" serif:id="beam" d="M64.108,34.623c-2.705,0 -4.901,2.196 -4.901,4.901c-0,2.705 2.196,4.901 4.901,4.901l9.614,0c2.705,0 4.901,-2.196 4.901,-4.901c-0,-2.705 -2.196,-4.901 -4.901,-4.901l-9.614,0Z" style="fill:#fff;"/></g></g></g></svg>
+    </template>
+
+    <template v-if="mode == 'dark'">
+        <svg width="100%" height="100%" viewBox="0 0 79 80" version="1.1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" xml:space="preserve" style="fill-rule:evenodd;clip-rule:evenodd;stroke-linejoin:round;stroke-miterlimit:2;"><g id="mode_dark"><rect id="Bounds" x="0" y="-0" width="78.623" height="79.049" style="fill:none;"/><circle cx="39.311" cy="39.524" r="15.734" style="fill:#fff;"/><g id="beams"><g id="beam"><path id="beam1" serif:id="beam" d="M44.212,14.515c0,-2.705 -2.196,-4.901 -4.901,-4.901c-2.704,0 -4.901,2.196 -4.901,4.901c0,2.705 2.197,4.901 4.901,4.901c2.705,0 4.901,-2.196 4.901,-4.901Z" style="fill:#fff;"/></g><g id="beam2" serif:id="beam"><path id="beam3" serif:id="beam" d="M60.662,24.892c1.902,-1.902 1.902,-4.99 0,-6.892l-0.04,-0.039c-1.901,-1.902 -4.989,-1.902 -6.891,-0c-1.901,1.901 -1.901,4.989 0,6.891l0.04,0.04c1.902,1.901 4.989,1.901 6.891,-0Z" style="fill:#fff;"/></g><g id="beam4" serif:id="beam"><path id="beam5" serif:id="beam" d="M25.732,61.103c1.91,-1.91 1.91,-5.011 0,-6.921l-0.009,-0.01c-1.91,-1.91 -5.012,-1.91 -6.921,-0c-1.91,1.91 -1.91,5.011 -0,6.921l0.01,0.01c1.909,1.91 5.011,1.91 6.92,-0Z" style="fill:#fff;"/></g><g id="beam6" serif:id="beam"><path id="beam7" serif:id="beam" d="M60.672,54.167c-1.907,-1.907 -5.004,-1.907 -6.911,0l-0.02,0.02c-1.907,1.907 -1.907,5.004 0,6.911c1.907,1.907 5.004,1.907 6.911,-0l0.02,-0.02c1.907,-1.907 1.907,-5.004 0,-6.911Z" style="fill:#fff;"/></g><g id="beam8" serif:id="beam"><path id="beam9" serif:id="beam" d="M14.52,34.623c-2.702,0 -4.896,2.194 -4.896,4.896l0,0.01c0,2.702 2.194,4.896 4.896,4.896c2.702,0 4.896,-2.194 4.896,-4.896l-0,-0.01c-0,-2.702 -2.194,-4.896 -4.896,-4.896Z" style="fill:#fff;"/></g><g id="beam10" serif:id="beam"><path id="beam11" serif:id="beam" d="M44.212,64.534c0,-2.705 -2.196,-4.901 -4.901,-4.901c-2.704,-0 -4.901,2.196 -4.901,4.901c0,2.704 2.197,4.9 4.901,4.9c2.705,0 4.901,-2.196 4.901,-4.9Z" style="fill:#fff;"/></g><g id="beam12" serif:id="beam"><path id="beam13" serif:id="beam" d="M25.73,17.943c-1.911,-1.911 -5.015,-1.911 -6.926,0l-0.005,0.005c-1.911,1.911 -1.911,5.015 0,6.926c1.911,1.911 5.015,1.911 6.926,0l0.005,-0.005c1.911,-1.911 1.911,-5.014 -0,-6.926Z" style="fill:#fff;"/></g><g id="beam14" serif:id="beam"><path id="beam15" serif:id="beam" d="M64.098,34.623c-2.699,0 -4.891,2.192 -4.891,4.892l-0,0.019c-0,2.699 2.192,4.891 4.891,4.891c2.7,0 4.892,-2.192 4.892,-4.891l0,-0.019c0,-2.7 -2.192,-4.892 -4.892,-4.892Z" style="fill:#fff;"/></g></g></g></svg>
+    </template>
+
+    <template v-if="mode == 'darkest'">
+        <svg width="100%" height="100%" viewBox="0 0 79 80" version="1.1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" xml:space="preserve" style="fill-rule:evenodd;clip-rule:evenodd;stroke-linejoin:round;stroke-miterlimit:2;"><g id="mode_darkest"><rect id="Bounds" x="0" y="-0" width="78.623" height="79.049" style="fill:none;"/><path d="M39.315,23.791c8.684,-0 15.734,7.05 15.734,15.733c0,8.684 -7.05,15.734 -15.734,15.734c-8.683,0 -15.733,-7.05 -15.733,-15.734c-0,-8.683 7.05,-15.733 15.733,-15.733Zm0,4.737c6.069,0 10.997,4.927 10.997,10.996c-0,6.069 -4.928,10.996 -10.997,10.996c-6.068,0 -10.996,-4.927 -10.996,-10.996c0,-6.069 4.928,-10.996 10.996,-10.996Z" style="fill:#fff;"/><g id="beams"><g id="beam"><path id="beam1" serif:id="beam" d="M44.216,14.515c0,-2.705 -2.196,-4.901 -4.901,-4.901c-2.704,0 -4.9,2.196 -4.9,4.901c-0,2.705 2.196,4.901 4.9,4.901c2.705,0 4.901,-2.196 4.901,-4.901Z" style="fill:#fff;"/></g><g id="beam2" serif:id="beam"><path id="beam3" serif:id="beam" d="M60.666,24.892c1.902,-1.902 1.902,-4.99 0,-6.892l-0.04,-0.039c-1.901,-1.902 -4.989,-1.902 -6.891,-0c-1.901,1.901 -1.901,4.989 0,6.891l0.04,0.04c1.902,1.901 4.99,1.901 6.891,-0Z" style="fill:#fff;"/></g><g id="beam4" serif:id="beam"><path id="beam5" serif:id="beam" d="M25.737,61.103c1.909,-1.91 1.909,-5.011 -0,-6.921l-0.01,-0.01c-1.91,-1.91 -5.011,-1.91 -6.921,-0c-1.91,1.91 -1.91,5.011 -0,6.921l0.01,0.01c1.91,1.91 5.011,1.91 6.921,-0Z" style="fill:#fff;"/></g><g id="beam6" serif:id="beam"><path id="beam7" serif:id="beam" d="M60.676,54.167c-1.907,-1.907 -5.004,-1.907 -6.911,0l-0.02,0.02c-1.907,1.907 -1.907,5.004 0,6.911c1.907,1.907 5.004,1.907 6.911,-0l0.02,-0.02c1.907,-1.907 1.907,-5.004 0,-6.911Z" style="fill:#fff;"/></g><g id="beam8" serif:id="beam"><path id="beam9" serif:id="beam" d="M14.524,34.623c-2.702,0 -4.896,2.194 -4.896,4.896l0,0.01c0,2.702 2.194,4.896 4.896,4.896c2.702,0 4.896,-2.194 4.896,-4.896l0,-0.01c0,-2.702 -2.194,-4.896 -4.896,-4.896Z" style="fill:#fff;"/></g><g id="beam10" serif:id="beam"><path id="beam11" serif:id="beam" d="M44.216,64.534c0,-2.705 -2.196,-4.901 -4.901,-4.901c-2.704,-0 -4.9,2.196 -4.9,4.901c-0,2.704 2.196,4.9 4.9,4.9c2.705,0 4.901,-2.196 4.901,-4.9Z" style="fill:#fff;"/></g><g id="beam12" serif:id="beam"><path id="beam13" serif:id="beam" d="M25.734,17.943c-1.911,-1.911 -5.015,-1.911 -6.926,0l-0.005,0.005c-1.911,1.911 -1.911,5.015 0,6.926c1.911,1.911 5.015,1.911 6.926,0l0.005,-0.005c1.911,-1.911 1.911,-5.014 0,-6.926Z" style="fill:#fff;"/></g><g id="beam14" serif:id="beam"><path id="beam15" serif:id="beam" d="M64.103,34.623c-2.7,0 -4.892,2.192 -4.892,4.892l-0,0.019c-0,2.699 2.192,4.891 4.892,4.891c2.699,0 4.891,-2.192 4.891,-4.891l0,-0.019c0,-2.7 -2.192,-4.892 -4.891,-4.892Z" style="fill:#fff;"/></g></g></g></svg>
+    </template>
+</a>
+
+<script>
+(function() {
+    const LOCAL_STORAGE_KEY = 'piccoloThemeMode'
+
+    var initialMode = localStorage.getItem(LOCAL_STORAGE_KEY)
+
+    if (initialMode) {
+        // Make sure the value in local storage is valid
+        if (['light', 'dark', 'darkest'].indexOf(initialMode) == -1) {
+            initialMode = 'light'
+            localStorage.setItem(LOCAL_STORAGE_KEY, initialMode)
+        }
+    } else {
+        // Check if the client prefers dark mode
+        if (window.matchMedia && window.matchMedia('(prefers-color-scheme: dark)').matches) {
+            initialMode = 'dark'
+        } else {
+            initialMode = 'light'
+        }
+        localStorage.setItem(LOCAL_STORAGE_KEY, initialMode)
+    }
+
+    document.documentElement.dataset.mode = initialMode
+
+    PetiteVue.createApp({
+        'mode': initialMode,
+        handleClick() {
+            let currentMode = this.mode
+
+            if (currentMode == 'light') {
+                this.mode = 'dark'
+            } else if (currentMode == 'dark') {
+                this.mode = 'darkest'
+            } else if (currentMode == 'darkest') {
+                this.mode = 'light'
+            }
+
+            document.documentElement.dataset.mode = this.mode
+            localStorage.setItem(LOCAL_STORAGE_KEY, this.mode)
+
+            console.log(this.mode)
+        }
+    }).mount('#mode_toggle')
+})()
+</script>
+            <p class="mobile_search_link">
+                <a href="../search.html" title="Search">
+                    <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 65 64" fill-rule="evenodd" stroke-linejoin="round" stroke-miterlimit="2">
+                        <path d="M14.873 40.009c-2.315-3.943-3.642-8.532-3.642-13.429C11.231 11.91 23.141 0 37.811 0s26.58 11.91 26.58 26.58-11.91 26.58-26.58 26.58a26.44 26.44 0 0 1-14.277-4.161L9.739 62.794a3.12 3.12 0 0 1-4.413 0L.913 58.382c-1.217-1.218-1.217-3.196 0-4.413l13.96-13.96zM37.811 8.054c10.225 0 18.526 8.301 18.526 18.526s-8.301 18.526-18.526 18.526-18.526-8.301-18.526-18.526S27.586 8.054 37.811 8.054z" fill="#fff" />
+                    </svg>
+                </a>
+            </p>
+        
+
+        <div class="searchbox_wrapper">
+            
+<search id="searchbox" style="display: none" role="search">
+  <h3 id="searchlabel">Quick search</h3>
+    <div class="searchformwrapper">
+    <form class="search" action="../search.html" method="get">
+      <input type="text" name="q" aria-labelledby="searchlabel" autocomplete="off" autocorrect="off" autocapitalize="off" spellcheck="false"/>
+      <input type="submit" value="Go" />
+    </form>
+    </div>
+</search>
+<script>document.getElementById('searchbox').style.display = "block"</script>
+        </div>
+    </nav>
+</div>
+
+    
+      <div class="sphinxsidebar" role="navigation" aria-label="Main">
+        <div class="sphinxsidebarwrapper"><p class="caption" role="heading"><span class="caption-text">Getting Started</span></p>
+<ul>
+<li class="toctree-l1"><a class="reference internal" href="../setup/index.html">Installation</a></li>
+</ul>
+<p class="caption" role="heading"><span class="caption-text">Tutorials</span></p>
+<ul>
+<li class="toctree-l1"><a class="reference internal" href="tutorial_01_get_started.html">Getting Started</a></li>
+<li class="toctree-l1"><a class="reference internal" href="tutorial_02_vhls.html">Vivado/Vitis HLS Backend</a></li>
+</ul>
+<p class="caption" role="heading"><span class="caption-text">Deep Dive</span></p>
+<ul class="current">
+<li class="toctree-l1 current"><a class="current reference internal" href="#">Data Types and Type Casting</a></li>
+<li class="toctree-l1"><a class="reference internal" href="dive_02_template.html">Template Kernels</a></li>
+<li class="toctree-l1"><a class="reference internal" href="dive_03_composition.html">Kernel Composition</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../dive/ip.html">IP Integration</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../dive/pytorch.html">PyTorch Integration</a></li>
+<li class="toctree-l1"><a class="reference internal" href="dive_04_features.html">Other Features</a></li>
+</ul>
+<p class="caption" role="heading"><span class="caption-text">Developer Guide</span></p>
+<ul>
+<li class="toctree-l1"><a class="reference internal" href="../developer/index.html">Developer Setup</a></li>
+<li class="toctree-l1"><a class="reference internal" href="developer_01_ir_builder.html">IR Builder Walkthrough</a></li>
+<li class="toctree-l1"><a class="reference internal" href="developer_02_mlir.html">MLIR Translation Guide</a></li>
+</ul>
+<p class="caption" role="heading"><span class="caption-text">Python API</span></p>
+<ul>
+<li class="toctree-l1"><a class="reference internal" href="../api/index.html">Schedule Primitives</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../api/index.html#data-types">Data Types</a></li>
+</ul>
+
+        </div>
+      </div>
+
+
+    <div class="document">
+      <div class="documentwrapper">
+        <div class="bodywrapper">
+          <div class="body" role="main">
+            
+  <div class="sphx-glr-download-link-note admonition note">
+<p class="admonition-title">Note</p>
+<p><a class="reference internal" href="#sphx-glr-download-gallery-dive-01-data-types-py"><span class="std std-ref">Go to the end</span></a>
+to download the full example code.</p>
+</div>
+<section class="sphx-glr-example-title" id="data-types-and-type-casting">
+<span id="sphx-glr-gallery-dive-01-data-types-py"></span><h1>Data Types and Type Casting<a class="headerlink" href="#data-types-and-type-casting" title="Link to this heading"><span>¶</span></a></h1>
+<p><strong>Author</strong>: Hongzheng Chen (<a class="reference external" href="mailto:hzchen&#37;&#52;&#48;cs&#46;cornell&#46;edu">hzchen<span>&#64;</span>cs<span>&#46;</span>cornell<span>&#46;</span>edu</a>)</p>
+<p>This document will discuss the Allo-supported data types in detail.
+All the data types are defined in the <code class="docutils literal notranslate"><span class="pre">allo.ir.types</span></code> module.</p>
+<div class="highlight-Python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span><span class="w"> </span><span class="nn">allo</span>
+<span class="kn">from</span><span class="w"> </span><span class="nn">allo.ir.types</span><span class="w"> </span><span class="kn">import</span> <span class="n">int16</span><span class="p">,</span> <span class="n">int32</span><span class="p">,</span> <span class="n">float32</span><span class="p">,</span> <span class="n">Int</span><span class="p">,</span> <span class="n">UInt</span><span class="p">,</span> <span class="n">Float</span><span class="p">,</span> <span class="n">Fixed</span>
+</pre></div>
+</div>
+<p>Currently, Allo supports three base data types for mathematical operations:</p>
+<ul class="simple">
+<li><p>Integers: <code class="docutils literal notranslate"><span class="pre">Int(bitwdith)</span></code>, <code class="docutils literal notranslate"><span class="pre">UInt(bitwidth)</span></code></p></li>
+<li><p>Floating points: <code class="docutils literal notranslate"><span class="pre">Float(bitwidth)</span></code> (only support 16, 32, and 64 bits)</p></li>
+<li><p>Fixed points: <code class="docutils literal notranslate"><span class="pre">Fixed(bitwidth,</span> <span class="pre">frac)</span></code>, <code class="docutils literal notranslate"><span class="pre">UFixed(bitwidth,</span> <span class="pre">frac)</span></code></p></li>
+</ul>
+<p>For example, one can declare a 15-bit integer as <code class="docutils literal notranslate"><span class="pre">Int(15)</span></code> and an unsigned 8-bit fixed-point number with 3 fractional bits as <code class="docutils literal notranslate"><span class="pre">UFixed(8,</span> <span class="pre">3)</span></code>.
+For all the C/C++ supported data types, we provide shorthands like <code class="docutils literal notranslate"><span class="pre">float32</span></code> and <code class="docutils literal notranslate"><span class="pre">int16</span></code> to easily declare them.</p>
+<p>Notice different from native Python, Allo requires the program to be <strong>strongly and statically typed</strong>.
+The variable types are either declared explicitly or inferred from the context.
+For a variable that first appears in the program, we should declare it with an expected data type using Python’s type hint notation:</p>
+<div class="highlight-Python notranslate"><div class="highlight"><pre><span></span><span class="n">a</span><span class="p">:</span> <span class="n">int32</span>
+</pre></div>
+</div>
+<p>Once the data types are defined, an important consideration is how to handle
+operations between variables of different types. Allo supports two types of casting:
+(1) implicit casting that is automatically done by the Allo compiler;
+and (2) explicit casting that is manually done by the user.</p>
+<section id="implicit-casting">
+<h2>Implicit Casting<a class="headerlink" href="#implicit-casting" title="Link to this heading"><span>¶</span></a></h2>
+<p>Allo has a strong type system that follows the <a class="reference external" href="https://mlir.llvm.org/docs/Dialects/ArithOps/">MLIR convention</a> to enforce the operand types are the same for the arithmetic operations.
+However, it is burdensome for users to cast the variables every time, and it is also error-prone to avoid overflow when performing computations.
+Therefore, Allo is equipped with builtin casting rules to automatically cast the variables to the same type before the operation, which is called <em>implicit casting</em>.
+An example is shown below:</p>
+<div class="highlight-Python notranslate"><div class="highlight"><pre><span></span><span class="k">def</span><span class="w"> </span><span class="nf">add</span><span class="p">(</span><span class="n">a</span><span class="p">:</span> <span class="n">int32</span><span class="p">,</span> <span class="n">b</span><span class="p">:</span> <span class="n">int32</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">int32</span><span class="p">:</span>
+    <span class="k">return</span> <span class="n">a</span> <span class="o">+</span> <span class="n">b</span>
+
+
+<span class="n">s</span> <span class="o">=</span> <span class="n">allo</span><span class="o">.</span><span class="n">customize</span><span class="p">(</span><span class="n">add</span><span class="p">)</span>
+<span class="nb">print</span><span class="p">(</span><span class="n">s</span><span class="o">.</span><span class="n">module</span><span class="p">)</span>
+</pre></div>
+</div>
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>module {
+  func.func @add(%arg0: i32, %arg1: i32) -&gt; i32 attributes {itypes = &quot;ss&quot;, otypes = &quot;s&quot;} {
+    %0 = arith.extsi %arg0 : i32 to i33
+    %1 = arith.extsi %arg1 : i32 to i33
+    %2 = arith.addi %0, %1 : i33
+    %3 = arith.trunci %2 : i33 to i32
+    return %3 : i32
+  }
+}
+</pre></div>
+</div>
+<p>We can see that <code class="docutils literal notranslate"><span class="pre">a</span></code> and <code class="docutils literal notranslate"><span class="pre">b</span></code> are firstly casted to <code class="docutils literal notranslate"><span class="pre">int33</span></code>, added
+together, and converted back to <code class="docutils literal notranslate"><span class="pre">int32</span></code>.
+This is to avoid overflow and is automatically inferred by the Allo compiler.</p>
+</section>
+<section id="explicit-casting">
+<h2>Explicit Casting<a class="headerlink" href="#explicit-casting" title="Link to this heading"><span>¶</span></a></h2>
+<p>One can also explicitly cast the variable to a specific type by creating an intermediate variable,
+or use Python-builtin functions like <code class="docutils literal notranslate"><span class="pre">float()</span></code> and <code class="docutils literal notranslate"><span class="pre">int()</span></code> to explicitly cast a variable to <code class="docutils literal notranslate"><span class="pre">float32</span></code> or <code class="docutils literal notranslate"><span class="pre">int32</span></code>.
+Another example is shown below:</p>
+<div class="highlight-Python notranslate"><div class="highlight"><pre><span></span><span class="k">def</span><span class="w"> </span><span class="nf">cast</span><span class="p">(</span><span class="n">a</span><span class="p">:</span> <span class="n">int32</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">int16</span><span class="p">:</span>
+    <span class="n">b</span><span class="p">:</span> <span class="n">float32</span> <span class="o">=</span> <span class="n">a</span>  <span class="c1"># explicit</span>
+    <span class="n">c</span><span class="p">:</span> <span class="n">float32</span> <span class="o">=</span> <span class="n">b</span> <span class="o">*</span> <span class="mi">2</span>
+    <span class="n">d</span><span class="p">:</span> <span class="n">float32</span> <span class="o">=</span> <span class="nb">float</span><span class="p">(</span><span class="n">a</span><span class="p">)</span> <span class="o">*</span> <span class="mi">2</span>
+    <span class="n">e</span><span class="p">:</span> <span class="n">int16</span> <span class="o">=</span> <span class="n">c</span> <span class="o">+</span> <span class="n">d</span>
+    <span class="k">return</span> <span class="n">e</span>
+
+
+<span class="n">s</span> <span class="o">=</span> <span class="n">allo</span><span class="o">.</span><span class="n">customize</span><span class="p">(</span><span class="n">cast</span><span class="p">)</span>
+<span class="nb">print</span><span class="p">(</span><span class="n">s</span><span class="o">.</span><span class="n">module</span><span class="p">)</span>
+</pre></div>
+</div>
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>module {
+  func.func @cast(%arg0: i32) -&gt; i16 attributes {itypes = &quot;s&quot;, otypes = &quot;s&quot;} {
+    %0 = arith.sitofp %arg0 : i32 to f32
+    %alloc = memref.alloc() {name = &quot;b&quot;} : memref&lt;f32&gt;
+    affine.store %0, %alloc[] {to = &quot;b&quot;} : memref&lt;f32&gt;
+    %c2_i32 = arith.constant 2 : i32
+    %1 = arith.sitofp %c2_i32 : i32 to f32
+    %2 = affine.load %alloc[] {from = &quot;b&quot;} : memref&lt;f32&gt;
+    %3 = arith.mulf %2, %1 : f32
+    %alloc_0 = memref.alloc() {name = &quot;c&quot;} : memref&lt;f32&gt;
+    affine.store %3, %alloc_0[] {to = &quot;c&quot;} : memref&lt;f32&gt;
+    %4 = arith.sitofp %arg0 : i32 to f32
+    %c2_i32_1 = arith.constant 2 : i32
+    %5 = arith.sitofp %c2_i32_1 : i32 to f32
+    %6 = arith.mulf %4, %5 : f32
+    %alloc_2 = memref.alloc() {name = &quot;d&quot;} : memref&lt;f32&gt;
+    affine.store %6, %alloc_2[] {to = &quot;d&quot;} : memref&lt;f32&gt;
+    %7 = affine.load %alloc_0[] {from = &quot;c&quot;} : memref&lt;f32&gt;
+    %8 = affine.load %alloc_2[] {from = &quot;d&quot;} : memref&lt;f32&gt;
+    %9 = arith.addf %7, %8 : f32
+    %10 = arith.fptosi %9 : f32 to i16
+    %alloc_3 = memref.alloc() {name = &quot;e&quot;} : memref&lt;i16&gt;
+    affine.store %10, %alloc_3[] {to = &quot;e&quot;} : memref&lt;i16&gt;
+    %11 = affine.load %alloc_3[] {from = &quot;e&quot;} : memref&lt;i16&gt;
+    %12 = affine.load %alloc_3[] {from = &quot;e&quot;} : memref&lt;i16&gt;
+    return %12 : i16
+  }
+}
+</pre></div>
+</div>
+<p>By explicitly creating an intermediate variable <code class="docutils literal notranslate"><span class="pre">b</span></code>, we can cast the <code class="docutils literal notranslate"><span class="pre">int32</span></code> variable <code class="docutils literal notranslate"><span class="pre">a</span></code> to the desired floating-point type.
+Similarly, calling <code class="docutils literal notranslate"><span class="pre">float(a)</span></code> can also cast <code class="docutils literal notranslate"><span class="pre">a</span></code> to a floating-point type.</p>
+<div class="admonition note">
+<p class="admonition-title">Note</p>
+<p>The above stated explicit casting between integers and floating points preserves the value but the precision may be changed.
+If you want to use a union type to represent both integers and floating points, please use the <cite>.bitcast()</cite> API instead. For example, <code class="docutils literal notranslate"><span class="pre">a.bitcast()</span></code> can convert <code class="docutils literal notranslate"><span class="pre">int32</span></code> to <code class="docutils literal notranslate"><span class="pre">float32</span></code> representation with the bit pattern preserved.</p>
+</div>
+</section>
+<section id="bit-operations">
+<h2>Bit Operations<a class="headerlink" href="#bit-operations" title="Link to this heading"><span>¶</span></a></h2>
+<p>As hardware accelerators have ability to manipulate each bit of the data, Allo supports bit operations on
+those integer types. For example, we can access a specific bit in an integer <code class="docutils literal notranslate"><span class="pre">a</span></code> using the indexing operator:</p>
+<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">a</span><span class="p">[</span><span class="mi">15</span><span class="p">]</span>
+</pre></div>
+</div>
+<p>We can also extract a chunk of bits from an integer using the slicing operator:</p>
+<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">a</span><span class="p">[</span><span class="mi">0</span><span class="p">:</span><span class="mi">16</span><span class="p">]</span>
+</pre></div>
+</div>
+<div class="admonition note">
+<p class="admonition-title">Note</p>
+<p>Allo follows the Python convention that the upper bound is not included, so <code class="docutils literal notranslate"><span class="pre">[0:16]</span></code> means
+extracting the first 16 bits, which is different from the Xilinx HLS convention that uses <code class="docutils literal notranslate"><span class="pre">[0:15]</span></code>
+to indicate the first 16 bits.</p>
+</div>
+<p>Not only constant values are supported, but also variables can be used as the index or the slice range.</p>
+<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> (0 minutes 0.170 seconds)</p>
+<div class="sphx-glr-footer sphx-glr-footer-example docutils container" id="sphx-glr-download-gallery-dive-01-data-types-py">
+<div class="sphx-glr-download sphx-glr-download-jupyter docutils container">
+<p><a class="reference download internal" download="" href="../_downloads/68e0932078b39343e70c899a03d3ae7c/dive_01_data_types.ipynb"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Jupyter</span> <span class="pre">notebook:</span> <span class="pre">dive_01_data_types.ipynb</span></code></a></p>
+</div>
+<div class="sphx-glr-download sphx-glr-download-python docutils container">
+<p><a class="reference download internal" download="" href="../_downloads/4fba383e419c1fc1ea22179140eb2d12/dive_01_data_types.py"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Python</span> <span class="pre">source</span> <span class="pre">code:</span> <span class="pre">dive_01_data_types.py</span></code></a></p>
+</div>
+<div class="sphx-glr-download sphx-glr-download-zip docutils container">
+<p><a class="reference download internal" download="" href="../_downloads/77583a2b4d388a5f4b2bf7b3eec828d2/dive_01_data_types.zip"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">zipped:</span> <span class="pre">dive_01_data_types.zip</span></code></a></p>
+</div>
+</div>
+<p class="sphx-glr-signature"><a class="reference external" href="https://sphinx-gallery.github.io">Gallery generated by Sphinx-Gallery</a></p>
+</section>
+</section>
+
+
+            <div class="clearer"></div>
+          </div>
+        </div>
+      </div>
+    
+        <div id="show_right_sidebar">
+            <p><a class="toggle_right_sidebar" href="#"><span class="icon">&lt;</span><span>Page contents</span></a></p>
+        </div>
+
+        <div id="right_sidebar">
+            <p><a class="toggle_right_sidebar" href="#"><span class="icon">&gt;</span><span>Page contents:</span></a></p>
+            <div class="page_toc">
+                <ul>
+<li><a class="reference internal" href="#">Data Types and Type Casting</a><ul>
+<li><a class="reference internal" href="#implicit-casting">Implicit Casting</a></li>
+<li><a class="reference internal" href="#explicit-casting">Explicit Casting</a></li>
+<li><a class="reference internal" href="#bit-operations">Bit Operations</a></li>
+</ul>
+</li>
+</ul>
+
+            </div>
+        </div>
+    
+
+      <div class="clearer"></div>
+    </div>
+    <div class="button_nav_wrapper">
+        <div class="button_nav">
+            <div class="left">
+                
+                <a href="tutorial_02_vhls.html">
+                    <span class="icon">&lt;</span><span>Vivado/Vitis HLS Backend</span></a>
+                
+            </div>
+
+            <div class="right">
+                
+                    <a href="dive_02_template.html"><span>Template Kernels</span><span class="icon">&gt;</span></a>
+                
+            </div>
+        </div>
+    </div>
+
+
+    <div class="footer" role="contentinfo">
+    &#169; Copyright 2025, Allo Authors.
+      Created using <a href="https://www.sphinx-doc.org/">Sphinx</a> 8.1.3.
+    </div>
+
+<p id="theme_credit">Styled using the <a href="https://github.com/piccolo-orm/piccolo_theme">Piccolo Theme</a></p>
+  </body>
+</html>
\ No newline at end of file
diff --git a/gallery/dive_02_template.html b/gallery/dive_02_template.html
new file mode 100644
index 00000000..9773f1f1
--- /dev/null
+++ b/gallery/dive_02_template.html
@@ -0,0 +1,376 @@
+<!DOCTYPE html>
+
+<html lang="en" data-content_root="../">
+  <head>
+    <meta charset="utf-8" />
+    <meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="viewport" content="width=device-width, initial-scale=1" />
+
+    <title>Template Kernels &#8212; Allo Documentation</title>
+    <link rel="stylesheet" type="text/css" href="../_static/pygments.css?v=03e43079" />
+    <link rel="stylesheet" type="text/css" href="../_static/basic_mod.css?v=9b2032db" />
+    <link rel="stylesheet" type="text/css" href="../_static/graphviz.css?v=4ae1632d" />
+    <link rel="stylesheet" type="text/css" href="../_static/copybutton.css?v=76b2166b" />
+    <link rel="stylesheet" type="text/css" href="../_static/sg_gallery.css?v=d2d258e8" />
+    <link rel="stylesheet" type="text/css" href="../_static/sg_gallery-binder.css?v=f4aeca0c" />
+    <link rel="stylesheet" type="text/css" href="../_static/sg_gallery-dataframe.css?v=2082cf3c" />
+    <link rel="stylesheet" type="text/css" href="../_static/sg_gallery-rendered-html.css?v=1277b6f3" />
+    <script src="../_static/documentation_options.js?v=db78e746"></script>
+    <script src="../_static/doctools.js?v=9bcbadda"></script>
+    <script src="../_static/sphinx_highlight.js?v=dc90522c"></script>
+    <script src="../_static/clipboard.min.js?v=a7894cd8"></script>
+    <script src="../_static/copybutton.js?v=30646c52"></script>
+    <script src="../_static/js/theme.js"></script>
+    <script src="../_static/js/petite-vue.js"></script>
+    <link rel="index" title="Index" href="../genindex.html" />
+    <link rel="search" title="Search" href="../search.html" />
+    <link rel="next" title="Kernel Composition" href="dive_03_composition.html" />
+    <link rel="prev" title="Data Types and Type Casting" href="dive_01_data_types.html" /> 
+  </head><body data-dark_mode_code_blocks="true">
+
+<div id="top_nav">
+    
+
+    <nav>
+        
+            
+        
+
+        <p id="toggle_sidebar">
+            <a href="#" title="Toggle sidebar">|||</a>
+        </p>
+        <h1><a href="../index.html" title="Go to homepage">Allo Documentation</a></h1>
+            <a id="source_link" href="https://github.com/cornell-zhang/allo">
+    
+        <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 496 512">
+            <path fill="white" d="M 244.8,8 C 106.1,8 0,113.3 0,252 c 0,110.9 69.8,205.8 169.5,239.2 12.8,2.3 17.3,-5.6 17.3,-12.1 0,-6.2 -0.3,-40.4 -0.3,-61.4 0,0 -70,15 -84.7,-29.8 0,0 -11.4,-29.1 -27.8,-36.6 0,0 -22.9,-15.7 1.6,-15.4 0,0 24.9,2 38.6,25.8 21.9,38.6 58.6,27.5 72.9,20.9 2.3,-16 8.8,-27.1 16,-33.7 -55.9,-6.2 -112.3,-14.3 -112.3,-110.5 0,-27.5 7.6,-41.3 23.6,-58.9 -2.6,-6.5 -11.1,-33.3 2.6,-67.9 20.9,-6.5 69,27 69,27 20,-5.6 41.5,-8.5 62.8,-8.5 21.3,0 42.8,2.9 62.8,8.5 0,0 48.1,-33.6 69,-27 13.7,34.7 5.2,61.4 2.6,67.9 16,17.7 25.8,31.5 25.8,58.9 0,96.5 -58.9,104.2 -114.8,110.5 9.2,7.9 17,22.9 17,46.4 0,33.7 -0.3,75.4 -0.3,83.6 0,6.5 4.6,14.4 17.3,12.1 C 428.2,457.8 496,362.9 496,252 496,113.3 383.5,8 244.8,8 Z"/>
+        </svg>
+    
+</a>
+        
+
+        <a id="mode_toggle" href="#" @click.prevent="handleClick" :title="mode">
+    <template v-if="mode == 'light'">
+        <svg width="100%" height="100%" viewBox="0 0 79 80" version="1.1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" xml:space="preserve" style="fill-rule:evenodd;clip-rule:evenodd;stroke-linejoin:round;stroke-miterlimit:2;"><g id="mode_light"><rect id="Bounds" x="0" y="-0" width="78.623" height="79.049" style="fill:none;"/><circle cx="39.311" cy="39.524" r="15.734" style="fill:#fff;"/><g id="beams"><g id="beam"><path id="beam1" serif:id="beam" d="M44.212,4.901c0,-2.705 -2.196,-4.901 -4.901,-4.901c-2.704,-0 -4.9,2.196 -4.9,4.901l-0,9.614c-0,2.705 2.196,4.901 4.9,4.901c2.705,0 4.901,-2.196 4.901,-4.901l0,-9.614Z" style="fill:#fff;"/></g><g id="beam2" serif:id="beam"><path id="beam3" serif:id="beam" d="M67.48,18.073c1.913,-1.912 1.913,-5.018 0,-6.931c-1.912,-1.912 -5.018,-1.912 -6.931,0l-6.798,6.799c-1.912,1.912 -1.912,5.018 0,6.931c1.913,1.912 5.018,1.912 6.931,-0l6.798,-6.799Z" style="fill:#fff;"/></g><g id="beam4" serif:id="beam"><path id="beam5" serif:id="beam" d="M25.728,61.108c1.912,-1.913 1.912,-5.018 -0,-6.931c-1.913,-1.913 -5.019,-1.913 -6.931,-0l-6.799,6.798c-1.912,1.913 -1.912,5.019 0,6.931c1.913,1.913 5.019,1.913 6.931,0l6.799,-6.798Z" style="fill:#fff;"/></g><g id="beam6" serif:id="beam"><path id="beam7" serif:id="beam" d="M60.682,54.177c-1.913,-1.913 -5.018,-1.913 -6.931,-0c-1.912,1.913 -1.912,5.018 0,6.931l6.798,6.798c1.913,1.913 5.019,1.913 6.931,0c1.913,-1.912 1.913,-5.018 0,-6.931l-6.798,-6.798Z" style="fill:#fff;"/></g><g id="beam8" serif:id="beam"><path id="beam9" serif:id="beam" d="M4.901,34.623c-2.705,0 -4.901,2.196 -4.901,4.901c0,2.705 2.196,4.901 4.901,4.901l9.614,0c2.705,0 4.901,-2.196 4.901,-4.901c0,-2.705 -2.196,-4.901 -4.901,-4.901l-9.614,0Z" style="fill:#fff;"/></g><g id="beam10" serif:id="beam"><path id="beam11" serif:id="beam" d="M44.212,64.534c0,-2.705 -2.196,-4.901 -4.901,-4.901c-2.704,-0 -4.9,2.196 -4.9,4.901l-0,9.614c-0,2.705 2.196,4.901 4.9,4.901c2.705,-0 4.901,-2.196 4.901,-4.901l0,-9.614Z" style="fill:#fff;"/></g><g id="beam12" serif:id="beam"><path id="beam13" serif:id="beam" d="M18.929,11.142c-1.912,-1.912 -5.018,-1.912 -6.931,0c-1.912,1.913 -1.912,5.019 0,6.931l6.799,6.799c1.912,1.912 5.018,1.912 6.931,-0c1.912,-1.913 1.912,-5.019 -0,-6.931l-6.799,-6.799Z" style="fill:#fff;"/></g><g id="beam14" serif:id="beam"><path id="beam15" serif:id="beam" d="M64.108,34.623c-2.705,0 -4.901,2.196 -4.901,4.901c-0,2.705 2.196,4.901 4.901,4.901l9.614,0c2.705,0 4.901,-2.196 4.901,-4.901c-0,-2.705 -2.196,-4.901 -4.901,-4.901l-9.614,0Z" style="fill:#fff;"/></g></g></g></svg>
+    </template>
+
+    <template v-if="mode == 'dark'">
+        <svg width="100%" height="100%" viewBox="0 0 79 80" version="1.1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" xml:space="preserve" style="fill-rule:evenodd;clip-rule:evenodd;stroke-linejoin:round;stroke-miterlimit:2;"><g id="mode_dark"><rect id="Bounds" x="0" y="-0" width="78.623" height="79.049" style="fill:none;"/><circle cx="39.311" cy="39.524" r="15.734" style="fill:#fff;"/><g id="beams"><g id="beam"><path id="beam1" serif:id="beam" d="M44.212,14.515c0,-2.705 -2.196,-4.901 -4.901,-4.901c-2.704,0 -4.901,2.196 -4.901,4.901c0,2.705 2.197,4.901 4.901,4.901c2.705,0 4.901,-2.196 4.901,-4.901Z" style="fill:#fff;"/></g><g id="beam2" serif:id="beam"><path id="beam3" serif:id="beam" d="M60.662,24.892c1.902,-1.902 1.902,-4.99 0,-6.892l-0.04,-0.039c-1.901,-1.902 -4.989,-1.902 -6.891,-0c-1.901,1.901 -1.901,4.989 0,6.891l0.04,0.04c1.902,1.901 4.989,1.901 6.891,-0Z" style="fill:#fff;"/></g><g id="beam4" serif:id="beam"><path id="beam5" serif:id="beam" d="M25.732,61.103c1.91,-1.91 1.91,-5.011 0,-6.921l-0.009,-0.01c-1.91,-1.91 -5.012,-1.91 -6.921,-0c-1.91,1.91 -1.91,5.011 -0,6.921l0.01,0.01c1.909,1.91 5.011,1.91 6.92,-0Z" style="fill:#fff;"/></g><g id="beam6" serif:id="beam"><path id="beam7" serif:id="beam" d="M60.672,54.167c-1.907,-1.907 -5.004,-1.907 -6.911,0l-0.02,0.02c-1.907,1.907 -1.907,5.004 0,6.911c1.907,1.907 5.004,1.907 6.911,-0l0.02,-0.02c1.907,-1.907 1.907,-5.004 0,-6.911Z" style="fill:#fff;"/></g><g id="beam8" serif:id="beam"><path id="beam9" serif:id="beam" d="M14.52,34.623c-2.702,0 -4.896,2.194 -4.896,4.896l0,0.01c0,2.702 2.194,4.896 4.896,4.896c2.702,0 4.896,-2.194 4.896,-4.896l-0,-0.01c-0,-2.702 -2.194,-4.896 -4.896,-4.896Z" style="fill:#fff;"/></g><g id="beam10" serif:id="beam"><path id="beam11" serif:id="beam" d="M44.212,64.534c0,-2.705 -2.196,-4.901 -4.901,-4.901c-2.704,-0 -4.901,2.196 -4.901,4.901c0,2.704 2.197,4.9 4.901,4.9c2.705,0 4.901,-2.196 4.901,-4.9Z" style="fill:#fff;"/></g><g id="beam12" serif:id="beam"><path id="beam13" serif:id="beam" d="M25.73,17.943c-1.911,-1.911 -5.015,-1.911 -6.926,0l-0.005,0.005c-1.911,1.911 -1.911,5.015 0,6.926c1.911,1.911 5.015,1.911 6.926,0l0.005,-0.005c1.911,-1.911 1.911,-5.014 -0,-6.926Z" style="fill:#fff;"/></g><g id="beam14" serif:id="beam"><path id="beam15" serif:id="beam" d="M64.098,34.623c-2.699,0 -4.891,2.192 -4.891,4.892l-0,0.019c-0,2.699 2.192,4.891 4.891,4.891c2.7,0 4.892,-2.192 4.892,-4.891l0,-0.019c0,-2.7 -2.192,-4.892 -4.892,-4.892Z" style="fill:#fff;"/></g></g></g></svg>
+    </template>
+
+    <template v-if="mode == 'darkest'">
+        <svg width="100%" height="100%" viewBox="0 0 79 80" version="1.1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" xml:space="preserve" style="fill-rule:evenodd;clip-rule:evenodd;stroke-linejoin:round;stroke-miterlimit:2;"><g id="mode_darkest"><rect id="Bounds" x="0" y="-0" width="78.623" height="79.049" style="fill:none;"/><path d="M39.315,23.791c8.684,-0 15.734,7.05 15.734,15.733c0,8.684 -7.05,15.734 -15.734,15.734c-8.683,0 -15.733,-7.05 -15.733,-15.734c-0,-8.683 7.05,-15.733 15.733,-15.733Zm0,4.737c6.069,0 10.997,4.927 10.997,10.996c-0,6.069 -4.928,10.996 -10.997,10.996c-6.068,0 -10.996,-4.927 -10.996,-10.996c0,-6.069 4.928,-10.996 10.996,-10.996Z" style="fill:#fff;"/><g id="beams"><g id="beam"><path id="beam1" serif:id="beam" d="M44.216,14.515c0,-2.705 -2.196,-4.901 -4.901,-4.901c-2.704,0 -4.9,2.196 -4.9,4.901c-0,2.705 2.196,4.901 4.9,4.901c2.705,0 4.901,-2.196 4.901,-4.901Z" style="fill:#fff;"/></g><g id="beam2" serif:id="beam"><path id="beam3" serif:id="beam" d="M60.666,24.892c1.902,-1.902 1.902,-4.99 0,-6.892l-0.04,-0.039c-1.901,-1.902 -4.989,-1.902 -6.891,-0c-1.901,1.901 -1.901,4.989 0,6.891l0.04,0.04c1.902,1.901 4.99,1.901 6.891,-0Z" style="fill:#fff;"/></g><g id="beam4" serif:id="beam"><path id="beam5" serif:id="beam" d="M25.737,61.103c1.909,-1.91 1.909,-5.011 -0,-6.921l-0.01,-0.01c-1.91,-1.91 -5.011,-1.91 -6.921,-0c-1.91,1.91 -1.91,5.011 -0,6.921l0.01,0.01c1.91,1.91 5.011,1.91 6.921,-0Z" style="fill:#fff;"/></g><g id="beam6" serif:id="beam"><path id="beam7" serif:id="beam" d="M60.676,54.167c-1.907,-1.907 -5.004,-1.907 -6.911,0l-0.02,0.02c-1.907,1.907 -1.907,5.004 0,6.911c1.907,1.907 5.004,1.907 6.911,-0l0.02,-0.02c1.907,-1.907 1.907,-5.004 0,-6.911Z" style="fill:#fff;"/></g><g id="beam8" serif:id="beam"><path id="beam9" serif:id="beam" d="M14.524,34.623c-2.702,0 -4.896,2.194 -4.896,4.896l0,0.01c0,2.702 2.194,4.896 4.896,4.896c2.702,0 4.896,-2.194 4.896,-4.896l0,-0.01c0,-2.702 -2.194,-4.896 -4.896,-4.896Z" style="fill:#fff;"/></g><g id="beam10" serif:id="beam"><path id="beam11" serif:id="beam" d="M44.216,64.534c0,-2.705 -2.196,-4.901 -4.901,-4.901c-2.704,-0 -4.9,2.196 -4.9,4.901c-0,2.704 2.196,4.9 4.9,4.9c2.705,0 4.901,-2.196 4.901,-4.9Z" style="fill:#fff;"/></g><g id="beam12" serif:id="beam"><path id="beam13" serif:id="beam" d="M25.734,17.943c-1.911,-1.911 -5.015,-1.911 -6.926,0l-0.005,0.005c-1.911,1.911 -1.911,5.015 0,6.926c1.911,1.911 5.015,1.911 6.926,0l0.005,-0.005c1.911,-1.911 1.911,-5.014 0,-6.926Z" style="fill:#fff;"/></g><g id="beam14" serif:id="beam"><path id="beam15" serif:id="beam" d="M64.103,34.623c-2.7,0 -4.892,2.192 -4.892,4.892l-0,0.019c-0,2.699 2.192,4.891 4.892,4.891c2.699,0 4.891,-2.192 4.891,-4.891l0,-0.019c0,-2.7 -2.192,-4.892 -4.891,-4.892Z" style="fill:#fff;"/></g></g></g></svg>
+    </template>
+</a>
+
+<script>
+(function() {
+    const LOCAL_STORAGE_KEY = 'piccoloThemeMode'
+
+    var initialMode = localStorage.getItem(LOCAL_STORAGE_KEY)
+
+    if (initialMode) {
+        // Make sure the value in local storage is valid
+        if (['light', 'dark', 'darkest'].indexOf(initialMode) == -1) {
+            initialMode = 'light'
+            localStorage.setItem(LOCAL_STORAGE_KEY, initialMode)
+        }
+    } else {
+        // Check if the client prefers dark mode
+        if (window.matchMedia && window.matchMedia('(prefers-color-scheme: dark)').matches) {
+            initialMode = 'dark'
+        } else {
+            initialMode = 'light'
+        }
+        localStorage.setItem(LOCAL_STORAGE_KEY, initialMode)
+    }
+
+    document.documentElement.dataset.mode = initialMode
+
+    PetiteVue.createApp({
+        'mode': initialMode,
+        handleClick() {
+            let currentMode = this.mode
+
+            if (currentMode == 'light') {
+                this.mode = 'dark'
+            } else if (currentMode == 'dark') {
+                this.mode = 'darkest'
+            } else if (currentMode == 'darkest') {
+                this.mode = 'light'
+            }
+
+            document.documentElement.dataset.mode = this.mode
+            localStorage.setItem(LOCAL_STORAGE_KEY, this.mode)
+
+            console.log(this.mode)
+        }
+    }).mount('#mode_toggle')
+})()
+</script>
+            <p class="mobile_search_link">
+                <a href="../search.html" title="Search">
+                    <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 65 64" fill-rule="evenodd" stroke-linejoin="round" stroke-miterlimit="2">
+                        <path d="M14.873 40.009c-2.315-3.943-3.642-8.532-3.642-13.429C11.231 11.91 23.141 0 37.811 0s26.58 11.91 26.58 26.58-11.91 26.58-26.58 26.58a26.44 26.44 0 0 1-14.277-4.161L9.739 62.794a3.12 3.12 0 0 1-4.413 0L.913 58.382c-1.217-1.218-1.217-3.196 0-4.413l13.96-13.96zM37.811 8.054c10.225 0 18.526 8.301 18.526 18.526s-8.301 18.526-18.526 18.526-18.526-8.301-18.526-18.526S27.586 8.054 37.811 8.054z" fill="#fff" />
+                    </svg>
+                </a>
+            </p>
+        
+
+        <div class="searchbox_wrapper">
+            
+<search id="searchbox" style="display: none" role="search">
+  <h3 id="searchlabel">Quick search</h3>
+    <div class="searchformwrapper">
+    <form class="search" action="../search.html" method="get">
+      <input type="text" name="q" aria-labelledby="searchlabel" autocomplete="off" autocorrect="off" autocapitalize="off" spellcheck="false"/>
+      <input type="submit" value="Go" />
+    </form>
+    </div>
+</search>
+<script>document.getElementById('searchbox').style.display = "block"</script>
+        </div>
+    </nav>
+</div>
+
+    
+      <div class="sphinxsidebar" role="navigation" aria-label="Main">
+        <div class="sphinxsidebarwrapper"><p class="caption" role="heading"><span class="caption-text">Getting Started</span></p>
+<ul>
+<li class="toctree-l1"><a class="reference internal" href="../setup/index.html">Installation</a></li>
+</ul>
+<p class="caption" role="heading"><span class="caption-text">Tutorials</span></p>
+<ul>
+<li class="toctree-l1"><a class="reference internal" href="tutorial_01_get_started.html">Getting Started</a></li>
+<li class="toctree-l1"><a class="reference internal" href="tutorial_02_vhls.html">Vivado/Vitis HLS Backend</a></li>
+</ul>
+<p class="caption" role="heading"><span class="caption-text">Deep Dive</span></p>
+<ul class="current">
+<li class="toctree-l1"><a class="reference internal" href="dive_01_data_types.html">Data Types and Type Casting</a></li>
+<li class="toctree-l1 current"><a class="current reference internal" href="#">Template Kernels</a></li>
+<li class="toctree-l1"><a class="reference internal" href="dive_03_composition.html">Kernel Composition</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../dive/ip.html">IP Integration</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../dive/pytorch.html">PyTorch Integration</a></li>
+<li class="toctree-l1"><a class="reference internal" href="dive_04_features.html">Other Features</a></li>
+</ul>
+<p class="caption" role="heading"><span class="caption-text">Developer Guide</span></p>
+<ul>
+<li class="toctree-l1"><a class="reference internal" href="../developer/index.html">Developer Setup</a></li>
+<li class="toctree-l1"><a class="reference internal" href="developer_01_ir_builder.html">IR Builder Walkthrough</a></li>
+<li class="toctree-l1"><a class="reference internal" href="developer_02_mlir.html">MLIR Translation Guide</a></li>
+</ul>
+<p class="caption" role="heading"><span class="caption-text">Python API</span></p>
+<ul>
+<li class="toctree-l1"><a class="reference internal" href="../api/index.html">Schedule Primitives</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../api/index.html#data-types">Data Types</a></li>
+</ul>
+
+        </div>
+      </div>
+
+
+    <div class="document">
+      <div class="documentwrapper">
+        <div class="bodywrapper">
+          <div class="body" role="main">
+            
+  <div class="sphx-glr-download-link-note admonition note">
+<p class="admonition-title">Note</p>
+<p><a class="reference internal" href="#sphx-glr-download-gallery-dive-02-template-py"><span class="std std-ref">Go to the end</span></a>
+to download the full example code.</p>
+</div>
+<section class="sphx-glr-example-title" id="template-kernels">
+<span id="sphx-glr-gallery-dive-02-template-py"></span><h1>Template Kernels<a class="headerlink" href="#template-kernels" title="Link to this heading"><span>¶</span></a></h1>
+<p><strong>Author</strong>: Hongzheng Chen (<a class="reference external" href="mailto:hzchen&#37;&#52;&#48;cs&#46;cornell&#46;edu">hzchen<span>&#64;</span>cs<span>&#46;</span>cornell<span>&#46;</span>edu</a>)</p>
+<p>This document explains how to write a template kernel in Allo.
+Template kernels are useful when we need to reuse a kernel with different data types or when certain computation patterns depend on specific constants.
+By leveraging template kernels, we can achieve greater flexibility and reusability in the code.</p>
+<div class="highlight-Python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span><span class="w"> </span><span class="nn">allo</span>
+<span class="kn">from</span><span class="w"> </span><span class="nn">allo.ir.types</span><span class="w"> </span><span class="kn">import</span> <span class="n">int32</span><span class="p">,</span> <span class="n">float32</span>
+</pre></div>
+</div>
+<p>We follow Python’s convention to use <em>type variable</em> to define a template kernel.
+Specifically, the type variable is specified after the function name using square brackets: <code class="docutils literal notranslate"><span class="pre">def</span> <span class="pre">kernel[T](...)</span></code>, and the type variable can be used in the function signature and body.
+Importantly, as the native Python interpreter does not support Allo’s type declaration (i.e., base type + shape), we need to use string annotations like <code class="docutils literal notranslate"><span class="pre">&quot;T[10]&quot;</span></code> to specify the type of the variables.
+Otherwise, it will raise a type error.</p>
+<p>In the following, we define a simple addition function that adds 1 to each element of the input array.
+To invoke the kernel with a specific data type, we can use the <code class="docutils literal notranslate"><span class="pre">instantiate</span></code> argument in the <code class="docutils literal notranslate"><span class="pre">allo.customize</span></code> function.</p>
+<div class="highlight-Python notranslate"><div class="highlight"><pre><span></span><span class="k">def</span><span class="w"> </span><span class="nf">kernel</span><span class="p">[</span><span class="n">T</span><span class="p">](</span><span class="n">A</span><span class="p">:</span> <span class="s2">&quot;T[10]&quot;</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="s2">&quot;T[10]&quot;</span><span class="p">:</span>
+    <span class="n">B</span><span class="p">:</span> <span class="n">T</span><span class="p">[</span><span class="mi">10</span><span class="p">]</span>
+    <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">10</span><span class="p">):</span>
+        <span class="n">B</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">A</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">+</span> <span class="mi">1</span>
+    <span class="k">return</span> <span class="n">B</span>
+
+
+<span class="n">s</span> <span class="o">=</span> <span class="n">allo</span><span class="o">.</span><span class="n">customize</span><span class="p">(</span><span class="n">kernel</span><span class="p">,</span> <span class="n">instantiate</span><span class="o">=</span><span class="p">[</span><span class="n">int32</span><span class="p">])</span>
+<span class="nb">print</span><span class="p">(</span><span class="n">s</span><span class="o">.</span><span class="n">module</span><span class="p">)</span>
+</pre></div>
+</div>
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>module {
+  func.func @kernel(%arg0: memref&lt;10xi32&gt;) -&gt; memref&lt;10xi32&gt; attributes {itypes = &quot;s&quot;, otypes = &quot;s&quot;} {
+    %alloc = memref.alloc() {name = &quot;B&quot;} : memref&lt;10xi32&gt;
+    affine.for %arg1 = 0 to 10 {
+      %0 = affine.load %arg0[%arg1] {from = &quot;A&quot;} : memref&lt;10xi32&gt;
+      %1 = arith.extsi %0 : i32 to i33
+      %c1_i32 = arith.constant 1 : i32
+      %2 = arith.extsi %c1_i32 : i32 to i33
+      %3 = arith.addi %1, %2 : i33
+      %4 = arith.trunci %3 : i33 to i32
+      affine.store %4, %alloc[%arg1] {to = &quot;B&quot;} : memref&lt;10xi32&gt;
+    } {loop_name = &quot;i&quot;, op_name = &quot;S_i_0&quot;}
+    return %alloc : memref&lt;10xi32&gt;
+  }
+}
+</pre></div>
+</div>
+<p>We can see that the kernel is specialized with the given <code class="docutils literal notranslate"><span class="pre">int32</span></code> data type.
+Similarly, we can directly declare a new kernel by specifying <code class="docutils literal notranslate"><span class="pre">float32</span></code> as the data type.</p>
+<div class="highlight-Python notranslate"><div class="highlight"><pre><span></span><span class="n">s</span> <span class="o">=</span> <span class="n">allo</span><span class="o">.</span><span class="n">customize</span><span class="p">(</span><span class="n">kernel</span><span class="p">,</span> <span class="n">instantiate</span><span class="o">=</span><span class="p">[</span><span class="n">float32</span><span class="p">])</span>
+<span class="nb">print</span><span class="p">(</span><span class="n">s</span><span class="o">.</span><span class="n">module</span><span class="p">)</span>
+</pre></div>
+</div>
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>module {
+  func.func @kernel(%arg0: memref&lt;10xf32&gt;) -&gt; memref&lt;10xf32&gt; attributes {itypes = &quot;_&quot;, otypes = &quot;_&quot;} {
+    %alloc = memref.alloc() {name = &quot;B&quot;} : memref&lt;10xf32&gt;
+    affine.for %arg1 = 0 to 10 {
+      %0 = affine.load %arg0[%arg1] {from = &quot;A&quot;} : memref&lt;10xf32&gt;
+      %c1_i32 = arith.constant 1 : i32
+      %1 = arith.sitofp %c1_i32 : i32 to f32
+      %2 = arith.addf %0, %1 : f32
+      affine.store %2, %alloc[%arg1] {to = &quot;B&quot;} : memref&lt;10xf32&gt;
+    } {loop_name = &quot;i&quot;, op_name = &quot;S_i_0&quot;}
+    return %alloc : memref&lt;10xf32&gt;
+  }
+}
+</pre></div>
+</div>
+<p>If we not only want to specialize the data type but also the shape of the array, we can provide another type variable, and pass it to the <code class="docutils literal notranslate"><span class="pre">instantiate</span></code> argument.
+Note that here we also use the <code class="docutils literal notranslate"><span class="pre">&lt;type_var&gt;:</span> <span class="pre">base_type</span></code> notation to constrain the type of the type variable. Here we constrain the type variable <code class="docutils literal notranslate"><span class="pre">M</span></code> to be an integer.</p>
+<div class="highlight-Python notranslate"><div class="highlight"><pre><span></span><span class="k">def</span><span class="w"> </span><span class="nf">kernel2</span><span class="p">[</span><span class="n">T</span><span class="p">,</span> <span class="n">M</span><span class="p">:</span> <span class="n">int32</span><span class="p">](</span><span class="n">A</span><span class="p">:</span> <span class="s2">&quot;T[M]&quot;</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="s2">&quot;T[M]&quot;</span><span class="p">:</span>
+    <span class="n">B</span><span class="p">:</span> <span class="n">T</span><span class="p">[</span><span class="n">M</span><span class="p">]</span>
+    <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">M</span><span class="p">):</span>
+        <span class="n">B</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">A</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">+</span> <span class="mi">1</span>
+    <span class="k">return</span> <span class="n">B</span>
+
+
+<span class="n">s</span> <span class="o">=</span> <span class="n">allo</span><span class="o">.</span><span class="n">customize</span><span class="p">(</span><span class="n">kernel2</span><span class="p">,</span> <span class="n">instantiate</span><span class="o">=</span><span class="p">[</span><span class="n">int32</span><span class="p">,</span> <span class="mi">20</span><span class="p">])</span>
+<span class="nb">print</span><span class="p">(</span><span class="n">s</span><span class="o">.</span><span class="n">module</span><span class="p">)</span>
+</pre></div>
+</div>
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>module {
+  func.func @kernel2(%arg0: memref&lt;20xi32&gt;) -&gt; memref&lt;20xi32&gt; attributes {itypes = &quot;s&quot;, otypes = &quot;s&quot;} {
+    %alloc = memref.alloc() {name = &quot;B&quot;} : memref&lt;20xi32&gt;
+    affine.for %arg1 = 0 to 20 {
+      %0 = affine.load %arg0[%arg1] {from = &quot;A&quot;} : memref&lt;20xi32&gt;
+      %1 = arith.extsi %0 : i32 to i33
+      %c1_i32 = arith.constant 1 : i32
+      %2 = arith.extsi %c1_i32 : i32 to i33
+      %3 = arith.addi %1, %2 : i33
+      %4 = arith.trunci %3 : i33 to i32
+      affine.store %4, %alloc[%arg1] {to = &quot;B&quot;} : memref&lt;20xi32&gt;
+    } {loop_name = &quot;i&quot;, op_name = &quot;S_i_0&quot;}
+    return %alloc : memref&lt;20xi32&gt;
+  }
+}
+</pre></div>
+</div>
+<p>Furthermore, Allo’s template also enables metaprogramming that can evaluate type variables at compile time.
+Specifically, we can use the <code class="docutils literal notranslate"><span class="pre">allo.meta_if</span></code>, <code class="docutils literal notranslate"><span class="pre">allo.meta_elif</span></code>, and <code class="docutils literal notranslate"><span class="pre">allo.meta_else</span></code> to conditionally generate code based on the type variables.
+Just to make sure the conditions can be determined at compile time.</p>
+<div class="highlight-Python notranslate"><div class="highlight"><pre><span></span><span class="k">def</span><span class="w"> </span><span class="nf">kernel3</span><span class="p">[</span><span class="n">T</span><span class="p">,</span> <span class="n">M</span><span class="p">:</span> <span class="n">int32</span><span class="p">](</span><span class="n">A</span><span class="p">:</span> <span class="s2">&quot;T[M]&quot;</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="s2">&quot;T[M]&quot;</span><span class="p">:</span>
+    <span class="n">B</span><span class="p">:</span> <span class="n">T</span><span class="p">[</span><span class="n">M</span><span class="p">]</span>
+    <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">M</span><span class="p">):</span>
+        <span class="k">with</span> <span class="n">allo</span><span class="o">.</span><span class="n">meta_if</span><span class="p">(</span><span class="n">T</span> <span class="o">==</span> <span class="n">int32</span><span class="p">):</span>
+            <span class="n">B</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">A</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">+</span> <span class="mi">1</span>
+        <span class="k">with</span> <span class="n">allo</span><span class="o">.</span><span class="n">meta_else</span><span class="p">():</span>
+            <span class="n">B</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">A</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">-</span> <span class="mi">1</span>
+    <span class="k">return</span> <span class="n">B</span>
+</pre></div>
+</div>
+<p>In final generated code, we can see that only a single branch is generated based on the given data type.</p>
+<div class="highlight-Python notranslate"><div class="highlight"><pre><span></span><span class="n">s</span> <span class="o">=</span> <span class="n">allo</span><span class="o">.</span><span class="n">customize</span><span class="p">(</span><span class="n">kernel3</span><span class="p">,</span> <span class="n">instantiate</span><span class="o">=</span><span class="p">[</span><span class="n">int32</span><span class="p">,</span> <span class="mi">20</span><span class="p">])</span>
+<span class="nb">print</span><span class="p">(</span><span class="n">s</span><span class="o">.</span><span class="n">module</span><span class="p">)</span>
+<span class="n">s</span> <span class="o">=</span> <span class="n">allo</span><span class="o">.</span><span class="n">customize</span><span class="p">(</span><span class="n">kernel3</span><span class="p">,</span> <span class="n">instantiate</span><span class="o">=</span><span class="p">[</span><span class="n">float32</span><span class="p">,</span> <span class="mi">20</span><span class="p">])</span>
+<span class="nb">print</span><span class="p">(</span><span class="n">s</span><span class="o">.</span><span class="n">module</span><span class="p">)</span>
+</pre></div>
+</div>
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>module {
+  func.func @kernel3(%arg0: memref&lt;20xi32&gt;) -&gt; memref&lt;20xi32&gt; attributes {itypes = &quot;s&quot;, otypes = &quot;s&quot;} {
+    %alloc = memref.alloc() {name = &quot;B&quot;} : memref&lt;20xi32&gt;
+    affine.for %arg1 = 0 to 20 {
+      %0 = affine.load %arg0[%arg1] {from = &quot;A&quot;} : memref&lt;20xi32&gt;
+      %1 = arith.extsi %0 : i32 to i33
+      %c1_i32 = arith.constant 1 : i32
+      %2 = arith.extsi %c1_i32 : i32 to i33
+      %3 = arith.addi %1, %2 : i33
+      %4 = arith.trunci %3 : i33 to i32
+      affine.store %4, %alloc[%arg1] {to = &quot;B&quot;} : memref&lt;20xi32&gt;
+    } {loop_name = &quot;i&quot;, op_name = &quot;S_i_0&quot;}
+    return %alloc : memref&lt;20xi32&gt;
+  }
+}
+
+module {
+  func.func @kernel3(%arg0: memref&lt;20xf32&gt;) -&gt; memref&lt;20xf32&gt; attributes {itypes = &quot;_&quot;, otypes = &quot;_&quot;} {
+    %alloc = memref.alloc() {name = &quot;B&quot;} : memref&lt;20xf32&gt;
+    affine.for %arg1 = 0 to 20 {
+      %0 = affine.load %arg0[%arg1] {from = &quot;A&quot;} : memref&lt;20xf32&gt;
+      %c1_i32 = arith.constant 1 : i32
+      %1 = arith.sitofp %c1_i32 : i32 to f32
+      %2 = arith.subf %0, %1 : f32
+      affine.store %2, %alloc[%arg1] {to = &quot;B&quot;} : memref&lt;20xf32&gt;
+    } {loop_name = &quot;i&quot;, op_name = &quot;S_i_0&quot;}
+    return %alloc : memref&lt;20xf32&gt;
+  }
+}
+</pre></div>
+</div>
+<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> (0 minutes 0.372 seconds)</p>
+<div class="sphx-glr-footer sphx-glr-footer-example docutils container" id="sphx-glr-download-gallery-dive-02-template-py">
+<div class="sphx-glr-download sphx-glr-download-jupyter docutils container">
+<p><a class="reference download internal" download="" href="../_downloads/de72dcd3242a3c85b41c9c54a3424409/dive_02_template.ipynb"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Jupyter</span> <span class="pre">notebook:</span> <span class="pre">dive_02_template.ipynb</span></code></a></p>
+</div>
+<div class="sphx-glr-download sphx-glr-download-python docutils container">
+<p><a class="reference download internal" download="" href="../_downloads/9b041eab9a2cc4c12c883027dcc37a54/dive_02_template.py"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Python</span> <span class="pre">source</span> <span class="pre">code:</span> <span class="pre">dive_02_template.py</span></code></a></p>
+</div>
+<div class="sphx-glr-download sphx-glr-download-zip docutils container">
+<p><a class="reference download internal" download="" href="../_downloads/849a5b43539eae829c9f79867111880a/dive_02_template.zip"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">zipped:</span> <span class="pre">dive_02_template.zip</span></code></a></p>
+</div>
+</div>
+<p class="sphx-glr-signature"><a class="reference external" href="https://sphinx-gallery.github.io">Gallery generated by Sphinx-Gallery</a></p>
+</section>
+
+
+            <div class="clearer"></div>
+          </div>
+        </div>
+      </div>
+    
+
+      <div class="clearer"></div>
+    </div>
+    <div class="button_nav_wrapper">
+        <div class="button_nav">
+            <div class="left">
+                
+                <a href="dive_01_data_types.html">
+                    <span class="icon">&lt;</span><span>Data Types and Type Casting</span></a>
+                
+            </div>
+
+            <div class="right">
+                
+                    <a href="dive_03_composition.html"><span>Kernel Composition</span><span class="icon">&gt;</span></a>
+                
+            </div>
+        </div>
+    </div>
+
+
+    <div class="footer" role="contentinfo">
+    &#169; Copyright 2025, Allo Authors.
+      Created using <a href="https://www.sphinx-doc.org/">Sphinx</a> 8.1.3.
+    </div>
+
+<p id="theme_credit">Styled using the <a href="https://github.com/piccolo-orm/piccolo_theme">Piccolo Theme</a></p>
+  </body>
+</html>
\ No newline at end of file
diff --git a/gallery/dive_03_composition.html b/gallery/dive_03_composition.html
new file mode 100644
index 00000000..d0f046c8
--- /dev/null
+++ b/gallery/dive_03_composition.html
@@ -0,0 +1,581 @@
+<!DOCTYPE html>
+
+<html lang="en" data-content_root="../">
+  <head>
+    <meta charset="utf-8" />
+    <meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="viewport" content="width=device-width, initial-scale=1" />
+
+    <title>Kernel Composition &#8212; Allo Documentation</title>
+    <link rel="stylesheet" type="text/css" href="../_static/pygments.css?v=03e43079" />
+    <link rel="stylesheet" type="text/css" href="../_static/basic_mod.css?v=9b2032db" />
+    <link rel="stylesheet" type="text/css" href="../_static/graphviz.css?v=4ae1632d" />
+    <link rel="stylesheet" type="text/css" href="../_static/copybutton.css?v=76b2166b" />
+    <link rel="stylesheet" type="text/css" href="../_static/sg_gallery.css?v=d2d258e8" />
+    <link rel="stylesheet" type="text/css" href="../_static/sg_gallery-binder.css?v=f4aeca0c" />
+    <link rel="stylesheet" type="text/css" href="../_static/sg_gallery-dataframe.css?v=2082cf3c" />
+    <link rel="stylesheet" type="text/css" href="../_static/sg_gallery-rendered-html.css?v=1277b6f3" />
+    <script src="../_static/documentation_options.js?v=db78e746"></script>
+    <script src="../_static/doctools.js?v=9bcbadda"></script>
+    <script src="../_static/sphinx_highlight.js?v=dc90522c"></script>
+    <script src="../_static/clipboard.min.js?v=a7894cd8"></script>
+    <script src="../_static/copybutton.js?v=30646c52"></script>
+    <script src="../_static/js/theme.js"></script>
+    <script src="../_static/js/petite-vue.js"></script>
+    <link rel="index" title="Index" href="../genindex.html" />
+    <link rel="search" title="Search" href="../search.html" />
+    <link rel="next" title="IP Integration" href="../dive/ip.html" />
+    <link rel="prev" title="Template Kernels" href="dive_02_template.html" /> 
+  </head><body data-dark_mode_code_blocks="true">
+
+<div id="top_nav">
+    
+
+    <nav>
+        
+            
+        
+
+        <p id="toggle_sidebar">
+            <a href="#" title="Toggle sidebar">|||</a>
+        </p>
+        <h1><a href="../index.html" title="Go to homepage">Allo Documentation</a></h1>
+            <a id="source_link" href="https://github.com/cornell-zhang/allo">
+    
+        <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 496 512">
+            <path fill="white" d="M 244.8,8 C 106.1,8 0,113.3 0,252 c 0,110.9 69.8,205.8 169.5,239.2 12.8,2.3 17.3,-5.6 17.3,-12.1 0,-6.2 -0.3,-40.4 -0.3,-61.4 0,0 -70,15 -84.7,-29.8 0,0 -11.4,-29.1 -27.8,-36.6 0,0 -22.9,-15.7 1.6,-15.4 0,0 24.9,2 38.6,25.8 21.9,38.6 58.6,27.5 72.9,20.9 2.3,-16 8.8,-27.1 16,-33.7 -55.9,-6.2 -112.3,-14.3 -112.3,-110.5 0,-27.5 7.6,-41.3 23.6,-58.9 -2.6,-6.5 -11.1,-33.3 2.6,-67.9 20.9,-6.5 69,27 69,27 20,-5.6 41.5,-8.5 62.8,-8.5 21.3,0 42.8,2.9 62.8,8.5 0,0 48.1,-33.6 69,-27 13.7,34.7 5.2,61.4 2.6,67.9 16,17.7 25.8,31.5 25.8,58.9 0,96.5 -58.9,104.2 -114.8,110.5 9.2,7.9 17,22.9 17,46.4 0,33.7 -0.3,75.4 -0.3,83.6 0,6.5 4.6,14.4 17.3,12.1 C 428.2,457.8 496,362.9 496,252 496,113.3 383.5,8 244.8,8 Z"/>
+        </svg>
+    
+</a>
+        
+
+        <a id="mode_toggle" href="#" @click.prevent="handleClick" :title="mode">
+    <template v-if="mode == 'light'">
+        <svg width="100%" height="100%" viewBox="0 0 79 80" version="1.1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" xml:space="preserve" style="fill-rule:evenodd;clip-rule:evenodd;stroke-linejoin:round;stroke-miterlimit:2;"><g id="mode_light"><rect id="Bounds" x="0" y="-0" width="78.623" height="79.049" style="fill:none;"/><circle cx="39.311" cy="39.524" r="15.734" style="fill:#fff;"/><g id="beams"><g id="beam"><path id="beam1" serif:id="beam" d="M44.212,4.901c0,-2.705 -2.196,-4.901 -4.901,-4.901c-2.704,-0 -4.9,2.196 -4.9,4.901l-0,9.614c-0,2.705 2.196,4.901 4.9,4.901c2.705,0 4.901,-2.196 4.901,-4.901l0,-9.614Z" style="fill:#fff;"/></g><g id="beam2" serif:id="beam"><path id="beam3" serif:id="beam" d="M67.48,18.073c1.913,-1.912 1.913,-5.018 0,-6.931c-1.912,-1.912 -5.018,-1.912 -6.931,0l-6.798,6.799c-1.912,1.912 -1.912,5.018 0,6.931c1.913,1.912 5.018,1.912 6.931,-0l6.798,-6.799Z" style="fill:#fff;"/></g><g id="beam4" serif:id="beam"><path id="beam5" serif:id="beam" d="M25.728,61.108c1.912,-1.913 1.912,-5.018 -0,-6.931c-1.913,-1.913 -5.019,-1.913 -6.931,-0l-6.799,6.798c-1.912,1.913 -1.912,5.019 0,6.931c1.913,1.913 5.019,1.913 6.931,0l6.799,-6.798Z" style="fill:#fff;"/></g><g id="beam6" serif:id="beam"><path id="beam7" serif:id="beam" d="M60.682,54.177c-1.913,-1.913 -5.018,-1.913 -6.931,-0c-1.912,1.913 -1.912,5.018 0,6.931l6.798,6.798c1.913,1.913 5.019,1.913 6.931,0c1.913,-1.912 1.913,-5.018 0,-6.931l-6.798,-6.798Z" style="fill:#fff;"/></g><g id="beam8" serif:id="beam"><path id="beam9" serif:id="beam" d="M4.901,34.623c-2.705,0 -4.901,2.196 -4.901,4.901c0,2.705 2.196,4.901 4.901,4.901l9.614,0c2.705,0 4.901,-2.196 4.901,-4.901c0,-2.705 -2.196,-4.901 -4.901,-4.901l-9.614,0Z" style="fill:#fff;"/></g><g id="beam10" serif:id="beam"><path id="beam11" serif:id="beam" d="M44.212,64.534c0,-2.705 -2.196,-4.901 -4.901,-4.901c-2.704,-0 -4.9,2.196 -4.9,4.901l-0,9.614c-0,2.705 2.196,4.901 4.9,4.901c2.705,-0 4.901,-2.196 4.901,-4.901l0,-9.614Z" style="fill:#fff;"/></g><g id="beam12" serif:id="beam"><path id="beam13" serif:id="beam" d="M18.929,11.142c-1.912,-1.912 -5.018,-1.912 -6.931,0c-1.912,1.913 -1.912,5.019 0,6.931l6.799,6.799c1.912,1.912 5.018,1.912 6.931,-0c1.912,-1.913 1.912,-5.019 -0,-6.931l-6.799,-6.799Z" style="fill:#fff;"/></g><g id="beam14" serif:id="beam"><path id="beam15" serif:id="beam" d="M64.108,34.623c-2.705,0 -4.901,2.196 -4.901,4.901c-0,2.705 2.196,4.901 4.901,4.901l9.614,0c2.705,0 4.901,-2.196 4.901,-4.901c-0,-2.705 -2.196,-4.901 -4.901,-4.901l-9.614,0Z" style="fill:#fff;"/></g></g></g></svg>
+    </template>
+
+    <template v-if="mode == 'dark'">
+        <svg width="100%" height="100%" viewBox="0 0 79 80" version="1.1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" xml:space="preserve" style="fill-rule:evenodd;clip-rule:evenodd;stroke-linejoin:round;stroke-miterlimit:2;"><g id="mode_dark"><rect id="Bounds" x="0" y="-0" width="78.623" height="79.049" style="fill:none;"/><circle cx="39.311" cy="39.524" r="15.734" style="fill:#fff;"/><g id="beams"><g id="beam"><path id="beam1" serif:id="beam" d="M44.212,14.515c0,-2.705 -2.196,-4.901 -4.901,-4.901c-2.704,0 -4.901,2.196 -4.901,4.901c0,2.705 2.197,4.901 4.901,4.901c2.705,0 4.901,-2.196 4.901,-4.901Z" style="fill:#fff;"/></g><g id="beam2" serif:id="beam"><path id="beam3" serif:id="beam" d="M60.662,24.892c1.902,-1.902 1.902,-4.99 0,-6.892l-0.04,-0.039c-1.901,-1.902 -4.989,-1.902 -6.891,-0c-1.901,1.901 -1.901,4.989 0,6.891l0.04,0.04c1.902,1.901 4.989,1.901 6.891,-0Z" style="fill:#fff;"/></g><g id="beam4" serif:id="beam"><path id="beam5" serif:id="beam" d="M25.732,61.103c1.91,-1.91 1.91,-5.011 0,-6.921l-0.009,-0.01c-1.91,-1.91 -5.012,-1.91 -6.921,-0c-1.91,1.91 -1.91,5.011 -0,6.921l0.01,0.01c1.909,1.91 5.011,1.91 6.92,-0Z" style="fill:#fff;"/></g><g id="beam6" serif:id="beam"><path id="beam7" serif:id="beam" d="M60.672,54.167c-1.907,-1.907 -5.004,-1.907 -6.911,0l-0.02,0.02c-1.907,1.907 -1.907,5.004 0,6.911c1.907,1.907 5.004,1.907 6.911,-0l0.02,-0.02c1.907,-1.907 1.907,-5.004 0,-6.911Z" style="fill:#fff;"/></g><g id="beam8" serif:id="beam"><path id="beam9" serif:id="beam" d="M14.52,34.623c-2.702,0 -4.896,2.194 -4.896,4.896l0,0.01c0,2.702 2.194,4.896 4.896,4.896c2.702,0 4.896,-2.194 4.896,-4.896l-0,-0.01c-0,-2.702 -2.194,-4.896 -4.896,-4.896Z" style="fill:#fff;"/></g><g id="beam10" serif:id="beam"><path id="beam11" serif:id="beam" d="M44.212,64.534c0,-2.705 -2.196,-4.901 -4.901,-4.901c-2.704,-0 -4.901,2.196 -4.901,4.901c0,2.704 2.197,4.9 4.901,4.9c2.705,0 4.901,-2.196 4.901,-4.9Z" style="fill:#fff;"/></g><g id="beam12" serif:id="beam"><path id="beam13" serif:id="beam" d="M25.73,17.943c-1.911,-1.911 -5.015,-1.911 -6.926,0l-0.005,0.005c-1.911,1.911 -1.911,5.015 0,6.926c1.911,1.911 5.015,1.911 6.926,0l0.005,-0.005c1.911,-1.911 1.911,-5.014 -0,-6.926Z" style="fill:#fff;"/></g><g id="beam14" serif:id="beam"><path id="beam15" serif:id="beam" d="M64.098,34.623c-2.699,0 -4.891,2.192 -4.891,4.892l-0,0.019c-0,2.699 2.192,4.891 4.891,4.891c2.7,0 4.892,-2.192 4.892,-4.891l0,-0.019c0,-2.7 -2.192,-4.892 -4.892,-4.892Z" style="fill:#fff;"/></g></g></g></svg>
+    </template>
+
+    <template v-if="mode == 'darkest'">
+        <svg width="100%" height="100%" viewBox="0 0 79 80" version="1.1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" xml:space="preserve" style="fill-rule:evenodd;clip-rule:evenodd;stroke-linejoin:round;stroke-miterlimit:2;"><g id="mode_darkest"><rect id="Bounds" x="0" y="-0" width="78.623" height="79.049" style="fill:none;"/><path d="M39.315,23.791c8.684,-0 15.734,7.05 15.734,15.733c0,8.684 -7.05,15.734 -15.734,15.734c-8.683,0 -15.733,-7.05 -15.733,-15.734c-0,-8.683 7.05,-15.733 15.733,-15.733Zm0,4.737c6.069,0 10.997,4.927 10.997,10.996c-0,6.069 -4.928,10.996 -10.997,10.996c-6.068,0 -10.996,-4.927 -10.996,-10.996c0,-6.069 4.928,-10.996 10.996,-10.996Z" style="fill:#fff;"/><g id="beams"><g id="beam"><path id="beam1" serif:id="beam" d="M44.216,14.515c0,-2.705 -2.196,-4.901 -4.901,-4.901c-2.704,0 -4.9,2.196 -4.9,4.901c-0,2.705 2.196,4.901 4.9,4.901c2.705,0 4.901,-2.196 4.901,-4.901Z" style="fill:#fff;"/></g><g id="beam2" serif:id="beam"><path id="beam3" serif:id="beam" d="M60.666,24.892c1.902,-1.902 1.902,-4.99 0,-6.892l-0.04,-0.039c-1.901,-1.902 -4.989,-1.902 -6.891,-0c-1.901,1.901 -1.901,4.989 0,6.891l0.04,0.04c1.902,1.901 4.99,1.901 6.891,-0Z" style="fill:#fff;"/></g><g id="beam4" serif:id="beam"><path id="beam5" serif:id="beam" d="M25.737,61.103c1.909,-1.91 1.909,-5.011 -0,-6.921l-0.01,-0.01c-1.91,-1.91 -5.011,-1.91 -6.921,-0c-1.91,1.91 -1.91,5.011 -0,6.921l0.01,0.01c1.91,1.91 5.011,1.91 6.921,-0Z" style="fill:#fff;"/></g><g id="beam6" serif:id="beam"><path id="beam7" serif:id="beam" d="M60.676,54.167c-1.907,-1.907 -5.004,-1.907 -6.911,0l-0.02,0.02c-1.907,1.907 -1.907,5.004 0,6.911c1.907,1.907 5.004,1.907 6.911,-0l0.02,-0.02c1.907,-1.907 1.907,-5.004 0,-6.911Z" style="fill:#fff;"/></g><g id="beam8" serif:id="beam"><path id="beam9" serif:id="beam" d="M14.524,34.623c-2.702,0 -4.896,2.194 -4.896,4.896l0,0.01c0,2.702 2.194,4.896 4.896,4.896c2.702,0 4.896,-2.194 4.896,-4.896l0,-0.01c0,-2.702 -2.194,-4.896 -4.896,-4.896Z" style="fill:#fff;"/></g><g id="beam10" serif:id="beam"><path id="beam11" serif:id="beam" d="M44.216,64.534c0,-2.705 -2.196,-4.901 -4.901,-4.901c-2.704,-0 -4.9,2.196 -4.9,4.901c-0,2.704 2.196,4.9 4.9,4.9c2.705,0 4.901,-2.196 4.901,-4.9Z" style="fill:#fff;"/></g><g id="beam12" serif:id="beam"><path id="beam13" serif:id="beam" d="M25.734,17.943c-1.911,-1.911 -5.015,-1.911 -6.926,0l-0.005,0.005c-1.911,1.911 -1.911,5.015 0,6.926c1.911,1.911 5.015,1.911 6.926,0l0.005,-0.005c1.911,-1.911 1.911,-5.014 0,-6.926Z" style="fill:#fff;"/></g><g id="beam14" serif:id="beam"><path id="beam15" serif:id="beam" d="M64.103,34.623c-2.7,0 -4.892,2.192 -4.892,4.892l-0,0.019c-0,2.699 2.192,4.891 4.892,4.891c2.699,0 4.891,-2.192 4.891,-4.891l0,-0.019c0,-2.7 -2.192,-4.892 -4.891,-4.892Z" style="fill:#fff;"/></g></g></g></svg>
+    </template>
+</a>
+
+<script>
+(function() {
+    const LOCAL_STORAGE_KEY = 'piccoloThemeMode'
+
+    var initialMode = localStorage.getItem(LOCAL_STORAGE_KEY)
+
+    if (initialMode) {
+        // Make sure the value in local storage is valid
+        if (['light', 'dark', 'darkest'].indexOf(initialMode) == -1) {
+            initialMode = 'light'
+            localStorage.setItem(LOCAL_STORAGE_KEY, initialMode)
+        }
+    } else {
+        // Check if the client prefers dark mode
+        if (window.matchMedia && window.matchMedia('(prefers-color-scheme: dark)').matches) {
+            initialMode = 'dark'
+        } else {
+            initialMode = 'light'
+        }
+        localStorage.setItem(LOCAL_STORAGE_KEY, initialMode)
+    }
+
+    document.documentElement.dataset.mode = initialMode
+
+    PetiteVue.createApp({
+        'mode': initialMode,
+        handleClick() {
+            let currentMode = this.mode
+
+            if (currentMode == 'light') {
+                this.mode = 'dark'
+            } else if (currentMode == 'dark') {
+                this.mode = 'darkest'
+            } else if (currentMode == 'darkest') {
+                this.mode = 'light'
+            }
+
+            document.documentElement.dataset.mode = this.mode
+            localStorage.setItem(LOCAL_STORAGE_KEY, this.mode)
+
+            console.log(this.mode)
+        }
+    }).mount('#mode_toggle')
+})()
+</script>
+            <p class="mobile_search_link">
+                <a href="../search.html" title="Search">
+                    <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 65 64" fill-rule="evenodd" stroke-linejoin="round" stroke-miterlimit="2">
+                        <path d="M14.873 40.009c-2.315-3.943-3.642-8.532-3.642-13.429C11.231 11.91 23.141 0 37.811 0s26.58 11.91 26.58 26.58-11.91 26.58-26.58 26.58a26.44 26.44 0 0 1-14.277-4.161L9.739 62.794a3.12 3.12 0 0 1-4.413 0L.913 58.382c-1.217-1.218-1.217-3.196 0-4.413l13.96-13.96zM37.811 8.054c10.225 0 18.526 8.301 18.526 18.526s-8.301 18.526-18.526 18.526-18.526-8.301-18.526-18.526S27.586 8.054 37.811 8.054z" fill="#fff" />
+                    </svg>
+                </a>
+            </p>
+        
+
+        <div class="searchbox_wrapper">
+            
+<search id="searchbox" style="display: none" role="search">
+  <h3 id="searchlabel">Quick search</h3>
+    <div class="searchformwrapper">
+    <form class="search" action="../search.html" method="get">
+      <input type="text" name="q" aria-labelledby="searchlabel" autocomplete="off" autocorrect="off" autocapitalize="off" spellcheck="false"/>
+      <input type="submit" value="Go" />
+    </form>
+    </div>
+</search>
+<script>document.getElementById('searchbox').style.display = "block"</script>
+        </div>
+    </nav>
+</div>
+
+    
+      <div class="sphinxsidebar" role="navigation" aria-label="Main">
+        <div class="sphinxsidebarwrapper"><p class="caption" role="heading"><span class="caption-text">Getting Started</span></p>
+<ul>
+<li class="toctree-l1"><a class="reference internal" href="../setup/index.html">Installation</a></li>
+</ul>
+<p class="caption" role="heading"><span class="caption-text">Tutorials</span></p>
+<ul>
+<li class="toctree-l1"><a class="reference internal" href="tutorial_01_get_started.html">Getting Started</a></li>
+<li class="toctree-l1"><a class="reference internal" href="tutorial_02_vhls.html">Vivado/Vitis HLS Backend</a></li>
+</ul>
+<p class="caption" role="heading"><span class="caption-text">Deep Dive</span></p>
+<ul class="current">
+<li class="toctree-l1"><a class="reference internal" href="dive_01_data_types.html">Data Types and Type Casting</a></li>
+<li class="toctree-l1"><a class="reference internal" href="dive_02_template.html">Template Kernels</a></li>
+<li class="toctree-l1 current"><a class="current reference internal" href="#">Kernel Composition</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../dive/ip.html">IP Integration</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../dive/pytorch.html">PyTorch Integration</a></li>
+<li class="toctree-l1"><a class="reference internal" href="dive_04_features.html">Other Features</a></li>
+</ul>
+<p class="caption" role="heading"><span class="caption-text">Developer Guide</span></p>
+<ul>
+<li class="toctree-l1"><a class="reference internal" href="../developer/index.html">Developer Setup</a></li>
+<li class="toctree-l1"><a class="reference internal" href="developer_01_ir_builder.html">IR Builder Walkthrough</a></li>
+<li class="toctree-l1"><a class="reference internal" href="developer_02_mlir.html">MLIR Translation Guide</a></li>
+</ul>
+<p class="caption" role="heading"><span class="caption-text">Python API</span></p>
+<ul>
+<li class="toctree-l1"><a class="reference internal" href="../api/index.html">Schedule Primitives</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../api/index.html#data-types">Data Types</a></li>
+</ul>
+
+        </div>
+      </div>
+
+
+    <div class="document">
+      <div class="documentwrapper">
+        <div class="bodywrapper">
+          <div class="body" role="main">
+            
+  <div class="sphx-glr-download-link-note admonition note">
+<p class="admonition-title">Note</p>
+<p><a class="reference internal" href="#sphx-glr-download-gallery-dive-03-composition-py"><span class="std std-ref">Go to the end</span></a>
+to download the full example code.</p>
+</div>
+<section class="sphx-glr-example-title" id="kernel-composition">
+<span id="sphx-glr-gallery-dive-03-composition-py"></span><h1>Kernel Composition<a class="headerlink" href="#kernel-composition" title="Link to this heading"><span>¶</span></a></h1>
+<p><strong>Author</strong>: Hongzheng Chen (<a class="reference external" href="mailto:hzchen&#37;&#52;&#48;cs&#46;cornell&#46;edu">hzchen<span>&#64;</span>cs<span>&#46;</span>cornell<span>&#46;</span>edu</a>)</p>
+<p>This document will discuss kernel composition.
+In the previous tutorials, we have seen how to write a simple kernel.
+However, in real applications, we often need to compose multiple kernels together.</p>
+<p>In the following example, we define a <code class="docutils literal notranslate"><span class="pre">matrix_add</span></code> and a <code class="docutils literal notranslate"><span class="pre">gemm</span></code> kernel, and wrap them into a <code class="docutils literal notranslate"><span class="pre">top</span></code>-level function.</p>
+<div class="highlight-Python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span><span class="w"> </span><span class="nn">allo</span>
+<span class="kn">from</span><span class="w"> </span><span class="nn">allo.ir.types</span><span class="w"> </span><span class="kn">import</span> <span class="n">int32</span><span class="p">,</span> <span class="n">float32</span>
+
+<span class="n">M</span><span class="p">,</span> <span class="n">K</span><span class="p">,</span> <span class="n">N</span> <span class="o">=</span> <span class="mi">32</span><span class="p">,</span> <span class="mi">32</span><span class="p">,</span> <span class="mi">32</span>
+
+
+<span class="k">def</span><span class="w"> </span><span class="nf">matrix_add</span><span class="p">(</span><span class="n">A</span><span class="p">:</span> <span class="n">int32</span><span class="p">[</span><span class="n">M</span><span class="p">,</span> <span class="n">N</span><span class="p">])</span> <span class="o">-&gt;</span> <span class="n">int32</span><span class="p">[</span><span class="n">M</span><span class="p">,</span> <span class="n">N</span><span class="p">]:</span>
+    <span class="n">B</span><span class="p">:</span> <span class="n">int32</span><span class="p">[</span><span class="n">M</span><span class="p">,</span> <span class="n">N</span><span class="p">]</span> <span class="o">=</span> <span class="mi">0</span>
+    <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">j</span> <span class="ow">in</span> <span class="n">allo</span><span class="o">.</span><span class="n">grid</span><span class="p">(</span><span class="n">M</span><span class="p">,</span> <span class="n">N</span><span class="p">):</span>
+        <span class="n">B</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">]</span> <span class="o">=</span> <span class="n">A</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">]</span> <span class="o">+</span> <span class="mi">1</span>
+    <span class="k">return</span> <span class="n">B</span>
+
+
+<span class="k">def</span><span class="w"> </span><span class="nf">gemm</span><span class="p">(</span><span class="n">A</span><span class="p">:</span> <span class="n">int32</span><span class="p">[</span><span class="n">M</span><span class="p">,</span> <span class="n">K</span><span class="p">],</span> <span class="n">B</span><span class="p">:</span> <span class="n">int32</span><span class="p">[</span><span class="n">K</span><span class="p">,</span> <span class="n">N</span><span class="p">])</span> <span class="o">-&gt;</span> <span class="n">int32</span><span class="p">[</span><span class="n">M</span><span class="p">,</span> <span class="n">N</span><span class="p">]:</span>
+    <span class="n">C</span><span class="p">:</span> <span class="n">int32</span><span class="p">[</span><span class="n">M</span><span class="p">,</span> <span class="n">N</span><span class="p">]</span> <span class="o">=</span> <span class="mi">0</span>
+    <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">j</span> <span class="ow">in</span> <span class="n">allo</span><span class="o">.</span><span class="n">grid</span><span class="p">(</span><span class="n">M</span><span class="p">,</span> <span class="n">N</span><span class="p">):</span>
+        <span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="n">allo</span><span class="o">.</span><span class="n">reduction</span><span class="p">(</span><span class="n">K</span><span class="p">):</span>
+            <span class="n">C</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">]</span> <span class="o">+=</span> <span class="n">A</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="n">k</span><span class="p">]</span> <span class="o">*</span> <span class="n">B</span><span class="p">[</span><span class="n">k</span><span class="p">,</span> <span class="n">j</span><span class="p">]</span>
+    <span class="k">return</span> <span class="n">C</span>
+
+
+<span class="k">def</span><span class="w"> </span><span class="nf">top</span><span class="p">(</span><span class="n">A</span><span class="p">:</span> <span class="n">int32</span><span class="p">[</span><span class="n">M</span><span class="p">,</span> <span class="n">K</span><span class="p">],</span> <span class="n">B</span><span class="p">:</span> <span class="n">int32</span><span class="p">[</span><span class="n">K</span><span class="p">,</span> <span class="n">N</span><span class="p">])</span> <span class="o">-&gt;</span> <span class="n">int32</span><span class="p">[</span><span class="n">M</span><span class="p">,</span> <span class="n">N</span><span class="p">]:</span>
+    <span class="n">C</span> <span class="o">=</span> <span class="n">gemm</span><span class="p">(</span><span class="n">A</span><span class="p">,</span> <span class="n">B</span><span class="p">)</span>
+    <span class="n">D</span> <span class="o">=</span> <span class="n">matrix_add</span><span class="p">(</span><span class="n">C</span><span class="p">)</span>
+    <span class="k">return</span> <span class="n">D</span>
+</pre></div>
+</div>
+<p>Different teams or people can then work on different parts of the code and optimize each kernel.
+We first create a schedule for the <code class="docutils literal notranslate"><span class="pre">matrix_add</span></code> kernel, and add several optimizations.</p>
+<div class="highlight-Python notranslate"><div class="highlight"><pre><span></span><span class="n">s1</span> <span class="o">=</span> <span class="n">allo</span><span class="o">.</span><span class="n">customize</span><span class="p">(</span><span class="n">matrix_add</span><span class="p">)</span>
+<span class="n">s1</span><span class="o">.</span><span class="n">pipeline</span><span class="p">(</span><span class="s2">&quot;j&quot;</span><span class="p">)</span>
+<span class="nb">print</span><span class="p">(</span><span class="n">s1</span><span class="o">.</span><span class="n">module</span><span class="p">)</span>
+</pre></div>
+</div>
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>module {
+  func.func @matrix_add(%arg0: memref&lt;32x32xi32&gt;) -&gt; memref&lt;32x32xi32&gt; attributes {itypes = &quot;s&quot;, otypes = &quot;s&quot;} {
+    %alloc = memref.alloc() {name = &quot;B&quot;} : memref&lt;32x32xi32&gt;
+    %c0_i32 = arith.constant 0 : i32
+    linalg.fill ins(%c0_i32 : i32) outs(%alloc : memref&lt;32x32xi32&gt;)
+    affine.for %arg1 = 0 to 32 {
+      affine.for %arg2 = 0 to 32 {
+        %0 = affine.load %arg0[%arg1, %arg2] {from = &quot;A&quot;} : memref&lt;32x32xi32&gt;
+        %1 = arith.extsi %0 : i32 to i33
+        %c1_i32 = arith.constant 1 : i32
+        %2 = arith.extsi %c1_i32 : i32 to i33
+        %3 = arith.addi %1, %2 : i33
+        %4 = arith.trunci %3 : i33 to i32
+        affine.store %4, %alloc[%arg1, %arg2] {to = &quot;B&quot;} : memref&lt;32x32xi32&gt;
+      } {loop_name = &quot;j&quot;, pipeline_ii = 1 : ui32}
+    } {loop_name = &quot;i&quot;, op_name = &quot;S_i_j_0&quot;}
+    return %alloc : memref&lt;32x32xi32&gt;
+  }
+}
+</pre></div>
+</div>
+<p>Then we create a schedule for the <code class="docutils literal notranslate"><span class="pre">gemm</span></code> kernel and optimize it.</p>
+<div class="highlight-Python notranslate"><div class="highlight"><pre><span></span><span class="n">s2</span> <span class="o">=</span> <span class="n">allo</span><span class="o">.</span><span class="n">customize</span><span class="p">(</span><span class="n">gemm</span><span class="p">)</span>
+<span class="n">s2</span><span class="o">.</span><span class="n">reorder</span><span class="p">(</span><span class="s2">&quot;k&quot;</span><span class="p">,</span> <span class="s2">&quot;j&quot;</span><span class="p">)</span>
+<span class="n">s2</span><span class="o">.</span><span class="n">buffer_at</span><span class="p">(</span><span class="n">s2</span><span class="o">.</span><span class="n">C</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="s2">&quot;i&quot;</span><span class="p">)</span>
+<span class="n">s2</span><span class="o">.</span><span class="n">pipeline</span><span class="p">(</span><span class="s2">&quot;j&quot;</span><span class="p">)</span>
+<span class="nb">print</span><span class="p">(</span><span class="n">s2</span><span class="o">.</span><span class="n">module</span><span class="p">)</span>
+</pre></div>
+</div>
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>module {
+  func.func @gemm(%arg0: memref&lt;32x32xi32&gt;, %arg1: memref&lt;32x32xi32&gt;) -&gt; memref&lt;32x32xi32&gt; attributes {itypes = &quot;ss&quot;, otypes = &quot;s&quot;} {
+    %alloc = memref.alloc() {name = &quot;C&quot;} : memref&lt;32x32xi32&gt;
+    %c0_i32 = arith.constant 0 : i32
+    linalg.fill ins(%c0_i32 : i32) outs(%alloc : memref&lt;32x32xi32&gt;)
+    affine.for %arg2 = 0 to 32 {
+      %alloc_0 = memref.alloc() : memref&lt;32xi32&gt;
+      affine.for %arg3 = 0 to 32 {
+        affine.store %c0_i32, %alloc_0[%arg3] : memref&lt;32xi32&gt;
+      } {buffer, loop_name = &quot;j_init&quot;, pipeline_ii = 1 : i32}
+      affine.for %arg3 = 0 to 32 {
+        affine.for %arg4 = 0 to 32 {
+          %0 = affine.load %arg0[%arg2, %arg3] {from = &quot;A&quot;} : memref&lt;32x32xi32&gt;
+          %1 = affine.load %arg1[%arg3, %arg4] {from = &quot;B&quot;} : memref&lt;32x32xi32&gt;
+          %2 = arith.extsi %0 : i32 to i64
+          %3 = arith.extsi %1 : i32 to i64
+          %4 = arith.muli %2, %3 : i64
+          %5 = affine.load %alloc_0[%arg4] : memref&lt;32xi32&gt;
+          %6 = arith.trunci %4 : i64 to i32
+          %7 = arith.addi %5, %6 : i32
+          affine.store %7, %alloc_0[%arg4] : memref&lt;32xi32&gt;
+        } {loop_name = &quot;j&quot;, pipeline_ii = 1 : ui32}
+      } {loop_name = &quot;k&quot;, op_name = &quot;S_k_0&quot;, reduction}
+      affine.for %arg3 = 0 to 32 {
+        %0 = affine.load %alloc_0[%arg3] : memref&lt;32xi32&gt;
+        affine.store %0, %alloc[%arg2, %arg3] : memref&lt;32x32xi32&gt;
+      } {buffer, loop_name = &quot;j_back&quot;, pipeline_ii = 1 : i32}
+    } {loop_name = &quot;i&quot;, op_name = &quot;S_i_j_0&quot;}
+    return %alloc : memref&lt;32x32xi32&gt;
+  }
+}
+</pre></div>
+</div>
+<p>Notice that now we only optimize the separate kernels but do not incorporate them into the top-level function, as shown in the following printed module.</p>
+<div class="highlight-Python notranslate"><div class="highlight"><pre><span></span><span class="n">s</span> <span class="o">=</span> <span class="n">allo</span><span class="o">.</span><span class="n">customize</span><span class="p">(</span><span class="n">top</span><span class="p">)</span>
+<span class="nb">print</span><span class="p">(</span><span class="n">s</span><span class="o">.</span><span class="n">module</span><span class="p">)</span>
+</pre></div>
+</div>
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>module {
+  func.func @gemm(%arg0: memref&lt;32x32xi32&gt;, %arg1: memref&lt;32x32xi32&gt;) -&gt; memref&lt;32x32xi32&gt; attributes {itypes = &quot;ss&quot;, otypes = &quot;s&quot;} {
+    %alloc = memref.alloc() {name = &quot;C&quot;} : memref&lt;32x32xi32&gt;
+    %c0_i32 = arith.constant 0 : i32
+    linalg.fill ins(%c0_i32 : i32) outs(%alloc : memref&lt;32x32xi32&gt;)
+    affine.for %arg2 = 0 to 32 {
+      affine.for %arg3 = 0 to 32 {
+        affine.for %arg4 = 0 to 32 {
+          %0 = affine.load %arg0[%arg2, %arg4] {from = &quot;A&quot;} : memref&lt;32x32xi32&gt;
+          %1 = affine.load %arg1[%arg4, %arg3] {from = &quot;B&quot;} : memref&lt;32x32xi32&gt;
+          %2 = arith.extsi %0 : i32 to i64
+          %3 = arith.extsi %1 : i32 to i64
+          %4 = arith.muli %2, %3 : i64
+          %5 = affine.load %alloc[%arg2, %arg3] {from = &quot;C&quot;} : memref&lt;32x32xi32&gt;
+          %6 = arith.trunci %4 : i64 to i32
+          %7 = arith.addi %5, %6 : i32
+          affine.store %7, %alloc[%arg2, %arg3] {to = &quot;C&quot;} : memref&lt;32x32xi32&gt;
+        } {loop_name = &quot;k&quot;, op_name = &quot;S_k_0&quot;, reduction}
+      } {loop_name = &quot;j&quot;}
+    } {loop_name = &quot;i&quot;, op_name = &quot;S_i_j_0&quot;}
+    return %alloc : memref&lt;32x32xi32&gt;
+  }
+  func.func @matrix_add(%arg0: memref&lt;32x32xi32&gt;) -&gt; memref&lt;32x32xi32&gt; attributes {itypes = &quot;s&quot;, otypes = &quot;s&quot;} {
+    %alloc = memref.alloc() {name = &quot;B&quot;} : memref&lt;32x32xi32&gt;
+    %c0_i32 = arith.constant 0 : i32
+    linalg.fill ins(%c0_i32 : i32) outs(%alloc : memref&lt;32x32xi32&gt;)
+    affine.for %arg1 = 0 to 32 {
+      affine.for %arg2 = 0 to 32 {
+        %0 = affine.load %arg0[%arg1, %arg2] {from = &quot;A&quot;} : memref&lt;32x32xi32&gt;
+        %1 = arith.extsi %0 : i32 to i33
+        %c1_i32 = arith.constant 1 : i32
+        %2 = arith.extsi %c1_i32 : i32 to i33
+        %3 = arith.addi %1, %2 : i33
+        %4 = arith.trunci %3 : i33 to i32
+        affine.store %4, %alloc[%arg1, %arg2] {to = &quot;B&quot;} : memref&lt;32x32xi32&gt;
+      } {loop_name = &quot;j&quot;}
+    } {loop_name = &quot;i&quot;, op_name = &quot;S_i_j_0&quot;}
+    return %alloc : memref&lt;32x32xi32&gt;
+  }
+  func.func @top(%arg0: memref&lt;32x32xi32&gt;, %arg1: memref&lt;32x32xi32&gt;) -&gt; memref&lt;32x32xi32&gt; attributes {itypes = &quot;ss&quot;, otypes = &quot;s&quot;} {
+    %0 = call @gemm(%arg0, %arg1) {name = &quot;C&quot;} : (memref&lt;32x32xi32&gt;, memref&lt;32x32xi32&gt;) -&gt; memref&lt;32x32xi32&gt;
+    %1 = call @matrix_add(%0) {name = &quot;D&quot;} : (memref&lt;32x32xi32&gt;) -&gt; memref&lt;32x32xi32&gt;
+    return %1 : memref&lt;32x32xi32&gt;
+  }
+}
+</pre></div>
+</div>
+<p>Therefore, after each part has been optimized, we need to explicitly <em>compose</em> them together.
+In Allo, we can use the <code class="docutils literal notranslate"><span class="pre">.compose()</span></code> primitive to compose the schedules together into the parent function.</p>
+<div class="highlight-Python notranslate"><div class="highlight"><pre><span></span><span class="n">s</span><span class="o">.</span><span class="n">compose</span><span class="p">([</span><span class="n">s1</span><span class="p">,</span> <span class="n">s2</span><span class="p">])</span>
+<span class="nb">print</span><span class="p">(</span><span class="n">s</span><span class="o">.</span><span class="n">module</span><span class="p">)</span>
+</pre></div>
+</div>
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>module {
+  func.func @gemm(%arg0: memref&lt;32x32xi32&gt;, %arg1: memref&lt;32x32xi32&gt;) -&gt; memref&lt;32x32xi32&gt; attributes {itypes = &quot;ss&quot;, otypes = &quot;s&quot;} {
+    %alloc = memref.alloc() {name = &quot;C&quot;} : memref&lt;32x32xi32&gt;
+    %c0_i32 = arith.constant 0 : i32
+    linalg.fill ins(%c0_i32 : i32) outs(%alloc : memref&lt;32x32xi32&gt;)
+    affine.for %arg2 = 0 to 32 {
+      %alloc_0 = memref.alloc() : memref&lt;32xi32&gt;
+      affine.for %arg3 = 0 to 32 {
+        affine.store %c0_i32, %alloc_0[%arg3] : memref&lt;32xi32&gt;
+      } {buffer, loop_name = &quot;j_init&quot;, pipeline_ii = 1 : i32}
+      affine.for %arg3 = 0 to 32 {
+        affine.for %arg4 = 0 to 32 {
+          %0 = affine.load %arg0[%arg2, %arg3] {from = &quot;A&quot;} : memref&lt;32x32xi32&gt;
+          %1 = affine.load %arg1[%arg3, %arg4] {from = &quot;B&quot;} : memref&lt;32x32xi32&gt;
+          %2 = arith.extsi %0 : i32 to i64
+          %3 = arith.extsi %1 : i32 to i64
+          %4 = arith.muli %2, %3 : i64
+          %5 = affine.load %alloc_0[%arg4] : memref&lt;32xi32&gt;
+          %6 = arith.trunci %4 : i64 to i32
+          %7 = arith.addi %5, %6 : i32
+          affine.store %7, %alloc_0[%arg4] : memref&lt;32xi32&gt;
+        } {loop_name = &quot;j&quot;, pipeline_ii = 1 : ui32}
+      } {loop_name = &quot;k&quot;, op_name = &quot;S_k_0&quot;, reduction}
+      affine.for %arg3 = 0 to 32 {
+        %0 = affine.load %alloc_0[%arg3] : memref&lt;32xi32&gt;
+        affine.store %0, %alloc[%arg2, %arg3] : memref&lt;32x32xi32&gt;
+      } {buffer, loop_name = &quot;j_back&quot;, pipeline_ii = 1 : i32}
+    } {loop_name = &quot;i&quot;, op_name = &quot;S_i_j_0&quot;}
+    return %alloc : memref&lt;32x32xi32&gt;
+  }
+  func.func @matrix_add(%arg0: memref&lt;32x32xi32&gt;) -&gt; memref&lt;32x32xi32&gt; attributes {itypes = &quot;s&quot;, otypes = &quot;s&quot;} {
+    %alloc = memref.alloc() {name = &quot;B&quot;} : memref&lt;32x32xi32&gt;
+    %c0_i32 = arith.constant 0 : i32
+    linalg.fill ins(%c0_i32 : i32) outs(%alloc : memref&lt;32x32xi32&gt;)
+    affine.for %arg1 = 0 to 32 {
+      affine.for %arg2 = 0 to 32 {
+        %0 = affine.load %arg0[%arg1, %arg2] {from = &quot;A&quot;} : memref&lt;32x32xi32&gt;
+        %1 = arith.extsi %0 : i32 to i33
+        %c1_i32 = arith.constant 1 : i32
+        %2 = arith.extsi %c1_i32 : i32 to i33
+        %3 = arith.addi %1, %2 : i33
+        %4 = arith.trunci %3 : i33 to i32
+        affine.store %4, %alloc[%arg1, %arg2] {to = &quot;B&quot;} : memref&lt;32x32xi32&gt;
+      } {loop_name = &quot;j&quot;, pipeline_ii = 1 : ui32}
+    } {loop_name = &quot;i&quot;, op_name = &quot;S_i_j_0&quot;}
+    return %alloc : memref&lt;32x32xi32&gt;
+  }
+  func.func @top(%arg0: memref&lt;32x32xi32&gt;, %arg1: memref&lt;32x32xi32&gt;) -&gt; memref&lt;32x32xi32&gt; attributes {itypes = &quot;ss&quot;, otypes = &quot;s&quot;} {
+    %0 = call @gemm(%arg0, %arg1) {name = &quot;C&quot;} : (memref&lt;32x32xi32&gt;, memref&lt;32x32xi32&gt;) -&gt; memref&lt;32x32xi32&gt;
+    %1 = call @matrix_add(%0) {name = &quot;D&quot;} : (memref&lt;32x32xi32&gt;) -&gt; memref&lt;32x32xi32&gt;
+    return %1 : memref&lt;32x32xi32&gt;
+  }
+}
+</pre></div>
+</div>
+<p>We can see that the schedules for the <code class="docutils literal notranslate"><span class="pre">matrix_add</span></code> and <code class="docutils literal notranslate"><span class="pre">gemm</span></code> kernels are both correctly optimized in the top-level function.</p>
+<section id="template-composition">
+<h2>Template Composition<a class="headerlink" href="#template-composition" title="Link to this heading"><span>¶</span></a></h2>
+<p>Sometimes we may define template kernels and invoke the kernel with different template arguments. Allo provides an <em>id</em> option to specify the exact kernel to be composed.</p>
+<div class="highlight-Python notranslate"><div class="highlight"><pre><span></span><span class="k">def</span><span class="w"> </span><span class="nf">kernel</span><span class="p">[</span><span class="n">T_in</span><span class="p">,</span> <span class="n">T_out</span><span class="p">,</span> <span class="n">S</span><span class="p">](</span><span class="n">A</span><span class="p">:</span> <span class="s2">&quot;T_in[S]&quot;</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="s2">&quot;T_out[S]&quot;</span><span class="p">:</span>
+    <span class="n">B</span><span class="p">:</span> <span class="n">T_out</span><span class="p">[</span><span class="n">S</span><span class="p">]</span> <span class="o">=</span> <span class="mi">0</span>
+    <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">S</span><span class="p">):</span>
+        <span class="k">with</span> <span class="n">allo</span><span class="o">.</span><span class="n">meta_if</span><span class="p">(</span><span class="n">T_out</span> <span class="o">==</span> <span class="n">int32</span><span class="p">):</span>
+            <span class="n">B</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">A</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">+</span> <span class="mi">1</span>
+        <span class="k">with</span> <span class="n">allo</span><span class="o">.</span><span class="n">meta_else</span><span class="p">():</span>
+            <span class="n">B</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">A</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">*</span> <span class="mi">2</span>
+    <span class="k">return</span> <span class="n">B</span>
+
+
+<span class="k">def</span><span class="w"> </span><span class="nf">top2</span><span class="p">(</span><span class="n">A</span><span class="p">:</span> <span class="n">int32</span><span class="p">[</span><span class="n">M</span><span class="p">])</span> <span class="o">-&gt;</span> <span class="n">float32</span><span class="p">[</span><span class="n">M</span><span class="p">]:</span>
+    <span class="n">C</span> <span class="o">=</span> <span class="n">kernel</span><span class="p">[</span><span class="n">int32</span><span class="p">,</span> <span class="n">int32</span><span class="p">,</span> <span class="n">M</span><span class="p">,</span> <span class="s2">&quot;K1&quot;</span><span class="p">](</span><span class="n">A</span><span class="p">)</span>
+    <span class="n">D</span> <span class="o">=</span> <span class="n">kernel</span><span class="p">[</span><span class="n">int32</span><span class="p">,</span> <span class="n">float32</span><span class="p">,</span> <span class="n">M</span><span class="p">,</span> <span class="s2">&quot;K2&quot;</span><span class="p">](</span><span class="n">C</span><span class="p">)</span>
+    <span class="k">return</span> <span class="n">D</span>
+</pre></div>
+</div>
+<p>Specifically, the last argument of the template kernel is the <em>id</em> of the kernel. Later on we can use this ID for distinguishing different kernels during composition.
+We also customize the two template kernels with different optimizations first.</p>
+<div class="highlight-Python notranslate"><div class="highlight"><pre><span></span><span class="n">s1</span> <span class="o">=</span> <span class="n">allo</span><span class="o">.</span><span class="n">customize</span><span class="p">(</span><span class="n">kernel</span><span class="p">,</span> <span class="n">instantiate</span><span class="o">=</span><span class="p">[</span><span class="n">int32</span><span class="p">,</span> <span class="n">int32</span><span class="p">,</span> <span class="n">M</span><span class="p">])</span>
+<span class="n">s1</span><span class="o">.</span><span class="n">unroll</span><span class="p">(</span><span class="s2">&quot;i&quot;</span><span class="p">,</span> <span class="n">factor</span><span class="o">=</span><span class="mi">4</span><span class="p">)</span>
+<span class="nb">print</span><span class="p">(</span><span class="n">s1</span><span class="o">.</span><span class="n">module</span><span class="p">)</span>
+
+<span class="n">s2</span> <span class="o">=</span> <span class="n">allo</span><span class="o">.</span><span class="n">customize</span><span class="p">(</span><span class="n">kernel</span><span class="p">,</span> <span class="n">instantiate</span><span class="o">=</span><span class="p">[</span><span class="n">int32</span><span class="p">,</span> <span class="n">float32</span><span class="p">,</span> <span class="n">M</span><span class="p">])</span>
+<span class="n">s2</span><span class="o">.</span><span class="n">pipeline</span><span class="p">(</span><span class="s2">&quot;i&quot;</span><span class="p">)</span>
+<span class="nb">print</span><span class="p">(</span><span class="n">s2</span><span class="o">.</span><span class="n">module</span><span class="p">)</span>
+</pre></div>
+</div>
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>module {
+  func.func @kernel(%arg0: memref&lt;32xi32&gt;) -&gt; memref&lt;32xi32&gt; attributes {itypes = &quot;s&quot;, otypes = &quot;s&quot;} {
+    %alloc = memref.alloc() {name = &quot;B&quot;} : memref&lt;32xi32&gt;
+    %c0_i32 = arith.constant 0 : i32
+    linalg.fill ins(%c0_i32 : i32) outs(%alloc : memref&lt;32xi32&gt;)
+    affine.for %arg1 = 0 to 32 {
+      %0 = affine.load %arg0[%arg1] {from = &quot;A&quot;} : memref&lt;32xi32&gt;
+      %1 = arith.extsi %0 : i32 to i33
+      %c1_i32 = arith.constant 1 : i32
+      %2 = arith.extsi %c1_i32 : i32 to i33
+      %3 = arith.addi %1, %2 : i33
+      %4 = arith.trunci %3 : i33 to i32
+      affine.store %4, %alloc[%arg1] {to = &quot;B&quot;} : memref&lt;32xi32&gt;
+    } {loop_name = &quot;i&quot;, op_name = &quot;S_i_0&quot;, unroll = 4 : i32}
+    return %alloc : memref&lt;32xi32&gt;
+  }
+}
+
+module {
+  func.func @kernel(%arg0: memref&lt;32xi32&gt;) -&gt; memref&lt;32xf32&gt; attributes {itypes = &quot;s&quot;, otypes = &quot;_&quot;} {
+    %c0_i32 = arith.constant 0 : i32
+    %0 = arith.sitofp %c0_i32 : i32 to f32
+    %alloc = memref.alloc() {name = &quot;B&quot;} : memref&lt;32xf32&gt;
+    linalg.fill ins(%0 : f32) outs(%alloc : memref&lt;32xf32&gt;)
+    affine.for %arg1 = 0 to 32 {
+      %1 = affine.load %arg0[%arg1] {from = &quot;A&quot;} : memref&lt;32xi32&gt;
+      %2 = arith.extsi %1 : i32 to i64
+      %c2_i32 = arith.constant 2 : i32
+      %3 = arith.extsi %c2_i32 : i32 to i64
+      %4 = arith.muli %2, %3 : i64
+      %5 = arith.sitofp %4 : i64 to f32
+      affine.store %5, %alloc[%arg1] {to = &quot;B&quot;} : memref&lt;32xf32&gt;
+    } {loop_name = &quot;i&quot;, op_name = &quot;S_i_0&quot;, pipeline_ii = 1 : ui32}
+    return %alloc : memref&lt;32xf32&gt;
+  }
+}
+</pre></div>
+</div>
+<p>Finally, we compose the two template kernels into the top-level function with the ID specified.</p>
+<div class="highlight-Python notranslate"><div class="highlight"><pre><span></span><span class="n">s</span> <span class="o">=</span> <span class="n">allo</span><span class="o">.</span><span class="n">customize</span><span class="p">(</span><span class="n">top2</span><span class="p">)</span>
+<span class="n">s</span><span class="o">.</span><span class="n">compose</span><span class="p">(</span><span class="n">s1</span><span class="p">,</span> <span class="nb">id</span><span class="o">=</span><span class="s2">&quot;K1&quot;</span><span class="p">)</span>
+<span class="n">s</span><span class="o">.</span><span class="n">compose</span><span class="p">(</span><span class="n">s2</span><span class="p">,</span> <span class="nb">id</span><span class="o">=</span><span class="s2">&quot;K2&quot;</span><span class="p">)</span>
+<span class="nb">print</span><span class="p">(</span><span class="n">s</span><span class="o">.</span><span class="n">module</span><span class="p">)</span>
+</pre></div>
+</div>
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>module {
+  func.func @kernel_K1(%arg0: memref&lt;32xi32&gt;) -&gt; memref&lt;32xi32&gt; attributes {itypes = &quot;s&quot;, otypes = &quot;s&quot;} {
+    %alloc = memref.alloc() {name = &quot;B&quot;} : memref&lt;32xi32&gt;
+    %c0_i32 = arith.constant 0 : i32
+    linalg.fill ins(%c0_i32 : i32) outs(%alloc : memref&lt;32xi32&gt;)
+    affine.for %arg1 = 0 to 32 {
+      %0 = affine.load %arg0[%arg1] {from = &quot;A&quot;} : memref&lt;32xi32&gt;
+      %1 = arith.extsi %0 : i32 to i33
+      %c1_i32 = arith.constant 1 : i32
+      %2 = arith.extsi %c1_i32 : i32 to i33
+      %3 = arith.addi %1, %2 : i33
+      %4 = arith.trunci %3 : i33 to i32
+      affine.store %4, %alloc[%arg1] {to = &quot;B&quot;} : memref&lt;32xi32&gt;
+    } {loop_name = &quot;i&quot;, op_name = &quot;S_i_0&quot;, unroll = 4 : i32}
+    return %alloc : memref&lt;32xi32&gt;
+  }
+  func.func @kernel_K2(%arg0: memref&lt;32xi32&gt;) -&gt; memref&lt;32xf32&gt; attributes {itypes = &quot;s&quot;, otypes = &quot;_&quot;} {
+    %c0_i32 = arith.constant 0 : i32
+    %0 = arith.sitofp %c0_i32 : i32 to f32
+    %alloc = memref.alloc() {name = &quot;B&quot;} : memref&lt;32xf32&gt;
+    linalg.fill ins(%0 : f32) outs(%alloc : memref&lt;32xf32&gt;)
+    affine.for %arg1 = 0 to 32 {
+      %1 = affine.load %arg0[%arg1] {from = &quot;A&quot;} : memref&lt;32xi32&gt;
+      %2 = arith.extsi %1 : i32 to i64
+      %c2_i32 = arith.constant 2 : i32
+      %3 = arith.extsi %c2_i32 : i32 to i64
+      %4 = arith.muli %2, %3 : i64
+      %5 = arith.sitofp %4 : i64 to f32
+      affine.store %5, %alloc[%arg1] {to = &quot;B&quot;} : memref&lt;32xf32&gt;
+    } {loop_name = &quot;i&quot;, op_name = &quot;S_i_0&quot;, pipeline_ii = 1 : ui32}
+    return %alloc : memref&lt;32xf32&gt;
+  }
+  func.func @top2(%arg0: memref&lt;32xi32&gt;) -&gt; memref&lt;32xf32&gt; attributes {itypes = &quot;s&quot;, otypes = &quot;_&quot;} {
+    %0 = call @kernel_K1(%arg0) {name = &quot;C&quot;} : (memref&lt;32xi32&gt;) -&gt; memref&lt;32xi32&gt;
+    %1 = call @kernel_K2(%0) {name = &quot;D&quot;} : (memref&lt;32xi32&gt;) -&gt; memref&lt;32xf32&gt;
+    return %1 : memref&lt;32xf32&gt;
+  }
+}
+</pre></div>
+</div>
+<p>We can see from the printed module that the loop in the first kernel is unrolled by a factor of 4, and the loop in the second kernel is pipelined.</p>
+<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> (0 minutes 0.731 seconds)</p>
+<div class="sphx-glr-footer sphx-glr-footer-example docutils container" id="sphx-glr-download-gallery-dive-03-composition-py">
+<div class="sphx-glr-download sphx-glr-download-jupyter docutils container">
+<p><a class="reference download internal" download="" href="../_downloads/39c6904b3f007c07e3d59200d0bf98b4/dive_03_composition.ipynb"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Jupyter</span> <span class="pre">notebook:</span> <span class="pre">dive_03_composition.ipynb</span></code></a></p>
+</div>
+<div class="sphx-glr-download sphx-glr-download-python docutils container">
+<p><a class="reference download internal" download="" href="../_downloads/aac8c815d185f6d5646a9509ba2daa13/dive_03_composition.py"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Python</span> <span class="pre">source</span> <span class="pre">code:</span> <span class="pre">dive_03_composition.py</span></code></a></p>
+</div>
+<div class="sphx-glr-download sphx-glr-download-zip docutils container">
+<p><a class="reference download internal" download="" href="../_downloads/5c3db288c9103701a8cc33d4c4f30066/dive_03_composition.zip"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">zipped:</span> <span class="pre">dive_03_composition.zip</span></code></a></p>
+</div>
+</div>
+<p class="sphx-glr-signature"><a class="reference external" href="https://sphinx-gallery.github.io">Gallery generated by Sphinx-Gallery</a></p>
+</section>
+</section>
+
+
+            <div class="clearer"></div>
+          </div>
+        </div>
+      </div>
+    
+        <div id="show_right_sidebar">
+            <p><a class="toggle_right_sidebar" href="#"><span class="icon">&lt;</span><span>Page contents</span></a></p>
+        </div>
+
+        <div id="right_sidebar">
+            <p><a class="toggle_right_sidebar" href="#"><span class="icon">&gt;</span><span>Page contents:</span></a></p>
+            <div class="page_toc">
+                <ul>
+<li><a class="reference internal" href="#">Kernel Composition</a><ul>
+<li><a class="reference internal" href="#template-composition">Template Composition</a></li>
+</ul>
+</li>
+</ul>
+
+            </div>
+        </div>
+    
+
+      <div class="clearer"></div>
+    </div>
+    <div class="button_nav_wrapper">
+        <div class="button_nav">
+            <div class="left">
+                
+                <a href="dive_02_template.html">
+                    <span class="icon">&lt;</span><span>Template Kernels</span></a>
+                
+            </div>
+
+            <div class="right">
+                
+                    <a href="../dive/ip.html"><span>IP Integration</span><span class="icon">&gt;</span></a>
+                
+            </div>
+        </div>
+    </div>
+
+
+    <div class="footer" role="contentinfo">
+    &#169; Copyright 2025, Allo Authors.
+      Created using <a href="https://www.sphinx-doc.org/">Sphinx</a> 8.1.3.
+    </div>
+
+<p id="theme_credit">Styled using the <a href="https://github.com/piccolo-orm/piccolo_theme">Piccolo Theme</a></p>
+  </body>
+</html>
\ No newline at end of file
diff --git a/gallery/dive_04_features.html b/gallery/dive_04_features.html
new file mode 100644
index 00000000..e29b562d
--- /dev/null
+++ b/gallery/dive_04_features.html
@@ -0,0 +1,391 @@
+<!DOCTYPE html>
+
+<html lang="en" data-content_root="../">
+  <head>
+    <meta charset="utf-8" />
+    <meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="viewport" content="width=device-width, initial-scale=1" />
+
+    <title>Other Features &#8212; Allo Documentation</title>
+    <link rel="stylesheet" type="text/css" href="../_static/pygments.css?v=03e43079" />
+    <link rel="stylesheet" type="text/css" href="../_static/basic_mod.css?v=9b2032db" />
+    <link rel="stylesheet" type="text/css" href="../_static/graphviz.css?v=4ae1632d" />
+    <link rel="stylesheet" type="text/css" href="../_static/copybutton.css?v=76b2166b" />
+    <link rel="stylesheet" type="text/css" href="../_static/sg_gallery.css?v=d2d258e8" />
+    <link rel="stylesheet" type="text/css" href="../_static/sg_gallery-binder.css?v=f4aeca0c" />
+    <link rel="stylesheet" type="text/css" href="../_static/sg_gallery-dataframe.css?v=2082cf3c" />
+    <link rel="stylesheet" type="text/css" href="../_static/sg_gallery-rendered-html.css?v=1277b6f3" />
+    <script src="../_static/documentation_options.js?v=db78e746"></script>
+    <script src="../_static/doctools.js?v=9bcbadda"></script>
+    <script src="../_static/sphinx_highlight.js?v=dc90522c"></script>
+    <script src="../_static/clipboard.min.js?v=a7894cd8"></script>
+    <script src="../_static/copybutton.js?v=30646c52"></script>
+    <script src="../_static/js/theme.js"></script>
+    <script src="../_static/js/petite-vue.js"></script>
+    <link rel="index" title="Index" href="../genindex.html" />
+    <link rel="search" title="Search" href="../search.html" />
+    <link rel="next" title="Developer Setup" href="../developer/index.html" />
+    <link rel="prev" title="PyTorch Integration" href="../dive/pytorch.html" /> 
+  </head><body data-dark_mode_code_blocks="true">
+
+<div id="top_nav">
+    
+
+    <nav>
+        
+            
+        
+
+        <p id="toggle_sidebar">
+            <a href="#" title="Toggle sidebar">|||</a>
+        </p>
+        <h1><a href="../index.html" title="Go to homepage">Allo Documentation</a></h1>
+            <a id="source_link" href="https://github.com/cornell-zhang/allo">
+    
+        <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 496 512">
+            <path fill="white" d="M 244.8,8 C 106.1,8 0,113.3 0,252 c 0,110.9 69.8,205.8 169.5,239.2 12.8,2.3 17.3,-5.6 17.3,-12.1 0,-6.2 -0.3,-40.4 -0.3,-61.4 0,0 -70,15 -84.7,-29.8 0,0 -11.4,-29.1 -27.8,-36.6 0,0 -22.9,-15.7 1.6,-15.4 0,0 24.9,2 38.6,25.8 21.9,38.6 58.6,27.5 72.9,20.9 2.3,-16 8.8,-27.1 16,-33.7 -55.9,-6.2 -112.3,-14.3 -112.3,-110.5 0,-27.5 7.6,-41.3 23.6,-58.9 -2.6,-6.5 -11.1,-33.3 2.6,-67.9 20.9,-6.5 69,27 69,27 20,-5.6 41.5,-8.5 62.8,-8.5 21.3,0 42.8,2.9 62.8,8.5 0,0 48.1,-33.6 69,-27 13.7,34.7 5.2,61.4 2.6,67.9 16,17.7 25.8,31.5 25.8,58.9 0,96.5 -58.9,104.2 -114.8,110.5 9.2,7.9 17,22.9 17,46.4 0,33.7 -0.3,75.4 -0.3,83.6 0,6.5 4.6,14.4 17.3,12.1 C 428.2,457.8 496,362.9 496,252 496,113.3 383.5,8 244.8,8 Z"/>
+        </svg>
+    
+</a>
+        
+
+        <a id="mode_toggle" href="#" @click.prevent="handleClick" :title="mode">
+    <template v-if="mode == 'light'">
+        <svg width="100%" height="100%" viewBox="0 0 79 80" version="1.1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" xml:space="preserve" style="fill-rule:evenodd;clip-rule:evenodd;stroke-linejoin:round;stroke-miterlimit:2;"><g id="mode_light"><rect id="Bounds" x="0" y="-0" width="78.623" height="79.049" style="fill:none;"/><circle cx="39.311" cy="39.524" r="15.734" style="fill:#fff;"/><g id="beams"><g id="beam"><path id="beam1" serif:id="beam" d="M44.212,4.901c0,-2.705 -2.196,-4.901 -4.901,-4.901c-2.704,-0 -4.9,2.196 -4.9,4.901l-0,9.614c-0,2.705 2.196,4.901 4.9,4.901c2.705,0 4.901,-2.196 4.901,-4.901l0,-9.614Z" style="fill:#fff;"/></g><g id="beam2" serif:id="beam"><path id="beam3" serif:id="beam" d="M67.48,18.073c1.913,-1.912 1.913,-5.018 0,-6.931c-1.912,-1.912 -5.018,-1.912 -6.931,0l-6.798,6.799c-1.912,1.912 -1.912,5.018 0,6.931c1.913,1.912 5.018,1.912 6.931,-0l6.798,-6.799Z" style="fill:#fff;"/></g><g id="beam4" serif:id="beam"><path id="beam5" serif:id="beam" d="M25.728,61.108c1.912,-1.913 1.912,-5.018 -0,-6.931c-1.913,-1.913 -5.019,-1.913 -6.931,-0l-6.799,6.798c-1.912,1.913 -1.912,5.019 0,6.931c1.913,1.913 5.019,1.913 6.931,0l6.799,-6.798Z" style="fill:#fff;"/></g><g id="beam6" serif:id="beam"><path id="beam7" serif:id="beam" d="M60.682,54.177c-1.913,-1.913 -5.018,-1.913 -6.931,-0c-1.912,1.913 -1.912,5.018 0,6.931l6.798,6.798c1.913,1.913 5.019,1.913 6.931,0c1.913,-1.912 1.913,-5.018 0,-6.931l-6.798,-6.798Z" style="fill:#fff;"/></g><g id="beam8" serif:id="beam"><path id="beam9" serif:id="beam" d="M4.901,34.623c-2.705,0 -4.901,2.196 -4.901,4.901c0,2.705 2.196,4.901 4.901,4.901l9.614,0c2.705,0 4.901,-2.196 4.901,-4.901c0,-2.705 -2.196,-4.901 -4.901,-4.901l-9.614,0Z" style="fill:#fff;"/></g><g id="beam10" serif:id="beam"><path id="beam11" serif:id="beam" d="M44.212,64.534c0,-2.705 -2.196,-4.901 -4.901,-4.901c-2.704,-0 -4.9,2.196 -4.9,4.901l-0,9.614c-0,2.705 2.196,4.901 4.9,4.901c2.705,-0 4.901,-2.196 4.901,-4.901l0,-9.614Z" style="fill:#fff;"/></g><g id="beam12" serif:id="beam"><path id="beam13" serif:id="beam" d="M18.929,11.142c-1.912,-1.912 -5.018,-1.912 -6.931,0c-1.912,1.913 -1.912,5.019 0,6.931l6.799,6.799c1.912,1.912 5.018,1.912 6.931,-0c1.912,-1.913 1.912,-5.019 -0,-6.931l-6.799,-6.799Z" style="fill:#fff;"/></g><g id="beam14" serif:id="beam"><path id="beam15" serif:id="beam" d="M64.108,34.623c-2.705,0 -4.901,2.196 -4.901,4.901c-0,2.705 2.196,4.901 4.901,4.901l9.614,0c2.705,0 4.901,-2.196 4.901,-4.901c-0,-2.705 -2.196,-4.901 -4.901,-4.901l-9.614,0Z" style="fill:#fff;"/></g></g></g></svg>
+    </template>
+
+    <template v-if="mode == 'dark'">
+        <svg width="100%" height="100%" viewBox="0 0 79 80" version="1.1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" xml:space="preserve" style="fill-rule:evenodd;clip-rule:evenodd;stroke-linejoin:round;stroke-miterlimit:2;"><g id="mode_dark"><rect id="Bounds" x="0" y="-0" width="78.623" height="79.049" style="fill:none;"/><circle cx="39.311" cy="39.524" r="15.734" style="fill:#fff;"/><g id="beams"><g id="beam"><path id="beam1" serif:id="beam" d="M44.212,14.515c0,-2.705 -2.196,-4.901 -4.901,-4.901c-2.704,0 -4.901,2.196 -4.901,4.901c0,2.705 2.197,4.901 4.901,4.901c2.705,0 4.901,-2.196 4.901,-4.901Z" style="fill:#fff;"/></g><g id="beam2" serif:id="beam"><path id="beam3" serif:id="beam" d="M60.662,24.892c1.902,-1.902 1.902,-4.99 0,-6.892l-0.04,-0.039c-1.901,-1.902 -4.989,-1.902 -6.891,-0c-1.901,1.901 -1.901,4.989 0,6.891l0.04,0.04c1.902,1.901 4.989,1.901 6.891,-0Z" style="fill:#fff;"/></g><g id="beam4" serif:id="beam"><path id="beam5" serif:id="beam" d="M25.732,61.103c1.91,-1.91 1.91,-5.011 0,-6.921l-0.009,-0.01c-1.91,-1.91 -5.012,-1.91 -6.921,-0c-1.91,1.91 -1.91,5.011 -0,6.921l0.01,0.01c1.909,1.91 5.011,1.91 6.92,-0Z" style="fill:#fff;"/></g><g id="beam6" serif:id="beam"><path id="beam7" serif:id="beam" d="M60.672,54.167c-1.907,-1.907 -5.004,-1.907 -6.911,0l-0.02,0.02c-1.907,1.907 -1.907,5.004 0,6.911c1.907,1.907 5.004,1.907 6.911,-0l0.02,-0.02c1.907,-1.907 1.907,-5.004 0,-6.911Z" style="fill:#fff;"/></g><g id="beam8" serif:id="beam"><path id="beam9" serif:id="beam" d="M14.52,34.623c-2.702,0 -4.896,2.194 -4.896,4.896l0,0.01c0,2.702 2.194,4.896 4.896,4.896c2.702,0 4.896,-2.194 4.896,-4.896l-0,-0.01c-0,-2.702 -2.194,-4.896 -4.896,-4.896Z" style="fill:#fff;"/></g><g id="beam10" serif:id="beam"><path id="beam11" serif:id="beam" d="M44.212,64.534c0,-2.705 -2.196,-4.901 -4.901,-4.901c-2.704,-0 -4.901,2.196 -4.901,4.901c0,2.704 2.197,4.9 4.901,4.9c2.705,0 4.901,-2.196 4.901,-4.9Z" style="fill:#fff;"/></g><g id="beam12" serif:id="beam"><path id="beam13" serif:id="beam" d="M25.73,17.943c-1.911,-1.911 -5.015,-1.911 -6.926,0l-0.005,0.005c-1.911,1.911 -1.911,5.015 0,6.926c1.911,1.911 5.015,1.911 6.926,0l0.005,-0.005c1.911,-1.911 1.911,-5.014 -0,-6.926Z" style="fill:#fff;"/></g><g id="beam14" serif:id="beam"><path id="beam15" serif:id="beam" d="M64.098,34.623c-2.699,0 -4.891,2.192 -4.891,4.892l-0,0.019c-0,2.699 2.192,4.891 4.891,4.891c2.7,0 4.892,-2.192 4.892,-4.891l0,-0.019c0,-2.7 -2.192,-4.892 -4.892,-4.892Z" style="fill:#fff;"/></g></g></g></svg>
+    </template>
+
+    <template v-if="mode == 'darkest'">
+        <svg width="100%" height="100%" viewBox="0 0 79 80" version="1.1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" xml:space="preserve" style="fill-rule:evenodd;clip-rule:evenodd;stroke-linejoin:round;stroke-miterlimit:2;"><g id="mode_darkest"><rect id="Bounds" x="0" y="-0" width="78.623" height="79.049" style="fill:none;"/><path d="M39.315,23.791c8.684,-0 15.734,7.05 15.734,15.733c0,8.684 -7.05,15.734 -15.734,15.734c-8.683,0 -15.733,-7.05 -15.733,-15.734c-0,-8.683 7.05,-15.733 15.733,-15.733Zm0,4.737c6.069,0 10.997,4.927 10.997,10.996c-0,6.069 -4.928,10.996 -10.997,10.996c-6.068,0 -10.996,-4.927 -10.996,-10.996c0,-6.069 4.928,-10.996 10.996,-10.996Z" style="fill:#fff;"/><g id="beams"><g id="beam"><path id="beam1" serif:id="beam" d="M44.216,14.515c0,-2.705 -2.196,-4.901 -4.901,-4.901c-2.704,0 -4.9,2.196 -4.9,4.901c-0,2.705 2.196,4.901 4.9,4.901c2.705,0 4.901,-2.196 4.901,-4.901Z" style="fill:#fff;"/></g><g id="beam2" serif:id="beam"><path id="beam3" serif:id="beam" d="M60.666,24.892c1.902,-1.902 1.902,-4.99 0,-6.892l-0.04,-0.039c-1.901,-1.902 -4.989,-1.902 -6.891,-0c-1.901,1.901 -1.901,4.989 0,6.891l0.04,0.04c1.902,1.901 4.99,1.901 6.891,-0Z" style="fill:#fff;"/></g><g id="beam4" serif:id="beam"><path id="beam5" serif:id="beam" d="M25.737,61.103c1.909,-1.91 1.909,-5.011 -0,-6.921l-0.01,-0.01c-1.91,-1.91 -5.011,-1.91 -6.921,-0c-1.91,1.91 -1.91,5.011 -0,6.921l0.01,0.01c1.91,1.91 5.011,1.91 6.921,-0Z" style="fill:#fff;"/></g><g id="beam6" serif:id="beam"><path id="beam7" serif:id="beam" d="M60.676,54.167c-1.907,-1.907 -5.004,-1.907 -6.911,0l-0.02,0.02c-1.907,1.907 -1.907,5.004 0,6.911c1.907,1.907 5.004,1.907 6.911,-0l0.02,-0.02c1.907,-1.907 1.907,-5.004 0,-6.911Z" style="fill:#fff;"/></g><g id="beam8" serif:id="beam"><path id="beam9" serif:id="beam" d="M14.524,34.623c-2.702,0 -4.896,2.194 -4.896,4.896l0,0.01c0,2.702 2.194,4.896 4.896,4.896c2.702,0 4.896,-2.194 4.896,-4.896l0,-0.01c0,-2.702 -2.194,-4.896 -4.896,-4.896Z" style="fill:#fff;"/></g><g id="beam10" serif:id="beam"><path id="beam11" serif:id="beam" d="M44.216,64.534c0,-2.705 -2.196,-4.901 -4.901,-4.901c-2.704,-0 -4.9,2.196 -4.9,4.901c-0,2.704 2.196,4.9 4.9,4.9c2.705,0 4.901,-2.196 4.901,-4.9Z" style="fill:#fff;"/></g><g id="beam12" serif:id="beam"><path id="beam13" serif:id="beam" d="M25.734,17.943c-1.911,-1.911 -5.015,-1.911 -6.926,0l-0.005,0.005c-1.911,1.911 -1.911,5.015 0,6.926c1.911,1.911 5.015,1.911 6.926,0l0.005,-0.005c1.911,-1.911 1.911,-5.014 0,-6.926Z" style="fill:#fff;"/></g><g id="beam14" serif:id="beam"><path id="beam15" serif:id="beam" d="M64.103,34.623c-2.7,0 -4.892,2.192 -4.892,4.892l-0,0.019c-0,2.699 2.192,4.891 4.892,4.891c2.699,0 4.891,-2.192 4.891,-4.891l0,-0.019c0,-2.7 -2.192,-4.892 -4.891,-4.892Z" style="fill:#fff;"/></g></g></g></svg>
+    </template>
+</a>
+
+<script>
+(function() {
+    const LOCAL_STORAGE_KEY = 'piccoloThemeMode'
+
+    var initialMode = localStorage.getItem(LOCAL_STORAGE_KEY)
+
+    if (initialMode) {
+        // Make sure the value in local storage is valid
+        if (['light', 'dark', 'darkest'].indexOf(initialMode) == -1) {
+            initialMode = 'light'
+            localStorage.setItem(LOCAL_STORAGE_KEY, initialMode)
+        }
+    } else {
+        // Check if the client prefers dark mode
+        if (window.matchMedia && window.matchMedia('(prefers-color-scheme: dark)').matches) {
+            initialMode = 'dark'
+        } else {
+            initialMode = 'light'
+        }
+        localStorage.setItem(LOCAL_STORAGE_KEY, initialMode)
+    }
+
+    document.documentElement.dataset.mode = initialMode
+
+    PetiteVue.createApp({
+        'mode': initialMode,
+        handleClick() {
+            let currentMode = this.mode
+
+            if (currentMode == 'light') {
+                this.mode = 'dark'
+            } else if (currentMode == 'dark') {
+                this.mode = 'darkest'
+            } else if (currentMode == 'darkest') {
+                this.mode = 'light'
+            }
+
+            document.documentElement.dataset.mode = this.mode
+            localStorage.setItem(LOCAL_STORAGE_KEY, this.mode)
+
+            console.log(this.mode)
+        }
+    }).mount('#mode_toggle')
+})()
+</script>
+            <p class="mobile_search_link">
+                <a href="../search.html" title="Search">
+                    <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 65 64" fill-rule="evenodd" stroke-linejoin="round" stroke-miterlimit="2">
+                        <path d="M14.873 40.009c-2.315-3.943-3.642-8.532-3.642-13.429C11.231 11.91 23.141 0 37.811 0s26.58 11.91 26.58 26.58-11.91 26.58-26.58 26.58a26.44 26.44 0 0 1-14.277-4.161L9.739 62.794a3.12 3.12 0 0 1-4.413 0L.913 58.382c-1.217-1.218-1.217-3.196 0-4.413l13.96-13.96zM37.811 8.054c10.225 0 18.526 8.301 18.526 18.526s-8.301 18.526-18.526 18.526-18.526-8.301-18.526-18.526S27.586 8.054 37.811 8.054z" fill="#fff" />
+                    </svg>
+                </a>
+            </p>
+        
+
+        <div class="searchbox_wrapper">
+            
+<search id="searchbox" style="display: none" role="search">
+  <h3 id="searchlabel">Quick search</h3>
+    <div class="searchformwrapper">
+    <form class="search" action="../search.html" method="get">
+      <input type="text" name="q" aria-labelledby="searchlabel" autocomplete="off" autocorrect="off" autocapitalize="off" spellcheck="false"/>
+      <input type="submit" value="Go" />
+    </form>
+    </div>
+</search>
+<script>document.getElementById('searchbox').style.display = "block"</script>
+        </div>
+    </nav>
+</div>
+
+    
+      <div class="sphinxsidebar" role="navigation" aria-label="Main">
+        <div class="sphinxsidebarwrapper"><p class="caption" role="heading"><span class="caption-text">Getting Started</span></p>
+<ul>
+<li class="toctree-l1"><a class="reference internal" href="../setup/index.html">Installation</a></li>
+</ul>
+<p class="caption" role="heading"><span class="caption-text">Tutorials</span></p>
+<ul>
+<li class="toctree-l1"><a class="reference internal" href="tutorial_01_get_started.html">Getting Started</a></li>
+<li class="toctree-l1"><a class="reference internal" href="tutorial_02_vhls.html">Vivado/Vitis HLS Backend</a></li>
+</ul>
+<p class="caption" role="heading"><span class="caption-text">Deep Dive</span></p>
+<ul class="current">
+<li class="toctree-l1"><a class="reference internal" href="dive_01_data_types.html">Data Types and Type Casting</a></li>
+<li class="toctree-l1"><a class="reference internal" href="dive_02_template.html">Template Kernels</a></li>
+<li class="toctree-l1"><a class="reference internal" href="dive_03_composition.html">Kernel Composition</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../dive/ip.html">IP Integration</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../dive/pytorch.html">PyTorch Integration</a></li>
+<li class="toctree-l1 current"><a class="current reference internal" href="#">Other Features</a></li>
+</ul>
+<p class="caption" role="heading"><span class="caption-text">Developer Guide</span></p>
+<ul>
+<li class="toctree-l1"><a class="reference internal" href="../developer/index.html">Developer Setup</a></li>
+<li class="toctree-l1"><a class="reference internal" href="developer_01_ir_builder.html">IR Builder Walkthrough</a></li>
+<li class="toctree-l1"><a class="reference internal" href="developer_02_mlir.html">MLIR Translation Guide</a></li>
+</ul>
+<p class="caption" role="heading"><span class="caption-text">Python API</span></p>
+<ul>
+<li class="toctree-l1"><a class="reference internal" href="../api/index.html">Schedule Primitives</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../api/index.html#data-types">Data Types</a></li>
+</ul>
+
+        </div>
+      </div>
+
+
+    <div class="document">
+      <div class="documentwrapper">
+        <div class="bodywrapper">
+          <div class="body" role="main">
+            
+  <div class="sphx-glr-download-link-note admonition note">
+<p class="admonition-title">Note</p>
+<p><a class="reference internal" href="#sphx-glr-download-gallery-dive-04-features-py"><span class="std std-ref">Go to the end</span></a>
+to download the full example code.</p>
+</div>
+<section class="sphx-glr-example-title" id="other-features">
+<span id="sphx-glr-gallery-dive-04-features-py"></span><h1>Other Features<a class="headerlink" href="#other-features" title="Link to this heading"><span>¶</span></a></h1>
+<p><strong>Author</strong>: Hongzheng Chen (<a class="reference external" href="mailto:hzchen&#37;&#52;&#48;cs&#46;cornell&#46;edu">hzchen<span>&#64;</span>cs<span>&#46;</span>cornell<span>&#46;</span>edu</a>)</p>
+<p>This document will discuss other features that are not covered in the previous tutorials.</p>
+<section id="dynamic-shapes">
+<h2>Dynamic Shapes<a class="headerlink" href="#dynamic-shapes" title="Link to this heading"><span>¶</span></a></h2>
+<p>In some cases, the shape of the tensor is not known at compile time, so we can use <code class="docutils literal notranslate"><span class="pre">[...]</span></code> to represent the dynamic shape.
+From the generated MLIR module, we can see it has a <code class="docutils literal notranslate"><span class="pre">&quot;?&quot;</span></code> in the shape of the tensor, which means the shape is not predefined,
+but we can still run the LLVM module with arbitrary shapes of NumPy arrays.</p>
+<div class="highlight-Python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span><span class="w"> </span><span class="nn">allo</span>
+<span class="kn">from</span><span class="w"> </span><span class="nn">allo.ir.types</span><span class="w"> </span><span class="kn">import</span> <span class="n">int32</span><span class="p">,</span> <span class="n">float32</span>
+<span class="kn">import</span><span class="w"> </span><span class="nn">numpy</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="nn">np</span>
+
+
+<span class="k">def</span><span class="w"> </span><span class="nf">kernel</span><span class="p">(</span><span class="n">A</span><span class="p">:</span> <span class="n">float32</span><span class="p">[</span><span class="o">...</span><span class="p">],</span> <span class="n">B</span><span class="p">:</span> <span class="n">float32</span><span class="p">[</span><span class="o">...</span><span class="p">],</span> <span class="n">size</span><span class="p">:</span> <span class="n">int32</span><span class="p">):</span>
+    <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">size</span><span class="p">):</span>
+        <span class="n">B</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">A</span><span class="p">[</span><span class="n">i</span><span class="p">]</span>
+
+
+<span class="n">s</span> <span class="o">=</span> <span class="n">allo</span><span class="o">.</span><span class="n">customize</span><span class="p">(</span><span class="n">kernel</span><span class="p">)</span>
+<span class="nb">print</span><span class="p">(</span><span class="n">s</span><span class="o">.</span><span class="n">module</span><span class="p">)</span>
+<a href="https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray" title="numpy.ndarray" class="sphx-glr-backref-module-numpy sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">np_A</span></a> <span class="o">=</span> <a href="https://numpy.org/doc/stable/reference/random/generated/numpy.random.random.html#numpy.random.random" title="numpy.random.random" class="sphx-glr-backref-module-numpy-random sphx-glr-backref-type-py-function"><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">random</span></a><span class="p">((</span><span class="mi">256</span><span class="p">,))</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><a href="https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.float32" title="numpy.float32" class="sphx-glr-backref-module-numpy sphx-glr-backref-type-py-attribute"><span class="n">np</span><span class="o">.</span><span class="n">float32</span></a><span class="p">)</span>
+<a href="https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray" title="numpy.ndarray" class="sphx-glr-backref-module-numpy sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">allo_A</span></a> <span class="o">=</span> <a href="https://numpy.org/doc/stable/reference/generated/numpy.zeros.html#numpy.zeros" title="numpy.zeros" class="sphx-glr-backref-module-numpy sphx-glr-backref-type-py-function"><span class="n">np</span><span class="o">.</span><span class="n">zeros</span></a><span class="p">((</span><span class="mi">256</span><span class="p">,))</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><a href="https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.float32" title="numpy.float32" class="sphx-glr-backref-module-numpy sphx-glr-backref-type-py-attribute"><span class="n">np</span><span class="o">.</span><span class="n">float32</span></a><span class="p">)</span>
+<span class="n">mod</span> <span class="o">=</span> <span class="n">s</span><span class="o">.</span><span class="n">build</span><span class="p">()</span>
+<span class="n">mod</span><span class="p">(</span><a href="https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray" title="numpy.ndarray" class="sphx-glr-backref-module-numpy sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">np_A</span></a><span class="p">,</span> <a href="https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray" title="numpy.ndarray" class="sphx-glr-backref-module-numpy sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">allo_A</span></a><span class="p">,</span> <span class="mi">256</span><span class="p">)</span>
+<a href="https://numpy.org/doc/stable/reference/generated/numpy.testing.assert_allclose.html#numpy.testing.assert_allclose" title="numpy.testing.assert_allclose" class="sphx-glr-backref-module-numpy-testing sphx-glr-backref-type-py-function"><span class="n">np</span><span class="o">.</span><span class="n">testing</span><span class="o">.</span><span class="n">assert_allclose</span></a><span class="p">(</span><a href="https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray" title="numpy.ndarray" class="sphx-glr-backref-module-numpy sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">np_A</span></a><span class="p">,</span> <a href="https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray" title="numpy.ndarray" class="sphx-glr-backref-module-numpy sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">allo_A</span></a><span class="p">)</span>
+</pre></div>
+</div>
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>module {
+  func.func @kernel(%arg0: memref&lt;?xf32&gt;, %arg1: memref&lt;?xf32&gt;, %arg2: i32) attributes {itypes = &quot;__s&quot;, otypes = &quot;&quot;} {
+    %c0_i32 = arith.constant 0 : i32
+    %0 = arith.index_cast %c0_i32 : i32 to index
+    %1 = arith.index_cast %arg2 : i32 to index
+    %c1_i32 = arith.constant 1 : i32
+    %2 = arith.index_cast %c1_i32 : i32 to index
+    scf.for %arg3 = %0 to %1 step %2 {
+      %3 = memref.load %arg0[%arg3] {from = &quot;A&quot;} : memref&lt;?xf32&gt;
+      memref.store %3, %arg1[%arg3] {to = &quot;B&quot;} : memref&lt;?xf32&gt;
+    } {loop_name = &quot;i&quot;, op_name = &quot;S_i_0&quot;}
+    return
+  }
+}
+</pre></div>
+</div>
+<p>We can also check the generated HLS code that the arguments are declared as pointers.</p>
+<div class="highlight-Python notranslate"><div class="highlight"><pre><span></span><span class="n">code</span> <span class="o">=</span> <span class="n">s</span><span class="o">.</span><span class="n">build</span><span class="p">(</span><span class="n">target</span><span class="o">=</span><span class="s2">&quot;vhls&quot;</span><span class="p">)</span>
+<span class="nb">print</span><span class="p">(</span><span class="n">code</span><span class="p">)</span>
+</pre></div>
+</div>
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>//===------------------------------------------------------------*- C++ -*-===//
+//
+// Automatically generated file for High-level Synthesis (HLS).
+//
+//===----------------------------------------------------------------------===//
+#include &lt;algorithm&gt;
+#include &lt;ap_axi_sdata.h&gt;
+#include &lt;ap_fixed.h&gt;
+#include &lt;ap_int.h&gt;
+#include &lt;hls_math.h&gt;
+#include &lt;hls_stream.h&gt;
+#include &lt;math.h&gt;
+#include &lt;stdint.h&gt;
+using namespace std;
+void kernel(
+  float *v0,
+  float *v1,
+  int32_t v2
+) {     // L2
+  int v3 = v2;  // L5
+  for (int v4 = 0; v4 &lt; v3; v4 += 1) {  // L8
+    float v5 = *v0[v4]; // L9
+    *v1[v4] = v5;       // L10
+  }
+}
+</pre></div>
+</div>
+</section>
+<section id="tuple-return">
+<h2>Tuple Return<a class="headerlink" href="#tuple-return" title="Link to this heading"><span>¶</span></a></h2>
+<p>Another feature is the tuple support. As in Python, we can return multiple values from a function, Allo
+also supports this by explicitly specifying the return type as a tuple.</p>
+<div class="highlight-Python notranslate"><div class="highlight"><pre><span></span><span class="k">def</span><span class="w"> </span><span class="nf">callee</span><span class="p">(</span><span class="n">a</span><span class="p">:</span> <span class="n">float32</span><span class="p">,</span> <span class="n">b</span><span class="p">:</span> <span class="n">float32</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="p">(</span><span class="n">float32</span><span class="p">,</span> <span class="n">float32</span><span class="p">):</span>
+    <span class="n">c</span><span class="p">:</span> <span class="n">float32</span> <span class="o">=</span> <span class="n">a</span> <span class="o">+</span> <span class="n">b</span>
+    <span class="n">d</span><span class="p">:</span> <span class="n">float32</span> <span class="o">=</span> <span class="n">a</span> <span class="o">-</span> <span class="n">b</span>
+    <span class="k">return</span> <span class="n">c</span><span class="p">,</span> <span class="n">d</span>
+
+
+<span class="k">def</span><span class="w"> </span><span class="nf">kernel</span><span class="p">(</span><span class="n">A</span><span class="p">:</span> <span class="n">float32</span><span class="p">[</span><span class="mi">10</span><span class="p">],</span> <span class="n">B</span><span class="p">:</span> <span class="n">float32</span><span class="p">[</span><span class="mi">10</span><span class="p">])</span> <span class="o">-&gt;</span> <span class="p">(</span><span class="n">float32</span><span class="p">[</span><span class="mi">10</span><span class="p">],</span> <span class="n">float32</span><span class="p">[</span><span class="mi">10</span><span class="p">]):</span>
+    <span class="n">C</span><span class="p">:</span> <span class="n">float32</span><span class="p">[</span><span class="mi">10</span><span class="p">]</span> <span class="o">=</span> <span class="mi">0</span>
+    <span class="n">D</span><span class="p">:</span> <span class="n">float32</span><span class="p">[</span><span class="mi">10</span><span class="p">]</span> <span class="o">=</span> <span class="mi">0</span>
+    <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">10</span><span class="p">):</span>
+        <span class="n">C</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">D</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">callee</span><span class="p">(</span><span class="n">A</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">B</span><span class="p">[</span><span class="n">i</span><span class="p">])</span>
+    <span class="k">return</span> <span class="n">C</span><span class="p">,</span> <span class="n">D</span>
+
+
+<span class="n">s</span> <span class="o">=</span> <span class="n">allo</span><span class="o">.</span><span class="n">customize</span><span class="p">(</span><span class="n">kernel</span><span class="p">)</span>
+<span class="nb">print</span><span class="p">(</span><span class="n">s</span><span class="o">.</span><span class="n">module</span><span class="p">)</span>
+<span class="n">mod</span> <span class="o">=</span> <span class="n">s</span><span class="o">.</span><span class="n">build</span><span class="p">()</span>
+<a href="https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray" title="numpy.ndarray" class="sphx-glr-backref-module-numpy sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">np_A</span></a> <span class="o">=</span> <a href="https://numpy.org/doc/stable/reference/random/generated/numpy.random.random.html#numpy.random.random" title="numpy.random.random" class="sphx-glr-backref-module-numpy-random sphx-glr-backref-type-py-function"><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">random</span></a><span class="p">((</span><span class="mi">10</span><span class="p">,))</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><a href="https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.float32" title="numpy.float32" class="sphx-glr-backref-module-numpy sphx-glr-backref-type-py-attribute"><span class="n">np</span><span class="o">.</span><span class="n">float32</span></a><span class="p">)</span>
+<a href="https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray" title="numpy.ndarray" class="sphx-glr-backref-module-numpy sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">np_B</span></a> <span class="o">=</span> <a href="https://numpy.org/doc/stable/reference/random/generated/numpy.random.random.html#numpy.random.random" title="numpy.random.random" class="sphx-glr-backref-module-numpy-random sphx-glr-backref-type-py-function"><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">random</span></a><span class="p">((</span><span class="mi">10</span><span class="p">,))</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><a href="https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.float32" title="numpy.float32" class="sphx-glr-backref-module-numpy sphx-glr-backref-type-py-attribute"><span class="n">np</span><span class="o">.</span><span class="n">float32</span></a><span class="p">)</span>
+<a href="https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray" title="numpy.ndarray" class="sphx-glr-backref-module-numpy sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">np_C</span></a><span class="p">,</span> <a href="https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray" title="numpy.ndarray" class="sphx-glr-backref-module-numpy sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">np_D</span></a> <span class="o">=</span> <span class="n">mod</span><span class="p">(</span><a href="https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray" title="numpy.ndarray" class="sphx-glr-backref-module-numpy sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">np_A</span></a><span class="p">,</span> <a href="https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray" title="numpy.ndarray" class="sphx-glr-backref-module-numpy sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">np_B</span></a><span class="p">)</span>
+<a href="https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray" title="numpy.ndarray" class="sphx-glr-backref-module-numpy sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">np_C_ref</span></a> <span class="o">=</span> <a href="https://numpy.org/doc/stable/reference/generated/numpy.zeros.html#numpy.zeros" title="numpy.zeros" class="sphx-glr-backref-module-numpy sphx-glr-backref-type-py-function"><span class="n">np</span><span class="o">.</span><span class="n">zeros</span></a><span class="p">((</span><span class="mi">10</span><span class="p">,),</span> <span class="n">dtype</span><span class="o">=</span><a href="https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.float32" title="numpy.float32" class="sphx-glr-backref-module-numpy sphx-glr-backref-type-py-attribute"><span class="n">np</span><span class="o">.</span><span class="n">float32</span></a><span class="p">)</span>
+<a href="https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray" title="numpy.ndarray" class="sphx-glr-backref-module-numpy sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">np_D_ref</span></a> <span class="o">=</span> <a href="https://numpy.org/doc/stable/reference/generated/numpy.zeros.html#numpy.zeros" title="numpy.zeros" class="sphx-glr-backref-module-numpy sphx-glr-backref-type-py-function"><span class="n">np</span><span class="o">.</span><span class="n">zeros</span></a><span class="p">((</span><span class="mi">10</span><span class="p">,),</span> <span class="n">dtype</span><span class="o">=</span><a href="https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.float32" title="numpy.float32" class="sphx-glr-backref-module-numpy sphx-glr-backref-type-py-attribute"><span class="n">np</span><span class="o">.</span><span class="n">float32</span></a><span class="p">)</span>
+<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">10</span><span class="p">):</span>
+    <a href="https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray" title="numpy.ndarray" class="sphx-glr-backref-module-numpy sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">np_C_ref</span></a><span class="p">[</span><span class="n">i</span><span class="p">],</span> <a href="https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray" title="numpy.ndarray" class="sphx-glr-backref-module-numpy sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">np_D_ref</span></a><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">callee</span><span class="p">(</span><a href="https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray" title="numpy.ndarray" class="sphx-glr-backref-module-numpy sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">np_A</span></a><span class="p">[</span><span class="n">i</span><span class="p">],</span> <a href="https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray" title="numpy.ndarray" class="sphx-glr-backref-module-numpy sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">np_B</span></a><span class="p">[</span><span class="n">i</span><span class="p">])</span>
+<a href="https://numpy.org/doc/stable/reference/generated/numpy.testing.assert_allclose.html#numpy.testing.assert_allclose" title="numpy.testing.assert_allclose" class="sphx-glr-backref-module-numpy-testing sphx-glr-backref-type-py-function"><span class="n">np</span><span class="o">.</span><span class="n">testing</span><span class="o">.</span><span class="n">assert_allclose</span></a><span class="p">(</span><a href="https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray" title="numpy.ndarray" class="sphx-glr-backref-module-numpy sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">np_C</span></a><span class="p">,</span> <a href="https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray" title="numpy.ndarray" class="sphx-glr-backref-module-numpy sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">np_C_ref</span></a><span class="p">)</span>
+<a href="https://numpy.org/doc/stable/reference/generated/numpy.testing.assert_allclose.html#numpy.testing.assert_allclose" title="numpy.testing.assert_allclose" class="sphx-glr-backref-module-numpy-testing sphx-glr-backref-type-py-function"><span class="n">np</span><span class="o">.</span><span class="n">testing</span><span class="o">.</span><span class="n">assert_allclose</span></a><span class="p">(</span><a href="https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray" title="numpy.ndarray" class="sphx-glr-backref-module-numpy sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">np_D</span></a><span class="p">,</span> <a href="https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray" title="numpy.ndarray" class="sphx-glr-backref-module-numpy sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">np_D_ref</span></a><span class="p">)</span>
+</pre></div>
+</div>
+<div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>module {
+  func.func @callee(%arg0: f32, %arg1: f32) -&gt; (f32, f32) attributes {itypes = &quot;__&quot;, otypes = &quot;__&quot;} {
+    %0 = arith.addf %arg0, %arg1 : f32
+    %alloc = memref.alloc() {name = &quot;c&quot;} : memref&lt;f32&gt;
+    affine.store %0, %alloc[] {to = &quot;c&quot;} : memref&lt;f32&gt;
+    %1 = arith.subf %arg0, %arg1 : f32
+    %alloc_0 = memref.alloc() {name = &quot;d&quot;} : memref&lt;f32&gt;
+    affine.store %1, %alloc_0[] {to = &quot;d&quot;} : memref&lt;f32&gt;
+    %2 = affine.load %alloc[] {from = &quot;c&quot;} : memref&lt;f32&gt;
+    %3 = affine.load %alloc_0[] {from = &quot;d&quot;} : memref&lt;f32&gt;
+    return %2, %3 : f32, f32
+  }
+  func.func @kernel(%arg0: memref&lt;10xf32&gt;, %arg1: memref&lt;10xf32&gt;) -&gt; (memref&lt;10xf32&gt;, memref&lt;10xf32&gt;) attributes {itypes = &quot;__&quot;, otypes = &quot;__&quot;} {
+    %c0_i32 = arith.constant 0 : i32
+    %0 = arith.sitofp %c0_i32 : i32 to f32
+    %alloc = memref.alloc() {name = &quot;C&quot;} : memref&lt;10xf32&gt;
+    linalg.fill ins(%0 : f32) outs(%alloc : memref&lt;10xf32&gt;)
+    %c0_i32_0 = arith.constant 0 : i32
+    %1 = arith.sitofp %c0_i32_0 : i32 to f32
+    %alloc_1 = memref.alloc() {name = &quot;D&quot;} : memref&lt;10xf32&gt;
+    linalg.fill ins(%1 : f32) outs(%alloc_1 : memref&lt;10xf32&gt;)
+    affine.for %arg2 = 0 to 10 {
+      %2 = affine.load %arg0[%arg2] {from = &quot;A&quot;} : memref&lt;10xf32&gt;
+      %3 = affine.load %arg1[%arg2] {from = &quot;B&quot;} : memref&lt;10xf32&gt;
+      %4:2 = func.call @callee(%2, %3) : (f32, f32) -&gt; (f32, f32)
+      affine.store %4#0, %alloc[%arg2] {to = &quot;C&quot;} : memref&lt;10xf32&gt;
+      affine.store %4#1, %alloc_1[%arg2] {to = &quot;D&quot;} : memref&lt;10xf32&gt;
+    } {loop_name = &quot;i&quot;, op_name = &quot;S_i_0&quot;}
+    return %alloc, %alloc_1 : memref&lt;10xf32&gt;, memref&lt;10xf32&gt;
+  }
+}
+</pre></div>
+</div>
+<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> (0 minutes 0.240 seconds)</p>
+<div class="sphx-glr-footer sphx-glr-footer-example docutils container" id="sphx-glr-download-gallery-dive-04-features-py">
+<div class="sphx-glr-download sphx-glr-download-jupyter docutils container">
+<p><a class="reference download internal" download="" href="../_downloads/90b883f891c63f481ffa4756cd7e0781/dive_04_features.ipynb"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Jupyter</span> <span class="pre">notebook:</span> <span class="pre">dive_04_features.ipynb</span></code></a></p>
+</div>
+<div class="sphx-glr-download sphx-glr-download-python docutils container">
+<p><a class="reference download internal" download="" href="../_downloads/d58a09ade6135cf6e79cb2fe738ace28/dive_04_features.py"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Python</span> <span class="pre">source</span> <span class="pre">code:</span> <span class="pre">dive_04_features.py</span></code></a></p>
+</div>
+<div class="sphx-glr-download sphx-glr-download-zip docutils container">
+<p><a class="reference download internal" download="" href="../_downloads/4569d89feac47262c8c4e3e128da7a7e/dive_04_features.zip"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">zipped:</span> <span class="pre">dive_04_features.zip</span></code></a></p>
+</div>
+</div>
+<p class="sphx-glr-signature"><a class="reference external" href="https://sphinx-gallery.github.io">Gallery generated by Sphinx-Gallery</a></p>
+</section>
+</section>
+
+
+            <div class="clearer"></div>
+          </div>
+        </div>
+      </div>
+    
+        <div id="show_right_sidebar">
+            <p><a class="toggle_right_sidebar" href="#"><span class="icon">&lt;</span><span>Page contents</span></a></p>
+        </div>
+
+        <div id="right_sidebar">
+            <p><a class="toggle_right_sidebar" href="#"><span class="icon">&gt;</span><span>Page contents:</span></a></p>
+            <div class="page_toc">
+                <ul>
+<li><a class="reference internal" href="#">Other Features</a><ul>
+<li><a class="reference internal" href="#dynamic-shapes">Dynamic Shapes</a></li>
+<li><a class="reference internal" href="#tuple-return">Tuple Return</a></li>
+</ul>
+</li>
+</ul>
+
+            </div>
+        </div>
+    
+
+      <div class="clearer"></div>
+    </div>
+    <div class="button_nav_wrapper">
+        <div class="button_nav">
+            <div class="left">
+                
+                <a href="../dive/pytorch.html">
+                    <span class="icon">&lt;</span><span>PyTorch Integration</span></a>
+                
+            </div>
+
+            <div class="right">
+                
+                    <a href="../developer/index.html"><span>Developer Setup</span><span class="icon">&gt;</span></a>
+                
+            </div>
+        </div>
+    </div>
+
+
+    <div class="footer" role="contentinfo">
+    &#169; Copyright 2025, Allo Authors.
+      Created using <a href="https://www.sphinx-doc.org/">Sphinx</a> 8.1.3.
+    </div>
+
+<p id="theme_credit">Styled using the <a href="https://github.com/piccolo-orm/piccolo_theme">Piccolo Theme</a></p>
+  </body>
+</html>
\ No newline at end of file
diff --git a/gallery/index.html b/gallery/index.html
index 7744ee1c..a62f550b 100644
--- a/gallery/index.html
+++ b/gallery/index.html
@@ -6,7 +6,7 @@
     <meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="viewport" content="width=device-width, initial-scale=1" />
 
     <title>Allo Documentations &#8212; Allo Documentation</title>
-    <link rel="stylesheet" type="text/css" href="../_static/pygments.css?v=fa44fd50" />
+    <link rel="stylesheet" type="text/css" href="../_static/pygments.css?v=03e43079" />
     <link rel="stylesheet" type="text/css" href="../_static/basic_mod.css?v=9b2032db" />
     <link rel="stylesheet" type="text/css" href="../_static/graphviz.css?v=4ae1632d" />
     <link rel="stylesheet" type="text/css" href="../_static/copybutton.css?v=76b2166b" />
@@ -141,6 +141,15 @@ <h3 id="searchlabel">Quick search</h3>
 <li class="toctree-l1"><a class="reference internal" href="tutorial_01_get_started.html">Getting Started</a></li>
 <li class="toctree-l1"><a class="reference internal" href="tutorial_02_vhls.html">Vivado/Vitis HLS Backend</a></li>
 </ul>
+<p class="caption" role="heading"><span class="caption-text">Deep Dive</span></p>
+<ul>
+<li class="toctree-l1"><a class="reference internal" href="dive_01_data_types.html">Data Types and Type Casting</a></li>
+<li class="toctree-l1"><a class="reference internal" href="dive_02_template.html">Template Kernels</a></li>
+<li class="toctree-l1"><a class="reference internal" href="dive_03_composition.html">Kernel Composition</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../dive/ip.html">IP Integration</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../dive/pytorch.html">PyTorch Integration</a></li>
+<li class="toctree-l1"><a class="reference internal" href="dive_04_features.html">Other Features</a></li>
+</ul>
 <p class="caption" role="heading"><span class="caption-text">Developer Guide</span></p>
 <ul>
 <li class="toctree-l1"><a class="reference internal" href="../developer/index.html">Developer Setup</a></li>
@@ -167,15 +176,27 @@ <h1>Allo Documentations<a class="headerlink" href="#allo-documentations" title="
 <div class="sphx-glr-thumbnails"><div class="sphx-glr-thumbcontainer" tooltip="Author: Hongzheng Chen (hzchen@cs.cornell.edu)"><img alt="" src="../_images/sphx_glr_developer_01_ir_builder_thumb.png" />
 <p><a class="reference internal" href="developer_01_ir_builder.html#sphx-glr-gallery-developer-01-ir-builder-py"><span class="std std-ref">IR Builder Walkthrough</span></a></p>
   <div class="sphx-glr-thumbnail-title">IR Builder Walkthrough</div>
+</div><div class="sphx-glr-thumbcontainer" tooltip="Author: Hongzheng Chen (hzchen@cs.cornell.edu)"><img alt="" src="../_images/sphx_glr_dive_01_data_types_thumb.png" />
+<p><a class="reference internal" href="dive_01_data_types.html#sphx-glr-gallery-dive-01-data-types-py"><span class="std std-ref">Data Types and Type Casting</span></a></p>
+  <div class="sphx-glr-thumbnail-title">Data Types and Type Casting</div>
 </div><div class="sphx-glr-thumbcontainer" tooltip="Author: Hongzheng Chen (hzchen@cs.cornell.edu)"><img alt="" src="../_images/sphx_glr_tutorial_02_vhls_thumb.png" />
 <p><a class="reference internal" href="tutorial_02_vhls.html#sphx-glr-gallery-tutorial-02-vhls-py"><span class="std std-ref">Vivado/Vitis HLS Backend</span></a></p>
   <div class="sphx-glr-thumbnail-title">Vivado/Vitis HLS Backend</div>
 </div><div class="sphx-glr-thumbcontainer" tooltip="Author: Hongzheng Chen (hzchen@cs.cornell.edu)"><img alt="" src="../_images/sphx_glr_tutorial_01_get_started_thumb.png" />
 <p><a class="reference internal" href="tutorial_01_get_started.html#sphx-glr-gallery-tutorial-01-get-started-py"><span class="std std-ref">Getting Started</span></a></p>
   <div class="sphx-glr-thumbnail-title">Getting Started</div>
+</div><div class="sphx-glr-thumbcontainer" tooltip="Author: Hongzheng Chen (hzchen@cs.cornell.edu)"><img alt="" src="../_images/sphx_glr_dive_02_template_thumb.png" />
+<p><a class="reference internal" href="dive_02_template.html#sphx-glr-gallery-dive-02-template-py"><span class="std std-ref">Template Kernels</span></a></p>
+  <div class="sphx-glr-thumbnail-title">Template Kernels</div>
+</div><div class="sphx-glr-thumbcontainer" tooltip="Author: Hongzheng Chen (hzchen@cs.cornell.edu)"><img alt="" src="../_images/sphx_glr_dive_04_features_thumb.png" />
+<p><a class="reference internal" href="dive_04_features.html#sphx-glr-gallery-dive-04-features-py"><span class="std std-ref">Other Features</span></a></p>
+  <div class="sphx-glr-thumbnail-title">Other Features</div>
 </div><div class="sphx-glr-thumbcontainer" tooltip="Author: Hongzheng Chen (hzchen@cs.cornell.edu)"><img alt="" src="../_images/sphx_glr_developer_02_mlir_thumb.png" />
 <p><a class="reference internal" href="developer_02_mlir.html#sphx-glr-gallery-developer-02-mlir-py"><span class="std std-ref">MLIR Translation Guide</span></a></p>
   <div class="sphx-glr-thumbnail-title">MLIR Translation Guide</div>
+</div><div class="sphx-glr-thumbcontainer" tooltip="Author: Hongzheng Chen (hzchen@cs.cornell.edu)"><img alt="" src="../_images/sphx_glr_dive_03_composition_thumb.png" />
+<p><a class="reference internal" href="dive_03_composition.html#sphx-glr-gallery-dive-03-composition-py"><span class="std std-ref">Kernel Composition</span></a></p>
+  <div class="sphx-glr-thumbnail-title">Kernel Composition</div>
 </div></div><div class="toctree-wrapper compound">
 </div>
 <p class="sphx-glr-signature"><a class="reference external" href="https://sphinx-gallery.github.io">Gallery generated by Sphinx-Gallery</a></p>
@@ -204,7 +225,7 @@ <h1>Allo Documentations<a class="headerlink" href="#allo-documentations" title="
 
 
     <div class="footer" role="contentinfo">
-    &#169; Copyright 2024, Allo Authors.
+    &#169; Copyright 2025, Allo Authors.
       Created using <a href="https://www.sphinx-doc.org/">Sphinx</a> 8.1.3.
     </div>
 
diff --git a/gallery/sg_execution_times.html b/gallery/sg_execution_times.html
index be59d16a..ea1abda6 100644
--- a/gallery/sg_execution_times.html
+++ b/gallery/sg_execution_times.html
@@ -6,7 +6,7 @@
     <meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="viewport" content="width=device-width, initial-scale=1" />
 
     <title>Computation times &#8212; Allo Documentation</title>
-    <link rel="stylesheet" type="text/css" href="../_static/pygments.css?v=fa44fd50" />
+    <link rel="stylesheet" type="text/css" href="../_static/pygments.css?v=03e43079" />
     <link rel="stylesheet" type="text/css" href="../_static/basic_mod.css?v=9b2032db" />
     <link rel="stylesheet" type="text/css" href="../_static/graphviz.css?v=4ae1632d" />
     <link rel="stylesheet" type="text/css" href="../_static/copybutton.css?v=76b2166b" />
@@ -141,6 +141,15 @@ <h3 id="searchlabel">Quick search</h3>
 <li class="toctree-l1"><a class="reference internal" href="tutorial_01_get_started.html">Getting Started</a></li>
 <li class="toctree-l1"><a class="reference internal" href="tutorial_02_vhls.html">Vivado/Vitis HLS Backend</a></li>
 </ul>
+<p class="caption" role="heading"><span class="caption-text">Deep Dive</span></p>
+<ul>
+<li class="toctree-l1"><a class="reference internal" href="dive_01_data_types.html">Data Types and Type Casting</a></li>
+<li class="toctree-l1"><a class="reference internal" href="dive_02_template.html">Template Kernels</a></li>
+<li class="toctree-l1"><a class="reference internal" href="dive_03_composition.html">Kernel Composition</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../dive/ip.html">IP Integration</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../dive/pytorch.html">PyTorch Integration</a></li>
+<li class="toctree-l1"><a class="reference internal" href="dive_04_features.html">Other Features</a></li>
+</ul>
 <p class="caption" role="heading"><span class="caption-text">Developer Guide</span></p>
 <ul>
 <li class="toctree-l1"><a class="reference internal" href="../developer/index.html">Developer Setup</a></li>
@@ -164,7 +173,7 @@ <h3 id="searchlabel">Quick search</h3>
             
   <section id="computation-times">
 <span id="sphx-glr-gallery-sg-execution-times"></span><h1>Computation times<a class="headerlink" href="#computation-times" title="Link to this heading"><span>¶</span></a></h1>
-<p><strong>00:00.607</strong> total execution time for 4 files <strong>from gallery</strong>:</p>
+<p><strong>00:01.961</strong> total execution time for 8 files <strong>from gallery</strong>:</p>
 <div class="docutils container">
 <style scoped>
 <link href="https://cdnjs.cloudflare.com/ajax/libs/twitter-bootstrap/5.3.0/css/bootstrap.min.css" rel="stylesheet" />
@@ -185,16 +194,32 @@ <h3 id="searchlabel">Quick search</h3>
 </tr>
 </thead>
 <tbody>
-<tr class="row-even"><td><p><a class="reference internal" href="tutorial_02_vhls.html#sphx-glr-gallery-tutorial-02-vhls-py"><span class="std std-ref">Vivado/Vitis HLS Backend</span></a> (<code class="docutils literal notranslate"><span class="pre">tutorial_02_vhls.py</span></code>)</p></td>
-<td><p>00:00.332</p></td>
+<tr class="row-even"><td><p><a class="reference internal" href="dive_03_composition.html#sphx-glr-gallery-dive-03-composition-py"><span class="std std-ref">Kernel Composition</span></a> (<code class="docutils literal notranslate"><span class="pre">dive_03_composition.py</span></code>)</p></td>
+<td><p>00:00.731</p></td>
+<td><p>0.0</p></td>
+</tr>
+<tr class="row-odd"><td><p><a class="reference internal" href="dive_02_template.html#sphx-glr-gallery-dive-02-template-py"><span class="std std-ref">Template Kernels</span></a> (<code class="docutils literal notranslate"><span class="pre">dive_02_template.py</span></code>)</p></td>
+<td><p>00:00.372</p></td>
+<td><p>0.0</p></td>
+</tr>
+<tr class="row-even"><td><p><a class="reference internal" href="dive_04_features.html#sphx-glr-gallery-dive-04-features-py"><span class="std std-ref">Other Features</span></a> (<code class="docutils literal notranslate"><span class="pre">dive_04_features.py</span></code>)</p></td>
+<td><p>00:00.240</p></td>
+<td><p>0.0</p></td>
+</tr>
+<tr class="row-odd"><td><p><a class="reference internal" href="tutorial_02_vhls.html#sphx-glr-gallery-tutorial-02-vhls-py"><span class="std std-ref">Vivado/Vitis HLS Backend</span></a> (<code class="docutils literal notranslate"><span class="pre">tutorial_02_vhls.py</span></code>)</p></td>
+<td><p>00:00.192</p></td>
+<td><p>0.0</p></td>
+</tr>
+<tr class="row-even"><td><p><a class="reference internal" href="tutorial_01_get_started.html#sphx-glr-gallery-tutorial-01-get-started-py"><span class="std std-ref">Getting Started</span></a> (<code class="docutils literal notranslate"><span class="pre">tutorial_01_get_started.py</span></code>)</p></td>
+<td><p>00:00.181</p></td>
 <td><p>0.0</p></td>
 </tr>
-<tr class="row-odd"><td><p><a class="reference internal" href="tutorial_01_get_started.html#sphx-glr-gallery-tutorial-01-get-started-py"><span class="std std-ref">Getting Started</span></a> (<code class="docutils literal notranslate"><span class="pre">tutorial_01_get_started.py</span></code>)</p></td>
-<td><p>00:00.196</p></td>
+<tr class="row-odd"><td><p><a class="reference internal" href="dive_01_data_types.html#sphx-glr-gallery-dive-01-data-types-py"><span class="std std-ref">Data Types and Type Casting</span></a> (<code class="docutils literal notranslate"><span class="pre">dive_01_data_types.py</span></code>)</p></td>
+<td><p>00:00.170</p></td>
 <td><p>0.0</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="reference internal" href="developer_02_mlir.html#sphx-glr-gallery-developer-02-mlir-py"><span class="std std-ref">MLIR Translation Guide</span></a> (<code class="docutils literal notranslate"><span class="pre">developer_02_mlir.py</span></code>)</p></td>
-<td><p>00:00.074</p></td>
+<td><p>00:00.070</p></td>
 <td><p>0.0</p></td>
 </tr>
 <tr class="row-odd"><td><p><a class="reference internal" href="developer_01_ir_builder.html#sphx-glr-gallery-developer-01-ir-builder-py"><span class="std std-ref">IR Builder Walkthrough</span></a> (<code class="docutils literal notranslate"><span class="pre">developer_01_ir_builder.py</span></code>)</p></td>
@@ -229,7 +254,7 @@ <h3 id="searchlabel">Quick search</h3>
 
 
     <div class="footer" role="contentinfo">
-    &#169; Copyright 2024, Allo Authors.
+    &#169; Copyright 2025, Allo Authors.
       Created using <a href="https://www.sphinx-doc.org/">Sphinx</a> 8.1.3.
     </div>
 
diff --git a/gallery/tutorial_01_get_started.html b/gallery/tutorial_01_get_started.html
index fe463f74..cc91fdf9 100644
--- a/gallery/tutorial_01_get_started.html
+++ b/gallery/tutorial_01_get_started.html
@@ -6,7 +6,7 @@
     <meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="viewport" content="width=device-width, initial-scale=1" />
 
     <title>Getting Started &#8212; Allo Documentation</title>
-    <link rel="stylesheet" type="text/css" href="../_static/pygments.css?v=fa44fd50" />
+    <link rel="stylesheet" type="text/css" href="../_static/pygments.css?v=03e43079" />
     <link rel="stylesheet" type="text/css" href="../_static/basic_mod.css?v=9b2032db" />
     <link rel="stylesheet" type="text/css" href="../_static/graphviz.css?v=4ae1632d" />
     <link rel="stylesheet" type="text/css" href="../_static/copybutton.css?v=76b2166b" />
@@ -143,6 +143,15 @@ <h3 id="searchlabel">Quick search</h3>
 <li class="toctree-l1 current"><a class="current reference internal" href="#">Getting Started</a></li>
 <li class="toctree-l1"><a class="reference internal" href="tutorial_02_vhls.html">Vivado/Vitis HLS Backend</a></li>
 </ul>
+<p class="caption" role="heading"><span class="caption-text">Deep Dive</span></p>
+<ul>
+<li class="toctree-l1"><a class="reference internal" href="dive_01_data_types.html">Data Types and Type Casting</a></li>
+<li class="toctree-l1"><a class="reference internal" href="dive_02_template.html">Template Kernels</a></li>
+<li class="toctree-l1"><a class="reference internal" href="dive_03_composition.html">Kernel Composition</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../dive/ip.html">IP Integration</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../dive/pytorch.html">PyTorch Integration</a></li>
+<li class="toctree-l1"><a class="reference internal" href="dive_04_features.html">Other Features</a></li>
+</ul>
 <p class="caption" role="heading"><span class="caption-text">Developer Guide</span></p>
 <ul>
 <li class="toctree-l1"><a class="reference internal" href="../developer/index.html">Developer Setup</a></li>
@@ -176,7 +185,7 @@ <h3 id="searchlabel">Quick search</h3>
 <section id="import-allo">
 <h2>Import Allo<a class="headerlink" href="#import-allo" title="Link to this heading"><span>¶</span></a></h2>
 <p>First we import the necessary packages.</p>
-<div class="highlight-Python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">allo</span>
+<div class="highlight-Python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span><span class="w"> </span><span class="nn">allo</span>
 </pre></div>
 </div>
 </section>
@@ -189,13 +198,15 @@ <h2>Algorithm Definition<a class="headerlink" href="#algorithm-definition" title
 (GEMM) in the Allo DSL.</p>
 <p>We first import the necessary data types from Allo. In this example, we
 use <code class="docutils literal notranslate"><span class="pre">int32</span></code> as the data type for all the variables.</p>
-<div class="highlight-Python notranslate"><div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">allo.ir.types</span> <span class="kn">import</span> <span class="n">int32</span>
+<div class="highlight-Python notranslate"><div class="highlight"><pre><span></span><span class="kn">from</span><span class="w"> </span><span class="nn">allo.ir.types</span><span class="w"> </span><span class="kn">import</span> <span class="n">int32</span>
 </pre></div>
 </div>
 <p>We then define a function that takes two 32x32 matrices as inputs and
 returns a 32x32 matrix as output. The variable declaration is defined
-as <code class="docutils literal notranslate"><span class="pre">&lt;name&gt;:</span> <span class="pre">&lt;type&gt;[&lt;shape&gt;]</span></code>. We require <strong>strict type annotation</strong> in
-Allo’s kernels, which is different from directly programming in Python.</p>
+as <code class="docutils literal notranslate"><span class="pre">&lt;name&gt;:</span> <span class="pre">&lt;type&gt;[&lt;shape&gt;]</span></code>, and the function type is defined as
+<code class="docutils literal notranslate"><span class="pre">(&lt;in_type0&gt;,</span> <span class="pre">&lt;in_type1&gt;,</span> <span class="pre">...)</span> <span class="pre">-&gt;</span> <span class="pre">&lt;out_type&gt;</span></code>.
+We require <strong>strict type annotation</strong> in Allo’s kernels, which is different
+from directly programming in Python.</p>
 <p>Inside the kernel, we provide a shorthand for the loop iterator. For example,
 <code class="docutils literal notranslate"><span class="pre">for</span> <span class="pre">i,</span> <span class="pre">j,</span> <span class="pre">k</span> <span class="pre">in</span> <span class="pre">allo.grid(32,</span> <span class="pre">32,</span> <span class="pre">32)</span></code> is equivalent to the following
 nested for-loop:</p>
@@ -209,7 +220,7 @@ <h2>Algorithm Definition<a class="headerlink" href="#algorithm-definition" title
 The arguments denote the upper bounds of the loop iterators.
 Notice the above range-loop is also supported in the new Allo, so
 users have more flexibility to define the loop structure.</p>
-<div class="highlight-Python notranslate"><div class="highlight"><pre><span></span><span class="k">def</span> <span class="nf">gemm</span><span class="p">(</span><span class="n">A</span><span class="p">:</span> <span class="n">int32</span><span class="p">[</span><span class="mi">32</span><span class="p">,</span> <span class="mi">32</span><span class="p">],</span> <span class="n">B</span><span class="p">:</span> <span class="n">int32</span><span class="p">[</span><span class="mi">32</span><span class="p">,</span> <span class="mi">32</span><span class="p">])</span> <span class="o">-&gt;</span> <span class="n">int32</span><span class="p">[</span><span class="mi">32</span><span class="p">,</span> <span class="mi">32</span><span class="p">]:</span>
+<div class="highlight-Python notranslate"><div class="highlight"><pre><span></span><span class="k">def</span><span class="w"> </span><span class="nf">gemm</span><span class="p">(</span><span class="n">A</span><span class="p">:</span> <span class="n">int32</span><span class="p">[</span><span class="mi">32</span><span class="p">,</span> <span class="mi">32</span><span class="p">],</span> <span class="n">B</span><span class="p">:</span> <span class="n">int32</span><span class="p">[</span><span class="mi">32</span><span class="p">,</span> <span class="mi">32</span><span class="p">])</span> <span class="o">-&gt;</span> <span class="n">int32</span><span class="p">[</span><span class="mi">32</span><span class="p">,</span> <span class="mi">32</span><span class="p">]:</span>
     <span class="n">C</span><span class="p">:</span> <span class="n">int32</span><span class="p">[</span><span class="mi">32</span><span class="p">,</span> <span class="mi">32</span><span class="p">]</span> <span class="o">=</span> <span class="mi">0</span>
     <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">,</span> <span class="n">k</span> <span class="ow">in</span> <span class="n">allo</span><span class="o">.</span><span class="n">grid</span><span class="p">(</span><span class="mi">32</span><span class="p">,</span> <span class="mi">32</span><span class="p">,</span> <span class="mi">32</span><span class="p">):</span>
         <span class="n">C</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">]</span> <span class="o">+=</span> <span class="n">A</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="n">k</span><span class="p">]</span> <span class="o">*</span> <span class="n">B</span><span class="p">[</span><span class="n">k</span><span class="p">,</span> <span class="n">j</span><span class="p">]</span>
@@ -430,7 +441,7 @@ <h2>Prepare the Inputs/Outputs for the Executable<a class="headerlink" href="#pr
 <code class="docutils literal notranslate"><span class="pre">np.random.randint</span></code> will generate <code class="docutils literal notranslate"><span class="pre">np.int64</span></code> data type, while we use <code class="docutils literal notranslate"><span class="pre">int32</span></code>
 when defining our kernel function, so we need to explicitly cast the data type
 to <code class="docutils literal notranslate"><span class="pre">np.int32</span></code>.</p>
-<div class="highlight-Python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
+<div class="highlight-Python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span><span class="w"> </span><span class="nn">numpy</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="nn">np</span>
 
 <a href="https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray" title="numpy.ndarray" class="sphx-glr-backref-module-numpy sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">np_A</span></a> <span class="o">=</span> <a href="https://numpy.org/doc/stable/reference/random/generated/numpy.random.randint.html#numpy.random.randint" title="numpy.random.randint" class="sphx-glr-backref-module-numpy-random sphx-glr-backref-type-py-function"><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randint</span></a><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">100</span><span class="p">,</span> <span class="p">(</span><span class="mi">32</span><span class="p">,</span> <span class="mi">32</span><span class="p">))</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><a href="https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.int32" title="numpy.int32" class="sphx-glr-backref-module-numpy sphx-glr-backref-type-py-attribute"><span class="n">np</span><span class="o">.</span><span class="n">int32</span></a><span class="p">)</span>
 <a href="https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray" title="numpy.ndarray" class="sphx-glr-backref-module-numpy sphx-glr-backref-type-py-class sphx-glr-backref-instance"><span class="n">np_B</span></a> <span class="o">=</span> <a href="https://numpy.org/doc/stable/reference/random/generated/numpy.random.randint.html#numpy.random.randint" title="numpy.random.randint" class="sphx-glr-backref-module-numpy-random sphx-glr-backref-type-py-function"><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randint</span></a><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">100</span><span class="p">,</span> <span class="p">(</span><span class="mi">32</span><span class="p">,</span> <span class="mi">32</span><span class="p">))</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><a href="https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.int32" title="numpy.int32" class="sphx-glr-backref-module-numpy sphx-glr-backref-type-py-attribute"><span class="n">np</span><span class="o">.</span><span class="n">int32</span></a><span class="p">)</span>
@@ -454,7 +465,7 @@ <h2>Run the Executable<a class="headerlink" href="#run-the-executable" title="Li
 <div class="sphx-glr-script-out highlight-none notranslate"><div class="highlight"><pre><span></span>Results are correct!
 </pre></div>
 </div>
-<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> (0 minutes 0.196 seconds)</p>
+<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> (0 minutes 0.181 seconds)</p>
 <div class="sphx-glr-footer sphx-glr-footer-example docutils container" id="sphx-glr-download-gallery-tutorial-01-get-started-py">
 <div class="sphx-glr-download sphx-glr-download-jupyter docutils container">
 <p><a class="reference download internal" download="" href="../_downloads/addf17760130f22dafec92dedc62e16a/tutorial_01_get_started.ipynb"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Jupyter</span> <span class="pre">notebook:</span> <span class="pre">tutorial_01_get_started.ipynb</span></code></a></p>
@@ -522,7 +533,7 @@ <h2>Run the Executable<a class="headerlink" href="#run-the-executable" title="Li
 
 
     <div class="footer" role="contentinfo">
-    &#169; Copyright 2024, Allo Authors.
+    &#169; Copyright 2025, Allo Authors.
       Created using <a href="https://www.sphinx-doc.org/">Sphinx</a> 8.1.3.
     </div>
 
diff --git a/gallery/tutorial_02_vhls.html b/gallery/tutorial_02_vhls.html
index 342822b2..81e2f93c 100644
--- a/gallery/tutorial_02_vhls.html
+++ b/gallery/tutorial_02_vhls.html
@@ -6,7 +6,7 @@
     <meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="viewport" content="width=device-width, initial-scale=1" />
 
     <title>Vivado/Vitis HLS Backend &#8212; Allo Documentation</title>
-    <link rel="stylesheet" type="text/css" href="../_static/pygments.css?v=fa44fd50" />
+    <link rel="stylesheet" type="text/css" href="../_static/pygments.css?v=03e43079" />
     <link rel="stylesheet" type="text/css" href="../_static/basic_mod.css?v=9b2032db" />
     <link rel="stylesheet" type="text/css" href="../_static/graphviz.css?v=4ae1632d" />
     <link rel="stylesheet" type="text/css" href="../_static/copybutton.css?v=76b2166b" />
@@ -23,7 +23,7 @@
     <script src="../_static/js/petite-vue.js"></script>
     <link rel="index" title="Index" href="../genindex.html" />
     <link rel="search" title="Search" href="../search.html" />
-    <link rel="next" title="Developer Setup" href="../developer/index.html" />
+    <link rel="next" title="Data Types and Type Casting" href="dive_01_data_types.html" />
     <link rel="prev" title="Getting Started" href="tutorial_01_get_started.html" /> 
   </head><body data-dark_mode_code_blocks="true">
 
@@ -143,6 +143,15 @@ <h3 id="searchlabel">Quick search</h3>
 <li class="toctree-l1"><a class="reference internal" href="tutorial_01_get_started.html">Getting Started</a></li>
 <li class="toctree-l1 current"><a class="current reference internal" href="#">Vivado/Vitis HLS Backend</a></li>
 </ul>
+<p class="caption" role="heading"><span class="caption-text">Deep Dive</span></p>
+<ul>
+<li class="toctree-l1"><a class="reference internal" href="dive_01_data_types.html">Data Types and Type Casting</a></li>
+<li class="toctree-l1"><a class="reference internal" href="dive_02_template.html">Template Kernels</a></li>
+<li class="toctree-l1"><a class="reference internal" href="dive_03_composition.html">Kernel Composition</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../dive/ip.html">IP Integration</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../dive/pytorch.html">PyTorch Integration</a></li>
+<li class="toctree-l1"><a class="reference internal" href="dive_04_features.html">Other Features</a></li>
+</ul>
 <p class="caption" role="heading"><span class="caption-text">Developer Guide</span></p>
 <ul>
 <li class="toctree-l1"><a class="reference internal" href="../developer/index.html">Developer Setup</a></li>
@@ -177,9 +186,9 @@ <h3 id="searchlabel">Quick search</h3>
 <section id="import-allo">
 <h2>Import Allo<a class="headerlink" href="#import-allo" title="Link to this heading"><span>¶</span></a></h2>
 <p>First, we import the necessary packages.</p>
-<div class="highlight-Python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">allo</span>
-<span class="kn">from</span> <span class="nn">allo.ir.types</span> <span class="kn">import</span> <span class="n">float32</span>
-<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
+<div class="highlight-Python notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span><span class="w"> </span><span class="nn">allo</span>
+<span class="kn">from</span><span class="w"> </span><span class="nn">allo.ir.types</span><span class="w"> </span><span class="kn">import</span> <span class="n">float32</span>
+<span class="kn">import</span><span class="w"> </span><span class="nn">numpy</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="nn">np</span>
 </pre></div>
 </div>
 </section>
@@ -204,7 +213,7 @@ <h2>Algorithm Definition<a class="headerlink" href="#algorithm-definition" title
 computation of <code class="docutils literal notranslate"><span class="pre">C[i,</span> <span class="pre">j]</span></code> will be accumulated along the <code class="docutils literal notranslate"><span class="pre">k</span></code> dimension.
 This annotation is necessary for later optimizations, since Allo leverages
 this information to generate correct intermediate buffers.</p>
-<div class="highlight-Python notranslate"><div class="highlight"><pre><span></span><span class="k">def</span> <span class="nf">gemm</span><span class="p">(</span><span class="n">A</span><span class="p">:</span> <span class="n">float32</span><span class="p">[</span><span class="n">M</span><span class="p">,</span> <span class="n">K</span><span class="p">],</span> <span class="n">B</span><span class="p">:</span> <span class="n">float32</span><span class="p">[</span><span class="n">K</span><span class="p">,</span> <span class="n">N</span><span class="p">])</span> <span class="o">-&gt;</span> <span class="n">float32</span><span class="p">[</span><span class="n">M</span><span class="p">,</span> <span class="n">N</span><span class="p">]:</span>
+<div class="highlight-Python notranslate"><div class="highlight"><pre><span></span><span class="k">def</span><span class="w"> </span><span class="nf">gemm</span><span class="p">(</span><span class="n">A</span><span class="p">:</span> <span class="n">float32</span><span class="p">[</span><span class="n">M</span><span class="p">,</span> <span class="n">K</span><span class="p">],</span> <span class="n">B</span><span class="p">:</span> <span class="n">float32</span><span class="p">[</span><span class="n">K</span><span class="p">,</span> <span class="n">N</span><span class="p">])</span> <span class="o">-&gt;</span> <span class="n">float32</span><span class="p">[</span><span class="n">M</span><span class="p">,</span> <span class="n">N</span><span class="p">]:</span>
     <span class="n">C</span><span class="p">:</span> <span class="n">float32</span><span class="p">[</span><span class="n">M</span><span class="p">,</span> <span class="n">N</span><span class="p">]</span> <span class="o">=</span> <span class="mf">0.0</span>
     <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">j</span> <span class="ow">in</span> <span class="n">allo</span><span class="o">.</span><span class="n">grid</span><span class="p">(</span><span class="n">M</span><span class="p">,</span> <span class="n">N</span><span class="p">):</span>
         <span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="n">allo</span><span class="o">.</span><span class="n">reduction</span><span class="p">(</span><span class="n">K</span><span class="p">):</span>
@@ -497,7 +506,7 @@ <h2>On-board Execution<a class="headerlink" href="#on-board-execution" title="Li
 <li><p><code class="docutils literal notranslate"><span class="pre">gemm.prj/_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/logs/gemm/gemm_vitis_hls.log</span></code>: The log file of the Vitis HLS.</p></li>
 <li><p><code class="docutils literal notranslate"><span class="pre">gemm.prj/_x.hw.xilinx_u280_gen3x16_xdma_1_202211_1/logs/link/v++.log</span></code>: The log file of the Vivado backend synthesis.</p></li>
 </ul>
-<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> (0 minutes 0.332 seconds)</p>
+<p class="sphx-glr-timing"><strong>Total running time of the script:</strong> (0 minutes 0.192 seconds)</p>
 <div class="sphx-glr-footer sphx-glr-footer-example docutils container" id="sphx-glr-download-gallery-tutorial-02-vhls-py">
 <div class="sphx-glr-download sphx-glr-download-jupyter docutils container">
 <p><a class="reference download internal" download="" href="../_downloads/861a23b0676093686f50a5f922fc6363/tutorial_02_vhls.ipynb"><code class="xref download docutils literal notranslate"><span class="pre">Download</span> <span class="pre">Jupyter</span> <span class="pre">notebook:</span> <span class="pre">tutorial_02_vhls.ipynb</span></code></a></p>
@@ -554,7 +563,7 @@ <h2>On-board Execution<a class="headerlink" href="#on-board-execution" title="Li
 
             <div class="right">
                 
-                    <a href="../developer/index.html"><span>Developer Setup</span><span class="icon">&gt;</span></a>
+                    <a href="dive_01_data_types.html"><span>Data Types and Type Casting</span><span class="icon">&gt;</span></a>
                 
             </div>
         </div>
@@ -562,7 +571,7 @@ <h2>On-board Execution<a class="headerlink" href="#on-board-execution" title="Li
 
 
     <div class="footer" role="contentinfo">
-    &#169; Copyright 2024, Allo Authors.
+    &#169; Copyright 2025, Allo Authors.
       Created using <a href="https://www.sphinx-doc.org/">Sphinx</a> 8.1.3.
     </div>
 
diff --git a/genindex.html b/genindex.html
index 4cb558d2..6dcc8481 100644
--- a/genindex.html
+++ b/genindex.html
@@ -5,7 +5,7 @@
     <meta charset="utf-8" />
     <meta name="viewport" content="width=device-width, initial-scale=1.0" />
     <title>Index &#8212; Allo Documentation</title>
-    <link rel="stylesheet" type="text/css" href="_static/pygments.css?v=fa44fd50" />
+    <link rel="stylesheet" type="text/css" href="_static/pygments.css?v=03e43079" />
     <link rel="stylesheet" type="text/css" href="_static/basic_mod.css?v=9b2032db" />
     <link rel="stylesheet" type="text/css" href="_static/graphviz.css?v=4ae1632d" />
     <link rel="stylesheet" type="text/css" href="_static/copybutton.css?v=76b2166b" />
@@ -140,6 +140,15 @@ <h3 id="searchlabel">Quick search</h3>
 <li class="toctree-l1"><a class="reference internal" href="gallery/tutorial_01_get_started.html">Getting Started</a></li>
 <li class="toctree-l1"><a class="reference internal" href="gallery/tutorial_02_vhls.html">Vivado/Vitis HLS Backend</a></li>
 </ul>
+<p class="caption" role="heading"><span class="caption-text">Deep Dive</span></p>
+<ul>
+<li class="toctree-l1"><a class="reference internal" href="gallery/dive_01_data_types.html">Data Types and Type Casting</a></li>
+<li class="toctree-l1"><a class="reference internal" href="gallery/dive_02_template.html">Template Kernels</a></li>
+<li class="toctree-l1"><a class="reference internal" href="gallery/dive_03_composition.html">Kernel Composition</a></li>
+<li class="toctree-l1"><a class="reference internal" href="dive/ip.html">IP Integration</a></li>
+<li class="toctree-l1"><a class="reference internal" href="dive/pytorch.html">PyTorch Integration</a></li>
+<li class="toctree-l1"><a class="reference internal" href="gallery/dive_04_features.html">Other Features</a></li>
+</ul>
 <p class="caption" role="heading"><span class="caption-text">Developer Guide</span></p>
 <ul>
 <li class="toctree-l1"><a class="reference internal" href="developer/index.html">Developer Setup</a></li>
@@ -309,7 +318,7 @@ <h2 id="U">U</h2>
 
 
     <div class="footer" role="contentinfo">
-    &#169; Copyright 2024, Allo Authors.
+    &#169; Copyright 2025, Allo Authors.
       Created using <a href="https://www.sphinx-doc.org/">Sphinx</a> 8.1.3.
     </div>
 
diff --git a/index.html b/index.html
index 51adbeac..2cbf260f 100644
--- a/index.html
+++ b/index.html
@@ -6,7 +6,7 @@
     <meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="viewport" content="width=device-width, initial-scale=1" />
 
     <title>Allo Documentation &#8212; Allo Documentation</title>
-    <link rel="stylesheet" type="text/css" href="_static/pygments.css?v=fa44fd50" />
+    <link rel="stylesheet" type="text/css" href="_static/pygments.css?v=03e43079" />
     <link rel="stylesheet" type="text/css" href="_static/basic_mod.css?v=9b2032db" />
     <link rel="stylesheet" type="text/css" href="_static/graphviz.css?v=4ae1632d" />
     <link rel="stylesheet" type="text/css" href="_static/copybutton.css?v=76b2166b" />
@@ -142,6 +142,15 @@ <h3 id="searchlabel">Quick search</h3>
 <li class="toctree-l1"><a class="reference internal" href="gallery/tutorial_01_get_started.html">Getting Started</a></li>
 <li class="toctree-l1"><a class="reference internal" href="gallery/tutorial_02_vhls.html">Vivado/Vitis HLS Backend</a></li>
 </ul>
+<p class="caption" role="heading"><span class="caption-text">Deep Dive</span></p>
+<ul>
+<li class="toctree-l1"><a class="reference internal" href="gallery/dive_01_data_types.html">Data Types and Type Casting</a></li>
+<li class="toctree-l1"><a class="reference internal" href="gallery/dive_02_template.html">Template Kernels</a></li>
+<li class="toctree-l1"><a class="reference internal" href="gallery/dive_03_composition.html">Kernel Composition</a></li>
+<li class="toctree-l1"><a class="reference internal" href="dive/ip.html">IP Integration</a></li>
+<li class="toctree-l1"><a class="reference internal" href="dive/pytorch.html">PyTorch Integration</a></li>
+<li class="toctree-l1"><a class="reference internal" href="gallery/dive_04_features.html">Other Features</a></li>
+</ul>
 <p class="caption" role="heading"><span class="caption-text">Developer Guide</span></p>
 <ul>
 <li class="toctree-l1"><a class="reference internal" href="developer/index.html">Developer Setup</a></li>
@@ -180,6 +189,17 @@ <h1>Allo Documentation<a class="headerlink" href="#allo-documentation" title="Li
 </ul>
 </div>
 <div class="toctree-wrapper compound">
+<p class="caption" role="heading"><span class="caption-text">Deep Dive</span></p>
+<ul>
+<li class="toctree-l1"><a class="reference internal" href="gallery/dive_01_data_types.html">Data Types and Type Casting</a></li>
+<li class="toctree-l1"><a class="reference internal" href="gallery/dive_02_template.html">Template Kernels</a></li>
+<li class="toctree-l1"><a class="reference internal" href="gallery/dive_03_composition.html">Kernel Composition</a></li>
+<li class="toctree-l1"><a class="reference internal" href="dive/ip.html">IP Integration</a></li>
+<li class="toctree-l1"><a class="reference internal" href="dive/pytorch.html">PyTorch Integration</a></li>
+<li class="toctree-l1"><a class="reference internal" href="gallery/dive_04_features.html">Other Features</a></li>
+</ul>
+</div>
+<div class="toctree-wrapper compound">
 <p class="caption" role="heading"><span class="caption-text">Developer Guide</span></p>
 <ul>
 <li class="toctree-l1"><a class="reference internal" href="developer/index.html">Developer Setup</a></li>
@@ -221,7 +241,7 @@ <h1>Allo Documentation<a class="headerlink" href="#allo-documentation" title="Li
 
 
     <div class="footer" role="contentinfo">
-    &#169; Copyright 2024, Allo Authors.
+    &#169; Copyright 2025, Allo Authors.
       Created using <a href="https://www.sphinx-doc.org/">Sphinx</a> 8.1.3.
     </div>
 
diff --git a/objects.inv b/objects.inv
index 5e9211ac..e4e6832d 100644
Binary files a/objects.inv and b/objects.inv differ
diff --git a/search.html b/search.html
index 17f930a5..0e4a6ce1 100644
--- a/search.html
+++ b/search.html
@@ -5,7 +5,7 @@
     <meta charset="utf-8" />
     <meta name="viewport" content="width=device-width, initial-scale=1.0" />
     <title>Search &#8212; Allo Documentation</title>
-    <link rel="stylesheet" type="text/css" href="_static/pygments.css?v=fa44fd50" />
+    <link rel="stylesheet" type="text/css" href="_static/pygments.css?v=03e43079" />
     <link rel="stylesheet" type="text/css" href="_static/basic_mod.css?v=9b2032db" />
     <link rel="stylesheet" type="text/css" href="_static/graphviz.css?v=4ae1632d" />
     <link rel="stylesheet" type="text/css" href="_static/copybutton.css?v=76b2166b" />
@@ -129,6 +129,15 @@ <h1><a href="index.html" title="Go to homepage">Allo Documentation</a></h1>
 <li class="toctree-l1"><a class="reference internal" href="gallery/tutorial_01_get_started.html">Getting Started</a></li>
 <li class="toctree-l1"><a class="reference internal" href="gallery/tutorial_02_vhls.html">Vivado/Vitis HLS Backend</a></li>
 </ul>
+<p class="caption" role="heading"><span class="caption-text">Deep Dive</span></p>
+<ul>
+<li class="toctree-l1"><a class="reference internal" href="gallery/dive_01_data_types.html">Data Types and Type Casting</a></li>
+<li class="toctree-l1"><a class="reference internal" href="gallery/dive_02_template.html">Template Kernels</a></li>
+<li class="toctree-l1"><a class="reference internal" href="gallery/dive_03_composition.html">Kernel Composition</a></li>
+<li class="toctree-l1"><a class="reference internal" href="dive/ip.html">IP Integration</a></li>
+<li class="toctree-l1"><a class="reference internal" href="dive/pytorch.html">PyTorch Integration</a></li>
+<li class="toctree-l1"><a class="reference internal" href="gallery/dive_04_features.html">Other Features</a></li>
+</ul>
 <p class="caption" role="heading"><span class="caption-text">Developer Guide</span></p>
 <ul>
 <li class="toctree-l1"><a class="reference internal" href="developer/index.html">Developer Setup</a></li>
@@ -200,7 +209,7 @@ <h1 id="search-documentation">Search</h1>
 
 
     <div class="footer" role="contentinfo">
-    &#169; Copyright 2024, Allo Authors.
+    &#169; Copyright 2025, Allo Authors.
       Created using <a href="https://www.sphinx-doc.org/">Sphinx</a> 8.1.3.
     </div>
 
diff --git a/searchindex.js b/searchindex.js
index f55c80a0..c4c17e48 100644
--- a/searchindex.js
+++ b/searchindex.js
@@ -1 +1 @@
-Search.setIndex({"alltitles": {"Algorithm Definition": [[2, "algorithm-definition"], [6, "algorithm-definition"], [7, "algorithm-definition"]], "Allo Documentation": [[8, null]], "Allo Documentations": [[4, null]], "AnnAssign Node": [[2, "annassign-node"]], "Apply Transformations": [[6, "apply-transformations"]], "Codegen for Vivado/Vitis HLS": [[7, "codegen-for-vivado-vitis-hls"]], "Computation times": [[5, null], [10, null]], "Create the Executable": [[6, "create-the-executable"]], "Create the Schedule": [[6, "create-the-schedule"]], "Data Types": [[0, "data-types"]], "Define an MLIR program with Tensor dialect": [[3, "define-an-mlir-program-with-tensor-dialect"]], "Define an MLIR program with linalg dialect": [[3, "define-an-mlir-program-with-linalg-dialect"]], "Developer Guide": [[8, null]], "Developer Setup": [[1, null]], "For Node": [[2, "for-node"]], "FunctionDef Node": [[2, "functiondef-node"]], "Getting Started": [[6, null], [8, null]], "IR Builder Walkthrough": [[2, null]], "Import Allo": [[6, "import-allo"], [7, "import-allo"]], "Inspect the Intermediate Representation (IR)": [[6, "inspect-the-intermediate-representation-ir"]], "Install from Docker": [[9, "install-from-docker"]], "Install from Source": [[9, "install-from-source"]], "Installation": [[9, null]], "Integration Tests": [[1, "id1"]], "Internal Installation (Cornell)": [[9, "internal-installation-cornell"]], "MLIR Translation Guide": [[3, null]], "On-board Execution": [[7, "on-board-execution"]], "Other Nodes": [[2, "other-nodes"]], "Prepare the Inputs/Outputs for the Executable": [[6, "prepare-the-inputs-outputs-for-the-executable"]], "Python API": [[8, null]], "Run the Executable": [[6, "run-the-executable"]], "Scalar-Vector Product for GEMM": [[7, "scalar-vector-product-for-gemm"]], "Schedule Primitives": [[0, null]], "Testing": [[9, "testing"]], "Traverse the AST": [[2, "traverse-the-ast"]], "Tutorials": [[8, null]], "Upstream Changes": [[1, "upstream-changes"]], "Vivado/Vitis HLS Backend": [[7, null]]}, "docnames": ["api/index", "developer/index", "gallery/developer_01_ir_builder", "gallery/developer_02_mlir", "gallery/index", "gallery/sg_execution_times", "gallery/tutorial_01_get_started", "gallery/tutorial_02_vhls", "index", "setup/index", "sg_execution_times"], "envversion": {"sphinx": 64, "sphinx.domains.c": 3, "sphinx.domains.changeset": 1, "sphinx.domains.citation": 1, "sphinx.domains.cpp": 9, "sphinx.domains.index": 1, "sphinx.domains.javascript": 3, "sphinx.domains.math": 2, "sphinx.domains.python": 4, "sphinx.domains.rst": 2, "sphinx.domains.std": 2, "sphinx.ext.intersphinx": 1, "sphinx.ext.todo": 2, "sphinx.ext.viewcode": 1}, "filenames": ["api/index.rst", "developer/index.rst", "gallery/developer_01_ir_builder.rst", "gallery/developer_02_mlir.rst", "gallery/index.rst", "gallery/sg_execution_times.rst", "gallery/tutorial_01_get_started.rst", "gallery/tutorial_02_vhls.rst", "index.rst", "setup/index.rst", "sg_execution_times.rst"], "indexentries": {"buffer_at() (allo.customize.schedule method)": [[0, "allo.customize.Schedule.buffer_at", false]], "compose() (allo.customize.schedule method)": [[0, "allo.customize.Schedule.compose", false]], "compute_at() (allo.customize.schedule method)": [[0, "allo.customize.Schedule.compute_at", false]], "dataflow() (allo.customize.schedule method)": [[0, "allo.customize.Schedule.dataflow", false]], "fuse() (allo.customize.schedule method)": [[0, "allo.customize.Schedule.fuse", false]], "inline() (allo.customize.schedule method)": [[0, "allo.customize.Schedule.inline", false]], "parallel() (allo.customize.schedule method)": [[0, "allo.customize.Schedule.parallel", false]], "partition() (allo.customize.schedule method)": [[0, "allo.customize.Schedule.partition", false]], "pipeline() (allo.customize.schedule method)": [[0, "allo.customize.Schedule.pipeline", false]], "reorder() (allo.customize.schedule method)": [[0, "allo.customize.Schedule.reorder", false]], "reshape() (allo.customize.schedule method)": [[0, "allo.customize.Schedule.reshape", false]], "reuse_at() (allo.customize.schedule method)": [[0, "allo.customize.Schedule.reuse_at", false]], "schedule (class in allo.customize)": [[0, "allo.customize.Schedule", false]], "split() (allo.customize.schedule method)": [[0, "allo.customize.Schedule.split", false]], "to() (allo.customize.schedule method)": [[0, "allo.customize.Schedule.to", false]], "types (in module allo.ir)": [[0, "allo.ir.types", false]], "unfold() (allo.customize.schedule method)": [[0, "allo.customize.Schedule.unfold", false]], "unroll() (allo.customize.schedule method)": [[0, "allo.customize.Schedule.unroll", false]]}, "objects": {"allo.customize": [[0, 0, 1, "", "Schedule"]], "allo.customize.Schedule": [[0, 1, 1, "", "buffer_at"], [0, 1, 1, "", "compose"], [0, 1, 1, "", "compute_at"], [0, 1, 1, "", "dataflow"], [0, 1, 1, "", "fuse"], [0, 1, 1, "", "inline"], [0, 1, 1, "", "parallel"], [0, 1, 1, "", "partition"], [0, 1, 1, "", "pipeline"], [0, 1, 1, "", "reorder"], [0, 1, 1, "", "reshape"], [0, 1, 1, "", "reuse_at"], [0, 1, 1, "", "split"], [0, 1, 1, "", "to"], [0, 1, 1, "", "unfold"], [0, 1, 1, "", "unroll"]], "allo.ir": [[0, 2, 1, "", "types"]]}, "objnames": {"0": ["py", "class", "Python class"], "1": ["py", "method", "Python method"], "2": ["py", "attribute", "Python attribute"]}, "objtypes": {"0": "py:class", "1": "py:method", "2": "py:attribute"}, "terms": {"": [0, 2, 6, 7], "0": [0, 1, 2, 3, 5, 6, 7, 10], "00": [1, 5, 7, 10], "000000": 7, "000000e": 7, "005": [2, 5, 10], "03": 7, "05": 7, "074": [3, 5, 10], "1": [0, 1, 2, 3, 6, 7], "10": [1, 3, 6], "100": 6, "1024": [2, 7], "1025": 7, "1026": 7, "1037": 7, "1039": 7, "106": 7, "1113": 7, "113": 7, "12": [1, 9], "13": 3, "15": 7, "169": 7, "187e": 7, "19": [1, 3, 9], "19074": 7, "196": [5, 6, 10], "1_opt": 7, "1e": [6, 7], "2": [0, 2, 3, 6, 7], "2020": 7, "209": 3, "22": 1, "24": 1, "29069": 7, "3": [1, 3, 6, 7, 9], "300mhz": 7, "31": 1, "32": [0, 3, 6, 7], "322": 7, "32x32": 6, "32x32xf32": 7, "32x32xi32": [3, 6], "32xf32": 7, "33": 3, "331e": 7, "332": [5, 7, 10], "34": 7, "35616": 7, "36": 7, "39934": 7, "39935": 7, "4": [0, 3, 5, 6, 7, 10], "413e": 7, "416e": 7, "42": 3, "420e": 7, "456e": 7, "463e": 7, "47": 3, "494": 7, "5": [3, 6, 7], "50": 7, "6": [3, 6, 7], "607": [5, 10], "656": 7, "7": 6, "759": 7, "78": 7, "8": [0, 1, 3, 6, 7], "9": 6, "A": [0, 2, 3, 6, 7], "And": [1, 2, 3, 6], "As": [0, 9], "By": [6, 7], "For": [0, 1, 3, 6, 9], "If": [0, 1, 9], "In": [2, 6, 7], "It": [1, 2, 3, 7], "No": 1, "Not": 3, "ON": 9, "OR": 1, "One": [2, 3], "The": [0, 2, 3, 6, 7, 9], "Then": [2, 3, 9], "There": [0, 1], "To": [0, 1, 3, 6, 7, 9], "With": 6, "_": 7, "__": 7, "__w": 0, "_x": 7, "abl": [3, 7], "about": [3, 7], "abov": [2, 3, 6, 7, 9], "abstract": [2, 3], "acceler": [6, 7, 8], "accept": 0, "access": [0, 2, 7, 9], "account": 1, "accumul": 7, "achiev": [6, 7], "action": 1, "activ": 9, "actual": [2, 6, 7], "ad": 3, "add": [0, 1, 2, 9], "addf": 7, "addi": [3, 6], "addit": [2, 7], "adl": 8, "affin": [6, 7], "affine_map": [3, 6], "affineforop": 3, "after": [2, 6, 7, 9], "again": [3, 6, 7], "air": 9, "alia": 0, "all": [0, 1, 2, 3, 6, 7, 10], "allo": [0, 1, 2, 3, 9], "allo_c": [3, 7], "alloc": [2, 3, 6, 7], "alloc_0": 7, "allocop": 2, "allow": 0, "along": 7, "alreadi": [1, 3, 9], "also": [1, 2, 3, 6, 7, 9], "amaz": 2, "amd": 7, "an": [0, 1, 2, 6, 7, 8, 9], "anaconda": 9, "ani": [0, 1, 3, 7], "annot": [0, 2, 6, 7], "anoth": 0, "ap_axi_sdata": 7, "ap_fix": 7, "ap_int": 7, "api": [2, 6, 7, 9], "append": 0, "appli": [0, 9], "approv": 1, "ar": [0, 1, 2, 3, 6, 7, 9], "arg": [0, 2], "arg0": [3, 6, 7], "arg1": [3, 6, 7], "arg2": [3, 6, 7], "arg3": [3, 6, 7], "arg4": [3, 6, 7], "arg5": [3, 6], "arg6": [3, 6], "argument": [0, 2, 6, 7], "arith": [3, 6, 7], "arithmet": [6, 7], "arrai": [0, 3, 6, 7], "assembli": 3, "assert": 3, "assert_allclos": [6, 7], "assert_array_equ": 3, "assign": [2, 6], "astpretti": 2, "asttransform": 2, "astyp": [6, 7], "atol": [6, 7], "attach": [6, 7], "attr": 2, "attribut": [0, 2, 6, 7], "author": [2, 3, 6, 7], "automat": [6, 7], "ax": 0, "axi": [0, 7], "b": [0, 1, 2, 3, 6, 7], "back": 7, "backend": [3, 4, 5, 8, 9, 10], "band_nam": 0, "base": [2, 3], "bash": 1, "bashrc": 9, "basic": [1, 2, 3, 6], "bb0": 3, "becaus": 3, "becom": 7, "been": 1, "befor": [1, 2, 7], "being": 0, "below": [1, 2, 9], "best": 7, "better": 7, "binop": 2, "bitstream": 7, "black": 1, "block": [0, 7], "blockargu": 2, "blockram": 7, "bodi": [0, 2, 6], "bool": 0, "bound": [2, 6], "bram": 7, "branch": 1, "branch_nam": 1, "break": 1, "brg": 9, "buffer": [0, 2, 3, 7], "buffer_at": [0, 7], "build": [2, 6, 7, 9], "build_annassign": 2, "build_dir": 7, "build_for_loop": 2, "build_functiondef": 2, "build_grid_for": 2, "build_modul": 2, "build_stmt": 2, "builder": [4, 5, 8, 10], "built": 7, "builtin": 3, "button": 1, "c": [0, 1, 2, 3, 6, 7, 9], "c0_i32": [3, 6], "call": [0, 2, 3, 6], "campu": 9, "can": [0, 1, 2, 3, 6, 7, 9], "cannot": [1, 3, 7], "captur": 7, "care": 7, "case": [1, 2], "cast": 6, "caus": [0, 3], "cd": 9, "chang": [6, 7, 9], "channel": 0, "check": [1, 3, 6, 7], "checker": 1, "checkout": 1, "chen": [2, 3, 6, 7], "chhzh123": 9, "chip": 0, "choos": 9, "clang": 1, "class": [0, 2, 3], "clb": 7, "clearli": 7, "click": 1, "clone": [1, 9], "close": 6, "cmake": 9, "code": [1, 2, 3, 6, 7], "codebas": 1, "column": 3, "com": [1, 9], "combin": 0, "command": [1, 7, 9], "commit": 1, "common": [7, 9], "commonli": 3, "compil": [0, 1, 2, 7, 8, 9], "complet": 0, "compos": [0, 8], "comput": [0, 3, 7], "compute_at": 0, "conda": 9, "conduct": [3, 6], "configur": [1, 7, 9], "consecut": 0, "consist": [1, 2, 6], "constant": [0, 2, 3, 6, 7], "construct": [7, 8], "contact": 9, "contain": [0, 6, 9], "context": 2, "continu": 0, "contribut": 1, "convert": 0, "copi": [0, 1], "cornel": [1, 2, 3, 6, 7], "correct": [2, 3, 6, 7], "correctli": [1, 7], "correspond": [0, 2, 6], "count": 7, "cpp": [3, 7], "cpu": [6, 7], "cpython": 1, "creat": [0, 1, 2, 7, 9], "create_op_handl": 6, "creation": 2, "csim": 7, "cst": 7, "csyn": 7, "csynth": 7, "ctx": [2, 3], "current": [2, 3, 6, 7], "custom": [0, 2, 3, 6, 7], "cycl": 7, "cyclic": 0, "d0": [3, 6], "d1": [3, 6], "d2": 3, "data": [2, 6, 7, 8], "data_clk": 7, "dataflow": 0, "dcmake_build_typ": 9, "debug": 3, "declar": 6, "decorator_list": 2, "decoupl": 6, "def": [2, 3, 6, 7], "default": [2, 6, 7], "defin": [2, 6, 7], "definit": 3, "demonstr": [2, 6, 7], "denot": [2, 6, 7], "depth": 0, "describ": 9, "design": [7, 8], "desir": 0, "detail": [1, 2, 3, 7, 9], "developer_01_ir_build": [2, 5, 10], "developer_02_mlir": [3, 5, 10], "diagnost": 3, "dialect": [2, 6], "dictionari": 2, "differ": [2, 3, 6, 7, 9], "dim": 0, "dimens": [0, 7], "directli": [2, 3, 6, 7], "directori": [1, 7], "dispatch": 2, "dllvm_build_exampl": 9, "dllvm_enable_assert": 9, "dllvm_enable_project": 9, "dllvm_install_util": 9, "dllvm_targets_to_build": 9, "dmlir_enable_bindings_python": 9, "do": [1, 3, 6, 9], "document": 1, "doe": 2, "don": [3, 9], "done": [1, 3], "dot": [6, 7], "download": [2, 3, 6, 7, 9], "dpython3_execut": 9, "dsl": [6, 7], "dsp": 7, "dst": 0, "dtype": [3, 7], "dump": 2, "duplic": 0, "dure": 1, "e": [0, 2, 6, 7, 9], "each": [0, 2], "easi": 7, "easier": [1, 2, 3], "easili": [3, 7], "edu": [2, 3, 6, 7], "effect": [0, 2, 6], "element": 0, "elementwis": 3, "elt": 2, "enable_debug_info": 3, "enclos": 3, "end": [2, 3, 6, 7], "engin": 3, "entri": 2, "entry_block": 2, "environ": [1, 7, 9], "equal": [0, 7], "equival": 6, "error": [1, 3, 9], "estim": 7, "evalu": 2, "everi": [0, 1, 9], "everyth": [7, 9], "exact": 3, "exactli": 3, "exampl": [0, 1, 2, 3, 5, 6, 7, 10], "execut": [1, 3, 5, 10], "exist": 1, "expect": 3, "explain": 2, "explicitli": [3, 6, 7], "export": 9, "express": 3, "ext_lib": 0, "extens": 9, "extern": 9, "extsi": 6, "f32": 7, "facil": [1, 2], "facilit": 8, "factor": [0, 6], "fail": 3, "failureor": 3, "fals": [0, 2], "featur": [1, 7, 9], "feed": 6, "fetch": 1, "ff": 7, "figur": 3, "file": [1, 3, 5, 7, 9, 10], "fill": [3, 6, 7], "final": [6, 7], "find": [0, 7], "first": [0, 2, 3, 6, 7, 9], "flexibl": 6, "flip": 7, "float": 7, "float32": 7, "flop": 7, "flow": 7, "folder": [1, 7, 9], "follow": [1, 2, 3, 6, 7, 9], "for_loop": 2, "fork": 1, "form": 3, "format": [1, 2], "found": [2, 7, 9], "fpga": [6, 7], "frequenc": 7, "friendli": 3, "from": [0, 1, 2, 3, 5, 6, 7, 10], "from_loop": 0, "frontend": [2, 3], "full": [2, 3, 6, 7], "func": [2, 3, 6, 7], "func_arg": 0, "func_d": 2, "func_op": 2, "func_typ": 2, "funcop": 2, "function": [0, 2, 3, 6, 7], "function_typ": 3, "functiontyp": 2, "further": [2, 3, 6], "fuse": 0, "g": [0, 2, 7, 9], "galleri": [2, 3, 4, 5, 6, 7, 10], "gemm": 6, "gemm_csynth": 7, "gemm_pipeline_l_j_back": 7, "gemm_pipeline_l_j_init": 7, "gemm_pipeline_l_s_k_0_k_l_j": 7, "gemm_pipeline_vitis_loop_44_1_vitis_loop_45_2": 7, "gemm_vitis_hl": 7, "gener": [0, 2, 3, 4, 6, 7], "get": [0, 2, 4, 5, 7, 10], "get_ip": 2, "getsourc": 2, "git": [1, 9], "github": [1, 9], "give": 3, "given": 0, "global": [2, 7], "global_var": 2, "go": [1, 2, 3, 6, 7], "golden_c": 6, "good": 1, "grid": [2, 6, 7], "group": [7, 9], "gui": 1, "guid": [2, 4, 5, 9, 10], "guidelin": 1, "h": 7, "ha": [1, 2, 3, 7, 9], "hand": [2, 3], "handl": 6, "handwritten": 3, "hardwar": [6, 8], "hasbuffersemant": 3, "have": [0, 1, 2, 3, 6, 7, 9], "hc676": 3, "header": [1, 7], "help": [1, 3, 9], "helper": 2, "here": [1, 2, 3, 6, 7], "high": [3, 6, 7, 8], "hl": [4, 5, 6, 8, 10], "hls_math": 7, "hls_report": 7, "hls_stream": 7, "hoefler": 7, "hold": 0, "hongzheng": [2, 3, 6, 7], "host": [7, 9], "hour": 7, "how": [3, 6, 7], "howev": [3, 7], "http": [1, 9], "hub": 9, "human": 2, "hurrah": 1, "hw": 7, "hw_emu": 7, "hzchen": [2, 3, 6, 7], "i": [0, 1, 2, 3, 6, 7, 8, 9], "i32": [3, 6, 7], "i64": 6, "id": [0, 2], "identifi": 0, "ii": 7, "imag": 9, "immedi": [0, 6], "imp": 7, "imperfect": 7, "impl_1_full_util_rout": 7, "impl_1_slr_util_rout": 7, "implement": 7, "import": [1, 2, 3, 9], "imposs": 7, "includ": [2, 3, 6, 7], "indent": 2, "index": [0, 2, 3], "indic": 0, "individu": 0, "induction_var": 2, "info": 7, "inform": [1, 2, 7], "infrastructur": 6, "inher": 2, "initi": [6, 7], "initiation_interv": 0, "inlin": 0, "inner": [0, 6, 7], "input": [2, 3], "ins": [3, 6, 7], "insert": [0, 2, 7], "insid": [0, 2, 3, 6, 7], "inspect": 2, "inst_list": 0, "instal": [1, 2, 8], "instanti": 0, "instead": [0, 3, 7], "instruct": 9, "int": [0, 7], "int32": [2, 3, 6], "int64": 6, "integ": 7, "interfac": 3, "interleav": [0, 7], "intermedi": 7, "intern": [2, 6], "interv": 7, "invok": [3, 7], "invoke_mlir_pars": 3, "ip": [0, 2], "ip_stack": 2, "ipynb": [2, 3, 6, 7], "ir": [0, 3, 4, 5, 7, 8, 9, 10], "iter": [2, 6, 7], "its": 0, "ityp": [6, 7], "j": [0, 2, 3, 6, 7], "j_back": 7, "j_init": 7, "job": 7, "jupyt": [2, 3, 6, 7], "just": 2, "k": [2, 6, 7], "k1": 0, "k2": 0, "kernel": [0, 3, 6, 7], "keyword": 2, "kind": 2, "know": 3, "kw_default": 2, "kwarg": 2, "kwonlyarg": 2, "l11": 7, "l12": 7, "l13": 7, "l14": 7, "l15": 7, "l16": 7, "l17": 7, "l18": 7, "l2": 7, "l21": 7, "l22": 7, "l23": 7, "l5": 7, "l6": 7, "l7": 7, "l8": 7, "l9": 7, "l_j": 7, "l_j_back": 7, "l_j_init": 7, "l_s_buf0_buf0_l_0_l_buf0_l_1": 7, "l_s_buf1_buf1_l_0_l_buf1_l_1": 7, "l_s_i_j_0_i": 7, "l_s_k_0_k": 7, "l_s_k_0_k_l_j": 7, "l_s_result2_result2_l_0_l_result2_l_1": 7, "languag": [6, 8], "larg": 8, "large_elements_limit": 3, "last": 3, "lastli": [1, 7], "latenc": 7, "later": [3, 7], "latest": [7, 9], "launch": 7, "layout": 6, "lc_all": 1, "left": [1, 2], "let": [2, 3, 6], "level": [0, 2, 3, 6, 7], "leverag": [2, 3, 6, 7], "lib": 3, "licens": 1, "lightweight": 9, "like": [2, 3], "linalg": [6, 7], "linalgop": 3, "linalgoptoloopsimpl": 3, "line": [1, 2, 3, 9], "link": [7, 9], "lint": 1, "list": [0, 2], "liter": 2, "llvm": [3, 6, 9], "llvm19": 9, "llvm_build_dir": 9, "llvm_mod": 3, "llvm_patch": 9, "llvmmodul": 3, "llvmmoudl": 3, "load": [2, 6, 7], "loc": [2, 3], "local": 1, "locat": [2, 3], "log": [7, 9], "logic": 7, "look": [3, 6], "loop": [0, 2, 3, 6, 7], "loop_nam": [6, 7], "loopti": 3, "loopwrapp": 0, "lot": [3, 9], "lower": 3, "lut": 7, "m": [1, 2, 7, 9], "machin": 1, "mai": [0, 2, 3, 7, 9], "main": [1, 3, 7], "maintain": 1, "make": [1, 2, 6, 7, 9], "makefil": 7, "manag": 3, "manner": 8, "map": [2, 6], "mark": 7, "math": 7, "matmul": [3, 6, 7], "matric": [6, 7], "matrix": [6, 7], "matrix_add": 2, "mb": [5, 10], "mean": [0, 3, 6, 7], "meet": 2, "mem": [5, 10], "memoized_indexing_map": 3, "memori": [0, 2, 3, 6], "memref": [2, 3, 6, 7], "memref_d": 2, "mention": [2, 7], "merg": [0, 1], "messag": [1, 3, 9], "method": [0, 2], "middl": 7, "miniconda": 9, "minimum": 9, "minut": [2, 3, 6, 7], "mkdir": 9, "mlir": [0, 1, 2, 4, 5, 6, 7, 8, 9, 10], "mock": 2, "mockarg": 2, "mockbuff": 0, "mod": [3, 6, 7], "mode": [1, 7], "modul": [0, 2, 3, 6, 7], "modular": 8, "more": [1, 2, 3, 6, 7, 9], "most": [1, 2, 3, 6], "move": [0, 6], "mulf": 7, "muli": [3, 6], "multipl": [0, 6, 7], "must": 0, "n": [2, 7, 9], "name": [0, 2, 3, 6, 7], "namespac": 7, "necessari": [2, 6, 7, 9], "need": [0, 1, 2, 3, 6, 7, 9], "nest": [0, 2, 6, 7], "network": 9, "new": [0, 1, 2, 6, 7, 9], "newer": 7, "next": [1, 2, 3, 6, 7], "ninja": 9, "none": [0, 2, 3], "notebook": [2, 3, 6, 7], "notic": [2, 6, 7], "now": [2, 9], "np": [3, 6, 7], "np_a": [3, 6, 7], "np_b": [3, 6, 7], "np_c": 6, "number": [0, 2], "numpi": [3, 6, 7], "o": 7, "object": 0, "obtain": 2, "off": 9, "offici": 9, "old": 7, "omit": [2, 3], "onc": [0, 1], "one": [2, 3], "ones": 3, "onli": [0, 2, 3, 6, 7, 9], "op": [2, 3, 6], "op_nam": [6, 7], "open": 1, "oper": [2, 3, 6], "operand": 3, "operandsegments": 3, "optim": [6, 7], "order": [6, 7], "orels": 2, "origin": [0, 1, 3, 6, 7], "other": [3, 9], "otherwis": [2, 3, 6, 9], "otyp": [6, 7], "our": [1, 3, 6, 9], "out": [3, 6, 7], "outer": [0, 6], "outermost": 0, "output": [0, 1, 2, 3, 7], "outsid": 2, "over": 0, "overflow": 3, "overflowflag": 3, "own": [1, 9], "owner": 1, "p": 9, "packag": [6, 7, 9], "paper": 7, "paradigm": 6, "parallel": 0, "paramet": 0, "pars": [2, 3], "parser": 3, "part": 1, "partit": 0, "partition_typ": 0, "pass": [1, 2, 3, 7], "patch": 9, "patternrewrit": 3, "pep8": 1, "per": 7, "perfect": [0, 7], "perform": [6, 7, 8], "pip": [2, 9], "pipelin": [0, 7], "pipeline_ii": 7, "placement": 7, "pleas": [1, 2, 3, 7, 9], "point": [2, 7], "pop_ip": 2, "posit": 2, "posonlyarg": 2, "pprint": 2, "pr": 1, "pragma": 7, "prebuilt": 9, "prepar": 9, "present": 7, "preserv": 7, "pretty_debug_info": 3, "previou": [1, 7], "previous": 7, "primit": [6, 7, 8], "print": [2, 3, 6, 7], "print_generic_op_form": 3, "prj": 7, "probabl": 2, "problem": 3, "process": [2, 9], "product": 6, "program": [2, 6, 7], "project": [1, 3, 7, 9], "properli": 9, "properti": 2, "provid": [2, 3, 6, 7, 9], "pull": [1, 9], "push": [1, 7], "push_ip": 2, "pwd": 9, "py": [0, 1, 2, 3, 5, 6, 7, 10], "py3": 9, "pylint": 1, "pytest": [1, 9], "python": [1, 2, 3, 6, 7, 9], "python3": [1, 3, 9], "quickli": 7, "rais": 3, "ram": 7, "randint": [3, 6], "random": [3, 6, 7], "rang": [0, 6], "rate": 1, "ration": 7, "raw": 2, "read": 1, "readabl": 2, "recent": 3, "recommend": 9, "recov": 2, "recurs": [2, 9], "reduct": 7, "refer": [1, 2, 3, 6, 7, 9], "reflect": 2, "regist": [0, 7], "releas": 9, "remot": [1, 9], "reorder": [0, 6, 7], "replac": [0, 6], "report": [1, 7], "repositori": 1, "repres": [0, 2, 6], "represent": [2, 3], "request": 1, "requir": [2, 6, 9], "reshap": [0, 3], "resolv": 0, "resourc": 7, "result": [2, 3, 6, 7], "retriev": 2, "return": [1, 2, 3, 6, 7], "reus": 0, "reuse_at": 0, "review": 1, "rewind": 0, "rh": 2, "rich": 2, "right": [1, 2], "rm": 9, "root": [1, 2, 9], "rout": 7, "rpt": 7, "rtl": 7, "rtol": [6, 7], "run": [1, 2, 3, 7, 9], "s2": 0, "s_i_j_0": 7, "s_i_j_k_0": 6, "s_k_0": 7, "same": [0, 3, 7], "saniti": 6, "scale": 8, "sch": 0, "schedul": [7, 8], "scope": 0, "scratch": 3, "script": [1, 2, 3, 6, 7, 9], "second": [0, 2, 3, 6, 7], "see": [1, 2, 3, 6, 7, 9], "seem": 7, "self": 0, "semant": 3, "separ": [0, 2], "sequenc": 6, "sequenti": 0, "server": [7, 9], "set": [0, 2, 6, 7, 9], "set_ip": 2, "setlocal": 1, "setup": [8, 9], "sever": [1, 3, 7], "sh": [1, 7, 9], "shape": [0, 6], "share": [7, 9], "shorthand": [6, 7], "should": [1, 3, 7], "show": [3, 6], "show_offset": 2, "side": 2, "signatur": 6, "similar": [2, 3, 7], "similarli": [2, 6], "simpl": 2, "simpli": [7, 9], "simplifi": 9, "sinc": [7, 9], "singl": 0, "size": [0, 3, 7], "slice": [2, 3], "slr": 7, "smallvector": 3, "so": [0, 1, 2, 3, 6, 7, 9], "solution1": 7, "some": [0, 2, 3, 6, 7], "someth": [2, 3], "sourc": [0, 1, 2, 3, 6, 7], "space": 6, "specif": [3, 7], "specifi": [0, 6], "sphinx": [2, 3, 4, 6, 7], "split": [0, 6], "src": 2, "ss": 6, "ssh": 9, "stack": 2, "stage": 0, "standard": 1, "start": [4, 5, 7, 10], "statement": 2, "staticmethod": 2, "std": 7, "stdint": 7, "step": [2, 6, 9], "still": [3, 6, 7], "stmt": 2, "store": [2, 6, 7], "str": [0, 3], "stream": 0, "strict": [6, 7], "string": [2, 7], "structur": [2, 6, 7], "student": 9, "style": 1, "subscript": 2, "success": 9, "suit": 3, "support": [2, 6, 7], "sure": [1, 6, 7, 9], "sw_emu": 7, "sym_nam": 3, "syn": 7, "syntact": 2, "syntax": [2, 3], "synthes": 6, "synthesi": 7, "system": [7, 9], "t": [3, 9], "take": [0, 2, 3, 6, 7], "target": [0, 2, 6, 7], "target_loop": 0, "task_lint": 1, "techniqu": [2, 6, 7], "tediou": 3, "tensor": [2, 6, 7], "tensor_program": 3, "test": [3, 6, 7], "test_mlir_program": 3, "th": 0, "thank": 3, "thei": [3, 6], "them": [0, 2, 3, 6], "therefor": [2, 3], "thi": [0, 1, 2, 3, 6, 7, 9], "thing": [2, 3], "think": 3, "those": [1, 9], "three": 0, "through": [0, 2], "tile": [0, 7], "time": [2, 3, 6, 7, 9], "tmux": 7, "togeth": 6, "tool": 2, "toolchain": 3, "top": [0, 1, 2, 3], "top_func": [0, 2], "torsten": 7, "total": [2, 3, 5, 6, 7, 10], "traceback": 3, "transform": [2, 3], "translat": [2, 4, 5, 8, 10], "tree": 2, "trip": 7, "true": [0, 2, 3], "trunci": 6, "try": [3, 7], "tupl": [0, 2], "tutori": [3, 6, 7, 10], "tutorial_01_get_start": [5, 6, 10], "tutorial_02_vhl": [5, 7, 10], "two": [0, 1, 2, 6, 7], "type": [2, 6, 7, 8], "type_com": 2, "type_ignor": 2, "type_param": 2, "u": [3, 9], "u280": 7, "ui32": 7, "unabl": 3, "unchang": 1, "under": [1, 7, 9], "unfold": 0, "unfortun": 3, "unit": 1, "unknown": 3, "unrol": 0, "up": [7, 9], "updat": [1, 2], "upper": 6, "uram": 7, "us": [0, 1, 2, 3, 6, 7, 9], "usag": [2, 6, 7], "use_def_chain": 0, "use_local_scop": 3, "user": [2, 3, 6, 7], "usernam": 1, "usual": 2, "utf": 1, "util": [0, 7], "v": [1, 7, 9], "v0": 7, "v1": 7, "v10": 7, "v11": 7, "v12": 7, "v13": 7, "v14": 7, "v16": 7, "v2": 7, "v3": 7, "v4": 7, "v6": 7, "valid": 3, "valu": [0, 2, 3, 7], "valueerror": 3, "vararg": 2, "vardic": 0, "variabl": [2, 6, 7, 9], "variou": 6, "vector": 2, "verbos": 2, "veri": 3, "verifi": [1, 3], "version": [1, 7, 9], "vhl": [6, 7], "view": 7, "visit": 2, "viti": [4, 5, 8, 10], "vitis_2022": 7, "vitis_hl": 7, "vitis_loop_44_1_vitis_loop_45_2": 7, "vivado": [4, 5, 6, 8, 10], "void": 7, "vpn": 9, "vscode": [1, 9], "wa": 7, "wai": [3, 7], "wait": 1, "walk": 2, "walkthrough": [4, 5, 8, 10], "want": [1, 2, 3], "warn": 1, "we": [1, 2, 3, 6, 7, 9], "webpag": 2, "websit": 9, "what": [2, 3], "when": [0, 2, 6], "where": 6, "which": [0, 1, 2, 3, 6, 7, 9], "while": [6, 7], "whose": 0, "window": 0, "without": [3, 7], "work": [7, 9], "workflow": 1, "would": [0, 1, 7], "wrap": [2, 3], "wrapper": 6, "write": [0, 3, 7], "written": 0, "x": 9, "xcel": 0, "xclbin": 7, "xilinx_u280_gen3x16_xdma_1_202211_1": 7, "ye": [1, 7], "yield": 3, "you": [1, 2, 3, 6, 7, 9], "your": [1, 3, 9], "zero": 7, "zhang": [1, 7, 9], "zip": [2, 3, 6, 7]}, "titles": ["Schedule Primitives", "Developer Setup", "IR Builder Walkthrough", "MLIR Translation Guide", "Allo Documentations", "Computation times", "Getting Started", "Vivado/Vitis HLS Backend", "Allo Documentation", "Installation", "Computation times"], "titleterms": {"For": 2, "On": 7, "algorithm": [2, 6, 7], "allo": [4, 6, 7, 8], "an": 3, "annassign": 2, "api": 8, "appli": 6, "ast": 2, "backend": 7, "board": 7, "builder": 2, "chang": 1, "codegen": 7, "comput": [5, 10], "cornel": 9, "creat": 6, "data": 0, "defin": 3, "definit": [2, 6, 7], "develop": [1, 8], "dialect": 3, "docker": 9, "document": [4, 8], "execut": [6, 7], "from": 9, "functiondef": 2, "gemm": 7, "get": [6, 8], "guid": [3, 8], "hl": 7, "import": [6, 7], "input": 6, "inspect": 6, "instal": 9, "integr": 1, "intermedi": 6, "intern": 9, "ir": [2, 6], "linalg": 3, "mlir": 3, "node": 2, "other": 2, "output": 6, "prepar": 6, "primit": 0, "product": 7, "program": 3, "python": 8, "represent": 6, "run": 6, "scalar": 7, "schedul": [0, 6], "setup": 1, "sourc": 9, "start": [6, 8], "tensor": 3, "test": [1, 9], "time": [5, 10], "transform": 6, "translat": 3, "travers": 2, "tutori": 8, "type": 0, "upstream": 1, "vector": 7, "viti": 7, "vivado": 7, "walkthrough": 2}})
\ No newline at end of file
+Search.setIndex({"alltitles": {"Algorithm Definition": [[4, "algorithm-definition"], [12, "algorithm-definition"], [13, "algorithm-definition"]], "Allo Documentation": [[14, null]], "Allo Documentations": [[10, null]], "AnnAssign Node": [[4, "annassign-node"]], "Apply Transformations": [[12, "apply-transformations"]], "Bit Operations": [[6, "bit-operations"]], "Codegen for Vivado/Vitis HLS": [[13, "codegen-for-vivado-vitis-hls"]], "Computation times": [[11, null], [16, null]], "Create the Executable": [[12, "create-the-executable"]], "Create the Schedule": [[12, "create-the-schedule"]], "Data Types": [[0, "data-types"]], "Data Types and Type Casting": [[6, null]], "Deep Dive": [[14, null]], "Define an MLIR program with Tensor dialect": [[5, "define-an-mlir-program-with-tensor-dialect"]], "Define an MLIR program with linalg dialect": [[5, "define-an-mlir-program-with-linalg-dialect"]], "Developer Guide": [[14, null]], "Developer Setup": [[1, null]], "Dynamic Shapes": [[9, "dynamic-shapes"]], "Explicit Casting": [[6, "explicit-casting"]], "For Node": [[4, "for-node"]], "FunctionDef Node": [[4, "functiondef-node"]], "Getting Started": [[12, null], [14, null]], "IP Integration": [[2, null]], "IR Builder Walkthrough": [[4, null]], "Implicit Casting": [[6, "implicit-casting"]], "Import Allo": [[12, "import-allo"], [13, "import-allo"]], "Inspect the Intermediate Representation (IR)": [[12, "inspect-the-intermediate-representation-ir"]], "Install from Docker": [[15, "install-from-docker"]], "Install from Source": [[15, "install-from-source"]], "Installation": [[15, null]], "Integration Tests": [[1, "id1"]], "Internal Installation (Cornell)": [[15, "internal-installation-cornell"]], "Kernel Composition": [[8, null]], "MLIR Translation Guide": [[5, null]], "On-board Execution": [[13, "on-board-execution"]], "Other Features": [[9, null]], "Other Nodes": [[4, "other-nodes"]], "Prepare the Inputs/Outputs for the Executable": [[12, "prepare-the-inputs-outputs-for-the-executable"]], "PyTorch Integration": [[3, null]], "Python API": [[14, null]], "Run the Executable": [[12, "run-the-executable"]], "Scalar-Vector Product for GEMM": [[13, "scalar-vector-product-for-gemm"]], "Schedule Primitives": [[0, null]], "Template Composition": [[8, "template-composition"]], "Template Kernels": [[7, null]], "Testing": [[15, "testing"]], "Traverse the AST": [[4, "traverse-the-ast"]], "Tuple Return": [[9, "tuple-return"]], "Tutorials": [[14, null]], "Upstream Changes": [[1, "upstream-changes"]], "Vivado/Vitis HLS Backend": [[13, null]]}, "docnames": ["api/index", "developer/index", "dive/ip", "dive/pytorch", "gallery/developer_01_ir_builder", "gallery/developer_02_mlir", "gallery/dive_01_data_types", "gallery/dive_02_template", "gallery/dive_03_composition", "gallery/dive_04_features", "gallery/index", "gallery/sg_execution_times", "gallery/tutorial_01_get_started", "gallery/tutorial_02_vhls", "index", "setup/index", "sg_execution_times"], "envversion": {"sphinx": 64, "sphinx.domains.c": 3, "sphinx.domains.changeset": 1, "sphinx.domains.citation": 1, "sphinx.domains.cpp": 9, "sphinx.domains.index": 1, "sphinx.domains.javascript": 3, "sphinx.domains.math": 2, "sphinx.domains.python": 4, "sphinx.domains.rst": 2, "sphinx.domains.std": 2, "sphinx.ext.intersphinx": 1, "sphinx.ext.todo": 2, "sphinx.ext.viewcode": 1}, "filenames": ["api/index.rst", "developer/index.rst", "dive/ip.rst", "dive/pytorch.rst", "gallery/developer_01_ir_builder.rst", "gallery/developer_02_mlir.rst", "gallery/dive_01_data_types.rst", "gallery/dive_02_template.rst", "gallery/dive_03_composition.rst", "gallery/dive_04_features.rst", "gallery/index.rst", "gallery/sg_execution_times.rst", "gallery/tutorial_01_get_started.rst", "gallery/tutorial_02_vhls.rst", "index.rst", "setup/index.rst", "sg_execution_times.rst"], "indexentries": {"buffer_at() (allo.customize.schedule method)": [[0, "allo.customize.Schedule.buffer_at", false]], "compose() (allo.customize.schedule method)": [[0, "allo.customize.Schedule.compose", false]], "compute_at() (allo.customize.schedule method)": [[0, "allo.customize.Schedule.compute_at", false]], "dataflow() (allo.customize.schedule method)": [[0, "allo.customize.Schedule.dataflow", false]], "fuse() (allo.customize.schedule method)": [[0, "allo.customize.Schedule.fuse", false]], "inline() (allo.customize.schedule method)": [[0, "allo.customize.Schedule.inline", false]], "parallel() (allo.customize.schedule method)": [[0, "allo.customize.Schedule.parallel", false]], "partition() (allo.customize.schedule method)": [[0, "allo.customize.Schedule.partition", false]], "pipeline() (allo.customize.schedule method)": [[0, "allo.customize.Schedule.pipeline", false]], "reorder() (allo.customize.schedule method)": [[0, "allo.customize.Schedule.reorder", false]], "reshape() (allo.customize.schedule method)": [[0, "allo.customize.Schedule.reshape", false]], "reuse_at() (allo.customize.schedule method)": [[0, "allo.customize.Schedule.reuse_at", false]], "schedule (class in allo.customize)": [[0, "allo.customize.Schedule", false]], "split() (allo.customize.schedule method)": [[0, "allo.customize.Schedule.split", false]], "to() (allo.customize.schedule method)": [[0, "allo.customize.Schedule.to", false]], "types (in module allo.ir)": [[0, "allo.ir.types", false]], "unfold() (allo.customize.schedule method)": [[0, "allo.customize.Schedule.unfold", false]], "unroll() (allo.customize.schedule method)": [[0, "allo.customize.Schedule.unroll", false]]}, "objects": {"allo.customize": [[0, 0, 1, "", "Schedule"]], "allo.customize.Schedule": [[0, 1, 1, "", "buffer_at"], [0, 1, 1, "", "compose"], [0, 1, 1, "", "compute_at"], [0, 1, 1, "", "dataflow"], [0, 1, 1, "", "fuse"], [0, 1, 1, "", "inline"], [0, 1, 1, "", "parallel"], [0, 1, 1, "", "partition"], [0, 1, 1, "", "pipeline"], [0, 1, 1, "", "reorder"], [0, 1, 1, "", "reshape"], [0, 1, 1, "", "reuse_at"], [0, 1, 1, "", "split"], [0, 1, 1, "", "to"], [0, 1, 1, "", "unfold"], [0, 1, 1, "", "unroll"]], "allo.ir": [[0, 2, 1, "", "types"]]}, "objnames": {"0": ["py", "class", "Python class"], "1": ["py", "method", "Python method"], "2": ["py", "attribute", "Python attribute"]}, "objtypes": {"0": "py:class", "1": "py:method", "2": "py:attribute"}, "terms": {"": [0, 2, 4, 6, 7, 8, 9, 12, 13], "0": [0, 1, 2, 4, 5, 6, 7, 8, 9, 11, 12, 13, 16], "00": [1, 11, 13, 16], "000000": 13, "000000e": 13, "005": [4, 11, 16], "01": [11, 16], "03": 13, "05": 13, "070": [5, 11, 16], "1": [0, 1, 3, 4, 5, 6, 7, 8, 9, 12, 13], "10": [1, 3, 5, 6, 7, 9, 12], "100": [2, 12], "1024": [4, 13], "1025": 13, "1026": 13, "1037": 13, "1039": 13, "106": 13, "10xf32": [7, 9], "10xi32": 7, "11": 6, "1113": 13, "113": 13, "12": [1, 6, 15], "13": 5, "15": [6, 13], "16": 6, "169": 13, "170": [6, 11, 16], "181": [11, 12, 16], "187e": 13, "19": [1, 5, 15], "19074": 13, "192": [11, 13, 16], "1_opt": 13, "1e": [2, 12, 13], "2": [0, 4, 5, 6, 7, 8, 9, 12, 13], "20": 7, "2020": 13, "209": 5, "20xf32": 7, "20xi32": 7, "22": 1, "24": 1, "240": [9, 11, 16], "256": 9, "29069": 13, "3": [1, 3, 5, 6, 7, 8, 9, 12, 13, 15], "300mhz": 13, "31": 1, "32": [0, 2, 5, 6, 8, 12, 13], "322": 13, "32x32": 12, "32x32xf32": 13, "32x32xi32": [5, 8, 12], "32xf32": [8, 13], "32xi32": 8, "33": 5, "331e": 13, "34": 13, "35616": 13, "36": 13, "372": [7, 11, 16], "39934": 13, "39935": 13, "4": [0, 5, 6, 7, 8, 9, 12, 13], "413e": 13, "416e": 13, "42": 5, "420e": 13, "456e": 13, "463e": 13, "47": 5, "494": 13, "5": [5, 6, 8, 12, 13], "50": 13, "6": [2, 5, 6, 8, 12, 13], "64": 6, "656": 13, "7": [6, 8, 12], "731": [8, 11, 16], "759": 13, "78": 13, "8": [0, 1, 5, 6, 11, 12, 13, 16], "9": [6, 12], "961": [11, 16], "A": [0, 2, 4, 5, 7, 8, 9, 12, 13], "And": [1, 2, 4, 5, 12], "As": [0, 6, 9, 15], "By": [6, 7, 12, 13], "For": [0, 1, 2, 5, 6, 12, 15], "If": [0, 1, 6, 7, 15], "In": [2, 3, 4, 7, 8, 9, 12, 13], "It": [1, 4, 5, 13], "No": 1, "Not": [5, 6], "ON": 15, "OR": 1, "One": [4, 5, 6], "The": [0, 2, 3, 4, 5, 6, 12, 13, 15], "Then": [3, 4, 5, 8, 15], "There": [0, 1], "To": [0, 1, 5, 7, 12, 13, 15], "With": 12, "_": [7, 8, 13], "__": [9, 13], "__init__": 3, "__w": 0, "_x": 13, "abil": 6, "abl": [5, 13], "about": [5, 13], "abov": [4, 5, 6, 12, 13, 15], "abstract": [4, 5], "acceler": [6, 12, 13, 14], "accept": 0, "access": [0, 4, 6, 13, 15], "account": 1, "accumul": 13, "achiev": [7, 12, 13], "action": 1, "activ": 15, "actual": [4, 12, 13], "ad": [5, 6], "add": [0, 1, 4, 6, 7, 8, 15], "addf": [6, 7, 9, 13], "addi": [5, 6, 7, 8, 12], "addit": [2, 4, 7, 13], "adl": 14, "affin": [6, 7, 8, 9, 12, 13], "affine_map": [5, 12], "affineforop": 5, "after": [2, 4, 7, 8, 12, 13, 15], "again": [5, 12, 13], "air": 15, "algorithm": 9, "alia": 0, "all": [0, 1, 4, 5, 6, 12, 13, 16], "allo": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 15], "allo_a": 9, "allo_c": [2, 5, 13], "alloc": [4, 5, 6, 7, 8, 9, 12, 13], "alloc_0": [6, 8, 9, 13], "alloc_1": 9, "alloc_2": 6, "alloc_3": 6, "allocop": 4, "allow": 0, "along": 13, "alreadi": [1, 5, 15], "also": [1, 2, 3, 4, 5, 6, 7, 8, 9, 12, 13, 15], "amaz": 4, "amd": 13, "an": [0, 1, 2, 4, 6, 7, 8, 12, 13, 14, 15], "anaconda": 15, "ani": [0, 1, 5, 13], "annot": [0, 4, 7, 12, 13], "anoth": [0, 6, 7, 9], "ap_axi_sdata": [9, 13], "ap_fix": [9, 13], "ap_int": [2, 9, 13], "apart": 2, "api": [3, 4, 6, 12, 13, 15], "appear": 6, "append": 0, "appli": [0, 15], "applic": 8, "approv": 1, "ar": [0, 1, 4, 5, 6, 7, 8, 9, 12, 13, 15], "arbitrari": 9, "arg": [0, 4], "arg0": [5, 6, 7, 8, 9, 12, 13], "arg1": [5, 6, 7, 8, 9, 12, 13], "arg2": [5, 8, 9, 12, 13], "arg3": [5, 8, 9, 12, 13], "arg4": [5, 8, 12, 13], "arg5": [5, 12], "arg6": [5, 12], "argument": [0, 2, 4, 7, 8, 9, 12, 13], "arith": [5, 6, 7, 8, 9, 12, 13], "arithmet": [6, 12, 13], "arrai": [0, 2, 5, 7, 9, 12, 13], "assembli": 5, "assert": 5, "assert_allclos": [2, 9, 12, 13], "assert_array_equ": 5, "assert_clos": 3, "assign": [4, 12], "astpretti": 4, "asttransform": 4, "astyp": [2, 9, 12, 13], "atol": [2, 12, 13], "attach": [12, 13], "attr": 4, "attribut": [0, 4, 6, 7, 8, 9, 12, 13], "author": [4, 5, 6, 7, 8, 9, 12, 13], "automat": [2, 6, 9, 12, 13], "avail": 2, "avoid": 6, "ax": 0, "axi": [0, 8, 13], "b": [0, 1, 2, 4, 5, 6, 7, 8, 9, 12, 13], "back": [6, 13], "backend": [3, 5, 10, 11, 14, 15, 16], "band_nam": 0, "base": [2, 4, 5, 6, 7], "base_typ": 7, "bash": 1, "bashrc": 15, "basic": [1, 2, 4, 5, 12], "bb0": 5, "becaus": 5, "becom": 13, "been": [1, 8], "befor": [1, 4, 6, 13], "being": 0, "below": [1, 4, 6, 15], "best": 13, "better": 13, "between": 6, "binop": 4, "bitcast": 6, "bitstream": 13, "bitwdith": 6, "bitwidth": 6, "black": 1, "block": [0, 13], "blockargu": 4, "blockram": 13, "bodi": [0, 4, 7, 12], "bool": 0, "both": [6, 8], "bound": [4, 6, 12], "bracket": 7, "bram": 13, "branch": [1, 7], "branch_nam": 1, "break": 1, "brg": 15, "buffer": [0, 4, 5, 8, 13], "buffer_at": [0, 8, 13], "build": [2, 4, 9, 12, 13, 15], "build_annassign": 4, "build_dir": 13, "build_for_loop": 4, "build_functiondef": 4, "build_grid_for": 4, "build_modul": 4, "build_stmt": 4, "builder": [10, 11, 14, 16], "built": 13, "builtin": [5, 6], "burden": 2, "burdensom": 6, "button": 1, "c": [0, 1, 2, 4, 5, 6, 7, 8, 9, 12, 13, 15], "c0_i32": [5, 8, 9, 12], "c0_i32_0": 9, "c1_i32": [7, 8, 9], "c2_i32": [6, 8], "c2_i32_1": 6, "call": [0, 2, 4, 5, 6, 8, 9, 12], "calle": 9, "campu": 15, "can": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 12, 13, 15], "cannot": [1, 5, 13], "captur": 13, "care": 13, "case": [1, 4, 9], "cast": [10, 11, 12, 14, 16], "caus": [0, 5], "cd": 15, "certain": 7, "chang": [3, 6, 12, 13, 15], "channel": 0, "check": [1, 5, 9, 12, 13], "checker": 1, "checkout": 1, "chen": [4, 5, 6, 7, 8, 9, 12, 13], "chhzh123": 15, "chip": 0, "choos": 15, "chunk": 6, "clang": 1, "class": [0, 3, 4, 5], "clb": 13, "clearli": 13, "click": 1, "clone": [1, 15], "close": 12, "cmake": 15, "code": [1, 2, 4, 5, 6, 7, 8, 9, 12, 13], "codebas": 1, "column": 5, "com": [1, 15], "combin": 0, "command": [1, 13, 15], "commit": 1, "common": [13, 15], "commonli": 5, "compil": [0, 1, 2, 3, 4, 6, 7, 9, 13, 14, 15], "complet": 0, "complex": 2, "compos": [0, 8, 14], "composit": [10, 11, 14, 16], "comput": [0, 5, 6, 7, 13], "compute_at": 0, "conda": 15, "condit": 7, "condition": 7, "conduct": [5, 12], "configur": [1, 13, 15], "consecut": 0, "consider": 6, "consist": [1, 4, 12], "constant": [0, 4, 5, 6, 7, 8, 9, 12, 13], "constrain": 7, "construct": [13, 14], "contact": 15, "contain": [0, 12, 15], "context": [4, 6], "continu": 0, "contribut": 1, "convent": [6, 7], "convert": [0, 6], "copi": [0, 1], "cornel": [1, 4, 5, 6, 7, 8, 9, 12, 13], "correct": [4, 5, 12, 13], "correctli": [1, 8, 13], "correspond": [0, 2, 4, 12], "count": 13, "cover": 9, "cpp": [2, 5, 13], "cpu": [12, 13], "cpython": 1, "creat": [0, 1, 2, 4, 6, 8, 13, 15], "create_op_handl": 12, "creation": 4, "csim": 13, "cst": 13, "csyn": 13, "csynth": 13, "ctx": [4, 5], "current": [2, 4, 5, 6, 12, 13], "custom": [0, 2, 4, 5, 6, 7, 8, 9, 12, 13], "cycl": 13, "cyclic": 0, "d": [6, 8, 9], "d0": [5, 12], "d1": [5, 12], "d2": 5, "data": [2, 4, 7, 10, 11, 12, 13, 14, 16], "data_clk": 13, "dataflow": 0, "dcmake_build_typ": 15, "debug": [2, 5], "declar": [6, 7, 9, 12], "decorator_list": 4, "decoupl": 12, "def": [2, 3, 4, 5, 6, 7, 8, 9, 12, 13], "default": [3, 4, 12, 13], "defin": [2, 3, 4, 6, 7, 8, 12, 13], "definit": 5, "demonstr": [4, 12, 13], "denot": [4, 12, 13], "depend": 7, "depth": 0, "describ": 15, "design": [13, 14], "desir": [0, 6], "detach": 3, "detail": [1, 4, 5, 6, 13, 15], "determin": [2, 7], "developer_01_ir_build": [4, 11, 16], "developer_02_mlir": [5, 11, 16], "diagnost": 5, "dialect": [4, 12], "dictionari": 4, "differ": [4, 5, 6, 7, 8, 12, 13, 15], "dim": 0, "dimens": [0, 13], "directli": [2, 3, 4, 5, 7, 12, 13], "directori": [1, 13], "discuss": [6, 8, 9], "dispatch": 4, "distinguish": 8, "dive_01_data_typ": [6, 11, 16], "dive_02_templ": [7, 11, 16], "dive_03_composit": [8, 11, 16], "dive_04_featur": [9, 11, 16], "dllvm_build_exampl": 15, "dllvm_enable_assert": 15, "dllvm_enable_project": 15, "dllvm_install_util": 15, "dllvm_targets_to_build": 15, "dmlir_enable_bindings_python": 15, "do": [1, 5, 8, 12, 15], "document": [1, 3, 6, 7, 8, 9], "doe": [4, 7], "don": [5, 15], "done": [1, 5, 6], "dot": [12, 13], "download": [4, 5, 6, 7, 8, 9, 12, 13, 15], "dpython3_execut": 15, "dsl": [12, 13], "dsp": 13, "dst": 0, "dtype": [2, 5, 9, 13], "dump": 4, "duplic": 0, "dure": [1, 8], "e": [0, 2, 4, 6, 7, 12, 13, 15], "each": [0, 4, 6, 7, 8], "easi": 13, "easier": [1, 4, 5], "easili": [5, 6, 13], "edu": [4, 5, 6, 7, 8, 9, 12, 13], "effect": [0, 4, 12], "either": 6, "element": [0, 7], "elementwis": 5, "elt": 4, "enabl": 7, "enable_debug_info": 5, "enclos": 5, "end": [4, 5, 6, 7, 8, 9, 12, 13], "endif": 2, "enforc": 6, "engin": 5, "entri": 4, "entry_block": 4, "environ": [1, 13, 15], "equal": [0, 13], "equip": 6, "equival": 12, "error": [1, 5, 6, 7, 15], "estim": 13, "eval": 3, "evalu": [4, 7], "everi": [0, 1, 6, 15], "everyth": [13, 15], "exact": [5, 8], "exactli": 5, "exampl": [0, 1, 2, 4, 5, 6, 7, 8, 9, 11, 12, 13, 16], "example_input": 3, "execut": [1, 5, 11, 16], "exist": [1, 2], "expect": [5, 6], "explain": [4, 7], "explicitli": [5, 6, 8, 9, 12, 13], "export": 15, "express": 5, "ext_lib": 0, "extens": 15, "extern": [2, 15], "extract": 6, "extsi": [6, 7, 8, 12], "f": 3, "f32": [6, 7, 8, 9, 13], "facil": [1, 4], "facilit": 14, "factor": [0, 8, 12], "fail": 5, "failureor": 5, "fals": [0, 2, 4], "featur": [1, 2, 10, 11, 13, 14, 15, 16], "feed": 12, "fetch": 1, "ff": 13, "figur": 5, "file": [1, 2, 5, 9, 11, 13, 15, 16], "fill": [5, 8, 9, 12, 13], "final": [7, 8, 12, 13], "find": [0, 13], "first": [0, 3, 4, 5, 6, 8, 12, 13, 15], "firstli": 6, "fix": 6, "flexibl": [7, 12], "flip": 13, "float": [6, 9, 13], "float32": [6, 7, 8, 9, 13], "flop": 13, "flow": 13, "folder": [1, 13, 15], "follow": [1, 2, 4, 5, 6, 7, 8, 12, 13, 15], "for_loop": 4, "fork": 1, "form": 5, "format": [1, 4], "forward": 3, "found": [4, 13, 15], "fpga": [12, 13], "fptosi": 6, "frac": 6, "fraction": 6, "frequenc": 13, "friendli": 5, "from": [0, 1, 2, 4, 5, 6, 7, 8, 9, 11, 12, 13, 16], "from_loop": 0, "from_pytorch": 3, "frontend": [3, 4, 5], "full": [4, 5, 6, 7, 8, 9, 12, 13], "func": [4, 5, 6, 7, 8, 9, 12, 13], "func_arg": 0, "func_d": 4, "func_op": 4, "func_typ": 4, "funcop": 4, "function": [0, 2, 3, 4, 5, 6, 7, 8, 9, 12, 13], "function_typ": 5, "functiontyp": 4, "further": [2, 4, 5, 12], "furthermor": 7, "fuse": 0, "g": [0, 2, 4, 13, 15], "galleri": [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 16], "gemm": [8, 12], "gemm_csynth": 13, "gemm_pipeline_l_j_back": 13, "gemm_pipeline_l_j_init": 13, "gemm_pipeline_l_s_k_0_k_l_j": 13, "gemm_pipeline_vitis_loop_44_1_vitis_loop_45_2": 13, "gemm_vitis_hl": 13, "gener": [0, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13], "get": [0, 4, 10, 11, 13, 16], "get_ip": 4, "getsourc": 4, "git": [1, 15], "github": [1, 15], "give": 5, "given": [0, 7], "global": [4, 13], "global_var": 4, "go": [1, 4, 5, 6, 7, 8, 9, 12, 13], "golden": 3, "golden_c": 12, "good": 1, "greater": 7, "greatli": 2, "grid": [4, 8, 12, 13], "group": [13, 15], "gui": 1, "guid": [4, 10, 11, 15, 16], "guidelin": 1, "h": [2, 9, 13], "ha": [1, 2, 4, 5, 6, 8, 9, 13, 15], "hand": [4, 5], "handl": [6, 12], "handwritten": 5, "hardwar": [6, 12, 14], "hasbuffersemant": 5, "have": [0, 1, 2, 4, 5, 6, 8, 12, 13, 15], "hc676": 5, "header": [1, 2, 13], "help": [1, 5, 15], "helper": 4, "here": [1, 4, 5, 7, 12, 13], "high": [5, 9, 12, 13, 14], "hint": 6, "hl": [2, 3, 6, 9, 10, 11, 12, 14, 16], "hls_code": 3, "hls_math": [9, 13], "hls_report": 13, "hls_stream": [9, 13], "hoefler": 13, "hold": 0, "hongzheng": [4, 5, 6, 7, 8, 9, 12, 13], "host": [13, 15], "hour": 13, "how": [2, 3, 5, 6, 7, 8, 12, 13], "howev": [5, 6, 8, 13], "http": [1, 15], "hub": 15, "human": 4, "hurrah": 1, "hw": 13, "hw_emu": 13, "hzchen": [4, 5, 6, 7, 8, 9, 12, 13], "i": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 12, 13, 14, 15], "i16": 6, "i32": [5, 6, 7, 8, 9, 12, 13], "i33": [6, 7, 8], "i64": [8, 12], "id": [0, 4, 8], "identifi": 0, "ifndef": 2, "ii": 13, "imag": 15, "immedi": [0, 12], "imp": 13, "imperfect": 13, "impl": 2, "impl_1_full_util_rout": 13, "impl_1_slr_util_rout": 13, "implement": [2, 13], "import": [1, 3, 4, 5, 6, 7, 8, 9, 15], "importantli": 7, "imposs": 13, "in_type0": 12, "in_type1": 12, "includ": [2, 4, 5, 6, 9, 12, 13], "incorpor": 8, "indent": 4, "index": [0, 4, 5, 6, 9], "index_cast": 9, "indic": [0, 6], "individu": 0, "induction_var": 4, "infer": 6, "info": 13, "inform": [1, 4, 13], "infrastructur": 12, "inher": 4, "initi": [12, 13], "initiation_interv": 0, "inlin": 0, "inner": [0, 12, 13], "input": [2, 3, 4, 5, 7], "ins": [5, 8, 9, 12, 13], "insert": [0, 4, 13], "insid": [0, 4, 5, 12, 13], "inspect": 4, "inst_list": 0, "instal": [1, 2, 4, 14], "instanti": [0, 7, 8], "instead": [0, 5, 6, 13], "instruct": 15, "int": [0, 2, 6, 9, 13], "int16": 6, "int32": [2, 4, 5, 6, 7, 8, 9, 12], "int32_t": 9, "int33": 6, "int64": 12, "integ": [6, 7, 13], "integr": 14, "interfac": [2, 5], "interleav": [0, 13], "intermedi": [6, 13], "intern": [4, 12], "interpret": 7, "interv": 13, "invok": [5, 7, 8, 13], "invoke_mlir_pars": 5, "ip": [0, 4, 14], "ip_stack": 4, "ipmodul": 2, "ipynb": [4, 5, 6, 7, 8, 9, 12, 13], "ir": [0, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, 16], "iter": [4, 12, 13], "its": 0, "ityp": [6, 7, 8, 9, 12, 13], "j": [0, 4, 5, 8, 12, 13], "j_back": [8, 13], "j_init": [8, 13], "job": 13, "jupyt": [4, 5, 6, 7, 8, 9, 12, 13], "just": [4, 7], "k": [4, 8, 12, 13], "k1": [0, 8], "k2": [0, 8], "kernel": [0, 2, 5, 9, 10, 11, 12, 13, 14, 16], "kernel2": 7, "kernel3": 7, "kernel_k1": 8, "kernel_k2": 8, "keyword": 4, "kind": 4, "know": 5, "known": 9, "kw_default": 4, "kwarg": 4, "kwonlyarg": 4, "l10": 9, "l11": 13, "l12": 13, "l13": 13, "l14": 13, "l15": 13, "l16": 13, "l17": 13, "l18": 13, "l2": [9, 13], "l21": 13, "l22": 13, "l23": 13, "l5": [9, 13], "l6": 13, "l7": 13, "l8": [9, 13], "l9": [9, 13], "l_j": 13, "l_j_back": 13, "l_j_init": 13, "l_s_buf0_buf0_l_0_l_buf0_l_1": 13, "l_s_buf1_buf1_l_0_l_buf1_l_1": 13, "l_s_i_j_0_i": 13, "l_s_k_0_k": 13, "l_s_k_0_k_l_j": 13, "l_s_result2_result2_l_0_l_result2_l_1": 13, "languag": [12, 14], "larg": 14, "large_elements_limit": 5, "last": [2, 5, 8], "lastli": [1, 13], "latenc": 13, "later": [5, 8, 13], "latest": [13, 15], "launch": 13, "layout": 12, "lc_all": 1, "left": [1, 4], "let": [4, 5, 12], "level": [0, 2, 4, 5, 8, 9, 12, 13], "leverag": [4, 5, 7, 12, 13], "lib": 5, "librari": 2, "licens": 1, "lightweight": 15, "like": [4, 5, 6, 7], "linalg": [8, 9, 12, 13], "linalgop": 5, "linalgoptoloopsimpl": 5, "line": [1, 4, 5, 15], "link": [2, 13, 15], "link_hl": 2, "lint": 1, "list": [0, 4], "liter": 4, "llvm": [3, 5, 9, 12, 15], "llvm19": 15, "llvm_build_dir": 15, "llvm_mod": [3, 5], "llvm_patch": 15, "llvmmodul": 5, "llvmmoudl": 5, "load": [4, 6, 7, 8, 9, 12, 13], "loc": [4, 5], "local": 1, "locat": [4, 5], "log": [13, 15], "logic": 13, "look": [5, 12], "loop": [0, 4, 5, 8, 12, 13], "loop_nam": [7, 8, 9, 12, 13], "loopti": 5, "loopwrapp": 0, "lot": [5, 15], "lower": 5, "lut": 13, "m": [1, 4, 7, 8, 13, 15], "machin": [1, 2], "mai": [0, 4, 5, 6, 8, 13, 15], "main": [1, 5, 13], "maintain": 1, "make": [1, 4, 7, 12, 13, 15], "makefil": 13, "manag": [2, 5], "manipul": 6, "manner": 14, "manual": 6, "map": [4, 12], "mark": 13, "math": [9, 13], "mathemat": 6, "matmul": [5, 12, 13], "matric": [12, 13], "matrix": [12, 13], "matrix_add": [4, 8], "mb": [11, 16], "mean": [0, 5, 6, 9, 12, 13], "meet": 4, "mem": [11, 16], "memoized_indexing_map": 5, "memori": [0, 4, 5, 12], "memref": [4, 5, 6, 7, 8, 9, 12, 13], "memref_d": 4, "mention": [4, 13], "merg": [0, 1], "messag": [1, 5, 15], "meta_elif": 7, "meta_els": [7, 8], "meta_if": [7, 8], "metaprogram": 7, "method": [0, 4], "middl": 13, "miniconda": 15, "minimum": 15, "minut": [4, 5, 6, 7, 8, 9, 12, 13], "mkdir": 15, "mlir": [0, 1, 4, 6, 9, 10, 11, 12, 13, 14, 15, 16], "mock": 4, "mockarg": 4, "mockbuff": 0, "mod": [2, 3, 5, 9, 12, 13], "mode": [1, 13], "model": 3, "modul": [0, 2, 3, 4, 5, 6, 7, 8, 9, 12, 13], "modular": 14, "more": [1, 4, 5, 12, 13, 15], "moreov": 2, "most": [1, 4, 5, 12], "move": [0, 12], "mulf": [6, 13], "muli": [5, 8, 12], "multipl": [0, 8, 9, 12, 13], "must": 0, "n": [4, 8, 13, 15], "name": [0, 2, 4, 5, 6, 7, 8, 9, 12, 13], "namespac": [2, 9, 13], "nativ": [6, 7], "necessari": [4, 12, 13, 15], "need": [0, 1, 2, 4, 5, 7, 8, 12, 13, 15], "nest": [0, 4, 12, 13], "network": 15, "new": [0, 1, 4, 7, 12, 13, 15], "newer": 13, "next": [1, 4, 5, 12, 13], "ninja": 15, "nn": 3, "none": [0, 4, 5], "normal": 2, "notat": [6, 7], "note": 7, "notebook": [4, 5, 6, 7, 8, 9, 12, 13], "notic": [4, 6, 8, 12, 13], "now": [4, 8, 15], "np": [2, 5, 9, 12, 13], "np_a": [2, 5, 9, 12, 13], "np_b": [2, 5, 9, 12, 13], "np_c": [2, 9, 12], "np_c_ref": 9, "np_d": 9, "np_d_ref": 9, "np_input": 3, "number": [0, 4, 6], "numpi": [2, 3, 5, 9, 12, 13], "o": 13, "object": 0, "obtain": 4, "off": 15, "offici": 15, "often": 8, "old": 13, "omit": [4, 5], "onc": [0, 1, 6], "one": [4, 5, 6], "ones": 5, "onli": [0, 2, 4, 5, 6, 7, 8, 12, 13, 15], "op": [4, 5, 12], "op_nam": [7, 8, 9, 12, 13], "open": 1, "oper": [4, 5, 12], "operand": [5, 6], "operandsegments": 5, "optim": [2, 8, 12, 13], "option": 8, "order": [12, 13], "orels": 4, "origin": [0, 1, 3, 5, 12, 13], "other": [3, 5, 10, 11, 14, 15, 16], "otherwis": [4, 5, 7, 12, 15], "otyp": [6, 7, 8, 9, 12, 13], "our": [1, 5, 12, 15], "out": [5, 8, 9, 12, 13], "out_typ": 12, "outer": [0, 12], "outermost": 0, "output": [0, 1, 2, 4, 5, 13], "outsid": 4, "over": 0, "overflow": [5, 6], "overflowflag": 5, "own": [1, 15], "owner": 1, "p": 15, "packag": [12, 13, 15], "paper": 13, "paradigm": 12, "parallel": 0, "paramet": 0, "parent": 8, "pars": [4, 5], "parser": 5, "part": [1, 8], "partit": 0, "partition_typ": 0, "pass": [1, 3, 4, 5, 7, 13], "patch": 15, "pattern": [6, 7], "patternrewrit": 5, "peopl": 8, "pep8": 1, "per": 13, "perfect": [0, 13], "perform": [2, 6, 12, 13, 14], "pip": [4, 15], "pipelin": [0, 8, 13], "pipeline_ii": [8, 13], "placement": 13, "pleas": [1, 4, 5, 6, 13, 15], "point": [4, 6, 13], "pointer": 9, "pop_ip": 4, "posit": 4, "posonlyarg": 4, "pprint": 4, "pr": 1, "pragma": 13, "prebuilt": 15, "precis": 6, "predefin": 9, "prepar": 15, "present": 13, "preserv": [6, 13], "pretty_debug_info": 5, "previou": [1, 8, 9, 13], "previous": 13, "primit": [8, 12, 13, 14], "print": [2, 3, 4, 5, 6, 7, 8, 9, 12, 13], "print_generic_op_form": 5, "prj": 13, "probabl": 4, "problem": 5, "process": [3, 4, 15], "product": 12, "program": [4, 6, 12, 13], "project": [1, 5, 13, 15], "prone": 6, "properli": 15, "properti": 4, "provid": [2, 4, 5, 6, 7, 8, 12, 13, 15], "pull": [1, 15], "push": [1, 13], "push_ip": 4, "pwd": 15, "py": [0, 1, 4, 5, 6, 7, 8, 9, 11, 12, 13, 16], "py3": 15, "pylint": 1, "pytest": [1, 15], "python": [1, 2, 4, 5, 6, 7, 8, 9, 12, 13, 15], "python3": [1, 5, 15], "pytorch": 14, "quickli": 13, "rais": [5, 7], "ram": 13, "rand": 3, "randint": [2, 5, 12], "random": [2, 5, 9, 12, 13], "rang": [0, 6, 7, 8, 9, 12], "rate": 1, "ration": 13, "raw": 4, "re": 3, "read": 1, "readabl": 4, "real": 8, "recent": 5, "recommend": 15, "recov": 4, "recurs": [4, 15], "reduct": [8, 13], "refer": [1, 4, 5, 12, 13, 15], "reflect": 4, "regist": [0, 13], "releas": 15, "relu": 3, "remot": [1, 15], "reorder": [0, 8, 12, 13], "replac": [0, 12], "report": [1, 13], "repositori": 1, "repres": [0, 4, 6, 9, 12], "represent": [4, 5, 6], "request": 1, "requir": [2, 4, 6, 12, 15], "reshap": [0, 5], "resolv": 0, "resourc": 13, "result": [4, 5, 12, 13], "retriev": 4, "return": [1, 2, 3, 4, 5, 6, 7, 8, 12, 13], "reus": [0, 7], "reusabl": 7, "reuse_at": 0, "review": 1, "rewind": 0, "rh": 4, "rich": 4, "right": [1, 4], "rm": 15, "root": [1, 4, 15], "rout": 13, "rpt": 13, "rtl": 13, "rtol": [12, 13], "rule": 6, "run": [1, 4, 5, 6, 7, 8, 9, 13, 15], "s1": 8, "s2": [0, 8], "s_i_0": [7, 8, 9], "s_i_j_0": [8, 13], "s_i_j_k_0": 12, "s_k_0": [8, 13], "same": [0, 5, 6, 13], "saniti": 12, "scale": 14, "scf": 9, "sch": 0, "schedul": [8, 13, 14], "scope": 0, "scratch": 5, "script": [1, 4, 5, 6, 7, 8, 9, 12, 13, 15], "second": [0, 4, 5, 6, 7, 8, 9, 12, 13], "see": [1, 4, 5, 6, 7, 8, 9, 12, 13, 15], "seem": 13, "seen": 8, "self": [0, 3], "semant": 5, "separ": [0, 4, 8], "sequenc": 12, "sequenti": 0, "server": [13, 15], "set": [0, 4, 12, 13, 15], "set_ip": 4, "setlocal": 1, "setup": [14, 15], "sever": [1, 5, 8, 13], "sh": [1, 13, 15], "shape": [0, 7, 12], "share": [13, 15], "shorthand": [6, 12, 13], "should": [1, 2, 3, 5, 6, 13], "show": [2, 3, 5, 12], "show_offset": 4, "shown": [6, 8], "side": 4, "signatur": [2, 7, 12], "similar": [3, 4, 5, 13], "similarli": [4, 6, 7, 12], "simpl": [2, 4, 7, 8], "simpli": [2, 13, 15], "simplifi": 15, "sinc": [13, 15], "singl": [0, 7], "sitofp": [6, 7, 8, 9], "size": [0, 5, 9, 13], "slice": [4, 5, 6], "slr": 13, "smallvector": 5, "so": [0, 1, 4, 5, 6, 9, 12, 13, 15], "solution1": 13, "some": [0, 4, 5, 9, 12, 13], "someth": [4, 5], "sometim": 8, "sourc": [0, 1, 4, 5, 6, 7, 8, 9, 12, 13], "space": 12, "special": 7, "specif": [5, 6, 7, 8, 13], "specifi": [0, 2, 3, 7, 8, 9, 12], "sphinx": [4, 5, 6, 7, 8, 9, 10, 12, 13], "split": [0, 12], "squar": 7, "src": 4, "ss": [6, 8, 12], "ssh": 15, "stack": 4, "stage": 0, "standard": 1, "start": [10, 11, 13, 16], "state": 6, "statement": 4, "static": 6, "staticmethod": 4, "std": [2, 9, 13], "stdint": [9, 13], "step": [4, 9, 12, 15], "still": [5, 9, 12, 13], "stmt": 4, "store": [4, 6, 7, 8, 9, 12, 13], "str": [0, 5], "stream": 0, "strict": [12, 13], "string": [4, 7, 13], "strong": 6, "strongli": 6, "structur": [4, 12, 13], "student": 15, "style": 1, "subf": [7, 9], "subscript": 4, "success": 15, "suit": 5, "super": 3, "support": [2, 4, 6, 7, 9, 12, 13], "suppos": 2, "sure": [1, 7, 12, 13, 15], "sw_emu": 13, "sym_nam": 5, "syn": 13, "syntact": 4, "syntax": [4, 5], "synthes": 12, "synthesi": [9, 13], "system": [6, 13, 15], "t": [5, 7, 15], "t_in": 8, "t_out": 8, "take": [0, 4, 5, 12, 13], "target": [0, 3, 4, 9, 12, 13], "target_loop": 0, "task_lint": 1, "team": 8, "techniqu": [4, 12, 13], "tediou": 5, "templat": [10, 11, 14, 16], "tensor": [4, 9, 12, 13], "tensor_program": 5, "test": [2, 3, 5, 9, 12, 13], "test_mlir_program": 5, "th": 0, "thank": 5, "thei": [5, 12], "them": [0, 4, 5, 6, 8, 12], "therefor": [4, 5, 6, 8], "thi": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 12, 13, 15], "thing": [4, 5], "think": 5, "those": [1, 6, 15], "three": [0, 6], "through": [0, 4], "tile": [0, 13], "time": [4, 5, 6, 7, 8, 9, 12, 13, 15], "tmux": 13, "togeth": [6, 8, 12], "tool": 4, "toolchain": 5, "top": [0, 1, 2, 4, 5, 8], "top2": 8, "top_func": [0, 4], "torch": 3, "torsten": 13, "total": [4, 5, 6, 7, 8, 9, 11, 12, 13, 16], "traceback": 5, "transform": [4, 5], "translat": [4, 10, 11, 14, 16], "tree": 4, "trip": 13, "true": [0, 4, 5], "trunci": [6, 7, 8, 12], "try": [5, 13], "tupl": [0, 4], "tutori": [5, 8, 9, 12, 13, 16], "tutorial_01_get_start": [11, 12, 16], "tutorial_02_vhl": [11, 13, 16], "two": [0, 1, 4, 6, 8, 12, 13], "type": [2, 4, 7, 8, 9, 10, 11, 12, 13, 14, 16], "type_com": 4, "type_ignor": 4, "type_param": 4, "type_var": 7, "u": [5, 15], "u280": 13, "ufix": 6, "ui32": [8, 13], "uint": 6, "unabl": 5, "unchang": 1, "under": [1, 13, 15], "unfold": 0, "unfortun": 5, "union": 6, "unit": 1, "unknown": 5, "unrol": [0, 8], "unsign": 6, "unwrap": 2, "up": [13, 15], "updat": [1, 4], "upper": [6, 12], "uram": 13, "us": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 12, 13, 15], "usag": [4, 12, 13], "use_local_scop": 5, "user": [3, 4, 5, 6, 12, 13], "usernam": 1, "usual": [3, 4], "utf": 1, "util": [0, 13], "v": [1, 13, 15], "v0": [9, 13], "v1": [9, 13], "v10": 13, "v11": 13, "v12": 13, "v13": 13, "v14": 13, "v16": 13, "v2": [9, 13], "v3": [9, 13], "v4": [9, 13], "v5": 9, "v6": 13, "vadd": 2, "vadd_h": 2, "valid": 5, "valu": [0, 4, 5, 6, 9, 13], "valueerror": 5, "vararg": 4, "vardic": 0, "variabl": [4, 6, 7, 12, 13, 15], "variou": 12, "vector": [2, 4], "verbos": 4, "veri": [2, 3, 5], "verifi": [1, 5], "version": [1, 13, 15], "vhl": [3, 9, 12, 13], "view": 13, "visit": 4, "viti": [2, 3, 10, 11, 14, 16], "vitis_2022": 13, "vitis_hl": 13, "vitis_loop_44_1_vitis_loop_45_2": 13, "vivado": [10, 11, 12, 14, 16], "void": [2, 9, 13], "vpn": 15, "vscode": [1, 15], "wa": 13, "wai": [5, 13], "wait": 1, "walk": 4, "walkthrough": [10, 11, 14, 16], "want": [1, 2, 4, 5, 6, 7], "warn": 1, "we": [1, 2, 3, 4, 5, 6, 7, 8, 9, 12, 13, 15], "webpag": 4, "websit": 15, "what": [4, 5], "when": [0, 2, 4, 6, 7, 12], "where": 12, "whether": 2, "which": [0, 1, 2, 4, 5, 6, 9, 12, 13, 15], "while": [12, 13], "whose": 0, "window": 0, "without": [5, 13], "work": [8, 13, 15], "workflow": [1, 3], "would": [0, 1, 13], "wrap": [2, 4, 5, 8], "wrapper": [2, 12], "write": [0, 2, 5, 7, 8, 13], "written": [0, 2], "x": [3, 15], "xcel": 0, "xclbin": 13, "xf32": 9, "xilinx": 6, "xilinx_u280_gen3x16_xdma_1_202211_1": 13, "y": 3, "ye": [1, 13], "yield": 5, "you": [1, 2, 4, 5, 6, 12, 13, 15], "your": [1, 2, 5, 15], "zero": [2, 9, 13], "zhang": [1, 13, 15], "zip": [4, 5, 6, 7, 8, 9, 12, 13]}, "titles": ["Schedule Primitives", "Developer Setup", "IP Integration", "PyTorch Integration", "IR Builder Walkthrough", "MLIR Translation Guide", "Data Types and Type Casting", "Template Kernels", "Kernel Composition", "Other Features", "Allo Documentations", "Computation times", "Getting Started", "Vivado/Vitis HLS Backend", "Allo Documentation", "Installation", "Computation times"], "titleterms": {"For": 4, "On": 13, "algorithm": [4, 12, 13], "allo": [10, 12, 13, 14], "an": 5, "annassign": 4, "api": 14, "appli": 12, "ast": 4, "backend": 13, "bit": 6, "board": 13, "builder": 4, "cast": 6, "chang": 1, "codegen": 13, "composit": 8, "comput": [11, 16], "cornel": 15, "creat": 12, "data": [0, 6], "deep": 14, "defin": 5, "definit": [4, 12, 13], "develop": [1, 14], "dialect": 5, "dive": 14, "docker": 15, "document": [10, 14], "dynam": 9, "execut": [12, 13], "explicit": 6, "featur": 9, "from": 15, "functiondef": 4, "gemm": 13, "get": [12, 14], "guid": [5, 14], "hl": 13, "implicit": 6, "import": [12, 13], "input": 12, "inspect": 12, "instal": 15, "integr": [1, 2, 3], "intermedi": 12, "intern": 15, "ip": 2, "ir": [4, 12], "kernel": [7, 8], "linalg": 5, "mlir": 5, "node": 4, "oper": 6, "other": [4, 9], "output": 12, "prepar": 12, "primit": 0, "product": 13, "program": 5, "python": 14, "pytorch": 3, "represent": 12, "return": 9, "run": 12, "scalar": 13, "schedul": [0, 12], "setup": 1, "shape": 9, "sourc": 15, "start": [12, 14], "templat": [7, 8], "tensor": 5, "test": [1, 15], "time": [11, 16], "transform": 12, "translat": 5, "travers": 4, "tupl": 9, "tutori": 14, "type": [0, 6], "upstream": 1, "vector": 13, "viti": 13, "vivado": 13, "walkthrough": 4}})
\ No newline at end of file
diff --git a/setup/index.html b/setup/index.html
index 16bd87a7..50c8e438 100644
--- a/setup/index.html
+++ b/setup/index.html
@@ -6,7 +6,7 @@
     <meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="viewport" content="width=device-width, initial-scale=1" />
 
     <title>Installation &#8212; Allo Documentation</title>
-    <link rel="stylesheet" type="text/css" href="../_static/pygments.css?v=fa44fd50" />
+    <link rel="stylesheet" type="text/css" href="../_static/pygments.css?v=03e43079" />
     <link rel="stylesheet" type="text/css" href="../_static/basic_mod.css?v=9b2032db" />
     <link rel="stylesheet" type="text/css" href="../_static/graphviz.css?v=4ae1632d" />
     <link rel="stylesheet" type="text/css" href="../_static/copybutton.css?v=76b2166b" />
@@ -143,6 +143,15 @@ <h3 id="searchlabel">Quick search</h3>
 <li class="toctree-l1"><a class="reference internal" href="../gallery/tutorial_01_get_started.html">Getting Started</a></li>
 <li class="toctree-l1"><a class="reference internal" href="../gallery/tutorial_02_vhls.html">Vivado/Vitis HLS Backend</a></li>
 </ul>
+<p class="caption" role="heading"><span class="caption-text">Deep Dive</span></p>
+<ul>
+<li class="toctree-l1"><a class="reference internal" href="../gallery/dive_01_data_types.html">Data Types and Type Casting</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../gallery/dive_02_template.html">Template Kernels</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../gallery/dive_03_composition.html">Kernel Composition</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../dive/ip.html">IP Integration</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../dive/pytorch.html">PyTorch Integration</a></li>
+<li class="toctree-l1"><a class="reference internal" href="../gallery/dive_04_features.html">Other Features</a></li>
+</ul>
 <p class="caption" role="heading"><span class="caption-text">Developer Guide</span></p>
 <ul>
 <li class="toctree-l1"><a class="reference internal" href="../developer/index.html">Developer Setup</a></li>
@@ -296,7 +305,7 @@ <h2>Internal Installation (Cornell)<a class="headerlink" href="#internal-install
 
 
     <div class="footer" role="contentinfo">
-    &#169; Copyright 2024, Allo Authors.
+    &#169; Copyright 2025, Allo Authors.
       Created using <a href="https://www.sphinx-doc.org/">Sphinx</a> 8.1.3.
     </div>
 
diff --git a/sg_execution_times.html b/sg_execution_times.html
index e83f5684..5ea6b8af 100644
--- a/sg_execution_times.html
+++ b/sg_execution_times.html
@@ -6,7 +6,7 @@
     <meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="viewport" content="width=device-width, initial-scale=1" />
 
     <title>Computation times &#8212; Allo Documentation</title>
-    <link rel="stylesheet" type="text/css" href="_static/pygments.css?v=fa44fd50" />
+    <link rel="stylesheet" type="text/css" href="_static/pygments.css?v=03e43079" />
     <link rel="stylesheet" type="text/css" href="_static/basic_mod.css?v=9b2032db" />
     <link rel="stylesheet" type="text/css" href="_static/graphviz.css?v=4ae1632d" />
     <link rel="stylesheet" type="text/css" href="_static/copybutton.css?v=76b2166b" />
@@ -141,6 +141,15 @@ <h3 id="searchlabel">Quick search</h3>
 <li class="toctree-l1"><a class="reference internal" href="gallery/tutorial_01_get_started.html">Getting Started</a></li>
 <li class="toctree-l1"><a class="reference internal" href="gallery/tutorial_02_vhls.html">Vivado/Vitis HLS Backend</a></li>
 </ul>
+<p class="caption" role="heading"><span class="caption-text">Deep Dive</span></p>
+<ul>
+<li class="toctree-l1"><a class="reference internal" href="gallery/dive_01_data_types.html">Data Types and Type Casting</a></li>
+<li class="toctree-l1"><a class="reference internal" href="gallery/dive_02_template.html">Template Kernels</a></li>
+<li class="toctree-l1"><a class="reference internal" href="gallery/dive_03_composition.html">Kernel Composition</a></li>
+<li class="toctree-l1"><a class="reference internal" href="dive/ip.html">IP Integration</a></li>
+<li class="toctree-l1"><a class="reference internal" href="dive/pytorch.html">PyTorch Integration</a></li>
+<li class="toctree-l1"><a class="reference internal" href="gallery/dive_04_features.html">Other Features</a></li>
+</ul>
 <p class="caption" role="heading"><span class="caption-text">Developer Guide</span></p>
 <ul>
 <li class="toctree-l1"><a class="reference internal" href="developer/index.html">Developer Setup</a></li>
@@ -164,7 +173,7 @@ <h3 id="searchlabel">Quick search</h3>
             
   <section id="computation-times">
 <span id="sphx-glr-sg-execution-times"></span><h1>Computation times<a class="headerlink" href="#computation-times" title="Link to this heading"><span>¶</span></a></h1>
-<p><strong>00:00.607</strong> total execution time for 4 files <strong>from all galleries</strong>:</p>
+<p><strong>00:01.961</strong> total execution time for 8 files <strong>from all galleries</strong>:</p>
 <div class="docutils container">
 <style scoped>
 <link href="https://cdnjs.cloudflare.com/ajax/libs/twitter-bootstrap/5.3.0/css/bootstrap.min.css" rel="stylesheet" />
@@ -185,16 +194,32 @@ <h3 id="searchlabel">Quick search</h3>
 </tr>
 </thead>
 <tbody>
-<tr class="row-even"><td><p><a class="reference internal" href="gallery/tutorial_02_vhls.html#sphx-glr-gallery-tutorial-02-vhls-py"><span class="std std-ref">Vivado/Vitis HLS Backend</span></a> (<code class="docutils literal notranslate"><span class="pre">../../tutorials/tutorial_02_vhls.py</span></code>)</p></td>
-<td><p>00:00.332</p></td>
+<tr class="row-even"><td><p><a class="reference internal" href="gallery/dive_03_composition.html#sphx-glr-gallery-dive-03-composition-py"><span class="std std-ref">Kernel Composition</span></a> (<code class="docutils literal notranslate"><span class="pre">../../tutorials/dive_03_composition.py</span></code>)</p></td>
+<td><p>00:00.731</p></td>
+<td><p>0.0</p></td>
+</tr>
+<tr class="row-odd"><td><p><a class="reference internal" href="gallery/dive_02_template.html#sphx-glr-gallery-dive-02-template-py"><span class="std std-ref">Template Kernels</span></a> (<code class="docutils literal notranslate"><span class="pre">../../tutorials/dive_02_template.py</span></code>)</p></td>
+<td><p>00:00.372</p></td>
+<td><p>0.0</p></td>
+</tr>
+<tr class="row-even"><td><p><a class="reference internal" href="gallery/dive_04_features.html#sphx-glr-gallery-dive-04-features-py"><span class="std std-ref">Other Features</span></a> (<code class="docutils literal notranslate"><span class="pre">../../tutorials/dive_04_features.py</span></code>)</p></td>
+<td><p>00:00.240</p></td>
+<td><p>0.0</p></td>
+</tr>
+<tr class="row-odd"><td><p><a class="reference internal" href="gallery/tutorial_02_vhls.html#sphx-glr-gallery-tutorial-02-vhls-py"><span class="std std-ref">Vivado/Vitis HLS Backend</span></a> (<code class="docutils literal notranslate"><span class="pre">../../tutorials/tutorial_02_vhls.py</span></code>)</p></td>
+<td><p>00:00.192</p></td>
+<td><p>0.0</p></td>
+</tr>
+<tr class="row-even"><td><p><a class="reference internal" href="gallery/tutorial_01_get_started.html#sphx-glr-gallery-tutorial-01-get-started-py"><span class="std std-ref">Getting Started</span></a> (<code class="docutils literal notranslate"><span class="pre">../../tutorials/tutorial_01_get_started.py</span></code>)</p></td>
+<td><p>00:00.181</p></td>
 <td><p>0.0</p></td>
 </tr>
-<tr class="row-odd"><td><p><a class="reference internal" href="gallery/tutorial_01_get_started.html#sphx-glr-gallery-tutorial-01-get-started-py"><span class="std std-ref">Getting Started</span></a> (<code class="docutils literal notranslate"><span class="pre">../../tutorials/tutorial_01_get_started.py</span></code>)</p></td>
-<td><p>00:00.196</p></td>
+<tr class="row-odd"><td><p><a class="reference internal" href="gallery/dive_01_data_types.html#sphx-glr-gallery-dive-01-data-types-py"><span class="std std-ref">Data Types and Type Casting</span></a> (<code class="docutils literal notranslate"><span class="pre">../../tutorials/dive_01_data_types.py</span></code>)</p></td>
+<td><p>00:00.170</p></td>
 <td><p>0.0</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="reference internal" href="gallery/developer_02_mlir.html#sphx-glr-gallery-developer-02-mlir-py"><span class="std std-ref">MLIR Translation Guide</span></a> (<code class="docutils literal notranslate"><span class="pre">../../tutorials/developer_02_mlir.py</span></code>)</p></td>
-<td><p>00:00.074</p></td>
+<td><p>00:00.070</p></td>
 <td><p>0.0</p></td>
 </tr>
 <tr class="row-odd"><td><p><a class="reference internal" href="gallery/developer_01_ir_builder.html#sphx-glr-gallery-developer-01-ir-builder-py"><span class="std std-ref">IR Builder Walkthrough</span></a> (<code class="docutils literal notranslate"><span class="pre">../../tutorials/developer_01_ir_builder.py</span></code>)</p></td>
@@ -229,7 +254,7 @@ <h3 id="searchlabel">Quick search</h3>
 
 
     <div class="footer" role="contentinfo">
-    &#169; Copyright 2024, Allo Authors.
+    &#169; Copyright 2025, Allo Authors.
       Created using <a href="https://www.sphinx-doc.org/">Sphinx</a> 8.1.3.
     </div>