We'd like to share our ideas on how to implement the AOT feature in Taichi. AOT …refers to the process of using Taichi as a GPU compute shader/kernel compiler: The AOT users can compile their Taichi kernels into an *AOT module*, package this module alongside their apps (likely without a Python environment, e.g. an Android app), and load/execute the compiled shaders from this module in the app.
Note that AOT is already an ongoing work, hence some of the tasks have already been implemented. For a quick peak of the Taichi AOT workflow, please check out [this test](https://github.com/taichi-dev/taichi/blob/master/tests/python/test_aot.py).
# Goals
* Provide a workflow to package the Taichi kernels into a module that can be loaded and executed from a non-Python environment (e.g. an Android app using Vulkan).
* Priotitize on the mobile platforms deployment, including OpenGL ES, Vulkan and Apple Metal.
* Priotitize on the *dense* Taichi fields. However, we have plan to further decouple the AOT from Taichi fields (see the *Ndarray* section).
# API Proposal
Taichi provides a utility, [`taichi.aot.Module`](https://github.com/taichi-dev/taichi/blob/fdf665e4dfe6c1cee41e67b12b288749b2c8cb8f/python/taichi/aot/module.py#L58), for compiling the Taichi kernels and fields info into a module file. It provides these APIs:
* `add_kernel(kernel_fn)`: Add a Taichi kernel to the AOT module.
* `add_kernel_template(kernel_templ_fn)`: Add a *Taichi kernel template* to the AOT module.
* (*) `add_field(name, field)` : Add a Taichi field to the AOT module. However, we hope that Ndarray can serve as a more convenient dense data container in the AOT use cases.
* `save(filepath, filename)`: Save this AOT module to `filepath/filename`.
We will walk through the `Module` usage with the following example.
```py
x = ti.Vector.field(2, float, n_particles)
v = ti.Vector.field(2, float, n_particles)
...
@ti.kernel
def init():
...
@ti.kernel
def substep():
...
def run_jit():
gui = ti.GUI('mpm88')
# Driver logic
init()
while True:
for _ in range(50):
substep()
gui.circles(x.to_numpy())
gui.show()
def run_aot():
# 1
mod = ti.aot.Module(arch=ti.opengl)
# 2
mod.add_kernel(init)
mod.add_kernel(substep)
# 3
mod.add_field(x, name="x")
mod.add_field(v, name="v")
# 4
mod.save('/path/to/dir', 'opengl.tcb') # .tcb for "taichi binary"
```
1. We have created an AOT module, `mod`, targeted for the GL/ES shading language.
2. The snippet has defined two Taichi kernels: `init` and `substep`. This step adds both kernels to `mod`.
3. The snippet has also defined two Taichi fields: `x` and `v`. Both are added to `mod`, too.
4. Finally, we save the module to `/path/to/dir/opengl.tcb`.
This completes the works required at the Tachi/Python side.
Assuming that we then want to deploy this to an Android app, and have added `opengl.tcb` to the app project, we imagine the following set of C++ APIs useful. Note that the language implementing the API is mostly irrelevant, and should be chosen according to the targeted platform suitability (e.g. ObjC/Swift for iOS, Java/Kotlin for Android). We choose C++ here just for the developer's familarity (Although at a very low level, C++ is suitable for both mobile platforms).
<details>
<summary>C++ API</summary>
```cpp
namespace taichi {
// Corresponds to a ti.field
class Field {
public:
/**
* Copies the content of this field to @param dest_buffer.
*
* Internally uses glCopyBufferSubData() along with
* GL_COPY_READ_BUFFER/GL_COPY_WRITE_BUFFER.
*/
void CopyTo(GLint dest_buffer, GLintptr write_offset);
/**
* Copies the content of this field to @param host_buffer.
*/
void CopyTo(char* host_buffer);
/**
* Number of bytes this field occupies
*/
std::size_t size() const;
};
class KernelArgBuilder {
public:
void SetInt(int val);
void SetFloat(float val);
void SetNdarray(GLuint ssbo, const std::vector<int>& shape);
};
class ArgsContext {
public:
KernelArgBuilder GetArgBuilder(int index);
};
// Corresponds to a @ti.kernel with NO ti.template parameter
class Kernel {
public:
// ArgsContext is for passing the kernel arguments
void Run(const ArgsContext& c);
// Small helper, in case there is no argument
void Run();
};
// Examples:
// {"bool_key", true}
// {"int_key", 42}
// {"field_key", a taichi::Field object}
class TemplateArg;
// Corresponds to a @ti.kernel with AT LEAST ONE
// ti.template parameter
class KernelTemplate {
public:
void Run(const std::vector<TemplateArg>& tmpl_args, const ArgsContext& c);
// Small helper, in case there is no argument
void Run(const std::vector<TemplateArg>& tmpl_args);
};
// Corresponds to an AOT module that is compiled for
// the OpenGL backend.
class GLProgram {
public:
Field GetField(const std::string& name);
Kernel GetKernel(const std::string& name);
KernelTemplate GetKernelTemplate(const std::string& name);
};
} // namespace taichi
```
</details>
We can then use the above API in the following manner:
```cpp
GLuint x_ssbo;
glGenBuffers(1, &x_ssbo);
/************* App initialization *************/
taichi::GLProgram program{"/path/to/dir/opengl.tcb"};
auto init_kernel = program.GetKernel("init");
init_kernel.Run();
/************* App rendering loop *************/
auto substep_kernel = program.GetKernel("substep");
for (int i = 0; i < 50; i++) {
substep_kernel.Run();
}
auto x_field = program.GetField("x");
x_field.CopyTo(x_ssbo);
// The MPM88 particle position data are now copied into `x_ssbo`, and can be
// used in a particle-system rendering pipeline.
```
## Taichi kernel template
So far we have only talked about the regular Taichi kernels. However, there is a special kind of kernel: A Taichi kernel with at least one `ti.template` parameter. E.g.
```py
x = ti.field(ti.f32, shape=8)
y = ti.field(ti.f32, shape=(8, 4))
@ti.kernel
def add_one(f: ti.template()):
for I in ti.grouped(f):
x[I] += 1
# This will instantiate two Taichi kernels, bounded to different Taichi fields, `x` and `y`.
add_one(x)
add_one(y)
```
The special part about this is that Taichi will instantiate a separate kernel body for different input arguments. Readers coming from the C++ background can relate this to the C++ function template: It is not until you invoke a function template with the actual type arguments filled, will the compiler instantiate a function definition for you. As a result, one cannot identify a compiled Taichi kernel just by its name. Instead, it is the combination of a string (the kernel template name) and the template args.
`Module.add_kernel_template()` is for handling this situation.
```py
with m.add_kernel_template(add_one) as kt:
kt.instantiate(f=x)
kt.instantiate(f=y)
```
Then on the app side, we can retrieve and run these instantiated kernels with the code below.
```cpp
auto x_field = program.GetField("x");
auto y_field = program.GetField("y");
auto add_one_tmpl = program.GetKernelTemplate("add_one");
bar_tmpl.Run(/*template_args=*/{taichi::TemplateArg{"f", x_field}});
bar_tmpl.Run(/*template_args=*/{taichi::TemplateArg{"f", y_field}});
```
## Ndarray: making data containers more flexible
Currently, Taichi field is the official way for passing data between the kernel side and the host side. However, it comes with a few restrictions:
1. All the Taichi fields are currently packed into a single GPU buffer (the *root buffer*). This means that to retrieve the data of a single field, we need a look-up table to figure out its offset and range in the buffer. (Note that different GPU APIs have different terminology for this buffer, e.g. SSBO for OpenGL, `MTLBuffer` for Apple Metal, etc.)
2. Because the root buffer size is determined at compile time, we cannot use a field of different shape at runtime. The implication is that each time we want to change the field size, we have to re-run the AOT pipeline.
3. Users cannot plug in their existing GPU buffers into the Taichi kernel. Say if I already have an SSBO `x_ssbo` holding the particles' position in my particle system, we have to run the Taichi kernels, then copy the data from the root buffer to `x_ssbo`. Ideally, we can achieve zero-copy here by just binding `x_ssbo` to the GL shaders generated by Taichi.
To overcome these disadvantages, we have been prototyping a new data container called *Ndarray*. Ndarray can be viewed as a more flexible and systematic implementation of Taichi's external array.
Say if we'd like to to pass a 2-D array of `vec2` into a Taichi kernel, here's how we can re-write the kernels using the Ndarray container:
```py
x = ti.Vector.ndarray(n=2, dtype=ti.f32, shape=(128, 128))
@ti.kernel
def substep(x_arr: ti.types.Vector.ndarray(n=2, dtype=ti.f32, dim=2)):
...
def run_jit():
...
for _ in range(50):
substep(x)
...
def run_aot():
mod = ti.aot.Module(arch=ti.opengl)
mod.add_kernel(init)
mod.add_kernel(substep)
mod.save('/path/to/dir', 'opengl.tcb') # .tcb for "taichi binary"
```
If our app already has an SSBO `x_ssbo` of the matching traits, we can pass it to the compiled kernel in this way:
```cpp
GLuint x_ssbo;
glGenBuffers(1, &x_ssbo);
/************* App initialization *************/
taichi::GLProgram program{"/path/to/dir/opengl.tcb"};
auto init_kernel = program.GetKernel("init");
init_kernel.Run();
/************* App rendering loop *************/
auto substep_kernel = program.GetKernel("substep");
taichi::ArgsContext ctx;
// |x_ssbo| and its shape are directly passed in to the
// compiled Taichi kernels as runtime arguments.
ctx.GetArgBuidler(/*index=*/0)
.SetNdarray(/*ssbo=*/x_ssbo, /*shape=*/{128, 128});
for (int i = 0; i < 50; i++) {
substep_kernel.Run();
}
// Note that there is no need to copy the data out from a Taichi
// field. |x_ssbo| is now populated with the correct data
```
# Implementaion Roadmap
* [x] AOT module implementation
* [x] OpenGL
* [x] Vulkan
* [x] Apple Metal
* [x] Ndarray
* [x] OPenGL (WIP https://github.com/taichi-dev/taichi/issues/3280)
* [x] Vulkan
* [x] Apple Metal https://github.com/taichi-dev/taichi/pull/4720
* [ ] Split the runtime and the codegen into separate folders (https://github.com/taichi-dev/taichi/issues/2626, https://github.com/taichi-dev/taichi/issues/3679).
* [x] Provide a way to build just the runtime part into a sub-library, which can be included into the users' app projects.
* [ ] CPU-side constant folding. This would avoid the necessity to initialize any backend runtime during the codegen. With this capability, the targeted arch can be completely decoupled from the OS running the AOT, i.e. we can run AOT targeted at Apple Metal on a Windows machine.
# Q & A
* What Taichi features do you plan to support?
1. Dense Taichi fields + Ndarray
2. Non-LLVM backends, including OpenGL, Vulkan and Apple Metal
3. Taichi template kernels
* What Taichi features are currently out of the scope?
* sparse fields
* `@ti.data_oriented`
* Other limitations?
The logic to invoke these kernels will still need to be re-written in the users' app (e.g. the `run_jit()` body in the above example). We may consider adding a compute graph in the future. Welcome discussion & proposal!
* How to locate a Taichi kernel?
For a regular kernel, the kernel name (a string) is enough as the identifier. For a kernel template, it is a combination of the kernel name and the instantiating template args.
* How to support upgrading?
We can include a version into the AOT module.