有没有办法一开始就编译所有kernel？

beidou · 2022 年7 月 23 日 05:07

描述：
我写了个碰撞算法，只有检测到碰撞的时候，才处理碰撞。没有碰撞的时候，不会触发if条件，大概也就不会编译碰撞处理的kernel。但是这样就会造成第一次碰撞的时候明显的卡顿（大概是去编译那个kernel了）

问题：
请问有没有办法一开始就让它编译所有的kernel。避免由于后续编译造成的卡顿？

Roden · 2022 年7 月 23 日 19:57

这个？

github.com/taichi-dev/taichi

[RFC] Taichi's Ahead of Time (AOT) Module

opened 08:55AM - 29 Nov 21 UTC

k-ye

feature request discussion RFC

We'd like to share our ideas on how to implement the AOT feature in Taichi. AOT …refers to the process of using Taichi as a GPU compute shader/kernel compiler: The AOT users can compile their Taichi kernels into an *AOT module*, package this module alongside their apps (likely without a Python environment, e.g. an Android app), and load/execute the compiled shaders from this module in the app. Note that AOT is already an ongoing work, hence some of the tasks have already been implemented. For a quick peak of the Taichi AOT workflow, please check out [this test](https://github.com/taichi-dev/taichi/blob/master/tests/python/test_aot.py). # Goals * Provide a workflow to package the Taichi kernels into a module that can be loaded and executed from a non-Python environment (e.g. an Android app using Vulkan). * Priotitize on the mobile platforms deployment, including OpenGL ES, Vulkan and Apple Metal. * Priotitize on the *dense* Taichi fields. However, we have plan to further decouple the AOT from Taichi fields (see the *Ndarray* section). # API Proposal Taichi provides a utility, [`taichi.aot.Module`](https://github.com/taichi-dev/taichi/blob/fdf665e4dfe6c1cee41e67b12b288749b2c8cb8f/python/taichi/aot/module.py#L58), for compiling the Taichi kernels and fields info into a module file. It provides these APIs: * `add_kernel(kernel_fn)`: Add a Taichi kernel to the AOT module. * `add_kernel_template(kernel_templ_fn)`: Add a *Taichi kernel template* to the AOT module. * (*) `add_field(name, field)` : Add a Taichi field to the AOT module. However, we hope that Ndarray can serve as a more convenient dense data container in the AOT use cases. * `save(filepath, filename)`: Save this AOT module to `filepath/filename`. We will walk through the `Module` usage with the following example. ```py x = ti.Vector.field(2, float, n_particles) v = ti.Vector.field(2, float, n_particles) ... @ti.kernel def init(): ... @ti.kernel def substep(): ... def run_jit(): gui = ti.GUI('mpm88') # Driver logic init() while True: for _ in range(50): substep() gui.circles(x.to_numpy()) gui.show() def run_aot(): # 1 mod = ti.aot.Module(arch=ti.opengl) # 2 mod.add_kernel(init) mod.add_kernel(substep) # 3 mod.add_field(x, name="x") mod.add_field(v, name="v") # 4 mod.save('/path/to/dir', 'opengl.tcb') # .tcb for "taichi binary" ``` 1. We have created an AOT module, `mod`, targeted for the GL/ES shading language. 2. The snippet has defined two Taichi kernels: `init` and `substep`. This step adds both kernels to `mod`. 3. The snippet has also defined two Taichi fields: `x` and `v`. Both are added to `mod`, too. 4. Finally, we save the module to `/path/to/dir/opengl.tcb`. This completes the works required at the Tachi/Python side. Assuming that we then want to deploy this to an Android app, and have added `opengl.tcb` to the app project, we imagine the following set of C++ APIs useful. Note that the language implementing the API is mostly irrelevant, and should be chosen according to the targeted platform suitability (e.g. ObjC/Swift for iOS, Java/Kotlin for Android). We choose C++ here just for the developer's familarity (Although at a very low level, C++ is suitable for both mobile platforms). <details> <summary>C++ API</summary> ```cpp namespace taichi { // Corresponds to a ti.field class Field { public: /** * Copies the content of this field to @param dest_buffer. * * Internally uses glCopyBufferSubData() along with * GL_COPY_READ_BUFFER/GL_COPY_WRITE_BUFFER. */ void CopyTo(GLint dest_buffer, GLintptr write_offset); /** * Copies the content of this field to @param host_buffer. */ void CopyTo(char* host_buffer); /** * Number of bytes this field occupies */ std::size_t size() const; }; class KernelArgBuilder { public: void SetInt(int val); void SetFloat(float val); void SetNdarray(GLuint ssbo, const std::vector<int>& shape); }; class ArgsContext { public: KernelArgBuilder GetArgBuilder(int index); }; // Corresponds to a @ti.kernel with NO ti.template parameter class Kernel { public: // ArgsContext is for passing the kernel arguments void Run(const ArgsContext& c); // Small helper, in case there is no argument void Run(); }; // Examples: // {"bool_key", true} // {"int_key", 42} // {"field_key", a taichi::Field object} class TemplateArg; // Corresponds to a @ti.kernel with AT LEAST ONE // ti.template parameter class KernelTemplate { public: void Run(const std::vector<TemplateArg>& tmpl_args, const ArgsContext& c); // Small helper, in case there is no argument void Run(const std::vector<TemplateArg>& tmpl_args); }; // Corresponds to an AOT module that is compiled for // the OpenGL backend. class GLProgram { public: Field GetField(const std::string& name); Kernel GetKernel(const std::string& name); KernelTemplate GetKernelTemplate(const std::string& name); }; } // namespace taichi ``` </details> We can then use the above API in the following manner: ```cpp GLuint x_ssbo; glGenBuffers(1, &x_ssbo); /************* App initialization *************/ taichi::GLProgram program{"/path/to/dir/opengl.tcb"}; auto init_kernel = program.GetKernel("init"); init_kernel.Run(); /************* App rendering loop *************/ auto substep_kernel = program.GetKernel("substep"); for (int i = 0; i < 50; i++) { substep_kernel.Run(); } auto x_field = program.GetField("x"); x_field.CopyTo(x_ssbo); // The MPM88 particle position data are now copied into `x_ssbo`, and can be // used in a particle-system rendering pipeline. ``` ## Taichi kernel template So far we have only talked about the regular Taichi kernels. However, there is a special kind of kernel: A Taichi kernel with at least one `ti.template` parameter. E.g. ```py x = ti.field(ti.f32, shape=8) y = ti.field(ti.f32, shape=(8, 4)) @ti.kernel def add_one(f: ti.template()): for I in ti.grouped(f): x[I] += 1 # This will instantiate two Taichi kernels, bounded to different Taichi fields, `x` and `y`. add_one(x) add_one(y) ``` The special part about this is that Taichi will instantiate a separate kernel body for different input arguments. Readers coming from the C++ background can relate this to the C++ function template: It is not until you invoke a function template with the actual type arguments filled, will the compiler instantiate a function definition for you. As a result, one cannot identify a compiled Taichi kernel just by its name. Instead, it is the combination of a string (the kernel template name) and the template args. `Module.add_kernel_template()` is for handling this situation. ```py with m.add_kernel_template(add_one) as kt: kt.instantiate(f=x) kt.instantiate(f=y) ``` Then on the app side, we can retrieve and run these instantiated kernels with the code below. ```cpp auto x_field = program.GetField("x"); auto y_field = program.GetField("y"); auto add_one_tmpl = program.GetKernelTemplate("add_one"); bar_tmpl.Run(/*template_args=*/{taichi::TemplateArg{"f", x_field}}); bar_tmpl.Run(/*template_args=*/{taichi::TemplateArg{"f", y_field}}); ``` ## Ndarray: making data containers more flexible Currently, Taichi field is the official way for passing data between the kernel side and the host side. However, it comes with a few restrictions: 1. All the Taichi fields are currently packed into a single GPU buffer (the *root buffer*). This means that to retrieve the data of a single field, we need a look-up table to figure out its offset and range in the buffer. (Note that different GPU APIs have different terminology for this buffer, e.g. SSBO for OpenGL, `MTLBuffer` for Apple Metal, etc.) 2. Because the root buffer size is determined at compile time, we cannot use a field of different shape at runtime. The implication is that each time we want to change the field size, we have to re-run the AOT pipeline. 3. Users cannot plug in their existing GPU buffers into the Taichi kernel. Say if I already have an SSBO `x_ssbo` holding the particles' position in my particle system, we have to run the Taichi kernels, then copy the data from the root buffer to `x_ssbo`. Ideally, we can achieve zero-copy here by just binding `x_ssbo` to the GL shaders generated by Taichi. To overcome these disadvantages, we have been prototyping a new data container called *Ndarray*. Ndarray can be viewed as a more flexible and systematic implementation of Taichi's external array. Say if we'd like to to pass a 2-D array of `vec2` into a Taichi kernel, here's how we can re-write the kernels using the Ndarray container: ```py x = ti.Vector.ndarray(n=2, dtype=ti.f32, shape=(128, 128)) @ti.kernel def substep(x_arr: ti.types.Vector.ndarray(n=2, dtype=ti.f32, dim=2)): ... def run_jit(): ... for _ in range(50): substep(x) ... def run_aot(): mod = ti.aot.Module(arch=ti.opengl) mod.add_kernel(init) mod.add_kernel(substep) mod.save('/path/to/dir', 'opengl.tcb') # .tcb for "taichi binary" ``` If our app already has an SSBO `x_ssbo` of the matching traits, we can pass it to the compiled kernel in this way: ```cpp GLuint x_ssbo; glGenBuffers(1, &x_ssbo); /************* App initialization *************/ taichi::GLProgram program{"/path/to/dir/opengl.tcb"}; auto init_kernel = program.GetKernel("init"); init_kernel.Run(); /************* App rendering loop *************/ auto substep_kernel = program.GetKernel("substep"); taichi::ArgsContext ctx; // |x_ssbo| and its shape are directly passed in to the // compiled Taichi kernels as runtime arguments. ctx.GetArgBuidler(/*index=*/0) .SetNdarray(/*ssbo=*/x_ssbo, /*shape=*/{128, 128}); for (int i = 0; i < 50; i++) { substep_kernel.Run(); } // Note that there is no need to copy the data out from a Taichi // field. |x_ssbo| is now populated with the correct data ``` # Implementaion Roadmap * [x] AOT module implementation * [x] OpenGL * [x] Vulkan * [x] Apple Metal * [x] Ndarray * [x] OPenGL (WIP https://github.com/taichi-dev/taichi/issues/3280) * [x] Vulkan * [x] Apple Metal https://github.com/taichi-dev/taichi/pull/4720 * [ ] Split the runtime and the codegen into separate folders (https://github.com/taichi-dev/taichi/issues/2626, https://github.com/taichi-dev/taichi/issues/3679). * [x] Provide a way to build just the runtime part into a sub-library, which can be included into the users' app projects. * [ ] CPU-side constant folding. This would avoid the necessity to initialize any backend runtime during the codegen. With this capability, the targeted arch can be completely decoupled from the OS running the AOT, i.e. we can run AOT targeted at Apple Metal on a Windows machine. # Q & A * What Taichi features do you plan to support? 1. Dense Taichi fields + Ndarray 2. Non-LLVM backends, including OpenGL, Vulkan and Apple Metal 3. Taichi template kernels * What Taichi features are currently out of the scope? * sparse fields * `@ti.data_oriented` * Other limitations? The logic to invoke these kernels will still need to be re-written in the users' app (e.g. the `run_jit()` body in the above example). We may consider adding a compute graph in the future. Welcome discussion & proposal! * How to locate a Taichi kernel? For a regular kernel, the kernel name (a string) is enough as the identifier. For a kernel template, it is a combination of the kernel name and the instantiating template args. * How to support upgrading? We can include a version into the AOT module.

github.com/taichi-dev/taichi

Ahead-of-time (AOT) Feature Roadmap

opened 07:55AM - 24 Mar 22 UTC

k-ye

discussion roadmap

![Untitled Diagram (1)](https://user-images.githubusercontent.com/7481356/159867…754-d75005a4-50bc-43a3-b8c5-489161011f54.svg) I'd like to share our vision and the initial roadmap for the AOT feature. AOT is our solution to make Taichi kernels and programs runnable in non-Python environments. Mobile and game engines, for example. The ongoing efforts include: * defining the base AOT API * implementing this AOT API for the Vulkan backend The API design, its features and limitations are all documented in * https://github.com/taichi-dev/taichi/issues/3642 --- Starting from here, we can immediately explore the following directions: 1. Quantized type formalization: For saving memory resource and atomic acceleration https://github.com/taichi-dev/taichi/issues/4857 2. LLVM AOT implementation https://github.com/taichi-dev/taichi/issues/4800 3. SPIR-V based codegen: Cross compiling the SPIR-V shader to other platforms, including OpenGL, Apple Metal (OpenCL?) 4. AOT compute graph: This would significantly reduce the efforts to porting a demo written in Taichi to non-Python environments. In essence, the AOT module will save not only the Taichi kernels, but also the host logic invoking them. https://github.com/taichi-dev/taichi/issues/4786 5. Texture types (e.g. OpenGL texture, imageND, ...) https://github.com/taichi-dev/taichi/issues/5109 --- 1. The LLVM AOT impl would allow us to understand how to support the more advanced sparse SNodes, e.g. `pointer`. 2. The SPIR-V codegen would also need to support these sparse SNodes. When the SPIR-V cross-compilation is ready, all SPIR-V-based backends will gain sparsity support for free. 3. LLVM + SPIR-V means that the AOT solution can finally cover all the backends in Taichi. 4. Support all kinds of SNode types, making Taichi's field system feature complete in AOT. https://github.com/taichi-dev/taichi/issues/4777 5. Supporting texture means we can better support signed-distance-field (SDF) and plug in more customized geometries. 6. AOT solution would help us more accurately benchmark the kernel performance. Meanwhile, we can make quant types tunable to find better type layout. 7. Integrate Taichi AOT with other platforms or game engines, Unity, for example. https://github.com/taichi-dev/taichi/issues/4808

lin · 2022 年7 月 28 日 09:53

我们即将上线CPU后端和CUDA后端kernel的缓存功能，在第一次执行后就会把kernel编译出来的结果缓存到硬盘里，这样下次（关掉这个脚本以后）再执行这个kernel的时候就不会再次编译了，感觉可以部分解决你的问题。