Blender Molecular in Taichi

Pavel_Blend · 2020 年12 月 12 日 11:51

Hello.

I have a question. But first, a foreword.

There is an addon for blender 3d that allows you to calculate collisions between particles. This addon can also create links between particles. To create various materials (snow, soil, sand and others). The name of this addon is Molecular. The author of this addon is Pyroevil. The original code is in this repository on github:
github → Pyroevil/Blender-Molecular-Script

I am adding new features to this addon. At the moment I am writing code that will allow you to specify the physical properties of particles and links using textures. My code is in the given fork:

github → PavelBlend/blender-molecular

With the molecular addon, you can simulate millions of particles and tens of millions of links between particles. The core of the add-on is written in Cython (you can say that in C).

Here are examples with lots of particles:

Calculating millions of particles can take several days.

Now the question itself:

If we rewrite the molecular core from cython to taichi, is it possible to increase the performance and speed of calculations without using CUDA? Does it make sense to rewrite in taichi? Is it possible to compute millions of particles and tens of millions of links using taichi without CUDA?

archibate · 2020 年12 月 21 日 05:56

Yes, it worth. Please add the solver to my Taichi-Blend so that people can install it easiler if you’d like to write one. Utilizing GPGPU for physics simulation is exactly what Taichi is good at.

without CUDA

Still profitable, we support Metal and OpenGL (and OpenCL in future) as backends too. Also the multi-threading support of the CPU backend can still beat Cython at certain conditions. Not to say we will soon restore SIMD support. What’s more, you only have to write the Taichi code once to deploy it to all backends. This could be revolutionary IMO.

Pavel_Blend · 2020 年12 月 21 日 08:56

I recently tried to rewrite molecular to taichi, but I failed due to the fact that taichi does not support function recursion. Recursion is often used in molecular. I don’t know how to rewrite the code so that it works without recursion.

archibate · 2020 年12 月 22 日 01:46

Yes, many algorithms demands recursion, it would be great for Taichi to support it.
But in fact, there are no actual functions in Taichi, all functions are force-inlined so there are only a single function on top-level.
Even if Taichi supports real functions it could also be hard for recursion as many backends (like OpenGL compute shader) prohibits recursion for simplicity reasons.
Before Taichi could support real functions and recursion, we may use a stack to simulate recursion by pushing the arguments you want to call next into the stack, and keep poping stack after that.
For example, here’s how I implement octree:

@ti.data_oriented
class Stack:
    def __init__(self, func, *args):
        self.buf = func(*args)
        self.len = ti.field(int, ())

    @ti.func
    def subscript(self, I):
        return self.buf[I]

    @ti.func
    def push(self, val):
        self.buf[ti.atomic_add(self.size(), 1)] = val

    @ti.func
    def pop(self):
        return self.buf[ti.atomic_sub(self.size(), 1) - 1]

    @ti.func
    def size(self):
        return self.len[None]

@ti.data_oriented
class Octree:
    def __init__(self):
        ...
        self.global_stack = Stack(ti.field, int, 1024)

    @ti.func
    def walk(self, r):
        stack = ti.static(self.global_stack)
        stack.push(0)  # push root node
        while stack.size():
            cur = stack.pop()
            if not self.hit_bound(cur, r):
                continue
            for c in range(8):
                if self.children[cur, c] != 0:
                    stack.push(self.children[cur, c])  # call child
            ...

However, it use a global field for stack, which works very well on single thread CPU, but won’t work for multi-threading CPU and GPUs. This problem can be solved if Taichi provide support thread-local array. I’ve raised an issue for asking such support:

github.com/taichi-dev/taichi

[Lang] [IR] [Type] Support allocating thread-local arrays

opened 03:45PM - 16 Dec 20 UTC

closed 10:52PM - 14 Apr 22 UTC

archibate

feature request welcome contribution

**Concisely describe the proposed feature** I'd like to dynamically allocate ar…rays locally within the current thread so that we have local stack and enumerate tree-like structures without recursion. **Describe the solution you'd like (if any)** First of all, the usage of local arrays must not share between offloads. This should adds several backend IR statements: - `arr = ArrayAllocStmt(size)` allocates a new local array with fixed constant size - `ArrayPtrStmt(arr, index)` gives the pointer to `arr[index]`. For the OpenGL backend, it support allocating local arrays by simply `int arr[size];`, it is thread-private. **Additional comments** After having local arrays, users may self-define a `top` variable to implement stacks. Btw, how does our research project (the type system) helps? Is local array be considered as a non-primitive type IIUC? Also, will fixed-length local array be the future of `ti.Vector` after the research project is done? Or this is actually already in your TODO items?

But they claim that they are rushing for a research project that may conflict with my issued feature…
@yuanming @k-ye IMO this feature can be a great test to your research project on types, why conflict? Otherwise I’ll have to hand-write ti.asm to write non-portable arrays in GLSL…

k-ye · 2020 年12 月 24 日 09:37

I’m not sure what exactly your use case will be, but I think it’s possible to extend @archibate’s global stack with dynamic (or even dense in you know your max stack depth) as an immediate workaround.

Say if you are looping over a field x and want a stack per x element, one workaround could be to define another dynamic field along with x itself. For example:

x = ti.field(...)
x_stack = ti.field(...)
block = ti.dense(...)
block.place(x)
block.dyanmic.place(x_stack)

@ti.kernel
def foo():
  for i in x:
    stack = x_stack[i]  # |stack| is dynamically indexable

archibate · 2020 年12 月 26 日 06:43

It works for small ranges, thanks @k-ye! But I got:

[cuda_driver.h:operator()@80] CUDA Error CUDA_ERROR_ASSERT: device-side assert triggered while calling stream_synchronize (cuStreamSynchronize)

when using a larger value for it even with dynamic SNodes…

FYI I want to launch my kernel over a range of 512x512 (for ray tracing), but I believe there isn’t so much threads due to grid-stride-loop. So here’s my feature request for adding the following two APIs:

ti.get_linear_thread_id() to get the linear thread id, can be called inside a kernel.
ti.get_max_thread_count() to get the maximum possible thread count of current backend.

So that I only needs to allocate ti.get_max_thread_count() number of stacks to save memory therefore allows stacks to extend longer?

k-ye · 2021 年1 月 1 日 05:42

I see. Yeah, these sound like very useful features.

So that I only needs to allocate ti.get_max_thread_count() number of stacks

I guess this relies on the assumption that all kernels are using the same number of threads? IIRC, the Metal backend might launch more threads than the grid-stride loop cap, if it knows in advance how many threads are needed. (range_for, for example). So this may not work on all the backends. Does OpenGL always use grid stride loop now?

archibate · 2021 年1 月 1 日 14:06

Does OpenGL always use grid stride loop now?

Yes, but it doesn’t mean it always use a same number of threads.
I understand that ti.get_max_thread_count() can be hard to guess for GPU backends, but at least for CPU, we can implement ti.get_linear_thread_id() which should be easy IMO. It’s max value is simply ti.cfg.cpu_max_num_threads, so that I don’t have allocate 512x512 stacks, but only 4 stacks for 4 CPU cores, much efficient.
Currently I’m working on a toy ray tracer which already gives a satisfying accuracy and speed on CPU compared to Blender built-in Cycles with just 34 samples:

However it can’t be faster thanks to the fact that each CPU thread needs to loop over 512x128 bytes of memory just because we can’t determine which thread we current is. In fact this huge memory overhead just cancelled out my well implemented importance sampling algorithm… So please support ti.get_linear_thread_id() and local stacks soon, thanks!

Pavel_Blend · 2024 年1 月 26 日 15:59

I rewrote the core molecular addon from Cython to C:

But the code did not work any faster. I want to try again to rewrite the molecular core in taichi. But I can’t do this quickly myself. Therefore, I would like to get help from experienced physics engine programmers.

Questions:
What simulation method is best for creating similar simulations as the molecular addon creates? The current method has several disadvantages. I would like to completely replace the simulation method.

Main criteria:
The method should be significantly faster than the current molecular addon method.
The results obtained should not be inferior to the current results of the molecular addon.