Blender Molecular in Taichi

Hello.

I have a question. But first, a foreword.

There is an addon for blender 3d that allows you to calculate collisions between particles. This addon can also create links between particles. To create various materials (snow, soil, sand and others). The name of this addon is Molecular. The author of this addon is Pyroevil. The original code is in this repository on github:
github --> Pyroevil/Blender-Molecular-Script

I am adding new features to this addon. At the moment I am writing code that will allow you to specify the physical properties of particles and links using textures. My code is in the given fork:

github --> PavelBlend/blender-molecular

With the molecular addon, you can simulate millions of particles and tens of millions of links between particles. The core of the add-on is written in Cython (you can say that in C).

Here are examples with lots of particles:


Calculating millions of particles can take several days.

Now the question itself:

If we rewrite the molecular core from cython to taichi, is it possible to increase the performance and speed of calculations without using CUDA? Does it make sense to rewrite in taichi? Is it possible to compute millions of particles and tens of millions of links using taichi without CUDA?

Yes, it worth. Please add the solver to my Taichi-Blend so that people can install it easiler if you’d like to write one. Utilizing GPGPU for physics simulation is exactly what Taichi is good at.

without CUDA

Still profitable, we support Metal and OpenGL (and OpenCL in future) as backends too. Also the multi-threading support of the CPU backend can still beat Cython at certain conditions. Not to say we will soon restore SIMD support. What’s more, you only have to write the Taichi code once to deploy it to all backends. This could be revolutionary IMO.

1 Like

I recently tried to rewrite molecular to taichi, but I failed due to the fact that taichi does not support function recursion. Recursion is often used in molecular. I don’t know how to rewrite the code so that it works without recursion.

1 Like

Yes, many algorithms demands recursion, it would be great for Taichi to support it.
But in fact, there are no actual functions in Taichi, all functions are force-inlined so there are only a single function on top-level.
Even if Taichi supports real functions it could also be hard for recursion as many backends (like OpenGL compute shader) prohibits recursion for simplicity reasons.
Before Taichi could support real functions and recursion, we may use a stack to simulate recursion by pushing the arguments you want to call next into the stack, and keep poping stack after that.
For example, here’s how I implement octree:

@ti.data_oriented
class Stack:
    def __init__(self, func, *args):
        self.buf = func(*args)
        self.len = ti.field(int, ())

    @ti.func
    def subscript(self, I):
        return self.buf[I]

    @ti.func
    def push(self, val):
        self.buf[ti.atomic_add(self.size(), 1)] = val

    @ti.func
    def pop(self):
        return self.buf[ti.atomic_sub(self.size(), 1) - 1]

    @ti.func
    def size(self):
        return self.len[None]

@ti.data_oriented
class Octree:
    def __init__(self):
        ...
        self.global_stack = Stack(ti.field, int, 1024)

    @ti.func
    def walk(self, r):
        stack = ti.static(self.global_stack)
        stack.push(0)  # push root node
        while stack.size():
            cur = stack.pop()
            if not self.hit_bound(cur, r):
                continue
            for c in range(8):
                if self.children[cur, c] != 0:
                    stack.push(self.children[cur, c])  # call child
            ...

However, it use a global field for stack, which works very well on single thread CPU, but won’t work for multi-threading CPU and GPUs. This problem can be solved if Taichi provide support thread-local array. I’ve raised an issue for asking such support:


But they claim that they are rushing for a research project that may conflict with my issued feature…
@yuanming @k-ye IMO this feature can be a great test to your research project on types, why conflict? Otherwise I’ll have to hand-write ti.asm to write non-portable arrays in GLSL…
1 Like

I’m not sure what exactly your use case will be, but I think it’s possible to extend @archibate’s global stack with dynamic (or even dense in you know your max stack depth) as an immediate workaround.

Say if you are looping over a field x and want a stack per x element, one workaround could be to define another dynamic field along with x itself. For example:

x = ti.field(...)
x_stack = ti.field(...)
block = ti.dense(...)
block.place(x)
block.dyanmic.place(x_stack)

@ti.kernel
def foo():
  for i in x:
    stack = x_stack[i]  # |stack| is dynamically indexable
1 Like

It works for small ranges, thanks @k-ye! But I got:

[cuda_driver.h:operator()@80] CUDA Error CUDA_ERROR_ASSERT: device-side assert triggered while calling stream_synchronize (cuStreamSynchronize)

when using a larger value for it even with dynamic SNodes…

FYI I want to launch my kernel over a range of 512x512 (for ray tracing), but I believe there isn’t so much threads due to grid-stride-loop. So here’s my feature request for adding the following two APIs:

  1. ti.get_linear_thread_id() to get the linear thread id, can be called inside a kernel.
  2. ti.get_max_thread_count() to get the maximum possible thread count of current backend.

So that I only needs to allocate ti.get_max_thread_count() number of stacks to save memory therefore allows stacks to extend longer?

I see. Yeah, these sound like very useful features.

So that I only needs to allocate ti.get_max_thread_count() number of stacks

I guess this relies on the assumption that all kernels are using the same number of threads? IIRC, the Metal backend might launch more threads than the grid-stride loop cap, if it knows in advance how many threads are needed. (range_for, for example). So this may not work on all the backends. Does OpenGL always use grid stride loop now?

Does OpenGL always use grid stride loop now?

Yes, but it doesn’t mean it always use a same number of threads.
I understand that ti.get_max_thread_count() can be hard to guess for GPU backends, but at least for CPU, we can implement ti.get_linear_thread_id() which should be easy IMO. It’s max value is simply ti.cfg.cpu_max_num_threads, so that I don’t have allocate 512x512 stacks, but only 4 stacks for 4 CPU cores, much efficient.
Currently I’m working on a toy ray tracer which already gives a satisfying accuracy and speed on CPU compared to Blender built-in Cycles with just 34 samples:


However it can’t be faster thanks to the fact that each CPU thread needs to loop over 512x128 bytes of memory just because we can’t determine which thread we current is. In fact this huge memory overhead just cancelled out my well implemented importance sampling algorithm… So please support ti.get_linear_thread_id() and local stacks soon, thanks!