关于__syncthreads

lonelywm · 2021 年11 月 9 日 13:17

近期初学taichi，可能是我教程没有看全，我目前没有找到类似 cuda中"__syncthreads"的或其它方式实现"barrier"的函数，ti.sync()在ti.kernel内不齐作用。测试代码如下：

import taichi as ti
import numpy as np
import random

length = 50000+1
ti.init(arch=ti.cuda)
boxes = ti.field(ti.f32, length)
lis = [i for i in range(length)]
# random.shuffle(lis)
lisnp = np.array(lis)
boxes.from_numpy(lisnp)
boxes_cache = ti.field(ti.f32, length)

@ti.kernel
def com_max(boxes: ti.template(), boxes_cache: ti.template(), length: int) -> ti.f32:
    # ti.block_local(boxes_cache)
    for tid in range(length):
        stride = length // 2
        length = length - stride
        boxes_cache[tid] = boxes[tid]
        ti.sync()
        while stride > tid:
            a = boxes_cache[tid]
            b = boxes_cache[tid + stride]
            ti.sync()
            boxes_cache[tid] = a if a > b else b
            stride = length // 2
            length = length - stride
    return boxes_cache[0]

a = com_max(boxes, boxes_cache, length)

# ======> output
print(lisnp.max())   # output: 50000
print(a)             # output: 47269.0

这个求最大值的代码在cpu和cuda下面都无法返回正确的结果，不知道taichi中有没有实现barrier功能，因为这个对于并行应该是挺重要的。

archibate · 2021 年11 月 9 日 13:50

太极的确没有barrier和shared memory。
但其实直接用atomic_max即可，太极编译器会自动优化为利用barrier和BLS的并行reduction：

@ti.kernel
def com_max(boxes: ti.template(), length: int) -> ti.f32:
    ret = boxes[0]
    for i in range(length):
        ret = ti.atomic_max(ret, boxes[i])

a = com_max(boxes, length)

这就是他的哲学，用户只需要用串行的逻辑写kernel，优化由编译器自动决定，在cpu和cuda后端会分别生成相应的代码，而不需要关心cuda特定的barrier等实现细节。

lonelywm · 2021 年11 月 9 日 14:10

真的很优雅，这样我就不用操心临时的内存分配还有很多同步的问题了！！
对了，我跑了下您贴的代码好像结果不对，能否帮忙看下~

import taichi as ti
import numpy as np

length = 50000+1
ti.init(arch=ti.cuda)
boxes = ti.field(ti.f32, length)
lis = [i for i in range(length)]
lisnp = np.array(lis)
boxes.from_numpy(lisnp)


@ti.kernel
def com_max(boxes: ti.template(), length: int) -> ti.f32:
    ret = boxes[0]
    for i in range(length):
        ret = ti.atomic_max(ret, boxes[i])
    return ret

a = com_max(boxes, length)

print(lisnp.max()) # output: 50000
print(a)           # output: 0.0

这个代码怎么看都没错啊，但是结果就是不对~

lonelywm · 2021 年11 月 9 日 14:31

我解决了，API上没查到 atomic_max 这个函数，我猜了一下，去掉赋值即可：

ret = ti.atomic_max(ret, boxes[i])  # ret === 0
ti.atomic_max(ret, boxes[i])          # ok