Ascend 学习笔记 06 - ReduceLogSumExp
ReduceLogSumExp算子思路草稿。
首先根据S4赛季的经验,服务器和我的开发板均是单核的,分片不用考虑大小核。
先按单核实现,之后提交一份代码检查判题机aicore的数量。
Reduce算子和之前的Select不一样,每个位置的计算并不是独立的。所以可能得记录每次计算的中间结果。
参考算子:
torch.logsumexp
算子输入取值范围:
N∈[1,1000]
N2∈[1,1000]
N3∈[200, 2000]
N4∈[200,2000]
算子输入特征说明:
N~N4均可能为非32的整倍数,需要考虑非对齐场景。
x的维度数为dim,axes取值范围为[-dim, dim-1]
题目里给出N的取值范围。另外注意到 axes 的取值可以是负数。含义是从最后一维往前数的第几个维度。
想到题目最终计分的只有一个测试点。
往往会是最简单的形状,比如 axes.size() = 1的 float。
正常
Case1: Pass, Result: 210.17
Case2: Pass, Result: 44807.09
Case3: Run failed!
Case4: Run failed!
Case5: Run failed!
禁用 half
Case1: Wrong answer
Case2: Pass, Result: 44807.09
Case3: Wrong answer
Case4: Run failed!
Case5: Run failed!
case3说明,在half的情况下我的代码有可能运行时报错。
禁用fp32
Case1: Pass, Result: 210.17
Case2: Wrong answer
Case3: Run failed!
Case4: Wrong answer
Case5: Wrong answer
axes_num == 1 then return
Case1: Wrong answer
Case2: Wrong answer
Case3: Wrong answer
Case4: Wrong answer
Case5: Wrong answer
惊了,全是 Wrong answer,说明只用归约一个维度
x_ndim == 1 then return
Case1: Wrong answer
Case2: Pass, Result: 23626.96
Case3: Run failed!
Case4: Run failed!
Case5: Run failed!
x_ndim == 3 then return
Case1: Pass, Result: 201.3
Case2: Wrong answer
Case3: Wrong answer
Case4: Wrong answer
Case5: Run failed!
注释掉Process
Case1: Wrong answer
Case2: Wrong answer
Case3: Wrong answer
Case4: Wrong answer
Case5: Wrong answer
说明不是Init的时候报错。
axes[0] == 0 then return
Case1: Wrong answer
Case2: Wrong answer
Case3: Run failed!
Case4: Run failed!
Case5: Run failed!
axes[0] == 1 then return
Case1: Pass, Result: 201.81
Case2: Wrong answer
Case3: Pass, Result: 795316.39
Case4: Run failed!
Case5: Wrong answer
??????再提交一次一样的,难道axes是随机的?Run failed 有没有可能就是因为超时了?因为他的算子里有50次重复,如果太慢了,就报错了?
Case1: Pass, Result: 205.64
Case2: Pass, Result: 31030.32
Case3: Wrong answer
Case4: Wrong answer
Case5: Run failed!
确定了,axes也是随机的。shape的尺寸应该是固定的,内容是随机的。
并且时间的随机性有点大。
这里的 Case3 碰巧成功了。
现在怀疑的是,超时会导致 run failed 吗?
Case1: Run failed!
Case2: Run failed!
Case3: Run failed!
Case4: Run failed!
Case5: Run failed!
对了!超时会报错 Run failed. 所以就应该改进算法了。
又侧了一下,大多数测试点的最后一维度,能放满8个datablock。
| case | 类型 | axes | x_ndim | 对齐 |
|---|---|---|---|---|
| case1 | fp16 | 0 | 1 | √ |
| case2 | fp32 | 随机 | 3 | √ |
| case3 | fp16 | 随机 | 3 | √ |
| case4 | fp32 | 随机 | 3 | x |
| case5 | fp32 | 随机 | 4 | √ |
Case1: Pass, Result: 24.33
Case2: Pass, Result: 26025.45
Case3: Pass, Result: 52674.41
Case4: Pass, Result: 614477.86
Case5: Run failed!
Case1: Pass, Result: 23.07
Case2: Pass, Result: 13752.02
Case3: Pass, Result: 357978.04
Case4: Pass, Result: 115878.35
Case5: Run failed!
沃趣,开了O3就快了好几倍……………………S4亏了几千块……
Case1: Pass, Result: 10.46
Case2: Pass, Result: 296.57
Case3: Wrong answer
Case4: Wrong answer
Case5: Pass, Result: 91518.82
Case1: Pass, Result: 11.2
Case2: Wrong answer
Case3: Wrong answer
Case4: Wrong answer
Case5: Wrong answer
Case1: Pass, Result: 10.41
Case2: Pass, Result: 864.72
Case3: Wrong answer
Case4: Wrong answer
Case5: Pass, Result: 405289.68
Case1: Pass, Result: 10.37
Case2: Pass, Result: 864.0
Case3: Wrong answer
Case4: Pass, Result: 115822.27
Case5: Pass, Result: 405290.45
Case1: Pass, Result: 10.22
Case2: Pass, Result: 2303.0
Case3: Pass, Result: 35795.71
Case4: Pass, Result: 115763.58
Case5: Pass, Result: 356334.49
prof_sum: 510207.0
然后开始更高并行的优化。
每次迭代最多算256B,mask [1, 256 / sizeof(type)]. 最大 128/64
现在消耗的空间是:
tile_data_num * sizeof(DTYPE_X) * dpb * 2
sizeof(DTYPE_X) * dpb * 2
sizeof(DTYPE_X) * dpb
BLOCK_SIZE * (2 * tile_data_num + 3)
还要保证
bpt * BLOCK_SIZE <= 256
优化完之后:
Case1: Pass, Result: 10.51
Case2: Pass, Result: 142.76
Case3: Pass, Result: 4632.55
Case4: Pass, Result: 724551.2
Case5: Wrong answer
Case1: Pass, Result: 10.43
Case2: Pass, Result: 300.22
Case3: Pass, Result: 4639.45
Case4: Pass, Result: 614716.15
Case5: Pass, Result: 69201.51 // 356334.49
prof_sum: 688867.75
Case1: Pass, Result: 10.55
Case2: Pass, Result: 301.08
Case3: Wrong answer
Case4: Pass, Result: 724559.5
Case5: Pass, Result: 69202.31
Case1: Pass, Result: 10.62
Case2: Pass, Result: 292.77
Case3: Wrong answer
Case4: Pass, Result: 724575.87
Case5: Pass, Result: 69201.29
Case1: Pass, Result: 11.22
Case2: Pass, Result: 318.59
Case3: Pass, Result: 4789.99
Case4: Pass, Result: 614270.17
Case5: Wrong answer
Case1: Pass, Result: 10.4
Case2: Pass, Result: 142.85
Case3: Pass, Result: 4975.8
Case4: Pass, Result: 614251.78
Case5: Wrong answer
Case1: Pass, Result: 11.43
Case2: Pass, Result: 142.81
Case3: Wrong answer
Case4: Pass, Result: 724589.49
Case5: Pass, Result: 69201.51
中间凑巧有一次全跑对了……
现在的问题:
- 非对齐未实现
- 对齐但归约最后一维度,没优化
- 对齐归约非最后一维度,优化了,但是实现有bug,并且优化性能提升不够(远不如第一名)
后续方案:
- 首先解决目前的bug,评估一下在目前报错的维度中,性能提升是否更加显著
- 检查现在的tiling策略是不是有缺陷,可以分配的更密集一点
- 非对齐实现
很好,现在解决了bug。
Case1: Pass, Result: 19.57
Case2: Pass, Result: 313.68
Case3: Pass, Result: 10087.66
Case4: Pass, Result: 724563.58
Case5: Pass, Result: 78690.74
Case1: Pass, Result: 11.49
Case2: Pass, Result: 168.13
Case3: Wrong answer
Case4: Pass, Result: 115767.7
Case5: Pass, Result: 64360.94
Case1: Pass, Result: 11.07
Case2: Pass, Result: 299.66
Case3: Wrong answer
Case4: Pass, Result: 614316.69
Case5: Pass, Result: 78297.76
但是结果非常不尽人意……变得更慢了。而且怎么还有错误的情况。
首先可能是因为CopyIn的标量计算过多。提取一些重复标量计算,再试试。
发现错误的原因是 fp16.
99.56,99.5,x
99.44,99.4,x
74.8,74.75,x
并不是很密集,都是有些地方有误差。
明天:
- 解决精度问题。
- 实现最后一维度归约的并行。(对齐)
- 看看效果有没有提升。
- 性能分析。
- 重新收集数据,更密集的计算。
- 非对齐实现。
性能分析的结果:数据搬运的时间要远远小于计算时间。
所以可以考虑增加数据搬运的时间,降低计算时间。
归约最后一维度
Case1: Pass, Result: 13.55
Case2: Pass, Result: 172.85
Case3: Wrong answer
Case4: Pass, Result: 724587.46
Case5: Pass, Result: 64361.66
Case1: Pass, Result: 16.31
Case2: Pass, Result: 172.11
Case3: Wrong answer
Case4: Pass, Result: 614351.63
Case5: Pass, Result: 70522.96
Case1: Pass, Result: 14.08
Case2: Pass, Result: 230.86
Case3: Pass, Result: 10064.64
Case4: Pass, Result: 724589.36
Case5: Pass, Result: 64361.34
prof_sum: 799260.28
[0,1,2,3,4,5,6,7,8...n-1]
[n+0,n+1,n+2,n+3,n+4,n+5,n+6,n+7,n+8...2n-1]
[2n+0,2n+1,2n+2,2n+3,2n+4,2n+5,2n+6,2n+7,2n+8...3n-1]
...
[mn+0,mn+1,mn+2,mn+3,mn+4,mn+5,mn+6,mn+7,mn+8...mn-1]
[0,n,2n,3n,...mn]
[1,n+1,2n+1,3n+1,...mn+1]
[0, 1, 2, 3, ]
[0, 1, 2, 3, ]
[0, 1, 2, 3, ]
xn
[0, n, 2n, 3n, ]
[0, n, 2n, 3n, ]
[0, n, 2n, 3n, ]
+1,2,3...
[0, n, 2n, 3n, ]
[0+1, n+1, 2n+1, 3n+1, ]
[0+2, n+2, 2n+2, 3n+2, ]