We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
你好,最近在做paddle的split操作时发现与torch相比paddle会有额外的kernel操作,性能只有torch的30%,我的测试脚本如下:
import paddle from paddle import nn import time if __name__ == '__main__': paddle.set_device("gpu:0") x = paddle.ones(shape=[2 * 2048, 5120], dtype='float16') weight = paddle.ones(shape=[5120, 5120], dtype='float16') res = paddle.empty(shape=[2 * 2048, 5120], dtype='float16') for i in range(100): res[0 : 2048, :] = paddle._C_ops.linear(x[0 : 2048, :], weight, None)
在用nsys分析时发现会多出一个Eigen::internal::EigenMetaKernel,而torch是通过d2d的copy实现的,paddle的这个操作相比torch会慢30%。请问一下这个有什么可以修改的方向或者绕过的方式吗?
The text was updated successfully, but these errors were encountered:
linear_result = paddle._C_ops.linear(x[0 : 2048, :], weight, None) res = paddle.concat([linear_result, res[2048:, :]], axis=0) 这个应该是赋值的时候调用的kernel实现问题,可以用这个绕下看看性能表现。 如果循环里的每次结果都需要加到res里,用for i in range(xxx): matmul_results.append(linear_res) res = paddle.stack(matmul_results, axis=0)试下
Sorry, something went wrong.
LiYuRio
No branches or pull requests
请提出你的问题 Please ask your question
你好,最近在做paddle的split操作时发现与torch相比paddle会有额外的kernel操作,性能只有torch的30%,我的测试脚本如下:
在用nsys分析时发现会多出一个Eigen::internal::EigenMetaKernel,而torch是通过d2d的copy实现的,paddle的这个操作相比torch会慢30%。请问一下这个有什么可以修改的方向或者绕过的方式吗?
The text was updated successfully, but these errors were encountered: