Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

相比于torch的split的额外开销 #71173

Open
tianyuzhou668 opened this issue Feb 18, 2025 · 1 comment
Open

相比于torch的split的额外开销 #71173

tianyuzhou668 opened this issue Feb 18, 2025 · 1 comment
Assignees
Labels

Comments

@tianyuzhou668
Copy link

请提出你的问题 Please ask your question

你好,最近在做paddle的split操作时发现与torch相比paddle会有额外的kernel操作,性能只有torch的30%,我的测试脚本如下:

import paddle
from paddle import nn
import time


if __name__ == '__main__':

    paddle.set_device("gpu:0")

    x = paddle.ones(shape=[2 * 2048, 5120], dtype='float16')
    weight = paddle.ones(shape=[5120, 5120], dtype='float16')

    res = paddle.empty(shape=[2 * 2048, 5120], dtype='float16')
    for i in range(100):
        res[0 : 2048, :] = paddle._C_ops.linear(x[0 : 2048, :], weight, None)

在用nsys分析时发现会多出一个Eigen::internal::EigenMetaKernel,而torch是通过d2d的copy实现的,paddle的这个操作相比torch会慢30%。请问一下这个有什么可以修改的方向或者绕过的方式吗?

Image

@xiaoguoguo626807
Copy link
Contributor

linear_result = paddle._C_ops.linear(x[0 : 2048, :], weight, None)
res = paddle.concat([linear_result, res[2048:, :]], axis=0) 这个应该是赋值的时候调用的kernel实现问题,可以用这个绕下看看性能表现。 如果循环里的每次结果都需要加到res里,用for i in range(xxx):
matmul_results.append(linear_res)
res = paddle.stack(matmul_results, axis=0)试下

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants