Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supports Chinese Docstring #6281

Closed
wants to merge 3 commits into from
Closed

Supports Chinese Docstring #6281

wants to merge 3 commits into from

Conversation

Aruelius
Copy link

@Aruelius Aruelius commented Nov 1, 2023

\w+ only matches one or more word characters (same as [a-zA-Z0-9_]+), when the docstring is chinese (or other) it will not be matched.

microsoft/pylance-release#4840

@@ -589,7 +589,7 @@

// catch-all for styles except reST
const hasArguments =
!line?.endsWith(':') && !line?.endsWith('::') && !!line.match(/^\s*.*?\w+(\s*\(.*?\))*\s*:\s*\w+/g);
!line?.endsWith(':') && !line?.endsWith('::') && !!line.match(/^\s*.*?\w+(\s*\(.*?\))*\s*:[\s\S]*/g);

Check failure

Code scanning / CodeQL

Inefficient regular expression High

This part of the regular expression may cause exponential backtracking on strings starting with '0(' and containing many repetitions of ')('.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Aruelius, note the issue detected above. This will need to either be addressed or an explanation given for why it's OK.

Copy link
Contributor

github-actions bot commented Nov 1, 2023

According to mypy_primer, this change doesn't affect type check results on a corpus of open source code. ✅

@@ -589,7 +589,7 @@ class DocStringConverter {

// catch-all for styles except reST
const hasArguments =
!line?.endsWith(':') && !line?.endsWith('::') && !!line.match(/^\s*.*?\w+(\s*\(.*?\))*\s*:\s*\w+/g);
!line?.endsWith(':') && !line?.endsWith('::') && !!line.match(/^\s*.*?\w+(\s*\(.*?\))*\s*:\s*[\u4e00-\u9fa5\w+]*/g);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bschnurr, do we have existing unit tests that we can expand to cover this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bschnurr, any thoughts here?

@@ -589,7 +589,7 @@ class DocStringConverter {

// catch-all for styles except reST
const hasArguments =
!line?.endsWith(':') && !line?.endsWith('::') && !!line.match(/^\s*.*?\w+(\s*\(.*?\))*\s*:\s*\w+/g);
!line?.endsWith(':') && !line?.endsWith('::') && !!line.match(/^\s*.*?\w+(\s*\(.*?\))*\s*:\s*[\u4e00-\u9fa5\w+]*/g);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really the only regex that needs to be updated for us to properly support Chinese characters? I understand that it fixes this one scenario, but are there others we should update at the same time?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, this one scenario is the biggest issue for me, for now, I have to add the \n newline character after docstring.
a: 中文\n
And I think it need be fixed, I thank for you help, it's very helpful for me.

@@ -589,7 +589,7 @@ class DocStringConverter {

// catch-all for styles except reST
const hasArguments =
!line?.endsWith(':') && !line?.endsWith('::') && !!line.match(/^\s*.*?\w+(\s*\(.*?\))*\s*:\s*\w+/g);
!line?.endsWith(':') && !line?.endsWith('::') && !!line.match(/^\s*.*?\w+(\s*\(.*?\))*\s*:\s*\p{L}+/gu);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably we want to add some tests covering issue it is fixing to prevent regressions in future?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that will be good, thank you.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Aruelius, are you planning to add tests to this PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually have a fix.. i used this line
!line?.endsWith(':') && !line?.endsWith('::') && !!line.match(/^\s*.*?\S+(\s*\(.*?\))*\s*:\s*\S+/g);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bschnurr, do you mean that you're going to update this PR? Or that you're going to fix the problem in a separate PR and we should close this one?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

\S includes non-letter characters like <, >, +, $, ,, ', etc, whereas \p{L} only includes characters that Unicode says are letters. Is \S what we want? I'm not familiar with the docstring format requirements, but going from \w to \S seems strange to me unless we should always have been including symbol characters.

Btw, my initial suggestion of \p{L} only includes letters, not numbers. So if we wanted to use that approach, I believe [\p{L}\p{N}] would be better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought using \S would be the most flexible when it comes to user naming stuff

@bschnurr
Copy link
Member

bschnurr commented Nov 2, 2023

Closing. new PR here with a test. #6307
thank you

@bschnurr bschnurr closed this Nov 2, 2023
@Aruelius Aruelius deleted the patch-1 branch November 4, 2023 11:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants