Supports Chinese Docstring #6281

Aruelius · 2023-11-01T08:01:10Z

\w+ only matches one or more word characters (same as [a-zA-Z0-9_]+), when the docstring is chinese (or other) it will not be matched.

microsoft/pylance-release#4840

erictraut · 2023-11-02T18:53:55Z

packages/pyright-internal/src/analyzer/docStringConversion.ts

@@ -589,7 +589,7 @@

        // catch-all for styles except reST
        const hasArguments =
-            !line?.endsWith(':') && !line?.endsWith('::') && !!line.match(/^\s*.*?\w+(\s*\(.*?\))*\s*:\s*\w+/g);
+            !line?.endsWith(':') && !line?.endsWith('::') && !!line.match(/^\s*.*?\w+(\s*\(.*?\))*\s*:[\s\S]*/g);


@Aruelius, note the issue detected above. This will need to either be addressed or an explanation given for why it's OK.

github-actions · 2023-11-01T08:15:15Z

According to mypy_primer, this change doesn't affect type check results on a corpus of open source code. ✅

packages/pyright-internal/src/analyzer/docStringConversion.ts

debonte · 2023-11-01T13:54:07Z

packages/pyright-internal/src/analyzer/docStringConversion.ts

@@ -589,7 +589,7 @@ class DocStringConverter {

        // catch-all for styles except reST
        const hasArguments =
-            !line?.endsWith(':') && !line?.endsWith('::') && !!line.match(/^\s*.*?\w+(\s*\(.*?\))*\s*:\s*\w+/g);
+            !line?.endsWith(':') && !line?.endsWith('::') && !!line.match(/^\s*.*?\w+(\s*\(.*?\))*\s*:\s*[\u4e00-\u9fa5\w+]*/g);


@bschnurr, do we have existing unit tests that we can expand to cover this?

@bschnurr, any thoughts here?

debonte · 2023-11-01T13:54:47Z

packages/pyright-internal/src/analyzer/docStringConversion.ts

@@ -589,7 +589,7 @@ class DocStringConverter {

        // catch-all for styles except reST
        const hasArguments =
-            !line?.endsWith(':') && !line?.endsWith('::') && !!line.match(/^\s*.*?\w+(\s*\(.*?\))*\s*:\s*\w+/g);
+            !line?.endsWith(':') && !line?.endsWith('::') && !!line.match(/^\s*.*?\w+(\s*\(.*?\))*\s*:\s*[\u4e00-\u9fa5\w+]*/g);


Is this really the only regex that needs to be updated for us to properly support Chinese characters? I understand that it fixes this one scenario, but are there others we should update at the same time?

Hi, this one scenario is the biggest issue for me, for now, I have to add the \n newline character after docstring.
a: 中文\n
And I think it need be fixed, I thank for you help, it's very helpful for me.

Co-authored-by: Erik De Bonte <[email protected]>

heejaechang · 2023-11-01T17:17:16Z

packages/pyright-internal/src/analyzer/docStringConversion.ts

@@ -589,7 +589,7 @@ class DocStringConverter {

        // catch-all for styles except reST
        const hasArguments =
-            !line?.endsWith(':') && !line?.endsWith('::') && !!line.match(/^\s*.*?\w+(\s*\(.*?\))*\s*:\s*\w+/g);
+            !line?.endsWith(':') && !line?.endsWith('::') && !!line.match(/^\s*.*?\w+(\s*\(.*?\))*\s*:\s*\p{L}+/gu);


probably we want to add some tests covering issue it is fixing to prevent regressions in future?

that will be good, thank you.

@Aruelius, are you planning to add tests to this PR?

I actually have a fix.. i used this line
!line?.endsWith(':') && !line?.endsWith('::') && !!line.match(/^\s*.*?\S+(\s*$.*?$)*\s*:\s*\S+/g);

@bschnurr, do you mean that you're going to update this PR? Or that you're going to fix the problem in a separate PR and we should close this one?

\S includes non-letter characters like <, >, +, $, ,, ', etc, whereas \p{L} only includes characters that Unicode says are letters. Is \S what we want? I'm not familiar with the docstring format requirements, but going from \w to \S seems strange to me unless we should always have been including symbol characters.

Btw, my initial suggestion of \p{L} only includes letters, not numbers. So if we wanted to use that approach, I believe [\p{L}\p{N}] would be better.

I thought using \S would be the most flexible when it comes to user naming stuff

bschnurr · 2023-11-02T19:54:39Z

Closing. new PR here with a test. #6307
thank you

Supports Chinese Docstring

a2d537a

microsoft/pylance-release#4840

erictraut requested review from heejaechang, debonte and rchiodo November 1, 2023 08:04

heejaechang requested a review from bschnurr November 1, 2023 08:06

github-advanced-security bot found potential problems Nov 1, 2023

View reviewed changes

Replace [\s\S]* with \s*[\u4e00-\u9fa5\w+]*

2defb9e

debonte reviewed Nov 1, 2023

View reviewed changes

Proper regex to handle Unicode letters

aa0aac5

Co-authored-by: Erik De Bonte <[email protected]>

heejaechang reviewed Nov 1, 2023

View reviewed changes

bschnurr closed this Nov 2, 2023

Aruelius deleted the patch-1 branch November 4, 2023 11:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Supports Chinese Docstring #6281

Supports Chinese Docstring #6281

Aruelius commented Nov 1, 2023

erictraut Nov 2, 2023

github-actions bot commented Nov 1, 2023

debonte Nov 1, 2023

erictraut Nov 2, 2023

debonte Nov 1, 2023

Aruelius Nov 1, 2023

heejaechang Nov 1, 2023

Aruelius Nov 1, 2023

erictraut Nov 2, 2023

bschnurr Nov 2, 2023

erictraut Nov 2, 2023

debonte Nov 2, 2023

bschnurr Nov 2, 2023

bschnurr commented Nov 2, 2023

Supports Chinese Docstring #6281

Supports Chinese Docstring #6281

Conversation

Aruelius commented Nov 1, 2023

Choose a reason for hiding this comment

github-actions bot commented Nov 1, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bschnurr commented Nov 2, 2023