Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect handling of double-width characters #14

Open
ulidtko opened this issue Oct 20, 2015 · 4 comments
Open

Incorrect handling of double-width characters #14

ulidtko opened this issue Oct 20, 2015 · 4 comments

Comments

@ulidtko
Copy link

ulidtko commented Oct 20, 2015

Take this TSV:

date        |P東       |東 score  |P南       |南 score  |P西       |西 score  |P北       |北 score  |comment
2015-04-04  john    35100       bob     32100       mary    12000       katy    20800
2015-04-04  mary    33500       bob     49500       katy    21600       john    -4600

It looks aligned in ST+elastic tabstops, completely with column headers. But in any other text viewer (less or this Markdown view above) column headers are not aligned — because of an extra space inserted between double-width characters 東南西北 and the following tab character separator.

For clarity, I'll visualize the whitespace characters involved:

date······↦   |P東····↦   |東·score↦   |P南····↦   |南·score↦   |P西····↦   |西·score↦   |P北····↦   |北·score↦   |comment
2015-04-04↦   john···↦   35100···↦   bob····↦   32100···↦   mary···↦   12000···↦   katy···↦   20800
2015-04-04↦   mary···↦   33500···↦   bob····↦   49500···↦   katy···↦   21600···↦   john···↦   -4600

In a fixwidth environment like a terminal (e.g. less), the string |P東 takes 4 character places to render (even though it's a 3-character string: |, P, ). This is exactly the width that john and mary cells have. But — and this is the bug — john and mary have 3 U+20's after them, while |P東 has 4. This is what breaks alignment in monospace non-elastic-tabstop-aware viewers.

Conceptually, this is easily fixed by using "em width" (which is 1 or 2 for character C where unicodedata.east_asian_width(C)=='Na' or unicodedata.east_asian_width(C)=='W' correspondingly) instead of plain character count when computing the number of spaces that the plugin inserts for compatibility alignment.

Whew. I do realize that this report is futile, but still, it's here for the record.

@adzenith
Copy link
Member

Actually I think this might be one of the few reports that's not futile. It wouldn't be too hard to figure out the character width of each character if unicode will tell you like that. My one concern is that in Sublime Text, double-width characters aren't quite double-width. It would probably work fine if you just had one or two and wide tabs, but with a large amount of characters there may be an offset.

Do you feel comfortable hacking python? Want to give it a shot? Or I can look at it sometime soon.

@ulidtko
Copy link
Author

ulidtko commented Oct 21, 2015

I think the offset is not a problem, since ST's double-width characters appear slightly less than two places, and that should be perfectly compensated by increased width of the following tab. It'd be a problem if wide characters took more space :)

I might give it a try, should be simple...

@adzenith
Copy link
Member

Right, it would only be a problem if you had quite a few characters in a row combined with a relatively narrow tab width: at a certain point you might be able to lose enough space so that you end up at an earlier tabstop.

Give it a shot and let me know how it goes! I'm excited to see the PR :)

ulidtko added a commit to ulidtko/ElasticTabstops that referenced this issue Oct 23, 2015
@ulidtko
Copy link
Author

ulidtko commented Oct 23, 2015

Well @adzenith, turns out you were quite right: I couldn't get this to work with tab width less than 5!

But otherwise, I'm quite satisfied with the result. Looks excellent in the terminal — what could be desired more :)

PR incoming, any comments welcome

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants