Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX #6 by encoding UTF-8 into UTF-16LE #42

Merged
merged 1 commit into from
Aug 18, 2018
Merged

Conversation

keanpantraw
Copy link
Contributor

I ran into #6 on Ubuntu 16.04 with g++ 5.4.0.
Debugging reveals that Tokenizer.Process(trainText) just treats entire text as single sentence because text gets encoded in big-endian:
First 20 characters on machine with reproduced bug:

[info] loading text
1b04
3804
4204
3204
3004
a00
1b04
3804
4204
3204
3004
103
2000
2800
2900
2c30
2000
3e04
4404
3804
4604
3804
[info] generating N-grams 1

First 20 characters on machine without bug:

[info] loading text
43b
438
442
432
430
a
43b
438
442
432
430
301
20
28
29
2c
20
43e
444
438
446
438
[info] generating N-grams 1

I can't reproduce this on more recent Ubuntu 18.04 with g++ 7.3.0.
This is fixable by explicitly converting text to UTF-16LE.

@bakwc
Copy link
Owner

bakwc commented Aug 18, 2018

Thank you! Great job!

@bakwc bakwc merged commit 02a6f32 into bakwc:master Aug 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants