FIX #6 by encoding UTF-8 into UTF-16LE #42

keanpantraw · 2018-08-18T17:13:33Z

I ran into #6 on Ubuntu 16.04 with g++ 5.4.0.
Debugging reveals that Tokenizer.Process(trainText) just treats entire text as single sentence because text gets encoded in big-endian:
First 20 characters on machine with reproduced bug:

[info] loading text
1b04
3804
4204
3204
3004
a00
1b04
3804
4204
3204
3004
103
2000
2800
2900
2c30
2000
3e04
4404
3804
4604
3804
[info] generating N-grams 1

First 20 characters on machine without bug:

[info] loading text
43b
438
442
432
430
a
43b
438
442
432
430
301
20
28
29
2c
20
43e
444
438
446
438
[info] generating N-grams 1

I can't reproduce this on more recent Ubuntu 18.04 with g++ 7.3.0.
This is fixable by explicitly converting text to UTF-16LE.

bakwc · 2018-08-18T22:48:54Z

Thank you! Great job!

FIX bakwc#6 by encoding UTF-8 into UTF-16LE

47c1810

keanpantraw force-pushed the master branch from 6d21370 to 47c1810 Compare August 18, 2018 17:26

bakwc merged commit 02a6f32 into bakwc:master Aug 18, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX #6 by encoding UTF-8 into UTF-16LE #42

FIX #6 by encoding UTF-8 into UTF-16LE #42

keanpantraw commented Aug 18, 2018

bakwc commented Aug 18, 2018

FIX #6 by encoding UTF-8 into UTF-16LE #42

FIX #6 by encoding UTF-8 into UTF-16LE #42

Conversation

keanpantraw commented Aug 18, 2018

bakwc commented Aug 18, 2018