Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding error #8

Open
arademaker opened this issue Jan 31, 2020 · 4 comments
Open

encoding error #8

arademaker opened this issue Jan 31, 2020 · 4 comments

Comments

@arademaker
Copy link
Contributor

See UniversalDependencies/UD_English-EWT#83

UD treebank preserved in a single token the word basically regardless of the encoding error. But the Propbank data broke it into two tokens:

google/ewt/answers/00/20111107175720AAlb2TB_ans.xml  14   16        basic    GW            (S(ADVP*         -            -        *   (ARGM-ADV*             *             *
google/ewt/answers/00/20111107175720AAlb2TB_ans.xml  14   17         ally    RB                   *)        -            -        *            *)            *             *
@manning
Copy link

manning commented Oct 30, 2020

I think this has to be regarded as a text processing error in the preparation of propbank materials….

The word has a U+00AD soft hyphen character in it. This is a valid Unicode character. It's not an encoding error, and it is most definitely not a space character.

I think the only two good choices are to either preserve the original as a single token, or to decide that you don't want to deal with soft hyphen characters and to delete it leaving one token basically. This is just a processing mistake.

@arademaker
Copy link
Contributor Author

Regardless of the decision in the UniversalDependencies/UD_English-EWT#83, the data here must be compatible with it. So basic<U+00AD>ally or basically need to be a single token.

The original https://catalog.ldc.upenn.edu/LDC2012T13 data contains (ADVP (GW basic) (RB ally)). But I am assuming that fixing the LDC data is hard.

Can we fix the 20111107175720AAlb2TB_ans.xml.gold_skel?

@timjogorman
Copy link
Member

I would absolutely love for the "stand-off" PropBank EWT data to be switched over to point to English UD -- removing the reliance on LDC2012T13 would let us fix all of these issues easily (and people have easier access to the data). As long as PropBank is based on LDC2012T13, it's a pain to do any of these fixes (and any LDC update could take years).

@MarthaSPalmer
Copy link

MarthaSPalmer commented Nov 12, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants