-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
encoding error #8
Comments
I think this has to be regarded as a text processing error in the preparation of propbank materials…. The word has a U+00AD soft hyphen character in it. This is a valid Unicode character. It's not an encoding error, and it is most definitely not a space character. I think the only two good choices are to either preserve the original as a single token, or to decide that you don't want to deal with soft hyphen characters and to delete it leaving one token |
Regardless of the decision in the UniversalDependencies/UD_English-EWT#83, the data here must be compatible with it. So The original https://catalog.ldc.upenn.edu/LDC2012T13 data contains Can we fix the 20111107175720AAlb2TB_ans.xml.gold_skel? |
I would absolutely love for the "stand-off" PropBank EWT data to be switched over to point to English UD -- removing the reliance on LDC2012T13 would let us fix all of these issues easily (and people have easier access to the data). As long as PropBank is based on LDC2012T13, it's a pain to do any of these fixes (and any LDC update could take years). |
Oh, yes, please! That would be terrific. I don’t have a real PB master like Tim working at CU anymore but I have a student who just graduated with an MS in Computational Linguistics and has some experience with moving PB mappings from Treebank parsers to UD. He would need supervision but if it would be helpful I am happy to volunteer him to help with this.
On Nov 11, 2020, at 3:54 PM, timjogorman <[email protected]<mailto:[email protected]>> wrote:
I would absolutely love for the "stand-off" PropBank EWT data to be switched over to point to English UD -- removing the reliance on LDC2012T13 would let us fix all of these issues easily (and people have easier access to the data). As long as PropBank is based on LDC2012T13, it's a pain to do any of these fixes (and any LDC update could take years).
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub<#8 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABB327QVPC4BUR37WH4ZOK3SPMI3FANCNFSM4KOJQO7A>.
|
See UniversalDependencies/UD_English-EWT#83
UD treebank preserved in a single token the word
basically
regardless of the encoding error. But the Propbank data broke it into two tokens:The text was updated successfully, but these errors were encountered: