-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ontonotes raw content #15
Comments
All of the ON files came from LDC originally. They might have them.
Martha
On Oct 4, 2021, at 7:57 AM, Alexandre Rademaker ***@***.***> wrote:
EWT dataset does contain the raw content before tokenization. I suppose that allows the UD version to obtain the source string to add to the sentences metadata (see UniversalDependencies/UD_English-EWT#252<UniversalDependencies/UD_English-EWT#252>.
What about the Ontonotes? The .onf files are the only ones with the plain sentence, but even these sentences seem tokenized. Do we have the actual source content of OntoNotes sentences somewhere?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub<#15>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABB327SU6RFN5CCLV3H3W23UFGXCPANCNFSM5FJN47MA>.
Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Thank you @MarthaSPalmer, but no, they don't have the raw. See
|
I”m sorry. We only worked with what we got from LDC.
Martha
On Oct 5, 2021, at 2:59 PM, Alexandre Rademaker ***@***.******@***.***>> wrote:
Thank you @MarthaSPalmer<https://github.com/MarthaSPalmer>, but no, they don't have the raw. See ./wb/sel/16/sel_1677.onf:
Plain sentence:
---------------
Your ' answer ' did n't address the specific question because you never did return one to walmart remember, you refused
to shop there and pay for my return privledges ?
Treebanked sentence:
--------------------
Your ' answer ' did n't address the specific question because you never did return one to walmart *PRO* remember , you
refused *PRO*-1 to shop there and pay for my return privledges ?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#15 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABB327T5P4Q6PNVC5IAQGYDUFNRL3ANCNFSM5FJN47MA>.
Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
@arademaker your concrete example can be found in metadata/context directory. Try to |
LDC distribution contains only the subfolder |
@arademaker if OntoNotes sentences is tokenised only (without token editing) it should be easy to find sentence by regular expression pseudocode example: no_trace_tokens: List[str]
raw_text: str
if all(t in raw_text for t in no_trace_tokens): # we found text candidate, then search for sentence
sentence_regex = '\s*'.join(no_trace_tokens)
re.search(sentence_regex, raw_text) Yes, there are some work to be done, but I can just propose the solution for your problem. |
EWT dataset does contain the raw content before tokenization. I suppose that allows the UD version to obtain the source string to add to the sentences metadata (see UniversalDependencies/UD_English-EWT#252)
What about the Ontonotes? The
.onf
files are the only ones with theplain sentence,
but even these sentences seem tokenized. Do we have the actual source content of OntoNotes sentences somewhere?The text was updated successfully, but these errors were encountered: