Ontonotes raw content #15

arademaker · 2021-10-04T13:56:43Z

EWT dataset does contain the raw content before tokenization. I suppose that allows the UD version to obtain the source string to add to the sentences metadata (see UniversalDependencies/UD_English-EWT#252)

What about the Ontonotes? The .onf files are the only ones with the plain sentence, but even these sentences seem tokenized. Do we have the actual source content of OntoNotes sentences somewhere?

The text was updated successfully, but these errors were encountered:

MarthaSPalmer · 2021-10-04T14:51:37Z

All of the ON files came from LDC originally. They might have them. Martha On Oct 4, 2021, at 7:57 AM, Alexandre Rademaker ***@***.***> wrote: EWT dataset does contain the raw content before tokenization. I suppose that allows the UD version to obtain the source string to add to the sentences metadata (see UniversalDependencies/UD_English-EWT#252<UniversalDependencies/UD_English-EWT#252>. What about the Ontonotes? The .onf files are the only ones with the plain sentence, but even these sentences seem tokenized. Do we have the actual source content of OntoNotes sentences somewhere? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub<#15>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABB327SU6RFN5CCLV3H3W23UFGXCPANCNFSM5FJN47MA>. Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

arademaker · 2021-10-05T20:59:31Z

Thank you @MarthaSPalmer, but no, they don't have the raw. See ./wb/sel/16/sel_1677.onf:

Plain sentence:
---------------
    Your ' answer ' did n't address the specific question because you never did return one to walmart remember, you refused
    to shop there and pay for my return privledges ?

Treebanked sentence:
--------------------
    Your ' answer ' did n't address the specific question because you never did return one to walmart *PRO* remember , you
    refused *PRO*-1 to shop there and pay for my return privledges ?

MarthaSPalmer · 2021-10-05T23:03:04Z

I”m sorry. We only worked with what we got from LDC. Martha On Oct 5, 2021, at 2:59 PM, Alexandre Rademaker ***@***.******@***.***>> wrote: Thank you @MarthaSPalmer<https://github.com/MarthaSPalmer>, but no, they don't have the raw. See ./wb/sel/16/sel_1677.onf: Plain sentence: --------------- Your ' answer ' did n't address the specific question because you never did return one to walmart remember, you refused to shop there and pay for my return privledges ? Treebanked sentence: -------------------- Your ' answer ' did n't address the specific question because you never did return one to walmart *PRO* remember , you refused *PRO*-1 to shop there and pay for my return privledges ? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#15 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABB327T5P4Q6PNVC5IAQGYDUFNRL3ANCNFSM5FJN47MA>. Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

V3RGANz · 2022-08-06T18:06:19Z

@arademaker your concrete example can be found in metadata/context directory. Try to grep -r /path/to/dir -e 'text' with some substrings of example you are looking for

arademaker · 2022-08-08T15:41:30Z

LDC distribution contains only the subfolder wb in the ontonotes-release-5.0/data/files/data/english/metadata/context folder. Moreover, the <text> tag contains the raw text, but no raw of the sentences... recovering the sentence split would be an extra hard word.

V3RGANz · 2022-08-08T16:09:48Z

@arademaker if OntoNotes sentences is tokenised only (without token editing) it should be easy to find sentence by regular expression

pseudocode example:

no_trace_tokens: List[str]
raw_text: str
if all(t in raw_text for t in no_trace_tokens):  # we found text candidate, then search for sentence
    sentence_regex = '\s*'.join(no_trace_tokens)
    re.search(sentence_regex, raw_text)

Yes, there are some work to be done, but I can just propose the solution for your problem.
About wb. Is pre-tokenisation occurs in some other corpus?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ontonotes raw content #15

Ontonotes raw content #15

arademaker commented Oct 4, 2021 •

edited

Loading

MarthaSPalmer commented Oct 4, 2021 via email

arademaker commented Oct 5, 2021

MarthaSPalmer commented Oct 5, 2021 via email

V3RGANz commented Aug 6, 2022

arademaker commented Aug 8, 2022

V3RGANz commented Aug 8, 2022 •

edited

Loading

Ontonotes raw content #15

Ontonotes raw content #15

Comments

arademaker commented Oct 4, 2021 • edited Loading

MarthaSPalmer commented Oct 4, 2021 via email

arademaker commented Oct 5, 2021

MarthaSPalmer commented Oct 5, 2021 via email

V3RGANz commented Aug 6, 2022

arademaker commented Aug 8, 2022

V3RGANz commented Aug 8, 2022 • edited Loading

arademaker commented Oct 4, 2021 •

edited

Loading

V3RGANz commented Aug 8, 2022 •

edited

Loading