Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ontonotes raw content #15

Open
arademaker opened this issue Oct 4, 2021 · 6 comments
Open

Ontonotes raw content #15

arademaker opened this issue Oct 4, 2021 · 6 comments

Comments

@arademaker
Copy link
Contributor

arademaker commented Oct 4, 2021

EWT dataset does contain the raw content before tokenization. I suppose that allows the UD version to obtain the source string to add to the sentences metadata (see UniversalDependencies/UD_English-EWT#252)

What about the Ontonotes? The .onf files are the only ones with the plain sentence, but even these sentences seem tokenized. Do we have the actual source content of OntoNotes sentences somewhere?

@MarthaSPalmer
Copy link

MarthaSPalmer commented Oct 4, 2021 via email

@arademaker
Copy link
Contributor Author

Thank you @MarthaSPalmer, but no, they don't have the raw. See ./wb/sel/16/sel_1677.onf:

Plain sentence:
---------------
    Your ' answer ' did n't address the specific question because you never did return one to walmart remember, you refused
    to shop there and pay for my return privledges ?

Treebanked sentence:
--------------------
    Your ' answer ' did n't address the specific question because you never did return one to walmart *PRO* remember , you
    refused *PRO*-1 to shop there and pay for my return privledges ?

@MarthaSPalmer
Copy link

MarthaSPalmer commented Oct 5, 2021 via email

@V3RGANz
Copy link

V3RGANz commented Aug 6, 2022

@arademaker your concrete example can be found in metadata/context directory. Try to grep -r /path/to/dir -e 'text' with some substrings of example you are looking for

@arademaker
Copy link
Contributor Author

LDC distribution contains only the subfolder wb in the ontonotes-release-5.0/data/files/data/english/metadata/context folder. Moreover, the <text> tag contains the raw text, but no raw of the sentences... recovering the sentence split would be an extra hard word.

@V3RGANz
Copy link

V3RGANz commented Aug 8, 2022

@arademaker if OntoNotes sentences is tokenised only (without token editing) it should be easy to find sentence by regular expression

pseudocode example:

no_trace_tokens: List[str]
raw_text: str
if all(t in raw_text for t in no_trace_tokens):  # we found text candidate, then search for sentence
    sentence_regex = '\s*'.join(no_trace_tokens)
    re.search(sentence_regex, raw_text)

Yes, there are some work to be done, but I can just propose the solution for your problem.
About wb. Is pre-tokenisation occurs in some other corpus?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants