Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is handling of singular / plural forms ('sentence' and 'sentences') correct / consistent? #231

Closed
0dB opened this issue Aug 6, 2023 · 3 comments
Assignees
Labels

Comments

@0dB
Copy link
Contributor

0dB commented Aug 6, 2023

In https://derwen.ai/docs/ptr/sample/ in the "Scrubber" section it says

Different variations of "sentence(s)" are now represented as part of single entry in phrase list.

To me this implies that "sentence" and "sentences" should be "grouped" (lemmatized), but in my experiments and in the output shown, the singular and plural forms are listed as separate.

ic| phrase: Phrase(text='sentences', chunks=[sentences, the sentences], count=2, rank=0.14407775200046075)
ic| phrase: Phrase(text='sentence', chunks=[every sentence, every other sentence], count=2, rank=0.07909136977858265)
ic| phrase: Phrase(text='two sentences', chunks=[the two sentences, two sentences], count=2, rank=0.06654666312136172)

Is this correct or wrong behavior? If it is correct, maybe just the tutorial needs to make this clear?

With the bugfix I propose in #232 and the token list I used for scrubbing I get the results

0.13134098, 05, sentences, [sentences, the two sentences, sentences, two sentences, the sentences]
0.07117996, 02, sentence, [every sentence, every other sentence]

but now assume that I would actually be getting only one line, for both "sentence" and "sentences", am I wrong?

@Ankush-Chander
Copy link
Contributor

Ankush-Chander commented Aug 6, 2023

Hi @0dB
Thanks bringing this to our attention.
The occurrences of sentences being grouped together is working as per the scrubber code.
Since scrubber function returns the span.text in the example code, sentences are grouped as one, while sentence are being grouped together.

We can change the desired behaviour by changing the example code from

return span.text

to

return span.lemma_

This will group all occurrences of sentence and sentences together.

Please feel free to make this change in the example notebook in your existing PR #233 .

@0dB
Copy link
Contributor Author

0dB commented Aug 6, 2023

Thanks, let me try that out and see what effect that has in total and then I would also update the sample output, too. I can do this sometime soon.

Update: I think I am more pleased with the results, I am getting better summaries this way, since singular and plural forms of words now are "equal" to the algorithm and together have more weight instead of carrying separate but then not so strong weights. I will test some more and then propose a few updates to the sample page.

@ceteri
Copy link
Collaborator

ceteri commented Aug 7, 2023

Many thanks @0dB and @Ankush-Chander !

It would help to have examples/sample.ipynb updated to illustrate the behaviors discussed here.

@0dB, the changes in your PR #233 look good -

We're having issues with our CI pipeline (see #235) and as soon as I get that cleared (hopefully tonight) I'll accept/merge the PR.

I also noticed the typo toekn in that same notebook :) FWIW, these notebooks get rendered as Markdown to build portions of our docs, so the docs will become updated by the same fix.

@ceteri ceteri self-assigned this Aug 7, 2023
@DerwenAI DerwenAI locked and limited conversation to collaborators Aug 22, 2023
@ceteri ceteri converted this issue into discussion #242 Aug 22, 2023

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
Projects
None yet
Development

No branches or pull requests

3 participants