Is handling of singular / plural forms ('sentence' and 'sentences') correct / consistent? #231

0dB · 2023-08-06T06:00:11Z

In https://derwen.ai/docs/ptr/sample/ in the "Scrubber" section it says

Different variations of "sentence(s)" are now represented as part of single entry in phrase list.

To me this implies that "sentence" and "sentences" should be "grouped" (lemmatized), but in my experiments and in the output shown, the singular and plural forms are listed as separate.

ic| phrase: Phrase(text='sentences', chunks=[sentences, the sentences], count=2, rank=0.14407775200046075)
ic| phrase: Phrase(text='sentence', chunks=[every sentence, every other sentence], count=2, rank=0.07909136977858265)
ic| phrase: Phrase(text='two sentences', chunks=[the two sentences, two sentences], count=2, rank=0.06654666312136172)

Is this correct or wrong behavior? If it is correct, maybe just the tutorial needs to make this clear?

With the bugfix I propose in #232 and the token list I used for scrubbing I get the results

0.13134098, 05, sentences, [sentences, the two sentences, sentences, two sentences, the sentences]
0.07117996, 02, sentence, [every sentence, every other sentence]

but now assume that I would actually be getting only one line, for both "sentence" and "sentences", am I wrong?

The text was updated successfully, but these errors were encountered:

Ankush-Chander · 2023-08-06T15:34:32Z

Hi @0dB
Thanks bringing this to our attention.
The occurrences of sentences being grouped together is working as per the scrubber code.
Since scrubber function returns the span.text in the example code, sentences are grouped as one, while sentence are being grouped together.

We can change the desired behaviour by changing the example code from

return span.text

to

return span.lemma_

This will group all occurrences of sentence and sentences together.

Please feel free to make this change in the example notebook in your existing PR #233 .

0dB · 2023-08-06T19:45:04Z

Thanks, let me try that out and see what effect that has in total and then I would also update the sample output, too. I can do this sometime soon.

Update: I think I am more pleased with the results, I am getting better summaries this way, since singular and plural forms of words now are "equal" to the algorithm and together have more weight instead of carrying separate but then not so strong weights. I will test some more and then propose a few updates to the sample page.

ceteri · 2023-08-07T05:22:09Z

Many thanks @0dB and @Ankush-Chander !

It would help to have examples/sample.ipynb updated to illustrate the behaviors discussed here.

@0dB, the changes in your PR #233 look good -

We're having issues with our CI pipeline (see #235) and as soon as I get that cleared (hopefully tonight) I'll accept/merge the PR.

I also noticed the typo toekn in that same notebook :) FWIW, these notebooks get rendered as Markdown to build portions of our docs, so the docs will become updated by the same fix.

ceteri self-assigned this Aug 7, 2023

ceteri added enhancement question and removed enhancement labels Aug 7, 2023

DerwenAI locked and limited conversation to collaborators Aug 22, 2023

ceteri converted this issue into discussion #242 Aug 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Is handling of singular / plural forms ('sentence' and 'sentences') correct / consistent? #231

Is handling of singular / plural forms ('sentence' and 'sentences') correct / consistent? #231

0dB commented Aug 6, 2023 •

edited

Loading

Ankush-Chander commented Aug 6, 2023 •

edited

Loading

0dB commented Aug 6, 2023 •

edited

Loading

ceteri commented Aug 7, 2023 •

edited

Loading

This issue was moved to a discussion.

This issue was moved to a discussion.

Is handling of singular / plural forms ('sentence' and 'sentences') correct / consistent? #231

Is handling of singular / plural forms ('sentence' and 'sentences') correct / consistent? #231

Comments

0dB commented Aug 6, 2023 • edited Loading

Ankush-Chander commented Aug 6, 2023 • edited Loading

0dB commented Aug 6, 2023 • edited Loading

ceteri commented Aug 7, 2023 • edited Loading

This issue was moved to a discussion.

0dB commented Aug 6, 2023 •

edited

Loading

Ankush-Chander commented Aug 6, 2023 •

edited

Loading

0dB commented Aug 6, 2023 •

edited

Loading

ceteri commented Aug 7, 2023 •

edited

Loading