Support cultivar epithets #173

tobymarsden · 2021-06-22T13:48:58Z

This PR introduces support for cultivar epithets in the context of uninomials, binomials, infraspecies, named hybrids and hybrid formulae.

It supports input including:

Sarracenia flava "Maxima"
Leucocasia gigantea 'Jack's Giant'
Sarracenia alata cv Black Tube
Sarracenia alata ’Black Tube’ x Sarracenia flava cv. Maxima

It doesn't support cultivar groups yet.

A few notes:

I've kept the existing references to uninomials, even if they may now have a cultivar epithet too. E.g. Spathiphyllum ‘Mauna Loa’ produces these details:
```
"uninomial": {
  "uninomial": "Spathiphyllum",
  "cultivar": "‘Mauna Loa’"
}
```
The cardinality is updated, however, so we can end up with a uninomial with a cardinality of 2...

I don't know how you feel about that, but I didn't want to go making breaking changes that were out of proportion to the benefit.
A couple of existing tests had unparsed tails which now (reasonably) get parsed to cultivar epithets; I updated these tests.
The stemmer includes the cultivar epithet but doesn't stem it. e.g. Sarracenia flau ’Maxima’. Happy to remove the cultivar epithet from the stemmed canonical if you prefer.
Capitalization of cultivar epithets is not addressed; the input capitalization is retained. Something could be done here, though I don't think it can ever be entirely unambiguous due to ICNCP rules.
Currently, subsequent genera names inside hybrid formulae get stemmed:, e.g. Arthopyrenia hyalospora X Hydnellum scrobiculatum gives Arthopyrenia hyalospor × Hydnell scrobiculat.

Updating the stemmer to support verbatim cultivar epithets had the side effect of retaining the verbatim genera, so this example is now stemmed as Arthopyrenia hyalospor × Hydnellum scrobiculat. That didn't seem unreasonable, so I updated those tests (of which there were many). This change is the one that's perhaps the most 'breaking', and it should be easy enough to maintain existing behavior at least for non-cultivars if needed.

I'm not very familiar with Go and it's been many years since I used PEG (!) so while we're going to put this into production here, I do understand if you don't consider it ready for primetime.

That said, if you let me know what I've missed or any problems I've caused, I'd be happy to find time to give it another pass.

dimus · 2021-06-22T14:04:32Z

Impressive @tobymarsden! @havardo is also interested in cultivar support, and I was giving it a thought too. In my understanding cultivar naming can be quite flexible in practice, and I wonder if a solution would be to create a new peg file for cultivars specifically, and add a flag to gnparser that will clearly state that the parsing is of cultivar names. What do you think? I am worried that trying to mix cultivars, ICZN, ICN, bactrial codes all together would create ambiguous situations and increase the number of parsing mistakes.

tobymarsden · 2021-06-22T16:26:33Z

@dimus Possibly!

I like the idea of a flag to enable cultivar parsing, though I'm not familiar enough with the libraries to know whether it can be done without causing a maintenance nightmare. Parsing cultivars as presented here uses most or all of the existing grammar.

I'm pretty sure it's possible to cause parsing errors with a scientific name containing multiple apostrophes, for example, though I couldn't come up with any plausible tests to show it -- I reckon it would currently end up in the unparsed tail virtually all of the time.

havardo · 2021-06-22T17:22:58Z

Wonderful news @tobymarsden.
Agree with @dimus regarding a potential flag approch. It sounds reasonable to make cultivar support an "opt in" capability if that can help on the overall design and performance.

We mainly deal with botanical taxonomy and cultivars are omnipresent, even in datasets that you would not expect to see them. So for our situation, we would allways opt in. Look forward to trying it out :)

tobymarsden · 2021-06-22T19:18:51Z

@dimus @havardo I took a quick look into adding a flag. It seems that pointlander/peg may have some issues supporting multiple peg files in the same package. That's resolvable, but I'm not sure compiling multiple grammars is the best approach anyway as I don't see how to avoid massive rule duplication?

One option would be to have a subgrammar file (e.g. cultivars.peg) with a build step that merges it into the main grammar.peg file before compilation, but then the flag is a compile-time option rather than runtime, which is barely helpful. Testing could get "interesting", plus it's creating some footguns for the future especially if more subgrammars come along later.

It would be great if there were a way to allow conditional rules at runtime within pointlander/peg but that would be... quite a thing.

A lo-fi approach would be to enable cultivar parsing in all cases but increase test coverage of non-cultivar names in the outlying cases where there is the potential for collision with the cultivar rules, and at least document these.

(I benchmarked parsing performance and couldn't detect any difference at all between the branches over 1000 runs on my MacBook Air.)

Any thoughts? Thanks for everything!

dimus · 2021-06-22T20:30:31Z

@tobymarsden @havardo, I think if cultivar parsing does cover the most common cases and does not try to cover every nook and cranny one PEG grammar might be enough. An additional flexibility can be achieved after PEG, when abstract syntax tree is processed for the output, at this point a 'cultivar flag' might be of help.

tobymarsden · 2021-06-22T20:57:09Z

@dimus Yep! I can definitely add a flag into the tree processing to disable adding the cultivars into the output. I'll update the PR.

As far as I can tell the only parsing errors that should hit non-cultivar names are when two or more apostrophes are included in the input -- I might be missing something, honestly there isn't enough caffeine in the world, but I think that should the edgiest of edge cases.

Happily, I suspect there's no way to get too clever regarding the ICNCP anyway given that epithets, group names (and the group marker itself!) can be in any language. Even the simplest capitalization rules become nonsensical when you're dealing with cultivars in Chinese...

tobymarsden · 2021-06-22T22:37:10Z

The flag --disable_cultivars is added to the PR along with some tests... as it doesn't affect the grammar parsing, just the AST processing and hence the output I've made it opt-out, but of course I can make it opt-in if you prefer.

dimus · 2021-06-23T01:46:10Z

opt in would probably have more sense, as most of the users, at least for now, are more interested in ICN, BC or ICZN codes. May be '-C' '-cultivar'?

tobymarsden · 2021-06-23T08:08:28Z

What about including the cultivar in the optional details section regardless, which shouldn't have any impact on people currently using the library as it's purely additive, but only include it in the normalized/canonical names if the --cultivar option is enabled? The library's default output would remain unchanged in that case but it will be clearer that there is fuller support for botanical names.

Otherwise I'd propose adding a new quality warning when names with cultivars are parsed without the option enabled. Without that, in the default case the cultivar name data will be lost: currently they'll get an unparsed tail warning but that will disappear if the cultivar epithets are recognized but suppressed from the output.

Thanks!

dimus · 2021-06-23T09:02:05Z

Stemming of genera in hybrid formulas is a bug, thank you for spotting and fixing it @tobymarsden

dimus · 2021-06-23T10:14:42Z

What about including the cultivar in the optional details section regardless, which shouldn't have any impact on people currently using the library as it's purely additive, but only include it in the normalized/canonical names if the --cultivar option is enabled? The library's default output would remain unchanged in that case but it will be clearer that there is fuller support for botanical names.

I like the idea of keeping canonical, normalized forms and cardinality depentent on the flag and providing cultivar details with or without the flag.

Otherwise I'd propose adding a new quality warning when names with cultivars are parsed without the option enabled. Without that, in the default case the cultivar name data will be lost: currently they'll get an unparsed tail warning but that will disappear if the cultivar epithets are recognized but suppressed from the output.

So if --canonical is not set, there is a level 2 warning "Cultivar name", and if flag is enabled the
warning is not set? I think it makes sense.

dimus · 2021-06-23T10:23:30Z

In case of Ligusticum sinense cv 'chuanxiong' S.H. Qiu & et al. the authorship goes into uparsed tail currently. Is it ok for cultivars? I am going to ask around for ICNCP today, it does not look like there is a free download of its text at https://www.ishs.org/scripta-horticulturae/international-code-nomenclature-cultivated-plants-ninth-edition

tobymarsden · 2021-06-23T10:41:03Z

I like the idea of keeping canonical, normalized forms and cardinality depentent on the flag and providing cultivar details with or without the flag.

Great! I'll amend.

So if --canonical is not set, there is a level 2 warning "Cultivar name", and if flag is enabled the
warning is not set? I think it makes sense.

OK! I'll implement.

In case of Ligusticum sinense cv 'chuanxiong' S.H. Qiu & et al. the authorship goes into uparsed tail currently. Is it ok for cultivars? I am going to ask around for ICNCP today, it does not look like there is a free download of its text at https://www.ishs.org/scripta-horticulturae/international-code-nomenclature-cultivated-plants-ninth-edition

I haven't added support for cultivar authorship (though uninomial and binomial authorship is retained) -- the ICNCP doesn't recommend adding cultivar authorship though it's permitted. It seems to be very rare in practice, but I don't see why it should present any particular problems. I can see if it's trivial to implement without breaking anything.

The ICNCP 9th edition is available at https://www.ishs.org/sites/default/files/static/ScriptaHorticulturae_18.pdf

dimus · 2021-06-23T10:52:45Z

I went through your PR, looks great so far @tobymarsden!

If authorship is not recommended, we can skip it for now then.
Thank you for the link to ICNCP

dimus · 2021-06-23T11:10:23Z

To make testing simpler I suggest to make a separate file for cultivars, so they are tested fully with --cultivar flag

tobymarsden · 2021-06-23T22:17:10Z

@dimus PR updated:

new --cultivar/-C flag which is opt-in
with the flag omitted, when a cultivar epithet is parsed a quality warning is added and the cultivar epithet is added to the details but not to the canonical/normalized outputs
with the flag present, there is no quality warning and the cultivar epithet is added both to the details and the canonical/normalized outputs
a single cultivar test is retained in the main test file (to verify that the quality warning, cardinality and canonical forms are correct)
the remaining cultivars tests are moved to a second test file which is run with the --cultivar flag

Thanks for your perseverance as we heave this towards the line!

dimus · 2021-06-24T00:23:46Z

great @tobymarsden! Let me merge it with master, and I will start adding missing parts, docs, openapi stuff, web option. When all is settled I'll tag a new version

Support cultivar epithets

a290451

tobymarsden mentioned this pull request Jun 22, 2021

Add cultivar tests with quotes and apostrophes #171

Merged

Add command line flag to disable cultivars

12ebe81

Add '--cultivar/-C' flag; cultivar quality warnings

378a4d1

dimus merged commit acce122 into gnames:master Jun 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support cultivar epithets #173

Support cultivar epithets #173

tobymarsden commented Jun 22, 2021

dimus commented Jun 22, 2021

tobymarsden commented Jun 22, 2021

havardo commented Jun 22, 2021

tobymarsden commented Jun 22, 2021

dimus commented Jun 22, 2021

tobymarsden commented Jun 22, 2021

tobymarsden commented Jun 22, 2021

dimus commented Jun 23, 2021 •

edited

Loading

tobymarsden commented Jun 23, 2021

dimus commented Jun 23, 2021

dimus commented Jun 23, 2021 •

edited

Loading

dimus commented Jun 23, 2021 •

edited

Loading

tobymarsden commented Jun 23, 2021

dimus commented Jun 23, 2021

dimus commented Jun 23, 2021

tobymarsden commented Jun 23, 2021 •

edited

Loading

dimus commented Jun 24, 2021

Support cultivar epithets #173

Support cultivar epithets #173

Conversation

tobymarsden commented Jun 22, 2021

dimus commented Jun 22, 2021

tobymarsden commented Jun 22, 2021

havardo commented Jun 22, 2021

tobymarsden commented Jun 22, 2021

dimus commented Jun 22, 2021

tobymarsden commented Jun 22, 2021

tobymarsden commented Jun 22, 2021

dimus commented Jun 23, 2021 • edited Loading

tobymarsden commented Jun 23, 2021

dimus commented Jun 23, 2021

dimus commented Jun 23, 2021 • edited Loading

dimus commented Jun 23, 2021 • edited Loading

tobymarsden commented Jun 23, 2021

dimus commented Jun 23, 2021

dimus commented Jun 23, 2021

tobymarsden commented Jun 23, 2021 • edited Loading

dimus commented Jun 24, 2021

dimus commented Jun 23, 2021 •

edited

Loading

dimus commented Jun 23, 2021 •

edited

Loading

dimus commented Jun 23, 2021 •

edited

Loading

tobymarsden commented Jun 23, 2021 •

edited

Loading