-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support cultivar epithets #173
Conversation
Impressive @tobymarsden! @havardo is also interested in cultivar support, and I was giving it a thought too. In my understanding cultivar naming can be quite flexible in practice, and I wonder if a solution would be to create a new peg file for cultivars specifically, and add a flag to |
@dimus Possibly! I like the idea of a flag to enable cultivar parsing, though I'm not familiar enough with the libraries to know whether it can be done without causing a maintenance nightmare. Parsing cultivars as presented here uses most or all of the existing grammar. I'm pretty sure it's possible to cause parsing errors with a scientific name containing multiple apostrophes, for example, though I couldn't come up with any plausible tests to show it -- I reckon it would currently end up in the unparsed tail virtually all of the time. |
Wonderful news @tobymarsden. We mainly deal with botanical taxonomy and cultivars are omnipresent, even in datasets that you would not expect to see them. So for our situation, we would allways opt in. Look forward to trying it out :) |
@dimus @havardo I took a quick look into adding a flag. It seems that pointlander/peg may have some issues supporting multiple peg files in the same package. That's resolvable, but I'm not sure compiling multiple grammars is the best approach anyway as I don't see how to avoid massive rule duplication? One option would be to have a subgrammar file (e.g. cultivars.peg) with a build step that merges it into the main grammar.peg file before compilation, but then the flag is a compile-time option rather than runtime, which is barely helpful. Testing could get "interesting", plus it's creating some footguns for the future especially if more subgrammars come along later. It would be great if there were a way to allow conditional rules at runtime within pointlander/peg but that would be... quite a thing. A lo-fi approach would be to enable cultivar parsing in all cases but increase test coverage of non-cultivar names in the outlying cases where there is the potential for collision with the cultivar rules, and at least document these. (I benchmarked parsing performance and couldn't detect any difference at all between the branches over 1000 runs on my MacBook Air.) Any thoughts? Thanks for everything! |
@tobymarsden @havardo, I think if cultivar parsing does cover the most common cases and does not try to cover every nook and cranny one PEG grammar might be enough. An additional flexibility can be achieved after PEG, when abstract syntax tree is processed for the output, at this point a 'cultivar flag' might be of help. |
@dimus Yep! I can definitely add a flag into the tree processing to disable adding the cultivars into the output. I'll update the PR. As far as I can tell the only parsing errors that should hit non-cultivar names are when two or more apostrophes are included in the input -- I might be missing something, honestly there isn't enough caffeine in the world, but I think that should the edgiest of edge cases. Happily, I suspect there's no way to get too clever regarding the ICNCP anyway given that epithets, group names (and the group marker itself!) can be in any language. Even the simplest capitalization rules become nonsensical when you're dealing with cultivars in Chinese... |
The flag |
opt in would probably have more sense, as most of the users, at least for now, are more interested in ICN, BC or ICZN codes. May be '-C' '-cultivar'? |
What about including the cultivar in the optional Otherwise I'd propose adding a new quality warning when names with cultivars are parsed without the option enabled. Without that, in the default case the cultivar name data will be lost: currently they'll get an unparsed tail warning but that will disappear if the cultivar epithets are recognized but suppressed from the output. Thanks! |
Stemming of genera in hybrid formulas is a bug, thank you for spotting and fixing it @tobymarsden |
I like the idea of keeping canonical, normalized forms and cardinality depentent on the flag and providing cultivar details with or without the flag.
So if --canonical is not set, there is a level 2 warning "Cultivar name", and if flag is enabled the |
In case of |
Great! I'll amend.
OK! I'll implement.
I haven't added support for cultivar authorship (though uninomial and binomial authorship is retained) -- the ICNCP doesn't recommend adding cultivar authorship though it's permitted. It seems to be very rare in practice, but I don't see why it should present any particular problems. I can see if it's trivial to implement without breaking anything. The ICNCP 9th edition is available at https://www.ishs.org/sites/default/files/static/ScriptaHorticulturae_18.pdf |
I went through your PR, looks great so far @tobymarsden! If authorship is not recommended, we can skip it for now then. |
To make testing simpler I suggest to make a separate file for cultivars, so they are tested fully with --cultivar flag |
@dimus PR updated:
Thanks for your perseverance as we heave this towards the line! |
great @tobymarsden! Let me merge it with master, and I will start adding missing parts, docs, openapi stuff, web option. When all is settled I'll tag a new version |
This PR introduces support for cultivar epithets in the context of uninomials, binomials, infraspecies, named hybrids and hybrid formulae.
It supports input including:
It doesn't support cultivar groups yet.
A few notes:
I've kept the existing references to uninomials, even if they may now have a cultivar epithet too. E.g.
Spathiphyllum ‘Mauna Loa’
produces these details:The cardinality is updated, however, so we can end up with a uninomial with a cardinality of 2...
I don't know how you feel about that, but I didn't want to go making breaking changes that were out of proportion to the benefit.
A couple of existing tests had unparsed tails which now (reasonably) get parsed to cultivar epithets; I updated these tests.
The stemmer includes the cultivar epithet but doesn't stem it. e.g.
Sarracenia flau ’Maxima’
. Happy to remove the cultivar epithet from the stemmed canonical if you prefer.Capitalization of cultivar epithets is not addressed; the input capitalization is retained. Something could be done here, though I don't think it can ever be entirely unambiguous due to ICNCP rules.
Currently, subsequent genera names inside hybrid formulae get stemmed:, e.g.
Arthopyrenia hyalospora X Hydnellum scrobiculatum
givesArthopyrenia hyalospor × Hydnell scrobiculat
.Updating the stemmer to support verbatim cultivar epithets had the side effect of retaining the verbatim genera, so this example is now stemmed as
Arthopyrenia hyalospor × Hydnellum scrobiculat
. That didn't seem unreasonable, so I updated those tests (of which there were many). This change is the one that's perhaps the most 'breaking', and it should be easy enough to maintain existing behavior at least for non-cultivars if needed.I'm not very familiar with Go and it's been many years since I used PEG (!) so while we're going to put this into production here, I do understand if you don't consider it ready for primetime.
That said, if you let me know what I've missed or any problems I've caused, I'd be happy to find time to give it another pass.