Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add unicode upcase, downcase #2547

Open
wants to merge 20 commits into
base: master
Choose a base branch
from

Conversation

liquidaty
Copy link

See #492

@pkoppstein
Copy link
Contributor

@liquidaty - Thanks for keeping this issue alive.

Given that jq already has ascii_upcase and ascii_downcase, wouldn't it be preferable to use the names upcase and downcase for the new filters?

Also, perhaps you could trigger a restart of the automated "checks"?

@liquidaty
Copy link
Author

liquidaty commented Jul 3, 2023

thank you @pkoppstein -- absolutely, will do. Your suggested names sound fine to me.

I realized after submitting the last PR state that the autoconf did not properly work with regards to libutf8proc. I've already done the work (outside of the PR) to handle that so that will be easy to update.

The approach it will take is the same as for libutf8proc as it does for onigurama, which is to include it as a submodule in modules/, with the same options and choices and defaults. Feel free to let me know if you / other maintainers prefer a different approach.

@pkoppstein
Copy link
Contributor

Feel free to let me know if you / other maintainers prefer a different approach.

What would the main “cons” be in including libutf8proc unconditionally? Do we have any numbers at this point?

@liquidaty
Copy link
Author

liquidaty commented Jul 4, 2023

What would the main “cons” be in including libutf8proc unconditionally? Do we have any numbers at this point?

No cons in my view, unless someone had some preference to use their own version of libutf8proc or anyone wants the option to have a (very slightly) lighter-weight jq that did not support these functions. No numbers I'm aware of

At the moment the updated PR I'm prepping already makes it optional, defaulting to yes, and is basically ready to go (I just want to add a simple test before re-submitting), so easy to change it to unconditional

@pkoppstein
Copy link
Contributor

Assuming we’re ok with the license, I’d suggest NOT making its inclusion conditional, partly because the functionality that will be enabled is so central to Unicode and thus JSON, and because it’s already such a pain having even one external library optional (e.g. w.r.t. man.test).

liquidaty and others added 2 commits July 5, 2023 10:46
add one simple test to jq.test for each
include utf8proc files directly in repo as an unconditional part of the build
update COPYING to include utf8proc license
@wader
Copy link
Member

wader commented Jul 5, 2023

I agree with @pkoppstein to make it non-optional as it feels fundamental. Only cons i can come up with is if it makes libjq more complicated to link? should we namespace the symbols somehow?

Nice addition!

@liquidaty
Copy link
Author

liquidaty commented Jul 5, 2023

OK should be all set now, utf8proc is included unconditionally (hence no need to modify autoconf), filter names changed to upcase and downcase, couple tests have been added, license notice added, all check pass. LMK if any further changes wanted in order to merge

@nicowilliams
Copy link
Contributor

I've not reviewed this, but yeah, we need this!

(We probably also need normalization and form-insensitive string comparisons, but that's for another issue/PR and another time.)

@liquidaty
Copy link
Author

Actually, please do not merge this just yet. It occurs to me that the current way of unconditionally including libutf8 is to simply include the utf8proc.c into the source code. However, this may cause issues if jq is used as a library together with other code that links the utf8proc library. I will update this PR to unconditionally include libutf8 but still providing an option for the user to specify their own

@pkoppstein pkoppstein marked this pull request as draft July 7, 2023 18:26
@nicowilliams
Copy link
Contributor

Actually, please do not merge this just yet. It occurs to me that the current way of unconditionally including libutf8 is to simply include the utf8proc.c into the source code. However, this may cause issues if jq is used as a library together with other code that links the utf8proc library. I will update this PR to unconditionally include libutf8 but still providing an option for the user to specify their own

Can of worms opening alert :)

There was an issue asking for us to make sure all non-static symbols are named jq_... or _jq_... or jv_.... The idea is that when static linking one has no control over what non-static symbols are "exported" by libraries. I was not, and still am not sympathetic because there is no reason that static linking should be stuck with 1980 semantics while dynamic linking has much better semantics. But the reality is that static linking still exists, it is still stuck with 1980 semantics, and we probably still want to support it, so we probably need to do some symbol renaming.

So I'm inclined to agree that any external library sources that we fork right into jq's should have symbols renamed.

Whereas if we use some external library and import it as a submodule like we did with Oniguruma, then users who want static linking would probably not use the submodules when building jq, and we can leave any symbol conflict issues in those submodules to the user.

So I'm further inclined to believe that if we can treat libutf8 like Oniguruma, that's probably better, and we might be able to then avoid opening this can of worms. Though we should rename our non-static symbols that don't start with jq_, _jq_, or jv_.

@pkoppstein
Copy link
Contributor

asking for us to make sure all non-static symbols are named jq_... or jq... or jv_....

Couldn't we just add utf8proc_ to the list of "reserved prefixes"?

See https://juliastrings.github.io/utf8proc/doc/utf8proc_8h.html

Having recently tried and failed to make a "-static" version of jq, my sense is that the benefits of including the library unconditionally outweigh the drawbacks. If it's good enough for Julia, it's good enough for jq.

@nicowilliams
Copy link
Contributor

asking for us to make sure all non-static symbols are named jq_... or jq... or jv_....

Couldn't we just add utf8proc_ to the list of "reserved prefixes"?

Anyone building a busybox-like thing will probably want jq's internal libutf8 to not conflict with other utilities'. Is libutf8 something we can link with and -if need be- include as a submodule, just like Oniguruma?

See https://juliastrings.github.io/utf8proc/doc/utf8proc_8h.html

Having recently tried and failed to make a "-static" version of jq, my sense is that the benefits of including the library unconditionally outweigh the drawbacks. If it's good enough for Julia, it's good enough for jq.

True for you, and true for me. I'm not very sympathetic to users who want to link statically, but I am willing to do a bit for them. We could have a header that can be included to rename all the relevant symbols, say.

I really wish that the various binutils-like tools out there could learn ELF semantics for static linking... (Specifically:

  • .a archive files should have a well-known .o whose contents is the set of relevant link-editor options used to produce the archive, such as all the -L..., -l..., -rpath ... options, including global/local symbol demotions from mapfiles
  • when doing a final link-edit of an executable, ld should
    • not require the full, flattened dependency tree to be provided on the command-line, instead
    • ld should learn the non-flattened dependency tree from inspecting the direct dependencies' dependencies and so on
    • ld should resolve symbol conflicts like dynamic linking does (pick the symbol from direct dependencies)
    • "map files"

Then we'd not have to consider these silly symbol naming issues.)

@nicowilliams
Copy link
Contributor

Anyways, for now I'll take the inclusion of this utf8 library and forget about symbol renaming.

@nicowilliams
Copy link
Contributor

Looking at the header, it looks like we can also get normalization and other Unicode functionality with this!

@nicowilliams
Copy link
Contributor

I've not reviewed the code in src/builtin.c yet, but assuming we get a review of that (possibly mine) I'm willing to just take this. Is there any reason this is a Draft PR and not just a PR?

allow custom utf8proc location; search default prefix if not found, else use builtin
omit utf8proc from libjq
changed original modules/utf8proc/Makefile to omit checks that require ruby to be installed
@liquidaty
Copy link
Author

liquidaty commented Jul 8, 2023

I've updated the PR with a proposed way to thread the needle as follows (apologies for the weird path of merging master into the branch, that was a workaround for an originally ill-targeted commit in the first attempt to update the PR):

  • the JQ executable is unconditionally compiled with a statically linked libutf8proc.a library (irrespective of ENABLE_ALL_STATIC)
  • moved libutf8proc code to modules and added all of that repo into this repo as normal files
  • User can specify --with-utf8proc=prefix configure option, else the default prefix will be searched for utf8proc.h and used if that is found, else built-in will be used
  • the JQ library does not internally bundle the utf8proc library. However, Libs.private is added to libjq.pc, and contains the path to the libutf8proc.a that was used

This way:

  • the jq executable has no additional shared lib dependencies
  • No symbol renaming is needed (libutf8proc currently does not provide a way to automatically do this, so this avoids having to apply some additional symbol renaming processing)
  • the jq library can be linked without any issues via standard pkg-config using --libs --static

Some downsides:

  • If the installation proceeds with the built-in utf8proc, then, at least as currently implemented, the location of utf8proc.a will be {srcdir}/modules/utf8proc, rather than a "normal" location such as /usr/local. This issue can be mitigated by providing notice/warning when configure is run (e.g. "Using built-in utf8proc. If you will be using jq as a library, you may want to consider installing utf8proc and using that location, which you can do by running make -C modules/utf8proc install") (we could even include a configure option --with-utf8proc=install to auto-install if we wanted)
  • libjq linking is now a bit more complicated, requiring a libutf8proc.a link, than if we had bundled libutfproc with renamed symbols

As an aside, I believe this will address any issues building a -static version, at least insofar as they are related to libutf8proc, but please send details if not.

@liquidaty liquidaty marked this pull request as ready for review July 8, 2023 07:59
@nicowilliams
Copy link
Contributor

nicowilliams commented Jul 8, 2023 via email

@nicowilliams
Copy link
Contributor

nicowilliams commented Jul 8, 2023 via email

@liquidaty
Copy link
Author

I'd be OK with the UTF-8 builtins being added by the jq executable and not being in libjq, but I'd rather they be in libjq.

OK, maybe we can have our cake and eat it. Updated configure.ac and Makefile.am to support --with-utf8=bundled which will simply include the bundled modules/utf8proc/utf8proc.c into jqlib_src, thereby adding it directly into jq as well as libjq binaries. Did not add any tests specific to this option but looks like it's working as expected by doing a grep utf8proc_version .libs/*.*

@pkoppstein
Copy link
Contributor

@liquidaty - I hope you haven't run out of steam! Is there some way I can help push this over the finishing line? It would be a great shame if all your work didn't make it into jq 1.7.

@liquidaty
Copy link
Author

@pkoppstein have been getting pulled away but will take a look this week, hopefully on the earlier side (I'm also entirely fine with maintainers making any of the contemplated changes, in case anyone prefers that which sometimes is less work than chasing the change to begin with...)

@pkoppstein
Copy link
Contributor

@liquidaty - I noticed that you recently merged master into the PR (88a102f). My understanding is that the jq administrators have a very strong preference for rebasing. That's just a "heads-up" as I would not hazard to say what your best option at this point would be.

@liquidaty
Copy link
Author

liquidaty commented Jul 24, 2023

My apologies,
I am new to multi-external-party merging with GitHub and I thought that was the way we were "supposed" to resolve the merge conflicts. I definitely tried to explore whether the GitHub UI provided alternative ways, and afaict GitHub only seemed to offer that 1 and only 1 way for me to resolve the conflict (i.e. click "Resolve", make changes, click "Save")

I assume from your note however that there are other ways that I'm not aware of. I will have to ask for assistance next time the "Resolve conflict" button pops up for me then, as I don't know how else to resolve. If there are further steps I should take at this time to get it to the way you need/want it to be, please feel free to lmk.

@nicowilliams
Copy link
Contributor

@liquidaty just do this in the future git pull --rebase or git fetch origin && git rebase origin/master.

@pkoppstein
Copy link
Contributor

@nicowilliams - What would you suggest @liquidaty do to undo the damage? Start a new branch?

@nicowilliams
Copy link
Contributor

nicowilliams commented Jul 25, 2023

@nicowilliams - What would you suggest @liquidaty do to undo the damage? Start a new branch?

No need to start a new branch. Just git pull --rebase, or git fetch origin; git rebase origin/master. I also wrote a lengthy comment in a reply to you (I think on #2731) about this same sort of issue on another PR, and linked to a gist I wrote about git.

@liquidaty
Copy link
Author

liquidaty commented Jul 25, 2023

OK I can do that. I also though it preferable to consolidate all the changes for this PR into the following 3 chunks. that sound ok?

I used the following commands to rebase:

git fetch origin pull/2547/head:pr2547
git checkout pr2547
git rebase -i origin/master
# do the rebase-- see below for end result

#### if all is ok, the next cmds will be:
  git remote add liquidaty https://github.com/liquidaty/jq
  git remote set-url --push liquidaty [email protected]:liquidaty/jq
  git push -f liquidaty HEAD:2547-resolve-jq.test
####

and the rebase consolidated the commit history into:

commit 235cd5e0c69e3a9b0e1d53235752df9dc968d469 (HEAD -> pr2547)
Author: liquidaty <[email protected]>
Date:   Sat Jul 15 10:37:10 2023 -0700

    add upcase/downcase to manual.yml

    passing tests

    update manual.yml to correct `example` vs `examples`

    updated jq.1.prebuilt from prior commit

    updated tests/man.test

    restore modules/oniguruma, from prior inadvertent modification and commit

commit c6830a7036d072351a87c4cd914d8a3bf19626e2
Author: liquidaty <[email protected]>
Date:   Sun Jul 9 07:27:35 2023 +0200

    add --with-utf8proc=bundled option

commit 290ef0d20c7b3d1435efa2b6469d2c4ddcee594b
Author: liquidaty <[email protected]>
Date:   Thu Jun 9 14:15:29 2022 -0700

    add utf8tolower, utf8toupper

    change new filter names to `upcase` and `downcase`
    add one simple test to jq.test for each
    include utf8proc files directly in repo as an unconditional part of the build
    update COPYING to include utf8proc license

    statically link utf8proc
    allow custom utf8proc location; search default prefix if not found, else use builtin
    omit utf8proc from libjq
    changed original modules/utf8proc/Makefile to omit checks that require ruby to be installed

(I'm still figuring out how to clean up the multi-line comments...)

@nicowilliams
Copy link
Contributor

Tomorrow I'll try to rebase and clean up the branch's history and force-push to the PR, and I'll include a log of what I did.

@ryenus
Copy link

ryenus commented Aug 15, 2023

Possible to include this in the upcoming 1.7 release?
Hopefully not too late, as it's been a few years since 1.6 :-)

@liquidaty
Copy link
Author

@itchyny @nicowilliams @pkoppstein et al: great job re release 1.7.

Coming back to this issue: I had (apparently mistakenly) thought it would be in 1.7 given the prior discussion and the state of the pull which afaik is totally ready to go. Could we put this issue on a definitive roadmap to be merged?

@pkoppstein
Copy link
Contributor

@liquidaty - Rebasing might be necessary.

@itchyny - Once this PR has been prepared for merging, could you please help ensure it’s merged before it becomes outdated again? Many of us have been looking forward to the inclusion of this functionality in a master version of jq, and now that 1.7 has been released, this would be an excellent time, not least because it would allow for any imperfections to be resolved calmly. Thank you.

@itchyny itchyny added this to the 1.8 release milestone Sep 24, 2023
@itchyny itchyny linked an issue Sep 24, 2023 that may be closed by this pull request
@bitsondatadev
Copy link

ufff, this one is soooo close.

Great work @liquidaty! Hopefully a maintainer can get some time to bash out the last few changes needed.

@liquidaty
Copy link
Author

@bitsondatadev it was fully ready to go well before 1.7 and at this point I'm resigned to just using my own modified version of jq and not waiting for it to ever be officially merged. I'd be happy to get it in release-ready shape again if the maintainers are willing to provide some indication that the effort won't just be ignored, but otherwise I'm probably just never going to use anything past my own modified version 1.6

@wader
Copy link
Member

wader commented Mar 20, 2024

Hello i did a rebase, squash, cleanup and resolved a bunch of review comment and pushed a variant of this here https://github.com/wader/jq/commits/2547-resolve-jq.test-cleanup/ should i push it to the PR branch?

@liquidaty
Copy link
Author

FWIW I don't believe it's my decision to make but if I'm wrong about that then yes please

@wader
Copy link
Member

wader commented Mar 20, 2024

FWIW I don't believe it's my decision to make but if I'm wrong about that then yes please

👍 could you take a quick peek before i push to see if things looks sane? i'm mostly concern about configure.ac and Makefile.am which has quite a lot of messy conflicts.

@liquidaty
Copy link
Author

sure will do. imv the configure.ac changes generally should be fine if it still passes all tests; a separate related question would be, do we want to add more tests to verify the different options that configure.ac now offers related to this feature? I generally prefer a "don't write it if you don't use / test it" principle, but in this case I would override that and vote "no" for various reasons I won't expand on unless folks disagree...

@wader
Copy link
Member

wader commented Mar 23, 2024

TODO:

  • Update to latest utf8proc https://github.com/JuliaStrings/utf8proc and maybe make sure no changes are needed to build to make it easier to update in the future. I think current vendored does some changes to use ${builddir}?
  • Different link modes: builtin or external utf8proc, bundled or dynamic for libjq? more?
  • Make sure make dist works. Wonder if this should be some kind of CI test?
  • More tests?

@git-developer
Copy link

Is this PR alive?

@liquidaty
Copy link
Author

@wader apologies for the belated response. looks good to me. thank you!

@wader
Copy link
Member

wader commented Oct 14, 2024

@liquidaty no worries!

@git-developer I think the PR it stalled/blocked on the TODOs i mentioned above. Someone has to do some work/decide about those points. I'm leaning towards somehow embedding the utf8proc source for simplicity, less cmake, make dist and submodules mess. But in that case we should probably somehow prefix all utf8proc symbols so that someone using libjq and their own utf8proc code would not get a symbol conflict.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add toUpper and toLower methods
10 participants