Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add analysis and runtime hooks for ibm_db #765

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

imavo
Copy link

@imavo imavo commented Jul 17, 2024

Added analysis hook and runtime hook for ibm_db module (they work together).
Supports pyinstaller 6.8.0 and higher, python 3.8 and higher, ibm_db versions 3.2.3 (wheel) and higher, and supports linux x64 and microsoft-windows x64. No support yet for AIX/bsd/cygwin/zLinux/macOs, and therefore not tested on AIX or bsd or cygwin or zLinux or macOs. Tested on Win10, Win11 and ubuntu 20.04, 22.04, 24.04 with python 3.8, 3.10, 3.11, 3.12. Not tested with conda.

@bwoodsend
Copy link
Member

Supports pyinstaller 6.8.0 and higher, python 3.8 and higher, ibm_db versions 3.2.3 (wheel) and higher, and supports linux x64 and microsoft-windows x64. Not tested on AIX or bsd or cygwin or zLinux. Tested on Win10, Win11 and ubuntu 20.04, 22.04, 24.04 with python 3.8, 3.10, 3.11, 3.12. Not tested with conda.

We have CI for this. No need to overdo it... 🙂

@rokm
Copy link
Member

rokm commented Jul 29, 2024

Hmmm, I guess there are two main questions here:


First, is it possible to test ibm_db in any (from pyinstaller's perspective) meaningful way without having actual database instance available?

For example, would a basic import ibm_db test show that we are not collecting everything and that the hook is required?

Would basic connection attempt show that the libs and/or clidriver are missing (e.g., would it throw a library-missing-error before a host-not-found error)?


Second, how much of this is actually required for ibm_db and how much of it is specific to your application? I recall you mentioned in https://github.com/orgs/pyinstaller/discussions/8632 something to the effect of "at least for my use cases", but did not elaborate on what exactly was that.

That approach of collecting environment variables from build environment, storing them in a file, and then restoring them at run-time via run-time hook is, in my opinion, a no-go.
How do you handle that in original unfrozen script? Are you relying on the environment variables to be set externally?
If the library requires parametrization via environment variables, then it is application's code job to set that up, before importing ibm_db (based on hard-coded values, or reading a config file, or whatever is applicable here).
Then both the unfrozen and frozen codepaths are handled in the same way, instead of trying to off-load this into PyInstaller's run-time hook (which are not the place to implement extra functionality that does not exist in the base package).

I've asked before in https://github.com/orgs/pyinstaller/discussions/8632, but why do you need to set LD_LIBRARY_PATH or use os.add_dll_directory()?
How is this handled in original unfrozen script? If it is expected that this is set up externally (before running python code), then again, it is not up to PyInstaller's hook to handle this sort of thing, but up to your application code.
But even so, why is it necessary at all? Is it required because executables in clidriver/bin are ran as sub-processes? Is there a plug-in system that explicitly scans LD_LIBRARY_PATH?

@imavo
Copy link
Author

imavo commented Jul 31, 2024

Thank you for the response.

To answer your questions, if I understand them correctly, I'm happy to provide clarifications if I have misunderstood or not explained properly.

"...to test ibm_db in any (from pyinstaller's perspective) meaningful way without having actual database instance available?"

No, the sole purpose of module ibm_db is to interface with a Db2-database or Db2-instance, so if you do not have an available instance of a Db2-database on either Db2-LUW , or i-series (license required) , or z/os (license required) then the module cannot be properly tested either unfrozen or frozen, as far as I understand that term "testing" at least.

You can run a docker-container on premises delivering a Db2-LUW instance+database (Db2 community edition) if that is what you seek. You can also download and run an on-premises version of Db2-community-edition on a suitably equipped server provided the relevant skills are available. I do not know the CI environment has this capability.

"...how much of it is specific to your application?" Almost nothing, I use ibm_db for many purposes, many simple apps, as python lets me run the resulting scripts on microsoft-windows, or linux x64, or aix , or zLinux without hassles provided I that coded properly, of course. Bundling is sometimes convenient. So there is not a single app. Previously, shell scripting on AIX/linux/zLinux etc would suffice, but the need to have a Microsoft-Windows capability never goes away.

The import ibm_db will only test the load-time dependencies, but not the run time dependencies, and that import ibm_db statement has some meaning on Microsoft-Windows x64 environments, it shows the DLL load-time-dependencies. It is a useful sanity check.

"Would basic connection attempt show that the libs and/or clidriver are missing (e.g., would it throw a library-missing-error before a host-not-found error)?" YES: As long as a connectable database exists, then a basic connection (translation: an unencrypted, userid/password authentication) to a pre-existing Db2 database , will suffice for many situations (but not all) , but it WILL show that the load-time dependencies are met. However, corporates often use encrypted connections, use tokens for authentication (not userid/password), or using certificates-for-authentication (z/os) , use gateways and various other complexitites that are properties determined by the site or by (remote) Db2-SERVER and not application-specific and not workstation-specific.

My phrase "at least for my use cases", means for me "raw usage of ibm_db API", as distinct from the higher level usages of the same module by other python modules. Higher level usages include the DBI (a la Perl) via module ibm_db_dbi (shipped with ibm_db) , or SQLAlchemy, or Alembic, or Djanjo framework, all of which have the ibm_db module at the bottom just encapsulating clidriver and gsk (encryption toolkit).

Some customers of Db2 never use the raw ibm_db interface, but instead use SQLAlchemy or Django framework, but they also sometimes want to bundle. The idea of my hooks is that the python scripts can run unfrozen or frozen from the same source. I have also hooks for SQLAlchemy and Djanjo (not yet submitted) but they require the ibm_db hooks for analysis and runtime.

The python ibm_db module is only a thin wrapper around an IBM supplied call-level-interface binary driver (called clidriver) for Db2-databases on various Db2-server platforms (cloud, i-series(as400) , Z/OS, zLinux, AIX etc) , and it also works with suitably-configured Informix databases. The wheel that is provided by current ibm_db versions delivers clidriver, and this is what the anaysis hook and runtime hook exploits. In other words, customers that do NOT use the default clidriver (but use instead an alternative call-level-interface driver) are NOT the target here, and that is enforced by the hooks. ( Call-level-interface drivers for Db2 are available from IBM , and from Microsoft, and from other companies). However, most enterprise customers will chose an IBM supplied call-level-interface driver because that is what they pay for (the support etc).

"That approach of collecting environment variables from build environment, storing them in a file, and then restoring them at run-time via run-time hook is, in my opinion, a no-go."

I do not understand your rationale, so perhaps you can explain more after you see my explanation. Only very specific and optional environment variables that are meaninfgul only to clidriver are preserved, if and only if they exist at source AND if and only if they do not already exist at the target. Those environment variables are NOT app-specific per se, they are Db2-server or site specific. Those environment variables are optional and influence the runtime behaviour of the clidriver in such a way that they need to preserved if and only if they exist at source and do not exist at target, and no other configuration-method was provided.

The clidriver library does not require parameterisation via environment variables, they are just ONE way it can be configured, but if such variables are necessary then they are not application specific, but instead are either site specific or determined by the Db2-server. If the cldriver is NOT configured with environment variables then it might be configured by an XML file that will be captured by the analysis hook via the datas object.

"why do you need to set LD_LIBRARY_PATH or use os.add_dll_directory()"
This is not a requirement of the hooks per se, it is instead a requirement of ibm_db module v3.2.3 and higher. It appears they (the developers of ibm_db module) want to avoid DLL/lib problems caused by fixed PATH values, because it is common to have multiple different versions of the zero-install clidriver to co-exist on the same hostname, running independently. So they advise removing from the PATH any clidriver references and instead placing it into the python scripts by searching the relevant site-packages tree.

"How is this handled in original unfrozen script?" The ibm_db module (from release 3.2.3 and higher) requires that the unfrozen script runs the add_dll_directory() for the Microsoft-Windows environment. So the ddl directory (lib directorY) is NOT meant to be externally set. That is not my decision, it's what the ibm_db implementers chose for some reason (allegedly to avoid DLL problems with incorrect PATH settings on Microsoft-Windows when multiple concurrent different versions of clidriver co-exist).

"why is it necessary at all?" The answer (from the ibm_db devs) is that multiple concurrent (different) versions of the zero-install clidriver can co-exist on the same hostname, and so instead of having a fixed PATH value, the decision was to remove such things from the PATH and configure them dynamically inside the python script according to exactly which versions(i.e locations) of the clidriver are in use.

"Is there a plug-in system that explicitly scans LD_LIBRARY_PATH?" if I understand you correctly, the answer is No, not by ibm_db itself (but of course the choice of shared-library search order is dependent on the operating system loader and its config, for shared objects/dlls). In other words, the python script just tells the operating system (either linux, windows, aix, Z/Linux which clidriver it wants to use by using its own add_dll_directory() etc.

I hope this answers your questions , and that I have understood them. Open to clarifications.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants