Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Statically link the python executable to libpython and disable the shared library #540

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

zanieb
Copy link
Member

@zanieb zanieb commented Feb 26, 2025

As part of investigating #535, we posited that Conda's static linking of the python executable was part of the performance difference.

This change gives a 10% performance improvement (geometric mean on pyperformance).

@zanieb zanieb added platform:darwin Specific to the macOS platform platform:linux Specific to the Linux platform labels Feb 26, 2025
@indygreg
Copy link
Collaborator

Does this affect the actual run-time performance of the Python interpreter? Or just the time to start a new process and init the interpreter?

I.e. what is the benchmark actually measuring?

@zanieb
Copy link
Member Author

zanieb commented Feb 26, 2025

It seems to have a significant holistic effect — this matches some expectations set by the conda-forge folks and @carljm.

The referenced number is on the full pyperformance benchmark suite.

It also seemed to drastically improve performance on the benchmark in #535. I can tweak the number of calculations such that it's not dominated by interpreter startup, i.e., a runtime per Python process of >5s.

I definitely intend to do more benchmarking before marking this as ready for review, I'll post full results then.

@zanieb
Copy link
Member Author

zanieb commented Feb 26, 2025

Note to self, should consider updating the following

extension_module_loading.append("shared-library")

extension_module_loading.append("shared-library")

bi["core"]["shared_lib"] = "install/lib/libpython%s%s.so.1.0" % (

bi["core"]["shared_lib"] = "install/lib/libpython%s%s.dylib" % (

"libpython_link_mode": "static" if "musl" in target_triple else "shared",

Though I think we may want to retain the shared library even if python doesn't link to it?

@indygreg
Copy link
Collaborator

I think it is important to understand why static linking is faster. It could be many different things. Some of them might be fixable on shared libraries.

As a first step I would compare binaries without PGO, LTO, and BOLT.

Static linking unlocks all kinds of optimizations. I suspect what we're seeing is the result of aggressive inlining or something of that nature.

Also, I strongly prefer we still ship a libpython.so, even if Python doesn't link it. This gets you the performance without losing the shared library, which some customers will want.

@zanieb
Copy link
Member Author

zanieb commented Feb 26, 2025

Agree on all those points.

As a note, @geofft has been investigating some other problems that statically linking would solve. I expect he'll engage on exploring this further.

@indygreg
Copy link
Collaborator

Are you referring to symbol resolution issues with binary packages pulling in 3rd party libraries [that can overlap with the libraries we statically link]?

@zanieb
Copy link
Member Author

zanieb commented Feb 26, 2025

@geofft
Copy link
Collaborator

geofft commented Feb 26, 2025

There is a very old (2002) Debian bug reporting that statically linking libpython is good for performance: https://bugs.debian.org/131813

There too it's about steady-state runtime performance, not startup cost. I think the idea is that there is less back-and-forth between the executable and the library, but it's a good question why this is actually true, given that most of the hot code paths should be fully within the library.

Debian does something unusual in that they ship a libpython.so too, and the way they do it is that they build twice, once with --enable-shared=no and once with --enable-shared=yes, and the resulting package has the python3.x binary and libpython.a from the former and libpython.so from the latter. The linker seems to find libpython.so first if you specify -lpython3. I am not yet confident if this is guaranteed.

If we were to ship a libpython.a then, yes, downstream consumers would have an easier time of things because no rpath is required. (Notably, cargo test in a pyo3 project would work out of the box; right now you need to write a build.rs file to set the rpath for the test runner binary.) We would have to ensure the linker finds libpython.a first, or stop shipping a libpython.so.

Note that, as implied by what Debian does, whether we ship a libpython.a and/or a libpython.so is not necessarily correlated with which one our bin/python3 uses. So, we could (at the cost of a longer build time) ship a bin/python3 that statically links libpython but also continue to ship a shared library for people who want it.

(I suppose it's also possible that this doesn't actually require two builds, and with sufficient changes to the CPython build system, you can get it to produce both a libpython.a and a libpython.so in the same build.)

Fun fact, for the third-party libraries, a handful of downstream consumers would have an easier time if we moved from a static e.g. Tcl/Tk to a shared one. (Notably, PyInstaller outputs a C binary whose splash screen uses Tcl/Tk, so they need the ability to get to those libraries from C, before they've unpacked the Python distribution.)

@KRRT7
Copy link

KRRT7 commented Feb 26, 2025

There is a very old (2002) Debian bug reporting that statically linking libpython is good for performance: https://bugs.debian.org/131813

I wanted to note that statically linking lib python has yielded proven performance gains for Nuitka as well.

@geofft
Copy link
Collaborator

geofft commented Feb 26, 2025

That makes more intuitive sense to me in that Nuitka compiles what it can, so you're going back and forth between the main program and libpython for the stuff that didn't get compiled. But bin/python is literally just int main(int argc, char **argv) {return Py_BytesMain(argc, argv);} so there should be no back and forth. So the fact that there's a difference there too is a little weird, at least to my intuition!

(One mildly weird idea, btw, is that it's possible for a shared library to have an entry point—try running /lib/x86_64-linux-gnu/libc.so.6 directly, for instance. So you could imagine a distribution where bin/python is a symlink to ../lib/libpython.so, as opposed to an actual executable that depends on it, which would act sort of like a bin/python built against static libpython but be usable as a libpython.so too... but that might not help things if the actual performance problem is behavior differences from Py_ENABLE_SHARED being defined as opposed to merely being a loaded library.)

@indygreg
Copy link
Collaborator

Fun fact, for the third-party libraries, a handful of downstream consumers would have an easier time if we moved from a static e.g. Tcl/Tk to a shared one. (Notably, PyInstaller outputs a C binary whose splash screen uses Tcl/Tk, so they need the ability to get to those libraries from C, before they've unpacked the Python distribution.)

Yeah, this is what I was getting at. Having them as separate libraries helps with symbol resolution issues. It was always on my undocumented backlog to split out at least tcl/tk and the x11 libraries into standalone shared libraries to mitigate this issue.

On the static vs dynamic bit, python is literally just a function call into a function in libpython. So execution shouldn't be bouncing around between those 2 ELF binaries. So the speedup can't be explained by that.

I think the speedup is coming from the compiler/linker no longer having to provide strong ABI guarantees around functions. I think statically linking libpython is enabling it to more aggressively optimize functions without regards to function boundaries.

It might be doing some funky copying of functions because I thought that you still needed to export the libpython symbols so loaded extension modules could continue using them. You'd really need to do some low-level debugging - maybe disassembling - to get to the bottom of things. I'd feed the statically linked binary into ghidra and look at the core interpreter loop to see if any funky inlining of libpython symbols is going on.

@traversaro
Copy link

Just as an additional fyi, it seems that some corner downstream use of Python do not work as expected when using a statically linked Python, see for example (just a few I encountered in the past):

I recall also a lot of macos segfaults in CMake projects creating extensions as SHARED instead of MODULE libraries, but I can't find an issue at the moment. Probably it is nothing blocking, but something that it could make sense to consider.

@traversaro
Copy link

I recall also a lot of macos segfaults in CMake projects creating extensions as SHARED instead of MODULE libraries, but I can't find an issue at the moment.

Found: pybind/pybind11#3907 .

@indygreg
Copy link
Collaborator

Yeah, these linked issues seemingly confirm what I thought: extension module builds really want to run against the Python they were built against. If there is a mismatch between the build and runtime Python, things can blow up.

In Conda's world, they have their own universe of binary dependencies. But in PBS / uv world, there isn't as much a buffer here. So my fear is that if PBS ships a static libpython, we're signing ourselves up for all kinds of random extension module breakage.

We could assess risk by downloading popular PyPI packages and verifying extension modules load and run. But the "run" part is difficult since there's no guaranteed way to run tests from a wheel. And even if PyPI is fine, you are going to be finding people building extensions behind corporate walls encountering issues.

I want to support this work. But I'm worried about side-effects.

@zanieb
Copy link
Member Author

zanieb commented Feb 27, 2025

Thanks for sharing those @traversaro! That's helpful context.

Just for some context on how I'm thinking about this pull request: I posted this for discussion and testing — I'm not in any rush to land this.

@traversaro
Copy link

traversaro commented Feb 27, 2025

Yeah, these linked issues seemingly confirm what I thought: extension module builds really want to run against the Python they were built against. If there is a mismatch between the build and runtime Python, things can blow up.

Just to clarify, all those issues were related to conda-installations, so that was not the problem, as compatible versions of python and libpython was used. I am not saying that mismatching build and runtime Python may not be a problem, just that the issue I linked are related to other problems. I guess what connects all linked issue is how macOS linking model deals with having duplicate symbols (even if the symbols are identical) in the Python executable and in a libpython linked in a Python extension that is being opened via dlopen.

@jjhelmus
Copy link

jjhelmus commented Mar 1, 2025

Statically linking libpython can give a significant speed up but this configuration is not universal. I think Fedora/RHEL dynamically link where as Debian/Ubuntu statically link.

When libpython is dynamically linked there used to be significant performance gain when semantic interposition is disabled (via -fno-semantic-interposition) . Fedora found performance improvements of up to ~27% when this flag was set. Note that Fedora found similar gains from statically linking.

This flag is included by default since Python 3.10 if --enable-optimizations is specified and gcc is used. I think disabling semantic interposition for symbols within the same library is the default for clang so this flag is not needed.

c.f. conda-forge/python-feedstock#287

@zanieb
Copy link
Member Author

zanieb commented Mar 1, 2025

We do use --enable-optimizations and Clang (for most builds) so the -fno-semantic-interposition case should be accounted for.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
platform:darwin Specific to the macOS platform platform:linux Specific to the Linux platform
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants