Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use -flto=thin for clang-cl on Windows #131005

Draft
wants to merge 22 commits into
base: main
Choose a base branch
from

Conversation

chris-eibl
Copy link
Contributor

This started off as a build time analysis, since @zanieb asked for numbers in #129907 (comment) and this is too painfull to do manually.
So I hacked the two new targets BeginTimeStamp and EndTimeStamp for detailed analysis into pyproject.props. This is only needed for the detailed timings per project. The overall wall times come for free, since msbuild and -m test --pgo both print timings. analyse_build_times.py is still of use here to create the tables.

Numbers represent seconds. Please note that the sum of the detailed project times does not match the total time, because most of the projects are built in parallel, except _freeze_module and python314. Therefore, I have listed them first in the details table and then sorted the others by build time of the first column.

In the pgupdate details we still see _freez_module, because #130420 is not on that branch. Hence, we see what that PR saves us.

I intentionally branched off commit 9db1a29 to have the same environment.

  • MSVC is much faster for both debug and release builds
  • older clangs are faster than newer in debug builds

Debug build times:

debug_clang_18.1.8 debug_clang_19.1.1 debug_clang_20.1.0-rc2 debug_msvc
total time 128.3 154.8 163.5 84.9

Release build times:

release_clang_18.1.8 release_clang_19.1.1 release_clang_20.1.0-rc2 release_msvc
total time 278.3 277.2 274.2 172.3

PGO build times:

  • MSVC is still faster in the pginstr phase
  • but the instrumented binaires take much longer to execute (pgo as short for pgo task in the table)
  • kill is due to call :Kill in case of build.bat --pgo - can be ignored, takes almost no time
  • pgupd phase takes longer for MSVC
  • so the overall build.bat --pgo times are longer for MSVC
pgo_clang_18.1.8 pgo_clang_19.1.1 pgo_clang_20.1.0-rc2 pgo_clang_thin_20.1.0-rc2 pgo_msvc
pginstr 288.7 279.5 297.2 219.3 155.9
pgo 77.0 70.0 70.0 69.0 559.0
kill 1.1 1.2 1.2 0.5 1.1
pgupd 284.8 271.5 282.8 231.7 359.0
total time 651.7 622.1 651.2 520.6 1075.0

Very interesting: pyexpat _elementtree take much longer for 20.1.0-rc2 in the pginstr phase (see details), but come back to "normal" with --flto=thin. Because these are so outliers, I retested several times with the same result :-O

-flto=thin:

Since I now have the timing infrastructure, I wanted to try -flto=thin, too. Unsurprisingly, build times are faster.
Performance seems neutral:

Benchmark clang.pgo.20.1.0-rc2 clang.pgo.thin.20.1.0-rc2
Geometric mean (ref) 1.00x faster
Detailed pybenchmark results

Benchmark clang.pgo.20.1.0-rc2 clang.pgo.thin.20.1.0-rc2
float 95.0 ms 89.7 ms: 1.06x faster
json_loads 29.8 us 28.6 us: 1.04x faster
mdp 2.86 sec 2.77 sec: 1.03x faster
html5lib 68.3 ms 66.2 ms: 1.03x faster
async_tree_none_tg 330 ms 320 ms: 1.03x faster
pyflate 518 ms 505 ms: 1.03x faster
sqlite_synth 3.21 us 3.13 us: 1.03x faster
pidigits 228 ms 223 ms: 1.02x faster
bench_mp_pool 168 ms 165 ms: 1.02x faster
async_tree_eager_io 742 ms 727 ms: 1.02x faster
generators 34.5 ms 33.8 ms: 1.02x faster
comprehensions 18.3 us 17.9 us: 1.02x faster
async_tree_cpu_io_mixed 641 ms 629 ms: 1.02x faster
scimark_sparse_mat_mult 4.51 ms 4.43 ms: 1.02x faster
async_tree_memoization 425 ms 417 ms: 1.02x faster
sympy_expand 538 ms 529 ms: 1.02x faster
unpack_sequence 57.0 ns 56.0 ns: 1.02x faster
regex_dna 209 ms 205 ms: 1.02x faster
async_generators 465 ms 458 ms: 1.02x faster
scimark_sor 140 ms 137 ms: 1.02x faster
sympy_str 319 ms 314 ms: 1.02x faster
async_tree_io_tg 751 ms 740 ms: 1.01x faster
regex_effbot 3.14 ms 3.10 ms: 1.01x faster
async_tree_eager_tg 272 ms 268 ms: 1.01x faster
pickle_dict 27.3 us 27.0 us: 1.01x faster
async_tree_eager_memoization_tg 363 ms 359 ms: 1.01x faster
sympy_integrate 22.5 ms 22.2 ms: 1.01x faster
sympy_sum 181 ms 179 ms: 1.01x faster
2to3 390 ms 386 ms: 1.01x faster
hexiom 6.68 ms 6.61 ms: 1.01x faster
docutils 3.03 sec 3.00 sec: 1.01x faster
sqlglot_normalize 121 ms 120 ms: 1.01x faster
async_tree_memoization_tg 392 ms 389 ms: 1.01x faster
async_tree_cpu_io_mixed_tg 614 ms 609 ms: 1.01x faster
tomli_loads 2.20 sec 2.18 sec: 1.01x faster
spectral_norm 102 ms 101 ms: 1.01x faster
python_startup_no_site 34.4 ms 34.2 ms: 1.01x faster
genshi_text 24.6 ms 24.5 ms: 1.01x faster
dulwich_log 119 ms 118 ms: 1.00x faster
go 128 ms 128 ms: 1.00x faster
deltablue 3.62 ms 3.63 ms: 1.00x slower
unpickle_pure_python 247 us 248 us: 1.00x slower
xml_etree_generate 107 ms 107 ms: 1.01x slower
django_template 39.2 ms 39.4 ms: 1.01x slower
coroutines 24.8 ms 25.0 ms: 1.01x slower
mako 13.3 ms 13.5 ms: 1.01x slower
unpickle 15.9 us 16.1 us: 1.01x slower
nbody 119 ms 121 ms: 1.01x slower
fannkuch 465 ms 472 ms: 1.01x slower
crypto_pyaes 81.3 ms 82.6 ms: 1.02x slower
json_dumps 11.5 ms 11.7 ms: 1.02x slower
deepcopy 285 us 291 us: 1.02x slower
pprint_safe_repr 858 ms 876 ms: 1.02x slower
xml_etree_iterparse 136 ms 139 ms: 1.02x slower
gc_traversal 5.03 ms 5.14 ms: 1.02x slower
meteor_contest 115 ms 117 ms: 1.02x slower
deepcopy_memo 33.8 us 34.7 us: 1.03x slower
richards_super 51.1 ms 52.6 ms: 1.03x slower
scimark_fft 327 ms 337 ms: 1.03x slower
richards 44.9 ms 46.3 ms: 1.03x slower
pickle_list 4.83 us 4.99 us: 1.03x slower
deepcopy_reduce 2.93 us 3.03 us: 1.03x slower
pprint_pformat 1.74 sec 1.80 sec: 1.03x slower
logging_simple 10.9 us 11.4 us: 1.05x slower
logging_format 12.1 us 12.6 us: 1.05x slower
xml_etree_parse 197 ms 208 ms: 1.05x slower
Geometric mean (ref) 1.00x faster

Maybe this is the reason why it is used in makefile based clang builds, too? CONFIGURE_CFLAGS_NODIST and CONFIGURE_LDFLAGS_NOLTO both use -flto=thin when I configure for clang in WSL Ubuntu-24.04.

What to do with it?

Most probably a skip news. Don't even know, whether we want anything of that branch. Maybe just -flto=thin?
I'd like others to verify / experiment.

Detailed build times:

Detailed debug build times

debug_clang_18.1.8 debug_clang_19.1.1 debug_clang_20.1.0-rc2 debug_msvc
_freeze_module 31.0 36.5 38.5 16.8
python314 44.9 56.5 62.7 31.4
liblzma 14.8 15.7 14.3 7.5
sqlite3 8.7 8.0 7.8 6.9
_bz2 5.2 6.4 5.1 7.7
_wmi 5.0 5.3 4.5 4.7
_ctypes 5.0 5.9 5.7 7.9
_decimal 4.2 5.5 5.2 3.4
_testcapi 4.0 6.4 7.0 2.2
_ssl 3.4 5.0 3.8 2.3
_overlapped 3.1 3.9 3.6 2.5
_uuid 3.1 1.9 4.2 4.0
_socket 3.0 3.3 3.0 4.7
_tkinter 3.0 3.6 3.5 2.2
_sqlite3 2.9 4.0 4.2 1.4
_hashlib 2.4 2.9 2.8 4.2
venvwlauncher 2.4 2.8 2.8 4.5
_elementtree 2.4 2.8 2.5 2.3
_testlimitedcapi 2.3 3.6 3.9 1.4
_multiprocessing 2.3 3.0 2.6 1.8
_asyncio 2.3 2.8 2.7 3.5
pyshellext 2.2 2.3 2.4 3.3
_zoneinfo 2.1 2.7 2.5 3.1
unicodedata 2.0 2.2 2.0 2.9
py 1.9 2.1 2.2 3.7
pyw 1.9 2.1 2.2 4.0
_queue 1.9 2.0 2.1 3.5
venvlauncher 1.9 1.9 1.5 3.8
pyexpat 1.8 1.7 1.6 1.8
_ctypes_test 1.6 1.6 1.7 1.1
select 1.6 1.7 2.3 2.9
_testinternalcapi 1.5 2.0 2.0 1.1
winsound 1.4 1.8 1.7 7.5
_testclinic 1.1 1.3 1.4 0.8
_testembed 1.0 1.2 1.3 0.8
pythonw 0.9 1.1 1.1 0.7
_testconsole 0.8 1.0 1.1 0.7
_testbuffer 0.8 0.9 1.0 0.6
_lzma 0.8 1.0 1.1 1.1
_testimportmultiple 0.7 0.8 0.9 0.5
python 0.7 1.4 1.0 0.6
_testmultiphase 0.7 0.9 1.0 0.6
_testclinic_limited 0.7 0.8 0.9 0.5
_testsinglephase 0.7 0.9 1.0 0.6
python3 0.5 0.5 0.5 0.5
total 186.8 221.8 227.1 169.8

Detailed release build times

release_clang_18.1.8 release_clang_19.1.1 release_clang_20.1.0-rc2 release_msvc
_freeze_module 26.4 35.5 37.6 13.8
python314 147.0 135.6 131.0 98.2
sqlite3 50.5 45.8 44.2 18.8
liblzma 14.6 15.3 15.3 11.7
_decimal 10.9 11.0 10.8 7.0
_bz2 7.4 7.8 7.2 11.5
_ctypes 7.3 7.3 7.2 9.7
_testcapi 6.0 7.9 8.3 2.7
pyexpat 5.6 5.0 4.3 5.9
_ssl 5.0 5.3 5.5 5.2
_wmi 4.8 4.4 5.0 6.2
_tkinter 4.4 4.4 4.3 6.0
_ctypes_test 3.9 3.7 3.6 6.3
_socket 3.8 4.0 4.3 5.1
_elementtree 3.5 3.5 3.6 5.0
_uuid 3.4 5.0 4.0 5.2
_testlimitedcapi 3.4 4.5 4.8 2.7
_lzma 3.3 3.0 3.1 6.0
_asyncio 3.3 3.5 3.5 1.9
_hashlib 3.0 3.3 3.7 4.9
_overlapped 3.0 3.2 3.4 3.4
venvwlauncher 2.8 3.0 2.8 6.0
_zoneinfo 2.7 4.0 3.1 4.8
pyw 2.6 2.8 2.8 6.4
unicodedata 2.5 2.5 2.4 2.6
_sqlite3 2.4 3.1 3.1 1.4
py 2.4 2.5 2.6 3.4
pyshellext 2.4 2.7 2.7 6.0
_multiprocessing 2.0 2.7 1.9 3.3
_testclinic 2.0 2.0 2.0 1.0
_testinternalcapi 1.9 2.3 2.4 1.2
venvlauncher 1.9 1.8 1.9 3.5
_queue 1.6 2.3 2.4 2.3
select 1.5 1.6 1.6 2.6
_testembed 1.3 1.5 1.5 0.8
winsound 1.2 1.7 1.9 7.6
_testbuffer 1.2 1.4 1.4 0.7
_testconsole 0.8 1.1 1.2 0.7
pythonw 0.8 1.1 1.1 0.7
_testmultiphase 0.8 1.0 1.0 0.6
_testsinglephase 0.7 0.9 1.0 0.6
_testclinic_limited 0.7 0.9 0.9 0.5
python 0.7 0.9 1.0 0.6
_testimportmultiple 0.6 0.9 0.9 0.5
xxlimited 0.6 0.9 0.8 0.5
xxlimited_35 0.6 0.8 0.8 0.5
python3 0.5 0.5 0.5 0.4
total 359.9 365.9 360.3 296.1

Details pginstrument build times

pgo_clang_18.1.8 pgo_clang_19.1.1 pgo_clang_20.1.0-rc2 pgo_clang_thin_20.1.0-rc2 pgo_msvc
_freeze_module 26.6 35.1 38.5 40.0 14.0
python314 159.6 139.7 141.5 81.3 86.6
sqlite3 50.0 45.1 46.0 42.4 18.2
_ctypes 14.3 8.6 6.9 7.5 8.1
_bz2 12.7 8.4 7.0 4.9 6.1
liblzma 12.7 18.3 18.2 16.5 11.3
_decimal 11.4 10.9 12.4 7.7 3.6
pyexpat 10.4 6.1 52.7 3.9 6.2
_testcapi 5.8 7.6 8.3 7.1 2.7
_asyncio 5.5 4.4 4.0 5.2 3.4
_elementtree 5.0 5.4 51.8 5.3 3.2
_wmi 4.9 6.1 4.5 3.0 5.7
_lzma 4.5 3.5 3.8 1.8 6.2
_ssl 3.9 5.6 3.7 5.5 5.6
_ctypes_test 3.9 3.6 3.7 3.4 6.3
venvwlauncher 3.6 2.8 3.3 2.7 4.1
_testlimitedcapi 3.4 4.4 4.9 4.3 2.7
_sqlite3 3.0 3.4 3.4 2.8 1.4
_overlapped 2.9 3.2 4.5 3.2 2.5
_zoneinfo 2.9 3.6 3.1 3.4 3.3
_socket 2.8 4.2 2.4 3.7 4.6
unicodedata 2.7 2.6 2.7 3.0 2.4
_tkinter 2.6 4.6 2.2 4.1 3.4
_multiprocessing 2.5 1.7 3.5 2.7 9.8
pyw 2.4 2.8 2.7 2.7 3.2
py 2.4 2.5 2.6 2.5 3.2
pyshellext 2.3 2.8 2.7 2.6 2.9
_testclinic 2.0 2.0 2.0 1.9 1.0
_hashlib 1.9 3.3 1.8 3.1 5.0
_testinternalcapi 1.8 2.2 2.4 2.2 1.2
venvlauncher 1.7 1.8 1.8 1.7 2.7
select 1.4 1.6 1.8 2.2 2.3
_uuid 1.4 1.6 1.6 3.2 2.4
_queue 1.4 1.7 1.6 2.3 2.8
winsound 1.4 1.6 1.7 3.3 2.4
_testembed 1.3 1.4 1.5 1.5 0.8
_testbuffer 1.3 1.3 1.4 1.3 0.7
_testconsole 0.8 1.0 1.1 1.1 0.7
pythonw 0.8 1.1 1.1 1.1 0.7
_testmultiphase 0.7 0.9 1.0 1.0 0.6
_testsinglephase 0.7 0.9 1.0 1.0 0.6
_testclinic_limited 0.7 0.9 0.9 0.9 0.6
python 0.7 0.9 1.0 0.9 0.6
_testimportmultiple 0.6 0.8 0.9 0.9 0.6
python3 0.5 0.5 0.5 0.5 0.5
total 385.7 372.4 465.8 303.3 257.1

Details pgupdate build times

pgo_clang_18.1.8 pgo_clang_19.1.1 pgo_clang_20.1.0-rc2 pgo_clang_thin_20.1.0-rc2 pgo_msvc
_freeze_module 26.7 34.7 38.0 39.5 13.8
python314 154.3 137.6 141.9 95.4 287.1
sqlite3 47.4 46.1 44.4 42.9 16.1
_ctypes 12.1 6.9 8.0 7.2 6.3
liblzma 12.0 16.6 17.3 16.5 7.9
_bz2 11.0 7.1 7.8 5.5 9.1
_decimal 10.0 10.8 11.2 8.7 6.1
_testcapi 6.0 7.6 8.6 7.3 2.7
pyexpat 5.3 4.4 4.6 3.6 4.7
_elementtree 4.1 3.4 3.5 4.5 12.0
_lzma 4.1 3.3 3.2 1.9 4.8
_ctypes_test 3.8 3.6 3.7 3.4 6.4
unicodedata 3.6 4.1 3.2 3.0 3.0
venvwlauncher 3.5 2.8 3.1 3.0 5.2
_testlimitedcapi 3.4 4.4 5.0 4.2 2.7
_ssl 3.3 5.1 5.2 5.6 2.6
_asyncio 3.0 4.7 4.5 4.6 3.4
_overlapped 2.9 2.9 3.5 3.7 7.2
_uuid 2.9 3.4 2.6 2.8 7.2
_zoneinfo 2.8 3.1 3.2 3.2 2.5
_wmi 2.6 3.3 3.5 3.1 5.3
_sqlite3 2.6 2.9 3.1 2.7 1.7
pyw 2.4 2.5 2.6 2.6 2.7
_socket 2.4 3.8 4.3 3.5 14.6
py 2.3 2.5 2.6 2.7 2.5
pyshellext 2.3 2.5 2.7 2.6 2.4
_tkinter 2.2 3.6 4.0 4.2 4.4
_testclinic 2.0 2.0 2.0 1.9 1.0
_testinternalcapi 1.9 2.2 2.4 2.2 1.2
venvlauncher 1.7 1.6 1.7 1.5 2.4
_multiprocessing 1.5 1.8 2.8 2.6 2.7
select 1.5 1.6 1.6 2.0 3.5
_queue 1.5 1.7 1.9 2.2 2.5
_hashlib 1.5 3.5 3.1 3.3 3.2
winsound 1.3 1.6 1.8 3.0 4.5
_testembed 1.3 1.5 1.5 1.4 0.8
_testbuffer 1.2 1.3 1.4 1.3 0.8
_testconsole 0.8 1.1 1.1 1.0 0.7
pythonw 0.8 1.0 1.1 1.1 0.7
_testmultiphase 0.7 1.0 1.0 1.1 0.6
_testsinglephase 0.7 0.9 1.0 1.0 0.6
python 0.7 0.9 1.0 0.9 0.6
_testclinic_limited 0.6 0.9 0.9 0.9 0.6
_testimportmultiple 0.6 0.8 0.9 0.9 0.6
python3 0.5 0.5 0.5 0.5 0.4
total 360.0 359.7 372.9 316.8 472.1

chris-eibl and others added 21 commits February 9, 2025 16:30
for _freeze_module in case of clang-cl to speed up the build
Speeds up both MSVC and clang-cl builds.

Should most probably done in a separate PR and issue, though.
I've previously gotten compile errors from clang, because the needed
intrinsics were not available without that option.

Cannot reproduce anymore. Most probably, because I've upgraded to
Visual Studio 17.13.0 Preview 5.0, which now ships with clang 19.1.1
instead of 18.1.8 and they've done that for compatibility with MSVC?

Anyway, let's keep the PR small :)
This reverts commit 26fb51f.

Shall be done in a separate PR.
This better matches the behaviour of build.bat in case of MSVC PGO builds.
and make it a target with inputs and outputs
because the name is too MSVC specific
@@ -23,7 +23,7 @@ extern "C" {
declaration \
_GENERATE_DEBUG_SECTION_LINUX(name)

#if defined(MS_WINDOWS)
#if defined(MS_WINDOWS)&& !defined(__clang__)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly, for debug builds I now need this, too. Never needed before, see #130040 (comment). Will anyway come with that PR...

@@ -420,6 +420,7 @@
<ClCompile Include="..\Modules\blake2module.c">
<PreprocessorDefinitions Condition="'$(Platform)' == 'x64'">HACL_CAN_COMPILE_SIMD128;%(PreprocessorDefinitions)</PreprocessorDefinitions>
<PreprocessorDefinitions Condition="'$(Platform)' == 'x64'">HACL_CAN_COMPILE_SIMD256;%(PreprocessorDefinitions)</PreprocessorDefinitions>
<AdditionalOptions Condition="'$(Platform)' == 'x64' and '$(LLVMToolsVersion)' &lt; '19'">/arch:AVX</AdditionalOptions>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only as an interim taken from #130447 until we vendor the latest hacl-star that abstracts the AVX intrinsics (#130960)

@@ -0,0 +1,264 @@
import argparse
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a beauty but it creates the tables for me :)

@@ -167,6 +167,30 @@ public override bool Execute() {
</Task>
</UsingTask>

<Target Name="BeginTimeStamp" BeforeTargets="PrepareForBuild">
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only needed for detailed timings per project. Could be guarded behind a Condition="'$(PrintBuildTimeStamps)' == 'true'", in case this shall be merged.

from datetime import datetime, date, time

# Verstrichene Zeit 00:00:00.74
msbuild_time_str = "Verstrichene Zeit"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the thing I dislike most: msbuild does this localized :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant