Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault using Julia 1.11-alpha2 on AMD EPYC 9554 #68

Closed
jonas-schulze opened this issue Mar 28, 2024 · 7 comments
Closed

Segfault using Julia 1.11-alpha2 on AMD EPYC 9554 #68

jonas-schulze opened this issue Mar 28, 2024 · 7 comments

Comments

@jonas-schulze
Copy link

Running

using BFloat16s # v0.5

A = ones(BFloat16, 10)
A + A

sometimes leads to a segfault, sometimes a stack overflow, and sometimes one CPU sits at 100% until ^Ced.
Nothing breaks on my Intel Core i5-12600K that does not support avx512_bf16.

               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.11.0-alpha2 (2024-03-18)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> include("../mwe.jl")
ERROR: LoadError: StackOverflowError:
in expression starting at /home/jschulze/tmp/julia-bfloat16/mwe.jl:4

julia> include("../mwe.jl")
ERROR: LoadError: StackOverflowError:
in expression starting at /home/jschulze/tmp/julia-bfloat16/mwe.jl:4

julia> include("../mwe.jl")
ERROR: LoadError: StackOverflowError:
in expression starting at /home/jschulze/tmp/julia-bfloat16/mwe.jl:4

julia> 
jschulze@hostname:~/tmp/julia-bfloat16/v0.5.0$ julia +1.11 ../mwe.jl 
ERROR: LoadError: StackOverflowError:
in expression starting at /home/jschulze/tmp/julia-bfloat16/mwe.jl:4
jschulze@hostname:~/tmp/julia-bfloat16/v0.5.0$ julia +1.11 ../mwe.jl 
Segmentation fault (core dumped)
jschulze@hostname:~/tmp/julia-bfloat16/v0.5.0$ julia +1.11 ../mwe.jl 
^C
[207628] signal 2: Interrupt
in expression starting at none:0
_ZN4llvm8ExpectedINS_8ArrayRefINS_6object12Elf_Sym_ImplINS2_7ELFTypeILNS_7support10endiannessE1ELb1EEEEEEEED2Ev at /home/jschulze/.julia/juliaup/julia-1.11.0-alpha2+0.x64.linux.gnu/bin/../lib/julia/libLLVM-16jl.so (unknown line)
_ZNK4llvm6object13ELFObjectFileINS0_7ELFTypeILNS_7support10endiannessE1ELb1EEEE14getSymbolFlagsENS0_11DataRefImplE at /home/jschulze/.julia/juliaup/julia-1.11.0-alpha2+0.x64.linux.gnu/bin/../lib/julia/libLLVM-16jl.so (unknown line)
_ZNK4llvm6object10ObjectFile14getSymbolValueENS0_11DataRefImplE at /home/jschulze/.julia/juliaup/julia-1.11.0-alpha2+0.x64.linux.gnu/bin/../lib/julia/libLLVM-16jl.so (unknown line)
_ZNK4llvm6object13ELFObjectFileINS0_7ELFTypeILNS_7support10endiannessE1ELb1EEEE16getSymbolAddressENS0_11DataRefImplE at /home/jschulze/.julia/juliaup/julia-1.11.0-alpha2+0.x64.linux.gnu/bin/../lib/julia/libLLVM-16jl.so (unknown line)
getAddress at /cache/build/builder-amdci4-1/julialang/julia-release-1-dot-11/usr/include/llvm/Object/ObjectFile.h:408 [inlined]
get_function_name_and_base at /cache/build/builder-amdci4-1/julialang/julia-release-1-dot-11/src/debuginfo.cpp:746 [inlined]
jl_dylib_DI_for_fptr at /cache/build/builder-amdci4-1/julialang/julia-release-1-dot-11/src/debuginfo.cpp:1142
jl_getDylibFunctionInfo at /cache/build/builder-amdci4-1/julialang/julia-release-1-dot-11/src/debuginfo.cpp:1174 [inlined]
jl_getFunctionInfo_impl at /cache/build/builder-amdci4-1/julialang/julia-release-1-dot-11/src/debuginfo.cpp:1247
ijl_lookup_code_address at /cache/build/builder-amdci4-1/julialang/julia-release-1-dot-11/src/stackwalk.c:589
lookup at ./stacktraces.jl:108
stacktrace at ./stacktraces.jl:164
stacktrace at ./stacktraces.jl:162 [inlined]
scrub_repl_backtrace at ./client.jl:96
jfptr_scrub_repl_backtrace_70894.1 at /home/jschulze/.julia/juliaup/julia-1.11.0-alpha2+0.x64.linux.gnu/lib/julia/sys.so (unknown line)
scrub_repl_backtrace at ./client.jl:103
exec_options at ./client.jl:321
_start at ./client.jl:526
jfptr__start_71122.1 at /home/jschulze/.julia/juliaup/julia-1.11.0-alpha2+0.x64.linux.gnu/lib/julia/sys.so (unknown line)
jl_apply at /cache/build/builder-amdci4-1/julialang/julia-release-1-dot-11/src/julia.h:2154 [inlined]
true_main at /cache/build/builder-amdci4-1/julialang/julia-release-1-dot-11/src/jlapi.c:900
jl_repl_entrypoint at /cache/build/builder-amdci4-1/julialang/julia-release-1-dot-11/src/jlapi.c:1059
main at /cache/build/builder-amdci4-1/julialang/julia-release-1-dot-11/cli/loader_exe.c:58
unknown function (ip: 0x7f3a5a5dbd8f)
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x4010b8)
unknown function (ip: (nil))
Allocations: 1 (Pool: 1; Big: 0); GC: 0
Manifest-v1.11.toml
# This file is machine-generated - editing it directly is not advised

julia_version = "1.11.0-alpha2"
manifest_format = "2.0"
project_hash = "911edae1ed7fd2de4577c3badb415b11dc83b1e4"

[[deps.Artifacts]]
uuid = "56f22d72-fd6d-98f1-02f0-08ddc0907c33"
version = "1.11.0"

[[deps.BFloat16s]]
deps = ["LinearAlgebra", "Printf", "Random", "Test"]
git-tree-sha1 = "2c7cc21e8678eff479978a0a2ef5ce2f51b63dff"
uuid = "ab4f0b2a-ad5b-11e8-123f-65d77653426b"
version = "0.5.0"

[[deps.Base64]]
uuid = "2a0f44e3-6c83-55bd-87e4-b1978d98bd5f"
version = "1.11.0"

[[deps.CompilerSupportLibraries_jll]]
deps = ["Artifacts", "Libdl"]
uuid = "e66e0078-7015-5450-92f7-15fbd957f2ae"
version = "1.1.1+0"

[[deps.InteractiveUtils]]
deps = ["Markdown"]
uuid = "b77e0a4c-d291-57a0-90e8-8db25a27a240"
version = "1.11.0"

[[deps.Libdl]]
uuid = "8f399da3-3557-5675-b5ff-fb832c97cbdb"
version = "1.11.0"

[[deps.LinearAlgebra]]
deps = ["Libdl", "OpenBLAS_jll", "libblastrampoline_jll"]
uuid = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"
version = "1.11.0"

[[deps.Logging]]
deps = ["StyledStrings"]
uuid = "56ddb016-857b-54e1-b83d-db4d58db5568"
version = "1.11.0"

[[deps.Markdown]]
deps = ["Base64"]
uuid = "d6f4376e-aef5-505a-96c1-9c027394607a"
version = "1.11.0"

[[deps.OpenBLAS_jll]]
deps = ["Artifacts", "CompilerSupportLibraries_jll", "Libdl"]
uuid = "4536629a-c528-5b80-bd46-f80d51c5b363"
version = "0.3.26+2"

[[deps.Printf]]
deps = ["Unicode"]
uuid = "de0858da-6303-5e67-8744-51eddeeeb8d7"
version = "1.11.0"

[[deps.Random]]
deps = ["SHA"]
uuid = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"
version = "1.11.0"

[[deps.SHA]]
uuid = "ea8e919c-243c-51af-8825-aaa63cd721ce"
version = "0.7.0"

[[deps.Serialization]]
uuid = "9e88b42a-f829-5b0c-bbe9-9e923198166b"
version = "1.11.0"

[[deps.StyledStrings]]
uuid = "f489334b-da3d-4c2e-b8f0-e476e12c162b"
version = "1.11.0"

[[deps.Test]]
deps = ["InteractiveUtils", "Logging", "Random", "Serialization"]
uuid = "8dfed614-e22c-5e08-85e1-65c5234f0b40"
version = "1.11.0"

[[deps.Unicode]]
uuid = "4ec0a83e-493e-50e2-b9ac-8f72acf5a8f5"
version = "1.11.0"

[[deps.libblastrampoline_jll]]
deps = ["Artifacts", "Libdl"]
uuid = "8e850b90-86db-534c-a0d3-1478176c7d93"
version = "5.8.0+1"
@milankl
Copy link
Member

milankl commented Mar 28, 2024

Nothing breaks on my Intel Core i5-12600K that does not support avx512_bf16.

BFloat16 is for arithmetic operations converted to Float32 and then the result is truncated back down to BFloat16 (but I'm not sure when/how BFloat16 arithmetic is used if natively available). You can check this with

julia> a = one(BFloat16)
BFloat16(1.0)

julia> @code_lowered a+a
CodeInfo(
1%1 = BFloat16s.Float32(x)
│   %2 = BFloat16s.Float32(y)
│   %3 = %1 + %2%4 = BFloat16s.BFloat16(%3)
└──      return %4
)

I also have an Intel i5 on my macbook and with Julia 1.10.2 I cannot reproduce your error, even if I execute this a million times

julia> using BFloat16s
julia> A = ones(BFloat16,10)
julia> for _ in 1:1000000
           A + A
       end

julia>

Are you sure that .../mwe.jl really only contains these lines of code that you copied in?

@jonas-schulze
Copy link
Author

jonas-schulze commented Mar 28, 2024

Are you sure that .../mwe.jl really only contains these lines of code that you copied in?

Yes, I am. I was also testing v0.4.2, hence the v0.5.0/ to separate the environments and the ../ to the common mwe.jl.

BFloat16 is for arithmetic operations converted to Float32 [...]

Starting with Julia 1.11 (JuliaLang/julia@5487046) and BFloat16s 0.5 (#51), native LLVM bfloat is used if available. On the AMD CPU, I see the following.

julia> BFloat16s.llvm_storage
true

julia> BFloat16s.llvm_arithmetic
true

julia> a = one(BFloat16)
BFloat16(1.0)

julia> @code_lowered a+a
CodeInfo(
1 ─ %1 = Base.add_float
│   %2 = (%1)(x, y)
└──      return %2
)

@milankl
Copy link
Member

milankl commented Mar 28, 2024

But what happens if you look at the LLVM code? Because for me the same conversion happens there (wtih 1.11) but you're hoping it would call fadd bfloat directly?

julia> @code_llvm a+a
; Function Signature: +(Core.BFloat16, Core.BFloat16)
;  @ /Users/milan/.julia/packages/BFloat16s/u3WQc/src/bfloat16.jl:225 within `+`
define bfloat @"julia_+_5925"(bfloat %"x::BFloat16", bfloat %"y::BFloat16") #0 {
top:
  %0 = fpext bfloat %"x::BFloat16" to float
  %1 = fpext bfloat %"y::BFloat16" to float
  %2 = fadd float %0, %1
  %3 = fptrunc float %2 to bfloat
  ret bfloat %3
}

@jonas-schulze
Copy link
Author

Yes, I was hoping for fadd bfloat, but I see the same IR you posted ... 🤔
Do I need to compile julia with a custom LLVM that has BF16 enabled ... somehow?

Interestingly, I can't even generate the LLVM IR for A + A from the original MWE:

julia> A = ones(Core.BFloat16, 32);

julia> @code_llvm 2A
ERROR: StackOverflowError:

julia> @code_llvm A + A
ERROR: StackOverflowError:

julia> @code_llvm BFloat16(1) * A
ERROR: StackOverflowError:

Sometimes I even get one core sitting at 100% load just generating the LLVM IR. I am a bit clueless here.

@jonas-schulze
Copy link
Author

The problem persists on the current nightly, Version 1.12.0-DEV.629 (2024-05-30).

@giordano
Copy link
Member

Works for me on AMD EPYC 9654: JuliaLang/julia#54025 (comment)

@ViralBShah
Copy link
Member

Please reopen if still broken.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants