You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, @dsroberts, @rbeucher :
Today there was an error occured while I running running ilamb with mpiexec on access-med-0.6, this is the detail.(this run have 24 processes so there are some redundant information here)
Loading conda/access-med-0.6
Loading requirement: singularity
Currently Loaded Modulefiles:
1) singularity 2) conda/access-med-0.6
[LOG_CAT_MCAST] No MCAST components selected
[LOG_CAT_ML] Failure in hcoll_mcast_base_select
[LOG_CAT_MCAST] No MCAST components selected
[LOG_CAT_ML] Failure in hcoll_mcast_base_select
[LOG_CAT_MCAST] No MCAST components selected
[LOG_CAT_ML] Failure in hcoll_mcast_base_select
[LOG_CAT_MCAST] No MCAST components selected
[LOG_CAT_ML] Failure in hcoll_mcast_base_select
[LOG_CAT_MCAST] No MCAST components selected
[LOG_CAT_ML] Failure in hcoll_mcast_base_select
[LOG_CAT_MCAST] No MCAST components selected
[LOG_CAT_ML] Failure in hcoll_mcast_base_select
[LOG_CAT_MCAST] No MCAST components selected
[LOG_CAT_ML] Failure in hcoll_mcast_base_select
[LOG_CAT_MCAST] No MCAST components selected
[LOG_CAT_ML] Failure in hcoll_mcast_base_select
[LOG_CAT_MCAST] No MCAST components selected
[LOG_CAT_ML] Failure in hcoll_mcast_base_select
[LOG_CAT_MCAST] No MCAST components selected
[LOG_CAT_ML] Failure in hcoll_mcast_base_select
[LOG_CAT_MCAST] No MCAST components selected
[LOG_CAT_ML] Failure in hcoll_mcast_base_select
[LOG_CAT_MCAST] No MCAST components selected
[LOG_CAT_ML] Failure in hcoll_mcast_base_select
[LOG_CAT_MCAST] No MCAST components selected
[LOG_CAT_ML] Failure in hcoll_mcast_base_select
[LOG_CAT_MCAST] No MCAST components selected
[LOG_CAT_ML] Failure in hcoll_mcast_base_select
[LOG_CAT_MCAST] No MCAST components selected
[LOG_CAT_ML] Failure in hcoll_mcast_base_select
[LOG_CAT_MCAST] No MCAST components selected
[LOG_CAT_ML] Failure in hcoll_mcast_base_select
[LOG_CAT_MCAST] No MCAST components selected
[LOG_CAT_ML] Failure in hcoll_mcast_base_select
[LOG_CAT_MCAST] No MCAST components selected
[LOG_CAT_ML] Failure in hcoll_mcast_base_select
[LOG_CAT_MCAST] No MCAST components selected
[LOG_CAT_ML] Failure in hcoll_mcast_base_select
[LOG_CAT_MCAST] No MCAST components selected
[LOG_CAT_ML] Failure in hcoll_mcast_base_select
[LOG_CAT_MCAST] No MCAST components selected
[LOG_CAT_ML] Failure in hcoll_mcast_base_select
[LOG_CAT_MCAST] No MCAST components selected
[LOG_CAT_ML] Failure in hcoll_mcast_base_select
[LOG_CAT_MCAST] No MCAST components selected
[LOG_CAT_MCAST] No MCAST components selected
[LOG_CAT_ML] Failure in hcoll_mcast_base_select
[LOG_CAT_ML] Failure in hcoll_mcast_base_select
[LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p
[LOG_CAT_ML] ml_discover_hierarchy exited with error
[LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p
[LOG_CAT_ML] ml_discover_hierarchy exited with error
[LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p
[LOG_CAT_ML] ml_discover_hierarchy exited with error
[LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p
[LOG_CAT_ML] ml_discover_hierarchy exited with error
[LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p
[LOG_CAT_ML] ml_discover_hierarchy exited with error
[LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p
[LOG_CAT_ML] ml_discover_hierarchy exited with error
[LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p
[LOG_CAT_ML] ml_discover_hierarchy exited with error
[LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p
[LOG_CAT_ML] ml_discover_hierarchy exited with error
[LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p
[LOG_CAT_ML] ml_discover_hierarchy exited with error
[LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p
[LOG_CAT_ML] ml_discover_hierarchy exited with error
[LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p
[LOG_CAT_ML] ml_discover_hierarchy exited with error
[LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p
[LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p
[LOG_CAT_ML] ml_discover_hierarchy exited with error
[LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p
[LOG_CAT_ML] ml_discover_hierarchy exited with error
[LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p
[LOG_CAT_ML] ml_discover_hierarchy exited with error
[LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p
[LOG_CAT_ML] ml_discover_hierarchy exited with error
[LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p
[LOG_CAT_ML] ml_discover_hierarchy exited with error
[LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p
[LOG_CAT_ML] ml_discover_hierarchy exited with error
[LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p
[LOG_CAT_ML] ml_discover_hierarchy exited with error
[LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p
[LOG_CAT_ML] ml_discover_hierarchy exited with error
[LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p
[LOG_CAT_ML] ml_discover_hierarchy exited with error
[LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p
[LOG_CAT_ML] ml_discover_hierarchy exited with error
[LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p
[LOG_CAT_ML] ml_discover_hierarchy exited with error
[LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p
[LOG_CAT_ML] ml_discover_hierarchy exited with error
[LOG_CAT_ML] ml_discover_hierarchy exited with error
[gadi-cpu-clx-2405:1119691:0:1119691] Caught signal 7 (Bus error: nonexistent physical address)
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
==== backtrace (tid:1119691) ====
0 0x0000000000012d20 __funlockfile() :0
1 0x00000000000a8186 NC4_def_var() ???:0
2 0x000000000003f2ad nc_def_var() ???:0
3 0x00000000000b75b3 __pyx_pw_7netCDF4_8_netCDF4_8Variable_1__init__() _netCDF4.c:0
4 0x000000000013ddbb type_call() :0
5 0x000000000002288f __Pyx_PyObject_Call() _netCDF4.c:0
6 0x0000000000040928 __pyx_pw_7netCDF4_8_netCDF4_7Dataset_47createVariable() _netCDF4.c:0
7 0x00000000001445a6 cfunction_call() :0
8 0x000000000013da6b _PyObject_MakeTpCall.localalias() :0
9 0x0000000000139c53 _PyEval_EvalFrameDefault() ???:0
10 0x0000000000150582 method_vectorcall() :0
11 0x00000000001358fa _PyEval_EvalFrameDefault() ???:0
12 0x0000000000144a2c _PyFunction_Vectorcall() ???:0
13 0x0000000000134c5c _PyEval_EvalFrameDefault() ???:0
14 0x0000000000144a2c _PyFunction_Vectorcall() ???:0
15 0x0000000000134850 _PyEval_EvalFrameDefault() ???:0
16 0x00000000001d7c60 _PyEval_Vector() :0
17 0x00000000001d7ba7 PyEval_EvalCode() ???:0
18 0x000000000020812a run_eval_code_obj() :0
19 0x0000000000203523 run_mod() :0
20 0x000000000009a6f5 pyrun_file.cold() :0
21 0x00000000001fd9fe _PyRun_SimpleFileObject.localalias() :0
22 0x00000000001fd594 _PyRun_AnyFileObject.localalias() :0
23 0x00000000001fa78b Py_RunMain.localalias() :0
24 0x00000000001cb1f7 Py_BytesMain() ???:0
25 0x000000000003a7e5 __libc_start_main() ???:0
26 0x00000000001cb0f1 _start() ???:0
=================================
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
[gadi-cpu-clx-2405:1119698:0:1119698] Caught signal 7 (Bus error: nonexistent physical address)
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
--------------------------------------------------------------------------
mpiexec noticed that process rank 9 with PID 0 on node gadi-cpu-clx-2405 exited on signal 7 (Bus error).
--------------------------------------------------------------------------
The same script works fine on /g/data/hh5/public/modules/conda_concept/analysis3, so I think this might be an issue about the module.
The text was updated successfully, but these errors were encountered:
Hi @rhaegar325 apologies for the delayed response, since I've left CLEX, the containerised conda environments that this is based on are no longer supported. That being said, I'm applying this approach to another application and have encountered this error (not the bus error, but the ml_discover_hierarchy exited with error. The error can be made to go away by disabling HCOLL (export OMPI_MCA_coll=^hcoll), but the circumstances under which this appears seems to be quite specific. As far as I can tell, this error is specific to using mpi4pyin a container. I'm yet to come up with a small reproducer, but I'll keep you up to date on progress.
Edit: turns out this happens outside of containerised environments too, import mpi4py is enough to reproduce when run in parallel.
Hi, @dsroberts, @rbeucher :
Today there was an error occured while I running running
ilamb
with mpiexec onaccess-med-0.6
, this is the detail.(this run have 24 processes so there are some redundant information here)The same script works fine on
/g/data/hh5/public/modules/conda_concept/analysis3
, so I think this might be an issue about the module.The text was updated successfully, but these errors were encountered: