-
Notifications
You must be signed in to change notification settings - Fork 745
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate PeerDAS KZG library stack overflow during block production #6105
Comments
Looks like a panic inside |
Yeah, I think Pawan figured it's most likely something to do with the
|
I think the stackoverflow in this case was a red-herring -- running the LH node in debug mode on the devnet, shows the issue was that
The fix, replacing rust-embed with |
Hi @kevaundray
I can take another look tomorrow too. |
Oh that's surprising -- if you run it in debug mode, then that should give a stacktrace No rush on this, I can try running it on a different machine x86_64 too and in the meantime, I will update the image to use debug mode |
To help debugging further, I have created three docker images in debug mode based on the last three commits on the Parts of the code I modifiedFull backtrace
if std::env::var("RUST_BACKTRACE").is_err() {
std::env::set_var("RUST_BACKTRACE", "1");
} to if std::env::var("RUST_BACKTRACE").is_err() {
std::env::set_var("RUST_BACKTRACE", "full");
} Enabling debug modeSo I could have just modified the Dockerfile and used the How the images were createdCreated a bash script that checked out the following commits from the
These are in order of newest to oldest. Then just ran, built and published each of these commits with the commit hash in the tag name. Docker Images
Kurtosis config# filename: peerdas.yaml
participants:
# Supernodes
- cl_type: prysm
cl_image: ethpandaops/prysm-beacon-chain:peerdas-initial-sync
cl_extra_params: [--subscribe-all-subnets]
- cl_type: lighthouse
cl_image: kevaundray/lighthouse-eth:das-dbg-{COMMIT_HASH}
cl_extra_params: [--subscribe-all-data-column-subnets]
# Non supernodes
- cl_type: prysm
cl_image: ethpandaops/prysm-beacon-chain:peerdas-initial-sync
- cl_type: lighthouse
cl_image: kevaundray/lighthouse-eth:das-dbg-{COMMIT_HASH}
network_params:
eip7594_fork_epoch: 0
eip7594_fork_version: "0x50000038"
snooper_enabled: true
global_log_level: debug
additional_services:
- dora
- goomy_blob Running the docker imagesUsing Kurtosis, I run the following command:
The main difference here between the ResultsOn the On the On the Observations
|
This is kurtosis waiting for Lighthouse to complete the startup process and wait until it's healthy before proceeding. |
Yeah I'm not exactly sure what's taking a long time, but I don't think it's to do with KZG as it loads in about 500ms. I'll look into the slow startup issue. |
Seems to load pretty quickly on my amd machine. Would be useful to check the Lighthouse container logs of the while "Execution in progress" is happening. |
For the latest commit (04d9eef) the logs for vc-geth-lighthouse was : docker logs -f vc-2-geth-lighthouse--66190c11946b4c7980461b123f308375
Jul 18 08:24:39.178 INFO Logging to file path: "/validator-keys/keys/logs/validator.log"
Jul 18 08:24:39.274 INFO Lighthouse started version: Lighthouse/v5.2.1-04d9eef+
Jul 18 08:24:39.274 INFO Configured for network name: custom (/network-configs)
Unable to initialize validator config: Unable to parse proposer_nodes: Unknown argument or group id. Make sure you are using the argument id and not the short or long flags
My current hypothesis is that the lengthy periods of time are due to retries. This is because in the error messages by Kurtosis where the service did not start up, it notes the retry attempts: Caused by: Unsuccessful ports check for IP '172.16.0.17' and port spec '{privatePortSpec:0x400088c570}', even after '240' retries with '500' milliseconds in between retries. Timeout '2m0s' has been reached If this is correct, then the long startup times on |
For completeness, I'll add what the Dockerfile and Makefile looks like, that I used to build the docker images: FROM rust:1.78.0-bullseye AS builder
RUN apt-get update && apt-get -y upgrade && apt-get install -y cmake libclang-dev
COPY . lighthouse
ARG FEATURES
# ARG PROFILE=release
ARG CARGO_USE_GIT_CLI=true
ENV FEATURES $FEATURES
# ENV PROFILE $PROFILE
ENV CARGO_NET_GIT_FETCH_WITH_CLI=$CARGO_USE_GIT_CLI
RUN cd lighthouse && make
FROM ubuntu:22.04
RUN apt-get update && apt-get -y upgrade && apt-get install -y --no-install-recommends \
libssl-dev \
ca-certificates \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*
COPY --from=builder /usr/local/cargo/bin/lighthouse /usr/local/bin/lighthouse The main change is that I remove the PROFILE env variable. The only parts I changed in the Makefile were the install:
cargo install --debug --path lighthouse --force --locked \
--features "$(FEATURES)" \
$(CARGO_INSTALL_EXTRA_FLAGS)
# Builds the lcli binary in release (optimized).
install-lcli:
cargo install --debug --path lcli --force --locked \
--features "$(FEATURES)" \
$(CARGO_INSTALL_EXTRA_FLAGS) The main changes are that I removed the profile flag and added --debug. |
I've pushed a new image with the fix in PR 6125 -- the tag is For startup times, I get ~10 seconds to startup a |
I am still getting a stackoverflow though its a bit more detailed:
Full logs: https://gist.github.com/kevaundray/04504a2bbb175a9485f120251b9a24a3 |
yep your image actually revealed a bug we've had for a while! We rarely run validator client in debug mode so this bug has been around for a while without us noticing! Fixed in #6125 |
Oh great, can confirm that that does fix it in debug mode! |
@jimmygchen was away for a bit, I added debug lines into the code and seems to imply that the issue was never peerdas kzg: Jul 20 06:44:46.174 DEBG let kzg_proofs = Vec::from(proofs) service: beacon
Jul 20 06:44:46.174 DEBG let kzg = self.kzg service: beacon
Jul 20 06:44:46.174 DEBG let blobs_ref: Vec<_> service: beacon
[beacon_node/beacon_chain/src/kzg_utils.rs:85:5] blobs.len() = 4
thread 'tokio-runtime-worker' has overflowed its stack
fatal runtime error: stack overflow complete_partial_beacon_blockIn let kzg_proofs = Vec::from(proofs);
debug!(
self.log,
"let kzg_proofs = Vec::from(proofs)";
);
let kzg = self
.kzg
.as_ref()
.ok_or(BlockProductionError::TrustedSetupNotInitialized)?;
debug!(
self.log,
"let kzg = self.kzg";
);
let blobs_ref: Vec<_> = blobs.iter().collect();
debug!(
self.log,
"let blobs_ref: Vec<_>";
);
kzg_utils::validate_blobs::<T::EthSpec>(
kzg,
expected_kzg_commitments,
blobs_ref,
&kzg_proofs,
)
.map_err(BlockProductionError::KzgError)?;
debug!(
self.log,
"kzg_utils::validate_blobs";
); kzg_utils::validate_blobsIn validate_blobs, I added the following debug lines. Note, I use pub fn validate_blobs<E: EthSpec>(
kzg: &Kzg,
expected_kzg_commitments: &[KzgCommitment],
blobs: Vec<&Blob<E>>,
kzg_proofs: &[KzgProof],
) -> Result<(), KzgError> {
let _timer = crate::metrics::start_timer(&crate::metrics::KZG_VERIFICATION_BATCH_TIMES);
dbg!("blobs size: ", blobs.len());
let blobs = blobs
.into_iter()
.map(|blob| ssz_blob_to_crypto_blob::<E>(blob))
.collect::<Result<Vec<_>, KzgError>>()?;
dbg!("blobs collected");
let ok = kzg.verify_blob_kzg_proof_batch(&blobs, expected_kzg_commitments, kzg_proofs);
dbg!("verified verify_blob_kzg_proof_batch");
ok
} StackoverflowThis implies that the stackoverflow is happening with four blobs on this line: let blobs = blobs
.into_iter()
.map(|blob| ssz_blob_to_crypto_blob::<E>(blob))
.collect::<Result<Vec<_>, KzgError>>()?; Looking further,
The from_bytes method looks like this: pub fn from_bytes(bytes: &[u8]) -> Result<Self, Error> {
if bytes.len() != BYTES_PER_BLOB {
return Err(Error::InvalidBytesLength(format!(
"Invalid byte length. Expected {} got {}",
BYTES_PER_BLOB,
bytes.len(),
)));
}
let mut new_bytes = [0; BYTES_PER_BLOB];
new_bytes.copy_from_slice(bytes);
Ok(Self::new(new_bytes))
} with four blobs this would allocate 4 * 4096 * 32 = 524KB onto the stack -- this is likely not the only contributor but a big part My mistakeI had assumed that the stack overflow was coming from rust-embed and was being converted into a panic in debug mode. The issue with rust-embed only happens in debug mode and it only happens because the Dockerfile copies the binary from one image to another. I have created a proof of concept of this here: https://github.com/kevaundray/poc-rust-embed This would imply that the original issue was always the stackoverflow due to |
I'll note that a possible quick fix for this is to decrease the batch size to 1 and verify each blob individually for now using a for loop. This should allow the compiler to free each blob after every for loop/verification because it can see the blob variable going out of scope. |
Is the crash only happening in One of the main sources of stack bloat for us is futures, and previous investigations using the compiler's RUSTFLAGS='-Zprint-type-sizes' cargo +nightly build --release > type_sizes.txt I suspect running that command on the relevant branch without We made some improvements when we were getting stack overflows on the HTTP API in this PR, although the root of the issue still wasn't fully established: It seems that If it is of interest to you, I would appreciate your insights and research here @kevaundray, but I also think you'd be justified in tapping out and focussing on PeerDAS. As a workaround if you'd like to run Lighthouse with debug assertions, you could turn them on in the # In top-level Cargo.toml
[profile.release]
debug-assertions = true Or add a new debug-release profile: [profile.debug-release]
inherits = "release"
debug-assertions = true This would also have the advantage of optimising the BLS crypto code, which is usually very slow in debug builds too. |
Oh yeah this has been a known issue for a while, and I raised an issue for this last week #6120 Although i think people were getting stack overflow with the image built by |
I've managed to get a core dump of the crash with Lighthouse running with release profile. Here's the backtrace - no debug symbols so it's not nice to read, but we know it's failing at
|
Ooh you're using |
Ah didn't realize that debug was not expected to work, was mainly using it to get better stacktraces, however they were roughly the same in both profiles. I'm happy to continue looking into this until a solution is found! |
Yep, so it seems that there is a stackoverflow in debug mode due to validate_blobs which is separate to the stackoverflow in release mode for the |
Yeah the stack overflow in debug mode is probably the same issue reported in #6120 (the conversion (array copy) from Blob to kzg blob) The issue in release mode seems to be rayon related, I just had a chat with Michael about this. Going to try some workarounds (increase num of threads, or increase the thread pool stack size) |
Based on my rudimentary understanding of I think the relevant part of code in The Increasing the thread pool size may alleviate the issue somewhat because jobs will get stolen off the stack before too many can build up. But I think it's still not foolproof. We may want to optimise If multiple of these jobs were spawned from different threads outside Rayon's pool, I think they could end up on the same Rayon thread and therefore the same stack. I guess that method doesn't often get called multiple times in a short period though? |
I believe this is correct; there was an issue for it here: rayon-rs/rayon#1105 With the docs updated here: https://docs.rs/rayon-core/1.12.1/src/rayon_core/thread_pool/mod.rs.html#85 |
Yeah, I thought that comment looked suss. But I don't think it's relevant to us because we are not usually calling that function (with I was more referring to the case where we call |
I added a regression test here that seems to reliably give me a stackoverflow : #6151 I think one possible way which should also validate the hypothesis above is to not call the threadpool inside of a parallel iterator, so instead of: let blob_cells_and_proofs_vec = blobs
.into_par_iter()
.map(|blob| {
let blob = KzgBlob::from_bytes(blob).map_err(KzgError::from)?;
kzg.compute_cells_and_proofs(&blob)
})
.collect::<Result<Vec<_>, KzgError>>()?; We would do: let blob_cells_and_proofs_vec = blobs
.into_iter()
.map(|blob| {
let blob = KzgBlob::from_bytes(blob).map_err(KzgError::from)?;
kzg.compute_cells_and_proofs(&blob)
})
.collect::<Result<Vec<_>, KzgError>>()?; Where I changed I can also Box the return type for |
My understanding (could be wrong) was that in the following code: let blob_cells_and_proofs_vec = blobs
.into_par_iter()
.map(|blob| {
let blob = KzgBlob::from_bytes(blob).map_err(KzgError::from)?;
kzg.compute_cells_and_proofs(&blob)
})
.collect::<Result<Vec<_>, KzgError>>()?;
|
Yes, agree. Are we using this pattern anywhere outside of your unit test? |
Yep, I believe we are still doing this in the
|
Added a test here showing the CI failing even if we use vectors as return types, and then removing the stack allocation being done inside of map and showing it passing. The stack allocation is akin to the |
This has been fixed after upgrading to |
Description
We've recently switched from the
c-kzg
library to the Rustpeerdas-kzg
library on thedas
branch (#5941). The purpose for the switch was to:das
branch tounstable
without having to updatec-kzg
to a development versionHowever we've recently ran into some issues during interop testing after the change, and we're getting stack overflow panics, likely during block production:
Kurtosis config to reproduce:
The text was updated successfully, but these errors were encountered: