-
Notifications
You must be signed in to change notification settings - Fork 911
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hanging connections due to IO wait #584
Comments
Could you give some details about your use case? In particular, what is the usage pattern, and where is this app deployed? Does this connection transit any cloud providers' firewalls? We've seen issues with firewalls closing idle connections without sending RST packets, leading to symptoms similar to yours (in fact, the stack trace looks the same). Take a look at cockroachdb/cockroach#13823 - does any of this sound familiar? |
@tamird We use an AWS NAT instance (not to be confused with NAT "gateway"). From docs,
The program itself is for ETL so it makes pretty low rate of queries, but each query takes a significant amount of time (e.g. schema ALTERs, COPYs, INSERT FROM INTO, etc.). This problem seems to occur more frequently as the number of connections we make to the 3rd-party databases (not managed by us) grows-- whether in this program or in general from our NAT. For context, we sometimes have multiple instances of this program running against a single 3rd-party DB, and when this happens, they all start to hang around the same time. What would you expect to happen if lib/pq is executing a query and the connection is closed by an external source (e.g. NAT instance)? |
Well, this is where it can get messy. Normally, I'd expect both sides of the connection to receive a TCP RST from the intermediate device, but as I mentioned earlier, this is not how Azure's firewall behaves, for instance. My advice would be to read through the CockroachDB issue and try some of the same debugging steps (e.g. tcpdump) and see where that leads you. |
@tejasmanohar ... FWIW we see the exact same problem. Sigh. Were you able to glean any insights? |
@bilalaslamseattle We realized lib/pq didn't have TCP KeepAlives by default while pgx did and made a wrapper https://gist.github.com/tejasmanohar/fdaafe17d7ac1c083147055f2c03959b |
Can we re-open this. The gist solution above does not work if PostgreSQL has really high IO the keep alive does nothing. In my case the network is OK but postgresql is sometimes too overloaded to respond to the cancel request. |
@chrispassas Not an expert here, but that seems like a different problem. I thought TCP Keepalives are specifically designed to protect against the case when the peer cannot respond with a FIN packet. |
@chrispassas I've seen Postgres go into a state where it is too overloaded to cancel requests or even discard connections that have been closed. This may be what's happening. To be fair, this usually happens when the system is in an unhealthy state (e.g. around an OOM). Depending on what you're doing, you may benefit from setting server side limits / timeouts like |
@tejasmanohar In most cases the database server is in distress. I have tried setting statement_timeout. The problem is when statement_timeout is reached or a context is canceled it seems like the code hangs on db.QueryContext() because its waiting for some acknowledgment from the database server. The whole point of QueryContext is to ensure we can timeout a query but if it requires the remote server to agree or respond then the QueryContext timeout is broken. |
Any update on this does or does not the library currently support keepalive? |
I'm gonna leave a link to the original issue regarding TCP keepalive settings here - #360. |
We manage a connection pool against customer databases through lib/pq + database/sql. From time to time, we see all connections across the pool to start to hang around the same time, and upon killing the Go process with signal
ABRT
, we get the following stack trace for each open connection. Restarting the program and reestablishing the connection pool fixes it everytime, but obviously, this is not ideal in production.It seems like there is some ephemeral networking issue that causes this. Do you have any ideas on what may be at fault here? The stack looks very similar to that of jackc/pgx#225 so I must ask-- is it possible this is an issue in some underlying Go networking libraries?
Or, is it possible that this is due to a code error on our end (e.g. not closing something properly)? I don't wanna put it past me, but I'm slightly skeptical here since the program works fine for most databases but comes up for a few databases we work against every so often.
Also, do you have any recommendations on handling this case? I'm considering adding timeouts using the Go 1.8 context + db feature (#535) to stop the fire for now. Many thanks in advance!
The text was updated successfully, but these errors were encountered: