Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hanging connections due to IO wait #584

Closed
tejasmanohar opened this issue Mar 19, 2017 · 11 comments
Closed

Hanging connections due to IO wait #584

tejasmanohar opened this issue Mar 19, 2017 · 11 comments

Comments

@tejasmanohar
Copy link

tejasmanohar commented Mar 19, 2017

We manage a connection pool against customer databases through lib/pq + database/sql. From time to time, we see all connections across the pool to start to hang around the same time, and upon killing the Go process with signal ABRT, we get the following stack trace for each open connection. Restarting the program and reestablishing the connection pool fixes it everytime, but obviously, this is not ideal in production.

It seems like there is some ephemeral networking issue that causes this. Do you have any ideas on what may be at fault here? The stack looks very similar to that of jackc/pgx#225 so I must ask-- is it possible this is an issue in some underlying Go networking libraries?

Or, is it possible that this is due to a code error on our end (e.g. not closing something properly)? I don't wanna put it past me, but I'm slightly skeptical here since the program works fine for most databases but comes up for a few databases we work against every so often.

Also, do you have any recommendations on handling this case? I'm considering adding timeouts using the Go 1.8 context + db feature (#535) to stop the fire for now. Many thanks in advance!

goroutine 1394 [IO wait, 1029 minutes]:
net.runtime_pollWait(0x7f9cc085f740, 0x72, 0x1a)
	/usr/local/go/src/runtime/netpoll.go:160 +0x59
net.(*pollDesc).wait(0xc42300c300, 0x72, 0xc4204568c0, 0xc42009e0b0)
	/usr/local/go/src/net/fd_poll_runtime.go:73 +0x38
net.(*pollDesc).waitRead(0xc42300c300, 0x10fc400, 0xc42009e0b0)
	/usr/local/go/src/net/fd_poll_runtime.go:78 +0x34
net.(*netFD).Read(0xc42300c2a0, 0xc4214eb000, 0x1000, 0x1000, 0x0, 0x10fc400, 0xc42009e0b0)
	/usr/local/go/src/net/fd_unix.go:243 +0x1a1
net.(*conn).Read(0xc420faa0c0, 0xc4214eb000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/net.go:173 +0x70
bufio.(*Reader).fill(0xc420f54180)
	/usr/local/go/src/bufio/bufio.go:97 +0x10c
bufio.(*Reader).Read(0xc420f54180, 0xc42040d8e0, 0x5, 0x200, 0x2, 0xc420456a68, 0x0)
	/usr/local/go/src/bufio/bufio.go:209 +0x1bc
io.ReadAtLeast(0x10f6d80, 0xc420f54180, 0xc42040d8e0, 0x5, 0x200, 0x5, 0xc43f253320, 0x20, 0x18)
	/usr/local/go/src/io/io.go:307 +0xa4
io.ReadFull(0x10f6d80, 0xc420f54180, 0xc42040d8e0, 0x5, 0x200, 0xc42040d8e0, 0xc420495800, 0x7f9cc08b34b0)
	/usr/local/go/src/io/io.go:325 +0x58
github.com/segmentio/connector-postgres/vendor/github.com/lib/pq.(*conn).recvMessage(0xc42040d8c0, 0xc43f253320, 0xb519e0, 0x53a701, 0x11540d0)
	/go/src/github.com/segmentio/connector-postgres/vendor/github.com/lib/pq/conn.go:954 +0x117
github.com/segmentio/connector-postgres/vendor/github.com/lib/pq.(*conn).recv1Buf(0xc42040d8c0, 0xc43f253320, 0x0)
	/go/src/github.com/segmentio/connector-postgres/vendor/github.com/lib/pq/conn.go:1004 +0x39
github.com/segmentio/connector-postgres/vendor/github.com/lib/pq.(*conn).recv1(0xc42040d8c0, 0xc420456c68, 0x123)
	/go/src/github.com/segmentio/connector-postgres/vendor/github.com/lib/pq/conn.go:1025 +0x87
github.com/segmentio/connector-postgres/vendor/github.com/lib/pq.(*conn).simpleExec(0xc42040d8c0, 0xc4364d0c80, 0x122, 0xc420456d38, 0xc420456ce8, 0x419679, 0xbb1180, 0xc83d20, 0x1)
	/go/src/github.com/segmentio/connector-postgres/vendor/github.com/lib/pq/conn.go:620 +0x1ef
github.com/segmentio/connector-postgres/vendor/github.com/lib/pq.(*conn).Exec(0xc42040d8c0, 0xc4364d0c80, 0x122, 0x11540d0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
	/go/src/github.com/segmentio/connector-postgres/vendor/github.com/lib/pq/conn.go:874 +0x2e2
database/sql.(*Tx).Exec(0xc42dfd02d0, 0xc4364d0c80, 0x122, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
	/usr/local/go/src/database/sql/sql.go:1382 +0x441
@tamird
Copy link
Collaborator

tamird commented Mar 19, 2017

Could you give some details about your use case? In particular, what is the usage pattern, and where is this app deployed? Does this connection transit any cloud providers' firewalls?

We've seen issues with firewalls closing idle connections without sending RST packets, leading to symptoms similar to yours (in fact, the stack trace looks the same). Take a look at cockroachdb/cockroach#13823 - does any of this sound familiar?

@tejasmanohar
Copy link
Author

tejasmanohar commented Mar 20, 2017

@tamird We use an AWS NAT instance (not to be confused with NAT "gateway"). From docs,

When a connection times out, a NAT instance sends a FIN packet to resources behind the NAT instance to close the connection.

The program itself is for ETL so it makes pretty low rate of queries, but each query takes a significant amount of time (e.g. schema ALTERs, COPYs, INSERT FROM INTO, etc.). This problem seems to occur more frequently as the number of connections we make to the 3rd-party databases (not managed by us) grows-- whether in this program or in general from our NAT. For context, we sometimes have multiple instances of this program running against a single 3rd-party DB, and when this happens, they all start to hang around the same time.

What would you expect to happen if lib/pq is executing a query and the connection is closed by an external source (e.g. NAT instance)?

@tamird
Copy link
Collaborator

tamird commented Mar 20, 2017

What would you expect to happen if lib/pq is executing a query and the connection is closed by an external source (e.g. NAT instance)?

Well, this is where it can get messy. Normally, I'd expect both sides of the connection to receive a TCP RST from the intermediate device, but as I mentioned earlier, this is not how Azure's firewall behaves, for instance.

My advice would be to read through the CockroachDB issue and try some of the same debugging steps (e.g. tcpdump) and see where that leads you.

@bilalaslamseattle
Copy link

@tejasmanohar ... FWIW we see the exact same problem. Sigh. Were you able to glean any insights?

@tejasmanohar
Copy link
Author

tejasmanohar commented Aug 9, 2017

@bilalaslamseattle We realized lib/pq didn't have TCP KeepAlives by default while pgx did and made a wrapper https://gist.github.com/tejasmanohar/fdaafe17d7ac1c083147055f2c03959b

@chrispassas
Copy link

Can we re-open this. The gist solution above does not work if PostgreSQL has really high IO the keep alive does nothing. In my case the network is OK but postgresql is sometimes too overloaded to respond to the cancel request.

@tejasmanohar
Copy link
Author

@chrispassas Not an expert here, but that seems like a different problem. I thought TCP Keepalives are specifically designed to protect against the case when the peer cannot respond with a FIN packet.

@tejasmanohar
Copy link
Author

tejasmanohar commented Nov 28, 2017

@chrispassas I've seen Postgres go into a state where it is too overloaded to cancel requests or even discard connections that have been closed. This may be what's happening. To be fair, this usually happens when the system is in an unhealthy state (e.g. around an OOM). Depending on what you're doing, you may benefit from setting server side limits / timeouts like statement_timeout.

@chrispassas
Copy link

@tejasmanohar In most cases the database server is in distress. I have tried setting statement_timeout. The problem is when statement_timeout is reached or a context is canceled it seems like the code hangs on db.QueryContext() because its waiting for some acknowledgment from the database server.

The whole point of QueryContext is to ensure we can timeout a query but if it requires the remote server to agree or respond then the QueryContext timeout is broken.

@meetme2meat
Copy link

Any update on this does or does not the library currently support keepalive?

@LasTshaMAN
Copy link

I'm gonna leave a link to the original issue regarding TCP keepalive settings here - #360.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants