Hanging connections due to IO wait #584

tejasmanohar · 2017-03-19T22:43:06Z

We manage a connection pool against customer databases through lib/pq + database/sql. From time to time, we see all connections across the pool to start to hang around the same time, and upon killing the Go process with signal ABRT, we get the following stack trace for each open connection. Restarting the program and reestablishing the connection pool fixes it everytime, but obviously, this is not ideal in production.

It seems like there is some ephemeral networking issue that causes this. Do you have any ideas on what may be at fault here? The stack looks very similar to that of jackc/pgx#225 so I must ask-- is it possible this is an issue in some underlying Go networking libraries?

Or, is it possible that this is due to a code error on our end (e.g. not closing something properly)? I don't wanna put it past me, but I'm slightly skeptical here since the program works fine for most databases but comes up for a few databases we work against every so often.

Also, do you have any recommendations on handling this case? I'm considering adding timeouts using the Go 1.8 context + db feature (#535) to stop the fire for now. Many thanks in advance!

goroutine 1394 [IO wait, 1029 minutes]:
net.runtime_pollWait(0x7f9cc085f740, 0x72, 0x1a)
	/usr/local/go/src/runtime/netpoll.go:160 +0x59
net.(*pollDesc).wait(0xc42300c300, 0x72, 0xc4204568c0, 0xc42009e0b0)
	/usr/local/go/src/net/fd_poll_runtime.go:73 +0x38
net.(*pollDesc).waitRead(0xc42300c300, 0x10fc400, 0xc42009e0b0)
	/usr/local/go/src/net/fd_poll_runtime.go:78 +0x34
net.(*netFD).Read(0xc42300c2a0, 0xc4214eb000, 0x1000, 0x1000, 0x0, 0x10fc400, 0xc42009e0b0)
	/usr/local/go/src/net/fd_unix.go:243 +0x1a1
net.(*conn).Read(0xc420faa0c0, 0xc4214eb000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/net.go:173 +0x70
bufio.(*Reader).fill(0xc420f54180)
	/usr/local/go/src/bufio/bufio.go:97 +0x10c
bufio.(*Reader).Read(0xc420f54180, 0xc42040d8e0, 0x5, 0x200, 0x2, 0xc420456a68, 0x0)
	/usr/local/go/src/bufio/bufio.go:209 +0x1bc
io.ReadAtLeast(0x10f6d80, 0xc420f54180, 0xc42040d8e0, 0x5, 0x200, 0x5, 0xc43f253320, 0x20, 0x18)
	/usr/local/go/src/io/io.go:307 +0xa4
io.ReadFull(0x10f6d80, 0xc420f54180, 0xc42040d8e0, 0x5, 0x200, 0xc42040d8e0, 0xc420495800, 0x7f9cc08b34b0)
	/usr/local/go/src/io/io.go:325 +0x58
github.com/segmentio/connector-postgres/vendor/github.com/lib/pq.(*conn).recvMessage(0xc42040d8c0, 0xc43f253320, 0xb519e0, 0x53a701, 0x11540d0)
	/go/src/github.com/segmentio/connector-postgres/vendor/github.com/lib/pq/conn.go:954 +0x117
github.com/segmentio/connector-postgres/vendor/github.com/lib/pq.(*conn).recv1Buf(0xc42040d8c0, 0xc43f253320, 0x0)
	/go/src/github.com/segmentio/connector-postgres/vendor/github.com/lib/pq/conn.go:1004 +0x39
github.com/segmentio/connector-postgres/vendor/github.com/lib/pq.(*conn).recv1(0xc42040d8c0, 0xc420456c68, 0x123)
	/go/src/github.com/segmentio/connector-postgres/vendor/github.com/lib/pq/conn.go:1025 +0x87
github.com/segmentio/connector-postgres/vendor/github.com/lib/pq.(*conn).simpleExec(0xc42040d8c0, 0xc4364d0c80, 0x122, 0xc420456d38, 0xc420456ce8, 0x419679, 0xbb1180, 0xc83d20, 0x1)
	/go/src/github.com/segmentio/connector-postgres/vendor/github.com/lib/pq/conn.go:620 +0x1ef
github.com/segmentio/connector-postgres/vendor/github.com/lib/pq.(*conn).Exec(0xc42040d8c0, 0xc4364d0c80, 0x122, 0x11540d0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
	/go/src/github.com/segmentio/connector-postgres/vendor/github.com/lib/pq/conn.go:874 +0x2e2
database/sql.(*Tx).Exec(0xc42dfd02d0, 0xc4364d0c80, 0x122, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
	/usr/local/go/src/database/sql/sql.go:1382 +0x441

The text was updated successfully, but these errors were encountered:

tamird · 2017-03-19T22:46:59Z

Could you give some details about your use case? In particular, what is the usage pattern, and where is this app deployed? Does this connection transit any cloud providers' firewalls?

We've seen issues with firewalls closing idle connections without sending RST packets, leading to symptoms similar to yours (in fact, the stack trace looks the same). Take a look at cockroachdb/cockroach#13823 - does any of this sound familiar?

tejasmanohar · 2017-03-20T20:22:54Z

@tamird We use an AWS NAT instance (not to be confused with NAT "gateway"). From docs,

When a connection times out, a NAT instance sends a FIN packet to resources behind the NAT instance to close the connection.

The program itself is for ETL so it makes pretty low rate of queries, but each query takes a significant amount of time (e.g. schema ALTERs, COPYs, INSERT FROM INTO, etc.). This problem seems to occur more frequently as the number of connections we make to the 3rd-party databases (not managed by us) grows-- whether in this program or in general from our NAT. For context, we sometimes have multiple instances of this program running against a single 3rd-party DB, and when this happens, they all start to hang around the same time.

What would you expect to happen if lib/pq is executing a query and the connection is closed by an external source (e.g. NAT instance)?

tamird · 2017-03-20T21:01:37Z

What would you expect to happen if lib/pq is executing a query and the connection is closed by an external source (e.g. NAT instance)?

Well, this is where it can get messy. Normally, I'd expect both sides of the connection to receive a TCP RST from the intermediate device, but as I mentioned earlier, this is not how Azure's firewall behaves, for instance.

My advice would be to read through the CockroachDB issue and try some of the same debugging steps (e.g. tcpdump) and see where that leads you.

bilalaslamseattle · 2017-06-06T05:55:01Z

@tejasmanohar ... FWIW we see the exact same problem. Sigh. Were you able to glean any insights?

tejasmanohar · 2017-08-09T03:07:24Z

@bilalaslamseattle We realized lib/pq didn't have TCP KeepAlives by default while pgx did and made a wrapper https://gist.github.com/tejasmanohar/fdaafe17d7ac1c083147055f2c03959b

chrispassas · 2017-11-28T16:08:55Z

Can we re-open this. The gist solution above does not work if PostgreSQL has really high IO the keep alive does nothing. In my case the network is OK but postgresql is sometimes too overloaded to respond to the cancel request.

tejasmanohar · 2017-11-28T19:16:06Z

@chrispassas Not an expert here, but that seems like a different problem. I thought TCP Keepalives are specifically designed to protect against the case when the peer cannot respond with a FIN packet.

tejasmanohar · 2017-11-28T19:18:07Z

@chrispassas I've seen Postgres go into a state where it is too overloaded to cancel requests or even discard connections that have been closed. This may be what's happening. To be fair, this usually happens when the system is in an unhealthy state (e.g. around an OOM). Depending on what you're doing, you may benefit from setting server side limits / timeouts like statement_timeout.

chrispassas · 2017-11-29T20:03:50Z

@tejasmanohar In most cases the database server is in distress. I have tried setting statement_timeout. The problem is when statement_timeout is reached or a context is canceled it seems like the code hangs on db.QueryContext() because its waiting for some acknowledgment from the database server.

The whole point of QueryContext is to ensure we can timeout a query but if it requires the remote server to agree or respond then the QueryContext timeout is broken.

meetme2meat · 2019-05-13T10:09:46Z

Any update on this does or does not the library currently support keepalive?

LasTshaMAN · 2019-09-28T07:59:03Z

I'm gonna leave a link to the original issue regarding TCP keepalive settings here - #360.

banks mentioned this issue May 31, 2017

Context cancel hangs if there is no network connectivity #620

Open

tejasmanohar closed this as completed Aug 9, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hanging connections due to IO wait #584

Hanging connections due to IO wait #584

tejasmanohar commented Mar 19, 2017 •

edited

Loading

tamird commented Mar 19, 2017

tejasmanohar commented Mar 20, 2017 •

edited

Loading

tamird commented Mar 20, 2017

bilalaslamseattle commented Jun 6, 2017

tejasmanohar commented Aug 9, 2017 •

edited

Loading

chrispassas commented Nov 28, 2017

tejasmanohar commented Nov 28, 2017

tejasmanohar commented Nov 28, 2017 •

edited

Loading

chrispassas commented Nov 29, 2017

meetme2meat commented May 13, 2019

LasTshaMAN commented Sep 28, 2019

Hanging connections due to IO wait #584

Hanging connections due to IO wait #584

Comments

tejasmanohar commented Mar 19, 2017 • edited Loading

tamird commented Mar 19, 2017

tejasmanohar commented Mar 20, 2017 • edited Loading

tamird commented Mar 20, 2017

bilalaslamseattle commented Jun 6, 2017

tejasmanohar commented Aug 9, 2017 • edited Loading

chrispassas commented Nov 28, 2017

tejasmanohar commented Nov 28, 2017

tejasmanohar commented Nov 28, 2017 • edited Loading

chrispassas commented Nov 29, 2017

meetme2meat commented May 13, 2019

LasTshaMAN commented Sep 28, 2019

tejasmanohar commented Mar 19, 2017 •

edited

Loading

tejasmanohar commented Mar 20, 2017 •

edited

Loading

tejasmanohar commented Aug 9, 2017 •

edited

Loading

tejasmanohar commented Nov 28, 2017 •

edited

Loading