Data race while reconnecting #703

NiklasCi · 2025-03-05T08:39:21Z

One of our test fails with a data race if we test our logic we implemented if the client reconnects.
It seems that there is a DATA RACE in the client itself. At least the trace does not show any of our go files/packages.

Trace:

WARNING: DATA RACE
Write at 0x00c00047e441 by goroutine 147:
  github.com/eclipse/paho%2emqtt%2egolang.(*client).resume()
      /home/niklas/go/pkg/mod/github.com/eclipse/[email protected]/client.go:1090 +0x13de
  github.com/eclipse/paho%2emqtt%2egolang.(*client).reconnect()
      /home/niklas/go/pkg/mod/github.com/eclipse/[email protected]/client.go:341 +0x973
  github.com/eclipse/paho%2emqtt%2egolang.(*client).internalConnLost.func1.gowrap1()
      /home/niklas/go/pkg/mod/github.com/eclipse/[email protected]/client.go:562 +0x44

Previous read at 0x00c00047e441 by goroutine 167:
  github.com/eclipse/paho.mqtt.golang/packets.(*FixedHeader).pack()
      /home/niklas/go/pkg/mod/github.com/eclipse/[email protected]/packets/packets.go:267 +0x71
  github.com/eclipse/paho.mqtt.golang/packets.(*PublishPacket).Write()
      /home/niklas/go/pkg/mod/github.com/eclipse/[email protected]/packets/publish.go:47 +0x2dd
  github.com/eclipse/paho%2emqtt%2egolang.startOutgoingComms.func1()
      /home/niklas/go/pkg/mod/github.com/eclipse/[email protected]/net.go:293 +0xd5e

Goroutine 147 (running) created at:
  github.com/eclipse/paho%2emqtt%2egolang.(*client).internalConnLost.func1()
      /home/niklas/go/pkg/mod/github.com/eclipse/[email protected]/client.go:562 +0x60a

Goroutine 167 (running) created at:
  github.com/eclipse/paho%2emqtt%2egolang.startOutgoingComms()
      /home/niklas/go/pkg/mod/github.com/eclipse/[email protected]/net.go:264 +0x3a4
  github.com/eclipse/paho%2emqtt%2egolang.startComms()
      /home/niklas/go/pkg/mod/github.com/eclipse/[email protected]/net.go:394 +0x104
  github.com/eclipse/paho%2emqtt%2egolang.(*client).startCommsWorkers()
      /home/niklas/go/pkg/mod/github.com/eclipse/[email protected]/client.go:655 +0xc5e
  github.com/eclipse/paho%2emqtt%2egolang.(*client).reconnect()
      /home/niklas/go/pkg/mod/github.com/eclipse/[email protected]/client.go:340 +0x93e
  github.com/eclipse/paho%2emqtt%2egolang.(*client).internalConnLost.func1.gowrap1()
      /home/niklas/go/pkg/mod/github.com/eclipse/[email protected]/client.go:562 +0x44

Following Options are set:

KeepAlive: 30s
PingTimeout: 1s
AutoReconnect: true
ConnectTimeout: 10s
ConnectRetry: true
ConnectRetryInterval: 500ms
ResumeSubs: false

The text was updated successfully, but these errors were encountered:

MattBrittan · 2025-03-06T03:43:46Z

The Write is at [[email protected]/client.go:1090](http://github.com/eclipse/[email protected]/client.go:1090):

if p.Qos != 0 { // spec: The DUP flag MUST be set to 0 for all QoS 0 messages
   p.Dup = true // HERE
}

This is part of the code that sends packets that are in the store upon reconnect.

The previous read will have been triggred by [email protected]/net.go:293, which is :

if err := msg.Write(conn); err != nil {

This is sending the outgoing packet.

So it looks like the following is happening without syncronisation:

Attempt to send outgoing packet
Loss of connection, reconnection
Resend packet (including modification of packet)

This will have an impact on the memory store only (because the file store will load a new copy of the packet from disk) and it's highely unlikely that the operations will occur concurrently, because the previous connection is closed before a reconnect is attempted (and this will lead to the Write operation failing).

Because of the above, and the fact that the change is always to set p.Dup = true, the race has no real negative impact.

I think the prefered fix is to have the Memory store return a copy of the packet, rather than the same bit of memory that is passed in. This would prevent any possibility of the contents of the store being updated unintentionally.

NiklasCi · 2025-03-06T08:08:51Z

The Write is at [[email protected]/client.go:1090](http://github.com/eclipse/[email protected]/client.go:1090):
if p.Qos != 0 { // spec: The DUP flag MUST be set to 0 for all QoS 0 messages
   p.Dup = true // HERE
}
This is part of the code that sends packets that are in the store upon reconnect.

The previous read will have been triggred by [email protected]/net.go:293, which is :
if err := msg.Write(conn); err != nil {
This is sending the outgoing packet.

So it looks like the following is happening without syncronisation:

Attempt to send outgoing packet

Loss of connection, reconnection

Resend packet (including modification of packet)

This will have an impact on the memory store only (because the file store will load a new copy of the packet from disk) and it's highely unlikely that the operations will occur concurrently, because the previous connection is closed before a reconnect is attempted (and this will lead to the Write operation failing).

Because of the above, and the fact that the change is always to set p.Dup = true, the race has no real negative impact.

I think the prefered fix is to have the Memory store return a copy of the packet, rather than the same bit of memory that is passed in. This would prevent any possibility of the contents of the store being updated unintentionally.

Thanks for the fast reply and analyses. Does that mean there will be a fix from your side? I assume the fix does not has high prioraty because as you pointed out the race does not have a negative impact (which is absolutly fine for me)?
If you say this is a good issue to be solved by new contributors, i might will give it a try to fix it.

If i want to avoid the data race in my test for now, i should not killing the broker while a publishing is ongoing or did i understood something wrong?

MattBrittan · 2025-03-06T20:01:24Z

Does that mean there will be a fix from your side?

Probably, eventually. Happy to accept a PR (should just be a matter of copying the response before returning it). Note that using the memory store for QOS1+ means you can loose messages upon restart (I only use the filestore for this reason).

NiklasCi · 2025-03-10T08:32:26Z

Does that mean there will be a fix from your side?

Probably, eventually. Happy to accept a PR (should just be a matter of copying the response before returning it). Note that using the memory store for QOS1+ means you can loose messages upon restart (I only use the filestore for this reason).

Ok i will try to fix it this month.

Nocccer mentioned this issue Mar 15, 2025

Return a copy from the memstore to avoid data races #708

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data race while reconnecting #703

Data race while reconnecting #703

NiklasCi commented Mar 5, 2025

MattBrittan commented Mar 6, 2025

NiklasCi commented Mar 6, 2025 •

edited

Loading

MattBrittan commented Mar 6, 2025

NiklasCi commented Mar 10, 2025

Data race while reconnecting #703

Data race while reconnecting #703

Comments

NiklasCi commented Mar 5, 2025

MattBrittan commented Mar 6, 2025

NiklasCi commented Mar 6, 2025 • edited Loading

MattBrittan commented Mar 6, 2025

NiklasCi commented Mar 10, 2025

NiklasCi commented Mar 6, 2025 •

edited

Loading