Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data race while reconnecting #703

Open
NiklasCi opened this issue Mar 5, 2025 · 4 comments
Open

Data race while reconnecting #703

NiklasCi opened this issue Mar 5, 2025 · 4 comments

Comments

@NiklasCi
Copy link

NiklasCi commented Mar 5, 2025

One of our test fails with a data race if we test our logic we implemented if the client reconnects.
It seems that there is a DATA RACE in the client itself. At least the trace does not show any of our go files/packages.

Trace:

WARNING: DATA RACE
Write at 0x00c00047e441 by goroutine 147:
  github.com/eclipse/paho%2emqtt%2egolang.(*client).resume()
      /home/niklas/go/pkg/mod/github.com/eclipse/[email protected]/client.go:1090 +0x13de
  github.com/eclipse/paho%2emqtt%2egolang.(*client).reconnect()
      /home/niklas/go/pkg/mod/github.com/eclipse/[email protected]/client.go:341 +0x973
  github.com/eclipse/paho%2emqtt%2egolang.(*client).internalConnLost.func1.gowrap1()
      /home/niklas/go/pkg/mod/github.com/eclipse/[email protected]/client.go:562 +0x44

Previous read at 0x00c00047e441 by goroutine 167:
  github.com/eclipse/paho.mqtt.golang/packets.(*FixedHeader).pack()
      /home/niklas/go/pkg/mod/github.com/eclipse/[email protected]/packets/packets.go:267 +0x71
  github.com/eclipse/paho.mqtt.golang/packets.(*PublishPacket).Write()
      /home/niklas/go/pkg/mod/github.com/eclipse/[email protected]/packets/publish.go:47 +0x2dd
  github.com/eclipse/paho%2emqtt%2egolang.startOutgoingComms.func1()
      /home/niklas/go/pkg/mod/github.com/eclipse/[email protected]/net.go:293 +0xd5e

Goroutine 147 (running) created at:
  github.com/eclipse/paho%2emqtt%2egolang.(*client).internalConnLost.func1()
      /home/niklas/go/pkg/mod/github.com/eclipse/[email protected]/client.go:562 +0x60a

Goroutine 167 (running) created at:
  github.com/eclipse/paho%2emqtt%2egolang.startOutgoingComms()
      /home/niklas/go/pkg/mod/github.com/eclipse/[email protected]/net.go:264 +0x3a4
  github.com/eclipse/paho%2emqtt%2egolang.startComms()
      /home/niklas/go/pkg/mod/github.com/eclipse/[email protected]/net.go:394 +0x104
  github.com/eclipse/paho%2emqtt%2egolang.(*client).startCommsWorkers()
      /home/niklas/go/pkg/mod/github.com/eclipse/[email protected]/client.go:655 +0xc5e
  github.com/eclipse/paho%2emqtt%2egolang.(*client).reconnect()
      /home/niklas/go/pkg/mod/github.com/eclipse/[email protected]/client.go:340 +0x93e
  github.com/eclipse/paho%2emqtt%2egolang.(*client).internalConnLost.func1.gowrap1()
      /home/niklas/go/pkg/mod/github.com/eclipse/[email protected]/client.go:562 +0x44

Following Options are set:

KeepAlive: 30s
PingTimeout: 1s
AutoReconnect: true
ConnectTimeout: 10s
ConnectRetry: true
ConnectRetryInterval: 500ms
ResumeSubs: false

@MattBrittan
Copy link
Contributor

The Write is at [[email protected]/client.go:1090](http://github.com/eclipse/[email protected]/client.go:1090):

if p.Qos != 0 { // spec: The DUP flag MUST be set to 0 for all QoS 0 messages
   p.Dup = true // HERE
}

This is part of the code that sends packets that are in the store upon reconnect.

The previous read will have been triggred by [email protected]/net.go:293, which is :

if err := msg.Write(conn); err != nil {

This is sending the outgoing packet.

So it looks like the following is happening without syncronisation:

  • Attempt to send outgoing packet
  • Loss of connection, reconnection
  • Resend packet (including modification of packet)

This will have an impact on the memory store only (because the file store will load a new copy of the packet from disk) and it's highely unlikely that the operations will occur concurrently, because the previous connection is closed before a reconnect is attempted (and this will lead to the Write operation failing).

Because of the above, and the fact that the change is always to set p.Dup = true, the race has no real negative impact.

I think the prefered fix is to have the Memory store return a copy of the packet, rather than the same bit of memory that is passed in. This would prevent any possibility of the contents of the store being updated unintentionally.

@NiklasCi
Copy link
Author

NiklasCi commented Mar 6, 2025

The Write is at [[email protected]/client.go:1090](http://github.com/eclipse/[email protected]/client.go:1090):

if p.Qos != 0 { // spec: The DUP flag MUST be set to 0 for all QoS 0 messages
   p.Dup = true // HERE
}

This is part of the code that sends packets that are in the store upon reconnect.

The previous read will have been triggred by [email protected]/net.go:293, which is :

if err := msg.Write(conn); err != nil {

This is sending the outgoing packet.

So it looks like the following is happening without syncronisation:

  • Attempt to send outgoing packet
  • Loss of connection, reconnection
  • Resend packet (including modification of packet)

This will have an impact on the memory store only (because the file store will load a new copy of the packet from disk) and it's highely unlikely that the operations will occur concurrently, because the previous connection is closed before a reconnect is attempted (and this will lead to the Write operation failing).

Because of the above, and the fact that the change is always to set p.Dup = true, the race has no real negative impact.

I think the prefered fix is to have the Memory store return a copy of the packet, rather than the same bit of memory that is passed in. This would prevent any possibility of the contents of the store being updated unintentionally.

Thanks for the fast reply and analyses. Does that mean there will be a fix from your side? I assume the fix does not has high prioraty because as you pointed out the race does not have a negative impact (which is absolutly fine for me)?
If you say this is a good issue to be solved by new contributors, i might will give it a try to fix it.

If i want to avoid the data race in my test for now, i should not killing the broker while a publishing is ongoing or did i understood something wrong?

@MattBrittan
Copy link
Contributor

Does that mean there will be a fix from your side?

Probably, eventually. Happy to accept a PR (should just be a matter of copying the response before returning it). Note that using the memory store for QOS1+ means you can loose messages upon restart (I only use the filestore for this reason).

@NiklasCi
Copy link
Author

Does that mean there will be a fix from your side?

Probably, eventually. Happy to accept a PR (should just be a matter of copying the response before returning it). Note that using the memory store for QOS1+ means you can loose messages upon restart (I only use the filestore for this reason).

Ok i will try to fix it this month.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants