Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[APP-7566]: Add bluetooth provisioning as a default provisioning method which runs parallel to hotspot provisioning. #70

Open
wants to merge 41 commits into
base: main
Choose a base branch
from

Conversation

maxhorowitz
Copy link
Member

@maxhorowitz maxhorowitz commented Feb 20, 2025

Summary

This PR encompasses everything needed for adding bluetooth provisioning as a default provisioning method in the viam-agent (and is designed to run in parallel to the existing WiFi hotspot provisioning method).

Setup

I want you to test as if you are a customer who has just received a machine in the mail. The machine will just have a viam-agent binary installation. It will not have WiFi credentials (and thus no internet connectivity).

Machine setup

Using a Linux machine/laptop (that you are comfortable getting your /opt/viam/*, /etc/viam/* and other files tampered with) , pull down this branch. Because you will need to remove connectivity to emulate the user perspective, please do not SSH into a machine over WiFi (and don't use Ethernet because it can interfere with provisioning). Instead, test directly on the device (keyboard + mouse) needed.

Uninstall Viam

First, you will need to remove any Viam-related stuff from your computer (to emulate the "newly-out-of-the-box" scenario our customers will be in). Please run the following from the repository root:
sudo ./uninstall.sh

Remove local networks

You will then need to remove local networks from your machine to emulate the "offline-ness" that our customers will be dealing with. I've been running nmcli con show and subsequently sudo nmcli con delete <name> (for every network that could "interfere" with the provisioning flow).

Preinstall the Viam Agent

Then, run the following (again from the repository root):

  1. make
  2. sudo ./bin/viam-agent-custom-aarch64 --install
  3. sudo /opt/viam/bin/viam-agent --debug

I've been running the commands altogether as such:
make && sudo ./bin/viam-agent-custom-aarch64 --install && sudo /opt/viam/bin/viam-agent --debug

Phone setup

Download the LightBlue app (available on both Android and iPhone). This is what we will use to communicate with the bluetooth service that is being advertised from the Linux machine.

Testing Procedure

At this point, we are ready to test.

LightBlue

Pairing

In the app, use the search bar to find the name of the Linux machine that you are currently testing against. It should be there. Pair with your machine (one of the tests is to see if the pairing request is automatically accepted on the machine side, so I am hoping this part works!).

If it does not pair, check your screen for a six-digit pairing code and accept the request manually. This is a known limitation that I am working on fixing. The TLDR is it's complicated because of reliance on systemd bus signals which may be picked up and discarded as negligible elsewhere in the viam-agent.

Bluetooth service used to "transmit" credentials

Once paired, look at the bluetooth service and nested characteristics available to us. You should be able to take the provided UUIDs and map them to the logs on your viam-agent machine:

There is an encoding that will be helpful for you to know here. Characteristics whose last 4 characters of their first 8 character sequence (preceding the -) will always be the following:

  • xxxx1111-... is the encompassing service
  • xxxx2222-... is the write-only characteristic for SSID
  • xxxx3333-... is the write-only characteristic for passkey
  • xxxx4444-... is the write-only characteristic for part ID
  • xxxx5555-... is the write-only characteristic for part secret
  • xxxx6666-... is the write-only characteristic for app address
  • xxxx7777-... is the read-only characteristic for nearby available WiFi networks that the machine has detected
2025-02-25T20:41:24.245Z	DEBUG	viam-agent	networking/bluetooth.go:507	BlueZ version (5.66) meets the requirement (5.66 or later)
2025-02-25T20:41:24.246Z	DEBUG	viam-agent	networking/bluetooth.go:43	Bluetooth peripheral service UUID: 79ff1111-4f38-44b9-b3b5-78fb7e14757e
2025-02-25T20:41:24.246Z	DEBUG	viam-agent	networking/bluetooth.go:45	WiFi SSID can be written to the following bluetooth characteristic: db8f2222-baea-452f-bad0-62b440f98161
2025-02-25T20:41:24.246Z	DEBUG	viam-agent	networking/bluetooth.go:47	WiFi passkey can be written to the following bluetooth characteristic: 06af3333-4442-42e5-9e6a-1729ffa1378d
2025-02-25T20:41:24.246Z	DEBUG	viam-agent	networking/bluetooth.go:49	Robot part key ID can be written to the following bluetooth characteristic: 26d24444-161f-4474-a4a5-e83c563ac52b
2025-02-25T20:41:24.246Z	DEBUG	viam-agent	networking/bluetooth.go:51	Robot part key can be written to the following bluetooth characteristic: 12db5555-ca13-4a0b-879a-c76fb1542fe1
2025-02-25T20:41:24.246Z	DEBUG	viam-agent	networking/bluetooth.go:53	Viam app address can be written to the following bluetooth characteristic: 9eaa6666-14f0-47a9-b69f-a093f8358b18
2025-02-25T20:41:24.246Z	DEBUG	viam-agent	networking/bluetooth.go:55	Available WiFi networks can be read from the following bluetooth characteristic: a8ee7777-2496-485a-b0ea-dc63a1122f1

The LightBlue interface is pretty easy to follow. You may need to change LightBlue messages from hex or binary to utf-8 string. Once confirmed you're in utf-8 string mode, you can write individual messages to the SSID, passkey, part ID, part secret, and app address "characteristics" by clicking "write value." Similarly, you can read from the available WiFi networks "characteristic" by clicking "read value." Any value written will get "sucked in" to the viam-agent provisioning flow. You can check this in the viam-agent logs. Once all values are submitted, the provisioning loop should close out the BT connection, connect to WiFi, retrieve its cloud config, and should start up a viam-server (thus ending the provisioning loop).

Cases

  1. Hotspot provisioning flow (i.e. check the captive portal flow still works)
  2. Bluetooth provisioning flow (i.e. check the LightBlue flow works)

…d and rename waitForBLEValue to retryCallbackOnEmptyCharacteristicError.
…unexported interface so that the BT functionality can be mocked from provisioning tests.
… methods that exist on its implementation from the calling code (in networkmanager.go).
…ember to revert this commit once the fix for rc0.14 is released).
…(and remember to revert this commit once the fix for rc0.14 is released)."

This reverts commit a4c226a.
Comment on lines 226 to 231
userInput, err := n.bluetoothService.waitForCredentials(ctx, true, true) // Background goroutine ultimately cancelled by context.
if err != nil {
n.logger.Errorw("failed to wait for user input of credentials", "err", err)
return
}
inputChan <- *userInput
Copy link
Member Author

@maxhorowitz maxhorowitz Feb 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would people like it more if I pass the inputChan into the waitForCredentials call itself? Then, I could get rid of the goroutine here and make the waitForCredentials call do nonblocking listening. In that case I would rename it to listenForCredentials.

Signal int32
Connected bool
LastError string
Type string `json:"type,omitempty"`
Copy link
Member Author

@maxhorowitz maxhorowitz Feb 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

json:"xxx" tags are shorthand on purpose to minimize bytes wasted as uninformative JSON keys.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our coding standards are not to use shorthand for json. The json should reflect the exact spelling (though lower/snake cased) of the object to avoid confussion.

Also, changing this may break the existing mobile provisioning, as this would now get marshalled differently. If you need a new/more compact syntax, create a new struct.

@maxhorowitz maxhorowitz changed the title [APP-7566]: Bluetooth Provisioning [APP-7566]: Add bluetooth provisioning as a default provisioning method which runs parallel to hotspot provisioning. Feb 24, 2025
)

// bluetoothService provides an interface for retrieving cloud config and/or WiFi credentials for a robot over bluetooth.
type bluetoothService interface {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The private interface lets us

  1. protect all linux related BLE behaviors from the provisioning code path, and
  2. allow us to make modular unit tests that validate state behaviors in our provisioning flow in the event of various (mocked) bluetooth errors

@@ -0,0 +1,630 @@
package networking
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally speaking having a difficult time enforcing behaviors via unit testing the functions/methods in this file because many of them are low level BT commands.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same problem with wifi provisioning... unit tests for stuff that interacts with a complex system are just nearly impossible to create in a useful way. So don't worry too much about that. We just have to do human-in-the-loop testing.

return err
}
return errors.Join(err, err2)
n.connState.setProvisioning(false)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this from top of function (it should only be false after clean shutdown, right?).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that's the exact opposite. You set state when you're TRYING to close down that state. If things fail halfway through, then it doesn't think it's in a WORKING provisioning mode, and will restart from the ground up.

For STARTUP, the opposite is true. Mark it at the end, when it's fully started/good.

Copy link
Member

@Otterverse Otterverse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I apologize in advance for how long this review is. I'm leaving a summary here at the top of (most) of the important things, so it's easier to discuss (and check off as you go) hopefully. But it will overlap a lot with the inline comments too.

As always, hit me up in slack to discuss things whenever. Plenty of this is non-obvious, as you're deep in the guts of the most complicated part of Agent here.

  1. Background go routines/threads need to be used only when/where required, and must be fully managable. Remember that calls to start/stop can come in at any point in the flow, so everything must be able to exit cleanly and quickly. No need to background threads if you're just going to wait for them in the same fuction. Just handle things linearly to keep it simpler.

  2. Need to be careful about data races, and protect any data that can be changed from another thread with locks. Note that even if it's not explicitly in a thread, outside calls from start/stop/update can happen at any point, so everything in those paths has to be race-safe.

  3. BT characteristics (and their associated UUIDs) are a bunch of indivudual variables resulting in ~1/3 of bluetooth.go being very repetitive boilerplate. Should get them arranged into a simpler data structure, with a map to use directly, or a true structure with (generalized) getter/setter methods. E.g. calls should be able to look like val, err := bt.GetCharacteristic("ssid") or bt.SetCharacteristic("ssid", myVal) and be human readable. Look at networkState and connState for hints.

  4. As discussed in slack, the 20 character limit for charcteristic could be a real problem. SSIDs and PSKs can be longer than that by themselves. And scan results may contains dozens of networks.

  5. Need a new config setting to disable bluetooth.

  6. If agent starts on a device without bluetooth (or otherwise unsupport version/etc.), it needs to detect this and quietly do nothing (beyond an initial log.) Likely best to integrate with above, and just have a "temporary" flag that disables it until a full restart.

  7. BLE needs to integrate with the health checking. Any backgrounded routines need to report health status regularly using health.healthySleep() (add a new one on Networking{} to track ble health.)

  8. Ideally, this should be abstracted just like the portal is. E.g. whenever start/stopPortal() is called, start/stopBLE() is called, and otherwise works the same. They should even use the same inputChan, so the outer provisioning code only has to listen to that one channel.

  9. Network scans (and other info) are updated in real time. This is why they're managed in their own structures like n.connState(). BLE needs to use those and not just pass in a static list one time at startup.

Signal int32
Connected bool
LastError string
Type string `json:"type,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our coding standards are not to use shorthand for json. The json should reflect the exact spelling (though lower/snake cased) of the object to avoid confussion.

Also, changing this may break the existing mobile provisioning, as this would now get marshalled differently. If you need a new/more compact syntax, create a new struct.

webServer *http.Server
grpcServer *grpc.Server
portalData *portalData
hotspotIsActive bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This duplicates the functionality of connState.GetProvisioning() (which properly mutex locks things and records timestamps.)

err = errors.Join(err, n.deactivateConnection(n.Config().HotspotInterface, n.Config().HotspotSSID))
return errw.Wrap(err, "starting web/grpc portal")
// Simultaneously start both the hotspot captive portal and the bluetooth service provisioning methods.
wg := sync.WaitGroup{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
wg := sync.WaitGroup{}
var wg sync.WaitGroup

Empty vars should be declared with "var" when possible, not an empty struct

)
n.bluetoothIsActive = true
}, wg.Done)
wg.Wait()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand why you're backgrounding functions above just to wait here. WaitGroups are for async thread management. You're firing off async threads then waiting in the same function that just fired them.

bluetoothErr = err
return
}
goutils.ManagedGo( // Listen for user input asynchronously. How should we be handling errors here?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is firing a new thread from within another backgrounded thread? This one doesn't look like it's being tracked/managed by a waitgroup, so it'll just be orphaned as far as I can tell.

Comment on lines +266 to +268
return fmt.Errorf("failed to stop advertising: %w", err)
}
bsl.advActive = false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if this fails and returns an error, but is still marked active?

}

// getBlueZVersion retrieves the installed BlueZ version and extracts the numeric value correctly.
func getBlueZVersion() (float64, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should get/check the version via dbus (which is how the rest of this interacts) not via exec calls (if at all possible.) Userspace utilities may not actually be what's running, or may not even be installed. If version isn't available directly, then check that the properties needed exist perhaps. Should allow wider compatibility too.

Comment on lines +438 to +443
func checkOS() error {
if runtime.GOOS != "linux" {
return fmt.Errorf("this program requires Linux, detected: %s", runtime.GOOS)
}
return nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This likely shouldn't exist. Use build flags to guard code that simply can't run elsewhere at compile time, not checking at runtime.

}

// Not ready to return (do not have the minimum required set of credentials), so sleep and try again.
time.Sleep(time.Second)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to integrate with healthchecks. Use the healthySleep functions there. May need to extend if the health of bluetooth is separate from the general subsystems loops.

goutils.ManagedGo(func() {
if n.bluetoothService == nil {
bt, err := newBluetoothService(
n.logger, fmt.Sprintf("%s.%s.%s", n.cfg.Manufacturer, n.cfg.Model, n.cfg.FragmentID), n.getVisibleNetworks())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the config values should be n.Config() otherwise you can hit a race.

Suggested change
n.logger, fmt.Sprintf("%s.%s.%s", n.cfg.Manufacturer, n.cfg.Model, n.cfg.FragmentID), n.getVisibleNetworks())
n.logger, fmt.Sprintf("%s.%s.%s", n.Config().Manufacturer, n.Config().Model, n.Config().FragmentID), n.getVisibleNetworks())

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n.getVisibleNetworks() is a one-time call here, but the list of visible networks changes frequently, especially if someone is trying to get a new network to show up. This needs to be handled like portal does, and updated when things change, not passed in once here only at startup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants