-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[APP-7566]: Add bluetooth provisioning as a default provisioning method which runs parallel to hotspot provisioning. #70
base: main
Are you sure you want to change the base?
Conversation
…ed interface for OS abstraction).
…irware differences.
… is yet to be implemented.
…fix used elsewhere in the Agent.
…t sending information to a closed channel.
…d and rename waitForBLEValue to retryCallbackOnEmptyCharacteristicError.
…move custom types that are now redundant.
…unexported interface so that the BT functionality can be mocked from provisioning tests.
… methods that exist on its implementation from the calling code (in networkmanager.go).
… unimplemented and left for follow up PRs).
…for machine settings config updates).
…ember to revert this commit once the fix for rc0.14 is released).
…(and remember to revert this commit once the fix for rc0.14 is released)." This reverts commit a4c226a.
…rmed with manual test).
userInput, err := n.bluetoothService.waitForCredentials(ctx, true, true) // Background goroutine ultimately cancelled by context. | ||
if err != nil { | ||
n.logger.Errorw("failed to wait for user input of credentials", "err", err) | ||
return | ||
} | ||
inputChan <- *userInput |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would people like it more if I pass the inputChan
into the waitForCredentials
call itself? Then, I could get rid of the goroutine here and make the waitForCredentials
call do nonblocking listening. In that case I would rename it to listenForCredentials
.
Signal int32 | ||
Connected bool | ||
LastError string | ||
Type string `json:"type,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
json:"xxx"
tags are shorthand on purpose to minimize bytes wasted as uninformative JSON keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our coding standards are not to use shorthand for json. The json should reflect the exact spelling (though lower/snake cased) of the object to avoid confussion.
Also, changing this may break the existing mobile provisioning, as this would now get marshalled differently. If you need a new/more compact syntax, create a new struct.
) | ||
|
||
// bluetoothService provides an interface for retrieving cloud config and/or WiFi credentials for a robot over bluetooth. | ||
type bluetoothService interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The private interface lets us
- protect all linux related BLE behaviors from the provisioning code path, and
- allow us to make modular unit tests that validate state behaviors in our provisioning flow in the event of various (mocked) bluetooth errors
@@ -0,0 +1,630 @@ | |||
package networking |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally speaking having a difficult time enforcing behaviors via unit testing the functions/methods in this file because many of them are low level BT commands.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same problem with wifi provisioning... unit tests for stuff that interacts with a complex system are just nearly impossible to create in a useful way. So don't worry too much about that. We just have to do human-in-the-loop testing.
return err | ||
} | ||
return errors.Join(err, err2) | ||
n.connState.setProvisioning(false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved this from top of function (it should only be false after clean shutdown, right?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that's the exact opposite. You set state when you're TRYING to close down that state. If things fail halfway through, then it doesn't think it's in a WORKING provisioning mode, and will restart from the ground up.
For STARTUP, the opposite is true. Mark it at the end, when it's fully started/good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I apologize in advance for how long this review is. I'm leaving a summary here at the top of (most) of the important things, so it's easier to discuss (and check off as you go) hopefully. But it will overlap a lot with the inline comments too.
As always, hit me up in slack to discuss things whenever. Plenty of this is non-obvious, as you're deep in the guts of the most complicated part of Agent here.
-
Background go routines/threads need to be used only when/where required, and must be fully managable. Remember that calls to start/stop can come in at any point in the flow, so everything must be able to exit cleanly and quickly. No need to background threads if you're just going to wait for them in the same fuction. Just handle things linearly to keep it simpler.
-
Need to be careful about data races, and protect any data that can be changed from another thread with locks. Note that even if it's not explicitly in a thread, outside calls from start/stop/update can happen at any point, so everything in those paths has to be race-safe.
-
BT characteristics (and their associated UUIDs) are a bunch of indivudual variables resulting in ~1/3 of bluetooth.go being very repetitive boilerplate. Should get them arranged into a simpler data structure, with a map to use directly, or a true structure with (generalized) getter/setter methods. E.g. calls should be able to look like
val, err := bt.GetCharacteristic("ssid")
orbt.SetCharacteristic("ssid", myVal)
and be human readable. Look at networkState and connState for hints. -
As discussed in slack, the 20 character limit for charcteristic could be a real problem. SSIDs and PSKs can be longer than that by themselves. And scan results may contains dozens of networks.
-
Need a new config setting to disable bluetooth.
-
If agent starts on a device without bluetooth (or otherwise unsupport version/etc.), it needs to detect this and quietly do nothing (beyond an initial log.) Likely best to integrate with above, and just have a "temporary" flag that disables it until a full restart.
-
BLE needs to integrate with the health checking. Any backgrounded routines need to report health status regularly using health.healthySleep() (add a new one on Networking{} to track ble health.)
-
Ideally, this should be abstracted just like the portal is. E.g. whenever start/stopPortal() is called, start/stopBLE() is called, and otherwise works the same. They should even use the same inputChan, so the outer provisioning code only has to listen to that one channel.
-
Network scans (and other info) are updated in real time. This is why they're managed in their own structures like n.connState(). BLE needs to use those and not just pass in a static list one time at startup.
Signal int32 | ||
Connected bool | ||
LastError string | ||
Type string `json:"type,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our coding standards are not to use shorthand for json. The json should reflect the exact spelling (though lower/snake cased) of the object to avoid confussion.
Also, changing this may break the existing mobile provisioning, as this would now get marshalled differently. If you need a new/more compact syntax, create a new struct.
webServer *http.Server | ||
grpcServer *grpc.Server | ||
portalData *portalData | ||
hotspotIsActive bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This duplicates the functionality of connState.GetProvisioning() (which properly mutex locks things and records timestamps.)
err = errors.Join(err, n.deactivateConnection(n.Config().HotspotInterface, n.Config().HotspotSSID)) | ||
return errw.Wrap(err, "starting web/grpc portal") | ||
// Simultaneously start both the hotspot captive portal and the bluetooth service provisioning methods. | ||
wg := sync.WaitGroup{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wg := sync.WaitGroup{} | |
var wg sync.WaitGroup |
Empty vars should be declared with "var" when possible, not an empty struct
) | ||
n.bluetoothIsActive = true | ||
}, wg.Done) | ||
wg.Wait() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not understand why you're backgrounding functions above just to wait here. WaitGroups are for async thread management. You're firing off async threads then waiting in the same function that just fired them.
bluetoothErr = err | ||
return | ||
} | ||
goutils.ManagedGo( // Listen for user input asynchronously. How should we be handling errors here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is firing a new thread from within another backgrounded thread? This one doesn't look like it's being tracked/managed by a waitgroup, so it'll just be orphaned as far as I can tell.
return fmt.Errorf("failed to stop advertising: %w", err) | ||
} | ||
bsl.advActive = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if this fails and returns an error, but is still marked active?
} | ||
|
||
// getBlueZVersion retrieves the installed BlueZ version and extracts the numeric value correctly. | ||
func getBlueZVersion() (float64, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should get/check the version via dbus (which is how the rest of this interacts) not via exec calls (if at all possible.) Userspace utilities may not actually be what's running, or may not even be installed. If version isn't available directly, then check that the properties needed exist perhaps. Should allow wider compatibility too.
func checkOS() error { | ||
if runtime.GOOS != "linux" { | ||
return fmt.Errorf("this program requires Linux, detected: %s", runtime.GOOS) | ||
} | ||
return nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This likely shouldn't exist. Use build flags to guard code that simply can't run elsewhere at compile time, not checking at runtime.
} | ||
|
||
// Not ready to return (do not have the minimum required set of credentials), so sleep and try again. | ||
time.Sleep(time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to integrate with healthchecks. Use the healthySleep functions there. May need to extend if the health of bluetooth is separate from the general subsystems loops.
goutils.ManagedGo(func() { | ||
if n.bluetoothService == nil { | ||
bt, err := newBluetoothService( | ||
n.logger, fmt.Sprintf("%s.%s.%s", n.cfg.Manufacturer, n.cfg.Model, n.cfg.FragmentID), n.getVisibleNetworks()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the config values should be n.Config() otherwise you can hit a race.
n.logger, fmt.Sprintf("%s.%s.%s", n.cfg.Manufacturer, n.cfg.Model, n.cfg.FragmentID), n.getVisibleNetworks()) | |
n.logger, fmt.Sprintf("%s.%s.%s", n.Config().Manufacturer, n.Config().Model, n.Config().FragmentID), n.getVisibleNetworks()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n.getVisibleNetworks() is a one-time call here, but the list of visible networks changes frequently, especially if someone is trying to get a new network to show up. This needs to be handled like portal does, and updated when things change, not passed in once here only at startup.
Summary
This PR encompasses everything needed for adding bluetooth provisioning as a default provisioning method in the
viam-agent
(and is designed to run in parallel to the existing WiFi hotspot provisioning method).Setup
I want you to test as if you are a customer who has just received a machine in the mail. The machine will just have a
viam-agent
binary installation. It will not have WiFi credentials (and thus no internet connectivity).Machine setup
Using a Linux machine/laptop (that you are comfortable getting your
/opt/viam/*
,/etc/viam/*
and other files tampered with) , pull down this branch. Because you will need to remove connectivity to emulate the user perspective, please do not SSH into a machine over WiFi (and don't use Ethernet because it can interfere with provisioning). Instead, test directly on the device (keyboard + mouse) needed.Uninstall Viam
First, you will need to remove any Viam-related stuff from your computer (to emulate the "newly-out-of-the-box" scenario our customers will be in). Please run the following from the repository root:
sudo ./uninstall.sh
Remove local networks
You will then need to remove local networks from your machine to emulate the "offline-ness" that our customers will be dealing with. I've been running
nmcli con show
and subsequentlysudo nmcli con delete <name>
(for every network that could "interfere" with the provisioning flow).Preinstall the Viam Agent
Then, run the following (again from the repository root):
make
sudo ./bin/viam-agent-custom-aarch64 --install
sudo /opt/viam/bin/viam-agent --debug
I've been running the commands altogether as such:
make && sudo ./bin/viam-agent-custom-aarch64 --install && sudo /opt/viam/bin/viam-agent --debug
Phone setup
Download the LightBlue app (available on both Android and iPhone). This is what we will use to communicate with the bluetooth service that is being advertised from the Linux machine.
Testing Procedure
At this point, we are ready to test.
LightBlue
Pairing
In the app, use the search bar to find the name of the Linux machine that you are currently testing against. It should be there. Pair with your machine (one of the tests is to see if the pairing request is automatically accepted on the machine side, so I am hoping this part works!).
If it does not pair, check your screen for a six-digit pairing code and accept the request manually. This is a known limitation that I am working on fixing. The TLDR is it's complicated because of reliance on systemd bus signals which may be picked up and discarded as negligible elsewhere in the
viam-agent
.Bluetooth service used to "transmit" credentials
Once paired, look at the bluetooth service and nested characteristics available to us. You should be able to take the provided UUIDs and map them to the logs on your
viam-agent
machine:There is an encoding that will be helpful for you to know here. Characteristics whose last 4 characters of their first 8 character sequence (preceding the
-
) will always be the following:xxxx1111-...
is the encompassing servicexxxx2222-...
is the write-only characteristic for SSIDxxxx3333-...
is the write-only characteristic for passkeyxxxx4444-...
is the write-only characteristic for part IDxxxx5555-...
is the write-only characteristic for part secretxxxx6666-...
is the write-only characteristic for app addressxxxx7777-...
is the read-only characteristic for nearby available WiFi networks that the machine has detectedThe LightBlue interface is pretty easy to follow. You may need to change LightBlue messages from hex or binary to utf-8 string. Once confirmed you're in utf-8 string mode, you can write individual messages to the SSID, passkey, part ID, part secret, and app address "characteristics" by clicking "write value." Similarly, you can read from the available WiFi networks "characteristic" by clicking "read value." Any value written will get "sucked in" to the
viam-agent
provisioning flow. You can check this in theviam-agent
logs. Once all values are submitted, the provisioning loop should close out the BT connection, connect to WiFi, retrieve its cloud config, and should start up aviam-server
(thus ending the provisioning loop).Cases