Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rsdk 9697 support multiple nvs #376

Merged
merged 11 commits into from
Jan 13, 2025

Conversation

npmenard
Copy link
Member

@npmenard npmenard commented Jan 8, 2025

Implements the storage traits for any type we can iterate over, we can then easily pass an array of a vec as a storage to viam server.
With this something like
let storage = vec![ NVSStorage::new("nvs").unwrap(), NVSStorage::new("nvs2").unwrap(), ];
Will allow use to store the robot data over multiple NVS

I didn't want to add an extra type to represent a collection of storage and this way the logic is driven by the underlying storage implementation.

@npmenard npmenard requested a review from a team as a code owner January 8, 2025 20:51
}

#[derive(Clone)]
pub struct NVSStorage {
// esp-idf-svc partition driver ensures that only one handle of a type can be created
// so inner mutability can be achieves safely with RefCell
nvs: Rc<RefCell<EspCustomNvs>>,
partition_name: CString,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

store a copy of the partition rather than the partition name to avoid cloning a CString as we pass the Storage through different par to of the code

fn reset_ota_metadata(&self) -> Result<(), Self::Error>;
}

pub trait StorageDiagnostic {
fn log_space_diagnostic(&self);
}

#[derive(Error, Debug)]
#[error("empty storage collection")]
pub struct EmptyStorageCollectionError;
Copy link
Member Author

@npmenard npmenard Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a special error to handle the case where a collection is empty

@@ -158,6 +163,22 @@ struct RAMCredentialStorageInner {
#[derive(Default, Clone)]
pub struct RAMStorage(Rc<Mutex<RAMCredentialStorageInner>>);

#[derive(Error, Debug)]
pub enum RAMStorageError {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add an error type for RAMStorage to make it easier to test

self.into_iter().any(OtaMetadataStorage::has_ota_metadata)
}
fn get_ota_metadata(&self) -> Result<OtaMetadata, Self::Error> {
self.into_iter().fold(
Copy link
Member Author

@npmenard npmenard Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the idea here is to use fold to call get_ota_metadata until Ok is returned. If all storage in a collection return an error only the last error is reported.
We don't have an iterator that behave like try_fold but actually short-circuit on the first Ok, so we will iterate through the whole collection even after a get_ota_metadata has succeeded. The function get_ota_metadata will however not be called anymore on storage

{
type Error = Storage::Error;
fn has_ota_metadata(&self) -> bool {
self.into_iter().any(OtaMetadataStorage::has_ota_metadata)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

short-circuit on the first true

#[cfg(feature = "ota")]
impl<Iterable, Storage: OtaMetadataStorage> OtaMetadataStorage for Iterable
where
for<'a> &'a Iterable: IntoIterator<Item = &'a Storage>,
Copy link
Member Author

@npmenard npmenard Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will only work on collections that implements impl<'a, T,> IntoIterator for &'a ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have one small concern here. I don't have a fully worked counterexample, but I wonder a bit about using something like a HashSet as the type backing IntoIterator, because the iteration order isn't deterministic.

@@ -136,7 +137,7 @@ impl StorageDiagnostic for NVSStorage {
fn log_space_diagnostic(&self) {
let mut stats: nvs_stats_t = Default::default();
if let Err(err) =
esp!(unsafe { nvs_get_stats(self.partition_name.as_ptr(), &mut stats as *mut _) })
esp!(unsafe { nvs_get_stats(self.part.handle() as _, &mut stats as *mut _) })
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handle is the partition name

fn reset_ota_metadata(&self) -> Result<(), Self::Error> {
self.into_iter().fold(
Err::<_, Self::Error>(EmptyStorageCollectionError.into()),
|val, s| val.or(s.reset_ota_metadata()),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

taking advantage of the eagerness of or might not be the best approach

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate a bit? I think I know what you are saying, but I'd rather not guess.

Copy link
Member Author

@npmenard npmenard Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As opposed to or_else or will always evaluate a function in argument if there is one (because it doesn't actually accept a closure) before considering replacing val hence the eagerness. We use it our advantage to make sure all copies of an object are erased through all storage. In theory we shouldn't need this since we use or_else elsewhere so only on copy should be in the storage array at anytime.

Copy link
Member

@acmorrow acmorrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is neat, LGTM, a few small comments. I'd like to think about it a bit tomorrow before it merges so I can make sure I totally understand how the various folds are working and think about whether there are any tricky edge cases.

micro-rdk/src/common/credentials_storage.rs Show resolved Hide resolved
let mut inner_ref = self.0.lock().unwrap();
let _ = inner_ref.robot_creds.insert(creds);
Ok(())
}
fn get_robot_credentials(&self) -> Result<RobotCredentials, Self::Error> {
let inner_ref = self.0.lock().unwrap();
let cfg = inner_ref.robot_creds.clone().unwrap_or_default().clone();
Ok(cfg)
log::info!("get robot creds");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stray log, or debug!?

As an aside, I do miss some of the logging we used to have pre #316 that gave a little more context about progression through the startup sequence, whether we were using cached credentials or configs, etc. I'm usually for "silence is golden" but in this case that's very useful information to understand how things are coming up, especially in offline mode. Would you be OK with a ticket to re-introduce some of that logging? Or did you not propagate into the refactored world because you found it too chatty?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stray log i will remove it.
Yes let's make a ticket for some logging housekeeping

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, in https://viam.atlassian.net/browse/RSDK-9715. Feel free to add more things you'd like to see added to a logging pass.

}
fn reset_robot_credentials(&self) -> Result<(), Self::Error> {
let mut inner_ref = self.0.lock().unwrap();
let _ = inner_ref.robot_creds.take();
log::info!("reset robot creds");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above

assert!(!v.has_robot_credentials());
assert!(ram2
.store_robot_credentials(&CloudConfig {
app_address: "http://downloadramstorage.org".to_owned(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe .example.org

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes that was for the joke :P

fn reset_ota_metadata(&self) -> Result<(), Self::Error> {
self.into_iter().fold(
Err::<_, Self::Error>(EmptyStorageCollectionError.into()),
|val, s| val.or(s.reset_ota_metadata()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate a bit? I think I know what you are saying, but I'd rather not guess.

@acmorrow
Copy link
Member

acmorrow commented Jan 8, 2025

How do you envision configuring the space of available NVS, for, say, callers of viam_server_start?

@npmenard
Copy link
Member Author

npmenard commented Jan 8, 2025

How do you envision configuring the space of available NVS, for, say, callers of viam_server_start?

What do you mean?

@acmorrow
Copy link
Member

acmorrow commented Jan 9, 2025

How do you envision configuring the space of available NVS, for, say, callers of viam_server_start?

What do you mean?

As discussed in-person, I was interested in how you saw the ability to configure the selection of available partitions expressed in the FFI.

Copy link
Member

@acmorrow acmorrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, though I'm interested in your thoughts about my iteration order worry.

self.set_string(NVS_ROBOT_ID_KEY, &cfg.id)?;
self.set_string(NVS_ROBOT_APP_ADDRESS, &cfg.app_address)?;
self.set_string(NVS_ROBOT_ID_KEY, &cfg.id).or_else(|err| {
let _ = self.reset_robot_credentials();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • By using reset_robot_credentials, I think there is now a requirement that it erase keys in the same order that store_robot_credentials writes them, since a failed erase_key ends the method. Maybe it is better to explicitly just unset the keys as acted on so far in the store methods rather than relying on ordering in a different method? Accumulate some guards?
  • ⬇️ - Is it a bug that reset_robot_credentials doesn't clear NVS_ROBOT_APP_ADDRESS?

@@ -284,7 +315,22 @@ pub unsafe extern "C" fn viam_server_start(ctx: *mut viam_server_context) -> via
#[cfg(not(target_os = "espidf"))]
let storage = micro_rdk::common::credentials_storage::RAMStorage::default();
#[cfg(target_os = "espidf")]
let storage = micro_rdk::esp32::nvs_storage::NVSStorage::new("nvs").unwrap();
//let storage = {micro_rdk::esp32::nvs_storage::NVSStorage::new("nvs").unwrap()};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove commented line

@@ -257,7 +288,7 @@ pub unsafe extern "C" fn viam_server_start(ctx: *mut viam_server_context) -> via
use micro_rdk::common::credentials_storage::RAMStorage;
use micro_rdk::common::credentials_storage::RobotConfigurationStorage;
use micro_rdk::proto::provisioning::v1::CloudConfig;

log::info!("when the FFI was built a robot config was supplied, defaulting to this robot");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe take this out for now but add a TODO(RSDK-9715) which is my init logging ticket so we can do this in coordination with other startup logging changes?

@@ -310,6 +310,18 @@ void app_main(void)
return;
}

ret = viam_server_add_nvs_storage(viam_ctx, "nvs");
if (ret != VIAM_OK) {
ESP_LOGE(TAG,"couldn't set add nvs partition, error : %i", ret);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space after , here and below please

@acmorrow
Copy link
Member

LGTM, though I'm interested in your thoughts about my iteration order worry.

I think the fix you've made here about ensuring that writes of multiple keys are all-or-nothing in a given partition is necessary. But I'm still a little concerned about iteration order determinism over the set of storages. Per my slack message:

The situation I had in mind looked something like this: you have nvs partitions named A and B, but they are referenced via a HashSet, so the iteration order is indeterminate. If the firmware has some sort of unconditional write on startup (e.g. writing the robot config), then if you come up with A as first iterated once, and B first iterated once, then it will end up stored in both. If those values differ, and then you come up in a mode where you need to consume the value (e.g. offline), you may get either value depending on whether you iterate A or B first, and potentially one of them is out of date.

I haven’t totally convinced myself that this is possible or a real problem, but it has me a little hesitant.

Maybe it can easily be fixed by iterating in sorted order, based on partition name or address?

Or by requiring the partitions to be paired with “priorities”, so we can say “prefer the nvs partition, but spill to config if needed”. Then just iterate sorted on prio, and it wouldn’t matter if the container supporting iteration was non-deterministic. Setting prio though wouldn’t be 100% as you could accidentally set two with prio N and then you are back to non-determinism.

@npmenard
Copy link
Member Author

LGTM, though I'm interested in your thoughts about my iteration order worry.

I think the fix you've made here about ensuring that writes of multiple keys are all-or-nothing in a given partition is necessary. But I'm still a little concerned about iteration order determinism over the set of storages. Per my slack message:

The situation I had in mind looked something like this: you have nvs partitions named A and B, but they are referenced via a HashSet, so the iteration order is indeterminate. If the firmware has some sort of unconditional write on startup (e.g. writing the robot config), then if you come up with A as first iterated once, and B first iterated once, then it will end up stored in both. If those values differ, and then you come up in a mode where you need to consume the value (e.g. offline), you may get either value depending on whether you iterate A or B first, and potentially one of them is out of date.
I haven’t totally convinced myself that this is possible or a real problem, but it has me a little hesitant.
Maybe it can easily be fixed by iterating in sorted order, based on partition name or address?
Or by requiring the partitions to be paired with “priorities”, so we can say “prefer the nvs partition, but spill to config if needed”. Then just iterate sorted on prio, and it wouldn’t matter if the container supporting iteration was non-deterministic. Setting prio though wouldn’t be 100% as you could accidentally set two with prio N and then you are back to non-determinism.

I think this can be an issue when storing config or certificate only since we don't explicitly erase these objects when storing them during viam server initialization. We implicitly expect the storage driver to overwrite.
Everywhere else we either check if an object is in storage or erase it unilaterally
We can add a ticket to change the initialization behavior to do an erase before storing but that would increase the wear on the flash.

@npmenard npmenard requested a review from acmorrow January 13, 2025 18:26
Copy link
Member

@acmorrow acmorrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@npmenard npmenard merged commit 0cbf5b1 into viamrobotics:main Jan 13, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants