-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reliable Delivery model algorithim #485
Comments
I love the idea of having a standard for this, but I've never done the work you've done to test it. |
Yes I've got lots of gray hairs from having to meet targets of 99.999% or 5 9's reliability . Generate 200,000 messages or telephone calls and only have one fail for the test to pass. Ahhh! Sometimes there was a chain of 5 or 6 little Mayflys (or timy 8051's) for all the messages to pass through, something like I2C to the Modem Controller, then the Modem software itself, network ....... The reliability was a very visible characterization, and sometimes took some custom test beds and setups to achieve. What worked, was to be able to easily start the test going at the end of the day, and then the next morning check the numbers easily, and spot any potential issues that might creep in. I believe it also makes it easier on the server side, as it gets to do a defined server loading burst that it keeps it ahead of actual real data, and make it possible to see potential problems earlier. I was thinking of proposing a standard test setup for a Maylfy that maybe could go under ModularSensors/tools/tests - that would be easy to build and deploy.? |
Since there is no comment on this from the server side, I propose any significant delivery errors are flagged. |
@neilh10, I also love this idea. On the server side, we'll unfortunately need to wait for funding. The good news is that we have an excellent proposal pending, so hopefully that starts sooner than later. |
Great. I guess this is a marker for anybody thinking about it as to what does "reliable delivery" look like from the server side. So a data point; in recent tests last week, when testing "low battery" condition, and ended up with about 1000 outstanding readings, I was seeing that the number of POSTs in one session where varying between about 8 and 30, before the server side timed out. They all got delivered successfully which is really nice (#489) |
An observation - Mayfly Xbee S6B WiFi became unstable EnviroDIY/ModularSensors#347 for my testing in Jan 2021 The issue for all "client" POSTing to ODM2/MonitorMyWatershed is what is the model for POSTing, and then being able to characterize it on the device side. The Xbee WiFi S6B device has a limited method of setting up and tearing down TCP/IP links. A number of hacks are tried to manage the TCP/IP links toMMW. The problem is that this may beat on the server a lot. So this is just a plea, based on hours of debugging, for documenting what is the ODM2 model for accessing the server, with all the internal network facing timeouts identified. |
In this update I'm identifying what I understand is the state of different systems to support a "reliable delivery algorithm" The target I suggest is that for data recorded on ModularSensors/Mayfly for 1year at 15minutes time intervals will be successfully delivered to MMW. The release v0.12.x is now production on the AWS servers and live for POSTs to data.enviroDIY.org. A sensor node, typically running ModularSensors/Mayfly takes readings and periodically POSTs these results to the server and then they are accessed and downloaded through MMW. All POSTs that are accepted into the database or already in the database receive an HTTP 201. (per resolution in #538) The term "reliable delivery" is from functionality discussed in these issues The released https://github.com/EnviroDIY/ModularSensors/releases/tag/v0.32.2 is a best effort - that is POST and if everything works well it should be there. However there are many potential situations where this may not be result in data arriving. Communication channels and servers cannot be guaranteed to operate perfectly. I've been working on a forked version of ModularSensors that include a reliable delivery - I designate this forked version as azModularSensors - and initial functionality for reliable delivery was added in https://github.com/neilh10/ModularSensors/releases/tag/v0.25.0.release1_200906 The reliable delivery algorithm is implemented in the azModularSensors and on the server, providing the server is up and there is a reasonable transmission medium, the readings are transferred to the server. This reliable delivery is configured through a local [COMMON]
LOGGING_INTERVAL_MINUTES=2 ; aggressive testing, typically though every 15minutes
[NETWORK]
COLLECT_READINGS=5 ;Number of readings to collect before send 0to30
POST_MAX_NUM=100 ; Max number of readings to POST after
SEND_OFFSET_MIN=0 ;minutes to wait after collection complete to send
[PROVIDER_MMW]
CLOUD_ID=data.enviroDIY.org
TIMER_POST_TOUT_MS=7000; Gateway Timeout (ms) depending on medium
TIMER_POST_PACE_MS=3000; |
@neilh10, think it is worth changing your host to See my explanation here. |
@neilh10, now that we have completed release 0.12 to AWS, we're finally in a position to start working with you closely on your proposed Reliable Delivery Model Algorithm. That said, there is still a lot of tech debt for us to work on for release 0.13, which will all substantially help toward our collective goals for reliable data delivery. So we may not be able o fully address your proposal with 0.13 even if as we are working with you toward that goal. We very much appreciate your detailed suggestions and error reporting, and appreciate your patience with the amount of time it has taken us to get to a position where we might start working toward your proposal. |
Based on #543, a reiteration the intent of the reliability delivery algorithm is to throttle if the server isn't responding, and follow normal comms industry practices of making it configurable for each IoT device and potential servers. Partly the purpose of this discussion is so that a (future) large number of ModularSensors/Mayfly's, properly configured will work gracefully with the server as traffic scales. Happy to have any comments/feedback what works better for the server side - and absolutely the intent is for individual ModularSensors nodes, through easy statistical configurations, to reduce future costs on the server technology as its scaled. Under a "standard" data delivery, with local power savings, my fork of the ModularSensors is configured thru ms_cfg.ini to take a reading every 15 minutes (could be faster), and queues them for delivery every 4 reading period (4x15min=one hour) in a local file For every readings that doesn't receive an HTTP 201, it then writes it to a file For the last reading ( [COMMON]
LOGGING_INTERVAL_MINUTES=15 ; Collect a reading
[NETWORK]
COLLECT_READINGS=4 ; Collect 4 readings and then deliver
SEND_OFFSET_MIN=3 ;delay after collection complete to send
POST_MAX_NUM=100 ; Max number of Que readings to POST
[PROVIDER_MMW]
TIMER_POST_TOUT_MS=25000; Gateway Timeout (ms)
TIMER_POST_PACE_MS=3000; between readings |
Update on EnviroDIY/ModularSensors#194 So theses notes I have gleaned are for any recipes that anybody is stewing on. Any opinions are only mine, no QA implied. So seems there has been a monitoring of the server for the last 9months. Processing the POSTs are serialized and take the time - essentially the validation of the message UUIDs. There is no advantage to multiple POSTs on same tcp/ip link from ModularSensors/Mayfly. With batched queue algorithm ModularSensors code does POSTs on the same connection to the MMW, but processes each POST the same old way, which I think establishes the TCP/IP link, POSTs, then tears it down. It seems unlikely to provide any benefit to the server with tinkering with the algorithm. POST time Offsets : suggestion between 3min and 8min ( seems to be assuming a 15minute window.) Pacing between POSTs on the same connection : not likely to save much time. – depends on server loading. See serialization. IMHO seems like it needs characterizing between 100mS and 500mS. I have tried 1sec and 2sec to try and be friendly, however not any difference. POST Timeout : define it for the device settings - based on characterizing the servers response. Server internal setting currently 5minutes to provide for internal optimal guarantee of insertion in database. Not possible to sensibly set for device. Seems plans for improved response times to device POSTs only likely to improved with AWS Simple Queuing Service. (but maybe MQTT could be better long term solution) Max Number of POSTs on any one connection : Any number of greater than 0 has the same effect and is serialized, so needs some characterization. It does of course impact power. One other parameter not figured out, and needing feedback is the timedb compression" algorithms. A component is The Mayfly of course uses less power if it connects to MMW less often - and is the product of the sample time (eg 15min) and the number of samples to collect (eg 4) for a connection attempt every 60min. However for some edge cases where readings have failed on first POST or conditions delayed connection eg #658 & #661 the timedb may have compressed the endpoint data. When new data is POSTed it has to be uncompressed, which of course could take a long time, and then the Mayfly times out. The algorithm will of course try in the next time window - which could be 1hour or ? but will the timedb compression algorithms have kicked in. (can chatGPT arbitrate this condition - oops still need people to do that) Does this describe what was seen with this field system So for an open source, the integration testing I've been doing provides an effective characterization source of published data ~ latest on #661 I've gleaned that there is some serious noodling on how data flows on MMW and thanks so much for the detailed effort to improve response times. Whew! Note to self, well done for proposing this sensible network layer "Orderly Data delivery" in 2018 when reviewing this wonderful effort of ModularSensors open source code. The value of checking out the architecture of open source code. |
Refreshing on the issue of the effects of database compression. For a normal type of field issue, where a field systems gets behind in sending readings to the cloud, and may not have a person do a field visit for some months., When it starts POSTing again, it may not be received according to the algorithm that is utilized for the ODM2 database compression. |
Answered here: #665 (comment) |
@aufdenkampe many thanks for the quick response. As this is an open source discussion, this thread has been a way of phrasing "edge conditions" - conditions that aren't thought about in the beginning and then have consequences later, and a software hero has to be found to then decode and solve. As an engineer, I'm just saying that part of the of discipline taught to students, is identify likely edge conditions "Requirements" and then also what needs to be done to test for it. Restating with data from #665, the architectural challenge for the server with field systems that get behind and then start posting to the server : The architectural solution on the sever This has an uncalibrated assumption built into it. |
@neilh10, SQS, API Gateway, and the equivalent is current best practice in Cloud DevOps. There are no untested assumptions that it will provide substantial benefits. We understand software architecture and test driven development and are doing an excellent job refactoring and improving the legacy code stack we received 3 years ago while enabling the system to grow from receiving 6 million data points per month to the current 16 million per month. |
It seems to me with all the work that has gone on in characterizing the servers, and in line with standard communications stack architecture the API for successful ACK, is a response code of 2XX as per https://en.wikipedia.org/wiki/List_of_HTTP_status_codes For all other response code, in line with standard communications architecture, the devices posting need to retry until they receive a successful ACK. I hope that after the amazing work that has happened over four years on understanding the code, that this won't cause any unexpected problems. If there are any suspicions of potential turds to slip on - be great to hear about them. |
I'm just wondering what the "reliable delivery" algorithm parameters could be from the server interface and the client interface. So putting this out for discussion.
For software systems with a core communication element, communications's reliability is a system level characterization. It is often a target, and is best to be quantified and agreed upon, the system characterized over time, and repeated releases of software.
For a wireless network, which is inherently unreliable due to the complex nature of wireless signals, the complex nature of the wireless footprint, the complex nature of the environmental conditions that effect the wireless connectivity, a "reliable delivery" algorithm is key to confidence in the value of a remote device.
A "reliable delivery" algorithm is that set of parameters that a client Mayfly should execute and responses, for the successfully guaranteed delivery of that data over a network to a level of specified reliability.
The reliability testing could be high enough to make it meaningful, and low enough to be a reasonable test case.
Reliability is typically defined as the message that could be lost and still meet that bar. 99% one message in 100, 99.9% one message in 1000 ....
To be able to say you have met a standard of reliability (eg 2 9's or 99%), twice the number of messages need to be transmitted, with only 1 lost.
As a straw poll, for the client Mayfly I would suggest the following:
On a client Mayfly POSTing, if a response isn't normally received in 2 seconds by that client, that it consider the gateway to be in timeout. This is generally in line with the characterization data I have seen of this date. The timeout time directly effects the power draw, so there is a tradeoff. If the wireless network is unavailable, there is a benefit from shorter timeouts until it comes back.
For characterization purposes a client gateway may set a a gateway timeout as 5 seconds to determine what range of response the specific network the client gets.
The gateway timing starts from after the data has been transmitted to the Modem, which is currently a slow 9600baud link, and the network is typically much faster 100MegaBaud
The reliable delivery algorithm, is such that if a POST returns a SUCCESS (HTTP STATUS 201 CREATED) it will be considered delivered.
If any other response is received, including none, the POST will be repeated at a later time, until a SUCCESS is received.
For any client POST attempt they shall not repeat more than 60messages on any single connection attempt in any one hour period.
The any client POST attempt at one connection they shall not POST more than 10 messages after the first unsuccessful indication.
The reliability target for releases of client and server software should attempt to be 99%.
Longer term field reliability can attempt to characterize a system to 99.9%
For comparison purposes;
i) if a Mayfly unit that is being characterized in the field with a wireless network of variable reliability , with a sample time every 15minutes, and an objective of 99.9% reliability it needs to have 2000 messages, or 21days ~ three weeks - with only one message being lost. There could be significant number of retrys, and needs reliable software to be able to characterize a potentially lossey network.
ii) if a Mayfly software is UNDER TEST with a reliable network, an dit generates a message every 2minutes, and uploads every 2 messages (that is 4 minutes) it will be able to reach an objective reliability of 99% or 200messages in 400minutes. or 6hours 40minutes - comfortably an overnight test that can verify the combined software under good network conditions.
The text was updated successfully, but these errors were encountered: