BEVE schemas/description files #10

SiebrenW · 2024-10-29T12:49:33Z

SiebrenW
Oct 29, 2024

Since BEVE is cross language and with REPE intended for IPC I would consider using some sort of a schema or definition language, like protobuf, EXI or ASN.1 has, which can be used to generate or in some cases dynamically cast a schema to some coherent interface that will speak the same language as all other consumers of this description file.

This could also introduce the possibility of a stricter, more efficient encoding as it simply "knows" what to expect. (like schema informed EXI vs schema-less)

As opposed to inventing a new language we may use an existing, less complicated, object notation language, but this may need to be discussed further. We can also simply use a subset of protobuf lang, because it has solved a lot of problems already, including RPC.

To summarise:

There would be less room for error, as there is a single source of truth
Encoding could be more efficient
It would save time over implementing the same interface multiple times for different languages

Any thoughts?

stephenberry · 2024-10-29T15:41:31Z

stephenberry
Oct 29, 2024
Maintainer

Thanks for sharing your thoughts. I'll discuss BEVE here because REPE is being completely reworked to become a format agnostic binary IPC specification that will not affect the data schema at all (it will only package up an arbitrary query format and body format, you can look at the version1 branch on the REPE repository). BEVE is designed to not require a schema (like JSON), but also provide a binary format that is significantly more efficient than MessagePack, BSON, UBJSON, and protocol buffers, etc. The company I work for actively uses BEVE in production and it makes handling large arrays of data much more reasonable (performance and ease of use) than the formally mentioned formats. But, it also translates to/from JSON very easily. BEVE also doesn't have the 2 GiB size limit of protocol buffers, which is critical for my work where we often generate data files that are gigabytes in size.

By including type tags and string keys within the BEVE document it is possible to use a BEVE document as a schema for generating a coherent interface (not possible with protocol buffers that lacks string keys). The BEVE document can act as the schema. Any programming language can generate structures based on a BEVE document with default values.

Now, it would be terrible for a human to write out a BEVE file as the schema, so we need a human readable syntax for defining a schema. This is where your point of having a schema definition language would be extremely useful and provide the single source of truth as you pointed out.

I totally agree with you here and I think the protocol buffer schema language is a good place to start. But, looking at the specification it will need to be tweaked to support the BEVE spec. I'll begin looking into this and writing up an experimental spec.

I'll address your comments next:

There would be less room for error, as there is a single source of truth

This is true in terms of being able to auto-generate code that matches the specification, and therefore reduce implementation errors. In terms of runtime safety, both BEVE and protocol buffers encode type information (BEVE calls these tags and protocol buffers calls them wire types). BEVE tends to be easier to debug because mismatching files in BEVE have a human readable string encoded, rather than protocol buffers which uses a numeric key which will differ across the mismatching schemas.

Encoding could be more efficient

Note that protocol buffers does not remove type information from its messages (needed for error checking), so BEVE and protocol buffers are similar here. BEVE is often more space efficient. For example, protocol buffers encodes a boolean as a 4 byte integer and also requires a byte for the wire type. The field index is packed into the Varint that holds the wire type, which uses more bytes as there are more fields. On the other hand BEVE encodes the boolean in a single byte, which holds the type information and the boolean value. Now, the key is encoded separately, so this adds to the cost of the boolean, but let me explain the benefit of string keys over indices.

Why string keys versus numeric indices?

First, it is easier to use and debug, because the human readable key name exists right in the document. It makes managing schemas unnecessary to understand the document, and if a schema is lost or corrupted the meaning of your data is not lost.

API changes are easier to make with string keys, because numeric keys require new fields to have increasing integer counts and when keys are removed you can't just decrement indices or the API breaks. It makes API management more complex, whereas with a string key you can just add and remove fields from your API and the parser will error when trying to decode a deprecated value or if valid, it can handle the new value gracefully without needing a schema document update. Albeit a schema document to generate the new interface can be useful as previously remarked.

Protocol buffers' documentation says in bold about field numbers: "This number cannot be changed once your message type is in use". This means that as the message evolves over time it becomes less efficient in terms of encoding. Whereas BEVE can add and remove new fields over time without incurring additional encoding costs.

For scientific data sources with large arrays of data, which my company often deals with, the size of the key has no significant impact on the document size. For messages where the size is absolutely critical, then the data is better encoded in array form, which removes the cost of the keys, albeit the API is rigidly tied to the types within the array.

From my experience the number of cases where size isn't critical, but still important enough to use numeric keys is limited. The protocol buffer approach seems to aim for the uncommon use case, and at the cost of API flexibility, ease of use, and human readability. Because an array would be used rather than an object with keys if performance were critical.

The nail in the coffin for protocol buffer's approach is key compression in BEVE. BEVE has been written in such a way as to offer an LZ4 like approach to key compression, except with nearly zero compression cost. I'm still actively developing this extension to BEVE, but it will exist within the tag layout and indicate when a key refers to a previously decoded key rather than storing it in place. By keeping an active decoding buffer, which is common in networking or dealing with the document in memory, we can decode keys only once and then refer back to them as they are encountered in the future. This enables us to encode repeated keys (quite common) with only two bytes of information (the same for many protocol buffer messages). This key compression can be opt-in.

The place where numeric keys make the most sense are if you are sending small messages with keys that are never repeated. But, you are paying for less flexible APIs than string keys and worse performance than using a packed array. I feel like it is best to have the default behavior be more flexible, easier to debug, and more human readable, and where performance is critical move to an array approach where BEVE is more efficiently packed than protocol buffers.

In Conclusion

I've explained why a schema is not required for BEVE, because it contains all the information (e.g. key names) unlike protocol buffers, but I totally agree that a schema definition for auto-code generation would be useful.

It would save time over implementing the same interface multiple times for different languages

Agreed. It won't make encodings more efficient, but it will make dealing with a variety of programming languages safer and save time. Albeit I tend to avoid auto code generation.

The one argument against a custom schema language for BEVE is that users could write a structure in any language of their choice, serialize it to BEVE, and then use this file to auto-generate any other programming language interface. So, the schema language could be any programming language that has a BEVE library. The benefit of this is that users can visualize the schema in whatever language they are most familiar with.

What are your thoughts on using your programming language of choice instead of a BEVE specific schema language?

0 replies

SiebrenW · 2024-10-29T19:52:59Z

SiebrenW
Oct 29, 2024
Author

I didn't expect such a extensive explanation and I think we're in one line with the intent of my suggestion.

For the beve definition being in any language I'm not familiar with how this would represent an implementation in another language.

The last part I am not aware what the tooling may look like. Basically if I have a project in C++ and Python. It could look as follows:

Definition is in Python, I want to build by C++ application:

I start my cmake generation
cmake exports the definition from the python lib/app to BEVE (or this is a manual prerequisite)
The BEVE is used as a source for the generator to generate c++ code
the C++ application is built

Definition is in C++, I want to package my python application:

I call pip install -e my_pkg
pip somehow builds my C++ application that exports BEVE (or this is a manual prerequisite)
the application is run that exports BEVE (or this is a manual prerequisite)
The BEVE is used to generate (? or used runtime?)
The package is installed

Definition is in a separate application, language doesn't matter, and I want to build both a c++ and python pkg

export the BEVE from the separate app
run pip/package, which uses the BEVE(?), to create the python package
I run cmake, which uses the BEVE(?), to create the c++ package

This sounds a lot like a schema language step, but without standardisation. I'm not saying it's bad, I have had similar ideas with one of my projects at work to just write a generator with the definition simply being a part of the source instead of some extra idl file. For DX you also don't necessarily want to compile and then run a C++ application before you can use your python application for instance.

At the least you would need a "BEVE -> your lang" tool somehow, whether this be a library in your lang or a tool like protoc to your lang I wouldn't mind either way as long as it's convenient and agnostic somehow.

Just as a dump of my brain here's a pseudo example of what this may look like:

import beve.types as t
import beve
import beve.cpp as bpp
import beve.rust as bever # I couldn't help myself

myDef = {
  'myType': {
    'fieldOne': t.str,
    'fieldTwo': t.f32,
  }
}
myDefBeve = beve.to_beve(myDef)
bpp.export(myDefBeve, "myDef.cpp", "myDef.hpp")
bever.export(myDefBeve, "myDef.rs")

Is that what you had in mind?

EDIT as a clarification: I didn't consider using the BEVE itself as the source of truth, because it's binary. Version control won't be of much use and cannot be manually altered without tooling. So you at least need some kind of text-based definition. Another advantage of stuff like protobuf is you can simply look for the .proto file and there's your interface defined, even if you never worked on that project. If it's hidden away there's a lot of magic under the hood, which in my experience tends to be fragile and hard to maintain, understand, update, etc.

Also regarding this:

I feel like it is best to have the default behavior be more flexible, easier to debug, and more human readable, and where performance is critical move to an array approach where BEVE is more efficiently packed than protocol buffers.

If you need to compromise on usability vs performance then I don't disagree on this, but I also like a "performance by default" approach.

On another note.

The compression of the strings really makes me think of EXI (I work on EV chargers, which uses EXI for communication with EVs), which is even more extreme with compression (shame it's based on XML) I can recommend reading its spec for inspiration. It's trying to do too much, so don't try to be like it 😝 and I think its byte unalignment is hurting its performance a lot. Strict schema informed EXI doesn't bother with indices for every field, it knows the next thing will be field x and can't be anything else. Unless they're optional.

Integers are variable length, depending on the value range and the value. Anything with a maximum range that fits in 12 bits is encoded in 12 bits. Everything over is in single bit padded bytes. The first bit in each byte will be 1 until the last byte. And this goes on and on.

I'm getting a headache thinking of implementing it again. It's impressively tiny though. Smaller than XZ compressed json for instance.

Thanks for the amazing response.

5 replies

stephenberry Oct 30, 2024
Maintainer

Thanks for your additional insight. I do feel that in the long term a BEVE specific schema language would be a good approach for working across languages. Having tools to convert from a BEVE file into a specific language would also be valuable, so I don't think one path needs to exclude the other. Your example is a good idea of what an in code implementation might look like.

To clarify, the string compression in BEVE will only apply to keys, which often repeat. The BEVE format is intended to be easily compressible by methods like LZ4, gzip, etc. This allows compression to be opt-in when size is more critical and not required when performance is critical. But, BEVE is still quite efficient in its data handling. The key compression is essentially zero runtime cost because it only applies to keys.

SiebrenW Oct 30, 2024
Author

Thanks for the discussion. To be honest I do think I'm overstating the criticality of the size of the keys.

If you need input for the definition language or when you have an idea what is should look like and you can use any help for the tooling, then I'd be happy to help.

Maybe I'll dump some proposal for the IDL for you to shoot on to start the conversation if you like. And if I have time.

I have many ideas, but I know they will be different tomorrow and even more drastically different once I have been working on it for a bit.

Thanks again, have a nice day.

stephenberry Oct 30, 2024
Maintainer

Yeah, you could start a markdown file for the IDL in the BEVE repository or the Wiki. Just warn those who look at it that it is under development.

Would you like me to add you as a collaborator to beve-org, so that you have write access?

SiebrenW Oct 30, 2024
Author

I will probably just work the usual fork-PR GitHub way, so you will see it in PRs instead of direct commits, which I prefer.

If you think it's easier that way then that may come in handy at some point.

stephenberry Oct 30, 2024
Maintainer

No problem, working off a fork is great. Just let me know if you feel limited and like you could be more productive with more access.

SiebrenW · 2024-11-01T08:48:12Z

SiebrenW
Nov 1, 2024
Author

I set the ball rolling over here: #11

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BEVE schemas/description files #10

{{title}}

Replies: 3 comments 5 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

BEVE schemas/description files #10

SiebrenW Oct 29, 2024

Replies: 3 comments · 5 replies

stephenberry Oct 29, 2024 Maintainer

Why string keys versus numeric indices?

In Conclusion

SiebrenW Oct 29, 2024 Author

Definition is in Python, I want to build by C++ application:

Definition is in C++, I want to package my python application:

Definition is in a separate application, language doesn't matter, and I want to build both a c++ and python pkg

stephenberry Oct 30, 2024 Maintainer

SiebrenW Oct 30, 2024 Author

stephenberry Oct 30, 2024 Maintainer

SiebrenW Oct 30, 2024 Author

stephenberry Oct 30, 2024 Maintainer

SiebrenW Nov 1, 2024 Author

SiebrenW
Oct 29, 2024

Replies: 3 comments 5 replies

stephenberry
Oct 29, 2024
Maintainer

SiebrenW
Oct 29, 2024
Author

stephenberry Oct 30, 2024
Maintainer

SiebrenW Oct 30, 2024
Author

stephenberry Oct 30, 2024
Maintainer

SiebrenW Oct 30, 2024
Author

stephenberry Oct 30, 2024
Maintainer

SiebrenW
Nov 1, 2024
Author