Replies: 3 comments 5 replies
-
Thanks for sharing your thoughts. I'll discuss BEVE here because REPE is being completely reworked to become a format agnostic binary IPC specification that will not affect the data schema at all (it will only package up an arbitrary query format and body format, you can look at the By including type tags and string keys within the BEVE document it is possible to use a BEVE document as a schema for generating a coherent interface (not possible with protocol buffers that lacks string keys). The BEVE document can act as the schema. Any programming language can generate structures based on a BEVE document with default values. Now, it would be terrible for a human to write out a BEVE file as the schema, so we need a human readable syntax for defining a schema. This is where your point of having a schema definition language would be extremely useful and provide the single source of truth as you pointed out. I totally agree with you here and I think the protocol buffer schema language is a good place to start. But, looking at the specification it will need to be tweaked to support the BEVE spec. I'll begin looking into this and writing up an experimental spec. I'll address your comments next:
This is true in terms of being able to auto-generate code that matches the specification, and therefore reduce implementation errors. In terms of runtime safety, both BEVE and protocol buffers encode type information (BEVE calls these tags and protocol buffers calls them wire types). BEVE tends to be easier to debug because mismatching files in BEVE have a human readable string encoded, rather than protocol buffers which uses a numeric key which will differ across the mismatching schemas.
Note that protocol buffers does not remove type information from its messages (needed for error checking), so BEVE and protocol buffers are similar here. BEVE is often more space efficient. For example, protocol buffers encodes a boolean as a 4 byte integer and also requires a byte for the wire type. The field index is packed into the Varint that holds the wire type, which uses more bytes as there are more fields. On the other hand BEVE encodes the boolean in a single byte, which holds the type information and the boolean value. Now, the key is encoded separately, so this adds to the cost of the boolean, but let me explain the benefit of string keys over indices. Why string keys versus numeric indices?First, it is easier to use and debug, because the human readable key name exists right in the document. It makes managing schemas unnecessary to understand the document, and if a schema is lost or corrupted the meaning of your data is not lost. API changes are easier to make with string keys, because numeric keys require new fields to have increasing integer counts and when keys are removed you can't just decrement indices or the API breaks. It makes API management more complex, whereas with a string key you can just add and remove fields from your API and the parser will error when trying to decode a deprecated value or if valid, it can handle the new value gracefully without needing a schema document update. Albeit a schema document to generate the new interface can be useful as previously remarked. Protocol buffers' documentation says in bold about field numbers: "This number cannot be changed once your message type is in use". This means that as the message evolves over time it becomes less efficient in terms of encoding. Whereas BEVE can add and remove new fields over time without incurring additional encoding costs. For scientific data sources with large arrays of data, which my company often deals with, the size of the key has no significant impact on the document size. For messages where the size is absolutely critical, then the data is better encoded in array form, which removes the cost of the keys, albeit the API is rigidly tied to the types within the array. From my experience the number of cases where size isn't critical, but still important enough to use numeric keys is limited. The protocol buffer approach seems to aim for the uncommon use case, and at the cost of API flexibility, ease of use, and human readability. Because an array would be used rather than an object with keys if performance were critical. The nail in the coffin for protocol buffer's approach is key compression in BEVE. BEVE has been written in such a way as to offer an LZ4 like approach to key compression, except with nearly zero compression cost. I'm still actively developing this extension to BEVE, but it will exist within the tag layout and indicate when a key refers to a previously decoded key rather than storing it in place. By keeping an active decoding buffer, which is common in networking or dealing with the document in memory, we can decode keys only once and then refer back to them as they are encountered in the future. This enables us to encode repeated keys (quite common) with only two bytes of information (the same for many protocol buffer messages). This key compression can be opt-in. The place where numeric keys make the most sense are if you are sending small messages with keys that are never repeated. But, you are paying for less flexible APIs than string keys and worse performance than using a packed array. I feel like it is best to have the default behavior be more flexible, easier to debug, and more human readable, and where performance is critical move to an array approach where BEVE is more efficiently packed than protocol buffers. In ConclusionI've explained why a schema is not required for BEVE, because it contains all the information (e.g. key names) unlike protocol buffers, but I totally agree that a schema definition for auto-code generation would be useful.
Agreed. It won't make encodings more efficient, but it will make dealing with a variety of programming languages safer and save time. Albeit I tend to avoid auto code generation. The one argument against a custom schema language for BEVE is that users could write a structure in any language of their choice, serialize it to BEVE, and then use this file to auto-generate any other programming language interface. So, the schema language could be any programming language that has a BEVE library. The benefit of this is that users can visualize the schema in whatever language they are most familiar with. What are your thoughts on using your programming language of choice instead of a BEVE specific schema language? |
Beta Was this translation helpful? Give feedback.
-
I didn't expect such a extensive explanation and I think we're in one line with the intent of my suggestion. For the beve definition being in any language I'm not familiar with how this would represent an implementation in another language. The last part I am not aware what the tooling may look like. Basically if I have a project in C++ and Python. It could look as follows: Definition is in Python, I want to build by C++ application:
Definition is in C++, I want to package my python application:
Definition is in a separate application, language doesn't matter, and I want to build both a c++ and python pkg
This sounds a lot like a schema language step, but without standardisation. I'm not saying it's bad, I have had similar ideas with one of my projects at work to just write a generator with the definition simply being a part of the source instead of some extra idl file. For DX you also don't necessarily want to compile and then run a C++ application before you can use your python application for instance. At the least you would need a "BEVE -> your lang" tool somehow, whether this be a library in your lang or a tool like protoc to your lang I wouldn't mind either way as long as it's convenient and agnostic somehow. Just as a dump of my brain here's a pseudo example of what this may look like: import beve.types as t
import beve
import beve.cpp as bpp
import beve.rust as bever # I couldn't help myself
myDef = {
'myType': {
'fieldOne': t.str,
'fieldTwo': t.f32,
}
}
myDefBeve = beve.to_beve(myDef)
bpp.export(myDefBeve, "myDef.cpp", "myDef.hpp")
bever.export(myDefBeve, "myDef.rs") Is that what you had in mind? EDIT as a clarification: I didn't consider using the BEVE itself as the source of truth, because it's binary. Version control won't be of much use and cannot be manually altered without tooling. So you at least need some kind of text-based definition. Another advantage of stuff like protobuf is you can simply look for the .proto file and there's your interface defined, even if you never worked on that project. If it's hidden away there's a lot of magic under the hood, which in my experience tends to be fragile and hard to maintain, understand, update, etc. Also regarding this:
If you need to compromise on usability vs performance then I don't disagree on this, but I also like a "performance by default" approach. On another note. The compression of the strings really makes me think of EXI (I work on EV chargers, which uses EXI for communication with EVs), which is even more extreme with compression (shame it's based on XML) I can recommend reading its spec for inspiration. It's trying to do too much, so don't try to be like it 😝 and I think its byte unalignment is hurting its performance a lot. Strict schema informed EXI doesn't bother with indices for every field, it knows the next thing will be field x and can't be anything else. Unless they're optional. Integers are variable length, depending on the value range and the value. Anything with a maximum range that fits in 12 bits is encoded in 12 bits. Everything over is in single bit padded bytes. The first bit in each byte will be 1 until the last byte. And this goes on and on. I'm getting a headache thinking of implementing it again. It's impressively tiny though. Smaller than XZ compressed json for instance. Thanks for the amazing response. |
Beta Was this translation helpful? Give feedback.
-
I set the ball rolling over here: #11 |
Beta Was this translation helpful? Give feedback.
-
Since BEVE is cross language and with REPE intended for IPC I would consider using some sort of a schema or definition language, like protobuf, EXI or ASN.1 has, which can be used to generate or in some cases dynamically cast a schema to some coherent interface that will speak the same language as all other consumers of this description file.
This could also introduce the possibility of a stricter, more efficient encoding as it simply "knows" what to expect. (like schema informed EXI vs schema-less)
As opposed to inventing a new language we may use an existing, less complicated, object notation language, but this may need to be discussed further. We can also simply use a subset of protobuf lang, because it has solved a lot of problems already, including RPC.
To summarise:
Any thoughts?
Beta Was this translation helpful? Give feedback.
All reactions