Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data: Replace custom formats with msgpack #374

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

fstachura
Copy link
Collaborator

I have previously attempted to refactor data converters used in data.py in this branch. In the end I wasn't happy with the result, because I believe that writing custom parsers for each database is the wrong approach.

This PR replaces all data.py parsers with msgpack. Plain python objects can be serialized and deserialized into the databases.
The main advantage is convenience - values can be manipulated like normal Python objects, no string parsing is required anywhere in the codebase that interacts with the database.

From what I remember, larger databases were also a bit smaller, mainly because large ints take less space in msgpack than in base10 representation. But to be fair, there is some storage overhead for other datatypes.
I also wouldn't be surprised if average serialization/deserialization times were a bit smaller, although I don't have numbers on that and I doubt it's a major bottleneck anywhere.

Leaving this as a draft - I tested it only a little bit.

self.families = parsed_data[1]
else:
self.entries = []
self.families = ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we work only on the raw data, and parsing it when things are requested? Goal is to store only a bytes buffer without taking loads of memory if we want to have loads in memory.

else:
return self.ctype(p)
else:
return None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Early return the if p is None case.

if type(key) is str:
key = key.encode()
elif type(key) is int:
key = msgpack.dumps(key)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to do this? Isn't it rather an error if someone gives us a string or int? A key is of type bytes, callers that don't respect that have a bug IMO.

elif type(arg) is int:
arg = str(arg).encode()
arg = msgpack.dumps(arg)
return arg
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment: is this used? Shouldn't callers know what they have and do the right thing themselves.

@@ -54,7 +54,7 @@ function generateSymbolDefinitionsHTML(symbolDefinitions, project, version) {
result += '<ul>';
previous_type = sd.type;
}
let ln = sd.line.toString().split(',');
let ln = [sd.line];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is that related to msgpack?

Copy link
Member

@tleb tleb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing major, the change isn't massive, no surprise. Some more cleanup (like random code comments here and there) is required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants