-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pre-retrieved data formatting #5
Comments
I'm happy to be convinced otherwise, but here's how I was thinking about it:
It does technically work. This is the method I used when writing https://blog.phronemophobic.com/dewey-analysis.html. However, it does require a ton of ram and takes about 15min to load. It probably makes sense to document that caveat and maybe provide a helper that can scan through the file in constant memory.
The goal of dewey is to provide the raw data. The raw data isn't queryable or indexed. I think what you want is to load it into a database which can sort/organize/index the data for you. I don't want force consumers to use a specific database (that's the point of providing the raw data). However, it probably makes sense to provide some examples for loading data into popular dbs. There's a principle in data pipelines where you start with raw data as you find it (ie. the data pulled from github) and you create a pipeline of steps to clean/transform/summarize the data until it's in a form that's easy to query. I'm not totally against adding a Also, I'm curious what data in |
I know that the formatting of Dewey's data dumps is still a work in progress and open to change, so I wanted to flag some concerns and potential areas for improvement.
It looks like at the moment the formatting of the pre-retrieved data dump files is not consistent:
analysis.edn.gz
: first line is opening bracket[
followed by one project's analysis map{...}
per lineall-repos.edn
: single line containing({...} ...)
deps-libs.edn
: single line containing a map keyed by libdeps-tags.edn
: single line containing a vector of 2-element vectors:These differences create a bit of friction when consuming the data. It would be nice to either unify the format somehow, provide clear documentation about the differences, or add some helpers like
read-edn
to assist those trying to consume the data (or a combination of all three).Related:
read-edn
does not currently work as documented in the README, because it consumes too much CPU and takes too long for certain files, likeanalysis.edn.gz
.As for the formatting, it would be nice if all of the provided data would be easy to join. Making the
:lib
key ubiquitous, as mentioned in #3 would be a step in the right direction. If all the data could be keyed by:lib
, it would be better still, allowing lookups without iterating sequences to find the data to join. I don't know exactly how to achieve that without requiring the whole map to be loaded into memory first. Perhaps SQLite can be part of the solution?The text was updated successfully, but these errors were encountered: