Pre-retrieved data formatting #5

sheluchin · 2022-12-05T14:17:40Z

I know that the formatting of Dewey's data dumps is still a work in progress and open to change, so I wanted to flag some concerns and potential areas for improvement.

It looks like at the moment the formatting of the pre-retrieved data dump files is not consistent:

analysis.edn.gz: first line is opening bracket [ followed by one project's analysis map {...} per line
all-repos.edn: single line containing ({...} ...)
deps-libs.edn: single line containing a map keyed by lib
deps-tags.edn: single line containing a vector of 2-element vectors:
- first element: map of repo information
- second element: vector of maps, where each map represents a git tag of the repo

These differences create a bit of friction when consuming the data. It would be nice to either unify the format somehow, provide clear documentation about the differences, or add some helpers like read-edn to assist those trying to consume the data (or a combination of all three).

Related: read-edn does not currently work as documented in the README, because it consumes too much CPU and takes too long for certain files, like analysis.edn.gz.

As for the formatting, it would be nice if all of the provided data would be easy to join. Making the :lib key ubiquitous, as mentioned in #3 would be a step in the right direction. If all the data could be keyed by :lib, it would be better still, allowing lookups without iterating sequences to find the data to join. I don't know exactly how to achieve that without requiring the whole map to be loaded into memory first. Perhaps SQLite can be part of the solution?

The text was updated successfully, but these errors were encountered:

phronmophobic · 2022-12-05T19:24:43Z

I'm happy to be convinced otherwise, but here's how I was thinking about it:

except for analysis.edn.gz, all of the data files fit comfortably in memory.
I think finding better ways to document the data formats makes a lot of sense. Would love to have good examples of libs that do this.
As long as the data is documented, it doesn't seem like a problem that some files are maps, some are vectors, etc. Data comes in all shapes/forms and clojure has good tools for dealing with heterogenous data.
I use tools similar to portal/reveal/clerk/cider-inspect for "seeing" the data I'm working with. Better documentation always helps, but using these sorts of tools also makes working with the data much, much easier. In some cases, it's even better than sifting through documentation.
The all-repos.edn and deps-tags.edn files are really just intermediate files that are used to produce deps-libs.edn. In my head, consumers would generally just look in deps-libs.edn. Is there data missing from deps-libs.edn that you're finding in the other data files?

Related: read-edn does not currently work as documented in the README, because it consumes too much CPU and takes too long for certain files, like analysis.edn.gz.

It does technically work. This is the method I used when writing https://blog.phronemophobic.com/dewey-analysis.html. However, it does require a ton of ram and takes about 15min to load. It probably makes sense to document that caveat and maybe provide a helper that can scan through the file in constant memory.

As for the formatting, it would be nice if all of the provided data would be easy to join. Making the :lib key ubiquitous, as mentioned in #3 would be a step in the right direction. If all the data could be keyed by :lib, it would be better still, allowing lookups without iterating sequences to find the data to join. I don't know exactly how to achieve that without requiring the whole map to be loaded into memory first. Perhaps SQLite can be part of the solution?

The goal of dewey is to provide the raw data. The raw data isn't queryable or indexed. I think what you want is to load it into a database which can sort/organize/index the data for you. I don't want force consumers to use a specific database (that's the point of providing the raw data). However, it probably makes sense to provide some examples for loading data into popular dbs.

There's a principle in data pipelines where you start with raw data as you find it (ie. the data pulled from github) and you create a pipeline of steps to clean/transform/summarize the data until it's in a form that's easy to query. I'm not totally against adding a :lib key to more of the data files, but I am hesitant to insert "extra" data into the raw data sets. I would rather have it included as part of a "cleaned" dataset at the end of a pipeline (eg. deps-libs.edn)

Also, I'm curious what data in all-repos.edn and deps-tags.edn that's not in deps-libs.edn.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pre-retrieved data formatting #5

Pre-retrieved data formatting #5

sheluchin commented Dec 5, 2022

phronmophobic commented Dec 5, 2022

Pre-retrieved data formatting #5

Pre-retrieved data formatting #5

Comments

sheluchin commented Dec 5, 2022

phronmophobic commented Dec 5, 2022