Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft: Split catalogue reader #76

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from
Draft

Draft: Split catalogue reader #76

wants to merge 4 commits into from

Conversation

bwvdnbro
Copy link
Member

@bwvdnbro bwvdnbro commented Feb 9, 2022

This adds support for reading distributed VR catalogues, as requested in #72. Metadata is present in each file of a distributed catalogue, so the only significant difference between a normal and a distributed catalogue is the way you read datasets. This is handled by using a new VelociraptorCatalogueReader object that takes care of reading and distinguishes between single file and distributed file reads.

Currently I have only implemented this for the actual catalogue (*properties.* files), but I assume something similar would need to be done for the particle and SO files.

@bwvdnbro bwvdnbro marked this pull request as draft February 9, 2022 12:42
@JBorrow
Copy link
Member

JBorrow commented Feb 14, 2022

A couple of general comments:

  1. Don't forget your type hints!
  2. I think the filename=x.filenames[0] is pretty dirty. Maybe have x.filename be a property that returns that internally?
  3. You should use pathlib.Path objects, then you can use e.g. the glob() method on that to search rather than using regex.

@bwvdnbro
Copy link
Member Author

I have attempted to add type hints, let me know if you want more. I agree that the filenames[0] was ugly and I have created a property.

Regarding your last comment: the issue here is not so much finding all the files, but generating an ordered list of files. While glob() would work to generate the list, there is no clean way to sort that list. glob() does not sort files and because numbers for VR are not padded, sort() sorts them wrong (so you get x.properties.1, x.properties.11... x.properties.2 etc). So in order to sort correctly, you need to write your own comparison function and that requires... a regex. I thought just using the regex from the start would be easier.

Any thoughts on the particle and SO list files? Where are those used?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants