Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add KDVH migration package #30

Open
wants to merge 24 commits into
base: trunk
Choose a base branch
from
Open

Add KDVH migration package #30

wants to merge 24 commits into from

Conversation

Lun4m
Copy link
Collaborator

@Lun4m Lun4m commented Sep 25, 2024

This PR tackles the first part of issue #14.
I adapted some scripts originally written by Ketil for ODA to dump tables from KDVH and import them into LARD. Almost happy with the code...

Things I'm still not 100% sure about:

  1. I'm using the KDVH proxy to dump the tables, but I couldn't find T_MONTH_INTERPOLATED and T_DIURNAL_INTERPOLATED, probably I need to connect to KDVH directly?
  2. Is blobData used for non-scalar parameters (KLOBS for example)?
  3. Do we need to dump observations that have both NULL data and flags? Related to ->
  4. in data_functions.go we do a lot of postprocessing that I haven't touched since I don't really understand it. My main concern is with the insertion of "null" values (i.e. -32767, etc.), are these used somewhere downstream? Otherwise I might simply drop them, since our schema can handle real NULLs.

@ketilt, do you have any insight regarding these points? Or comments about the code, if you have time to take a look at it?

Still left to do:

  • remove CorrKDVH field, since original and corrected have the same value
  • Double check that the derived controlinfo and useinfo are correct

@ketilt
Copy link
Collaborator

ketilt commented Sep 30, 2024

  1. It makes sense that they are not available in the proxy. Do not worry about dumping them. The tables are not in active use by anything. I can make a dump of them.
  2. Yes
  3. Related to what? There should be no reason to dump nulls. They likely exist only because of the table format of KDVH (empty cells in a row where something is not null)
  4. This is a kvalobs feature. The missing indicator exists in two versions: -32767 means expected value missing and -32766 means value was found but was censored in QC. These numbers will not feature in the KDVH data, but they are all over the queues and all over ODA. If you've found a lossless way to handle this differently, it's probably better to do that instead

@ketilt
Copy link
Collaborator

ketilt commented Sep 30, 2024

Flag definitions:
https://dokit.met.no/klima/bjornn/kvalobs/flagg_i_kvalobs_versjon_8.7.1

@Lun4m
Copy link
Collaborator Author

Lun4m commented Sep 30, 2024

  1. Related to what? There should be no reason to dump nulls.
    They likely exist only because of the table format of KDVH (empty cells in a row where something is not null)

Related to point 4, sorry for the confusion! I was specifically asking since in makeDataPage it's fine if both are null:

if kdvh.flagsAreInvalid() {
useinfo = []byte("9999900900000000")
} else {
useinfo = []byte(kdvh.Flags + "00900000000")
}
if !nullData {
controlinfo = []byte("0000000000000000")
} else {
controlinfo = []byte("0000003000000000")
floatval = -32767
}

But I'm with you, it doesn't make sense to dump them 👌

@ketilt
Copy link
Collaborator

ketilt commented Sep 30, 2024

  1. Related to what? There should be no reason to dump nulls.
    They likely exist only because of the table format of KDVH (empty cells in a row where something is not null)

Related to point 4, sorry for the confusion! I was specifically asking since in makeDataPage it's fine if both are null:

if kdvh.flagsAreInvalid() {
useinfo = []byte("9999900900000000")
} else {
useinfo = []byte(kdvh.Flags + "00900000000")
}
if !nullData {
controlinfo = []byte("0000000000000000")
} else {
controlinfo = []byte("0000003000000000")
floatval = -32767
}

But I'm with you, it doesn't make sense to dump them 👌

Hmm, could this be in the case of blob data?

@Lun4m Lun4m force-pushed the kdvh_importer branch 4 times, most recently from cc7a951 to 40b69e6 Compare November 11, 2024 08:38
@Lun4m Lun4m marked this pull request as ready for review November 11, 2024 08:51
@Lun4m Lun4m requested review from intarga and ketilt November 11, 2024 08:51
CREATE TABLE IF NOT EXISTS flags.kdvh (
timeseries INT4 REFERENCES public.timeseries,
obstime TIMESTAMPTZ NOT NULL,
controlinfo TEXT NULL,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where the reverse-engineered flags would be stored? Or the original 5 digits of useinfo that KDVH has kept?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both, the controlinfo is completely reverse-engineered, while the 5-digit KDVH flags are stored in useinfo with the 11 trailing digits set to their default values (line 242 in kdvh/import_functions.go). They should both be following what's defined in the excel document you shared with me.

migrations/kdvh/dump.go Show resolved Hide resolved
return data, nil
}

// TODO: add CALL_SIGN? It's not in stinfosys?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does this column appear? It might be ship names, but it might also be the initials of the observer for older data

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

T_MDATA (basically the only table I've been testing against). I'll add this info to the code!

Copy link
Collaborator Author

@Lun4m Lun4m Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I think about it, if T_MDATA (or any other table) contains data from moving stations we probably should not store it in the same table we store static stations, since they require additional metadata (position?). But we haven't decided on the table structure yet.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-moving installations like oil platforms also have call signs, and they make up most of that table, I think. We do also have ship data, and there are paramid for MLON and MLAT which contains timeseries of the coordinates of these stations. That is how it was solved in KDVH. So if you see MLAT and MLON, that means there could be ship data in a table. (The table T_CDCV_DATA also contains data from buoys, though I don't remember if they move around a lot.)

migrations/kdvh/import_functions.go Outdated Show resolved Hide resolved
migrations/kdvh/import_functions.go Show resolved Hide resolved
migrations/kdvh/main.go Show resolved Hide resolved
migrations/kdvh/main.go Outdated Show resolved Hide resolved
migrations/kdvh/main.go Outdated Show resolved Hide resolved
migrations/kdvh/main.go Outdated Show resolved Hide resolved
migrations/kdvh/table.go Outdated Show resolved Hide resolved
migrations/kdvh/table.go Show resolved Hide resolved
justfile Outdated Show resolved Hide resolved
migrations/README.md Outdated Show resolved Hide resolved
migrations/kdvh/main.go Outdated Show resolved Hide resolved
migrations/kdvh/dump.go Show resolved Hide resolved
migrations/kdvh/dump.go Outdated Show resolved Hide resolved
migrations/kdvh/import.go Show resolved Hide resolved
migrations/kdvh/table.go Outdated Show resolved Hide resolved
"github.com/jackc/pgx/v5/pgxpool"
)

// TODO: I'm not sure I like the interface solution
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I like it either. Would it be better to generate all the rows up front and use CopyFromRows?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm, do you mean decoupling LardObs to separate data and flags timeseries, instead of having them together? That sounds reasonable

Copy link
Member

@intarga intarga Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's also a good idea, but I meant have methods on the timeseries type that generate [][]any (all the rows at once). Then you could drop the interface and just do pgx.CopyFromRows(ts.new_method()). Not sure what the performance implications would be, depends a bit on Go's runtime and pgx's implementation.

migrations/kdvh/product_offsets.csv Outdated Show resolved Hide resolved
migrations/kdvh_test.go Outdated Show resolved Hide resolved
@intarga
Copy link
Member

intarga commented Nov 13, 2024

@Lun4m Do you think there would be a point in removing the indices while importing?

I would think it's maybe redundant with COPY, but this doc page suggests it: https://www.postgresql.org/docs/current/populate.html#POPULATE-RM-INDEXES

In any case we should probably run ANALYZE after importing

@Lun4m
Copy link
Collaborator Author

Lun4m commented Nov 13, 2024

I don't have an opinion about it, but there aren't any drawbacks at the moment, since we don't have users, right? I don't see any reason why we shouldn't do it 👍

On another note, do you think KDVH and Kvalobs flags should be stored in the same table?
Kvalobs has more fields (cfailed and corrected, as seen in the kvdata table).
Just wondering how we plan to handle queries that target imported data.
Is there any use case where someone might want to query old flags?

@intarga
Copy link
Member

intarga commented Nov 13, 2024

On another note, do you think KDVH and Kvalobs flags should be stored in the same table?

Hmm, I'm not sure... I think it makes sense to store them in the same table, they would have been migrated to the same table in ODA. It's really a question of what Frost needs, so I would maybe ask Jo if and how Frost handles this. Louise would be the best person to ask if she wasn't on permisjon

@intarga
Copy link
Member

intarga commented Nov 14, 2024

@Lun4m Are you filtering open and closed data? I don't remember seeing anything about that when reading the code, but perhaps I missed it

@Lun4m
Copy link
Collaborator Author

Lun4m commented Nov 14, 2024

Good point, yeah, I forgot I need to filter those out for the time being

@Lun4m Lun4m force-pushed the kdvh_importer branch 2 times, most recently from 2337032 to 93fbb62 Compare November 14, 2024 11:15
@ketilt
Copy link
Collaborator

ketilt commented Nov 14, 2024

@Lun4m Are you filtering open and closed data? I don't remember seeing anything about that when reading the code, but perhaps I missed it

I can provide a list of which stnr / tables are defined as closed

@intarga
Copy link
Member

intarga commented Nov 14, 2024

I can provide a list of which stnr / tables are defined as closed

Thanks, but I don't think that's necessary, we can fetch it from stinfosys

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants