Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add endpoint for querying multiple metadata #311

Merged
merged 16 commits into from
Aug 20, 2023

Conversation

DenizYil
Copy link
Contributor

@DenizYil DenizYil commented Aug 9, 2023

This PR closes #309

This PR provides the ability for users to send a POST request to the /metadata endpoint, where they can filter the columns has path parameters, e.g. POST /metadata/bounds, which will only return the bounds (and the keys used, which is default on all)

Notes:

  • I had a challenge with the naming convention and the structure itself. Should this endpoint be in its own file, or share a file with the GET request?
  • Should the methods be called post_metadata or get_multiple_metadata or get_metadatas or something else..? For now, I named the endpoint post_metadata and every other method get_multiple_metadata.
  • Should the user be able to specify e.g. bounds_south as a column or should we only allow them to specify bounds? (same with range & min/max). For now, I have added this functionality.
  • Please let me know if you have a better way to do the specific-decoding handling instead of the added _decode_specific_data function.
  • Please let me know as well if any variable names/types should be changed.
  • What should we do in the case of lazy-loading? Should we allow the user to lazy-load and generate metadata for everything? For now, I excluded this functionality as it could take a very long time. Let me know if you'd like this added.

Please let me know if you have any feedback, as that is very appreciated. I'm excited to see this feature be used.

@dionhaefner
Copy link
Collaborator

Thanks a lot for the fast implementation @DenizYil !

General notes

New methods in driver

I am not a fan of introducing additional driver code for this feature. The way I envisioned it was something like this in the handler:

out = []

with driver.connect():
    for keys in requested_datasets:
        metadata = driver.get_metadata(keys)
        metadata = {k: metadata[k] for k in requested_columns}
        out.append(metadata)

Sure, it's not getting optimal performance out of the SQL queries, but these are rarely a bottleneck anyway. On the flipside, we don't need to introduce hundreds of additional lines of code that will come back to haunt us if we ever implement a non-SQL metadata driver.

I could be convinced otherwise with hard evidence (benchmarks that show this can be a real-world bottleneck), but in the absence of that I would strongly prefer the simpler option.

API design

We should keep consistency between GET and POST endpoint wherever we can. POST /metadata/bounds is confusing because it looks like bounds could be a key value. I suggest the following API instead:

GET /metadata/<keys>?columns=[bounds,range]
POST /metadata?columns=[bounds,range]

This keeps the symmetry between both endpoints.

Specific questions

I had a challenge with the naming convention and the structure itself. Should this endpoint be in its own file, or share a file with the GET request?

Sharing a file is fine.

  • Should the methods be called post_metadata or get_multiple_metadata or get_metadatas or something else..? For now, I named the endpoint post_metadata and every other method get_multiple_metadata.

Since the GET endpoint is called get_metadata, I think the new endpoint should be called get_multiple_metadata. For methods see comment above.

  • Should the user be able to specify e.g. bounds_south as a column or should we only allow them to specify bounds? (same with range & min/max). For now, I have added this functionality.

I don't think the added complexity is worth it for this feature, both in terms of actual code and more complex API. IMO, doing metadata["bounds"][0] in the frontend isn't more complex than passing bounds_south explicitly.

  • What should we do in the case of lazy-loading? Should we allow the user to lazy-load and generate metadata for everything? For now, I excluded this functionality as it could take a very long time. Let me know if you'd like this added.

If the user explicitly requests data then they should get it. Maybe emit a warning to the logger when using the POST request with lazy loading.

@DenizYil
Copy link
Contributor Author

DenizYil commented Aug 9, 2023

Thanks a lot for the fast feedback @dionhaefner - it's truly appreciated. I believe I should have addressed all your changes.

Some notes:

  • With that being said, I kept the multiple_data handler function. I think it makes sense to not have this functionality in the endpoint - but let me know if I should change this. I could merge the functionality with the existing function and add a multi parameter (or use isInstance) or something to check if the contents of the list are lists or numbers - but that feels like added complexity.

  • I'll also note that on my computer (I've not run this in a real production setting yet) that performance is a lot slower. Querying 36 datasets takes 12 seconds on my computer (and it's a pretty good PC), whereas before it was almost instantaneous. I'm sure performance will be much better in production, but wanted to note it, nonetheless.

  • Let me know if you think CommaSeparatedListField is over-engineered. I wanted to stay true to using Marshmallow, but yeah.

Thanks a lot again, and I'm looking forward to your feedback! 😄

@dionhaefner
Copy link
Collaborator

dionhaefner commented Aug 9, 2023

I'll also note that on my computer (I've not run this in a real production setting yet) that performance is a lot slower. Querying 36 datasets takes 12 seconds on my computer (and it's a pretty good PC), whereas before it was almost instantaneous. I'm sure performance will be much better in production, but wanted to note it, nonetheless.

That seems very slow to me. Can you share the SQLite database? (No need to include the rasters.)

@DenizYil
Copy link
Contributor Author

DenizYil commented Aug 9, 2023

That seems very slow to me. Can you share the SQLite database?

Agreed.

Apologies though, perhaps I should have clarified. I'm using a PostgreSQL database which is hosted on Azure - so the DB is not running locally. I've hosted a Terracotta instance using Azure Functions, which only contains 36 datasets. This is the URL: https://100ktrees-tc.azurewebsites.net/datasets

@dionhaefner
Copy link
Collaborator

Before we get into code review, there should be a feature to limit the amount of data to be requested (as a protection against [accidental] DOS). At first I thought we should use pagination, but then again that's probably not needed here because the number of datasets is known a-priori. How about we introduce a runtime setting for this, like MAX_POST_METADATA_KEYS or so, and set that to a reasonable value by default (100?).

@dionhaefner
Copy link
Collaborator

Apologies though, perhaps I should have clarified. I'm using a PostgreSQL database which is hosted on Azure - so the DB is not running locally. I've hosted a Terracotta instance using Azure Functions, which only contains 36 datasets. This is the URL: https://100ktrees-tc.azurewebsites.net/datasets

Can you say more about how you tested the performance of the new endpoint, then?

@DenizYil
Copy link
Contributor Author

DenizYil commented Aug 9, 2023

Apologies though, perhaps I should have clarified. I'm using a PostgreSQL database which is hosted on Azure - so the DB is not running locally. I've hosted a Terracotta instance using Azure Functions, which only contains 36 datasets. This is the URL: https://100ktrees-tc.azurewebsites.net/datasets

Can you say more about how you tested the performance of the new endpoint, then?

I simply ran this script, and the results are consistent.

import requests
import time

start = time.time()

data = requests.post(
    "http://localhost:5000/metadata?columns=[metadata]",
    json={
        "keys": [
            ["01012012", "HRL", "annually", "IMD"],
            ["01012012", "HRL", "annually", "TCD"],
            ["01012015", "HRL", "annually", "GRA"],
            ["01012015", "HRL", "annually", "IMD"],
            ["01012015", "HRL", "annually", "TCD"],
            ["01012015", "HRL", "annually", "WAW"],
            ["01012018", "HRL", "annually", "GRA"],
            ["01012018", "HRL", "annually", "IMD"],
            ["01012018", "HRL", "annually", "TCD"],
            ["01012018", "HRL", "annually", "WAW"],
            ["202206241029443", "NED", "1", "B01"],
            ["202206241029443", "NED", "1", "B02"],
            ["202206241029443", "NED", "1", "B03"],
            ["202206241029443", "NED", "1", "B04"],
            ["202206241029443", "RGB", "1", "B01"],
            ["202206241029443", "RGB", "1", "B02"],
            ["202206241029443", "RGB", "1", "B03"],
            ["202206241029443", "RGB", "1", "B04"],
            ["202206241029444", "1", "F", "P"],
            ["202206241029444", "NED", "1", "B01"],
            ["202206241029444", "NED", "1", "B02"],
            ["202206241029444", "NED", "1", "B03"],
            ["202206241029444", "NED", "1", "B04"],
            ["202206241029444", "RGB", "1", "B01"],
            ["202206241029444", "RGB", "1", "B02"],
            ["202206241029444", "RGB", "1", "B03"],
            ["202206241029444", "RGB", "1", "B04"],
            ["202304081018382", "1", "F", "P"],
            ["202304081018382", "NED", "1", "B01"],
            ["202304081018382", "NED", "1", "B02"],
            ["202304081018382", "NED", "1", "B03"],
            ["202304081018382", "NED", "1", "B04"],
            ["202304081018382", "RGB", "1", "B01"],
            ["202304081018382", "RGB", "1", "B02"],
            ["202304081018382", "RGB", "1", "B03"],
            ["202304081018382", "RGB", "1", "B04"],
        ]
    },
)

end = time.time()

print(end - start)

Before we get into code review, there should be a feature to limit the amount of data to be requested (as a protection against [accidental] DOS). At first I thought we should use pagination, but then again that's probably not needed here because the number of datasets is known a-priori. How about we introduce a runtime setting for this, like MAX_POST_METADATA_KEYS or so, and set that to a reasonable value by default (100?).

Sure! I can add that into the PR. I think if we're going with the 1 SQL query per dataset, then 100 seems reasonable, but I do think that it seems a bit on the low-end. When this has been used in legitimate cases (like querying the GET /metadata endpoint), it's been in the 300s+. I think if we decide to go with the 1 SQL query per dataset, we could easily consider increasing the amount. I'll keep at 100 for now though :)

@dionhaefner
Copy link
Collaborator

So you are running a local Terracotta server that connects to the remote PostgreSQL database?

@DenizYil
Copy link
Contributor Author

DenizYil commented Aug 9, 2023

Yes. I'll do some testing tomorrow with an Azure Function that connects to an Azure PostgreSQL database to see performance in a real production environment. I'll also do some testing with a locally running database (and server). I'll keep you updated :)

@codecov
Copy link

codecov bot commented Aug 11, 2023

Codecov Report

Merging #311 (ecff640) into main (3b28c08) will increase coverage by 0.04%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #311      +/-   ##
==========================================
+ Coverage   98.07%   98.11%   +0.04%     
==========================================
  Files          52       52              
  Lines        2280     2335      +55     
  Branches      320      327       +7     
==========================================
+ Hits         2236     2291      +55     
  Misses         29       29              
  Partials       15       15              
Files Changed Coverage Δ
terracotta/config.py 100.00% <100.00%> (ø)
terracotta/handlers/metadata.py 100.00% <100.00%> (ø)
terracotta/scripts/click_types.py 98.61% <100.00%> (ø)
terracotta/server/metadata.py 100.00% <100.00%> (ø)

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@dionhaefner
Copy link
Collaborator

Starting to look good! Let's get that runtime config implemented and test coverage for the POST /metdata endpoint, then we should be good to go.

Copy link
Collaborator

@dionhaefner dionhaefner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some last nits and we're good to merge. Thanks again @DenizYil

terracotta/server/metadata.py Outdated Show resolved Hide resolved
terracotta/server/metadata.py Show resolved Hide resolved
terracotta/handlers/metadata.py Show resolved Hide resolved
@DenizYil
Copy link
Contributor Author

Should be good now @dionhaefner! Thank you so much for your thorough feedback and your patience :-) I know I made some silly mistakes, but I've learned a good amount from this, so thanks a lot 😄

This will be a very nice feature to have 🥳

Copy link
Collaborator

@dionhaefner dionhaefner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great, thanks!

I didn't see any silly mistakes here. I think this PR was a display of healthy OSS dynamics at work. Thanks for contributing.

@dionhaefner dionhaefner merged commit e6f1952 into DHI:main Aug 20, 2023
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reduction in /metadata API calls for multiple datasets
2 participants