Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enable import from GCS emulator without PublicHost #248

Conversation

totem3
Copy link
Collaborator

@totem3 totem3 commented Nov 23, 2023

fixes #209

Summary of problem:

There is an issue with the job that imports files from GCS, specifically when using the GCS Emulator. As detailed in issue #209, attempts to import data from the GCS Emulator sometimes does not work.

This happens when publicHost is not set in GCS Emulator, or access not using publicHost .

We have spent quite some time investigating this issue, and considering there's already an issue created with comments on it, we believe there is value in making it work without needing to set a publicHost.

cause

The problem arises due to two different URL formats used for accessing objects in the GCS Emulator:

  • /storage/v1/b/{bucketName}/o/{objectName}
  • /{bucketName}/{objectName}

The second URL pattern is only valid for accesses to publicHost in the GCS Emulator. The Go GCS SDK, when downloading files from GCS (using client.Bucket(...).Object(...).NewReader()) , accesses the latter URL format, which requires a valid publicHost and results in errors if it's not set.

The issue can be pinpointed in the code here:
When building the URL for data reading, the method at google-cloud-go#L788-L793 is used. This method does not take the API prefix (storage/v1) into account, considering only the host, bucket name, and object path. It is internally used in the NewReader method at bigquery-emulator#L1087.

However, in the JSON API this problem does not occur, because even when data reading, it uses the former URL format. (google-api-go-client#L12441).

This issue seems to be specific to the Emulator and not a problem with standard GCS usage, likely due to the ability to access objects directly through URLs without an API Prefix on storage.googleapis.com.

Changes made in this PR:

I have enabled the option to use the JSON API, ensuring that imports work even when a publicHost is not set for the emulator. Since JSON download API introduced in v1.30.0, I have upgraded cloud.google.com/go/storage version.

This might be more a problem with the Go GCS SDK than with the BigQuery Emulator. So, if this fix isn't right, please let me know. If that's the case, I'm thinking of making another PR to add guidelines in the README about setting a publicHost for the GCS Emulator.

Thank you for maintaining such a great product.

Copy link
Owner

@goccy goccy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution !! LGTM 👍
Please resolve conflict 🙏

@goccy goccy added the reviewed label Apr 6, 2024
@totem3 totem3 force-pushed the feature/import-from-gcs-emulator-without-public-host branch from b4ea961 to 8bf5a73 Compare April 7, 2024 06:03
@totem3
Copy link
Collaborator Author

totem3 commented Apr 7, 2024

@goccy
Thank you for the review! I've rebased onto main.
I dropped the commit that was causing conflicts because the dependency modules in the latest main were already updated enough, making that commit unnecessary. Other than that, I haven't made any changes.

@goccy
Copy link
Owner

goccy commented Apr 7, 2024

Thank you for your quickly response !!

@goccy goccy merged commit 5ad569f into goccy:main Apr 7, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Loading a CSV from emulated GCS fails
2 participants