Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pylance write to local disk with a path started with s3:// #799

Open
Renkai opened this issue Apr 24, 2023 · 4 comments
Open

pylance write to local disk with a path started with s3:// #799

Renkai opened this issue Apr 24, 2023 · 4 comments
Labels
bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed python rust Rust related tasks

Comments

@Renkai
Copy link
Contributor

Renkai commented Apr 24, 2023

edit by @changhiskhan: 

One final thing to test:

if you don't have any s3 credentials setup, writing to s3 should raise an Exception and not silently write to local drive.

With the below code, I want to write to an s3 path, but it writes to the local disk directory s3: instead.

import lance
import duckdb
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.dataset
import shutil

import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('hehe').getOrCreate()

data = [("James", "", "Smith", "36636", "M", 60000),
        ("Michael", "Rose", "", "40288", "M", 70000),
        ("Robert", "", "Williams", "42114", "", 400000),
        ("Maria", "Anne", "Jones", "39192", "F", 500000),
        ("Jen", "Mary", "Brown", "", "F", 0)]

columns = ["first_name", "middle_name", "last_name", "dob", "gender", "salary"]
pysparkDF = spark.createDataFrame(data=data, schema=columns)
pysparkDF.printSchema()
pysparkDF.show(truncate=False)

df = pysparkDF.toPandas()
print("converted to pandas")
# df = pd.DataFrame({"a": [5]})
# shutil.rmtree("/tmp/test_df.lance", ignore_errors=True)
import os

os.environ['AWS_PROFILE'] = 'some-profile-name'
dataset = lance.write_dataset(df, "s3://the-bucket-name/the-path-under-bucket")

If I remove the AWS_PROFILE environment variable, it got

  File "/Users/renkaige/tubi/Grinder/python/hehe.py", line 32, in <module>
    dataset = lance.write_dataset(df, "s3://the-bucket-name/the-path-under-bucket")
  File "/usr/local/Caskroom/miniconda/base/envs/spock3/lib/python3.10/site-packages/lance/dataset.py", line 654, in write_dataset
    _write_dataset(reader, uri, params)
OSError: LanceError(I/O): Generic S3 error: response error "request error", after 0 retries: error sending request for url (http://169.254.169.254/latest/api/token): error trying to connect: tcp connect error: Operation timed out (os error 60)

instead

➜  python git:(master) ✗ tree s3:
s3:
└── the-bucket-name
    └── the-path-under-bucket
        ├── _latest.manifest
        ├── _versions
        │   └── 1.manifest
        └── data
            └── 6e3c3a9c-8a43-4c96-b2bf-7c8d0018170f.lance

5 directories, 3 files

The pylance version is 0.4.3 from pypi

@Renkai
Copy link
Contributor Author

Renkai commented Apr 24, 2023

I got err: IO("Generic S3 error: Profile support requires aws_profile feature") by adding some log in my branch.

@Renkai
Copy link
Contributor Author

Renkai commented Apr 24, 2023

Here we try object store first. If it fails, then try to write a local path with the same URL. If the failed URL use an object store, the local path shall not be the fallback.

https://github.com/eto-ai/lance/blob/d66f2b3887c4fd75bbefbc3e0e055eba9ce618ad/rust/src/io/object_store.rs#L101-L102

@gsilvestrin
Copy link
Contributor

Thanks for the bug submission @Renkai! Appreciate you attaching a PR. We had to move away from handling ParseError::RelativeUrlWithoutBase because it doesn't work well with Windows paths.

This part of the code was a little brittle so I refactored it to make it more explicit what schemes for local and remote file development. It will be part of the next lance release

@gsilvestrin gsilvestrin added the bug Something isn't working label Apr 24, 2023
@changhiskhan
Copy link
Contributor

One final thing to test:

if you don't have any s3 credentials setup, writing to s3 should raise an Exception and not silently write to local drive.

@changhiskhan changhiskhan added good first issue Good for newcomers help wanted Extra attention is needed python rust Rust related tasks labels Jul 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed python rust Rust related tasks
Projects
None yet
3 participants