Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add TILEDB_DATETIME_DAY type support for Arrow #2002

Merged
merged 8 commits into from
Oct 23, 2024

Conversation

kounelisagis
Copy link
Member

@kounelisagis kounelisagis commented Jul 9, 2024

The Arrow C data interface supports date32[days]. Let's use it as a conversion for TILEDB_DATETIME_DAY in Arrow.

The date32[days] type will be useful when Arrow is used, for example, to create a Pandas DataFrame.

Since all TileDB datetime values for attributes use the same representation as NumPy, np.datetime64, we have to find a way to transform this 64bit representation into a 32bit representation as expected by Arrow.

The contents of bufferinfo.data for a TILEDB_DATETIME_DAY attribute:

228 30 0 0 0 0 0 0 19 0 0 0 0 0 0 0 248 25 0 0 0 0 0 0 203 1 0 0 0 0 0 0

but what we would like to have in the 32bit representation, achieved by this PR, is:
228 30 0 0 19 0 0 0 248 25 0 0 203 1 0 0

The possibility of overflow seems impossible given the ranges of days that both 32bit and 64bit buffers can handle.


The initial isue:

R
> library(tiledb)
> library(palmerpenguins)
> praw <- penguins_raw
> fromDataFrame(praw, "/tmp/penguinsraw")
python
>>> import tiledb
>>> a = tiledb.open("/tmp/penguinsraw/")
>>> a.df[:]

We used to get the following error. Now we are not.

TileDBError                               Traceback (most recent call last)
Input In [3], in <cell line: 1>()
----> 1 a.df[:]

File ~/work/git/TileDB-Py/tiledb/multirange_indexing.py:259, in _BaseIndexer.__getitem__(self, idx)
    257     self.subarray = Subarray(self.array)
    258     self._set_ranges(idx)
--> 259 return self if self.return_incomplete else self._run_query()

File ~/work/git/TileDB-Py/tiledb/multirange_indexing.py:401, in DataFrameIndexer._run_query(self)
    399 elif self.use_arrow:
    400     with timing("buffer_conversion_time"):
--> 401         table = self.pyquery._buffers_to_pa_table()
    403     columns = []
    404     pa_schema = table.schema

TileDBError: TileDB-Arrow: tiledb datatype not understood ('DATETIME_DAY', cell_val_num: 1)

Copy link
Collaborator

@nguyenv nguyenv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me but would like to check the data being read back in the unit test.

tiledb/tests/test_pandas_dataframe.py Outdated Show resolved Hide resolved
@kounelisagis kounelisagis marked this pull request as draft July 29, 2024 17:09
@nguyenv nguyenv self-requested a review August 7, 2024 18:08
@kounelisagis kounelisagis force-pushed the agis/sc-25572/add-datetime64-days-support-pyarrow branch 2 times, most recently from 6b80fd2 to d2f36ca Compare August 9, 2024 18:05
@kounelisagis kounelisagis marked this pull request as ready for review August 9, 2024 18:25
@kounelisagis kounelisagis changed the title Add datetime64[D] attribute support for PyArrow Add TILEDB_DATETIME_DAY type support for Arrow Aug 9, 2024
@kounelisagis kounelisagis changed the title Add TILEDB_DATETIME_DAY type support for Arrow Add TILEDB_DATETIME_DAY type support for Arrow Aug 9, 2024
@kounelisagis kounelisagis force-pushed the agis/sc-25572/add-datetime64-days-support-pyarrow branch from d2f36ca to 6c91c64 Compare August 12, 2024 07:51
@teo-tsirpanis teo-tsirpanis removed their request for review August 12, 2024 09:44
@eddelbuettel
Copy link
Contributor

eddelbuettel commented Aug 29, 2024

The underlying numpy data model has resolution increments for every power of ten. R very much does not, it has native 'Date' (integer width) and POSIXct aka Datetime (double) (and an add-on package for nanoseconds). So for the R package I mapped that at the two different corresponding resolutions:

> uri <- tempfile()
> D <- data.frame(ind = 1:10, days = Sys.Date() + 0:9, seconds = Sys.time() + 0:9)
> fromDataFrame(D, uri, col_index=1)
> schema(uri)
tiledb_array_schema(
    domain=tiledb_domain(c(
        tiledb_dim(name="ind", domain=c(1L,10L), tile=10L, type="INT32", filter_list=tiledb_filter_list(c(tiledb_filter_set_option(tiledb_filter("ZSTD"),"COMPRESSION_LEVEL",-1))))
    )),
    attrs=c(
        tiledb_attr(name="days", type="DATETIME_DAY", ncells=1, nullable=FALSE, filter_list=tiledb_filter_list(c(tiledb_filter_set_option(tiledb_filter("ZSTD"),"COMPRESSION_LEVEL",-1)))),
        tiledb_attr(name="seconds", type="DATETIME_MS", ncells=1, nullable=FALSE, filter_list=tiledb_filter_list(c(tiledb_filter_set_option(tiledb_filter("ZSTD"),"COMPRESSION_LEVEL",-1))))
    ),
    cell_order="COL_MAJOR", tile_order="COL_MAJOR", capacity=10000, sparse=TRUE, allows_dups=TRUE,
    coords_filter_list=tiledb_filter_list(c(tiledb_filter_set_option(tiledb_filter("ZSTD"),"COMPRESSION_LEVEL",-1))),
    offsets_filter_list=tiledb_filter_list(c(tiledb_filter_set_option(tiledb_filter("ZSTD"),"COMPRESSION_LEVEL",-1))),
    validity_filter_list=tiledb_filter_list(c(tiledb_filter_set_option(tiledb_filter("RLE"),"COMPRESSION_LEVEL",-1)))
)
> chk <- tiledb_array(uri, return_as="data.frame")[]
> chk
   ind       days             seconds
1    1 2024-08-29 2024-08-29 09:35:01
2    2 2024-08-30 2024-08-29 09:35:02
3    3 2024-08-31 2024-08-29 09:35:03
4    4 2024-09-01 2024-08-29 09:35:04
5    5 2024-09-02 2024-08-29 09:35:05
6    6 2024-09-03 2024-08-29 09:35:06
7    7 2024-09-04 2024-08-29 09:35:07
8    8 2024-09-05 2024-08-29 09:35:08
9    9 2024-09-06 2024-08-29 09:35:09
10  10 2024-09-07 2024-08-29 09:35:10
> 

I do not know pandas very well (or, at all, really) so I am not sure why you need to bit operation logic (but maybe it just standard casting...). Can you not resort to the Arrow level representation for DAY and DATETIME_MS? If you do and I missed it, my bad.

@eddelbuettel
Copy link
Contributor

PS This becomes clearer when I read as arrow (well: nanoarrow, at user-level return converted to Arrow):

> chk <- tiledb_array(uri, return_as="arrow")[]
> chk
Table
10 rows x 3 columns
$ind <int32 not null>
$days <date32[day] not null>
$seconds <timestamp[ms] not null>
> 

Comment on lines 784 to 785
std::memcpy(static_cast<uint8_t *>(buffers[1]) + i * 4,
static_cast<uint8_t *>(buffers[1]) + i * 8, 4);
Copy link
Member

@teo-tsirpanis teo-tsirpanis Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::memcpy(static_cast<uint8_t *>(buffers[1]) + i * 4,
static_cast<uint8_t *>(buffers[1]) + i * 8, 4);
*static_cast<uint32_t*>(buffers[1])[i] = cast_checked(*static_cast<uint64_t*>(buffers[1])[i]);

cast_checked should be implemented to throw if its parameter is larger than std::numeric_limits<int32>::max() and static_cast otherwise.

And buffers[1] should be declared as a variable outside of the loop (for clarity mostly).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Where is cast_checked defined?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not, it will have to be written. 😅 Reworded my comment.

Copy link
Member

@ihnorton ihnorton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we make sure we're not leaking the rest of the buffer?

@kounelisagis
Copy link
Member Author

How do we make sure we're not leaking the rest of the buffer?

The only difference is the data shifting. The array_ variable is being freed in the same way as before:

std::free(array_);

@kounelisagis kounelisagis force-pushed the agis/sc-25572/add-datetime64-days-support-pyarrow branch from dd67a3b to 06f90d7 Compare October 9, 2024 13:23
@kounelisagis kounelisagis merged commit f206545 into dev Oct 23, 2024
15 checks passed
@kounelisagis kounelisagis deleted the agis/sc-25572/add-datetime64-days-support-pyarrow branch October 23, 2024 11:33
kounelisagis added a commit that referenced this pull request Oct 23, 2024
* Add in place buffer shift for TILEDB_DATETIME_DAY
* Add tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants