Add `TILEDB_DATETIME_DAY` type support for Arrow #2002

kounelisagis · 2024-07-09T11:40:16Z

The Arrow C data interface supports date32[days]. Let's use it as a conversion for TILEDB_DATETIME_DAY in Arrow.

The date32[days] type will be useful when Arrow is used, for example, to create a Pandas DataFrame.

Since all TileDB datetime values for attributes use the same representation as NumPy, np.datetime64, we have to find a way to transform this 64bit representation into a 32bit representation as expected by Arrow.

The contents of bufferinfo.data for a TILEDB_DATETIME_DAY attribute:

228 30 0 0 0 0 0 0 19 0 0 0 0 0 0 0 248 25 0 0 0 0 0 0 203 1 0 0 0 0 0 0

but what we would like to have in the 32bit representation, achieved by this PR, is:
228 30 0 0 19 0 0 0 248 25 0 0 203 1 0 0

The possibility of overflow seems impossible given the ranges of days that both 32bit and 64bit buffers can handle.

The initial isue:

R
> library(tiledb)
> library(palmerpenguins)
> praw <- penguins_raw
> fromDataFrame(praw, "/tmp/penguinsraw")

python
>>> import tiledb
>>> a = tiledb.open("/tmp/penguinsraw/")
>>> a.df[:]

We used to get the following error. Now we are not.

TileDBError                               Traceback (most recent call last)
Input In [3], in <cell line: 1>()
----> 1 a.df[:]

File ~/work/git/TileDB-Py/tiledb/multirange_indexing.py:259, in _BaseIndexer.__getitem__(self, idx)
    257     self.subarray = Subarray(self.array)
    258     self._set_ranges(idx)
--> 259 return self if self.return_incomplete else self._run_query()

File ~/work/git/TileDB-Py/tiledb/multirange_indexing.py:401, in DataFrameIndexer._run_query(self)
    399 elif self.use_arrow:
    400     with timing("buffer_conversion_time"):
--> 401         table = self.pyquery._buffers_to_pa_table()
    403     columns = []
    404     pa_schema = table.schema

TileDBError: TileDB-Arrow: tiledb datatype not understood ('DATETIME_DAY', cell_val_num: 1)

nguyenv

Looks good to me but would like to check the data being read back in the unit test.

tiledb/tests/test_pandas_dataframe.py

tiledb/py_arrow_io_impl.h

eddelbuettel · 2024-08-29T14:38:57Z

The underlying numpy data model has resolution increments for every power of ten. R very much does not, it has native 'Date' (integer width) and POSIXct aka Datetime (double) (and an add-on package for nanoseconds). So for the R package I mapped that at the two different corresponding resolutions:

> uri <- tempfile()
> D <- data.frame(ind = 1:10, days = Sys.Date() + 0:9, seconds = Sys.time() + 0:9)
> fromDataFrame(D, uri, col_index=1)
> schema(uri)
tiledb_array_schema(
    domain=tiledb_domain(c(
        tiledb_dim(name="ind", domain=c(1L,10L), tile=10L, type="INT32", filter_list=tiledb_filter_list(c(tiledb_filter_set_option(tiledb_filter("ZSTD"),"COMPRESSION_LEVEL",-1))))
    )),
    attrs=c(
        tiledb_attr(name="days", type="DATETIME_DAY", ncells=1, nullable=FALSE, filter_list=tiledb_filter_list(c(tiledb_filter_set_option(tiledb_filter("ZSTD"),"COMPRESSION_LEVEL",-1)))),
        tiledb_attr(name="seconds", type="DATETIME_MS", ncells=1, nullable=FALSE, filter_list=tiledb_filter_list(c(tiledb_filter_set_option(tiledb_filter("ZSTD"),"COMPRESSION_LEVEL",-1))))
    ),
    cell_order="COL_MAJOR", tile_order="COL_MAJOR", capacity=10000, sparse=TRUE, allows_dups=TRUE,
    coords_filter_list=tiledb_filter_list(c(tiledb_filter_set_option(tiledb_filter("ZSTD"),"COMPRESSION_LEVEL",-1))),
    offsets_filter_list=tiledb_filter_list(c(tiledb_filter_set_option(tiledb_filter("ZSTD"),"COMPRESSION_LEVEL",-1))),
    validity_filter_list=tiledb_filter_list(c(tiledb_filter_set_option(tiledb_filter("RLE"),"COMPRESSION_LEVEL",-1)))
)
> chk <- tiledb_array(uri, return_as="data.frame")[]
> chk
   ind       days             seconds
1    1 2024-08-29 2024-08-29 09:35:01
2    2 2024-08-30 2024-08-29 09:35:02
3    3 2024-08-31 2024-08-29 09:35:03
4    4 2024-09-01 2024-08-29 09:35:04
5    5 2024-09-02 2024-08-29 09:35:05
6    6 2024-09-03 2024-08-29 09:35:06
7    7 2024-09-04 2024-08-29 09:35:07
8    8 2024-09-05 2024-08-29 09:35:08
9    9 2024-09-06 2024-08-29 09:35:09
10  10 2024-09-07 2024-08-29 09:35:10
>

I do not know pandas very well (or, at all, really) so I am not sure why you need to bit operation logic (but maybe it just standard casting...). Can you not resort to the Arrow level representation for DAY and DATETIME_MS? If you do and I missed it, my bad.

eddelbuettel · 2024-08-29T14:40:28Z

PS This becomes clearer when I read as arrow (well: nanoarrow, at user-level return converted to Arrow):

> chk <- tiledb_array(uri, return_as="arrow")[]
> chk
Table
10 rows x 3 columns
$ind <int32 not null>
$days <date32[day] not null>
$seconds <timestamp[ms] not null>
>

teo-tsirpanis · 2024-09-10T12:57:15Z

tiledb/py_arrow_io_impl.h

+        std::memcpy(static_cast<uint8_t *>(buffers[1]) + i * 4,
+                    static_cast<uint8_t *>(buffers[1]) + i * 8, 4);


Suggested change

std::memcpy(static_cast<uint8_t *>(buffers[1]) + i * 4,

static_cast<uint8_t *>(buffers[1]) + i * 8, 4);

*static_cast<uint32_t*>(buffers[1])[i] = cast_checked(*static_cast<uint64_t*>(buffers[1])[i]);

cast_checked should be implemented to throw if its parameter is larger than std::numeric_limits<int32>::max() and static_cast otherwise.

And buffers[1] should be declared as a variable outside of the loop (for clarity mostly).

Interesting. Where is cast_checked defined?

It's not, it will have to be written. 😅 Reworded my comment.

…rrow

ihnorton

How do we make sure we're not leaking the rest of the buffer?

tiledb/py_arrow_io_impl.h

kounelisagis · 2024-10-09T12:56:29Z

How do we make sure we're not leaking the rest of the buffer?

The only difference is the data shifting. The array_ variable is being freed in the same way as before:

TileDB-Py/tiledb/py_arrow_io_impl.h

Line 571 in 687d549

std::free(array_);

…rrow

* Add in place buffer shift for TILEDB_DATETIME_DAY * Add tests

kounelisagis requested review from ihnorton and nguyenv July 9, 2024 11:40

nguyenv approved these changes Jul 9, 2024

View reviewed changes

tiledb/tests/test_pandas_dataframe.py Outdated Show resolved Hide resolved

kounelisagis marked this pull request as draft July 29, 2024 17:09

teo-tsirpanis reviewed Jul 31, 2024

View reviewed changes

tiledb/py_arrow_io_impl.h Show resolved Hide resolved

nguyenv self-requested a review August 7, 2024 18:08

kounelisagis force-pushed the agis/sc-25572/add-datetime64-days-support-pyarrow branch 2 times, most recently from 6b80fd2 to d2f36ca Compare August 9, 2024 18:05

kounelisagis marked this pull request as ready for review August 9, 2024 18:25

kounelisagis requested a review from teo-tsirpanis August 9, 2024 18:25

kounelisagis changed the title ~~Add datetime64[D] attribute support for PyArrow~~ Add TILEDB_DATETIME_DAY type support for Arrow Aug 9, 2024

kounelisagis changed the title ~~Add TILEDB_DATETIME_DAY type support for Arrow~~ Add TILEDB_DATETIME_DAY type support for Arrow Aug 9, 2024

kounelisagis added 2 commits August 12, 2024 10:51

Add in place buffer shift for TILEDB_DATETIME_DAY

04ba022

Add tests

6c91c64

kounelisagis force-pushed the agis/sc-25572/add-datetime64-days-support-pyarrow branch from d2f36ca to 6c91c64 Compare August 12, 2024 07:51

teo-tsirpanis removed their request for review August 12, 2024 09:44

ihnorton reviewed Aug 29, 2024

View reviewed changes

tiledb/py_arrow_io_impl.h Show resolved Hide resolved

ihnorton requested a review from eddelbuettel August 29, 2024 14:29

teo-tsirpanis reviewed Sep 10, 2024

View reviewed changes

kounelisagis added 4 commits September 10, 2024 16:13

Add overflow check

5265747

Add out of range and overflow tests

f2fbfe3

Linting

50e422d

Merge branch 'dev' into agis/sc-25572/add-datetime64-days-support-pya…

687d549

…rrow

ihnorton reviewed Sep 11, 2024

View reviewed changes

eddelbuettel reviewed Sep 11, 2024

View reviewed changes

tiledb/py_arrow_io_impl.h Show resolved Hide resolved

kounelisagis added 2 commits October 9, 2024 13:22

Fix non-little-endian

dd68bef

Merge branch 'dev' into agis/sc-25572/add-datetime64-days-support-pya…

06f90d7

…rrow

kounelisagis force-pushed the agis/sc-25572/add-datetime64-days-support-pyarrow branch from dd67a3b to 06f90d7 Compare October 9, 2024 13:23

kounelisagis requested review from ihnorton and teo-tsirpanis October 9, 2024 13:23

kounelisagis merged commit f206545 into dev Oct 23, 2024
15 checks passed

kounelisagis deleted the agis/sc-25572/add-datetime64-days-support-pyarrow branch October 23, 2024 11:33

kounelisagis added a commit that referenced this pull request Oct 23, 2024

Add TILEDB_DATETIME_DAY type support for Arrow (#2002)

a0b885b

* Add in place buffer shift for TILEDB_DATETIME_DAY * Add tests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `TILEDB_DATETIME_DAY` type support for Arrow #2002

Add `TILEDB_DATETIME_DAY` type support for Arrow #2002

kounelisagis commented Jul 9, 2024 •

edited

Loading

nguyenv left a comment

eddelbuettel commented Aug 29, 2024 •

edited

Loading

eddelbuettel commented Aug 29, 2024

teo-tsirpanis Sep 10, 2024 •

edited

Loading

eddelbuettel Sep 10, 2024

teo-tsirpanis Sep 10, 2024

ihnorton left a comment

kounelisagis commented Oct 9, 2024

		std::memcpy(static_cast<uint8_t >(buffers[1]) + i 4,
		static_cast<uint8_t >(buffers[1]) + i 8, 4);

	std::memcpy(static_cast<uint8_t >(buffers[1]) + i 4,
	static_cast<uint8_t >(buffers[1]) + i 8, 4);
	static_cast<uint32_t>(buffers[1])[i] = cast_checked(static_cast<uint64_t>(buffers[1])[i]);

Add TILEDB_DATETIME_DAY type support for Arrow #2002

Add TILEDB_DATETIME_DAY type support for Arrow #2002

Conversation

kounelisagis commented Jul 9, 2024 • edited Loading

nguyenv left a comment

Choose a reason for hiding this comment

eddelbuettel commented Aug 29, 2024 • edited Loading

eddelbuettel commented Aug 29, 2024

teo-tsirpanis Sep 10, 2024 • edited Loading

Choose a reason for hiding this comment

eddelbuettel Sep 10, 2024

Choose a reason for hiding this comment

teo-tsirpanis Sep 10, 2024

Choose a reason for hiding this comment

ihnorton left a comment

Choose a reason for hiding this comment

kounelisagis commented Oct 9, 2024

Add `TILEDB_DATETIME_DAY` type support for Arrow #2002

Add `TILEDB_DATETIME_DAY` type support for Arrow #2002

kounelisagis commented Jul 9, 2024 •

edited

Loading

eddelbuettel commented Aug 29, 2024 •

edited

Loading

teo-tsirpanis Sep 10, 2024 •

edited

Loading