-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add TILEDB_DATETIME_DAY
type support for Arrow
#2002
Add TILEDB_DATETIME_DAY
type support for Arrow
#2002
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me but would like to check the data being read back in the unit test.
6b80fd2
to
d2f36ca
Compare
datetime64[D]
attribute support for PyArrowTILEDB_DATETIME_DAY
type support for Arrow
TILEDB_DATETIME_DAY
type support for ArrowTILEDB_DATETIME_DAY
type support for Arrow
d2f36ca
to
6c91c64
Compare
The underlying numpy data model has resolution increments for every power of ten. R very much does not, it has native 'Date' (integer width) and > uri <- tempfile()
> D <- data.frame(ind = 1:10, days = Sys.Date() + 0:9, seconds = Sys.time() + 0:9)
> fromDataFrame(D, uri, col_index=1)
> schema(uri)
tiledb_array_schema(
domain=tiledb_domain(c(
tiledb_dim(name="ind", domain=c(1L,10L), tile=10L, type="INT32", filter_list=tiledb_filter_list(c(tiledb_filter_set_option(tiledb_filter("ZSTD"),"COMPRESSION_LEVEL",-1))))
)),
attrs=c(
tiledb_attr(name="days", type="DATETIME_DAY", ncells=1, nullable=FALSE, filter_list=tiledb_filter_list(c(tiledb_filter_set_option(tiledb_filter("ZSTD"),"COMPRESSION_LEVEL",-1)))),
tiledb_attr(name="seconds", type="DATETIME_MS", ncells=1, nullable=FALSE, filter_list=tiledb_filter_list(c(tiledb_filter_set_option(tiledb_filter("ZSTD"),"COMPRESSION_LEVEL",-1))))
),
cell_order="COL_MAJOR", tile_order="COL_MAJOR", capacity=10000, sparse=TRUE, allows_dups=TRUE,
coords_filter_list=tiledb_filter_list(c(tiledb_filter_set_option(tiledb_filter("ZSTD"),"COMPRESSION_LEVEL",-1))),
offsets_filter_list=tiledb_filter_list(c(tiledb_filter_set_option(tiledb_filter("ZSTD"),"COMPRESSION_LEVEL",-1))),
validity_filter_list=tiledb_filter_list(c(tiledb_filter_set_option(tiledb_filter("RLE"),"COMPRESSION_LEVEL",-1)))
)
> chk <- tiledb_array(uri, return_as="data.frame")[]
> chk
ind days seconds
1 1 2024-08-29 2024-08-29 09:35:01
2 2 2024-08-30 2024-08-29 09:35:02
3 3 2024-08-31 2024-08-29 09:35:03
4 4 2024-09-01 2024-08-29 09:35:04
5 5 2024-09-02 2024-08-29 09:35:05
6 6 2024-09-03 2024-08-29 09:35:06
7 7 2024-09-04 2024-08-29 09:35:07
8 8 2024-09-05 2024-08-29 09:35:08
9 9 2024-09-06 2024-08-29 09:35:09
10 10 2024-09-07 2024-08-29 09:35:10
> I do not know pandas very well (or, at all, really) so I am not sure why you need to bit operation logic (but maybe it just standard casting...). Can you not resort to the Arrow level representation for DAY and DATETIME_MS? If you do and I missed it, my bad. |
PS This becomes clearer when I read as arrow (well: nanoarrow, at user-level return converted to Arrow): > chk <- tiledb_array(uri, return_as="arrow")[]
> chk
Table
10 rows x 3 columns
$ind <int32 not null>
$days <date32[day] not null>
$seconds <timestamp[ms] not null>
> |
tiledb/py_arrow_io_impl.h
Outdated
std::memcpy(static_cast<uint8_t *>(buffers[1]) + i * 4, | ||
static_cast<uint8_t *>(buffers[1]) + i * 8, 4); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::memcpy(static_cast<uint8_t *>(buffers[1]) + i * 4, | |
static_cast<uint8_t *>(buffers[1]) + i * 8, 4); | |
*static_cast<uint32_t*>(buffers[1])[i] = cast_checked(*static_cast<uint64_t*>(buffers[1])[i]); |
cast_checked
should be implemented to throw if its parameter is larger than std::numeric_limits<int32>::max()
and static_cast
otherwise.
And buffers[1]
should be declared as a variable outside of the loop (for clarity mostly).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. Where is cast_checked
defined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not, it will have to be written. 😅 Reworded my comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we make sure we're not leaking the rest of the buffer?
The only difference is the data shifting. The TileDB-Py/tiledb/py_arrow_io_impl.h Line 571 in 687d549
|
dd67a3b
to
06f90d7
Compare
* Add in place buffer shift for TILEDB_DATETIME_DAY * Add tests
The Arrow C data interface supports
date32[days]
. Let's use it as a conversion forTILEDB_DATETIME_DAY
in Arrow.The
date32[days]
type will be useful when Arrow is used, for example, to create a Pandas DataFrame.Since all TileDB datetime values for attributes use the same representation as NumPy,
np.datetime64
, we have to find a way to transform this 64bit representation into a 32bit representation as expected by Arrow.The contents of
bufferinfo.data
for aTILEDB_DATETIME_DAY
attribute:228 30 0 0 0 0 0 0
19 0 0 0 0 0 0 0
248 25 0 0 0 0 0 0
203 1 0 0 0 0 0 0
but what we would like to have in the 32bit representation, achieved by this PR, is:
228 30 0 0
19 0 0 0
248 25 0 0
203 1 0 0
The possibility of overflow seems impossible given the ranges of days that both 32bit and 64bit buffers can handle.
The initial isue:
We used to get the following error. Now we are not.