Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Add LogicalScalar #14617

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from
Draft

WIP: Add LogicalScalar #14617

wants to merge 9 commits into from

Conversation

tobixdev
Copy link
Contributor

Which issue does this PR close?

This change is related to #12622.

While the PR is not yet complete, I won't be able to work on it for at least a few days. Therefore, I'd like to gather some feedback on the current direction and spark some further discussions in #12622.

Rationale for this change

In #12622, we discussed how can decouple ScalarValue from the physical DataType.
While there is already work on the logical-types branch, this PR explores a different approach that may be easier to integrate into the current main branch.

This PR introduces the enum LogicalScalar - a scalar value that is decoupled from the physical arrow DataType.
By doing this, ScalarValue can remain tightly integrated with DataType and no existing code breaks while still allowing us to use LogicalScalar in situations where a coupling to DataType is unwanted.

Some design considerations of LogicalScalar:

  • Mirrors variants from NativeType. There is currently no support for extension types. However, adding a LogicalScalar::Extension variant should be possible.
  • Non-trivial values (e.g., Timestamp have a corresponding LogicalTimestamp concept). For some variants this prevents creating invalid values. Furthermore, it allows us to attach logic to these values (e.g., transforming LogicalTimestamp to chrono::NaiveDateTime).
  • I tried to translate most of the equality/printing/ordering behavior from ScalarValue. However, some fixes may still be necessary.

LogicalScalar is supposed to be used in at least two ways:

  • Providing a logical view on a physical ScalarValue (LogicalScalar::from(scalar_value)). The resulting LogicalScalar can be more ergonomic to work with. To name a few "benefits": no working with arrow arrays, no dictionaries, no multiple encodings of Utf8 strings and hopefully more. I have adapted the code in datafusion/sql/src/unparser/expr.rs to demonstrate such a use case.
  • Migration path for code using ScalarValue: As we want to decouple logical from physical types, the new LogicalScalar can provide a vehicle for replacing ScalarValue in LogicalPlan etc.

The goal of this approach in the foreseeable future is to support the migration of LogicalPlan to logical types.
To do this, we must replace the variant Expr::Literal(ScalarValue) with Expr::Literal(LogicalScalar).
However, as this breaks quite a few things, doing this will be done in other PRs (I think #14609 is exploring this impact)

What changes are included in this PR?

  • Adds LogicalScalar enum and the contained structs and enums.
  • Changes to datafusion/sql/src/unparser/expr.rs to demonstrate the use of LogicalScalar.

Are these changes tested?

Some tests via the changes in datafusion/sql/src/unparser/expr.rs.
I would like to improve upon that.

Are there any user-facing changes?

Yes Public APIs (LogicalScalar) have been added.
Documentation on LogicalScalar is still incomplete (see Open Issues).

Open Issues

  • Currently, transforming a physical scalar into a logical scalar does not return a Result<LogicalScalar>. I think, in most situations, extracting a logical value should succeed. However, there may be some special cases that I did not consider and there are quite some expect calls in scalar/logical/from_scalar_value.rs. Basically this boils down to ensuring that we can always extract a ScalarValue from the arrays contained in, for example, ScalarValue::List. What is your opinion on that?
  • Breaking change in unparser: Is DATETIME equivalent to TIMESTAMP is this context? See changes in the test suite.
  • Improve Documentation
  • Add tests for converting ScalarValue to LogicalScalar
  • I am a bit unsure about the value in LogicalDecimal. I think it makes sense to use a dedicated decimal library here for working with decimal values and only convert to the arrow decimals once we need them. Any opinions?
  • Polish the implementation

cc @jayzhan211

@github-actions github-actions bot added sql SQL Planner common Related to common crate labels Feb 11, 2025
@tobixdev tobixdev marked this pull request as draft February 11, 2025 20:52
@@ -1831,15 +1625,15 @@ mod tests {
),
(
Expr::Literal(ScalarValue::Date64(Some(0))),
r#"CAST('1970-01-01 00:00:00' AS DATETIME)"#,
r#"CAST('1970-01-01 00:00:00' AS TIMESTAMP)"#,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still an open issue and we need to check whether this is equivalent.

Maybe I also misunderstood Date64 (see LogicalScalar::from)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dateTime and timestamp is not equivalent

@@ -47,3 +50,53 @@ singleton!(LOGICAL_FLOAT64, logical_float64, Float64);
singleton!(LOGICAL_DATE, logical_date, Date);
singleton!(LOGICAL_BINARY, logical_binary, Binary);
singleton!(LOGICAL_STRING, logical_string, String);

pub fn logical_timestamp(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some interning or similar could be beneficial here to share Arcs between invocations.

@jayzhan211
Copy link
Contributor

However, as this breaks quite a few things, doing this will be done in other PRs (I think #14609 is exploring this impact)

I hope we have a complete PR before merging LogicalScalar, maybe we can merge it into a branch

@jayzhan211
Copy link
Contributor

jayzhan211 commented Feb 12, 2025

For decimal, I think it should be the same as ScalarValue given we need to differentiate the precision of it

    Decimal128(Option<i128>, u8, i8),
    Decimal256(Option<i256>, u8, i8),

I would be great if we don't have Result or unwrap for ScalarValue to LogicalScalar, if we can't avoid unwrap, we need Result

/// TODO logical-types
#[derive(Debug, Clone, Eq, Hash, PartialEq, PartialOrd, Ord)]
pub struct LogicalDate {
value: i32,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably need LogicalScalar::Date32(i32) and LogicalScalar::Date64(i64) 🤔

/// A null value
Null,
/// Stores a scalar for [`NativeType::Boolean`].
Boolean(bool),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need Option for LogicalScalar too 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we a 100% sure on that? I think this is one of the things I don't like when working with ScalarValue. Can't we just use LogicalScalar::Null and use the schema information when converting to ScalarValue?

#[derive(Clone, PartialEq, Eq, Hash, Debug, PartialOrd)]
pub struct LogicalFixedSizeList {
/// The inner list with a fixed size.
inner: LogicalList,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might need len: i32

impl FixedSizeListArray {
    /// Create a new [`FixedSizeListArray`] with `size` element size, panicking on failure
    ///
    /// # Panics
    ///
    /// Panics if [`Self::try_new`] returns an error
    pub fn new(field: FieldRef, size: i32, values: ArrayRef, nulls: Option<NullBuffer>) -> Self {
        Self::try_new(field, size, values, nulls).unwrap()
    }


/// Returns the value of this timestamp as [DateTime] or [NaiveDateTime] depending on whether
/// there is a time zone.
pub fn value(&self) -> Result<LogicalTimestampValue> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need value for LogicalScalar, we don't have similar functionality for ScalarValue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea was to obtain a value that you would also use in non-arrow rust for representing this concept (e.g. types from chrono) while still storeing the "arrow-like" values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate sql SQL Planner
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants