Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Postgres: Add support for partitioned tables #2465

Open
mingfang opened this issue Apr 17, 2024 · 6 comments
Open

Postgres: Add support for partitioned tables #2465

mingfang opened this issue Apr 17, 2024 · 6 comments
Assignees
Labels
Engine: Postgres Improvement Improves existing functionality

Comments

@mingfang
Copy link

mingfang commented Apr 17, 2024

Postgres natively supports partitioned tables

https://www.postgresql.org/docs/current/ddl-partitioning.html

@izeigerman izeigerman added Improvement Improves existing functionality Engine: Postgres labels Apr 23, 2024
@jrhorne
Copy link

jrhorne commented Sep 20, 2024

How large of a lift would it be to support partitioned tables? Is that something I could contribute to? I'm guessing some involved logic in the create stage, given you have to declare the partitions manually in PostgreSQL.

I have an incremental_by_time model with 1.7B rows, and maintaining the (necessary) indexes has caused things to slow dramatically. Partitioning by month, so at least the insert by day only has to reference one month's worth of data, would likely help tremendously.

@izeigerman

@erindru
Copy link
Collaborator

erindru commented Sep 22, 2024

It would be a new concept for SQLMesh, right now its concept of partitioning is "tell the underlying engine what columns to partition on and let it handle it transparently".

This would be active manual partitioning where SQLMesh itself has to create, drop and keep track of partitions. On the plus side, adding this for Postgres would also benefit Hive/Athena which has the same problem as Postgres

@mingfang
Copy link
Author

Postgres supports Declarative Partitioning, as described in the link above, 5.11.2. Declarative Partitioning.

@erindru
Copy link
Collaborator

erindru commented Sep 23, 2024

Indeed it does, and even with Declarative Partitioning you still have to manually create the partitions (you just don't have to manually attach and detach them).

For example:

create table foo (id int, day date) partition by range (day);

insert into foo (id, day) values (1, '2023-01-01'::date);
--FAIL: ERROR: no partition of relation "foo" found for row.  Detail: Partition key of the failing row contains (day) = (2023-01-01).

create table foo_2023 partition of foo for values from ('2023-01-01') to ('2024-01-01');

insert into foo (id, day) values (1, '2023-01-01'::date); --succeeds

insert into foo (id, day) values (1, '2024-01-01'::date);
--FAIL: ERROR: no partition of relation "foo" found for row  Detail: Partition key of the failing row contains (day) = (2024-01-01).

Today, SQLMesh assumes that the DB will automatically create partitions when you start trying to insert data, which isnt the case for Postgres. So we would have to track partitions manually and issue the correct commands to create + drop them so that we can successfully insert data into the table.

@erindru erindru self-assigned this Sep 23, 2024
@jrhorne
Copy link

jrhorne commented Sep 25, 2024

Would it simplify things to tie back to snapshot intervals? So forcing partitions to align (but contain possibly multiple of) with the interval unit?

Given the way SQLMesh also creates all the physical tables, then evaluates models, making it inherit from the parent table with pg_partman would be possible, right?

Then as snapshots are evaluated, it first checks if the physical table has a partition available for the interval.

I’m not sure if incremental by time and range based partitioning have to go together in SQLMesh, but to me, it seems simpler to implement if both are used in conjunction.

@erindru
Copy link
Collaborator

erindru commented Sep 25, 2024

Yes, for incremental models partitioned by time, the plan is to use RANGE-based partitioning. I'm debating whether to make it multiples of the interval unit specifically or copy what Iceberg did which is nice and simple (make year(time_col), month(time_col), day(time_col) or hour(time_col) the available options).

I'm also looking at how we can add LIST and HASH based partitioning because this lines up with partitioning strategies of other engines (identity and bucket in Iceberg).

I'm not planning to add a dependency on pg_partman at this stage. The SQLMesh Postgres adapter will be responsible for ensuring partitions exist on the target table prior to inserting data and should work out of the box on any Postgres instance without special configuration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Engine: Postgres Improvement Improves existing functionality
Projects
None yet
Development

No branches or pull requests

4 participants