Feat: Athena adapter #3154

erindru · 2024-09-19T05:28:08Z

Initial implementation of an Athena adapter. Addresses #1315

georgesittas · 2024-09-19T07:33:53Z

Released v25.22.0, should be able to upgrade and get the CI working now.

erindru · 2024-09-19T21:40:34Z

.circleci/continue_config.yml

-            branches:
-              only:
-                - main
+                #- snowflake


Note: i'll uncomment these immediately prior to merging. It's helpful to be able to run the Athena integration tests for this PR

izeigerman · 2024-09-19T23:45:09Z

sqlmesh/core/config/connection.py

+    # SQLMesh options
+    s3_warehouse_location: t.Optional[str] = None
+    concurrent_tasks: int = 4
+    register_comments: bool = False  # because Athena doesnt support comments in most cases


Should we even let users set true here then?

How do we force it to false and make it unsettable? Is that the register_comments: t.Literal[False] = False syntax?

Yes, exactly

izeigerman · 2024-09-19T23:46:43Z

sqlmesh/core/engine_adapter/athena.py

+
+
+def _ensure_valid_location(value: str) -> str:
+    if not value.startswith("s3://"):


Shouldn't this be validated in the connection config instead?

It's also used in _table_location to ensure that final location is valid since that is assembled at runtime and relies on user input from physical_properties

I see. I'd still add validation to the config to fail faster.

izeigerman · 2024-09-19T23:52:07Z

sqlmesh/core/engine_adapter/athena.py

+            exp.select(
+                exp.case()
+                .when(
+                    # 'awsdatacatalog' is the default catalog that is invisible for all intents and purposes


This doesn't quite explain why do we set catalog to NULL of the actual value is awsdatacatalog. What does "invisible" actually mean here?

It's because the integration test test_get_data_objects expects that if it passes a schema like test_schema_x (as opposed to a catalog-schema combo like test_catalog.test_schema_x) to get_data_objects(), the resulting data objects should have None set on the catalog property.

I'll amend the comment

izeigerman · 2024-09-19T23:52:57Z

sqlmesh/core/engine_adapter/athena.py

+            )
+            .from_(exp.to_table("information_schema.tables", alias="t"))
+            .where(
+                exp.and_(


AND seems redundant here

izeigerman · 2024-09-19T23:55:45Z

sqlmesh/core/engine_adapter/athena.py

+
+        elif self.s3_warehouse_location:
+            # If the user has set `s3_warehouse_location` in the connection config, the base URI is <s3_warehouse_location>/<catalog>/<schema>/
+            catalog_name = table.catalog if hasattr(table, "catalog") else None


Looking at signatures we always expect table to be of type exp.Table. How could catalog and db attributes be missing, provided they are property methods in that class?

Good catch, this didnt get updated when I tightened up the signature

izeigerman · 2024-09-19T23:56:14Z

sqlmesh/core/engine_adapter/athena.py

+            # Assume the user has set a default location for this schema in the metastore
+            return None
+
+        table_name = table.name if hasattr(table, "name") else None


izeigerman · 2024-09-20T00:03:21Z

sqlmesh/core/engine_adapter/athena.py

+
+        is_hive = self._table_type(table_properties) == "hive"
+
+        # Filter any PARTITIONED BY properties from the main column list since they cant be specified in both places


good old hive

izeigerman · 2024-09-20T00:04:21Z

sqlmesh/core/engine_adapter/athena.py

+        Use the user-specified table_properties to figure out of this is a Hive or an Iceberg table
+        """
+        # if table_type is not defined or is not set to "iceberg", this is a Hive table
+        if table_properties and (table_type := table_properties.get("table_type", None)):


Why are we not using storage_format for this instead?

Basically any value that is not iceberg should be treated as hive.

I thought about storage_format but decided not to use it because it describes a different concept. Both Hive and Iceberg tables support different storage formats.

For example, a Hive table can be STORED AS PARQUET or STORED AS ORC or if you really dont like your colleagues STORED AS TEXTFILE.

Same for Iceberg, the internal format can be set to parquet or orc or whatever the engine supports.

So storage_format=hive / storage_format=iceberg doesn't make sense because they're table formats that can encompass a particular storage format.

We dont have a top-level table_format property and I didnt want to add one just for Athena

FWIW, when used with spark iceberg is provided through storage_format because the SparkSQL syntax looks like:

CREATE TABLE ... USING [iceberg|parquet|etc]

Should we be consistent?

izeigerman · 2024-09-20T00:08:13Z

sqlmesh/core/engine_adapter/athena.py

+            # To make a CTAS expression persist as iceberg, alongside setting `table_type=iceberg` (which the user has already
+            # supplied in physical_properties and is thus set above), you also need to set:
+            #  - is_external=false
+            #  - table_location='s3://<path>'


Should we ensure somehow that the location has already been set at this point?

Oh nice catch!

Actually, the location will be set already a few lines up if the user supplied it (or s3_warehouse_location was set in the config). The original idea was that if it wasnt set at all, Athena can figure out what to do.

But ive just done some tests and unlike Trino, it looks like Athena will not automatically generate table locations for you if the schema the table is in was created with a location set.

I created a schema using CREATE SCHEMA foo LOCATION 's3://path' and then tried to create both Hive and Iceberg tables in that schema without setting a location explicitly. Both times it failed with an error asking to set the location.

So i'll tighten this up and throw an error if SQLMesh cant figure out the table location

erindru force-pushed the erin/athena-adapter branch from 4f8a160 to 10f91ba Compare September 19, 2024 20:46

erindru commented Sep 19, 2024

View reviewed changes

erindru force-pushed the erin/athena-adapter branch 5 times, most recently from 88e2e48 to 3110b64 Compare September 19, 2024 23:15

izeigerman reviewed Sep 19, 2024

View reviewed changes

izeigerman reviewed Sep 20, 2024

View reviewed changes

erindru force-pushed the erin/athena-adapter branch from 64b1239 to 1415d7d Compare September 20, 2024 04:13

erindru added 2 commits September 20, 2024 04:13

Feat: Athena adapter

e0372ed

PR feedback

957cb97

erindru force-pushed the erin/athena-adapter branch from 1415d7d to 957cb97 Compare September 20, 2024 04:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: Athena adapter #3154

Feat: Athena adapter #3154

erindru commented Sep 19, 2024 •

edited

Loading

georgesittas commented Sep 19, 2024

erindru Sep 19, 2024

izeigerman Sep 19, 2024

erindru Sep 20, 2024

izeigerman Sep 20, 2024

izeigerman Sep 19, 2024

erindru Sep 20, 2024

izeigerman Sep 20, 2024

izeigerman Sep 19, 2024

erindru Sep 20, 2024

izeigerman Sep 19, 2024

izeigerman Sep 19, 2024

erindru Sep 20, 2024

izeigerman Sep 19, 2024

izeigerman Sep 20, 2024

izeigerman Sep 20, 2024

izeigerman Sep 20, 2024 •

edited

Loading

erindru Sep 20, 2024 •

edited

Loading

izeigerman Sep 20, 2024

izeigerman Sep 20, 2024

erindru Sep 20, 2024

erindru Sep 20, 2024 •

edited

Loading



		def _ensure_valid_location(value: str) -> str:
		if not value.startswith("s3://"):


		is_hive = self._table_type(table_properties) == "hive"

		# Filter any PARTITIONED BY properties from the main column list since they cant be specified in both places

Feat: Athena adapter #3154

Are you sure you want to change the base?

Feat: Athena adapter #3154

Conversation

erindru commented Sep 19, 2024 • edited Loading

georgesittas commented Sep 19, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

izeigerman Sep 20, 2024 • edited Loading

Choose a reason for hiding this comment

erindru Sep 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erindru Sep 20, 2024 • edited Loading

Choose a reason for hiding this comment

erindru commented Sep 19, 2024 •

edited

Loading

izeigerman Sep 20, 2024 •

edited

Loading

erindru Sep 20, 2024 •

edited

Loading

erindru Sep 20, 2024 •

edited

Loading