feat(snapshot) Provide a formatter configuration when taking a snapshot. #28

fpacifici · 2021-03-11T05:40:12Z

Based on #27

When we take a snapshot of a postgres table with cdc (to import it into clickhouse) dates are formatted as 2019-06-16 06:21:39+00. The process of transforming these dates into clickhouse dates (2019-06-16 06:21:39) is by far the bottleneck when ingesting the snapshot into Clickhouse. This bottleneck is specifically impactful on tables like groupedmessages_local that are huge.

We can largely avoid this overhead by just formatting the snapshot in a way that is good for clickhouse. Just piping the CSV into clickhouse is one order of magnitude faster.
At the same time I would not like to tightly couple the cdc producer with clickhouse.

So this PR adds more structure to the snapshot configuration so that we can configure a format for each column we query from postgres. This puts the coupling in the configuration we pass to cdc each time we extract a snapshot, instead of being in the cdc code itself.
THe formatting logic is applied in the postgres query.

fpacifici · 2021-03-11T05:40:52Z

cdc/__main__.py

@@ -129,7 +129,29 @@ def snapshot(ctx, snapshot_config):
                        "type": "object",
                        "properties": {
                            "table": {"type": "string"},
-                            "columns": {"type": "array", "items": {"type": "string"}},
+                            "columns": {


See test_config.py for an example of how how the config now looks like.

fpacifici · 2021-03-11T05:43:16Z

cdc/snapshots/sources/postgres_snapshot.py

 from cdc.snapshots.destinations import SnapshotDestination
 from cdc.utils.logging import LoggerAdapter
 from cdc.utils.registry import Configuration

 logger = LoggerAdapter(logging.getLogger(__name__))

+


From here to line 78 (except for line 68) the code did not change, it was only formatted when I saved it.

fpacifici · 2021-03-11T16:48:16Z

cdc/snapshots/sources/postgres_snapshot.py

+        raise ValueError(f"Unknown formatter type {type(column.formatter)}")
+
+
+# We could add more mappings if needed, which is unlikely.


The additional complexity in mapping a config format to the postgres format, is that the config is supposed to be DB agnostic (as this system has a concept of source, which is the source DB, and multiple backend implementations one of which is postgres). This is the postgres implementation.
Right now there is one implementation only though.

Do you foresee a lot of different formats for postgres? Why not have a lookup on a specific format, "%Y-%m-%d%H:%M:%S": "YYYY-MM-DDHH24:MI:SS". Why do the regex substitution part by part?

I think you may be right. It is very unlikely we would change it. Though, having a config that allows for a regex like system just on paper, would be quite confusing. Since the regex system already works, I would not restrict it, this is less confusing than the other option. What do you think?

This works fine so let's roll with it. It's not a blocker for me.

evanh · 2021-03-11T20:36:01Z

cdc/snapshots/sources/postgres_snapshot.py

+        raise ValueError(f"Unknown formatter type {type(column.formatter)}")
+
+
+# We could add more mappings if needed, which is unlikely.


Do you foresee a lot of different formats for postgres? Why not have a lookup on a specific format, "%Y-%m-%d%H:%M:%S": "YYYY-MM-DDHH24:MI:SS". Why do the regex substitution part by part?

fpacifici · 2021-03-14T02:34:56Z

I changed the way to format the date. Instead of using to_char, this uses DATE_TRUNC('second', column)::timestamp which is 50% faster in taking the snapshot.

I can take a snapshot of 8M rows in 1 minute without format and with this formatter, while it takes 90 seconds with to_char

evanh · 2021-03-15T15:04:13Z

cdc/snapshots/snapshot_types.py


 class FormatterConfig(ABC):
    """
    Parent class to all the the formatter configs.
    """

+    @abstractmethod
+    def to_dict(self) -> Mapping[str, str]:


It seems a little odd to not have this as a Mapping[str, Any] to line up with the other to_dict functions. I understand that you can be more specific with this function than with the other ones. I'm not saying you should change it, but something to think about.

evanh · 2021-03-15T15:07:46Z

cdc/snapshots/sources/postgres_snapshot.py

+
+
+def format_datetime(col_name: str, formatter: DateTimeFormatterConfig) -> sql.SQL:
+    return sql.SQL("DATE_TRUNC({precision} ,{column})::timestamp AS {alias}").format(


Suggested change

return sql.SQL("DATE_TRUNC({precision} ,{column})::timestamp AS {alias}").format(

return sql.SQL("DATE_TRUNC({precision}, {column})::timestamp AS {alias}").format(

Very much a nit

right. this is weird. Will fix.

lynnagara · 2021-03-15T18:06:36Z

cdc/snapshots/destinations/file_snapshot.py

+        def safe_dump_default(value: Any) -> Any:
+            if isinstance(value, Enum):
+                return value.value
+            else:
+                raise TypeError
+


Isn't this handled by the from_dict() methods now?

Right, I forgot to remove this. Thanks for noticing.

fpacifici added 8 commits March 9, 2021 18:20

Change the type of the message sent to the stream

9d848f3

Fix tests

80ba000

Add tests for serialization

558c9e6

Improve testing

0b87648

Add format to config management

30bec0f

Add test for config loader

8b552d2

Format the csv according to config

cad1d06

Add comments

8d80bf4

fpacifici requested a review from a team March 11, 2021 05:40

fpacifici commented Mar 11, 2021

View reviewed changes

evanh approved these changes Mar 11, 2021

View reviewed changes

Use DATE_TRUC

227877e

fpacifici mentioned this pull request Mar 14, 2021

feat(snapshot) Compress table files after taking the snapshot #29

Merged

Fix serialization of JSON

96090c2

evanh approved these changes Mar 15, 2021

View reviewed changes

evanh reviewed Mar 15, 2021

View reviewed changes

lynnagara reviewed Mar 15, 2021

View reviewed changes

fpacifici changed the base branch from fix/message_type to master March 15, 2021 18:42

fpacifici added 2 commits March 15, 2021 11:43

Merge branch 'master' into feat/support_formatting

48d0b3e

Review comments

447f8c8

fpacifici merged commit a4be8c8 into master Mar 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(snapshot) Provide a formatter configuration when taking a snapshot. #28

feat(snapshot) Provide a formatter configuration when taking a snapshot. #28

fpacifici commented Mar 11, 2021

fpacifici Mar 11, 2021

fpacifici Mar 11, 2021

fpacifici Mar 11, 2021

evanh Mar 11, 2021

fpacifici Mar 12, 2021

evanh Mar 12, 2021

evanh Mar 11, 2021

fpacifici commented Mar 14, 2021

evanh Mar 15, 2021

evanh Mar 15, 2021

evanh Mar 15, 2021

fpacifici Mar 15, 2021

lynnagara Mar 15, 2021

fpacifici Mar 15, 2021

		raise ValueError(f"Unknown formatter type {type(column.formatter)}")


		# We could add more mappings if needed, which is unlikely.



		def format_datetime(col_name: str, formatter: DateTimeFormatterConfig) -> sql.SQL:
		return sql.SQL("DATE_TRUNC({precision} ,{column})::timestamp AS {alias}").format(

	return sql.SQL("DATE_TRUNC({precision} ,{column})::timestamp AS {alias}").format(
	return sql.SQL("DATE_TRUNC({precision}, {column})::timestamp AS {alias}").format(

feat(snapshot) Provide a formatter configuration when taking a snapshot. #28

feat(snapshot) Provide a formatter configuration when taking a snapshot. #28

Conversation

fpacifici commented Mar 11, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fpacifici commented Mar 14, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment