Skip to content
This repository has been archived by the owner on May 6, 2024. It is now read-only.

feat(snapshot) Provide a formatter configuration when taking a snapshot. #28

Merged
merged 12 commits into from
Mar 15, 2021

Conversation

fpacifici
Copy link
Contributor

Based on #27

When we take a snapshot of a postgres table with cdc (to import it into clickhouse) dates are formatted as 2019-06-16 06:21:39+00. The process of transforming these dates into clickhouse dates (2019-06-16 06:21:39) is by far the bottleneck when ingesting the snapshot into Clickhouse. This bottleneck is specifically impactful on tables like groupedmessages_local that are huge.

We can largely avoid this overhead by just formatting the snapshot in a way that is good for clickhouse. Just piping the CSV into clickhouse is one order of magnitude faster.
At the same time I would not like to tightly couple the cdc producer with clickhouse.

So this PR adds more structure to the snapshot configuration so that we can configure a format for each column we query from postgres. This puts the coupling in the configuration we pass to cdc each time we extract a snapshot, instead of being in the cdc code itself.
THe formatting logic is applied in the postgres query.

@fpacifici fpacifici requested a review from a team March 11, 2021 05:40
@@ -129,7 +129,29 @@ def snapshot(ctx, snapshot_config):
"type": "object",
"properties": {
"table": {"type": "string"},
"columns": {"type": "array", "items": {"type": "string"}},
"columns": {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See test_config.py for an example of how how the config now looks like.

from cdc.snapshots.destinations import SnapshotDestination
from cdc.utils.logging import LoggerAdapter
from cdc.utils.registry import Configuration

logger = LoggerAdapter(logging.getLogger(__name__))


Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From here to line 78 (except for line 68) the code did not change, it was only formatted when I saved it.

raise ValueError(f"Unknown formatter type {type(column.formatter)}")


# We could add more mappings if needed, which is unlikely.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The additional complexity in mapping a config format to the postgres format, is that the config is supposed to be DB agnostic (as this system has a concept of source, which is the source DB, and multiple backend implementations one of which is postgres). This is the postgres implementation.
Right now there is one implementation only though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you foresee a lot of different formats for postgres? Why not have a lookup on a specific format, "%Y-%m-%d%H:%M:%S": "YYYY-MM-DDHH24:MI:SS". Why do the regex substitution part by part?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you may be right. It is very unlikely we would change it. Though, having a config that allows for a regex like system just on paper, would be quite confusing. Since the regex system already works, I would not restrict it, this is less confusing than the other option. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works fine so let's roll with it. It's not a blocker for me.

raise ValueError(f"Unknown formatter type {type(column.formatter)}")


# We could add more mappings if needed, which is unlikely.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you foresee a lot of different formats for postgres? Why not have a lookup on a specific format, "%Y-%m-%d%H:%M:%S": "YYYY-MM-DDHH24:MI:SS". Why do the regex substitution part by part?

@fpacifici
Copy link
Contributor Author

I changed the way to format the date. Instead of using to_char, this uses DATE_TRUNC('second', column)::timestamp which is 50% faster in taking the snapshot.

I can take a snapshot of 8M rows in 1 minute without format and with this formatter, while it takes 90 seconds with to_char


class FormatterConfig(ABC):
"""
Parent class to all the the formatter configs.
"""

@abstractmethod
def to_dict(self) -> Mapping[str, str]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems a little odd to not have this as a Mapping[str, Any] to line up with the other to_dict functions. I understand that you can be more specific with this function than with the other ones. I'm not saying you should change it, but something to think about.



def format_datetime(col_name: str, formatter: DateTimeFormatterConfig) -> sql.SQL:
return sql.SQL("DATE_TRUNC({precision} ,{column})::timestamp AS {alias}").format(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return sql.SQL("DATE_TRUNC({precision} ,{column})::timestamp AS {alias}").format(
return sql.SQL("DATE_TRUNC({precision}, {column})::timestamp AS {alias}").format(

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very much a nit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right. this is weird. Will fix.

Comment on lines 82 to 87
def safe_dump_default(value: Any) -> Any:
if isinstance(value, Enum):
return value.value
else:
raise TypeError

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this handled by the from_dict() methods now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I forgot to remove this. Thanks for noticing.

@fpacifici fpacifici changed the base branch from fix/message_type to master March 15, 2021 18:42
@fpacifici fpacifici merged commit a4be8c8 into master Mar 15, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants