Update to collector main branch #1

jwieringa · 2024-11-29T19:04:42Z

No description provided.

Collect system info and metrics for Azure Flexible Server and Cosmos DB for PostgreSQL. To start using this, the customer needs to supply a new config variable AZURE_SUBSCRIPTION_ID to the collector, as well as setting up the managed identity. When AZURE_SUBSCRIPTION_ID is not supplied, the collector doesn't error out and simply won't collect the system info.

…er with text format (#586) * Add auto_explain parameters support for JSON format * Support both new lines and Query Parameters with text format auto_explain better * Make sure not to include query params with explain JSON with normalized filter

In #494, I missed updating the --analyze-logfile flag. The flag is intended to analyze a logfile "offline", which does not work with the new mechanism, since that relies on being able to read the server's log_timezone and log_line_prefix. This adds additional flags to pass that information to the --analyze-logfile path. The timezone defaults to UTC; the prefix is required.

In passing, this also renames max_dead_tuples and num_dead_tuples to max_dead_item_ids/num_dead_item_ids, to track the upstream clarification on the naming.

Previously the collector would emit system catalogs that are only existing on EPAS, causing unnecessary information to show in the pganalyze dashboard. Additionally, EPAS allows function signatures to be re-used between functions and procedures (presumably this was allowed before upstream Postgres added CREATE PROCEDURE), which causes a server-side error with pganalyze. Since this is an edge case and there are no intents to support this upstream, ignore duplicate functions/procedures that have the same signature.

In some cases, it's possible for the user, database, or other free-form parts of the log line prefix to match too much of the input string, leaving the LogLevel to unintentionally match part of the content. E.g., with CloudSQL the vacuum analysis logs are output as > 2024-10-04 06:38:28.808 UTC [2569017]: [1-1] db=,user= LOG: automatic vacuum of table \"some_database.some_schema.some_table\": index scans: ... This is matched as user " LOG: automatic vacuum of table \"some_database.some_schema.some_table\": index" and log level "scans". The bogus log level is then mapped to 0, and the line is not handled correctly. To avoid this, change our regexp to only look for the log levels (and additional context strings that appear in place of log level on detail lines) we recognize.

RPM base packages have changed and no longer contain an init system by default. Since we mainly care about having a process running to then issue exec commands under, we can use a sleep here instead to avoid failure.

This asks for confirmation which we don't want in the automated flow.

* Bump protoc version to 28.2 * Update CI/release protoc version to 28.2 Also update the setup-protoc version to 3. Note that we still use the forked version to reduce a security risk: #308

)

…616) * Track cluster identifier (cluster ID) as part of system information This is intended to help identify servers that belong together, to allow the pganalyze app to group them in an overview, offer data in combined views, or allow Index Advisor to avoid considering indexes unused if they are used on a replica. To start with this identifies the cluster ID for given providers: - AWS: DBClusterIdentifier (user-assigned cluster name) - Azure: Source Server ID (replication parent server ID + resource group) - Crunchy Bridge: Cluster Parent ID (replication parent server ID) - Self hosted: pg_controldata system identifier (unique ID shared by all physical replicas) There are providers (e.g. GCP) and edge cases (e.g. when helper doesn't run for self hosted) where this doesn't work. Handling those cases, and allowing manual overrides of the cluster ID is left for a later follow-up.

Fixes CVE-2024-45338, which is a DOS style vulnerability that's not a practical problem, but may be flagged by dependency scanners.

Some time ago (at least 9+ years) we introduced a maximum connection lifetime of 30 seconds via the SetConnMaxLifetime database/sql function. This was likely intended as an extra safety measure in case the collector didn't clean up its DB handle correctly. We've had a few minor issues with incorrect cleanups over the years (though usually not with the DB connection), but at this point our connection cleanup should be reliable. The connection lifetime causes an issue because it causes Go to reconnect to the database after 30 seconds have elapsed. This then effectively removes any connection local settings, in particular a statement_timeout. Effectively this meant that queries may unintentionally not have been subject to the statement_timeout, if the same connection previously took 30 seconds to do its work. To resolve, remove the max lifetime.

This executes on-demand query runs for the Query Tuning feature via a helper function that prevents reading the table data directly, as well as granting the collector user unnecessary permissions. This is now the only way that query runs can execute EXPLAIN ANALYZE, direct execution without the helper is intentionally not supported. In passing, expand the protobufs to pass down parameters and parameter types and optional planner settings, and refactor the query run code to handle work requiring a DB connection in a separate Go function. Further, this adds tests that utilize a new TEST_DATABASE_URL enviroment setting, which is used to run unit tests that require a database connection (previously only the integration tests used the database).

Like "--generate-stats-helper-sql" this allows generating a SQL script that installs the pganalyze.explain_analyze helper into all databases on a server, as specified by the server section name passed as an argument. Additionally, the "--generate-helper-explain-analyze-role" option allows setting the role that the helper function should be owned by as a value, defaulting to "pganalyze_explain". Note that the role creation, as well as any GRANT statements for the role must be taken care of separately.

#660) This acts as an additional defense in case the user does not revoke permissions from non-superuser for dblink as we recommend.

In some edge cases, for example unknown CONTEXT patterns, we may not detect the contents of a secondary line. If filtering of unidentified content is enabled, such lines need to be redacted.

This can occur in certain cases, for example auto_explain output on some Postgres versions. We previously would have not detected this CONTEXT line, but are now correctly detecting it as containing statement_parameter log secrets. In passing add a test case for the new "Query Parameters" field in auto_explain which gets redacted based on the statement_text filter that filters out the whole plan text (we normalize the individual fields in resulting query samples, but the log text for auto_explain is redacted altogether).

Because the buffercache logic initializes its own database (so it can run in parallel to other collection tasks), it needs to also ensure to close its DB handle so the connection is released again. In passing avoid the double assignment of "db" in the later parts of the function.

This makes it clear that we must clean up the database connection that was established, since we leak connections otherwise. To support this, refactor the log-based EXPLAIN code so defer can be used easily.

In almost all cases we have an existing database connection, so use that to retrieve the value of Postgres settings.

* Start collecting and sending pg_stat_statements_info stats * Send diff dealloc number instead of absolute number

Due to security enhancements in newer macOS versions, unsigned Go binaries may hang when built and executed locally. To prevent this, add code signing to `make build`.

Currently, the collector log processing code assumes that the log parser has always been initialized by the time we receive log messages to process. This is generally true, but it can be false if the collector failed to connect to the database (to fetch log_line_prefix to initialize the parser). That can lead to confusing errors (that are not relevant to the root cause of the connection problem). Skip attempting to process log lines when the log parser is not yet initialized.

#669) * Dockerfile: Update alpine base image from 3.18 to 3.21 * Bump go version to 1.23.5 * Explicitly make a copy of logLine With Go version up of 1.22+, this part was warned with "loop variable logLine now per-iteration, heap-allocated". To potentially avoid the heal allocation, explicitly make a copy of logLine.

This adds a new configuration setting, `gcp_region` / `GCP_REGION`. If `db_use_iam_auth` / `DB_USE_IAM_AUTH` is enabled and `gcp_region` / `GCP_REGION` is specified, the collector uses a service account associated with the collector VM as a monitoring user, instead of traditional passwords. In order to use this setting, a service account needs to be created with proper roles, the database needs to be configured for IAM database authentication, and a service account needs to be added to the database, so that the service account can access to the database to collect statistics.

We currently do not resolve symlinks when looking up $PGDATA from the environment or the `data_directory` GUC setting. This can misattribute storage costs to the wrong partition. Always follow symlinks in the helper that resolves the data directory.

These were located in the structs that are typically used for metadata about the table, with statistics kept separate, even if the statistics in question are not diffed. This does not make much of a difference in practice in the current way this data is utilized (its merely a matter of following the existing pattern), but will enable a subsequent refactoring of the partition statistics merging logic, since it can rely on the statistics state information alone (which is indexed by OID).

We do this as a best practice, since theoretically a table that's not fully qualified could be a problem if an attacker creates a maliciously named table in the public schema.

Previously we merged this data after we completed gathering all output statistics. Besides this being challenging to make work correctly in light of locked tables (which mean indexes into the statistics array need to be looked up carefully via RelationIdx), it also had to resort to some workarounds to find the relationship of child indexes to their parent indexes. Instead, replace the logic with an approach that gathers all declarative partition child tables and indexes during the input phase, and then use that to look up the statistics to be added when we first construct the output statistics data. Note this replaces the previous index child/parent heuristic (which worked based on the columns being identical, and only considered unique indexes) with simply relying on child indexes being tracked in pg_inherits to belong to their parent. This means the parent index sizes will now exclude any indexes that were directly created on the children but shared a name with an empty parent index, which may be the case when inheritance-based partitioning was migrated to declarative partitioning. On the other hand, this will now consistently make most declaratively partitioned tables have their index sizes correctly added up to the parent, even if the indexes are not unique, or use expressions.

Currently, we assume that a Heroku-deployed collector will always monitor Heroku databases specified through Heroku env config (`HEROKU_POSTGRES_..._URL` values). It's possible to work around this with `CONFIG_CONTENTS`, but can be an awkward workflow. Allow monitoring databases deployed elsewhere by checking for pganalyze-specific env vars first.

This was a bug introduced in #494. The code here was intended to pass syslog-format messages to the old parsing code, but the old parsing code did not actually use this code path for syslog. All the syslog-specific processing is actually in input/system/selfhosted/syslog_handler.go, and then the messages are handed off to the parser. They will be matched against log_line_prefix like any other log line. Before #494 this worked fine, because #145 introduced a couple of log_line_prefix options that could be used with this format. The RsyslogRegexp was actually introduced in 73712dd (before our workflow used pull requests consistently), and appears to have been intended to tail log files in an older syslog format (the obsolete RFC3164, versus the RFC5424 we currently support). Drop mistaken delegating of syslog message handling to obsolete parsing mechanism. Also drop this mechanism: it was broken by #494 anyway, and given it's a legacy format (RFC5424 came out in 2009), there's no reason to believe it's still in use.

This splits the package building and testing logic into x86_64 and arm64 targets, and allows triggering either both, or just one of them. It also changes the GitHub actions for package testing and release to no longer use QEMU emulation (which recently started failing with segfaults), and instead uses separate runners for each architecture. Further we add an explicit check for cgroups v2. Testing the Amazon Linux 2 package is only supported on a cgroups v1 host, since the container itself has a too old systemd version to support cgroups v2. Because the ARM runner is only available on Ubuntu 22.04 or newer (which requires cgroups v2) we do not test the ARM packages on Amazon Linux 2. In passing remove stale "DOCKER_BUILDKIT=1" settings from package release action, which should no longer be needed since GH actions updated their Docker version, and allow overriding the docker command used with "DOCKER_CMD", like we already supported for the release step.

…ion (#683) For the collector-based Query Tuning workflow, the collector receives a query text from the pganalyze server and runs it on the target database. This query text may include pg_hint_plan hints, as users use workbooks to optimize queries. However, it looks like pg_hint_plan ignores hints if they are not in the _first_ comment in a query. The collector prepends a query marker as a separate comment, and this causes pg_hint_plan to ignore any hints specified in the query. Avoid passing our query marker, since we're already prepending it to the top-level explain_analyze invocation, and nested queries will not show up in pg_stat_activity anyway.

During the release process we group together the created packages as an artifact, to then upload them individually to the GitHub release entry. Previously this was able to use a single "packages" zip file for both x86_64 and arm64 packages, but we now need to split this since we split the actions.

We've previously used Ubuntu 20.04 for the x86_64 binary, to avoid building a too new glibc. Unfortunately with Ubuntu 20.04 GH actions on the way out, this seems to no longer work reliably. Separately, we were already building the arm64 packages with an undefined Ubuntu version (likely 24.04?). To fix, consistently use Ubuntu 22.04 (glibc 2.35) for both x86_64 and arm64. This also switches to GitHub's ARM runners from Ubicloud, since Ubicloud does not offer tagged Ubuntu versions in their runners.

msakrejda and others added 30 commits September 23, 2024 15:13

Drop legacy log line parsing mechanism (#594)

f975250

Update logo for Deploy to Heroku button (#599)

1301a9e

Add integration test for Postgres 17 (#600)

b9316a1

Add snapshot WebSocket API (#583)

d7a5290

Update pg_stat_progress_vacuum field names for Postgres 17

27203cc

Track all additional fields in pg_stat_progress_vacuum in PG17+

750cdf8

In passing, this also renames max_dead_tuples and num_dead_tuples to max_dead_item_ids/num_dead_item_ids, to track the upstream clarification on the naming.

Release 0.59.0 (#603)

9b1a707

Include log text in snapshots (#597)

a25800b

Improve test run error reporting for WebSocket (#608)

a3e2b86

Reduce memory usage when processing pg_stat_statements (#605)

3f7f1a8

Split up main.go (#609)

1792918

Build more regular expressions at boot time (#604)

e672751

Update get_column_stats for Postgres 17 (#610)

9874b00

Remove temp file usage for logs (#607)

4b82419

Update helper setup to use DROP FUNCTION IF EXISTS (#612)

b138a52

Release 0.60.0 (#611)

177debc

Packages repo sync: Use sleep instead of /sbin/init (#615)

1afee06

RPM base packages have changed and no longer contain an init system by default. Since we mainly care about having a process running to then issue exec commands under, we can use a sleep here instead to avoid failure.

Debian package build: Allow override of public key file (#614)

8882c30

This asks for confirmation which we don't want in the automated flow.

Add error when pgss version is below 1.9 with Postgres 14+ (#613)

8562a59

Bump protoc version to 28.2 (#617)

dd7ce6d

* Bump protoc version to 28.2 * Update CI/release protoc version to 28.2 Also update the setup-protoc version to 3. Note that we still use the forked version to reduce a security risk: #308

Fix inconsistent test failure (#618)

6d2ed61

Collect column stats using helper on Postgres < 17 despite warning (#619

8b9e229

)

Store query texts in temp file (#620)

6b769dc

Update deprecated API usage (#621)

47bcebf

lfittl and others added 30 commits January 3, 2025 23:06

Update golang.org/x/net to v0.33.0 (#658)

4a33b35

Fixes CVE-2024-45338, which is a DOS style vulnerability that's not a practical problem, but may be flagged by dependency scanners.

Remove enable_query_runner setting (#659)

e8b078c

Query runner: Block known problematic functions if found in parse tree (

5b7b8d4

#660) This acts as an additional defense in case the user does not revoke permissions from non-superuser for dblink as we recommend.

PII filtering: Correctly handle secondary lines that were not analyzed

2c37620

In some edge cases, for example unknown CONTEXT patterns, we may not detect the contents of a secondary line. If filtering of unidentified content is enabled, such lines need to be redacted.

Release 0.64.0 (#662)

77df9be

Standardize on "defer db.Close()" after connection establishment

e07d78d

This makes it clear that we must clean up the database connection that was established, since we leak connections otherwise. To support this, refactor the log-based EXPLAIN code so defer can be used easily.

Avoid establishing separate connection to read Postgres settings

620a0e3

In almost all cases we have an existing database connection, so use that to retrieve the value of Postgres settings.

Release 0.64.1 (#664)

1e3f6b6

Start collecting and sending pg_stat_statements_info stats (#666)

60071a5

* Start collecting and sending pg_stat_statements_info stats * Send diff dealloc number instead of absolute number

Sign built Go binaries for macOS (#667)

51222d6

Due to security enhancements in newer macOS versions, unsigned Go binaries may hang when built and executed locally. To prevent this, add code signing to `make build`.

Rename cron scheduler (#675)

50abd97

Relation (stats) SQL: Fully qualify the catalog tables referenced

ea6654e

We do this as a best practice, since theoretically a table that's not fully qualified could be a problem if an attacker creates a maliciously named table in the public schema.

Release 0.65.0 (#677)

9915bb1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update to collector main branch #1

Update to collector main branch #1

jwieringa commented Nov 29, 2024

Update to collector main branch #1

Are you sure you want to change the base?

Update to collector main branch #1

Conversation

jwieringa commented Nov 29, 2024