Mitigation Strategy: Per-Tenant Rate Limiting and Resource Quotas (Cortex Config)
Description:
- Identify Tenant Boundaries: Ensure your application correctly identifies tenants and passes this information to Cortex (e.g., via the
X-Scope-OrgID
header). - Baseline Usage: Analyze historical data within Cortex (using PromQL queries against Cortex metrics) to determine typical resource usage per tenant.
- Configure Rate Limits (limits_config): Use Cortex's
limits_config
(YAML) to set per-tenant limits. Crucially, this is within the Cortex configuration:limits: ingestion_rate: 1000 # Samples per second per tenant ingestion_burst_size: 2000 max_series_per_user: 100000 max_samples_per_query: 1000000 max_series_per_metric: 5000 # Limit series per metric, per tenant max_metadata_per_user: 10000 # Limit metadata entries
- Set
ingestion_rate
,ingestion_burst_size
,max_series_per_user
,max_samples_per_query
,max_series_per_metric
, andmax_metadata_per_user
appropriately.
- Set
- Configure Resource Quotas (runtime_config): Use the
runtime_config
file (which can be dynamically reloaded) to set per-tenant resource quotas. This is also a Cortex-specific configuration. While the enforcement might happen via Kubernetes (if deployed there), the configuration of the limits is done through Cortex. Example (conceptual, as the exact format depends on your setup):# Example - this might be JSON or YAML, depending on your setup tenant_resource_limits: tenant1: cpu_limit: "2" # 2 cores memory_limit: "4Gi" tenant2: cpu_limit: "1" memory_limit: "2Gi"
- Alerting (using Cortex Rules): Set up alerting rules within Cortex itself (using the Ruler component and PromQL) to monitor for tenants approaching or exceeding their limits. This uses Cortex's own alerting capabilities. Example:
groups: - name: tenant_limits rules: - alert: TenantNearIngestionLimit expr: cortex_distributor_ingest_requests_total{status_code="429"} / cortex_distributor_ingest_requests_total > 0.1 # 10% of requests are rate-limited for: 5m labels: severity: warning annotations: summary: "Tenant {{ $labels.org_id }} is nearing ingestion limit"
- Regular Review (using Cortex Metrics): Periodically query Cortex's own metrics (e.g.,
cortex_request_duration_seconds
,cortex_distributor_ingest_requests_total
,cortex_memory_requests_total
) to review and adjust limits.
Threats Mitigated:
- DoS/DDoS from a Single Tenant: (Severity: High)
- Resource Exhaustion: (Severity: High)
- Unfair Resource Allocation: (Severity: Medium)
- Accidental Overload: (Severity: Medium)
Impact:
- High reduction in risk for DoS, resource exhaustion, and unfair allocation, specifically through Cortex's built-in mechanisms.
Currently Implemented:
- Cortex's
limits_config
provides the core functionality. runtime_config
for resource quotas is supported.- Basic alerting via Ruler is possible.
Missing Implementation:
- Comprehensive alerting rules within Cortex for all limit types.
- Dynamic adjustment of quotas based on real-time usage (requires custom logic within a Cortex component or a custom controller interacting with the
runtime_config
). - Consistent enforcement across all custom Cortex components (if any are built).
Mitigation Strategy: Robust Authentication and Authorization (Cortex Config and Code)
Description:
- Disable Basic Authentication (Cortex Config): Ensure basic authentication is disabled in the Cortex configuration.
- Implement JWT Authentication (Cortex Config and Code):
- Cortex Config: Configure Cortex to accept JWTs (usually in the
server
section of the config). Specify the expected claims (e.g.,user_claim
,tenant_id_claim
). - Cortex Code (Middleware): If building custom Cortex components, implement JWT validation middleware within the Go code of those components. This middleware should:
- Verify the JWT signature (using a library like
github.com/golang-jwt/jwt
). - Check the expiration time.
- Validate the issuer and audience.
- Extract the tenant ID and roles.
- Verify the JWT signature (using a library like
- Cortex Config: Configure Cortex to accept JWTs (usually in the
- Implement mTLS for Inter-Component Communication (Cortex Config):
- Cortex Config: Configure each Cortex component (in its respective configuration section) to use mTLS. This involves specifying:
client_ca_file
: Path to the CA certificate.cert_file
: Path to the component's certificate.key_file
: Path to the component's private key.tls_enabled
: Set totrue
.
- Cortex Config: Configure each Cortex component (in its respective configuration section) to use mTLS. This involves specifying:
- Implement Fine-Grained Authorization (Cortex Code):
- Cortex Code (Middleware/Handlers): Within the Go code of Cortex components (especially queriers and rulers), implement authorization checks before processing requests. Use the tenant ID and roles from the JWT to enforce policies. Example (conceptual):
// In a request handler... tenantID := getTenantIDFromContext(ctx) roles := getRolesFromContext(ctx) if !isAuthorized(tenantID, roles, "read:metrics") { http.Error(w, "Unauthorized", http.StatusForbidden) return } // ... proceed with processing the request ...
- Cortex Code (Middleware/Handlers): Within the Go code of Cortex components (especially queriers and rulers), implement authorization checks before processing requests. Use the tenant ID and roles from the JWT to enforce policies. Example (conceptual):
- Key Rotation (External Tool Integration, but triggered by Cortex): While key rotation itself might be handled by an external tool (like Vault), Cortex needs to be configured to reload its configuration when keys change. This often involves using a sidecar container or a custom controller that monitors the secret management system and signals Cortex to reload.
Threats Mitigated:
- Unauthorized Access: (Severity: Critical)
- Data Breach: (Severity: Critical)
- Man-in-the-Middle (MitM) Attacks (Internal): (Severity: High)
- Compromised Component: (Severity: High)
- Credential Stuffing/Brute-Force Attacks: (Severity: High)
Impact:
- High reduction in risk, achieved through Cortex's configuration and, crucially, code-level enforcement in custom components.
Currently Implemented:
- Cortex supports JWT authentication (configuration).
- Cortex supports mTLS (configuration).
- Basic tenant isolation (likely implemented in code).
Missing Implementation:
- Fine-grained authorization within the Go code of Cortex components (beyond tenant isolation).
- JWT validation middleware in custom components.
- Mechanism for Cortex to automatically reload configuration when keys/certificates are rotated (requires integration with a secret manager, but the reload logic is Cortex-related).
Mitigation Strategy: Data Validation (Cortex Code)
Description:
- Identify Data Ingestion Points (Cortex Code): Locate the code within Cortex components (primarily ingesters and distributors) that handles incoming data.
- Implement Input Validation (Cortex Code): Within the Go code of these components, add validation logic:
- Timestamp Validation: Check
timestamp
fields in incoming samples. - Metric Name Validation: Validate
__name__
against a schema or whitelist (potentially configurable via Cortex config). - Label Validation: Validate label names and values. Limit label cardinality (potentially configurable via Cortex config).
- Value Validation: Implement sanity checks (e.g., counters only increase).
- Duplicate Sample Detection: Implement logic to detect and handle duplicates (e.g., drop or "last write wins" - configurable). This is often done using Cortex's internal caching mechanisms.
- Timestamp Validation: Check
- Configuration Options (Cortex Config): Expose some validation parameters (e.g., allowed metric name patterns, maximum label cardinality) as configuration options in Cortex's configuration file. This allows operators to customize the validation rules without modifying the code.
Threats Mitigated:
- Data Corruption: (Severity: Medium)
- Data Tampering: (Severity: Medium)
- Incorrect Query Results: (Severity: Medium)
- Denial of Service (DoS) via Data Injection: (Severity: High)
- Series Explosion: (Severity: High)
Impact:
- Reduces risk of data issues and improves query accuracy, directly through code-level checks within Cortex.
Currently Implemented:
- Cortex performs basic data type validation.
Missing Implementation:
- Comprehensive input validation logic within the Go code of ingesters and distributors (timestamp, metric name, label, value checks).
- Configurable validation rules (e.g., via
limits_config
or a dedicated validation section). - Duplicate sample detection and handling logic within the relevant Cortex components.
Mitigation Strategy: Principle of Least Privilege for Cortex Components (Cortex Configuration and Deployment)
Description:
- Identify Component Roles: Clearly define roles of each component (ingester, distributor, querier, etc.).
- Define Minimum Permissions (Cortex Configuration and Deployment):
- Network Access (Cortex Configuration): Use Cortex's configuration options (e.g., listen addresses, advertised addresses) to restrict which network interfaces each component listens on and which other components it can connect to. This is directly within the Cortex configuration.
- Resource Limits (Cortex Configuration/Deployment): While the enforcement might be done by Kubernetes or the OS, the definition of resource limits (CPU, memory) can often be done via Cortex's configuration (e.g., using flags or environment variables that Cortex itself reads).
- Implement Access Controls (Deployment, but influenced by Cortex): While the primary enforcement is at the deployment level (Kubernetes RBAC, IAM roles), Cortex's design and configuration influence how these are applied. For example, Cortex's multi-tenancy features and the way it handles authentication tokens are crucial for implementing least privilege.
Threats Mitigated:
- Compromised Component: (Severity: High)
- Unauthorized Access: (Severity: High)
- Privilege Escalation: (Severity: High)
Impact:
- High reduction in risk, achieved through a combination of Cortex configuration and deployment practices.
Currently Implemented:
- Cortex's configuration options allow for some control over network access and resource usage.
Missing Implementation:
- Systematic application of least privilege principles throughout the Cortex configuration and deployment.
- Fine-grained control over network access within Cortex's configuration.
Mitigation Strategy: Disable Unused Features (Cortex Configuration)
Description:
- Review Cortex Configuration: Examine the Cortex configuration file (usually YAML).
- Identify Unused Features: Determine which features (e.g., specific storage backends, alerting integrations, experimental features) are not in use.
- Disable Unused Features (Cortex Configuration): Explicitly disable unused features by setting appropriate configuration options to
false
or removing relevant configuration sections. This is entirely within the Cortex configuration. For example:# If you're not using the Ruler component, disable it: ruler: enabled: false # If you're not using a specific storage backend, disable it: # (Example - replace with the actual backend name) my_unused_storage_backend: enabled: false
Threats Mitigated:
- Vulnerabilities in Unused Features: (Severity: Medium)
- Misconfiguration of Unused Features: (Severity: Medium)
- Resource Consumption by Unused Features: (Severity: Low)
Impact:
- Reduces attack surface and simplifies configuration, directly through Cortex's configuration file.
Currently Implemented:
- Some features may be disabled by default.
Missing Implementation:
- Systematic review and explicit disabling of all unused features in the Cortex configuration.