Skip to content

Latest commit

 

History

History
200 lines (167 loc) · 12.5 KB

File metadata and controls

200 lines (167 loc) · 12.5 KB

Mitigation Strategies Analysis for cortexproject/cortex

Description:

  1. Identify Tenant Boundaries: Ensure your application correctly identifies tenants and passes this information to Cortex (e.g., via the X-Scope-OrgID header).
  2. Baseline Usage: Analyze historical data within Cortex (using PromQL queries against Cortex metrics) to determine typical resource usage per tenant.
  3. Configure Rate Limits (limits_config): Use Cortex's limits_config (YAML) to set per-tenant limits. Crucially, this is within the Cortex configuration:
    limits:
      ingestion_rate: 1000  # Samples per second per tenant
      ingestion_burst_size: 2000
      max_series_per_user: 100000
      max_samples_per_query: 1000000
      max_series_per_metric: 5000 # Limit series per metric, per tenant
      max_metadata_per_user: 10000 # Limit metadata entries
    • Set ingestion_rate, ingestion_burst_size, max_series_per_user, max_samples_per_query, max_series_per_metric, and max_metadata_per_user appropriately.
  4. Configure Resource Quotas (runtime_config): Use the runtime_config file (which can be dynamically reloaded) to set per-tenant resource quotas. This is also a Cortex-specific configuration. While the enforcement might happen via Kubernetes (if deployed there), the configuration of the limits is done through Cortex. Example (conceptual, as the exact format depends on your setup):
    # Example - this might be JSON or YAML, depending on your setup
    tenant_resource_limits:
      tenant1:
        cpu_limit: "2"  # 2 cores
        memory_limit: "4Gi"
      tenant2:
        cpu_limit: "1"
        memory_limit: "2Gi"
  5. Alerting (using Cortex Rules): Set up alerting rules within Cortex itself (using the Ruler component and PromQL) to monitor for tenants approaching or exceeding their limits. This uses Cortex's own alerting capabilities. Example:
    groups:
    - name: tenant_limits
      rules:
      - alert: TenantNearIngestionLimit
        expr: cortex_distributor_ingest_requests_total{status_code="429"} / cortex_distributor_ingest_requests_total > 0.1  # 10% of requests are rate-limited
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Tenant {{ $labels.org_id }} is nearing ingestion limit"
  6. Regular Review (using Cortex Metrics): Periodically query Cortex's own metrics (e.g., cortex_request_duration_seconds, cortex_distributor_ingest_requests_total, cortex_memory_requests_total) to review and adjust limits.

Threats Mitigated:

  • DoS/DDoS from a Single Tenant: (Severity: High)
  • Resource Exhaustion: (Severity: High)
  • Unfair Resource Allocation: (Severity: Medium)
  • Accidental Overload: (Severity: Medium)

Impact:

  • High reduction in risk for DoS, resource exhaustion, and unfair allocation, specifically through Cortex's built-in mechanisms.

Currently Implemented:

  • Cortex's limits_config provides the core functionality.
  • runtime_config for resource quotas is supported.
  • Basic alerting via Ruler is possible.

Missing Implementation:

  • Comprehensive alerting rules within Cortex for all limit types.
  • Dynamic adjustment of quotas based on real-time usage (requires custom logic within a Cortex component or a custom controller interacting with the runtime_config).
  • Consistent enforcement across all custom Cortex components (if any are built).

Description:

  1. Disable Basic Authentication (Cortex Config): Ensure basic authentication is disabled in the Cortex configuration.
  2. Implement JWT Authentication (Cortex Config and Code):
    • Cortex Config: Configure Cortex to accept JWTs (usually in the server section of the config). Specify the expected claims (e.g., user_claim, tenant_id_claim).
    • Cortex Code (Middleware): If building custom Cortex components, implement JWT validation middleware within the Go code of those components. This middleware should:
      • Verify the JWT signature (using a library like github.com/golang-jwt/jwt).
      • Check the expiration time.
      • Validate the issuer and audience.
      • Extract the tenant ID and roles.
  3. Implement mTLS for Inter-Component Communication (Cortex Config):
    • Cortex Config: Configure each Cortex component (in its respective configuration section) to use mTLS. This involves specifying:
      • client_ca_file: Path to the CA certificate.
      • cert_file: Path to the component's certificate.
      • key_file: Path to the component's private key.
      • tls_enabled: Set to true.
  4. Implement Fine-Grained Authorization (Cortex Code):
    • Cortex Code (Middleware/Handlers): Within the Go code of Cortex components (especially queriers and rulers), implement authorization checks before processing requests. Use the tenant ID and roles from the JWT to enforce policies. Example (conceptual):
      // In a request handler...
      tenantID := getTenantIDFromContext(ctx)
      roles := getRolesFromContext(ctx)
      
      if !isAuthorized(tenantID, roles, "read:metrics") {
          http.Error(w, "Unauthorized", http.StatusForbidden)
          return
      }
      // ... proceed with processing the request ...
  5. Key Rotation (External Tool Integration, but triggered by Cortex): While key rotation itself might be handled by an external tool (like Vault), Cortex needs to be configured to reload its configuration when keys change. This often involves using a sidecar container or a custom controller that monitors the secret management system and signals Cortex to reload.

Threats Mitigated:

  • Unauthorized Access: (Severity: Critical)
  • Data Breach: (Severity: Critical)
  • Man-in-the-Middle (MitM) Attacks (Internal): (Severity: High)
  • Compromised Component: (Severity: High)
  • Credential Stuffing/Brute-Force Attacks: (Severity: High)

Impact:

  • High reduction in risk, achieved through Cortex's configuration and, crucially, code-level enforcement in custom components.

Currently Implemented:

  • Cortex supports JWT authentication (configuration).
  • Cortex supports mTLS (configuration).
  • Basic tenant isolation (likely implemented in code).

Missing Implementation:

  • Fine-grained authorization within the Go code of Cortex components (beyond tenant isolation).
  • JWT validation middleware in custom components.
  • Mechanism for Cortex to automatically reload configuration when keys/certificates are rotated (requires integration with a secret manager, but the reload logic is Cortex-related).

Mitigation Strategy: Data Validation (Cortex Code)

Description:

  1. Identify Data Ingestion Points (Cortex Code): Locate the code within Cortex components (primarily ingesters and distributors) that handles incoming data.
  2. Implement Input Validation (Cortex Code): Within the Go code of these components, add validation logic:
    • Timestamp Validation: Check timestamp fields in incoming samples.
    • Metric Name Validation: Validate __name__ against a schema or whitelist (potentially configurable via Cortex config).
    • Label Validation: Validate label names and values. Limit label cardinality (potentially configurable via Cortex config).
    • Value Validation: Implement sanity checks (e.g., counters only increase).
    • Duplicate Sample Detection: Implement logic to detect and handle duplicates (e.g., drop or "last write wins" - configurable). This is often done using Cortex's internal caching mechanisms.
  3. Configuration Options (Cortex Config): Expose some validation parameters (e.g., allowed metric name patterns, maximum label cardinality) as configuration options in Cortex's configuration file. This allows operators to customize the validation rules without modifying the code.

Threats Mitigated:

  • Data Corruption: (Severity: Medium)
  • Data Tampering: (Severity: Medium)
  • Incorrect Query Results: (Severity: Medium)
  • Denial of Service (DoS) via Data Injection: (Severity: High)
  • Series Explosion: (Severity: High)

Impact:

  • Reduces risk of data issues and improves query accuracy, directly through code-level checks within Cortex.

Currently Implemented:

  • Cortex performs basic data type validation.

Missing Implementation:

  • Comprehensive input validation logic within the Go code of ingesters and distributors (timestamp, metric name, label, value checks).
  • Configurable validation rules (e.g., via limits_config or a dedicated validation section).
  • Duplicate sample detection and handling logic within the relevant Cortex components.

Description:

  1. Identify Component Roles: Clearly define roles of each component (ingester, distributor, querier, etc.).
  2. Define Minimum Permissions (Cortex Configuration and Deployment):
    • Network Access (Cortex Configuration): Use Cortex's configuration options (e.g., listen addresses, advertised addresses) to restrict which network interfaces each component listens on and which other components it can connect to. This is directly within the Cortex configuration.
    • Resource Limits (Cortex Configuration/Deployment): While the enforcement might be done by Kubernetes or the OS, the definition of resource limits (CPU, memory) can often be done via Cortex's configuration (e.g., using flags or environment variables that Cortex itself reads).
  3. Implement Access Controls (Deployment, but influenced by Cortex): While the primary enforcement is at the deployment level (Kubernetes RBAC, IAM roles), Cortex's design and configuration influence how these are applied. For example, Cortex's multi-tenancy features and the way it handles authentication tokens are crucial for implementing least privilege.

Threats Mitigated:

  • Compromised Component: (Severity: High)
  • Unauthorized Access: (Severity: High)
  • Privilege Escalation: (Severity: High)

Impact:

  • High reduction in risk, achieved through a combination of Cortex configuration and deployment practices.

Currently Implemented:

  • Cortex's configuration options allow for some control over network access and resource usage.

Missing Implementation:

  • Systematic application of least privilege principles throughout the Cortex configuration and deployment.
  • Fine-grained control over network access within Cortex's configuration.

Description:

  1. Review Cortex Configuration: Examine the Cortex configuration file (usually YAML).
  2. Identify Unused Features: Determine which features (e.g., specific storage backends, alerting integrations, experimental features) are not in use.
  3. Disable Unused Features (Cortex Configuration): Explicitly disable unused features by setting appropriate configuration options to false or removing relevant configuration sections. This is entirely within the Cortex configuration. For example:
    # If you're not using the Ruler component, disable it:
    ruler:
      enabled: false
    
    # If you're not using a specific storage backend, disable it:
    # (Example - replace with the actual backend name)
    my_unused_storage_backend:
      enabled: false

Threats Mitigated:

  • Vulnerabilities in Unused Features: (Severity: Medium)
  • Misconfiguration of Unused Features: (Severity: Medium)
  • Resource Consumption by Unused Features: (Severity: Low)

Impact:

  • Reduces attack surface and simplifies configuration, directly through Cortex's configuration file.

Currently Implemented:

  • Some features may be disabled by default.

Missing Implementation:

  • Systematic review and explicit disabling of all unused features in the Cortex configuration.