Skip to content

Latest commit

 

History

History
105 lines (85 loc) · 8.16 KB

File metadata and controls

105 lines (85 loc) · 8.16 KB

Mitigation Strategies Analysis for pola-rs/polars

  1. Prioritize Safe Formats: For data exchange with external or untrusted sources, default to using inherently safer data formats like Parquet, CSV, or JSON. Completely avoid using Pickle for any data that originates from outside a fully trusted environment.

  2. Polars Schema Enforcement: Utilize Polars' built-in schema enforcement capabilities during data loading (read_csv, read_parquet, etc.). Define the expected schema (column names, data types) explicitly within the read_... function call using the dtypes and/or schema parameters. This leverages Polars' internal validation, which is generally more efficient and reliable than external validation.

  3. Polars Data Type Validation: Leverage Polars' data type system. Ensure that the specified dtypes in the schema are as restrictive as possible (e.g., use pl.Int32 instead of pl.Int64 if the data is known to fit within the smaller range). This helps prevent unexpected data from being loaded.

  4. Polars read_csv Specific Options: When using read_csv, utilize parameters like:

    • ignore_errors=False (default): Ensure that parsing errors are raised as exceptions, rather than silently ignored.
    • null_values: Explicitly define how null values are represented in the CSV file.
    • infer_schema_length: Control how many rows are used to infer the schema (use a sufficiently large value to avoid incorrect inference).
    • row_count_name and row_count_offset: If your CSV has a row count, use these to verify the integrity.
  5. Post-Load Validation (with Polars Expressions): Even after initial loading and schema enforcement, perform additional validation using Polars expressions. This allows for more complex checks that are difficult to express in a simple schema:

    • Check for values within specific ranges using .filter() and .is_between().
    • Check for uniqueness of values in certain columns using .unique().
    • Check for relationships between columns (e.g., one column should always be greater than another).
    • Use .is_null().any() to check for unexpected null values after transformations.
  6. Encoding: Use encoding parameter in read_csv to specify the correct encoding.

    • Threats Mitigated:

      • Arbitrary Code Execution via Untrusted Data (Deserialization): (Severity: Critical) - Avoiding Pickle and using Polars' built-in validation for other formats significantly reduces the risk.
      • Denial of Service (DoS) via Resource Exhaustion: (Severity: High) - Data type restrictions and post-load validation can help prevent excessively large or malformed data from being processed.
      • Logic Errors in Data Transformations: (Severity: Medium) - Schema enforcement and post-load validation ensure data conforms to expectations, reducing errors.
    • Impact:

      • Arbitrary Code Execution: Risk reduced from Critical to Low (if Pickle is avoided).
      • DoS: Risk reduced from High to Medium.
      • Logic Errors: Risk reduced from Medium to Low.
    • Currently Implemented: (Example - Needs to be filled in with your project's specifics)

      • Data loading functions in data_loader.py use Parquet and CSV.
      • dtypes are specified in read_parquet.
    • Missing Implementation:

      • dtypes are not consistently specified in read_csv.
      • ignore_errors=True is sometimes used in read_csv.
      • No post-load validation using Polars expressions.
      • null_values is not explicitly defined.
      • infer_schema_length is using default value.
  1. Lazy Evaluation Optimization: When using Polars' lazy evaluation, carefully construct and review the query plans. Use df.explain() to understand the execution plan. Optimize complex queries by:

    • Applying filters (.filter()) as early as possible in the pipeline to reduce the amount of data processed.
    • Selecting only the necessary columns (.select()) early on.
    • Avoiding unnecessary or redundant operations.
    • Breaking down very complex queries into smaller, chained operations.
  2. Streaming Data Processing (for Large Datasets): If dealing with datasets that exceed available memory, must use Polars' streaming capabilities (lazy API with scan_... functions). Process data in chunks:

    • Use pl.scan_csv, pl.scan_parquet, etc., instead of pl.read_....
    • Define the processing pipeline using lazy operations.
    • Use .collect(streaming=True) to execute the pipeline in a streaming fashion.
  3. Timeout within Polars operations: If you are using apply method, you can implement timeout inside the function that is applied.

    • Threats Mitigated:

      • Denial of Service (DoS) via Resource Exhaustion: (Severity: High) - Lazy evaluation optimization and streaming prevent excessive memory usage and long-running operations.
      • Logic Errors (Indirectly): (Severity: Medium) - Optimized query plans are less likely to contain hidden performance bottlenecks caused by logic errors.
    • Impact:

      • DoS: Risk reduced from High to Low (with streaming and optimization).
      • Logic Errors: Risk remains Medium (but impact is reduced).
    • Currently Implemented: (Example)

      • Some use of lazy evaluation.
    • Missing Implementation:

      • No consistent use of df.explain() to optimize query plans.
      • Streaming is not used, even for large datasets.
      • No timeouts within apply method.
  1. Explicit Data Type Selection: Always explicitly specify data types when creating or loading DataFrames. Choose the most restrictive data type that can accommodate the expected range of values. For example, if a column will only contain positive integers less than 1000, use pl.UInt16 instead of pl.Int64.
  2. Intermediate Type Casting (within Polars Expressions): When performing calculations within Polars expressions (e.g., using .apply() or custom aggregations), be mindful of potential intermediate overflow. If necessary, cast to a larger data type within the expression before performing the calculation, and then cast back to the desired final type if appropriate. Example:
    import polars as pl
    df = pl.DataFrame({"a": [2**30, 2**30]})
    # Incorrect: Potential overflow
    df = df.with_columns((pl.col("a") * 2).alias("result_incorrect"))
    # Correct: Cast to Int64 for the calculation
    df = df.with_columns((pl.col("a").cast(pl.Int64) * 2).alias("result_correct"))
  3. Polars Expression Checks: Use Polars expressions to check for potential overflow after calculations, if necessary. This is less efficient than preventing overflow in the first place, but can be a useful safeguard. Example:
import polars as pl
df = pl.DataFrame({"a": [2**30, 2**30]})
df = df.with_columns((pl.col("a") * 2).alias("result"))
# Check if the result is within the Int32 range (if that's the intended type)
df = df.filter((pl.col("result") >= -2**31) & (pl.col("result") <= 2**31 - 1))
*   **Threats Mitigated:**
    *   **Integer Overflow/Underflow in Calculations:** (Severity: Medium) - Prevents incorrect results and potential vulnerabilities.

*   **Impact:**
    *   **Integer Overflow/Underflow:** Risk reduced from Medium to Low.

*   **Currently Implemented:** (Example)
    *   Data types are sometimes specified.

*   **Missing Implementation:**
    *   No consistent use of the most restrictive data types.
    *   No intermediate type casting within expressions.
    *   No Polars expression checks for overflow.