Mitigation Strategy: Safe read_csv
and Input Handling
1. Mitigation Strategy: Safe read_csv
and Input Handling
-
Description:
dtype
Parameter: When usingpd.read_csv()
, always use thedtype
parameter to explicitly specify the data type of each column. This prevents pandas from inferring types incorrectly, which could lead to vulnerabilities or unexpected behavior. Example:pd.read_csv(..., dtype={'col1': str, 'col2': int, 'col3': float})
converters
Parameter: Use theconverters
parameter to apply custom sanitization functions to specific columns during the read process. This is more efficient than reading the entire DataFrame and then cleaning it. Example:pd.read_csv(..., converters={'col1': sanitize_string})
, wheresanitize_string
is a custom function you define.chunksize
Parameter: For large CSV files, use thechunksize
parameter to read the data in smaller chunks. This prevents pandas from attempting to load the entire file into memory at once. Example:for chunk in pd.read_csv('large_file.csv', chunksize=10000): process_chunk(chunk)
- Avoid
read_clipboard()
: Do not usepd.read_clipboard()
with untrusted clipboard data, as this could introduce malicious content. - Limit Input Size (Pre-pandas, but relevant): Before calling
read_csv
, check file size.
-
Threats Mitigated:
- Untrusted Data Input (Severity: High): Prevents malicious or malformed data from causing unexpected behavior or vulnerabilities within pandas.
- Resource Exhaustion (Severity: Medium):
chunksize
prevents memory exhaustion from large files. - Data Corruption (Severity: Medium):
dtype
andconverters
ensure data is interpreted correctly.
-
Impact:
- Untrusted Data Input: Risk reduced significantly (High impact).
- Resource Exhaustion: Risk reduced significantly with
chunksize
(High impact). - Data Corruption: Risk reduced moderately (Medium impact).
-
Currently Implemented:
- Example: "
dtype
is used in allread_csv
calls indata_loading.py
.chunksize
is used for files larger than 100MB."
- Example: "
-
Missing Implementation:
- Example: "We are not currently using
converters
for sanitization. This should be added todata_loading.py
." - Example: "
read_clipboard
is used in a testing script. This should be removed."
- Example: "We are not currently using
Mitigation Strategy: Avoid Pickle with Untrusted Data
2. Mitigation Strategy: Avoid Pickle with Untrusted Data
-
Description:
- Prohibit
pd.read_pickle()
with Untrusted Data: Never usepd.read_pickle()
to load data from an untrusted source (e.g., user uploads, external APIs). Pickle can execute arbitrary code during deserialization. - Safe Alternatives: Use safer serialization formats like JSON (
pd.read_json()
,pd.to_json()
), CSV (pd.read_csv()
,pd.to_csv()
), or Parquet (pd.read_parquet()
,pd.to_parquet()
) for data exchange.
- Prohibit
-
Threats Mitigated:
- Deserialization Vulnerabilities (Pickle) (Severity: Critical): Eliminates the risk of arbitrary code execution.
-
Impact:
- Deserialization Vulnerabilities (Pickle): Risk eliminated completely if the rule is followed (High impact).
-
Currently Implemented:
- Example: "Code review policy prohibits
pd.read_pickle()
with external data. All external data exchange uses JSON."
- Example: "Code review policy prohibits
-
Missing Implementation:
- Example: "Legacy code in
old_module.py
usespd.read_pickle()
. This needs to be refactored."
- Example: "Legacy code in
Mitigation Strategy: Safe eval()
, query()
, and Indexing
3. Mitigation Strategy: Safe eval()
, query()
, and Indexing
-
Description:
- Avoid Untrusted Input: Never pass user-provided strings directly to
df.eval()
,df.query()
, or use them to construct dynamic selections withdf.loc[]
ordf.iloc[]
. These can execute arbitrary code. - Boolean Indexing: If filtering based on user input is needed, construct boolean indexing conditions programmatically based on validated and sanitized user input. Do not embed user input directly into a string passed to
eval()
orquery()
. - Sanitize any string that will be used within a query.
- Avoid Untrusted Input: Never pass user-provided strings directly to
-
Threats Mitigated:
- Code Injection via
eval()
/query()
(Severity: Critical): Prevents arbitrary code execution.
- Code Injection via
-
Impact:
- Code Injection: Risk eliminated completely if the rule is followed (High impact).
-
Currently Implemented:
- Example: "All filtering in
data_analysis.py
uses boolean indexing with validated input."
- Example: "All filtering in
-
Missing Implementation:
- Example: "
report_generator.py
usesdf.query()
with user input. Refactor to use boolean indexing."
- Example: "
Mitigation Strategy: Post-Read Data Type Enforcement
4. Mitigation Strategy: Post-Read Data Type Enforcement
-
Description:
astype()
: After reading data into a DataFrame (even after usingdtype
inread_csv
), explicitly cast columns to their intended data types using the.astype()
method. Example:df['col1'] = df['col1'].astype(str)
- Categorical Data: For columns with a limited set of known values, use the pandas categorical data type:
df['col2'] = df['col2'].astype('category')
. - Downcasting: For numerical columns, consider downcasting to smaller data types if the data range allows:
df['col3'] = pd.to_numeric(df['col3'], downcast='integer')
.
-
Threats Mitigated:
- Untrusted Data Input (Severity: High): Provides a second layer of defense against incorrect data types.
- Data Corruption (Severity: Medium): Ensures data is interpreted correctly.
- Resource Exhaustion (Severity: Low): Categorical types and downcasting can reduce memory usage.
-
Impact:
- Untrusted Data Input: Risk reduced moderately (Medium impact).
- Data Corruption: Risk reduced moderately (Medium impact).
- Resource Exhaustion: Risk reduced slightly (Low impact).
-
Currently Implemented:
- Example: "
.astype(str)
is used for text columns indata_processing.py
."
- Example: "
-
Missing Implementation:
- Example: "Categorical types are not consistently used. Implement for 'status' and 'category' columns."
- Example: "Downcasting is not implemented. Review numerical columns in
data_analysis.py
."
Mitigation Strategy: Chunked Processing with read_sql
5. Mitigation Strategy: Chunked Processing with read_sql
-
Description:
chunksize
withread_sql
: When usingpd.read_sql
to read data from a database, use thechunksize
parameter to retrieve data in smaller batches. This is analogous to thechunksize
parameter inread_csv
.- Iterate and Process: Iterate through the chunks and process each one individually, similar to how you would handle chunks from
read_csv
.
-
Threats Mitigated:
- Resource Exhaustion (Severity: Medium): Prevents pandas from attempting to load an entire, potentially very large, database result set into memory at once.
-
Impact:
- Resource Exhaustion: Risk reduced significantly (High impact).
-
Currently Implemented:
- Example: "Not currently implemented. All database reads use
pd.read_sql
withoutchunksize
."
- Example: "Not currently implemented. All database reads use
-
Missing Implementation:
- Example: "Implement
chunksize
in allpd.read_sql
calls within thedatabase_connector.py
module. Determine an appropriate chunk size based on testing and expected data volumes."
- Example: "Implement