Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JDBC logging and safety improvements #1117

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open

Conversation

zaleslaw
Copy link
Collaborator

@zaleslaw zaleslaw commented Apr 3, 2025

This pull request includes several changes aimed at enhancing the validation and execution of SQL queries and table names, as well as improving the database connection handling in the dataframe-jdbc module.

The most important changes include the introduction of strict validation for SQL queries and table names, the replacement of createStatement with prepareStatement for better security, and the addition of new validation methods.

Enhancements to SQL Validation:

  • Added strictValidation parameter to methods readSqlTable, readSqlQuery, and readDataFrame to enforce or bypass strict validation of SQL queries and table names. Default is true.
  • Implemented isValidSqlQuery and isValidTableName methods to validate SQL queries and table names, ensuring they do not contain forbidden patterns and have balanced quotes.

Security Improvements:

Code Cleanup:

  • Removed the unused constant MULTIPLE_SQL_QUERY_SEPARATOR from readJdbc.kt.

Documentation Updates:

  • Updated method documentation to include descriptions for the new strictValidation parameter.

These changes collectively improve the robustness, security, and clarity of the codebase, ensuring safer and more reliable database interactions.

zaleslaw added 5 commits April 2, 2025 11:01
Introduces a new `strictValidation` parameter to SQL reading methods, defaulting to `true`, to ensure table names and queries follow a valid format. If disabled, a warning is logged, granting users flexibility while maintaining safety by default.
@zaleslaw zaleslaw requested a review from Copilot April 3, 2025 19:18
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces JDBC logging and safety improvements by adding strict validation for SQL queries and table names while also updating test cases to prevent SQL injection. Key changes include:

  • Adding tests to verify that invalid table names and malicious SQL queries are rejected.
  • Refactoring query execution to use prepared statements and introducing a strictValidation flag.
  • Enhancing validation methods with detailed logging for both SQL queries and table names.

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
dataframe-jdbc/src/test/kotlin/org/jetbrains/kotlinx/dataframe/io/h2/h2Test.kt Added tests for SQL injection prevention and strictValidation behavior
dataframe-jdbc/src/main/kotlin/org/jetbrains/kotlinx/dataframe/io/readJdbc.kt Refactored query execution to use prepared statements; added strictValidation parameter and enhanced SQL validation
dataframe-jdbc/src/main/kotlin/org/jetbrains/kotlinx/dataframe/io/db/util.kt Updated database utility to use prepared statements for improved safety

@zaleslaw zaleslaw marked this pull request as ready for review April 3, 2025 19:22
@zaleslaw zaleslaw requested a review from Jolanrensen April 3, 2025 19:22
* Checks if a given string contains forbidden patterns or keywords.
* Logs a clear and friendly message if any forbidden pattern is found.
*/
private fun containsForbiddenPatterns(input: String): Boolean {
Copy link
Collaborator

@Jolanrensen Jolanrensen Apr 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering whether we, as DataFrame, are responsible for catching these types of attacks. Allow me to elaborate:
This PR makes it so that we can catch 5 types of attacks, which is good! But I'm pretty sure there are more, and we'll miss all of those. Couldn't we better ask the JDBC implementation or a proven validation library whether the statement is safe instead?

Copy link
Collaborator Author

@zaleslaw zaleslaw Apr 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I investigated a few libraries with statement validation: some of them supports a limited number of databases or have complex API and seems like overengineering or for example licensing is SOMETHING SPECIAL

Pros of Catching Attacks in DataFrame.readSql*:

  • Defense in depth: Even if JDBC is secure, a second layer helps prevent human errors in API usage.
  • Prevents misusing readSqlQuery for multiple queries or data modifications.
  • Improves developer ergonomics: Clear logs/warnings/errors when something sketchy is passed.
  • Customizability: The user can turn strictValidation off in edge cases.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also mostly protected on JDBC level with PreparedStatement and executeQuery, but...different drivers give different possiblities to push some DROP Table/Database stuff through executeQuery and execute it BEFORE exception

executeQuery("SELECT * FROM users; DROP TABLE X") could lead to SQL injection, especially if multiple statements are allowed by the DB/driver (in URL for example)

My idea was a 3-tier protection:

✅ Prepared statements

✅ Manual validation (for SELECT-only queries, no ;, no suspicious tokens)

✅ strictValidation as opt-in/out

Together, that's a pretty solid defense-in-depth.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds like a good approach indeed :)

}

// Validate the table name structure: letters, numbers, underscores, and dots are allowed
val tableNameRegex = Regex("^[A-Z0-9_]+(\\.[A-Z0-9_]+)*$")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regex can be a const somewhere

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only uppercase letters are allowed? Oh I see, normalizedTableName is made uppercase, got it

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, do we specifically only allow A-Z? So, not Æ, Ø, Å, etc? Otherwise, it might make more sense to replace [A-Z0-9_] by unicode category with [\p{L}\p{N}_] (https://www.regular-expressions.info/unicode.html). If SQL has a specific ISO standard for these names, let's stick to that instead.

Copy link
Collaborator Author

@zaleslaw zaleslaw Apr 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree about possibility to use other name, it seems inconsistent with other DataFrame parts, from other hand, I thought that this behavior could be covered with disabling strictValidation

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yeah probably, but it seems unnecessarily strict to me. If you're from a country that uses a different alphabet you will always have to turn off strictValidation in this case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said I agree, we're supporting different alphabets and symbols in other stuff, better to be aligned with that, I consider this and add test cases (honestly never named in Cyrillic any table in my life)


// Check if there are balanced quotes (single and double)
val singleQuotes = sqlQuery.count { it == '\'' }
val doubleQuotes = sqlQuery.count { it == '"' }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about backticks? are they fine?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backticks aren't standard across all databases:

  • PostgreSQL uses double quotes (") for quoting identifiers.
  • SQL Server uses square brackets ([table]).

Or you need to provide different implementation for different databases (and for custom database) or handle it in one manner and for specific cases you could disable strictValidation

The idea of strictValidation was mostly not to PROTECT API or something like that, but avoid some typical use-cases, when somebody copying invalid code, for example from stackoverflow or ChatGPT not to ruin their databases

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants