Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add support for pgvector [NAIVE] #251

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

amotl
Copy link
Contributor

@amotl amotl commented Dec 13, 2023

Note: This PR is just for educational purposes and is not to be taken seriously, unless otherwise endorsed.

About

This patch aims to add very basic and naive support for populating data into a vector()-type column as provided by pgvector. A vector() is actually an array of floating point numbers, so the implementation tries to follow that.

Details

As outlined below, my implementation is very naive and doesn't take any Singer SCHEMA standardization processes into account, effectively just hacking in a "additionalProperties": {"storage": {"type": "vector", "dim": 4}} extra attribute.

I am sure this will not be appropriate, so I will take it as an opportunity to learn how it actually works to integrate new types to the type system, at the same time asking for your patience with me.

The patch is stacked on top of GH-250, that's why the diff is not readable well. Merging GH-250 and rebasing this branch will improve the situation. In the meanwhile, the diff can be inspected by visiting the commit ea08740.

Usage

Install package in development mode including the pgvector extra.

poetry install --extras=pgvector

Invoke all array tests, including test_array_float_vector.

pytest -vvv -k array

Thoughts

Let me know if you think it is a good idea to explore this feature within target-postgres, or if you think it should be approached on behalf a different target implementation, like target-pgvector, which builds upon the former, but separates concerns.

target_postgres/connector.py Outdated Show resolved Hide resolved
Comment on lines 279 to 335
if "array" in jsonschema_type["type"]:
# Select between different kinds of `ARRAY` data types.
#
# This currently leverages an unspecified definition for the Singer SCHEMA,
# using the `additionalProperties` attribute to convey additional type
# information, agnostic of the target database.
#
# In this case, it is about telling different kinds of `ARRAY` types apart:
# Either it is a vanilla `ARRAY`, to be stored into a `jsonb[]` type, or,
# alternatively, it can be a "vector" kind `ARRAY` of floating point numbers,
# effectively what pgvector is storing in its `VECTOR` type.
#
# Still, `type: "vector"` is only a surrogate label here, because other
# database systems may use different types for implementing the same thing,
# and need to translate accordingly.
"""
Schema override rule in `meltano.yml`:

type: "array"
items:
type: "number"
additionalProperties:
storage:
type: "vector"
dim: 4

Produced schema annotation in `catalog.json`:

{"type": "array",
"items": {"type": "number"},
"additionalProperties": {"storage": {"type": "vector", "dim": 4}}}
"""
if (
"additionalProperties" in jsonschema_type
and "storage" in jsonschema_type["additionalProperties"]
):
storage_properties = jsonschema_type["additionalProperties"]["storage"]
if (
"type" in storage_properties
and storage_properties["type"] == "vector"
):
# On PostgreSQL/pgvector, use the corresponding type definition
# from its SQLAlchemy dialect.
from pgvector.sqlalchemy import (
Vector, # type: ignore[import-untyped]
)

return Vector(storage_properties["dim"])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After learning more details about the Singer specification, I've now used the additionalProperties schema slot to convey additional type information, agnostic of the target database.

It might not be what you had in mind, but it seems to works well. 🤷

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Andreas!

Reading through the JSON Schema Draft 7 spec and thinking about our own use of additionalProperties in the SDK I think I have a different understanding of its use cases:

This keyword determines how child instances validate for objects, and does not directly validate the immediate instance itself.

Validation with "additionalProperties" applies only to the child values of instance names that do not match any names in "properties", and do not match any regular expression in "patternProperties".

For all such properties, validation succeeds if the child instance validates against the "additionalProperties" schema.

So I think it's used to constraint the schema that extra fields not included in an object's properties mapping should have.

What do you think?

Copy link
Contributor Author

@amotl amotl Dec 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Edgar,

thanks for your response. I am absolutely sure I failed on respecting any specifications. 💥 😇

Please bear with me that I've not made friends with the baseline specification too much yet, and apologies that I take up your time.

It is clear that I've abused the additionalProperties field, and I will be happy to wait for you unlocking corresponding other attributes how to convey additional type information, as you suggested at meltano/sdk#2102.

My patch was merely thought to exercise a few other details which may be needed to make this fly, just on the level of the target, and to see if the details could be re-assembled to make it actually runnable 1, also as a learning exercise.

With kind regards,
Andreas.

Footnotes

  1. https://github.com/singer-contrib/meltano-examples/tree/main/to-database

Copy link
Contributor Author

@amotl amotl Dec 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if you we should better close this PR, and let the topic sit until corresponding improvements made it to the SDK.

I probably can't help there, as I believe it needs more intrinsic knowledge and discussions amongst you and your colleagues.

Alternatively, if you think there are other more feasible workarounds in the same style like I've hacked it, I will also be happy to receive your guidance.

In PostgreSQL, all boils down to the `jsonb[]` type, but arrays are
reflected as `sqlalchemy.dialects.postgresql.ARRAY` instead of
`sqlalchemy.dialects.postgresql.JSONB`.

In order to prepare for more advanced type mangling & validation, and to
better support databases pretending to be compatible with PostgreSQL,
the new test cases exercise arrays with different kinds of inner values,
because, on other databases, ARRAYs may need to have uniform content.

Along the lines, it adds a `verify_schema` utility function in the
spirit of the `verify_data` function, refactored and generalized from
the `test_anyof` test case.
Dispose the SQLAlchemy engine object after use within test utility functions.
Within `BasePostgresSDKTests`, new database connections via SQLAlchemy
haven't been closed, and started filling up the connection pool,
eventually saturating it.
Dispose the SQLAlchemy engine object after use within
`PostgresConnector`.
By wrapping them into a container class `AssertionHelper`, it is easy
to parameterize them, and to provide them to the test functions using
a pytest fixture.

This way, they are reusable from database adapter implementations which
derive from PostgreSQL.

The motivation for this is because the metadata column prefix `_sdc`
needs to be adjusted for other database systems, as they reject such
columns, being reserved for system purposes. In the specific case of
CrateDB, it is enough to rename it like `__sdc`. Sad but true.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants