-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using the Amundsen's APIs as a bridge rather than interacting with Neo4J? #3
Comments
Thanks for the suggestion, @hashhar - that's super interesting! My only hesitation with going in this direction to start was that I didn't want to heavily clog the metadata endpoint, but I'll scope this out and let you know. Are you guys on Atlas? |
Yes @rsyi, I'm using the Atlas backend. I haven't yet looked into the code but I am assuming with the existing code you somewhat dump all data from remote Neo4j to a local Neo4j instance using data builder. If my assumption is correct then you are right that this would either mean each interaction will need to make a call to metadata service or that you'll need to effectively re-implement a data store and queries for the metadata API responses. I think another possible way to handle this is to instruct people to write their own data builder jobs to do the same with Atlas as you are doing for Neo4j but I'll need to check if exporting and importing is possible via Atlas or not. Maybe folks at ING WBAA might have an idea. Not sure which approach makes sense though. |
The code actually doesn't dump into a local neo4j instance (it stores all metadata as text), but your point is otherwise right on the money! Because I'm storing the data locally and searching over it there, I can't go through metadataservice to access the data for each table -- I have to dump it all. I went with this architecture primarily for:
And I'm very open to contributions! I'm currently in the process of writing docs explaining how to create a more extensive tutorial, and I have a rough draft here: https://docs.metaframe.sh/custom-etl I took a quick look at the metadata endpoints, and it actually doesn't seem too bad. But if you (or anyone) wants to give this a try, I'd be happy to help out/walk you through the code. :) |
I'll be able to look at this over the next weekend. I think it's much better to write an extractor for Atlas rather than Amundsen since people using Atlas without Amundsen will also get the feature for free. The initial dump into text files via Metadata service might also not be feasible for even moderately large catalogs. |
Awesome! Let me know if you need any help/clarity. You could even just DM me on the amundsen slack. Happy to talk there as well. |
# This is the 1st commit message: update to newer amundsen-databuilder and requests. Also add connect_args to presto driver # This is the commit message rsyi#2: fix sql_alchemy_engine.py to use connect_args as json connect_args could be set like this: ``` connect_args: {'protocol':'https'} ``` # This is the commit message rsyi#3: add support connect_args for presto/trino connector (rsyi#184) * update to newer amundsen-databuilder and requests. Also add connect_args to presto driver * fix sql_alchemy_engine.py to use connect_args as json connect_args could be set like this: ``` connect_args: {'protocol':'https'} ``` * add missing packages * fix some missing to support sqlachemy connect_args
Amundsen's APIs (specifically metadataservice) might be a good idea to integrate as a backend since it abstracts the Atlas/Neo4J/other backends and will not depend on the schema of the backend data.
If the goal is to be able to run completely offline with a local copy of the backing data then I can understand that too, but if it's not a concern to depend on connectivity/access to Amundsen then it might be worthwhile.
The text was updated successfully, but these errors were encountered: