Using the Amundsen's APIs as a bridge rather than interacting with Neo4J? #3

hashhar · 2020-06-04T17:08:31Z

Amundsen's APIs (specifically metadataservice) might be a good idea to integrate as a backend since it abstracts the Atlas/Neo4J/other backends and will not depend on the schema of the backend data.

If the goal is to be able to run completely offline with a local copy of the backing data then I can understand that too, but if it's not a concern to depend on connectivity/access to Amundsen then it might be worthwhile.

rsyi · 2020-06-05T02:17:59Z

Thanks for the suggestion, @hashhar - that's super interesting! My only hesitation with going in this direction to start was that I didn't want to heavily clog the metadata endpoint, but I'll scope this out and let you know. Are you guys on Atlas?

hashhar · 2020-06-05T02:25:43Z

Yes @rsyi, I'm using the Atlas backend. I haven't yet looked into the code but I am assuming with the existing code you somewhat dump all data from remote Neo4j to a local Neo4j instance using data builder.

If my assumption is correct then you are right that this would either mean each interaction will need to make a call to metadata service or that you'll need to effectively re-implement a data store and queries for the metadata API responses.

I think another possible way to handle this is to instruct people to write their own data builder jobs to do the same with Atlas as you are doing for Neo4j but I'll need to check if exporting and importing is possible via Atlas or not. Maybe folks at ING WBAA might have an idea.
Maybe you can accept PRs implementing the data builder jobs for whatever backends people want to implement.

Not sure which approach makes sense though.

rsyi · 2020-06-05T04:03:42Z

The code actually doesn't dump into a local neo4j instance (it stores all metadata as text), but your point is otherwise right on the money! Because I'm storing the data locally and searching over it there, I can't go through metadataservice to access the data for each table -- I have to dump it all.

I went with this architecture primarily for:

Speed: hitting the search service and metadata endpoints are slower than searching over a local directory (and also unavailable, offline)
And flexibility: while I like amundsen, I want metaframe to be able to access databases directly, in case amundsen is living in a hard-to-access walled garden, or users aren't using amundsen.

And I'm very open to contributions! I'm currently in the process of writing docs explaining how to create a more extensive tutorial, and I have a rough draft here: https://docs.metaframe.sh/custom-etl
In short, any Extractor object that returns a TableMetadata object is really easy to slot in.

I took a quick look at the metadata endpoints, and it actually doesn't seem too bad. But if you (or anyone) wants to give this a try, I'd be happy to help out/walk you through the code. :)

hashhar · 2020-06-05T11:37:25Z

I'll be able to look at this over the next weekend. I think it's much better to write an extractor for Atlas rather than Amundsen since people using Atlas without Amundsen will also get the feature for free.

The initial dump into text files via Metadata service might also not be feasible for even moderately large catalogs.

rsyi · 2020-06-05T16:42:52Z

Awesome! Let me know if you need any help/clarity. You could even just DM me on the amundsen slack. Happy to talk there as well.

# This is the 1st commit message: update to newer amundsen-databuilder and requests. Also add connect_args to presto driver # This is the commit message rsyi#2: fix sql_alchemy_engine.py to use connect_args as json connect_args could be set like this: ``` connect_args: {'protocol':'https'} ``` # This is the commit message rsyi#3: add support connect_args for presto/trino connector (rsyi#184) * update to newer amundsen-databuilder and requests. Also add connect_args to presto driver * fix sql_alchemy_engine.py to use connect_args as json connect_args could be set like this: ``` connect_args: {'protocol':'https'} ``` * add missing packages * fix some missing to support sqlachemy connect_args

rsyi added the enhancement New feature or request label Jun 10, 2020

nevi-me mentioned this issue Oct 22, 2020

Plans for Apache Atlas support/integration #51

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using the Amundsen's APIs as a bridge rather than interacting with Neo4J? #3

Using the Amundsen's APIs as a bridge rather than interacting with Neo4J? #3

hashhar commented Jun 4, 2020

rsyi commented Jun 5, 2020

hashhar commented Jun 5, 2020

rsyi commented Jun 5, 2020

hashhar commented Jun 5, 2020

rsyi commented Jun 5, 2020

Using the Amundsen's APIs as a bridge rather than interacting with Neo4J? #3

Using the Amundsen's APIs as a bridge rather than interacting with Neo4J? #3

Comments

hashhar commented Jun 4, 2020

rsyi commented Jun 5, 2020

hashhar commented Jun 5, 2020

rsyi commented Jun 5, 2020

hashhar commented Jun 5, 2020

rsyi commented Jun 5, 2020