-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unifying Deduplication API Modules #516
Comments
I think it is okay to mark this as done. I see what you are saying about semantic dedupe, but since the removal logic is already baked in I think it is fine as is. Plus, once Only other thing I can think of is if you want to change the names of the |
I don't think that's true, since IIRC Semantic returns "doc ids to keep" rather than "documents to keep" (difference being only |
Oh I see, I didn't realize the other columns aren't kept. Then maybe adding an extra function to the semantic dedupe class? |
I think what we're trying to achieve here is unifying API (which can be interpreted differently as that the "call should achieve the same results"). Though I would say inner join might be more "parallelizable" than a left-anti join, so if we're okay to "achieve same results" while having "differing API" in sem dedup we can have We could benchmark which is faster, assuming sem dedup fings ~30% dupes in a dataset, is left-anti join with 30% faster or inner join with 70% and then decide the API too |
identify
/remove
andidentify_and_remove
?__call__
should behave asidentify_and_remove
and advanced users who need to configure which dupe among dupes to keep (for exact / fuzzy) can call identify and remove separately?Architectural Design
BaseDeduplicator
that has the abstract methods (feel free to suggest)ids_to_keep
orids_to_remove
is to be decided as we learn more on the performance implications in a dask mergeThe text was updated successfully, but these errors were encountered: