Unifying Deduplication API Modules #516

praateekmahajan · 2025-02-04T22:01:38Z

SemDedup currently outpus "ids to keep" while Fuzzy/Exact dedup outputs "ids to remove". Across our deduplication API we should make sure we have the same API
The deduplicator modules should have identify / remove and identify_and_remove ?
The __call__ should behave as identify_and_remove and advanced users who need to configure which dupe among dupes to keep (for exact / fuzzy) can call identify and remove separately?
Rename / remove current (Fuzzy)Duplicates in favor of a (Fuzzy)Deduplicator that has both methods

Architectural Design

Base class called BaseDeduplicator that has the abstract methods (feel free to suggest)
Exact and Fuzzy by default will keep randomly 1 of the documents in the "matched" groups, however users who have opinions on which dupe to keep, can break it into identify and dedup
Whether we output ids_to_keep or ids_to_remove is to be decided as we learn more on the performance implications in a dask merge

The text was updated successfully, but these errors were encountered:

sarahyurick · 2025-02-10T21:38:31Z

I think it is okay to mark this as done. I see what you are saying about semantic dedupe, but since the removal logic is already baked in I think it is fine as is. Plus, once perform_removal=True is the default for exact and fuzzy dedupe, all 3 modules will be the same from a user perspective.

Only other thing I can think of is if you want to change the names of the compute_semantic_match_dfs and extract_dedup_data functions within the SemanticClusterLevelDedup class, I think that could be nice.

praateekmahajan · 2025-02-10T22:41:05Z

all 3 modules will be the same from a user perspective.

I don't think that's true, since IIRC Semantic returns "doc ids to keep" rather than "documents to keep" (difference being only id column is returned while id | text column being returned). Which means we need to perform an inner join with the original dataset to get the "same" behavior.

sarahyurick · 2025-02-10T22:49:07Z

Oh I see, I didn't realize the other columns aren't kept. Then maybe adding an extra function to the semantic dedupe class?

praateekmahajan · 2025-02-10T23:35:13Z

Then maybe adding an extra function to the semantic dedupe class

I think what we're trying to achieve here is unifying API (which can be interpreted differently as that the "call should achieve the same results").

Though I would say inner join might be more "parallelizable" than a left-anti join, so if we're okay to "achieve same results" while having "differing API" in sem dedup we can have identify_documents_to_keep (vs identify_duplicates in exact / fuzzy) and "remove" remains the same except it does an inner join.

We could benchmark which is faster, assuming sem dedup fings ~30% dupes in a dataset, is left-anti join with 30% faster or inner join with 70% and then decide the API too

praateekmahajan added the enhancement New feature or request label Feb 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unifying Deduplication API Modules #516

Unifying Deduplication API Modules #516

praateekmahajan commented Feb 4, 2025 •

edited

Loading

sarahyurick commented Feb 10, 2025

praateekmahajan commented Feb 10, 2025

sarahyurick commented Feb 10, 2025

praateekmahajan commented Feb 10, 2025

Unifying Deduplication API Modules #516

Unifying Deduplication API Modules #516

Comments

praateekmahajan commented Feb 4, 2025 • edited Loading

Architectural Design

sarahyurick commented Feb 10, 2025

praateekmahajan commented Feb 10, 2025

sarahyurick commented Feb 10, 2025

praateekmahajan commented Feb 10, 2025

praateekmahajan commented Feb 4, 2025 •

edited

Loading