Show HN: Fuzzy deduplicate any CSV using vector embeddings

Show HN: Fuzzy deduplicate any CSV using vector embeddings(app.dedupe.it)

5 points by remolacha 1 year ago | 5 comments

I made an app to fuzzy-deduplicate my Google Sheets and CRM records

- No manual configuration required

- Works out-of-the-box on most data types (ex. people, companies, product catalog)

Implementation details:

- Embeds records using an E5-family model

- Performs similarity search using DuckDB w/ vector similarity extension

- Does last-mile comparison and merges duplicates using Claude

Demo video: https://youtu.be/7mZ0kdwXBwM

Github repo (Apache 2.0 licensed): https://github.com/SnowPilotOrg/dedupe_it

Background story: My company has a table for tracking leads, which includes website visitors, demo form submissions, app signups, and manual entries. It’s full of duplicates. And writing formulas to merge those dupes has been a massive PITA.

I figured that an LLM could handle any data shape and give me a way to deal with tricky custom rules like “treat international subsidiaries as distinct from their parent company”.

The challenging thing was avoiding an NxN comparison matrix. The solution I came up with was first narrowing down our search space using vector embeddings + semantic similarity search, and then using a generative LLM only to compare a few nearest neighbors and merge.

Some cool attributes of this approach:

- Can work incrementally (no reprocessing the entire dataset)

- Allows processing all records in parallel

- Composes with deterministic dedupe rules

Lmk any feedback on how to make this better!

K0IN 1 year ago |

This is very interesting, i was building something similar, but i used https://github.com/K0IN/string-embed (embeddings based on a distance function - Levenshtein in my case) to generate embeddings, for deterministic matching.

I will follow your project, im interested in your ann search speeds :)

remolacha 1 year ago | |

Very cool :) I initially tried something like this, but had trouble getting reliable results without tuning my distance functions to the specific schema & domain. Did you find a way around that?

K0IN 1 year ago | | |

No, I tuned a model on my (unique) table data, which does not take long, since the model is small.

My model seemed in my tests at least to hold up good enough, since its only used as a preselect to find "good enough" candidates to use Levenshtein later on.

But yes, a universal model (maybe a fine-tuned transformer / embedding model) might be better, but i did not have the time (and knowledge) to build one yet.

DigiFreeze 1 year ago |

high-key useful! Are you thinking of making a Google Sheets extension? How are you thinking about data privacy? Any plans to make a local-only app?

remolacha 1 year ago | |

Thanks! Yeah, we'd do a GSheet extension if there's enough interest. Privacy-wise, we don't store any data. Local-only isn't a priority, but should be easy to self-host if you take a look at the Github README.