Dictionary of Algorithms and Data Structures (1998)

Dictionary of Algorithms and Data Structures (1998)(xlinux.nist.gov)

222 points by nullgeo 9 years ago | 18 comments

almata 9 years ago |

Just to mention one that's usually quite ignored, I love the simplicity and usefulness of the Levenshtein distance: https://xlinux.nist.gov/dads/HTML/Levenshtein.html

I once implemented it in typical scenario where sales people had to look for a client, but it could be written as: 1/ That Client With A Strange Name, Ltd. 2/ The Client With Strange Name, Ltd. 3/ That Client With A Strange Name 4/ [etc]

It worked really well and avoided lots of duplicated entries.

lstamour 9 years ago | |

See also: https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-... (from a few years back)

As if to prove its utility, https://xlinux.nist.gov/dads/HTML/Levenshtein.html lists most of the same algorithms under "See also".

todd8 9 years ago | |

For name matching there is also an old (1960s) algorithm called Soundex. It is well described in Knuth's The Art of Computer Programming Vol. 3, Sorting and Searching. It's a simple algorithm so the wikipedia page is enough: https://en.wikipedia.org/wiki/Soundex#cite_ref-10

cema 9 years ago | | |

Soundex was designed for a very specific purpose. It is very culture-dependent and, in my experience, is working very poorly in most practical applications related to matching names.

amelius 9 years ago | |

Yes, but the downside is that (afaik) you can't use an index to quickly retrieve the matches in order. You really have to scan your complete dataset on every search.

Besides, related to this, does anybody know of a good Javascript implementation of a 3-way merge of strings, and perhaps also of JSON-like structures?

peff 9 years ago | | |

You can store the values in a trie (e.g., with one node per character in the string). Exact lookup in the trie is O(string_length), like a hash table. Inexact lookup can similarly walk the tree, but explore side branches within a certain budget.

So if your string is "abc", you'd follow the node for "a", then the one for "b", but _also_ the one for "c", at a cost of 1 (because dropping the "b" incurs an edit distance of 1).

lorenzhs 9 years ago | | |

There is a variety of approximate string matching algorithms that speed up search by using an index. https://arxiv.org/abs/1008.1191 is one that should be fairly easy to implement. Levenshtein automata are another approach that makes the rounds on HN every now and then, but are a tough beast to implement and I wouldn't really recommend them in practice.

justin66 9 years ago | | |

> Yes, but the downside is that (afaik) you can't use an index to quickly retrieve the matches in order. You really have to scan your complete dataset on every search.

I'm not sure I see what you're driving at there. If you had a finite set of strings that you might have to compare, you could (for example) populate a graph or something with weighted edges representing the Levenshtein distance between strings (vertices). Offhand, it seems like your search could basically use a hash table to find the position of the vertex representing your string on an already-populated adjacency list.

It'd be big, but in reality you'd probably only populate the edges with especially high or low weights, depending on the application?

jonathanstrange 9 years ago |

I often look up algorithms there, it's a great resource. Unfortunately, you cannot search for parallel algorithms specifically and need to know what you're looking for anyway.

I was wondering whether someone could recommend similar resources (or books) for parallel algorithms. These are still very underrepresented on websites and in books, often just an addition or mentioned in passing by.

Any recommendations?

0xmohit 9 years ago | |

"Algorithm Design: Parallel and Sequential" [0] -- developed at Carnegie Mellon University.

[0] http://www.parallel-algorithms-book.com/

justin66 9 years ago | |

Joseph JaJa's Introduction is still worthwhile if you've never seen it.