The Trie: A Neglected Data Structure(toptal.com) |
The Trie: A Neglected Data Structure(toptal.com) |
def prefix_tree(words):
d = dict()
for word in words:
root = d
for c in word:
root = root.setdefault(c, {})
root['$$'] = {} # mark end of word with '$$' token
return d
For example: >>> prefix_tree('foo bar baz'.split())
{'b': {'a': {'r': {'$$': {}},
'z': {'$$': {}}}},
'f': {'o': {'o': {'$$': {}}}}}
To query whether a word is present in a tree you basically compute reduce(dict.__getitem__, word, tree)
and test whether the result (a forest) has an end-of-word '$$' tree.Even then, tries only win if the distribution is suitable to give you the memory efficiency you are hoping for. If you don't have many common prefixes, you're better off with just a hashtable.
You can answer questions like "how many files does this directory have, immediately in it" and also "under it recursively" relatively quickly, and if you use a Patricia trie with a little cleverness in the compression of the text, it needn't take up much space at all even for millions of paths. Whereas a binary tree isn't so great at the memory saving bit.
Anyway, you just got right the point that I was trying to make, a trie is definitely not a general purpouse data structure, but for the specific case (limited number of elements, many common prefixes etc) it performs very well. As for the point of returning the original object, it suffices to store the original object in end-word nodes. I didn't do that in this code because it would have added unnecessary code, since the problem was really just membership and prefixes.
In the specific case, hashtable wouldn't have worked, since it's not sorted and it doesn't solve the original problem.
That is surprising to me.
Personally I'll continue to pronounce it as 'try'. It has, at least, fewer conflicting interpretations in a programming context. When I say 'tree', people will probably assume I mean a binary tree, not a radix tree. So, if I'm not going at 30% light speed, I'll just say that.
Unfortunately, they were highly optimized for 32-bit machines.
A B-Tree is better due to better exploitation of locality of reference in memory.
The Clojure programming language is completely designed around several immutable variants.
Tries are the fastest tree-based data structures for managing strings in-memory, but are space-intensive. The burst-trie is almost as fast but reduces space by collapsing trie-chains into buckets. This is not however, a cache-conscious approach and can lead to poor performance on current processors. In this paper, we introduce the HAT-trie, a cache-conscious trie-based data structure that is formed by carefully combining existing components.
P.S. I love tries!
For a fun example of this last application, there was a recent Google Code Jam problem called "Alien Languages" [1]. My solution [2] basically counts the leaf nodes in the intersection of two prefix trees. (Note that we can compute the count on the fly during the search and need not actually construct the intersection.)
[1] http://code.google.com/codejam/contest/90101/dashboard#s=p0
[2] https://github.com/tmoertel/practice/blob/master/google-code...
but i've used a trie once to good effect to replace a red-black tree with a depth-3 trie whose leaves were red-black trees, trading off the increased space usage for the ability to update the rbtree entries in parallel.
1) Keep an array of size equal to the alphabet size, indexed by the characters, holding pointers to successive nodes.
- but this blows the memory to O(N * alphabet_size) where N is the input size
2) Keep a linked list of pairs (character, pointer).
- but this blows the lookup time to O(word_length * alphabet_size)
3) Keep something like an STL map<character, pointer>.
- still suboptimal complexity of O(word_length * log alphabet_size) + it complicates the code + it dramatically increases the constant time per operation
Or you research a bit more and actually learn a way to properly implement a trie, called a Ternary Search Tree: http://en.wikipedia.org/wiki/Ternary_search_tree
- storing this at byte per char = 12.8MB
- storing this in a basic trie = 12.8 * 64 = 820MB
- storing this in the TST = 12.8 * 3 = 38MB
1. http://ejohn.org/blog/dictionary-lookups-in-javascript/
2. http://ejohn.org/blog/javascript-trie-performance-analysis/
3. http://ejohn.org/blog/revised-javascript-dictionary-search/
Since I sort the strings before insertion and since DAWGs naturally collapse similar sub trees, I get high level of compression (e.g. "eat", "ate", "tea" all become 1 path a-e-t with a list of numbers at the leaf node indicating which permutations of a-e-t are valid). Searching for anagrams of a given string is super fast and trivial now as a path from root to leaf holds all the valid anagrams of that path at the leaf node using permutation-numbers.
http://www.inference.phy.cam.ac.uk/dasher/DasherSummary2.htm...
(Prof. Mackay, its author, uses it to teach information theory and compression, among other things.)
The end result is that university CS programs are run with the assumption that highschool students had zero prior experience with programming.
There's a reason that so many software geeks are hardcore libertarians and Bill Gates was fighting to reform teaching into a meritocracy - the educational system basically ignored their core skill-set and so the students were often self-taught. Obviously schools are playing catch-up now, but you can't change the past experiences of two generations of programmers.
Oh how times have changed.
However, this was back in the days of tiny caches. I suspect they may have lost the edge they had in 2003.
And my apologies - looking back it was Patricia trees I was using as you can take a pointer to a mid-point in the path: useful for disk caches etc.
I was using them for small dictionaries (mainly symbol table-size structures) comparing them against an early super-fast hash (and I recall for really small dictionaries linear search as I naively assumed that a linear search would win for tiny dictionaries - it didn't). In all my tests then (as I say, 2003, on a 3 year old IBM Thinkpad) the tries won.
What's the difference between O(1) and ~O(1)?
I would read it as close to, but maybe not quite O(1).
In secondary school ("high school"), The only computer subject they taught was IT, and people weren't allowed to take it unless they had a special reason not to take a foreign language. It was rumored to be ridiculously easy.
The UK is supposed to be bringing in a programming GCSE - I'll believe it when I see it.
I've never had a problem with people using different pronunciations. It certainly keeps talking about code fresh.
With a set of random strings, for alphabet with 64 chars, there are 16M different 4 character prefixes, so the savings from overlapping prefixes for 128char long strings is likely less than 3%.
Furthermore, using precise set notation would, IMO, make it more difficult to use. A big blob of set notation would not be easy to read for most people.
No. O(n) and O(n)+C and different sets. However, if an algorithm belongs to the first set, it also belongs to the latter. The confusion comes from using the equality sign instead of the "element of" operator.
>Furthermore, using precise set notation would, IMO, make it more difficult to use. A big blob of set notation would not be easy to read for most people.
No, the only difference is that f(x) = O(n) becomes f(x) ∈ O(n).
I recognize pronunciation is more about exceptions than rules, but even so...
Can you state this in a mathematically precise notation?
>Where is the confusion?
The confusion comes from stateents like 'O(n)+C = O(n)', which are actually false. O(n)+C and O(n) are different sets. If a function belongs to the latter set, it also belongs to the former, but that doesn't mean they're equal.
Using the equal sign here is "abuse of notation". 'O(n)+C = O(n)' would seem to implicate that C=0, but that's untrue. C is a constant which can be non-zero. Therefore the whole statement doesn't really make sense unless we assign a new definition to the equal sign.
Using set notation solves all the confusion.
'Implied +C' is pretty much nonsense. There are an infinite number of different sets that a single function belongs to. Just because f(x) belongs to O(n) and therefore to O(n)+C doesn't mean that 'big-O implies +C'. 'big-O' doesn't imply anything like that. '+C' is just a special case, and there's an infinite number of extra 'implies'.
A function f(x) is a member of the set O(n) iff there is some M,x0 such that for all x>x0, |f(x)|<=Mn. This means functions like lg n, n+30, n+1e999, 1e999n + sqrt(n), sin(n), and 7 are all members of the set O(n).
It is difficult to give a formal proof that O(n) + C = O(n) because there isn't really a formal notion of what is meant by O(n) + C. To me, that looks like the result of adding a real constant to a set of functions. It is common to see people do algorithmic analysis and replace terms in a runtime with a set they belong to, in which case an expression like O(n) + C might appear, but what this means is just "some function that is in O(n) plus some constant," and it is simple to prove that the set of all functions that can be described in this way is equal to the set O(n). O(n) is closed under the operation of adding constants, and the "adding constants" operation is its own inverse.
Yes, and no. O(n) isn't a set until you specify the limit that goes with it; for some limits O(n) is different from O(n)+C -- the latter of which is equivalent to O(n+C). However, in CS, O(n) with no specified limit refers by convention to the set O(n) as n->infinity, and O(n) as n->infinity is equal to O(n+C) as n->infinity, because the limit as n->infinity of n is equal to the limit as n->infinity of n+C.
Big-O isn't meant to be precise. The entire point is to roughly describe algorithm behavior.
> 'Implied +C' is pretty much nonsense.
Implied +C is built in to the definition of big-O notation. As n increases towards infinity the influence of C becomes negligible, so it's left off. Read the first chapter of any algorithms book.