How to Write a Spelling Corrector (2007)

How to Write a Spelling Corrector (2007)(norvig.com)

76 points by fybs 4 years ago | 23 comments

Joker_vD 4 years ago |

And after you've read that, here's a related blogpost: "A Spellchecker Used to Be a Major Feat of Software Engineering" [0], because Python being "fast enough" and having enough memory for large dictionaries hasn't always been the case.

[0] https://prog21.dadgum.com/29.html

tgv 4 years ago | |

I'll repeat my feat in this area: a spelling corrector that had 32kB memory (because it was tied to a keyboard trap, or something like that) and could do four 8kB reads from CD-ROM before telling the user. It contained not only orthographic similarity (the actual letters), but also phonetic similarity. You rarely see that in older English spell checkers, because it's such an irregular language, but for other languages it's quite informative, although in extremely regular languages, such as Spanish, you could also put it in the weighting/probability model (e.g. v<->b is a pretty common confusion).

The CD-ROM spelling corrector was not really great, BTW, but at least it replied in 1s on a typical end-user PC.

Edit: this was late 1980s.

mtreis86 4 years ago |

One of my most common spelling mistakes is physical mistypes on the keyboard, yet no spell checker seems to account for keyboard layout and locality of keys, or for something like my hand being one position off on the board but typing all the keys relatively correct only positionally shifted.

wootest 4 years ago | |

Many designs does this implicitly. At least one, the original iPhone keyboard autocomplete/suggestion algorithm, did it explicitly.

Ken Kocienda's book Creative Selection has a very good chapter on the algorithm being built piece by piece, but finding out words being created by surrounding keys was part of collecting all the candidates.

They even used this in marketing. One of the pre-original-iPhone-launch videos was focused just on the on-screen keyboard (probably because almost everyone thought it was a really kooky idea at the time), and used the example of explicitly pressing "O-U-Z-Z-A" but still getting "pizza" as the autocomplete because it was the closest recommendation.

One of the iOS versions a few years ago became incredibly fond of including the space bar and considering alternatives with slightly off key presses near the space bar split into two or more words. When you're using a language with a lot of compound words like Swedish, this yielded some almost postmodern corrections with one or more words often completely different (but close on the keyboard, of course). I don't know if this was a tweak to the manual algorithm going off the rails or an AI version that wasn't quite tuned well yet.

brian_cloutier 4 years ago | |

spell checkers which account for keyboard layout absolutely exist. [1] is one example, but in general it's hard to believe that any trained model would fail to notice that certain mistakes are more likely than others, and it's hard to believe that any spell checker google releases these days would not be based on a trained model.

[1] https://ai.googleblog.com/2017/05/the-machine-intelligence-b...

m463 4 years ago | |

I think the problem is that there are multiple layers of correction - both the physical and the spelling/semantic.

On ios I just had to turn off autocorrect, because I had reminders I jotted down quickly which missed the physical correction and were irretrievably corrected semantically into garbage.

now I find a note with "spekk" and I can figure out I meant "spell" (need a better example)

z3t4 4 years ago | |

The problem is that if you take an english word and change one letter it will likely become another valid word.

tyingq 4 years ago | |

I would guess that the iPhone autocorrect does this implicitly since it's ML trained.

killingtime74 4 years ago |

Those interested in toy implementations in this area might also enjoy this blog https://blog.burntsushi.net/transducers/ on FSMs.

Also the NLP Book on the data side https://web.stanford.edu/~jurafsky/slp3/

pandatigox 4 years ago |

I've had a thought and am curious how people would solve it. Sometimes, if you copy words off a PDF lecture slide, all the words are mashed together (eg. Hello Foo bar → HelloFoobar). Is this an AI domain or can it solved by simple programming?

eigenhombre 4 years ago |

One of the more interesting parts of the post, for me, is the list of implementations in other languages, including: one for Clojure written by that language's author, Rich Hickey; an interesting one in R that clocks in at 2 lines (with a longer, more readable version further down in the linked post); and one written in functional Java. The first one in Awk is also interesting.

graycat 4 years ago |

My favorite, long standard spell checker is Aspell long part of a TeX distribution.

the-smug-one 4 years ago |

Could use a Hidden Markov Model, how to implement: https://www.cs.sjsu.edu/~stamp/RUA/HMM.pdf

Here's an impl of some kind: https://github.com/crisbal/hmm-spellcheck

da39a3ee 4 years ago |

From a pedagogical point of view presenting that as a Bayesian model and then using the error model he does, is a bit questionable. But as always his python style is inspiring.

blondin 4 years ago |

just wait for old hn to show up and tell new hn that the famous Norvig's spelling corrector is not efficient and is not teaching people how to do it right.

import collections import math import heapq with open('count_1w.txt', 'r') as f: unigrams = [l.split() for l in f] unigram_map = collections.defaultdict(lambda: 0) for word, count in unigrams: unigram_map[word] = int(count) with open('count_2w.txt', 'r') as f: bigrams = [l.split() for l in f] bigram_map = collections.defaultdict(lambda: collections.defaultdict(lambda: {})) for word0, word1, count in bigrams: bigram_map[word0][word1] = int(count) log_p_unseen = collections.defaultdict(lambda: 0.) for word0, counts in bigram_map.items(): for word1, count in counts.items(): unigram_map[word1] += count total = sum(counts.values()) #smoothing for unseen words mn, mx = min(counts.values()), max(counts.values()) if mn == mx: p_unseen = 0.5 else: #geometric series approximation r = (mn / mx) ** (1. / (len(counts) - 1)) n = mn * r / (1. - r) p_unseen = n / (n + total) log_p_unseen[word0] = math.log(p_unseen) c = (1. - p_unseen) / total bigram_map[word0] = {word1: math.log(c * count) for word1, count in counts.items()} c = 1. / sum(unigram_map.values()) unigram_map = {word: math.log(c * count) for word, count in unigram_map.items()} max_len = max(map(len, unigram_map)) def optimal_parse(text): word_spans = {j: [] for j in range(len(text) + 1)} for i in range(len(text)): for j in range(i + 1, min(i + max_len, len(text)) + 1): if text[i:j] in unigram_map: word_spans[i].append(j) min_cost = collections.defaultdict(lambda: float('inf')) parent = {} queue = [(0., 0, 0)] while queue: cost, i, j = heapq.heappop(queue) if cost > min_cost[(i, j)]: continue if j == len(text): break if j == 0: word0 = '<s>' else: word0 = text[i:j] for k in word_spans[j]: word1 = text[j:k] if word1 in bigram_map[word0]: word1_cost = -bigram_map[word0][word1] else: #It would technically be more correct to normalize the unigram probability only over unseen words. word1_cost = -(log_p_unseen[word0] + unigram_map[word1]) cost1 = cost + word1_cost if cost1 < min_cost[(j, k)]: min_cost[(j, k)] = cost1 parent[(j, k)] = i heapq.heappush(queue, (cost1, j, k)) words = [] while j != 0: words.append(text[i:j]) i, j = parent.get((i, j)), i return words[::-1] print(optimal_parse('ivehadathoughtandamcurioushowpeoplewouldsolveit'))