How to Write a Spelling Corrector (2007)(norvig.com) |
How to Write a Spelling Corrector (2007)(norvig.com) |
The CD-ROM spelling corrector was not really great, BTW, but at least it replied in 1s on a typical end-user PC.
Edit: this was late 1980s.
Ken Kocienda's book Creative Selection has a very good chapter on the algorithm being built piece by piece, but finding out words being created by surrounding keys was part of collecting all the candidates.
They even used this in marketing. One of the pre-original-iPhone-launch videos was focused just on the on-screen keyboard (probably because almost everyone thought it was a really kooky idea at the time), and used the example of explicitly pressing "O-U-Z-Z-A" but still getting "pizza" as the autocomplete because it was the closest recommendation.
One of the iOS versions a few years ago became incredibly fond of including the space bar and considering alternatives with slightly off key presses near the space bar split into two or more words. When you're using a language with a lot of compound words like Swedish, this yielded some almost postmodern corrections with one or more words often completely different (but close on the keyboard, of course). I don't know if this was a tweak to the manual algorithm going off the rails or an AI version that wasn't quite tuned well yet.
[1] https://ai.googleblog.com/2017/05/the-machine-intelligence-b...
On ios I just had to turn off autocorrect, because I had reminders I jotted down quickly which missed the physical correction and were irretrievably corrected semantically into garbage.
now I find a note with "spekk" and I can figure out I meant "spell" (need a better example)
Also the NLP Book on the data side https://web.stanford.edu/~jurafsky/slp3/
Here's an impl of some kind: https://github.com/crisbal/hmm-spellcheck
There also exists research on solving this problem unsupervised which basically invents new word boundaries for a language (remember that spoken languages doesn’t have word boundaries - it was invented for writing and strictly speaking, current spelling isn’t the only way to solve word boundaries for a given language)
https://github.com/grantjenks/python-wordsegment https://github.com/InstantDomain/instant-segment (rust)
I used the latter to process pdf plain text output quite successfully.
look at the Speech and Language processing book, particularly chapter 3 about language models https://web.stanford.edu/~jurafsky/slp3/
You can implement a language model based on character n-grams to calculate whether a sequence is more likely with or without a space. Of course you would need a way of estimating the proability of each sequence, which means you need a corpus to train your language model on.
This version assumes non-dictionary "words" have probability 0. But it's the same basic idea as a fancier answer, and it's quick.
e.g.
logp(_, "") = 0
logp(word0, text) = max(logp_bigram(word0, word1) + logp(word1, rest) for word1, rest in prefix_words(text))
import collections
import math
import heapq
with open('count_1w.txt', 'r') as f:
unigrams = [l.split() for l in f]
unigram_map = collections.defaultdict(lambda: 0)
for word, count in unigrams:
unigram_map[word] = int(count)
with open('count_2w.txt', 'r') as f:
bigrams = [l.split() for l in f]
bigram_map = collections.defaultdict(lambda: collections.defaultdict(lambda: {}))
for word0, word1, count in bigrams:
bigram_map[word0][word1] = int(count)
log_p_unseen = collections.defaultdict(lambda: 0.)
for word0, counts in bigram_map.items():
for word1, count in counts.items():
unigram_map[word1] += count
total = sum(counts.values())
#smoothing for unseen words
mn, mx = min(counts.values()), max(counts.values())
if mn == mx:
p_unseen = 0.5
else:
#geometric series approximation
r = (mn / mx) ** (1. / (len(counts) - 1))
n = mn * r / (1. - r)
p_unseen = n / (n + total)
log_p_unseen[word0] = math.log(p_unseen)
c = (1. - p_unseen) / total
bigram_map[word0] = {word1: math.log(c * count) for word1, count in counts.items()}
c = 1. / sum(unigram_map.values())
unigram_map = {word: math.log(c * count) for word, count in unigram_map.items()}
max_len = max(map(len, unigram_map))
def optimal_parse(text):
word_spans = {j: [] for j in range(len(text) + 1)}
for i in range(len(text)):
for j in range(i + 1, min(i + max_len, len(text)) + 1):
if text[i:j] in unigram_map:
word_spans[i].append(j)
min_cost = collections.defaultdict(lambda: float('inf'))
parent = {}
queue = [(0., 0, 0)]
while queue:
cost, i, j = heapq.heappop(queue)
if cost > min_cost[(i, j)]:
continue
if j == len(text):
break
if j == 0:
word0 = '<s>'
else:
word0 = text[i:j]
for k in word_spans[j]:
word1 = text[j:k]
if word1 in bigram_map[word0]:
word1_cost = -bigram_map[word0][word1]
else:
#It would technically be more correct to normalize the unigram probability only over unseen words.
word1_cost = -(log_p_unseen[word0] + unigram_map[word1])
cost1 = cost + word1_cost
if cost1 < min_cost[(j, k)]:
min_cost[(j, k)] = cost1
parent[(j, k)] = i
heapq.heappush(queue, (cost1, j, k))
words = []
while j != 0:
words.append(text[i:j])
i, j = parent.get((i, j)), i
return words[::-1]
print(optimal_parse('ivehadathoughtandamcurioushowpeoplewouldsolveit'))and tried to find which sentence without spaces matches your sentence