Elegant n-gram generation in Python(locallyoptimal.com) |
Elegant n-gram generation in Python(locallyoptimal.com) |
ngrams' :: Int -> [b] -> [[b]]
ngrams' n = filter ((==) n . length) . map (take n) . tails
ghci session: λ> let ngrams' n = filter ((==) n . length) . map (take n) . tails
λ> inputList
["all","this","happened","more","or","less"]
λ> tails inputList
[["all","this","happened","more","or","less"],["this","happened","more","or","less"],["happened","more","or","less"],["more","or","less"],["or","less"],["less"],[]]
λ> map (take 2) (tails inputList)
[["all","this"],["this","happened"],["happened","more"], ["more","or"],["or","less"],["less"],[]]
λ> filter ((==) 2 . length) (map (take 2) (tails inputList))
[["all","this"],["this","happened"],["happened","more"],["more","or"],["or","less"]]
Pointfree Haskell code oftens ends up being a lot like piping together Unix commands.Except typed and pure. With global type inference, so you have an objective type in mind, you can slap together and query the type with :t in ghci and see if it looks like what you wanted.
Here's a generator that yields ngrams from an arbitrary iterable:
from collections import deque
from itertools import islice
def ngram_generator(iterable, n):
iterator = iter(iterable)
d = deque(islice(iterator, n-1), maxlen=n)
for item in iterator:
d.append(item)
yield tuple(d)I would've just written:
def find_bigrams(input_list):
ngrams = []
last_word = '-EOL-'
for word in input_list:
ngrams.append((last_word, word))
last_word = word
return ngrams
I have trouble seeing the requirement to generalize to arbitrary n as important...If the data is big enough to want n >= 4, it's probably large enough that you'll write this in another language anyway. And n is unlikely ever to be larger than 5.And n of quite large degrees is not uncommon in hardcore natural language processing, or bioinformatics, both of which Python (wrapping Numpy and Scipy, usually) is heavily used for.
For instance, Chinese doesn't tokenize its words (all the characters are packed) which means you usually end up doing something like taking N-ngrams (of potentially large degree) on the character space, doing a lot of lookups into a dictionary and a language model, and seeing if you can get everything to "fit" so that all characters are accounted for and the resulting sentence makes sense.
But as far as word ngrams goes, I've been doing NLP research for over ten years, and you almost never want 4 or 5 grams, let alone ngrams of greater length. The data's simply too sparse to be useful. So, it's really a matter of generating bigrams and generating trigrams, which I think it's reasonable to have separate functions for.