Head to Head Comparison of Text Extraction Algorithms(readwriteweb.com) |
Head to Head Comparison of Text Extraction Algorithms(readwriteweb.com) |
1) Remove any sort of remaining inline html tags 2) Remove all punctuation characters 3) Remove all control characters 4) Remove all non ascii characters (due to unreliable information of the document encoding) 5) Normalize to lowercase 6) Split on whitespace
This seems to me a case of measuring what is easy to measure rather than measuring what is right. What would the author think about adding a rule to 'remove all vowels' or 'arbitrarily split words'? Yet he happily removes meaning and context in the form of punctuation and case. If the underlying text extraction algorithms are not similarly handicapped, then one or more of them might be a better standard of measurement than the one the author applies. Rather like measuring the accuracy of an atomic clock by using a rusty stopwatch.
Edit: Oh, I found out by myself. There's a link on the page to post the story on Hacker News. So it seems like the url fragment was added by http://www.addthis.com