Head to Head Comparison of Text Extraction Algorithms

Head to Head Comparison of Text Extraction Algorithms(readwriteweb.com)

40 points by dusano 15 years ago | 4 comments

m_eiman 15 years ago |

Link to actual comparison: http://tomazkovacic.com/blog/122/evaluating-text-extraction-...

vannevar 15 years ago |

The metrics used in the comparison seems significantly flawed. It compares a reference set of tokens produced by applying the following rules:

1) Remove any sort of remaining inline html tags 2) Remove all punctuation characters 3) Remove all control characters 4) Remove all non ascii characters (due to unreliable information of the document encoding) 5) Normalize to lowercase 6) Split on whitespace

This seems to me a case of measuring what is easy to measure rather than measuring what is right. What would the author think about adding a rule to 'remove all vowels' or 'arbitrarily split words'? Yet he happily removes meaning and context in the form of punctuation and case. If the underlying text extraction algorithms are not similarly handicapped, then one or more of them might be a better standard of measurement than the one the author applies. Rather like measuring the accuracy of an atomic clock by using a rusty stopwatch.

sigil 15 years ago |

Author submitted the post directly to HN a few days ago: http://news.ycombinator.com/item?id=2639214

jannes 15 years ago |

Just out of curiosity. How did this url fragment #.TfMwNJgETxs;hackernews end up in the URL? How did ReadWriteWeb know that this URL would be posted on Hacker News?

Edit: Oh, I found out by myself. There's a link on the page to post the story on Hacker News. So it seems like the url fragment was added by http://www.addthis.com

beagledude 15 years ago |

Goose also does image extraction there is a demo online here:

http://jimplush.com/blog/goose