Scrapely: The brains behind Portia, our visual web scraping tool(blog.scrapinghub.com) |
Scrapely: The brains behind Portia, our visual web scraping tool(blog.scrapinghub.com) |
https://c1.staticflickr.com/7/6096/6306406141_3b237e21ee_b.j...
Look at those big, soulful black eyes...
and since this is all open source, here's a link to GitHub: https://github.com/scrapinghub/portia2code
- document conversion (pdftotext, pdfbox, apache tabula, etc.)
- OCR (tesseract, pypdfocr, etc.)
- Named-Entity-Recognition (NER) i.e. finding and recognizing entities in text (DBPedia Spotlight, stanford NER via NLTK, spacy)
- coreference resolution, dependency parsing (spacy, syntaxnet)