Table Detection and Extraction Using Deep Learning(nanonets.com) |
Table Detection and Extraction Using Deep Learning(nanonets.com) |
The AI approaches are definitely still worse than human-written rules. I can infer - and I've chatted with the devs to confirm - from the quality of the text and table extraction whether the company is using a modern NN approach or someone has sat down and handwritten some simple rules that understand indents and baselines etc.
[1] https://edinburghhacklab.com/2013/09/probabalistic-scraping-...
One of the contributors to the PDF library I'm developing has been implementing some interesting algorithms for layout analysis https://github.com/UglyToad/PdfPig/wiki/Document-Layout-Anal...
In the delicious pics of results I can see the bullets treated as one column and the paragraphs for each bullet point actually run together as single chunk of text?
What do you think about tackling bullets and indents?
I think a lot of folks find this out as I did, when they run into a project where they need to extract info from pdf documents. Without knowing anything about pdf, one can easily assume that it will be possible to do things like "can't we just extract some semantic structures like headings, tables, etc"... but nooo, it don't work that way!
Discovering the true nature of pdf is major WTF moment because we're so conditioned to expect documents to have a semantic structure. It's hard to understand how a standard can take the exact opposite approach and be so successful.
Imagine how bogged down and limited vector graphics would be if every element had to have semantic meaning? "This line connects the <body> of the <car> to the 13th <spoke> on the <wheel>".
Scanning the comments I see two mentions of Camelot [1] and one mention each of PDFTron [2] and ExtractTable [3].
[1]: https://camelot-py.readthedocs.io/en/master/
[2]: https://www.pdftron.com/pdf-tools/pdf-table-extraction/
[3]: https://extracttable.com/
Would love to hear if you’ve compared across multiple options.
While it isn't the sexiest project, I've had a number of companies reach out about the project. Human written rule-based approaches are pretty bad at the task, and even humans doing it manually aren't great (likely due to sloppiness).
It is disappointing just how haphazardly most PDFs are structured. Too many of the PDF production tools remove all document structure metadata or fail to include it by default.
This is a very interesting field, and PDFTron has been doing similar work with ML and table extraction as part of our document understanding platform. We've made pretty good progress over the past year -- you can try it on your docs here:
https://www.pdftron.com/pdf-tools/pdf-table-extraction/
We also have a rules-based table extraction product (PDFGenie) that works reasonably well, but ML is most definitely the future.
But does anyone have insight how hard is it to be in a space where all the big cloud providers seem to be offering very similar products? Can you survive by focusing on a niche segment? Is the market growing so fast that there's room for multiple companies offering (roughly) the same thing?
The example of TableNet using deep learning for table extraction on top of tesseract for OCR means two layers of ML, either of which could individually introduce pathologies without human oversight. It reminds me of the photocopier that changed numbers for you - https://www.theregister.co.uk/2013/08/06/xerox_copier_flaw_m...
If an ML engine was trained to be able to do things like look for totals and sub-totals in numerical tables and flag errors in summation, then that would clearly add more value in parsing for moderation (the use-case described at the end). But that doesn't seem to be something that's yet... on the table.
https://www.microsoft.com/en-us/research/publication/melford...
I would like to have a neural net that can give me the data from a chart in an image. I have a hunch image segmentation NN's are able to do this because the size of the surface has a predictive value in the causal relationship. Artificial data could be created at scale with the Google Sheets api.
For tables with numbers in them, it worked pretty well, but I'm yet to find a tool that can parse/understand documents where the entire page is a table layout with lots of merged cells. I think even for humans it's hard to understand the structure in those cases...
He's using a CNN for digit recognition.
The mentioned service is not perfect either. There are always limitations, minimizing is the key.
P.s: I work with the team at extracttable
My area of work on the project has been the core file-reading and file-creation stuff so I haven't had much of a chance to review the layout algorithm performance across documents.
Having been working on a purely rules-based approach in a private repository for a side project it seems like the algorithms the contributor has implemented get you a lot closer to the correct result than starting from rules alone but it definitely feels like adding some context-aware rules would get all the way there. I'm not sure whether they'd be in scope for the layout analysis project itself or someone could take the open-core and extend it, as I was attempting in my side project.
There really is no real differentiation between formatting and content in a PDF, so it's not possible to truly separate them.
The current layout analysis algorithms don't do much normalization as far as I'm aware, the Recursive-XY Cut algorithm uses page level font-size information [0] to tune parameters but it doesn't infer a common structure or format either as an input or result.
The aim of most layout analysis algorithms is to produce classifications for regions, e.g. paragraphs, titles, lists which I suppose counts as denormalizing the document? Arriving at those classifications generally relies on first splitting the document into sections or regions and then classifying those regions. So far the implemented algorithms mainly focus on the first step, splitting a document into discrete regions. An example of the second step using ML approaches to classify those regions by the same contributor can be found here [1].
With the rule based approaches I've been experimenting with you can use certain information from the common producers to normalize certain features. For example line spacing and font size have a well defined relationship, as do whitespace size and font size (though this is a fuzzier relationship and goes out the window entirely for justified text).
An example where you rely on non-locality to parse a document, in this SEC filing there are both key values and a table: https://www.sec.gov/Archives/edgar/data/1428796/000110465920...
For the values following the subheading "Institutional Investment Manager Filing this Report:" the left hand column are keys for the right hand values.
At the bottom of the document there's a table containing the columns "Form 13F File Number" and "Name".
Now you could use a couple of rules to infer the difference between the key-values and the table:
1) The keys in a key value list end in ':'.
2) The keys in a key value list have a different font/color to the values.
Both of those rules hold true here but not in all or even most documents. For this reason you need to use the whole page to deduce the type of these sections, rather than immediately surrounding features/pixels as an ML algorithm might.
[0]: https://github.com/UglyToad/PdfPig/blob/master/src/UglyToad....
(it's well supported by Adobe tools...)
But by its nature text is intrinsically semantic. I am just surprised that a document format utterly free of semantics has lasted so long. Perhaps because we (as people and organizations) can't agree on the structure of documents?
Another way to see this, perhaps, is as the failure of the promise of xml and the ecosystem around it? In the late 90's many of us thought that all documents would be xml for content and styling would be through xslt or even it's big sister, xsl. Well, THAT went nowhere despite all the W3C meetings and papers.
It's interesting you brought up graphics as an analogy. It's true that you can have graphics which are literally just lines and that's adequate for many needs. However, modern CAD drawing systems increasingly use notions of 2D/3D objects and a disciplined series of transformations. They call it "parametric modeling" and it's where where all drawing consist of a series of transformations that can be represented in a timeline. I suspect modern parametric model CAD can very much be semantic.
I think more than lack of agreement, it's just that there aren't really universal document structures. There's relatively useful chunks like paragraphs that are more or less universal (at least for a given language), but those don't need much structure to be clear.