Table Detection and Extraction Using Deep Learning

Table Detection and Extraction Using Deep Learning(nanonets.com)

154 points by ole_gooner 6 years ago | 48 comments

willvarfar 6 years ago |

I've worked with several companies that try to parse things in PDF documents, extracting tables and paragraphs etc. This is actually challenging because a PDF is a large bag of words and fragments of words with x y positions. There is a particularly popular word processor that emits individual characters. Just determining that two fragments are part of the same word is challenging as is detecting bullet points, etc.

The AI approaches are definitely still worse than human-written rules. I can infer - and I've chatted with the devs to confirm - from the quality of the text and table extraction whether the company is using a modern NN approach or someone has sat down and handwritten some simple rules that understand indents and baselines etc.

tlarkworthy 6 years ago | |

Yes exactly, table comprehension is a logic driven, non-local inference problem. Critically, its the non-locality that trips up common machine learning techniques. I wrote an approach using mixed integer programming once[1]

[1] https://edinburghhacklab.com/2013/09/probabalistic-scraping-...

UglyToad 6 years ago | |

I had to check we hadn't worked for the same company! Yeah, text extraction and layout analysis from PDFs is a super interesting challenge and still relatively underdeveloped. I'd put table detection at about the hardest challenge in that field.

One of the contributors to the PDF library I'm developing has been implementing some interesting algorithms for layout analysis https://github.com/UglyToad/PdfPig/wiki/Document-Layout-Anal...

willvarfar 6 years ago | | |

Really really interesting, hadn't seen pdfpig before!

In the delicious pics of results I can see the bullets treated as one column and the paragraphs for each bullet point actually run together as single chunk of text?

What do you think about tackling bullets and indents?

tastyminerals 6 years ago | | |

It depends. We do commercial pdf and scanned information extraction as well as table detection for line items for invoices, receipts and remittance slips. We have been successfully using rule-based system for years but are mixing in deep learning now. I also know a about 5 other companies competing in the same field. So, I wouldn't say it is underdeveloped.

ahpearce 6 years ago | | |

Referring to the above poster's "non-locality", are we talking about denormalization of formatting? Is there a way to "normalize" PDF structure? Calculate margins or common formats beforehand to normalize?

crispyambulance 6 years ago | |

It was really shocking when I learned that the way pdf works is as you describe, literally fragments of text with positions and essentially no semantics.

I think a lot of folks find this out as I did, when they run into a project where they need to extract info from pdf documents. Without knowing anything about pdf, one can easily assume that it will be possible to do things like "can't we just extract some semantic structures like headings, tables, etc"... but nooo, it don't work that way!

Discovering the true nature of pdf is major WTF moment because we're so conditioned to expect documents to have a semantic structure. It's hard to understand how a standard can take the exact opposite approach and be so successful.

nightcracker 6 years ago | | |

It's so successful precisely because it doesn't have semantics. It is a print format with one goal: show the output as desired. Semantics only confuse and limit this goal.

Imagine how bogged down and limited vector graphics would be if every element had to have semantic meaning? "This line connects the <body> of the <car> to the 13th <spoke> on the <wheel>".

divbzero 6 years ago | |

Do you have recommendations for which libraries or APIs currently perform the best at extracting tables and extracting text?

Scanning the comments I see two mentions of Camelot [1] and one mention each of PDFTron [2] and ExtractTable [3].

[1]: https://camelot-py.readthedocs.io/en/master/

[2]: https://www.pdftron.com/pdf-tools/pdf-table-extraction/

[3]: https://extracttable.com/

Would love to hear if you’ve compared across multiple options.

tensor 6 years ago | |

Having worked with OCR products doing table detection for years, simple hand written rules cannot solve the general case. It can work for specific documents, but if you want to be able to handle any document it's just not accurate once you include non-gridded tables.

chriskanan 6 years ago |

With collaborators at Adobe Research, my lab published a paper recently showing how to do table reconstruction from infographics (e.g., bar charts) using deep learning [1].

While it isn't the sexiest project, I've had a number of companies reach out about the project. Human written rule-based approaches are pretty bad at the task, and even humans doing it manually aren't great (likely due to sloppiness).

[1] https://arxiv.org/abs/1908.01801

jessaustin 6 years ago | |

I've found that when PDFs are produced by a single entity for a particular purpose, I can automate this pretty well with a loop and some regex... maybe I've just gotten lucky?

theSage 6 years ago |

For what it's worth, at my previous place we built a YOLO based model for detecting paragraphs/tables/headlines/page layouts mixed with traditional rule based OCR/layout detection.

https://www.youtube.com/watch?v=VVdHFqhQRUk

https://voody.clapresearch.com/

bondolo 6 years ago |

This capability has a lot of value for accessibility. Recovering the table structure for logical presentation allows navigation by blind users as well as users who are not using a pointing device.

It is disappointing just how haphazardly most PDFs are structured. Too many of the PDF production tools remove all document structure metadata or fail to include it by default.

jjohansson 6 years ago |

Disclaimer: I work for PDFTron

This is a very interesting field, and PDFTron has been doing similar work with ML and table extraction as part of our document understanding platform. We've made pretty good progress over the past year -- you can try it on your docs here:

https://www.pdftron.com/pdf-tools/pdf-table-extraction/

We also have a rules-based table extraction product (PDFGenie) that works reasonably well, but ML is most definitely the future.

Pandabob 6 years ago |

I applaud the Nanonet folks for starting a business around AI API's, and it seems clear there's lots of value to unlock with solutions like these.

But does anyone have insight how hard is it to be in a space where all the big cloud providers seem to be offering very similar products? Can you survive by focusing on a niche segment? Is the market growing so fast that there's room for multiple companies offering (roughly) the same thing?

ackbar03 6 years ago | |

I think it's pretty hard and even the big cloud providers dont necessarily have a perfect solution. It's not a particularly creative idea to come up with I think. I've thought about making something similar as a product but I'm kinda glad I didn't

nanoamp 6 years ago |

I can see the use-case and potential for ML in exfiltrating tables, but I'd be worried about the potential for decision-making mistakes in environments the author identifies, such as finance.

The example of TableNet using deep learning for table extraction on top of tesseract for OCR means two layers of ML, either of which could individually introduce pathologies without human oversight. It reminds me of the photocopier that changed numbers for you - https://www.theregister.co.uk/2013/08/06/xerox_copier_flaw_m...

If an ML engine was trained to be able to do things like look for totals and sub-totals in numerical tables and flag errors in summation, then that would clearly add more value in parsing for moderation (the use-case described at the end). But that doesn't seem to be something that's yet... on the table.

fny 6 years ago | |

There's a project from Microsoft Research that's really interesting which does just that:

https://www.microsoft.com/en-us/research/publication/melford...

nanoamp 6 years ago | | |

It looks like it's not quite the same thing, in that it identifies Excel values that should be formulae. It could be used in a pipeline with spreadsheets extracted by ML/OCR to reverse-engineer formulae though, which is an interesting prospect.

tastyminerals 6 years ago | |

yes, that's why in financial domain you use rules with ML as fallback.

lowdose 6 years ago |

There is a lot of data locked up in pdfs but even more so in images.

I would like to have a neural net that can give me the data from a chart in an image. I have a hunch image segmentation NN's are able to do this because the size of the surface has a predictive value in the causal relationship. Artificial data could be created at scale with the Google Sheets api.

ivansavz 6 years ago |

Also in the extracting-structured-data-from-PDFs solution space, there is Parsr which was recently posted on HN: https://github.com/axa-group/Parsr see https://news.ycombinator.com/item?id=22035258 It's based on a pipeline of various js modules and pluggable backends (e.g. tesseract, GCP cloud vision, Abbyy API, etc.)

For tables with numbers in them, it worked pretty well, but I'm yet to find a tool that can parse/understand documents where the entire page is a table layout with lots of merged cells. I think even for humans it's hard to understand the structure in those cases...

yorwba 6 years ago |

Related: table extraction using mixed integer programming to encode constraints: https://news.ycombinator.com/item?id=21256005

tastyminerals 6 years ago |

Table detection is useful for line item extraction from financial documents and it is solvable. However, generic table extraction is very difficult.

cafard 6 years ago |

I was very impressed with "Camelot" (https://camelot-py.readthedocs.io/en/master/). My impression was that it extracted maybe 80 or 90% of the text properly, far better than anything else I had tried.

busymom0 6 years ago |

Partially related - is this what someone could use to detect a sudoku grid? The spaces and the digits from a picture?

speps 6 years ago | |

Some related articles :

https://medium.com/@braddwyer/behind-the-magic-how-we-built-...

https://blog.scottlogic.com/2020/01/03/webassembly-sudoku-so...

sarthakjain 6 years ago | |

Sudoku is a relatively simpler problem since the structure is known apriori and becomes as simple as pattern matching.

ovi256 6 years ago | | |

Exactly, sudoku can be solved with classical CV through OpenCV, see for example https://www.youtube.com/watch?v=QR66rMS_ZfA

He's using a CNN for digit recognition.

mushufasa 6 years ago |

see camelot https://camelot-py.readthedocs.io/en/master/

PaulHoule 6 years ago |

Woohoo!

Animats 6 years ago |

Table extraction has been a feature of better OCR programs for at least a decade. It's easier than the OCR part. Look up "OCR table" for examples, products, code, papers, etc.

curiousgal 6 years ago | |

You'd think that until you try them with tables that contain empty cells that you still need recognized or tables that span multiple pages. I wouldn't say this has been solved for a decade.

pathsjs 6 years ago | |

I wish it was, but it isn't. There are various kinds of tables, that may have delimited lines or not, or they may be unaligned cells, each showing a key and a value... If you actually have in mind some solution that works well (either a paper, a github project, a commercial product) I'd be eager to know

m1sta_ 6 years ago | |

You're wrong.Robust and easy to use table extraction might be solvable, but from a business perspective it isn't solved.

saradhi 6 years ago | | |

Did you try https://extracttable.com

The mentioned service is not perfect either. There are always limitations, minimizing is the key.

P.s: I work with the team at extracttable

tastyminerals 6 years ago | |

It does not work reliably and the quality is not something you can only sell as an addon feature. This is what Abbyy does for example.

tensor 6 years ago | |

Gridded tables is not too hard, but once you remove the grid lines, even a portion of them, it becomes a complete crap shoot.

Ididntdothis 6 years ago | |

Then go ahead, make a table extractor for PDF and get very rich. A lot of people have tried.