Are text mining tools capable of getting specific data from files?

1 points by aanfhn 8 years ago | 0 comments

I'm trying to parse and extract out data from unstructured file formats such as docx, xlsx, and pdf. I'm having a tough time finding ways to get data from these files as they're not all consistent in format and have some structured data that I need to extract.

For example, some of these docx files have tables and when I extract out the text, some of the data is split into different lines, but some of the other tables spit out data not in any reasonable format...

So I'm wondering if any text mining tools or even AI/ML tools and packages are able to do this. By 'this', what I mean is to find a field that might be labeled "author" in these files and then get the corresponding data.

No comments yet