Nougat: Neural Optical Understanding for Academic Documents(facebookresearch.github.io) |
Nougat: Neural Optical Understanding for Academic Documents(facebookresearch.github.io) |
It is impressive but...it really feels like those are the details that really really matter.
I don't have any love for PDF, but I'm actually not sure what's more cross-platform. Any browser will render PDF, so everyone already has a viewer on their computer. A browser will also print any document to PDF, and many other editors can export to PDF (though perhaps not import for editing)
It can't be replaced by an Office format, like docx, because even today apps like Pages can't render MS Office docs correctly half the time.
Doesn't seem like HTML would fly, either, given all the kinds of things that get embedded into PDF.
> Doesn't seem like HTML would fly, either, given all the kinds of things that get embedded into PDF.
That's ironic. Browser PDF readers, at least open source ones, render PDFs as HTML using javascript. At least I'm sure about FF because I just checked that text from a native-digital pdf showed up in the DOM in developer tools.
What's the obsession with "looking the same everywhere"?
Page references: this shouldn't be a thing. Academia has already solved this problem for notable texts. Rather than nearly uncountable numbers of paragraphs that all run together, paragraphs or short sections or lines are numbered. See any good edition of Plato or Aristotle, or just about any notable play or longer poem ever translated. Relying on a single published layout of a work to reference is dumb.
Citing exact line numbers isn't even necessary for native-language works. When they're digital, search works. It works even better in flowed-format texts than it does in pdfs, which sometimes, depending on how the pdf was constructed, won't match text properly across newlines.
Visual quality: As long as images—data, charts, graphs, photographs—are not degraded beyond usefulness, the actual text, and its display, is up to the reader application. Everyone uses the web complete with mathjax, and those doesn't have Knuth-approved formatting in every respect. But they're good enough, and they work everywhere on every device without squinting or pinch to zoom. There are some people who insist on putting pre-rendered images of math in html, and they always look worse, because they don't match the text without a lot of work to have extra high-res images that are auto-scaled according to viewport and surrounding font size—work that I bet not many people have ever done in the history of html publishing.
I look at pdfs on my phone all the time, it's great. 'Optimized for mobile' usually means oversized fonts and a shitty UI so I get RSI in my thumb from endless scrolling.
PDF is kind of an ugly format, but the problem with realtime text flow etc. is that designers are (at the behest of clients) are always trying to look visually distinct and as a result nothing is standardized or predictable at the rendering end. 95% of digital layout is ass compared to the print version.
Or, as I like to call it, SOUNDYMAREHEATRONER.
How do you email that to someone as an attachment? Can you embed all of that stuff into a single .html file?
You could (or maybe you can't, but ebook readers should allow you to) disable any network access without explicit confirmation, so the javascript can't do anything evil other than modify the ebook being displayed. If you can't do that, that's up to ebook readers to solve, and not a flaw with epubs.
Technically yes, but there are two problems.
First is that inline style and scripts are a potential security vulnerability.
Second is that if someone does not inline everything and instead references css/js from the web the document will stop rendering correctly when those resources go offline.
mhtml would somewhat fit part of the bill of what PDF offers: a single downloadable "file" you can archive or forward and you know: the recipient will see exactly what you saw.
however the mhtml doesn't look the same, depending on the device. and looking.exactly the same helps a great deal in convincing a judge that we all talk about the same.thing.
get me right.
I hate PDF with all passion of my heart. epub (similar to mhtml) imho is a much better format for many intents and purposes and it allows to reflow the contents depending on the device.
but the claim was "PDF is useless and.shall go" and that's cutting.it too short.
I still don't understand why it needs to look exactly the same. I get that habitually people say "turn to page X and look 2/3 down the page for the line starting "The quick brown fox jumped", but with digital documents that's not in my experience how anything works. You just say "the sentence starting with 'The quick brown fox'", and everyone can search for it in a few seconds.
If an official proceeding needs to be sure everyone's working from the same document, they can distribute, or publish hashes of, an epub or mhtml the same as they can for a pdf. There's no assurance that two pdfs that you think are the same document are actually the same document, any more than two epubs would be.
For the vast majority of works that are untranslated, that isn't necessary, because, as mentioned, search works fine, and it's faster, too. For translated works, the concept of one published source of truth for page numbers is already broken, so you need some alternative to page numbers anyway.