Xerox responds to the recent character substitution issue(realbusinessatxerox.blogs.xerox.com) |
Xerox responds to the recent character substitution issue(realbusinessatxerox.blogs.xerox.com) |
Follow-up blog post about a conference call with Xerox: http://www.dkriesel.com/en/blog/2013/0806_conference_call_wi...
It shouldn't be seen with any setting. Nothing you can do to the device (short of involving a hammer) should change the content in any way. Compress, resize, zoom, do whatever, but it simply must not change the content at any time at any resolution/quality.
I'm just flabbergasted that such a compression scheme was ever implemented in the first place. Surely, there are alternative OCR based methods do compression that don't introduce these artifacts (that's putting it mildly) at lower resolutions.
I can just see a legal loophole now for anyone using these devices, for example "the electronic document was modified by a Xerox and we don't have the original, those numbers were not what we signed, contract void".
No matter the case of an optional setting or the size of the font involved, this can have major consequences for people who trust the device to be an accurate representation in all cases, of what they put into it.
Looks like the company is trying to weasel out of it and there are going to have to be lawsuits. Though I didn't really expect otherwise; if the dice come up badly, the damage from this could exceed the net value of the company.
Don't get me wrong - using OCR is a great compression technique, but if it isn't reliable enough, it shouldn't be the default or "normal" setting.
I was expecting "Here is new firmware and we apologize for using JBIG2, won't happen again."
One wonders if JBIG2 is used in the storing of checks by banks (my bank these days only sends me images of my checks, never the actual check any more) or DMV records, or any number of things.
So in the previous thread I suggested a JBIG2 test image, now I want to build one that if you copy it, it goes from one thing to something else entirely!
First of foremost, I agree that Xerox putting their name on a product which creates an unfaithful copy is corporate suicide. Such an ancient paragon of computer innovation should be able to come up with a clever algorithm that compresses but doesn't substitute image bits.
But...
- The original story[1] didn't mention that the product itself warns against the very thing they are reporting. Did they ignore that warning, did the copier not show it, did they use a setting that did not have the warning? Their further posts cover the issue, so it looks like somebody else set the resolution and ignored the warning.
- Calling what the JBIG2 algorithm does "OCR" is misleading. OCR is pretty much understood to be analog text (image) to digital text (ASCII, UTF-32). Matching to a real character set and outputting those characters is a defining part of true OCR. It's also confusing because the copiers have a true OCR function, and this is not related. What JBIG2 does, I would call it "sub-image matching and substitution."
- Calling JBIG2 "lossy" is also misleading. I suppose it is lossy by definition, but lossy is usually limited to pixel effects as seen in JPG, no image blocks.
- JBIG2 seems like an algorithm that shouldn't be used on low-res text documents. You might say it's just a configuration of the algorithm, but if engineers can't take it as a tool and use it correctly, you start to wonder if it's a problem with the tool.
[1] http://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres_...?
There comes a point when the quality is so poor that you no longer trust your interpretation. Is that a 3? An 8? If you can't tell, you will not act on that information without further clarification.
This compression algorithm destroys this process.
How can you trust what you are reading anymore? How do we know there isn't a bug that sometimes causes the content substitution when the source text is large and perfectly legible?
Disk space is not at enough of a premium to justify this.
convert *.jpg JPEG.pdf -- 43777 kb
convert *.png PNG.pdf -- 6907 kb
jbig2 -b J -d -p -s *.jpg; pdf.py J > JBIG2.pdf -- 947 kb
jbig2 -b J -d -p -s -2 *.jpg; pdf.py J > 2xJBIG2.pdf -- 1451 kb
Quite a difference. I don't quite understand how JPEG fares so poorly compared to (lossless) PNG, maybe because it doesn't do monochrome?[1] http://ssdigit.nothingisreal.com/2010/03/pdfs-jpeg-vs-png-vs...
The only acceptable fix for this is to disable the ability to use lower compression qualities that have could EVER cause this to happen.
"Normal" is an overly aggressive compression setting? Is that an overly aggressive setting for the end-user or for Xerox to be implementing in their hardware marketed to law firms?
I expected something better from Xerox, instead it is a sort of: "You are a stupid costumer, leave it on default and stop bothering me, it is not my fault you find bugs when not using the default."
Pretend you care, blame the users, and don't take any action. Hey, what could be wrong with that?
Why on earth does a scanner have a web interface
All it's doing is recognizing "similar" patches of the image and coalescing them, which is what it's supposed to do, according to the standard. Yes, it's too aggressive.
I might have read it wrong, but from how I understood it the default settings don't have this problem. It's when people adjust the quality settings to be lower. Am I wrong?
A major and highly pertinent difference is that if this OCR-ish procedure incorrectly classifies two identical letters as being different, accuracy is not affected, and the only consequence is a larger file. With normal OCR, seeing two As and saying they're different would be an error, but in this case, it's fine.
What this means is that, while regular OCR is inherently error-prone, this compression procedure can be fully tuned anywhere between no errors and nothing but errors, with file size being the tradeoff.
The ability to run this algorithm in a way that produces no errors may be enough to disqualify it as "OCR", depending on your point of view. In any case, it certainly changes things from "that's just how it is" to "this is a royal cock-up on Xerox's part".
And even the support person didn't know about the consequences of the setting.
Also, it seems that the setting was also used when copying, not just when scanning (still seeking confirmation on that one), which would be quite useless.
This seems a bit hair splitty when the end result is the same as invalid OCR dictionaries.
"Textual regions are compressed as follows: the foreground pixels in the regions are grouped into symbols. A dictionary of symbols is then created and encoded, typically also using context-dependent arithmetic coding, and the regions are encoded by describing which symbols appear where."
Then from the OCR wiki[2].
"Matrix matching involves comparing an image to a stored glyph on a pixel-by-pixel basis; it is also known as "pattern matching" or "pattern recognition"."
Furrow your brow and smash the down-vote arrow all you wish. It won't stop JBIG2 from doing much of what people consider OCR as doing today. Recognizing characters, just JBIG2 adds in making it's own dictionary which opened the path to this topic today.
[1] http://en.wikipedia.org/wiki/JBIG2 [2] http://en.wikipedia.org/wiki/Optical_character_recognition
Having various compression/quality options allows you to pick the tradeoff (file size/resulting quality) that is acceptable for your situtation. There is no perfect setting for all situations. Even the original bitmap is an imperfect (i.e. lossy) rendering of the original document.
I don't expect the scanner to have any semantic awareness of the document content, so when I hear "lossy compression", my expectation is "image may become illegible", and not "image may remain legible, but become inaccurate".
The issue only involves small letters, because the compression scheme breaks up the image into patches and then tries to identify visually similar blocks and reuse them. Certain settings can allow for small blocks of text to be deemed identical, within a threshold, and thus replaced. That's all. Coincidence, not semantic awareness.
Hence the advisory notice to use a higher resolution -- smaller block sizes.
A document will be covered in numbers, and the compression algorithm looks for similar blocks it can re-use; the side effect is sometimes it says "that blurry 4 looks pretty close to this blurry two, so I'll just store that block once and reuse it"
The problem is that this is a minor side effect to a programmer and an absolutely massive issue to an end user that no-one had thought of previously, and now we all have to be worried that all our scanned documents might be incorrect. (just because this was found in fuji-xerox scanners doesn't mean other brands don't also have the issue)
According to Adam (https://news.ycombinator.com/item?id=6156418) this is a known problem that Xerox, who call themselves document people for crying out loud, should have known and compensated for.
Copiers very commonly copy printed material. This sort of algorithm makes it likely that sometimes one character will be replaced by another, so it is a bad algorithm for the job.
Xerox should have known better.
As opposed to what, ImageCompression News where you can expect everyone to know it?
Clearly, the compression algorithm is designed around human perception (i.e. looking for visually-similar segments to, I assume, tokenize), and therefore does relate to the actual semantics of the document, albeit in a coarse and mechanical way. It did know enough to replace character glyphs with other character glyphs, but didn't know enough to choose the right ones.
My point is that it's not coincidental at all - this algorithm is obviously in a sort of "uncanny valley" in its attempt to model human visual perception.
Again from the JBIG2 wiki[1]:
"Textual regions are compressed as follows: the foreground pixels in the regions are grouped into symbols. A dictionary of symbols is then created and encoded.."
It seems not only is JBIG2 being deployed as OCR by Xerox for whatever reason, its implementation in this case is an absolute failure.
edit: by the definition you seem to be going on, any facial recognition is also OCR, since you could consider a face a 'glyph' (edit: 'symbol'). The only 'text' thing here that I can see is that it is intended to be used on text, which lends some optimizations, nothing that it's actually text-based in any way.
Say that the scanner internally splits the scan into regions of 10x10 pixels that it saves in memory. If another region differs on less than (say) 10% of the pixels it is assumed that the two zones are identical and the first one is used in the second place too. The regions have no semantic meaning.
OCR translates the scan into a character set.
Also, something to think about: an EBCDIC document accidentally printed as ASCII/8859-1 would have equally zero semantic meaning when fed into an OCR program. But I don't think anyone would argue it wasn't OCR.
That mapping isn't a very big thing. Sometimes text-based PDFs don't even have it, and you don't notice unless you try to copy out and get the wrong letters.