Show HN: Copyfish – Extract text from images, videos or PDF(addons.mozilla.org) |
Show HN: Copyfish – Extract text from images, videos or PDF(addons.mozilla.org) |
On the other hand, the OCR.space OCR API has a very strict privacy policy:
https://ocr.space/privacypolicy - All uploaded images and the extracted text are deleted immediatly after processing.
It looks like many free extensions either have malware in them from the start or get sold to malware companies later on, who then deploy the malware via updates:
http://lifehacker.com/many-browser-extensions-have-become-ad...
Here are archived versions of the URLs mentioned in the issue:
Without "partner extension": http://archive.is/anu2E
With "partner extension": http://archive.is/bp93l
As is evident, what their "partner extension" does is in fact maliciously hijacking and replacing ad-space on websites visited by the user.
Strangely, searching for their name among the issues on GitHub does not show other such results. I guess they usually make contact directly and that the person at that company who filed this issue did not realize it would be visible to the public.
Here is the full text of the issue:
> Adnow is interested in byuiing your extension traffic #1
> Dear Kyong Tsu,
> My name is Anastasia, I am a manager from international advertising network Adnow.
> Extension traffic is a hot trend nowadays, and we are interested in buying traffic from Facebook Video Downloader extension and the others. We are ready to share an idea of monetization extensions with you and give you a method.
> We offer:
> * high payouts
> * 100% fill rate (we buy traffic from all over the world)
> * Integration through JS Tag / XML / JSON feed
> * Integration method
> That's how the page looks without partner extension: https://gyazo.com/5d635a9dc7bdc142e18e6775a1d1340d
> And that's how it looks for user with our plugin/code in extension: https://gyazo.com/a2b48b16d304a3ba37cdf6967fa4d9d8
> Please contact me in case you are interested in monetization your extensions.
> I am looking forward to your answer.
> Thank you in advance.
> Best regards,
> --
> Anastasia Nova
> Sales manager | Adnow LLP
> e.: tasya@sales.adnow.com
> Skype: tasya@adnow.com
[1]: https://github.com/KyongTsu/TabMemorySaver/issues/1
Archived snapshot of above issue: http://archive.is/Z5mJl
I started using it a bit ago, the area selection seems a bit wonky, but otherwise works.
It is beyond irresponsible for mozilla to do nothing to prevent this malware from being recommended on their platform.
Look down the bottom.
It uploads everything to a commercial OCR service. Which provides these CPU cycles 'for free'.
Who owns this data? Do you have a privacy agreement with ocr.space? Can you trust them as far as you could spit?
It doesn't matter that this is documented though. Unless it had a popup banner EVERY TIME YOU USED IT saying "Your data will be sent to a cloud service for OCR, which may keep/index/sell you data without restriction."
Are you misunderstanding the extension or am I missing something bigger?
E: A total guess: "the server will see the image you are trying to OCR"? That's about as much privacy as I could see being intruded upon.
It is good that it isn't scanning everything, i.e. complete exfiltration, but that is a low bar. It leaks every time you use it.
Abbyy (best recognition rate but by far most expensive), Google Cloud Vision (second best recognition rate), Microsoft OCR and... our OCR.space service with a very generous free tier and a competitive priced PRO tier.
They should have an API to point to. It is fairly accurate. I use them occasionally via ShareX, which uses their API for OCR.
If your images however differ from the typical text document, recognition from those services will fail. OCR is highly dependent on the particular application and the kind of images that you're dealing with. Preprocessing and segmentation are very important.
If you need a custom solution, my email is in my profile.
But how's the accuracy here? Cause when I used previous plugins for this functionality, I often found they'd return gibberish if the text was even slightly ambiguous looking in image form.
How does it compare to the other plugins doing the same thing here?
> For extension gurus: You might have heard of Project Naptha, a great addon that applies state-of-the-art computer vision algorithms on every image you see while browsing the web. Copyfish solves the same problem, but it takes a different user interface approach. It does not try to alter the website. Instead, it lets you mark the text in the image that you want to extract. As a result Copyfish works with every website, even videos and PDF documents.
Thanks for making this!
I saw the first example screenshot on the page was a Chinese movie and thought "Great, it does"
I saw the enlarged version of the screenshot and the Chinese subtitles contain multiple mistakes: "Nice try, but maybe not so great after all for the use case I'd personally be interested in".
The tricky part for the OCR in this example is the diverse background, as the Chinese characters are directly inside the movie.
Your comment is interesting, as the original motivation for creating the Copyfish extension was to help me watch Chinese movies. So I can confirm that for this purpose, it works fine. Of course, once in a while it gets some characters wrong but it works ok with many movies.
Here is a screencast of Copyfish doing subtitle OCR:
Wondering what you're using for OCR?
For developers: Copyfish is published under the
GPL open-source license. As OCR software, it uses
the free OCR API from https://ocr.space/Neat! Brother. +1 =100 Ace
Yep, same with TV shows, and soft-copies of transcripts are difficult to come by, hence my interest in something like this.
I just watched the video. When used on a video does it keep a history of all OCRed text?
Finally, you might also like to try posting this on http://www.chinese-forums.com If it mostly works well for TV and films, I'm sure there will be quite a few people there who are interested in it.
Not yet - but this feature is already on my todo list ;)
Thanks for the hint about the chinese forums!
Another interesting feature would be to do some sort of statistical analysis of Chinese text being OCRed and then combining that with possible characters suggested by the OCR. This would almost certainly prevent the mistake in the last two characters of the Chinese movie screenshot.
Based on this github they might be using the microsoft ocr library.
As long as the plugin is clear that they are using a third party service that will recieve your images, I think it is fine to leave it at that. Not everyone feels that is a deal breaker, and they shouldn't be annoyed by a pop up just because their deal breaker is different than yours.
If the end user clicks the 'do not show again' checkbox on the message, sure. But it should still be graphically represented whenever you use an insecure cloud plugin, e.g. via an unlocked padlock sub icon if it doesn't use TLS, maybe a cloud sub-icon to represent someone else's computer.
While you might want to believe that a user would actually think about what they are accepting, reality is almost all don't. Even the more security minded people among us will start to get numb to the requests. Only the most paranoid would pay attention to all of them, and those people are probably already doing things that would make that sort of pop up redundant.
I think this is a very common trap we fall into, where we want to provide MORE warnings to people and let them use their judgement. However, there is such a thing as 'alert fatigue'.
In California, companies that produce carcinogens took advantage of this aspect of human nature; when California wanted to place warning signs about cancer causing substances, they realized they couldn't win the fight against the warnings. Instead, they fought for MORE warnings; they wanted warning signs for even very slight risk carcinogens. They knew that if the signs were EVERYWHERE, people would stop paying attention to them.
It worked. Basically every building in California has a warning that 'substances known to cause cancer or birth defects are present'. Since every building has the same warning, I have no way of knowing which ones are ACTUALLY dangerous.
Until they are served with a subpoena for a particular client, or a sweeping subpoena to store everything forever, or the company is sold and the new parent has different values, or the company decides to mine customer data for advertising uses, or there's a bug in the software, or there's a long-lived cache of the data, or it gets into their backups accidentally or deliberately, or they don't keep the data but keep "just" the meta-data, or they do statistics or analytics before deleting the data, or they are hacked, or they simply change their minds.
In terms of privacy, even a non-free non-open-source local app with DRM or license management is better than a server app with a "strict privacy policy". With a good firewall setup, you can be pretty sure that the local app won't betray you.
http://www.daemonology.net/blog/2012-01-19-playing-chicken-w...
You mean that we have to place some trust that they are. Some users cannot afford that kind of trust.
Personally, I chose .space simply because it's cool, cheap, and not overcrowded. It also seems to lend itself well to being part of a name.
I know spam is a hard problem, but I wish you wouldn't label me a spammer simply because of the TLD I chose.
There are a few others which you may want to avoid according to this report: https://securityintelligence.com/enticing-clicks-with-spam/
> Why did you end up going with a .space domain? We blocked that whole TLD because we were getting massive amounts of spam from it when it first came out.
From your comment:
> I know spam is a hard problem, but I wish you wouldn't label me a spammer simply because of the TLD I chose.
The author is not "labeling you a spammer". They're simply stating a fact about their experience. And in fact, it doesn't even mention you.
I only tried to hightlight that they have, in effect, labeled everyone in .space (not just me, but me included) as a spammer.
It's heavy handed, but I understand there are sometimes pressing needs for quick solutions, like when having your mailboxes flooded with SPAM. Hence, the "I know ..." clause.
`For developers: Copyfish is published under the GPL open-source license. As OCR software, it uses the free OCR API from https://ocr.space/ .`
Also, for nearly all documents I ever need to scan, if they're important enough to require scanning, they're important enough that a third party should have nothing to do with them.
The majority of exceptions to the above being, ironically, documents without text, sketches, doodles, etc.
However, I don't believe for a second, without some kind of law, punishable by death, a requirement like that would have lasted. It would take only one browser to default "Never prompt for permissions to run JavaScript". Typical users would flock to it (because sites would say they only work with it) and compliant browsers would have to copy to compete. Users ruin everything.