What Is Synthetic Data? The Good, the Bad, and the Ugly(benthamsgaze.org) |
What Is Synthetic Data? The Good, the Bad, and the Ugly(benthamsgaze.org) |
Quick Start:
curl https://clickhouse.com/ | sh
./clickhouse obfuscator --help
Source code:https://github.com/ClickHouse/ClickHouse/tree/master/program...
It does not use differential privacy.
There are so many ways in which data can point to individuals, you'd need to process every datapoint with a lot of care and investigation.
For example, rare medical conditions can be a good identification tool if the adversary knows the relation between such a condition and a person. How would an automatic tool know if a medical condition is rare enough? How will it know if such information is already available elsewhere?
Information may be transferred as images, or as audio. What if database simply stores these as blobs and only the application knows what format is used inside the blob?
Or, even if the format is known, in format s.a. DICOM where it's hard to tell if the information is significant or not. You can often recognize MRI machines due to various features of an image they take, eg. there might be some artifacts that would be found in every image. DICOMs usually have information s.a. date the image was taken, beside patient's name. But, connecting the date and a machine one may be able to infer which patient was pictured, if they also know that the patient paid for the cab ride around that time. Or, even simpler: sometimes there may be text in DICOM images identifying patients in some way.
Or, in a situation like my office: there's one woman and 30 men working there. Surprisingly, gender becomes a very precise tool at identifying people.
What obfuscation mainly does is remove the PII that neither side wants to handle before the data gets transferred over so the data is "safe" and the receiver of the data no longer has the burden of stewardship over PII.
Contrary to a lot of handwringing on the internet, almost everyone that handles your data couldn't care less about you as a person. Their overwhelming interest in you is as a bag of attributes that they can statistically correlate with other bags of attributes. It's a relief for them if they can scrub all the PII from their databases while retaining all of the other bag of attribute qualities that they care about. Of course, the few entities that do care about deanonymization are the ones that make this entire process so difficult.
It’s kind of like the word “secure”. The threat model matters - what is being protected and from whom?
Late in the decision process, I couldn't resist, I blurted out the joke that had been on my mind for days,
"I can't believe it's not data!"
("Not butter", if you're young for the margarine commercial reference.)
This did not go over well, and probably cost math a grant.
For stationary data, only stationary data, it is very powerful.
Look up "Stationary Bootstrap"
Precisely. They care about my credit card number and enough of identifying details to impersonate me to the credit company...
So, he claimed that whoever builds such a thing will be instantly the richest person in the world, eclipsing Bill Gates and Jeff Bezos combined.
Well, having worked with many different databases, I can see how that's a mission impossible... So, what does this have to do with anonymization? -- Well, most databases in the world are either built by application developers or are later extended due to the demands of application developers in such a way that the meaning of the data stored in the database is impossible to determine without the application which works with the database. In all but the most trivial cases. Not to mention that data in the databases in majority of cases is generated by humans, and even though both application developers and data administrators try to prevent invalid inputs, they too make mistakes.
To continue the example of DICOM files: those are typically generated by a combo of a technician operating the machine, a radiologist who reads the image, a doctor who ordered the imaging and a medical secretary who collected patient's data upon arrival. All of these people are very busy and have very little time to spend on patients. This often leads to mismatch between field type and data stored in those fields. Eg. patient's address gets stored in the name field, the name is stored in the allergies field and so on. Some data are essential for the file to move around the system, but a lot of the properties won't prevent the file from reaching its target, even if they contain completely nonsensical data.
----
My wife participated in some Kaggle challenges that had to do with chest CT. In order to do that, she went through some of the publicly available sets of images that belong to this general category. Each contained defective images, up to and including CTs of other body parts, X-rays and so on. (Needless to mention that stuff like proper radiological modality was wiped from the set, so there was no contrast information attached to images etc.) And that was only what she could find with some simple scripts which relied on heuristic.
What I'm trying to say is that dealing automatically with large quantities of data that was acquired in real-world situation will almost certainly not live up to expectations. It will require a human in the loop until we have AI comparable to human intelligence.