What Is Synthetic Data? The Good, the Bad, and the Ugly

What Is Synthetic Data? The Good, the Bad, and the Ugly(benthamsgaze.org)

66 points by sjm217 3 years ago | 12 comments

zX41ZdbW 3 years ago |

If you need to anonymize a dataset (structured, possibly linked tables), I recommend clickhouse-obfuscator - a tool designed specifically for this purpose: https://clickhouse.com/blog/five-methods-of-database-obfusca...

Quick Start:

    curl https://clickhouse.com/ | sh
    ./clickhouse obfuscator --help

Source code:

https://github.com/ClickHouse/ClickHouse/tree/master/program...

It does not use differential privacy.

crabbone 3 years ago | |

Anyone who believes they can anonymize data automatically will be very disappointed...

There are so many ways in which data can point to individuals, you'd need to process every datapoint with a lot of care and investigation.

For example, rare medical conditions can be a good identification tool if the adversary knows the relation between such a condition and a person. How would an automatic tool know if a medical condition is rare enough? How will it know if such information is already available elsewhere?

Information may be transferred as images, or as audio. What if database simply stores these as blobs and only the application knows what format is used inside the blob?

Or, even if the format is known, in format s.a. DICOM where it's hard to tell if the information is significant or not. You can often recognize MRI machines due to various features of an image they take, eg. there might be some artifacts that would be found in every image. DICOMs usually have information s.a. date the image was taken, beside patient's name. But, connecting the date and a machine one may be able to infer which patient was pictured, if they also know that the patient paid for the cab ride around that time. Or, even simpler: sometimes there may be text in DICOM images identifying patients in some way.

Or, in a situation like my office: there's one woman and 30 men working there. Surprisingly, gender becomes a very precise tool at identifying people.

shalmanese 3 years ago | | |

I don't think antagonistic data obfuscation is the primary problem to be solved since, as you noted, it's extremely hard and not valid in most circumstances. Antagonism should be filtered out at the client selection stage, most clients have no incentive to pierce the veil and it's relatively easy to vet and make sure that a client has no benefit in deanonymization.

What obfuscation mainly does is remove the PII that neither side wants to handle before the data gets transferred over so the data is "safe" and the receiver of the data no longer has the burden of stewardship over PII.

Contrary to a lot of handwringing on the internet, almost everyone that handles your data couldn't care less about you as a person. Their overwhelming interest in you is as a bag of attributes that they can statistically correlate with other bags of attributes. It's a relief for them if they can scrub all the PII from their databases while retaining all of the other bag of attribute qualities that they care about. Of course, the few entities that do care about deanonymization are the ones that make this entire process so difficult.

edmundsauto 3 years ago | | |

I don’t think of “anonymization” as a single thing. The requirements depend on the use case and sensitivity of the data. 100% full irreversibility is indeed a difficult task, but even partial anonymization for less sensitive types of data have value.

It’s kind of like the word “secure”. The threat model matters - what is being protected and from whom?

riedel 3 years ago | |

Another nice tool for anonymizatiom that can take demographics into account: https://amnesia.openaire.eu/

Syzygies 3 years ago |

I've been on various NSF grant panels. One was math / applied math / statistics. Everyone shares their concerns reading proposals, that's how one builds cred. Synthetic data got mentioned.

Late in the decision process, I couldn't resist, I blurted out the joke that had been on my mind for days,

"I can't believe it's not data!"

("Not butter", if you're young for the margarine commercial reference.)

This did not go over well, and probably cost math a grant.

wannabebarista 3 years ago |

For context, here's another view on differentially private synthetic data: https://differentialprivacy.org/synth-data-1/.

worik 3 years ago |

They left bootstrapping off their list.

For stationary data, only stationary data, it is very powerful.

Look up "Stationary Bootstrap"

hinkley 3 years ago |

Synthetic data seems like a potentially useful application of GPT and friends.

Pandabob 3 years ago | |

The new ChatGPT API is really good at this. I had it create fake documents for a demo, where hallusinations were not an issue. Really surprised at how well it worked.