Website data leaks pose greater risks than most people realize(seas.harvard.edu) |
Website data leaks pose greater risks than most people realize(seas.harvard.edu) |
Part of the problem is that there are still no good criteria available to define anonymity. Concepts like differential privacy are a step in the right direction but they still provide room for error, and in many cases they are either too restrictive (transformed data is not useful anymore) or too lax (transformed data is useful but can be easily re-identified).
Society is a tapestry of bullshit and low-level swindling is generally tolerated or quickly forgotten about. Thus, there's nothing to prod the unprincipled in charge to do the right thing. As long as something seems to be good(anonymized, in this cage), and problems can be hidden behind the corporate veil long enough, the unwritten rule is to half-ass security solutions because, well, security is boring and there's other things to devote company time and resources to(that will advance upper management).
Security measures, especially those that protect the users, don't make money. At best, they're insurance against the fallout that might occur when it's revealed that your company has been silently screwing people over. Like most human beings, businesses often put off serious consideration of the future in order to enjoy quick and immediate gain.
I wouldn't put it past most companies to screw up an approach like differential privacy. Not enough people actually care that much.
This is why the government has to make regulations with teeth in this space (of course, the government could be the "unprincipled in charge" you referred to).
Lots of companies are content to stop at "our data can't be linked back to a person's identity", which doesn't prevent building a uniquely-identifying user profile. (e.g. via browser fingerprinting, plus enough metadata to associate a user's computer and phone accounts.) Even if they do better than that, its typically "our data is not uniquely identifying in isolation", which still isn't enough. If your differential privacy model says that these four pieces of data have a specificity of 10,000 possible individuals, that's a good start. But if someone with an individual's PII and three of those keys comes looking, they can still narrow down information about the fourth value from your aggregates.
And even if no one screws up, what happens when someone queries a half dozen differential datasets for different subsets of a uniquely identifying key? It's something like the file-drawer problem, where one researcher hiding bad data is malicious, but a dozen studies failing to coordinate produces the same result innocently. If outright failures to anonymize become rarer, cross-dataset approaches become more rewarding.
https://fpf.org/wp-content/uploads/2017/06/FPF_Visual-Guide-...
You keep data because data is economically valuable, but even when you care enough to implement some techniques that depends on the invariants you still fail to achieve something the better because of scale and because you don't want to refine the techniques. This also means that somehow somebody may have a technique that, provided enough pieces of data, can reverse you transformation.
If companies were required to aggregate information in this way and throw away their logs, perhaps leaks would be much less risky for their users.
Today this might seem far-fetched, but it could come to pass in the future, when people raised in this environment and able to understand the implications and technical aspects come to political power.
For freeways, lots of small segments, and fuzzing of timestamps to co-mingle users. Where there's a stoplight snap the intersection cross-time to the green light (guess) for anyone in the queue.
The anonymity would come from breaking up both requests and observed telemetry to fragments too small to tie back to a single user or session (and thus form a pattern; I hope).
Do NOT record end-times, only an intended route. Do NOT associate that movement to any particular user or persistent session (ideally in memory on the mobile device only, not saved: though it could save favorite routes locally). Packages of transition times between various freeway exits would generally help add to anonymity.
That would also be part of generally improving the UI for the user. The application on the device should be making most of the decisions, by asking about the traffic in a given region on a grid. I also want it to show me (the driver) the data (heatmap) on the rejected routes so I know what isn't a good option.
https://www.hhs.gov/hipaa/for-professionals/privacy/special-...
https://www.deccanchronicle.com/technology/in-other-news/201...
There's tons of PHI on the internet. Your local hospital's online medical chart, your insurance companies bill-pay, etc...
The main take-away from the talk - an in fact all the talks I saw on the same day - was that while DP is touted as a silver bullet and the new hotness, in reality it can not protect against the battery of information theoretical attacks advertisers have been aware of for couple of decades, and intelligence agencies must have been doing for a lot longer. Hiding information is really hard. Cross-correlating data across different sets, even if each set in itself contains nothing but weak proxies, remains a powerful deanonymisation technique.
After all, if you have huge pool of people and dozens or even hundreds of unique subgroups, the Venn-diagram-like intersection of just a handful will carve out a small and very specific population.
There's a lot of privacy snakeoil out there and even large govt departments fall for it.
https://pursuit.unimelb.edu.au/articles/the-simple-process-o...
It's possible I'm misreading, but your paper seems to focus on the very anonymization techniques diff privacy was invented to improve on, specifically because these kinds of attacks exist. While I agree it's no silver bullet, the reason is because it's too strong (it's hard to get useful results while providing such powerful guarantees) rather than not strong enough.
I've found the introduction to this textbook on it to be useful and very approachable if others are interested: https://www.cis.upenn.edu/~aaroth/Papers/privacybook.pdf
After spending three years working on privacy technologies I'm convinced that anonymization of high-dimensional datasets (say more than 1000 bits of information entropy per individual) is simply not possible for information-theoretic reasons, the best we can do for such data is peudonymization or deletion.
I want to be better equipped to respond to this slowly emerging "DP is a silver bullet" meme and your response implies that you'd have actual research to back the position up.
Also, it's not a magical solution. Here's one of the issues from the linked paper (edited for clarity):
"The proponents of differential privacy have always maintained that the setting of the [trade-off between privacy loss (ε) and accuracy] is a policy question, not a technical one. [...] To date, the Census committee has set the values of ε far higher than those envisioned by the creators of differential privacy. (In their contemporaneous writings, differential privacy’s creators clearly imply that they expected values of ε that were “much less than one.”)
us census, RIP
One of the leaks they talk about way from Experian, a credit reporting agency. Not only would this approach work poorly for them, it wouldn't be legal (they need to be able to back up any claims they make about people, which requires going back to the source data).
Not specifically, but I suppose I wouldn't say that politicians are more or less principled than corporate executives. I know some would argue otherwise, but I'm too black pilled at this point to have faith in any "public servant".
Nevertheless, government regulation is probably the way to actually address these issues. Government may lock competence or will, but at least it provides us some leverage, little it may be.
https://medium.com/@francis_49362/dear-differential-privacy-...
Here's my simple take: Imagine you want to protect individuals by using differential privacy when collecting their data. Imagine you want to publish datapoints that each contain only 1 bit of information (i.e. each datapoint says "this individual is member of this group"). To protect the individual, you introduce strong randomization: In 90 % of cases you return a random value (0 or 1 with 50 % probability), and only in 10 % of the cases you return the true value. This is differentially private and for a single datapoint it protects the individual very well, because he/she has very good plausible deniability. If you want a physical analogy, you can think of this as adding a 5 Volt signal on top of 95 Volt noise background: For a single individual, no meaningful information can be extracted from such data, if you combine the data of many individuals you can average out the noise and gain some real information. However, averaging out the noise also works if you can combine multiple datapoints from the same individual, if those datapoints describe the same information or are strongly correlated. An adversary who knows the values of some of the datapoints as context information can therefore infer if an individual is in the dataset (which might already be a breach of privacy). If the adversary knows which datapoints represent the same information or are correlated he can also infer the value of some attributes of the individual (e.g. learn if the individual is part of a given group). How many datapoints an adversary needs for such an attack varies based on the nature of the data.
Example: Let's assume you randomize a bit by only publishing the real value in 10 % of the cases and publish a random (50/50) value in the other cases. If the true value of the bit is 1, the probability of publishing a 1 is 55 %. This is a small difference but if you publish this value 100 times (say you publish the data once per day for each individual) the standard deviation of the averaged value of the randomized bits is just under 5 %, so an adversary who observes the individual randomized bits can already infer with a high probability the true value of the bit. You can defend against this by increasing the randomization (a value of 99 % would require 10.000 bits for the standard deviation to equal the probability difference), but this of course reduces the utility of the data for you as well. You can also use techniques like "sticky noise" (i.e. always produce the same noise value for a given individual and bit), in that case the anonymity depends on the secrecy of the seed information for generating that noise though. Or you can try to avoid publishing the same information multiple times, this can be surprisingly difficult though, as individual bits tend to be highly correlated in many analytics use cases (e.g. due to repeating patterns or behaviors).
That said differential privacy & randomization are still much more secure than other naive anonymization techniques like pure aggregation using k-anonymity.
We have a simple Jupyter notebook that shows how randomization works for the one-bit example btw:
https://github.com/KIProtect/data-privacy-for-data-scientist...