Website data leaks pose greater risks than most people realize

Website data leaks pose greater risks than most people realize(seas.harvard.edu)

194 points by tonicb 6 years ago | 31 comments

Most companies still don’t know what anonymization means and confuse anonymized with pseudonymized or masked data.

Part of the problem is that there are still no good criteria available to define anonymity. Concepts like differential privacy are a step in the right direction but they still provide room for error, and in many cases they are either too restrictive (transformed data is not useful anymore) or too lax (transformed data is useful but can be easily re-identified).

ravenstine 6 years ago | |

It's not that most of them don't know what anonymization is or are confused about it.

Society is a tapestry of bullshit and low-level swindling is generally tolerated or quickly forgotten about. Thus, there's nothing to prod the unprincipled in charge to do the right thing. As long as something seems to be good(anonymized, in this cage), and problems can be hidden behind the corporate veil long enough, the unwritten rule is to half-ass security solutions because, well, security is boring and there's other things to devote company time and resources to(that will advance upper management).

Security measures, especially those that protect the users, don't make money. At best, they're insurance against the fallout that might occur when it's revealed that your company has been silently screwing people over. Like most human beings, businesses often put off serious consideration of the future in order to enjoy quick and immediate gain.

I wouldn't put it past most companies to screw up an approach like differential privacy. Not enough people actually care that much.

dfxm12 6 years ago | | |

Security measures, especially those that protect the users, don't make money.

This is why the government has to make regulations with teeth in this space (of course, the government could be the "unprincipled in charge" you referred to).

Bartweiss 6 years ago | |

And even the ones who do practice decent anonymization are generally contributing to the problem just by holding a lot of data.

Lots of companies are content to stop at "our data can't be linked back to a person's identity", which doesn't prevent building a uniquely-identifying user profile. (e.g. via browser fingerprinting, plus enough metadata to associate a user's computer and phone accounts.) Even if they do better than that, its typically "our data is not uniquely identifying in isolation", which still isn't enough. If your differential privacy model says that these four pieces of data have a specificity of 10,000 possible individuals, that's a good start. But if someone with an individual's PII and three of those keys comes looking, they can still narrow down information about the fourth value from your aggregates.

And even if no one screws up, what happens when someone queries a half dozen differential datasets for different subsets of a uniquely identifying key? It's something like the file-drawer problem, where one researcher hiding bad data is malicious, but a dozen studies failing to coordinate produces the same result innocently. If outright failures to anonymize become rarer, cross-dataset approaches become more rewarding.

sarnowski 6 years ago | |

As one step to raise awareness about the differences I really like this overview:

https://fpf.org/wp-content/uploads/2017/06/FPF_Visual-Guide-...

stebann 6 years ago | |

Having read about anonymization techniques I have started to believe that definitions of anonymity and pseudo-anonymity are well settle by now but criteria that contributes to the invariants for performing data transformation are not, so the result is that this criteria fail to guide the implementations of the transformations.

You keep data because data is economically valuable, but even when you care enough to implement some techniques that depends on the invariants you still fail to achieve something the better because of scale and because you don't want to refine the techniques. This also means that somehow somebody may have a technique that, provided enough pieces of data, can reverse you transformation.

inciampati 6 years ago |

Differential privacy provides a system that can allow the sharing of databases without allowing an external observer to determine if a particular individual was included.

If companies were required to aggregate information in this way and throw away their logs, perhaps leaks would be much less risky for their users.

Today this might seem far-fetched, but it could come to pass in the future, when people raised in this environment and able to understand the implications and technical aspects come to political power.

https://www.cis.upenn.edu/~aaroth/privacybook.html

https://en.wikipedia.org/wiki/Differential_privacy

mjevans 6 years ago |

I've considered how I would like E.G. GPS / driving apps to anonymize data.

For freeways, lots of small segments, and fuzzing of timestamps to co-mingle users. Where there's a stoplight snap the intersection cross-time to the green light (guess) for anyone in the queue.

The anonymity would come from breaking up both requests and observed telemetry to fragments too small to tie back to a single user or session (and thus form a pattern; I hope).

Do NOT record end-times, only an intended route. Do NOT associate that movement to any particular user or persistent session (ideally in memory on the mobile device only, not saved: though it could save favorite routes locally). Packages of transition times between various freeway exits would generally help add to anonymity.

That would also be part of generally improving the UI for the user. The application on the device should be making most of the decisions, by asking about the traffic in a given region on a grid. I also want it to show me (the driver) the data (heatmap) on the rejected routes so I know what isn't a good option.

redis_mlc 6 years ago |

Largely true, but there are HHS rules and guidelines that are accepted in the US healthcare space:

https://www.hhs.gov/hipaa/for-professionals/privacy/special-...

kube-system 6 years ago | |

HIPAA data is not immune to a data leak... not even the organization that wrote those guidelines are immune:

https://www.deccanchronicle.com/technology/in-other-news/201...

There's tons of PHI on the internet. Your local hospital's online medical chart, your insurance companies bill-pay, etc...

SiempreViernes 6 years ago |

The title refers to claims by marketing companies that they have appropriately anonymised the data, and is not an attack on the concept of anonymisation itself.

akavel 6 years ago |

What does "computer science concentrator" or "statistics concentrator" mean? It's a first time I see such a title (?)

hwbehrens 6 years ago | |

Harvard calls their fields of study "concentrations", not majors [0]. Thus, a CS concentrator is an undergraduate student who is majoring in CS.

[0]: https://en.wikipedia.org/wiki/Academic_major

ComodoHacker 6 years ago |

Students have found data enrichment techniques exist and can be effectively applied to breach datasets. Good for them.

ghostpepper 6 years ago | |

Yeah, I was a bit surprised when I read this was a project for a first year course Privacy and Technology (CS 105). I don't see it being reported anywhere other than Harvard's own website.

ansmithz42 6 years ago |

I think this should be sent to the government officials that they were able to find in their research, it might get them to wake up and stop treating it so lightly.

lwb 6 years ago |

Relevant XKCD: https://xkcd.com/792/

kache_ 6 years ago |

Is it just data leaks? How about Google's reports on how busy a certain area is (restaurants, malls)? That is pretty much telling a potential terrorist the optimal time to target an area. We leak data everywhere, and all we need is a single bad actor to utilize it for a catastrophe to occur.