Notes on AI Bias(ben-evans.com) |
Notes on AI Bias(ben-evans.com) |
I'm doing a lot of such algorithms (well, not for images). Does someone know if such algorithms have a name? I'm calling it "heuristics" and I think it falls under "AI".
Every single photo was of a cat.
I have to say I was humbled by the amount of human and computing power that had gone into developing this system over the years, that could achieve such a complicated, impressive technical feat, without requiring any effort or money on my part, and yet also be 100% wrong.
This really is quite impressive. It's rare for humans to do worse than random guessing on tasks, and they almost never do much worse. There's something almost charming about the ability of AI to put real effort into actively avoiding correct answers.
Calling it "feature engineering" implies it's still being fed into some sort of trained classifier to make the final decision, though.
What you're describing of your own work might better fall under the broad category of an "expert system".
https://en.wikipedia.org/wiki/Bag-of-words_model_in_computer...
Couldn't they have retrained the system with a 50/50 mix of males / females resumes ? Or restrict the use of the algorithm to sort male resumes ? Or maybe resumes don't actually correlate at all with success in Amazon ...
I swear, when someone starts building autonomous killer robots, the first set of concerned articles will probably be asking whether robots were properly trained to target all genders and races with equal accuracy. This is not a sensible way to approach AI ethics.
>It was recently reported that Amazon had tried building a machine learning system to screen resumés for recruitment. Since Amazon’s current employee base skews male, the examples of ‘successful hires’ also, mechanistically, skewed male and so, therefore, did this system’s selection of resumés.
There is nothing "mechanistic" about this. It depends on how you select sample resumes and how you split them between "good" and "bad" labels.
I worked on a similar thing as an "encouraged" side-project at a certain company. Except I realized from day 1 that using AI on resumes is a bad idea and aimed to show this with data. My model was aiming to detect people who will quit or get fired within first 6 month (with the intent of lowering them in priority for interviews, supposedly). It miraculously achieved 85% accuracy... by figuring out how to detect summer interns.
Framing this problem as "bias" and especially hyper-focusing everyone's attention on diversity aspect of it is extremely irresponsible. (I'm not saying that's what the author is doing, but that's definitely what's being done at large.) Fundamentally, there are significant higher-level problems with using statistical ML models for things like hiring or crime prediction.
More importantly, the only way to really show causation is by positing a mechanism.
Given a statistically large enough sample, 2 outcomes: 1) The Siemens sensor actually is at fault. 2) The Siemens sensor is a part of a larger system, which is different in non-Siemens turbines, and that system is failing.
Either way, the model prediction on turbine failures is enhanced with that Siemens feature. But to even get to this granularity, you are diving into model explainability, or what features were important for each prediction. Here, you try to understand the black-box to find reasons for particular input->output.
"just as a dog is much better at finding drugs than people, but you wouldn’t convict someone on a dog’s evidence. And dogs are much more intelligent than any machine learning."
Because in my head I followed it with the sentence "but we're all confident that we will have dogs driving our cars in about 5 years." Food for thought for sure.
They didn't say dogs were better than technology at solving problems, in any sort of general sense.
1. The AI system accurately predicted employee success across both genders
AND
2. The AI system predicted that women would do worse than men
That's politically embarrassing and something that you can't necessarily 'fix' by improving the system. (see: all the 'will this person commit a crime if let out on parole' systems that end up accurately discriminating based on race)
This isn't to say that women are worse engineers than men, or anything of that sort - only that the applicant pool to Amazon was skewed, or women were treated worse in the workplace and thus performed worse, or a dozen other possible causes. (And only in this hypothetical scenario! I have no inside info from Amazon!)
Assume that the ability curve of male applicants and female applicants are identical; that the majority of applicants are male; and that Amazon wants to hire more females then would be expected given the portion of applicants that are female.
A natural way of accomplishing this goal is to give extra points to female applicants [0].
Due to selection bias, the ability curve of women within the population of Amazon engineers would skew lower then men within the population of Amazon engineers.
This is a special case of a more general phenomona. If you have signal S that is positivly correlated with a desired trait in the general population, and over select for S, you will find that S is negativly correlated within your population.
[0]. All proposals I have seen amount to either a good approximation of this or changing the applicant pool. And, by assumption, the latter is excluded.
> Gender bias was not the only issue. Problems with the data that underpinned the models’ judgments meant that unqualified candidates were often recommended for all manner of jobs, the people said. With the technology returning results almost at random, Amazon shut down the project, they said.
Apparently the recommendation system really did create gender bias, neither inherited from real differences nor from replicated human biases. (It looks like an issue with mismatched training data and task.) But that initial bias was found and corrected (2015) more than a year before the project was cancelled (2017) for providing "random" results. I think this is the most extreme case of algorithmic bias I've ever seen, but also the least commonly relevant; Amazon appears to have built a model which contained almost no rules except sexism, and scrapped it for not knowing anything worthwhile.
https://www.reuters.com/article/us-amazon-com-jobs-automatio...
If it isn't acceptable to use an AI to create biased outcomes how is it acceptable to use people to create the the same outcomes. AI decision making can be examined and tuned in ways that people cannot.
The parole software was NOT being fed data for "will this person commit another crime". It was being fed data for, "will this person be a suspect for another crime".
The significant difference is that selective enforcement biases the data that it was trained on. Said selective enforcement has multiple causes, including the fact that heavier patrolling in black neighborhoods makes catching crimes more likely.
The size of the selective enforcement bias shows in a number of ways. For example consider drugs. In surveys, the usage of illegal drugs is the same in blacks and whites. And yet 6 times as many blacks are arrested for using illegal drugs as whites.
Humans are pretty happy to create nonsensical results if it fits their goals... especially if it befits them. I wonder if with AI we do that to the point that it is somewhat irrelevant.
To some extent, you're bringing in your human bias to prefer human biases when you make that statement. We humans have a hierarchy of important attributes, and for various reasons believe race and gender are more important than eye color or height. But the machine learning algorithm just gets a multidimensional point in hyperspace. It doesn't, a priori, "know" that it needs to do a "per capita" adjustment based on FIELD_1 any more than it knows it needs to do a per capita adjustment on FIELD_2. And you can't "adjust" on all the fields because that'll just cancel out.
We are also in the weird position of wanting the machine to do adjustments based on FIELD_1, but without us having to actually admit to ourselves that we're doing it. From a technical perspective, probably the best answer is to do a straight-up training based on the data, then have an cleanly-separated after-the-fact cleanup process to perform whatever social adjustments it is we want on the outcome. But nobody is willing to admit that's what we want, and to put those adjustments down on paper in the form of code, because the instant they're concrete, pretty much everybody is going to decide they're wrong, and no two people are going to agree on the manner in which they are wrong, and an epic, national-front-page-news shitstorm will ensue. So here we are, trying to make adjustments without making adjustments, or, alternatively, trying to make adjustments in a place where we can blame the AI rather than humans.
(The ironic thing is that because we can't admit what we're trying to do, we're going to end up doing a really poor job of it. Tools will be applied haphazardly, the results can't be measured except very grossly at the very end of the process, and the goals won't be obtained and the system is always going to be quirky and weird. If we could clearly declare what it is we actually wanted, it would be fairly easy to get it from the AIs.)
Going by the details of the Reuters story and several others, it appears that what actually happened was a training/task mismatch. Amazon wanted an algorithm to do resume discovery, which recruiters would run and get quality predictions as they viewed resumes. But they trained it on resume results, giving it past resumes which had been submitted to Amazon and telling it to seek similar resumes. None of the stories make it clear if there even was negative training data; it looks like the tool was simply told to compute degree-of-similarity to past inputs, and possibly told to prioritize resumes which were ultimately hired.
As a result, the tool was trying to convert a relatively gender-neutral pool (resumes found online) to a skewed one (Amazon applicant resumes), and did so by weighting gendered terms. It also seems to have underweighted technical terms, failing to appreciate them as mandatory or strictly position-specific.
The developers were sufficiently aware of that to catch and correct the known gender biases (e.g. devaluing women's colleges or the literal word "women's"), but were scared there were other uncaught biases. And the results were apparently terrible all around, so the tool was scrapped. Which is pretty much what you'd expect from something trained on exclusively positive, sample-biased examples. The story has been seriously distorted, but the real plan also seems terrible...
The typical AI system doesn't work on the basis of selecting candidates entirely at random, pro rata, in order to meet a quota. It works on the basis of criteria for success. One thing it might learn (unfortunately) is that most posts at the company are filled by men.
Using the blog's skin cancer example, couldn't the labelled images be augmented by altering the skin tones and adding these new examples to the training set?
It seems to me that some of the anomalous results discussed in the article are actually the result of poor model design or poor pre-processing data choices. We can't just throw anything to any ol' machine learning model and expect it to be magic
As far as I can tell from later stories (e.g. 1, 2), what Amazon actually did was build a tool to show recruiters 'quality' predictions for all resumes, for instance as they scrolled LinkedIn. But they trained it on resumes submitted to Amazon for various positions, possibly also adding weight to resumes which produced hires.
In which case the problem is painfully obvious; the system effectively had no negative training data, and its positive examples (submitted resumes) didn't actually match the desired output (qualified resumes). It was computing degree of similarity between a gender-neutral-ish pool (resumes posted online) and a gender-skewed pool (resumes submitted to Amazon), and tried to make that conversion with whatever data was available - like devaluing resumes that mentioned women's colleges. (This wasn't just a proxy-variable thing, the model essentially learned to weight on gender.) Amazon's team apparently caught this issue and did the usual things like blinding on those words. But they were scared of uncaught factors; reading between the lines, they were unable to "detrain" biases like neural nets do because their dataset and task didn't match.
Ultimately, the tool was apparently scrapped because it made selections "almost at random". Which, again, isn't exactly surprising in light of the absolutely bonkers choice of training examples.
[1] https://www.aclu.org/blog/womens-rights/womens-rights-workpl...
[2] https://www.ml.cmu.edu/news/news-archive/2018/october/amazon...
More topically, you're quite right to object to that Amazon reference. As far as I can tell, the real story is even worse than mislabeling. Amazon devs wanted a system to spot candidates in resume banks, so they trained it to recognize resumes similar to the ones submitted to Amazon in the past. The entire dataset was 'positive', and output degrees of similarity instead of classifications. Amazon applicants are mostly male while the pool was presumably 50/50, so that was learned as an element of "Amazon-candidate-ness".
That's also an interesting story, but from the first publication (in Reuters) it's been framed as an uneven base rate 'inevitably/predictably/mechanistically' producing a biased result. Which is not only untrue but downright backwards, since it implies that the rate in the general data is what matters, rather than the relative rate between samples and positive classifications. It's yet another variant of the mammogram base rates question, and I wish people would stop trying to reinforce the incorrect answer to that.
Post your bank! Let's be like Magnus Carlson and occasionally ask ourselves, "What would DeepMind do?"
Except that's exactly what it is. Much as your model was biased against interns.
> and especially hyper-focusing everyone's attention on diversity aspect of it is extremely irresponsible.
Why? Pointing out a specific and concrete harm badly designed ML models cause is irresponsible? Just because the same kind of methodological flaw can cause other harms its irresponsible to use a motivating example?
In my opinion, yes, if it leads most readers to misjudge some fundamental properties of the problem as a whole. Again, I'm not saying this article is guilty, but most are.
Using the term 'bias' has certain political motivations behind it. It's not about the term being technically untrue as it is about the term being non-neutral. For instance, here are some definitions of 'bias' I just grabbed from American Heritage:
"A preference or an inclination, especially one that inhibits impartial judgment."
"An unfair act or policy stemming from prejudice."
"A statistical sampling or testing error caused by systematically favoring some outcomes over others."
The ML model does not have a preference, inclination, or prejudice relating to interns, except insofar as we anthropomorphize it to have them. What does using a word suggesting that add?
A more neutral account of what's going on is along the lines: It's easy to accidentally train ML models so that they will make systematic errors. (Among those errors is the possibility for it to exhibit behavior resembling prejudice.)
Isn't that what the article is trying to say, though? That your model can only be as accurate as your data set… and that even then, you have to be very careful to make sure it's not inferring patterns from entirely unrelated information?
Curiously though, did you compare the non-hire (full time) rates of interns vs fire rates of non-interns?
That's not what happened in the example at all. The example company isn't biased against summer interns, "who stops working after x time" was just a bad question.
The comment you're replying to can boil down to "do you want a monkey's paw solving your problem? If so then AI may be for you"
Or perhaps "stop pretending you're ever going to get ethics or empathy out of a computer"
Not sure I understand the question. IIRC, the way data was setup there was no way to tell why an intern stopped working for the company, because for all interns "reason code" for separation was the same.
Meaning, isn't it prudent to spending time on this issue?
That was the logical next step and we started on that, but it required exporting more historic data out of the HR system and filtering out anyone who started as an intern as well. Sounds simple, but in practice it's anything but. Just for the reference, data extraction, cleaning and filtering in that project took at least an order of magnitude more time than anything related to machine learning.
The project eventually lost steam and got abandoned.
>Do you still suspect a skewed result?
Absolutely. My personal intuition is that there is very little correlation between resumes and candidate quality. If that is true, any seemingly accurate predictions would be the result of a similar problem. Testing this hypothesis was a large portion of why I agreed to work on the project in the first place.
We aren't just looking for patterns. We are looking for patterns so that we can take action and affect the future. If the patterns, which are real enough in the historical data, don't correctly predict the impact of a choice, then they are anti-helpful bias.
For example, it may be that the company bought Siemens sensors years ago and then switched to another brand later. Unsurprisingly, older turbines fail more than newer ones. So, really, it's age that is the causative factor and the concrete action you want to take is to pay closer attention to older turbines. Even though the correlation to Siemens is real, if the action you take is "replace all the Seimens sensors with another brand", that won't make those old turbines work any better.
In other words, understanding data doesn't just mean "see which bits are correlated with which other bots". In order to be useful, we need to understand which changes to those bits in the future will be correlated with which desired outcomes. Anything less than that and you don't yet have information, just data.
Yes, AI systems presume induction to be true. But so does... uh, science and most other things we do?
The point is the Siemens sensor is a superfluous correlation with turbine failure, because the underlying dataset is biased towards Siemens sensors. The scenario suggested by the author is one in which your turbine failure dataset does not match reality.
No amount of sample enlargement will correct sample bias. You have a variable which is disproportionately represented in your underlying dataset despite being independent from a collection of variables correlated to failure, and the algorithm is learning that one instead.
Real world ways this is plausible and cannot be corrected by increased sampling:
1. Your telemetry data is accurate, but your logging service providing that data is faulty and only consumes data from a subset of meaningful publishers.
2. Whoever provided this dataset fat fingered a SQL query which joined too few tables including the sensor vendors, but correctly returned only the failing turbines.
3. Your data has (unnormalized) duplicates, because more than one system is providing telemetry data for Siemens sensors without the older systems being retired.
4. You use mostly Siemens sensors, and simply didn't correct for this in your sample.
1. Not a spurious correlation - Siemens sensors are in fact associated with increased failure rates in the dataset and if you continue to sample data with the same methodology this correlation will continue. You need to fix your data collection methodology, but it's not a spurious correlation.
2. See #1.
3. See #1.
4. The original problem statement said that a low percentage of unfailed turbines used Siemens sensors, and a high percentage of failed turbines used Siemens sensors. So 'you use mostly Siemens sensors' would imply that most of your turbines have failed, which seems a little unlikely to me.
Given how incredibly hard it is to avoid sample bias, you can't take it for granted that your training data doesn't have any sample bias.
- The aforementioned Tetris story: an undirected learner was set to maximize score at Tetris learned normal play techniques, but also learned to pause the game immediately before losing so that the score wouldn't "decline" at game over.
- In the same vein as interns quitting, proxy detection of all sorts. Identify "field with sheep" by finding green fields with grey skies, or letting heuristics like "humans pick up dogs and cats" override correct identifications. (It's a goat until you pick it up, then it's a dog!)
- An agent playing Q*bert found a known bug for infinite lives, then escalated to an unknown bug which disabled the game while overflowing the score counter.
- Agents in a physics sim tasked with jumping as high as possible instead learned to 'fly' by abusing collision detection bugs, hitting themselves in ways that created upward momentum.
- Another "maximize jump height" task demonstrated that "highest" is an extremely fuzzy term. Initially measured by highest point, they became incredible tall. Measured by lowest point, they stayed tall and grew topheavy to 'kick' their base upwards.
- Number-handling bugs of all kinds. In one case, small twitches led to floating-point errors that created energy. In another, a "minimize force" task got solved by maximizing force and triggering integer wraparound.
My personal favorite is an adversarial bug. An agent playing tic-tac-toe on an infinite grid with a time limit submitted extremely remote moves which caused timeouts/crashes in any agent that tried to model the full board.
[1] https://arxiv.org/pdf/1803.03453.pdf
[2] https://aiweirdness.com/post/172894792687/when-algorithms-su...
Photos creates folders for you based on identified themes, and then adds new photos to them as they're taken. I haven't checked, but I'm guessing it doesn't relabel existing buckets to avoid causing confusion. And I'm not sure whether bucketing is done by assessing theme or similarity to other photos in a folder. If it's the latter, the system could have hit the confidence threshold to make a Dog folder out of a few images, then ceaselessly dumped similar-looking photos (i.e. cats) into that bucket.
More sophisticated approaches are possible.
Which problem? The general statement of this problem is "models, trained on [somehow] misrepresentative data [or even technically representative data] can draw unintended conclusions that lead to harm". Specifically in this case, the harm was "the model was basically just trained to ignore all women applicants due to bad inference of conditional probabilities".
This is a common thing. Because our society draws lines and has bias, its fairly common for modelling failures to exist along those lines. Indeed, sometimes the failures are mostly harmless and immediately obvious, but often they aren't. And people building models should be made aware of those failure scenarios, and be especially aware of failure scenarios that affect underrepresented groups, because those groups are the most likely for the model to fail on if you aren't actively looking for them.
And this stuff is pervasive. Facial recognition tech is much worse at noticing the faces of darker skinned people [1]. Some of this is because the people building the common models (eigenfaces etc.) didn't use diverse skin tones, but some of it goes back further, white balance in film was tuned for lighter skin tones until the 90s[2]. Some of that has likely persisted into modern film and camera technology, unfortunately. People working with data need to understand their data. And that means understanding how bias infests their data.
> fundamental properties of the problem as a whole
You've yet to state the "whole problem" or the fundamental properties that people might misjudge. So I'm unclear what they are.
[1]: Arguably an advantage now.
[2]: https://petapixel.com/2015/09/19/heres-a-look-at-how-color-f...
Throwing AI at answering an ill-formed question or optimizing a process that shouldn't happen in the first place is not something that can be corrected by getting better training data.
Moreover, automation can have consequences that aren't detectable by analyzing some test set.
Gender bias was not the only issue. Problems with the data that underpinned the models’ judgments meant that unqualified candidates were often recommended for all manner of jobs, the people said. With the technology returning results almost at random, Amazon shut down the project, they said.
The main difference is that you’d write code to extract features from the image and then learn a model using those features (as opposed to using the pixel data directly and learning a model from that as in CNNs).
As an example, you wouldn’t necessarily write code for “fur texture” but instead would extract histograms of pixel brightness gradients and feed those (along with other things) to a machine learning algorithm. In this example, fur texture would generate a different histogram (to be used as a feature) than skin texture.
https://en.m.wikipedia.org/wiki/Histogram_of_oriented_gradie...
What I meant when asking for a name of an algorithm class are algorithms where the feature extraction is done using hand-coded algorithms.
I don't think people are against using ML and for biased human systems. Just pointing out the ignorant, naive and lazy deference to computers that often occurs in human systems that share the same bias.
In short I'd think most people who are against biased AI are also against biased human systems for very similar reasons.
[1]: With the political motivation.
Depending on the what the appropriate quantification of 'often' is, that might make sense. Do we have enough reason to believe it would take on a high enough value to merit the usage of a term that refers only to it?
The other problem with what you're describing is that all we actually know is that the model is reflecting the current state of things. Your statement attributes particular causes to the current state of things, and implies a certain valuation of the current state of things (which I don't personally disagree with, necessarily—but I don't think my personal views should be reflected in scientific/engineering jargon).
So given the uncertain value of 'often,' and the unsettled nature of the causes behind various aspects of the 'current state of things,' it seems to be solidly jumping the gun to frame the entire general problem with a term that refers to this partial and fraught aspect of it.
I didn't, nor should it matter how we got to where we are for a builder of a thing.
> and implies a certain valuation of the current state of things
This may have happened, but I'd disagree: recognizing that there exists inequality doesn't cast value judgement on that inequality. I simply stated that they're there. Perhaps saying "how to prevent them" is casting value judgement, so I might walk that back, model creators should be aware of the biases and aware of tools and strategies to account for them, if so desired.
Personally I think you're a bad person if, armed with the tools to detect and correct, you decide its okay to build something that has a systemic error that wrongly disfavors some group. But perhaps that's just me.
No, this does also not match.
One of the easiest way to get a ML model that creates systematic errors is spam filters. If I take my spam folder with no consideration, what the filter will learn is that any language which isn't my own are spam, and that servers located outside my nation are spammers. This resembles prejudice.
The cause of this systematic error is that individual email addresses do not get ham emails uniformly from every nation and every language. Proximity warps the data. I would need to normalize the data based on language and nation if I wanted to remove those errors in the filter. Looking at it from a political perspective does not make the filter perform better, and fixing it from that side has a high risk of causing even more errors in the model.
If you set a team of scientists to find a way of predicting failure of turbines, they might notice a correlation between Siemens sensors and failure. They would then look for and attempt to prove theories to explain this descrepency. In doing so, they would likly discover that, not only can they not find a causative theory, but the correlation goes away when they control for age.
AI systems stop after the first step, yet somehow are perceived as better than expert humans.
You just asserted your attribution of cause right there: inequality. There are multiple possible causes for differing demographic representations in various roles. This is not a settled issue, even though people on both sides promote competing ideologies to the effect that it is.
(And again, I have intentionally left my own views on the subject out of this, even though I suspect they align with yours (insofar as cause attribution goes): I'm just pointing out the fact that this isn't something society agrees on, nor is it something the scientific data resolves unambiguously.)
> Personally I think you're a bad person if, armed with the tools to detect and correct, you decide its okay to build something that has a systemic error that wrongly disfavors some group.
Agreed, hinging on that point about cause attribution.
There's no point to having an ML model unless you are applying it to something outside of the training data.
If you plan on applying the model to different turbines, then there is potential for sample bias in which turbines you selected. If you apply it to the same turbines at some point in the future, then you sampled points in time so there is a potential for sample bias based on which points in time you selected.
There is no way of completely avoiding the potential for sample bias unless you completely abandon ML as a useful concept.
Why would I care about the fact that only 10% of turbines globally have Siemens sensors? I don't know the failure data outside of the turbines I own and operate, and those are the only ones I need to predict failures for.
Say that turbines have an average lifespan of X years, and from year 0 to 10 you bought 90% Siemens and then from year 10 to 20 you bought 10% Siemens and then you measure failure rates from year X to year X+10.
Based on that data you would predict that Siemens turbines will be the most likely to fail next, but they are probably actually less likely to fail because most of the ones that are likely to fail soon are already gone.
You can modify reality, but our understanding of biology - especially hormones - clearly tells us that the AI was right: men are generally better than women at weight lifting.
I'm not saying that every issue is like that, but it would be foolish to ignore that sometimes reality is biased, sometimes in obvious ways and sometimes more subtly.
Your post is great for the assumptions it encodes. Like what does it mean to be good at weight lifting? And that for some reason being good at weight lifting is a good proxy for being a good bouncer or construction worker?
For an off the cuff example it’s a great way to demonstrate the sort of bias we can naively introduce then defend because it’s just ‘reality’. When really it’s much more complex than identifying a relevant trait and assuming everything else falls out of it.
being a good weight lifter, means you can lift heavier weights then a less-good weight lifter. Whether this is a proxy for anything isn't relevant, because it's a purely contrived example. There are clearly jobs where physical strength (among other things) is important, and given the context of this example, there is no guarantee that a more complicated model evens out the differences.
The point of the example is, basically "there are some things which might discriminate strongly on the basis on physical traits, which might end up correlating with race/sex etc" - ask for a better model by all means, but there is no guarantee the perfect model will never correlate strongly with some political demographic, and hence be controversial.
It could be a spurious correlation, sure - but that'll go away as the amount of data increases.
"bias" and "reality" can equally cover for model simplicity.