ML Beyond Curve Fitting: An Intro to Causal Inference and Do-Calculus

ML Beyond Curve Fitting: An Intro to Causal Inference and Do-Calculus(inference.vc)

184 points by dil8 8 years ago | 41 comments

Something to note about this formulation is the explicit assumption that in p(y|do(x)), the 'do' operation is supposed to be completely independent of prior observed variables, e.g. the doers are 'unmoved movers' [1].

That fits the model where you randomly 'do' one thing or another (e.g. blinded testing); however this is not the same thing as p(y|do'(x)), where do' is your empirical observation of when you yourself have set X=x in a more natural context.

E.g. let's say you will always turn on the heat when it's cold outside. P(cold outside | do(turn on heat)) = P(cold outside), because turning on the heat does not affect the temperature outdoors.

However, P(cold outside | do'(turned on heat)) > P(cold outside), because empirically, you actually only choose to turn on the heat when it's cold outdoors.

These two are also different from P(cold outside | heat was turned on) (since someone else might have access to the thermostat).

In reality our choices and actions are also products of the initial states (including our own beliefs, and our own knowledge of what would happen if we did x). Our actions both move the world, but we are also moved by the world.

Does do-calculus have a careful treatment of 'mixed' scenarios where actions are both causes and effects of other causes?

[1] https://en.wikipedia.org/wiki/Unmoved_mover

Darmani 8 years ago | |

But, you might want to consider making "turned on heat" part of the system in this case, and go back to using the classic conditioning operator instead of the do operator.

This is covered in chapter 4 of Pearl's Causality.

Darmani 8 years ago |

For those trying to understand the difference between action and observation, here's a good example from a friend:

Every bug you fix in your code increases your chances of shipping on time, but provides evidence that you won't.

phkahler 8 years ago |

I really enjoyed the humility the author had in the introduction to this piece. He paused and took a hard look at what seemed to be harsh or arrogant criticism of his field and found insight.

mlthoughts2018 8 years ago | |

Can you cite any parts of the article that support your view on this? I’ve read it a few times now and don’t see any. The author describes glossing past do-calculus before but for practical reasons, and doesn’t mention anything about “harsh or arrogant criticism” — and in fact doesn’t make reference to fair criticisms, like Rubin’s & Gelman’s.

phkahler 8 years ago | | |

>> Can you cite any parts of the article that support your view on this?

How about this: "In the interview, Pearl dismisses most of what we do in ML as curve fitting. While I believe that's an overstatement (conveniently ignores RL for example), it's a nice reminder that most productive debates are often triggered by controversial or outright arrogant comments. Calling machine learning alchemy was a great recent example."

When a person is dismissive of an entire field and claims to have a better way, that often comes off as arrogant (even if it is true). My interpretation is "harsh" while the author uses the word "overstatement". You'll also see "arrogant" in there and that last line calling it "alchemy" really has to be interpreted with negative connotations. Perhaps I read more into it than was written, but that was the impression I got.

seandougall 8 years ago | | |

That was what I took away from basically the whole introduction. The first paragraph describes his reaction to the criticism as “harsh” and “arrogant” (author’s words), the second describes his change of heart, and the third describes himself as “embarrassed” at having previously dismissed do-calculus.

It is written in a way that suggests he still regards the criticism as harsh and arrogant, but not incorrect, if that makes sense.

thadk 8 years ago |

Here is a paper explaining the essentials of how 45+ years of Causal Inference applies to ML: http://www.nber.org/chapters/c14009.pdf

In this podcast by the same author, it explains the potential of sharing lessons from both worlds, if you're not in the mood for an academic paper: http://www.econtalk.org/archives/2016/09/susan_athey_on.html

mlthoughts2018 8 years ago | |

It's very important to note that the term 'causal inference' in this research paper is not the same thing as Pearl's causal inference techniques, and in fact the main two statistics and econometrics researchers cited in your linked article are Imbens and Rubin, two of the biggest critics of Pearl's methods.

The linked paper mostly goes into instrumental variables and mixed effects modeling for how classical econometrics has dealt with trying to understand the causality of intentionally varying a treatment. And, despite citing Rubin heavily, the paper doesn't go much into the Bayesian methods for solving similar problems (hierarchical models), even though they are a state of the art approach with modern computational MCMC techniques.

The last few sections do offer some interesting research citations for how classical instrumental effects models have been morphed with advances in machine learning, with things like causal trees.

But just look at one of the take away points of the survey, in section 5:

> "4. No fundamental changes to theory of identification of causal effects"

Overall, the link you've shared would be strongly in favor of ML-extended classical econometrics and possibly Bayesian hierarchical models or latent variable approaches, but almost surely would be against the notion that do-calculus could lead to a wide-spread or real-world set of applicable models.

gowld 8 years ago |

How does someone use do-calculus? It's a nice mathematization of Goodhart's law, https://en.wikipedia.org/wiki/Goodhart%27s_law

but how would help an algorithm make better predictions?

Sure, the reason a person turns on the heat affects our belief in the outside weather (were they feeling cold, or were they just trolling?), but how do you know the reason a person turned on the heat, and couldn't you learn which reason are predictive by measuring correlations with other observables? If you know the reason directly ("I'm just playing with the dial because I'm 4 years old") that's a data point you could throw into your ML model without explicitly knowing it's a reason.

sjg007 8 years ago | |

See http://www.michaelnielsen.org/ddi/if-correlation-doesnt-impl...

And

https://www.statisticssolutions.com/structural-equation-mode...

mlthoughts2018 8 years ago |

I am interested in a companion phenomenon with the recent interest in causal models in machine learning. Namely, the fact that at least in computer vision, it is not new at all and has been an important idea for at least many decades.

One of the original sources that took this approach is "The Ecological Approach to Visual Perception" (1979) [0], by James Gibson, discussed at length the idea of "affordances" of an algorithmic model, similar in some respects to topics in reinforcement learning as well. Affordances represented the information about outcomes you gained by varying your degrees of observational freedom (i.e. you learn how to generalize beyond occluded objects by moving your head a little to the left or right and seeing how the visual input varies. This lets you get food, or hide from a predator that's partially blocked by a tree, etc., so over time generalizing past occlusions become better and better -- this is much more interesting than a naive approach, like using data augmentation to augment a labeled data set with synthetically occluded variations, for example as is often done to improve rotational invariance).

Then this idea was extended with a lot of formality in the mid-to-late 00's by Stefano Soatto in his papers on "Actionable Information" [1].

I wish more effort had been made by e.g. Pearl to look into this and unify his approach with what had already been thought of, especially because it turns me off a lot when someone tries to create a "whole new paradigm" and it starts to feel like they want to generate sexy marketing hype about it, rather than to say hey, this is an extension or connection or alternative of this older idea already in the topic of machine learning rather than appearing like one is saying, "Us over hear in causal inference world already know so much more about what to do ... so now let's apply it to your domain where you never thought of this". Pearl has a history of doing this stuff too, like with his previous debates with Gelman about Bayesian models. It almost feels to me like he is shopping around for some sexy application area where his one-upsmanship approach will catch on too give him a chance at the hype gravy train or something.

[0]: < https://en.wikipedia.org/wiki/James_J._Gibson#Major_works >

[1]: < http://www.vision.cs.ucla.edu/papers/soatto09.pdf >

carapace 8 years ago |

Worth mentioning, perhaps, that Cybernetics originated from the study of "circular loops of causality", systems where e.g. A causes B, B causes C, and in turn C causes A, etc...

thanatropism 8 years ago |

This is really sexy.

offpolicy 8 years ago |

Nothing to see here. The do-calculus is just fancy notation for what reinforcement learning is already doing: trying different possible actions and trying to maximize reward. If you know possible actions in advance, this is basically minimizing regret of wrong policy actions.

Darmani 8 years ago | |

First, RL and causal inference do fundamentally different things. RL is trying to train a controller; causal inference gives you a theory so that you can predict the results of a randomized controlled experiment without running one.

Second, consider this: Classic ML techniques will tell you that you should never go to the doctor because it increases the probability that you have a disease. Causal inference does not have this problem.

How does RL dodge this?

Eridrus 8 years ago | | |

Not an RL expert, but Model-Based RL is a thing, where you try to train a model of how actions affect the world, and then use that model to choose/influence your actions.

But I don't think it's true that we always need a model, or at least I don't necessarily think we always need a human understandable model.

Your doctor example is weird to me tbh. A non-causal ML approach would seek to determine whether a patient has a disease based on some symptoms, and then send them to a doctor based on those results, sidestepping the need for causal models.

To rephrase it in a way that makes a bit more sense to me is: let's assume we want to know if a specific procedure would be good for a patient (basically the same example). With a non-causal approach we would want to predict whether a patient would have a better outcome from doing a procedure than not.

A natural way to solve this (to me) would be to build one model that estimates the probability of various outcomes from the procedure, and one that estimates the probability of various outcomes from not undergoing the procedure.

Or if you're working in the world of Neural Nets/Deep RL, have a model that takes all the non-intervention data as input and outputs the expected outcomes from the procedure and the expected outcomes from not doing the procedure, and when you train it, you only supervise the outcomes that you had data for.

This ignores the Bayesian/Distributional Shift issue, but I don't think the do calculus has a real answer to that either.

I would be interested in knowing if this ad-hoc modelling approach is any different to the causal modelling the Pearl is arguing for, or if Causal modelling is more necessary when you have more complicated causal relationships than a single intervention.

thanatropism 8 years ago | | |

I think what offpolicy was trying to clumsily say is that policy evaluation (I come from the economic policy econometrics world originally) can be used for RL.

Maybe it can, but isn't Bayesian stuff really costly most of the time?