Scaling Machine Learning at Uber with Michelangelo(eng.uber.com) |
Scaling Machine Learning at Uber with Michelangelo(eng.uber.com) |
We recently potted some models from Stan to Pyro (SVI on PyTorch), and it’s been reallly exciting (except for the dark corner of poutines), it really has the performance of something being used in production, except the occasional nan explosion.
edit we are lazy and use our GitLab CI/CD to drive model development iteration. It’s not as fully featured as what’s in the article but it’s a zero effort start.
[1] https://medium.com/@gc/ubers-path-forward-b59ec9bd4ef6 [2] https://www.sfchronicle.com/business/article/Uber-drivers-in...
*Disclaimer: I work at Uber, and my opinions are solely my own. We're hiring.
And that was an unforced error, by Silicon Valley. It was in their DNA. They didn't have to give Travis Kalanick, a guy they despised and never trusted, for good reason—They didn't have to give him all that venture capital.
But they saw him as an expendable probe, so they cynically gave him money, to see how much law-breaking he could get away with in the name of their disruption activities.
That was hubris—and nemesis is well on the way."
- NEXT17 | Bruce Sterling | Live from 2027
Though it does have a business model that (did?) flagrantly disregards the law in pretty much every market it moved into.
And we'll see how the privacy thing turns out when they figure out the data they have on millions/billions of people is worth a bunch of money and Wall Street is demanding "more cowbell".
Pachyderm is another one I’ve looked at but we don’t have the sys admin bandwidth for that stuff right now.
And if you use express pools it will always say to go the wrong side of an intersection. I like uber because of the drivers, but their fancy technology is flawed.
They can use GPS data to chart usage metrics, plan pool rides, check for anomalies, and harass journalists, for example.
Navigation is pathfinding in the real world + directions. GPS is useful (But far from the only) system for determining where you are. Navigation is often implemented as a route from point A (in lat, lng) to point B (in lat, lng) and then running an algorithm (such as dijkstra, or A*, but usually something far more advanced) from A to B. The algorithm runs on a routing graph of some kind, produced from processing real world map data.
For the same reason, I don't think it's fair to compare the average Uber driver salary to a full-time salary. Some Uber drivers work full-time, but I'd guess most don't. Lots probably only work a few hours a week. Uber provides students/parents/anyone with a way to make extra money on the side.
Also, from the article I linked to, 900k US drivers make $13 Billion a year. So $14k-$15k/job/year. When you factor in that many (probably most) Uber drivers are only working part-time, that's significant income from a super flexible job.
yep so they should be compensated more, not less, in sight of the precarity of their job
> When you factor in that many (probably most) Uber drivers are only working part-time, that's significant income from a super flexible job
super flexible for whom? the drivers? or for Uber?
There are two sides to the gig style work, one side is a corp that's got teams of PhDs and a cloud calculating its optimal risk-reward strategy, and the other side some poor people trying to make money. How this can possibly turn out well for the latter is a pipe dream.
Driving for Uber is as flexible job as it gets. I don't see how it's flexible for Uber - Uber can only offer a ride if a driver decides out of their own free will to accept a ride.
For many people (2+ Million), driving for Uber is worth the money. They decided that it's a better gig than their other options. If they decide that it's not worth it, then they stop. And because it's such a flexible job, they don't even need to give two weeks notice.
Even though you think that driving for Uber isn't worth the money, millions of people around the world do.
this is purified hubris, how can you not vomit on your keyboard while writing that?
> roughly 200 to 400 total bikes. It's really a pathetic number of bikes and doesn't come close to meeting demand
maybe the mandate of SF is not to satisfy demand or Uber's profit incentive but to keep public interest in mind, e.g. ensure that Uber doesn't develop a monopoly on whatever a jump bike is.
I'm curious how you get around. Do you own a car? It's a fairly privileged view that only the well-off should have access to point-to-point transportation and maybe some civil disobedience was in order to rectify this injustice.
>maybe the mandate of SF is not to satisfy demand or Uber's profit incentive but to keep public interest in mind, e.g. ensure that Uber doesn't develop a monopoly on whatever a jump bike is.
Prior restraint on free enterprise that lacks negative externalities opens the door to crony capitalism replete with bribes, donations, and rent-seeking. In government, never ascribe to benevolence that which can be better explained by greed or power-seeking.
As far as I can tell, the guy doesn't like America or even representative democracy very much. Take that for what you will.
They avoid responsibility to communities they generate profits in, by exporting negative externalizes at a much higher level than traditional businesses.
Also, I don't think Uber drivers are 'unskilled'. The lowest rung is filtered out by not being able to bring their own $20000 vehicle to participate.
GPU operability is well-supported, and much like Keras, pymc provides well-designed abstractions over top of TensorFlow, making the downsides of raw TensorFlow mostly irrelevant.
I like PyTorch a lot too, but any time I see someone say PyTorch is easier than TensorFlow, it usually just means that person only tried PyTorch, learned some special knowledge about it, and now they don’t want to admit using a different framework might be the better choice, even if it requires giving up some of what’s nice about PyTorch.
Both TF and Theano require static graph while PyTorch lets you use Python’s regular control flows (if, for, while, etc). This makes building modular model components much easier, since you can reason about execution mostly as if it’s normal numerical Python code.
I have tried running PyMC3 models on GPUs (when they were on Theano; not sure if they have transitioned since) and it is slower than CPUs, not for small models but the big, SIMD-wide ones. When I ported the same thing to Pyro/PyTorch, it was clearly making good use of the GPU, not bottlenecked by useless CPU-GPU transfers
Maybe that’s changed now, so as they say the only useful benchmark is your own code.
Can you post a link to your code with some synthetic data of the sizes you’re talking about to demonstrate this? I hear it as a criticism a lot, but have never found it to be true (full disclosure: I work on a large-scale production system that uses pymc for huge Bayesian logistic regression and huge hierarchical models, both in GPU mode out of necessity).
> “Both TF and Theano require static graph while PyTorch lets you use Python’s regular control flows (if, for, while, etc). This makes building modular model components much easier, since you can reason about execution mostly as if it’s normal numerical Python code.”
I can’t tell if you’ve looked into pymc or not based on this (or Keras either for that matter), since in pymc, GPU mode is just a Theano setting, you don’t actually write any Theano code, manipulate any graphs or sessions directly, or anything else. You just call pm.sample with the appropriate mode settings at it is executed on the GPU.
Much like with Keras, where you can also easily use Python native control flow, context managers and so on, pymc doesn’t require low-level usage of underlying computation graph abstractions.
Again, I really like PyTorch too, but people just seem to have only ever tried PyTorch, liked one or two things about it, forgive the parts that are bad about it (like needing to explicitly write a wrapper for the backwards calculation for custom layers, which you don’t need to do in Keras for example), and generalize to criticize other tools.
> free will to accept a ride
Whenever money is involved, free will goes out the window. It’s naive to think of all these people as rational actors.
all off topic generalities.
Having witnessed countless uber/lyft drivers do their thing I have to agree with your "skilled" assessment.
More on point -- my main objection is they are paying basically at cost pricing to the "driver-partners" when you add up all the costs. Basically, though many will disagree, all they're doing is taking the equity out of their vehicle now instead of at resale time.
Actually the question was more around "how do you create your models and what do you mean treating them as code", "why slurm and not something like airflow" , "what is the test/performance setup - backtesting, smoke test" etc etc
The Gitlab stuff is easier to understand.
> how do you create your models and what do you mean treating them as code
we start with local Jupyter notebooks, and refactor bits of code into modules that get tested, which for our models mainly means recovering parameters from simulations, and then test them on real data, where we assess performance with LOO approximations for Bayesian models (notably PSIS) and some labeling from experts (which is not taken too seriously tbh)
> why slurm and not something like airflow
because the HPC resources we have access to are built with Slurm, which is super fast, supports DAGs of jobs, schedules our jobs reliably and quickly. I don't really want the other stuff on the Airflow feature list to be honest.
>we start with local Jupyter notebooks, and refactor bits of code into modules that get tested, which for our models mainly means recovering parameters from simulations, and then test them on real data
This is the part that everyone seems reinventing. Have you looked at PyML (https://eng.uber.com/michelangelo-pyml/). What are some of your learnings around jupyter -> production code. A lot of these are around conventions - "write a function called train(), fit(), test()". Is that the basis of your pipeline as well ?
I don’t have pymc code anymore since we have moved to Stan, and now starting porting code to Pyro.
> forgive the parts that are bad about it (like needing to explicitly write a wrapper for the backwards calculation for custom layers
Why do that when AD does it for you?
Not sure I understand - you will need to write a backwards pass regardless if you use Keras, PyTorch, or anything else. With Keras, you would need to modify the underlying backend code (e.g. with tf.RegisterGradient or tf.custom_gradient). With Pytorch you write the backward() function, which is about the same amount of effort.
In PyTorch, you still do have to define the backward function and worry about bookkeeping the gradient, clearing gradient values at the appropriate time, and explicitly calling to calculate these things in verbose optimizer invocation code.
I encourage you to check out how this works in Keras, because it is simply just factually different than what you are saying, in ways that are specifically designed to remove certain types of boilerplate or overhead or bookkeeping that are required by PyTorch.
Regarding more verbose Pytorch code for the update step, compare:
In Tensorflow:
loss = tf.nn.softmax_cross_entropy_with_logits(labels=y, logits=output_logits)
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss)
sess.run(optimizer)
In PyTorch:
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
loss = nn.CrossEntropyLoss()(output, label)
optimizer.zero_grad()
loss.backward()
optimizer.step()
In my opinion, PyTorch makes the parameter update process a lot easier to understand, control, and modify (if needed). For example what if you want to modify gradients right before the weight update? In PyTorch I'd do it right here in my code after the loss.backward() statement, while in TF I'd have to modify the optimizer code. Which option would you prefer?
[1] https://stackoverflow.com/questions/44428784/when-is-a-pytor...
I’ve definitely never had to do that. Where do you get this from?
Usually when we are doing more of the train/fit/test cycle, there’s an argparse script to quickly try different parameter values succinctly (which is run and tracked by the above CI setup)
I wouldn’t say we’re reinventing since a better solution isn’t very clear (though PyML et al look interesting)
edit forward simulation isn't a frequent thing in posts on generic ML algorithms, so just as an example: suppose you run a model and see an oscillatory component along a temporal dimensions in your residual error, and you add a oscillatory component to your model, and rerun it but still see a residual with an oscillation. You can run a forward simulation of your model to see what frequency it's predicting and check against what's seen in the data, and fix it. This is a contrived example but when you have multiple competing priors or model components, this is an effective way to debug their behavior.
Maybe a missing detail is that our models are run-once, once results are QA'd, they are sent to relevant practitioner, so Uber's query-per-second stuff is irrelevant for us (for now), which I can see simplifies the deployment question enormously.
do you have specific questions?
Would love to know more about your packaging setup - the branch name to divide datasets is a nice trick (I'll use it as well).
How does your CI know where to find models ? Im betting you are using some kind of convention here - one model per py file...so package each py file in a docker container.
If it is possible, would love to see the skeleton structure of one of your pre-packaged files.
Tldr - it seems you invented something like pyml as well. Are the deployment scripts+model skeletons open source ?
In the ML projects, it serves mainly to package dependencies, and to ensure some basic security constraints: raw datasets are accessible read only, ensuring that if we suspect some issue with cached results (cause our inner orchestrator is Make..) we can nuke all the results and start over from scratch, sure the raw data is intact.
The models and arguments are in the CI config. No magic there, but since it’s all in the repo I’m ok with it.
This whole setup was put together for an upcoming clinical trial as steps toward ISO quality norms compliance, and I can’t share it now. I do intend to reproduce it in an open form alongside our existing software (GitHub.com/the-virtual-brain) when it’s ready.
In any case I appreciate your questions a lot: they drove me to think a little harder and see why stuff like Michelango and PyML is stuff that even we (academic/clinical) group should be using... if we can find the time to do it.