I think this is one of those techniques that every serious computer scientist should have in their toolbox. Someone who knew some statistics, a smattering of something like SMT (enough to drive Z3) and some linear programming could routinely resemble Gandalf at any shop that has people struggling with difficult problems.
The main thing is not to turn into a cookbook guy. I've known a few people whose ability to just build an actual algorithm to solve some interesting problem has atrophied since they reach for a ILP solver the minute anything gets difficult.
And then if weird constraints come along that broke the max-flow reduction, I could usually shoehorn that into the formulation.
Can't remember how to write Dijkstra's algorithm? Meh, integer programming it is. Want to write a sudoku solver and you can't be arsed with dancing links or backtracking or prolog or... Whatever, CBC will do it. It'll be nasty and a hell of a lot slower than doing it "the right way" in a fast language, but it might not be slower than doing it in Python or Javascript.
Constraint programming generates lots of potential solutions (combinatorially) and prunes the search tree to make the large numbers tractable.
The intuitions behind the two techniques are quite different.
But, now (A)-(D) are no longer so difficult. This should be the beginning of a new Golden Age for such work.
Sure, since the results of such optimization can look darned smart, smarter than the average human, might call the work artificial intelligence. Really, though, the work is mostly just some classic applied math now enabled in practice by the progress in computer hard/software.
The OP is a nice introduction to vehicle routing via integer linear programming (ILP) set partitioning. For the linear programming and the case of ILP, I'll give a quick view below. But, now, let's just dig in:
Here is an explanation of the secret approach, technique, trick that can work for vehicle routing and many other problems: The real problems can have some just awful non-linear cost functions, just absurdly tricky constraints, e.g., from labor contract work rules, equipment maintenance schedules, something can't do near point A near lunch, even handle some random things, even responding to them in real-time (dynamically), etc. yet can still have a good shot at getting a least cost solution or nearly so. The "nearly so" part can mean save a lot of money not available otherwise. When there is randomness, then try to get least expected cost.
So, first, the trick is to do the work in two steps.
The first step call evaluation and the second, optimization.
From 50,000 feet up, all the tricky, non-linear, goofy stuff gets handled by essentially enumeration in the first part leaving some relatively simple data for the optimization in the second part.
In practice, this first step typically needs a lot of data on, say, the streets of a city and requires writing some software unique to the specific problem. The second step, the optimization, may require deriving some math and/or writing some unique software, but the hope is that the step can be done just by routine application of some existing optimization software.
The OP mentions the now famous optimization software Gurobi from R. Bixby and maybe some people from Georgia Tech, e.g., from George Nemhauser and Ellis Johnson (long at IBM Research and behind IBM's Optimization Subroutine Library (OSL) and its application to crew scheduling at American Airlines).
First Step.
Suppose you are in Chicago and have 20,000 packages to deliver and 300 trucks. Okay, what trucks deliver what packages to make all the deliveries on time, not overload any trucks, and minimize the cost of driving the trucks? You do have for each package the GPS coordinates and street address. And you have a lot of data on the streets, where and when traffic is heavy during the day, etc.
Okay, let's make some obvious, likely doable progress: Of those 20,000 packages, maybe have only 15,000 unique addresses. So, for each address, bundle all the packages that go to that address. Then regard the problem as visiting 15,000 addresses instead of delivering 20,000 packages.
So, you write some software to enumerate. The enumeration results in a collection of candidate routes, stops, and packages to be delivered for a single truck. For each of those candidates, you adjust the order in which the stops are made to minimize cost -- so here get some early, first-cut, simple optimization. You keep only those candidates that get the packages delivered on time, meet other criteria, etc. You may have 1 million candidate single truck routes. For each of the the candidates, you find the (expected) operating cost.
So, suppose you have n = 1 million candidate single truck routes.
Also you have m = 15,000 addresses to visit.
So, you have a table with m = 15,000 rows and n = 1 million columns. Each column is for some one candidate route. Each row is for some one address. In each column there is a 1 in the row of each address that candidate route visits and a 0 otherwise. One more row at the top of the table is, for each column, the operating cost of that candidate route.
So, you have a table of 0's and 1's with m = 15,000 rows and n = 1 million columns. You have a row with 1 million costs, one cost for each column.
Again, you have 300 trucks. So, you want to pick, from the n columns, some <= 300 columns so that all the m addresses get served and the total costs of the columns selected is minimized. That is, if add the columns as column vectors, then get all 1's.
Second Step.
Well, consider variables x_i for i = 1 to n = 1 million. Then we want x_i = 1 if we use the route in column i and 0 otherwise. Let the cost of the route in column i be c_i. We want the total cost (TeX notation):
z(x) = sum_{i = 1}^n x_i c_i
to be minimized. So, right, we take the big table of m = 15,000 rows and n = 1 million columns and call it m x n matrix A = [a_{ij}]. We let m x 1 column vector b have all 1's. We regard x as n x 1 where in row j = 1 to n is x_j. Then, we get linear program
minimize z(x)
subject to
Ax = b
x >= 0
So, this is a case of linear programming.
Except in our problem we have one more constraint -- each x_i is 0 or 1, and in this case our problem is 0-1 integer linear programming.
Linear Programming.
In linear programming with n variables, with the real numbers R, we go into the n-dimensional vector space R^n. The
Ax = b
x >= 0
are the constraints, and the set of all x that satisfies those is the feasible region F, a subset of Rn.
In R^n, a closed half space* is a plane and everything on some one side of it.
Then F can be regarded as an intersection of m closed half spaces. So, F has flat sides, straight edges, and some sharp points (extreme points).
Well, if there is an optimal solution, then there is an optimal solution at at least one of those extreme points. So, the famous Dantzig simplex algorithm looks for optimal solutions in iterations were each iteration starts at an extreme point, moves along an edge, and stops at the next extreme point. That's for the geometric view; the algebraic view is a tweak of the standard Gauss elimination algorithm.
Linear programming and the simplex and other algorithms have a huge collection of nice properties, including some surprisingly good performance both in practice and in theory.
But, asking for each x_i to be an integer is in principle and usually in practice a huge difference and gives us a problem in NP-complete -- at one time, this was a huge, bitter surprise.
Warning: At one time, the field of the applied math of optimization in operations research sometimes had an attitude, that is, placed a quasi-religious importance on optimal solutions and was contemptuous of any solutions even 10 cents short of optimal. Well, that attitude was costly for all concerned. Instead of all the concentration on saving the last 10 cents, consider saving the first $1 million. Commonly in practice, we can get close to optimality, may be able to show that we are within 1% of optimality, so close the rest wouldn't even buy a nice dinner, and see no way in less than two weeks more of computer time to try to get an optimal solution.
So, concentrate on the big, fat doughnut, not the hole.
How to solve ILP problems is a huge subject -- e.g., can start with George Nemhauser -- but a major fraction of the techniques exploit some surprisingly nice properties of the simplex algorithm. Right, likely the best known approach is the tree search technique of branch and bound.
Great definition of linear programming. In a sense the thing that makes integer programming hard is that the feasible region is not convex -- for two feasible solutions x and y, ax + (1-a)y for 0<=a<=1 is not necessarily feasible.
I'd also say that integer programming is a kind of constraint-based programming extended with the addition of an objective function -- we're not just looking for any satisfactory answer, but an answer with a cost or value that cannot be improved upon. You're definitely right that "things called constraint-based programs" tend to be solved in different ways, though (and the languages they're expressed in tend to be nicer, too.)
Further, I think you tried to describe Linear programming in the beginning of your post. But Linear programming searches the vertices of the polytope and not the edges.
The biggest issue with Integer programming is the lack of convexity of the solution space when compared to Linear programming. So the approach for solving them is very different compared to Linear programming. If I made any mistakes in my post, I welcome to be corrected.
So, what is optimization? For positive integers m and n, the real numbers R, and functions
f: R^n --> R
g: R^n --> R^m
find x in R^n to (problem MP1)
maximize f(x)
subject to
g(x) <= 0
where g(x) is regarded as an m x 1 matrix and the 0 is also m x 1 but with all its components 0.
The constraints are the
g(x) <= 0
We say that we have m constraints.
The subject is essentially the same using
g(x) >= 0
or
g(x) = 0
instead or minimizing instead of maximizing.
Point x in R^n is feasible provided it satisfies the constraints, that is,
g(x) <= 0
The feasible region is the set F of all feasible x.
A point x is optimal if it is feasible and makes f(x) as large as possible. That is, x is optimal provided for any feasible y we have f(x) >= f(y).
If one or more of the functions f or g is non-linear, then we have non-linear programming. Here we may use the Kuhn-Tucker conditions to seek a point that satisfies necessary conditions for optimality. For an algorithm, we may take gradients of the constraints to set up a linear program, solve that, do a line search, and try again.
If functions f and g are linear, easily written using matrix algebra, then the optimization problem is linear programming (here programming is used in the British sense of operational planning).
Function f is convex provided for any x, y in R^n and any t in [0, 1] we have
t f(x) + (1 - t) f(y) >= f( t x + (1 - t) y )
That is, in a graph of f, a straight line between (x, f(x) and (y, f(y) ) is on or over the graph and never under the graph. So, the straight line over approximates the function.
If f is convex, then it is continuous (proof is not very difficult but is in Fleming, Functions of Several Variables). IIRC convex implies differentiable almost everywhere (maybe proof in R. T. Rockafellar, Convex Analysis). The epigraph is the graph of f and everything above it, and it is a convex set so has supporting hyperplanes and is the intersection of all the closed half spaces from its supporting (tangent) hyperplanes (IIRC, proof in Fleming). Such a supporting hyperplane is called a subgradient and can be quite useful in optimization, e.g., Lagrangian relaxation. There is a closely related, really nice result in Hilbert space: Every non-empty, closed convex subset has a unique element of minimum norm. That can be very nice to know!
To be more clear, a convex function, while continuous, need not be differentiable. Why not? Intuitively, at a point, the function makes a sharp turn. Similarly, a cube is a convex set but has 10 sharp edges and 8 sharp points that have supporting (tangent) planes, but these planes are not unique. So, a subgradient at a point would be a gradient and a tangent plane if at that point it was the only tangent plane. So, intuitively, a subgradient is nearly a tangent plane.
Function f is concave provided that -f is convex. If f is convex and -f is convex, then f is linear -- so, intuitively, a function that is convex (or concave) is half of being linear.
We have some astoundingly powerful methods for solving linear programming problems.
A significant fraction of the Nobel prizes in economics have been from economic models based on linear programming.
Some special cases, e.g., least cost network flows, maximum matchings, are special cases with even better results for the algorithms.
If in MP1 function f is concave and funtion g is linear, then again we have some good methods.
If in MP1 function f is quadratic and concave and function g is linear, then we have the problem of Markowitz in investment portfolio management and his Nobel prize in economics.
If we are maximizing a convex function subject to linear constraints, then we have, IIRC, a problem in NP-complete.
If in linear programming we require in addition that some or all of the components of x are required to be integers, then we have some integer constraints and a problem in integer linear programming (ILP). ILP was one of the earliest problems shown to be in NP-complete.
So, that's a nutshell version of optimization.
Can also extend to accommodating randomness and doing much the same over time.
The over time part is control theory, e.g., how to fly an airplane from one point to another in least time -- there are books, e.g., by Athans and Falb. Mike Athans was long at MIT and Falb, at the Brown Division of Applied Mathematics. Control theory had a flurry of interest in the space program. There is an extension, differential game theory, which is like how best for a missile to chase an airplane where the airplane is best trying to avoid the missile. Typically in control theory and differential game theory we look for necessary conditions for optimality which usually turn out to be in terms of Lagrange multipliers. Optimization over time with randomness is essentially stochastic optimal control, and the main technique is a version of dynamic programming. An early proponent was Richard Bellman.
The Black-Scholes Nobel prize work in option pricing is a relatively simple case of stochastic optimal control.
So, in terms of MP1, constraint programming is, say, find x in R^n so that
g(x) <= 0
that is, just find a point in the feasible region R.
An example might be, yesterday killed 5000 hogs and hung and chilled the results. Today cut the hogs and put the pieces in boxes in 18 wheel trucks, about 40,000 pounds of load per truck, for delivery to customers about 900 miles away.
Have only three bays in the loading dock. So, want to know the order in which to load the trucks to get them all loaded in time.
So, have a constraint satisfacation problem and not really an optimization problem.
Fundamentally, i.e., in the sense of computational complexity, finding a feasible point is as difficult as finding an optimal point. So, really, fundamentally, constraint programming is no easier than optimization.
IIRC at one time SAP and ILOG were interested in CPLEX as means of solving constraint satisfaction problems. So, to solve a constraint satisfaction problem, start with some optimization software.
Gurobi, CPLEX, and friends are less magical if you have enough mathematical maturity. Typically, if you are comfortable with linear algebra (a subject woefully ignored in most undergraduate curricula), you can work with them. In my experience, working with optimization engines is a two step process. First, you need to figure out how to formulate the problem in a smart way (typically this means that it's convex with constraints in a certain format), which is normally the hard part. Then you scroll through the list of available algorithms that the engine has at your disposal and pick one with properties that you like (they often have suggestions if you have no strong preference). Normally, if you can formulate the problem so it's convex, you will at least have a good shot of doing well with your optimization.
Recently I re-implemented in Pulp in an afternoon using ILP.
It's similarly fast, both can solve similar sets of problems. But the ILP solution was so much easier and shorter.
What class of cutting planes did you use in your B&C algorithm (full disclure: I do my PhD on cutting planes, thus I'm interested)?
I think the heuristics may have been based on selecting cells that had the least possible solutions, in order to quickly 'zoom in' on the areas where you can quickly prune infeasible solutions.