Musings on Probprog

People in my circles periodically talk and write about the nature of this emerging new beast that is called probabilistic programming. There’s various talk about how it’s about samplers, or density functions, or Markov chains, or making machine learning or artificial intelligence more accessible, or various other such stuff. Certainly those things hover in the air around probabilistic programming, but I don’t think that gets to the essence of it. I think that probabilistic programming, as opposed to standard programming, is a new scope of problems that can be solved by computing.

This new scope is obtained by relaxing what it means for a problem to be “solved”. To wit, a problem like “Consider a hypothetical Earth satellite in geosynchronous orbit, weighing 500 kg. What country plausibly launched it and why?” has no definite “answer”. Yet, in the presence of a model of satellites, computation may be applied and plausible (and, one hopes, useful) candidates (e.g., “It could be a US communications satellite”) may be generated.

To me, this shift is analogous to my college experience of shifting from studying math to studying computers. The standard I learned from the math department for what constitutes a “solution” was the presence of a clear, understandable, and communicable answer, supported by a (potentially very difficult or obscure) proof of its correctness. The Fermat-Wiles theorem provides a spectacular example: For what integers \(n > 0\) are there any triples of positive integers \(x\), \(y\), and \(z\) such that \(x^n + y^n = z^n\)? The answer is simple and clear: \(n \leq 2\). The proof is an eighty-page book about elliptic curves that not many people in the world can read, but the problem is considered solved.

The standard of solution I learned from the computer science department is different: a problem is solved when there is a program that computes the answer acceptably quickly. It is necessary to trust or verify that the program is correct, but it is not necessary to be able to predict its output on any given instance. The problem of finding the shortest path from one node of a network to another is classic. Even though someone can hand me a network in which I won’t be able to intuit shortest paths, I nonetheless consider the problem solved (as a computational problem) because effective algorithms are well-disseminated and high quality implementations are available.

In some sense, the computational standard is a relaxation of the mathematical one. An answer that the mathematicians consider good looks to the CS people like an extremely fast program—much faster than necessary. And so the scope of addressable problems broadens. Computer programming offers attacks on all sorts of problems that simply do not have simple, understandable answers.

The relationship between classical and probabilistic programming seems similar to me. A “normal” program computes an answer. As a probability distribution, this has zero entropy—much less than strictly necessary, from the probprog lens. And so the scope, similarly, broadens, to problems like our satellite that simply do not have definite answers.

So what is the shape of this new scope of attackable problems? What do the solutions look like? The common characteristic of the problems is the presence of modelable uncertainty. That’s why it’s called “probabilistic”: probability theory serves as the framework in which to approach uncertainties in one’s problem with computational objects.

As for solutions, and the internal workings of the discipline, I can name several relatively novel features:

Uncertainty in, uncertainty out.
Imprecisely posed questions necessarily have imprecise answers. It could be a US communications satellite; it could be a Chinese one too, though perhaps that’s less likely. The “solution” to such a problem, then, is (a program that samples from) a probability distribution on possible answers.
Knowable unknowns.
Generally, the uncertainty within a given model of a given situation can be characterized. For instance, the meta-question “What is the probability (in this model) that it’s a US communications satellite?” does have a definite answer. Such answers are often intractable to compute to full floating-point precision, but can always be approximated arbitrarily well with sufficient computation. (The Central Limit Theorem guarantees \(O(\sqrt{N})\) accuracy for \(O(N)\) computation for many questions, including probabilities.) There are, of course, also many such probability questions to which the exact answer is known.
New problems, new tactics.
Expanding the space of approachable problems has the side-effect of expanding the space of valid subproblems. This, in turn, expands the universe of tactics by which problems may be decomposed and solved. For instance, instead of directly forming a probabilistic model of 500-kg satellites in geosynchronous orbit, or even of the Earth satellites that exist, we can form a probabilistic model of possible structures for a domain of satellites. If we then ask that model “How is the true satellite domain plausibly structured, in light of this data set about real satellites?”, answers to that question can become models of the extant Earth satellites, which can then be queried for inferences about origin and purpose from mass and orbit class. It often turns out that this indirection is actually an easier path to decent answers for the original question.
Inverse reasoning.
A specific problem template to which probability applies very well is “Here is an uncertain model of a causal process; here is an observed result; how might the process have gone to produce it?” The problem is formalized as conditioning the joint distribution over cause-result pairs on the observed result. There is no acceptably efficient general solution, but many usable (usually approximate) methods have been developed. This template is fruitful enough that many authors view it as paradigmatic of the entire field.
All models are wrong, but some are useful.
It seems, looking at the practice of probabilistic problem solving, that it tolerates more severe, or at least more explicit, gaps between the real-world problems it attacks and the models with which it attacks them than classical programming does. Perhaps this is because uncertainty is already present within the formalism, so the uncertainty in the model-reality gap falls under some of the same quality assurance processes as the in-model uncertainty does.
An interesting side-effect of models that explicitly admit uncertainty is that they can often “go with the flow” if something unexpected happens. If the data fails to conform to the model, the model just concludes that there must have been a lot of noise, and makes the best of it. This both makes it easier to get useful results about a new situation than traditional programming, and harder to diagnose errors in model specification.
Natural approximation.
To the extent that “a sample from a probability distribution”, or “a sampler for a probability distribution”, constitutes a solution to a problem, the probability paradigm admits a new natural notion of approximation. To wit, a sample or a sampler may often still be acceptable if its distribution is merely close (in, say, Kullback-Leibler divergence) to the “exact” distribution. Many probabilistic techniques naturally trade computational cost for nearness of approximation; much effort can sometimes be saved by not making the approximation gap very much smaller than the model-reality gap.
Squishy Testing.
Testing a probabilistic program is squishier. After all, even a correct program can produce arbitrarily weird results by sheer chance, as can an incorrect program produce plausible results by sheer chance. So the testing is also a probabilistic activity, though generally its error can be driven down by computation.

What, then, is probabilistic programming? It is the practice of applying computation to generate probabilistic solutions to probabilistically posed problems. It is also the practice of making (software) agents that operate in light of a capability to solve such problems.