A Program Synthesis Primer
My colleague Somesh Jha recently asked me to give a lecture on program synthesis to his class. As I prepared my notes, I realized that a single lecture is long enough to formally define the problem and code up some cool examples that demonstrate the process. This post contains the notes and code I used in class. I go through two simple examples in this post; the full code is on GitHub.
What is program synthesis anyways?
In the fields of programming languages and verification, the traditional problem of program synthesis involves constructing a program from a high-level mathematical specification. For example, suppose you want to write a program that computes the factorial of a number $n \gt 1$. You start with the mathematical definition of factorial, $ n! $, and you gradually massage it—rewrite it using your knowledge of the properties of factorial and mathematics—until you get to an executable program.
While lofty in its goals and rich in its tradition, this synthesis approach—called deductive synthesis—has had limited success. Simply, mathematical specifications are hard to write, and turning a specification to a program is very hard to automate.
Recently, researchers have been looking at forms
of synthesis that are simpler to automate.
The style of synthesis I will focus on here is
the one pioneered in the Sketch project.
The idea is that we start with a reference implementation
Think of this as the naive solution of a problem
that you’re sure works correctly, but that you won’t put into production, perhaps because it’s pretty inefficient.
Your goal is to find a program
that is equivalent to
but that is more efficient and hard to get right.1
For illustration, let’s look at an extremely simple example.
The naive program is the following,
where for simplicity we assume all
ints are bitvectors with an 8-bit width.
int p_naive(int x): y = x * 2 return y
You, as a smart (but not very smart) programmer, think
that you can do better by writing a program of the following
form, but you’re not sure what to place in the
int p_smart(int x): y = x << ?? // the operator << is shift left return y
In a sense,
p_smart represents a family
of programs—all possible instantiations of the hole
Of course, the right completion here is replacing
the hole with
So how do we automatically find the value of
p_smart equivalent to
Easy! There are $ 2^8 $ different instantiations
??, so we enumerate
them until we find the right one.
In general this incurs a combinatorial explosion.
We cannot avoid the combinatorial problem,
but it turns out we can neatly characterize the
search space as a formula that we can give to an efficient
off-the-shelf solver—as we will see in a bit.
First, let’s try to gradually formalize the problem.
We will use the variable
to denote the hole
Definition 1 Find a value for
such that for all values of
p_naive(x) == p_smart(x).
Definition 2 We can think of
an input to
So, our goal is really to
find a value for
such that for all values of
p_naive(x) == p_smart(x,h).
Definition 3 Now, we can view a program, say,
as a logical relation $ \varphi_n(x,y) $
over its input and output variables, $ x $ and $ y $.
The idea is that the relation is true if and only
p_naive(x) == y.
So, our goal is really to find find a value for $ h $ such that for all values of $ x $ and $y$, we have $ \varphi_n(x,y) \iff \varphi_s(x,h,y) $ is true.
Final definition Finally, we are ready to state the problem logically. Find a satisfying assignment of the following formula Notice that the variable $ h $ is free. Therefore, any satisfying assignment of this formula is only over $ h $.
At this point, we have completely reduced the synthesis problem to solving a logical formula. We can do this with an SMT solver like Z3.
A detailed example
We can now apply the above
process to encode
p_naive as formulas.
We will encode them in the first-order theory of bitvectors,
which has all the standard operations we care about.
We will now encode those in Z3 using its Python API as follows.
First, we define all the variables as 8-bitvectors.
The code for this example is available on GitHub
x = BitVec('x',8) y = BitVec('y',8) h = BitVec('h',8)
Next, we encode $ \varphi_n $ and $ \varphi_s $:
phi_n = y == x * 2 phi_s = y == x << h
Finally, we are ready to invoke Z3:
# first, encode the universally quantified formula # (the == symbol is if and only if) encoding = ForAll([x,y], phi_n == phi_s) # second, call Z3 and check if there is a model s = Solver() s.add(encoding) s.check() # print the model (i.e., the value of h) print s.model()
If you run this code, you will get the output
[h = 1],
just as expected!
The approach presented above works over two programs: it tries to find a program that is equivalent to some reference implementation. Instead of writing a full-blown reference implementation, we will now just write some test cases, and find a program that passes all of them. You could think of this as test-driven development on steroids!
Say we want to write a function that, given
x, returns a bitvector
y that is 1 in the
position of the first 0 from the right occuring in
x, and 0 everywhere else.2
# test 1 x = 00000000 y = 00000001 # test 2 x = 00000011 y = 00000100 # test 3 x = 00000010 y = 00000001
Suppose you have the following hunch on how to write such a program efficiently:
y = (~(x + ??)) & (x + ??)
~ is bitwise not and
& is bitwise and.
Our goal is to find solutions to the two holes
that result in our desired program.
Just as before, we encode the program as a logical relation:
Note how we encoded the two holes as two variables.
Similarly, we can encode the test cases as a relation:
Finally, we solve the following formula:
Observe that we have an implication and not an equivalence
between the two relations.
Intuitively, the relation defined by the test cases
is small and does not define the behavior other than for
the 3 test cases we’ve supplied.
So what we’re looking for is a program
test cases appear in (or are a subset of) its relation.
Let’s encode this in Z3. The process is similar to what we saw above:
The code for this example is available on GitHub
phi_s = y == (~(x + h1)) & (x + h2) t1 = And(x == 0, y == 1) t2 = And(x == 1, y == 2) t3 = And(x == 2, y == 1) phi_t = Or(t1,t2,t3) encoding = ForAll([x,y], Implies(phi_t, phi_s))
Solving the formula
[h2 = 1, h1 = 0], which results in a correct program.
Since we are dealing with test cases,
the test cases we supply may be insufficient to force the SMT
solver to find the right program; in such case,
you can supply more tests.
For example, if I just give Z3 test 3, on my machine I get
[h2 = 15, h1 = 14], which is incorrect (check it).
We looked at a combinatorial form of program synthesis, where we define the search space as a program with holes. You might wonder how to encode control-flow, and not only holes that are replaced with consants. A great paper to read is Synthesis of Loop-free Programs.
The idea of searching the space of programs has recently been explored in a wide array of settings. It turns out that handing the problem to the SMT solver is not always the right way to go; sometimes a custom enumeration algorithm outperforms SMT solvers. SMT solvers, however, are very good at finding magic constants, which simple enumeration may never get to. So, in bit-twiddling problems, symbolic encodings like the one shown here are the way to go.
“But this is what an optimizing compiler does!” you protest. Yes! But here we are taking a combinatorial approach: a compiler applies a fixed set of tricks to get to an efficient program; here we will manually define a space of programs, and search it exhaustively until we find the right one. In some sense, we are doing superoptimization. ↩
Example borrowed from Sketch project and found through Loris D’Antoni’s notes. ↩