A look at the development of machine-code analysis and the difference between academic and commercial research
Recently, the Reps at Sixty workshop was held in Edinburgh, Scotland in honor of my having turned 60 earlier in the year. It was really a lot of fun! There were interesting talks by former colleagues, collaborators, a sabbatical office mate, etc., most of which were related to problems that I had worked on at some point in my career. GrammaTech CEO Tim Teitelbaum‘s talk was titled “Reps at Twenty-Two,” in which he described how we came to work together on the Cornell Program Synthesizer, and how that progressed into the work on the Synthesizer Generator and the founding of GrammaTech.
That evening, there was an elegant dinner at the Balmoral Hotel, which included a hilarious talk by Reinhard Wilhelm, Tom Reps—A Man with Principles. In the talk, he compared the six principles that I had helped articulate in our TOPLAS 2002 paper on shape analysis with the principles of Donald Trump! Reinhard started with the “truth-blurring principle” — which we used to abstract linked data structures into three-valued logical structures — and went on from there.
Back to the workshop: In the morning, I gave a technical talk about a current project, which I described as a kind of capstone to two research threads that I have pursued throughout my career: incremental algorithms and dataflow analysis. At the end of the day, I gave a second talk about how I ended up becoming a professor of Computer Science — and how ironic that career outcome is, given that when I graduated from college in June 1977, I had sworn to myself that I would never go near a university graduate program.
In the talk, I went back a bit further than Tim had, and recounted how I was completely at sea after my college graduation, and how my path to working with Tim included:
- A programming job with an Ithaca firm that was making a computer for astrologers;
- A 3-month trip in Europe by backpack, youth-hostel, and Eurail-pass;
- A job back in Ithaca as a sailmaker with Haarstick Sailmakers;
- A cocktail-party conversation between my mother and former Cornell CS professor Dick Conway, who hired me to do maintenance on the PL/C compiler; and
- A presentation by Tim to Conway’s group about how he was looking for help on an idea that he had had: to create an integrated system in which one could create, edit, run, and debug programs. At the end of Tim’s presentation, Conway said, “Tom just joined us; he’s not doing anything yet; so he’s yours.” And that was how we came together to build the Cornell Program Synthesizer. Moreover, because Tim had no students, and was hugely engaged in the project, it meant that I had a full-time private tutor!
After about a year, Tim pointed out, “What you’re doing is not all that different from what the graduate students are doing, but at the end of their projects they’ll have a degree, and you won’t. So why don’t you enroll in the graduate program?” I followed his advice, and soon thereafter I started to work on incremental attribute evaluation, which became the subject of my Cornell Ph.D. thesis. That work was the basis for the Synthesizer Generator, which eventually became GrammaTech’s first product.
In the remainder of my talk, I traced how the work on incremental attribute evaluation influenced all the later work that I did on incremental algorithms and dataflow analysis. I also examined another slice of my career through the lens of frameworks and generators, which included the Synthesizer Generator; TVLA; WPDS++, WALi, and OpenNWA; and TSL. I also said a few words about the work I did on shape analysis with Mooly Sagiv (Tel Aviv Univ.), Reinhard Wilhelm (U. des Saarlandes), and numerous students of all three of us. That work was another major research thread of my career.
Pictures and more details can be found in the slides and notes panes from my talk.
Due to time constraints in Edinburgh, I wasn’t able to include any reflections about the work on machine-code analysis, which represents yet another major research thread of my career, so that is what I want to discuss here — as well as make some observations about the differences in how research is done at a university and a company.
The work on machine-code analysis started in April 2001, when GrammaTech VP of Engineering Paul Anderson wrote a winning SBIR proposal to AFRL-Rome called “Detecting Malicious Code in Firmware.” Contemporaneously, my UW colleagues Somesh Jha, Bart Miller, and I were funded by ONR to do work on information flow in program executables, under a five-year Multidisciplinary University Research Initiative (MURI) on Critical Infrastructure Protection and High Confidence, Adaptable Software (CIP/SW).
Both projects had a common tool-building goal, namely, to create a tool built from three components:
- The IDA disassembler, to be used as a front end, from which we would obtain a basic intermediate representation (IR) that would have a preliminary control-flow graph, information about instructions, registers, some information about the variables on the stack, etc.
- An analyzer that would use the IR to perform pointer analysis, which would be used in constructing the inputs to component (3).
- The CodeSurfer dependence-graph builder, which would perform reaching-definitions analysis (a.k.a. flow-dependence analysis) and control-dependence analysis.
The outcome would be the kind of dependence graph used by the CodeSurfer GUI to navigate in and query the dependence graph. ONR program manager Ralph Wachter didn’t want to fund duplicate work on creating program dependence graphs for machine code, so he provided separate ONR funding for GrammaTech to collaborate with Wisconsin on developing appropriate infrastructure.
Because I had worked previously on both points-to analysis and dependence analysis, we envisioned my role as building component (2). We had blithely imagined that we would be able to harness some standard points-to algorithm, and then use the results to identify the defs and uses associated with each instruction of the machine-code program. However, it turned out that we had completely misjudged what was necessary to solve the problem, and had a major unsolved issue on our hands right out of the starting gate.
In retrospect, the issue was obvious: the conventional decomposition of analyses used in source-code analyzers — namely, (i) points-to analysis, followed by (ii) numeric analysis — is simply not feasible when one is dealing with machine code. To see why, consider the action of loading the value of a stack-allocated local variable into a register, which occurs frequently in compiled code. In x86 code, it can be performed by an instruction like mov eax, [ebp-20]. In this instruction, there is a numeric operation to compute the address of the local variable on the stack — “subtract 20 from the value in the frame-pointer register ebp” — followed by a dereference of that address to move the value of the local variable from the stack to register eax. The instruction carries out both a numeric computation as well as an address dereference. In other words, numeric operations and address (“pointer”) operations are inextricably intermingled!
I had persuaded Gogul Balakrishnan, who had just started in the graduate CS program at Wisconsin, to work with me on the project. Gogul and I spent about fourteen months feeling like we were beating our heads against a brick wall. Gogul implemented multiple techniques from the program-analysis literature, but each time we found that there was some kind of flaw when applied to machine-code analysis.
This period must have been very discouraging for Gogul, because the other graduate students on the project had begun to publish papers (and win best-paper awards). However, later he told me that he had found the work such fun that he decided to change his original plan of getting just a Masters’ degree and stay for a Ph.D.
After fourteen months, Gogul made the observation that all of the problems we were having with conventional methods boiled down to not having certain values available — so shouldn’t we just concentrate on finding out the sets of possible values? As we later stated it, the technical problem was as follows:
Given a (possibly stripped) executable E, identify the procedures, data objects, types, and libraries that it uses, and,
- for each instruction I in E and its libraries,
- for each interprocedural calling context of I, and
- for each machine register and variable V in scope at I,
statically compute an accurate over-approximation to the set of values that V may contain when I executes.
This information can be used to provide answers to such questions as “What could this dereference operation access?” or “What function could be called at this indirect call site?”
After Gogul’s insight about finding values, it felt like all the things I had learned over the preceding 14 or 15 years had been meant for the purpose of creating CodeSurfer/x86. Things immediately fell into place, and Gogul and I proceeded to develop and write about the collection of analyses that went into CodeSurfer/x86: VSA, ASI, ARA, and recency abstraction, as well as some of its applications in device-driver analysis and model checking. (The first paper that we wrote allowed Gogul to catch up with the other UW students on the MURI project; it received the EAPLS best-paper award at ETAPS.)
A few years later, I worked with Junghee Lim to find a better method for creating machine-code analyses than the brute-force approach that Gogul and I had adopted. The TSL system provides a framework for creating correct-by-construction implementations of the state-transformation functions needed in state-space-exploration tools that analyze machine code. (TSL stands for Transformer Specification Language). From a single specification of the concrete semantics of a machine-code instruction set, TSL automatically generates state-transformation functions needed for static analysis, dynamic analysis, symbolic analysis, or any combination of the three. The generated implementations are guaranteed to be mutually consistent, and also to be consistent with an instruction-set emulator that is generated from the same specification of the concrete semantics. TSL is now used in most of GrammaTech’s binary-analysis products and research projects.
Looking back on these experiences from the perspective of fifteen years, it seems to me that they illustrate a major difference in the kind of research that one can carry out at a university and at a company. I don’t think many companies (GrammaTech or others) would have tolerated— or, in the case of small companies, been able to afford — a situation where the major technical direction was unclear for fourteen months. I’m not saying that it was an ideal situation that Gogul and I were in, but it was one where we were not under a threat of termination while we learned about the problem and found leverage on the issues that we were facing. Even though it would be fair to say that we got it completely wrong in the grant proposal, I am proud that we used the funding to ultimately get it right in the developed tool.
In other words, some things take time, from which I conclude that it would be very difficult in most companies to make the kind of advance that CodeSurfer/x86 represented at the time. Moreover, the experience of developing the analyses in CodeSurfer/x86 is my model for the role of university research, which is to build something that could not be built in a company setting.
GrammaTech has benefited from the long-term perspective that I was able to take in my machine-code-analysis research at the University of Wisconsin. At the same time, I have benefited from seeing the real-world problems that GrammaTech is engaged in solving. I have also benefited from having GrammaTech as a technology-transfer vehicle, which has allowed some of the solutions developed in my research group to have a path to adoption in commercial products.
In closing, I want to reiterate my thanks to the people I already thanked in my talk at the workshop: Tim Teitelbaum, Mooly Sagiv, Reinhard Wilhelm, Susan Horwitz, and Fran Wong. I want to add thanks to Paul Anderson for his role in getting me started on machine-code analysis, and to Gogul Balakrishnan and Junghee Lim for our collaborations on CodeSurfer/x86 and TSL, respectively. Finally, I want to thank Ralph Wachter, who has had the kindness to support the work that Tim Teitelbaum and I did at Cornell, several of my research projects at Wisconsin (including the MURI grant for work on machine-code analysis), and numerous projects at GrammaTech.