Note: This is the second installment of a two-part series about our concolic testing tool, Grace. Check out Part 1 to read more background information on concolic execution.
I’ll Take One of Each
Grace takes a hybrid approach to exploring GUI programs. It generates command line, file, and network inputs with the solver, but GUI actions are produced randomly. These GUI inputs are represented as commands to xdotool, an X11 automation program. The basic idea is to use random X events to trigger additional program functionality that can then be explored concolically.
Nearly all GUI programs can be navigated using keyboard commands: ALT and COMMAND key combinations are often used as shortcuts to access functionality; GUI elements like text fields and buttons can be selected using the tab key. Thus, we decided to focus on keyboard events and ignore the mouse for the time being.
We compiled a list of possible key presses from X11 KeySyms, excluding characters we weren’t interested in yet (such as accented letters). We also provided functionality to assign “weights” that determine how frequently certain key presses appear in the generated stream.
When Grace detects an X program, it adds a random assortment of xdotool commands to every input it generates. When the input is used in a subsequent round of concolic execution, Grace uses xdotool to feed X inputs into the program. The stream of randomized X commands both exercises additional code in the program and exposes paths that are potentially affected by other input sources, allowing concolic execution to continue program exploration.
Does It Work?
We measure the effectiveness of a test generation tool by the amount of code coverage its inputs provide. There are many types of “coverage” that are used in software testing: code coverage, branch coverage, path coverage, to name a few. For our evaluation, though, we count the number of instructions executed in the main program itself, not taking into account any shared libraries it’s linked with (the reason being that a program is likely to use only a small portion of a shared library and using a relative coverage of a library—e.g., 10% of libcrypt-2.15.so—is not meaningful).
Let’s see how our hybrid approach helps us with xcalc, a fairly simple calculator tool that nevertheless stymies Grace’s ability to do deep exploration due to its heavy reliance on X11 library code.
With a 2 hour time limit and no special handling of X, Grace generates inputs that cover about 16% of the base xcalc program. These inputs aren’t terribly meaningful and amount to simply launching the program without doing anything. Grace can be pretty good at finding command line options (like xcalc’s –stipple and –rpn), but the small amount of code that parses those options is so overshadowed by the massive X libraries, Grace simply doesn’t get very far in the time I’ve allowed it to run.
Things get more interesting when we enable X fuzzing. Some of the many randomly generated streams of X events can trigger additional program functionality. Sometimes you get a combination that performs a calculation of some sort, invoking the arithmetic engine lurking beneath the interface and possibly exposing it to concolic exploration.
Even these simple actions apparently get us much further. Results vary by run, but under the same 2-hour constraint we achieve coverage of up to 41% with X fuzzing enabled, a 156% increase. We also observe that more file inputs are generated, though we haven’t determined whether these files have any material effect on coverage.
Conclusion
Testing the program manually (systematically trying every button and command line option, though not testing every setting in xcalc’s enormous settings file because I have to get back to work at some point) I was able to achieve up to about 70% coverage. If there’s statically-linked code that can’t be reached at all, the actual effective coverage may be even higher. In any case, that’s a lot better than Grace and it took much less than 2 hours.
The thing is, manual testing is boring and limited in ingenuity by my meager brain. On the other hand, with no bias about what “normal” program use is, Grace can (and does!) uncover completely unexpected behavior. Plus, concolic testing provides a way of automating the test generation process, allowing testing to occur in parallel with us humans writing more broken code to test.
That said, concolic testing currently scales poorly and the hybrid approach I’ve presented here shows that much can be gained from combining two disparate input generation strategies. We were able to obtain decent coverage with these programs without pouring too many resources into making it happen. We may yet approach X input generation symbolically, but the payoff in results from the time invested in this hybrid technique has already yielded great gains.