Reverse Architecting Software Binaries
Dr. Denis Gopan, and Greg Nelson, GrammaTech, Inc.
One of the central tasks in reverse engineering software is understanding its structural composition. Reverse architecting helps reverse engineers by automatically recovering software design from its implementation. Reverse Architecting has straightforward applications in software assurance (does the recovered design match the intended software functionality?), security analysis (does the software contain unexpected, possibly malicious functionality?), and software refactoring and transformation (what portions of the software need to be altered?).
Reverse Architecting has been extensively explored and shown to have great value at the level of source code. However, in many important reverse-engineering scenarios including the analysis of legacy software, COTS components, and malware, the source code is typically not available. Under the DARPA Automated Rapid Certification of Software (ARCOS) program, we developed technology for performing reverse architecting directly on software binaries, in the absence of source code. In this presentation, we will outline the technical approach we pursued and describe our initial experiences applying this technology to real-world targets.
The cornerstone of our approach is Binary Componentization, a mechanism that decomposes a monolithic target binary into a hierarchy of logically connected components – data-structure implementations, libraries, and higher level functional modules, such as a parser in a compiler or a hardware abstraction layer (HAL) in a firmware image. Binary componentization works bottom-up, finding similarities based on features identified within individual functions. These features include the caller-callee relationship, similarity in data access patterns, code proximity, and name similarity. The individual features are used to measure how strongly the functions in the binary are related to each other; then functions are clustered through different algorithms to group them into components.
We explored a variety of features, feature weights, and clustering algorithms. Our experiments indicate that different categories of software (e.g., binaries for object-oriented code with extensive class systems; embedded Linux binaries; or bare-metal firmware written in C) require different combinations of features for achieving good results. However, within the same software category, the approach can be fine-tuned to decompose software with high accuracy.
The derived hierarchy of components informs reverse engineers about structural binary composition. It can also be used to bootstrap the adoption of principled development paradigms, such as MBSE, for legacy systems by generating modeling artifacts, such as SysML’s Block Definition Diagrams (BDDs). Some human assistance is required for this, however, to correct analysis inaccuracies and assign meaningful names for the inferred components.
For some target software, the high-level design may already be available in some form, such as SysML or a design diagram in software documentation. We developed a Design-to-Implementation Mapping mechanism, which automatically maps the implementation structure inferred by the binary componentization to the design description provided by the user. To construct the mapping, the mechanism leverages both structural and semantic similarity between the implementation and design. Structural similarity includes parallel parent-child relationships. Semantic similarity includes similarity of word patterns between string literals (in binary components) and the description of the design module (in the documentation). The resultant mapping allows the user to connect system requirements allocated to the design modules (captured in SysML or other documentation) with the corresponding code that implements them. Conversely, low-level implementation weaknesses and bugs can be traced to the design modules they may impact. Furthermore, the design recovered from the implementation can be contrasted against the intended design using established techniques such as Reflexion Models [RM].
[RM] Gail C. Murphy, David Notkin, Kevin J. Sullivan: Software Reflexion Models: Bridging the Gap Between Source and High-Level Models. SIGSOFT FSE 1995: 18-28
Distribution Statement “A” (Approved for Public Release, Distribution Unlimited) This research was developed with funding from the Defense Advanced Research Projects Agency (DARPA). The views, opinions and/or findings expressed are those of the author and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government.