Binary Software Composition Analysis with CodeSentry  

Originally published by the High Confidence Software and Systems Conference

High Confidence Software and Systems Conference, Annapolis, MD, May 2022

Authors:

Antonio Flores Montoya

Abstract:

Most modern software systems have significant third-party dependencies, which often contain exploitable vulnerabilities. Once a vulnerability is disclosed, there is a race between malicious actors, trying to exploit the vulnerability, and the defenders of critical infrastructure. The recent Log4j vulnerability disclosure is just one of many examples over the last few years. Deployed systems must be continuously scanned for known vulnerabilities and repaired with patches before the attackers breach them. Keeping track of third-party dependencies, their versions, and their associated vulnerabilities is challenging, but also essential to map a software system’s supply chain. The difficulty of this task is exacerbated by the fact that much of the software today is distributed in binary format and without a comprehensive software bill of materials (SBOM) that enumerates all its dependencies. SBOMs are now the subject of legislation and regulations, e.g., the May 12 2021 executive order on improving the nation’s cybersecurity, and they are expected to be required by compliance directives in the future.

To address this problem, we have developed CodeSentry, a deep binary scanner for identifying the presence of known vulnerable components in binaries. CodeSentry uses a combination of lightweight binary analysis and machine learning to reliably identify third-party components in software and their associated vulnerabilities, providing a comprehensive cybersecurity assessment, and helping cyber defenders prioritize risks.

In this presentation we will discuss two of the analysis techniques that CodeSentry uses to identify software components in binaries. These techniques have to address the fact that compilation options—such as the choice of the compiler and optimization flags—introduce a lot of variability in the binary code generated from the same source code. These analysis techniques also need to be accurate and lightweight, to be able to scan modern software deployments containing hundreds of binaries within minutes. The first technique is called Strlibid, and it extracts component signatures from the strings in the binary. Strings provide useful information for identifying binary components. They are relatively easy to extract from binaries, and they are typically not modified by the compilation process. Nonetheless, in order to use strings as an effective component identification signature, several information retrieval techniques need to be applied to filter and weight them. The second analysis technique we discuss, Embedlibid, uses function embeddings as signatures. Embedlibid computes function embeddings from the function’s assembly listings using Siamese deep neural networks. These Siamese networks are trained to produce similar embeddings for similar functions, thus they can be used to identify library functions in binaries. We will discuss some of the challenges of training such neural networks, as well as how function similarity scores can be lifted into software component matches.

Contact Us

Get a personally guided tour of our solution offerings. 

Contact US