THE ANALYSIS OF BINARY FILE SECURITY USING
A HIERARCHICAL QUALITY MODEL
by
Andrew Lucas Johnson
A thesis submitted in partial fulfillment
of the requirements for the degree
of
Master of Science
in
Computer Science
MONTANA STATE UNIVERSITY
Bozeman, Montana
December 2021
©COPYRIGHT
by
Andrew Lucas Johnson
2022
All Rights Reserved
ii
TABLE OF CONTENTS
1. INTRODUCTION .................................................................................................. 1
2. BACKGROUND..................................................................................................... 3
Binary Analysis ...................................................................................................... 3
Quality Modeling.................................................................................................... 5
Security Metrics and Vulnerability Management ....................................................... 7
The Operational Technology Environment.............................................................. 11
3. SUPPORTING WORK ......................................................................................... 17
Quamoco ............................................................................................................. 17
QATCH ............................................................................................................... 20
PIQUE ................................................................................................................ 23
Literature Review of Model-Based Binary Security Metrics ..................................... 31
4. RESEARCH GOALS ............................................................................................ 35
Motivation ........................................................................................................... 35
Goal Question Metric............................................................................................ 36
5. PIQUE-BIN DEVELOPMENT.............................................................................. 37
Gather Requirements ............................................................................................ 37
Design ................................................................................................................. 38
Development ........................................................................................................ 41
Utility Function ............................................................................................ 42
Threshold Calculation ................................................................................... 45
Tools ............................................................................................................ 47
CVE-Bin-Tool........................................................................................ 47
cwe Checker........................................................................................... 48
Yara-Rules............................................................................................. 49
Changes to PIQUE........................................................................................ 51
6. EXPLORATORY CASE STUDIES........................................................................ 53
Application to Wireshark Binaries ......................................................................... 53
Results ......................................................................................................... 54
Discussion..................................................................................................... 54
Threats to Validity ........................................................................................ 55
Application to Busybox Binaries............................................................................ 57
iii
TABLE OF CONTENTS – CONTINUED
Results ......................................................................................................... 58
Discussion..................................................................................................... 60
Threats to Validity ........................................................................................ 61
7. MODEL VALIDATION......................................................................................... 62
Tool Output Sensitivity To Binary Attributes......................................................... 62
cwe checker output model....................................................................... 70
CVE-Bin-Tool Output Model.................................................................. 73
Yara-Rules Output Model....................................................................... 77
Discussion..................................................................................................... 79
Threats to Validity ........................................................................................ 81
cwe Checker Model Threats .................................................................... 82
CVE-Bin-Tool Model Threats ................................................................. 82
Yara-Rules Model Threats ...................................................................... 82
Sensitivity to Weighting ........................................................................................ 82
Sensitivity to Single Findings ................................................................................ 85
8. THREATS TO VALIDITY.................................................................................... 92
Internal Validity ................................................................................................... 92
External Validity .................................................................................................. 93
Construct Validity ................................................................................................ 93
Conclusion Validity............................................................................................... 95
9. CONCLUSION..................................................................................................... 96
REFERENCES.........................................................................................................100
iv
LIST OF TABLES
Table Page
5.1 STRIDE Threats and Desired Security Properties ....................................... 39
5.2 Comparison Matrix And Final Weights for Quality Aspects ........................ 41
5.3 Weighting of Product Factors to Quality Aspects ........................................ 42
6.1 Wireshark Model Application Results......................................................... 54
6.2 Busybox Model Application Results ........................................................... 60
7.1 cwe Checker Output Model Coefficients ...................................................... 73
7.2 CVE-Bin-Tool Output Model Coefficients ................................................... 76
7.3 Yara-Rules Output Model Coefficients ........................................................ 78
7.4 Output Models Coefficient P-Values ........................................................... 79
v
LIST OF FIGURES
Figure Page
2.1 The ISO 25010 Quality Model, from [25] ...................................................... 7
2.2 Security Metric Classification Method as presented in [39] ........................... 12
2.3 Basic SCADA Communication Topologies as presented in [49] ..................... 14
3.1 The Quamoco Quality Modeling Approach [52] ........................................... 18
3.2 A Generic QATCH Model Instance [46] ...................................................... 21
3.3 An exemplary PIQUE model, from [40] ...................................................... 26
3.4 The UML Class Diagram Of A Measure Node............................................. 27
3.5 The UML Sequence Diagram Of Evaluating A Measure Node ...................... 28
5.1 An Exemplary CWE Category From CWE-699........................................... 40
5.2 Two simple model evaluations using the default PIQUE
utility function.......................................................................................... 45
5.3 Two simple model evaluations using an unbounded utility
function with bound QA values.................................................................. 46
6.1 TQI Values for Busybox Versions ............................................................... 59
7.1 Distributions of Factors ............................................................................. 65
7.2 Size Compared to Other Factors ................................................................ 66
7.3 Distributions of Tool Outputs .................................................................... 67
7.4 cwe Checker Output Compared to Factors .................................................. 68
7.5 CVE-Bin-Tool Output Compared to Factors ............................................... 69
7.6 Yara Rules Output Compared to Factors .................................................... 70
7.7 cwe Checker Output Linear Model Diagnostic Plots .................................... 71
7.8 CVE-Bin-Tool Output Poisson Regression Model Diagnos-
tic Plots ................................................................................................... 73
7.9 CVE-Bin-Tool Output Logistic Regression Model Residuals
vs Leverage Plot ....................................................................................... 76
vi
LIST OF FIGURES – CONTINUED
Figure Page
7.10 The Maximum and Minimum Possible Value for Each
Assessment, Based Upon Weighting............................................................ 84
7.11 Impact On TQI For A Single Finding From Each Diagnos-
tic, Part 1................................................................................................. 87
7.12 Impact On TQI For A Single Finding From Each Diagnos-
tic, Part 2................................................................................................. 88
7.13 Impact On TQI For A Single Finding From Each Diagnos-
tic, Part 3................................................................................................. 89
7.14 Mean Impact On TQI By Category............................................................ 90
vii
ABSTRACT
Software security is commanding significant attention from practitioners. In many or-
ganizations, security assessment has been integrated into the software development lifecycle,
which allows for continuous monitoring of software weaknesses and vulnerabilities throughout
the development process. One often overlooked aspect of the software development lifecycle
is the end of the lifecycle. Prior to delivering software to customers, many vendors digitally
sign and compile source code into a binary. In binary form, analysis may be done to reveal
security flaws that were not present in the original code or that were injected at some point
between the code being written and the code being compiled.
Our research goal is to improve our ability to assess the security quality of a binary from
different stakeholders’ perspectives. While many analysis tools exist that identify security
flaws, there is little work done to enable the use of multiple tools, which is necessary to
identify different types of security flaws. To accomplish our goal, we approach the problem
from the perspective of quality modeling. We have designed and developed a software
quality model for assessing security quality in binaries (PIQUE-Bin) and operationalized
the model by using PIQUE, the Platform for Investigative software Quality Understanding
and Evaluation. The design of our model is based on the Microsoft STRIDE model and
the software development view of the Common Weakness Enumeration (CWE). The model
produces a relative and subjective security score for a binary file.
An informal literature review reveals a lack of model-based security metrics targeting
binary files, which helped motivate this research. To enhance the validity of this work,
a sensitivity analysis assessment based on a benchmark repository of 700 binary files was
performed. Model output is validated by measuring tool output sensitivity and calibrated
against the presence of injected vulnerabilities. We find that our model is able to measure
the security quality of binaries relative to the benchmark repository.
1
INTRODUCTION
It is no secret that cyber threats have been on the rise in recent years. One recent
large scale compromise occurred in 2020 with the compromise of SolarWinds software in
the attack referred to as SUNBURST1. Attackers were able to compromise the SolarWinds
build process such that any system with the SolarWinds software would have an exploitable
vulnerability. Attacks such as these show the need for security at multiple layers, or defense
in depth. Software may be compromised at any point in its lifecycle which means that
security assessment must be done throughout the lifecycle, including when it has reached its
final destination in the form of a binary.
There are a plethora of different analysis tools available for identifying vulnerabilities,
weaknesses, and malware within a binary (for examples, see [44, 13, 11, 36, 45]). There is
also evidence to suggest that multiple methods of automated analysis are necessary to cover
many different types of security findings [45, 43, 28]. With the use of multiple tools comes
the drawback of a massive amount of varied findings. Because these tools often find false
positives or a large amount of low priority true positives, there can be too many findings to
feasibly check them all manually. This motivates the scoring, management, and aggregation
of security findings. One way to accomplish this is through the lens of software quality
modeling.
Quality modeling enables developers to track the quality of their project over time. This
assists in the regulation of code quality, allowing developers to set and maintain a quality
goal [25]. The ISO 25010 software quality model defines how software quality may be broken
up into characteristics and sub-characteristics. Though security is a part of the ISO 25010
software quality model, the definition is abstract, broad, and does not include availability
as a sub-characteristic, which can be a very important security aspect in some settings such
1https://www.solarwinds.com/sa-overview/securityadvisory
2
as critical infrastructure. We will be expanding on the aspects of security quality presented
in the ISO 25010 standard and utilizing a software quality modeling approach to assess
security quality in binaries. The Platform for Investigative software Quality Understanding
and Evaluation (PIQUE) [40] is used to build an operational quality model.
In this thesis we propose, operationalize, and validate a security quality model for
binaries, PIQUE-Bin. First, background will be given on security findings, quality modeling,
vulnerability scoring, management, prioritization, and operation technology in chapter 2.
Support quality modeling frameworks are described in chapter 3, and an informal review
of similar models is presented. We define our research goals in chapter 4, and then explain
the methods and reasoning surrounding the model design, development, and calibration in
chapter 5. Finally, several case studies are performed to improve the model in chapter 6 and
validate the model in chapter 7.
3
BACKGROUND
Binary Analysis
Automated binary analysis is done to identify vulnerable or malicious code within a
binary. We first review malware analysis of binaries, then move on to vulnerability analysis.
Malware analysis is the analysis of software to identify malicious behavior. Malware
analysis has been around for as long as malware, when the latter was classified into trojan
horses, viruses, and worms. Early malware analysis was primarily focused on viruses. Cohen
defines the term ‘virus’ and begins feasibility analysis with respect to virus detection in
his 1987 work [12]. Cohen finds that a virus can be made for which the identification
process is undecidable [12]. A study in 2003 then found that reliable identification of a
mutating computer virus is NP-complete [47]. Clearly, virus detection, and therefore malware
detection, is a difficult problem. This led to the need for malware analysis methods that
would use either some proxy for virus detection or an approximation algorithm.
Malware analysis methods can be classified as static or dynamic, [10] otherwise referred
to as signature-based or behavior-based [5]. Many methods blur the lines between these two
classifications resulting in a hybrid method.
Early forms of malware detection were focused on static methods such as pattern
matching and string identification. One of the simplest of these methods is taking the
hash of a binary to compare it with the hashes of known malware binaries. This method
is used by popular malware intelligence databases such as VirusTotal1 and is a good first
step to identify malware. This is a fast method to determine if a binary is a malware binary
that had already been identified; however, it suffers from several faults. Any changes in the
malware will render this detection method null as the hash would completely change and it
1https://www.virustotal.com/
4
is only effective for known malware.
Static methods tend to share these faults: they excel in identifying known malware,
but tend to have difficulty identifying new or obfuscated malware. On the other hand,
dynamic analysis of malware is able to identify new malware, but can suffer from a lower
accuracy rate and a higher false positive rate [10] [5], as well as greater complexity in the
analysis due to the need to execute the binary [6]. When dealing with potentially malicious
binaries, execution must be handled with care - this means sandboxing and isolating, as well
as avoiding any anti-analysis capabilities malware might have. Both techniques have their
strengths and weaknesses, so the use of multiple methods may yield the greatest results.
There are many recent studies that propose methods for analyzing malware through
novel feature extraction, new analysis methods, or new machine learning techniques (for
example, [17, 37, 30, 38, 51]). Clearly, malware detection techniques are abundant and
improving. Similarly, vulnerability analysis is an evolving landscape in which new techniques
are being actively developed and improved. The National Vulnerability Database (NVD)
defines a vulnerability as “A weakness in the computational logic (e.g., code) found in
software and hardware components that, when exploited, results in a negative impact to
confidentiality, integrity, or availability”2. Weaknesses are more generic forms of security
flaws in software, meaning that vulnerabilities are the result of an instance of a weakness
occurring in software.
Automated vulnerability analysis, much like malware analysis, is often separated into
static and dynamic analysis. Static analysis techniques analyze a binary without executing
the file. This often takes the form of pattern identification, such as in the tool cwe checker.
This tool checks for patterns that often indicate the presence of elements of the Common
Weakness Enumeration (CWE), an enumerated catalogue of software weaknesses. To check
for integer overflow or wraparound (CWE-190) in a binary, cwe checker checks for the
2https://nvd.nist.gov/vuln
5
following 3:
For each call to a function from the CWE190 symbol list we check whether
the basic block directly before the call contains a multiplication instruction. If
one is found, the call gets flagged as a CWE hit, as there is no overflow check
corresponding to the multiplication before the call. The default CWE190 symbol
list contains the memory allocation functions *malloc*, *xmalloc*, *calloc* and
*realloc*.
This type of analysis typically suffers from a high false positive rate but is fast and does
not require as much overhead as dynamic analysis. These methods also excel in identifying
known vulnerabilities in binaries, such as those incorporated through third party libraries.
Dynamic analysis involves executing a binary to look for behavior that is indicative
of a vulnerability. This type of analysis offers much more semantic information about
identified vulnerabilities and typically identifies less false positives. However, the requirement
of executing the binary can often complicate the analysis process and generally takes longer
than static analysis [6].
It has been found that while static and dynamic analysis techniques have their strengths
and weaknesses, a mixture of techniques is required to find all kinds of vulnerabilities [6],
[23]. It has also been found that static analysis is an effective method for initial analysis to
identify vulnerabilities [43]. This motivates the use of multiple analysis tools to identify a
wider variety of vulnerabilities.
Quality Modeling
The model proposed in this thesis is built upon quality modeling, so we first discuss the
benefits and reasoning behind the use of a quality model. Software quality modeling provides
3https://github.com/fkie-cad/cwe checker/blob/master/src/cwe checker lib/src/checkers/cwe 190.rs
6
“a systematic approach for modeling quality requirements, analyzing and monitoring quality,
and directing quality improvement measures. They thus allow ensuring quality early in the
development process” [52]. Therefore, the modeling process enables tracking of quality for
purposes of improvement or assessment. Typically, software quality modeling would be
employed by an organization seeking to increase their own software quality, but can also be
used by an outside entity seeking to ensure a product is being delivered with a certain level
of quality. This would be the most likely use case for PIQUE-Bin, where we are looking at
applying a quality model at the latest point in the development lifecycle, without necessarily
measuring the product in early stages.
At the lowest level, software is assessed through metrics and findings. Metrics are
mathematical formulas backed by empirical evidence to describe the state of a product. The
evaluation of a metric provides a measure. One example of a metric is coupling, which
attempts to provide a numeric representation of the level of dependence between software
modules using data parameters, control parameters, and number of modules [48]. Findings
represent a section of code with some phenomenon in it and are evaluated as a count of
findings. This could be something that breaks a program or introduces a vulnerability
through something such as buffer overflow. An example of a weakness finding, building from
the prior section where we defined the cwe checker rule, is CWE-190. Findings can take many
forms, but for this work findings will primarily take the form of weaknesses, vulnerabilities,
and malicious indicators.
One prominent software quality model is the ISO 25010 standard [25]. This standard
breaks quality into characteristics and sub-characteristics, giving a high level overview of
software quality without detailing how to specifically measure characteristics of quality or
how they compose overall quality in an operationalized model. The ISO 25010 model is shown
in Figure 2.1. At the highest level is software quality. Below quality are the characteristics
and even sub-characteristics of quality.
7
Further definition of how characteristics should compose overall quality and how
characteristics may be operationalized has led to the creation of quality meta-models such as
Quamoco [54], which defines a hierarchical structure through which tool findings aggregate
to score quality characteristics, which then contribute to the overall quality score. This
meta-model was further refined in the Quality Assessment Tool Chain (QATCH) quality
meta-model [46], which limits the model to four layers through which tool findings are
aggregated. PIQUE, the platform we use to create our model, enables the creation of flexible
tree-based models such as Quamoco and QATCH based models. Building our model using
PIQUE enables us to easily adapt and improve the model when weaknesses of the model are
identified [40]. To provide additional context and history, the Quamoco and QATCH quality
modelling approaches are briefly summarized in chapter 3. The PIQUE platform is detailed
as well in chapter 3.
Figure 2.1: The ISO 25010 Quality Model, from [25]
Security Metrics and Vulnerability Management
As discussed in previous sections, multiple analysis types and tools are necessary
to discover all types of vulnerabilities. This leads to the need to manage the findings
8
from multiple tools. Quite often, these findings may be assigned a score to quickly and
automatically give some level of priority. However, tools may differ in how they score findings
as well as what takes priority in their scores. Additionally, the scale of severity and number
of findings may vary greatly. This makes the comparison of different tools’ output difficult,
and creates the need to manage these findings and compare findings amongst tools. We may
accomplish this through providing a security metric, managing vulnerabilities, or prioritizing
findings.
One difficulty of validating security metrics is that there is no way to know the true
overall security quality of a binary, creating an oracle problem for any models seeking to
evaluate some security metric [7, 41]. The oracle problem occurs in scenarios in which there
is no known output to test output against, thus making validation difficult. Additionally,
this means that most security metrics are subjective in some sense.
The Common Weakness Enumeration4 (CWE) and Common Vulnerabilities and
Exposures5 (CVE) are two catalogs that capture information pertinent to weaknesses and
vulnerabilities. The CVE catalog is an enumeration of specific vulnerabilities in certain
products and platforms, while the CWE catalog is an enumeration of general software and
hardware weaknesses. This means that individual CVEs are realizations of a CWE. For
example, CVE-2021-30316 is an instance of CWE-2007. CVEs may be scored using the
Common Vulnerability Scoring System (CVSS8), which gives a numeric score that represents
the threat that a particular CVE poses.
Similar to the CVSS which scores CVEs, the Common Weakness Scoring System
(CWSS) is a method to score CWE instances. The main problem with the CWSS for
automated scoring is that the information required to use it typically requires manual work
4https://cwe.mitre.org/
5https://cve.mitre.org/
6https://nvd.nist.gov/vuln/detail/CVE-2021-3031
7https://cwe.mitre.org/data/definitions/200.html
8https://nvd.nist.gov/vuln-metrics/cvss
9
to find. While CVSS also requires this, the work has been done for many CVEs already and so
may be leveraged automatically. Theoretically, there are many ways to score a vulnerability
as the CVSS does, but CVSS in particular has been found to be consistent with expert
opinion [21] and credible through Bayesian analysis [27]. These findings indicate it may be
a good catalog to consider when building a security quality model.
The CVSS gives a score for a single vulnerability, but does not provide insight into how
multiple CVEs in a system impact the overall security. Similarly, there are many methods to
manage or prioritize vulnerabilities. Methods such as the ones found in [4, 18, 24, 15] shed
light on how vulnerabilities may be effectively prioritized, but do not give an overall idea of
how serious the set of security findings may be within a given binary.
Security metrics based solely on CVSS scores have not been found to accurately reflect
the security of a system in terms of time-to-compromise [22], though the reason for this is not
clear. Additionally, simplistic models such as the weakest link appear to be less promising
than more complex models that take into account multiple vulnerabilities [22]. These results
should be taken with a grain of salt, as the study that makes these statements had a small
sample size.
Surveys have been conducted on security indicators [41] and network-based security
models [39]. Mellado et al. [35] compare software design security metrics. Rudolph and
Schwarz [41] define the following terms:
• Security Indicator. A security indicator is any observable characteristic that
correlates (or is assumed to correlate) with a desired security property. The set of
feasible indicator values is assumed to form (at least) a nominal scale.
• Security Measure. A security measure assigns to an object a security indicator value
from an ordinal scale according to a well-defined measurement protocol.
10
• Security Metric. A security metric is a security measure with an associated set of
rules for the interpretation of the measured data values.
Let us consider an example for each of these definitions to provide some clarity. A
security indicator could be something such as a potentially dangerous system call. A security
measure based on this indicator is the count of these indicators for some specific software. A
metric, then, would be the comparison of this number of dangerous system calls compared
to other software systems, allowing a relative score to be given. Interpretation of this metric
is as simple as understanding that it is relative to other software, so a score of 0.5 means
that the severity of dangerous system calls in this binary is very similar to other software,
whereas greater than 0.5 is better (fewer dangerous system calls).
Under these definitions, PIQUE-Bin will be considered a security metric, as it is may
be considered a set of rules for interpreting the measures which come from the indicators
provided by the tools used in our model. Indicators are synonymous with diagnostics in
the model and security measures to measures in the models. It is important to note that
we should consider security metrics rather than indicators or measures when looking to
holistically assess security in a binary. While indicators and measures provide valuable
information pertaining to security, a holistic security perspective must consider multiple
analysis types and synthesize information available, and therefore must be composed of
multiple measures and have guidelines for interpretation. Additionally, we should compare
our model to other metrics rather than measures or indicators.
Rudolph et al. [41] conducted a survey of security indicators. Security indicators, as
previously defined, are observable characteristics that are assumed to correlate with security.
These are more ‘raw data’ than metrics which should be acted upon. They conclude that
indicators (as of 2008) are lacking in several ways. They list several requirements for reliable
and meaningful security indicators, measurements, and metrics. It must have a well-defined
goal, provide a rationale, be based on objective parameters, be reproducible, be quantitative
11
(with a baseline and goal), provide guidance for interpretation, be constructive, and be
applicable to large/complex targets. They also state that a more balanced coverage of
lifecycle is desirable. This includes coverage of early stage designs and of final products,
which for software would be a working project in the form of an application or binary.
Ramos et al. [39] perform a survey of model-based quantitative security metrics at
the network level. The authors define a method of classification for security models, as
seen in Figure 2.2. This classification method will be used in section 3 to categorize
models for review. While the authors fail to define their search strategy for identifying
these model-based metrics, they present many model-based security metrics at the network
level. The authors conclude that the state of model-based network security metrics is still in
development and needs much more progress. No current approaches are totally satisfactory,
but the value of the models is in assisting the decision-making process rather than perfectly
assessing security.
One such model presented in [39] is VEA-bility [50], which uses multiple input types to
assess network security in a holistic fashion. They consider the vulnerability, exploitability,
and attackability, input as network topology, attack trees, and CVSS scores of vulnerabilities.
The model is successfully able to compare the security of network topologies. This model is
similar to the type of metric we are looking for in chapter 3 section 3; however, VEA-bility
aims to assess network topology rather than binary files.
The Operational Technology Environment
The Operation Technology (OT) is a computing environment where the purpose of
the system is to monitor and control a physical process, typically industrial processes such
as those found in manufacturing plants. The OT environment differs from traditional
12
Figure 2.2: Security Metric Classification Method as presented in [39]
Information Technology (IT) environments in several important ways when considering
security. The primary difference between OT and IT environmental is the purpose. While
the IT environment is focused on information, OT environments are focused on maintaining
physical processes. This difference alone puts much more emphasis on securing OT
environments because security compromises can have potentially life-threatening physical
consequences. Industrial Control Systems (ICS) are an OT environment of focus in
recent years. ICSs include Supervisory Control and Data Acquisition (SCADA) systems,
Distributed Control Systems (DCS), and Programmable Logic Controllers (PLC) [49].
McLaughlin et al. [34] enumerate the primary differences between IT and OT systems:
• OT systems maintain the integrity of an industrial process
• OT systems require high availability due to to their continuous nature
• OT systems often focus heavily on physical processes
• OT systems have limited resources for security
13
• OT systems require timely response to human reactions and/or physical sensors
• OT systems use proprietary communication protocols to communicate with field
devices
• OT systems rarely replace components
• OT systems are composed of distributed and isolated components
In addition to this, the authors paint a picture of the modern ICS cybersecurity
landscape in [34]. In recent years, there has been a trend of moving OT networks to be in
contact with the internet [34]. This enables remote access and control of those systems which
may allow for higher process efficiency and response times. However, this has the unintended
consequence of opening these systems to a huge new attack vector. These systems are often
legacy and cannot be upgraded [49, 34], meaning that they have many known exploitable
vulnerabilities.
Attacks on ICSs have been increasing in frequency and severity in recent years. In 2015,
a SANS Instute survey found a 7% increase in the number of ICSs that had been attacked,
and 5% increase in ICSs that were potentially attacked [33]. In part, this increase in attacks
is likely attributable to a rise in nation-state sponsored or organized cyber attacks/cyber
warfare, which rose to represent over 50% of malicious attacks from outsiders in 2019 from
0.0% in 2017 according to another SANS Institute survey [16]. Although it seems unlikely
that none of the cyber attacks in 2017 were attributed to organized crime or nation-states,
it is possible that the attacks simply could not be traced to a specific source. Additionally,
because these are self reported sources of attacks, they are to some extent subjective. Another
troubling trend in ICS security is that, while in 2017 25% of survey responders were unable
to answer these questions, 43% were unable to answer in 2019. This indicates that there may
be an even higher number of undisclosed attacks that are not considered in these numbers.
14
The National Institute of Standards and Technology (NIST) 800-82 publication [49]
gives a comprehensive guide to security for ICS. This publication provides methods for
securing ICSs while addressing their unique requirements in comparison to IT systems.
NIST 800-82 also provides an overview of common system topologies, typical threats and
vulnerabilities, and effective countermeasures to mitigate these risks. Several common
topologies covered in the 800-82 may be seen in 2.3.
Figure 2.3: Basic SCADA Communication Topologies as presented in [49]
The Purdue model was developed in the early 90s and has developed as a method of
identifying best practices for communications of components in ICS and IT networks [55, 32].
15
The model separates ICS systems into five different layers of components:
• level 5: enterprise networks
• level 4: business networks
• level 3: site-wide supervisory
• level 2: local supervisors
• level 1: local controllers
• level 0: field devices
Levels four and five are often combined to “IT networks”, and Levels one and zero are often
combined in modern systems where field devices are often sophisticated enough to handle
local automation of a process [32]. Between levels three and four we see a boundary (de-
militarized zone) between IT and OT systems. This model helps clearly define boundaries
and components of systems, allowing for air-gaps to clearly segregate systems properly.
In a similar vein to the NIST 800-82, Li et al. [31] give additional information,
highlighting microgrids as an instance of a DCS. They cover the specific consequences
of vulnerabilities being exploited, and common cyber vulnerabilities in microgrids. They
separate the most common vulnerabilities into vulnerabilities that occur in application
software, through communication networks, and on field devices. Primarily, Li et al.
highlight the importance of proactivity to mitigate cyber risk and to ensure resiliency and
reliability in critical systems.
As an example of the consequences of a successful OT attack, we describe Stuxnet.
Stuxnet may be one of the earliest, most successful and widely known attacks on OT systems
[29]. Stuxnet targeted Iranian uranium facilities to systematically destroy the facilities while
feeding false information to the monitoring systems. This worm was able to precisely target
16
the specific Siemens controller in use by the Iranian facility to ensure that the only target
that would be destroyed would be the “correct” one. Stuxnet did this by exploiting the
vendor’s driver DLL file in use by the SCADA software and the programming software.
When it found the correct target, it dropped malicious code onto the controller alongside
the code running the controller correctly. When the time came, the malicious code executed
and destroyed the target systems. In the end, one of the actual vulnerabilities that Stuxnet
exploited is technically a feature of the controllers: not allowing for code signing. This
cannot be patched aside from changing the controllers in use by the system. The way to
mitigate this attack is to avoid the compromised DLL that enabled the malicious code to be
dropped onto the controller [29].
17
SUPPORTING WORK
PIQUE-Bin is built upon the PIQUE framework [40]. PIQUE is a platform that builds
upon two earlier quality modeling approaches, Quamoco and QATCH. In this chapter,
we briefly summarize Quamoco, QATCH, and PIQUE as the predecessors that enable the
building of PIQUE-Bin.
We introduce each model’s terminology as they define and use it, but for other sections
of this paper it is safe to assume it is the definition from PIQUE, as this is the latest and
most relevant of the models for PIQUE-Bin.
Quamoco
The Quamoco Project aimed to develop and validate operationalized quality models
for software, as well as to “provide the missing connections between generic descriptions of
software quality characteristics and specific software analysis and measurement approaches”
[52]. Quamoco also gives a meta model which defines valid Quamoco quality models [54].
Quamoco Model Terms
Quamoco defines a model as depicted in Figure 3.1.
Factors are high-level terms which define some property of an entity. Factors are not
directly measurable, but can be approximated through the aggregation of measures.
Quality Aspects are the highest level of factor in a quality meta model, and directly
compose quality. In the ISO 25010 mode, factors would be high-level characteristics such as
maintainability.
Product Factors are factors that are one level below quality aspects which compose
quality aspects. In the ISO 25010 model, these would be sub-characteristics of quality such
as modularity or reusability.
18
Measures are concretely defined concepts that are quantifiable by values directly
obtained from instruments.
Instruments are tools which obtain values or findings from the software product.
Figure 3.1: The Quamoco Quality Modeling Approach [52]
Main Concepts
The Quamoco quality assessment approach aims to bridge the gap between the ISO
25010 standard, which gives a high-level overview of quality, and metrics output by tools.
19
This is done through defining an ‘impact’ relationship and a quality meta model. The impact
relationship allows quantitative values from instruments to aggregate through measures and
factors which are impacted by these values. The quality meta model defines a structure
which a Quamoco quality model should adhere to. Defining and adhering to this meta
model enables a common framework of understanding and confidence in these models.
Additionally, Quamoco emphasizes the modularity of their quality assessment approach.
This enables the high-level factors to remain the same, regardless of how lower level
instruments gather their values. As such, one may use the same model structure for analysis
of multiple languages, as the only necessary changes are the instruments themselves.
Mechanisms
Beyond the definition of a tree structure and aggregation of findings, several additional
mechanisms are involved with using an operationalized Quamoco model to assess quality.
These mechanisms are normalization of tool output, the use of a utility function, and factor
edge weighting.
Normalization occurs in the measure nodes in Quamoco models. A measure for
normalization is found, such as lines of code, which is then used to calculate the value of
additional measures which have their values divided by the normalization value.
Utility Functions are applied in the measure nodes in Quamoco models. A benchmark
data set is used to derive the utility function for each measure, which is then applied when
evaluating measures.
Factor Edge Weighting is applied to all product factor and quality aspects as they
are aggregated into the next layer of the model. This is done using their relative importance,
calculated by using human-gathered importance orderings for the rank-order centroid method
[8].
Assessment using a Quamoco quality model involves several steps. The measurement
20
step involves running tools against the product under assessment. Then, evaluation occurs
in which measure nodes take on a value according to the tool findings and previously defined
utility function. Aggregation of all factors is the next step, in which factors take on values
according to their incoming weighted edges. Finally, interpretation must occur. Quamoco
suggests mapping values to a grading scale from A to F [52].
Results
Overall, the Quamoco project contributes greatly to taking quality modeling from an
abstract concept as in the ISO standards to an operational model. By bridging the gap,
they enable quality modeling and quality modeling research to move forward with the idea
of using a model to extend the definitions provided by the standard.
To ensure validity of the Quamoco approach, researchers compared how their model
ranked five projects against expert judgements of those projects. They found high correlation
and significance of the correlation between these rankings, indicating that a Quamoco quality
model is able to judge quality in a way that is consistent with expert opinion.
Although the Quamoco model is able to assess software effectively, the authors of
Quamoco state that in its final state, the operational Quamoco quality model became very
large and complicated. It became slow and unresponsive. The authors indicate that they
are unsure of the extent to which it may be utilized in industry due to these problems. This
motivates the development of QATCH, an extension and simplification of Quamoco quality
models.
QATCH
QATCH, the Quality Assessment Tool Chain [46], is an extension to work done in the
Quamoco project. QATCH focuses on automating the derivation of quality models that
are sensitive and responsive to the subjectivity of stakeholders. Additionally, they approach
21
modeling with simplicity and transparency as a bigger focus, likely due to the results reported
in the Quamoco project.
QATCH defines its own model structure, terms, and approach; however, we focus
primarily on the differences between Quamoco and QATCH, as largely, they follow the
same approach.
QATCH Model Terms
QATCH defines a quality model in several layers. An example of a generic QATCH
model instance is given in Figure 3.2.
Figure 3.2: A Generic QATCH Model Instance [46]
Characteristics are defined as the abstract components of a system. This closely
matches the quality aspects in Quamoco models.
22
Properties are attributes of an object that can be directly measured. This relates to
the product factors in Quamoco models.
Measures are concrete, quantified representations of tool results. This definition is
very similar to a Quamoco measure.
Unique Mechanisms
QATCH Models aim to restrict the complexity of quality models to enable simpler
models that are more easily understood. To achieve this aim, they restrict QATCH models
with the following rules:
• The model must not be derived from black box methods, such as machine learning
techniques.
• The model must have only three layers.
• A property node is quantified by a single measure only.
These design decisions are supported by issues brought on by the complexity of the
Quamoco quality model used in [52], as well as [19], which argues that comprehensibility
and ease of extension outweigh loss of granularity brought on by complex models such as the
model in [52].
Quality is and always will be subjective to some extent. One prominent concern of
quality modeling then, is the time requirement to tune a quality model to a stakeholder’s
subjectivity. QATCH approaches this by allowing stakeholders to express judgments of
quality concepts using intuitive terms that do not require technical knowledge of quality
modeling. They then integrate these judgements into the model derivation process using the
Analytical Hierarchy Process (AHP).
The AHP provides the ability for stakeholders to input their priorities and values
to a quality model. The AHP is a method typically used to help decision makers in
23
scenarios where there are several objectives [42]. AHP allows stakeholders to make pairwise
comparisons between criteria to derive an order of importance for decision. In the context
of quality modeling, our overall goal is quality, and our criteria are quality aspects. In this
way, stakeholders are able to decide what quality aspects are important for overall quality in
the context of a quality model, without the need of extensive knowledge of quality modeling
or model structure.
The pairwise comparisons necessary for AHP may be expressed linguistically, meaning
that the comparisons may be very intuitive for non-technical stakeholders to express their
values and judgements. QATCH also expands upon the use of AHP, presenting a fuzzy AHP
that also allows the stakeholders to express uncertainty about specific comparisons.
Results
The QATCH model underwent the same experiment as the Quamoco validity experi-
ment in [52]. The QATCH model reports perfect correlation with expert opinion, indicating
that QATCH is able to perform comparably (possibly even better) than Quamoco in how
it judges quality compared to expert opinion. Additionally, this indicates that the loss of
complexity in the model did not result in a loss of correlation with expert opinion.
PIQUE
The Platform for Investigative software Quality Understanding and Evaluation (PIQUE)
[40] was created to provide an environment for quality assessment research. The platform
focuses on reducing the resources required to build a hierarchical quality model, from
conceptualizing to operationalizing a quality model. PIQUE was built with nine design
goals:
1. Benchmarking, utility functions, and adaptive edge weighting
24
2. Default model mechanisms
3. Extension or modification of model mechanisms
4. Models are easy to derive
5. Derived models are easy to operationalize
6. Adding, removing, or modifying tool support is simple
7. Input and output is easy to interact with
8. Facilitate automation and continuous integration
9. Facilitate trustworthy models
Altogether, PIQUE is a platform that enables small scale teams to build quality models
without the resources required for models such as the Quamoco model in [52]. This enables
models to be created by single developers, as is the case with the model in this thesis. The
models built by PIQUE are flexible, meaning they may adhere to Quamoco or QATCH meta
models.
PIQUE Model Terms
Most of the definitions for PIQUE follow closely from Quamoco definitions.
Factors are abstract terms for high-level nodes at the top layers of the quality model.
Their definition follows from Quamoco. Factors are not directly measurable, but can be
approximated through the aggregation of measures. Factors include the Total Quality Index
(TQI), quality aspects, and product factors.
TQI is the root node of a PIQUE model. It is a factor, and the value of the TQI can be
thought of as the output of the model. The TQI value is the result of the final aggregation
of values in the model. Rice [40] states that “in the context of ISO/IEC 25010 [25], TQI
25
represents Software Product Quality, but it can represent whatever concept the researcher
desires, such as Security.”
Quality Aspects (QA) are the highest level of factor in a quality meta model, and
directly compose the TQI. In the ISO 25010 mode, factors would be high-level characteristics
such as maintainability.
Product Factors (PF) are factors that are one level below quality aspects. In the ISO
25010 model, these would be sub-characteristics of quality such as modularity or reusability.
Product factors are decomposed into directly measurable concepts represented by measures.
Measures are concretely defined concepts that are quantifiable by values directly
obtained from instruments. Measures may be ‘negative’ or ‘positive’, where a negative
measure is one in which findings negatively impact the TQI.
Diagnostics are a new term introduced by PIQUE. A diagnostic is “a representation of
the parts needed for a measure to evaluate” [40]. A diagnostic is evaluated directly from tool
output and is tool specific. Including diagnostics allows quality model designers to configure
a measure in terms of its parts.
Diagnostics also add the ability to configure evaluation methods at another level, adding
to the flexibility that is part of the design of PIQUE. Whereas traditional quality modeling
only allows for the measure to be some function of a tool output, this enables multiple finding
types to be considered at the measure level.
Findings are the data object representation of a “hit” from a tool. A finding is
instantiated when the its associated tool is run on the system under evaluation and identifies
some portion of the system which fits a rule defined by the tool.
Exemplary Model
An example of a software quality model built using PIQUE and evaluated on a project
26
Figure 3.3: An exemplary PIQUE model, from [40]
is shown in Figure 3.3. Below the TQI are the quality aspects, factors that define the
characteristics that compose overall quality.
Below the quality aspects are the product factors, which are factors that decompose
directly into measurable concepts. In the ISO 25010 standard, these are concepts such as
capacity or re-usability. Below the product factor layer in PIQUE, the ISO 25010 standard
does not give any guidance.
Measures are concrete definitions that compose product factors. Measures contain
a method for normalizing diagnostic values as well as a utility function, derived from a
benchmark repository, which is used when evaluating. Figure 3.4 shows the structure of
27
a measure node in PIQUE. Figure 3.5 shows the sequence diagram of a call to evaluate a
measure node. First, a call is made to evaluate the measure. This leads to the Evaluator’s
Evaluate() function being called. The Evaluator first finds the node value through a specified
method - by default, this is the average value of the children nodes. This value is then
normalized and used as input for the utility function. The output of the utility function is
then returned as the value of the measure node.
Figure 3.4: The UML Class Diagram Of A Measure Node
PIQUE Mechanisms
Utility Functions are utilized by PIQUE models at the measures layer. Utility functions
are created through a benchmarking process that begins with the creation of a repository
of similar projects to the one which the model will be applied. The tools of the model are
28
Figure 3.5: The UML Sequence Diagram Of Evaluating A Measure Node
then applied to all the projects in the repository to identify metrics such as the mean and
standard deviation for each measure. This provides an estimate of the number of findings
that can be expected in the average project, allowing the creation of utility functions.
The utility function for a measure takes the measurement of some diagnostic for a
specific software project and outputs a value based on how it compares to other software
projects in the benchmark repository. In the default case of PIQUE, the utility function is
linear interpolation between the minimum and maximum value found within the benchmark
repository for each measure.
Utility functions also account for differences in scale of findings between tools. That is,
if one tool typically reports thousands of findings while another reports tens of findings, the
29
output of the utility function for both measures will be between 0 and 1.
The use of utility functions implies that a final quality score will be relative to the
benchmark projects. The composition of the benchmark repository has a large impact on
the output of the quality model, so selection of systems for the repository is a potential
threat to validity of the model’s output. It also greatly impacts how the output should
be interpreted - a 0.5 might be good or bad depending upon the quality of the benchmark
repository projects.
As well as the makeup of the benchmark repository, stakeholders must consider the
weighting of edges when interpreting the results of applying a PIQUE model.
AHP is used to configure a model for stakeholders’ needs and priorities [42]. This
is also done in QATCH models. AHP allows for stakeholders to identify how important
quality aspects and product factors are in relation to each other, which dictates how they
will aggregate to the overall score. This process dictates the weights of edges in the higher
levels of PIQUE models.
The AHP enables stakeholders to input their priorities and values to a quality model
through pairwise comparisons between criteria to derive an order of importance for a decision.
In this way, stakeholders are able to decide what quality aspects are important for overall
quality in the context of a quality model, without the need of extensive knowledge of quality
modeling or model structure. This technique allows non-technical stakeholders to impart
their values on a model.
While utility functions, benchmark composition, and normalization are important
concepts in PIQUE, these concepts are much more oriented towards developers. The AHP
is a feature that is aimed primarily towards stakeholders. The combination of these features
results in a quality model that is guided by the subjectivity of developers who created the
benchmark repository, utility functions, and normalization methods, and the stakeholders
30
who determined the weighting of nodes in the aggregation of the model.
Default PIQUE Model Behavior
There are several default behaviors that PIQUE models use when not configured. For
the most part, these are standard practices - such as the weighted average aggregation of
quality aspect nodes to produce the TQI score. We define the default behaviors at each level
to be clear. We briefly note where PIQUE-Bin deviates from these defaults, and in chapter 5
we discuss reasoning for the deviations. We start at the lowest level of the model, at the
level of findings.
A finding is specific to a diagnostic. They have a severity of 1, unless specified otherwise.
Diagnostics are evaluated as sums of the severities of their children findings. Measures are
evaluated using a utility function with the input being the sum of their children’s values. By
default, PIQUE implements the utility function as using linear interpolation between two
values, bounded to [0, 1] characterized by piece-wise Equation 3.1:
0 x < a
f(x) = 1 x > b (3.1)x−a
− a < x < b,b a
where a, b are referred to as thresholds, and evaluated as the minimum and maximum of the
sums of children in the benchmark repository. Negative measures are evaluated as 1− f(x).
As an example, let us assume we have some measure M . Assume we evaluate M on five
benchmark projects and receive a set of values S = [1, 2, 3, 3, 4]. Then we set a = min(S) = 1
and b = max(S) = 4. If we evaluate M on the binary under assessment and find a value of 2,
we evaluate this as x−a = 2−1 1− − = , which is the value the measure M takes in our assessment.b a 4 1 3
In PIQUE-Bin, we re-define the utility function and remove the piece-wise component of the
utility function.
31
All factors (TQI, QAs, PFs) are evaluated as the weighted sum of their children. For
PFs, this is the average of their children, as the structure defines what children feed into
PFs. However, The layers between PFs and QAs, and QAs and the TQI are fully connected.
These layers are weighted through the AHP as defined above. In PIQUE-Bin, we use AHP
to weight the QAs to the TQI and rely on manual assignment of weights between the PFs
and QAs.
Literature Review of Model-Based Binary Security Metrics
Search Strategy
To review the current state of binary assessment, we performed a systematic review of
literature with regard to binary security assessment metrics. We first defined our research
questions, search strategy, acceptance criteria, quality control strategy, and data extraction
strategy.
We define what type of security metric we are searching for through the classification
method presented in [39]. This classification scheme is shown in Figure 2.2. We look for
security metrics that target binaries (software). We do not restrict the object of the metric,
so it may be a metric of compliance, economics, or effectiveness. We look for model-based
construction of the metric, as that will be similar to our proposed method. A model-based
construction will be more holistic as it will not be utilizing a single type of measurement
for a metric. We look for automatic or semi-automatic automation level because we seek to
use these metrics for easily and effectively assessing binaries without the need for an expert
to do analysis. We do not restrict the measurement consistency, so they may be subjective
or objective. We do look for any measurement type, quantitative or qualitative, and any
measurement moment, static or dynamic.
It is important to note that in this review we look specifically for holistic security metrics
and not tools that identify specific weaknesses or vulnerabilities. Papers such as [14] and [13]
32
present methods for finding weaknesses in binaries, but do not attempt to holistically assess
a binary based on the vulnerabilities found. In this review, we are searching for methods of
assessment for binaries that take into consideration multiple weaknesses or vulnerabilities as
is necessary when assessing security holistically.
With this literature review we look to answer the following question: How well do
current metrics assess the security of binaries? This answer will give insight to the gaps
present in the binary security assessment area. In chapter 4 we show how this research
question fits into our overall research goal.
We examine how models are validated, how they perform, how comprehensive they are,
and flexible the model is. While model validation is obviously important, it is also important
to have a model that is flexible to allow for new security threats to be incorporated, and
comprehensive such that all types of security threats can be accounted for.
Our search strategy was to use IEEE Xplore to search for papers from journals and
conferences within the past 10 years (2011-2021). Our search terms are papers that have
(“Binary” OR “Executable”) AND “Security” AND (“Model” OR “Metric”) within the
abstract of the paper. We also examine the references of any papers we identify for further
investigation. The goal is to identify any papers that could be relevant while limiting
the search to more recent papers which may reference any older relevant metrics. One
consideration is that some metrics may be missed if they are focused on a type of domain
specific binary and the abstract does not specify that the domain specific entity is a binary
or executable file.
To accept a paper, we ensure that the paper is focused on a method for analyzing a
binary file specifically. The method in the paper must not be for generating security findings
given a binary, but rather to take security findings and utilize those for a security metric.
Initially, we read the papers’ titles to eliminate papers that are clearly not producing some
security metric. After the initial pass of reading titles to reduce the number of papers,
33
we read through the remaining papers’ abstracts to determine if the paper fits the criteria
and should be accepted. If the specific target of the metric cannot be determined from the
abstract, the paper is read until it can be determined if the metric is targeting binaries.
Additionally, we will search through the citations of all papers that are accepted to identify
any additional papers that should be included.
Results
After performing the search as defined above, over 400 papers met the search terms on
IEEE Xplore, but we have not identified any papers which fit the criteria. We intentionally
used search terms that would hopefully catch any papers that would be relevant at the cost
of an abundance of papers to sort through, but the lack of papers that fit the criteria using
these search terms indicates that we have identified a gap in current research on model-based
security metrics for binaries.
Threats to Validity
The results of the literature review indicate a gap in research. However, there are
threats to this review that must be acknowledged.
First and foremost, the search criteria could have missed relevant papers. It is possible
that by using some domain specific terminology, a paper could be left out of this criteria.
This could be analysis of firmware update patches that take the form of a binary, or specific
programs that are analyzed in binary form but not referred to as such. We attempted to
use broad search terms to avoid missing any relevant papers to mitigate this threat, but we
cannot entirely eliminate this threat to validity.
Additionally, any papers that would not be found by IEEE Xplore are left out of this
review. This means that we miss any results that would be found in non-IEEE conferences
or journals. The fact that no results were found, however, would still suggest that this topic
34
is largely unexplored. Additionally, no papers that would fit this criteria were identified
during preliminary research which included many other conferences and journals.
Another threat to validity is the filtering of papers from those found by the search. A
paper could have been dismissed when it should not have been. To mitigate this threat,
any time a paper was in question of being dismissed or accepted, we would read the paper
carefully until it was clear what action should be taken.
Despite these threats to validity, the conclusion remains largely the same. Even if
several papers were improperly removed from the review or not found by the search, there
is still a lack of model based security metrics for binaries. If any methods had been found
there would be potential to improve these models; however, with no papers being identified
we are confident that this subject is under-researched.
35
RESEARCH GOALS
Our research goal is aligned with improving the gaps identified through research. While
there are many tools for conducting analysis, there are far fewer methods of interpreting the
results of these tools together. We notice a distinct lack of metrics or models that assess
security at the binary level, in a holistic fashion, and according to stakeholder interests. To
achieve our goal, we developed a security metric based on software quality modeling.
Motivation
In chapter 2 we described to increasing attack surface and number of attacks occurring
in OT environments. Due to this increase in cyber attacks, we need to increase our ability to
harden our OT environments to ensure these attacks are mitigated or avoided. While many
methods exist to harden these systems, attacks such as SUNBURST1 and Stuxnet [29] have
been able to circumvent these defenses. In order to better the security posture in critical
infrastructure systems, we must develop additional layers of security that will identify and
prevent attacks like these.
One currently under-developed area in which both IT and OT systems could be
improved is security analysis at the binary level of software, which is supported by the
literature review presented earlier. Aside from anti-virus (which may or may not be present
in OT environments), little is done to ensure that placing and running a binary on a system
will not cause additional exploitable vulnerabilities. Therefore, our goal is to improve our
ability to assess the security quality of a binary from a stakeholder’s perspective.
As discussed in chapter 3 section 3, we have observed a distinct lack of model-based
security metrics targeting the binary level of software. Therefore, to accomplish our goal we
develop and validate PIQUE-Bin for analysis of the security quality of a binary file.
1https://www.solarwinds.com/sa-overview/securityadvisory
36
We follow Basili’s Goal-Question-Metric approach to our research [9].
Goal Question Metric
Following [9], we aim to achieve our goal by answering a set of questions. Furthermore,
we aim to answer the questions through specific metrics. These metrics will answer our
questions, which in turn will help us accomplish our goal. The specific goals, questions, and
metrics of this research are presented below.
Goal: Improve our ability to assess the security quality of a binary from a stakeholder’s
perspective
• Q1 How well do current metrics assess the security of binaries?
• Q2 What are the attributes of security in binaries that are inadequately measured by
current metrics?
• Q3 To what extent does utilizing a security quality model improve the ability to
identify vulnerable binaries?
• Q4 How can we adapt current metrics for security quality to binaries in the ICS
environment?
• M1 Systematic review of model-based security metrics at the binary level (answers
Q1,Q2)
• M2 How security aspects are measured by current methods identified (answers Q2)
• M3 Iterative case studies (answers Q3)
• M4 Experiment to assess model (answers Q3)
• M5 Proposal of ICS-specific model (answers Q4)
37
PIQUE-BIN DEVELOPMENT
To achieve the research goal of improving our ability to assess the security quality of
a binary from a stakeholder’s perspective, we must either improve upon existing methods
or create a method which is better than existing methods. Because no existing methods for
assessing security quality holistically in binaries were found in section 3, we must create our
own. To this end, we utilize PIQUE to create a security-focused quality model for binaries.
In this chapter, we present the methods and reasoning used to develop the PIQUE-Bin
model via the PIQUE framework. We present this in the phases of model operationalization:
gather requirements, design, and develop.
Gather Requirements
As part of the process of creating PIQUE-Bin, two objectives must be achieved. First,
a collection of relevant binaries for benchmarking must be gathered. These binaries should
be similar in function and size to the binary under analysis. We gather these binaries from
the Ubuntu 18.04 ‘/bin’ file, Andrew-d’s static-binaries repository1, Mosajjal’s binary-tools
repository2, and Kali Linux’s ‘/bin’ file.
We chose this to be our benchmark repository due to ease of gathering and similarity of
function to the binary under analysis. This results in approximately 700 binaries gathered
in total.
We then limit the benchmark repository to files above 50 KB and below 10 MB. We
limited binaries based on size to remove trivial binaries that will have very few weaknesses and
to ensure the tools will run in a reasonable amount of time. There is currently no established
method to determine the best way to gather the benchmark repository or identify the ideal
1https://github.com/andrew-d/static-binaries
2https://github.com/mosajjal/binary-tools
38
characteristics of the benchmark repository, which are further investigated in chapter 7.
The second step in this phase is to acquire a set of relative rankings of security
characteristics in terms of stakeholder concern. Security characteristic rankings are chosen
using the researcher as a surrogate stakeholder. This represents a threat to the construct
validity of this study because the researcher may make ranking selections that are not
reflective of the stakeholders who would utilize the binary. We investigate the impact of
stakeholder values in chapter 7.
Design
The model design phase is the phase in which we construct and populate the tree
structure of PIQUE-Bin. This means identifying security aspects and product factors that
will be included in the model, as well as populating and weighting the edges of the model
from the stakeholder ranking gathered in the requirements phase.
Security aspects are chosen with stakeholder interests in mind. The concept of security
aspects aligns well with the CIA triad. The classical CIA triad includes confidentiality,
integrity, and availability. This paradigm is commonly critiqued for being too simplistic
in its view of security aspects, primarily because it lacks the security requirements for
accountability and responsibility. We chose a form of the extended CIA triad based
on Microsoft’s STRIDE threat model3. Microsoft’s STRIDE model is a threat model
for classifying threats. STRIDE is an acronym of Spoofing, Tampering, Repudiation,
Information disclosure, Denial of service, and Escalation of privileges. These are all threats
that impact the target system in different ways, corresponding to different desired security
properties that would be negatively affected. These desired properties are authenticity,
integrity, non-repudiation, confidentiality, availability, and authorization respectively. This
3https://www.microsoft.com/security/blog/2007/09/11/stride-chart/
39
may be seen in Table 5.1. We chose these properties to be the security aspects for PIQUE-
Bin to give a more sophisticated view of security aspects than the CIA triad, while still
utilizing a well-established, mature paradigm.
Threat Desired Security Property
Spoofing Authenticity
Tampering Integrity
Repudiation Non-repudiation
Information disclosure Confidentiality
Denial of service Availability
Escalation of privileges Authorization
Table 5.1: STRIDE Threats and Desired Security Properties
With STRIDE guiding the choice of quality aspects, we now look to the product factors.
Product factors are measurable properties of the binary that can be tied to the security
aspects. We chose to utilize the CWE-699 view4 for this layer of the model. CWE-699 is
a view of the CWE catalog based on software development. It breaks the catalog into 40
categories of weaknesses, which then have base level weaknesses under them. The product
factors in PIQUE-Bin are the category-level CWEs found in CWE-699. As an example of a
CWE category found in CWE-699, see CWE-1210 and the CWEs classified under CWE-1210
in Figure 5.1. We chose to use the CWE because it allows us to tie the findings of different
tools to the CWE that they are instances of while preserving the impact of a finding to
the security aspects that we are interested in. In addition, the information for each CWE
provided by MITRE guides the stakeholder in how each CWE might impact the quality
aspects. We choose to limit the set of CWEs because the whole CWE catalog is over 1,000
4https://cwe.mitre.org/data/definitions/699.html
40
weaknesses, which would make the model cumbersome and overly complicated, with no clear
way to aggregate through these weaknesses. The amount of comparisons required for the
stakeholder to weight the model using the AHP is the number of product factors squared,
so reducing the number of nodes greatly helps the ease of using the model. Additionally,
we add one ‘unknown/other’ node which encompasses CWEs not included in CWE-699 as
well as findings for which the CWE is unknown. This would be the case for CVEs that have
not been mapped to a CWE. Finally, we have one more product factor that is ‘potential
malicious indicators’, which is added because CWE-699 lacks anything that would capture
these findings.
Figure 5.1: An Exemplary CWE Category From CWE-699
After defining the structure of the top two layers, we must now weight the edges from
product factors to security aspects and from the security aspects to the overall security score.
For the pairwise comparisons of aspects and weighting of the aspects to the overall score,
Table 5.2. We approached this as if it were an ICS environment where availability is of
utmost importance, followed by authorization/authenticity and integrity. For the product
factor layer, we manually weighted the connections to the most likely impacts. We chose
this approach due to the large number of comparisons that would be required for pairwise
41
comparisons between product factors for each quality aspect. We followed the information
provided in the ‘technical impact’ section of the CWEs as well as the description of the CWE
categories. When multiple security aspects were likely to be impacted, we allowed the CWE
category to impact multiple security aspects. This weighting can be seen in Table 5.3. The
final weights for each quality aspect are calculated as the value associated with each product
factor divided by the sum of values, giving a set of weights that sum to one.
Table 5.2: Comparison Matrix And Final Weights for Quality Aspects
Criteria Availability Authenticity Authorization Confidentiality Non-repudiation Integrity Final Weight
Availability 1 3 3 8 8 5 0.455
Authenticity 0.333 1 1 4 4 2 0.179
Authorization 0.333 1 1 4 4 2 0.179
Confidentiality 0.125 0.25 0.25 1 1 0.5 0.048
Non-repudiation 0.125 0.25 0.25 1 1 0.5 0.048
Integrity 0.2 0.5 0.5 2 2 1 0.092
Development
The development phase of PIQUE-Bin is the stage at which we derive the model and
calibrate. Model calibration consists of applying the chosen tools to the set of benchmark
binaries to create utility functions. After this stage the model is ready to be evaluated on a
binary to assess security quality.
This process is guided by multiple exploration-motivated applications of the model, as
detailed in chapter 6. These iterative applications of PIQUE-Bin to binaries have lead us to
changes we detail in the following subsections.
42
Table 5.3: Weighting of Product Factors to Quality Aspects
Product Factor (CWE Number) Availability Authenticity Authorization Confidentiality Non-repudiation Integrity
API / Function Errors - (1228) 1 1 1 1 1 1
Audit / Logging - (1210) 0 0 0 0 1 0
Authentication Errors - (1211) 0 1 0 0 0 0
Authorization Errors - (1212) 0 0 1 0 0 0
Bad Coding Practices - (1006) 1 1 1 1 1 1
Behavioral Problems - (438) 1 0 0 0 0 0
Business Logic Errors - (840) 1 1 1 1 1 1
Communication Channel Errors - (417) 0 1 1 1 0 0
Complexity Issues - (1226) 1 1 1 1 1 1
Concurrency Issues - (557) 1 1 1 1 1 1
Credentials Management Errors - (255) 0 1 0 0 0 0
Cryptographic Issues - (310) 0 1 0 1 0 0
Key Management Errors - (320) 0 1 0 0 0 0
Data Integrity Issues - (1214) 0 0 0 0 0 1
Data Processing Errors - (19) 1 1 1 1 1 1
Data Neutralization Issues - (137) 1 1 1 1 1 1
Documentation Issues - (1225) 1 1 1 1 1 1
File Handling Issues - (1219) 1 1 1 1 1 1
Encapsulation Issues - (1227) 1 1 1 1 1 1
Error Conditions, Return Values, Status Codes - (389) 1 0 0 1 0 0
Expression Issues - (569) 1 1 1 1 1 1
Handler Errors - (429) 1 1 1 1 1 1
Information Management Errors - (199) 0 0 0 1 0 0
Initialization and Cleanup Errors - (452) 1 1 1 1 1 1
Data Validation Issues - (1215) 1 1 1 1 1 1
Lockout Mechanism Errors - (1216) 1 0 0 0 0 0
Memory Buffer Errors - (1218) 1 1 1 1 1 1
Numeric Errors - (189) 1 1 1 1 1 1
Permission Issues - (275) 0 0 1 0 0 0
Pointer Issues - (465) 1 1 1 1 1 1
Privilege Issues - (265) 0 0 1 0 0 0
Random Number Issues - (1213) 1 1 1 1 1 1
Resource Locking Problems - (411) 1 1 1 1 1 1
Resource Management Errors - (399) 1 1 1 1 1 1
Signal Errors - (387) 1 1 1 1 1 1
State Issues - (371) 1 1 1 1 1 1
String Errors - (133) 1 1 1 1 1 1
Type Errors - (136) 1 0 0 0 0 0
User Interface Security Issues - (355) 0 0 0 1 0 0
User Session Errors - (1217) 0 1 1 1 0 0
Potential Malicious Indicators 1 1 1 1 1 1
Unknown-Other 1 1 1 1 1 1
Utility Function
By default, PIQUE utilizes linear interpolation between two threshold values, a and b.
It limits the value to the range [0, 1] via a piece-wise function. This function is defined as
43
0 x < a
f(x) = 1 x > b (5.1)x−a
− a < x < b.b a
This is a standard utility function in the context of quality modeling. This utility function,
however, makes one critical assumption which we must re-assess now that we are primarily
focusing on a security context.
In models that follow the Quamoco/QATCH/PIQUE paradigm, the output of a utility
function is limited to [0, 1] because there is some a minimum utility and maximum utility
value, where utility is defined as a value that “quantifies the relative satisfaction of a decision
maker concerning the quality of a software product characterised by specific measurable
factors”[53]. However, in the context of security measures and factors, we argue that there is
no true maximum utility at the measure or factor level because enough vulnerabilities of the
same type (categorized under the same measure/factor) can completely compromise some
aspect of security.
We argue that this is because there is never a point where an additional vulnerability
should not increase the output of the utility function. To see why this would be an
issue, consider a case where we have a measure whose utility function is evaluating to
1 (the maximum utility score). Then, consider injecting a new vulnerability that would
be categorized under the same measure and then re-analyzing it. The value of the utility
function cannot increase and therefore we see no change in the output of the model. A change
is required to the utility functions of the model in order to allow additional vulnerabilities
to always change the utility score.
This phenomenon is seen in the analysis done on Wireshark5 in section 6. To remedy
5https://www.wireshark.org/
44
this issue, we move the bound to [0, 1] from the utility function to the quality aspect
evaluation function. In this way, we enable a single measure to cause a quality aspect to
drop all the way to 0. Additionally, this change implies that the quality aspects have some
maximum or minimum level of utility, which we claim to be true. Complete compromise of a
security aspect such as availability is possible, for example if a trivially exploited vulnerability
causes an application to hang indefinitely, availability could be considered fully compromised.
However, this should not compromise the total score because availability has a set amount of
weight within the TQI and we seek to preserve that weight to allow the stakeholder’s values
to remain.
Therefore, PIQUE-Bin makes one change to this functionality, which is to remove the
piece-wise component of the utility function. The utility function in PIQUE-Bin becomes
x− a
f(x) = . (5.2)
b− a
To see what this change in the utility function looks like in a model, consider Figures
5.2 and 5.3. In Figure 5.2 we see two model evaluations utilizing the default bounded utility
function. We see that in Figure 5.2a, with a severe finding, the model evaluates to a score of
0.875 and with just a few minor findings as seen in Figure 5.2b, we see a score of 0.8125 - a
worse score. We would like to see the high severity finding be more influential in the model.
Now consider 5.3, the same analysis done using the unbounded utility function and bounded
QA evaluation. We see the desired effect in 5.3a, allowing the high severity impact to push a
QA value to 0, therefore impacting the overall score greatly. Additionally, this change does
not impact analysis when no measure would exceed the thresholds, which may be observed
in Figures 5.2b and 5.3b not changing between the two methods.
45
(a) (b)
Figure 5.2: Two simple model evaluations using the default PIQUE utility function
Threshold Calculation
By default, PIQUE uses the maximum and minimum value found in the benchmark
repository for the threshold values. For PIQUE-Bin, we change this calculation method for
several reasons. First of all, we want to compare the binary under analysis to the benchmark
repository as a whole, rather than just two of the binaries in the repository which have the
worst and best values for a given measure. Additionally, we find that these thresholds are
46
(a) (b)
Figure 5.3: Two simple model evaluations using an unbounded utility function with bound
QA values
almost never influenced by adding more projects to the repository unless it is an exceptionally
poorly developed project.
We change the thresholds to be calculated as the mean plus and minus the standard
deviation of the values in the benchmark repository. This is similar to using the interquartile
range as seen in QATCH and Quamoco models; however, the interquartile range does not
work well for benchmarking measures whose findings are rarely found. If greater than 75
47
percent of the measure values are 0, then we get [0, 0] for our thresholds which is not ideal
for obvious reasons. Therefore we use the mean and standard deviation to calculate the
thresholds and limit the lower threshold to be non-negative. We limit the lower threshold
to be non-negative because when analyzing a binary for vulnerabilities, a measure with
no findings should evaluate to the best possible value which would not occur if the lower
threshold was negative.
Tools
For PIQUE-Bin, we implement three binary static analysis tools. These tools are
cwe checker, cve-bin-tool, and Yara using the Yara-Rules Github repository. Of course,
additional tools will improve the model, but the time to implement the tools and integrate
the tools into the model structure is a limiting factor for the number of tools. More tools
also makes the benchmarking and analysis process take longer.
These tools each serve a different purpose in the model, supplying a different source of
information with each finding. The use of multiple tools highlights our ability to incorporate
many different type of security tools within a model.
CVE-Bin-Tool The first tool in the model is cve-bin-tool6. This tool was chosen because
it is under active development, popular, and it can identify many CVEs that may appear in
a binary due to third party library usage.
The tool works by searching a binary for patterns known to be associated with third
party libraries. The tool then cross-references the library with the NVD to identify known
vulnerabilities. As an example, here are the patterns associated with Wireshark:
CONTAINS_PATTERNS = [
r"’usermod -a -G wireshark _your_username_’ as root.",
6https://github.com/intel/cve-bin-tool
48
r"Are you a member of the ’wireshark’ group\? Try running",
]
FILENAME_PATTERNS = [r"rawshark", r"wireshark"]
VERSION_PATTERNS = [r"Wireshark ([0-9]+\.[0-9]+\.[0-9]+)"]
VENDOR_PRODUCT = [("wireshark", "wireshark")]
The tool is defeated rather easily by obfuscation, so more sophisticated analysis methods
may help identify additional CVEs. The CVEs findings are then classified under a CWE
according to the National Vulnerability Database. The severity for a finding is taken from
the CVSS2 that NVD also provides.
One final advantage of this tool is that because it is simple pattern scanning, it should
work on any architecture or form of binary, making this an ideal tool for ICS settings.
A major drawback to cve-bin-tool is that the applications of it are limited. Only a
certain set of third party libraries are identified and we cannot guarantee that the CVEs
in a third party library will also be present within the binary that utilizes the library. We
also do not know the prevalence of third party libraries with OT environments, which may
further limit how useful this tool will be in such environments.
cwe Checker cwe checker is a suite of checks for CWE instances in a binary7. It
currently has the capability to search for 19 different CWEs and is under active development.
It can be run on multiple popular architectures, which makes it a good tool to consider in
environments where multiple architectures may be in use across a system, such as in an ICS.
Each finding from this tool is classified under the specific CWE it is an instance of, and each
finding receives the same severity of 1.
The tool works by using Ghidra, a disassembler built by the National Security Agency
(NSA), to disassemble a binary. After disassembly, the tool searches through an intermediate
7https://github.com/fkie-cad/cwe checker
49
representation of the binary code for patterns associated with known vulnerabilities. As
stated in an earlier section, cwe checker searches for CWE190 using the following method:
For each call to a function from the CWE190 symbol list we check whether
the basic block directly before the call contains a multiplication instruction. If
one is found, the call gets flagged as a CWE hit, as there is no overflow check
corresponding to the multiplication before the call. The default CWE190 symbol
list contains the memory allocation functions *malloc*, *xmalloc*, *calloc* and
*realloc*.
cwe checker is prone to false positives, but will find vulnerabilities that are unknown or
new. As such, this tool is a valuable tool for identifying vulnerabilities inserted throughout
the development process.
Additionally, Ghidra is able to disassemble many different architectures, with the
number of architectures that can be disassembled actively increasing. Also, the tool is
open source, so if an important architecture is not available, it may be developed and used
to allow Ghidra for that architecture.
Yara-Rules The final tool is the Yara-Rules repository8, which is a collection of rules
for the YARA (Yet Another Ridiculous Acronym) tool. The tool allows the definition of
patterns and logic surrounding the patterns to search for in a binary or text file. The Yara-
Rules repository is a set of rules put together by a group of IT security researchers. The rules
are classified under anti-debug/anti-VM, capabilities, cryptography, exploit kits, malicious
documents, malware, packers, and malicious emails. All these rules have their own diagnostic
and measure, with each broken rule becoming a finding with a severity of 1. These measures
are aggregate into the ‘Potential Malicious Indicators’ product factor.
8https://github.com/Yara-Rules/rules
50
This tool will allow for identification of significant changes in the capabilities of binaries,
as well as identification of malicious indicators within a binary. This tool is also able to run
on any architecture or form of binary, which is ideal for utilizing it in ICS environments.
One example of a finding, we define the rule Check DLLs, a anti-debug/anti-VM rule.
This rule defines several strings, then applies a rule surrounding these strings:
rule Check_Dlls
{
meta:
Author = "Nick Hoffman"
Description = "Checks for common sandbox dlls"
Sample = "de1af0e97e94859d372be7fcf3a5daa5"
strings:
$dll1 = "sbiedll.dll" wide nocase ascii fullword
$dll2 = "dbghelp.dll" wide nocase ascii fullword
$dll3 = "api_log.dll" wide nocase ascii fullword
$dll4 = "dir_watch.dll" wide nocase ascii fullword
$dll5 = "pstorec.dll" wide nocase ascii fullword
$dll6 = "vmcheck.dll" wide nocase ascii fullword
$dll7 = "wpespy.dll" wide nocase ascii fullword
condition:
2 of them
}
If any two of the defined strings are present, then DLLs commonly associated with anti-
debug/anti-VM are being used in a binary and that is cause for concern.
51
Changes to PIQUE
For the most part, PIQUE as a platform provides all the functionality to build
a standard operationalized quality model while enabling the implementation of different
behavior. In the context of security analysis, it was found that there are several changes
required. We have already covered several changes and the respective reasoning: the utility
function, threshold calculation, and manual weighting for the PF to QA level of the model.
One additional change we have made to PIQUE concerns the thresholds for measures
that had no findings in the benchmark repository. These measures get a threshold value of
[0, 0]. Ideally, all thresholds would have some value aside from [0, 0]. This is achieved when
every diagnostic has at least one finding among the entire benchmark repository. However,
when dealing with models that have a lot of diagnostics for which findings are rare, achieving
that goal may not be realistic. It is not desirable to force every finding to be a part of the
benchmark repository. In the case of [0, 0] thresholds then, we really do not know how the
binary under analysis compares to the benchmark repository, so it wouldn’t make sense to
say it is better or worse.
By default for negative measures, PIQUE assigns a value of 1 to a measure that has
no finding in the benchmark repository and no finding in the binary under analysis. This is
stating that although there were no findings, PIQUE claims that the binary under analysis
is better than the benchmark repository with respect to that measure. Although some value
must be assigned, the assignment of 1 for a measure where we have no information and no
context may mislead interpretation by giving a score for a binary that indicates it is more
secure than it is, simply due to the sparsity of tool findings.
Therefore, for any measure with [0, 0] thresholds and no findings for the binary under
analysis, we assign a value of 0.5. This essentially means that we do no have enough
information for a conclusion so we assume that the binary is equivalent in quality to the
benchmark repository. One interesting effect of this is that the more nodes with no findings
52
and [0, 0] threshold values, the lower the maximum possible score of the binary will be. With
this change, we can no longer achieve a score of 1 if we have any [0, 0] thresholds.
There is still discussion on the impact of this decision and what default behavior is
preferable. Either method has its benefits, but one choice must be made. We choose a
default value of 0.5 because it tends to give a lower score for binaries, which is preferable
for a security metric. In the face of uncertainty, we should not make the score higher for
something that may be used in security-critical decisions.
Two other options are being actively considered and discussed. The first option is to
not include the measures that receive [0, 0] thresholds, and to raise some warning flag when
a finding that has [0, 0] thresholds occurs in a binary, but leave the score unchanged. The
other option is that we could create theoretical thresholds that would simulate adding a single
binary to the repository that has the finding. For example, consider a repository that has
700 binaries. If we have some measure with [0, 0] thresholds but has a finding in the binary
under analysis, we would assume that a single binary among the benchmark repository had
that finding. Using the mean plus and minus standard deviation (with a lower limit of 0),
we would get thresholds of [0, 0.04]. A single finding (with a severity of 1) of this measure
would take on a value of −24.5, giving a high impact.
53
EXPLORATORY CASE STUDIES
In this section, we apply PIQUE-Bin to several exemplary binaries to discover how it
may be improved. These case studies lead us to make many of the design decisions that were
described in chapter 5.
Application to Wireshark Binaries
In this case study, we apply an early version of PIQUE-Bin to two Wireshark binaries.
The purpose of our study is to propose design changes to create an improved model
with which further evaluations will be done. This version of PIQUE-Bin uses PIQUE
default behaviors. It bounds the utility function to [0, 1] and uses threshold calculations
of [min,max]. The only tool in this version of the model is cve-bin-tool.
The binaries under analysis are Wireshark 1.8.1 and Wireshark 3.0.0. Wireshark1 is a
popular network protocol analysis tool that allows users to observe what is happening on a
network with a great level of detail.
Wireshark was chosen as a case study due to the high number of CVEs associated with
it. Wireshark is known to be very vulnerable and can be a blacklisted application in high
security environments due to enabling privilege escalation attacks on systems. Wireshark
1.8.1 has 109 CVEs contained within it2, making it a good case study for a vulnerable
binary. We analyze Wireshark 3.0.0 as well, which has 24 vulnerabilities3, making it a good
comparison. We expect to see version 1.8.1 get a low score (less than 0.2), while version
3.0.0 should get a significantly higher score.
In addition to applying the model to the binaries, we also apply the model to the
binaries with manually set thresholds: rather than threshold values of [min,max], we use
1https://www.wireshark.org/
2As of 06/2021, according to NVD using cpe:2.3:a:wireshark:wireshark:1.8.1:*:*:*:*:*:*:*
3As of 06/2021, according to NVD using cpe:2.3:a:wireshark:wireshark:3.0.0:*:*:*:*:*:*:*
54
[0, 10] for all thresholds. The purpose of doing this is to get some idea of the impact of
the thresholds (and therefore the benchmark repository composition) on the overall score.
We expect that the scores of the binaries will be impacted similarly and will maintain their
ordering.
Results
The score for the derived PIQUE-Bin model applied to Wireshark versions 1.8.1 and
3.0.0 with and without manually set thresholds can be seen in Table 6.1. As expected, the
score of version 3.0.0 is greater than that of version 1.8.1 in the derived threshold case, but
in the manual threshold case we see that 3.0.0 receives a worse overall score.
Table 6.1: Wireshark Model Application Results
Threshold Type Derived Thresholds Manual Thresholds
Wireshark Version 1.8.1 3.0.0 1.8.1 3.0.0
Total Score 0.7962 0.8004 0.7690 0.6209
Availability 0.7884 0.7662 0.7550 0.6200
Authenticity 0.8072 0.8318 0.7414 0.5756
Authorization 0.8287 0.8559 0.8108 0.6270
Confidentiality 0.7443 0.8502 0.6720 0.5920
Non-repudiation 0.8079 0.8384 0.7878 0.5818
Integrity 0.7896 0.7973 0.8571 0.7183
Discussion
The application of the model has revealed several flaws that must be addressed before
the model will meet expectations. The model failed to give a score in the expected range
55
(<0.2 for Wireshark 1.8.1), and in the manual threshold case Wireshark 3.0.0 scored lower
than 1.8.1.
The issues noted above are caused by several phenomena. Primarily, the reason is the
grouping of CVEs under a single measure in combination with the lack of the ability for a
single measure to greatly impact the model. Wireshark 3.0.0 scores higher than 1.8.1 when
using derived thresholds, but lower than 1.8.1 when using manual thresholds. The most likely
reason for this is the interaction between threshold derivation and CVEs grouped under a
single CWE. Wireshark 3.0.0 has fewer CVEs but the CVEs are more spread out among
CWEs, causing less information loss and allowing more CVEs to impact the overall score in
comparison to 1.8.1. This reversal of the ordering of scores due to thresholds shows a need
for further investigation of the benchmarking process and utility function impact.
In this analysis, one specific issue is the number of CVEs that are not assigned a CWE.
This causes a large amount of findings to be classified under the ‘unknown-other’ category
of weaknesses. As a result, this node receives a very large value evaluates to 0. This leads
to many of the CVEs for Wireshark 1.8.1 not impacting the score because the measure
node associated with the ‘unknown-other’ CVEs reaches the maximum utility value with
only a small set of the total vulnerabilities. The thresholds of the ‘unknown-other’ node are
[0, 96.33] and the value for Wireshark 1.8.1 is 371, meaning that it would evaluate to −2.85
if not for the minimum value of 0. Essentially, two thirds of the findings have no impact on
the TQI due to this limit. This significantly reduces the potential impact of this node which
contains 62 CVE findings.
Threats to Validity
We will not be addressing internal or conclusion threats, because we are not asserting
any causal relationships nor are we concluding anything about our model generally in this
case study.
56
External threats include threats to our ability to use this same model on other binaries
with other vulnerabilities. The model is highly stakeholder, tool, and benchmark dependent.
Our choice of binaries for benchmarking reflects the type of binary we expect to apply the
model to - it is possible that this model will not be applicable to binaries dissimilar to the
ones we use to benchmark. Additionally, it is certain that there are security findings that are
not identified by the tool in the model, which means the model’s output is not considering
all security threats within a binary. These problems are remedied by adapting the model for
different scenarios, as this is the advantage of using PIQUE to build the model.
The primary threat to validity of this model is construct validity. There are many
threats that must be addressed as the model is improved. Many of these threats were
touched on in the discussion section.
There is some risk that we lose too much relevant information through generalization
- a CVE provides more information about the impact of the flaw than the CWE that the
CVE is categorized under. In addition, the impact of the CVE may be different than
that of the CWE it is categorized as. For instance, take some vulnerability that allows
anyone to read the username and password of an admin account. Although this is clearly
a threat to authorization, we could see the CVE be categorized as a CWE that is related
to confidentiality because the weakness itself is an exposure of information. There also may
be too many CVEs that are either categorized improperly or not at all. One potential way
to mitigate these issues is through the use of machine learning to categorize CVEs under
CWEs as in [1].
The benchmarking process needs to be investigated further to determine that the
benchmark binaries are appropriate.
Additionally, it may be that we are not adequately measuring security quality by only
observing security flaws. Perhaps to obtain a holistic picture, security measures that have
been implemented properly need to be taken into consideration. This is something that
57
should be considered as a possibility when this model is being assessed to determine efficacy,
but is outside the scope of this work.
Finally, security flaws that have not been documented as a CVE exist, and the current
model does not account for these. Additional tools should be implemented in the model to
identify non-CVE security findings. This will be done later by utilizing cwe Checker and
YARA.
Application to Busybox Binaries
For this case study, we investigate the use of PIQUE-Bin to analyze several versions
of busybox binaries. Busybox is a set of system utility functions for Linux systems. It was
chosen at the request of our research contractors, the Idaho National Laboratory. This
version of PIQUE-Bin utilizes all the latest developments in PIQUE inspired by initial
applications to Wireshark binaries.
Specifically, the changes made in PIQUE-Bin from the Wireshark case study are:
• Removed limit to [0, 1] for utility functions
• Added limit to [0, 1] for QA evaluation
• Thresholds are calculated using the mean plus/minus standard deviation
• Added two tools, cwe checker and yara
• Changed default evaluation of measures that receive [0, 0] thresholds and no findings
to evaluate to 0.5 rather than 1
These changes were made due to the shortcomings of the model identified in the analysis of
Wireshark.
The goal of this case study is to observe a scenario in which we are able to apply
PIQUE-Bin to the same binary across multiple versions. This will give insight on how much
58
a binary is expected to change over time, as well as confirm that the model is working by
showing a score that increases over time. We expect to see an increase in score because this
binary has patched known vulnerabilities over time. This is as close as we can get to a proxy
for the expected relative security - because there are fewer known vulnerabilities in more
recent versions of the binary, we expect recent versions to have a better score. If this does
not prove to be the case according to the output of PIQUE-Bin, we should investigate and
be sure that our findings are accurate. If we do not see a score that increases over time, then
we should attribute this to a specific cause/finding and confirm that the decrease in TQI is
reasonable.
Results
The results of the application of PIQUE-Bin to the busybox binaries may be seen in
Figure 6.1 and Table 6.2. The values are increasing as we analyze newer versions, indicating
(as expected) that the binary has increased in security quality over its development history.
The only few points where we do not see a change in TQI is between versions 1.29.0, 1.29.1,
1.29.2, and 1.29.3. This is likely due to the fact that these are all small, consecutive patches
to busybox, fixing bugs that are not reported as vulnerabilities. Therefore the differences
between the versions of 1.29.x are minimal. This is confirmed by the patch notes of busybox4:
9 September 2018 -- BusyBox 1.29.3 (stable)
Bug fix release. 1.29.3 has a fix in libbb for xmalloc_fgets().
31 July 2018 -- BusyBox 1.29.2 (stable)
Bug fix release. 1.29.2 has fixes for fdisk (compat fixes, allow 2TB+
sizes), gzip (FEATURE_GZIP_LEVELS was producing badly-compressed .gz),
hexedit (segfault fix).
4https://www.busybox.net/
59
15 July 2018 -- BusyBox 1.29.1 (stable)
Bug fix release. 1.29.1 has fixes for wget (http->https redirect) and
sendmail (angle bracket parsing).
The patch notes do not mention any known vulnerabilities in the form of CVEs, nor
any CWE categories that would be detected by our tools.
Overall, these results confirm that the busybox binaries have improved security quality
throughout their patch history with respect to the benchmark repository according to
PIQUE-Bin’s output. From June 2013 to February 2019, we see a change of 0.25.
Figure 6.1: TQI Values for Busybox Versions
60
Table 6.2: Busybox Model Application Results
BusyBox Version Total Score Availability Authenticity Authorization Confidentiality Non-repudiation Integrity
1.21.1 0.355 0.357 0.351 0.367 0.375 0.330 0.330
1.28.4 0.470 0.476 0.456 0.476 0.480 0.460 0.460
1.29.0 0.601 0.610 0.574 0.599 0.599 0.607 0.607
1.29.1 0.601 0.610 0.574 0.599 0.599 0.607 0.607
1.29.2 0.601 0.610 0.574 0.599 0.599 0.607 0.607
1.29.3 0.601 0.610 0.574 0.599 0.599 0.607 0.607
1.30.0 0.605 0.614 0.578 0.603 0.602 0.611 0.611
1.30.1 0.607 0.617 0.580 0.606 0.604 0.614 0.614
1.30.1 lite 0.643 0.653 0.613 0.639 0.637 0.654 0.654
Discussion
The changes made to the PIQUE-Bin model after the application to Wireshark binaries
appear to have improved our ability to assess the security quality of binaries by both changing
how we create our utility functions and allowing measure output to be unbounded, allowing
for higher impact.
Additionally, we see in this case study what we hope to see: an increasing score over
time. We expect this because the older versions of busybox have more known vulnerabilities,
indicating that they are less secure. As we mentioned, if the PIQUE-Bin analysis deviated
from this trend, we would have reason for concern, as it goes against expectations.
This case study shows that PIQUE-Bin is able to successfully analyze a binary as it is
updated and patched and is able to reflect changes in security quality through the removal
of vulnerabilities.
One potential cause for concern is the lack of change in score for the versions of busybox
that are 1.29.x. These versions all receive the same score because they have small bug fixes
that are not detected by the tools currently used by PIQUE-Bin. Therefore, this case shows
that in order to improve PIQUE-Bin we should add additional tools that would identify the
61
minor bugs fixed in the 1.29.x versions.
Threats to Validity
The threats to validity are similar to the threats to validity noted in the application to
the Wireshark binaries.
Additional threats to validity include a threat to the construct validity of the study. We
assume that the true security score of the binaries we are analyzing is improving, but that is
not necessarily the case. One may argue that an unknown or undisclosed vulnerability could
be present in newer versions but has not been discovered, detected by tools, or disclosed.
Therefore, we may be missing some significant findings.
62
MODEL VALIDATION
We have designed, developed, and applied the PIQUE-Bin model, but we still are unsure
of what the output of PIQUE-Bin indicates and how it should be interpreted. We must build
trust in the process that PIQUE-Bin applies to assess a binary and build understanding of
what the output means.
To this end we first analyze relationships between the attributes of binaries and tool
output. This will help build trust in the model by ensuring that we are comparing binaries
to the appropriate population of binaries to create a score that may be relied upon as a
security assessment metric. To better show why this analysis is important, consider the
following analogy. A 20 year old person goes to his/her doctor. The doctor compares the
lung capacity of the 20 year old to a population of 90 year old individuals and reaches the
conclusion that the younger individual’s lungs are average. Of course, a 20 year old should
have much better lung capacity than a 90 year old, so this comparison leads the doctor to an
incorrect conclusion. We want to avoid this scenario by ensuring that we are not comparing
the equivalent of a ‘20 year old’ binary to a population of ‘90 year old’ binaries, or vice versa.
We also investigate the impact of individual vulnerabilities being injected into the binary
under analysis. This will allow us to isolate the change in TQI for a finding classified under
each diagnostic. This will give much better context for analyzing changes in scores. Currently
we do not have any idea whether a change of 0.05 is a large, noteworthy change in PIQUE-
Bin’s output, or if it is not significant.
Tool Output Sensitivity To Binary Attributes
The benchmark repository can have a large impact on the output of any PIQUE model
which makes sensitivity analysis important. We conduct sensitivity analysis by looking into
what attributes of the gathered binaries correlate with the number of findings they produce.
63
This analysis will allow us to be confident that our overall score is due to differences in
security between the benchmark repository and binary under analysis rather than due to
differences in other attributes such as size. These attributes may cause the output to be
artificially high or low, which may mislead stakeholders when interpreting the score produced
by PIQUE-Bin.
To gather data, we use the tool Detect-It-Easy1 to identify size, compiler, and whether
the binary was statically or dynamically linked. We will also manually categorize the binaries
as either ‘System’ or ‘Network’ focused binaries as a proxy for a more specific categorization
of domain or purpose. We will then apply the three tools in the model, CVE-Bin-Tool,
cwe Checker, and Yara Rules, to the binaries and record the count of findings for each tool.
Once we have collected this data, we fit a model to the data for each tool to identify what
attributes of the binary are correlated with each tool’s output.
Hypotheses
Before we begin, we define the research question and hypotheses that we are seeking to
answer.
Our research question is: What attributes of a binary correlate with changes in the
number of findings from a tool?
To answer this question, our null hypotheses are:
• H10 : For each tool, there is no correlation between the size of the binary under analysis
and number of findings
• H20 : For each tool, there is no correlation between the method of linking (static vs
dynamic) and number of findings
1https://github.com/horsicq/Detect-It-Easy
64
• H30 : For each tool, there is no correlation between the compiler used and number of
findings
• H40 : For each tool, there is no correlation between the domain (system vs network)
and number of findings
Finding a significant correlation (p < 0.05) between output from one of the tools and
an attribute will lead us to reject the null hypothesis for that attribute. For any rejected
null hypothesis, we must be sure to consider that attribute’s presence in the benchmark
repository, and the potential impact on the output of PIQUE-Bin.
Data Exploration
To begin data exploration, we explore the distributions of our factors (size in bytes
(continuous), static linking (categorical), compiler (categorical), and domain (categorical)),
which can be seen in Figure 7.1. There is a clear imbalance of classes for the categorical
variables which could be important to consider as we begin the data analysis phase.
Additionally, the compiler variable contains an ‘unknown’ category which contains all the
binaries for which the compiler could not be determined. Because this category could contain
binaries that should fall under a different category, we must consider this a threat to validity
and be wary as we interpret results and make conclusions.
We also find that our compiler factor contains perfect separation with some of the tool
output, causing issues with fitting a model. Perfect separation occurs when the output of
0s and 1s is perfectly separated by a certain predictor, which causes issues when performing
logistic regression. In the case of our data, the ‘gcc(Alpine)’ compiler received cve-bin-tool
findings for all of the binaries that it compiled. To alleviate this issue we combine the
compiler variable in either “GCC” or “Unknown”.
65
Figure 7.1: Distributions of Factors
Additionally, we should consider any collinearity within our factors. If collinearity does
exist, we must consider this as we choose an appropriate model to consider the effects of each
factor on the tool output. After initial investigation, size appears to be the only factor that
has collinearity with the other factors. To see the comparison of size and the other factors,
as well as the p-value given by Kruskal-Wallis tests, see Figure 7.2. Static compiling and the
compiler used both appear to have some association with size, which is shown by the p-value
being less than 0.05 for the non-parametric ANOVA alternative, the Kruskal-Wallis test.
This finding indicates that whatever model is fit in the analysis phase should account
for size when investigating the effect of other factors. Now we investigate each tool’s output
compared to the binaries’ factors.
66
Figure 7.2: Size Compared to Other Factors
We also investigate the distribution of the tools’ output. This may be seen in Figure
7.3. Apart from cwe checker, our tools are outputting predominantly 0 findings for each
binary. This is a very important consideration to move forward with - we must ensure that
our model is robust to a large number of 0 values. This means that simple linear regression
will not be appropriate as the assumptions of normally distributed errors and equal variance
will most likely not be met.
One more consideration is that this data is a count of findings per binary and therefore
Poisson regression could be appropriate. However, the mean of our data is much smaller than
our variance, indicating over-dispersion is occurring. To account for this as well as the large
number of 0 findings, we may investigate the use of a zero-inflated model, a quasi-poisson
model, or a negative binomial regression model.
67
Figure 7.3: Distributions of Tool Outputs
First we investigate cwe Checker, seen in Figure 7.4. Without considering the effect
of size (which is known to be collinear with the compiler and static linking), we can see
that the Kruskal-Wallis test is giving a p-value of less than 0.05 for compiler and static
linking, indicating that these could be factors that are important to consider when building
the benchmark repository.
CVE-Bin-Tool’s output may be seen compared to binaries’ factors in Figure 7.5. Clearly
the extreme number of 0 findings is evident in these plots - any non-zero value is considered
an outlier. Modeling this data will require a model that accounts for a large number of zeros
68
Figure 7.4: cwe Checker Output Compared to Factors
in the dependent variable, further indicating a zero-inflated model would be appropriate to
use.
Yara Rules’ output may be seen compared to binaries’ factors in Figures 7.6. It appears
that size, compiler, and static linking all have some correlation with findings, but this analysis
does not account for the effect of collinearity between size and the other factors. Network
versus System domain does not appear to have a significant effect on the number of findings
we expect.
69
Figure 7.5: CVE-Bin-Tool Output Compared to Factors
Data Modeling
To reach a conclusion for the hypotheses stated above, we will fit a model to determine
what attributes of the binaries are significant in helping to predict the tool output for a
binary. This will result in three separate models, one for each tool. There are several caveats
we must consider as we determine what model is appropriate for our data. Because there
is some collinearity in our factors, we should consider a comprehensive model for each tool
that is able to account for effects of factors together rather than separately. Otherwise, we
would be able to answer our hypotheses through ANOVA tests.
One note before we begin: we do not expect to achieve accurate models. Security is
largely influenced by the organizational processes (such as quality assurance) and the specific
developer(s) who produce a binary. We also believe that age of a binary could play a large
70
Figure 7.6: Yara Rules Output Compared to Factors
part in the security findings from tools such as CVE-Bin-Tool. This is because as a binary
ages, additional vulnerabilities may be discovered that are present in older versions of a
binary, so logically older binaries are going to tend to have more known vulnerabilities than
newer binaries. Therefore, age should be considered as an attribute important in benchmark
makeup as well. Unfortunately, we were unable to determine the age of the binaries.
cwe checker output model The cwe checker tool output, as seen compared to factors
in Figure 7.4, is count data. Typically we would begin this analysis by looking at using a
Poisson regression model; however, the cwe checker tool output is large enough that Poisson
regression may not be viable. Instead, we begin by attempting to fit a linear regression
model.
After fitting a linear regression model, we find that our assumptions of equal variance
71
and normally distributed residuals are not met. This leads us to attempt a transformation.
Using a log transform on either size or the tool output alone do not yield good results, but
transforming both provides a model that meets the assumptions of linear regression. The
diagnostic plots for this model may be seen in Figure 7.7.
Figure 7.7: cwe Checker Output Linear Model Diagnostic Plots
From these plots, we may conclude that there is some evidence against normality and
equal variance, but the model should be robust enough to allow for this small amount of
evidence against the assumptions. Additionally, the residuals appear to be independent, and
there are no outliers with large leverage.
Now that we have determined that the assumptions for linear regression are met, we
fit the model and begin identifying significant variables for the model. We will fit the model
72
initially with all factors, then remove them one by one, removing the non-significant factor
with the smallest coefficient first. Each time a variable is removed, we re-fit the model
and determine if another variable should be removed. AIC (Akaike’s Information Criteria)
is a general indicator of goodness of fit for models [3, 2]. AIC estimates the amount of
error in the residuals of the model with a penalty for more complex model, so a lower AIC
typically indicates a better model. We verify that the model’s AIC does not increase as we
removed variables as an additional measure to ensure that the model is being improved. The
coefficients of the final model may be seen in Table 7.1. The formula for the model is
µ{log(cwe chˆecker)} = C0 + C1log(size) + C2StaticallyCompiled (7.1)
µ{log(cwe chˆecker)} = −5.71 + 0.84log(size)− 0.38StaticallyCompiled (7.2)
median{cwe chˆecker} = 0.0033size0.043e−0.38StaticallyCompiled, (7.3)
where StaticallyCompiled is 1 if the binary is statically compiled and 0 if not.
To interpret these results, we must pay close attention to the log transformation of
the response and one explanatory variable. Equation 7.3 showcases the back-transformed
model. This model predicts the median output of cwe checker for a binary given the size
and whether it was statically compiled. A binary that uses static compilation is expected
to have e−0.382 = 0.682 times the median number of findings from cwe checker. We expect
a binary with 10000 bytes to have 1.48 times the median number of findings compared to a
binary with 1 byte.
Due to the p values of our coefficients being less that 0.05 for all coefficients in the
model (size and static linking), we may reject the null hypotheses H1 and H2 for this tool.
We can not reject H3 or H4 based on this model, as the variables associated with those
hypotheses were not found to be significant.
73
Table 7.1: cwe Checker Output Model Coefficients
Estimate Std. Error t value P-value
Intercept -5.71261 0.51395 -11.115 0.000
Log(size) 0.83639 0.04276 19.559 0.000
Statically Compiled -0.38200 0.18738 -2.039 0.042
CVE-Bin-Tool Output Model In the case of CVE-Bin-Tool, the output is largely
composed of 0 values. We begin by fitting a simple Poisson regression model, and assess
assumptions from that point. Diagnostic plots for this Poisson regression model can be seen
in Figure 7.8.
Figure 7.8: CVE-Bin-Tool Output Poisson Regression Model Diagnostic Plots
74
From Figure 7.8, we can see that several extreme outliers are present in the data. We
observe that the residuals are on the scale of the data points themselves, indicating that the
model is likely predicting every point as a 0, and we are simply seeing residuals that are equal
to the value of each point. We also observe a high amount of overdispersion in the model,
likely due to the large number of 0 counts. One method to relax these constraints would
be to fit a quasi-Poisson model, or a Negative Binomial Model (NBM). Hoef and Boveng
[26] suggest that these two models are largely similar aside from certain situations that can
occur, in which they find largely different estimates of the effects of covariates. However,
fitting either of these models does not get around the fact that the data is mostly 0 counts.
Therefore we fit a zero-inflated Poisson regression model [20].
The zero-inflated Poisson (ZIP) regression model will separate the data into two models:
one logistic model that estimates whether a binary will have a 0 count or greater than 0 count,
and one Poisson regression model on the non-zero data to estimate what factors influence
the size of the count [20].
The initial comparison of AIC between the Poisson regression model and the ZIP model
reveals that the ZIP model performs much better - the Poisson regression model has an AIC
of 3824.69, while the ZIP model has an AIC of 1087.9.
We also fit a zero-inflated negative binomial (ZINB) regression model to compare to
the ZIP model. The AIC of the ZINB is found to be 554.77 compared to the ZIP’s AIC of
1087.9.
Therefore we will fit a ZINB model and follow the same process as we did for the
cwe Checker output model: remove the non-significant variable with the smallest coefficient
and then reassess, repeating until all variables are significant.
Following that process, we find that no variables are significant to predict the size of the
count for non-zero counts of CVE-Bin-Tool findings. This indicates that the only prediction
75
we may do is on whether or not CVE-Bin-Tool will have findings, and not the number of
vulnerabilities that we find. This makes intuitive sense - when we import a library, nothing
about the binary we import into defines how many vulnerabilities will be in the library that
is being utilized.
This makes interpretation much easier - we only need to look at the logistic component
of the ZINB model, or to make things simple, fit a logistic model that predicts whether the
count will be zero or non-zero. We fit a logistic model for simplicity, and follow the same
process that we did to remove non-significant variables for the cwe Checker model.
Assumptions for logistic regression require that we have no dependent response
variables, no extreme outliers, and no collinearity in our predictors. We have no extreme
outliers, which can be observed in Figure 7.9. Our response variables are independent.
Collinearity does occur in our predictors, so there is some evidence against that assumption;
however, the collinearity is not so severe that we should need to account for it through a
different model.
The coefficients for the final logistic regression model are seen in Table 7.2, and the
final equation becomes
e−6.03+0.33log(size)−0.74DomainSystem
p(CV EBinTool) = , (7.4)
1 + e−6.03+0.33log(size)−0.74DomainSystem
and the logit function becomes
p(CV EBinTool)
= e−6.03+0.33log(size)−0.74DomainSystem, (7.5)
1− p(CV EBinTool)
logit(p(CV EBinTool)) = −6.03 + 0.33log(size)− 0.74DomainSystem (7.6)
76
Figure 7.9: CVE-Bin-Tool Output Logistic Regression Model Residuals vs Leverage Plot
where DomainSystem is 1 if the binary’s domain is system, and 0 if the binary’s domain is
network.
Table 7.2: CVE-Bin-Tool Output Model Coefficients
Estimate Std. Error z value P-value
Intercept -6.0264 1.6743 -3.599 0.0003
Log(size) 0.3305 0.1334 2.477 0.0132
DomainSystem -0.7433 0.3441 -2.160 0.0307
The coefficients for this model indicate that for a network-based binary with a log(size)
of 0, the odds of have findings from CVE-Bin-Tool are 0.002. For a system-based binary
with a log(size) of 0 the odds become 0.001. For a network-based binary with a log(size)
77
of 15 (around 5 MB), there is a 0.163 probability of having findings. This does mean that
we will never predict a positive occurrence, indicating that we are missing critical variables
that would fill the gaps in our model or that we may not be able to accurately predict the
occurrence of findings from CVE-Bin-Tool.
While we do find log(size) and binary domain to be significant in a model for predicting
findings for CVE-Bin-Tool, it is difficult to say that there are any conclusions that may be
drawn from this model due to the fact that everything will be predicted as having zero
findings. This is almost certainly caused by the fact that most of the binaries do not have
any findings from CVE-Bin-Tool. This likely requires further investigation in which we look
at a wider variety of binaries, but is outside of the scope of this study.
Yara-Rules Output Model The Yara rules data has some resemblance to the CVE-Bin-
Tool data with a large number of 0s. It is not as extreme in the number of 0s, however. The
distribution of Yara Rules’ output among the factors can be seen in Figure 7.6.
We will begin by attempting to fit a basic Poisson regression model, as well as a Poisson
rate regression model using log(size) as an offset variable. We find that the Poisson rate
regression model does not perform as well as the basic model when comparing by AIC
or residual deviance. In the basic Poisson regression model we do not find any extreme
outliers, but there does appear to be some overdispersion occurring in the model. Running
a dispersion test (from the “AER” R package) on our model confirms that our model is
overdispersed.
Fitting a NBM allows for the overdispersion to be accounted for. Additionally, the
NBM has an AIC of 1167 compared to 1318, and around half of the residual deviance. We
still do not have any extreme outliers and our responses are independent, meaning that we
have no assumptions violated for this model.
We also fit a ZIP model to determine if it would be a better fit. Using AIC, we do find
78
that this model improves on the NBM by about 10; however, this is a much more complicated
model. Therefore we acknowledge that our model could potentially be improved by adding
much more complexity, but will utilize the NBM as a simpler approach to modeling this
data.
We take the same approach to removing non-significant variables as in the previous two
models: by removing the non-significant variable with the smallest coefficient in the model.
The NBM model’s variables can be found in Table 7.3. The equation for this model is
ln(Y araˆRules) = β0 + β1log(size) + β2SystemDomain+ β3StaticallyCompiled (7.7)
Y araˆRules = eβ0+β1log(size)+β2SystemDomain+β3StaticallyCompiled (7.8)
Y araˆRules = e−7.28+0.6log(size)−0.45SystemDomain+0.68StaticallyCompiled. (7.9)
Table 7.3: Yara-Rules Output Model Coefficients
Estimate Std. Error z value P-value
Intercept -7.28110 0.79228 -9.190 0.00000
log(size) 0.59893 0.06365 9.410 0.00000
SystemDomain -0.44560 0.16042 -2.778 0.00547
StaticallyCompiled 0.67812 0.25444 2.665 0.00769
For a dynamically compiled, network based binary with size of 1, we find that the
baseline number of findings from Yara Rules is e−7.28 = 0.0007. Statically compiling the
binary correlates with an expected number of findings e0.678 = 1.97 times higher. A binary
of the system domain is correlated with e−0.446 = 0.64 times the number of expected findings,
meaning that network based binaries tend to have more findings. Finally, a change of 1 in
79
the log(size) of the binary is correlated with a e0.599 = 1.82 times change in the expected
number of findings. As an example, consider a network-based binary, statically compiled,
with a log(size) of 15 (around 5 MB). We would expect 11.0 findings from Yara-rules.
We find log(size), static compilation, and domain to correlate with the number of
findings from Yara-Rules. We may therefore reject the null hypothesis H10, H20, and H40,
but we may not reject H30.
Conclusion
From the models we have fit, we reject H10, H20, and H40. This indicates that when
we are creating a benchmark repository for a PIQUE model for binaries, we must take into
consideration the size, domain, and linking strategy used. If we fail to do so, we will receive
a score that is inflated in some way, meaning that there is a systematic reason for a lower or
higher score that is not necessarily due to security differences between binaries.
Table 7.4 contains the final variables and their p values for each model.
Table 7.4: Output Models Coefficient P-Values
cwe Checker CVE-Bin-Tool Yara-Rules
Compiler - - -
log(size) 0.000 0.000 0.000
Static Compilation 0.042 - 0.008
Domain - 0.03 0.005
Discussion
When creating a security metric, we must be sure that the metric does not mislead
people to believe that a system is more secure than it is in reality. We also need to ensure
that there is trust and understanding of the security metric.
80
Without trust in the output of PIQUE-Bin, it is likely to be ignored by stakeholders
and developers alike. Without understanding, the output of PIQUE-Bin may be used to
take inappropriate security actions, such as allowing a binary in a system due to a high score
(without considering the benchmark makeup), or rejecting a critical patch to a binary due
to a lower score from PIQUE-Bin.
These tests help to build trust and understanding by showing how the benchmark
repository may influence the score. For example, we know now that if the benchmark is
composed of large binaries and we analyze a small binary, we will have an artificially high
score.
For all tools, size was found to be significant. This makes sense, as a larger binary is
more likely to have more libraries being used, potential capabilities, and more potential lines
that could fit a CWE pattern.
We found static compilation to correlate with the output of cwe Checker and Yara-
Rules. For cwe Checker, this result is somewhat unexpected; however, it makes sense when
we think about what the added code in the binary is. This is often going to be common
functions, especially functions that interact with the operating system. These types of
functions are a higher risk, and likely contain more high-risk code than typical binary code
would, causing a higher number of CWEs to be identified. In the same way, the capabilities
that are now contained in the binary would likely be found by Yara-Rules, causing additional
findings.
Domain is found to correlate with the output of CVE-Bin-Tool and Yara-Rules. This is
interesting, but likely stems from the common libraries imported for network or system-based
calls and the differences in the vulnerabilities contained in those libraries. Additionally, the
network binaries have more capabilities compared to a system-based binary.
Altogether, we find that there are several important factors to consider when composing
the benchmark repository for PIQUE-Bin. These considerations help to build a reliable
81
security metric that we understand and are confident in.
Threats to Validity
Let’s begin with threats to validity stemming from the data itself, then address any
model-specific threats to validity.
The binaries we gathered were not gathered randomly as there is no possible way to
collect binaries randomly. This means that any conclusions we make will be limited to apply
only to the collection of binaries that we have. However, even though that is the case, most
of the trends we identify in these binaries would likely be present in other binaries when we
apply some intuition to why we found what we did. Confounding variables that we have
not accounted for could be causing us to make conclusions about certain variables that do
not truly have an effect on the tools’ outputs. One example of this would be a correlation
between the developers who work on system based binaries as opposed to network based
binaries - perhaps many of these binaries were developed by a small set of developers, and
the teams were split based on the domain of the binary. Perhaps the statically compiled
binaries we collected all came from the same developer. These are unlikely situations, but
not impossible.
As we noted earlier, the compiler factor has some oddities. Most binaries do not include
the information about the compiler used to create them. This leads to us categorizing many
binaries as an unknown compiler. We cannot be sure that the two categories (“GCC” and
“Unknown”) are mutually exclusive - therefore, we lose some power of the conclusion we
may make about this. That being said, the logical reasons that compiler would cause a
systematic difference in tool output are 1) the compiler or compilation process introduces a
vulnerability or 2) certain compiler-specific optimizations cause patterns that are flagged as
false positives more often, or 3) the compiler is confounded with some other factor such as
development environment. Therefore, we may be fairly certain that compiler is not necessary
82
to consider when building a benchmark repository.
cwe Checker Model Threats The cwe Checker model appears to fit the data quite well,
however there is some evidence against the assumptions of the linear model we fit.
The residuals do not appear to have any severe patterns found in the residuals vs fitted
plot in Figure 7.7, but there may be a trend of low predictions having a large negative
residual. Additionally, while the residuals are not perfectly normally distributed, we expect
that the effect of the lack of normal distribution of residuals in negligible when we consider
the p-values being as low as they are.
CVE-Bin-Tool Model Threats This model has many threats to validity. The model was
eventually fit as a logistic regression model predicting if a binary would have findings or not.
Our model predicts every point in our data set as not having findings. This means that we
cannot rely on this model for prediction, and we have low confidence in any conclusions we
make using this model.
For this reason, we have chosen not to make any conclusions based upon this model.
However, any conclusions we would make about correlated factors for tool output is confirmed
by the other models we fit.
Yara-Rules Model Threats One threat to validity is that we chose not to utilize a more
complex model in favor of a simpler model that is easier to interpret. Typical statistical
guidelines would say that an AIC of 5 or more is significant [20]. however, the added
complexity of a second model is not accounted for by the AIC.
Sensitivity to Weighting
The model output is sensitive to the weighting from stakeholders to a large extent.
The weighting done in the model is done in two layers: at the QA to TQI level which
83
is done through pairwise comparisons and the AHP, and at the PF to QA layer which is
manually weighted. These weights for PIQUE-Bin have been previously shown in Table 5.2
and Table 5.3. These weights were done from the perspective of the researcher, but they
could be changed entirely by some other stakeholder.
Changes to the weighting from the PF to QA layer can have a drastic impact on the
TQI - potentially, it can take the score from close to 1 to 0. This is because the weighting
may be modified such that findings are considered higher priority for particular QAs that
may also be important. For example, consider applying PIQUE-Bin to some project where
there is only a single finding across all diagnostics in the model. This single finding causes a
measure to evaluate to 0. As it is weighted now, this would likely have a small impact (as we
will see later, likely less than 0.01 change to TQI). Now consider that we re-weight the PF
to QA layers such that this diagnostic is the only diagnostic that aggregates to eventually
impact all QAs but non-repudiation. Then all QAs will evaluate to zero except one, and we
receive a score of 0.048 (the current weight from non-repudiation to the TQI).
We choose not to investigate the weighting from the PF to QA layer with a case study or
experiment because the weighting of that layer (in the context of PIQUE-Bin) should be done
as objectively as possible and should not change across projects or domains. This is because
regardless of the domain, PFs will impact the same QAs - for example, ‘Authentication
Errors (CWE-1211)’ will always impact authentication and only authentication. We have
shown that a change in weighting at this layer may completely change the TQI, so this
weighting should be handled with care.
The weighting between the QA layer and TQI provides some interesting context as
well. The TQI is calculated as a weighted sum of the QAs. Therefore, the value of the
TQI is limited to be between the minimum and maximum values that the QAs take on. We
may also see the weighting at this layer impact the TQI to change from 1 to 0 or vice versa
if there are QAs that take on values of 1 and 0. We will investigate the potential impact
84
of weighting at this layer in the case study performed earlier on the Busybox binaries as
detailed in chapter 6.
As seen in Table 6.2, we have a set of TQI values and QA scores. Using these values,
we can determine the maximum possible change in the TQI based upon the weighting by
taking the difference between the minimum and maximum QA value for each assessment.
This can be seen visually in Figure 7.10, where each assessment has a point for the minimum
and maximum possible value achieved by changing only the weighting for each version of
Busybox.
Figure 7.10: The Maximum and Minimum Possible Value for Each Assessment, Based Upon
Weighting
To summarize the possible changes in TQI due to weighting, some summary statistics based
85
on the Busybox assessments follow. The minimum maximum possible change in TQI among
these assessments is 0.02, which occurs in the assessment of Busybox version 1.28.4. The
maximum maximum possible change in TQI among these assessments is 0.045, which occurs
in version 1.21.1. The mean maximum possible change is 0.036.
The sensitivity is a rather small value due to the nature of the vulnerabilities that have
been found in these binaries. Many findings have aggregated to impact all of the QA nodes.
This is often the case when weakness categories may lead to the arbitrary execution of code.
Arbitrary execution of code will compromise all QAs in PIQUE-Bin, causing the QA nodes
in the model to tend towards the same value.
One way to solve this would be to add a node at the QA level that accounts for the
property that is compromised by arbitrary execution of code, allowing for the composition
of the QAs to be more orthogonal. This could be something such as ‘control’. However, this
QA would likely always be the most important and would therefore reduce the impact of the
other QA nodes a significant amount.
Another cause of the small maximum possible change is due to the small number of
findings in more recent versions. The reason the binaries do not receive a score close to 1
is because of the default behavior of measures with [0, 0] thresholds and no findings. These
nodes evaluate to 0.5 in the model currently, so the maximum score is very close to the score
of busybox version 1.30.1lite.
Sensitivity to Single Findings
Another area of importance is PIQUE-Bin’s sensitivity to single vulnerabilities within
a binary. This will help define the expectations of stakeholders on normal fluctuations in
the TQI score. Additionally, this will give greater context to what goes into a score. For
instance, does a score of 0.5 indicate many findings or few findings?
86
To identify the impact of a single vulnerability, we perform the following: Run PIQUE-
Bin on a binary and record the TQI score. Now, for each possible finding from our tools,
we re-run PIQUE-Bin on the same binary with that weakness injected into the binary. All
weaknesses are injected with a severity of 1. This is standard for findings from cwe checker
and yara-ruels, however cve-bin-tool reports every finding with a severity between 0 and 10,
derived from the CVSS score of a vulnerability. We then take the difference from the original
TQI and the new TQI found after injecting a vulnerability. This will give the impact of each
finding on the TQI.
The impact of a single finding on the TQI does not change depending on the TQI -
that is, a finding that changes the score by 0.01 will change a score of 0.5 and 1 equally, to
get 0.49 and 0.99. The only case in which this does not occur is the case in which a QA that
the finding aggregates into already evaluates to 0 because the score cannot drop any lower.
This is a desirable quality to allow stakeholder priorities to properly impact the TQI.
Results
The results of following the above procedure gives us impact for each finding which may
be split into three Figures, Figure 7.11, Figure 7.12, and Figure 7.13. The average impact
of each product factor (CWE category) may be seen in Figure 7.14. The average impact is
0.008, the minimum is .00002 and the maximum is .085.
There is one outlying finding which has an impact of 0.52 and has not been included
in this analysis. This finding is ‘CWE-560 Weakness Diagnostic’, which is the cwe Checker
finding for ‘ CWE-560: Use of umask() with chmod-style Argument’.
87
Figure 7.11: Impact On TQI For A Single Finding From Each Diagnostic, Part 1
Discussion
Overall, we see that the findings are all positive, which is a desirable outcome. Some
findings have impact that is very close to 0 which is caused by model structure, benchmark
thresholds, and stakeholder priorities.
The diagnostic for CWE-560 has a very high impact. There are two primary reasons
for this. The first reason is that in the benchmarking process, this finding was incredibly
rare, leading to threshold values of [0.0, 0.0406]. This means that even a single severity 1
finding will evaluate to −23.5. The second reason is that this finding aggregates into the
‘CWE-1006: Bad Coding Practices’ common weakness category, a category that only has
two diagnostics from our tools. This causes these diagnostics to aggregate with higher impact
88
Figure 7.12: Impact On TQI For A Single Finding From Each Diagnostic, Part 2
due to getting a higher edge weighting when aggregating into the product factor layer from
the measure layer.
To exemplify this second reason, consider a PF with three measures - A, B, and C.
Then consider a PF with only 2 measures, X and Y. Consider a finding under measure A
such that the value of A is 0 and there are no findings for B or C, giving values of 1. Then
we evaluate the associated product factor to 01 + 11 + 11 = 2 . Then consider the second
3 3 3 3
scenario with two measures, and let X have a finding such that X evaluates to 0 and Y
evaluates to 1. Then we evaluate the associated product factor to 01 + 11 = 1 . The single
2 2 2
measure that evaluates to 0 in both scenarios causes a lower score when it aggregates with
fewer other nodes.
These two conditions allow the CWE-560 finding to have a huge impact on our model,
89
Figure 7.13: Impact On TQI For A Single Finding From Each Diagnostic, Part 3
which is important to know as a stakeholder or developer interpreting the results of any
model application. Sudden and large changes in TQI may be due to severe vulnerabilities
or could happen due to a high impact finding.
One other idea to keep in mind is that this analysis is done with all severity 1 findings;
however, cve-bin-tool gives findings with severity between 1 and 10, meaning that a finding
could potentially have 10x the impact as we see in Figure 7.11, Figure 7.12, and Figure 7.13
. Severity 1 findings are incredibly rare from cve-bin-tool, so the impacts we estimate here
significantly underestimate the likely impact of a cve-bin-tool finding.
These results do hint that many nodes may have close to 0 impact, while others may
have very large impacts. Whether this is a desired behavior depends upon the stakeholders
who are utilizing the model. However, these impacts are a product of the entire process
90
Figure 7.14: Mean Impact On TQI By Category
applied using PIQUE-bin, and as such, this is to be expected. Findings have little impact
due to weighting, model structure, or high frequency withing the benchmark repository. The
impact of findings may be changed through changing any one of these factors.
Threats to validity
Internal Validity Internal validity represents our ability to claim causation between
our explanatory and response variables. We know that the model’s change in score is directly
cause by the explanatory variables, so there are no threats here.
External Validity This analysis is done for our specific model structure, with our
stakeholder weighting, and our benchmark repository. A change in any one of these factors
may significantly change the outcome of the experiment, and therefore, we may not extend
these results to any other models/scenarios.
91
Despite that, we may still use these results to guide interpretation of the model in some
cases, and small changes to the model are unlikely to change the results so much that they
are useless for interpretation. Additionally, this impact will remain the same for this model
regardless of what binary is being analyzed.
Construct Validity Threats to construct validity would indicate that we are not
measuring variables in a way that reflects reality. The primary threat to construct validity
for this sensitivity analysis is that we are measuring the impact of each vulnerability occurring
a single time, which may not be representative of the average change in TQI for a model
in practice. Perhaps software typically undergoes the addition or removal of dozens of
vulnerabilities at a time, in which case we could see a large range of possible changes in TQI,
and therefore this analysis does not inform the interpretation of the output.
Despite this, we still claim that this sheds light upon the expected change in TQI, as
we know that a large or small change may be indicative of several or a single vulnerabilities
and should be investigated further. PIQUE lends itself to easily determining what causes
a change in TQI by categorizing weaknesses In addition, a visualizer for PIQUE is under
active development which will make the investigation of changes in TQI easier.
Conclusion Validity We have not made any conclusions about the impact of
vulnerabilities in general, so there are minimal construct threats. However, we should note
that the data presented in this analysis is limited to the specific context of PIQUE-Bin, as
we mentioned when addressing external validity.
92
THREATS TO VALIDITY
In addition to threats addressed in specific sections, we note overarching threats to
validity to PIQUE-Bin. The validity in question is that of the model PIQUE-Bin. We
address internal, external, construct, and conclusion validity.
Internal Validity
Internal threats are threats to the study’s assertions of a causal relationship.
One inherent assumption in PIQUE-Bin is that findings have a negative impact on
security. In almost all cases this is evident because vulnerabilities do not improve the security
posture of a binary. However, it could be argued that certain findings may not always indicate
worse security - for instance, consider the category from the yara-rules tool that represents
capabilities of the binary. It is possible that a certain set of capabilities may be worse in
terms of security than another set even if they are smaller in number, but we have no way of
accounting for this. This would require experts to weigh in on the severity of each finding,
which would reduce the autonomy of PIQUE-Bin, but allow us to have greater confidence
in the model.
In a similar way, we assume that the effect of findings is linear - that is, two findings
have twice the impact of one finding and four findings have twice the impact of two findings.
This may not always be the case. Perhaps findings should be put on some other scale, such as
an exponential curve, that assumes that as we get more findings, additional findings should
have more severity. This would account for some sort of compounding effect of vulnerabilities
enabling other vulnerabilities. However, different scales all come with their own assumptions,
and the linear case is the simplest approach.
93
External Validity
External threats are threats to the study’s ability to generalize to other settings and be
replicated.
We have shown that PIQUE-Bin is able to assess security for busybox binaries over
time. Threats to external validity represent threats to our ability to apply PIQUE-Bin to
other binaries with similar success.
PIQUE-Bin will likely be valid for binaries that are similar to busybox, and we have no
reason to suspect that it would fail to assess other binaries. However, as we discovered in
our tool output sensitivity analysis, we must be wary of interpretation with respect to the
differences between the binary under analysis and the benchmark repository. Additionally,
we found that the cwe checker tool was inconsistent at times, and was unable to analyze all
binaries for unknown reasons. This discovery resulted in one bug being discovered in the
cwe checker tool and one potential bug being discovered in the NSA’s Ghidra tool. One
threat is that a tool may fail to run on a binary, in which case we cannot rely on the output
of PIQUE-Bin for any comparison purposes.
Construct Validity
Construct threats are threats to the study’s ability to represent and measure variables
in a way that reflects reality.
The primary threats to validity are concerned with the process of aggregation that
occurs in PIQUE-Bin. We are unsure if the sum of CVSS values of vulnerabilities represents
the best way to analyze the impact of a set of vulnerabilities. This approach assumes that
two severity 5 vulnerabilities are equivalent to a severity 10 vulnerability. It is not stated
94
in a CVSS guide that the NVD references1 whether this approach is valid or not. However,
this is one of the simplest approaches which is why it is chosen.
We also see issues in the way that aggregation occurs and the way certain vulnerabilities
impact the quality aspects. That is, a vulnerability may be classified under one CWE because
that is the weakness it exploits, but may have different impacts from the typical impacts of
the CWE. Additionally, we may see improper impacts when aggregating from a base level
CWE to a CWE category, as in the case of aggregating both CWE-353: ‘Missing Support for
Integrity Check’ and CWE-349: ‘Acceptance of Extraneous Untrusted Data With Trusted
Data’ into CWE-1214: ‘Data Integrity Issues’. CWE-353 has impacts on integrity and non-
repudiation, while CWE-349 has impacts on access control and integrity. Both of these are
aggregated into the same CWE category, where they go on to impact quality aspects in
the same way. This is a consequence of aggregation and has been mitigated as much as
possible through proper choice of the structure, as well as attempting to account for this
when determining the impact of a category on the quality aspect nodes. We rely on the
expertise of the MITRE organization in the way that these CWEs have been structured,
as well as reporters of CVEs to properly and accurately categorize CVEs under the correct
CWE. Failure of this process means that PIQUE-Bin will improperly aggregate a finding.
This is another reason that interpreters of the output of PIQUE-Bin must be wary of any
changes in TQI and investigate the cause.
Other threats to construct validity include our decision to include measures which
receive [0, 0] thresholds. This is a threat to our ability to measure the impact of these
measures, because we do not have information on how common they are among our
repository. We acknowledge that this the existence of these nodes is not ideal, but leaving
them out of the model may miss critical information - ignoring these findings would be
detrimental because they can represent some of the most interesting findings due to their
1https://www.first.org/cvss/user-guide
95
rarity. Some findings may be very rare due to how severe they are, such as plain-text
passwords.
Conclusion Validity
Conclusion threats are threats to a study’s ability to draw conclusions.
We acknowledge that the conclusions we may draw from our work is very limited due
to validation methods and limited ability to randomly sample. We have avoided conclusion
threats by staying within scope when making conclusions. However, we are able to achieve
our research goal while making minimal conclusions due to the lack of existing research in
this space.
96
CONCLUSION
In this thesis, we present the design, development, and validation of a hierarchical
quality model for the analysis of security quality in binaries by utilizing the PIQUE platform.
Our research followed the goal question metric paradigm with an overall research goal
of improving our ability to assess the security quality in binaries from a stakeholder’s
perspective.
We began by briefly covering relevant topics as well as supporting work. This includes
the state of binary analysis and some examples of binary analysis, software quality modeling,
vulnerability management, and the operational technology environment. The supporting
work includes quality modeling approaches presented by Quamoco, QATCH, and the
framework PIQUE. We also presented an informal literature review on quantitative model-
based security metrics for binaries which found no papers, leading to the conclusion that
this area of study is largely under-researched.
We present our research goal in Basili’s Goal-Question-Metric format [9]. This process
includes defining several questions, the answers to which will lead us to achieving our goal.
We proceed to present and systematically answer these questions through metrics quantified
in the following chapters through an informal literature review, an overview of the design
and development of PIQUE-Bin, case studies, and experiments for sensitivity analysis.
The design and development of PIQUE-Bin utilizes well-established security resources
that will enable the model to easily incorporate a variety of tools while drawing on the
knowledge base of security research to categorize and aggregate findings. The tool utilizes
the Microsoft STRIDE model to define the nodes at the highest level, and below that uses
the Common Weakness Enumeration software development view (CWE-699) to define the
remaining structure.
Three tools are integrated with PIQUE-Bin: CVE-Bin-Tool, cwe Checker, and YARA.
97
CVE-Bin-Tool finds known vulnerabilities in third party libraries by searching for associated
strings. cwe Checker disassembles a binary using the NSA’s Ghidra tool, then identifies
patterns commonly associated with CWEs. YARA is a simple pattern searcher. We use this
tool in combination with a repository of security-related patterns that will help to identify
malicious code within a binary as well as some other concerning categories of findings such
as the capabilities a binary has. These three tools cover a variety of security findings, giving
a better picture of security together than any one tool individually.
We apply an early version of the model to two Wireshark binaries for initial exploration
of the model’s usage. We find unsatisfactory results in this version of the model, and identify
how it might be improved before continuing with exemplary applications of the model.
We then apply the latest version of PIQUE-Bin to several versions of the Busybox
Linux utility binary. We find satisfactory results that indicate the binary has improved in
security quality over time as expected. We see that the issues identified in the application
of PIQUE-Bin to Wireshark have been resolved and that the model performs satisfactorily.
Finally, we note that there are several versions that do not see a change in score due to a
lack of tooling that identifies the changes between those binary versions. As such, the model
may be improved through the use of more tools. This comes at the cost of ease of use and
the cost of integrating a tool.
Model validation is performed primarily through sensitivity analysis. We perform
sensitivity analysis in three different ways. First, we analyze the sensitivity of the tools
output to attributes of the binary they are run on. Second, we briefly touch on the sensitivity
of the model TQI to weighting of the lower layers. Then, we perform sensitivity analysis of
the TQI to individual findings.
We find that the tools’ output is correlated with different attributes depending on the
tool. We tested this by first gathering attributes of benchmark binaries, including size,
domain (network or system), static compilation, and compiler. We then ran each tool
98
on the binaries and found the count of findings for each. We then fit a model for each
tool where the response variable was the number of findings and the explanatory variables
were the attributes of the binaries. After assessing the data and assumptions of models,
we found that a different type of statistical model was required for each tool. Each tool
also had different significant predictor variables, implying that different attributes must
be considered for different tools. We found that all attributes aside from compiler were
correlated with at least one tool’s output count. This leads to the conclusion that when
selecting a benchmark repository, all of these attributes should be considered aside from the
compiler. We acknowledge that there are other important variables that are not considered
in this study such as the age of the binary. Because we have different findings for each tool,
it is clear that more study of these relationships for additional tools, binaries, and attributes
would bring more clarity to this process. However, we do know that it is important to
consider the makeup of the benchmark repository as it is put together and as the model is
interpreted.
The sensitivity of the model’s TQI to weighting is quite large. We showed that in
most cases the model’s output can be dropped to 0 by changing the weighting of the lower
layers, meaning that the weighting must be handled with care. Additionally, we analyzed
the impact of weighting at the highest layer, where the TQI score is limited to the range of
the lowest valued quality aspect and the highest valued quality aspect.
The impact of individual findings is also investigated. We performed this analysis by
running an initial analysis on a binary, then injecting a vulnerability and re-running the
vulnerability. We do this for every diagnostic in the model. This give a resulting change in
TQI for each finding, allowing us to identify high and low impact findings. This also brings
context to interpretation and what changes in TQI should be concerning.
We have designed, developed, and validated a model for the assessment of security
quality in binary files. Due to the lack of similar models, we have successfully improved our
99
ability to assess the security quality of a binary without needing to compare our model to
others. There will always be additional work to be done in validating this model further,
but the applications and validation undertaken to date are very promising.
100
REFERENCES
[1] Ehsan Aghaei, Waseem Shadid, and Ehab Al-Shaer. ThreatZoom: CVE2CWE using
Hierarchical Neural Network. Technical report.
[2] H. Akaike. A new look at the statistical model identification. IEEE Transactions on
Automatic Control, 19(6):716–723, 1974.
[3] Hirotogu Akaike. Information theory and an extension of the maximum likelihood
principle. In Selected papers of hirotugu akaike, pages 199–213. Springer, 1998.
[4] Kenneth Alperin, Allan Wollaber, Dennis Ross, Pierre Trepagnier, and Leslie Leonard.
Risk prioritization by leveraging latent vulnerability features in a contested environment.
Proceedings of the ACM Conference on Computer and Communications Security, pages
49–57, 2019.
[5] Omer Aslan and Refik Samet. A Comprehensive Review on Malware Detection
Approaches. IEEE Access, 8:6249–6271, 2020.
[6] Andrew Austin and Laurie Williams. One technique is not enough: A comparison
of vulnerability discovery techniques. International Symposium on Empirical Software
Engineering and Measurement, pages 97–106, 2011.
[7] Earl T. Barr, Mark Harman, Phil McMinn, Muzammil Shahbaz, and Shin Yoo.
The oracle problem in software testing: A survey. IEEE Transactions on Software
Engineering, 41(5):507–525, 2015.
[8] F. Hutton Barron and Bruce E. Barrett. Decision Quality Using Ranked At-
tribute Weights. http://dx.doi.org.proxybz.lib.montana.edu/10.1287/mnsc.42.11.1515,
42(11):1515–1523, nov 1996.
[9] Victor R Basili, Gianluigi Caldiera, and H Dieter Rombach. The goal question metric
approach. Encyclopedia of Software Engineering, 2:528–532, 1994.
[10] Luca Caviglione, Michal Choras, Igino Corona, Artur Janicki, Wojciech Mazurczyk,
Marek Pawlicki, and Katarzyna Wasielewska. Tight Arms Race: Overview of Current
Malware Threats and Trends in Their Detection. IEEE Access, pages 5371–5396, 2020.
[11] Kai Cheng, Qiang Li, Lei Wang, Qian Chen, Yaowen Zheng, Limin Sun, and Zhenkai
Liang. DTaint: Detecting the Taint-Style vulnerability in embedded device firmware.
Proceedings - 48th Annual IEEE/IFIP International Conference on Dependable Systems
and Networks, DSN 2018, pages 430–441, 2018.
[12] Fred Cohen. Computer viruses. Theory and experiments. Computers and Security,
6(1):22–35, feb 1987.
101
[13] Drew Davidson, Benjamin Moench, Somesh Jha, Thomas Ristenpart, Drew Davidson,
and Thomas Ristenpart. FIE on Firmware. Proceedings of the 22nd USENIX Security
Symposium, pages 463–478, 2013.
[14] Xiaoning Du, Bihuan Chen, Yuekang Li, Jianmin Guo, Yaqin Zhou, Yang Liu, and
Yu Jiang. LEOPARD: Identifying Vulnerable Code for Vulnerability Assessment
Through Program Metrics. Proceedings - International Conference on Software
Engineering, 2019-May:60–71, 2019.
[15] Katheryn A Farris, Ankit Shah, George Cybenko, Rajesh Ganesan, and Sushil Jajodia.
VULCON: A system for vulnerability prioritization, mitigation, and management. ACM
Transactions on Privacy and Security, 21(4), 2018.
[16] Barbara Filkins, Doug Wylie, and Jason Dely. 2019 State of OT/ICS Cybersecurity
Survey. SANS Institute, (June), 2019.
[17] William Fleshman, Edward Raff, Richard Zak, Mark McLean, and Charles Nicholas.
Static malware detection and subterfuge: Quantifying the robustness of machine
learning and current anti-virus. arXiv, pages 3–12, 2018.
[18] Christian Frühwirth and Tomi Männistö. Improving CVSS-based vulnerability prioriti-
zation and response with context information. In 2009 3rd International Symposium on
Empirical Software Engineering and Measurement, ESEM 2009, pages 535–544, 2009.
[19] Ilja Heitlager, Tobias Kuipers, and Joost Visser. A Practical Model for Measuring
Maintainability. pages 30–39, 2007.
[20] Joseph M. Hilbe. Negative Binomial Regression. Cambridge University Press, 2 edition,
2011.
[21] Hannes Holm and Khalid Khan Afridi. An expert-based investigation of the Common
Vulnerability Scoring System. Computers and Security, 53:18–30, 2015.
[22] Hannes Holm, Mathias Ekstedt, and Dennis Andersson. Empirical analysis of system-
level vulnerability metrics through actual attacks. IEEE Transactions on Dependable
and Secure Computing, 9(6):825–837, 2012.
[23] Hannes Holm, Teodor Sommestad, Jonas Almroth, and Mats Persson. A quantitative
evaluation of vulnerability scanning. Information Management and Computer Security,
19(4):231–247, 2011.
[24] Chien-Cheng Huang, Feng-Yu Lin, Frank Yeong-Sung Lin, and Yeali S Sun. A novel
approach to evaluate software vulnerability prioritization. The Journal of Systems and
Software, 86:2822–2840, 2013.
102
[25] ISO/IEC 25010:2011 Systems and software engineering: Systems and software Quality
Requirements and Evaluation (SQuaRE)—System and software quality models. Stan-
dard, International Organization for Standardization, Geneva.
[26] Ver Hoef JM and Boveng PL. Quasi-Poisson vs. negative binomial regression: how
should we model overdispersed count data? Ecology, 88(11):2766–2772, nov 2007.
[27] Pontus Johnson, Robert Lagerstrom, Mathias Ekstedt, and Ulrik Franke. Can
the common vulnerability scoring system be trusted? A Bayesian analysis. IEEE
Transactions on Dependable and Secure Computing, 15(6):1002–1015, 2018.
[28] James A. Kupsch, Elisa Heymann, Barton Miller, and Vamshi Basupalli. Bad and
good news about using software assurance tools. Software - Practice and Experience,
47(1):143–156, 2017.
[29] Ralph Langner. Stuxnet: Dissecting a cyberwarfare weapon. IEEE Security and
Privacy, 9:49–51, 2011.
[30] Quan Le, Oiśın Boydell, Brian Mac Namee, and Mark Scanlon. Deep learning at the
shallow end: Malware classification for non-domain experts. Proceedings of the Digital
Forensic Research Conference, DFRWS 2018 USA, 26:S118–S126, 2018.
[31] Zhiyi Li, Mohammad Shahidehpour, and Farrokh Aminifar. Cybersecurity in Dis-
tributed Power Systems. Proceedings of the IEEE, 105(7):1367–1388, 2017.
[32] Stephen Mathezer. Introduction to ICS security Part 2. 2015.
[33] Luallen Matthew. Breaches on the Rise in Control Systems: A SANS Survey. SANS
Institute, (April):31, 2014.
[34] Stephen McLaughlin, Charalambos Konstantinou, Xueyang Wang, Lucas Davi, Ah-
mad Reza Sadeghi, Michail Maniatakos, and Ramesh Karri. The Cybersecurity
Landscape in Industrial Control Systems. Proceedings of the IEEE, 104(5):1039–1057,
2016.
[35] Daniel Mellado, Eduardo Fernández-Medina, and Mario Piattini. A comparison of
software design security metrics. ACM International Conference Proceeding Series,
(c):236–242, 2010.
[36] Thomas Panas and Daniel Quinlan. Techniques for software quality analysis of binaries:
Applied to Windows and Linux. DEFECTS 2009 - Proceedings of the 2nd International
Workshop on Defects in Large Software Systems, Held in conjunction with the ACM
SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2009,
(May):6–10, 2009.
[37] Edward Raff, Jared Sylvester, and Charles Nicholas. Learning the PE header, malware
detection with minimal domain knowledge. arXiv, pages 121–132, 2017.
103
[38] Edward Raff, Richard Zak, Russell Cox, Jared Sylvester, Paul Yacci, Rebecca Ward,
Anna Tracy, Mark McLean, and Charles Nicholas. An investigation of byte n-
gram features for malware classification. Journal of Computer Virology and Hacking
Techniques, 14(1):1–20, 2018.
[39] Alex Ramos, Marcella Lazar, Raimir Holanda Filho, and Joel J.P.C. Rodrigues. Model-
Based Quantitative Network Security Metrics: A Survey. IEEE Communications
Surveys and Tutorials, 19(4):2704–2734, 2017.
[40] David Rice. An extensible, hierarchical architecture for analysis of software quality
assurance. Master’s thesis, Montana State University, 12 2020.
[41] Manuel Rudolph and Reinhard Schwarz. A critical survey of security indicator
approaches. Proceedings - 2012 7th International Conference on Availability, Reliability
and Security, ARES 2012, pages 291–300, 2012.
[42] Thomas Saaty. Decision making with the analytic hierarchy process. Int. J. Services
Sciences Int. J. Services Sciences, 1:83–98, 01 2008.
[43] Riccardo Scandariato, James Walden, and Wouter Joosen. Static analysis versus
penetration testing: A controlled experiment. 2013 IEEE 24th International Symposium
on Software Reliability Engineering, ISSRE 2013, pages 451–460, 2013.
[44] Yan Shoshitaishvili, Ruoyu Wang, Christophe Hauser, Christopher Kruegel, and
Giovanni Vigna. Firmalice - automatic detection of authentication bypass vulnerabilities
in binary firmware. Proceedings of the Network and Distributed System Security
Symposium, NDSS 2015, (February):8–11, 2015.
[45] Yan Shoshitaishvili, Ruoyu Wang, Christopher Salls, Nick Stephens, Mario Polino,
Andrew Dutcher, John Grosen, Siji Feng, Christophe Hauser, Christopher Kruegel,
and Giovanni Vigna. SOK: (State of) the Art of War: Offensive Techniques in Binary
Analysis. Proceedings - 2016 IEEE Symposium on Security and Privacy, SP 2016, pages
138–157, 2016.
[46] Miltiadis G. Siavvas, Kyriakos C. Chatzidimitriou, and Andreas L. Symeonidis. QATCH
- An adaptive framework for software product quality assessment. Expert Systems with
Applications, 86:350–366, 2017.
[47] Diomidis Spinellis. Reliable identification of bounded-length viruses is NP-complete.
IEEE Transactions on Information Theory, 49(1):280–284, jan 2003.
[48] W. P. Stevens, G. J. Myers, and L. L. Constantine. Structured design. IBM Syst. J.,
13(2):115–139, June 1974.
[49] Keith Stouffer, Victoria Pillitteri, Suzanne Lightman, Marshall Abrams, and Adam
Hahn. Guide to Industrial Control Systems (ICS) Security NIST Special Publication
800-82 Revision 2. NIST Special Publication 800-82 rev 2, pages 1–157, 2015.
104
[50] Melanie Tupper and A. Nur Zincir-Heywood. VEA-bility security metric: A network
security analysis tool. In ARES 2008 - 3rd International Conference on Availability,
Security, and Reliability, Proceedings, pages 950–957, 2008.
[51] R. Vinayakumar, Mamoun Alazab, K. P. Soman, Prabaharan Poornachandran, and
Sitalakshmi Venkatraman. Robust Intelligent Malware Detection Using Deep Learning.
IEEE Access, 7:46717–46738, 2019.
[52] Stefan Wagner, Andreas Goeb, Lars Heinemann, Michael Kläs, Constanza Lampasona,
Klaus Lochmann, Alois Mayr, Reinhold Plösch, Andreas Seidl, Jonathan Streit, and
Adam Trendowicz. Operationalised product quality models and assessment: The
Quamoco approach. Information and Software Technology, 62(1):101–123, 2015.
[53] Stefan Wagner, Klaus Lochmann, Lars Heinemann, Michael Kläs, Adam Trendowicz,
Reinhold Plösch, Andreas Seidl, Andreas Goeb, and Jonathan Streit. The Quamoco
Product Quality Modelling and Assessment Approach. Technical report.
[54] Stefan Wagner, Klaus Lochmann, Sebastian Winter, Florian Deissenboeck, Elmar
Juergens, Lars Heinemann, Michael Kläs, Adam Trendowicz, Jens Heidrich, Reinhold
Ploesch, Andreas Goeb, Christian Koerner, and Christian Schubert. The Quamoco
Quality Meta-Model.
[55] Theodore J. Williams. The Purdue enterprise reference architecture. Computers in
Industry, 24(2-3):141–158, sep 1994.