This website reflects only the author's view and is his sole responsibility. The European Commission's Research Executive Agency is not responsible for any use that may be made of the information it contains.
Projection of Security Vulnerabilities caused by Exploits in Dependencies—ProSVED—is an Horizon Europe MSCA Postdoctoral Fellowship hosted in the University of Trento, Italy, led by Carlos E. Budde and Fabio Massacci.
Estimating the amount and severity of security vulnerabilities in code is essential for software quality and control. Purely-empirical approaches lack prognosis versatility, as they are generally applicable to the codebase used for training, with extrapolation capabilities that degrade over time. In turn, traditional formal-modelling approaches (“the other side of the spectrum”) depend on assumptions that do not hold in the field, such as the independence between a codebase and its following version.
ProSVED generates quantitative forecasts about the emergence of security vulnerabilities in (third-party) open-source code, which propagate via software dependencies to threaten entire projects. Its main goal is to introduce theoretical and practical methodologies for the prognosis of vulnerability-propagation in software, that can model the full stack of third-party libraries underlying the codebase of a software project.
Time Dependency Trees — This calls for lightweight representations of the evolution of software in time, and the acyclic interdependency of codebases in projects that can use hundredths of third-party source code, where a single vulnerability can bring it all down. ProSVED organises this complexity at the high-abstraction layer of library-dependency and -evolution, generating a directed acyclic graph that represents the evolution in time of a dependency tree: a Time Dependency Tree. Forecasting analyses can then proceed by harvesting the data available about past vulnerabilities, effectively implementing a time-series study where the nodes in the Time Dependency Tree DAG will be labelled with the quantitative estimates produced.
Time Dependency Trees provide the skeleton on which vulnerability propagation can be analysed. These propagations can be deterministic by code use, or probabilistic by code evolution. An example of deterministic propagation is dependency inclusion. e.g. if library a₁ has a dependency d₂, then running a₁ will at some point execute code from d₂, which means that an exploit to d₂ can be used as an exploit to a₁. An example of probabilistic propagation is the persistence of the codebase in source code development. If d₃ is the version of library d released as the successor of d₂, and a vulnerability is found in d₃, there is a non-zero (and typically quite large) probability that d₂ is affected by the same vulnerability.
Attack Trees — By their acyclic nature, both in the dependency and time dimensions, Time Dependency Trees (TDTs) are close to the standard modelling formalism known as Attack Trees (ATs), and in fact an injection can be made from TDT models to AT models. ATs are used in event-based representations of progressive attacks, much alike the propagation of vulnerabilities across a chain of code dependencies, and count with a plethora of efficient algorithms for the quantification of security properties such as the probability or min-time to attack. The representation of TDTs as ATs leverages these memory- and runtime-optimal algorithms.
TDTs (and ATs) offer optimal representations of codebases and their evolution in time, to allow quantitative studies of the propagation of security vulnerabilities—but they do nothing to effectively quantify these probabilities. For that, ProSVED poses the following broad research question:
How does the probability of finding a security vulnerability in a software library evolve over time?
While time-dependence of exploits and vulnerabilities is agreed upon by the practitioners' community—see e.g. the Temporal Metrics from the CVSS standard—the great majority of research has focused on the detection of vulnerabilities already known in the code. Some past attempts to generate vulnerability forecasts have used time-series machinery: one of the most modern and tangible outcomes is provided by the Vulnerability Forecasting interest group of FIRST, which is periodically updated to reflect yearly and quarterly projections of CVEs: https://github.com/FIRSTdotorg/Vuln4Cast/blob/main/README.md.
Probability of vulnerability as a function of time — For forecasting capacities, the novelty of ProSVED is the prediction of vulnerabilities for individual codebases, as opposed to the entire universe of CVE entries. Note that “predict” here is synonym of forecast—i.e. determine occurrence in a future time point—as opposed to the ML-interpretation of the term which could also be understood as “detect”.
A hurdle is that, when considering an individual code base such as the source code of a single library, security vulnerabilities become rare events. This hinders statistical fitting and is commonly combated with data aggregation—cf. the Vulnerability Forecasting approach to work on the entire CVE dataset. To generate more specific forecasts, ProSVED proposes divisions of the learning sets by attributes that are known or suspected to affect security vulnerability occurrence, such as library size, seniority of developers, and functional purpose.
From a singled-out set of libraries, ProSVED measures the time elapsed between the release of the source code and the publication of a CVE for it, fitting statistical models to come up with probability density functions (PDFs) for the publication of a CVE since code release.
This provides individual PDFs for specific types of codebases, that can be linked to the nodes that compose a TDT, by determining which type of library each such node represents. Integrating these functions over time yields pointwise probability values, that represent the likelihood of having a new vulnerability (CVE) released for a codebase that our project is using. Depending on the severity of the vulnerability, or more fine-grained information such as the potential attack vector, this can represent a disruptive event that forces the release of urgent patches. Quantifying these probabilities gives companies concrete estimates of the workload needed in the future, thus facilitating security-related decisions.
ProSVED has also studied analytical (or rather, numerical) compositions of the PDFs to spawn the multi-dimensional probabilistic space that describes the fluctuation of vuln. probability as a function of time in dense non-singular intervals. In layman terms, one can see the full landscape of “vulnerability probability” up to a chosen future moment in time. While this suffers from the curse of dimensionality, which renders it impractical to visualize all dependencies of a project, it allows to single out a few codebases—e.g. dependencies of main concern, usual suspects—and study them in greater detail than via TDT analysis, which can only produce punctual aggregated results.
Time Dependency Trees were designed as a lightweight graph structure capable of representing the evolution in time of entire dependency trees. Such representations can be used to compute indices at the level of entire projects or even development environments. For instance, an out-degree count in the nodes can determine the presence of pervasive dependencies, whose exploitation poses a threat to large portions of a project across several versions of libraries. Also, measuring the number of versions of (popular) libraries that were affected by published vulnerabilities is a high-level risk indicator of developing code in a specific ecosystem.
When coupled with the PDFs fitted for the probability of vulnerability disclosure as a function of time, this can be used to find the weak spots in the dependency tree at different time points, and even quantify how much risk is posed by every dependency library.
Our studies for the Java/Maven library jira-core
offer a concrete example of these capabilities. This library implements the source code for the core of the Jira project, which is one of the most popular bug trackers in the world. The source code of jira-core
depends on many other libraries, such as xstream
for XML serialisation, and all those codebases are periodically updated.
The problem
—
From a security perspective, when a new release of jira-core
is being prepared, it makes sense to ask whether the newest version of a dependency such as xstream
should be factored in or not—and the answer is far for trivial, since new code could bring in zero-day vulnerabilities, while older code has been around for a longer time, so it could also be the target of better engineered attacks.
Our PDFs offer a quantification of such risks, as the probability of having a new CVE released for the dependency.
What happened
—
In particular for this example, the high-severity CVE-2021-39139 released on late August of 2021 affects xstream
version 1.4.17 and earlier, which had been fixed in xstream:1.4.18
released earlier that month.
Those specific library instances had been dependencies of jira-core
for months, and the August releases of versions 8.19.0 and 8.19.1 of jira-core
(prior to the disclosure of the vulnerability) decided to keep the old versions of xstream
.
Hindsight then proved that this was a mistake when CVE-2021-39139 came out, and that factoring in xstream:1.4.18
would have avoided the vulnerability that was faced by the decision to keep using xstream:1.4.17
.
The novelty of ProSVED
—
In that scenario there is no information on when will the next CVE be released, and which libraries (and which version) will it affect.
The TDT+PDF machinery of ProSVED changes that, providing estimate probabilities of the release of a new CVE for each library that a project is using as a dependency.
Applied to this example, from the dependencies of jira-core
related to xstream
we find two libraries that pose a large security risk: mxparser
with a 0.0836 chance, and xstream
itself with a 0.070 chance of having a new CVE released in 45 days counting since July 25, 2021.
This quantities produced by ProSVED are available to developers before vulnerabilities like CVE-2021-39139 are released, and here it indicates that the risk of facing a new vulnerability will be reduced if the developers of jira-core
adopy any new version available for mxparser
or xstream
, which matches what eventually happened in that case.
A social objective of ProSVED is to raise awareness of cybersecurity practices in general, and the importance (and feasibility) of forecasting security vulnerabilities in particular. In this sense, ProSVED has been part of the following scientific and industrial dissemination events:
While ProSVED is driven by Carlos and supervised by Fabio, many more people have influenced its scientific developments and application to existent source code. From that too-long list, we extend our explicit gratitude to the following: