Most importantly, Do you want data? We know that building datasets is difficult, error prone and time consuming so we have decided to share our efforts of the past 4 years. Check our Security Datasets in Trento.
Vulnerable dependencies are a known problem in today’s open-source software ecosystems because FOSS libraries are highly interconnected and developers do not always update their dependencies.
In our recent paper we show how to avoid the over-inflation problem of academic and industrial approaches for reporting vulnerable dependencies in FOSS software, and therefore, satisfy the needs of industrial practice for correct allocation of development and audit resources.
To achieve this, we carefully analysed the deployed dependencies, aggregated dependencies by their projects, and distinguished halted dependencies. All this allowed us to obtain a counting method that avoids over-inflation.
To understand the industrial impact, we considered the 200 most popular FOSS Java libraries used by SAP in its own software. Our analysis included 10905 distinct GAVs (group, artifact, version) in Maven when considering all the library versions.
We found that about 20% of the dependencies affected by a known vulnerability are not deployed, and therefore, they do not represent a danger to the analyzed library because they cannot be exploited in practice. Developers of the analyzed libraries are able to fix (and actually responsible for) 82% of the deployed vulnerable dependencies. The vast majority (81%) of vulnerable dependencies may be fixed by simply updating to a new version, while 1% of the vulnerable dependencies in our sample are halted, and therefore, potentially require a costly mitigation strategy.
Our methodology allows software development companies to receive actionable information about their library dependencies, and therefore, correctly allocate costly development and audit resources, which is spent inefficiently in case of distorted measurements.
Do you want to check if your project actually uses some vulnerable dependencies? Let us know.
Our new paper in the IEEE Transactions on Software Engineering proposes an automated method to determine the code evidence for the presence of vulnerabilities in retro software versions.
Why is should you worry about a disclosed vulnerabilities? Each time a vulnerability is disclosed in a FOSS component, a software vendor using this component in an application must decide whether to update the FOSS component, patch the application itself, or just do nothing as the vulnerability is not applicable to the older version of the FOSS component used.
To address this challenge, we propose a screening test: a novel, automatic method based on thin slicing, for estimating quickly whether a given vulnerability is present in a consumed FOSS component by looking across its entire repository. We have applied it our test suit to large open source projects (e.g., Apache Tomcat, Spring Framework, Jenkins) that are routinely used by large software vendors, scanning thousands of commits and hundred thousands lines of code in a matter of minutes.
Further, we provide insights on the empirical probability that, on the above mentioned projects, a potentially vulnerable component might not actually be vulnerable after all (e.g. entries to a vulnerability database such as NVD, which says that a version is vulnerable when the code is not even there),
A previous paper in the Empirical Software Engineering Journal focussed on Chrome and Firefox (spanning 7,236 vulnerable files and approximately 9,800 vulnerabilities) on the National Vulnerability Database (NVD). We found out that the elimination of spurious vulnerability claims found by our method may change the conclusions of studies on the prevalence of foundational vulnerabilities.
If you are interested in getting the code for the analysis please let us know.
In our paper we investigated publicly available factors (from number of active users to commits, from code size to usage of popular programming languages, etc.) to identify which ones impact three potential effort models: Centralized (the company checks each component and propagates changes to the product groups), Distributed (each product group is in charge of evaluating and fixing its consumed FOSS components), and Hybrid (seldom used components are checked individually by each development team, the rest is centralized).
We use Grounded Theory to extract the factors from a six months study at the vendor and report the results on a sample of 152 FOSS components used by the vendor.
Our paper in proceedings of International Symposium on Empirical Software Engineering and Measurement addresses the limitations of the existing static analysis security testing (SAST) tool benchmarks: lack of vulnerability realism, uncertain ground truth, and large amount of findings not related to analyzed vulnerability.
We propose Delta-Bench – a novel approach for the automatic construction of benchmarks for SAST tools based on differencing vulnerable and fixed versions in Free and Open Source (FOSS) repositories. I.e., Delta-Bench allows SAST tools to be automatically evaluated on the real-world historical vulnerabilities using only the findings that a tool produced for the analyzed vulnerability.
We applied our approach to test 7 state of the art SAST tools against 70 revisions of four major versions of Apache Tomcat spanning 62 distinct Common Vulnerabilities and Exposures (CVE) fixes and vulnerable files totalling over 100K lines of code as the source of ground truth vulnerabilities.
The most interesting finding we have - tools perform differently due to the selected benchmark.
Let us know if you want us to select a SAST tool that suits to your needs.
Vulnerability exploitation is, reportedly, a major threat to system and software security. Assessing the risk represented by a vulnerability has therefore been at the center of a long debate. Eventually, the security community widely adopted the Common Vulnerability Scoring System (or CVSS in short) as the reference methodology for vulnerability risk assessment. The CVSS is used in reference vulnerability databases such as CERT and NIST's NVD, and is referenced as the standard-de-facto methodology by national and international standards and best-practices for system security (e.g. U.S. Government SCAP Protocol).
Today's baseline is that if you have a vulnerability and its CVSS score is high, you are in trouble and must fix it. But this may not be so realistic…
We are trying to assess to what degree this “baseline” can be reasonable to follow: after all, any CIO of any company big enough to care about security, will tell you “if this was a perfect world, maybe: but you are crazy if you think I'll fix every vulnerability out there, high CVSS or not”. Surely, CIOs and CEOs care about business continuity on top of business security, and to this extent updates can sometimes be more risky to apply than vulnerabilities not to patch.
We are going to present a detailed analysis of how CVSS influences (positively and negatively) your patching policy at the beginning of August at BlackHat USA 2013 in Las Vegas, USA. Want to come? PDF of the BlackHat presentation or the talk video on YouTube . You can also check out the Full paper on ACM TISSEC (Now ACM TOPS) .
Our research gravitates around the question “are really all (high CVSS score) vulnerabilities interesting for the attacker?” With this question we renounce in giving a general answer that would identify every possible attack vector, ending with really not identifying anything in particular (see our CCS BADGERS work): on the contrary, we are seeking for a general law of macro-security that can cover the greatest majority of the risk.
We developed a methodology that allows the organisation to:
For example, our methodology enables CIOs and decision makers to make assessments such as “System K has a Z% likelihood of being successfully attacked. If I fix these V vulnerabilities on the system, my risk of being attacked will decrease by X%”.
Additional information on the methodology can be found in our ACM TISSEC article.
To perform our study we collected five databases of vulnerabilities and exploits.
The Picture on the right is a Venn-Diagram representation of vulnerabilities and CVSS scores in our database. Colours are representative of High, Medium and Low CVSS scores (Red, Orange and Cyan respectively). Areas are proportional to volumes of vulnerabilities. As one can immediately see, NVD is disproportionally big with respect to any other database. Remember that NVD is the database into which, according to the SCAP protocol, are contained all the vulnerabilities you should fix, and SYM is the dataset of actually exploited vulnerabilities. Adjusting by software type and year of the vulnerability does not change the overall figure: NVD is full of un-interesting (or at least not-high risk) vulnerabilities, despite what the CVSS score says.
EDB (or the equivalent OSVDB) is often used as the reference dataset for “actually exploited” vulnerabilities. Many researchers already observed that a vulnerability should represent higher risk if a public exploit for it exist. Still, EDB intersect SYM for only ~4% of its surface: most publicly available exploits aren't used by attackers! And, most interestingly, more than 75% of SYM is not covered by EDB, which decreases the credibility of the latter, as a risk marker for vulnerabilities, by a fair amount.
EKITS is the small square at the intersection between SYM and EDB. It features only about 100 vulnerabilities and still, according to Google, it may drive as much as 60% of the overall attacks against the final users ( see Trends in circumventing web-malware detection (PDF)). EKITS is covered by SYM for 80% of its surface, meaning that if a vulnerability is in the black markets it is, most likely, going to be attacked.
Focusing on the CVSS score distributions, a few facts are worth being underlined:
Overall, these results show, in our opinion, that much room for improvement in vulnerability metrics and risk assessment is possible. Our contribution is rooted in:
Vulnerability Discovery Model (VDM) operates on known vulnerability data (or observed data) to estimate the cumulative number of vulnerabilities found and reported in released software. A VDM is a function family with some parameters, for example, the linear model (LN) is:
LN(t) = At + B where
t is the time,
A, B are two parameters. These parameters are valued by fitting the model to observed data.
A successful model should not only well fit the observed data, but also be able to predict the future trend of vulnerabilities.
The figure on the right exhibits the taxonomy of recent VDMs. VDMs are categorized into two major categories: time-based models and effort-based models. Most of state-of-the-art models fall into the first category. Only one model is classified into the second category. The time-based models divide into three other subcategories:
The above models were supported with one or more empirical evidences by the proponents, except the Aderson's one. However, there are some concerns in their experiments:
We propose an experimental methodology to systematically assess the performance of a model based on two quantitative metrics: quality and predictability. The methodology includes following steps:
Step 1 Acquire the data sets: collect different data sets of vulnerabilities with respect to different definitions of vulnerability and different versions of software. The assessment of VDM will be established based on these data sets, so it will cover different definitions of vulnerability. By doing this, we address the vulnerability definition and multi-version software concerns.
Step 2 Fit the VDM on collected data: estimate the parameters of VDM so that it could fit the collected data as much as possible. The goodness-of-fit of the fitted model can be evaluated by using the chi-square test for goodness-of-fit.
Step 3 Perform goodness-of-fit quality analysis: perform an analysis on the quality of VDM in software lifetime. This addresses the brittle goodness-of-fit concern.
Step 4 Perform predictability analysis: perform an analysis on the predictability of VDM.
Step 5 Compare VDM: compare a VDM with others to determine which one is better.
We apply the proposed methodology to evaluate state-of-the-art models. They are evaluated in different usage scenarios such as:
Our analysis has revealed that the most appropriate model is the simplest one (LN) when software is young (12 months). In other cases, s-shape models perform better where AML model is significantly better for middle-age software (36 months), but there is no statistically significant difference among s-shape models when software is old (72 months).
The following is a list of people that have been involved in the project at some point in time.