Datasets on Security Research at the University of Trento

Datasets are difficult to build and construct and yet they are an essential part of scientific research.

Some of them may be just collection of publicly available raw data (such as the NVD) but there is a huge difference between a Web-site or the archive of a mailing list and an Excel file.

At the Security Group we realize how difficult is to build a dataset, so we have decided to make them available to promote access and replicability of our experiments on vulnerability assessement models.

Available Datasets Used in our Research

NVD: is the reference database for the population of vulnerabilities. It is collects the data from the National Vulnerability Database from NIST (link).
EDB is the reference database for public (proof-of-concept) exploits. It collects the data from the Exploit-DB web site (link).
EKITS is our database of vulnerabilities and exploits traded in the black markets. We have built an update infrastructure that allow us to keep our database well ahead of any public source on such vulnerabilities publicly available (e.g. Contagio's Exploit Pack Table). We only share this dataset on the basis of a joint research project.
SYM is a database of vulnerabilities exploited in the wild as reported by Symantec's sensors world wide. This dataset is a collection of publicly available vulnerability data through Symantec's Threat Explorer and Attack Signatures websites.
WINE reports volumes of attacks per month from 2009 and 2012. Its integration with our datasets was possible thank to our collaboration with Symantec's WINE Program. If you want to have accesss to this dataset you should directly contact Symantec.
FFV collects the vulnerabilities of the Firefox browser. It is the most comprehensive database. It integrates the Mozilla Foundation Security Advisory (MFSA) bulletin, the Mozilla Bugzilla bugtracker and the NVD.
GCV reports the vulnerabilities of the Google Chrome Browser extracted from Chrome Issue Tracker, integrated with the NVD to reconstruct affected versions and checked for consistency with the code distribution (just using the NVD would yield more than 10% bogus foundational vulnerabilities). Another caveat is that this might not include all vulnerabilities of the browser as some of the third party software such as WebKit are only partly included.
IEV lists the vulnerabilities for Internet Explorer extracted from the Microsoft Security Bulletin and integrated with the NVD to reconstruct affected versions.
ASV Vulnerabilities of the Safari Web Browser extracted from the Apple Knowledge Base and integrated with the NVD to reconstruct affected versions.
ESEJ is the list of vulnerabilities in Google Chrome and Mozilla Firefox along with ranges of major versions affected by each vulnerability. For each vulnerability, the dataset contains two affected version ranges: (1) vulnerable versions according to the NVD (“version X and all previous versions”); (2) vulnerable versions based on the vulnerable code evidence (identified by our algorithm).
COMPRehension is a dataset collected in a series of controlled experiments on Model Comprehension for Security Risk Assessment.
Delta-Bench collects revisions of Apache Tomcat 6.0 - 8.5 with security fixes of various CVEs.

How to Access the Data

Write us at security-dataREMOVESPAM@disi.unitn.it to see if the data is what you actually want (the email alias will expand to the researchers who worked on the datasets);
Specify the initial purpose for which you would like to use the data (this will go in the formal licence and in the web page with your name attached to it);
We will fill the licensing agreement (see uncompiled license) with your data and the head of department (or a tenured full professor of department) should sign it;
We will return the signed copy of the agreement and the excel file;
Report to us at security-dataREMOVESPAM@disi.unitn.it the publications based on the data which should be include the citation to our appropriate paper;
That's it. No fee, no painful plodding through websites for web2junk, no junk2data cleaning, etc. citation and reporting back are the only formal requirements in the license for your research.

Rights for Access

This is the human readable summary of the rights and obligations that the license entails:

You can
1. share these datasets in whatever format with any member of your institution—faculty, administration, students, research associates in the case of universities, and employees in the case of government ministries and research organizations;
2. use these datasets in creative ways for scientific, not-profit, non-commercial use including publications under the terms of the agreement.

You cannot
1. post any of these datasets on your website such that it becomes available to non-members of your institution, or make copies that circulate outside of your institution;
2. use it for commercial purposes unless agreed in writing

You agree to
1. cite the appropriate reference work in all your publications that make use of the datasets or its derivatives
2. provide us the information on the publication where you used the data at security-dataREMOVESPAM@disi.unitn.it for the purposes of posting it on our web site
3. refer to us any people outside your institution that requires you for the data

You are aware that
1. Other parties may have rights or set licensing obligations in some of the data contained in the datasets. It is up to you to obtain permissions from these parties if needed.
2. The datasets may contains errors or may be unfit for your purposes and we bear no liability for any problem you might encounter.

Users

Here are researchers/institutes who are granted the access permission for our data sets.

1. DAI-Labor (Technische Universitat Berlin)

Competence Center Security at DAI-Labor is a security research group, and in one of our current public-grant research projects, Auvegos, we develop a discrete-event simulation software for performing security analysis in network infrastructures, especially in the context of e-government. To this end, we generate or explicitly model of the domain networks to assess, and we associate the nodes in this network with CPE and CVE information. Based on this, we perform algorithmic computations (Attack Graph Generation, MDP-based risk assessment,…) and evaluate the effectiveness of potential mitigation strategies via simulation runs. The requested datasets would be used to generate input for the aforementioned simulation tool.

2. MITRE Corporation

Investigation on which CVEs are exploited by malicious exploit kits. [Scientist in charge: Aaron Powell]

3. MIT Sloan School of Management (Massachusetts Institute of Technology)

Evaluation of security practice use in relation to how and when vulnerabilities are discovered and resolved in software development projects; evolution of the vulnerability discovery and resolution process over time in software development projects.[Scientists in charge: Stuart Madnick, Michael Siegel, James Houghton].

4. Pierre Trepagnier and James Riordan

Investigating the probability that a given vulnerability will be exploited as a function of (a) its CVSS base score as well as (b) other possible markers which are available at the time the vulnerability is first noted.

5. NCSU (North Carolina State University)

Evaluation of security practice use in relation to how and when vulnerabilities are discovered and resolved in software development projects; evolution of the vulnerability discovery and resolution process over time in so[Users] ftware development projects [Scientists in charge: Laurie Williams, Patrick Morrison, Rahul Pandita]

6. ECNU (East China Normal University)

Research on building vulnerability prediction models and comparing the experimental results with previous studies conducted by DISI Security Research Group [Scientists in charge: Xiangxue Li, Liang He, Limin Yang]

7. GWU (George Washington University)

Dissertation research regarding vulnerability discovery modeling [Scientists in charge: Reuben Johnston, Thomas Mazzuchi].

8. IIIT-Delhi (Indraprastha Institute of Information Technology, Delhi)

Understanding and predicting vulnerabilities by leveraging online contents (Scientists in charge: Baani Leen Kaur Jolly, Tanmoy Chakraborty)

DISI Security Research Group Wiki

Table of Contents

Datasets on Security Research at the University of Trento

Available Datasets Used in our Research

How to Access the Data

Rights for Access

Users