CERN releases 300TB of Large Hadron Collider data into open access
The CMS collaboration at CERN has just released more than 300 terabytes (TB) of high-quality open data. These include more than 100TB of data from proton collisions at 7 TeV (teraelectronvolts), making up half the data collected at the LHC by the CMS detector in 2011. This release follows a previous one from November 2014, which made available around 27TB of research data collected in 2010.
The data comes in two types and is available on the CERN Open Data Portal. The primary datasets are in the same format used by the collaboration to perform research. On the other hand, the derived datasets require a lot less computing power and can be readily examined by university or high school students.
CMS is also offering the simulated data created with the same software version that should be used to examine the primary datasets. Simulations play an important role in particle physics research. The data release is complemented by analysis tools and code examples custom-made to the datasets. A virtual machine image based on CernVM, which comes preloaded with the software environment needed to examine the CMS data, can also be downloaded from the portal.
“Once we’ve exhausted our exploration of the data, we see no reason not to make them available publicly,” says Kati Lassila-Perini, a CMS physicist who leads these data preservation efforts. “The benefits are numerous, from inspiring high school students to the training of the particle physicists of tomorrow. And personally, as CMS’s data preservation coordinator, this is a crucial part of ensuring the long-term availability of our research data.”
The previous release of research data had already exhibited the scope of open LHC data. A group of theorists at MIT wanted to study the set-up of jets—showers of hadron clusters recorded in the CMS detector. However, the theorists got in touch with the CMS scientists for advice on how to proceed, as CMS had not performed this particular research. This bloomed into a productive association between the theorists and CMS.
Salvatore Rappoccio, a CMS physicist who worked with the MIT theorists, says, “As scientists, we should take the release of data from publicly funded research very seriously. In addition to showing good stewardship of the funding we have received, it also provides a scientific benefit to our field as a whole. While it is a difficult and daunting task with much left to do, the release of CMS data is a giant step in the right direction.”
Additionally, a CMS physicist in Germany tasked two undergraduates with authenticating the CMS Open Data by replicating important strategies from some highly cited CMS papers that used data collected in 2010. With some direction from the physicist and by using openly available documentation about CMS’s analysis software, the students were able to reconstruct plots that look almost alike to those from CMS, displaying what can be attained with these statistics.
“We are very pleased that we can make all these data publicly available,” adds Lassila-Perini. “We look forward to how they are utilized outside our collaboration, for research as well as for building educational tools.”
This is only the latest of several data dumps, but it’s also by far the largest. A more detailed explanation of the types of data and how they can be accessed is right here.