# New Center Supports Data-Driven Research

With the advanced capabilities of today's computer technologies, researchers can now collect vast amounts of information with unprecedented speed. However, gathering information is only one half of a scientific discovery, as the data also need to be analyzed and interpreted. A new center on campus aims to hasten such data-driven discoveries by making expertise and advanced computational tools available to Caltech researchers in many disciplines within the sciences and the humanities.

The new Center for Data-Driven Discovery (CD3), which became operational this fall, is a hub for researchers to apply advanced data exploration and analysis tools to their work in fields such as biology, environmental science, physics, astronomy, chemistry, engineering, and the humanities.

The Caltech center will also complement the resources available at JPL's Center for Data Science and Technology, says director of CD3 and professor of astronomy George Djorgovski.

"Bringing together the research, technical expertise, and respective disciplines of the two centers to form this joint initiative creates a wonderful synergy that will allow us opportunities to explore and innovate new capabilities in data-driven science for many of our sponsors," adds Daniel Crichton, director of the Center for Data Science and Technology at JPL.

At the core of the Caltech center are staff members who specialize in both computational methodology and various domains of science, such as biology, chemistry, and physics. Faculty-led research groups from each of Caltech's six divisions and JPL will be able to collaborate with center staff to find new ways to get the most from their research data. Resources at CD3 will range from data storage and cataloguing that meet the highest "housekeeping" standards, to custom data-analysis methods that combine statistics with machine learning—the development of algorithms that can "learn" from data. The staff will also help develop new research projects that could benefit from large amounts of existing data.

"The volume, quality, and complexity of data are growing such that the tools that we used to use—on our desktops or even on serious computing machines—10 years ago are no longer adequate. These are not problems that can be solved by just buying a bigger computer or better software; we need to actually invent new methods that allow us to make discoveries from these data sets," says Djorgovski.

Rather than turning to off-the-shelf data-analysis methods, Caltech researchers can now collaborate with CD3 staff to develop new customized computational methods and tools that are specialized for their unique goals. For example, astronomers like Djorgovski can use data-driven computing in the development of new ways to quickly scan large digital sky surveys for rare or interesting targets, such as distant quasars or new kinds of supernova explosions—targets that can be examined more closely with telescopes, such as those at the W. M. Keck Observatory, he says.

Mary Kennedy, the Allen and Lenabelle Davis Professor of Biology and a coleader of CD3, says that the center will serve as a bridge between the laboratory-science and computer-science communities at Caltech. In addition to matching up Caltech faculty members with the expertise they will need to analyze their data, the center will also minimize the gap between those communities by providing educational opportunities for undergraduate and graduate students.

"Scientific development has moved so quickly that the education of most experimental scientists has not included the techniques one needs to synthesize or mine large data sets efficiently," Kennedy says. "Another way to say this is that 'domain' sciences—biology, engineering, astronomy, geology, chemistry, sociology, etc.—have developed in isolation from theoretical computer science and mathematics aimed at analysis of high-dimensional data. The goal of the new center is to provide a link between the two."

Work in Kennedy's laboratory focuses on understanding what takes place at the molecular level in the brain when neuronal synapses are altered to store information during learning. She says that methods and tools developed at the new center will assist her group in creating computer simulations that can help them understand how synapses are regulated by enzymes during learning.

"The ability to simulate molecular mechanisms in detail and then test predictions of the simulations with experiments will revolutionize our understanding of highly interconnected control mechanisms in cells," she says. "To some, this seems like science fiction, but it won't stay fictional for long. Caltech needs to lead in these endeavors."

Assistant Professor of Biology Mitchell Guttman says that the center will also be an asset to groups like his that are trying to make sense out of big sets of genomic data. "Biology is becoming a big-data science—genome sequences are available at an unprecedented pace. Whereas it took more than $1 billion to sequence the first genome, it now costs less than$1,000," he says. "Making sense of all this data is a challenge, but it is the future of biomedical research."

In his own work, Guttman studies the genetic code of lncRNAs, a new class of gene that he discovered, largely through computational methods like those available at the new center. "I am excited about the new CD3 center because it represents an opportunity to leverage the best ideas and approaches across disciplines to solve a major challenge in our own research," he says.

But the most valuable findings from the center could be those that stem not from a single project, but from the multidisciplinary collaborations that CD3 will enable, Djorgovski says. "To me, the most interesting outcome is to have successful methodology transfers between different fields—for example, to see if a solution developed in astronomy can be used in biology," he says.

In fact, one such crossover method has already been identified, says Matthew Graham, a computational scientist at the center. "One of the challenges in data-rich science is dealing with very heterogeneous data—data of different types from different instruments," says Graham. "Using the experience and the methods we developed in astronomy for the Virtual Observatory, I worked with biologists to develop a smart data-management system for a collection of expression and gene-integration data for genetic lines in zebrafish. We are now starting a project along similar methodology transfer lines with Professor Barbara Wold's group on RNA genomics."

And, through the discovery of more tools and methods like these, "the center could really develop new projects that bridge the boundaries between different traditional fields through new collaborations," Djorgovski says.