Data Management Day

This is the home page for the University of Notre Dame’s inaugural Data Management Day (April 25, 2011). Here you will find a thorough description of the event. In a sentence, it was a success.

Data Management Day

Introduction

Co-sponsored by the Center for Research Computing and the Hesburgh Libraries, the purpose of Data Management Day was to raise awareness of all things research data across the Notre Dame community. This half-day event took place on Monday, April 25 from 1–5 o’clock in Room 126 of DeBartolo Hall. It brought together as many people as possible who deal with research data. The issues included but were not limited to:

  • copyrights, trademarks, and patents
  • data management plans
  • data modeling and metadata description
  • financial sustainability
  • high performance computing
  • licensing, distribution, and data sharing
  • preservation and curation
  • personal privacy and human subjects
  • scholarly communication
  • security, access control, and authorization
  • sponsored research and funder requirements
  • storage and backup

Presenters

To help get us going and to stimulate our thinking, a number of speakers shared their experience.

In “Science and Engineering need’s perspective” Edward Bensman (Civil Engineering & Geological Sciences) described how he quantified the need for research data storage & backup. He noted that people’s storage quotas were increasing at a linear rate but the need for storage was increasing at an exponential rate. In short he said, “The CRC is not sized for University demand and we need an enterprise solution.” He went on to recommend a number of things, specifically:

  • review quotas and streamline the process for getting more
  • consider greater amounts of collaboration
  • improve campus-wide support for Mac OSX
  • survey constituents for more specific needs

Charles Vardeman (Center for Research Computing) in “Data Management for Molecular Simulations” outlined the workflow of theoretical chemists and enumerated a number of models for running calculations against them. He emphasized the need to give meaning to the data and thus the employment of a metadata schema called SMILES was used in conjunction with relational database models to describe content. Vardeman concluded with a brief description of a file system-based indexing scheme that might make the storage and retrieval of information easier.

Vardeman’s abstract: Simulation algorithms are enabling scientists to ask interesting questions about molecular systems at an increasingly unmanageable rate from a data perspective. Traditional POSIX directory and file storage models are inadequate to categorize this ever increasing amount of data. Additionally, the tools for managing molecular simulation data must be highly flexible and extensible allowing unforeseen connections in the data to be elucidated. Recently, the Center for Research Computing built a simulation database to categorize data from Gaussian molecular calculations. Our experience of applying traditional database structures to this problem will be discussed highlighting the advantages and disadvantages of using such a strategy to manage molecular data.

Daniel Skendzel (Project Manager for the Digital Asset Strategy Committee) presented an overview of the work of the Digital Asset Management group in “Our Approach to Digital Asset Management”. He began by comparing digital asset management to a storage closet and then showed two different pictures of closets. One messy and another orderly. He described the University’s digital asset management system as “siloed”, and he envisioned bringing these silos together into a more coherent whole complete with suites of tools for using the assets more effectively. Skendzel compared & contrasted our strategy to Duke’s (coordinated), Yale’s (enabling), and the University of Michigan’s (integrated) noting the differences in functionality and maturity across all four. I thought his principles for cultural change — something he mentioned at the end — were most interesting:

  • central advocacy
  • faculty needs driven
  • built on standard architecture
  • flexible applications
  • addresses entire life cycle
  • mindful of the cultural element

Skendzel’s abstract: John Affleck-Graves and the Leadership Committee on Operational Excellence commissioned the Digital Asset Strategy Committee in May 2010 to create and administer a master plan to provide structure for managing digital content in the form of multi-media, images, specific document-types and research data. The plan will address a strategy for how we can best approach the lifecycle needs of capturing, managing, distributing and preserving our institutional digital content. This talk will focus on our progress toward a vision to enhance the value of our digital content by integrating our unique organizational culture with digital technologies.

Darren Davis (Associate Vice President for Research and Professor of Political Science) talked about the importance and role of institutional review boards in “Compliance and research data management”. He began by pointing out the long-standing issues of research and human subject noting a decades-old report outlining the challenges. He stated how the University goes well beyond the Federal guidelines, and he said the respect of the individual person is the thing the University is most interested in when it comes to these guidelines. When human subjects are involved in any study, he said, it is very important for the subjects to understand what information is being gleaned from them, the compensation they will receive from the process, and that their services are being given willingly. When licensing data from human subject research confidentiality is an ever-present challenge, and the data needs to be de-identifiable. Moreover, the licensed data can not be repurposed. Finally, Davis said he and the Office of Research will help faculty create data management plans and they look to expand these service offerings accordingly.

Davis’s abstract: Advances in technology have enabled investigators to explore new avenues of research, enhance productivity, and use data in ways unimagined before. However, the application of new technologies has the potential to create unanticipated compliance problems regarding what constitutes human subject research, confidentiality, and consent.

In “From Design to Archiving: Managing Multi-Site, Longitudinal Data in Psychology” Jennifer Burke (Research Assistant Professor of Psychology & Associate Director of the Center for Children and Families) gave an overview of the process she uses to manage her research data. She strongly advocated planning that includes storage, security, back-up, unit analysis, language, etc. Her data comes in all formats: paper, electronic, audio/video. She designs and builds her data sets sometimes in rows & columns and sometimes as linked relational databases. She is mindful of file naming conventions and the use of labeling conventions (her term for “metadata”). There is lots of data-entry, data clean-up, and sometimes “back filling”. Finally everything is documented in code books complete with a CD. She shares her data, when she can, through archives, online, and even the postal mail. I asked Burke which of the processes was the most difficult or time-consuming, and she said, without a doubt, the data-entry was the most difficult.

Burke’s abstract: This brief talk will summarize the work of the Data Management Center, from consulting on methodological designs to preparing data to be archived. The talk will provide an overview of the types of data that are typical for psychological research and the strategies we have developed to maintain these data safely and efficiently. Processes for data documentation and preparation for long-term archiving will be described.

Up next was Maciej Malawski (Center for Research Computing, University of Notre Dame & AGH University of Science and Technology, Krakow, Poland) and his “Prospects for Executable Papers in Web and Cloud Environments”. Creating research data is one thing, making it available is another. In this presentation Malawski advocated “executable” papers” — applications/services embedded into published articles allowing readers to interact with the underlying data. The idea is not brand new and may have been first articulated as early as 1992 when CD-ROMs became readily available. Malawski gave at least a couple of working examples of the executable papers citing myExperiment and the Grid Space Virtual Laboratory.

Malawski’s abstract: Recent developments in both e-Science and computational technologies such as Web 2.0 and cloud computing call for a novel publishing paradigm. Traditional scientific publications should be supplemented with elements of interactivity, enabling reviewers and readers to reexamine the reported results by executing parts of the software on which such results are based as well as access primary scientific data. We will discuss opportunities brought by recent Web 2.0, Software-as-a-Service, grid and cloud computing developments, and how they can be combined together to make executable papers possible. As example solutions, we will focus on two specific environments: MyExperiment portal for sharing scientific workflows, and GridSpace virtual laboratory which can be used as a prototype executable paper engine.

Patrick Flynn (Professor of Computer Science & Engineering, Concurrent Professor of Electrical Engineering) seemed to have the greatest amount of experience in the group, and he shared it in a presentation called “You want to do WHAT?: Managing and distributing identifying data without running afoul of your research sponsor, your IRB, or your Office of Counsel”. Flynn and his immediate colleagues have more than 10 years of experience with biometric data. Working with government and non-government grant sponsors, Flynn has been collecting images of people’s irises, their faces, and other data points. The data is meticulously maintained, given back to the granters, and then licensed to others. To date Flynn has about 18 data sets to his credit, and they have been used in a wide variety of subsequent studies. The whole process is challenging, he says. Consent forms. Metadata data accuracy. Licensing. Institutional review boards. In the end, he advocated the University cultivate a culture of data stewardship and articulated the need for better data management systems across campus.

Flynns’ abstract: This talk will summarize nine years’ experience with collecting biometrics data from consenting human subjects and distributing such data to qualified research groups. Key points visited during the talk will include: Transparency and disclosure; addressing concerns and educating the concerned; deploying infrastructure for the management of terabytes of data; deciding whom to license data to and how to decline requests; how to manage an ongoing data collection/enrollment/distribution workflow.

In “Globus Online: Software-as-a-Service for Research Data Management” Steven Tuecke (Deputy Director, Computation Institute, University of Chicago & Argonne National Laboratory) described the vision for a DropBox-like service for scientists called Globus Online. By exploiting cloud computing techniques, Tuecke sees a time when researchers can go to a website, answer a few questions, select a few check boxes, and have the information technology for their lab set up almost instantly. Technology components may include blogs, wikis, mailing lists, file systems for storage, databases for information management, indexer/search engines, etc. “Medium and small labs should be doing science, not IT (information technology).” In short, Tuecke advocated Software-As-A-Service (SaaS) for much of research data.

Tuecke’s abstract: The proliferation of data and technology creates huge opportunities for new discoveries and innovations. But they also create huge challenges, as many researchers lack the IT skills, tools, and resources ($) to leverage these opportunities. We propose to solve this problem by providing missing IT to researchers via a cost-effective Software-as-a-Service (SaaS) platform, which we believe can greatly accelerate discovery and innovation worldwide. In this presentation I will discuss these issue, and demonstrate our initial step down this path with the Globus Online file transfer service.

The final presentation was given by Timothy Flanagan (Associate General Counsel for the University), “Legal issues and research data management”. Flanagan told the audience it was his responsibility to represent the University and provide legal advice. When it comes to research data management, there are more questions than answers. “A lot of these things are not understood.” He sees his job and the General Counsel’s job as one of balancing obligation with risk.

Summary

Jarek Nabrzyski (Center for Research Computing) and I believe Data Management Day was a success. The event itself was attended by more than sixty-five people, and they seemed to come from all parts of the University. Despite the fact that the presentations were only fifteen minutes long, each of the presenters obviously spent a great deal of time putting their thoughts together. Such effort is greatly appreciated.

The discussion after the presentations was thoughtful and meaningful. Some people believed a larger top-down effort to provide infrastructure support was needed. Others thought the needs were more pressing and the solution to the infrastructure and policy issues needs to come up from a grassroots level. Probably a mixture of both is required.

One of the goals of Data Management Day was to raise the awareness of all issues research data management. The presentations covered many of the issues:

  • collecting, organizing, and distributing data
  • data management plans
  • digital asset management activities at Notre Dame
  • institutional review boards
  • legal issues surrounding research data management
  • organizing & analyzing data
  • SaaS and data management
  • storage space & infrastructure
  • the use of data after it is created

Data management is happening all across our great university. The formats, storage mechanisms, data modeling, etc. are different from project to project. But they all share a set of core issues that need to be addressed to one degree or another. By bringing together as many people as possible and facilitating discussion among them, the hope was to build understanding across our academe and ultimately work more efficiently. Data Management Day was one way to accomplish this goal.

What are the next steps? Frankly, we don’t know. All we can say is research data management is not an issue that can be addressed in isolation. Instead, everybody has some of the solution. Talk with your immediate colleagues about the issues, and more importantly, talk with people outside your immediate circle. Our whole is greater than the sum of our parts.

Comments are closed.