Data tsunamis and explosions
Posted on October 29, 2010 in Uncategorized by Eric Lease Morgan
Michelle Hudson and I have visited more teaching & research faculty across campus learning about their uses, needs, and wants when it comes to data. As one person put it, we are preparing for the “Data Tsunami”, or as another person put it — the “Data Explosion”. We have learned a few more things:
- Brokering – At least a couple of the people we visited thought libraries ought to play a central role in the brokering of data sets. In their view, libraries would be repositories of data as well as manage the licensing of the data both inside and outside the University community. “Libraries can make it easy for us to find data sets.” The Institute for Quantitative Social Science at Harvard University may be a good model. This clearing house function needs to include services educating people on how to use the data, “cool” interfaces for finding and using the data, and links to bibliographic materials like books and journal articles. “We would also like somebody to help us keep track who uses our data and where it is cited in the literature.”
- Curation – Some people have “archived” original data sets in the form of paper-based surveys. These things are stored in file cabinets of basements. Others have elaborate computer systems complete with redundant backups, rsync functionality, and data refreshment protocols. One person alluded to HubZero as a possible tool for these tasks.
- Data origination – Most of the people we have talked to generate their own data either through surveys or scientific equipment. Fewer people, so far, have gotten their data from other people or companies. When it has come from companies, the data has been encrypted before it gets here as well as anonymized.
- Data types – The formats of the data is falling into a couple of categories: 1) binary data such as images, video, & simulation output, and 2) plain text data in the form of spreadsheets (mostly) or sometimes relational databases. “We know that the use of relational databases is the ‘best’ way to organize this information, but none of us want to take the time to learn SQL.”
- Licensing – At least a couple of the people we visited license their data to others. After working with General Counsel, contracts between parties are signed and the data is exchanged. We have yet to see any money changing hands. The licenses are used to protect the University from liability when the data gets used in ways not allowed by the license. A couple of people would like to the University (or specifically the library) to handle this sort of paperwork.
- Metadata – There is a wide spectrum of metadata application against the data sets. Some people have no metadata at all. Others maintain multi-volume books filled with “protocols” describing their data and how it is to be collected. One person said, “We spend a lot of our time correcting metadata tabulating what camera was used, when, and by whom… Our reputation rests on the quality of our data (and metadata). We’ve formatted our metadata as CSV files as well as XML files. In order for our data to be characterized as ‘good’ we need an error rate of 1000/1”.
- Sharing – We are learning that the sharing of data is a complicated decision-making process. Many things come into play including but not necessarily limited to: the culture of the subject discipline, patents, the competitive nature of the researcher, intellectual property rights, funding agency requirements, embargoes, and the inclusion of human subjects. Some people are more than willing to share than others. So far, no one will share their until the first paper has been written. The want (need) “publication rights”.
- Size – Everybody believes they have “large” data sets, but the definition of large needs to be qualified. On one hand large may be equated with sizable files. Videos are a good example. On the other hand large may mean many records. Big longitudinal studies complete with many fields per subject are a good example.
We are also learning that no one person or group seems to have a complete handle on the issues surrounding data. Michelle and I certainly don’t. Everybody knows a lot but not everything. Consquently, we are thinking of hosting “Data Day” — a time and place when many of the people who deal with data for teaching and research get together, share war stories, and learn from each others’ experience. In the end we may understand how to be more efficient and prepared with the “tsumami” is actually upon us.
Off to interview more people… ‘More later.