This posting documents my experiences at the 6th International Data Curation Conference, December 6-8, 2010 in Chicago (Illinois). In a sentence, my understanding of the breath and depth of data curation was re-enforced, and the issues of data curation seem to be very similar to the issues surrounding open access publishing.
Day #1
After a few pre-conference workshops which seemed to be popular, and after the reception the night before, the Conference began in earnest on Tuesday, December 7. The presentations of the day were akin to overviews of data curation, mostly from the people who were data creators.
One of the keynote addresses was entitled “Working the crowd: Lessons from Galaxy Zoo” by Chris Lintott (University of Oxford & Alder Planetarium). In it he described how images of galaxies taken as a part of the Sloan Digital Sky Survey where classified through crowd sourcing techniques — the Galaxy Zoo. Wildly popular for a limited period of time, its success was attributed to convincing people their task was useful, they were treated as collaborators (not subjects), and it was not considered a waste of time. He called the whole process “citizen science”, and he has recently launched Zooniverse in the same vein.
“Curation centres, curation services: How many are enough?” by Kevin Asjhley (Digital Curation Centre) was the second talk, and in a tongue-in-cheek way, he said the answer was three. He went on to outline they whys and wherefores of curation centers. Different players: publishers, governments, and subject centers. Different motivations: institutional value, reuse, presentation of the data behind the graph, obligation, aggregation, and education. Different debates on who should do the work: libraries, archives, computer centers, institutions, disciplines, nations, localities. He summarized by noting how data is “living”, we have a duty to promote it, it is about more than scholarly research, and finally, three centers are not really enough.
Like Lintott, Antony Williams (Royal Society of Chemistry) described a crowd sourcing project in “ChemSpider as a platform for crowd participation”. He began by demonstrating the myriad of ways Viagra has been chemically described on the the ‘Net. “Chemical information on the Internet is a mess.” ChemSpider brings together many links from chemistry-related sites and provides a means for editing them in an online environment.
Barend Mons outlined one of the common challenges of metadata. Namely, the computer’s need for structured information and most individuals’ lack of desire to create it. In “The curation challenge for the next decade: Digital overlap strategy or collective brains?” Mons advocated the creation of “nano publications” in the form of RDF statements — assertions — as a possible solution. “We need computers to create ‘reasonable’ formats.”
“Idiosyncrasy at scale: Data curation in the humanities” by John Unsworth (University of Illinois at Urbana-Champaign) was the fourth presentation of the day. Unsworth began with an interesting set of statements. “Retrieval is a precondition for use, and normalization is a precondition for retrieval, but humanities’ texts are messy and difficult to normalize.” He went on to enumerate types of textual normalization: spelling, vocabulary, punctuation, “chunking”, mark-up, and metadata. He described MONK as a normalization project. He also mentioned a recent discussion on the Alliance of Digital Humanities Organizations site where humanists debated whether or not texts ought be marked-up prior to analysis. In short, idiosyncracies abound.
The Best Student Paper Award was won by Youngseek Kim (Syracuse University) for “Education for eScience professionals: Integrating data curation and cyberinfrastructure”. In it he described the use of focus group interviews and an analysis of job postings to articulate the most common skills a person needs to be a “escience professional”. In the end he outlined three sets of skills: 1) the ability to work with data, 2) the ability to collaborate with others, and 3) the ability to work with cyberinfrastructure. The escience professional needs to have domain knowledge, a collaborative nature, and know how to work with computers. “The escience professional needs to have a range of capabilities and play a bridging role between scientists and information professionals.”
After Kim’s presentation there was a discussion surrounding the role of the librarian in data curation. While I do not feel very much came out of the discussion, I was impressed with one person’s comment. “If a university’s research data were closely tied to the institution’s teaching efforts, then much of the angst surrounding data curation would suddenly go away, and a strategic path would become clear.” I thought that comment, especially coming from a United States Government librarian, was quite insightful.
The day’s events were (more or less) summarized by Clifford Lynch (Coalition for Networked Information) with some of the following quotes. “The NSF mandate is the elephant in the room… The NSF plans are not using the language of longevity… The whole thing may be a ‘wonderful experiment’… It might be a good idea for someone to create a list of the existing data plans and their characteristics in order to see which ones play out… Citizen science is not only about analysis but also about data collection.”
Day #2
The second day’s presentation were more practical in nature and seemingly geared for librarians and archivists.
In my opinion, “Managing research data at MIT: Growing the curation community one institution at a time” by MacKenzie Smith (Massachusetts Institute of Technology Libraries) was the best presentation of the conference. In it she described data curation as a “meta-discipline” as defined in Media Ecology by Marshall McLuhan, and where information can be described in terms of format, magnitude, velocity, direction, and access. She articulated how data is tricky once a person travels beyond one’s own silo, and she described curation as being about reproducing data, aggregating data, and re-using data. Specific examples include: finding data, publishing data, preserving data, referencing data, making sense of data, and working with data. Like many of the presenters, she thought data curation was not the purview of any one institution or group, but rather a combination. She compared them to layers of storage, management, linking, discovery, delivery, management, and society. All of these things are done by different groups: researchers, subject disciplines, data centers, libraries & archives, businesses, colleges & universities, and funders. She then presented an interesting set of two case studies comparing & contrasting data curation activities at the University of Chicago and MIT. Finally she described a library’s role as one of providing services and collaboration. In the jargon of Media Ecology, “Libraries are a ‘keystone’ species.”
The Best Paper Award was given to Laura Wynholds (University of California, Los Angeles) for “Linking to scientific data: Identity problems of unruly and poorly bounded digital objects”. In it she pointed out how one particular data set was referenced, accessible, and formatted from three different publications in three different ways. She went on to outline the challenges of identifying which data set to curate and how.
In “Making digital curation a systematic institutional function” Christopher Prom (University of Illinois at Urbana-Champaign) answered the question, “How can we be more systematic about bringing materials into the archives?” Using time granted via a leave of absence, Prom wrote Practical E-Records which “aims to evaluate software and conceptual models that archivists and records manager might use to identify preserve, and provide access to electronic records.” He defined trust as an essencial component of records management, and outlined the following process that needs to be done in order to build it: assess resources, wrote program statement, engage records producers, implement policies, implement repository, develop action plans, tailor workflows, and provide access.
James A. J. Wilson (University of Oxford) shared some of his experiences with data curation in “An institutional approach to developing research data management infrastructure”. According to Wilson, the Computing Services center is taking the coordinating role at Oxford when it comes to data curation, but he, like everybody else, emphasized the process is not about a single department or entity. He outlined a number of processes: planning, creation, local storage, documentation, institutional storage, discovery, retrieval, and training. He divided these processes between researchers, computing centers, and libraries. I thought one of the more interesting ideas Wilson described was DaaS (database as a service) where databases are created on demand for researchers to use.
Patricia Hswe (Penn State University) described how she and a team of other people at the University are have broken down information silos to create a data repository. Her presentation, “Responding to the call to curate: Digital curation in practice at Penn State University” outlined the use of microservices in their implementation, and she explained the successes of CurateCamps. She emphasized how the organizational context of the implementation is probably the most difficult part of the work.
Huda Kan (Cornell University) described an application to create, reuse, stage, and share research data in a presentation called “DataStaR: Using the Semantic Web approach for data curation”. The use of RDF was core to the system’s underlying data structure.
Since this was the last session in a particular concurrent track, a discussion followed Kan’s presentation. It revolved around the errors in metadata, and the discussed solutions seemed to fall into three categories: 1) write better documentation and/or descriptions of data, 2) write computer programs to statistically identify errors and then fix them, or 3) have humans do the work. In the end, the solution is probably a combination of all three.
Sometime during the conference I got the idea of creating a word cloud made up of Twitter “tweets” with the conference’s hash tag — idcc10. In a fit of creativity, I wrote the hack upon my return home, and the following illustration is the result:
Wordcloud illustrating the tweets tagged with idcc10
Summary
The Conference was attended by approximately 250 people, apparently a record. The attendees were mostly from the United States (obviously), but there it was not uncommon to see people from abroad. The Conference was truly international in scope. I was surprised at the number of people I knew but had not seen for a while because I have not been recently participating in Digital Library Federation-like circles. It was nice to rekindle old acquaintances and make some new ones.
At to be expected, the presentations outlined apparent successes based on experience. From my perspective, Notre Dame’s experience is just beginning. We ought to learn from this experience, and some of my take-aways include:
- data curation is not the job of any one university department; there are many stakeholders
- data curation is a process involving computer technology, significant resources of all types, and policy; all three are needed to make the process functional
- data curation is a lot like open access publishing but without a lot of the moral high ground