Archive for the ‘Uncategorized’ Category

Data management & curation groups

Posted on March 18, 2011 in Uncategorized

This is a short and incomplete list of universities with data management & curation groups. Each item includes the name of the local group, a link to the group’s home page, a blurb describing the focus of the group, and a sublist of group membership.

Research Data Management Service Group (Cornell)

“The Research Data Management Service Group (RDMSG) aims to: present a coherent set of services to researchers; develop a unified web presence providing general information on data management planning, services available on campus, and standard language that may be used in data management plans in grant proposals; provide a single point of contact that puts researchers in touch with specialized assistance as the need arises. The RDMSG is jointly sponsored by the Senior Vice Provost for Research and the University Librarian, and also has a faculty advisory board.” — http://data.research.cornell.edu/

  • Bill Block – Cornell Institute Social & Economics Research
  • Dave Lifka – Center for Advanced Computing
  • Dean Krafft – Library
  • Dianne Dietrich – Library
  • Eric Chen – Center for Advanced Computing
  • Gail Steinhart – Library
  • Janet McCue – Library
  • Jim Cordes – Astronomy
  • Stefan Kramer – Cornell Institute for Social & Economic Research

Research Data Management (Oxford)

“The University of Oxford is committed to supporting researchers in appropriate curation and preservation of their research data, and where applicable in accordance with the research funders’ requirements.” — http://www.admin.ox.ac.uk/rdm/

  • Bodleian Library (organization and documentation)
  • Central University Research Ethics Committee (ethical issues)
  • Departmental IT Support (backup and security)
  • Intellectual Property Advisory Group (lab notebooks)
  • Isis Innovation (commerical issues)
  • Legal Services (legal issues)
  • Oxford Digital Library (publication and preservation)
  • Oxford Research Archive (publication and preservation)
  • Research Services (funder policies)
  • Research Technology Services (technical aspects of data management)
  • The Data Library (access and discovery issues)

Research Data Services (University of Wisconsin-Madison)

“Digital curation covers cradle-to-grave data management, including storage, preservation, selection, transfer, description, sharing, access, reuse, and transformations. With the current focus on data sharing and preservation on the part of funding agencies, publishers, and research disciplines, having data management practices in place is more relevant than ever.” — http://researchdata.wisc.edu/

  • Alan Wolf – Madison Digital Media Center
  • Allan Barclay – Health Sciences Library
  • Amanda Werhane – Agriculture & Life Science Library
  • Brad Leege – DoIT Academic Technology
  • Bruce Barton – DoIT Academic Technology
  • Caroline Meikle – Soil Science Department/UW-Extension
  • Cindy Severt – Data & Information Services Center
  • Dorothea Salo – Memorial Library
  • Jan Cheetham – DoIT Academic Technology
  • Keely Merchant – Space Science Library
  • Nancy Wiegand – Geospatial Sciences
  • Rebecca Holz – Health Sciences Library
  • Ryan Schryver – Engineering

Data Management and Publishing (MIT)

“What should be included in a data management plan? Funding agencies, e.g., the National Science Foundation (NSF), may have specific requirements for plan content. Otherwise, there are fundamental data management issues that apply to most disciplines, formats, and projects. And keep in mind that a data management plan will help you to properly manage your data for own use, not only to meet a funder requirement or enable data sharing in the future.” — http://libraries.mit.edu/guides/subjects/data-management/

  • Amy Stout – Library – Library
  • Anne Graham – Library
  • Courtney Crummett – Library
  • Katherine McNeill – Library
  • Lisa Sweeney – Library

Scientific Data Consulting (University of Virginia)

“The SciDaC Group is ready to consult with you on your entire data life cycle, helping you to make the right decisions, so that your scientific research data will continue to be available when you and others need it in the future.” — http://www2.lib.virginia.edu/brown/data/

  • Andrew Sallans – Library
  • Sherry Lake – Library

Managing Your Data (University of Minnesota)

“The University Libraries are here to assist you with research data management issues through best practices, training, and awareness of data preservation issues. This site examines the research data life-cycle and offers tools and solutions for creation, storage, analysis, dissemination, and preservation of your data.” — http://www.lib.umn.edu/datamanagement

  • Amy West – Library
  • Lisa Johnston – Library
  • Meghan Lafferty – Library

Data Management Planning (Penn State)

“Good data management starts with comprehensive and consistent data documentation and should be maintained through the life cycle of the data.” — http://www.libraries.psu.edu/psul/scholar/datamanagement.html

  • Ann Holt – Library
  • Daniel C. Mack – Library
  • Kevin Clair – Library
  • Marcy Bidney – Library
  • Mike Giarlo – Library
  • Nancy Henry – Library
  • Nic Cecchino – Library
  • Patricia Hswe – Library
  • Stephen Woods – Library

Research Cyberinfrastructure (University of California-San Diego)

“Research Cyberinfrastructure offers UC San Diego researchers the computing, network, and human infrastructure needed to create, manage, and share data, and SDSC’s favorable pricing can help researchers meet the new federal requirements for budget proposals.” — http://rci.ucsd.edu/

  • Ardys Kozbial – Data Curation Working Group (Libraries)
  • Dallas Thornton – Cyberinfrastructure Services
  • Ron Joyce – Associate Director, IT Infrastructure
  • Sharon Franks – Office of Research Affairs

Distributed Data Curation Center (Purdue University)

“We investigate and pursue innovative solutions for curation issues of organizing, facilitating access to, archiving for and preserving research data and data sets in complex environments.” — http://d2c2.lib.purdue.edu/

  • Elisa Bertino – Director of the Cyber Center at Discovery Park
  • Gerry McCartney – Information Technology
  • James L. Mullins – Dean of Libraries
  • Jay Akridge – Dean of Agriculture
  • Jeffrey Roberts – Dean of Science
  • Leah Jamieson – Dean of Engineering

6th International Data Curation Conference

Posted on December 14, 2010 in Uncategorized

This posting documents my experiences at the 6th International Data Curation Conference, December 6-8, 2010 in Chicago (Illinois). In a sentence, my understanding of the breath and depth of data curation was re-enforced, and the issues of data curation seem to be very similar to the issues surrounding open access publishing.

Day #1

After a few pre-conference workshops which seemed to be popular, and after the reception the night before, the Conference began in earnest on Tuesday, December 7. The presentations of the day were akin to overviews of data curation, mostly from the people who were data creators.

One of the keynote addresses was entitled “Working the crowd: Lessons from Galaxy Zoo” by Chris Lintott (University of Oxford & Alder Planetarium). In it he described how images of galaxies taken as a part of the Sloan Digital Sky Survey where classified through crowd sourcing techniques — the Galaxy Zoo. Wildly popular for a limited period of time, its success was attributed to convincing people their task was useful, they were treated as collaborators (not subjects), and it was not considered a waste of time. He called the whole process “citizen science”, and he has recently launched Zooniverse in the same vein.

“Curation centres, curation services: How many are enough?” by Kevin Asjhley (Digital Curation Centre) was the second talk, and in a tongue-in-cheek way, he said the answer was three. He went on to outline they whys and wherefores of curation centers. Different players: publishers, governments, and subject centers. Different motivations: institutional value, reuse, presentation of the data behind the graph, obligation, aggregation, and education. Different debates on who should do the work: libraries, archives, computer centers, institutions, disciplines, nations, localities. He summarized by noting how data is “living”, we have a duty to promote it, it is about more than scholarly research, and finally, three centers are not really enough.

Like Lintott, Antony Williams (Royal Society of Chemistry) described a crowd sourcing project in “ChemSpider as a platform for crowd participation”. He began by demonstrating the myriad of ways Viagra has been chemically described on the the ‘Net. “Chemical information on the Internet is a mess.” ChemSpider brings together many links from chemistry-related sites and provides a means for editing them in an online environment.

Barend Mons outlined one of the common challenges of metadata. Namely, the computer’s need for structured information and most individuals’ lack of desire to create it. In “The curation challenge for the next decade: Digital overlap strategy or collective brains?” Mons advocated the creation of “nano publications” in the form of RDF statements — assertions — as a possible solution. “We need computers to create ‘reasonable’ formats.”

“Idiosyncrasy at scale: Data curation in the humanities” by John Unsworth (University of Illinois at Urbana-Champaign) was the fourth presentation of the day. Unsworth began with an interesting set of statements. “Retrieval is a precondition for use, and normalization is a precondition for retrieval, but humanities’ texts are messy and difficult to normalize.” He went on to enumerate types of textual normalization: spelling, vocabulary, punctuation, “chunking”, mark-up, and metadata. He described MONK as a normalization project. He also mentioned a recent discussion on the Alliance of Digital Humanities Organizations site where humanists debated whether or not texts ought be marked-up prior to analysis. In short, idiosyncracies abound.

The Best Student Paper Award was won by Youngseek Kim (Syracuse University) for “Education for eScience professionals: Integrating data curation and cyberinfrastructure”. In it he described the use of focus group interviews and an analysis of job postings to articulate the most common skills a person needs to be a “escience professional”. In the end he outlined three sets of skills: 1) the ability to work with data, 2) the ability to collaborate with others, and 3) the ability to work with cyberinfrastructure. The escience professional needs to have domain knowledge, a collaborative nature, and know how to work with computers. “The escience professional needs to have a range of capabilities and play a bridging role between scientists and information professionals.”

After Kim’s presentation there was a discussion surrounding the role of the librarian in data curation. While I do not feel very much came out of the discussion, I was impressed with one person’s comment. “If a university’s research data were closely tied to the institution’s teaching efforts, then much of the angst surrounding data curation would suddenly go away, and a strategic path would become clear.” I thought that comment, especially coming from a United States Government librarian, was quite insightful.

The day’s events were (more or less) summarized by Clifford Lynch (Coalition for Networked Information) with some of the following quotes. “The NSF mandate is the elephant in the room… The NSF plans are not using the language of longevity… The whole thing may be a ‘wonderful experiment’… It might be a good idea for someone to create a list of the existing data plans and their characteristics in order to see which ones play out… Citizen science is not only about analysis but also about data collection.”

Day #2

The second day’s presentation were more practical in nature and seemingly geared for librarians and archivists.

In my opinion, “Managing research data at MIT: Growing the curation community one institution at a time” by MacKenzie Smith (Massachusetts Institute of Technology Libraries) was the best presentation of the conference. In it she described data curation as a “meta-discipline” as defined in Media Ecology by Marshall McLuhan, and where information can be described in terms of format, magnitude, velocity, direction, and access. She articulated how data is tricky once a person travels beyond one’s own silo, and she described curation as being about reproducing data, aggregating data, and re-using data. Specific examples include: finding data, publishing data, preserving data, referencing data, making sense of data, and working with data. Like many of the presenters, she thought data curation was not the purview of any one institution or group, but rather a combination. She compared them to layers of storage, management, linking, discovery, delivery, management, and society. All of these things are done by different groups: researchers, subject disciplines, data centers, libraries & archives, businesses, colleges & universities, and funders. She then presented an interesting set of two case studies comparing & contrasting data curation activities at the University of Chicago and MIT. Finally she described a library’s role as one of providing services and collaboration. In the jargon of Media Ecology, “Libraries are a ‘keystone’ species.”

The Best Paper Award was given to Laura Wynholds (University of California, Los Angeles) for “Linking to scientific data: Identity problems of unruly and poorly bounded digital objects”. In it she pointed out how one particular data set was referenced, accessible, and formatted from three different publications in three different ways. She went on to outline the challenges of identifying which data set to curate and how.

In “Making digital curation a systematic institutional function” Christopher Prom (University of Illinois at Urbana-Champaign) answered the question, “How can we be more systematic about bringing materials into the archives?” Using time granted via a leave of absence, Prom wrote Practical E-Records which “aims to evaluate software and conceptual models that archivists and records manager might use to identify preserve, and provide access to electronic records.” He defined trust as an essencial component of records management, and outlined the following process that needs to be done in order to build it: assess resources, wrote program statement, engage records producers, implement policies, implement repository, develop action plans, tailor workflows, and provide access.

James A. J. Wilson (University of Oxford) shared some of his experiences with data curation in “An institutional approach to developing research data management infrastructure”. According to Wilson, the Computing Services center is taking the coordinating role at Oxford when it comes to data curation, but he, like everybody else, emphasized the process is not about a single department or entity. He outlined a number of processes: planning, creation, local storage, documentation, institutional storage, discovery, retrieval, and training. He divided these processes between researchers, computing centers, and libraries. I thought one of the more interesting ideas Wilson described was DaaS (database as a service) where databases are created on demand for researchers to use.

Patricia Hswe (Penn State University) described how she and a team of other people at the University are have broken down information silos to create a data repository. Her presentation, “Responding to the call to curate: Digital curation in practice at Penn State University” outlined the use of microservices in their implementation, and she explained the successes of CurateCamps. She emphasized how the organizational context of the implementation is probably the most difficult part of the work.

Huda Kan (Cornell University) described an application to create, reuse, stage, and share research data in a presentation called “DataStaR: Using the Semantic Web approach for data curation”. The use of RDF was core to the system’s underlying data structure.

Since this was the last session in a particular concurrent track, a discussion followed Kan’s presentation. It revolved around the errors in metadata, and the discussed solutions seemed to fall into three categories: 1) write better documentation and/or descriptions of data, 2) write computer programs to statistically identify errors and then fix them, or 3) have humans do the work. In the end, the solution is probably a combination of all three.

Sometime during the conference I got the idea of creating a word cloud made up of Twitter “tweets” with the conference’s hash tag — idcc10. In a fit of creativity, I wrote the hack upon my return home, and the following illustration is the result:

word cloud
Wordcloud illustrating the tweets tagged with idcc10

Summary

The Conference was attended by approximately 250 people, apparently a record. The attendees were mostly from the United States (obviously), but there it was not uncommon to see people from abroad. The Conference was truly international in scope. I was surprised at the number of people I knew but had not seen for a while because I have not been recently participating in Digital Library Federation-like circles. It was nice to rekindle old acquaintances and make some new ones.

At to be expected, the presentations outlined apparent successes based on experience. From my perspective, Notre Dame’s experience is just beginning. We ought to learn from this experience, and some of my take-aways include:

  • data curation is not the job of any one university department; there are many stakeholders
  • data curation is a process involving computer technology, significant resources of all types, and policy; all three are needed to make the process functional
  • data curation is a lot like open access publishing but without a lot of the moral high ground

Two more data creator interviews

Posted on December 11, 2010 in Uncategorized

Michelle Hudson and I have had a couple more data creator interviews, and here is a list of themes from them:

  • data types – data types include various delimited text files, narrative texts, geographic information system (GIS) files, images, and videos; the size of the data sets is about 20 GB
  • subject content – the subject represented by the content include observations of primates, longitudinal studies of families
  • output – the output resulting from these various data sets include scholarly articles and simulations
  • data management – the information is saved on any number of servers located in the Center for Research Computing, under one’s desk, or in a departmental space; some back up and curation happens, but not a lot; there is little if any metadata assigned to the data; migrating data from older versions of software to new versions is sometimes problematic
  • ongoing dissemination – everybody interviewed believe there needs to be a more formalized method for the ongoing management and dissemination of locally created data; some thought the Libraries ought to play a leadership role; others have considered offering the service to the campus community for a fee

Three data webinars

Posted on December 10, 2010 in Uncategorized

Between Monday, November 8 and Thursday, November 11 I participated in three data webinars — a subset of a larger number of webinars facilitated by the ICPSR, and this posting outlines what I learned from them.

Data Management Plans

The first was called “Data Management Plans” and presented by Katherine McNeill (MIT). She gave the briefest of histories of data sharing and noted the ICPSR has been doing this since 1962. With the advent of the recent National Science Foundation announcement requiring data curation plans the interest in curation has become keen, especially in the sciences. The National Institute of Health has had similar mandate for grants over $250,000. Many of these mandates only specify the need for a “what” when it comes to plan, and not necessarily the “how”. This is slightly different from the United Kingdom’s way of doing things.

After evaluating a number of plans from a number of places, McNeill identified a set of core issues common to many of them:

  • a description of the project and data
  • standards to be applied
  • short-term storage specifications
  • legal and ethical issues
  • access policies and provisions
  • long-term archiving stipulations
  • funder-specific requirements

How do and/or will libraries support data curation? She answered this question by listing a number of possibilities:

  • instituting an interdisciplinary librarian models
  • creating a dedicated data center
  • getting any (all) librarians up to speed
  • having the scholarly communications librarian lead the efforts
  • creating partnerships with other campus departments
  • participating in a national data service
  • getting funder support
  • activities through the local office of research
  • doing more inter-university collaborations
  • providing services through professional societies

Somewhere along the line McNeill advocated reading ICPSR’s “Guidelines for Effective Data Management Plans” which outlines elements of data plans as well as a number of examples

America’s Most Wanted

The second webinar was “America’s Most Wanted: Top US Government Data Resources” presented by Lynda Kellam (The University of North Carolina at Greensboro). Kellam is a data librarian, and this session was akin to a bibliographic instruction session where a number of government data sources were described:

  • Data.gov – has a lot of data from the Environmental Protection Agency; works a lot like ICPSR; includes “chatter” around data; includes “cool” preview function
  • Geospatial One Stop – a geographic information system portal with a lot of metadata; good for tracking down sources with a geographic interface
  • FactFinder – a demographic portal for commerce and census data; will include in the near future a more interactive interface
  • United States Bureau of Labor Statistics – lot o’ labor statistics
  • National Center for Education Statistics – includes demographics for school statistics and provides analysis online
  • DataFerrett – provides you with an applet to download, run, and use to analyze data

Students Analyzing Data

The final webinar I listened to was “Students Analyzing Data in the Large Lecture Class: Active Learning with SDA Online Analysis” by Jim Oberly (University of Wisconsin-Eau Claire). [5] As a historian, Oberly is interested in making history come alive for his students. To do this, he use ICPSR’s Analyze Data Online service, and this webinar demonstrated how. He began by asking questions about the Civil War such as “For economic reasons, would the institution of slavery have died out naturally, and therefore the Civil War would have been unnecessary?” Second, he identifying a data set (New Orleans Slave Sale Sample, 1804-1862) from the ICPSR containing information on the sale of slaves. Finally, he used ICPSR’s online interface to query the data looking for trends in prices. In the end, I believe he was not so sure the War could have been avoided because the prices of slaves seemed unaffected by the political environment. The demonstration was fascinating, and interface seemingly easy to use.

Summary

Based on these webinars it is an understatement to say the area of data is wide, varied, broad, and deep. Much of Library Land is steeped in books, but in the current environment books are only one of many manifestations of data, information, and knowledge. The profession is still grappling with every aspect of raw data. From its definition to its curation. From its organization to it use. From its politics to its economics.

I especially enjoyed seeing how data is being used online. Such is a growing trend, I believe, and represents a opportunity for the profession. The finding and acquisition of data sets is somewhat of a problem now, but such a thing will become less of a problem later. The bigger problem is learning how to use and understand the data. If the profession were to integrate functions for data’s use and understanding into its systems, then libraries have a growing responsibility. If the profession only seeks to enable find and access, then the opportunities are limited and short-lived. Find and access are things we know how to do. Use and understanding requires an adjustment of our skills, resources, and expertise. Are we up to the challenge?

Data tsunamis and explosions

Posted on October 29, 2010 in Uncategorized

Michelle Hudson and I have visited more teaching & research faculty across campus learning about their uses, needs, and wants when it comes to data. As one person put it, we are preparing for the “Data Tsunami”, or as another person put it — the “Data Explosion”. We have learned a few more things:

  • Brokering – At least a couple of the people we visited thought libraries ought to play a central role in the brokering of data sets. In their view, libraries would be repositories of data as well as manage the licensing of the data both inside and outside the University community. “Libraries can make it easy for us to find data sets.” The Institute for Quantitative Social Science at Harvard University may be a good model. This clearing house function needs to include services educating people on how to use the data, “cool” interfaces for finding and using the data, and links to bibliographic materials like books and journal articles. “We would also like somebody to help us keep track who uses our data and where it is cited in the literature.”
  • Curation – Some people have “archived” original data sets in the form of paper-based surveys. These things are stored in file cabinets of basements. Others have elaborate computer systems complete with redundant backups, rsync functionality, and data refreshment protocols. One person alluded to HubZero as a possible tool for these tasks.
  • Data origination – Most of the people we have talked to generate their own data either through surveys or scientific equipment. Fewer people, so far, have gotten their data from other people or companies. When it has come from companies, the data has been encrypted before it gets here as well as anonymized.
  • Data types – The formats of the data is falling into a couple of categories: 1) binary data such as images, video, & simulation output, and 2) plain text data in the form of spreadsheets (mostly) or sometimes relational databases. “We know that the use of relational databases is the ‘best’ way to organize this information, but none of us want to take the time to learn SQL.”
  • Licensing – At least a couple of the people we visited license their data to others. After working with General Counsel, contracts between parties are signed and the data is exchanged. We have yet to see any money changing hands. The licenses are used to protect the University from liability when the data gets used in ways not allowed by the license. A couple of people would like to the University (or specifically the library) to handle this sort of paperwork.
  • Metadata – There is a wide spectrum of metadata application against the data sets. Some people have no metadata at all. Others maintain multi-volume books filled with “protocols” describing their data and how it is to be collected. One person said, “We spend a lot of our time correcting metadata tabulating what camera was used, when, and by whom… Our reputation rests on the quality of our data (and metadata). We’ve formatted our metadata as CSV files as well as XML files. In order for our data to be characterized as ‘good’ we need an error rate of 1000/1”.
  • Sharing – We are learning that the sharing of data is a complicated decision-making process. Many things come into play including but not necessarily limited to: the culture of the subject discipline, patents, the competitive nature of the researcher, intellectual property rights, funding agency requirements, embargoes, and the inclusion of human subjects. Some people are more than willing to share than others. So far, no one will share their until the first paper has been written. The want (need) “publication rights”.
  • Size – Everybody believes they have “large” data sets, but the definition of large needs to be qualified. On one hand large may be equated with sizable files. Videos are a good example. On the other hand large may mean many records. Big longitudinal studies complete with many fields per subject are a good example.

We are also learning that no one person or group seems to have a complete handle on the issues surrounding data. Michelle and I certainly don’t. Everybody knows a lot but not everything. Consquently, we are thinking of hosting “Data Day” — a time and place when many of the people who deal with data for teaching and research get together, share war stories, and learn from each others’ experience. In the end we may understand how to be more efficient and prepared with the “tsumami” is actually upon us.

Off to interview more people… ‘More later.

David Dickinson and New Testament manuscripts

Posted on October 20, 2010 in Uncategorized

Yesterday David Dickinson came to visit the libraries to share and discuss some of his work regarding optical character recognition of New Testament manuscripts.

David Dickinson is a South Bend resident and Renaissance Man with a multifaceted educational background and vocational history. Along the way he became keenly interested in religion as well as computer programming. On and off for the past five years or so, and working in conjunction with the Center for the Study of New Testament Manuscripts, he has been exploring the possibilities of optical character recognition against New Testament manuscripts. Input very large digitized images of a really, really old original New Testament manuscripts. Programmatically examine each man-made mark in the image. Use artificial intelligence computing techniques to determine (or guess) which “letter” the mark represents. Save the resulting transcription to a file. And finally, provide a means for the Biblical scholar to simultaneously compare the image with the resulting transcription and a “canonical” version of a displayed chapter/verse.

David’s goal is not so much to replace the work being done by scholars but rather to save their time. Using statistical techniques, he knows computer programs can work tirelessly to transcribe texts. These transcriptions are then expected to be reviewed by people. The results are then expected to be shared widely thus enabling other scholars to benefit.

David’s presentation was attended by approximately twenty people representing the Libraries, the Center for Social Research, and the Center for Research Computing. After the formal presentation a number of us discussed how David’s technology may or may not be applicable to the learning, teaching, and scholarship being done here at the University. For example, there are a number of Biblical scholars on campus, but many of them seem to focus on the Old Testament as opposed to the New Testament. The technology was deemed interesting but some people thought it could not replace man-made transcriptions. Others wondered about the degree the technology could be applied against manuscripts other the New Testament. In the end there were more questions than answers.

Next steps? Most of us thought David’s ideas were not dead-ends. Consequently, it was agreed that next steps will include presenting the technology to local scholars in an effort to learn whether or not it is applicable to their needs and the University’s.

Data curation at ECDL 2010

Posted on October 11, 2010 in Uncategorized

University of GlasgowAt the most recent ECDL conference in Glasgow (Scotland) there was a panel discussion on data curation called Developing services to support research data management and sharing. Below are some of the things I learned:

  • My take-away from Sara Jone‘s (DDC) remarks was, “There are no incentives for sharing research data”, and when given the opportunity for sharing data owners react by saying things like, “I’m giving my baby away… I don’t know the best practices… What are my roles and responsibilities?”
  • Veerle Van den Eynden (United Kingdom Data Archive) outlined how she puts together infrastructure, policy, and support (such as workshops) to create successful data archives. “infrastructure + support + policy = data sharing” She enumerated time, attitudes and privacy/confidentiality as the bigger challenges.
  • Robin Rice (EDINA) outlined services similar to Van den Eynden’s but was particularly interested in social science data and its re-use. There is a much longer tradition of sharing social science data and it is definitely not intended to be a dark archive. He enumerated a similar but different set of barriers to sharing: ownership, freedom of errors, fear of scooping, poor documentation, and lack of rewards.
  • Rob Grim (Tilburg University) was the final panelist. He said, “We want to link publications with data sets as in Economists Online, and we want to provide a number of additional services against the data.” He described data sharing incentive, “I will only give you my data if you provide me with sets of services against it such as who is using it as well as where it is being cited.” Grim described the social issues surrounding data sharing as the most important. He compared & contrasted sharing with preservation, and re-use with archiving. “Not only is it important to have the data but it is also important to have the tools that created the data.”

Diddling with data

Posted on September 30, 2010 in Uncategorized

Michele Hudson and I have begun visiting faculty across campus in an effort to learn about data needs and desires. We’re diddling with data. To date, we have only visited two (and a half) people, but I have learned a few things:

  • Darwin Core – Apparently there is metadata schema used for describing biological content — Darin Core. I wonder where they got that name?
  • ease-of-use – At least one faculty member said a University-wide effort to collect and curate scientific data would be good thing, as long as the overhead of doing so was minimal.
  • evaluation – Some scientists generate data and then work with computer scientists to do the analysis and look for patterns. This seems akin to the relationship between social scientists and statisticians.
  • Google Groups – When it comes to collaboration, Google Groups & Friends was used at a tool. Consequently much of their content is save in Google’s “cloud”.
  • MBLWHOI Library – A faculty member suggested we get in touch with the librarian at the MBLWHOI Library because they are doing similar work, and “That person changed forever my perception of librarians.”
  • notebooks – Many scientists record their actions in notebooks, a la the way science was first conducted. Sometimes these notebooks are physical items, and sometimes they are digital manifestations. The physical items may benefit from digitization. The “born digital” notebooks may benefit from preservation. Everything goes into them. Data. Observations. Cogitations. In this vein, a number of software packages were brought to our attention including: YoGo and CambridgeSoft.
  • scooped – One faculty member thought the idea of sharing data enables “scooping”, the process of stealing another’s ideas. While they advocated sharing, they thought it would be better if it were brought up through the ranks and manifested in undergraduate research. This is what has essentially happened with the local implementation of the Excellent Undergraduate Research.

That is what I’ve learned so far. ‘More later.

Data curation in Purdue

Posted on September 23, 2010 in Uncategorized

On Wednesday, September 22, 2010 a number of us from Notre Dame (Eric Morgan, Julie Arnott, Michelle Hudson, and Rick Johnson) went on a road trip to visit a number of people from Purdue (Christopher Miller, Dean Lingley, Jacob Carlson, Mark Newton, Matthew Riehle, Michael Witt, and Megan Sapp Nelson). Our joint goal was to share with each other our experiences regarding data curation.

After introductions, quite a number of talking points were enumerated (Purdue’s IMLS project, their E-Data Taskforce and data repository prototype, data literacy, data citation, and electronic theses & dissertation data curation). But we spent the majority of our formal meeting talking about their Data Curation Profile toolkit. Consisting of a number of questions, the toolkit is intended to provide the framework for discussions with data creators (researchers, scholars, graduate students, faculty, etc.). “It is a way to engage the faculty… It is generic (modular) and intended to form a baseline for understanding… It is used as a guide to learn about the data and the researcher’s needs.” From the toolkit’s introduction:

A completed Data Curation Profile will contain two types of information about a data set. First, the Profile will contain information about the data set itself, including its current lifecycle, purpose, forms, and perceived value. Second, a Data Curation Profile will contain information regarding a researcher’s needs for the data including how and when the data should be made accessible to others, what documentation and description for the data are needed, and details regarding the need for the preservation of the data.

The Purdue folks are tentatively scheduled to give workshops around the country on the use of the toolkit, and I believe one of those workshops will be at the upcoming International Digital Curation Conference taking place in Chicago (December 6-8).

We also talked about infrastructure — both technical and human. We agreed that the technical infrastructure, while not necessarily trivial, could be created and maintained. On the other hand the human infrastructure may be more difficult to establish. There are hosts of issues to address listed here in no priority order: copyright & other legal issues, privacy & anonymity, business models & finances, workflows & true integration of services, and the articulation of roles played by librarians, curators, faculty, etc. “There is a need to build lots of infrastructure, both technical as well as human. The human infrastructure does not scale as well as the technical infrastructure.” Two examples were outlined. One required a couple of years of relationship building, and the other required cultural differences to be bridged.

We then retired to lunch and shared more of our experiences in a less formal atmosphere. We discussed “micro curation services”, data repository bibliographies, and the need to do more collaboration since our two universities have more in common than differences. Ironically, one of the projects being worked on at Purdue involves Notre Dame faculty, but alas, none of us from Notre Dame knew of the project’s specifics.

Yes, the drive was long and the meeting relatively short, but everybody went away feeling like their time was well-spent. “Thank you for hosting us!”

Hello world!

Posted on August 19, 2010 in Uncategorized

Welcome to Notre Dame Blogs. This is your first post. Edit or delete it, then start blogging!