Trip to the Internet Archive, Fort Wayne (Indiana)

Posted on July 18, 2011 in Uncategorized by Eric Lease Morgan

This is the tiniest of travelogues describing a field trip to a branch of the Internet Archive in Fort Wayne (Indiana), July 11, 2001.

Here at the Hesburgh Libraries we are in the midst of a digitization effort affectionately called the Catholic Pamphlets Project. We have been scanning away, but at the rate we are going we will not get done until approximately 2100. Consequently we are considering the outsourcing of the scanning work, and the Internet Archive came to mind. In an effort to learn more about their operation, a number of us went to visit a branch of the Internet Archive, located at the Allen County Public Library in Fort Wayne (Indiana), on Monday, July 11, 2011.

When we arrived we were pleasantly surprised at the appearance of the newly renovated public library. Open. Clean. Lots of people. Apparently the renovation caused quite a stir. “Why spend money on a public library?” The facilities of the Archive, on the other hand, were modest. It is located in the lower level of the building with no windows, cinderblock walls, and just a tiny bit cramped.

Internet Archive
Internet Archive, the movie!
Johnny Appleseed
Pilgimage to Johnny Appleseed’s grave

We were then given a tour of the facility and learned about the workflow. Books arrive in boxes. Each book is associated with bibliographic metadata usually found in library catalogs. Each is assigned a unique identifier. The book is then scanned in a “Scribe”, a contraption cradling the book in a V-shape while photographing each page. After the books are digitized they are put through a bit of a quality control process making sure there are no missing pages, blurry images, or pictures of thumbs. Once that is complete the book’s images and metadata are electronically sent to the Internet Archive’s “home planet” in San Francisco for post-processing. This is where the various derivatives are made. Finally, the result is indexed and posted to the ‘Net. People are then free to download the materials and do with them just about anything they desire. We have sent a number of our pamphlets to the Fort Wayne facility and you can see the result of the digitization process.

From my point of view, working with the Internet Archive sounds like a good idea, especially if one or more of their Scribes comes to live here in the Libraries. All things considered, their services are inexpensive. They have experience. They sincerely seem to have the public good at heart. They sure are more transparent than the Google Books project. Digitization by the Internet Archive may be a bit challenging when it comes to items not in the public domain, but compared to the other issues I do not think this is a very big issue.

To cap off our visit to Fort Wayne we made a pilgrimage to Johnny Appleseed’s (John Chapman’s) grave. A good time was had by all.

The Catholic Pamphlets Project at the University of Notre Dame

Posted on June 21, 2011 in Uncategorized by Eric Lease Morgan

This posting outlines an initiative colloquially called the Catholic Pamphlets Project, and it outlines the current state of the Project.

Earlier this year the Hesburgh Libraries was awarded a “mini-grant” from University administration to digitize and make accessible a set of Catholic Americana. From the proposal:

The proposed project will enable the Libraries and Archives to apply these [digital humanities computing] techniques to key Catholic resources held by the University. This effort supports and advances the Catholic mission of the University by providing enhanced access to significant Catholic scholarship and facilitating the discovery of new knowledge. We propose to create an online collection of Catholic Americana resources and to develop and deploy an online discovery environment that allows users to search metadata and full-text documents, and provides them with tools to interact with the documents. These web-based tools will support robust keyword searching within and across texts and the ability to analyze texts and detect patterns across documents using tools such as charts, graphs, timelines, etc.

A part of this Catholic Americana collection is a sub-collection of about 5,000 Catholic pamphlets. With titles like The Catholic Factor In Urban Welfare, About Keeping Your Child Healthy, and The Formation Of Scripture the pamphlets provide a rich fodder for research. We are currently in the process of digitizing these pamphlets, thus the name, the Catholic Pamphlets Project.

While the Libraries has digitized things in the past, previous efforts have not been holistic nor as large in scope. Because of the volume of materials, in both analog and digital forms, the Catholic Pamphlets Projects is one of the larger digitization projects the Libraries has undertaken. Consequently, it involves just about every department: Collection Development, Special Collections, Preservation, Cataloging, Library Systems, and Public Services. To date, as many as twenty different people have been involved, and the number will probably grow.

What are we going to actually do? What objectives are we going to accomplish? The answer to these questions fall into four categories, listed here in no priority order:

  • digitize a set of Catholic Americana – the most obvious objective
  • experiment with digitizing techniques – here we are giving ourselves the opportunity to fail; we’ve never really been here before
  • give interesting opportunities to graduate students – through a stipend, a junior scholar will evaluate the collection, put it into context, and survey the landscape when it comes to the digital humanities
  • facilitate innovative services to readers – this will be the most innovative aspect of the Project because we will be providing a text mining interface to the digitized content

Towards these ends, a number of things have been happening. For example, catalogers have been drawing up new policies. And preservationists have been doing the same. Part-time summer help has been hired. They are located in our Art Slide Library and have digitized just less than two dozen items. As of this afternoon, summer workers in the Engineering Library are lending a scanning hand too. Folks from Collection Development are determining the copyright status of pamphlets. The Libraries is simultaneously building a relationship with the Internet Archive. A number of pamphlets have been sent to them, digitized, and returned. For a day in July a number of us plan on visiting an Internet Archive branch office in Fort Wayne to learn more. Folks from Systems have laid down the infrastructure for the text mining, a couple of text mining orientation sessions have been facilitated, and about two dozen pamphlets are available for exploration.

The Catholic Pamphlets Project is something new for the Hesburgh Libraries, and it is experiencing incremental progress.

Research Data Inventory

Posted on May 19, 2011 in Uncategorized by Eric Lease Morgan

This is the home page for the Research Data Inventory.

If you create or manage research data here at the University, then please complete the 10-question form. It will take you less than two minutes. I promise.

Research data abounds across the University of Notre Dame. It comes in many shapes and sizes, and it comes from many diverse disciplines. In order for the University to support research it behooves us to know what data sets exist and how they are characterized. This — the Research Data Inventory — is one way to accomplish this goal.

The aggregated results of the inventory will help the University understand the breadth & depth of the local research data, set priorities, and allocate resources. The more we know about the data sets on campus, the more likely resources will be allocated to make their management easier.

Complete the inventory, and tell your colleagues about it. Your efforts are sincerely appreciated.

Data Management Day

Posted on April 30, 2011 in Uncategorized by Eric Lease Morgan

This is the home page for the University of Notre Dame’s inaugural Data Management Day (April 25, 2011). Here you will find a thorough description of the event. In a sentence, it was a success.

Data Management Day

Introduction

Co-sponsored by the Center for Research Computing and the Hesburgh Libraries, the purpose of Data Management Day was to raise awareness of all things research data across the Notre Dame community. This half-day event took place on Monday, April 25 from 1–5 o’clock in Room 126 of DeBartolo Hall. It brought together as many people as possible who deal with research data. The issues included but were not limited to:

  • copyrights, trademarks, and patents
  • data management plans
  • data modeling and metadata description
  • financial sustainability
  • high performance computing
  • licensing, distribution, and data sharing
  • preservation and curation
  • personal privacy and human subjects
  • scholarly communication
  • security, access control, and authorization
  • sponsored research and funder requirements
  • storage and backup

Presenters

To help get us going and to stimulate our thinking, a number of speakers shared their experience.

In “Science and Engineering need’s perspective” Edward Bensman (Civil Engineering & Geological Sciences) described how he quantified the need for research data storage & backup. He noted that people’s storage quotas were increasing at a linear rate but the need for storage was increasing at an exponential rate. In short he said, “The CRC is not sized for University demand and we need an enterprise solution.” He went on to recommend a number of things, specifically:

  • review quotas and streamline the process for getting more
  • consider greater amounts of collaboration
  • improve campus-wide support for Mac OSX
  • survey constituents for more specific needs

Charles Vardeman (Center for Research Computing) in “Data Management for Molecular Simulations” outlined the workflow of theoretical chemists and enumerated a number of models for running calculations against them. He emphasized the need to give meaning to the data and thus the employment of a metadata schema called SMILES was used in conjunction with relational database models to describe content. Vardeman concluded with a brief description of a file system-based indexing scheme that might make the storage and retrieval of information easier.

Vardeman’s abstract: Simulation algorithms are enabling scientists to ask interesting questions about molecular systems at an increasingly unmanageable rate from a data perspective. Traditional POSIX directory and file storage models are inadequate to categorize this ever increasing amount of data. Additionally, the tools for managing molecular simulation data must be highly flexible and extensible allowing unforeseen connections in the data to be elucidated. Recently, the Center for Research Computing built a simulation database to categorize data from Gaussian molecular calculations. Our experience of applying traditional database structures to this problem will be discussed highlighting the advantages and disadvantages of using such a strategy to manage molecular data.

Daniel Skendzel (Project Manager for the Digital Asset Strategy Committee) presented an overview of the work of the Digital Asset Management group in “Our Approach to Digital Asset Management”. He began by comparing digital asset management to a storage closet and then showed two different pictures of closets. One messy and another orderly. He described the University’s digital asset management system as “siloed”, and he envisioned bringing these silos together into a more coherent whole complete with suites of tools for using the assets more effectively. Skendzel compared & contrasted our strategy to Duke’s (coordinated), Yale’s (enabling), and the University of Michigan’s (integrated) noting the differences in functionality and maturity across all four. I thought his principles for cultural change — something he mentioned at the end — were most interesting:

  • central advocacy
  • faculty needs driven
  • built on standard architecture
  • flexible applications
  • addresses entire life cycle
  • mindful of the cultural element

Skendzel’s abstract: John Affleck-Graves and the Leadership Committee on Operational Excellence commissioned the Digital Asset Strategy Committee in May 2010 to create and administer a master plan to provide structure for managing digital content in the form of multi-media, images, specific document-types and research data. The plan will address a strategy for how we can best approach the lifecycle needs of capturing, managing, distributing and preserving our institutional digital content. This talk will focus on our progress toward a vision to enhance the value of our digital content by integrating our unique organizational culture with digital technologies.

Darren Davis (Associate Vice President for Research and Professor of Political Science) talked about the importance and role of institutional review boards in “Compliance and research data management”. He began by pointing out the long-standing issues of research and human subject noting a decades-old report outlining the challenges. He stated how the University goes well beyond the Federal guidelines, and he said the respect of the individual person is the thing the University is most interested in when it comes to these guidelines. When human subjects are involved in any study, he said, it is very important for the subjects to understand what information is being gleaned from them, the compensation they will receive from the process, and that their services are being given willingly. When licensing data from human subject research confidentiality is an ever-present challenge, and the data needs to be de-identifiable. Moreover, the licensed data can not be repurposed. Finally, Davis said he and the Office of Research will help faculty create data management plans and they look to expand these service offerings accordingly.

Davis’s abstract: Advances in technology have enabled investigators to explore new avenues of research, enhance productivity, and use data in ways unimagined before. However, the application of new technologies has the potential to create unanticipated compliance problems regarding what constitutes human subject research, confidentiality, and consent.

In “From Design to Archiving: Managing Multi-Site, Longitudinal Data in Psychology” Jennifer Burke (Research Assistant Professor of Psychology & Associate Director of the Center for Children and Families) gave an overview of the process she uses to manage her research data. She strongly advocated planning that includes storage, security, back-up, unit analysis, language, etc. Her data comes in all formats: paper, electronic, audio/video. She designs and builds her data sets sometimes in rows & columns and sometimes as linked relational databases. She is mindful of file naming conventions and the use of labeling conventions (her term for “metadata”). There is lots of data-entry, data clean-up, and sometimes “back filling”. Finally everything is documented in code books complete with a CD. She shares her data, when she can, through archives, online, and even the postal mail. I asked Burke which of the processes was the most difficult or time-consuming, and she said, without a doubt, the data-entry was the most difficult.

Burke’s abstract: This brief talk will summarize the work of the Data Management Center, from consulting on methodological designs to preparing data to be archived. The talk will provide an overview of the types of data that are typical for psychological research and the strategies we have developed to maintain these data safely and efficiently. Processes for data documentation and preparation for long-term archiving will be described.

Up next was Maciej Malawski (Center for Research Computing, University of Notre Dame & AGH University of Science and Technology, Krakow, Poland) and his “Prospects for Executable Papers in Web and Cloud Environments”. Creating research data is one thing, making it available is another. In this presentation Malawski advocated “executable” papers” — applications/services embedded into published articles allowing readers to interact with the underlying data. The idea is not brand new and may have been first articulated as early as 1992 when CD-ROMs became readily available. Malawski gave at least a couple of working examples of the executable papers citing myExperiment and the Grid Space Virtual Laboratory.

Malawski’s abstract: Recent developments in both e-Science and computational technologies such as Web 2.0 and cloud computing call for a novel publishing paradigm. Traditional scientific publications should be supplemented with elements of interactivity, enabling reviewers and readers to reexamine the reported results by executing parts of the software on which such results are based as well as access primary scientific data. We will discuss opportunities brought by recent Web 2.0, Software-as-a-Service, grid and cloud computing developments, and how they can be combined together to make executable papers possible. As example solutions, we will focus on two specific environments: MyExperiment portal for sharing scientific workflows, and GridSpace virtual laboratory which can be used as a prototype executable paper engine.

Patrick Flynn (Professor of Computer Science & Engineering, Concurrent Professor of Electrical Engineering) seemed to have the greatest amount of experience in the group, and he shared it in a presentation called “You want to do WHAT?: Managing and distributing identifying data without running afoul of your research sponsor, your IRB, or your Office of Counsel”. Flynn and his immediate colleagues have more than 10 years of experience with biometric data. Working with government and non-government grant sponsors, Flynn has been collecting images of people’s irises, their faces, and other data points. The data is meticulously maintained, given back to the granters, and then licensed to others. To date Flynn has about 18 data sets to his credit, and they have been used in a wide variety of subsequent studies. The whole process is challenging, he says. Consent forms. Metadata data accuracy. Licensing. Institutional review boards. In the end, he advocated the University cultivate a culture of data stewardship and articulated the need for better data management systems across campus.

Flynns’ abstract: This talk will summarize nine years’ experience with collecting biometrics data from consenting human subjects and distributing such data to qualified research groups. Key points visited during the talk will include: Transparency and disclosure; addressing concerns and educating the concerned; deploying infrastructure for the management of terabytes of data; deciding whom to license data to and how to decline requests; how to manage an ongoing data collection/enrollment/distribution workflow.

In “Globus Online: Software-as-a-Service for Research Data Management” Steven Tuecke (Deputy Director, Computation Institute, University of Chicago & Argonne National Laboratory) described the vision for a DropBox-like service for scientists called Globus Online. By exploiting cloud computing techniques, Tuecke sees a time when researchers can go to a website, answer a few questions, select a few check boxes, and have the information technology for their lab set up almost instantly. Technology components may include blogs, wikis, mailing lists, file systems for storage, databases for information management, indexer/search engines, etc. “Medium and small labs should be doing science, not IT (information technology).” In short, Tuecke advocated Software-As-A-Service (SaaS) for much of research data.

Tuecke’s abstract: The proliferation of data and technology creates huge opportunities for new discoveries and innovations. But they also create huge challenges, as many researchers lack the IT skills, tools, and resources ($) to leverage these opportunities. We propose to solve this problem by providing missing IT to researchers via a cost-effective Software-as-a-Service (SaaS) platform, which we believe can greatly accelerate discovery and innovation worldwide. In this presentation I will discuss these issue, and demonstrate our initial step down this path with the Globus Online file transfer service.

The final presentation was given by Timothy Flanagan (Associate General Counsel for the University), “Legal issues and research data management”. Flanagan told the audience it was his responsibility to represent the University and provide legal advice. When it comes to research data management, there are more questions than answers. “A lot of these things are not understood.” He sees his job and the General Counsel’s job as one of balancing obligation with risk.

Summary

Jarek Nabrzyski (Center for Research Computing) and I believe Data Management Day was a success. The event itself was attended by more than sixty-five people, and they seemed to come from all parts of the University. Despite the fact that the presentations were only fifteen minutes long, each of the presenters obviously spent a great deal of time putting their thoughts together. Such effort is greatly appreciated.

The discussion after the presentations was thoughtful and meaningful. Some people believed a larger top-down effort to provide infrastructure support was needed. Others thought the needs were more pressing and the solution to the infrastructure and policy issues needs to come up from a grassroots level. Probably a mixture of both is required.

One of the goals of Data Management Day was to raise the awareness of all issues research data management. The presentations covered many of the issues:

  • collecting, organizing, and distributing data
  • data management plans
  • digital asset management activities at Notre Dame
  • institutional review boards
  • legal issues surrounding research data management
  • organizing & analyzing data
  • SaaS and data management
  • storage space & infrastructure
  • the use of data after it is created

Data management is happening all across our great university. The formats, storage mechanisms, data modeling, etc. are different from project to project. But they all share a set of core issues that need to be addressed to one degree or another. By bringing together as many people as possible and facilitating discussion among them, the hope was to build understanding across our academe and ultimately work more efficiently. Data Management Day was one way to accomplish this goal.

What are the next steps? Frankly, we don’t know. All we can say is research data management is not an issue that can be addressed in isolation. Instead, everybody has some of the solution. Talk with your immediate colleagues about the issues, and more importantly, talk with people outside your immediate circle. Our whole is greater than the sum of our parts.

Data management & curation groups

Posted on March 18, 2011 in Uncategorized by Eric Lease Morgan

This is a short and incomplete list of universities with data management & curation groups. Each item includes the name of the local group, a link to the group’s home page, a blurb describing the focus of the group, and a sublist of group membership.

Research Data Management Service Group (Cornell)

“The Research Data Management Service Group (RDMSG) aims to: present a coherent set of services to researchers; develop a unified web presence providing general information on data management planning, services available on campus, and standard language that may be used in data management plans in grant proposals; provide a single point of contact that puts researchers in touch with specialized assistance as the need arises. The RDMSG is jointly sponsored by the Senior Vice Provost for Research and the University Librarian, and also has a faculty advisory board.” — http://data.research.cornell.edu/

  • Bill Block – Cornell Institute Social & Economics Research
  • Dave Lifka – Center for Advanced Computing
  • Dean Krafft – Library
  • Dianne Dietrich – Library
  • Eric Chen – Center for Advanced Computing
  • Gail Steinhart – Library
  • Janet McCue – Library
  • Jim Cordes – Astronomy
  • Stefan Kramer – Cornell Institute for Social & Economic Research

Research Data Management (Oxford)

“The University of Oxford is committed to supporting researchers in appropriate curation and preservation of their research data, and where applicable in accordance with the research funders’ requirements.” — http://www.admin.ox.ac.uk/rdm/

  • Bodleian Library (organization and documentation)
  • Central University Research Ethics Committee (ethical issues)
  • Departmental IT Support (backup and security)
  • Intellectual Property Advisory Group (lab notebooks)
  • Isis Innovation (commerical issues)
  • Legal Services (legal issues)
  • Oxford Digital Library (publication and preservation)
  • Oxford Research Archive (publication and preservation)
  • Research Services (funder policies)
  • Research Technology Services (technical aspects of data management)
  • The Data Library (access and discovery issues)

Research Data Services (University of Wisconsin-Madison)

“Digital curation covers cradle-to-grave data management, including storage, preservation, selection, transfer, description, sharing, access, reuse, and transformations. With the current focus on data sharing and preservation on the part of funding agencies, publishers, and research disciplines, having data management practices in place is more relevant than ever.” — http://researchdata.wisc.edu/

  • Alan Wolf – Madison Digital Media Center
  • Allan Barclay – Health Sciences Library
  • Amanda Werhane – Agriculture & Life Science Library
  • Brad Leege – DoIT Academic Technology
  • Bruce Barton – DoIT Academic Technology
  • Caroline Meikle – Soil Science Department/UW-Extension
  • Cindy Severt – Data & Information Services Center
  • Dorothea Salo – Memorial Library
  • Jan Cheetham – DoIT Academic Technology
  • Keely Merchant – Space Science Library
  • Nancy Wiegand – Geospatial Sciences
  • Rebecca Holz – Health Sciences Library
  • Ryan Schryver – Engineering

Data Management and Publishing (MIT)

“What should be included in a data management plan? Funding agencies, e.g., the National Science Foundation (NSF), may have specific requirements for plan content. Otherwise, there are fundamental data management issues that apply to most disciplines, formats, and projects. And keep in mind that a data management plan will help you to properly manage your data for own use, not only to meet a funder requirement or enable data sharing in the future.” — http://libraries.mit.edu/guides/subjects/data-management/

  • Amy Stout – Library – Library
  • Anne Graham – Library
  • Courtney Crummett – Library
  • Katherine McNeill – Library
  • Lisa Sweeney – Library

Scientific Data Consulting (University of Virginia)

“The SciDaC Group is ready to consult with you on your entire data life cycle, helping you to make the right decisions, so that your scientific research data will continue to be available when you and others need it in the future.” — http://www2.lib.virginia.edu/brown/data/

  • Andrew Sallans – Library
  • Sherry Lake – Library

Managing Your Data (University of Minnesota)

“The University Libraries are here to assist you with research data management issues through best practices, training, and awareness of data preservation issues. This site examines the research data life-cycle and offers tools and solutions for creation, storage, analysis, dissemination, and preservation of your data.” — http://www.lib.umn.edu/datamanagement

  • Amy West – Library
  • Lisa Johnston – Library
  • Meghan Lafferty – Library

Data Management Planning (Penn State)

“Good data management starts with comprehensive and consistent data documentation and should be maintained through the life cycle of the data.” — http://www.libraries.psu.edu/psul/scholar/datamanagement.html

  • Ann Holt – Library
  • Daniel C. Mack – Library
  • Kevin Clair – Library
  • Marcy Bidney – Library
  • Mike Giarlo – Library
  • Nancy Henry – Library
  • Nic Cecchino – Library
  • Patricia Hswe – Library
  • Stephen Woods – Library

Research Cyberinfrastructure (University of California-San Diego)

“Research Cyberinfrastructure offers UC San Diego researchers the computing, network, and human infrastructure needed to create, manage, and share data, and SDSC’s favorable pricing can help researchers meet the new federal requirements for budget proposals.” — http://rci.ucsd.edu/

  • Ardys Kozbial – Data Curation Working Group (Libraries)
  • Dallas Thornton – Cyberinfrastructure Services
  • Ron Joyce – Associate Director, IT Infrastructure
  • Sharon Franks – Office of Research Affairs

Distributed Data Curation Center (Purdue University)

“We investigate and pursue innovative solutions for curation issues of organizing, facilitating access to, archiving for and preserving research data and data sets in complex environments.” — http://d2c2.lib.purdue.edu/

  • Elisa Bertino – Director of the Cyber Center at Discovery Park
  • Gerry McCartney – Information Technology
  • James L. Mullins – Dean of Libraries
  • Jay Akridge – Dean of Agriculture
  • Jeffrey Roberts – Dean of Science
  • Leah Jamieson – Dean of Engineering

6th International Data Curation Conference

Posted on December 14, 2010 in Uncategorized by Eric Lease Morgan

This posting documents my experiences at the 6th International Data Curation Conference, December 6-8, 2010 in Chicago (Illinois). In a sentence, my understanding of the breath and depth of data curation was re-enforced, and the issues of data curation seem to be very similar to the issues surrounding open access publishing.

Day #1

After a few pre-conference workshops which seemed to be popular, and after the reception the night before, the Conference began in earnest on Tuesday, December 7. The presentations of the day were akin to overviews of data curation, mostly from the people who were data creators.

One of the keynote addresses was entitled “Working the crowd: Lessons from Galaxy Zoo” by Chris Lintott (University of Oxford & Alder Planetarium). In it he described how images of galaxies taken as a part of the Sloan Digital Sky Survey where classified through crowd sourcing techniques — the Galaxy Zoo. Wildly popular for a limited period of time, its success was attributed to convincing people their task was useful, they were treated as collaborators (not subjects), and it was not considered a waste of time. He called the whole process “citizen science”, and he has recently launched Zooniverse in the same vein.

“Curation centres, curation services: How many are enough?” by Kevin Asjhley (Digital Curation Centre) was the second talk, and in a tongue-in-cheek way, he said the answer was three. He went on to outline they whys and wherefores of curation centers. Different players: publishers, governments, and subject centers. Different motivations: institutional value, reuse, presentation of the data behind the graph, obligation, aggregation, and education. Different debates on who should do the work: libraries, archives, computer centers, institutions, disciplines, nations, localities. He summarized by noting how data is “living”, we have a duty to promote it, it is about more than scholarly research, and finally, three centers are not really enough.

Like Lintott, Antony Williams (Royal Society of Chemistry) described a crowd sourcing project in “ChemSpider as a platform for crowd participation”. He began by demonstrating the myriad of ways Viagra has been chemically described on the the ‘Net. “Chemical information on the Internet is a mess.” ChemSpider brings together many links from chemistry-related sites and provides a means for editing them in an online environment.

Barend Mons outlined one of the common challenges of metadata. Namely, the computer’s need for structured information and most individuals’ lack of desire to create it. In “The curation challenge for the next decade: Digital overlap strategy or collective brains?” Mons advocated the creation of “nano publications” in the form of RDF statements — assertions — as a possible solution. “We need computers to create ‘reasonable’ formats.”

“Idiosyncrasy at scale: Data curation in the humanities” by John Unsworth (University of Illinois at Urbana-Champaign) was the fourth presentation of the day. Unsworth began with an interesting set of statements. “Retrieval is a precondition for use, and normalization is a precondition for retrieval, but humanities’ texts are messy and difficult to normalize.” He went on to enumerate types of textual normalization: spelling, vocabulary, punctuation, “chunking”, mark-up, and metadata. He described MONK as a normalization project. He also mentioned a recent discussion on the Alliance of Digital Humanities Organizations site where humanists debated whether or not texts ought be marked-up prior to analysis. In short, idiosyncracies abound.

The Best Student Paper Award was won by Youngseek Kim (Syracuse University) for “Education for eScience professionals: Integrating data curation and cyberinfrastructure”. In it he described the use of focus group interviews and an analysis of job postings to articulate the most common skills a person needs to be a “escience professional”. In the end he outlined three sets of skills: 1) the ability to work with data, 2) the ability to collaborate with others, and 3) the ability to work with cyberinfrastructure. The escience professional needs to have domain knowledge, a collaborative nature, and know how to work with computers. “The escience professional needs to have a range of capabilities and play a bridging role between scientists and information professionals.”

After Kim’s presentation there was a discussion surrounding the role of the librarian in data curation. While I do not feel very much came out of the discussion, I was impressed with one person’s comment. “If a university’s research data were closely tied to the institution’s teaching efforts, then much of the angst surrounding data curation would suddenly go away, and a strategic path would become clear.” I thought that comment, especially coming from a United States Government librarian, was quite insightful.

The day’s events were (more or less) summarized by Clifford Lynch (Coalition for Networked Information) with some of the following quotes. “The NSF mandate is the elephant in the room… The NSF plans are not using the language of longevity… The whole thing may be a ‘wonderful experiment’… It might be a good idea for someone to create a list of the existing data plans and their characteristics in order to see which ones play out… Citizen science is not only about analysis but also about data collection.”

Day #2

The second day’s presentation were more practical in nature and seemingly geared for librarians and archivists.

In my opinion, “Managing research data at MIT: Growing the curation community one institution at a time” by MacKenzie Smith (Massachusetts Institute of Technology Libraries) was the best presentation of the conference. In it she described data curation as a “meta-discipline” as defined in Media Ecology by Marshall McLuhan, and where information can be described in terms of format, magnitude, velocity, direction, and access. She articulated how data is tricky once a person travels beyond one’s own silo, and she described curation as being about reproducing data, aggregating data, and re-using data. Specific examples include: finding data, publishing data, preserving data, referencing data, making sense of data, and working with data. Like many of the presenters, she thought data curation was not the purview of any one institution or group, but rather a combination. She compared them to layers of storage, management, linking, discovery, delivery, management, and society. All of these things are done by different groups: researchers, subject disciplines, data centers, libraries & archives, businesses, colleges & universities, and funders. She then presented an interesting set of two case studies comparing & contrasting data curation activities at the University of Chicago and MIT. Finally she described a library’s role as one of providing services and collaboration. In the jargon of Media Ecology, “Libraries are a ‘keystone’ species.”

The Best Paper Award was given to Laura Wynholds (University of California, Los Angeles) for “Linking to scientific data: Identity problems of unruly and poorly bounded digital objects”. In it she pointed out how one particular data set was referenced, accessible, and formatted from three different publications in three different ways. She went on to outline the challenges of identifying which data set to curate and how.

In “Making digital curation a systematic institutional function” Christopher Prom (University of Illinois at Urbana-Champaign) answered the question, “How can we be more systematic about bringing materials into the archives?” Using time granted via a leave of absence, Prom wrote Practical E-Records which “aims to evaluate software and conceptual models that archivists and records manager might use to identify preserve, and provide access to electronic records.” He defined trust as an essencial component of records management, and outlined the following process that needs to be done in order to build it: assess resources, wrote program statement, engage records producers, implement policies, implement repository, develop action plans, tailor workflows, and provide access.

James A. J. Wilson (University of Oxford) shared some of his experiences with data curation in “An institutional approach to developing research data management infrastructure”. According to Wilson, the Computing Services center is taking the coordinating role at Oxford when it comes to data curation, but he, like everybody else, emphasized the process is not about a single department or entity. He outlined a number of processes: planning, creation, local storage, documentation, institutional storage, discovery, retrieval, and training. He divided these processes between researchers, computing centers, and libraries. I thought one of the more interesting ideas Wilson described was DaaS (database as a service) where databases are created on demand for researchers to use.

Patricia Hswe (Penn State University) described how she and a team of other people at the University are have broken down information silos to create a data repository. Her presentation, “Responding to the call to curate: Digital curation in practice at Penn State University” outlined the use of microservices in their implementation, and she explained the successes of CurateCamps. She emphasized how the organizational context of the implementation is probably the most difficult part of the work.

Huda Kan (Cornell University) described an application to create, reuse, stage, and share research data in a presentation called “DataStaR: Using the Semantic Web approach for data curation”. The use of RDF was core to the system’s underlying data structure.

Since this was the last session in a particular concurrent track, a discussion followed Kan’s presentation. It revolved around the errors in metadata, and the discussed solutions seemed to fall into three categories: 1) write better documentation and/or descriptions of data, 2) write computer programs to statistically identify errors and then fix them, or 3) have humans do the work. In the end, the solution is probably a combination of all three.

Sometime during the conference I got the idea of creating a word cloud made up of Twitter “tweets” with the conference’s hash tag — idcc10. In a fit of creativity, I wrote the hack upon my return home, and the following illustration is the result:

word cloud
Wordcloud illustrating the tweets tagged with idcc10

Summary

The Conference was attended by approximately 250 people, apparently a record. The attendees were mostly from the United States (obviously), but there it was not uncommon to see people from abroad. The Conference was truly international in scope. I was surprised at the number of people I knew but had not seen for a while because I have not been recently participating in Digital Library Federation-like circles. It was nice to rekindle old acquaintances and make some new ones.

At to be expected, the presentations outlined apparent successes based on experience. From my perspective, Notre Dame’s experience is just beginning. We ought to learn from this experience, and some of my take-aways include:

  • data curation is not the job of any one university department; there are many stakeholders
  • data curation is a process involving computer technology, significant resources of all types, and policy; all three are needed to make the process functional
  • data curation is a lot like open access publishing but without a lot of the moral high ground

Two more data creator interviews

Posted on December 11, 2010 in Uncategorized by Eric Lease Morgan

Michelle Hudson and I have had a couple more data creator interviews, and here is a list of themes from them:

  • data types – data types include various delimited text files, narrative texts, geographic information system (GIS) files, images, and videos; the size of the data sets is about 20 GB
  • subject content – the subject represented by the content include observations of primates, longitudinal studies of families
  • output – the output resulting from these various data sets include scholarly articles and simulations
  • data management – the information is saved on any number of servers located in the Center for Research Computing, under one’s desk, or in a departmental space; some back up and curation happens, but not a lot; there is little if any metadata assigned to the data; migrating data from older versions of software to new versions is sometimes problematic
  • ongoing dissemination – everybody interviewed believe there needs to be a more formalized method for the ongoing management and dissemination of locally created data; some thought the Libraries ought to play a leadership role; others have considered offering the service to the campus community for a fee

Three data webinars

Posted on December 10, 2010 in Uncategorized by Eric Lease Morgan

Between Monday, November 8 and Thursday, November 11 I participated in three data webinars — a subset of a larger number of webinars facilitated by the ICPSR, and this posting outlines what I learned from them.

Data Management Plans

The first was called “Data Management Plans” and presented by Katherine McNeill (MIT). She gave the briefest of histories of data sharing and noted the ICPSR has been doing this since 1962. With the advent of the recent National Science Foundation announcement requiring data curation plans the interest in curation has become keen, especially in the sciences. The National Institute of Health has had similar mandate for grants over $250,000. Many of these mandates only specify the need for a “what” when it comes to plan, and not necessarily the “how”. This is slightly different from the United Kingdom’s way of doing things.

After evaluating a number of plans from a number of places, McNeill identified a set of core issues common to many of them:

  • a description of the project and data
  • standards to be applied
  • short-term storage specifications
  • legal and ethical issues
  • access policies and provisions
  • long-term archiving stipulations
  • funder-specific requirements

How do and/or will libraries support data curation? She answered this question by listing a number of possibilities:

  • instituting an interdisciplinary librarian models
  • creating a dedicated data center
  • getting any (all) librarians up to speed
  • having the scholarly communications librarian lead the efforts
  • creating partnerships with other campus departments
  • participating in a national data service
  • getting funder support
  • activities through the local office of research
  • doing more inter-university collaborations
  • providing services through professional societies

Somewhere along the line McNeill advocated reading ICPSR’s “Guidelines for Effective Data Management Plans” which outlines elements of data plans as well as a number of examples

America’s Most Wanted

The second webinar was “America’s Most Wanted: Top US Government Data Resources” presented by Lynda Kellam (The University of North Carolina at Greensboro). Kellam is a data librarian, and this session was akin to a bibliographic instruction session where a number of government data sources were described:

  • Data.gov – has a lot of data from the Environmental Protection Agency; works a lot like ICPSR; includes “chatter” around data; includes “cool” preview function
  • Geospatial One Stop – a geographic information system portal with a lot of metadata; good for tracking down sources with a geographic interface
  • FactFinder – a demographic portal for commerce and census data; will include in the near future a more interactive interface
  • United States Bureau of Labor Statistics – lot o’ labor statistics
  • National Center for Education Statistics – includes demographics for school statistics and provides analysis online
  • DataFerrett – provides you with an applet to download, run, and use to analyze data

Students Analyzing Data

The final webinar I listened to was “Students Analyzing Data in the Large Lecture Class: Active Learning with SDA Online Analysis” by Jim Oberly (University of Wisconsin-Eau Claire). [5] As a historian, Oberly is interested in making history come alive for his students. To do this, he use ICPSR’s Analyze Data Online service, and this webinar demonstrated how. He began by asking questions about the Civil War such as “For economic reasons, would the institution of slavery have died out naturally, and therefore the Civil War would have been unnecessary?” Second, he identifying a data set (New Orleans Slave Sale Sample, 1804-1862) from the ICPSR containing information on the sale of slaves. Finally, he used ICPSR’s online interface to query the data looking for trends in prices. In the end, I believe he was not so sure the War could have been avoided because the prices of slaves seemed unaffected by the political environment. The demonstration was fascinating, and interface seemingly easy to use.

Summary

Based on these webinars it is an understatement to say the area of data is wide, varied, broad, and deep. Much of Library Land is steeped in books, but in the current environment books are only one of many manifestations of data, information, and knowledge. The profession is still grappling with every aspect of raw data. From its definition to its curation. From its organization to it use. From its politics to its economics.

I especially enjoyed seeing how data is being used online. Such is a growing trend, I believe, and represents a opportunity for the profession. The finding and acquisition of data sets is somewhat of a problem now, but such a thing will become less of a problem later. The bigger problem is learning how to use and understand the data. If the profession were to integrate functions for data’s use and understanding into its systems, then libraries have a growing responsibility. If the profession only seeks to enable find and access, then the opportunities are limited and short-lived. Find and access are things we know how to do. Use and understanding requires an adjustment of our skills, resources, and expertise. Are we up to the challenge?

Data tsunamis and explosions

Posted on October 29, 2010 in Uncategorized by Eric Lease Morgan

Michelle Hudson and I have visited more teaching & research faculty across campus learning about their uses, needs, and wants when it comes to data. As one person put it, we are preparing for the “Data Tsunami”, or as another person put it — the “Data Explosion”. We have learned a few more things:

  • Brokering – At least a couple of the people we visited thought libraries ought to play a central role in the brokering of data sets. In their view, libraries would be repositories of data as well as manage the licensing of the data both inside and outside the University community. “Libraries can make it easy for us to find data sets.” The Institute for Quantitative Social Science at Harvard University may be a good model. This clearing house function needs to include services educating people on how to use the data, “cool” interfaces for finding and using the data, and links to bibliographic materials like books and journal articles. “We would also like somebody to help us keep track who uses our data and where it is cited in the literature.”
  • Curation – Some people have “archived” original data sets in the form of paper-based surveys. These things are stored in file cabinets of basements. Others have elaborate computer systems complete with redundant backups, rsync functionality, and data refreshment protocols. One person alluded to HubZero as a possible tool for these tasks.
  • Data origination – Most of the people we have talked to generate their own data either through surveys or scientific equipment. Fewer people, so far, have gotten their data from other people or companies. When it has come from companies, the data has been encrypted before it gets here as well as anonymized.
  • Data types – The formats of the data is falling into a couple of categories: 1) binary data such as images, video, & simulation output, and 2) plain text data in the form of spreadsheets (mostly) or sometimes relational databases. “We know that the use of relational databases is the ‘best’ way to organize this information, but none of us want to take the time to learn SQL.”
  • Licensing – At least a couple of the people we visited license their data to others. After working with General Counsel, contracts between parties are signed and the data is exchanged. We have yet to see any money changing hands. The licenses are used to protect the University from liability when the data gets used in ways not allowed by the license. A couple of people would like to the University (or specifically the library) to handle this sort of paperwork.
  • Metadata – There is a wide spectrum of metadata application against the data sets. Some people have no metadata at all. Others maintain multi-volume books filled with “protocols” describing their data and how it is to be collected. One person said, “We spend a lot of our time correcting metadata tabulating what camera was used, when, and by whom… Our reputation rests on the quality of our data (and metadata). We’ve formatted our metadata as CSV files as well as XML files. In order for our data to be characterized as ‘good’ we need an error rate of 1000/1”.
  • Sharing – We are learning that the sharing of data is a complicated decision-making process. Many things come into play including but not necessarily limited to: the culture of the subject discipline, patents, the competitive nature of the researcher, intellectual property rights, funding agency requirements, embargoes, and the inclusion of human subjects. Some people are more than willing to share than others. So far, no one will share their until the first paper has been written. The want (need) “publication rights”.
  • Size – Everybody believes they have “large” data sets, but the definition of large needs to be qualified. On one hand large may be equated with sizable files. Videos are a good example. On the other hand large may mean many records. Big longitudinal studies complete with many fields per subject are a good example.

We are also learning that no one person or group seems to have a complete handle on the issues surrounding data. Michelle and I certainly don’t. Everybody knows a lot but not everything. Consquently, we are thinking of hosting “Data Day” — a time and place when many of the people who deal with data for teaching and research get together, share war stories, and learn from each others’ experience. In the end we may understand how to be more efficient and prepared with the “tsumami” is actually upon us.

Off to interview more people… ‘More later.

David Dickinson and New Testament manuscripts

Posted on October 20, 2010 in Uncategorized by Eric Lease Morgan

Yesterday David Dickinson came to visit the libraries to share and discuss some of his work regarding optical character recognition of New Testament manuscripts.

David Dickinson is a South Bend resident and Renaissance Man with a multifaceted educational background and vocational history. Along the way he became keenly interested in religion as well as computer programming. On and off for the past five years or so, and working in conjunction with the Center for the Study of New Testament Manuscripts, he has been exploring the possibilities of optical character recognition against New Testament manuscripts. Input very large digitized images of a really, really old original New Testament manuscripts. Programmatically examine each man-made mark in the image. Use artificial intelligence computing techniques to determine (or guess) which “letter” the mark represents. Save the resulting transcription to a file. And finally, provide a means for the Biblical scholar to simultaneously compare the image with the resulting transcription and a “canonical” version of a displayed chapter/verse.

David’s goal is not so much to replace the work being done by scholars but rather to save their time. Using statistical techniques, he knows computer programs can work tirelessly to transcribe texts. These transcriptions are then expected to be reviewed by people. The results are then expected to be shared widely thus enabling other scholars to benefit.

David’s presentation was attended by approximately twenty people representing the Libraries, the Center for Social Research, and the Center for Research Computing. After the formal presentation a number of us discussed how David’s technology may or may not be applicable to the learning, teaching, and scholarship being done here at the University. For example, there are a number of Biblical scholars on campus, but many of them seem to focus on the Old Testament as opposed to the New Testament. The technology was deemed interesting but some people thought it could not replace man-made transcriptions. Others wondered about the degree the technology could be applied against manuscripts other the New Testament. In the end there were more questions than answers.

Next steps? Most of us thought David’s ideas were not dead-ends. Consequently, it was agreed that next steps will include presenting the technology to local scholars in an effort to learn whether or not it is applicable to their needs and the University’s.