Drive By Shared Data: A Travelogue
Posted on June 8, 2013 in Uncategorized by Eric Lease Morgan
Last Friday (May 31, 2013) I attended an interesting symposium at Northwestern University called Driven By Shared Data. This blog posting describes my experiences.
Driven By Shared Data was an OCLC-sponsored event with the purpose of bringing together librarians to discuss “opportunities and operational challenges of turning data into powerful analysis and purposeful action”. At first I thought the symposium was going to be about the curation of “research data”, but I was pleasantly surprised otherwise. The symposium was organized into a number of sections / presentations, each enumerated below:
- Larry Birnbaum (Northwestern University) – Birnbaum’s opening remarks bordered on the topic of artificial intelligence. For a long time he has been interested in the problem of “find more like this one”. To address this problem, he often took initial queries sent to things like Google, syntactically altered the queries, and resubmitted them. Other times he looked at search results, did entity-extraction against them, looked for entities occurring less frequently, supplemented queries with these newly found entities, and repeated the search process. The result was usually a set of “interesting” search results — results that were not identical to original but rather slightly askew. He also described and demonstrated a recommender service listing books of possible interest based on Twitter tweets. More recently he has been spending his time creating computer-generated narrative texts from sets of numeric data. For example, given the essential statistics from a baseball game, he and his colleagues have been able to generate newspaper stories describing the action of the game. “The game was tied until the bottom in the seventh inning when Bass Ball came to bat. Ball hit a double. Jim Hitter was up next, and blew one out of the park. The final score was three to one. Go home team!” What is the problem he is trying to solve? Rows and columns of data often do not make sense to the general reader. Illustrating the data graphically goes a long way to describing trends, but not everybody knows how to read graphs. Narrative texts supplement both the original data and graphical illustrations. His technique has been applied to all sorts of domains from business to medicine. Interesting because many times people don’t want words but images instead. (“A picture is worth a thousand words.”) Birnbaum is generating a thousand words from both a picture as well as data sets. In the words of Birnbaum, “Stories make data meaningful.” Some of his work has been commercialized at a site called Narrative Science.
- Deborah Blecic (University of Illinois at Chicago) – Blecie described how some of her collection development processes have changed with the availability of COUNTER statistics. She began by enumerating some of her older data sets: circulation counts, reshelving counts, etc. She then gave an overview of some of the data sets available from COUNTER: number of hits, number of reads, etc. Based on this new information she has determined how she is going to alter her subscription to “the Big Deal” when the time comes for changing it. She commented on the integrity of COUNTER statistics because they seem ambiguious. “What is a ‘read’? Is a read when a patron looks at the HTML abstract, or is a read when the patron downloads a PDF version of an article? How do the patrons identify items to ‘read’?” She is looking forward to COUNTER 4.
- Dave Green (Northeastern Illinois University) – Green shared with the audience some of the challenges he has had when it came to dealing with data generated from an ethnography project. More specifically, Green is the Project Director for ERIAL. Through this project a lot of field work was done, and the data created was not necessarily bibliographic in nature. Examples included transcripts of interviews, cognitive maps, photographs, movies & videos, the results of questionnaires, etc. Being an anthropologic study, the data was more qualitative than quantitative. After analyzing their data, they learned how students of their libraries used the spaces, and instructors learned how to take better advantage of library services.
- Kim Armstrong (CIC) – On the other hand, Armstrong’s data was wholly bibliographic in nature. She is deeply involved in a project to centrally store older and lesser used books and journals owned by CIC libraries. It is hard enough to coordinate all the libraries in the project, but trying to figure out who owns what is even more challenging because of evolving and local cataloging practices. While everybody used the MARC record as a data structure, there is little consistency between libraries on how data gets put into each of the field/subfields. “The folks at Google have much of our bibliographic data as a part of their Google Books Project, and even they are not able to write a ‘regular expression’ to parse serial holdings… The result is a ‘Frankenrun’ of journals.”
- small group discussion – We then broke up into groups of five or six people. We were tasked with enumerating sets of data we have or we would like to have. We were then expected to ask ourselves what we would do with the data once we got it, what are some of the challenges we have with the data, and what are some of the solutions to the challenges. I articulated data sets including information about readers (“patrons” or “users”), information about what is used frequently or infrequently, tabulations of words and phrases from the full text of our collections, contact information of local grant awardees, and finally, the names and contact information of local editors of scholarly publications. As we discussed these data sets and others, challenges ranged from technical to political. Every solution seemed to be rooted in a desire for more resources (time and money).
- Dorothea Salo (Univesity of Wisconsin – Madison) – The event was brought to a close by Salo who began by articulating the Three V’s of Big Data: volume, velocity, and variety. Volume alludes to the amount of data. Velocity refers to the frequency the data changes. Variety is an account of the data’s consistency. Good data, she says is clean, consistent, easy to understand, and computable. She then asked, “Do libraries have ‘big data’?” And her answer was, “Yes and no.” Yes, we have volumes of bibliographic information but it is not clean nor easy to understand. The challenges described by Armstrong are perfect examples. She says that our ‘non-computable’ datasets are costing the profession mind share, and we have only a limited amount of time to rectify the problem before somebody else comes up with a solution and by-passes libraries all together. She also mentioned the power of data aggregation. Examples included OIAster, WorldCat, various union catalogs, and the National Science Foundation Digital Library. It did not sound to me as if she thought these efforts were successes. She alluded to the Digital Public Library Of America, and because of their explicit policy for metadata use and re-use, she thinks it has potential, but only time will tell. She has a lot of faith in the idea of “linked data”, and frankly, that sounds like a great idea to me as well. What is the way forward? She advocated the creation of “library scaffolding” to increase internal library skills, and she did not advocate the hiring of specific people to do specific tasks and expect them to solve all the problems.
After the meeting I visited the Northwestern main library and experienced the round rooms where books are shelved. It was interesting to see the ranges radiating from each rooms’ center. Along the way I autographed my book and visited the university museum which had on display quite a number of architectural drawings.
Even though the symposium was not about “e-science research data”, I’m very glad I attended. Discussion was lively. The venue was intimate. I met a number of people, and my cognitive side was stimulated. Thank you for the opportunity.