Fun with bibliographic indexes, bibliographic data management software, and Z39.50
Posted on November 15, 2013 in Uncategorized by Eric Lease Morgan
It is not suppose to be this hard.
The problem to solve
A student came into the Center For Digital Scholarship here at Notre Dame. They wanted to do some text analysis against a mass of bibliographic citations from the New York Times dating from 1934 to the present. His corpus consists of more than 1.6 million records. The student said, “I believe the use of words like ‘trade’ and ‘tariff’ have changed over time, and these changes reflect shifts in our economic development policies.” Sounds interesting to me, really.
Solution #1
To do this analysis I needed to download the 1.6 million records in question. No, I wasn’t going to download them in one whole batch but rather break them up into years. Still this results in individual data sets totaling thousands and thousands of records. Selecting these records through the Web interface of the bibliographic vendor was tedious. No, it was cruel and unusual punishment. There had to be a better way. Moreover, the vendor said, “Four thousand (4,000) records is the most a person can download at any particular time.”
Solution #2
After a bit of back & forth a commercial Z39.50 client seemed to be the answer. At the very least there won’t be a whole lot of clicking going on. I obtained a username/password combination. I figured out the correct host name of the remote Z39.50 server. I got the correct database name. I configured my client. Searches worked perfectly. But upon closer inspection, no date information was being parsed from the records. No pagination. The bibliographic citation management software could not create… bibliographic citations. “Is date information being sent? What about pagination and URLs?” More back & forth and I learned that the bibliographic vendor’s Z39.50 server outputs MARC, and the required data is encoded in the MARC. I went back to tweaking my client’s configuration. Everything was now working, but downloading the citations was very slow — too slow.
Solution #3
So I got to more thinking. “I have all the information I need to use a low-level Z39.50 client.” Yaz-client might be an option, but in the end I wrote my own Perl script. In about twenty-five lines of code I wrote what I needed, and downloads were a factor of 10 faster than the desktop client. (See the Appendix.) The only drawback was the raw MARC that I was saving. I would need to parse it for my student.
Back to the drawing board
Everything was going well, but then I hit the original limit — the record limit. When the bibliographic database vendor said there was a 4,000 record limit, I thought that meant no more than 4,000 records could be downloaded at one time. No, it means that from any given search I can only download the first 4,000 records. Trying to retrieve record 4,001 or greater results in an error. This is true. When I request record 4001 from my commercial client or Perl-based client I get an error. Bummer!
The only thing I can do now is ask the bibliographic vendor for a data dump.
Take-aways
On one hand I can’t blame the bibliographic vendor too much. For decades the library profession has been trying to teach people to do the most specific, highly accurate, precision/recall searches possible. “Why would anybody want more than a few dozen citations anyway? Four thousand ought to be plenty.” One the other hand, text mining is a legitimate and additional method for dealing with information overload. Four thousand records is just a tip of an iceberg.
I learned a few things:
- many students have very interesting senior projects
- the commercial Z39.50 client works quite well and is well-supported
- many commercial Z30.50 implementations are based on the good work of Indexdata
- my bibliographic database vendor does IP-based Z39.50 authentication
I also got an idea — provide my clientele with a “smart” database search interface. Here’s how:
- authenticate a person
- allow the person to select one or more bibliographic databases to search
- allow the person to enter a rudimentary, free text query
- search the selected databases
- harvest the results (of potentially thousand’s of records)
- do text mining against the results to create timelines, word clouds, author recommendations, etc.
- present the results to the person for analysis
Wish me luck!?
Appendix
#!/usr/bin/perl
# nytimes-search.pl - rudimentary z39.50 client to query the NY Times
# Eric Lease Morgan <emorgan@nd.edu>
# November 13, 2013 - first cut; "Happy Birthday, Steve!"
# usage: ./nytimes-search.pl > nytimes.marc
# configure
use constant DB => 'hnpnewyorktimes';
use constant HOST => 'fedsearch.proquest.com';
use constant PORT => 210;
use constant QUERY => '@attr 1=1016 "trade or tariff"';
use constant SYNTAX => 'usmarc';
# require
use strict;
use ZOOM;
# do the work
eval {
# connect; configure; search
my $conn = new ZOOM::Connection( HOST, PORT, databaseName => DB );
$conn->option( preferredRecordSyntax => SYNTAX );
my $rs = $conn->search_pqf( QUERY );
# requests > 4000 return errors
# print $rs->record( 4001 )->raw;
# retrieve; will break at record 4,000 because of vendor limitations
for my $i ( 0 .. $rs->size ) {
print STDERR "\tRetrieving record #$i\r";
print $rs->record( $i )->raw;
}
};
# report errors
if ( $@ ) { print STDERR "Error ", $@->code, ": ", $@->message, "\n" }
# done
exit;