{"id":388,"date":"2013-04-12T20:15:57","date_gmt":"2013-04-12T15:15:57","guid":{"rendered":"http:\/\/blogs.nd.edu\/emorgan\/?p=388"},"modified":"2014-04-15T20:46:48","modified_gmt":"2014-04-15T15:46:48","slug":"workflow","status":"publish","type":"post","link":"https:\/\/sites.nd.edu\/emorgan\/2013\/04\/workflow\/","title":{"rendered":"Catholic pamphlets workflow"},"content":{"rendered":"<p>\n<div id=\"attachment_394\" style=\"width: 213px\" class=\"wp-caption alignright\"><a href=\"http:\/\/blogs.nd.edu\/emorgan\/files\/2013\/04\/matisse.jpg\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-394\" src=\"http:\/\/blogs.nd.edu\/emorgan\/files\/2013\/04\/matisse.jpg\" alt=\"Gratuitous eye candy by Matisse\" width=\"203\" height=\"270\" class=\"size-full wp-image-394\" \/><\/a><p id=\"caption-attachment-394\" class=\"wp-caption-text\">Gratuitous eye candy by Matisse<\/p><\/div>This is an outline of how we here at Notre Dame have been making digitized versions of our Catholic pamphlets available on the Web &#8212; a workflow:\n<\/p>\n<ol>\n<li>Save PDF files to a common file system &#8211; This can be as simple as a shared hard disk or removable media.<\/li>\n<li>Ingest PDF files into Fedora to generate URLs &#8211; The PDF files are saved in Fedora for the long haul.<\/li>\n<li>Create persistent URLs and return a list of system numbers and&#8230; URLs &#8211; Each PDF file is given a PURL for the long haul. Output a delimited file containing system numbers in one column and PURLs in another. Create persistent URLs and return a list of system numbers and&#8230; URLs &#8211; Each PDF file is given a PURL for the long haul. Output a delimited file containing system numbers in one column and PURLs in another. (Steps #2 and #3 are implemented with a number of Ruby scripts: <a href=\"http:\/\/dh.crc.nd.edu\/sandbox\/catholic-pamphlets-workflow\/batch_ingester.rb\" target=\"_blank\">batch_ingester.rb<\/a>, <a href=\"http:\/\/dh.crc.nd.edu\/sandbox\/catholic-pamphlets-workflow\/book.rb\" target=\"_blank\">book.rb<\/a>, <a href=\"http:\/\/dh.crc.nd.edu\/sandbox\/catholic-pamphlets-workflow\/mint_purl.rb\" target=\"_blank\">mint_purl.rb<\/a>, <a href=\"http:\/\/dh.crc.nd.edu\/sandbox\/catholic-pamphlets-workflow\/purl_config.rb\" target=\"_blank\">purl_config.rb<\/a>, <a href=\"http:\/\/dh.crc.nd.edu\/sandbox\/catholic-pamphlets-workflow\/purl.rb\" target=\"_blank\">purl.rb<\/a>, <a href=\"http:\/\/dh.crc.nd.edu\/sandbox\/catholic-pamphlets-workflow\/repo_object.rb\" target=\"_blank\">repo_object.rb<\/a>.)<\/li>\n<li>Update Filemaker database with URLs for quality assurance purposes &#8211; Use the PURLs from the previous step and update the local database so we can check the digitization process.<\/li>\n<li>Start quality assurance process and cook until done &#8211; Look at each PDF file making sure it has been digitized correctly and thoroughly. Return poorly digitized items back to the digitization process.<\/li>\n<li>Use system numbers to extract MARC records from Aleph &#8211; The file names of each original PDF document should be an Aleph system number. Use the list of numbers to get the associated bibliographic data from the integrated library system.<\/li>\n<li>Edit MARC records to include copyright information and URLs to PDF file &#8211; Update the bibliographic records using scripts called <a href=\"http:\/\/dh.crc.nd.edu\/sandbox\/catholic-pamphlets-workflow\/list-copyright.pl\" target=\"_blank\">list-copyright.pl<\/a> and <a href=\"http:\/\/dh.crc.nd.edu\/sandbox\/catholic-pamphlets-workflow\/update-marc.pl\" target=\"_blank\">update-marc.pl<\/a>. The first script outputs a list of copyright information that is used as input for the second script which includes the copyright information as well as simply pointers to the PDF documents.<\/li>\n<li>Duplicate MARC records and edit them to create electronic resource records &#8211; Much of this work is done using MARCEdit<\/li>\n<li>Put newly edited records into Aleph test &#8211; Ingest the newly created records into a staging area.<\/li>\n<li>Check records for correctness &#8211; Given enough eyes, all bugs are shallow.<\/li>\n<li>Put newly edited records into Aleph production &#8211; Make the newly created records available to the public.<\/li>\n<li>Extract newly created MARC records with new system numbers &#8211; These numbers are needed for the concordance program &#8212; a way to link back from the concordance to the full bibliographic record.<\/li>\n<li>Update concordance database and texts &#8211; Use something like pdftotext to extract the OCR from the scanned PDF documents. Save the text files in a place where the concordance program can find them. Update the concordance&#8217;s database linking keys to bibliographic information as well as locations of the text files. All of this is done with a script called <a href=\"http:\/\/dh.crc.nd.edu\/sandbox\/catholic-pamphlets-workflow\/extract.pl\" target=\"_blank\">extract.pl<\/a>.<\/li>\n<li>Create Aleph Sequential File to add concordance links &#8211; This script (<a href=\"http:\/\/dh.crc.nd.edu\/sandbox\/catholic-pamphlets-workflow\/marc2aleph.pl\" target=\"_blank\">marc2aleph.pl<\/a>) will output something that can be used to update the bibliographic records with concordance URLs &#8212; an Aleph Sequential File.<\/li>\n<li>Run Sequential File to update MARC records with concordance link &#8211; This updates the bibliographic information accordingly.<\/li>\n<\/ol>\n<p>\nDone, but I&#8217;m sure your milage will vary.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This is an outline of how we here at Notre Dame have been making digitized versions of our Catholic pamphlets available on the Web &#8212; a workflow: Save PDF files to a common file system &#8211; This can be as simple as a shared hard disk or removable media. Ingest PDF files into Fedora to [&hellip;]<\/p>\n","protected":false},"author":92,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-388","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/sites.nd.edu\/emorgan\/wp-json\/wp\/v2\/posts\/388","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/sites.nd.edu\/emorgan\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/sites.nd.edu\/emorgan\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/sites.nd.edu\/emorgan\/wp-json\/wp\/v2\/users\/92"}],"replies":[{"embeddable":true,"href":"https:\/\/sites.nd.edu\/emorgan\/wp-json\/wp\/v2\/comments?post=388"}],"version-history":[{"count":8,"href":"https:\/\/sites.nd.edu\/emorgan\/wp-json\/wp\/v2\/posts\/388\/revisions"}],"predecessor-version":[{"id":399,"href":"https:\/\/sites.nd.edu\/emorgan\/wp-json\/wp\/v2\/posts\/388\/revisions\/399"}],"wp:attachment":[{"href":"https:\/\/sites.nd.edu\/emorgan\/wp-json\/wp\/v2\/media?parent=388"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/sites.nd.edu\/emorgan\/wp-json\/wp\/v2\/categories?post=388"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/sites.nd.edu\/emorgan\/wp-json\/wp\/v2\/tags?post=388"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}