Shortest OAI-Harvesting Script?
|
|
By jaf, Section Software Posted on Sun Jun 8th, 2003 at 10:52:34 AM EST
|
 |
Thom Hickey from OCLC's Office of Research posted a 60-line Python script which will pull records from an OAI-PMH compliant repository and place them in a file. So I'm wondering - how small could a fully-compliant harvesting script be? What language would allow for the shortest code? Hmmmm....
See below to read Thom's email to the OAI-Implementers list and to view the code itself.
|
Thom's email:
I've attached a one page (or at least it was one page until I put our legal
text into it!) Python script that will pull records from a repository and
dump them to a file. In spite of its length it:
o Handles resumption tokens
o Notices OAI errors
o Supports compression
o Respects 503 Retry-After's
It doesn't know much about XML, though, so the file created is just a
collection of the downloaded XML responses, and the only metadata format it
asks for is oai_dc, even though it does ask the repository for the metadata
formats supported. Sets are ignored, but would be fairly easy to add.
I tested it using Python 2.2.2 under Windows 2000 against several
repositories.
It is invoked by:
python harverst.py [repository-address outputfile]
e.g.:
python harvest.py alcme.oclc.org/ndltd/servlet/OAIHandler ndltd.out
If you just run the script without parameters it defaults the NDLTD
repository (around 39,000 digital thesis and dissertation records).
Anyway, I thought it was interesting to see how much could be done in less
than 60 lines.
--Th
import sys, urllib2, zlib, time, re
## Copyright (c) 2000-2003 OCLC Online Computer Library Center, Inc. and other
## contributors. All rights reserved. The contents of this file, as updated
## from time to time by the OCLC Office of Research are subject to OCLC
## Research Public License Version 2.0 (the "License"); you may not use this
## file except in compliance with the License. You may obtain a current copy
## of the License at http://purl.oclc.org/oclc/research/ORPL/. Software
## distributed under the License is distributed on an "AS IS" basis, WITHOUT
## WARRANTY OF ANY KIND, either express or implied. See the License for the
## specific language governing rights and limitations under the License. This
## software consists of voluntary contributions made by many individuals on
## behalf of OCLC Research. For more information on OCLC Research, please see
## http://www.oclc.org/research/. This is the Original Code. The Initial
## Developer of the Original Code is Thomas Hickey (mailto:hickey@oclc.org).
## Portions created by OCLC are Copyright (C) 2003. All Rights Reserved.
def getResumptionToken(data):
mo = re.search('<resumptionToken[^>]*>(.*)</resumptionToken>', data)
if mo: return mo.group(1)
def getFile(serverString, command, verbose=1):
remoteAddr = serverString+'?verb=%s'%command
if verbose: print "getFile '%s'"%remoteAddr
headers = {'User-Agent': 'OAIHarvester/2.0',
'Accept': 'text/html',
'Accept-Encoding': 'compress, deflate'}
try:
req = urllib2.Request(remoteAddr, None, headers)
remoteFile = urllib2.urlopen(req)
remoteData = remoteFile.read()
remoteFile.close()
except urllib2.HTTPError, exValue:
if exValue.code==503:
retryWait = int(exValue.hdrs.get("Retry-After", "-1"))
if retryWait<0: return None
print 'Waiting %d seconds'%retryWait
time.sleep(retryWait)
return getFile(serverString, command, 0)
print
exValue
return None
try:
remoteData =
zlib.decompressobj().decompress(remoteData)
except:
pass
mo =
re.search('<error *code=\"([^"]*)">(.*)</error>', remoteData)
if mo:
print >>sys.stderr,"OAIERROR: code=%s '%s'"%(mo.group(1), mo.group(2))
sys.exit(1)
return remoteData
def writeWithLF(ofile, data):
if not data: return
ofile.write(data)
if data[-1]!='\n': ofile.write('\n')
def writeRecords(outFile, serverString, mdformat, sDate=None, eDate=None):
if not sDate and not eDate:
verb='ListRecords&metadataPrefix=%s'%(mdformat)
else:
verb='ListRecords&metadataPrefix=%s&from=%s&until=%s'%(mdformat, sDate, eDate)
data = getFile(serverString, verb)
while data:
writeWithLF(outFile, data)
reTok = getResumptionToken(data)
if not reTok: break
data = getFile(serverString, "ListRecords&resumptionToken=%s"%reTok)
if __name__=="__main__":
try: serverName, outName = sys.argv[1:]
except: serverName, outName = 'alcme.oclc.org/ndltd/servlet/OAIHandler', 'harvest.out'
serverString = 'http://%s'%serverName
print "Writing to file %s from archive at %s"%(outName, serverName)
outFile = file(outName, 'wb')
writeWithLF(outFile, getFile(serverString, 'Identify'))
writeWithLF(outFile, getFile(serverString, 'ListMetadataFormats'))
writeRecords(outFile, serverString, 'oai_dc')
|
|
|