/usr/lib/info -- hacker/librarian haven
Front Page News Features
Software Events Standards
Ask Anything Opinion Diaries
Reviews . MLP . Everything
Shortest OAI-Harvesting Script?

By jaf, Section Software
Posted on Sun Jun 8th, 2003 at 10:52:34 AM EST
Thom Hickey from OCLC's Office of Research posted a 60-line Python script which will pull records from an OAI-PMH compliant repository and place them in a file. So I'm wondering - how small could a fully-compliant harvesting script be? What language would allow for the shortest code? Hmmmm....

See below to read Thom's email to the OAI-Implementers list and to view the code itself.

 

Thom's email:
I've attached a one page (or at least it was one page until I put our legal
text into it!) Python script that will pull records from a repository and
dump them to a file.  In spite of its length it:

  o Handles resumption tokens
  o Notices OAI errors
  o Supports compression
  o Respects 503 Retry-After's

It doesn't know much about XML, though, so the file created is just a
collection of the downloaded XML responses, and the only metadata format it
asks for is oai_dc, even though it does ask the repository for the metadata
formats supported.  Sets are ignored, but would be fairly easy to add.

I tested it using Python 2.2.2 under Windows 2000 against several
repositories.

It is invoked by:
  python harverst.py [repository-address outputfile]
e.g.:
  python harvest.py alcme.oclc.org/ndltd/servlet/OAIHandler ndltd.out

If you just run the script without parameters it defaults the NDLTD
repository (around 39,000 digital thesis and dissertation records).

Anyway, I thought it was interesting to see how much could be done in less
than 60 lines.

--Th


import sys, urllib2, zlib, time, re
## Copyright (c) 2000-2003 OCLC Online Computer Library Center, Inc. and other
## contributors.  All rights reserved.	The contents of this file, as updated
## from time to time by the OCLC Office of Research are subject to OCLC
## Research Public License Version 2.0 (the "License"); you may not use this
## file except in compliance with the License.	You may obtain a current copy
## of the License at http://purl.oclc.org/oclc/research/ORPL/.	Software
## distributed under the License is distributed on an "AS IS" basis, WITHOUT
## WARRANTY OF ANY KIND, either express or implied.  See the License for the
## specific language governing rights and limitations under the License.  This
## software consists of voluntary contributions made by many individuals on
## behalf of OCLC Research.  For more information on OCLC Research, please see
## http://www.oclc.org/research/.  This is the Original Code.  The Initial
## Developer of the Original Code is Thomas Hickey (mailto:hickey@oclc.org).
## Portions created by OCLC are Copyright (C) 2003.  All Rights Reserved.

def getResumptionToken(data):
    mo = re.search('<resumptionToken[^>]*>(.*)</resumptionToken>', data)
    if mo: return mo.group(1)
def getFile(serverString, command, verbose=1):
    remoteAddr = serverString+'?verb=%s'%command
    if verbose: print "getFile '%s'"%remoteAddr
    headers = {'User-Agent': 'OAIHarvester/2.0',
	       'Accept': 'text/html',
	       'Accept-Encoding': 'compress, deflate'}
    try:
	req = urllib2.Request(remoteAddr, None, headers)
	remoteFile = urllib2.urlopen(req)
	remoteData = remoteFile.read()
	remoteFile.close()
    except urllib2.HTTPError, exValue:
	if exValue.code==503:
	    retryWait = int(exValue.hdrs.get("Retry-After", "-1"))
	    if retryWait<0: return None
		print 'Waiting %d seconds'%retryWait
	   
      time.sleep(retryWait)
		 return getFile(serverString, command, 0)
	   print
exValue
	 return None
	  try:
 	remoteData =
zlib.decompressobj().decompress(remoteData)
	 except:
	  pass
     mo =
re.search('<error *code=\"([^"]*)">(.*)</error>', remoteData)
    if mo:
	print >>sys.stderr,"OAIERROR: code=%s '%s'"%(mo.group(1), mo.group(2))
	sys.exit(1)
    return remoteData
def writeWithLF(ofile, data):
    if not data: return
    ofile.write(data)
    if data[-1]!='\n': ofile.write('\n')
def writeRecords(outFile, serverString, mdformat, sDate=None, eDate=None):
    if not sDate and not eDate:
	verb='ListRecords&metadataPrefix=%s'%(mdformat)
    else:
	verb='ListRecords&metadataPrefix=%s&from=%s&until=%s'%(mdformat, sDate, eDate)
    data = getFile(serverString, verb)
    while data:
	writeWithLF(outFile, data)
	reTok = getResumptionToken(data)
	if not reTok: break
	data = getFile(serverString, "ListRecords&resumptionToken=%s"%reTok)
if __name__=="__main__":
    try:    serverName, outName = sys.argv[1:]
    except: serverName, outName = 'alcme.oclc.org/ndltd/servlet/OAIHandler', 'harvest.out'
    serverString = 'http://%s'%serverName
    print "Writing to file %s from archive at %s"%(outName, serverName)
    outFile = file(outName, 'wb')
    writeWithLF(outFile, getFile(serverString, 'Identify'))
    writeWithLF(outFile, getFile(serverString, 'ListMetadataFormats'))
    writeRecords(outFile, serverString, 'oai_dc')
< Principles that we have learned (2 comments) | Running XSL transformations from PHP (4 comments) >

Menu
submit story
create account
faq
search
recommended reading
editorial guide
masthead

Login
Make a new account
Username:
Password:

Related Links
More on
Also by jaf

View: Display: Sort:
Shortest OAI-Harvesting Script? | 3 comments (3 topical, 0 editorial, 0 pending) | Post A Comment
Smaller still (none / 0) (#1)
by art on Sun Jun 15th, 2003 at 05:56:43 PM EST
(User Info) http://www.uwindsor.ca/library/leddy/people/art

Thom has another version that's even shorter:

import sys, urllib2, zlib, time, re
nDataBytes, nRawBytes, nRecoveries, maxRecoveries = 0, 0, 0, 3
def getFile(serverString, command, verbose=1, sleepTime=0):
global nRecoveries, nDataBytes, nRawBytes
if sleepTime: time.sleep(sleepTime)
remoteAddr = serverString+'?verb=%s'%command
if verbose: print >>sys.stderr, "\r", "getFile '%s'"%remoteAddr,
headers = {'User-Agent': 'OAIHarvester/2.0', 'Accept': 'text/html',
'Accept-Encoding': 'compress, deflate'}
try:
remoteFile = urllib2.urlopen(urllib2.Request(remoteAddr, None, headers))
remoteData = remoteFile.read()
except urllib2.HTTPError, exValue:
if exValue.code==503:
retryWait = int(exValue.hdrs.get("Retry-After", "-1"))
if retryWait<0: return None<br> print >>sys.stderr, 'Waiting %d seconds'%retryWait
return getFile(serverString, command, 0, retryWait)
print >>sys.stderr, exValue
if nRecoveries<maxRecoveries:<br> numRecoveries += 1
return getFile(serverString, command, 1, 60)
return
nRawBytes += len(remoteData)
try: remoteData = zlib.decompressobj().decompress(remoteData)
except: pass
nDataBytes += len(remoteData)
mo = re.search('<error *code=\"([^"]*)">(.*)</error>', remoteData)
if mo: print >>sys.stderr,"OAIERROR: code=%s '%s'"%(mo.group(1), mo.group(2))
else: return remoteData
try: serverString, outName=sys.argv[1:]
except:serverString, outName='alcme.oclc.org/ndltd/servlet/OAIHandler', 'harvest.out'
if serverString.find('http://')!=0: serverString = 'http://'+serverString
print >>sys.stderr, "Creating %s from archive %s"%(outName, serverString)
outFile = file(outName, 'wb')
outFile.write(getFile(serverString, 'Identify'))
outFile.write(getFile(serverString, 'ListMetadataFormats'))
data = getFile(serverString, 'ListRecords&metadataPrefix=%s'%'oai_dc')
while data:
outFile.write(data)
mo = re.search('<resumptionToken[^>]*>(.*)</resumptionToken>', data)
if not mo: break
data = getFile(serverString, "ListRecords&resumptionToken=%s"%mo.group(1))
print "\nRead %d bytes (%.2f compression)"%(nDataBytes, float(nDataBytes)/nRawBytes)


Whew!

[ Reply to This ]


URL to source (none / 0) (#2)
by ThomasBHickey on Mon Jun 16th, 2003 at 06:41:31 AM EST
(User Info)

The text Art posted doesn't seem to have any indentation. Here's a URL to the latest script: harvestall.py

--Th

[ Reply to This ]


Shortest OAI-Harvesting Script? | 3 comments (3 topical, 0 editorial, 0 pending) | Post A Comment
View: Display: Sort:

Powered by Scoop
All trademarks and copyrights on this page are owned by their respective companies. Comments are owned by the Poster. The Rest 2002 The Management

front page | submit story | create account | faq | search