API for the Department of Enegry (DOE) documents

Published on Aug 09, 2014 | By | Permalink

How to use the API for SciTech Connect by the Office of Scientific and Technical Information (OSTI)

I am delighted that the Office of Scientific and Technical Information followed my input to make their plain text documents in SciTech Connect available through an API. The quality of the mainly converted text documents varies, but the sheer amount of freely available documents, 2,580,037 at this point, make it an very interesting scientific document source.

To get the files from OSTI 3 main steps are involved. First, one has to find the documents which one is looking for through SciTech Connect (or its XML service). Second, one has to extract the links out of the obtained search result file (either Excel or XML). Third, one has to download the XML metadata files and plain text files.

Search with SciTech Connect

Two ways are possible. One is to search through the SciTech Connect Website and then export the search results in Excel form. Here one can copy and paste the link column. A more elegant way is possible through the new XML Service, which is well-documented here.

A search I used was e.g.

1
http://www.osti.gov/scitech/scitechxmlFullText=solid-state%20OR%20%22solid%20state%22&Title=batter*&SortBy=publication_date&nrows=3000

That search returned at my time 987 documents.

which searched in all fields for the term batter with wildcard, in the full-text for solid-state or solid state and in the title again for batter with wildcard. Then, the search results were ordered by date descending and the max. output of 3000 was requested.

cElementTree Python

Sorting does not work