As the Google Summer of Code program 2014 is coming closer to an end, I will summarize some of my experiences in the following articles. I will start with my insights on obtaining journal articles in a format which can be properly processed (XML and UTF-8 encoded text in my favor).
Scientists are used to obtain journal articles in PDF format. Searches like PubMed by the National Center for Biotechnology Information and ScienceDirect by Elsevier allow scientists to keyword search journal publications fitting their interests. In my case I am able to export max. 20 PDFs through ScienceDirect at a time. Once I obtained the PDF I would have to convert it to a more suitable format so I can process the text (and images where needed). I tried a couple of tools from the python package PDFMiner to Adobe Acrobat and found limitations from the resulting quality to as batch processing limitations. The best tool I was able to find was PDFX which is build for the conversion of journal articles PDFs to formats from HTML to XML, but there are still issues with the quality as well as that PDFX is not open-source and when writing with the team behind there are currently no plans to change that.
Rather than obtaining PDFs and convert them, another approach is to get the underlying XMLs and plain text from the publishers directly. There is great interest in such access by many scientists in the field and excitement spread when Elsevier announced in January that it would allow scientists to text-mine all documents in ScienceDirect. However, until this day I was not able to get official to their API as there are now agreements yet with governmental organizations like the Lawrence Berkeley National Laboratory I could have gotten access through. In the UK there was a legal change in June with the emphasis to allow scientists to text-mine documents for non-commercial purposes. Elsevier has reacted on that and I will look into how to incorporate their data into the search once I moved to the UK in fall. Many universities have agreements with Elsevier about the text-mining API and it is worth asking.
OSTI SciTech API
The Office of Scientific and Technical Information (OSTI) provides SciTech, a search for 350,000 full-text DOE sponsored reports covering 65 years of energy-related R&D. As first Research Assistant and then Research Associate at the Lawrence Berkeley National Laboratory I broad forward the idea for an API and over the past month this API was developed and access to full-text documents was provided. The quality of the through the API mostly converted text varies, but the wealth of included knowledge in these documents is exciting. I will release instructions and on how to use the API and respective Python code through the Google Summer of Code project.