Text Mining Collections

Many of the articles in PMC are subject to traditional copyright restrictions and are not available for bulk downloading. However, there are a few collections within PMC where bulk retrieval of files for text mining and other purposes is permitted. License terms may vary by collection or even within a collection.

To download a collection in PMC for text mining, you must use the designated services (usually the PMC FTP service).

Open Access Subset

The Open Access Subset (OA Subset) is the largest collection of articles available for text mining via PMC. Articles in the OA Subset are still protected by copyright in most cases, but are made available for download under a Creative Commons or similar license that generally allows more liberal redistribution and reuse than a traditional copyrighted work.

Within the OA Subset, there is a Commercial Use Collection that includes only OA Subset articles that have a machine-readable “CC BY” or “CC0” license.

Author Manuscript Collection

The Author Manuscript Collection consists of articles in author manuscript form that have been made available in PMC in compliance with the NIH Public Access Policy or similar policies of other funders. The texts of these manuscripts are available for text mining, and may also be used consistent with the principles of applicable copyright law.

This Collection is distinct from the OA Subset and subject to different terms of use.

Historical OCR Collection

The Historical OCR Collection consists of OCR text from a subset of the journals that participated in NLM's back issue digitization project. With the publisher’s permission, the OCR text files from a few of these journals, spanning nearly two centuries of biomedical research, have been made available for text mining.

The Historical OCR Collection files are only available from our FTP site.

Support Center

Last updated: Mon, 23 Nov 2015