Open Access Subset

The PMC Open Access Subset some or all openaccess content is a part of the total collection of articles in PMC. The articles in the OA Subset are made available under a Creative Commons or similar license that generally allows more liberal redistribution and reuse than a traditional copyrighted work.

To preview the articles or get a current count of articles in the OA Subset, do a search for open access[filter] in PMC. As of 2015, there were over 1 million articles available in this collection.

Please note the following:

  • The license terms are not identical for all of the articles in this subset. Please refer to the license statement in each article for specific terms of use.
  • The majority of the articles in PMC are subject to traditional copyright restrictions and are not part of this subset.
  • Users are directly and solely responsible for compliance with copyright restrictions and are expected to adhere to the terms and conditions defined by the copyright holder (see the PMC Copyright Notice).
  • The PMC OAI service and the PMC FTP service are the only services that may be used for automated downloading of articles from the OA Subset. Systematic retrieval (bulk downloading) of articles through any other automated process is prohibited.
  • Some licenses restrict commercial use on open access content. If you are accessing the OA Subset for commercial purposes, please limit your use to the Commercial Use Collection.
  • Some journals use the label "open access" for an article that is available free at time of publication, but is still subject to traditional copyright restrictions. Such articles are not part of this subset.

To do automated (or bulk) retrieval of articles from the OA Subset, use one of the following services.

  1. Use the OAI-PMH service to download XML for the full text of articles in this open access subset.
  2. Use the FTP service (see next section) to download a complete set of files — XML, images, PDF, and supplementary data files, if any — for individual articles in this subset.
  3. Use the FTP service to download bulk packages of just XML or extracted text for all the articles in the OA Subset.

The FTP service provides the source files for any article in the OA Subset in two file formats:

.tar.gz – these are archive files that include all of the source material for the article:

  • A .nxml file, which is XML for the full text of the article, encoded in the NLM/JATS DTD.
  • Image files from the article, and graphics for display versions of mathematical equations or chemical schemes.
  • Supplementary data, such as background research data or videos
  • PDF, if available
  • Converted video files, in a number of formats, suitable for streaming on the web. These files have the suffix, "-pmcvs_normal" to distinguish them from original, publisher-supplied files.

.pdf – only the article PDF (the same as that in the .tar.gz file). Note that not every article has a PDF.

In order to prevent any one FTP folder from having thousands of files, these .tgz and .pdf files are distributed randomly in a two-level-deep structure. There are two ways to locate a specific article on the FTP site:

  • Use one of the file index lists described below.
  • Use the OA web service, which provides an API to locate articles by PMCID, or by an update date range.

Bulk Packages

In addition to the individual article files described above, PMC also makes available gzipped archive files that contain XML only, and others that contain just the extracted full text, for all of the articles in the PMC open access subset. In the extracted full text (.txt) files, the full text is extracted either from the XML files, or, for those articles that don't have XML, from the PDFs.

Users who do not need PDFs, images, or supplementary data can use these files in data mining and other types of processing. Note that these files are quite large (2 to 6 GBs). To access the complete OA Subset, you will need to use the commercial use and non-commercial use collections. These collections complement each other, rather than duplicating files.

These files are updated once per week, on Saturday.

Non-Commercial Use of OA Subset Articles

If you intend to use articles from the PMC OA Subset only for non-commercial purposes, these are your options:

  • Download any PDF directly from the “oa_pdf” directory. The files “oa_non-comm_use_pdfs.csv/txt” are indices in .csv and .txt formats to the contents of this directory.
  • Download complete files for any article from the “oa_package” directory.
  • Use “oa_file_list.txt” or “oa_file_list.csv” to find the specific directory location for an article or to filter articles by license type.
  • Download any bulk XML/txt article package, either “non_comm_use.*.tar.gz” or “comm_use.*.tar.gz”
  • Download XML for the full text of articles using the OAI-PMH service

Commercial Use of OA Subset Articles

Within the OA Subset, there is a Commercial Use Collection that includes only OA Subset articles that have a machine-readable “CC BY” (Creative Commons Attribution Only) or “CC0” (Creative Commons public domain) license.

If you intend to use articles from the PMC OA Subset for commercial purposes, your options include:

  • Use “oa_comm_use_file_list.txt” or “oa_comm_use_file_list.csv” to identify the articles you may download and use.
  • From the “oa_package” directory, download complete files for only the articles listed in “oa_comm_use_file_list.*”
  • If you want to use only article PDFs (and not the XML and graphics files), you still must download the complete article package. Do not download freestanding PDFs from the “oa_pdf” directory and do not use “oa_non_comm_use_pdfs.*” to try to identify what files are available to you.
  • Download only those bulk XML/txt article packages that are available for commercial use, i.e., files named “comm_use.*.tar.gz”
  • Download XML for the full text of articles using the OAI-PMH service

Please note that if you are accessing the content for commercial purposes, you may NOT download any of the “non_comm_use.*.tar.gz” bulk packages or any individual article packages for articles that are not included in “oa_comm_use_file_list.*”

How to Search for Articles by Creative Commons License

Search filters are available in PMC and PubMed for finding articles in the OA Subset with specific Creative Commons (CC) licenses. For descriptions of these licenses, please see the Creative Commons site, About the Licenses. Please note that not all articles in the OA Subset have a CC license.

License type Filter in PMC Filter in PubMed
Any CC license cc license pmc cc license
CC BY (Attribution) cc by license pmc cc by license
CC BY-ND (Attribution, no derivatives) cc by-nd license pmc cc by-nd license
CC BY-NC (Attribution, noncommercial) cc by-nc license pmc cc by-nc license
CC BY-NC-ND (Attribution, noncommercial, no derivatives) cc by-nc-nd license pmc cc by-nc-nd license
CC BY-NC-SA (Attribution, noncommercial, share-alike) cc by-nc-sa license pmc cc by-nc-sa license
CC BY-SA (Attribution, share-alike) cc by-sa license pmc cc by-sa license
CC0 (Public domain) cc0 license pmc cc0 license

These filters are based on license information, which is provided to PMC by publishers and other content providers, as encoded by the machine-readable identifiers in the source XML of each journal article. Please note that, in some cases, there are discrepancies between these machine-readable identifiers and the actual text of the license statements. In February 2013, PMC instituted new rules to help ensure consistency of the tagging of the licenses, which apply to all newly received content.

Using the Index Lists

The FTP site includes six index files: oa_file_list.txt, oa_file_list.csv, oa_non_comm_use_pdf.txt, oa_non_comm_use_pdf.csv, oa_comm_use_file_list.csv, and oa_comm_use_file_list.txt. To locate an article on the FTP site, search for its PMC accession number (PMCID) in the appropriate file list. The matching entry will point you to the specific FTP directory and file name for the article.

oa_file_list.txt
This file can be retrieved from ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_file_list.txt. The first line of the file gives the date and time at which it was last generated. Every subsequent line contains information about one article in PMC. For example:

oa_package/66/8b/PMC555938.tar.gz	BMC Bioinformatics. 2005 Mar 7; 6:44	PMC555938	PMID:15748298	CC BY

This line is divided into four main fields, delimited by tab characters. Those fields are: The fully qualified name of the .tar.gz file for an article.

The article citation, comprising:

  • journal title abbreviation
  • publication date
  • volume
  • issue
  • PMC accession number (PMCID)
  • PubMed ID (PMID)
  • License type*

* The field value for “license type” can be any of the standard Creative Commons license variants (e.g., CC BY; CC BY-NC; CC BY-NC-ND) or “NO-CC CODE”. “NO-CC CODE” appears when the license is missing, has custom terms (i.e., not a Creative Commons license), or is not machine decodable.

oa_file_list.csv
This file can be retrieved from ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_file_list.csv.

Contents: Fields are the same as above, except separated by commas, and with the addition of a timestamp indicating the last update to the article in PMC. The timestamp appears before the PMID. For example:

oa_package/d2/6d/PMC2137107.tar.gz,Environ Health Perspect. 2007 Dec; 115(12):A580a,PMC2137107,2014-05-16 12:59:15,18087575,CC0

oa_non_comm_use_pdfs.txt
This file can be retrieved from ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_non_comm_use_pdf.txt.

This is the same as the oa_file_list.txt file, but it only lists those articles that have PDFs, and, of course, it gives the location of the PDF rather than the location of the .tgz file.

oa_non_comm_use_pdfs.csv
This file can be retrieved from ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_non_comm_use_pdf.csv.

This is the same as oa_non_comm_use_pdfs.txt, except uses commas as delimiters, and adds the timestamp, indicating the last update to the article in PMC.

oa_comm_use_file_list.txt
This file can be retrieved from ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_comm_use_file_list.txt.

Contents: Fields are the same as oa_file_list.txt, but file list is limited to only the articles in the Commercial Use Collection (i.e., articles with a machine-readable CC BY or CC0 license) within the OA Subset.

oa_comm_use _file_list.csv
This file can be retrieved from ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_comm_use_file_list.csv.

This is the same as oa_comm_use_file_list.txt (i.e. list is limited to articles in the Commercial Use Collection), except separated by commas, and with the addition of a timestamp indicating the last update to the article in PMC. The timestamp appears before the PMID.

Support Center

Last updated: Thu, 6 July 2017