4 minute read.
Nature’s Metadata for Web Pages
Well, we may not be the first but wanted anyway to report that Nature has now embedded metadata (HTML meta tags) into all its newly published pages including full text, abstracts and landing pages (all bar four titles which are currently being worked on). Metadata coverage extends back through the Nature archives (and depth of coverage varies depending on title). This conforms to the W3C’s Guideline 13.2 in the Web Content Accessibility Guidelines 1.0 which exhorts content publishers to “provide metadata to add semantic information to pages and sites”.
Metadata is provided in both DC and PRISM formats as well as in Google’s own bespoke metadata format. This generally follows the DCMI recommendation “Expressing Dublin Core metadata using HTML/XHTML meta and link elements, and the earlier RFC 2731 “Encoding Dublin Core Metadata in HTML”. (Note that schema name is normalized to lowercase.) Some notes:
- The DOI is included in the “
dc.identifier” term in URI form which is the Crossref recommendation for citing DOI.
- We could consider adding also “
prism.doi” for disclosing the native DOI form. This requires the PRISM namespace declaration to be bumped to v2.0. We might consider synchronizing this change with our RSS feeds which are currently pegged at v1.2, although note that the RSS module mod_prism currently applies only to PRISM v1.2.
- We could then also add in a “
prism.url” term to link back (through the DOI proxy server) to the content site. The namespace issue listed above still holds.
The HTML metadata sets from an example landing page are presented below.
- The “
citation_” terms are not anchored in any published namespace which does make this term set problematic in application reuse. It would be useful to be able to reference a namespace (e.g. “
rel="schema.gs" href="..."“) for these terms and to cite them as e.g. “
If you view the page source you should see something like the text below. (Note that you may have to scroll past whitespace which is emitted by the HTML template generator.)
<pre><link title="schema(DC)" rel="schema.dc" href="http://purl.org/dc/elements/1.1/" />
<meta name=“dc.publisher” content=“Nature Publishing Group” />
<meta name=“dc.language” content=“en” />
<meta name=“dc.rights” content="© 2008 Nature Publishing Group" />
<meta name=“dc.title” content=“Crystal structure of squid rhodopsin” />
<meta name=“dc.creator” content=“Midori Murakami” />
<meta name=“dc.creator” content=“Tsutomu Kouyama” />
<meta name=“dc.identifier” content=“doi:10.1038/nature06925” />
<link title=“schema(PRISM)” rel=“schema.prism” href=“https://web.archive.org/web/20080516191035/http://www.prismstandard.org//namespaces/1.2/basic/" />
<meta name=“prism.copyright” content=”© 2008 Nature Publishing Group" />
<meta name=“prism.rightsAgent” content=“firstname.lastname@example.org” />
<meta name=“prism.publicationName” content=“Nature” />
<meta name=“prism.issn” content=“0028-0836” />
<meta name=“prism.eIssn” content=“1476-4687” />
<meta name=“prism.volume” content=“453” />
<meta name=“prism.number” content=“7193” />
<meta name=“prism.startingPage” content=“363” />
<meta name=“prism.endingPage” content=“367” />
<meta name=“citation_journal_title” content=“Nature” />
<meta name=“citation_publisher” content=“Nature Publishing Group” />
<meta name=“citation_authors” content=“Midori Murakami, Tsutomu Kouyama” />
<meta name=“citation_title” content=“Crystal structure of squid rhodopsin” />
<meta name=“citation_volume” content=“453” />
<meta name=“citation_issue” content=“7193” />
<meta name=“citation_firstpage” content=“363” />
<meta name=“citation_doi” content=“doi:10.1038/nature06925” />
While it is not expected that search engines will index these terms directly and that no direct SEO is intended, we think there is enough value for applications to make use of these terms. The terms are reasonably accessible to simple scripts, etc. Note that even in [RFC 2731] (published in 1999) there is a Perl script listed in Section 9 which allows the metadata name/value pairs to be easily pulled out. Running this over the example page yields the following output:
@|MISSING ELEMENT NAME; text/css
@|MISSING ELEMENT NAME; text/html; charset=iso-8859-1
@|keywords; Nature, science, science news, biology, physics, genetics, astronomy, astrophysics, quantum physics, evolution, evolutionary biology, geophysics, climate change, earth science, materials science, interdisciplinary science, science policy, medicine, systems biology, genomics, transcriptomics, palaeobiology, ecology, molecular biology, cancer, immunology, pharmacology, development, developmental biology, structural biology, biochemistry, bioinformatics, computational biology, nanotechnology, proteomics, metabolomics, biotechnology, drug discovery, environmental science, life, marine biology, medical research, neuroscience, neurobiology, functional genomics, molecular interactions, RNA, DNA, cell cycle, signal transduction, cell signalling.
@|description; Nature is the international weekly journal of science: a magazine style journal that publishes full-length research papers in all disciplines of science, as well as News and Views, reviews, news, features, commentaries, web focuses and more, covering all branches of science and how science impacts upon all aspects of society and life.
@|dc.publisher; Nature Publishing Group
@|dc.rights; #169; 2008 Nature Publishing Group
@|dc.title; Crystal structure of squid rhodopsin
@|dc.creator; Midori Murakami
@|dc.creator; Tsutomu Kouyama
@|prism.copyright; © 2008 Nature Publishing Group
@|citation_publisher; Nature Publishing Group
@|citation_authors; Midori Murakami, Tsutomu Kouyama
@|citation_title; Crystal structure of squid rhodopsin