6 minute read.
Metadata in PDF: 1. Strategies
Emboldened by my own researches, by the recent handle plugin announcement from CNRI (on which, more in a follow-on post), and by Alexander Griekspoor’s comment to my earlier post, I thought I’d write a more extensive piece about embedding metadata in PDF with a view to the following:
- Discover what other publishers are currently doing
- Stimulate discussions between content providers and/or consumers
- Lay groundwork for a Crossref best practice guidelines
Why should Crossref be interested? Well, at minimum to embed the DOI along with the digital asset would seem to be inherently “a good thing”. (And, in fact, this is precisely the approach that CNRI have taken for their plugin demos. I’ll look later at what they actually did and consider whether that is a model that Crossref publishers might usefully follow.)
Why include the DOI as an explicit piece of metadata rather than have it included by virtue of its appearance in a content section? The main reason is that it is then unambiguously accessible. Content sections in PDFs are typically filtered and sometimes encrypted), whereas metadata is usually plain text and moreover is marked up as to field type.
Another question concerns whether to add in the identifier alone, or to embed a full metadata set. Why not just embed the identifier and visit Crossref for the metadata? This is feasible in some cases although it does involve an extra network trip, requires an application to service the identifier and is obviously not workable in offline contexts. Seems like a “no-brainer” to include a fuller description from the outset. Note that publishers frequently make some of this information available anyway in other metadata delivery channels, e.g. RSS feeds.
There are two (complementary) approaches to embedding document-level metadata in a PDF:
- A - Document Information Dictionary
This is an optional object (a dictionary) referenced from the PDF trailer dictionary. Example:
/Title ( PostScript Language Reference, Third Edition )
/Author ( Adobe Systems Incorporated )
/Creator ( Adobe FrameMaker 5.5.3 for Power Macintosh® )
/Producer ( Acrobat Distiller 3.01 for Power Macintosh )
/CreationDate ( D:19970915110347-08'00' )
/ModDate ( D:19990209153925-08'00' )
- B - (Document) Metadata Stream
This is an optional object (a stream) referenced from the document catalog, itself referenced from the PDF trailer dictionary. Example:
2 0 obj
<?xpacket begin='' id='W5M0MpCehiHzreSzNTczkc9d'?>
<!-- RDF/XML goes here -->
Both approaches usually make the embedded metadata in the PDF available in the clear, whereas content is frequently filtered and sometimes encrypted. (Note that the information dictionary is always in the clear, while the metadata stream can be filtered and rendered unreadable although in practice this tends not to be filtered.)
Below I examine both approaches and see how they can be used to encode the kind of metadata that scholarly publishers are accustomed to.
A - Document Information Dictionary
Note that keys in the document information dictionary divide equally between the logical document description (non-asterisked keys) and the physical asset description (asterisked keys):
This is the complete listing of keys in the PDF specification, although foreign keys are allowed (and ignored).
What is missing here is any document identifier and/or any other descriptive metadata. From a Crossref point of view the identifier (the DOI) is a “hook” into the metadata record and so at minimum this could usefully be added. The question then is how? Either the identifier can be squeezed into one of the existing fields (“Title”, “Author”, “Subject”, “Keywords”) or else a new foreign key could be created.
IMO if an existing keyword is used then I would opt for “Subject” or “Keywords”, and probably the former. If, on the other hand, a new foreign key were to be created I would choose something generic and (in keeping with the other terms) use something like “Identifier” (rather than, say, “DOI”).
Of preference, I think I would go for the latter (“Identifier”) but if one wanted to make this more robust one could think of also adding in a known term (e.g. “Subject” or “Keywords”). So, to include metadata for the news article “Cosmology: Ripples of early starlight” printed in Nature magazine Nature 445, 37 (2007): doi:10.1038/445037a, we might include the following terms in the document information dictionary as:
/Title ( Cosmology: Ripples of early starlight )
/Author ( Craig J. Hogan )
/Subject ( doi:10.1038/445037a )
/Keywords ( cosmology infrared protogalaxy starlight )
<b>/Identifier ( doi:10.1038/445037a )</b>
/Creator ( ... )
/Producer ( ... )
/CreationDate ( ... )
/ModDate ( ... )
where the bolded term represents a foreign key/value pair.
Note: This (including the DOI in the “Subject” field) is a fix intended to get the DOI listed by Adobe apps which would not otherwise recognize the foreign key “Identifier”.
Since it is not really feasible to include separate enumerated fields within the information dictionary (although it could be done), one might also consider including a descriptive citation field as a foreign key, e.g., something like:
/Source (Nature 445, 37 \(2007\))
Aternatively that might better be presented as the “Subject” along with the DOI. Which would then limit the number of foreign keys to one (“Identifier”).
B - (Document) Metadata Stream
The metadata stream with its use of XMP packets (wrapping RDF/XML instances) is a much more flexible approach to embedding metadata and allows multiple schemas to be used. As noted in my previous post here on XMP, PDFs with XMP packets mostly use media-specific terms and schemas, although there is also a token showing of DC. From a descriptive metadata point of view we would more likely make use of DC and PRISM for our schemas.
Reprising the example from the previous post (and again using citation example listed above) this would mean we may be inclined to include the following terms for a scholarly work (here in RDF/N3 for readability):
dc:title "Cosmology: Ripples of early starlight" ;
dc:identifier "doi:10.1038/445037a" ;
dc:source "Nature 445, 37 (2007)" ;
dc:date "2007-01-04" ;
dc:format "application/pdf" ;
dc:publisher "Nature Publishing Group" ;
dc:language "en" ;
dc:rights "© 2007 Nature Publishing Group" ;
prism:publicationName "Nature" ;
prism:issn "0028-0836" ;
prism:eIssn "1476-4679" ;
prism:publicationDate "2007-01-04" ;
prism:copyright "© 2007 Nature Publishing Group" ;
prism:rightsAgent "email@example.com" ;
prism:volume "445" ;
prism:number "7123" ;
prism:startingPage "37" ;
prism:endingPage "37" ;
prism:section "News and Views" ;
This would look something like the following as an XMP packet within a PDF metadata stream (the RDF now being serialized as RDF/XML):
<?xpacket begin='' id='W5M0MpCehiHzreSzNTczkc9d'?>
<dc:creator>Craig J. Hogan</dc:creator>
<dc:title>Cosmology: Ripples of early starlight</dc:title>
<dc:source>Nature 445, 37 (2007)</dc:source>
<dc:publisher>Nature Publishing Group</dc:publisher>
<dc:rights>© 2007 Nature Publishing Group</dc:rights>
<prism:copyright>© 2007 Nature Publishing Group</prism:copyright>
<prism:section>News and Views</prism:section>
Some useful references are:
Note a): See Section 10.2, “Metadata” in Ref. 1.
Note b): Ref.  is a fairly brief draft which covers both the Information Dictionary and Metadata Dictionary (XMP) approaches. There is an Adobe-hosted update to this document from June 2002 but that only discusses the XMP approach.