Building the Graph of Medieval Data

Researchers in Classics and Ancient History have achieved a great deal in the Linked Data world, through services like Pelagios, Pleiades, Perseus, Arachne and CLAROS.

Two of their major initiatives have been to create and publish Uniform Resource Identifiers (URIs) for specific entities, and to reuse these URIs across different services. Most of the interlinking, at this stage, centres on the geographical names recorded in the Pelagios gazetteer, via the Pleiades API and graph explorer.

The result has been an increasingly integrated framework for linking across multiple datasets, under the general rubric of the Graph of Ancient World Data. (1)

Graph of Ancient World Data

There is nothing equivalent to this for medieval studies. This makes it very difficult for researchers (like me) who are interested in joining up diverse sources of evidence relating to medieval manuscripts and analyzing the aggregated information.

Here is an initial list of the major elements which will be needed to create a similar kind of Linked Data infrastructure for medieval manuscript research:

  • Identifiers for medieval people, places, and organizations
  • Identifiers for individual manuscripts – mapped to varying ways of citing institutional shelf-marks
  • Identifiers for the texts carried by manuscripts
  • Linkable versions of specialist vocabularies for describing scripts, decoration and illumination, bindings, coats of arms and bookplates

Medieval people are represented, at least to some extent, in existing Linked Data services – especially Wikidata, VIAF, and Library of Congress Names. Peter Abelard, for example, has a Wikidata record with at least twenty other identifiers cross-linked. Extracting these records could form the basis for a “medieval people” service, which could then be augmented from specialist prosopographical sources.

But the other elements are more problematic. I have written previously about the lack of standard identifiers for medieval manuscripts. There are numerous reference books and databases (such as Scriptorium’s index of manuscripts cited) which list and cross-reference institutional shelf-marks. But they need a location-independent identifier (URI) service, to which the different data can be mapped. It’s very encouraging to see that the German national programme for manuscript digitization includes a proposal for assigning unique identifiers to individual manuscripts. (2)

Problems with identifying and naming medieval texts are discussed by Richard Sharpe in his book Titulus: Identifying Medieval Latin Texts: an Evidence-Based Approach (Turnhout: Brepols, 2003). While titles of medieval works do occur in the Library of Congress Names service, for example, there are far more extensive and authoritative lists which could be expressed as Linked Data URIs. The German master plan also makes provision for identifiers for individual texts.

One of the underlying difficulties with developing Linked Data URIs for medieval entities is that many of the relevant source materials are not yet in a digital form which is suitable for reuse in the Linked Data world. Expressing specialist vocabularies and thesauri in the SKOS format, for instance, would be a worthwhile goal. Other reference works are available only in print or as PDF files.

Even where the source materials are in a more easily reusable digital form, they may not be available for copyright reasons. This is notably the case with the various databases from Brill, Brepols and ProQuest (Chadwyck-Healey) – dictionaries, directories, biographical information, texts and so on. These contain large numbers of entries for specific medieval people, places, texts, manuscripts and so on. Their incorporation into a “Graph of Medieval Data”, without infringing the publishers’ rights, would require detailed technical negotiations.

There is plenty of existing activity aimed at creating shareable digital materials derived from medieval manuscripts. This includes numerous initiatives for the transcription and encoding of texts, especially using the TEI (Text Encoding Initiative). There are also many libraries and projects creating digital images of medieval manuscripts, and there is a growing interest in enabling interoperability by sharing these images through the International Image Interoperability Framework (IIIF).

A “Graph of Medieval Data” would sit as a unifying layer above all these digital resources. It would provide a framework for cross-referencing and interlinking between existing services, and a basis for new annotation and navigation services across disparate digital resources.

This type of infrastructure currently appears to be a long way off. I would really like to see the international manuscript research community coming together to work towards a “Graph of Medieval Data” along these lines.

This approach appears to be our best hope of joining up the vast but disparate body of evidence relating to medieval manuscripts. It would be a huge boon for researchers in this field.


(1) Isaksen, Leif; Simon, Rainer; Barker, Elton T. E. and de Soto Cañamares, Pau (2014). “Pelagios and the emerging graph of ancient world data”, in: WebSci ’14: Proceedings of the 2014 ACM conference on Web science, ACM, pp. 197–201.

(2) Fabian, Claudia; Schreiber, Carolin (2014). “Piloting a national programme for the digitization of medieval manuscripts in Germany”, Liber Quarterly 24 (1)

Towards Unique Identifiers for Medieval and Renaissance Manuscripts

At the recent Schoenberg Symposium, I suggested that we need a unique identifying system for medieval and Renaissance manuscripts. We need this for two main reasons: to overcome the difficulties inherent in current identification methods, and to ensure that manuscript information can be incorporated into the world of Linked Data.

Current scholarly practice is to cite manuscripts by their present location, institution and shelf-mark. So the Beowulf manuscript should be cited as London, British Library, Cotton Vitellius A XV and the Codex Sinaiticus as London, British Library, Add. 43725. This approach underlies the manuscript indexes of the journal Scriptorium.

As several people at the Schoenberg Symposium were quick to point out, this approach is full of difficulties:

  • Shelf-marks, even at the same institution, change over time. So, for example, the manuscript now referred to as “BnF Latin 9” was previously “Regius 3570”.
  • The names of institutions change over time. The British Library used to be the British Museum; the Pierpont Morgan Library is now the Morgan Library and Museum.
  • Some institutions do not give their manuscripts unique, citable shelf-marks. Alternatives might include a Dewey Decimal classification number, or a generic shelf location.
  • Manuscripts move between different institutions, even today. A move of this kind renders previous citations obsolete.
  • The format of these kinds of shelf-marks is vulnerable to mis-spellings and to numerous variations and inconsistencies. Is it BL or British Library? Add. or Additional?
  • Even if the shelf-marks are unique and consistent, they may not have stable URL equivalents. The State Library of Victoria’s manuscripts, for example, have “handle” URLs for their digitized versions, but not for their catalogue records.

In the Phillipps project, I am fortunate that the manuscripts have their own system of identifiers, which is not tied to their current institutional location. Sir Thomas Phillipps gave his manuscripts individual numbers, which are widely quoted in library catalogue records and in booksellers’ and dealers’ catalogues. The numbers were usually marked on the manuscripts themselves, and have survived the various changes of ownership since the dispersal of the Phillipps Collection.

For my purposes, the Phillipps numbers appear to be sufficiently unique to serve as identifiers. But even these numbers have their problems:

  • A single manuscript may have more than one Phillipps number. The University of Western Australia’s copy of Virgil’s Aeneid was recorded twice in Phillipps’ catalogue (in error), and therefore has the numbers 988 and 2878.
  • The same Phillipps number may have been assigned to more than one manuscript. This is evident in the hand-written supplementary list of manuscripts 23,838 to 26,365, held in the Grolier Club’s Library, where many titles have been crossed out and the numbers re-used for different manuscripts.
  • The Phillipps number may have been recorded incorrectly in subsequent indexes and catalogues. The British Library’s card index to the provenance of Phillipps manuscripts, for example, ends with manuscript number 74,539, which is a simple transcription error for 24,539.
  • There are numerous Phillipps manuscripts which never received a Phillipps number. His printed catalogue finishes at 23,837; Edward Bond’s handwritten supplementary list finishes at 26,179 in one version and 26,365 in another. Thomas Fitzroy Fenwick continued the numbering up to 38,628, though his list has not survived. Munby estimated up to 60,000 manuscripts in all. Unnumbered Phillipps manuscripts are still advertised for sale through sites like AbeBooks, even today.

My proposal is for a unique identifier which conforms to the Uniform Resource Identifier (URI) model used in the world of Linked Data.

Best practice for minting and structuring these URIs is described in the document “Cool URIs for the Semantic Web”, produced by the WorldWideWeb Consortium (W3C). An example of their implementation is given by Linked Data Finland. Some background information can also be found in Phil Archer’s “Study on Persistent URIs”, prepared for the European Commission in 2012.

  • This kind of identifier does not need to conform to (or incorporate) any current or past shelf-marks.
  • Individual codices would have their own URIs.
  • Multi-volume codices could be given a single URI, with subsidiary URIs for each volume.
  • Fragments which were formerly part of a codex could be treated like this: if an item can be (or has been) catalogued individually by the current institution, then it should have its own URI.
  • Individual documents would have their own URI.

Current catalogue records could be used as a starting-point. Each current catalogue record for a manuscript could be regarded as an entity which needs a URI. A basic initial approach might be as follows:

  • Create a URI for each individual manuscript codex currently held and catalogued in a public collection.
  • Create a URI for each document individually catalogued in a public collection.
  • Map current and past shelf-marks to the URI.
  • Map current and past catalogue records to the URI.

Subsequent use cases would include the following:

  • Manuscripts which are now dispersed or fragmented could be virtually re-united by creating an additional URI for the original manuscript and creating relationships between this URI and the URIs for each current fragment.
  • Previously separate manuscripts which are now combined into a single volume could be virtually dis-bound by creating additional URIs for each former manuscript and creating relationships between these URIs and the URI for the current codex.
  • Information from different sources about the same manuscript could be linked by matching disparate data to the same URI.

I am not proposing a unified central catalogue of manuscripts, in which full descriptions would be normalized to an agreed metadata schema. Instead, an identifier service would provide a crucial structural element which could be used as the basis for future aggregations of data relating to manuscripts. The service would need to incorporate some minimal descriptive information about the manuscript referred to by each URI: a shelf-mark and institution, at the very least, preferably accompanied by a title (conventional or bibliographical).

The technical aspects of this proposal are one issue. Even more crucial, though, are the politics and funding involved in setting up a service to mint, manage and distribute such URIs. In the book world, much of the impetus for ISBNs and ISSNs (and their predecessors) came from the book trade, which could see a clear commercial advantage in unique numbering systems. In the wider world of Linked Data, various URI services for personal names (like VIAF, ISNI and ORCID) have been developed by consortia and co-operatives in the world of libraries and publishing.

A manuscript identifier service, in contrast, has less commercial value. It will take a combination of libraries and researchers – and possibly publishers – to develop, implement and fund such a service. Some of the key benefits and justifications will be:

  • A framework like this is necessary for any global or international integrated system related to manuscripts.
  • It can overcome the fragmentary nature of the many manuscript databases now in existence, and help to link the proliferating collections of digitized manuscripts.
  • There are huge benefits for researchers in being able to find manuscripts – and information about them – much more quickly and reliably, as well as being able to cite manuscripts more effectively and unambiguously in their own research.
  • There are significant benefits for libraries in promoting their manuscripts, building links to scholarship based on their manuscripts, and connecting their manuscripts to other manuscripts held elsewhere.

There are several existing initiatives working towards unique identifiers for manuscripts. [1]

These identifiers have also been adopted by Diktyon, the “digital network for Greek manuscripts”: http://www.diktyon.org/en/identifiers-manuscripts

This identifier does not necessarily equate to a single manuscript codex (or even one manuscript in multiple volumes). The URL http://pinakes.irht.cnrs.fr/notices/fond/id/977 represents three manuscripts owned by the Library Company of Philadelphia, which also have individual catalogue records and identifiers.

The Trismegistos number (TM_id) maps between (1) publication identifiers (especially sigla), (2) collection inventory numbers (i.e. equivalent of shelf-marks) and (3) conventional names like “the Rosetta Stone”.

These numbers are used solely within the context of the Trismegistos database. They are not expressed as Linked Data URIs, though they do have stable URLs.

  • The Europeana digital library aggregates metadata about digitized objects from many European cultural institutions: http://europeana.eu

It includes URIs for each object, created and structured in accordance with the framework of the W3C.

A version of the Europeana Data Model specifically for hand-written manuscripts has been developed by the DM2E (Digitized Manuscripts to Europeana) project (2012-2015).

While Europeana contains records for a significant number of medieval and early modern manuscripts, it is impossible to estimate how many. Its scope is European, not global, and it excludes manuscripts which have not been digitized.

Developing and hosting a manuscript identifier service will require a partnership between interested organizations in Europe and North America. These will need to include library consortia and researchers’ associations. Some possibilities might include CERL, LIBER, the Medieval Academy, the Renaissance Society of America and CARMEN. Specialist publishers like Brepols and Brill could also be involved.

Funding will also have to be raised. Some possible sources might include infrastructure funding programmes like the European Union’s Horizon 2020, and foundations like the Mellon Foundation.

Without such a service, medieval and Renaissance manuscripts are likely to miss out on the benefits to be gained from the world of Linked Data. Databases will remain dispersed and fragmented, digital resources will be difficult to locate, and citations will continue to be inconsistent and confusing. A unique identifier service is the key to linking and joining up all these resources. It will dramatically increase the efficiency, richness and interconnectedness of the manuscript digital ecosystem, to the benefit of researchers and cultural heritage institutions alike.

[1] My thanks to Cillian O’Hogan, Carrie Schroeder and Matthieu Cassin for these suggestions (via Twitter).