Wednesday, 27 June 2012

British Museum and Pelagios

A number of things struck me reading and learning about the Pelagios project.

  • Simplicity – the project has a clear and achievable focus of aligning information about places in the ancient world. By facilitating cooperation between global stores of classical data the project provides a richer and reliable resource for project partners and the whole community.
  • Sustainability – unlike the many aggregation projects currently competing with each other Pelagios works with organisations to harmonise rather than assert management and controls. The project is more likely to leave a lasting legacy because it is not just about aligning data but about a common and shared vision. Other projects have much to learn from the Pelagios approach
  • Tools – by concentrating on a particular and core set of data it provides the building blocks for other initiatives and allows partners to more easily participate. As such it supports a more rapid development of tangible and useful tools currently lacking in many other linked data initiatives.
  • Pro-active – it’s exactly what we all should be doing!
The British Museum currently publishes a beta linked data Endpoint which will shortly be improved and released as a full production system. We already understand the importance of linked data, accessibility and reuse. The ResearchSpace project (, which will incorporate the full BM online collection dataset, also seeks to use some of the same principles applied in the Pelagios project although, for the purposes of research, requires more detailed datasets.

As we venture into trying to make sense of larger datasets, the importance of applying the Pelagios principles will become ever more important. ResearchSpace tries to use and bring together the social networking models that are used, for example, for the Digital Classicist Wiki; some of the linked data approaches of Pelagios; and will provide integrated tools that operate across and within those environments. These tools include extensive annotation facilities and again, like Pelagios, the Open Annotation Collaboration guidelines are being used with the potential for further data exchange.

Data from all British Museum departments will be available including Ancient Egypt and Sudan, Greece and Rome and Middle East. The BM data has been mapped to the CIDOC-CRM schema and a working version of the schema is currently available at The Pelagios project will concentrate on mapping terms from the Museum’s place name thesaurus (also in RDF format and part of the CRM schema, containing both modern and archaic terms) to Pelagios stable URIs, providing the quickest route for the Museum to join the other Pelagios partners and make BM data more accessible to the community.

We look forward to contributing to Pelagios.

Dominic Oldman
Deputy Head of IS, British Museum
Principal Investigator of ResearchSpace

Tuesday, 26 June 2012

Using the Pelagios data in widgets

The last few months I have been developing embeddable widgets for the Pelagios project using the data from Pelagios partners via the Pelagios API. Trying out early versions of these widgets with potential users, it became clear that there was a tension between the data available and the data I needed to make a really good user interface. 

As a reminder, a Pelagios-compliant set of data consists of a set of annotations. Each annotation body is a place in the Pleiades gazetteer and each annotation target is a URI identifying something that has an association with that place. Datasets can be grouped into subdatasets in a tree-like structure and partners provide a VoID file containing information about the root dataset.

Below I outline the main issues that have cropped up.

1) Titles for annotation targets 

Let us suppose we want to show all the items that Pelagios Partners have annotated with Corinth. It is not very useful to users to give them a list of URIs of these items, even when they are divided into subdatasets. Initially as this was all the data I had, I either had to display these URIs or list the items as 'Item 1', 'Item 2', 'Item 3', and so on. Needless to say, neither of these options were popular with users.

As part of Pelagios 2,  most partners have added a title to the annotation using dcterms:title, giving more descriptive information about items.  However, strictly this should be the title of the annotation not of the annotated target, a subtle but important difference, and of course some partners treated it as such. In the discussion on the Pelagios mailing list, it was suggested that the target of the annotation is given the title instead, which makes sense to me and will hopefully make its way into the Pelagios Cookbook. In the meantime, I am displaying the title of the annotation target if it exists, and of the annotation if it doesn't exist, as most partners have not updated their data yet. 

2) Dataset and subdataset titles and divisions 

Should the title of a dataset describe the set of annotations or should it describe the set of annotation targets? The former is strictly correct but will also not make much sense to users who have not heard of Pleiades and are not thinking in terms of annotations. How do we deal with this and what guidance do we give on making dataset and subdataset titles meaningful to users who may not have heard of many of the partners before seeing the widget?

There is also the question of granularity of subdatasets at different levels. Partners are free to structure their datasets and subdatasets how they wish, but this could easily leave us with user interface problems in the widgets. One possible option here might be to give some examples in the Pelagios Cookbook of subdataset structures that have worked well from an information architecture perspective for different types of partners.

3) Should annotation target URIs be URLs of HTML pages about the target?

We would obviously like a way for a user to be able to find out more information about an annotation target. At the moment all we have about an annotation target is the URI and probably a title. There is no reason why that URI should be anything but an identifier, but on the other hand, most partners have been using a URL of a page about the annotation target (e.g. the page in the online museum catalogue) so in the widget I am providing a link to the URI. But for a few partners, the URI might not exist as a URL or might contain RDF which is clearly not very user friendly. So there is an interesting question here of how we can associate the URL of a page about an annotation target with the annotation target in the data while not making things over-complicated.  

4) Meta information about partners used in the widgets

In the widgets, we want to tell users something about each partner. Users may not have heard of the partners in question, or indeed Pelagios, so a list of the Pelagios partners by itself may be fairly meaningless to them.

To give users rather more context, I asked each partner for a logo file and short 'strapline' describing the type of data that they provide. I collated these manually and assembled them in a JSON file from which I then grab the relevant information from the widgets. However, it would be nice if this information was provided in the VoID file provided by each partner instead so that this JSON file will not need updating for new partners, but how best to do this? The dcterms:description field is too long to use as a strapline for many partners for example. So this is another thing for the project to think about.

In summary

One of the interesting aspects of building the widgets has been how it has made the data from the Pelagios partners much more transparent and as a result, several interesting questions that have arisen for the project as a whole. Do you have any thoughts on how the project should deal with the issues above? Please do feel welcome to comment below - I would love to hear your ideas and suggestions.

Thursday, 21 June 2012

Improving the Arachne-Pleiades matching

In this Blogpost we describe how we improved the accuracy of the process by which we aligned Arachne to Pleiades. The fact that the first Arachne-Pleiades matching was strictly string-based brought several problems with it. (See previous posts 12.)

In a place matching process, each usable context can reduce the prospect of making errors, especially when the granularity of data on each side is not the same and one-to-many matchings are possible. Furthermore, every detail used for the refining of matching reduces the number of hits you can get. As far as Pelagios compliancy is concerned, a simple fact reduces the likelihood of error from the start: Pelagios focuses on ancient world data, in fact, primarily material from the ancient Mediterranean. Thus  the confusion of Memphis (Tennessee) in the US with Memphis in Egypt is avoided, even when both places are associated with a (or the) "King"... The main effort in the new process was to use more contexts from Arachne in order to make the matching more accurate.  

The new matching

To help with matching places, we selected the context that had the most reasonable chance of enhancing the process: the geographical coordinate.

But there are problems: point geo-coordinates can differ from exact point coordinates, while definitions of places are often not exact as WGS 84 coordinates, which can be exact up to a few centimetres. Also the geo-coordinates on both sides are rather roughly chosen.

Arachne internally uses three matchings:

  1. exact places in Arachne;
  2. topographical units in Arachne;
  3. the ancient landscapes in which at least those topographical units that are smaller than the landscapes are located. (Basically a topographical unit in Arachne can also be bigger than just one landscape, so their size is very flexible in the Arachne data-model.)
The first and second matchings use the rough coordinates that the Arachne team has found for places.

As the first indicator for linking, we use distance-based matching. This uses the distance between a certain Arachne place to all Pleiades places as an indication of whether the two places could match. If the distance lies below a certain t
hreshold, the Pleiades place is matched by the string-based matching that we described in our earlier blogpost.

Objects and other sources are linked to places and topographies, making it easy to connect them to the Pleiades places.

Where to advance the matching?

After producing a whole lot of links you get into trouble if you have to review them all: it is simply beyond the scope for a single person to look at each linked entity. So, we need a faster way to review large quantities of Links.
Description of the construction of the co-occurence networks (created with yed)

The complexity of handling ancient place-names emerges when you look at the following visualizations, created with Gephi, which show the co-occurrence networks of the Arachne-Pleiades matching: the Pleiades entities are shown as nodes; the edges or links between them are created when they are referenced by the same Arachne place. Like we have described here an Arachne places bears more features than just one place name. It also keeps the country city etc. in hierarchical order.

These visualizations reveal just some of the problems you can expect when you use any other than a 1-to-1 matching. 
For example, the co-occurrence links should reveal a match between the city level and the ancient province. The main result of the place matching should also follow a more or less tree-like structure, because there are places that are part of other places or regions. So, one place in a dataset can be matched to two places in the other place-system.

The old Arachne matching:
Geo-layout overview of the Pleiades places co-occuring with Arachne Places using the old matching
Visualization explanation
  • Nodes: Entites from the Pleiades dataset that occure in the linkage
  • Edges: Entities from Arachne which Link to at least two Pleiades places. An edge means that two Pleiades places are Referenced to the same Arachne place
  • Node Color: Latitude (south-north) blue (south) to grey (center) to green(north)
  • Node Size: Longitude (east-west) from small (west) to big (east)
This overview helps us to find out, on a place matching level, how accurate the place matching is.

This visualization shows that a matching, which mainly uses place-names, produces a co-occurrence network with long edges. The fact that Arachne Places (which are represented by the edges) link a lot Pleiades places with very different coordinates show the level of label based mismatches.This was a clear hint that there were a lot of Errors in our matching that should be removed.

Bad links good links

A good link, as shown in the example picture above, would be produced by the linking of the acropolis in Athens in Arachne( to both Achaea ( the Roman province) and Athenae ( in Pleiades. The acropolis is in Athena and it is also in Achaea (on another level of granularity). So in the visualisation the acropolis would produce a link between Achaea and Athenae.

A bad link example would be the linking between Istanbul ( and 
"Byzantion?"(, "Byzantium" ( Here we would have the problem that the first two Byzantium/Byzantion (both used in the matching as synonyms) are far away from each other. But we could expect that the name similarity has same origin as in the Memphis example from the introduction. This would also create a very long (very bad) edge from the modern Istanbul to India.

The nodes in this visualization use the coordinates from the Pleiades dataset for positioning,which is archived using a gephi plugin. Even if they are rough, they can give a hint about the distance between the places they describe. This positioning is independent from the coordinates in Arachne so it would also be applicable to datasets that do not have geo-data for their own places.

The new Arachne matching:
Geo-layout overview of the Pleiades places co-occuring with Arachne Places using the old matching
In this graph there are far fewer long links between places. Many of the co-occurrence links are covered by the nodes.  The picture shown is far clearer than the first graph, because there are far fewer co-occurrence links of the "long" sort. These long links indicate that two places are somehow connected that aren't in a similar location.

One should keep in mind that the fact that there are fewer co-occurrence links does not mean that there are fewer links: there could even be more links that just connect one Arachne place (more 1-1 relations) to one Pleiades place, but they won't be shown in the visualization.  We can say this because of the closed context of the Mediterranean Area! If we were to try to extrapolate these results to the whole world, it wouldn't work, because the links on the borders of the projection would go from one end of the projection to the other, but the places on the border would be close to each other.


The following visualizations depict the complete view of the matchings. In these graphs, we have changed the method of node positioning from coordinate based to force based.

In the following examples , it is important to interpret the size and color of the nodes. As you can see, the size of the nodes depends on the longitude (east-west axis) of Pleiades places represented by the node. The latitude (south-north axis) value changes in the nodes in colour from green to gray to blue. For example, a place is green and small because it lies in the north west. Another place is big and blue because it lies in the south east. The places in the central region of ancient Greece are medium sized and in grey. (overview)

Here the length of edges or links is not that clear. The force-based layout shows where Pleiades places cluster by referencing the same Arachne Place(s).

The old matching again:

Force-based layout overview of the Pleiades places co-occuring with Arachne Places using the old matching
Here we notice a mixture of size and coloured nodes. This is another indicator for the error in the matching.  

The new matching again:
Force-based layout overview of the Pleiades places co-occuring with Arachne Places using the new matching

This overview is much clearer there are a lot of nearly treelike structures as we expect it.

An example from the old matching:

 detail that shows the places called Alexandria in the old matching 
The obvious Alexandria-disambiguation-example is old hat (there are many Alexandrias in the ancient world!), but it acts as a useful example for showing how to read the visualization. The different Alexandrias are all connected - one placename matches them all - but they differ in their locations, as shown by colour and size. 

What we have learned:

An important lesson has been: don't use modern country names and don't use continent names! This is especially important because the ancient definition of Asia ( and Europa (the German spelling of Europe) ( differ markedly from their modern definitions. These errors produce a very visible effect, as you can see below:

Detail that shows the influence of Europa in co-occurrence of the old matching
Here Europe links to nearly everything, because the Arachne places contain the information that most places in Arachne are in the space of the modern definition of Europe! The connection between Asia Minor and Europe come from the fact that there are Places with the the same name in modern Asia and  modern Europe so they will also co-occur.

The way in which the enhanced matching works is shown by the term "Asklepieion". Asklepieion is especially interesting because it denotes a god as well as whole sanctuaries or specific temple complexes that can be inside or outside of larger sanctuaries, while those temple complexes or sanctuaries can in turn be inside or outside of cities. In the old matching, all the "Asklepieion" occurrences were matched together and composed a large co-occurrence cluster:

Cluster that shows the the co-occurence of Asklepieion in the old matching 
The new matching seperates the Asklepieions from each other and shows how they move in other contexts:
First occurence of Asklepieion in the new matching with all connected Places highlighted
Second occurrence of Asklepieion in the new matching with all connected Places highlighted
Third occurrence of Asklepieion in the new matching with all connected Places highlighted

What still needs to be done

The new matching does not solve all the problems shown here: for example, there are many different "Asias" that describe different Roman provinces at different times and sizes of expansion. This is not taken account of – we need a much more time-place-interactive Mediterranean Gazetteer. But without the time context, these problems are hard to resolve because they got nearly the same point coordinates. So there is also the conflict about what side should enhance and refine their data.

Reinhard Förtsch, Rasmus Krempel
Arachne Database, CoDArchLab University of Cologne

Wednesday, 20 June 2012

Inscriptions of Israel/Palestine - New Partner Introduction

The Inscriptions of Israel/Palestine Project (IIP) is thrilled to be a new Pelagios partner. We are especially looking forward to becoming a part of the rapidly growing linked dataset of materials on the ancient world.

IIP is an internet-accessible database of published inscriptions from Israel/Palestine that date between ca. 500 BCE and 614 CE, roughly corresponding to the Persian, Greek, and Roman periods. The purpose of this database is to provide a tool that will make accessible the approximately 15,000 relevant inscriptions published to date, together with substantial contextual information, including geographical information. As of 2012, the database holds about 1500 inscriptions, with two levels of transcription (diplomatic and edited), translations and detailed metadata and notes. More inscriptions are added regularly. The inscriptions, in Greek, Latin and Hebrew, range from imperial declarations on monumental architecture to notices of donations in synagogues to humble names scratched on ossuaries, and include everything in between. The goals of this project are (1) to collect these inscriptions in one place; (2) allow for this data to be integrated with other contextual information that would open new avenues of scholarly investigation; and (3) to allow for easy access to it.

The project began in 1996 at the Institute for Advanced Digital Technology in the Humanities at the University of Virginia as a prototype under the name "Inscriptions of the Land of Israel". Although that project has been decommissioned, the Document Type Definition developed for it (in SGML), with modifications, continues to anchor the project. The project moved briefly to Indiana University in 1999, where the DTD was converted to XML and a second prototype (also now decommissioned) was produced. The project moved to Brown University in 2002, and soon went into production, under the auspices of the Scholarly Technology Group, now reorganized as the Center for Digital Scholarship.

The texts are extensively marked-up by student encoders, before they are added to the database. The IIP DTD was designed from the start to treat the inscribed object as primary and to focus on its various attributes, rather than foregrounding the inscribed text in the form of a print publication. IIP encodes inscription metadata in the Header of the document, (similar to the TEI header), using controlled vocabularies and information structures, recording information such as type of object; date range; locations (present, find, and original); type of inscription; language, etc.). In the body of the document, Individual <div> sections contain the diplomatic and edited version of the texts, in their original languages, and an English translation. The source of each is always acknowledged.

The project is under the direction of Professor Michael Satlow, and employs several part-time student encoders, one of whom is designated as the senior encoder. Encoders follow an extensive set of guidelines when they prepare their data. Records are then checked by the senior encoder before being uploaded to the test server, where they are checked by the project director before being uploaded to the production server. Current support for this project comes entirely from Brown University. The Center for Digital Scholarship provides basic technical support, and other divisions within the university (most generously the Office of the Vice President for Research) contribute necessary operating funds.

Currently, IIP is in the process of cross-walking its original schema to Epidoc, and plans to complete the conversion by the end of summer 2012. We have also been experimenting with timelines and map views of our data.  Our partnership with Pelagios has come at an opportune moment, as we have been focussing on improving and using our geographic data. Place names in IIP are encoded using the vocabulary from TIR (Tabula Imperii Romanii, Iudaea, Palestina).  We began matching places to Pleiades IDs, but did not have a high success rate. As we have already derived latitude and longitude from the Israel plane coordinates in TIR, we hope to match more places, with the help of Pelagios. Once this has been achieved, we will be able to provide Pelagios with a set of Pleiades IDs, as well as the names and URIs of the inscriptions from those places.

When the conversion of the IIP source files into Epidoc P5 is complete, and we have updated our delivery system to handle them, we can start to provide better linked data hooks. Some examples are to enrich the markup with rdfa attributes, and to provide our data in other formats, such as json. We would like to draw on the collective LAWDI and Pelagios experience before making those decisions, so as to make it as useful as possible.

Inscriptions of Israel/Palestine is an ongoing project at Brown University. It has been generously supported by the Center of Digital Scholarship and the Office of the Vice President of Research at Brown University.

Thursday, 14 June 2012

Arachne Void descriptions

In this blogpost we describe how the VoID RDF description of the Arachne Pleiades linkage works. As a result of the Pelagios compliancy work, we are introducing some mechanisms to the datastructure of Arachne itself that will mean changes in future iterations. Thus we have chosen the VOID description of our Pleiades linkage to reflect that.

The VoID descriptions

The void dataset describes the data that have been matched to Pleiades. We have chosen the VoID:linkset for the general definition of the interlinkage set between Arachne and Pleiades.

The general interlinkage set (ArachnePleiadesLinkage) divides into two groups. A place matching (Arachne2Pleiades_Places) and an object matching (Arachne2Pleiades). The matching is split for two reasons. Sometime in the near future, Arachne will start using the DAI-gazetteer where place information will be shared among the different web-resources of the DAI - a Gazetteer, a Web-GIS, Arachne, Zenon pp. At that point the place component will be "outsourced" from Arachne. The other data set contains all objects that are “inferred” from the place matchings. So it uses the internal linkage between Places and Arachne objects, etc.

These two sets have subsets that combine the results of a matching process at a specified time. We have tried to include this information in the first matching, but without the void data set description this has been a time consuming task, since every previous annotation had a creation time. This problem has been solved by attaching the information about the creation time to the data set. The time related information relating to the creation of a matching is now also reflected in the set hierarchy.

The split between places and other entites in Arachne has been a more complicated task because they were held in one triple space. We have tried to overcome this issue by putting the entities into different .n3 files in the downloadable zip Archive. This can now be archived by using the VoID descriptions.

In short, our approach tries to address four problems:
  1. The data will grow. Neither Arachne nor Pleiades are yet complete at the time of the matching process. Any data that is put into Arachne or Pleiades after the matching process will not show up in it. So, from a future perspective, the matching is going to be incomplete very soon and will have to be undertaken again.
  2. The data themselves will change. For example: if a place gets a more precise coordinate, the matching results will also differ in some way. Here, a versioning of datasets represented in the URI on both sides would be a solution for an “everlasting” matching. 
  3. The matching process will be enhanced, so, for example, the results can be more accurate. 
  4. Keeping old stuff available will be important. If you are using data for your project that is not up-to-date, you can still reference the information by a unique data set and a unique URI of a match. This is essential because places can match one time and will fail to match the other time (depending on Problem 1 or 2).

Prof Dr. Reinhard Förtsch and Rasmus Krempel, Arachne Database, CoDArchLab

Sunday, 10 June 2012

New Partners

We're happy to announce that in the past week Pelagios has gained five new partners :-) As always, there will be future blog posts describing what they do in their own words, as well as how greater connectivity with other ancient world resources benefits those activities. For now however, we're glad to welcome the following institutions and initiatives into the Pelagios Community:

The Graph of Ancient World Data has been greatly enriched by them!

Friday, 8 June 2012

Pelagios at the Linked Ancient World Data Institute (#lawdi)

Last week (31st May - 2 June) the Institute for the Study of the Ancient World, NYU, hosted a workshop on linked data in the ancient world. Pelagios were well represented, as Leif kicked off the invited presentations with a discussion of the difference between the semantic web and linked data, while I brought up the rear with a personal reflection on what the evolving digital world might mean for a Classical Studies researcher or student. (All presentations can be found here.)

Here I’d like to present 5 take-home points:

1. Of the different approaches to tying together resources on the web, linked open data seems the best bet, and not just because that's what Pelagios is doing! Linked open data uses a decentralised model in which participants agree on certain stable identifiers for things (such as places or names) and a way of mapping their data to them. So, for example, Pelagios uses Pleiades identifiers for ancient places and something known as RDF triples for expressing the relationship. We find that, by doing this, authority is diffused through the Pelagios ecosystem, meaning that there is no single point of failure (unless Pleiades fails, and, if that happens, we're all screwed anyway!) and that the extent to which projects annotate their data depends on the extent to which they want to hook into the network. Above all, as Sean Gillies, Pleiades’ head developer, has already emphasised in a previous post, it means doing what works.

2. Ok, but what difference does it make if your data is linked? Well, one great example provided at LAWDI by Andrew Meadows of Nomisma concerned coins. Within the world of linked data, it's now possible to discover, map and analyse not only find-spots (where the coins are found), but also where the same coins were minted and even the mines from which their metals derive. These data provide hitherto unparalleled access into the political and cultural deep structure that underpinned all kinds of interactions in the ancient world.

3. At one level, this kind of work represents a paradigm shift of sorts. The lone humanities scholar could hardly be expected to provide and analyse all these data by him or herself; linked data presupposes cooperation. But there is also a bigger point. If I think about my own experiences in Hestia, GAP and now Pelagios, it’s not only the case that each project has led to further, and more involved, collaboration; at each point new skills or tools have been needed, we have found the person to carry out that work and brought them in on the team. Linking data means, when all is said and done, linking with people. Which is fun!

4. While formal collaborations are not the usual humanistic way of doing things, linking data is what scholars have been doing all the time, as evident in footnoting. But scholarship is not only about referring to some other data of some kind; the best scholars chase up the connections. So, for example, the late, great, Oxford don, Don Fowler, writes (in the chapter “On the Shoulders of Giants” of his book Roman Constructions, Oxford, 2000, p.116):
“Classicists have always been concerned with ‘parallels’ – with what goes after the magic word ‘cf.’... What has not been clear with the traditional citation of parallel passages is what the point of the activity is, how the parallels affect the interpretation of the text.”
With this abbreviation “cf.”, which derives from the Latin conferre, Fowler plays upon its meaning to compare or “to bring together”. Imagine reading a footnote and being able to check the ancient source or modern scholar cited, or find out what other materials (images, documents) relate to the place or person under investigation, simply by clicking on a link. This might be blue- or pie-in-the-sky thinking for present publications, but it will be soon possible in ISAW papers, where individual contributions will be identified down to the paragraph level, meaning that any paragraph can be cited, or tweeted, at will. Reading is going get a lot more interactive.

5. Finally, this idea of linked open data is a powerful metaphor not only for thinking about our own world (and especially the internet) but also for approaching the ancient world. At the beginning of his enquiry (‘historia’) into why the Greeks and Persians came into conflict, Herodotus describes how he ‘came upon towns of men both small and great alike, for of the places that were once great, most have now become small, while those that were great in my time were small before’ (1.5.3). Like an Odysseus wandering the seas and coming to know the minds of many men (Homer, Odyssey 1.3), Herodotus writes about a world in which a people forcibly relocated to Persia (claim to) ultimately derive from refugees from Troy (the Paeonians, 5.13), where places as far flung as Marseilles and Cyprus are brought together for comparing the meaning of a word (5.9), and where the river Ister (Danube) and Nile frame the Histories’ geography (2.26, 33-34; 4.50, 53). In a world that is linked together in a myriad of different ways, investigations require making myriad uses of connections. Herodotus would have approved.