Wednesday 13 July 2011

Creating the Pleiades to Arachne annotation

Below I describe how the Arachne Places and Topographical Units are used to annotate the Pleiades Places. The output format will be OAC RDF, as it has been agreed upon. The approach towards annotation is fully automated and preformed by a PHP script.

The Datasources:

Pleiades+:

Pleiades+ CSV file, ordered by the Pleiades ID converted form the initial excel file of Pleiades. The ordering by Pleiades ID makes the computation of alternative names easier, because there is one alternative name after another.

http://googleancientplaces.wordpress.com/pleiades/

Link to the sorted version used by the script

Arachne Database:

The Arachne Database is a Mysql database with a structure that has evolved over time. It uses the utf8_general_ci encoding. The Arachne places have place names which are more or less ordered.

http://arachne.uni-koeln.de

All places contain information about their larger entities: for every place there is the country in which it falls. The Pleiades Places are described by identifiers: therefore a country is described in the same way as a city or a village.

In Arachne, places are defined by a list of name attributes. These names contain the specific place (Aufbewahrungsort), ancient place name (Ort_antik), city name (Stadt), country name (Land), and their synonyms(Aufbewahrungsort_Synonym,Stadt_Synonym). Specific Places describe the place in the most Specific way. They are often freetext descriptions of places, addresses, directions (after 300 meters on road XY on the left side) , Museums and many other forms of Place descriptions. This field is unlikely to match one of the Pleiades labels. The annotations will therefore often describe the more general place in which an Arachne place forms a part.

For example, the place “Corso d` Italia 35 b, Rome, Italy, Europe” will refer to “Italy” and “Rome” in Pleiades the address “Corso d` Italia 35 b” is much too specific for Pleiades.


The script is focused on precision rather than recall. This means that the script will try to produce exact results rather than many results, even if there could be more "real" matches.


The five steps to annotation

1. Collecting synonyms of Pleiades places

Example:

Pleiades ID : 1094

Names:

Lopadusa Ins.

lopadusa

lampidusa


2. Query the database for all synonyms

The collected synonyms of the Pleiades+ dataset have been rewritten as SQL-queries.

The collection of labels:

Lopadusa Ins.

lopadusa

lampidusa

The resulting SQL-queries:

SELECT `PS_OrtID` From `ort` WHERE `Aufbewahrungsort` LIKE 'Lopadusa Ins.' OR `Aufbewahrungsort_Synonym` LIKE 'Lopadusa Ins.' OR `Ort_antik` LIKE 'Lopadusa Ins.' OR `Stadt` LIKE 'Lopadusa Ins.' OR `Stadt_Synonym` LIKE 'Lopadusa Ins.' OR `Land` LIKE 'Lopadusa Ins.';

SELECT `PS_OrtID` From `ort` WHERE `Aufbewahrungsort` LIKE 'lopadusa' OR `Aufbewahrungsort_Synonym` LIKE 'lopadusa' OR `Ort_antik` LIKE 'lopadusa' OR `Stadt` LIKE 'lopadusa' OR `Stadt_Synonym` LIKE 'lopadusa' OR `Land` LIKE 'lopadusa';

SELECT `PS_OrtID` From `ort` WHERE `Aufbewahrungsort` LIKE 'lampidusa' OR `Aufbewahrungsort_Synonym` LIKE 'lampidusa' OR `Ort_antik` LIKE 'lampidusa.' OR `Stadt` LIKE 'lampidusa' OR `Stadt_Synonym` LIKE 'lampidusa' OR `Land` LIKE 'lampidusa';

The SQL statements check all the fields of the Arachne database where place names are mentioned for a specific place. If one of the names in Pleiades+ matches one of the name description fields in Arachne, the key of the dataset is returned and added to the list of matching IDs of all synonyms.

The synonymous names of a Pleiades data sets are matched by the SQL "LIKE" statement. The "LIKE" comparison is case insensitive and special characters for example 'a' matches on 'รค'. For Further Information on this issue see:

http://dev.mysql.com/doc/refman/5.0/en/string-comparison-functions.html


3. Collect all results and remove duplicates

Result examples: 1001,2001,1001,3001,1001,2001,3001

After removing the duplicates: 1001,2001,3001


4. Convert Internal IDs

The internal Arachne IDs are converted to general Arachne entity IDs.


5. Convert results to OAC-RDF

The matching datasets in Pleiades and Arachne are then annotated in OAC.

The annotation URI:

A typical annotation URI produced in this workflow looks like this.

"http://arachne.uni-koeln.de/oacAnnotaion/1310384539/Pleiades991319toArachneEntity8003997"

"1310384539" represents a unix-timestamp of the date of creation.

"Pleiades991319toArachneEntity8003997" describes the annotation in a more or less human readable form.

The timestamp and the combination of IDs provide the uniqueness of the annotation URI. The fact that the URI scheme is human readable is NOT a requirement of an URI and shouldnt be expected in general!

The main goal of a URI is to be unique reference to a piece of data. In this case even if the same annotation is done another time it must not have the same URI. The annotated things are the same but the annotation itself is different. The annotation represents knowledge at a specific time. The Arachne database is still growing and corrected, so the results will change over time, even if the script that produces them remains the same.

The Annotation in RDF:

The date is also represented in the "dc:created" triple of the annotation.

<http://arachne.uni-koeln.de/oacAnnotaion/1308218223/Pleiades991319toArachneEntity8003997> dc:created "2011-06-16 11:57:03".

The name of the script is represented in the "dc:creator" triple.

<http://arachne.uni-koeln.de/oacAnnotaion/1310384539/Pleiades991319toArachneEntity8003997> dc:creator "ArachneEntityToPleiadesPlacesScriptv0.2" .

This is just a string, but when you get conflicting or wrong interpretations you can just blame a bad version of a script that has produced the results.

There is also the advantage of identifying old annotations.

A complete set of triples of an Arachne Pleiades annotation:

<http://arachne.uni-koeln.de/oacAnnotaion/1310384539/Pleiades991319toArachneEntity8003997> rdf:type oac:Annotation .

<http://arachne.uni-koeln.de/oacAnnotaion/1310384539/Pleiades991319toArachneEntity8003997> dc:created "2011-07-11 13:42:19".

<http://arachne.uni-koeln.de/oacAnnotaion/1310384539/Pleiades991319toArachneEntity8003997> dc:creator "ArachneEntityToPleiadesPlacesScriptv0.2" .

<http://arachne.uni-koeln.de/oacAnnotaion/1310384539/Pleiades991319toArachneEntity8003997> oac:hasBody <http://pleiades.stoa.org/places/991319>.

<http://arachne.uni-koeln.de/oacAnnotaion/1310384539/Pleiades991319toArachneEntity8003997> oac:hasTarget <http://arachne.uni-koeln/entity/8003997>.

<http://pleiades.stoa.org/places/991319> rdf:type rdfs:Resource .

<http://pleiades.stoa.org/places/991319> rdf:type oac:Body.

The annotation of topographical units is similar to this. But it uses a different table and different fields.

Conclusion:

The fact that a Pleiades place is most of the time more general than the Arachne Place could not be expressed in the annotation. This is a unsatisfying trade-off produced by using such general frameworks as the OAC. The information expressed in it is very unspecific.

Maybe there will be a way for the next version expressing this fact as an optional piece of information that can or cannot be used. If you want just a Link then you just ignore the other links. If you are very hungry, you can eat millions of good spaghetti as well as millions of very cheap spaghetti. But the good ones you can also enjoy, if you want to.

A web of linked data could not be typed in by hand. That would be extremely time consuming and produce different errors than letting the computer handle it. Everyone that tries to keep two long numbers in mind will mess up something every once in a while. What I'm missing is a simple controlled way to annotate machine authorship. This could help to back check results and to handle messy data. Who is to blame if the data is messy? The data-source or the automated process in the background.

Rasmus Krempel, Arachne Database, CoDArchLab University of Cologne

sorted version of pleiades+

PHP script to create annotation

Thursday 7 July 2011

Open Licenses

One of the most important things for any online resource to think about is the license they will make their data available under. Naturally we're very much into the Open variety here at Pelagios (in fact our work would essentially be meaningless otherwise), but that word can mask a good degree of philosophical (and occasionally bureaucratic) diversity. Creative Commons have a done a remarkable job of trying to simplify this process but it remains a fact that when bringing different datasets together we may be faced with the need to deal with multiple licenses. Pelagios in fact uses four: CC-ZERO, CC-BY, CC-BY-SA, CC-NC-ND. There are two reasons for this, a pragmatic one and a philosophical one.

Pragmatically, we essentially have no choice when working with a sizeable (and growing) consortium. Many of the projects we are working have been established for years (occasionally decades) and although all subscribe to a policy of open content they have different needs and histories of licensing. There is simply no way for us to require that they change to a new licensing regime. Philosophically however we see this as no bad thing. Indeed, the purpose of Pelagios is not to serve up a new aggregation of content under a new license and we doubt whether such an imperative could ever be scalable in principle. Rather, we wish explore complementary ways that resource providers can make their content available that facilitate integration with third parties, but under their own license terms. Resource combinations should always be created with specific goals in mind - not simply ad hoc aggregations - that take into account the licensing requirements as well as the content.

We are aware that in some cases the desired data will not always be available under the terms and conditions one might wish but that is simply the nature of cultural content. Trying to force the issue will only make potential future partners more hesitant to collaborate. We are keen to work together with any provider with ancient data, just as long as they are truly interested in sharing what they have with the wider world.

Information about all the licenses we use is available at:
http://pelagios-project.blogspot.com/2011/03/pelagios-project-plan-part-4-ipr.html