header

Benchmarks

All taxonomies tested and timing data are available in the archive here.

About the ontologies

Due to inconsistencies among reasoners with respect to handling of owl:imports statements, we run benchmarks over only ontologies contained within a single file. Because much of the test data available makes use of imports, we have “localized” ontologies by parsing them, resolving and parsing all imports, merging the main and imported ontologies together, and reserializing the ontology, all using the OWL API (version 2.2.1, from the 104 Protege release). The version of the OWL API used generates files with invalid namespaces in a number of cases, and the resulting files (sometimes?) still contain owl:imports statements in addition to all the axioms from the imported ontology; these errors were corrected by hand.

Gardiner

The “Gardiner Corpus” is derived from the set of ontologies obtained from http://www.cs.man.ac.uk/~horrocks/testing/ and described in the paper “Framework for an Automated Comparison of Description Logic Reasoners” presented at ISWC 2006.

@INPROCEEDINGS{GaTH06b,
  AUTHOR = {Tom Gardiner and Dmitry Tsarkov and Ian Horrocks},
  BOOKTITLE = {Proc.\ of the 5th International Semantic Web Conference (ISWC 2006)},
  DATE-ADDED = {2006-09-12 20:39:38 +0100},
  DATE-MODIFIED = {2007-03-07 20:44:26 +0000},
  PAGES = {654--667},
  PUBLISHER = {Springer},
  SERIES = {Lecture Notes in Computer Science},
  TITLE = {Framework For an Automated Comparison of Description Logic Reasoners},
  VOLUME = {4273},
  YEAR = 2006
}

This collection was downloaded from the above link on 2009-02-13.

The files obtained from http://www.cs.man.ac.uk/~horrocks/testing/ contain owl:imports statements to non-local files; these were localized as we describe above. A number of files have names which are not valid URIs (they contain # characters; these files were renamed and placed in the gardiner-new directory.

OBO

The OBO corpus consists of the OWL ontologies published as part of the Open Biological Ontologies Foundry; they were downloaded from ftp://ftp.fruitfly.org/pub/obo/obo-all-owl.tar.gz on 2009-02-12. Imports are handled specially in the OWL versions: each “main” ontology contains only axioms, and comes with two associated wrappers, one containing import statements using relative path names to local files, and one containing full URIs to ontologies on the web. We produced single-file versions of all ontologies using the wrapper naming local imports. In cases where the original OBO ontology does not import any other files, the local imports file is empty (and does not even import the main file): in these cases we simply used the main ontology file.

GALEN

We also included a number of versions of the GALEN ontology, a well-known biomedical ontology which has been the subject of reasoner optimization for many years. We test on a number of versions with very different performance characteristics: the full version from the web (which none of the tested reasoners could classify), the “undoctored” version originally studied by Ian Horrocks during development of the FaCT system, a “doctored” version which was modified such that it was classifiable by FaCT, and “not-GALEN”, a relatively easy-to-process recent variant.

Results

We tested four reasoners:

Tests were executed on a 2.2 GHz MacBook Pro. The reasoner process was allocated 1 gigabyte of RAM and given 30 minutes to classify each ontology.

For each reasoner and ontology, we parsed the ontology using the OWL API, loaded the ontology into the reasoner, classified the ontology, and wrote the classified taxonomy to disk. Only the times required for loading and classification were measured, and the two times were added to produce the published summary results. Raw timings (in milliseconds) are included in the included *.times files. Failures to classify an ontology can be listed as either timeout (the process was killed after thirty minutes and exited relatively gracefully), Exception (the reasoner raised an error, such as an out-of-memory exception), or a blank field for classification time (the process crashed, which occasionally happens because Java was unable to recover from a low-memory situation, due to inability to gracefully respond to a timeout, or due to some unknown error).

On correctness

The taxonomies generated by each reasoner occasionally contain differences, but almost all such situations are simply due to differing API conventions: FaCT++ sometimes makes classes direct parents of owl:Nothing despite the existence of named subclasses, and HermiT inserts the names of unsupported datatypes into the hierarchy.

There were a few cases in which it appears that the version of Pellet tested misses some inferences; these cases were investigated manually, and it does appear that Pellet’s results are incorrect while FaCT++ and HermiT produce the correct results.