sameAssameAs

Interlinking the Web of DataInterlinking the Web of Data

The Web of Data has many equivalent URIs.

This service helps you to find co-references between different data sets.


About

We are pleased to offer http://sameas.org/ as a service to provide you with help finding URIs.

It sort of does what it says – if you provide a URI, it will give you back URIs that may well be co-referent, should any be known to it.

Using <sameAs>

Apart from the obvious use of the service through the web-form, simple URIs can be provided to do lookups:

The following formats are supported –

rdf+xml, text/n3, application/json, text/plain

For example as URIs:

Or, use content negotiation. For example at the command line:

curl -iLH "Accept: application/rdf+xml" "http://sameas.org/?uri=http://dbpedia.org/resource/Tim_Berners-Lee"

Mischa Tuffield has commented that he can include the following into his foaf file to get some nice extra stuff

<http://mmt.me.uk/foaf.rdf#mischa> rdfs:seeAlso <http://sameas.org/?uri=http://mmt.me.uk/foaf.rdf#mischa> .

Where does this co-reference data come from?

Well, this is rather a long story, as we did not set out to provide this service. But if I tell you the story, you might be able to assess the utility of it in your context.

As part of the RKBExplorer work, we needed to be able to manage co-reference between triplestores (see related publications). We had an existing infrastructure for doing this, the Co-Reference Service (CRS), and we populated these CRSes with the co-reference data we were generating on RKBExplorer.com. As the RKBExplorer application became more sophisticated, we needed to know co-reference information with other sites such as dbpedia and http://data.semanticweb.org/. This enabled us to use information such as descriptions from wikipedia/dbpedia, and the information on conferences and foaf relationships.

However, long ago we discovered that getting things even slightly wrong can cause serious problems once the "network effect" that we are seeking comes into play. A seemingly trivial problem of a source telling us that two different people with the same name are the same person can result in our network relationships between entities that are related to them being badly misrepresented. Such problems would not arise if the raw data is simply being presented.

So I set out to gather co-referent information from sources I thought were sufficiently accurate for my purposes.

I started with the data we already had, and indeed are still generating. I then went to the Linked Data cloud, and harvested from the RDF dumps and SPARQL endpoints that I deemed to be satisfactory. In addition I approached some people who were not publishing in a form I could easily harvest already, such as David Baxter of Opencyc, and asked them to provide the data to me directly.

I have avoided spidering the web for arbitrary data, and indeed would suggest that other Semantic Web search engines are a much better source for this than I can possibly provide.

The question of which predicates I might have used now arises. There is what I consider a deep irony here. For many years, we have been arguing (not always with great success) that the issue of co-reference is much more complicated than can be captured by a simple predicate such as owl:sameAs. On undertaking this task, I found that there are many predicates coming into existence that address this question. In assembling this site, I have used at least the following:

	<http://www.w3.org/2002/07/owl#sameAs>
	<http://www.rkbexplorer.com/ontologies/coref#coreferenceData>
	<http://umbel.org/umbel/sc/isLike>
	<http://www.w3.org/2004/02/skos/core#exactMatch>
	<http://www.w3.org/2004/02/skos/core#closeMatch>
	<http://open.vocab.org/terms/similarTo>
	<http://www.geneontology.org/formats/oboInOwl#hasExactSynonym>
I accepted the idea of co-reference for each of these on a per source basis. The <sameAs> service currently has a single concept of co-reference, and publishes the data it has in a single way, for example using the owl:sameAs predicate.

I have to say it does beg the question of why there should be so many vocabularies that mint new URIs for these concepts.

So what sources? Here is a non-exhaustive list of places I may have got the data came from:

	http://go.bio2rdf.org/
	http://purl.org/hcls/
	http://moustaki.org/
	http://rdf.dmoz.org/
	http://doapstore.org/
	http://dbpedia.org/
	http://rdf.geospecies.org/
	http://www.yr-bcn.es/pmika/
	http://umbel.org/
	http://downloads.dbpedia.org/
	http://www.opencyc.org/
	http://hcls.deri.org/
	http://lingvoj.org/
	http://www.cs.vu.nl/STITCH/rameau/
	http://rkbexplorer.com/
	http://airports.dataincubator.org/
	http://telegraphis.net/
	http://ontologi.es/rail/stations
	http://data.linkedct.org/
	http://discogs.dataincubator.org/
	http://www.bbc.co.uk/music/
	http://linkedgeodata.org/
	http://data.nytimes.com/
	http://bnb.data.bl.uk
	http://d-nb.info
	http://data.bibsys.no
	http://nektar.oszk.hu
	http://dbpedia.org
	http://id.loc.gov
	http://id.ndl.go.jp
	http://stitch.cs.vu.nl
	

Finally, please be aware that the data is changing all the time. As people browse using RKBExplorer, the system examines the results and establishes co-reference as appropriate; thus the results provided by the RKBExplorer are intended to improve as time goes by, and also the <sameAs.org> reflection of that will change.

I hope that helps - I confess that in the early days I was simply getting data I needed, rather than preparing to document it.

Helping us

There is currently no public service to enable arbitrary contribution to the contents of <sameAs>. If you have significant data you would be prepared to give us, then please conact us at the email below. On the other hand, if you have time to help us provide such services, then please feel free to offer your help.

License and Re-use

We believe that Linked Data needs to develop clear, focussed, services that only do one or two things, so that they can be composed and utilised by the more complex services, as well as facilitating re-use. We hope that <sameAs> fits into that category, and that Linked Data application builders will find it an appropriate and useful service for the important task of discovering co-referent URIs.

In addition, by providing formats oriented towards non-Linked Data application, we hope that the use of Linked Data can be spread even wider. If you really want, there are a number of sameAs logos available.

There are currently roughly 200M URIs, with an average of about 3 URIs per bundle.

CC0
To the extent possible under law, the person who associated CC0 with this work has waived all copyright and related or neighboring rights to this work.

The information is provided as-is and without any warranty.

Acknowledgements

We acknowledge the partial financial support of:

We thank all the people who have provided this information to us, either specifically or by publishing on the web.

Further info

There are a number of publications about this work.

If you have any queries, suggestions or comments please get in touch with Ian Millard and/or Hugh Glaser.