We need shared Persistent Reproducible Identifiers

28 september 2016

We live in an endless connected world of data. Linked data spans datasets, institutions, domains, and continents. It allows us to determine relations in the vast amount of digital information and to connect textual descriptive information to media files, such as provenance information. But what happens when a media file is shared out of this network of data? How do we get back to the information if all links are lost? In september 2016 project videorooter.eu++CollaborationVideorooter is a collaboration between Kennisland and Commons Machinery, funded by the SIDN Fonds. Read more about this project here. held a meeting in Amsterdam to discuss this issue. Its outcome is a call for shared Persistent Reproducible Identifiers.

Attribution Chain

When you publish a photo on Flickr, the image becomes part of your personal account. This means that you can embed the image on a website, and link to the photo, without losing the information that you are the creator of that image. Using embedded metadata one can even download the photo and still find the link back to you as the creator. But once you convert the file, change the format (from .tiff to .png, for example), strip embedded metadata, or resize the image, this information gets lost. This happens when you publish a photo on Facebook, use TwitPic, etc. Once the link to the original platform is lost, there is little you can do to find information about the owner. There is no way to trace back how to provide proper attribution: the attribution chain is broken.

We need mechanisms to reattach the attribution chain when it is broken. For this, we need shared Persistent Reproducible Identifiers: identifiers that can be recreated from the files that are perceptually identical. Such an identifier survives conversions, resizing the image and stripping of embedded metadata. Furthermore, we need a way to use identifiers in a uniform manner on different services. Ideally, there is a way to share catalogs of identifiers without sharing content, which maximises usability across platforms and business models. For example by setting up a publicly accessible method to communicate information and extract knowledge and insight out of various services and datasets based on reproducible identifiers of their content.

Why Persistent Identifiers are not enough

Search enginesSuch identifier survives conversions, resizing the image and stripping of embedded metadata., registries, archives, museums and other information intensive industries have loads of information and ways to structure this information internally. To tie datasets to each other and to normalise data, information systems use Persistent Identifiers. Systems like Persistent Uniform Resource Locators (PURLs), Archival Resource Keys (ARKs), Digital Object Identifiers (DOIs), and the Handle System are ways to keep links to metadata or objects alive to be able to use them in new contexts.

Similar, projects like Wikidata++WikidataFor example, the Wikidata persistent identifier link to the concept of persistent identifiers is: www.wikidata.org/wiki/Q420330 publicly create linked data for Wiki projects by using persistent identifiers. Others like Europeana create identifiers to identify objects in the heritage sector. And Handle.net allows you to register works that dereferences to web pages on institutes. In all of these cases these identifiers are placed in the metadata of the media object. These cannot be reproduced once the media file gets into a place where it has no access to that metadata. Persistent Identifiers connect items within different datasets, but once you take connected media files outside the context of these linked data, or when a media file is resized or transcoded, there is no way back in. If there is no link back Identifiers that cannot be reproduced once media gets into a state that has no access to said identifier are not predictable.to a linked data network, it can become nearly impossible to get back to the other descriptive information about the media file because the identifiers they use are not reproducible.

Elog.io, whereonthe.net, and Videorooter are projects that try to solve these problems. They are registries trying to fingerprint and provide metadata for arbitrary media files that got stripped of their metadata. Either to determine under what kind of license they are available or to determine where they are used on the Internet.

Perceptual hashes are reproducible identifiers

Persistent Identifiers are only useful up to the point when metadata is stripped from a media object. Solutions like watermarks or other embedded identifiers do not survive these processes. Reproducible Identifiers can recognise media outside of its original publishing context and specific file by use of perceptual fingerprints or other techniques.

Perceptual hashes are created by algorithms that try to closely determine how humans perceptually perceive a file instead of the actual bits and bytes of a media file.  At least two major perceptual hashing libraries exist: phash.org and blockhash.io.

As these perceptual hashing algorithms encode how an image looks, they can recognise a file after it has been stripped of metadata or even resized and re-encoded into a different format. This makes perceptual hashing algorithms a valuable candidate for reproducible identifiers.

These fingerprints are however prone to collisions++Perceptual hashesThis blogpost explains in more detail how perceptual hashes work and what their strengths and weaknesses are. , meaning two very similar but still distinct files can create the same hash. To counter these collisions, organisations often combine two algorithms and match both fingerprints to reduce these false positives. These fingerprints do not need to have a 100% accuracy rate, but their combined strength gives highly accurate results.

Other combinations of algorithms, or even other algorithms are not only theoretically possible but a practical reality. There are organisations that use hashes or fingerprints as reproducible identifiers, for example, provenance and copyright registries such as ascribe.io and YouTube’s ContentID that address the need for ownership on the Internet. These organisations can find metadata on media outside of their original publication context. They allow you to register a work and get an identifier back. The way these systems are structured now, the identifier typically only works within their own system and ContentID does not even share any of their algorithms or reproducible identifiers.

Having reproducible identifiers is therefore not enough: we also need ways to communicate and share identifiers so that they become generally useful. We need Persistent Reproducible Identifiers with public documentation so that they can be implemented freely by anyone. In order to make use of the Persistent Reproducible Identifiers, we need open APIs which allow us to query catalogs for the existence of works. Only then can we successfully reattach the attribution chain by linking the Persistent Reproducible Identifier with identifiers of the original work across different services, establishing relations in the vast amount of digital information in the world.

Kennisland and Commons Machinery connected several like-minded organisations about the ideas around making Persistent Reproducible Identifiers. We are currently writing two white papers about the creation of truly open standards for creating and communicating these kind of Persistent Reproducible Identifiers to be presented on the IIIF Working Group meeting in The Hague.

Join us? Contact Maarten Zeinstra at mz@kl.nl.