Why we need open standards for fingerprinting
NOTE: An earlier similar version of this blog post has been published on videorooter.eu.
The Internet is typically a platform for sharing media. However, since most files are shared online with no provenance information, there is no easy way to catalogue these files or trace the rightful owner of (openly licensed) files. Fingerprinting technology makes this possible. In Videorooter we develop ++VideorooterBased on this, we develop a whitelist (a list or ‘registry’ with videos shared online under Creative Commons licenses with their metadata) that helps to trace back provenance information. Want to know more about our project? Go to videorooter.eu or read more here. . A fingerprint (a string of characters) is the result of an algorithm that captures identifying aspects of a media file. When a different algorithm is used, a different fingerprint comes out. So if you want to develop a comprehensive catalogue, one and the same algorithm has to be used. Therefore, we need a community that develops clear open standards for fingerprinting technology together.
Fingerprinting is a complex mathematical construct to capture perceptual identifying aspects of a media file. The series of characters (called a hash) is the outcome of an algorithm applied to a media file. When one certain algorithm is used to identify a video, with every video that is perceptually (to the human eye) the same video, the same fingerprint (a perceptual hash) comes out. However, when a different algorithm is used, a different seriesA fingerprint (a string of characters) is the result of an algorithm that captures identifying aspects of a media file. When a different algorithm is used, a different fingerprint comes out. So if you want to develop a comprehensive catalogue, one and the same algorithm has to be used. of numbers and digits comes out.
Why is fingerprinting relevant?
To understand the need for media fingerprinting, imagine a library without catalogue or order. If you’re curious about a certain book there would be no way to find the particular book you are looking for or to even determine if this book is in the library at all. A simple way to go about creating order in the pile of books is by sorting based on the last name of the author or category of work. The books are then labelled with such information and a code.
However, for online images and videos this isn’t the case. Unlike books in the library, online videos do not always have a clear author. On platforms such as YouTube, Vimeo, etc., videos are easily published without ++Provenance informationThis is information about the origin: the author, title, year in which it was made, license under which it was shared etc.. Even if it is added, it might be lost when videos are shared across platforms. The very nature of the Internet makes sharing and copying simple, but the resulting copies contribute to the unstructured mess of online media. Additionally, videos can be modified (e.g. resized or reformatted). The modification of files results in different files that are perceptually identical (imagine the same book with a different font and/or cover while the words in the book stay the same), which makes the Internet even messier.
Fingerprinting is like ISBN coding of the Internet after the media has been published. It identifies two perceptually similar files as the same, even if the files have been modified. As the fingerprints can be stored together, this technology makes it possible to catalogue online files. When a whitelist of files is created (we do this in Videorooter for videos published under ++Creative CommonsA good way to share media and culture online is by the use of Creative Commons licenses. ), it becomes possible to trace back provenance information. In sum, by developing good fingerprinting algorithms, it becomes possible to develop a card catalogue of all online media files, map their occurrences on the Internet and trace their provenance information.By developing good fingerprinting algorithms, it becomes possible to develop a card catalogue of all online media files, map their occurrences on the Internet and trace their provenance information.
We need open standards shared by the commons
If we want to be able to get provenance information on any (openly licensed) video in for example a whitelist or card catalogue, we need to see the Internet as one ecosystem in which we use one and the same algorithm to fingerprint the files. Thus far there is not one universal fingerprinting algorithm that is used by the whole community. Some widely used fingerprinting technologies have been developed, such as YouTube’s ContentID. By the sheer size of it, this is the most frequently used perceptual fingerprinting algorithm. However this is a closed, proprietary algorithm that the wider Internet cannot use.
None of the fingerprinting technologies have been published as open source. This means that everyone who wants to use fingerprinting technology, has to reinvent the wheel. There is no clear universal standard that is used over and again. What we need is a method to collaborate and index videos, to locate videos or identify duplicates and see where videos are used on the Internet.What we need is a method to collaborate and index videos, to locate videos or identify duplicates and see where videos are used on the Internet.
However, at the same time it is important to acknowledge that there is value in having more than one single algorithm. Not one algorithm is perfect; each one has different strengths and weaknesses. Given that different algorithms can be used for different purposes, we come to the conclusion that it is important to identify which algorithm is used in a certain instance.
Videorooters works towards open methods of identifying video
Videorooter develops its own methods of video fingerprinting, but we recognise that our method will not capture all use cases of video fingerprinting. We therefore propose, as a first step, to develop an open standard for the fingerprints themselves. We acknowledge that we need a standard that recognises not one single algorithm is the best one to use each time. Therefore, we intend to come up with a way that enables us to identify the algorithm used based on the fingerprint.
To make this clear, the actual fingerprint that comes out of a fingerprinting algorithm varies widely. For example, A3a1c25d9b71a19d412188fa9ee0949a and 71db803472debd08beb63b7e99b72802344358c211f90fc114a85c041a3057a3 are both fingerprints (cryptographic hashes in this case) of the same file but obtained with different methods. There is no way to tell which algorithm is used without encapsulating the method used to obtain the fingerprint in a standard format.
In our example we need to be able to determine that the first one was madeWe therefore propose to develop an open standard for the fingerprints themselves; one that recognises not one single algorithm is the best one to use each time. Therefore, we intend to come up with a way that enables us to identify the algorithm used based on the fingerprint. with ++AlgorithmsThese are two frequently used cryptographical fingerprinting algorithms. They can only be applied on files that are the same bit by bit.. We need a way to identify the type of fingerprint used. Only if we do that, it becomes possible to develop a card catalogue to trace back provenance information of the work. One possible way to this this is that we attach the name of the algorithm to the hash, so for example: MD5:A3a1c25d9b71a19d412188fa9ee0949a or SHA-256:71db803472debd08beb63b7e99b72802344358c211f90fc114a85c041a3057a3.
Do you want to help us strengthen the video commons by standardising perceptual hashes? Contact us at firstname.lastname@example.org.