Identifiers ... not only, but also
(Report on a presentation given by Ingenta's CTO Leigh Dodds at the recent UKSG annual conference. For a report on the presentation given at the same conference by our marketing manager, Charlie Rapple, see our blog.)
Goodbye to dark matter
"Dark matter" accounts for most of the universe's mass - but cannot be seen, and can only be detected based on its influence on the stars in the sky. The world wide web also has dark matter: effectively, anything that doesn't have its own URL 'doesn't exist' - but even so, it influences the content that we see and interact with on the web.
This is beginning to change as we enter what information architect Tom Coates calls "the age of naming" - the "age of pointing at things". Tom's experience shows that the most successful projects are those that give web identifiers and names to all of the key objects, thereby bringing data into the global information environment and enabling it to be enriched with related tools and services.
A paradigm shift: enriching the web
This process is now extending across the web, as individuals, communities, and businesses begin to identify objects of collective interest. Sometimes this is being done explicitly, for example through community initiatives like MusicBrainz which is creating unique identifiers for music, or through companies like the BBC opening up their catalogues to the web. Sometimes it's a by-product of the ongoing progression of moving content and services onto the web.
In any case, it's important to remember that the content, and the identifiers being applied to it, are part of the global online environment that is the web; so to add actual value, we must avoid the false silos of the offline environment. For example, in the early days of the web, each site reinvented the search wheel by creating and deploying its own search infrastructure. Then along came the search engines and, using the fact that there were identifiers (URLs) for each of these documents, they built large scale aggregations that provided a single place where a user could search content. So nowadays, it's generally no longer necessary to build a search engine, and most small websites don’t: they just drop in a little Google widget and then spend their time focusing on their core business.
This same process can be applied at a fine-grained level. Assigning identifiers to smaller content components (for example, assigning a DOI to an article), and exposing these components to the web, allows communities to begin to co-ordinate more effectively, and to streamline how they integrate with one another. In most of the established examples, you only benefit from these efficiencies if you are part of the community in which these particular identifiers and metadata formats are common. What we are now aiming for is an approach to identifying and describing content that is global.
Global identifiers
Three key internet standards have been created in this area; the first is the URI, which stands for Uniform Resource Identifier. It describes the general class of identifiers - essentially, the basic syntax for all forms of web identifiers. The other two standards build on this basis: the URN, the Uniform Resource Name is purely an identifier. It can’t be used to find or locate content, but can be used to uniquely identify it. The URL, with which we’re all familiar, is both an identifier AND a locator. It can be used not only to uniquely identify something, but also to point people at the object of interest. The web is entirely built on URLs, and so anything that doesn’t have a URL is not part of the web. And by extension, any identifiers that aren’t URLs aren’t part of the web.
URLs have gained popularity because they are actionable, i.e., if someone hands you an ISBN, what do you do with it? But if someone hands you a URL, you click on it. But we need identifiers to be more than actionable. We also need them to be reliable and sustainable.
Stability
The web, by its nature, is a chaotic environment: anyone can put anything online, and just as quickly it can be taken down again. Identifiers, as points of agreement within a community, require support from that community to ensure their stability. For the processes we build on top of them to work effectively, we need guarantees that web identifiers will be reliable and maintained indefinitely, that there is a clear process of assignment, that their meaning won’t change over time, and that what we get when we click on a URL remains consistent.
Openness
To foster innovation, identifiers also need to be open and shareable, freely usable and reusable. They need a basic set of services to facilitate their usage, in particular the ability to look up identifiers based on specific items of metadata.
Existing examples
The obvious first example is the DOI. It ticks nearly all of the boxes: it's supported by a community, and it's easy to use. And while it's not actionable in itself, it can be appended to a base URL to become part of the web.
Unfortunately, not all identifiers follow this general model. The ISBN , for example, is ubiquitous but it's NOT part of the web, because there is no standard actionable form. In fact, if you visit the ISBN website, you’ll be hard pushed to actually find an ISBN, because the database in which they are stored is not open and shareable. ISBNs have nonetheless become actionable in a widespread way ... by appending them to an Amazon URL. Now, as far as the web is concerned, this is the easiest way to identify book content, and people routinely paste Amazon URLs into emails and blog postings. But the book industry's behind-the-scenes usage of the ISBN is not actionable, not linked in any accepted standard way, and so is not part of the web. This illustrates a key problem of not embracing the web environment: if you don’t do it, someone else will.
Given the considerable industry interest in identifiers, there are several others to consider: ISAN (audio visual numbers), ISPI (identifier for parties involved in the creation and production of content), the Ringgold identifier (for institutions), NISO's institutional identifier (still in development), ISTC codes (for exchanges between publishers, agents and librarians) and author identifiers (currently being explored by Thomson and CrossRef, among others). Some of these initiatives present huge opportunities to change the way we do business, to drive innovation, and to create better levels of business integration. Institutional identifiers in particular offer immediate and obvious benefits; with the 2008 renewals process only now beginning to tie up its loose ends, it's clear that there is an opportunity for further automation.
Conclusion
Ultimately, we need to embrace the web and turn it from a vague threat into an opportunity and a strength. And if we don’t start to see some disruptive changes in our workflows, then we’re probably doing something wrong!