The growing problem of Internet “link rot” and best practices for media and online publishers

The Internet is an endlessly rich world of sites, pages and posts — until it all ends with a click and a “404 page not found” error message. While the hyperlink was conceived in the 1960s, it came into its own with the HTML protocol in 1991, and there’s no doubt that the first broken link soon followed.


On its surface, the problem is simple: A once-working URL is now a goner. The root cause can be any of a half-dozen things, however, and sometimes more: Content could have been renamed, moved or deleted, or an entire site could have evaporated. Across the Web, the content, design and infrastructure of millions of sites are constantly evolving, and while that’s generally good for users and the Web ecosystem as a whole, it’s bad for existing links.
In its own way, the Web is also a very literal-minded creature, and all it takes is a single-character change in a URL to break a link. For example, many sites have stopped using “www,” and even if their content remains the same, the original links may no longer work. The same can occur with the shift from “http:” to “https:” The rise of CMS platforms such as WordPress and Druple have led to the fall of static HTML sites, and with each relaunch, untold thousands of links die.
Even if a core URL remains the same, many sites frequently append login information or search terms to URLs, and those are ephemeral. And as the Web has grown, the problem has been complicated by Google and other search engines that crawl the Web and archive — briefly — URLs and pages. Many work, but their long-term stability is open to question.
This phenomenon has its own name, “link rot,” and it’s far more than just an occasional annoyance to individual users.
Nerdy but important context
A 2013 study in BMC Bioinformatics looked at the lifespan of links in the scientific literature — a place where link persistence is crucial to public knowledge. The scholars, Jason Hennessey and Steven Xijin Ge of South Dakota State University, analyzed nearly 15,000 links in abstracts from Thomson Reuters’ Web of Science citation index. They found that the median lifespan of Web pages was 9.3 years, and just 62% were archived. Even the websites of major corporations that should know better — including Adobe, IBM, and Intel — can be littered with broken links.
A 2014 Harvard Law School study looks at the legal implications of Internet link decay, and finds reasons for alarm. The authors, Jonathan Zittrain, Kendra Albert and Lawrence Lessig, determined that approximately 50% of the URLs in U.S. Supreme Court opinions no longer link to the original information. They also found that in a selection of legal journals published between 1999 and 2011, more than 70% of the links no longer functioned as intended. The scholars write:
[As] websites evolve, not all third parties will have a sufficient interest in preserving the links that provide backwards compatibility to those who relied upon those links. The author of the cited source may decide the argument in the source was mistaken and take it down. The website owner may decide to abandon one mode of organizing material for another. Or the organization providing the source material may change its views and “update” the original source to reflect its evolving views. In each case, the citing paper is vulnerable to footnotes that no longer support its claims. This vulnerability threatens the integrity of the resulting scholarship.
To address some of these issues, academic journals are adopting use of digital object identifiers (DOIs), which provide both persistence and traceability. But as Zittrain, Albert and Lessig point out, many people who produce content for the Web are likely to be “indifferent to the problems of posterity.” The scholars’ solution, supported by a broad coalition of university libraries, is perma.cc — the service takes a snapshot of a URL’s content and returns a permanent link (known as a permalink) that users employ rather than the original link.
Resources exist to preserve a comprehensive history of the Web, including the Internet Archive’s WayBackMachine. This service takes snapshots of entire websites over time, but the pages and data preserved aren’t always consistent and comprehensive, in part because many sites are dynamic — they’re built on the fly, and thus don’t exist in the way that classic HTML pages do — or because they block archiving.
The Hiberlink project, a collaboration between the University of Edinburgh, the Los Alamos National Laboratory and others, is working to measure “reference rot” in online academic articles, and also to what extent Web content has been archived. A related project, Memento, has established a technical standard for accessing online content as it existed in the past.

Popular posts from this blog

ACHIEVING ECONOMIES OF SCALE WHEN JOINING A MULTI-ACADEMY TRUST

BACK-TO-SCHOOL: IS YOUR WEB FILTERING COMPLIANT?