Portugal: preserving the value of Wikipedia

The free online encyclopedia Wikipedia has become one of the most widely used resources for educational purposes internationally. However, the references to the sources behind the articles deteriorate over time. In a collaboration with Wikipedia Portugal, the national research and education network (NREN) of Portugal, FCCN, a unit of the Foundation for Science and Technology (FCT) is engaged in tackling this problem.

Wikipedia articles often reference external pages with important complementary information. Unfortunately, such articles can become unavailable for various reasons. This problem degrades the quality of Wikipedia as a credible and verifiable source of information.

Originally created by the Faculty of Sciences at the University of Lisbon, is today an FCCN service. Appreciating the value of Wikipedia articles, the team carried out an experiment. According to the investigation, 25 percent of external links outside the Wikipedia domain were broken for the Portuguese articles.

Content drift

A frequent reason for a link becoming unavailable is management issues at the original website. For instance, the organization behind the site may not prioritize preserving the content. In addition, there is the problem that while a link may reference available content, this content may no longer be what was originally intended to be referenced in the Wikipedia article. Either because the domain has since been bought by a third party, or even for malicious purposes. This phenomenon is known as content drift.

To counter such issues, and Wikipedia Portugal have joined forces.

The aim is to change the references to broken links in Wikipedia articles so that they refer to content preserved on, thus keeping the referenced information always accessible to Wikipedia users.

New automatic process

The Portuguese Wikipedia contains more than 1 million articles, and, on average, 140 pages are edited per day. extracted 14 million links from the references in all the articles on the Portuguese Wikipedia. Of these links, only 620 referenced and 744,553 the Internet Archive.

Further, collected all the pages referenced in Portuguese Wikipedia articles, resulting in a new collection containing 12 million files.

The main result of the project is the creation of a new automatic process for extracting and collecting external links cited on Portuguese Wikipedia pages. This process is now part of the collection operation, and an annual compilation of Wikipedia citations is carried out.

Image: screenshot of Arquivo website from explainer video

Published: 02/2024

