News outlets are limiting the Internet Archive’s access to their journalism

TL;DR

Many major U.S. local news publishers are blocking the Internet Archive’s crawling bots, limiting access to their journalism. This move impacts researchers, journalists, and historians relying on web archives for primary sources. The full extent and motivations remain under discussion.

Over 340 local news websites across the United States have begun restricting the Internet Archive’s web crawling bots, according to recent analysis. This development, driven by concerns over data scraping and intellectual property, threatens the long-term preservation of local journalism and impacts researchers, journalists, and historians relying on web archives.

Since January 2026, the number of local news sites disallowing Internet Archive bots has increased from 241 to 382, with the majority owned by major publishers such as USA Today Co., McClatchy, Advance Local, MediaNews Group, and Tribune Publishing. Many of these sites are blocking specific bots associated with the Internet Archive, including Heritrix and related user agents.

Researchers and journalists emphasize that web archives like the Internet Archive’s Wayback Machine are vital for preserving primary sources of local news, especially as many outlets face financial pressures and decline. Notably, local outlets owned by large corporations are among the most active in restricting access, raising concerns about the future availability of local news history.

The Internet Archive has responded by stating it is engaging in conversations with publishers and has implemented measures to prevent abuse, such as limiting bulk downloads and monitoring bot activity. News outlets are limiting the Internet Archive’s access to their journalism. Mark Graham, founder of the Wayback Machine, confirmed ongoing discussions but emphasized that their terms of use restrict collections to research and scholarship purposes.

Why It Matters

This restriction on web archiving poses a significant threat to the preservation of local journalism, which is a critical component of the historical record. Without access to these archives, future researchers, journalists, and citizens may find it difficult to verify past events, track media coverage, or understand local histories. The move also raises broader questions about intellectual property rights, data privacy, and the role of nonprofit archives in maintaining an open and accessible internet.

Electronic Library and Visual Information Research 3

Electronic Library and Visual Information Research 3

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Background

Since early 2026, news outlets have expressed concerns over AI companies scraping their content for training purposes, prompting some publishers to restrict web crawling. The Internet Archive, a nonprofit organization, has historically preserved vast amounts of online news content, including local journalism, which is increasingly under threat as publishers tighten access controls. Previous debates have centered on copyright and fair use, but recent actions appear driven by fears of data extraction for AI training.

“Blocking the Internet Archive’s web crawlers threatens one of the most effective ways that we capture and store news content for the long term.”

— Edward McCain, journalism librarian at the University of Missouri

“We are in conversation with many publishers and appreciate the opportunity to address their concerns.”

— Mark Graham, founder of the Wayback Machine

“This is the same fight that everybody has been having with the Internet Archive since its inception.”

— Meredith Broussard, data journalist and NYU professor

“Without the Internet Archive, my work would be incredibly difficult to do.”

— B.J. Mendelson, journalist and petition signer

Express Rip Free CD Ripper Software - Extract Audio in Perfect Digital Quality [PC Download]

Express Rip Free CD Ripper Software – Extract Audio in Perfect Digital Quality [PC Download]

Perfect quality CD digital audio extraction (ripping)

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

What Remains Unclear

It remains unclear whether AI companies have already scraped content from the restricted sites or plan to do so in the future. The full scope of publishers’ motivations and the potential legal or technological responses by the Internet Archive are still evolving. Additionally, the long-term effectiveness of current measures to prevent abuse is uncertain.

Libraries: the Internet, and Scholarship: Tools and Trends Converging

Used Book in Good Condition

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

What’s Next

The Internet Archive continues discussions with publishers to address concerns and explore technical solutions. Monitoring of site restrictions will persist, and advocacy efforts by journalists and researchers are likely to increase. Future developments may include legal or policy debates over fair use, copyright, and digital preservation rights.

1001 Best Websites for Educators

1001 Best Websites for Educators

Product Details:softcover 3rd edition Pages 256

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Key Questions

Why are news outlets blocking the Internet Archive?

Many outlets cite concerns over data scraping, intellectual property, and AI training purposes as reasons for restricting web crawling by the Internet Archive’s bots.

Could this affect the availability of local news history?

Yes, limiting access to web archives threatens the preservation of local journalism, which is vital for historical record-keeping and research.

Is the Internet Archive scraping content without permission?

The Internet Archive states it operates within legal and ethical boundaries, using bots designed to respect publisher restrictions and terms of use.

What can journalists and researchers do about this?

They can advocate for open access, participate in petitions, and support policies that balance copyright with the public interest in preserving digital history.

What are the next steps for the Internet Archive?

The organization is engaging in ongoing discussions with publishers, implementing technical safeguards, and monitoring the impact of restrictions to adapt its strategies.

Source: Hacker News

You May Also Like

Trulieve Could Benefit From Uplisting To Major Index

Trulieve’s potential uplisting to a major stock index could improve its visibility and investor confidence, impacting its stock performance.

China's economy loses steam in April as retail sales hit 40-month low

China’s retail sales growth slowed to 0.2% in April, the weakest since December 2022, amid declining investment and industrial output, despite strong exports.

LegalZoom Promo Code: Exclusive 10% Off LLC Formations

LegalZoom is providing an exclusive 10% off promo code for LLC formations, making it easier and more affordable to start a small business online.

Kyber (YC W23) Is Hiring a Founding Marketer

Kyber, a YC-backed enterprise AI platform, is seeking a Founding Marketer to build its content and community efforts, signaling a new phase of growth.