The Case for Permanent URLs in an Age of Disappearing Content

The Disappearing Web

The internet has a memory problem. A 2024 study by the Pew Research Center found that 38% of web pages that existed in 2013 are no longer accessible^[1]. Link rot — the gradual decay of URLs pointing to content that has moved or been deleted — affects everything from news articles to Supreme Court citations. The web was designed to be a permanent record of human knowledge, but in practice it functions more like a whiteboard that gets erased every few years.

This matters because the web has become the primary medium for sharing technical knowledge, research findings, and professional insight. When a blog post explaining a critical architectural decision disappears, that knowledge is lost. When a tutorial that thousands of developers relied on goes offline, the community loses a resource that cannot easily be reconstructed.

The problem is not just technical. It reflects a deeper issue with how we think about publishing online. We have optimized for engagement and reach at the expense of permanence and quality.

The Publishing Gap

Today there are broadly two ways to share technical writing on the internet. On one end, social media platforms like LinkedIn, Twitter, and Medium offer massive reach but treat content as disposable. Posts are buried by algorithms within hours. Medium paywalls content arbitrarily. LinkedIn reformats everything into a feed optimized for engagement metrics rather than knowledge transfer.

On the other end, academic journals and formal publications offer permanence through DOIs and institutional archiving, but their processes are slow, expensive, and gatekept by peer review systems that can take months or years^[2]. The overhead of formal publication makes it impractical for the kind of technical writing that professionals produce daily — architecture decision records, engineering retrospectives, framework comparisons, and implementation guides.

Between these two extremes is a gap. There is no publishing layer that combines the permanence and credibility signals of academic publishing with the speed and accessibility of social media. This is the gap that needs to be filled.

What a Quality-First Publishing Layer Looks Like

A publishing layer for the modern web needs several properties that current platforms lack:

Permanent URLs: Content gets a URL that never changes and never expires. No paywalls, no login walls, no algorithmic burial.
Quality signals: Readers should be able to assess credibility before reading. This means structured metadata, author verification, and transparent quality scoring.
Machine readability: AI systems and search engines should be able to discover, index, and cite content through standard protocols like robots.txt, sitemaps, and structured data.
Content negotiation: The same URL should serve HTML to browsers and JSON to APIs. This is how the web was designed to work, following RFC 7231 content negotiation.
Open standards: No proprietary lock-in. Markdown in, HTML and JSON out. Standard HTTP, standard metadata, standard syndication.

These are not novel ideas. They are established web standards that most publishing platforms have abandoned in favor of engagement optimization.

Measuring Quality Without Gatekeeping

Traditional academic publishing uses peer review as a quality signal. This works but creates bottlenecks. Social media uses engagement metrics — likes, shares, comments — which incentivize controversy and clickbait over substance.

A better approach is deterministic quality scoring based on structural analysis of the content itself. Consider what makes technical writing useful:

Signal	What It Measures	Why It Matters
Structure	Headings, paragraphs, variety	Well-organized content is easier to navigate
Substance	Word count, code blocks, lists, tables	Dense content provides more value per page
Tone	Professional language, low clickbait	Credible writing avoids manipulation
Attribution	Links, references, footnotes	Good work builds on and credits prior work

This kind of scoring is transparent, reproducible, and instant. No waiting for reviewers. No gaming engagement algorithms. The score reflects the structural properties of the writing itself, and readers can see exactly how it was calculated.

The scoring is deliberately imperfect — it measures form rather than truth. But form is a surprisingly good proxy for effort and professionalism. A document with clear headings, external references, code examples, and a bibliography is almost always more useful than a wall of unstructured text.

Author Verification Through Gravity

Quality scoring addresses the content side, but readers also need to assess the author. Traditional platforms solve this with follower counts and blue checkmarks — signals that correlate more with popularity than expertise.

An alternative is a verification system based on concrete, auditable actions rather than social metrics:

Level 0: Anonymous or unverified — anyone can publish
Level 1: Domain-verified — the author controls a domain, proven via DNS TXT records
Level 2: Identity-verified — the author has linked a professional profile like LinkedIn or ORCID
Level 3: Peer-endorsed — other verified authors have vouched for this person

Each level requires a specific, verifiable action. There is no way to buy or game your way to a higher level. Domain verification proves you control infrastructure. Identity verification proves you are a real professional. Peer endorsement proves other professionals trust your work.

This creates a credibility gradient that readers can interpret at a glance, without relying on popularity metrics or centralized editorial decisions.

The Technical Architecture

Building a permanent publishing layer requires careful technical choices. The system needs to be fast, reliable, and resistant to the forces that cause link rot.

The core data model is straightforward:

Document
  ├── id (immutable, globally unique)
  ├── slug (human-readable URL path)
  ├── title, subtitle, authors
  ├── content (Markdown source)
  ├── rendered_html (sanitized output)
  ├── quality_score (deterministic)
  ├── author_gravity (verification level)
  └── versions[] (append-only history)

Key architectural decisions include:

Append-only versioning: Content is never overwritten, only new versions are created. Every version is preserved with a content hash for integrity verification.
Server-side rendering: Markdown is rendered to HTML on the server using a hardened pipeline — markdown-it for parsing, Pygments for syntax highlighting, and nh3 for HTML sanitization against XSS.
Content-addressed storage: Each version is identified by its SHA-256 content hash, making it trivial to detect duplicates and verify integrity.

The rendering pipeline in Python looks like this:

def render_markdown(content: str) -> str:
    raw_html = md.render(content)
    return nh3.clean(
        raw_html,
        tags=ALLOWED_TAGS,
        attributes=ALLOWED_ATTRIBUTES,
        url_schemes={"http", "https", "mailto"},
        link_rel="noopener noreferrer",
    )

Every piece of user-generated HTML passes through the nh3 sanitizer before being stored or served. There is no |safe template bypass, no raw HTML injection point. Security is a property of the architecture, not a checklist item.

Discovery and Machine Readability

Permanent URLs are only valuable if content can be found. The discovery layer needs to serve both human readers and machine consumers:

OpenGraph metadata: Every document generates OG tags for rich previews when shared on LinkedIn, Twitter, or Slack
Structured data: JSON-LD markup enables search engines to understand the content type, author, and publication date
Sitemap: Standard XML sitemap for search engine crawlers
llms.txt: A proposed standard that helps AI systems discover and understand available content
Content negotiation: Request a document URL with Accept: application/json and receive structured JSON instead of HTML

This multi-layered discovery approach ensures content is accessible to every type of consumer, from a person clicking a link on LinkedIn to an AI agent searching for technical references.

Why This Matters Now

The convergence of three trends makes this kind of publishing infrastructure urgent:

First, AI systems need citable sources. As large language models become primary research tools, they need permanent, machine-readable references to cite. A URL that returns 404 is not a citation — it is a hallucination waiting to happen^[3].

Second, professional knowledge is being lost. The average lifespan of a blog post is shrinking as platforms consolidate and shut down. Valuable technical writing from the early web is already gone. The next decade of professional knowledge sharing should not depend on the business models of social media companies.

Third, trust in online content is declining. Readers need better signals than follower counts and engagement metrics to assess credibility. Transparent quality scoring and verifiable author credentials provide those signals without centralized editorial control.

The solution is not another social media platform or another blogging tool. It is a publishing layer — infrastructure that sits beneath applications and above raw hosting, providing permanence, quality signals, and machine readability as a service.

References

A Quarter of the Web Has Vanished — Pew Research Center, 2024
RFC 7231: HTTP/1.1 Semantics and Content — IETF
The Sitemaps Protocol — sitemaps.org
ORCID: Connecting Research and Researchers — ORCID
llms.txt: A Proposal — llmstxt.org
nh3: Ammonia HTML Sanitizer — GitHub

Pew Research Center, "When Online Content Disappears," May 2024. The study found that 38% of pages from 2013 were no longer accessible by October 2023. ↩︎
The average time from submission to publication in academic journals ranges from 6 months to over 2 years, depending on the field. Source: Nature, 2023. ↩︎
The term "hallucination" in AI refers to model outputs that are plausible but factually incorrect, often caused by training on content that is no longer verifiable. ↩︎