# web-archive

web archiver with MASL bundle mode for ATProto. captures web pages as content-addressed bundles stored on your PDS with optional IPFS pinning. includes a recursive site crawler for archiving entire websites.

## what it does

- **single mode**: archives a single HTML page with its CID stored as a `systems.witchcraft.archive.capture` ATProto record
- **bundle mode**: archives a page + all subresources (CSS, JS, images, fonts) as a MASL bundle — each resource gets its own content-addressed blob on your PDS
- **site mode** (`site_archive.py`): recursively crawls a website (BFS), archives each page as a bundle, creates a site manifest linking them all together. internal links are rewritten to point to sibling archive captures.
- **CSS url() scanning**: follows `@import` and `url()` references in stylesheets to capture fonts, background images, etc.
- **IPFS pinning**: optionally pins the HTML to IPFS via a local kubo node
- **PDS blob storage**: uploads all resources as PDS blobs with proper content-type headers

## usage

```bash
# single page archive
python web_archive.py https://example.com

# bundle mode (page + all subresources)
python web_archive.py https://example.com --bundle

# bundle with resource limit
python web_archive.py https://example.com --bundle --max-resources 50

# skip IPFS pinning
python web_archive.py https://example.com --no-ipfs

# list all archives
python web_archive.py --list

# search archives
python web_archive.py --search "example"

# verify archive integrity
python web_archive.py --verify <rkey>
```

### site archiver

```bash
# dry-run: show what would be archived
python site_archive.py https://example.com --dry-run

# archive a site (default: depth 2, max 30 pages)
python site_archive.py https://example.com

# customize crawl depth and page limit
python site_archive.py https://example.com --depth 3 --max-pages 50

# list site archives
python site_archive.py --list

# show status of a site archive
python site_archive.py --status <rkey>
```

## auth

set these environment variables:

```bash
export ATP_PDS_URL=https://your.pds.example.com
export ATP_HANDLE=your.handle
export ATP_PASSWORD=your-app-password
```

## dependencies

```bash
pip install requests beautifulsoup4
```

optional: `ipfs` CLI (kubo) for IPFS pinning

## record types

- `systems.witchcraft.archive.capture` — single page captures
- `systems.witchcraft.archive.bundle` — bundle archives with MASL-shaped manifest
  - `masl.resources`: path → {src: CID, content-type} (spec-conformant CID strings)
  - `blobs`: path → ATProto blob ref (for content retrieval from PDS)
  - archive metadata (url, title, capturedAt, etc) at top level
  - see [MASL spec](https://dasl.ing/masl.html) for the manifest format
- `systems.witchcraft.archive.site` — site manifests linking multiple page bundles
  - page list with URLs, titles, depths, and bundle rkeys
  - link map for internal link rewriting between archived pages

## viewer

archived pages can be viewed with the [archive viewer](https://sites.wisp.place/kira.pds.witchcraft.systems/archive-viewer/) hosted on wisp.place.

## license

MIT