# web-archive web archiver with MASL bundle mode for ATProto. captures web pages as content-addressed bundles stored on your PDS with optional IPFS pinning. includes a recursive site crawler for archiving entire websites. ## what it does - **single mode**: archives a single HTML page with its CID stored as a `systems.witchcraft.archive.capture` ATProto record - **bundle mode**: archives a page + all subresources (CSS, JS, images, fonts) as a MASL bundle — each resource gets its own content-addressed blob on your PDS - **site mode** (`site_archive.py`): recursively crawls a website (BFS), archives each page as a bundle, creates a site manifest linking them all together. internal links are rewritten to point to sibling archive captures. - **CSS url() scanning**: follows `@import` and `url()` references in stylesheets to capture fonts, background images, etc. - **IPFS pinning**: optionally pins the HTML to IPFS via a local kubo node - **PDS blob storage**: uploads all resources as PDS blobs with proper content-type headers ## usage ```bash # single page archive python web_archive.py https://example.com # bundle mode (page + all subresources) python web_archive.py https://example.com --bundle # bundle with resource limit python web_archive.py https://example.com --bundle --max-resources 50 # skip IPFS pinning python web_archive.py https://example.com --no-ipfs # list all archives python web_archive.py --list # search archives python web_archive.py --search "example" # verify archive integrity python web_archive.py --verify ``` ### site archiver ```bash # dry-run: show what would be archived python site_archive.py https://example.com --dry-run # archive a site (default: depth 2, max 30 pages) python site_archive.py https://example.com # customize crawl depth and page limit python site_archive.py https://example.com --depth 3 --max-pages 50 # list site archives python site_archive.py --list # show status of a site archive python site_archive.py --status ``` ## auth set these environment variables: ```bash export ATP_PDS_URL=https://your.pds.example.com export ATP_HANDLE=your.handle export ATP_PASSWORD=your-app-password ``` ## dependencies ```bash pip install requests beautifulsoup4 ``` optional: `ipfs` CLI (kubo) for IPFS pinning ## record types - `systems.witchcraft.archive.capture` — single page captures - `systems.witchcraft.archive.bundle` — bundle archives with MASL-shaped manifest - `masl.resources`: path → {src: CID, content-type} (spec-conformant CID strings) - `blobs`: path → ATProto blob ref (for content retrieval from PDS) - archive metadata (url, title, capturedAt, etc) at top level - see [MASL spec](https://dasl.ing/masl.html) for the manifest format - `systems.witchcraft.archive.site` — site manifests linking multiple page bundles - page list with URLs, titles, depths, and bundle rkeys - link map for internal link rewriting between archived pages ## viewer archived pages can be viewed with the [archive viewer](https://sites.wisp.place/kira.pds.witchcraft.systems/archive-viewer/) hosted on wisp.place. ## license MIT