very fast at protocol indexer with flexible filtering, xrpc queries, cursor-backed event stream, and more, built on fjall
rust fjall at-protocol atproto indexer
58
fork

Configure Feed

Select the types of activity you want to include in your feed.

Hydrant using large amounts of memory during backfill #13

open opened by makeworld.space

Hydrant is using 37+ gigs of RAM when I'm trying to backfill, causing OOM. I assume this is unexpected.

My settings (running it in Docker):

      HYDRANT_DATABASE_PATH: /data/hydrant.db
      HYDRANT_EPHEMERAL: "true"
      HYDRANT_EPHEMERAL_TTL: "60min"
      HYDRANT_FULL_NETWORK: "true"
[deleted by author]
  • Filesystem: ext4, so no compression
  • 16 CPU cores allocated from Intel(R) Xeon(R) CPU E5520 @ 2.27GHz
  • 40 gigs RAM allocated
  • Storage: 650 gigs of NVME from TEAM TM8FP6001T
[deleted by author]

After testing from latest: it seems to have made a small difference? Thanks! Let me give some numbers:

  • Previous code peak: 11730 MB at 6 mins
  • New code peak: 10339 MB at 11 mins, but more stable, not just climbing

This still seems higher than I expected, but idk, maybe just inherent to the problem space. I can decrease the limit, but I'm curious what you think the expected memory usage should be. Given the blog post and the machine I'm using. I wouldn't expect to have to decrease it.

Peaks at about 12 gigs used, but doesn't seem to just be growing without bound. That might be where the difference is, for the recent commits you made.

[deleted by author]
[deleted by author]

(these numbers collected without a websocket consumer, which is why memory usage is stable)

One part of the code that seems like it would cause intermittent memory issues is that the entire CAR repo is held in memory, rather than streamed. Idk if that's the culprit here but it seems relevant.

The extreme memory ballooning (the 37G+ I initial reported) is only reproduceable when I run some other tasks on the same machine, including postgres and some custom code consuming the websocket. I thought this might be disk utilization as you hinted at, but that wasn't it.

The other explanation is there is some sort of memory leak involved with consumers of the websocket, or DID resolvers - as that is what one of the containers is doing.

I investigated this and found that:

  • Stopping backfill kept memory stable at a single value (say, 20 gigs)
  • Stopping the websocket consumer immediately dropped memory down vastly (~1.6 gigs)

So I think there is some sort of memory leak with a slow websocket consumer - something to do with the iterator not letting fjall compact? I'm not familiar with this kind of database though.

Even with 8 backfill workers instead of 64, I still see memory usage hit 19 gigs in 16 mins - when there is a websocket consumer.

Claude seems to have given me a fix. I don't want to give you a low quality PR (per AGENTS.md), but I also don't know Rust, so let me just describe what it's done at a high level: instead of holding the websocket catch up iterator open the entire time, it batches 128 items, then does blocking sends, then repeats. This way the iterator is not held open by a slow websocket consumer, and compaction can proceed.

This fixed my memory issue so far. No more ballooning, only 7-8 gigs used even with 64 workers.

Update: it eventually OOM'd again, hitting 35G. So this is maybe an improvement but not everything? It takes longer to get to the first OOM now but still does.

Once things got going long enough, it OOM'd multiple times, indicated by breaks on this graph. Note the Y-axis is not memory usage, I don't have that.

[deleted by author]
[deleted by author]
[deleted by author]

so the way you are reproducing this is, running full network backfill, and then streaming the events from the beginning always?

Uh actually I now realize that I am streaming without a cursor at the beginning (since I thought that would mean non-live events), and only apply a cursor if the websocket drops or the consumer restarts.

if you try just calling it without a cursor it should just be live tailing, im wondering if that changes anything

I guess that's what I've been doing, up until hydrant restarts due to OOM, then I start using a cursor.

Let me test with properly doing backfill from cursor=0...

It's been running for 3 hours now without issue! This might be the fix. Thanks. It's using about 3 gigs of mem right now, although of course it's varies, up to 8 gigs maybe. I'll update if I have any further issues.

For clarity, I am running your new branch, with cursor=0 at the start, although I have used other cursors since then.

I can see from my disk usage graph that compaction is happening. If you look at the image I posted earlier, ignoring the OOMs, you can see it only ever increases. In my graph now by contrast, there are clear decreases in disk usage when compaction happens.

Closing this as it seems to just work now. Will re-open if it reappears after the deadlock is fixed. Thanks!

Ah wait I'll close it once the branch is merged, that makes more sense.

[deleted by author]
sign up or login to add to the discussion
Labels

None yet.

assignee

None yet.

Participants 2
AT URI
at://did:plc:2j2ounbiyi3ftihronlw5qhj/sh.tangled.repo.issue/3ml2hk4qm4j22