[GH-ISSUE #277] Getting OOM killed with large number of inodes #122

Closed
opened 2026-06-08 11:25:46 +03:00 by zhus · 3 comments
Owner

Originally created by @WesleyAC on GitHub (Nov 30, 2022).
Original GitHub issue: https://github.com/bootandy/dust/issues/277

On a system with 1,500,000 inodes used and 2GB RAM (with ~500MB free), dust is consistently OOM-killed. I'm not sure what the internals of dust look like, but it seems to me that it should be possible to operate with significantly less than that.

It looks like what's happening is that every node name is stored up-front when the directory tree is being walked, and then the summary is only done at the end. I'm not sure how much surgery would be required to improve the situation. One thought is to not store the PathBuf in the Node, and instead only get the name for the node when assembling the final output, by walking the directory tree one more time and matching inode numbers. This would probably want to be behind a --low-memory switch or something like that, since it's a tradeoff — it would take about twice a much time (walking the file tree twice), but use significantly less memory.

Another option, which is less of a tradeoff, is to store just the top section of the path, rather than the absolute path in the Node object, and only assemble absolute paths from the tree of nodes. I haven't looked deeply into how much the assumption of absolute paths is baked-in, so maybe this is a non-starter, though.

Would love for dust to be more useful on my file-filled and ram-constrained server :)

Originally created by @WesleyAC on GitHub (Nov 30, 2022). Original GitHub issue: https://github.com/bootandy/dust/issues/277 On a system with 1,500,000 inodes used and 2GB RAM (with ~500MB free), `dust` is consistently OOM-killed. I'm not sure what the internals of `dust` look like, but it seems to me that it should be possible to operate with significantly less than that. It looks like what's happening is that every node name is stored up-front when the directory tree is being walked, and then the summary is only done at the end. I'm not sure how much surgery would be required to improve the situation. One thought is to not store the `PathBuf` in the `Node`, and instead only get the name for the node when assembling the final output, by walking the directory tree one more time and matching inode numbers. This would probably want to be behind a `--low-memory` switch or something like that, since it's a tradeoff — it would take about twice a much time (walking the file tree twice), but use significantly less memory. Another option, which is less of a tradeoff, is to store just the top section of the path, rather than the absolute path in the `Node` object, and only assemble absolute paths from the tree of nodes. I haven't looked deeply into how much the assumption of absolute paths is baked-in, so maybe this is a non-starter, though. Would love for `dust` to be more useful on my file-filled and ram-constrained server :)
zhus closed this issue 2026-06-08 11:25:47 +03:00
Author
Owner

@bootandy commented on GitHub (Jan 4, 2023):

I'll see what I can do about this.

We had to limit stack size on small machines before: https://github.com/bootandy/dust/issues/256

We keep a summary of all the nodes that we've visited to deal with duplicate inodes so we don't double count them so I want to keep the inode_device
We could try storing just a string instead of a PathBuf as you suggested, but I'm not sure that will make much difference.
We might be able to remove 'depth' from the Node - that might not be used anymore.
We might also be able to remove the 'top' part of the paths.

But I'm not sure the above would allow us to save much memory.

by walking the directory tree one more time

I don't want to walk the tree more than once, it will be too slow. And adding it as an option would cause the code to get too messy.

<!-- gh-comment-id:1371306106 --> @bootandy commented on GitHub (Jan 4, 2023): I'll see what I can do about this. We had to limit stack size on small machines before: https://github.com/bootandy/dust/issues/256 We keep a summary of all the nodes that we've visited to deal with duplicate inodes so we don't double count them so I want to keep the inode_device We could try storing just a string instead of a PathBuf as you suggested, but I'm not sure that will make much difference. We might be able to remove 'depth' from the Node - that might not be used anymore. We might also be able to remove the 'top' part of the paths. But I'm not sure the above would allow us to save much memory. > by walking the directory tree one more time I don't want to walk the tree more than once, it will be too slow. And adding it as an option would cause the code to get too messy.
Author
Owner

@WesleyAC commented on GitHub (Jan 5, 2023):

Thanks! I think that removing the top part of the paths is probably most impactful — running find . | xargs dirname | wc -c as a approximation of how much RAM might be saved (the actual amount would probably be slightly more than this because of alignment, etc) in a directory that I had this problem on gives 108593973, around 108MB, which is a respectable saving.

Another thought is that the code could be more deeply aware of the -d flag — my understanding is that it shouldn't need to keep track of the name at all if the file or directory is more nested than the number given by -d. I could also imagine a new flag to not show sizes of files at all, only directories, which would enable a similar optimization — not storing names of files, only of directories. For me, the output would be similarly useful in most cases — generally I'm using dust to find directories that are large, rather than particular offending files.

For that matter, you could probably omit the filename in all cases, and get names only for the files that need to be displayed by calling statx on each file in the parent directory to check the inode number. That way, instead of re-crawling the whole tree, you just need to loop over the files in a directory for each file in the output (which is probably only few dozen files at most). I don't know if that tradeoff is worth it in the non-ram-constrained case, but it seems like it might be worthwhile.

<!-- gh-comment-id:1371523026 --> @WesleyAC commented on GitHub (Jan 5, 2023): Thanks! I think that removing the top part of the paths is probably most impactful — running `find . | xargs dirname | wc -c` as a approximation of how much RAM might be saved (the actual amount would probably be slightly more than this because of alignment, etc) in a directory that I had this problem on gives `108593973`, around 108MB, which is a respectable saving. Another thought is that the code could be more deeply aware of the `-d` flag — my understanding is that it shouldn't need to keep track of the name at all if the file or directory is more nested than the number given by `-d`. I could also imagine a new flag to not show sizes of files at all, only directories, which would enable a similar optimization — not storing names of files, only of directories. For me, the output would be similarly useful in most cases — generally I'm using `dust` to find directories that are large, rather than particular offending files. For that matter, you could probably omit the filename in all cases, and get names only for the files that need to be displayed by calling `statx` on each file in the parent directory to check the inode number. That way, instead of re-crawling the whole tree, you just need to loop over the files in a directory for each file in the output (which is probably only few dozen files at most). I don't know if that tradeoff is worth it in the non-ram-constrained case, but it seems like it might be worthwhile.
Author
Owner

@bootandy commented on GitHub (Jan 18, 2023):

Shaving the top of the paths of proved to be problematic.

I don't think I can realisticly get the memory requirement down much, sorry.

<!-- gh-comment-id:1386259108 --> @bootandy commented on GitHub (Jan 18, 2023): Shaving the top of the paths of proved to be problematic. I don't think I can realisticly get the memory requirement down much, sorry.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: bootandy/archived-dust#122