[GH-ISSUE #327] apparent-size link limitation? #144

Closed
opened 2026-06-08 11:25:52 +03:00 by zhus · 1 comment
Owner

Originally created by @rbtcollins on GitHub (May 23, 2023).
Original GitHub issue: https://github.com/bootandy/dust/issues/327

Hi, just would like to understand the apparent-size hardlink limitation.

I imagine its something to do with tracking potentially enormous numbers of links, and the possible race conditions (added links, removed links that can drive memory usage)?

OTOH a pretty decent and very easy/fast algorithm is for files with a link count greater than 1, to hash the volume id and inode id together, and store the link_count (u32 will do); then when a file is seen that collides, decrement the link_count, and finally remove the entry when it hits zero. That can be tweaked in a bunch of ways (e.g. a sparse set rather than a hash, do it by-volume to allow not storing the volume id redundantly...)

So you have memory proportional to links whose full members you haven't observed. For file systems with links being deleted, that memory will 'leak' proportional to the number of distinct inodes with unresolved link counts that are deleted. For file systems with new links being made, those new links are counted as whole files, which isn't unreasonable since that can race in any arbitrary order a tool might iterate the fs.

Originally created by @rbtcollins on GitHub (May 23, 2023). Original GitHub issue: https://github.com/bootandy/dust/issues/327 Hi, just would like to understand the apparent-size hardlink limitation. I imagine its something to do with tracking potentially enormous numbers of links, and the possible race conditions (added links, removed links that can drive memory usage)? OTOH a pretty decent and very easy/fast algorithm is for files with a link count greater than 1, to hash the volume id and inode id together, and store the link_count (u32 will do); then when a file is seen that collides, decrement the link_count, and finally remove the entry when it hits zero. That can be tweaked in a bunch of ways (e.g. a sparse set rather than a hash, do it by-volume to allow not storing the volume id redundantly...) So you have memory proportional to links whose full members you haven't observed. For file systems with links being deleted, that memory will 'leak' proportional to the number of distinct inodes with unresolved link counts that are deleted. For file systems with new links being made, those new links are counted as whole files, which isn't unreasonable since that can race in any arbitrary order a tool might iterate the fs.
zhus closed this issue 2026-06-08 11:25:52 +03:00
Author
Owner

@bootandy commented on GitHub (May 24, 2023):

So i don't see this as a limitation and more of a feature.

I assume that if you are using 'apparent-size' you want to know how long the file appears to be. For me this means you shouldn't have to care if it is a symlink or not - you only want to know how long it appears to be. Hence I choose --apparent-size to work like it does.

<!-- gh-comment-id:1560559741 --> @bootandy commented on GitHub (May 24, 2023): So i don't see this as a limitation and more of a feature. I assume that if you are using 'apparent-size' you want to know how long the file appears to be. For me this means you shouldn't have to care if it is a symlink or not - you only want to know how long it appears to be. Hence I choose --apparent-size to work like it does.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: bootandy/archived-dust#144