Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

ext4, doc: fix and improve directory hash tree description

Some of the details about how directory hash trees work were confusing or
outright wrong, this patch should fix those.

A note on dx_tail's dt_reserved member, as far as I can tell the kernel
never sets this explicitly, so its content is apparently left-overs from
what was there before (for the dx_root I've seen remnants of a
ext4_dir_entry_tail struct from when the dir was not yet a hash dir).

Signed-off-by: Zeno Endemann <zeno.endemann@mailbox.org>
Message-ID: <20250925152435.22749-1-zeno.endemann@mailbox.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

authored by

Zeno Endemann and committed by
Theodore Ts'o
4b471b73 328a782c

+31 -30
+31 -30
Documentation/filesystems/ext4/directory.rst
··· 183 183 - det_checksum 184 184 - Directory leaf block checksum. 185 185 186 - The leaf directory block checksum is calculated against the FS UUID, the 187 - directory's inode number, the directory's inode generation number, and 188 - the entire directory entry block up to (but not including) the fake 189 - directory entry. 186 + The leaf directory block checksum is calculated against the FS UUID (or 187 + the checksum seed, if that feature is enabled for the fs), the directory's 188 + inode number, the directory's inode generation number, and the entire 189 + directory entry block up to (but not including) the fake directory entry. 190 190 191 191 Hash Tree Directories 192 192 ~~~~~~~~~~~~~~~~~~~~~ ··· 196 196 balanced tree keyed off a hash of the directory entry name. If the 197 197 EXT4_INDEX_FL (0x1000) flag is set in the inode, this directory uses a 198 198 hashed btree (htree) to organize and find directory entries. For 199 - backwards read-only compatibility with ext2, this tree is actually 200 - hidden inside the directory file, masquerading as “empty” directory data 201 - blocks! It was stated previously that the end of the linear directory 202 - entry table was signified with an entry pointing to inode 0; this is 203 - (ab)used to fool the old linear-scan algorithm into thinking that the 204 - rest of the directory block is empty so that it moves on. 199 + backwards read-only compatibility with ext2, interior tree nodes are actually 200 + hidden inside the directory file, masquerading as “empty” directory entries 201 + spanning the whole block. It was stated previously that directory entries 202 + with the inode set to 0 are treated as unused entries; this is (ab)used to 203 + fool the old linear-scan algorithm into skipping over those blocks containing 204 + the interior tree node data. 205 205 206 206 The root of the tree always lives in the first data block of the 207 207 directory. By ext2 custom, the '.' and '..' entries must appear at the ··· 209 209 ``struct ext4_dir_entry_2`` s and not stored in the tree. The rest of 210 210 the root node contains metadata about the tree and finally a hash->block 211 211 map to find nodes that are lower in the htree. If 212 - ``dx_root.info.indirect_levels`` is non-zero then the htree has two 213 - levels; the data block pointed to by the root node's map is an interior 214 - node, which is indexed by a minor hash. Interior nodes in this tree 215 - contains a zeroed out ``struct ext4_dir_entry_2`` followed by a 216 - minor_hash->block map to find leafe nodes. Leaf nodes contain a linear 217 - array of all ``struct ext4_dir_entry_2``; all of these entries 218 - (presumably) hash to the same value. If there is an overflow, the 219 - entries simply overflow into the next leaf node, and the 220 - least-significant bit of the hash (in the interior node map) that gets 221 - us to this next leaf node is set. 212 + ``dx_root.info.indirect_levels`` is non-zero then the htree has that many 213 + levels and the blocks pointed to by the root node's map are interior nodes. 214 + These interior nodes have a zeroed out ``struct ext4_dir_entry_2`` followed by 215 + a hash->block map to find nodes of the next level. Leaf nodes look like 216 + classic linear directory blocks, but all of its entries have a hash value 217 + equal or greater than the indicated hash of the parent node. 222 218 223 - To traverse the directory as a htree, the code calculates the hash of 224 - the desired file name and uses it to find the corresponding block 225 - number. If the tree is flat, the block is a linear array of directory 226 - entries that can be searched; otherwise, the minor hash of the file name 227 - is computed and used against this second block to find the corresponding 228 - third block number. That third block number will be a linear array of 229 - directory entries. 219 + The actual hash value for an entry name is only 31 bits, the least-significant 220 + bit is set to 0. However, if there is a hash collision between directory 221 + entries, the least-significant bit may get set to 1 on interior nodes in the 222 + case where these two (or more) hash-colliding entries do not fit into one leaf 223 + node and must be split across multiple nodes. 224 + 225 + To look up a name in such a htree, the code calculates the hash of the desired 226 + file name and uses it to find the leaf node with the range of hash values the 227 + calculated hash falls into (in other words, a lookup works basically the same 228 + as it would in a B-Tree keyed by the hash value), and possibly also scanning 229 + the leaf nodes that follow (in tree order) in case of hash collisions. 230 230 231 231 To traverse the directory as a linear array (such as the old code does), 232 232 the code simply reads every data block in the directory. The blocks used ··· 319 319 * - 0x24 320 320 - __le32 321 321 - block 322 - - The block number (within the directory file) that goes with hash=0. 322 + - The block number (within the directory file) that lead to the left-most 323 + leaf node, i.e. the leaf containing entries with the lowest hash values. 323 324 * - 0x28 324 325 - struct dx_entry 325 326 - entries[0] ··· 443 442 * - 0x0 444 443 - u32 445 444 - dt_reserved 446 - - Zero. 445 + - Unused (but still part of the checksum curiously). 447 446 * - 0x4 448 447 - __le32 449 448 - dt_checksum ··· 451 450 452 451 The checksum is calculated against the FS UUID, the htree index header 453 452 (dx_root or dx_node), all of the htree indices (dx_entry) that are in 454 - use, and the tail block (dx_tail). 453 + use, and the tail block (dx_tail) with the dt_checksum initially set to 0.