Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

block: introduce max_{hw|user}_wzeroes_unmap_sectors to queue limits

Currently, disks primarily implement the write zeroes command (aka
REQ_OP_WRITE_ZEROES) through two mechanisms: the first involves
physically writing zeros to the disk media (e.g., HDDs), while the
second performs an unmap operation on the logical blocks, effectively
putting them into a deallocated state (e.g., SSDs). The first method is
generally slow, while the second method is typically very fast.

For example, on certain NVMe SSDs that support NVME_NS_DEAC, submitting
REQ_OP_WRITE_ZEROES requests with the NVME_WZ_DEAC bit can accelerate
the write zeros operation by placing disk blocks into a deallocated
state, which opportunistically avoids writing zeroes to media while
still guaranteeing that subsequent reads from the specified block range
will return zeroed data. This is a best-effort optimization, not a
mandatory requirement, some devices may partially fall back to writing
physical zeroes due to factors such as misalignment or being asked to
clear a block range smaller than the device's internal allocation unit.
Therefore, the speed of this operation is not guaranteed.

It is difficult to determine whether the storage device supports unmap
write zeroes operation. We cannot determine this by only querying
bdev_limits(bdev)->max_write_zeroes_sectors. Therefore, first, add a new
hardware queue limit parameters, max_hw_wzeroes_unmap_sectors, to
indicate whether a device supports this unmap write zeroes operation.
Then, add two new counterpart software queue limits,
max_wzeroes_unmap_sectors and max_user_wzeroes_unmap_sectors, which
allow users to disable this operation if the speed is very slow on some
sepcial devices.

Finally, for the stacked devices cases, initialize these two parameters
to UINT_MAX. This operation should be enabled by both the stacking
driver and all underlying devices.

Thanks to Martin K. Petersen for optimizing the documentation of the
write_zeroes_unmap sysfs interface.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://lore.kernel.org/20250619111806.3546162-2-yi.zhang@huaweicloud.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>

authored by

Zhang Yi and committed by
Christian Brauner
0c40d7cb e04c78d8

+87 -2
+33
Documentation/ABI/stable/sysfs-block
··· 778 778 0, write zeroes is not supported by the device. 779 779 780 780 781 + What: /sys/block/<disk>/queue/write_zeroes_unmap_max_hw_bytes 782 + Date: January 2025 783 + Contact: Zhang Yi <yi.zhang@huawei.com> 784 + Description: 785 + [RO] This file indicates whether a device supports zeroing data 786 + in a specified block range without incurring the cost of 787 + physically writing zeroes to the media for each individual 788 + block. If this parameter is set to write_zeroes_max_bytes, the 789 + device implements a zeroing operation which opportunistically 790 + avoids writing zeroes to media while still guaranteeing that 791 + subsequent reads from the specified block range will return 792 + zeroed data. This operation is a best-effort optimization, a 793 + device may fall back to physically writing zeroes to the media 794 + due to other factors such as misalignment or being asked to 795 + clear a block range smaller than the device's internal 796 + allocation unit. If this parameter is set to 0, the device may 797 + have to write each logical block media during a zeroing 798 + operation. 799 + 800 + 801 + What: /sys/block/<disk>/queue/write_zeroes_unmap_max_bytes 802 + Date: January 2025 803 + Contact: Zhang Yi <yi.zhang@huawei.com> 804 + Description: 805 + [RW] While write_zeroes_unmap_max_hw_bytes is the hardware limit 806 + for the device, this setting is the software limit. Since the 807 + unmap write zeroes operation is a best-effort optimization, some 808 + devices may still physically writing zeroes to media. So the 809 + speed of this operation is not guaranteed. Writing a value of 810 + '0' to this file disables this operation. Otherwise, this 811 + parameter should be equal to write_zeroes_unmap_max_hw_bytes. 812 + 813 + 781 814 What: /sys/block/<disk>/queue/zone_append_max_bytes 782 815 Date: May 2020 783 816 Contact: linux-block@vger.kernel.org
+18 -2
block/blk-settings.c
··· 50 50 lim->max_sectors = UINT_MAX; 51 51 lim->max_dev_sectors = UINT_MAX; 52 52 lim->max_write_zeroes_sectors = UINT_MAX; 53 + lim->max_hw_wzeroes_unmap_sectors = UINT_MAX; 54 + lim->max_user_wzeroes_unmap_sectors = UINT_MAX; 53 55 lim->max_hw_zone_append_sectors = UINT_MAX; 54 56 lim->max_user_discard_sectors = UINT_MAX; 55 57 } ··· 335 333 if (!lim->max_segments) 336 334 lim->max_segments = BLK_MAX_SEGMENTS; 337 335 336 + if (lim->max_hw_wzeroes_unmap_sectors && 337 + lim->max_hw_wzeroes_unmap_sectors != lim->max_write_zeroes_sectors) 338 + return -EINVAL; 339 + lim->max_wzeroes_unmap_sectors = min(lim->max_hw_wzeroes_unmap_sectors, 340 + lim->max_user_wzeroes_unmap_sectors); 341 + 338 342 lim->max_discard_sectors = 339 343 min(lim->max_hw_discard_sectors, lim->max_user_discard_sectors); 340 344 ··· 426 418 { 427 419 /* 428 420 * Most defaults are set by capping the bounds in blk_validate_limits, 429 - * but max_user_discard_sectors is special and needs an explicit 430 - * initialization to the max value here. 421 + * but these limits are special and need an explicit initialization to 422 + * the max value here. 431 423 */ 432 424 lim->max_user_discard_sectors = UINT_MAX; 425 + lim->max_user_wzeroes_unmap_sectors = UINT_MAX; 433 426 return blk_validate_limits(lim); 434 427 } 435 428 ··· 717 708 t->max_dev_sectors = min_not_zero(t->max_dev_sectors, b->max_dev_sectors); 718 709 t->max_write_zeroes_sectors = min(t->max_write_zeroes_sectors, 719 710 b->max_write_zeroes_sectors); 711 + t->max_user_wzeroes_unmap_sectors = 712 + min(t->max_user_wzeroes_unmap_sectors, 713 + b->max_user_wzeroes_unmap_sectors); 714 + t->max_hw_wzeroes_unmap_sectors = 715 + min(t->max_hw_wzeroes_unmap_sectors, 716 + b->max_hw_wzeroes_unmap_sectors); 717 + 720 718 t->max_hw_zone_append_sectors = min(t->max_hw_zone_append_sectors, 721 719 b->max_hw_zone_append_sectors); 722 720
+26
block/blk-sysfs.c
··· 161 161 QUEUE_SYSFS_LIMIT_SHOW_SECTORS_TO_BYTES(max_discard_sectors) 162 162 QUEUE_SYSFS_LIMIT_SHOW_SECTORS_TO_BYTES(max_hw_discard_sectors) 163 163 QUEUE_SYSFS_LIMIT_SHOW_SECTORS_TO_BYTES(max_write_zeroes_sectors) 164 + QUEUE_SYSFS_LIMIT_SHOW_SECTORS_TO_BYTES(max_hw_wzeroes_unmap_sectors) 165 + QUEUE_SYSFS_LIMIT_SHOW_SECTORS_TO_BYTES(max_wzeroes_unmap_sectors) 164 166 QUEUE_SYSFS_LIMIT_SHOW_SECTORS_TO_BYTES(atomic_write_max_sectors) 165 167 QUEUE_SYSFS_LIMIT_SHOW_SECTORS_TO_BYTES(atomic_write_boundary_sectors) 166 168 QUEUE_SYSFS_LIMIT_SHOW_SECTORS_TO_BYTES(max_zone_append_sectors) ··· 204 202 return -EINVAL; 205 203 206 204 lim->max_user_discard_sectors = max_discard_bytes >> SECTOR_SHIFT; 205 + return 0; 206 + } 207 + 208 + static int queue_max_wzeroes_unmap_sectors_store(struct gendisk *disk, 209 + const char *page, size_t count, struct queue_limits *lim) 210 + { 211 + unsigned long max_zeroes_bytes, max_hw_zeroes_bytes; 212 + ssize_t ret; 213 + 214 + ret = queue_var_store(&max_zeroes_bytes, page, count); 215 + if (ret < 0) 216 + return ret; 217 + 218 + max_hw_zeroes_bytes = lim->max_hw_wzeroes_unmap_sectors << SECTOR_SHIFT; 219 + if (max_zeroes_bytes != 0 && max_zeroes_bytes != max_hw_zeroes_bytes) 220 + return -EINVAL; 221 + 222 + lim->max_user_wzeroes_unmap_sectors = max_zeroes_bytes >> SECTOR_SHIFT; 207 223 return 0; 208 224 } 209 225 ··· 534 514 535 515 QUEUE_RO_ENTRY(queue_write_same_max, "write_same_max_bytes"); 536 516 QUEUE_LIM_RO_ENTRY(queue_max_write_zeroes_sectors, "write_zeroes_max_bytes"); 517 + QUEUE_LIM_RO_ENTRY(queue_max_hw_wzeroes_unmap_sectors, 518 + "write_zeroes_unmap_max_hw_bytes"); 519 + QUEUE_LIM_RW_ENTRY(queue_max_wzeroes_unmap_sectors, 520 + "write_zeroes_unmap_max_bytes"); 537 521 QUEUE_LIM_RO_ENTRY(queue_max_zone_append_sectors, "zone_append_max_bytes"); 538 522 QUEUE_LIM_RO_ENTRY(queue_zone_write_granularity, "zone_write_granularity"); 539 523 ··· 686 662 &queue_atomic_write_unit_min_entry.attr, 687 663 &queue_atomic_write_unit_max_entry.attr, 688 664 &queue_max_write_zeroes_sectors_entry.attr, 665 + &queue_max_hw_wzeroes_unmap_sectors_entry.attr, 666 + &queue_max_wzeroes_unmap_sectors_entry.attr, 689 667 &queue_max_zone_append_sectors_entry.attr, 690 668 &queue_zone_write_granularity_entry.attr, 691 669 &queue_rotational_entry.attr,
+10
include/linux/blkdev.h
··· 383 383 unsigned int max_user_discard_sectors; 384 384 unsigned int max_secure_erase_sectors; 385 385 unsigned int max_write_zeroes_sectors; 386 + unsigned int max_wzeroes_unmap_sectors; 387 + unsigned int max_hw_wzeroes_unmap_sectors; 388 + unsigned int max_user_wzeroes_unmap_sectors; 386 389 unsigned int max_hw_zone_append_sectors; 387 390 unsigned int max_zone_append_sectors; 388 391 unsigned int discard_granularity; ··· 1045 1042 static inline void blk_queue_disable_write_zeroes(struct request_queue *q) 1046 1043 { 1047 1044 q->limits.max_write_zeroes_sectors = 0; 1045 + q->limits.max_wzeroes_unmap_sectors = 0; 1048 1046 } 1049 1047 1050 1048 /* ··· 1380 1376 static inline unsigned int bdev_write_zeroes_sectors(struct block_device *bdev) 1381 1377 { 1382 1378 return bdev_limits(bdev)->max_write_zeroes_sectors; 1379 + } 1380 + 1381 + static inline unsigned int 1382 + bdev_write_zeroes_unmap_sectors(struct block_device *bdev) 1383 + { 1384 + return bdev_limits(bdev)->max_wzeroes_unmap_sectors; 1383 1385 } 1384 1386 1385 1387 static inline bool bdev_nonrot(struct block_device *bdev)