Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

md: allow configuring logical block size

Previously, raid array used the maximum logical block size (LBS)
of all member disks. Adding a larger LBS disk at runtime could
unexpectedly increase RAID's LBS, risking corruption of existing
partitions. This can be reproduced by:

```
# LBS of sd[de] is 512 bytes, sdf is 4096 bytes.
mdadm -CRq /dev/md0 -l1 -n3 /dev/sd[de] missing --assume-clean

# LBS is 512
cat /sys/block/md0/queue/logical_block_size

# create partition md0p1
parted -s /dev/md0 mklabel gpt mkpart primary 1MiB 100%
lsblk | grep md0p1

# LBS becomes 4096 after adding sdf
mdadm --add -q /dev/md0 /dev/sdf
cat /sys/block/md0/queue/logical_block_size

# partition lost
partprobe /dev/md0
lsblk | grep md0p1
```

Simply restricting larger-LBS disks is inflexible. In some scenarios,
only disks with 512 bytes LBS are available currently, but later, disks
with 4KB LBS may be added to the array.

Making LBS configurable is the best way to solve this scenario.
After this patch, the raid will:
- store LBS in disk metadata
- add a read-write sysfs 'mdX/logical_block_size'

Future mdadm should support setting LBS via metadata field during RAID
creation and the new sysfs. Though the kernel allows runtime LBS changes,
users should avoid modifying it after creating partitions or filesystems
to prevent compatibility issues.

Only 1.x metadata supports configurable LBS. 0.90 metadata inits all
fields to default values at auto-detect. Supporting 0.90 would require
more extensive changes and no such use case has been observed.

Note that many RAID paths rely on PAGE_SIZE alignment, including for
metadata I/O. A larger LBS than PAGE_SIZE will result in metadata
read/write failures. So this config should be prevented.

Link: https://lore.kernel.org/linux-raid/20251103125757.1405796-6-linan666@huaweicloud.com
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Yu Kuai <yukuai@fnnas.com>

authored by

Li Nan and committed by
Yu Kuai
62ed1b58 9c47127a

+95 -1
+10
Documentation/admin-guide/md.rst
··· 238 238 the number of devices in a raid4/5/6, or to support external 239 239 metadata formats which mandate such clipping. 240 240 241 + logical_block_size 242 + Configure the array's logical block size in bytes. This attribute 243 + is only supported for 1.x meta. Write the value before starting 244 + array. The final array LBS uses the maximum between this 245 + configuration and LBS of all combined devices. Note that 246 + LBS cannot exceed PAGE_SIZE before RAID supports folio. 247 + WARNING: Arrays created on new kernel cannot be assembled at old 248 + kernel due to padding check, Set module parameter 'check_new_feature' 249 + to false to bypass, but data loss may occur. 250 + 241 251 reshape_position 242 252 This is either ``none`` or a sector number within the devices of 243 253 the array where ``reshape`` is up to. If this is set, the three
+1
drivers/md/md-linear.c
··· 72 72 73 73 md_init_stacking_limits(&lim); 74 74 lim.max_hw_sectors = mddev->chunk_sectors; 75 + lim.logical_block_size = mddev->logical_block_size; 75 76 lim.max_write_zeroes_sectors = mddev->chunk_sectors; 76 77 lim.max_hw_wzeroes_unmap_sectors = mddev->chunk_sectors; 77 78 lim.io_min = mddev->chunk_sectors << 9;
+77
drivers/md/md.c
··· 1999 1999 mddev->layout = le32_to_cpu(sb->layout); 2000 2000 mddev->raid_disks = le32_to_cpu(sb->raid_disks); 2001 2001 mddev->dev_sectors = le64_to_cpu(sb->size); 2002 + mddev->logical_block_size = le32_to_cpu(sb->logical_block_size); 2002 2003 mddev->events = ev1; 2003 2004 mddev->bitmap_info.offset = 0; 2004 2005 mddev->bitmap_info.space = 0; ··· 2209 2208 sb->chunksize = cpu_to_le32(mddev->chunk_sectors); 2210 2209 sb->level = cpu_to_le32(mddev->level); 2211 2210 sb->layout = cpu_to_le32(mddev->layout); 2211 + sb->logical_block_size = cpu_to_le32(mddev->logical_block_size); 2212 2212 if (test_bit(FailFast, &rdev->flags)) 2213 2213 sb->devflags |= FailFast1; 2214 2214 else ··· 5938 5936 __ATTR(serialize_policy, S_IRUGO | S_IWUSR, serialize_policy_show, 5939 5937 serialize_policy_store); 5940 5938 5939 + static int mddev_set_logical_block_size(struct mddev *mddev, 5940 + unsigned int lbs) 5941 + { 5942 + int err = 0; 5943 + struct queue_limits lim; 5944 + 5945 + if (queue_logical_block_size(mddev->gendisk->queue) >= lbs) { 5946 + pr_err("%s: Cannot set LBS smaller than mddev LBS %u\n", 5947 + mdname(mddev), lbs); 5948 + return -EINVAL; 5949 + } 5950 + 5951 + lim = queue_limits_start_update(mddev->gendisk->queue); 5952 + lim.logical_block_size = lbs; 5953 + pr_info("%s: logical_block_size is changed, data may be lost\n", 5954 + mdname(mddev)); 5955 + err = queue_limits_commit_update(mddev->gendisk->queue, &lim); 5956 + if (err) 5957 + return err; 5958 + 5959 + mddev->logical_block_size = lbs; 5960 + /* New lbs will be written to superblock after array is running */ 5961 + set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags); 5962 + return 0; 5963 + } 5964 + 5965 + static ssize_t 5966 + lbs_show(struct mddev *mddev, char *page) 5967 + { 5968 + return sprintf(page, "%u\n", mddev->logical_block_size); 5969 + } 5970 + 5971 + static ssize_t 5972 + lbs_store(struct mddev *mddev, const char *buf, size_t len) 5973 + { 5974 + unsigned int lbs; 5975 + int err = -EBUSY; 5976 + 5977 + /* Only 1.x meta supports configurable LBS */ 5978 + if (mddev->major_version == 0) 5979 + return -EINVAL; 5980 + 5981 + if (mddev->pers) 5982 + return -EBUSY; 5983 + 5984 + err = kstrtouint(buf, 10, &lbs); 5985 + if (err < 0) 5986 + return -EINVAL; 5987 + 5988 + err = mddev_lock(mddev); 5989 + if (err) 5990 + goto unlock; 5991 + 5992 + err = mddev_set_logical_block_size(mddev, lbs); 5993 + 5994 + unlock: 5995 + mddev_unlock(mddev); 5996 + return err ?: len; 5997 + } 5998 + 5999 + static struct md_sysfs_entry md_logical_block_size = 6000 + __ATTR(logical_block_size, 0644, lbs_show, lbs_store); 5941 6001 5942 6002 static struct attribute *md_default_attrs[] = { 5943 6003 &md_level.attr, ··· 6022 5958 &md_consistency_policy.attr, 6023 5959 &md_fail_last_dev.attr, 6024 5960 &md_serialize_policy.attr, 5961 + &md_logical_block_size.attr, 6025 5962 NULL, 6026 5963 }; 6027 5964 ··· 6152 6087 !queue_limits_stack_integrity_bdev(lim, rdev->bdev)) 6153 6088 return -EINVAL; 6154 6089 } 6090 + 6091 + /* 6092 + * Before RAID adding folio support, the logical_block_size 6093 + * should be smaller than the page size. 6094 + */ 6095 + if (lim->logical_block_size > PAGE_SIZE) { 6096 + pr_err("%s: logical_block_size must not larger than PAGE_SIZE\n", 6097 + mdname(mddev)); 6098 + return -EINVAL; 6099 + } 6100 + mddev->logical_block_size = lim->logical_block_size; 6155 6101 6156 6102 return 0; 6157 6103 } ··· 6775 6699 mddev->chunk_sectors = 0; 6776 6700 mddev->ctime = mddev->utime = 0; 6777 6701 mddev->layout = 0; 6702 + mddev->logical_block_size = 0; 6778 6703 mddev->max_disks = 0; 6779 6704 mddev->events = 0; 6780 6705 mddev->can_decrease_events = 0;
+1
drivers/md/md.h
··· 433 433 sector_t array_sectors; /* exported array size */ 434 434 int external_size; /* size managed 435 435 * externally */ 436 + unsigned int logical_block_size; 436 437 __u64 events; 437 438 /* If the last 'event' was simply a clean->dirty transition, and 438 439 * we didn't write it to the spares, then it is safe and simple
+1
drivers/md/raid0.c
··· 380 380 lim.max_hw_sectors = mddev->chunk_sectors; 381 381 lim.max_write_zeroes_sectors = mddev->chunk_sectors; 382 382 lim.max_hw_wzeroes_unmap_sectors = mddev->chunk_sectors; 383 + lim.logical_block_size = mddev->logical_block_size; 383 384 lim.io_min = mddev->chunk_sectors << 9; 384 385 lim.io_opt = lim.io_min * mddev->raid_disks; 385 386 lim.chunk_sectors = mddev->chunk_sectors;
+1
drivers/md/raid1.c
··· 3213 3213 md_init_stacking_limits(&lim); 3214 3214 lim.max_write_zeroes_sectors = 0; 3215 3215 lim.max_hw_wzeroes_unmap_sectors = 0; 3216 + lim.logical_block_size = mddev->logical_block_size; 3216 3217 lim.features |= BLK_FEAT_ATOMIC_WRITES; 3217 3218 err = mddev_stack_rdev_limits(mddev, &lim, MDDEV_STACK_INTEGRITY); 3218 3219 if (err)
+1
drivers/md/raid10.c
··· 4000 4000 md_init_stacking_limits(&lim); 4001 4001 lim.max_write_zeroes_sectors = 0; 4002 4002 lim.max_hw_wzeroes_unmap_sectors = 0; 4003 + lim.logical_block_size = mddev->logical_block_size; 4003 4004 lim.io_min = mddev->chunk_sectors << 9; 4004 4005 lim.chunk_sectors = mddev->chunk_sectors; 4005 4006 lim.io_opt = lim.io_min * raid10_nr_stripes(conf);
+1
drivers/md/raid5.c
··· 7745 7745 stripe = roundup_pow_of_two(data_disks * (mddev->chunk_sectors << 9)); 7746 7746 7747 7747 md_init_stacking_limits(&lim); 7748 + lim.logical_block_size = mddev->logical_block_size; 7748 7749 lim.io_min = mddev->chunk_sectors << 9; 7749 7750 lim.io_opt = lim.io_min * (conf->raid_disks - conf->max_degraded); 7750 7751 lim.features |= BLK_FEAT_RAID_PARTIAL_STRIPES_EXPENSIVE;
+2 -1
include/uapi/linux/raid/md_p.h
··· 291 291 __le64 resync_offset; /* data before this offset (from data_offset) known to be in sync */ 292 292 __le32 sb_csum; /* checksum up to devs[max_dev] */ 293 293 __le32 max_dev; /* size of devs[] array to consider */ 294 - __u8 pad3[64-32]; /* set to 0 when writing */ 294 + __le32 logical_block_size; /* same as q->limits->logical_block_size */ 295 + __u8 pad3[64-36]; /* set to 0 when writing */ 295 296 296 297 /* device state information. Indexed by dev_number. 297 298 * 2 bytes per device