Merge tag 'for-6.11/block-20240710' of git://git.kernel.dk/linux

+1

.mailmap

··· 690 690 Vlad Dogaru <ddvlad@gmail.com> <vlad.dogaru@intel.com> 691 691 Vladimir Davydov <vdavydov.dev@gmail.com> <vdavydov@parallels.com> 692 692 Vladimir Davydov <vdavydov.dev@gmail.com> <vdavydov@virtuozzo.com> 693 + Weiwen Hu <huweiwen@linux.alibaba.com> <sehuww@mail.scut.edu.cn> 693 694 WeiXiong Liao <gmpy.liaowx@gmail.com> <liaoweixiong@allwinnertech.com> 694 695 Wen Gong <quic_wgong@quicinc.com> <wgong@codeaurora.org> 695 696 Wesley Cheng <quic_wcheng@quicinc.com> <wcheng@codeaurora.org>

+53

Documentation/ABI/stable/sysfs-block

··· 21 21 device is offset from the internal allocation unit's 22 22 natural alignment. 23 23 24 + What: /sys/block/<disk>/atomic_write_max_bytes 25 + Date: February 2024 26 + Contact: Himanshu Madhani <himanshu.madhani@oracle.com> 27 + Description: 28 + [RO] This parameter specifies the maximum atomic write 29 + size reported by the device. This parameter is relevant 30 + for merging of writes, where a merged atomic write 31 + operation must not exceed this number of bytes. 32 + This parameter may be greater than the value in 33 + atomic_write_unit_max_bytes as 34 + atomic_write_unit_max_bytes will be rounded down to a 35 + power-of-two and atomic_write_unit_max_bytes may also be 36 + limited by some other queue limits, such as max_segments. 37 + This parameter - along with atomic_write_unit_min_bytes 38 + and atomic_write_unit_max_bytes - will not be larger than 39 + max_hw_sectors_kb, but may be larger than max_sectors_kb. 40 + 41 + 42 + What: /sys/block/<disk>/atomic_write_unit_min_bytes 43 + Date: February 2024 44 + Contact: Himanshu Madhani <himanshu.madhani@oracle.com> 45 + Description: 46 + [RO] This parameter specifies the smallest block which can 47 + be written atomically with an atomic write operation. All 48 + atomic write operations must begin at a 49 + atomic_write_unit_min boundary and must be multiples of 50 + atomic_write_unit_min. This value must be a power-of-two. 51 + 52 + 53 + What: /sys/block/<disk>/atomic_write_unit_max_bytes 54 + Date: February 2024 55 + Contact: Himanshu Madhani <himanshu.madhani@oracle.com> 56 + Description: 57 + [RO] This parameter defines the largest block which can be 58 + written atomically with an atomic write operation. This 59 + value must be a multiple of atomic_write_unit_min and must 60 + be a power-of-two. This value will not be larger than 61 + atomic_write_max_bytes. 62 + 63 + 64 + What: /sys/block/<disk>/atomic_write_boundary_bytes 65 + Date: February 2024 66 + Contact: Himanshu Madhani <himanshu.madhani@oracle.com> 67 + Description: 68 + [RO] A device may need to internally split an atomic write I/O 69 + which straddles a given logical block address boundary. This 70 + parameter specifies the size in bytes of the atomic boundary if 71 + one is reported by the device. This value must be a 72 + power-of-two and at least the size as in 73 + atomic_write_unit_max_bytes. 74 + Any attempt to merge atomic write I/Os must not result in a 75 + merged I/O which crosses this boundary (if any). 76 + 24 77 25 78 What: /sys/block/<disk>/diskseq 26 79 Date: February 2021

+3 -46

Documentation/block/data-integrity.rst

··· 153 153 4.2 Block Device 154 154 ---------------- 155 155 156 - Because the format of the protection data is tied to the physical 157 - disk, each block device has been extended with a block integrity 158 - profile (struct blk_integrity). This optional profile is registered 159 - with the block layer using blk_integrity_register(). 160 - 161 - The profile contains callback functions for generating and verifying 162 - the protection data, as well as getting and setting application tags. 163 - The profile also contains a few constants to aid in completing, 164 - merging and splitting the integrity metadata. 156 + Block devices can set up the integrity information in the integrity 157 + sub-struture of the queue_limits structure. 165 158 166 159 Layered block devices will need to pick a profile that's appropriate 167 - for all subdevices. blk_integrity_compare() can help with that. DM 160 + for all subdevices. queue_limits_stack_integrity() can help with that. DM 168 161 and MD linear, RAID0 and RAID1 are currently supported. RAID4/5/6 169 162 will require extra work due to the application tag. 170 163 ··· 242 249 It is up to the receiver to process them and verify data 243 250 integrity upon completion. 244 251 245 - 246 - 5.4 Registering A Block Device As Capable Of Exchanging Integrity Metadata 247 - -------------------------------------------------------------------------- 248 - 249 - To enable integrity exchange on a block device the gendisk must be 250 - registered as capable: 251 - 252 - `int blk_integrity_register(gendisk, blk_integrity);` 253 - 254 - The blk_integrity struct is a template and should contain the 255 - following:: 256 - 257 - static struct blk_integrity my_profile = { 258 - .name = "STANDARDSBODY-TYPE-VARIANT-CSUM", 259 - .generate_fn = my_generate_fn, 260 - .verify_fn = my_verify_fn, 261 - .tuple_size = sizeof(struct my_tuple_size), 262 - .tag_size = <tag bytes per hw sector>, 263 - }; 264 - 265 - 'name' is a text string which will be visible in sysfs. This is 266 - part of the userland API so chose it carefully and never change 267 - it. The format is standards body-type-variant. 268 - E.g. T10-DIF-TYPE1-IP or T13-EPP-0-CRC. 269 - 270 - 'generate_fn' generates appropriate integrity metadata (for WRITE). 271 - 272 - 'verify_fn' verifies that the data buffer matches the integrity 273 - metadata. 274 - 275 - 'tuple_size' must be set to match the size of the integrity 276 - metadata per sector. I.e. 8 for DIF and EPP. 277 - 278 - 'tag_size' must be set to identify how many bytes of tag space 279 - are available per hardware sector. For DIF this is either 2 or 280 - 0 depending on the value of the Control Mode Page ATO bit. 281 252 282 253 ---------------------------------------------------------------------- 283 254

+39 -30

Documentation/block/writeback_cache_control.rst

··· 46 46 the Forced Unit Access is implemented. The REQ_PREFLUSH and REQ_FUA flags 47 47 may both be set on a single bio. 48 48 49 - 50 - Implementation details for bio based block drivers 51 - -------------------------------------------------------------- 52 - 53 - These drivers will always see the REQ_PREFLUSH and REQ_FUA bits as they sit 54 - directly below the submit_bio interface. For remapping drivers the REQ_FUA 55 - bits need to be propagated to underlying devices, and a global flush needs 56 - to be implemented for bios with the REQ_PREFLUSH bit set. For real device 57 - drivers that do not have a volatile cache the REQ_PREFLUSH and REQ_FUA bits 58 - on non-empty bios can simply be ignored, and REQ_PREFLUSH requests without 59 - data can be completed successfully without doing any work. Drivers for 60 - devices with volatile caches need to implement the support for these 61 - flags themselves without any help from the block layer. 62 - 63 - 64 - Implementation details for request_fn based block drivers 65 - --------------------------------------------------------- 49 + Feature settings for block drivers 50 + ---------------------------------- 66 51 67 52 For devices that do not support volatile write caches there is no driver 68 53 support required, the block layer completes empty REQ_PREFLUSH requests before 69 54 entering the driver and strips off the REQ_PREFLUSH and REQ_FUA bits from 70 - requests that have a payload. For devices with volatile write caches the 71 - driver needs to tell the block layer that it supports flushing caches by 72 - doing:: 55 + requests that have a payload. 73 56 74 - blk_queue_write_cache(sdkp->disk->queue, true, false); 57 + For devices with volatile write caches the driver needs to tell the block layer 58 + that it supports flushing caches by setting the 75 59 76 - and handle empty REQ_OP_FLUSH requests in its prep_fn/request_fn. Note that 77 - REQ_PREFLUSH requests with a payload are automatically turned into a sequence 78 - of an empty REQ_OP_FLUSH request followed by the actual write by the block 79 - layer. For devices that also support the FUA bit the block layer needs 80 - to be told to pass through the REQ_FUA bit using:: 60 + BLK_FEAT_WRITE_CACHE 81 61 82 - blk_queue_write_cache(sdkp->disk->queue, true, true); 62 + flag in the queue_limits feature field. For devices that also support the FUA 63 + bit the block layer needs to be told to pass on the REQ_FUA bit by also setting 64 + the 83 65 84 - and the driver must handle write requests that have the REQ_FUA bit set 85 - in prep_fn/request_fn. If the FUA bit is not natively supported the block 86 - layer turns it into an empty REQ_OP_FLUSH request after the actual write. 66 + BLK_FEAT_FUA 67 + 68 + flag in the features field of the queue_limits structure. 69 + 70 + Implementation details for bio based block drivers 71 + -------------------------------------------------- 72 + 73 + For bio based drivers the REQ_PREFLUSH and REQ_FUA bit are simply passed on to 74 + the driver if the driver sets the BLK_FEAT_WRITE_CACHE flag and the driver 75 + needs to handle them. 76 + 77 + *NOTE*: The REQ_FUA bit also gets passed on when the BLK_FEAT_FUA flags is 78 + _not_ set. Any bio based driver that sets BLK_FEAT_WRITE_CACHE also needs to 79 + handle REQ_FUA. 80 + 81 + For remapping drivers the REQ_FUA bits need to be propagated to underlying 82 + devices, and a global flush needs to be implemented for bios with the 83 + REQ_PREFLUSH bit set. 84 + 85 + Implementation details for blk-mq drivers 86 + ----------------------------------------- 87 + 88 + When the BLK_FEAT_WRITE_CACHE flag is set, REQ_OP_WRITE | REQ_PREFLUSH requests 89 + with a payload are automatically turned into a sequence of a REQ_OP_FLUSH 90 + request followed by the actual write by the block layer. 91 + 92 + When the BLK_FEAT_FUA flags is set, the REQ_FUA bit is simply passed on for the 93 + REQ_OP_WRITE request, else a REQ_OP_FLUSH request is sent by the block layer 94 + after the completion of the write request for bio submissions with the REQ_FUA 95 + bit set.

+14

MAINTAINERS

··· 3759 3759 F: kernel/trace/blktrace.c 3760 3760 F: lib/sbitmap.c 3761 3761 3762 + BLOCK LAYER DEVICE DRIVER API [RUST] 3763 + M: Andreas Hindborg <a.hindborg@samsung.com> 3764 + R: Boqun Feng <boqun.feng@gmail.com> 3765 + L: linux-block@vger.kernel.org 3766 + L: rust-for-linux@vger.kernel.org 3767 + S: Supported 3768 + W: https://rust-for-linux.com 3769 + B: https://github.com/Rust-for-Linux/linux/issues 3770 + C: https://rust-for-linux.zulipchat.com/#narrow/stream/Block 3771 + T: git https://github.com/Rust-for-Linux/linux.git rust-block-next 3772 + F: drivers/block/rnull.rs 3773 + F: rust/kernel/block.rs 3774 + F: rust/kernel/block/ 3775 + 3762 3776 BLOCK2MTD DRIVER 3763 3777 M: Joern Engel <joern@lazybastard.org> 3764 3778 L: linux-mtd@lists.infradead.org

+2 -1

arch/m68k/emu/nfblock.c

··· 71 71 len = bvec.bv_len; 72 72 len >>= 9; 73 73 nfhd_read_write(dev->id, 0, dir, sec >> shift, len >> shift, 74 - page_to_phys(bvec.bv_page) + bvec.bv_offset); 74 + bvec_phys(&bvec)); 75 75 sec += len; 76 76 } 77 77 bio_endio(bio); ··· 98 98 { 99 99 struct queue_limits lim = { 100 100 .logical_block_size = bsize, 101 + .features = BLK_FEAT_ROTATIONAL, 101 102 }; 102 103 struct nfhd_device *dev; 103 104 int dev_id = id - NFHD_DEV_OFFSET;

+20 -33

arch/um/drivers/ubd_kern.c

··· 447 447 return n; 448 448 } 449 449 450 - /* Called without dev->lock held, and only in interrupt context. */ 451 - static void ubd_handler(void) 450 + static void ubd_end_request(struct io_thread_req *io_req) 452 451 { 453 - int n; 454 - int count; 455 - 456 - while(1){ 457 - n = bulk_req_safe_read( 458 - thread_fd, 459 - irq_req_buffer, 460 - &irq_remainder, 461 - &irq_remainder_size, 462 - UBD_REQ_BUFFER_SIZE 463 - ); 464 - if (n < 0) { 465 - if(n == -EAGAIN) 466 - break; 467 - printk(KERN_ERR "spurious interrupt in ubd_handler, " 468 - "err = %d\n", -n); 469 - return; 470 - } 471 - for (count = 0; count < n/sizeof(struct io_thread_req *); count++) { 472 - struct io_thread_req *io_req = (*irq_req_buffer)[count]; 473 - 474 - if ((io_req->error == BLK_STS_NOTSUPP) && (req_op(io_req->req) == REQ_OP_DISCARD)) { 475 - blk_queue_max_discard_sectors(io_req->req->q, 0); 476 - blk_queue_max_write_zeroes_sectors(io_req->req->q, 0); 477 - } 478 - blk_mq_end_request(io_req->req, io_req->error); 479 - kfree(io_req); 480 - } 452 + if (io_req->error == BLK_STS_NOTSUPP) { 453 + if (req_op(io_req->req) == REQ_OP_DISCARD) 454 + blk_queue_disable_discard(io_req->req->q); 455 + else if (req_op(io_req->req) == REQ_OP_WRITE_ZEROES) 456 + blk_queue_disable_write_zeroes(io_req->req->q); 481 457 } 458 + blk_mq_end_request(io_req->req, io_req->error); 459 + kfree(io_req); 482 460 } 483 461 484 462 static irqreturn_t ubd_intr(int irq, void *dev) 485 463 { 486 - ubd_handler(); 464 + int len, i; 465 + 466 + while ((len = bulk_req_safe_read(thread_fd, irq_req_buffer, 467 + &irq_remainder, &irq_remainder_size, 468 + UBD_REQ_BUFFER_SIZE)) >= 0) { 469 + for (i = 0; i < len / sizeof(struct io_thread_req *); i++) 470 + ubd_end_request((*irq_req_buffer)[i]); 471 + } 472 + 473 + if (len < 0 && len != -EAGAIN) 474 + pr_err("spurious interrupt in %s, err = %d\n", __func__, len); 487 475 return IRQ_HANDLED; 488 476 } 489 477 ··· 835 847 struct queue_limits lim = { 836 848 .max_segments = MAX_SG, 837 849 .seg_boundary_mask = PAGE_SIZE - 1, 850 + .features = BLK_FEAT_WRITE_CACHE, 838 851 }; 839 852 struct gendisk *disk; 840 853 int err = 0; ··· 882 893 goto out_cleanup_tags; 883 894 } 884 895 885 - blk_queue_flag_set(QUEUE_FLAG_NONROT, disk->queue); 886 - blk_queue_write_cache(disk->queue, true, false); 887 896 disk->major = UBD_MAJOR; 888 897 disk->first_minor = n << UBD_SHIFT; 889 898 disk->minors = 1 << UBD_SHIFT;

+4 -1

arch/xtensa/platforms/iss/simdisk.c

··· 263 263 static int __init simdisk_setup(struct simdisk *dev, int which, 264 264 struct proc_dir_entry *procdir) 265 265 { 266 + struct queue_limits lim = { 267 + .features = BLK_FEAT_ROTATIONAL, 268 + }; 266 269 char tmp[2] = { '0' + which, 0 }; 267 270 int err; 268 271 ··· 274 271 spin_lock_init(&dev->lock); 275 272 dev->users = 0; 276 273 277 - dev->gd = blk_alloc_disk(NULL, NUMA_NO_NODE); 274 + dev->gd = blk_alloc_disk(&lim, NUMA_NO_NODE); 278 275 if (IS_ERR(dev->gd)) { 279 276 err = PTR_ERR(dev->gd); 280 277 goto out;

+2 -6

block/Kconfig

··· 62 62 63 63 config BLK_DEV_INTEGRITY 64 64 bool "Block layer data integrity support" 65 + select CRC_T10DIF 66 + select CRC64_ROCKSOFT 65 67 help 66 68 Some storage devices allow extra information to be 67 69 stored/retrieved to help protect the data. The block layer ··· 73 71 Say yes here if you have a storage device that provides the 74 72 T10/SCSI Data Integrity Field or the T13/ATA External Path 75 73 Protection. If in doubt, say N. 76 - 77 - config BLK_DEV_INTEGRITY_T10 78 - tristate 79 - depends on BLK_DEV_INTEGRITY 80 - select CRC_T10DIF 81 - select CRC64_ROCKSOFT 82 74 83 75 config BLK_DEV_WRITE_MOUNTED 84 76 bool "Allow writing to mounted block devices"

+1 -2

block/Makefile

··· 26 26 bfq-y := bfq-iosched.o bfq-wf2q.o bfq-cgroup.o 27 27 obj-$(CONFIG_IOSCHED_BFQ) += bfq.o 28 28 29 - obj-$(CONFIG_BLK_DEV_INTEGRITY) += bio-integrity.o blk-integrity.o 30 - obj-$(CONFIG_BLK_DEV_INTEGRITY_T10) += t10-pi.o 29 + obj-$(CONFIG_BLK_DEV_INTEGRITY) += bio-integrity.o blk-integrity.o t10-pi.o 31 30 obj-$(CONFIG_BLK_MQ_PCI) += blk-mq-pci.o 32 31 obj-$(CONFIG_BLK_MQ_VIRTIO) += blk-mq-virtio.o 33 32 obj-$(CONFIG_BLK_DEV_ZONED) += blk-zoned.o

+30 -11

block/bdev.c

··· 385 385 }; 386 386 387 387 struct super_block *blockdev_superblock __ro_after_init; 388 - struct vfsmount *blockdev_mnt __ro_after_init; 388 + static struct vfsmount *blockdev_mnt __ro_after_init; 389 389 EXPORT_SYMBOL_GPL(blockdev_superblock); 390 390 391 391 void __init bdev_cache_init(void) ··· 1260 1260 } 1261 1261 1262 1262 /* 1263 - * Handle STATX_DIOALIGN for block devices. 1264 - * 1265 - * Note that the inode passed to this is the inode of a block device node file, 1266 - * not the block device's internal inode. Therefore it is *not* valid to use 1267 - * I_BDEV() here; the block device has to be looked up by i_rdev instead. 1263 + * Handle STATX_{DIOALIGN, WRITE_ATOMIC} for block devices. 1268 1264 */ 1269 - void bdev_statx_dioalign(struct inode *inode, struct kstat *stat) 1265 + void bdev_statx(struct path *path, struct kstat *stat, 1266 + u32 request_mask) 1270 1267 { 1268 + struct inode *backing_inode; 1271 1269 struct block_device *bdev; 1272 1270 1273 - bdev = blkdev_get_no_open(inode->i_rdev); 1271 + if (!(request_mask & (STATX_DIOALIGN | STATX_WRITE_ATOMIC))) 1272 + return; 1273 + 1274 + backing_inode = d_backing_inode(path->dentry); 1275 + 1276 + /* 1277 + * Note that backing_inode is the inode of a block device node file, 1278 + * not the block device's internal inode. Therefore it is *not* valid 1279 + * to use I_BDEV() here; the block device has to be looked up by i_rdev 1280 + * instead. 1281 + */ 1282 + bdev = blkdev_get_no_open(backing_inode->i_rdev); 1274 1283 if (!bdev) 1275 1284 return; 1276 1285 1277 - stat->dio_mem_align = bdev_dma_alignment(bdev) + 1; 1278 - stat->dio_offset_align = bdev_logical_block_size(bdev); 1279 - stat->result_mask |= STATX_DIOALIGN; 1286 + if (request_mask & STATX_DIOALIGN) { 1287 + stat->dio_mem_align = bdev_dma_alignment(bdev) + 1; 1288 + stat->dio_offset_align = bdev_logical_block_size(bdev); 1289 + stat->result_mask |= STATX_DIOALIGN; 1290 + } 1291 + 1292 + if (request_mask & STATX_WRITE_ATOMIC && bdev_can_atomic_write(bdev)) { 1293 + struct request_queue *bd_queue = bdev->bd_queue; 1294 + 1295 + generic_fill_statx_atomic_writes(stat, 1296 + queue_atomic_write_unit_min_bytes(bd_queue), 1297 + queue_atomic_write_unit_max_bytes(bd_queue)); 1298 + } 1280 1299 1281 1300 blkdev_put_no_open(bdev); 1282 1301 }

-51

block/bfq-cgroup.c

··· 797 797 */ 798 798 bfq_link_bfqg(bfqd, bfqg); 799 799 __bfq_bic_change_cgroup(bfqd, bic, bfqg); 800 - /* 801 - * Update blkg_path for bfq_log_* functions. We cache this 802 - * path, and update it here, for the following 803 - * reasons. Operations on blkg objects in blk-cgroup are 804 - * protected with the request_queue lock, and not with the 805 - * lock that protects the instances of this scheduler 806 - * (bfqd->lock). This exposes BFQ to the following sort of 807 - * race. 808 - * 809 - * The blkg_lookup performed in bfq_get_queue, protected 810 - * through rcu, may happen to return the address of a copy of 811 - * the original blkg. If this is the case, then the 812 - * bfqg_and_blkg_get performed in bfq_get_queue, to pin down 813 - * the blkg, is useless: it does not prevent blk-cgroup code 814 - * from destroying both the original blkg and all objects 815 - * directly or indirectly referred by the copy of the 816 - * blkg. 817 - * 818 - * On the bright side, destroy operations on a blkg invoke, as 819 - * a first step, hooks of the scheduler associated with the 820 - * blkg. And these hooks are executed with bfqd->lock held for 821 - * BFQ. As a consequence, for any blkg associated with the 822 - * request queue this instance of the scheduler is attached 823 - * to, we are guaranteed that such a blkg is not destroyed, and 824 - * that all the pointers it contains are consistent, while we 825 - * are holding bfqd->lock. A blkg_lookup performed with 826 - * bfqd->lock held then returns a fully consistent blkg, which 827 - * remains consistent until this lock is held. 828 - * 829 - * Thanks to the last fact, and to the fact that: (1) bfqg has 830 - * been obtained through a blkg_lookup in the above 831 - * assignment, and (2) bfqd->lock is being held, here we can 832 - * safely use the policy data for the involved blkg (i.e., the 833 - * field bfqg->pd) to get to the blkg associated with bfqg, 834 - * and then we can safely use any field of blkg. After we 835 - * release bfqd->lock, even just getting blkg through this 836 - * bfqg may cause dangling references to be traversed, as 837 - * bfqg->pd may not exist any more. 838 - * 839 - * In view of the above facts, here we cache, in the bfqg, any 840 - * blkg data we may need for this bic, and for its associated 841 - * bfq_queue. As of now, we need to cache only the path of the 842 - * blkg, which is used in the bfq_log_* functions. 843 - * 844 - * Finally, note that bfqg itself needs to be protected from 845 - * destruction on the blkg_free of the original blkg (which 846 - * invokes bfq_pd_free). We use an additional private 847 - * refcounter for bfqg, to let it disappear only after no 848 - * bfq_queue refers to it any longer. 849 - */ 850 - blkg_path(bfqg_to_blkg(bfqg), bfqg->blkg_path, sizeof(bfqg->blkg_path)); 851 800 bic->blkcg_serial_nr = serial_nr; 852 801 } 853 802

+24 -22

block/bfq-iosched.c

··· 5463 5463 } 5464 5464 } 5465 5465 5466 - static void bfq_exit_icq(struct io_cq *icq) 5466 + static void _bfq_exit_icq(struct bfq_io_cq *bic, unsigned int num_actuators) 5467 5467 { 5468 - struct bfq_io_cq *bic = icq_to_bic(icq); 5469 - struct bfq_data *bfqd = bic_to_bfqd(bic); 5470 - unsigned long flags; 5471 - unsigned int act_idx; 5472 - /* 5473 - * If bfqd and thus bfqd->num_actuators is not available any 5474 - * longer, then cycle over all possible per-actuator bfqqs in 5475 - * next loop. We rely on bic being zeroed on creation, and 5476 - * therefore on its unused per-actuator fields being NULL. 5477 - */ 5478 - unsigned int num_actuators = BFQ_MAX_ACTUATORS; 5479 5468 struct bfq_iocq_bfqq_data *bfqq_data = bic->bfqq_data; 5480 - 5481 - /* 5482 - * bfqd is NULL if scheduler already exited, and in that case 5483 - * this is the last time these queues are accessed. 5484 - */ 5485 - if (bfqd) { 5486 - spin_lock_irqsave(&bfqd->lock, flags); 5487 - num_actuators = bfqd->num_actuators; 5488 - } 5469 + unsigned int act_idx; 5489 5470 5490 5471 for (act_idx = 0; act_idx < num_actuators; act_idx++) { 5491 5472 if (bfqq_data[act_idx].stable_merge_bfqq) ··· 5475 5494 bfq_exit_icq_bfqq(bic, true, act_idx); 5476 5495 bfq_exit_icq_bfqq(bic, false, act_idx); 5477 5496 } 5497 + } 5478 5498 5479 - if (bfqd) 5499 + static void bfq_exit_icq(struct io_cq *icq) 5500 + { 5501 + struct bfq_io_cq *bic = icq_to_bic(icq); 5502 + struct bfq_data *bfqd = bic_to_bfqd(bic); 5503 + unsigned long flags; 5504 + 5505 + /* 5506 + * If bfqd and thus bfqd->num_actuators is not available any 5507 + * longer, then cycle over all possible per-actuator bfqqs in 5508 + * next loop. We rely on bic being zeroed on creation, and 5509 + * therefore on its unused per-actuator fields being NULL. 5510 + * 5511 + * bfqd is NULL if scheduler already exited, and in that case 5512 + * this is the last time these queues are accessed. 5513 + */ 5514 + if (bfqd) { 5515 + spin_lock_irqsave(&bfqd->lock, flags); 5516 + _bfq_exit_icq(bic, bfqd->num_actuators); 5480 5517 spin_unlock_irqrestore(&bfqd->lock, flags); 5518 + } else { 5519 + _bfq_exit_icq(bic, BFQ_MAX_ACTUATORS); 5520 + } 5481 5521 } 5482 5522 5483 5523 /*

-3

block/bfq-iosched.h

··· 1003 1003 /* must be the first member */ 1004 1004 struct blkg_policy_data pd; 1005 1005 1006 - /* cached path for this blkg (see comments in bfq_bic_update_cgroup) */ 1007 - char blkg_path[128]; 1008 - 1009 1006 /* reference counter (see comments in bfq_bic_update_cgroup) */ 1010 1007 refcount_t ref; 1011 1008

+37 -98

block/bio-integrity.c

··· 76 76 &bip->bip_max_vcnt, gfp_mask); 77 77 if (!bip->bip_vec) 78 78 goto err; 79 - } else { 79 + } else if (nr_vecs) { 80 80 bip->bip_vec = bip->bip_inline_vecs; 81 81 } 82 82 ··· 276 276 277 277 bip->bip_flags |= BIP_INTEGRITY_USER | BIP_COPY_USER; 278 278 bip->bip_iter.bi_sector = seed; 279 + bip->bip_vcnt = nr_vecs; 279 280 return 0; 280 281 free_bip: 281 282 bio_integrity_free(bio); ··· 298 297 bip->bip_flags |= BIP_INTEGRITY_USER; 299 298 bip->bip_iter.bi_sector = seed; 300 299 bip->bip_iter.bi_size = len; 300 + bip->bip_vcnt = nr_vecs; 301 301 return 0; 302 302 } 303 303 ··· 336 334 u32 seed) 337 335 { 338 336 struct request_queue *q = bdev_get_queue(bio->bi_bdev); 339 - unsigned int align = q->dma_pad_mask | queue_dma_alignment(q); 337 + unsigned int align = blk_lim_dma_alignment_and_pad(&q->limits); 340 338 struct page *stack_pages[UIO_FASTIOV], **pages = stack_pages; 341 339 struct bio_vec stack_vec[UIO_FASTIOV], *bvec = stack_vec; 342 340 unsigned int direction, nr_bvecs; ··· 399 397 EXPORT_SYMBOL_GPL(bio_integrity_map_user); 400 398 401 399 /** 402 - * bio_integrity_process - Process integrity metadata for a bio 403 - * @bio: bio to generate/verify integrity metadata for 404 - * @proc_iter: iterator to process 405 - * @proc_fn: Pointer to the relevant processing function 406 - */ 407 - static blk_status_t bio_integrity_process(struct bio *bio, 408 - struct bvec_iter *proc_iter, integrity_processing_fn *proc_fn) 409 - { 410 - struct blk_integrity *bi = blk_get_integrity(bio->bi_bdev->bd_disk); 411 - struct blk_integrity_iter iter; 412 - struct bvec_iter bviter; 413 - struct bio_vec bv; 414 - struct bio_integrity_payload *bip = bio_integrity(bio); 415 - blk_status_t ret = BLK_STS_OK; 416 - 417 - iter.disk_name = bio->bi_bdev->bd_disk->disk_name; 418 - iter.interval = 1 << bi->interval_exp; 419 - iter.tuple_size = bi->tuple_size; 420 - iter.seed = proc_iter->bi_sector; 421 - iter.prot_buf = bvec_virt(bip->bip_vec); 422 - iter.pi_offset = bi->pi_offset; 423 - 424 - __bio_for_each_segment(bv, bio, bviter, *proc_iter) { 425 - void *kaddr = bvec_kmap_local(&bv); 426 - 427 - iter.data_buf = kaddr; 428 - iter.data_size = bv.bv_len; 429 - ret = proc_fn(&iter); 430 - kunmap_local(kaddr); 431 - 432 - if (ret) 433 - break; 434 - 435 - } 436 - return ret; 437 - } 438 - 439 - /** 440 400 * bio_integrity_prep - Prepare bio for integrity I/O 441 401 * @bio: bio to prepare 442 402 * ··· 414 450 { 415 451 struct bio_integrity_payload *bip; 416 452 struct blk_integrity *bi = blk_get_integrity(bio->bi_bdev->bd_disk); 453 + unsigned int len; 417 454 void *buf; 418 - unsigned long start, end; 419 - unsigned int len, nr_pages; 420 - unsigned int bytes, offset, i; 455 + gfp_t gfp = GFP_NOIO; 421 456 422 457 if (!bi) 423 - return true; 424 - 425 - if (bio_op(bio) != REQ_OP_READ && bio_op(bio) != REQ_OP_WRITE) 426 458 return true; 427 459 428 460 if (!bio_sectors(bio)) ··· 428 468 if (bio_integrity(bio)) 429 469 return true; 430 470 431 - if (bio_data_dir(bio) == READ) { 432 - if (!bi->profile->verify_fn || 433 - !(bi->flags & BLK_INTEGRITY_VERIFY)) 471 + switch (bio_op(bio)) { 472 + case REQ_OP_READ: 473 + if (bi->flags & BLK_INTEGRITY_NOVERIFY) 434 474 return true; 435 - } else { 436 - if (!bi->profile->generate_fn || 437 - !(bi->flags & BLK_INTEGRITY_GENERATE)) 475 + break; 476 + case REQ_OP_WRITE: 477 + if (bi->flags & BLK_INTEGRITY_NOGENERATE) 438 478 return true; 479 + 480 + /* 481 + * Zero the memory allocated to not leak uninitialized kernel 482 + * memory to disk for non-integrity metadata where nothing else 483 + * initializes the memory. 484 + */ 485 + if (bi->csum_type == BLK_INTEGRITY_CSUM_NONE) 486 + gfp |= __GFP_ZERO; 487 + break; 488 + default: 489 + return true; 439 490 } 440 491 441 492 /* Allocate kernel buffer for protection data */ 442 493 len = bio_integrity_bytes(bi, bio_sectors(bio)); 443 - buf = kmalloc(len, GFP_NOIO); 494 + buf = kmalloc(len, gfp); 444 495 if (unlikely(buf == NULL)) { 445 - printk(KERN_ERR "could not allocate integrity buffer\n"); 446 496 goto err_end_io; 447 497 } 448 498 449 - end = (((unsigned long) buf) + len + PAGE_SIZE - 1) >> PAGE_SHIFT; 450 - start = ((unsigned long) buf) >> PAGE_SHIFT; 451 - nr_pages = end - start; 452 - 453 - /* Allocate bio integrity payload and integrity vectors */ 454 - bip = bio_integrity_alloc(bio, GFP_NOIO, nr_pages); 499 + bip = bio_integrity_alloc(bio, GFP_NOIO, 1); 455 500 if (IS_ERR(bip)) { 456 - printk(KERN_ERR "could not allocate data integrity bioset\n"); 457 501 kfree(buf); 458 502 goto err_end_io; 459 503 } ··· 465 501 bip->bip_flags |= BIP_BLOCK_INTEGRITY; 466 502 bip_set_seed(bip, bio->bi_iter.bi_sector); 467 503 468 - if (bi->flags & BLK_INTEGRITY_IP_CHECKSUM) 504 + if (bi->csum_type == BLK_INTEGRITY_CSUM_IP) 469 505 bip->bip_flags |= BIP_IP_CHECKSUM; 470 506 471 - /* Map it */ 472 - offset = offset_in_page(buf); 473 - for (i = 0; i < nr_pages && len > 0; i++) { 474 - bytes = PAGE_SIZE - offset; 475 - 476 - if (bytes > len) 477 - bytes = len; 478 - 479 - if (bio_integrity_add_page(bio, virt_to_page(buf), 480 - bytes, offset) < bytes) { 481 - printk(KERN_ERR "could not attach integrity payload\n"); 482 - goto err_end_io; 483 - } 484 - 485 - buf += bytes; 486 - len -= bytes; 487 - offset = 0; 507 + if (bio_integrity_add_page(bio, virt_to_page(buf), len, 508 + offset_in_page(buf)) < len) { 509 + printk(KERN_ERR "could not attach integrity payload\n"); 510 + goto err_end_io; 488 511 } 489 512 490 513 /* Auto-generate integrity metadata if this is a write */ 491 - if (bio_data_dir(bio) == WRITE) { 492 - bio_integrity_process(bio, &bio->bi_iter, 493 - bi->profile->generate_fn); 494 - } else { 514 + if (bio_data_dir(bio) == WRITE) 515 + blk_integrity_generate(bio); 516 + else 495 517 bip->bio_iter = bio->bi_iter; 496 - } 497 518 return true; 498 519 499 520 err_end_io: ··· 501 552 struct bio_integrity_payload *bip = 502 553 container_of(work, struct bio_integrity_payload, bip_work); 503 554 struct bio *bio = bip->bip_bio; 504 - struct blk_integrity *bi = blk_get_integrity(bio->bi_bdev->bd_disk); 505 555 506 - /* 507 - * At the moment verify is called bio's iterator was advanced 508 - * during split and completion, we need to rewind iterator to 509 - * it's original position. 510 - */ 511 - bio->bi_status = bio_integrity_process(bio, &bip->bio_iter, 512 - bi->profile->verify_fn); 556 + blk_integrity_verify(bio); 513 557 bio_integrity_free(bio); 514 558 bio_endio(bio); 515 559 } ··· 524 582 struct bio_integrity_payload *bip = bio_integrity(bio); 525 583 526 584 if (bio_op(bio) == REQ_OP_READ && !bio->bi_status && 527 - (bip->bip_flags & BIP_BLOCK_INTEGRITY) && bi->profile->verify_fn) { 585 + (bip->bip_flags & BIP_BLOCK_INTEGRITY) && bi->csum_type) { 528 586 INIT_WORK(&bip->bip_work, bio_integrity_verify_fn); 529 587 queue_work(kintegrityd_wq, &bip->bip_work); 530 588 return false; ··· 584 642 585 643 BUG_ON(bip_src == NULL); 586 644 587 - bip = bio_integrity_alloc(bio, gfp_mask, bip_src->bip_vcnt); 645 + bip = bio_integrity_alloc(bio, gfp_mask, 0); 588 646 if (IS_ERR(bip)) 589 647 return PTR_ERR(bip); 590 648 591 - memcpy(bip->bip_vec, bip_src->bip_vec, 592 - bip_src->bip_vcnt * sizeof(struct bio_vec)); 593 - 594 - bip->bip_vcnt = bip_src->bip_vcnt; 649 + bip->bip_vec = bip_src->bip_vec; 595 650 bip->bip_iter = bip_src->bip_iter; 596 651 bip->bip_flags = bip_src->bip_flags & ~BIP_BLOCK_INTEGRITY; 597 652

+1 -1

block/bio.c

··· 953 953 bool *same_page) 954 954 { 955 955 unsigned long mask = queue_segment_boundary(q); 956 - phys_addr_t addr1 = page_to_phys(bv->bv_page) + bv->bv_offset; 956 + phys_addr_t addr1 = bvec_phys(bv); 957 957 phys_addr_t addr2 = page_to_phys(page) + offset + len - 1; 958 958 959 959 if ((addr1 | mask) != (addr2 | mask))

-13

block/blk-cgroup.h

··· 301 301 } 302 302 303 303 /** 304 - * blkg_path - format cgroup path of blkg 305 - * @blkg: blkg of interest 306 - * @buf: target buffer 307 - * @buflen: target buffer length 308 - * 309 - * Format the path of the cgroup of @blkg into @buf. 310 - */ 311 - static inline int blkg_path(struct blkcg_gq *blkg, char *buf, int buflen) 312 - { 313 - return cgroup_path(blkg->blkcg->css.cgroup, buf, buflen); 314 - } 315 - 316 - /** 317 304 * blkg_get - get a blkg reference 318 305 * @blkg: blkg to get 319 306 *

+23 -22

block/blk-core.c

··· 94 94 } 95 95 EXPORT_SYMBOL(blk_queue_flag_clear); 96 96 97 - /** 98 - * blk_queue_flag_test_and_set - atomically test and set a queue flag 99 - * @flag: flag to be set 100 - * @q: request queue 101 - * 102 - * Returns the previous value of @flag - 0 if the flag was not set and 1 if 103 - * the flag was already set. 104 - */ 105 - bool blk_queue_flag_test_and_set(unsigned int flag, struct request_queue *q) 106 - { 107 - return test_and_set_bit(flag, &q->queue_flags); 108 - } 109 - EXPORT_SYMBOL_GPL(blk_queue_flag_test_and_set); 110 - 111 97 #define REQ_OP_NAME(name) [REQ_OP_##name] = #name 112 98 static const char *const blk_op_name[] = { 113 99 REQ_OP_NAME(READ), ··· 159 173 160 174 /* Command duration limit device-side timeout */ 161 175 [BLK_STS_DURATION_LIMIT] = { -ETIME, "duration limit exceeded" }, 176 + 177 + [BLK_STS_INVAL] = { -EINVAL, "invalid" }, 162 178 163 179 /* everything else not covered above: */ 164 180 [BLK_STS_IOERR] = { -EIO, "I/O" }, ··· 727 739 __submit_bio_noacct(bio); 728 740 } 729 741 742 + static blk_status_t blk_validate_atomic_write_op_size(struct request_queue *q, 743 + struct bio *bio) 744 + { 745 + if (bio->bi_iter.bi_size > queue_atomic_write_unit_max_bytes(q)) 746 + return BLK_STS_INVAL; 747 + 748 + if (bio->bi_iter.bi_size % queue_atomic_write_unit_min_bytes(q)) 749 + return BLK_STS_INVAL; 750 + 751 + return BLK_STS_OK; 752 + } 753 + 730 754 /** 731 755 * submit_bio_noacct - re-submit a bio to the block device layer for I/O 732 756 * @bio: The bio describing the location in memory and on the device. ··· 782 782 if (WARN_ON_ONCE(bio_op(bio) != REQ_OP_WRITE && 783 783 bio_op(bio) != REQ_OP_ZONE_APPEND)) 784 784 goto end_io; 785 - if (!test_bit(QUEUE_FLAG_WC, &q->queue_flags)) { 785 + if (!bdev_write_cache(bdev)) { 786 786 bio->bi_opf &= ~(REQ_PREFLUSH | REQ_FUA); 787 787 if (!bio_sectors(bio)) { 788 788 status = BLK_STS_OK; ··· 791 791 } 792 792 } 793 793 794 - if (!test_bit(QUEUE_FLAG_POLL, &q->queue_flags)) 794 + if (!(q->limits.features & BLK_FEAT_POLL)) 795 795 bio_clear_polled(bio); 796 796 797 797 switch (bio_op(bio)) { 798 798 case REQ_OP_READ: 799 799 case REQ_OP_WRITE: 800 + if (bio->bi_opf & REQ_ATOMIC) { 801 + status = blk_validate_atomic_write_op_size(q, bio); 802 + if (status != BLK_STS_OK) 803 + goto end_io; 804 + } 800 805 break; 801 806 case REQ_OP_FLUSH: 802 807 /* ··· 830 825 case REQ_OP_ZONE_OPEN: 831 826 case REQ_OP_ZONE_CLOSE: 832 827 case REQ_OP_ZONE_FINISH: 833 - if (!bdev_is_zoned(bio->bi_bdev)) 834 - goto not_supported; 835 - break; 836 828 case REQ_OP_ZONE_RESET_ALL: 837 - if (!bdev_is_zoned(bio->bi_bdev) || !blk_queue_zone_resetall(q)) 829 + if (!bdev_is_zoned(bio->bi_bdev)) 838 830 goto not_supported; 839 831 break; 840 832 case REQ_OP_DRV_IN: ··· 917 915 return 0; 918 916 919 917 q = bdev_get_queue(bdev); 920 - if (cookie == BLK_QC_T_NONE || 921 - !test_bit(QUEUE_FLAG_POLL, &q->queue_flags)) 918 + if (cookie == BLK_QC_T_NONE || !(q->limits.features & BLK_FEAT_POLL)) 922 919 return 0; 923 920 924 921 blk_flush_plug(current->plug, false);

+16 -20

block/blk-flush.c

··· 100 100 return blk_mq_map_queue(q, REQ_OP_FLUSH, ctx)->fq; 101 101 } 102 102 103 - static unsigned int blk_flush_policy(unsigned long fflags, struct request *rq) 104 - { 105 - unsigned int policy = 0; 106 - 107 - if (blk_rq_sectors(rq)) 108 - policy |= REQ_FSEQ_DATA; 109 - 110 - if (fflags & (1UL << QUEUE_FLAG_WC)) { 111 - if (rq->cmd_flags & REQ_PREFLUSH) 112 - policy |= REQ_FSEQ_PREFLUSH; 113 - if (!(fflags & (1UL << QUEUE_FLAG_FUA)) && 114 - (rq->cmd_flags & REQ_FUA)) 115 - policy |= REQ_FSEQ_POSTFLUSH; 116 - } 117 - return policy; 118 - } 119 - 120 103 static unsigned int blk_flush_cur_seq(struct request *rq) 121 104 { 122 105 return 1 << ffz(rq->flush.seq); ··· 382 399 bool blk_insert_flush(struct request *rq) 383 400 { 384 401 struct request_queue *q = rq->q; 385 - unsigned long fflags = q->queue_flags; /* may change, cache */ 386 - unsigned int policy = blk_flush_policy(fflags, rq); 387 402 struct blk_flush_queue *fq = blk_get_flush_queue(q, rq->mq_ctx); 403 + bool supports_fua = q->limits.features & BLK_FEAT_FUA; 404 + unsigned int policy = 0; 388 405 389 406 /* FLUSH/FUA request must never be merged */ 390 407 WARN_ON_ONCE(rq->bio != rq->biotail); 408 + 409 + if (blk_rq_sectors(rq)) 410 + policy |= REQ_FSEQ_DATA; 411 + 412 + /* 413 + * Check which flushes we need to sequence for this operation. 414 + */ 415 + if (blk_queue_write_cache(q)) { 416 + if (rq->cmd_flags & REQ_PREFLUSH) 417 + policy |= REQ_FSEQ_PREFLUSH; 418 + if ((rq->cmd_flags & REQ_FUA) && !supports_fua) 419 + policy |= REQ_FSEQ_POSTFLUSH; 420 + } 391 421 392 422 /* 393 423 * @policy now records what operations need to be done. Adjust 394 424 * REQ_PREFLUSH and FUA for the driver. 395 425 */ 396 426 rq->cmd_flags &= ~REQ_PREFLUSH; 397 - if (!(fflags & (1UL << QUEUE_FLAG_FUA))) 427 + if (!supports_fua) 398 428 rq->cmd_flags &= ~REQ_FUA; 399 429 400 430 /*

+65 -163

block/blk-integrity.c

··· 107 107 } 108 108 EXPORT_SYMBOL(blk_rq_map_integrity_sg); 109 109 110 - /** 111 - * blk_integrity_compare - Compare integrity profile of two disks 112 - * @gd1: Disk to compare 113 - * @gd2: Disk to compare 114 - * 115 - * Description: Meta-devices like DM and MD need to verify that all 116 - * sub-devices use the same integrity format before advertising to 117 - * upper layers that they can send/receive integrity metadata. This 118 - * function can be used to check whether two gendisk devices have 119 - * compatible integrity formats. 120 - */ 121 - int blk_integrity_compare(struct gendisk *gd1, struct gendisk *gd2) 122 - { 123 - struct blk_integrity *b1 = &gd1->queue->integrity; 124 - struct blk_integrity *b2 = &gd2->queue->integrity; 125 - 126 - if (!b1->profile && !b2->profile) 127 - return 0; 128 - 129 - if (!b1->profile || !b2->profile) 130 - return -1; 131 - 132 - if (b1->interval_exp != b2->interval_exp) { 133 - pr_err("%s: %s/%s protection interval %u != %u\n", 134 - __func__, gd1->disk_name, gd2->disk_name, 135 - 1 << b1->interval_exp, 1 << b2->interval_exp); 136 - return -1; 137 - } 138 - 139 - if (b1->tuple_size != b2->tuple_size) { 140 - pr_err("%s: %s/%s tuple sz %u != %u\n", __func__, 141 - gd1->disk_name, gd2->disk_name, 142 - b1->tuple_size, b2->tuple_size); 143 - return -1; 144 - } 145 - 146 - if (b1->tag_size && b2->tag_size && (b1->tag_size != b2->tag_size)) { 147 - pr_err("%s: %s/%s tag sz %u != %u\n", __func__, 148 - gd1->disk_name, gd2->disk_name, 149 - b1->tag_size, b2->tag_size); 150 - return -1; 151 - } 152 - 153 - if (b1->profile != b2->profile) { 154 - pr_err("%s: %s/%s type %s != %s\n", __func__, 155 - gd1->disk_name, gd2->disk_name, 156 - b1->profile->name, b2->profile->name); 157 - return -1; 158 - } 159 - 160 - return 0; 161 - } 162 - EXPORT_SYMBOL(blk_integrity_compare); 163 - 164 110 bool blk_integrity_merge_rq(struct request_queue *q, struct request *req, 165 111 struct request *next) 166 112 { ··· 160 214 161 215 static inline struct blk_integrity *dev_to_bi(struct device *dev) 162 216 { 163 - return &dev_to_disk(dev)->queue->integrity; 217 + return &dev_to_disk(dev)->queue->limits.integrity; 218 + } 219 + 220 + const char *blk_integrity_profile_name(struct blk_integrity *bi) 221 + { 222 + switch (bi->csum_type) { 223 + case BLK_INTEGRITY_CSUM_IP: 224 + if (bi->flags & BLK_INTEGRITY_REF_TAG) 225 + return "T10-DIF-TYPE1-IP"; 226 + return "T10-DIF-TYPE3-IP"; 227 + case BLK_INTEGRITY_CSUM_CRC: 228 + if (bi->flags & BLK_INTEGRITY_REF_TAG) 229 + return "T10-DIF-TYPE1-CRC"; 230 + return "T10-DIF-TYPE3-CRC"; 231 + case BLK_INTEGRITY_CSUM_CRC64: 232 + if (bi->flags & BLK_INTEGRITY_REF_TAG) 233 + return "EXT-DIF-TYPE1-CRC64"; 234 + return "EXT-DIF-TYPE3-CRC64"; 235 + case BLK_INTEGRITY_CSUM_NONE: 236 + break; 237 + } 238 + 239 + return "nop"; 240 + } 241 + EXPORT_SYMBOL_GPL(blk_integrity_profile_name); 242 + 243 + static ssize_t flag_store(struct device *dev, const char *page, size_t count, 244 + unsigned char flag) 245 + { 246 + struct request_queue *q = dev_to_disk(dev)->queue; 247 + struct queue_limits lim; 248 + unsigned long val; 249 + int err; 250 + 251 + err = kstrtoul(page, 10, &val); 252 + if (err) 253 + return err; 254 + 255 + /* note that the flags are inverted vs the values in the sysfs files */ 256 + lim = queue_limits_start_update(q); 257 + if (val) 258 + lim.integrity.flags &= ~flag; 259 + else 260 + lim.integrity.flags |= flag; 261 + 262 + blk_mq_freeze_queue(q); 263 + err = queue_limits_commit_update(q, &lim); 264 + blk_mq_unfreeze_queue(q); 265 + if (err) 266 + return err; 267 + return count; 268 + } 269 + 270 + static ssize_t flag_show(struct device *dev, char *page, unsigned char flag) 271 + { 272 + struct blk_integrity *bi = dev_to_bi(dev); 273 + 274 + return sysfs_emit(page, "%d\n", !(bi->flags & flag)); 164 275 } 165 276 166 277 static ssize_t format_show(struct device *dev, struct device_attribute *attr, ··· 225 222 { 226 223 struct blk_integrity *bi = dev_to_bi(dev); 227 224 228 - if (bi->profile && bi->profile->name) 229 - return sysfs_emit(page, "%s\n", bi->profile->name); 230 - return sysfs_emit(page, "none\n"); 225 + if (!bi->tuple_size) 226 + return sysfs_emit(page, "none\n"); 227 + return sysfs_emit(page, "%s\n", blk_integrity_profile_name(bi)); 231 228 } 232 229 233 230 static ssize_t tag_size_show(struct device *dev, struct device_attribute *attr, ··· 252 249 struct device_attribute *attr, 253 250 const char *page, size_t count) 254 251 { 255 - struct blk_integrity *bi = dev_to_bi(dev); 256 - char *p = (char *) page; 257 - unsigned long val = simple_strtoul(p, &p, 10); 258 - 259 - if (val) 260 - bi->flags |= BLK_INTEGRITY_VERIFY; 261 - else 262 - bi->flags &= ~BLK_INTEGRITY_VERIFY; 263 - 264 - return count; 252 + return flag_store(dev, page, count, BLK_INTEGRITY_NOVERIFY); 265 253 } 266 254 267 255 static ssize_t read_verify_show(struct device *dev, 268 256 struct device_attribute *attr, char *page) 269 257 { 270 - struct blk_integrity *bi = dev_to_bi(dev); 271 - 272 - return sysfs_emit(page, "%d\n", !!(bi->flags & BLK_INTEGRITY_VERIFY)); 258 + return flag_show(dev, page, BLK_INTEGRITY_NOVERIFY); 273 259 } 274 260 275 261 static ssize_t write_generate_store(struct device *dev, 276 262 struct device_attribute *attr, 277 263 const char *page, size_t count) 278 264 { 279 - struct blk_integrity *bi = dev_to_bi(dev); 280 - 281 - char *p = (char *) page; 282 - unsigned long val = simple_strtoul(p, &p, 10); 283 - 284 - if (val) 285 - bi->flags |= BLK_INTEGRITY_GENERATE; 286 - else 287 - bi->flags &= ~BLK_INTEGRITY_GENERATE; 288 - 289 - return count; 265 + return flag_store(dev, page, count, BLK_INTEGRITY_NOGENERATE); 290 266 } 291 267 292 268 static ssize_t write_generate_show(struct device *dev, 293 269 struct device_attribute *attr, char *page) 294 270 { 295 - struct blk_integrity *bi = dev_to_bi(dev); 296 - 297 - return sysfs_emit(page, "%d\n", !!(bi->flags & BLK_INTEGRITY_GENERATE)); 271 + return flag_show(dev, page, BLK_INTEGRITY_NOGENERATE); 298 272 } 299 273 300 274 static ssize_t device_is_integrity_capable_show(struct device *dev, ··· 305 325 .name = "integrity", 306 326 .attrs = integrity_attrs, 307 327 }; 308 - 309 - static blk_status_t blk_integrity_nop_fn(struct blk_integrity_iter *iter) 310 - { 311 - return BLK_STS_OK; 312 - } 313 - 314 - static void blk_integrity_nop_prepare(struct request *rq) 315 - { 316 - } 317 - 318 - static void blk_integrity_nop_complete(struct request *rq, 319 - unsigned int nr_bytes) 320 - { 321 - } 322 - 323 - static const struct blk_integrity_profile nop_profile = { 324 - .name = "nop", 325 - .generate_fn = blk_integrity_nop_fn, 326 - .verify_fn = blk_integrity_nop_fn, 327 - .prepare_fn = blk_integrity_nop_prepare, 328 - .complete_fn = blk_integrity_nop_complete, 329 - }; 330 - 331 - /** 332 - * blk_integrity_register - Register a gendisk as being integrity-capable 333 - * @disk: struct gendisk pointer to make integrity-aware 334 - * @template: block integrity profile to register 335 - * 336 - * Description: When a device needs to advertise itself as being able to 337 - * send/receive integrity metadata it must use this function to register 338 - * the capability with the block layer. The template is a blk_integrity 339 - * struct with values appropriate for the underlying hardware. See 340 - * Documentation/block/data-integrity.rst. 341 - */ 342 - void blk_integrity_register(struct gendisk *disk, struct blk_integrity *template) 343 - { 344 - struct blk_integrity *bi = &disk->queue->integrity; 345 - 346 - bi->flags = BLK_INTEGRITY_VERIFY | BLK_INTEGRITY_GENERATE | 347 - template->flags; 348 - bi->interval_exp = template->interval_exp ? : 349 - ilog2(queue_logical_block_size(disk->queue)); 350 - bi->profile = template->profile ? template->profile : &nop_profile; 351 - bi->tuple_size = template->tuple_size; 352 - bi->tag_size = template->tag_size; 353 - bi->pi_offset = template->pi_offset; 354 - 355 - blk_queue_flag_set(QUEUE_FLAG_STABLE_WRITES, disk->queue); 356 - 357 - #ifdef CONFIG_BLK_INLINE_ENCRYPTION 358 - if (disk->queue->crypto_profile) { 359 - pr_warn("blk-integrity: Integrity and hardware inline encryption are not supported together. Disabling hardware inline encryption.\n"); 360 - disk->queue->crypto_profile = NULL; 361 - } 362 - #endif 363 - } 364 - EXPORT_SYMBOL(blk_integrity_register); 365 - 366 - /** 367 - * blk_integrity_unregister - Unregister block integrity profile 368 - * @disk: disk whose integrity profile to unregister 369 - * 370 - * Description: This function unregisters the integrity capability from 371 - * a block device. 372 - */ 373 - void blk_integrity_unregister(struct gendisk *disk) 374 - { 375 - struct blk_integrity *bi = &disk->queue->integrity; 376 - 377 - if (!bi->profile) 378 - return; 379 - 380 - /* ensure all bios are off the integrity workqueue */ 381 - blk_flush_integrity(); 382 - blk_queue_flag_clear(QUEUE_FLAG_STABLE_WRITES, disk->queue); 383 - memset(bi, 0, sizeof(*bi)); 384 - } 385 - EXPORT_SYMBOL(blk_integrity_unregister);

+120 -90

block/blk-lib.c

··· 103 103 } 104 104 EXPORT_SYMBOL(blkdev_issue_discard); 105 105 106 - static int __blkdev_issue_write_zeroes(struct block_device *bdev, 106 + static sector_t bio_write_zeroes_limit(struct block_device *bdev) 107 + { 108 + sector_t bs_mask = (bdev_logical_block_size(bdev) >> 9) - 1; 109 + 110 + return min(bdev_write_zeroes_sectors(bdev), 111 + (UINT_MAX >> SECTOR_SHIFT) & ~bs_mask); 112 + } 113 + 114 + static void __blkdev_issue_write_zeroes(struct block_device *bdev, 107 115 sector_t sector, sector_t nr_sects, gfp_t gfp_mask, 108 116 struct bio **biop, unsigned flags) 109 117 { 110 - struct bio *bio = *biop; 111 - unsigned int max_sectors; 112 - 113 - if (bdev_read_only(bdev)) 114 - return -EPERM; 115 - 116 - /* Ensure that max_sectors doesn't overflow bi_size */ 117 - max_sectors = bdev_write_zeroes_sectors(bdev); 118 - 119 - if (max_sectors == 0) 120 - return -EOPNOTSUPP; 121 - 122 118 while (nr_sects) { 123 - unsigned int len = min_t(sector_t, nr_sects, max_sectors); 119 + unsigned int len = min_t(sector_t, nr_sects, 120 + bio_write_zeroes_limit(bdev)); 121 + struct bio *bio; 124 122 125 - bio = blk_next_bio(bio, bdev, 0, REQ_OP_WRITE_ZEROES, gfp_mask); 123 + if ((flags & BLKDEV_ZERO_KILLABLE) && 124 + fatal_signal_pending(current)) 125 + break; 126 + 127 + bio = bio_alloc(bdev, 0, REQ_OP_WRITE_ZEROES, gfp_mask); 126 128 bio->bi_iter.bi_sector = sector; 127 129 if (flags & BLKDEV_ZERO_NOUNMAP) 128 130 bio->bi_opf |= REQ_NOUNMAP; 129 131 130 132 bio->bi_iter.bi_size = len << SECTOR_SHIFT; 133 + *biop = bio_chain_and_submit(*biop, bio); 134 + 131 135 nr_sects -= len; 132 136 sector += len; 133 137 cond_resched(); 134 138 } 139 + } 135 140 136 - *biop = bio; 137 - return 0; 141 + static int blkdev_issue_write_zeroes(struct block_device *bdev, sector_t sector, 142 + sector_t nr_sects, gfp_t gfp, unsigned flags) 143 + { 144 + struct bio *bio = NULL; 145 + struct blk_plug plug; 146 + int ret = 0; 147 + 148 + blk_start_plug(&plug); 149 + __blkdev_issue_write_zeroes(bdev, sector, nr_sects, gfp, &bio, flags); 150 + if (bio) { 151 + if ((flags & BLKDEV_ZERO_KILLABLE) && 152 + fatal_signal_pending(current)) { 153 + bio_await_chain(bio); 154 + blk_finish_plug(&plug); 155 + return -EINTR; 156 + } 157 + ret = submit_bio_wait(bio); 158 + bio_put(bio); 159 + } 160 + blk_finish_plug(&plug); 161 + 162 + /* 163 + * For some devices there is no non-destructive way to verify whether 164 + * WRITE ZEROES is actually supported. These will clear the capability 165 + * on an I/O error, in which case we'll turn any error into 166 + * "not supported" here. 167 + */ 168 + if (ret && !bdev_write_zeroes_sectors(bdev)) 169 + return -EOPNOTSUPP; 170 + return ret; 138 171 } 139 172 140 173 /* ··· 183 150 return min(pages, (sector_t)BIO_MAX_VECS); 184 151 } 185 152 186 - static int __blkdev_issue_zero_pages(struct block_device *bdev, 153 + static void __blkdev_issue_zero_pages(struct block_device *bdev, 187 154 sector_t sector, sector_t nr_sects, gfp_t gfp_mask, 188 - struct bio **biop) 155 + struct bio **biop, unsigned int flags) 189 156 { 190 - struct bio *bio = *biop; 191 - int bi_size = 0; 192 - unsigned int sz; 157 + while (nr_sects) { 158 + unsigned int nr_vecs = __blkdev_sectors_to_bio_pages(nr_sects); 159 + struct bio *bio; 193 160 194 - if (bdev_read_only(bdev)) 195 - return -EPERM; 196 - 197 - while (nr_sects != 0) { 198 - bio = blk_next_bio(bio, bdev, __blkdev_sectors_to_bio_pages(nr_sects), 199 - REQ_OP_WRITE, gfp_mask); 161 + bio = bio_alloc(bdev, nr_vecs, REQ_OP_WRITE, gfp_mask); 200 162 bio->bi_iter.bi_sector = sector; 201 163 202 - while (nr_sects != 0) { 203 - sz = min((sector_t) PAGE_SIZE, nr_sects << 9); 204 - bi_size = bio_add_page(bio, ZERO_PAGE(0), sz, 0); 205 - nr_sects -= bi_size >> 9; 206 - sector += bi_size >> 9; 207 - if (bi_size < sz) 164 + if ((flags & BLKDEV_ZERO_KILLABLE) && 165 + fatal_signal_pending(current)) 166 + break; 167 + 168 + do { 169 + unsigned int len, added; 170 + 171 + len = min_t(sector_t, 172 + PAGE_SIZE, nr_sects << SECTOR_SHIFT); 173 + added = bio_add_page(bio, ZERO_PAGE(0), len, 0); 174 + if (added < len) 208 175 break; 209 - } 176 + nr_sects -= added >> SECTOR_SHIFT; 177 + sector += added >> SECTOR_SHIFT; 178 + } while (nr_sects); 179 + 180 + *biop = bio_chain_and_submit(*biop, bio); 210 181 cond_resched(); 211 182 } 183 + } 212 184 213 - *biop = bio; 214 - return 0; 185 + static int blkdev_issue_zero_pages(struct block_device *bdev, sector_t sector, 186 + sector_t nr_sects, gfp_t gfp, unsigned flags) 187 + { 188 + struct bio *bio = NULL; 189 + struct blk_plug plug; 190 + int ret = 0; 191 + 192 + if (flags & BLKDEV_ZERO_NOFALLBACK) 193 + return -EOPNOTSUPP; 194 + 195 + blk_start_plug(&plug); 196 + __blkdev_issue_zero_pages(bdev, sector, nr_sects, gfp, &bio, flags); 197 + if (bio) { 198 + if ((flags & BLKDEV_ZERO_KILLABLE) && 199 + fatal_signal_pending(current)) { 200 + bio_await_chain(bio); 201 + blk_finish_plug(&plug); 202 + return -EINTR; 203 + } 204 + ret = submit_bio_wait(bio); 205 + bio_put(bio); 206 + } 207 + blk_finish_plug(&plug); 208 + 209 + return ret; 215 210 } 216 211 217 212 /** ··· 265 204 sector_t nr_sects, gfp_t gfp_mask, struct bio **biop, 266 205 unsigned flags) 267 206 { 268 - int ret; 269 - sector_t bs_mask; 207 + if (bdev_read_only(bdev)) 208 + return -EPERM; 270 209 271 - bs_mask = (bdev_logical_block_size(bdev) >> 9) - 1; 272 - if ((sector | nr_sects) & bs_mask) 273 - return -EINVAL; 274 - 275 - ret = __blkdev_issue_write_zeroes(bdev, sector, nr_sects, gfp_mask, 276 - biop, flags); 277 - if (ret != -EOPNOTSUPP || (flags & BLKDEV_ZERO_NOFALLBACK)) 278 - return ret; 279 - 280 - return __blkdev_issue_zero_pages(bdev, sector, nr_sects, gfp_mask, 281 - biop); 210 + if (bdev_write_zeroes_sectors(bdev)) { 211 + __blkdev_issue_write_zeroes(bdev, sector, nr_sects, 212 + gfp_mask, biop, flags); 213 + } else { 214 + if (flags & BLKDEV_ZERO_NOFALLBACK) 215 + return -EOPNOTSUPP; 216 + __blkdev_issue_zero_pages(bdev, sector, nr_sects, gfp_mask, 217 + biop, flags); 218 + } 219 + return 0; 282 220 } 283 221 EXPORT_SYMBOL(__blkdev_issue_zeroout); 284 222 ··· 297 237 int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector, 298 238 sector_t nr_sects, gfp_t gfp_mask, unsigned flags) 299 239 { 300 - int ret = 0; 301 - sector_t bs_mask; 302 - struct bio *bio; 303 - struct blk_plug plug; 304 - bool try_write_zeroes = !!bdev_write_zeroes_sectors(bdev); 240 + int ret; 305 241 306 - bs_mask = (bdev_logical_block_size(bdev) >> 9) - 1; 307 - if ((sector | nr_sects) & bs_mask) 242 + if ((sector | nr_sects) & ((bdev_logical_block_size(bdev) >> 9) - 1)) 308 243 return -EINVAL; 244 + if (bdev_read_only(bdev)) 245 + return -EPERM; 309 246 310 - retry: 311 - bio = NULL; 312 - blk_start_plug(&plug); 313 - if (try_write_zeroes) { 314 - ret = __blkdev_issue_write_zeroes(bdev, sector, nr_sects, 315 - gfp_mask, &bio, flags); 316 - } else if (!(flags & BLKDEV_ZERO_NOFALLBACK)) { 317 - ret = __blkdev_issue_zero_pages(bdev, sector, nr_sects, 318 - gfp_mask, &bio); 319 - } else { 320 - /* No zeroing offload support */ 321 - ret = -EOPNOTSUPP; 322 - } 323 - if (ret == 0 && bio) { 324 - ret = submit_bio_wait(bio); 325 - bio_put(bio); 326 - } 327 - blk_finish_plug(&plug); 328 - if (ret && try_write_zeroes) { 329 - if (!(flags & BLKDEV_ZERO_NOFALLBACK)) { 330 - try_write_zeroes = false; 331 - goto retry; 332 - } 333 - if (!bdev_write_zeroes_sectors(bdev)) { 334 - /* 335 - * Zeroing offload support was indicated, but the 336 - * device reported ILLEGAL REQUEST (for some devices 337 - * there is no non-destructive way to verify whether 338 - * WRITE ZEROES is actually supported). 339 - */ 340 - ret = -EOPNOTSUPP; 341 - } 247 + if (bdev_write_zeroes_sectors(bdev)) { 248 + ret = blkdev_issue_write_zeroes(bdev, sector, nr_sects, 249 + gfp_mask, flags); 250 + if (ret != -EOPNOTSUPP) 251 + return ret; 342 252 } 343 253 344 - return ret; 254 + return blkdev_issue_zero_pages(bdev, sector, nr_sects, gfp_mask, flags); 345 255 } 346 256 EXPORT_SYMBOL(blkdev_issue_zeroout); 347 257

+1 -1

block/blk-map.c

··· 634 634 const struct iov_iter *iter, gfp_t gfp_mask) 635 635 { 636 636 bool copy = false, map_bvec = false; 637 - unsigned long align = q->dma_pad_mask | queue_dma_alignment(q); 637 + unsigned long align = blk_lim_dma_alignment_and_pad(&q->limits); 638 638 struct bio *bio = NULL; 639 639 struct iov_iter i; 640 640 int ret = -EINVAL;

+70 -22

block/blk-merge.c

··· 154 154 return bio_split(bio, lim->max_write_zeroes_sectors, GFP_NOIO, bs); 155 155 } 156 156 157 + static inline unsigned int blk_boundary_sectors(const struct queue_limits *lim, 158 + bool is_atomic) 159 + { 160 + /* 161 + * chunk_sectors must be a multiple of atomic_write_boundary_sectors if 162 + * both non-zero. 163 + */ 164 + if (is_atomic && lim->atomic_write_boundary_sectors) 165 + return lim->atomic_write_boundary_sectors; 166 + 167 + return lim->chunk_sectors; 168 + } 169 + 157 170 /* 158 171 * Return the maximum number of sectors from the start of a bio that may be 159 172 * submitted as a single request to a block device. If enough sectors remain, ··· 180 167 { 181 168 unsigned pbs = lim->physical_block_size >> SECTOR_SHIFT; 182 169 unsigned lbs = lim->logical_block_size >> SECTOR_SHIFT; 183 - unsigned max_sectors = lim->max_sectors, start, end; 170 + bool is_atomic = bio->bi_opf & REQ_ATOMIC; 171 + unsigned boundary_sectors = blk_boundary_sectors(lim, is_atomic); 172 + unsigned max_sectors, start, end; 184 173 185 - if (lim->chunk_sectors) { 174 + /* 175 + * We ignore lim->max_sectors for atomic writes because it may less 176 + * than the actual bio size, which we cannot tolerate. 177 + */ 178 + if (is_atomic) 179 + max_sectors = lim->atomic_write_max_sectors; 180 + else 181 + max_sectors = lim->max_sectors; 182 + 183 + if (boundary_sectors) { 186 184 max_sectors = min(max_sectors, 187 - blk_chunk_sectors_left(bio->bi_iter.bi_sector, 188 - lim->chunk_sectors)); 185 + blk_boundary_sectors_left(bio->bi_iter.bi_sector, 186 + boundary_sectors)); 189 187 } 190 188 191 189 start = bio->bi_iter.bi_sector & (pbs - 1); ··· 209 185 /** 210 186 * get_max_segment_size() - maximum number of bytes to add as a single segment 211 187 * @lim: Request queue limits. 212 - * @start_page: See below. 213 - * @offset: Offset from @start_page where to add a segment. 188 + * @paddr: address of the range to add 189 + * @len: maximum length available to add at @paddr 214 190 * 215 - * Returns the maximum number of bytes that can be added as a single segment. 191 + * Returns the maximum number of bytes of the range starting at @paddr that can 192 + * be added to a single segment. 216 193 */ 217 194 static inline unsigned get_max_segment_size(const struct queue_limits *lim, 218 - struct page *start_page, unsigned long offset) 195 + phys_addr_t paddr, unsigned int len) 219 196 { 220 - unsigned long mask = lim->seg_boundary_mask; 221 - 222 - offset = mask & (page_to_phys(start_page) + offset); 223 - 224 197 /* 225 198 * Prevent an overflow if mask = ULONG_MAX and offset = 0 by adding 1 226 199 * after having calculated the minimum. 227 200 */ 228 - return min(mask - offset, (unsigned long)lim->max_segment_size - 1) + 1; 201 + return min_t(unsigned long, len, 202 + min(lim->seg_boundary_mask - (lim->seg_boundary_mask & paddr), 203 + (unsigned long)lim->max_segment_size - 1) + 1); 229 204 } 230 205 231 206 /** ··· 257 234 unsigned seg_size = 0; 258 235 259 236 while (len && *nsegs < max_segs) { 260 - seg_size = get_max_segment_size(lim, bv->bv_page, 261 - bv->bv_offset + total_len); 262 - seg_size = min(seg_size, len); 237 + seg_size = get_max_segment_size(lim, bvec_phys(bv) + total_len, len); 263 238 264 239 (*nsegs)++; 265 240 total_len += seg_size; ··· 326 305 *segs = nsegs; 327 306 return NULL; 328 307 split: 308 + if (bio->bi_opf & REQ_ATOMIC) { 309 + bio->bi_status = BLK_STS_INVAL; 310 + bio_endio(bio); 311 + return ERR_PTR(-EINVAL); 312 + } 329 313 /* 330 314 * We can't sanely support splitting for a REQ_NOWAIT bio. End it 331 315 * with EAGAIN if splitting is required and return an error pointer. ··· 491 465 492 466 while (nbytes > 0) { 493 467 unsigned offset = bvec->bv_offset + total; 494 - unsigned len = min(get_max_segment_size(&q->limits, 495 - bvec->bv_page, offset), nbytes); 468 + unsigned len = get_max_segment_size(&q->limits, 469 + bvec_phys(bvec) + total, nbytes); 496 470 struct page *page = bvec->bv_page; 497 471 498 472 /* ··· 614 588 sector_t offset) 615 589 { 616 590 struct request_queue *q = rq->q; 617 - unsigned int max_sectors; 591 + struct queue_limits *lim = &q->limits; 592 + unsigned int max_sectors, boundary_sectors; 593 + bool is_atomic = rq->cmd_flags & REQ_ATOMIC; 618 594 619 595 if (blk_rq_is_passthrough(rq)) 620 596 return q->limits.max_hw_sectors; 621 597 622 - max_sectors = blk_queue_get_max_sectors(q, req_op(rq)); 623 - if (!q->limits.chunk_sectors || 598 + boundary_sectors = blk_boundary_sectors(lim, is_atomic); 599 + max_sectors = blk_queue_get_max_sectors(rq); 600 + 601 + if (!boundary_sectors || 624 602 req_op(rq) == REQ_OP_DISCARD || 625 603 req_op(rq) == REQ_OP_SECURE_ERASE) 626 604 return max_sectors; 627 605 return min(max_sectors, 628 - blk_chunk_sectors_left(offset, q->limits.chunk_sectors)); 606 + blk_boundary_sectors_left(offset, boundary_sectors)); 629 607 } 630 608 631 609 static inline int ll_new_hw_segment(struct request *req, struct bio *bio, ··· 827 797 return ELEVATOR_NO_MERGE; 828 798 } 829 799 800 + static bool blk_atomic_write_mergeable_rq_bio(struct request *rq, 801 + struct bio *bio) 802 + { 803 + return (rq->cmd_flags & REQ_ATOMIC) == (bio->bi_opf & REQ_ATOMIC); 804 + } 805 + 806 + static bool blk_atomic_write_mergeable_rqs(struct request *rq, 807 + struct request *next) 808 + { 809 + return (rq->cmd_flags & REQ_ATOMIC) == (next->cmd_flags & REQ_ATOMIC); 810 + } 811 + 830 812 /* 831 813 * For non-mq, this has to be called with the request spinlock acquired. 832 814 * For mq with scheduling, the appropriate queue wide lock should be held. ··· 860 818 return NULL; 861 819 862 820 if (req->ioprio != next->ioprio) 821 + return NULL; 822 + 823 + if (!blk_atomic_write_mergeable_rqs(req, next)) 863 824 return NULL; 864 825 865 826 /* ··· 994 949 return false; 995 950 996 951 if (rq->ioprio != bio_prio(bio)) 952 + return false; 953 + 954 + if (blk_atomic_write_mergeable_rq_bio(rq, bio) == false) 997 955 return false; 998 956 999 957 return true;

-13

block/blk-mq-debugfs.c

··· 84 84 QUEUE_FLAG_NAME(NOMERGES), 85 85 QUEUE_FLAG_NAME(SAME_COMP), 86 86 QUEUE_FLAG_NAME(FAIL_IO), 87 - QUEUE_FLAG_NAME(NONROT), 88 - QUEUE_FLAG_NAME(IO_STAT), 89 87 QUEUE_FLAG_NAME(NOXMERGES), 90 - QUEUE_FLAG_NAME(ADD_RANDOM), 91 - QUEUE_FLAG_NAME(SYNCHRONOUS), 92 88 QUEUE_FLAG_NAME(SAME_FORCE), 93 89 QUEUE_FLAG_NAME(INIT_DONE), 94 - QUEUE_FLAG_NAME(STABLE_WRITES), 95 - QUEUE_FLAG_NAME(POLL), 96 - QUEUE_FLAG_NAME(WC), 97 - QUEUE_FLAG_NAME(FUA), 98 - QUEUE_FLAG_NAME(DAX), 99 90 QUEUE_FLAG_NAME(STATS), 100 91 QUEUE_FLAG_NAME(REGISTERED), 101 92 QUEUE_FLAG_NAME(QUIESCED), 102 - QUEUE_FLAG_NAME(PCI_P2PDMA), 103 - QUEUE_FLAG_NAME(ZONE_RESETALL), 104 93 QUEUE_FLAG_NAME(RQ_ALLOC_TIME), 105 94 QUEUE_FLAG_NAME(HCTX_ACTIVE), 106 - QUEUE_FLAG_NAME(NOWAIT), 107 95 QUEUE_FLAG_NAME(SQ_SCHED), 108 - QUEUE_FLAG_NAME(SKIP_TAGSET_QUIESCE), 109 96 }; 110 97 #undef QUEUE_FLAG_NAME 111 98

+57 -32

block/blk-mq.c

··· 448 448 if (data->cmd_flags & REQ_NOWAIT) 449 449 data->flags |= BLK_MQ_REQ_NOWAIT; 450 450 451 + retry: 452 + data->ctx = blk_mq_get_ctx(q); 453 + data->hctx = blk_mq_map_queue(q, data->cmd_flags, data->ctx); 454 + 451 455 if (q->elevator) { 452 456 /* 453 457 * All requests use scheduler tags when an I/O scheduler is ··· 473 469 if (ops->limit_depth) 474 470 ops->limit_depth(data->cmd_flags, data); 475 471 } 476 - } 477 - 478 - retry: 479 - data->ctx = blk_mq_get_ctx(q); 480 - data->hctx = blk_mq_map_queue(q, data->cmd_flags, data->ctx); 481 - if (!(data->rq_flags & RQF_SCHED_TAGS)) 472 + } else { 482 473 blk_mq_tag_busy(data->hctx); 474 + } 483 475 484 476 if (data->flags & BLK_MQ_REQ_RESERVED) 485 477 data->rq_flags |= RQF_RESV; ··· 804 804 if (!bio) 805 805 return; 806 806 807 - #ifdef CONFIG_BLK_DEV_INTEGRITY 808 807 if (blk_integrity_rq(req) && req_op(req) == REQ_OP_READ) 809 - req->q->integrity.profile->complete_fn(req, total_bytes); 810 - #endif 808 + blk_integrity_complete(req, total_bytes); 811 809 812 810 /* 813 811 * Upper layers may call blk_crypto_evict_key() anytime after the last ··· 873 875 if (!req->bio) 874 876 return false; 875 877 876 - #ifdef CONFIG_BLK_DEV_INTEGRITY 877 878 if (blk_integrity_rq(req) && req_op(req) == REQ_OP_READ && 878 879 error == BLK_STS_OK) 879 - req->q->integrity.profile->complete_fn(req, nr_bytes); 880 - #endif 880 + blk_integrity_complete(req, nr_bytes); 881 881 882 882 /* 883 883 * Upper layers may call blk_crypto_evict_key() anytime after the last ··· 1260 1264 WRITE_ONCE(rq->state, MQ_RQ_IN_FLIGHT); 1261 1265 rq->mq_hctx->tags->rqs[rq->tag] = rq; 1262 1266 1263 - #ifdef CONFIG_BLK_DEV_INTEGRITY 1264 1267 if (blk_integrity_rq(rq) && req_op(rq) == REQ_OP_WRITE) 1265 - q->integrity.profile->prepare_fn(rq); 1266 - #endif 1268 + blk_integrity_prepare(rq); 1269 + 1267 1270 if (rq->bio && rq->bio->bi_opf & REQ_POLLED) 1268 1271 WRITE_ONCE(rq->bio->bi_cookie, rq->mq_hctx->queue_num); 1269 1272 } ··· 2909 2914 INIT_LIST_HEAD(&rq->queuelist); 2910 2915 } 2911 2916 2917 + static bool bio_unaligned(const struct bio *bio, struct request_queue *q) 2918 + { 2919 + unsigned int bs_mask = queue_logical_block_size(q) - 1; 2920 + 2921 + /* .bi_sector of any zero sized bio need to be initialized */ 2922 + if ((bio->bi_iter.bi_size & bs_mask) || 2923 + ((bio->bi_iter.bi_sector << SECTOR_SHIFT) & bs_mask)) 2924 + return true; 2925 + return false; 2926 + } 2927 + 2912 2928 /** 2913 2929 * blk_mq_submit_bio - Create and send a request to block device. 2914 2930 * @bio: Bio pointer. ··· 2970 2964 if (!rq) { 2971 2965 if (unlikely(bio_queue_enter(bio))) 2972 2966 return; 2967 + } 2968 + 2969 + /* 2970 + * Device reconfiguration may change logical block size, so alignment 2971 + * check has to be done with queue usage counter held 2972 + */ 2973 + if (unlikely(bio_unaligned(bio, q))) { 2974 + bio_io_error(bio); 2975 + goto queue_exit; 2973 2976 } 2974 2977 2975 2978 if (unlikely(bio_may_exceed_limits(bio, &q->limits))) { ··· 3056 3041 blk_status_t blk_insert_cloned_request(struct request *rq) 3057 3042 { 3058 3043 struct request_queue *q = rq->q; 3059 - unsigned int max_sectors = blk_queue_get_max_sectors(q, req_op(rq)); 3044 + unsigned int max_sectors = blk_queue_get_max_sectors(rq); 3060 3045 unsigned int max_segments = blk_rq_get_max_segments(rq); 3061 3046 blk_status_t ret; 3062 3047 ··· 4129 4114 blk_mq_sysfs_deinit(q); 4130 4115 } 4131 4116 4117 + static bool blk_mq_can_poll(struct blk_mq_tag_set *set) 4118 + { 4119 + return set->nr_maps > HCTX_TYPE_POLL && 4120 + set->map[HCTX_TYPE_POLL].nr_queues; 4121 + } 4122 + 4132 4123 struct request_queue *blk_mq_alloc_queue(struct blk_mq_tag_set *set, 4133 4124 struct queue_limits *lim, void *queuedata) 4134 4125 { ··· 4142 4121 struct request_queue *q; 4143 4122 int ret; 4144 4123 4145 - q = blk_alloc_queue(lim ? lim : &default_lim, set->numa_node); 4124 + if (!lim) 4125 + lim = &default_lim; 4126 + lim->features |= BLK_FEAT_IO_STAT | BLK_FEAT_NOWAIT; 4127 + if (blk_mq_can_poll(set)) 4128 + lim->features |= BLK_FEAT_POLL; 4129 + 4130 + q = blk_alloc_queue(lim, set->numa_node); 4146 4131 if (IS_ERR(q)) 4147 4132 return q; 4148 4133 q->queuedata = queuedata; ··· 4301 4274 mutex_unlock(&q->sysfs_lock); 4302 4275 } 4303 4276 4304 - static void blk_mq_update_poll_flag(struct request_queue *q) 4305 - { 4306 - struct blk_mq_tag_set *set = q->tag_set; 4307 - 4308 - if (set->nr_maps > HCTX_TYPE_POLL && 4309 - set->map[HCTX_TYPE_POLL].nr_queues) 4310 - blk_queue_flag_set(QUEUE_FLAG_POLL, q); 4311 - else 4312 - blk_queue_flag_clear(QUEUE_FLAG_POLL, q); 4313 - } 4314 - 4315 4277 int blk_mq_init_allocated_queue(struct blk_mq_tag_set *set, 4316 4278 struct request_queue *q) 4317 4279 { ··· 4328 4312 q->tag_set = set; 4329 4313 4330 4314 q->queue_flags |= QUEUE_FLAG_MQ_DEFAULT; 4331 - blk_mq_update_poll_flag(q); 4332 4315 4333 4316 INIT_DELAYED_WORK(&q->requeue_work, blk_mq_requeue_work); 4334 4317 INIT_LIST_HEAD(&q->flush_list); ··· 4651 4636 int ret; 4652 4637 unsigned long i; 4653 4638 4639 + if (WARN_ON_ONCE(!q->mq_freeze_depth)) 4640 + return -EINVAL; 4641 + 4654 4642 if (!set) 4655 4643 return -EINVAL; 4656 4644 4657 4645 if (q->nr_requests == nr) 4658 4646 return 0; 4659 4647 4660 - blk_mq_freeze_queue(q); 4661 4648 blk_mq_quiesce_queue(q); 4662 4649 4663 4650 ret = 0; ··· 4693 4676 } 4694 4677 4695 4678 blk_mq_unquiesce_queue(q); 4696 - blk_mq_unfreeze_queue(q); 4697 4679 4698 4680 return ret; 4699 4681 } ··· 4814 4798 fallback: 4815 4799 blk_mq_update_queue_map(set); 4816 4800 list_for_each_entry(q, &set->tag_list, tag_set_list) { 4801 + struct queue_limits lim; 4802 + 4817 4803 blk_mq_realloc_hw_ctxs(set, q); 4818 - blk_mq_update_poll_flag(q); 4804 + 4819 4805 if (q->nr_hw_queues != set->nr_hw_queues) { 4820 4806 int i = prev_nr_hw_queues; 4821 4807 ··· 4829 4811 set->nr_hw_queues = prev_nr_hw_queues; 4830 4812 goto fallback; 4831 4813 } 4814 + lim = queue_limits_start_update(q); 4815 + if (blk_mq_can_poll(set)) 4816 + lim.features |= BLK_FEAT_POLL; 4817 + else 4818 + lim.features &= ~BLK_FEAT_POLL; 4819 + if (queue_limits_commit_update(q, &lim) < 0) 4820 + pr_warn("updating the poll flag failed\n"); 4832 4821 blk_mq_map_swqueue(q); 4833 4822 } 4834 4823

+239 -308

block/blk-settings.c

··· 6 6 #include <linux/module.h> 7 7 #include <linux/init.h> 8 8 #include <linux/bio.h> 9 - #include <linux/blkdev.h> 9 + #include <linux/blk-integrity.h> 10 10 #include <linux/pagemap.h> 11 11 #include <linux/backing-dev-defs.h> 12 12 #include <linux/gcd.h> ··· 55 55 } 56 56 EXPORT_SYMBOL(blk_set_stacking_limits); 57 57 58 - static void blk_apply_bdi_limits(struct backing_dev_info *bdi, 58 + void blk_apply_bdi_limits(struct backing_dev_info *bdi, 59 59 struct queue_limits *lim) 60 60 { 61 61 /* ··· 68 68 69 69 static int blk_validate_zoned_limits(struct queue_limits *lim) 70 70 { 71 - if (!lim->zoned) { 71 + if (!(lim->features & BLK_FEAT_ZONED)) { 72 72 if (WARN_ON_ONCE(lim->max_open_zones) || 73 73 WARN_ON_ONCE(lim->max_active_zones) || 74 74 WARN_ON_ONCE(lim->zone_write_granularity) || ··· 78 78 } 79 79 80 80 if (WARN_ON_ONCE(!IS_ENABLED(CONFIG_BLK_DEV_ZONED))) 81 + return -EINVAL; 82 + 83 + /* 84 + * Given that active zones include open zones, the maximum number of 85 + * open zones cannot be larger than the maximum number of active zones. 86 + */ 87 + if (lim->max_active_zones && 88 + lim->max_open_zones > lim->max_active_zones) 81 89 return -EINVAL; 82 90 83 91 if (lim->zone_write_granularity < lim->logical_block_size) ··· 105 97 return 0; 106 98 } 107 99 100 + static int blk_validate_integrity_limits(struct queue_limits *lim) 101 + { 102 + struct blk_integrity *bi = &lim->integrity; 103 + 104 + if (!bi->tuple_size) { 105 + if (bi->csum_type != BLK_INTEGRITY_CSUM_NONE || 106 + bi->tag_size || ((bi->flags & BLK_INTEGRITY_REF_TAG))) { 107 + pr_warn("invalid PI settings.\n"); 108 + return -EINVAL; 109 + } 110 + return 0; 111 + } 112 + 113 + if (!IS_ENABLED(CONFIG_BLK_DEV_INTEGRITY)) { 114 + pr_warn("integrity support disabled.\n"); 115 + return -EINVAL; 116 + } 117 + 118 + if (bi->csum_type == BLK_INTEGRITY_CSUM_NONE && 119 + (bi->flags & BLK_INTEGRITY_REF_TAG)) { 120 + pr_warn("ref tag not support without checksum.\n"); 121 + return -EINVAL; 122 + } 123 + 124 + if (!bi->interval_exp) 125 + bi->interval_exp = ilog2(lim->logical_block_size); 126 + 127 + return 0; 128 + } 129 + 130 + /* 131 + * Returns max guaranteed bytes which we can fit in a bio. 132 + * 133 + * We request that an atomic_write is ITER_UBUF iov_iter (so a single vector), 134 + * so we assume that we can fit in at least PAGE_SIZE in a segment, apart from 135 + * the first and last segments. 136 + */ 137 + static unsigned int blk_queue_max_guaranteed_bio(struct queue_limits *lim) 138 + { 139 + unsigned int max_segments = min(BIO_MAX_VECS, lim->max_segments); 140 + unsigned int length; 141 + 142 + length = min(max_segments, 2) * lim->logical_block_size; 143 + if (max_segments > 2) 144 + length += (max_segments - 2) * PAGE_SIZE; 145 + 146 + return length; 147 + } 148 + 149 + static void blk_atomic_writes_update_limits(struct queue_limits *lim) 150 + { 151 + unsigned int unit_limit = min(lim->max_hw_sectors << SECTOR_SHIFT, 152 + blk_queue_max_guaranteed_bio(lim)); 153 + 154 + unit_limit = rounddown_pow_of_two(unit_limit); 155 + 156 + lim->atomic_write_max_sectors = 157 + min(lim->atomic_write_hw_max >> SECTOR_SHIFT, 158 + lim->max_hw_sectors); 159 + lim->atomic_write_unit_min = 160 + min(lim->atomic_write_hw_unit_min, unit_limit); 161 + lim->atomic_write_unit_max = 162 + min(lim->atomic_write_hw_unit_max, unit_limit); 163 + lim->atomic_write_boundary_sectors = 164 + lim->atomic_write_hw_boundary >> SECTOR_SHIFT; 165 + } 166 + 167 + static void blk_validate_atomic_write_limits(struct queue_limits *lim) 168 + { 169 + unsigned int boundary_sectors; 170 + 171 + if (!lim->atomic_write_hw_max) 172 + goto unsupported; 173 + 174 + boundary_sectors = lim->atomic_write_hw_boundary >> SECTOR_SHIFT; 175 + 176 + if (boundary_sectors) { 177 + /* 178 + * A feature of boundary support is that it disallows bios to 179 + * be merged which would result in a merged request which 180 + * crosses either a chunk sector or atomic write HW boundary, 181 + * even though chunk sectors may be just set for performance. 182 + * For simplicity, disallow atomic writes for a chunk sector 183 + * which is non-zero and smaller than atomic write HW boundary. 184 + * Furthermore, chunk sectors must be a multiple of atomic 185 + * write HW boundary. Otherwise boundary support becomes 186 + * complicated. 187 + * Devices which do not conform to these rules can be dealt 188 + * with if and when they show up. 189 + */ 190 + if (WARN_ON_ONCE(lim->chunk_sectors % boundary_sectors)) 191 + goto unsupported; 192 + 193 + /* 194 + * The boundary size just needs to be a multiple of unit_max 195 + * (and not necessarily a power-of-2), so this following check 196 + * could be relaxed in future. 197 + * Furthermore, if needed, unit_max could even be reduced so 198 + * that it is compliant with a !power-of-2 boundary. 199 + */ 200 + if (!is_power_of_2(boundary_sectors)) 201 + goto unsupported; 202 + } 203 + 204 + blk_atomic_writes_update_limits(lim); 205 + return; 206 + 207 + unsupported: 208 + lim->atomic_write_max_sectors = 0; 209 + lim->atomic_write_boundary_sectors = 0; 210 + lim->atomic_write_unit_min = 0; 211 + lim->atomic_write_unit_max = 0; 212 + } 213 + 108 214 /* 109 215 * Check that the limits in lim are valid, initialize defaults for unset 110 216 * values, and cap values based on others where needed. ··· 227 105 { 228 106 unsigned int max_hw_sectors; 229 107 unsigned int logical_block_sectors; 108 + int err; 230 109 231 110 /* 232 111 * Unless otherwise specified, default to 512 byte logical blocks and a ··· 235 112 */ 236 113 if (!lim->logical_block_size) 237 114 lim->logical_block_size = SECTOR_SIZE; 115 + else if (blk_validate_block_size(lim->logical_block_size)) { 116 + pr_warn("Invalid logical block size (%d)\n", lim->logical_block_size); 117 + return -EINVAL; 118 + } 238 119 if (lim->physical_block_size < lim->logical_block_size) 239 120 lim->physical_block_size = lim->logical_block_size; 240 121 ··· 280 153 if (lim->max_user_sectors < PAGE_SIZE / SECTOR_SIZE) 281 154 return -EINVAL; 282 155 lim->max_sectors = min(max_hw_sectors, lim->max_user_sectors); 156 + } else if (lim->io_opt > (BLK_DEF_MAX_SECTORS_CAP << SECTOR_SHIFT)) { 157 + lim->max_sectors = 158 + min(max_hw_sectors, lim->io_opt >> SECTOR_SHIFT); 159 + } else if (lim->io_min > (BLK_DEF_MAX_SECTORS_CAP << SECTOR_SHIFT)) { 160 + lim->max_sectors = 161 + min(max_hw_sectors, lim->io_min >> SECTOR_SHIFT); 283 162 } else { 284 163 lim->max_sectors = min(max_hw_sectors, BLK_DEF_MAX_SECTORS_CAP); 285 164 } ··· 353 220 354 221 if (lim->alignment_offset) { 355 222 lim->alignment_offset &= (lim->physical_block_size - 1); 356 - lim->misaligned = 0; 223 + lim->flags &= ~BLK_FLAG_MISALIGNED; 357 224 } 358 225 226 + if (!(lim->features & BLK_FEAT_WRITE_CACHE)) 227 + lim->features &= ~BLK_FEAT_FUA; 228 + 229 + blk_validate_atomic_write_limits(lim); 230 + 231 + err = blk_validate_integrity_limits(lim); 232 + if (err) 233 + return err; 359 234 return blk_validate_zoned_limits(lim); 360 235 } 361 236 ··· 395 254 */ 396 255 int queue_limits_commit_update(struct request_queue *q, 397 256 struct queue_limits *lim) 398 - __releases(q->limits_lock) 399 257 { 400 - int error = blk_validate_limits(lim); 258 + int error; 401 259 402 - if (!error) { 403 - q->limits = *lim; 404 - if (q->disk) 405 - blk_apply_bdi_limits(q->disk->bdi, lim); 260 + error = blk_validate_limits(lim); 261 + if (error) 262 + goto out_unlock; 263 + 264 + #ifdef CONFIG_BLK_INLINE_ENCRYPTION 265 + if (q->crypto_profile && lim->integrity.tag_size) { 266 + pr_warn("blk-integrity: Integrity and hardware inline encryption are not supported together.\n"); 267 + error = -EINVAL; 268 + goto out_unlock; 406 269 } 270 + #endif 271 + 272 + q->limits = *lim; 273 + if (q->disk) 274 + blk_apply_bdi_limits(q->disk->bdi, lim); 275 + out_unlock: 407 276 mutex_unlock(&q->limits_lock); 408 277 return error; 409 278 } ··· 438 287 EXPORT_SYMBOL_GPL(queue_limits_set); 439 288 440 289 /** 441 - * blk_queue_chunk_sectors - set size of the chunk for this queue 442 - * @q: the request queue for the device 443 - * @chunk_sectors: chunk sectors in the usual 512b unit 444 - * 445 - * Description: 446 - * If a driver doesn't want IOs to cross a given chunk size, it can set 447 - * this limit and prevent merging across chunks. Note that the block layer 448 - * must accept a page worth of data at any offset. So if the crossing of 449 - * chunks is a hard limitation in the driver, it must still be prepared 450 - * to split single page bios. 451 - **/ 452 - void blk_queue_chunk_sectors(struct request_queue *q, unsigned int chunk_sectors) 453 - { 454 - q->limits.chunk_sectors = chunk_sectors; 455 - } 456 - EXPORT_SYMBOL(blk_queue_chunk_sectors); 457 - 458 - /** 459 - * blk_queue_max_discard_sectors - set max sectors for a single discard 460 - * @q: the request queue for the device 461 - * @max_discard_sectors: maximum number of sectors to discard 462 - **/ 463 - void blk_queue_max_discard_sectors(struct request_queue *q, 464 - unsigned int max_discard_sectors) 465 - { 466 - struct queue_limits *lim = &q->limits; 467 - 468 - lim->max_hw_discard_sectors = max_discard_sectors; 469 - lim->max_discard_sectors = 470 - min(max_discard_sectors, lim->max_user_discard_sectors); 471 - } 472 - EXPORT_SYMBOL(blk_queue_max_discard_sectors); 473 - 474 - /** 475 - * blk_queue_max_secure_erase_sectors - set max sectors for a secure erase 476 - * @q: the request queue for the device 477 - * @max_sectors: maximum number of sectors to secure_erase 478 - **/ 479 - void blk_queue_max_secure_erase_sectors(struct request_queue *q, 480 - unsigned int max_sectors) 481 - { 482 - q->limits.max_secure_erase_sectors = max_sectors; 483 - } 484 - EXPORT_SYMBOL(blk_queue_max_secure_erase_sectors); 485 - 486 - /** 487 - * blk_queue_max_write_zeroes_sectors - set max sectors for a single 488 - * write zeroes 489 - * @q: the request queue for the device 490 - * @max_write_zeroes_sectors: maximum number of sectors to write per command 491 - **/ 492 - void blk_queue_max_write_zeroes_sectors(struct request_queue *q, 493 - unsigned int max_write_zeroes_sectors) 494 - { 495 - q->limits.max_write_zeroes_sectors = max_write_zeroes_sectors; 496 - } 497 - EXPORT_SYMBOL(blk_queue_max_write_zeroes_sectors); 498 - 499 - /** 500 - * blk_queue_max_zone_append_sectors - set max sectors for a single zone append 501 - * @q: the request queue for the device 502 - * @max_zone_append_sectors: maximum number of sectors to write per command 503 - * 504 - * Sets the maximum number of sectors allowed for zone append commands. If 505 - * Specifying 0 for @max_zone_append_sectors indicates that the queue does 506 - * not natively support zone append operations and that the block layer must 507 - * emulate these operations using regular writes. 508 - **/ 509 - void blk_queue_max_zone_append_sectors(struct request_queue *q, 510 - unsigned int max_zone_append_sectors) 511 - { 512 - unsigned int max_sectors = 0; 513 - 514 - if (WARN_ON(!blk_queue_is_zoned(q))) 515 - return; 516 - 517 - if (max_zone_append_sectors) { 518 - max_sectors = min(q->limits.max_hw_sectors, 519 - max_zone_append_sectors); 520 - max_sectors = min(q->limits.chunk_sectors, max_sectors); 521 - 522 - /* 523 - * Signal eventual driver bugs resulting in the max_zone_append 524 - * sectors limit being 0 due to the chunk_sectors limit (zone 525 - * size) not set or the max_hw_sectors limit not set. 526 - */ 527 - WARN_ON_ONCE(!max_sectors); 528 - } 529 - 530 - q->limits.max_zone_append_sectors = max_sectors; 531 - } 532 - EXPORT_SYMBOL_GPL(blk_queue_max_zone_append_sectors); 533 - 534 - /** 535 - * blk_queue_logical_block_size - set logical block size for the queue 536 - * @q: the request queue for the device 537 - * @size: the logical block size, in bytes 538 - * 539 - * Description: 540 - * This should be set to the lowest possible block size that the 541 - * storage device can address. The default of 512 covers most 542 - * hardware. 543 - **/ 544 - void blk_queue_logical_block_size(struct request_queue *q, unsigned int size) 545 - { 546 - struct queue_limits *limits = &q->limits; 547 - 548 - limits->logical_block_size = size; 549 - 550 - if (limits->discard_granularity < limits->logical_block_size) 551 - limits->discard_granularity = limits->logical_block_size; 552 - 553 - if (limits->physical_block_size < size) 554 - limits->physical_block_size = size; 555 - 556 - if (limits->io_min < limits->physical_block_size) 557 - limits->io_min = limits->physical_block_size; 558 - 559 - limits->max_hw_sectors = 560 - round_down(limits->max_hw_sectors, size >> SECTOR_SHIFT); 561 - limits->max_sectors = 562 - round_down(limits->max_sectors, size >> SECTOR_SHIFT); 563 - } 564 - EXPORT_SYMBOL(blk_queue_logical_block_size); 565 - 566 - /** 567 - * blk_queue_physical_block_size - set physical block size for the queue 568 - * @q: the request queue for the device 569 - * @size: the physical block size, in bytes 570 - * 571 - * Description: 572 - * This should be set to the lowest possible sector size that the 573 - * hardware can operate on without reverting to read-modify-write 574 - * operations. 575 - */ 576 - void blk_queue_physical_block_size(struct request_queue *q, unsigned int size) 577 - { 578 - q->limits.physical_block_size = size; 579 - 580 - if (q->limits.physical_block_size < q->limits.logical_block_size) 581 - q->limits.physical_block_size = q->limits.logical_block_size; 582 - 583 - if (q->limits.discard_granularity < q->limits.physical_block_size) 584 - q->limits.discard_granularity = q->limits.physical_block_size; 585 - 586 - if (q->limits.io_min < q->limits.physical_block_size) 587 - q->limits.io_min = q->limits.physical_block_size; 588 - } 589 - EXPORT_SYMBOL(blk_queue_physical_block_size); 590 - 591 - /** 592 - * blk_queue_zone_write_granularity - set zone write granularity for the queue 593 - * @q: the request queue for the zoned device 594 - * @size: the zone write granularity size, in bytes 595 - * 596 - * Description: 597 - * This should be set to the lowest possible size allowing to write in 598 - * sequential zones of a zoned block device. 599 - */ 600 - void blk_queue_zone_write_granularity(struct request_queue *q, 601 - unsigned int size) 602 - { 603 - if (WARN_ON_ONCE(!blk_queue_is_zoned(q))) 604 - return; 605 - 606 - q->limits.zone_write_granularity = size; 607 - 608 - if (q->limits.zone_write_granularity < q->limits.logical_block_size) 609 - q->limits.zone_write_granularity = q->limits.logical_block_size; 610 - } 611 - EXPORT_SYMBOL_GPL(blk_queue_zone_write_granularity); 612 - 613 - /** 614 - * blk_queue_alignment_offset - set physical block alignment offset 615 - * @q: the request queue for the device 616 - * @offset: alignment offset in bytes 617 - * 618 - * Description: 619 - * Some devices are naturally misaligned to compensate for things like 620 - * the legacy DOS partition table 63-sector offset. Low-level drivers 621 - * should call this function for devices whose first sector is not 622 - * naturally aligned. 623 - */ 624 - void blk_queue_alignment_offset(struct request_queue *q, unsigned int offset) 625 - { 626 - q->limits.alignment_offset = 627 - offset & (q->limits.physical_block_size - 1); 628 - q->limits.misaligned = 0; 629 - } 630 - EXPORT_SYMBOL(blk_queue_alignment_offset); 631 - 632 - void disk_update_readahead(struct gendisk *disk) 633 - { 634 - blk_apply_bdi_limits(disk->bdi, &disk->queue->limits); 635 - } 636 - EXPORT_SYMBOL_GPL(disk_update_readahead); 637 - 638 - /** 639 290 * blk_limits_io_min - set minimum request size for a device 640 291 * @limits: the queue limits 641 292 * @min: smallest I/O size in bytes ··· 459 506 limits->io_min = limits->physical_block_size; 460 507 } 461 508 EXPORT_SYMBOL(blk_limits_io_min); 462 - 463 - /** 464 - * blk_queue_io_min - set minimum request size for the queue 465 - * @q: the request queue for the device 466 - * @min: smallest I/O size in bytes 467 - * 468 - * Description: 469 - * Storage devices may report a granularity or preferred minimum I/O 470 - * size which is the smallest request the device can perform without 471 - * incurring a performance penalty. For disk drives this is often the 472 - * physical block size. For RAID arrays it is often the stripe chunk 473 - * size. A properly aligned multiple of minimum_io_size is the 474 - * preferred request size for workloads where a high number of I/O 475 - * operations is desired. 476 - */ 477 - void blk_queue_io_min(struct request_queue *q, unsigned int min) 478 - { 479 - blk_limits_io_min(&q->limits, min); 480 - } 481 - EXPORT_SYMBOL(blk_queue_io_min); 482 509 483 510 /** 484 511 * blk_limits_io_opt - set optimal request size for a device ··· 547 614 { 548 615 unsigned int top, bottom, alignment, ret = 0; 549 616 617 + t->features |= (b->features & BLK_FEAT_INHERIT_MASK); 618 + 619 + /* 620 + * BLK_FEAT_NOWAIT and BLK_FEAT_POLL need to be supported both by the 621 + * stacking driver and all underlying devices. The stacking driver sets 622 + * the flags before stacking the limits, and this will clear the flags 623 + * if any of the underlying devices does not support it. 624 + */ 625 + if (!(b->features & BLK_FEAT_NOWAIT)) 626 + t->features &= ~BLK_FEAT_NOWAIT; 627 + if (!(b->features & BLK_FEAT_POLL)) 628 + t->features &= ~BLK_FEAT_POLL; 629 + 630 + t->flags |= (b->flags & BLK_FLAG_MISALIGNED); 631 + 550 632 t->max_sectors = min_not_zero(t->max_sectors, b->max_sectors); 551 633 t->max_user_sectors = min_not_zero(t->max_user_sectors, 552 634 b->max_user_sectors); ··· 571 623 b->max_write_zeroes_sectors); 572 624 t->max_zone_append_sectors = min(queue_limits_max_zone_append_sectors(t), 573 625 queue_limits_max_zone_append_sectors(b)); 574 - t->bounce = max(t->bounce, b->bounce); 575 626 576 627 t->seg_boundary_mask = min_not_zero(t->seg_boundary_mask, 577 628 b->seg_boundary_mask); ··· 586 639 t->max_segment_size = min_not_zero(t->max_segment_size, 587 640 b->max_segment_size); 588 641 589 - t->misaligned |= b->misaligned; 590 - 591 642 alignment = queue_limit_alignment_offset(b, start); 592 643 593 644 /* Bottom device has different alignment. Check that it is ··· 599 654 600 655 /* Verify that top and bottom intervals line up */ 601 656 if (max(top, bottom) % min(top, bottom)) { 602 - t->misaligned = 1; 657 + t->flags |= BLK_FLAG_MISALIGNED; 603 658 ret = -1; 604 659 } 605 660 } ··· 621 676 /* Physical block size a multiple of the logical block size? */ 622 677 if (t->physical_block_size & (t->logical_block_size - 1)) { 623 678 t->physical_block_size = t->logical_block_size; 624 - t->misaligned = 1; 679 + t->flags |= BLK_FLAG_MISALIGNED; 625 680 ret = -1; 626 681 } 627 682 628 683 /* Minimum I/O a multiple of the physical block size? */ 629 684 if (t->io_min & (t->physical_block_size - 1)) { 630 685 t->io_min = t->physical_block_size; 631 - t->misaligned = 1; 686 + t->flags |= BLK_FLAG_MISALIGNED; 632 687 ret = -1; 633 688 } 634 689 635 690 /* Optimal I/O a multiple of the physical block size? */ 636 691 if (t->io_opt & (t->physical_block_size - 1)) { 637 692 t->io_opt = 0; 638 - t->misaligned = 1; 693 + t->flags |= BLK_FLAG_MISALIGNED; 639 694 ret = -1; 640 695 } 641 696 642 697 /* chunk_sectors a multiple of the physical block size? */ 643 698 if ((t->chunk_sectors << 9) & (t->physical_block_size - 1)) { 644 699 t->chunk_sectors = 0; 645 - t->misaligned = 1; 700 + t->flags |= BLK_FLAG_MISALIGNED; 646 701 ret = -1; 647 702 } 648 - 649 - t->raid_partial_stripes_expensive = 650 - max(t->raid_partial_stripes_expensive, 651 - b->raid_partial_stripes_expensive); 652 703 653 704 /* Find lowest common alignment_offset */ 654 705 t->alignment_offset = lcm_not_zero(t->alignment_offset, alignment) ··· 652 711 653 712 /* Verify that new alignment_offset is on a logical block boundary */ 654 713 if (t->alignment_offset & (t->logical_block_size - 1)) { 655 - t->misaligned = 1; 714 + t->flags |= BLK_FLAG_MISALIGNED; 656 715 ret = -1; 657 716 } 658 717 ··· 663 722 /* Discard alignment and granularity */ 664 723 if (b->discard_granularity) { 665 724 alignment = queue_limit_discard_alignment(b, start); 666 - 667 - if (t->discard_granularity != 0 && 668 - t->discard_alignment != alignment) { 669 - top = t->discard_granularity + t->discard_alignment; 670 - bottom = b->discard_granularity + alignment; 671 - 672 - /* Verify that top and bottom intervals line up */ 673 - if ((max(top, bottom) % min(top, bottom)) != 0) 674 - t->discard_misaligned = 1; 675 - } 676 725 677 726 t->max_discard_sectors = min_not_zero(t->max_discard_sectors, 678 727 b->max_discard_sectors); ··· 677 746 b->max_secure_erase_sectors); 678 747 t->zone_write_granularity = max(t->zone_write_granularity, 679 748 b->zone_write_granularity); 680 - t->zoned = max(t->zoned, b->zoned); 681 - if (!t->zoned) { 749 + if (!(t->features & BLK_FEAT_ZONED)) { 682 750 t->zone_write_granularity = 0; 683 751 t->max_zone_append_sectors = 0; 684 752 } ··· 711 781 EXPORT_SYMBOL_GPL(queue_limits_stack_bdev); 712 782 713 783 /** 714 - * blk_queue_update_dma_pad - update pad mask 715 - * @q: the request queue for the device 716 - * @mask: pad mask 784 + * queue_limits_stack_integrity - stack integrity profile 785 + * @t: target queue limits 786 + * @b: base queue limits 717 787 * 718 - * Update dma pad mask. 788 + * Check if the integrity profile in the @b can be stacked into the 789 + * target @t. Stacking is possible if either: 719 790 * 720 - * Appending pad buffer to a request modifies the last entry of a 721 - * scatter list such that it includes the pad buffer. 722 - **/ 723 - void blk_queue_update_dma_pad(struct request_queue *q, unsigned int mask) 791 + * a) does not have any integrity information stacked into it yet 792 + * b) the integrity profile in @b is identical to the one in @t 793 + * 794 + * If @b can be stacked into @t, return %true. Else return %false and clear the 795 + * integrity information in @t. 796 + */ 797 + bool queue_limits_stack_integrity(struct queue_limits *t, 798 + struct queue_limits *b) 724 799 { 725 - if (mask > q->dma_pad_mask) 726 - q->dma_pad_mask = mask; 800 + struct blk_integrity *ti = &t->integrity; 801 + struct blk_integrity *bi = &b->integrity; 802 + 803 + if (!IS_ENABLED(CONFIG_BLK_DEV_INTEGRITY)) 804 + return true; 805 + 806 + if (!ti->tuple_size) { 807 + /* inherit the settings from the first underlying device */ 808 + if (!(ti->flags & BLK_INTEGRITY_STACKED)) { 809 + ti->flags = BLK_INTEGRITY_DEVICE_CAPABLE | 810 + (bi->flags & BLK_INTEGRITY_REF_TAG); 811 + ti->csum_type = bi->csum_type; 812 + ti->tuple_size = bi->tuple_size; 813 + ti->pi_offset = bi->pi_offset; 814 + ti->interval_exp = bi->interval_exp; 815 + ti->tag_size = bi->tag_size; 816 + goto done; 817 + } 818 + if (!bi->tuple_size) 819 + goto done; 820 + } 821 + 822 + if (ti->tuple_size != bi->tuple_size) 823 + goto incompatible; 824 + if (ti->interval_exp != bi->interval_exp) 825 + goto incompatible; 826 + if (ti->tag_size != bi->tag_size) 827 + goto incompatible; 828 + if (ti->csum_type != bi->csum_type) 829 + goto incompatible; 830 + if ((ti->flags & BLK_INTEGRITY_REF_TAG) != 831 + (bi->flags & BLK_INTEGRITY_REF_TAG)) 832 + goto incompatible; 833 + 834 + done: 835 + ti->flags |= BLK_INTEGRITY_STACKED; 836 + return true; 837 + 838 + incompatible: 839 + memset(ti, 0, sizeof(*ti)); 840 + return false; 727 841 } 728 - EXPORT_SYMBOL(blk_queue_update_dma_pad); 842 + EXPORT_SYMBOL_GPL(queue_limits_stack_integrity); 729 843 730 844 /** 731 845 * blk_set_queue_depth - tell the block layer about the device queue depth ··· 784 810 } 785 811 EXPORT_SYMBOL(blk_set_queue_depth); 786 812 787 - /** 788 - * blk_queue_write_cache - configure queue's write cache 789 - * @q: the request queue for the device 790 - * @wc: write back cache on or off 791 - * @fua: device supports FUA writes, if true 792 - * 793 - * Tell the block layer about the write cache of @q. 794 - */ 795 - void blk_queue_write_cache(struct request_queue *q, bool wc, bool fua) 796 - { 797 - if (wc) { 798 - blk_queue_flag_set(QUEUE_FLAG_HW_WC, q); 799 - blk_queue_flag_set(QUEUE_FLAG_WC, q); 800 - } else { 801 - blk_queue_flag_clear(QUEUE_FLAG_HW_WC, q); 802 - blk_queue_flag_clear(QUEUE_FLAG_WC, q); 803 - } 804 - if (fua) 805 - blk_queue_flag_set(QUEUE_FLAG_FUA, q); 806 - else 807 - blk_queue_flag_clear(QUEUE_FLAG_FUA, q); 808 - } 809 - EXPORT_SYMBOL_GPL(blk_queue_write_cache); 810 - 811 - /** 812 - * disk_set_zoned - inidicate a zoned device 813 - * @disk: gendisk to configure 814 - */ 815 - void disk_set_zoned(struct gendisk *disk) 816 - { 817 - struct request_queue *q = disk->queue; 818 - 819 - WARN_ON_ONCE(!IS_ENABLED(CONFIG_BLK_DEV_ZONED)); 820 - 821 - /* 822 - * Set the zone write granularity to the device logical block 823 - * size by default. The driver can change this value if needed. 824 - */ 825 - q->limits.zoned = true; 826 - blk_queue_zone_write_granularity(q, queue_logical_block_size(q)); 827 - } 828 - EXPORT_SYMBOL_GPL(disk_set_zoned); 829 - 830 813 int bdev_alignment_offset(struct block_device *bdev) 831 814 { 832 815 struct request_queue *q = bdev_get_queue(bdev); 833 816 834 - if (q->limits.misaligned) 817 + if (q->limits.flags & BLK_FLAG_MISALIGNED) 835 818 return -1; 836 819 if (bdev_is_partition(bdev)) 837 820 return queue_limit_alignment_offset(&q->limits,

+192 -242

block/blk-sysfs.c

··· 22 22 23 23 struct queue_sysfs_entry { 24 24 struct attribute attr; 25 - ssize_t (*show)(struct request_queue *, char *); 26 - ssize_t (*store)(struct request_queue *, const char *, size_t); 25 + ssize_t (*show)(struct gendisk *disk, char *page); 26 + ssize_t (*store)(struct gendisk *disk, const char *page, size_t count); 27 27 }; 28 28 29 29 static ssize_t ··· 47 47 return count; 48 48 } 49 49 50 - static ssize_t queue_requests_show(struct request_queue *q, char *page) 50 + static ssize_t queue_requests_show(struct gendisk *disk, char *page) 51 51 { 52 - return queue_var_show(q->nr_requests, page); 52 + return queue_var_show(disk->queue->nr_requests, page); 53 53 } 54 54 55 55 static ssize_t 56 - queue_requests_store(struct request_queue *q, const char *page, size_t count) 56 + queue_requests_store(struct gendisk *disk, const char *page, size_t count) 57 57 { 58 58 unsigned long nr; 59 59 int ret, err; 60 60 61 - if (!queue_is_mq(q)) 61 + if (!queue_is_mq(disk->queue)) 62 62 return -EINVAL; 63 63 64 64 ret = queue_var_store(&nr, page, count); ··· 68 68 if (nr < BLKDEV_MIN_RQ) 69 69 nr = BLKDEV_MIN_RQ; 70 70 71 - err = blk_mq_update_nr_requests(q, nr); 71 + err = blk_mq_update_nr_requests(disk->queue, nr); 72 72 if (err) 73 73 return err; 74 74 75 75 return ret; 76 76 } 77 77 78 - static ssize_t queue_ra_show(struct request_queue *q, char *page) 78 + static ssize_t queue_ra_show(struct gendisk *disk, char *page) 79 79 { 80 - unsigned long ra_kb; 81 - 82 - if (!q->disk) 83 - return -EINVAL; 84 - ra_kb = q->disk->bdi->ra_pages << (PAGE_SHIFT - 10); 85 - return queue_var_show(ra_kb, page); 80 + return queue_var_show(disk->bdi->ra_pages << (PAGE_SHIFT - 10), page); 86 81 } 87 82 88 83 static ssize_t 89 - queue_ra_store(struct request_queue *q, const char *page, size_t count) 84 + queue_ra_store(struct gendisk *disk, const char *page, size_t count) 90 85 { 91 86 unsigned long ra_kb; 92 87 ssize_t ret; 93 88 94 - if (!q->disk) 95 - return -EINVAL; 96 89 ret = queue_var_store(&ra_kb, page, count); 97 90 if (ret < 0) 98 91 return ret; 99 - q->disk->bdi->ra_pages = ra_kb >> (PAGE_SHIFT - 10); 92 + disk->bdi->ra_pages = ra_kb >> (PAGE_SHIFT - 10); 100 93 return ret; 101 94 } 102 95 103 - static ssize_t queue_max_sectors_show(struct request_queue *q, char *page) 104 - { 105 - int max_sectors_kb = queue_max_sectors(q) >> 1; 106 - 107 - return queue_var_show(max_sectors_kb, page); 96 + #define QUEUE_SYSFS_LIMIT_SHOW(_field) \ 97 + static ssize_t queue_##_field##_show(struct gendisk *disk, char *page) \ 98 + { \ 99 + return queue_var_show(disk->queue->limits._field, page); \ 108 100 } 109 101 110 - static ssize_t queue_max_segments_show(struct request_queue *q, char *page) 111 - { 112 - return queue_var_show(queue_max_segments(q), page); 102 + QUEUE_SYSFS_LIMIT_SHOW(max_segments) 103 + QUEUE_SYSFS_LIMIT_SHOW(max_discard_segments) 104 + QUEUE_SYSFS_LIMIT_SHOW(max_integrity_segments) 105 + QUEUE_SYSFS_LIMIT_SHOW(max_segment_size) 106 + QUEUE_SYSFS_LIMIT_SHOW(logical_block_size) 107 + QUEUE_SYSFS_LIMIT_SHOW(physical_block_size) 108 + QUEUE_SYSFS_LIMIT_SHOW(chunk_sectors) 109 + QUEUE_SYSFS_LIMIT_SHOW(io_min) 110 + QUEUE_SYSFS_LIMIT_SHOW(io_opt) 111 + QUEUE_SYSFS_LIMIT_SHOW(discard_granularity) 112 + QUEUE_SYSFS_LIMIT_SHOW(zone_write_granularity) 113 + QUEUE_SYSFS_LIMIT_SHOW(virt_boundary_mask) 114 + QUEUE_SYSFS_LIMIT_SHOW(dma_alignment) 115 + QUEUE_SYSFS_LIMIT_SHOW(max_open_zones) 116 + QUEUE_SYSFS_LIMIT_SHOW(max_active_zones) 117 + QUEUE_SYSFS_LIMIT_SHOW(atomic_write_unit_min) 118 + QUEUE_SYSFS_LIMIT_SHOW(atomic_write_unit_max) 119 + 120 + #define QUEUE_SYSFS_LIMIT_SHOW_SECTORS_TO_BYTES(_field) \ 121 + static ssize_t queue_##_field##_show(struct gendisk *disk, char *page) \ 122 + { \ 123 + return sprintf(page, "%llu\n", \ 124 + (unsigned long long)disk->queue->limits._field << \ 125 + SECTOR_SHIFT); \ 113 126 } 114 127 115 - static ssize_t queue_max_discard_segments_show(struct request_queue *q, 116 - char *page) 117 - { 118 - return queue_var_show(queue_max_discard_segments(q), page); 128 + QUEUE_SYSFS_LIMIT_SHOW_SECTORS_TO_BYTES(max_discard_sectors) 129 + QUEUE_SYSFS_LIMIT_SHOW_SECTORS_TO_BYTES(max_hw_discard_sectors) 130 + QUEUE_SYSFS_LIMIT_SHOW_SECTORS_TO_BYTES(max_write_zeroes_sectors) 131 + QUEUE_SYSFS_LIMIT_SHOW_SECTORS_TO_BYTES(atomic_write_max_sectors) 132 + QUEUE_SYSFS_LIMIT_SHOW_SECTORS_TO_BYTES(atomic_write_boundary_sectors) 133 + 134 + #define QUEUE_SYSFS_LIMIT_SHOW_SECTORS_TO_KB(_field) \ 135 + static ssize_t queue_##_field##_show(struct gendisk *disk, char *page) \ 136 + { \ 137 + return queue_var_show(disk->queue->limits._field >> 1, page); \ 119 138 } 120 139 121 - static ssize_t queue_max_integrity_segments_show(struct request_queue *q, char *page) 122 - { 123 - return queue_var_show(q->limits.max_integrity_segments, page); 140 + QUEUE_SYSFS_LIMIT_SHOW_SECTORS_TO_KB(max_sectors) 141 + QUEUE_SYSFS_LIMIT_SHOW_SECTORS_TO_KB(max_hw_sectors) 142 + 143 + #define QUEUE_SYSFS_SHOW_CONST(_name, _val) \ 144 + static ssize_t queue_##_name##_show(struct gendisk *disk, char *page) \ 145 + { \ 146 + return sprintf(page, "%d\n", _val); \ 124 147 } 125 148 126 - static ssize_t queue_max_segment_size_show(struct request_queue *q, char *page) 127 - { 128 - return queue_var_show(queue_max_segment_size(q), page); 129 - } 149 + /* deprecated fields */ 150 + QUEUE_SYSFS_SHOW_CONST(discard_zeroes_data, 0) 151 + QUEUE_SYSFS_SHOW_CONST(write_same_max, 0) 152 + QUEUE_SYSFS_SHOW_CONST(poll_delay, -1) 130 153 131 - static ssize_t queue_logical_block_size_show(struct request_queue *q, char *page) 132 - { 133 - return queue_var_show(queue_logical_block_size(q), page); 134 - } 135 - 136 - static ssize_t queue_physical_block_size_show(struct request_queue *q, char *page) 137 - { 138 - return queue_var_show(queue_physical_block_size(q), page); 139 - } 140 - 141 - static ssize_t queue_chunk_sectors_show(struct request_queue *q, char *page) 142 - { 143 - return queue_var_show(q->limits.chunk_sectors, page); 144 - } 145 - 146 - static ssize_t queue_io_min_show(struct request_queue *q, char *page) 147 - { 148 - return queue_var_show(queue_io_min(q), page); 149 - } 150 - 151 - static ssize_t queue_io_opt_show(struct request_queue *q, char *page) 152 - { 153 - return queue_var_show(queue_io_opt(q), page); 154 - } 155 - 156 - static ssize_t queue_discard_granularity_show(struct request_queue *q, char *page) 157 - { 158 - return queue_var_show(q->limits.discard_granularity, page); 159 - } 160 - 161 - static ssize_t queue_discard_max_hw_show(struct request_queue *q, char *page) 162 - { 163 - 164 - return sprintf(page, "%llu\n", 165 - (unsigned long long)q->limits.max_hw_discard_sectors << 9); 166 - } 167 - 168 - static ssize_t queue_discard_max_show(struct request_queue *q, char *page) 169 - { 170 - return sprintf(page, "%llu\n", 171 - (unsigned long long)q->limits.max_discard_sectors << 9); 172 - } 173 - 174 - static ssize_t queue_discard_max_store(struct request_queue *q, 175 - const char *page, size_t count) 154 + static ssize_t queue_max_discard_sectors_store(struct gendisk *disk, 155 + const char *page, size_t count) 176 156 { 177 157 unsigned long max_discard_bytes; 178 158 struct queue_limits lim; ··· 163 183 if (ret < 0) 164 184 return ret; 165 185 166 - if (max_discard_bytes & (q->limits.discard_granularity - 1)) 186 + if (max_discard_bytes & (disk->queue->limits.discard_granularity - 1)) 167 187 return -EINVAL; 168 188 169 189 if ((max_discard_bytes >> SECTOR_SHIFT) > UINT_MAX) 170 190 return -EINVAL; 171 191 172 - blk_mq_freeze_queue(q); 173 - lim = queue_limits_start_update(q); 192 + lim = queue_limits_start_update(disk->queue); 174 193 lim.max_user_discard_sectors = max_discard_bytes >> SECTOR_SHIFT; 175 - err = queue_limits_commit_update(q, &lim); 176 - blk_mq_unfreeze_queue(q); 177 - 194 + err = queue_limits_commit_update(disk->queue, &lim); 178 195 if (err) 179 196 return err; 180 197 return ret; 181 198 } 182 199 183 - static ssize_t queue_discard_zeroes_data_show(struct request_queue *q, char *page) 184 - { 185 - return queue_var_show(0, page); 186 - } 187 - 188 - static ssize_t queue_write_same_max_show(struct request_queue *q, char *page) 189 - { 190 - return queue_var_show(0, page); 191 - } 192 - 193 - static ssize_t queue_write_zeroes_max_show(struct request_queue *q, char *page) 200 + /* 201 + * For zone append queue_max_zone_append_sectors does not just return the 202 + * underlying queue limits, but actually contains a calculation. Because of 203 + * that we can't simply use QUEUE_SYSFS_LIMIT_SHOW_SECTORS_TO_BYTES here. 204 + */ 205 + static ssize_t queue_zone_append_max_show(struct gendisk *disk, char *page) 194 206 { 195 207 return sprintf(page, "%llu\n", 196 - (unsigned long long)q->limits.max_write_zeroes_sectors << 9); 197 - } 198 - 199 - static ssize_t queue_zone_write_granularity_show(struct request_queue *q, 200 - char *page) 201 - { 202 - return queue_var_show(queue_zone_write_granularity(q), page); 203 - } 204 - 205 - static ssize_t queue_zone_append_max_show(struct request_queue *q, char *page) 206 - { 207 - unsigned long long max_sectors = queue_max_zone_append_sectors(q); 208 - 209 - return sprintf(page, "%llu\n", max_sectors << SECTOR_SHIFT); 208 + (u64)queue_max_zone_append_sectors(disk->queue) << 209 + SECTOR_SHIFT); 210 210 } 211 211 212 212 static ssize_t 213 - queue_max_sectors_store(struct request_queue *q, const char *page, size_t count) 213 + queue_max_sectors_store(struct gendisk *disk, const char *page, size_t count) 214 214 { 215 215 unsigned long max_sectors_kb; 216 216 struct queue_limits lim; ··· 201 241 if (ret < 0) 202 242 return ret; 203 243 204 - blk_mq_freeze_queue(q); 205 - lim = queue_limits_start_update(q); 244 + lim = queue_limits_start_update(disk->queue); 206 245 lim.max_user_sectors = max_sectors_kb << 1; 207 - err = queue_limits_commit_update(q, &lim); 208 - blk_mq_unfreeze_queue(q); 246 + err = queue_limits_commit_update(disk->queue, &lim); 209 247 if (err) 210 248 return err; 211 249 return ret; 212 250 } 213 251 214 - static ssize_t queue_max_hw_sectors_show(struct request_queue *q, char *page) 252 + static ssize_t queue_feature_store(struct gendisk *disk, const char *page, 253 + size_t count, blk_features_t feature) 215 254 { 216 - int max_hw_sectors_kb = queue_max_hw_sectors(q) >> 1; 255 + struct queue_limits lim; 256 + unsigned long val; 257 + ssize_t ret; 217 258 218 - return queue_var_show(max_hw_sectors_kb, page); 259 + ret = queue_var_store(&val, page, count); 260 + if (ret < 0) 261 + return ret; 262 + 263 + lim = queue_limits_start_update(disk->queue); 264 + if (val) 265 + lim.features |= feature; 266 + else 267 + lim.features &= ~feature; 268 + ret = queue_limits_commit_update(disk->queue, &lim); 269 + if (ret) 270 + return ret; 271 + return count; 219 272 } 220 273 221 - static ssize_t queue_virt_boundary_mask_show(struct request_queue *q, char *page) 222 - { 223 - return queue_var_show(q->limits.virt_boundary_mask, page); 224 - } 225 - 226 - static ssize_t queue_dma_alignment_show(struct request_queue *q, char *page) 227 - { 228 - return queue_var_show(queue_dma_alignment(q), page); 229 - } 230 - 231 - #define QUEUE_SYSFS_BIT_FNS(name, flag, neg) \ 232 - static ssize_t \ 233 - queue_##name##_show(struct request_queue *q, char *page) \ 274 + #define QUEUE_SYSFS_FEATURE(_name, _feature) \ 275 + static ssize_t queue_##_name##_show(struct gendisk *disk, char *page) \ 234 276 { \ 235 - int bit; \ 236 - bit = test_bit(QUEUE_FLAG_##flag, &q->queue_flags); \ 237 - return queue_var_show(neg ? !bit : bit, page); \ 277 + return sprintf(page, "%u\n", \ 278 + !!(disk->queue->limits.features & _feature)); \ 238 279 } \ 239 - static ssize_t \ 240 - queue_##name##_store(struct request_queue *q, const char *page, size_t count) \ 280 + static ssize_t queue_##_name##_store(struct gendisk *disk, \ 281 + const char *page, size_t count) \ 241 282 { \ 242 - unsigned long val; \ 243 - ssize_t ret; \ 244 - ret = queue_var_store(&val, page, count); \ 245 - if (ret < 0) \ 246 - return ret; \ 247 - if (neg) \ 248 - val = !val; \ 249 - \ 250 - if (val) \ 251 - blk_queue_flag_set(QUEUE_FLAG_##flag, q); \ 252 - else \ 253 - blk_queue_flag_clear(QUEUE_FLAG_##flag, q); \ 254 - return ret; \ 283 + return queue_feature_store(disk, page, count, _feature); \ 255 284 } 256 285 257 - QUEUE_SYSFS_BIT_FNS(nonrot, NONROT, 1); 258 - QUEUE_SYSFS_BIT_FNS(random, ADD_RANDOM, 0); 259 - QUEUE_SYSFS_BIT_FNS(iostats, IO_STAT, 0); 260 - QUEUE_SYSFS_BIT_FNS(stable_writes, STABLE_WRITES, 0); 261 - #undef QUEUE_SYSFS_BIT_FNS 286 + QUEUE_SYSFS_FEATURE(rotational, BLK_FEAT_ROTATIONAL) 287 + QUEUE_SYSFS_FEATURE(add_random, BLK_FEAT_ADD_RANDOM) 288 + QUEUE_SYSFS_FEATURE(iostats, BLK_FEAT_IO_STAT) 289 + QUEUE_SYSFS_FEATURE(stable_writes, BLK_FEAT_STABLE_WRITES); 262 290 263 - static ssize_t queue_zoned_show(struct request_queue *q, char *page) 291 + #define QUEUE_SYSFS_FEATURE_SHOW(_name, _feature) \ 292 + static ssize_t queue_##_name##_show(struct gendisk *disk, char *page) \ 293 + { \ 294 + return sprintf(page, "%u\n", \ 295 + !!(disk->queue->limits.features & _feature)); \ 296 + } 297 + 298 + QUEUE_SYSFS_FEATURE_SHOW(poll, BLK_FEAT_POLL); 299 + QUEUE_SYSFS_FEATURE_SHOW(fua, BLK_FEAT_FUA); 300 + QUEUE_SYSFS_FEATURE_SHOW(dax, BLK_FEAT_DAX); 301 + 302 + static ssize_t queue_zoned_show(struct gendisk *disk, char *page) 264 303 { 265 - if (blk_queue_is_zoned(q)) 304 + if (blk_queue_is_zoned(disk->queue)) 266 305 return sprintf(page, "host-managed\n"); 267 306 return sprintf(page, "none\n"); 268 307 } 269 308 270 - static ssize_t queue_nr_zones_show(struct request_queue *q, char *page) 309 + static ssize_t queue_nr_zones_show(struct gendisk *disk, char *page) 271 310 { 272 - return queue_var_show(disk_nr_zones(q->disk), page); 311 + return queue_var_show(disk_nr_zones(disk), page); 273 312 } 274 313 275 - static ssize_t queue_max_open_zones_show(struct request_queue *q, char *page) 314 + static ssize_t queue_nomerges_show(struct gendisk *disk, char *page) 276 315 { 277 - return queue_var_show(bdev_max_open_zones(q->disk->part0), page); 316 + return queue_var_show((blk_queue_nomerges(disk->queue) << 1) | 317 + blk_queue_noxmerges(disk->queue), page); 278 318 } 279 319 280 - static ssize_t queue_max_active_zones_show(struct request_queue *q, char *page) 281 - { 282 - return queue_var_show(bdev_max_active_zones(q->disk->part0), page); 283 - } 284 - 285 - static ssize_t queue_nomerges_show(struct request_queue *q, char *page) 286 - { 287 - return queue_var_show((blk_queue_nomerges(q) << 1) | 288 - blk_queue_noxmerges(q), page); 289 - } 290 - 291 - static ssize_t queue_nomerges_store(struct request_queue *q, const char *page, 320 + static ssize_t queue_nomerges_store(struct gendisk *disk, const char *page, 292 321 size_t count) 293 322 { 294 323 unsigned long nm; ··· 286 337 if (ret < 0) 287 338 return ret; 288 339 289 - blk_queue_flag_clear(QUEUE_FLAG_NOMERGES, q); 290 - blk_queue_flag_clear(QUEUE_FLAG_NOXMERGES, q); 340 + blk_queue_flag_clear(QUEUE_FLAG_NOMERGES, disk->queue); 341 + blk_queue_flag_clear(QUEUE_FLAG_NOXMERGES, disk->queue); 291 342 if (nm == 2) 292 - blk_queue_flag_set(QUEUE_FLAG_NOMERGES, q); 343 + blk_queue_flag_set(QUEUE_FLAG_NOMERGES, disk->queue); 293 344 else if (nm) 294 - blk_queue_flag_set(QUEUE_FLAG_NOXMERGES, q); 345 + blk_queue_flag_set(QUEUE_FLAG_NOXMERGES, disk->queue); 295 346 296 347 return ret; 297 348 } 298 349 299 - static ssize_t queue_rq_affinity_show(struct request_queue *q, char *page) 350 + static ssize_t queue_rq_affinity_show(struct gendisk *disk, char *page) 300 351 { 301 - bool set = test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags); 302 - bool force = test_bit(QUEUE_FLAG_SAME_FORCE, &q->queue_flags); 352 + bool set = test_bit(QUEUE_FLAG_SAME_COMP, &disk->queue->queue_flags); 353 + bool force = test_bit(QUEUE_FLAG_SAME_FORCE, &disk->queue->queue_flags); 303 354 304 355 return queue_var_show(set << force, page); 305 356 } 306 357 307 358 static ssize_t 308 - queue_rq_affinity_store(struct request_queue *q, const char *page, size_t count) 359 + queue_rq_affinity_store(struct gendisk *disk, const char *page, size_t count) 309 360 { 310 361 ssize_t ret = -EINVAL; 311 362 #ifdef CONFIG_SMP 363 + struct request_queue *q = disk->queue; 312 364 unsigned long val; 313 365 314 366 ret = queue_var_store(&val, page, count); ··· 330 380 return ret; 331 381 } 332 382 333 - static ssize_t queue_poll_delay_show(struct request_queue *q, char *page) 334 - { 335 - return sprintf(page, "%d\n", -1); 336 - } 337 - 338 - static ssize_t queue_poll_delay_store(struct request_queue *q, const char *page, 383 + static ssize_t queue_poll_delay_store(struct gendisk *disk, const char *page, 339 384 size_t count) 340 385 { 341 386 return count; 342 387 } 343 388 344 - static ssize_t queue_poll_show(struct request_queue *q, char *page) 345 - { 346 - return queue_var_show(test_bit(QUEUE_FLAG_POLL, &q->queue_flags), page); 347 - } 348 - 349 - static ssize_t queue_poll_store(struct request_queue *q, const char *page, 389 + static ssize_t queue_poll_store(struct gendisk *disk, const char *page, 350 390 size_t count) 351 391 { 352 - if (!test_bit(QUEUE_FLAG_POLL, &q->queue_flags)) 392 + if (!(disk->queue->limits.features & BLK_FEAT_POLL)) 353 393 return -EINVAL; 354 394 pr_info_ratelimited("writes to the poll attribute are ignored.\n"); 355 395 pr_info_ratelimited("please use driver specific parameters instead.\n"); 356 396 return count; 357 397 } 358 398 359 - static ssize_t queue_io_timeout_show(struct request_queue *q, char *page) 399 + static ssize_t queue_io_timeout_show(struct gendisk *disk, char *page) 360 400 { 361 - return sprintf(page, "%u\n", jiffies_to_msecs(q->rq_timeout)); 401 + return sprintf(page, "%u\n", jiffies_to_msecs(disk->queue->rq_timeout)); 362 402 } 363 403 364 - static ssize_t queue_io_timeout_store(struct request_queue *q, const char *page, 404 + static ssize_t queue_io_timeout_store(struct gendisk *disk, const char *page, 365 405 size_t count) 366 406 { 367 407 unsigned int val; ··· 361 421 if (err || val == 0) 362 422 return -EINVAL; 363 423 364 - blk_queue_rq_timeout(q, msecs_to_jiffies(val)); 424 + blk_queue_rq_timeout(disk->queue, msecs_to_jiffies(val)); 365 425 366 426 return count; 367 427 } 368 428 369 - static ssize_t queue_wc_show(struct request_queue *q, char *page) 429 + static ssize_t queue_wc_show(struct gendisk *disk, char *page) 370 430 { 371 - if (test_bit(QUEUE_FLAG_WC, &q->queue_flags)) 431 + if (blk_queue_write_cache(disk->queue)) 372 432 return sprintf(page, "write back\n"); 373 - 374 433 return sprintf(page, "write through\n"); 375 434 } 376 435 377 - static ssize_t queue_wc_store(struct request_queue *q, const char *page, 436 + static ssize_t queue_wc_store(struct gendisk *disk, const char *page, 378 437 size_t count) 379 438 { 439 + struct queue_limits lim; 440 + bool disable; 441 + int err; 442 + 380 443 if (!strncmp(page, "write back", 10)) { 381 - if (!test_bit(QUEUE_FLAG_HW_WC, &q->queue_flags)) 382 - return -EINVAL; 383 - blk_queue_flag_set(QUEUE_FLAG_WC, q); 444 + disable = false; 384 445 } else if (!strncmp(page, "write through", 13) || 385 - !strncmp(page, "none", 4)) { 386 - blk_queue_flag_clear(QUEUE_FLAG_WC, q); 446 + !strncmp(page, "none", 4)) { 447 + disable = true; 387 448 } else { 388 449 return -EINVAL; 389 450 } 390 451 452 + lim = queue_limits_start_update(disk->queue); 453 + if (disable) 454 + lim.flags |= BLK_FLAG_WRITE_CACHE_DISABLED; 455 + else 456 + lim.flags &= ~BLK_FLAG_WRITE_CACHE_DISABLED; 457 + err = queue_limits_commit_update(disk->queue, &lim); 458 + if (err) 459 + return err; 391 460 return count; 392 - } 393 - 394 - static ssize_t queue_fua_show(struct request_queue *q, char *page) 395 - { 396 - return sprintf(page, "%u\n", test_bit(QUEUE_FLAG_FUA, &q->queue_flags)); 397 - } 398 - 399 - static ssize_t queue_dax_show(struct request_queue *q, char *page) 400 - { 401 - return queue_var_show(blk_queue_dax(q), page); 402 461 } 403 462 404 463 #define QUEUE_RO_ENTRY(_prefix, _name) \ ··· 430 491 431 492 QUEUE_RO_ENTRY(queue_max_discard_segments, "max_discard_segments"); 432 493 QUEUE_RO_ENTRY(queue_discard_granularity, "discard_granularity"); 433 - QUEUE_RO_ENTRY(queue_discard_max_hw, "discard_max_hw_bytes"); 434 - QUEUE_RW_ENTRY(queue_discard_max, "discard_max_bytes"); 494 + QUEUE_RO_ENTRY(queue_max_hw_discard_sectors, "discard_max_hw_bytes"); 495 + QUEUE_RW_ENTRY(queue_max_discard_sectors, "discard_max_bytes"); 435 496 QUEUE_RO_ENTRY(queue_discard_zeroes_data, "discard_zeroes_data"); 436 497 498 + QUEUE_RO_ENTRY(queue_atomic_write_max_sectors, "atomic_write_max_bytes"); 499 + QUEUE_RO_ENTRY(queue_atomic_write_boundary_sectors, 500 + "atomic_write_boundary_bytes"); 501 + QUEUE_RO_ENTRY(queue_atomic_write_unit_max, "atomic_write_unit_max_bytes"); 502 + QUEUE_RO_ENTRY(queue_atomic_write_unit_min, "atomic_write_unit_min_bytes"); 503 + 437 504 QUEUE_RO_ENTRY(queue_write_same_max, "write_same_max_bytes"); 438 - QUEUE_RO_ENTRY(queue_write_zeroes_max, "write_zeroes_max_bytes"); 505 + QUEUE_RO_ENTRY(queue_max_write_zeroes_sectors, "write_zeroes_max_bytes"); 439 506 QUEUE_RO_ENTRY(queue_zone_append_max, "zone_append_max_bytes"); 440 507 QUEUE_RO_ENTRY(queue_zone_write_granularity, "zone_write_granularity"); 441 508 ··· 467 522 .show = queue_logical_block_size_show, 468 523 }; 469 524 470 - QUEUE_RW_ENTRY(queue_nonrot, "rotational"); 525 + QUEUE_RW_ENTRY(queue_rotational, "rotational"); 471 526 QUEUE_RW_ENTRY(queue_iostats, "iostats"); 472 - QUEUE_RW_ENTRY(queue_random, "add_random"); 527 + QUEUE_RW_ENTRY(queue_add_random, "add_random"); 473 528 QUEUE_RW_ENTRY(queue_stable_writes, "stable_writes"); 474 529 475 530 #ifdef CONFIG_BLK_WBT ··· 486 541 return 0; 487 542 } 488 543 489 - static ssize_t queue_wb_lat_show(struct request_queue *q, char *page) 544 + static ssize_t queue_wb_lat_show(struct gendisk *disk, char *page) 490 545 { 491 - if (!wbt_rq_qos(q)) 546 + if (!wbt_rq_qos(disk->queue)) 492 547 return -EINVAL; 493 548 494 - if (wbt_disabled(q)) 549 + if (wbt_disabled(disk->queue)) 495 550 return sprintf(page, "0\n"); 496 551 497 - return sprintf(page, "%llu\n", div_u64(wbt_get_min_lat(q), 1000)); 552 + return sprintf(page, "%llu\n", 553 + div_u64(wbt_get_min_lat(disk->queue), 1000)); 498 554 } 499 555 500 - static ssize_t queue_wb_lat_store(struct request_queue *q, const char *page, 556 + static ssize_t queue_wb_lat_store(struct gendisk *disk, const char *page, 501 557 size_t count) 502 558 { 559 + struct request_queue *q = disk->queue; 503 560 struct rq_qos *rqos; 504 561 ssize_t ret; 505 562 s64 val; ··· 514 567 515 568 rqos = wbt_rq_qos(q); 516 569 if (!rqos) { 517 - ret = wbt_init(q->disk); 570 + ret = wbt_init(disk); 518 571 if (ret) 519 572 return ret; 520 573 } ··· 532 585 * ends up either enabling or disabling wbt completely. We can't 533 586 * have IO inflight if that happens. 534 587 */ 535 - blk_mq_freeze_queue(q); 536 588 blk_mq_quiesce_queue(q); 537 589 538 590 wbt_set_min_lat(q, val); 539 591 540 592 blk_mq_unquiesce_queue(q); 541 - blk_mq_unfreeze_queue(q); 542 593 543 594 return count; 544 595 } ··· 560 615 &queue_io_min_entry.attr, 561 616 &queue_io_opt_entry.attr, 562 617 &queue_discard_granularity_entry.attr, 563 - &queue_discard_max_entry.attr, 564 - &queue_discard_max_hw_entry.attr, 618 + &queue_max_discard_sectors_entry.attr, 619 + &queue_max_hw_discard_sectors_entry.attr, 565 620 &queue_discard_zeroes_data_entry.attr, 621 + &queue_atomic_write_max_sectors_entry.attr, 622 + &queue_atomic_write_boundary_sectors_entry.attr, 623 + &queue_atomic_write_unit_min_entry.attr, 624 + &queue_atomic_write_unit_max_entry.attr, 566 625 &queue_write_same_max_entry.attr, 567 - &queue_write_zeroes_max_entry.attr, 626 + &queue_max_write_zeroes_sectors_entry.attr, 568 627 &queue_zone_append_max_entry.attr, 569 628 &queue_zone_write_granularity_entry.attr, 570 - &queue_nonrot_entry.attr, 629 + &queue_rotational_entry.attr, 571 630 &queue_zoned_entry.attr, 572 631 &queue_nr_zones_entry.attr, 573 632 &queue_max_open_zones_entry.attr, ··· 579 630 &queue_nomerges_entry.attr, 580 631 &queue_iostats_entry.attr, 581 632 &queue_stable_writes_entry.attr, 582 - &queue_random_entry.attr, 633 + &queue_add_random_entry.attr, 583 634 &queue_poll_entry.attr, 584 635 &queue_wc_entry.attr, 585 636 &queue_fua_entry.attr, ··· 648 699 { 649 700 struct queue_sysfs_entry *entry = to_queue(attr); 650 701 struct gendisk *disk = container_of(kobj, struct gendisk, queue_kobj); 651 - struct request_queue *q = disk->queue; 652 702 ssize_t res; 653 703 654 704 if (!entry->show) 655 705 return -EIO; 656 - mutex_lock(&q->sysfs_lock); 657 - res = entry->show(q, page); 658 - mutex_unlock(&q->sysfs_lock); 706 + mutex_lock(&disk->queue->sysfs_lock); 707 + res = entry->show(disk, page); 708 + mutex_unlock(&disk->queue->sysfs_lock); 659 709 return res; 660 710 } 661 711 ··· 670 722 if (!entry->store) 671 723 return -EIO; 672 724 725 + blk_mq_freeze_queue(q); 673 726 mutex_lock(&q->sysfs_lock); 674 - res = entry->store(q, page, length); 727 + res = entry->store(disk, page, length); 675 728 mutex_unlock(&q->sysfs_lock); 729 + blk_mq_unfreeze_queue(q); 676 730 return res; 677 731 } 678 732

+3

block/blk-throttle.c

··· 704 704 705 705 /* Calc approx time to dispatch */ 706 706 jiffy_wait = jiffy_elapsed_rnd - jiffy_elapsed; 707 + 708 + /* make sure at least one io can be dispatched after waiting */ 709 + jiffy_wait = max(jiffy_wait, HZ / iops_limit + 1); 707 710 return jiffy_wait; 708 711 } 709 712

+11 -11

block/blk-wbt.c

··· 37 37 enum wbt_flags { 38 38 WBT_TRACKED = 1, /* write, tracked for throttling */ 39 39 WBT_READ = 2, /* read */ 40 - WBT_KSWAPD = 4, /* write, from kswapd */ 40 + WBT_SWAP = 4, /* write, from swap_writepage() */ 41 41 WBT_DISCARD = 8, /* discard */ 42 42 43 43 WBT_NR_BITS = 4, /* number of bits */ ··· 45 45 46 46 enum { 47 47 WBT_RWQ_BG = 0, 48 - WBT_RWQ_KSWAPD, 48 + WBT_RWQ_SWAP, 49 49 WBT_RWQ_DISCARD, 50 50 WBT_NUM_RWQ, 51 51 }; ··· 172 172 static inline struct rq_wait *get_rq_wait(struct rq_wb *rwb, 173 173 enum wbt_flags wb_acct) 174 174 { 175 - if (wb_acct & WBT_KSWAPD) 176 - return &rwb->rq_wait[WBT_RWQ_KSWAPD]; 175 + if (wb_acct & WBT_SWAP) 176 + return &rwb->rq_wait[WBT_RWQ_SWAP]; 177 177 else if (wb_acct & WBT_DISCARD) 178 178 return &rwb->rq_wait[WBT_RWQ_DISCARD]; 179 179 ··· 206 206 */ 207 207 if (wb_acct & WBT_DISCARD) 208 208 limit = rwb->wb_background; 209 - else if (test_bit(QUEUE_FLAG_WC, &rwb->rqos.disk->queue->queue_flags) && 210 - !wb_recent_wait(rwb)) 209 + else if (blk_queue_write_cache(rwb->rqos.disk->queue) && 210 + !wb_recent_wait(rwb)) 211 211 limit = 0; 212 212 else 213 213 limit = rwb->wb_normal; ··· 528 528 time_before(now, rwb->last_comp + HZ / 10); 529 529 } 530 530 531 - #define REQ_HIPRIO (REQ_SYNC | REQ_META | REQ_PRIO) 531 + #define REQ_HIPRIO (REQ_SYNC | REQ_META | REQ_PRIO | REQ_SWAP) 532 532 533 533 static inline unsigned int get_limit(struct rq_wb *rwb, blk_opf_t opf) 534 534 { ··· 539 539 540 540 /* 541 541 * At this point we know it's a buffered write. If this is 542 - * kswapd trying to free memory, or REQ_SYNC is set, then 542 + * swap trying to free memory, or REQ_SYNC is set, then 543 543 * it's WB_SYNC_ALL writeback, and we'll use the max limit for 544 544 * that. If the write is marked as a background write, then use 545 545 * the idle limit, or go to normal if we haven't had competing 546 546 * IO for a bit. 547 547 */ 548 - if ((opf & REQ_HIPRIO) || wb_recent_wait(rwb) || current_is_kswapd()) 548 + if ((opf & REQ_HIPRIO) || wb_recent_wait(rwb)) 549 549 limit = rwb->rq_depth.max_depth; 550 550 else if ((opf & REQ_BACKGROUND) || close_io(rwb)) { 551 551 /* ··· 622 622 if (bio_op(bio) == REQ_OP_READ) { 623 623 flags = WBT_READ; 624 624 } else if (wbt_should_throttle(bio)) { 625 - if (current_is_kswapd()) 626 - flags |= WBT_KSWAPD; 625 + if (bio->bi_opf & REQ_SWAP) 626 + flags |= WBT_SWAP; 627 627 if (bio_op(bio) == REQ_OP_DISCARD) 628 628 flags |= WBT_DISCARD; 629 629 flags |= WBT_TRACKED;

+21 -105

block/blk-zoned.c

··· 116 116 EXPORT_SYMBOL_GPL(blk_zone_cond_str); 117 117 118 118 /** 119 - * bdev_nr_zones - Get number of zones 120 - * @bdev: Target device 121 - * 122 - * Return the total number of zones of a zoned block device. For a block 123 - * device without zone capabilities, the number of zones is always 0. 124 - */ 125 - unsigned int bdev_nr_zones(struct block_device *bdev) 126 - { 127 - sector_t zone_sectors = bdev_zone_sectors(bdev); 128 - 129 - if (!bdev_is_zoned(bdev)) 130 - return 0; 131 - return (bdev_nr_sectors(bdev) + zone_sectors - 1) >> 132 - ilog2(zone_sectors); 133 - } 134 - EXPORT_SYMBOL_GPL(bdev_nr_zones); 135 - 136 - /** 137 119 * blkdev_report_zones - Get zones information 138 120 * @bdev: Target block device 139 121 * @sector: Sector from which to report zones ··· 150 168 } 151 169 EXPORT_SYMBOL_GPL(blkdev_report_zones); 152 170 153 - static inline unsigned long *blk_alloc_zone_bitmap(int node, 154 - unsigned int nr_zones) 155 - { 156 - return kcalloc_node(BITS_TO_LONGS(nr_zones), sizeof(unsigned long), 157 - GFP_NOIO, node); 158 - } 159 - 160 - static int blk_zone_need_reset_cb(struct blk_zone *zone, unsigned int idx, 161 - void *data) 162 - { 163 - /* 164 - * For an all-zones reset, ignore conventional, empty, read-only 165 - * and offline zones. 166 - */ 167 - switch (zone->cond) { 168 - case BLK_ZONE_COND_NOT_WP: 169 - case BLK_ZONE_COND_EMPTY: 170 - case BLK_ZONE_COND_READONLY: 171 - case BLK_ZONE_COND_OFFLINE: 172 - return 0; 173 - default: 174 - set_bit(idx, (unsigned long *)data); 175 - return 0; 176 - } 177 - } 178 - 179 - static int blkdev_zone_reset_all_emulated(struct block_device *bdev) 180 - { 181 - struct gendisk *disk = bdev->bd_disk; 182 - sector_t capacity = bdev_nr_sectors(bdev); 183 - sector_t zone_sectors = bdev_zone_sectors(bdev); 184 - unsigned long *need_reset; 185 - struct bio *bio = NULL; 186 - sector_t sector = 0; 187 - int ret; 188 - 189 - need_reset = blk_alloc_zone_bitmap(disk->queue->node, disk->nr_zones); 190 - if (!need_reset) 191 - return -ENOMEM; 192 - 193 - ret = disk->fops->report_zones(disk, 0, disk->nr_zones, 194 - blk_zone_need_reset_cb, need_reset); 195 - if (ret < 0) 196 - goto out_free_need_reset; 197 - 198 - ret = 0; 199 - while (sector < capacity) { 200 - if (!test_bit(disk_zone_no(disk, sector), need_reset)) { 201 - sector += zone_sectors; 202 - continue; 203 - } 204 - 205 - bio = blk_next_bio(bio, bdev, 0, REQ_OP_ZONE_RESET | REQ_SYNC, 206 - GFP_KERNEL); 207 - bio->bi_iter.bi_sector = sector; 208 - sector += zone_sectors; 209 - 210 - /* This may take a while, so be nice to others */ 211 - cond_resched(); 212 - } 213 - 214 - if (bio) { 215 - ret = submit_bio_wait(bio); 216 - bio_put(bio); 217 - } 218 - 219 - out_free_need_reset: 220 - kfree(need_reset); 221 - return ret; 222 - } 223 - 224 171 static int blkdev_zone_reset_all(struct block_device *bdev) 225 172 { 226 173 struct bio bio; ··· 176 265 int blkdev_zone_mgmt(struct block_device *bdev, enum req_op op, 177 266 sector_t sector, sector_t nr_sectors) 178 267 { 179 - struct request_queue *q = bdev_get_queue(bdev); 180 268 sector_t zone_sectors = bdev_zone_sectors(bdev); 181 269 sector_t capacity = bdev_nr_sectors(bdev); 182 270 sector_t end_sector = sector + nr_sectors; ··· 203 293 return -EINVAL; 204 294 205 295 /* 206 - * In the case of a zone reset operation over all zones, 207 - * REQ_OP_ZONE_RESET_ALL can be used with devices supporting this 208 - * command. For other devices, we emulate this command behavior by 209 - * identifying the zones needing a reset. 296 + * In the case of a zone reset operation over all zones, use 297 + * REQ_OP_ZONE_RESET_ALL. 210 298 */ 211 - if (op == REQ_OP_ZONE_RESET && sector == 0 && nr_sectors == capacity) { 212 - if (!blk_queue_zone_resetall(q)) 213 - return blkdev_zone_reset_all_emulated(bdev); 299 + if (op == REQ_OP_ZONE_RESET && sector == 0 && nr_sectors == capacity) 214 300 return blkdev_zone_reset_all(bdev); 215 - } 216 301 217 302 while (sector < end_sector) { 218 303 bio = blk_next_bio(bio, bdev, 0, op | REQ_SYNC, GFP_KERNEL); ··· 1478 1573 mempool_destroy(disk->zone_wplugs_pool); 1479 1574 disk->zone_wplugs_pool = NULL; 1480 1575 1481 - kfree(disk->conv_zones_bitmap); 1576 + bitmap_free(disk->conv_zones_bitmap); 1482 1577 disk->conv_zones_bitmap = NULL; 1483 1578 disk->zone_capacity = 0; 1484 1579 disk->last_zone_capacity = 0; ··· 1555 1650 return -ENODEV; 1556 1651 } 1557 1652 1653 + lim = queue_limits_start_update(q); 1654 + 1655 + /* 1656 + * Some devices can advertize zone resource limits that are larger than 1657 + * the number of sequential zones of the zoned block device, e.g. a 1658 + * small ZNS namespace. For such case, assume that the zoned device has 1659 + * no zone resource limits. 1660 + */ 1661 + nr_seq_zones = disk->nr_zones - nr_conv_zones; 1662 + if (lim.max_open_zones >= nr_seq_zones) 1663 + lim.max_open_zones = 0; 1664 + if (lim.max_active_zones >= nr_seq_zones) 1665 + lim.max_active_zones = 0; 1666 + 1558 1667 if (!disk->zone_wplugs_pool) 1559 - return 0; 1668 + goto commit; 1560 1669 1561 1670 /* 1562 1671 * If the device has no limit on the maximum number of open and active ··· 1579 1660 * dynamic zone write plug allocation when simultaneously writing to 1580 1661 * more zones than the size of the mempool. 1581 1662 */ 1582 - lim = queue_limits_start_update(q); 1583 - 1584 - nr_seq_zones = disk->nr_zones - nr_conv_zones; 1585 1663 pool_size = max(lim.max_open_zones, lim.max_active_zones); 1586 1664 if (!pool_size) 1587 1665 pool_size = min(BLK_ZONE_WPLUG_DEFAULT_POOL_SIZE, nr_seq_zones); ··· 1592 1676 lim.max_open_zones = 0; 1593 1677 } 1594 1678 1679 + commit: 1595 1680 return queue_limits_commit_update(q, &lim); 1596 1681 } 1597 1682 ··· 1600 1683 struct blk_revalidate_zone_args *args) 1601 1684 { 1602 1685 struct gendisk *disk = args->disk; 1603 - struct request_queue *q = disk->queue; 1604 1686 1605 1687 if (zone->capacity != zone->len) { 1606 1688 pr_warn("%s: Invalid conventional zone capacity\n", ··· 1615 1699 1616 1700 if (!args->conv_zones_bitmap) { 1617 1701 args->conv_zones_bitmap = 1618 - blk_alloc_zone_bitmap(q->node, args->nr_zones); 1702 + bitmap_zalloc(args->nr_zones, GFP_NOIO); 1619 1703 if (!args->conv_zones_bitmap) 1620 1704 return -ENOMEM; 1621 1705 }

+17 -5

block/blk.h

··· 98 98 struct bio_vec *vec1, struct bio_vec *vec2) 99 99 { 100 100 unsigned long mask = queue_segment_boundary(q); 101 - phys_addr_t addr1 = page_to_phys(vec1->bv_page) + vec1->bv_offset; 102 - phys_addr_t addr2 = page_to_phys(vec2->bv_page) + vec2->bv_offset; 101 + phys_addr_t addr1 = bvec_phys(vec1); 102 + phys_addr_t addr2 = bvec_phys(vec2); 103 103 104 104 /* 105 105 * Merging adjacent physical pages may not work correctly under KMSAN ··· 181 181 return queue_max_segments(rq->q); 182 182 } 183 183 184 - static inline unsigned int blk_queue_get_max_sectors(struct request_queue *q, 185 - enum req_op op) 184 + static inline unsigned int blk_queue_get_max_sectors(struct request *rq) 186 185 { 186 + struct request_queue *q = rq->q; 187 + enum req_op op = req_op(rq); 188 + 187 189 if (unlikely(op == REQ_OP_DISCARD || op == REQ_OP_SECURE_ERASE)) 188 190 return min(q->limits.max_discard_sectors, 189 191 UINT_MAX >> SECTOR_SHIFT); 190 192 191 193 if (unlikely(op == REQ_OP_WRITE_ZEROES)) 192 194 return q->limits.max_write_zeroes_sectors; 195 + 196 + if (rq->cmd_flags & REQ_ATOMIC) 197 + return q->limits.atomic_write_max_sectors; 193 198 194 199 return q->limits.max_sectors; 195 200 } ··· 357 352 enum elv_merge blk_try_merge(struct request *rq, struct bio *bio); 358 353 359 354 int blk_set_default_limits(struct queue_limits *lim); 355 + void blk_apply_bdi_limits(struct backing_dev_info *bdi, 356 + struct queue_limits *lim); 360 357 int blk_dev_init(void); 361 358 362 359 /* ··· 400 393 static inline bool blk_queue_may_bounce(struct request_queue *q) 401 394 { 402 395 return IS_ENABLED(CONFIG_BOUNCE) && 403 - q->limits.bounce == BLK_BOUNCE_HIGH && 396 + (q->limits.features & BLK_FEAT_BOUNCE_HIGH) && 404 397 max_low_pfn >= max_pfn; 405 398 } 406 399 ··· 679 672 int bdev_open(struct block_device *bdev, blk_mode_t mode, void *holder, 680 673 const struct blk_holder_ops *hops, struct file *bdev_file); 681 674 int bdev_permission(dev_t dev, blk_mode_t mode, void *holder); 675 + 676 + void blk_integrity_generate(struct bio *bio); 677 + void blk_integrity_verify(struct bio *bio); 678 + void blk_integrity_prepare(struct request *rq); 679 + void blk_integrity_complete(struct request *rq, unsigned int nr_bytes); 682 680 683 681 #endif /* BLK_INTERNAL_H */

+5 -4

block/elevator.c

··· 709 709 return ret; 710 710 } 711 711 712 - ssize_t elv_iosched_store(struct request_queue *q, const char *buf, 712 + ssize_t elv_iosched_store(struct gendisk *disk, const char *buf, 713 713 size_t count) 714 714 { 715 715 char elevator_name[ELV_NAME_MAX]; 716 716 int ret; 717 717 718 - if (!elv_support_iosched(q)) 718 + if (!elv_support_iosched(disk->queue)) 719 719 return count; 720 720 721 721 strscpy(elevator_name, buf, sizeof(elevator_name)); 722 - ret = elevator_change(q, strstrip(elevator_name)); 722 + ret = elevator_change(disk->queue, strstrip(elevator_name)); 723 723 if (!ret) 724 724 return count; 725 725 return ret; 726 726 } 727 727 728 - ssize_t elv_iosched_show(struct request_queue *q, char *name) 728 + ssize_t elv_iosched_show(struct gendisk *disk, char *name) 729 729 { 730 + struct request_queue *q = disk->queue; 730 731 struct elevator_queue *eq = q->elevator; 731 732 struct elevator_type *cur = NULL, *e; 732 733 int len = 0;

+2 -2

block/elevator.h

··· 147 147 /* 148 148 * io scheduler sysfs switching 149 149 */ 150 - extern ssize_t elv_iosched_show(struct request_queue *, char *); 151 - extern ssize_t elv_iosched_store(struct request_queue *, const char *, size_t); 150 + ssize_t elv_iosched_show(struct gendisk *disk, char *page); 151 + ssize_t elv_iosched_store(struct gendisk *disk, const char *page, size_t count); 152 152 153 153 extern bool elv_bio_merge_ok(struct request *, struct bio *); 154 154 extern struct elevator_queue *elevator_alloc(struct request_queue *,

+20 -5

block/fops.c

··· 34 34 return opf; 35 35 } 36 36 37 - static bool blkdev_dio_unaligned(struct block_device *bdev, loff_t pos, 38 - struct iov_iter *iter) 37 + static bool blkdev_dio_invalid(struct block_device *bdev, loff_t pos, 38 + struct iov_iter *iter, bool is_atomic) 39 39 { 40 + if (is_atomic && !generic_atomic_write_valid(iter, pos)) 41 + return true; 42 + 40 43 return pos & (bdev_logical_block_size(bdev) - 1) || 41 44 !bdev_iter_is_aligned(bdev, iter); 42 45 } ··· 75 72 bio.bi_iter.bi_sector = pos >> SECTOR_SHIFT; 76 73 bio.bi_write_hint = file_inode(iocb->ki_filp)->i_write_hint; 77 74 bio.bi_ioprio = iocb->ki_ioprio; 75 + if (iocb->ki_flags & IOCB_ATOMIC) 76 + bio.bi_opf |= REQ_ATOMIC; 78 77 79 78 ret = bio_iov_iter_get_pages(&bio, iter); 80 79 if (unlikely(ret)) ··· 348 343 task_io_account_write(bio->bi_iter.bi_size); 349 344 } 350 345 346 + if (iocb->ki_flags & IOCB_ATOMIC) 347 + bio->bi_opf |= REQ_ATOMIC; 348 + 351 349 if (iocb->ki_flags & IOCB_NOWAIT) 352 350 bio->bi_opf |= REQ_NOWAIT; 353 351 ··· 367 359 static ssize_t blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter) 368 360 { 369 361 struct block_device *bdev = I_BDEV(iocb->ki_filp->f_mapping->host); 362 + bool is_atomic = iocb->ki_flags & IOCB_ATOMIC; 370 363 unsigned int nr_pages; 371 364 372 365 if (!iov_iter_count(iter)) 373 366 return 0; 374 367 375 - if (blkdev_dio_unaligned(bdev, iocb->ki_pos, iter)) 368 + if (blkdev_dio_invalid(bdev, iocb->ki_pos, iter, is_atomic)) 376 369 return -EINVAL; 377 370 378 371 nr_pages = bio_iov_vecs_to_alloc(iter, BIO_MAX_VECS + 1); ··· 382 373 return __blkdev_direct_IO_simple(iocb, iter, bdev, 383 374 nr_pages); 384 375 return __blkdev_direct_IO_async(iocb, iter, bdev, nr_pages); 376 + } else if (is_atomic) { 377 + return -EINVAL; 385 378 } 386 379 return __blkdev_direct_IO(iocb, iter, bdev, bio_max_segs(nr_pages)); 387 380 } ··· 394 383 struct block_device *bdev = I_BDEV(inode); 395 384 loff_t isize = i_size_read(inode); 396 385 397 - iomap->bdev = bdev; 398 - iomap->offset = ALIGN_DOWN(offset, bdev_logical_block_size(bdev)); 399 386 if (offset >= isize) 400 387 return -EIO; 388 + 389 + iomap->bdev = bdev; 390 + iomap->offset = ALIGN_DOWN(offset, bdev_logical_block_size(bdev)); 401 391 iomap->type = IOMAP_MAPPED; 402 392 iomap->addr = iomap->offset; 403 393 iomap->length = isize - iomap->offset; ··· 623 611 bdev = blkdev_get_no_open(inode->i_rdev); 624 612 if (!bdev) 625 613 return -ENXIO; 614 + 615 + if (bdev_can_atomic_write(bdev) && filp->f_flags & O_DIRECT) 616 + filp->f_mode |= FMODE_CAN_ATOMIC_WRITE; 626 617 627 618 ret = bdev_open(bdev, mode, filp->private_data, NULL, filp); 628 619 if (ret)

+1 -1

block/genhd.c

··· 524 524 disk->part0->bd_dev = MKDEV(disk->major, disk->first_minor); 525 525 } 526 526 527 - disk_update_readahead(disk); 527 + blk_apply_bdi_limits(disk->bdi, &disk->queue->limits); 528 528 disk_add_events(disk); 529 529 set_bit(GD_ADDED, &disk->state); 530 530 return 0;

+1 -1

block/ioctl.c

··· 224 224 goto fail; 225 225 226 226 err = blkdev_issue_zeroout(bdev, start >> 9, len >> 9, GFP_KERNEL, 227 - BLKDEV_ZERO_NOUNMAP); 227 + BLKDEV_ZERO_NOUNMAP | BLKDEV_ZERO_KILLABLE); 228 228 229 229 fail: 230 230 filemap_invalidate_unlock(bdev->bd_mapping);

+17 -3

block/mq-deadline.c

··· 488 488 } 489 489 490 490 /* 491 + * 'depth' is a number in the range 1..INT_MAX representing a number of 492 + * requests. Scale it with a factor (1 << bt->sb.shift) / q->nr_requests since 493 + * 1..(1 << bt->sb.shift) is the range expected by sbitmap_get_shallow(). 494 + * Values larger than q->nr_requests have the same effect as q->nr_requests. 495 + */ 496 + static int dd_to_word_depth(struct blk_mq_hw_ctx *hctx, unsigned int qdepth) 497 + { 498 + struct sbitmap_queue *bt = &hctx->sched_tags->bitmap_tags; 499 + const unsigned int nrr = hctx->queue->nr_requests; 500 + 501 + return ((qdepth << bt->sb.shift) + nrr - 1) / nrr; 502 + } 503 + 504 + /* 491 505 * Called by __blk_mq_alloc_request(). The shallow_depth value set by this 492 506 * function is used by __blk_mq_get_tag(). 493 507 */ ··· 517 503 * Throttle asynchronous requests and writes such that these requests 518 504 * do not block the allocation of synchronous requests. 519 505 */ 520 - data->shallow_depth = dd->async_depth; 506 + data->shallow_depth = dd_to_word_depth(data->hctx, dd->async_depth); 521 507 } 522 508 523 509 /* Called by blk_mq_update_nr_requests(). */ ··· 527 513 struct deadline_data *dd = q->elevator->elevator_data; 528 514 struct blk_mq_tags *tags = hctx->sched_tags; 529 515 530 - dd->async_depth = max(1UL, 3 * q->nr_requests / 4); 516 + dd->async_depth = q->nr_requests; 531 517 532 - sbitmap_queue_min_shallow_depth(&tags->bitmap_tags, dd->async_depth); 518 + sbitmap_queue_min_shallow_depth(&tags->bitmap_tags, 1); 533 519 } 534 520 535 521 /* Called by blk_mq_init_hctx() and blk_mq_init_sched(). */

+138 -160

block/t10-pi.c

··· 11 11 #include <linux/module.h> 12 12 #include <net/checksum.h> 13 13 #include <asm/unaligned.h> 14 + #include "blk.h" 14 15 15 - typedef __be16 (csum_fn) (__be16, void *, unsigned int); 16 + struct blk_integrity_iter { 17 + void *prot_buf; 18 + void *data_buf; 19 + sector_t seed; 20 + unsigned int data_size; 21 + unsigned short interval; 22 + const char *disk_name; 23 + }; 16 24 17 - static __be16 t10_pi_crc_fn(__be16 crc, void *data, unsigned int len) 25 + static __be16 t10_pi_csum(__be16 csum, void *data, unsigned int len, 26 + unsigned char csum_type) 18 27 { 19 - return cpu_to_be16(crc_t10dif_update(be16_to_cpu(crc), data, len)); 20 - } 21 - 22 - static __be16 t10_pi_ip_fn(__be16 csum, void *data, unsigned int len) 23 - { 24 - return (__force __be16)ip_compute_csum(data, len); 28 + if (csum_type == BLK_INTEGRITY_CSUM_IP) 29 + return (__force __be16)ip_compute_csum(data, len); 30 + return cpu_to_be16(crc_t10dif_update(be16_to_cpu(csum), data, len)); 25 31 } 26 32 27 33 /* ··· 35 29 * 16 bit app tag, 32 bit reference tag. Type 3 does not define the ref 36 30 * tag. 37 31 */ 38 - static blk_status_t t10_pi_generate(struct blk_integrity_iter *iter, 39 - csum_fn *fn, enum t10_dif_type type) 32 + static void t10_pi_generate(struct blk_integrity_iter *iter, 33 + struct blk_integrity *bi) 40 34 { 41 - u8 offset = iter->pi_offset; 35 + u8 offset = bi->pi_offset; 42 36 unsigned int i; 43 37 44 38 for (i = 0 ; i < iter->data_size ; i += iter->interval) { 45 39 struct t10_pi_tuple *pi = iter->prot_buf + offset; 46 40 47 - pi->guard_tag = fn(0, iter->data_buf, iter->interval); 41 + pi->guard_tag = t10_pi_csum(0, iter->data_buf, iter->interval, 42 + bi->csum_type); 48 43 if (offset) 49 - pi->guard_tag = fn(pi->guard_tag, iter->prot_buf, 50 - offset); 44 + pi->guard_tag = t10_pi_csum(pi->guard_tag, 45 + iter->prot_buf, offset, bi->csum_type); 51 46 pi->app_tag = 0; 52 47 53 - if (type == T10_PI_TYPE1_PROTECTION) 48 + if (bi->flags & BLK_INTEGRITY_REF_TAG) 54 49 pi->ref_tag = cpu_to_be32(lower_32_bits(iter->seed)); 55 50 else 56 51 pi->ref_tag = 0; 57 52 58 53 iter->data_buf += iter->interval; 59 - iter->prot_buf += iter->tuple_size; 54 + iter->prot_buf += bi->tuple_size; 60 55 iter->seed++; 61 56 } 62 - 63 - return BLK_STS_OK; 64 57 } 65 58 66 59 static blk_status_t t10_pi_verify(struct blk_integrity_iter *iter, 67 - csum_fn *fn, enum t10_dif_type type) 60 + struct blk_integrity *bi) 68 61 { 69 - u8 offset = iter->pi_offset; 62 + u8 offset = bi->pi_offset; 70 63 unsigned int i; 71 - 72 - BUG_ON(type == T10_PI_TYPE0_PROTECTION); 73 64 74 65 for (i = 0 ; i < iter->data_size ; i += iter->interval) { 75 66 struct t10_pi_tuple *pi = iter->prot_buf + offset; 76 67 __be16 csum; 77 68 78 - if (type == T10_PI_TYPE1_PROTECTION || 79 - type == T10_PI_TYPE2_PROTECTION) { 69 + if (bi->flags & BLK_INTEGRITY_REF_TAG) { 80 70 if (pi->app_tag == T10_PI_APP_ESCAPE) 81 71 goto next; 82 72 ··· 84 82 iter->seed, be32_to_cpu(pi->ref_tag)); 85 83 return BLK_STS_PROTECTION; 86 84 } 87 - } else if (type == T10_PI_TYPE3_PROTECTION) { 85 + } else { 88 86 if (pi->app_tag == T10_PI_APP_ESCAPE && 89 87 pi->ref_tag == T10_PI_REF_ESCAPE) 90 88 goto next; 91 89 } 92 90 93 - csum = fn(0, iter->data_buf, iter->interval); 91 + csum = t10_pi_csum(0, iter->data_buf, iter->interval, 92 + bi->csum_type); 94 93 if (offset) 95 - csum = fn(csum, iter->prot_buf, offset); 94 + csum = t10_pi_csum(csum, iter->prot_buf, offset, 95 + bi->csum_type); 96 96 97 97 if (pi->guard_tag != csum) { 98 98 pr_err("%s: guard tag error at sector %llu " \ ··· 106 102 107 103 next: 108 104 iter->data_buf += iter->interval; 109 - iter->prot_buf += iter->tuple_size; 105 + iter->prot_buf += bi->tuple_size; 110 106 iter->seed++; 111 107 } 112 108 113 109 return BLK_STS_OK; 114 - } 115 - 116 - static blk_status_t t10_pi_type1_generate_crc(struct blk_integrity_iter *iter) 117 - { 118 - return t10_pi_generate(iter, t10_pi_crc_fn, T10_PI_TYPE1_PROTECTION); 119 - } 120 - 121 - static blk_status_t t10_pi_type1_generate_ip(struct blk_integrity_iter *iter) 122 - { 123 - return t10_pi_generate(iter, t10_pi_ip_fn, T10_PI_TYPE1_PROTECTION); 124 - } 125 - 126 - static blk_status_t t10_pi_type1_verify_crc(struct blk_integrity_iter *iter) 127 - { 128 - return t10_pi_verify(iter, t10_pi_crc_fn, T10_PI_TYPE1_PROTECTION); 129 - } 130 - 131 - static blk_status_t t10_pi_type1_verify_ip(struct blk_integrity_iter *iter) 132 - { 133 - return t10_pi_verify(iter, t10_pi_ip_fn, T10_PI_TYPE1_PROTECTION); 134 110 } 135 111 136 112 /** ··· 125 141 */ 126 142 static void t10_pi_type1_prepare(struct request *rq) 127 143 { 128 - struct blk_integrity *bi = &rq->q->integrity; 144 + struct blk_integrity *bi = &rq->q->limits.integrity; 129 145 const int tuple_sz = bi->tuple_size; 130 146 u32 ref_tag = t10_pi_ref_tag(rq); 131 147 u8 offset = bi->pi_offset; ··· 176 192 */ 177 193 static void t10_pi_type1_complete(struct request *rq, unsigned int nr_bytes) 178 194 { 179 - struct blk_integrity *bi = &rq->q->integrity; 195 + struct blk_integrity *bi = &rq->q->limits.integrity; 180 196 unsigned intervals = nr_bytes >> bi->interval_exp; 181 197 const int tuple_sz = bi->tuple_size; 182 198 u32 ref_tag = t10_pi_ref_tag(rq); ··· 209 225 } 210 226 } 211 227 212 - static blk_status_t t10_pi_type3_generate_crc(struct blk_integrity_iter *iter) 213 - { 214 - return t10_pi_generate(iter, t10_pi_crc_fn, T10_PI_TYPE3_PROTECTION); 215 - } 216 - 217 - static blk_status_t t10_pi_type3_generate_ip(struct blk_integrity_iter *iter) 218 - { 219 - return t10_pi_generate(iter, t10_pi_ip_fn, T10_PI_TYPE3_PROTECTION); 220 - } 221 - 222 - static blk_status_t t10_pi_type3_verify_crc(struct blk_integrity_iter *iter) 223 - { 224 - return t10_pi_verify(iter, t10_pi_crc_fn, T10_PI_TYPE3_PROTECTION); 225 - } 226 - 227 - static blk_status_t t10_pi_type3_verify_ip(struct blk_integrity_iter *iter) 228 - { 229 - return t10_pi_verify(iter, t10_pi_ip_fn, T10_PI_TYPE3_PROTECTION); 230 - } 231 - 232 - /* Type 3 does not have a reference tag so no remapping is required. */ 233 - static void t10_pi_type3_prepare(struct request *rq) 234 - { 235 - } 236 - 237 - /* Type 3 does not have a reference tag so no remapping is required. */ 238 - static void t10_pi_type3_complete(struct request *rq, unsigned int nr_bytes) 239 - { 240 - } 241 - 242 - const struct blk_integrity_profile t10_pi_type1_crc = { 243 - .name = "T10-DIF-TYPE1-CRC", 244 - .generate_fn = t10_pi_type1_generate_crc, 245 - .verify_fn = t10_pi_type1_verify_crc, 246 - .prepare_fn = t10_pi_type1_prepare, 247 - .complete_fn = t10_pi_type1_complete, 248 - }; 249 - EXPORT_SYMBOL(t10_pi_type1_crc); 250 - 251 - const struct blk_integrity_profile t10_pi_type1_ip = { 252 - .name = "T10-DIF-TYPE1-IP", 253 - .generate_fn = t10_pi_type1_generate_ip, 254 - .verify_fn = t10_pi_type1_verify_ip, 255 - .prepare_fn = t10_pi_type1_prepare, 256 - .complete_fn = t10_pi_type1_complete, 257 - }; 258 - EXPORT_SYMBOL(t10_pi_type1_ip); 259 - 260 - const struct blk_integrity_profile t10_pi_type3_crc = { 261 - .name = "T10-DIF-TYPE3-CRC", 262 - .generate_fn = t10_pi_type3_generate_crc, 263 - .verify_fn = t10_pi_type3_verify_crc, 264 - .prepare_fn = t10_pi_type3_prepare, 265 - .complete_fn = t10_pi_type3_complete, 266 - }; 267 - EXPORT_SYMBOL(t10_pi_type3_crc); 268 - 269 - const struct blk_integrity_profile t10_pi_type3_ip = { 270 - .name = "T10-DIF-TYPE3-IP", 271 - .generate_fn = t10_pi_type3_generate_ip, 272 - .verify_fn = t10_pi_type3_verify_ip, 273 - .prepare_fn = t10_pi_type3_prepare, 274 - .complete_fn = t10_pi_type3_complete, 275 - }; 276 - EXPORT_SYMBOL(t10_pi_type3_ip); 277 - 278 228 static __be64 ext_pi_crc64(u64 crc, void *data, unsigned int len) 279 229 { 280 230 return cpu_to_be64(crc64_rocksoft_update(crc, data, len)); 281 231 } 282 232 283 - static blk_status_t ext_pi_crc64_generate(struct blk_integrity_iter *iter, 284 - enum t10_dif_type type) 233 + static void ext_pi_crc64_generate(struct blk_integrity_iter *iter, 234 + struct blk_integrity *bi) 285 235 { 286 - u8 offset = iter->pi_offset; 236 + u8 offset = bi->pi_offset; 287 237 unsigned int i; 288 238 289 239 for (i = 0 ; i < iter->data_size ; i += iter->interval) { ··· 229 311 iter->prot_buf, offset); 230 312 pi->app_tag = 0; 231 313 232 - if (type == T10_PI_TYPE1_PROTECTION) 314 + if (bi->flags & BLK_INTEGRITY_REF_TAG) 233 315 put_unaligned_be48(iter->seed, pi->ref_tag); 234 316 else 235 317 put_unaligned_be48(0ULL, pi->ref_tag); 236 318 237 319 iter->data_buf += iter->interval; 238 - iter->prot_buf += iter->tuple_size; 320 + iter->prot_buf += bi->tuple_size; 239 321 iter->seed++; 240 322 } 241 - 242 - return BLK_STS_OK; 243 323 } 244 324 245 325 static bool ext_pi_ref_escape(u8 *ref_tag) ··· 248 332 } 249 333 250 334 static blk_status_t ext_pi_crc64_verify(struct blk_integrity_iter *iter, 251 - enum t10_dif_type type) 335 + struct blk_integrity *bi) 252 336 { 253 - u8 offset = iter->pi_offset; 337 + u8 offset = bi->pi_offset; 254 338 unsigned int i; 255 339 256 340 for (i = 0; i < iter->data_size; i += iter->interval) { ··· 258 342 u64 ref, seed; 259 343 __be64 csum; 260 344 261 - if (type == T10_PI_TYPE1_PROTECTION) { 345 + if (bi->flags & BLK_INTEGRITY_REF_TAG) { 262 346 if (pi->app_tag == T10_PI_APP_ESCAPE) 263 347 goto next; 264 348 ··· 269 353 iter->disk_name, seed, ref); 270 354 return BLK_STS_PROTECTION; 271 355 } 272 - } else if (type == T10_PI_TYPE3_PROTECTION) { 356 + } else { 273 357 if (pi->app_tag == T10_PI_APP_ESCAPE && 274 358 ext_pi_ref_escape(pi->ref_tag)) 275 359 goto next; ··· 290 374 291 375 next: 292 376 iter->data_buf += iter->interval; 293 - iter->prot_buf += iter->tuple_size; 377 + iter->prot_buf += bi->tuple_size; 294 378 iter->seed++; 295 379 } 296 380 297 381 return BLK_STS_OK; 298 382 } 299 383 300 - static blk_status_t ext_pi_type1_verify_crc64(struct blk_integrity_iter *iter) 301 - { 302 - return ext_pi_crc64_verify(iter, T10_PI_TYPE1_PROTECTION); 303 - } 304 - 305 - static blk_status_t ext_pi_type1_generate_crc64(struct blk_integrity_iter *iter) 306 - { 307 - return ext_pi_crc64_generate(iter, T10_PI_TYPE1_PROTECTION); 308 - } 309 - 310 384 static void ext_pi_type1_prepare(struct request *rq) 311 385 { 312 - struct blk_integrity *bi = &rq->q->integrity; 386 + struct blk_integrity *bi = &rq->q->limits.integrity; 313 387 const int tuple_sz = bi->tuple_size; 314 388 u64 ref_tag = ext_pi_ref_tag(rq); 315 389 u8 offset = bi->pi_offset; ··· 339 433 340 434 static void ext_pi_type1_complete(struct request *rq, unsigned int nr_bytes) 341 435 { 342 - struct blk_integrity *bi = &rq->q->integrity; 436 + struct blk_integrity *bi = &rq->q->limits.integrity; 343 437 unsigned intervals = nr_bytes >> bi->interval_exp; 344 438 const int tuple_sz = bi->tuple_size; 345 439 u64 ref_tag = ext_pi_ref_tag(rq); ··· 373 467 } 374 468 } 375 469 376 - static blk_status_t ext_pi_type3_verify_crc64(struct blk_integrity_iter *iter) 470 + void blk_integrity_generate(struct bio *bio) 377 471 { 378 - return ext_pi_crc64_verify(iter, T10_PI_TYPE3_PROTECTION); 472 + struct blk_integrity *bi = blk_get_integrity(bio->bi_bdev->bd_disk); 473 + struct bio_integrity_payload *bip = bio_integrity(bio); 474 + struct blk_integrity_iter iter; 475 + struct bvec_iter bviter; 476 + struct bio_vec bv; 477 + 478 + iter.disk_name = bio->bi_bdev->bd_disk->disk_name; 479 + iter.interval = 1 << bi->interval_exp; 480 + iter.seed = bio->bi_iter.bi_sector; 481 + iter.prot_buf = bvec_virt(bip->bip_vec); 482 + bio_for_each_segment(bv, bio, bviter) { 483 + void *kaddr = bvec_kmap_local(&bv); 484 + 485 + iter.data_buf = kaddr; 486 + iter.data_size = bv.bv_len; 487 + switch (bi->csum_type) { 488 + case BLK_INTEGRITY_CSUM_CRC64: 489 + ext_pi_crc64_generate(&iter, bi); 490 + break; 491 + case BLK_INTEGRITY_CSUM_CRC: 492 + case BLK_INTEGRITY_CSUM_IP: 493 + t10_pi_generate(&iter, bi); 494 + break; 495 + default: 496 + break; 497 + } 498 + kunmap_local(kaddr); 499 + } 379 500 } 380 501 381 - static blk_status_t ext_pi_type3_generate_crc64(struct blk_integrity_iter *iter) 502 + void blk_integrity_verify(struct bio *bio) 382 503 { 383 - return ext_pi_crc64_generate(iter, T10_PI_TYPE3_PROTECTION); 504 + struct blk_integrity *bi = blk_get_integrity(bio->bi_bdev->bd_disk); 505 + struct bio_integrity_payload *bip = bio_integrity(bio); 506 + struct blk_integrity_iter iter; 507 + struct bvec_iter bviter; 508 + struct bio_vec bv; 509 + 510 + /* 511 + * At the moment verify is called bi_iter has been advanced during split 512 + * and completion, so use the copy created during submission here. 513 + */ 514 + iter.disk_name = bio->bi_bdev->bd_disk->disk_name; 515 + iter.interval = 1 << bi->interval_exp; 516 + iter.seed = bip->bio_iter.bi_sector; 517 + iter.prot_buf = bvec_virt(bip->bip_vec); 518 + __bio_for_each_segment(bv, bio, bviter, bip->bio_iter) { 519 + void *kaddr = bvec_kmap_local(&bv); 520 + blk_status_t ret = BLK_STS_OK; 521 + 522 + iter.data_buf = kaddr; 523 + iter.data_size = bv.bv_len; 524 + switch (bi->csum_type) { 525 + case BLK_INTEGRITY_CSUM_CRC64: 526 + ret = ext_pi_crc64_verify(&iter, bi); 527 + break; 528 + case BLK_INTEGRITY_CSUM_CRC: 529 + case BLK_INTEGRITY_CSUM_IP: 530 + ret = t10_pi_verify(&iter, bi); 531 + break; 532 + default: 533 + break; 534 + } 535 + kunmap_local(kaddr); 536 + 537 + if (ret) { 538 + bio->bi_status = ret; 539 + return; 540 + } 541 + } 384 542 } 385 543 386 - const struct blk_integrity_profile ext_pi_type1_crc64 = { 387 - .name = "EXT-DIF-TYPE1-CRC64", 388 - .generate_fn = ext_pi_type1_generate_crc64, 389 - .verify_fn = ext_pi_type1_verify_crc64, 390 - .prepare_fn = ext_pi_type1_prepare, 391 - .complete_fn = ext_pi_type1_complete, 392 - }; 393 - EXPORT_SYMBOL_GPL(ext_pi_type1_crc64); 544 + void blk_integrity_prepare(struct request *rq) 545 + { 546 + struct blk_integrity *bi = &rq->q->limits.integrity; 394 547 395 - const struct blk_integrity_profile ext_pi_type3_crc64 = { 396 - .name = "EXT-DIF-TYPE3-CRC64", 397 - .generate_fn = ext_pi_type3_generate_crc64, 398 - .verify_fn = ext_pi_type3_verify_crc64, 399 - .prepare_fn = t10_pi_type3_prepare, 400 - .complete_fn = t10_pi_type3_complete, 401 - }; 402 - EXPORT_SYMBOL_GPL(ext_pi_type3_crc64); 548 + if (!(bi->flags & BLK_INTEGRITY_REF_TAG)) 549 + return; 550 + 551 + if (bi->csum_type == BLK_INTEGRITY_CSUM_CRC64) 552 + ext_pi_type1_prepare(rq); 553 + else 554 + t10_pi_type1_prepare(rq); 555 + } 556 + 557 + void blk_integrity_complete(struct request *rq, unsigned int nr_bytes) 558 + { 559 + struct blk_integrity *bi = &rq->q->limits.integrity; 560 + 561 + if (!(bi->flags & BLK_INTEGRITY_REF_TAG)) 562 + return; 563 + 564 + if (bi->csum_type == BLK_INTEGRITY_CSUM_CRC64) 565 + ext_pi_type1_complete(rq, nr_bytes); 566 + else 567 + t10_pi_type1_complete(rq, nr_bytes); 568 + } 403 569 404 570 MODULE_DESCRIPTION("T10 Protection Information module"); 405 571 MODULE_LICENSE("GPL");

+1 -2

drivers/ata/libata-scsi.c

··· 1024 1024 int ata_scsi_dev_config(struct scsi_device *sdev, struct queue_limits *lim, 1025 1025 struct ata_device *dev) 1026 1026 { 1027 - struct request_queue *q = sdev->request_queue; 1028 1027 int depth = 1; 1029 1028 1030 1029 if (!ata_id_has_unload(dev->id)) ··· 1037 1038 sdev->sector_size = ATA_SECT_SIZE; 1038 1039 1039 1040 /* set DMA padding */ 1040 - blk_queue_update_dma_pad(q, ATA_DMA_PAD_SZ - 1); 1041 + lim->dma_pad_mask = ATA_DMA_PAD_SZ - 1; 1041 1042 1042 1043 /* make room for appending the drain */ 1043 1044 lim->max_segments--;

+2 -2

drivers/ata/pata_macio.c

··· 816 816 /* OHare has issues with non cache aligned DMA on some chipsets */ 817 817 if (priv->kind == controller_ohare) { 818 818 lim->dma_alignment = 31; 819 - blk_queue_update_dma_pad(sdev->request_queue, 31); 819 + lim->dma_pad_mask = 31; 820 820 821 821 /* Tell the world about it */ 822 822 ata_dev_info(dev, "OHare alignment limits applied\n"); ··· 831 831 if (priv->kind == controller_sh_ata6 || priv->kind == controller_k2_ata6) { 832 832 /* Allright these are bad, apply restrictions */ 833 833 lim->dma_alignment = 15; 834 - blk_queue_update_dma_pad(sdev->request_queue, 15); 834 + lim->dma_pad_mask = 15; 835 835 836 836 /* We enable MWI and hack cache line size directly here, this 837 837 * is specific to this chipset and not normal values, we happen

+9

drivers/block/Kconfig

··· 354 354 This is the virtual block driver for virtio. It can be used with 355 355 QEMU based VMMs (like KVM or Xen). Say Y or M. 356 356 357 + config BLK_DEV_RUST_NULL 358 + tristate "Rust null block driver (Experimental)" 359 + depends on RUST 360 + help 361 + This is the Rust implementation of the null block driver. For now it 362 + is only a minimal stub. 363 + 364 + If unsure, say N. 365 + 357 366 config BLK_DEV_RBD 358 367 tristate "Rados block device (RBD)" 359 368 depends on INET && BLOCK

+3

drivers/block/Makefile

··· 9 9 # needed for trace events 10 10 ccflags-y += -I$(src) 11 11 12 + obj-$(CONFIG_BLK_DEV_RUST_NULL) += rnull_mod.o 13 + rnull_mod-y := rnull.o 14 + 12 15 obj-$(CONFIG_MAC_FLOPPY) += swim3.o 13 16 obj-$(CONFIG_BLK_DEV_SWIM) += swim_mod.o 14 17 obj-$(CONFIG_BLK_DEV_FD) += floppy.o

+5 -1

drivers/block/amiflop.c

··· 232 232 static unsigned long int fd_def_df0 = FD_DD_3; /* default for df0 if it doesn't identify */ 233 233 234 234 module_param(fd_def_df0, ulong, 0); 235 + MODULE_DESCRIPTION("Amiga floppy driver"); 235 236 MODULE_LICENSE("GPL"); 236 237 237 238 /* ··· 1777 1776 1778 1777 static int fd_alloc_disk(int drive, int system) 1779 1778 { 1779 + struct queue_limits lim = { 1780 + .features = BLK_FEAT_ROTATIONAL, 1781 + }; 1780 1782 struct gendisk *disk; 1781 1783 int err; 1782 1784 1783 - disk = blk_mq_alloc_disk(&unit[drive].tag_set, NULL, NULL); 1785 + disk = blk_mq_alloc_disk(&unit[drive].tag_set, &lim, NULL); 1784 1786 if (IS_ERR(disk)) 1785 1787 return PTR_ERR(disk); 1786 1788

+1

drivers/block/aoe/aoeblk.c

··· 337 337 struct queue_limits lim = { 338 338 .max_hw_sectors = aoe_maxsectors, 339 339 .io_opt = SZ_2M, 340 + .features = BLK_FEAT_ROTATIONAL, 340 341 }; 341 342 ulong flags; 342 343 int late = 0;

+5 -1

drivers/block/ataflop.c

··· 1992 1992 1993 1993 static int ataflop_alloc_disk(unsigned int drive, unsigned int type) 1994 1994 { 1995 + struct queue_limits lim = { 1996 + .features = BLK_FEAT_ROTATIONAL, 1997 + }; 1995 1998 struct gendisk *disk; 1996 1999 1997 - disk = blk_mq_alloc_disk(&unit[drive].tag_set, NULL, NULL); 2000 + disk = blk_mq_alloc_disk(&unit[drive].tag_set, &lim, NULL); 1998 2001 if (IS_ERR(disk)) 1999 2002 return PTR_ERR(disk); 2000 2003 ··· 2200 2197 module_init(atari_floppy_init) 2201 2198 module_exit(atari_floppy_exit) 2202 2199 2200 + MODULE_DESCRIPTION("Atari floppy driver"); 2203 2201 MODULE_LICENSE("GPL");

+3 -4

drivers/block/brd.c

··· 296 296 module_param(max_part, int, 0444); 297 297 MODULE_PARM_DESC(max_part, "Num Minors to reserve between devices"); 298 298 299 + MODULE_DESCRIPTION("Ram backed block device driver"); 299 300 MODULE_LICENSE("GPL"); 300 301 MODULE_ALIAS_BLOCKDEV_MAJOR(RAMDISK_MAJOR); 301 302 MODULE_ALIAS("rd"); ··· 336 335 .max_hw_discard_sectors = UINT_MAX, 337 336 .max_discard_segments = 1, 338 337 .discard_granularity = PAGE_SIZE, 338 + .features = BLK_FEAT_SYNCHRONOUS | 339 + BLK_FEAT_NOWAIT, 339 340 }; 340 341 341 342 list_for_each_entry(brd, &brd_devices, brd_list) ··· 369 366 strscpy(disk->disk_name, buf, DISK_NAME_LEN); 370 367 set_capacity(disk, rd_size * 2); 371 368 372 - /* Tell the block layer that this is not a rotational device */ 373 - blk_queue_flag_set(QUEUE_FLAG_NONROT, disk->queue); 374 - blk_queue_flag_set(QUEUE_FLAG_SYNCHRONOUS, disk->queue); 375 - blk_queue_flag_set(QUEUE_FLAG_NOWAIT, disk->queue); 376 369 err = add_disk(disk); 377 370 if (err) 378 371 goto out_cleanup_disk;

+3 -3

drivers/block/drbd/drbd_main.c

··· 2697 2697 * connect. 2698 2698 */ 2699 2699 .max_hw_sectors = DRBD_MAX_BIO_SIZE_SAFE >> 8, 2700 + .features = BLK_FEAT_WRITE_CACHE | BLK_FEAT_FUA | 2701 + BLK_FEAT_ROTATIONAL | 2702 + BLK_FEAT_STABLE_WRITES, 2700 2703 }; 2701 2704 2702 2705 device = minor_to_device(minor); ··· 2737 2734 disk->flags |= GENHD_FL_NO_PART; 2738 2735 sprintf(disk->disk_name, "drbd%d", minor); 2739 2736 disk->private_data = device; 2740 - 2741 - blk_queue_flag_set(QUEUE_FLAG_STABLE_WRITES, disk->queue); 2742 - blk_queue_write_cache(disk->queue, true, true); 2743 2737 2744 2738 device->md_io.page = alloc_page(GFP_KERNEL); 2745 2739 if (!device->md_io.page)

+3 -1

drivers/block/floppy.c

··· 4516 4516 static int floppy_alloc_disk(unsigned int drive, unsigned int type) 4517 4517 { 4518 4518 struct queue_limits lim = { 4519 - .max_hw_sectors = 64, 4519 + .max_hw_sectors = 64, 4520 + .features = BLK_FEAT_ROTATIONAL, 4520 4521 }; 4521 4522 struct gendisk *disk; 4522 4523 ··· 5017 5016 module_param(FLOPPY_IRQ, int, 0); 5018 5017 module_param(FLOPPY_DMA, int, 0); 5019 5018 MODULE_AUTHOR("Alain L. Knaff"); 5019 + MODULE_DESCRIPTION("Normal floppy disk support"); 5020 5020 MODULE_LICENSE("GPL"); 5021 5021 5022 5022 /* This doesn't actually get used other than for module information */

+78 -106

drivers/block/loop.c

··· 211 211 if (lo->lo_state == Lo_bound) 212 212 blk_mq_freeze_queue(lo->lo_queue); 213 213 lo->use_dio = use_dio; 214 - if (use_dio) { 215 - blk_queue_flag_clear(QUEUE_FLAG_NOMERGES, lo->lo_queue); 214 + if (use_dio) 216 215 lo->lo_flags |= LO_FLAGS_DIRECT_IO; 217 - } else { 218 - blk_queue_flag_set(QUEUE_FLAG_NOMERGES, lo->lo_queue); 216 + else 219 217 lo->lo_flags &= ~LO_FLAGS_DIRECT_IO; 220 - } 221 218 if (lo->lo_state == Lo_bound) 222 219 blk_mq_unfreeze_queue(lo->lo_queue); 223 220 } ··· 936 939 return loop_free_idle_workers(lo, false); 937 940 } 938 941 939 - static void loop_update_rotational(struct loop_device *lo) 940 - { 941 - struct file *file = lo->lo_backing_file; 942 - struct inode *file_inode = file->f_mapping->host; 943 - struct block_device *file_bdev = file_inode->i_sb->s_bdev; 944 - struct request_queue *q = lo->lo_queue; 945 - bool nonrot = true; 946 - 947 - /* not all filesystems (e.g. tmpfs) have a sb->s_bdev */ 948 - if (file_bdev) 949 - nonrot = bdev_nonrot(file_bdev); 950 - 951 - if (nonrot) 952 - blk_queue_flag_set(QUEUE_FLAG_NONROT, q); 953 - else 954 - blk_queue_flag_clear(QUEUE_FLAG_NONROT, q); 955 - } 956 - 957 942 /** 958 943 * loop_set_status_from_info - configure device from loop_info 959 944 * @lo: struct loop_device to configure ··· 977 998 return 0; 978 999 } 979 1000 980 - static int loop_reconfigure_limits(struct loop_device *lo, unsigned short bsize, 981 - bool update_discard_settings) 1001 + static unsigned short loop_default_blocksize(struct loop_device *lo, 1002 + struct block_device *backing_bdev) 982 1003 { 1004 + /* In case of direct I/O, match underlying block size */ 1005 + if ((lo->lo_backing_file->f_flags & O_DIRECT) && backing_bdev) 1006 + return bdev_logical_block_size(backing_bdev); 1007 + return SECTOR_SIZE; 1008 + } 1009 + 1010 + static int loop_reconfigure_limits(struct loop_device *lo, unsigned short bsize) 1011 + { 1012 + struct file *file = lo->lo_backing_file; 1013 + struct inode *inode = file->f_mapping->host; 1014 + struct block_device *backing_bdev = NULL; 983 1015 struct queue_limits lim; 1016 + 1017 + if (S_ISBLK(inode->i_mode)) 1018 + backing_bdev = I_BDEV(inode); 1019 + else if (inode->i_sb->s_bdev) 1020 + backing_bdev = inode->i_sb->s_bdev; 1021 + 1022 + if (!bsize) 1023 + bsize = loop_default_blocksize(lo, backing_bdev); 984 1024 985 1025 lim = queue_limits_start_update(lo->lo_queue); 986 1026 lim.logical_block_size = bsize; 987 1027 lim.physical_block_size = bsize; 988 1028 lim.io_min = bsize; 989 - if (update_discard_settings) 990 - loop_config_discard(lo, &lim); 1029 + lim.features &= ~(BLK_FEAT_WRITE_CACHE | BLK_FEAT_ROTATIONAL); 1030 + if (file->f_op->fsync && !(lo->lo_flags & LO_FLAGS_READ_ONLY)) 1031 + lim.features |= BLK_FEAT_WRITE_CACHE; 1032 + if (backing_bdev && !bdev_nonrot(backing_bdev)) 1033 + lim.features |= BLK_FEAT_ROTATIONAL; 1034 + loop_config_discard(lo, &lim); 991 1035 return queue_limits_commit_update(lo->lo_queue, &lim); 992 1036 } 993 1037 ··· 1019 1017 const struct loop_config *config) 1020 1018 { 1021 1019 struct file *file = fget(config->fd); 1022 - struct inode *inode; 1023 1020 struct address_space *mapping; 1024 1021 int error; 1025 1022 loff_t size; 1026 1023 bool partscan; 1027 - unsigned short bsize; 1028 1024 bool is_loop; 1029 1025 1030 1026 if (!file) ··· 1055 1055 goto out_unlock; 1056 1056 1057 1057 mapping = file->f_mapping; 1058 - inode = mapping->host; 1059 1058 1060 1059 if ((config->info.lo_flags & ~LOOP_CONFIGURE_SETTABLE_FLAGS) != 0) { 1061 1060 error = -EINVAL; 1062 1061 goto out_unlock; 1063 - } 1064 - 1065 - if (config->block_size) { 1066 - error = blk_validate_block_size(config->block_size); 1067 - if (error) 1068 - goto out_unlock; 1069 1062 } 1070 1063 1071 1064 error = loop_set_status_from_info(lo, &config->info); ··· 1091 1098 lo->old_gfp_mask = mapping_gfp_mask(mapping); 1092 1099 mapping_set_gfp_mask(mapping, lo->old_gfp_mask & ~(__GFP_IO|__GFP_FS)); 1093 1100 1094 - if (!(lo->lo_flags & LO_FLAGS_READ_ONLY) && file->f_op->fsync) 1095 - blk_queue_write_cache(lo->lo_queue, true, false); 1096 - 1097 - if (config->block_size) 1098 - bsize = config->block_size; 1099 - else if ((lo->lo_backing_file->f_flags & O_DIRECT) && inode->i_sb->s_bdev) 1100 - /* In case of direct I/O, match underlying block size */ 1101 - bsize = bdev_logical_block_size(inode->i_sb->s_bdev); 1102 - else 1103 - bsize = 512; 1104 - 1105 - error = loop_reconfigure_limits(lo, bsize, true); 1106 - if (WARN_ON_ONCE(error)) 1101 + error = loop_reconfigure_limits(lo, config->block_size); 1102 + if (error) 1107 1103 goto out_unlock; 1108 1104 1109 - loop_update_rotational(lo); 1110 1105 loop_update_dio(lo); 1111 1106 loop_sysfs_init(lo); 1112 1107 ··· 1135 1154 return error; 1136 1155 } 1137 1156 1138 - static void __loop_clr_fd(struct loop_device *lo, bool release) 1157 + static void __loop_clr_fd(struct loop_device *lo) 1139 1158 { 1159 + struct queue_limits lim; 1140 1160 struct file *filp; 1141 1161 gfp_t gfp = lo->old_gfp_mask; 1142 - 1143 - if (test_bit(QUEUE_FLAG_WC, &lo->lo_queue->queue_flags)) 1144 - blk_queue_write_cache(lo->lo_queue, false, false); 1145 - 1146 - /* 1147 - * Freeze the request queue when unbinding on a live file descriptor and 1148 - * thus an open device. When called from ->release we are guaranteed 1149 - * that there is no I/O in progress already. 1150 - */ 1151 - if (!release) 1152 - blk_mq_freeze_queue(lo->lo_queue); 1153 1162 1154 1163 spin_lock_irq(&lo->lo_lock); 1155 1164 filp = lo->lo_backing_file; ··· 1150 1179 lo->lo_offset = 0; 1151 1180 lo->lo_sizelimit = 0; 1152 1181 memset(lo->lo_file_name, 0, LO_NAME_SIZE); 1153 - loop_reconfigure_limits(lo, 512, false); 1182 + 1183 + /* reset the block size to the default */ 1184 + lim = queue_limits_start_update(lo->lo_queue); 1185 + lim.logical_block_size = SECTOR_SIZE; 1186 + lim.physical_block_size = SECTOR_SIZE; 1187 + lim.io_min = SECTOR_SIZE; 1188 + queue_limits_commit_update(lo->lo_queue, &lim); 1189 + 1154 1190 invalidate_disk(lo->lo_disk); 1155 1191 loop_sysfs_exit(lo); 1156 1192 /* let user-space know about this change */ ··· 1165 1187 mapping_set_gfp_mask(filp->f_mapping, gfp); 1166 1188 /* This is safe: open() is still holding a reference. */ 1167 1189 module_put(THIS_MODULE); 1168 - if (!release) 1169 - blk_mq_unfreeze_queue(lo->lo_queue); 1170 1190 1171 1191 disk_force_media_change(lo->lo_disk); 1172 1192 ··· 1179 1203 * must be at least one and it can only become zero when the 1180 1204 * current holder is released. 1181 1205 */ 1182 - if (!release) 1183 - mutex_lock(&lo->lo_disk->open_mutex); 1184 1206 err = bdev_disk_changed(lo->lo_disk, false); 1185 - if (!release) 1186 - mutex_unlock(&lo->lo_disk->open_mutex); 1187 1207 if (err) 1188 1208 pr_warn("%s: partition scan of loop%d failed (rc=%d)\n", 1189 1209 __func__, lo->lo_number, err); ··· 1228 1256 return -ENXIO; 1229 1257 } 1230 1258 /* 1231 - * If we've explicitly asked to tear down the loop device, 1232 - * and it has an elevated reference count, set it for auto-teardown when 1233 - * the last reference goes away. This stops $!~#$@ udev from 1234 - * preventing teardown because it decided that it needs to run blkid on 1235 - * the loopback device whenever they appear. xfstests is notorious for 1236 - * failing tests because blkid via udev races with a losetup 1237 - * <dev>/do something like mkfs/losetup -d <dev> causing the losetup -d 1238 - * command to fail with EBUSY. 1259 + * Mark the device for removing the backing device on last close. 1260 + * If we are the only opener, also switch the state to roundown here to 1261 + * prevent new openers from coming in. 1239 1262 */ 1240 - if (disk_openers(lo->lo_disk) > 1) { 1241 - lo->lo_flags |= LO_FLAGS_AUTOCLEAR; 1242 - loop_global_unlock(lo, true); 1243 - return 0; 1244 - } 1245 - lo->lo_state = Lo_rundown; 1263 + 1264 + lo->lo_flags |= LO_FLAGS_AUTOCLEAR; 1265 + if (disk_openers(lo->lo_disk) == 1) 1266 + lo->lo_state = Lo_rundown; 1246 1267 loop_global_unlock(lo, true); 1247 1268 1248 - __loop_clr_fd(lo, false); 1249 1269 return 0; 1250 1270 } 1251 1271 ··· 1464 1500 if (lo->lo_state != Lo_bound) 1465 1501 return -ENXIO; 1466 1502 1467 - err = blk_validate_block_size(arg); 1468 - if (err) 1469 - return err; 1470 - 1471 1503 if (lo->lo_queue->limits.logical_block_size == arg) 1472 1504 return 0; 1473 1505 ··· 1471 1511 invalidate_bdev(lo->lo_device); 1472 1512 1473 1513 blk_mq_freeze_queue(lo->lo_queue); 1474 - err = loop_reconfigure_limits(lo, arg, false); 1514 + err = loop_reconfigure_limits(lo, arg); 1475 1515 loop_update_dio(lo); 1476 1516 blk_mq_unfreeze_queue(lo->lo_queue); 1477 1517 ··· 1700 1740 } 1701 1741 #endif 1702 1742 1743 + static int lo_open(struct gendisk *disk, blk_mode_t mode) 1744 + { 1745 + struct loop_device *lo = disk->private_data; 1746 + int err; 1747 + 1748 + err = mutex_lock_killable(&lo->lo_mutex); 1749 + if (err) 1750 + return err; 1751 + 1752 + if (lo->lo_state == Lo_deleting || lo->lo_state == Lo_rundown) 1753 + err = -ENXIO; 1754 + mutex_unlock(&lo->lo_mutex); 1755 + return err; 1756 + } 1757 + 1703 1758 static void lo_release(struct gendisk *disk) 1704 1759 { 1705 1760 struct loop_device *lo = disk->private_data; 1761 + bool need_clear = false; 1706 1762 1707 1763 if (disk_openers(disk) > 0) 1708 1764 return; 1765 + /* 1766 + * Clear the backing device information if this is the last close of 1767 + * a device that's been marked for auto clear, or on which LOOP_CLR_FD 1768 + * has been called. 1769 + */ 1709 1770 1710 1771 mutex_lock(&lo->lo_mutex); 1711 - if (lo->lo_state == Lo_bound && (lo->lo_flags & LO_FLAGS_AUTOCLEAR)) { 1772 + if (lo->lo_state == Lo_bound && (lo->lo_flags & LO_FLAGS_AUTOCLEAR)) 1712 1773 lo->lo_state = Lo_rundown; 1713 - mutex_unlock(&lo->lo_mutex); 1714 - /* 1715 - * In autoclear mode, stop the loop thread 1716 - * and remove configuration after last close. 1717 - */ 1718 - __loop_clr_fd(lo, true); 1719 - return; 1720 - } 1774 + 1775 + need_clear = (lo->lo_state == Lo_rundown); 1721 1776 mutex_unlock(&lo->lo_mutex); 1777 + 1778 + if (need_clear) 1779 + __loop_clr_fd(lo); 1722 1780 } 1723 1781 1724 1782 static void lo_free_disk(struct gendisk *disk) ··· 1753 1775 1754 1776 static const struct block_device_operations lo_fops = { 1755 1777 .owner = THIS_MODULE, 1778 + .open = lo_open, 1756 1779 .release = lo_release, 1757 1780 .ioctl = lo_ioctl, 1758 1781 #ifdef CONFIG_COMPAT ··· 1832 1853 device_param_cb(hw_queue_depth, &loop_hw_qdepth_param_ops, &hw_queue_depth, 0444); 1833 1854 MODULE_PARM_DESC(hw_queue_depth, "Queue depth for each hardware queue. Default: " __stringify(LOOP_DEFAULT_HW_Q_DEPTH)); 1834 1855 1856 + MODULE_DESCRIPTION("Loopback device support"); 1835 1857 MODULE_LICENSE("GPL"); 1836 1858 MODULE_ALIAS_BLOCKDEV_MAJOR(LOOP_MAJOR); 1837 1859 ··· 2038 2058 goto out_cleanup_tags; 2039 2059 } 2040 2060 lo->lo_queue = lo->lo_disk->queue; 2041 - 2042 - /* 2043 - * By default, we do buffer IO, so it doesn't make sense to enable 2044 - * merge because the I/O submitted to backing file is handled page by 2045 - * page. For directio mode, merge does help to dispatch bigger request 2046 - * to underlayer disk. We will enable merge once directio is enabled. 2047 - */ 2048 - blk_queue_flag_set(QUEUE_FLAG_NOMERGES, lo->lo_queue); 2049 2061 2050 2062 /* 2051 2063 * Disable partition scanning by default. The in-kernel partition

-2

drivers/block/mtip32xx/mtip32xx.c

··· 3485 3485 goto start_service_thread; 3486 3486 3487 3487 /* Set device limits. */ 3488 - blk_queue_flag_set(QUEUE_FLAG_NONROT, dd->queue); 3489 - blk_queue_flag_clear(QUEUE_FLAG_ADD_RANDOM, dd->queue); 3490 3488 dma_set_max_seg_size(&dd->pdev->dev, 0x400000); 3491 3489 3492 3490 /* Set the capacity of the device in 512 byte sectors. */

-2

drivers/block/n64cart.c

··· 150 150 set_capacity(disk, size >> SECTOR_SHIFT); 151 151 set_disk_ro(disk, 1); 152 152 153 - blk_queue_flag_set(QUEUE_FLAG_NONROT, disk->queue); 154 - 155 153 err = add_disk(disk); 156 154 if (err) 157 155 goto out_cleanup_disk;

+10 -16

drivers/block/nbd.c

··· 342 342 lim.max_hw_discard_sectors = UINT_MAX; 343 343 else 344 344 lim.max_hw_discard_sectors = 0; 345 + if (!(nbd->config->flags & NBD_FLAG_SEND_FLUSH)) { 346 + lim.features &= ~(BLK_FEAT_WRITE_CACHE | BLK_FEAT_FUA); 347 + } else if (nbd->config->flags & NBD_FLAG_SEND_FUA) { 348 + lim.features |= BLK_FEAT_WRITE_CACHE | BLK_FEAT_FUA; 349 + } else { 350 + lim.features |= BLK_FEAT_WRITE_CACHE; 351 + lim.features &= ~BLK_FEAT_FUA; 352 + } 345 353 lim.logical_block_size = blksize; 346 354 lim.physical_block_size = blksize; 347 355 error = queue_limits_commit_update(nbd->disk->queue, &lim); ··· 1287 1279 1288 1280 static void nbd_parse_flags(struct nbd_device *nbd) 1289 1281 { 1290 - struct nbd_config *config = nbd->config; 1291 - if (config->flags & NBD_FLAG_READ_ONLY) 1282 + if (nbd->config->flags & NBD_FLAG_READ_ONLY) 1292 1283 set_disk_ro(nbd->disk, true); 1293 1284 else 1294 1285 set_disk_ro(nbd->disk, false); 1295 - if (config->flags & NBD_FLAG_SEND_FLUSH) { 1296 - if (config->flags & NBD_FLAG_SEND_FUA) 1297 - blk_queue_write_cache(nbd->disk->queue, true, true); 1298 - else 1299 - blk_queue_write_cache(nbd->disk->queue, true, false); 1300 - } 1301 - else 1302 - blk_queue_write_cache(nbd->disk->queue, false, false); 1303 1286 } 1304 1287 1305 1288 static void send_disconnects(struct nbd_device *nbd) ··· 1800 1801 { 1801 1802 struct queue_limits lim = { 1802 1803 .max_hw_sectors = 65536, 1803 - .max_user_sectors = 256, 1804 + .io_opt = 256 << SECTOR_SHIFT, 1804 1805 .max_segments = USHRT_MAX, 1805 1806 .max_segment_size = UINT_MAX, 1806 1807 }; ··· 1859 1860 err = -ENOMEM; 1860 1861 goto out_err_disk; 1861 1862 } 1862 - 1863 - /* 1864 - * Tell the block layer that we are not a rotational device 1865 - */ 1866 - blk_queue_flag_set(QUEUE_FLAG_NONROT, disk->queue); 1867 1863 1868 1864 mutex_init(&nbd->config_lock); 1869 1865 refcount_set(&nbd->config_refs, 0);

+17 -12

drivers/block/null_blk/main.c

··· 77 77 NULL_IRQ_TIMER = 2, 78 78 }; 79 79 80 - static bool g_virt_boundary = false; 80 + static bool g_virt_boundary; 81 81 module_param_named(virt_boundary, g_virt_boundary, bool, 0444); 82 82 MODULE_PARM_DESC(virt_boundary, "Require a virtual boundary for the device. Default: False"); 83 83 ··· 227 227 228 228 static bool g_fua = true; 229 229 module_param_named(fua, g_fua, bool, 0444); 230 - MODULE_PARM_DESC(zoned, "Enable/disable FUA support when cache_size is used. Default: true"); 230 + MODULE_PARM_DESC(fua, "Enable/disable FUA support when cache_size is used. Default: true"); 231 231 232 232 static unsigned int g_mbps; 233 233 module_param_named(mbps, g_mbps, uint, 0444); ··· 261 261 module_param_named(zone_append_max_sectors, g_zone_append_max_sectors, int, 0444); 262 262 MODULE_PARM_DESC(zone_append_max_sectors, 263 263 "Maximum size of a zone append command (in 512B sectors). Specify 0 for zone append emulation"); 264 + 265 + static bool g_zone_full; 266 + module_param_named(zone_full, g_zone_full, bool, S_IRUGO); 267 + MODULE_PARM_DESC(zone_full, "Initialize the sequential write required zones of a zoned device to be full. Default: false"); 264 268 265 269 static struct nullb_device *null_alloc_dev(void); 266 270 static void null_free_dev(struct nullb_device *dev); ··· 462 458 NULLB_DEVICE_ATTR(zone_max_open, uint, NULL); 463 459 NULLB_DEVICE_ATTR(zone_max_active, uint, NULL); 464 460 NULLB_DEVICE_ATTR(zone_append_max_sectors, uint, NULL); 461 + NULLB_DEVICE_ATTR(zone_full, bool, NULL); 465 462 NULLB_DEVICE_ATTR(virt_boundary, bool, NULL); 466 463 NULLB_DEVICE_ATTR(no_sched, bool, NULL); 467 464 NULLB_DEVICE_ATTR(shared_tags, bool, NULL); ··· 615 610 &nullb_device_attr_zone_append_max_sectors, 616 611 &nullb_device_attr_zone_readonly, 617 612 &nullb_device_attr_zone_offline, 613 + &nullb_device_attr_zone_full, 618 614 &nullb_device_attr_virt_boundary, 619 615 &nullb_device_attr_no_sched, 620 616 &nullb_device_attr_shared_tags, ··· 706 700 "shared_tags,size,submit_queues,use_per_node_hctx," 707 701 "virt_boundary,zoned,zone_capacity,zone_max_active," 708 702 "zone_max_open,zone_nr_conv,zone_offline,zone_readonly," 709 - "zone_size,zone_append_max_sectors\n"); 703 + "zone_size,zone_append_max_sectors,zone_full\n"); 710 704 } 711 705 712 706 CONFIGFS_ATTR_RO(memb_group_, features); ··· 787 781 dev->zone_max_open = g_zone_max_open; 788 782 dev->zone_max_active = g_zone_max_active; 789 783 dev->zone_append_max_sectors = g_zone_append_max_sectors; 784 + dev->zone_full = g_zone_full; 790 785 dev->virt_boundary = g_virt_boundary; 791 786 dev->no_sched = g_no_sched; 792 787 dev->shared_tags = g_shared_tags; ··· 1831 1824 dev->queue_mode = NULL_Q_MQ; 1832 1825 } 1833 1826 1834 - if (blk_validate_block_size(dev->blocksize)) 1835 - return -EINVAL; 1836 - 1837 1827 if (dev->use_per_node_hctx) { 1838 1828 if (dev->submit_queues != nr_online_nodes) 1839 1829 dev->submit_queues = nr_online_nodes; ··· 1932 1928 goto out_cleanup_tags; 1933 1929 } 1934 1930 1931 + if (dev->cache_size > 0) { 1932 + set_bit(NULLB_DEV_FL_CACHE, &nullb->dev->flags); 1933 + lim.features |= BLK_FEAT_WRITE_CACHE; 1934 + if (dev->fua) 1935 + lim.features |= BLK_FEAT_FUA; 1936 + } 1937 + 1935 1938 nullb->disk = blk_mq_alloc_disk(nullb->tag_set, &lim, nullb); 1936 1939 if (IS_ERR(nullb->disk)) { 1937 1940 rv = PTR_ERR(nullb->disk); ··· 1951 1940 nullb_setup_bwtimer(nullb); 1952 1941 } 1953 1942 1954 - if (dev->cache_size > 0) { 1955 - set_bit(NULLB_DEV_FL_CACHE, &nullb->dev->flags); 1956 - blk_queue_write_cache(nullb->q, true, dev->fua); 1957 - } 1958 - 1959 1943 nullb->q->queuedata = nullb; 1960 - blk_queue_flag_set(QUEUE_FLAG_NONROT, nullb->q); 1961 1944 1962 1945 rv = ida_alloc(&nullb_indexes, GFP_KERNEL); 1963 1946 if (rv < 0)

+1

drivers/block/null_blk/null_blk.h

··· 101 101 bool memory_backed; /* if data is stored in memory */ 102 102 bool discard; /* if support discard */ 103 103 bool zoned; /* if device is zoned */ 104 + bool zone_full; /* Initialize zones to be full */ 104 105 bool virt_boundary; /* virtual boundary on/off for the device */ 105 106 bool no_sched; /* no IO scheduler for the device */ 106 107 bool shared_tags; /* share tag set between devices for blk-mq */

+9 -6

drivers/block/null_blk/zoned.c

··· 145 145 zone = &dev->zones[i]; 146 146 147 147 null_init_zone_lock(dev, zone); 148 - zone->start = zone->wp = sector; 148 + zone->start = sector; 149 149 if (zone->start + dev->zone_size_sects > dev_capacity_sects) 150 150 zone->len = dev_capacity_sects - zone->start; 151 151 else ··· 153 153 zone->capacity = 154 154 min_t(sector_t, zone->len, zone_capacity_sects); 155 155 zone->type = BLK_ZONE_TYPE_SEQWRITE_REQ; 156 - zone->cond = BLK_ZONE_COND_EMPTY; 156 + if (dev->zone_full) { 157 + zone->cond = BLK_ZONE_COND_FULL; 158 + zone->wp = zone->start + zone->capacity; 159 + } else{ 160 + zone->cond = BLK_ZONE_COND_EMPTY; 161 + zone->wp = zone->start; 162 + } 157 163 158 164 sector += dev->zone_size_sects; 159 165 } 160 166 161 - lim->zoned = true; 167 + lim->features |= BLK_FEAT_ZONED; 162 168 lim->chunk_sectors = dev->zone_size_sects; 163 169 lim->max_zone_append_sectors = dev->zone_append_max_sectors; 164 170 lim->max_open_zones = dev->zone_max_open; ··· 176 170 { 177 171 struct request_queue *q = nullb->q; 178 172 struct gendisk *disk = nullb->disk; 179 - 180 - blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, q); 181 - disk->nr_zones = bdev_nr_zones(disk->part0); 182 173 183 174 pr_info("%s: using %s zone append\n", 184 175 disk->disk_name,

+1

drivers/block/pktcdvd.c

··· 2622 2622 struct queue_limits lim = { 2623 2623 .max_hw_sectors = PACKET_MAX_SECTORS, 2624 2624 .logical_block_size = CD_FRAMESIZE, 2625 + .features = BLK_FEAT_ROTATIONAL, 2625 2626 }; 2626 2627 int idx; 2627 2628 int ret = -ENOMEM;

+2 -6

drivers/block/ps3disk.c

··· 388 388 .max_segments = -1, 389 389 .max_segment_size = dev->bounce_size, 390 390 .dma_alignment = dev->blk_size - 1, 391 + .features = BLK_FEAT_WRITE_CACHE | 392 + BLK_FEAT_ROTATIONAL, 391 393 }; 392 - 393 - struct request_queue *queue; 394 394 struct gendisk *gendisk; 395 395 396 396 if (dev->blk_size < 512) { ··· 446 446 error = PTR_ERR(gendisk); 447 447 goto fail_free_tag_set; 448 448 } 449 - 450 - queue = gendisk->queue; 451 - 452 - blk_queue_write_cache(queue, true, false); 453 449 454 450 priv->gendisk = gendisk; 455 451 gendisk->major = ps3disk_major;

+4 -11

drivers/block/rbd.c

··· 4949 4949 static int rbd_init_disk(struct rbd_device *rbd_dev) 4950 4950 { 4951 4951 struct gendisk *disk; 4952 - struct request_queue *q; 4953 4952 unsigned int objset_bytes = 4954 4953 rbd_dev->layout.object_size * rbd_dev->layout.stripe_count; 4955 4954 struct queue_limits lim = { 4956 4955 .max_hw_sectors = objset_bytes >> SECTOR_SHIFT, 4957 - .max_user_sectors = objset_bytes >> SECTOR_SHIFT, 4956 + .io_opt = objset_bytes, 4958 4957 .io_min = rbd_dev->opts->alloc_size, 4959 - .io_opt = rbd_dev->opts->alloc_size, 4960 4958 .max_segments = USHRT_MAX, 4961 4959 .max_segment_size = UINT_MAX, 4962 4960 }; ··· 4978 4980 lim.max_write_zeroes_sectors = objset_bytes >> SECTOR_SHIFT; 4979 4981 } 4980 4982 4983 + if (!ceph_test_opt(rbd_dev->rbd_client->client, NOCRC)) 4984 + lim.features |= BLK_FEAT_STABLE_WRITES; 4985 + 4981 4986 disk = blk_mq_alloc_disk(&rbd_dev->tag_set, &lim, rbd_dev); 4982 4987 if (IS_ERR(disk)) { 4983 4988 err = PTR_ERR(disk); 4984 4989 goto out_tag_set; 4985 4990 } 4986 - q = disk->queue; 4987 4991 4988 4992 snprintf(disk->disk_name, sizeof(disk->disk_name), RBD_DRV_NAME "%d", 4989 4993 rbd_dev->dev_id); ··· 4997 4997 disk->minors = RBD_MINORS_PER_MAJOR; 4998 4998 disk->fops = &rbd_bd_ops; 4999 4999 disk->private_data = rbd_dev; 5000 - 5001 - blk_queue_flag_set(QUEUE_FLAG_NONROT, q); 5002 - /* QUEUE_FLAG_ADD_RANDOM is off by default for blk-mq */ 5003 - 5004 - if (!ceph_test_opt(rbd_dev->rbd_client->client, NOCRC)) 5005 - blk_queue_flag_set(QUEUE_FLAG_STABLE_WRITES, q); 5006 - 5007 5000 rbd_dev->disk = disk; 5008 5001 5009 5002 return 0;

+1 -1

drivers/block/rnbd/rnbd-clt-sysfs.c

··· 475 475 } 476 476 } 477 477 478 - static struct kobj_type rnbd_dev_ktype = { 478 + static const struct kobj_type rnbd_dev_ktype = { 479 479 .sysfs_ops = &kobj_sysfs_ops, 480 480 .default_groups = rnbd_dev_groups, 481 481 };

+6 -10

drivers/block/rnbd/rnbd-clt.c

··· 1352 1352 if (dev->access_mode == RNBD_ACCESS_RO) 1353 1353 set_disk_ro(dev->gd, true); 1354 1354 1355 - /* 1356 - * Network device does not need rotational 1357 - */ 1358 - blk_queue_flag_set(QUEUE_FLAG_NONROT, dev->queue); 1359 1355 err = add_disk(dev->gd); 1360 1356 if (err) 1361 1357 put_disk(dev->gd); ··· 1385 1389 le32_to_cpu(rsp->max_discard_sectors); 1386 1390 } 1387 1391 1392 + if (rsp->cache_policy & RNBD_WRITEBACK) { 1393 + lim.features |= BLK_FEAT_WRITE_CACHE; 1394 + if (rsp->cache_policy & RNBD_FUA) 1395 + lim.features |= BLK_FEAT_FUA; 1396 + } 1397 + 1388 1398 dev->gd = blk_mq_alloc_disk(&dev->sess->tag_set, &lim, dev); 1389 1399 if (IS_ERR(dev->gd)) 1390 1400 return PTR_ERR(dev->gd); 1391 1401 dev->queue = dev->gd->queue; 1392 1402 rnbd_init_mq_hw_queues(dev); 1393 - 1394 - blk_queue_flag_set(QUEUE_FLAG_SAME_COMP, dev->queue); 1395 - blk_queue_flag_set(QUEUE_FLAG_SAME_FORCE, dev->queue); 1396 - blk_queue_write_cache(dev->queue, 1397 - !!(rsp->cache_policy & RNBD_WRITEBACK), 1398 - !!(rsp->cache_policy & RNBD_FUA)); 1399 1403 1400 1404 return rnbd_clt_setup_gen_disk(dev, rsp, idx); 1401 1405 }

+2 -2

drivers/block/rnbd/rnbd-srv-sysfs.c

··· 33 33 kfree(dev); 34 34 } 35 35 36 - static struct kobj_type dev_ktype = { 36 + static const struct kobj_type dev_ktype = { 37 37 .sysfs_ops = &kobj_sysfs_ops, 38 38 .release = rnbd_srv_dev_release 39 39 }; ··· 184 184 rnbd_destroy_sess_dev(sess_dev, sess_dev->keep_id); 185 185 } 186 186 187 - static struct kobj_type rnbd_srv_sess_dev_ktype = { 187 + static const struct kobj_type rnbd_srv_sess_dev_ktype = { 188 188 .sysfs_ops = &kobj_sysfs_ops, 189 189 .release = rnbd_srv_sess_dev_release, 190 190 };

+73

drivers/block/rnull.rs

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + 3 + //! This is a Rust implementation of the C null block driver. 4 + //! 5 + //! Supported features: 6 + //! 7 + //! - blk-mq interface 8 + //! - direct completion 9 + //! - block size 4k 10 + //! 11 + //! The driver is not configurable. 12 + 13 + use kernel::{ 14 + alloc::flags, 15 + block::mq::{ 16 + self, 17 + gen_disk::{self, GenDisk}, 18 + Operations, TagSet, 19 + }, 20 + error::Result, 21 + new_mutex, pr_info, 22 + prelude::*, 23 + sync::{Arc, Mutex}, 24 + types::ARef, 25 + }; 26 + 27 + module! { 28 + type: NullBlkModule, 29 + name: "rnull_mod", 30 + author: "Andreas Hindborg", 31 + license: "GPL v2", 32 + } 33 + 34 + struct NullBlkModule { 35 + _disk: Pin<Box<Mutex<GenDisk<NullBlkDevice>>>>, 36 + } 37 + 38 + impl kernel::Module for NullBlkModule { 39 + fn init(_module: &'static ThisModule) -> Result<Self> { 40 + pr_info!("Rust null_blk loaded\n"); 41 + let tagset = Arc::pin_init(TagSet::new(1, 256, 1), flags::GFP_KERNEL)?; 42 + 43 + let disk = gen_disk::GenDiskBuilder::new() 44 + .capacity_sectors(4096 << 11) 45 + .logical_block_size(4096)? 46 + .physical_block_size(4096)? 47 + .rotational(false) 48 + .build(format_args!("rnullb{}", 0), tagset)?; 49 + 50 + let disk = Box::pin_init(new_mutex!(disk, "nullb:disk"), flags::GFP_KERNEL)?; 51 + 52 + Ok(Self { _disk: disk }) 53 + } 54 + } 55 + 56 + struct NullBlkDevice; 57 + 58 + #[vtable] 59 + impl Operations for NullBlkDevice { 60 + #[inline(always)] 61 + fn queue_rq(rq: ARef<mq::Request<Self>>, _is_last: bool) -> Result { 62 + mq::Request::end_ok(rq) 63 + .map_err(|_e| kernel::error::code::EIO) 64 + // We take no refcounts on the request, so we expect to be able to 65 + // end the request. The request reference must be unique at this 66 + // point, and so `end_ok` cannot fail. 67 + .expect("Fatal error - expected to be able to end request"); 68 + 69 + Ok(()) 70 + } 71 + 72 + fn commit_rqs() {} 73 + }

+1

drivers/block/sunvdc.c

··· 791 791 .seg_boundary_mask = PAGE_SIZE - 1, 792 792 .max_segment_size = PAGE_SIZE, 793 793 .max_segments = port->ring_cookies, 794 + .features = BLK_FEAT_ROTATIONAL, 794 795 }; 795 796 struct request_queue *q; 796 797 struct gendisk *g;

+4 -1

drivers/block/swim.c

··· 787 787 788 788 static int swim_floppy_init(struct swim_priv *swd) 789 789 { 790 + struct queue_limits lim = { 791 + .features = BLK_FEAT_ROTATIONAL, 792 + }; 790 793 int err; 791 794 int drive; 792 795 struct swim __iomem *base = swd->base; ··· 823 820 goto exit_put_disks; 824 821 825 822 swd->unit[drive].disk = 826 - blk_mq_alloc_disk(&swd->unit[drive].tag_set, NULL, 823 + blk_mq_alloc_disk(&swd->unit[drive].tag_set, &lim, 827 824 &swd->unit[drive]); 828 825 if (IS_ERR(swd->unit[drive].disk)) { 829 826 blk_mq_free_tag_set(&swd->unit[drive].tag_set);

+4 -1

drivers/block/swim3.c

··· 1189 1189 static int swim3_attach(struct macio_dev *mdev, 1190 1190 const struct of_device_id *match) 1191 1191 { 1192 + struct queue_limits lim = { 1193 + .features = BLK_FEAT_ROTATIONAL, 1194 + }; 1192 1195 struct floppy_state *fs; 1193 1196 struct gendisk *disk; 1194 1197 int rc; ··· 1213 1210 if (rc) 1214 1211 goto out_unregister; 1215 1212 1216 - disk = blk_mq_alloc_disk(&fs->tag_set, NULL, fs); 1213 + disk = blk_mq_alloc_disk(&fs->tag_set, &lim, fs); 1217 1214 if (IS_ERR(disk)) { 1218 1215 rc = PTR_ERR(disk); 1219 1216 goto out_free_tag_set;

+11 -11

drivers/block/ublk_drv.c

··· 248 248 249 249 static void ublk_dev_param_zoned_apply(struct ublk_device *ub) 250 250 { 251 - blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, ub->ub_disk->queue); 252 - 253 251 ub->ub_disk->nr_zones = ublk_get_nr_zones(ub); 254 252 } 255 253 ··· 482 484 483 485 static void ublk_dev_param_basic_apply(struct ublk_device *ub) 484 486 { 485 - struct request_queue *q = ub->ub_disk->queue; 486 487 const struct ublk_param_basic *p = &ub->params.basic; 487 - 488 - blk_queue_write_cache(q, p->attrs & UBLK_ATTR_VOLATILE_CACHE, 489 - p->attrs & UBLK_ATTR_FUA); 490 - if (p->attrs & UBLK_ATTR_ROTATIONAL) 491 - blk_queue_flag_clear(QUEUE_FLAG_NONROT, q); 492 - else 493 - blk_queue_flag_set(QUEUE_FLAG_NONROT, q); 494 488 495 489 if (p->attrs & UBLK_ATTR_READ_ONLY) 496 490 set_disk_ro(ub->ub_disk, true); ··· 2194 2204 if (!IS_ENABLED(CONFIG_BLK_DEV_ZONED)) 2195 2205 return -EOPNOTSUPP; 2196 2206 2197 - lim.zoned = true; 2207 + lim.features |= BLK_FEAT_ZONED; 2198 2208 lim.max_active_zones = p->max_active_zones; 2199 2209 lim.max_open_zones = p->max_open_zones; 2200 2210 lim.max_zone_append_sectors = p->max_zone_append_sectors; 2201 2211 } 2212 + 2213 + if (ub->params.basic.attrs & UBLK_ATTR_VOLATILE_CACHE) { 2214 + lim.features |= BLK_FEAT_WRITE_CACHE; 2215 + if (ub->params.basic.attrs & UBLK_ATTR_FUA) 2216 + lim.features |= BLK_FEAT_FUA; 2217 + } 2218 + 2219 + if (ub->params.basic.attrs & UBLK_ATTR_ROTATIONAL) 2220 + lim.features |= BLK_FEAT_ROTATIONAL; 2202 2221 2203 2222 if (wait_for_completion_interruptible(&ub->completion) != 0) 2204 2223 return -EINTR; ··· 3016 3017 MODULE_PARM_DESC(ublks_max, "max number of ublk devices allowed to add(default: 64)"); 3017 3018 3018 3019 MODULE_AUTHOR("Ming Lei <ming.lei@redhat.com>"); 3020 + MODULE_DESCRIPTION("Userspace block device"); 3019 3021 MODULE_LICENSE("GPL");

+32 -36

drivers/block/virtio_blk.c

··· 728 728 729 729 dev_dbg(&vdev->dev, "probing host-managed zoned device\n"); 730 730 731 - lim->zoned = true; 731 + lim->features |= BLK_FEAT_ZONED; 732 732 733 733 virtio_cread(vdev, struct virtio_blk_config, 734 734 zoned.max_open_zones, &v); ··· 1089 1089 return writeback; 1090 1090 } 1091 1091 1092 - static void virtblk_update_cache_mode(struct virtio_device *vdev) 1093 - { 1094 - u8 writeback = virtblk_get_cache_mode(vdev); 1095 - struct virtio_blk *vblk = vdev->priv; 1096 - 1097 - blk_queue_write_cache(vblk->disk->queue, writeback, false); 1098 - } 1099 - 1100 1092 static const char *const virtblk_cache_types[] = { 1101 1093 "write through", "write back" 1102 1094 }; ··· 1100 1108 struct gendisk *disk = dev_to_disk(dev); 1101 1109 struct virtio_blk *vblk = disk->private_data; 1102 1110 struct virtio_device *vdev = vblk->vdev; 1111 + struct queue_limits lim; 1103 1112 int i; 1104 1113 1105 1114 BUG_ON(!virtio_has_feature(vblk->vdev, VIRTIO_BLK_F_CONFIG_WCE)); ··· 1109 1116 return i; 1110 1117 1111 1118 virtio_cwrite8(vdev, offsetof(struct virtio_blk_config, wce), i); 1112 - virtblk_update_cache_mode(vdev); 1119 + 1120 + lim = queue_limits_start_update(disk->queue); 1121 + if (virtblk_get_cache_mode(vdev)) 1122 + lim.features |= BLK_FEAT_WRITE_CACHE; 1123 + else 1124 + lim.features &= ~BLK_FEAT_WRITE_CACHE; 1125 + blk_mq_freeze_queue(disk->queue); 1126 + i = queue_limits_commit_update(disk->queue, &lim); 1127 + blk_mq_unfreeze_queue(disk->queue); 1128 + if (i) 1129 + return i; 1113 1130 return count; 1114 1131 } 1115 1132 ··· 1250 1247 struct queue_limits *lim) 1251 1248 { 1252 1249 struct virtio_device *vdev = vblk->vdev; 1253 - u32 v, blk_size, max_size, sg_elems, opt_io_size; 1250 + u32 v, max_size, sg_elems, opt_io_size; 1254 1251 u32 max_discard_segs = 0; 1255 1252 u32 discard_granularity = 0; 1256 1253 u16 min_io_size; ··· 1289 1286 lim->max_segment_size = max_size; 1290 1287 1291 1288 /* Host can optionally specify the block size of the device */ 1292 - err = virtio_cread_feature(vdev, VIRTIO_BLK_F_BLK_SIZE, 1289 + virtio_cread_feature(vdev, VIRTIO_BLK_F_BLK_SIZE, 1293 1290 struct virtio_blk_config, blk_size, 1294 - &blk_size); 1295 - if (!err) { 1296 - err = blk_validate_block_size(blk_size); 1297 - if (err) { 1298 - dev_err(&vdev->dev, 1299 - "virtio_blk: invalid block size: 0x%x\n", 1300 - blk_size); 1301 - return err; 1302 - } 1303 - 1304 - lim->logical_block_size = blk_size; 1305 - } else 1306 - blk_size = lim->logical_block_size; 1291 + &lim->logical_block_size); 1307 1292 1308 1293 /* Use topology information if available */ 1309 1294 err = virtio_cread_feature(vdev, VIRTIO_BLK_F_TOPOLOGY, 1310 1295 struct virtio_blk_config, physical_block_exp, 1311 1296 &physical_block_exp); 1312 1297 if (!err && physical_block_exp) 1313 - lim->physical_block_size = blk_size * (1 << physical_block_exp); 1298 + lim->physical_block_size = 1299 + lim->logical_block_size * (1 << physical_block_exp); 1314 1300 1315 1301 err = virtio_cread_feature(vdev, VIRTIO_BLK_F_TOPOLOGY, 1316 1302 struct virtio_blk_config, alignment_offset, 1317 1303 &alignment_offset); 1318 1304 if (!err && alignment_offset) 1319 - lim->alignment_offset = blk_size * alignment_offset; 1305 + lim->alignment_offset = 1306 + lim->logical_block_size * alignment_offset; 1320 1307 1321 1308 err = virtio_cread_feature(vdev, VIRTIO_BLK_F_TOPOLOGY, 1322 1309 struct virtio_blk_config, min_io_size, 1323 1310 &min_io_size); 1324 1311 if (!err && min_io_size) 1325 - lim->io_min = blk_size * min_io_size; 1312 + lim->io_min = lim->logical_block_size * min_io_size; 1326 1313 1327 1314 err = virtio_cread_feature(vdev, VIRTIO_BLK_F_TOPOLOGY, 1328 1315 struct virtio_blk_config, opt_io_size, 1329 1316 &opt_io_size); 1330 1317 if (!err && opt_io_size) 1331 - lim->io_opt = blk_size * opt_io_size; 1318 + lim->io_opt = lim->logical_block_size * opt_io_size; 1332 1319 1333 1320 if (virtio_has_feature(vdev, VIRTIO_BLK_F_DISCARD)) { 1334 1321 virtio_cread(vdev, struct virtio_blk_config, ··· 1412 1419 lim->discard_granularity = 1413 1420 discard_granularity << SECTOR_SHIFT; 1414 1421 else 1415 - lim->discard_granularity = blk_size; 1422 + lim->discard_granularity = lim->logical_block_size; 1416 1423 } 1417 1424 1418 1425 if (virtio_has_feature(vdev, VIRTIO_BLK_F_ZONED)) { ··· 1441 1448 static int virtblk_probe(struct virtio_device *vdev) 1442 1449 { 1443 1450 struct virtio_blk *vblk; 1444 - struct queue_limits lim = { }; 1451 + struct queue_limits lim = { 1452 + .features = BLK_FEAT_ROTATIONAL, 1453 + .logical_block_size = SECTOR_SIZE, 1454 + }; 1445 1455 int err, index; 1446 1456 unsigned int queue_depth; 1447 1457 ··· 1508 1512 if (err) 1509 1513 goto out_free_tags; 1510 1514 1515 + if (virtblk_get_cache_mode(vdev)) 1516 + lim.features |= BLK_FEAT_WRITE_CACHE; 1517 + 1511 1518 vblk->disk = blk_mq_alloc_disk(&vblk->tag_set, &lim, vblk); 1512 1519 if (IS_ERR(vblk->disk)) { 1513 1520 err = PTR_ERR(vblk->disk); ··· 1526 1527 vblk->disk->fops = &virtblk_fops; 1527 1528 vblk->index = index; 1528 1529 1529 - /* configure queue flush support */ 1530 - virtblk_update_cache_mode(vdev); 1531 - 1532 1530 /* If disk is read-only in the host, the guest should obey */ 1533 1531 if (virtio_has_feature(vdev, VIRTIO_BLK_F_RO)) 1534 1532 set_disk_ro(vblk->disk, 1); ··· 1537 1541 * All steps that follow use the VQs therefore they need to be 1538 1542 * placed after the virtio_device_ready() call above. 1539 1543 */ 1540 - if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) && lim.zoned) { 1541 - blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, vblk->disk->queue); 1544 + if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) && 1545 + (lim.features & BLK_FEAT_ZONED)) { 1542 1546 err = blk_revalidate_disk_zones(vblk->disk); 1543 1547 if (err) 1544 1548 goto out_cleanup_disk;

+1

drivers/block/xen-blkback/blkback.c

··· 1563 1563 1564 1564 module_exit(xen_blkif_fini); 1565 1565 1566 + MODULE_DESCRIPTION("Virtual block device back-end driver"); 1566 1567 MODULE_LICENSE("Dual BSD/GPL"); 1567 1568 MODULE_ALIAS("xen-backend:vbd");

+36 -37

drivers/block/xen-blkfront.c

··· 788 788 * A barrier request a superset of FUA, so we can 789 789 * implement it the same way. (It's also a FLUSH+FUA, 790 790 * since it is guaranteed ordered WRT previous writes.) 791 + * 792 + * Note that can end up here with a FUA write and the 793 + * flags cleared. This happens when the flag was 794 + * run-time disabled after a failing I/O, and we'll 795 + * simplify submit it as a normal write. 791 796 */ 792 797 if (info->feature_flush && info->feature_fua) 793 798 ring_req->operation = ··· 800 795 else if (info->feature_flush) 801 796 ring_req->operation = 802 797 BLKIF_OP_FLUSH_DISKCACHE; 803 - else 804 - ring_req->operation = 0; 805 798 } 806 799 ring_req->u.rw.nr_segments = num_grant; 807 800 if (unlikely(require_extra_req)) { ··· 890 887 notify_remote_via_irq(rinfo->irq); 891 888 } 892 889 893 - static inline bool blkif_request_flush_invalid(struct request *req, 894 - struct blkfront_info *info) 895 - { 896 - return (blk_rq_is_passthrough(req) || 897 - ((req_op(req) == REQ_OP_FLUSH) && 898 - !info->feature_flush) || 899 - ((req->cmd_flags & REQ_FUA) && 900 - !info->feature_fua)); 901 - } 902 - 903 890 static blk_status_t blkif_queue_rq(struct blk_mq_hw_ctx *hctx, 904 891 const struct blk_mq_queue_data *qd) 905 892 { ··· 901 908 rinfo = get_rinfo(info, qid); 902 909 blk_mq_start_request(qd->rq); 903 910 spin_lock_irqsave(&rinfo->ring_lock, flags); 911 + 912 + /* 913 + * Check if the backend actually supports flushes. 914 + * 915 + * While the block layer won't send us flushes if we don't claim to 916 + * support them, the Xen protocol allows the backend to revoke support 917 + * at any time. That is of course a really bad idea and dangerous, but 918 + * has been allowed for 10+ years. In that case we simply clear the 919 + * flags, and directly return here for an empty flush and ignore the 920 + * FUA flag later on. 921 + */ 922 + if (unlikely(req_op(qd->rq) == REQ_OP_FLUSH && !info->feature_flush)) 923 + goto complete; 924 + 904 925 if (RING_FULL(&rinfo->ring)) 905 926 goto out_busy; 906 - 907 - if (blkif_request_flush_invalid(qd->rq, rinfo->dev_info)) 908 - goto out_err; 909 - 910 927 if (blkif_queue_request(qd->rq, rinfo)) 911 928 goto out_busy; 912 929 ··· 924 921 spin_unlock_irqrestore(&rinfo->ring_lock, flags); 925 922 return BLK_STS_OK; 926 923 927 - out_err: 928 - spin_unlock_irqrestore(&rinfo->ring_lock, flags); 929 - return BLK_STS_IOERR; 930 - 931 924 out_busy: 932 925 blk_mq_stop_hw_queue(hctx); 933 926 spin_unlock_irqrestore(&rinfo->ring_lock, flags); 934 927 return BLK_STS_DEV_RESOURCE; 928 + complete: 929 + spin_unlock_irqrestore(&rinfo->ring_lock, flags); 930 + blk_mq_end_request(qd->rq, BLK_STS_OK); 931 + return BLK_STS_OK; 935 932 } 936 933 937 934 static void blkif_complete_rq(struct request *rq) ··· 957 954 lim->discard_alignment = info->discard_alignment; 958 955 if (info->feature_secdiscard) 959 956 lim->max_secure_erase_sectors = UINT_MAX; 957 + } 958 + 959 + if (info->feature_flush) { 960 + lim->features |= BLK_FEAT_WRITE_CACHE; 961 + if (info->feature_fua) 962 + lim->features |= BLK_FEAT_FUA; 960 963 } 961 964 962 965 /* Hard sector size and max sectors impersonate the equiv. hardware. */ ··· 993 984 994 985 static void xlvbd_flush(struct blkfront_info *info) 995 986 { 996 - blk_queue_write_cache(info->rq, info->feature_flush ? true : false, 997 - info->feature_fua ? true : false); 998 987 pr_info("blkfront: %s: %s %s %s %s %s %s %s\n", 999 988 info->gd->disk_name, flush_info(info), 1000 989 "persistent grants:", info->feature_persistent ? ··· 1070 1063 } 1071 1064 1072 1065 static int xlvbd_alloc_gendisk(blkif_sector_t capacity, 1073 - struct blkfront_info *info, u16 sector_size, 1074 - unsigned int physical_sector_size) 1066 + struct blkfront_info *info) 1075 1067 { 1076 1068 struct queue_limits lim = {}; 1077 1069 struct gendisk *gd; ··· 1145 1139 err = PTR_ERR(gd); 1146 1140 goto out_free_tag_set; 1147 1141 } 1148 - blk_queue_flag_set(QUEUE_FLAG_VIRT, gd->queue); 1149 1142 1150 1143 strcpy(gd->disk_name, DEV_NAME); 1151 1144 ptr = encode_disk_name(gd->disk_name + sizeof(DEV_NAME) - 1, offset); ··· 1164 1159 1165 1160 info->rq = gd->queue; 1166 1161 info->gd = gd; 1167 - info->sector_size = sector_size; 1168 - info->physical_sector_size = physical_sector_size; 1169 1162 1170 1163 xlvbd_flush(info); 1171 1164 ··· 1608 1605 blkif_req(req)->error = BLK_STS_NOTSUPP; 1609 1606 info->feature_discard = 0; 1610 1607 info->feature_secdiscard = 0; 1611 - blk_queue_max_discard_sectors(rq, 0); 1612 - blk_queue_max_secure_erase_sectors(rq, 0); 1608 + blk_queue_disable_discard(rq); 1609 + blk_queue_disable_secure_erase(rq); 1613 1610 } 1614 1611 break; 1615 1612 case BLKIF_OP_FLUSH_DISKCACHE: ··· 1630 1627 blkif_req(req)->error = BLK_STS_OK; 1631 1628 info->feature_fua = 0; 1632 1629 info->feature_flush = 0; 1633 - xlvbd_flush(info); 1634 1630 } 1635 1631 fallthrough; 1636 1632 case BLKIF_OP_READ: ··· 2317 2315 static void blkfront_connect(struct blkfront_info *info) 2318 2316 { 2319 2317 unsigned long long sectors; 2320 - unsigned long sector_size; 2321 - unsigned int physical_sector_size; 2322 2318 int err, i; 2323 2319 struct blkfront_ring_info *rinfo; 2324 2320 ··· 2355 2355 err = xenbus_gather(XBT_NIL, info->xbdev->otherend, 2356 2356 "sectors", "%llu", &sectors, 2357 2357 "info", "%u", &info->vdisk_info, 2358 - "sector-size", "%lu", &sector_size, 2358 + "sector-size", "%lu", &info->sector_size, 2359 2359 NULL); 2360 2360 if (err) { 2361 2361 xenbus_dev_fatal(info->xbdev, err, ··· 2369 2369 * provide this. Assume physical sector size to be the same as 2370 2370 * sector_size in that case. 2371 2371 */ 2372 - physical_sector_size = xenbus_read_unsigned(info->xbdev->otherend, 2372 + info->physical_sector_size = xenbus_read_unsigned(info->xbdev->otherend, 2373 2373 "physical-sector-size", 2374 - sector_size); 2374 + info->sector_size); 2375 2375 blkfront_gather_backend_features(info); 2376 2376 for_each_rinfo(info, rinfo, i) { 2377 2377 err = blkfront_setup_indirect(rinfo); ··· 2383 2383 } 2384 2384 } 2385 2385 2386 - err = xlvbd_alloc_gendisk(sectors, info, sector_size, 2387 - physical_sector_size); 2386 + err = xlvbd_alloc_gendisk(sectors, info); 2388 2387 if (err) { 2389 2388 xenbus_dev_fatal(info->xbdev, err, "xlvbd_add at %s", 2390 2389 info->xbdev->otherend);

+1

drivers/block/z2ram.c

··· 409 409 410 410 module_init(z2_init); 411 411 module_exit(z2_exit); 412 + MODULE_DESCRIPTION("Amiga Zorro II ramdisk driver"); 412 413 MODULE_LICENSE("GPL");

+2 -4

drivers/block/zram/zram_drv.c

··· 2208 2208 #if ZRAM_LOGICAL_BLOCK_SIZE == PAGE_SIZE 2209 2209 .max_write_zeroes_sectors = UINT_MAX, 2210 2210 #endif 2211 + .features = BLK_FEAT_STABLE_WRITES | 2212 + BLK_FEAT_SYNCHRONOUS, 2211 2213 }; 2212 2214 struct zram *zram; 2213 2215 int ret, device_id; ··· 2247 2245 2248 2246 /* Actual capacity set using sysfs (/sys/block/zram<id>/disksize */ 2249 2247 set_capacity(zram->disk, 0); 2250 - /* zram devices sort of resembles non-rotational disks */ 2251 - blk_queue_flag_set(QUEUE_FLAG_NONROT, zram->disk->queue); 2252 - blk_queue_flag_set(QUEUE_FLAG_SYNCHRONOUS, zram->disk->queue); 2253 - blk_queue_flag_set(QUEUE_FLAG_STABLE_WRITES, zram->disk->queue); 2254 2248 ret = device_add_disk(NULL, zram->disk, zram_disk_groups); 2255 2249 if (ret) 2256 2250 goto out_cleanup_disk;

+1

drivers/cdrom/cdrom.c

··· 3708 3708 3709 3709 module_init(cdrom_init); 3710 3710 module_exit(cdrom_exit); 3711 + MODULE_DESCRIPTION("Uniform CD-ROM driver"); 3711 3712 MODULE_LICENSE("GPL");

+1

drivers/cdrom/gdrom.c

··· 744 744 .max_segments = 1, 745 745 /* set a large max size to get most from DMA */ 746 746 .max_segment_size = 0x40000, 747 + .features = BLK_FEAT_ROTATIONAL, 747 748 }; 748 749 int err; 749 750

+3 -10

drivers/md/bcache/super.c

··· 897 897 sector_t sectors, struct block_device *cached_bdev, 898 898 const struct block_device_operations *ops) 899 899 { 900 - struct request_queue *q; 901 900 const size_t max_stripes = min_t(size_t, INT_MAX, 902 901 SIZE_MAX / sizeof(atomic_t)); 903 902 struct queue_limits lim = { ··· 908 909 .io_min = block_size, 909 910 .logical_block_size = block_size, 910 911 .physical_block_size = block_size, 912 + .features = BLK_FEAT_WRITE_CACHE | BLK_FEAT_FUA, 911 913 }; 912 914 uint64_t n; 913 915 int idx; ··· 974 974 d->disk->minors = BCACHE_MINORS; 975 975 d->disk->fops = ops; 976 976 d->disk->private_data = d; 977 - 978 - q = d->disk->queue; 979 - 980 - blk_queue_flag_set(QUEUE_FLAG_NONROT, d->disk->queue); 981 - 982 - blk_queue_write_cache(q, true, true); 983 - 984 977 return 0; 985 978 986 979 out_bioset_exit: ··· 1416 1423 } 1417 1424 1418 1425 if (bdev_io_opt(dc->bdev)) 1419 - dc->partial_stripes_expensive = 1420 - q->limits.raid_partial_stripes_expensive; 1426 + dc->partial_stripes_expensive = !!(q->limits.features & 1427 + BLK_FEAT_RAID_PARTIAL_STRIPES_EXPENSIVE); 1421 1428 1422 1429 ret = bcache_device_init(&dc->disk, block_size, 1423 1430 bdev_nr_sectors(dc->bdev) - dc->sb.data_offset,

-1

drivers/md/dm-cache-target.c

··· 3403 3403 limits->max_hw_discard_sectors = origin_limits->max_hw_discard_sectors; 3404 3404 limits->discard_granularity = origin_limits->discard_granularity; 3405 3405 limits->discard_alignment = origin_limits->discard_alignment; 3406 - limits->discard_misaligned = origin_limits->discard_misaligned; 3407 3406 } 3408 3407 3409 3408 static void cache_io_hints(struct dm_target *ti, struct queue_limits *limits)

-1

drivers/md/dm-clone-target.c

··· 2059 2059 limits->max_hw_discard_sectors = dest_limits->max_hw_discard_sectors; 2060 2060 limits->discard_granularity = dest_limits->discard_granularity; 2061 2061 limits->discard_alignment = dest_limits->discard_alignment; 2062 - limits->discard_misaligned = dest_limits->discard_misaligned; 2063 2062 limits->max_discard_segments = dest_limits->max_discard_segments; 2064 2063 } 2065 2064

-1

drivers/md/dm-core.h

··· 206 206 207 207 bool integrity_supported:1; 208 208 bool singleton:1; 209 - unsigned integrity_added:1; 210 209 211 210 /* 212 211 * Indicates the rw permissions for the new logical device. This

+2 -2

drivers/md/dm-crypt.c

··· 1176 1176 struct blk_integrity *bi = blk_get_integrity(cc->dev->bdev->bd_disk); 1177 1177 struct mapped_device *md = dm_table_get_md(ti->table); 1178 1178 1179 - /* From now we require underlying device with our integrity profile */ 1180 - if (!bi || strcasecmp(bi->profile->name, "DM-DIF-EXT-TAG")) { 1179 + /* We require an underlying device with non-PI metadata */ 1180 + if (!bi || bi->csum_type != BLK_INTEGRITY_CSUM_NONE) { 1181 1181 ti->error = "Integrity profile not supported."; 1182 1182 return -EINVAL; 1183 1183 }

+11 -36

drivers/md/dm-integrity.c

··· 350 350 #define DEBUG_bytes(bytes, len, msg, ...) do { } while (0) 351 351 #endif 352 352 353 - static void dm_integrity_prepare(struct request *rq) 354 - { 355 - } 356 - 357 - static void dm_integrity_complete(struct request *rq, unsigned int nr_bytes) 358 - { 359 - } 360 - 361 - /* 362 - * DM Integrity profile, protection is performed layer above (dm-crypt) 363 - */ 364 - static const struct blk_integrity_profile dm_integrity_profile = { 365 - .name = "DM-DIF-EXT-TAG", 366 - .generate_fn = NULL, 367 - .verify_fn = NULL, 368 - .prepare_fn = dm_integrity_prepare, 369 - .complete_fn = dm_integrity_complete, 370 - }; 371 - 372 353 static void dm_integrity_map_continue(struct dm_integrity_io *dio, bool from_map); 373 354 static void integrity_bio_wait(struct work_struct *w); 374 355 static void dm_integrity_dtr(struct dm_target *ti); ··· 3475 3494 limits->dma_alignment = limits->logical_block_size - 1; 3476 3495 limits->discard_granularity = ic->sectors_per_block << SECTOR_SHIFT; 3477 3496 } 3497 + 3498 + if (!ic->internal_hash) { 3499 + struct blk_integrity *bi = &limits->integrity; 3500 + 3501 + memset(bi, 0, sizeof(*bi)); 3502 + bi->tuple_size = ic->tag_size; 3503 + bi->tag_size = bi->tuple_size; 3504 + bi->interval_exp = 3505 + ic->sb->log2_sectors_per_block + SECTOR_SHIFT; 3506 + } 3507 + 3478 3508 limits->max_integrity_segments = USHRT_MAX; 3479 3509 } 3480 3510 ··· 3640 3648 sb_set_version(ic); 3641 3649 3642 3650 return 0; 3643 - } 3644 - 3645 - static void dm_integrity_set(struct dm_target *ti, struct dm_integrity_c *ic) 3646 - { 3647 - struct gendisk *disk = dm_disk(dm_table_get_md(ti->table)); 3648 - struct blk_integrity bi; 3649 - 3650 - memset(&bi, 0, sizeof(bi)); 3651 - bi.profile = &dm_integrity_profile; 3652 - bi.tuple_size = ic->tag_size; 3653 - bi.tag_size = bi.tuple_size; 3654 - bi.interval_exp = ic->sb->log2_sectors_per_block + SECTOR_SHIFT; 3655 - 3656 - blk_integrity_register(disk, &bi); 3657 3651 } 3658 3652 3659 3653 static void dm_integrity_free_page_list(struct page_list *pl) ··· 4626 4648 goto bad; 4627 4649 } 4628 4650 } 4629 - 4630 - if (!ic->internal_hash) 4631 - dm_integrity_set(ti, ic); 4632 4651 4633 4652 ti->num_flush_bios = 1; 4634 4653 ti->flush_supported = true;

+1 -1

drivers/md/dm-raid.c

··· 3542 3542 recovery = rs->md.recovery; 3543 3543 state = decipher_sync_action(mddev, recovery); 3544 3544 progress = rs_get_progress(rs, recovery, state, resync_max_sectors); 3545 - resync_mismatches = (mddev->last_sync_action && !strcasecmp(mddev->last_sync_action, "check")) ? 3545 + resync_mismatches = mddev->last_sync_action == ACTION_CHECK ? 3546 3546 atomic64_read(&mddev->resync_mismatches) : 0; 3547 3547 3548 3548 /* HM FIXME: do we want another state char for raid0? It shows 'D'/'A'/'-' now */

+73 -278

drivers/md/dm-table.c

··· 425 425 q->limits.logical_block_size, 426 426 q->limits.alignment_offset, 427 427 (unsigned long long) start << SECTOR_SHIFT); 428 + 429 + /* 430 + * Only stack the integrity profile if the target doesn't have native 431 + * integrity support. 432 + */ 433 + if (!dm_target_has_integrity(ti->type)) 434 + queue_limits_stack_integrity_bdev(limits, bdev); 428 435 return 0; 429 436 } 430 437 ··· 579 572 return 0; 580 573 } 581 574 575 + static void dm_set_stacking_limits(struct queue_limits *limits) 576 + { 577 + blk_set_stacking_limits(limits); 578 + limits->features |= BLK_FEAT_IO_STAT | BLK_FEAT_NOWAIT | BLK_FEAT_POLL; 579 + } 580 + 582 581 /* 583 582 * Impose necessary and sufficient conditions on a devices's table such 584 583 * that any incoming bio which respects its logical_block_size can be ··· 623 610 for (i = 0; i < t->num_targets; i++) { 624 611 ti = dm_table_get_target(t, i); 625 612 626 - blk_set_stacking_limits(&ti_limits); 613 + dm_set_stacking_limits(&ti_limits); 627 614 628 615 /* combine all target devices' limits */ 629 616 if (ti->type->iterate_devices) ··· 714 701 } 715 702 t->immutable_target_type = ti->type; 716 703 } 717 - 718 - if (dm_target_has_integrity(ti->type)) 719 - t->integrity_added = 1; 720 704 721 705 ti->table = t; 722 706 ti->begin = start; ··· 1024 1014 return __table_type_request_based(dm_table_get_type(t)); 1025 1015 } 1026 1016 1027 - static bool dm_table_supports_poll(struct dm_table *t); 1028 - 1029 1017 static int dm_table_alloc_md_mempools(struct dm_table *t, struct mapped_device *md) 1030 1018 { 1031 1019 enum dm_queue_mode type = dm_table_get_type(t); 1032 1020 unsigned int per_io_data_size = 0, front_pad, io_front_pad; 1033 1021 unsigned int min_pool_size = 0, pool_size; 1034 1022 struct dm_md_mempools *pools; 1023 + unsigned int bioset_flags = 0; 1035 1024 1036 1025 if (unlikely(type == DM_TYPE_NONE)) { 1037 1026 DMERR("no table type is set, can't allocate mempools"); ··· 1047 1038 goto init_bs; 1048 1039 } 1049 1040 1041 + if (md->queue->limits.features & BLK_FEAT_POLL) 1042 + bioset_flags |= BIOSET_PERCPU_CACHE; 1043 + 1050 1044 for (unsigned int i = 0; i < t->num_targets; i++) { 1051 1045 struct dm_target *ti = dm_table_get_target(t, i); 1052 1046 ··· 1062 1050 1063 1051 io_front_pad = roundup(per_io_data_size, 1064 1052 __alignof__(struct dm_io)) + DM_IO_BIO_OFFSET; 1065 - if (bioset_init(&pools->io_bs, pool_size, io_front_pad, 1066 - dm_table_supports_poll(t) ? BIOSET_PERCPU_CACHE : 0)) 1053 + if (bioset_init(&pools->io_bs, pool_size, io_front_pad, bioset_flags)) 1067 1054 goto out_free_pools; 1068 1055 if (t->integrity_supported && 1069 1056 bioset_integrity_create(&pools->io_bs, pool_size)) ··· 1128 1117 r = setup_indexes(t); 1129 1118 1130 1119 return r; 1131 - } 1132 - 1133 - static bool integrity_profile_exists(struct gendisk *disk) 1134 - { 1135 - return !!blk_get_integrity(disk); 1136 - } 1137 - 1138 - /* 1139 - * Get a disk whose integrity profile reflects the table's profile. 1140 - * Returns NULL if integrity support was inconsistent or unavailable. 1141 - */ 1142 - static struct gendisk *dm_table_get_integrity_disk(struct dm_table *t) 1143 - { 1144 - struct list_head *devices = dm_table_get_devices(t); 1145 - struct dm_dev_internal *dd = NULL; 1146 - struct gendisk *prev_disk = NULL, *template_disk = NULL; 1147 - 1148 - for (unsigned int i = 0; i < t->num_targets; i++) { 1149 - struct dm_target *ti = dm_table_get_target(t, i); 1150 - 1151 - if (!dm_target_passes_integrity(ti->type)) 1152 - goto no_integrity; 1153 - } 1154 - 1155 - list_for_each_entry(dd, devices, list) { 1156 - template_disk = dd->dm_dev->bdev->bd_disk; 1157 - if (!integrity_profile_exists(template_disk)) 1158 - goto no_integrity; 1159 - else if (prev_disk && 1160 - blk_integrity_compare(prev_disk, template_disk) < 0) 1161 - goto no_integrity; 1162 - prev_disk = template_disk; 1163 - } 1164 - 1165 - return template_disk; 1166 - 1167 - no_integrity: 1168 - if (prev_disk) 1169 - DMWARN("%s: integrity not set: %s and %s profile mismatch", 1170 - dm_device_name(t->md), 1171 - prev_disk->disk_name, 1172 - template_disk->disk_name); 1173 - return NULL; 1174 - } 1175 - 1176 - /* 1177 - * Register the mapped device for blk_integrity support if the 1178 - * underlying devices have an integrity profile. But all devices may 1179 - * not have matching profiles (checking all devices isn't reliable 1180 - * during table load because this table may use other DM device(s) which 1181 - * must be resumed before they will have an initialized integity 1182 - * profile). Consequently, stacked DM devices force a 2 stage integrity 1183 - * profile validation: First pass during table load, final pass during 1184 - * resume. 1185 - */ 1186 - static int dm_table_register_integrity(struct dm_table *t) 1187 - { 1188 - struct mapped_device *md = t->md; 1189 - struct gendisk *template_disk = NULL; 1190 - 1191 - /* If target handles integrity itself do not register it here. */ 1192 - if (t->integrity_added) 1193 - return 0; 1194 - 1195 - template_disk = dm_table_get_integrity_disk(t); 1196 - if (!template_disk) 1197 - return 0; 1198 - 1199 - if (!integrity_profile_exists(dm_disk(md))) { 1200 - t->integrity_supported = true; 1201 - /* 1202 - * Register integrity profile during table load; we can do 1203 - * this because the final profile must match during resume. 1204 - */ 1205 - blk_integrity_register(dm_disk(md), 1206 - blk_get_integrity(template_disk)); 1207 - return 0; 1208 - } 1209 - 1210 - /* 1211 - * If DM device already has an initialized integrity 1212 - * profile the new profile should not conflict. 1213 - */ 1214 - if (blk_integrity_compare(dm_disk(md), template_disk) < 0) { 1215 - DMERR("%s: conflict with existing integrity profile: %s profile mismatch", 1216 - dm_device_name(t->md), 1217 - template_disk->disk_name); 1218 - return 1; 1219 - } 1220 - 1221 - /* Preserve existing integrity profile */ 1222 - t->integrity_supported = true; 1223 - return 0; 1224 1120 } 1225 1121 1226 1122 #ifdef CONFIG_BLK_INLINE_ENCRYPTION ··· 1341 1423 return r; 1342 1424 } 1343 1425 1344 - r = dm_table_register_integrity(t); 1345 - if (r) { 1346 - DMERR("could not register integrity profile."); 1347 - return r; 1348 - } 1349 - 1350 1426 r = dm_table_construct_crypto_profile(t); 1351 1427 if (r) { 1352 1428 DMERR("could not construct crypto profile."); ··· 1405 1493 return &t->targets[(KEYS_PER_NODE * n) + k]; 1406 1494 } 1407 1495 1408 - static int device_not_poll_capable(struct dm_target *ti, struct dm_dev *dev, 1409 - sector_t start, sector_t len, void *data) 1410 - { 1411 - struct request_queue *q = bdev_get_queue(dev->bdev); 1412 - 1413 - return !test_bit(QUEUE_FLAG_POLL, &q->queue_flags); 1414 - } 1415 - 1416 1496 /* 1417 1497 * type->iterate_devices() should be called when the sanity check needs to 1418 1498 * iterate and check all underlying data devices. iterate_devices() will ··· 1450 1546 (*num_devices)++; 1451 1547 1452 1548 return 0; 1453 - } 1454 - 1455 - static bool dm_table_supports_poll(struct dm_table *t) 1456 - { 1457 - for (unsigned int i = 0; i < t->num_targets; i++) { 1458 - struct dm_target *ti = dm_table_get_target(t, i); 1459 - 1460 - if (!ti->type->iterate_devices || 1461 - ti->type->iterate_devices(ti, device_not_poll_capable, NULL)) 1462 - return false; 1463 - } 1464 - 1465 - return true; 1466 1549 } 1467 1550 1468 1551 /* ··· 1577 1686 unsigned int zone_sectors = 0; 1578 1687 bool zoned = false; 1579 1688 1580 - blk_set_stacking_limits(limits); 1689 + dm_set_stacking_limits(limits); 1690 + 1691 + t->integrity_supported = true; 1692 + for (unsigned int i = 0; i < t->num_targets; i++) { 1693 + struct dm_target *ti = dm_table_get_target(t, i); 1694 + 1695 + if (!dm_target_passes_integrity(ti->type)) 1696 + t->integrity_supported = false; 1697 + } 1581 1698 1582 1699 for (unsigned int i = 0; i < t->num_targets; i++) { 1583 1700 struct dm_target *ti = dm_table_get_target(t, i); 1584 1701 1585 - blk_set_stacking_limits(&ti_limits); 1702 + dm_set_stacking_limits(&ti_limits); 1586 1703 1587 1704 if (!ti->type->iterate_devices) { 1588 1705 /* Set I/O hints portion of queue limits */ ··· 1605 1706 ti->type->iterate_devices(ti, dm_set_device_limits, 1606 1707 &ti_limits); 1607 1708 1608 - if (!zoned && ti_limits.zoned) { 1709 + if (!zoned && (ti_limits.features & BLK_FEAT_ZONED)) { 1609 1710 /* 1610 1711 * After stacking all limits, validate all devices 1611 1712 * in table support this zoned model and zone sectors. 1612 1713 */ 1613 - zoned = ti_limits.zoned; 1714 + zoned = (ti_limits.features & BLK_FEAT_ZONED); 1614 1715 zone_sectors = ti_limits.chunk_sectors; 1615 1716 } 1616 1717 ··· 1637 1738 dm_device_name(t->md), 1638 1739 (unsigned long long) ti->begin, 1639 1740 (unsigned long long) ti->len); 1741 + 1742 + if (t->integrity_supported || 1743 + dm_target_has_integrity(ti->type)) { 1744 + if (!queue_limits_stack_integrity(limits, &ti_limits)) { 1745 + DMWARN("%s: adding target device (start sect %llu len %llu) " 1746 + "disabled integrity support due to incompatibility", 1747 + dm_device_name(t->md), 1748 + (unsigned long long) ti->begin, 1749 + (unsigned long long) ti->len); 1750 + t->integrity_supported = false; 1751 + } 1752 + } 1640 1753 } 1641 1754 1642 1755 /* ··· 1658 1747 * zoned model on host-managed zoned block devices. 1659 1748 * BUT... 1660 1749 */ 1661 - if (limits->zoned) { 1750 + if (limits->features & BLK_FEAT_ZONED) { 1662 1751 /* 1663 1752 * ...IF the above limits stacking determined a zoned model 1664 1753 * validate that all of the table's devices conform to it. 1665 1754 */ 1666 - zoned = limits->zoned; 1755 + zoned = limits->features & BLK_FEAT_ZONED; 1667 1756 zone_sectors = limits->chunk_sectors; 1668 1757 } 1669 1758 if (validate_hardware_zoned(t, zoned, zone_sectors)) ··· 1673 1762 } 1674 1763 1675 1764 /* 1676 - * Verify that all devices have an integrity profile that matches the 1677 - * DM device's registered integrity profile. If the profiles don't 1678 - * match then unregister the DM device's integrity profile. 1765 + * Check if a target requires flush support even if none of the underlying 1766 + * devices need it (e.g. to persist target-specific metadata). 1679 1767 */ 1680 - static void dm_table_verify_integrity(struct dm_table *t) 1768 + static bool dm_table_supports_flush(struct dm_table *t) 1681 1769 { 1682 - struct gendisk *template_disk = NULL; 1683 - 1684 - if (t->integrity_added) 1685 - return; 1686 - 1687 - if (t->integrity_supported) { 1688 - /* 1689 - * Verify that the original integrity profile 1690 - * matches all the devices in this table. 1691 - */ 1692 - template_disk = dm_table_get_integrity_disk(t); 1693 - if (template_disk && 1694 - blk_integrity_compare(dm_disk(t->md), template_disk) >= 0) 1695 - return; 1696 - } 1697 - 1698 - if (integrity_profile_exists(dm_disk(t->md))) { 1699 - DMWARN("%s: unable to establish an integrity profile", 1700 - dm_device_name(t->md)); 1701 - blk_integrity_unregister(dm_disk(t->md)); 1702 - } 1703 - } 1704 - 1705 - static int device_flush_capable(struct dm_target *ti, struct dm_dev *dev, 1706 - sector_t start, sector_t len, void *data) 1707 - { 1708 - unsigned long flush = (unsigned long) data; 1709 - struct request_queue *q = bdev_get_queue(dev->bdev); 1710 - 1711 - return (q->queue_flags & flush); 1712 - } 1713 - 1714 - static bool dm_table_supports_flush(struct dm_table *t, unsigned long flush) 1715 - { 1716 - /* 1717 - * Require at least one underlying device to support flushes. 1718 - * t->devices includes internal dm devices such as mirror logs 1719 - * so we need to use iterate_devices here, which targets 1720 - * supporting flushes must provide. 1721 - */ 1722 1770 for (unsigned int i = 0; i < t->num_targets; i++) { 1723 1771 struct dm_target *ti = dm_table_get_target(t, i); 1724 1772 1725 - if (!ti->num_flush_bios) 1726 - continue; 1727 - 1728 - if (ti->flush_supported) 1729 - return true; 1730 - 1731 - if (ti->type->iterate_devices && 1732 - ti->type->iterate_devices(ti, device_flush_capable, (void *) flush)) 1773 + if (ti->num_flush_bios && ti->flush_supported) 1733 1774 return true; 1734 1775 } 1735 1776 ··· 1700 1837 if (dax_write_cache_enabled(dax_dev)) 1701 1838 return true; 1702 1839 return false; 1703 - } 1704 - 1705 - static int device_is_rotational(struct dm_target *ti, struct dm_dev *dev, 1706 - sector_t start, sector_t len, void *data) 1707 - { 1708 - return !bdev_nonrot(dev->bdev); 1709 - } 1710 - 1711 - static int device_is_not_random(struct dm_target *ti, struct dm_dev *dev, 1712 - sector_t start, sector_t len, void *data) 1713 - { 1714 - struct request_queue *q = bdev_get_queue(dev->bdev); 1715 - 1716 - return !blk_queue_add_random(q); 1717 1840 } 1718 1841 1719 1842 static int device_not_write_zeroes_capable(struct dm_target *ti, struct dm_dev *dev, ··· 1726 1877 return true; 1727 1878 } 1728 1879 1729 - static int device_not_nowait_capable(struct dm_target *ti, struct dm_dev *dev, 1730 - sector_t start, sector_t len, void *data) 1731 - { 1732 - return !bdev_nowait(dev->bdev); 1733 - } 1734 - 1735 1880 static bool dm_table_supports_nowait(struct dm_table *t) 1736 1881 { 1737 1882 for (unsigned int i = 0; i < t->num_targets; i++) { 1738 1883 struct dm_target *ti = dm_table_get_target(t, i); 1739 1884 1740 1885 if (!dm_target_supports_nowait(ti->type)) 1741 - return false; 1742 - 1743 - if (!ti->type->iterate_devices || 1744 - ti->type->iterate_devices(ti, device_not_nowait_capable, NULL)) 1745 1886 return false; 1746 1887 } 1747 1888 ··· 1789 1950 return true; 1790 1951 } 1791 1952 1792 - static int device_requires_stable_pages(struct dm_target *ti, 1793 - struct dm_dev *dev, sector_t start, 1794 - sector_t len, void *data) 1795 - { 1796 - return bdev_stable_writes(dev->bdev); 1797 - } 1798 - 1799 1953 int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q, 1800 1954 struct queue_limits *limits) 1801 1955 { 1802 - bool wc = false, fua = false; 1803 1956 int r; 1804 1957 1805 - if (dm_table_supports_nowait(t)) 1806 - blk_queue_flag_set(QUEUE_FLAG_NOWAIT, q); 1807 - else 1808 - blk_queue_flag_clear(QUEUE_FLAG_NOWAIT, q); 1958 + if (!dm_table_supports_nowait(t)) 1959 + limits->features &= ~BLK_FEAT_NOWAIT; 1960 + 1961 + /* 1962 + * The current polling impementation does not support request based 1963 + * stacking. 1964 + */ 1965 + if (!__table_type_bio_based(t->type)) 1966 + limits->features &= ~BLK_FEAT_POLL; 1809 1967 1810 1968 if (!dm_table_supports_discards(t)) { 1811 1969 limits->max_hw_discard_sectors = 0; 1812 1970 limits->discard_granularity = 0; 1813 1971 limits->discard_alignment = 0; 1814 - limits->discard_misaligned = 0; 1815 1972 } 1816 1973 1817 1974 if (!dm_table_supports_write_zeroes(t)) ··· 1816 1981 if (!dm_table_supports_secure_erase(t)) 1817 1982 limits->max_secure_erase_sectors = 0; 1818 1983 1819 - if (dm_table_supports_flush(t, (1UL << QUEUE_FLAG_WC))) { 1820 - wc = true; 1821 - if (dm_table_supports_flush(t, (1UL << QUEUE_FLAG_FUA))) 1822 - fua = true; 1823 - } 1824 - blk_queue_write_cache(q, wc, fua); 1984 + if (dm_table_supports_flush(t)) 1985 + limits->features |= BLK_FEAT_WRITE_CACHE | BLK_FEAT_FUA; 1825 1986 1826 1987 if (dm_table_supports_dax(t, device_not_dax_capable)) { 1827 - blk_queue_flag_set(QUEUE_FLAG_DAX, q); 1988 + limits->features |= BLK_FEAT_DAX; 1828 1989 if (dm_table_supports_dax(t, device_not_dax_synchronous_capable)) 1829 1990 set_dax_synchronous(t->md->dax_dev); 1830 1991 } else 1831 - blk_queue_flag_clear(QUEUE_FLAG_DAX, q); 1992 + limits->features &= ~BLK_FEAT_DAX; 1832 1993 1833 1994 if (dm_table_any_dev_attr(t, device_dax_write_cache_enabled, NULL)) 1834 1995 dax_write_cache(t->md->dax_dev, true); 1835 1996 1836 - /* Ensure that all underlying devices are non-rotational. */ 1837 - if (dm_table_any_dev_attr(t, device_is_rotational, NULL)) 1838 - blk_queue_flag_clear(QUEUE_FLAG_NONROT, q); 1839 - else 1840 - blk_queue_flag_set(QUEUE_FLAG_NONROT, q); 1841 - 1842 - dm_table_verify_integrity(t); 1843 - 1844 - /* 1845 - * Some devices don't use blk_integrity but still want stable pages 1846 - * because they do their own checksumming. 1847 - * If any underlying device requires stable pages, a table must require 1848 - * them as well. Only targets that support iterate_devices are considered: 1849 - * don't want error, zero, etc to require stable pages. 1850 - */ 1851 - if (dm_table_any_dev_attr(t, device_requires_stable_pages, NULL)) 1852 - blk_queue_flag_set(QUEUE_FLAG_STABLE_WRITES, q); 1853 - else 1854 - blk_queue_flag_clear(QUEUE_FLAG_STABLE_WRITES, q); 1855 - 1856 - /* 1857 - * Determine whether or not this queue's I/O timings contribute 1858 - * to the entropy pool, Only request-based targets use this. 1859 - * Clear QUEUE_FLAG_ADD_RANDOM if any underlying device does not 1860 - * have it set. 1861 - */ 1862 - if (blk_queue_add_random(q) && 1863 - dm_table_any_dev_attr(t, device_is_not_random, NULL)) 1864 - blk_queue_flag_clear(QUEUE_FLAG_ADD_RANDOM, q); 1865 - 1866 - /* 1867 - * For a zoned target, setup the zones related queue attributes 1868 - * and resources necessary for zone append emulation if necessary. 1869 - */ 1870 - if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) && limits->zoned) { 1997 + /* For a zoned table, setup the zone related queue attributes. */ 1998 + if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) && 1999 + (limits->features & BLK_FEAT_ZONED)) { 1871 2000 r = dm_set_zones_restrictions(t, q, limits); 1872 2001 if (r) 1873 2002 return r; ··· 1841 2042 if (r) 1842 2043 return r; 1843 2044 1844 - dm_update_crypto_profile(q, t); 1845 - 1846 2045 /* 1847 - * Check for request-based device is left to 1848 - * dm_mq_init_request_queue()->blk_mq_init_allocated_queue(). 1849 - * 1850 - * For bio-based device, only set QUEUE_FLAG_POLL when all 1851 - * underlying devices supporting polling. 2046 + * Now that the limits are set, check the zones mapped by the table 2047 + * and setup the resources for zone append emulation if necessary. 1852 2048 */ 1853 - if (__table_type_bio_based(t->type)) { 1854 - if (dm_table_supports_poll(t)) 1855 - blk_queue_flag_set(QUEUE_FLAG_POLL, q); 1856 - else 1857 - blk_queue_flag_clear(QUEUE_FLAG_POLL, q); 2049 + if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) && 2050 + (limits->features & BLK_FEAT_ZONED)) { 2051 + r = dm_revalidate_zones(t, q); 2052 + if (r) 2053 + return r; 1858 2054 } 1859 2055 2056 + dm_update_crypto_profile(q, t); 1860 2057 return 0; 1861 2058 } 1862 2059

+207 -47

drivers/md/dm-zone.c

··· 13 13 14 14 #define DM_MSG_PREFIX "zone" 15 15 16 - #define DM_ZONE_INVALID_WP_OFST UINT_MAX 17 - 18 16 /* 19 17 * For internal zone reports bypassing the top BIO submission path. 20 18 */ ··· 144 146 } 145 147 146 148 /* 147 - * Count conventional zones of a mapped zoned device. If the device 148 - * only has conventional zones, do not expose it as zoned. 149 - */ 150 - static int dm_check_zoned_cb(struct blk_zone *zone, unsigned int idx, 151 - void *data) 152 - { 153 - unsigned int *nr_conv_zones = data; 154 - 155 - if (zone->type == BLK_ZONE_TYPE_CONVENTIONAL) 156 - (*nr_conv_zones)++; 157 - 158 - return 0; 159 - } 160 - 161 - /* 162 149 * Revalidate the zones of a mapped device to initialize resource necessary 163 150 * for zone append emulation. Note that we cannot simply use the block layer 164 151 * blk_revalidate_disk_zones() function here as the mapped device is suspended 165 152 * (this is called from __bind() context). 166 153 */ 167 - static int dm_revalidate_zones(struct mapped_device *md, struct dm_table *t) 154 + int dm_revalidate_zones(struct dm_table *t, struct request_queue *q) 168 155 { 156 + struct mapped_device *md = t->md; 169 157 struct gendisk *disk = md->disk; 170 158 int ret; 171 159 160 + if (!get_capacity(disk)) 161 + return 0; 162 + 172 163 /* Revalidate only if something changed. */ 173 - if (!disk->nr_zones || disk->nr_zones != md->nr_zones) 164 + if (!disk->nr_zones || disk->nr_zones != md->nr_zones) { 165 + DMINFO("%s using %s zone append", 166 + disk->disk_name, 167 + queue_emulates_zone_append(q) ? "emulated" : "native"); 174 168 md->nr_zones = 0; 169 + } 175 170 176 171 if (md->nr_zones) 177 172 return 0; ··· 211 220 return true; 212 221 } 213 222 223 + struct dm_device_zone_count { 224 + sector_t start; 225 + sector_t len; 226 + unsigned int total_nr_seq_zones; 227 + unsigned int target_nr_seq_zones; 228 + }; 229 + 230 + /* 231 + * Count the total number of and the number of mapped sequential zones of a 232 + * target zoned device. 233 + */ 234 + static int dm_device_count_zones_cb(struct blk_zone *zone, 235 + unsigned int idx, void *data) 236 + { 237 + struct dm_device_zone_count *zc = data; 238 + 239 + if (zone->type != BLK_ZONE_TYPE_CONVENTIONAL) { 240 + zc->total_nr_seq_zones++; 241 + if (zone->start >= zc->start && 242 + zone->start < zc->start + zc->len) 243 + zc->target_nr_seq_zones++; 244 + } 245 + 246 + return 0; 247 + } 248 + 249 + static int dm_device_count_zones(struct dm_dev *dev, 250 + struct dm_device_zone_count *zc) 251 + { 252 + int ret; 253 + 254 + ret = blkdev_report_zones(dev->bdev, 0, BLK_ALL_ZONES, 255 + dm_device_count_zones_cb, zc); 256 + if (ret < 0) 257 + return ret; 258 + if (!ret) 259 + return -EIO; 260 + return 0; 261 + } 262 + 263 + struct dm_zone_resource_limits { 264 + unsigned int mapped_nr_seq_zones; 265 + struct queue_limits *lim; 266 + bool reliable_limits; 267 + }; 268 + 269 + static int device_get_zone_resource_limits(struct dm_target *ti, 270 + struct dm_dev *dev, sector_t start, 271 + sector_t len, void *data) 272 + { 273 + struct dm_zone_resource_limits *zlim = data; 274 + struct gendisk *disk = dev->bdev->bd_disk; 275 + unsigned int max_open_zones, max_active_zones; 276 + int ret; 277 + struct dm_device_zone_count zc = { 278 + .start = start, 279 + .len = len, 280 + }; 281 + 282 + /* 283 + * If the target is not the whole device, the device zone resources may 284 + * be shared between different targets. Check this by counting the 285 + * number of mapped sequential zones: if this number is smaller than the 286 + * total number of sequential zones of the target device, then resource 287 + * sharing may happen and the zone limits will not be reliable. 288 + */ 289 + ret = dm_device_count_zones(dev, &zc); 290 + if (ret) { 291 + DMERR("Count %s zones failed %d", disk->disk_name, ret); 292 + return ret; 293 + } 294 + 295 + /* 296 + * If the target does not map any sequential zones, then we do not need 297 + * any zone resource limits. 298 + */ 299 + if (!zc.target_nr_seq_zones) 300 + return 0; 301 + 302 + /* 303 + * If the target does not map all sequential zones, the limits 304 + * will not be reliable and we cannot use REQ_OP_ZONE_RESET_ALL. 305 + */ 306 + if (zc.target_nr_seq_zones < zc.total_nr_seq_zones) { 307 + zlim->reliable_limits = false; 308 + ti->zone_reset_all_supported = false; 309 + } 310 + 311 + /* 312 + * If the target maps less sequential zones than the limit values, then 313 + * we do not have limits for this target. 314 + */ 315 + max_active_zones = disk->queue->limits.max_active_zones; 316 + if (max_active_zones >= zc.target_nr_seq_zones) 317 + max_active_zones = 0; 318 + zlim->lim->max_active_zones = 319 + min_not_zero(max_active_zones, zlim->lim->max_active_zones); 320 + 321 + max_open_zones = disk->queue->limits.max_open_zones; 322 + if (max_open_zones >= zc.target_nr_seq_zones) 323 + max_open_zones = 0; 324 + zlim->lim->max_open_zones = 325 + min_not_zero(max_open_zones, zlim->lim->max_open_zones); 326 + 327 + /* 328 + * Also count the total number of sequential zones for the mapped 329 + * device so that when we are done inspecting all its targets, we are 330 + * able to check if the mapped device actually has any sequential zones. 331 + */ 332 + zlim->mapped_nr_seq_zones += zc.target_nr_seq_zones; 333 + 334 + return 0; 335 + } 336 + 214 337 int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q, 215 338 struct queue_limits *lim) 216 339 { 217 340 struct mapped_device *md = t->md; 218 341 struct gendisk *disk = md->disk; 219 - unsigned int nr_conv_zones = 0; 220 - int ret; 342 + struct dm_zone_resource_limits zlim = { 343 + .reliable_limits = true, 344 + .lim = lim, 345 + }; 221 346 222 347 /* 223 348 * Check if zone append is natively supported, and if not, set the ··· 347 240 lim->max_zone_append_sectors = 0; 348 241 } 349 242 350 - if (!get_capacity(md->disk)) 351 - return 0; 352 - 353 243 /* 354 - * Count conventional zones to check that the mapped device will indeed 355 - * have sequential write required zones. 244 + * Determine the max open and max active zone limits for the mapped 245 + * device by inspecting the zone resource limits and the zones mapped 246 + * by each target. 356 247 */ 357 - md->zone_revalidate_map = t; 358 - ret = dm_blk_report_zones(disk, 0, UINT_MAX, 359 - dm_check_zoned_cb, &nr_conv_zones); 360 - md->zone_revalidate_map = NULL; 361 - if (ret < 0) { 362 - DMERR("Check zoned failed %d", ret); 363 - return ret; 248 + for (unsigned int i = 0; i < t->num_targets; i++) { 249 + struct dm_target *ti = dm_table_get_target(t, i); 250 + 251 + /* 252 + * Assume that the target can accept REQ_OP_ZONE_RESET_ALL. 253 + * device_get_zone_resource_limits() may adjust this if one of 254 + * the device used by the target does not have all its 255 + * sequential write required zones mapped. 256 + */ 257 + ti->zone_reset_all_supported = true; 258 + 259 + if (!ti->type->iterate_devices || 260 + ti->type->iterate_devices(ti, 261 + device_get_zone_resource_limits, &zlim)) { 262 + DMERR("Could not determine %s zone resource limits", 263 + disk->disk_name); 264 + return -ENODEV; 265 + } 364 266 } 365 267 366 268 /* 367 - * If we only have conventional zones, expose the mapped device as 368 - * a regular device. 269 + * If we only have conventional zones mapped, expose the mapped device 270 + + as a regular device. 369 271 */ 370 - if (nr_conv_zones >= ret) { 272 + if (!zlim.mapped_nr_seq_zones) { 371 273 lim->max_open_zones = 0; 372 274 lim->max_active_zones = 0; 373 - lim->zoned = false; 275 + lim->max_zone_append_sectors = 0; 276 + lim->zone_write_granularity = 0; 277 + lim->chunk_sectors = 0; 278 + lim->features &= ~BLK_FEAT_ZONED; 374 279 clear_bit(DMF_EMULATE_ZONE_APPEND, &md->flags); 280 + md->nr_zones = 0; 375 281 disk->nr_zones = 0; 376 282 return 0; 377 283 } 378 284 379 - if (!md->disk->nr_zones) { 380 - DMINFO("%s using %s zone append", 381 - md->disk->disk_name, 382 - queue_emulates_zone_append(q) ? "emulated" : "native"); 383 - } 285 + /* 286 + * Warn once (when the capacity is not yet set) if the mapped device is 287 + * partially using zone resources of the target devices as that leads to 288 + * unreliable limits, i.e. if another mapped device uses the same 289 + * underlying devices, we cannot enforce zone limits to guarantee that 290 + * writing will not lead to errors. Note that we really should return 291 + * an error for such case but there is no easy way to find out if 292 + * another mapped device uses the same underlying zoned devices. 293 + */ 294 + if (!get_capacity(disk) && !zlim.reliable_limits) 295 + DMWARN("%s zone resource limits may be unreliable", 296 + disk->disk_name); 384 297 385 - ret = dm_revalidate_zones(md, t); 386 - if (ret < 0) 387 - return ret; 388 - 389 - if (!static_key_enabled(&zoned_enabled.key)) 298 + if (lim->features & BLK_FEAT_ZONED && 299 + !static_key_enabled(&zoned_enabled.key)) 390 300 static_branch_enable(&zoned_enabled); 391 301 return 0; 392 302 } ··· 429 305 } 430 306 431 307 return; 308 + } 309 + 310 + static int dm_zone_need_reset_cb(struct blk_zone *zone, unsigned int idx, 311 + void *data) 312 + { 313 + /* 314 + * For an all-zones reset, ignore conventional, empty, read-only 315 + * and offline zones. 316 + */ 317 + switch (zone->cond) { 318 + case BLK_ZONE_COND_NOT_WP: 319 + case BLK_ZONE_COND_EMPTY: 320 + case BLK_ZONE_COND_READONLY: 321 + case BLK_ZONE_COND_OFFLINE: 322 + return 0; 323 + default: 324 + set_bit(idx, (unsigned long *)data); 325 + return 0; 326 + } 327 + } 328 + 329 + int dm_zone_get_reset_bitmap(struct mapped_device *md, struct dm_table *t, 330 + sector_t sector, unsigned int nr_zones, 331 + unsigned long *need_reset) 332 + { 333 + int ret; 334 + 335 + ret = dm_blk_do_report_zones(md, t, sector, nr_zones, 336 + dm_zone_need_reset_cb, need_reset); 337 + if (ret != nr_zones) { 338 + DMERR("Get %s zone reset bitmap failed\n", 339 + md->disk->disk_name); 340 + return -EIO; 341 + } 342 + 343 + return 0; 432 344 }

+1 -1

drivers/md/dm-zoned-target.c

··· 1009 1009 limits->max_sectors = chunk_sectors; 1010 1010 1011 1011 /* We are exposing a drive-managed zoned block device */ 1012 - limits->zoned = false; 1012 + limits->features &= ~BLK_FEAT_ZONED; 1013 1013 } 1014 1014 1015 1015 /*

+147 -27

drivers/md/dm.c

··· 1188 1188 return len; 1189 1189 return min_t(sector_t, len, 1190 1190 min(max_sectors ? : queue_max_sectors(ti->table->md->queue), 1191 - blk_chunk_sectors_left(target_offset, max_granularity))); 1191 + blk_boundary_sectors_left(target_offset, max_granularity))); 1192 1192 } 1193 1193 1194 1194 static inline sector_t max_io_len(struct dm_target *ti, sector_t sector) ··· 1598 1598 1599 1599 static bool is_abnormal_io(struct bio *bio) 1600 1600 { 1601 - enum req_op op = bio_op(bio); 1602 - 1603 - if (op != REQ_OP_READ && op != REQ_OP_WRITE && op != REQ_OP_FLUSH) { 1604 - switch (op) { 1605 - case REQ_OP_DISCARD: 1606 - case REQ_OP_SECURE_ERASE: 1607 - case REQ_OP_WRITE_ZEROES: 1608 - return true; 1609 - default: 1610 - break; 1611 - } 1601 + switch (bio_op(bio)) { 1602 + case REQ_OP_READ: 1603 + case REQ_OP_WRITE: 1604 + case REQ_OP_FLUSH: 1605 + return false; 1606 + case REQ_OP_DISCARD: 1607 + case REQ_OP_SECURE_ERASE: 1608 + case REQ_OP_WRITE_ZEROES: 1609 + case REQ_OP_ZONE_RESET_ALL: 1610 + return true; 1611 + default: 1612 + return false; 1612 1613 } 1613 - 1614 - return false; 1615 1614 } 1616 1615 1617 1616 static blk_status_t __process_abnormal_io(struct clone_info *ci, ··· 1775 1776 { 1776 1777 return dm_emulate_zone_append(md) && blk_zone_plug_bio(bio, 0); 1777 1778 } 1779 + 1780 + static blk_status_t __send_zone_reset_all_emulated(struct clone_info *ci, 1781 + struct dm_target *ti) 1782 + { 1783 + struct bio_list blist = BIO_EMPTY_LIST; 1784 + struct mapped_device *md = ci->io->md; 1785 + unsigned int zone_sectors = md->disk->queue->limits.chunk_sectors; 1786 + unsigned long *need_reset; 1787 + unsigned int i, nr_zones, nr_reset; 1788 + unsigned int num_bios = 0; 1789 + blk_status_t sts = BLK_STS_OK; 1790 + sector_t sector = ti->begin; 1791 + struct bio *clone; 1792 + int ret; 1793 + 1794 + nr_zones = ti->len >> ilog2(zone_sectors); 1795 + need_reset = bitmap_zalloc(nr_zones, GFP_NOIO); 1796 + if (!need_reset) 1797 + return BLK_STS_RESOURCE; 1798 + 1799 + ret = dm_zone_get_reset_bitmap(md, ci->map, ti->begin, 1800 + nr_zones, need_reset); 1801 + if (ret) { 1802 + sts = BLK_STS_IOERR; 1803 + goto free_bitmap; 1804 + } 1805 + 1806 + /* If we have no zone to reset, we are done. */ 1807 + nr_reset = bitmap_weight(need_reset, nr_zones); 1808 + if (!nr_reset) 1809 + goto free_bitmap; 1810 + 1811 + atomic_add(nr_zones, &ci->io->io_count); 1812 + 1813 + for (i = 0; i < nr_zones; i++) { 1814 + 1815 + if (!test_bit(i, need_reset)) { 1816 + sector += zone_sectors; 1817 + continue; 1818 + } 1819 + 1820 + if (bio_list_empty(&blist)) { 1821 + /* This may take a while, so be nice to others */ 1822 + if (num_bios) 1823 + cond_resched(); 1824 + 1825 + /* 1826 + * We may need to reset thousands of zones, so let's 1827 + * not go crazy with the clone allocation. 1828 + */ 1829 + alloc_multiple_bios(&blist, ci, ti, min(nr_reset, 32), 1830 + NULL, GFP_NOIO); 1831 + } 1832 + 1833 + /* Get a clone and change it to a regular reset operation. */ 1834 + clone = bio_list_pop(&blist); 1835 + clone->bi_opf &= ~REQ_OP_MASK; 1836 + clone->bi_opf |= REQ_OP_ZONE_RESET | REQ_SYNC; 1837 + clone->bi_iter.bi_sector = sector; 1838 + clone->bi_iter.bi_size = 0; 1839 + __map_bio(clone); 1840 + 1841 + sector += zone_sectors; 1842 + num_bios++; 1843 + nr_reset--; 1844 + } 1845 + 1846 + WARN_ON_ONCE(!bio_list_empty(&blist)); 1847 + atomic_sub(nr_zones - num_bios, &ci->io->io_count); 1848 + ci->sector_count = 0; 1849 + 1850 + free_bitmap: 1851 + bitmap_free(need_reset); 1852 + 1853 + return sts; 1854 + } 1855 + 1856 + static void __send_zone_reset_all_native(struct clone_info *ci, 1857 + struct dm_target *ti) 1858 + { 1859 + unsigned int bios; 1860 + 1861 + atomic_add(1, &ci->io->io_count); 1862 + bios = __send_duplicate_bios(ci, ti, 1, NULL, GFP_NOIO); 1863 + atomic_sub(1 - bios, &ci->io->io_count); 1864 + 1865 + ci->sector_count = 0; 1866 + } 1867 + 1868 + static blk_status_t __send_zone_reset_all(struct clone_info *ci) 1869 + { 1870 + struct dm_table *t = ci->map; 1871 + blk_status_t sts = BLK_STS_OK; 1872 + 1873 + for (unsigned int i = 0; i < t->num_targets; i++) { 1874 + struct dm_target *ti = dm_table_get_target(t, i); 1875 + 1876 + if (ti->zone_reset_all_supported) { 1877 + __send_zone_reset_all_native(ci, ti); 1878 + continue; 1879 + } 1880 + 1881 + sts = __send_zone_reset_all_emulated(ci, ti); 1882 + if (sts != BLK_STS_OK) 1883 + break; 1884 + } 1885 + 1886 + /* Release the reference that alloc_io() took for submission. */ 1887 + atomic_sub(1, &ci->io->io_count); 1888 + 1889 + return sts; 1890 + } 1891 + 1778 1892 #else 1779 1893 static inline bool dm_zone_bio_needs_split(struct mapped_device *md, 1780 1894 struct bio *bio) ··· 1897 1785 static inline bool dm_zone_plug_bio(struct mapped_device *md, struct bio *bio) 1898 1786 { 1899 1787 return false; 1788 + } 1789 + static blk_status_t __send_zone_reset_all(struct clone_info *ci) 1790 + { 1791 + return BLK_STS_NOTSUPP; 1900 1792 } 1901 1793 #endif 1902 1794 ··· 1915 1799 blk_status_t error = BLK_STS_OK; 1916 1800 bool is_abnormal, need_split; 1917 1801 1918 - need_split = is_abnormal = is_abnormal_io(bio); 1919 - if (static_branch_unlikely(&zoned_enabled)) 1920 - need_split = is_abnormal || dm_zone_bio_needs_split(md, bio); 1802 + is_abnormal = is_abnormal_io(bio); 1803 + if (static_branch_unlikely(&zoned_enabled)) { 1804 + /* Special case REQ_OP_ZONE_RESET_ALL as it cannot be split. */ 1805 + need_split = (bio_op(bio) != REQ_OP_ZONE_RESET_ALL) && 1806 + (is_abnormal || dm_zone_bio_needs_split(md, bio)); 1807 + } else { 1808 + need_split = is_abnormal; 1809 + } 1921 1810 1922 1811 if (unlikely(need_split)) { 1923 1812 /* ··· 1960 1839 if (bio->bi_opf & REQ_PREFLUSH) { 1961 1840 __send_empty_flush(&ci); 1962 1841 /* dm_io_complete submits any data associated with flush */ 1842 + goto out; 1843 + } 1844 + 1845 + if (static_branch_unlikely(&zoned_enabled) && 1846 + (bio_op(bio) == REQ_OP_ZONE_RESET_ALL)) { 1847 + error = __send_zone_reset_all(&ci); 1963 1848 goto out; 1964 1849 } 1965 1850 ··· 2513 2386 struct table_device *td; 2514 2387 int r; 2515 2388 2516 - switch (type) { 2517 - case DM_TYPE_REQUEST_BASED: 2389 + WARN_ON_ONCE(type == DM_TYPE_NONE); 2390 + 2391 + if (type == DM_TYPE_REQUEST_BASED) { 2518 2392 md->disk->fops = &dm_rq_blk_dops; 2519 2393 r = dm_mq_init_request_queue(md, t); 2520 2394 if (r) { 2521 2395 DMERR("Cannot initialize queue for request-based dm mapped device"); 2522 2396 return r; 2523 2397 } 2524 - break; 2525 - case DM_TYPE_BIO_BASED: 2526 - case DM_TYPE_DAX_BIO_BASED: 2527 - blk_queue_flag_set(QUEUE_FLAG_IO_STAT, md->queue); 2528 - break; 2529 - case DM_TYPE_NONE: 2530 - WARN_ON_ONCE(true); 2531 - break; 2532 2398 } 2533 2399 2534 2400 r = dm_calculate_queue_limits(t, &limits);

+4

drivers/md/dm.h

··· 103 103 */ 104 104 int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q, 105 105 struct queue_limits *lim); 106 + int dm_revalidate_zones(struct dm_table *t, struct request_queue *q); 106 107 void dm_zone_endio(struct dm_io *io, struct bio *clone); 107 108 #ifdef CONFIG_BLK_DEV_ZONED 108 109 int dm_blk_report_zones(struct gendisk *disk, sector_t sector, 109 110 unsigned int nr_zones, report_zones_cb cb, void *data); 110 111 bool dm_is_zone_write(struct mapped_device *md, struct bio *bio); 111 112 int dm_zone_map_bio(struct dm_target_io *io); 113 + int dm_zone_get_reset_bitmap(struct mapped_device *md, struct dm_table *t, 114 + sector_t sector, unsigned int nr_zones, 115 + unsigned long *need_reset); 112 116 #else 113 117 #define dm_blk_report_zones NULL 114 118 static inline bool dm_is_zone_write(struct mapped_device *md, struct bio *bio)

+3 -3

drivers/md/md-bitmap.c

··· 227 227 struct block_device *bdev; 228 228 struct mddev *mddev = bitmap->mddev; 229 229 struct bitmap_storage *store = &bitmap->storage; 230 + unsigned int bitmap_limit = (bitmap->storage.file_pages - pg_index) << 231 + PAGE_SHIFT; 230 232 loff_t sboff, offset = mddev->bitmap_info.offset; 231 233 sector_t ps = pg_index * PAGE_SIZE / SECTOR_SIZE; 232 234 unsigned int size = PAGE_SIZE; ··· 271 269 if (size == 0) 272 270 /* bitmap runs in to data */ 273 271 return -EINVAL; 274 - } else { 275 - /* DATA METADATA BITMAP - no problems */ 276 272 } 277 273 278 - md_super_write(mddev, rdev, sboff + ps, (int) size, page); 274 + md_super_write(mddev, rdev, sboff + ps, (int)min(size, bitmap_limit), page); 279 275 return 0; 280 276 } 281 277

+1 -1

drivers/md/md-cluster.c

··· 1570 1570 return err; 1571 1571 } 1572 1572 1573 - static struct md_cluster_operations cluster_ops = { 1573 + static const struct md_cluster_operations cluster_ops = { 1574 1574 .join = join, 1575 1575 .leave = leave, 1576 1576 .slot_number = slot_number,

+332 -308

drivers/md/md.c

··· 69 69 #include "md-bitmap.h" 70 70 #include "md-cluster.h" 71 71 72 + static const char *action_name[NR_SYNC_ACTIONS] = { 73 + [ACTION_RESYNC] = "resync", 74 + [ACTION_RECOVER] = "recover", 75 + [ACTION_CHECK] = "check", 76 + [ACTION_REPAIR] = "repair", 77 + [ACTION_RESHAPE] = "reshape", 78 + [ACTION_FROZEN] = "frozen", 79 + [ACTION_IDLE] = "idle", 80 + }; 81 + 72 82 /* pers_list is a list of registered personalities protected by pers_lock. */ 73 83 static LIST_HEAD(pers_list); 74 84 static DEFINE_SPINLOCK(pers_lock); 75 85 76 86 static const struct kobj_type md_ktype; 77 87 78 - struct md_cluster_operations *md_cluster_ops; 88 + const struct md_cluster_operations *md_cluster_ops; 79 89 EXPORT_SYMBOL(md_cluster_ops); 80 90 static struct module *md_cluster_mod; 81 91 ··· 489 479 */ 490 480 WRITE_ONCE(mddev->suspended, mddev->suspended + 1); 491 481 492 - del_timer_sync(&mddev->safemode_timer); 493 482 /* restrict memory reclaim I/O during raid array is suspend */ 494 483 mddev->noio_flag = memalloc_noio_save(); 495 484 ··· 559 550 560 551 rdev_dec_pending(rdev, mddev); 561 552 562 - if (atomic_dec_and_test(&mddev->flush_pending)) { 563 - /* The pair is percpu_ref_get() from md_flush_request() */ 564 - percpu_ref_put(&mddev->active_io); 565 - 553 + if (atomic_dec_and_test(&mddev->flush_pending)) 566 554 /* The pre-request flush has finished */ 567 555 queue_work(md_wq, &mddev->flush_work); 568 - } 569 556 } 570 557 571 558 static void md_submit_flush_data(struct work_struct *ws); ··· 592 587 rcu_read_lock(); 593 588 } 594 589 rcu_read_unlock(); 595 - if (atomic_dec_and_test(&mddev->flush_pending)) { 596 - /* The pair is percpu_ref_get() from md_flush_request() */ 597 - percpu_ref_put(&mddev->active_io); 598 - 590 + if (atomic_dec_and_test(&mddev->flush_pending)) 599 591 queue_work(md_wq, &mddev->flush_work); 600 - } 601 592 } 602 593 603 594 static void md_submit_flush_data(struct work_struct *ws) ··· 618 617 bio_endio(bio); 619 618 } else { 620 619 bio->bi_opf &= ~REQ_PREFLUSH; 621 - md_handle_request(mddev, bio); 620 + 621 + /* 622 + * make_requst() will never return error here, it only 623 + * returns error in raid5_make_request() by dm-raid. 624 + * Since dm always splits data and flush operation into 625 + * two separate io, io size of flush submitted by dm 626 + * always is 0, make_request() will not be called here. 627 + */ 628 + if (WARN_ON_ONCE(!mddev->pers->make_request(mddev, bio))) 629 + bio_io_error(bio); 622 630 } 631 + 632 + /* The pair is percpu_ref_get() from md_flush_request() */ 633 + percpu_ref_put(&mddev->active_io); 623 634 } 624 635 625 636 /* ··· 667 654 WARN_ON(percpu_ref_is_zero(&mddev->active_io)); 668 655 percpu_ref_get(&mddev->active_io); 669 656 mddev->flush_bio = bio; 670 - bio = NULL; 671 - } 672 - spin_unlock_irq(&mddev->lock); 673 - 674 - if (!bio) { 657 + spin_unlock_irq(&mddev->lock); 675 658 INIT_WORK(&mddev->flush_work, submit_flushes); 676 659 queue_work(md_wq, &mddev->flush_work); 677 - } else { 678 - /* flush was performed for some other bio while we waited. */ 679 - if (bio->bi_iter.bi_size == 0) 680 - /* an empty barrier - all done */ 681 - bio_endio(bio); 682 - else { 683 - bio->bi_opf &= ~REQ_PREFLUSH; 684 - return false; 685 - } 660 + return true; 686 661 } 687 - return true; 662 + 663 + /* flush was performed for some other bio while we waited. */ 664 + spin_unlock_irq(&mddev->lock); 665 + if (bio->bi_iter.bi_size == 0) { 666 + /* pure flush without data - all done */ 667 + bio_endio(bio); 668 + return true; 669 + } 670 + 671 + bio->bi_opf &= ~REQ_PREFLUSH; 672 + return false; 688 673 } 689 674 EXPORT_SYMBOL(md_flush_request); 690 675 ··· 753 742 754 743 mutex_init(&mddev->open_mutex); 755 744 mutex_init(&mddev->reconfig_mutex); 756 - mutex_init(&mddev->sync_mutex); 757 745 mutex_init(&mddev->suspend_mutex); 758 746 mutex_init(&mddev->bitmap_info.mutex); 759 747 INIT_LIST_HEAD(&mddev->disks); ··· 768 758 init_waitqueue_head(&mddev->recovery_wait); 769 759 mddev->reshape_position = MaxSector; 770 760 mddev->reshape_backwards = 0; 771 - mddev->last_sync_action = "none"; 761 + mddev->last_sync_action = ACTION_IDLE; 772 762 mddev->resync_min = 0; 773 763 mddev->resync_max = MaxSector; 774 764 mddev->level = LEVEL_NONE; ··· 2420 2410 */ 2421 2411 int md_integrity_register(struct mddev *mddev) 2422 2412 { 2423 - struct md_rdev *rdev, *reference = NULL; 2424 - 2425 2413 if (list_empty(&mddev->disks)) 2426 2414 return 0; /* nothing to do */ 2427 - if (mddev_is_dm(mddev) || blk_get_integrity(mddev->gendisk)) 2428 - return 0; /* shouldn't register, or already is */ 2429 - rdev_for_each(rdev, mddev) { 2430 - /* skip spares and non-functional disks */ 2431 - if (test_bit(Faulty, &rdev->flags)) 2432 - continue; 2433 - if (rdev->raid_disk < 0) 2434 - continue; 2435 - if (!reference) { 2436 - /* Use the first rdev as the reference */ 2437 - reference = rdev; 2438 - continue; 2439 - } 2440 - /* does this rdev's profile match the reference profile? */ 2441 - if (blk_integrity_compare(reference->bdev->bd_disk, 2442 - rdev->bdev->bd_disk) < 0) 2443 - return -EINVAL; 2444 - } 2445 - if (!reference || !bdev_get_integrity(reference->bdev)) 2446 - return 0; 2447 - /* 2448 - * All component devices are integrity capable and have matching 2449 - * profiles, register the common profile for the md device. 2450 - */ 2451 - blk_integrity_register(mddev->gendisk, 2452 - bdev_get_integrity(reference->bdev)); 2415 + if (mddev_is_dm(mddev) || !blk_get_integrity(mddev->gendisk)) 2416 + return 0; /* shouldn't register */ 2453 2417 2454 2418 pr_debug("md: data integrity enabled on %s\n", mdname(mddev)); 2455 2419 if (bioset_integrity_create(&mddev->bio_set, BIO_POOL_SIZE) || ··· 2442 2458 return 0; 2443 2459 } 2444 2460 EXPORT_SYMBOL(md_integrity_register); 2445 - 2446 - /* 2447 - * Attempt to add an rdev, but only if it is consistent with the current 2448 - * integrity profile 2449 - */ 2450 - int md_integrity_add_rdev(struct md_rdev *rdev, struct mddev *mddev) 2451 - { 2452 - struct blk_integrity *bi_mddev; 2453 - 2454 - if (mddev_is_dm(mddev)) 2455 - return 0; 2456 - 2457 - bi_mddev = blk_get_integrity(mddev->gendisk); 2458 - 2459 - if (!bi_mddev) /* nothing to do */ 2460 - return 0; 2461 - 2462 - if (blk_integrity_compare(mddev->gendisk, rdev->bdev->bd_disk) != 0) { 2463 - pr_err("%s: incompatible integrity profile for %pg\n", 2464 - mdname(mddev), rdev->bdev); 2465 - return -ENXIO; 2466 - } 2467 - 2468 - return 0; 2469 - } 2470 - EXPORT_SYMBOL(md_integrity_add_rdev); 2471 2461 2472 2462 static bool rdev_read_only(struct md_rdev *rdev) 2473 2463 { ··· 4825 4867 static struct md_sysfs_entry md_metadata = 4826 4868 __ATTR_PREALLOC(metadata_version, S_IRUGO|S_IWUSR, metadata_show, metadata_store); 4827 4869 4870 + enum sync_action md_sync_action(struct mddev *mddev) 4871 + { 4872 + unsigned long recovery = mddev->recovery; 4873 + 4874 + /* 4875 + * frozen has the highest priority, means running sync_thread will be 4876 + * stopped immediately, and no new sync_thread can start. 4877 + */ 4878 + if (test_bit(MD_RECOVERY_FROZEN, &recovery)) 4879 + return ACTION_FROZEN; 4880 + 4881 + /* 4882 + * read-only array can't register sync_thread, and it can only 4883 + * add/remove spares. 4884 + */ 4885 + if (!md_is_rdwr(mddev)) 4886 + return ACTION_IDLE; 4887 + 4888 + /* 4889 + * idle means no sync_thread is running, and no new sync_thread is 4890 + * requested. 4891 + */ 4892 + if (!test_bit(MD_RECOVERY_RUNNING, &recovery) && 4893 + !test_bit(MD_RECOVERY_NEEDED, &recovery)) 4894 + return ACTION_IDLE; 4895 + 4896 + if (test_bit(MD_RECOVERY_RESHAPE, &recovery) || 4897 + mddev->reshape_position != MaxSector) 4898 + return ACTION_RESHAPE; 4899 + 4900 + if (test_bit(MD_RECOVERY_RECOVER, &recovery)) 4901 + return ACTION_RECOVER; 4902 + 4903 + if (test_bit(MD_RECOVERY_SYNC, &recovery)) { 4904 + /* 4905 + * MD_RECOVERY_CHECK must be paired with 4906 + * MD_RECOVERY_REQUESTED. 4907 + */ 4908 + if (test_bit(MD_RECOVERY_CHECK, &recovery)) 4909 + return ACTION_CHECK; 4910 + if (test_bit(MD_RECOVERY_REQUESTED, &recovery)) 4911 + return ACTION_REPAIR; 4912 + return ACTION_RESYNC; 4913 + } 4914 + 4915 + /* 4916 + * MD_RECOVERY_NEEDED or MD_RECOVERY_RUNNING is set, however, no 4917 + * sync_action is specified. 4918 + */ 4919 + return ACTION_IDLE; 4920 + } 4921 + 4922 + enum sync_action md_sync_action_by_name(const char *page) 4923 + { 4924 + enum sync_action action; 4925 + 4926 + for (action = 0; action < NR_SYNC_ACTIONS; ++action) { 4927 + if (cmd_match(page, action_name[action])) 4928 + return action; 4929 + } 4930 + 4931 + return NR_SYNC_ACTIONS; 4932 + } 4933 + 4934 + const char *md_sync_action_name(enum sync_action action) 4935 + { 4936 + return action_name[action]; 4937 + } 4938 + 4828 4939 static ssize_t 4829 4940 action_show(struct mddev *mddev, char *page) 4830 4941 { 4831 - char *type = "idle"; 4832 - unsigned long recovery = mddev->recovery; 4833 - if (test_bit(MD_RECOVERY_FROZEN, &recovery)) 4834 - type = "frozen"; 4835 - else if (test_bit(MD_RECOVERY_RUNNING, &recovery) || 4836 - (md_is_rdwr(mddev) && test_bit(MD_RECOVERY_NEEDED, &recovery))) { 4837 - if (test_bit(MD_RECOVERY_RESHAPE, &recovery)) 4838 - type = "reshape"; 4839 - else if (test_bit(MD_RECOVERY_SYNC, &recovery)) { 4840 - if (!test_bit(MD_RECOVERY_REQUESTED, &recovery)) 4841 - type = "resync"; 4842 - else if (test_bit(MD_RECOVERY_CHECK, &recovery)) 4843 - type = "check"; 4844 - else 4845 - type = "repair"; 4846 - } else if (test_bit(MD_RECOVERY_RECOVER, &recovery)) 4847 - type = "recover"; 4848 - else if (mddev->reshape_position != MaxSector) 4849 - type = "reshape"; 4850 - } 4851 - return sprintf(page, "%s\n", type); 4942 + enum sync_action action = md_sync_action(mddev); 4943 + 4944 + return sprintf(page, "%s\n", md_sync_action_name(action)); 4852 4945 } 4853 4946 4854 4947 /** ··· 4908 4899 * @locked: if set, reconfig_mutex will still be held after this function 4909 4900 * return; if not set, reconfig_mutex will be released after this 4910 4901 * function return. 4911 - * @check_seq: if set, only wait for curent running sync_thread to stop, noted 4912 - * that new sync_thread can still start. 4913 4902 */ 4914 - static void stop_sync_thread(struct mddev *mddev, bool locked, bool check_seq) 4903 + static void stop_sync_thread(struct mddev *mddev, bool locked) 4915 4904 { 4916 - int sync_seq; 4917 - 4918 - if (check_seq) 4919 - sync_seq = atomic_read(&mddev->sync_seq); 4905 + int sync_seq = atomic_read(&mddev->sync_seq); 4920 4906 4921 4907 if (!test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) { 4922 4908 if (!locked) ··· 4932 4928 4933 4929 wait_event(resync_wait, 4934 4930 !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) || 4935 - (check_seq && sync_seq != atomic_read(&mddev->sync_seq))); 4931 + (!test_bit(MD_RECOVERY_FROZEN, &mddev->recovery) && 4932 + sync_seq != atomic_read(&mddev->sync_seq))); 4936 4933 4937 4934 if (locked) 4938 4935 mddev_lock_nointr(mddev); ··· 4944 4939 lockdep_assert_held(&mddev->reconfig_mutex); 4945 4940 4946 4941 clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); 4947 - stop_sync_thread(mddev, true, true); 4942 + stop_sync_thread(mddev, true); 4948 4943 } 4949 4944 EXPORT_SYMBOL_GPL(md_idle_sync_thread); 4950 4945 ··· 4953 4948 lockdep_assert_held(&mddev->reconfig_mutex); 4954 4949 4955 4950 set_bit(MD_RECOVERY_FROZEN, &mddev->recovery); 4956 - stop_sync_thread(mddev, true, false); 4951 + stop_sync_thread(mddev, true); 4957 4952 } 4958 4953 EXPORT_SYMBOL_GPL(md_frozen_sync_thread); 4959 4954 ··· 4968 4963 } 4969 4964 EXPORT_SYMBOL_GPL(md_unfrozen_sync_thread); 4970 4965 4971 - static void idle_sync_thread(struct mddev *mddev) 4966 + static int mddev_start_reshape(struct mddev *mddev) 4972 4967 { 4973 - mutex_lock(&mddev->sync_mutex); 4974 - clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); 4968 + int ret; 4975 4969 4976 - if (mddev_lock(mddev)) { 4977 - mutex_unlock(&mddev->sync_mutex); 4978 - return; 4970 + if (mddev->pers->start_reshape == NULL) 4971 + return -EINVAL; 4972 + 4973 + if (mddev->reshape_position == MaxSector || 4974 + mddev->pers->check_reshape == NULL || 4975 + mddev->pers->check_reshape(mddev)) { 4976 + clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); 4977 + ret = mddev->pers->start_reshape(mddev); 4978 + if (ret) 4979 + return ret; 4980 + } else { 4981 + /* 4982 + * If reshape is still in progress, and md_check_recovery() can 4983 + * continue to reshape, don't restart reshape because data can 4984 + * be corrupted for raid456. 4985 + */ 4986 + clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); 4979 4987 } 4980 4988 4981 - stop_sync_thread(mddev, false, true); 4982 - mutex_unlock(&mddev->sync_mutex); 4983 - } 4984 - 4985 - static void frozen_sync_thread(struct mddev *mddev) 4986 - { 4987 - mutex_lock(&mddev->sync_mutex); 4988 - set_bit(MD_RECOVERY_FROZEN, &mddev->recovery); 4989 - 4990 - if (mddev_lock(mddev)) { 4991 - mutex_unlock(&mddev->sync_mutex); 4992 - return; 4993 - } 4994 - 4995 - stop_sync_thread(mddev, false, false); 4996 - mutex_unlock(&mddev->sync_mutex); 4989 + sysfs_notify_dirent_safe(mddev->sysfs_degraded); 4990 + return 0; 4997 4991 } 4998 4992 4999 4993 static ssize_t 5000 4994 action_store(struct mddev *mddev, const char *page, size_t len) 5001 4995 { 4996 + int ret; 4997 + enum sync_action action; 4998 + 5002 4999 if (!mddev->pers || !mddev->pers->sync_request) 5003 5000 return -EINVAL; 5004 5001 5002 + retry: 5003 + if (work_busy(&mddev->sync_work)) 5004 + flush_work(&mddev->sync_work); 5005 5005 5006 - if (cmd_match(page, "idle")) 5007 - idle_sync_thread(mddev); 5008 - else if (cmd_match(page, "frozen")) 5009 - frozen_sync_thread(mddev); 5010 - else if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) 5011 - return -EBUSY; 5012 - else if (cmd_match(page, "resync")) 5013 - clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); 5014 - else if (cmd_match(page, "recover")) { 5015 - clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); 5016 - set_bit(MD_RECOVERY_RECOVER, &mddev->recovery); 5017 - } else if (cmd_match(page, "reshape")) { 5018 - int err; 5019 - if (mddev->pers->start_reshape == NULL) 5020 - return -EINVAL; 5021 - err = mddev_lock(mddev); 5022 - if (!err) { 5023 - if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) { 5024 - err = -EBUSY; 5025 - } else if (mddev->reshape_position == MaxSector || 5026 - mddev->pers->check_reshape == NULL || 5027 - mddev->pers->check_reshape(mddev)) { 5028 - clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); 5029 - err = mddev->pers->start_reshape(mddev); 5030 - } else { 5031 - /* 5032 - * If reshape is still in progress, and 5033 - * md_check_recovery() can continue to reshape, 5034 - * don't restart reshape because data can be 5035 - * corrupted for raid456. 5036 - */ 5037 - clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); 5038 - } 5039 - mddev_unlock(mddev); 5040 - } 5041 - if (err) 5042 - return err; 5043 - sysfs_notify_dirent_safe(mddev->sysfs_degraded); 5044 - } else { 5045 - if (cmd_match(page, "check")) 5046 - set_bit(MD_RECOVERY_CHECK, &mddev->recovery); 5047 - else if (!cmd_match(page, "repair")) 5048 - return -EINVAL; 5049 - clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); 5050 - set_bit(MD_RECOVERY_REQUESTED, &mddev->recovery); 5051 - set_bit(MD_RECOVERY_SYNC, &mddev->recovery); 5006 + ret = mddev_lock(mddev); 5007 + if (ret) 5008 + return ret; 5009 + 5010 + if (work_busy(&mddev->sync_work)) { 5011 + mddev_unlock(mddev); 5012 + goto retry; 5052 5013 } 5014 + 5015 + action = md_sync_action_by_name(page); 5016 + 5017 + /* TODO: mdadm rely on "idle" to start sync_thread. */ 5018 + if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) { 5019 + switch (action) { 5020 + case ACTION_FROZEN: 5021 + md_frozen_sync_thread(mddev); 5022 + ret = len; 5023 + goto out; 5024 + case ACTION_IDLE: 5025 + md_idle_sync_thread(mddev); 5026 + break; 5027 + case ACTION_RESHAPE: 5028 + case ACTION_RECOVER: 5029 + case ACTION_CHECK: 5030 + case ACTION_REPAIR: 5031 + case ACTION_RESYNC: 5032 + ret = -EBUSY; 5033 + goto out; 5034 + default: 5035 + ret = -EINVAL; 5036 + goto out; 5037 + } 5038 + } else { 5039 + switch (action) { 5040 + case ACTION_FROZEN: 5041 + set_bit(MD_RECOVERY_FROZEN, &mddev->recovery); 5042 + ret = len; 5043 + goto out; 5044 + case ACTION_RESHAPE: 5045 + clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); 5046 + ret = mddev_start_reshape(mddev); 5047 + if (ret) 5048 + goto out; 5049 + break; 5050 + case ACTION_RECOVER: 5051 + clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); 5052 + set_bit(MD_RECOVERY_RECOVER, &mddev->recovery); 5053 + break; 5054 + case ACTION_CHECK: 5055 + set_bit(MD_RECOVERY_CHECK, &mddev->recovery); 5056 + fallthrough; 5057 + case ACTION_REPAIR: 5058 + set_bit(MD_RECOVERY_REQUESTED, &mddev->recovery); 5059 + set_bit(MD_RECOVERY_SYNC, &mddev->recovery); 5060 + fallthrough; 5061 + case ACTION_RESYNC: 5062 + case ACTION_IDLE: 5063 + clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); 5064 + break; 5065 + default: 5066 + ret = -EINVAL; 5067 + goto out; 5068 + } 5069 + } 5070 + 5053 5071 if (mddev->ro == MD_AUTO_READ) { 5054 5072 /* A write to sync_action is enough to justify 5055 5073 * canceling read-auto mode 5056 5074 */ 5057 - flush_work(&mddev->sync_work); 5058 5075 mddev->ro = MD_RDWR; 5059 5076 md_wakeup_thread(mddev->sync_thread); 5060 5077 } 5078 + 5061 5079 set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); 5062 5080 md_wakeup_thread(mddev->thread); 5063 5081 sysfs_notify_dirent_safe(mddev->sysfs_action); 5064 - return len; 5082 + ret = len; 5083 + 5084 + out: 5085 + mddev_unlock(mddev); 5086 + return ret; 5065 5087 } 5066 5088 5067 5089 static struct md_sysfs_entry md_scan_mode = ··· 5097 5065 static ssize_t 5098 5066 last_sync_action_show(struct mddev *mddev, char *page) 5099 5067 { 5100 - return sprintf(page, "%s\n", mddev->last_sync_action); 5068 + return sprintf(page, "%s\n", 5069 + md_sync_action_name(mddev->last_sync_action)); 5101 5070 } 5102 5071 5103 5072 static struct md_sysfs_entry md_last_scan_mode = __ATTR_RO(last_sync_action); ··· 5788 5755 int mdp_major = 0; 5789 5756 5790 5757 /* stack the limit for all rdevs into lim */ 5791 - void mddev_stack_rdev_limits(struct mddev *mddev, struct queue_limits *lim) 5758 + int mddev_stack_rdev_limits(struct mddev *mddev, struct queue_limits *lim, 5759 + unsigned int flags) 5792 5760 { 5793 5761 struct md_rdev *rdev; 5794 5762 5795 5763 rdev_for_each(rdev, mddev) { 5796 5764 queue_limits_stack_bdev(lim, rdev->bdev, rdev->data_offset, 5797 5765 mddev->gendisk->disk_name); 5766 + if ((flags & MDDEV_STACK_INTEGRITY) && 5767 + !queue_limits_stack_integrity_bdev(lim, rdev->bdev)) 5768 + return -EINVAL; 5798 5769 } 5770 + 5771 + return 0; 5799 5772 } 5800 5773 EXPORT_SYMBOL_GPL(mddev_stack_rdev_limits); 5801 5774 ··· 5816 5777 lim = queue_limits_start_update(mddev->gendisk->queue); 5817 5778 queue_limits_stack_bdev(&lim, rdev->bdev, rdev->data_offset, 5818 5779 mddev->gendisk->disk_name); 5780 + 5781 + if (!queue_limits_stack_integrity_bdev(&lim, rdev->bdev)) { 5782 + pr_err("%s: incompatible integrity profile for %pg\n", 5783 + mdname(mddev), rdev->bdev); 5784 + queue_limits_cancel_update(mddev->gendisk->queue); 5785 + return -ENXIO; 5786 + } 5787 + 5819 5788 return queue_limits_commit_update(mddev->gendisk->queue, &lim); 5820 5789 } 5821 5790 EXPORT_SYMBOL_GPL(mddev_stack_new_rdev); ··· 5853 5806 kobject_put(&mddev->kobj); 5854 5807 } 5855 5808 5809 + void md_init_stacking_limits(struct queue_limits *lim) 5810 + { 5811 + blk_set_stacking_limits(lim); 5812 + lim->features = BLK_FEAT_WRITE_CACHE | BLK_FEAT_FUA | 5813 + BLK_FEAT_IO_STAT | BLK_FEAT_NOWAIT; 5814 + } 5815 + EXPORT_SYMBOL_GPL(md_init_stacking_limits); 5816 + 5856 5817 struct mddev *md_alloc(dev_t dev, char *name) 5857 5818 { 5858 5819 /* ··· 5878 5823 int partitioned; 5879 5824 int shift; 5880 5825 int unit; 5881 - int error ; 5826 + int error; 5882 5827 5883 5828 /* 5884 5829 * Wait for any previous instance of this device to be completely ··· 5936 5881 disk->fops = &md_fops; 5937 5882 disk->private_data = mddev; 5938 5883 5939 - blk_queue_write_cache(disk->queue, true, true); 5940 5884 disk->events |= DISK_EVENT_MEDIA_CHANGE; 5941 5885 mddev->gendisk = disk; 5942 5886 error = add_disk(disk); ··· 6239 6185 } 6240 6186 } 6241 6187 6242 - if (!mddev_is_dm(mddev)) { 6243 - struct request_queue *q = mddev->gendisk->queue; 6244 - bool nonrot = true; 6245 - 6246 - rdev_for_each(rdev, mddev) { 6247 - if (rdev->raid_disk >= 0 && !bdev_nonrot(rdev->bdev)) { 6248 - nonrot = false; 6249 - break; 6250 - } 6251 - } 6252 - if (mddev->degraded) 6253 - nonrot = false; 6254 - if (nonrot) 6255 - blk_queue_flag_set(QUEUE_FLAG_NONROT, q); 6256 - else 6257 - blk_queue_flag_clear(QUEUE_FLAG_NONROT, q); 6258 - blk_queue_flag_set(QUEUE_FLAG_IO_STAT, q); 6259 - 6260 - /* Set the NOWAIT flags if all underlying devices support it */ 6261 - if (nowait) 6262 - blk_queue_flag_set(QUEUE_FLAG_NOWAIT, q); 6263 - } 6264 6188 if (pers->sync_request) { 6265 6189 if (mddev->kobj.sd && 6266 6190 sysfs_create_group(&mddev->kobj, &md_redundancy_group)) ··· 6469 6437 { 6470 6438 mddev_lock_nointr(mddev); 6471 6439 set_bit(MD_RECOVERY_FROZEN, &mddev->recovery); 6472 - stop_sync_thread(mddev, true, false); 6440 + stop_sync_thread(mddev, true); 6473 6441 __md_stop_writes(mddev); 6474 6442 mddev_unlock(mddev); 6475 6443 } ··· 6537 6505 set_bit(MD_RECOVERY_FROZEN, &mddev->recovery); 6538 6506 } 6539 6507 6540 - stop_sync_thread(mddev, false, false); 6508 + stop_sync_thread(mddev, false); 6541 6509 wait_event(mddev->sb_wait, 6542 6510 !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)); 6543 6511 mddev_lock_nointr(mddev); ··· 6583 6551 set_bit(MD_RECOVERY_FROZEN, &mddev->recovery); 6584 6552 } 6585 6553 6586 - stop_sync_thread(mddev, true, false); 6554 + stop_sync_thread(mddev, true); 6587 6555 6588 6556 if (mddev->sysfs_active || 6589 6557 test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) { ··· 7198 7166 if (!mddev->thread) 7199 7167 md_update_sb(mddev, 1); 7200 7168 /* 7201 - * If the new disk does not support REQ_NOWAIT, 7202 - * disable on the whole MD. 7203 - */ 7204 - if (!bdev_nowait(rdev->bdev)) { 7205 - pr_info("%s: Disabling nowait because %pg does not support nowait\n", 7206 - mdname(mddev), rdev->bdev); 7207 - blk_queue_flag_clear(QUEUE_FLAG_NOWAIT, mddev->gendisk->queue); 7208 - } 7209 - /* 7210 7169 * Kick recovery, maybe this spare has to be added to the 7211 7170 * array immediately. 7212 7171 */ ··· 7765 7742 return get_bitmap_file(mddev, argp); 7766 7743 } 7767 7744 7768 - if (cmd == HOT_REMOVE_DISK) 7769 - /* need to ensure recovery thread has run */ 7770 - wait_event_interruptible_timeout(mddev->sb_wait, 7771 - !test_bit(MD_RECOVERY_NEEDED, 7772 - &mddev->recovery), 7773 - msecs_to_jiffies(5000)); 7774 7745 if (cmd == STOP_ARRAY || cmd == STOP_ARRAY_RO) { 7775 7746 /* Need to flush page cache, and ensure no-one else opens 7776 7747 * and writes ··· 8537 8520 } 8538 8521 EXPORT_SYMBOL(unregister_md_personality); 8539 8522 8540 - int register_md_cluster_operations(struct md_cluster_operations *ops, 8523 + int register_md_cluster_operations(const struct md_cluster_operations *ops, 8541 8524 struct module *module) 8542 8525 { 8543 8526 int ret = 0; ··· 8658 8641 * A return value of 'false' means that the write wasn't recorded 8659 8642 * and cannot proceed as the array is being suspend. 8660 8643 */ 8661 - bool md_write_start(struct mddev *mddev, struct bio *bi) 8644 + void md_write_start(struct mddev *mddev, struct bio *bi) 8662 8645 { 8663 8646 int did_change = 0; 8664 8647 8665 8648 if (bio_data_dir(bi) != WRITE) 8666 - return true; 8649 + return; 8667 8650 8668 8651 BUG_ON(mddev->ro == MD_RDONLY); 8669 8652 if (mddev->ro == MD_AUTO_READ) { ··· 8696 8679 if (did_change) 8697 8680 sysfs_notify_dirent_safe(mddev->sysfs_state); 8698 8681 if (!mddev->has_superblocks) 8699 - return true; 8682 + return; 8700 8683 wait_event(mddev->sb_wait, 8701 - !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags) || 8702 - is_md_suspended(mddev)); 8703 - if (test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) { 8704 - percpu_ref_put(&mddev->writes_pending); 8705 - return false; 8706 - } 8707 - return true; 8684 + !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)); 8708 8685 } 8709 8686 EXPORT_SYMBOL(md_write_start); 8710 8687 ··· 8846 8835 } 8847 8836 EXPORT_SYMBOL_GPL(md_allow_write); 8848 8837 8838 + static sector_t md_sync_max_sectors(struct mddev *mddev, 8839 + enum sync_action action) 8840 + { 8841 + switch (action) { 8842 + case ACTION_RESYNC: 8843 + case ACTION_CHECK: 8844 + case ACTION_REPAIR: 8845 + atomic64_set(&mddev->resync_mismatches, 0); 8846 + fallthrough; 8847 + case ACTION_RESHAPE: 8848 + return mddev->resync_max_sectors; 8849 + case ACTION_RECOVER: 8850 + return mddev->dev_sectors; 8851 + default: 8852 + return 0; 8853 + } 8854 + } 8855 + 8856 + static sector_t md_sync_position(struct mddev *mddev, enum sync_action action) 8857 + { 8858 + sector_t start = 0; 8859 + struct md_rdev *rdev; 8860 + 8861 + switch (action) { 8862 + case ACTION_CHECK: 8863 + case ACTION_REPAIR: 8864 + return mddev->resync_min; 8865 + case ACTION_RESYNC: 8866 + if (!mddev->bitmap) 8867 + return mddev->recovery_cp; 8868 + return 0; 8869 + case ACTION_RESHAPE: 8870 + /* 8871 + * If the original node aborts reshaping then we continue the 8872 + * reshaping, so set again to avoid restart reshape from the 8873 + * first beginning 8874 + */ 8875 + if (mddev_is_clustered(mddev) && 8876 + mddev->reshape_position != MaxSector) 8877 + return mddev->reshape_position; 8878 + return 0; 8879 + case ACTION_RECOVER: 8880 + start = MaxSector; 8881 + rcu_read_lock(); 8882 + rdev_for_each_rcu(rdev, mddev) 8883 + if (rdev->raid_disk >= 0 && 8884 + !test_bit(Journal, &rdev->flags) && 8885 + !test_bit(Faulty, &rdev->flags) && 8886 + !test_bit(In_sync, &rdev->flags) && 8887 + rdev->recovery_offset < start) 8888 + start = rdev->recovery_offset; 8889 + rcu_read_unlock(); 8890 + 8891 + /* If there is a bitmap, we need to make sure all 8892 + * writes that started before we added a spare 8893 + * complete before we start doing a recovery. 8894 + * Otherwise the write might complete and (via 8895 + * bitmap_endwrite) set a bit in the bitmap after the 8896 + * recovery has checked that bit and skipped that 8897 + * region. 8898 + */ 8899 + if (mddev->bitmap) { 8900 + mddev->pers->quiesce(mddev, 1); 8901 + mddev->pers->quiesce(mddev, 0); 8902 + } 8903 + return start; 8904 + default: 8905 + return MaxSector; 8906 + } 8907 + } 8908 + 8849 8909 #define SYNC_MARKS 10 8850 8910 #define SYNC_MARK_STEP (3*HZ) 8851 8911 #define UPDATE_FREQUENCY (5*60*HZ) ··· 8933 8851 sector_t last_check; 8934 8852 int skipped = 0; 8935 8853 struct md_rdev *rdev; 8936 - char *desc, *action = NULL; 8854 + enum sync_action action; 8855 + const char *desc; 8937 8856 struct blk_plug plug; 8938 8857 int ret; 8939 8858 ··· 8965 8882 goto skip; 8966 8883 } 8967 8884 8968 - if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) { 8969 - if (test_bit(MD_RECOVERY_CHECK, &mddev->recovery)) { 8970 - desc = "data-check"; 8971 - action = "check"; 8972 - } else if (test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery)) { 8973 - desc = "requested-resync"; 8974 - action = "repair"; 8975 - } else 8976 - desc = "resync"; 8977 - } else if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery)) 8978 - desc = "reshape"; 8979 - else 8980 - desc = "recovery"; 8981 - 8982 - mddev->last_sync_action = action ?: desc; 8885 + action = md_sync_action(mddev); 8886 + desc = md_sync_action_name(action); 8887 + mddev->last_sync_action = action; 8983 8888 8984 8889 /* 8985 8890 * Before starting a resync we must have set curr_resync to ··· 9035 8964 spin_unlock(&all_mddevs_lock); 9036 8965 } while (mddev->curr_resync < MD_RESYNC_DELAYED); 9037 8966 9038 - j = 0; 9039 - if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) { 9040 - /* resync follows the size requested by the personality, 9041 - * which defaults to physical size, but can be virtual size 9042 - */ 9043 - max_sectors = mddev->resync_max_sectors; 9044 - atomic64_set(&mddev->resync_mismatches, 0); 9045 - /* we don't use the checkpoint if there's a bitmap */ 9046 - if (test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery)) 9047 - j = mddev->resync_min; 9048 - else if (!mddev->bitmap) 9049 - j = mddev->recovery_cp; 9050 - 9051 - } else if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery)) { 9052 - max_sectors = mddev->resync_max_sectors; 9053 - /* 9054 - * If the original node aborts reshaping then we continue the 9055 - * reshaping, so set j again to avoid restart reshape from the 9056 - * first beginning 9057 - */ 9058 - if (mddev_is_clustered(mddev) && 9059 - mddev->reshape_position != MaxSector) 9060 - j = mddev->reshape_position; 9061 - } else { 9062 - /* recovery follows the physical size of devices */ 9063 - max_sectors = mddev->dev_sectors; 9064 - j = MaxSector; 9065 - rcu_read_lock(); 9066 - rdev_for_each_rcu(rdev, mddev) 9067 - if (rdev->raid_disk >= 0 && 9068 - !test_bit(Journal, &rdev->flags) && 9069 - !test_bit(Faulty, &rdev->flags) && 9070 - !test_bit(In_sync, &rdev->flags) && 9071 - rdev->recovery_offset < j) 9072 - j = rdev->recovery_offset; 9073 - rcu_read_unlock(); 9074 - 9075 - /* If there is a bitmap, we need to make sure all 9076 - * writes that started before we added a spare 9077 - * complete before we start doing a recovery. 9078 - * Otherwise the write might complete and (via 9079 - * bitmap_endwrite) set a bit in the bitmap after the 9080 - * recovery has checked that bit and skipped that 9081 - * region. 9082 - */ 9083 - if (mddev->bitmap) { 9084 - mddev->pers->quiesce(mddev, 1); 9085 - mddev->pers->quiesce(mddev, 0); 9086 - } 9087 - } 8967 + max_sectors = md_sync_max_sectors(mddev, action); 8968 + j = md_sync_position(mddev, action); 9088 8969 9089 8970 pr_info("md: %s of RAID array %s\n", desc, mdname(mddev)); 9090 8971 pr_debug("md: minimum _guaranteed_ speed: %d KB/sec/disk.\n", speed_min(mddev)); ··· 9118 9095 if (test_bit(MD_RECOVERY_INTR, &mddev->recovery)) 9119 9096 break; 9120 9097 9121 - sectors = mddev->pers->sync_request(mddev, j, &skipped); 9098 + sectors = mddev->pers->sync_request(mddev, j, max_sectors, 9099 + &skipped); 9122 9100 if (sectors == 0) { 9123 9101 set_bit(MD_RECOVERY_INTR, &mddev->recovery); 9124 9102 break; ··· 9209 9185 mddev->curr_resync_completed = mddev->curr_resync; 9210 9186 sysfs_notify_dirent_safe(mddev->sysfs_completed); 9211 9187 } 9212 - mddev->pers->sync_request(mddev, max_sectors, &skipped); 9188 + mddev->pers->sync_request(mddev, max_sectors, max_sectors, &skipped); 9213 9189 9214 9190 if (!test_bit(MD_RECOVERY_CHECK, &mddev->recovery) && 9215 9191 mddev->curr_resync > MD_RESYNC_ACTIVE) {

+109 -27

drivers/md/md.h

··· 34 34 */ 35 35 #define MD_FAILFAST (REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT) 36 36 37 + /* Status of sync thread. */ 38 + enum sync_action { 39 + /* 40 + * Represent by MD_RECOVERY_SYNC, start when: 41 + * 1) after assemble, sync data from first rdev to other copies, this 42 + * must be done first before other sync actions and will only execute 43 + * once; 44 + * 2) resize the array(notice that this is not reshape), sync data for 45 + * the new range; 46 + */ 47 + ACTION_RESYNC, 48 + /* 49 + * Represent by MD_RECOVERY_RECOVER, start when: 50 + * 1) for new replacement, sync data based on the replace rdev or 51 + * available copies from other rdev; 52 + * 2) for new member disk while the array is degraded, sync data from 53 + * other rdev; 54 + * 3) reassemble after power failure or re-add a hot removed rdev, sync 55 + * data from first rdev to other copies based on bitmap; 56 + */ 57 + ACTION_RECOVER, 58 + /* 59 + * Represent by MD_RECOVERY_SYNC | MD_RECOVERY_REQUESTED | 60 + * MD_RECOVERY_CHECK, start when user echo "check" to sysfs api 61 + * sync_action, used to check if data copies from differenct rdev are 62 + * the same. The number of mismatch sectors will be exported to user 63 + * by sysfs api mismatch_cnt; 64 + */ 65 + ACTION_CHECK, 66 + /* 67 + * Represent by MD_RECOVERY_SYNC | MD_RECOVERY_REQUESTED, start when 68 + * user echo "repair" to sysfs api sync_action, usually paired with 69 + * ACTION_CHECK, used to force syncing data once user found that there 70 + * are inconsistent data, 71 + */ 72 + ACTION_REPAIR, 73 + /* 74 + * Represent by MD_RECOVERY_RESHAPE, start when new member disk is added 75 + * to the conf, notice that this is different from spares or 76 + * replacement; 77 + */ 78 + ACTION_RESHAPE, 79 + /* 80 + * Represent by MD_RECOVERY_FROZEN, can be set by sysfs api sync_action 81 + * or internal usage like setting the array read-only, will forbid above 82 + * actions. 83 + */ 84 + ACTION_FROZEN, 85 + /* 86 + * All above actions don't match. 87 + */ 88 + ACTION_IDLE, 89 + NR_SYNC_ACTIONS, 90 + }; 91 + 37 92 /* 38 93 * The struct embedded in rdev is used to serialize IO. 39 94 */ ··· 426 371 struct md_thread __rcu *thread; /* management thread */ 427 372 struct md_thread __rcu *sync_thread; /* doing resync or reconstruct */ 428 373 429 - /* 'last_sync_action' is initialized to "none". It is set when a 430 - * sync operation (i.e "data-check", "requested-resync", "resync", 431 - * "recovery", or "reshape") is started. It holds this value even 374 + /* 375 + * Set when a sync operation is started. It holds this value even 432 376 * when the sync thread is "frozen" (interrupted) or "idle" (stopped 433 - * or finished). It is overwritten when a new sync operation is begun. 377 + * or finished). It is overwritten when a new sync operation is begun. 434 378 */ 435 - char *last_sync_action; 379 + enum sync_action last_sync_action; 436 380 sector_t curr_resync; /* last block scheduled */ 437 381 /* As resync requests can complete out of order, we cannot easily track 438 382 * how much resync has been completed. So we occasionally pause until ··· 594 540 */ 595 541 struct list_head deleting; 596 542 597 - /* Used to synchronize idle and frozen for action_store() */ 598 - struct mutex sync_mutex; 599 543 /* The sequence number for sync thread */ 600 544 atomic_t sync_seq; 601 545 ··· 603 551 }; 604 552 605 553 enum recovery_flags { 554 + /* flags for sync thread running status */ 555 + 606 556 /* 607 - * If neither SYNC or RESHAPE are set, then it is a recovery. 557 + * set when one of sync action is set and new sync thread need to be 558 + * registered, or just add/remove spares from conf. 608 559 */ 609 - MD_RECOVERY_RUNNING, /* a thread is running, or about to be started */ 610 - MD_RECOVERY_SYNC, /* actually doing a resync, not a recovery */ 611 - MD_RECOVERY_RECOVER, /* doing recovery, or need to try it. */ 612 - MD_RECOVERY_INTR, /* resync needs to be aborted for some reason */ 613 - MD_RECOVERY_DONE, /* thread is done and is waiting to be reaped */ 614 - MD_RECOVERY_NEEDED, /* we might need to start a resync/recover */ 615 - MD_RECOVERY_REQUESTED, /* user-space has requested a sync (used with SYNC) */ 616 - MD_RECOVERY_CHECK, /* user-space request for check-only, no repair */ 617 - MD_RECOVERY_RESHAPE, /* A reshape is happening */ 618 - MD_RECOVERY_FROZEN, /* User request to abort, and not restart, any action */ 619 - MD_RECOVERY_ERROR, /* sync-action interrupted because io-error */ 620 - MD_RECOVERY_WAIT, /* waiting for pers->start() to finish */ 621 - MD_RESYNCING_REMOTE, /* remote node is running resync thread */ 560 + MD_RECOVERY_NEEDED, 561 + /* sync thread is running, or about to be started */ 562 + MD_RECOVERY_RUNNING, 563 + /* sync thread needs to be aborted for some reason */ 564 + MD_RECOVERY_INTR, 565 + /* sync thread is done and is waiting to be unregistered */ 566 + MD_RECOVERY_DONE, 567 + /* running sync thread must abort immediately, and not restart */ 568 + MD_RECOVERY_FROZEN, 569 + /* waiting for pers->start() to finish */ 570 + MD_RECOVERY_WAIT, 571 + /* interrupted because io-error */ 572 + MD_RECOVERY_ERROR, 573 + 574 + /* flags determines sync action, see details in enum sync_action */ 575 + 576 + /* if just this flag is set, action is resync. */ 577 + MD_RECOVERY_SYNC, 578 + /* 579 + * paired with MD_RECOVERY_SYNC, if MD_RECOVERY_CHECK is not set, 580 + * action is repair, means user requested resync. 581 + */ 582 + MD_RECOVERY_REQUESTED, 583 + /* 584 + * paired with MD_RECOVERY_SYNC and MD_RECOVERY_REQUESTED, action is 585 + * check. 586 + */ 587 + MD_RECOVERY_CHECK, 588 + /* recovery, or need to try it */ 589 + MD_RECOVERY_RECOVER, 590 + /* reshape */ 591 + MD_RECOVERY_RESHAPE, 592 + /* remote node is running resync thread */ 593 + MD_RESYNCING_REMOTE, 622 594 }; 623 595 624 596 enum md_ro_state { ··· 729 653 int (*hot_add_disk) (struct mddev *mddev, struct md_rdev *rdev); 730 654 int (*hot_remove_disk) (struct mddev *mddev, struct md_rdev *rdev); 731 655 int (*spare_active) (struct mddev *mddev); 732 - sector_t (*sync_request)(struct mddev *mddev, sector_t sector_nr, int *skipped); 656 + sector_t (*sync_request)(struct mddev *mddev, sector_t sector_nr, 657 + sector_t max_sector, int *skipped); 733 658 int (*resize) (struct mddev *mddev, sector_t sectors); 734 659 sector_t (*size) (struct mddev *mddev, sector_t sectors, int raid_disks); 735 660 int (*check_reshape) (struct mddev *mddev); ··· 849 772 850 773 extern int register_md_personality(struct md_personality *p); 851 774 extern int unregister_md_personality(struct md_personality *p); 852 - extern int register_md_cluster_operations(struct md_cluster_operations *ops, 775 + extern int register_md_cluster_operations(const struct md_cluster_operations *ops, 853 776 struct module *module); 854 777 extern int unregister_md_cluster_operations(void); 855 778 extern int md_setup_cluster(struct mddev *mddev, int nodes); ··· 862 785 extern void md_wakeup_thread(struct md_thread __rcu *thread); 863 786 extern void md_check_recovery(struct mddev *mddev); 864 787 extern void md_reap_sync_thread(struct mddev *mddev); 865 - extern bool md_write_start(struct mddev *mddev, struct bio *bi); 788 + extern enum sync_action md_sync_action(struct mddev *mddev); 789 + extern enum sync_action md_sync_action_by_name(const char *page); 790 + extern const char *md_sync_action_name(enum sync_action action); 791 + extern void md_write_start(struct mddev *mddev, struct bio *bi); 866 792 extern void md_write_inc(struct mddev *mddev, struct bio *bi); 867 793 extern void md_write_end(struct mddev *mddev); 868 794 extern void md_done_sync(struct mddev *mddev, int blocks, int ok); ··· 889 809 extern void md_set_array_sectors(struct mddev *mddev, sector_t array_sectors); 890 810 extern int md_check_no_bitmap(struct mddev *mddev); 891 811 extern int md_integrity_register(struct mddev *mddev); 892 - extern int md_integrity_add_rdev(struct md_rdev *rdev, struct mddev *mddev); 893 812 extern int strict_strtoul_scaled(const char *cp, unsigned long *res, int scale); 894 813 895 814 extern int mddev_init(struct mddev *mddev); 896 815 extern void mddev_destroy(struct mddev *mddev); 816 + void md_init_stacking_limits(struct queue_limits *lim); 897 817 struct mddev *md_alloc(dev_t dev, char *name); 898 818 void mddev_put(struct mddev *mddev); 899 819 extern int md_run(struct mddev *mddev); ··· 932 852 } 933 853 } 934 854 935 - extern struct md_cluster_operations *md_cluster_ops; 855 + extern const struct md_cluster_operations *md_cluster_ops; 936 856 static inline int mddev_is_clustered(struct mddev *mddev) 937 857 { 938 858 return mddev->cluster_info && mddev->bitmap_info.nodes > 1; ··· 988 908 int md_set_array_info(struct mddev *mddev, struct mdu_array_info_s *info); 989 909 int md_add_new_disk(struct mddev *mddev, struct mdu_disk_info_s *info); 990 910 int do_md_run(struct mddev *mddev); 991 - void mddev_stack_rdev_limits(struct mddev *mddev, struct queue_limits *lim); 911 + #define MDDEV_STACK_INTEGRITY (1u << 0) 912 + int mddev_stack_rdev_limits(struct mddev *mddev, struct queue_limits *lim, 913 + unsigned int flags); 992 914 int mddev_stack_new_rdev(struct mddev *mddev, struct md_rdev *rdev); 993 915 void mddev_update_io_opt(struct mddev *mddev, unsigned int nr_stripes); 994 916

+12 -18

drivers/md/raid0.c

··· 365 365 return array_sectors; 366 366 } 367 367 368 - static void free_conf(struct mddev *mddev, struct r0conf *conf) 368 + static void raid0_free(struct mddev *mddev, void *priv) 369 369 { 370 + struct r0conf *conf = priv; 371 + 370 372 kfree(conf->strip_zone); 371 373 kfree(conf->devlist); 372 374 kfree(conf); 373 375 } 374 376 375 - static void raid0_free(struct mddev *mddev, void *priv) 376 - { 377 - struct r0conf *conf = priv; 378 - 379 - free_conf(mddev, conf); 380 - } 381 - 382 377 static int raid0_set_limits(struct mddev *mddev) 383 378 { 384 379 struct queue_limits lim; 380 + int err; 385 381 386 - blk_set_stacking_limits(&lim); 382 + md_init_stacking_limits(&lim); 387 383 lim.max_hw_sectors = mddev->chunk_sectors; 388 384 lim.max_write_zeroes_sectors = mddev->chunk_sectors; 389 385 lim.io_min = mddev->chunk_sectors << 9; 390 386 lim.io_opt = lim.io_min * mddev->raid_disks; 391 - mddev_stack_rdev_limits(mddev, &lim); 387 + err = mddev_stack_rdev_limits(mddev, &lim, MDDEV_STACK_INTEGRITY); 388 + if (err) { 389 + queue_limits_cancel_update(mddev->gendisk->queue); 390 + return err; 391 + } 392 392 return queue_limits_set(mddev->gendisk->queue, &lim); 393 393 } 394 394 ··· 415 415 if (!mddev_is_dm(mddev)) { 416 416 ret = raid0_set_limits(mddev); 417 417 if (ret) 418 - goto out_free_conf; 418 + return ret; 419 419 } 420 420 421 421 /* calculate array device size */ ··· 427 427 428 428 dump_zones(mddev); 429 429 430 - ret = md_integrity_register(mddev); 431 - if (ret) 432 - goto out_free_conf; 433 - return 0; 434 - out_free_conf: 435 - free_conf(mddev, conf); 436 - return ret; 430 + return md_integrity_register(mddev); 437 431 } 438 432 439 433 /*

+13 -21

drivers/md/raid1.c

··· 1687 1687 if (bio_data_dir(bio) == READ) 1688 1688 raid1_read_request(mddev, bio, sectors, NULL); 1689 1689 else { 1690 - if (!md_write_start(mddev,bio)) 1691 - return false; 1690 + md_write_start(mddev,bio); 1692 1691 raid1_write_request(mddev, bio, sectors); 1693 1692 } 1694 1693 return true; ··· 1905 1906 1906 1907 if (mddev->recovery_disabled == conf->recovery_disabled) 1907 1908 return -EBUSY; 1908 - 1909 - if (md_integrity_add_rdev(rdev, mddev)) 1910 - return -ENXIO; 1911 1909 1912 1910 if (rdev->raid_disk >= 0) 1913 1911 first = last = rdev->raid_disk; ··· 2753 2757 */ 2754 2758 2755 2759 static sector_t raid1_sync_request(struct mddev *mddev, sector_t sector_nr, 2756 - int *skipped) 2760 + sector_t max_sector, int *skipped) 2757 2761 { 2758 2762 struct r1conf *conf = mddev->private; 2759 2763 struct r1bio *r1_bio; 2760 2764 struct bio *bio; 2761 - sector_t max_sector, nr_sectors; 2765 + sector_t nr_sectors; 2762 2766 int disk = -1; 2763 2767 int i; 2764 2768 int wonly = -1; ··· 2774 2778 if (init_resync(conf)) 2775 2779 return 0; 2776 2780 2777 - max_sector = mddev->dev_sectors; 2778 2781 if (sector_nr >= max_sector) { 2779 2782 /* If we aborted, we need to abort the 2780 2783 * sync on the 'current' bitmap chunk (there will ··· 3192 3197 static int raid1_set_limits(struct mddev *mddev) 3193 3198 { 3194 3199 struct queue_limits lim; 3200 + int err; 3195 3201 3196 - blk_set_stacking_limits(&lim); 3202 + md_init_stacking_limits(&lim); 3197 3203 lim.max_write_zeroes_sectors = 0; 3198 - mddev_stack_rdev_limits(mddev, &lim); 3204 + err = mddev_stack_rdev_limits(mddev, &lim, MDDEV_STACK_INTEGRITY); 3205 + if (err) { 3206 + queue_limits_cancel_update(mddev->gendisk->queue); 3207 + return err; 3208 + } 3199 3209 return queue_limits_set(mddev->gendisk->queue, &lim); 3200 3210 } 3201 3211 3202 - static void raid1_free(struct mddev *mddev, void *priv); 3203 3212 static int raid1_run(struct mddev *mddev) 3204 3213 { 3205 3214 struct r1conf *conf; ··· 3237 3238 if (!mddev_is_dm(mddev)) { 3238 3239 ret = raid1_set_limits(mddev); 3239 3240 if (ret) 3240 - goto abort; 3241 + return ret; 3241 3242 } 3242 3243 3243 3244 mddev->degraded = 0; ··· 3251 3252 */ 3252 3253 if (conf->raid_disks - mddev->degraded < 1) { 3253 3254 md_unregister_thread(mddev, &conf->thread); 3254 - ret = -EINVAL; 3255 - goto abort; 3255 + return -EINVAL; 3256 3256 } 3257 3257 3258 3258 if (conf->raid_disks - mddev->degraded == 1) ··· 3275 3277 md_set_array_sectors(mddev, raid1_size(mddev, 0, 0)); 3276 3278 3277 3279 ret = md_integrity_register(mddev); 3278 - if (ret) { 3280 + if (ret) 3279 3281 md_unregister_thread(mddev, &mddev->thread); 3280 - goto abort; 3281 - } 3282 - return 0; 3283 - 3284 - abort: 3285 - raid1_free(mddev, conf); 3286 3282 return ret; 3287 3283 } 3288 3284

+10 -13

drivers/md/raid10.c

··· 1836 1836 && md_flush_request(mddev, bio)) 1837 1837 return true; 1838 1838 1839 - if (!md_write_start(mddev, bio)) 1840 - return false; 1839 + md_write_start(mddev, bio); 1841 1840 1842 1841 if (unlikely(bio_op(bio) == REQ_OP_DISCARD)) 1843 1842 if (!raid10_handle_discard(mddev, bio)) ··· 2081 2082 return -EBUSY; 2082 2083 if (rdev->saved_raid_disk < 0 && !_enough(conf, 1, -1)) 2083 2084 return -EINVAL; 2084 - 2085 - if (md_integrity_add_rdev(rdev, mddev)) 2086 - return -ENXIO; 2087 2085 2088 2086 if (rdev->raid_disk >= 0) 2089 2087 first = last = rdev->raid_disk; ··· 3136 3140 */ 3137 3141 3138 3142 static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr, 3139 - int *skipped) 3143 + sector_t max_sector, int *skipped) 3140 3144 { 3141 3145 struct r10conf *conf = mddev->private; 3142 3146 struct r10bio *r10_bio; 3143 3147 struct bio *biolist = NULL, *bio; 3144 - sector_t max_sector, nr_sectors; 3148 + sector_t nr_sectors; 3145 3149 int i; 3146 3150 int max_sync; 3147 3151 sector_t sync_blocks; ··· 3171 3175 return 0; 3172 3176 3173 3177 skipped: 3174 - max_sector = mddev->dev_sectors; 3175 - if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery) || 3176 - test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery)) 3177 - max_sector = mddev->resync_max_sectors; 3178 3178 if (sector_nr >= max_sector) { 3179 3179 conf->cluster_sync_low = 0; 3180 3180 conf->cluster_sync_high = 0; ··· 3972 3980 { 3973 3981 struct r10conf *conf = mddev->private; 3974 3982 struct queue_limits lim; 3983 + int err; 3975 3984 3976 - blk_set_stacking_limits(&lim); 3985 + md_init_stacking_limits(&lim); 3977 3986 lim.max_write_zeroes_sectors = 0; 3978 3987 lim.io_min = mddev->chunk_sectors << 9; 3979 3988 lim.io_opt = lim.io_min * raid10_nr_stripes(conf); 3980 - mddev_stack_rdev_limits(mddev, &lim); 3989 + err = mddev_stack_rdev_limits(mddev, &lim, MDDEV_STACK_INTEGRITY); 3990 + if (err) { 3991 + queue_limits_cancel_update(mddev->gendisk->queue); 3992 + return err; 3993 + } 3981 3994 return queue_limits_set(mddev->gendisk->queue, &lim); 3982 3995 } 3983 3996

+68 -46

drivers/md/raid5.c

··· 155 155 return slot; 156 156 } 157 157 158 - static void print_raid5_conf (struct r5conf *conf); 158 + static void print_raid5_conf(struct r5conf *conf); 159 159 160 160 static int stripe_operations_active(struct stripe_head *sh) 161 161 { ··· 5899 5899 return ret; 5900 5900 } 5901 5901 5902 + enum reshape_loc { 5903 + LOC_NO_RESHAPE, 5904 + LOC_AHEAD_OF_RESHAPE, 5905 + LOC_INSIDE_RESHAPE, 5906 + LOC_BEHIND_RESHAPE, 5907 + }; 5908 + 5909 + static enum reshape_loc get_reshape_loc(struct mddev *mddev, 5910 + struct r5conf *conf, sector_t logical_sector) 5911 + { 5912 + sector_t reshape_progress, reshape_safe; 5913 + /* 5914 + * Spinlock is needed as reshape_progress may be 5915 + * 64bit on a 32bit platform, and so it might be 5916 + * possible to see a half-updated value 5917 + * Of course reshape_progress could change after 5918 + * the lock is dropped, so once we get a reference 5919 + * to the stripe that we think it is, we will have 5920 + * to check again. 5921 + */ 5922 + spin_lock_irq(&conf->device_lock); 5923 + reshape_progress = conf->reshape_progress; 5924 + reshape_safe = conf->reshape_safe; 5925 + spin_unlock_irq(&conf->device_lock); 5926 + if (reshape_progress == MaxSector) 5927 + return LOC_NO_RESHAPE; 5928 + if (ahead_of_reshape(mddev, logical_sector, reshape_progress)) 5929 + return LOC_AHEAD_OF_RESHAPE; 5930 + if (ahead_of_reshape(mddev, logical_sector, reshape_safe)) 5931 + return LOC_INSIDE_RESHAPE; 5932 + return LOC_BEHIND_RESHAPE; 5933 + } 5934 + 5902 5935 static enum stripe_result make_stripe_request(struct mddev *mddev, 5903 5936 struct r5conf *conf, struct stripe_request_ctx *ctx, 5904 5937 sector_t logical_sector, struct bio *bi) ··· 5946 5913 seq = read_seqcount_begin(&conf->gen_lock); 5947 5914 5948 5915 if (unlikely(conf->reshape_progress != MaxSector)) { 5949 - /* 5950 - * Spinlock is needed as reshape_progress may be 5951 - * 64bit on a 32bit platform, and so it might be 5952 - * possible to see a half-updated value 5953 - * Of course reshape_progress could change after 5954 - * the lock is dropped, so once we get a reference 5955 - * to the stripe that we think it is, we will have 5956 - * to check again. 5957 - */ 5958 - spin_lock_irq(&conf->device_lock); 5959 - if (ahead_of_reshape(mddev, logical_sector, 5960 - conf->reshape_progress)) { 5961 - previous = 1; 5962 - } else { 5963 - if (ahead_of_reshape(mddev, logical_sector, 5964 - conf->reshape_safe)) { 5965 - spin_unlock_irq(&conf->device_lock); 5966 - ret = STRIPE_SCHEDULE_AND_RETRY; 5967 - goto out; 5968 - } 5916 + enum reshape_loc loc = get_reshape_loc(mddev, conf, 5917 + logical_sector); 5918 + if (loc == LOC_INSIDE_RESHAPE) { 5919 + ret = STRIPE_SCHEDULE_AND_RETRY; 5920 + goto out; 5969 5921 } 5970 - spin_unlock_irq(&conf->device_lock); 5922 + if (loc == LOC_AHEAD_OF_RESHAPE) 5923 + previous = 1; 5971 5924 } 5972 5925 5973 5926 new_sector = raid5_compute_sector(conf, logical_sector, previous, ··· 6097 6078 ctx.do_flush = bi->bi_opf & REQ_PREFLUSH; 6098 6079 } 6099 6080 6100 - if (!md_write_start(mddev, bi)) 6101 - return false; 6081 + md_write_start(mddev, bi); 6102 6082 /* 6103 6083 * If array is degraded, better not do chunk aligned read because 6104 6084 * later we might have to read it again in order to reconstruct ··· 6131 6113 /* Bail out if conflicts with reshape and REQ_NOWAIT is set */ 6132 6114 if ((bi->bi_opf & REQ_NOWAIT) && 6133 6115 (conf->reshape_progress != MaxSector) && 6134 - !ahead_of_reshape(mddev, logical_sector, conf->reshape_progress) && 6135 - ahead_of_reshape(mddev, logical_sector, conf->reshape_safe)) { 6116 + get_reshape_loc(mddev, conf, logical_sector) == LOC_INSIDE_RESHAPE) { 6136 6117 bio_wouldblock_error(bi); 6137 6118 if (rw == WRITE) 6138 6119 md_write_end(mddev); ··· 6272 6255 safepos = conf->reshape_safe; 6273 6256 sector_div(safepos, data_disks); 6274 6257 if (mddev->reshape_backwards) { 6275 - BUG_ON(writepos < reshape_sectors); 6258 + if (WARN_ON(writepos < reshape_sectors)) 6259 + return MaxSector; 6260 + 6276 6261 writepos -= reshape_sectors; 6277 6262 readpos += reshape_sectors; 6278 6263 safepos += reshape_sectors; ··· 6292 6273 * to set 'stripe_addr' which is where we will write to. 6293 6274 */ 6294 6275 if (mddev->reshape_backwards) { 6295 - BUG_ON(conf->reshape_progress == 0); 6276 + if (WARN_ON(conf->reshape_progress == 0)) 6277 + return MaxSector; 6278 + 6296 6279 stripe_addr = writepos; 6297 - BUG_ON((mddev->dev_sectors & 6298 - ~((sector_t)reshape_sectors - 1)) 6299 - - reshape_sectors - stripe_addr 6300 - != sector_nr); 6280 + if (WARN_ON((mddev->dev_sectors & 6281 + ~((sector_t)reshape_sectors - 1)) - 6282 + reshape_sectors - stripe_addr != sector_nr)) 6283 + return MaxSector; 6301 6284 } else { 6302 - BUG_ON(writepos != sector_nr + reshape_sectors); 6285 + if (WARN_ON(writepos != sector_nr + reshape_sectors)) 6286 + return MaxSector; 6287 + 6303 6288 stripe_addr = sector_nr; 6304 6289 } 6305 6290 ··· 6481 6458 } 6482 6459 6483 6460 static inline sector_t raid5_sync_request(struct mddev *mddev, sector_t sector_nr, 6484 - int *skipped) 6461 + sector_t max_sector, int *skipped) 6485 6462 { 6486 6463 struct r5conf *conf = mddev->private; 6487 6464 struct stripe_head *sh; 6488 - sector_t max_sector = mddev->dev_sectors; 6489 6465 sector_t sync_blocks; 6490 6466 int still_degraded = 0; 6491 6467 int i; ··· 7104 7082 err = -ENODEV; 7105 7083 else if (new != conf->skip_copy) { 7106 7084 struct request_queue *q = mddev->gendisk->queue; 7085 + struct queue_limits lim = queue_limits_start_update(q); 7107 7086 7108 7087 conf->skip_copy = new; 7109 7088 if (new) 7110 - blk_queue_flag_set(QUEUE_FLAG_STABLE_WRITES, q); 7089 + lim.features |= BLK_FEAT_STABLE_WRITES; 7111 7090 else 7112 - blk_queue_flag_clear(QUEUE_FLAG_STABLE_WRITES, q); 7091 + lim.features &= ~BLK_FEAT_STABLE_WRITES; 7092 + err = queue_limits_commit_update(q, &lim); 7113 7093 } 7114 7094 mddev_unlock_and_resume(mddev); 7115 7095 return err ?: len; ··· 7586 7562 if (test_bit(Replacement, &rdev->flags)) { 7587 7563 if (disk->replacement) 7588 7564 goto abort; 7589 - RCU_INIT_POINTER(disk->replacement, rdev); 7565 + disk->replacement = rdev; 7590 7566 } else { 7591 7567 if (disk->rdev) 7592 7568 goto abort; 7593 - RCU_INIT_POINTER(disk->rdev, rdev); 7569 + disk->rdev = rdev; 7594 7570 } 7595 7571 7596 7572 if (test_bit(In_sync, &rdev->flags)) { ··· 7726 7702 */ 7727 7703 stripe = roundup_pow_of_two(data_disks * (mddev->chunk_sectors << 9)); 7728 7704 7729 - blk_set_stacking_limits(&lim); 7705 + md_init_stacking_limits(&lim); 7730 7706 lim.io_min = mddev->chunk_sectors << 9; 7731 7707 lim.io_opt = lim.io_min * (conf->raid_disks - conf->max_degraded); 7732 - lim.raid_partial_stripes_expensive = 1; 7708 + lim.features |= BLK_FEAT_RAID_PARTIAL_STRIPES_EXPENSIVE; 7733 7709 lim.discard_granularity = stripe; 7734 7710 lim.max_write_zeroes_sectors = 0; 7735 - mddev_stack_rdev_limits(mddev, &lim); 7711 + mddev_stack_rdev_limits(mddev, &lim, 0); 7736 7712 rdev_for_each(rdev, mddev) 7737 7713 queue_limits_stack_bdev(&lim, rdev->bdev, rdev->new_data_offset, 7738 7714 mddev->gendisk->disk_name); ··· 8072 8048 seq_printf (seq, "]"); 8073 8049 } 8074 8050 8075 - static void print_raid5_conf (struct r5conf *conf) 8051 + static void print_raid5_conf(struct r5conf *conf) 8076 8052 { 8077 8053 struct md_rdev *rdev; 8078 8054 int i; ··· 8086 8062 conf->raid_disks, 8087 8063 conf->raid_disks - conf->mddev->degraded); 8088 8064 8089 - rcu_read_lock(); 8090 8065 for (i = 0; i < conf->raid_disks; i++) { 8091 - rdev = rcu_dereference(conf->disks[i].rdev); 8066 + rdev = conf->disks[i].rdev; 8092 8067 if (rdev) 8093 8068 pr_debug(" disk %d, o:%d, dev:%pg\n", 8094 8069 i, !test_bit(Faulty, &rdev->flags), 8095 8070 rdev->bdev); 8096 8071 } 8097 - rcu_read_unlock(); 8098 8072 } 8099 8073 8100 8074 static int raid5_spare_active(struct mddev *mddev)

+19 -23

drivers/mmc/core/block.c

··· 2466 2466 struct mmc_blk_data *md; 2467 2467 int devidx, ret; 2468 2468 char cap_str[10]; 2469 - bool cache_enabled = false; 2470 - bool fua_enabled = false; 2469 + unsigned int features = 0; 2471 2470 2472 2471 devidx = ida_alloc_max(&mmc_blk_ida, max_devices - 1, GFP_KERNEL); 2473 2472 if (devidx < 0) { ··· 2498 2499 */ 2499 2500 md->read_only = mmc_blk_readonly(card); 2500 2501 2501 - md->disk = mmc_init_queue(&md->queue, card); 2502 + if (mmc_host_cmd23(card->host)) { 2503 + if ((mmc_card_mmc(card) && 2504 + card->csd.mmca_vsn >= CSD_SPEC_VER_3) || 2505 + (mmc_card_sd(card) && 2506 + card->scr.cmds & SD_SCR_CMD23_SUPPORT)) 2507 + md->flags |= MMC_BLK_CMD23; 2508 + } 2509 + 2510 + if (md->flags & MMC_BLK_CMD23 && 2511 + ((card->ext_csd.rel_param & EXT_CSD_WR_REL_PARAM_EN) || 2512 + card->ext_csd.rel_sectors)) { 2513 + md->flags |= MMC_BLK_REL_WR; 2514 + features |= (BLK_FEAT_WRITE_CACHE | BLK_FEAT_FUA); 2515 + } else if (mmc_cache_enabled(card->host)) { 2516 + features |= BLK_FEAT_WRITE_CACHE; 2517 + } 2518 + 2519 + md->disk = mmc_init_queue(&md->queue, card, features); 2502 2520 if (IS_ERR(md->disk)) { 2503 2521 ret = PTR_ERR(md->disk); 2504 2522 goto err_kfree; ··· 2554 2538 "mmcblk%u%s", card->host->index, subname ? subname : ""); 2555 2539 2556 2540 set_capacity(md->disk, size); 2557 - 2558 - if (mmc_host_cmd23(card->host)) { 2559 - if ((mmc_card_mmc(card) && 2560 - card->csd.mmca_vsn >= CSD_SPEC_VER_3) || 2561 - (mmc_card_sd(card) && 2562 - card->scr.cmds & SD_SCR_CMD23_SUPPORT)) 2563 - md->flags |= MMC_BLK_CMD23; 2564 - } 2565 - 2566 - if (md->flags & MMC_BLK_CMD23 && 2567 - ((card->ext_csd.rel_param & EXT_CSD_WR_REL_PARAM_EN) || 2568 - card->ext_csd.rel_sectors)) { 2569 - md->flags |= MMC_BLK_REL_WR; 2570 - fua_enabled = true; 2571 - cache_enabled = true; 2572 - } 2573 - if (mmc_cache_enabled(card->host)) 2574 - cache_enabled = true; 2575 - 2576 - blk_queue_write_cache(md->queue.queue, cache_enabled, fua_enabled); 2577 2541 2578 2542 string_get_size((u64)size, 512, STRING_UNITS_2, 2579 2543 cap_str, sizeof(cap_str));

+11 -9

drivers/mmc/core/queue.c

··· 344 344 }; 345 345 346 346 static struct gendisk *mmc_alloc_disk(struct mmc_queue *mq, 347 - struct mmc_card *card) 347 + struct mmc_card *card, unsigned int features) 348 348 { 349 349 struct mmc_host *host = card->host; 350 - struct queue_limits lim = { }; 350 + struct queue_limits lim = { 351 + .features = features, 352 + }; 351 353 struct gendisk *disk; 352 354 353 355 if (mmc_can_erase(card)) ··· 378 376 lim.max_segments = host->max_segs; 379 377 } 380 378 379 + if (mmc_host_is_spi(host) && host->use_spi_crc) 380 + lim.features |= BLK_FEAT_STABLE_WRITES; 381 + 381 382 disk = blk_mq_alloc_disk(&mq->tag_set, &lim, mq); 382 383 if (IS_ERR(disk)) 383 384 return disk; 384 385 mq->queue = disk->queue; 385 386 386 - if (mmc_host_is_spi(host) && host->use_spi_crc) 387 - blk_queue_flag_set(QUEUE_FLAG_STABLE_WRITES, mq->queue); 388 387 blk_queue_rq_timeout(mq->queue, 60 * HZ); 389 - 390 - blk_queue_flag_set(QUEUE_FLAG_NONROT, mq->queue); 391 - blk_queue_flag_clear(QUEUE_FLAG_ADD_RANDOM, mq->queue); 392 388 393 389 dma_set_max_seg_size(mmc_dev(host), queue_max_segment_size(mq->queue)); 394 390 ··· 413 413 * mmc_init_queue - initialise a queue structure. 414 414 * @mq: mmc queue 415 415 * @card: mmc card to attach this queue 416 + * @features: block layer features (BLK_FEAT_*) 416 417 * 417 418 * Initialise a MMC card request queue. 418 419 */ 419 - struct gendisk *mmc_init_queue(struct mmc_queue *mq, struct mmc_card *card) 420 + struct gendisk *mmc_init_queue(struct mmc_queue *mq, struct mmc_card *card, 421 + unsigned int features) 420 422 { 421 423 struct mmc_host *host = card->host; 422 424 struct gendisk *disk; ··· 462 460 return ERR_PTR(ret); 463 461 464 462 465 - disk = mmc_alloc_disk(mq, card); 463 + disk = mmc_alloc_disk(mq, card, features); 466 464 if (IS_ERR(disk)) 467 465 blk_mq_free_tag_set(&mq->tag_set); 468 466 return disk;

+2 -1

drivers/mmc/core/queue.h

··· 94 94 struct work_struct complete_work; 95 95 }; 96 96 97 - struct gendisk *mmc_init_queue(struct mmc_queue *mq, struct mmc_card *card); 97 + struct gendisk *mmc_init_queue(struct mmc_queue *mq, struct mmc_card *card, 98 + unsigned int features); 98 99 extern void mmc_cleanup_queue(struct mmc_queue *); 99 100 extern void mmc_queue_suspend(struct mmc_queue *); 100 101 extern void mmc_queue_resume(struct mmc_queue *);

+2 -7

drivers/mtd/mtd_blkdevs.c

··· 336 336 lim.logical_block_size = tr->blksize; 337 337 if (tr->discard) 338 338 lim.max_hw_discard_sectors = UINT_MAX; 339 + if (tr->flush) 340 + lim.features |= BLK_FEAT_WRITE_CACHE; 339 341 340 342 /* Create gendisk */ 341 343 gd = blk_mq_alloc_disk(new->tag_set, &lim, new); ··· 374 372 /* Create the request queue */ 375 373 spin_lock_init(&new->queue_lock); 376 374 INIT_LIST_HEAD(&new->rq_list); 377 - 378 - if (tr->flush) 379 - blk_queue_write_cache(new->rq, true, false); 380 - 381 - blk_queue_flag_set(QUEUE_FLAG_NONROT, new->rq); 382 - blk_queue_flag_clear(QUEUE_FLAG_ADD_RANDOM, new->rq); 383 - 384 375 gd->queue = new->rq; 385 376 386 377 if (new->readonly)

+6 -11

drivers/nvdimm/btt.c

··· 1501 1501 .logical_block_size = btt->sector_size, 1502 1502 .max_hw_sectors = UINT_MAX, 1503 1503 .max_integrity_segments = 1, 1504 + .features = BLK_FEAT_SYNCHRONOUS, 1504 1505 }; 1505 1506 int rc; 1507 + 1508 + if (btt_meta_size(btt) && IS_ENABLED(CONFIG_BLK_DEV_INTEGRITY)) { 1509 + lim.integrity.tuple_size = btt_meta_size(btt); 1510 + lim.integrity.tag_size = btt_meta_size(btt); 1511 + } 1506 1512 1507 1513 btt->btt_disk = blk_alloc_disk(&lim, NUMA_NO_NODE); 1508 1514 if (IS_ERR(btt->btt_disk)) ··· 1518 1512 btt->btt_disk->first_minor = 0; 1519 1513 btt->btt_disk->fops = &btt_fops; 1520 1514 btt->btt_disk->private_data = btt; 1521 - 1522 - blk_queue_flag_set(QUEUE_FLAG_NONROT, btt->btt_disk->queue); 1523 - blk_queue_flag_set(QUEUE_FLAG_SYNCHRONOUS, btt->btt_disk->queue); 1524 - 1525 - if (btt_meta_size(btt) && IS_ENABLED(CONFIG_BLK_DEV_INTEGRITY)) { 1526 - struct blk_integrity bi = { 1527 - .tuple_size = btt_meta_size(btt), 1528 - .tag_size = btt_meta_size(btt), 1529 - }; 1530 - blk_integrity_register(btt->btt_disk, &bi); 1531 - } 1532 1515 1533 1516 set_capacity(btt->btt_disk, btt->nlba * btt->sector_size >> 9); 1534 1517 rc = device_add_disk(&btt->nd_btt->dev, btt->btt_disk, NULL);

+6 -8

drivers/nvdimm/pmem.c

··· 455 455 .logical_block_size = pmem_sector_size(ndns), 456 456 .physical_block_size = PAGE_SIZE, 457 457 .max_hw_sectors = UINT_MAX, 458 + .features = BLK_FEAT_WRITE_CACHE | 459 + BLK_FEAT_SYNCHRONOUS, 458 460 }; 459 461 int nid = dev_to_node(dev), fua; 460 462 struct resource *res = &nsio->res; ··· 465 463 struct dax_device *dax_dev; 466 464 struct nd_pfn_sb *pfn_sb; 467 465 struct pmem_device *pmem; 468 - struct request_queue *q; 469 466 struct gendisk *disk; 470 467 void *addr; 471 468 int rc; ··· 496 495 dev_warn(dev, "unable to guarantee persistence of writes\n"); 497 496 fua = 0; 498 497 } 498 + if (fua) 499 + lim.features |= BLK_FEAT_FUA; 500 + if (is_nd_pfn(dev)) 501 + lim.features |= BLK_FEAT_DAX; 499 502 500 503 if (!devm_request_mem_region(dev, res->start, resource_size(res), 501 504 dev_name(&ndns->dev))) { ··· 510 505 disk = blk_alloc_disk(&lim, nid); 511 506 if (IS_ERR(disk)) 512 507 return PTR_ERR(disk); 513 - q = disk->queue; 514 508 515 509 pmem->disk = disk; 516 510 pmem->pgmap.owner = pmem; ··· 546 542 goto out; 547 543 } 548 544 pmem->virt_addr = addr; 549 - 550 - blk_queue_write_cache(q, true, fua); 551 - blk_queue_flag_set(QUEUE_FLAG_NONROT, q); 552 - blk_queue_flag_set(QUEUE_FLAG_SYNCHRONOUS, q); 553 - if (pmem->pfn_flags & PFN_MAP) 554 - blk_queue_flag_set(QUEUE_FLAG_DAX, q); 555 545 556 546 disk->fops = &pmem_fops; 557 547 disk->private_data = pmem;

-1

drivers/nvme/host/Kconfig

··· 1 1 # SPDX-License-Identifier: GPL-2.0-only 2 2 config NVME_CORE 3 3 tristate 4 - select BLK_DEV_INTEGRITY_T10 if BLK_DEV_INTEGRITY 5 4 6 5 config BLK_DEV_NVME 7 6 tristate "NVM Express block device"

+27 -5

drivers/nvme/host/apple.c

··· 1388 1388 mempool_destroy(data); 1389 1389 } 1390 1390 1391 - static int apple_nvme_probe(struct platform_device *pdev) 1391 + static struct apple_nvme *apple_nvme_alloc(struct platform_device *pdev) 1392 1392 { 1393 1393 struct device *dev = &pdev->dev; 1394 1394 struct apple_nvme *anv; ··· 1396 1396 1397 1397 anv = devm_kzalloc(dev, sizeof(*anv), GFP_KERNEL); 1398 1398 if (!anv) 1399 - return -ENOMEM; 1399 + return ERR_PTR(-ENOMEM); 1400 1400 1401 1401 anv->dev = get_device(dev); 1402 1402 anv->adminq.is_adminq = true; ··· 1516 1516 goto put_dev; 1517 1517 } 1518 1518 1519 + return anv; 1520 + put_dev: 1521 + put_device(anv->dev); 1522 + return ERR_PTR(ret); 1523 + } 1524 + 1525 + static int apple_nvme_probe(struct platform_device *pdev) 1526 + { 1527 + struct apple_nvme *anv; 1528 + int ret; 1529 + 1530 + anv = apple_nvme_alloc(pdev); 1531 + if (IS_ERR(anv)) 1532 + return PTR_ERR(anv); 1533 + 1534 + ret = nvme_add_ctrl(&anv->ctrl); 1535 + if (ret) 1536 + goto out_put_ctrl; 1537 + 1519 1538 anv->ctrl.admin_q = blk_mq_alloc_queue(&anv->admin_tagset, NULL, NULL); 1520 1539 if (IS_ERR(anv->ctrl.admin_q)) { 1521 1540 ret = -ENOMEM; 1522 - goto put_dev; 1541 + anv->ctrl.admin_q = NULL; 1542 + goto out_uninit_ctrl; 1523 1543 } 1524 1544 1525 1545 nvme_reset_ctrl(&anv->ctrl); ··· 1547 1527 1548 1528 return 0; 1549 1529 1550 - put_dev: 1551 - put_device(anv->dev); 1530 + out_uninit_ctrl: 1531 + nvme_uninit_ctrl(&anv->ctrl); 1532 + out_put_ctrl: 1533 + nvme_put_ctrl(&anv->ctrl); 1552 1534 return ret; 1553 1535 } 1554 1536

+1 -1

drivers/nvme/host/constants.c

··· 173 173 174 174 const char *nvme_get_error_status_str(u16 status) 175 175 { 176 - status &= 0x7ff; 176 + status &= NVME_SCT_SC_MASK; 177 177 if (status < ARRAY_SIZE(nvme_statuses) && nvme_statuses[status]) 178 178 return nvme_statuses[status]; 179 179 return "Unknown";

+191 -95

drivers/nvme/host/core.c

··· 110 110 EXPORT_SYMBOL_GPL(nvme_delete_wq); 111 111 112 112 static LIST_HEAD(nvme_subsystems); 113 - static DEFINE_MUTEX(nvme_subsystems_lock); 113 + DEFINE_MUTEX(nvme_subsystems_lock); 114 114 115 115 static DEFINE_IDA(nvme_instance_ida); 116 116 static dev_t nvme_ctrl_base_chr_devt; ··· 261 261 262 262 static blk_status_t nvme_error_status(u16 status) 263 263 { 264 - switch (status & 0x7ff) { 264 + switch (status & NVME_SCT_SC_MASK) { 265 265 case NVME_SC_SUCCESS: 266 266 return BLK_STS_OK; 267 267 case NVME_SC_CAP_EXCEEDED: ··· 307 307 u16 crd; 308 308 309 309 /* The mask and shift result must be <= 3 */ 310 - crd = (nvme_req(req)->status & NVME_SC_CRD) >> 11; 310 + crd = (nvme_req(req)->status & NVME_STATUS_CRD) >> 11; 311 311 if (crd) 312 312 delay = nvme_req(req)->ctrl->crdt[crd - 1] * 100; 313 313 ··· 329 329 nvme_sect_to_lba(ns->head, blk_rq_pos(req)), 330 330 blk_rq_bytes(req) >> ns->head->lba_shift, 331 331 nvme_get_error_status_str(nr->status), 332 - nr->status >> 8 & 7, /* Status Code Type */ 333 - nr->status & 0xff, /* Status Code */ 334 - nr->status & NVME_SC_MORE ? "MORE " : "", 335 - nr->status & NVME_SC_DNR ? "DNR " : ""); 332 + NVME_SCT(nr->status), /* Status Code Type */ 333 + nr->status & NVME_SC_MASK, /* Status Code */ 334 + nr->status & NVME_STATUS_MORE ? "MORE " : "", 335 + nr->status & NVME_STATUS_DNR ? "DNR " : ""); 336 336 return; 337 337 } 338 338 ··· 341 341 nvme_get_admin_opcode_str(nr->cmd->common.opcode), 342 342 nr->cmd->common.opcode, 343 343 nvme_get_error_status_str(nr->status), 344 - nr->status >> 8 & 7, /* Status Code Type */ 345 - nr->status & 0xff, /* Status Code */ 346 - nr->status & NVME_SC_MORE ? "MORE " : "", 347 - nr->status & NVME_SC_DNR ? "DNR " : ""); 344 + NVME_SCT(nr->status), /* Status Code Type */ 345 + nr->status & NVME_SC_MASK, /* Status Code */ 346 + nr->status & NVME_STATUS_MORE ? "MORE " : "", 347 + nr->status & NVME_STATUS_DNR ? "DNR " : ""); 348 348 } 349 349 350 350 static void nvme_log_err_passthru(struct request *req) ··· 359 359 nvme_get_admin_opcode_str(nr->cmd->common.opcode), 360 360 nr->cmd->common.opcode, 361 361 nvme_get_error_status_str(nr->status), 362 - nr->status >> 8 & 7, /* Status Code Type */ 363 - nr->status & 0xff, /* Status Code */ 364 - nr->status & NVME_SC_MORE ? "MORE " : "", 365 - nr->status & NVME_SC_DNR ? "DNR " : "", 362 + NVME_SCT(nr->status), /* Status Code Type */ 363 + nr->status & NVME_SC_MASK, /* Status Code */ 364 + nr->status & NVME_STATUS_MORE ? "MORE " : "", 365 + nr->status & NVME_STATUS_DNR ? "DNR " : "", 366 366 nr->cmd->common.cdw10, 367 367 nr->cmd->common.cdw11, 368 368 nr->cmd->common.cdw12, ··· 384 384 return COMPLETE; 385 385 386 386 if (blk_noretry_request(req) || 387 - (nvme_req(req)->status & NVME_SC_DNR) || 387 + (nvme_req(req)->status & NVME_STATUS_DNR) || 388 388 nvme_req(req)->retries >= nvme_max_retries) 389 389 return COMPLETE; 390 390 391 - if ((nvme_req(req)->status & 0x7ff) == NVME_SC_AUTH_REQUIRED) 391 + if ((nvme_req(req)->status & NVME_SCT_SC_MASK) == NVME_SC_AUTH_REQUIRED) 392 392 return AUTHENTICATE; 393 393 394 394 if (req->cmd_flags & REQ_NVME_MPATH) { ··· 927 927 return BLK_STS_OK; 928 928 } 929 929 930 + /* 931 + * NVMe does not support a dedicated command to issue an atomic write. A write 932 + * which does adhere to the device atomic limits will silently be executed 933 + * non-atomically. The request issuer should ensure that the write is within 934 + * the queue atomic writes limits, but just validate this in case it is not. 935 + */ 936 + static bool nvme_valid_atomic_write(struct request *req) 937 + { 938 + struct request_queue *q = req->q; 939 + u32 boundary_bytes = queue_atomic_write_boundary_bytes(q); 940 + 941 + if (blk_rq_bytes(req) > queue_atomic_write_unit_max_bytes(q)) 942 + return false; 943 + 944 + if (boundary_bytes) { 945 + u64 mask = boundary_bytes - 1, imask = ~mask; 946 + u64 start = blk_rq_pos(req) << SECTOR_SHIFT; 947 + u64 end = start + blk_rq_bytes(req) - 1; 948 + 949 + /* If greater then must be crossing a boundary */ 950 + if (blk_rq_bytes(req) > boundary_bytes) 951 + return false; 952 + 953 + if ((start & imask) != (end & imask)) 954 + return false; 955 + } 956 + 957 + return true; 958 + } 959 + 930 960 static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns, 931 961 struct request *req, struct nvme_command *cmnd, 932 962 enum nvme_opcode op) ··· 971 941 972 942 if (req->cmd_flags & REQ_RAHEAD) 973 943 dsmgmt |= NVME_RW_DSM_FREQ_PREFETCH; 944 + 945 + if (req->cmd_flags & REQ_ATOMIC && !nvme_valid_atomic_write(req)) 946 + return BLK_STS_INVAL; 974 947 975 948 cmnd->rw.opcode = op; 976 949 cmnd->rw.flags = 0; ··· 1257 1224 1258 1225 /* 1259 1226 * Recommended frequency for KATO commands per NVMe 1.4 section 7.12.1: 1260 - * 1227 + * 1261 1228 * The host should send Keep Alive commands at half of the Keep Alive Timeout 1262 1229 * accounting for transport roundtrip times [..]. 1263 1230 */ ··· 1757 1724 return 0; 1758 1725 } 1759 1726 1760 - static bool nvme_init_integrity(struct gendisk *disk, struct nvme_ns_head *head) 1727 + static bool nvme_init_integrity(struct gendisk *disk, struct nvme_ns_head *head, 1728 + struct queue_limits *lim) 1761 1729 { 1762 - struct blk_integrity integrity = { }; 1730 + struct blk_integrity *bi = &lim->integrity; 1763 1731 1764 - blk_integrity_unregister(disk); 1732 + memset(bi, 0, sizeof(*bi)); 1765 1733 1766 1734 if (!head->ms) 1767 1735 return true; ··· 1779 1745 case NVME_NS_DPS_PI_TYPE3: 1780 1746 switch (head->guard_type) { 1781 1747 case NVME_NVM_NS_16B_GUARD: 1782 - integrity.profile = &t10_pi_type3_crc; 1783 - integrity.tag_size = sizeof(u16) + sizeof(u32); 1784 - integrity.flags |= BLK_INTEGRITY_DEVICE_CAPABLE; 1748 + bi->csum_type = BLK_INTEGRITY_CSUM_CRC; 1749 + bi->tag_size = sizeof(u16) + sizeof(u32); 1750 + bi->flags |= BLK_INTEGRITY_DEVICE_CAPABLE; 1785 1751 break; 1786 1752 case NVME_NVM_NS_64B_GUARD: 1787 - integrity.profile = &ext_pi_type3_crc64; 1788 - integrity.tag_size = sizeof(u16) + 6; 1789 - integrity.flags |= BLK_INTEGRITY_DEVICE_CAPABLE; 1753 + bi->csum_type = BLK_INTEGRITY_CSUM_CRC64; 1754 + bi->tag_size = sizeof(u16) + 6; 1755 + bi->flags |= BLK_INTEGRITY_DEVICE_CAPABLE; 1790 1756 break; 1791 1757 default: 1792 - integrity.profile = NULL; 1793 1758 break; 1794 1759 } 1795 1760 break; ··· 1796 1763 case NVME_NS_DPS_PI_TYPE2: 1797 1764 switch (head->guard_type) { 1798 1765 case NVME_NVM_NS_16B_GUARD: 1799 - integrity.profile = &t10_pi_type1_crc; 1800 - integrity.tag_size = sizeof(u16); 1801 - integrity.flags |= BLK_INTEGRITY_DEVICE_CAPABLE; 1766 + bi->csum_type = BLK_INTEGRITY_CSUM_CRC; 1767 + bi->tag_size = sizeof(u16); 1768 + bi->flags |= BLK_INTEGRITY_DEVICE_CAPABLE | 1769 + BLK_INTEGRITY_REF_TAG; 1802 1770 break; 1803 1771 case NVME_NVM_NS_64B_GUARD: 1804 - integrity.profile = &ext_pi_type1_crc64; 1805 - integrity.tag_size = sizeof(u16); 1806 - integrity.flags |= BLK_INTEGRITY_DEVICE_CAPABLE; 1772 + bi->csum_type = BLK_INTEGRITY_CSUM_CRC64; 1773 + bi->tag_size = sizeof(u16); 1774 + bi->flags |= BLK_INTEGRITY_DEVICE_CAPABLE | 1775 + BLK_INTEGRITY_REF_TAG; 1807 1776 break; 1808 1777 default: 1809 - integrity.profile = NULL; 1810 1778 break; 1811 1779 } 1812 1780 break; 1813 1781 default: 1814 - integrity.profile = NULL; 1815 1782 break; 1816 1783 } 1817 1784 1818 - integrity.tuple_size = head->ms; 1819 - integrity.pi_offset = head->pi_offset; 1820 - blk_integrity_register(disk, &integrity); 1785 + bi->tuple_size = head->ms; 1786 + bi->pi_offset = head->pi_offset; 1821 1787 return true; 1822 1788 } 1823 1789 ··· 1954 1922 } 1955 1923 } 1956 1924 1925 + 1926 + static void nvme_update_atomic_write_disk_info(struct nvme_ns *ns, 1927 + struct nvme_id_ns *id, struct queue_limits *lim, 1928 + u32 bs, u32 atomic_bs) 1929 + { 1930 + unsigned int boundary = 0; 1931 + 1932 + if (id->nsfeat & NVME_NS_FEAT_ATOMICS && id->nawupf) { 1933 + if (le16_to_cpu(id->nabspf)) 1934 + boundary = (le16_to_cpu(id->nabspf) + 1) * bs; 1935 + } 1936 + lim->atomic_write_hw_max = atomic_bs; 1937 + lim->atomic_write_hw_boundary = boundary; 1938 + lim->atomic_write_hw_unit_min = bs; 1939 + lim->atomic_write_hw_unit_max = rounddown_pow_of_two(atomic_bs); 1940 + } 1941 + 1957 1942 static u32 nvme_max_drv_segments(struct nvme_ctrl *ctrl) 1958 1943 { 1959 1944 return ctrl->max_hw_sectors / (NVME_CTRL_PAGE_SIZE >> SECTOR_SHIFT) + 1; ··· 2017 1968 atomic_bs = (1 + le16_to_cpu(id->nawupf)) * bs; 2018 1969 else 2019 1970 atomic_bs = (1 + ns->ctrl->subsys->awupf) * bs; 1971 + 1972 + nvme_update_atomic_write_disk_info(ns, id, lim, bs, atomic_bs); 2020 1973 } 2021 1974 2022 1975 if (id->nsfeat & NVME_NS_FEAT_IO_OPT) { 2023 1976 /* NPWG = Namespace Preferred Write Granularity */ 2024 1977 phys_bs = bs * (1 + le16_to_cpu(id->npwg)); 2025 1978 /* NOWS = Namespace Optimal Write Size */ 2026 - io_opt = bs * (1 + le16_to_cpu(id->nows)); 1979 + if (id->nows) 1980 + io_opt = bs * (1 + le16_to_cpu(id->nows)); 2027 1981 } 2028 1982 2029 1983 /* ··· 2110 2058 static int nvme_update_ns_info_block(struct nvme_ns *ns, 2111 2059 struct nvme_ns_info *info) 2112 2060 { 2113 - bool vwc = ns->ctrl->vwc & NVME_CTRL_VWC_PRESENT; 2114 2061 struct queue_limits lim; 2115 2062 struct nvme_id_ns_nvm *nvm = NULL; 2116 2063 struct nvme_zone_info zi = {}; ··· 2158 2107 if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) && 2159 2108 ns->head->ids.csi == NVME_CSI_ZNS) 2160 2109 nvme_update_zone_info(ns, &lim, &zi); 2161 - ret = queue_limits_commit_update(ns->disk->queue, &lim); 2162 - if (ret) { 2163 - blk_mq_unfreeze_queue(ns->disk->queue); 2164 - goto out; 2165 - } 2110 + 2111 + if (ns->ctrl->vwc & NVME_CTRL_VWC_PRESENT) 2112 + lim.features |= BLK_FEAT_WRITE_CACHE | BLK_FEAT_FUA; 2113 + else 2114 + lim.features &= ~(BLK_FEAT_WRITE_CACHE | BLK_FEAT_FUA); 2166 2115 2167 2116 /* 2168 2117 * Register a metadata profile for PI, or the plain non-integrity NVMe ··· 2170 2119 * I/O to namespaces with metadata except when the namespace supports 2171 2120 * PI, as it can strip/insert in that case. 2172 2121 */ 2173 - if (!nvme_init_integrity(ns->disk, ns->head)) 2122 + if (!nvme_init_integrity(ns->disk, ns->head, &lim)) 2174 2123 capacity = 0; 2124 + 2125 + ret = queue_limits_commit_update(ns->disk->queue, &lim); 2126 + if (ret) { 2127 + blk_mq_unfreeze_queue(ns->disk->queue); 2128 + goto out; 2129 + } 2175 2130 2176 2131 set_capacity_and_notify(ns->disk, capacity); 2177 2132 ··· 2190 2133 if ((id->dlfeat & 0x7) == 0x1 && (id->dlfeat & (1 << 3))) 2191 2134 ns->head->features |= NVME_NS_DEAC; 2192 2135 set_disk_ro(ns->disk, nvme_ns_is_readonly(ns, info)); 2193 - blk_queue_write_cache(ns->disk->queue, vwc, vwc); 2194 2136 set_bit(NVME_NS_READY, &ns->flags); 2195 2137 blk_mq_unfreeze_queue(ns->disk->queue); 2196 2138 ··· 2249 2193 struct queue_limits lim; 2250 2194 2251 2195 blk_mq_freeze_queue(ns->head->disk->queue); 2252 - if (unsupported) 2253 - ns->head->disk->flags |= GENHD_FL_HIDDEN; 2254 - else 2255 - nvme_init_integrity(ns->head->disk, ns->head); 2256 - set_capacity_and_notify(ns->head->disk, get_capacity(ns->disk)); 2257 - set_disk_ro(ns->head->disk, nvme_ns_is_readonly(ns, info)); 2258 - nvme_mpath_revalidate_paths(ns); 2259 - 2260 2196 /* 2261 2197 * queue_limits mixes values that are the hardware limitations 2262 2198 * for bio splitting with what is the device configuration. ··· 2271 2223 lim.io_opt = ns_lim->io_opt; 2272 2224 queue_limits_stack_bdev(&lim, ns->disk->part0, 0, 2273 2225 ns->head->disk->disk_name); 2226 + if (unsupported) 2227 + ns->head->disk->flags |= GENHD_FL_HIDDEN; 2228 + else 2229 + nvme_init_integrity(ns->head->disk, ns->head, &lim); 2274 2230 ret = queue_limits_commit_update(ns->head->disk->queue, &lim); 2231 + 2232 + set_capacity_and_notify(ns->head->disk, get_capacity(ns->disk)); 2233 + set_disk_ro(ns->head->disk, nvme_ns_is_readonly(ns, info)); 2234 + nvme_mpath_revalidate_paths(ns); 2235 + 2275 2236 blk_mq_unfreeze_queue(ns->head->disk->queue); 2276 2237 } 2277 2238 2278 2239 return ret; 2240 + } 2241 + 2242 + int nvme_ns_get_unique_id(struct nvme_ns *ns, u8 id[16], 2243 + enum blk_unique_id type) 2244 + { 2245 + struct nvme_ns_ids *ids = &ns->head->ids; 2246 + 2247 + if (type != BLK_UID_EUI64) 2248 + return -EINVAL; 2249 + 2250 + if (memchr_inv(ids->nguid, 0, sizeof(ids->nguid))) { 2251 + memcpy(id, &ids->nguid, sizeof(ids->nguid)); 2252 + return sizeof(ids->nguid); 2253 + } 2254 + if (memchr_inv(ids->eui64, 0, sizeof(ids->eui64))) { 2255 + memcpy(id, &ids->eui64, sizeof(ids->eui64)); 2256 + return sizeof(ids->eui64); 2257 + } 2258 + 2259 + return -EINVAL; 2260 + } 2261 + 2262 + static int nvme_get_unique_id(struct gendisk *disk, u8 id[16], 2263 + enum blk_unique_id type) 2264 + { 2265 + return nvme_ns_get_unique_id(disk->private_data, id, type); 2279 2266 } 2280 2267 2281 2268 #ifdef CONFIG_BLK_SED_OPAL ··· 2368 2285 .open = nvme_open, 2369 2286 .release = nvme_release, 2370 2287 .getgeo = nvme_getgeo, 2288 + .get_unique_id = nvme_get_unique_id, 2371 2289 .report_zones = nvme_report_zones, 2372 2290 .pr_ops = &nvme_pr_ops, 2373 2291 }; ··· 3805 3721 3806 3722 static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info) 3807 3723 { 3724 + struct queue_limits lim = { }; 3808 3725 struct nvme_ns *ns; 3809 3726 struct gendisk *disk; 3810 3727 int node = ctrl->numa_node; ··· 3814 3729 if (!ns) 3815 3730 return; 3816 3731 3817 - disk = blk_mq_alloc_disk(ctrl->tagset, NULL, ns); 3732 + if (ctrl->opts && ctrl->opts->data_digest) 3733 + lim.features |= BLK_FEAT_STABLE_WRITES; 3734 + if (ctrl->ops->supports_pci_p2pdma && 3735 + ctrl->ops->supports_pci_p2pdma(ctrl)) 3736 + lim.features |= BLK_FEAT_PCI_P2PDMA; 3737 + 3738 + disk = blk_mq_alloc_disk(ctrl->tagset, &lim, ns); 3818 3739 if (IS_ERR(disk)) 3819 3740 goto out_free_ns; 3820 3741 disk->fops = &nvme_bdev_ops; ··· 3828 3737 3829 3738 ns->disk = disk; 3830 3739 ns->queue = disk->queue; 3831 - 3832 - if (ctrl->opts && ctrl->opts->data_digest) 3833 - blk_queue_flag_set(QUEUE_FLAG_STABLE_WRITES, ns->queue); 3834 - 3835 - blk_queue_flag_set(QUEUE_FLAG_NONROT, ns->queue); 3836 - if (ctrl->ops->supports_pci_p2pdma && 3837 - ctrl->ops->supports_pci_p2pdma(ctrl)) 3838 - blk_queue_flag_set(QUEUE_FLAG_PCI_P2PDMA, ns->queue); 3839 - 3840 3740 ns->ctrl = ctrl; 3841 3741 kref_init(&ns->kref); 3842 3742 ··· 3969 3887 3970 3888 static void nvme_validate_ns(struct nvme_ns *ns, struct nvme_ns_info *info) 3971 3889 { 3972 - int ret = NVME_SC_INVALID_NS | NVME_SC_DNR; 3890 + int ret = NVME_SC_INVALID_NS | NVME_STATUS_DNR; 3973 3891 3974 3892 if (!nvme_ns_ids_equal(&ns->head->ids, &info->ids)) { 3975 3893 dev_err(ns->ctrl->device, ··· 3985 3903 * 3986 3904 * TODO: we should probably schedule a delayed retry here. 3987 3905 */ 3988 - if (ret > 0 && (ret & NVME_SC_DNR)) 3906 + if (ret > 0 && (ret & NVME_STATUS_DNR)) 3989 3907 nvme_ns_remove(ns); 3990 3908 } 3991 3909 ··· 4177 4095 * they report) but don't actually support it. 4178 4096 */ 4179 4097 ret = nvme_scan_ns_list(ctrl); 4180 - if (ret > 0 && ret & NVME_SC_DNR) 4098 + if (ret > 0 && ret & NVME_STATUS_DNR) 4181 4099 nvme_scan_ns_sequential(ctrl); 4182 4100 } 4183 4101 mutex_unlock(&ctrl->scan_lock); ··· 4571 4489 return ret; 4572 4490 4573 4491 if (ctrl->ops->flags & NVME_F_FABRICS) { 4574 - ctrl->connect_q = blk_mq_alloc_queue(set, NULL, NULL); 4492 + struct queue_limits lim = { 4493 + .features = BLK_FEAT_SKIP_TAGSET_QUIESCE, 4494 + }; 4495 + 4496 + ctrl->connect_q = blk_mq_alloc_queue(set, &lim, NULL); 4575 4497 if (IS_ERR(ctrl->connect_q)) { 4576 4498 ret = PTR_ERR(ctrl->connect_q); 4577 4499 goto out_free_tag_set; 4578 4500 } 4579 - blk_queue_flag_set(QUEUE_FLAG_SKIP_TAGSET_QUIESCE, 4580 - ctrl->connect_q); 4581 4501 } 4582 4502 4583 4503 ctrl->tagset = set; ··· 4697 4613 * Initialize a NVMe controller structures. This needs to be called during 4698 4614 * earliest initialization so that we have the initialized structured around 4699 4615 * during probing. 4616 + * 4617 + * On success, the caller must use the nvme_put_ctrl() to release this when 4618 + * needed, which also invokes the ops->free_ctrl() callback. 4700 4619 */ 4701 4620 int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev, 4702 4621 const struct nvme_ctrl_ops *ops, unsigned long quirks) ··· 4748 4661 goto out; 4749 4662 ctrl->instance = ret; 4750 4663 4664 + ret = nvme_auth_init_ctrl(ctrl); 4665 + if (ret) 4666 + goto out_release_instance; 4667 + 4668 + nvme_mpath_init_ctrl(ctrl); 4669 + 4751 4670 device_initialize(&ctrl->ctrl_device); 4752 4671 ctrl->device = &ctrl->ctrl_device; 4753 4672 ctrl->device->devt = MKDEV(MAJOR(nvme_ctrl_base_chr_devt), ··· 4766 4673 ctrl->device->groups = nvme_dev_attr_groups; 4767 4674 ctrl->device->release = nvme_free_ctrl; 4768 4675 dev_set_drvdata(ctrl->device, ctrl); 4676 + 4677 + return ret; 4678 + 4679 + out_release_instance: 4680 + ida_free(&nvme_instance_ida, ctrl->instance); 4681 + out: 4682 + if (ctrl->discard_page) 4683 + __free_page(ctrl->discard_page); 4684 + cleanup_srcu_struct(&ctrl->srcu); 4685 + return ret; 4686 + } 4687 + EXPORT_SYMBOL_GPL(nvme_init_ctrl); 4688 + 4689 + /* 4690 + * On success, returns with an elevated controller reference and caller must 4691 + * use nvme_uninit_ctrl() to properly free resources associated with the ctrl. 4692 + */ 4693 + int nvme_add_ctrl(struct nvme_ctrl *ctrl) 4694 + { 4695 + int ret; 4696 + 4769 4697 ret = dev_set_name(ctrl->device, "nvme%d", ctrl->instance); 4770 4698 if (ret) 4771 - goto out_release_instance; 4699 + return ret; 4772 4700 4773 - nvme_get_ctrl(ctrl); 4774 4701 cdev_init(&ctrl->cdev, &nvme_dev_fops); 4775 - ctrl->cdev.owner = ops->module; 4702 + ctrl->cdev.owner = ctrl->ops->module; 4776 4703 ret = cdev_device_add(&ctrl->cdev, ctrl->device); 4777 4704 if (ret) 4778 - goto out_free_name; 4705 + return ret; 4779 4706 4780 4707 /* 4781 4708 * Initialize latency tolerance controls. The sysfs files won't ··· 4806 4693 min(default_ps_max_latency_us, (unsigned long)S32_MAX)); 4807 4694 4808 4695 nvme_fault_inject_init(&ctrl->fault_inject, dev_name(ctrl->device)); 4809 - nvme_mpath_init_ctrl(ctrl); 4810 - ret = nvme_auth_init_ctrl(ctrl); 4811 - if (ret) 4812 - goto out_free_cdev; 4696 + nvme_get_ctrl(ctrl); 4813 4697 4814 4698 return 0; 4815 - out_free_cdev: 4816 - nvme_fault_inject_fini(&ctrl->fault_inject); 4817 - dev_pm_qos_hide_latency_tolerance(ctrl->device); 4818 - cdev_device_del(&ctrl->cdev, ctrl->device); 4819 - out_free_name: 4820 - nvme_put_ctrl(ctrl); 4821 - kfree_const(ctrl->device->kobj.name); 4822 - out_release_instance: 4823 - ida_free(&nvme_instance_ida, ctrl->instance); 4824 - out: 4825 - if (ctrl->discard_page) 4826 - __free_page(ctrl->discard_page); 4827 - cleanup_srcu_struct(&ctrl->srcu); 4828 - return ret; 4829 4699 } 4830 - EXPORT_SYMBOL_GPL(nvme_init_ctrl); 4700 + EXPORT_SYMBOL_GPL(nvme_add_ctrl); 4831 4701 4832 4702 /* let I/O to all namespaces fail in preparation for surprise removal */ 4833 4703 void nvme_mark_namespaces_dead(struct nvme_ctrl *ctrl)

+20 -5

drivers/nvme/host/fabrics.c

··· 187 187 if (unlikely(ret != 0)) 188 188 dev_err(ctrl->device, 189 189 "Property Get error: %d, offset %#x\n", 190 - ret > 0 ? ret & ~NVME_SC_DNR : ret, off); 190 + ret > 0 ? ret & ~NVME_STATUS_DNR : ret, off); 191 191 192 192 return ret; 193 193 } ··· 233 233 if (unlikely(ret != 0)) 234 234 dev_err(ctrl->device, 235 235 "Property Get error: %d, offset %#x\n", 236 - ret > 0 ? ret & ~NVME_SC_DNR : ret, off); 236 + ret > 0 ? ret & ~NVME_STATUS_DNR : ret, off); 237 237 return ret; 238 238 } 239 239 EXPORT_SYMBOL_GPL(nvmf_reg_read64); ··· 275 275 if (unlikely(ret)) 276 276 dev_err(ctrl->device, 277 277 "Property Set error: %d, offset %#x\n", 278 - ret > 0 ? ret & ~NVME_SC_DNR : ret, off); 278 + ret > 0 ? ret & ~NVME_STATUS_DNR : ret, off); 279 279 return ret; 280 280 } 281 281 EXPORT_SYMBOL_GPL(nvmf_reg_write32); 282 + 283 + int nvmf_subsystem_reset(struct nvme_ctrl *ctrl) 284 + { 285 + int ret; 286 + 287 + if (!nvme_wait_reset(ctrl)) 288 + return -EBUSY; 289 + 290 + ret = ctrl->ops->reg_write32(ctrl, NVME_REG_NSSR, NVME_SUBSYS_RESET); 291 + if (ret) 292 + return ret; 293 + 294 + return nvme_try_sched_reset(ctrl); 295 + } 296 + EXPORT_SYMBOL_GPL(nvmf_subsystem_reset); 282 297 283 298 /** 284 299 * nvmf_log_connect_error() - Error-parsing-diagnostic print out function for ··· 310 295 int errval, int offset, struct nvme_command *cmd, 311 296 struct nvmf_connect_data *data) 312 297 { 313 - int err_sctype = errval & ~NVME_SC_DNR; 298 + int err_sctype = errval & ~NVME_STATUS_DNR; 314 299 315 300 if (errval < 0) { 316 301 dev_err(ctrl->device, ··· 588 573 */ 589 574 bool nvmf_should_reconnect(struct nvme_ctrl *ctrl, int status) 590 575 { 591 - if (status > 0 && (status & NVME_SC_DNR)) 576 + if (status > 0 && (status & NVME_STATUS_DNR)) 592 577 return false; 593 578 594 579 if (status == -EKEYREJECTED)

+1

drivers/nvme/host/fabrics.h

··· 217 217 int nvmf_reg_read32(struct nvme_ctrl *ctrl, u32 off, u32 *val); 218 218 int nvmf_reg_read64(struct nvme_ctrl *ctrl, u32 off, u64 *val); 219 219 int nvmf_reg_write32(struct nvme_ctrl *ctrl, u32 off, u32 val); 220 + int nvmf_subsystem_reset(struct nvme_ctrl *ctrl); 220 221 int nvmf_connect_admin_queue(struct nvme_ctrl *ctrl); 221 222 int nvmf_connect_io_queue(struct nvme_ctrl *ctrl, u16 qid); 222 223 int nvmf_register_transport(struct nvmf_transport_ops *ops);

+1 -1

drivers/nvme/host/fault_inject.c

··· 75 75 /* inject status code and DNR bit */ 76 76 status = fault_inject->status; 77 77 if (fault_inject->dont_retry) 78 - status |= NVME_SC_DNR; 78 + status |= NVME_STATUS_DNR; 79 79 nvme_req(req)->status = status; 80 80 } 81 81 }

+36 -19

drivers/nvme/host/fc.c

··· 3132 3132 if (ctrl->ctrl.icdoff) { 3133 3133 dev_err(ctrl->ctrl.device, "icdoff %d is not supported!\n", 3134 3134 ctrl->ctrl.icdoff); 3135 - ret = NVME_SC_INVALID_FIELD | NVME_SC_DNR; 3135 + ret = NVME_SC_INVALID_FIELD | NVME_STATUS_DNR; 3136 3136 goto out_stop_keep_alive; 3137 3137 } 3138 3138 ··· 3140 3140 if (!nvme_ctrl_sgl_supported(&ctrl->ctrl)) { 3141 3141 dev_err(ctrl->ctrl.device, 3142 3142 "Mandatory sgls are not supported!\n"); 3143 - ret = NVME_SC_INVALID_FIELD | NVME_SC_DNR; 3143 + ret = NVME_SC_INVALID_FIELD | NVME_STATUS_DNR; 3144 3144 goto out_stop_keep_alive; 3145 3145 } 3146 3146 ··· 3325 3325 queue_delayed_work(nvme_wq, &ctrl->connect_work, recon_delay); 3326 3326 } else { 3327 3327 if (portptr->port_state == FC_OBJSTATE_ONLINE) { 3328 - if (status > 0 && (status & NVME_SC_DNR)) 3328 + if (status > 0 && (status & NVME_STATUS_DNR)) 3329 3329 dev_warn(ctrl->ctrl.device, 3330 3330 "NVME-FC{%d}: reconnect failure\n", 3331 3331 ctrl->cnum); ··· 3382 3382 .reg_read32 = nvmf_reg_read32, 3383 3383 .reg_read64 = nvmf_reg_read64, 3384 3384 .reg_write32 = nvmf_reg_write32, 3385 + .subsystem_reset = nvmf_subsystem_reset, 3385 3386 .free_ctrl = nvme_fc_free_ctrl, 3386 3387 .submit_async_event = nvme_fc_submit_async_event, 3387 3388 .delete_ctrl = nvme_fc_delete_ctrl, ··· 3445 3444 return found; 3446 3445 } 3447 3446 3448 - static struct nvme_ctrl * 3449 - nvme_fc_init_ctrl(struct device *dev, struct nvmf_ctrl_options *opts, 3447 + static struct nvme_fc_ctrl * 3448 + nvme_fc_alloc_ctrl(struct device *dev, struct nvmf_ctrl_options *opts, 3450 3449 struct nvme_fc_lport *lport, struct nvme_fc_rport *rport) 3451 3450 { 3452 3451 struct nvme_fc_ctrl *ctrl; 3453 - unsigned long flags; 3454 3452 int ret, idx, ctrl_loss_tmo; 3455 3453 3456 3454 if (!(rport->remoteport.port_role & ··· 3538 3538 if (lport->dev) 3539 3539 ctrl->ctrl.numa_node = dev_to_node(lport->dev); 3540 3540 3541 - /* at this point, teardown path changes to ref counting on nvme ctrl */ 3541 + return ctrl; 3542 + 3543 + out_free_queues: 3544 + kfree(ctrl->queues); 3545 + out_free_ida: 3546 + put_device(ctrl->dev); 3547 + ida_free(&nvme_fc_ctrl_cnt, ctrl->cnum); 3548 + out_free_ctrl: 3549 + kfree(ctrl); 3550 + out_fail: 3551 + /* exit via here doesn't follow ctlr ref points */ 3552 + return ERR_PTR(ret); 3553 + } 3554 + 3555 + static struct nvme_ctrl * 3556 + nvme_fc_init_ctrl(struct device *dev, struct nvmf_ctrl_options *opts, 3557 + struct nvme_fc_lport *lport, struct nvme_fc_rport *rport) 3558 + { 3559 + struct nvme_fc_ctrl *ctrl; 3560 + unsigned long flags; 3561 + int ret; 3562 + 3563 + ctrl = nvme_fc_alloc_ctrl(dev, opts, lport, rport); 3564 + if (IS_ERR(ctrl)) 3565 + return ERR_CAST(ctrl); 3566 + 3567 + ret = nvme_add_ctrl(&ctrl->ctrl); 3568 + if (ret) 3569 + goto out_put_ctrl; 3542 3570 3543 3571 ret = nvme_alloc_admin_tag_set(&ctrl->ctrl, &ctrl->admin_tag_set, 3544 3572 &nvme_fc_admin_mq_ops, ··· 3612 3584 /* initiate nvme ctrl ref counting teardown */ 3613 3585 nvme_uninit_ctrl(&ctrl->ctrl); 3614 3586 3587 + out_put_ctrl: 3615 3588 /* Remove core ctrl ref. */ 3616 3589 nvme_put_ctrl(&ctrl->ctrl); 3617 3590 ··· 3626 3597 nvme_fc_rport_get(rport); 3627 3598 3628 3599 return ERR_PTR(-EIO); 3629 - 3630 - out_free_queues: 3631 - kfree(ctrl->queues); 3632 - out_free_ida: 3633 - put_device(ctrl->dev); 3634 - ida_free(&nvme_fc_ctrl_cnt, ctrl->cnum); 3635 - out_free_ctrl: 3636 - kfree(ctrl); 3637 - out_fail: 3638 - /* exit via here doesn't follow ctlr ref points */ 3639 - return ERR_PTR(ret); 3640 3600 } 3641 - 3642 3601 3643 3602 struct nvmet_fc_traddr { 3644 3603 u64 nn;

+113 -31

drivers/nvme/host/multipath.c

··· 17 17 static const char *nvme_iopolicy_names[] = { 18 18 [NVME_IOPOLICY_NUMA] = "numa", 19 19 [NVME_IOPOLICY_RR] = "round-robin", 20 + [NVME_IOPOLICY_QD] = "queue-depth", 20 21 }; 21 22 22 23 static int iopolicy = NVME_IOPOLICY_NUMA; ··· 30 29 iopolicy = NVME_IOPOLICY_NUMA; 31 30 else if (!strncmp(val, "round-robin", 11)) 32 31 iopolicy = NVME_IOPOLICY_RR; 32 + else if (!strncmp(val, "queue-depth", 11)) 33 + iopolicy = NVME_IOPOLICY_QD; 33 34 else 34 35 return -EINVAL; 35 36 ··· 46 43 module_param_call(iopolicy, nvme_set_iopolicy, nvme_get_iopolicy, 47 44 &iopolicy, 0644); 48 45 MODULE_PARM_DESC(iopolicy, 49 - "Default multipath I/O policy; 'numa' (default) or 'round-robin'"); 46 + "Default multipath I/O policy; 'numa' (default), 'round-robin' or 'queue-depth'"); 50 47 51 48 void nvme_mpath_default_iopolicy(struct nvme_subsystem *subsys) 52 49 { ··· 86 83 void nvme_failover_req(struct request *req) 87 84 { 88 85 struct nvme_ns *ns = req->q->queuedata; 89 - u16 status = nvme_req(req)->status & 0x7ff; 86 + u16 status = nvme_req(req)->status & NVME_SCT_SC_MASK; 90 87 unsigned long flags; 91 88 struct bio *bio; 92 89 ··· 131 128 struct nvme_ns *ns = rq->q->queuedata; 132 129 struct gendisk *disk = ns->head->disk; 133 130 131 + if (READ_ONCE(ns->head->subsys->iopolicy) == NVME_IOPOLICY_QD) { 132 + atomic_inc(&ns->ctrl->nr_active); 133 + nvme_req(rq)->flags |= NVME_MPATH_CNT_ACTIVE; 134 + } 135 + 134 136 if (!blk_queue_io_stat(disk->queue) || blk_rq_is_passthrough(rq)) 135 137 return; 136 138 ··· 148 140 void nvme_mpath_end_request(struct request *rq) 149 141 { 150 142 struct nvme_ns *ns = rq->q->queuedata; 143 + 144 + if (nvme_req(rq)->flags & NVME_MPATH_CNT_ACTIVE) 145 + atomic_dec_if_positive(&ns->ctrl->nr_active); 151 146 152 147 if (!(nvme_req(rq)->flags & NVME_MPATH_IO_STATS)) 153 148 return; ··· 302 291 return list_first_or_null_rcu(&head->list, struct nvme_ns, siblings); 303 292 } 304 293 305 - static struct nvme_ns *nvme_round_robin_path(struct nvme_ns_head *head, 306 - int node, struct nvme_ns *old) 294 + static struct nvme_ns *nvme_round_robin_path(struct nvme_ns_head *head) 307 295 { 308 296 struct nvme_ns *ns, *found = NULL; 297 + int node = numa_node_id(); 298 + struct nvme_ns *old = srcu_dereference(head->current_path[node], 299 + &head->srcu); 300 + 301 + if (unlikely(!old)) 302 + return __nvme_find_path(head, node); 309 303 310 304 if (list_is_singular(&head->list)) { 311 305 if (nvme_path_is_disabled(old)) ··· 350 334 return found; 351 335 } 352 336 337 + static struct nvme_ns *nvme_queue_depth_path(struct nvme_ns_head *head) 338 + { 339 + struct nvme_ns *best_opt = NULL, *best_nonopt = NULL, *ns; 340 + unsigned int min_depth_opt = UINT_MAX, min_depth_nonopt = UINT_MAX; 341 + unsigned int depth; 342 + 343 + list_for_each_entry_rcu(ns, &head->list, siblings) { 344 + if (nvme_path_is_disabled(ns)) 345 + continue; 346 + 347 + depth = atomic_read(&ns->ctrl->nr_active); 348 + 349 + switch (ns->ana_state) { 350 + case NVME_ANA_OPTIMIZED: 351 + if (depth < min_depth_opt) { 352 + min_depth_opt = depth; 353 + best_opt = ns; 354 + } 355 + break; 356 + case NVME_ANA_NONOPTIMIZED: 357 + if (depth < min_depth_nonopt) { 358 + min_depth_nonopt = depth; 359 + best_nonopt = ns; 360 + } 361 + break; 362 + default: 363 + break; 364 + } 365 + 366 + if (min_depth_opt == 0) 367 + return best_opt; 368 + } 369 + 370 + return best_opt ? best_opt : best_nonopt; 371 + } 372 + 353 373 static inline bool nvme_path_is_optimized(struct nvme_ns *ns) 354 374 { 355 375 return nvme_ctrl_state(ns->ctrl) == NVME_CTRL_LIVE && 356 376 ns->ana_state == NVME_ANA_OPTIMIZED; 357 377 } 358 378 359 - inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head) 379 + static struct nvme_ns *nvme_numa_path(struct nvme_ns_head *head) 360 380 { 361 381 int node = numa_node_id(); 362 382 struct nvme_ns *ns; ··· 400 348 ns = srcu_dereference(head->current_path[node], &head->srcu); 401 349 if (unlikely(!ns)) 402 350 return __nvme_find_path(head, node); 403 - 404 - if (READ_ONCE(head->subsys->iopolicy) == NVME_IOPOLICY_RR) 405 - return nvme_round_robin_path(head, node, ns); 406 351 if (unlikely(!nvme_path_is_optimized(ns))) 407 352 return __nvme_find_path(head, node); 408 353 return ns; 354 + } 355 + 356 + inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head) 357 + { 358 + switch (READ_ONCE(head->subsys->iopolicy)) { 359 + case NVME_IOPOLICY_QD: 360 + return nvme_queue_depth_path(head); 361 + case NVME_IOPOLICY_RR: 362 + return nvme_round_robin_path(head); 363 + default: 364 + return nvme_numa_path(head); 365 + } 409 366 } 410 367 411 368 static bool nvme_available_path(struct nvme_ns_head *head) ··· 488 427 nvme_put_ns_head(disk->private_data); 489 428 } 490 429 430 + static int nvme_ns_head_get_unique_id(struct gendisk *disk, u8 id[16], 431 + enum blk_unique_id type) 432 + { 433 + struct nvme_ns_head *head = disk->private_data; 434 + struct nvme_ns *ns; 435 + int srcu_idx, ret = -EWOULDBLOCK; 436 + 437 + srcu_idx = srcu_read_lock(&head->srcu); 438 + ns = nvme_find_path(head); 439 + if (ns) 440 + ret = nvme_ns_get_unique_id(ns, id, type); 441 + srcu_read_unlock(&head->srcu, srcu_idx); 442 + return ret; 443 + } 444 + 491 445 #ifdef CONFIG_BLK_DEV_ZONED 492 446 static int nvme_ns_head_report_zones(struct gendisk *disk, sector_t sector, 493 447 unsigned int nr_zones, report_zones_cb cb, void *data) ··· 530 454 .ioctl = nvme_ns_head_ioctl, 531 455 .compat_ioctl = blkdev_compat_ptr_ioctl, 532 456 .getgeo = nvme_getgeo, 457 + .get_unique_id = nvme_ns_head_get_unique_id, 533 458 .report_zones = nvme_ns_head_report_zones, 534 459 .pr_ops = &nvme_pr_ops, 535 460 }; ··· 598 521 int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head) 599 522 { 600 523 struct queue_limits lim; 601 - bool vwc = false; 602 524 603 525 mutex_init(&head->lock); 604 526 bio_list_init(&head->requeue_list); ··· 615 539 616 540 blk_set_stacking_limits(&lim); 617 541 lim.dma_alignment = 3; 542 + lim.features |= BLK_FEAT_IO_STAT | BLK_FEAT_NOWAIT | BLK_FEAT_POLL; 618 543 if (head->ids.csi != NVME_CSI_ZNS) 619 544 lim.max_zone_append_sectors = 0; 620 545 ··· 626 549 head->disk->private_data = head; 627 550 sprintf(head->disk->disk_name, "nvme%dn%d", 628 551 ctrl->subsys->instance, head->instance); 629 - 630 - blk_queue_flag_set(QUEUE_FLAG_NONROT, head->disk->queue); 631 - blk_queue_flag_set(QUEUE_FLAG_NOWAIT, head->disk->queue); 632 - blk_queue_flag_set(QUEUE_FLAG_IO_STAT, head->disk->queue); 633 - /* 634 - * This assumes all controllers that refer to a namespace either 635 - * support poll queues or not. That is not a strict guarantee, 636 - * but if the assumption is wrong the effect is only suboptimal 637 - * performance but not correctness problem. 638 - */ 639 - if (ctrl->tagset->nr_maps > HCTX_TYPE_POLL && 640 - ctrl->tagset->map[HCTX_TYPE_POLL].nr_queues) 641 - blk_queue_flag_set(QUEUE_FLAG_POLL, head->disk->queue); 642 - 643 - /* we need to propagate up the VMC settings */ 644 - if (ctrl->vwc & NVME_CTRL_VWC_PRESENT) 645 - vwc = true; 646 - blk_queue_write_cache(head->disk->queue, vwc, vwc); 647 552 return 0; 648 553 } 649 554 ··· 862 803 nvme_iopolicy_names[READ_ONCE(subsys->iopolicy)]); 863 804 } 864 805 806 + static void nvme_subsys_iopolicy_update(struct nvme_subsystem *subsys, 807 + int iopolicy) 808 + { 809 + struct nvme_ctrl *ctrl; 810 + int old_iopolicy = READ_ONCE(subsys->iopolicy); 811 + 812 + if (old_iopolicy == iopolicy) 813 + return; 814 + 815 + WRITE_ONCE(subsys->iopolicy, iopolicy); 816 + 817 + /* iopolicy changes clear the mpath by design */ 818 + mutex_lock(&nvme_subsystems_lock); 819 + list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) 820 + nvme_mpath_clear_ctrl_paths(ctrl); 821 + mutex_unlock(&nvme_subsystems_lock); 822 + 823 + pr_notice("subsysnqn %s iopolicy changed from %s to %s\n", 824 + subsys->subnqn, 825 + nvme_iopolicy_names[old_iopolicy], 826 + nvme_iopolicy_names[iopolicy]); 827 + } 828 + 865 829 static ssize_t nvme_subsys_iopolicy_store(struct device *dev, 866 830 struct device_attribute *attr, const char *buf, size_t count) 867 831 { ··· 894 812 895 813 for (i = 0; i < ARRAY_SIZE(nvme_iopolicy_names); i++) { 896 814 if (sysfs_streq(buf, nvme_iopolicy_names[i])) { 897 - WRITE_ONCE(subsys->iopolicy, i); 815 + nvme_subsys_iopolicy_update(subsys, i); 898 816 return count; 899 817 } 900 818 } ··· 957 875 nvme_mpath_set_live(ns); 958 876 } 959 877 960 - if (blk_queue_stable_writes(ns->queue) && ns->head->disk) 961 - blk_queue_flag_set(QUEUE_FLAG_STABLE_WRITES, 962 - ns->head->disk->queue); 963 878 #ifdef CONFIG_BLK_DEV_ZONED 964 879 if (blk_queue_is_zoned(ns->queue) && ns->head->disk) 965 880 ns->head->disk->nr_zones = ns->disk->nr_zones; ··· 1001 922 if (!multipath || !ctrl->subsys || 1002 923 !(ctrl->subsys->cmic & NVME_CTRL_CMIC_ANA)) 1003 924 return 0; 925 + 926 + /* initialize this in the identify path to cover controller resets */ 927 + atomic_set(&ctrl->nr_active, 0); 1004 928 1005 929 if (!ctrl->max_namespaces || 1006 930 ctrl->max_namespaces > le32_to_cpu(id->nn)) {

+14 -14

drivers/nvme/host/nvme.h

··· 49 49 extern struct workqueue_struct *nvme_wq; 50 50 extern struct workqueue_struct *nvme_reset_wq; 51 51 extern struct workqueue_struct *nvme_delete_wq; 52 + extern struct mutex nvme_subsystems_lock; 52 53 53 54 /* 54 55 * List of workarounds for devices that required behavior not specified in ··· 196 195 NVME_REQ_CANCELLED = (1 << 0), 197 196 NVME_REQ_USERCMD = (1 << 1), 198 197 NVME_MPATH_IO_STATS = (1 << 2), 198 + NVME_MPATH_CNT_ACTIVE = (1 << 3), 199 199 }; 200 200 201 201 static inline struct nvme_request *nvme_req(struct request *req) ··· 362 360 size_t ana_log_size; 363 361 struct timer_list anatt_timer; 364 362 struct work_struct ana_work; 363 + atomic_t nr_active; 365 364 #endif 366 365 367 366 #ifdef CONFIG_NVME_HOST_AUTH ··· 411 408 enum nvme_iopolicy { 412 409 NVME_IOPOLICY_NUMA, 413 410 NVME_IOPOLICY_RR, 411 + NVME_IOPOLICY_QD, 414 412 }; 415 413 416 414 struct nvme_subsystem { ··· 555 551 int (*reg_read64)(struct nvme_ctrl *ctrl, u32 off, u64 *val); 556 552 void (*free_ctrl)(struct nvme_ctrl *ctrl); 557 553 void (*submit_async_event)(struct nvme_ctrl *ctrl); 554 + int (*subsystem_reset)(struct nvme_ctrl *ctrl); 558 555 void (*delete_ctrl)(struct nvme_ctrl *ctrl); 559 556 void (*stop_ctrl)(struct nvme_ctrl *ctrl); 560 557 int (*get_address)(struct nvme_ctrl *ctrl, char *buf, int size); ··· 654 649 655 650 static inline int nvme_reset_subsystem(struct nvme_ctrl *ctrl) 656 651 { 657 - int ret; 658 - 659 - if (!ctrl->subsystem) 652 + if (!ctrl->subsystem || !ctrl->ops->subsystem_reset) 660 653 return -ENOTTY; 661 - if (!nvme_wait_reset(ctrl)) 662 - return -EBUSY; 663 - 664 - ret = ctrl->ops->reg_write32(ctrl, NVME_REG_NSSR, 0x4E564D65); 665 - if (ret) 666 - return ret; 667 - 668 - return nvme_try_sched_reset(ctrl); 654 + return ctrl->ops->subsystem_reset(ctrl); 669 655 } 670 656 671 657 /* ··· 685 689 686 690 static inline bool nvme_is_ana_error(u16 status) 687 691 { 688 - switch (status & 0x7ff) { 692 + switch (status & NVME_SCT_SC_MASK) { 689 693 case NVME_SC_ANA_TRANSITION: 690 694 case NVME_SC_ANA_INACCESSIBLE: 691 695 case NVME_SC_ANA_PERSISTENT_LOSS: ··· 698 702 static inline bool nvme_is_path_error(u16 status) 699 703 { 700 704 /* check for a status code type of 'path related status' */ 701 - return (status & 0x700) == 0x300; 705 + return (status & NVME_SCT_MASK) == NVME_SCT_PATH; 702 706 } 703 707 704 708 /* ··· 788 792 int nvme_enable_ctrl(struct nvme_ctrl *ctrl); 789 793 int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev, 790 794 const struct nvme_ctrl_ops *ops, unsigned long quirks); 795 + int nvme_add_ctrl(struct nvme_ctrl *ctrl); 791 796 void nvme_uninit_ctrl(struct nvme_ctrl *ctrl); 792 797 void nvme_start_ctrl(struct nvme_ctrl *ctrl); 793 798 void nvme_stop_ctrl(struct nvme_ctrl *ctrl); ··· 874 877 NVME_SUBMIT_NOWAIT = (__force nvme_submit_flags_t)(1 << 1), 875 878 /* Set BLK_MQ_REQ_RESERVED when allocating request */ 876 879 NVME_SUBMIT_RESERVED = (__force nvme_submit_flags_t)(1 << 2), 877 - /* Retry command when NVME_SC_DNR is not set in the result */ 880 + /* Retry command when NVME_STATUS_DNR is not set in the result */ 878 881 NVME_SUBMIT_RETRY = (__force nvme_submit_flags_t)(1 << 3), 879 882 }; 880 883 ··· 1058 1061 return false; 1059 1062 } 1060 1063 #endif /* CONFIG_NVME_MULTIPATH */ 1064 + 1065 + int nvme_ns_get_unique_id(struct nvme_ns *ns, u8 id[16], 1066 + enum blk_unique_id type); 1061 1067 1062 1068 struct nvme_zone_info { 1063 1069 u64 zone_size;

+44 -3

drivers/nvme/host/pci.c

··· 826 826 struct nvme_command *cmnd) 827 827 { 828 828 struct nvme_iod *iod = blk_mq_rq_to_pdu(req); 829 + struct bio_vec bv = rq_integrity_vec(req); 829 830 830 - iod->meta_dma = dma_map_bvec(dev->dev, rq_integrity_vec(req), 831 - rq_dma_dir(req), 0); 831 + iod->meta_dma = dma_map_bvec(dev->dev, &bv, rq_dma_dir(req), 0); 832 832 if (dma_mapping_error(dev->dev, iod->meta_dma)) 833 833 return BLK_STS_IOERR; 834 834 cmnd->rw.metadata = cpu_to_le64(iod->meta_dma); ··· 967 967 struct nvme_iod *iod = blk_mq_rq_to_pdu(req); 968 968 969 969 dma_unmap_page(dev->dev, iod->meta_dma, 970 - rq_integrity_vec(req)->bv_len, rq_dma_dir(req)); 970 + rq_integrity_vec(req).bv_len, rq_dma_dir(req)); 971 971 } 972 972 973 973 if (blk_rq_nr_phys_segments(req)) ··· 1141 1141 nvme_sq_copy_cmd(nvmeq, &c); 1142 1142 nvme_write_sq_db(nvmeq, true); 1143 1143 spin_unlock(&nvmeq->sq_lock); 1144 + } 1145 + 1146 + static int nvme_pci_subsystem_reset(struct nvme_ctrl *ctrl) 1147 + { 1148 + struct nvme_dev *dev = to_nvme_dev(ctrl); 1149 + int ret = 0; 1150 + 1151 + /* 1152 + * Taking the shutdown_lock ensures the BAR mapping is not being 1153 + * altered by reset_work. Holding this lock before the RESETTING state 1154 + * change, if successful, also ensures nvme_remove won't be able to 1155 + * proceed to iounmap until we're done. 1156 + */ 1157 + mutex_lock(&dev->shutdown_lock); 1158 + if (!dev->bar_mapped_size) { 1159 + ret = -ENODEV; 1160 + goto unlock; 1161 + } 1162 + 1163 + if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING)) { 1164 + ret = -EBUSY; 1165 + goto unlock; 1166 + } 1167 + 1168 + writel(NVME_SUBSYS_RESET, dev->bar + NVME_REG_NSSR); 1169 + nvme_change_ctrl_state(ctrl, NVME_CTRL_LIVE); 1170 + 1171 + /* 1172 + * Read controller status to flush the previous write and trigger a 1173 + * pcie read error. 1174 + */ 1175 + readl(dev->bar + NVME_REG_CSTS); 1176 + unlock: 1177 + mutex_unlock(&dev->shutdown_lock); 1178 + return ret; 1144 1179 } 1145 1180 1146 1181 static int adapter_delete_queue(struct nvme_dev *dev, u8 opcode, u16 id) ··· 2894 2859 .reg_read64 = nvme_pci_reg_read64, 2895 2860 .free_ctrl = nvme_pci_free_ctrl, 2896 2861 .submit_async_event = nvme_pci_submit_async_event, 2862 + .subsystem_reset = nvme_pci_subsystem_reset, 2897 2863 .get_address = nvme_pci_get_address, 2898 2864 .print_device_info = nvme_pci_print_device_info, 2899 2865 .supports_pci_p2pdma = nvme_pci_supports_pci_p2pdma, ··· 3051 3015 if (IS_ERR(dev)) 3052 3016 return PTR_ERR(dev); 3053 3017 3018 + result = nvme_add_ctrl(&dev->ctrl); 3019 + if (result) 3020 + goto out_put_ctrl; 3021 + 3054 3022 result = nvme_dev_map(dev); 3055 3023 if (result) 3056 3024 goto out_uninit_ctrl; ··· 3141 3101 nvme_dev_unmap(dev); 3142 3102 out_uninit_ctrl: 3143 3103 nvme_uninit_ctrl(&dev->ctrl); 3104 + out_put_ctrl: 3144 3105 nvme_put_ctrl(&dev->ctrl); 3145 3106 return result; 3146 3107 }

+5 -5

drivers/nvme/host/pr.c

··· 72 72 return nvme_submit_sync_cmd(ns->queue, c, data, data_len); 73 73 } 74 74 75 - static int nvme_sc_to_pr_err(int nvme_sc) 75 + static int nvme_status_to_pr_err(int status) 76 76 { 77 - if (nvme_is_path_error(nvme_sc)) 77 + if (nvme_is_path_error(status)) 78 78 return PR_STS_PATH_FAILED; 79 79 80 - switch (nvme_sc & 0x7ff) { 80 + switch (status & NVME_SCT_SC_MASK) { 81 81 case NVME_SC_SUCCESS: 82 82 return PR_STS_SUCCESS; 83 83 case NVME_SC_RESERVATION_CONFLICT: ··· 121 121 if (ret < 0) 122 122 return ret; 123 123 124 - return nvme_sc_to_pr_err(ret); 124 + return nvme_status_to_pr_err(ret); 125 125 } 126 126 127 127 static int nvme_pr_register(struct block_device *bdev, u64 old, ··· 196 196 if (ret < 0) 197 197 return ret; 198 198 199 - return nvme_sc_to_pr_err(ret); 199 + return nvme_status_to_pr_err(ret); 200 200 } 201 201 202 202 static int nvme_pr_read_keys(struct block_device *bdev,

+27 -7

drivers/nvme/host/rdma.c

··· 2201 2201 .reg_read32 = nvmf_reg_read32, 2202 2202 .reg_read64 = nvmf_reg_read64, 2203 2203 .reg_write32 = nvmf_reg_write32, 2204 + .subsystem_reset = nvmf_subsystem_reset, 2204 2205 .free_ctrl = nvme_rdma_free_ctrl, 2205 2206 .submit_async_event = nvme_rdma_submit_async_event, 2206 2207 .delete_ctrl = nvme_rdma_delete_ctrl, ··· 2238 2237 return found; 2239 2238 } 2240 2239 2241 - static struct nvme_ctrl *nvme_rdma_create_ctrl(struct device *dev, 2240 + static struct nvme_rdma_ctrl *nvme_rdma_alloc_ctrl(struct device *dev, 2242 2241 struct nvmf_ctrl_options *opts) 2243 2242 { 2244 2243 struct nvme_rdma_ctrl *ctrl; 2245 2244 int ret; 2246 - bool changed; 2247 2245 2248 2246 ctrl = kzalloc(sizeof(*ctrl), GFP_KERNEL); 2249 2247 if (!ctrl) ··· 2304 2304 if (ret) 2305 2305 goto out_kfree_queues; 2306 2306 2307 + return ctrl; 2308 + 2309 + out_kfree_queues: 2310 + kfree(ctrl->queues); 2311 + out_free_ctrl: 2312 + kfree(ctrl); 2313 + return ERR_PTR(ret); 2314 + } 2315 + 2316 + static struct nvme_ctrl *nvme_rdma_create_ctrl(struct device *dev, 2317 + struct nvmf_ctrl_options *opts) 2318 + { 2319 + struct nvme_rdma_ctrl *ctrl; 2320 + bool changed; 2321 + int ret; 2322 + 2323 + ctrl = nvme_rdma_alloc_ctrl(dev, opts); 2324 + if (IS_ERR(ctrl)) 2325 + return ERR_CAST(ctrl); 2326 + 2327 + ret = nvme_add_ctrl(&ctrl->ctrl); 2328 + if (ret) 2329 + goto out_put_ctrl; 2330 + 2307 2331 changed = nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_CONNECTING); 2308 2332 WARN_ON_ONCE(!changed); 2309 2333 ··· 2346 2322 2347 2323 out_uninit_ctrl: 2348 2324 nvme_uninit_ctrl(&ctrl->ctrl); 2325 + out_put_ctrl: 2349 2326 nvme_put_ctrl(&ctrl->ctrl); 2350 2327 if (ret > 0) 2351 2328 ret = -EIO; 2352 - return ERR_PTR(ret); 2353 - out_kfree_queues: 2354 - kfree(ctrl->queues); 2355 - out_free_ctrl: 2356 - kfree(ctrl); 2357 2329 return ERR_PTR(ret); 2358 2330 } 2359 2331

+25 -6

drivers/nvme/host/tcp.c

··· 2662 2662 .reg_read32 = nvmf_reg_read32, 2663 2663 .reg_read64 = nvmf_reg_read64, 2664 2664 .reg_write32 = nvmf_reg_write32, 2665 + .subsystem_reset = nvmf_subsystem_reset, 2665 2666 .free_ctrl = nvme_tcp_free_ctrl, 2666 2667 .submit_async_event = nvme_tcp_submit_async_event, 2667 2668 .delete_ctrl = nvme_tcp_delete_ctrl, ··· 2687 2686 return found; 2688 2687 } 2689 2688 2690 - static struct nvme_ctrl *nvme_tcp_create_ctrl(struct device *dev, 2689 + static struct nvme_tcp_ctrl *nvme_tcp_alloc_ctrl(struct device *dev, 2691 2690 struct nvmf_ctrl_options *opts) 2692 2691 { 2693 2692 struct nvme_tcp_ctrl *ctrl; ··· 2762 2761 if (ret) 2763 2762 goto out_kfree_queues; 2764 2763 2764 + return ctrl; 2765 + out_kfree_queues: 2766 + kfree(ctrl->queues); 2767 + out_free_ctrl: 2768 + kfree(ctrl); 2769 + return ERR_PTR(ret); 2770 + } 2771 + 2772 + static struct nvme_ctrl *nvme_tcp_create_ctrl(struct device *dev, 2773 + struct nvmf_ctrl_options *opts) 2774 + { 2775 + struct nvme_tcp_ctrl *ctrl; 2776 + int ret; 2777 + 2778 + ctrl = nvme_tcp_alloc_ctrl(dev, opts); 2779 + if (IS_ERR(ctrl)) 2780 + return ERR_CAST(ctrl); 2781 + 2782 + ret = nvme_add_ctrl(&ctrl->ctrl); 2783 + if (ret) 2784 + goto out_put_ctrl; 2785 + 2765 2786 if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_CONNECTING)) { 2766 2787 WARN_ON_ONCE(1); 2767 2788 ret = -EINTR; ··· 2805 2782 2806 2783 out_uninit_ctrl: 2807 2784 nvme_uninit_ctrl(&ctrl->ctrl); 2785 + out_put_ctrl: 2808 2786 nvme_put_ctrl(&ctrl->ctrl); 2809 2787 if (ret > 0) 2810 2788 ret = -EIO; 2811 - return ERR_PTR(ret); 2812 - out_kfree_queues: 2813 - kfree(ctrl->queues); 2814 - out_free_ctrl: 2815 - kfree(ctrl); 2816 2789 return ERR_PTR(ret); 2817 2790 } 2818 2791

+1 -2

drivers/nvme/host/zns.c

··· 108 108 void nvme_update_zone_info(struct nvme_ns *ns, struct queue_limits *lim, 109 109 struct nvme_zone_info *zi) 110 110 { 111 - lim->zoned = 1; 111 + lim->features |= BLK_FEAT_ZONED; 112 112 lim->max_open_zones = zi->max_open_zones; 113 113 lim->max_active_zones = zi->max_active_zones; 114 114 lim->max_zone_append_sectors = ns->ctrl->max_zone_append; 115 115 lim->chunk_sectors = ns->head->zsze = 116 116 nvme_lba_to_sect(ns->head, zi->zone_size); 117 - blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, ns->queue); 118 117 } 119 118 120 119 static void *nvme_zns_alloc_report_buffer(struct nvme_ns *ns,

+9 -1

drivers/nvme/target/Kconfig

··· 6 6 depends on CONFIGFS_FS 7 7 select NVME_KEYRING if NVME_TARGET_TCP_TLS 8 8 select KEYS if NVME_TARGET_TCP_TLS 9 - select BLK_DEV_INTEGRITY_T10 if BLK_DEV_INTEGRITY 10 9 select SGL_ALLOC 11 10 help 12 11 This enabled target side support for the NVMe protocol, that is ··· 16 17 17 18 To configure the NVMe target you probably want to use the nvmetcli 18 19 tool from http://git.infradead.org/users/hch/nvmetcli.git. 20 + 21 + config NVME_TARGET_DEBUGFS 22 + bool "NVMe Target debugfs support" 23 + depends on NVME_TARGET 24 + help 25 + This enables debugfs support to display the connected controllers 26 + to each subsystem 27 + 28 + If unsure, say N. 19 29 20 30 config NVME_TARGET_PASSTHRU 21 31 bool "NVMe Target Passthrough support"

+1

drivers/nvme/target/Makefile

··· 11 11 12 12 nvmet-y += core.o configfs.o admin-cmd.o fabrics-cmd.o \ 13 13 discovery.o io-cmd-file.o io-cmd-bdev.o 14 + nvmet-$(CONFIG_NVME_TARGET_DEBUGFS) += debugfs.o 14 15 nvmet-$(CONFIG_NVME_TARGET_PASSTHRU) += passthru.o 15 16 nvmet-$(CONFIG_BLK_DEV_ZONED) += zns.o 16 17 nvmet-$(CONFIG_NVME_TARGET_AUTH) += fabrics-cmd-auth.o auth.o

+12 -12

drivers/nvme/target/admin-cmd.c

··· 344 344 pr_debug("unhandled lid %d on qid %d\n", 345 345 req->cmd->get_log_page.lid, req->sq->qid); 346 346 req->error_loc = offsetof(struct nvme_get_log_page_command, lid); 347 - nvmet_req_complete(req, NVME_SC_INVALID_FIELD | NVME_SC_DNR); 347 + nvmet_req_complete(req, NVME_SC_INVALID_FIELD | NVME_STATUS_DNR); 348 348 } 349 349 350 350 static void nvmet_execute_identify_ctrl(struct nvmet_req *req) ··· 496 496 497 497 if (le32_to_cpu(req->cmd->identify.nsid) == NVME_NSID_ALL) { 498 498 req->error_loc = offsetof(struct nvme_identify, nsid); 499 - status = NVME_SC_INVALID_NS | NVME_SC_DNR; 499 + status = NVME_SC_INVALID_NS | NVME_STATUS_DNR; 500 500 goto out; 501 501 } 502 502 ··· 662 662 663 663 if (sg_zero_buffer(req->sg, req->sg_cnt, NVME_IDENTIFY_DATA_SIZE - off, 664 664 off) != NVME_IDENTIFY_DATA_SIZE - off) 665 - status = NVME_SC_INTERNAL | NVME_SC_DNR; 665 + status = NVME_SC_INTERNAL | NVME_STATUS_DNR; 666 666 667 667 out: 668 668 nvmet_req_complete(req, status); ··· 724 724 pr_debug("unhandled identify cns %d on qid %d\n", 725 725 req->cmd->identify.cns, req->sq->qid); 726 726 req->error_loc = offsetof(struct nvme_identify, cns); 727 - nvmet_req_complete(req, NVME_SC_INVALID_FIELD | NVME_SC_DNR); 727 + nvmet_req_complete(req, NVME_SC_INVALID_FIELD | NVME_STATUS_DNR); 728 728 } 729 729 730 730 /* ··· 807 807 808 808 if (val32 & ~mask) { 809 809 req->error_loc = offsetof(struct nvme_common_command, cdw11); 810 - return NVME_SC_INVALID_FIELD | NVME_SC_DNR; 810 + return NVME_SC_INVALID_FIELD | NVME_STATUS_DNR; 811 811 } 812 812 813 813 WRITE_ONCE(req->sq->ctrl->aen_enabled, val32); ··· 833 833 ncqr = (cdw11 >> 16) & 0xffff; 834 834 nsqr = cdw11 & 0xffff; 835 835 if (ncqr == 0xffff || nsqr == 0xffff) { 836 - status = NVME_SC_INVALID_FIELD | NVME_SC_DNR; 836 + status = NVME_SC_INVALID_FIELD | NVME_STATUS_DNR; 837 837 break; 838 838 } 839 839 nvmet_set_result(req, ··· 846 846 status = nvmet_set_feat_async_event(req, NVMET_AEN_CFG_ALL); 847 847 break; 848 848 case NVME_FEAT_HOST_ID: 849 - status = NVME_SC_CMD_SEQ_ERROR | NVME_SC_DNR; 849 + status = NVME_SC_CMD_SEQ_ERROR | NVME_STATUS_DNR; 850 850 break; 851 851 case NVME_FEAT_WRITE_PROTECT: 852 852 status = nvmet_set_feat_write_protect(req); 853 853 break; 854 854 default: 855 855 req->error_loc = offsetof(struct nvme_common_command, cdw10); 856 - status = NVME_SC_INVALID_FIELD | NVME_SC_DNR; 856 + status = NVME_SC_INVALID_FIELD | NVME_STATUS_DNR; 857 857 break; 858 858 } 859 859 ··· 939 939 if (!(req->cmd->common.cdw11 & cpu_to_le32(1 << 0))) { 940 940 req->error_loc = 941 941 offsetof(struct nvme_common_command, cdw11); 942 - status = NVME_SC_INVALID_FIELD | NVME_SC_DNR; 942 + status = NVME_SC_INVALID_FIELD | NVME_STATUS_DNR; 943 943 break; 944 944 } 945 945 ··· 952 952 default: 953 953 req->error_loc = 954 954 offsetof(struct nvme_common_command, cdw10); 955 - status = NVME_SC_INVALID_FIELD | NVME_SC_DNR; 955 + status = NVME_SC_INVALID_FIELD | NVME_STATUS_DNR; 956 956 break; 957 957 } 958 958 ··· 969 969 mutex_lock(&ctrl->lock); 970 970 if (ctrl->nr_async_event_cmds >= NVMET_ASYNC_EVENTS) { 971 971 mutex_unlock(&ctrl->lock); 972 - nvmet_req_complete(req, NVME_SC_ASYNC_LIMIT | NVME_SC_DNR); 972 + nvmet_req_complete(req, NVME_SC_ASYNC_LIMIT | NVME_STATUS_DNR); 973 973 return; 974 974 } 975 975 ctrl->async_event_cmds[ctrl->nr_async_event_cmds++] = req; ··· 1006 1006 if (nvme_is_fabrics(cmd)) 1007 1007 return nvmet_parse_fabrics_admin_cmd(req); 1008 1008 if (unlikely(!nvmet_check_auth_status(req))) 1009 - return NVME_SC_AUTH_REQUIRED | NVME_SC_DNR; 1009 + return NVME_SC_AUTH_REQUIRED | NVME_STATUS_DNR; 1010 1010 if (nvmet_is_disc_subsys(nvmet_req_subsys(req))) 1011 1011 return nvmet_parse_discovery_cmd(req); 1012 1012

+8 -6

drivers/nvme/target/auth.c

··· 314 314 req->sq->dhchap_c1, 315 315 challenge, shash_len); 316 316 if (ret) 317 - goto out_free_response; 317 + goto out_free_challenge; 318 318 } 319 319 320 320 pr_debug("ctrl %d qid %d host response seq %u transaction %d\n", ··· 325 325 GFP_KERNEL); 326 326 if (!shash) { 327 327 ret = -ENOMEM; 328 - goto out_free_response; 328 + goto out_free_challenge; 329 329 } 330 330 shash->tfm = shash_tfm; 331 331 ret = crypto_shash_init(shash); ··· 361 361 goto out; 362 362 ret = crypto_shash_final(shash, response); 363 363 out: 364 + kfree(shash); 365 + out_free_challenge: 364 366 if (challenge != req->sq->dhchap_c1) 365 367 kfree(challenge); 366 - kfree(shash); 367 368 out_free_response: 368 369 nvme_auth_free_key(transformed_key); 369 370 out_free_tfm: ··· 428 427 req->sq->dhchap_c2, 429 428 challenge, shash_len); 430 429 if (ret) 431 - goto out_free_response; 430 + goto out_free_challenge; 432 431 } 433 432 434 433 shash = kzalloc(sizeof(*shash) + crypto_shash_descsize(shash_tfm), 435 434 GFP_KERNEL); 436 435 if (!shash) { 437 436 ret = -ENOMEM; 438 - goto out_free_response; 437 + goto out_free_challenge; 439 438 } 440 439 shash->tfm = shash_tfm; 441 440 ··· 472 471 goto out; 473 472 ret = crypto_shash_final(shash, response); 474 473 out: 474 + kfree(shash); 475 + out_free_challenge: 475 476 if (challenge != req->sq->dhchap_c2) 476 477 kfree(challenge); 477 - kfree(shash); 478 478 out_free_response: 479 479 nvme_auth_free_key(transformed_key); 480 480 out_free_tfm:

+52 -24

drivers/nvme/target/core.c

··· 16 16 #include "trace.h" 17 17 18 18 #include "nvmet.h" 19 + #include "debugfs.h" 19 20 20 21 struct kmem_cache *nvmet_bvec_cache; 21 22 struct workqueue_struct *buffered_io_wq; ··· 56 55 return NVME_SC_SUCCESS; 57 56 case -ENOSPC: 58 57 req->error_loc = offsetof(struct nvme_rw_command, length); 59 - return NVME_SC_CAP_EXCEEDED | NVME_SC_DNR; 58 + return NVME_SC_CAP_EXCEEDED | NVME_STATUS_DNR; 60 59 case -EREMOTEIO: 61 60 req->error_loc = offsetof(struct nvme_rw_command, slba); 62 - return NVME_SC_LBA_RANGE | NVME_SC_DNR; 61 + return NVME_SC_LBA_RANGE | NVME_STATUS_DNR; 63 62 case -EOPNOTSUPP: 64 63 req->error_loc = offsetof(struct nvme_common_command, opcode); 65 64 switch (req->cmd->common.opcode) { 66 65 case nvme_cmd_dsm: 67 66 case nvme_cmd_write_zeroes: 68 - return NVME_SC_ONCS_NOT_SUPPORTED | NVME_SC_DNR; 67 + return NVME_SC_ONCS_NOT_SUPPORTED | NVME_STATUS_DNR; 69 68 default: 70 - return NVME_SC_INVALID_OPCODE | NVME_SC_DNR; 69 + return NVME_SC_INVALID_OPCODE | NVME_STATUS_DNR; 71 70 } 72 71 break; 73 72 case -ENODATA: ··· 77 76 fallthrough; 78 77 default: 79 78 req->error_loc = offsetof(struct nvme_common_command, opcode); 80 - return NVME_SC_INTERNAL | NVME_SC_DNR; 79 + return NVME_SC_INTERNAL | NVME_STATUS_DNR; 81 80 } 82 81 } 83 82 ··· 87 86 req->sq->qid); 88 87 89 88 req->error_loc = offsetof(struct nvme_common_command, opcode); 90 - return NVME_SC_INVALID_OPCODE | NVME_SC_DNR; 89 + return NVME_SC_INVALID_OPCODE | NVME_STATUS_DNR; 91 90 } 92 91 93 92 static struct nvmet_subsys *nvmet_find_get_subsys(struct nvmet_port *port, ··· 98 97 { 99 98 if (sg_pcopy_from_buffer(req->sg, req->sg_cnt, buf, len, off) != len) { 100 99 req->error_loc = offsetof(struct nvme_common_command, dptr); 101 - return NVME_SC_SGL_INVALID_DATA | NVME_SC_DNR; 100 + return NVME_SC_SGL_INVALID_DATA | NVME_STATUS_DNR; 102 101 } 103 102 return 0; 104 103 } ··· 107 106 { 108 107 if (sg_pcopy_to_buffer(req->sg, req->sg_cnt, buf, len, off) != len) { 109 108 req->error_loc = offsetof(struct nvme_common_command, dptr); 110 - return NVME_SC_SGL_INVALID_DATA | NVME_SC_DNR; 109 + return NVME_SC_SGL_INVALID_DATA | NVME_STATUS_DNR; 111 110 } 112 111 return 0; 113 112 } ··· 116 115 { 117 116 if (sg_zero_buffer(req->sg, req->sg_cnt, len, off) != len) { 118 117 req->error_loc = offsetof(struct nvme_common_command, dptr); 119 - return NVME_SC_SGL_INVALID_DATA | NVME_SC_DNR; 118 + return NVME_SC_SGL_INVALID_DATA | NVME_STATUS_DNR; 120 119 } 121 120 return 0; 122 121 } ··· 146 145 while (ctrl->nr_async_event_cmds) { 147 146 req = ctrl->async_event_cmds[--ctrl->nr_async_event_cmds]; 148 147 mutex_unlock(&ctrl->lock); 149 - nvmet_req_complete(req, NVME_SC_INTERNAL | NVME_SC_DNR); 148 + nvmet_req_complete(req, NVME_SC_INTERNAL | NVME_STATUS_DNR); 150 149 mutex_lock(&ctrl->lock); 151 150 } 152 151 mutex_unlock(&ctrl->lock); ··· 445 444 req->error_loc = offsetof(struct nvme_common_command, nsid); 446 445 if (nvmet_subsys_nsid_exists(subsys, nsid)) 447 446 return NVME_SC_INTERNAL_PATH_ERROR; 448 - return NVME_SC_INVALID_NS | NVME_SC_DNR; 447 + return NVME_SC_INVALID_NS | NVME_STATUS_DNR; 449 448 } 450 449 451 450 percpu_ref_get(&req->ns->ref); ··· 905 904 return nvmet_parse_fabrics_io_cmd(req); 906 905 907 906 if (unlikely(!nvmet_check_auth_status(req))) 908 - return NVME_SC_AUTH_REQUIRED | NVME_SC_DNR; 907 + return NVME_SC_AUTH_REQUIRED | NVME_STATUS_DNR; 909 908 910 909 ret = nvmet_check_ctrl_status(req); 911 910 if (unlikely(ret)) ··· 968 967 /* no support for fused commands yet */ 969 968 if (unlikely(flags & (NVME_CMD_FUSE_FIRST | NVME_CMD_FUSE_SECOND))) { 970 969 req->error_loc = offsetof(struct nvme_common_command, flags); 971 - status = NVME_SC_INVALID_FIELD | NVME_SC_DNR; 970 + status = NVME_SC_INVALID_FIELD | NVME_STATUS_DNR; 972 971 goto fail; 973 972 } 974 973 ··· 979 978 */ 980 979 if (unlikely((flags & NVME_CMD_SGL_ALL) != NVME_CMD_SGL_METABUF)) { 981 980 req->error_loc = offsetof(struct nvme_common_command, flags); 982 - status = NVME_SC_INVALID_FIELD | NVME_SC_DNR; 981 + status = NVME_SC_INVALID_FIELD | NVME_STATUS_DNR; 983 982 goto fail; 984 983 } 985 984 ··· 997 996 trace_nvmet_req_init(req, req->cmd); 998 997 999 998 if (unlikely(!percpu_ref_tryget_live(&sq->ref))) { 1000 - status = NVME_SC_INVALID_FIELD | NVME_SC_DNR; 999 + status = NVME_SC_INVALID_FIELD | NVME_STATUS_DNR; 1001 1000 goto fail; 1002 1001 } 1003 1002 ··· 1024 1023 { 1025 1024 if (unlikely(len != req->transfer_len)) { 1026 1025 req->error_loc = offsetof(struct nvme_common_command, dptr); 1027 - nvmet_req_complete(req, NVME_SC_SGL_INVALID_DATA | NVME_SC_DNR); 1026 + nvmet_req_complete(req, NVME_SC_SGL_INVALID_DATA | NVME_STATUS_DNR); 1028 1027 return false; 1029 1028 } 1030 1029 ··· 1036 1035 { 1037 1036 if (unlikely(data_len > req->transfer_len)) { 1038 1037 req->error_loc = offsetof(struct nvme_common_command, dptr); 1039 - nvmet_req_complete(req, NVME_SC_SGL_INVALID_DATA | NVME_SC_DNR); 1038 + nvmet_req_complete(req, NVME_SC_SGL_INVALID_DATA | NVME_STATUS_DNR); 1040 1039 return false; 1041 1040 } 1042 1041 ··· 1305 1304 if (unlikely(!(req->sq->ctrl->cc & NVME_CC_ENABLE))) { 1306 1305 pr_err("got cmd %d while CC.EN == 0 on qid = %d\n", 1307 1306 req->cmd->common.opcode, req->sq->qid); 1308 - return NVME_SC_CMD_SEQ_ERROR | NVME_SC_DNR; 1307 + return NVME_SC_CMD_SEQ_ERROR | NVME_STATUS_DNR; 1309 1308 } 1310 1309 1311 1310 if (unlikely(!(req->sq->ctrl->csts & NVME_CSTS_RDY))) { 1312 1311 pr_err("got cmd %d while CSTS.RDY == 0 on qid = %d\n", 1313 1312 req->cmd->common.opcode, req->sq->qid); 1314 - return NVME_SC_CMD_SEQ_ERROR | NVME_SC_DNR; 1313 + return NVME_SC_CMD_SEQ_ERROR | NVME_STATUS_DNR; 1315 1314 } 1316 1315 1317 1316 if (unlikely(!nvmet_check_auth_status(req))) { 1318 1317 pr_warn("qid %d not authenticated\n", req->sq->qid); 1319 - return NVME_SC_AUTH_REQUIRED | NVME_SC_DNR; 1318 + return NVME_SC_AUTH_REQUIRED | NVME_STATUS_DNR; 1320 1319 } 1321 1320 return 0; 1322 1321 } ··· 1390 1389 int ret; 1391 1390 u16 status; 1392 1391 1393 - status = NVME_SC_CONNECT_INVALID_PARAM | NVME_SC_DNR; 1392 + status = NVME_SC_CONNECT_INVALID_PARAM | NVME_STATUS_DNR; 1394 1393 subsys = nvmet_find_get_subsys(req->port, subsysnqn); 1395 1394 if (!subsys) { 1396 1395 pr_warn("connect request for invalid subsystem %s!\n", ··· 1406 1405 hostnqn, subsysnqn); 1407 1406 req->cqe->result.u32 = IPO_IATTR_CONNECT_DATA(hostnqn); 1408 1407 up_read(&nvmet_config_sem); 1409 - status = NVME_SC_CONNECT_INVALID_HOST | NVME_SC_DNR; 1408 + status = NVME_SC_CONNECT_INVALID_HOST | NVME_STATUS_DNR; 1410 1409 req->error_loc = offsetof(struct nvme_common_command, dptr); 1411 1410 goto out_put_subsystem; 1412 1411 } ··· 1457 1456 subsys->cntlid_min, subsys->cntlid_max, 1458 1457 GFP_KERNEL); 1459 1458 if (ret < 0) { 1460 - status = NVME_SC_CONNECT_CTRL_BUSY | NVME_SC_DNR; 1459 + status = NVME_SC_CONNECT_CTRL_BUSY | NVME_STATUS_DNR; 1461 1460 goto out_free_sqs; 1462 1461 } 1463 1462 ctrl->cntlid = ret; ··· 1480 1479 mutex_lock(&subsys->lock); 1481 1480 list_add_tail(&ctrl->subsys_entry, &subsys->ctrls); 1482 1481 nvmet_setup_p2p_ns_map(ctrl, req); 1482 + nvmet_debugfs_ctrl_setup(ctrl); 1483 1483 mutex_unlock(&subsys->lock); 1484 1484 1485 1485 *ctrlp = ctrl; ··· 1515 1513 1516 1514 nvmet_destroy_auth(ctrl); 1517 1515 1516 + nvmet_debugfs_ctrl_free(ctrl); 1517 + 1518 1518 ida_free(&cntlid_ida, ctrl->cntlid); 1519 1519 1520 1520 nvmet_async_events_free(ctrl); ··· 1542 1538 mutex_unlock(&ctrl->lock); 1543 1539 } 1544 1540 EXPORT_SYMBOL_GPL(nvmet_ctrl_fatal_error); 1541 + 1542 + ssize_t nvmet_ctrl_host_traddr(struct nvmet_ctrl *ctrl, 1543 + char *traddr, size_t traddr_len) 1544 + { 1545 + if (!ctrl->ops->host_traddr) 1546 + return -EOPNOTSUPP; 1547 + return ctrl->ops->host_traddr(ctrl, traddr, traddr_len); 1548 + } 1545 1549 1546 1550 static struct nvmet_subsys *nvmet_find_get_subsys(struct nvmet_port *port, 1547 1551 const char *subsysnqn) ··· 1645 1633 INIT_LIST_HEAD(&subsys->ctrls); 1646 1634 INIT_LIST_HEAD(&subsys->hosts); 1647 1635 1636 + ret = nvmet_debugfs_subsys_setup(subsys); 1637 + if (ret) 1638 + goto free_subsysnqn; 1639 + 1648 1640 return subsys; 1649 1641 1642 + free_subsysnqn: 1643 + kfree(subsys->subsysnqn); 1650 1644 free_fr: 1651 1645 kfree(subsys->firmware_rev); 1652 1646 free_mn: ··· 1668 1650 container_of(ref, struct nvmet_subsys, ref); 1669 1651 1670 1652 WARN_ON_ONCE(!xa_empty(&subsys->namespaces)); 1653 + 1654 + nvmet_debugfs_subsys_free(subsys); 1671 1655 1672 1656 xa_destroy(&subsys->namespaces); 1673 1657 nvmet_passthru_subsys_free(subsys); ··· 1725 1705 if (error) 1726 1706 goto out_free_nvmet_work_queue; 1727 1707 1728 - error = nvmet_init_configfs(); 1708 + error = nvmet_init_debugfs(); 1729 1709 if (error) 1730 1710 goto out_exit_discovery; 1711 + 1712 + error = nvmet_init_configfs(); 1713 + if (error) 1714 + goto out_exit_debugfs; 1715 + 1731 1716 return 0; 1732 1717 1718 + out_exit_debugfs: 1719 + nvmet_exit_debugfs(); 1733 1720 out_exit_discovery: 1734 1721 nvmet_exit_discovery(); 1735 1722 out_free_nvmet_work_queue: ··· 1753 1726 static void __exit nvmet_exit(void) 1754 1727 { 1755 1728 nvmet_exit_configfs(); 1729 + nvmet_exit_debugfs(); 1756 1730 nvmet_exit_discovery(); 1757 1731 ida_destroy(&cntlid_ida); 1758 1732 destroy_workqueue(nvmet_wq);

+202

drivers/nvme/target/debugfs.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * DebugFS interface for the NVMe target. 4 + * Copyright (c) 2022-2024 Shadow 5 + * Copyright (c) 2024 SUSE LLC 6 + */ 7 + 8 + #include <linux/debugfs.h> 9 + #include <linux/fs.h> 10 + #include <linux/init.h> 11 + #include <linux/kernel.h> 12 + 13 + #include "nvmet.h" 14 + #include "debugfs.h" 15 + 16 + struct dentry *nvmet_debugfs; 17 + 18 + #define NVMET_DEBUGFS_ATTR(field) \ 19 + static int field##_open(struct inode *inode, struct file *file) \ 20 + { return single_open(file, field##_show, inode->i_private); } \ 21 + \ 22 + static const struct file_operations field##_fops = { \ 23 + .open = field##_open, \ 24 + .read = seq_read, \ 25 + .release = single_release, \ 26 + } 27 + 28 + #define NVMET_DEBUGFS_RW_ATTR(field) \ 29 + static int field##_open(struct inode *inode, struct file *file) \ 30 + { return single_open(file, field##_show, inode->i_private); } \ 31 + \ 32 + static const struct file_operations field##_fops = { \ 33 + .open = field##_open, \ 34 + .read = seq_read, \ 35 + .write = field##_write, \ 36 + .release = single_release, \ 37 + } 38 + 39 + static int nvmet_ctrl_hostnqn_show(struct seq_file *m, void *p) 40 + { 41 + struct nvmet_ctrl *ctrl = m->private; 42 + 43 + seq_puts(m, ctrl->hostnqn); 44 + return 0; 45 + } 46 + NVMET_DEBUGFS_ATTR(nvmet_ctrl_hostnqn); 47 + 48 + static int nvmet_ctrl_kato_show(struct seq_file *m, void *p) 49 + { 50 + struct nvmet_ctrl *ctrl = m->private; 51 + 52 + seq_printf(m, "%d\n", ctrl->kato); 53 + return 0; 54 + } 55 + NVMET_DEBUGFS_ATTR(nvmet_ctrl_kato); 56 + 57 + static int nvmet_ctrl_port_show(struct seq_file *m, void *p) 58 + { 59 + struct nvmet_ctrl *ctrl = m->private; 60 + 61 + seq_printf(m, "%d\n", le16_to_cpu(ctrl->port->disc_addr.portid)); 62 + return 0; 63 + } 64 + NVMET_DEBUGFS_ATTR(nvmet_ctrl_port); 65 + 66 + static const char *const csts_state_names[] = { 67 + [NVME_CSTS_RDY] = "ready", 68 + [NVME_CSTS_CFS] = "fatal", 69 + [NVME_CSTS_NSSRO] = "reset", 70 + [NVME_CSTS_SHST_OCCUR] = "shutdown", 71 + [NVME_CSTS_SHST_CMPLT] = "completed", 72 + [NVME_CSTS_PP] = "paused", 73 + }; 74 + 75 + static int nvmet_ctrl_state_show(struct seq_file *m, void *p) 76 + { 77 + struct nvmet_ctrl *ctrl = m->private; 78 + bool sep = false; 79 + int i; 80 + 81 + for (i = 0; i < 7; i++) { 82 + int state = BIT(i); 83 + 84 + if (!(ctrl->csts & state)) 85 + continue; 86 + if (sep) 87 + seq_puts(m, "|"); 88 + sep = true; 89 + if (csts_state_names[state]) 90 + seq_puts(m, csts_state_names[state]); 91 + else 92 + seq_printf(m, "%d", state); 93 + } 94 + if (sep) 95 + seq_printf(m, "\n"); 96 + return 0; 97 + } 98 + 99 + static ssize_t nvmet_ctrl_state_write(struct file *file, const char __user *buf, 100 + size_t count, loff_t *ppos) 101 + { 102 + struct seq_file *m = file->private_data; 103 + struct nvmet_ctrl *ctrl = m->private; 104 + char reset[16]; 105 + 106 + if (count >= sizeof(reset)) 107 + return -EINVAL; 108 + if (copy_from_user(reset, buf, count)) 109 + return -EFAULT; 110 + if (!memcmp(reset, "fatal", 5)) 111 + nvmet_ctrl_fatal_error(ctrl); 112 + else 113 + return -EINVAL; 114 + return count; 115 + } 116 + NVMET_DEBUGFS_RW_ATTR(nvmet_ctrl_state); 117 + 118 + static int nvmet_ctrl_host_traddr_show(struct seq_file *m, void *p) 119 + { 120 + struct nvmet_ctrl *ctrl = m->private; 121 + ssize_t size; 122 + char buf[NVMF_TRADDR_SIZE + 1]; 123 + 124 + size = nvmet_ctrl_host_traddr(ctrl, buf, NVMF_TRADDR_SIZE); 125 + if (size < 0) { 126 + buf[0] = '\0'; 127 + size = 0; 128 + } 129 + buf[size] = '\0'; 130 + seq_printf(m, "%s\n", buf); 131 + return 0; 132 + } 133 + NVMET_DEBUGFS_ATTR(nvmet_ctrl_host_traddr); 134 + 135 + int nvmet_debugfs_ctrl_setup(struct nvmet_ctrl *ctrl) 136 + { 137 + char name[32]; 138 + struct dentry *parent = ctrl->subsys->debugfs_dir; 139 + int ret; 140 + 141 + if (!parent) 142 + return -ENODEV; 143 + snprintf(name, sizeof(name), "ctrl%d", ctrl->cntlid); 144 + ctrl->debugfs_dir = debugfs_create_dir(name, parent); 145 + if (IS_ERR(ctrl->debugfs_dir)) { 146 + ret = PTR_ERR(ctrl->debugfs_dir); 147 + ctrl->debugfs_dir = NULL; 148 + return ret; 149 + } 150 + debugfs_create_file("port", S_IRUSR, ctrl->debugfs_dir, ctrl, 151 + &nvmet_ctrl_port_fops); 152 + debugfs_create_file("hostnqn", S_IRUSR, ctrl->debugfs_dir, ctrl, 153 + &nvmet_ctrl_hostnqn_fops); 154 + debugfs_create_file("kato", S_IRUSR, ctrl->debugfs_dir, ctrl, 155 + &nvmet_ctrl_kato_fops); 156 + debugfs_create_file("state", S_IRUSR | S_IWUSR, ctrl->debugfs_dir, ctrl, 157 + &nvmet_ctrl_state_fops); 158 + debugfs_create_file("host_traddr", S_IRUSR, ctrl->debugfs_dir, ctrl, 159 + &nvmet_ctrl_host_traddr_fops); 160 + return 0; 161 + } 162 + 163 + void nvmet_debugfs_ctrl_free(struct nvmet_ctrl *ctrl) 164 + { 165 + debugfs_remove_recursive(ctrl->debugfs_dir); 166 + } 167 + 168 + int nvmet_debugfs_subsys_setup(struct nvmet_subsys *subsys) 169 + { 170 + int ret = 0; 171 + 172 + subsys->debugfs_dir = debugfs_create_dir(subsys->subsysnqn, 173 + nvmet_debugfs); 174 + if (IS_ERR(subsys->debugfs_dir)) { 175 + ret = PTR_ERR(subsys->debugfs_dir); 176 + subsys->debugfs_dir = NULL; 177 + } 178 + return ret; 179 + } 180 + 181 + void nvmet_debugfs_subsys_free(struct nvmet_subsys *subsys) 182 + { 183 + debugfs_remove_recursive(subsys->debugfs_dir); 184 + } 185 + 186 + int __init nvmet_init_debugfs(void) 187 + { 188 + struct dentry *parent; 189 + 190 + parent = debugfs_create_dir("nvmet", NULL); 191 + if (IS_ERR(parent)) { 192 + pr_warn("%s: failed to create debugfs directory\n", "nvmet"); 193 + return PTR_ERR(parent); 194 + } 195 + nvmet_debugfs = parent; 196 + return 0; 197 + } 198 + 199 + void nvmet_exit_debugfs(void) 200 + { 201 + debugfs_remove_recursive(nvmet_debugfs); 202 + }

+42

drivers/nvme/target/debugfs.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * DebugFS interface for the NVMe target. 4 + * Copyright (c) 2022-2024 Shadow 5 + * Copyright (c) 2024 SUSE LLC 6 + */ 7 + #ifndef NVMET_DEBUGFS_H 8 + #define NVMET_DEBUGFS_H 9 + 10 + #include <linux/types.h> 11 + 12 + #ifdef CONFIG_NVME_TARGET_DEBUGFS 13 + int nvmet_debugfs_subsys_setup(struct nvmet_subsys *subsys); 14 + void nvmet_debugfs_subsys_free(struct nvmet_subsys *subsys); 15 + int nvmet_debugfs_ctrl_setup(struct nvmet_ctrl *ctrl); 16 + void nvmet_debugfs_ctrl_free(struct nvmet_ctrl *ctrl); 17 + 18 + int __init nvmet_init_debugfs(void); 19 + void nvmet_exit_debugfs(void); 20 + #else 21 + static inline int nvmet_debugfs_subsys_setup(struct nvmet_subsys *subsys) 22 + { 23 + return 0; 24 + } 25 + static inline void nvmet_debugfs_subsys_free(struct nvmet_subsys *subsys){} 26 + 27 + static inline int nvmet_debugfs_ctrl_setup(struct nvmet_ctrl *ctrl) 28 + { 29 + return 0; 30 + } 31 + static inline void nvmet_debugfs_ctrl_free(struct nvmet_ctrl *ctrl) {} 32 + 33 + static inline int __init nvmet_init_debugfs(void) 34 + { 35 + return 0; 36 + } 37 + 38 + static inline void nvmet_exit_debugfs(void) {} 39 + 40 + #endif 41 + 42 + #endif /* NVMET_DEBUGFS_H */

+7 -7

drivers/nvme/target/discovery.c

··· 179 179 if (req->cmd->get_log_page.lid != NVME_LOG_DISC) { 180 180 req->error_loc = 181 181 offsetof(struct nvme_get_log_page_command, lid); 182 - status = NVME_SC_INVALID_FIELD | NVME_SC_DNR; 182 + status = NVME_SC_INVALID_FIELD | NVME_STATUS_DNR; 183 183 goto out; 184 184 } 185 185 ··· 187 187 if (offset & 0x3) { 188 188 req->error_loc = 189 189 offsetof(struct nvme_get_log_page_command, lpo); 190 - status = NVME_SC_INVALID_FIELD | NVME_SC_DNR; 190 + status = NVME_SC_INVALID_FIELD | NVME_STATUS_DNR; 191 191 goto out; 192 192 } 193 193 ··· 256 256 257 257 if (req->cmd->identify.cns != NVME_ID_CNS_CTRL) { 258 258 req->error_loc = offsetof(struct nvme_identify, cns); 259 - status = NVME_SC_INVALID_FIELD | NVME_SC_DNR; 259 + status = NVME_SC_INVALID_FIELD | NVME_STATUS_DNR; 260 260 goto out; 261 261 } 262 262 ··· 320 320 default: 321 321 req->error_loc = 322 322 offsetof(struct nvme_common_command, cdw10); 323 - stat = NVME_SC_INVALID_FIELD | NVME_SC_DNR; 323 + stat = NVME_SC_INVALID_FIELD | NVME_STATUS_DNR; 324 324 break; 325 325 } 326 326 ··· 345 345 default: 346 346 req->error_loc = 347 347 offsetof(struct nvme_common_command, cdw10); 348 - stat = NVME_SC_INVALID_FIELD | NVME_SC_DNR; 348 + stat = NVME_SC_INVALID_FIELD | NVME_STATUS_DNR; 349 349 break; 350 350 } 351 351 ··· 361 361 cmd->common.opcode); 362 362 req->error_loc = 363 363 offsetof(struct nvme_common_command, opcode); 364 - return NVME_SC_INVALID_OPCODE | NVME_SC_DNR; 364 + return NVME_SC_INVALID_OPCODE | NVME_STATUS_DNR; 365 365 } 366 366 367 367 switch (cmd->common.opcode) { ··· 386 386 default: 387 387 pr_debug("unhandled cmd %d\n", cmd->common.opcode); 388 388 req->error_loc = offsetof(struct nvme_common_command, opcode); 389 - return NVME_SC_INVALID_OPCODE | NVME_SC_DNR; 389 + return NVME_SC_INVALID_OPCODE | NVME_STATUS_DNR; 390 390 } 391 391 392 392 }

+8 -8

drivers/nvme/target/fabrics-cmd-auth.c

··· 189 189 u8 dhchap_status; 190 190 191 191 if (req->cmd->auth_send.secp != NVME_AUTH_DHCHAP_PROTOCOL_IDENTIFIER) { 192 - status = NVME_SC_INVALID_FIELD | NVME_SC_DNR; 192 + status = NVME_SC_INVALID_FIELD | NVME_STATUS_DNR; 193 193 req->error_loc = 194 194 offsetof(struct nvmf_auth_send_command, secp); 195 195 goto done; 196 196 } 197 197 if (req->cmd->auth_send.spsp0 != 0x01) { 198 - status = NVME_SC_INVALID_FIELD | NVME_SC_DNR; 198 + status = NVME_SC_INVALID_FIELD | NVME_STATUS_DNR; 199 199 req->error_loc = 200 200 offsetof(struct nvmf_auth_send_command, spsp0); 201 201 goto done; 202 202 } 203 203 if (req->cmd->auth_send.spsp1 != 0x01) { 204 - status = NVME_SC_INVALID_FIELD | NVME_SC_DNR; 204 + status = NVME_SC_INVALID_FIELD | NVME_STATUS_DNR; 205 205 req->error_loc = 206 206 offsetof(struct nvmf_auth_send_command, spsp1); 207 207 goto done; 208 208 } 209 209 tl = le32_to_cpu(req->cmd->auth_send.tl); 210 210 if (!tl) { 211 - status = NVME_SC_INVALID_FIELD | NVME_SC_DNR; 211 + status = NVME_SC_INVALID_FIELD | NVME_STATUS_DNR; 212 212 req->error_loc = 213 213 offsetof(struct nvmf_auth_send_command, tl); 214 214 goto done; ··· 437 437 u16 status = 0; 438 438 439 439 if (req->cmd->auth_receive.secp != NVME_AUTH_DHCHAP_PROTOCOL_IDENTIFIER) { 440 - status = NVME_SC_INVALID_FIELD | NVME_SC_DNR; 440 + status = NVME_SC_INVALID_FIELD | NVME_STATUS_DNR; 441 441 req->error_loc = 442 442 offsetof(struct nvmf_auth_receive_command, secp); 443 443 goto done; 444 444 } 445 445 if (req->cmd->auth_receive.spsp0 != 0x01) { 446 - status = NVME_SC_INVALID_FIELD | NVME_SC_DNR; 446 + status = NVME_SC_INVALID_FIELD | NVME_STATUS_DNR; 447 447 req->error_loc = 448 448 offsetof(struct nvmf_auth_receive_command, spsp0); 449 449 goto done; 450 450 } 451 451 if (req->cmd->auth_receive.spsp1 != 0x01) { 452 - status = NVME_SC_INVALID_FIELD | NVME_SC_DNR; 452 + status = NVME_SC_INVALID_FIELD | NVME_STATUS_DNR; 453 453 req->error_loc = 454 454 offsetof(struct nvmf_auth_receive_command, spsp1); 455 455 goto done; 456 456 } 457 457 al = le32_to_cpu(req->cmd->auth_receive.al); 458 458 if (!al) { 459 - status = NVME_SC_INVALID_FIELD | NVME_SC_DNR; 459 + status = NVME_SC_INVALID_FIELD | NVME_STATUS_DNR; 460 460 req->error_loc = 461 461 offsetof(struct nvmf_auth_receive_command, al); 462 462 goto done;

+18 -18

drivers/nvme/target/fabrics-cmd.c

··· 18 18 if (req->cmd->prop_set.attrib & 1) { 19 19 req->error_loc = 20 20 offsetof(struct nvmf_property_set_command, attrib); 21 - status = NVME_SC_INVALID_FIELD | NVME_SC_DNR; 21 + status = NVME_SC_INVALID_FIELD | NVME_STATUS_DNR; 22 22 goto out; 23 23 } 24 24 ··· 29 29 default: 30 30 req->error_loc = 31 31 offsetof(struct nvmf_property_set_command, offset); 32 - status = NVME_SC_INVALID_FIELD | NVME_SC_DNR; 32 + status = NVME_SC_INVALID_FIELD | NVME_STATUS_DNR; 33 33 } 34 34 out: 35 35 nvmet_req_complete(req, status); ··· 50 50 val = ctrl->cap; 51 51 break; 52 52 default: 53 - status = NVME_SC_INVALID_FIELD | NVME_SC_DNR; 53 + status = NVME_SC_INVALID_FIELD | NVME_STATUS_DNR; 54 54 break; 55 55 } 56 56 } else { ··· 65 65 val = ctrl->csts; 66 66 break; 67 67 default: 68 - status = NVME_SC_INVALID_FIELD | NVME_SC_DNR; 68 + status = NVME_SC_INVALID_FIELD | NVME_STATUS_DNR; 69 69 break; 70 70 } 71 71 } ··· 105 105 pr_debug("received unknown capsule type 0x%x\n", 106 106 cmd->fabrics.fctype); 107 107 req->error_loc = offsetof(struct nvmf_common_command, fctype); 108 - return NVME_SC_INVALID_OPCODE | NVME_SC_DNR; 108 + return NVME_SC_INVALID_OPCODE | NVME_STATUS_DNR; 109 109 } 110 110 111 111 return 0; ··· 128 128 pr_debug("received unknown capsule type 0x%x\n", 129 129 cmd->fabrics.fctype); 130 130 req->error_loc = offsetof(struct nvmf_common_command, fctype); 131 - return NVME_SC_INVALID_OPCODE | NVME_SC_DNR; 131 + return NVME_SC_INVALID_OPCODE | NVME_STATUS_DNR; 132 132 } 133 133 134 134 return 0; ··· 147 147 pr_warn("queue size zero!\n"); 148 148 req->error_loc = offsetof(struct nvmf_connect_command, sqsize); 149 149 req->cqe->result.u32 = IPO_IATTR_CONNECT_SQE(sqsize); 150 - ret = NVME_SC_CONNECT_INVALID_PARAM | NVME_SC_DNR; 150 + ret = NVME_SC_CONNECT_INVALID_PARAM | NVME_STATUS_DNR; 151 151 goto err; 152 152 } 153 153 154 154 if (ctrl->sqs[qid] != NULL) { 155 155 pr_warn("qid %u has already been created\n", qid); 156 156 req->error_loc = offsetof(struct nvmf_connect_command, qid); 157 - return NVME_SC_CMD_SEQ_ERROR | NVME_SC_DNR; 157 + return NVME_SC_CMD_SEQ_ERROR | NVME_STATUS_DNR; 158 158 } 159 159 160 160 /* for fabrics, this value applies to only the I/O Submission Queues */ ··· 163 163 sqsize, mqes, ctrl->cntlid); 164 164 req->error_loc = offsetof(struct nvmf_connect_command, sqsize); 165 165 req->cqe->result.u32 = IPO_IATTR_CONNECT_SQE(sqsize); 166 - return NVME_SC_CONNECT_INVALID_PARAM | NVME_SC_DNR; 166 + return NVME_SC_CONNECT_INVALID_PARAM | NVME_STATUS_DNR; 167 167 } 168 168 169 169 old = cmpxchg(&req->sq->ctrl, NULL, ctrl); 170 170 if (old) { 171 171 pr_warn("queue already connected!\n"); 172 172 req->error_loc = offsetof(struct nvmf_connect_command, opcode); 173 - return NVME_SC_CONNECT_CTRL_BUSY | NVME_SC_DNR; 173 + return NVME_SC_CONNECT_CTRL_BUSY | NVME_STATUS_DNR; 174 174 } 175 175 176 176 /* note: convert queue size from 0's-based value to 1's-based value */ ··· 230 230 pr_warn("invalid connect version (%d).\n", 231 231 le16_to_cpu(c->recfmt)); 232 232 req->error_loc = offsetof(struct nvmf_connect_command, recfmt); 233 - status = NVME_SC_CONNECT_FORMAT | NVME_SC_DNR; 233 + status = NVME_SC_CONNECT_FORMAT | NVME_STATUS_DNR; 234 234 goto out; 235 235 } 236 236 237 237 if (unlikely(d->cntlid != cpu_to_le16(0xffff))) { 238 238 pr_warn("connect attempt for invalid controller ID %#x\n", 239 239 d->cntlid); 240 - status = NVME_SC_CONNECT_INVALID_PARAM | NVME_SC_DNR; 240 + status = NVME_SC_CONNECT_INVALID_PARAM | NVME_STATUS_DNR; 241 241 req->cqe->result.u32 = IPO_IATTR_CONNECT_DATA(cntlid); 242 242 goto out; 243 243 } ··· 257 257 dhchap_status); 258 258 nvmet_ctrl_put(ctrl); 259 259 if (dhchap_status == NVME_AUTH_DHCHAP_FAILURE_FAILED) 260 - status = (NVME_SC_CONNECT_INVALID_HOST | NVME_SC_DNR); 260 + status = (NVME_SC_CONNECT_INVALID_HOST | NVME_STATUS_DNR); 261 261 else 262 262 status = NVME_SC_INTERNAL; 263 263 goto out; ··· 305 305 if (c->recfmt != 0) { 306 306 pr_warn("invalid connect version (%d).\n", 307 307 le16_to_cpu(c->recfmt)); 308 - status = NVME_SC_CONNECT_FORMAT | NVME_SC_DNR; 308 + status = NVME_SC_CONNECT_FORMAT | NVME_STATUS_DNR; 309 309 goto out; 310 310 } 311 311 ··· 314 314 ctrl = nvmet_ctrl_find_get(d->subsysnqn, d->hostnqn, 315 315 le16_to_cpu(d->cntlid), req); 316 316 if (!ctrl) { 317 - status = NVME_SC_CONNECT_INVALID_PARAM | NVME_SC_DNR; 317 + status = NVME_SC_CONNECT_INVALID_PARAM | NVME_STATUS_DNR; 318 318 goto out; 319 319 } 320 320 321 321 if (unlikely(qid > ctrl->subsys->max_qid)) { 322 322 pr_warn("invalid queue id (%d)\n", qid); 323 - status = NVME_SC_CONNECT_INVALID_PARAM | NVME_SC_DNR; 323 + status = NVME_SC_CONNECT_INVALID_PARAM | NVME_STATUS_DNR; 324 324 req->cqe->result.u32 = IPO_IATTR_CONNECT_SQE(qid); 325 325 goto out_ctrl_put; 326 326 } ··· 350 350 pr_debug("invalid command 0x%x on unconnected queue.\n", 351 351 cmd->fabrics.opcode); 352 352 req->error_loc = offsetof(struct nvme_common_command, opcode); 353 - return NVME_SC_INVALID_OPCODE | NVME_SC_DNR; 353 + return NVME_SC_INVALID_OPCODE | NVME_STATUS_DNR; 354 354 } 355 355 if (cmd->fabrics.fctype != nvme_fabrics_type_connect) { 356 356 pr_debug("invalid capsule type 0x%x on unconnected queue.\n", 357 357 cmd->fabrics.fctype); 358 358 req->error_loc = offsetof(struct nvmf_common_command, fctype); 359 - return NVME_SC_INVALID_OPCODE | NVME_SC_DNR; 359 + return NVME_SC_INVALID_OPCODE | NVME_STATUS_DNR; 360 360 } 361 361 362 362 if (cmd->connect.qid == 0)

+33

drivers/nvme/target/fc.c

··· 2934 2934 tgtport->ops->discovery_event(&tgtport->fc_target_port); 2935 2935 } 2936 2936 2937 + static ssize_t 2938 + nvmet_fc_host_traddr(struct nvmet_ctrl *ctrl, 2939 + char *traddr, size_t traddr_size) 2940 + { 2941 + struct nvmet_sq *sq = ctrl->sqs[0]; 2942 + struct nvmet_fc_tgt_queue *queue = 2943 + container_of(sq, struct nvmet_fc_tgt_queue, nvme_sq); 2944 + struct nvmet_fc_tgtport *tgtport = queue->assoc ? queue->assoc->tgtport : NULL; 2945 + struct nvmet_fc_hostport *hostport = queue->assoc ? queue->assoc->hostport : NULL; 2946 + u64 wwnn, wwpn; 2947 + ssize_t ret = 0; 2948 + 2949 + if (!tgtport || !nvmet_fc_tgtport_get(tgtport)) 2950 + return -ENODEV; 2951 + if (!hostport || !nvmet_fc_hostport_get(hostport)) { 2952 + ret = -ENODEV; 2953 + goto out_put; 2954 + } 2955 + 2956 + if (tgtport->ops->host_traddr) { 2957 + ret = tgtport->ops->host_traddr(hostport->hosthandle, &wwnn, &wwpn); 2958 + if (ret) 2959 + goto out_put_host; 2960 + ret = snprintf(traddr, traddr_size, "nn-0x%llx:pn-0x%llx", wwnn, wwpn); 2961 + } 2962 + out_put_host: 2963 + nvmet_fc_hostport_put(hostport); 2964 + out_put: 2965 + nvmet_fc_tgtport_put(tgtport); 2966 + return ret; 2967 + } 2968 + 2937 2969 static const struct nvmet_fabrics_ops nvmet_fc_tgt_fcp_ops = { 2938 2970 .owner = THIS_MODULE, 2939 2971 .type = NVMF_TRTYPE_FC, ··· 2975 2943 .queue_response = nvmet_fc_fcp_nvme_cmd_done, 2976 2944 .delete_ctrl = nvmet_fc_delete_ctrl, 2977 2945 .discovery_chg = nvmet_fc_discovery_chg, 2946 + .host_traddr = nvmet_fc_host_traddr, 2978 2947 }; 2979 2948 2980 2949 static int __init nvmet_fc_init_module(void)

+11

drivers/nvme/target/fcloop.c

··· 492 492 /* host handle ignored for now */ 493 493 } 494 494 495 + static int 496 + fcloop_t2h_host_traddr(void *hosthandle, u64 *wwnn, u64 *wwpn) 497 + { 498 + struct fcloop_rport *rport = hosthandle; 499 + 500 + *wwnn = rport->lport->localport->node_name; 501 + *wwpn = rport->lport->localport->port_name; 502 + return 0; 503 + } 504 + 495 505 /* 496 506 * Simulate reception of RSCN and converting it to a initiator transport 497 507 * call to rescan a remote port. ··· 1084 1074 .ls_req = fcloop_t2h_ls_req, 1085 1075 .ls_abort = fcloop_t2h_ls_abort, 1086 1076 .host_release = fcloop_t2h_host_release, 1077 + .host_traddr = fcloop_t2h_host_traddr, 1087 1078 .max_hw_queues = FCLOOP_HW_QUEUES, 1088 1079 .max_sgl_segments = FCLOOP_SGL_SEGS, 1089 1080 .max_dif_sgl_segments = FCLOOP_SGL_SEGS,

+15 -13

drivers/nvme/target/io-cmd-bdev.c

··· 61 61 { 62 62 struct blk_integrity *bi = bdev_get_integrity(ns->bdev); 63 63 64 - if (bi) { 64 + if (!bi) 65 + return; 66 + 67 + if (bi->csum_type == BLK_INTEGRITY_CSUM_CRC) { 65 68 ns->metadata_size = bi->tuple_size; 66 - if (bi->profile == &t10_pi_type1_crc) 69 + if (bi->flags & BLK_INTEGRITY_REF_TAG) 67 70 ns->pi_type = NVME_NS_DPS_PI_TYPE1; 68 - else if (bi->profile == &t10_pi_type3_crc) 69 - ns->pi_type = NVME_NS_DPS_PI_TYPE3; 70 71 else 71 - /* Unsupported metadata type */ 72 - ns->metadata_size = 0; 72 + ns->pi_type = NVME_NS_DPS_PI_TYPE3; 73 + } else { 74 + ns->metadata_size = 0; 73 75 } 74 76 } 75 77 ··· 104 102 105 103 ns->pi_type = 0; 106 104 ns->metadata_size = 0; 107 - if (IS_ENABLED(CONFIG_BLK_DEV_INTEGRITY_T10)) 105 + if (IS_ENABLED(CONFIG_BLK_DEV_INTEGRITY)) 108 106 nvmet_bdev_ns_enable_integrity(ns); 109 107 110 108 if (bdev_is_zoned(ns->bdev)) { ··· 137 135 */ 138 136 switch (blk_sts) { 139 137 case BLK_STS_NOSPC: 140 - status = NVME_SC_CAP_EXCEEDED | NVME_SC_DNR; 138 + status = NVME_SC_CAP_EXCEEDED | NVME_STATUS_DNR; 141 139 req->error_loc = offsetof(struct nvme_rw_command, length); 142 140 break; 143 141 case BLK_STS_TARGET: 144 - status = NVME_SC_LBA_RANGE | NVME_SC_DNR; 142 + status = NVME_SC_LBA_RANGE | NVME_STATUS_DNR; 145 143 req->error_loc = offsetof(struct nvme_rw_command, slba); 146 144 break; 147 145 case BLK_STS_NOTSUPP: ··· 149 147 switch (req->cmd->common.opcode) { 150 148 case nvme_cmd_dsm: 151 149 case nvme_cmd_write_zeroes: 152 - status = NVME_SC_ONCS_NOT_SUPPORTED | NVME_SC_DNR; 150 + status = NVME_SC_ONCS_NOT_SUPPORTED | NVME_STATUS_DNR; 153 151 break; 154 152 default: 155 - status = NVME_SC_INVALID_OPCODE | NVME_SC_DNR; 153 + status = NVME_SC_INVALID_OPCODE | NVME_STATUS_DNR; 156 154 } 157 155 break; 158 156 case BLK_STS_MEDIUM: ··· 161 159 break; 162 160 case BLK_STS_IOERR: 163 161 default: 164 - status = NVME_SC_INTERNAL | NVME_SC_DNR; 162 + status = NVME_SC_INTERNAL | NVME_STATUS_DNR; 165 163 req->error_loc = offsetof(struct nvme_common_command, opcode); 166 164 } 167 165 ··· 358 356 return 0; 359 357 360 358 if (blkdev_issue_flush(req->ns->bdev)) 361 - return NVME_SC_INTERNAL | NVME_SC_DNR; 359 + return NVME_SC_INTERNAL | NVME_STATUS_DNR; 362 360 return 0; 363 361 } 364 362

+5

drivers/nvme/target/loop.c

··· 555 555 goto out; 556 556 } 557 557 558 + ret = nvme_add_ctrl(&ctrl->ctrl); 559 + if (ret) 560 + goto out_put_ctrl; 561 + 558 562 if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_CONNECTING)) 559 563 WARN_ON_ONCE(1); 560 564 ··· 615 611 kfree(ctrl->queues); 616 612 out_uninit_ctrl: 617 613 nvme_uninit_ctrl(&ctrl->ctrl); 614 + out_put_ctrl: 618 615 nvme_put_ctrl(&ctrl->ctrl); 619 616 out: 620 617 if (ret > 0)

+10 -2

drivers/nvme/target/nvmet.h

··· 230 230 231 231 struct device *p2p_client; 232 232 struct radix_tree_root p2p_ns_map; 233 - 233 + #ifdef CONFIG_NVME_TARGET_DEBUGFS 234 + struct dentry *debugfs_dir; 235 + #endif 234 236 spinlock_t error_lock; 235 237 u64 err_counter; 236 238 struct nvme_error_slot slots[NVMET_ERROR_LOG_SLOTS]; ··· 264 262 265 263 struct list_head hosts; 266 264 bool allow_any_host; 267 - 265 + #ifdef CONFIG_NVME_TARGET_DEBUGFS 266 + struct dentry *debugfs_dir; 267 + #endif 268 268 u16 max_qid; 269 269 270 270 u64 ver; ··· 354 350 void (*delete_ctrl)(struct nvmet_ctrl *ctrl); 355 351 void (*disc_traddr)(struct nvmet_req *req, 356 352 struct nvmet_port *port, char *traddr); 353 + ssize_t (*host_traddr)(struct nvmet_ctrl *ctrl, 354 + char *traddr, size_t traddr_len); 357 355 u16 (*install_queue)(struct nvmet_sq *nvme_sq); 358 356 void (*discovery_chg)(struct nvmet_port *port); 359 357 u8 (*get_mdts)(const struct nvmet_ctrl *ctrl); ··· 504 498 struct nvmet_req *req); 505 499 void nvmet_ctrl_put(struct nvmet_ctrl *ctrl); 506 500 u16 nvmet_check_ctrl_status(struct nvmet_req *req); 501 + ssize_t nvmet_ctrl_host_traddr(struct nvmet_ctrl *ctrl, 502 + char *traddr, size_t traddr_len); 507 503 508 504 struct nvmet_subsys *nvmet_subsys_alloc(const char *subsysnqn, 509 505 enum nvme_subsys_type type);

+5 -5

drivers/nvme/target/passthru.c

··· 306 306 ns = nvme_find_get_ns(ctrl, nsid); 307 307 if (unlikely(!ns)) { 308 308 pr_err("failed to get passthru ns nsid:%u\n", nsid); 309 - status = NVME_SC_INVALID_NS | NVME_SC_DNR; 309 + status = NVME_SC_INVALID_NS | NVME_STATUS_DNR; 310 310 goto out; 311 311 } 312 312 ··· 426 426 * emulated in the future if regular targets grow support for 427 427 * this feature. 428 428 */ 429 - return NVME_SC_INVALID_OPCODE | NVME_SC_DNR; 429 + return NVME_SC_INVALID_OPCODE | NVME_STATUS_DNR; 430 430 } 431 431 432 432 return nvmet_setup_passthru_command(req); ··· 478 478 case NVME_FEAT_RESV_PERSIST: 479 479 /* No reservations, see nvmet_parse_passthru_io_cmd() */ 480 480 default: 481 - return NVME_SC_INVALID_OPCODE | NVME_SC_DNR; 481 + return NVME_SC_INVALID_OPCODE | NVME_STATUS_DNR; 482 482 } 483 483 } 484 484 ··· 546 546 req->p.use_workqueue = true; 547 547 return NVME_SC_SUCCESS; 548 548 } 549 - return NVME_SC_INVALID_OPCODE | NVME_SC_DNR; 549 + return NVME_SC_INVALID_OPCODE | NVME_STATUS_DNR; 550 550 case NVME_ID_CNS_NS: 551 551 req->execute = nvmet_passthru_execute_cmd; 552 552 req->p.use_workqueue = true; ··· 558 558 req->p.use_workqueue = true; 559 559 return NVME_SC_SUCCESS; 560 560 } 561 - return NVME_SC_INVALID_OPCODE | NVME_SC_DNR; 561 + return NVME_SC_INVALID_OPCODE | NVME_STATUS_DNR; 562 562 default: 563 563 return nvmet_setup_passthru_command(req); 564 564 }

+17 -5

drivers/nvme/target/rdma.c

··· 852 852 if (!nvme_is_write(rsp->req.cmd)) { 853 853 rsp->req.error_loc = 854 854 offsetof(struct nvme_common_command, opcode); 855 - return NVME_SC_INVALID_FIELD | NVME_SC_DNR; 855 + return NVME_SC_INVALID_FIELD | NVME_STATUS_DNR; 856 856 } 857 857 858 858 if (off + len > rsp->queue->dev->inline_data_size) { 859 859 pr_err("invalid inline data offset!\n"); 860 - return NVME_SC_SGL_INVALID_OFFSET | NVME_SC_DNR; 860 + return NVME_SC_SGL_INVALID_OFFSET | NVME_STATUS_DNR; 861 861 } 862 862 863 863 /* no data command? */ ··· 919 919 pr_err("invalid SGL subtype: %#x\n", sgl->type); 920 920 rsp->req.error_loc = 921 921 offsetof(struct nvme_common_command, dptr); 922 - return NVME_SC_INVALID_FIELD | NVME_SC_DNR; 922 + return NVME_SC_INVALID_FIELD | NVME_STATUS_DNR; 923 923 } 924 924 case NVME_KEY_SGL_FMT_DATA_DESC: 925 925 switch (sgl->type & 0xf) { ··· 931 931 pr_err("invalid SGL subtype: %#x\n", sgl->type); 932 932 rsp->req.error_loc = 933 933 offsetof(struct nvme_common_command, dptr); 934 - return NVME_SC_INVALID_FIELD | NVME_SC_DNR; 934 + return NVME_SC_INVALID_FIELD | NVME_STATUS_DNR; 935 935 } 936 936 default: 937 937 pr_err("invalid SGL type: %#x\n", sgl->type); 938 938 rsp->req.error_loc = offsetof(struct nvme_common_command, dptr); 939 - return NVME_SC_SGL_INVALID_TYPE | NVME_SC_DNR; 939 + return NVME_SC_SGL_INVALID_TYPE | NVME_STATUS_DNR; 940 940 } 941 941 } 942 942 ··· 2000 2000 } 2001 2001 } 2002 2002 2003 + static ssize_t nvmet_rdma_host_port_addr(struct nvmet_ctrl *ctrl, 2004 + char *traddr, size_t traddr_len) 2005 + { 2006 + struct nvmet_sq *nvme_sq = ctrl->sqs[0]; 2007 + struct nvmet_rdma_queue *queue = 2008 + container_of(nvme_sq, struct nvmet_rdma_queue, nvme_sq); 2009 + 2010 + return snprintf(traddr, traddr_len, "%pISc", 2011 + (struct sockaddr *)&queue->cm_id->route.addr.dst_addr); 2012 + } 2013 + 2003 2014 static u8 nvmet_rdma_get_mdts(const struct nvmet_ctrl *ctrl) 2004 2015 { 2005 2016 if (ctrl->pi_support) ··· 2035 2024 .queue_response = nvmet_rdma_queue_response, 2036 2025 .delete_ctrl = nvmet_rdma_delete_ctrl, 2037 2026 .disc_traddr = nvmet_rdma_disc_port_addr, 2027 + .host_traddr = nvmet_rdma_host_port_addr, 2038 2028 .get_mdts = nvmet_rdma_get_mdts, 2039 2029 .get_max_queue_size = nvmet_rdma_get_max_queue_size, 2040 2030 };

+16 -2

drivers/nvme/target/tcp.c

··· 416 416 if (sgl->type == ((NVME_SGL_FMT_DATA_DESC << 4) | 417 417 NVME_SGL_FMT_OFFSET)) { 418 418 if (!nvme_is_write(cmd->req.cmd)) 419 - return NVME_SC_INVALID_FIELD | NVME_SC_DNR; 419 + return NVME_SC_INVALID_FIELD | NVME_STATUS_DNR; 420 420 421 421 if (len > cmd->req.port->inline_data_size) 422 - return NVME_SC_SGL_INVALID_OFFSET | NVME_SC_DNR; 422 + return NVME_SC_SGL_INVALID_OFFSET | NVME_STATUS_DNR; 423 423 cmd->pdu_len = len; 424 424 } 425 425 cmd->req.transfer_len += len; ··· 2167 2167 } 2168 2168 } 2169 2169 2170 + static ssize_t nvmet_tcp_host_port_addr(struct nvmet_ctrl *ctrl, 2171 + char *traddr, size_t traddr_len) 2172 + { 2173 + struct nvmet_sq *sq = ctrl->sqs[0]; 2174 + struct nvmet_tcp_queue *queue = 2175 + container_of(sq, struct nvmet_tcp_queue, nvme_sq); 2176 + 2177 + if (queue->sockaddr_peer.ss_family == AF_UNSPEC) 2178 + return -EINVAL; 2179 + return snprintf(traddr, traddr_len, "%pISc", 2180 + (struct sockaddr *)&queue->sockaddr_peer); 2181 + } 2182 + 2170 2183 static const struct nvmet_fabrics_ops nvmet_tcp_ops = { 2171 2184 .owner = THIS_MODULE, 2172 2185 .type = NVMF_TRTYPE_TCP, ··· 2190 2177 .delete_ctrl = nvmet_tcp_delete_ctrl, 2191 2178 .install_queue = nvmet_tcp_install_queue, 2192 2179 .disc_traddr = nvmet_tcp_disc_port_addr, 2180 + .host_traddr = nvmet_tcp_host_port_addr, 2193 2181 }; 2194 2182 2195 2183 static int __init nvmet_tcp_init(void)

+15 -15

drivers/nvme/target/zns.c

··· 100 100 101 101 if (le32_to_cpu(req->cmd->identify.nsid) == NVME_NSID_ALL) { 102 102 req->error_loc = offsetof(struct nvme_identify, nsid); 103 - status = NVME_SC_INVALID_NS | NVME_SC_DNR; 103 + status = NVME_SC_INVALID_NS | NVME_STATUS_DNR; 104 104 goto out; 105 105 } 106 106 ··· 121 121 } 122 122 123 123 if (!bdev_is_zoned(req->ns->bdev)) { 124 - status = NVME_SC_INVALID_FIELD | NVME_SC_DNR; 124 + status = NVME_SC_INVALID_FIELD | NVME_STATUS_DNR; 125 125 req->error_loc = offsetof(struct nvme_identify, nsid); 126 126 goto out; 127 127 } ··· 158 158 159 159 if (sect >= get_capacity(req->ns->bdev->bd_disk)) { 160 160 req->error_loc = offsetof(struct nvme_zone_mgmt_recv_cmd, slba); 161 - return NVME_SC_LBA_RANGE | NVME_SC_DNR; 161 + return NVME_SC_LBA_RANGE | NVME_STATUS_DNR; 162 162 } 163 163 164 164 if (out_bufsize < sizeof(struct nvme_zone_report)) { 165 165 req->error_loc = offsetof(struct nvme_zone_mgmt_recv_cmd, numd); 166 - return NVME_SC_INVALID_FIELD | NVME_SC_DNR; 166 + return NVME_SC_INVALID_FIELD | NVME_STATUS_DNR; 167 167 } 168 168 169 169 if (req->cmd->zmr.zra != NVME_ZRA_ZONE_REPORT) { 170 170 req->error_loc = offsetof(struct nvme_zone_mgmt_recv_cmd, zra); 171 - return NVME_SC_INVALID_FIELD | NVME_SC_DNR; 171 + return NVME_SC_INVALID_FIELD | NVME_STATUS_DNR; 172 172 } 173 173 174 174 switch (req->cmd->zmr.pr) { ··· 177 177 break; 178 178 default: 179 179 req->error_loc = offsetof(struct nvme_zone_mgmt_recv_cmd, pr); 180 - return NVME_SC_INVALID_FIELD | NVME_SC_DNR; 180 + return NVME_SC_INVALID_FIELD | NVME_STATUS_DNR; 181 181 } 182 182 183 183 switch (req->cmd->zmr.zrasf) { ··· 193 193 default: 194 194 req->error_loc = 195 195 offsetof(struct nvme_zone_mgmt_recv_cmd, zrasf); 196 - return NVME_SC_INVALID_FIELD | NVME_SC_DNR; 196 + return NVME_SC_INVALID_FIELD | NVME_STATUS_DNR; 197 197 } 198 198 199 199 return NVME_SC_SUCCESS; ··· 341 341 return NVME_SC_SUCCESS; 342 342 case -EINVAL: 343 343 case -EIO: 344 - return NVME_SC_ZONE_INVALID_TRANSITION | NVME_SC_DNR; 344 + return NVME_SC_ZONE_INVALID_TRANSITION | NVME_STATUS_DNR; 345 345 default: 346 346 return NVME_SC_INTERNAL; 347 347 } ··· 463 463 default: 464 464 /* this is needed to quiet compiler warning */ 465 465 req->error_loc = offsetof(struct nvme_zone_mgmt_send_cmd, zsa); 466 - return NVME_SC_INVALID_FIELD | NVME_SC_DNR; 466 + return NVME_SC_INVALID_FIELD | NVME_STATUS_DNR; 467 467 } 468 468 469 469 return NVME_SC_SUCCESS; ··· 481 481 482 482 if (op == REQ_OP_LAST) { 483 483 req->error_loc = offsetof(struct nvme_zone_mgmt_send_cmd, zsa); 484 - status = NVME_SC_ZONE_INVALID_TRANSITION | NVME_SC_DNR; 484 + status = NVME_SC_ZONE_INVALID_TRANSITION | NVME_STATUS_DNR; 485 485 goto out; 486 486 } 487 487 ··· 493 493 494 494 if (sect >= get_capacity(bdev->bd_disk)) { 495 495 req->error_loc = offsetof(struct nvme_zone_mgmt_send_cmd, slba); 496 - status = NVME_SC_LBA_RANGE | NVME_SC_DNR; 496 + status = NVME_SC_LBA_RANGE | NVME_STATUS_DNR; 497 497 goto out; 498 498 } 499 499 500 500 if (sect & (zone_sectors - 1)) { 501 501 req->error_loc = offsetof(struct nvme_zone_mgmt_send_cmd, slba); 502 - status = NVME_SC_INVALID_FIELD | NVME_SC_DNR; 502 + status = NVME_SC_INVALID_FIELD | NVME_STATUS_DNR; 503 503 goto out; 504 504 } 505 505 ··· 551 551 552 552 if (sect >= get_capacity(req->ns->bdev->bd_disk)) { 553 553 req->error_loc = offsetof(struct nvme_rw_command, slba); 554 - status = NVME_SC_LBA_RANGE | NVME_SC_DNR; 554 + status = NVME_SC_LBA_RANGE | NVME_STATUS_DNR; 555 555 goto out; 556 556 } 557 557 558 558 if (sect & (bdev_zone_sectors(req->ns->bdev) - 1)) { 559 559 req->error_loc = offsetof(struct nvme_rw_command, slba); 560 - status = NVME_SC_INVALID_FIELD | NVME_SC_DNR; 560 + status = NVME_SC_INVALID_FIELD | NVME_STATUS_DNR; 561 561 goto out; 562 562 } 563 563 ··· 590 590 } 591 591 592 592 if (total_len != nvmet_rw_data_len(req)) { 593 - status = NVME_SC_INTERNAL | NVME_SC_DNR; 593 + status = NVME_SC_INTERNAL | NVME_STATUS_DNR; 594 594 goto out_put_bio; 595 595 } 596 596

-1

drivers/s390/block/dasd_genhd.c

··· 68 68 blk_mq_free_tag_set(&block->tag_set); 69 69 return PTR_ERR(gdp); 70 70 } 71 - blk_queue_flag_set(QUEUE_FLAG_NONROT, gdp->queue); 72 71 73 72 /* Initialize gendisk structure. */ 74 73 gdp->major = DASD_MAJOR;

+1 -1

drivers/s390/block/dcssblk.c

··· 548 548 { 549 549 struct queue_limits lim = { 550 550 .logical_block_size = 4096, 551 + .features = BLK_FEAT_DAX, 551 552 }; 552 553 int rc, i, j, num_of_segments; 553 554 struct dcssblk_dev_info *dev_info; ··· 644 643 dev_info->gd->fops = &dcssblk_devops; 645 644 dev_info->gd->private_data = dev_info; 646 645 dev_info->gd->flags |= GENHD_FL_NO_PART; 647 - blk_queue_flag_set(QUEUE_FLAG_DAX, dev_info->gd->queue); 648 646 649 647 seg_byte_size = (dev_info->end - dev_info->start + 1); 650 648 set_capacity(dev_info->gd, seg_byte_size >> 9); // size in sectors

-5

drivers/s390/block/scm_blk.c

··· 439 439 .logical_block_size = 1 << 12, 440 440 }; 441 441 unsigned int devindex; 442 - struct request_queue *rq; 443 442 int len, ret; 444 443 445 444 lim.max_segments = min(scmdev->nr_max_block, ··· 473 474 ret = PTR_ERR(bdev->gendisk); 474 475 goto out_tag; 475 476 } 476 - rq = bdev->rq = bdev->gendisk->queue; 477 - blk_queue_flag_set(QUEUE_FLAG_NONROT, rq); 478 - blk_queue_flag_clear(QUEUE_FLAG_ADD_RANDOM, rq); 479 - 480 477 bdev->gendisk->private_data = scmdev; 481 478 bdev->gendisk->fops = &scm_blk_devops; 482 479 bdev->gendisk->major = scm_major;

-1

drivers/scsi/Kconfig

··· 82 82 config BLK_DEV_SD 83 83 tristate "SCSI disk support" 84 84 depends on SCSI 85 - select BLK_DEV_INTEGRITY_T10 if BLK_DEV_INTEGRITY 86 85 help 87 86 If you want to use SCSI hard disks, Fibre Channel disks, 88 87 Serial ATA (SATA) or Parallel ATA (PATA) hard disks,

+4 -4

drivers/scsi/iscsi_tcp.c

··· 1057 1057 return 0; 1058 1058 } 1059 1059 1060 - static int iscsi_sw_tcp_slave_configure(struct scsi_device *sdev) 1060 + static int iscsi_sw_tcp_device_configure(struct scsi_device *sdev, 1061 + struct queue_limits *lim) 1061 1062 { 1062 1063 struct iscsi_sw_tcp_host *tcp_sw_host = iscsi_host_priv(sdev->host); 1063 1064 struct iscsi_session *session = tcp_sw_host->session; 1064 1065 struct iscsi_conn *conn = session->leadconn; 1065 1066 1066 1067 if (conn->datadgst_en) 1067 - blk_queue_flag_set(QUEUE_FLAG_STABLE_WRITES, 1068 - sdev->request_queue); 1068 + lim->features |= BLK_FEAT_STABLE_WRITES; 1069 1069 return 0; 1070 1070 } 1071 1071 ··· 1083 1083 .eh_device_reset_handler= iscsi_eh_device_reset, 1084 1084 .eh_target_reset_handler = iscsi_eh_recover_target, 1085 1085 .dma_boundary = PAGE_SIZE - 1, 1086 - .slave_configure = iscsi_sw_tcp_slave_configure, 1086 + .device_configure = iscsi_sw_tcp_device_configure, 1087 1087 .proc_name = "iscsi_tcp", 1088 1088 .this_id = -1, 1089 1089 .track_queue_depth = 1,

+11

drivers/scsi/lpfc/lpfc_nvmet.c

··· 1363 1363 atomic_inc(&lpfc_nvmet->xmt_ls_abort); 1364 1364 } 1365 1365 1366 + static int 1367 + lpfc_nvmet_host_traddr(void *hosthandle, u64 *wwnn, u64 *wwpn) 1368 + { 1369 + struct lpfc_nodelist *ndlp = hosthandle; 1370 + 1371 + *wwnn = wwn_to_u64(ndlp->nlp_nodename.u.wwn); 1372 + *wwpn = wwn_to_u64(ndlp->nlp_portname.u.wwn); 1373 + return 0; 1374 + } 1375 + 1366 1376 static void 1367 1377 lpfc_nvmet_host_release(void *hosthandle) 1368 1378 { ··· 1423 1413 .ls_req = lpfc_nvmet_ls_req, 1424 1414 .ls_abort = lpfc_nvmet_ls_abort, 1425 1415 .host_release = lpfc_nvmet_host_release, 1416 + .host_traddr = lpfc_nvmet_host_traddr, 1426 1417 1427 1418 .max_hw_queues = 1, 1428 1419 .max_sgl_segments = LPFC_NVMET_DEFAULT_SEGS,

-2

drivers/scsi/megaraid/megaraid_sas_base.c

··· 1981 1981 1982 1982 lim->max_hw_sectors = max_io_size / 512; 1983 1983 lim->virt_boundary_mask = mr_nvme_pg_size - 1; 1984 - 1985 - blk_queue_flag_set(QUEUE_FLAG_NOMERGES, sdev->request_queue); 1986 1984 } 1987 1985 1988 1986 /*

-6

drivers/scsi/mpt3sas/mpt3sas_scsih.c

··· 2680 2680 pcie_device_put(pcie_device); 2681 2681 spin_unlock_irqrestore(&ioc->pcie_device_lock, flags); 2682 2682 mpt3sas_scsih_change_queue_depth(sdev, qdepth); 2683 - /* Enable QUEUE_FLAG_NOMERGES flag, so that IOs won't be 2684 - ** merged and can eliminate holes created during merging 2685 - ** operation. 2686 - **/ 2687 - blk_queue_flag_set(QUEUE_FLAG_NOMERGES, 2688 - sdev->request_queue); 2689 2683 lim->virt_boundary_mask = ioc->page_size - 1; 2690 2684 return 0; 2691 2685 }

+454 -134

drivers/scsi/scsi_debug.c

··· 69 69 70 70 /* Additional Sense Code (ASC) */ 71 71 #define NO_ADDITIONAL_SENSE 0x0 72 + #define OVERLAP_ATOMIC_COMMAND_ASC 0x0 73 + #define OVERLAP_ATOMIC_COMMAND_ASCQ 0x23 72 74 #define LOGICAL_UNIT_NOT_READY 0x4 73 75 #define LOGICAL_UNIT_COMMUNICATION_FAILURE 0x8 74 76 #define UNRECOVERED_READ_ERR 0x11 ··· 105 103 #define READ_BOUNDARY_ASCQ 0x7 106 104 #define ATTEMPT_ACCESS_GAP 0x9 107 105 #define INSUFF_ZONE_ASCQ 0xe 106 + /* see drivers/scsi/sense_codes.h */ 108 107 109 108 /* Additional Sense Code Qualifier (ASCQ) */ 110 109 #define ACK_NAK_TO 0x3 ··· 155 152 #define DEF_VIRTUAL_GB 0 156 153 #define DEF_VPD_USE_HOSTNO 1 157 154 #define DEF_WRITESAME_LENGTH 0xFFFF 155 + #define DEF_ATOMIC_WR 0 156 + #define DEF_ATOMIC_WR_MAX_LENGTH 8192 157 + #define DEF_ATOMIC_WR_ALIGN 2 158 + #define DEF_ATOMIC_WR_GRAN 2 159 + #define DEF_ATOMIC_WR_MAX_LENGTH_BNDRY (DEF_ATOMIC_WR_MAX_LENGTH) 160 + #define DEF_ATOMIC_WR_MAX_BNDRY 128 158 161 #define DEF_STRICT 0 159 162 #define DEF_STATISTICS false 160 163 #define DEF_SUBMIT_QUEUES 1 ··· 383 374 384 375 /* There is an xarray of pointers to this struct's objects, one per host */ 385 376 struct sdeb_store_info { 386 - rwlock_t macc_lck; /* for atomic media access on this store */ 377 + rwlock_t macc_data_lck; /* for media data access on this store */ 378 + rwlock_t macc_meta_lck; /* for atomic media meta access on this store */ 379 + rwlock_t macc_sector_lck; /* per-sector media data access on this store */ 387 380 u8 *storep; /* user data storage (ram) */ 388 381 struct t10_pi_tuple *dif_storep; /* protection info */ 389 382 void *map_storep; /* provisioning map */ ··· 409 398 enum sdeb_defer_type defer_t; 410 399 }; 411 400 401 + struct sdebug_device_access_info { 402 + bool atomic_write; 403 + u64 lba; 404 + u32 num; 405 + struct scsi_cmnd *self; 406 + }; 407 + 412 408 struct sdebug_queued_cmd { 413 409 /* corresponding bit set in in_use_bm[] in owning struct sdebug_queue 414 410 * instance indicates this slot is in use. 415 411 */ 416 412 struct sdebug_defer sd_dp; 417 413 struct scsi_cmnd *scmd; 414 + struct sdebug_device_access_info *i; 418 415 }; 419 416 420 417 struct sdebug_scsi_cmd { ··· 482 463 SDEB_I_PRE_FETCH = 29, /* 10, 16 */ 483 464 SDEB_I_ZONE_OUT = 30, /* 0x94+SA; includes no data xfer */ 484 465 SDEB_I_ZONE_IN = 31, /* 0x95+SA; all have data-in */ 485 - SDEB_I_LAST_ELEM_P1 = 32, /* keep this last (previous + 1) */ 466 + SDEB_I_ATOMIC_WRITE_16 = 32, 467 + SDEB_I_LAST_ELEM_P1 = 33, /* keep this last (previous + 1) */ 486 468 }; 487 469 488 470 ··· 517 497 0, 0, 0, SDEB_I_VERIFY, 518 498 SDEB_I_PRE_FETCH, SDEB_I_SYNC_CACHE, 0, SDEB_I_WRITE_SAME, 519 499 SDEB_I_ZONE_OUT, SDEB_I_ZONE_IN, 0, 0, 520 - 0, 0, 0, 0, 0, 0, SDEB_I_SERV_ACT_IN_16, SDEB_I_SERV_ACT_OUT_16, 500 + 0, 0, 0, 0, 501 + SDEB_I_ATOMIC_WRITE_16, 0, SDEB_I_SERV_ACT_IN_16, SDEB_I_SERV_ACT_OUT_16, 521 502 /* 0xa0; 0xa0->0xbf: 12 byte cdbs */ 522 503 SDEB_I_REPORT_LUNS, SDEB_I_ATA_PT, 0, SDEB_I_MAINT_IN, 523 504 SDEB_I_MAINT_OUT, 0, 0, 0, ··· 568 547 static int resp_sync_cache(struct scsi_cmnd *, struct sdebug_dev_info *); 569 548 static int resp_pre_fetch(struct scsi_cmnd *, struct sdebug_dev_info *); 570 549 static int resp_report_zones(struct scsi_cmnd *, struct sdebug_dev_info *); 550 + static int resp_atomic_write(struct scsi_cmnd *, struct sdebug_dev_info *); 571 551 static int resp_open_zone(struct scsi_cmnd *, struct sdebug_dev_info *); 572 552 static int resp_close_zone(struct scsi_cmnd *, struct sdebug_dev_info *); 573 553 static int resp_finish_zone(struct scsi_cmnd *, struct sdebug_dev_info *); ··· 810 788 resp_report_zones, zone_in_iarr, /* ZONE_IN(16), REPORT ZONES) */ 811 789 {16, 0x0 /* SA */, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 812 790 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xbf, 0xc7} }, 791 + /* 31 */ 792 + {0, 0x0, 0x0, F_D_OUT | FF_MEDIA_IO, 793 + resp_atomic_write, NULL, /* ATOMIC WRITE 16 */ 794 + {16, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 795 + 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff} }, 813 796 /* sentinel */ 814 797 {0xff, 0, 0, 0, NULL, NULL, /* terminating element */ 815 798 {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0} }, ··· 862 835 static unsigned int sdebug_unmap_max_blocks = DEF_UNMAP_MAX_BLOCKS; 863 836 static unsigned int sdebug_unmap_max_desc = DEF_UNMAP_MAX_DESC; 864 837 static unsigned int sdebug_write_same_length = DEF_WRITESAME_LENGTH; 838 + static unsigned int sdebug_atomic_wr = DEF_ATOMIC_WR; 839 + static unsigned int sdebug_atomic_wr_max_length = DEF_ATOMIC_WR_MAX_LENGTH; 840 + static unsigned int sdebug_atomic_wr_align = DEF_ATOMIC_WR_ALIGN; 841 + static unsigned int sdebug_atomic_wr_gran = DEF_ATOMIC_WR_GRAN; 842 + static unsigned int sdebug_atomic_wr_max_length_bndry = 843 + DEF_ATOMIC_WR_MAX_LENGTH_BNDRY; 844 + static unsigned int sdebug_atomic_wr_max_bndry = DEF_ATOMIC_WR_MAX_BNDRY; 865 845 static int sdebug_uuid_ctl = DEF_UUID_CTL; 866 846 static bool sdebug_random = DEF_RANDOM; 867 847 static bool sdebug_per_host_store = DEF_PER_HOST_STORE; ··· 1224 1190 { 1225 1191 return 0 == sdebug_fake_rw && 1226 1192 (sdebug_lbpu || sdebug_lbpws || sdebug_lbpws10); 1193 + } 1194 + 1195 + static inline bool scsi_debug_atomic_write(void) 1196 + { 1197 + return sdebug_fake_rw == 0 && sdebug_atomic_wr; 1227 1198 } 1228 1199 1229 1200 static void *lba2fake_store(struct sdeb_store_info *sip, ··· 1857 1818 1858 1819 /* Maximum WRITE SAME Length */ 1859 1820 put_unaligned_be64(sdebug_write_same_length, &arr[32]); 1821 + 1822 + if (sdebug_atomic_wr) { 1823 + put_unaligned_be32(sdebug_atomic_wr_max_length, &arr[40]); 1824 + put_unaligned_be32(sdebug_atomic_wr_align, &arr[44]); 1825 + put_unaligned_be32(sdebug_atomic_wr_gran, &arr[48]); 1826 + put_unaligned_be32(sdebug_atomic_wr_max_length_bndry, &arr[52]); 1827 + put_unaligned_be32(sdebug_atomic_wr_max_bndry, &arr[56]); 1828 + } 1860 1829 1861 1830 return 0x3c; /* Mandatory page length for Logical Block Provisioning */ 1862 1831 } ··· 3428 3381 return xa_load(per_store_ap, devip->sdbg_host->si_idx); 3429 3382 } 3430 3383 3384 + static inline void 3385 + sdeb_read_lock(rwlock_t *lock) 3386 + { 3387 + if (sdebug_no_rwlock) 3388 + __acquire(lock); 3389 + else 3390 + read_lock(lock); 3391 + } 3392 + 3393 + static inline void 3394 + sdeb_read_unlock(rwlock_t *lock) 3395 + { 3396 + if (sdebug_no_rwlock) 3397 + __release(lock); 3398 + else 3399 + read_unlock(lock); 3400 + } 3401 + 3402 + static inline void 3403 + sdeb_write_lock(rwlock_t *lock) 3404 + { 3405 + if (sdebug_no_rwlock) 3406 + __acquire(lock); 3407 + else 3408 + write_lock(lock); 3409 + } 3410 + 3411 + static inline void 3412 + sdeb_write_unlock(rwlock_t *lock) 3413 + { 3414 + if (sdebug_no_rwlock) 3415 + __release(lock); 3416 + else 3417 + write_unlock(lock); 3418 + } 3419 + 3420 + static inline void 3421 + sdeb_data_read_lock(struct sdeb_store_info *sip) 3422 + { 3423 + BUG_ON(!sip); 3424 + 3425 + sdeb_read_lock(&sip->macc_data_lck); 3426 + } 3427 + 3428 + static inline void 3429 + sdeb_data_read_unlock(struct sdeb_store_info *sip) 3430 + { 3431 + BUG_ON(!sip); 3432 + 3433 + sdeb_read_unlock(&sip->macc_data_lck); 3434 + } 3435 + 3436 + static inline void 3437 + sdeb_data_write_lock(struct sdeb_store_info *sip) 3438 + { 3439 + BUG_ON(!sip); 3440 + 3441 + sdeb_write_lock(&sip->macc_data_lck); 3442 + } 3443 + 3444 + static inline void 3445 + sdeb_data_write_unlock(struct sdeb_store_info *sip) 3446 + { 3447 + BUG_ON(!sip); 3448 + 3449 + sdeb_write_unlock(&sip->macc_data_lck); 3450 + } 3451 + 3452 + static inline void 3453 + sdeb_data_sector_read_lock(struct sdeb_store_info *sip) 3454 + { 3455 + BUG_ON(!sip); 3456 + 3457 + sdeb_read_lock(&sip->macc_sector_lck); 3458 + } 3459 + 3460 + static inline void 3461 + sdeb_data_sector_read_unlock(struct sdeb_store_info *sip) 3462 + { 3463 + BUG_ON(!sip); 3464 + 3465 + sdeb_read_unlock(&sip->macc_sector_lck); 3466 + } 3467 + 3468 + static inline void 3469 + sdeb_data_sector_write_lock(struct sdeb_store_info *sip) 3470 + { 3471 + BUG_ON(!sip); 3472 + 3473 + sdeb_write_lock(&sip->macc_sector_lck); 3474 + } 3475 + 3476 + static inline void 3477 + sdeb_data_sector_write_unlock(struct sdeb_store_info *sip) 3478 + { 3479 + BUG_ON(!sip); 3480 + 3481 + sdeb_write_unlock(&sip->macc_sector_lck); 3482 + } 3483 + 3484 + /* 3485 + * Atomic locking: 3486 + * We simplify the atomic model to allow only 1x atomic write and many non- 3487 + * atomic reads or writes for all LBAs. 3488 + 3489 + * A RW lock has a similar bahaviour: 3490 + * Only 1x writer and many readers. 3491 + 3492 + * So use a RW lock for per-device read and write locking: 3493 + * An atomic access grabs the lock as a writer and non-atomic grabs the lock 3494 + * as a reader. 3495 + */ 3496 + 3497 + static inline void 3498 + sdeb_data_lock(struct sdeb_store_info *sip, bool atomic) 3499 + { 3500 + if (atomic) 3501 + sdeb_data_write_lock(sip); 3502 + else 3503 + sdeb_data_read_lock(sip); 3504 + } 3505 + 3506 + static inline void 3507 + sdeb_data_unlock(struct sdeb_store_info *sip, bool atomic) 3508 + { 3509 + if (atomic) 3510 + sdeb_data_write_unlock(sip); 3511 + else 3512 + sdeb_data_read_unlock(sip); 3513 + } 3514 + 3515 + /* Allow many reads but only 1x write per sector */ 3516 + static inline void 3517 + sdeb_data_sector_lock(struct sdeb_store_info *sip, bool do_write) 3518 + { 3519 + if (do_write) 3520 + sdeb_data_sector_write_lock(sip); 3521 + else 3522 + sdeb_data_sector_read_lock(sip); 3523 + } 3524 + 3525 + static inline void 3526 + sdeb_data_sector_unlock(struct sdeb_store_info *sip, bool do_write) 3527 + { 3528 + if (do_write) 3529 + sdeb_data_sector_write_unlock(sip); 3530 + else 3531 + sdeb_data_sector_read_unlock(sip); 3532 + } 3533 + 3534 + static inline void 3535 + sdeb_meta_read_lock(struct sdeb_store_info *sip) 3536 + { 3537 + if (sdebug_no_rwlock) { 3538 + if (sip) 3539 + __acquire(&sip->macc_meta_lck); 3540 + else 3541 + __acquire(&sdeb_fake_rw_lck); 3542 + } else { 3543 + if (sip) 3544 + read_lock(&sip->macc_meta_lck); 3545 + else 3546 + read_lock(&sdeb_fake_rw_lck); 3547 + } 3548 + } 3549 + 3550 + static inline void 3551 + sdeb_meta_read_unlock(struct sdeb_store_info *sip) 3552 + { 3553 + if (sdebug_no_rwlock) { 3554 + if (sip) 3555 + __release(&sip->macc_meta_lck); 3556 + else 3557 + __release(&sdeb_fake_rw_lck); 3558 + } else { 3559 + if (sip) 3560 + read_unlock(&sip->macc_meta_lck); 3561 + else 3562 + read_unlock(&sdeb_fake_rw_lck); 3563 + } 3564 + } 3565 + 3566 + static inline void 3567 + sdeb_meta_write_lock(struct sdeb_store_info *sip) 3568 + { 3569 + if (sdebug_no_rwlock) { 3570 + if (sip) 3571 + __acquire(&sip->macc_meta_lck); 3572 + else 3573 + __acquire(&sdeb_fake_rw_lck); 3574 + } else { 3575 + if (sip) 3576 + write_lock(&sip->macc_meta_lck); 3577 + else 3578 + write_lock(&sdeb_fake_rw_lck); 3579 + } 3580 + } 3581 + 3582 + static inline void 3583 + sdeb_meta_write_unlock(struct sdeb_store_info *sip) 3584 + { 3585 + if (sdebug_no_rwlock) { 3586 + if (sip) 3587 + __release(&sip->macc_meta_lck); 3588 + else 3589 + __release(&sdeb_fake_rw_lck); 3590 + } else { 3591 + if (sip) 3592 + write_unlock(&sip->macc_meta_lck); 3593 + else 3594 + write_unlock(&sdeb_fake_rw_lck); 3595 + } 3596 + } 3597 + 3431 3598 /* Returns number of bytes copied or -1 if error. */ 3432 3599 static int do_device_access(struct sdeb_store_info *sip, struct scsi_cmnd *scp, 3433 - u32 sg_skip, u64 lba, u32 num, bool do_write, 3434 - u8 group_number) 3600 + u32 sg_skip, u64 lba, u32 num, u8 group_number, 3601 + bool do_write, bool atomic) 3435 3602 { 3436 3603 int ret; 3437 - u64 block, rest = 0; 3604 + u64 block; 3438 3605 enum dma_data_direction dir; 3439 3606 struct scsi_data_buffer *sdb = &scp->sdb; 3440 3607 u8 *fsp; 3608 + int i; 3609 + 3610 + /* 3611 + * Even though reads are inherently atomic (in this driver), we expect 3612 + * the atomic flag only for writes. 3613 + */ 3614 + if (!do_write && atomic) 3615 + return -1; 3441 3616 3442 3617 if (do_write) { 3443 3618 dir = DMA_TO_DEVICE; ··· 3679 3410 fsp = sip->storep; 3680 3411 3681 3412 block = do_div(lba, sdebug_store_sectors); 3682 - if (block + num > sdebug_store_sectors) 3683 - rest = block + num - sdebug_store_sectors; 3684 3413 3685 - ret = sg_copy_buffer(sdb->table.sgl, sdb->table.nents, 3414 + /* Only allow 1x atomic write or multiple non-atomic writes at any given time */ 3415 + sdeb_data_lock(sip, atomic); 3416 + for (i = 0; i < num; i++) { 3417 + /* We shouldn't need to lock for atomic writes, but do it anyway */ 3418 + sdeb_data_sector_lock(sip, do_write); 3419 + ret = sg_copy_buffer(sdb->table.sgl, sdb->table.nents, 3686 3420 fsp + (block * sdebug_sector_size), 3687 - (num - rest) * sdebug_sector_size, sg_skip, do_write); 3688 - if (ret != (num - rest) * sdebug_sector_size) 3689 - return ret; 3690 - 3691 - if (rest) { 3692 - ret += sg_copy_buffer(sdb->table.sgl, sdb->table.nents, 3693 - fsp, rest * sdebug_sector_size, 3694 - sg_skip + ((num - rest) * sdebug_sector_size), 3695 - do_write); 3421 + sdebug_sector_size, sg_skip, do_write); 3422 + sdeb_data_sector_unlock(sip, do_write); 3423 + if (ret != sdebug_sector_size) { 3424 + ret += (i * sdebug_sector_size); 3425 + break; 3426 + } 3427 + sg_skip += sdebug_sector_size; 3428 + if (++block >= sdebug_store_sectors) 3429 + block = 0; 3696 3430 } 3431 + ret = num * sdebug_sector_size; 3432 + sdeb_data_unlock(sip, atomic); 3697 3433 3698 3434 return ret; 3699 3435 } ··· 3874 3600 return ret; 3875 3601 } 3876 3602 3877 - static inline void 3878 - sdeb_read_lock(struct sdeb_store_info *sip) 3879 - { 3880 - if (sdebug_no_rwlock) { 3881 - if (sip) 3882 - __acquire(&sip->macc_lck); 3883 - else 3884 - __acquire(&sdeb_fake_rw_lck); 3885 - } else { 3886 - if (sip) 3887 - read_lock(&sip->macc_lck); 3888 - else 3889 - read_lock(&sdeb_fake_rw_lck); 3890 - } 3891 - } 3892 - 3893 - static inline void 3894 - sdeb_read_unlock(struct sdeb_store_info *sip) 3895 - { 3896 - if (sdebug_no_rwlock) { 3897 - if (sip) 3898 - __release(&sip->macc_lck); 3899 - else 3900 - __release(&sdeb_fake_rw_lck); 3901 - } else { 3902 - if (sip) 3903 - read_unlock(&sip->macc_lck); 3904 - else 3905 - read_unlock(&sdeb_fake_rw_lck); 3906 - } 3907 - } 3908 - 3909 - static inline void 3910 - sdeb_write_lock(struct sdeb_store_info *sip) 3911 - { 3912 - if (sdebug_no_rwlock) { 3913 - if (sip) 3914 - __acquire(&sip->macc_lck); 3915 - else 3916 - __acquire(&sdeb_fake_rw_lck); 3917 - } else { 3918 - if (sip) 3919 - write_lock(&sip->macc_lck); 3920 - else 3921 - write_lock(&sdeb_fake_rw_lck); 3922 - } 3923 - } 3924 - 3925 - static inline void 3926 - sdeb_write_unlock(struct sdeb_store_info *sip) 3927 - { 3928 - if (sdebug_no_rwlock) { 3929 - if (sip) 3930 - __release(&sip->macc_lck); 3931 - else 3932 - __release(&sdeb_fake_rw_lck); 3933 - } else { 3934 - if (sip) 3935 - write_unlock(&sip->macc_lck); 3936 - else 3937 - write_unlock(&sdeb_fake_rw_lck); 3938 - } 3939 - } 3940 - 3941 3603 static int resp_read_dt0(struct scsi_cmnd *scp, struct sdebug_dev_info *devip) 3942 3604 { 3943 3605 bool check_prot; ··· 3883 3673 u64 lba; 3884 3674 struct sdeb_store_info *sip = devip2sip(devip, true); 3885 3675 u8 *cmd = scp->cmnd; 3676 + bool meta_data_locked = false; 3886 3677 3887 3678 switch (cmd[0]) { 3888 3679 case READ_16: ··· 3942 3731 atomic_set(&sdeb_inject_pending, 0); 3943 3732 } 3944 3733 3734 + /* 3735 + * When checking device access params, for reads we only check data 3736 + * versus what is set at init time, so no need to lock. 3737 + */ 3945 3738 ret = check_device_access_params(scp, lba, num, false); 3946 3739 if (ret) 3947 3740 return ret; ··· 3965 3750 return check_condition_result; 3966 3751 } 3967 3752 3968 - sdeb_read_lock(sip); 3753 + if (sdebug_dev_is_zoned(devip) || 3754 + (sdebug_dix && scsi_prot_sg_count(scp))) { 3755 + sdeb_meta_read_lock(sip); 3756 + meta_data_locked = true; 3757 + } 3969 3758 3970 3759 /* DIX + T10 DIF */ 3971 3760 if (unlikely(sdebug_dix && scsi_prot_sg_count(scp))) { 3972 3761 switch (prot_verify_read(scp, lba, num, ei_lba)) { 3973 3762 case 1: /* Guard tag error */ 3974 3763 if (cmd[1] >> 5 != 3) { /* RDPROTECT != 3 */ 3975 - sdeb_read_unlock(sip); 3764 + sdeb_meta_read_unlock(sip); 3976 3765 mk_sense_buffer(scp, ABORTED_COMMAND, 0x10, 1); 3977 3766 return check_condition_result; 3978 3767 } else if (scp->prot_flags & SCSI_PROT_GUARD_CHECK) { 3979 - sdeb_read_unlock(sip); 3768 + sdeb_meta_read_unlock(sip); 3980 3769 mk_sense_buffer(scp, ILLEGAL_REQUEST, 0x10, 1); 3981 3770 return illegal_condition_result; 3982 3771 } 3983 3772 break; 3984 3773 case 3: /* Reference tag error */ 3985 3774 if (cmd[1] >> 5 != 3) { /* RDPROTECT != 3 */ 3986 - sdeb_read_unlock(sip); 3775 + sdeb_meta_read_unlock(sip); 3987 3776 mk_sense_buffer(scp, ABORTED_COMMAND, 0x10, 3); 3988 3777 return check_condition_result; 3989 3778 } else if (scp->prot_flags & SCSI_PROT_REF_CHECK) { 3990 - sdeb_read_unlock(sip); 3779 + sdeb_meta_read_unlock(sip); 3991 3780 mk_sense_buffer(scp, ILLEGAL_REQUEST, 0x10, 3); 3992 3781 return illegal_condition_result; 3993 3782 } ··· 3999 3780 } 4000 3781 } 4001 3782 4002 - ret = do_device_access(sip, scp, 0, lba, num, false, 0); 4003 - sdeb_read_unlock(sip); 3783 + ret = do_device_access(sip, scp, 0, lba, num, 0, false, false); 3784 + if (meta_data_locked) 3785 + sdeb_meta_read_unlock(sip); 4004 3786 if (unlikely(ret == -1)) 4005 3787 return DID_ERROR << 16; 4006 3788 ··· 4191 3971 u64 lba; 4192 3972 struct sdeb_store_info *sip = devip2sip(devip, true); 4193 3973 u8 *cmd = scp->cmnd; 3974 + bool meta_data_locked = false; 4194 3975 4195 3976 switch (cmd[0]) { 4196 3977 case WRITE_16: ··· 4250 4029 "to DIF device\n"); 4251 4030 } 4252 4031 4253 - sdeb_write_lock(sip); 4032 + if (sdebug_dev_is_zoned(devip) || 4033 + (sdebug_dix && scsi_prot_sg_count(scp)) || 4034 + scsi_debug_lbp()) { 4035 + sdeb_meta_write_lock(sip); 4036 + meta_data_locked = true; 4037 + } 4038 + 4254 4039 ret = check_device_access_params(scp, lba, num, true); 4255 4040 if (ret) { 4256 - sdeb_write_unlock(sip); 4041 + if (meta_data_locked) 4042 + sdeb_meta_write_unlock(sip); 4257 4043 return ret; 4258 4044 } 4259 4045 ··· 4269 4041 switch (prot_verify_write(scp, lba, num, ei_lba)) { 4270 4042 case 1: /* Guard tag error */ 4271 4043 if (scp->prot_flags & SCSI_PROT_GUARD_CHECK) { 4272 - sdeb_write_unlock(sip); 4044 + sdeb_meta_write_unlock(sip); 4273 4045 mk_sense_buffer(scp, ILLEGAL_REQUEST, 0x10, 1); 4274 4046 return illegal_condition_result; 4275 4047 } else if (scp->cmnd[1] >> 5 != 3) { /* WRPROTECT != 3 */ 4276 - sdeb_write_unlock(sip); 4048 + sdeb_meta_write_unlock(sip); 4277 4049 mk_sense_buffer(scp, ABORTED_COMMAND, 0x10, 1); 4278 4050 return check_condition_result; 4279 4051 } 4280 4052 break; 4281 4053 case 3: /* Reference tag error */ 4282 4054 if (scp->prot_flags & SCSI_PROT_REF_CHECK) { 4283 - sdeb_write_unlock(sip); 4055 + sdeb_meta_write_unlock(sip); 4284 4056 mk_sense_buffer(scp, ILLEGAL_REQUEST, 0x10, 3); 4285 4057 return illegal_condition_result; 4286 4058 } else if (scp->cmnd[1] >> 5 != 3) { /* WRPROTECT != 3 */ 4287 - sdeb_write_unlock(sip); 4059 + sdeb_meta_write_unlock(sip); 4288 4060 mk_sense_buffer(scp, ABORTED_COMMAND, 0x10, 3); 4289 4061 return check_condition_result; 4290 4062 } ··· 4292 4064 } 4293 4065 } 4294 4066 4295 - ret = do_device_access(sip, scp, 0, lba, num, true, group); 4067 + ret = do_device_access(sip, scp, 0, lba, num, group, true, false); 4296 4068 if (unlikely(scsi_debug_lbp())) 4297 4069 map_region(sip, lba, num); 4070 + 4298 4071 /* If ZBC zone then bump its write pointer */ 4299 4072 if (sdebug_dev_is_zoned(devip)) 4300 4073 zbc_inc_wp(devip, lba, num); 4301 - sdeb_write_unlock(sip); 4074 + if (meta_data_locked) 4075 + sdeb_meta_write_unlock(sip); 4076 + 4302 4077 if (unlikely(-1 == ret)) 4303 4078 return DID_ERROR << 16; 4304 4079 else if (unlikely(sdebug_verbose && ··· 4411 4180 goto err_out; 4412 4181 } 4413 4182 4414 - sdeb_write_lock(sip); 4183 + /* Just keep it simple and always lock for now */ 4184 + sdeb_meta_write_lock(sip); 4415 4185 sg_off = lbdof_blen; 4416 4186 /* Spec says Buffer xfer Length field in number of LBs in dout */ 4417 4187 cum_lb = 0; ··· 4455 4223 } 4456 4224 } 4457 4225 4458 - ret = do_device_access(sip, scp, sg_off, lba, num, true, group); 4226 + /* 4227 + * Write ranges atomically to keep as close to pre-atomic 4228 + * writes behaviour as possible. 4229 + */ 4230 + ret = do_device_access(sip, scp, sg_off, lba, num, group, true, true); 4459 4231 /* If ZBC zone then bump its write pointer */ 4460 4232 if (sdebug_dev_is_zoned(devip)) 4461 4233 zbc_inc_wp(devip, lba, num); ··· 4498 4262 } 4499 4263 ret = 0; 4500 4264 err_out_unlock: 4501 - sdeb_write_unlock(sip); 4265 + sdeb_meta_write_unlock(sip); 4502 4266 err_out: 4503 4267 kfree(lrdp); 4504 4268 return ret; ··· 4517 4281 scp->device->hostdata, true); 4518 4282 u8 *fs1p; 4519 4283 u8 *fsp; 4284 + bool meta_data_locked = false; 4520 4285 4521 - sdeb_write_lock(sip); 4286 + if (sdebug_dev_is_zoned(devip) || scsi_debug_lbp()) { 4287 + sdeb_meta_write_lock(sip); 4288 + meta_data_locked = true; 4289 + } 4522 4290 4523 4291 ret = check_device_access_params(scp, lba, num, true); 4524 - if (ret) { 4525 - sdeb_write_unlock(sip); 4526 - return ret; 4527 - } 4292 + if (ret) 4293 + goto out; 4528 4294 4529 4295 if (unmap && scsi_debug_lbp()) { 4530 4296 unmap_region(sip, lba, num); ··· 4537 4299 /* if ndob then zero 1 logical block, else fetch 1 logical block */ 4538 4300 fsp = sip->storep; 4539 4301 fs1p = fsp + (block * lb_size); 4302 + sdeb_data_write_lock(sip); 4540 4303 if (ndob) { 4541 4304 memset(fs1p, 0, lb_size); 4542 4305 ret = 0; ··· 4545 4306 ret = fetch_to_dev_buffer(scp, fs1p, lb_size); 4546 4307 4547 4308 if (-1 == ret) { 4548 - sdeb_write_unlock(sip); 4549 - return DID_ERROR << 16; 4309 + ret = DID_ERROR << 16; 4310 + goto out; 4550 4311 } else if (sdebug_verbose && !ndob && (ret < lb_size)) 4551 4312 sdev_printk(KERN_INFO, scp->device, 4552 4313 "%s: %s: lb size=%u, IO sent=%d bytes\n", ··· 4563 4324 /* If ZBC zone then bump its write pointer */ 4564 4325 if (sdebug_dev_is_zoned(devip)) 4565 4326 zbc_inc_wp(devip, lba, num); 4327 + sdeb_data_write_unlock(sip); 4328 + ret = 0; 4566 4329 out: 4567 - sdeb_write_unlock(sip); 4568 - 4569 - return 0; 4330 + if (meta_data_locked) 4331 + sdeb_meta_write_unlock(sip); 4332 + return ret; 4570 4333 } 4571 4334 4572 4335 static int resp_write_same_10(struct scsi_cmnd *scp, ··· 4711 4470 return check_condition_result; 4712 4471 } 4713 4472 4714 - sdeb_write_lock(sip); 4715 - 4716 4473 ret = do_dout_fetch(scp, dnum, arr); 4717 4474 if (ret == -1) { 4718 4475 retval = DID_ERROR << 16; 4719 - goto cleanup; 4476 + goto cleanup_free; 4720 4477 } else if (sdebug_verbose && (ret < (dnum * lb_size))) 4721 4478 sdev_printk(KERN_INFO, scp->device, "%s: compare_write: cdb " 4722 4479 "indicated=%u, IO sent=%d bytes\n", my_name, 4723 4480 dnum * lb_size, ret); 4481 + 4482 + sdeb_data_write_lock(sip); 4483 + sdeb_meta_write_lock(sip); 4724 4484 if (!comp_write_worker(sip, lba, num, arr, false)) { 4725 4485 mk_sense_buffer(scp, MISCOMPARE, MISCOMPARE_VERIFY_ASC, 0); 4726 4486 retval = check_condition_result; 4727 - goto cleanup; 4487 + goto cleanup_unlock; 4728 4488 } 4489 + 4490 + /* Cover sip->map_storep (which map_region()) sets with data lock */ 4729 4491 if (scsi_debug_lbp()) 4730 4492 map_region(sip, lba, num); 4731 - cleanup: 4732 - sdeb_write_unlock(sip); 4493 + cleanup_unlock: 4494 + sdeb_meta_write_unlock(sip); 4495 + sdeb_data_write_unlock(sip); 4496 + cleanup_free: 4733 4497 kfree(arr); 4734 4498 return retval; 4735 4499 } ··· 4778 4532 4779 4533 desc = (void *)&buf[8]; 4780 4534 4781 - sdeb_write_lock(sip); 4535 + sdeb_meta_write_lock(sip); 4782 4536 4783 4537 for (i = 0 ; i < descriptors ; i++) { 4784 4538 unsigned long long lba = get_unaligned_be64(&desc[i].lba); ··· 4794 4548 ret = 0; 4795 4549 4796 4550 out: 4797 - sdeb_write_unlock(sip); 4551 + sdeb_meta_write_unlock(sip); 4798 4552 kfree(buf); 4799 4553 4800 4554 return ret; ··· 4952 4706 rest = block + nblks - sdebug_store_sectors; 4953 4707 4954 4708 /* Try to bring the PRE-FETCH range into CPU's cache */ 4955 - sdeb_read_lock(sip); 4709 + sdeb_data_read_lock(sip); 4956 4710 prefetch_range(fsp + (sdebug_sector_size * block), 4957 4711 (nblks - rest) * sdebug_sector_size); 4958 4712 if (rest) 4959 4713 prefetch_range(fsp, rest * sdebug_sector_size); 4960 - sdeb_read_unlock(sip); 4714 + 4715 + sdeb_data_read_unlock(sip); 4961 4716 fini: 4962 4717 if (cmd[1] & 0x2) 4963 4718 res = SDEG_RES_IMMED_MASK; ··· 5117 4870 return check_condition_result; 5118 4871 } 5119 4872 /* Not changing store, so only need read access */ 5120 - sdeb_read_lock(sip); 4873 + sdeb_data_read_lock(sip); 5121 4874 5122 4875 ret = do_dout_fetch(scp, a_num, arr); 5123 4876 if (ret == -1) { ··· 5139 4892 goto cleanup; 5140 4893 } 5141 4894 cleanup: 5142 - sdeb_read_unlock(sip); 4895 + sdeb_data_read_unlock(sip); 5143 4896 kfree(arr); 5144 4897 return ret; 5145 4898 } ··· 5185 4938 return check_condition_result; 5186 4939 } 5187 4940 5188 - sdeb_read_lock(sip); 4941 + sdeb_meta_read_lock(sip); 5189 4942 5190 4943 desc = arr + 64; 5191 4944 for (lba = zs_lba; lba < sdebug_capacity; ··· 5283 5036 ret = fill_from_dev_buffer(scp, arr, min_t(u32, alloc_len, rep_len)); 5284 5037 5285 5038 fini: 5286 - sdeb_read_unlock(sip); 5039 + sdeb_meta_read_unlock(sip); 5287 5040 kfree(arr); 5288 5041 return ret; 5042 + } 5043 + 5044 + static int resp_atomic_write(struct scsi_cmnd *scp, 5045 + struct sdebug_dev_info *devip) 5046 + { 5047 + struct sdeb_store_info *sip; 5048 + u8 *cmd = scp->cmnd; 5049 + u16 boundary, len; 5050 + u64 lba, lba_tmp; 5051 + int ret; 5052 + 5053 + if (!scsi_debug_atomic_write()) { 5054 + mk_sense_invalid_opcode(scp); 5055 + return check_condition_result; 5056 + } 5057 + 5058 + sip = devip2sip(devip, true); 5059 + 5060 + lba = get_unaligned_be64(cmd + 2); 5061 + boundary = get_unaligned_be16(cmd + 10); 5062 + len = get_unaligned_be16(cmd + 12); 5063 + 5064 + lba_tmp = lba; 5065 + if (sdebug_atomic_wr_align && 5066 + do_div(lba_tmp, sdebug_atomic_wr_align)) { 5067 + /* Does not meet alignment requirement */ 5068 + mk_sense_buffer(scp, ILLEGAL_REQUEST, INVALID_FIELD_IN_CDB, 0); 5069 + return check_condition_result; 5070 + } 5071 + 5072 + if (sdebug_atomic_wr_gran && len % sdebug_atomic_wr_gran) { 5073 + /* Does not meet alignment requirement */ 5074 + mk_sense_buffer(scp, ILLEGAL_REQUEST, INVALID_FIELD_IN_CDB, 0); 5075 + return check_condition_result; 5076 + } 5077 + 5078 + if (boundary > 0) { 5079 + if (boundary > sdebug_atomic_wr_max_bndry) { 5080 + mk_sense_invalid_fld(scp, SDEB_IN_CDB, 12, -1); 5081 + return check_condition_result; 5082 + } 5083 + 5084 + if (len > sdebug_atomic_wr_max_length_bndry) { 5085 + mk_sense_invalid_fld(scp, SDEB_IN_CDB, 12, -1); 5086 + return check_condition_result; 5087 + } 5088 + } else { 5089 + if (len > sdebug_atomic_wr_max_length) { 5090 + mk_sense_invalid_fld(scp, SDEB_IN_CDB, 12, -1); 5091 + return check_condition_result; 5092 + } 5093 + } 5094 + 5095 + ret = do_device_access(sip, scp, 0, lba, len, 0, true, true); 5096 + if (unlikely(ret == -1)) 5097 + return DID_ERROR << 16; 5098 + if (unlikely(ret != len * sdebug_sector_size)) 5099 + return DID_ERROR << 16; 5100 + return 0; 5289 5101 } 5290 5102 5291 5103 /* Logic transplanted from tcmu-runner, file_zbc.c */ ··· 5373 5067 mk_sense_invalid_opcode(scp); 5374 5068 return check_condition_result; 5375 5069 } 5376 - 5377 - sdeb_write_lock(sip); 5070 + sdeb_meta_write_lock(sip); 5378 5071 5379 5072 if (all) { 5380 5073 /* Check if all closed zones can be open */ ··· 5422 5117 5423 5118 zbc_open_zone(devip, zsp, true); 5424 5119 fini: 5425 - sdeb_write_unlock(sip); 5120 + sdeb_meta_write_unlock(sip); 5426 5121 return res; 5427 5122 } 5428 5123 ··· 5449 5144 return check_condition_result; 5450 5145 } 5451 5146 5452 - sdeb_write_lock(sip); 5147 + sdeb_meta_write_lock(sip); 5453 5148 5454 5149 if (all) { 5455 5150 zbc_close_all(devip); ··· 5478 5173 5479 5174 zbc_close_zone(devip, zsp); 5480 5175 fini: 5481 - sdeb_write_unlock(sip); 5176 + sdeb_meta_write_unlock(sip); 5482 5177 return res; 5483 5178 } 5484 5179 ··· 5521 5216 return check_condition_result; 5522 5217 } 5523 5218 5524 - sdeb_write_lock(sip); 5219 + sdeb_meta_write_lock(sip); 5525 5220 5526 5221 if (all) { 5527 5222 zbc_finish_all(devip); ··· 5550 5245 5551 5246 zbc_finish_zone(devip, zsp, true); 5552 5247 fini: 5553 - sdeb_write_unlock(sip); 5248 + sdeb_meta_write_unlock(sip); 5554 5249 return res; 5555 5250 } 5556 5251 ··· 5601 5296 return check_condition_result; 5602 5297 } 5603 5298 5604 - sdeb_write_lock(sip); 5299 + sdeb_meta_write_lock(sip); 5605 5300 5606 5301 if (all) { 5607 5302 zbc_rwp_all(devip); ··· 5629 5324 5630 5325 zbc_rwp_zone(devip, zsp); 5631 5326 fini: 5632 - sdeb_write_unlock(sip); 5327 + sdeb_meta_write_unlock(sip); 5633 5328 return res; 5634 5329 } 5635 5330 ··· 6593 6288 module_param_named(lbpu, sdebug_lbpu, int, S_IRUGO); 6594 6289 module_param_named(lbpws, sdebug_lbpws, int, S_IRUGO); 6595 6290 module_param_named(lbpws10, sdebug_lbpws10, int, S_IRUGO); 6291 + module_param_named(atomic_wr, sdebug_atomic_wr, int, S_IRUGO); 6596 6292 module_param_named(lowest_aligned, sdebug_lowest_aligned, int, S_IRUGO); 6597 6293 module_param_named(lun_format, sdebug_lun_am_i, int, S_IRUGO | S_IWUSR); 6598 6294 module_param_named(max_luns, sdebug_max_luns, int, S_IRUGO | S_IWUSR); ··· 6628 6322 module_param_named(unmap_granularity, sdebug_unmap_granularity, int, S_IRUGO); 6629 6323 module_param_named(unmap_max_blocks, sdebug_unmap_max_blocks, int, S_IRUGO); 6630 6324 module_param_named(unmap_max_desc, sdebug_unmap_max_desc, int, S_IRUGO); 6325 + module_param_named(atomic_wr_max_length, sdebug_atomic_wr_max_length, int, S_IRUGO); 6326 + module_param_named(atomic_wr_align, sdebug_atomic_wr_align, int, S_IRUGO); 6327 + module_param_named(atomic_wr_gran, sdebug_atomic_wr_gran, int, S_IRUGO); 6328 + module_param_named(atomic_wr_max_length_bndry, sdebug_atomic_wr_max_length_bndry, int, S_IRUGO); 6329 + module_param_named(atomic_wr_max_bndry, sdebug_atomic_wr_max_bndry, int, S_IRUGO); 6631 6330 module_param_named(uuid_ctl, sdebug_uuid_ctl, int, S_IRUGO); 6632 6331 module_param_named(virtual_gb, sdebug_virtual_gb, int, S_IRUGO | S_IWUSR); 6633 6332 module_param_named(vpd_use_hostno, sdebug_vpd_use_hostno, int, ··· 6676 6365 MODULE_PARM_DESC(lbpu, "enable LBP, support UNMAP command (def=0)"); 6677 6366 MODULE_PARM_DESC(lbpws, "enable LBP, support WRITE SAME(16) with UNMAP bit (def=0)"); 6678 6367 MODULE_PARM_DESC(lbpws10, "enable LBP, support WRITE SAME(10) with UNMAP bit (def=0)"); 6368 + MODULE_PARM_DESC(atomic_write, "enable ATOMIC WRITE support, support WRITE ATOMIC(16) (def=0)"); 6679 6369 MODULE_PARM_DESC(lowest_aligned, "lowest aligned lba (def=0)"); 6680 6370 MODULE_PARM_DESC(lun_format, "LUN format: 0->peripheral (def); 1 --> flat address method"); 6681 6371 MODULE_PARM_DESC(max_luns, "number of LUNs per target to simulate(def=1)"); ··· 6708 6396 MODULE_PARM_DESC(unmap_granularity, "thin provisioning granularity in blocks (def=1)"); 6709 6397 MODULE_PARM_DESC(unmap_max_blocks, "max # of blocks can be unmapped in one cmd (def=0xffffffff)"); 6710 6398 MODULE_PARM_DESC(unmap_max_desc, "max # of ranges that can be unmapped in one cmd (def=256)"); 6399 + MODULE_PARM_DESC(atomic_wr_max_length, "max # of blocks can be atomically written in one cmd (def=8192)"); 6400 + MODULE_PARM_DESC(atomic_wr_align, "minimum alignment of atomic write in blocks (def=2)"); 6401 + MODULE_PARM_DESC(atomic_wr_gran, "minimum granularity of atomic write in blocks (def=2)"); 6402 + MODULE_PARM_DESC(atomic_wr_max_length_bndry, "max # of blocks can be atomically written in one cmd with boundary set (def=8192)"); 6403 + MODULE_PARM_DESC(atomic_wr_max_bndry, "max # boundaries per atomic write (def=128)"); 6711 6404 MODULE_PARM_DESC(uuid_ctl, 6712 6405 "1->use uuid for lu name, 0->don't, 2->all use same (def=0)"); 6713 6406 MODULE_PARM_DESC(virtual_gb, "virtual gigabyte (GiB) size (def=0 -> use dev_size_mb)"); ··· 7884 7567 return -EINVAL; 7885 7568 } 7886 7569 } 7570 + 7887 7571 xa_init_flags(per_store_ap, XA_FLAGS_ALLOC | XA_FLAGS_LOCK_IRQ); 7888 7572 if (want_store) { 7889 7573 idx = sdebug_add_store(); ··· 8092 7774 map_region(sip, 0, 2); 8093 7775 } 8094 7776 8095 - rwlock_init(&sip->macc_lck); 7777 + rwlock_init(&sip->macc_data_lck); 7778 + rwlock_init(&sip->macc_meta_lck); 7779 + rwlock_init(&sip->macc_sector_lck); 8096 7780 return (int)n_idx; 8097 7781 err: 8098 7782 sdebug_erase_store((int)n_idx, sip);

+4 -5

drivers/scsi/scsi_lib.c

··· 631 631 if (blk_update_request(req, error, bytes)) 632 632 return true; 633 633 634 - // XXX: 635 - if (blk_queue_add_random(q)) 634 + if (q->limits.features & BLK_FEAT_ADD_RANDOM) 636 635 add_disk_randomness(req->q->disk); 637 636 638 637 WARN_ON_ONCE(!blk_rq_is_passthrough(req) && ··· 1139 1140 */ 1140 1141 count = __blk_rq_map_sg(rq->q, rq, cmd->sdb.table.sgl, &last_sg); 1141 1142 1142 - if (blk_rq_bytes(rq) & rq->q->dma_pad_mask) { 1143 + if (blk_rq_bytes(rq) & rq->q->limits.dma_pad_mask) { 1143 1144 unsigned int pad_len = 1144 - (rq->q->dma_pad_mask & ~blk_rq_bytes(rq)) + 1; 1145 + (rq->q->limits.dma_pad_mask & ~blk_rq_bytes(rq)) + 1; 1145 1146 1146 1147 last_sg->length += pad_len; 1147 1148 cmd->extra_len += pad_len; ··· 1986 1987 shost->dma_alignment, dma_get_cache_alignment() - 1); 1987 1988 1988 1989 if (shost->no_highmem) 1989 - lim->bounce = BLK_BOUNCE_HIGH; 1990 + lim->features |= BLK_FEAT_BOUNCE_HIGH; 1990 1991 1991 1992 dma_set_seg_boundary(dev, shost->dma_boundary); 1992 1993 dma_set_max_seg_size(dev, shost->max_segment_size);

+22

drivers/scsi/scsi_trace.c

··· 326 326 } 327 327 328 328 static const char * 329 + scsi_trace_atomic_write16_out(struct trace_seq *p, unsigned char *cdb, int len) 330 + { 331 + const char *ret = trace_seq_buffer_ptr(p); 332 + unsigned int boundary_size; 333 + unsigned int nr_blocks; 334 + sector_t lba; 335 + 336 + lba = get_unaligned_be64(&cdb[2]); 337 + boundary_size = get_unaligned_be16(&cdb[10]); 338 + nr_blocks = get_unaligned_be16(&cdb[12]); 339 + 340 + trace_seq_printf(p, "lba=%llu txlen=%u boundary_size=%u", 341 + lba, nr_blocks, boundary_size); 342 + 343 + trace_seq_putc(p, 0); 344 + 345 + return ret; 346 + } 347 + 348 + static const char * 329 349 scsi_trace_varlen(struct trace_seq *p, unsigned char *cdb, int len) 330 350 { 331 351 switch (SERVICE_ACTION32(cdb)) { ··· 405 385 return scsi_trace_zbc_in(p, cdb, len); 406 386 case ZBC_OUT: 407 387 return scsi_trace_zbc_out(p, cdb, len); 388 + case WRITE_ATOMIC_16: 389 + return scsi_trace_atomic_write16_out(p, cdb, len); 408 390 default: 409 391 return scsi_trace_misc(p, cdb, len); 410 392 }

+227 -142

drivers/scsi/sd.c

··· 102 102 103 103 #define SD_MINORS 16 104 104 105 - static void sd_config_discard(struct scsi_disk *, unsigned int); 106 - static void sd_config_write_same(struct scsi_disk *); 105 + static void sd_config_discard(struct scsi_disk *sdkp, struct queue_limits *lim, 106 + unsigned int mode); 107 + static void sd_config_write_same(struct scsi_disk *sdkp, 108 + struct queue_limits *lim); 107 109 static int sd_revalidate_disk(struct gendisk *); 108 110 static void sd_unlock_native_capacity(struct gendisk *disk); 109 111 static void sd_shutdown(struct device *); 110 - static void sd_read_capacity(struct scsi_disk *sdkp, unsigned char *buffer); 111 112 static void scsi_disk_release(struct device *cdev); 112 113 113 114 static DEFINE_IDA(sd_index_ida); ··· 121 120 "write back, no read (daft)" 122 121 }; 123 122 124 - static void sd_set_flush_flag(struct scsi_disk *sdkp) 123 + static void sd_set_flush_flag(struct scsi_disk *sdkp, 124 + struct queue_limits *lim) 125 125 { 126 - bool wc = false, fua = false; 127 - 128 126 if (sdkp->WCE) { 129 - wc = true; 127 + lim->features |= BLK_FEAT_WRITE_CACHE; 130 128 if (sdkp->DPOFUA) 131 - fua = true; 129 + lim->features |= BLK_FEAT_FUA; 130 + else 131 + lim->features &= ~BLK_FEAT_FUA; 132 + } else { 133 + lim->features &= ~(BLK_FEAT_WRITE_CACHE | BLK_FEAT_FUA); 132 134 } 133 - 134 - blk_queue_write_cache(sdkp->disk->queue, wc, fua); 135 135 } 136 136 137 137 static ssize_t ··· 170 168 wce = (ct & 0x02) && !sdkp->write_prot ? 1 : 0; 171 169 172 170 if (sdkp->cache_override) { 171 + struct queue_limits lim; 172 + 173 173 sdkp->WCE = wce; 174 174 sdkp->RCD = rcd; 175 - sd_set_flush_flag(sdkp); 175 + 176 + lim = queue_limits_start_update(sdkp->disk->queue); 177 + sd_set_flush_flag(sdkp, &lim); 178 + blk_mq_freeze_queue(sdkp->disk->queue); 179 + ret = queue_limits_commit_update(sdkp->disk->queue, &lim); 180 + blk_mq_unfreeze_queue(sdkp->disk->queue); 181 + if (ret) 182 + return ret; 176 183 return count; 177 184 } 178 185 ··· 468 457 { 469 458 struct scsi_disk *sdkp = to_scsi_disk(dev); 470 459 struct scsi_device *sdp = sdkp->device; 471 - int mode; 460 + struct queue_limits lim; 461 + int mode, err; 472 462 473 463 if (!capable(CAP_SYS_ADMIN)) 474 464 return -EACCES; 475 - 476 - if (sd_is_zoned(sdkp)) { 477 - sd_config_discard(sdkp, SD_LBP_DISABLE); 478 - return count; 479 - } 480 465 481 466 if (sdp->type != TYPE_DISK) 482 467 return -EINVAL; ··· 481 474 if (mode < 0) 482 475 return -EINVAL; 483 476 484 - sd_config_discard(sdkp, mode); 485 - 477 + lim = queue_limits_start_update(sdkp->disk->queue); 478 + sd_config_discard(sdkp, &lim, mode); 479 + blk_mq_freeze_queue(sdkp->disk->queue); 480 + err = queue_limits_commit_update(sdkp->disk->queue, &lim); 481 + blk_mq_unfreeze_queue(sdkp->disk->queue); 482 + if (err) 483 + return err; 486 484 return count; 487 485 } 488 486 static DEVICE_ATTR_RW(provisioning_mode); ··· 570 558 { 571 559 struct scsi_disk *sdkp = to_scsi_disk(dev); 572 560 struct scsi_device *sdp = sdkp->device; 561 + struct queue_limits lim; 573 562 unsigned long max; 574 563 int err; 575 564 ··· 592 579 sdkp->max_ws_blocks = max; 593 580 } 594 581 595 - sd_config_write_same(sdkp); 596 - 582 + lim = queue_limits_start_update(sdkp->disk->queue); 583 + sd_config_write_same(sdkp, &lim); 584 + blk_mq_freeze_queue(sdkp->disk->queue); 585 + err = queue_limits_commit_update(sdkp->disk->queue, &lim); 586 + blk_mq_unfreeze_queue(sdkp->disk->queue); 587 + if (err) 588 + return err; 597 589 return count; 598 590 } 599 591 static DEVICE_ATTR_RW(max_write_same_blocks); ··· 841 823 return protect; 842 824 } 843 825 844 - static void sd_config_discard(struct scsi_disk *sdkp, unsigned int mode) 826 + static void sd_disable_discard(struct scsi_disk *sdkp) 845 827 { 846 - struct request_queue *q = sdkp->disk->queue; 828 + sdkp->provisioning_mode = SD_LBP_DISABLE; 829 + blk_queue_disable_discard(sdkp->disk->queue); 830 + } 831 + 832 + static void sd_config_discard(struct scsi_disk *sdkp, struct queue_limits *lim, 833 + unsigned int mode) 834 + { 847 835 unsigned int logical_block_size = sdkp->device->sector_size; 848 836 unsigned int max_blocks = 0; 849 837 850 - q->limits.discard_alignment = 851 - sdkp->unmap_alignment * logical_block_size; 852 - q->limits.discard_granularity = 853 - max(sdkp->physical_block_size, 854 - sdkp->unmap_granularity * logical_block_size); 838 + lim->discard_alignment = sdkp->unmap_alignment * logical_block_size; 839 + lim->discard_granularity = max(sdkp->physical_block_size, 840 + sdkp->unmap_granularity * logical_block_size); 855 841 sdkp->provisioning_mode = mode; 856 842 857 843 switch (mode) { 858 844 859 845 case SD_LBP_FULL: 860 846 case SD_LBP_DISABLE: 861 - blk_queue_max_discard_sectors(q, 0); 862 - return; 847 + break; 863 848 864 849 case SD_LBP_UNMAP: 865 850 max_blocks = min_not_zero(sdkp->max_unmap_blocks, ··· 893 872 break; 894 873 } 895 874 896 - blk_queue_max_discard_sectors(q, max_blocks * (logical_block_size >> 9)); 875 + lim->max_hw_discard_sectors = max_blocks * 876 + (logical_block_size >> SECTOR_SHIFT); 897 877 } 898 878 899 879 static void *sd_set_special_bvec(struct request *rq, unsigned int data_len) ··· 938 916 rq->timeout = SD_TIMEOUT; 939 917 940 918 return scsi_alloc_sgtables(cmd); 919 + } 920 + 921 + static void sd_config_atomic(struct scsi_disk *sdkp, struct queue_limits *lim) 922 + { 923 + unsigned int logical_block_size = sdkp->device->sector_size, 924 + physical_block_size_sectors, max_atomic, unit_min, unit_max; 925 + 926 + if ((!sdkp->max_atomic && !sdkp->max_atomic_with_boundary) || 927 + sdkp->protection_type == T10_PI_TYPE2_PROTECTION) 928 + return; 929 + 930 + physical_block_size_sectors = sdkp->physical_block_size / 931 + sdkp->device->sector_size; 932 + 933 + unit_min = rounddown_pow_of_two(sdkp->atomic_granularity ? 934 + sdkp->atomic_granularity : 935 + physical_block_size_sectors); 936 + 937 + /* 938 + * Only use atomic boundary when we have the odd scenario of 939 + * sdkp->max_atomic == 0, which the spec does permit. 940 + */ 941 + if (sdkp->max_atomic) { 942 + max_atomic = sdkp->max_atomic; 943 + unit_max = rounddown_pow_of_two(sdkp->max_atomic); 944 + sdkp->use_atomic_write_boundary = 0; 945 + } else { 946 + max_atomic = sdkp->max_atomic_with_boundary; 947 + unit_max = rounddown_pow_of_two(sdkp->max_atomic_boundary); 948 + sdkp->use_atomic_write_boundary = 1; 949 + } 950 + 951 + /* 952 + * Ensure compliance with granularity and alignment. For now, keep it 953 + * simple and just don't support atomic writes for values mismatched 954 + * with max_{boundary}atomic, physical block size, and 955 + * atomic_granularity itself. 956 + * 957 + * We're really being distrustful by checking unit_max also... 958 + */ 959 + if (sdkp->atomic_granularity > 1) { 960 + if (unit_min > 1 && unit_min % sdkp->atomic_granularity) 961 + return; 962 + if (unit_max > 1 && unit_max % sdkp->atomic_granularity) 963 + return; 964 + } 965 + 966 + if (sdkp->atomic_alignment > 1) { 967 + if (unit_min > 1 && unit_min % sdkp->atomic_alignment) 968 + return; 969 + if (unit_max > 1 && unit_max % sdkp->atomic_alignment) 970 + return; 971 + } 972 + 973 + lim->atomic_write_hw_max = max_atomic * logical_block_size; 974 + lim->atomic_write_hw_boundary = 0; 975 + lim->atomic_write_hw_unit_min = unit_min * logical_block_size; 976 + lim->atomic_write_hw_unit_max = unit_max * logical_block_size; 941 977 } 942 978 943 979 static blk_status_t sd_setup_write_same16_cmnd(struct scsi_cmnd *cmd, ··· 1080 1000 return sd_setup_write_same10_cmnd(cmd, false); 1081 1001 } 1082 1002 1083 - static void sd_config_write_same(struct scsi_disk *sdkp) 1003 + static void sd_disable_write_same(struct scsi_disk *sdkp) 1084 1004 { 1085 - struct request_queue *q = sdkp->disk->queue; 1005 + sdkp->device->no_write_same = 1; 1006 + sdkp->max_ws_blocks = 0; 1007 + blk_queue_disable_write_zeroes(sdkp->disk->queue); 1008 + } 1009 + 1010 + static void sd_config_write_same(struct scsi_disk *sdkp, 1011 + struct queue_limits *lim) 1012 + { 1086 1013 unsigned int logical_block_size = sdkp->device->sector_size; 1087 1014 1088 1015 if (sdkp->device->no_write_same) { ··· 1143 1056 } 1144 1057 1145 1058 out: 1146 - blk_queue_max_write_zeroes_sectors(q, sdkp->max_ws_blocks * 1147 - (logical_block_size >> 9)); 1059 + lim->max_write_zeroes_sectors = 1060 + sdkp->max_ws_blocks * (logical_block_size >> SECTOR_SHIFT); 1148 1061 } 1149 1062 1150 1063 static blk_status_t sd_setup_flush_cmnd(struct scsi_cmnd *cmd) ··· 1296 1209 return (hint - IOPRIO_HINT_DEV_DURATION_LIMIT_1) + 1; 1297 1210 } 1298 1211 1212 + static blk_status_t sd_setup_atomic_cmnd(struct scsi_cmnd *cmd, 1213 + sector_t lba, unsigned int nr_blocks, 1214 + bool boundary, unsigned char flags) 1215 + { 1216 + cmd->cmd_len = 16; 1217 + cmd->cmnd[0] = WRITE_ATOMIC_16; 1218 + cmd->cmnd[1] = flags; 1219 + put_unaligned_be64(lba, &cmd->cmnd[2]); 1220 + put_unaligned_be16(nr_blocks, &cmd->cmnd[12]); 1221 + if (boundary) 1222 + put_unaligned_be16(nr_blocks, &cmd->cmnd[10]); 1223 + else 1224 + put_unaligned_be16(0, &cmd->cmnd[10]); 1225 + put_unaligned_be16(nr_blocks, &cmd->cmnd[12]); 1226 + cmd->cmnd[14] = 0; 1227 + cmd->cmnd[15] = 0; 1228 + 1229 + return BLK_STS_OK; 1230 + } 1231 + 1299 1232 static blk_status_t sd_setup_read_write_cmnd(struct scsi_cmnd *cmd) 1300 1233 { 1301 1234 struct request *rq = scsi_cmd_to_rq(cmd); ··· 1381 1274 if (protect && sdkp->protection_type == T10_PI_TYPE2_PROTECTION) { 1382 1275 ret = sd_setup_rw32_cmnd(cmd, write, lba, nr_blocks, 1383 1276 protect | fua, dld); 1277 + } else if (rq->cmd_flags & REQ_ATOMIC && write) { 1278 + ret = sd_setup_atomic_cmnd(cmd, lba, nr_blocks, 1279 + sdkp->use_atomic_write_boundary, 1280 + protect | fua); 1384 1281 } else if (sdp->use_16_for_rw || (nr_blocks > 0xffff)) { 1385 1282 ret = sd_setup_rw16_cmnd(cmd, write, lba, nr_blocks, 1386 1283 protect | fua, dld); ··· 2358 2247 case 0x24: /* INVALID FIELD IN CDB */ 2359 2248 switch (SCpnt->cmnd[0]) { 2360 2249 case UNMAP: 2361 - sd_config_discard(sdkp, SD_LBP_DISABLE); 2250 + sd_disable_discard(sdkp); 2362 2251 break; 2363 2252 case WRITE_SAME_16: 2364 2253 case WRITE_SAME: 2365 2254 if (SCpnt->cmnd[1] & 8) { /* UNMAP */ 2366 - sd_config_discard(sdkp, SD_LBP_DISABLE); 2255 + sd_disable_discard(sdkp); 2367 2256 } else { 2368 - sdkp->device->no_write_same = 1; 2369 - sd_config_write_same(sdkp); 2257 + sd_disable_write_same(sdkp); 2370 2258 req->rq_flags |= RQF_QUIET; 2371 2259 } 2372 2260 break; ··· 2377 2267 } 2378 2268 2379 2269 out: 2380 - if (sd_is_zoned(sdkp)) 2270 + if (sdkp->device->type == TYPE_ZBC) 2381 2271 good_bytes = sd_zbc_complete(SCpnt, good_bytes, &sshdr); 2382 2272 2383 2273 SCSI_LOG_HLCOMPLETE(1, scmd_printk(KERN_INFO, SCpnt, ··· 2571 2461 return 0; 2572 2462 } 2573 2463 2574 - static void sd_config_protection(struct scsi_disk *sdkp) 2464 + static void sd_config_protection(struct scsi_disk *sdkp, 2465 + struct queue_limits *lim) 2575 2466 { 2576 2467 struct scsi_device *sdp = sdkp->device; 2577 2468 2578 - sd_dif_config_host(sdkp); 2469 + if (IS_ENABLED(CONFIG_BLK_DEV_INTEGRITY)) 2470 + sd_dif_config_host(sdkp, lim); 2579 2471 2580 2472 if (!sdkp->protection_type) 2581 2473 return; ··· 2626 2514 #define READ_CAPACITY_RETRIES_ON_RESET 10 2627 2515 2628 2516 static int read_capacity_16(struct scsi_disk *sdkp, struct scsi_device *sdp, 2629 - unsigned char *buffer) 2517 + struct queue_limits *lim, unsigned char *buffer) 2630 2518 { 2631 2519 unsigned char cmd[16]; 2632 2520 struct scsi_sense_hdr sshdr; ··· 2700 2588 2701 2589 /* Lowest aligned logical block */ 2702 2590 alignment = ((buffer[14] & 0x3f) << 8 | buffer[15]) * sector_size; 2703 - blk_queue_alignment_offset(sdp->request_queue, alignment); 2591 + lim->alignment_offset = alignment; 2704 2592 if (alignment && sdkp->first_scan) 2705 2593 sd_printk(KERN_NOTICE, sdkp, 2706 2594 "physical block alignment offset: %u\n", alignment); ··· 2711 2599 if (buffer[14] & 0x40) /* LBPRZ */ 2712 2600 sdkp->lbprz = 1; 2713 2601 2714 - sd_config_discard(sdkp, SD_LBP_WS16); 2602 + sd_config_discard(sdkp, lim, SD_LBP_WS16); 2715 2603 } 2716 2604 2717 2605 sdkp->capacity = lba + 1; ··· 2814 2702 * read disk capacity 2815 2703 */ 2816 2704 static void 2817 - sd_read_capacity(struct scsi_disk *sdkp, unsigned char *buffer) 2705 + sd_read_capacity(struct scsi_disk *sdkp, struct queue_limits *lim, 2706 + unsigned char *buffer) 2818 2707 { 2819 2708 int sector_size; 2820 2709 struct scsi_device *sdp = sdkp->device; 2821 2710 2822 2711 if (sd_try_rc16_first(sdp)) { 2823 - sector_size = read_capacity_16(sdkp, sdp, buffer); 2712 + sector_size = read_capacity_16(sdkp, sdp, lim, buffer); 2824 2713 if (sector_size == -EOVERFLOW) 2825 2714 goto got_data; 2826 2715 if (sector_size == -ENODEV) ··· 2841 2728 int old_sector_size = sector_size; 2842 2729 sd_printk(KERN_NOTICE, sdkp, "Very big device. " 2843 2730 "Trying to use READ CAPACITY(16).\n"); 2844 - sector_size = read_capacity_16(sdkp, sdp, buffer); 2731 + sector_size = read_capacity_16(sdkp, sdp, lim, buffer); 2845 2732 if (sector_size < 0) { 2846 2733 sd_printk(KERN_NOTICE, sdkp, 2847 2734 "Using 0xffffffff as device size\n"); ··· 2900 2787 */ 2901 2788 sector_size = 512; 2902 2789 } 2903 - blk_queue_logical_block_size(sdp->request_queue, sector_size); 2904 - blk_queue_physical_block_size(sdp->request_queue, 2905 - sdkp->physical_block_size); 2790 + lim->logical_block_size = sector_size; 2791 + lim->physical_block_size = sdkp->physical_block_size; 2906 2792 sdkp->device->sector_size = sector_size; 2907 2793 2908 2794 if (sdkp->capacity > 0xffffffff) ··· 3307 3195 return; 3308 3196 } 3309 3197 3310 - /** 3311 - * sd_read_block_limits - Query disk device for preferred I/O sizes. 3312 - * @sdkp: disk to query 3198 + static unsigned int sd_discard_mode(struct scsi_disk *sdkp) 3199 + { 3200 + if (!sdkp->lbpvpd) { 3201 + /* LBP VPD page not provided */ 3202 + if (sdkp->max_unmap_blocks) 3203 + return SD_LBP_UNMAP; 3204 + return SD_LBP_WS16; 3205 + } 3206 + 3207 + /* LBP VPD page tells us what to use */ 3208 + if (sdkp->lbpu && sdkp->max_unmap_blocks) 3209 + return SD_LBP_UNMAP; 3210 + if (sdkp->lbpws) 3211 + return SD_LBP_WS16; 3212 + if (sdkp->lbpws10) 3213 + return SD_LBP_WS10; 3214 + return SD_LBP_DISABLE; 3215 + } 3216 + 3217 + /* 3218 + * Query disk device for preferred I/O sizes. 3313 3219 */ 3314 - static void sd_read_block_limits(struct scsi_disk *sdkp) 3220 + static void sd_read_block_limits(struct scsi_disk *sdkp, 3221 + struct queue_limits *lim) 3315 3222 { 3316 3223 struct scsi_vpd *vpd; 3317 3224 ··· 3350 3219 sdkp->max_ws_blocks = (u32)get_unaligned_be64(&vpd->data[36]); 3351 3220 3352 3221 if (!sdkp->lbpme) 3353 - goto out; 3222 + goto config_atomic; 3354 3223 3355 3224 lba_count = get_unaligned_be32(&vpd->data[20]); 3356 3225 desc_count = get_unaligned_be32(&vpd->data[24]); ··· 3364 3233 sdkp->unmap_alignment = 3365 3234 get_unaligned_be32(&vpd->data[32]) & ~(1 << 31); 3366 3235 3367 - if (!sdkp->lbpvpd) { /* LBP VPD page not provided */ 3236 + sd_config_discard(sdkp, lim, sd_discard_mode(sdkp)); 3368 3237 3369 - if (sdkp->max_unmap_blocks) 3370 - sd_config_discard(sdkp, SD_LBP_UNMAP); 3371 - else 3372 - sd_config_discard(sdkp, SD_LBP_WS16); 3238 + config_atomic: 3239 + sdkp->max_atomic = get_unaligned_be32(&vpd->data[44]); 3240 + sdkp->atomic_alignment = get_unaligned_be32(&vpd->data[48]); 3241 + sdkp->atomic_granularity = get_unaligned_be32(&vpd->data[52]); 3242 + sdkp->max_atomic_with_boundary = get_unaligned_be32(&vpd->data[56]); 3243 + sdkp->max_atomic_boundary = get_unaligned_be32(&vpd->data[60]); 3373 3244 3374 - } else { /* LBP VPD page tells us what to use */ 3375 - if (sdkp->lbpu && sdkp->max_unmap_blocks) 3376 - sd_config_discard(sdkp, SD_LBP_UNMAP); 3377 - else if (sdkp->lbpws) 3378 - sd_config_discard(sdkp, SD_LBP_WS16); 3379 - else if (sdkp->lbpws10) 3380 - sd_config_discard(sdkp, SD_LBP_WS10); 3381 - else 3382 - sd_config_discard(sdkp, SD_LBP_DISABLE); 3383 - } 3245 + sd_config_atomic(sdkp, lim); 3384 3246 } 3385 3247 3386 3248 out: ··· 3392 3268 rcu_read_unlock(); 3393 3269 } 3394 3270 3395 - /** 3396 - * sd_read_block_characteristics - Query block dev. characteristics 3397 - * @sdkp: disk to query 3398 - */ 3399 - static void sd_read_block_characteristics(struct scsi_disk *sdkp) 3271 + /* Query block device characteristics */ 3272 + static void sd_read_block_characteristics(struct scsi_disk *sdkp, 3273 + struct queue_limits *lim) 3400 3274 { 3401 - struct request_queue *q = sdkp->disk->queue; 3402 3275 struct scsi_vpd *vpd; 3403 3276 u16 rot; 3404 3277 ··· 3411 3290 sdkp->zoned = (vpd->data[8] >> 4) & 3; 3412 3291 rcu_read_unlock(); 3413 3292 3414 - if (rot == 1) { 3415 - blk_queue_flag_set(QUEUE_FLAG_NONROT, q); 3416 - blk_queue_flag_clear(QUEUE_FLAG_ADD_RANDOM, q); 3417 - } 3418 - 3419 - 3420 - #ifdef CONFIG_BLK_DEV_ZONED /* sd_probe rejects ZBD devices early otherwise */ 3421 - if (sdkp->device->type == TYPE_ZBC) { 3422 - /* 3423 - * Host-managed. 3424 - */ 3425 - disk_set_zoned(sdkp->disk); 3426 - 3427 - /* 3428 - * Per ZBC and ZAC specifications, writes in sequential write 3429 - * required zones of host-managed devices must be aligned to 3430 - * the device physical block size. 3431 - */ 3432 - blk_queue_zone_write_granularity(q, sdkp->physical_block_size); 3433 - } else { 3434 - /* 3435 - * Host-aware devices are treated as conventional. 3436 - */ 3437 - WARN_ON_ONCE(blk_queue_is_zoned(q)); 3438 - } 3439 - #endif /* CONFIG_BLK_DEV_ZONED */ 3293 + if (rot == 1) 3294 + lim->features &= ~(BLK_FEAT_ROTATIONAL | BLK_FEAT_ADD_RANDOM); 3440 3295 3441 3296 if (!sdkp->first_scan) 3442 3297 return; 3443 3298 3444 - if (blk_queue_is_zoned(q)) 3299 + if (sdkp->device->type == TYPE_ZBC) 3445 3300 sd_printk(KERN_NOTICE, sdkp, "Host-managed zoned block device\n"); 3446 3301 else if (sdkp->zoned == 1) 3447 3302 sd_printk(KERN_NOTICE, sdkp, "Host-aware SMR disk used as regular disk\n"); ··· 3698 3601 { 3699 3602 struct scsi_disk *sdkp = scsi_disk(disk); 3700 3603 struct scsi_device *sdp = sdkp->device; 3701 - struct request_queue *q = sdkp->disk->queue; 3702 3604 sector_t old_capacity = sdkp->capacity; 3605 + struct queue_limits lim; 3703 3606 unsigned char *buffer; 3704 - unsigned int dev_max, rw_max; 3607 + unsigned int dev_max; 3608 + int err; 3705 3609 3706 3610 SCSI_LOG_HLQUEUE(3, sd_printk(KERN_INFO, sdkp, 3707 3611 "sd_revalidate_disk\n")); ··· 3723 3625 3724 3626 sd_spinup_disk(sdkp); 3725 3627 3628 + lim = queue_limits_start_update(sdkp->disk->queue); 3629 + 3726 3630 /* 3727 3631 * Without media there is no reason to ask; moreover, some devices 3728 3632 * react badly if we do. 3729 3633 */ 3730 3634 if (sdkp->media_present) { 3731 - sd_read_capacity(sdkp, buffer); 3635 + sd_read_capacity(sdkp, &lim, buffer); 3732 3636 /* 3733 3637 * Some USB/UAS devices return generic values for mode pages 3734 3638 * until the media has been accessed. Trigger a READ operation ··· 3744 3644 * cause this to be updated correctly and any device which 3745 3645 * doesn't support it should be treated as rotational. 3746 3646 */ 3747 - blk_queue_flag_clear(QUEUE_FLAG_NONROT, q); 3748 - blk_queue_flag_set(QUEUE_FLAG_ADD_RANDOM, q); 3647 + lim.features |= (BLK_FEAT_ROTATIONAL | BLK_FEAT_ADD_RANDOM); 3749 3648 3750 3649 if (scsi_device_supports_vpd(sdp)) { 3751 3650 sd_read_block_provisioning(sdkp); 3752 - sd_read_block_limits(sdkp); 3651 + sd_read_block_limits(sdkp, &lim); 3753 3652 sd_read_block_limits_ext(sdkp); 3754 - sd_read_block_characteristics(sdkp); 3755 - sd_zbc_read_zones(sdkp, buffer); 3653 + sd_read_block_characteristics(sdkp, &lim); 3654 + sd_zbc_read_zones(sdkp, &lim, buffer); 3756 3655 sd_read_cpr(sdkp); 3757 3656 } 3758 3657 ··· 3763 3664 sd_read_app_tag_own(sdkp, buffer); 3764 3665 sd_read_write_same(sdkp, buffer); 3765 3666 sd_read_security(sdkp, buffer); 3766 - sd_config_protection(sdkp); 3667 + sd_config_protection(sdkp, &lim); 3767 3668 } 3768 3669 3769 3670 /* 3770 3671 * We now have all cache related info, determine how we deal 3771 3672 * with flush requests. 3772 3673 */ 3773 - sd_set_flush_flag(sdkp); 3674 + sd_set_flush_flag(sdkp, &lim); 3774 3675 3775 3676 /* Initial block count limit based on CDB TRANSFER LENGTH field size. */ 3776 3677 dev_max = sdp->use_16_for_rw ? SD_MAX_XFER_BLOCKS : SD_DEF_XFER_BLOCKS; 3777 3678 3778 3679 /* Some devices report a maximum block count for READ/WRITE requests. */ 3779 3680 dev_max = min_not_zero(dev_max, sdkp->max_xfer_blocks); 3780 - q->limits.max_dev_sectors = logical_to_sectors(sdp, dev_max); 3681 + lim.max_dev_sectors = logical_to_sectors(sdp, dev_max); 3781 3682 3782 3683 if (sd_validate_min_xfer_size(sdkp)) 3783 - blk_queue_io_min(sdkp->disk->queue, 3784 - logical_to_bytes(sdp, sdkp->min_xfer_blocks)); 3684 + lim.io_min = logical_to_bytes(sdp, sdkp->min_xfer_blocks); 3785 3685 else 3786 - blk_queue_io_min(sdkp->disk->queue, 0); 3787 - 3788 - if (sd_validate_opt_xfer_size(sdkp, dev_max)) { 3789 - q->limits.io_opt = logical_to_bytes(sdp, sdkp->opt_xfer_blocks); 3790 - rw_max = logical_to_sectors(sdp, sdkp->opt_xfer_blocks); 3791 - } else { 3792 - q->limits.io_opt = 0; 3793 - rw_max = min_not_zero(logical_to_sectors(sdp, dev_max), 3794 - (sector_t)BLK_DEF_MAX_SECTORS_CAP); 3795 - } 3686 + lim.io_min = 0; 3796 3687 3797 3688 /* 3798 3689 * Limit default to SCSI host optimal sector limit if set. There may be 3799 3690 * an impact on performance for when the size of a request exceeds this 3800 3691 * host limit. 3801 3692 */ 3802 - rw_max = min_not_zero(rw_max, sdp->host->opt_sectors); 3803 - 3804 - /* Do not exceed controller limit */ 3805 - rw_max = min(rw_max, queue_max_hw_sectors(q)); 3806 - 3807 - /* 3808 - * Only update max_sectors if previously unset or if the current value 3809 - * exceeds the capabilities of the hardware. 3810 - */ 3811 - if (sdkp->first_scan || 3812 - q->limits.max_sectors > q->limits.max_dev_sectors || 3813 - q->limits.max_sectors > q->limits.max_hw_sectors) { 3814 - q->limits.max_sectors = rw_max; 3815 - q->limits.max_user_sectors = rw_max; 3693 + lim.io_opt = sdp->host->opt_sectors << SECTOR_SHIFT; 3694 + if (sd_validate_opt_xfer_size(sdkp, dev_max)) { 3695 + lim.io_opt = min_not_zero(lim.io_opt, 3696 + logical_to_bytes(sdp, sdkp->opt_xfer_blocks)); 3816 3697 } 3817 3698 3818 3699 sdkp->first_scan = 0; 3819 3700 3820 3701 set_capacity_and_notify(disk, logical_to_sectors(sdp, sdkp->capacity)); 3821 - sd_config_write_same(sdkp); 3702 + sd_config_write_same(sdkp, &lim); 3822 3703 kfree(buffer); 3704 + 3705 + blk_mq_freeze_queue(sdkp->disk->queue); 3706 + err = queue_limits_commit_update(sdkp->disk->queue, &lim); 3707 + blk_mq_unfreeze_queue(sdkp->disk->queue); 3708 + if (err) 3709 + return err; 3823 3710 3824 3711 /* 3825 3712 * For a zoned drive, revalidating the zones can be done only once

+13 -18

drivers/scsi/sd.h

··· 115 115 u32 max_unmap_blocks; 116 116 u32 unmap_granularity; 117 117 u32 unmap_alignment; 118 + 119 + u32 max_atomic; 120 + u32 atomic_alignment; 121 + u32 atomic_granularity; 122 + u32 max_atomic_with_boundary; 123 + u32 max_atomic_boundary; 124 + 118 125 u32 index; 119 126 unsigned int physical_block_size; 120 127 unsigned int max_medium_access_timeouts; ··· 155 148 unsigned security : 1; 156 149 unsigned ignore_medium_access_errors : 1; 157 150 unsigned rscs : 1; /* reduced stream control support */ 151 + unsigned use_atomic_write_boundary : 1; 158 152 }; 159 153 #define to_scsi_disk(obj) container_of(obj, struct scsi_disk, disk_dev) 160 154 ··· 228 220 return sector >> (ilog2(sdev->sector_size) - 9); 229 221 } 230 222 231 - #ifdef CONFIG_BLK_DEV_INTEGRITY 232 - 233 - extern void sd_dif_config_host(struct scsi_disk *); 234 - 235 - #else /* CONFIG_BLK_DEV_INTEGRITY */ 236 - 237 - static inline void sd_dif_config_host(struct scsi_disk *disk) 238 - { 239 - } 240 - 241 - #endif /* CONFIG_BLK_DEV_INTEGRITY */ 242 - 243 - static inline int sd_is_zoned(struct scsi_disk *sdkp) 244 - { 245 - return sdkp->zoned == 1 || sdkp->device->type == TYPE_ZBC; 246 - } 223 + void sd_dif_config_host(struct scsi_disk *sdkp, struct queue_limits *lim); 247 224 248 225 #ifdef CONFIG_BLK_DEV_ZONED 249 226 250 - int sd_zbc_read_zones(struct scsi_disk *sdkp, u8 buf[SD_BUF_SIZE]); 227 + int sd_zbc_read_zones(struct scsi_disk *sdkp, struct queue_limits *lim, 228 + u8 buf[SD_BUF_SIZE]); 251 229 int sd_zbc_revalidate_zones(struct scsi_disk *sdkp); 252 230 blk_status_t sd_zbc_setup_zone_mgmt_cmnd(struct scsi_cmnd *cmd, 253 231 unsigned char op, bool all); ··· 244 250 245 251 #else /* CONFIG_BLK_DEV_ZONED */ 246 252 247 - static inline int sd_zbc_read_zones(struct scsi_disk *sdkp, u8 buf[SD_BUF_SIZE]) 253 + static inline int sd_zbc_read_zones(struct scsi_disk *sdkp, 254 + struct queue_limits *lim, u8 buf[SD_BUF_SIZE]) 248 255 { 249 256 return 0; 250 257 }

+17 -28

drivers/scsi/sd_dif.c

··· 24 24 /* 25 25 * Configure exchange of protection information between OS and HBA. 26 26 */ 27 - void sd_dif_config_host(struct scsi_disk *sdkp) 27 + void sd_dif_config_host(struct scsi_disk *sdkp, struct queue_limits *lim) 28 28 { 29 29 struct scsi_device *sdp = sdkp->device; 30 - struct gendisk *disk = sdkp->disk; 31 30 u8 type = sdkp->protection_type; 32 - struct blk_integrity bi; 31 + struct blk_integrity *bi = &lim->integrity; 33 32 int dif, dix; 33 + 34 + memset(bi, 0, sizeof(*bi)); 34 35 35 36 dif = scsi_host_dif_capable(sdp->host, type); 36 37 dix = scsi_host_dix_capable(sdp->host, type); ··· 40 39 dif = 0; dix = 1; 41 40 } 42 41 43 - if (!dix) { 44 - blk_integrity_unregister(disk); 42 + if (!dix) 45 43 return; 46 - } 47 - 48 - memset(&bi, 0, sizeof(bi)); 49 44 50 45 /* Enable DMA of protection information */ 51 - if (scsi_host_get_guard(sdkp->device->host) & SHOST_DIX_GUARD_IP) { 52 - if (type == T10_PI_TYPE3_PROTECTION) 53 - bi.profile = &t10_pi_type3_ip; 54 - else 55 - bi.profile = &t10_pi_type1_ip; 46 + if (scsi_host_get_guard(sdkp->device->host) & SHOST_DIX_GUARD_IP) 47 + bi->csum_type = BLK_INTEGRITY_CSUM_IP; 48 + else 49 + bi->csum_type = BLK_INTEGRITY_CSUM_CRC; 56 50 57 - bi.flags |= BLK_INTEGRITY_IP_CHECKSUM; 58 - } else 59 - if (type == T10_PI_TYPE3_PROTECTION) 60 - bi.profile = &t10_pi_type3_crc; 61 - else 62 - bi.profile = &t10_pi_type1_crc; 51 + if (type != T10_PI_TYPE3_PROTECTION) 52 + bi->flags |= BLK_INTEGRITY_REF_TAG; 63 53 64 - bi.tuple_size = sizeof(struct t10_pi_tuple); 54 + bi->tuple_size = sizeof(struct t10_pi_tuple); 65 55 66 56 if (dif && type) { 67 - bi.flags |= BLK_INTEGRITY_DEVICE_CAPABLE; 57 + bi->flags |= BLK_INTEGRITY_DEVICE_CAPABLE; 68 58 69 59 if (!sdkp->ATO) 70 - goto out; 60 + return; 71 61 72 62 if (type == T10_PI_TYPE3_PROTECTION) 73 - bi.tag_size = sizeof(u16) + sizeof(u32); 63 + bi->tag_size = sizeof(u16) + sizeof(u32); 74 64 else 75 - bi.tag_size = sizeof(u16); 65 + bi->tag_size = sizeof(u16); 76 66 } 77 67 78 68 sd_first_printk(KERN_NOTICE, sdkp, 79 69 "Enabling DIX %s, application tag size %u bytes\n", 80 - bi.profile->name, bi.tag_size); 81 - out: 82 - blk_integrity_register(disk, &bi); 70 + blk_integrity_profile_name(bi), bi->tag_size); 83 71 } 84 -

+26 -26

drivers/scsi/sd_zbc.c

··· 232 232 int zone_idx = 0; 233 233 int ret; 234 234 235 - if (!sd_is_zoned(sdkp)) 235 + if (sdkp->device->type != TYPE_ZBC) 236 236 /* Not a zoned device */ 237 237 return -EOPNOTSUPP; 238 238 ··· 300 300 struct scsi_disk *sdkp = scsi_disk(rq->q->disk); 301 301 sector_t sector = blk_rq_pos(rq); 302 302 303 - if (!sd_is_zoned(sdkp)) 303 + if (sdkp->device->type != TYPE_ZBC) 304 304 /* Not a zoned device */ 305 305 return BLK_STS_IOERR; 306 306 ··· 521 521 522 522 static void sd_zbc_print_zones(struct scsi_disk *sdkp) 523 523 { 524 - if (!sd_is_zoned(sdkp) || !sdkp->capacity) 524 + if (sdkp->device->type != TYPE_ZBC || !sdkp->capacity) 525 525 return; 526 526 527 527 if (sdkp->capacity & (sdkp->zone_info.zone_blocks - 1)) ··· 565 565 sdkp->zone_info.zone_blocks = zone_blocks; 566 566 sdkp->zone_info.nr_zones = nr_zones; 567 567 568 - blk_queue_chunk_sectors(q, 569 - logical_to_sectors(sdkp->device, zone_blocks)); 570 - 571 - /* Enable block layer zone append emulation */ 572 - blk_queue_max_zone_append_sectors(q, 0); 573 - 574 568 flags = memalloc_noio_save(); 575 569 ret = blk_revalidate_disk_zones(disk); 576 570 memalloc_noio_restore(flags); ··· 582 588 /** 583 589 * sd_zbc_read_zones - Read zone information and update the request queue 584 590 * @sdkp: SCSI disk pointer. 591 + * @lim: queue limits to read into 585 592 * @buf: 512 byte buffer used for storing SCSI command output. 586 593 * 587 594 * Read zone information and update the request queue zone characteristics and 588 595 * also the zoned device information in *sdkp. Called by sd_revalidate_disk() 589 596 * before the gendisk capacity has been set. 590 597 */ 591 - int sd_zbc_read_zones(struct scsi_disk *sdkp, u8 buf[SD_BUF_SIZE]) 598 + int sd_zbc_read_zones(struct scsi_disk *sdkp, struct queue_limits *lim, 599 + u8 buf[SD_BUF_SIZE]) 592 600 { 593 - struct gendisk *disk = sdkp->disk; 594 - struct request_queue *q = disk->queue; 595 601 unsigned int nr_zones; 596 602 u32 zone_blocks = 0; 597 603 int ret; 598 604 599 - if (!sd_is_zoned(sdkp)) { 600 - /* 601 - * Device managed or normal SCSI disk, no special handling 602 - * required. 603 - */ 605 + if (sdkp->device->type != TYPE_ZBC) 604 606 return 0; 605 - } 607 + 608 + lim->features |= BLK_FEAT_ZONED; 609 + 610 + /* 611 + * Per ZBC and ZAC specifications, writes in sequential write required 612 + * zones of host-managed devices must be aligned to the device physical 613 + * block size. 614 + */ 615 + lim->zone_write_granularity = sdkp->physical_block_size; 606 616 607 617 /* READ16/WRITE16/SYNC16 is mandatory for ZBC devices */ 608 618 sdkp->device->use_16_for_rw = 1; ··· 623 625 if (ret != 0) 624 626 goto err; 625 627 626 - /* The drive satisfies the kernel restrictions: set it up */ 627 - blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, q); 628 - if (sdkp->zones_max_open == U32_MAX) 629 - disk_set_max_open_zones(disk, 0); 630 - else 631 - disk_set_max_open_zones(disk, sdkp->zones_max_open); 632 - disk_set_max_active_zones(disk, 0); 633 628 nr_zones = round_up(sdkp->capacity, zone_blocks) >> ilog2(zone_blocks); 634 - 635 629 sdkp->early_zone_info.nr_zones = nr_zones; 636 630 sdkp->early_zone_info.zone_blocks = zone_blocks; 631 + 632 + /* The drive satisfies the kernel restrictions: set it up */ 633 + if (sdkp->zones_max_open == U32_MAX) 634 + lim->max_open_zones = 0; 635 + else 636 + lim->max_open_zones = sdkp->zones_max_open; 637 + lim->max_active_zones = 0; 638 + lim->chunk_sectors = logical_to_sectors(sdkp->device, zone_blocks); 639 + /* Enable block layer zone append emulation */ 640 + lim->max_zone_append_sectors = 0; 637 641 638 642 return 0; 639 643

+25 -17

drivers/scsi/sr.c

··· 111 111 static int sr_open(struct cdrom_device_info *, int); 112 112 static void sr_release(struct cdrom_device_info *); 113 113 114 - static void get_sectorsize(struct scsi_cd *); 114 + static int get_sectorsize(struct scsi_cd *); 115 115 static int get_capabilities(struct scsi_cd *); 116 116 117 117 static unsigned int sr_check_events(struct cdrom_device_info *cdi, ··· 473 473 return BLK_STS_IOERR; 474 474 } 475 475 476 - static void sr_revalidate_disk(struct scsi_cd *cd) 476 + static int sr_revalidate_disk(struct scsi_cd *cd) 477 477 { 478 478 struct scsi_sense_hdr sshdr; 479 479 480 480 /* if the unit is not ready, nothing more to do */ 481 481 if (scsi_test_unit_ready(cd->device, SR_TIMEOUT, MAX_RETRIES, &sshdr)) 482 - return; 482 + return 0; 483 483 sr_cd_check(&cd->cdi); 484 - get_sectorsize(cd); 484 + return get_sectorsize(cd); 485 485 } 486 486 487 487 static int sr_block_open(struct gendisk *disk, blk_mode_t mode) ··· 494 494 return -ENXIO; 495 495 496 496 scsi_autopm_get_device(sdev); 497 - if (disk_check_media_change(disk)) 498 - sr_revalidate_disk(cd); 497 + if (disk_check_media_change(disk)) { 498 + ret = sr_revalidate_disk(cd); 499 + if (ret) 500 + goto out; 501 + } 499 502 500 503 mutex_lock(&cd->lock); 501 504 ret = cdrom_open(&cd->cdi, mode); 502 505 mutex_unlock(&cd->lock); 503 - 506 + out: 504 507 scsi_autopm_put_device(sdev); 505 508 if (ret) 506 509 scsi_device_put(cd->device); ··· 688 685 blk_pm_runtime_init(sdev->request_queue, dev); 689 686 690 687 dev_set_drvdata(dev, cd); 691 - sr_revalidate_disk(cd); 688 + error = sr_revalidate_disk(cd); 689 + if (error) 690 + goto unregister_cdrom; 692 691 693 692 error = device_add_disk(&sdev->sdev_gendev, disk, NULL); 694 693 if (error) ··· 719 714 } 720 715 721 716 722 - static void get_sectorsize(struct scsi_cd *cd) 717 + static int get_sectorsize(struct scsi_cd *cd) 723 718 { 719 + struct request_queue *q = cd->device->request_queue; 724 720 static const u8 cmd[10] = { READ_CAPACITY }; 725 721 unsigned char buffer[8] = { }; 726 - int the_result; 722 + struct queue_limits lim; 723 + int err; 727 724 int sector_size; 728 - struct request_queue *queue; 729 725 struct scsi_failure failure_defs[] = { 730 726 { 731 727 .result = SCMD_FAILURE_RESULT_ANY, ··· 742 736 }; 743 737 744 738 /* Do the command and wait.. */ 745 - the_result = scsi_execute_cmd(cd->device, cmd, REQ_OP_DRV_IN, buffer, 739 + err = scsi_execute_cmd(cd->device, cmd, REQ_OP_DRV_IN, buffer, 746 740 sizeof(buffer), SR_TIMEOUT, MAX_RETRIES, 747 741 &exec_args); 748 - if (the_result) { 742 + if (err) { 749 743 cd->capacity = 0x1fffff; 750 744 sector_size = 2048; /* A guess, just in case */ 751 745 } else { ··· 795 789 set_capacity(cd->disk, cd->capacity); 796 790 } 797 791 798 - queue = cd->device->request_queue; 799 - blk_queue_logical_block_size(queue, sector_size); 800 - 801 - return; 792 + lim = queue_limits_start_update(q); 793 + lim.logical_block_size = sector_size; 794 + blk_mq_freeze_queue(q); 795 + err = queue_limits_commit_update(q, &lim); 796 + blk_mq_unfreeze_queue(q); 797 + return err; 802 798 } 803 799 804 800 static int get_capabilities(struct scsi_cd *cd)

+28 -25

drivers/target/target_core_iblock.c

··· 148 148 dev->dev_attrib.is_nonrot = 1; 149 149 150 150 bi = bdev_get_integrity(bd); 151 - if (bi) { 152 - struct bio_set *bs = &ib_dev->ibd_bio_set; 151 + if (!bi) 152 + return 0; 153 153 154 - if (!strcmp(bi->profile->name, "T10-DIF-TYPE3-IP") || 155 - !strcmp(bi->profile->name, "T10-DIF-TYPE1-IP")) { 156 - pr_err("IBLOCK export of blk_integrity: %s not" 157 - " supported\n", bi->profile->name); 158 - ret = -ENOSYS; 159 - goto out_blkdev_put; 160 - } 161 - 162 - if (!strcmp(bi->profile->name, "T10-DIF-TYPE3-CRC")) { 163 - dev->dev_attrib.pi_prot_type = TARGET_DIF_TYPE3_PROT; 164 - } else if (!strcmp(bi->profile->name, "T10-DIF-TYPE1-CRC")) { 154 + switch (bi->csum_type) { 155 + case BLK_INTEGRITY_CSUM_IP: 156 + pr_err("IBLOCK export of blk_integrity: %s not supported\n", 157 + blk_integrity_profile_name(bi)); 158 + ret = -ENOSYS; 159 + goto out_blkdev_put; 160 + case BLK_INTEGRITY_CSUM_CRC: 161 + if (bi->flags & BLK_INTEGRITY_REF_TAG) 165 162 dev->dev_attrib.pi_prot_type = TARGET_DIF_TYPE1_PROT; 166 - } 167 - 168 - if (dev->dev_attrib.pi_prot_type) { 169 - if (bioset_integrity_create(bs, IBLOCK_BIO_POOL_SIZE) < 0) { 170 - pr_err("Unable to allocate bioset for PI\n"); 171 - ret = -ENOMEM; 172 - goto out_blkdev_put; 173 - } 174 - pr_debug("IBLOCK setup BIP bs->bio_integrity_pool: %p\n", 175 - &bs->bio_integrity_pool); 176 - } 177 - dev->dev_attrib.hw_pi_prot_type = dev->dev_attrib.pi_prot_type; 163 + else 164 + dev->dev_attrib.pi_prot_type = TARGET_DIF_TYPE3_PROT; 165 + break; 166 + default: 167 + break; 178 168 } 179 169 170 + if (dev->dev_attrib.pi_prot_type) { 171 + struct bio_set *bs = &ib_dev->ibd_bio_set; 172 + 173 + if (bioset_integrity_create(bs, IBLOCK_BIO_POOL_SIZE) < 0) { 174 + pr_err("Unable to allocate bioset for PI\n"); 175 + ret = -ENOMEM; 176 + goto out_blkdev_put; 177 + } 178 + pr_debug("IBLOCK setup BIP bs->bio_integrity_pool: %p\n", 179 + &bs->bio_integrity_pool); 180 + } 181 + 182 + dev->dev_attrib.hw_pi_prot_type = dev->dev_attrib.pi_prot_type; 180 183 return 0; 181 184 182 185 out_blkdev_put:

+6 -4

drivers/ufs/core/ufshcd.c

··· 5193 5193 } 5194 5194 5195 5195 /** 5196 - * ufshcd_slave_configure - adjust SCSI device configurations 5196 + * ufshcd_device_configure - adjust SCSI device configurations 5197 5197 * @sdev: pointer to SCSI device 5198 + * @lim: queue limits 5198 5199 * 5199 5200 * Return: 0 (success). 5200 5201 */ 5201 - static int ufshcd_slave_configure(struct scsi_device *sdev) 5202 + static int ufshcd_device_configure(struct scsi_device *sdev, 5203 + struct queue_limits *lim) 5202 5204 { 5203 5205 struct ufs_hba *hba = shost_priv(sdev->host); 5204 5206 struct request_queue *q = sdev->request_queue; 5205 5207 5206 - blk_queue_update_dma_pad(q, PRDT_DATA_BYTE_COUNT_PAD - 1); 5208 + lim->dma_pad_mask = PRDT_DATA_BYTE_COUNT_PAD - 1; 5207 5209 5208 5210 /* 5209 5211 * Block runtime-pm until all consumers are added. ··· 8912 8910 .queuecommand = ufshcd_queuecommand, 8913 8911 .mq_poll = ufshcd_poll, 8914 8912 .slave_alloc = ufshcd_slave_alloc, 8915 - .slave_configure = ufshcd_slave_configure, 8913 + .device_configure = ufshcd_device_configure, 8916 8914 .slave_destroy = ufshcd_slave_destroy, 8917 8915 .change_queue_depth = ufshcd_change_queue_depth, 8918 8916 .eh_abort_handler = ufshcd_abort,

+4 -4

fs/aio.c

··· 1516 1516 iocb_put(iocb); 1517 1517 } 1518 1518 1519 - static int aio_prep_rw(struct kiocb *req, const struct iocb *iocb) 1519 + static int aio_prep_rw(struct kiocb *req, const struct iocb *iocb, int rw_type) 1520 1520 { 1521 1521 int ret; 1522 1522 ··· 1542 1542 } else 1543 1543 req->ki_ioprio = get_current_ioprio(); 1544 1544 1545 - ret = kiocb_set_rw_flags(req, iocb->aio_rw_flags); 1545 + ret = kiocb_set_rw_flags(req, iocb->aio_rw_flags, rw_type); 1546 1546 if (unlikely(ret)) 1547 1547 return ret; 1548 1548 ··· 1594 1594 struct file *file; 1595 1595 int ret; 1596 1596 1597 - ret = aio_prep_rw(req, iocb); 1597 + ret = aio_prep_rw(req, iocb, READ); 1598 1598 if (ret) 1599 1599 return ret; 1600 1600 file = req->ki_filp; ··· 1621 1621 struct file *file; 1622 1622 int ret; 1623 1623 1624 - ret = aio_prep_rw(req, iocb); 1624 + ret = aio_prep_rw(req, iocb, WRITE); 1625 1625 if (ret) 1626 1626 return ret; 1627 1627 file = req->ki_filp;

+1 -1

fs/btrfs/ioctl.c

··· 4627 4627 goto out_iov; 4628 4628 4629 4629 init_sync_kiocb(&kiocb, file); 4630 - ret = kiocb_set_rw_flags(&kiocb, 0); 4630 + ret = kiocb_set_rw_flags(&kiocb, 0, WRITE); 4631 4631 if (ret) 4632 4632 goto out_iov; 4633 4633 kiocb.ki_pos = pos;

+17 -1

fs/read_write.c

··· 730 730 ssize_t ret; 731 731 732 732 init_sync_kiocb(&kiocb, filp); 733 - ret = kiocb_set_rw_flags(&kiocb, flags); 733 + ret = kiocb_set_rw_flags(&kiocb, flags, type); 734 734 if (ret) 735 735 return ret; 736 736 kiocb.ki_pos = (ppos ? *ppos : 0); ··· 1735 1735 return -EBADF; 1736 1736 1737 1737 return 0; 1738 + } 1739 + 1740 + bool generic_atomic_write_valid(struct iov_iter *iter, loff_t pos) 1741 + { 1742 + size_t len = iov_iter_count(iter); 1743 + 1744 + if (!iter_is_ubuf(iter)) 1745 + return false; 1746 + 1747 + if (!is_power_of_2(len)) 1748 + return false; 1749 + 1750 + if (!IS_ALIGNED(pos, len)) 1751 + return false; 1752 + 1753 + return true; 1738 1754 }

+41 -7

fs/stat.c

··· 90 90 EXPORT_SYMBOL(generic_fill_statx_attr); 91 91 92 92 /** 93 + * generic_fill_statx_atomic_writes - Fill in atomic writes statx attributes 94 + * @stat: Where to fill in the attribute flags 95 + * @unit_min: Minimum supported atomic write length in bytes 96 + * @unit_max: Maximum supported atomic write length in bytes 97 + * 98 + * Fill in the STATX{_ATTR}_WRITE_ATOMIC flags in the kstat structure from 99 + * atomic write unit_min and unit_max values. 100 + */ 101 + void generic_fill_statx_atomic_writes(struct kstat *stat, 102 + unsigned int unit_min, 103 + unsigned int unit_max) 104 + { 105 + /* Confirm that the request type is known */ 106 + stat->result_mask |= STATX_WRITE_ATOMIC; 107 + 108 + /* Confirm that the file attribute type is known */ 109 + stat->attributes_mask |= STATX_ATTR_WRITE_ATOMIC; 110 + 111 + if (unit_min) { 112 + stat->atomic_write_unit_min = unit_min; 113 + stat->atomic_write_unit_max = unit_max; 114 + /* Initially only allow 1x segment */ 115 + stat->atomic_write_segments_max = 1; 116 + 117 + /* Confirm atomic writes are actually supported */ 118 + stat->attributes |= STATX_ATTR_WRITE_ATOMIC; 119 + } 120 + } 121 + EXPORT_SYMBOL_GPL(generic_fill_statx_atomic_writes); 122 + 123 + /** 93 124 * vfs_getattr_nosec - getattr without security checks 94 125 * @path: file to get attributes from 95 126 * @stat: structure to return attributes in ··· 262 231 stat->attributes |= STATX_ATTR_MOUNT_ROOT; 263 232 stat->attributes_mask |= STATX_ATTR_MOUNT_ROOT; 264 233 265 - /* Handle STATX_DIOALIGN for block devices. */ 266 - if (request_mask & STATX_DIOALIGN) { 267 - struct inode *inode = d_backing_inode(path->dentry); 268 - 269 - if (S_ISBLK(inode->i_mode)) 270 - bdev_statx_dioalign(inode, stat); 271 - } 234 + /* 235 + * If this is a block device inode, override the filesystem 236 + * attributes with the block device specific parameters that need to be 237 + * obtained from the bdev backing inode. 238 + */ 239 + if (S_ISBLK(stat->mode)) 240 + bdev_statx(path, stat, request_mask); 272 241 273 242 return error; 274 243 } ··· 701 670 tmp.stx_dio_mem_align = stat->dio_mem_align; 702 671 tmp.stx_dio_offset_align = stat->dio_offset_align; 703 672 tmp.stx_subvol = stat->subvol; 673 + tmp.stx_atomic_write_unit_min = stat->atomic_write_unit_min; 674 + tmp.stx_atomic_write_unit_max = stat->atomic_write_unit_max; 675 + tmp.stx_atomic_write_segments_max = stat->atomic_write_segments_max; 704 676 705 677 return copy_to_user(buffer, &tmp, sizeof(tmp)) ? -EFAULT : 0; 706 678 }

+29 -58

include/linux/blk-integrity.h

··· 7 7 struct request; 8 8 9 9 enum blk_integrity_flags { 10 - BLK_INTEGRITY_VERIFY = 1 << 0, 11 - BLK_INTEGRITY_GENERATE = 1 << 1, 10 + BLK_INTEGRITY_NOVERIFY = 1 << 0, 11 + BLK_INTEGRITY_NOGENERATE = 1 << 1, 12 12 BLK_INTEGRITY_DEVICE_CAPABLE = 1 << 2, 13 - BLK_INTEGRITY_IP_CHECKSUM = 1 << 3, 13 + BLK_INTEGRITY_REF_TAG = 1 << 3, 14 + BLK_INTEGRITY_STACKED = 1 << 4, 14 15 }; 15 16 16 - struct blk_integrity_iter { 17 - void *prot_buf; 18 - void *data_buf; 19 - sector_t seed; 20 - unsigned int data_size; 21 - unsigned short interval; 22 - unsigned char tuple_size; 23 - unsigned char pi_offset; 24 - const char *disk_name; 25 - }; 26 - 27 - typedef blk_status_t (integrity_processing_fn) (struct blk_integrity_iter *); 28 - typedef void (integrity_prepare_fn) (struct request *); 29 - typedef void (integrity_complete_fn) (struct request *, unsigned int); 30 - 31 - struct blk_integrity_profile { 32 - integrity_processing_fn *generate_fn; 33 - integrity_processing_fn *verify_fn; 34 - integrity_prepare_fn *prepare_fn; 35 - integrity_complete_fn *complete_fn; 36 - const char *name; 37 - }; 17 + const char *blk_integrity_profile_name(struct blk_integrity *bi); 18 + bool queue_limits_stack_integrity(struct queue_limits *t, 19 + struct queue_limits *b); 20 + static inline bool queue_limits_stack_integrity_bdev(struct queue_limits *t, 21 + struct block_device *bdev) 22 + { 23 + return queue_limits_stack_integrity(t, &bdev->bd_disk->queue->limits); 24 + } 38 25 39 26 #ifdef CONFIG_BLK_DEV_INTEGRITY 40 - void blk_integrity_register(struct gendisk *, struct blk_integrity *); 41 - void blk_integrity_unregister(struct gendisk *); 42 - int blk_integrity_compare(struct gendisk *, struct gendisk *); 43 27 int blk_rq_map_integrity_sg(struct request_queue *, struct bio *, 44 28 struct scatterlist *); 45 29 int blk_rq_count_integrity_sg(struct request_queue *, struct bio *); 46 30 31 + static inline bool 32 + blk_integrity_queue_supports_integrity(struct request_queue *q) 33 + { 34 + return q->limits.integrity.tuple_size; 35 + } 36 + 47 37 static inline struct blk_integrity *blk_get_integrity(struct gendisk *disk) 48 38 { 49 - struct blk_integrity *bi = &disk->queue->integrity; 50 - 51 - if (!bi->profile) 39 + if (!blk_integrity_queue_supports_integrity(disk->queue)) 52 40 return NULL; 53 - 54 - return bi; 41 + return &disk->queue->limits.integrity; 55 42 } 56 43 57 44 static inline struct blk_integrity * 58 45 bdev_get_integrity(struct block_device *bdev) 59 46 { 60 47 return blk_get_integrity(bdev->bd_disk); 61 - } 62 - 63 - static inline bool 64 - blk_integrity_queue_supports_integrity(struct request_queue *q) 65 - { 66 - return q->integrity.profile; 67 48 } 68 49 69 50 static inline unsigned short ··· 81 100 } 82 101 83 102 /* 84 - * Return the first bvec that contains integrity data. Only drivers that are 85 - * limited to a single integrity segment should use this helper. 103 + * Return the current bvec that contains the integrity data. bip_iter may be 104 + * advanced to iterate over the integrity data. 86 105 */ 87 - static inline struct bio_vec *rq_integrity_vec(struct request *rq) 106 + static inline struct bio_vec rq_integrity_vec(struct request *rq) 88 107 { 89 - if (WARN_ON_ONCE(queue_max_integrity_segments(rq->q) > 1)) 90 - return NULL; 91 - return rq->bio->bi_integrity->bip_vec; 108 + return mp_bvec_iter_bvec(rq->bio->bi_integrity->bip_vec, 109 + rq->bio->bi_integrity->bip_iter); 92 110 } 93 111 #else /* CONFIG_BLK_DEV_INTEGRITY */ 94 112 static inline int blk_rq_count_integrity_sg(struct request_queue *q, ··· 114 134 { 115 135 return false; 116 136 } 117 - static inline int blk_integrity_compare(struct gendisk *a, struct gendisk *b) 118 - { 119 - return 0; 120 - } 121 - static inline void blk_integrity_register(struct gendisk *d, 122 - struct blk_integrity *b) 123 - { 124 - } 125 - static inline void blk_integrity_unregister(struct gendisk *d) 126 - { 127 - } 128 137 static inline unsigned short 129 138 queue_max_integrity_segments(const struct request_queue *q) 130 139 { ··· 136 167 return 0; 137 168 } 138 169 139 - static inline struct bio_vec *rq_integrity_vec(struct request *rq) 170 + static inline struct bio_vec rq_integrity_vec(struct request *rq) 140 171 { 141 - return NULL; 172 + /* the optimizer will remove all calls to this function */ 173 + return (struct bio_vec){ }; 142 174 } 143 175 #endif /* CONFIG_BLK_DEV_INTEGRITY */ 176 + 144 177 #endif /* _LINUX_BLK_INTEGRITY_H */

+7 -1

include/linux/blk_types.h

··· 162 162 */ 163 163 #define BLK_STS_DURATION_LIMIT ((__force blk_status_t)17) 164 164 165 + /* 166 + * Invalid size or alignment. 167 + */ 168 + #define BLK_STS_INVAL ((__force blk_status_t)19) 169 + 165 170 /** 166 171 * blk_path_error - returns true if error may be path related 167 172 * @error: status the request was completed with ··· 375 370 __REQ_SWAP, /* swap I/O */ 376 371 __REQ_DRV, /* for driver use */ 377 372 __REQ_FS_PRIVATE, /* for file system (submitter) use */ 378 - 373 + __REQ_ATOMIC, /* for atomic write operations */ 379 374 /* 380 375 * Command specific flags, keep last: 381 376 */ ··· 407 402 #define REQ_SWAP (__force blk_opf_t)(1ULL << __REQ_SWAP) 408 403 #define REQ_DRV (__force blk_opf_t)(1ULL << __REQ_DRV) 409 404 #define REQ_FS_PRIVATE (__force blk_opf_t)(1ULL << __REQ_FS_PRIVATE) 405 + #define REQ_ATOMIC (__force blk_opf_t)(1ULL << __REQ_ATOMIC) 410 406 411 407 #define REQ_NOUNMAP (__force blk_opf_t)(1ULL << __REQ_NOUNMAP) 412 408

+212 -136

include/linux/blkdev.h

··· 105 105 struct disk_events; 106 106 struct badblocks; 107 107 108 + enum blk_integrity_checksum { 109 + BLK_INTEGRITY_CSUM_NONE = 0, 110 + BLK_INTEGRITY_CSUM_IP = 1, 111 + BLK_INTEGRITY_CSUM_CRC = 2, 112 + BLK_INTEGRITY_CSUM_CRC64 = 3, 113 + } __packed ; 114 + 108 115 struct blk_integrity { 109 - const struct blk_integrity_profile *profile; 110 116 unsigned char flags; 117 + enum blk_integrity_checksum csum_type; 111 118 unsigned char tuple_size; 112 119 unsigned char pi_offset; 113 120 unsigned char interval_exp; ··· 268 261 return MKDEV(disk->major, disk->first_minor); 269 262 } 270 263 264 + /* blk_validate_limits() validates bsize, so drivers don't usually need to */ 271 265 static inline int blk_validate_block_size(unsigned long bsize) 272 266 { 273 267 if (bsize < 512 || bsize > PAGE_SIZE || !is_power_of_2(bsize)) ··· 283 275 return op == REQ_OP_DRV_IN || op == REQ_OP_DRV_OUT; 284 276 } 285 277 278 + /* flags set by the driver in queue_limits.features */ 279 + typedef unsigned int __bitwise blk_features_t; 280 + 281 + /* supports a volatile write cache */ 282 + #define BLK_FEAT_WRITE_CACHE ((__force blk_features_t)(1u << 0)) 283 + 284 + /* supports passing on the FUA bit */ 285 + #define BLK_FEAT_FUA ((__force blk_features_t)(1u << 1)) 286 + 287 + /* rotational device (hard drive or floppy) */ 288 + #define BLK_FEAT_ROTATIONAL ((__force blk_features_t)(1u << 2)) 289 + 290 + /* contributes to the random number pool */ 291 + #define BLK_FEAT_ADD_RANDOM ((__force blk_features_t)(1u << 3)) 292 + 293 + /* do disk/partitions IO accounting */ 294 + #define BLK_FEAT_IO_STAT ((__force blk_features_t)(1u << 4)) 295 + 296 + /* don't modify data until writeback is done */ 297 + #define BLK_FEAT_STABLE_WRITES ((__force blk_features_t)(1u << 5)) 298 + 299 + /* always completes in submit context */ 300 + #define BLK_FEAT_SYNCHRONOUS ((__force blk_features_t)(1u << 6)) 301 + 302 + /* supports REQ_NOWAIT */ 303 + #define BLK_FEAT_NOWAIT ((__force blk_features_t)(1u << 7)) 304 + 305 + /* supports DAX */ 306 + #define BLK_FEAT_DAX ((__force blk_features_t)(1u << 8)) 307 + 308 + /* supports I/O polling */ 309 + #define BLK_FEAT_POLL ((__force blk_features_t)(1u << 9)) 310 + 311 + /* is a zoned device */ 312 + #define BLK_FEAT_ZONED ((__force blk_features_t)(1u << 10)) 313 + 314 + /* supports PCI(e) p2p requests */ 315 + #define BLK_FEAT_PCI_P2PDMA ((__force blk_features_t)(1u << 12)) 316 + 317 + /* skip this queue in blk_mq_(un)quiesce_tagset */ 318 + #define BLK_FEAT_SKIP_TAGSET_QUIESCE ((__force blk_features_t)(1u << 13)) 319 + 320 + /* bounce all highmem pages */ 321 + #define BLK_FEAT_BOUNCE_HIGH ((__force blk_features_t)(1u << 14)) 322 + 323 + /* undocumented magic for bcache */ 324 + #define BLK_FEAT_RAID_PARTIAL_STRIPES_EXPENSIVE \ 325 + ((__force blk_features_t)(1u << 15)) 326 + 286 327 /* 287 - * BLK_BOUNCE_NONE: never bounce (default) 288 - * BLK_BOUNCE_HIGH: bounce all highmem pages 328 + * Flags automatically inherited when stacking limits. 289 329 */ 290 - enum blk_bounce { 291 - BLK_BOUNCE_NONE, 292 - BLK_BOUNCE_HIGH, 293 - }; 330 + #define BLK_FEAT_INHERIT_MASK \ 331 + (BLK_FEAT_WRITE_CACHE | BLK_FEAT_FUA | BLK_FEAT_ROTATIONAL | \ 332 + BLK_FEAT_STABLE_WRITES | BLK_FEAT_ZONED | BLK_FEAT_BOUNCE_HIGH | \ 333 + BLK_FEAT_RAID_PARTIAL_STRIPES_EXPENSIVE) 334 + 335 + /* internal flags in queue_limits.flags */ 336 + typedef unsigned int __bitwise blk_flags_t; 337 + 338 + /* do not send FLUSH/FUA commands despite advertising a write cache */ 339 + #define BLK_FLAG_WRITE_CACHE_DISABLED ((__force blk_flags_t)(1u << 0)) 340 + 341 + /* I/O topology is misaligned */ 342 + #define BLK_FLAG_MISALIGNED ((__force blk_flags_t)(1u << 1)) 294 343 295 344 struct queue_limits { 296 - enum blk_bounce bounce; 345 + blk_features_t features; 346 + blk_flags_t flags; 297 347 unsigned long seg_boundary_mask; 298 348 unsigned long virt_boundary_mask; 299 349 ··· 376 310 unsigned int discard_alignment; 377 311 unsigned int zone_write_granularity; 378 312 313 + /* atomic write limits */ 314 + unsigned int atomic_write_hw_max; 315 + unsigned int atomic_write_max_sectors; 316 + unsigned int atomic_write_hw_boundary; 317 + unsigned int atomic_write_boundary_sectors; 318 + unsigned int atomic_write_hw_unit_min; 319 + unsigned int atomic_write_unit_min; 320 + unsigned int atomic_write_hw_unit_max; 321 + unsigned int atomic_write_unit_max; 322 + 379 323 unsigned short max_segments; 380 324 unsigned short max_integrity_segments; 381 325 unsigned short max_discard_segments; 382 326 383 - unsigned char misaligned; 384 - unsigned char discard_misaligned; 385 - unsigned char raid_partial_stripes_expensive; 386 - bool zoned; 387 327 unsigned int max_open_zones; 388 328 unsigned int max_active_zones; 389 329 ··· 399 327 * due to possible offsets. 400 328 */ 401 329 unsigned int dma_alignment; 330 + unsigned int dma_pad_mask; 331 + 332 + struct blk_integrity integrity; 402 333 }; 403 334 404 335 typedef int (*report_zones_cb)(struct blk_zone *zone, unsigned int idx, 405 336 void *data); 406 - 407 - void disk_set_zoned(struct gendisk *disk); 408 337 409 338 #define BLK_ALL_ZONES ((unsigned int)-1) 410 339 int blkdev_report_zones(struct block_device *bdev, sector_t sector, ··· 487 414 488 415 struct queue_limits limits; 489 416 490 - #ifdef CONFIG_BLK_DEV_INTEGRITY 491 - struct blk_integrity integrity; 492 - #endif /* CONFIG_BLK_DEV_INTEGRITY */ 493 - 494 417 #ifdef CONFIG_PM 495 418 struct device *dev; 496 419 enum rpm_status rpm_status; ··· 507 438 * ioctx. 508 439 */ 509 440 int id; 510 - 511 - unsigned int dma_pad_mask; 512 441 513 442 /* 514 443 * queue settings ··· 593 526 #define QUEUE_FLAG_NOMERGES 3 /* disable merge attempts */ 594 527 #define QUEUE_FLAG_SAME_COMP 4 /* complete on same CPU-group */ 595 528 #define QUEUE_FLAG_FAIL_IO 5 /* fake timeout */ 596 - #define QUEUE_FLAG_NONROT 6 /* non-rotational device (SSD) */ 597 - #define QUEUE_FLAG_VIRT QUEUE_FLAG_NONROT /* paravirt device */ 598 - #define QUEUE_FLAG_IO_STAT 7 /* do disk/partitions IO accounting */ 599 529 #define QUEUE_FLAG_NOXMERGES 9 /* No extended merges */ 600 - #define QUEUE_FLAG_ADD_RANDOM 10 /* Contributes to random pool */ 601 - #define QUEUE_FLAG_SYNCHRONOUS 11 /* always completes in submit context */ 602 530 #define QUEUE_FLAG_SAME_FORCE 12 /* force complete on same CPU */ 603 - #define QUEUE_FLAG_HW_WC 13 /* Write back caching supported */ 604 531 #define QUEUE_FLAG_INIT_DONE 14 /* queue is initialized */ 605 - #define QUEUE_FLAG_STABLE_WRITES 15 /* don't modify blks until WB is done */ 606 - #define QUEUE_FLAG_POLL 16 /* IO polling enabled if set */ 607 - #define QUEUE_FLAG_WC 17 /* Write back caching */ 608 - #define QUEUE_FLAG_FUA 18 /* device supports FUA writes */ 609 - #define QUEUE_FLAG_DAX 19 /* device supports DAX */ 610 532 #define QUEUE_FLAG_STATS 20 /* track IO start and completion times */ 611 533 #define QUEUE_FLAG_REGISTERED 22 /* queue has been registered to a disk */ 612 534 #define QUEUE_FLAG_QUIESCED 24 /* queue has been quiesced */ 613 - #define QUEUE_FLAG_PCI_P2PDMA 25 /* device supports PCI p2p requests */ 614 - #define QUEUE_FLAG_ZONE_RESETALL 26 /* supports Zone Reset All */ 615 535 #define QUEUE_FLAG_RQ_ALLOC_TIME 27 /* record rq->alloc_time_ns */ 616 536 #define QUEUE_FLAG_HCTX_ACTIVE 28 /* at least one blk-mq hctx is active */ 617 - #define QUEUE_FLAG_NOWAIT 29 /* device supports NOWAIT */ 618 537 #define QUEUE_FLAG_SQ_SCHED 30 /* single queue style io dispatch */ 619 - #define QUEUE_FLAG_SKIP_TAGSET_QUIESCE 31 /* quiesce_tagset skip the queue*/ 620 538 621 - #define QUEUE_FLAG_MQ_DEFAULT ((1UL << QUEUE_FLAG_IO_STAT) | \ 622 - (1UL << QUEUE_FLAG_SAME_COMP) | \ 623 - (1UL << QUEUE_FLAG_NOWAIT)) 539 + #define QUEUE_FLAG_MQ_DEFAULT (1UL << QUEUE_FLAG_SAME_COMP) 624 540 625 541 void blk_queue_flag_set(unsigned int flag, struct request_queue *q); 626 542 void blk_queue_flag_clear(unsigned int flag, struct request_queue *q); 627 - bool blk_queue_flag_test_and_set(unsigned int flag, struct request_queue *q); 628 543 629 544 #define blk_queue_stopped(q) test_bit(QUEUE_FLAG_STOPPED, &(q)->queue_flags) 630 545 #define blk_queue_dying(q) test_bit(QUEUE_FLAG_DYING, &(q)->queue_flags) ··· 614 565 #define blk_queue_nomerges(q) test_bit(QUEUE_FLAG_NOMERGES, &(q)->queue_flags) 615 566 #define blk_queue_noxmerges(q) \ 616 567 test_bit(QUEUE_FLAG_NOXMERGES, &(q)->queue_flags) 617 - #define blk_queue_nonrot(q) test_bit(QUEUE_FLAG_NONROT, &(q)->queue_flags) 618 - #define blk_queue_stable_writes(q) \ 619 - test_bit(QUEUE_FLAG_STABLE_WRITES, &(q)->queue_flags) 620 - #define blk_queue_io_stat(q) test_bit(QUEUE_FLAG_IO_STAT, &(q)->queue_flags) 621 - #define blk_queue_add_random(q) test_bit(QUEUE_FLAG_ADD_RANDOM, &(q)->queue_flags) 622 - #define blk_queue_zone_resetall(q) \ 623 - test_bit(QUEUE_FLAG_ZONE_RESETALL, &(q)->queue_flags) 624 - #define blk_queue_dax(q) test_bit(QUEUE_FLAG_DAX, &(q)->queue_flags) 625 - #define blk_queue_pci_p2pdma(q) \ 626 - test_bit(QUEUE_FLAG_PCI_P2PDMA, &(q)->queue_flags) 568 + #define blk_queue_nonrot(q) (!((q)->limits.features & BLK_FEAT_ROTATIONAL)) 569 + #define blk_queue_io_stat(q) ((q)->limits.features & BLK_FEAT_IO_STAT) 570 + #define blk_queue_dax(q) ((q)->limits.features & BLK_FEAT_DAX) 571 + #define blk_queue_pci_p2pdma(q) ((q)->limits.features & BLK_FEAT_PCI_P2PDMA) 627 572 #ifdef CONFIG_BLK_RQ_ALLOC_TIME 628 573 #define blk_queue_rq_alloc_time(q) \ 629 574 test_bit(QUEUE_FLAG_RQ_ALLOC_TIME, &(q)->queue_flags) ··· 633 590 #define blk_queue_registered(q) test_bit(QUEUE_FLAG_REGISTERED, &(q)->queue_flags) 634 591 #define blk_queue_sq_sched(q) test_bit(QUEUE_FLAG_SQ_SCHED, &(q)->queue_flags) 635 592 #define blk_queue_skip_tagset_quiesce(q) \ 636 - test_bit(QUEUE_FLAG_SKIP_TAGSET_QUIESCE, &(q)->queue_flags) 593 + ((q)->limits.features & BLK_FEAT_SKIP_TAGSET_QUIESCE) 637 594 638 595 extern void blk_set_pm_only(struct request_queue *q); 639 596 extern void blk_clear_pm_only(struct request_queue *q); ··· 663 620 664 621 static inline bool blk_queue_is_zoned(struct request_queue *q) 665 622 { 666 - return IS_ENABLED(CONFIG_BLK_DEV_ZONED) && q->limits.zoned; 623 + return IS_ENABLED(CONFIG_BLK_DEV_ZONED) && 624 + (q->limits.features & BLK_FEAT_ZONED); 667 625 } 668 626 669 627 #ifdef CONFIG_BLK_DEV_ZONED 670 - unsigned int bdev_nr_zones(struct block_device *bdev); 671 - 672 628 static inline unsigned int disk_nr_zones(struct gendisk *disk) 673 629 { 674 - return blk_queue_is_zoned(disk->queue) ? disk->nr_zones : 0; 630 + return disk->nr_zones; 675 631 } 632 + bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs); 633 + #else /* CONFIG_BLK_DEV_ZONED */ 634 + static inline unsigned int disk_nr_zones(struct gendisk *disk) 635 + { 636 + return 0; 637 + } 638 + static inline bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs) 639 + { 640 + return false; 641 + } 642 + #endif /* CONFIG_BLK_DEV_ZONED */ 676 643 677 644 static inline unsigned int disk_zone_no(struct gendisk *disk, sector_t sector) 678 645 { ··· 691 638 return sector >> ilog2(disk->queue->limits.chunk_sectors); 692 639 } 693 640 694 - static inline void disk_set_max_open_zones(struct gendisk *disk, 695 - unsigned int max_open_zones) 641 + static inline unsigned int bdev_nr_zones(struct block_device *bdev) 696 642 { 697 - disk->queue->limits.max_open_zones = max_open_zones; 698 - } 699 - 700 - static inline void disk_set_max_active_zones(struct gendisk *disk, 701 - unsigned int max_active_zones) 702 - { 703 - disk->queue->limits.max_active_zones = max_active_zones; 643 + return disk_nr_zones(bdev->bd_disk); 704 644 } 705 645 706 646 static inline unsigned int bdev_max_open_zones(struct block_device *bdev) ··· 705 659 { 706 660 return bdev->bd_disk->queue->limits.max_active_zones; 707 661 } 708 - 709 - bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs); 710 - #else /* CONFIG_BLK_DEV_ZONED */ 711 - static inline unsigned int bdev_nr_zones(struct block_device *bdev) 712 - { 713 - return 0; 714 - } 715 - 716 - static inline unsigned int disk_nr_zones(struct gendisk *disk) 717 - { 718 - return 0; 719 - } 720 - static inline unsigned int disk_zone_no(struct gendisk *disk, sector_t sector) 721 - { 722 - return 0; 723 - } 724 - static inline unsigned int bdev_max_open_zones(struct block_device *bdev) 725 - { 726 - return 0; 727 - } 728 - 729 - static inline unsigned int bdev_max_active_zones(struct block_device *bdev) 730 - { 731 - return 0; 732 - } 733 - static inline bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs) 734 - { 735 - return false; 736 - } 737 - #endif /* CONFIG_BLK_DEV_ZONED */ 738 662 739 663 static inline unsigned int blk_queue_depth(struct request_queue *q) 740 664 { ··· 896 880 } 897 881 898 882 /* 899 - * Return how much of the chunk is left to be used for I/O at a given offset. 883 + * Return how much within the boundary is left to be used for I/O at a given 884 + * offset. 900 885 */ 901 - static inline unsigned int blk_chunk_sectors_left(sector_t offset, 902 - unsigned int chunk_sectors) 886 + static inline unsigned int blk_boundary_sectors_left(sector_t offset, 887 + unsigned int boundary_sectors) 903 888 { 904 - if (unlikely(!is_power_of_2(chunk_sectors))) 905 - return chunk_sectors - sector_div(offset, chunk_sectors); 906 - return chunk_sectors - (offset & (chunk_sectors - 1)); 889 + if (unlikely(!is_power_of_2(boundary_sectors))) 890 + return boundary_sectors - sector_div(offset, boundary_sectors); 891 + return boundary_sectors - (offset & (boundary_sectors - 1)); 907 892 } 908 893 909 894 /** ··· 921 904 */ 922 905 static inline struct queue_limits 923 906 queue_limits_start_update(struct request_queue *q) 924 - __acquires(q->limits_lock) 925 907 { 926 908 mutex_lock(&q->limits_lock); 927 909 return q->limits; ··· 943 927 } 944 928 945 929 /* 930 + * These helpers are for drivers that have sloppy feature negotiation and might 931 + * have to disable DISCARD, WRITE_ZEROES or SECURE_DISCARD from the I/O 932 + * completion handler when the device returned an indicator that the respective 933 + * feature is not actually supported. They are racy and the driver needs to 934 + * cope with that. Try to avoid this scheme if you can. 935 + */ 936 + static inline void blk_queue_disable_discard(struct request_queue *q) 937 + { 938 + q->limits.max_discard_sectors = 0; 939 + } 940 + 941 + static inline void blk_queue_disable_secure_erase(struct request_queue *q) 942 + { 943 + q->limits.max_secure_erase_sectors = 0; 944 + } 945 + 946 + static inline void blk_queue_disable_write_zeroes(struct request_queue *q) 947 + { 948 + q->limits.max_write_zeroes_sectors = 0; 949 + } 950 + 951 + /* 946 952 * Access functions for manipulating queue properties 947 953 */ 948 - extern void blk_queue_chunk_sectors(struct request_queue *, unsigned int); 949 - void blk_queue_max_secure_erase_sectors(struct request_queue *q, 950 - unsigned int max_sectors); 951 - extern void blk_queue_max_discard_sectors(struct request_queue *q, 952 - unsigned int max_discard_sectors); 953 - extern void blk_queue_max_write_zeroes_sectors(struct request_queue *q, 954 - unsigned int max_write_same_sectors); 955 - extern void blk_queue_logical_block_size(struct request_queue *, unsigned int); 956 - extern void blk_queue_max_zone_append_sectors(struct request_queue *q, 957 - unsigned int max_zone_append_sectors); 958 - extern void blk_queue_physical_block_size(struct request_queue *, unsigned int); 959 - void blk_queue_zone_write_granularity(struct request_queue *q, 960 - unsigned int size); 961 - extern void blk_queue_alignment_offset(struct request_queue *q, 962 - unsigned int alignment); 963 - void disk_update_readahead(struct gendisk *disk); 964 954 extern void blk_limits_io_min(struct queue_limits *limits, unsigned int min); 965 - extern void blk_queue_io_min(struct request_queue *q, unsigned int min); 966 955 extern void blk_limits_io_opt(struct queue_limits *limits, unsigned int opt); 967 956 extern void blk_set_queue_depth(struct request_queue *q, unsigned int depth); 968 957 extern void blk_set_stacking_limits(struct queue_limits *lim); ··· 975 954 sector_t offset); 976 955 void queue_limits_stack_bdev(struct queue_limits *t, struct block_device *bdev, 977 956 sector_t offset, const char *pfx); 978 - extern void blk_queue_update_dma_pad(struct request_queue *, unsigned int); 979 957 extern void blk_queue_rq_timeout(struct request_queue *, unsigned int); 980 - extern void blk_queue_write_cache(struct request_queue *q, bool enabled, bool fua); 981 958 982 959 struct blk_independent_access_ranges * 983 960 disk_alloc_independent_access_ranges(struct gendisk *disk, int nr_ia_ranges); ··· 1096 1077 1097 1078 #define BLKDEV_ZERO_NOUNMAP (1 << 0) /* do not free blocks */ 1098 1079 #define BLKDEV_ZERO_NOFALLBACK (1 << 1) /* don't write explicit zeroes */ 1080 + #define BLKDEV_ZERO_KILLABLE (1 << 2) /* interruptible by fatal signals */ 1099 1081 1100 1082 extern int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector, 1101 1083 sector_t nr_sects, gfp_t gfp_mask, struct bio **biop, ··· 1224 1204 1225 1205 static inline unsigned queue_logical_block_size(const struct request_queue *q) 1226 1206 { 1227 - int retval = 512; 1228 - 1229 - if (q && q->limits.logical_block_size) 1230 - retval = q->limits.logical_block_size; 1231 - 1232 - return retval; 1207 + return q->limits.logical_block_size; 1233 1208 } 1234 1209 1235 1210 static inline unsigned int bdev_logical_block_size(struct block_device *bdev) ··· 1310 1295 1311 1296 static inline bool bdev_synchronous(struct block_device *bdev) 1312 1297 { 1313 - return test_bit(QUEUE_FLAG_SYNCHRONOUS, 1314 - &bdev_get_queue(bdev)->queue_flags); 1298 + return bdev->bd_disk->queue->limits.features & BLK_FEAT_SYNCHRONOUS; 1315 1299 } 1316 1300 1317 1301 static inline bool bdev_stable_writes(struct block_device *bdev) 1318 1302 { 1319 - return test_bit(QUEUE_FLAG_STABLE_WRITES, 1320 - &bdev_get_queue(bdev)->queue_flags); 1303 + struct request_queue *q = bdev_get_queue(bdev); 1304 + 1305 + if (IS_ENABLED(CONFIG_BLK_DEV_INTEGRITY) && 1306 + q->limits.integrity.csum_type != BLK_INTEGRITY_CSUM_NONE) 1307 + return true; 1308 + return q->limits.features & BLK_FEAT_STABLE_WRITES; 1309 + } 1310 + 1311 + static inline bool blk_queue_write_cache(struct request_queue *q) 1312 + { 1313 + return (q->limits.features & BLK_FEAT_WRITE_CACHE) && 1314 + !(q->limits.flags & BLK_FLAG_WRITE_CACHE_DISABLED); 1321 1315 } 1322 1316 1323 1317 static inline bool bdev_write_cache(struct block_device *bdev) 1324 1318 { 1325 - return test_bit(QUEUE_FLAG_WC, &bdev_get_queue(bdev)->queue_flags); 1319 + return blk_queue_write_cache(bdev_get_queue(bdev)); 1326 1320 } 1327 1321 1328 1322 static inline bool bdev_fua(struct block_device *bdev) 1329 1323 { 1330 - return test_bit(QUEUE_FLAG_FUA, &bdev_get_queue(bdev)->queue_flags); 1324 + return bdev_get_queue(bdev)->limits.features & BLK_FEAT_FUA; 1331 1325 } 1332 1326 1333 1327 static inline bool bdev_nowait(struct block_device *bdev) 1334 1328 { 1335 - return test_bit(QUEUE_FLAG_NOWAIT, &bdev_get_queue(bdev)->queue_flags); 1329 + return bdev->bd_disk->queue->limits.features & BLK_FEAT_NOWAIT; 1336 1330 } 1337 1331 1338 1332 static inline bool bdev_is_zoned(struct block_device *bdev) ··· 1383 1359 1384 1360 static inline int queue_dma_alignment(const struct request_queue *q) 1385 1361 { 1386 - return q ? q->limits.dma_alignment : 511; 1362 + return q->limits.dma_alignment; 1363 + } 1364 + 1365 + static inline unsigned int 1366 + queue_atomic_write_unit_max_bytes(const struct request_queue *q) 1367 + { 1368 + return q->limits.atomic_write_unit_max; 1369 + } 1370 + 1371 + static inline unsigned int 1372 + queue_atomic_write_unit_min_bytes(const struct request_queue *q) 1373 + { 1374 + return q->limits.atomic_write_unit_min; 1375 + } 1376 + 1377 + static inline unsigned int 1378 + queue_atomic_write_boundary_bytes(const struct request_queue *q) 1379 + { 1380 + return q->limits.atomic_write_boundary_sectors << SECTOR_SHIFT; 1381 + } 1382 + 1383 + static inline unsigned int 1384 + queue_atomic_write_max_bytes(const struct request_queue *q) 1385 + { 1386 + return q->limits.atomic_write_max_sectors << SECTOR_SHIFT; 1387 1387 } 1388 1388 1389 1389 static inline unsigned int bdev_dma_alignment(struct block_device *bdev) ··· 1422 1374 bdev_logical_block_size(bdev) - 1); 1423 1375 } 1424 1376 1377 + static inline int blk_lim_dma_alignment_and_pad(struct queue_limits *lim) 1378 + { 1379 + return lim->dma_alignment | lim->dma_pad_mask; 1380 + } 1381 + 1425 1382 static inline int blk_rq_aligned(struct request_queue *q, unsigned long addr, 1426 1383 unsigned int len) 1427 1384 { 1428 - unsigned int alignment = queue_dma_alignment(q) | q->dma_pad_mask; 1385 + unsigned int alignment = blk_lim_dma_alignment_and_pad(&q->limits); 1386 + 1429 1387 return !(addr & alignment) && !(len & alignment); 1430 1388 } 1431 1389 ··· 1617 1563 int sync_blockdev_range(struct block_device *bdev, loff_t lstart, loff_t lend); 1618 1564 int sync_blockdev_nowait(struct block_device *bdev); 1619 1565 void sync_bdevs(bool wait); 1620 - void bdev_statx_dioalign(struct inode *inode, struct kstat *stat); 1566 + void bdev_statx(struct path *, struct kstat *, u32); 1621 1567 void printk_all_partitions(void); 1622 1568 int __init early_lookup_bdev(const char *pathname, dev_t *dev); 1623 1569 #else ··· 1635 1581 static inline void sync_bdevs(bool wait) 1636 1582 { 1637 1583 } 1638 - static inline void bdev_statx_dioalign(struct inode *inode, struct kstat *stat) 1584 + static inline void bdev_statx(struct path *path, struct kstat *stat, 1585 + u32 request_mask) 1639 1586 { 1640 1587 } 1641 1588 static inline void printk_all_partitions(void) ··· 1657 1602 bool need_ts; 1658 1603 void (*complete)(struct io_comp_batch *); 1659 1604 }; 1605 + 1606 + static inline bool bdev_can_atomic_write(struct block_device *bdev) 1607 + { 1608 + struct request_queue *bd_queue = bdev->bd_queue; 1609 + struct queue_limits *limits = &bd_queue->limits; 1610 + 1611 + if (!limits->atomic_write_unit_min) 1612 + return false; 1613 + 1614 + if (bdev_is_partition(bdev)) { 1615 + sector_t bd_start_sect = bdev->bd_start_sect; 1616 + unsigned int alignment = 1617 + max(limits->atomic_write_unit_min, 1618 + limits->atomic_write_hw_boundary); 1619 + 1620 + if (!IS_ALIGNED(bd_start_sect, alignment >> SECTOR_SHIFT)) 1621 + return false; 1622 + } 1623 + 1624 + return true; 1625 + } 1660 1626 1661 1627 #define DEFINE_IO_COMP_BATCH(name) struct io_comp_batch name = { } 1662 1628

+14

include/linux/bvec.h

··· 280 280 return page_address(bvec->bv_page) + bvec->bv_offset; 281 281 } 282 282 283 + /** 284 + * bvec_phys - return the physical address for a bvec 285 + * @bvec: bvec to return the physical address for 286 + */ 287 + static inline phys_addr_t bvec_phys(const struct bio_vec *bvec) 288 + { 289 + /* 290 + * Note this open codes page_to_phys because page_to_phys is defined in 291 + * <asm/io.h>, which we don't want to pull in here. If it ever moves to 292 + * a sensible place we should start using it. 293 + */ 294 + return PFN_PHYS(page_to_pfn(bvec->bv_page)) + bvec->bv_offset; 295 + } 296 + 283 297 #endif /* __LINUX_BVEC_H */

+7

include/linux/device-mapper.h

··· 358 358 bool discards_supported:1; 359 359 360 360 /* 361 + * Automatically set by dm-core if this target supports 362 + * REQ_OP_ZONE_RESET_ALL. Otherwise, this operation will be emulated 363 + * using REQ_OP_ZONE_RESET. Target drivers must not set this manually. 364 + */ 365 + bool zone_reset_all_supported:1; 366 + 367 + /* 361 368 * Set if this target requires that discards be split on 362 369 * 'max_discard_sectors' boundaries. 363 370 */

+18 -2

include/linux/fs.h

··· 125 125 #define FMODE_EXEC ((__force fmode_t)(1 << 5)) 126 126 /* File writes are restricted (block device specific) */ 127 127 #define FMODE_WRITE_RESTRICTED ((__force fmode_t)(1 << 6)) 128 + /* File supports atomic writes */ 129 + #define FMODE_CAN_ATOMIC_WRITE ((__force fmode_t)(1 << 7)) 128 130 129 - /* FMODE_* bits 7 to 8 */ 131 + /* FMODE_* bit 8 */ 130 132 131 133 /* 32bit hashes as llseek() offset (for directories) */ 132 134 #define FMODE_32BITHASH ((__force fmode_t)(1 << 9)) ··· 319 317 #define IOCB_SYNC (__force int) RWF_SYNC 320 318 #define IOCB_NOWAIT (__force int) RWF_NOWAIT 321 319 #define IOCB_APPEND (__force int) RWF_APPEND 320 + #define IOCB_ATOMIC (__force int) RWF_ATOMIC 322 321 323 322 /* non-RWF related bits - start at 16 */ 324 323 #define IOCB_EVENTFD (1 << 16) ··· 354 351 { IOCB_SYNC, "SYNC" }, \ 355 352 { IOCB_NOWAIT, "NOWAIT" }, \ 356 353 { IOCB_APPEND, "APPEND" }, \ 354 + { IOCB_ATOMIC, "ATOMIC"}, \ 357 355 { IOCB_EVENTFD, "EVENTFD"}, \ 358 356 { IOCB_DIRECT, "DIRECT" }, \ 359 357 { IOCB_WRITE, "WRITE" }, \ ··· 3258 3254 extern void kfree_link(void *); 3259 3255 void generic_fillattr(struct mnt_idmap *, u32, struct inode *, struct kstat *); 3260 3256 void generic_fill_statx_attr(struct inode *inode, struct kstat *stat); 3257 + void generic_fill_statx_atomic_writes(struct kstat *stat, 3258 + unsigned int unit_min, 3259 + unsigned int unit_max); 3261 3260 extern int vfs_getattr_nosec(const struct path *, struct kstat *, u32, unsigned int); 3262 3261 extern int vfs_getattr(const struct path *, struct kstat *, u32, unsigned int); 3263 3262 void __inode_add_bytes(struct inode *inode, loff_t bytes); ··· 3437 3430 return res; 3438 3431 } 3439 3432 3440 - static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags) 3433 + static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags, 3434 + int rw_type) 3441 3435 { 3442 3436 int kiocb_flags = 0; 3443 3437 ··· 3456 3448 if (!(ki->ki_filp->f_mode & FMODE_NOWAIT)) 3457 3449 return -EOPNOTSUPP; 3458 3450 kiocb_flags |= IOCB_NOIO; 3451 + } 3452 + if (flags & RWF_ATOMIC) { 3453 + if (rw_type != WRITE) 3454 + return -EOPNOTSUPP; 3455 + if (!(ki->ki_filp->f_mode & FMODE_CAN_ATOMIC_WRITE)) 3456 + return -EOPNOTSUPP; 3459 3457 } 3460 3458 kiocb_flags |= (__force int) (flags & RWF_SUPPORTED); 3461 3459 if (flags & RWF_SYNC) ··· 3656 3642 3657 3643 return !c; 3658 3644 } 3645 + 3646 + bool generic_atomic_write_valid(struct iov_iter *iter, loff_t pos); 3659 3647 3660 3648 #endif /* _LINUX_FS_H */

+4

include/linux/nvme-fc-driver.h

··· 920 920 * further references to hosthandle. 921 921 * Entrypoint is Mandatory if the lldd calls nvmet_fc_invalidate_host(). 922 922 * 923 + * @host_traddr: called by the transport to retrieve the node name and 924 + * port name of the host port address. 925 + * 923 926 * @max_hw_queues: indicates the maximum number of hw queues the LLDD 924 927 * supports for cpu affinitization. 925 928 * Value is Mandatory. Must be at least 1. ··· 978 975 void (*ls_abort)(struct nvmet_fc_target_port *targetport, 979 976 void *hosthandle, struct nvmefc_ls_req *lsreq); 980 977 void (*host_release)(void *hosthandle); 978 + int (*host_traddr)(void *hosthandle, u64 *wwnn, u64 *wwpn); 981 979 982 980 u32 max_hw_queues; 983 981 u16 max_sgl_segments;

+16 -3

include/linux/nvme.h

··· 25 25 26 26 #define NVME_NSID_ALL 0xffffffff 27 27 28 + /* Special NSSR value, 'NVMe' */ 29 + #define NVME_SUBSYS_RESET 0x4E564D65 30 + 28 31 enum nvme_subsys_type { 29 32 /* Referral to another discovery type target subsystem */ 30 33 NVME_NQN_DISC = 1, ··· 1851 1848 /* 1852 1849 * Generic Command Status: 1853 1850 */ 1851 + NVME_SCT_GENERIC = 0x0, 1854 1852 NVME_SC_SUCCESS = 0x0, 1855 1853 NVME_SC_INVALID_OPCODE = 0x1, 1856 1854 NVME_SC_INVALID_FIELD = 0x2, ··· 1899 1895 /* 1900 1896 * Command Specific Status: 1901 1897 */ 1898 + NVME_SCT_COMMAND_SPECIFIC = 0x100, 1902 1899 NVME_SC_CQ_INVALID = 0x100, 1903 1900 NVME_SC_QID_INVALID = 0x101, 1904 1901 NVME_SC_QUEUE_SIZE = 0x102, ··· 1973 1968 /* 1974 1969 * Media and Data Integrity Errors: 1975 1970 */ 1971 + NVME_SCT_MEDIA_ERROR = 0x200, 1976 1972 NVME_SC_WRITE_FAULT = 0x280, 1977 1973 NVME_SC_READ_ERROR = 0x281, 1978 1974 NVME_SC_GUARD_CHECK = 0x282, ··· 1986 1980 /* 1987 1981 * Path-related Errors: 1988 1982 */ 1983 + NVME_SCT_PATH = 0x300, 1989 1984 NVME_SC_INTERNAL_PATH_ERROR = 0x300, 1990 1985 NVME_SC_ANA_PERSISTENT_LOSS = 0x301, 1991 1986 NVME_SC_ANA_INACCESSIBLE = 0x302, ··· 1995 1988 NVME_SC_HOST_PATH_ERROR = 0x370, 1996 1989 NVME_SC_HOST_ABORTED_CMD = 0x371, 1997 1990 1998 - NVME_SC_CRD = 0x1800, 1999 - NVME_SC_MORE = 0x2000, 2000 - NVME_SC_DNR = 0x4000, 1991 + NVME_SC_MASK = 0x00ff, /* Status Code */ 1992 + NVME_SCT_MASK = 0x0700, /* Status Code Type */ 1993 + NVME_SCT_SC_MASK = NVME_SCT_MASK | NVME_SC_MASK, 1994 + 1995 + NVME_STATUS_CRD = 0x1800, /* Command Retry Delayed */ 1996 + NVME_STATUS_MORE = 0x2000, 1997 + NVME_STATUS_DNR = 0x4000, /* Do Not Retry */ 2001 1998 }; 1999 + 2000 + #define NVME_SCT(status) ((status) >> 8 & 7) 2002 2001 2003 2002 struct nvme_completion { 2004 2003 /*

+3

include/linux/stat.h

··· 54 54 u32 dio_offset_align; 55 55 u64 change_cookie; 56 56 u64 subvol; 57 + u32 atomic_write_unit_min; 58 + u32 atomic_write_unit_max; 59 + u32 atomic_write_segments_max; 57 60 }; 58 61 59 62 /* These definitions are internal to the kernel for now. Mainly used by nfsd. */

+6 -16

include/linux/t10-pi.h

··· 41 41 { 42 42 unsigned int shift = ilog2(queue_logical_block_size(rq->q)); 43 43 44 - #ifdef CONFIG_BLK_DEV_INTEGRITY 45 - if (rq->q->integrity.interval_exp) 46 - shift = rq->q->integrity.interval_exp; 47 - #endif 44 + if (IS_ENABLED(CONFIG_BLK_DEV_INTEGRITY) && 45 + rq->q->limits.integrity.interval_exp) 46 + shift = rq->q->limits.integrity.interval_exp; 48 47 return blk_rq_pos(rq) >> (shift - SECTOR_SHIFT) & 0xffffffff; 49 48 } 50 - 51 - extern const struct blk_integrity_profile t10_pi_type1_crc; 52 - extern const struct blk_integrity_profile t10_pi_type1_ip; 53 - extern const struct blk_integrity_profile t10_pi_type3_crc; 54 - extern const struct blk_integrity_profile t10_pi_type3_ip; 55 49 56 50 struct crc64_pi_tuple { 57 51 __be64 guard_tag; ··· 66 72 { 67 73 unsigned int shift = ilog2(queue_logical_block_size(rq->q)); 68 74 69 - #ifdef CONFIG_BLK_DEV_INTEGRITY 70 - if (rq->q->integrity.interval_exp) 71 - shift = rq->q->integrity.interval_exp; 72 - #endif 75 + if (IS_ENABLED(CONFIG_BLK_DEV_INTEGRITY) && 76 + rq->q->limits.integrity.interval_exp) 77 + shift = rq->q->limits.integrity.interval_exp; 73 78 return lower_48_bits(blk_rq_pos(rq) >> (shift - SECTOR_SHIFT)); 74 79 } 75 - 76 - extern const struct blk_integrity_profile ext_pi_type1_crc64; 77 - extern const struct blk_integrity_profile ext_pi_type3_crc64; 78 80 79 81 #endif

+1

include/scsi/scsi_proto.h

··· 120 120 #define WRITE_SAME_16 0x93 121 121 #define ZBC_OUT 0x94 122 122 #define ZBC_IN 0x95 123 + #define WRITE_ATOMIC_16 0x9c 123 124 #define SERVICE_ACTION_BIDIRECTIONAL 0x9d 124 125 #define SERVICE_ACTION_IN_16 0x9e 125 126 #define SERVICE_ACTION_OUT_16 0x9f

+32 -9

include/trace/events/block.h

··· 9 9 #include <linux/blkdev.h> 10 10 #include <linux/buffer_head.h> 11 11 #include <linux/tracepoint.h> 12 + #include <uapi/linux/ioprio.h> 12 13 13 14 #define RWBS_LEN 8 15 + 16 + #define IOPRIO_CLASS_STRINGS \ 17 + { IOPRIO_CLASS_NONE, "none" }, \ 18 + { IOPRIO_CLASS_RT, "rt" }, \ 19 + { IOPRIO_CLASS_BE, "be" }, \ 20 + { IOPRIO_CLASS_IDLE, "idle" }, \ 21 + { IOPRIO_CLASS_INVALID, "invalid"} 14 22 15 23 #ifdef CONFIG_BUFFER_HEAD 16 24 DECLARE_EVENT_CLASS(block_buffer, ··· 90 82 __field( dev_t, dev ) 91 83 __field( sector_t, sector ) 92 84 __field( unsigned int, nr_sector ) 85 + __field( unsigned short, ioprio ) 93 86 __array( char, rwbs, RWBS_LEN ) 94 87 __dynamic_array( char, cmd, 1 ) 95 88 ), ··· 99 90 __entry->dev = rq->q->disk ? disk_devt(rq->q->disk) : 0; 100 91 __entry->sector = blk_rq_trace_sector(rq); 101 92 __entry->nr_sector = blk_rq_trace_nr_sectors(rq); 93 + __entry->ioprio = rq->ioprio; 102 94 103 95 blk_fill_rwbs(__entry->rwbs, rq->cmd_flags); 104 96 __get_str(cmd)[0] = '\0'; 105 97 ), 106 98 107 - TP_printk("%d,%d %s (%s) %llu + %u [%d]", 99 + TP_printk("%d,%d %s (%s) %llu + %u %s,%u,%u [%d]", 108 100 MAJOR(__entry->dev), MINOR(__entry->dev), 109 101 __entry->rwbs, __get_str(cmd), 110 - (unsigned long long)__entry->sector, 111 - __entry->nr_sector, 0) 102 + (unsigned long long)__entry->sector, __entry->nr_sector, 103 + __print_symbolic(IOPRIO_PRIO_CLASS(__entry->ioprio), 104 + IOPRIO_CLASS_STRINGS), 105 + IOPRIO_PRIO_HINT(__entry->ioprio), 106 + IOPRIO_PRIO_LEVEL(__entry->ioprio), 0) 112 107 ); 113 108 114 109 DECLARE_EVENT_CLASS(block_rq_completion, ··· 126 113 __field( sector_t, sector ) 127 114 __field( unsigned int, nr_sector ) 128 115 __field( int , error ) 116 + __field( unsigned short, ioprio ) 129 117 __array( char, rwbs, RWBS_LEN ) 130 118 __dynamic_array( char, cmd, 1 ) 131 119 ), ··· 136 122 __entry->sector = blk_rq_pos(rq); 137 123 __entry->nr_sector = nr_bytes >> 9; 138 124 __entry->error = blk_status_to_errno(error); 125 + __entry->ioprio = rq->ioprio; 139 126 140 127 blk_fill_rwbs(__entry->rwbs, rq->cmd_flags); 141 128 __get_str(cmd)[0] = '\0'; 142 129 ), 143 130 144 - TP_printk("%d,%d %s (%s) %llu + %u [%d]", 131 + TP_printk("%d,%d %s (%s) %llu + %u %s,%u,%u [%d]", 145 132 MAJOR(__entry->dev), MINOR(__entry->dev), 146 133 __entry->rwbs, __get_str(cmd), 147 - (unsigned long long)__entry->sector, 148 - __entry->nr_sector, __entry->error) 134 + (unsigned long long)__entry->sector, __entry->nr_sector, 135 + __print_symbolic(IOPRIO_PRIO_CLASS(__entry->ioprio), 136 + IOPRIO_CLASS_STRINGS), 137 + IOPRIO_PRIO_HINT(__entry->ioprio), 138 + IOPRIO_PRIO_LEVEL(__entry->ioprio), __entry->error) 149 139 ); 150 140 151 141 /** ··· 198 180 __field( sector_t, sector ) 199 181 __field( unsigned int, nr_sector ) 200 182 __field( unsigned int, bytes ) 183 + __field( unsigned short, ioprio ) 201 184 __array( char, rwbs, RWBS_LEN ) 202 185 __array( char, comm, TASK_COMM_LEN ) 203 186 __dynamic_array( char, cmd, 1 ) ··· 209 190 __entry->sector = blk_rq_trace_sector(rq); 210 191 __entry->nr_sector = blk_rq_trace_nr_sectors(rq); 211 192 __entry->bytes = blk_rq_bytes(rq); 193 + __entry->ioprio = rq->ioprio; 212 194 213 195 blk_fill_rwbs(__entry->rwbs, rq->cmd_flags); 214 196 __get_str(cmd)[0] = '\0'; 215 197 memcpy(__entry->comm, current->comm, TASK_COMM_LEN); 216 198 ), 217 199 218 - TP_printk("%d,%d %s %u (%s) %llu + %u [%s]", 200 + TP_printk("%d,%d %s %u (%s) %llu + %u %s,%u,%u [%s]", 219 201 MAJOR(__entry->dev), MINOR(__entry->dev), 220 202 __entry->rwbs, __entry->bytes, __get_str(cmd), 221 - (unsigned long long)__entry->sector, 222 - __entry->nr_sector, __entry->comm) 203 + (unsigned long long)__entry->sector, __entry->nr_sector, 204 + __print_symbolic(IOPRIO_PRIO_CLASS(__entry->ioprio), 205 + IOPRIO_CLASS_STRINGS), 206 + IOPRIO_PRIO_HINT(__entry->ioprio), 207 + IOPRIO_PRIO_LEVEL(__entry->ioprio), __entry->comm) 223 208 ); 224 209 225 210 /**

+1

include/trace/events/scsi.h

··· 102 102 scsi_opcode_name(WRITE_32), \ 103 103 scsi_opcode_name(WRITE_SAME_32), \ 104 104 scsi_opcode_name(ATA_16), \ 105 + scsi_opcode_name(WRITE_ATOMIC_16), \ 105 106 scsi_opcode_name(ATA_12)) 106 107 107 108 #define scsi_hostbyte_name(result) { result, #result }

+4 -1

include/uapi/linux/fs.h

··· 329 329 /* per-IO negation of O_APPEND */ 330 330 #define RWF_NOAPPEND ((__force __kernel_rwf_t)0x00000020) 331 331 332 + /* Atomic Write */ 333 + #define RWF_ATOMIC ((__force __kernel_rwf_t)0x00000040) 334 + 332 335 /* mask of flags supported by the kernel */ 333 336 #define RWF_SUPPORTED (RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\ 334 - RWF_APPEND | RWF_NOAPPEND) 337 + RWF_APPEND | RWF_NOAPPEND | RWF_ATOMIC) 335 338 336 339 /* Pagemap ioctl */ 337 340 #define PAGEMAP_SCAN _IOWR('f', 16, struct pm_scan_arg)

+9 -1

include/uapi/linux/stat.h

··· 128 128 __u32 stx_dio_offset_align; /* File offset alignment for direct I/O */ 129 129 /* 0xa0 */ 130 130 __u64 stx_subvol; /* Subvolume identifier */ 131 - __u64 __spare3[11]; /* Spare space for future expansion */ 131 + __u32 stx_atomic_write_unit_min; /* Min atomic write unit in bytes */ 132 + __u32 stx_atomic_write_unit_max; /* Max atomic write unit in bytes */ 133 + /* 0xb0 */ 134 + __u32 stx_atomic_write_segments_max; /* Max atomic write segment count */ 135 + __u32 __spare1[1]; 136 + /* 0xb8 */ 137 + __u64 __spare3[9]; /* Spare space for future expansion */ 132 138 /* 0x100 */ 133 139 }; 134 140 ··· 163 157 #define STATX_DIOALIGN 0x00002000U /* Want/got direct I/O alignment info */ 164 158 #define STATX_MNT_ID_UNIQUE 0x00004000U /* Want/got extended stx_mount_id */ 165 159 #define STATX_SUBVOL 0x00008000U /* Want/got stx_subvol */ 160 + #define STATX_WRITE_ATOMIC 0x00010000U /* Want/got atomic_write_* fields */ 166 161 167 162 #define STATX__RESERVED 0x80000000U /* Reserved for future struct statx expansion */ 168 163 ··· 199 192 #define STATX_ATTR_MOUNT_ROOT 0x00002000 /* Root of a mount */ 200 193 #define STATX_ATTR_VERITY 0x00100000 /* [I] Verity protected file */ 201 194 #define STATX_ATTR_DAX 0x00200000 /* File is currently in DAX state */ 195 + #define STATX_ATTR_WRITE_ATOMIC 0x00400000 /* File supports atomic write operations */ 202 196 203 197 204 198 #endif /* _UAPI_LINUX_STAT_H */

+4 -5

io_uring/rw.c

··· 772 772 S_ISBLK(file_inode(req->file)->i_mode); 773 773 } 774 774 775 - static int io_rw_init_file(struct io_kiocb *req, fmode_t mode) 775 + static int io_rw_init_file(struct io_kiocb *req, fmode_t mode, int rw_type) 776 776 { 777 777 struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw); 778 778 struct kiocb *kiocb = &rw->kiocb; ··· 787 787 req->flags |= io_file_get_flags(file); 788 788 789 789 kiocb->ki_flags = file->f_iocb_flags; 790 - ret = kiocb_set_rw_flags(kiocb, rw->flags); 790 + ret = kiocb_set_rw_flags(kiocb, rw->flags, rw_type); 791 791 if (unlikely(ret)) 792 792 return ret; 793 793 kiocb->ki_flags |= IOCB_ALLOC_CACHE; ··· 832 832 if (unlikely(ret < 0)) 833 833 return ret; 834 834 } 835 - 836 - ret = io_rw_init_file(req, FMODE_READ); 835 + ret = io_rw_init_file(req, FMODE_READ, READ); 837 836 if (unlikely(ret)) 838 837 return ret; 839 838 req->cqe.res = iov_iter_count(&io->iter); ··· 1012 1013 ssize_t ret, ret2; 1013 1014 loff_t *ppos; 1014 1015 1015 - ret = io_rw_init_file(req, FMODE_WRITE); 1016 + ret = io_rw_init_file(req, FMODE_WRITE, WRITE); 1016 1017 if (unlikely(ret)) 1017 1018 return ret; 1018 1019 req->cqe.res = iov_iter_count(&io->iter);

+5

rust/bindings/bindings_helper.h

··· 7 7 */ 8 8 9 9 #include <kunit/test.h> 10 + #include <linux/blk_types.h> 11 + #include <linux/blk-mq.h> 12 + #include <linux/blkdev.h> 10 13 #include <linux/errname.h> 11 14 #include <linux/ethtool.h> 12 15 #include <linux/jiffies.h> ··· 23 20 24 21 /* `bindgen` gets confused at certain things. */ 25 22 const size_t RUST_CONST_HELPER_ARCH_SLAB_MINALIGN = ARCH_SLAB_MINALIGN; 23 + const size_t RUST_CONST_HELPER_PAGE_SIZE = PAGE_SIZE; 26 24 const gfp_t RUST_CONST_HELPER_GFP_ATOMIC = GFP_ATOMIC; 27 25 const gfp_t RUST_CONST_HELPER_GFP_KERNEL = GFP_KERNEL; 28 26 const gfp_t RUST_CONST_HELPER_GFP_KERNEL_ACCOUNT = GFP_KERNEL_ACCOUNT; 29 27 const gfp_t RUST_CONST_HELPER_GFP_NOWAIT = GFP_NOWAIT; 30 28 const gfp_t RUST_CONST_HELPER___GFP_ZERO = __GFP_ZERO; 29 + const blk_features_t RUST_CONST_HELPER_BLK_FEAT_ROTATIONAL = BLK_FEAT_ROTATIONAL;

+16

rust/helpers.c

··· 186 186 __alignof__(size_t) == __alignof__(uintptr_t), 187 187 "Rust code expects C `size_t` to match Rust `usize`" 188 188 ); 189 + 190 + // This will soon be moved to a separate file, so no need to merge with above. 191 + #include <linux/blk-mq.h> 192 + #include <linux/blkdev.h> 193 + 194 + void *rust_helper_blk_mq_rq_to_pdu(struct request *rq) 195 + { 196 + return blk_mq_rq_to_pdu(rq); 197 + } 198 + EXPORT_SYMBOL_GPL(rust_helper_blk_mq_rq_to_pdu); 199 + 200 + struct request *rust_helper_blk_mq_rq_from_pdu(void *pdu) 201 + { 202 + return blk_mq_rq_from_pdu(pdu); 203 + } 204 + EXPORT_SYMBOL_GPL(rust_helper_blk_mq_rq_from_pdu);

+5

rust/kernel/block.rs

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + 3 + //! Types for working with the block layer. 4 + 5 + pub mod mq;

+98

rust/kernel/block/mq.rs

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + 3 + //! This module provides types for implementing block drivers that interface the 4 + //! blk-mq subsystem. 5 + //! 6 + //! To implement a block device driver, a Rust module must do the following: 7 + //! 8 + //! - Implement [`Operations`] for a type `T`. 9 + //! - Create a [`TagSet<T>`]. 10 + //! - Create a [`GenDisk<T>`], via the [`GenDiskBuilder`]. 11 + //! - Add the disk to the system by calling [`GenDiskBuilder::build`] passing in 12 + //! the `TagSet` reference. 13 + //! 14 + //! The types available in this module that have direct C counterparts are: 15 + //! 16 + //! - The [`TagSet`] type that abstracts the C type `struct tag_set`. 17 + //! - The [`GenDisk`] type that abstracts the C type `struct gendisk`. 18 + //! - The [`Request`] type that abstracts the C type `struct request`. 19 + //! 20 + //! The kernel will interface with the block device driver by calling the method 21 + //! implementations of the `Operations` trait. 22 + //! 23 + //! IO requests are passed to the driver as [`kernel::types::ARef<Request>`] 24 + //! instances. The `Request` type is a wrapper around the C `struct request`. 25 + //! The driver must mark end of processing by calling one of the 26 + //! `Request::end`, methods. Failure to do so can lead to deadlock or timeout 27 + //! errors. Please note that the C function `blk_mq_start_request` is implicitly 28 + //! called when the request is queued with the driver. 29 + //! 30 + //! The `TagSet` is responsible for creating and maintaining a mapping between 31 + //! `Request`s and integer ids as well as carrying a pointer to the vtable 32 + //! generated by `Operations`. This mapping is useful for associating 33 + //! completions from hardware with the correct `Request` instance. The `TagSet` 34 + //! determines the maximum queue depth by setting the number of `Request` 35 + //! instances available to the driver, and it determines the number of queues to 36 + //! instantiate for the driver. If possible, a driver should allocate one queue 37 + //! per core, to keep queue data local to a core. 38 + //! 39 + //! One `TagSet` instance can be shared between multiple `GenDisk` instances. 40 + //! This can be useful when implementing drivers where one piece of hardware 41 + //! with one set of IO resources are represented to the user as multiple disks. 42 + //! 43 + //! One significant difference between block device drivers implemented with 44 + //! these Rust abstractions and drivers implemented in C, is that the Rust 45 + //! drivers have to own a reference count on the `Request` type when the IO is 46 + //! in flight. This is to ensure that the C `struct request` instances backing 47 + //! the Rust `Request` instances are live while the Rust driver holds a 48 + //! reference to the `Request`. In addition, the conversion of an integer tag to 49 + //! a `Request` via the `TagSet` would not be sound without this bookkeeping. 50 + //! 51 + //! [`GenDisk`]: gen_disk::GenDisk 52 + //! [`GenDisk<T>`]: gen_disk::GenDisk 53 + //! [`GenDiskBuilder`]: gen_disk::GenDiskBuilder 54 + //! [`GenDiskBuilder::build`]: gen_disk::GenDiskBuilder::build 55 + //! 56 + //! # Example 57 + //! 58 + //! ```rust 59 + //! use kernel::{ 60 + //! alloc::flags, 61 + //! block::mq::*, 62 + //! new_mutex, 63 + //! prelude::*, 64 + //! sync::{Arc, Mutex}, 65 + //! types::{ARef, ForeignOwnable}, 66 + //! }; 67 + //! 68 + //! struct MyBlkDevice; 69 + //! 70 + //! #[vtable] 71 + //! impl Operations for MyBlkDevice { 72 + //! 73 + //! fn queue_rq(rq: ARef<Request<Self>>, _is_last: bool) -> Result { 74 + //! Request::end_ok(rq); 75 + //! Ok(()) 76 + //! } 77 + //! 78 + //! fn commit_rqs() {} 79 + //! } 80 + //! 81 + //! let tagset: Arc<TagSet<MyBlkDevice>> = 82 + //! Arc::pin_init(TagSet::new(1, 256, 1), flags::GFP_KERNEL)?; 83 + //! let mut disk = gen_disk::GenDiskBuilder::new() 84 + //! .capacity_sectors(4096) 85 + //! .build(format_args!("myblk"), tagset)?; 86 + //! 87 + //! # Ok::<(), kernel::error::Error>(()) 88 + //! ``` 89 + 90 + pub mod gen_disk; 91 + mod operations; 92 + mod raw_writer; 93 + mod request; 94 + mod tag_set; 95 + 96 + pub use operations::Operations; 97 + pub use request::Request; 98 + pub use tag_set::TagSet;

+198

rust/kernel/block/mq/gen_disk.rs

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + 3 + //! Generic disk abstraction. 4 + //! 5 + //! C header: [`include/linux/blkdev.h`](srctree/include/linux/blkdev.h) 6 + //! C header: [`include/linux/blk_mq.h`](srctree/include/linux/blk_mq.h) 7 + 8 + use crate::block::mq::{raw_writer::RawWriter, Operations, TagSet}; 9 + use crate::error; 10 + use crate::{bindings, error::from_err_ptr, error::Result, sync::Arc}; 11 + use core::fmt::{self, Write}; 12 + 13 + /// A builder for [`GenDisk`]. 14 + /// 15 + /// Use this struct to configure and add new [`GenDisk`] to the VFS. 16 + pub struct GenDiskBuilder { 17 + rotational: bool, 18 + logical_block_size: u32, 19 + physical_block_size: u32, 20 + capacity_sectors: u64, 21 + } 22 + 23 + impl Default for GenDiskBuilder { 24 + fn default() -> Self { 25 + Self { 26 + rotational: false, 27 + logical_block_size: bindings::PAGE_SIZE as u32, 28 + physical_block_size: bindings::PAGE_SIZE as u32, 29 + capacity_sectors: 0, 30 + } 31 + } 32 + } 33 + 34 + impl GenDiskBuilder { 35 + /// Create a new instance. 36 + pub fn new() -> Self { 37 + Self::default() 38 + } 39 + 40 + /// Set the rotational media attribute for the device to be built. 41 + pub fn rotational(mut self, rotational: bool) -> Self { 42 + self.rotational = rotational; 43 + self 44 + } 45 + 46 + /// Validate block size by verifying that it is between 512 and `PAGE_SIZE`, 47 + /// and that it is a power of two. 48 + fn validate_block_size(size: u32) -> Result<()> { 49 + if !(512..=bindings::PAGE_SIZE as u32).contains(&size) || !size.is_power_of_two() { 50 + Err(error::code::EINVAL) 51 + } else { 52 + Ok(()) 53 + } 54 + } 55 + 56 + /// Set the logical block size of the device to be built. 57 + /// 58 + /// This method will check that block size is a power of two and between 512 59 + /// and 4096. If not, an error is returned and the block size is not set. 60 + /// 61 + /// This is the smallest unit the storage device can address. It is 62 + /// typically 4096 bytes. 63 + pub fn logical_block_size(mut self, block_size: u32) -> Result<Self> { 64 + Self::validate_block_size(block_size)?; 65 + self.logical_block_size = block_size; 66 + Ok(self) 67 + } 68 + 69 + /// Set the physical block size of the device to be built. 70 + /// 71 + /// This method will check that block size is a power of two and between 512 72 + /// and 4096. If not, an error is returned and the block size is not set. 73 + /// 74 + /// This is the smallest unit a physical storage device can write 75 + /// atomically. It is usually the same as the logical block size but may be 76 + /// bigger. One example is SATA drives with 4096 byte physical block size 77 + /// that expose a 512 byte logical block size to the operating system. 78 + pub fn physical_block_size(mut self, block_size: u32) -> Result<Self> { 79 + Self::validate_block_size(block_size)?; 80 + self.physical_block_size = block_size; 81 + Ok(self) 82 + } 83 + 84 + /// Set the capacity of the device to be built, in sectors (512 bytes). 85 + pub fn capacity_sectors(mut self, capacity: u64) -> Self { 86 + self.capacity_sectors = capacity; 87 + self 88 + } 89 + 90 + /// Build a new `GenDisk` and add it to the VFS. 91 + pub fn build<T: Operations>( 92 + self, 93 + name: fmt::Arguments<'_>, 94 + tagset: Arc<TagSet<T>>, 95 + ) -> Result<GenDisk<T>> { 96 + let lock_class_key = crate::sync::LockClassKey::new(); 97 + 98 + // SAFETY: `bindings::queue_limits` contain only fields that are valid when zeroed. 99 + let mut lim: bindings::queue_limits = unsafe { core::mem::zeroed() }; 100 + 101 + lim.logical_block_size = self.logical_block_size; 102 + lim.physical_block_size = self.physical_block_size; 103 + if self.rotational { 104 + lim.features = bindings::BLK_FEAT_ROTATIONAL; 105 + } 106 + 107 + // SAFETY: `tagset.raw_tag_set()` points to a valid and initialized tag set 108 + let gendisk = from_err_ptr(unsafe { 109 + bindings::__blk_mq_alloc_disk( 110 + tagset.raw_tag_set(), 111 + &mut lim, 112 + core::ptr::null_mut(), 113 + lock_class_key.as_ptr(), 114 + ) 115 + })?; 116 + 117 + const TABLE: bindings::block_device_operations = bindings::block_device_operations { 118 + submit_bio: None, 119 + open: None, 120 + release: None, 121 + ioctl: None, 122 + compat_ioctl: None, 123 + check_events: None, 124 + unlock_native_capacity: None, 125 + getgeo: None, 126 + set_read_only: None, 127 + swap_slot_free_notify: None, 128 + report_zones: None, 129 + devnode: None, 130 + alternative_gpt_sector: None, 131 + get_unique_id: None, 132 + // TODO: Set to THIS_MODULE. Waiting for const_refs_to_static feature to 133 + // be merged (unstable in rustc 1.78 which is staged for linux 6.10) 134 + // https://github.com/rust-lang/rust/issues/119618 135 + owner: core::ptr::null_mut(), 136 + pr_ops: core::ptr::null_mut(), 137 + free_disk: None, 138 + poll_bio: None, 139 + }; 140 + 141 + // SAFETY: `gendisk` is a valid pointer as we initialized it above 142 + unsafe { (*gendisk).fops = &TABLE }; 143 + 144 + let mut raw_writer = RawWriter::from_array( 145 + // SAFETY: `gendisk` points to a valid and initialized instance. We 146 + // have exclusive access, since the disk is not added to the VFS 147 + // yet. 148 + unsafe { &mut (*gendisk).disk_name }, 149 + )?; 150 + raw_writer.write_fmt(name)?; 151 + raw_writer.write_char('\0')?; 152 + 153 + // SAFETY: `gendisk` points to a valid and initialized instance of 154 + // `struct gendisk`. `set_capacity` takes a lock to synchronize this 155 + // operation, so we will not race. 156 + unsafe { bindings::set_capacity(gendisk, self.capacity_sectors) }; 157 + 158 + crate::error::to_result( 159 + // SAFETY: `gendisk` points to a valid and initialized instance of 160 + // `struct gendisk`. 161 + unsafe { 162 + bindings::device_add_disk(core::ptr::null_mut(), gendisk, core::ptr::null_mut()) 163 + }, 164 + )?; 165 + 166 + // INVARIANT: `gendisk` was initialized above. 167 + // INVARIANT: `gendisk` was added to the VFS via `device_add_disk` above. 168 + Ok(GenDisk { 169 + _tagset: tagset, 170 + gendisk, 171 + }) 172 + } 173 + } 174 + 175 + /// A generic block device. 176 + /// 177 + /// # Invariants 178 + /// 179 + /// - `gendisk` must always point to an initialized and valid `struct gendisk`. 180 + /// - `gendisk` was added to the VFS through a call to 181 + /// `bindings::device_add_disk`. 182 + pub struct GenDisk<T: Operations> { 183 + _tagset: Arc<TagSet<T>>, 184 + gendisk: *mut bindings::gendisk, 185 + } 186 + 187 + // SAFETY: `GenDisk` is an owned pointer to a `struct gendisk` and an `Arc` to a 188 + // `TagSet` It is safe to send this to other threads as long as T is Send. 189 + unsafe impl<T: Operations + Send> Send for GenDisk<T> {} 190 + 191 + impl<T: Operations> Drop for GenDisk<T> { 192 + fn drop(&mut self) { 193 + // SAFETY: By type invariant, `self.gendisk` points to a valid and 194 + // initialized instance of `struct gendisk`, and it was previously added 195 + // to the VFS. 196 + unsafe { bindings::del_gendisk(self.gendisk) }; 197 + } 198 + }

+245

rust/kernel/block/mq/operations.rs

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + 3 + //! This module provides an interface for blk-mq drivers to implement. 4 + //! 5 + //! C header: [`include/linux/blk-mq.h`](srctree/include/linux/blk-mq.h) 6 + 7 + use crate::{ 8 + bindings, 9 + block::mq::request::RequestDataWrapper, 10 + block::mq::Request, 11 + error::{from_result, Result}, 12 + types::ARef, 13 + }; 14 + use core::{marker::PhantomData, sync::atomic::AtomicU64, sync::atomic::Ordering}; 15 + 16 + /// Implement this trait to interface blk-mq as block devices. 17 + /// 18 + /// To implement a block device driver, implement this trait as described in the 19 + /// [module level documentation]. The kernel will use the implementation of the 20 + /// functions defined in this trait to interface a block device driver. Note: 21 + /// There is no need for an exit_request() implementation, because the `drop` 22 + /// implementation of the [`Request`] type will be invoked by automatically by 23 + /// the C/Rust glue logic. 24 + /// 25 + /// [module level documentation]: kernel::block::mq 26 + #[macros::vtable] 27 + pub trait Operations: Sized { 28 + /// Called by the kernel to queue a request with the driver. If `is_last` is 29 + /// `false`, the driver is allowed to defer committing the request. 30 + fn queue_rq(rq: ARef<Request<Self>>, is_last: bool) -> Result; 31 + 32 + /// Called by the kernel to indicate that queued requests should be submitted. 33 + fn commit_rqs(); 34 + 35 + /// Called by the kernel to poll the device for completed requests. Only 36 + /// used for poll queues. 37 + fn poll() -> bool { 38 + crate::build_error(crate::error::VTABLE_DEFAULT_ERROR) 39 + } 40 + } 41 + 42 + /// A vtable for blk-mq to interact with a block device driver. 43 + /// 44 + /// A `bindings::blk_mq_ops` vtable is constructed from pointers to the `extern 45 + /// "C"` functions of this struct, exposed through the `OperationsVTable::VTABLE`. 46 + /// 47 + /// For general documentation of these methods, see the kernel source 48 + /// documentation related to `struct blk_mq_operations` in 49 + /// [`include/linux/blk-mq.h`]. 50 + /// 51 + /// [`include/linux/blk-mq.h`]: srctree/include/linux/blk-mq.h 52 + pub(crate) struct OperationsVTable<T: Operations>(PhantomData<T>); 53 + 54 + impl<T: Operations> OperationsVTable<T> { 55 + /// This function is called by the C kernel. A pointer to this function is 56 + /// installed in the `blk_mq_ops` vtable for the driver. 57 + /// 58 + /// # Safety 59 + /// 60 + /// - The caller of this function must ensure that the pointee of `bd` is 61 + /// valid for reads for the duration of this function. 62 + /// - This function must be called for an initialized and live `hctx`. That 63 + /// is, `Self::init_hctx_callback` was called and 64 + /// `Self::exit_hctx_callback()` was not yet called. 65 + /// - `(*bd).rq` must point to an initialized and live `bindings:request`. 66 + /// That is, `Self::init_request_callback` was called but 67 + /// `Self::exit_request_callback` was not yet called for the request. 68 + /// - `(*bd).rq` must be owned by the driver. That is, the block layer must 69 + /// promise to not access the request until the driver calls 70 + /// `bindings::blk_mq_end_request` for the request. 71 + unsafe extern "C" fn queue_rq_callback( 72 + _hctx: *mut bindings::blk_mq_hw_ctx, 73 + bd: *const bindings::blk_mq_queue_data, 74 + ) -> bindings::blk_status_t { 75 + // SAFETY: `bd.rq` is valid as required by the safety requirement for 76 + // this function. 77 + let request = unsafe { &*(*bd).rq.cast::<Request<T>>() }; 78 + 79 + // One refcount for the ARef, one for being in flight 80 + request.wrapper_ref().refcount().store(2, Ordering::Relaxed); 81 + 82 + // SAFETY: 83 + // - We own a refcount that we took above. We pass that to `ARef`. 84 + // - By the safety requirements of this function, `request` is a valid 85 + // `struct request` and the private data is properly initialized. 86 + // - `rq` will be alive until `blk_mq_end_request` is called and is 87 + // reference counted by `ARef` until then. 88 + let rq = unsafe { Request::aref_from_raw((*bd).rq) }; 89 + 90 + // SAFETY: We have exclusive access and we just set the refcount above. 91 + unsafe { Request::start_unchecked(&rq) }; 92 + 93 + let ret = T::queue_rq( 94 + rq, 95 + // SAFETY: `bd` is valid as required by the safety requirement for 96 + // this function. 97 + unsafe { (*bd).last }, 98 + ); 99 + 100 + if let Err(e) = ret { 101 + e.to_blk_status() 102 + } else { 103 + bindings::BLK_STS_OK as _ 104 + } 105 + } 106 + 107 + /// This function is called by the C kernel. A pointer to this function is 108 + /// installed in the `blk_mq_ops` vtable for the driver. 109 + /// 110 + /// # Safety 111 + /// 112 + /// This function may only be called by blk-mq C infrastructure. 113 + unsafe extern "C" fn commit_rqs_callback(_hctx: *mut bindings::blk_mq_hw_ctx) { 114 + T::commit_rqs() 115 + } 116 + 117 + /// This function is called by the C kernel. It is not currently 118 + /// implemented, and there is no way to exercise this code path. 119 + /// 120 + /// # Safety 121 + /// 122 + /// This function may only be called by blk-mq C infrastructure. 123 + unsafe extern "C" fn complete_callback(_rq: *mut bindings::request) {} 124 + 125 + /// This function is called by the C kernel. A pointer to this function is 126 + /// installed in the `blk_mq_ops` vtable for the driver. 127 + /// 128 + /// # Safety 129 + /// 130 + /// This function may only be called by blk-mq C infrastructure. 131 + unsafe extern "C" fn poll_callback( 132 + _hctx: *mut bindings::blk_mq_hw_ctx, 133 + _iob: *mut bindings::io_comp_batch, 134 + ) -> core::ffi::c_int { 135 + T::poll().into() 136 + } 137 + 138 + /// This function is called by the C kernel. A pointer to this function is 139 + /// installed in the `blk_mq_ops` vtable for the driver. 140 + /// 141 + /// # Safety 142 + /// 143 + /// This function may only be called by blk-mq C infrastructure. This 144 + /// function may only be called once before `exit_hctx_callback` is called 145 + /// for the same context. 146 + unsafe extern "C" fn init_hctx_callback( 147 + _hctx: *mut bindings::blk_mq_hw_ctx, 148 + _tagset_data: *mut core::ffi::c_void, 149 + _hctx_idx: core::ffi::c_uint, 150 + ) -> core::ffi::c_int { 151 + from_result(|| Ok(0)) 152 + } 153 + 154 + /// This function is called by the C kernel. A pointer to this function is 155 + /// installed in the `blk_mq_ops` vtable for the driver. 156 + /// 157 + /// # Safety 158 + /// 159 + /// This function may only be called by blk-mq C infrastructure. 160 + unsafe extern "C" fn exit_hctx_callback( 161 + _hctx: *mut bindings::blk_mq_hw_ctx, 162 + _hctx_idx: core::ffi::c_uint, 163 + ) { 164 + } 165 + 166 + /// This function is called by the C kernel. A pointer to this function is 167 + /// installed in the `blk_mq_ops` vtable for the driver. 168 + /// 169 + /// # Safety 170 + /// 171 + /// - This function may only be called by blk-mq C infrastructure. 172 + /// - `_set` must point to an initialized `TagSet<T>`. 173 + /// - `rq` must point to an initialized `bindings::request`. 174 + /// - The allocation pointed to by `rq` must be at the size of `Request` 175 + /// plus the size of `RequestDataWrapper`. 176 + unsafe extern "C" fn init_request_callback( 177 + _set: *mut bindings::blk_mq_tag_set, 178 + rq: *mut bindings::request, 179 + _hctx_idx: core::ffi::c_uint, 180 + _numa_node: core::ffi::c_uint, 181 + ) -> core::ffi::c_int { 182 + from_result(|| { 183 + // SAFETY: By the safety requirements of this function, `rq` points 184 + // to a valid allocation. 185 + let pdu = unsafe { Request::wrapper_ptr(rq.cast::<Request<T>>()) }; 186 + 187 + // SAFETY: The refcount field is allocated but not initialized, so 188 + // it is valid for writes. 189 + unsafe { RequestDataWrapper::refcount_ptr(pdu.as_ptr()).write(AtomicU64::new(0)) }; 190 + 191 + Ok(0) 192 + }) 193 + } 194 + 195 + /// This function is called by the C kernel. A pointer to this function is 196 + /// installed in the `blk_mq_ops` vtable for the driver. 197 + /// 198 + /// # Safety 199 + /// 200 + /// - This function may only be called by blk-mq C infrastructure. 201 + /// - `_set` must point to an initialized `TagSet<T>`. 202 + /// - `rq` must point to an initialized and valid `Request`. 203 + unsafe extern "C" fn exit_request_callback( 204 + _set: *mut bindings::blk_mq_tag_set, 205 + rq: *mut bindings::request, 206 + _hctx_idx: core::ffi::c_uint, 207 + ) { 208 + // SAFETY: The tagset invariants guarantee that all requests are allocated with extra memory 209 + // for the request data. 210 + let pdu = unsafe { bindings::blk_mq_rq_to_pdu(rq) }.cast::<RequestDataWrapper>(); 211 + 212 + // SAFETY: `pdu` is valid for read and write and is properly initialised. 213 + unsafe { core::ptr::drop_in_place(pdu) }; 214 + } 215 + 216 + const VTABLE: bindings::blk_mq_ops = bindings::blk_mq_ops { 217 + queue_rq: Some(Self::queue_rq_callback), 218 + queue_rqs: None, 219 + commit_rqs: Some(Self::commit_rqs_callback), 220 + get_budget: None, 221 + put_budget: None, 222 + set_rq_budget_token: None, 223 + get_rq_budget_token: None, 224 + timeout: None, 225 + poll: if T::HAS_POLL { 226 + Some(Self::poll_callback) 227 + } else { 228 + None 229 + }, 230 + complete: Some(Self::complete_callback), 231 + init_hctx: Some(Self::init_hctx_callback), 232 + exit_hctx: Some(Self::exit_hctx_callback), 233 + init_request: Some(Self::init_request_callback), 234 + exit_request: Some(Self::exit_request_callback), 235 + cleanup_rq: None, 236 + busy: None, 237 + map_queues: None, 238 + #[cfg(CONFIG_BLK_DEBUG_FS)] 239 + show_rq: None, 240 + }; 241 + 242 + pub(crate) const fn build() -> &'static bindings::blk_mq_ops { 243 + &Self::VTABLE 244 + } 245 + }

+55

rust/kernel/block/mq/raw_writer.rs

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + 3 + use core::fmt::{self, Write}; 4 + 5 + use crate::error::Result; 6 + use crate::prelude::EINVAL; 7 + 8 + /// A mutable reference to a byte buffer where a string can be written into. 9 + /// 10 + /// # Invariants 11 + /// 12 + /// `buffer` is always null terminated. 13 + pub(crate) struct RawWriter<'a> { 14 + buffer: &'a mut [u8], 15 + pos: usize, 16 + } 17 + 18 + impl<'a> RawWriter<'a> { 19 + /// Create a new `RawWriter` instance. 20 + fn new(buffer: &'a mut [u8]) -> Result<RawWriter<'a>> { 21 + *(buffer.last_mut().ok_or(EINVAL)?) = 0; 22 + 23 + // INVARIANT: We null terminated the buffer above. 24 + Ok(Self { buffer, pos: 0 }) 25 + } 26 + 27 + pub(crate) fn from_array<const N: usize>( 28 + a: &'a mut [core::ffi::c_char; N], 29 + ) -> Result<RawWriter<'a>> { 30 + Self::new( 31 + // SAFETY: the buffer of `a` is valid for read and write as `u8` for 32 + // at least `N` bytes. 33 + unsafe { core::slice::from_raw_parts_mut(a.as_mut_ptr().cast::<u8>(), N) }, 34 + ) 35 + } 36 + } 37 + 38 + impl Write for RawWriter<'_> { 39 + fn write_str(&mut self, s: &str) -> fmt::Result { 40 + let bytes = s.as_bytes(); 41 + let len = bytes.len(); 42 + 43 + // We do not want to overwrite our null terminator 44 + if self.pos + len > self.buffer.len() - 1 { 45 + return Err(fmt::Error); 46 + } 47 + 48 + // INVARIANT: We are not overwriting the last byte 49 + self.buffer[self.pos..self.pos + len].copy_from_slice(bytes); 50 + 51 + self.pos += len; 52 + 53 + Ok(()) 54 + } 55 + }

+253

rust/kernel/block/mq/request.rs

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + 3 + //! This module provides a wrapper for the C `struct request` type. 4 + //! 5 + //! C header: [`include/linux/blk-mq.h`](srctree/include/linux/blk-mq.h) 6 + 7 + use crate::{ 8 + bindings, 9 + block::mq::Operations, 10 + error::Result, 11 + types::{ARef, AlwaysRefCounted, Opaque}, 12 + }; 13 + use core::{ 14 + marker::PhantomData, 15 + ptr::{addr_of_mut, NonNull}, 16 + sync::atomic::{AtomicU64, Ordering}, 17 + }; 18 + 19 + /// A wrapper around a blk-mq `struct request`. This represents an IO request. 20 + /// 21 + /// # Implementation details 22 + /// 23 + /// There are four states for a request that the Rust bindings care about: 24 + /// 25 + /// A) Request is owned by block layer (refcount 0) 26 + /// B) Request is owned by driver but with zero `ARef`s in existence 27 + /// (refcount 1) 28 + /// C) Request is owned by driver with exactly one `ARef` in existence 29 + /// (refcount 2) 30 + /// D) Request is owned by driver with more than one `ARef` in existence 31 + /// (refcount > 2) 32 + /// 33 + /// 34 + /// We need to track A and B to ensure we fail tag to request conversions for 35 + /// requests that are not owned by the driver. 36 + /// 37 + /// We need to track C and D to ensure that it is safe to end the request and hand 38 + /// back ownership to the block layer. 39 + /// 40 + /// The states are tracked through the private `refcount` field of 41 + /// `RequestDataWrapper`. This structure lives in the private data area of the C 42 + /// `struct request`. 43 + /// 44 + /// # Invariants 45 + /// 46 + /// * `self.0` is a valid `struct request` created by the C portion of the kernel. 47 + /// * The private data area associated with this request must be an initialized 48 + /// and valid `RequestDataWrapper<T>`. 49 + /// * `self` is reference counted by atomic modification of 50 + /// self.wrapper_ref().refcount(). 51 + /// 52 + #[repr(transparent)] 53 + pub struct Request<T: Operations>(Opaque<bindings::request>, PhantomData<T>); 54 + 55 + impl<T: Operations> Request<T> { 56 + /// Create an `ARef<Request>` from a `struct request` pointer. 57 + /// 58 + /// # Safety 59 + /// 60 + /// * The caller must own a refcount on `ptr` that is transferred to the 61 + /// returned `ARef`. 62 + /// * The type invariants for `Request` must hold for the pointee of `ptr`. 63 + pub(crate) unsafe fn aref_from_raw(ptr: *mut bindings::request) -> ARef<Self> { 64 + // INVARIANT: By the safety requirements of this function, invariants are upheld. 65 + // SAFETY: By the safety requirement of this function, we own a 66 + // reference count that we can pass to `ARef`. 67 + unsafe { ARef::from_raw(NonNull::new_unchecked(ptr as *const Self as *mut Self)) } 68 + } 69 + 70 + /// Notify the block layer that a request is going to be processed now. 71 + /// 72 + /// The block layer uses this hook to do proper initializations such as 73 + /// starting the timeout timer. It is a requirement that block device 74 + /// drivers call this function when starting to process a request. 75 + /// 76 + /// # Safety 77 + /// 78 + /// The caller must have exclusive ownership of `self`, that is 79 + /// `self.wrapper_ref().refcount() == 2`. 80 + pub(crate) unsafe fn start_unchecked(this: &ARef<Self>) { 81 + // SAFETY: By type invariant, `self.0` is a valid `struct request` and 82 + // we have exclusive access. 83 + unsafe { bindings::blk_mq_start_request(this.0.get()) }; 84 + } 85 + 86 + /// Try to take exclusive ownership of `this` by dropping the refcount to 0. 87 + /// This fails if `this` is not the only `ARef` pointing to the underlying 88 + /// `Request`. 89 + /// 90 + /// If the operation is successful, `Ok` is returned with a pointer to the 91 + /// C `struct request`. If the operation fails, `this` is returned in the 92 + /// `Err` variant. 93 + fn try_set_end(this: ARef<Self>) -> Result<*mut bindings::request, ARef<Self>> { 94 + // We can race with `TagSet::tag_to_rq` 95 + if let Err(_old) = this.wrapper_ref().refcount().compare_exchange( 96 + 2, 97 + 0, 98 + Ordering::Relaxed, 99 + Ordering::Relaxed, 100 + ) { 101 + return Err(this); 102 + } 103 + 104 + let request_ptr = this.0.get(); 105 + core::mem::forget(this); 106 + 107 + Ok(request_ptr) 108 + } 109 + 110 + /// Notify the block layer that the request has been completed without errors. 111 + /// 112 + /// This function will return `Err` if `this` is not the only `ARef` 113 + /// referencing the request. 114 + pub fn end_ok(this: ARef<Self>) -> Result<(), ARef<Self>> { 115 + let request_ptr = Self::try_set_end(this)?; 116 + 117 + // SAFETY: By type invariant, `this.0` was a valid `struct request`. The 118 + // success of the call to `try_set_end` guarantees that there are no 119 + // `ARef`s pointing to this request. Therefore it is safe to hand it 120 + // back to the block layer. 121 + unsafe { bindings::blk_mq_end_request(request_ptr, bindings::BLK_STS_OK as _) }; 122 + 123 + Ok(()) 124 + } 125 + 126 + /// Return a pointer to the `RequestDataWrapper` stored in the private area 127 + /// of the request structure. 128 + /// 129 + /// # Safety 130 + /// 131 + /// - `this` must point to a valid allocation of size at least size of 132 + /// `Self` plus size of `RequestDataWrapper`. 133 + pub(crate) unsafe fn wrapper_ptr(this: *mut Self) -> NonNull<RequestDataWrapper> { 134 + let request_ptr = this.cast::<bindings::request>(); 135 + // SAFETY: By safety requirements for this function, `this` is a 136 + // valid allocation. 137 + let wrapper_ptr = 138 + unsafe { bindings::blk_mq_rq_to_pdu(request_ptr).cast::<RequestDataWrapper>() }; 139 + // SAFETY: By C API contract, wrapper_ptr points to a valid allocation 140 + // and is not null. 141 + unsafe { NonNull::new_unchecked(wrapper_ptr) } 142 + } 143 + 144 + /// Return a reference to the `RequestDataWrapper` stored in the private 145 + /// area of the request structure. 146 + pub(crate) fn wrapper_ref(&self) -> &RequestDataWrapper { 147 + // SAFETY: By type invariant, `self.0` is a valid allocation. Further, 148 + // the private data associated with this request is initialized and 149 + // valid. The existence of `&self` guarantees that the private data is 150 + // valid as a shared reference. 151 + unsafe { Self::wrapper_ptr(self as *const Self as *mut Self).as_ref() } 152 + } 153 + } 154 + 155 + /// A wrapper around data stored in the private area of the C `struct request`. 156 + pub(crate) struct RequestDataWrapper { 157 + /// The Rust request refcount has the following states: 158 + /// 159 + /// - 0: The request is owned by C block layer. 160 + /// - 1: The request is owned by Rust abstractions but there are no ARef references to it. 161 + /// - 2+: There are `ARef` references to the request. 162 + refcount: AtomicU64, 163 + } 164 + 165 + impl RequestDataWrapper { 166 + /// Return a reference to the refcount of the request that is embedding 167 + /// `self`. 168 + pub(crate) fn refcount(&self) -> &AtomicU64 { 169 + &self.refcount 170 + } 171 + 172 + /// Return a pointer to the refcount of the request that is embedding the 173 + /// pointee of `this`. 174 + /// 175 + /// # Safety 176 + /// 177 + /// - `this` must point to a live allocation of at least the size of `Self`. 178 + pub(crate) unsafe fn refcount_ptr(this: *mut Self) -> *mut AtomicU64 { 179 + // SAFETY: Because of the safety requirements of this function, the 180 + // field projection is safe. 181 + unsafe { addr_of_mut!((*this).refcount) } 182 + } 183 + } 184 + 185 + // SAFETY: Exclusive access is thread-safe for `Request`. `Request` has no `&mut 186 + // self` methods and `&self` methods that mutate `self` are internally 187 + // synchronized. 188 + unsafe impl<T: Operations> Send for Request<T> {} 189 + 190 + // SAFETY: Shared access is thread-safe for `Request`. `&self` methods that 191 + // mutate `self` are internally synchronized` 192 + unsafe impl<T: Operations> Sync for Request<T> {} 193 + 194 + /// Store the result of `op(target.load())` in target, returning new value of 195 + /// target. 196 + fn atomic_relaxed_op_return(target: &AtomicU64, op: impl Fn(u64) -> u64) -> u64 { 197 + let old = target.fetch_update(Ordering::Relaxed, Ordering::Relaxed, |x| Some(op(x))); 198 + 199 + // SAFETY: Because the operation passed to `fetch_update` above always 200 + // return `Some`, `old` will always be `Ok`. 201 + let old = unsafe { old.unwrap_unchecked() }; 202 + 203 + op(old) 204 + } 205 + 206 + /// Store the result of `op(target.load)` in `target` if `target.load() != 207 + /// pred`, returning true if the target was updated. 208 + fn atomic_relaxed_op_unless(target: &AtomicU64, op: impl Fn(u64) -> u64, pred: u64) -> bool { 209 + target 210 + .fetch_update(Ordering::Relaxed, Ordering::Relaxed, |x| { 211 + if x == pred { 212 + None 213 + } else { 214 + Some(op(x)) 215 + } 216 + }) 217 + .is_ok() 218 + } 219 + 220 + // SAFETY: All instances of `Request<T>` are reference counted. This 221 + // implementation of `AlwaysRefCounted` ensure that increments to the ref count 222 + // keeps the object alive in memory at least until a matching reference count 223 + // decrement is executed. 224 + unsafe impl<T: Operations> AlwaysRefCounted for Request<T> { 225 + fn inc_ref(&self) { 226 + let refcount = &self.wrapper_ref().refcount(); 227 + 228 + #[cfg_attr(not(CONFIG_DEBUG_MISC), allow(unused_variables))] 229 + let updated = atomic_relaxed_op_unless(refcount, |x| x + 1, 0); 230 + 231 + #[cfg(CONFIG_DEBUG_MISC)] 232 + if !updated { 233 + panic!("Request refcount zero on clone") 234 + } 235 + } 236 + 237 + unsafe fn dec_ref(obj: core::ptr::NonNull<Self>) { 238 + // SAFETY: The type invariants of `ARef` guarantee that `obj` is valid 239 + // for read. 240 + let wrapper_ptr = unsafe { Self::wrapper_ptr(obj.as_ptr()).as_ptr() }; 241 + // SAFETY: The type invariant of `Request` guarantees that the private 242 + // data area is initialized and valid. 243 + let refcount = unsafe { &*RequestDataWrapper::refcount_ptr(wrapper_ptr) }; 244 + 245 + #[cfg_attr(not(CONFIG_DEBUG_MISC), allow(unused_variables))] 246 + let new_refcount = atomic_relaxed_op_return(refcount, |x| x - 1); 247 + 248 + #[cfg(CONFIG_DEBUG_MISC)] 249 + if new_refcount == 0 { 250 + panic!("Request reached refcount zero in Rust abstractions"); 251 + } 252 + } 253 + }

+86

rust/kernel/block/mq/tag_set.rs

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + 3 + //! This module provides the `TagSet` struct to wrap the C `struct blk_mq_tag_set`. 4 + //! 5 + //! C header: [`include/linux/blk-mq.h`](srctree/include/linux/blk-mq.h) 6 + 7 + use core::pin::Pin; 8 + 9 + use crate::{ 10 + bindings, 11 + block::mq::{operations::OperationsVTable, request::RequestDataWrapper, Operations}, 12 + error, 13 + prelude::PinInit, 14 + try_pin_init, 15 + types::Opaque, 16 + }; 17 + use core::{convert::TryInto, marker::PhantomData}; 18 + use macros::{pin_data, pinned_drop}; 19 + 20 + /// A wrapper for the C `struct blk_mq_tag_set`. 21 + /// 22 + /// `struct blk_mq_tag_set` contains a `struct list_head` and so must be pinned. 23 + /// 24 + /// # Invariants 25 + /// 26 + /// - `inner` is initialized and valid. 27 + #[pin_data(PinnedDrop)] 28 + #[repr(transparent)] 29 + pub struct TagSet<T: Operations> { 30 + #[pin] 31 + inner: Opaque<bindings::blk_mq_tag_set>, 32 + _p: PhantomData<T>, 33 + } 34 + 35 + impl<T: Operations> TagSet<T> { 36 + /// Try to create a new tag set 37 + pub fn new( 38 + nr_hw_queues: u32, 39 + num_tags: u32, 40 + num_maps: u32, 41 + ) -> impl PinInit<Self, error::Error> { 42 + // SAFETY: `blk_mq_tag_set` only contains integers and pointers, which 43 + // all are allowed to be 0. 44 + let tag_set: bindings::blk_mq_tag_set = unsafe { core::mem::zeroed() }; 45 + let tag_set = core::mem::size_of::<RequestDataWrapper>() 46 + .try_into() 47 + .map(|cmd_size| { 48 + bindings::blk_mq_tag_set { 49 + ops: OperationsVTable::<T>::build(), 50 + nr_hw_queues, 51 + timeout: 0, // 0 means default which is 30Hz in C 52 + numa_node: bindings::NUMA_NO_NODE, 53 + queue_depth: num_tags, 54 + cmd_size, 55 + flags: bindings::BLK_MQ_F_SHOULD_MERGE, 56 + driver_data: core::ptr::null_mut::<core::ffi::c_void>(), 57 + nr_maps: num_maps, 58 + ..tag_set 59 + } 60 + }); 61 + 62 + try_pin_init!(TagSet { 63 + inner <- PinInit::<_, error::Error>::pin_chain(Opaque::new(tag_set?), |tag_set| { 64 + // SAFETY: we do not move out of `tag_set`. 65 + let tag_set = unsafe { Pin::get_unchecked_mut(tag_set) }; 66 + // SAFETY: `tag_set` is a reference to an initialized `blk_mq_tag_set`. 67 + error::to_result( unsafe { bindings::blk_mq_alloc_tag_set(tag_set.get())}) 68 + }), 69 + _p: PhantomData, 70 + }) 71 + } 72 + 73 + /// Return the pointer to the wrapped `struct blk_mq_tag_set` 74 + pub(crate) fn raw_tag_set(&self) -> *mut bindings::blk_mq_tag_set { 75 + self.inner.get() 76 + } 77 + } 78 + 79 + #[pinned_drop] 80 + impl<T: Operations> PinnedDrop for TagSet<T> { 81 + fn drop(self: Pin<&mut Self>) { 82 + // SAFETY: By type invariant `inner` is valid and has been properly 83 + // initialized during construction. 84 + unsafe { bindings::blk_mq_free_tag_set(self.inner.get()) }; 85 + } 86 + }

+6

rust/kernel/error.rs

··· 126 126 self.0 127 127 } 128 128 129 + #[cfg(CONFIG_BLOCK)] 130 + pub(crate) fn to_blk_status(self) -> bindings::blk_status_t { 131 + // SAFETY: `self.0` is a valid error due to its invariant. 132 + unsafe { bindings::errno_to_blk_status(self.0) } 133 + } 134 + 129 135 /// Returns the error encoded as a pointer. 130 136 #[allow(dead_code)] 131 137 pub(crate) fn to_ptr<T>(self) -> *mut T {

+2

rust/kernel/lib.rs

··· 27 27 extern crate self as kernel; 28 28 29 29 pub mod alloc; 30 + #[cfg(CONFIG_BLOCK)] 31 + pub mod block; 30 32 mod build_assert; 31 33 pub mod error; 32 34 pub mod init;

Configure Feed

Configure Feed