Merge tag 'for-6.18/block-20250929' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

+4 -10

Documentation/ABI/stable/sysfs-block

··· 603 603 Contact: linux-block@vger.kernel.org 604 604 Description: 605 605 [RW] This controls how many requests may be allocated in the 606 - block layer for read or write requests. Note that the total 607 - allocated number may be twice this amount, since it applies only 608 - to reads or writes (not the accumulated sum). 609 - 610 - To avoid priority inversion through request starvation, a 611 - request queue maintains a separate request pool per each cgroup 612 - when CONFIG_BLK_CGROUP is enabled, and this parameter applies to 613 - each such per-block-cgroup request pool. IOW, if there are N 614 - block cgroups, each request queue may have up to N request 615 - pools, each independently regulated by nr_requests. 606 + block layer. Noted this value only represents the quantity for a 607 + single blk_mq_tags instance. The actual number for the entire 608 + device depends on the hardware queue count, whether elevator is 609 + enabled, and whether tags are shared. 616 610 617 611 618 612 What: /sys/block/<disk>/queue/nr_zones

+61 -25

Documentation/admin-guide/md.rst

··· 347 347 active-idle 348 348 like active, but no writes have been seen for a while (safe_mode_delay). 349 349 350 + consistency_policy 351 + This indicates how the array maintains consistency in case of unexpected 352 + shutdown. It can be: 353 + 354 + none 355 + Array has no redundancy information, e.g. raid0, linear. 356 + 357 + resync 358 + Full resync is performed and all redundancy is regenerated when the 359 + array is started after unclean shutdown. 360 + 361 + bitmap 362 + Resync assisted by a write-intent bitmap. 363 + 364 + journal 365 + For raid4/5/6, journal device is used to log transactions and replay 366 + after unclean shutdown. 367 + 368 + ppl 369 + For raid5 only, Partial Parity Log is used to close the write hole and 370 + eliminate resync. 371 + 372 + The accepted values when writing to this file are ``ppl`` and ``resync``, 373 + used to enable and disable PPL. 374 + 375 + uuid 376 + This indicates the UUID of the array in the following format: 377 + xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx 378 + 379 + bitmap_type 380 + [RW] When read, this file will display the current and available 381 + bitmap for this array. The currently active bitmap will be enclosed 382 + in [] brackets. Writing an bitmap name or ID to this file will switch 383 + control of this array to that new bitmap. Note that writing a new 384 + bitmap for created array is forbidden. 385 + 386 + none 387 + No bitmap 388 + bitmap 389 + The default internal bitmap 390 + llbitmap 391 + The lockless internal bitmap 392 + 393 + If bitmap_type is not none, then additional bitmap attributes bitmap/xxx or 394 + llbitmap/xxx will be created after md device KOBJ_CHANGE event. 395 + 396 + If bitmap_type is bitmap, then the md device will also contain: 397 + 350 398 bitmap/location 351 399 This indicates where the write-intent bitmap for the array is 352 400 stored. ··· 449 401 once the array becomes non-degraded, and this fact has been 450 402 recorded in the metadata. 451 403 452 - consistency_policy 453 - This indicates how the array maintains consistency in case of unexpected 454 - shutdown. It can be: 404 + If bitmap_type is llbitmap, then the md device will also contain: 455 405 456 - none 457 - Array has no redundancy information, e.g. raid0, linear. 406 + llbitmap/bits 407 + This is read-only, show status of bitmap bits, the number of each 408 + value. 458 409 459 - resync 460 - Full resync is performed and all redundancy is regenerated when the 461 - array is started after unclean shutdown. 410 + llbitmap/metadata 411 + This is read-only, show bitmap metadata, include chunksize, chunkshift, 412 + chunks, offset and daemon_sleep. 462 413 463 - bitmap 464 - Resync assisted by a write-intent bitmap. 414 + llbitmap/daemon_sleep 415 + This is read-write, time in seconds that daemon function will be 416 + triggered to clear dirty bits. 465 417 466 - journal 467 - For raid4/5/6, journal device is used to log transactions and replay 468 - after unclean shutdown. 469 - 470 - ppl 471 - For raid5 only, Partial Parity Log is used to close the write hole and 472 - eliminate resync. 473 - 474 - The accepted values when writing to this file are ``ppl`` and ``resync``, 475 - used to enable and disable PPL. 476 - 477 - uuid 478 - This indicates the UUID of the array in the following format: 479 - xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx 480 - 418 + llbitmap/barrier_idle 419 + This is read-write, time in seconds that page barrier will be idled, 420 + means dirty bits in the page will be cleared. 481 421 482 422 As component devices are added to an md array, they appear in the ``md`` 483 423 directory as new directories named::

+1 -1

Documentation/filesystems/locking.rst

··· 443 443 int (*direct_access) (struct block_device *, sector_t, void **, 444 444 unsigned long *); 445 445 void (*unlock_native_capacity) (struct gendisk *); 446 - int (*getgeo)(struct block_device *, struct hd_geometry *); 446 + int (*getgeo)(struct gendisk *, struct hd_geometry *); 447 447 void (*swap_slot_free_notify) (struct block_device *, unsigned long); 448 448 449 449 locking rules:

+4 -4

Documentation/scsi/scsi_mid_low_api.rst

··· 380 380 381 381 /** 382 382 * scsi_bios_ptable - return copy of block device's partition table 383 - * @dev: pointer to block device 383 + * @dev: pointer to gendisk 384 384 * 385 385 * Returns pointer to partition table, or NULL for failure 386 386 * ··· 390 390 * 391 391 * Defined in: drivers/scsi/scsicam.c 392 392 **/ 393 - unsigned char *scsi_bios_ptable(struct block_device *dev) 393 + unsigned char *scsi_bios_ptable(struct gendisk *dev) 394 394 395 395 396 396 /** ··· 623 623 * bios_param - fetch head, sector, cylinder info for a disk 624 624 * @sdev: pointer to scsi device context (defined in 625 625 * include/scsi/scsi_device.h) 626 - * @bdev: pointer to block device context (defined in fs.h) 626 + * @disk: pointer to gendisk (defined in blkdev.h) 627 627 * @capacity: device size (in 512 byte sectors) 628 628 * @params: three element array to place output: 629 629 * params[0] number of heads (max 255) ··· 643 643 * 644 644 * Optionally defined in: LLD 645 645 **/ 646 - int bios_param(struct scsi_device * sdev, struct block_device *bdev, 646 + int bios_param(struct scsi_device * sdev, struct gendisk *disk, 647 647 sector_t capacity, int params[3]) 648 648 649 649

+1 -1

MAINTAINERS

··· 4382 4382 B: https://github.com/Rust-for-Linux/linux/issues 4383 4383 C: https://rust-for-linux.zulipchat.com/#narrow/stream/Block 4384 4384 T: git https://github.com/Rust-for-Linux/linux.git rust-block-next 4385 - F: drivers/block/rnull.rs 4385 + F: drivers/block/rnull/ 4386 4386 F: rust/kernel/block.rs 4387 4387 F: rust/kernel/block/ 4388 4388

-19

arch/alpha/include/asm/floppy.h

··· 90 90 #define N_FDC 2 91 91 #define N_DRIVE 8 92 92 93 - /* 94 - * Most Alphas have no problems with floppy DMA crossing 64k borders, 95 - * except for certain ones, like XL and RUFFIAN. 96 - * 97 - * However, the test is simple and fast, and this *is* floppy, after all, 98 - * so we do it for all platforms, just to make sure. 99 - * 100 - * This is advantageous in other circumstances as well, as in moving 101 - * about the PCI DMA windows and forcing the floppy to start doing 102 - * scatter-gather when it never had before, and there *is* a problem 103 - * on that platform... ;-} 104 - */ 105 - 106 - static inline unsigned long CROSS_64KB(void *a, unsigned long s) 107 - { 108 - unsigned long p = (unsigned long)a; 109 - return ((p + s - 1) ^ p) & ~0xffffUL; 110 - } 111 - 112 93 #define EXTRA_FLOPPY_PARAMS 113 94 114 95 #endif /* __ASM_ALPHA_FLOPPY_H */

-2

arch/arm/include/asm/floppy.h

··· 65 65 #define N_FDC 1 66 66 #define N_DRIVE 4 67 67 68 - #define CROSS_64KB(a,s) (0) 69 - 70 68 /* 71 69 * This allows people to reverse the order of 72 70 * fd0 and fd1, in case their hardware is

+2 -2

arch/m68k/emu/nfblock.c

··· 77 77 bio_endio(bio); 78 78 } 79 79 80 - static int nfhd_getgeo(struct block_device *bdev, struct hd_geometry *geo) 80 + static int nfhd_getgeo(struct gendisk *disk, struct hd_geometry *geo) 81 81 { 82 - struct nfhd_device *dev = bdev->bd_disk->private_data; 82 + struct nfhd_device *dev = disk->private_data; 83 83 84 84 geo->cylinders = dev->blocks >> (6 - dev->bshift); 85 85 geo->heads = 4;

-4

arch/m68k/include/asm/floppy.h

··· 107 107 108 108 #define fd_free_dma() /* nothing */ 109 109 110 - /* No 64k boundary crossing problems on Q40 - no DMA at all */ 111 - #define CROSS_64KB(a,s) (0) 112 - 113 110 #define DMA_MODE_READ 0x44 /* i386 look-alike */ 114 111 #define DMA_MODE_WRITE 0x48 115 - 116 112 117 113 static int m68k_floppy_init(void) 118 114 {

-15

arch/mips/include/asm/floppy.h

··· 34 34 #define N_FDC 1 /* do you *really* want a second controller? */ 35 35 #define N_DRIVE 8 36 36 37 - /* 38 - * The DMA channel used by the floppy controller cannot access data at 39 - * addresses >= 16MB 40 - * 41 - * Went back to the 1MB limit, as some people had problems with the floppy 42 - * driver otherwise. It doesn't matter much for performance anyway, as most 43 - * floppy accesses go through the track buffer. 44 - * 45 - * On MIPSes using vdma, this actually means that *all* transfers go thru 46 - * the * track buffer since 0x1000000 is always smaller than KSEG0/1. 47 - * Actually this needs to be a bit more complicated since the so much different 48 - * hardware available with MIPS CPUs ... 49 - */ 50 - #define CROSS_64KB(a, s) ((unsigned long)(a)/K_64 != ((unsigned long)(a) + (s) - 1) / K_64) 51 - 52 37 #define EXTRA_FLOPPY_PARAMS 53 38 54 39 #include <floppy.h>

+4 -7

arch/parisc/include/asm/floppy.h

··· 8 8 #ifndef __ASM_PARISC_FLOPPY_H 9 9 #define __ASM_PARISC_FLOPPY_H 10 10 11 + #include <linux/sizes.h> 11 12 #include <linux/vmalloc.h> 12 - 13 13 14 14 /* 15 15 * The DMA channel used by the floppy controller cannot access data at ··· 20 20 * floppy accesses go through the track buffer. 21 21 */ 22 22 #define _CROSS_64KB(a,s,vdma) \ 23 - (!(vdma) && ((unsigned long)(a)/K_64 != ((unsigned long)(a) + (s) - 1) / K_64)) 24 - 25 - #define CROSS_64KB(a,s) _CROSS_64KB(a,s,use_virtual_dma & 1) 26 - 23 + (!(vdma) && \ 24 + ((unsigned long)(a) / SZ_64K != ((unsigned long)(a) + (s) - 1) / SZ_64K)) 27 25 28 26 #define SW fd_routine[use_virtual_dma&1] 29 27 #define CSW fd_routine[can_use_virtual_dma & 1] 30 - 31 28 32 29 #define fd_inb(base, reg) readb((base) + (reg)) 33 30 #define fd_outb(value, base, reg) writeb(value, (base) + (reg)) ··· 203 206 static int hard_dma_setup(char *addr, unsigned long size, int mode, int io) 204 207 { 205 208 #ifdef FLOPPY_SANITY_CHECK 206 - if (CROSS_64KB(addr, size)) { 209 + if (_CROSS_64KB(addr, size, use_virtual_dma & 1)) { 207 210 printk("DMA crossing 64-K boundary %p-%p\n", addr, addr+size); 208 211 return -1; 209 212 }

-5

arch/powerpc/include/asm/floppy.h

··· 206 206 #define N_FDC 2 /* Don't change this! */ 207 207 #define N_DRIVE 8 208 208 209 - /* 210 - * The PowerPC has no problems with floppy DMA crossing 64k borders. 211 - */ 212 - #define CROSS_64KB(a,s) (0) 213 - 214 209 #define EXTRA_FLOPPY_PARAMS 215 210 216 211 #endif /* __KERNEL__ */

-3

arch/sparc/include/asm/floppy_32.h

··· 96 96 #define N_FDC 1 97 97 #define N_DRIVE 8 98 98 99 - /* No 64k boundary crossing problems on the Sparc. */ 100 - #define CROSS_64KB(a,s) (0) 101 - 102 99 /* Routines unique to each controller type on a Sun. */ 103 100 static void sun_set_dor(unsigned char value, int fdc_82077) 104 101 {

-3

arch/sparc/include/asm/floppy_64.h

··· 95 95 #define N_FDC 1 96 96 #define N_DRIVE 8 97 97 98 - /* No 64k boundary crossing problems on the Sparc. */ 99 - #define CROSS_64KB(a,s) (0) 100 - 101 98 static unsigned char sun_82077_fd_inb(unsigned long base, unsigned int reg) 102 99 { 103 100 udelay(5);

+3 -3

arch/um/drivers/ubd_kern.c

··· 108 108 109 109 static int ubd_ioctl(struct block_device *bdev, blk_mode_t mode, 110 110 unsigned int cmd, unsigned long arg); 111 - static int ubd_getgeo(struct block_device *bdev, struct hd_geometry *geo); 111 + static int ubd_getgeo(struct gendisk *disk, struct hd_geometry *geo); 112 112 113 113 #define MAX_DEV (16) 114 114 ··· 1324 1324 return res; 1325 1325 } 1326 1326 1327 - static int ubd_getgeo(struct block_device *bdev, struct hd_geometry *geo) 1327 + static int ubd_getgeo(struct gendisk *disk, struct hd_geometry *geo) 1328 1328 { 1329 - struct ubd *ubd_dev = bdev->bd_disk->private_data; 1329 + struct ubd *ubd_dev = disk->private_data; 1330 1330 1331 1331 geo->heads = 128; 1332 1332 geo->sectors = 32;

+3 -5

arch/x86/include/asm/floppy.h

··· 10 10 #ifndef _ASM_X86_FLOPPY_H 11 11 #define _ASM_X86_FLOPPY_H 12 12 13 + #include <linux/sizes.h> 13 14 #include <linux/vmalloc.h> 14 15 15 16 /* ··· 23 22 */ 24 23 #define _CROSS_64KB(a, s, vdma) \ 25 24 (!(vdma) && \ 26 - ((unsigned long)(a)/K_64 != ((unsigned long)(a) + (s) - 1) / K_64)) 27 - 28 - #define CROSS_64KB(a, s) _CROSS_64KB(a, s, use_virtual_dma & 1) 29 - 25 + ((unsigned long)(a) / SZ_64K != ((unsigned long)(a) + (s) - 1) / SZ_64K)) 30 26 31 27 #define SW fd_routine[use_virtual_dma & 1] 32 28 #define CSW fd_routine[can_use_virtual_dma & 1] ··· 204 206 static int hard_dma_setup(char *addr, unsigned long size, int mode, int io) 205 207 { 206 208 #ifdef FLOPPY_SANITY_CHECK 207 - if (CROSS_64KB(addr, size)) { 209 + if (_CROSS_64KB(addr, size, use_virtual_dma & 1)) { 208 210 printk("DMA crossing 64-K boundary %p-%p\n", addr, addr+size); 209 211 return -1; 210 212 }

+5 -17

block/bfq-iosched.c

··· 7109 7109 * See the comments on bfq_limit_depth for the purpose of 7110 7110 * the depths set in the function. Return minimum shallow depth we'll use. 7111 7111 */ 7112 - static void bfq_update_depths(struct bfq_data *bfqd, struct sbitmap_queue *bt) 7112 + static void bfq_depth_updated(struct request_queue *q) 7113 7113 { 7114 - unsigned int nr_requests = bfqd->queue->nr_requests; 7114 + struct bfq_data *bfqd = q->elevator->elevator_data; 7115 + unsigned int nr_requests = q->nr_requests; 7115 7116 7116 7117 /* 7117 7118 * In-word depths if no bfq_queue is being weight-raised: ··· 7144 7143 bfqd->async_depths[1][0] = max((nr_requests * 3) >> 4, 1U); 7145 7144 /* no more than ~37% of tags for sync writes (~20% extra tags) */ 7146 7145 bfqd->async_depths[1][1] = max((nr_requests * 6) >> 4, 1U); 7147 - } 7148 7146 7149 - static void bfq_depth_updated(struct blk_mq_hw_ctx *hctx) 7150 - { 7151 - struct bfq_data *bfqd = hctx->queue->elevator->elevator_data; 7152 - struct blk_mq_tags *tags = hctx->sched_tags; 7153 - 7154 - bfq_update_depths(bfqd, &tags->bitmap_tags); 7155 - sbitmap_queue_min_shallow_depth(&tags->bitmap_tags, 1); 7156 - } 7157 - 7158 - static int bfq_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int index) 7159 - { 7160 - bfq_depth_updated(hctx); 7161 - return 0; 7147 + blk_mq_set_min_shallow_depth(q, 1); 7162 7148 } 7163 7149 7164 7150 static void bfq_exit_queue(struct elevator_queue *e) ··· 7357 7369 goto out_free; 7358 7370 bfq_init_root_group(bfqd->root_group, bfqd); 7359 7371 bfq_init_entity(&bfqd->oom_bfqq.entity, bfqd->root_group); 7372 + bfq_depth_updated(q); 7360 7373 7361 7374 /* We dispatch from request queue wide instead of hw queue */ 7362 7375 blk_queue_flag_set(QUEUE_FLAG_SQ_SCHED, q); ··· 7617 7628 .request_merged = bfq_request_merged, 7618 7629 .has_work = bfq_has_work, 7619 7630 .depth_updated = bfq_depth_updated, 7620 - .init_hctx = bfq_init_hctx, 7621 7631 .init_sched = bfq_init_queue, 7622 7632 .exit_sched = bfq_exit_queue, 7623 7633 },

+19 -6

block/bio-integrity.c

··· 230 230 } 231 231 232 232 static unsigned int bvec_from_pages(struct bio_vec *bvec, struct page **pages, 233 - int nr_vecs, ssize_t bytes, ssize_t offset) 233 + int nr_vecs, ssize_t bytes, ssize_t offset, 234 + bool *is_p2p) 234 235 { 235 236 unsigned int nr_bvecs = 0; 236 237 int i, j; ··· 252 251 bytes -= next; 253 252 } 254 253 254 + if (is_pci_p2pdma_page(pages[i])) 255 + *is_p2p = true; 256 + 255 257 bvec_set_page(&bvec[nr_bvecs], pages[i], size, offset); 256 258 offset = 0; 257 259 nr_bvecs++; ··· 266 262 int bio_integrity_map_user(struct bio *bio, struct iov_iter *iter) 267 263 { 268 264 struct request_queue *q = bdev_get_queue(bio->bi_bdev); 269 - unsigned int align = blk_lim_dma_alignment_and_pad(&q->limits); 270 265 struct page *stack_pages[UIO_FASTIOV], **pages = stack_pages; 271 266 struct bio_vec stack_vec[UIO_FASTIOV], *bvec = stack_vec; 267 + iov_iter_extraction_t extraction_flags = 0; 272 268 size_t offset, bytes = iter->count; 269 + bool copy, is_p2p = false; 273 270 unsigned int nr_bvecs; 274 271 int ret, nr_vecs; 275 - bool copy; 276 272 277 273 if (bio_integrity(bio)) 278 274 return -EINVAL; ··· 289 285 pages = NULL; 290 286 } 291 287 292 - copy = !iov_iter_is_aligned(iter, align, align); 293 - ret = iov_iter_extract_pages(iter, &pages, bytes, nr_vecs, 0, &offset); 288 + copy = iov_iter_alignment(iter) & 289 + blk_lim_dma_alignment_and_pad(&q->limits); 290 + 291 + if (blk_queue_pci_p2pdma(q)) 292 + extraction_flags |= ITER_ALLOW_P2PDMA; 293 + 294 + ret = iov_iter_extract_pages(iter, &pages, bytes, nr_vecs, 295 + extraction_flags, &offset); 294 296 if (unlikely(ret < 0)) 295 297 goto free_bvec; 296 298 297 - nr_bvecs = bvec_from_pages(bvec, pages, nr_vecs, bytes, offset); 299 + nr_bvecs = bvec_from_pages(bvec, pages, nr_vecs, bytes, offset, 300 + &is_p2p); 298 301 if (pages != stack_pages) 299 302 kvfree(pages); 300 303 if (nr_bvecs > queue_max_integrity_segments(q)) 301 304 copy = true; 305 + if (is_p2p) 306 + bio->bi_opf |= REQ_NOMERGE; 302 307 303 308 if (copy) 304 309 ret = bio_integrity_copy_user(bio, bvec, nr_bvecs, bytes);

+50 -28

block/bio.c

··· 261 261 bio->bi_private = NULL; 262 262 #ifdef CONFIG_BLK_CGROUP 263 263 bio->bi_blkg = NULL; 264 - bio->bi_issue.value = 0; 264 + bio->issue_time_ns = 0; 265 265 if (bdev) 266 266 bio_associate_blkg(bio); 267 267 #ifdef CONFIG_BLK_CGROUP_IOCOST ··· 462 462 cache->nr--; 463 463 put_cpu(); 464 464 465 - bio_init(bio, bdev, nr_vecs ? bio->bi_inline_vecs : NULL, nr_vecs, opf); 465 + if (nr_vecs) 466 + bio_init_inline(bio, bdev, nr_vecs, opf); 467 + else 468 + bio_init(bio, bdev, NULL, nr_vecs, opf); 466 469 bio->bi_pool = bs; 467 470 return bio; 468 471 } ··· 581 578 582 579 bio_init(bio, bdev, bvl, nr_vecs, opf); 583 580 } else if (nr_vecs) { 584 - bio_init(bio, bdev, bio->bi_inline_vecs, BIO_INLINE_VECS, opf); 581 + bio_init_inline(bio, bdev, BIO_INLINE_VECS, opf); 585 582 } else { 586 583 bio_init(bio, bdev, NULL, 0, opf); 587 584 } ··· 617 614 618 615 if (nr_vecs > BIO_MAX_INLINE_VECS) 619 616 return NULL; 620 - return kmalloc(struct_size(bio, bi_inline_vecs, nr_vecs), gfp_mask); 617 + return kmalloc(sizeof(*bio) + nr_vecs * sizeof(struct bio_vec), 618 + gfp_mask); 621 619 } 622 620 EXPORT_SYMBOL(bio_kmalloc); 623 621 ··· 985 981 WARN_ON_ONCE(bio_full(bio, len)); 986 982 987 983 if (is_pci_p2pdma_page(page)) 988 - bio->bi_opf |= REQ_P2PDMA | REQ_NOMERGE; 984 + bio->bi_opf |= REQ_NOMERGE; 989 985 990 986 bvec_set_page(&bio->bi_io_vec[bio->bi_vcnt], page, len, off); 991 987 bio->bi_iter.bi_size += len; ··· 1231 1227 if (bio->bi_bdev && blk_queue_pci_p2pdma(bio->bi_bdev->bd_disk->queue)) 1232 1228 extraction_flags |= ITER_ALLOW_P2PDMA; 1233 1229 1234 - /* 1235 - * Each segment in the iov is required to be a block size multiple. 1236 - * However, we may not be able to get the entire segment if it spans 1237 - * more pages than bi_max_vecs allows, so we have to ALIGN_DOWN the 1238 - * result to ensure the bio's total size is correct. The remainder of 1239 - * the iov data will be picked up in the next bio iteration. 1240 - */ 1241 1230 size = iov_iter_extract_pages(iter, &pages, 1242 1231 UINT_MAX - bio->bi_iter.bi_size, 1243 1232 nr_pages, extraction_flags, &offset); ··· 1238 1241 return size ? size : -EFAULT; 1239 1242 1240 1243 nr_pages = DIV_ROUND_UP(offset + size, PAGE_SIZE); 1241 - 1242 - if (bio->bi_bdev) { 1243 - size_t trim = size & (bdev_logical_block_size(bio->bi_bdev) - 1); 1244 - iov_iter_revert(iter, trim); 1245 - size -= trim; 1246 - } 1247 - 1248 - if (unlikely(!size)) { 1249 - ret = -EFAULT; 1250 - goto out; 1251 - } 1252 - 1253 1244 for (left = size, i = 0; left > 0; left -= len, i += num_pages) { 1254 1245 struct page *page = pages[i]; 1255 1246 struct folio *folio = page_folio(page); ··· 1282 1297 return ret; 1283 1298 } 1284 1299 1300 + /* 1301 + * Aligns the bio size to the len_align_mask, releasing excessive bio vecs that 1302 + * __bio_iov_iter_get_pages may have inserted, and reverts the trimmed length 1303 + * for the next iteration. 1304 + */ 1305 + static int bio_iov_iter_align_down(struct bio *bio, struct iov_iter *iter, 1306 + unsigned len_align_mask) 1307 + { 1308 + size_t nbytes = bio->bi_iter.bi_size & len_align_mask; 1309 + 1310 + if (!nbytes) 1311 + return 0; 1312 + 1313 + iov_iter_revert(iter, nbytes); 1314 + bio->bi_iter.bi_size -= nbytes; 1315 + do { 1316 + struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1]; 1317 + 1318 + if (nbytes < bv->bv_len) { 1319 + bv->bv_len -= nbytes; 1320 + break; 1321 + } 1322 + 1323 + bio_release_page(bio, bv->bv_page); 1324 + bio->bi_vcnt--; 1325 + nbytes -= bv->bv_len; 1326 + } while (nbytes); 1327 + 1328 + if (!bio->bi_vcnt) 1329 + return -EFAULT; 1330 + return 0; 1331 + } 1332 + 1285 1333 /** 1286 - * bio_iov_iter_get_pages - add user or kernel pages to a bio 1334 + * bio_iov_iter_get_pages_aligned - add user or kernel pages to a bio 1287 1335 * @bio: bio to add pages to 1288 1336 * @iter: iov iterator describing the region to be added 1337 + * @len_align_mask: the mask to align the total size to, 0 for any length 1289 1338 * 1290 1339 * This takes either an iterator pointing to user memory, or one pointing to 1291 1340 * kernel pages (BVEC iterator). If we're adding user pages, we pin them and ··· 1336 1317 * MM encounters an error pinning the requested pages, it stops. Error 1337 1318 * is returned only if 0 pages could be pinned. 1338 1319 */ 1339 - int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) 1320 + int bio_iov_iter_get_pages_aligned(struct bio *bio, struct iov_iter *iter, 1321 + unsigned len_align_mask) 1340 1322 { 1341 1323 int ret = 0; 1342 1324 ··· 1356 1336 ret = __bio_iov_iter_get_pages(bio, iter); 1357 1337 } while (!ret && iov_iter_count(iter) && !bio_full(bio, 0)); 1358 1338 1359 - return bio->bi_vcnt ? 0 : ret; 1339 + if (bio->bi_vcnt) 1340 + return bio_iov_iter_align_down(bio, iter, len_align_mask); 1341 + return ret; 1360 1342 } 1361 - EXPORT_SYMBOL_GPL(bio_iov_iter_get_pages); 1343 + EXPORT_SYMBOL_GPL(bio_iov_iter_get_pages_aligned); 1362 1344 1363 1345 static void submit_bio_wait_endio(struct bio *bio) 1364 1346 {

+8 -21

block/blk-cgroup.c

··· 110 110 return task_css(current, io_cgrp_id); 111 111 } 112 112 113 - static bool blkcg_policy_enabled(struct request_queue *q, 114 - const struct blkcg_policy *pol) 115 - { 116 - return pol && test_bit(pol->plid, q->blkcg_pols); 117 - } 118 - 119 113 static void blkg_free_workfn(struct work_struct *work) 120 114 { 121 115 struct blkcg_gq *blkg = container_of(work, struct blkcg_gq, ··· 877 883 disk = ctx->bdev->bd_disk; 878 884 q = disk->queue; 879 885 880 - /* 881 - * blkcg_deactivate_policy() requires queue to be frozen, we can grab 882 - * q_usage_counter to prevent concurrent with blkcg_deactivate_policy(). 883 - */ 884 - ret = blk_queue_enter(q, 0); 885 - if (ret) 886 - goto fail; 887 - 886 + /* Prevent concurrent with blkcg_deactivate_policy() */ 887 + mutex_lock(&q->blkcg_mutex); 888 888 spin_lock_irq(&q->queue_lock); 889 889 890 890 if (!blkcg_policy_enabled(q, pol)) { ··· 908 920 /* Drop locks to do new blkg allocation with GFP_KERNEL. */ 909 921 spin_unlock_irq(&q->queue_lock); 910 922 911 - new_blkg = blkg_alloc(pos, disk, GFP_KERNEL); 923 + new_blkg = blkg_alloc(pos, disk, GFP_NOIO); 912 924 if (unlikely(!new_blkg)) { 913 925 ret = -ENOMEM; 914 - goto fail_exit_queue; 926 + goto fail_exit; 915 927 } 916 928 917 929 if (radix_tree_preload(GFP_KERNEL)) { 918 930 blkg_free(new_blkg); 919 931 ret = -ENOMEM; 920 - goto fail_exit_queue; 932 + goto fail_exit; 921 933 } 922 934 923 935 spin_lock_irq(&q->queue_lock); ··· 945 957 goto success; 946 958 } 947 959 success: 948 - blk_queue_exit(q); 960 + mutex_unlock(&q->blkcg_mutex); 949 961 ctx->blkg = blkg; 950 962 return 0; 951 963 ··· 953 965 radix_tree_preload_end(); 954 966 fail_unlock: 955 967 spin_unlock_irq(&q->queue_lock); 956 - fail_exit_queue: 957 - blk_queue_exit(q); 958 - fail: 968 + fail_exit: 969 + mutex_unlock(&q->blkcg_mutex); 959 970 /* 960 971 * If queue was bypassing, we should retry. Do so after a 961 972 * short msleep(). It isn't strictly necessary but queue

+6 -6

block/blk-cgroup.h

··· 370 370 if (((d_blkg) = blkg_lookup(css_to_blkcg(pos_css), \ 371 371 (p_blkg)->q))) 372 372 373 - static inline void blkcg_bio_issue_init(struct bio *bio) 374 - { 375 - bio_issue_init(&bio->bi_issue, bio_sectors(bio)); 376 - } 377 - 378 373 static inline void blkcg_use_delay(struct blkcg_gq *blkg) 379 374 { 380 375 if (WARN_ON_ONCE(atomic_read(&blkg->use_delay) < 0)) ··· 454 459 bio_issue_as_root_blkg(rq->bio) == bio_issue_as_root_blkg(bio); 455 460 } 456 461 462 + static inline bool blkcg_policy_enabled(struct request_queue *q, 463 + const struct blkcg_policy *pol) 464 + { 465 + return pol && test_bit(pol->plid, q->blkcg_pols); 466 + } 467 + 457 468 void blk_cgroup_bio_start(struct bio *bio); 458 469 void blkcg_add_delay(struct blkcg_gq *blkg, u64 now, u64 delta); 459 470 #else /* CONFIG_BLK_CGROUP */ ··· 492 491 static inline struct blkcg_gq *pd_to_blkg(struct blkg_policy_data *pd) { return NULL; } 493 492 static inline void blkg_get(struct blkcg_gq *blkg) { } 494 493 static inline void blkg_put(struct blkcg_gq *blkg) { } 495 - static inline void blkcg_bio_issue_init(struct bio *bio) { } 496 494 static inline void blk_cgroup_bio_start(struct bio *bio) { } 497 495 static inline bool blk_cgroup_mergeable(struct request *rq, struct bio *bio) { return true; } 498 496

+11 -8

block/blk-core.c

··· 539 539 } 540 540 } 541 541 542 - static noinline int should_fail_bio(struct bio *bio) 542 + int should_fail_bio(struct bio *bio) 543 543 { 544 544 if (should_fail_request(bdev_whole(bio->bi_bdev), bio->bi_iter.bi_size)) 545 545 return -EIO; ··· 727 727 current->bio_list = NULL; 728 728 } 729 729 730 - void submit_bio_noacct_nocheck(struct bio *bio) 730 + void submit_bio_noacct_nocheck(struct bio *bio, bool split) 731 731 { 732 732 blk_cgroup_bio_start(bio); 733 - blkcg_bio_issue_init(bio); 734 733 735 734 if (!bio_flagged(bio, BIO_TRACE_COMPLETION)) { 736 735 trace_block_bio_queue(bio); ··· 746 747 * to collect a list of requests submited by a ->submit_bio method while 747 748 * it is active, and then process them after it returned. 748 749 */ 749 - if (current->bio_list) 750 - bio_list_add(&current->bio_list[0], bio); 751 - else if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO)) 750 + if (current->bio_list) { 751 + if (split) 752 + bio_list_add_head(&current->bio_list[0], bio); 753 + else 754 + bio_list_add(&current->bio_list[0], bio); 755 + } else if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO)) { 752 756 __submit_bio_noacct_mq(bio); 753 - else 757 + } else { 754 758 __submit_bio_noacct(bio); 759 + } 755 760 } 756 761 757 762 static blk_status_t blk_validate_atomic_write_op_size(struct request_queue *q, ··· 876 873 877 874 if (blk_throtl_bio(bio)) 878 875 return; 879 - submit_bio_noacct_nocheck(bio); 876 + submit_bio_noacct_nocheck(bio, false); 880 877 return; 881 878 882 879 not_supported:

+7 -12

block/blk-crypto-fallback.c

··· 167 167 bio = bio_kmalloc(nr_segs, GFP_NOIO); 168 168 if (!bio) 169 169 return NULL; 170 - bio_init(bio, bio_src->bi_bdev, bio->bi_inline_vecs, nr_segs, 171 - bio_src->bi_opf); 170 + bio_init_inline(bio, bio_src->bi_bdev, nr_segs, bio_src->bi_opf); 172 171 if (bio_flagged(bio_src, BIO_REMAPPED)) 173 172 bio_set_flag(bio, BIO_REMAPPED); 174 173 bio->bi_ioprio = bio_src->bi_ioprio; ··· 221 222 if (++i == BIO_MAX_VECS) 222 223 break; 223 224 } 224 - if (num_sectors < bio_sectors(bio)) { 225 - struct bio *split_bio; 226 225 227 - split_bio = bio_split(bio, num_sectors, GFP_NOIO, 228 - &crypto_bio_split); 229 - if (IS_ERR(split_bio)) { 230 - bio->bi_status = BLK_STS_RESOURCE; 226 + if (num_sectors < bio_sectors(bio)) { 227 + bio = bio_submit_split_bioset(bio, num_sectors, 228 + &crypto_bio_split); 229 + if (!bio) 231 230 return false; 232 - } 233 - bio_chain(split_bio, bio); 234 - submit_bio_noacct(bio); 235 - *bio_ptr = split_bio; 231 + 232 + *bio_ptr = bio; 236 233 } 237 234 238 235 return true;

-58

block/blk-integrity.c

··· 120 120 NULL); 121 121 } 122 122 123 - /** 124 - * blk_rq_map_integrity_sg - Map integrity metadata into a scatterlist 125 - * @rq: request to map 126 - * @sglist: target scatterlist 127 - * 128 - * Description: Map the integrity vectors in request into a 129 - * scatterlist. The scatterlist must be big enough to hold all 130 - * elements. I.e. sized using blk_rq_count_integrity_sg() or 131 - * rq->nr_integrity_segments. 132 - */ 133 - int blk_rq_map_integrity_sg(struct request *rq, struct scatterlist *sglist) 134 - { 135 - struct bio_vec iv, ivprv = { NULL }; 136 - struct request_queue *q = rq->q; 137 - struct scatterlist *sg = NULL; 138 - struct bio *bio = rq->bio; 139 - unsigned int segments = 0; 140 - struct bvec_iter iter; 141 - int prev = 0; 142 - 143 - bio_for_each_integrity_vec(iv, bio, iter) { 144 - if (prev) { 145 - if (!biovec_phys_mergeable(q, &ivprv, &iv)) 146 - goto new_segment; 147 - if (sg->length + iv.bv_len > queue_max_segment_size(q)) 148 - goto new_segment; 149 - 150 - sg->length += iv.bv_len; 151 - } else { 152 - new_segment: 153 - if (!sg) 154 - sg = sglist; 155 - else { 156 - sg_unmark_end(sg); 157 - sg = sg_next(sg); 158 - } 159 - 160 - sg_set_page(sg, iv.bv_page, iv.bv_len, iv.bv_offset); 161 - segments++; 162 - } 163 - 164 - prev = 1; 165 - ivprv = iv; 166 - } 167 - 168 - if (sg) 169 - sg_mark_end(sg); 170 - 171 - /* 172 - * Something must have been wrong if the figured number of segment 173 - * is bigger than number of req's physical integrity segments 174 - */ 175 - BUG_ON(segments > rq->nr_integrity_segments); 176 - BUG_ON(segments > queue_max_integrity_segments(q)); 177 - return segments; 178 - } 179 - EXPORT_SYMBOL(blk_rq_map_integrity_sg); 180 - 181 123 int blk_rq_integrity_map_user(struct request *rq, void __user *ubuf, 182 124 ssize_t bytes) 183 125 {

+8 -11

block/blk-iolatency.c

··· 485 485 mod_timer(&blkiolat->timer, jiffies + HZ); 486 486 } 487 487 488 - static void iolatency_record_time(struct iolatency_grp *iolat, 489 - struct bio_issue *issue, u64 now, 490 - bool issue_as_root) 488 + static void iolatency_record_time(struct iolatency_grp *iolat, u64 start, 489 + u64 now, bool issue_as_root) 491 490 { 492 - u64 start = bio_issue_time(issue); 493 491 u64 req_time; 494 - 495 - /* 496 - * Have to do this so we are truncated to the correct time that our 497 - * issue is truncated to. 498 - */ 499 - now = __bio_issue_time(now); 500 492 501 493 if (now <= start) 502 494 return; ··· 617 625 * submitted, so do not account for it. 618 626 */ 619 627 if (iolat->min_lat_nsec && bio->bi_status != BLK_STS_AGAIN) { 620 - iolatency_record_time(iolat, &bio->bi_issue, now, 628 + iolatency_record_time(iolat, bio->issue_time_ns, now, 621 629 issue_as_root); 622 630 window_start = atomic64_read(&iolat->window_start); 623 631 if (now > window_start && ··· 742 750 */ 743 751 enabled = atomic_read(&blkiolat->enable_cnt); 744 752 if (enabled != blkiolat->enabled) { 753 + struct request_queue *q = blkiolat->rqos.disk->queue; 745 754 unsigned int memflags; 746 755 747 756 memflags = blk_mq_freeze_queue(blkiolat->rqos.disk->queue); 748 757 blkiolat->enabled = enabled; 758 + if (enabled) 759 + blk_queue_flag_set(QUEUE_FLAG_BIO_ISSUE_TIME, q); 760 + else 761 + blk_queue_flag_clear(QUEUE_FLAG_BIO_ISSUE_TIME, q); 749 762 blk_mq_unfreeze_queue(blkiolat->rqos.disk->queue, memflags); 750 763 } 751 764 }

+7 -6

block/blk-map.c

··· 157 157 bio = bio_kmalloc(nr_pages, gfp_mask); 158 158 if (!bio) 159 159 goto out_bmd; 160 - bio_init(bio, NULL, bio->bi_inline_vecs, nr_pages, req_op(rq)); 160 + bio_init_inline(bio, NULL, nr_pages, req_op(rq)); 161 161 162 162 if (map_data) { 163 163 nr_pages = 1U << map_data->page_order; ··· 253 253 static struct bio *blk_rq_map_bio_alloc(struct request *rq, 254 254 unsigned int nr_vecs, gfp_t gfp_mask) 255 255 { 256 + struct block_device *bdev = rq->q->disk ? rq->q->disk->part0 : NULL; 256 257 struct bio *bio; 257 258 258 259 if (rq->cmd_flags & REQ_ALLOC_CACHE && (nr_vecs <= BIO_INLINE_VECS)) { 259 - bio = bio_alloc_bioset(NULL, nr_vecs, rq->cmd_flags, gfp_mask, 260 + bio = bio_alloc_bioset(bdev, nr_vecs, rq->cmd_flags, gfp_mask, 260 261 &fs_bio_set); 261 262 if (!bio) 262 263 return NULL; ··· 265 264 bio = bio_kmalloc(nr_vecs, gfp_mask); 266 265 if (!bio) 267 266 return NULL; 268 - bio_init(bio, NULL, bio->bi_inline_vecs, nr_vecs, req_op(rq)); 267 + bio_init_inline(bio, bdev, nr_vecs, req_op(rq)); 269 268 } 270 269 return bio; 271 270 } ··· 327 326 bio = bio_kmalloc(nr_vecs, gfp_mask); 328 327 if (!bio) 329 328 return ERR_PTR(-ENOMEM); 330 - bio_init(bio, NULL, bio->bi_inline_vecs, nr_vecs, op); 329 + bio_init_inline(bio, NULL, nr_vecs, op); 331 330 if (is_vmalloc_addr(data)) { 332 331 bio->bi_private = data; 333 332 if (!bio_add_vmalloc(bio, data, len)) { ··· 393 392 bio = bio_kmalloc(nr_pages, gfp_mask); 394 393 if (!bio) 395 394 return ERR_PTR(-ENOMEM); 396 - bio_init(bio, NULL, bio->bi_inline_vecs, nr_pages, op); 395 + bio_init_inline(bio, NULL, nr_pages, op); 397 396 398 397 while (len) { 399 398 struct page *page; ··· 444 443 int ret; 445 444 446 445 /* check that the data layout matches the hardware restrictions */ 447 - ret = bio_split_rw_at(bio, lim, &nr_segs, max_bytes); 446 + ret = bio_split_io_at(bio, lim, &nr_segs, max_bytes, 0); 448 447 if (ret) { 449 448 /* if we would have to split the bio, copy instead */ 450 449 if (ret > 0)

+61 -24

block/blk-merge.c

··· 104 104 return round_down(UINT_MAX, lim->logical_block_size) >> SECTOR_SHIFT; 105 105 } 106 106 107 + /* 108 + * bio_submit_split_bioset - Submit a bio, splitting it at a designated sector 109 + * @bio: the original bio to be submitted and split 110 + * @split_sectors: the sector count at which to split 111 + * @bs: the bio set used for allocating the new split bio 112 + * 113 + * The original bio is modified to contain the remaining sectors and submitted. 114 + * The caller is responsible for submitting the returned bio. 115 + * 116 + * If succeed, the newly allocated bio representing the initial part will be 117 + * returned, on failure NULL will be returned and original bio will fail. 118 + */ 119 + struct bio *bio_submit_split_bioset(struct bio *bio, unsigned int split_sectors, 120 + struct bio_set *bs) 121 + { 122 + struct bio *split = bio_split(bio, split_sectors, GFP_NOIO, bs); 123 + 124 + if (IS_ERR(split)) { 125 + bio->bi_status = errno_to_blk_status(PTR_ERR(split)); 126 + bio_endio(bio); 127 + return NULL; 128 + } 129 + 130 + bio_chain(split, bio); 131 + trace_block_split(split, bio->bi_iter.bi_sector); 132 + WARN_ON_ONCE(bio_zone_write_plugging(bio)); 133 + 134 + if (should_fail_bio(bio)) 135 + bio_io_error(bio); 136 + else if (!blk_throtl_bio(bio)) 137 + submit_bio_noacct_nocheck(bio, true); 138 + 139 + return split; 140 + } 141 + EXPORT_SYMBOL_GPL(bio_submit_split_bioset); 142 + 107 143 static struct bio *bio_submit_split(struct bio *bio, int split_sectors) 108 144 { 109 - if (unlikely(split_sectors < 0)) 110 - goto error; 145 + if (unlikely(split_sectors < 0)) { 146 + bio->bi_status = errno_to_blk_status(split_sectors); 147 + bio_endio(bio); 148 + return NULL; 149 + } 111 150 112 151 if (split_sectors) { 113 - struct bio *split; 114 - 115 - split = bio_split(bio, split_sectors, GFP_NOIO, 152 + bio = bio_submit_split_bioset(bio, split_sectors, 116 153 &bio->bi_bdev->bd_disk->bio_split); 117 - if (IS_ERR(split)) { 118 - split_sectors = PTR_ERR(split); 119 - goto error; 120 - } 121 - split->bi_opf |= REQ_NOMERGE; 122 - blkcg_bio_issue_init(split); 123 - bio_chain(split, bio); 124 - trace_block_split(split, bio->bi_iter.bi_sector); 125 - WARN_ON_ONCE(bio_zone_write_plugging(bio)); 126 - submit_bio_noacct(bio); 127 - return split; 154 + if (bio) 155 + bio->bi_opf |= REQ_NOMERGE; 128 156 } 129 157 130 158 return bio; 131 - error: 132 - bio->bi_status = errno_to_blk_status(split_sectors); 133 - bio_endio(bio); 134 - return NULL; 135 159 } 136 160 137 161 struct bio *bio_split_discard(struct bio *bio, const struct queue_limits *lim, ··· 303 279 } 304 280 305 281 /** 306 - * bio_split_rw_at - check if and where to split a read/write bio 282 + * bio_split_io_at - check if and where to split a bio 307 283 * @bio: [in] bio to be split 308 284 * @lim: [in] queue limits to split based on 309 285 * @segs: [out] number of segments in the bio with the first half of the sectors 310 286 * @max_bytes: [in] maximum number of bytes per bio 287 + * @len_align_mask: [in] length alignment mask for each vector 311 288 * 312 289 * Find out if @bio needs to be split to fit the queue limits in @lim and a 313 290 * maximum size of @max_bytes. Returns a negative error number if @bio can't be 314 291 * split, 0 if the bio doesn't have to be split, or a positive sector offset if 315 292 * @bio needs to be split. 316 293 */ 317 - int bio_split_rw_at(struct bio *bio, const struct queue_limits *lim, 318 - unsigned *segs, unsigned max_bytes) 294 + int bio_split_io_at(struct bio *bio, const struct queue_limits *lim, 295 + unsigned *segs, unsigned max_bytes, unsigned len_align_mask) 319 296 { 320 297 struct bio_vec bv, bvprv, *bvprvp = NULL; 321 298 struct bvec_iter iter; 322 299 unsigned nsegs = 0, bytes = 0; 323 300 324 301 bio_for_each_bvec(bv, bio, iter) { 302 + if (bv.bv_offset & lim->dma_alignment || 303 + bv.bv_len & len_align_mask) 304 + return -EINVAL; 305 + 325 306 /* 326 307 * If the queue doesn't support SG gaps and adding this 327 308 * offset would create a gap, disallow it. ··· 368 339 * Individual bvecs might not be logical block aligned. Round down the 369 340 * split size so that each bio is properly block size aligned, even if 370 341 * we do not use the full hardware limits. 342 + * 343 + * It is possible to submit a bio that can't be split into a valid io: 344 + * there may either be too many discontiguous vectors for the max 345 + * segments limit, or contain virtual boundary gaps without having a 346 + * valid block sized split. A zero byte result means one of those 347 + * conditions occured. 371 348 */ 372 349 bytes = ALIGN_DOWN(bytes, bio_split_alignment(bio, lim)); 350 + if (!bytes) 351 + return -EINVAL; 373 352 374 353 /* 375 354 * Bio splitting may cause subtle trouble such as hang when doing sync ··· 387 350 bio_clear_polled(bio); 388 351 return bytes >> SECTOR_SHIFT; 389 352 } 390 - EXPORT_SYMBOL_GPL(bio_split_rw_at); 353 + EXPORT_SYMBOL_GPL(bio_split_io_at); 391 354 392 355 struct bio *bio_split_rw(struct bio *bio, const struct queue_limits *lim, 393 356 unsigned *nr_segs)

+1

block/blk-mq-debugfs.c

··· 96 96 QUEUE_FLAG_NAME(DISABLE_WBT_DEF), 97 97 QUEUE_FLAG_NAME(NO_ELV_SWITCH), 98 98 QUEUE_FLAG_NAME(QOS_ENABLED), 99 + QUEUE_FLAG_NAME(BIO_ISSUE_TIME), 99 100 }; 100 101 #undef QUEUE_FLAG_NAME 101 102

+218 -64

block/blk-mq-dma.c

··· 2 2 /* 3 3 * Copyright (C) 2025 Christoph Hellwig 4 4 */ 5 + #include <linux/blk-integrity.h> 5 6 #include <linux/blk-mq-dma.h> 6 7 #include "blk.h" 7 8 ··· 11 10 u32 len; 12 11 }; 13 12 14 - static bool blk_map_iter_next(struct request *req, struct req_iterator *iter, 13 + static bool __blk_map_iter_next(struct blk_map_iter *iter) 14 + { 15 + if (iter->iter.bi_size) 16 + return true; 17 + if (!iter->bio || !iter->bio->bi_next) 18 + return false; 19 + 20 + iter->bio = iter->bio->bi_next; 21 + if (iter->is_integrity) { 22 + iter->iter = bio_integrity(iter->bio)->bip_iter; 23 + iter->bvecs = bio_integrity(iter->bio)->bip_vec; 24 + } else { 25 + iter->iter = iter->bio->bi_iter; 26 + iter->bvecs = iter->bio->bi_io_vec; 27 + } 28 + return true; 29 + } 30 + 31 + static bool blk_map_iter_next(struct request *req, struct blk_map_iter *iter, 15 32 struct phys_vec *vec) 16 33 { 17 34 unsigned int max_size; 18 35 struct bio_vec bv; 19 36 20 - if (req->rq_flags & RQF_SPECIAL_PAYLOAD) { 21 - if (!iter->bio) 22 - return false; 23 - vec->paddr = bvec_phys(&req->special_vec); 24 - vec->len = req->special_vec.bv_len; 25 - iter->bio = NULL; 26 - return true; 27 - } 28 - 29 37 if (!iter->iter.bi_size) 30 38 return false; 31 39 32 - bv = mp_bvec_iter_bvec(iter->bio->bi_io_vec, iter->iter); 40 + bv = mp_bvec_iter_bvec(iter->bvecs, iter->iter); 33 41 vec->paddr = bvec_phys(&bv); 34 42 max_size = get_max_segment_size(&req->q->limits, vec->paddr, UINT_MAX); 35 43 bv.bv_len = min(bv.bv_len, max_size); 36 - bio_advance_iter_single(iter->bio, &iter->iter, bv.bv_len); 44 + bvec_iter_advance_single(iter->bvecs, &iter->iter, bv.bv_len); 37 45 38 46 /* 39 47 * If we are entirely done with this bi_io_vec entry, check if the next ··· 52 42 while (!iter->iter.bi_size || !iter->iter.bi_bvec_done) { 53 43 struct bio_vec next; 54 44 55 - if (!iter->iter.bi_size) { 56 - if (!iter->bio->bi_next) 57 - break; 58 - iter->bio = iter->bio->bi_next; 59 - iter->iter = iter->bio->bi_iter; 60 - } 45 + if (!__blk_map_iter_next(iter)) 46 + break; 61 47 62 - next = mp_bvec_iter_bvec(iter->bio->bi_io_vec, iter->iter); 48 + next = mp_bvec_iter_bvec(iter->bvecs, iter->iter); 63 49 if (bv.bv_len + next.bv_len > max_size || 64 50 !biovec_phys_mergeable(req->q, &bv, &next)) 65 51 break; 66 52 67 53 bv.bv_len += next.bv_len; 68 - bio_advance_iter_single(iter->bio, &iter->iter, next.bv_len); 54 + bvec_iter_advance_single(iter->bvecs, &iter->iter, next.bv_len); 69 55 } 70 56 71 57 vec->len = bv.bv_len; ··· 131 125 return true; 132 126 } 133 127 128 + static inline void blk_rq_map_iter_init(struct request *rq, 129 + struct blk_map_iter *iter) 130 + { 131 + struct bio *bio = rq->bio; 132 + 133 + if (rq->rq_flags & RQF_SPECIAL_PAYLOAD) { 134 + *iter = (struct blk_map_iter) { 135 + .bvecs = &rq->special_vec, 136 + .iter = { 137 + .bi_size = rq->special_vec.bv_len, 138 + } 139 + }; 140 + } else if (bio) { 141 + *iter = (struct blk_map_iter) { 142 + .bio = bio, 143 + .bvecs = bio->bi_io_vec, 144 + .iter = bio->bi_iter, 145 + }; 146 + } else { 147 + /* the internal flush request may not have bio attached */ 148 + *iter = (struct blk_map_iter) {}; 149 + } 150 + } 151 + 152 + static bool blk_dma_map_iter_start(struct request *req, struct device *dma_dev, 153 + struct dma_iova_state *state, struct blk_dma_iter *iter, 154 + unsigned int total_len) 155 + { 156 + struct phys_vec vec; 157 + 158 + memset(&iter->p2pdma, 0, sizeof(iter->p2pdma)); 159 + iter->status = BLK_STS_OK; 160 + 161 + /* 162 + * Grab the first segment ASAP because we'll need it to check for P2P 163 + * transfers. 164 + */ 165 + if (!blk_map_iter_next(req, &iter->iter, &vec)) 166 + return false; 167 + 168 + switch (pci_p2pdma_state(&iter->p2pdma, dma_dev, 169 + phys_to_page(vec.paddr))) { 170 + case PCI_P2PDMA_MAP_BUS_ADDR: 171 + if (iter->iter.is_integrity) 172 + bio_integrity(req->bio)->bip_flags |= BIP_P2P_DMA; 173 + else 174 + req->cmd_flags |= REQ_P2PDMA; 175 + return blk_dma_map_bus(iter, &vec); 176 + case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE: 177 + /* 178 + * P2P transfers through the host bridge are treated the 179 + * same as non-P2P transfers below and during unmap. 180 + */ 181 + case PCI_P2PDMA_MAP_NONE: 182 + break; 183 + default: 184 + iter->status = BLK_STS_INVAL; 185 + return false; 186 + } 187 + 188 + if (blk_can_dma_map_iova(req, dma_dev) && 189 + dma_iova_try_alloc(dma_dev, state, vec.paddr, total_len)) 190 + return blk_rq_dma_map_iova(req, dma_dev, state, iter, &vec); 191 + return blk_dma_map_direct(req, dma_dev, iter, &vec); 192 + } 193 + 134 194 /** 135 195 * blk_rq_dma_map_iter_start - map the first DMA segment for a request 136 196 * @req: request to map ··· 222 150 bool blk_rq_dma_map_iter_start(struct request *req, struct device *dma_dev, 223 151 struct dma_iova_state *state, struct blk_dma_iter *iter) 224 152 { 225 - unsigned int total_len = blk_rq_payload_bytes(req); 226 - struct phys_vec vec; 227 - 228 - iter->iter.bio = req->bio; 229 - iter->iter.iter = req->bio->bi_iter; 230 - memset(&iter->p2pdma, 0, sizeof(iter->p2pdma)); 231 - iter->status = BLK_STS_OK; 232 - 233 - /* 234 - * Grab the first segment ASAP because we'll need it to check for P2P 235 - * transfers. 236 - */ 237 - if (!blk_map_iter_next(req, &iter->iter, &vec)) 238 - return false; 239 - 240 - if (IS_ENABLED(CONFIG_PCI_P2PDMA) && (req->cmd_flags & REQ_P2PDMA)) { 241 - switch (pci_p2pdma_state(&iter->p2pdma, dma_dev, 242 - phys_to_page(vec.paddr))) { 243 - case PCI_P2PDMA_MAP_BUS_ADDR: 244 - return blk_dma_map_bus(iter, &vec); 245 - case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE: 246 - /* 247 - * P2P transfers through the host bridge are treated the 248 - * same as non-P2P transfers below and during unmap. 249 - */ 250 - req->cmd_flags &= ~REQ_P2PDMA; 251 - break; 252 - default: 253 - iter->status = BLK_STS_INVAL; 254 - return false; 255 - } 256 - } 257 - 258 - if (blk_can_dma_map_iova(req, dma_dev) && 259 - dma_iova_try_alloc(dma_dev, state, vec.paddr, total_len)) 260 - return blk_rq_dma_map_iova(req, dma_dev, state, iter, &vec); 261 - return blk_dma_map_direct(req, dma_dev, iter, &vec); 153 + blk_rq_map_iter_init(req, &iter->iter); 154 + return blk_dma_map_iter_start(req, dma_dev, state, iter, 155 + blk_rq_payload_bytes(req)); 262 156 } 263 157 EXPORT_SYMBOL_GPL(blk_rq_dma_map_iter_start); 264 158 ··· 284 246 int __blk_rq_map_sg(struct request *rq, struct scatterlist *sglist, 285 247 struct scatterlist **last_sg) 286 248 { 287 - struct req_iterator iter = { 288 - .bio = rq->bio, 289 - }; 249 + struct blk_map_iter iter; 290 250 struct phys_vec vec; 291 251 int nsegs = 0; 292 252 293 - /* the internal flush request may not have bio attached */ 294 - if (iter.bio) 295 - iter.iter = iter.bio->bi_iter; 296 - 253 + blk_rq_map_iter_init(rq, &iter); 297 254 while (blk_map_iter_next(rq, &iter, &vec)) { 298 255 *last_sg = blk_next_sg(last_sg, sglist); 299 256 sg_set_page(*last_sg, phys_to_page(vec.paddr), vec.len, ··· 308 275 return nsegs; 309 276 } 310 277 EXPORT_SYMBOL(__blk_rq_map_sg); 278 + 279 + #ifdef CONFIG_BLK_DEV_INTEGRITY 280 + /** 281 + * blk_rq_integrity_dma_map_iter_start - map the first integrity DMA segment 282 + * for a request 283 + * @req: request to map 284 + * @dma_dev: device to map to 285 + * @state: DMA IOVA state 286 + * @iter: block layer DMA iterator 287 + * 288 + * Start DMA mapping @req integrity data to @dma_dev. @state and @iter are 289 + * provided by the caller and don't need to be initialized. @state needs to be 290 + * stored for use at unmap time, @iter is only needed at map time. 291 + * 292 + * Returns %false if there is no segment to map, including due to an error, or 293 + * %true if it did map a segment. 294 + * 295 + * If a segment was mapped, the DMA address for it is returned in @iter.addr 296 + * and the length in @iter.len. If no segment was mapped the status code is 297 + * returned in @iter.status. 298 + * 299 + * The caller can call blk_rq_dma_map_coalesce() to check if further segments 300 + * need to be mapped after this, or go straight to blk_rq_dma_map_iter_next() 301 + * to try to map the following segments. 302 + */ 303 + bool blk_rq_integrity_dma_map_iter_start(struct request *req, 304 + struct device *dma_dev, struct dma_iova_state *state, 305 + struct blk_dma_iter *iter) 306 + { 307 + unsigned len = bio_integrity_bytes(&req->q->limits.integrity, 308 + blk_rq_sectors(req)); 309 + struct bio *bio = req->bio; 310 + 311 + iter->iter = (struct blk_map_iter) { 312 + .bio = bio, 313 + .iter = bio_integrity(bio)->bip_iter, 314 + .bvecs = bio_integrity(bio)->bip_vec, 315 + .is_integrity = true, 316 + }; 317 + return blk_dma_map_iter_start(req, dma_dev, state, iter, len); 318 + } 319 + EXPORT_SYMBOL_GPL(blk_rq_integrity_dma_map_iter_start); 320 + 321 + /** 322 + * blk_rq_integrity_dma_map_iter_start - map the next integrity DMA segment for 323 + * a request 324 + * @req: request to map 325 + * @dma_dev: device to map to 326 + * @state: DMA IOVA state 327 + * @iter: block layer DMA iterator 328 + * 329 + * Iterate to the next integrity mapping after a previous call to 330 + * blk_rq_integrity_dma_map_iter_start(). See there for a detailed description 331 + * of the arguments. 332 + * 333 + * Returns %false if there is no segment to map, including due to an error, or 334 + * %true if it did map a segment. 335 + * 336 + * If a segment was mapped, the DMA address for it is returned in @iter.addr and 337 + * the length in @iter.len. If no segment was mapped the status code is 338 + * returned in @iter.status. 339 + */ 340 + bool blk_rq_integrity_dma_map_iter_next(struct request *req, 341 + struct device *dma_dev, struct blk_dma_iter *iter) 342 + { 343 + struct phys_vec vec; 344 + 345 + if (!blk_map_iter_next(req, &iter->iter, &vec)) 346 + return false; 347 + 348 + if (iter->p2pdma.map == PCI_P2PDMA_MAP_BUS_ADDR) 349 + return blk_dma_map_bus(iter, &vec); 350 + return blk_dma_map_direct(req, dma_dev, iter, &vec); 351 + } 352 + EXPORT_SYMBOL_GPL(blk_rq_integrity_dma_map_iter_next); 353 + 354 + /** 355 + * blk_rq_map_integrity_sg - Map integrity metadata into a scatterlist 356 + * @rq: request to map 357 + * @sglist: target scatterlist 358 + * 359 + * Description: Map the integrity vectors in request into a 360 + * scatterlist. The scatterlist must be big enough to hold all 361 + * elements. I.e. sized using blk_rq_count_integrity_sg() or 362 + * rq->nr_integrity_segments. 363 + */ 364 + int blk_rq_map_integrity_sg(struct request *rq, struct scatterlist *sglist) 365 + { 366 + struct request_queue *q = rq->q; 367 + struct scatterlist *sg = NULL; 368 + struct bio *bio = rq->bio; 369 + unsigned int segments = 0; 370 + struct phys_vec vec; 371 + 372 + struct blk_map_iter iter = { 373 + .bio = bio, 374 + .iter = bio_integrity(bio)->bip_iter, 375 + .bvecs = bio_integrity(bio)->bip_vec, 376 + .is_integrity = true, 377 + }; 378 + 379 + while (blk_map_iter_next(rq, &iter, &vec)) { 380 + sg = blk_next_sg(&sg, sglist); 381 + sg_set_page(sg, phys_to_page(vec.paddr), vec.len, 382 + offset_in_page(vec.paddr)); 383 + segments++; 384 + } 385 + 386 + if (sg) 387 + sg_mark_end(sg); 388 + 389 + /* 390 + * Something must have been wrong if the figured number of segment 391 + * is bigger than number of req's physical integrity segments 392 + */ 393 + BUG_ON(segments > rq->nr_integrity_segments); 394 + BUG_ON(segments > queue_max_integrity_segments(q)); 395 + return segments; 396 + } 397 + EXPORT_SYMBOL(blk_rq_map_integrity_sg); 398 + #endif

+5 -9

block/blk-mq-sched.c

··· 454 454 } 455 455 456 456 struct elevator_tags *blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set, 457 - unsigned int nr_hw_queues) 457 + unsigned int nr_hw_queues, unsigned int nr_requests) 458 458 { 459 459 unsigned int nr_tags; 460 460 int i; ··· 470 470 nr_tags * sizeof(struct blk_mq_tags *), gfp); 471 471 if (!et) 472 472 return NULL; 473 - /* 474 - * Default to double of smaller one between hw queue_depth and 475 - * 128, since we don't split into sync/async like the old code 476 - * did. Additionally, this is a per-hw queue depth. 477 - */ 478 - et->nr_requests = 2 * min_t(unsigned int, set->queue_depth, 479 - BLKDEV_DEFAULT_RQ); 473 + 474 + et->nr_requests = nr_requests; 480 475 et->nr_hw_queues = nr_hw_queues; 481 476 482 477 if (blk_mq_is_shared_tags(set->flags)) { ··· 516 521 * concurrently. 517 522 */ 518 523 if (q->elevator) { 519 - et = blk_mq_alloc_sched_tags(set, nr_hw_queues); 524 + et = blk_mq_alloc_sched_tags(set, nr_hw_queues, 525 + blk_mq_default_nr_requests(set)); 520 526 if (!et) 521 527 goto out_unwind; 522 528 if (xa_insert(et_table, q->id, et, gfp))

+12 -1

block/blk-mq-sched.h

··· 24 24 void blk_mq_sched_free_rqs(struct request_queue *q); 25 25 26 26 struct elevator_tags *blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set, 27 - unsigned int nr_hw_queues); 27 + unsigned int nr_hw_queues, unsigned int nr_requests); 28 28 int blk_mq_alloc_sched_tags_batch(struct xarray *et_table, 29 29 struct blk_mq_tag_set *set, unsigned int nr_hw_queues); 30 30 void blk_mq_free_sched_tags(struct elevator_tags *et, ··· 90 90 static inline bool blk_mq_sched_needs_restart(struct blk_mq_hw_ctx *hctx) 91 91 { 92 92 return test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state); 93 + } 94 + 95 + static inline void blk_mq_set_min_shallow_depth(struct request_queue *q, 96 + unsigned int depth) 97 + { 98 + struct blk_mq_hw_ctx *hctx; 99 + unsigned long i; 100 + 101 + queue_for_each_hw_ctx(q, hctx, i) 102 + sbitmap_queue_min_shallow_depth(&hctx->sched_tags->bitmap_tags, 103 + depth); 93 104 } 94 105 95 106 #endif

+4 -3

block/blk-mq-sysfs.c

··· 34 34 struct blk_mq_hw_ctx *hctx = container_of(kobj, struct blk_mq_hw_ctx, 35 35 kobj); 36 36 37 - blk_free_flush_queue(hctx->fq); 38 37 sbitmap_free(&hctx->ctx_map); 39 38 free_cpumask_var(hctx->cpumask); 40 39 kfree(hctx->ctxs); ··· 149 150 return; 150 151 151 152 hctx_for_each_ctx(hctx, ctx, i) 152 - kobject_del(&ctx->kobj); 153 + if (ctx->kobj.state_in_sysfs) 154 + kobject_del(&ctx->kobj); 153 155 154 - kobject_del(&hctx->kobj); 156 + if (hctx->kobj.state_in_sysfs) 157 + kobject_del(&hctx->kobj); 155 158 } 156 159 157 160 static int blk_mq_register_hctx(struct blk_mq_hw_ctx *hctx)

+54 -74

block/blk-mq-tag.c

··· 8 8 */ 9 9 #include <linux/kernel.h> 10 10 #include <linux/module.h> 11 + #include <linux/slab.h> 12 + #include <linux/mm.h> 13 + #include <linux/kmemleak.h> 11 14 12 15 #include <linux/delay.h> 13 16 #include "blk.h" ··· 256 253 unsigned int bitnr) 257 254 { 258 255 struct request *rq; 259 - unsigned long flags; 260 256 261 - spin_lock_irqsave(&tags->lock, flags); 262 257 rq = tags->rqs[bitnr]; 263 258 if (!rq || rq->tag != bitnr || !req_ref_inc_not_zero(rq)) 264 259 rq = NULL; 265 - spin_unlock_irqrestore(&tags->lock, flags); 266 260 return rq; 267 261 } 268 262 ··· 297 297 /** 298 298 * bt_for_each - iterate over the requests associated with a hardware queue 299 299 * @hctx: Hardware queue to examine. 300 - * @q: Request queue to examine. 300 + * @q: Request queue @hctx is associated with (@hctx->queue). 301 301 * @bt: sbitmap to examine. This is either the breserved_tags member 302 302 * or the bitmap_tags member of struct blk_mq_tags. 303 303 * @fn: Pointer to the function that will be called for each request 304 304 * associated with @hctx that has been assigned a driver tag. 305 - * @fn will be called as follows: @fn(@hctx, rq, @data, @reserved) 306 - * where rq is a pointer to a request. Return true to continue 307 - * iterating tags, false to stop. 308 - * @data: Will be passed as third argument to @fn. 305 + * @fn will be called as follows: @fn(rq, @data) where rq is a 306 + * pointer to a request. Return %true to continue iterating tags; 307 + * %false to stop. 308 + * @data: Will be passed as second argument to @fn. 309 309 * @reserved: Indicates whether @bt is the breserved_tags member or the 310 310 * bitmap_tags member of struct blk_mq_tags. 311 311 */ ··· 371 371 * @bt: sbitmap to examine. This is either the breserved_tags member 372 372 * or the bitmap_tags member of struct blk_mq_tags. 373 373 * @fn: Pointer to the function that will be called for each started 374 - * request. @fn will be called as follows: @fn(rq, @data, 375 - * @reserved) where rq is a pointer to a request. Return true 376 - * to continue iterating tags, false to stop. 374 + * request. @fn will be called as follows: @fn(rq, @data) where rq 375 + * is a pointer to a request. Return %true to continue iterating 376 + * tags; %false to stop. 377 377 * @data: Will be passed as second argument to @fn. 378 378 * @flags: BT_TAG_ITER_* 379 379 */ ··· 406 406 * blk_mq_all_tag_iter - iterate over all requests in a tag map 407 407 * @tags: Tag map to iterate over. 408 408 * @fn: Pointer to the function that will be called for each 409 - * request. @fn will be called as follows: @fn(rq, @priv, 410 - * reserved) where rq is a pointer to a request. 'reserved' 411 - * indicates whether or not @rq is a reserved request. Return 412 - * true to continue iterating tags, false to stop. 409 + * request. @fn will be called as follows: @fn(rq, @priv) where rq 410 + * is a pointer to a request. Return %true to continue iterating 411 + * tags; %false to stop. 413 412 * @priv: Will be passed as second argument to @fn. 414 413 * 415 414 * Caller has to pass the tag map from which requests are allocated. ··· 423 424 * blk_mq_tagset_busy_iter - iterate over all started requests in a tag set 424 425 * @tagset: Tag set to iterate over. 425 426 * @fn: Pointer to the function that will be called for each started 426 - * request. @fn will be called as follows: @fn(rq, @priv, 427 - * reserved) where rq is a pointer to a request. 'reserved' 428 - * indicates whether or not @rq is a reserved request. Return 429 - * true to continue iterating tags, false to stop. 427 + * request. @fn will be called as follows: @fn(rq, @priv) where 428 + * rq is a pointer to a request. Return true to continue iterating 429 + * tags, false to stop. 430 430 * @priv: Will be passed as second argument to @fn. 431 431 * 432 432 * We grab one request reference before calling @fn and release it after ··· 435 437 busy_tag_iter_fn *fn, void *priv) 436 438 { 437 439 unsigned int flags = tagset->flags; 438 - int i, nr_tags; 440 + int i, nr_tags, srcu_idx; 441 + 442 + srcu_idx = srcu_read_lock(&tagset->tags_srcu); 439 443 440 444 nr_tags = blk_mq_is_shared_tags(flags) ? 1 : tagset->nr_hw_queues; 441 445 ··· 446 446 __blk_mq_all_tag_iter(tagset->tags[i], fn, priv, 447 447 BT_TAG_ITER_STARTED); 448 448 } 449 + srcu_read_unlock(&tagset->tags_srcu, srcu_idx); 449 450 } 450 451 EXPORT_SYMBOL(blk_mq_tagset_busy_iter); 451 452 ··· 484 483 * blk_mq_queue_tag_busy_iter - iterate over all requests with a driver tag 485 484 * @q: Request queue to examine. 486 485 * @fn: Pointer to the function that will be called for each request 487 - * on @q. @fn will be called as follows: @fn(hctx, rq, @priv, 488 - * reserved) where rq is a pointer to a request and hctx points 489 - * to the hardware queue associated with the request. 'reserved' 490 - * indicates whether or not @rq is a reserved request. 491 - * @priv: Will be passed as third argument to @fn. 486 + * on @q. @fn will be called as follows: @fn(rq, @priv) where rq 487 + * is a pointer to a request and hctx points to the hardware queue 488 + * associated with the request. 489 + * @priv: Will be passed as second argument to @fn. 492 490 * 493 491 * Note: if @q->tag_set is shared with other request queues then @fn will be 494 492 * called for all requests on all queues that share that tag set and not only ··· 496 496 void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_tag_iter_fn *fn, 497 497 void *priv) 498 498 { 499 + int srcu_idx; 500 + 499 501 /* 500 502 * __blk_mq_update_nr_hw_queues() updates nr_hw_queues and hctx_table 501 503 * while the queue is frozen. So we can use q_usage_counter to avoid ··· 506 504 if (!percpu_ref_tryget(&q->q_usage_counter)) 507 505 return; 508 506 507 + srcu_idx = srcu_read_lock(&q->tag_set->tags_srcu); 509 508 if (blk_mq_is_shared_tags(q->tag_set->flags)) { 510 509 struct blk_mq_tags *tags = q->tag_set->shared_tags; 511 510 struct sbitmap_queue *bresv = &tags->breserved_tags; ··· 536 533 bt_for_each(hctx, q, btags, fn, priv, false); 537 534 } 538 535 } 536 + srcu_read_unlock(&q->tag_set->tags_srcu, srcu_idx); 539 537 blk_queue_exit(q); 540 538 } 541 539 ··· 566 562 tags->nr_tags = total_tags; 567 563 tags->nr_reserved_tags = reserved_tags; 568 564 spin_lock_init(&tags->lock); 565 + INIT_LIST_HEAD(&tags->page_list); 566 + 569 567 if (bt_alloc(&tags->bitmap_tags, depth, round_robin, node)) 570 568 goto out_free_tags; 571 569 if (bt_alloc(&tags->breserved_tags, reserved_tags, round_robin, node)) ··· 582 576 return NULL; 583 577 } 584 578 585 - void blk_mq_free_tags(struct blk_mq_tags *tags) 579 + static void blk_mq_free_tags_callback(struct rcu_head *head) 586 580 { 587 - sbitmap_queue_free(&tags->bitmap_tags); 588 - sbitmap_queue_free(&tags->breserved_tags); 581 + struct blk_mq_tags *tags = container_of(head, struct blk_mq_tags, 582 + rcu_head); 583 + struct page *page; 584 + 585 + while (!list_empty(&tags->page_list)) { 586 + page = list_first_entry(&tags->page_list, struct page, lru); 587 + list_del_init(&page->lru); 588 + /* 589 + * Remove kmemleak object previously allocated in 590 + * blk_mq_alloc_rqs(). 591 + */ 592 + kmemleak_free(page_address(page)); 593 + __free_pages(page, page->private); 594 + } 589 595 kfree(tags); 590 596 } 591 597 592 - int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx, 593 - struct blk_mq_tags **tagsptr, unsigned int tdepth, 594 - bool can_grow) 598 + void blk_mq_free_tags(struct blk_mq_tag_set *set, struct blk_mq_tags *tags) 595 599 { 596 - struct blk_mq_tags *tags = *tagsptr; 600 + sbitmap_queue_free(&tags->bitmap_tags); 601 + sbitmap_queue_free(&tags->breserved_tags); 597 602 598 - if (tdepth <= tags->nr_reserved_tags) 599 - return -EINVAL; 600 - 601 - /* 602 - * If we are allowed to grow beyond the original size, allocate 603 - * a new set of tags before freeing the old one. 604 - */ 605 - if (tdepth > tags->nr_tags) { 606 - struct blk_mq_tag_set *set = hctx->queue->tag_set; 607 - struct blk_mq_tags *new; 608 - 609 - if (!can_grow) 610 - return -EINVAL; 611 - 612 - /* 613 - * We need some sort of upper limit, set it high enough that 614 - * no valid use cases should require more. 615 - */ 616 - if (tdepth > MAX_SCHED_RQ) 617 - return -EINVAL; 618 - 619 - /* 620 - * Only the sbitmap needs resizing since we allocated the max 621 - * initially. 622 - */ 623 - if (blk_mq_is_shared_tags(set->flags)) 624 - return 0; 625 - 626 - new = blk_mq_alloc_map_and_rqs(set, hctx->queue_num, tdepth); 627 - if (!new) 628 - return -ENOMEM; 629 - 630 - blk_mq_free_map_and_rqs(set, *tagsptr, hctx->queue_num); 631 - *tagsptr = new; 632 - } else { 633 - /* 634 - * Don't need (or can't) update reserved tags here, they 635 - * remain static and should never need resizing. 636 - */ 637 - sbitmap_queue_resize(&tags->bitmap_tags, 638 - tdepth - tags->nr_reserved_tags); 603 + /* if tags pages is not allocated yet, free tags directly */ 604 + if (list_empty(&tags->page_list)) { 605 + kfree(tags); 606 + return; 639 607 } 640 608 641 - return 0; 609 + call_srcu(&set->tags_srcu, &tags->rcu_head, blk_mq_free_tags_callback); 642 610 } 643 611 644 612 void blk_mq_tag_resize_shared_tags(struct blk_mq_tag_set *set, unsigned int size)

+92 -83

block/blk-mq.c

··· 396 396 #endif 397 397 } 398 398 399 + static inline void blk_mq_bio_issue_init(struct request_queue *q, 400 + struct bio *bio) 401 + { 402 + #ifdef CONFIG_BLK_CGROUP 403 + if (test_bit(QUEUE_FLAG_BIO_ISSUE_TIME, &q->queue_flags)) 404 + bio->issue_time_ns = blk_time_get_ns(); 405 + #endif 406 + } 407 + 399 408 static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data, 400 409 struct blk_mq_tags *tags, unsigned int tag) 401 410 { ··· 3177 3168 if (!bio_integrity_prep(bio)) 3178 3169 goto queue_exit; 3179 3170 3171 + blk_mq_bio_issue_init(q, bio); 3180 3172 if (blk_mq_attempt_bio_merge(q, bio, nr_segs)) 3181 3173 goto queue_exit; 3182 3174 ··· 3425 3415 struct blk_mq_tags *tags) 3426 3416 { 3427 3417 struct page *page; 3428 - unsigned long flags; 3429 3418 3430 3419 /* 3431 3420 * There is no need to clear mapping if driver tags is not initialized ··· 3448 3439 } 3449 3440 } 3450 3441 } 3451 - 3452 - /* 3453 - * Wait until all pending iteration is done. 3454 - * 3455 - * Request reference is cleared and it is guaranteed to be observed 3456 - * after the ->lock is released. 3457 - */ 3458 - spin_lock_irqsave(&drv_tags->lock, flags); 3459 - spin_unlock_irqrestore(&drv_tags->lock, flags); 3460 3442 } 3461 3443 3462 3444 void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags, 3463 3445 unsigned int hctx_idx) 3464 3446 { 3465 3447 struct blk_mq_tags *drv_tags; 3466 - struct page *page; 3467 3448 3468 3449 if (list_empty(&tags->page_list)) 3469 3450 return; ··· 3477 3478 } 3478 3479 3479 3480 blk_mq_clear_rq_mapping(drv_tags, tags); 3480 - 3481 - while (!list_empty(&tags->page_list)) { 3482 - page = list_first_entry(&tags->page_list, struct page, lru); 3483 - list_del_init(&page->lru); 3484 - /* 3485 - * Remove kmemleak object previously allocated in 3486 - * blk_mq_alloc_rqs(). 3487 - */ 3488 - kmemleak_free(page_address(page)); 3489 - __free_pages(page, page->private); 3490 - } 3481 + /* 3482 + * Free request pages in SRCU callback, which is called from 3483 + * blk_mq_free_tags(). 3484 + */ 3491 3485 } 3492 3486 3493 - void blk_mq_free_rq_map(struct blk_mq_tags *tags) 3487 + void blk_mq_free_rq_map(struct blk_mq_tag_set *set, struct blk_mq_tags *tags) 3494 3488 { 3495 3489 kfree(tags->rqs); 3496 3490 tags->rqs = NULL; 3497 3491 kfree(tags->static_rqs); 3498 3492 tags->static_rqs = NULL; 3499 3493 3500 - blk_mq_free_tags(tags); 3494 + blk_mq_free_tags(set, tags); 3501 3495 } 3502 3496 3503 3497 static enum hctx_type hctx_idx_to_type(struct blk_mq_tag_set *set, ··· 3552 3560 err_free_rqs: 3553 3561 kfree(tags->rqs); 3554 3562 err_free_tags: 3555 - blk_mq_free_tags(tags); 3563 + blk_mq_free_tags(set, tags); 3556 3564 return NULL; 3557 3565 } 3558 3566 ··· 3581 3589 3582 3590 if (node == NUMA_NO_NODE) 3583 3591 node = set->numa_node; 3584 - 3585 - INIT_LIST_HEAD(&tags->page_list); 3586 3592 3587 3593 /* 3588 3594 * rq_size is the size of the request plus driver payload, rounded ··· 3668 3678 struct rq_iter_data data = { 3669 3679 .hctx = hctx, 3670 3680 }; 3681 + int srcu_idx; 3671 3682 3683 + srcu_idx = srcu_read_lock(&hctx->queue->tag_set->tags_srcu); 3672 3684 blk_mq_all_tag_iter(tags, blk_mq_has_request, &data); 3685 + srcu_read_unlock(&hctx->queue->tag_set->tags_srcu, srcu_idx); 3686 + 3673 3687 return data.has_rq; 3674 3688 } 3675 3689 ··· 3893 3899 unsigned int queue_depth, struct request *flush_rq) 3894 3900 { 3895 3901 int i; 3896 - unsigned long flags; 3897 3902 3898 3903 /* The hw queue may not be mapped yet */ 3899 3904 if (!tags) ··· 3902 3909 3903 3910 for (i = 0; i < queue_depth; i++) 3904 3911 cmpxchg(&tags->rqs[i], flush_rq, NULL); 3912 + } 3905 3913 3906 - /* 3907 - * Wait until all pending iteration is done. 3908 - * 3909 - * Request reference is cleared and it is guaranteed to be observed 3910 - * after the ->lock is released. 3911 - */ 3912 - spin_lock_irqsave(&tags->lock, flags); 3913 - spin_unlock_irqrestore(&tags->lock, flags); 3914 + static void blk_free_flush_queue_callback(struct rcu_head *head) 3915 + { 3916 + struct blk_flush_queue *fq = 3917 + container_of(head, struct blk_flush_queue, rcu_head); 3918 + 3919 + blk_free_flush_queue(fq); 3914 3920 } 3915 3921 3916 3922 /* hctx->ctxs will be freed in queue's release handler */ ··· 3930 3938 3931 3939 if (set->ops->exit_hctx) 3932 3940 set->ops->exit_hctx(hctx, hctx_idx); 3941 + 3942 + call_srcu(&set->tags_srcu, &hctx->fq->rcu_head, 3943 + blk_free_flush_queue_callback); 3944 + hctx->fq = NULL; 3933 3945 3934 3946 xa_erase(&q->hctx_table, hctx_idx); 3935 3947 ··· 3960 3964 struct blk_mq_tag_set *set, 3961 3965 struct blk_mq_hw_ctx *hctx, unsigned hctx_idx) 3962 3966 { 3967 + gfp_t gfp = GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY; 3968 + 3969 + hctx->fq = blk_alloc_flush_queue(hctx->numa_node, set->cmd_size, gfp); 3970 + if (!hctx->fq) 3971 + goto fail; 3972 + 3963 3973 hctx->queue_num = hctx_idx; 3964 3974 3965 3975 hctx->tags = set->tags[hctx_idx]; 3966 3976 3967 3977 if (set->ops->init_hctx && 3968 3978 set->ops->init_hctx(hctx, set->driver_data, hctx_idx)) 3969 - goto fail; 3979 + goto fail_free_fq; 3970 3980 3971 3981 if (blk_mq_init_request(set, hctx->fq->flush_rq, hctx_idx, 3972 3982 hctx->numa_node)) ··· 3989 3987 exit_hctx: 3990 3988 if (set->ops->exit_hctx) 3991 3989 set->ops->exit_hctx(hctx, hctx_idx); 3990 + fail_free_fq: 3991 + blk_free_flush_queue(hctx->fq); 3992 + hctx->fq = NULL; 3992 3993 fail: 3993 3994 return -1; 3994 3995 } ··· 4043 4038 init_waitqueue_func_entry(&hctx->dispatch_wait, blk_mq_dispatch_wake); 4044 4039 INIT_LIST_HEAD(&hctx->dispatch_wait.entry); 4045 4040 4046 - hctx->fq = blk_alloc_flush_queue(hctx->numa_node, set->cmd_size, gfp); 4047 - if (!hctx->fq) 4048 - goto free_bitmap; 4049 - 4050 4041 blk_mq_hctx_kobj_init(hctx); 4051 4042 4052 4043 return hctx; 4053 4044 4054 - free_bitmap: 4055 - sbitmap_free(&hctx->ctx_map); 4056 4045 free_ctxs: 4057 4046 kfree(hctx->ctxs); 4058 4047 free_cpumask: ··· 4100 4101 4101 4102 ret = blk_mq_alloc_rqs(set, tags, hctx_idx, depth); 4102 4103 if (ret) { 4103 - blk_mq_free_rq_map(tags); 4104 + blk_mq_free_rq_map(set, tags); 4104 4105 return NULL; 4105 4106 } 4106 4107 ··· 4128 4129 { 4129 4130 if (tags) { 4130 4131 blk_mq_free_rqs(set, tags, hctx_idx); 4131 - blk_mq_free_rq_map(tags); 4132 + blk_mq_free_rq_map(set, tags); 4132 4133 } 4133 4134 } 4134 4135 ··· 4827 4828 if (ret) 4828 4829 goto out_free_srcu; 4829 4830 } 4831 + ret = init_srcu_struct(&set->tags_srcu); 4832 + if (ret) 4833 + goto out_cleanup_srcu; 4830 4834 4831 4835 init_rwsem(&set->update_nr_hwq_lock); 4832 4836 ··· 4838 4836 sizeof(struct blk_mq_tags *), GFP_KERNEL, 4839 4837 set->numa_node); 4840 4838 if (!set->tags) 4841 - goto out_cleanup_srcu; 4839 + goto out_cleanup_tags_srcu; 4842 4840 4843 4841 for (i = 0; i < set->nr_maps; i++) { 4844 4842 set->map[i].mq_map = kcalloc_node(nr_cpu_ids, ··· 4867 4865 } 4868 4866 kfree(set->tags); 4869 4867 set->tags = NULL; 4868 + out_cleanup_tags_srcu: 4869 + cleanup_srcu_struct(&set->tags_srcu); 4870 4870 out_cleanup_srcu: 4871 4871 if (set->flags & BLK_MQ_F_BLOCKING) 4872 4872 cleanup_srcu_struct(set->srcu); ··· 4914 4910 4915 4911 kfree(set->tags); 4916 4912 set->tags = NULL; 4913 + 4914 + srcu_barrier(&set->tags_srcu); 4915 + cleanup_srcu_struct(&set->tags_srcu); 4917 4916 if (set->flags & BLK_MQ_F_BLOCKING) { 4918 4917 cleanup_srcu_struct(set->srcu); 4919 4918 kfree(set->srcu); ··· 4924 4917 } 4925 4918 EXPORT_SYMBOL(blk_mq_free_tag_set); 4926 4919 4927 - int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr) 4920 + struct elevator_tags *blk_mq_update_nr_requests(struct request_queue *q, 4921 + struct elevator_tags *et, 4922 + unsigned int nr) 4928 4923 { 4929 4924 struct blk_mq_tag_set *set = q->tag_set; 4925 + struct elevator_tags *old_et = NULL; 4930 4926 struct blk_mq_hw_ctx *hctx; 4931 - int ret; 4932 4927 unsigned long i; 4933 - 4934 - if (WARN_ON_ONCE(!q->mq_freeze_depth)) 4935 - return -EINVAL; 4936 - 4937 - if (!set) 4938 - return -EINVAL; 4939 - 4940 - if (q->nr_requests == nr) 4941 - return 0; 4942 4928 4943 4929 blk_mq_quiesce_queue(q); 4944 4930 4945 - ret = 0; 4946 - queue_for_each_hw_ctx(q, hctx, i) { 4947 - if (!hctx->tags) 4948 - continue; 4931 + if (blk_mq_is_shared_tags(set->flags)) { 4949 4932 /* 4950 - * If we're using an MQ scheduler, just update the scheduler 4951 - * queue depth. This is similar to what the old code would do. 4933 + * Shared tags, for sched tags, we allocate max initially hence 4934 + * tags can't grow, see blk_mq_alloc_sched_tags(). 4952 4935 */ 4953 - if (hctx->sched_tags) { 4954 - ret = blk_mq_tag_update_depth(hctx, &hctx->sched_tags, 4955 - nr, true); 4956 - } else { 4957 - ret = blk_mq_tag_update_depth(hctx, &hctx->tags, nr, 4958 - false); 4936 + if (q->elevator) 4937 + blk_mq_tag_update_sched_shared_tags(q); 4938 + else 4939 + blk_mq_tag_resize_shared_tags(set, nr); 4940 + } else if (!q->elevator) { 4941 + /* 4942 + * Non-shared hardware tags, nr is already checked from 4943 + * queue_requests_store() and tags can't grow. 4944 + */ 4945 + queue_for_each_hw_ctx(q, hctx, i) { 4946 + if (!hctx->tags) 4947 + continue; 4948 + sbitmap_queue_resize(&hctx->tags->bitmap_tags, 4949 + nr - hctx->tags->nr_reserved_tags); 4959 4950 } 4960 - if (ret) 4961 - break; 4962 - if (q->elevator && q->elevator->type->ops.depth_updated) 4963 - q->elevator->type->ops.depth_updated(hctx); 4964 - } 4965 - if (!ret) { 4966 - q->nr_requests = nr; 4967 - if (blk_mq_is_shared_tags(set->flags)) { 4968 - if (q->elevator) 4969 - blk_mq_tag_update_sched_shared_tags(q); 4970 - else 4971 - blk_mq_tag_resize_shared_tags(set, nr); 4951 + } else if (nr <= q->elevator->et->nr_requests) { 4952 + /* Non-shared sched tags, and tags don't grow. */ 4953 + queue_for_each_hw_ctx(q, hctx, i) { 4954 + if (!hctx->sched_tags) 4955 + continue; 4956 + sbitmap_queue_resize(&hctx->sched_tags->bitmap_tags, 4957 + nr - hctx->sched_tags->nr_reserved_tags); 4972 4958 } 4959 + } else { 4960 + /* Non-shared sched tags, and tags grow */ 4961 + queue_for_each_hw_ctx(q, hctx, i) 4962 + hctx->sched_tags = et->tags[i]; 4963 + old_et = q->elevator->et; 4964 + q->elevator->et = et; 4973 4965 } 4966 + 4967 + q->nr_requests = nr; 4968 + if (q->elevator && q->elevator->type->ops.depth_updated) 4969 + q->elevator->type->ops.depth_updated(q); 4974 4970 4975 4971 blk_mq_unquiesce_queue(q); 4976 - 4977 - return ret; 4972 + return old_et; 4978 4973 } 4979 4974 4980 4975 /*

+17 -5

block/blk-mq.h

··· 6 6 #include "blk-stat.h" 7 7 8 8 struct blk_mq_tag_set; 9 + struct elevator_tags; 9 10 10 11 struct blk_mq_ctxs { 11 12 struct kobject kobj; ··· 46 45 int blk_mq_poll(struct request_queue *q, blk_qc_t cookie, struct io_comp_batch *iob, 47 46 unsigned int flags); 48 47 void blk_mq_exit_queue(struct request_queue *q); 49 - int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr); 48 + struct elevator_tags *blk_mq_update_nr_requests(struct request_queue *q, 49 + struct elevator_tags *tags, 50 + unsigned int nr); 50 51 void blk_mq_wake_waiters(struct request_queue *q); 51 52 bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *, 52 53 bool); ··· 62 59 */ 63 60 void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags, 64 61 unsigned int hctx_idx); 65 - void blk_mq_free_rq_map(struct blk_mq_tags *tags); 62 + void blk_mq_free_rq_map(struct blk_mq_tag_set *set, struct blk_mq_tags *tags); 66 63 struct blk_mq_tags *blk_mq_alloc_map_and_rqs(struct blk_mq_tag_set *set, 67 64 unsigned int hctx_idx, unsigned int depth); 68 65 void blk_mq_free_map_and_rqs(struct blk_mq_tag_set *set, ··· 110 107 struct blk_mq_ctx *ctx) 111 108 { 112 109 return ctx->hctxs[blk_mq_get_hctx_type(opf)]; 110 + } 111 + 112 + /* 113 + * Default to double of smaller one between hw queue_depth and 114 + * 128, since we don't split into sync/async like the old code 115 + * did. Additionally, this is a per-hw queue depth. 116 + */ 117 + static inline unsigned int blk_mq_default_nr_requests( 118 + struct blk_mq_tag_set *set) 119 + { 120 + return 2 * min_t(unsigned int, set->queue_depth, BLKDEV_DEFAULT_RQ); 113 121 } 114 122 115 123 /* ··· 176 162 177 163 struct blk_mq_tags *blk_mq_init_tags(unsigned int nr_tags, 178 164 unsigned int reserved_tags, unsigned int flags, int node); 179 - void blk_mq_free_tags(struct blk_mq_tags *tags); 165 + void blk_mq_free_tags(struct blk_mq_tag_set *set, struct blk_mq_tags *tags); 180 166 181 167 unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data); 182 168 unsigned long blk_mq_get_tags(struct blk_mq_alloc_data *data, int nr_tags, ··· 184 170 void blk_mq_put_tag(struct blk_mq_tags *tags, struct blk_mq_ctx *ctx, 185 171 unsigned int tag); 186 172 void blk_mq_put_tags(struct blk_mq_tags *tags, int *tag_array, int nr_tags); 187 - int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx, 188 - struct blk_mq_tags **tags, unsigned int depth, bool can_grow); 189 173 void blk_mq_tag_resize_shared_tags(struct blk_mq_tag_set *set, 190 174 unsigned int size); 191 175 void blk_mq_tag_update_sched_shared_tags(struct request_queue *q);

+39 -45

block/blk-settings.c

··· 56 56 lim->max_user_wzeroes_unmap_sectors = UINT_MAX; 57 57 lim->max_hw_zone_append_sectors = UINT_MAX; 58 58 lim->max_user_discard_sectors = UINT_MAX; 59 + lim->atomic_write_hw_max = UINT_MAX; 59 60 } 60 61 EXPORT_SYMBOL(blk_set_stacking_limits); 61 62 ··· 224 223 lim->atomic_write_hw_boundary >> SECTOR_SHIFT; 225 224 } 226 225 226 + /* 227 + * Test whether any boundary is aligned with any chunk size. Stacked 228 + * devices store any stripe size in t->chunk_sectors. 229 + */ 230 + static bool blk_valid_atomic_writes_boundary(unsigned int chunk_sectors, 231 + unsigned int boundary_sectors) 232 + { 233 + if (!chunk_sectors || !boundary_sectors) 234 + return true; 235 + 236 + if (boundary_sectors > chunk_sectors && 237 + boundary_sectors % chunk_sectors) 238 + return false; 239 + 240 + if (chunk_sectors > boundary_sectors && 241 + chunk_sectors % boundary_sectors) 242 + return false; 243 + 244 + return true; 245 + } 246 + 227 247 static void blk_validate_atomic_write_limits(struct queue_limits *lim) 228 248 { 229 249 unsigned int boundary_sectors; ··· 252 230 lim->atomic_write_hw_max >> SECTOR_SHIFT; 253 231 254 232 if (!(lim->features & BLK_FEAT_ATOMIC_WRITES)) 233 + goto unsupported; 234 + 235 + /* UINT_MAX indicates stacked limits in initial state */ 236 + if (lim->atomic_write_hw_max == UINT_MAX) 255 237 goto unsupported; 256 238 257 239 if (!lim->atomic_write_hw_max) ··· 285 259 if (WARN_ON_ONCE(lim->atomic_write_hw_max > 286 260 lim->atomic_write_hw_boundary)) 287 261 goto unsupported; 288 - /* 289 - * A feature of boundary support is that it disallows bios to 290 - * be merged which would result in a merged request which 291 - * crosses either a chunk sector or atomic write HW boundary, 292 - * even though chunk sectors may be just set for performance. 293 - * For simplicity, disallow atomic writes for a chunk sector 294 - * which is non-zero and smaller than atomic write HW boundary. 295 - * Furthermore, chunk sectors must be a multiple of atomic 296 - * write HW boundary. Otherwise boundary support becomes 297 - * complicated. 298 - * Devices which do not conform to these rules can be dealt 299 - * with if and when they show up. 300 - */ 301 - if (WARN_ON_ONCE(lim->chunk_sectors % boundary_sectors)) 262 + 263 + if (WARN_ON_ONCE(!blk_valid_atomic_writes_boundary( 264 + lim->chunk_sectors, boundary_sectors))) 302 265 goto unsupported; 303 266 304 267 /* ··· 654 639 return true; 655 640 } 656 641 657 - /* Check for valid boundary of first bottom device */ 658 - static bool blk_stack_atomic_writes_boundary_head(struct queue_limits *t, 659 - struct queue_limits *b) 660 - { 661 - /* 662 - * Ensure atomic write boundary is aligned with chunk sectors. Stacked 663 - * devices store chunk sectors in t->io_min. 664 - */ 665 - if (b->atomic_write_hw_boundary > t->io_min && 666 - b->atomic_write_hw_boundary % t->io_min) 667 - return false; 668 - if (t->io_min > b->atomic_write_hw_boundary && 669 - t->io_min % b->atomic_write_hw_boundary) 670 - return false; 671 - 672 - t->atomic_write_hw_boundary = b->atomic_write_hw_boundary; 673 - return true; 674 - } 675 - 676 642 static void blk_stack_atomic_writes_chunk_sectors(struct queue_limits *t) 677 643 { 678 644 unsigned int chunk_bytes; ··· 691 695 static bool blk_stack_atomic_writes_head(struct queue_limits *t, 692 696 struct queue_limits *b) 693 697 { 694 - if (b->atomic_write_hw_boundary && 695 - !blk_stack_atomic_writes_boundary_head(t, b)) 698 + if (!blk_valid_atomic_writes_boundary(t->chunk_sectors, 699 + b->atomic_write_hw_boundary >> SECTOR_SHIFT)) 696 700 return false; 697 701 698 702 t->atomic_write_hw_unit_max = b->atomic_write_hw_unit_max; 699 703 t->atomic_write_hw_unit_min = b->atomic_write_hw_unit_min; 700 704 t->atomic_write_hw_max = b->atomic_write_hw_max; 705 + t->atomic_write_hw_boundary = b->atomic_write_hw_boundary; 701 706 return true; 702 707 } 703 708 ··· 714 717 if (!blk_atomic_write_start_sect_aligned(start, b)) 715 718 goto unsupported; 716 719 717 - /* 718 - * If atomic_write_hw_max is set, we have already stacked 1x bottom 719 - * device, so check for compliance. 720 - */ 721 - if (t->atomic_write_hw_max) { 720 + /* UINT_MAX indicates no stacking of bottom devices yet */ 721 + if (t->atomic_write_hw_max == UINT_MAX) { 722 + if (!blk_stack_atomic_writes_head(t, b)) 723 + goto unsupported; 724 + } else { 722 725 if (!blk_stack_atomic_writes_tail(t, b)) 723 726 goto unsupported; 724 - return; 725 727 } 726 - 727 - if (!blk_stack_atomic_writes_head(t, b)) 728 - goto unsupported; 729 728 blk_stack_atomic_writes_chunk_sectors(t); 730 729 return; 731 730 ··· 756 763 int blk_stack_limits(struct queue_limits *t, struct queue_limits *b, 757 764 sector_t start) 758 765 { 759 - unsigned int top, bottom, alignment, ret = 0; 766 + unsigned int top, bottom, alignment; 767 + int ret = 0; 760 768 761 769 t->features |= (b->features & BLK_FEAT_INHERIT_MASK); 762 770

+54 -16

block/blk-sysfs.c

··· 64 64 static ssize_t 65 65 queue_requests_store(struct gendisk *disk, const char *page, size_t count) 66 66 { 67 - unsigned long nr; 68 - int ret, err; 69 - unsigned int memflags; 70 67 struct request_queue *q = disk->queue; 71 - 72 - if (!queue_is_mq(q)) 73 - return -EINVAL; 68 + struct blk_mq_tag_set *set = q->tag_set; 69 + struct elevator_tags *et = NULL; 70 + unsigned int memflags; 71 + unsigned long nr; 72 + int ret; 74 73 75 74 ret = queue_var_store(&nr, page, count); 76 75 if (ret < 0) 77 76 return ret; 78 77 79 - memflags = blk_mq_freeze_queue(q); 80 - mutex_lock(&q->elevator_lock); 78 + /* 79 + * Serialize updating nr_requests with concurrent queue_requests_store() 80 + * and switching elevator. 81 + */ 82 + down_write(&set->update_nr_hwq_lock); 83 + 84 + if (nr == q->nr_requests) 85 + goto unlock; 86 + 81 87 if (nr < BLKDEV_MIN_RQ) 82 88 nr = BLKDEV_MIN_RQ; 83 89 84 - err = blk_mq_update_nr_requests(disk->queue, nr); 85 - if (err) 86 - ret = err; 90 + /* 91 + * Switching elevator is protected by update_nr_hwq_lock: 92 + * - read lock is held from elevator sysfs attribute; 93 + * - write lock is held from updating nr_hw_queues; 94 + * Hence it's safe to access q->elevator here with write lock held. 95 + */ 96 + if (nr <= set->reserved_tags || 97 + (q->elevator && nr > MAX_SCHED_RQ) || 98 + (!q->elevator && nr > set->queue_depth)) { 99 + ret = -EINVAL; 100 + goto unlock; 101 + } 102 + 103 + if (!blk_mq_is_shared_tags(set->flags) && q->elevator && 104 + nr > q->elevator->et->nr_requests) { 105 + /* 106 + * Tags will grow, allocate memory before freezing queue to 107 + * prevent deadlock. 108 + */ 109 + et = blk_mq_alloc_sched_tags(set, q->nr_hw_queues, nr); 110 + if (!et) { 111 + ret = -ENOMEM; 112 + goto unlock; 113 + } 114 + } 115 + 116 + memflags = blk_mq_freeze_queue(q); 117 + mutex_lock(&q->elevator_lock); 118 + et = blk_mq_update_nr_requests(q, et, nr); 87 119 mutex_unlock(&q->elevator_lock); 88 120 blk_mq_unfreeze_queue(q, memflags); 121 + 122 + if (et) 123 + blk_mq_free_sched_tags(et, set); 124 + 125 + unlock: 126 + up_write(&set->update_nr_hwq_lock); 89 127 return ret; 90 128 } 91 129 ··· 658 620 if (val < -1) 659 621 return -EINVAL; 660 622 623 + /* 624 + * Ensure that the queue is idled, in case the latency update 625 + * ends up either enabling or disabling wbt completely. We can't 626 + * have IO inflight if that happens. 627 + */ 661 628 memflags = blk_mq_freeze_queue(q); 662 629 663 630 rqos = wbt_rq_qos(q); ··· 681 638 if (wbt_get_min_lat(q) == val) 682 639 goto out; 683 640 684 - /* 685 - * Ensure that the queue is idled, in case the latency update 686 - * ends up either enabling or disabling wbt completely. We can't 687 - * have IO inflight if that happens. 688 - */ 689 641 blk_mq_quiesce_queue(q); 690 642 691 643 mutex_lock(&disk->rqos_state_mutex);

+7 -8

block/blk-throttle.c

··· 1224 1224 if (!bio_list_empty(&bio_list_on_stack)) { 1225 1225 blk_start_plug(&plug); 1226 1226 while ((bio = bio_list_pop(&bio_list_on_stack))) 1227 - submit_bio_noacct_nocheck(bio); 1227 + submit_bio_noacct_nocheck(bio, false); 1228 1228 blk_finish_plug(&plug); 1229 1229 } 1230 1230 } ··· 1327 1327 INIT_WORK(&td->dispatch_work, blk_throtl_dispatch_work_fn); 1328 1328 throtl_service_queue_init(&td->service_queue); 1329 1329 1330 - /* 1331 - * Freeze queue before activating policy, to synchronize with IO path, 1332 - * which is protected by 'q_usage_counter'. 1333 - */ 1334 1330 memflags = blk_mq_freeze_queue(disk->queue); 1335 1331 blk_mq_quiesce_queue(disk->queue); 1336 1332 1337 1333 q->td = td; 1338 1334 td->queue = q; 1339 1335 1340 - /* activate policy */ 1336 + /* activate policy, blk_throtl_activated() will return true */ 1341 1337 ret = blkcg_activate_policy(disk, &blkcg_policy_throtl); 1342 1338 if (ret) { 1343 1339 q->td = NULL; ··· 1842 1846 { 1843 1847 struct request_queue *q = disk->queue; 1844 1848 1845 - if (!blk_throtl_activated(q)) 1849 + /* 1850 + * blkg_destroy_all() already deactivate throtl policy, just check and 1851 + * free throtl data. 1852 + */ 1853 + if (!q->td) 1846 1854 return; 1847 1855 1848 1856 timer_delete_sync(&q->td->service_queue.pending_timer); 1849 1857 throtl_shutdown_wq(q); 1850 - blkcg_deactivate_policy(disk, &blkcg_policy_throtl); 1851 1858 kfree(q->td); 1852 1859 } 1853 1860

+11 -7

block/blk-throttle.h

··· 156 156 157 157 static inline bool blk_throtl_activated(struct request_queue *q) 158 158 { 159 - return q->td != NULL; 159 + /* 160 + * q->td guarantees that the blk-throttle module is already loaded, 161 + * and the plid of blk-throttle is assigned. 162 + * blkcg_policy_enabled() guarantees that the policy is activated 163 + * in the request_queue. 164 + */ 165 + return q->td != NULL && blkcg_policy_enabled(q, &blkcg_policy_throtl); 160 166 } 161 167 162 168 static inline bool blk_should_throtl(struct bio *bio) ··· 170 164 struct throtl_grp *tg; 171 165 int rw = bio_data_dir(bio); 172 166 173 - /* 174 - * This is called under bio_queue_enter(), and it's synchronized with 175 - * the activation of blk-throtl, which is protected by 176 - * blk_mq_freeze_queue(). 177 - */ 178 167 if (!blk_throtl_activated(bio->bi_bdev->bd_queue)) 179 168 return false; 180 169 ··· 195 194 196 195 static inline bool blk_throtl_bio(struct bio *bio) 197 196 { 198 - 197 + /* 198 + * block throttling takes effect if the policy is activated 199 + * in the bio's request_queue. 200 + */ 199 201 if (!blk_should_throtl(bio)) 200 202 return false; 201 203

+3 -43

block/blk.h

··· 41 41 struct list_head flush_queue[2]; 42 42 unsigned long flush_data_in_flight; 43 43 struct request *flush_rq; 44 + struct rcu_head rcu_head; 44 45 }; 45 46 46 47 bool is_flush_rq(struct request *req); ··· 55 54 bool __blk_freeze_queue_start(struct request_queue *q, 56 55 struct task_struct *owner); 57 56 int __bio_queue_enter(struct request_queue *q, struct bio *bio); 58 - void submit_bio_noacct_nocheck(struct bio *bio); 57 + void submit_bio_noacct_nocheck(struct bio *bio, bool split); 59 58 void bio_await_chain(struct bio *bio); 60 59 61 60 static inline bool blk_try_enter_queue(struct request_queue *q, bool pm) ··· 616 615 int disk_register_independent_access_ranges(struct gendisk *disk); 617 616 void disk_unregister_independent_access_ranges(struct gendisk *disk); 618 617 618 + int should_fail_bio(struct bio *bio); 619 619 #ifdef CONFIG_FAIL_MAKE_REQUEST 620 620 bool should_fail_request(struct block_device *part, unsigned int bytes); 621 621 #else /* CONFIG_FAIL_MAKE_REQUEST */ ··· 680 678 static inline ktime_t blk_time_get(void) 681 679 { 682 680 return ns_to_ktime(blk_time_get_ns()); 683 - } 684 - 685 - /* 686 - * From most significant bit: 687 - * 1 bit: reserved for other usage, see below 688 - * 12 bits: original size of bio 689 - * 51 bits: issue time of bio 690 - */ 691 - #define BIO_ISSUE_RES_BITS 1 692 - #define BIO_ISSUE_SIZE_BITS 12 693 - #define BIO_ISSUE_RES_SHIFT (64 - BIO_ISSUE_RES_BITS) 694 - #define BIO_ISSUE_SIZE_SHIFT (BIO_ISSUE_RES_SHIFT - BIO_ISSUE_SIZE_BITS) 695 - #define BIO_ISSUE_TIME_MASK ((1ULL << BIO_ISSUE_SIZE_SHIFT) - 1) 696 - #define BIO_ISSUE_SIZE_MASK \ 697 - (((1ULL << BIO_ISSUE_SIZE_BITS) - 1) << BIO_ISSUE_SIZE_SHIFT) 698 - #define BIO_ISSUE_RES_MASK (~((1ULL << BIO_ISSUE_RES_SHIFT) - 1)) 699 - 700 - /* Reserved bit for blk-throtl */ 701 - #define BIO_ISSUE_THROTL_SKIP_LATENCY (1ULL << 63) 702 - 703 - static inline u64 __bio_issue_time(u64 time) 704 - { 705 - return time & BIO_ISSUE_TIME_MASK; 706 - } 707 - 708 - static inline u64 bio_issue_time(struct bio_issue *issue) 709 - { 710 - return __bio_issue_time(issue->value); 711 - } 712 - 713 - static inline sector_t bio_issue_size(struct bio_issue *issue) 714 - { 715 - return ((issue->value & BIO_ISSUE_SIZE_MASK) >> BIO_ISSUE_SIZE_SHIFT); 716 - } 717 - 718 - static inline void bio_issue_init(struct bio_issue *issue, 719 - sector_t size) 720 - { 721 - size &= (1ULL << BIO_ISSUE_SIZE_BITS) - 1; 722 - issue->value = ((issue->value & BIO_ISSUE_RES_MASK) | 723 - (blk_time_get_ns() & BIO_ISSUE_TIME_MASK) | 724 - ((u64)size << BIO_ISSUE_SIZE_SHIFT)); 725 681 } 726 682 727 683 void bdev_release(struct file *bdev_file);

+2 -1

block/elevator.c

··· 669 669 lockdep_assert_held(&set->update_nr_hwq_lock); 670 670 671 671 if (strncmp(ctx->name, "none", 4)) { 672 - ctx->et = blk_mq_alloc_sched_tags(set, set->nr_hw_queues); 672 + ctx->et = blk_mq_alloc_sched_tags(set, set->nr_hw_queues, 673 + blk_mq_default_nr_requests(set)); 673 674 if (!ctx->et) 674 675 return -ENOMEM; 675 676 }

+1 -1

block/elevator.h

··· 37 37 void (*exit_sched)(struct elevator_queue *); 38 38 int (*init_hctx)(struct blk_mq_hw_ctx *, unsigned int); 39 39 void (*exit_hctx)(struct blk_mq_hw_ctx *, unsigned int); 40 - void (*depth_updated)(struct blk_mq_hw_ctx *); 40 + void (*depth_updated)(struct request_queue *); 41 41 42 42 bool (*allow_merge)(struct request_queue *, struct request *, struct bio *); 43 43 bool (*bio_merge)(struct request_queue *, struct bio *, unsigned int);

+5 -5

block/fops.c

··· 39 39 static bool blkdev_dio_invalid(struct block_device *bdev, struct kiocb *iocb, 40 40 struct iov_iter *iter) 41 41 { 42 - return iocb->ki_pos & (bdev_logical_block_size(bdev) - 1) || 43 - !bdev_iter_is_aligned(bdev, iter); 42 + return (iocb->ki_pos | iov_iter_count(iter)) & 43 + (bdev_logical_block_size(bdev) - 1); 44 44 } 45 45 46 46 #define DIO_INLINE_BIO_VECS 4 ··· 78 78 if (iocb->ki_flags & IOCB_ATOMIC) 79 79 bio.bi_opf |= REQ_ATOMIC; 80 80 81 - ret = bio_iov_iter_get_pages(&bio, iter); 81 + ret = bio_iov_iter_get_bdev_pages(&bio, iter, bdev); 82 82 if (unlikely(ret)) 83 83 goto out; 84 84 ret = bio.bi_iter.bi_size; ··· 212 212 bio->bi_end_io = blkdev_bio_end_io; 213 213 bio->bi_ioprio = iocb->ki_ioprio; 214 214 215 - ret = bio_iov_iter_get_pages(bio, iter); 215 + ret = bio_iov_iter_get_bdev_pages(bio, iter, bdev); 216 216 if (unlikely(ret)) { 217 217 bio->bi_status = BLK_STS_IOERR; 218 218 bio_endio(bio); ··· 348 348 */ 349 349 bio_iov_bvec_set(bio, iter); 350 350 } else { 351 - ret = bio_iov_iter_get_pages(bio, iter); 351 + ret = bio_iov_iter_get_bdev_pages(bio, iter, bdev); 352 352 if (unlikely(ret)) 353 353 goto out_bio_put; 354 354 }

+2 -2

block/ioctl.c

··· 481 481 */ 482 482 memset(&geo, 0, sizeof(geo)); 483 483 geo.start = get_start_sect(bdev); 484 - ret = disk->fops->getgeo(bdev, &geo); 484 + ret = disk->fops->getgeo(disk, &geo); 485 485 if (ret) 486 486 return ret; 487 487 if (copy_to_user(argp, &geo, sizeof(geo))) ··· 515 515 * want to override it. 516 516 */ 517 517 geo.start = get_start_sect(bdev); 518 - ret = disk->fops->getgeo(bdev, &geo); 518 + ret = disk->fops->getgeo(disk, &geo); 519 519 if (ret) 520 520 return ret; 521 521

+9 -10

block/kyber-iosched.c

··· 399 399 return ERR_PTR(ret); 400 400 } 401 401 402 + static void kyber_depth_updated(struct request_queue *q) 403 + { 404 + struct kyber_queue_data *kqd = q->elevator->elevator_data; 405 + 406 + kqd->async_depth = q->nr_requests * KYBER_ASYNC_PERCENT / 100U; 407 + blk_mq_set_min_shallow_depth(q, kqd->async_depth); 408 + } 409 + 402 410 static int kyber_init_sched(struct request_queue *q, struct elevator_queue *eq) 403 411 { 404 412 struct kyber_queue_data *kqd; ··· 421 413 422 414 eq->elevator_data = kqd; 423 415 q->elevator = eq; 416 + kyber_depth_updated(q); 424 417 425 418 return 0; 426 419 } ··· 447 438 spin_lock_init(&kcq->lock); 448 439 for (i = 0; i < KYBER_NUM_DOMAINS; i++) 449 440 INIT_LIST_HEAD(&kcq->rq_list[i]); 450 - } 451 - 452 - static void kyber_depth_updated(struct blk_mq_hw_ctx *hctx) 453 - { 454 - struct kyber_queue_data *kqd = hctx->queue->elevator->elevator_data; 455 - struct blk_mq_tags *tags = hctx->sched_tags; 456 - 457 - kqd->async_depth = hctx->queue->nr_requests * KYBER_ASYNC_PERCENT / 100U; 458 - sbitmap_queue_min_shallow_depth(&tags->bitmap_tags, kqd->async_depth); 459 441 } 460 442 461 443 static int kyber_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx) ··· 493 493 khd->batching = 0; 494 494 495 495 hctx->sched_data = khd; 496 - kyber_depth_updated(hctx); 497 496 498 497 return 0; 499 498

+3 -17

block/mq-deadline.c

··· 136 136 struct rb_node *node = per_prio->sort_list[data_dir].rb_node; 137 137 struct request *rq, *res = NULL; 138 138 139 - if (!node) 140 - return NULL; 141 - 142 - rq = rb_entry_rq(node); 143 139 while (node) { 144 140 rq = rb_entry_rq(node); 145 141 if (blk_rq_pos(rq) >= pos) { ··· 503 507 } 504 508 505 509 /* Called by blk_mq_update_nr_requests(). */ 506 - static void dd_depth_updated(struct blk_mq_hw_ctx *hctx) 510 + static void dd_depth_updated(struct request_queue *q) 507 511 { 508 - struct request_queue *q = hctx->queue; 509 512 struct deadline_data *dd = q->elevator->elevator_data; 510 - struct blk_mq_tags *tags = hctx->sched_tags; 511 513 512 514 dd->async_depth = q->nr_requests; 513 - 514 - sbitmap_queue_min_shallow_depth(&tags->bitmap_tags, 1); 515 - } 516 - 517 - /* Called by blk_mq_init_hctx() and blk_mq_init_sched(). */ 518 - static int dd_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx) 519 - { 520 - dd_depth_updated(hctx); 521 - return 0; 515 + blk_mq_set_min_shallow_depth(q, 1); 522 516 } 523 517 524 518 static void dd_exit_sched(struct elevator_queue *e) ··· 573 587 blk_queue_flag_set(QUEUE_FLAG_SQ_SCHED, q); 574 588 575 589 q->elevator = eq; 590 + dd_depth_updated(q); 576 591 return 0; 577 592 } 578 593 ··· 1035 1048 .has_work = dd_has_work, 1036 1049 .init_sched = dd_init_sched, 1037 1050 .exit_sched = dd_exit_sched, 1038 - .init_hctx = dd_init_hctx, 1039 1051 }, 1040 1052 1041 1053 #ifdef CONFIG_BLK_DEBUG_FS

+1 -1

block/partitions/ibm.c

··· 358 358 goto out_nolab; 359 359 /* set start if not filled by getgeo function e.g. virtblk */ 360 360 geo->start = get_start_sect(bdev); 361 - if (disk->fops->getgeo(bdev, geo)) 361 + if (disk->fops->getgeo(disk, geo)) 362 362 goto out_freeall; 363 363 if (!fn || fn(disk, info)) { 364 364 kfree(info);

+2 -2

drivers/ata/libata-scsi.c

··· 351 351 /** 352 352 * ata_std_bios_param - generic bios head/sector/cylinder calculator used by sd. 353 353 * @sdev: SCSI device for which BIOS geometry is to be determined 354 - * @bdev: block device associated with @sdev 354 + * @unused: gendisk associated with @sdev 355 355 * @capacity: capacity of SCSI device 356 356 * @geom: location to which geometry will be output 357 357 * ··· 366 366 * RETURNS: 367 367 * Zero. 368 368 */ 369 - int ata_std_bios_param(struct scsi_device *sdev, struct block_device *bdev, 369 + int ata_std_bios_param(struct scsi_device *sdev, struct gendisk *unused, 370 370 sector_t capacity, int geom[]) 371 371 { 372 372 geom[0] = 255;

+1 -9

drivers/block/Kconfig

··· 17 17 if BLK_DEV 18 18 19 19 source "drivers/block/null_blk/Kconfig" 20 + source "drivers/block/rnull/Kconfig" 20 21 21 22 config BLK_DEV_FD 22 23 tristate "Normal floppy disk support" ··· 311 310 help 312 311 This is the virtual block driver for virtio. It can be used with 313 312 QEMU based VMMs (like KVM or Xen). Say Y or M. 314 - 315 - config BLK_DEV_RUST_NULL 316 - tristate "Rust null block driver (Experimental)" 317 - depends on RUST 318 - help 319 - This is the Rust implementation of the null block driver. For now it 320 - is only a minimal stub. 321 - 322 - If unsure, say N. 323 313 324 314 config BLK_DEV_RBD 325 315 tristate "Rados block device (RBD)"

+1 -3

drivers/block/Makefile

··· 9 9 # needed for trace events 10 10 ccflags-y += -I$(src) 11 11 12 - obj-$(CONFIG_BLK_DEV_RUST_NULL) += rnull_mod.o 13 - rnull_mod-y := rnull.o 14 - 15 12 obj-$(CONFIG_MAC_FLOPPY) += swim3.o 16 13 obj-$(CONFIG_BLK_DEV_SWIM) += swim_mod.o 17 14 obj-$(CONFIG_BLK_DEV_FD) += floppy.o ··· 35 38 obj-$(CONFIG_BLK_DEV_RNBD) += rnbd/ 36 39 37 40 obj-$(CONFIG_BLK_DEV_NULL_BLK) += null_blk/ 41 + obj-$(CONFIG_BLK_DEV_RUST_NULL) += rnull/ 38 42 39 43 obj-$(CONFIG_BLK_DEV_UBLK) += ublk_drv.o 40 44 obj-$(CONFIG_BLK_DEV_ZONED_LOOP) += zloop.o

+5 -5

drivers/block/amiflop.c

··· 1523 1523 return BLK_STS_OK; 1524 1524 } 1525 1525 1526 - static int fd_getgeo(struct block_device *bdev, struct hd_geometry *geo) 1526 + static int fd_getgeo(struct gendisk *disk, struct hd_geometry *geo) 1527 1527 { 1528 - int drive = MINOR(bdev->bd_dev) & 3; 1528 + struct amiga_floppy_struct *p = disk->private_data; 1529 1529 1530 - geo->heads = unit[drive].type->heads; 1531 - geo->sectors = unit[drive].dtype->sects * unit[drive].type->sect_mult; 1532 - geo->cylinders = unit[drive].type->tracks; 1530 + geo->heads = p->type->heads; 1531 + geo->sectors = p->dtype->sects * p->type->sect_mult; 1532 + geo->cylinders = p->type->tracks; 1533 1533 return 0; 1534 1534 } 1535 1535

+2 -2

drivers/block/aoe/aoeblk.c

··· 269 269 } 270 270 271 271 static int 272 - aoeblk_getgeo(struct block_device *bdev, struct hd_geometry *geo) 272 + aoeblk_getgeo(struct gendisk *disk, struct hd_geometry *geo) 273 273 { 274 - struct aoedev *d = bdev->bd_disk->private_data; 274 + struct aoedev *d = disk->private_data; 275 275 276 276 if ((d->flags & DEVFL_UP) == 0) { 277 277 printk(KERN_ERR "aoe: disk not up\n");

+1 -1

drivers/block/aoe/aoemain.c

··· 44 44 { 45 45 int ret; 46 46 47 - aoe_wq = alloc_workqueue("aoe_wq", 0, 0); 47 + aoe_wq = alloc_workqueue("aoe_wq", WQ_PERCPU, 0); 48 48 if (!aoe_wq) 49 49 return -ENOMEM; 50 50

+48 -27

drivers/block/brd.c

··· 44 44 }; 45 45 46 46 /* 47 - * Look up and return a brd's page for a given sector. 47 + * Look up and return a brd's page with reference grabbed for a given sector. 48 48 */ 49 49 static struct page *brd_lookup_page(struct brd_device *brd, sector_t sector) 50 50 { 51 - return xa_load(&brd->brd_pages, sector >> PAGE_SECTORS_SHIFT); 51 + struct page *page; 52 + XA_STATE(xas, &brd->brd_pages, sector >> PAGE_SECTORS_SHIFT); 53 + 54 + rcu_read_lock(); 55 + repeat: 56 + page = xas_load(&xas); 57 + if (xas_retry(&xas, page)) { 58 + xas_reset(&xas); 59 + goto repeat; 60 + } 61 + 62 + if (!page) 63 + goto out; 64 + 65 + if (!get_page_unless_zero(page)) { 66 + xas_reset(&xas); 67 + goto repeat; 68 + } 69 + 70 + if (unlikely(page != xas_reload(&xas))) { 71 + put_page(page); 72 + xas_reset(&xas); 73 + goto repeat; 74 + } 75 + out: 76 + rcu_read_unlock(); 77 + 78 + return page; 52 79 } 53 80 54 81 /* 55 82 * Insert a new page for a given sector, if one does not already exist. 83 + * The returned page will grab reference. 56 84 */ 57 85 static struct page *brd_insert_page(struct brd_device *brd, sector_t sector, 58 86 blk_opf_t opf) 59 - __releases(rcu) 60 - __acquires(rcu) 61 87 { 62 88 gfp_t gfp = (opf & REQ_NOWAIT) ? GFP_NOWAIT : GFP_NOIO; 63 89 struct page *page, *ret; 64 90 65 - rcu_read_unlock(); 66 91 page = alloc_page(gfp | __GFP_ZERO | __GFP_HIGHMEM); 67 - if (!page) { 68 - rcu_read_lock(); 92 + if (!page) 69 93 return ERR_PTR(-ENOMEM); 70 - } 71 94 72 95 xa_lock(&brd->brd_pages); 73 96 ret = __xa_cmpxchg(&brd->brd_pages, sector >> PAGE_SECTORS_SHIFT, NULL, 74 97 page, gfp); 75 - rcu_read_lock(); 76 - if (ret) { 98 + if (!ret) { 99 + brd->brd_nr_pages++; 100 + get_page(page); 77 101 xa_unlock(&brd->brd_pages); 78 - __free_page(page); 79 - if (xa_is_err(ret)) 80 - return ERR_PTR(xa_err(ret)); 102 + return page; 103 + } 104 + 105 + if (!xa_is_err(ret)) { 106 + get_page(ret); 107 + xa_unlock(&brd->brd_pages); 108 + put_page(page); 81 109 return ret; 82 110 } 83 - brd->brd_nr_pages++; 111 + 84 112 xa_unlock(&brd->brd_pages); 85 - return page; 113 + put_page(page); 114 + return ERR_PTR(xa_err(ret)); 86 115 } 87 116 88 117 /* ··· 124 95 pgoff_t idx; 125 96 126 97 xa_for_each(&brd->brd_pages, idx, page) { 127 - __free_page(page); 98 + put_page(page); 128 99 cond_resched(); 129 100 } 130 101 ··· 146 117 147 118 bv.bv_len = min_t(u32, bv.bv_len, PAGE_SIZE - offset); 148 119 149 - rcu_read_lock(); 150 120 page = brd_lookup_page(brd, sector); 151 121 if (!page && op_is_write(opf)) { 152 122 page = brd_insert_page(brd, sector, opf); ··· 163 135 memset(kaddr, 0, bv.bv_len); 164 136 } 165 137 kunmap_local(kaddr); 166 - rcu_read_unlock(); 167 138 168 139 bio_advance_iter_single(bio, &bio->bi_iter, bv.bv_len); 140 + if (page) 141 + put_page(page); 169 142 return true; 170 143 171 144 out_error: 172 - rcu_read_unlock(); 173 145 if (PTR_ERR(page) == -ENOMEM && (opf & REQ_NOWAIT)) 174 146 bio_wouldblock_error(bio); 175 147 else 176 148 bio_io_error(bio); 177 149 return false; 178 - } 179 - 180 - static void brd_free_one_page(struct rcu_head *head) 181 - { 182 - struct page *page = container_of(head, struct page, rcu_head); 183 - 184 - __free_page(page); 185 150 } 186 151 187 152 static void brd_do_discard(struct brd_device *brd, sector_t sector, u32 size) ··· 191 170 while (aligned_sector < aligned_end && aligned_sector < rd_size * 2) { 192 171 page = __xa_erase(&brd->brd_pages, aligned_sector >> PAGE_SECTORS_SHIFT); 193 172 if (page) { 194 - call_rcu(&page->rcu_head, brd_free_one_page); 173 + put_page(page); 195 174 brd->brd_nr_pages--; 196 175 } 197 176 aligned_sector += PAGE_SECTORS;

+32 -37

drivers/block/floppy.c

··· 163 163 164 164 /* do print messages for unexpected interrupts */ 165 165 static int print_unex = 1; 166 - #include <linux/module.h> 167 - #include <linux/sched.h> 168 - #include <linux/fs.h> 169 - #include <linux/kernel.h> 170 - #include <linux/timer.h> 171 - #include <linux/workqueue.h> 172 - #include <linux/fdreg.h> 173 - #include <linux/fd.h> 174 - #include <linux/hdreg.h> 175 - #include <linux/errno.h> 176 - #include <linux/slab.h> 177 - #include <linux/mm.h> 178 - #include <linux/bio.h> 179 - #include <linux/string.h> 180 - #include <linux/jiffies.h> 181 - #include <linux/fcntl.h> 182 - #include <linux/delay.h> 183 - #include <linux/mc146818rtc.h> /* CMOS defines */ 184 - #include <linux/ioport.h> 185 - #include <linux/interrupt.h> 186 - #include <linux/init.h> 187 - #include <linux/major.h> 188 - #include <linux/platform_device.h> 189 - #include <linux/mod_devicetable.h> 190 - #include <linux/mutex.h> 191 - #include <linux/io.h> 192 - #include <linux/uaccess.h> 193 166 #include <linux/async.h> 167 + #include <linux/bio.h> 194 168 #include <linux/compat.h> 169 + #include <linux/delay.h> 170 + #include <linux/errno.h> 171 + #include <linux/fcntl.h> 172 + #include <linux/fd.h> 173 + #include <linux/fdreg.h> 174 + #include <linux/fs.h> 175 + #include <linux/hdreg.h> 176 + #include <linux/init.h> 177 + #include <linux/interrupt.h> 178 + #include <linux/io.h> 179 + #include <linux/ioport.h> 180 + #include <linux/jiffies.h> 181 + #include <linux/kernel.h> 182 + #include <linux/major.h> 183 + #include <linux/mc146818rtc.h> /* CMOS defines */ 184 + #include <linux/mm.h> 185 + #include <linux/mod_devicetable.h> 186 + #include <linux/module.h> 187 + #include <linux/mutex.h> 188 + #include <linux/platform_device.h> 189 + #include <linux/sched.h> 190 + #include <linux/slab.h> 191 + #include <linux/string.h> 192 + #include <linux/timer.h> 193 + #include <linux/uaccess.h> 194 + #include <linux/workqueue.h> 195 195 196 196 /* 197 197 * PS/2 floppies have much slower step rates than regular floppies. ··· 232 232 static unsigned short virtual_dma_port = 0x3f0; 233 233 irqreturn_t floppy_interrupt(int irq, void *dev_id); 234 234 static int set_dor(int fdc, char mask, char data); 235 - 236 - #define K_64 0x10000 /* 64KB */ 237 235 238 236 /* the following is the mask of allowed drives. By default units 2 and 239 237 * 3 of both floppy controllers are disabled, because switching on the ··· 3090 3092 *rcmd = NULL; 3091 3093 3092 3094 loop: 3093 - ptr = kmalloc(sizeof(struct floppy_raw_cmd), GFP_KERNEL); 3094 - if (!ptr) 3095 - return -ENOMEM; 3095 + ptr = memdup_user(param, sizeof(*ptr)); 3096 + if (IS_ERR(ptr)) 3097 + return PTR_ERR(ptr); 3096 3098 *rcmd = ptr; 3097 - ret = copy_from_user(ptr, param, sizeof(*ptr)); 3098 3099 ptr->next = NULL; 3099 3100 ptr->buffer_length = 0; 3100 3101 ptr->kernel_data = NULL; 3101 - if (ret) 3102 - return -EFAULT; 3103 3102 param += sizeof(struct floppy_raw_cmd); 3104 3103 if (ptr->cmd_count > FD_RAW_CMD_FULLSIZE) 3105 3104 return -EINVAL; ··· 3358 3363 return 0; 3359 3364 } 3360 3365 3361 - static int fd_getgeo(struct block_device *bdev, struct hd_geometry *geo) 3366 + static int fd_getgeo(struct gendisk *disk, struct hd_geometry *geo) 3362 3367 { 3363 - int drive = (long)bdev->bd_disk->private_data; 3368 + int drive = (long)disk->private_data; 3364 3369 int type = ITYPE(drive_state[drive].fd_device); 3365 3370 struct floppy_struct *g; 3366 3371 int ret;

+3 -3

drivers/block/mtip32xx/mtip32xx.c

··· 3148 3148 * that each partition is also 4KB aligned. Non-aligned partitions adversely 3149 3149 * affects performance. 3150 3150 * 3151 - * @dev Pointer to the block_device strucutre. 3151 + * @disk Pointer to the gendisk strucutre. 3152 3152 * @geo Pointer to a hd_geometry structure. 3153 3153 * 3154 3154 * return value 3155 3155 * 0 Operation completed successfully. 3156 3156 * -ENOTTY An error occurred while reading the drive capacity. 3157 3157 */ 3158 - static int mtip_block_getgeo(struct block_device *dev, 3158 + static int mtip_block_getgeo(struct gendisk *disk, 3159 3159 struct hd_geometry *geo) 3160 3160 { 3161 - struct driver_data *dd = dev->bd_disk->private_data; 3161 + struct driver_data *dd = disk->private_data; 3162 3162 sector_t capacity; 3163 3163 3164 3164 if (!dd)

+9 -1

drivers/block/nbd.c

··· 311 311 if (args) { 312 312 INIT_WORK(&args->work, nbd_dead_link_work); 313 313 args->index = nbd->index; 314 - queue_work(system_wq, &args->work); 314 + queue_work(system_percpu_wq, &args->work); 315 315 } 316 316 } 317 317 if (!nsock->dead) { ··· 1216 1216 sock = sockfd_lookup(fd, err); 1217 1217 if (!sock) 1218 1218 return NULL; 1219 + 1220 + if (!sk_is_tcp(sock->sk) && 1221 + !sk_is_stream_unix(sock->sk)) { 1222 + dev_err(disk_to_dev(nbd->disk), "Unsupported socket: should be TCP or UNIX.\n"); 1223 + *err = -EINVAL; 1224 + sockfd_put(sock); 1225 + return NULL; 1226 + } 1219 1227 1220 1228 if (sock->ops->shutdown == sock_no_shutdown) { 1221 1229 dev_err(disk_to_dev(nbd->disk), "Unsupported socket: shutdown callout must be supported.\n");

+1 -1

drivers/block/null_blk/main.c

··· 223 223 224 224 static unsigned long g_cache_size; 225 225 module_param_named(cache_size, g_cache_size, ulong, 0444); 226 - MODULE_PARM_DESC(mbps, "Cache size in MiB for memory-backed device. Default: 0 (none)"); 226 + MODULE_PARM_DESC(cache_size, "Cache size in MiB for memory-backed device. Default: 0 (none)"); 227 227 228 228 static bool g_fua = true; 229 229 module_param_named(fua, g_fua, bool, 0444);

+1 -1

drivers/block/rbd.c

··· 7389 7389 * The number of active work items is limited by the number of 7390 7390 * rbd devices * queue depth, so leave @max_active at default. 7391 7391 */ 7392 - rbd_wq = alloc_workqueue(RBD_DRV_NAME, WQ_MEM_RECLAIM, 0); 7392 + rbd_wq = alloc_workqueue(RBD_DRV_NAME, WQ_MEM_RECLAIM | WQ_PERCPU, 0); 7393 7393 if (!rbd_wq) { 7394 7394 rc = -ENOMEM; 7395 7395 goto err_out_slab;

+3 -3

drivers/block/rnbd/rnbd-clt.c

··· 942 942 rnbd_clt_put_dev(dev); 943 943 } 944 944 945 - static int rnbd_client_getgeo(struct block_device *block_device, 945 + static int rnbd_client_getgeo(struct gendisk *disk, 946 946 struct hd_geometry *geo) 947 947 { 948 948 u64 size; 949 - struct rnbd_clt_dev *dev = block_device->bd_disk->private_data; 949 + struct rnbd_clt_dev *dev = disk->private_data; 950 950 struct queue_limits *limit = &dev->queue->limits; 951 951 952 952 size = dev->size * (limit->logical_block_size / SECTOR_SIZE); ··· 1809 1809 unregister_blkdev(rnbd_client_major, "rnbd"); 1810 1810 return err; 1811 1811 } 1812 - rnbd_clt_wq = alloc_workqueue("rnbd_clt_wq", 0, 0); 1812 + rnbd_clt_wq = alloc_workqueue("rnbd_clt_wq", WQ_PERCPU, 0); 1813 1813 if (!rnbd_clt_wq) { 1814 1814 pr_err("Failed to load module, alloc_workqueue failed.\n"); 1815 1815 rnbd_clt_destroy_sysfs_files();

-80

drivers/block/rnull.rs

··· 1 - // SPDX-License-Identifier: GPL-2.0 2 - 3 - //! This is a Rust implementation of the C null block driver. 4 - //! 5 - //! Supported features: 6 - //! 7 - //! - blk-mq interface 8 - //! - direct completion 9 - //! - block size 4k 10 - //! 11 - //! The driver is not configurable. 12 - 13 - use kernel::{ 14 - alloc::flags, 15 - block::mq::{ 16 - self, 17 - gen_disk::{self, GenDisk}, 18 - Operations, TagSet, 19 - }, 20 - error::Result, 21 - new_mutex, pr_info, 22 - prelude::*, 23 - sync::{Arc, Mutex}, 24 - types::ARef, 25 - }; 26 - 27 - module! { 28 - type: NullBlkModule, 29 - name: "rnull_mod", 30 - authors: ["Andreas Hindborg"], 31 - description: "Rust implementation of the C null block driver", 32 - license: "GPL v2", 33 - } 34 - 35 - #[pin_data] 36 - struct NullBlkModule { 37 - #[pin] 38 - _disk: Mutex<GenDisk<NullBlkDevice>>, 39 - } 40 - 41 - impl kernel::InPlaceModule for NullBlkModule { 42 - fn init(_module: &'static ThisModule) -> impl PinInit<Self, Error> { 43 - pr_info!("Rust null_blk loaded\n"); 44 - 45 - // Use a immediately-called closure as a stable `try` block 46 - let disk = /* try */ (|| { 47 - let tagset = Arc::pin_init(TagSet::new(1, 256, 1), flags::GFP_KERNEL)?; 48 - 49 - gen_disk::GenDiskBuilder::new() 50 - .capacity_sectors(4096 << 11) 51 - .logical_block_size(4096)? 52 - .physical_block_size(4096)? 53 - .rotational(false) 54 - .build(fmt!("rnullb{}", 0), tagset) 55 - })(); 56 - 57 - try_pin_init!(Self { 58 - _disk <- new_mutex!(disk?, "nullb:disk"), 59 - }) 60 - } 61 - } 62 - 63 - struct NullBlkDevice; 64 - 65 - #[vtable] 66 - impl Operations for NullBlkDevice { 67 - #[inline(always)] 68 - fn queue_rq(rq: ARef<mq::Request<Self>>, _is_last: bool) -> Result { 69 - mq::Request::end_ok(rq) 70 - .map_err(|_e| kernel::error::code::EIO) 71 - // We take no refcounts on the request, so we expect to be able to 72 - // end the request. The request reference must be unique at this 73 - // point, and so `end_ok` cannot fail. 74 - .expect("Fatal error - expected to be able to end request"); 75 - 76 - Ok(()) 77 - } 78 - 79 - fn commit_rqs() {} 80 - }

+13

drivers/block/rnull/Kconfig

··· 1 + # SPDX-License-Identifier: GPL-2.0 2 + # 3 + # Rust null block device driver configuration 4 + 5 + config BLK_DEV_RUST_NULL 6 + tristate "Rust null block driver (Experimental)" 7 + depends on RUST && CONFIGFS_FS 8 + help 9 + This is the Rust implementation of the null block driver. Like 10 + the C version, the driver allows the user to create virutal block 11 + devices that can be configured via various configuration options. 12 + 13 + If unsure, say N.

+3

drivers/block/rnull/Makefile

··· 1 + 2 + obj-$(CONFIG_BLK_DEV_RUST_NULL) += rnull_mod.o 3 + rnull_mod-y := rnull.o

+262

drivers/block/rnull/configfs.rs

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + 3 + use super::{NullBlkDevice, THIS_MODULE}; 4 + use core::fmt::{Display, Write}; 5 + use kernel::{ 6 + block::mq::gen_disk::{GenDisk, GenDiskBuilder}, 7 + c_str, 8 + configfs::{self, AttributeOperations}, 9 + configfs_attrs, new_mutex, 10 + page::PAGE_SIZE, 11 + prelude::*, 12 + str::{kstrtobool_bytes, CString}, 13 + sync::Mutex, 14 + }; 15 + use pin_init::PinInit; 16 + 17 + pub(crate) fn subsystem() -> impl PinInit<kernel::configfs::Subsystem<Config>, Error> { 18 + let item_type = configfs_attrs! { 19 + container: configfs::Subsystem<Config>, 20 + data: Config, 21 + child: DeviceConfig, 22 + attributes: [ 23 + features: 0, 24 + ], 25 + }; 26 + 27 + kernel::configfs::Subsystem::new(c_str!("rnull"), item_type, try_pin_init!(Config {})) 28 + } 29 + 30 + #[pin_data] 31 + pub(crate) struct Config {} 32 + 33 + #[vtable] 34 + impl AttributeOperations<0> for Config { 35 + type Data = Config; 36 + 37 + fn show(_this: &Config, page: &mut [u8; PAGE_SIZE]) -> Result<usize> { 38 + let mut writer = kernel::str::Formatter::new(page); 39 + writer.write_str("blocksize,size,rotational,irqmode\n")?; 40 + Ok(writer.bytes_written()) 41 + } 42 + } 43 + 44 + #[vtable] 45 + impl configfs::GroupOperations for Config { 46 + type Child = DeviceConfig; 47 + 48 + fn make_group( 49 + &self, 50 + name: &CStr, 51 + ) -> Result<impl PinInit<configfs::Group<DeviceConfig>, Error>> { 52 + let item_type = configfs_attrs! { 53 + container: configfs::Group<DeviceConfig>, 54 + data: DeviceConfig, 55 + attributes: [ 56 + // Named for compatibility with C null_blk 57 + power: 0, 58 + blocksize: 1, 59 + rotational: 2, 60 + size: 3, 61 + irqmode: 4, 62 + ], 63 + }; 64 + 65 + Ok(configfs::Group::new( 66 + name.try_into()?, 67 + item_type, 68 + // TODO: cannot coerce new_mutex!() to impl PinInit<_, Error>, so put mutex inside 69 + try_pin_init!( DeviceConfig { 70 + data <- new_mutex!(DeviceConfigInner { 71 + powered: false, 72 + block_size: 4096, 73 + rotational: false, 74 + disk: None, 75 + capacity_mib: 4096, 76 + irq_mode: IRQMode::None, 77 + name: name.try_into()?, 78 + }), 79 + }), 80 + )) 81 + } 82 + } 83 + 84 + #[derive(Debug, Clone, Copy)] 85 + pub(crate) enum IRQMode { 86 + None, 87 + Soft, 88 + } 89 + 90 + impl TryFrom<u8> for IRQMode { 91 + type Error = kernel::error::Error; 92 + 93 + fn try_from(value: u8) -> Result<Self> { 94 + match value { 95 + 0 => Ok(Self::None), 96 + 1 => Ok(Self::Soft), 97 + _ => Err(EINVAL), 98 + } 99 + } 100 + } 101 + 102 + impl Display for IRQMode { 103 + fn fmt(&self, f: &mut core::fmt::Formatter<'_>) -> core::fmt::Result { 104 + match self { 105 + Self::None => f.write_str("0")?, 106 + Self::Soft => f.write_str("1")?, 107 + } 108 + Ok(()) 109 + } 110 + } 111 + 112 + #[pin_data] 113 + pub(crate) struct DeviceConfig { 114 + #[pin] 115 + data: Mutex<DeviceConfigInner>, 116 + } 117 + 118 + #[pin_data] 119 + struct DeviceConfigInner { 120 + powered: bool, 121 + name: CString, 122 + block_size: u32, 123 + rotational: bool, 124 + capacity_mib: u64, 125 + irq_mode: IRQMode, 126 + disk: Option<GenDisk<NullBlkDevice>>, 127 + } 128 + 129 + #[vtable] 130 + impl configfs::AttributeOperations<0> for DeviceConfig { 131 + type Data = DeviceConfig; 132 + 133 + fn show(this: &DeviceConfig, page: &mut [u8; PAGE_SIZE]) -> Result<usize> { 134 + let mut writer = kernel::str::Formatter::new(page); 135 + 136 + if this.data.lock().powered { 137 + writer.write_str("1\n")?; 138 + } else { 139 + writer.write_str("0\n")?; 140 + } 141 + 142 + Ok(writer.bytes_written()) 143 + } 144 + 145 + fn store(this: &DeviceConfig, page: &[u8]) -> Result { 146 + let power_op = kstrtobool_bytes(page)?; 147 + let mut guard = this.data.lock(); 148 + 149 + if !guard.powered && power_op { 150 + guard.disk = Some(NullBlkDevice::new( 151 + &guard.name, 152 + guard.block_size, 153 + guard.rotational, 154 + guard.capacity_mib, 155 + guard.irq_mode, 156 + )?); 157 + guard.powered = true; 158 + } else if guard.powered && !power_op { 159 + drop(guard.disk.take()); 160 + guard.powered = false; 161 + } 162 + 163 + Ok(()) 164 + } 165 + } 166 + 167 + #[vtable] 168 + impl configfs::AttributeOperations<1> for DeviceConfig { 169 + type Data = DeviceConfig; 170 + 171 + fn show(this: &DeviceConfig, page: &mut [u8; PAGE_SIZE]) -> Result<usize> { 172 + let mut writer = kernel::str::Formatter::new(page); 173 + writer.write_fmt(fmt!("{}\n", this.data.lock().block_size))?; 174 + Ok(writer.bytes_written()) 175 + } 176 + 177 + fn store(this: &DeviceConfig, page: &[u8]) -> Result { 178 + if this.data.lock().powered { 179 + return Err(EBUSY); 180 + } 181 + 182 + let text = core::str::from_utf8(page)?.trim(); 183 + let value = text.parse::<u32>().map_err(|_| EINVAL)?; 184 + 185 + GenDiskBuilder::validate_block_size(value)?; 186 + this.data.lock().block_size = value; 187 + Ok(()) 188 + } 189 + } 190 + 191 + #[vtable] 192 + impl configfs::AttributeOperations<2> for DeviceConfig { 193 + type Data = DeviceConfig; 194 + 195 + fn show(this: &DeviceConfig, page: &mut [u8; PAGE_SIZE]) -> Result<usize> { 196 + let mut writer = kernel::str::Formatter::new(page); 197 + 198 + if this.data.lock().rotational { 199 + writer.write_str("1\n")?; 200 + } else { 201 + writer.write_str("0\n")?; 202 + } 203 + 204 + Ok(writer.bytes_written()) 205 + } 206 + 207 + fn store(this: &DeviceConfig, page: &[u8]) -> Result { 208 + if this.data.lock().powered { 209 + return Err(EBUSY); 210 + } 211 + 212 + this.data.lock().rotational = kstrtobool_bytes(page)?; 213 + 214 + Ok(()) 215 + } 216 + } 217 + 218 + #[vtable] 219 + impl configfs::AttributeOperations<3> for DeviceConfig { 220 + type Data = DeviceConfig; 221 + 222 + fn show(this: &DeviceConfig, page: &mut [u8; PAGE_SIZE]) -> Result<usize> { 223 + let mut writer = kernel::str::Formatter::new(page); 224 + writer.write_fmt(fmt!("{}\n", this.data.lock().capacity_mib))?; 225 + Ok(writer.bytes_written()) 226 + } 227 + 228 + fn store(this: &DeviceConfig, page: &[u8]) -> Result { 229 + if this.data.lock().powered { 230 + return Err(EBUSY); 231 + } 232 + 233 + let text = core::str::from_utf8(page)?.trim(); 234 + let value = text.parse::<u64>().map_err(|_| EINVAL)?; 235 + 236 + this.data.lock().capacity_mib = value; 237 + Ok(()) 238 + } 239 + } 240 + 241 + #[vtable] 242 + impl configfs::AttributeOperations<4> for DeviceConfig { 243 + type Data = DeviceConfig; 244 + 245 + fn show(this: &DeviceConfig, page: &mut [u8; PAGE_SIZE]) -> Result<usize> { 246 + let mut writer = kernel::str::Formatter::new(page); 247 + writer.write_fmt(fmt!("{}\n", this.data.lock().irq_mode))?; 248 + Ok(writer.bytes_written()) 249 + } 250 + 251 + fn store(this: &DeviceConfig, page: &[u8]) -> Result { 252 + if this.data.lock().powered { 253 + return Err(EBUSY); 254 + } 255 + 256 + let text = core::str::from_utf8(page)?.trim(); 257 + let value = text.parse::<u8>().map_err(|_| EINVAL)?; 258 + 259 + this.data.lock().irq_mode = IRQMode::try_from(value)?; 260 + Ok(()) 261 + } 262 + }

+104

drivers/block/rnull/rnull.rs

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + 3 + //! This is a Rust implementation of the C null block driver. 4 + 5 + mod configfs; 6 + 7 + use configfs::IRQMode; 8 + use kernel::{ 9 + block::{ 10 + self, 11 + mq::{ 12 + self, 13 + gen_disk::{self, GenDisk}, 14 + Operations, TagSet, 15 + }, 16 + }, 17 + error::Result, 18 + pr_info, 19 + prelude::*, 20 + sync::Arc, 21 + types::ARef, 22 + }; 23 + use pin_init::PinInit; 24 + 25 + module! { 26 + type: NullBlkModule, 27 + name: "rnull_mod", 28 + authors: ["Andreas Hindborg"], 29 + description: "Rust implementation of the C null block driver", 30 + license: "GPL v2", 31 + } 32 + 33 + #[pin_data] 34 + struct NullBlkModule { 35 + #[pin] 36 + configfs_subsystem: kernel::configfs::Subsystem<configfs::Config>, 37 + } 38 + 39 + impl kernel::InPlaceModule for NullBlkModule { 40 + fn init(_module: &'static ThisModule) -> impl PinInit<Self, Error> { 41 + pr_info!("Rust null_blk loaded\n"); 42 + 43 + try_pin_init!(Self { 44 + configfs_subsystem <- configfs::subsystem(), 45 + }) 46 + } 47 + } 48 + 49 + struct NullBlkDevice; 50 + 51 + impl NullBlkDevice { 52 + fn new( 53 + name: &CStr, 54 + block_size: u32, 55 + rotational: bool, 56 + capacity_mib: u64, 57 + irq_mode: IRQMode, 58 + ) -> Result<GenDisk<Self>> { 59 + let tagset = Arc::pin_init(TagSet::new(1, 256, 1), GFP_KERNEL)?; 60 + 61 + let queue_data = Box::new(QueueData { irq_mode }, GFP_KERNEL)?; 62 + 63 + gen_disk::GenDiskBuilder::new() 64 + .capacity_sectors(capacity_mib << (20 - block::SECTOR_SHIFT)) 65 + .logical_block_size(block_size)? 66 + .physical_block_size(block_size)? 67 + .rotational(rotational) 68 + .build(fmt!("{}", name.to_str()?), tagset, queue_data) 69 + } 70 + } 71 + 72 + struct QueueData { 73 + irq_mode: IRQMode, 74 + } 75 + 76 + #[vtable] 77 + impl Operations for NullBlkDevice { 78 + type QueueData = KBox<QueueData>; 79 + 80 + #[inline(always)] 81 + fn queue_rq(queue_data: &QueueData, rq: ARef<mq::Request<Self>>, _is_last: bool) -> Result { 82 + match queue_data.irq_mode { 83 + IRQMode::None => mq::Request::end_ok(rq) 84 + .map_err(|_e| kernel::error::code::EIO) 85 + // We take no refcounts on the request, so we expect to be able to 86 + // end the request. The request reference must be unique at this 87 + // point, and so `end_ok` cannot fail. 88 + .expect("Fatal error - expected to be able to end request"), 89 + IRQMode::Soft => mq::Request::complete(rq), 90 + } 91 + Ok(()) 92 + } 93 + 94 + fn commit_rqs(_queue_data: &QueueData) {} 95 + 96 + fn complete(rq: ARef<mq::Request<Self>>) { 97 + mq::Request::end_ok(rq) 98 + .map_err(|_e| kernel::error::code::EIO) 99 + // We take no refcounts on the request, so we expect to be able to 100 + // end the request. The request reference must be unique at this 101 + // point, and so `end_ok` cannot fail. 102 + .expect("Fatal error - expected to be able to end request"); 103 + } 104 + }

+3 -4

drivers/block/sunvdc.c

··· 119 119 return vio_dring_avail(dr, VDC_TX_RING_SIZE); 120 120 } 121 121 122 - static int vdc_getgeo(struct block_device *bdev, struct hd_geometry *geo) 122 + static int vdc_getgeo(struct gendisk *disk, struct hd_geometry *geo) 123 123 { 124 - struct gendisk *disk = bdev->bd_disk; 125 124 sector_t nsect = get_capacity(disk); 126 125 sector_t cylinders = nsect; 127 126 ··· 1188 1189 } 1189 1190 1190 1191 if (port->ldc_timeout) 1191 - mod_delayed_work(system_wq, &port->ldc_reset_timer_work, 1192 + mod_delayed_work(system_percpu_wq, &port->ldc_reset_timer_work, 1192 1193 round_jiffies(jiffies + HZ * port->ldc_timeout)); 1193 1194 mod_timer(&port->vio.timer, round_jiffies(jiffies + HZ)); 1194 1195 return; ··· 1216 1217 { 1217 1218 int err; 1218 1219 1219 - sunvdc_wq = alloc_workqueue("sunvdc", 0, 0); 1220 + sunvdc_wq = alloc_workqueue("sunvdc", WQ_PERCPU, 0); 1220 1221 if (!sunvdc_wq) 1221 1222 return -ENOMEM; 1222 1223

+2 -2

drivers/block/swim.c

··· 711 711 return -ENOTTY; 712 712 } 713 713 714 - static int floppy_getgeo(struct block_device *bdev, struct hd_geometry *geo) 714 + static int floppy_getgeo(struct gendisk *disk, struct hd_geometry *geo) 715 715 { 716 - struct floppy_state *fs = bdev->bd_disk->private_data; 716 + struct floppy_state *fs = disk->private_data; 717 717 struct floppy_struct *g; 718 718 int ret; 719 719

+121 -115

drivers/block/ublk_drv.c

··· 201 201 bool force_abort; 202 202 bool canceling; 203 203 bool fail_io; /* copy of dev->state == UBLK_S_DEV_FAIL_IO */ 204 - unsigned short nr_io_ready; /* how many ios setup */ 205 204 spinlock_t cancel_lock; 206 205 struct ublk_device *dev; 207 206 struct ublk_io ios[]; ··· 233 234 struct ublk_params params; 234 235 235 236 struct completion completion; 236 - unsigned int nr_queues_ready; 237 + u32 nr_io_ready; 237 238 bool unprivileged_daemons; 238 239 struct mutex cancel_mutex; 239 240 bool canceling; ··· 251 252 static void ublk_stop_dev_unlocked(struct ublk_device *ub); 252 253 static void ublk_abort_queue(struct ublk_device *ub, struct ublk_queue *ubq); 253 254 static inline struct request *__ublk_check_and_get_req(struct ublk_device *ub, 254 - const struct ublk_queue *ubq, struct ublk_io *io, 255 - size_t offset); 255 + u16 q_id, u16 tag, struct ublk_io *io, size_t offset); 256 256 static inline unsigned int ublk_req_build_flags(struct request *req); 257 257 258 258 static inline struct ublksrv_io_desc * ··· 530 532 531 533 #endif 532 534 533 - static inline void __ublk_complete_rq(struct request *req); 535 + static inline void __ublk_complete_rq(struct request *req, struct ublk_io *io, 536 + bool need_map); 534 537 535 538 static dev_t ublk_chr_devt; 536 539 static const struct class ublk_chr_class = { ··· 663 664 return ubq->flags & UBLK_F_SUPPORT_ZERO_COPY; 664 665 } 665 666 667 + static inline bool ublk_dev_support_zero_copy(const struct ublk_device *ub) 668 + { 669 + return ub->dev_info.flags & UBLK_F_SUPPORT_ZERO_COPY; 670 + } 671 + 666 672 static inline bool ublk_support_auto_buf_reg(const struct ublk_queue *ubq) 667 673 { 668 674 return ubq->flags & UBLK_F_AUTO_BUF_REG; 675 + } 676 + 677 + static inline bool ublk_dev_support_auto_buf_reg(const struct ublk_device *ub) 678 + { 679 + return ub->dev_info.flags & UBLK_F_AUTO_BUF_REG; 669 680 } 670 681 671 682 static inline bool ublk_support_user_copy(const struct ublk_queue *ubq) ··· 683 674 return ubq->flags & UBLK_F_USER_COPY; 684 675 } 685 676 677 + static inline bool ublk_dev_support_user_copy(const struct ublk_device *ub) 678 + { 679 + return ub->dev_info.flags & UBLK_F_USER_COPY; 680 + } 681 + 686 682 static inline bool ublk_need_map_io(const struct ublk_queue *ubq) 687 683 { 688 684 return !ublk_support_user_copy(ubq) && !ublk_support_zero_copy(ubq) && 689 685 !ublk_support_auto_buf_reg(ubq); 686 + } 687 + 688 + static inline bool ublk_dev_need_map_io(const struct ublk_device *ub) 689 + { 690 + return !ublk_dev_support_user_copy(ub) && 691 + !ublk_dev_support_zero_copy(ub) && 692 + !ublk_dev_support_auto_buf_reg(ub); 690 693 } 691 694 692 695 static inline bool ublk_need_req_ref(const struct ublk_queue *ubq) ··· 718 697 ublk_support_auto_buf_reg(ubq); 719 698 } 720 699 700 + static inline bool ublk_dev_need_req_ref(const struct ublk_device *ub) 701 + { 702 + return ublk_dev_support_user_copy(ub) || 703 + ublk_dev_support_zero_copy(ub) || 704 + ublk_dev_support_auto_buf_reg(ub); 705 + } 706 + 721 707 static inline void ublk_init_req_ref(const struct ublk_queue *ubq, 722 708 struct ublk_io *io) 723 709 { ··· 739 711 740 712 static inline void ublk_put_req_ref(struct ublk_io *io, struct request *req) 741 713 { 742 - if (refcount_dec_and_test(&io->ref)) 743 - __ublk_complete_rq(req); 714 + if (!refcount_dec_and_test(&io->ref)) 715 + return; 716 + 717 + /* ublk_need_map_io() and ublk_need_req_ref() are mutually exclusive */ 718 + __ublk_complete_rq(req, io, false); 744 719 } 745 720 746 721 static inline bool ublk_sub_req_ref(struct ublk_io *io) ··· 757 726 static inline bool ublk_need_get_data(const struct ublk_queue *ubq) 758 727 { 759 728 return ubq->flags & UBLK_F_NEED_GET_DATA; 729 + } 730 + 731 + static inline bool ublk_dev_need_get_data(const struct ublk_device *ub) 732 + { 733 + return ub->dev_info.flags & UBLK_F_NEED_GET_DATA; 760 734 } 761 735 762 736 /* Called in slow path only, keep it noinline for trace purpose */ ··· 800 764 return round_up(depth * sizeof(struct ublksrv_io_desc), PAGE_SIZE); 801 765 } 802 766 803 - static inline int ublk_queue_cmd_buf_size(struct ublk_device *ub, int q_id) 767 + static inline int ublk_queue_cmd_buf_size(struct ublk_device *ub) 804 768 { 805 - struct ublk_queue *ubq = ublk_get_queue(ub, q_id); 806 - 807 - return __ublk_queue_cmd_buf_size(ubq->q_depth); 769 + return __ublk_queue_cmd_buf_size(ub->dev_info.queue_depth); 808 770 } 809 771 810 772 static int ublk_max_cmd_buf_size(void) ··· 1053 1019 return rq_bytes; 1054 1020 } 1055 1021 1056 - static int ublk_unmap_io(const struct ublk_queue *ubq, 1022 + static int ublk_unmap_io(bool need_map, 1057 1023 const struct request *req, 1058 1024 const struct ublk_io *io) 1059 1025 { 1060 1026 const unsigned int rq_bytes = blk_rq_bytes(req); 1061 1027 1062 - if (!ublk_need_map_io(ubq)) 1028 + if (!need_map) 1063 1029 return rq_bytes; 1064 1030 1065 1031 if (ublk_need_unmap_req(req)) { ··· 1106 1072 { 1107 1073 struct ublksrv_io_desc *iod = ublk_get_iod(ubq, req->tag); 1108 1074 struct ublk_io *io = &ubq->ios[req->tag]; 1109 - enum req_op op = req_op(req); 1110 1075 u32 ublk_op; 1111 - 1112 - if (!ublk_queue_is_zoned(ubq) && 1113 - (op_is_zone_mgmt(op) || op == REQ_OP_ZONE_APPEND)) 1114 - return BLK_STS_IOERR; 1115 1076 1116 1077 switch (req_op(req)) { 1117 1078 case REQ_OP_READ: ··· 1146 1117 } 1147 1118 1148 1119 /* todo: handle partial completion */ 1149 - static inline void __ublk_complete_rq(struct request *req) 1120 + static inline void __ublk_complete_rq(struct request *req, struct ublk_io *io, 1121 + bool need_map) 1150 1122 { 1151 - struct ublk_queue *ubq = req->mq_hctx->driver_data; 1152 - struct ublk_io *io = &ubq->ios[req->tag]; 1153 1123 unsigned int unmapped_bytes; 1154 1124 blk_status_t res = BLK_STS_OK; 1155 1125 ··· 1172 1144 goto exit; 1173 1145 1174 1146 /* for READ request, writing data in iod->addr to rq buffers */ 1175 - unmapped_bytes = ublk_unmap_io(ubq, req, io); 1147 + unmapped_bytes = ublk_unmap_io(need_map, req, io); 1176 1148 1177 1149 /* 1178 1150 * Extremely impossible since we got data filled in just before ··· 1528 1500 { 1529 1501 int i; 1530 1502 1531 - /* All old ioucmds have to be completed */ 1532 - ubq->nr_io_ready = 0; 1533 - 1534 1503 for (i = 0; i < ubq->q_depth; i++) { 1535 1504 struct ublk_io *io = &ubq->ios[i]; 1536 1505 ··· 1576 1551 1577 1552 /* set to NULL, otherwise new tasks cannot mmap io_cmd_buf */ 1578 1553 ub->mm = NULL; 1579 - ub->nr_queues_ready = 0; 1554 + ub->nr_io_ready = 0; 1580 1555 ub->unprivileged_daemons = false; 1581 1556 ub->ublksrv_tgid = -1; 1582 1557 } ··· 1800 1775 __func__, q_id, current->pid, vma->vm_start, 1801 1776 phys_off, (unsigned long)sz); 1802 1777 1803 - if (sz != ublk_queue_cmd_buf_size(ub, q_id)) 1778 + if (sz != ublk_queue_cmd_buf_size(ub)) 1804 1779 return -EINVAL; 1805 1780 1806 1781 pfn = virt_to_phys(ublk_queue_cmd_buf(ub, q_id)) >> PAGE_SHIFT; 1807 1782 return remap_pfn_range(vma, vma->vm_start, pfn, sz, vma->vm_page_prot); 1808 1783 } 1809 1784 1810 - static void __ublk_fail_req(struct ublk_queue *ubq, struct ublk_io *io, 1785 + static void __ublk_fail_req(struct ublk_device *ub, struct ublk_io *io, 1811 1786 struct request *req) 1812 1787 { 1813 1788 WARN_ON_ONCE(io->flags & UBLK_IO_FLAG_ACTIVE); 1814 1789 1815 - if (ublk_nosrv_should_reissue_outstanding(ubq->dev)) 1790 + if (ublk_nosrv_should_reissue_outstanding(ub)) 1816 1791 blk_mq_requeue_request(req, false); 1817 1792 else { 1818 1793 io->res = -EIO; 1819 - __ublk_complete_rq(req); 1794 + __ublk_complete_rq(req, io, ublk_dev_need_map_io(ub)); 1820 1795 } 1821 1796 } 1822 1797 ··· 1836 1811 struct ublk_io *io = &ubq->ios[i]; 1837 1812 1838 1813 if (io->flags & UBLK_IO_FLAG_OWNED_BY_SRV) 1839 - __ublk_fail_req(ubq, io, io->req); 1814 + __ublk_fail_req(ub, io, io->req); 1840 1815 } 1841 1816 } 1842 1817 ··· 1941 1916 ublk_cancel_cmd(ubq, pdu->tag, issue_flags); 1942 1917 } 1943 1918 1944 - static inline bool ublk_queue_ready(struct ublk_queue *ubq) 1919 + static inline bool ublk_dev_ready(const struct ublk_device *ub) 1945 1920 { 1946 - return ubq->nr_io_ready == ubq->q_depth; 1921 + u32 total = (u32)ub->dev_info.nr_hw_queues * ub->dev_info.queue_depth; 1922 + 1923 + return ub->nr_io_ready == total; 1947 1924 } 1948 1925 1949 1926 static void ublk_cancel_queue(struct ublk_queue *ubq) ··· 2069 2042 } 2070 2043 2071 2044 /* device can only be started after all IOs are ready */ 2072 - static void ublk_mark_io_ready(struct ublk_device *ub, struct ublk_queue *ubq) 2045 + static void ublk_mark_io_ready(struct ublk_device *ub) 2073 2046 __must_hold(&ub->mutex) 2074 2047 { 2075 - ubq->nr_io_ready++; 2076 - if (ublk_queue_ready(ubq)) 2077 - ub->nr_queues_ready++; 2078 2048 if (!ub->unprivileged_daemons && !capable(CAP_SYS_ADMIN)) 2079 2049 ub->unprivileged_daemons = true; 2080 2050 2081 - if (ub->nr_queues_ready == ub->dev_info.nr_hw_queues) { 2051 + ub->nr_io_ready++; 2052 + if (ublk_dev_ready(ub)) { 2082 2053 /* now we are ready for handling ublk io request */ 2083 2054 ublk_reset_io_flags(ub); 2084 2055 complete_all(&ub->completion); ··· 2147 2122 } 2148 2123 2149 2124 static inline int 2150 - ublk_config_io_buf(const struct ublk_queue *ubq, struct ublk_io *io, 2125 + ublk_config_io_buf(const struct ublk_device *ub, struct ublk_io *io, 2151 2126 struct io_uring_cmd *cmd, unsigned long buf_addr, 2152 2127 u16 *buf_idx) 2153 2128 { 2154 - if (ublk_support_auto_buf_reg(ubq)) 2129 + if (ublk_dev_support_auto_buf_reg(ub)) 2155 2130 return ublk_handle_auto_buf_reg(io, cmd, buf_idx); 2156 2131 2157 2132 io->addr = buf_addr; ··· 2190 2165 } 2191 2166 2192 2167 static int ublk_register_io_buf(struct io_uring_cmd *cmd, 2193 - const struct ublk_queue *ubq, 2168 + struct ublk_device *ub, 2169 + u16 q_id, u16 tag, 2194 2170 struct ublk_io *io, 2195 2171 unsigned int index, unsigned int issue_flags) 2196 2172 { 2197 - struct ublk_device *ub = cmd->file->private_data; 2198 2173 struct request *req; 2199 2174 int ret; 2200 2175 2201 - if (!ublk_support_zero_copy(ubq)) 2176 + if (!ublk_dev_support_zero_copy(ub)) 2202 2177 return -EINVAL; 2203 2178 2204 - req = __ublk_check_and_get_req(ub, ubq, io, 0); 2179 + req = __ublk_check_and_get_req(ub, q_id, tag, io, 0); 2205 2180 if (!req) 2206 2181 return -EINVAL; 2207 2182 ··· 2217 2192 2218 2193 static int 2219 2194 ublk_daemon_register_io_buf(struct io_uring_cmd *cmd, 2220 - const struct ublk_queue *ubq, struct ublk_io *io, 2195 + struct ublk_device *ub, 2196 + u16 q_id, u16 tag, struct ublk_io *io, 2221 2197 unsigned index, unsigned issue_flags) 2222 2198 { 2223 2199 unsigned new_registered_buffers; ··· 2231 2205 */ 2232 2206 new_registered_buffers = io->task_registered_buffers + 1; 2233 2207 if (unlikely(new_registered_buffers >= UBLK_REFCOUNT_INIT)) 2234 - return ublk_register_io_buf(cmd, ubq, io, index, issue_flags); 2208 + return ublk_register_io_buf(cmd, ub, q_id, tag, io, index, 2209 + issue_flags); 2235 2210 2236 - if (!ublk_support_zero_copy(ubq) || !ublk_rq_has_data(req)) 2211 + if (!ublk_dev_support_zero_copy(ub) || !ublk_rq_has_data(req)) 2237 2212 return -EINVAL; 2238 2213 2239 2214 ret = io_buffer_register_bvec(cmd, req, ublk_io_release, index, ··· 2256 2229 return io_buffer_unregister_bvec(cmd, index, issue_flags); 2257 2230 } 2258 2231 2259 - static int ublk_check_fetch_buf(const struct ublk_queue *ubq, __u64 buf_addr) 2232 + static int ublk_check_fetch_buf(const struct ublk_device *ub, __u64 buf_addr) 2260 2233 { 2261 - if (ublk_need_map_io(ubq)) { 2234 + if (ublk_dev_need_map_io(ub)) { 2262 2235 /* 2263 2236 * FETCH_RQ has to provide IO buffer if NEED GET 2264 2237 * DATA is not enabled 2265 2238 */ 2266 - if (!buf_addr && !ublk_need_get_data(ubq)) 2239 + if (!buf_addr && !ublk_dev_need_get_data(ub)) 2267 2240 return -EINVAL; 2268 2241 } else if (buf_addr) { 2269 2242 /* User copy requires addr to be unset */ ··· 2272 2245 return 0; 2273 2246 } 2274 2247 2275 - static int ublk_fetch(struct io_uring_cmd *cmd, struct ublk_queue *ubq, 2248 + static int ublk_fetch(struct io_uring_cmd *cmd, struct ublk_device *ub, 2276 2249 struct ublk_io *io, __u64 buf_addr) 2277 2250 { 2278 - struct ublk_device *ub = ubq->dev; 2279 2251 int ret = 0; 2280 2252 2281 2253 /* ··· 2283 2257 * FETCH, so it is fine even for IO_URING_F_NONBLOCK. 2284 2258 */ 2285 2259 mutex_lock(&ub->mutex); 2286 - /* UBLK_IO_FETCH_REQ is only allowed before queue is setup */ 2287 - if (ublk_queue_ready(ubq)) { 2260 + /* UBLK_IO_FETCH_REQ is only allowed before dev is setup */ 2261 + if (ublk_dev_ready(ub)) { 2288 2262 ret = -EBUSY; 2289 2263 goto out; 2290 2264 } ··· 2298 2272 WARN_ON_ONCE(io->flags & UBLK_IO_FLAG_OWNED_BY_SRV); 2299 2273 2300 2274 ublk_fill_io_cmd(io, cmd); 2301 - ret = ublk_config_io_buf(ubq, io, cmd, buf_addr, NULL); 2275 + ret = ublk_config_io_buf(ub, io, cmd, buf_addr, NULL); 2302 2276 if (ret) 2303 2277 goto out; 2304 2278 2305 2279 WRITE_ONCE(io->task, get_task_struct(current)); 2306 - ublk_mark_io_ready(ub, ubq); 2280 + ublk_mark_io_ready(ub); 2307 2281 out: 2308 2282 mutex_unlock(&ub->mutex); 2309 2283 return ret; 2310 2284 } 2311 2285 2312 - static int ublk_check_commit_and_fetch(const struct ublk_queue *ubq, 2286 + static int ublk_check_commit_and_fetch(const struct ublk_device *ub, 2313 2287 struct ublk_io *io, __u64 buf_addr) 2314 2288 { 2315 2289 struct request *req = io->req; 2316 2290 2317 - if (ublk_need_map_io(ubq)) { 2291 + if (ublk_dev_need_map_io(ub)) { 2318 2292 /* 2319 2293 * COMMIT_AND_FETCH_REQ has to provide IO buffer if 2320 2294 * NEED GET DATA is not enabled or it is Read IO. 2321 2295 */ 2322 - if (!buf_addr && (!ublk_need_get_data(ubq) || 2296 + if (!buf_addr && (!ublk_dev_need_get_data(ub) || 2323 2297 req_op(req) == REQ_OP_READ)) 2324 2298 return -EINVAL; 2325 2299 } else if (req_op(req) != REQ_OP_ZONE_APPEND && buf_addr) { ··· 2333 2307 return 0; 2334 2308 } 2335 2309 2336 - static bool ublk_need_complete_req(const struct ublk_queue *ubq, 2310 + static bool ublk_need_complete_req(const struct ublk_device *ub, 2337 2311 struct ublk_io *io) 2338 2312 { 2339 - if (ublk_need_req_ref(ubq)) 2313 + if (ublk_dev_need_req_ref(ub)) 2340 2314 return ublk_sub_req_ref(io); 2341 2315 return true; 2342 2316 } ··· 2359 2333 return ublk_start_io(ubq, req, io); 2360 2334 } 2361 2335 2362 - static int __ublk_ch_uring_cmd(struct io_uring_cmd *cmd, 2363 - unsigned int issue_flags, 2364 - const struct ublksrv_io_cmd *ub_cmd) 2336 + static int ublk_ch_uring_cmd_local(struct io_uring_cmd *cmd, 2337 + unsigned int issue_flags) 2365 2338 { 2339 + /* May point to userspace-mapped memory */ 2340 + const struct ublksrv_io_cmd *ub_src = io_uring_sqe_cmd(cmd->sqe); 2366 2341 u16 buf_idx = UBLK_INVALID_BUF_IDX; 2367 2342 struct ublk_device *ub = cmd->file->private_data; 2368 2343 struct ublk_queue *ubq; 2369 2344 struct ublk_io *io; 2370 2345 u32 cmd_op = cmd->cmd_op; 2371 - unsigned tag = ub_cmd->tag; 2346 + u16 q_id = READ_ONCE(ub_src->q_id); 2347 + u16 tag = READ_ONCE(ub_src->tag); 2348 + s32 result = READ_ONCE(ub_src->result); 2349 + u64 addr = READ_ONCE(ub_src->addr); /* unioned with zone_append_lba */ 2372 2350 struct request *req; 2373 2351 int ret; 2374 2352 bool compl; 2375 2353 2354 + WARN_ON_ONCE(issue_flags & IO_URING_F_UNLOCKED); 2355 + 2376 2356 pr_devel("%s: received: cmd op %d queue %d tag %d result %d\n", 2377 - __func__, cmd->cmd_op, ub_cmd->q_id, tag, 2378 - ub_cmd->result); 2357 + __func__, cmd->cmd_op, q_id, tag, result); 2379 2358 2380 2359 ret = ublk_check_cmd_op(cmd_op); 2381 2360 if (ret) ··· 2391 2360 * so no need to validate the q_id, tag, or task 2392 2361 */ 2393 2362 if (_IOC_NR(cmd_op) == UBLK_IO_UNREGISTER_IO_BUF) 2394 - return ublk_unregister_io_buf(cmd, ub, ub_cmd->addr, 2395 - issue_flags); 2363 + return ublk_unregister_io_buf(cmd, ub, addr, issue_flags); 2396 2364 2397 2365 ret = -EINVAL; 2398 - if (ub_cmd->q_id >= ub->dev_info.nr_hw_queues) 2366 + if (q_id >= ub->dev_info.nr_hw_queues) 2399 2367 goto out; 2400 2368 2401 - ubq = ublk_get_queue(ub, ub_cmd->q_id); 2369 + ubq = ublk_get_queue(ub, q_id); 2402 2370 2403 - if (tag >= ubq->q_depth) 2371 + if (tag >= ub->dev_info.queue_depth) 2404 2372 goto out; 2405 2373 2406 2374 io = &ubq->ios[tag]; 2407 2375 /* UBLK_IO_FETCH_REQ can be handled on any task, which sets io->task */ 2408 2376 if (unlikely(_IOC_NR(cmd_op) == UBLK_IO_FETCH_REQ)) { 2409 - ret = ublk_check_fetch_buf(ubq, ub_cmd->addr); 2377 + ret = ublk_check_fetch_buf(ub, addr); 2410 2378 if (ret) 2411 2379 goto out; 2412 - ret = ublk_fetch(cmd, ubq, io, ub_cmd->addr); 2380 + ret = ublk_fetch(cmd, ub, io, addr); 2413 2381 if (ret) 2414 2382 goto out; 2415 2383 ··· 2422 2392 * so can be handled on any task 2423 2393 */ 2424 2394 if (_IOC_NR(cmd_op) == UBLK_IO_REGISTER_IO_BUF) 2425 - return ublk_register_io_buf(cmd, ubq, io, ub_cmd->addr, 2426 - issue_flags); 2395 + return ublk_register_io_buf(cmd, ub, q_id, tag, io, 2396 + addr, issue_flags); 2427 2397 2428 2398 goto out; 2429 2399 } ··· 2444 2414 2445 2415 switch (_IOC_NR(cmd_op)) { 2446 2416 case UBLK_IO_REGISTER_IO_BUF: 2447 - return ublk_daemon_register_io_buf(cmd, ubq, io, ub_cmd->addr, 2417 + return ublk_daemon_register_io_buf(cmd, ub, q_id, tag, io, addr, 2448 2418 issue_flags); 2449 2419 case UBLK_IO_COMMIT_AND_FETCH_REQ: 2450 - ret = ublk_check_commit_and_fetch(ubq, io, ub_cmd->addr); 2420 + ret = ublk_check_commit_and_fetch(ub, io, addr); 2451 2421 if (ret) 2452 2422 goto out; 2453 - io->res = ub_cmd->result; 2423 + io->res = result; 2454 2424 req = ublk_fill_io_cmd(io, cmd); 2455 - ret = ublk_config_io_buf(ubq, io, cmd, ub_cmd->addr, &buf_idx); 2456 - compl = ublk_need_complete_req(ubq, io); 2425 + ret = ublk_config_io_buf(ub, io, cmd, addr, &buf_idx); 2426 + compl = ublk_need_complete_req(ub, io); 2457 2427 2458 2428 /* can't touch 'ublk_io' any more */ 2459 2429 if (buf_idx != UBLK_INVALID_BUF_IDX) 2460 2430 io_buffer_unregister_bvec(cmd, buf_idx, issue_flags); 2461 2431 if (req_op(req) == REQ_OP_ZONE_APPEND) 2462 - req->__sector = ub_cmd->zone_append_lba; 2432 + req->__sector = addr; 2463 2433 if (compl) 2464 - __ublk_complete_rq(req); 2434 + __ublk_complete_rq(req, io, ublk_dev_need_map_io(ub)); 2465 2435 2466 2436 if (ret) 2467 2437 goto out; ··· 2473 2443 * request 2474 2444 */ 2475 2445 req = ublk_fill_io_cmd(io, cmd); 2476 - ret = ublk_config_io_buf(ubq, io, cmd, ub_cmd->addr, NULL); 2446 + ret = ublk_config_io_buf(ub, io, cmd, addr, NULL); 2477 2447 WARN_ON_ONCE(ret); 2478 2448 if (likely(ublk_get_data(ubq, io, req))) { 2479 2449 __ublk_prep_compl_io_cmd(io, req); ··· 2493 2463 } 2494 2464 2495 2465 static inline struct request *__ublk_check_and_get_req(struct ublk_device *ub, 2496 - const struct ublk_queue *ubq, struct ublk_io *io, size_t offset) 2466 + u16 q_id, u16 tag, struct ublk_io *io, size_t offset) 2497 2467 { 2498 - unsigned tag = io - ubq->ios; 2499 2468 struct request *req; 2500 2469 2501 2470 /* 2502 2471 * can't use io->req in case of concurrent UBLK_IO_COMMIT_AND_FETCH_REQ, 2503 2472 * which would overwrite it with io->cmd 2504 2473 */ 2505 - req = blk_mq_tag_to_rq(ub->tag_set.tags[ubq->q_id], tag); 2474 + req = blk_mq_tag_to_rq(ub->tag_set.tags[q_id], tag); 2506 2475 if (!req) 2507 2476 return NULL; 2508 2477 ··· 2521 2492 fail_put: 2522 2493 ublk_put_req_ref(io, req); 2523 2494 return NULL; 2524 - } 2525 - 2526 - static inline int ublk_ch_uring_cmd_local(struct io_uring_cmd *cmd, 2527 - unsigned int issue_flags) 2528 - { 2529 - /* 2530 - * Not necessary for async retry, but let's keep it simple and always 2531 - * copy the values to avoid any potential reuse. 2532 - */ 2533 - const struct ublksrv_io_cmd *ub_src = io_uring_sqe_cmd(cmd->sqe); 2534 - const struct ublksrv_io_cmd ub_cmd = { 2535 - .q_id = READ_ONCE(ub_src->q_id), 2536 - .tag = READ_ONCE(ub_src->tag), 2537 - .result = READ_ONCE(ub_src->result), 2538 - .addr = READ_ONCE(ub_src->addr) 2539 - }; 2540 - 2541 - WARN_ON_ONCE(issue_flags & IO_URING_F_UNLOCKED); 2542 - 2543 - return __ublk_ch_uring_cmd(cmd, issue_flags, &ub_cmd); 2544 2495 } 2545 2496 2546 2497 static void ublk_ch_uring_cmd_cb(struct io_uring_cmd *cmd, ··· 2592 2583 return ERR_PTR(-EINVAL); 2593 2584 2594 2585 ubq = ublk_get_queue(ub, q_id); 2595 - if (!ubq) 2596 - return ERR_PTR(-EINVAL); 2597 - 2598 - if (!ublk_support_user_copy(ubq)) 2586 + if (!ublk_dev_support_user_copy(ub)) 2599 2587 return ERR_PTR(-EACCES); 2600 2588 2601 - if (tag >= ubq->q_depth) 2589 + if (tag >= ub->dev_info.queue_depth) 2602 2590 return ERR_PTR(-EINVAL); 2603 2591 2604 2592 *io = &ubq->ios[tag]; 2605 - req = __ublk_check_and_get_req(ub, ubq, *io, buf_off); 2593 + req = __ublk_check_and_get_req(ub, q_id, tag, *io, buf_off); 2606 2594 if (!req) 2607 2595 return ERR_PTR(-EINVAL); 2608 2596 ··· 2662 2656 2663 2657 static void ublk_deinit_queue(struct ublk_device *ub, int q_id) 2664 2658 { 2665 - int size = ublk_queue_cmd_buf_size(ub, q_id); 2659 + int size = ublk_queue_cmd_buf_size(ub); 2666 2660 struct ublk_queue *ubq = ublk_get_queue(ub, q_id); 2667 2661 int i; 2668 2662 ··· 2689 2683 ubq->flags = ub->dev_info.flags; 2690 2684 ubq->q_id = q_id; 2691 2685 ubq->q_depth = ub->dev_info.queue_depth; 2692 - size = ublk_queue_cmd_buf_size(ub, q_id); 2686 + size = ublk_queue_cmd_buf_size(ub); 2693 2687 2694 2688 ptr = (void *) __get_free_pages(gfp_flags, get_order(size)); 2695 2689 if (!ptr)

+4 -4

drivers/block/virtio_blk.c

··· 829 829 } 830 830 831 831 /* We provide getgeo only to please some old bootloader/partitioning tools */ 832 - static int virtblk_getgeo(struct block_device *bd, struct hd_geometry *geo) 832 + static int virtblk_getgeo(struct gendisk *disk, struct hd_geometry *geo) 833 833 { 834 - struct virtio_blk *vblk = bd->bd_disk->private_data; 834 + struct virtio_blk *vblk = disk->private_data; 835 835 int ret = 0; 836 836 837 837 mutex_lock(&vblk->vdev_mutex); ··· 853 853 /* some standard values, similar to sd */ 854 854 geo->heads = 1 << 6; 855 855 geo->sectors = 1 << 5; 856 - geo->cylinders = get_capacity(bd->bd_disk) >> 11; 856 + geo->cylinders = get_capacity(disk) >> 11; 857 857 } 858 858 out: 859 859 mutex_unlock(&vblk->vdev_mutex); ··· 1682 1682 { 1683 1683 int error; 1684 1684 1685 - virtblk_wq = alloc_workqueue("virtio-blk", 0, 0); 1685 + virtblk_wq = alloc_workqueue("virtio-blk", WQ_PERCPU, 0); 1686 1686 if (!virtblk_wq) 1687 1687 return -ENOMEM; 1688 1688

+2 -2

drivers/block/xen-blkfront.c

··· 493 493 schedule_work(&rinfo->work); 494 494 } 495 495 496 - static int blkif_getgeo(struct block_device *bd, struct hd_geometry *hg) 496 + static int blkif_getgeo(struct gendisk *disk, struct hd_geometry *hg) 497 497 { 498 498 /* We don't have real geometry info, but let's at least return 499 499 values consistent with the size of the device */ 500 - sector_t nsect = get_capacity(bd->bd_disk); 500 + sector_t nsect = get_capacity(disk); 501 501 sector_t cylinders = nsect; 502 502 503 503 hg->heads = 0xff;

+1 -1

drivers/block/zram/zram_drv.c

··· 1085 1085 work.entry = entry; 1086 1086 1087 1087 INIT_WORK_ONSTACK(&work.work, zram_sync_read); 1088 - queue_work(system_unbound_wq, &work.work); 1088 + queue_work(system_dfl_wq, &work.work); 1089 1089 flush_work(&work.work); 1090 1090 destroy_work_on_stack(&work.work); 1091 1091

+29

drivers/md/Kconfig

··· 37 37 38 38 If unsure, say N. 39 39 40 + config MD_BITMAP 41 + bool "MD RAID bitmap support" 42 + default y 43 + depends on BLK_DEV_MD 44 + help 45 + If you say Y here, support for the write intent bitmap will be 46 + enabled. The bitmap can be used to optimize resync speed after power 47 + failure or readding a disk, limiting it to recorded dirty sectors in 48 + bitmap. 49 + 50 + This feature can be added to existing MD array or MD array can be 51 + created with bitmap via mdadm(8). 52 + 53 + If unsure, say Y. 54 + 55 + config MD_LLBITMAP 56 + bool "MD RAID lockless bitmap support" 57 + depends on BLK_DEV_MD 58 + help 59 + If you say Y here, support for the lockless write intent bitmap will 60 + be enabled. 61 + 62 + Note, this is an experimental feature. 63 + 64 + If unsure, say N. 65 + 40 66 config MD_AUTODETECT 41 67 bool "Autodetect RAID arrays during kernel boot" 42 68 depends on BLK_DEV_MD=y ··· 80 54 config MD_BITMAP_FILE 81 55 bool "MD bitmap file support (deprecated)" 82 56 default y 57 + depends on MD_BITMAP 83 58 help 84 59 If you say Y here, support for write intent bitmaps in files on an 85 60 external file system is enabled. This is an alternative to the internal ··· 201 174 202 175 config MD_CLUSTER 203 176 tristate "Cluster Support for MD" 177 + select MD_BITMAP 204 178 depends on BLK_DEV_MD 205 179 depends on DLM 206 180 default n ··· 421 393 select MD_RAID1 422 394 select MD_RAID10 423 395 select MD_RAID456 396 + select MD_BITMAP 424 397 select BLK_DEV_MD 425 398 help 426 399 A dm target that supports RAID1, RAID10, RAID4, RAID5 and RAID6 mappings

+3 -1

drivers/md/Makefile

··· 27 27 dm-verity-y += dm-verity-target.o 28 28 dm-zoned-y += dm-zoned-target.o dm-zoned-metadata.o dm-zoned-reclaim.o 29 29 30 - md-mod-y += md.o md-bitmap.o 30 + md-mod-y += md.o 31 + md-mod-$(CONFIG_MD_BITMAP) += md-bitmap.o 32 + md-mod-$(CONFIG_MD_LLBITMAP) += md-llbitmap.o 31 33 raid456-y += raid5.o raid5-cache.o raid5-ppl.o 32 34 linear-y += md-linear.o 33 35

+1 -2

drivers/md/bcache/debug.c

··· 115 115 check = bio_kmalloc(nr_segs, GFP_NOIO); 116 116 if (!check) 117 117 return; 118 - bio_init(check, bio->bi_bdev, check->bi_inline_vecs, nr_segs, 119 - REQ_OP_READ); 118 + bio_init_inline(check, bio->bi_bdev, nr_segs, REQ_OP_READ); 120 119 check->bi_iter.bi_sector = bio->bi_iter.bi_sector; 121 120 check->bi_iter.bi_size = bio->bi_iter.bi_size; 122 121

+1 -2

drivers/md/bcache/io.c

··· 26 26 struct bbio *b = mempool_alloc(&c->bio_meta, GFP_NOIO); 27 27 struct bio *bio = &b->bio; 28 28 29 - bio_init(bio, NULL, bio->bi_inline_vecs, 30 - meta_bucket_pages(&c->cache->sb), 0); 29 + bio_init_inline(bio, NULL, meta_bucket_pages(&c->cache->sb), 0); 31 30 32 31 return bio; 33 32 }

+1 -1

drivers/md/bcache/journal.c

··· 615 615 616 616 atomic_set(&ja->discard_in_flight, DISCARD_IN_FLIGHT); 617 617 618 - bio_init(bio, ca->bdev, bio->bi_inline_vecs, 1, REQ_OP_DISCARD); 618 + bio_init_inline(bio, ca->bdev, 1, REQ_OP_DISCARD); 619 619 bio->bi_iter.bi_sector = bucket_to_sector(ca->set, 620 620 ca->sb.d[ja->discard_idx]); 621 621 bio->bi_iter.bi_size = bucket_bytes(ca);

+4 -4

drivers/md/bcache/movinggc.c

··· 79 79 { 80 80 struct bio *bio = &io->bio.bio; 81 81 82 - bio_init(bio, NULL, bio->bi_inline_vecs, 82 + bio_init_inline(bio, NULL, 83 83 DIV_ROUND_UP(KEY_SIZE(&io->w->key), PAGE_SECTORS), 0); 84 84 bio_get(bio); 85 85 bio->bi_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_IDLE, 0); ··· 145 145 continue; 146 146 } 147 147 148 - io = kzalloc(struct_size(io, bio.bio.bi_inline_vecs, 149 - DIV_ROUND_UP(KEY_SIZE(&w->key), PAGE_SECTORS)), 150 - GFP_KERNEL); 148 + io = kzalloc(sizeof(*io) + sizeof(struct bio_vec) * 149 + DIV_ROUND_UP(KEY_SIZE(&w->key), PAGE_SECTORS), 150 + GFP_KERNEL); 151 151 if (!io) 152 152 goto err; 153 153

+1 -1

drivers/md/bcache/super.c

··· 2236 2236 __module_get(THIS_MODULE); 2237 2237 kobject_init(&ca->kobj, &bch_cache_ktype); 2238 2238 2239 - bio_init(&ca->journal.bio, NULL, ca->journal.bio.bi_inline_vecs, 8, 0); 2239 + bio_init_inline(&ca->journal.bio, NULL, 8, 0); 2240 2240 2241 2241 /* 2242 2242 * When the cache disk is first registered, ca->sb.njournal_buckets

+4 -4

drivers/md/bcache/writeback.c

··· 331 331 struct dirty_io *io = w->private; 332 332 struct bio *bio = &io->bio; 333 333 334 - bio_init(bio, NULL, bio->bi_inline_vecs, 334 + bio_init_inline(bio, NULL, 335 335 DIV_ROUND_UP(KEY_SIZE(&w->key), PAGE_SECTORS), 0); 336 336 if (!io->dc->writeback_percent) 337 337 bio->bi_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_IDLE, 0); ··· 536 536 for (i = 0; i < nk; i++) { 537 537 w = keys[i]; 538 538 539 - io = kzalloc(struct_size(io, bio.bi_inline_vecs, 540 - DIV_ROUND_UP(KEY_SIZE(&w->key), PAGE_SECTORS)), 541 - GFP_KERNEL); 539 + io = kzalloc(sizeof(*io) + sizeof(struct bio_vec) * 540 + DIV_ROUND_UP(KEY_SIZE(&w->key), PAGE_SECTORS), 541 + GFP_KERNEL); 542 542 if (!io) 543 543 goto err; 544 544

+1 -1

drivers/md/dm-bufio.c

··· 1342 1342 use_dmio(b, op, sector, n_sectors, offset, ioprio); 1343 1343 return; 1344 1344 } 1345 - bio_init(bio, b->c->bdev, bio->bi_inline_vecs, 1, op); 1345 + bio_init_inline(bio, b->c->bdev, 1, op); 1346 1346 bio->bi_iter.bi_sector = sector; 1347 1347 bio->bi_end_io = bio_complete; 1348 1348 bio->bi_private = b;

+1 -1

drivers/md/dm-flakey.c

··· 441 441 if (!clone) 442 442 return NULL; 443 443 444 - bio_init(clone, fc->dev->bdev, clone->bi_inline_vecs, nr_iovecs, bio->bi_opf); 444 + bio_init_inline(clone, fc->dev->bdev, nr_iovecs, bio->bi_opf); 445 445 446 446 clone->bi_iter.bi_sector = flakey_map_sector(ti, bio->bi_iter.bi_sector); 447 447 clone->bi_private = bio;

+11 -7

drivers/md/dm-raid.c

··· 3955 3955 !test_and_set_bit(RT_FLAG_RS_BITMAP_LOADED, &rs->runtime_flags)) { 3956 3956 struct mddev *mddev = &rs->md; 3957 3957 3958 - r = mddev->bitmap_ops->load(mddev); 3959 - if (r) 3960 - DMERR("Failed to load bitmap"); 3958 + if (md_bitmap_enabled(mddev, false)) { 3959 + r = mddev->bitmap_ops->load(mddev); 3960 + if (r) 3961 + DMERR("Failed to load bitmap"); 3962 + } 3961 3963 } 3962 3964 3963 3965 return r; ··· 4074 4072 mddev->bitmap_info.chunksize != to_bytes(rs->requested_bitmap_chunk_sectors)))) { 4075 4073 int chunksize = to_bytes(rs->requested_bitmap_chunk_sectors) ?: mddev->bitmap_info.chunksize; 4076 4074 4077 - r = mddev->bitmap_ops->resize(mddev, mddev->dev_sectors, 4078 - chunksize, false); 4079 - if (r) 4080 - DMERR("Failed to resize bitmap"); 4075 + if (md_bitmap_enabled(mddev, false)) { 4076 + r = mddev->bitmap_ops->resize(mddev, mddev->dev_sectors, 4077 + chunksize); 4078 + if (r) 4079 + DMERR("Failed to resize bitmap"); 4080 + } 4081 4081 } 4082 4082 4083 4083 /* Check for any resize/reshape on @rs and adjust/initiate */

+1 -1

drivers/md/dm-vdo/vio.c

··· 212 212 return VDO_SUCCESS; 213 213 214 214 bio->bi_ioprio = 0; 215 - bio->bi_io_vec = bio->bi_inline_vecs; 215 + bio->bi_io_vec = bio_inline_vecs(bio); 216 216 bio->bi_max_vecs = vio->block_count + 1; 217 217 if (VDO_ASSERT(size <= vio_size, "specified size %d is not greater than allocated %d", 218 218 size, vio_size) != VDO_SUCCESS)

+2 -2

drivers/md/dm.c

··· 403 403 dm_deferred_remove(); 404 404 } 405 405 406 - static int dm_blk_getgeo(struct block_device *bdev, struct hd_geometry *geo) 406 + static int dm_blk_getgeo(struct gendisk *disk, struct hd_geometry *geo) 407 407 { 408 - struct mapped_device *md = bdev->bd_disk->private_data; 408 + struct mapped_device *md = disk->private_data; 409 409 410 410 return dm_get_geometry(md, geo); 411 411 }

+44 -45

drivers/md/md-bitmap.c

··· 34 34 #include "md-bitmap.h" 35 35 #include "md-cluster.h" 36 36 37 - #define BITMAP_MAJOR_LO 3 38 - /* version 4 insists the bitmap is in little-endian order 39 - * with version 3, it is host-endian which is non-portable 40 - * Version 5 is currently set only for clustered devices 41 - */ 42 - #define BITMAP_MAJOR_HI 4 43 - #define BITMAP_MAJOR_CLUSTERED 5 44 - #define BITMAP_MAJOR_HOSTENDIAN 3 45 - 46 37 /* 47 38 * in-memory bitmap: 48 39 * ··· 215 224 int cluster_slot; 216 225 }; 217 226 227 + static struct workqueue_struct *md_bitmap_wq; 228 + 218 229 static int __bitmap_resize(struct bitmap *bitmap, sector_t blocks, 219 230 int chunksize, bool init); 220 231 ··· 225 232 return bitmap->mddev ? mdname(bitmap->mddev) : "mdX"; 226 233 } 227 234 228 - static bool __bitmap_enabled(struct bitmap *bitmap) 235 + static bool bitmap_enabled(void *data, bool flush) 229 236 { 230 - return bitmap->storage.filemap && 231 - !test_bit(BITMAP_STALE, &bitmap->flags); 232 - } 237 + struct bitmap *bitmap = data; 233 238 234 - static bool bitmap_enabled(struct mddev *mddev) 235 - { 236 - struct bitmap *bitmap = mddev->bitmap; 239 + if (!flush) 240 + return true; 237 241 238 - if (!bitmap) 239 - return false; 240 - 241 - return __bitmap_enabled(bitmap); 242 + /* 243 + * If caller want to flush bitmap pages to underlying disks, check if 244 + * there are cached pages in filemap. 245 + */ 246 + return !test_bit(BITMAP_STALE, &bitmap->flags) && 247 + bitmap->storage.filemap != NULL; 242 248 } 243 249 244 250 /* ··· 476 484 return -EINVAL; 477 485 } 478 486 479 - md_super_write(mddev, rdev, sboff + ps, (int)min(size, bitmap_limit), page); 487 + md_write_metadata(mddev, rdev, sboff + ps, (int)min(size, bitmap_limit), 488 + page, 0); 480 489 return 0; 481 490 } 482 491 ··· 1237 1244 int dirty, need_write; 1238 1245 int writing = 0; 1239 1246 1240 - if (!__bitmap_enabled(bitmap)) 1247 + if (!bitmap_enabled(bitmap, true)) 1241 1248 return; 1242 1249 1243 1250 /* look at each page to see if there are any set bits that need to be ··· 1781 1788 sector_t *blocks, bool degraded) 1782 1789 { 1783 1790 bitmap_counter_t *bmc; 1784 - bool rv; 1791 + bool rv = false; 1785 1792 1786 - if (bitmap == NULL) {/* FIXME or bitmap set as 'failed' */ 1787 - *blocks = 1024; 1788 - return true; /* always resync if no bitmap */ 1789 - } 1790 1793 spin_lock_irq(&bitmap->counts.lock); 1791 - 1792 - rv = false; 1793 1794 bmc = md_bitmap_get_counter(&bitmap->counts, offset, blocks, 0); 1794 1795 if (bmc) { 1795 1796 /* locked */ ··· 1832 1845 bitmap_counter_t *bmc; 1833 1846 unsigned long flags; 1834 1847 1835 - if (bitmap == NULL) { 1836 - *blocks = 1024; 1837 - return; 1838 - } 1839 1848 spin_lock_irqsave(&bitmap->counts.lock, flags); 1840 1849 bmc = md_bitmap_get_counter(&bitmap->counts, offset, blocks, 0); 1841 1850 if (bmc == NULL) ··· 2043 2060 struct bitmap *bitmap = mddev->bitmap; 2044 2061 int bw; 2045 2062 2046 - if (!bitmap) 2047 - return; 2048 - 2049 2063 atomic_inc(&bitmap->behind_writes); 2050 2064 bw = atomic_read(&bitmap->behind_writes); 2051 2065 if (bw > bitmap->behind_writes_used) ··· 2055 2075 static void bitmap_end_behind_write(struct mddev *mddev) 2056 2076 { 2057 2077 struct bitmap *bitmap = mddev->bitmap; 2058 - 2059 - if (!bitmap) 2060 - return; 2061 2078 2062 2079 if (atomic_dec_and_test(&bitmap->behind_writes)) 2063 2080 wake_up(&bitmap->behind_wait); ··· 2570 2593 return ret; 2571 2594 } 2572 2595 2573 - static int bitmap_resize(struct mddev *mddev, sector_t blocks, int chunksize, 2574 - bool init) 2596 + static int bitmap_resize(struct mddev *mddev, sector_t blocks, int chunksize) 2575 2597 { 2576 2598 struct bitmap *bitmap = mddev->bitmap; 2577 2599 2578 2600 if (!bitmap) 2579 2601 return 0; 2580 2602 2581 - return __bitmap_resize(bitmap, blocks, chunksize, init); 2603 + return __bitmap_resize(bitmap, blocks, chunksize, false); 2582 2604 } 2583 2605 2584 2606 static ssize_t ··· 2966 2990 &max_backlog_used.attr, 2967 2991 NULL 2968 2992 }; 2969 - const struct attribute_group md_bitmap_group = { 2993 + 2994 + static struct attribute_group md_bitmap_group = { 2970 2995 .name = "bitmap", 2971 2996 .attrs = md_bitmap_attrs, 2972 2997 }; 2973 2998 2974 2999 static struct bitmap_operations bitmap_ops = { 3000 + .head = { 3001 + .type = MD_BITMAP, 3002 + .id = ID_BITMAP, 3003 + .name = "bitmap", 3004 + }, 3005 + 2975 3006 .enabled = bitmap_enabled, 2976 3007 .create = bitmap_create, 2977 3008 .resize = bitmap_resize, ··· 2996 3013 2997 3014 .start_write = bitmap_start_write, 2998 3015 .end_write = bitmap_end_write, 3016 + .start_discard = bitmap_start_write, 3017 + .end_discard = bitmap_end_write, 3018 + 2999 3019 .start_sync = bitmap_start_sync, 3000 3020 .end_sync = bitmap_end_sync, 3001 3021 .cond_end_sync = bitmap_cond_end_sync, ··· 3012 3026 .copy_from_slot = bitmap_copy_from_slot, 3013 3027 .set_pages = bitmap_set_pages, 3014 3028 .free = md_bitmap_free, 3029 + 3030 + .group = &md_bitmap_group, 3015 3031 }; 3016 3032 3017 - void mddev_set_bitmap_ops(struct mddev *mddev) 3033 + int md_bitmap_init(void) 3018 3034 { 3019 - mddev->bitmap_ops = &bitmap_ops; 3035 + md_bitmap_wq = alloc_workqueue("md_bitmap", WQ_MEM_RECLAIM | WQ_UNBOUND, 3036 + 0); 3037 + if (!md_bitmap_wq) 3038 + return -ENOMEM; 3039 + 3040 + return register_md_submodule(&bitmap_ops.head); 3041 + } 3042 + 3043 + void md_bitmap_exit(void) 3044 + { 3045 + destroy_workqueue(md_bitmap_wq); 3046 + unregister_md_submodule(&bitmap_ops.head); 3020 3047 }

+98 -9

drivers/md/md-bitmap.h

··· 9 9 10 10 #define BITMAP_MAGIC 0x6d746962 11 11 12 + /* 13 + * version 3 is host-endian order, this is deprecated and not used for new 14 + * array 15 + */ 16 + #define BITMAP_MAJOR_LO 3 17 + #define BITMAP_MAJOR_HOSTENDIAN 3 18 + /* version 4 is little-endian order, the default value */ 19 + #define BITMAP_MAJOR_HI 4 20 + /* version 5 is only used for cluster */ 21 + #define BITMAP_MAJOR_CLUSTERED 5 22 + /* version 6 is only used for lockless bitmap */ 23 + #define BITMAP_MAJOR_LOCKLESS 6 24 + 12 25 /* use these for bitmap->flags and bitmap->sb->state bit-fields */ 13 26 enum bitmap_state { 14 - BITMAP_STALE = 1, /* the bitmap file is out of date or had -EIO */ 27 + BITMAP_STALE = 1, /* the bitmap file is out of date or had -EIO */ 15 28 BITMAP_WRITE_ERROR = 2, /* A write error has occurred */ 29 + BITMAP_FIRST_USE = 3, /* llbitmap is just created */ 30 + BITMAP_CLEAN = 4, /* llbitmap is created with assume_clean */ 31 + BITMAP_DAEMON_BUSY = 5, /* llbitmap daemon is not finished after daemon_sleep */ 16 32 BITMAP_HOSTENDIAN =15, 17 33 }; 18 34 ··· 77 61 struct file *file; 78 62 }; 79 63 64 + typedef void (md_bitmap_fn)(struct mddev *mddev, sector_t offset, 65 + unsigned long sectors); 66 + 80 67 struct bitmap_operations { 81 - bool (*enabled)(struct mddev *mddev); 68 + struct md_submodule_head head; 69 + 70 + bool (*enabled)(void *data, bool flush); 82 71 int (*create)(struct mddev *mddev); 83 - int (*resize)(struct mddev *mddev, sector_t blocks, int chunksize, 84 - bool init); 72 + int (*resize)(struct mddev *mddev, sector_t blocks, int chunksize); 85 73 86 74 int (*load)(struct mddev *mddev); 87 75 void (*destroy)(struct mddev *mddev); ··· 100 80 void (*end_behind_write)(struct mddev *mddev); 101 81 void (*wait_behind_writes)(struct mddev *mddev); 102 82 103 - void (*start_write)(struct mddev *mddev, sector_t offset, 104 - unsigned long sectors); 105 - void (*end_write)(struct mddev *mddev, sector_t offset, 106 - unsigned long sectors); 83 + md_bitmap_fn *start_write; 84 + md_bitmap_fn *end_write; 85 + md_bitmap_fn *start_discard; 86 + md_bitmap_fn *end_discard; 87 + 88 + sector_t (*skip_sync_blocks)(struct mddev *mddev, sector_t offset); 89 + bool (*blocks_synced)(struct mddev *mddev, sector_t offset); 107 90 bool (*start_sync)(struct mddev *mddev, sector_t offset, 108 91 sector_t *blocks, bool degraded); 109 92 void (*end_sync)(struct mddev *mddev, sector_t offset, sector_t *blocks); ··· 124 101 sector_t *hi, bool clear_bits); 125 102 void (*set_pages)(void *data, unsigned long pages); 126 103 void (*free)(void *data); 104 + 105 + struct attribute_group *group; 127 106 }; 128 107 129 108 /* the bitmap API */ 130 - void mddev_set_bitmap_ops(struct mddev *mddev); 109 + static inline bool md_bitmap_registered(struct mddev *mddev) 110 + { 111 + return mddev->bitmap_ops != NULL; 112 + } 113 + 114 + static inline bool md_bitmap_enabled(struct mddev *mddev, bool flush) 115 + { 116 + /* bitmap_ops must be registered before creating bitmap. */ 117 + if (!md_bitmap_registered(mddev)) 118 + return false; 119 + 120 + if (!mddev->bitmap) 121 + return false; 122 + 123 + return mddev->bitmap_ops->enabled(mddev->bitmap, flush); 124 + } 125 + 126 + static inline bool md_bitmap_start_sync(struct mddev *mddev, sector_t offset, 127 + sector_t *blocks, bool degraded) 128 + { 129 + /* always resync if no bitmap */ 130 + if (!md_bitmap_enabled(mddev, false)) { 131 + *blocks = 1024; 132 + return true; 133 + } 134 + 135 + return mddev->bitmap_ops->start_sync(mddev, offset, blocks, degraded); 136 + } 137 + 138 + static inline void md_bitmap_end_sync(struct mddev *mddev, sector_t offset, 139 + sector_t *blocks) 140 + { 141 + if (!md_bitmap_enabled(mddev, false)) { 142 + *blocks = 1024; 143 + return; 144 + } 145 + 146 + mddev->bitmap_ops->end_sync(mddev, offset, blocks); 147 + } 148 + 149 + #ifdef CONFIG_MD_BITMAP 150 + int md_bitmap_init(void); 151 + void md_bitmap_exit(void); 152 + #else 153 + static inline int md_bitmap_init(void) 154 + { 155 + return 0; 156 + } 157 + static inline void md_bitmap_exit(void) 158 + { 159 + } 160 + #endif 161 + 162 + #ifdef CONFIG_MD_LLBITMAP 163 + int md_llbitmap_init(void); 164 + void md_llbitmap_exit(void); 165 + #else 166 + static inline int md_llbitmap_init(void) 167 + { 168 + return 0; 169 + } 170 + static inline void md_llbitmap_exit(void) 171 + { 172 + } 173 + #endif 131 174 132 175 #endif

+1 -1

drivers/md/md-cluster.c

··· 630 630 if (le64_to_cpu(msg->high) != mddev->pers->size(mddev, 0, 0)) 631 631 ret = mddev->bitmap_ops->resize(mddev, 632 632 le64_to_cpu(msg->high), 633 - 0, false); 633 + 0); 634 634 break; 635 635 default: 636 636 ret = -1;

+3 -11

drivers/md/md-linear.c

··· 257 257 258 258 if (unlikely(bio_end_sector(bio) > end_sector)) { 259 259 /* This bio crosses a device boundary, so we have to split it */ 260 - struct bio *split = bio_split(bio, end_sector - bio_sector, 261 - GFP_NOIO, &mddev->bio_set); 262 - 263 - if (IS_ERR(split)) { 264 - bio->bi_status = errno_to_blk_status(PTR_ERR(split)); 265 - bio_endio(bio); 260 + bio = bio_submit_split_bioset(bio, end_sector - bio_sector, 261 + &mddev->bio_set); 262 + if (!bio) 266 263 return true; 267 - } 268 - 269 - bio_chain(split, bio); 270 - submit_bio_noacct(bio); 271 - bio = split; 272 264 } 273 265 274 266 md_account_bio(mddev, &bio);

+1626

drivers/md/md-llbitmap.c

··· 1 + // SPDX-License-Identifier: GPL-2.0-or-later 2 + 3 + #include <linux/blkdev.h> 4 + #include <linux/module.h> 5 + #include <linux/errno.h> 6 + #include <linux/slab.h> 7 + #include <linux/init.h> 8 + #include <linux/timer.h> 9 + #include <linux/sched.h> 10 + #include <linux/list.h> 11 + #include <linux/file.h> 12 + #include <linux/seq_file.h> 13 + #include <trace/events/block.h> 14 + 15 + #include "md.h" 16 + #include "md-bitmap.h" 17 + 18 + /* 19 + * #### Background 20 + * 21 + * Redundant data is used to enhance data fault tolerance, and the storage 22 + * methods for redundant data vary depending on the RAID levels. And it's 23 + * important to maintain the consistency of redundant data. 24 + * 25 + * Bitmap is used to record which data blocks have been synchronized and which 26 + * ones need to be resynchronized or recovered. Each bit in the bitmap 27 + * represents a segment of data in the array. When a bit is set, it indicates 28 + * that the multiple redundant copies of that data segment may not be 29 + * consistent. Data synchronization can be performed based on the bitmap after 30 + * power failure or readding a disk. If there is no bitmap, a full disk 31 + * synchronization is required. 32 + * 33 + * #### Key Features 34 + * 35 + * - IO fastpath is lockless, if user issues lots of write IO to the same 36 + * bitmap bit in a short time, only the first write has additional overhead 37 + * to update bitmap bit, no additional overhead for the following writes; 38 + * - support only resync or recover written data, means in the case creating 39 + * new array or replacing with a new disk, there is no need to do a full disk 40 + * resync/recovery; 41 + * 42 + * #### Key Concept 43 + * 44 + * ##### State Machine 45 + * 46 + * Each bit is one byte, contain 6 different states, see llbitmap_state. And 47 + * there are total 8 different actions, see llbitmap_action, can change state: 48 + * 49 + * llbitmap state machine: transitions between states 50 + * 51 + * | | Startwrite | Startsync | Endsync | Abortsync| 52 + * | --------- | ---------- | --------- | ------- | ------- | 53 + * | Unwritten | Dirty | x | x | x | 54 + * | Clean | Dirty | x | x | x | 55 + * | Dirty | x | x | x | x | 56 + * | NeedSync | x | Syncing | x | x | 57 + * | Syncing | x | Syncing | Dirty | NeedSync | 58 + * 59 + * | | Reload | Daemon | Discard | Stale | 60 + * | --------- | -------- | ------ | --------- | --------- | 61 + * | Unwritten | x | x | x | x | 62 + * | Clean | x | x | Unwritten | NeedSync | 63 + * | Dirty | NeedSync | Clean | Unwritten | NeedSync | 64 + * | NeedSync | x | x | Unwritten | x | 65 + * | Syncing | NeedSync | x | Unwritten | NeedSync | 66 + * 67 + * Typical scenarios: 68 + * 69 + * 1) Create new array 70 + * All bits will be set to Unwritten by default, if --assume-clean is set, 71 + * all bits will be set to Clean instead. 72 + * 73 + * 2) write data, raid1/raid10 have full copy of data, while raid456 doesn't and 74 + * rely on xor data 75 + * 76 + * 2.1) write new data to raid1/raid10: 77 + * Unwritten --StartWrite--> Dirty 78 + * 79 + * 2.2) write new data to raid456: 80 + * Unwritten --StartWrite--> NeedSync 81 + * 82 + * Because the initial recover for raid456 is skipped, the xor data is not built 83 + * yet, the bit must be set to NeedSync first and after lazy initial recover is 84 + * finished, the bit will finally set to Dirty(see 5.1 and 5.4); 85 + * 86 + * 2.3) cover write 87 + * Clean --StartWrite--> Dirty 88 + * 89 + * 3) daemon, if the array is not degraded: 90 + * Dirty --Daemon--> Clean 91 + * 92 + * 4) discard 93 + * {Clean, Dirty, NeedSync, Syncing} --Discard--> Unwritten 94 + * 95 + * 5) resync and recover 96 + * 97 + * 5.1) common process 98 + * NeedSync --Startsync--> Syncing --Endsync--> Dirty --Daemon--> Clean 99 + * 100 + * 5.2) resync after power failure 101 + * Dirty --Reload--> NeedSync 102 + * 103 + * 5.3) recover while replacing with a new disk 104 + * By default, the old bitmap framework will recover all data, and llbitmap 105 + * implements this by a new helper, see llbitmap_skip_sync_blocks: 106 + * 107 + * skip recover for bits other than dirty or clean; 108 + * 109 + * 5.4) lazy initial recover for raid5: 110 + * By default, the old bitmap framework will only allow new recover when there 111 + * are spares(new disk), a new recovery flag MD_RECOVERY_LAZY_RECOVER is added 112 + * to perform raid456 lazy recover for set bits(from 2.2). 113 + * 114 + * 6. special handling for degraded array: 115 + * 116 + * - Dirty bits will never be cleared, daemon will just do nothing, so that if 117 + * a disk is readded, Clean bits can be skipped with recovery; 118 + * - Dirty bits will convert to Syncing from start write, to do data recovery 119 + * for new added disks; 120 + * - New write will convert bits to NeedSync directly; 121 + * 122 + * ##### Bitmap IO 123 + * 124 + * ##### Chunksize 125 + * 126 + * The default bitmap size is 128k, incluing 1k bitmap super block, and 127 + * the default size of segment of data in the array each bit(chunksize) is 64k, 128 + * and chunksize will adjust to twice the old size each time if the total number 129 + * bits is not less than 127k.(see llbitmap_init) 130 + * 131 + * ##### READ 132 + * 133 + * While creating bitmap, all pages will be allocated and read for llbitmap, 134 + * there won't be read afterwards 135 + * 136 + * ##### WRITE 137 + * 138 + * WRITE IO is divided into logical_block_size of the array, the dirty state 139 + * of each block is tracked independently, for example: 140 + * 141 + * each page is 4k, contain 8 blocks; each block is 512 bytes contain 512 bit; 142 + * 143 + * | page0 | page1 | ... | page 31 | 144 + * | | 145 + * | \-----------------------\ 146 + * | | 147 + * | block0 | block1 | ... | block 8| 148 + * | | 149 + * | \-----------------\ 150 + * | | 151 + * | bit0 | bit1 | ... | bit511 | 152 + * 153 + * From IO path, if one bit is changed to Dirty or NeedSync, the corresponding 154 + * subpage will be marked dirty, such block must write first before the IO is 155 + * issued. This behaviour will affect IO performance, to reduce the impact, if 156 + * multiple bits are changed in the same block in a short time, all bits in this 157 + * block will be changed to Dirty/NeedSync, so that there won't be any overhead 158 + * until daemon clears dirty bits. 159 + * 160 + * ##### Dirty Bits synchronization 161 + * 162 + * IO fast path will set bits to dirty, and those dirty bits will be cleared 163 + * by daemon after IO is done. llbitmap_page_ctl is used to synchronize between 164 + * IO path and daemon; 165 + * 166 + * IO path: 167 + * 1) try to grab a reference, if succeed, set expire time after 5s and return; 168 + * 2) if failed to grab a reference, wait for daemon to finish clearing dirty 169 + * bits; 170 + * 171 + * Daemon (Daemon will be woken up every daemon_sleep seconds): 172 + * For each page: 173 + * 1) check if page expired, if not skip this page; for expired page: 174 + * 2) suspend the page and wait for inflight write IO to be done; 175 + * 3) change dirty page to clean; 176 + * 4) resume the page; 177 + */ 178 + 179 + #define BITMAP_DATA_OFFSET 1024 180 + 181 + /* 64k is the max IO size of sync IO for raid1/raid10 */ 182 + #define MIN_CHUNK_SIZE (64 * 2) 183 + 184 + /* By default, daemon will be woken up every 30s */ 185 + #define DEFAULT_DAEMON_SLEEP 30 186 + 187 + /* 188 + * Dirtied bits that have not been accessed for more than 5s will be cleared 189 + * by daemon. 190 + */ 191 + #define DEFAULT_BARRIER_IDLE 5 192 + 193 + enum llbitmap_state { 194 + /* No valid data, init state after assemble the array */ 195 + BitUnwritten = 0, 196 + /* data is consistent */ 197 + BitClean, 198 + /* data will be consistent after IO is done, set directly for writes */ 199 + BitDirty, 200 + /* 201 + * data need to be resynchronized: 202 + * 1) set directly for writes if array is degraded, prevent full disk 203 + * synchronization after readding a disk; 204 + * 2) reassemble the array after power failure, and dirty bits are 205 + * found after reloading the bitmap; 206 + * 3) set for first write for raid5, to build initial xor data lazily 207 + */ 208 + BitNeedSync, 209 + /* data is synchronizing */ 210 + BitSyncing, 211 + BitStateCount, 212 + BitNone = 0xff, 213 + }; 214 + 215 + enum llbitmap_action { 216 + /* User write new data, this is the only action from IO fast path */ 217 + BitmapActionStartwrite = 0, 218 + /* Start recovery */ 219 + BitmapActionStartsync, 220 + /* Finish recovery */ 221 + BitmapActionEndsync, 222 + /* Failed recovery */ 223 + BitmapActionAbortsync, 224 + /* Reassemble the array */ 225 + BitmapActionReload, 226 + /* Daemon thread is trying to clear dirty bits */ 227 + BitmapActionDaemon, 228 + /* Data is deleted */ 229 + BitmapActionDiscard, 230 + /* 231 + * Bitmap is stale, mark all bits in addition to BitUnwritten to 232 + * BitNeedSync. 233 + */ 234 + BitmapActionStale, 235 + BitmapActionCount, 236 + /* Init state is BitUnwritten */ 237 + BitmapActionInit, 238 + }; 239 + 240 + enum llbitmap_page_state { 241 + LLPageFlush = 0, 242 + LLPageDirty, 243 + }; 244 + 245 + struct llbitmap_page_ctl { 246 + char *state; 247 + struct page *page; 248 + unsigned long expire; 249 + unsigned long flags; 250 + wait_queue_head_t wait; 251 + struct percpu_ref active; 252 + /* Per block size dirty state, maximum 64k page / 1 sector = 128 */ 253 + unsigned long dirty[]; 254 + }; 255 + 256 + struct llbitmap { 257 + struct mddev *mddev; 258 + struct llbitmap_page_ctl **pctl; 259 + 260 + unsigned int nr_pages; 261 + unsigned int io_size; 262 + unsigned int blocks_per_page; 263 + 264 + /* shift of one chunk */ 265 + unsigned long chunkshift; 266 + /* size of one chunk in sector */ 267 + unsigned long chunksize; 268 + /* total number of chunks */ 269 + unsigned long chunks; 270 + unsigned long last_end_sync; 271 + /* 272 + * time in seconds that dirty bits will be cleared if the page is not 273 + * accessed. 274 + */ 275 + unsigned long barrier_idle; 276 + /* fires on first BitDirty state */ 277 + struct timer_list pending_timer; 278 + struct work_struct daemon_work; 279 + 280 + unsigned long flags; 281 + __u64 events_cleared; 282 + 283 + /* for slow disks */ 284 + atomic_t behind_writes; 285 + wait_queue_head_t behind_wait; 286 + }; 287 + 288 + struct llbitmap_unplug_work { 289 + struct work_struct work; 290 + struct llbitmap *llbitmap; 291 + struct completion *done; 292 + }; 293 + 294 + static struct workqueue_struct *md_llbitmap_io_wq; 295 + static struct workqueue_struct *md_llbitmap_unplug_wq; 296 + 297 + static char state_machine[BitStateCount][BitmapActionCount] = { 298 + [BitUnwritten] = { 299 + [BitmapActionStartwrite] = BitDirty, 300 + [BitmapActionStartsync] = BitNone, 301 + [BitmapActionEndsync] = BitNone, 302 + [BitmapActionAbortsync] = BitNone, 303 + [BitmapActionReload] = BitNone, 304 + [BitmapActionDaemon] = BitNone, 305 + [BitmapActionDiscard] = BitNone, 306 + [BitmapActionStale] = BitNone, 307 + }, 308 + [BitClean] = { 309 + [BitmapActionStartwrite] = BitDirty, 310 + [BitmapActionStartsync] = BitNone, 311 + [BitmapActionEndsync] = BitNone, 312 + [BitmapActionAbortsync] = BitNone, 313 + [BitmapActionReload] = BitNone, 314 + [BitmapActionDaemon] = BitNone, 315 + [BitmapActionDiscard] = BitUnwritten, 316 + [BitmapActionStale] = BitNeedSync, 317 + }, 318 + [BitDirty] = { 319 + [BitmapActionStartwrite] = BitNone, 320 + [BitmapActionStartsync] = BitNone, 321 + [BitmapActionEndsync] = BitNone, 322 + [BitmapActionAbortsync] = BitNone, 323 + [BitmapActionReload] = BitNeedSync, 324 + [BitmapActionDaemon] = BitClean, 325 + [BitmapActionDiscard] = BitUnwritten, 326 + [BitmapActionStale] = BitNeedSync, 327 + }, 328 + [BitNeedSync] = { 329 + [BitmapActionStartwrite] = BitNone, 330 + [BitmapActionStartsync] = BitSyncing, 331 + [BitmapActionEndsync] = BitNone, 332 + [BitmapActionAbortsync] = BitNone, 333 + [BitmapActionReload] = BitNone, 334 + [BitmapActionDaemon] = BitNone, 335 + [BitmapActionDiscard] = BitUnwritten, 336 + [BitmapActionStale] = BitNone, 337 + }, 338 + [BitSyncing] = { 339 + [BitmapActionStartwrite] = BitNone, 340 + [BitmapActionStartsync] = BitSyncing, 341 + [BitmapActionEndsync] = BitDirty, 342 + [BitmapActionAbortsync] = BitNeedSync, 343 + [BitmapActionReload] = BitNeedSync, 344 + [BitmapActionDaemon] = BitNone, 345 + [BitmapActionDiscard] = BitUnwritten, 346 + [BitmapActionStale] = BitNeedSync, 347 + }, 348 + }; 349 + 350 + static void __llbitmap_flush(struct mddev *mddev); 351 + 352 + static enum llbitmap_state llbitmap_read(struct llbitmap *llbitmap, loff_t pos) 353 + { 354 + unsigned int idx; 355 + unsigned int offset; 356 + 357 + pos += BITMAP_DATA_OFFSET; 358 + idx = pos >> PAGE_SHIFT; 359 + offset = offset_in_page(pos); 360 + 361 + return llbitmap->pctl[idx]->state[offset]; 362 + } 363 + 364 + /* set all the bits in the subpage as dirty */ 365 + static void llbitmap_infect_dirty_bits(struct llbitmap *llbitmap, 366 + struct llbitmap_page_ctl *pctl, 367 + unsigned int block) 368 + { 369 + bool level_456 = raid_is_456(llbitmap->mddev); 370 + unsigned int io_size = llbitmap->io_size; 371 + int pos; 372 + 373 + for (pos = block * io_size; pos < (block + 1) * io_size; pos++) { 374 + switch (pctl->state[pos]) { 375 + case BitUnwritten: 376 + pctl->state[pos] = level_456 ? BitNeedSync : BitDirty; 377 + break; 378 + case BitClean: 379 + pctl->state[pos] = BitDirty; 380 + break; 381 + }; 382 + } 383 + } 384 + 385 + static void llbitmap_set_page_dirty(struct llbitmap *llbitmap, int idx, 386 + int offset) 387 + { 388 + struct llbitmap_page_ctl *pctl = llbitmap->pctl[idx]; 389 + unsigned int io_size = llbitmap->io_size; 390 + int block = offset / io_size; 391 + int pos; 392 + 393 + if (!test_bit(LLPageDirty, &pctl->flags)) 394 + set_bit(LLPageDirty, &pctl->flags); 395 + 396 + /* 397 + * For degraded array, dirty bits will never be cleared, and we must 398 + * resync all the dirty bits, hence skip infect new dirty bits to 399 + * prevent resync unnecessary data. 400 + */ 401 + if (llbitmap->mddev->degraded) { 402 + set_bit(block, pctl->dirty); 403 + return; 404 + } 405 + 406 + /* 407 + * The subpage usually contains a total of 512 bits. If any single bit 408 + * within the subpage is marked as dirty, the entire sector will be 409 + * written. To avoid impacting write performance, when multiple bits 410 + * within the same sector are modified within llbitmap->barrier_idle, 411 + * all bits in the sector will be collectively marked as dirty at once. 412 + */ 413 + if (test_and_set_bit(block, pctl->dirty)) { 414 + llbitmap_infect_dirty_bits(llbitmap, pctl, block); 415 + return; 416 + } 417 + 418 + for (pos = block * io_size; pos < (block + 1) * io_size; pos++) { 419 + if (pos == offset) 420 + continue; 421 + if (pctl->state[pos] == BitDirty || 422 + pctl->state[pos] == BitNeedSync) { 423 + llbitmap_infect_dirty_bits(llbitmap, pctl, block); 424 + return; 425 + } 426 + } 427 + } 428 + 429 + static void llbitmap_write(struct llbitmap *llbitmap, enum llbitmap_state state, 430 + loff_t pos) 431 + { 432 + unsigned int idx; 433 + unsigned int bit; 434 + 435 + pos += BITMAP_DATA_OFFSET; 436 + idx = pos >> PAGE_SHIFT; 437 + bit = offset_in_page(pos); 438 + 439 + llbitmap->pctl[idx]->state[bit] = state; 440 + if (state == BitDirty || state == BitNeedSync) 441 + llbitmap_set_page_dirty(llbitmap, idx, bit); 442 + } 443 + 444 + static struct page *llbitmap_read_page(struct llbitmap *llbitmap, int idx) 445 + { 446 + struct mddev *mddev = llbitmap->mddev; 447 + struct page *page = NULL; 448 + struct md_rdev *rdev; 449 + 450 + if (llbitmap->pctl && llbitmap->pctl[idx]) 451 + page = llbitmap->pctl[idx]->page; 452 + if (page) 453 + return page; 454 + 455 + page = alloc_page(GFP_KERNEL | __GFP_ZERO); 456 + if (!page) 457 + return ERR_PTR(-ENOMEM); 458 + 459 + rdev_for_each(rdev, mddev) { 460 + sector_t sector; 461 + 462 + if (rdev->raid_disk < 0 || test_bit(Faulty, &rdev->flags)) 463 + continue; 464 + 465 + sector = mddev->bitmap_info.offset + 466 + (idx << PAGE_SECTORS_SHIFT); 467 + 468 + if (sync_page_io(rdev, sector, PAGE_SIZE, page, REQ_OP_READ, 469 + true)) 470 + return page; 471 + 472 + md_error(mddev, rdev); 473 + } 474 + 475 + __free_page(page); 476 + return ERR_PTR(-EIO); 477 + } 478 + 479 + static void llbitmap_write_page(struct llbitmap *llbitmap, int idx) 480 + { 481 + struct page *page = llbitmap->pctl[idx]->page; 482 + struct mddev *mddev = llbitmap->mddev; 483 + struct md_rdev *rdev; 484 + int block; 485 + 486 + for (block = 0; block < llbitmap->blocks_per_page; block++) { 487 + struct llbitmap_page_ctl *pctl = llbitmap->pctl[idx]; 488 + 489 + if (!test_and_clear_bit(block, pctl->dirty)) 490 + continue; 491 + 492 + rdev_for_each(rdev, mddev) { 493 + sector_t sector; 494 + sector_t bit_sector = llbitmap->io_size >> SECTOR_SHIFT; 495 + 496 + if (rdev->raid_disk < 0 || test_bit(Faulty, &rdev->flags)) 497 + continue; 498 + 499 + sector = mddev->bitmap_info.offset + rdev->sb_start + 500 + (idx << PAGE_SECTORS_SHIFT) + 501 + block * bit_sector; 502 + md_write_metadata(mddev, rdev, sector, 503 + llbitmap->io_size, page, 504 + block * llbitmap->io_size); 505 + } 506 + } 507 + } 508 + 509 + static void active_release(struct percpu_ref *ref) 510 + { 511 + struct llbitmap_page_ctl *pctl = 512 + container_of(ref, struct llbitmap_page_ctl, active); 513 + 514 + wake_up(&pctl->wait); 515 + } 516 + 517 + static void llbitmap_free_pages(struct llbitmap *llbitmap) 518 + { 519 + int i; 520 + 521 + if (!llbitmap->pctl) 522 + return; 523 + 524 + for (i = 0; i < llbitmap->nr_pages; i++) { 525 + struct llbitmap_page_ctl *pctl = llbitmap->pctl[i]; 526 + 527 + if (!pctl || !pctl->page) 528 + break; 529 + 530 + __free_page(pctl->page); 531 + percpu_ref_exit(&pctl->active); 532 + } 533 + 534 + kfree(llbitmap->pctl[0]); 535 + kfree(llbitmap->pctl); 536 + llbitmap->pctl = NULL; 537 + } 538 + 539 + static int llbitmap_cache_pages(struct llbitmap *llbitmap) 540 + { 541 + struct llbitmap_page_ctl *pctl; 542 + unsigned int nr_pages = DIV_ROUND_UP(llbitmap->chunks + 543 + BITMAP_DATA_OFFSET, PAGE_SIZE); 544 + unsigned int size = struct_size(pctl, dirty, BITS_TO_LONGS( 545 + llbitmap->blocks_per_page)); 546 + int i; 547 + 548 + llbitmap->pctl = kmalloc_array(nr_pages, sizeof(void *), 549 + GFP_KERNEL | __GFP_ZERO); 550 + if (!llbitmap->pctl) 551 + return -ENOMEM; 552 + 553 + size = round_up(size, cache_line_size()); 554 + pctl = kmalloc_array(nr_pages, size, GFP_KERNEL | __GFP_ZERO); 555 + if (!pctl) { 556 + kfree(llbitmap->pctl); 557 + return -ENOMEM; 558 + } 559 + 560 + llbitmap->nr_pages = nr_pages; 561 + 562 + for (i = 0; i < nr_pages; i++, pctl = (void *)pctl + size) { 563 + struct page *page = llbitmap_read_page(llbitmap, i); 564 + 565 + llbitmap->pctl[i] = pctl; 566 + 567 + if (IS_ERR(page)) { 568 + llbitmap_free_pages(llbitmap); 569 + return PTR_ERR(page); 570 + } 571 + 572 + if (percpu_ref_init(&pctl->active, active_release, 573 + PERCPU_REF_ALLOW_REINIT, GFP_KERNEL)) { 574 + __free_page(page); 575 + llbitmap_free_pages(llbitmap); 576 + return -ENOMEM; 577 + } 578 + 579 + pctl->page = page; 580 + pctl->state = page_address(page); 581 + init_waitqueue_head(&pctl->wait); 582 + } 583 + 584 + return 0; 585 + } 586 + 587 + static void llbitmap_init_state(struct llbitmap *llbitmap) 588 + { 589 + enum llbitmap_state state = BitUnwritten; 590 + unsigned long i; 591 + 592 + if (test_and_clear_bit(BITMAP_CLEAN, &llbitmap->flags)) 593 + state = BitClean; 594 + 595 + for (i = 0; i < llbitmap->chunks; i++) 596 + llbitmap_write(llbitmap, state, i); 597 + } 598 + 599 + /* The return value is only used from resync, where @start == @end. */ 600 + static enum llbitmap_state llbitmap_state_machine(struct llbitmap *llbitmap, 601 + unsigned long start, 602 + unsigned long end, 603 + enum llbitmap_action action) 604 + { 605 + struct mddev *mddev = llbitmap->mddev; 606 + enum llbitmap_state state = BitNone; 607 + bool level_456 = raid_is_456(llbitmap->mddev); 608 + bool need_resync = false; 609 + bool need_recovery = false; 610 + 611 + if (test_bit(BITMAP_WRITE_ERROR, &llbitmap->flags)) 612 + return BitNone; 613 + 614 + if (action == BitmapActionInit) { 615 + llbitmap_init_state(llbitmap); 616 + return BitNone; 617 + } 618 + 619 + while (start <= end) { 620 + enum llbitmap_state c = llbitmap_read(llbitmap, start); 621 + 622 + if (c < 0 || c >= BitStateCount) { 623 + pr_err("%s: invalid bit %lu state %d action %d, forcing resync\n", 624 + __func__, start, c, action); 625 + state = BitNeedSync; 626 + goto write_bitmap; 627 + } 628 + 629 + if (c == BitNeedSync) 630 + need_resync = !mddev->degraded; 631 + 632 + state = state_machine[c][action]; 633 + 634 + write_bitmap: 635 + if (unlikely(mddev->degraded)) { 636 + /* For degraded array, mark new data as need sync. */ 637 + if (state == BitDirty && 638 + action == BitmapActionStartwrite) 639 + state = BitNeedSync; 640 + /* 641 + * For degraded array, resync dirty data as well, noted 642 + * if array is still degraded after resync is done, all 643 + * new data will still be dirty until array is clean. 644 + */ 645 + else if (c == BitDirty && 646 + action == BitmapActionStartsync) 647 + state = BitSyncing; 648 + } else if (c == BitUnwritten && state == BitDirty && 649 + action == BitmapActionStartwrite && level_456) { 650 + /* Delay raid456 initial recovery to first write. */ 651 + state = BitNeedSync; 652 + } 653 + 654 + if (state == BitNone) { 655 + start++; 656 + continue; 657 + } 658 + 659 + llbitmap_write(llbitmap, state, start); 660 + 661 + if (state == BitNeedSync) 662 + need_resync = !mddev->degraded; 663 + else if (state == BitDirty && 664 + !timer_pending(&llbitmap->pending_timer)) 665 + mod_timer(&llbitmap->pending_timer, 666 + jiffies + mddev->bitmap_info.daemon_sleep * HZ); 667 + 668 + start++; 669 + } 670 + 671 + if (need_resync && level_456) 672 + need_recovery = true; 673 + 674 + if (need_recovery) { 675 + set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); 676 + set_bit(MD_RECOVERY_LAZY_RECOVER, &mddev->recovery); 677 + md_wakeup_thread(mddev->thread); 678 + } else if (need_resync) { 679 + set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); 680 + set_bit(MD_RECOVERY_SYNC, &mddev->recovery); 681 + md_wakeup_thread(mddev->thread); 682 + } 683 + 684 + return state; 685 + } 686 + 687 + static void llbitmap_raise_barrier(struct llbitmap *llbitmap, int page_idx) 688 + { 689 + struct llbitmap_page_ctl *pctl = llbitmap->pctl[page_idx]; 690 + 691 + retry: 692 + if (likely(percpu_ref_tryget_live(&pctl->active))) { 693 + WRITE_ONCE(pctl->expire, jiffies + llbitmap->barrier_idle * HZ); 694 + return; 695 + } 696 + 697 + wait_event(pctl->wait, !percpu_ref_is_dying(&pctl->active)); 698 + goto retry; 699 + } 700 + 701 + static void llbitmap_release_barrier(struct llbitmap *llbitmap, int page_idx) 702 + { 703 + struct llbitmap_page_ctl *pctl = llbitmap->pctl[page_idx]; 704 + 705 + percpu_ref_put(&pctl->active); 706 + } 707 + 708 + static int llbitmap_suspend_timeout(struct llbitmap *llbitmap, int page_idx) 709 + { 710 + struct llbitmap_page_ctl *pctl = llbitmap->pctl[page_idx]; 711 + 712 + percpu_ref_kill(&pctl->active); 713 + 714 + if (!wait_event_timeout(pctl->wait, percpu_ref_is_zero(&pctl->active), 715 + llbitmap->mddev->bitmap_info.daemon_sleep * HZ)) 716 + return -ETIMEDOUT; 717 + 718 + return 0; 719 + } 720 + 721 + static void llbitmap_resume(struct llbitmap *llbitmap, int page_idx) 722 + { 723 + struct llbitmap_page_ctl *pctl = llbitmap->pctl[page_idx]; 724 + 725 + pctl->expire = LONG_MAX; 726 + percpu_ref_resurrect(&pctl->active); 727 + wake_up(&pctl->wait); 728 + } 729 + 730 + static int llbitmap_check_support(struct mddev *mddev) 731 + { 732 + if (test_bit(MD_HAS_JOURNAL, &mddev->flags)) { 733 + pr_notice("md/llbitmap: %s: array with journal cannot have bitmap\n", 734 + mdname(mddev)); 735 + return -EBUSY; 736 + } 737 + 738 + if (mddev->bitmap_info.space == 0) { 739 + if (mddev->bitmap_info.default_space == 0) { 740 + pr_notice("md/llbitmap: %s: no space for bitmap\n", 741 + mdname(mddev)); 742 + return -ENOSPC; 743 + } 744 + } 745 + 746 + if (!mddev->persistent) { 747 + pr_notice("md/llbitmap: %s: array must be persistent\n", 748 + mdname(mddev)); 749 + return -EOPNOTSUPP; 750 + } 751 + 752 + if (mddev->bitmap_info.file) { 753 + pr_notice("md/llbitmap: %s: doesn't support bitmap file\n", 754 + mdname(mddev)); 755 + return -EOPNOTSUPP; 756 + } 757 + 758 + if (mddev->bitmap_info.external) { 759 + pr_notice("md/llbitmap: %s: doesn't support external metadata\n", 760 + mdname(mddev)); 761 + return -EOPNOTSUPP; 762 + } 763 + 764 + if (mddev_is_dm(mddev)) { 765 + pr_notice("md/llbitmap: %s: doesn't support dm-raid\n", 766 + mdname(mddev)); 767 + return -EOPNOTSUPP; 768 + } 769 + 770 + return 0; 771 + } 772 + 773 + static int llbitmap_init(struct llbitmap *llbitmap) 774 + { 775 + struct mddev *mddev = llbitmap->mddev; 776 + sector_t blocks = mddev->resync_max_sectors; 777 + unsigned long chunksize = MIN_CHUNK_SIZE; 778 + unsigned long chunks = DIV_ROUND_UP(blocks, chunksize); 779 + unsigned long space = mddev->bitmap_info.space << SECTOR_SHIFT; 780 + int ret; 781 + 782 + while (chunks > space) { 783 + chunksize = chunksize << 1; 784 + chunks = DIV_ROUND_UP_SECTOR_T(blocks, chunksize); 785 + } 786 + 787 + llbitmap->barrier_idle = DEFAULT_BARRIER_IDLE; 788 + llbitmap->chunkshift = ffz(~chunksize); 789 + llbitmap->chunksize = chunksize; 790 + llbitmap->chunks = chunks; 791 + mddev->bitmap_info.daemon_sleep = DEFAULT_DAEMON_SLEEP; 792 + 793 + ret = llbitmap_cache_pages(llbitmap); 794 + if (ret) 795 + return ret; 796 + 797 + llbitmap_state_machine(llbitmap, 0, llbitmap->chunks - 1, 798 + BitmapActionInit); 799 + /* flush initial llbitmap to disk */ 800 + __llbitmap_flush(mddev); 801 + 802 + return 0; 803 + } 804 + 805 + static int llbitmap_read_sb(struct llbitmap *llbitmap) 806 + { 807 + struct mddev *mddev = llbitmap->mddev; 808 + unsigned long daemon_sleep; 809 + unsigned long chunksize; 810 + unsigned long events; 811 + struct page *sb_page; 812 + bitmap_super_t *sb; 813 + int ret = -EINVAL; 814 + 815 + if (!mddev->bitmap_info.offset) { 816 + pr_err("md/llbitmap: %s: no super block found", mdname(mddev)); 817 + return -EINVAL; 818 + } 819 + 820 + sb_page = llbitmap_read_page(llbitmap, 0); 821 + if (IS_ERR(sb_page)) { 822 + pr_err("md/llbitmap: %s: read super block failed", 823 + mdname(mddev)); 824 + return -EIO; 825 + } 826 + 827 + sb = kmap_local_page(sb_page); 828 + if (sb->magic != cpu_to_le32(BITMAP_MAGIC)) { 829 + pr_err("md/llbitmap: %s: invalid super block magic number", 830 + mdname(mddev)); 831 + goto out_put_page; 832 + } 833 + 834 + if (sb->version != cpu_to_le32(BITMAP_MAJOR_LOCKLESS)) { 835 + pr_err("md/llbitmap: %s: invalid super block version", 836 + mdname(mddev)); 837 + goto out_put_page; 838 + } 839 + 840 + if (memcmp(sb->uuid, mddev->uuid, 16)) { 841 + pr_err("md/llbitmap: %s: bitmap superblock UUID mismatch\n", 842 + mdname(mddev)); 843 + goto out_put_page; 844 + } 845 + 846 + if (mddev->bitmap_info.space == 0) { 847 + int room = le32_to_cpu(sb->sectors_reserved); 848 + 849 + if (room) 850 + mddev->bitmap_info.space = room; 851 + else 852 + mddev->bitmap_info.space = mddev->bitmap_info.default_space; 853 + } 854 + llbitmap->flags = le32_to_cpu(sb->state); 855 + if (test_and_clear_bit(BITMAP_FIRST_USE, &llbitmap->flags)) { 856 + ret = llbitmap_init(llbitmap); 857 + goto out_put_page; 858 + } 859 + 860 + chunksize = le32_to_cpu(sb->chunksize); 861 + if (!is_power_of_2(chunksize)) { 862 + pr_err("md/llbitmap: %s: chunksize not a power of 2", 863 + mdname(mddev)); 864 + goto out_put_page; 865 + } 866 + 867 + if (chunksize < DIV_ROUND_UP_SECTOR_T(mddev->resync_max_sectors, 868 + mddev->bitmap_info.space << SECTOR_SHIFT)) { 869 + pr_err("md/llbitmap: %s: chunksize too small %lu < %llu / %lu", 870 + mdname(mddev), chunksize, mddev->resync_max_sectors, 871 + mddev->bitmap_info.space); 872 + goto out_put_page; 873 + } 874 + 875 + daemon_sleep = le32_to_cpu(sb->daemon_sleep); 876 + if (daemon_sleep < 1 || daemon_sleep > MAX_SCHEDULE_TIMEOUT / HZ) { 877 + pr_err("md/llbitmap: %s: daemon sleep %lu period out of range", 878 + mdname(mddev), daemon_sleep); 879 + goto out_put_page; 880 + } 881 + 882 + events = le64_to_cpu(sb->events); 883 + if (events < mddev->events) { 884 + pr_warn("md/llbitmap :%s: bitmap file is out of date (%lu < %llu) -- forcing full recovery", 885 + mdname(mddev), events, mddev->events); 886 + set_bit(BITMAP_STALE, &llbitmap->flags); 887 + } 888 + 889 + sb->sync_size = cpu_to_le64(mddev->resync_max_sectors); 890 + mddev->bitmap_info.chunksize = chunksize; 891 + mddev->bitmap_info.daemon_sleep = daemon_sleep; 892 + 893 + llbitmap->barrier_idle = DEFAULT_BARRIER_IDLE; 894 + llbitmap->chunksize = chunksize; 895 + llbitmap->chunks = DIV_ROUND_UP_SECTOR_T(mddev->resync_max_sectors, chunksize); 896 + llbitmap->chunkshift = ffz(~chunksize); 897 + ret = llbitmap_cache_pages(llbitmap); 898 + 899 + out_put_page: 900 + __free_page(sb_page); 901 + kunmap_local(sb); 902 + return ret; 903 + } 904 + 905 + static void llbitmap_pending_timer_fn(struct timer_list *pending_timer) 906 + { 907 + struct llbitmap *llbitmap = 908 + container_of(pending_timer, struct llbitmap, pending_timer); 909 + 910 + if (work_busy(&llbitmap->daemon_work)) { 911 + pr_warn("md/llbitmap: %s daemon_work not finished in %lu seconds\n", 912 + mdname(llbitmap->mddev), 913 + llbitmap->mddev->bitmap_info.daemon_sleep); 914 + set_bit(BITMAP_DAEMON_BUSY, &llbitmap->flags); 915 + return; 916 + } 917 + 918 + queue_work(md_llbitmap_io_wq, &llbitmap->daemon_work); 919 + } 920 + 921 + static void md_llbitmap_daemon_fn(struct work_struct *work) 922 + { 923 + struct llbitmap *llbitmap = 924 + container_of(work, struct llbitmap, daemon_work); 925 + unsigned long start; 926 + unsigned long end; 927 + bool restart; 928 + int idx; 929 + 930 + if (llbitmap->mddev->degraded) 931 + return; 932 + retry: 933 + start = 0; 934 + end = min(llbitmap->chunks, PAGE_SIZE - BITMAP_DATA_OFFSET) - 1; 935 + restart = false; 936 + 937 + for (idx = 0; idx < llbitmap->nr_pages; idx++) { 938 + struct llbitmap_page_ctl *pctl = llbitmap->pctl[idx]; 939 + 940 + if (idx > 0) { 941 + start = end + 1; 942 + end = min(end + PAGE_SIZE, llbitmap->chunks - 1); 943 + } 944 + 945 + if (!test_bit(LLPageFlush, &pctl->flags) && 946 + time_before(jiffies, pctl->expire)) { 947 + restart = true; 948 + continue; 949 + } 950 + 951 + if (llbitmap_suspend_timeout(llbitmap, idx) < 0) { 952 + pr_warn("md/llbitmap: %s: %s waiting for page %d timeout\n", 953 + mdname(llbitmap->mddev), __func__, idx); 954 + continue; 955 + } 956 + 957 + llbitmap_state_machine(llbitmap, start, end, BitmapActionDaemon); 958 + llbitmap_resume(llbitmap, idx); 959 + } 960 + 961 + /* 962 + * If the daemon took a long time to finish, retry to prevent missing 963 + * clearing dirty bits. 964 + */ 965 + if (test_and_clear_bit(BITMAP_DAEMON_BUSY, &llbitmap->flags)) 966 + goto retry; 967 + 968 + /* If some page is dirty but not expired, setup timer again */ 969 + if (restart) 970 + mod_timer(&llbitmap->pending_timer, 971 + jiffies + llbitmap->mddev->bitmap_info.daemon_sleep * HZ); 972 + } 973 + 974 + static int llbitmap_create(struct mddev *mddev) 975 + { 976 + struct llbitmap *llbitmap; 977 + int ret; 978 + 979 + ret = llbitmap_check_support(mddev); 980 + if (ret) 981 + return ret; 982 + 983 + llbitmap = kzalloc(sizeof(*llbitmap), GFP_KERNEL); 984 + if (!llbitmap) 985 + return -ENOMEM; 986 + 987 + llbitmap->mddev = mddev; 988 + llbitmap->io_size = bdev_logical_block_size(mddev->gendisk->part0); 989 + llbitmap->blocks_per_page = PAGE_SIZE / llbitmap->io_size; 990 + 991 + timer_setup(&llbitmap->pending_timer, llbitmap_pending_timer_fn, 0); 992 + INIT_WORK(&llbitmap->daemon_work, md_llbitmap_daemon_fn); 993 + atomic_set(&llbitmap->behind_writes, 0); 994 + init_waitqueue_head(&llbitmap->behind_wait); 995 + 996 + mutex_lock(&mddev->bitmap_info.mutex); 997 + mddev->bitmap = llbitmap; 998 + ret = llbitmap_read_sb(llbitmap); 999 + mutex_unlock(&mddev->bitmap_info.mutex); 1000 + if (ret) { 1001 + kfree(llbitmap); 1002 + mddev->bitmap = NULL; 1003 + } 1004 + 1005 + return ret; 1006 + } 1007 + 1008 + static int llbitmap_resize(struct mddev *mddev, sector_t blocks, int chunksize) 1009 + { 1010 + struct llbitmap *llbitmap = mddev->bitmap; 1011 + unsigned long chunks; 1012 + 1013 + if (chunksize == 0) 1014 + chunksize = llbitmap->chunksize; 1015 + 1016 + /* If there is enough space, leave the chunksize unchanged. */ 1017 + chunks = DIV_ROUND_UP_SECTOR_T(blocks, chunksize); 1018 + while (chunks > mddev->bitmap_info.space << SECTOR_SHIFT) { 1019 + chunksize = chunksize << 1; 1020 + chunks = DIV_ROUND_UP_SECTOR_T(blocks, chunksize); 1021 + } 1022 + 1023 + llbitmap->chunkshift = ffz(~chunksize); 1024 + llbitmap->chunksize = chunksize; 1025 + llbitmap->chunks = chunks; 1026 + 1027 + return 0; 1028 + } 1029 + 1030 + static int llbitmap_load(struct mddev *mddev) 1031 + { 1032 + enum llbitmap_action action = BitmapActionReload; 1033 + struct llbitmap *llbitmap = mddev->bitmap; 1034 + 1035 + if (test_and_clear_bit(BITMAP_STALE, &llbitmap->flags)) 1036 + action = BitmapActionStale; 1037 + 1038 + llbitmap_state_machine(llbitmap, 0, llbitmap->chunks - 1, action); 1039 + return 0; 1040 + } 1041 + 1042 + static void llbitmap_destroy(struct mddev *mddev) 1043 + { 1044 + struct llbitmap *llbitmap = mddev->bitmap; 1045 + 1046 + if (!llbitmap) 1047 + return; 1048 + 1049 + mutex_lock(&mddev->bitmap_info.mutex); 1050 + 1051 + timer_delete_sync(&llbitmap->pending_timer); 1052 + flush_workqueue(md_llbitmap_io_wq); 1053 + flush_workqueue(md_llbitmap_unplug_wq); 1054 + 1055 + mddev->bitmap = NULL; 1056 + llbitmap_free_pages(llbitmap); 1057 + kfree(llbitmap); 1058 + mutex_unlock(&mddev->bitmap_info.mutex); 1059 + } 1060 + 1061 + static void llbitmap_start_write(struct mddev *mddev, sector_t offset, 1062 + unsigned long sectors) 1063 + { 1064 + struct llbitmap *llbitmap = mddev->bitmap; 1065 + unsigned long start = offset >> llbitmap->chunkshift; 1066 + unsigned long end = (offset + sectors - 1) >> llbitmap->chunkshift; 1067 + int page_start = (start + BITMAP_DATA_OFFSET) >> PAGE_SHIFT; 1068 + int page_end = (end + BITMAP_DATA_OFFSET) >> PAGE_SHIFT; 1069 + 1070 + llbitmap_state_machine(llbitmap, start, end, BitmapActionStartwrite); 1071 + 1072 + while (page_start <= page_end) { 1073 + llbitmap_raise_barrier(llbitmap, page_start); 1074 + page_start++; 1075 + } 1076 + } 1077 + 1078 + static void llbitmap_end_write(struct mddev *mddev, sector_t offset, 1079 + unsigned long sectors) 1080 + { 1081 + struct llbitmap *llbitmap = mddev->bitmap; 1082 + unsigned long start = offset >> llbitmap->chunkshift; 1083 + unsigned long end = (offset + sectors - 1) >> llbitmap->chunkshift; 1084 + int page_start = (start + BITMAP_DATA_OFFSET) >> PAGE_SHIFT; 1085 + int page_end = (end + BITMAP_DATA_OFFSET) >> PAGE_SHIFT; 1086 + 1087 + while (page_start <= page_end) { 1088 + llbitmap_release_barrier(llbitmap, page_start); 1089 + page_start++; 1090 + } 1091 + } 1092 + 1093 + static void llbitmap_start_discard(struct mddev *mddev, sector_t offset, 1094 + unsigned long sectors) 1095 + { 1096 + struct llbitmap *llbitmap = mddev->bitmap; 1097 + unsigned long start = DIV_ROUND_UP_SECTOR_T(offset, llbitmap->chunksize); 1098 + unsigned long end = (offset + sectors - 1) >> llbitmap->chunkshift; 1099 + int page_start = (start + BITMAP_DATA_OFFSET) >> PAGE_SHIFT; 1100 + int page_end = (end + BITMAP_DATA_OFFSET) >> PAGE_SHIFT; 1101 + 1102 + llbitmap_state_machine(llbitmap, start, end, BitmapActionDiscard); 1103 + 1104 + while (page_start <= page_end) { 1105 + llbitmap_raise_barrier(llbitmap, page_start); 1106 + page_start++; 1107 + } 1108 + } 1109 + 1110 + static void llbitmap_end_discard(struct mddev *mddev, sector_t offset, 1111 + unsigned long sectors) 1112 + { 1113 + struct llbitmap *llbitmap = mddev->bitmap; 1114 + unsigned long start = DIV_ROUND_UP_SECTOR_T(offset, llbitmap->chunksize); 1115 + unsigned long end = (offset + sectors - 1) >> llbitmap->chunkshift; 1116 + int page_start = (start + BITMAP_DATA_OFFSET) >> PAGE_SHIFT; 1117 + int page_end = (end + BITMAP_DATA_OFFSET) >> PAGE_SHIFT; 1118 + 1119 + while (page_start <= page_end) { 1120 + llbitmap_release_barrier(llbitmap, page_start); 1121 + page_start++; 1122 + } 1123 + } 1124 + 1125 + static void llbitmap_unplug_fn(struct work_struct *work) 1126 + { 1127 + struct llbitmap_unplug_work *unplug_work = 1128 + container_of(work, struct llbitmap_unplug_work, work); 1129 + struct llbitmap *llbitmap = unplug_work->llbitmap; 1130 + struct blk_plug plug; 1131 + int i; 1132 + 1133 + blk_start_plug(&plug); 1134 + 1135 + for (i = 0; i < llbitmap->nr_pages; i++) { 1136 + if (!test_bit(LLPageDirty, &llbitmap->pctl[i]->flags) || 1137 + !test_and_clear_bit(LLPageDirty, &llbitmap->pctl[i]->flags)) 1138 + continue; 1139 + 1140 + llbitmap_write_page(llbitmap, i); 1141 + } 1142 + 1143 + blk_finish_plug(&plug); 1144 + md_super_wait(llbitmap->mddev); 1145 + complete(unplug_work->done); 1146 + } 1147 + 1148 + static bool llbitmap_dirty(struct llbitmap *llbitmap) 1149 + { 1150 + int i; 1151 + 1152 + for (i = 0; i < llbitmap->nr_pages; i++) 1153 + if (test_bit(LLPageDirty, &llbitmap->pctl[i]->flags)) 1154 + return true; 1155 + 1156 + return false; 1157 + } 1158 + 1159 + static void llbitmap_unplug(struct mddev *mddev, bool sync) 1160 + { 1161 + DECLARE_COMPLETION_ONSTACK(done); 1162 + struct llbitmap *llbitmap = mddev->bitmap; 1163 + struct llbitmap_unplug_work unplug_work = { 1164 + .llbitmap = llbitmap, 1165 + .done = &done, 1166 + }; 1167 + 1168 + if (!llbitmap_dirty(llbitmap)) 1169 + return; 1170 + 1171 + /* 1172 + * Issue new bitmap IO under submit_bio() context will deadlock: 1173 + * - the bio will wait for bitmap bio to be done, before it can be 1174 + * issued; 1175 + * - bitmap bio will be added to current->bio_list and wait for this 1176 + * bio to be issued; 1177 + */ 1178 + INIT_WORK_ONSTACK(&unplug_work.work, llbitmap_unplug_fn); 1179 + queue_work(md_llbitmap_unplug_wq, &unplug_work.work); 1180 + wait_for_completion(&done); 1181 + destroy_work_on_stack(&unplug_work.work); 1182 + } 1183 + 1184 + /* 1185 + * Force to write all bitmap pages to disk, called when stopping the array, or 1186 + * every daemon_sleep seconds when sync_thread is running. 1187 + */ 1188 + static void __llbitmap_flush(struct mddev *mddev) 1189 + { 1190 + struct llbitmap *llbitmap = mddev->bitmap; 1191 + struct blk_plug plug; 1192 + int i; 1193 + 1194 + blk_start_plug(&plug); 1195 + for (i = 0; i < llbitmap->nr_pages; i++) { 1196 + struct llbitmap_page_ctl *pctl = llbitmap->pctl[i]; 1197 + 1198 + /* mark all blocks as dirty */ 1199 + set_bit(LLPageDirty, &pctl->flags); 1200 + bitmap_fill(pctl->dirty, llbitmap->blocks_per_page); 1201 + llbitmap_write_page(llbitmap, i); 1202 + } 1203 + blk_finish_plug(&plug); 1204 + md_super_wait(llbitmap->mddev); 1205 + } 1206 + 1207 + static void llbitmap_flush(struct mddev *mddev) 1208 + { 1209 + struct llbitmap *llbitmap = mddev->bitmap; 1210 + int i; 1211 + 1212 + for (i = 0; i < llbitmap->nr_pages; i++) 1213 + set_bit(LLPageFlush, &llbitmap->pctl[i]->flags); 1214 + 1215 + timer_delete_sync(&llbitmap->pending_timer); 1216 + queue_work(md_llbitmap_io_wq, &llbitmap->daemon_work); 1217 + flush_work(&llbitmap->daemon_work); 1218 + 1219 + __llbitmap_flush(mddev); 1220 + } 1221 + 1222 + /* This is used for raid5 lazy initial recovery */ 1223 + static bool llbitmap_blocks_synced(struct mddev *mddev, sector_t offset) 1224 + { 1225 + struct llbitmap *llbitmap = mddev->bitmap; 1226 + unsigned long p = offset >> llbitmap->chunkshift; 1227 + enum llbitmap_state c = llbitmap_read(llbitmap, p); 1228 + 1229 + return c == BitClean || c == BitDirty; 1230 + } 1231 + 1232 + static sector_t llbitmap_skip_sync_blocks(struct mddev *mddev, sector_t offset) 1233 + { 1234 + struct llbitmap *llbitmap = mddev->bitmap; 1235 + unsigned long p = offset >> llbitmap->chunkshift; 1236 + int blocks = llbitmap->chunksize - (offset & (llbitmap->chunksize - 1)); 1237 + enum llbitmap_state c = llbitmap_read(llbitmap, p); 1238 + 1239 + /* always skip unwritten blocks */ 1240 + if (c == BitUnwritten) 1241 + return blocks; 1242 + 1243 + /* For degraded array, don't skip */ 1244 + if (mddev->degraded) 1245 + return 0; 1246 + 1247 + /* For resync also skip clean/dirty blocks */ 1248 + if ((c == BitClean || c == BitDirty) && 1249 + test_bit(MD_RECOVERY_SYNC, &mddev->recovery) && 1250 + !test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery)) 1251 + return blocks; 1252 + 1253 + return 0; 1254 + } 1255 + 1256 + static bool llbitmap_start_sync(struct mddev *mddev, sector_t offset, 1257 + sector_t *blocks, bool degraded) 1258 + { 1259 + struct llbitmap *llbitmap = mddev->bitmap; 1260 + unsigned long p = offset >> llbitmap->chunkshift; 1261 + 1262 + /* 1263 + * Handle one bit at a time, this is much simpler. And it doesn't matter 1264 + * if md_do_sync() loop more times. 1265 + */ 1266 + *blocks = llbitmap->chunksize - (offset & (llbitmap->chunksize - 1)); 1267 + return llbitmap_state_machine(llbitmap, p, p, 1268 + BitmapActionStartsync) == BitSyncing; 1269 + } 1270 + 1271 + /* Something is wrong, sync_thread stop at @offset */ 1272 + static void llbitmap_end_sync(struct mddev *mddev, sector_t offset, 1273 + sector_t *blocks) 1274 + { 1275 + struct llbitmap *llbitmap = mddev->bitmap; 1276 + unsigned long p = offset >> llbitmap->chunkshift; 1277 + 1278 + *blocks = llbitmap->chunksize - (offset & (llbitmap->chunksize - 1)); 1279 + llbitmap_state_machine(llbitmap, p, llbitmap->chunks - 1, 1280 + BitmapActionAbortsync); 1281 + } 1282 + 1283 + /* A full sync_thread is finished */ 1284 + static void llbitmap_close_sync(struct mddev *mddev) 1285 + { 1286 + struct llbitmap *llbitmap = mddev->bitmap; 1287 + int i; 1288 + 1289 + for (i = 0; i < llbitmap->nr_pages; i++) { 1290 + struct llbitmap_page_ctl *pctl = llbitmap->pctl[i]; 1291 + 1292 + /* let daemon_fn clear dirty bits immediately */ 1293 + WRITE_ONCE(pctl->expire, jiffies); 1294 + } 1295 + 1296 + llbitmap_state_machine(llbitmap, 0, llbitmap->chunks - 1, 1297 + BitmapActionEndsync); 1298 + } 1299 + 1300 + /* 1301 + * sync_thread have reached @sector, update metadata every daemon_sleep seconds, 1302 + * just in case sync_thread have to restart after power failure. 1303 + */ 1304 + static void llbitmap_cond_end_sync(struct mddev *mddev, sector_t sector, 1305 + bool force) 1306 + { 1307 + struct llbitmap *llbitmap = mddev->bitmap; 1308 + 1309 + if (sector == 0) { 1310 + llbitmap->last_end_sync = jiffies; 1311 + return; 1312 + } 1313 + 1314 + if (time_before(jiffies, llbitmap->last_end_sync + 1315 + HZ * mddev->bitmap_info.daemon_sleep)) 1316 + return; 1317 + 1318 + wait_event(mddev->recovery_wait, !atomic_read(&mddev->recovery_active)); 1319 + 1320 + mddev->curr_resync_completed = sector; 1321 + set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags); 1322 + llbitmap_state_machine(llbitmap, 0, sector >> llbitmap->chunkshift, 1323 + BitmapActionEndsync); 1324 + __llbitmap_flush(mddev); 1325 + 1326 + llbitmap->last_end_sync = jiffies; 1327 + sysfs_notify_dirent_safe(mddev->sysfs_completed); 1328 + } 1329 + 1330 + static bool llbitmap_enabled(void *data, bool flush) 1331 + { 1332 + struct llbitmap *llbitmap = data; 1333 + 1334 + return llbitmap && !test_bit(BITMAP_WRITE_ERROR, &llbitmap->flags); 1335 + } 1336 + 1337 + static void llbitmap_dirty_bits(struct mddev *mddev, unsigned long s, 1338 + unsigned long e) 1339 + { 1340 + llbitmap_state_machine(mddev->bitmap, s, e, BitmapActionStartwrite); 1341 + } 1342 + 1343 + static void llbitmap_write_sb(struct llbitmap *llbitmap) 1344 + { 1345 + int nr_blocks = DIV_ROUND_UP(BITMAP_DATA_OFFSET, llbitmap->io_size); 1346 + 1347 + bitmap_fill(llbitmap->pctl[0]->dirty, nr_blocks); 1348 + llbitmap_write_page(llbitmap, 0); 1349 + md_super_wait(llbitmap->mddev); 1350 + } 1351 + 1352 + static void llbitmap_update_sb(void *data) 1353 + { 1354 + struct llbitmap *llbitmap = data; 1355 + struct mddev *mddev = llbitmap->mddev; 1356 + struct page *sb_page; 1357 + bitmap_super_t *sb; 1358 + 1359 + if (test_bit(BITMAP_WRITE_ERROR, &llbitmap->flags)) 1360 + return; 1361 + 1362 + sb_page = llbitmap_read_page(llbitmap, 0); 1363 + if (IS_ERR(sb_page)) { 1364 + pr_err("%s: %s: read super block failed", __func__, 1365 + mdname(mddev)); 1366 + set_bit(BITMAP_WRITE_ERROR, &llbitmap->flags); 1367 + return; 1368 + } 1369 + 1370 + if (mddev->events < llbitmap->events_cleared) 1371 + llbitmap->events_cleared = mddev->events; 1372 + 1373 + sb = kmap_local_page(sb_page); 1374 + sb->events = cpu_to_le64(mddev->events); 1375 + sb->state = cpu_to_le32(llbitmap->flags); 1376 + sb->chunksize = cpu_to_le32(llbitmap->chunksize); 1377 + sb->sync_size = cpu_to_le64(mddev->resync_max_sectors); 1378 + sb->events_cleared = cpu_to_le64(llbitmap->events_cleared); 1379 + sb->sectors_reserved = cpu_to_le32(mddev->bitmap_info.space); 1380 + sb->daemon_sleep = cpu_to_le32(mddev->bitmap_info.daemon_sleep); 1381 + 1382 + kunmap_local(sb); 1383 + llbitmap_write_sb(llbitmap); 1384 + } 1385 + 1386 + static int llbitmap_get_stats(void *data, struct md_bitmap_stats *stats) 1387 + { 1388 + struct llbitmap *llbitmap = data; 1389 + 1390 + memset(stats, 0, sizeof(*stats)); 1391 + 1392 + stats->missing_pages = 0; 1393 + stats->pages = llbitmap->nr_pages; 1394 + stats->file_pages = llbitmap->nr_pages; 1395 + 1396 + stats->behind_writes = atomic_read(&llbitmap->behind_writes); 1397 + stats->behind_wait = wq_has_sleeper(&llbitmap->behind_wait); 1398 + stats->events_cleared = llbitmap->events_cleared; 1399 + 1400 + return 0; 1401 + } 1402 + 1403 + /* just flag all pages as needing to be written */ 1404 + static void llbitmap_write_all(struct mddev *mddev) 1405 + { 1406 + int i; 1407 + struct llbitmap *llbitmap = mddev->bitmap; 1408 + 1409 + for (i = 0; i < llbitmap->nr_pages; i++) { 1410 + struct llbitmap_page_ctl *pctl = llbitmap->pctl[i]; 1411 + 1412 + set_bit(LLPageDirty, &pctl->flags); 1413 + bitmap_fill(pctl->dirty, llbitmap->blocks_per_page); 1414 + } 1415 + } 1416 + 1417 + static void llbitmap_start_behind_write(struct mddev *mddev) 1418 + { 1419 + struct llbitmap *llbitmap = mddev->bitmap; 1420 + 1421 + atomic_inc(&llbitmap->behind_writes); 1422 + } 1423 + 1424 + static void llbitmap_end_behind_write(struct mddev *mddev) 1425 + { 1426 + struct llbitmap *llbitmap = mddev->bitmap; 1427 + 1428 + if (atomic_dec_and_test(&llbitmap->behind_writes)) 1429 + wake_up(&llbitmap->behind_wait); 1430 + } 1431 + 1432 + static void llbitmap_wait_behind_writes(struct mddev *mddev) 1433 + { 1434 + struct llbitmap *llbitmap = mddev->bitmap; 1435 + 1436 + if (!llbitmap) 1437 + return; 1438 + 1439 + wait_event(llbitmap->behind_wait, 1440 + atomic_read(&llbitmap->behind_writes) == 0); 1441 + 1442 + } 1443 + 1444 + static ssize_t bits_show(struct mddev *mddev, char *page) 1445 + { 1446 + struct llbitmap *llbitmap; 1447 + int bits[BitStateCount] = {0}; 1448 + loff_t start = 0; 1449 + 1450 + mutex_lock(&mddev->bitmap_info.mutex); 1451 + llbitmap = mddev->bitmap; 1452 + if (!llbitmap || !llbitmap->pctl) { 1453 + mutex_unlock(&mddev->bitmap_info.mutex); 1454 + return sprintf(page, "no bitmap\n"); 1455 + } 1456 + 1457 + if (test_bit(BITMAP_WRITE_ERROR, &llbitmap->flags)) { 1458 + mutex_unlock(&mddev->bitmap_info.mutex); 1459 + return sprintf(page, "bitmap io error\n"); 1460 + } 1461 + 1462 + while (start < llbitmap->chunks) { 1463 + enum llbitmap_state c = llbitmap_read(llbitmap, start); 1464 + 1465 + if (c < 0 || c >= BitStateCount) 1466 + pr_err("%s: invalid bit %llu state %d\n", 1467 + __func__, start, c); 1468 + else 1469 + bits[c]++; 1470 + start++; 1471 + } 1472 + 1473 + mutex_unlock(&mddev->bitmap_info.mutex); 1474 + return sprintf(page, "unwritten %d\nclean %d\ndirty %d\nneed sync %d\nsyncing %d\n", 1475 + bits[BitUnwritten], bits[BitClean], bits[BitDirty], 1476 + bits[BitNeedSync], bits[BitSyncing]); 1477 + } 1478 + 1479 + static struct md_sysfs_entry llbitmap_bits = __ATTR_RO(bits); 1480 + 1481 + static ssize_t metadata_show(struct mddev *mddev, char *page) 1482 + { 1483 + struct llbitmap *llbitmap; 1484 + ssize_t ret; 1485 + 1486 + mutex_lock(&mddev->bitmap_info.mutex); 1487 + llbitmap = mddev->bitmap; 1488 + if (!llbitmap) { 1489 + mutex_unlock(&mddev->bitmap_info.mutex); 1490 + return sprintf(page, "no bitmap\n"); 1491 + } 1492 + 1493 + ret = sprintf(page, "chunksize %lu\nchunkshift %lu\nchunks %lu\noffset %llu\ndaemon_sleep %lu\n", 1494 + llbitmap->chunksize, llbitmap->chunkshift, 1495 + llbitmap->chunks, mddev->bitmap_info.offset, 1496 + llbitmap->mddev->bitmap_info.daemon_sleep); 1497 + mutex_unlock(&mddev->bitmap_info.mutex); 1498 + 1499 + return ret; 1500 + } 1501 + 1502 + static struct md_sysfs_entry llbitmap_metadata = __ATTR_RO(metadata); 1503 + 1504 + static ssize_t 1505 + daemon_sleep_show(struct mddev *mddev, char *page) 1506 + { 1507 + return sprintf(page, "%lu\n", mddev->bitmap_info.daemon_sleep); 1508 + } 1509 + 1510 + static ssize_t 1511 + daemon_sleep_store(struct mddev *mddev, const char *buf, size_t len) 1512 + { 1513 + unsigned long timeout; 1514 + int rv = kstrtoul(buf, 10, &timeout); 1515 + 1516 + if (rv) 1517 + return rv; 1518 + 1519 + mddev->bitmap_info.daemon_sleep = timeout; 1520 + return len; 1521 + } 1522 + 1523 + static struct md_sysfs_entry llbitmap_daemon_sleep = __ATTR_RW(daemon_sleep); 1524 + 1525 + static ssize_t 1526 + barrier_idle_show(struct mddev *mddev, char *page) 1527 + { 1528 + struct llbitmap *llbitmap = mddev->bitmap; 1529 + 1530 + return sprintf(page, "%lu\n", llbitmap->barrier_idle); 1531 + } 1532 + 1533 + static ssize_t 1534 + barrier_idle_store(struct mddev *mddev, const char *buf, size_t len) 1535 + { 1536 + struct llbitmap *llbitmap = mddev->bitmap; 1537 + unsigned long timeout; 1538 + int rv = kstrtoul(buf, 10, &timeout); 1539 + 1540 + if (rv) 1541 + return rv; 1542 + 1543 + llbitmap->barrier_idle = timeout; 1544 + return len; 1545 + } 1546 + 1547 + static struct md_sysfs_entry llbitmap_barrier_idle = __ATTR_RW(barrier_idle); 1548 + 1549 + static struct attribute *md_llbitmap_attrs[] = { 1550 + &llbitmap_bits.attr, 1551 + &llbitmap_metadata.attr, 1552 + &llbitmap_daemon_sleep.attr, 1553 + &llbitmap_barrier_idle.attr, 1554 + NULL 1555 + }; 1556 + 1557 + static struct attribute_group md_llbitmap_group = { 1558 + .name = "llbitmap", 1559 + .attrs = md_llbitmap_attrs, 1560 + }; 1561 + 1562 + static struct bitmap_operations llbitmap_ops = { 1563 + .head = { 1564 + .type = MD_BITMAP, 1565 + .id = ID_LLBITMAP, 1566 + .name = "llbitmap", 1567 + }, 1568 + 1569 + .enabled = llbitmap_enabled, 1570 + .create = llbitmap_create, 1571 + .resize = llbitmap_resize, 1572 + .load = llbitmap_load, 1573 + .destroy = llbitmap_destroy, 1574 + 1575 + .start_write = llbitmap_start_write, 1576 + .end_write = llbitmap_end_write, 1577 + .start_discard = llbitmap_start_discard, 1578 + .end_discard = llbitmap_end_discard, 1579 + .unplug = llbitmap_unplug, 1580 + .flush = llbitmap_flush, 1581 + 1582 + .start_behind_write = llbitmap_start_behind_write, 1583 + .end_behind_write = llbitmap_end_behind_write, 1584 + .wait_behind_writes = llbitmap_wait_behind_writes, 1585 + 1586 + .blocks_synced = llbitmap_blocks_synced, 1587 + .skip_sync_blocks = llbitmap_skip_sync_blocks, 1588 + .start_sync = llbitmap_start_sync, 1589 + .end_sync = llbitmap_end_sync, 1590 + .close_sync = llbitmap_close_sync, 1591 + .cond_end_sync = llbitmap_cond_end_sync, 1592 + 1593 + .update_sb = llbitmap_update_sb, 1594 + .get_stats = llbitmap_get_stats, 1595 + .dirty_bits = llbitmap_dirty_bits, 1596 + .write_all = llbitmap_write_all, 1597 + 1598 + .group = &md_llbitmap_group, 1599 + }; 1600 + 1601 + int md_llbitmap_init(void) 1602 + { 1603 + md_llbitmap_io_wq = alloc_workqueue("md_llbitmap_io", 1604 + WQ_MEM_RECLAIM | WQ_UNBOUND, 0); 1605 + if (!md_llbitmap_io_wq) 1606 + return -ENOMEM; 1607 + 1608 + md_llbitmap_unplug_wq = alloc_workqueue("md_llbitmap_unplug", 1609 + WQ_MEM_RECLAIM | WQ_UNBOUND, 0); 1610 + if (!md_llbitmap_unplug_wq) { 1611 + destroy_workqueue(md_llbitmap_io_wq); 1612 + md_llbitmap_io_wq = NULL; 1613 + return -ENOMEM; 1614 + } 1615 + 1616 + return register_md_submodule(&llbitmap_ops.head); 1617 + } 1618 + 1619 + void md_llbitmap_exit(void) 1620 + { 1621 + destroy_workqueue(md_llbitmap_io_wq); 1622 + md_llbitmap_io_wq = NULL; 1623 + destroy_workqueue(md_llbitmap_unplug_wq); 1624 + md_llbitmap_unplug_wq = NULL; 1625 + unregister_md_submodule(&llbitmap_ops.head); 1626 + }

+310 -72

drivers/md/md.c

··· 94 94 * workqueue whith reconfig_mutex grabbed. 95 95 */ 96 96 static struct workqueue_struct *md_misc_wq; 97 - struct workqueue_struct *md_bitmap_wq; 98 97 99 98 static int remove_and_add_spares(struct mddev *mddev, 100 99 struct md_rdev *this); ··· 676 677 677 678 static void no_op(struct percpu_ref *r) {} 678 679 680 + static bool mddev_set_bitmap_ops(struct mddev *mddev) 681 + { 682 + struct bitmap_operations *old = mddev->bitmap_ops; 683 + struct md_submodule_head *head; 684 + 685 + if (mddev->bitmap_id == ID_BITMAP_NONE || 686 + (old && old->head.id == mddev->bitmap_id)) 687 + return true; 688 + 689 + xa_lock(&md_submodule); 690 + head = xa_load(&md_submodule, mddev->bitmap_id); 691 + 692 + if (!head) { 693 + pr_warn("md: can't find bitmap id %d\n", mddev->bitmap_id); 694 + goto err; 695 + } 696 + 697 + if (head->type != MD_BITMAP) { 698 + pr_warn("md: invalid bitmap id %d\n", mddev->bitmap_id); 699 + goto err; 700 + } 701 + 702 + mddev->bitmap_ops = (void *)head; 703 + xa_unlock(&md_submodule); 704 + 705 + if (!mddev_is_dm(mddev) && mddev->bitmap_ops->group) { 706 + if (sysfs_create_group(&mddev->kobj, mddev->bitmap_ops->group)) 707 + pr_warn("md: cannot register extra bitmap attributes for %s\n", 708 + mdname(mddev)); 709 + else 710 + /* 711 + * Inform user with KOBJ_CHANGE about new bitmap 712 + * attributes. 713 + */ 714 + kobject_uevent(&mddev->kobj, KOBJ_CHANGE); 715 + } 716 + return true; 717 + 718 + err: 719 + xa_unlock(&md_submodule); 720 + return false; 721 + } 722 + 723 + static void mddev_clear_bitmap_ops(struct mddev *mddev) 724 + { 725 + if (!mddev_is_dm(mddev) && mddev->bitmap_ops && 726 + mddev->bitmap_ops->group) 727 + sysfs_remove_group(&mddev->kobj, mddev->bitmap_ops->group); 728 + 729 + mddev->bitmap_ops = NULL; 730 + } 731 + 679 732 int mddev_init(struct mddev *mddev) 680 733 { 734 + if (!IS_ENABLED(CONFIG_MD_BITMAP)) 735 + mddev->bitmap_id = ID_BITMAP_NONE; 736 + else 737 + mddev->bitmap_id = ID_BITMAP; 681 738 682 739 if (percpu_ref_init(&mddev->active_io, active_io_release, 683 740 PERCPU_REF_ALLOW_REINIT, GFP_KERNEL)) ··· 768 713 mddev->resync_min = 0; 769 714 mddev->resync_max = MaxSector; 770 715 mddev->level = LEVEL_NONE; 771 - mddev_set_bitmap_ops(mddev); 772 716 773 717 INIT_WORK(&mddev->sync_work, md_start_sync); 774 718 INIT_WORK(&mddev->del_work, mddev_delayed_delete); ··· 1074 1020 wake_up(&mddev->sb_wait); 1075 1021 } 1076 1022 1077 - void md_super_write(struct mddev *mddev, struct md_rdev *rdev, 1078 - sector_t sector, int size, struct page *page) 1023 + /** 1024 + * md_write_metadata - write metadata to underlying disk, including 1025 + * array superblock, badblocks, bitmap superblock and bitmap bits. 1026 + * @mddev: the array to write 1027 + * @rdev: the underlying disk to write 1028 + * @sector: the offset to @rdev 1029 + * @size: the length of the metadata 1030 + * @page: the metadata 1031 + * @offset: the offset to @page 1032 + * 1033 + * Write @size bytes of @page start from @offset, to @sector of @rdev, Increment 1034 + * mddev->pending_writes before returning, and decrement it on completion, 1035 + * waking up sb_wait. Caller must call md_super_wait() after issuing io to all 1036 + * rdev. If an error occurred, md_error() will be called, and the @rdev will be 1037 + * kicked out from @mddev. 1038 + */ 1039 + void md_write_metadata(struct mddev *mddev, struct md_rdev *rdev, 1040 + sector_t sector, int size, struct page *page, 1041 + unsigned int offset) 1079 1042 { 1080 - /* write first size bytes of page to sector of rdev 1081 - * Increment mddev->pending_writes before returning 1082 - * and decrement it on completion, waking up sb_wait 1083 - * if zero is reached. 1084 - * If an error occurred, call md_error 1085 - */ 1086 1043 struct bio *bio; 1087 1044 1088 1045 if (!page) ··· 1111 1046 atomic_inc(&rdev->nr_pending); 1112 1047 1113 1048 bio->bi_iter.bi_sector = sector; 1114 - __bio_add_page(bio, page, size, 0); 1049 + __bio_add_page(bio, page, size, offset); 1115 1050 bio->bi_private = rdev; 1116 1051 bio->bi_end_io = super_written; 1117 1052 ··· 1421 1356 struct md_bitmap_stats stats; 1422 1357 int err; 1423 1358 1359 + if (!md_bitmap_enabled(mddev, false)) 1360 + return 0; 1361 + 1424 1362 err = mddev->bitmap_ops->get_stats(mddev->bitmap, &stats); 1425 1363 if (err) 1426 1364 return 0; ··· 1721 1653 if ((u64)num_sectors >= (2ULL << 32) && rdev->mddev->level >= 1) 1722 1654 num_sectors = (sector_t)(2ULL << 32) - 2; 1723 1655 do { 1724 - md_super_write(rdev->mddev, rdev, rdev->sb_start, rdev->sb_size, 1725 - rdev->sb_page); 1656 + md_write_metadata(rdev->mddev, rdev, rdev->sb_start, 1657 + rdev->sb_size, rdev->sb_page, 0); 1726 1658 } while (md_super_wait(rdev->mddev) < 0); 1727 1659 return num_sectors; 1728 1660 } ··· 2370 2302 sb->super_offset = cpu_to_le64(rdev->sb_start); 2371 2303 sb->sb_csum = calc_sb_1_csum(sb); 2372 2304 do { 2373 - md_super_write(rdev->mddev, rdev, rdev->sb_start, rdev->sb_size, 2374 - rdev->sb_page); 2305 + md_write_metadata(rdev->mddev, rdev, rdev->sb_start, 2306 + rdev->sb_size, rdev->sb_page, 0); 2375 2307 } while (md_super_wait(rdev->mddev) < 0); 2376 2308 return num_sectors; 2377 2309 ··· 2381 2313 super_1_allow_new_offset(struct md_rdev *rdev, 2382 2314 unsigned long long new_offset) 2383 2315 { 2316 + struct mddev *mddev = rdev->mddev; 2317 + 2384 2318 /* All necessary checks on new >= old have been done */ 2385 2319 if (new_offset >= rdev->data_offset) 2386 2320 return 1; 2387 2321 2388 2322 /* with 1.0 metadata, there is no metadata to tread on 2389 2323 * so we can always move back */ 2390 - if (rdev->mddev->minor_version == 0) 2324 + if (mddev->minor_version == 0) 2391 2325 return 1; 2392 2326 2393 2327 /* otherwise we must be sure not to step on ··· 2401 2331 if (rdev->sb_start + (32+4)*2 > new_offset) 2402 2332 return 0; 2403 2333 2404 - if (!rdev->mddev->bitmap_info.file) { 2405 - struct mddev *mddev = rdev->mddev; 2334 + if (md_bitmap_registered(mddev) && !mddev->bitmap_info.file) { 2406 2335 struct md_bitmap_stats stats; 2407 2336 int err; 2408 2337 ··· 2873 2804 2874 2805 mddev_add_trace_msg(mddev, "md md_update_sb"); 2875 2806 rewrite: 2876 - mddev->bitmap_ops->update_sb(mddev->bitmap); 2807 + if (md_bitmap_enabled(mddev, false)) 2808 + mddev->bitmap_ops->update_sb(mddev->bitmap); 2877 2809 rdev_for_each(rdev, mddev) { 2878 2810 if (rdev->sb_loaded != 1) 2879 2811 continue; /* no noise on spare devices */ 2880 2812 2881 2813 if (!test_bit(Faulty, &rdev->flags)) { 2882 - md_super_write(mddev,rdev, 2883 - rdev->sb_start, rdev->sb_size, 2884 - rdev->sb_page); 2814 + md_write_metadata(mddev, rdev, rdev->sb_start, 2815 + rdev->sb_size, rdev->sb_page, 0); 2885 2816 pr_debug("md: (write) %pg's sb offset: %llu\n", 2886 2817 rdev->bdev, 2887 2818 (unsigned long long)rdev->sb_start); 2888 2819 rdev->sb_events = mddev->events; 2889 2820 if (rdev->badblocks.size) { 2890 - md_super_write(mddev, rdev, 2891 - rdev->badblocks.sector, 2892 - rdev->badblocks.size << 9, 2893 - rdev->bb_page); 2821 + md_write_metadata(mddev, rdev, 2822 + rdev->badblocks.sector, 2823 + rdev->badblocks.size << 9, 2824 + rdev->bb_page, 0); 2894 2825 rdev->badblocks.size = 0; 2895 2826 } 2896 2827 ··· 4219 4150 __ATTR(new_level, 0664, new_level_show, new_level_store); 4220 4151 4221 4152 static ssize_t 4153 + bitmap_type_show(struct mddev *mddev, char *page) 4154 + { 4155 + struct md_submodule_head *head; 4156 + unsigned long i; 4157 + ssize_t len = 0; 4158 + 4159 + if (mddev->bitmap_id == ID_BITMAP_NONE) 4160 + len += sprintf(page + len, "[none] "); 4161 + else 4162 + len += sprintf(page + len, "none "); 4163 + 4164 + xa_lock(&md_submodule); 4165 + xa_for_each(&md_submodule, i, head) { 4166 + if (head->type != MD_BITMAP) 4167 + continue; 4168 + 4169 + if (mddev->bitmap_id == head->id) 4170 + len += sprintf(page + len, "[%s] ", head->name); 4171 + else 4172 + len += sprintf(page + len, "%s ", head->name); 4173 + } 4174 + xa_unlock(&md_submodule); 4175 + 4176 + len += sprintf(page + len, "\n"); 4177 + return len; 4178 + } 4179 + 4180 + static ssize_t 4181 + bitmap_type_store(struct mddev *mddev, const char *buf, size_t len) 4182 + { 4183 + struct md_submodule_head *head; 4184 + enum md_submodule_id id; 4185 + unsigned long i; 4186 + int err = 0; 4187 + 4188 + xa_lock(&md_submodule); 4189 + 4190 + if (mddev->bitmap_ops) { 4191 + err = -EBUSY; 4192 + goto out; 4193 + } 4194 + 4195 + if (cmd_match(buf, "none")) { 4196 + mddev->bitmap_id = ID_BITMAP_NONE; 4197 + goto out; 4198 + } 4199 + 4200 + xa_for_each(&md_submodule, i, head) { 4201 + if (head->type == MD_BITMAP && cmd_match(buf, head->name)) { 4202 + mddev->bitmap_id = head->id; 4203 + goto out; 4204 + } 4205 + } 4206 + 4207 + err = kstrtoint(buf, 10, &id); 4208 + if (err) 4209 + goto out; 4210 + 4211 + if (id == ID_BITMAP_NONE) { 4212 + mddev->bitmap_id = id; 4213 + goto out; 4214 + } 4215 + 4216 + head = xa_load(&md_submodule, id); 4217 + if (head && head->type == MD_BITMAP) { 4218 + mddev->bitmap_id = id; 4219 + goto out; 4220 + } 4221 + 4222 + err = -ENOENT; 4223 + 4224 + out: 4225 + xa_unlock(&md_submodule); 4226 + return err ? err : len; 4227 + } 4228 + 4229 + static struct md_sysfs_entry md_bitmap_type = 4230 + __ATTR(bitmap_type, 0664, bitmap_type_show, bitmap_type_store); 4231 + 4232 + static ssize_t 4222 4233 layout_show(struct mddev *mddev, char *page) 4223 4234 { 4224 4235 /* just a number, not meaningful for all levels */ ··· 4828 4679 char *end; 4829 4680 unsigned long chunk, end_chunk; 4830 4681 int err; 4682 + 4683 + if (!md_bitmap_enabled(mddev, false)) 4684 + return len; 4831 4685 4832 4686 err = mddev_lock(mddev); 4833 4687 if (err) ··· 5904 5752 static struct attribute *md_default_attrs[] = { 5905 5753 &md_level.attr, 5906 5754 &md_new_level.attr, 5755 + &md_bitmap_type.attr, 5907 5756 &md_layout.attr, 5908 5757 &md_raid_disks.attr, 5909 5758 &md_uuid.attr, ··· 5954 5801 5955 5802 static const struct attribute_group *md_attr_groups[] = { 5956 5803 &md_default_group, 5957 - &md_bitmap_group, 5958 5804 NULL, 5959 5805 }; 5960 5806 ··· 6285 6133 6286 6134 static int start_dirty_degraded; 6287 6135 6136 + static int md_bitmap_create(struct mddev *mddev) 6137 + { 6138 + if (mddev->bitmap_id == ID_BITMAP_NONE) 6139 + return -EINVAL; 6140 + 6141 + if (!mddev_set_bitmap_ops(mddev)) 6142 + return -ENOENT; 6143 + 6144 + return mddev->bitmap_ops->create(mddev); 6145 + } 6146 + 6147 + static void md_bitmap_destroy(struct mddev *mddev) 6148 + { 6149 + if (!md_bitmap_registered(mddev)) 6150 + return; 6151 + 6152 + mddev->bitmap_ops->destroy(mddev); 6153 + mddev_clear_bitmap_ops(mddev); 6154 + } 6155 + 6288 6156 int md_run(struct mddev *mddev) 6289 6157 { 6290 6158 int err; ··· 6471 6299 } 6472 6300 if (err == 0 && pers->sync_request && 6473 6301 (mddev->bitmap_info.file || mddev->bitmap_info.offset)) { 6474 - err = mddev->bitmap_ops->create(mddev); 6302 + err = md_bitmap_create(mddev); 6475 6303 if (err) 6476 6304 pr_warn("%s: failed to create bitmap (%d)\n", 6477 6305 mdname(mddev), err); ··· 6544 6372 pers->free(mddev, mddev->private); 6545 6373 mddev->private = NULL; 6546 6374 put_pers(pers); 6547 - mddev->bitmap_ops->destroy(mddev); 6375 + md_bitmap_destroy(mddev); 6548 6376 abort: 6549 6377 bioset_exit(&mddev->io_clone_set); 6550 6378 exit_sync_set: ··· 6564 6392 if (err) 6565 6393 goto out; 6566 6394 6567 - err = mddev->bitmap_ops->load(mddev); 6568 - if (err) { 6569 - mddev->bitmap_ops->destroy(mddev); 6570 - goto out; 6395 + if (md_bitmap_registered(mddev)) { 6396 + err = mddev->bitmap_ops->load(mddev); 6397 + if (err) { 6398 + md_bitmap_destroy(mddev); 6399 + goto out; 6400 + } 6571 6401 } 6572 6402 6573 6403 if (mddev_is_clustered(mddev)) ··· 6720 6546 mddev->pers->quiesce(mddev, 0); 6721 6547 } 6722 6548 6723 - mddev->bitmap_ops->flush(mddev); 6549 + if (md_bitmap_enabled(mddev, true)) 6550 + mddev->bitmap_ops->flush(mddev); 6724 6551 6725 6552 if (md_is_rdwr(mddev) && 6726 6553 ((!mddev->in_sync && !mddev_is_clustered(mddev)) || ··· 6748 6573 6749 6574 static void mddev_detach(struct mddev *mddev) 6750 6575 { 6751 - mddev->bitmap_ops->wait_behind_writes(mddev); 6576 + if (md_bitmap_enabled(mddev, false)) 6577 + mddev->bitmap_ops->wait_behind_writes(mddev); 6752 6578 if (mddev->pers && mddev->pers->quiesce && !is_md_suspended(mddev)) { 6753 6579 mddev->pers->quiesce(mddev, 1); 6754 6580 mddev->pers->quiesce(mddev, 0); ··· 6765 6589 { 6766 6590 struct md_personality *pers = mddev->pers; 6767 6591 6768 - mddev->bitmap_ops->destroy(mddev); 6592 + md_bitmap_destroy(mddev); 6769 6593 mddev_detach(mddev); 6770 6594 spin_lock(&mddev->lock); 6771 6595 mddev->pers = NULL; ··· 7483 7307 { 7484 7308 int err = 0; 7485 7309 7310 + if (!md_bitmap_registered(mddev)) 7311 + return -EINVAL; 7312 + 7486 7313 if (mddev->pers) { 7487 7314 if (!mddev->pers->quiesce || !mddev->thread) 7488 7315 return -EBUSY; ··· 7542 7363 err = 0; 7543 7364 if (mddev->pers) { 7544 7365 if (fd >= 0) { 7545 - err = mddev->bitmap_ops->create(mddev); 7366 + err = md_bitmap_create(mddev); 7546 7367 if (!err) 7547 7368 err = mddev->bitmap_ops->load(mddev); 7548 7369 7549 7370 if (err) { 7550 - mddev->bitmap_ops->destroy(mddev); 7371 + md_bitmap_destroy(mddev); 7551 7372 fd = -1; 7552 7373 } 7553 7374 } else if (fd < 0) { 7554 - mddev->bitmap_ops->destroy(mddev); 7375 + md_bitmap_destroy(mddev); 7555 7376 } 7556 7377 } 7557 7378 ··· 7858 7679 mddev->bitmap_info.default_offset; 7859 7680 mddev->bitmap_info.space = 7860 7681 mddev->bitmap_info.default_space; 7861 - rv = mddev->bitmap_ops->create(mddev); 7682 + rv = md_bitmap_create(mddev); 7862 7683 if (!rv) 7863 7684 rv = mddev->bitmap_ops->load(mddev); 7864 7685 7865 7686 if (rv) 7866 - mddev->bitmap_ops->destroy(mddev); 7687 + md_bitmap_destroy(mddev); 7867 7688 } else { 7868 7689 struct md_bitmap_stats stats; 7869 7690 ··· 7889 7710 put_cluster_ops(mddev); 7890 7711 mddev->safemode_delay = DEFAULT_SAFEMODE_DELAY; 7891 7712 } 7892 - mddev->bitmap_ops->destroy(mddev); 7713 + md_bitmap_destroy(mddev); 7893 7714 mddev->bitmap_info.offset = 0; 7894 7715 } 7895 7716 } ··· 7926 7747 * 4 sectors (with a BIG number of cylinders...). This drives 7927 7748 * dosfs just mad... ;-) 7928 7749 */ 7929 - static int md_getgeo(struct block_device *bdev, struct hd_geometry *geo) 7750 + static int md_getgeo(struct gendisk *disk, struct hd_geometry *geo) 7930 7751 { 7931 - struct mddev *mddev = bdev->bd_disk->private_data; 7752 + struct mddev *mddev = disk->private_data; 7932 7753 7933 7754 geo->heads = 2; 7934 7755 geo->sectors = 4; ··· 8670 8491 unsigned long chunk_kb; 8671 8492 int err; 8672 8493 8494 + if (!md_bitmap_enabled(mddev, false)) 8495 + return; 8496 + 8673 8497 err = mddev->bitmap_ops->get_stats(mddev->bitmap, &stats); 8674 8498 if (err) 8675 8499 return; ··· 9055 8873 static void md_bitmap_start(struct mddev *mddev, 9056 8874 struct md_io_clone *md_io_clone) 9057 8875 { 8876 + md_bitmap_fn *fn = unlikely(md_io_clone->rw == STAT_DISCARD) ? 8877 + mddev->bitmap_ops->start_discard : 8878 + mddev->bitmap_ops->start_write; 8879 + 9058 8880 if (mddev->pers->bitmap_sector) 9059 8881 mddev->pers->bitmap_sector(mddev, &md_io_clone->offset, 9060 8882 &md_io_clone->sectors); 9061 8883 9062 - mddev->bitmap_ops->start_write(mddev, md_io_clone->offset, 9063 - md_io_clone->sectors); 8884 + fn(mddev, md_io_clone->offset, md_io_clone->sectors); 9064 8885 } 9065 8886 9066 8887 static void md_bitmap_end(struct mddev *mddev, struct md_io_clone *md_io_clone) 9067 8888 { 9068 - mddev->bitmap_ops->end_write(mddev, md_io_clone->offset, 9069 - md_io_clone->sectors); 8889 + md_bitmap_fn *fn = unlikely(md_io_clone->rw == STAT_DISCARD) ? 8890 + mddev->bitmap_ops->end_discard : 8891 + mddev->bitmap_ops->end_write; 8892 + 8893 + fn(mddev, md_io_clone->offset, md_io_clone->sectors); 9070 8894 } 9071 8895 9072 8896 static void md_end_clone_io(struct bio *bio) ··· 9081 8893 struct bio *orig_bio = md_io_clone->orig_bio; 9082 8894 struct mddev *mddev = md_io_clone->mddev; 9083 8895 9084 - if (bio_data_dir(orig_bio) == WRITE && mddev->bitmap) 8896 + if (bio_data_dir(orig_bio) == WRITE && md_bitmap_enabled(mddev, false)) 9085 8897 md_bitmap_end(mddev, md_io_clone); 9086 8898 9087 8899 if (bio->bi_status && !orig_bio->bi_status) ··· 9108 8920 if (blk_queue_io_stat(bdev->bd_disk->queue)) 9109 8921 md_io_clone->start_time = bio_start_io_acct(*bio); 9110 8922 9111 - if (bio_data_dir(*bio) == WRITE && mddev->bitmap) { 8923 + if (bio_data_dir(*bio) == WRITE && md_bitmap_enabled(mddev, false)) { 9112 8924 md_io_clone->offset = (*bio)->bi_iter.bi_sector; 9113 8925 md_io_clone->sectors = bio_sectors(*bio); 8926 + md_io_clone->rw = op_stat_group(bio_op(*bio)); 9114 8927 md_bitmap_start(mddev, md_io_clone); 9115 8928 } 9116 8929 ··· 9133 8944 struct bio *orig_bio = md_io_clone->orig_bio; 9134 8945 struct mddev *mddev = md_io_clone->mddev; 9135 8946 9136 - if (bio_data_dir(orig_bio) == WRITE && mddev->bitmap) 8947 + if (bio_data_dir(orig_bio) == WRITE && md_bitmap_enabled(mddev, false)) 9137 8948 md_bitmap_end(mddev, md_io_clone); 9138 8949 9139 8950 if (bio->bi_status && !orig_bio->bi_status) ··· 9199 9010 } 9200 9011 } 9201 9012 9013 + /* 9014 + * If lazy recovery is requested and all rdevs are in sync, select the rdev with 9015 + * the higest index to perfore recovery to build initial xor data, this is the 9016 + * same as old bitmap. 9017 + */ 9018 + static bool mddev_select_lazy_recover_rdev(struct mddev *mddev) 9019 + { 9020 + struct md_rdev *recover_rdev = NULL; 9021 + struct md_rdev *rdev; 9022 + bool ret = false; 9023 + 9024 + rcu_read_lock(); 9025 + rdev_for_each_rcu(rdev, mddev) { 9026 + if (rdev->raid_disk < 0) 9027 + continue; 9028 + 9029 + if (test_bit(Faulty, &rdev->flags) || 9030 + !test_bit(In_sync, &rdev->flags)) 9031 + break; 9032 + 9033 + if (!recover_rdev || recover_rdev->raid_disk < rdev->raid_disk) 9034 + recover_rdev = rdev; 9035 + } 9036 + 9037 + if (recover_rdev) { 9038 + clear_bit(In_sync, &recover_rdev->flags); 9039 + ret = true; 9040 + } 9041 + 9042 + rcu_read_unlock(); 9043 + return ret; 9044 + } 9045 + 9202 9046 static sector_t md_sync_position(struct mddev *mddev, enum sync_action action) 9203 9047 { 9204 9048 sector_t start = 0; ··· 9263 9041 start = rdev->recovery_offset; 9264 9042 rcu_read_unlock(); 9265 9043 9044 + /* 9045 + * If there are no spares, and raid456 lazy initial recover is 9046 + * requested. 9047 + */ 9048 + if (test_bit(MD_RECOVERY_LAZY_RECOVER, &mddev->recovery) && 9049 + start == MaxSector && mddev_select_lazy_recover_rdev(mddev)) 9050 + start = 0; 9051 + 9266 9052 /* If there is a bitmap, we need to make sure all 9267 9053 * writes that started before we added a spare 9268 9054 * complete before we start doing a recovery. ··· 9291 9061 9292 9062 static bool sync_io_within_limit(struct mddev *mddev) 9293 9063 { 9294 - int io_sectors; 9295 - 9296 9064 /* 9297 9065 * For raid456, sync IO is stripe(4k) per IO, for other levels, it's 9298 9066 * RESYNC_PAGES(64k) per IO. 9299 9067 */ 9300 - if (mddev->level == 4 || mddev->level == 5 || mddev->level == 6) 9301 - io_sectors = 8; 9302 - else 9303 - io_sectors = 128; 9304 - 9305 9068 return atomic_read(&mddev->recovery_active) < 9306 - io_sectors * sync_io_depth(mddev); 9069 + (raid_is_456(mddev) ? 8 : 128) * sync_io_depth(mddev); 9307 9070 } 9308 9071 9309 9072 #define SYNC_MARKS 10 ··· 9506 9283 if (test_bit(MD_RECOVERY_INTR, &mddev->recovery)) 9507 9284 break; 9508 9285 9286 + if (mddev->bitmap_ops && mddev->bitmap_ops->skip_sync_blocks) { 9287 + sectors = mddev->bitmap_ops->skip_sync_blocks(mddev, j); 9288 + if (sectors) 9289 + goto update; 9290 + } 9291 + 9509 9292 sectors = mddev->pers->sync_request(mddev, j, max_sectors, 9510 9293 &skipped); 9511 9294 if (sectors == 0) { ··· 9527 9298 if (test_bit(MD_RECOVERY_INTR, &mddev->recovery)) 9528 9299 break; 9529 9300 9301 + update: 9530 9302 j += sectors; 9531 9303 if (j > max_sectors) 9532 9304 /* when skipping, extra large numbers can be returned. */ ··· 9837 9607 9838 9608 set_bit(MD_RECOVERY_RESHAPE, &mddev->recovery); 9839 9609 clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery); 9610 + clear_bit(MD_RECOVERY_LAZY_RECOVER, &mddev->recovery); 9840 9611 return true; 9841 9612 } 9842 9613 ··· 9846 9615 remove_spares(mddev, NULL); 9847 9616 set_bit(MD_RECOVERY_SYNC, &mddev->recovery); 9848 9617 clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery); 9618 + clear_bit(MD_RECOVERY_LAZY_RECOVER, &mddev->recovery); 9849 9619 return true; 9850 9620 } 9851 9621 ··· 9856 9624 * re-add. 9857 9625 */ 9858 9626 *spares = remove_and_add_spares(mddev, NULL); 9859 - if (*spares) { 9627 + if (*spares || test_bit(MD_RECOVERY_LAZY_RECOVER, &mddev->recovery)) { 9860 9628 clear_bit(MD_RECOVERY_SYNC, &mddev->recovery); 9861 9629 clear_bit(MD_RECOVERY_CHECK, &mddev->recovery); 9862 9630 clear_bit(MD_RECOVERY_REQUESTED, &mddev->recovery); ··· 9914 9682 * We are adding a device or devices to an array which has the bitmap 9915 9683 * stored on all devices. So make sure all bitmap pages get written. 9916 9684 */ 9917 - if (spares) 9685 + if (spares && md_bitmap_enabled(mddev, true)) 9918 9686 mddev->bitmap_ops->write_all(mddev); 9919 9687 9920 9688 name = test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) ? ··· 10002 9770 */ 10003 9771 void md_check_recovery(struct mddev *mddev) 10004 9772 { 10005 - if (mddev->bitmap) 9773 + if (md_bitmap_enabled(mddev, false) && mddev->bitmap_ops->daemon_work) 10006 9774 mddev->bitmap_ops->daemon_work(mddev); 10007 9775 10008 9776 if (signal_pending(current)) { ··· 10069 9837 } 10070 9838 10071 9839 clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery); 9840 + clear_bit(MD_RECOVERY_LAZY_RECOVER, &mddev->recovery); 10072 9841 clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery); 10073 9842 clear_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags); 10074 9843 ··· 10180 9947 clear_bit(MD_RECOVERY_RESHAPE, &mddev->recovery); 10181 9948 clear_bit(MD_RECOVERY_REQUESTED, &mddev->recovery); 10182 9949 clear_bit(MD_RECOVERY_CHECK, &mddev->recovery); 9950 + clear_bit(MD_RECOVERY_LAZY_RECOVER, &mddev->recovery); 10183 9951 /* 10184 9952 * We call mddev->cluster_ops->update_size here because sync_size could 10185 9953 * be changed by md_update_sb, and MD_RECOVERY_RESHAPE is cleared, ··· 10328 10094 10329 10095 static int __init md_init(void) 10330 10096 { 10331 - int ret = -ENOMEM; 10097 + int ret = md_bitmap_init(); 10332 10098 10099 + if (ret) 10100 + return ret; 10101 + 10102 + ret = md_llbitmap_init(); 10103 + if (ret) 10104 + goto err_bitmap; 10105 + 10106 + ret = -ENOMEM; 10333 10107 md_wq = alloc_workqueue("md", WQ_MEM_RECLAIM, 0); 10334 10108 if (!md_wq) 10335 10109 goto err_wq; ··· 10345 10103 md_misc_wq = alloc_workqueue("md_misc", 0, 0); 10346 10104 if (!md_misc_wq) 10347 10105 goto err_misc_wq; 10348 - 10349 - md_bitmap_wq = alloc_workqueue("md_bitmap", WQ_MEM_RECLAIM | WQ_UNBOUND, 10350 - 0); 10351 - if (!md_bitmap_wq) 10352 - goto err_bitmap_wq; 10353 10106 10354 10107 ret = __register_blkdev(MD_MAJOR, "md", md_probe); 10355 10108 if (ret < 0) ··· 10364 10127 err_mdp: 10365 10128 unregister_blkdev(MD_MAJOR, "md"); 10366 10129 err_md: 10367 - destroy_workqueue(md_bitmap_wq); 10368 - err_bitmap_wq: 10369 10130 destroy_workqueue(md_misc_wq); 10370 10131 err_misc_wq: 10371 10132 destroy_workqueue(md_wq); 10372 10133 err_wq: 10134 + md_llbitmap_exit(); 10135 + err_bitmap: 10136 + md_bitmap_exit(); 10373 10137 return ret; 10374 10138 } 10375 10139 ··· 10388 10150 ret = mddev->pers->resize(mddev, le64_to_cpu(sb->size)); 10389 10151 if (ret) 10390 10152 pr_info("md-cluster: resize failed\n"); 10391 - else 10153 + else if (md_bitmap_enabled(mddev, false)) 10392 10154 mddev->bitmap_ops->update_sb(mddev->bitmap); 10393 10155 } 10394 10156 ··· 10676 10438 spin_unlock(&all_mddevs_lock); 10677 10439 10678 10440 destroy_workqueue(md_misc_wq); 10679 - destroy_workqueue(md_bitmap_wq); 10680 10441 destroy_workqueue(md_wq); 10442 + md_bitmap_exit(); 10681 10443 } 10682 10444 10683 10445 subsys_initcall(md_init);

+17 -7

drivers/md/md.h

··· 26 26 enum md_submodule_type { 27 27 MD_PERSONALITY = 0, 28 28 MD_CLUSTER, 29 - MD_BITMAP, /* TODO */ 29 + MD_BITMAP, 30 30 }; 31 31 32 32 enum md_submodule_id { ··· 38 38 ID_RAID6 = 6, 39 39 ID_RAID10 = 10, 40 40 ID_CLUSTER, 41 - ID_BITMAP, /* TODO */ 42 - ID_LLBITMAP, /* TODO */ 41 + ID_BITMAP, 42 + ID_LLBITMAP, 43 + ID_BITMAP_NONE, 43 44 }; 44 45 45 46 struct md_submodule_head { ··· 566 565 struct percpu_ref writes_pending; 567 566 int sync_checkers; /* # of threads checking writes_pending */ 568 567 568 + enum md_submodule_id bitmap_id; 569 569 void *bitmap; /* the bitmap for the device */ 570 570 struct bitmap_operations *bitmap_ops; 571 571 struct { ··· 667 665 MD_RECOVERY_RESHAPE, 668 666 /* remote node is running resync thread */ 669 667 MD_RESYNCING_REMOTE, 668 + /* raid456 lazy initial recover */ 669 + MD_RECOVERY_LAZY_RECOVER, 670 670 }; 671 671 672 672 enum md_ro_state { ··· 800 796 ssize_t (*show)(struct mddev *, char *); 801 797 ssize_t (*store)(struct mddev *, const char *, size_t); 802 798 }; 803 - extern const struct attribute_group md_bitmap_group; 804 799 805 800 static inline struct kernfs_node *sysfs_get_dirent_safe(struct kernfs_node *sd, char *name) 806 801 { ··· 876 873 unsigned long start_time; 877 874 sector_t offset; 878 875 unsigned long sectors; 876 + enum stat_group rw; 879 877 struct bio bio_clone; 880 878 }; 881 879 ··· 913 909 void md_free_cloned_bio(struct bio *bio); 914 910 915 911 extern bool __must_check md_flush_request(struct mddev *mddev, struct bio *bio); 916 - extern void md_super_write(struct mddev *mddev, struct md_rdev *rdev, 917 - sector_t sector, int size, struct page *page); 912 + void md_write_metadata(struct mddev *mddev, struct md_rdev *rdev, 913 + sector_t sector, int size, struct page *page, 914 + unsigned int offset); 918 915 extern int md_super_wait(struct mddev *mddev); 919 916 extern int sync_page_io(struct md_rdev *rdev, sector_t sector, int size, 920 917 struct page *page, blk_opf_t opf, bool metadata_op); ··· 1018 1013 struct mdu_disk_info_s; 1019 1014 1020 1015 extern int mdp_major; 1021 - extern struct workqueue_struct *md_bitmap_wq; 1022 1016 void md_autostart_arrays(int part); 1023 1017 int md_set_array_info(struct mddev *mddev, struct mdu_array_info_s *info); 1024 1018 int md_add_new_disk(struct mddev *mddev, struct mdu_disk_info_s *info); ··· 1036 1032 static inline bool mddev_is_dm(struct mddev *mddev) 1037 1033 { 1038 1034 return !mddev->gendisk; 1035 + } 1036 + 1037 + static inline bool raid_is_456(struct mddev *mddev) 1038 + { 1039 + return mddev->level == ID_RAID4 || mddev->level == ID_RAID5 || 1040 + mddev->level == ID_RAID6; 1039 1041 } 1040 1042 1041 1043 static inline void mddev_trace_remap(struct mddev *mddev, struct bio *bio,

+9 -21

drivers/md/raid0.c

··· 464 464 zone = find_zone(conf, &start); 465 465 466 466 if (bio_end_sector(bio) > zone->zone_end) { 467 - struct bio *split = bio_split(bio, 468 - zone->zone_end - bio->bi_iter.bi_sector, GFP_NOIO, 469 - &mddev->bio_set); 470 - 471 - if (IS_ERR(split)) { 472 - bio->bi_status = errno_to_blk_status(PTR_ERR(split)); 473 - bio_endio(bio); 467 + bio = bio_submit_split_bioset(bio, 468 + zone->zone_end - bio->bi_iter.bi_sector, 469 + &mddev->bio_set); 470 + if (!bio) 474 471 return; 475 - } 476 - bio_chain(split, bio); 477 - submit_bio_noacct(bio); 478 - bio = split; 472 + 479 473 end = zone->zone_end; 480 - } else 474 + } else { 481 475 end = bio_end_sector(bio); 476 + } 482 477 483 478 orig_end = end; 484 479 if (zone != conf->strip_zone) ··· 608 613 : sector_div(sector, chunk_sects)); 609 614 610 615 if (sectors < bio_sectors(bio)) { 611 - struct bio *split = bio_split(bio, sectors, GFP_NOIO, 616 + bio = bio_submit_split_bioset(bio, sectors, 612 617 &mddev->bio_set); 613 - 614 - if (IS_ERR(split)) { 615 - bio->bi_status = errno_to_blk_status(PTR_ERR(split)); 616 - bio_endio(bio); 618 + if (!bio) 617 619 return true; 618 - } 619 - bio_chain(split, bio); 620 - raid0_map_submit_bio(mddev, bio); 621 - bio = split; 622 620 } 623 621 624 622 raid0_map_submit_bio(mddev, bio);

+1 -1

drivers/md/raid1-10.c

··· 140 140 * If bitmap is not enabled, it's safe to submit the io directly, and 141 141 * this can get optimal performance. 142 142 */ 143 - if (!mddev->bitmap_ops->enabled(mddev)) { 143 + if (!md_bitmap_enabled(mddev, true)) { 144 144 raid1_submit_write(bio); 145 145 return true; 146 146 }

+61 -58

drivers/md/raid1.c

··· 167 167 bio = bio_kmalloc(RESYNC_PAGES, gfp_flags); 168 168 if (!bio) 169 169 goto out_free_bio; 170 - bio_init(bio, NULL, bio->bi_inline_vecs, RESYNC_PAGES, 0); 170 + bio_init_inline(bio, NULL, RESYNC_PAGES, 0); 171 171 r1_bio->bios[j] = bio; 172 172 } 173 173 /* ··· 1317 1317 struct raid1_info *mirror; 1318 1318 struct bio *read_bio; 1319 1319 int max_sectors; 1320 - int rdisk, error; 1320 + int rdisk; 1321 1321 bool r1bio_existed = !!r1_bio; 1322 1322 1323 1323 /* ··· 1366 1366 (unsigned long long)r1_bio->sector, 1367 1367 mirror->rdev->bdev); 1368 1368 1369 - if (test_bit(WriteMostly, &mirror->rdev->flags)) { 1369 + if (test_bit(WriteMostly, &mirror->rdev->flags) && 1370 + md_bitmap_enabled(mddev, false)) { 1370 1371 /* 1371 1372 * Reading from a write-mostly device must take care not to 1372 1373 * over-take any writes that are 'behind' ··· 1377 1376 } 1378 1377 1379 1378 if (max_sectors < bio_sectors(bio)) { 1380 - struct bio *split = bio_split(bio, max_sectors, 1381 - gfp, &conf->bio_split); 1382 - 1383 - if (IS_ERR(split)) { 1384 - error = PTR_ERR(split); 1379 + bio = bio_submit_split_bioset(bio, max_sectors, 1380 + &conf->bio_split); 1381 + if (!bio) { 1382 + set_bit(R1BIO_Returned, &r1_bio->state); 1385 1383 goto err_handle; 1386 1384 } 1387 - bio_chain(split, bio); 1388 - submit_bio_noacct(bio); 1389 - bio = split; 1385 + 1390 1386 r1_bio->master_bio = bio; 1391 1387 r1_bio->sectors = max_sectors; 1392 1388 } ··· 1411 1413 1412 1414 err_handle: 1413 1415 atomic_dec(&mirror->rdev->nr_pending); 1414 - bio->bi_status = errno_to_blk_status(error); 1415 - set_bit(R1BIO_Uptodate, &r1_bio->state); 1416 1416 raid_end_bio_io(r1_bio); 1417 1417 } 1418 1418 ··· 1448 1452 return true; 1449 1453 } 1450 1454 1455 + static void raid1_start_write_behind(struct mddev *mddev, struct r1bio *r1_bio, 1456 + struct bio *bio) 1457 + { 1458 + unsigned long max_write_behind = mddev->bitmap_info.max_write_behind; 1459 + struct md_bitmap_stats stats; 1460 + int err; 1461 + 1462 + /* behind write rely on bitmap, see bitmap_operations */ 1463 + if (!md_bitmap_enabled(mddev, false)) 1464 + return; 1465 + 1466 + err = mddev->bitmap_ops->get_stats(mddev->bitmap, &stats); 1467 + if (err) 1468 + return; 1469 + 1470 + /* Don't do behind IO if reader is waiting, or there are too many. */ 1471 + if (!stats.behind_wait && stats.behind_writes < max_write_behind) 1472 + alloc_behind_master_bio(r1_bio, bio); 1473 + 1474 + if (test_bit(R1BIO_BehindIO, &r1_bio->state)) 1475 + mddev->bitmap_ops->start_behind_write(mddev); 1476 + 1477 + } 1478 + 1451 1479 static void raid1_write_request(struct mddev *mddev, struct bio *bio, 1452 1480 int max_write_sectors) 1453 1481 { 1454 1482 struct r1conf *conf = mddev->private; 1455 1483 struct r1bio *r1_bio; 1456 - int i, disks, k, error; 1484 + int i, disks, k; 1457 1485 unsigned long flags; 1458 1486 int first_clone; 1459 1487 int max_sectors; ··· 1581 1561 * complexity of supporting that is not worth 1582 1562 * the benefit. 1583 1563 */ 1584 - if (bio->bi_opf & REQ_ATOMIC) { 1585 - error = -EIO; 1564 + if (bio->bi_opf & REQ_ATOMIC) 1586 1565 goto err_handle; 1587 - } 1588 1566 1589 1567 good_sectors = first_bad - r1_bio->sector; 1590 1568 if (good_sectors < max_sectors) ··· 1602 1584 max_sectors = min_t(int, max_sectors, 1603 1585 BIO_MAX_VECS * (PAGE_SIZE >> 9)); 1604 1586 if (max_sectors < bio_sectors(bio)) { 1605 - struct bio *split = bio_split(bio, max_sectors, 1606 - GFP_NOIO, &conf->bio_split); 1607 - 1608 - if (IS_ERR(split)) { 1609 - error = PTR_ERR(split); 1587 + bio = bio_submit_split_bioset(bio, max_sectors, 1588 + &conf->bio_split); 1589 + if (!bio) { 1590 + set_bit(R1BIO_Returned, &r1_bio->state); 1610 1591 goto err_handle; 1611 1592 } 1612 - bio_chain(split, bio); 1613 - submit_bio_noacct(bio); 1614 - bio = split; 1593 + 1615 1594 r1_bio->master_bio = bio; 1616 1595 r1_bio->sectors = max_sectors; 1617 1596 } ··· 1627 1612 continue; 1628 1613 1629 1614 if (first_clone) { 1630 - unsigned long max_write_behind = 1631 - mddev->bitmap_info.max_write_behind; 1632 - struct md_bitmap_stats stats; 1633 - int err; 1634 - 1635 - /* do behind I/O ? 1636 - * Not if there are too many, or cannot 1637 - * allocate memory, or a reader on WriteMostly 1638 - * is waiting for behind writes to flush */ 1639 - err = mddev->bitmap_ops->get_stats(mddev->bitmap, &stats); 1640 - if (!err && write_behind && !stats.behind_wait && 1641 - stats.behind_writes < max_write_behind) 1642 - alloc_behind_master_bio(r1_bio, bio); 1643 - 1644 - if (test_bit(R1BIO_BehindIO, &r1_bio->state)) 1645 - mddev->bitmap_ops->start_behind_write(mddev); 1615 + if (write_behind) 1616 + raid1_start_write_behind(mddev, r1_bio, bio); 1646 1617 first_clone = 0; 1647 1618 } 1648 1619 ··· 1684 1683 } 1685 1684 } 1686 1685 1687 - bio->bi_status = errno_to_blk_status(error); 1688 - set_bit(R1BIO_Uptodate, &r1_bio->state); 1689 1686 raid_end_bio_io(r1_bio); 1690 1687 } 1691 1688 ··· 2056 2057 2057 2058 /* make sure these bits don't get cleared. */ 2058 2059 do { 2059 - mddev->bitmap_ops->end_sync(mddev, s, &sync_blocks); 2060 + md_bitmap_end_sync(mddev, s, &sync_blocks); 2060 2061 s += sync_blocks; 2061 2062 sectors_to_go -= sync_blocks; 2062 2063 } while (sectors_to_go > 0); ··· 2803 2804 * We can find the current addess in mddev->curr_resync 2804 2805 */ 2805 2806 if (mddev->curr_resync < max_sector) /* aborted */ 2806 - mddev->bitmap_ops->end_sync(mddev, mddev->curr_resync, 2807 - &sync_blocks); 2807 + md_bitmap_end_sync(mddev, mddev->curr_resync, 2808 + &sync_blocks); 2808 2809 else /* completed sync */ 2809 2810 conf->fullsync = 0; 2810 2811 2811 - mddev->bitmap_ops->close_sync(mddev); 2812 + if (md_bitmap_enabled(mddev, false)) 2813 + mddev->bitmap_ops->close_sync(mddev); 2812 2814 close_sync(conf); 2813 2815 2814 2816 if (mddev_is_clustered(mddev)) { ··· 2829 2829 /* before building a request, check if we can skip these blocks.. 2830 2830 * This call the bitmap_start_sync doesn't actually record anything 2831 2831 */ 2832 - if (!mddev->bitmap_ops->start_sync(mddev, sector_nr, &sync_blocks, true) && 2832 + if (!md_bitmap_start_sync(mddev, sector_nr, &sync_blocks, true) && 2833 2833 !conf->fullsync && !test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery)) { 2834 2834 /* We can skip this block, and probably several more */ 2835 2835 *skipped = 1; ··· 2846 2846 /* we are incrementing sector_nr below. To be safe, we check against 2847 2847 * sector_nr + two times RESYNC_SECTORS 2848 2848 */ 2849 - 2850 - mddev->bitmap_ops->cond_end_sync(mddev, sector_nr, 2851 - mddev_is_clustered(mddev) && 2852 - (sector_nr + 2 * RESYNC_SECTORS > conf->cluster_sync_high)); 2849 + if (md_bitmap_enabled(mddev, false)) 2850 + mddev->bitmap_ops->cond_end_sync(mddev, sector_nr, 2851 + mddev_is_clustered(mddev) && 2852 + (sector_nr + 2 * RESYNC_SECTORS > 2853 + conf->cluster_sync_high)); 2853 2854 2854 2855 if (raise_barrier(conf, sector_nr)) 2855 2856 return 0; ··· 3005 3004 if (len == 0) 3006 3005 break; 3007 3006 if (sync_blocks == 0) { 3008 - if (!mddev->bitmap_ops->start_sync(mddev, sector_nr, 3009 - &sync_blocks, still_degraded) && 3007 + if (!md_bitmap_start_sync(mddev, sector_nr, 3008 + &sync_blocks, still_degraded) && 3010 3009 !conf->fullsync && 3011 3010 !test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery)) 3012 3011 break; ··· 3326 3325 * worth it. 3327 3326 */ 3328 3327 sector_t newsize = raid1_size(mddev, sectors, 0); 3329 - int ret; 3330 3328 3331 3329 if (mddev->external_size && 3332 3330 mddev->array_sectors > newsize) 3333 3331 return -EINVAL; 3334 3332 3335 - ret = mddev->bitmap_ops->resize(mddev, newsize, 0, false); 3336 - if (ret) 3337 - return ret; 3333 + if (md_bitmap_enabled(mddev, false)) { 3334 + int ret = mddev->bitmap_ops->resize(mddev, newsize, 0); 3335 + 3336 + if (ret) 3337 + return ret; 3338 + } 3338 3339 3339 3340 md_set_array_sectors(mddev, newsize); 3340 3341 if (sectors > mddev->dev_sectors &&

+3 -1

drivers/md/raid1.h

··· 178 178 * any write was successful. Otherwise we call when 179 179 * any write-behind write succeeds, otherwise we call 180 180 * with failure when last write completes (and all failed). 181 - * Record that bi_end_io was called with this flag... 181 + * 182 + * And for bio_split errors, record that bi_end_io was called 183 + * with this flag... 182 184 */ 183 185 R1BIO_Returned, 184 186 /* If a write for this request means we can clear some

+51 -56

drivers/md/raid10.c

··· 163 163 bio = bio_kmalloc(RESYNC_PAGES, gfp_flags); 164 164 if (!bio) 165 165 goto out_free_bio; 166 - bio_init(bio, NULL, bio->bi_inline_vecs, RESYNC_PAGES, 0); 166 + bio_init_inline(bio, NULL, RESYNC_PAGES, 0); 167 167 r10_bio->devs[j].bio = bio; 168 168 if (!conf->have_replacement) 169 169 continue; 170 170 bio = bio_kmalloc(RESYNC_PAGES, gfp_flags); 171 171 if (!bio) 172 172 goto out_free_bio; 173 - bio_init(bio, NULL, bio->bi_inline_vecs, RESYNC_PAGES, 0); 173 + bio_init_inline(bio, NULL, RESYNC_PAGES, 0); 174 174 r10_bio->devs[j].repl_bio = bio; 175 175 } 176 176 /* ··· 322 322 struct bio *bio = r10_bio->master_bio; 323 323 struct r10conf *conf = r10_bio->mddev->private; 324 324 325 - if (!test_bit(R10BIO_Uptodate, &r10_bio->state)) 326 - bio->bi_status = BLK_STS_IOERR; 325 + if (!test_and_set_bit(R10BIO_Returned, &r10_bio->state)) { 326 + if (!test_bit(R10BIO_Uptodate, &r10_bio->state)) 327 + bio->bi_status = BLK_STS_IOERR; 328 + bio_endio(bio); 329 + } 327 330 328 - bio_endio(bio); 329 331 /* 330 332 * Wake up any possible resync thread that waits for the device 331 333 * to go idle. ··· 1156 1154 int slot = r10_bio->read_slot; 1157 1155 struct md_rdev *err_rdev = NULL; 1158 1156 gfp_t gfp = GFP_NOIO; 1159 - int error; 1160 1157 1161 1158 if (slot >= 0 && r10_bio->devs[slot].rdev) { 1162 1159 /* ··· 1204 1203 rdev->bdev, 1205 1204 (unsigned long long)r10_bio->sector); 1206 1205 if (max_sectors < bio_sectors(bio)) { 1207 - struct bio *split = bio_split(bio, max_sectors, 1208 - gfp, &conf->bio_split); 1209 - if (IS_ERR(split)) { 1210 - error = PTR_ERR(split); 1206 + allow_barrier(conf); 1207 + bio = bio_submit_split_bioset(bio, max_sectors, 1208 + &conf->bio_split); 1209 + wait_barrier(conf, false); 1210 + if (!bio) { 1211 + set_bit(R10BIO_Returned, &r10_bio->state); 1211 1212 goto err_handle; 1212 1213 } 1213 - bio_chain(split, bio); 1214 - allow_barrier(conf); 1215 - submit_bio_noacct(bio); 1216 - wait_barrier(conf, false); 1217 - bio = split; 1214 + 1218 1215 r10_bio->master_bio = bio; 1219 1216 r10_bio->sectors = max_sectors; 1220 1217 } ··· 1240 1241 return; 1241 1242 err_handle: 1242 1243 atomic_dec(&rdev->nr_pending); 1243 - bio->bi_status = errno_to_blk_status(error); 1244 - set_bit(R10BIO_Uptodate, &r10_bio->state); 1245 1244 raid_end_bio_io(r10_bio); 1246 1245 } 1247 1246 ··· 1348 1351 int i, k; 1349 1352 sector_t sectors; 1350 1353 int max_sectors; 1351 - int error; 1352 1354 1353 1355 if ((mddev_is_clustered(mddev) && 1354 1356 mddev->cluster_ops->area_resyncing(mddev, WRITE, ··· 1461 1465 * complexity of supporting that is not worth 1462 1466 * the benefit. 1463 1467 */ 1464 - if (bio->bi_opf & REQ_ATOMIC) { 1465 - error = -EIO; 1468 + if (bio->bi_opf & REQ_ATOMIC) 1466 1469 goto err_handle; 1467 - } 1468 1470 1469 1471 good_sectors = first_bad - dev_sector; 1470 1472 if (good_sectors < max_sectors) ··· 1483 1489 r10_bio->sectors = max_sectors; 1484 1490 1485 1491 if (r10_bio->sectors < bio_sectors(bio)) { 1486 - struct bio *split = bio_split(bio, r10_bio->sectors, 1487 - GFP_NOIO, &conf->bio_split); 1488 - if (IS_ERR(split)) { 1489 - error = PTR_ERR(split); 1492 + allow_barrier(conf); 1493 + bio = bio_submit_split_bioset(bio, r10_bio->sectors, 1494 + &conf->bio_split); 1495 + wait_barrier(conf, false); 1496 + if (!bio) { 1497 + set_bit(R10BIO_Returned, &r10_bio->state); 1490 1498 goto err_handle; 1491 1499 } 1492 - bio_chain(split, bio); 1493 - allow_barrier(conf); 1494 - submit_bio_noacct(bio); 1495 - wait_barrier(conf, false); 1496 - bio = split; 1500 + 1497 1501 r10_bio->master_bio = bio; 1498 1502 } 1499 1503 ··· 1523 1531 } 1524 1532 } 1525 1533 1526 - bio->bi_status = errno_to_blk_status(error); 1527 - set_bit(R10BIO_Uptodate, &r10_bio->state); 1528 1534 raid_end_bio_io(r10_bio); 1529 1535 } 1530 1536 ··· 1669 1679 bio_endio(bio); 1670 1680 return 0; 1671 1681 } 1682 + 1672 1683 bio_chain(split, bio); 1684 + trace_block_split(split, bio->bi_iter.bi_sector); 1673 1685 allow_barrier(conf); 1674 1686 /* Resend the fist split part */ 1675 1687 submit_bio_noacct(split); ··· 1686 1694 bio_endio(bio); 1687 1695 return 0; 1688 1696 } 1697 + 1689 1698 bio_chain(split, bio); 1699 + trace_block_split(split, bio->bi_iter.bi_sector); 1690 1700 allow_barrier(conf); 1691 1701 /* Resend the second split part */ 1692 1702 submit_bio_noacct(bio); ··· 3215 3221 3216 3222 if (mddev->curr_resync < max_sector) { /* aborted */ 3217 3223 if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) 3218 - mddev->bitmap_ops->end_sync(mddev, 3219 - mddev->curr_resync, 3220 - &sync_blocks); 3224 + md_bitmap_end_sync(mddev, mddev->curr_resync, 3225 + &sync_blocks); 3221 3226 else for (i = 0; i < conf->geo.raid_disks; i++) { 3222 3227 sector_t sect = 3223 3228 raid10_find_virt(conf, mddev->curr_resync, i); 3224 3229 3225 - mddev->bitmap_ops->end_sync(mddev, sect, 3226 - &sync_blocks); 3230 + md_bitmap_end_sync(mddev, sect, &sync_blocks); 3227 3231 } 3228 3232 } else { 3229 3233 /* completed sync */ ··· 3241 3249 } 3242 3250 conf->fullsync = 0; 3243 3251 } 3244 - mddev->bitmap_ops->close_sync(mddev); 3252 + if (md_bitmap_enabled(mddev, false)) 3253 + mddev->bitmap_ops->close_sync(mddev); 3245 3254 close_sync(conf); 3246 3255 *skipped = 1; 3247 3256 return sectors_skipped; ··· 3344 3351 * we only need to recover the block if it is set in 3345 3352 * the bitmap 3346 3353 */ 3347 - must_sync = mddev->bitmap_ops->start_sync(mddev, sect, 3348 - &sync_blocks, 3349 - true); 3354 + must_sync = md_bitmap_start_sync(mddev, sect, 3355 + &sync_blocks, true); 3350 3356 if (sync_blocks < max_sync) 3351 3357 max_sync = sync_blocks; 3352 3358 if (!must_sync && ··· 3388 3396 } 3389 3397 } 3390 3398 3391 - must_sync = mddev->bitmap_ops->start_sync(mddev, sect, 3392 - &sync_blocks, still_degraded); 3393 - 3399 + md_bitmap_start_sync(mddev, sect, &sync_blocks, 3400 + still_degraded); 3394 3401 any_working = 0; 3395 3402 for (j=0; j<conf->copies;j++) { 3396 3403 int k; ··· 3561 3570 * safety reason, which ensures curr_resync_completed is 3562 3571 * updated in bitmap_cond_end_sync. 3563 3572 */ 3564 - mddev->bitmap_ops->cond_end_sync(mddev, sector_nr, 3573 + if (md_bitmap_enabled(mddev, false)) 3574 + mddev->bitmap_ops->cond_end_sync(mddev, sector_nr, 3565 3575 mddev_is_clustered(mddev) && 3566 3576 (sector_nr + 2 * RESYNC_SECTORS > conf->cluster_sync_high)); 3567 3577 3568 - if (!mddev->bitmap_ops->start_sync(mddev, sector_nr, 3569 - &sync_blocks, 3570 - mddev->degraded) && 3578 + if (!md_bitmap_start_sync(mddev, sector_nr, &sync_blocks, 3579 + mddev->degraded) && 3571 3580 !conf->fullsync && !test_bit(MD_RECOVERY_REQUESTED, 3572 3581 &mddev->recovery)) { 3573 3582 /* We can skip this block */ ··· 4217 4226 */ 4218 4227 struct r10conf *conf = mddev->private; 4219 4228 sector_t oldsize, size; 4220 - int ret; 4221 4229 4222 4230 if (mddev->reshape_position != MaxSector) 4223 4231 return -EBUSY; ··· 4230 4240 mddev->array_sectors > size) 4231 4241 return -EINVAL; 4232 4242 4233 - ret = mddev->bitmap_ops->resize(mddev, size, 0, false); 4234 - if (ret) 4235 - return ret; 4243 + if (md_bitmap_enabled(mddev, false)) { 4244 + int ret = mddev->bitmap_ops->resize(mddev, size, 0); 4245 + 4246 + if (ret) 4247 + return ret; 4248 + } 4236 4249 4237 4250 md_set_array_sectors(mddev, size); 4238 4251 if (sectors > mddev->dev_sectors && ··· 4501 4508 oldsize = raid10_size(mddev, 0, 0); 4502 4509 newsize = raid10_size(mddev, 0, conf->geo.raid_disks); 4503 4510 4504 - if (!mddev_is_clustered(mddev)) { 4505 - ret = mddev->bitmap_ops->resize(mddev, newsize, 0, false); 4511 + if (!mddev_is_clustered(mddev) && 4512 + md_bitmap_enabled(mddev, false)) { 4513 + ret = mddev->bitmap_ops->resize(mddev, newsize, 0); 4506 4514 if (ret) 4507 4515 goto abort; 4508 4516 else ··· 4525 4531 MD_FEATURE_RESHAPE_ACTIVE)) || (oldsize == newsize)) 4526 4532 goto out; 4527 4533 4528 - ret = mddev->bitmap_ops->resize(mddev, newsize, 0, false); 4534 + /* cluster can't be setup without bitmap */ 4535 + ret = mddev->bitmap_ops->resize(mddev, newsize, 0); 4529 4536 if (ret) 4530 4537 goto abort; 4531 4538 4532 4539 ret = mddev->cluster_ops->resize_bitmaps(mddev, newsize, oldsize); 4533 4540 if (ret) { 4534 - mddev->bitmap_ops->resize(mddev, oldsize, 0, false); 4541 + mddev->bitmap_ops->resize(mddev, oldsize, 0); 4535 4542 goto abort; 4536 4543 } 4537 4544 }

+2

drivers/md/raid10.h

··· 165 165 * so that raid10d knows what to do with them. 166 166 */ 167 167 R10BIO_ReadError, 168 + /* For bio_split errors, record that bi_end_io was called. */ 169 + R10BIO_Returned, 168 170 /* If a write for this request means we can clear some 169 171 * known-bad-block records, we set this flag. 170 172 */

+47 -27

drivers/md/raid5.c

··· 4097 4097 int disks) 4098 4098 { 4099 4099 int rmw = 0, rcw = 0, i; 4100 - sector_t resync_offset = conf->mddev->resync_offset; 4100 + struct mddev *mddev = conf->mddev; 4101 + sector_t resync_offset = mddev->resync_offset; 4101 4102 4102 4103 /* Check whether resync is now happening or should start. 4103 4104 * If yes, then the array is dirty (after unclean shutdown or ··· 4117 4116 pr_debug("force RCW rmw_level=%u, resync_offset=%llu sh->sector=%llu\n", 4118 4117 conf->rmw_level, (unsigned long long)resync_offset, 4119 4118 (unsigned long long)sh->sector); 4119 + } else if (mddev->bitmap_ops && mddev->bitmap_ops->blocks_synced && 4120 + !mddev->bitmap_ops->blocks_synced(mddev, sh->sector)) { 4121 + /* The initial recover is not done, must read everything */ 4122 + rcw = 1; rmw = 2; 4123 + pr_debug("force RCW by lazy recovery, sh->sector=%llu\n", 4124 + sh->sector); 4120 4125 } else for (i = disks; i--; ) { 4121 4126 /* would I have to read this buffer for read_modify_write */ 4122 4127 struct r5dev *dev = &sh->dev[i]; ··· 4155 4148 set_bit(STRIPE_HANDLE, &sh->state); 4156 4149 if ((rmw < rcw || (rmw == rcw && conf->rmw_level == PARITY_PREFER_RMW)) && rmw > 0) { 4157 4150 /* prefer read-modify-write, but need to get some data */ 4158 - mddev_add_trace_msg(conf->mddev, "raid5 rmw %llu %d", 4151 + mddev_add_trace_msg(mddev, "raid5 rmw %llu %d", 4159 4152 sh->sector, rmw); 4160 4153 4161 4154 for (i = disks; i--; ) { ··· 4234 4227 set_bit(STRIPE_DELAYED, &sh->state); 4235 4228 } 4236 4229 } 4237 - if (rcw && !mddev_is_dm(conf->mddev)) 4238 - blk_add_trace_msg(conf->mddev->gendisk->queue, 4230 + if (rcw && !mddev_is_dm(mddev)) 4231 + blk_add_trace_msg(mddev->gendisk->queue, 4239 4232 "raid5 rcw %llu %d %d %d", 4240 4233 (unsigned long long)sh->sector, rcw, qread, 4241 4234 test_bit(STRIPE_DELAYED, &sh->state)); ··· 4705 4698 } 4706 4699 } else if (test_bit(In_sync, &rdev->flags)) 4707 4700 set_bit(R5_Insync, &dev->flags); 4708 - else if (sh->sector + RAID5_STRIPE_SECTORS(conf) <= rdev->recovery_offset) 4709 - /* in sync if before recovery_offset */ 4710 - set_bit(R5_Insync, &dev->flags); 4711 - else if (test_bit(R5_UPTODATE, &dev->flags) && 4701 + else if (sh->sector + RAID5_STRIPE_SECTORS(conf) <= 4702 + rdev->recovery_offset) { 4703 + /* 4704 + * in sync if: 4705 + * - normal IO, or 4706 + * - resync IO that is not lazy recovery 4707 + * 4708 + * For lazy recovery, we have to mark the rdev without 4709 + * In_sync as failed, to build initial xor data. 4710 + */ 4711 + if (!test_bit(STRIPE_SYNCING, &sh->state) || 4712 + !test_bit(MD_RECOVERY_LAZY_RECOVER, 4713 + &conf->mddev->recovery)) 4714 + set_bit(R5_Insync, &dev->flags); 4715 + } else if (test_bit(R5_UPTODATE, &dev->flags) && 4712 4716 test_bit(R5_Expanded, &dev->flags)) 4713 4717 /* If we've reshaped into here, we assume it is Insync. 4714 4718 * We will shortly update recovery_offset to make ··· 5486 5468 5487 5469 static struct bio *chunk_aligned_read(struct mddev *mddev, struct bio *raid_bio) 5488 5470 { 5489 - struct bio *split; 5490 5471 sector_t sector = raid_bio->bi_iter.bi_sector; 5491 5472 unsigned chunk_sects = mddev->chunk_sectors; 5492 5473 unsigned sectors = chunk_sects - (sector & (chunk_sects-1)); 5493 5474 5494 5475 if (sectors < bio_sectors(raid_bio)) { 5495 5476 struct r5conf *conf = mddev->private; 5496 - split = bio_split(raid_bio, sectors, GFP_NOIO, &conf->bio_split); 5497 - bio_chain(split, raid_bio); 5498 - submit_bio_noacct(raid_bio); 5499 - raid_bio = split; 5477 + 5478 + raid_bio = bio_submit_split_bioset(raid_bio, sectors, 5479 + &conf->bio_split); 5480 + if (!raid_bio) 5481 + return NULL; 5500 5482 } 5501 5483 5502 5484 if (!raid5_read_one_chunk(mddev, raid_bio)) ··· 6510 6492 } 6511 6493 6512 6494 if (mddev->curr_resync < max_sector) /* aborted */ 6513 - mddev->bitmap_ops->end_sync(mddev, mddev->curr_resync, 6514 - &sync_blocks); 6495 + md_bitmap_end_sync(mddev, mddev->curr_resync, 6496 + &sync_blocks); 6515 6497 else /* completed sync */ 6516 6498 conf->fullsync = 0; 6517 - mddev->bitmap_ops->close_sync(mddev); 6499 + if (md_bitmap_enabled(mddev, false)) 6500 + mddev->bitmap_ops->close_sync(mddev); 6518 6501 6519 6502 return 0; 6520 6503 } ··· 6544 6525 } 6545 6526 if (!test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery) && 6546 6527 !conf->fullsync && 6547 - !mddev->bitmap_ops->start_sync(mddev, sector_nr, &sync_blocks, 6548 - true) && 6528 + !md_bitmap_start_sync(mddev, sector_nr, &sync_blocks, true) && 6549 6529 sync_blocks >= RAID5_STRIPE_SECTORS(conf)) { 6550 6530 /* we can skip this block, and probably more */ 6551 6531 do_div(sync_blocks, RAID5_STRIPE_SECTORS(conf)); ··· 6553 6535 return sync_blocks * RAID5_STRIPE_SECTORS(conf); 6554 6536 } 6555 6537 6556 - mddev->bitmap_ops->cond_end_sync(mddev, sector_nr, false); 6538 + if (md_bitmap_enabled(mddev, false)) 6539 + mddev->bitmap_ops->cond_end_sync(mddev, sector_nr, false); 6557 6540 6558 6541 sh = raid5_get_active_stripe(conf, NULL, sector_nr, 6559 6542 R5_GAS_NOBLOCK); ··· 6576 6557 still_degraded = true; 6577 6558 } 6578 6559 6579 - mddev->bitmap_ops->start_sync(mddev, sector_nr, &sync_blocks, 6580 - still_degraded); 6581 - 6560 + md_bitmap_start_sync(mddev, sector_nr, &sync_blocks, still_degraded); 6582 6561 set_bit(STRIPE_SYNC_REQUESTED, &sh->state); 6583 6562 set_bit(STRIPE_HANDLE, &sh->state); 6584 6563 ··· 6780 6763 /* Now is a good time to flush some bitmap updates */ 6781 6764 conf->seq_flush++; 6782 6765 spin_unlock_irq(&conf->device_lock); 6783 - mddev->bitmap_ops->unplug(mddev, true); 6766 + if (md_bitmap_enabled(mddev, true)) 6767 + mddev->bitmap_ops->unplug(mddev, true); 6784 6768 spin_lock_irq(&conf->device_lock); 6785 6769 conf->seq_write = conf->seq_flush; 6786 6770 activate_bit_delay(conf, conf->temp_inactive_list); ··· 8331 8313 */ 8332 8314 sector_t newsize; 8333 8315 struct r5conf *conf = mddev->private; 8334 - int ret; 8335 8316 8336 8317 if (raid5_has_log(conf) || raid5_has_ppl(conf)) 8337 8318 return -EINVAL; ··· 8340 8323 mddev->array_sectors > newsize) 8341 8324 return -EINVAL; 8342 8325 8343 - ret = mddev->bitmap_ops->resize(mddev, sectors, 0, false); 8344 - if (ret) 8345 - return ret; 8326 + if (md_bitmap_enabled(mddev, false)) { 8327 + int ret = mddev->bitmap_ops->resize(mddev, sectors, 0); 8328 + 8329 + if (ret) 8330 + return ret; 8331 + } 8346 8332 8347 8333 md_set_array_sectors(mddev, newsize); 8348 8334 if (sectors > mddev->dev_sectors &&

+2 -2

drivers/memstick/core/ms_block.c

··· 1953 1953 msb->card = NULL; 1954 1954 } 1955 1955 1956 - static int msb_bd_getgeo(struct block_device *bdev, 1956 + static int msb_bd_getgeo(struct gendisk *disk, 1957 1957 struct hd_geometry *geo) 1958 1958 { 1959 - struct msb_data *msb = bdev->bd_disk->private_data; 1959 + struct msb_data *msb = disk->private_data; 1960 1960 *geo = msb->geometry; 1961 1961 return 0; 1962 1962 }

+2 -2

drivers/memstick/core/mspro_block.c

··· 189 189 kfree(msb); 190 190 } 191 191 192 - static int mspro_block_bd_getgeo(struct block_device *bdev, 192 + static int mspro_block_bd_getgeo(struct gendisk *disk, 193 193 struct hd_geometry *geo) 194 194 { 195 - struct mspro_block_data *msb = bdev->bd_disk->private_data; 195 + struct mspro_block_data *msb = disk->private_data; 196 196 197 197 geo->heads = msb->heads; 198 198 geo->sectors = msb->sectors_per_track;

+1 -1

drivers/message/fusion/mptscsih.c

··· 2074 2074 * This is anyones guess quite frankly. 2075 2075 */ 2076 2076 int 2077 - mptscsih_bios_param(struct scsi_device * sdev, struct block_device *bdev, 2077 + mptscsih_bios_param(struct scsi_device * sdev, struct gendisk *unused, 2078 2078 sector_t capacity, int geom[]) 2079 2079 { 2080 2080 int heads;

+1 -1

drivers/message/fusion/mptscsih.h

··· 123 123 extern int mptscsih_dev_reset(struct scsi_cmnd * SCpnt); 124 124 extern int mptscsih_bus_reset(struct scsi_cmnd * SCpnt); 125 125 extern int mptscsih_host_reset(struct scsi_cmnd *SCpnt); 126 - extern int mptscsih_bios_param(struct scsi_device * sdev, struct block_device *bdev, sector_t capacity, int geom[]); 126 + extern int mptscsih_bios_param(struct scsi_device * sdev, struct gendisk *unused, sector_t capacity, int geom[]); 127 127 extern int mptscsih_io_done(MPT_ADAPTER *ioc, MPT_FRAME_HDR *mf, MPT_FRAME_HDR *r); 128 128 extern int mptscsih_taskmgmt_complete(MPT_ADAPTER *ioc, MPT_FRAME_HDR *mf, MPT_FRAME_HDR *r); 129 129 extern int mptscsih_scandv_complete(MPT_ADAPTER *ioc, MPT_FRAME_HDR *mf, MPT_FRAME_HDR *r);

+2 -2

drivers/mmc/core/block.c

··· 439 439 } 440 440 441 441 static int 442 - mmc_blk_getgeo(struct block_device *bdev, struct hd_geometry *geo) 442 + mmc_blk_getgeo(struct gendisk *disk, struct hd_geometry *geo) 443 443 { 444 - geo->cylinders = get_capacity(bdev->bd_disk) / (4 * 16); 444 + geo->cylinders = get_capacity(disk) / (4 * 16); 445 445 geo->heads = 4; 446 446 geo->sectors = 16; 447 447 return 0;

+2 -2

drivers/mtd/mtd_blkdevs.c

··· 246 246 blktrans_dev_put(dev); 247 247 } 248 248 249 - static int blktrans_getgeo(struct block_device *bdev, struct hd_geometry *geo) 249 + static int blktrans_getgeo(struct gendisk *disk, struct hd_geometry *geo) 250 250 { 251 - struct mtd_blktrans_dev *dev = bdev->bd_disk->private_data; 251 + struct mtd_blktrans_dev *dev = disk->private_data; 252 252 int ret = -ENXIO; 253 253 254 254 mutex_lock(&dev->lock);

+2 -2

drivers/mtd/ubi/block.c

··· 282 282 mutex_unlock(&dev->dev_mutex); 283 283 } 284 284 285 - static int ubiblock_getgeo(struct block_device *bdev, struct hd_geometry *geo) 285 + static int ubiblock_getgeo(struct gendisk *disk, struct hd_geometry *geo) 286 286 { 287 287 /* Some tools might require this information */ 288 288 geo->heads = 1; 289 289 geo->cylinders = 1; 290 - geo->sectors = get_capacity(bdev->bd_disk); 290 + geo->sectors = get_capacity(disk); 291 291 geo->start = 0; 292 292 return 0; 293 293 }

+2 -2

drivers/nvdimm/btt.c

··· 1478 1478 bio_endio(bio); 1479 1479 } 1480 1480 1481 - static int btt_getgeo(struct block_device *bd, struct hd_geometry *geo) 1481 + static int btt_getgeo(struct gendisk *disk, struct hd_geometry *geo) 1482 1482 { 1483 1483 /* some standard values */ 1484 1484 geo->heads = 1 << 6; 1485 1485 geo->sectors = 1 << 5; 1486 - geo->cylinders = get_capacity(bd->bd_disk) >> 11; 1486 + geo->cylinders = get_capacity(disk) >> 11; 1487 1487 return 0; 1488 1488 } 1489 1489

+66 -20

drivers/nvme/common/auth.c

··· 684 684 EXPORT_SYMBOL_GPL(nvme_auth_generate_digest); 685 685 686 686 /** 687 + * hkdf_expand_label - HKDF-Expand-Label (RFC 8846 section 7.1) 688 + * @hmac_tfm: hash context keyed with pseudorandom key 689 + * @label: ASCII label without "tls13 " prefix 690 + * @labellen: length of @label 691 + * @context: context bytes 692 + * @contextlen: length of @context 693 + * @okm: output keying material 694 + * @okmlen: length of @okm 695 + * 696 + * Build the TLS 1.3 HkdfLabel structure and invoke hkdf_expand(). 697 + * 698 + * Returns 0 on success with output keying material stored in @okm, 699 + * or a negative errno value otherwise. 700 + */ 701 + static int hkdf_expand_label(struct crypto_shash *hmac_tfm, 702 + const u8 *label, unsigned int labellen, 703 + const u8 *context, unsigned int contextlen, 704 + u8 *okm, unsigned int okmlen) 705 + { 706 + int err; 707 + u8 *info; 708 + unsigned int infolen; 709 + const char *tls13_prefix = "tls13 "; 710 + unsigned int prefixlen = strlen(tls13_prefix); 711 + 712 + if (WARN_ON(labellen > (255 - prefixlen))) 713 + return -EINVAL; 714 + if (WARN_ON(contextlen > 255)) 715 + return -EINVAL; 716 + 717 + infolen = 2 + (1 + prefixlen + labellen) + (1 + contextlen); 718 + info = kzalloc(infolen, GFP_KERNEL); 719 + if (!info) 720 + return -ENOMEM; 721 + 722 + /* HkdfLabel.Length */ 723 + put_unaligned_be16(okmlen, info); 724 + 725 + /* HkdfLabel.Label */ 726 + info[2] = prefixlen + labellen; 727 + memcpy(info + 3, tls13_prefix, prefixlen); 728 + memcpy(info + 3 + prefixlen, label, labellen); 729 + 730 + /* HkdfLabel.Context */ 731 + info[3 + prefixlen + labellen] = contextlen; 732 + memcpy(info + 4 + prefixlen + labellen, context, contextlen); 733 + 734 + err = hkdf_expand(hmac_tfm, info, infolen, okm, okmlen); 735 + kfree_sensitive(info); 736 + return err; 737 + } 738 + 739 + /** 687 740 * nvme_auth_derive_tls_psk - Derive TLS PSK 688 741 * @hmac_id: Hash function identifier 689 742 * @psk: generated input PSK ··· 768 715 { 769 716 struct crypto_shash *hmac_tfm; 770 717 const char *hmac_name; 771 - const char *psk_prefix = "tls13 nvme-tls-psk"; 718 + const char *label = "nvme-tls-psk"; 772 719 static const char default_salt[HKDF_MAX_HASHLEN]; 773 - size_t info_len, prk_len; 774 - char *info; 720 + size_t prk_len; 721 + const char *ctx; 775 722 unsigned char *prk, *tls_key; 776 723 int ret; 777 724 ··· 811 758 if (ret) 812 759 goto out_free_prk; 813 760 814 - /* 815 - * 2 additional bytes for the length field from HDKF-Expand-Label, 816 - * 2 additional bytes for the HMAC ID, and one byte for the space 817 - * separator. 818 - */ 819 - info_len = strlen(psk_digest) + strlen(psk_prefix) + 5; 820 - info = kzalloc(info_len + 1, GFP_KERNEL); 821 - if (!info) { 761 + ctx = kasprintf(GFP_KERNEL, "%02d %s", hmac_id, psk_digest); 762 + if (!ctx) { 822 763 ret = -ENOMEM; 823 764 goto out_free_prk; 824 765 } 825 766 826 - put_unaligned_be16(psk_len, info); 827 - memcpy(info + 2, psk_prefix, strlen(psk_prefix)); 828 - sprintf(info + 2 + strlen(psk_prefix), "%02d %s", hmac_id, psk_digest); 829 - 830 767 tls_key = kzalloc(psk_len, GFP_KERNEL); 831 768 if (!tls_key) { 832 769 ret = -ENOMEM; 833 - goto out_free_info; 770 + goto out_free_ctx; 834 771 } 835 - ret = hkdf_expand(hmac_tfm, info, info_len, tls_key, psk_len); 772 + ret = hkdf_expand_label(hmac_tfm, 773 + label, strlen(label), 774 + ctx, strlen(ctx), 775 + tls_key, psk_len); 836 776 if (ret) { 837 777 kfree(tls_key); 838 - goto out_free_info; 778 + goto out_free_ctx; 839 779 } 840 780 *ret_psk = tls_key; 841 781 842 - out_free_info: 843 - kfree(info); 782 + out_free_ctx: 783 + kfree(ctx); 844 784 out_free_prk: 845 785 kfree(prk); 846 786 out_free_shash:

+3 -2

drivers/nvme/host/auth.c

··· 331 331 } else { 332 332 memset(chap->c2, 0, chap->hash_len); 333 333 } 334 - if (ctrl->opts->concat) 334 + if (ctrl->opts->concat) { 335 335 chap->s2 = 0; 336 - else 336 + chap->bi_directional = false; 337 + } else 337 338 chap->s2 = nvme_auth_get_seqnum(); 338 339 data->seqnum = cpu_to_le32(chap->s2); 339 340 if (chap->host_key_len) {

+17 -6

drivers/nvme/host/core.c

··· 1807 1807 nvme_ns_release(disk->private_data); 1808 1808 } 1809 1809 1810 - int nvme_getgeo(struct block_device *bdev, struct hd_geometry *geo) 1810 + int nvme_getgeo(struct gendisk *disk, struct hd_geometry *geo) 1811 1811 { 1812 1812 /* some standard values */ 1813 1813 geo->heads = 1 << 6; 1814 1814 geo->sectors = 1 << 5; 1815 - geo->cylinders = get_capacity(bdev->bd_disk) >> 11; 1815 + geo->cylinders = get_capacity(disk) >> 11; 1816 1816 return 0; 1817 1817 } 1818 1818 ··· 3167 3167 return ctrl->cntrltype == NVME_CTRL_ADMIN; 3168 3168 } 3169 3169 3170 + static inline bool nvme_is_io_ctrl(struct nvme_ctrl *ctrl) 3171 + { 3172 + return !nvme_discovery_ctrl(ctrl) && !nvme_admin_ctrl(ctrl); 3173 + } 3174 + 3170 3175 static bool nvme_validate_cntlid(struct nvme_subsystem *subsys, 3171 3176 struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id) 3172 3177 { ··· 3374 3369 else 3375 3370 ctrl->max_zeroes_sectors = 0; 3376 3371 3377 - if (ctrl->subsys->subtype != NVME_NQN_NVME || 3372 + if (!nvme_is_io_ctrl(ctrl) || 3378 3373 !nvme_id_cns_ok(ctrl, NVME_ID_CNS_CS_CTRL) || 3379 3374 test_bit(NVME_CTRL_SKIP_ID_CNS_CS, &ctrl->flags)) 3380 3375 return 0; ··· 3496 3491 return -EINVAL; 3497 3492 } 3498 3493 3499 - if (!nvme_discovery_ctrl(ctrl) && ctrl->ioccsz < 4) { 3494 + if (nvme_is_io_ctrl(ctrl) && ctrl->ioccsz < 4) { 3500 3495 dev_err(ctrl->device, 3501 3496 "I/O queue command capsule supported size %d < 4\n", 3502 3497 ctrl->ioccsz); 3503 3498 return -EINVAL; 3504 3499 } 3505 3500 3506 - if (!nvme_discovery_ctrl(ctrl) && ctrl->iorcsz < 1) { 3501 + if (nvme_is_io_ctrl(ctrl) && ctrl->iorcsz < 1) { 3507 3502 dev_err(ctrl->device, 3508 3503 "I/O queue response capsule supported size %d < 1\n", 3509 3504 ctrl->iorcsz); ··· 4995 4990 * checking that they started once before, hence are reconnecting back. 4996 4991 */ 4997 4992 if (test_bit(NVME_CTRL_STARTED_ONCE, &ctrl->flags) && 4998 - nvme_discovery_ctrl(ctrl)) 4993 + nvme_discovery_ctrl(ctrl)) { 4994 + if (!ctrl->kato) { 4995 + nvme_stop_keep_alive(ctrl); 4996 + ctrl->kato = NVME_DEFAULT_KATO; 4997 + nvme_start_keep_alive(ctrl); 4998 + } 4999 4999 nvme_change_uevent(ctrl, "NVME_EVENT=rediscover"); 5000 + } 5000 5001 5001 5002 if (ctrl->queue_count > 1) { 5002 5003 nvme_queue_scan(ctrl);

+8 -2

drivers/nvme/host/fc.c

··· 3032 3032 3033 3033 ++ctrl->ctrl.nr_reconnects; 3034 3034 3035 - if (ctrl->rport->remoteport.port_state != FC_OBJSTATE_ONLINE) 3035 + spin_lock_irqsave(&ctrl->rport->lock, flags); 3036 + if (ctrl->rport->remoteport.port_state != FC_OBJSTATE_ONLINE) { 3037 + spin_unlock_irqrestore(&ctrl->rport->lock, flags); 3036 3038 return -ENODEV; 3039 + } 3037 3040 3038 - if (nvme_fc_ctlr_active_on_rport(ctrl)) 3041 + if (nvme_fc_ctlr_active_on_rport(ctrl)) { 3042 + spin_unlock_irqrestore(&ctrl->rport->lock, flags); 3039 3043 return -ENOTUNIQ; 3044 + } 3045 + spin_unlock_irqrestore(&ctrl->rport->lock, flags); 3040 3046 3041 3047 dev_info(ctrl->ctrl.device, 3042 3048 "NVME-FC{%d}: create association : host wwpn 0x%016llx "

-5

drivers/nvme/host/ioctl.c

··· 142 142 ret = blk_rq_map_user_io(req, NULL, nvme_to_user_ptr(ubuffer), 143 143 bufflen, GFP_KERNEL, flags & NVME_IOCTL_VEC, 0, 144 144 0, rq_data_dir(req)); 145 - 146 145 if (ret) 147 146 return ret; 148 - 149 - bio = req->bio; 150 - if (bdev) 151 - bio_set_dev(bio, bdev); 152 147 153 148 if (has_metadata) { 154 149 ret = blk_rq_integrity_map_user(req, meta_buffer, meta_len);

+1 -1

drivers/nvme/host/nvme.h

··· 936 936 unsigned int issue_flags); 937 937 int nvme_identify_ns(struct nvme_ctrl *ctrl, unsigned nsid, 938 938 struct nvme_id_ns **id); 939 - int nvme_getgeo(struct block_device *bdev, struct hd_geometry *geo); 939 + int nvme_getgeo(struct gendisk *disk, struct hd_geometry *geo); 940 940 int nvme_dev_uring_cmd(struct io_uring_cmd *ioucmd, unsigned int issue_flags); 941 941 942 942 extern const struct attribute_group *nvme_ns_attr_groups[];

+97 -91

drivers/nvme/host/pci.c

··· 172 172 u32 last_ps; 173 173 bool hmb; 174 174 struct sg_table *hmb_sgt; 175 - 176 175 mempool_t *dmavec_mempool; 177 - mempool_t *iod_meta_mempool; 178 176 179 177 /* shadow doorbell buffer support: */ 180 178 __le32 *dbbuf_dbs; ··· 259 261 260 262 /* single segment dma mapping */ 261 263 IOD_SINGLE_SEGMENT = 1U << 2, 264 + 265 + /* Metadata using non-coalesced MPTR */ 266 + IOD_SINGLE_META_SEGMENT = 1U << 5, 262 267 }; 263 268 264 269 struct nvme_dma_vec { ··· 285 284 unsigned int nr_dma_vecs; 286 285 287 286 dma_addr_t meta_dma; 288 - struct sg_table meta_sgt; 287 + unsigned int meta_total_len; 288 + struct dma_iova_state meta_dma_state; 289 289 struct nvme_sgl_desc *meta_descriptor; 290 290 }; 291 291 ··· 643 641 return nvmeq->descriptor_pools.large; 644 642 } 645 643 644 + static inline bool nvme_pci_cmd_use_meta_sgl(struct nvme_command *cmd) 645 + { 646 + return (cmd->common.flags & NVME_CMD_SGL_ALL) == NVME_CMD_SGL_METASEG; 647 + } 648 + 646 649 static inline bool nvme_pci_cmd_use_sgl(struct nvme_command *cmd) 647 650 { 648 651 return cmd->common.flags & ··· 697 690 mempool_free(iod->dma_vecs, nvmeq->dev->dmavec_mempool); 698 691 } 699 692 700 - static void nvme_free_sgls(struct request *req) 693 + static void nvme_free_sgls(struct request *req, struct nvme_sgl_desc *sge, 694 + struct nvme_sgl_desc *sg_list) 701 695 { 702 - struct nvme_iod *iod = blk_mq_rq_to_pdu(req); 703 696 struct nvme_queue *nvmeq = req->mq_hctx->driver_data; 704 - struct device *dma_dev = nvmeq->dev->dev; 705 - dma_addr_t sqe_dma_addr = le64_to_cpu(iod->cmd.common.dptr.sgl.addr); 706 - unsigned int sqe_dma_len = le32_to_cpu(iod->cmd.common.dptr.sgl.length); 707 - struct nvme_sgl_desc *sg_list = iod->descriptors[0]; 708 697 enum dma_data_direction dir = rq_dma_dir(req); 698 + unsigned int len = le32_to_cpu(sge->length); 699 + struct device *dma_dev = nvmeq->dev->dev; 700 + unsigned int i; 709 701 710 - if (iod->nr_descriptors) { 711 - unsigned int nr_entries = sqe_dma_len / sizeof(*sg_list), i; 712 - 713 - for (i = 0; i < nr_entries; i++) 714 - dma_unmap_page(dma_dev, le64_to_cpu(sg_list[i].addr), 715 - le32_to_cpu(sg_list[i].length), dir); 716 - } else { 717 - dma_unmap_page(dma_dev, sqe_dma_addr, sqe_dma_len, dir); 702 + if (sge->type == (NVME_SGL_FMT_DATA_DESC << 4)) { 703 + dma_unmap_page(dma_dev, le64_to_cpu(sge->addr), len, dir); 704 + return; 718 705 } 706 + 707 + for (i = 0; i < len / sizeof(*sg_list); i++) 708 + dma_unmap_page(dma_dev, le64_to_cpu(sg_list[i].addr), 709 + le32_to_cpu(sg_list[i].length), dir); 710 + } 711 + 712 + static void nvme_unmap_metadata(struct request *req) 713 + { 714 + struct nvme_queue *nvmeq = req->mq_hctx->driver_data; 715 + enum dma_data_direction dir = rq_dma_dir(req); 716 + struct nvme_iod *iod = blk_mq_rq_to_pdu(req); 717 + struct device *dma_dev = nvmeq->dev->dev; 718 + struct nvme_sgl_desc *sge = iod->meta_descriptor; 719 + 720 + if (iod->flags & IOD_SINGLE_META_SEGMENT) { 721 + dma_unmap_page(dma_dev, iod->meta_dma, 722 + rq_integrity_vec(req).bv_len, 723 + rq_dma_dir(req)); 724 + return; 725 + } 726 + 727 + if (!blk_rq_integrity_dma_unmap(req, dma_dev, &iod->meta_dma_state, 728 + iod->meta_total_len)) { 729 + if (nvme_pci_cmd_use_meta_sgl(&iod->cmd)) 730 + nvme_free_sgls(req, sge, &sge[1]); 731 + else 732 + dma_unmap_page(dma_dev, iod->meta_dma, 733 + iod->meta_total_len, dir); 734 + } 735 + 736 + if (iod->meta_descriptor) 737 + dma_pool_free(nvmeq->descriptor_pools.small, 738 + iod->meta_descriptor, iod->meta_dma); 719 739 } 720 740 721 741 static void nvme_unmap_data(struct request *req) ··· 761 727 762 728 if (!blk_rq_dma_unmap(req, dma_dev, &iod->dma_state, iod->total_len)) { 763 729 if (nvme_pci_cmd_use_sgl(&iod->cmd)) 764 - nvme_free_sgls(req); 730 + nvme_free_sgls(req, iod->descriptors[0], 731 + &iod->cmd.common.dptr.sgl); 765 732 else 766 733 nvme_free_prps(req); 767 734 } ··· 1042 1007 return nvme_pci_setup_data_prp(req, &iter); 1043 1008 } 1044 1009 1045 - static void nvme_pci_sgl_set_data_sg(struct nvme_sgl_desc *sge, 1046 - struct scatterlist *sg) 1047 - { 1048 - sge->addr = cpu_to_le64(sg_dma_address(sg)); 1049 - sge->length = cpu_to_le32(sg_dma_len(sg)); 1050 - sge->type = NVME_SGL_FMT_DATA_DESC << 4; 1051 - } 1052 - 1053 1010 static blk_status_t nvme_pci_setup_meta_sgls(struct request *req) 1054 1011 { 1055 1012 struct nvme_queue *nvmeq = req->mq_hctx->driver_data; 1056 - struct nvme_dev *dev = nvmeq->dev; 1013 + unsigned int entries = req->nr_integrity_segments; 1057 1014 struct nvme_iod *iod = blk_mq_rq_to_pdu(req); 1015 + struct nvme_dev *dev = nvmeq->dev; 1058 1016 struct nvme_sgl_desc *sg_list; 1059 - struct scatterlist *sgl, *sg; 1060 - unsigned int entries; 1017 + struct blk_dma_iter iter; 1061 1018 dma_addr_t sgl_dma; 1062 - int rc, i; 1019 + int i = 0; 1063 1020 1064 - iod->meta_sgt.sgl = mempool_alloc(dev->iod_meta_mempool, GFP_ATOMIC); 1065 - if (!iod->meta_sgt.sgl) 1066 - return BLK_STS_RESOURCE; 1021 + if (!blk_rq_integrity_dma_map_iter_start(req, dev->dev, 1022 + &iod->meta_dma_state, &iter)) 1023 + return iter.status; 1067 1024 1068 - sg_init_table(iod->meta_sgt.sgl, req->nr_integrity_segments); 1069 - iod->meta_sgt.orig_nents = blk_rq_map_integrity_sg(req, 1070 - iod->meta_sgt.sgl); 1071 - if (!iod->meta_sgt.orig_nents) 1072 - goto out_free_sg; 1025 + if (blk_rq_dma_map_coalesce(&iod->meta_dma_state)) 1026 + entries = 1; 1073 1027 1074 - rc = dma_map_sgtable(dev->dev, &iod->meta_sgt, rq_dma_dir(req), 1075 - DMA_ATTR_NO_WARN); 1076 - if (rc) 1077 - goto out_free_sg; 1028 + /* 1029 + * The NVMe MPTR descriptor has an implicit length that the host and 1030 + * device must agree on to avoid data/memory corruption. We trust the 1031 + * kernel allocated correctly based on the format's parameters, so use 1032 + * the more efficient MPTR to avoid extra dma pool allocations for the 1033 + * SGL indirection. 1034 + * 1035 + * But for user commands, we don't necessarily know what they do, so 1036 + * the driver can't validate the metadata buffer size. The SGL 1037 + * descriptor provides an explicit length, so we're relying on that 1038 + * mechanism to catch any misunderstandings between the application and 1039 + * device. 1040 + */ 1041 + if (entries == 1 && !(nvme_req(req)->flags & NVME_REQ_USERCMD)) { 1042 + iod->cmd.common.metadata = cpu_to_le64(iter.addr); 1043 + iod->meta_total_len = iter.len; 1044 + iod->meta_dma = iter.addr; 1045 + iod->meta_descriptor = NULL; 1046 + return BLK_STS_OK; 1047 + } 1078 1048 1079 1049 sg_list = dma_pool_alloc(nvmeq->descriptor_pools.small, GFP_ATOMIC, 1080 1050 &sgl_dma); 1081 1051 if (!sg_list) 1082 - goto out_unmap_sg; 1052 + return BLK_STS_RESOURCE; 1083 1053 1084 - entries = iod->meta_sgt.nents; 1085 1054 iod->meta_descriptor = sg_list; 1086 1055 iod->meta_dma = sgl_dma; 1087 - 1088 1056 iod->cmd.common.flags = NVME_CMD_SGL_METASEG; 1089 1057 iod->cmd.common.metadata = cpu_to_le64(sgl_dma); 1090 - 1091 - sgl = iod->meta_sgt.sgl; 1092 1058 if (entries == 1) { 1093 - nvme_pci_sgl_set_data_sg(sg_list, sgl); 1059 + iod->meta_total_len = iter.len; 1060 + nvme_pci_sgl_set_data(sg_list, &iter); 1094 1061 return BLK_STS_OK; 1095 1062 } 1096 1063 1097 1064 sgl_dma += sizeof(*sg_list); 1098 - nvme_pci_sgl_set_seg(sg_list, sgl_dma, entries); 1099 - for_each_sg(sgl, sg, entries, i) 1100 - nvme_pci_sgl_set_data_sg(&sg_list[i + 1], sg); 1065 + do { 1066 + nvme_pci_sgl_set_data(&sg_list[++i], &iter); 1067 + iod->meta_total_len += iter.len; 1068 + } while (blk_rq_integrity_dma_map_iter_next(req, dev->dev, &iter)); 1101 1069 1102 - return BLK_STS_OK; 1103 - 1104 - out_unmap_sg: 1105 - dma_unmap_sgtable(dev->dev, &iod->meta_sgt, rq_dma_dir(req), 0); 1106 - out_free_sg: 1107 - mempool_free(iod->meta_sgt.sgl, dev->iod_meta_mempool); 1108 - return BLK_STS_RESOURCE; 1070 + nvme_pci_sgl_set_seg(sg_list, sgl_dma, i); 1071 + if (unlikely(iter.status)) 1072 + nvme_unmap_metadata(req); 1073 + return iter.status; 1109 1074 } 1110 1075 1111 1076 static blk_status_t nvme_pci_setup_meta_mptr(struct request *req) ··· 1118 1083 if (dma_mapping_error(nvmeq->dev->dev, iod->meta_dma)) 1119 1084 return BLK_STS_IOERR; 1120 1085 iod->cmd.common.metadata = cpu_to_le64(iod->meta_dma); 1086 + iod->flags |= IOD_SINGLE_META_SEGMENT; 1121 1087 return BLK_STS_OK; 1122 1088 } 1123 1089 ··· 1140 1104 iod->flags = 0; 1141 1105 iod->nr_descriptors = 0; 1142 1106 iod->total_len = 0; 1143 - iod->meta_sgt.nents = 0; 1107 + iod->meta_total_len = 0; 1144 1108 1145 1109 ret = nvme_setup_cmd(req->q->queuedata, req); 1146 1110 if (ret) ··· 1249 1213 if (nvmeq) 1250 1214 nvme_submit_cmds(nvmeq, &submit_list); 1251 1215 *rqlist = requeue_list; 1252 - } 1253 - 1254 - static __always_inline void nvme_unmap_metadata(struct request *req) 1255 - { 1256 - struct nvme_iod *iod = blk_mq_rq_to_pdu(req); 1257 - struct nvme_queue *nvmeq = req->mq_hctx->driver_data; 1258 - struct nvme_dev *dev = nvmeq->dev; 1259 - 1260 - if (!iod->meta_sgt.nents) { 1261 - dma_unmap_page(dev->dev, iod->meta_dma, 1262 - rq_integrity_vec(req).bv_len, 1263 - rq_dma_dir(req)); 1264 - return; 1265 - } 1266 - 1267 - dma_pool_free(nvmeq->descriptor_pools.small, iod->meta_descriptor, 1268 - iod->meta_dma); 1269 - dma_unmap_sgtable(dev->dev, &iod->meta_sgt, rq_dma_dir(req), 0); 1270 - mempool_free(iod->meta_sgt.sgl, dev->iod_meta_mempool); 1271 1216 } 1272 1217 1273 1218 static __always_inline void nvme_pci_unmap_rq(struct request *req) ··· 3056 3039 3057 3040 static int nvme_pci_alloc_iod_mempool(struct nvme_dev *dev) 3058 3041 { 3059 - size_t meta_size = sizeof(struct scatterlist) * (NVME_MAX_META_SEGS + 1); 3060 3042 size_t alloc_size = sizeof(struct nvme_dma_vec) * NVME_MAX_SEGS; 3061 3043 3062 3044 dev->dmavec_mempool = mempool_create_node(1, ··· 3064 3048 dev_to_node(dev->dev)); 3065 3049 if (!dev->dmavec_mempool) 3066 3050 return -ENOMEM; 3067 - 3068 - dev->iod_meta_mempool = mempool_create_node(1, 3069 - mempool_kmalloc, mempool_kfree, 3070 - (void *)meta_size, GFP_KERNEL, 3071 - dev_to_node(dev->dev)); 3072 - if (!dev->iod_meta_mempool) 3073 - goto free; 3074 3051 return 0; 3075 - free: 3076 - mempool_destroy(dev->dmavec_mempool); 3077 - return -ENOMEM; 3078 3052 } 3079 3053 3080 3054 static void nvme_free_tagset(struct nvme_dev *dev) ··· 3330 3324 * Exclude Samsung 990 Evo from NVME_QUIRK_SIMPLE_SUSPEND 3331 3325 * because of high power consumption (> 2 Watt) in s2idle 3332 3326 * sleep. Only some boards with Intel CPU are affected. 3327 + * (Note for testing: Samsung 990 Evo Plus has same PCI ID) 3333 3328 */ 3334 3329 if (dmi_match(DMI_BOARD_NAME, "DN50Z-140HC-YD") || 3335 3330 dmi_match(DMI_BOARD_NAME, "GMxPXxx") || 3336 3331 dmi_match(DMI_BOARD_NAME, "GXxMRXx") || 3332 + dmi_match(DMI_BOARD_NAME, "NS5X_NS7XAU") || 3337 3333 dmi_match(DMI_BOARD_NAME, "PH4PG31") || 3338 3334 dmi_match(DMI_BOARD_NAME, "PH4PRX1_PH6PRX1") || 3339 3335 dmi_match(DMI_BOARD_NAME, "PH6PG01_PH6PG71")) ··· 3516 3508 nvme_free_queues(dev, 0); 3517 3509 out_release_iod_mempool: 3518 3510 mempool_destroy(dev->dmavec_mempool); 3519 - mempool_destroy(dev->iod_meta_mempool); 3520 3511 out_dev_unmap: 3521 3512 nvme_dev_unmap(dev); 3522 3513 out_uninit_ctrl: ··· 3579 3572 nvme_dbbuf_dma_free(dev); 3580 3573 nvme_free_queues(dev, 0); 3581 3574 mempool_destroy(dev->dmavec_mempool); 3582 - mempool_destroy(dev->iod_meta_mempool); 3583 3575 nvme_release_descriptor_pools(dev); 3584 3576 nvme_dev_unmap(dev); 3585 3577 nvme_uninit_ctrl(&dev->ctrl);

+3

drivers/nvme/host/tcp.c

··· 2250 2250 if (error) 2251 2251 goto out_cleanup_tagset; 2252 2252 2253 + if (ctrl->opts->concat && !ctrl->tls_pskid) 2254 + return 0; 2255 + 2253 2256 error = nvme_enable_ctrl(ctrl); 2254 2257 if (error) 2255 2258 goto out_stop_queue;

+6 -9

drivers/nvme/target/core.c

··· 513 513 return 0; 514 514 } 515 515 516 - /* 517 - * Note: ctrl->subsys->lock should be held when calling this function 518 - */ 519 516 static void nvmet_p2pmem_ns_add_p2p(struct nvmet_ctrl *ctrl, 520 517 struct nvmet_ns *ns) 521 518 { 522 519 struct device *clients[2]; 523 520 struct pci_dev *p2p_dev; 524 521 int ret; 522 + 523 + lockdep_assert_held(&ctrl->subsys->lock); 525 524 526 525 if (!ctrl->p2p_client || !ns->use_p2pmem) 527 526 return; ··· 1538 1539 return false; 1539 1540 } 1540 1541 1541 - /* 1542 - * Note: ctrl->subsys->lock should be held when calling this function 1543 - */ 1544 1542 static void nvmet_setup_p2p_ns_map(struct nvmet_ctrl *ctrl, 1545 1543 struct device *p2p_client) 1546 1544 { 1547 1545 struct nvmet_ns *ns; 1548 1546 unsigned long idx; 1547 + 1548 + lockdep_assert_held(&ctrl->subsys->lock); 1549 1549 1550 1550 if (!p2p_client) 1551 1551 return; ··· 1555 1557 nvmet_p2pmem_ns_add_p2p(ctrl, ns); 1556 1558 } 1557 1559 1558 - /* 1559 - * Note: ctrl->subsys->lock should be held when calling this function 1560 - */ 1561 1560 static void nvmet_release_p2p_ns_map(struct nvmet_ctrl *ctrl) 1562 1561 { 1563 1562 struct radix_tree_iter iter; 1564 1563 void __rcu **slot; 1564 + 1565 + lockdep_assert_held(&ctrl->subsys->lock); 1565 1566 1566 1567 radix_tree_for_each_slot(slot, &ctrl->p2p_ns_map, &iter, 0) 1567 1568 pci_dev_put(radix_tree_deref_slot(slot));

+18 -17

drivers/nvme/target/fc.c

··· 54 54 int ls_error; 55 55 struct list_head lsreq_list; /* tgtport->ls_req_list */ 56 56 bool req_queued; 57 + 58 + struct work_struct put_work; 57 59 }; 58 60 59 61 ··· 113 111 struct nvmet_fc_port_entry *pe; 114 112 struct kref ref; 115 113 u32 max_sg_cnt; 116 - 117 - struct work_struct put_work; 118 114 }; 119 115 120 116 struct nvmet_fc_port_entry { ··· 235 235 static void nvmet_fc_tgt_q_put(struct nvmet_fc_tgt_queue *queue); 236 236 static int nvmet_fc_tgt_q_get(struct nvmet_fc_tgt_queue *queue); 237 237 static void nvmet_fc_tgtport_put(struct nvmet_fc_tgtport *tgtport); 238 - static void nvmet_fc_put_tgtport_work(struct work_struct *work) 238 + static void nvmet_fc_put_lsop_work(struct work_struct *work) 239 239 { 240 - struct nvmet_fc_tgtport *tgtport = 241 - container_of(work, struct nvmet_fc_tgtport, put_work); 240 + struct nvmet_fc_ls_req_op *lsop = 241 + container_of(work, struct nvmet_fc_ls_req_op, put_work); 242 242 243 - nvmet_fc_tgtport_put(tgtport); 243 + nvmet_fc_tgtport_put(lsop->tgtport); 244 + kfree(lsop); 244 245 } 245 246 static int nvmet_fc_tgtport_get(struct nvmet_fc_tgtport *tgtport); 246 247 static void nvmet_fc_handle_fcp_rqst(struct nvmet_fc_tgtport *tgtport, ··· 368 367 DMA_BIDIRECTIONAL); 369 368 370 369 out_putwork: 371 - queue_work(nvmet_wq, &tgtport->put_work); 370 + queue_work(nvmet_wq, &lsop->put_work); 372 371 } 373 372 374 373 static int ··· 389 388 lsreq->done = done; 390 389 lsop->req_queued = false; 391 390 INIT_LIST_HEAD(&lsop->lsreq_list); 391 + INIT_WORK(&lsop->put_work, nvmet_fc_put_lsop_work); 392 392 393 393 lsreq->rqstdma = fc_dma_map_single(tgtport->dev, lsreq->rqstaddr, 394 394 lsreq->rqstlen + lsreq->rsplen, ··· 449 447 __nvmet_fc_finish_ls_req(lsop); 450 448 451 449 /* fc-nvme target doesn't care about success or failure of cmd */ 452 - 453 - kfree(lsop); 454 450 } 455 451 456 452 /* ··· 1075 1075 static void 1076 1076 nvmet_fc_schedule_delete_assoc(struct nvmet_fc_tgt_assoc *assoc) 1077 1077 { 1078 + int terminating; 1079 + 1080 + terminating = atomic_xchg(&assoc->terminating, 1); 1081 + 1082 + /* if already terminating, do nothing */ 1083 + if (terminating) 1084 + return; 1085 + 1078 1086 nvmet_fc_tgtport_get(assoc->tgtport); 1079 1087 if (!queue_work(nvmet_wq, &assoc->del_work)) 1080 1088 nvmet_fc_tgtport_put(assoc->tgtport); ··· 1210 1202 { 1211 1203 struct nvmet_fc_tgtport *tgtport = assoc->tgtport; 1212 1204 unsigned long flags; 1213 - int i, terminating; 1214 - 1215 - terminating = atomic_xchg(&assoc->terminating, 1); 1216 - 1217 - /* if already terminating, do nothing */ 1218 - if (terminating) 1219 - return; 1205 + int i; 1220 1206 1221 1207 spin_lock_irqsave(&tgtport->lock, flags); 1222 1208 list_del_rcu(&assoc->a_list); ··· 1412 1410 kref_init(&newrec->ref); 1413 1411 ida_init(&newrec->assoc_cnt); 1414 1412 newrec->max_sg_cnt = template->max_sgl_segments; 1415 - INIT_WORK(&newrec->put_work, nvmet_fc_put_tgtport_work); 1416 1413 1417 1414 ret = nvmet_fc_alloc_ls_iodlist(newrec); 1418 1415 if (ret) {

+5 -3

drivers/nvme/target/fcloop.c

··· 496 496 if (!targetport) { 497 497 /* 498 498 * The target port is gone. The target doesn't expect any 499 - * response anymore and the ->done call is not valid 500 - * because the resources have been freed by 501 - * nvmet_fc_free_pending_reqs. 499 + * response anymore and thus lsreq can't be accessed anymore. 502 500 * 503 501 * We end up here from delete association exchange: 504 502 * nvmet_fc_xmt_disconnect_assoc sends an async request. 503 + * 504 + * Return success because this is what LLDDs do; silently 505 + * drop the response. 505 506 */ 507 + lsrsp->done(lsrsp); 506 508 kmem_cache_free(lsreq_cache, tls_req); 507 509 return 0; 508 510 }

+16 -8

drivers/s390/block/dasd.c

··· 334 334 lim.max_dev_sectors = device->discipline->max_sectors(block); 335 335 lim.max_hw_sectors = lim.max_dev_sectors; 336 336 lim.logical_block_size = block->bp_block; 337 + /* 338 + * Adjust dma_alignment to match block_size - 1 339 + * to ensure proper buffer alignment checks in the block layer. 340 + */ 341 + lim.dma_alignment = lim.logical_block_size - 1; 337 342 338 343 if (device->discipline->has_discard) { 339 344 unsigned int max_bytes; ··· 3119 3114 PTR_ERR(cqr) == -ENOMEM || 3120 3115 PTR_ERR(cqr) == -EAGAIN) { 3121 3116 rc = BLK_STS_RESOURCE; 3122 - goto out; 3117 + } else if (PTR_ERR(cqr) == -EINVAL) { 3118 + rc = BLK_STS_INVAL; 3119 + } else { 3120 + DBF_DEV_EVENT(DBF_ERR, basedev, 3121 + "CCW creation failed (rc=%ld) on request %p", 3122 + PTR_ERR(cqr), req); 3123 + rc = BLK_STS_IOERR; 3123 3124 } 3124 - DBF_DEV_EVENT(DBF_ERR, basedev, 3125 - "CCW creation failed (rc=%ld) on request %p", 3126 - PTR_ERR(cqr), req); 3127 - rc = BLK_STS_IOERR; 3128 3125 goto out; 3129 3126 } 3130 3127 /* ··· 3324 3317 /* 3325 3318 * Return disk geometry. 3326 3319 */ 3327 - static int dasd_getgeo(struct block_device *bdev, struct hd_geometry *geo) 3320 + static int dasd_getgeo(struct gendisk *disk, struct hd_geometry *geo) 3328 3321 { 3329 3322 struct dasd_device *base; 3330 3323 3331 - base = dasd_device_from_gendisk(bdev->bd_disk); 3324 + base = dasd_device_from_gendisk(disk); 3332 3325 if (!base) 3333 3326 return -ENODEV; 3334 3327 ··· 3338 3331 return -EINVAL; 3339 3332 } 3340 3333 base->discipline->fill_geometry(base->block, geo); 3341 - geo->start = get_start_sect(bdev) >> base->block->s2b_shift; 3334 + // geo->start is left unchanged by the above 3335 + geo->start >>= base->block->s2b_shift; 3342 3336 dasd_put_device(base); 3343 3337 return 0; 3344 3338 }

+1 -1

drivers/scsi/3w-9xxx.c

··· 1695 1695 } /* End twa_reset_sequence() */ 1696 1696 1697 1697 /* This funciton returns unit geometry in cylinders/heads/sectors */ 1698 - static int twa_scsi_biosparam(struct scsi_device *sdev, struct block_device *bdev, sector_t capacity, int geom[]) 1698 + static int twa_scsi_biosparam(struct scsi_device *sdev, struct gendisk *unused, sector_t capacity, int geom[]) 1699 1699 { 1700 1700 int heads, sectors, cylinders; 1701 1701

+1 -1

drivers/scsi/3w-sas.c

··· 1404 1404 } /* End twl_reset_device_extension() */ 1405 1405 1406 1406 /* This funciton returns unit geometry in cylinders/heads/sectors */ 1407 - static int twl_scsi_biosparam(struct scsi_device *sdev, struct block_device *bdev, sector_t capacity, int geom[]) 1407 + static int twl_scsi_biosparam(struct scsi_device *sdev, struct gendisk *unused, sector_t capacity, int geom[]) 1408 1408 { 1409 1409 int heads, sectors; 1410 1410

+1 -1

drivers/scsi/3w-xxxx.c

··· 1340 1340 } /* End tw_reset_device_extension() */ 1341 1341 1342 1342 /* This funciton returns unit geometry in cylinders/heads/sectors */ 1343 - static int tw_scsi_biosparam(struct scsi_device *sdev, struct block_device *bdev, 1343 + static int tw_scsi_biosparam(struct scsi_device *sdev, struct gendisk *unused, 1344 1344 sector_t capacity, int geom[]) 1345 1345 { 1346 1346 int heads, sectors, cylinders;

+2 -2

drivers/scsi/BusLogic.c

··· 3240 3240 the BIOS, and a warning may be displayed. 3241 3241 */ 3242 3242 3243 - static int blogic_diskparam(struct scsi_device *sdev, struct block_device *dev, 3243 + static int blogic_diskparam(struct scsi_device *sdev, struct gendisk *disk, 3244 3244 sector_t capacity, int *params) 3245 3245 { 3246 3246 struct blogic_adapter *adapter = ··· 3261 3261 diskparam->sectors = 32; 3262 3262 } 3263 3263 diskparam->cylinders = (unsigned long) capacity / (diskparam->heads * diskparam->sectors); 3264 - buf = scsi_bios_ptable(dev); 3264 + buf = scsi_bios_ptable(disk); 3265 3265 if (buf == NULL) 3266 3266 return 0; 3267 3267 /*

+1 -1

drivers/scsi/BusLogic.h

··· 1273 1273 1274 1274 static const char *blogic_drvr_info(struct Scsi_Host *); 1275 1275 static int blogic_qcmd(struct Scsi_Host *h, struct scsi_cmnd *); 1276 - static int blogic_diskparam(struct scsi_device *, struct block_device *, sector_t, int *); 1276 + static int blogic_diskparam(struct scsi_device *, struct gendisk *, sector_t, int *); 1277 1277 static int blogic_sdev_configure(struct scsi_device *, 1278 1278 struct queue_limits *lim); 1279 1279 static void blogic_qcompleted_ccb(struct blogic_ccb *);

+3 -3

drivers/scsi/aacraid/linit.c

··· 273 273 /** 274 274 * aac_biosparm - return BIOS parameters for disk 275 275 * @sdev: The scsi device corresponding to the disk 276 - * @bdev: the block device corresponding to the disk 276 + * @disk: the gendisk corresponding to the disk 277 277 * @capacity: the sector capacity of the disk 278 278 * @geom: geometry block to fill in 279 279 * ··· 292 292 * be displayed. 293 293 */ 294 294 295 - static int aac_biosparm(struct scsi_device *sdev, struct block_device *bdev, 295 + static int aac_biosparm(struct scsi_device *sdev, struct gendisk *disk, 296 296 sector_t capacity, int *geom) 297 297 { 298 298 struct diskparm *param = (struct diskparm *)geom; ··· 324 324 * entry whose end_head matches one of the standard geometry 325 325 * translations ( 64/32, 128/32, 255/63 ). 326 326 */ 327 - buf = scsi_bios_ptable(bdev); 327 + buf = scsi_bios_ptable(disk); 328 328 if (!buf) 329 329 return 0; 330 330 if (*(__le16 *)(buf + 0x40) == cpu_to_le16(MSDOS_LABEL_MAGIC)) {

+1 -1

drivers/scsi/advansys.c

··· 7096 7096 * ip[2]: cylinders 7097 7097 */ 7098 7098 static int 7099 - advansys_biosparam(struct scsi_device *sdev, struct block_device *bdev, 7099 + advansys_biosparam(struct scsi_device *sdev, struct gendisk *unused, 7100 7100 sector_t capacity, int ip[]) 7101 7101 { 7102 7102 struct asc_board *boardp = shost_priv(sdev->host);

+2 -2

drivers/scsi/aha152x.c

··· 1246 1246 * Return the "logical geometry" 1247 1247 * 1248 1248 */ 1249 - static int aha152x_biosparam(struct scsi_device *sdev, struct block_device *bdev, 1249 + static int aha152x_biosparam(struct scsi_device *sdev, struct gendisk *disk, 1250 1250 sector_t capacity, int *info_array) 1251 1251 { 1252 1252 struct Scsi_Host *shpnt = sdev->host; ··· 1261 1261 int info[3]; 1262 1262 1263 1263 /* try to figure out the geometry from the partition table */ 1264 - if (scsicam_bios_param(bdev, capacity, info) < 0 || 1264 + if (scsicam_bios_param(disk, capacity, info) < 0 || 1265 1265 !((info[0] == 64 && info[1] == 32) || (info[0] == 255 && info[1] == 63))) { 1266 1266 if (EXT_TRANS) { 1267 1267 printk(KERN_NOTICE

+1 -1

drivers/scsi/aha1542.c

··· 992 992 } 993 993 994 994 static int aha1542_biosparam(struct scsi_device *sdev, 995 - struct block_device *bdev, sector_t capacity, int geom[]) 995 + struct gendisk *unused, sector_t capacity, int geom[]) 996 996 { 997 997 struct aha1542_hostdata *aha1542 = shost_priv(sdev->host); 998 998

+1 -1

drivers/scsi/aha1740.c

··· 510 510 } 511 511 512 512 static int aha1740_biosparam(struct scsi_device *sdev, 513 - struct block_device *dev, 513 + struct gendisk *unused, 514 514 sector_t capacity, int* ip) 515 515 { 516 516 int size = capacity;

+2 -2

drivers/scsi/aic7xxx/aic79xx_osm.c

··· 720 720 * Return the disk geometry for the given SCSI device. 721 721 */ 722 722 static int 723 - ahd_linux_biosparam(struct scsi_device *sdev, struct block_device *bdev, 723 + ahd_linux_biosparam(struct scsi_device *sdev, struct gendisk *disk, 724 724 sector_t capacity, int geom[]) 725 725 { 726 726 int heads; ··· 731 731 732 732 ahd = *((struct ahd_softc **)sdev->host->hostdata); 733 733 734 - if (scsi_partsize(bdev, capacity, geom)) 734 + if (scsi_partsize(disk, capacity, geom)) 735 735 return 0; 736 736 737 737 heads = 64;

+2 -2

drivers/scsi/aic7xxx/aic7xxx_osm.c

··· 683 683 * Return the disk geometry for the given SCSI device. 684 684 */ 685 685 static int 686 - ahc_linux_biosparam(struct scsi_device *sdev, struct block_device *bdev, 686 + ahc_linux_biosparam(struct scsi_device *sdev, struct gendisk *disk, 687 687 sector_t capacity, int geom[]) 688 688 { 689 689 int heads; ··· 696 696 ahc = *((struct ahc_softc **)sdev->host->hostdata); 697 697 channel = sdev_channel(sdev); 698 698 699 - if (scsi_partsize(bdev, capacity, geom)) 699 + if (scsi_partsize(disk, capacity, geom)) 700 700 return 0; 701 701 702 702 heads = 64;

+3 -3

drivers/scsi/arcmsr/arcmsr_hba.c

··· 112 112 static int arcmsr_abort(struct scsi_cmnd *); 113 113 static int arcmsr_bus_reset(struct scsi_cmnd *); 114 114 static int arcmsr_bios_param(struct scsi_device *sdev, 115 - struct block_device *bdev, sector_t capacity, int *info); 115 + struct gendisk *disk, sector_t capacity, int *info); 116 116 static int arcmsr_queue_command(struct Scsi_Host *h, struct scsi_cmnd *cmd); 117 117 static int arcmsr_probe(struct pci_dev *pdev, 118 118 const struct pci_device_id *id); ··· 377 377 } 378 378 379 379 static int arcmsr_bios_param(struct scsi_device *sdev, 380 - struct block_device *bdev, sector_t capacity, int *geom) 380 + struct gendisk *disk, sector_t capacity, int *geom) 381 381 { 382 382 int heads, sectors, cylinders, total_capacity; 383 383 384 - if (scsi_partsize(bdev, capacity, geom)) 384 + if (scsi_partsize(disk, capacity, geom)) 385 385 return 0; 386 386 387 387 total_capacity = capacity;

+1 -1

drivers/scsi/atp870u.c

··· 1692 1692 } 1693 1693 1694 1694 1695 - static int atp870u_biosparam(struct scsi_device *disk, struct block_device *dev, 1695 + static int atp870u_biosparam(struct scsi_device *disk, struct gendisk *unused, 1696 1696 sector_t capacity, int *ip) 1697 1697 { 1698 1698 int heads, sectors, cylinders;

+2 -2

drivers/scsi/fdomain.c

··· 469 469 } 470 470 471 471 static int fdomain_biosparam(struct scsi_device *sdev, 472 - struct block_device *bdev, sector_t capacity, 472 + struct gendisk *disk, sector_t capacity, 473 473 int geom[]) 474 474 { 475 - unsigned char *p = scsi_bios_ptable(bdev); 475 + unsigned char *p = scsi_bios_ptable(disk); 476 476 477 477 if (p && p[65] == 0xaa && p[64] == 0x55 /* Partition table valid */ 478 478 && p[4]) { /* Partition type */

+1 -1

drivers/scsi/imm.c

··· 954 954 * be done in sd.c. Even if it gets fixed there, this will still 955 955 * work. 956 956 */ 957 - static int imm_biosparam(struct scsi_device *sdev, struct block_device *dev, 957 + static int imm_biosparam(struct scsi_device *sdev, struct gendisk *unused, 958 958 sector_t capacity, int ip[]) 959 959 { 960 960 ip[0] = 0x40;

+2 -2

drivers/scsi/initio.c

··· 2645 2645 /** 2646 2646 * i91u_biosparam - return the "logical geometry 2647 2647 * @sdev: SCSI device 2648 - * @dev: Matching block device 2648 + * @unused: Matching gendisk 2649 2649 * @capacity: Sector size of drive 2650 2650 * @info_array: Return space for BIOS geometry 2651 2651 * ··· 2655 2655 * FIXME: limited to 2^32 sector devices. 2656 2656 */ 2657 2657 2658 - static int i91u_biosparam(struct scsi_device *sdev, struct block_device *dev, 2658 + static int i91u_biosparam(struct scsi_device *sdev, struct gendisk *unused, 2659 2659 sector_t capacity, int *info_array) 2660 2660 { 2661 2661 struct initio_host *host; /* Point to Host adapter control block */

+4 -4

drivers/scsi/ipr.c

··· 4644 4644 4645 4645 /** 4646 4646 * ipr_biosparam - Return the HSC mapping 4647 - * @sdev: scsi device struct 4648 - * @block_device: block device pointer 4647 + * @sdev: scsi device struct 4648 + * @unused: gendisk pointer 4649 4649 * @capacity: capacity of the device 4650 - * @parm: Array containing returned HSC values. 4650 + * @parm: Array containing returned HSC values. 4651 4651 * 4652 4652 * This function generates the HSC parms that fdisk uses. 4653 4653 * We want to make sure we return something that places partitions ··· 4657 4657 * 0 on success 4658 4658 **/ 4659 4659 static int ipr_biosparam(struct scsi_device *sdev, 4660 - struct block_device *block_device, 4660 + struct gendisk *unused, 4661 4661 sector_t capacity, int *parm) 4662 4662 { 4663 4663 int heads, sectors;

+1 -1

drivers/scsi/ips.c

··· 1123 1123 /* Set bios geometry for the controller */ 1124 1124 /* */ 1125 1125 /****************************************************************************/ 1126 - static int ips_biosparam(struct scsi_device *sdev, struct block_device *bdev, 1126 + static int ips_biosparam(struct scsi_device *sdev, struct gendisk *unused, 1127 1127 sector_t capacity, int geom[]) 1128 1128 { 1129 1129 ips_ha_t *ha = (ips_ha_t *) sdev->host->hostdata;

+1 -1

drivers/scsi/ips.h

··· 398 398 /* 399 399 * Scsi_Host Template 400 400 */ 401 - static int ips_biosparam(struct scsi_device *sdev, struct block_device *bdev, 401 + static int ips_biosparam(struct scsi_device *sdev, struct gendisk *unused, 402 402 sector_t capacity, int geom[]); 403 403 static int ips_sdev_configure(struct scsi_device *SDptr, 404 404 struct queue_limits *lim);

+1 -1

drivers/scsi/libsas/sas_scsi_host.c

··· 845 845 EXPORT_SYMBOL_GPL(sas_change_queue_depth); 846 846 847 847 int sas_bios_param(struct scsi_device *scsi_dev, 848 - struct block_device *bdev, 848 + struct gendisk *unused, 849 849 sector_t capacity, int *hsc) 850 850 { 851 851 hsc[0] = 255;

+2 -2

drivers/scsi/megaraid.c

··· 2780 2780 * Return the disk geometry for a particular disk 2781 2781 */ 2782 2782 static int 2783 - megaraid_biosparam(struct scsi_device *sdev, struct block_device *bdev, 2783 + megaraid_biosparam(struct scsi_device *sdev, struct gendisk *disk, 2784 2784 sector_t capacity, int geom[]) 2785 2785 { 2786 2786 adapter_t *adapter; ··· 2813 2813 geom[2] = cylinders; 2814 2814 } 2815 2815 else { 2816 - if (scsi_partsize(bdev, capacity, geom)) 2816 + if (scsi_partsize(disk, capacity, geom)) 2817 2817 return 0; 2818 2818 2819 2819 dev_info(&adapter->dev->dev,

+1 -1

drivers/scsi/megaraid.h

··· 975 975 static int megaraid_abort(struct scsi_cmnd *); 976 976 static int megaraid_reset(struct scsi_cmnd *); 977 977 static int megaraid_abort_and_reset(adapter_t *, struct scsi_cmnd *, int); 978 - static int megaraid_biosparam(struct scsi_device *, struct block_device *, 978 + static int megaraid_biosparam(struct scsi_device *, struct gendisk *, 979 979 sector_t, int []); 980 980 981 981 static int mega_build_sglist (adapter_t *adapter, scb_t *scb,

+2 -2

drivers/scsi/megaraid/megaraid_sas_base.c

··· 3137 3137 /** 3138 3138 * megasas_bios_param - Returns disk geometry for a disk 3139 3139 * @sdev: device handle 3140 - * @bdev: block device 3140 + * @unused: gendisk 3141 3141 * @capacity: drive capacity 3142 3142 * @geom: geometry parameters 3143 3143 */ 3144 3144 static int 3145 - megasas_bios_param(struct scsi_device *sdev, struct block_device *bdev, 3145 + megasas_bios_param(struct scsi_device *sdev, struct gendisk *unused, 3146 3146 sector_t capacity, int geom[]) 3147 3147 { 3148 3148 int heads;

+2 -2

drivers/scsi/mpi3mr/mpi3mr_os.c

··· 4031 4031 /** 4032 4032 * mpi3mr_bios_param - BIOS param callback 4033 4033 * @sdev: SCSI device reference 4034 - * @bdev: Block device reference 4034 + * @unused: gendisk reference 4035 4035 * @capacity: Capacity in logical sectors 4036 4036 * @params: Parameter array 4037 4037 * ··· 4040 4040 * Return: 0 always 4041 4041 */ 4042 4042 static int mpi3mr_bios_param(struct scsi_device *sdev, 4043 - struct block_device *bdev, sector_t capacity, int params[]) 4043 + struct gendisk *unused, sector_t capacity, int params[]) 4044 4044 { 4045 4045 int heads; 4046 4046 int sectors;

+2 -2

drivers/scsi/mpt3sas/mpt3sas_scsih.c

··· 2754 2754 /** 2755 2755 * scsih_bios_param - fetch head, sector, cylinder info for a disk 2756 2756 * @sdev: scsi device struct 2757 - * @bdev: pointer to block device context 2757 + * @unused: pointer to gendisk 2758 2758 * @capacity: device size (in 512 byte sectors) 2759 2759 * @params: three element array to place output: 2760 2760 * params[0] number of heads (max 255) ··· 2762 2762 * params[2] number of cylinders 2763 2763 */ 2764 2764 static int 2765 - scsih_bios_param(struct scsi_device *sdev, struct block_device *bdev, 2765 + scsih_bios_param(struct scsi_device *sdev, struct gendisk *unused, 2766 2766 sector_t capacity, int params[]) 2767 2767 { 2768 2768 int heads;

+1 -1

drivers/scsi/mvumi.c

··· 2142 2142 } 2143 2143 2144 2144 static int 2145 - mvumi_bios_param(struct scsi_device *sdev, struct block_device *bdev, 2145 + mvumi_bios_param(struct scsi_device *sdev, struct gendisk *unused, 2146 2146 sector_t capacity, int geom[]) 2147 2147 { 2148 2148 int heads, sectors;

+1 -1

drivers/scsi/myrb.c

··· 1745 1745 kfree(sdev->hostdata); 1746 1746 } 1747 1747 1748 - static int myrb_biosparam(struct scsi_device *sdev, struct block_device *bdev, 1748 + static int myrb_biosparam(struct scsi_device *sdev, struct gendisk *unused, 1749 1749 sector_t capacity, int geom[]) 1750 1750 { 1751 1751 struct myrb_hba *cb = shost_priv(sdev->host);

+1 -1

drivers/scsi/pcmcia/sym53c500_cs.c

··· 597 597 598 598 static int 599 599 SYM53C500_biosparm(struct scsi_device *disk, 600 - struct block_device *dev, 600 + struct gendisk *unused, 601 601 sector_t capacity, int *info_array) 602 602 { 603 603 int size;

+1 -1

drivers/scsi/ppa.c

··· 845 845 * be done in sd.c. Even if it gets fixed there, this will still 846 846 * work. 847 847 */ 848 - static int ppa_biosparam(struct scsi_device *sdev, struct block_device *dev, 848 + static int ppa_biosparam(struct scsi_device *sdev, struct gendisk *unused, 849 849 sector_t capacity, int ip[]) 850 850 { 851 851 ip[0] = 0x40;

+1 -1

drivers/scsi/qla1280.c

··· 1023 1023 } 1024 1024 1025 1025 static int 1026 - qla1280_biosparam(struct scsi_device *sdev, struct block_device *bdev, 1026 + qla1280_biosparam(struct scsi_device *sdev, struct gendisk *unused, 1027 1027 sector_t capacity, int geom[]) 1028 1028 { 1029 1029 int heads, sectors, cylinders;

+1 -1

drivers/scsi/qlogicfas408.c

··· 492 492 * Return bios parameters 493 493 */ 494 494 495 - int qlogicfas408_biosparam(struct scsi_device *disk, struct block_device *dev, 495 + int qlogicfas408_biosparam(struct scsi_device *disk, struct gendisk *unused, 496 496 sector_t capacity, int ip[]) 497 497 { 498 498 /* This should mimic the DOS Qlogic driver's behavior exactly */

+1 -1

drivers/scsi/qlogicfas408.h

··· 106 106 irqreturn_t qlogicfas408_ihandl(int irq, void *dev_id); 107 107 int qlogicfas408_queuecommand(struct Scsi_Host *h, struct scsi_cmnd * cmd); 108 108 int qlogicfas408_biosparam(struct scsi_device * disk, 109 - struct block_device *dev, 109 + struct gendisk *unused, 110 110 sector_t capacity, int ip[]); 111 111 int qlogicfas408_abort(struct scsi_cmnd * cmd); 112 112 extern int qlogicfas408_host_reset(struct scsi_cmnd *cmd);

+8 -8

drivers/scsi/scsicam.c

··· 30 30 * starting at offset %0x1be. 31 31 * Returns: partition table in kmalloc(GFP_KERNEL) memory, or NULL on error. 32 32 */ 33 - unsigned char *scsi_bios_ptable(struct block_device *dev) 33 + unsigned char *scsi_bios_ptable(struct gendisk *dev) 34 34 { 35 - struct address_space *mapping = bdev_whole(dev)->bd_mapping; 35 + struct address_space *mapping = dev->part0->bd_mapping; 36 36 unsigned char *res = NULL; 37 37 struct folio *folio; 38 38 ··· 48 48 49 49 /** 50 50 * scsi_partsize - Parse cylinders/heads/sectors from PC partition table 51 - * @bdev: block device to parse 51 + * @disk: gendisk of the disk to parse 52 52 * @capacity: size of the disk in sectors 53 53 * @geom: output in form of [hds, cylinders, sectors] 54 54 * ··· 57 57 * 58 58 * Returns: %false on failure, %true on success. 59 59 */ 60 - bool scsi_partsize(struct block_device *bdev, sector_t capacity, int geom[3]) 60 + bool scsi_partsize(struct gendisk *disk, sector_t capacity, int geom[3]) 61 61 { 62 62 int cyl, ext_cyl, end_head, end_cyl, end_sector; 63 63 unsigned int logical_end, physical_end, ext_physical_end; ··· 65 65 void *buf; 66 66 int ret = false; 67 67 68 - buf = scsi_bios_ptable(bdev); 68 + buf = scsi_bios_ptable(disk); 69 69 if (!buf) 70 70 return false; 71 71 ··· 205 205 206 206 /** 207 207 * scsicam_bios_param - Determine geometry of a disk in cylinders/heads/sectors. 208 - * @bdev: which device 208 + * @disk: which device 209 209 * @capacity: size of the disk in sectors 210 210 * @ip: return value: ip[0]=heads, ip[1]=sectors, ip[2]=cylinders 211 211 * ··· 215 215 * 216 216 * Returns : -1 on failure, 0 on success. 217 217 */ 218 - int scsicam_bios_param(struct block_device *bdev, sector_t capacity, int *ip) 218 + int scsicam_bios_param(struct gendisk *disk, sector_t capacity, int *ip) 219 219 { 220 220 u64 capacity64 = capacity; /* Suppress gcc warning */ 221 221 int ret = 0; 222 222 223 223 /* try to infer mapping from partition table */ 224 - if (scsi_partsize(bdev, capacity, ip)) 224 + if (scsi_partsize(disk, capacity, ip)) 225 225 return 0; 226 226 227 227 if (capacity64 < (1ULL << 32)) {

+4 -4

drivers/scsi/sd.c

··· 1599 1599 scsi_device_put(sdev); 1600 1600 } 1601 1601 1602 - static int sd_getgeo(struct block_device *bdev, struct hd_geometry *geo) 1602 + static int sd_getgeo(struct gendisk *disk, struct hd_geometry *geo) 1603 1603 { 1604 - struct scsi_disk *sdkp = scsi_disk(bdev->bd_disk); 1604 + struct scsi_disk *sdkp = scsi_disk(disk); 1605 1605 struct scsi_device *sdp = sdkp->device; 1606 1606 struct Scsi_Host *host = sdp->host; 1607 1607 sector_t capacity = logical_to_sectors(sdp, sdkp->capacity); ··· 1614 1614 1615 1615 /* override with calculated, extended default, or driver values */ 1616 1616 if (host->hostt->bios_param) 1617 - host->hostt->bios_param(sdp, bdev, capacity, diskinfo); 1617 + host->hostt->bios_param(sdp, disk, capacity, diskinfo); 1618 1618 else 1619 - scsicam_bios_param(bdev, capacity, diskinfo); 1619 + scsicam_bios_param(disk, capacity, diskinfo); 1620 1620 1621 1621 geo->heads = diskinfo[0]; 1622 1622 geo->sectors = diskinfo[1];

+1 -1

drivers/scsi/stex.c

··· 1457 1457 } 1458 1458 1459 1459 static int stex_biosparam(struct scsi_device *sdev, 1460 - struct block_device *bdev, sector_t capacity, int geom[]) 1460 + struct gendisk *unused, sector_t capacity, int geom[]) 1461 1461 { 1462 1462 int heads = 255, sectors = 63; 1463 1463

+1 -1

drivers/scsi/storvsc_drv.c

··· 1615 1615 return 0; 1616 1616 } 1617 1617 1618 - static int storvsc_get_chs(struct scsi_device *sdev, struct block_device * bdev, 1618 + static int storvsc_get_chs(struct scsi_device *sdev, struct gendisk *unused, 1619 1619 sector_t capacity, int *info) 1620 1620 { 1621 1621 sector_t nsect = capacity;

+1 -1

drivers/scsi/wd719x.c

··· 544 544 return wd719x_chip_init(wd) == 0 ? SUCCESS : FAILED; 545 545 } 546 546 547 - static int wd719x_biosparam(struct scsi_device *sdev, struct block_device *bdev, 547 + static int wd719x_biosparam(struct scsi_device *sdev, struct gendisk *unused, 548 548 sector_t capacity, int geom[]) 549 549 { 550 550 if (capacity >= 0x200000) {

+1 -1

drivers/target/target_core_pscsi.c

··· 861 861 bio = bio_kmalloc(nr_vecs, GFP_KERNEL); 862 862 if (!bio) 863 863 goto fail; 864 - bio_init(bio, NULL, bio->bi_inline_vecs, nr_vecs, 864 + bio_init_inline(bio, NULL, nr_vecs, 865 865 rw ? REQ_OP_WRITE : REQ_OP_READ); 866 866 bio->bi_end_io = pscsi_bi_endio; 867 867

+2 -3

fs/iomap/direct-io.c

··· 337 337 u64 copied = 0; 338 338 size_t orig_count; 339 339 340 - if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) || 341 - !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter)) 340 + if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1)) 342 341 return -EINVAL; 343 342 344 343 if (dio->flags & IOMAP_DIO_WRITE) { ··· 433 434 bio->bi_private = dio; 434 435 bio->bi_end_io = iomap_dio_bio_end_io; 435 436 436 - ret = bio_iov_iter_get_pages(bio, dio->submit.iter); 437 + ret = bio_iov_iter_get_bdev_pages(bio, dio->submit.iter, iomap->bdev); 437 438 if (unlikely(ret)) { 438 439 /* 439 440 * We have to stop part way through an IO. We must fall

+1 -1

fs/squashfs/block.c

··· 231 231 bio = bio_kmalloc(page_count, GFP_NOIO); 232 232 if (!bio) 233 233 return -ENOMEM; 234 - bio_init(bio, sb->s_bdev, bio->bi_inline_vecs, page_count, REQ_OP_READ); 234 + bio_init_inline(bio, sb->s_bdev, page_count, REQ_OP_READ); 235 235 bio->bi_iter.bi_sector = block * (msblk->devblksize >> SECTOR_SHIFT); 236 236 237 237 for (i = 0; i < page_count; ++i) {

+1

include/linux/bio-integrity.h

··· 13 13 BIP_CHECK_GUARD = 1 << 5, /* guard check */ 14 14 BIP_CHECK_REFTAG = 1 << 6, /* reftag check */ 15 15 BIP_CHECK_APPTAG = 1 << 7, /* apptag check */ 16 + BIP_P2P_DMA = 1 << 8, /* using P2P address */ 16 17 }; 17 18 18 19 struct bio_integrity_payload {

+15 -3

include/linux/bio.h

··· 322 322 void bio_trim(struct bio *bio, sector_t offset, sector_t size); 323 323 extern struct bio *bio_split(struct bio *bio, int sectors, 324 324 gfp_t gfp, struct bio_set *bs); 325 - int bio_split_rw_at(struct bio *bio, const struct queue_limits *lim, 326 - unsigned *segs, unsigned max_bytes); 325 + int bio_split_io_at(struct bio *bio, const struct queue_limits *lim, 326 + unsigned *segs, unsigned max_bytes, unsigned len_align); 327 327 328 328 /** 329 329 * bio_next_split - get next @sectors from a bio, splitting if necessary ··· 405 405 406 406 void bio_init(struct bio *bio, struct block_device *bdev, struct bio_vec *table, 407 407 unsigned short max_vecs, blk_opf_t opf); 408 + static inline void bio_init_inline(struct bio *bio, struct block_device *bdev, 409 + unsigned short max_vecs, blk_opf_t opf) 410 + { 411 + bio_init(bio, bdev, bio_inline_vecs(bio), max_vecs, opf); 412 + } 408 413 extern void bio_uninit(struct bio *); 409 414 void bio_reset(struct bio *bio, struct block_device *bdev, blk_opf_t opf); 410 415 void bio_chain(struct bio *, struct bio *); ··· 446 441 int bdev_rw_virt(struct block_device *bdev, sector_t sector, void *data, 447 442 size_t len, enum req_op op); 448 443 449 - int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter); 444 + int bio_iov_iter_get_pages_aligned(struct bio *bio, struct iov_iter *iter, 445 + unsigned len_align_mask); 446 + 447 + static inline int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) 448 + { 449 + return bio_iov_iter_get_pages_aligned(bio, iter, 0); 450 + } 451 + 450 452 void bio_iov_bvec_set(struct bio *bio, const struct iov_iter *iter); 451 453 void __bio_release_pages(struct bio *bio, bool mark_dirty); 452 454 extern void bio_set_pages_dirty(struct bio *bio);

+32

include/linux/blk-integrity.h

··· 4 4 5 5 #include <linux/blk-mq.h> 6 6 #include <linux/bio-integrity.h> 7 + #include <linux/blk-mq-dma.h> 7 8 8 9 struct request; 9 10 ··· 27 26 28 27 #ifdef CONFIG_BLK_DEV_INTEGRITY 29 28 int blk_rq_map_integrity_sg(struct request *, struct scatterlist *); 29 + 30 + static inline bool blk_rq_integrity_dma_unmap(struct request *req, 31 + struct device *dma_dev, struct dma_iova_state *state, 32 + size_t mapped_len) 33 + { 34 + return blk_dma_unmap(req, dma_dev, state, mapped_len, 35 + bio_integrity(req->bio)->bip_flags & BIP_P2P_DMA); 36 + } 37 + 30 38 int blk_rq_count_integrity_sg(struct request_queue *, struct bio *); 31 39 int blk_rq_integrity_map_user(struct request *rq, void __user *ubuf, 32 40 ssize_t bytes); 33 41 int blk_get_meta_cap(struct block_device *bdev, unsigned int cmd, 34 42 struct logical_block_metadata_cap __user *argp); 43 + bool blk_rq_integrity_dma_map_iter_start(struct request *req, 44 + struct device *dma_dev, struct dma_iova_state *state, 45 + struct blk_dma_iter *iter); 46 + bool blk_rq_integrity_dma_map_iter_next(struct request *req, 47 + struct device *dma_dev, struct blk_dma_iter *iter); 35 48 36 49 static inline bool 37 50 blk_integrity_queue_supports_integrity(struct request_queue *q) ··· 124 109 { 125 110 return 0; 126 111 } 112 + static inline bool blk_rq_integrity_dma_unmap(struct request *req, 113 + struct device *dma_dev, struct dma_iova_state *state, 114 + size_t mapped_len) 115 + { 116 + return false; 117 + } 127 118 static inline int blk_rq_integrity_map_user(struct request *rq, 128 119 void __user *ubuf, 129 120 ssize_t bytes) 130 121 { 131 122 return -EINVAL; 123 + } 124 + static inline bool blk_rq_integrity_dma_map_iter_start(struct request *req, 125 + struct device *dma_dev, struct dma_iova_state *state, 126 + struct blk_dma_iter *iter) 127 + { 128 + return false; 129 + } 130 + static inline bool blk_rq_integrity_dma_map_iter_next(struct request *req, 131 + struct device *dma_dev, struct blk_dma_iter *iter) 132 + { 133 + return false; 132 134 } 133 135 static inline struct blk_integrity *bdev_get_integrity(struct block_device *b) 134 136 {

+20 -5

include/linux/blk-mq-dma.h

··· 5 5 #include <linux/blk-mq.h> 6 6 #include <linux/pci-p2pdma.h> 7 7 8 + struct blk_map_iter { 9 + struct bvec_iter iter; 10 + struct bio *bio; 11 + struct bio_vec *bvecs; 12 + bool is_integrity; 13 + }; 14 + 8 15 struct blk_dma_iter { 9 16 /* Output address range for this iteration */ 10 17 dma_addr_t addr; ··· 21 14 blk_status_t status; 22 15 23 16 /* Internal to blk_rq_dma_map_iter_* */ 24 - struct req_iterator iter; 17 + struct blk_map_iter iter; 25 18 struct pci_p2pdma_map_state p2pdma; 26 19 }; 27 20 ··· 43 36 } 44 37 45 38 /** 46 - * blk_rq_dma_unmap - try to DMA unmap a request 39 + * blk_dma_unmap - try to DMA unmap a request 47 40 * @req: request to unmap 48 41 * @dma_dev: device to unmap from 49 42 * @state: DMA IOVA state 50 43 * @mapped_len: number of bytes to unmap 44 + * @is_p2p: true if mapped with PCI_P2PDMA_MAP_BUS_ADDR 51 45 * 52 46 * Returns %false if the callers need to manually unmap every DMA segment 53 47 * mapped using @iter or %true if no work is left to be done. 54 48 */ 55 - static inline bool blk_rq_dma_unmap(struct request *req, struct device *dma_dev, 56 - struct dma_iova_state *state, size_t mapped_len) 49 + static inline bool blk_dma_unmap(struct request *req, struct device *dma_dev, 50 + struct dma_iova_state *state, size_t mapped_len, bool is_p2p) 57 51 { 58 - if (req->cmd_flags & REQ_P2PDMA) 52 + if (is_p2p) 59 53 return true; 60 54 61 55 if (dma_use_iova(state)) { ··· 66 58 } 67 59 68 60 return !dma_need_unmap(dma_dev); 61 + } 62 + 63 + static inline bool blk_rq_dma_unmap(struct request *req, struct device *dma_dev, 64 + struct dma_iova_state *state, size_t mapped_len) 65 + { 66 + return blk_dma_unmap(req, dma_dev, state, mapped_len, 67 + req->cmd_flags & REQ_P2PDMA); 69 68 } 70 69 71 70 #endif /* BLK_MQ_DMA_H */

+4

include/linux/blk-mq.h

··· 507 507 * request_queue.tag_set_list. 508 508 * @srcu: Use as lock when type of the request queue is blocking 509 509 * (BLK_MQ_F_BLOCKING). 510 + * @tags_srcu: SRCU used to defer freeing of tags page_list to prevent 511 + * use-after-free when iterating tags. 510 512 * @update_nr_hwq_lock: 511 513 * Synchronize updating nr_hw_queues with add/del disk & 512 514 * switching elevator. ··· 533 531 struct mutex tag_list_lock; 534 532 struct list_head tag_list; 535 533 struct srcu_struct *srcu; 534 + struct srcu_struct tags_srcu; 536 535 537 536 struct rw_semaphore update_nr_hwq_lock; 538 537 }; ··· 770 767 * request pool 771 768 */ 772 769 spinlock_t lock; 770 + struct rcu_head rcu_head; 773 771 }; 774 772 775 773 static inline struct request *blk_mq_tag_to_rq(struct blk_mq_tags *tags,

+7 -12

include/linux/blk_types.h

··· 198 198 return true; 199 199 } 200 200 201 - struct bio_issue { 202 - u64 value; 203 - }; 204 - 205 201 typedef __u32 __bitwise blk_opf_t; 206 202 207 203 typedef unsigned int blk_qc_t; ··· 238 242 * on release of the bio. 239 243 */ 240 244 struct blkcg_gq *bi_blkg; 241 - struct bio_issue bi_issue; 245 + /* Time that this bio was issued. */ 246 + u64 issue_time_ns; 242 247 #ifdef CONFIG_BLK_CGROUP_IOCOST 243 248 u64 bi_iocost_cost; 244 249 #endif ··· 266 269 struct bio_vec *bi_io_vec; /* the actual vec list */ 267 270 268 271 struct bio_set *bi_pool; 269 - 270 - /* 271 - * We can inline a number of vecs at the end of the bio, to avoid 272 - * double allocations for a small number of bio_vecs. This member 273 - * MUST obviously be kept at the very end of the bio. 274 - */ 275 - struct bio_vec bi_inline_vecs[]; 276 272 }; 277 273 278 274 #define BIO_RESET_BYTES offsetof(struct bio, bi_max_vecs) 279 275 #define BIO_MAX_SECTORS (UINT_MAX >> SECTOR_SHIFT) 276 + 277 + static inline struct bio_vec *bio_inline_vecs(struct bio *bio) 278 + { 279 + return (struct bio_vec *)(bio + 1); 280 + } 280 281 281 282 /* 282 283 * bio flags

+18 -8

include/linux/blkdev.h

··· 657 657 QUEUE_FLAG_DISABLE_WBT_DEF, /* for sched to disable/enable wbt */ 658 658 QUEUE_FLAG_NO_ELV_SWITCH, /* can't switch elevator any more */ 659 659 QUEUE_FLAG_QOS_ENABLED, /* qos is enabled */ 660 + QUEUE_FLAG_BIO_ISSUE_TIME, /* record bio->issue_time_ns */ 660 661 QUEUE_FLAG_MAX 661 662 }; 662 663 ··· 1000 999 extern void blk_unregister_queue(struct gendisk *disk); 1001 1000 void submit_bio_noacct(struct bio *bio); 1002 1001 struct bio *bio_split_to_limits(struct bio *bio); 1002 + struct bio *bio_submit_split_bioset(struct bio *bio, unsigned int split_sectors, 1003 + struct bio_set *bs); 1003 1004 1004 1005 extern int blk_lld_busy(struct request_queue *q); 1005 1006 extern int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags); ··· 1593 1590 return queue_dma_alignment(bdev_get_queue(bdev)); 1594 1591 } 1595 1592 1596 - static inline bool bdev_iter_is_aligned(struct block_device *bdev, 1597 - struct iov_iter *iter) 1598 - { 1599 - return iov_iter_is_aligned(iter, bdev_dma_alignment(bdev), 1600 - bdev_logical_block_size(bdev) - 1); 1601 - } 1602 - 1603 1593 static inline unsigned int 1604 1594 blk_lim_dma_alignment_and_pad(struct queue_limits *lim) 1605 1595 { ··· 1656 1660 unsigned int (*check_events) (struct gendisk *disk, 1657 1661 unsigned int clearing); 1658 1662 void (*unlock_native_capacity) (struct gendisk *); 1659 - int (*getgeo)(struct block_device *, struct hd_geometry *); 1663 + int (*getgeo)(struct gendisk *, struct hd_geometry *); 1660 1664 int (*set_read_only)(struct block_device *bdev, bool ro); 1661 1665 void (*free_disk)(struct gendisk *disk); 1662 1666 /* this callback is with swap_lock and sometimes page table lock held */ ··· 1864 1868 if (!bdev_can_atomic_write(bdev)) 1865 1869 return 0; 1866 1870 return queue_atomic_write_unit_max_bytes(bdev_get_queue(bdev)); 1871 + } 1872 + 1873 + static inline int bio_split_rw_at(struct bio *bio, 1874 + const struct queue_limits *lim, 1875 + unsigned *segs, unsigned max_bytes) 1876 + { 1877 + return bio_split_io_at(bio, lim, segs, max_bytes, lim->dma_alignment); 1878 + } 1879 + 1880 + static inline int bio_iov_iter_get_bdev_pages(struct bio *bio, 1881 + struct iov_iter *iter, struct block_device *bdev) 1882 + { 1883 + return bio_iov_iter_get_pages_aligned(bio, iter, 1884 + bdev_logical_block_size(bdev) - 1); 1867 1885 } 1868 1886 1869 1887 #define DEFINE_IO_COMP_BATCH(name) struct io_comp_batch name = { }

+1 -1

include/linux/libata.h

··· 1203 1203 extern u64 ata_qc_get_active(struct ata_port *ap); 1204 1204 extern void ata_scsi_simulate(struct ata_device *dev, struct scsi_cmnd *cmd); 1205 1205 extern int ata_std_bios_param(struct scsi_device *sdev, 1206 - struct block_device *bdev, 1206 + struct gendisk *unused, 1207 1207 sector_t capacity, int geom[]); 1208 1208 extern void ata_scsi_unlock_native_capacity(struct scsi_device *sdev); 1209 1209 extern int ata_scsi_sdev_init(struct scsi_device *sdev);

-2

include/linux/uio.h

··· 286 286 #endif 287 287 288 288 size_t iov_iter_zero(size_t bytes, struct iov_iter *); 289 - bool iov_iter_is_aligned(const struct iov_iter *i, unsigned addr_mask, 290 - unsigned len_mask); 291 289 unsigned long iov_iter_alignment(const struct iov_iter *i); 292 290 unsigned long iov_iter_gap_alignment(const struct iov_iter *i); 293 291 void iov_iter_init(struct iov_iter *i, unsigned int direction, const struct iovec *iov,

+1 -1

include/scsi/libsas.h

··· 685 685 extern int sas_target_alloc(struct scsi_target *); 686 686 int sas_sdev_configure(struct scsi_device *dev, struct queue_limits *lim); 687 687 extern int sas_change_queue_depth(struct scsi_device *, int new_depth); 688 - extern int sas_bios_param(struct scsi_device *, struct block_device *, 688 + extern int sas_bios_param(struct scsi_device *, struct gendisk *, 689 689 sector_t capacity, int *hsc); 690 690 int sas_execute_internal_abort_single(struct domain_device *device, 691 691 u16 tag, unsigned int qid,

+1 -1

include/scsi/scsi_host.h

··· 318 318 * 319 319 * Status: OPTIONAL 320 320 */ 321 - int (* bios_param)(struct scsi_device *, struct block_device *, 321 + int (* bios_param)(struct scsi_device *, struct gendisk *, 322 322 sector_t, int []); 323 323 324 324 /*

+4 -3

include/scsi/scsicam.h

··· 13 13 14 14 #ifndef SCSICAM_H 15 15 #define SCSICAM_H 16 - int scsicam_bios_param(struct block_device *bdev, sector_t capacity, int *ip); 17 - bool scsi_partsize(struct block_device *bdev, sector_t capacity, int geom[3]); 18 - unsigned char *scsi_bios_ptable(struct block_device *bdev); 16 + struct gendisk; 17 + int scsicam_bios_param(struct gendisk *disk, sector_t capacity, int *ip); 18 + bool scsi_partsize(struct gendisk *disk, sector_t capacity, int geom[3]); 19 + unsigned char *scsi_bios_ptable(struct gendisk *disk); 19 20 #endif /* def SCSICAM_H */

-95

lib/iov_iter.c

··· 784 784 } 785 785 EXPORT_SYMBOL(iov_iter_discard); 786 786 787 - static bool iov_iter_aligned_iovec(const struct iov_iter *i, unsigned addr_mask, 788 - unsigned len_mask) 789 - { 790 - const struct iovec *iov = iter_iov(i); 791 - size_t size = i->count; 792 - size_t skip = i->iov_offset; 793 - 794 - do { 795 - size_t len = iov->iov_len - skip; 796 - 797 - if (len > size) 798 - len = size; 799 - if (len & len_mask) 800 - return false; 801 - if ((unsigned long)(iov->iov_base + skip) & addr_mask) 802 - return false; 803 - 804 - iov++; 805 - size -= len; 806 - skip = 0; 807 - } while (size); 808 - 809 - return true; 810 - } 811 - 812 - static bool iov_iter_aligned_bvec(const struct iov_iter *i, unsigned addr_mask, 813 - unsigned len_mask) 814 - { 815 - const struct bio_vec *bvec = i->bvec; 816 - unsigned skip = i->iov_offset; 817 - size_t size = i->count; 818 - 819 - do { 820 - size_t len = bvec->bv_len - skip; 821 - 822 - if (len > size) 823 - len = size; 824 - if (len & len_mask) 825 - return false; 826 - if ((unsigned long)(bvec->bv_offset + skip) & addr_mask) 827 - return false; 828 - 829 - bvec++; 830 - size -= len; 831 - skip = 0; 832 - } while (size); 833 - 834 - return true; 835 - } 836 - 837 - /** 838 - * iov_iter_is_aligned() - Check if the addresses and lengths of each segments 839 - * are aligned to the parameters. 840 - * 841 - * @i: &struct iov_iter to restore 842 - * @addr_mask: bit mask to check against the iov element's addresses 843 - * @len_mask: bit mask to check against the iov element's lengths 844 - * 845 - * Return: false if any addresses or lengths intersect with the provided masks 846 - */ 847 - bool iov_iter_is_aligned(const struct iov_iter *i, unsigned addr_mask, 848 - unsigned len_mask) 849 - { 850 - if (likely(iter_is_ubuf(i))) { 851 - if (i->count & len_mask) 852 - return false; 853 - if ((unsigned long)(i->ubuf + i->iov_offset) & addr_mask) 854 - return false; 855 - return true; 856 - } 857 - 858 - if (likely(iter_is_iovec(i) || iov_iter_is_kvec(i))) 859 - return iov_iter_aligned_iovec(i, addr_mask, len_mask); 860 - 861 - if (iov_iter_is_bvec(i)) 862 - return iov_iter_aligned_bvec(i, addr_mask, len_mask); 863 - 864 - /* With both xarray and folioq types, we're dealing with whole folios. */ 865 - if (iov_iter_is_xarray(i)) { 866 - if (i->count & len_mask) 867 - return false; 868 - if ((i->xarray_start + i->iov_offset) & addr_mask) 869 - return false; 870 - } 871 - if (iov_iter_is_folioq(i)) { 872 - if (i->count & len_mask) 873 - return false; 874 - if (i->iov_offset & addr_mask) 875 - return false; 876 - } 877 - 878 - return true; 879 - } 880 - EXPORT_SYMBOL_GPL(iov_iter_is_aligned); 881 - 882 787 static unsigned long iov_iter_alignment_iovec(const struct iov_iter *i) 883 788 { 884 789 const struct iovec *iov = iter_iov(i);

+13

rust/kernel/block.rs

··· 3 3 //! Types for working with the block layer. 4 4 5 5 pub mod mq; 6 + 7 + /// Bit mask for masking out [`SECTOR_SIZE`]. 8 + pub const SECTOR_MASK: u32 = bindings::SECTOR_MASK; 9 + 10 + /// Sectors are size `1 << SECTOR_SHIFT`. 11 + pub const SECTOR_SHIFT: u32 = bindings::SECTOR_SHIFT; 12 + 13 + /// Size of a sector. 14 + pub const SECTOR_SIZE: u32 = bindings::SECTOR_SIZE; 15 + 16 + /// The difference between the size of a page and the size of a sector, 17 + /// expressed as a power of two. 18 + pub const PAGE_SECTORS_SHIFT: u32 = bindings::PAGE_SECTORS_SHIFT;

+10 -4

rust/kernel/block/mq.rs

··· 69 69 //! 70 70 //! #[vtable] 71 71 //! impl Operations for MyBlkDevice { 72 + //! type QueueData = (); 72 73 //! 73 - //! fn queue_rq(rq: ARef<Request<Self>>, _is_last: bool) -> Result { 74 + //! fn queue_rq(_queue_data: (), rq: ARef<Request<Self>>, _is_last: bool) -> Result { 74 75 //! Request::end_ok(rq); 75 76 //! Ok(()) 76 77 //! } 77 78 //! 78 - //! fn commit_rqs() {} 79 + //! fn commit_rqs(_queue_data: ()) {} 80 + //! 81 + //! fn complete(rq: ARef<Request<Self>>) { 82 + //! Request::end_ok(rq) 83 + //! .map_err(|_e| kernel::error::code::EIO) 84 + //! .expect("Fatal error - expected to be able to end request"); 85 + //! } 79 86 //! } 80 87 //! 81 88 //! let tagset: Arc<TagSet<MyBlkDevice>> = 82 89 //! Arc::pin_init(TagSet::new(1, 256, 1), flags::GFP_KERNEL)?; 83 90 //! let mut disk = gen_disk::GenDiskBuilder::new() 84 91 //! .capacity_sectors(4096) 85 - //! .build(fmt!("myblk"), tagset)?; 92 + //! .build(fmt!("myblk"), tagset, ())?; 86 93 //! 87 94 //! # Ok::<(), kernel::error::Error>(()) 88 95 //! ``` 89 96 90 97 pub mod gen_disk; 91 98 mod operations; 92 - mod raw_writer; 93 99 mod request; 94 100 mod tag_set; 95 101

+43 -13

rust/kernel/block/mq/gen_disk.rs

··· 5 5 //! C header: [`include/linux/blkdev.h`](srctree/include/linux/blkdev.h) 6 6 //! C header: [`include/linux/blk-mq.h`](srctree/include/linux/blk-mq.h) 7 7 8 - use crate::block::mq::{raw_writer::RawWriter, Operations, TagSet}; 9 - use crate::fmt::{self, Write}; 10 - use crate::{bindings, error::from_err_ptr, error::Result, sync::Arc}; 11 - use crate::{error, static_lock_class}; 8 + use crate::{ 9 + bindings, 10 + block::mq::{Operations, TagSet}, 11 + error::{self, from_err_ptr, Result}, 12 + fmt::{self, Write}, 13 + prelude::*, 14 + static_lock_class, 15 + str::NullTerminatedFormatter, 16 + sync::Arc, 17 + types::{ForeignOwnable, ScopeGuard}, 18 + }; 12 19 13 20 /// A builder for [`GenDisk`]. 14 21 /// ··· 52 45 53 46 /// Validate block size by verifying that it is between 512 and `PAGE_SIZE`, 54 47 /// and that it is a power of two. 55 - fn validate_block_size(size: u32) -> Result { 48 + pub fn validate_block_size(size: u32) -> Result { 56 49 if !(512..=bindings::PAGE_SIZE as u32).contains(&size) || !size.is_power_of_two() { 57 50 Err(error::code::EINVAL) 58 51 } else { ··· 99 92 self, 100 93 name: fmt::Arguments<'_>, 101 94 tagset: Arc<TagSet<T>>, 95 + queue_data: T::QueueData, 102 96 ) -> Result<GenDisk<T>> { 97 + let data = queue_data.into_foreign(); 98 + let recover_data = ScopeGuard::new(|| { 99 + // SAFETY: T::QueueData was created by the call to `into_foreign()` above 100 + drop(unsafe { T::QueueData::from_foreign(data) }); 101 + }); 102 + 103 103 // SAFETY: `bindings::queue_limits` contain only fields that are valid when zeroed. 104 104 let mut lim: bindings::queue_limits = unsafe { core::mem::zeroed() }; 105 105 ··· 121 107 bindings::__blk_mq_alloc_disk( 122 108 tagset.raw_tag_set(), 123 109 &mut lim, 124 - core::ptr::null_mut(), 110 + data, 125 111 static_lock_class!().as_ptr(), 126 112 ) 127 113 })?; ··· 153 139 // SAFETY: `gendisk` is a valid pointer as we initialized it above 154 140 unsafe { (*gendisk).fops = &TABLE }; 155 141 156 - let mut raw_writer = RawWriter::from_array( 142 + let mut writer = NullTerminatedFormatter::new( 157 143 // SAFETY: `gendisk` points to a valid and initialized instance. We 158 144 // have exclusive access, since the disk is not added to the VFS 159 145 // yet. 160 146 unsafe { &mut (*gendisk).disk_name }, 161 - )?; 162 - raw_writer.write_fmt(name)?; 163 - raw_writer.write_char('\0')?; 147 + ) 148 + .ok_or(EINVAL)?; 149 + writer.write_fmt(name)?; 164 150 165 151 // SAFETY: `gendisk` points to a valid and initialized instance of 166 152 // `struct gendisk`. `set_capacity` takes a lock to synchronize this ··· 175 161 }, 176 162 )?; 177 163 164 + recover_data.dismiss(); 165 + 178 166 // INVARIANT: `gendisk` was initialized above. 179 167 // INVARIANT: `gendisk` was added to the VFS via `device_add_disk` above. 168 + // INVARIANT: `gendisk.queue.queue_data` is set to `data` in the call to 169 + // `__blk_mq_alloc_disk` above. 180 170 Ok(GenDisk { 181 171 _tagset: tagset, 182 172 gendisk, ··· 192 174 /// 193 175 /// # Invariants 194 176 /// 195 - /// - `gendisk` must always point to an initialized and valid `struct gendisk`. 196 - /// - `gendisk` was added to the VFS through a call to 197 - /// `bindings::device_add_disk`. 177 + /// - `gendisk` must always point to an initialized and valid `struct gendisk`. 178 + /// - `gendisk` was added to the VFS through a call to 179 + /// `bindings::device_add_disk`. 180 + /// - `self.gendisk.queue.queuedata` is initialized by a call to `ForeignOwnable::into_foreign`. 198 181 pub struct GenDisk<T: Operations> { 199 182 _tagset: Arc<TagSet<T>>, 200 183 gendisk: *mut bindings::gendisk, ··· 207 188 208 189 impl<T: Operations> Drop for GenDisk<T> { 209 190 fn drop(&mut self) { 191 + // SAFETY: By type invariant of `Self`, `self.gendisk` points to a valid 192 + // and initialized instance of `struct gendisk`, and, `queuedata` was 193 + // initialized with the result of a call to 194 + // `ForeignOwnable::into_foreign`. 195 + let queue_data = unsafe { (*(*self.gendisk).queue).queuedata }; 196 + 210 197 // SAFETY: By type invariant, `self.gendisk` points to a valid and 211 198 // initialized instance of `struct gendisk`, and it was previously added 212 199 // to the VFS. 213 200 unsafe { bindings::del_gendisk(self.gendisk) }; 201 + 202 + // SAFETY: `queue.queuedata` was created by `GenDiskBuilder::build` with 203 + // a call to `ForeignOwnable::into_foreign` to create `queuedata`. 204 + // `ForeignOwnable::from_foreign` is only called here. 205 + drop(unsafe { T::QueueData::from_foreign(queue_data) }); 214 206 } 215 207 }

+52 -13

rust/kernel/block/mq/operations.rs

··· 6 6 7 7 use crate::{ 8 8 bindings, 9 - block::mq::request::RequestDataWrapper, 10 - block::mq::Request, 9 + block::mq::{request::RequestDataWrapper, Request}, 11 10 error::{from_result, Result}, 12 11 prelude::*, 13 12 sync::Refcount, 14 - types::ARef, 13 + types::{ARef, ForeignOwnable}, 15 14 }; 16 15 use core::marker::PhantomData; 16 + 17 + type ForeignBorrowed<'a, T> = <T as ForeignOwnable>::Borrowed<'a>; 17 18 18 19 /// Implement this trait to interface blk-mq as block devices. 19 20 /// ··· 28 27 /// [module level documentation]: kernel::block::mq 29 28 #[macros::vtable] 30 29 pub trait Operations: Sized { 30 + /// Data associated with the `struct request_queue` that is allocated for 31 + /// the `GenDisk` associated with this `Operations` implementation. 32 + type QueueData: ForeignOwnable; 33 + 31 34 /// Called by the kernel to queue a request with the driver. If `is_last` is 32 35 /// `false`, the driver is allowed to defer committing the request. 33 - fn queue_rq(rq: ARef<Request<Self>>, is_last: bool) -> Result; 36 + fn queue_rq( 37 + queue_data: ForeignBorrowed<'_, Self::QueueData>, 38 + rq: ARef<Request<Self>>, 39 + is_last: bool, 40 + ) -> Result; 34 41 35 42 /// Called by the kernel to indicate that queued requests should be submitted. 36 - fn commit_rqs(); 43 + fn commit_rqs(queue_data: ForeignBorrowed<'_, Self::QueueData>); 44 + 45 + /// Called by the kernel when the request is completed. 46 + fn complete(rq: ARef<Request<Self>>); 37 47 38 48 /// Called by the kernel to poll the device for completed requests. Only 39 49 /// used for poll queues. ··· 83 71 /// promise to not access the request until the driver calls 84 72 /// `bindings::blk_mq_end_request` for the request. 85 73 unsafe extern "C" fn queue_rq_callback( 86 - _hctx: *mut bindings::blk_mq_hw_ctx, 74 + hctx: *mut bindings::blk_mq_hw_ctx, 87 75 bd: *const bindings::blk_mq_queue_data, 88 76 ) -> bindings::blk_status_t { 89 77 // SAFETY: `bd.rq` is valid as required by the safety requirement for ··· 101 89 // reference counted by `ARef` until then. 102 90 let rq = unsafe { Request::aref_from_raw((*bd).rq) }; 103 91 92 + // SAFETY: `hctx` is valid as required by this function. 93 + let queue_data = unsafe { (*(*hctx).queue).queuedata }; 94 + 95 + // SAFETY: `queue.queuedata` was created by `GenDiskBuilder::build` with 96 + // a call to `ForeignOwnable::into_foreign` to create `queuedata`. 97 + // `ForeignOwnable::from_foreign` is only called when the tagset is 98 + // dropped, which happens after we are dropped. 99 + let queue_data = unsafe { T::QueueData::borrow(queue_data) }; 100 + 104 101 // SAFETY: We have exclusive access and we just set the refcount above. 105 102 unsafe { Request::start_unchecked(&rq) }; 106 103 107 104 let ret = T::queue_rq( 105 + queue_data, 108 106 rq, 109 107 // SAFETY: `bd` is valid as required by the safety requirement for 110 108 // this function. ··· 133 111 /// 134 112 /// # Safety 135 113 /// 136 - /// This function may only be called by blk-mq C infrastructure. 137 - unsafe extern "C" fn commit_rqs_callback(_hctx: *mut bindings::blk_mq_hw_ctx) { 138 - T::commit_rqs() 114 + /// This function may only be called by blk-mq C infrastructure. The caller 115 + /// must ensure that `hctx` is valid. 116 + unsafe extern "C" fn commit_rqs_callback(hctx: *mut bindings::blk_mq_hw_ctx) { 117 + // SAFETY: `hctx` is valid as required by this function. 118 + let queue_data = unsafe { (*(*hctx).queue).queuedata }; 119 + 120 + // SAFETY: `queue.queuedata` was created by `GenDisk::try_new()` with a 121 + // call to `ForeignOwnable::into_foreign()` to create `queuedata`. 122 + // `ForeignOwnable::from_foreign()` is only called when the tagset is 123 + // dropped, which happens after we are dropped. 124 + let queue_data = unsafe { T::QueueData::borrow(queue_data) }; 125 + T::commit_rqs(queue_data) 139 126 } 140 127 141 - /// This function is called by the C kernel. It is not currently 142 - /// implemented, and there is no way to exercise this code path. 128 + /// This function is called by the C kernel. A pointer to this function is 129 + /// installed in the `blk_mq_ops` vtable for the driver. 143 130 /// 144 131 /// # Safety 145 132 /// 146 - /// This function may only be called by blk-mq C infrastructure. 147 - unsafe extern "C" fn complete_callback(_rq: *mut bindings::request) {} 133 + /// This function may only be called by blk-mq C infrastructure. `rq` must 134 + /// point to a valid request that has been marked as completed. The pointee 135 + /// of `rq` must be valid for write for the duration of this function. 136 + unsafe extern "C" fn complete_callback(rq: *mut bindings::request) { 137 + // SAFETY: This function can only be dispatched through 138 + // `Request::complete`. We leaked a refcount then which we pick back up 139 + // now. 140 + let aref = unsafe { Request::aref_from_raw(rq) }; 141 + T::complete(aref); 142 + } 148 143 149 144 /// This function is called by the C kernel. A pointer to this function is 150 145 /// installed in the `blk_mq_ops` vtable for the driver.

-54

rust/kernel/block/mq/raw_writer.rs

··· 1 - // SPDX-License-Identifier: GPL-2.0 2 - 3 - use crate::error::Result; 4 - use crate::fmt::{self, Write}; 5 - use crate::prelude::EINVAL; 6 - 7 - /// A mutable reference to a byte buffer where a string can be written into. 8 - /// 9 - /// # Invariants 10 - /// 11 - /// `buffer` is always null terminated. 12 - pub(crate) struct RawWriter<'a> { 13 - buffer: &'a mut [u8], 14 - pos: usize, 15 - } 16 - 17 - impl<'a> RawWriter<'a> { 18 - /// Create a new `RawWriter` instance. 19 - fn new(buffer: &'a mut [u8]) -> Result<RawWriter<'a>> { 20 - *(buffer.last_mut().ok_or(EINVAL)?) = 0; 21 - 22 - // INVARIANT: We null terminated the buffer above. 23 - Ok(Self { buffer, pos: 0 }) 24 - } 25 - 26 - pub(crate) fn from_array<const N: usize>( 27 - a: &'a mut [crate::ffi::c_char; N], 28 - ) -> Result<RawWriter<'a>> { 29 - Self::new( 30 - // SAFETY: the buffer of `a` is valid for read and write as `u8` for 31 - // at least `N` bytes. 32 - unsafe { core::slice::from_raw_parts_mut(a.as_mut_ptr().cast::<u8>(), N) }, 33 - ) 34 - } 35 - } 36 - 37 - impl Write for RawWriter<'_> { 38 - fn write_str(&mut self, s: &str) -> fmt::Result { 39 - let bytes = s.as_bytes(); 40 - let len = bytes.len(); 41 - 42 - // We do not want to overwrite our null terminator 43 - if self.pos + len > self.buffer.len() - 1 { 44 - return Err(fmt::Error); 45 - } 46 - 47 - // INVARIANT: We are not overwriting the last byte 48 - self.buffer[self.pos..self.pos + len].copy_from_slice(bytes); 49 - 50 - self.pos += len; 51 - 52 - Ok(()) 53 - } 54 - }

+19 -2

rust/kernel/block/mq/request.rs

··· 53 53 /// [`struct request`]: srctree/include/linux/blk-mq.h 54 54 /// 55 55 #[repr(transparent)] 56 - pub struct Request<T: Operations>(Opaque<bindings::request>, PhantomData<T>); 56 + pub struct Request<T>(Opaque<bindings::request>, PhantomData<T>); 57 57 58 58 impl<T: Operations> Request<T> { 59 59 /// Create an [`ARef<Request>`] from a [`struct request`] pointer. ··· 138 138 Ok(()) 139 139 } 140 140 141 + /// Complete the request by scheduling `Operations::complete` for 142 + /// execution. 143 + /// 144 + /// The function may be scheduled locally, via SoftIRQ or remotely via IPMI. 145 + /// See `blk_mq_complete_request_remote` in [`blk-mq.c`] for details. 146 + /// 147 + /// [`blk-mq.c`]: srctree/block/blk-mq.c 148 + pub fn complete(this: ARef<Self>) { 149 + let ptr = ARef::into_raw(this).cast::<bindings::request>().as_ptr(); 150 + // SAFETY: By type invariant, `self.0` is a valid `struct request` 151 + if !unsafe { bindings::blk_mq_complete_request_remote(ptr) } { 152 + // SAFETY: We released a refcount above that we can reclaim here. 153 + let this = unsafe { Request::aref_from_raw(ptr) }; 154 + T::complete(this); 155 + } 156 + } 157 + 141 158 /// Return a pointer to the [`RequestDataWrapper`] stored in the private area 142 159 /// of the request structure. 143 160 /// ··· 168 151 // valid allocation. 169 152 let wrapper_ptr = 170 153 unsafe { bindings::blk_mq_rq_to_pdu(request_ptr).cast::<RequestDataWrapper>() }; 171 - // SAFETY: By C API contract, wrapper_ptr points to a valid allocation 154 + // SAFETY: By C API contract, `wrapper_ptr` points to a valid allocation 172 155 // and is not null. 173 156 unsafe { NonNull::new_unchecked(wrapper_ptr) } 174 157 }

+2

rust/kernel/configfs.rs

··· 1039 1039 }; 1040 1040 1041 1041 } 1042 + 1043 + pub use crate::configfs_attrs;

+150 -12

rust/kernel/str.rs

··· 2 2 3 3 //! String representations. 4 4 5 - use crate::alloc::{flags::*, AllocError, KVec}; 6 - use crate::fmt::{self, Write}; 7 - use core::ops::{self, Deref, DerefMut, Index}; 8 - 9 - use crate::prelude::*; 5 + use crate::{ 6 + alloc::{flags::*, AllocError, KVec}, 7 + error::{to_result, Result}, 8 + fmt::{self, Write}, 9 + prelude::*, 10 + }; 11 + use core::{ 12 + marker::PhantomData, 13 + ops::{self, Deref, DerefMut, Index}, 14 + }; 10 15 11 16 /// Byte string without UTF-8 validity guarantee. 12 17 #[repr(transparent)] ··· 737 732 /// 738 733 /// The memory region between `pos` (inclusive) and `end` (exclusive) is valid for writes if `pos` 739 734 /// is less than `end`. 740 - pub(crate) struct RawFormatter { 735 + pub struct RawFormatter { 741 736 // Use `usize` to use `saturating_*` functions. 742 737 beg: usize, 743 738 pos: usize, ··· 795 790 } 796 791 797 792 /// Returns the number of bytes written to the formatter. 798 - pub(crate) fn bytes_written(&self) -> usize { 793 + pub fn bytes_written(&self) -> usize { 799 794 self.pos - self.beg 800 795 } 801 796 } ··· 829 824 /// Allows formatting of [`fmt::Arguments`] into a raw buffer. 830 825 /// 831 826 /// Fails if callers attempt to write more than will fit in the buffer. 832 - pub(crate) struct Formatter(RawFormatter); 827 + pub struct Formatter<'a>(RawFormatter, PhantomData<&'a mut ()>); 833 828 834 - impl Formatter { 829 + impl Formatter<'_> { 835 830 /// Creates a new instance of [`Formatter`] with the given buffer. 836 831 /// 837 832 /// # Safety ··· 840 835 /// for the lifetime of the returned [`Formatter`]. 841 836 pub(crate) unsafe fn from_buffer(buf: *mut u8, len: usize) -> Self { 842 837 // SAFETY: The safety requirements of this function satisfy those of the callee. 843 - Self(unsafe { RawFormatter::from_buffer(buf, len) }) 838 + Self(unsafe { RawFormatter::from_buffer(buf, len) }, PhantomData) 839 + } 840 + 841 + /// Create a new [`Self`] instance. 842 + pub fn new(buffer: &mut [u8]) -> Self { 843 + // SAFETY: `buffer` is valid for writes for the entire length for 844 + // the lifetime of `Self`. 845 + unsafe { Formatter::from_buffer(buffer.as_mut_ptr(), buffer.len()) } 844 846 } 845 847 } 846 848 847 - impl Deref for Formatter { 849 + impl Deref for Formatter<'_> { 848 850 type Target = RawFormatter; 849 851 850 852 fn deref(&self) -> &Self::Target { ··· 859 847 } 860 848 } 861 849 862 - impl fmt::Write for Formatter { 850 + impl fmt::Write for Formatter<'_> { 863 851 fn write_str(&mut self, s: &str) -> fmt::Result { 864 852 self.0.write_str(s)?; 865 853 ··· 870 858 Ok(()) 871 859 } 872 860 } 861 + } 862 + 863 + /// A mutable reference to a byte buffer where a string can be written into. 864 + /// 865 + /// The buffer will be automatically null terminated after the last written character. 866 + /// 867 + /// # Invariants 868 + /// 869 + /// * The first byte of `buffer` is always zero. 870 + /// * The length of `buffer` is at least 1. 871 + pub(crate) struct NullTerminatedFormatter<'a> { 872 + buffer: &'a mut [u8], 873 + } 874 + 875 + impl<'a> NullTerminatedFormatter<'a> { 876 + /// Create a new [`Self`] instance. 877 + pub(crate) fn new(buffer: &'a mut [u8]) -> Option<NullTerminatedFormatter<'a>> { 878 + *(buffer.first_mut()?) = 0; 879 + 880 + // INVARIANT: 881 + // - We wrote zero to the first byte above. 882 + // - If buffer was not at least length 1, `buffer.first_mut()` would return None. 883 + Some(Self { buffer }) 884 + } 885 + } 886 + 887 + impl Write for NullTerminatedFormatter<'_> { 888 + fn write_str(&mut self, s: &str) -> fmt::Result { 889 + let bytes = s.as_bytes(); 890 + let len = bytes.len(); 891 + 892 + // We want space for a zero. By type invariant, buffer length is always at least 1, so no 893 + // underflow. 894 + if len > self.buffer.len() - 1 { 895 + return Err(fmt::Error); 896 + } 897 + 898 + let buffer = core::mem::take(&mut self.buffer); 899 + // We break the zero start invariant for a short while. 900 + buffer[..len].copy_from_slice(bytes); 901 + // INVARIANT: We checked above that buffer will have size at least 1 after this assignment. 902 + self.buffer = &mut buffer[len..]; 903 + 904 + // INVARIANT: We write zero to the first byte of the buffer. 905 + self.buffer[0] = 0; 906 + 907 + Ok(()) 908 + } 909 + } 910 + 911 + /// # Safety 912 + /// 913 + /// - `string` must point to a null terminated string that is valid for read. 914 + unsafe fn kstrtobool_raw(string: *const u8) -> Result<bool> { 915 + let mut result: bool = false; 916 + 917 + // SAFETY: 918 + // - By function safety requirement, `string` is a valid null-terminated string. 919 + // - `result` is a valid `bool` that we own. 920 + to_result(unsafe { bindings::kstrtobool(string, &mut result) })?; 921 + Ok(result) 922 + } 923 + 924 + /// Convert common user inputs into boolean values using the kernel's `kstrtobool` function. 925 + /// 926 + /// This routine returns `Ok(bool)` if the first character is one of 'YyTt1NnFf0', or 927 + /// \[oO\]\[NnFf\] for "on" and "off". Otherwise it will return `Err(EINVAL)`. 928 + /// 929 + /// # Examples 930 + /// 931 + /// ``` 932 + /// # use kernel::{c_str, str::kstrtobool}; 933 + /// 934 + /// // Lowercase 935 + /// assert_eq!(kstrtobool(c_str!("true")), Ok(true)); 936 + /// assert_eq!(kstrtobool(c_str!("tr")), Ok(true)); 937 + /// assert_eq!(kstrtobool(c_str!("t")), Ok(true)); 938 + /// assert_eq!(kstrtobool(c_str!("twrong")), Ok(true)); 939 + /// assert_eq!(kstrtobool(c_str!("false")), Ok(false)); 940 + /// assert_eq!(kstrtobool(c_str!("f")), Ok(false)); 941 + /// assert_eq!(kstrtobool(c_str!("yes")), Ok(true)); 942 + /// assert_eq!(kstrtobool(c_str!("no")), Ok(false)); 943 + /// assert_eq!(kstrtobool(c_str!("on")), Ok(true)); 944 + /// assert_eq!(kstrtobool(c_str!("off")), Ok(false)); 945 + /// 946 + /// // Camel case 947 + /// assert_eq!(kstrtobool(c_str!("True")), Ok(true)); 948 + /// assert_eq!(kstrtobool(c_str!("False")), Ok(false)); 949 + /// assert_eq!(kstrtobool(c_str!("Yes")), Ok(true)); 950 + /// assert_eq!(kstrtobool(c_str!("No")), Ok(false)); 951 + /// assert_eq!(kstrtobool(c_str!("On")), Ok(true)); 952 + /// assert_eq!(kstrtobool(c_str!("Off")), Ok(false)); 953 + /// 954 + /// // All caps 955 + /// assert_eq!(kstrtobool(c_str!("TRUE")), Ok(true)); 956 + /// assert_eq!(kstrtobool(c_str!("FALSE")), Ok(false)); 957 + /// assert_eq!(kstrtobool(c_str!("YES")), Ok(true)); 958 + /// assert_eq!(kstrtobool(c_str!("NO")), Ok(false)); 959 + /// assert_eq!(kstrtobool(c_str!("ON")), Ok(true)); 960 + /// assert_eq!(kstrtobool(c_str!("OFF")), Ok(false)); 961 + /// 962 + /// // Numeric 963 + /// assert_eq!(kstrtobool(c_str!("1")), Ok(true)); 964 + /// assert_eq!(kstrtobool(c_str!("0")), Ok(false)); 965 + /// 966 + /// // Invalid input 967 + /// assert_eq!(kstrtobool(c_str!("invalid")), Err(EINVAL)); 968 + /// assert_eq!(kstrtobool(c_str!("2")), Err(EINVAL)); 969 + /// ``` 970 + pub fn kstrtobool(string: &CStr) -> Result<bool> { 971 + // SAFETY: 972 + // - The pointer returned by `CStr::as_char_ptr` is guaranteed to be 973 + // null terminated. 974 + // - `string` is live and thus the string is valid for read. 975 + unsafe { kstrtobool_raw(string.as_char_ptr()) } 976 + } 977 + 978 + /// Convert `&[u8]` to `bool` by deferring to [`kernel::str::kstrtobool`]. 979 + /// 980 + /// Only considers at most the first two bytes of `bytes`. 981 + pub fn kstrtobool_bytes(bytes: &[u8]) -> Result<bool> { 982 + // `ktostrbool` only considers the first two bytes of the input. 983 + let stack_string = [*bytes.first().unwrap_or(&0), *bytes.get(1).unwrap_or(&0), 0]; 984 + // SAFETY: `stack_string` is null terminated and it is live on the stack so 985 + // it is valid for read. 986 + unsafe { kstrtobool_raw(stack_string.as_ptr()) } 873 987 } 874 988 875 989 /// An owned string that is guaranteed to have exactly one `NUL` byte, which is at the end.

+1 -1

samples/rust/rust_configfs.rs

··· 5 5 use kernel::alloc::flags; 6 6 use kernel::c_str; 7 7 use kernel::configfs; 8 - use kernel::configfs_attrs; 8 + use kernel::configfs::configfs_attrs; 9 9 use kernel::new_mutex; 10 10 use kernel::page::PAGE_SIZE; 11 11 use kernel::prelude::*;

+1

tools/testing/selftests/ublk/Makefile

··· 20 20 TEST_PROGS += test_generic_10.sh 21 21 TEST_PROGS += test_generic_11.sh 22 22 TEST_PROGS += test_generic_12.sh 23 + TEST_PROGS += test_generic_13.sh 23 24 24 25 TEST_PROGS += test_null_01.sh 25 26 TEST_PROGS += test_null_02.sh

+17 -15

tools/testing/selftests/ublk/kublk.c

··· 1384 1384 static int cmd_dev_get_features(void) 1385 1385 { 1386 1386 #define const_ilog2(x) (63 - __builtin_clzll(x)) 1387 + #define FEAT_NAME(f) [const_ilog2(f)] = #f 1387 1388 static const char *feat_map[] = { 1388 - [const_ilog2(UBLK_F_SUPPORT_ZERO_COPY)] = "ZERO_COPY", 1389 - [const_ilog2(UBLK_F_URING_CMD_COMP_IN_TASK)] = "COMP_IN_TASK", 1390 - [const_ilog2(UBLK_F_NEED_GET_DATA)] = "GET_DATA", 1391 - [const_ilog2(UBLK_F_USER_RECOVERY)] = "USER_RECOVERY", 1392 - [const_ilog2(UBLK_F_USER_RECOVERY_REISSUE)] = "RECOVERY_REISSUE", 1393 - [const_ilog2(UBLK_F_UNPRIVILEGED_DEV)] = "UNPRIVILEGED_DEV", 1394 - [const_ilog2(UBLK_F_CMD_IOCTL_ENCODE)] = "CMD_IOCTL_ENCODE", 1395 - [const_ilog2(UBLK_F_USER_COPY)] = "USER_COPY", 1396 - [const_ilog2(UBLK_F_ZONED)] = "ZONED", 1397 - [const_ilog2(UBLK_F_USER_RECOVERY_FAIL_IO)] = "RECOVERY_FAIL_IO", 1398 - [const_ilog2(UBLK_F_UPDATE_SIZE)] = "UPDATE_SIZE", 1399 - [const_ilog2(UBLK_F_AUTO_BUF_REG)] = "AUTO_BUF_REG", 1400 - [const_ilog2(UBLK_F_QUIESCE)] = "QUIESCE", 1401 - [const_ilog2(UBLK_F_PER_IO_DAEMON)] = "PER_IO_DAEMON", 1389 + FEAT_NAME(UBLK_F_SUPPORT_ZERO_COPY), 1390 + FEAT_NAME(UBLK_F_URING_CMD_COMP_IN_TASK), 1391 + FEAT_NAME(UBLK_F_NEED_GET_DATA), 1392 + FEAT_NAME(UBLK_F_USER_RECOVERY), 1393 + FEAT_NAME(UBLK_F_USER_RECOVERY_REISSUE), 1394 + FEAT_NAME(UBLK_F_UNPRIVILEGED_DEV), 1395 + FEAT_NAME(UBLK_F_CMD_IOCTL_ENCODE), 1396 + FEAT_NAME(UBLK_F_USER_COPY), 1397 + FEAT_NAME(UBLK_F_ZONED), 1398 + FEAT_NAME(UBLK_F_USER_RECOVERY_FAIL_IO), 1399 + FEAT_NAME(UBLK_F_UPDATE_SIZE), 1400 + FEAT_NAME(UBLK_F_AUTO_BUF_REG), 1401 + FEAT_NAME(UBLK_F_QUIESCE), 1402 + FEAT_NAME(UBLK_F_PER_IO_DAEMON), 1403 + FEAT_NAME(UBLK_F_BUF_REG_OFF_DAEMON), 1402 1404 }; 1403 1405 struct ublk_dev *dev; 1404 1406 __u64 features = 0; ··· 1427 1425 feat = feat_map[i]; 1428 1426 else 1429 1427 feat = "unknown"; 1430 - printf("\t%-20s: 0x%llx\n", feat, 1ULL << i); 1428 + printf("0x%-16llx: %s\n", 1ULL << i, feat); 1431 1429 } 1432 1430 } 1433 1431

+4

tools/testing/selftests/ublk/test_generic_01.sh

··· 10 10 exit "$UBLK_SKIP_CODE" 11 11 fi 12 12 13 + if ! _have_program fio; then 14 + exit "$UBLK_SKIP_CODE" 15 + fi 16 + 13 17 _prep_test "null" "sequential io order" 14 18 15 19 dev_id=$(_add_ublk_dev -t null)

+4

tools/testing/selftests/ublk/test_generic_02.sh

··· 10 10 exit "$UBLK_SKIP_CODE" 11 11 fi 12 12 13 + if ! _have_program fio; then 14 + exit "$UBLK_SKIP_CODE" 15 + fi 16 + 13 17 _prep_test "null" "sequential io order for MQ" 14 18 15 19 dev_id=$(_add_ublk_dev -t null -q 2)

+4

tools/testing/selftests/ublk/test_generic_12.sh

··· 10 10 exit "$UBLK_SKIP_CODE" 11 11 fi 12 12 13 + if ! _have_program fio; then 14 + exit "$UBLK_SKIP_CODE" 15 + fi 16 + 13 17 _prep_test "null" "do imbalanced load, it should be balanced over I/O threads" 14 18 15 19 NTHREADS=6

+20

tools/testing/selftests/ublk/test_generic_13.sh

··· 1 + #!/bin/bash 2 + # SPDX-License-Identifier: GPL-2.0 3 + 4 + . "$(cd "$(dirname "$0")" && pwd)"/test_common.sh 5 + 6 + TID="generic_13" 7 + ERR_CODE=0 8 + 9 + _prep_test "null" "check that feature list is complete" 10 + 11 + if ${UBLK_PROG} features | grep -q unknown; then 12 + echo "# unknown feature detected!" 13 + echo "# did you add a feature and forget to update feat_map in kublk?" 14 + echo "# this failure is expected if running an older test suite against" 15 + echo "# a newer kernel with new features added" 16 + ERR_CODE=255 17 + fi 18 + 19 + _cleanup_test "null" 20 + _show_result $TID $ERR_CODE

+4

tools/testing/selftests/ublk/test_null_01.sh

··· 6 6 TID="null_01" 7 7 ERR_CODE=0 8 8 9 + if ! _have_program fio; then 10 + exit "$UBLK_SKIP_CODE" 11 + fi 12 + 9 13 _prep_test "null" "basic IO test" 10 14 11 15 dev_id=$(_add_ublk_dev -t null)

+4

tools/testing/selftests/ublk/test_null_02.sh

··· 6 6 TID="null_02" 7 7 ERR_CODE=0 8 8 9 + if ! _have_program fio; then 10 + exit "$UBLK_SKIP_CODE" 11 + fi 12 + 9 13 _prep_test "null" "basic IO test with zero copy" 10 14 11 15 dev_id=$(_add_ublk_dev -t null -z)

+4

tools/testing/selftests/ublk/test_stress_05.sh

··· 5 5 TID="stress_05" 6 6 ERR_CODE=0 7 7 8 + if ! _have_program fio; then 9 + exit "$UBLK_SKIP_CODE" 10 + fi 11 + 8 12 run_io_and_remove() 9 13 { 10 14 local size=$1

Configure Feed

Configure Feed