Merge tag 'for-5.3/block-20190708' of git://git.kernel.dk/linux-block

+6 -6

Documentation/block/bfq-iosched.txt

··· 38 38 CPUs, here are, first, the limits of BFQ for three different CPUs, on, 39 39 respectively, an average laptop, an old desktop, and a cheap embedded 40 40 system, in case full hierarchical support is enabled (i.e., 41 - CONFIG_BFQ_GROUP_IOSCHED is set), but CONFIG_DEBUG_BLK_CGROUP is not 41 + CONFIG_BFQ_GROUP_IOSCHED is set), but CONFIG_BFQ_CGROUP_DEBUG is not 42 42 set (Section 4-2): 43 43 - Intel i7-4850HQ: 400 KIOPS 44 44 - AMD A8-3850: 250 KIOPS 45 45 - ARM CortexTM-A53 Octa-core: 80 KIOPS 46 46 47 - If CONFIG_DEBUG_BLK_CGROUP is set (and of course full hierarchical 47 + If CONFIG_BFQ_CGROUP_DEBUG is set (and of course full hierarchical 48 48 support is enabled), then the sustainable throughput with BFQ 49 49 decreases, because all blkio.bfq* statistics are created and updated 50 50 (Section 4-2). For BFQ, this leads to the following maximum ··· 537 537 538 538 As for cgroups-v1 (blkio controller), the exact set of stat files 539 539 created, and kept up-to-date by bfq, depends on whether 540 - CONFIG_DEBUG_BLK_CGROUP is set. If it is set, then bfq creates all 540 + CONFIG_BFQ_CGROUP_DEBUG is set. If it is set, then bfq creates all 541 541 the stat files documented in 542 542 Documentation/cgroup-v1/blkio-controller.rst. If, instead, 543 - CONFIG_DEBUG_BLK_CGROUP is not set, then bfq creates only the files 543 + CONFIG_BFQ_CGROUP_DEBUG is not set, then bfq creates only the files 544 544 blkio.bfq.io_service_bytes 545 545 blkio.bfq.io_service_bytes_recursive 546 546 blkio.bfq.io_serviced 547 547 blkio.bfq.io_serviced_recursive 548 548 549 - The value of CONFIG_DEBUG_BLK_CGROUP greatly influences the maximum 549 + The value of CONFIG_BFQ_CGROUP_DEBUG greatly influences the maximum 550 550 throughput sustainable with bfq, because updating the blkio.bfq.* 551 551 stats is rather costly, especially for some of the stats enabled by 552 - CONFIG_DEBUG_BLK_CGROUP. 552 + CONFIG_BFQ_CGROUP_DEBUG. 553 553 554 554 Parameters to set 555 555 -----------------

-1

Documentation/block/biodoc.txt

··· 436 436 struct bvec_iter bi_iter; /* current index into bio_vec array */ 437 437 438 438 unsigned int bi_size; /* total size in bytes */ 439 - unsigned short bi_phys_segments; /* segments after physaddr coalesce*/ 440 439 unsigned short bi_hw_segments; /* segments after DMA remapping */ 441 440 unsigned int bi_max; /* max bio_vecs we can hold 442 441 used as index into pool */

+43 -21

Documentation/block/queue-sysfs.txt

··· 14 14 This file allows to turn off the disk entropy contribution. Default 15 15 value of this file is '1'(on). 16 16 17 + chunk_sectors (RO) 18 + ------------------ 19 + This has different meaning depending on the type of the block device. 20 + For a RAID device (dm-raid), chunk_sectors indicates the size in 512B sectors 21 + of the RAID volume stripe segment. For a zoned block device, either host-aware 22 + or host-managed, chunk_sectors indicates the size in 512B sectors of the zones 23 + of the device, with the eventual exception of the last zone of the device which 24 + may be smaller. 25 + 17 26 dax (RO) 18 27 -------- 19 28 This file indicates whether the device supports Direct Access (DAX), ··· 51 42 large discards are issued, setting this value lower will make Linux issue 52 43 smaller discards and potentially help reduce latencies induced by large 53 44 discard operations. 45 + 46 + discard_zeroes_data (RO) 47 + ------------------------ 48 + Obsolete. Always zero. 49 + 50 + fua (RO) 51 + -------- 52 + Whether or not the block driver supports the FUA flag for write requests. 53 + FUA stands for Force Unit Access. If the FUA flag is set that means that 54 + write requests must bypass the volatile cache of the storage device. 54 55 55 56 hw_sector_size (RO) 56 57 ------------------- ··· 102 83 ----------------------- 103 84 This is the logical block size of the device, in bytes. 104 85 86 + max_discard_segments (RO) 87 + ------------------------- 88 + The maximum number of DMA scatter/gather entries in a discard request. 89 + 105 90 max_hw_sectors_kb (RO) 106 91 ---------------------- 107 92 This is the maximum number of kilobytes supported in a single data transfer. 108 93 109 94 max_integrity_segments (RO) 110 95 --------------------------- 111 - When read, this file shows the max limit of integrity segments as 112 - set by block layer which a hardware controller can handle. 96 + Maximum number of elements in a DMA scatter/gather list with integrity 97 + data that will be submitted by the block layer core to the associated 98 + block driver. 113 99 114 100 max_sectors_kb (RW) 115 101 ------------------- ··· 124 100 125 101 max_segments (RO) 126 102 ----------------- 127 - Maximum number of segments of the device. 103 + Maximum number of elements in a DMA scatter/gather list that is submitted 104 + to the associated block driver. 128 105 129 106 max_segment_size (RO) 130 107 --------------------- 131 - Maximum segment size of the device. 108 + Maximum size in bytes of a single element in a DMA scatter/gather list. 132 109 133 110 minimum_io_size (RO) 134 111 -------------------- ··· 156 131 per-block-cgroup request pool. IOW, if there are N block cgroups, 157 132 each request queue may have up to N request pools, each independently 158 133 regulated by nr_requests. 134 + 135 + nr_zones (RO) 136 + ------------- 137 + For zoned block devices (zoned attribute indicating "host-managed" or 138 + "host-aware"), this indicates the total number of zones of the device. 139 + This is always 0 for regular block devices. 159 140 160 141 optimal_io_size (RO) 161 142 -------------------- ··· 216 185 command. A value of '0' means write-same is not supported by this 217 186 device. 218 187 219 - wb_lat_usec (RW) 220 - ---------------- 188 + wbt_lat_usec (RW) 189 + ----------------- 221 190 If the device is registered for writeback throttling, then this file shows 222 191 the target minimum read latency. If this latency is exceeded in a given 223 192 window of time (see wb_window_usec), then the writeback throttling will start ··· 232 201 have more smooth throughput, but higher CPU overhead. This exists only when 233 202 CONFIG_BLK_DEV_THROTTLING_LOW is enabled. 234 203 204 + write_zeroes_max_bytes (RO) 205 + --------------------------- 206 + For block drivers that support REQ_OP_WRITE_ZEROES, the maximum number of 207 + bytes that can be zeroed at once. The value 0 means that REQ_OP_WRITE_ZEROES 208 + is not supported. 209 + 235 210 zoned (RO) 236 211 ---------- 237 212 This indicates if the device is a zoned block device and the zone model of the ··· 249 212 "drive-managed" zone model. However, since drive-managed zoned block devices 250 213 do not support zone commands, they will be treated as regular block devices 251 214 and zoned will report "none". 252 - 253 - nr_zones (RO) 254 - ------------- 255 - For zoned block devices (zoned attribute indicating "host-managed" or 256 - "host-aware"), this indicates the total number of zones of the device. 257 - This is always 0 for regular block devices. 258 - 259 - chunk_sectors (RO) 260 - ------------------ 261 - This has different meaning depending on the type of the block device. 262 - For a RAID device (dm-raid), chunk_sectors indicates the size in 512B sectors 263 - of the RAID volume stripe segment. For a zoned block device, either host-aware 264 - or host-managed, chunk_sectors indicates the size in 512B sectors of the zones 265 - of the device, with the eventual exception of the last zone of the device which 266 - may be smaller. 267 215 268 216 Jens Axboe <jens.axboe@oracle.com>, February 2009

+6 -6

Documentation/cgroup-v1/blkio-controller.rst

··· 82 82 CONFIG_BLK_CGROUP 83 83 - Block IO controller. 84 84 85 - CONFIG_DEBUG_BLK_CGROUP 85 + CONFIG_BFQ_CGROUP_DEBUG 86 86 - Debug help. Right now some additional stats file show up in cgroup 87 87 if this option is enabled. 88 88 ··· 202 202 write, sync or async. 203 203 204 204 - blkio.avg_queue_size 205 - - Debugging aid only enabled if CONFIG_DEBUG_BLK_CGROUP=y. 205 + - Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y. 206 206 The average queue size for this cgroup over the entire time of this 207 207 cgroup's existence. Queue size samples are taken each time one of the 208 208 queues of this cgroup gets a timeslice. 209 209 210 210 - blkio.group_wait_time 211 - - Debugging aid only enabled if CONFIG_DEBUG_BLK_CGROUP=y. 211 + - Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y. 212 212 This is the amount of time the cgroup had to wait since it became busy 213 213 (i.e., went from 0 to 1 request queued) to get a timeslice for one of 214 214 its queues. This is different from the io_wait_time which is the ··· 219 219 got a timeslice and will not include the current delta. 220 220 221 221 - blkio.empty_time 222 - - Debugging aid only enabled if CONFIG_DEBUG_BLK_CGROUP=y. 222 + - Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y. 223 223 This is the amount of time a cgroup spends without any pending 224 224 requests when not being served, i.e., it does not include any time 225 225 spent idling for one of the queues of the cgroup. This is in ··· 228 228 time it had a pending request and will not include the current delta. 229 229 230 230 - blkio.idle_time 231 - - Debugging aid only enabled if CONFIG_DEBUG_BLK_CGROUP=y. 231 + - Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y. 232 232 This is the amount of time spent by the IO scheduler idling for a 233 233 given cgroup in anticipation of a better request than the existing ones 234 234 from other queues/cgroups. This is in nanoseconds. If this is read ··· 237 237 the current delta. 238 238 239 239 - blkio.dequeue 240 - - Debugging aid only enabled if CONFIG_DEBUG_BLK_CGROUP=y. This 240 + - Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y. This 241 241 gives the statistics about how many a times a group was dequeued 242 242 from service tree of the device. First two fields specify the major 243 243 and minor number of the device and third field specifies the number

+56

Documentation/fault-injection/nvme-fault-injection.txt

··· 114 114 cpu_startup_entry+0x6f/0x80 115 115 start_secondary+0x187/0x1e0 116 116 secondary_startup_64+0xa5/0xb0 117 + 118 + Example 3: Inject an error into the 10th admin command 119 + ------------------------------------------------------ 120 + 121 + echo 100 > /sys/kernel/debug/nvme0/fault_inject/probability 122 + echo 10 > /sys/kernel/debug/nvme0/fault_inject/space 123 + echo 1 > /sys/kernel/debug/nvme0/fault_inject/times 124 + nvme reset /dev/nvme0 125 + 126 + Expected Result: 127 + 128 + After NVMe controller reset, the reinitialization may or may not succeed. 129 + It depends on which admin command is actually forced to fail. 130 + 131 + Message from dmesg: 132 + 133 + nvme nvme0: resetting controller 134 + FAULT_INJECTION: forcing a failure. 135 + name fault_inject, interval 1, probability 100, space 1, times 1 136 + CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.2.0-rc2+ #2 137 + Hardware name: MSI MS-7A45/B150M MORTAR ARCTIC (MS-7A45), BIOS 1.50 04/25/2017 138 + Call Trace: 139 + <IRQ> 140 + dump_stack+0x63/0x85 141 + should_fail+0x14a/0x170 142 + nvme_should_fail+0x38/0x80 [nvme_core] 143 + nvme_irq+0x129/0x280 [nvme] 144 + ? blk_mq_end_request+0xb3/0x120 145 + __handle_irq_event_percpu+0x84/0x1a0 146 + handle_irq_event_percpu+0x32/0x80 147 + handle_irq_event+0x3b/0x60 148 + handle_edge_irq+0x7f/0x1a0 149 + handle_irq+0x20/0x30 150 + do_IRQ+0x4e/0xe0 151 + common_interrupt+0xf/0xf 152 + </IRQ> 153 + RIP: 0010:cpuidle_enter_state+0xc5/0x460 154 + Code: ff e8 8f 5f 86 ff 80 7d c7 00 74 17 9c 58 0f 1f 44 00 00 f6 c4 02 0f 85 69 03 00 00 31 ff e8 62 aa 8c ff fb 66 0f 1f 44 00 00 <45> 85 ed 0f 88 37 03 00 00 4c 8b 45 d0 4c 2b 45 b8 48 ba cf f7 53 155 + RSP: 0018:ffffffff88c03dd0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffdc 156 + RAX: ffff9dac25a2ac80 RBX: ffffffff88d53760 RCX: 000000000000001f 157 + RDX: 0000000000000000 RSI: 000000002d958403 RDI: 0000000000000000 158 + RBP: ffffffff88c03e18 R08: fffffff75e35ffb7 R09: 00000a49a56c0b48 159 + R10: ffffffff88c03da0 R11: 0000000000001b0c R12: ffff9dac25a34d00 160 + R13: 0000000000000006 R14: 0000000000000006 R15: ffffffff88d53760 161 + cpuidle_enter+0x2e/0x40 162 + call_cpuidle+0x23/0x40 163 + do_idle+0x201/0x280 164 + cpu_startup_entry+0x1d/0x20 165 + rest_init+0xaa/0xb0 166 + arch_call_rest_init+0xe/0x1b 167 + start_kernel+0x51c/0x53b 168 + x86_64_start_reservations+0x24/0x26 169 + x86_64_start_kernel+0x74/0x77 170 + secondary_startup_64+0xa4/0xb0 171 + nvme nvme0: Could not set queue count (16385) 172 + nvme nvme0: IO queues not created

+7

block/Kconfig.iosched

··· 36 36 Enable hierarchical scheduling in BFQ, using the blkio 37 37 (cgroups-v1) or io (cgroups-v2) controller. 38 38 39 + config BFQ_CGROUP_DEBUG 40 + bool "BFQ IO controller debugging" 41 + depends on BFQ_GROUP_IOSCHED 42 + ---help--- 43 + Enable some debugging help. Currently it exports additional stat 44 + files in a cgroup which can be useful for debugging. 45 + 39 46 endmenu 40 47 41 48 endif

+153 -59

block/bfq-cgroup.c

··· 15 15 16 16 #include "bfq-iosched.h" 17 17 18 - #if defined(CONFIG_BFQ_GROUP_IOSCHED) && defined(CONFIG_DEBUG_BLK_CGROUP) 18 + #ifdef CONFIG_BFQ_CGROUP_DEBUG 19 + static int bfq_stat_init(struct bfq_stat *stat, gfp_t gfp) 20 + { 21 + int ret; 22 + 23 + ret = percpu_counter_init(&stat->cpu_cnt, 0, gfp); 24 + if (ret) 25 + return ret; 26 + 27 + atomic64_set(&stat->aux_cnt, 0); 28 + return 0; 29 + } 30 + 31 + static void bfq_stat_exit(struct bfq_stat *stat) 32 + { 33 + percpu_counter_destroy(&stat->cpu_cnt); 34 + } 35 + 36 + /** 37 + * bfq_stat_add - add a value to a bfq_stat 38 + * @stat: target bfq_stat 39 + * @val: value to add 40 + * 41 + * Add @val to @stat. The caller must ensure that IRQ on the same CPU 42 + * don't re-enter this function for the same counter. 43 + */ 44 + static inline void bfq_stat_add(struct bfq_stat *stat, uint64_t val) 45 + { 46 + percpu_counter_add_batch(&stat->cpu_cnt, val, BLKG_STAT_CPU_BATCH); 47 + } 48 + 49 + /** 50 + * bfq_stat_read - read the current value of a bfq_stat 51 + * @stat: bfq_stat to read 52 + */ 53 + static inline uint64_t bfq_stat_read(struct bfq_stat *stat) 54 + { 55 + return percpu_counter_sum_positive(&stat->cpu_cnt); 56 + } 57 + 58 + /** 59 + * bfq_stat_reset - reset a bfq_stat 60 + * @stat: bfq_stat to reset 61 + */ 62 + static inline void bfq_stat_reset(struct bfq_stat *stat) 63 + { 64 + percpu_counter_set(&stat->cpu_cnt, 0); 65 + atomic64_set(&stat->aux_cnt, 0); 66 + } 67 + 68 + /** 69 + * bfq_stat_add_aux - add a bfq_stat into another's aux count 70 + * @to: the destination bfq_stat 71 + * @from: the source 72 + * 73 + * Add @from's count including the aux one to @to's aux count. 74 + */ 75 + static inline void bfq_stat_add_aux(struct bfq_stat *to, 76 + struct bfq_stat *from) 77 + { 78 + atomic64_add(bfq_stat_read(from) + atomic64_read(&from->aux_cnt), 79 + &to->aux_cnt); 80 + } 81 + 82 + /** 83 + * blkg_prfill_stat - prfill callback for bfq_stat 84 + * @sf: seq_file to print to 85 + * @pd: policy private data of interest 86 + * @off: offset to the bfq_stat in @pd 87 + * 88 + * prfill callback for printing a bfq_stat. 89 + */ 90 + static u64 blkg_prfill_stat(struct seq_file *sf, struct blkg_policy_data *pd, 91 + int off) 92 + { 93 + return __blkg_prfill_u64(sf, pd, bfq_stat_read((void *)pd + off)); 94 + } 19 95 20 96 /* bfqg stats flags */ 21 97 enum bfqg_stats_flags { ··· 129 53 130 54 now = ktime_get_ns(); 131 55 if (now > stats->start_group_wait_time) 132 - blkg_stat_add(&stats->group_wait_time, 56 + bfq_stat_add(&stats->group_wait_time, 133 57 now - stats->start_group_wait_time); 134 58 bfqg_stats_clear_waiting(stats); 135 59 } ··· 158 82 159 83 now = ktime_get_ns(); 160 84 if (now > stats->start_empty_time) 161 - blkg_stat_add(&stats->empty_time, 85 + bfq_stat_add(&stats->empty_time, 162 86 now - stats->start_empty_time); 163 87 bfqg_stats_clear_empty(stats); 164 88 } 165 89 166 90 void bfqg_stats_update_dequeue(struct bfq_group *bfqg) 167 91 { 168 - blkg_stat_add(&bfqg->stats.dequeue, 1); 92 + bfq_stat_add(&bfqg->stats.dequeue, 1); 169 93 } 170 94 171 95 void bfqg_stats_set_start_empty_time(struct bfq_group *bfqg) ··· 195 119 u64 now = ktime_get_ns(); 196 120 197 121 if (now > stats->start_idle_time) 198 - blkg_stat_add(&stats->idle_time, 122 + bfq_stat_add(&stats->idle_time, 199 123 now - stats->start_idle_time); 200 124 bfqg_stats_clear_idling(stats); 201 125 } ··· 213 137 { 214 138 struct bfqg_stats *stats = &bfqg->stats; 215 139 216 - blkg_stat_add(&stats->avg_queue_size_sum, 140 + bfq_stat_add(&stats->avg_queue_size_sum, 217 141 blkg_rwstat_total(&stats->queued)); 218 - blkg_stat_add(&stats->avg_queue_size_samples, 1); 142 + bfq_stat_add(&stats->avg_queue_size_samples, 1); 219 143 bfqg_stats_update_group_wait_time(stats); 220 144 } 221 145 ··· 252 176 io_start_time_ns - start_time_ns); 253 177 } 254 178 255 - #else /* CONFIG_BFQ_GROUP_IOSCHED && CONFIG_DEBUG_BLK_CGROUP */ 179 + #else /* CONFIG_BFQ_CGROUP_DEBUG */ 256 180 257 181 void bfqg_stats_update_io_add(struct bfq_group *bfqg, struct bfq_queue *bfqq, 258 182 unsigned int op) { } ··· 266 190 void bfqg_stats_set_start_idle_time(struct bfq_group *bfqg) { } 267 191 void bfqg_stats_update_avg_queue_size(struct bfq_group *bfqg) { } 268 192 269 - #endif /* CONFIG_BFQ_GROUP_IOSCHED && CONFIG_DEBUG_BLK_CGROUP */ 193 + #endif /* CONFIG_BFQ_CGROUP_DEBUG */ 270 194 271 195 #ifdef CONFIG_BFQ_GROUP_IOSCHED 272 196 ··· 350 274 /* @stats = 0 */ 351 275 static void bfqg_stats_reset(struct bfqg_stats *stats) 352 276 { 353 - #ifdef CONFIG_DEBUG_BLK_CGROUP 277 + #ifdef CONFIG_BFQ_CGROUP_DEBUG 354 278 /* queued stats shouldn't be cleared */ 355 279 blkg_rwstat_reset(&stats->merged); 356 280 blkg_rwstat_reset(&stats->service_time); 357 281 blkg_rwstat_reset(&stats->wait_time); 358 - blkg_stat_reset(&stats->time); 359 - blkg_stat_reset(&stats->avg_queue_size_sum); 360 - blkg_stat_reset(&stats->avg_queue_size_samples); 361 - blkg_stat_reset(&stats->dequeue); 362 - blkg_stat_reset(&stats->group_wait_time); 363 - blkg_stat_reset(&stats->idle_time); 364 - blkg_stat_reset(&stats->empty_time); 282 + bfq_stat_reset(&stats->time); 283 + bfq_stat_reset(&stats->avg_queue_size_sum); 284 + bfq_stat_reset(&stats->avg_queue_size_samples); 285 + bfq_stat_reset(&stats->dequeue); 286 + bfq_stat_reset(&stats->group_wait_time); 287 + bfq_stat_reset(&stats->idle_time); 288 + bfq_stat_reset(&stats->empty_time); 365 289 #endif 366 290 } 367 291 ··· 371 295 if (!to || !from) 372 296 return; 373 297 374 - #ifdef CONFIG_DEBUG_BLK_CGROUP 298 + #ifdef CONFIG_BFQ_CGROUP_DEBUG 375 299 /* queued stats shouldn't be cleared */ 376 300 blkg_rwstat_add_aux(&to->merged, &from->merged); 377 301 blkg_rwstat_add_aux(&to->service_time, &from->service_time); 378 302 blkg_rwstat_add_aux(&to->wait_time, &from->wait_time); 379 - blkg_stat_add_aux(&from->time, &from->time); 380 - blkg_stat_add_aux(&to->avg_queue_size_sum, &from->avg_queue_size_sum); 381 - blkg_stat_add_aux(&to->avg_queue_size_samples, 303 + bfq_stat_add_aux(&from->time, &from->time); 304 + bfq_stat_add_aux(&to->avg_queue_size_sum, &from->avg_queue_size_sum); 305 + bfq_stat_add_aux(&to->avg_queue_size_samples, 382 306 &from->avg_queue_size_samples); 383 - blkg_stat_add_aux(&to->dequeue, &from->dequeue); 384 - blkg_stat_add_aux(&to->group_wait_time, &from->group_wait_time); 385 - blkg_stat_add_aux(&to->idle_time, &from->idle_time); 386 - blkg_stat_add_aux(&to->empty_time, &from->empty_time); 307 + bfq_stat_add_aux(&to->dequeue, &from->dequeue); 308 + bfq_stat_add_aux(&to->group_wait_time, &from->group_wait_time); 309 + bfq_stat_add_aux(&to->idle_time, &from->idle_time); 310 + bfq_stat_add_aux(&to->empty_time, &from->empty_time); 387 311 #endif 388 312 } 389 313 ··· 431 355 432 356 static void bfqg_stats_exit(struct bfqg_stats *stats) 433 357 { 434 - #ifdef CONFIG_DEBUG_BLK_CGROUP 358 + #ifdef CONFIG_BFQ_CGROUP_DEBUG 435 359 blkg_rwstat_exit(&stats->merged); 436 360 blkg_rwstat_exit(&stats->service_time); 437 361 blkg_rwstat_exit(&stats->wait_time); 438 362 blkg_rwstat_exit(&stats->queued); 439 - blkg_stat_exit(&stats->time); 440 - blkg_stat_exit(&stats->avg_queue_size_sum); 441 - blkg_stat_exit(&stats->avg_queue_size_samples); 442 - blkg_stat_exit(&stats->dequeue); 443 - blkg_stat_exit(&stats->group_wait_time); 444 - blkg_stat_exit(&stats->idle_time); 445 - blkg_stat_exit(&stats->empty_time); 363 + bfq_stat_exit(&stats->time); 364 + bfq_stat_exit(&stats->avg_queue_size_sum); 365 + bfq_stat_exit(&stats->avg_queue_size_samples); 366 + bfq_stat_exit(&stats->dequeue); 367 + bfq_stat_exit(&stats->group_wait_time); 368 + bfq_stat_exit(&stats->idle_time); 369 + bfq_stat_exit(&stats->empty_time); 446 370 #endif 447 371 } 448 372 449 373 static int bfqg_stats_init(struct bfqg_stats *stats, gfp_t gfp) 450 374 { 451 - #ifdef CONFIG_DEBUG_BLK_CGROUP 375 + #ifdef CONFIG_BFQ_CGROUP_DEBUG 452 376 if (blkg_rwstat_init(&stats->merged, gfp) || 453 377 blkg_rwstat_init(&stats->service_time, gfp) || 454 378 blkg_rwstat_init(&stats->wait_time, gfp) || 455 379 blkg_rwstat_init(&stats->queued, gfp) || 456 - blkg_stat_init(&stats->time, gfp) || 457 - blkg_stat_init(&stats->avg_queue_size_sum, gfp) || 458 - blkg_stat_init(&stats->avg_queue_size_samples, gfp) || 459 - blkg_stat_init(&stats->dequeue, gfp) || 460 - blkg_stat_init(&stats->group_wait_time, gfp) || 461 - blkg_stat_init(&stats->idle_time, gfp) || 462 - blkg_stat_init(&stats->empty_time, gfp)) { 380 + bfq_stat_init(&stats->time, gfp) || 381 + bfq_stat_init(&stats->avg_queue_size_sum, gfp) || 382 + bfq_stat_init(&stats->avg_queue_size_samples, gfp) || 383 + bfq_stat_init(&stats->dequeue, gfp) || 384 + bfq_stat_init(&stats->group_wait_time, gfp) || 385 + bfq_stat_init(&stats->idle_time, gfp) || 386 + bfq_stat_init(&stats->empty_time, gfp)) { 463 387 bfqg_stats_exit(stats); 464 388 return -ENOMEM; 465 389 } ··· 985 909 return ret ?: nbytes; 986 910 } 987 911 988 - #ifdef CONFIG_DEBUG_BLK_CGROUP 912 + #ifdef CONFIG_BFQ_CGROUP_DEBUG 989 913 static int bfqg_print_stat(struct seq_file *sf, void *v) 990 914 { 991 915 blkcg_print_blkgs(sf, css_to_blkcg(seq_css(sf)), blkg_prfill_stat, ··· 1003 927 static u64 bfqg_prfill_stat_recursive(struct seq_file *sf, 1004 928 struct blkg_policy_data *pd, int off) 1005 929 { 1006 - u64 sum = blkg_stat_recursive_sum(pd_to_blkg(pd), 1007 - &blkcg_policy_bfq, off); 930 + struct blkcg_gq *blkg = pd_to_blkg(pd); 931 + struct blkcg_gq *pos_blkg; 932 + struct cgroup_subsys_state *pos_css; 933 + u64 sum = 0; 934 + 935 + lockdep_assert_held(&blkg->q->queue_lock); 936 + 937 + rcu_read_lock(); 938 + blkg_for_each_descendant_pre(pos_blkg, pos_css, blkg) { 939 + struct bfq_stat *stat; 940 + 941 + if (!pos_blkg->online) 942 + continue; 943 + 944 + stat = (void *)blkg_to_pd(pos_blkg, &blkcg_policy_bfq) + off; 945 + sum += bfq_stat_read(stat) + atomic64_read(&stat->aux_cnt); 946 + } 947 + rcu_read_unlock(); 948 + 1008 949 return __blkg_prfill_u64(sf, pd, sum); 1009 950 } 1010 951 1011 952 static u64 bfqg_prfill_rwstat_recursive(struct seq_file *sf, 1012 953 struct blkg_policy_data *pd, int off) 1013 954 { 1014 - struct blkg_rwstat sum = blkg_rwstat_recursive_sum(pd_to_blkg(pd), 1015 - &blkcg_policy_bfq, 1016 - off); 955 + struct blkg_rwstat_sample sum; 956 + 957 + blkg_rwstat_recursive_sum(pd_to_blkg(pd), &blkcg_policy_bfq, off, &sum); 1017 958 return __blkg_prfill_rwstat(sf, pd, &sum); 1018 959 } 1019 960 ··· 1068 975 static u64 bfqg_prfill_sectors_recursive(struct seq_file *sf, 1069 976 struct blkg_policy_data *pd, int off) 1070 977 { 1071 - struct blkg_rwstat tmp = blkg_rwstat_recursive_sum(pd->blkg, NULL, 1072 - offsetof(struct blkcg_gq, stat_bytes)); 1073 - u64 sum = atomic64_read(&tmp.aux_cnt[BLKG_RWSTAT_READ]) + 1074 - atomic64_read(&tmp.aux_cnt[BLKG_RWSTAT_WRITE]); 978 + struct blkg_rwstat_sample tmp; 1075 979 1076 - return __blkg_prfill_u64(sf, pd, sum >> 9); 980 + blkg_rwstat_recursive_sum(pd->blkg, NULL, 981 + offsetof(struct blkcg_gq, stat_bytes), &tmp); 982 + 983 + return __blkg_prfill_u64(sf, pd, 984 + (tmp.cnt[BLKG_RWSTAT_READ] + tmp.cnt[BLKG_RWSTAT_WRITE]) >> 9); 1077 985 } 1078 986 1079 987 static int bfqg_print_stat_sectors_recursive(struct seq_file *sf, void *v) ··· 1089 995 struct blkg_policy_data *pd, int off) 1090 996 { 1091 997 struct bfq_group *bfqg = pd_to_bfqg(pd); 1092 - u64 samples = blkg_stat_read(&bfqg->stats.avg_queue_size_samples); 998 + u64 samples = bfq_stat_read(&bfqg->stats.avg_queue_size_samples); 1093 999 u64 v = 0; 1094 1000 1095 1001 if (samples) { 1096 - v = blkg_stat_read(&bfqg->stats.avg_queue_size_sum); 1002 + v = bfq_stat_read(&bfqg->stats.avg_queue_size_sum); 1097 1003 v = div64_u64(v, samples); 1098 1004 } 1099 1005 __blkg_prfill_u64(sf, pd, v); ··· 1108 1014 0, false); 1109 1015 return 0; 1110 1016 } 1111 - #endif /* CONFIG_DEBUG_BLK_CGROUP */ 1017 + #endif /* CONFIG_BFQ_CGROUP_DEBUG */ 1112 1018 1113 1019 struct bfq_group *bfq_create_group_hierarchy(struct bfq_data *bfqd, int node) 1114 1020 { ··· 1156 1062 .private = (unsigned long)&blkcg_policy_bfq, 1157 1063 .seq_show = blkg_print_stat_ios, 1158 1064 }, 1159 - #ifdef CONFIG_DEBUG_BLK_CGROUP 1065 + #ifdef CONFIG_BFQ_CGROUP_DEBUG 1160 1066 { 1161 1067 .name = "bfq.time", 1162 1068 .private = offsetof(struct bfq_group, stats.time), ··· 1186 1092 .private = offsetof(struct bfq_group, stats.queued), 1187 1093 .seq_show = bfqg_print_rwstat, 1188 1094 }, 1189 - #endif /* CONFIG_DEBUG_BLK_CGROUP */ 1095 + #endif /* CONFIG_BFQ_CGROUP_DEBUG */ 1190 1096 1191 1097 /* the same statistics which cover the bfqg and its descendants */ 1192 1098 { ··· 1199 1105 .private = (unsigned long)&blkcg_policy_bfq, 1200 1106 .seq_show = blkg_print_stat_ios_recursive, 1201 1107 }, 1202 - #ifdef CONFIG_DEBUG_BLK_CGROUP 1108 + #ifdef CONFIG_BFQ_CGROUP_DEBUG 1203 1109 { 1204 1110 .name = "bfq.time_recursive", 1205 1111 .private = offsetof(struct bfq_group, stats.time), ··· 1253 1159 .private = offsetof(struct bfq_group, stats.dequeue), 1254 1160 .seq_show = bfqg_print_stat, 1255 1161 }, 1256 - #endif /* CONFIG_DEBUG_BLK_CGROUP */ 1162 + #endif /* CONFIG_BFQ_CGROUP_DEBUG */ 1257 1163 { } /* terminate */ 1258 1164 }; 1259 1165

+671 -296

block/bfq-iosched.c

··· 157 157 BFQ_BFQQ_FNS(coop); 158 158 BFQ_BFQQ_FNS(split_coop); 159 159 BFQ_BFQQ_FNS(softrt_update); 160 + BFQ_BFQQ_FNS(has_waker); 160 161 #undef BFQ_BFQQ_FNS \ 161 162 162 163 /* Expiration time of sync (0) and async (1) requests, in ns. */ ··· 1428 1427 * mechanism may be re-designed in such a way to make it possible to 1429 1428 * know whether preemption is needed without needing to update service 1430 1429 * trees). In addition, queue preemptions almost always cause random 1431 - * I/O, and thus loss of throughput. Because of these facts, the next 1432 - * function adopts the following simple scheme to avoid both costly 1433 - * operations and too frequent preemptions: it requests the expiration 1434 - * of the in-service queue (unconditionally) only for queues that need 1435 - * to recover a hole, or that either are weight-raised or deserve to 1436 - * be weight-raised. 1430 + * I/O, which may in turn cause loss of throughput. Finally, there may 1431 + * even be no in-service queue when the next function is invoked (so, 1432 + * no queue to compare timestamps with). Because of these facts, the 1433 + * next function adopts the following simple scheme to avoid costly 1434 + * operations, too frequent preemptions and too many dependencies on 1435 + * the state of the scheduler: it requests the expiration of the 1436 + * in-service queue (unconditionally) only for queues that need to 1437 + * recover a hole. Then it delegates to other parts of the code the 1438 + * responsibility of handling the above case 2. 1437 1439 */ 1438 1440 static bool bfq_bfqq_update_budg_for_activation(struct bfq_data *bfqd, 1439 1441 struct bfq_queue *bfqq, 1440 - bool arrived_in_time, 1441 - bool wr_or_deserves_wr) 1442 + bool arrived_in_time) 1442 1443 { 1443 1444 struct bfq_entity *entity = &bfqq->entity; 1444 1445 ··· 1495 1492 entity->budget = max_t(unsigned long, bfqq->max_budget, 1496 1493 bfq_serv_to_charge(bfqq->next_rq, bfqq)); 1497 1494 bfq_clear_bfqq_non_blocking_wait_rq(bfqq); 1498 - return wr_or_deserves_wr; 1495 + return false; 1499 1496 } 1500 1497 1501 1498 /* ··· 1613 1610 bfqd->bfq_wr_min_idle_time); 1614 1611 } 1615 1612 1613 + 1614 + /* 1615 + * Return true if bfqq is in a higher priority class, or has a higher 1616 + * weight than the in-service queue. 1617 + */ 1618 + static bool bfq_bfqq_higher_class_or_weight(struct bfq_queue *bfqq, 1619 + struct bfq_queue *in_serv_bfqq) 1620 + { 1621 + int bfqq_weight, in_serv_weight; 1622 + 1623 + if (bfqq->ioprio_class < in_serv_bfqq->ioprio_class) 1624 + return true; 1625 + 1626 + if (in_serv_bfqq->entity.parent == bfqq->entity.parent) { 1627 + bfqq_weight = bfqq->entity.weight; 1628 + in_serv_weight = in_serv_bfqq->entity.weight; 1629 + } else { 1630 + if (bfqq->entity.parent) 1631 + bfqq_weight = bfqq->entity.parent->weight; 1632 + else 1633 + bfqq_weight = bfqq->entity.weight; 1634 + if (in_serv_bfqq->entity.parent) 1635 + in_serv_weight = in_serv_bfqq->entity.parent->weight; 1636 + else 1637 + in_serv_weight = in_serv_bfqq->entity.weight; 1638 + } 1639 + 1640 + return bfqq_weight > in_serv_weight; 1641 + } 1642 + 1616 1643 static void bfq_bfqq_handle_idle_busy_switch(struct bfq_data *bfqd, 1617 1644 struct bfq_queue *bfqq, 1618 1645 int old_wr_coeff, ··· 1687 1654 */ 1688 1655 bfqq_wants_to_preempt = 1689 1656 bfq_bfqq_update_budg_for_activation(bfqd, bfqq, 1690 - arrived_in_time, 1691 - wr_or_deserves_wr); 1657 + arrived_in_time); 1692 1658 1693 1659 /* 1694 1660 * If bfqq happened to be activated in a burst, but has been ··· 1752 1720 1753 1721 /* 1754 1722 * Expire in-service queue only if preemption may be needed 1755 - * for guarantees. In this respect, the function 1756 - * next_queue_may_preempt just checks a simple, necessary 1757 - * condition, and not a sufficient condition based on 1758 - * timestamps. In fact, for the latter condition to be 1759 - * evaluated, timestamps would need first to be updated, and 1760 - * this operation is quite costly (see the comments on the 1761 - * function bfq_bfqq_update_budg_for_activation). 1723 + * for guarantees. In particular, we care only about two 1724 + * cases. The first is that bfqq has to recover a service 1725 + * hole, as explained in the comments on 1726 + * bfq_bfqq_update_budg_for_activation(), i.e., that 1727 + * bfqq_wants_to_preempt is true. However, if bfqq does not 1728 + * carry time-critical I/O, then bfqq's bandwidth is less 1729 + * important than that of queues that carry time-critical I/O. 1730 + * So, as a further constraint, we consider this case only if 1731 + * bfqq is at least as weight-raised, i.e., at least as time 1732 + * critical, as the in-service queue. 1733 + * 1734 + * The second case is that bfqq is in a higher priority class, 1735 + * or has a higher weight than the in-service queue. If this 1736 + * condition does not hold, we don't care because, even if 1737 + * bfqq does not start to be served immediately, the resulting 1738 + * delay for bfqq's I/O is however lower or much lower than 1739 + * the ideal completion time to be guaranteed to bfqq's I/O. 1740 + * 1741 + * In both cases, preemption is needed only if, according to 1742 + * the timestamps of both bfqq and of the in-service queue, 1743 + * bfqq actually is the next queue to serve. So, to reduce 1744 + * useless preemptions, the return value of 1745 + * next_queue_may_preempt() is considered in the next compound 1746 + * condition too. Yet next_queue_may_preempt() just checks a 1747 + * simple, necessary condition for bfqq to be the next queue 1748 + * to serve. In fact, to evaluate a sufficient condition, the 1749 + * timestamps of the in-service queue would need to be 1750 + * updated, and this operation is quite costly (see the 1751 + * comments on bfq_bfqq_update_budg_for_activation()). 1762 1752 */ 1763 - if (bfqd->in_service_queue && bfqq_wants_to_preempt && 1764 - bfqd->in_service_queue->wr_coeff < bfqq->wr_coeff && 1753 + if (bfqd->in_service_queue && 1754 + ((bfqq_wants_to_preempt && 1755 + bfqq->wr_coeff >= bfqd->in_service_queue->wr_coeff) || 1756 + bfq_bfqq_higher_class_or_weight(bfqq, bfqd->in_service_queue)) && 1765 1757 next_queue_may_preempt(bfqd)) 1766 1758 bfq_bfqq_expire(bfqd, bfqd->in_service_queue, 1767 1759 false, BFQQE_PREEMPTED); 1760 + } 1761 + 1762 + static void bfq_reset_inject_limit(struct bfq_data *bfqd, 1763 + struct bfq_queue *bfqq) 1764 + { 1765 + /* invalidate baseline total service time */ 1766 + bfqq->last_serv_time_ns = 0; 1767 + 1768 + /* 1769 + * Reset pointer in case we are waiting for 1770 + * some request completion. 1771 + */ 1772 + bfqd->waited_rq = NULL; 1773 + 1774 + /* 1775 + * If bfqq has a short think time, then start by setting the 1776 + * inject limit to 0 prudentially, because the service time of 1777 + * an injected I/O request may be higher than the think time 1778 + * of bfqq, and therefore, if one request was injected when 1779 + * bfqq remains empty, this injected request might delay the 1780 + * service of the next I/O request for bfqq significantly. In 1781 + * case bfqq can actually tolerate some injection, then the 1782 + * adaptive update will however raise the limit soon. This 1783 + * lucky circumstance holds exactly because bfqq has a short 1784 + * think time, and thus, after remaining empty, is likely to 1785 + * get new I/O enqueued---and then completed---before being 1786 + * expired. This is the very pattern that gives the 1787 + * limit-update algorithm the chance to measure the effect of 1788 + * injection on request service times, and then to update the 1789 + * limit accordingly. 1790 + * 1791 + * However, in the following special case, the inject limit is 1792 + * left to 1 even if the think time is short: bfqq's I/O is 1793 + * synchronized with that of some other queue, i.e., bfqq may 1794 + * receive new I/O only after the I/O of the other queue is 1795 + * completed. Keeping the inject limit to 1 allows the 1796 + * blocking I/O to be served while bfqq is in service. And 1797 + * this is very convenient both for bfqq and for overall 1798 + * throughput, as explained in detail in the comments in 1799 + * bfq_update_has_short_ttime(). 1800 + * 1801 + * On the opposite end, if bfqq has a long think time, then 1802 + * start directly by 1, because: 1803 + * a) on the bright side, keeping at most one request in 1804 + * service in the drive is unlikely to cause any harm to the 1805 + * latency of bfqq's requests, as the service time of a single 1806 + * request is likely to be lower than the think time of bfqq; 1807 + * b) on the downside, after becoming empty, bfqq is likely to 1808 + * expire before getting its next request. With this request 1809 + * arrival pattern, it is very hard to sample total service 1810 + * times and update the inject limit accordingly (see comments 1811 + * on bfq_update_inject_limit()). So the limit is likely to be 1812 + * never, or at least seldom, updated. As a consequence, by 1813 + * setting the limit to 1, we avoid that no injection ever 1814 + * occurs with bfqq. On the downside, this proactive step 1815 + * further reduces chances to actually compute the baseline 1816 + * total service time. Thus it reduces chances to execute the 1817 + * limit-update algorithm and possibly raise the limit to more 1818 + * than 1. 1819 + */ 1820 + if (bfq_bfqq_has_short_ttime(bfqq)) 1821 + bfqq->inject_limit = 0; 1822 + else 1823 + bfqq->inject_limit = 1; 1824 + 1825 + bfqq->decrease_time_jif = jiffies; 1768 1826 } 1769 1827 1770 1828 static void bfq_add_request(struct request *rq) ··· 1871 1749 1872 1750 if (RB_EMPTY_ROOT(&bfqq->sort_list) && bfq_bfqq_sync(bfqq)) { 1873 1751 /* 1752 + * Detect whether bfqq's I/O seems synchronized with 1753 + * that of some other queue, i.e., whether bfqq, after 1754 + * remaining empty, happens to receive new I/O only 1755 + * right after some I/O request of the other queue has 1756 + * been completed. We call waker queue the other 1757 + * queue, and we assume, for simplicity, that bfqq may 1758 + * have at most one waker queue. 1759 + * 1760 + * A remarkable throughput boost can be reached by 1761 + * unconditionally injecting the I/O of the waker 1762 + * queue, every time a new bfq_dispatch_request 1763 + * happens to be invoked while I/O is being plugged 1764 + * for bfqq. In addition to boosting throughput, this 1765 + * unblocks bfqq's I/O, thereby improving bandwidth 1766 + * and latency for bfqq. Note that these same results 1767 + * may be achieved with the general injection 1768 + * mechanism, but less effectively. For details on 1769 + * this aspect, see the comments on the choice of the 1770 + * queue for injection in bfq_select_queue(). 1771 + * 1772 + * Turning back to the detection of a waker queue, a 1773 + * queue Q is deemed as a waker queue for bfqq if, for 1774 + * two consecutive times, bfqq happens to become non 1775 + * empty right after a request of Q has been 1776 + * completed. In particular, on the first time, Q is 1777 + * tentatively set as a candidate waker queue, while 1778 + * on the second time, the flag 1779 + * bfq_bfqq_has_waker(bfqq) is set to confirm that Q 1780 + * is a waker queue for bfqq. These detection steps 1781 + * are performed only if bfqq has a long think time, 1782 + * so as to make it more likely that bfqq's I/O is 1783 + * actually being blocked by a synchronization. This 1784 + * last filter, plus the above two-times requirement, 1785 + * make false positives less likely. 1786 + * 1787 + * NOTE 1788 + * 1789 + * The sooner a waker queue is detected, the sooner 1790 + * throughput can be boosted by injecting I/O from the 1791 + * waker queue. Fortunately, detection is likely to be 1792 + * actually fast, for the following reasons. While 1793 + * blocked by synchronization, bfqq has a long think 1794 + * time. This implies that bfqq's inject limit is at 1795 + * least equal to 1 (see the comments in 1796 + * bfq_update_inject_limit()). So, thanks to 1797 + * injection, the waker queue is likely to be served 1798 + * during the very first I/O-plugging time interval 1799 + * for bfqq. This triggers the first step of the 1800 + * detection mechanism. Thanks again to injection, the 1801 + * candidate waker queue is then likely to be 1802 + * confirmed no later than during the next 1803 + * I/O-plugging interval for bfqq. 1804 + */ 1805 + if (!bfq_bfqq_has_short_ttime(bfqq) && 1806 + ktime_get_ns() - bfqd->last_completion < 1807 + 200 * NSEC_PER_USEC) { 1808 + if (bfqd->last_completed_rq_bfqq != bfqq && 1809 + bfqd->last_completed_rq_bfqq != 1810 + bfqq->waker_bfqq) { 1811 + /* 1812 + * First synchronization detected with 1813 + * a candidate waker queue, or with a 1814 + * different candidate waker queue 1815 + * from the current one. 1816 + */ 1817 + bfqq->waker_bfqq = bfqd->last_completed_rq_bfqq; 1818 + 1819 + /* 1820 + * If the waker queue disappears, then 1821 + * bfqq->waker_bfqq must be reset. To 1822 + * this goal, we maintain in each 1823 + * waker queue a list, woken_list, of 1824 + * all the queues that reference the 1825 + * waker queue through their 1826 + * waker_bfqq pointer. When the waker 1827 + * queue exits, the waker_bfqq pointer 1828 + * of all the queues in the woken_list 1829 + * is reset. 1830 + * 1831 + * In addition, if bfqq is already in 1832 + * the woken_list of a waker queue, 1833 + * then, before being inserted into 1834 + * the woken_list of a new waker 1835 + * queue, bfqq must be removed from 1836 + * the woken_list of the old waker 1837 + * queue. 1838 + */ 1839 + if (!hlist_unhashed(&bfqq->woken_list_node)) 1840 + hlist_del_init(&bfqq->woken_list_node); 1841 + hlist_add_head(&bfqq->woken_list_node, 1842 + &bfqd->last_completed_rq_bfqq->woken_list); 1843 + 1844 + bfq_clear_bfqq_has_waker(bfqq); 1845 + } else if (bfqd->last_completed_rq_bfqq == 1846 + bfqq->waker_bfqq && 1847 + !bfq_bfqq_has_waker(bfqq)) { 1848 + /* 1849 + * synchronization with waker_bfqq 1850 + * seen for the second time 1851 + */ 1852 + bfq_mark_bfqq_has_waker(bfqq); 1853 + } 1854 + } 1855 + 1856 + /* 1874 1857 * Periodically reset inject limit, to make sure that 1875 1858 * the latter eventually drops in case workload 1876 1859 * changes, see step (3) in the comments on 1877 1860 * bfq_update_inject_limit(). 1878 1861 */ 1879 1862 if (time_is_before_eq_jiffies(bfqq->decrease_time_jif + 1880 - msecs_to_jiffies(1000))) { 1881 - /* invalidate baseline total service time */ 1882 - bfqq->last_serv_time_ns = 0; 1883 - 1884 - /* 1885 - * Reset pointer in case we are waiting for 1886 - * some request completion. 1887 - */ 1888 - bfqd->waited_rq = NULL; 1889 - 1890 - /* 1891 - * If bfqq has a short think time, then start 1892 - * by setting the inject limit to 0 1893 - * prudentially, because the service time of 1894 - * an injected I/O request may be higher than 1895 - * the think time of bfqq, and therefore, if 1896 - * one request was injected when bfqq remains 1897 - * empty, this injected request might delay 1898 - * the service of the next I/O request for 1899 - * bfqq significantly. In case bfqq can 1900 - * actually tolerate some injection, then the 1901 - * adaptive update will however raise the 1902 - * limit soon. This lucky circumstance holds 1903 - * exactly because bfqq has a short think 1904 - * time, and thus, after remaining empty, is 1905 - * likely to get new I/O enqueued---and then 1906 - * completed---before being expired. This is 1907 - * the very pattern that gives the 1908 - * limit-update algorithm the chance to 1909 - * measure the effect of injection on request 1910 - * service times, and then to update the limit 1911 - * accordingly. 1912 - * 1913 - * On the opposite end, if bfqq has a long 1914 - * think time, then start directly by 1, 1915 - * because: 1916 - * a) on the bright side, keeping at most one 1917 - * request in service in the drive is unlikely 1918 - * to cause any harm to the latency of bfqq's 1919 - * requests, as the service time of a single 1920 - * request is likely to be lower than the 1921 - * think time of bfqq; 1922 - * b) on the downside, after becoming empty, 1923 - * bfqq is likely to expire before getting its 1924 - * next request. With this request arrival 1925 - * pattern, it is very hard to sample total 1926 - * service times and update the inject limit 1927 - * accordingly (see comments on 1928 - * bfq_update_inject_limit()). So the limit is 1929 - * likely to be never, or at least seldom, 1930 - * updated. As a consequence, by setting the 1931 - * limit to 1, we avoid that no injection ever 1932 - * occurs with bfqq. On the downside, this 1933 - * proactive step further reduces chances to 1934 - * actually compute the baseline total service 1935 - * time. Thus it reduces chances to execute the 1936 - * limit-update algorithm and possibly raise the 1937 - * limit to more than 1. 1938 - */ 1939 - if (bfq_bfqq_has_short_ttime(bfqq)) 1940 - bfqq->inject_limit = 0; 1941 - else 1942 - bfqq->inject_limit = 1; 1943 - bfqq->decrease_time_jif = jiffies; 1944 - } 1863 + msecs_to_jiffies(1000))) 1864 + bfq_reset_inject_limit(bfqd, bfqq); 1945 1865 1946 1866 /* 1947 1867 * The following conditions must hold to setup a new ··· 2191 2027 2192 2028 } 2193 2029 2194 - static bool bfq_bio_merge(struct blk_mq_hw_ctx *hctx, struct bio *bio) 2030 + static bool bfq_bio_merge(struct blk_mq_hw_ctx *hctx, struct bio *bio, 2031 + unsigned int nr_segs) 2195 2032 { 2196 2033 struct request_queue *q = hctx->queue; 2197 2034 struct bfq_data *bfqd = q->elevator->elevator_data; ··· 2215 2050 bfqd->bio_bfqq = NULL; 2216 2051 bfqd->bio_bic = bic; 2217 2052 2218 - ret = blk_mq_sched_try_merge(q, bio, &free); 2053 + ret = blk_mq_sched_try_merge(q, bio, nr_segs, &free); 2219 2054 2220 2055 if (free) 2221 2056 blk_mq_free_request(free); ··· 2678 2513 * to enjoy weight raising if split soon. 2679 2514 */ 2680 2515 bic->saved_wr_coeff = bfqq->bfqd->bfq_wr_coeff; 2516 + bic->saved_wr_start_at_switch_to_srt = bfq_smallest_from_now(); 2681 2517 bic->saved_wr_cur_max_time = bfq_wr_duration(bfqq->bfqd); 2682 2518 bic->saved_last_wr_start_finish = jiffies; 2683 2519 } else { ··· 3211 3045 bfq_remove_request(q, rq); 3212 3046 } 3213 3047 3214 - static bool __bfq_bfqq_expire(struct bfq_data *bfqd, struct bfq_queue *bfqq) 3048 + /* 3049 + * There is a case where idling does not have to be performed for 3050 + * throughput concerns, but to preserve the throughput share of 3051 + * the process associated with bfqq. 3052 + * 3053 + * To introduce this case, we can note that allowing the drive 3054 + * to enqueue more than one request at a time, and hence 3055 + * delegating de facto final scheduling decisions to the 3056 + * drive's internal scheduler, entails loss of control on the 3057 + * actual request service order. In particular, the critical 3058 + * situation is when requests from different processes happen 3059 + * to be present, at the same time, in the internal queue(s) 3060 + * of the drive. In such a situation, the drive, by deciding 3061 + * the service order of the internally-queued requests, does 3062 + * determine also the actual throughput distribution among 3063 + * these processes. But the drive typically has no notion or 3064 + * concern about per-process throughput distribution, and 3065 + * makes its decisions only on a per-request basis. Therefore, 3066 + * the service distribution enforced by the drive's internal 3067 + * scheduler is likely to coincide with the desired throughput 3068 + * distribution only in a completely symmetric, or favorably 3069 + * skewed scenario where: 3070 + * (i-a) each of these processes must get the same throughput as 3071 + * the others, 3072 + * (i-b) in case (i-a) does not hold, it holds that the process 3073 + * associated with bfqq must receive a lower or equal 3074 + * throughput than any of the other processes; 3075 + * (ii) the I/O of each process has the same properties, in 3076 + * terms of locality (sequential or random), direction 3077 + * (reads or writes), request sizes, greediness 3078 + * (from I/O-bound to sporadic), and so on; 3079 + 3080 + * In fact, in such a scenario, the drive tends to treat the requests 3081 + * of each process in about the same way as the requests of the 3082 + * others, and thus to provide each of these processes with about the 3083 + * same throughput. This is exactly the desired throughput 3084 + * distribution if (i-a) holds, or, if (i-b) holds instead, this is an 3085 + * even more convenient distribution for (the process associated with) 3086 + * bfqq. 3087 + * 3088 + * In contrast, in any asymmetric or unfavorable scenario, device 3089 + * idling (I/O-dispatch plugging) is certainly needed to guarantee 3090 + * that bfqq receives its assigned fraction of the device throughput 3091 + * (see [1] for details). 3092 + * 3093 + * The problem is that idling may significantly reduce throughput with 3094 + * certain combinations of types of I/O and devices. An important 3095 + * example is sync random I/O on flash storage with command 3096 + * queueing. So, unless bfqq falls in cases where idling also boosts 3097 + * throughput, it is important to check conditions (i-a), i(-b) and 3098 + * (ii) accurately, so as to avoid idling when not strictly needed for 3099 + * service guarantees. 3100 + * 3101 + * Unfortunately, it is extremely difficult to thoroughly check 3102 + * condition (ii). And, in case there are active groups, it becomes 3103 + * very difficult to check conditions (i-a) and (i-b) too. In fact, 3104 + * if there are active groups, then, for conditions (i-a) or (i-b) to 3105 + * become false 'indirectly', it is enough that an active group 3106 + * contains more active processes or sub-groups than some other active 3107 + * group. More precisely, for conditions (i-a) or (i-b) to become 3108 + * false because of such a group, it is not even necessary that the 3109 + * group is (still) active: it is sufficient that, even if the group 3110 + * has become inactive, some of its descendant processes still have 3111 + * some request already dispatched but still waiting for 3112 + * completion. In fact, requests have still to be guaranteed their 3113 + * share of the throughput even after being dispatched. In this 3114 + * respect, it is easy to show that, if a group frequently becomes 3115 + * inactive while still having in-flight requests, and if, when this 3116 + * happens, the group is not considered in the calculation of whether 3117 + * the scenario is asymmetric, then the group may fail to be 3118 + * guaranteed its fair share of the throughput (basically because 3119 + * idling may not be performed for the descendant processes of the 3120 + * group, but it had to be). We address this issue with the following 3121 + * bi-modal behavior, implemented in the function 3122 + * bfq_asymmetric_scenario(). 3123 + * 3124 + * If there are groups with requests waiting for completion 3125 + * (as commented above, some of these groups may even be 3126 + * already inactive), then the scenario is tagged as 3127 + * asymmetric, conservatively, without checking any of the 3128 + * conditions (i-a), (i-b) or (ii). So the device is idled for bfqq. 3129 + * This behavior matches also the fact that groups are created 3130 + * exactly if controlling I/O is a primary concern (to 3131 + * preserve bandwidth and latency guarantees). 3132 + * 3133 + * On the opposite end, if there are no groups with requests waiting 3134 + * for completion, then only conditions (i-a) and (i-b) are actually 3135 + * controlled, i.e., provided that conditions (i-a) or (i-b) holds, 3136 + * idling is not performed, regardless of whether condition (ii) 3137 + * holds. In other words, only if conditions (i-a) and (i-b) do not 3138 + * hold, then idling is allowed, and the device tends to be prevented 3139 + * from queueing many requests, possibly of several processes. Since 3140 + * there are no groups with requests waiting for completion, then, to 3141 + * control conditions (i-a) and (i-b) it is enough to check just 3142 + * whether all the queues with requests waiting for completion also 3143 + * have the same weight. 3144 + * 3145 + * Not checking condition (ii) evidently exposes bfqq to the 3146 + * risk of getting less throughput than its fair share. 3147 + * However, for queues with the same weight, a further 3148 + * mechanism, preemption, mitigates or even eliminates this 3149 + * problem. And it does so without consequences on overall 3150 + * throughput. This mechanism and its benefits are explained 3151 + * in the next three paragraphs. 3152 + * 3153 + * Even if a queue, say Q, is expired when it remains idle, Q 3154 + * can still preempt the new in-service queue if the next 3155 + * request of Q arrives soon (see the comments on 3156 + * bfq_bfqq_update_budg_for_activation). If all queues and 3157 + * groups have the same weight, this form of preemption, 3158 + * combined with the hole-recovery heuristic described in the 3159 + * comments on function bfq_bfqq_update_budg_for_activation, 3160 + * are enough to preserve a correct bandwidth distribution in 3161 + * the mid term, even without idling. In fact, even if not 3162 + * idling allows the internal queues of the device to contain 3163 + * many requests, and thus to reorder requests, we can rather 3164 + * safely assume that the internal scheduler still preserves a 3165 + * minimum of mid-term fairness. 3166 + * 3167 + * More precisely, this preemption-based, idleless approach 3168 + * provides fairness in terms of IOPS, and not sectors per 3169 + * second. This can be seen with a simple example. Suppose 3170 + * that there are two queues with the same weight, but that 3171 + * the first queue receives requests of 8 sectors, while the 3172 + * second queue receives requests of 1024 sectors. In 3173 + * addition, suppose that each of the two queues contains at 3174 + * most one request at a time, which implies that each queue 3175 + * always remains idle after it is served. Finally, after 3176 + * remaining idle, each queue receives very quickly a new 3177 + * request. It follows that the two queues are served 3178 + * alternatively, preempting each other if needed. This 3179 + * implies that, although both queues have the same weight, 3180 + * the queue with large requests receives a service that is 3181 + * 1024/8 times as high as the service received by the other 3182 + * queue. 3183 + * 3184 + * The motivation for using preemption instead of idling (for 3185 + * queues with the same weight) is that, by not idling, 3186 + * service guarantees are preserved (completely or at least in 3187 + * part) without minimally sacrificing throughput. And, if 3188 + * there is no active group, then the primary expectation for 3189 + * this device is probably a high throughput. 3190 + * 3191 + * We are now left only with explaining the additional 3192 + * compound condition that is checked below for deciding 3193 + * whether the scenario is asymmetric. To explain this 3194 + * compound condition, we need to add that the function 3195 + * bfq_asymmetric_scenario checks the weights of only 3196 + * non-weight-raised queues, for efficiency reasons (see 3197 + * comments on bfq_weights_tree_add()). Then the fact that 3198 + * bfqq is weight-raised is checked explicitly here. More 3199 + * precisely, the compound condition below takes into account 3200 + * also the fact that, even if bfqq is being weight-raised, 3201 + * the scenario is still symmetric if all queues with requests 3202 + * waiting for completion happen to be 3203 + * weight-raised. Actually, we should be even more precise 3204 + * here, and differentiate between interactive weight raising 3205 + * and soft real-time weight raising. 3206 + * 3207 + * As a side note, it is worth considering that the above 3208 + * device-idling countermeasures may however fail in the 3209 + * following unlucky scenario: if idling is (correctly) 3210 + * disabled in a time period during which all symmetry 3211 + * sub-conditions hold, and hence the device is allowed to 3212 + * enqueue many requests, but at some later point in time some 3213 + * sub-condition stops to hold, then it may become impossible 3214 + * to let requests be served in the desired order until all 3215 + * the requests already queued in the device have been served. 3216 + */ 3217 + static bool idling_needed_for_service_guarantees(struct bfq_data *bfqd, 3218 + struct bfq_queue *bfqq) 3219 + { 3220 + return (bfqq->wr_coeff > 1 && 3221 + bfqd->wr_busy_queues < 3222 + bfq_tot_busy_queues(bfqd)) || 3223 + bfq_asymmetric_scenario(bfqd, bfqq); 3224 + } 3225 + 3226 + static bool __bfq_bfqq_expire(struct bfq_data *bfqd, struct bfq_queue *bfqq, 3227 + enum bfqq_expiration reason) 3215 3228 { 3216 3229 /* 3217 3230 * If this bfqq is shared between multiple processes, check ··· 3401 3056 if (bfq_bfqq_coop(bfqq) && BFQQ_SEEKY(bfqq)) 3402 3057 bfq_mark_bfqq_split_coop(bfqq); 3403 3058 3404 - if (RB_EMPTY_ROOT(&bfqq->sort_list)) { 3059 + /* 3060 + * Consider queues with a higher finish virtual time than 3061 + * bfqq. If idling_needed_for_service_guarantees(bfqq) returns 3062 + * true, then bfqq's bandwidth would be violated if an 3063 + * uncontrolled amount of I/O from these queues were 3064 + * dispatched while bfqq is waiting for its new I/O to 3065 + * arrive. This is exactly what may happen if this is a forced 3066 + * expiration caused by a preemption attempt, and if bfqq is 3067 + * not re-scheduled. To prevent this from happening, re-queue 3068 + * bfqq if it needs I/O-dispatch plugging, even if it is 3069 + * empty. By doing so, bfqq is granted to be served before the 3070 + * above queues (provided that bfqq is of course eligible). 3071 + */ 3072 + if (RB_EMPTY_ROOT(&bfqq->sort_list) && 3073 + !(reason == BFQQE_PREEMPTED && 3074 + idling_needed_for_service_guarantees(bfqd, bfqq))) { 3405 3075 if (bfqq->dispatched == 0) 3406 3076 /* 3407 3077 * Overloading budget_timeout field to store ··· 3433 3073 * Resort priority tree of potential close cooperators. 3434 3074 * See comments on bfq_pos_tree_add_move() for the unlikely(). 3435 3075 */ 3436 - if (unlikely(!bfqd->nonrot_with_queueing)) 3076 + if (unlikely(!bfqd->nonrot_with_queueing && 3077 + !RB_EMPTY_ROOT(&bfqq->sort_list))) 3437 3078 bfq_pos_tree_add_move(bfqd, bfqq); 3438 3079 } 3439 3080 ··· 3935 3574 * reason. 3936 3575 */ 3937 3576 __bfq_bfqq_recalc_budget(bfqd, bfqq, reason); 3938 - if (__bfq_bfqq_expire(bfqd, bfqq)) 3577 + if (__bfq_bfqq_expire(bfqd, bfqq, reason)) 3939 3578 /* bfqq is gone, no more actions on it */ 3940 3579 return; 3941 3580 ··· 4079 3718 */ 4080 3719 return idling_boosts_thr && 4081 3720 bfqd->wr_busy_queues == 0; 4082 - } 4083 - 4084 - /* 4085 - * There is a case where idling does not have to be performed for 4086 - * throughput concerns, but to preserve the throughput share of 4087 - * the process associated with bfqq. 4088 - * 4089 - * To introduce this case, we can note that allowing the drive 4090 - * to enqueue more than one request at a time, and hence 4091 - * delegating de facto final scheduling decisions to the 4092 - * drive's internal scheduler, entails loss of control on the 4093 - * actual request service order. In particular, the critical 4094 - * situation is when requests from different processes happen 4095 - * to be present, at the same time, in the internal queue(s) 4096 - * of the drive. In such a situation, the drive, by deciding 4097 - * the service order of the internally-queued requests, does 4098 - * determine also the actual throughput distribution among 4099 - * these processes. But the drive typically has no notion or 4100 - * concern about per-process throughput distribution, and 4101 - * makes its decisions only on a per-request basis. Therefore, 4102 - * the service distribution enforced by the drive's internal 4103 - * scheduler is likely to coincide with the desired throughput 4104 - * distribution only in a completely symmetric, or favorably 4105 - * skewed scenario where: 4106 - * (i-a) each of these processes must get the same throughput as 4107 - * the others, 4108 - * (i-b) in case (i-a) does not hold, it holds that the process 4109 - * associated with bfqq must receive a lower or equal 4110 - * throughput than any of the other processes; 4111 - * (ii) the I/O of each process has the same properties, in 4112 - * terms of locality (sequential or random), direction 4113 - * (reads or writes), request sizes, greediness 4114 - * (from I/O-bound to sporadic), and so on; 4115 - 4116 - * In fact, in such a scenario, the drive tends to treat the requests 4117 - * of each process in about the same way as the requests of the 4118 - * others, and thus to provide each of these processes with about the 4119 - * same throughput. This is exactly the desired throughput 4120 - * distribution if (i-a) holds, or, if (i-b) holds instead, this is an 4121 - * even more convenient distribution for (the process associated with) 4122 - * bfqq. 4123 - * 4124 - * In contrast, in any asymmetric or unfavorable scenario, device 4125 - * idling (I/O-dispatch plugging) is certainly needed to guarantee 4126 - * that bfqq receives its assigned fraction of the device throughput 4127 - * (see [1] for details). 4128 - * 4129 - * The problem is that idling may significantly reduce throughput with 4130 - * certain combinations of types of I/O and devices. An important 4131 - * example is sync random I/O on flash storage with command 4132 - * queueing. So, unless bfqq falls in cases where idling also boosts 4133 - * throughput, it is important to check conditions (i-a), i(-b) and 4134 - * (ii) accurately, so as to avoid idling when not strictly needed for 4135 - * service guarantees. 4136 - * 4137 - * Unfortunately, it is extremely difficult to thoroughly check 4138 - * condition (ii). And, in case there are active groups, it becomes 4139 - * very difficult to check conditions (i-a) and (i-b) too. In fact, 4140 - * if there are active groups, then, for conditions (i-a) or (i-b) to 4141 - * become false 'indirectly', it is enough that an active group 4142 - * contains more active processes or sub-groups than some other active 4143 - * group. More precisely, for conditions (i-a) or (i-b) to become 4144 - * false because of such a group, it is not even necessary that the 4145 - * group is (still) active: it is sufficient that, even if the group 4146 - * has become inactive, some of its descendant processes still have 4147 - * some request already dispatched but still waiting for 4148 - * completion. In fact, requests have still to be guaranteed their 4149 - * share of the throughput even after being dispatched. In this 4150 - * respect, it is easy to show that, if a group frequently becomes 4151 - * inactive while still having in-flight requests, and if, when this 4152 - * happens, the group is not considered in the calculation of whether 4153 - * the scenario is asymmetric, then the group may fail to be 4154 - * guaranteed its fair share of the throughput (basically because 4155 - * idling may not be performed for the descendant processes of the 4156 - * group, but it had to be). We address this issue with the following 4157 - * bi-modal behavior, implemented in the function 4158 - * bfq_asymmetric_scenario(). 4159 - * 4160 - * If there are groups with requests waiting for completion 4161 - * (as commented above, some of these groups may even be 4162 - * already inactive), then the scenario is tagged as 4163 - * asymmetric, conservatively, without checking any of the 4164 - * conditions (i-a), (i-b) or (ii). So the device is idled for bfqq. 4165 - * This behavior matches also the fact that groups are created 4166 - * exactly if controlling I/O is a primary concern (to 4167 - * preserve bandwidth and latency guarantees). 4168 - * 4169 - * On the opposite end, if there are no groups with requests waiting 4170 - * for completion, then only conditions (i-a) and (i-b) are actually 4171 - * controlled, i.e., provided that conditions (i-a) or (i-b) holds, 4172 - * idling is not performed, regardless of whether condition (ii) 4173 - * holds. In other words, only if conditions (i-a) and (i-b) do not 4174 - * hold, then idling is allowed, and the device tends to be prevented 4175 - * from queueing many requests, possibly of several processes. Since 4176 - * there are no groups with requests waiting for completion, then, to 4177 - * control conditions (i-a) and (i-b) it is enough to check just 4178 - * whether all the queues with requests waiting for completion also 4179 - * have the same weight. 4180 - * 4181 - * Not checking condition (ii) evidently exposes bfqq to the 4182 - * risk of getting less throughput than its fair share. 4183 - * However, for queues with the same weight, a further 4184 - * mechanism, preemption, mitigates or even eliminates this 4185 - * problem. And it does so without consequences on overall 4186 - * throughput. This mechanism and its benefits are explained 4187 - * in the next three paragraphs. 4188 - * 4189 - * Even if a queue, say Q, is expired when it remains idle, Q 4190 - * can still preempt the new in-service queue if the next 4191 - * request of Q arrives soon (see the comments on 4192 - * bfq_bfqq_update_budg_for_activation). If all queues and 4193 - * groups have the same weight, this form of preemption, 4194 - * combined with the hole-recovery heuristic described in the 4195 - * comments on function bfq_bfqq_update_budg_for_activation, 4196 - * are enough to preserve a correct bandwidth distribution in 4197 - * the mid term, even without idling. In fact, even if not 4198 - * idling allows the internal queues of the device to contain 4199 - * many requests, and thus to reorder requests, we can rather 4200 - * safely assume that the internal scheduler still preserves a 4201 - * minimum of mid-term fairness. 4202 - * 4203 - * More precisely, this preemption-based, idleless approach 4204 - * provides fairness in terms of IOPS, and not sectors per 4205 - * second. This can be seen with a simple example. Suppose 4206 - * that there are two queues with the same weight, but that 4207 - * the first queue receives requests of 8 sectors, while the 4208 - * second queue receives requests of 1024 sectors. In 4209 - * addition, suppose that each of the two queues contains at 4210 - * most one request at a time, which implies that each queue 4211 - * always remains idle after it is served. Finally, after 4212 - * remaining idle, each queue receives very quickly a new 4213 - * request. It follows that the two queues are served 4214 - * alternatively, preempting each other if needed. This 4215 - * implies that, although both queues have the same weight, 4216 - * the queue with large requests receives a service that is 4217 - * 1024/8 times as high as the service received by the other 4218 - * queue. 4219 - * 4220 - * The motivation for using preemption instead of idling (for 4221 - * queues with the same weight) is that, by not idling, 4222 - * service guarantees are preserved (completely or at least in 4223 - * part) without minimally sacrificing throughput. And, if 4224 - * there is no active group, then the primary expectation for 4225 - * this device is probably a high throughput. 4226 - * 4227 - * We are now left only with explaining the additional 4228 - * compound condition that is checked below for deciding 4229 - * whether the scenario is asymmetric. To explain this 4230 - * compound condition, we need to add that the function 4231 - * bfq_asymmetric_scenario checks the weights of only 4232 - * non-weight-raised queues, for efficiency reasons (see 4233 - * comments on bfq_weights_tree_add()). Then the fact that 4234 - * bfqq is weight-raised is checked explicitly here. More 4235 - * precisely, the compound condition below takes into account 4236 - * also the fact that, even if bfqq is being weight-raised, 4237 - * the scenario is still symmetric if all queues with requests 4238 - * waiting for completion happen to be 4239 - * weight-raised. Actually, we should be even more precise 4240 - * here, and differentiate between interactive weight raising 4241 - * and soft real-time weight raising. 4242 - * 4243 - * As a side note, it is worth considering that the above 4244 - * device-idling countermeasures may however fail in the 4245 - * following unlucky scenario: if idling is (correctly) 4246 - * disabled in a time period during which all symmetry 4247 - * sub-conditions hold, and hence the device is allowed to 4248 - * enqueue many requests, but at some later point in time some 4249 - * sub-condition stops to hold, then it may become impossible 4250 - * to let requests be served in the desired order until all 4251 - * the requests already queued in the device have been served. 4252 - */ 4253 - static bool idling_needed_for_service_guarantees(struct bfq_data *bfqd, 4254 - struct bfq_queue *bfqq) 4255 - { 4256 - return (bfqq->wr_coeff > 1 && 4257 - bfqd->wr_busy_queues < 4258 - bfq_tot_busy_queues(bfqd)) || 4259 - bfq_asymmetric_scenario(bfqd, bfqq); 4260 3721 } 4261 3722 4262 3723 /* ··· 4339 4156 (bfqq->dispatched != 0 && bfq_better_to_idle(bfqq))) { 4340 4157 struct bfq_queue *async_bfqq = 4341 4158 bfqq->bic && bfqq->bic->bfqq[0] && 4342 - bfq_bfqq_busy(bfqq->bic->bfqq[0]) ? 4159 + bfq_bfqq_busy(bfqq->bic->bfqq[0]) && 4160 + bfqq->bic->bfqq[0]->next_rq ? 4343 4161 bfqq->bic->bfqq[0] : NULL; 4344 4162 4345 4163 /* 4346 - * If the process associated with bfqq has also async 4347 - * I/O pending, then inject it 4348 - * unconditionally. Injecting I/O from the same 4349 - * process can cause no harm to the process. On the 4350 - * contrary, it can only increase bandwidth and reduce 4351 - * latency for the process. 4164 + * The next three mutually-exclusive ifs decide 4165 + * whether to try injection, and choose the queue to 4166 + * pick an I/O request from. 4167 + * 4168 + * The first if checks whether the process associated 4169 + * with bfqq has also async I/O pending. If so, it 4170 + * injects such I/O unconditionally. Injecting async 4171 + * I/O from the same process can cause no harm to the 4172 + * process. On the contrary, it can only increase 4173 + * bandwidth and reduce latency for the process. 4174 + * 4175 + * The second if checks whether there happens to be a 4176 + * non-empty waker queue for bfqq, i.e., a queue whose 4177 + * I/O needs to be completed for bfqq to receive new 4178 + * I/O. This happens, e.g., if bfqq is associated with 4179 + * a process that does some sync. A sync generates 4180 + * extra blocking I/O, which must be completed before 4181 + * the process associated with bfqq can go on with its 4182 + * I/O. If the I/O of the waker queue is not served, 4183 + * then bfqq remains empty, and no I/O is dispatched, 4184 + * until the idle timeout fires for bfqq. This is 4185 + * likely to result in lower bandwidth and higher 4186 + * latencies for bfqq, and in a severe loss of total 4187 + * throughput. The best action to take is therefore to 4188 + * serve the waker queue as soon as possible. So do it 4189 + * (without relying on the third alternative below for 4190 + * eventually serving waker_bfqq's I/O; see the last 4191 + * paragraph for further details). This systematic 4192 + * injection of I/O from the waker queue does not 4193 + * cause any delay to bfqq's I/O. On the contrary, 4194 + * next bfqq's I/O is brought forward dramatically, 4195 + * for it is not blocked for milliseconds. 4196 + * 4197 + * The third if checks whether bfqq is a queue for 4198 + * which it is better to avoid injection. It is so if 4199 + * bfqq delivers more throughput when served without 4200 + * any further I/O from other queues in the middle, or 4201 + * if the service times of bfqq's I/O requests both 4202 + * count more than overall throughput, and may be 4203 + * easily increased by injection (this happens if bfqq 4204 + * has a short think time). If none of these 4205 + * conditions holds, then a candidate queue for 4206 + * injection is looked for through 4207 + * bfq_choose_bfqq_for_injection(). Note that the 4208 + * latter may return NULL (for example if the inject 4209 + * limit for bfqq is currently 0). 4210 + * 4211 + * NOTE: motivation for the second alternative 4212 + * 4213 + * Thanks to the way the inject limit is updated in 4214 + * bfq_update_has_short_ttime(), it is rather likely 4215 + * that, if I/O is being plugged for bfqq and the 4216 + * waker queue has pending I/O requests that are 4217 + * blocking bfqq's I/O, then the third alternative 4218 + * above lets the waker queue get served before the 4219 + * I/O-plugging timeout fires. So one may deem the 4220 + * second alternative superfluous. It is not, because 4221 + * the third alternative may be way less effective in 4222 + * case of a synchronization. For two main 4223 + * reasons. First, throughput may be low because the 4224 + * inject limit may be too low to guarantee the same 4225 + * amount of injected I/O, from the waker queue or 4226 + * other queues, that the second alternative 4227 + * guarantees (the second alternative unconditionally 4228 + * injects a pending I/O request of the waker queue 4229 + * for each bfq_dispatch_request()). Second, with the 4230 + * third alternative, the duration of the plugging, 4231 + * i.e., the time before bfqq finally receives new I/O, 4232 + * may not be minimized, because the waker queue may 4233 + * happen to be served only after other queues. 4352 4234 */ 4353 4235 if (async_bfqq && 4354 4236 icq_to_bic(async_bfqq->next_rq->elv.icq) == bfqq->bic && 4355 4237 bfq_serv_to_charge(async_bfqq->next_rq, async_bfqq) <= 4356 4238 bfq_bfqq_budget_left(async_bfqq)) 4357 4239 bfqq = bfqq->bic->bfqq[0]; 4240 + else if (bfq_bfqq_has_waker(bfqq) && 4241 + bfq_bfqq_busy(bfqq->waker_bfqq) && 4242 + bfqq->next_rq && 4243 + bfq_serv_to_charge(bfqq->waker_bfqq->next_rq, 4244 + bfqq->waker_bfqq) <= 4245 + bfq_bfqq_budget_left(bfqq->waker_bfqq) 4246 + ) 4247 + bfqq = bfqq->waker_bfqq; 4358 4248 else if (!idling_boosts_thr_without_issues(bfqd, bfqq) && 4359 4249 (bfqq->wr_coeff == 1 || bfqd->wr_busy_queues > 1 || 4360 4250 !bfq_bfqq_has_short_ttime(bfqq))) ··· 4659 4403 return rq; 4660 4404 } 4661 4405 4662 - #if defined(CONFIG_BFQ_GROUP_IOSCHED) && defined(CONFIG_DEBUG_BLK_CGROUP) 4406 + #ifdef CONFIG_BFQ_CGROUP_DEBUG 4663 4407 static void bfq_update_dispatch_stats(struct request_queue *q, 4664 4408 struct request *rq, 4665 4409 struct bfq_queue *in_serv_queue, ··· 4709 4453 struct request *rq, 4710 4454 struct bfq_queue *in_serv_queue, 4711 4455 bool idle_timer_disabled) {} 4712 - #endif 4456 + #endif /* CONFIG_BFQ_CGROUP_DEBUG */ 4713 4457 4714 4458 static struct request *bfq_dispatch_request(struct blk_mq_hw_ctx *hctx) 4715 4459 { ··· 4816 4560 4817 4561 static void bfq_exit_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq) 4818 4562 { 4563 + struct bfq_queue *item; 4564 + struct hlist_node *n; 4565 + 4819 4566 if (bfqq == bfqd->in_service_queue) { 4820 - __bfq_bfqq_expire(bfqd, bfqq); 4567 + __bfq_bfqq_expire(bfqd, bfqq, BFQQE_BUDGET_TIMEOUT); 4821 4568 bfq_schedule_dispatch(bfqd); 4822 4569 } 4823 4570 4824 4571 bfq_log_bfqq(bfqd, bfqq, "exit_bfqq: %p, %d", bfqq, bfqq->ref); 4825 4572 4826 4573 bfq_put_cooperator(bfqq); 4574 + 4575 + /* remove bfqq from woken list */ 4576 + if (!hlist_unhashed(&bfqq->woken_list_node)) 4577 + hlist_del_init(&bfqq->woken_list_node); 4578 + 4579 + /* reset waker for all queues in woken list */ 4580 + hlist_for_each_entry_safe(item, n, &bfqq->woken_list, 4581 + woken_list_node) { 4582 + item->waker_bfqq = NULL; 4583 + bfq_clear_bfqq_has_waker(item); 4584 + hlist_del_init(&item->woken_list_node); 4585 + } 4827 4586 4828 4587 bfq_put_queue(bfqq); /* release process reference */ 4829 4588 } ··· 4855 4584 unsigned long flags; 4856 4585 4857 4586 spin_lock_irqsave(&bfqd->lock, flags); 4587 + bfqq->bic = NULL; 4858 4588 bfq_exit_bfqq(bfqd, bfqq); 4859 4589 bic_set_bfqq(bic, NULL, is_sync); 4860 4590 spin_unlock_irqrestore(&bfqd->lock, flags); ··· 4959 4687 RB_CLEAR_NODE(&bfqq->entity.rb_node); 4960 4688 INIT_LIST_HEAD(&bfqq->fifo); 4961 4689 INIT_HLIST_NODE(&bfqq->burst_list_node); 4690 + INIT_HLIST_NODE(&bfqq->woken_list_node); 4691 + INIT_HLIST_HEAD(&bfqq->woken_list); 4962 4692 4963 4693 bfqq->ref = 0; 4964 4694 bfqq->bfqd = bfqd; ··· 5128 4854 struct bfq_queue *bfqq, 5129 4855 struct bfq_io_cq *bic) 5130 4856 { 5131 - bool has_short_ttime = true; 4857 + bool has_short_ttime = true, state_changed; 5132 4858 5133 4859 /* 5134 4860 * No need to update has_short_ttime if bfqq is async or in ··· 5153 4879 bfqq->ttime.ttime_mean > bfqd->bfq_slice_idle)) 5154 4880 has_short_ttime = false; 5155 4881 5156 - bfq_log_bfqq(bfqd, bfqq, "update_has_short_ttime: has_short_ttime %d", 5157 - has_short_ttime); 4882 + state_changed = has_short_ttime != bfq_bfqq_has_short_ttime(bfqq); 5158 4883 5159 4884 if (has_short_ttime) 5160 4885 bfq_mark_bfqq_has_short_ttime(bfqq); 5161 4886 else 5162 4887 bfq_clear_bfqq_has_short_ttime(bfqq); 4888 + 4889 + /* 4890 + * Until the base value for the total service time gets 4891 + * finally computed for bfqq, the inject limit does depend on 4892 + * the think-time state (short|long). In particular, the limit 4893 + * is 0 or 1 if the think time is deemed, respectively, as 4894 + * short or long (details in the comments in 4895 + * bfq_update_inject_limit()). Accordingly, the next 4896 + * instructions reset the inject limit if the think-time state 4897 + * has changed and the above base value is still to be 4898 + * computed. 4899 + * 4900 + * However, the reset is performed only if more than 100 ms 4901 + * have elapsed since the last update of the inject limit, or 4902 + * (inclusive) if the change is from short to long think 4903 + * time. The reason for this waiting is as follows. 4904 + * 4905 + * bfqq may have a long think time because of a 4906 + * synchronization with some other queue, i.e., because the 4907 + * I/O of some other queue may need to be completed for bfqq 4908 + * to receive new I/O. Details in the comments on the choice 4909 + * of the queue for injection in bfq_select_queue(). 4910 + * 4911 + * As stressed in those comments, if such a synchronization is 4912 + * actually in place, then, without injection on bfqq, the 4913 + * blocking I/O cannot happen to served while bfqq is in 4914 + * service. As a consequence, if bfqq is granted 4915 + * I/O-dispatch-plugging, then bfqq remains empty, and no I/O 4916 + * is dispatched, until the idle timeout fires. This is likely 4917 + * to result in lower bandwidth and higher latencies for bfqq, 4918 + * and in a severe loss of total throughput. 4919 + * 4920 + * On the opposite end, a non-zero inject limit may allow the 4921 + * I/O that blocks bfqq to be executed soon, and therefore 4922 + * bfqq to receive new I/O soon. 4923 + * 4924 + * But, if the blocking gets actually eliminated, then the 4925 + * next think-time sample for bfqq may be very low. This in 4926 + * turn may cause bfqq's think time to be deemed 4927 + * short. Without the 100 ms barrier, this new state change 4928 + * would cause the body of the next if to be executed 4929 + * immediately. But this would set to 0 the inject 4930 + * limit. Without injection, the blocking I/O would cause the 4931 + * think time of bfqq to become long again, and therefore the 4932 + * inject limit to be raised again, and so on. The only effect 4933 + * of such a steady oscillation between the two think-time 4934 + * states would be to prevent effective injection on bfqq. 4935 + * 4936 + * In contrast, if the inject limit is not reset during such a 4937 + * long time interval as 100 ms, then the number of short 4938 + * think time samples can grow significantly before the reset 4939 + * is performed. As a consequence, the think time state can 4940 + * become stable before the reset. Therefore there will be no 4941 + * state change when the 100 ms elapse, and no reset of the 4942 + * inject limit. The inject limit remains steadily equal to 1 4943 + * both during and after the 100 ms. So injection can be 4944 + * performed at all times, and throughput gets boosted. 4945 + * 4946 + * An inject limit equal to 1 is however in conflict, in 4947 + * general, with the fact that the think time of bfqq is 4948 + * short, because injection may be likely to delay bfqq's I/O 4949 + * (as explained in the comments in 4950 + * bfq_update_inject_limit()). But this does not happen in 4951 + * this special case, because bfqq's low think time is due to 4952 + * an effective handling of a synchronization, through 4953 + * injection. In this special case, bfqq's I/O does not get 4954 + * delayed by injection; on the contrary, bfqq's I/O is 4955 + * brought forward, because it is not blocked for 4956 + * milliseconds. 4957 + * 4958 + * In addition, serving the blocking I/O much sooner, and much 4959 + * more frequently than once per I/O-plugging timeout, makes 4960 + * it much quicker to detect a waker queue (the concept of 4961 + * waker queue is defined in the comments in 4962 + * bfq_add_request()). This makes it possible to start sooner 4963 + * to boost throughput more effectively, by injecting the I/O 4964 + * of the waker queue unconditionally on every 4965 + * bfq_dispatch_request(). 4966 + * 4967 + * One last, important benefit of not resetting the inject 4968 + * limit before 100 ms is that, during this time interval, the 4969 + * base value for the total service time is likely to get 4970 + * finally computed for bfqq, freeing the inject limit from 4971 + * its relation with the think time. 4972 + */ 4973 + if (state_changed && bfqq->last_serv_time_ns == 0 && 4974 + (time_is_before_eq_jiffies(bfqq->decrease_time_jif + 4975 + msecs_to_jiffies(100)) || 4976 + !has_short_ttime)) 4977 + bfq_reset_inject_limit(bfqd, bfqq); 5163 4978 } 5164 4979 5165 4980 /* ··· 5258 4895 static void bfq_rq_enqueued(struct bfq_data *bfqd, struct bfq_queue *bfqq, 5259 4896 struct request *rq) 5260 4897 { 5261 - struct bfq_io_cq *bic = RQ_BIC(rq); 5262 - 5263 4898 if (rq->cmd_flags & REQ_META) 5264 4899 bfqq->meta_pending++; 5265 - 5266 - bfq_update_io_thinktime(bfqd, bfqq); 5267 - bfq_update_has_short_ttime(bfqd, bfqq, bic); 5268 - bfq_update_io_seektime(bfqd, bfqq, rq); 5269 - 5270 - bfq_log_bfqq(bfqd, bfqq, 5271 - "rq_enqueued: has_short_ttime=%d (seeky %d)", 5272 - bfq_bfqq_has_short_ttime(bfqq), BFQQ_SEEKY(bfqq)); 5273 4900 5274 4901 bfqq->last_request_pos = blk_rq_pos(rq) + blk_rq_sectors(rq); 5275 4902 ··· 5348 4995 bfqq = new_bfqq; 5349 4996 } 5350 4997 4998 + bfq_update_io_thinktime(bfqd, bfqq); 4999 + bfq_update_has_short_ttime(bfqd, bfqq, RQ_BIC(rq)); 5000 + bfq_update_io_seektime(bfqd, bfqq, rq); 5001 + 5351 5002 waiting = bfqq && bfq_bfqq_wait_request(bfqq); 5352 5003 bfq_add_request(rq); 5353 5004 idle_timer_disabled = waiting && !bfq_bfqq_wait_request(bfqq); ··· 5364 5007 return idle_timer_disabled; 5365 5008 } 5366 5009 5367 - #if defined(CONFIG_BFQ_GROUP_IOSCHED) && defined(CONFIG_DEBUG_BLK_CGROUP) 5010 + #ifdef CONFIG_BFQ_CGROUP_DEBUG 5368 5011 static void bfq_update_insert_stats(struct request_queue *q, 5369 5012 struct bfq_queue *bfqq, 5370 5013 bool idle_timer_disabled, ··· 5394 5037 struct bfq_queue *bfqq, 5395 5038 bool idle_timer_disabled, 5396 5039 unsigned int cmd_flags) {} 5397 - #endif 5040 + #endif /* CONFIG_BFQ_CGROUP_DEBUG */ 5398 5041 5399 5042 static void bfq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq, 5400 5043 bool at_head) ··· 5557 5200 1UL<<(BFQ_RATE_SHIFT - 10)) 5558 5201 bfq_update_rate_reset(bfqd, NULL); 5559 5202 bfqd->last_completion = now_ns; 5203 + bfqd->last_completed_rq_bfqq = bfqq; 5560 5204 5561 5205 /* 5562 5206 * If we are waiting to discover whether the request pattern ··· 5755 5397 * total service time, and there seem to be the right 5756 5398 * conditions to do it, or we can lower the last base value 5757 5399 * computed. 5400 + * 5401 + * NOTE: (bfqd->rq_in_driver == 1) means that there is no I/O 5402 + * request in flight, because this function is in the code 5403 + * path that handles the completion of a request of bfqq, and, 5404 + * in particular, this function is executed before 5405 + * bfqd->rq_in_driver is decremented in such a code path. 5758 5406 */ 5759 - if ((bfqq->last_serv_time_ns == 0 && bfqd->rq_in_driver == 0) || 5407 + if ((bfqq->last_serv_time_ns == 0 && bfqd->rq_in_driver == 1) || 5760 5408 tot_time_ns < bfqq->last_serv_time_ns) { 5761 5409 bfqq->last_serv_time_ns = tot_time_ns; 5762 5410 /* ··· 5770 5406 * start trying injection. 5771 5407 */ 5772 5408 bfqq->inject_limit = max_t(unsigned int, 1, old_limit); 5773 - } 5409 + } else if (!bfqd->rqs_injected && bfqd->rq_in_driver == 1) 5410 + /* 5411 + * No I/O injected and no request still in service in 5412 + * the drive: these are the exact conditions for 5413 + * computing the base value of the total service time 5414 + * for bfqq. So let's update this value, because it is 5415 + * rather variable. For example, it varies if the size 5416 + * or the spatial locality of the I/O requests in bfqq 5417 + * change. 5418 + */ 5419 + bfqq->last_serv_time_ns = tot_time_ns; 5420 + 5774 5421 5775 5422 /* update complete, not waiting for any request completion any longer */ 5776 5423 bfqd->waited_rq = NULL;

+38 -10

block/bfq-iosched.h

··· 357 357 358 358 /* max service rate measured so far */ 359 359 u32 max_service_rate; 360 + 361 + /* 362 + * Pointer to the waker queue for this queue, i.e., to the 363 + * queue Q such that this queue happens to get new I/O right 364 + * after some I/O request of Q is completed. For details, see 365 + * the comments on the choice of the queue for injection in 366 + * bfq_select_queue(). 367 + */ 368 + struct bfq_queue *waker_bfqq; 369 + /* node for woken_list, see below */ 370 + struct hlist_node woken_list_node; 371 + /* 372 + * Head of the list of the woken queues for this queue, i.e., 373 + * of the list of the queues for which this queue is a waker 374 + * queue. This list is used to reset the waker_bfqq pointer in 375 + * the woken queues when this queue exits. 376 + */ 377 + struct hlist_head woken_list; 360 378 }; 361 379 362 380 /** ··· 550 532 551 533 /* time of last request completion (ns) */ 552 534 u64 last_completion; 535 + 536 + /* bfqq owning the last completed rq */ 537 + struct bfq_queue *last_completed_rq_bfqq; 553 538 554 539 /* time of last transition from empty to non-empty (ns) */ 555 540 u64 last_empty_occupied_ns; ··· 764 743 * update 765 744 */ 766 745 BFQQF_coop, /* bfqq is shared */ 767 - BFQQF_split_coop /* shared bfqq will be split */ 746 + BFQQF_split_coop, /* shared bfqq will be split */ 747 + BFQQF_has_waker /* bfqq has a waker queue */ 768 748 }; 769 749 770 750 #define BFQ_BFQQ_FNS(name) \ ··· 785 763 BFQ_BFQQ_FNS(coop); 786 764 BFQ_BFQQ_FNS(split_coop); 787 765 BFQ_BFQQ_FNS(softrt_update); 766 + BFQ_BFQQ_FNS(has_waker); 788 767 #undef BFQ_BFQQ_FNS 789 768 790 769 /* Expiration reasons. */ ··· 800 777 BFQQE_PREEMPTED /* preemption in progress */ 801 778 }; 802 779 780 + struct bfq_stat { 781 + struct percpu_counter cpu_cnt; 782 + atomic64_t aux_cnt; 783 + }; 784 + 803 785 struct bfqg_stats { 804 - #if defined(CONFIG_BFQ_GROUP_IOSCHED) && defined(CONFIG_DEBUG_BLK_CGROUP) 786 + #ifdef CONFIG_BFQ_CGROUP_DEBUG 805 787 /* number of ios merged */ 806 788 struct blkg_rwstat merged; 807 789 /* total time spent on device in ns, may not be accurate w/ queueing */ ··· 816 788 /* number of IOs queued up */ 817 789 struct blkg_rwstat queued; 818 790 /* total disk time and nr sectors dispatched by this group */ 819 - struct blkg_stat time; 791 + struct bfq_stat time; 820 792 /* sum of number of ios queued across all samples */ 821 - struct blkg_stat avg_queue_size_sum; 793 + struct bfq_stat avg_queue_size_sum; 822 794 /* count of samples taken for average */ 823 - struct blkg_stat avg_queue_size_samples; 795 + struct bfq_stat avg_queue_size_samples; 824 796 /* how many times this group has been removed from service tree */ 825 - struct blkg_stat dequeue; 797 + struct bfq_stat dequeue; 826 798 /* total time spent waiting for it to be assigned a timeslice. */ 827 - struct blkg_stat group_wait_time; 799 + struct bfq_stat group_wait_time; 828 800 /* time spent idling for this blkcg_gq */ 829 - struct blkg_stat idle_time; 801 + struct bfq_stat idle_time; 830 802 /* total time with empty current active q with other requests queued */ 831 - struct blkg_stat empty_time; 803 + struct bfq_stat empty_time; 832 804 /* fields after this shouldn't be cleared on stat reset */ 833 805 u64 start_group_wait_time; 834 806 u64 start_idle_time; 835 807 u64 start_empty_time; 836 808 uint16_t flags; 837 - #endif /* CONFIG_BFQ_GROUP_IOSCHED && CONFIG_DEBUG_BLK_CGROUP */ 809 + #endif /* CONFIG_BFQ_CGROUP_DEBUG */ 838 810 }; 839 811 840 812 #ifdef CONFIG_BFQ_GROUP_IOSCHED

+20 -76

block/bio.c

··· 558 558 } 559 559 EXPORT_SYMBOL(bio_put); 560 560 561 - int bio_phys_segments(struct request_queue *q, struct bio *bio) 562 - { 563 - if (unlikely(!bio_flagged(bio, BIO_SEG_VALID))) 564 - blk_recount_segments(q, bio); 565 - 566 - return bio->bi_phys_segments; 567 - } 568 - 569 561 /** 570 562 * __bio_clone_fast - clone a bio that shares the original bio's biovec 571 563 * @bio: destination bio ··· 723 731 } 724 732 } 725 733 726 - if (bio_full(bio)) 734 + if (bio_full(bio, len)) 727 735 return 0; 728 736 729 - if (bio->bi_phys_segments >= queue_max_segments(q)) 737 + if (bio->bi_vcnt >= queue_max_segments(q)) 730 738 return 0; 731 739 732 740 bvec = &bio->bi_io_vec[bio->bi_vcnt]; ··· 736 744 bio->bi_vcnt++; 737 745 done: 738 746 bio->bi_iter.bi_size += len; 739 - bio->bi_phys_segments = bio->bi_vcnt; 740 - bio_set_flag(bio, BIO_SEG_VALID); 741 747 return len; 742 748 } 743 749 ··· 797 807 struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt]; 798 808 799 809 WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED)); 800 - WARN_ON_ONCE(bio_full(bio)); 810 + WARN_ON_ONCE(bio_full(bio, len)); 801 811 802 812 bv->bv_page = page; 803 813 bv->bv_offset = off; ··· 824 834 bool same_page = false; 825 835 826 836 if (!__bio_try_merge_page(bio, page, len, offset, &same_page)) { 827 - if (bio_full(bio)) 837 + if (bio_full(bio, len)) 828 838 return 0; 829 839 __bio_add_page(bio, page, len, offset); 830 840 } ··· 832 842 } 833 843 EXPORT_SYMBOL(bio_add_page); 834 844 835 - static void bio_get_pages(struct bio *bio) 845 + void bio_release_pages(struct bio *bio, bool mark_dirty) 836 846 { 837 847 struct bvec_iter_all iter_all; 838 848 struct bio_vec *bvec; 839 849 840 - bio_for_each_segment_all(bvec, bio, iter_all) 841 - get_page(bvec->bv_page); 842 - } 850 + if (bio_flagged(bio, BIO_NO_PAGE_REF)) 851 + return; 843 852 844 - static void bio_release_pages(struct bio *bio) 845 - { 846 - struct bvec_iter_all iter_all; 847 - struct bio_vec *bvec; 848 - 849 - bio_for_each_segment_all(bvec, bio, iter_all) 853 + bio_for_each_segment_all(bvec, bio, iter_all) { 854 + if (mark_dirty && !PageCompound(bvec->bv_page)) 855 + set_page_dirty_lock(bvec->bv_page); 850 856 put_page(bvec->bv_page); 857 + } 851 858 } 852 859 853 860 static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter) ··· 909 922 if (same_page) 910 923 put_page(page); 911 924 } else { 912 - if (WARN_ON_ONCE(bio_full(bio))) 925 + if (WARN_ON_ONCE(bio_full(bio, len))) 913 926 return -EINVAL; 914 927 __bio_add_page(bio, page, len, offset); 915 928 } ··· 953 966 ret = __bio_iov_bvec_add_pages(bio, iter); 954 967 else 955 968 ret = __bio_iov_iter_get_pages(bio, iter); 956 - } while (!ret && iov_iter_count(iter) && !bio_full(bio)); 969 + } while (!ret && iov_iter_count(iter) && !bio_full(bio, 0)); 957 970 958 - if (iov_iter_bvec_no_ref(iter)) 971 + if (is_bvec) 959 972 bio_set_flag(bio, BIO_NO_PAGE_REF); 960 - else if (is_bvec) 961 - bio_get_pages(bio); 962 - 963 973 return bio->bi_vcnt ? 0 : ret; 964 974 } 965 975 ··· 1108 1124 if (data->nr_segs > UIO_MAXIOV) 1109 1125 return NULL; 1110 1126 1111 - bmd = kmalloc(sizeof(struct bio_map_data) + 1112 - sizeof(struct iovec) * data->nr_segs, gfp_mask); 1127 + bmd = kmalloc(struct_size(bmd, iov, data->nr_segs), gfp_mask); 1113 1128 if (!bmd) 1114 1129 return NULL; 1115 1130 memcpy(bmd->iov, data->iov, sizeof(struct iovec) * data->nr_segs); ··· 1354 1371 int j; 1355 1372 struct bio *bio; 1356 1373 int ret; 1357 - struct bio_vec *bvec; 1358 - struct bvec_iter_all iter_all; 1359 1374 1360 1375 if (!iov_iter_count(iter)) 1361 1376 return ERR_PTR(-EINVAL); ··· 1420 1439 return bio; 1421 1440 1422 1441 out_unmap: 1423 - bio_for_each_segment_all(bvec, bio, iter_all) { 1424 - put_page(bvec->bv_page); 1425 - } 1442 + bio_release_pages(bio, false); 1426 1443 bio_put(bio); 1427 1444 return ERR_PTR(ret); 1428 - } 1429 - 1430 - static void __bio_unmap_user(struct bio *bio) 1431 - { 1432 - struct bio_vec *bvec; 1433 - struct bvec_iter_all iter_all; 1434 - 1435 - /* 1436 - * make sure we dirty pages we wrote to 1437 - */ 1438 - bio_for_each_segment_all(bvec, bio, iter_all) { 1439 - if (bio_data_dir(bio) == READ) 1440 - set_page_dirty_lock(bvec->bv_page); 1441 - 1442 - put_page(bvec->bv_page); 1443 - } 1444 - 1445 - bio_put(bio); 1446 1445 } 1447 1446 1448 1447 /** ··· 1436 1475 */ 1437 1476 void bio_unmap_user(struct bio *bio) 1438 1477 { 1439 - __bio_unmap_user(bio); 1478 + bio_release_pages(bio, bio_data_dir(bio) == READ); 1479 + bio_put(bio); 1440 1480 bio_put(bio); 1441 1481 } 1442 1482 ··· 1657 1695 while ((bio = next) != NULL) { 1658 1696 next = bio->bi_private; 1659 1697 1660 - bio_set_pages_dirty(bio); 1661 - if (!bio_flagged(bio, BIO_NO_PAGE_REF)) 1662 - bio_release_pages(bio); 1698 + bio_release_pages(bio, true); 1663 1699 bio_put(bio); 1664 1700 } 1665 1701 } ··· 1673 1713 goto defer; 1674 1714 } 1675 1715 1676 - if (!bio_flagged(bio, BIO_NO_PAGE_REF)) 1677 - bio_release_pages(bio); 1716 + bio_release_pages(bio, false); 1678 1717 bio_put(bio); 1679 1718 return; 1680 1719 defer: ··· 1733 1774 part_stat_unlock(); 1734 1775 } 1735 1776 EXPORT_SYMBOL(generic_end_io_acct); 1736 - 1737 - #if ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1738 - void bio_flush_dcache_pages(struct bio *bi) 1739 - { 1740 - struct bio_vec bvec; 1741 - struct bvec_iter iter; 1742 - 1743 - bio_for_each_segment(bvec, bi, iter) 1744 - flush_dcache_page(bvec.bv_page); 1745 - } 1746 - EXPORT_SYMBOL(bio_flush_dcache_pages); 1747 - #endif 1748 1777 1749 1778 static inline bool bio_remaining_done(struct bio *bio) 1750 1779 { ··· 1861 1914 if (offset == 0 && size == bio->bi_iter.bi_size) 1862 1915 return; 1863 1916 1864 - bio_clear_flag(bio, BIO_SEG_VALID); 1865 - 1866 1917 bio_advance(bio, offset << 9); 1867 - 1868 1918 bio->bi_iter.bi_size = size; 1869 1919 1870 1920 if (bio_integrity(bio))

+41 -98

block/blk-cgroup.c

··· 79 79 80 80 blkg_rwstat_exit(&blkg->stat_ios); 81 81 blkg_rwstat_exit(&blkg->stat_bytes); 82 + percpu_ref_exit(&blkg->refcnt); 82 83 kfree(blkg); 83 84 } 84 85 85 86 static void __blkg_release(struct rcu_head *rcu) 86 87 { 87 88 struct blkcg_gq *blkg = container_of(rcu, struct blkcg_gq, rcu_head); 88 - 89 - percpu_ref_exit(&blkg->refcnt); 90 89 91 90 /* release the blkcg and parent blkg refs this blkg has been holding */ 92 91 css_put(&blkg->blkcg->css); ··· 130 131 blkg = kzalloc_node(sizeof(*blkg), gfp_mask, q->node); 131 132 if (!blkg) 132 133 return NULL; 134 + 135 + if (percpu_ref_init(&blkg->refcnt, blkg_release, 0, gfp_mask)) 136 + goto err_free; 133 137 134 138 if (blkg_rwstat_init(&blkg->stat_bytes, gfp_mask) || 135 139 blkg_rwstat_init(&blkg->stat_ios, gfp_mask)) ··· 246 244 blkg_get(blkg->parent); 247 245 } 248 246 249 - ret = percpu_ref_init(&blkg->refcnt, blkg_release, 0, 250 - GFP_NOWAIT | __GFP_NOWARN); 251 - if (ret) 252 - goto err_cancel_ref; 253 - 254 247 /* invoke per-policy init */ 255 248 for (i = 0; i < BLKCG_MAX_POLS; i++) { 256 249 struct blkcg_policy *pol = blkcg_policy[i]; ··· 278 281 blkg_put(blkg); 279 282 return ERR_PTR(ret); 280 283 281 - err_cancel_ref: 282 - percpu_ref_exit(&blkg->refcnt); 283 284 err_put_congested: 284 285 wb_congested_put(wb_congested); 285 286 err_put_css: ··· 544 549 * Print @rwstat to @sf for the device assocaited with @pd. 545 550 */ 546 551 u64 __blkg_prfill_rwstat(struct seq_file *sf, struct blkg_policy_data *pd, 547 - const struct blkg_rwstat *rwstat) 552 + const struct blkg_rwstat_sample *rwstat) 548 553 { 549 554 static const char *rwstr[] = { 550 555 [BLKG_RWSTAT_READ] = "Read", ··· 562 567 563 568 for (i = 0; i < BLKG_RWSTAT_NR; i++) 564 569 seq_printf(sf, "%s %s %llu\n", dname, rwstr[i], 565 - (unsigned long long)atomic64_read(&rwstat->aux_cnt[i])); 570 + rwstat->cnt[i]); 566 571 567 - v = atomic64_read(&rwstat->aux_cnt[BLKG_RWSTAT_READ]) + 568 - atomic64_read(&rwstat->aux_cnt[BLKG_RWSTAT_WRITE]) + 569 - atomic64_read(&rwstat->aux_cnt[BLKG_RWSTAT_DISCARD]); 570 - seq_printf(sf, "%s Total %llu\n", dname, (unsigned long long)v); 572 + v = rwstat->cnt[BLKG_RWSTAT_READ] + 573 + rwstat->cnt[BLKG_RWSTAT_WRITE] + 574 + rwstat->cnt[BLKG_RWSTAT_DISCARD]; 575 + seq_printf(sf, "%s Total %llu\n", dname, v); 571 576 return v; 572 577 } 573 578 EXPORT_SYMBOL_GPL(__blkg_prfill_rwstat); 574 - 575 - /** 576 - * blkg_prfill_stat - prfill callback for blkg_stat 577 - * @sf: seq_file to print to 578 - * @pd: policy private data of interest 579 - * @off: offset to the blkg_stat in @pd 580 - * 581 - * prfill callback for printing a blkg_stat. 582 - */ 583 - u64 blkg_prfill_stat(struct seq_file *sf, struct blkg_policy_data *pd, int off) 584 - { 585 - return __blkg_prfill_u64(sf, pd, blkg_stat_read((void *)pd + off)); 586 - } 587 - EXPORT_SYMBOL_GPL(blkg_prfill_stat); 588 579 589 580 /** 590 581 * blkg_prfill_rwstat - prfill callback for blkg_rwstat ··· 583 602 u64 blkg_prfill_rwstat(struct seq_file *sf, struct blkg_policy_data *pd, 584 603 int off) 585 604 { 586 - struct blkg_rwstat rwstat = blkg_rwstat_read((void *)pd + off); 605 + struct blkg_rwstat_sample rwstat = { }; 587 606 607 + blkg_rwstat_read((void *)pd + off, &rwstat); 588 608 return __blkg_prfill_rwstat(sf, pd, &rwstat); 589 609 } 590 610 EXPORT_SYMBOL_GPL(blkg_prfill_rwstat); ··· 593 611 static u64 blkg_prfill_rwstat_field(struct seq_file *sf, 594 612 struct blkg_policy_data *pd, int off) 595 613 { 596 - struct blkg_rwstat rwstat = blkg_rwstat_read((void *)pd->blkg + off); 614 + struct blkg_rwstat_sample rwstat = { }; 597 615 616 + blkg_rwstat_read((void *)pd->blkg + off, &rwstat); 598 617 return __blkg_prfill_rwstat(sf, pd, &rwstat); 599 618 } 600 619 ··· 637 654 struct blkg_policy_data *pd, 638 655 int off) 639 656 { 640 - struct blkg_rwstat rwstat = blkg_rwstat_recursive_sum(pd->blkg, 641 - NULL, off); 657 + struct blkg_rwstat_sample rwstat; 658 + 659 + blkg_rwstat_recursive_sum(pd->blkg, NULL, off, &rwstat); 642 660 return __blkg_prfill_rwstat(sf, pd, &rwstat); 643 661 } 644 662 ··· 674 690 EXPORT_SYMBOL_GPL(blkg_print_stat_ios_recursive); 675 691 676 692 /** 677 - * blkg_stat_recursive_sum - collect hierarchical blkg_stat 678 - * @blkg: blkg of interest 679 - * @pol: blkcg_policy which contains the blkg_stat 680 - * @off: offset to the blkg_stat in blkg_policy_data or @blkg 681 - * 682 - * Collect the blkg_stat specified by @blkg, @pol and @off and all its 683 - * online descendants and their aux counts. The caller must be holding the 684 - * queue lock for online tests. 685 - * 686 - * If @pol is NULL, blkg_stat is at @off bytes into @blkg; otherwise, it is 687 - * at @off bytes into @blkg's blkg_policy_data of the policy. 688 - */ 689 - u64 blkg_stat_recursive_sum(struct blkcg_gq *blkg, 690 - struct blkcg_policy *pol, int off) 691 - { 692 - struct blkcg_gq *pos_blkg; 693 - struct cgroup_subsys_state *pos_css; 694 - u64 sum = 0; 695 - 696 - lockdep_assert_held(&blkg->q->queue_lock); 697 - 698 - rcu_read_lock(); 699 - blkg_for_each_descendant_pre(pos_blkg, pos_css, blkg) { 700 - struct blkg_stat *stat; 701 - 702 - if (!pos_blkg->online) 703 - continue; 704 - 705 - if (pol) 706 - stat = (void *)blkg_to_pd(pos_blkg, pol) + off; 707 - else 708 - stat = (void *)blkg + off; 709 - 710 - sum += blkg_stat_read(stat) + atomic64_read(&stat->aux_cnt); 711 - } 712 - rcu_read_unlock(); 713 - 714 - return sum; 715 - } 716 - EXPORT_SYMBOL_GPL(blkg_stat_recursive_sum); 717 - 718 - /** 719 693 * blkg_rwstat_recursive_sum - collect hierarchical blkg_rwstat 720 694 * @blkg: blkg of interest 721 695 * @pol: blkcg_policy which contains the blkg_rwstat 722 696 * @off: offset to the blkg_rwstat in blkg_policy_data or @blkg 697 + * @sum: blkg_rwstat_sample structure containing the results 723 698 * 724 699 * Collect the blkg_rwstat specified by @blkg, @pol and @off and all its 725 700 * online descendants and their aux counts. The caller must be holding the ··· 687 744 * If @pol is NULL, blkg_rwstat is at @off bytes into @blkg; otherwise, it 688 745 * is at @off bytes into @blkg's blkg_policy_data of the policy. 689 746 */ 690 - struct blkg_rwstat blkg_rwstat_recursive_sum(struct blkcg_gq *blkg, 691 - struct blkcg_policy *pol, int off) 747 + void blkg_rwstat_recursive_sum(struct blkcg_gq *blkg, struct blkcg_policy *pol, 748 + int off, struct blkg_rwstat_sample *sum) 692 749 { 693 750 struct blkcg_gq *pos_blkg; 694 751 struct cgroup_subsys_state *pos_css; 695 - struct blkg_rwstat sum = { }; 696 - int i; 752 + unsigned int i; 697 753 698 754 lockdep_assert_held(&blkg->q->queue_lock); 699 755 ··· 709 767 rwstat = (void *)pos_blkg + off; 710 768 711 769 for (i = 0; i < BLKG_RWSTAT_NR; i++) 712 - atomic64_add(atomic64_read(&rwstat->aux_cnt[i]) + 713 - percpu_counter_sum_positive(&rwstat->cpu_cnt[i]), 714 - &sum.aux_cnt[i]); 770 + sum->cnt[i] = blkg_rwstat_read_counter(rwstat, i); 715 771 } 716 772 rcu_read_unlock(); 717 - 718 - return sum; 719 773 } 720 774 EXPORT_SYMBOL_GPL(blkg_rwstat_recursive_sum); 721 775 ··· 877 939 hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) { 878 940 const char *dname; 879 941 char *buf; 880 - struct blkg_rwstat rwstat; 942 + struct blkg_rwstat_sample rwstat; 881 943 u64 rbytes, wbytes, rios, wios, dbytes, dios; 882 944 size_t size = seq_get_buf(sf, &buf), off = 0; 883 945 int i; ··· 897 959 898 960 spin_lock_irq(&blkg->q->queue_lock); 899 961 900 - rwstat = blkg_rwstat_recursive_sum(blkg, NULL, 901 - offsetof(struct blkcg_gq, stat_bytes)); 902 - rbytes = atomic64_read(&rwstat.aux_cnt[BLKG_RWSTAT_READ]); 903 - wbytes = atomic64_read(&rwstat.aux_cnt[BLKG_RWSTAT_WRITE]); 904 - dbytes = atomic64_read(&rwstat.aux_cnt[BLKG_RWSTAT_DISCARD]); 962 + blkg_rwstat_recursive_sum(blkg, NULL, 963 + offsetof(struct blkcg_gq, stat_bytes), &rwstat); 964 + rbytes = rwstat.cnt[BLKG_RWSTAT_READ]; 965 + wbytes = rwstat.cnt[BLKG_RWSTAT_WRITE]; 966 + dbytes = rwstat.cnt[BLKG_RWSTAT_DISCARD]; 905 967 906 - rwstat = blkg_rwstat_recursive_sum(blkg, NULL, 907 - offsetof(struct blkcg_gq, stat_ios)); 908 - rios = atomic64_read(&rwstat.aux_cnt[BLKG_RWSTAT_READ]); 909 - wios = atomic64_read(&rwstat.aux_cnt[BLKG_RWSTAT_WRITE]); 910 - dios = atomic64_read(&rwstat.aux_cnt[BLKG_RWSTAT_DISCARD]); 968 + blkg_rwstat_recursive_sum(blkg, NULL, 969 + offsetof(struct blkcg_gq, stat_ios), &rwstat); 970 + rios = rwstat.cnt[BLKG_RWSTAT_READ]; 971 + wios = rwstat.cnt[BLKG_RWSTAT_WRITE]; 972 + dios = rwstat.cnt[BLKG_RWSTAT_DISCARD]; 911 973 912 974 spin_unlock_irq(&blkg->q->queue_lock); 913 975 ··· 944 1006 } 945 1007 next: 946 1008 if (has_stats) { 947 - off += scnprintf(buf+off, size-off, "\n"); 948 - seq_commit(sf, off); 1009 + if (off < size - 1) { 1010 + off += scnprintf(buf+off, size-off, "\n"); 1011 + seq_commit(sf, off); 1012 + } else { 1013 + seq_commit(sf, -1); 1014 + } 949 1015 } 950 1016 } 951 1017 ··· 1333 1391 1334 1392 spin_lock_irq(&q->queue_lock); 1335 1393 1336 - list_for_each_entry(blkg, &q->blkg_list, q_node) { 1394 + /* blkg_list is pushed at the head, reverse walk to init parents first */ 1395 + list_for_each_entry_reverse(blkg, &q->blkg_list, q_node) { 1337 1396 struct blkg_policy_data *pd; 1338 1397 1339 1398 if (blkg->pd[pol->plid])

+63 -48

block/blk-core.c

··· 120 120 } 121 121 EXPORT_SYMBOL(blk_rq_init); 122 122 123 + #define REQ_OP_NAME(name) [REQ_OP_##name] = #name 124 + static const char *const blk_op_name[] = { 125 + REQ_OP_NAME(READ), 126 + REQ_OP_NAME(WRITE), 127 + REQ_OP_NAME(FLUSH), 128 + REQ_OP_NAME(DISCARD), 129 + REQ_OP_NAME(SECURE_ERASE), 130 + REQ_OP_NAME(ZONE_RESET), 131 + REQ_OP_NAME(WRITE_SAME), 132 + REQ_OP_NAME(WRITE_ZEROES), 133 + REQ_OP_NAME(SCSI_IN), 134 + REQ_OP_NAME(SCSI_OUT), 135 + REQ_OP_NAME(DRV_IN), 136 + REQ_OP_NAME(DRV_OUT), 137 + }; 138 + #undef REQ_OP_NAME 139 + 140 + /** 141 + * blk_op_str - Return string XXX in the REQ_OP_XXX. 142 + * @op: REQ_OP_XXX. 143 + * 144 + * Description: Centralize block layer function to convert REQ_OP_XXX into 145 + * string format. Useful in the debugging and tracing bio or request. For 146 + * invalid REQ_OP_XXX it returns string "UNKNOWN". 147 + */ 148 + inline const char *blk_op_str(unsigned int op) 149 + { 150 + const char *op_str = "UNKNOWN"; 151 + 152 + if (op < ARRAY_SIZE(blk_op_name) && blk_op_name[op]) 153 + op_str = blk_op_name[op]; 154 + 155 + return op_str; 156 + } 157 + EXPORT_SYMBOL_GPL(blk_op_str); 158 + 123 159 static const struct { 124 160 int errno; 125 161 const char *name; ··· 203 167 } 204 168 EXPORT_SYMBOL_GPL(blk_status_to_errno); 205 169 206 - static void print_req_error(struct request *req, blk_status_t status) 170 + static void print_req_error(struct request *req, blk_status_t status, 171 + const char *caller) 207 172 { 208 173 int idx = (__force int)status; 209 174 210 175 if (WARN_ON_ONCE(idx >= ARRAY_SIZE(blk_errors))) 211 176 return; 212 177 213 - printk_ratelimited(KERN_ERR "%s: %s error, dev %s, sector %llu flags %x\n", 214 - __func__, blk_errors[idx].name, 215 - req->rq_disk ? req->rq_disk->disk_name : "?", 216 - (unsigned long long)blk_rq_pos(req), 217 - req->cmd_flags); 178 + printk_ratelimited(KERN_ERR 179 + "%s: %s error, dev %s, sector %llu op 0x%x:(%s) flags 0x%x " 180 + "phys_seg %u prio class %u\n", 181 + caller, blk_errors[idx].name, 182 + req->rq_disk ? req->rq_disk->disk_name : "?", 183 + blk_rq_pos(req), req_op(req), blk_op_str(req_op(req)), 184 + req->cmd_flags & ~REQ_OP_MASK, 185 + req->nr_phys_segments, 186 + IOPRIO_PRIO_CLASS(req->ioprio)); 218 187 } 219 188 220 189 static void req_bio_endio(struct request *rq, struct bio *bio, ··· 591 550 } 592 551 EXPORT_SYMBOL(blk_put_request); 593 552 594 - bool bio_attempt_back_merge(struct request_queue *q, struct request *req, 595 - struct bio *bio) 553 + bool bio_attempt_back_merge(struct request *req, struct bio *bio, 554 + unsigned int nr_segs) 596 555 { 597 556 const int ff = bio->bi_opf & REQ_FAILFAST_MASK; 598 557 599 - if (!ll_back_merge_fn(q, req, bio)) 558 + if (!ll_back_merge_fn(req, bio, nr_segs)) 600 559 return false; 601 560 602 - trace_block_bio_backmerge(q, req, bio); 561 + trace_block_bio_backmerge(req->q, req, bio); 603 562 604 563 if ((req->cmd_flags & REQ_FAILFAST_MASK) != ff) 605 564 blk_rq_set_mixed_merge(req); ··· 612 571 return true; 613 572 } 614 573 615 - bool bio_attempt_front_merge(struct request_queue *q, struct request *req, 616 - struct bio *bio) 574 + bool bio_attempt_front_merge(struct request *req, struct bio *bio, 575 + unsigned int nr_segs) 617 576 { 618 577 const int ff = bio->bi_opf & REQ_FAILFAST_MASK; 619 578 620 - if (!ll_front_merge_fn(q, req, bio)) 579 + if (!ll_front_merge_fn(req, bio, nr_segs)) 621 580 return false; 622 581 623 - trace_block_bio_frontmerge(q, req, bio); 582 + trace_block_bio_frontmerge(req->q, req, bio); 624 583 625 584 if ((req->cmd_flags & REQ_FAILFAST_MASK) != ff) 626 585 blk_rq_set_mixed_merge(req); ··· 662 621 * blk_attempt_plug_merge - try to merge with %current's plugged list 663 622 * @q: request_queue new bio is being queued at 664 623 * @bio: new bio being queued 624 + * @nr_segs: number of segments in @bio 665 625 * @same_queue_rq: pointer to &struct request that gets filled in when 666 626 * another request associated with @q is found on the plug list 667 627 * (optional, may be %NULL) ··· 681 639 * Caller must ensure !blk_queue_nomerges(q) beforehand. 682 640 */ 683 641 bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio, 684 - struct request **same_queue_rq) 642 + unsigned int nr_segs, struct request **same_queue_rq) 685 643 { 686 644 struct blk_plug *plug; 687 645 struct request *rq; ··· 710 668 711 669 switch (blk_try_merge(rq, bio)) { 712 670 case ELEVATOR_BACK_MERGE: 713 - merged = bio_attempt_back_merge(q, rq, bio); 671 + merged = bio_attempt_back_merge(rq, bio, nr_segs); 714 672 break; 715 673 case ELEVATOR_FRONT_MERGE: 716 - merged = bio_attempt_front_merge(q, rq, bio); 674 + merged = bio_attempt_front_merge(rq, bio, nr_segs); 717 675 break; 718 676 case ELEVATOR_DISCARD_MERGE: 719 677 merged = bio_attempt_discard_merge(q, rq, bio); ··· 728 686 729 687 return false; 730 688 } 731 - 732 - void blk_init_request_from_bio(struct request *req, struct bio *bio) 733 - { 734 - if (bio->bi_opf & REQ_RAHEAD) 735 - req->cmd_flags |= REQ_FAILFAST_MASK; 736 - 737 - req->__sector = bio->bi_iter.bi_sector; 738 - req->ioprio = bio_prio(bio); 739 - req->write_hint = bio->bi_write_hint; 740 - blk_rq_bio_prep(req->q, req, bio); 741 - } 742 - EXPORT_SYMBOL_GPL(blk_init_request_from_bio); 743 689 744 690 static void handle_bad_sector(struct bio *bio, sector_t maxsector) 745 691 { ··· 1193 1163 * Recalculate it to check the request correctly on this queue's 1194 1164 * limitation. 1195 1165 */ 1196 - blk_recalc_rq_segments(rq); 1166 + rq->nr_phys_segments = blk_recalc_rq_segments(rq); 1197 1167 if (rq->nr_phys_segments > queue_max_segments(q)) { 1198 1168 printk(KERN_ERR "%s: over max segments limit. (%hu > %hu)\n", 1199 1169 __func__, rq->nr_phys_segments, queue_max_segments(q)); ··· 1378 1348 * 1379 1349 * This special helper function is only for request stacking drivers 1380 1350 * (e.g. request-based dm) so that they can handle partial completion. 1381 - * Actual device drivers should use blk_end_request instead. 1351 + * Actual device drivers should use blk_mq_end_request instead. 1382 1352 * 1383 1353 * Passing the result of blk_rq_bytes() as @nr_bytes guarantees 1384 1354 * %false return from this function. ··· 1403 1373 1404 1374 if (unlikely(error && !blk_rq_is_passthrough(req) && 1405 1375 !(req->rq_flags & RQF_QUIET))) 1406 - print_req_error(req, error); 1376 + print_req_error(req, error, __func__); 1407 1377 1408 1378 blk_account_io_completion(req, nr_bytes); 1409 1379 ··· 1462 1432 } 1463 1433 1464 1434 /* recalculate the number of segments */ 1465 - blk_recalc_rq_segments(req); 1435 + req->nr_phys_segments = blk_recalc_rq_segments(req); 1466 1436 } 1467 1437 1468 1438 return true; 1469 1439 } 1470 1440 EXPORT_SYMBOL_GPL(blk_update_request); 1471 - 1472 - void blk_rq_bio_prep(struct request_queue *q, struct request *rq, 1473 - struct bio *bio) 1474 - { 1475 - if (bio_has_data(bio)) 1476 - rq->nr_phys_segments = bio_phys_segments(q, bio); 1477 - else if (bio_op(bio) == REQ_OP_DISCARD) 1478 - rq->nr_phys_segments = 1; 1479 - 1480 - rq->__data_len = bio->bi_iter.bi_size; 1481 - rq->bio = rq->biotail = bio; 1482 - 1483 - if (bio->bi_disk) 1484 - rq->rq_disk = bio->bi_disk; 1485 - } 1486 1441 1487 1442 #if ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1488 1443 /**

+17 -34

block/blk-iolatency.c

··· 618 618 619 619 inflight = atomic_dec_return(&rqw->inflight); 620 620 WARN_ON_ONCE(inflight < 0); 621 - if (iolat->min_lat_nsec == 0) 622 - goto next; 623 - iolatency_record_time(iolat, &bio->bi_issue, now, 624 - issue_as_root); 625 - window_start = atomic64_read(&iolat->window_start); 626 - if (now > window_start && 627 - (now - window_start) >= iolat->cur_win_nsec) { 628 - if (atomic64_cmpxchg(&iolat->window_start, 629 - window_start, now) == window_start) 630 - iolatency_check_latencies(iolat, now); 621 + /* 622 + * If bi_status is BLK_STS_AGAIN, the bio wasn't actually 623 + * submitted, so do not account for it. 624 + */ 625 + if (iolat->min_lat_nsec && bio->bi_status != BLK_STS_AGAIN) { 626 + iolatency_record_time(iolat, &bio->bi_issue, now, 627 + issue_as_root); 628 + window_start = atomic64_read(&iolat->window_start); 629 + if (now > window_start && 630 + (now - window_start) >= iolat->cur_win_nsec) { 631 + if (atomic64_cmpxchg(&iolat->window_start, 632 + window_start, now) == window_start) 633 + iolatency_check_latencies(iolat, now); 634 + } 631 635 } 632 - next: 633 636 wake_up(&rqw->wait); 634 - blkg = blkg->parent; 635 - } 636 - } 637 - 638 - static void blkcg_iolatency_cleanup(struct rq_qos *rqos, struct bio *bio) 639 - { 640 - struct blkcg_gq *blkg; 641 - 642 - blkg = bio->bi_blkg; 643 - while (blkg && blkg->parent) { 644 - struct rq_wait *rqw; 645 - struct iolatency_grp *iolat; 646 - 647 - iolat = blkg_to_lat(blkg); 648 - if (!iolat) 649 - goto next; 650 - 651 - rqw = &iolat->rq_wait; 652 - atomic_dec(&rqw->inflight); 653 - wake_up(&rqw->wait); 654 - next: 655 637 blkg = blkg->parent; 656 638 } 657 639 } ··· 649 667 650 668 static struct rq_qos_ops blkcg_iolatency_ops = { 651 669 .throttle = blkcg_iolatency_throttle, 652 - .cleanup = blkcg_iolatency_cleanup, 653 670 .done_bio = blkcg_iolatency_done_bio, 654 671 .exit = blkcg_iolatency_exit, 655 672 }; ··· 759 778 760 779 if (!oldval && val) 761 780 return 1; 762 - if (oldval && !val) 781 + if (oldval && !val) { 782 + blkcg_clear_delay(blkg); 763 783 return -1; 784 + } 764 785 return 0; 765 786 } 766 787

+8 -2

block/blk-map.c

··· 18 18 int blk_rq_append_bio(struct request *rq, struct bio **bio) 19 19 { 20 20 struct bio *orig_bio = *bio; 21 + struct bvec_iter iter; 22 + struct bio_vec bv; 23 + unsigned int nr_segs = 0; 21 24 22 25 blk_queue_bounce(rq->q, bio); 23 26 27 + bio_for_each_bvec(bv, *bio, iter) 28 + nr_segs++; 29 + 24 30 if (!rq->bio) { 25 - blk_rq_bio_prep(rq->q, rq, *bio); 31 + blk_rq_bio_prep(rq, *bio, nr_segs); 26 32 } else { 27 - if (!ll_back_merge_fn(rq->q, rq, *bio)) { 33 + if (!ll_back_merge_fn(rq, *bio, nr_segs)) { 28 34 if (orig_bio != *bio) { 29 35 bio_put(*bio); 30 36 *bio = orig_bio;

+37 -75

block/blk-merge.c

··· 105 105 static struct bio *blk_bio_write_zeroes_split(struct request_queue *q, 106 106 struct bio *bio, struct bio_set *bs, unsigned *nsegs) 107 107 { 108 - *nsegs = 1; 108 + *nsegs = 0; 109 109 110 110 if (!q->limits.max_write_zeroes_sectors) 111 111 return NULL; ··· 202 202 struct bio_vec bv, bvprv, *bvprvp = NULL; 203 203 struct bvec_iter iter; 204 204 unsigned nsegs = 0, sectors = 0; 205 - bool do_split = true; 206 - struct bio *new = NULL; 207 205 const unsigned max_sectors = get_max_io_size(q, bio); 208 206 const unsigned max_segs = queue_max_segments(q); 209 207 ··· 243 245 } 244 246 } 245 247 246 - do_split = false; 248 + *segs = nsegs; 249 + return NULL; 247 250 split: 248 251 *segs = nsegs; 249 - 250 - if (do_split) { 251 - new = bio_split(bio, sectors, GFP_NOIO, bs); 252 - if (new) 253 - bio = new; 254 - } 255 - 256 - return do_split ? new : NULL; 252 + return bio_split(bio, sectors, GFP_NOIO, bs); 257 253 } 258 254 259 - void blk_queue_split(struct request_queue *q, struct bio **bio) 255 + void __blk_queue_split(struct request_queue *q, struct bio **bio, 256 + unsigned int *nr_segs) 260 257 { 261 - struct bio *split, *res; 262 - unsigned nsegs; 258 + struct bio *split; 263 259 264 260 switch (bio_op(*bio)) { 265 261 case REQ_OP_DISCARD: 266 262 case REQ_OP_SECURE_ERASE: 267 - split = blk_bio_discard_split(q, *bio, &q->bio_split, &nsegs); 263 + split = blk_bio_discard_split(q, *bio, &q->bio_split, nr_segs); 268 264 break; 269 265 case REQ_OP_WRITE_ZEROES: 270 - split = blk_bio_write_zeroes_split(q, *bio, &q->bio_split, &nsegs); 266 + split = blk_bio_write_zeroes_split(q, *bio, &q->bio_split, 267 + nr_segs); 271 268 break; 272 269 case REQ_OP_WRITE_SAME: 273 - split = blk_bio_write_same_split(q, *bio, &q->bio_split, &nsegs); 270 + split = blk_bio_write_same_split(q, *bio, &q->bio_split, 271 + nr_segs); 274 272 break; 275 273 default: 276 - split = blk_bio_segment_split(q, *bio, &q->bio_split, &nsegs); 274 + split = blk_bio_segment_split(q, *bio, &q->bio_split, nr_segs); 277 275 break; 278 276 } 279 - 280 - /* physical segments can be figured out during splitting */ 281 - res = split ? split : *bio; 282 - res->bi_phys_segments = nsegs; 283 - bio_set_flag(res, BIO_SEG_VALID); 284 277 285 278 if (split) { 286 279 /* there isn't chance to merge the splitted bio */ ··· 293 304 *bio = split; 294 305 } 295 306 } 307 + 308 + void blk_queue_split(struct request_queue *q, struct bio **bio) 309 + { 310 + unsigned int nr_segs; 311 + 312 + __blk_queue_split(q, bio, &nr_segs); 313 + } 296 314 EXPORT_SYMBOL(blk_queue_split); 297 315 298 - static unsigned int __blk_recalc_rq_segments(struct request_queue *q, 299 - struct bio *bio) 316 + unsigned int blk_recalc_rq_segments(struct request *rq) 300 317 { 301 318 unsigned int nr_phys_segs = 0; 302 - struct bvec_iter iter; 319 + struct req_iterator iter; 303 320 struct bio_vec bv; 304 321 305 - if (!bio) 322 + if (!rq->bio) 306 323 return 0; 307 324 308 - switch (bio_op(bio)) { 325 + switch (bio_op(rq->bio)) { 309 326 case REQ_OP_DISCARD: 310 327 case REQ_OP_SECURE_ERASE: 311 328 case REQ_OP_WRITE_ZEROES: ··· 320 325 return 1; 321 326 } 322 327 323 - for_each_bio(bio) { 324 - bio_for_each_bvec(bv, bio, iter) 325 - bvec_split_segs(q, &bv, &nr_phys_segs, NULL, UINT_MAX); 326 - } 327 - 328 + rq_for_each_bvec(bv, rq, iter) 329 + bvec_split_segs(rq->q, &bv, &nr_phys_segs, NULL, UINT_MAX); 328 330 return nr_phys_segs; 329 - } 330 - 331 - void blk_recalc_rq_segments(struct request *rq) 332 - { 333 - rq->nr_phys_segments = __blk_recalc_rq_segments(rq->q, rq->bio); 334 - } 335 - 336 - void blk_recount_segments(struct request_queue *q, struct bio *bio) 337 - { 338 - struct bio *nxt = bio->bi_next; 339 - 340 - bio->bi_next = NULL; 341 - bio->bi_phys_segments = __blk_recalc_rq_segments(q, bio); 342 - bio->bi_next = nxt; 343 - 344 - bio_set_flag(bio, BIO_SEG_VALID); 345 331 } 346 332 347 333 static inline struct scatterlist *blk_next_sg(struct scatterlist **sg, ··· 495 519 } 496 520 EXPORT_SYMBOL(blk_rq_map_sg); 497 521 498 - static inline int ll_new_hw_segment(struct request_queue *q, 499 - struct request *req, 500 - struct bio *bio) 522 + static inline int ll_new_hw_segment(struct request *req, struct bio *bio, 523 + unsigned int nr_phys_segs) 501 524 { 502 - int nr_phys_segs = bio_phys_segments(q, bio); 503 - 504 - if (req->nr_phys_segments + nr_phys_segs > queue_max_segments(q)) 525 + if (req->nr_phys_segments + nr_phys_segs > queue_max_segments(req->q)) 505 526 goto no_merge; 506 527 507 - if (blk_integrity_merge_bio(q, req, bio) == false) 528 + if (blk_integrity_merge_bio(req->q, req, bio) == false) 508 529 goto no_merge; 509 530 510 531 /* ··· 512 539 return 1; 513 540 514 541 no_merge: 515 - req_set_nomerge(q, req); 542 + req_set_nomerge(req->q, req); 516 543 return 0; 517 544 } 518 545 519 - int ll_back_merge_fn(struct request_queue *q, struct request *req, 520 - struct bio *bio) 546 + int ll_back_merge_fn(struct request *req, struct bio *bio, unsigned int nr_segs) 521 547 { 522 548 if (req_gap_back_merge(req, bio)) 523 549 return 0; ··· 525 553 return 0; 526 554 if (blk_rq_sectors(req) + bio_sectors(bio) > 527 555 blk_rq_get_max_sectors(req, blk_rq_pos(req))) { 528 - req_set_nomerge(q, req); 556 + req_set_nomerge(req->q, req); 529 557 return 0; 530 558 } 531 - if (!bio_flagged(req->biotail, BIO_SEG_VALID)) 532 - blk_recount_segments(q, req->biotail); 533 - if (!bio_flagged(bio, BIO_SEG_VALID)) 534 - blk_recount_segments(q, bio); 535 559 536 - return ll_new_hw_segment(q, req, bio); 560 + return ll_new_hw_segment(req, bio, nr_segs); 537 561 } 538 562 539 - int ll_front_merge_fn(struct request_queue *q, struct request *req, 540 - struct bio *bio) 563 + int ll_front_merge_fn(struct request *req, struct bio *bio, unsigned int nr_segs) 541 564 { 542 - 543 565 if (req_gap_front_merge(req, bio)) 544 566 return 0; 545 567 if (blk_integrity_rq(req) && ··· 541 575 return 0; 542 576 if (blk_rq_sectors(req) + bio_sectors(bio) > 543 577 blk_rq_get_max_sectors(req, bio->bi_iter.bi_sector)) { 544 - req_set_nomerge(q, req); 578 + req_set_nomerge(req->q, req); 545 579 return 0; 546 580 } 547 - if (!bio_flagged(bio, BIO_SEG_VALID)) 548 - blk_recount_segments(q, bio); 549 - if (!bio_flagged(req->bio, BIO_SEG_VALID)) 550 - blk_recount_segments(q, req->bio); 551 581 552 - return ll_new_hw_segment(q, req, bio); 582 + return ll_new_hw_segment(req, bio, nr_segs); 553 583 } 554 584 555 585 static bool req_attempt_discard_merge(struct request_queue *q, struct request *req,

+13 -29

block/blk-mq-debugfs.c

··· 17 17 static void print_stat(struct seq_file *m, struct blk_rq_stat *stat) 18 18 { 19 19 if (stat->nr_samples) { 20 - seq_printf(m, "samples=%d, mean=%lld, min=%llu, max=%llu", 20 + seq_printf(m, "samples=%d, mean=%llu, min=%llu, max=%llu", 21 21 stat->nr_samples, stat->mean, stat->min, stat->max); 22 22 } else { 23 23 seq_puts(m, "samples=0"); ··· 29 29 struct request_queue *q = data; 30 30 int bucket; 31 31 32 - for (bucket = 0; bucket < BLK_MQ_POLL_STATS_BKTS/2; bucket++) { 33 - seq_printf(m, "read (%d Bytes): ", 1 << (9+bucket)); 34 - print_stat(m, &q->poll_stat[2*bucket]); 32 + for (bucket = 0; bucket < (BLK_MQ_POLL_STATS_BKTS / 2); bucket++) { 33 + seq_printf(m, "read (%d Bytes): ", 1 << (9 + bucket)); 34 + print_stat(m, &q->poll_stat[2 * bucket]); 35 35 seq_puts(m, "\n"); 36 36 37 - seq_printf(m, "write (%d Bytes): ", 1 << (9+bucket)); 38 - print_stat(m, &q->poll_stat[2*bucket+1]); 37 + seq_printf(m, "write (%d Bytes): ", 1 << (9 + bucket)); 38 + print_stat(m, &q->poll_stat[2 * bucket + 1]); 39 39 seq_puts(m, "\n"); 40 40 } 41 41 return 0; ··· 261 261 return 0; 262 262 } 263 263 264 - #define REQ_OP_NAME(name) [REQ_OP_##name] = #name 265 - static const char *const op_name[] = { 266 - REQ_OP_NAME(READ), 267 - REQ_OP_NAME(WRITE), 268 - REQ_OP_NAME(FLUSH), 269 - REQ_OP_NAME(DISCARD), 270 - REQ_OP_NAME(SECURE_ERASE), 271 - REQ_OP_NAME(ZONE_RESET), 272 - REQ_OP_NAME(WRITE_SAME), 273 - REQ_OP_NAME(WRITE_ZEROES), 274 - REQ_OP_NAME(SCSI_IN), 275 - REQ_OP_NAME(SCSI_OUT), 276 - REQ_OP_NAME(DRV_IN), 277 - REQ_OP_NAME(DRV_OUT), 278 - }; 279 - #undef REQ_OP_NAME 280 - 281 264 #define CMD_FLAG_NAME(name) [__REQ_##name] = #name 282 265 static const char *const cmd_flag_name[] = { 283 266 CMD_FLAG_NAME(FAILFAST_DEV), ··· 324 341 int __blk_mq_debugfs_rq_show(struct seq_file *m, struct request *rq) 325 342 { 326 343 const struct blk_mq_ops *const mq_ops = rq->q->mq_ops; 327 - const unsigned int op = rq->cmd_flags & REQ_OP_MASK; 344 + const unsigned int op = req_op(rq); 345 + const char *op_str = blk_op_str(op); 328 346 329 347 seq_printf(m, "%p {.op=", rq); 330 - if (op < ARRAY_SIZE(op_name) && op_name[op]) 331 - seq_printf(m, "%s", op_name[op]); 348 + if (strcmp(op_str, "UNKNOWN") == 0) 349 + seq_printf(m, "%u", op); 332 350 else 333 - seq_printf(m, "%d", op); 351 + seq_printf(m, "%s", op_str); 334 352 seq_puts(m, ", .cmd_flags="); 335 353 blk_flags_show(m, rq->cmd_flags & ~REQ_OP_MASK, cmd_flag_name, 336 354 ARRAY_SIZE(cmd_flag_name)); ··· 763 779 764 780 if (attr->show) 765 781 return single_release(inode, file); 766 - else 767 - return seq_release(inode, file); 782 + 783 + return seq_release(inode, file); 768 784 } 769 785 770 786 static const struct file_operations blk_mq_debugfs_fops = {

+16 -15

block/blk-mq-sched.c

··· 224 224 } 225 225 226 226 bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio, 227 - struct request **merged_request) 227 + unsigned int nr_segs, struct request **merged_request) 228 228 { 229 229 struct request *rq; 230 230 ··· 232 232 case ELEVATOR_BACK_MERGE: 233 233 if (!blk_mq_sched_allow_merge(q, rq, bio)) 234 234 return false; 235 - if (!bio_attempt_back_merge(q, rq, bio)) 235 + if (!bio_attempt_back_merge(rq, bio, nr_segs)) 236 236 return false; 237 237 *merged_request = attempt_back_merge(q, rq); 238 238 if (!*merged_request) ··· 241 241 case ELEVATOR_FRONT_MERGE: 242 242 if (!blk_mq_sched_allow_merge(q, rq, bio)) 243 243 return false; 244 - if (!bio_attempt_front_merge(q, rq, bio)) 244 + if (!bio_attempt_front_merge(rq, bio, nr_segs)) 245 245 return false; 246 246 *merged_request = attempt_front_merge(q, rq); 247 247 if (!*merged_request) ··· 260 260 * of them. 261 261 */ 262 262 bool blk_mq_bio_list_merge(struct request_queue *q, struct list_head *list, 263 - struct bio *bio) 263 + struct bio *bio, unsigned int nr_segs) 264 264 { 265 265 struct request *rq; 266 266 int checked = 8; ··· 277 277 switch (blk_try_merge(rq, bio)) { 278 278 case ELEVATOR_BACK_MERGE: 279 279 if (blk_mq_sched_allow_merge(q, rq, bio)) 280 - merged = bio_attempt_back_merge(q, rq, bio); 280 + merged = bio_attempt_back_merge(rq, bio, 281 + nr_segs); 281 282 break; 282 283 case ELEVATOR_FRONT_MERGE: 283 284 if (blk_mq_sched_allow_merge(q, rq, bio)) 284 - merged = bio_attempt_front_merge(q, rq, bio); 285 + merged = bio_attempt_front_merge(rq, bio, 286 + nr_segs); 285 287 break; 286 288 case ELEVATOR_DISCARD_MERGE: 287 289 merged = bio_attempt_discard_merge(q, rq, bio); ··· 306 304 */ 307 305 static bool blk_mq_attempt_merge(struct request_queue *q, 308 306 struct blk_mq_hw_ctx *hctx, 309 - struct blk_mq_ctx *ctx, struct bio *bio) 307 + struct blk_mq_ctx *ctx, struct bio *bio, 308 + unsigned int nr_segs) 310 309 { 311 310 enum hctx_type type = hctx->type; 312 311 313 312 lockdep_assert_held(&ctx->lock); 314 313 315 - if (blk_mq_bio_list_merge(q, &ctx->rq_lists[type], bio)) { 314 + if (blk_mq_bio_list_merge(q, &ctx->rq_lists[type], bio, nr_segs)) { 316 315 ctx->rq_merged++; 317 316 return true; 318 317 } ··· 321 318 return false; 322 319 } 323 320 324 - bool __blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio) 321 + bool __blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio, 322 + unsigned int nr_segs) 325 323 { 326 324 struct elevator_queue *e = q->elevator; 327 325 struct blk_mq_ctx *ctx = blk_mq_get_ctx(q); ··· 330 326 bool ret = false; 331 327 enum hctx_type type; 332 328 333 - if (e && e->type->ops.bio_merge) { 334 - blk_mq_put_ctx(ctx); 335 - return e->type->ops.bio_merge(hctx, bio); 336 - } 329 + if (e && e->type->ops.bio_merge) 330 + return e->type->ops.bio_merge(hctx, bio, nr_segs); 337 331 338 332 type = hctx->type; 339 333 if ((hctx->flags & BLK_MQ_F_SHOULD_MERGE) && 340 334 !list_empty_careful(&ctx->rq_lists[type])) { 341 335 /* default per sw-queue merge */ 342 336 spin_lock(&ctx->lock); 343 - ret = blk_mq_attempt_merge(q, hctx, ctx, bio); 337 + ret = blk_mq_attempt_merge(q, hctx, ctx, bio, nr_segs); 344 338 spin_unlock(&ctx->lock); 345 339 } 346 340 347 - blk_mq_put_ctx(ctx); 348 341 return ret; 349 342 } 350 343

+6 -4

block/blk-mq-sched.h

··· 12 12 13 13 void blk_mq_sched_request_inserted(struct request *rq); 14 14 bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio, 15 - struct request **merged_request); 16 - bool __blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio); 15 + unsigned int nr_segs, struct request **merged_request); 16 + bool __blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio, 17 + unsigned int nr_segs); 17 18 bool blk_mq_sched_try_insert_merge(struct request_queue *q, struct request *rq); 18 19 void blk_mq_sched_mark_restart_hctx(struct blk_mq_hw_ctx *hctx); 19 20 void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx); ··· 32 31 void blk_mq_sched_free_requests(struct request_queue *q); 33 32 34 33 static inline bool 35 - blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio) 34 + blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio, 35 + unsigned int nr_segs) 36 36 { 37 37 if (blk_queue_nomerges(q) || !bio_mergeable(bio)) 38 38 return false; 39 39 40 - return __blk_mq_sched_bio_merge(q, bio); 40 + return __blk_mq_sched_bio_merge(q, bio, nr_segs); 41 41 } 42 42 43 43 static inline bool

-8

block/blk-mq-tag.c

··· 113 113 struct sbq_wait_state *ws; 114 114 DEFINE_SBQ_WAIT(wait); 115 115 unsigned int tag_offset; 116 - bool drop_ctx; 117 116 int tag; 118 117 119 118 if (data->flags & BLK_MQ_REQ_RESERVED) { ··· 135 136 return BLK_MQ_TAG_FAIL; 136 137 137 138 ws = bt_wait_ptr(bt, data->hctx); 138 - drop_ctx = data->ctx == NULL; 139 139 do { 140 140 struct sbitmap_queue *bt_prev; 141 141 ··· 158 160 tag = __blk_mq_get_tag(data, bt); 159 161 if (tag != -1) 160 162 break; 161 - 162 - if (data->ctx) 163 - blk_mq_put_ctx(data->ctx); 164 163 165 164 bt_prev = bt; 166 165 io_schedule(); ··· 183 188 184 189 ws = bt_wait_ptr(bt, data->hctx); 185 190 } while (1); 186 - 187 - if (drop_ctx && data->ctx) 188 - blk_mq_put_ctx(data->ctx); 189 191 190 192 sbitmap_finish_wait(bt, ws, &wait); 191 193

+17 -27

block/blk-mq.c

··· 355 355 struct elevator_queue *e = q->elevator; 356 356 struct request *rq; 357 357 unsigned int tag; 358 - bool put_ctx_on_error = false; 358 + bool clear_ctx_on_error = false; 359 359 360 360 blk_queue_enter_live(q); 361 361 data->q = q; 362 362 if (likely(!data->ctx)) { 363 363 data->ctx = blk_mq_get_ctx(q); 364 - put_ctx_on_error = true; 364 + clear_ctx_on_error = true; 365 365 } 366 366 if (likely(!data->hctx)) 367 367 data->hctx = blk_mq_map_queue(q, data->cmd_flags, ··· 387 387 388 388 tag = blk_mq_get_tag(data); 389 389 if (tag == BLK_MQ_TAG_FAIL) { 390 - if (put_ctx_on_error) { 391 - blk_mq_put_ctx(data->ctx); 390 + if (clear_ctx_on_error) 392 391 data->ctx = NULL; 393 - } 394 392 blk_queue_exit(q); 395 393 return NULL; 396 394 } ··· 424 426 425 427 if (!rq) 426 428 return ERR_PTR(-EWOULDBLOCK); 427 - 428 - blk_mq_put_ctx(alloc_data.ctx); 429 429 430 430 rq->__data_len = 0; 431 431 rq->__sector = (sector_t) -1; ··· 1760 1764 } 1761 1765 } 1762 1766 1763 - static void blk_mq_bio_to_request(struct request *rq, struct bio *bio) 1767 + static void blk_mq_bio_to_request(struct request *rq, struct bio *bio, 1768 + unsigned int nr_segs) 1764 1769 { 1765 - blk_init_request_from_bio(rq, bio); 1770 + if (bio->bi_opf & REQ_RAHEAD) 1771 + rq->cmd_flags |= REQ_FAILFAST_MASK; 1772 + 1773 + rq->__sector = bio->bi_iter.bi_sector; 1774 + rq->write_hint = bio->bi_write_hint; 1775 + blk_rq_bio_prep(rq, bio, nr_segs); 1766 1776 1767 1777 blk_account_io_start(rq, true); 1768 1778 } ··· 1938 1936 struct request *rq; 1939 1937 struct blk_plug *plug; 1940 1938 struct request *same_queue_rq = NULL; 1939 + unsigned int nr_segs; 1941 1940 blk_qc_t cookie; 1942 1941 1943 1942 blk_queue_bounce(q, &bio); 1944 - 1945 - blk_queue_split(q, &bio); 1943 + __blk_queue_split(q, &bio, &nr_segs); 1946 1944 1947 1945 if (!bio_integrity_prep(bio)) 1948 1946 return BLK_QC_T_NONE; 1949 1947 1950 1948 if (!is_flush_fua && !blk_queue_nomerges(q) && 1951 - blk_attempt_plug_merge(q, bio, &same_queue_rq)) 1949 + blk_attempt_plug_merge(q, bio, nr_segs, &same_queue_rq)) 1952 1950 return BLK_QC_T_NONE; 1953 1951 1954 - if (blk_mq_sched_bio_merge(q, bio)) 1952 + if (blk_mq_sched_bio_merge(q, bio, nr_segs)) 1955 1953 return BLK_QC_T_NONE; 1956 1954 1957 1955 rq_qos_throttle(q, bio); ··· 1971 1969 1972 1970 cookie = request_to_qc_t(data.hctx, rq); 1973 1971 1972 + blk_mq_bio_to_request(rq, bio, nr_segs); 1973 + 1974 1974 plug = current->plug; 1975 1975 if (unlikely(is_flush_fua)) { 1976 - blk_mq_put_ctx(data.ctx); 1977 - blk_mq_bio_to_request(rq, bio); 1978 - 1979 1976 /* bypass scheduler for flush rq */ 1980 1977 blk_insert_flush(rq); 1981 1978 blk_mq_run_hw_queue(data.hctx, true); ··· 1985 1984 */ 1986 1985 unsigned int request_count = plug->rq_count; 1987 1986 struct request *last = NULL; 1988 - 1989 - blk_mq_put_ctx(data.ctx); 1990 - blk_mq_bio_to_request(rq, bio); 1991 1987 1992 1988 if (!request_count) 1993 1989 trace_block_plug(q); ··· 1999 2001 2000 2002 blk_add_rq_to_plug(plug, rq); 2001 2003 } else if (plug && !blk_queue_nomerges(q)) { 2002 - blk_mq_bio_to_request(rq, bio); 2003 - 2004 2004 /* 2005 2005 * We do limited plugging. If the bio can be merged, do that. 2006 2006 * Otherwise the existing request in the plug list will be ··· 2015 2019 blk_add_rq_to_plug(plug, rq); 2016 2020 trace_block_plug(q); 2017 2021 2018 - blk_mq_put_ctx(data.ctx); 2019 - 2020 2022 if (same_queue_rq) { 2021 2023 data.hctx = same_queue_rq->mq_hctx; 2022 2024 trace_block_unplug(q, 1, true); ··· 2023 2029 } 2024 2030 } else if ((q->nr_hw_queues > 1 && is_sync) || (!q->elevator && 2025 2031 !data.hctx->dispatch_busy)) { 2026 - blk_mq_put_ctx(data.ctx); 2027 - blk_mq_bio_to_request(rq, bio); 2028 2032 blk_mq_try_issue_directly(data.hctx, rq, &cookie); 2029 2033 } else { 2030 - blk_mq_put_ctx(data.ctx); 2031 - blk_mq_bio_to_request(rq, bio); 2032 2034 blk_mq_sched_insert_request(rq, false, true, true); 2033 2035 } 2034 2036

+1 -6

block/blk-mq.h

··· 151 151 */ 152 152 static inline struct blk_mq_ctx *blk_mq_get_ctx(struct request_queue *q) 153 153 { 154 - return __blk_mq_get_ctx(q, get_cpu()); 155 - } 156 - 157 - static inline void blk_mq_put_ctx(struct blk_mq_ctx *ctx) 158 - { 159 - put_cpu(); 154 + return __blk_mq_get_ctx(q, raw_smp_processor_id()); 160 155 } 161 156 162 157 struct blk_mq_alloc_data {

+24 -12

block/blk.h

··· 51 51 int node, int cmd_size, gfp_t flags); 52 52 void blk_free_flush_queue(struct blk_flush_queue *q); 53 53 54 - void blk_rq_bio_prep(struct request_queue *q, struct request *rq, 55 - struct bio *bio); 56 54 void blk_freeze_queue(struct request_queue *q); 57 55 58 56 static inline void blk_queue_enter_live(struct request_queue *q) ··· 97 99 if (!queue_virt_boundary(q)) 98 100 return false; 99 101 return __bvec_gap_to_prev(q, bprv, offset); 102 + } 103 + 104 + static inline void blk_rq_bio_prep(struct request *rq, struct bio *bio, 105 + unsigned int nr_segs) 106 + { 107 + rq->nr_phys_segments = nr_segs; 108 + rq->__data_len = bio->bi_iter.bi_size; 109 + rq->bio = rq->biotail = bio; 110 + rq->ioprio = bio_prio(bio); 111 + 112 + if (bio->bi_disk) 113 + rq->rq_disk = bio->bi_disk; 100 114 } 101 115 102 116 #ifdef CONFIG_BLK_DEV_INTEGRITY ··· 164 154 unsigned long blk_rq_timeout(unsigned long timeout); 165 155 void blk_add_timer(struct request *req); 166 156 167 - bool bio_attempt_front_merge(struct request_queue *q, struct request *req, 168 - struct bio *bio); 169 - bool bio_attempt_back_merge(struct request_queue *q, struct request *req, 170 - struct bio *bio); 157 + bool bio_attempt_front_merge(struct request *req, struct bio *bio, 158 + unsigned int nr_segs); 159 + bool bio_attempt_back_merge(struct request *req, struct bio *bio, 160 + unsigned int nr_segs); 171 161 bool bio_attempt_discard_merge(struct request_queue *q, struct request *req, 172 162 struct bio *bio); 173 163 bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio, 174 - struct request **same_queue_rq); 164 + unsigned int nr_segs, struct request **same_queue_rq); 175 165 176 166 void blk_account_io_start(struct request *req, bool new_io); 177 167 void blk_account_io_completion(struct request *req, unsigned int bytes); ··· 212 202 } 213 203 #endif 214 204 215 - int ll_back_merge_fn(struct request_queue *q, struct request *req, 216 - struct bio *bio); 217 - int ll_front_merge_fn(struct request_queue *q, struct request *req, 218 - struct bio *bio); 205 + void __blk_queue_split(struct request_queue *q, struct bio **bio, 206 + unsigned int *nr_segs); 207 + int ll_back_merge_fn(struct request *req, struct bio *bio, 208 + unsigned int nr_segs); 209 + int ll_front_merge_fn(struct request *req, struct bio *bio, 210 + unsigned int nr_segs); 219 211 struct request *attempt_back_merge(struct request_queue *q, struct request *rq); 220 212 struct request *attempt_front_merge(struct request_queue *q, struct request *rq); 221 213 int blk_attempt_req_merge(struct request_queue *q, struct request *rq, 222 214 struct request *next); 223 - void blk_recalc_rq_segments(struct request *rq); 215 + unsigned int blk_recalc_rq_segments(struct request *rq); 224 216 void blk_rq_set_mixed_merge(struct request *rq); 225 217 bool blk_rq_merge_ok(struct request *rq, struct bio *bio); 226 218 enum elv_merge blk_try_merge(struct request *rq, struct bio *bio);

+2 -3

block/genhd.c

··· 1281 1281 struct disk_part_tbl *new_ptbl; 1282 1282 int len = old_ptbl ? old_ptbl->len : 0; 1283 1283 int i, target; 1284 - size_t size; 1285 1284 1286 1285 /* 1287 1286 * check for int overflow, since we can get here from blkpg_ioctl() ··· 1297 1298 if (target <= len) 1298 1299 return 0; 1299 1300 1300 - size = sizeof(*new_ptbl) + target * sizeof(new_ptbl->part[0]); 1301 - new_ptbl = kzalloc_node(size, GFP_KERNEL, disk->node_id); 1301 + new_ptbl = kzalloc_node(struct_size(new_ptbl, part, target), GFP_KERNEL, 1302 + disk->node_id); 1302 1303 if (!new_ptbl) 1303 1304 return -ENOMEM; 1304 1305

+3 -3

block/kyber-iosched.c

··· 562 562 } 563 563 } 564 564 565 - static bool kyber_bio_merge(struct blk_mq_hw_ctx *hctx, struct bio *bio) 565 + static bool kyber_bio_merge(struct blk_mq_hw_ctx *hctx, struct bio *bio, 566 + unsigned int nr_segs) 566 567 { 567 568 struct kyber_hctx_data *khd = hctx->sched_data; 568 569 struct blk_mq_ctx *ctx = blk_mq_get_ctx(hctx->queue); ··· 573 572 bool merged; 574 573 575 574 spin_lock(&kcq->lock); 576 - merged = blk_mq_bio_list_merge(hctx->queue, rq_list, bio); 575 + merged = blk_mq_bio_list_merge(hctx->queue, rq_list, bio, nr_segs); 577 576 spin_unlock(&kcq->lock); 578 - blk_mq_put_ctx(ctx); 579 577 580 578 return merged; 581 579 }

+3 -2

block/mq-deadline.c

··· 469 469 return ELEVATOR_NO_MERGE; 470 470 } 471 471 472 - static bool dd_bio_merge(struct blk_mq_hw_ctx *hctx, struct bio *bio) 472 + static bool dd_bio_merge(struct blk_mq_hw_ctx *hctx, struct bio *bio, 473 + unsigned int nr_segs) 473 474 { 474 475 struct request_queue *q = hctx->queue; 475 476 struct deadline_data *dd = q->elevator->elevator_data; ··· 478 477 bool ret; 479 478 480 479 spin_lock(&dd->lock); 481 - ret = blk_mq_sched_try_merge(q, bio, &free); 480 + ret = blk_mq_sched_try_merge(q, bio, nr_segs, &free); 482 481 spin_unlock(&dd->lock); 483 482 484 483 if (free)

+16

block/opal_proto.h

··· 98 98 OPAL_ENTERPRISE_BANDMASTER0_UID, 99 99 OPAL_ENTERPRISE_ERASEMASTER_UID, 100 100 /* tables */ 101 + OPAL_TABLE_TABLE, 101 102 OPAL_LOCKINGRANGE_GLOBAL, 102 103 OPAL_LOCKINGRANGE_ACE_RDLOCKED, 103 104 OPAL_LOCKINGRANGE_ACE_WRLOCKED, ··· 153 152 OPAL_STARTCOLUMN = 0x03, 154 153 OPAL_ENDCOLUMN = 0x04, 155 154 OPAL_VALUES = 0x01, 155 + /* table table */ 156 + OPAL_TABLE_UID = 0x00, 157 + OPAL_TABLE_NAME = 0x01, 158 + OPAL_TABLE_COMMON = 0x02, 159 + OPAL_TABLE_TEMPLATE = 0x03, 160 + OPAL_TABLE_KIND = 0x04, 161 + OPAL_TABLE_COLUMN = 0x05, 162 + OPAL_TABLE_COLUMNS = 0x06, 163 + OPAL_TABLE_ROWS = 0x07, 164 + OPAL_TABLE_ROWS_FREE = 0x08, 165 + OPAL_TABLE_ROW_BYTES = 0x09, 166 + OPAL_TABLE_LASTID = 0x0A, 167 + OPAL_TABLE_MIN = 0x0B, 168 + OPAL_TABLE_MAX = 0x0C, 169 + 156 170 /* authority table */ 157 171 OPAL_PIN = 0x03, 158 172 /* locking tokens */

+186 -11

block/sed-opal.c

··· 26 26 #define IO_BUFFER_LENGTH 2048 27 27 #define MAX_TOKS 64 28 28 29 + /* Number of bytes needed by cmd_finalize. */ 30 + #define CMD_FINALIZE_BYTES_NEEDED 7 31 + 29 32 struct opal_step { 30 33 int (*fn)(struct opal_dev *dev, void *data); 31 34 void *data; ··· 130 127 131 128 /* tables */ 132 129 130 + [OPAL_TABLE_TABLE] 131 + { 0x00, 0x00, 0x00, 0x01, 0x00, 0x00, 0x00, 0x01 }, 133 132 [OPAL_LOCKINGRANGE_GLOBAL] = 134 133 { 0x00, 0x00, 0x08, 0x02, 0x00, 0x00, 0x00, 0x01 }, 135 134 [OPAL_LOCKINGRANGE_ACE_RDLOCKED] = ··· 528 523 return execute_step(dev, &discovery0_step, 0); 529 524 } 530 525 526 + static size_t remaining_size(struct opal_dev *cmd) 527 + { 528 + return IO_BUFFER_LENGTH - cmd->pos; 529 + } 530 + 531 531 static bool can_add(int *err, struct opal_dev *cmd, size_t len) 532 532 { 533 533 if (*err) 534 534 return false; 535 535 536 - if (len > IO_BUFFER_LENGTH || cmd->pos > IO_BUFFER_LENGTH - len) { 536 + if (remaining_size(cmd) < len) { 537 537 pr_debug("Error adding %zu bytes: end of buffer.\n", len); 538 538 *err = -ERANGE; 539 539 return false; ··· 684 674 struct opal_header *hdr; 685 675 int err = 0; 686 676 687 - /* close the parameter list opened from cmd_start */ 677 + /* 678 + * Close the parameter list opened from cmd_start. 679 + * The number of bytes added must be equal to 680 + * CMD_FINALIZE_BYTES_NEEDED. 681 + */ 688 682 add_token_u8(&err, cmd, OPAL_ENDLIST); 689 683 690 684 add_token_u8(&err, cmd, OPAL_ENDOFDATA); ··· 1133 1119 return finalize_and_send(dev, parse_and_check_status); 1134 1120 } 1135 1121 1122 + /* 1123 + * see TCG SAS 5.3.2.3 for a description of the available columns 1124 + * 1125 + * the result is provided in dev->resp->tok[4] 1126 + */ 1127 + static int generic_get_table_info(struct opal_dev *dev, enum opal_uid table, 1128 + u64 column) 1129 + { 1130 + u8 uid[OPAL_UID_LENGTH]; 1131 + const unsigned int half = OPAL_UID_LENGTH/2; 1132 + 1133 + /* sed-opal UIDs can be split in two halves: 1134 + * first: actual table index 1135 + * second: relative index in the table 1136 + * so we have to get the first half of the OPAL_TABLE_TABLE and use the 1137 + * first part of the target table as relative index into that table 1138 + */ 1139 + memcpy(uid, opaluid[OPAL_TABLE_TABLE], half); 1140 + memcpy(uid+half, opaluid[table], half); 1141 + 1142 + return generic_get_column(dev, uid, column); 1143 + } 1144 + 1136 1145 static int gen_key(struct opal_dev *dev, void *data) 1137 1146 { 1138 1147 u8 uid[OPAL_UID_LENGTH]; ··· 1344 1307 break; 1345 1308 case OPAL_ADMIN1_UID: 1346 1309 case OPAL_SID_UID: 1310 + case OPAL_PSID_UID: 1347 1311 add_token_u8(&err, dev, OPAL_STARTNAME); 1348 1312 add_token_u8(&err, dev, 0); /* HostChallenge */ 1349 1313 add_token_bytestring(&err, dev, key, key_len); ··· 1403 1365 return start_generic_opal_session(dev, OPAL_ADMIN1_UID, 1404 1366 OPAL_LOCKINGSP_UID, 1405 1367 key->key, key->key_len); 1368 + } 1369 + 1370 + static int start_PSID_opal_session(struct opal_dev *dev, void *data) 1371 + { 1372 + const struct opal_key *okey = data; 1373 + 1374 + return start_generic_opal_session(dev, OPAL_PSID_UID, 1375 + OPAL_ADMINSP_UID, 1376 + okey->key, 1377 + okey->key_len); 1406 1378 } 1407 1379 1408 1380 static int start_auth_opal_session(struct opal_dev *dev, void *data) ··· 1571 1523 } 1572 1524 1573 1525 return finalize_and_send(dev, parse_and_check_status); 1526 + } 1527 + 1528 + static int write_shadow_mbr(struct opal_dev *dev, void *data) 1529 + { 1530 + struct opal_shadow_mbr *shadow = data; 1531 + const u8 __user *src; 1532 + u8 *dst; 1533 + size_t off = 0; 1534 + u64 len; 1535 + int err = 0; 1536 + 1537 + /* do we fit in the available shadow mbr space? */ 1538 + err = generic_get_table_info(dev, OPAL_MBR, OPAL_TABLE_ROWS); 1539 + if (err) { 1540 + pr_debug("MBR: could not get shadow size\n"); 1541 + return err; 1542 + } 1543 + 1544 + len = response_get_u64(&dev->parsed, 4); 1545 + if (shadow->size > len || shadow->offset > len - shadow->size) { 1546 + pr_debug("MBR: does not fit in shadow (%llu vs. %llu)\n", 1547 + shadow->offset + shadow->size, len); 1548 + return -ENOSPC; 1549 + } 1550 + 1551 + /* do the actual transmission(s) */ 1552 + src = (u8 __user *)(uintptr_t)shadow->data; 1553 + while (off < shadow->size) { 1554 + err = cmd_start(dev, opaluid[OPAL_MBR], opalmethod[OPAL_SET]); 1555 + add_token_u8(&err, dev, OPAL_STARTNAME); 1556 + add_token_u8(&err, dev, OPAL_WHERE); 1557 + add_token_u64(&err, dev, shadow->offset + off); 1558 + add_token_u8(&err, dev, OPAL_ENDNAME); 1559 + 1560 + add_token_u8(&err, dev, OPAL_STARTNAME); 1561 + add_token_u8(&err, dev, OPAL_VALUES); 1562 + 1563 + /* 1564 + * The bytestring header is either 1 or 2 bytes, so assume 2. 1565 + * There also needs to be enough space to accommodate the 1566 + * trailing OPAL_ENDNAME (1 byte) and tokens added by 1567 + * cmd_finalize. 1568 + */ 1569 + len = min(remaining_size(dev) - (2+1+CMD_FINALIZE_BYTES_NEEDED), 1570 + (size_t)(shadow->size - off)); 1571 + pr_debug("MBR: write bytes %zu+%llu/%llu\n", 1572 + off, len, shadow->size); 1573 + 1574 + dst = add_bytestring_header(&err, dev, len); 1575 + if (!dst) 1576 + break; 1577 + if (copy_from_user(dst, src + off, len)) 1578 + err = -EFAULT; 1579 + dev->pos += len; 1580 + 1581 + add_token_u8(&err, dev, OPAL_ENDNAME); 1582 + if (err) 1583 + break; 1584 + 1585 + err = finalize_and_send(dev, parse_and_check_status); 1586 + if (err) 1587 + break; 1588 + 1589 + off += len; 1590 + } 1591 + return err; 1574 1592 } 1575 1593 1576 1594 static int generic_pw_cmd(u8 *key, size_t key_len, u8 *cpin_uid, ··· 2092 1978 return ret; 2093 1979 } 2094 1980 1981 + static int opal_set_mbr_done(struct opal_dev *dev, 1982 + struct opal_mbr_done *mbr_done) 1983 + { 1984 + u8 mbr_done_tf = mbr_done->done_flag == OPAL_MBR_DONE ? 1985 + OPAL_TRUE : OPAL_FALSE; 1986 + 1987 + const struct opal_step mbr_steps[] = { 1988 + { start_admin1LSP_opal_session, &mbr_done->key }, 1989 + { set_mbr_done, &mbr_done_tf }, 1990 + { end_opal_session, } 1991 + }; 1992 + int ret; 1993 + 1994 + if (mbr_done->done_flag != OPAL_MBR_DONE && 1995 + mbr_done->done_flag != OPAL_MBR_NOT_DONE) 1996 + return -EINVAL; 1997 + 1998 + mutex_lock(&dev->dev_lock); 1999 + setup_opal_dev(dev); 2000 + ret = execute_steps(dev, mbr_steps, ARRAY_SIZE(mbr_steps)); 2001 + mutex_unlock(&dev->dev_lock); 2002 + return ret; 2003 + } 2004 + 2005 + static int opal_write_shadow_mbr(struct opal_dev *dev, 2006 + struct opal_shadow_mbr *info) 2007 + { 2008 + const struct opal_step mbr_steps[] = { 2009 + { start_admin1LSP_opal_session, &info->key }, 2010 + { write_shadow_mbr, info }, 2011 + { end_opal_session, } 2012 + }; 2013 + int ret; 2014 + 2015 + if (info->size == 0) 2016 + return 0; 2017 + 2018 + mutex_lock(&dev->dev_lock); 2019 + setup_opal_dev(dev); 2020 + ret = execute_steps(dev, mbr_steps, ARRAY_SIZE(mbr_steps)); 2021 + mutex_unlock(&dev->dev_lock); 2022 + return ret; 2023 + } 2024 + 2095 2025 static int opal_save(struct opal_dev *dev, struct opal_lock_unlock *lk_unlk) 2096 2026 { 2097 2027 struct opal_suspend_data *suspend; ··· 2188 2030 return ret; 2189 2031 } 2190 2032 2191 - static int opal_reverttper(struct opal_dev *dev, struct opal_key *opal) 2033 + static int opal_reverttper(struct opal_dev *dev, struct opal_key *opal, bool psid) 2192 2034 { 2035 + /* controller will terminate session */ 2193 2036 const struct opal_step revert_steps[] = { 2194 2037 { start_SIDASP_opal_session, opal }, 2195 - { revert_tper, } /* controller will terminate session */ 2038 + { revert_tper, } 2196 2039 }; 2040 + const struct opal_step psid_revert_steps[] = { 2041 + { start_PSID_opal_session, opal }, 2042 + { revert_tper, } 2043 + }; 2044 + 2197 2045 int ret; 2198 2046 2199 2047 mutex_lock(&dev->dev_lock); 2200 2048 setup_opal_dev(dev); 2201 - ret = execute_steps(dev, revert_steps, ARRAY_SIZE(revert_steps)); 2049 + if (psid) 2050 + ret = execute_steps(dev, psid_revert_steps, 2051 + ARRAY_SIZE(psid_revert_steps)); 2052 + else 2053 + ret = execute_steps(dev, revert_steps, 2054 + ARRAY_SIZE(revert_steps)); 2202 2055 mutex_unlock(&dev->dev_lock); 2203 2056 2204 2057 /* ··· 2261 2092 { 2262 2093 int ret; 2263 2094 2264 - if (lk_unlk->session.who < OPAL_ADMIN1 || 2265 - lk_unlk->session.who > OPAL_USER9) 2095 + if (lk_unlk->session.who > OPAL_USER9) 2266 2096 return -EINVAL; 2267 2097 2268 2098 mutex_lock(&dev->dev_lock); ··· 2339 2171 }; 2340 2172 int ret; 2341 2173 2342 - if (opal_pw->session.who < OPAL_ADMIN1 || 2343 - opal_pw->session.who > OPAL_USER9 || 2344 - opal_pw->new_user_pw.who < OPAL_ADMIN1 || 2174 + if (opal_pw->session.who > OPAL_USER9 || 2345 2175 opal_pw->new_user_pw.who > OPAL_USER9) 2346 2176 return -EINVAL; 2347 2177 ··· 2446 2280 ret = opal_activate_user(dev, p); 2447 2281 break; 2448 2282 case IOC_OPAL_REVERT_TPR: 2449 - ret = opal_reverttper(dev, p); 2283 + ret = opal_reverttper(dev, p, false); 2450 2284 break; 2451 2285 case IOC_OPAL_LR_SETUP: 2452 2286 ret = opal_setup_locking_range(dev, p); ··· 2457 2291 case IOC_OPAL_ENABLE_DISABLE_MBR: 2458 2292 ret = opal_enable_disable_shadow_mbr(dev, p); 2459 2293 break; 2294 + case IOC_OPAL_MBR_DONE: 2295 + ret = opal_set_mbr_done(dev, p); 2296 + break; 2297 + case IOC_OPAL_WRITE_SHADOW_MBR: 2298 + ret = opal_write_shadow_mbr(dev, p); 2299 + break; 2460 2300 case IOC_OPAL_ERASE_LR: 2461 2301 ret = opal_erase_locking_range(dev, p); 2462 2302 break; 2463 2303 case IOC_OPAL_SECURE_ERASE_LR: 2464 2304 ret = opal_secure_erase_locking_range(dev, p); 2305 + break; 2306 + case IOC_OPAL_PSID_REVERT_TPR: 2307 + ret = opal_reverttper(dev, p, true); 2465 2308 break; 2466 2309 default: 2467 2310 break;

+2 -62

drivers/block/drbd/drbd_debugfs.c

··· 465 465 void drbd_debugfs_resource_add(struct drbd_resource *resource) 466 466 { 467 467 struct dentry *dentry; 468 - if (!drbd_debugfs_resources) 469 - return; 470 468 471 469 dentry = debugfs_create_dir(resource->name, drbd_debugfs_resources); 472 - if (IS_ERR_OR_NULL(dentry)) 473 - goto fail; 474 470 resource->debugfs_res = dentry; 475 471 476 472 dentry = debugfs_create_dir("volumes", resource->debugfs_res); 477 - if (IS_ERR_OR_NULL(dentry)) 478 - goto fail; 479 473 resource->debugfs_res_volumes = dentry; 480 474 481 475 dentry = debugfs_create_dir("connections", resource->debugfs_res); 482 - if (IS_ERR_OR_NULL(dentry)) 483 - goto fail; 484 476 resource->debugfs_res_connections = dentry; 485 477 486 478 dentry = debugfs_create_file("in_flight_summary", 0440, 487 479 resource->debugfs_res, resource, 488 480 &in_flight_summary_fops); 489 - if (IS_ERR_OR_NULL(dentry)) 490 - goto fail; 491 481 resource->debugfs_res_in_flight_summary = dentry; 492 - return; 493 - 494 - fail: 495 - drbd_debugfs_resource_cleanup(resource); 496 - drbd_err(resource, "failed to create debugfs dentry\n"); 497 482 } 498 483 499 484 static void drbd_debugfs_remove(struct dentry **dp) ··· 621 636 { 622 637 struct dentry *conns_dir = connection->resource->debugfs_res_connections; 623 638 struct dentry *dentry; 624 - if (!conns_dir) 625 - return; 626 639 627 640 /* Once we enable mutliple peers, 628 641 * these connections will have descriptive names. 629 642 * For now, it is just the one connection to the (only) "peer". */ 630 643 dentry = debugfs_create_dir("peer", conns_dir); 631 - if (IS_ERR_OR_NULL(dentry)) 632 - goto fail; 633 644 connection->debugfs_conn = dentry; 634 645 635 646 dentry = debugfs_create_file("callback_history", 0440, 636 647 connection->debugfs_conn, connection, 637 648 &connection_callback_history_fops); 638 - if (IS_ERR_OR_NULL(dentry)) 639 - goto fail; 640 649 connection->debugfs_conn_callback_history = dentry; 641 650 642 651 dentry = debugfs_create_file("oldest_requests", 0440, 643 652 connection->debugfs_conn, connection, 644 653 &connection_oldest_requests_fops); 645 - if (IS_ERR_OR_NULL(dentry)) 646 - goto fail; 647 654 connection->debugfs_conn_oldest_requests = dentry; 648 - return; 649 - 650 - fail: 651 - drbd_debugfs_connection_cleanup(connection); 652 - drbd_err(connection, "failed to create debugfs dentry\n"); 653 655 } 654 656 655 657 void drbd_debugfs_connection_cleanup(struct drbd_connection *connection) ··· 781 809 782 810 snprintf(vnr_buf, sizeof(vnr_buf), "%u", device->vnr); 783 811 dentry = debugfs_create_dir(vnr_buf, vols_dir); 784 - if (IS_ERR_OR_NULL(dentry)) 785 - goto fail; 786 812 device->debugfs_vol = dentry; 787 813 788 814 snprintf(minor_buf, sizeof(minor_buf), "%u", device->minor); ··· 789 819 if (!slink_name) 790 820 goto fail; 791 821 dentry = debugfs_create_symlink(minor_buf, drbd_debugfs_minors, slink_name); 822 + device->debugfs_minor = dentry; 792 823 kfree(slink_name); 793 824 slink_name = NULL; 794 - if (IS_ERR_OR_NULL(dentry)) 795 - goto fail; 796 - device->debugfs_minor = dentry; 797 825 798 826 #define DCF(name) do { \ 799 827 dentry = debugfs_create_file(#name, 0440, \ 800 828 device->debugfs_vol, device, \ 801 829 &device_ ## name ## _fops); \ 802 - if (IS_ERR_OR_NULL(dentry)) \ 803 - goto fail; \ 804 830 device->debugfs_vol_ ## name = dentry; \ 805 831 } while (0) 806 832 ··· 830 864 struct dentry *dentry; 831 865 char vnr_buf[8]; 832 866 833 - if (!conn_dir) 834 - return; 835 - 836 867 snprintf(vnr_buf, sizeof(vnr_buf), "%u", peer_device->device->vnr); 837 868 dentry = debugfs_create_dir(vnr_buf, conn_dir); 838 - if (IS_ERR_OR_NULL(dentry)) 839 - goto fail; 840 869 peer_device->debugfs_peer_dev = dentry; 841 - return; 842 - 843 - fail: 844 - drbd_debugfs_peer_device_cleanup(peer_device); 845 - drbd_err(peer_device, "failed to create debugfs entries\n"); 846 870 } 847 871 848 872 void drbd_debugfs_peer_device_cleanup(struct drbd_peer_device *peer_device) ··· 873 917 drbd_debugfs_remove(&drbd_debugfs_root); 874 918 } 875 919 876 - int __init drbd_debugfs_init(void) 920 + void __init drbd_debugfs_init(void) 877 921 { 878 922 struct dentry *dentry; 879 923 880 924 dentry = debugfs_create_dir("drbd", NULL); 881 - if (IS_ERR_OR_NULL(dentry)) 882 - goto fail; 883 925 drbd_debugfs_root = dentry; 884 926 885 927 dentry = debugfs_create_file("version", 0444, drbd_debugfs_root, NULL, &drbd_version_fops); 886 - if (IS_ERR_OR_NULL(dentry)) 887 - goto fail; 888 928 drbd_debugfs_version = dentry; 889 929 890 930 dentry = debugfs_create_dir("resources", drbd_debugfs_root); 891 - if (IS_ERR_OR_NULL(dentry)) 892 - goto fail; 893 931 drbd_debugfs_resources = dentry; 894 932 895 933 dentry = debugfs_create_dir("minors", drbd_debugfs_root); 896 - if (IS_ERR_OR_NULL(dentry)) 897 - goto fail; 898 934 drbd_debugfs_minors = dentry; 899 - return 0; 900 - 901 - fail: 902 - drbd_debugfs_cleanup(); 903 - if (dentry) 904 - return PTR_ERR(dentry); 905 - else 906 - return -EINVAL; 907 935 }

+2 -2

drivers/block/drbd/drbd_debugfs.h

··· 6 6 #include "drbd_int.h" 7 7 8 8 #ifdef CONFIG_DEBUG_FS 9 - int __init drbd_debugfs_init(void); 9 + void __init drbd_debugfs_init(void); 10 10 void drbd_debugfs_cleanup(void); 11 11 12 12 void drbd_debugfs_resource_add(struct drbd_resource *resource); ··· 22 22 void drbd_debugfs_peer_device_cleanup(struct drbd_peer_device *peer_device); 23 23 #else 24 24 25 - static inline int __init drbd_debugfs_init(void) { return -ENODEV; } 25 + static inline void __init drbd_debugfs_init(void) { } 26 26 static inline void drbd_debugfs_cleanup(void) { } 27 27 28 28 static inline void drbd_debugfs_resource_add(struct drbd_resource *resource) { }

+1 -2

drivers/block/drbd/drbd_main.c

··· 3009 3009 spin_lock_init(&retry.lock); 3010 3010 INIT_LIST_HEAD(&retry.writes); 3011 3011 3012 - if (drbd_debugfs_init()) 3013 - pr_notice("failed to initialize debugfs -- will not be available\n"); 3012 + drbd_debugfs_init(); 3014 3013 3015 3014 pr_info("initialized. " 3016 3015 "Version: " REL_VERSION " (api:%d/proto:%d-%d)\n",

+1 -1

drivers/block/floppy.c

··· 3900 3900 if (!UDP->cmos) 3901 3901 UDP->cmos = FLOPPY0_TYPE; 3902 3902 drive = 1; 3903 - if (!UDP->cmos && FLOPPY1_TYPE) 3903 + if (!UDP->cmos) 3904 3904 UDP->cmos = FLOPPY1_TYPE; 3905 3905 3906 3906 /* FIXME: additional physical CMOS drive detection should go here */

+4 -12

drivers/block/loop.c

··· 264 264 return ret; 265 265 } 266 266 267 - static inline void loop_iov_iter_bvec(struct iov_iter *i, 268 - unsigned int direction, const struct bio_vec *bvec, 269 - unsigned long nr_segs, size_t count) 270 - { 271 - iov_iter_bvec(i, direction, bvec, nr_segs, count); 272 - i->type |= ITER_BVEC_FLAG_NO_REF; 273 - } 274 - 275 267 static int lo_write_bvec(struct file *file, struct bio_vec *bvec, loff_t *ppos) 276 268 { 277 269 struct iov_iter i; 278 270 ssize_t bw; 279 271 280 - loop_iov_iter_bvec(&i, WRITE, bvec, 1, bvec->bv_len); 272 + iov_iter_bvec(&i, WRITE, bvec, 1, bvec->bv_len); 281 273 282 274 file_start_write(file); 283 275 bw = vfs_iter_write(file, &i, ppos, 0); ··· 347 355 ssize_t len; 348 356 349 357 rq_for_each_segment(bvec, rq, iter) { 350 - loop_iov_iter_bvec(&i, READ, &bvec, 1, bvec.bv_len); 358 + iov_iter_bvec(&i, READ, &bvec, 1, bvec.bv_len); 351 359 len = vfs_iter_read(lo->lo_backing_file, &i, &pos, 0); 352 360 if (len < 0) 353 361 return len; ··· 388 396 b.bv_offset = 0; 389 397 b.bv_len = bvec.bv_len; 390 398 391 - loop_iov_iter_bvec(&i, READ, &b, 1, b.bv_len); 399 + iov_iter_bvec(&i, READ, &b, 1, b.bv_len); 392 400 len = vfs_iter_read(lo->lo_backing_file, &i, &pos, 0); 393 401 if (len < 0) { 394 402 ret = len; ··· 555 563 } 556 564 atomic_set(&cmd->ref, 2); 557 565 558 - loop_iov_iter_bvec(&iter, rw, bvec, nr_bvec, blk_rq_bytes(rq)); 566 + iov_iter_bvec(&iter, rw, bvec, nr_bvec, blk_rq_bytes(rq)); 559 567 iter.iov_offset = offset; 560 568 561 569 cmd->iocb.ki_pos = pos;

-5

drivers/block/mtip32xx/mtip32xx.c

··· 1577 1577 ATA_SECT_SIZE * xfer_sz); 1578 1578 return -ENOMEM; 1579 1579 } 1580 - memset(buf, 0, ATA_SECT_SIZE * xfer_sz); 1581 1580 } 1582 1581 1583 1582 /* Build the FIS. */ ··· 2775 2776 &port->block1_dma, GFP_KERNEL); 2776 2777 if (!port->block1) 2777 2778 return -ENOMEM; 2778 - memset(port->block1, 0, BLOCK_DMA_ALLOC_SZ); 2779 2779 2780 2780 /* Allocate dma memory for command list */ 2781 2781 port->command_list = ··· 2787 2789 port->block1_dma = 0; 2788 2790 return -ENOMEM; 2789 2791 } 2790 - memset(port->command_list, 0, AHCI_CMD_TBL_SZ); 2791 2792 2792 2793 /* Setup all pointers into first DMA region */ 2793 2794 port->rxfis = port->block1 + AHCI_RX_FIS_OFFSET; ··· 3525 3528 &cmd->command_dma, GFP_KERNEL); 3526 3529 if (!cmd->command) 3527 3530 return -ENOMEM; 3528 - 3529 - memset(cmd->command, 0, CMD_DMA_ALLOC_SZ); 3530 3531 3531 3532 sg_init_table(cmd->sg, MTIP_MAX_SG); 3532 3533 return 0;

+7 -7

drivers/block/null_blk_main.c

··· 327 327 set_bit(NULLB_DEV_FL_CONFIGURED, &dev->flags); 328 328 dev->power = newp; 329 329 } else if (dev->power && !newp) { 330 - mutex_lock(&lock); 331 - dev->power = newp; 332 - null_del_dev(dev->nullb); 333 - mutex_unlock(&lock); 334 - clear_bit(NULLB_DEV_FL_UP, &dev->flags); 330 + if (test_and_clear_bit(NULLB_DEV_FL_UP, &dev->flags)) { 331 + mutex_lock(&lock); 332 + dev->power = newp; 333 + null_del_dev(dev->nullb); 334 + mutex_unlock(&lock); 335 + } 335 336 clear_bit(NULLB_DEV_FL_CONFIGURED, &dev->flags); 336 337 } 337 338 ··· 1198 1197 if (!cmd->error && dev->zoned) { 1199 1198 sector_t sector; 1200 1199 unsigned int nr_sectors; 1201 - int op; 1200 + enum req_opf op; 1202 1201 1203 1202 if (dev->queue_mode == NULL_Q_BIO) { 1204 1203 op = bio_op(cmd->bio); ··· 1489 1488 if (!nullb->queues) 1490 1489 return -ENOMEM; 1491 1490 1492 - nullb->nr_queues = 0; 1493 1491 nullb->queue_depth = nullb->dev->hw_queue_depth; 1494 1492 1495 1493 return 0;

-1

drivers/block/skd_main.c

··· 2694 2694 (FIT_QCMD_ALIGN - 1), 2695 2695 "not aligned: msg_buf %p mb_dma_address %pad\n", 2696 2696 skmsg->msg_buf, &skmsg->mb_dma_address); 2697 - memset(skmsg->msg_buf, 0, SKD_N_FITMSG_BYTES); 2698 2697 } 2699 2698 2700 2699 err_out:

+1 -1

drivers/lightnvm/core.c

··· 478 478 */ 479 479 static int nvm_remove_tgt(struct nvm_ioctl_remove *remove) 480 480 { 481 - struct nvm_target *t; 481 + struct nvm_target *t = NULL; 482 482 struct nvm_dev *dev; 483 483 484 484 down_read(&nvm_lock);

+9 -7

drivers/lightnvm/pblk-core.c

··· 323 323 void pblk_bio_free_pages(struct pblk *pblk, struct bio *bio, int off, 324 324 int nr_pages) 325 325 { 326 - struct bio_vec bv; 327 - int i; 326 + struct bio_vec *bv; 327 + struct page *page; 328 + int i, e, nbv = 0; 328 329 329 - WARN_ON(off + nr_pages != bio->bi_vcnt); 330 - 331 - for (i = off; i < nr_pages + off; i++) { 332 - bv = bio->bi_io_vec[i]; 333 - mempool_free(bv.bv_page, &pblk->page_bio_pool); 330 + for (i = 0; i < bio->bi_vcnt; i++) { 331 + bv = &bio->bi_io_vec[i]; 332 + page = bv->bv_page; 333 + for (e = 0; e < bv->bv_len; e += PBLK_EXPOSED_PAGE_SIZE, nbv++) 334 + if (nbv >= off) 335 + mempool_free(page++, &pblk->page_bio_pool); 334 336 } 335 337 } 336 338

+9

drivers/md/bcache/alloc.c

··· 393 393 struct bucket *b; 394 394 long r; 395 395 396 + 397 + /* No allocation if CACHE_SET_IO_DISABLE bit is set */ 398 + if (unlikely(test_bit(CACHE_SET_IO_DISABLE, &ca->set->flags))) 399 + return -1; 400 + 396 401 /* fastpath */ 397 402 if (fifo_pop(&ca->free[RESERVE_NONE], r) || 398 403 fifo_pop(&ca->free[reserve], r)) ··· 488 483 struct bkey *k, int n, bool wait) 489 484 { 490 485 int i; 486 + 487 + /* No allocation if CACHE_SET_IO_DISABLE bit is set */ 488 + if (unlikely(test_bit(CACHE_SET_IO_DISABLE, &c->flags))) 489 + return -1; 491 490 492 491 lockdep_assert_held(&c->bucket_lock); 493 492 BUG_ON(!n || n > c->caches_loaded || n > MAX_CACHES_PER_SET);

+2 -4

drivers/md/bcache/bcache.h

··· 705 705 atomic_long_t writeback_keys_failed; 706 706 707 707 atomic_long_t reclaim; 708 + atomic_long_t reclaimed_journal_buckets; 708 709 atomic_long_t flush_write; 709 - atomic_long_t retry_flush_write; 710 710 711 711 enum { 712 712 ON_ERROR_UNREGISTER, ··· 726 726 727 727 #define BUCKET_HASH_BITS 12 728 728 struct hlist_head bucket_hash[1 << BUCKET_HASH_BITS]; 729 - 730 - DECLARE_HEAP(struct btree *, flush_btree); 731 729 }; 732 730 733 731 struct bbio { ··· 1004 1006 int bch_cached_dev_attach(struct cached_dev *dc, struct cache_set *c, 1005 1007 uint8_t *set_uuid); 1006 1008 void bch_cached_dev_detach(struct cached_dev *dc); 1007 - void bch_cached_dev_run(struct cached_dev *dc); 1009 + int bch_cached_dev_run(struct cached_dev *dc); 1008 1010 void bcache_device_stop(struct bcache_device *d); 1009 1011 1010 1012 void bch_cache_set_unregister(struct cache_set *c);

+19 -42

drivers/md/bcache/bset.c

··· 347 347 void bch_btree_keys_init(struct btree_keys *b, const struct btree_keys_ops *ops, 348 348 bool *expensive_debug_checks) 349 349 { 350 - unsigned int i; 351 - 352 350 b->ops = ops; 353 351 b->expensive_debug_checks = expensive_debug_checks; 354 352 b->nsets = 0; 355 353 b->last_set_unwritten = 0; 356 354 357 - /* XXX: shouldn't be needed */ 358 - for (i = 0; i < MAX_BSETS; i++) 359 - b->set[i].size = 0; 360 355 /* 361 - * Second loop starts at 1 because b->keys[0]->data is the memory we 362 - * allocated 356 + * struct btree_keys in embedded in struct btree, and struct 357 + * bset_tree is embedded into struct btree_keys. They are all 358 + * initialized as 0 by kzalloc() in mca_bucket_alloc(), and 359 + * b->set[0].data is allocated in bch_btree_keys_alloc(), so we 360 + * don't have to initiate b->set[].size and b->set[].data here 361 + * any more. 363 362 */ 364 - for (i = 1; i < MAX_BSETS; i++) 365 - b->set[i].data = NULL; 366 363 } 367 364 EXPORT_SYMBOL(bch_btree_keys_init); 368 365 ··· 967 970 unsigned int inorder, j, n = 1; 968 971 969 972 do { 970 - /* 971 - * A bit trick here. 972 - * If p < t->size, (int)(p - t->size) is a minus value and 973 - * the most significant bit is set, right shifting 31 bits 974 - * gets 1. If p >= t->size, the most significant bit is 975 - * not set, right shifting 31 bits gets 0. 976 - * So the following 2 lines equals to 977 - * if (p >= t->size) 978 - * p = 0; 979 - * but a branch instruction is avoided. 980 - */ 981 973 unsigned int p = n << 4; 982 974 983 - p &= ((int) (p - t->size)) >> 31; 984 - 985 - prefetch(&t->tree[p]); 975 + if (p < t->size) 976 + prefetch(&t->tree[p]); 986 977 987 978 j = n; 988 979 f = &t->tree[j]; 989 980 990 - /* 991 - * Similar bit trick, use subtract operation to avoid a branch 992 - * instruction. 993 - * 994 - * n = (f->mantissa > bfloat_mantissa()) 995 - * ? j * 2 996 - * : j * 2 + 1; 997 - * 998 - * We need to subtract 1 from f->mantissa for the sign bit trick 999 - * to work - that's done in make_bfloat() 1000 - */ 1001 - if (likely(f->exponent != 127)) 1002 - n = j * 2 + (((unsigned int) 1003 - (f->mantissa - 1004 - bfloat_mantissa(search, f))) >> 31); 1005 - else 1006 - n = (bkey_cmp(tree_to_bkey(t, j), search) > 0) 1007 - ? j * 2 1008 - : j * 2 + 1; 981 + if (likely(f->exponent != 127)) { 982 + if (f->mantissa >= bfloat_mantissa(search, f)) 983 + n = j * 2; 984 + else 985 + n = j * 2 + 1; 986 + } else { 987 + if (bkey_cmp(tree_to_bkey(t, j), search) > 0) 988 + n = j * 2; 989 + else 990 + n = j * 2 + 1; 991 + } 1009 992 } while (n < t->size); 1010 993 1011 994 inorder = to_inorder(j, t);

+47 -6

drivers/md/bcache/btree.c

··· 35 35 #include <linux/rcupdate.h> 36 36 #include <linux/sched/clock.h> 37 37 #include <linux/rculist.h> 38 - 38 + #include <linux/delay.h> 39 39 #include <trace/events/bcache.h> 40 40 41 41 /* ··· 613 613 static struct btree *mca_bucket_alloc(struct cache_set *c, 614 614 struct bkey *k, gfp_t gfp) 615 615 { 616 + /* 617 + * kzalloc() is necessary here for initialization, 618 + * see code comments in bch_btree_keys_init(). 619 + */ 616 620 struct btree *b = kzalloc(sizeof(struct btree), gfp); 617 621 618 622 if (!b) ··· 659 655 up(&b->io_mutex); 660 656 } 661 657 658 + retry: 659 + /* 660 + * BTREE_NODE_dirty might be cleared in btree_flush_btree() by 661 + * __bch_btree_node_write(). To avoid an extra flush, acquire 662 + * b->write_lock before checking BTREE_NODE_dirty bit. 663 + */ 662 664 mutex_lock(&b->write_lock); 665 + /* 666 + * If this btree node is selected in btree_flush_write() by journal 667 + * code, delay and retry until the node is flushed by journal code 668 + * and BTREE_NODE_journal_flush bit cleared by btree_flush_write(). 669 + */ 670 + if (btree_node_journal_flush(b)) { 671 + pr_debug("bnode %p is flushing by journal, retry", b); 672 + mutex_unlock(&b->write_lock); 673 + udelay(1); 674 + goto retry; 675 + } 676 + 663 677 if (btree_node_dirty(b)) 664 678 __bch_btree_node_write(b, &cl); 665 679 mutex_unlock(&b->write_lock); ··· 800 778 while (!list_empty(&c->btree_cache)) { 801 779 b = list_first_entry(&c->btree_cache, struct btree, list); 802 780 803 - if (btree_node_dirty(b)) 781 + /* 782 + * This function is called by cache_set_free(), no I/O 783 + * request on cache now, it is unnecessary to acquire 784 + * b->write_lock before clearing BTREE_NODE_dirty anymore. 785 + */ 786 + if (btree_node_dirty(b)) { 804 787 btree_complete_write(b, btree_current_write(b)); 805 - clear_bit(BTREE_NODE_dirty, &b->flags); 806 - 788 + clear_bit(BTREE_NODE_dirty, &b->flags); 789 + } 807 790 mca_data_free(b); 808 791 } 809 792 ··· 1094 1067 1095 1068 BUG_ON(b == b->c->root); 1096 1069 1070 + retry: 1097 1071 mutex_lock(&b->write_lock); 1072 + /* 1073 + * If the btree node is selected and flushing in btree_flush_write(), 1074 + * delay and retry until the BTREE_NODE_journal_flush bit cleared, 1075 + * then it is safe to free the btree node here. Otherwise this btree 1076 + * node will be in race condition. 1077 + */ 1078 + if (btree_node_journal_flush(b)) { 1079 + mutex_unlock(&b->write_lock); 1080 + pr_debug("bnode %p journal_flush set, retry", b); 1081 + udelay(1); 1082 + goto retry; 1083 + } 1098 1084 1099 - if (btree_node_dirty(b)) 1085 + if (btree_node_dirty(b)) { 1100 1086 btree_complete_write(b, btree_current_write(b)); 1101 - clear_bit(BTREE_NODE_dirty, &b->flags); 1087 + clear_bit(BTREE_NODE_dirty, &b->flags); 1088 + } 1102 1089 1103 1090 mutex_unlock(&b->write_lock); 1104 1091

+2

drivers/md/bcache/btree.h

··· 158 158 BTREE_NODE_io_error, 159 159 BTREE_NODE_dirty, 160 160 BTREE_NODE_write_idx, 161 + BTREE_NODE_journal_flush, 161 162 }; 162 163 163 164 BTREE_FLAG(io_error); 164 165 BTREE_FLAG(dirty); 165 166 BTREE_FLAG(write_idx); 167 + BTREE_FLAG(journal_flush); 166 168 167 169 static inline struct btree_write *btree_current_write(struct btree *b) 168 170 {

+12

drivers/md/bcache/io.c

··· 58 58 59 59 WARN_ONCE(!dc, "NULL pointer of struct cached_dev"); 60 60 61 + /* 62 + * Read-ahead requests on a degrading and recovering md raid 63 + * (e.g. raid6) device might be failured immediately by md 64 + * raid code, which is not a real hardware media failure. So 65 + * we shouldn't count failed REQ_RAHEAD bio to dc->io_errors. 66 + */ 67 + if (bio->bi_opf & REQ_RAHEAD) { 68 + pr_warn_ratelimited("%s: Read-ahead I/O failed on backing device, ignore", 69 + dc->backing_dev_name); 70 + return; 71 + } 72 + 61 73 errors = atomic_add_return(1, &dc->io_errors); 62 74 if (errors < dc->error_limit) 63 75 pr_err("%s: IO error on backing device, unrecoverable",

+103 -44

drivers/md/bcache/journal.c

··· 100 100 101 101 blocks = set_blocks(j, block_bytes(ca->set)); 102 102 103 + /* 104 + * Nodes in 'list' are in linear increasing order of 105 + * i->j.seq, the node on head has the smallest (oldest) 106 + * journal seq, the node on tail has the biggest 107 + * (latest) journal seq. 108 + */ 109 + 110 + /* 111 + * Check from the oldest jset for last_seq. If 112 + * i->j.seq < j->last_seq, it means the oldest jset 113 + * in list is expired and useless, remove it from 114 + * this list. Otherwise, j is a condidate jset for 115 + * further following checks. 116 + */ 103 117 while (!list_empty(list)) { 104 118 i = list_first_entry(list, 105 119 struct journal_replay, list); ··· 123 109 kfree(i); 124 110 } 125 111 112 + /* iterate list in reverse order (from latest jset) */ 126 113 list_for_each_entry_reverse(i, list, list) { 127 114 if (j->seq == i->j.seq) 128 115 goto next_set; 129 116 117 + /* 118 + * if j->seq is less than any i->j.last_seq 119 + * in list, j is an expired and useless jset. 120 + */ 130 121 if (j->seq < i->j.last_seq) 131 122 goto next_set; 132 123 124 + /* 125 + * 'where' points to first jset in list which 126 + * is elder then j. 127 + */ 133 128 if (j->seq > i->j.seq) { 134 129 where = &i->list; 135 130 goto add; ··· 152 129 if (!i) 153 130 return -ENOMEM; 154 131 memcpy(&i->j, j, bytes); 132 + /* Add to the location after 'where' points to */ 155 133 list_add(&i->list, where); 156 134 ret = 1; 157 135 158 - ja->seq[bucket_index] = j->seq; 136 + if (j->seq > ja->seq[bucket_index]) 137 + ja->seq[bucket_index] = j->seq; 159 138 next_set: 160 139 offset += blocks * ca->sb.block_size; 161 140 len -= blocks * ca->sb.block_size; ··· 293 268 struct journal_replay, 294 269 list)->j.seq; 295 270 296 - return ret; 271 + return 0; 297 272 #undef read_bucket 298 273 } 299 274 ··· 416 391 } 417 392 418 393 /* Journalling */ 419 - #define journal_max_cmp(l, r) \ 420 - (fifo_idx(&c->journal.pin, btree_current_write(l)->journal) < \ 421 - fifo_idx(&(c)->journal.pin, btree_current_write(r)->journal)) 422 - #define journal_min_cmp(l, r) \ 423 - (fifo_idx(&c->journal.pin, btree_current_write(l)->journal) > \ 424 - fifo_idx(&(c)->journal.pin, btree_current_write(r)->journal)) 425 394 426 395 static void btree_flush_write(struct cache_set *c) 427 396 { 428 - /* 429 - * Try to find the btree node with that references the oldest journal 430 - * entry, best is our current candidate and is locked if non NULL: 431 - */ 432 - struct btree *b; 433 - int i; 397 + struct btree *b, *t, *btree_nodes[BTREE_FLUSH_NR]; 398 + unsigned int i, n; 399 + 400 + if (c->journal.btree_flushing) 401 + return; 402 + 403 + spin_lock(&c->journal.flush_write_lock); 404 + if (c->journal.btree_flushing) { 405 + spin_unlock(&c->journal.flush_write_lock); 406 + return; 407 + } 408 + c->journal.btree_flushing = true; 409 + spin_unlock(&c->journal.flush_write_lock); 434 410 435 411 atomic_long_inc(&c->flush_write); 412 + memset(btree_nodes, 0, sizeof(btree_nodes)); 413 + n = 0; 436 414 437 - retry: 438 - spin_lock(&c->journal.lock); 439 - if (heap_empty(&c->flush_btree)) { 440 - for_each_cached_btree(b, c, i) 441 - if (btree_current_write(b)->journal) { 442 - if (!heap_full(&c->flush_btree)) 443 - heap_add(&c->flush_btree, b, 444 - journal_max_cmp); 445 - else if (journal_max_cmp(b, 446 - heap_peek(&c->flush_btree))) { 447 - c->flush_btree.data[0] = b; 448 - heap_sift(&c->flush_btree, 0, 449 - journal_max_cmp); 450 - } 451 - } 415 + mutex_lock(&c->bucket_lock); 416 + list_for_each_entry_safe_reverse(b, t, &c->btree_cache, list) { 417 + if (btree_node_journal_flush(b)) 418 + pr_err("BUG: flush_write bit should not be set here!"); 452 419 453 - for (i = c->flush_btree.used / 2 - 1; i >= 0; --i) 454 - heap_sift(&c->flush_btree, i, journal_min_cmp); 455 - } 456 - 457 - b = NULL; 458 - heap_pop(&c->flush_btree, b, journal_min_cmp); 459 - spin_unlock(&c->journal.lock); 460 - 461 - if (b) { 462 420 mutex_lock(&b->write_lock); 421 + 422 + if (!btree_node_dirty(b)) { 423 + mutex_unlock(&b->write_lock); 424 + continue; 425 + } 426 + 463 427 if (!btree_current_write(b)->journal) { 464 428 mutex_unlock(&b->write_lock); 465 - /* We raced */ 466 - atomic_long_inc(&c->retry_flush_write); 467 - goto retry; 429 + continue; 430 + } 431 + 432 + set_btree_node_journal_flush(b); 433 + 434 + mutex_unlock(&b->write_lock); 435 + 436 + btree_nodes[n++] = b; 437 + if (n == BTREE_FLUSH_NR) 438 + break; 439 + } 440 + mutex_unlock(&c->bucket_lock); 441 + 442 + for (i = 0; i < n; i++) { 443 + b = btree_nodes[i]; 444 + if (!b) { 445 + pr_err("BUG: btree_nodes[%d] is NULL", i); 446 + continue; 447 + } 448 + 449 + /* safe to check without holding b->write_lock */ 450 + if (!btree_node_journal_flush(b)) { 451 + pr_err("BUG: bnode %p: journal_flush bit cleaned", b); 452 + continue; 453 + } 454 + 455 + mutex_lock(&b->write_lock); 456 + if (!btree_current_write(b)->journal) { 457 + clear_bit(BTREE_NODE_journal_flush, &b->flags); 458 + mutex_unlock(&b->write_lock); 459 + pr_debug("bnode %p: written by others", b); 460 + continue; 461 + } 462 + 463 + if (!btree_node_dirty(b)) { 464 + clear_bit(BTREE_NODE_journal_flush, &b->flags); 465 + mutex_unlock(&b->write_lock); 466 + pr_debug("bnode %p: dirty bit cleaned by others", b); 467 + continue; 468 468 } 469 469 470 470 __bch_btree_node_write(b, NULL); 471 + clear_bit(BTREE_NODE_journal_flush, &b->flags); 471 472 mutex_unlock(&b->write_lock); 472 473 } 474 + 475 + spin_lock(&c->journal.flush_write_lock); 476 + c->journal.btree_flushing = false; 477 + spin_unlock(&c->journal.flush_write_lock); 473 478 } 474 479 475 480 #define last_seq(j) ((j)->seq - fifo_used(&(j)->pin) + 1) ··· 614 559 k->ptr[n++] = MAKE_PTR(0, 615 560 bucket_to_sector(c, ca->sb.d[ja->cur_idx]), 616 561 ca->sb.nr_this_dev); 562 + atomic_long_inc(&c->reclaimed_journal_buckets); 617 563 } 618 564 619 565 if (n) { ··· 867 811 struct journal_write *w; 868 812 atomic_t *ret; 869 813 814 + /* No journaling if CACHE_SET_IO_DISABLE set already */ 815 + if (unlikely(test_bit(CACHE_SET_IO_DISABLE, &c->flags))) 816 + return NULL; 817 + 870 818 if (!CACHE_SYNC(&c->sb)) 871 819 return NULL; 872 820 ··· 915 855 free_pages((unsigned long) c->journal.w[1].data, JSET_BITS); 916 856 free_pages((unsigned long) c->journal.w[0].data, JSET_BITS); 917 857 free_fifo(&c->journal.pin); 918 - free_heap(&c->flush_btree); 919 858 } 920 859 921 860 int bch_journal_alloc(struct cache_set *c) ··· 922 863 struct journal *j = &c->journal; 923 864 924 865 spin_lock_init(&j->lock); 866 + spin_lock_init(&j->flush_write_lock); 925 867 INIT_DELAYED_WORK(&j->work, journal_write_work); 926 868 927 869 c->journal_delay_ms = 100; ··· 930 870 j->w[0].c = c; 931 871 j->w[1].c = c; 932 872 933 - if (!(init_heap(&c->flush_btree, 128, GFP_KERNEL)) || 934 - !(init_fifo(&j->pin, JOURNAL_PIN, GFP_KERNEL)) || 873 + if (!(init_fifo(&j->pin, JOURNAL_PIN, GFP_KERNEL)) || 935 874 !(j->w[0].data = (void *) __get_free_pages(GFP_KERNEL, JSET_BITS)) || 936 875 !(j->w[1].data = (void *) __get_free_pages(GFP_KERNEL, JSET_BITS))) 937 876 return -ENOMEM;

+4

drivers/md/bcache/journal.h

··· 103 103 /* Embedded in struct cache_set */ 104 104 struct journal { 105 105 spinlock_t lock; 106 + spinlock_t flush_write_lock; 107 + bool btree_flushing; 106 108 /* used when waiting because the journal was full */ 107 109 struct closure_waitlist wait; 108 110 struct closure io; ··· 155 153 struct bio bio; 156 154 struct bio_vec bv[8]; 157 155 }; 156 + 157 + #define BTREE_FLUSH_NR 8 158 158 159 159 #define journal_pin_cmp(c, l, r) \ 160 160 (fifo_idx(&(c)->journal.pin, (l)) > fifo_idx(&(c)->journal.pin, (r)))

+183 -44

drivers/md/bcache/super.c

··· 40 40 41 41 static struct kobject *bcache_kobj; 42 42 struct mutex bch_register_lock; 43 + bool bcache_is_reboot; 43 44 LIST_HEAD(bch_cache_sets); 44 45 static LIST_HEAD(uncached_devices); 45 46 ··· 49 48 static wait_queue_head_t unregister_wait; 50 49 struct workqueue_struct *bcache_wq; 51 50 struct workqueue_struct *bch_journal_wq; 51 + 52 52 53 53 #define BTREE_MAX_PAGES (256 * 1024 / PAGE_SIZE) 54 54 /* limitation of partitions number on single bcache device */ ··· 199 197 static void write_bdev_super_endio(struct bio *bio) 200 198 { 201 199 struct cached_dev *dc = bio->bi_private; 202 - /* XXX: error checking */ 200 + 201 + if (bio->bi_status) 202 + bch_count_backing_io_errors(dc, bio); 203 203 204 204 closure_put(&dc->sb_write); 205 205 } ··· 695 691 { 696 692 unsigned int i; 697 693 struct cache *ca; 694 + int ret; 698 695 699 696 for_each_cache(ca, d->c, i) 700 697 bd_link_disk_holder(ca->bdev, d->disk); ··· 703 698 snprintf(d->name, BCACHEDEVNAME_SIZE, 704 699 "%s%u", name, d->id); 705 700 706 - WARN(sysfs_create_link(&d->kobj, &c->kobj, "cache") || 707 - sysfs_create_link(&c->kobj, &d->kobj, d->name), 708 - "Couldn't create device <-> cache set symlinks"); 701 + ret = sysfs_create_link(&d->kobj, &c->kobj, "cache"); 702 + if (ret < 0) 703 + pr_err("Couldn't create device -> cache set symlink"); 704 + 705 + ret = sysfs_create_link(&c->kobj, &d->kobj, d->name); 706 + if (ret < 0) 707 + pr_err("Couldn't create cache set -> device symlink"); 709 708 710 709 clear_bit(BCACHE_DEV_UNLINK_DONE, &d->flags); 711 710 } ··· 917 908 } 918 909 919 910 920 - void bch_cached_dev_run(struct cached_dev *dc) 911 + int bch_cached_dev_run(struct cached_dev *dc) 921 912 { 922 913 struct bcache_device *d = &dc->disk; 923 914 char *buf = kmemdup_nul(dc->sb.label, SB_LABEL_SIZE, GFP_KERNEL); ··· 928 919 NULL, 929 920 }; 930 921 922 + if (dc->io_disable) { 923 + pr_err("I/O disabled on cached dev %s", 924 + dc->backing_dev_name); 925 + return -EIO; 926 + } 927 + 931 928 if (atomic_xchg(&dc->running, 1)) { 932 929 kfree(env[1]); 933 930 kfree(env[2]); 934 931 kfree(buf); 935 - return; 932 + pr_info("cached dev %s is running already", 933 + dc->backing_dev_name); 934 + return -EBUSY; 936 935 } 937 936 938 937 if (!d->c && ··· 966 949 kfree(buf); 967 950 968 951 if (sysfs_create_link(&d->kobj, &disk_to_dev(d->disk)->kobj, "dev") || 969 - sysfs_create_link(&disk_to_dev(d->disk)->kobj, &d->kobj, "bcache")) 970 - pr_debug("error creating sysfs link"); 952 + sysfs_create_link(&disk_to_dev(d->disk)->kobj, 953 + &d->kobj, "bcache")) { 954 + pr_err("Couldn't create bcache dev <-> disk sysfs symlinks"); 955 + return -ENOMEM; 956 + } 971 957 972 958 dc->status_update_thread = kthread_run(cached_dev_status_update, 973 959 dc, "bcache_status_update"); ··· 979 959 "continue to run without monitoring backing " 980 960 "device status"); 981 961 } 962 + 963 + return 0; 982 964 } 983 965 984 966 /* ··· 1018 996 BUG_ON(!test_bit(BCACHE_DEV_DETACHING, &dc->disk.flags)); 1019 997 BUG_ON(refcount_read(&dc->count)); 1020 998 1021 - mutex_lock(&bch_register_lock); 1022 999 1023 1000 if (test_and_clear_bit(BCACHE_DEV_WB_RUNNING, &dc->disk.flags)) 1024 1001 cancel_writeback_rate_update_dwork(dc); ··· 1032 1011 1033 1012 bch_write_bdev_super(dc, &cl); 1034 1013 closure_sync(&cl); 1014 + 1015 + mutex_lock(&bch_register_lock); 1035 1016 1036 1017 calc_cached_dev_sectors(dc->disk.c); 1037 1018 bcache_device_detach(&dc->disk); ··· 1077 1054 uint32_t rtime = cpu_to_le32((u32)ktime_get_real_seconds()); 1078 1055 struct uuid_entry *u; 1079 1056 struct cached_dev *exist_dc, *t; 1057 + int ret = 0; 1080 1058 1081 1059 if ((set_uuid && memcmp(set_uuid, c->sb.set_uuid, 16)) || 1082 1060 (!set_uuid && memcmp(dc->sb.set_uuid, c->sb.set_uuid, 16))) ··· 1177 1153 down_write(&dc->writeback_lock); 1178 1154 if (bch_cached_dev_writeback_start(dc)) { 1179 1155 up_write(&dc->writeback_lock); 1156 + pr_err("Couldn't start writeback facilities for %s", 1157 + dc->disk.disk->disk_name); 1180 1158 return -ENOMEM; 1181 1159 } 1182 1160 ··· 1189 1163 1190 1164 bch_sectors_dirty_init(&dc->disk); 1191 1165 1192 - bch_cached_dev_run(dc); 1166 + ret = bch_cached_dev_run(dc); 1167 + if (ret && (ret != -EBUSY)) { 1168 + up_write(&dc->writeback_lock); 1169 + /* 1170 + * bch_register_lock is held, bcache_device_stop() is not 1171 + * able to be directly called. The kthread and kworker 1172 + * created previously in bch_cached_dev_writeback_start() 1173 + * have to be stopped manually here. 1174 + */ 1175 + kthread_stop(dc->writeback_thread); 1176 + cancel_writeback_rate_update_dwork(dc); 1177 + pr_err("Couldn't run cached device %s", 1178 + dc->backing_dev_name); 1179 + return ret; 1180 + } 1181 + 1193 1182 bcache_device_link(&dc->disk, c, "bdev"); 1194 1183 atomic_inc(&c->attached_dev_nr); 1195 1184 ··· 1231 1190 { 1232 1191 struct cached_dev *dc = container_of(cl, struct cached_dev, disk.cl); 1233 1192 1234 - mutex_lock(&bch_register_lock); 1235 - 1236 1193 if (test_and_clear_bit(BCACHE_DEV_WB_RUNNING, &dc->disk.flags)) 1237 1194 cancel_writeback_rate_update_dwork(dc); 1238 1195 1239 1196 if (!IS_ERR_OR_NULL(dc->writeback_thread)) 1240 1197 kthread_stop(dc->writeback_thread); 1241 - if (dc->writeback_write_wq) 1242 - destroy_workqueue(dc->writeback_write_wq); 1243 1198 if (!IS_ERR_OR_NULL(dc->status_update_thread)) 1244 1199 kthread_stop(dc->status_update_thread); 1200 + 1201 + mutex_lock(&bch_register_lock); 1245 1202 1246 1203 if (atomic_read(&dc->running)) 1247 1204 bd_unlink_disk_holder(dc->bdev, dc->disk.disk); ··· 1329 1290 { 1330 1291 const char *err = "cannot allocate memory"; 1331 1292 struct cache_set *c; 1293 + int ret = -ENOMEM; 1332 1294 1333 1295 bdevname(bdev, dc->backing_dev_name); 1334 1296 memcpy(&dc->sb, sb, sizeof(struct cache_sb)); ··· 1359 1319 bch_cached_dev_attach(dc, c, NULL); 1360 1320 1361 1321 if (BDEV_STATE(&dc->sb) == BDEV_STATE_NONE || 1362 - BDEV_STATE(&dc->sb) == BDEV_STATE_STALE) 1363 - bch_cached_dev_run(dc); 1322 + BDEV_STATE(&dc->sb) == BDEV_STATE_STALE) { 1323 + err = "failed to run cached device"; 1324 + ret = bch_cached_dev_run(dc); 1325 + if (ret) 1326 + goto err; 1327 + } 1364 1328 1365 1329 return 0; 1366 1330 err: 1367 1331 pr_notice("error %s: %s", dc->backing_dev_name, err); 1368 1332 bcache_device_stop(&dc->disk); 1369 - return -EIO; 1333 + return ret; 1370 1334 } 1371 1335 1372 1336 /* Flash only volumes */ ··· 1481 1437 1482 1438 bool bch_cached_dev_error(struct cached_dev *dc) 1483 1439 { 1484 - struct cache_set *c; 1485 - 1486 1440 if (!dc || test_bit(BCACHE_DEV_CLOSING, &dc->disk.flags)) 1487 1441 return false; 1488 1442 ··· 1490 1448 1491 1449 pr_err("stop %s: too many IO errors on backing device %s\n", 1492 1450 dc->disk.disk->disk_name, dc->backing_dev_name); 1493 - 1494 - /* 1495 - * If the cached device is still attached to a cache set, 1496 - * even dc->io_disable is true and no more I/O requests 1497 - * accepted, cache device internal I/O (writeback scan or 1498 - * garbage collection) may still prevent bcache device from 1499 - * being stopped. So here CACHE_SET_IO_DISABLE should be 1500 - * set to c->flags too, to make the internal I/O to cache 1501 - * device rejected and stopped immediately. 1502 - * If c is NULL, that means the bcache device is not attached 1503 - * to any cache set, then no CACHE_SET_IO_DISABLE bit to set. 1504 - */ 1505 - c = dc->disk.c; 1506 - if (c && test_and_set_bit(CACHE_SET_IO_DISABLE, &c->flags)) 1507 - pr_info("CACHE_SET_IO_DISABLE already set"); 1508 1451 1509 1452 bcache_device_stop(&dc->disk); 1510 1453 return true; ··· 1591 1564 kobject_put(&c->internal); 1592 1565 kobject_del(&c->kobj); 1593 1566 1594 - if (c->gc_thread) 1567 + if (!IS_ERR_OR_NULL(c->gc_thread)) 1595 1568 kthread_stop(c->gc_thread); 1596 1569 1597 1570 if (!IS_ERR_OR_NULL(c->root)) 1598 1571 list_add(&c->root->list, &c->btree_cache); 1599 1572 1600 - /* Should skip this if we're unregistering because of an error */ 1601 - list_for_each_entry(b, &c->btree_cache, list) { 1602 - mutex_lock(&b->write_lock); 1603 - if (btree_node_dirty(b)) 1604 - __bch_btree_node_write(b, NULL); 1605 - mutex_unlock(&b->write_lock); 1606 - } 1573 + /* 1574 + * Avoid flushing cached nodes if cache set is retiring 1575 + * due to too many I/O errors detected. 1576 + */ 1577 + if (!test_bit(CACHE_SET_IO_DISABLE, &c->flags)) 1578 + list_for_each_entry(b, &c->btree_cache, list) { 1579 + mutex_lock(&b->write_lock); 1580 + if (btree_node_dirty(b)) 1581 + __bch_btree_node_write(b, NULL); 1582 + mutex_unlock(&b->write_lock); 1583 + } 1607 1584 1608 1585 for_each_cache(ca, c, i) 1609 1586 if (ca->alloc_thread) ··· 1880 1849 if (bch_btree_check(c)) 1881 1850 goto err; 1882 1851 1852 + /* 1853 + * bch_btree_check() may occupy too much system memory which 1854 + * has negative effects to user space application (e.g. data 1855 + * base) performance. Shrink the mca cache memory proactively 1856 + * here to avoid competing memory with user space workloads.. 1857 + */ 1858 + if (!c->shrinker_disabled) { 1859 + struct shrink_control sc; 1860 + 1861 + sc.gfp_mask = GFP_KERNEL; 1862 + sc.nr_to_scan = c->btree_cache_used * c->btree_pages; 1863 + /* first run to clear b->accessed tag */ 1864 + c->shrink.scan_objects(&c->shrink, &sc); 1865 + /* second run to reap non-accessed nodes */ 1866 + c->shrink.scan_objects(&c->shrink, &sc); 1867 + } 1868 + 1883 1869 bch_journal_mark(c, &journal); 1884 1870 bch_initial_gc_finish(c); 1885 1871 pr_debug("btree_check() done"); ··· 2005 1957 } 2006 1958 2007 1959 closure_sync(&cl); 2008 - /* XXX: test this, it's broken */ 1960 + 2009 1961 bch_cache_set_error(c, "%s", err); 2010 1962 2011 1963 return -EIO; ··· 2299 2251 2300 2252 static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr, 2301 2253 const char *buffer, size_t size); 2254 + static ssize_t bch_pending_bdevs_cleanup(struct kobject *k, 2255 + struct kobj_attribute *attr, 2256 + const char *buffer, size_t size); 2302 2257 2303 2258 kobj_attribute_write(register, register_bcache); 2304 2259 kobj_attribute_write(register_quiet, register_bcache); 2260 + kobj_attribute_write(pendings_cleanup, bch_pending_bdevs_cleanup); 2305 2261 2306 2262 static bool bch_is_open_backing(struct block_device *bdev) 2307 2263 { ··· 2351 2299 struct page *sb_page = NULL; 2352 2300 2353 2301 if (!try_module_get(THIS_MODULE)) 2302 + return -EBUSY; 2303 + 2304 + /* For latest state of bcache_is_reboot */ 2305 + smp_mb(); 2306 + if (bcache_is_reboot) 2354 2307 return -EBUSY; 2355 2308 2356 2309 path = kstrndup(buffer, size, GFP_KERNEL); ··· 2435 2378 goto out; 2436 2379 } 2437 2380 2381 + 2382 + struct pdev { 2383 + struct list_head list; 2384 + struct cached_dev *dc; 2385 + }; 2386 + 2387 + static ssize_t bch_pending_bdevs_cleanup(struct kobject *k, 2388 + struct kobj_attribute *attr, 2389 + const char *buffer, 2390 + size_t size) 2391 + { 2392 + LIST_HEAD(pending_devs); 2393 + ssize_t ret = size; 2394 + struct cached_dev *dc, *tdc; 2395 + struct pdev *pdev, *tpdev; 2396 + struct cache_set *c, *tc; 2397 + 2398 + mutex_lock(&bch_register_lock); 2399 + list_for_each_entry_safe(dc, tdc, &uncached_devices, list) { 2400 + pdev = kmalloc(sizeof(struct pdev), GFP_KERNEL); 2401 + if (!pdev) 2402 + break; 2403 + pdev->dc = dc; 2404 + list_add(&pdev->list, &pending_devs); 2405 + } 2406 + 2407 + list_for_each_entry_safe(pdev, tpdev, &pending_devs, list) { 2408 + list_for_each_entry_safe(c, tc, &bch_cache_sets, list) { 2409 + char *pdev_set_uuid = pdev->dc->sb.set_uuid; 2410 + char *set_uuid = c->sb.uuid; 2411 + 2412 + if (!memcmp(pdev_set_uuid, set_uuid, 16)) { 2413 + list_del(&pdev->list); 2414 + kfree(pdev); 2415 + break; 2416 + } 2417 + } 2418 + } 2419 + mutex_unlock(&bch_register_lock); 2420 + 2421 + list_for_each_entry_safe(pdev, tpdev, &pending_devs, list) { 2422 + pr_info("delete pdev %p", pdev); 2423 + list_del(&pdev->list); 2424 + bcache_device_stop(&pdev->dc->disk); 2425 + kfree(pdev); 2426 + } 2427 + 2428 + return ret; 2429 + } 2430 + 2438 2431 static int bcache_reboot(struct notifier_block *n, unsigned long code, void *x) 2439 2432 { 2433 + if (bcache_is_reboot) 2434 + return NOTIFY_DONE; 2435 + 2440 2436 if (code == SYS_DOWN || 2441 2437 code == SYS_HALT || 2442 2438 code == SYS_POWER_OFF) { ··· 2502 2392 2503 2393 mutex_lock(&bch_register_lock); 2504 2394 2395 + if (bcache_is_reboot) 2396 + goto out; 2397 + 2398 + /* New registration is rejected since now */ 2399 + bcache_is_reboot = true; 2400 + /* 2401 + * Make registering caller (if there is) on other CPU 2402 + * core know bcache_is_reboot set to true earlier 2403 + */ 2404 + smp_mb(); 2405 + 2505 2406 if (list_empty(&bch_cache_sets) && 2506 2407 list_empty(&uncached_devices)) 2507 2408 goto out; 2508 2409 2410 + mutex_unlock(&bch_register_lock); 2411 + 2509 2412 pr_info("Stopping all devices:"); 2510 2413 2414 + /* 2415 + * The reason bch_register_lock is not held to call 2416 + * bch_cache_set_stop() and bcache_device_stop() is to 2417 + * avoid potential deadlock during reboot, because cache 2418 + * set or bcache device stopping process will acqurie 2419 + * bch_register_lock too. 2420 + * 2421 + * We are safe here because bcache_is_reboot sets to 2422 + * true already, register_bcache() will reject new 2423 + * registration now. bcache_is_reboot also makes sure 2424 + * bcache_reboot() won't be re-entered on by other thread, 2425 + * so there is no race in following list iteration by 2426 + * list_for_each_entry_safe(). 2427 + */ 2511 2428 list_for_each_entry_safe(c, tc, &bch_cache_sets, list) 2512 2429 bch_cache_set_stop(c); 2513 2430 2514 2431 list_for_each_entry_safe(dc, tdc, &uncached_devices, list) 2515 2432 bcache_device_stop(&dc->disk); 2516 2433 2517 - mutex_unlock(&bch_register_lock); 2518 2434 2519 2435 /* 2520 2436 * Give an early chance for other kthreads and ··· 2632 2496 static const struct attribute *files[] = { 2633 2497 &ksysfs_register.attr, 2634 2498 &ksysfs_register_quiet.attr, 2499 + &ksysfs_pendings_cleanup.attr, 2635 2500 NULL 2636 2501 }; 2637 2502 ··· 2667 2530 2668 2531 bch_debug_init(); 2669 2532 closure_debug_init(); 2533 + 2534 + bcache_is_reboot = false; 2670 2535 2671 2536 return 0; 2672 2537 err:

+46 -21

drivers/md/bcache/sysfs.c

··· 16 16 #include <linux/sort.h> 17 17 #include <linux/sched/clock.h> 18 18 19 + extern bool bcache_is_reboot; 20 + 19 21 /* Default is 0 ("writethrough") */ 20 22 static const char * const bch_cache_modes[] = { 21 23 "writethrough", 22 24 "writeback", 23 25 "writearound", 24 - "none", 25 - NULL 26 + "none" 26 27 }; 27 28 28 29 /* Default is 0 ("auto") */ 29 30 static const char * const bch_stop_on_failure_modes[] = { 30 31 "auto", 31 - "always", 32 - NULL 32 + "always" 33 33 }; 34 34 35 35 static const char * const cache_replacement_policies[] = { 36 36 "lru", 37 37 "fifo", 38 - "random", 39 - NULL 38 + "random" 40 39 }; 41 40 42 41 static const char * const error_actions[] = { 43 42 "unregister", 44 - "panic", 45 - NULL 43 + "panic" 46 44 }; 47 45 48 46 write_attribute(attach); ··· 82 84 read_attribute(state); 83 85 read_attribute(cache_read_races); 84 86 read_attribute(reclaim); 87 + read_attribute(reclaimed_journal_buckets); 85 88 read_attribute(flush_write); 86 - read_attribute(retry_flush_write); 87 89 read_attribute(writeback_keys_done); 88 90 read_attribute(writeback_keys_failed); 89 91 read_attribute(io_errors); ··· 178 180 var_print(writeback_percent); 179 181 sysfs_hprint(writeback_rate, 180 182 wb ? atomic_long_read(&dc->writeback_rate.rate) << 9 : 0); 181 - sysfs_hprint(io_errors, atomic_read(&dc->io_errors)); 183 + sysfs_printf(io_errors, "%i", atomic_read(&dc->io_errors)); 182 184 sysfs_printf(io_error_limit, "%i", dc->error_limit); 183 185 sysfs_printf(io_disable, "%i", dc->io_disable); 184 186 var_print(writeback_rate_update_seconds); ··· 269 271 struct cache_set *c; 270 272 struct kobj_uevent_env *env; 271 273 274 + /* no user space access if system is rebooting */ 275 + if (bcache_is_reboot) 276 + return -EBUSY; 277 + 272 278 #define d_strtoul(var) sysfs_strtoul(var, dc->var) 273 279 #define d_strtoul_nonzero(var) sysfs_strtoul_clamp(var, dc->var, 1, INT_MAX) 274 280 #define d_strtoi_h(var) sysfs_hatoi(var, dc->var) ··· 331 329 bch_cache_accounting_clear(&dc->accounting); 332 330 333 331 if (attr == &sysfs_running && 334 - strtoul_or_return(buf)) 335 - bch_cached_dev_run(dc); 332 + strtoul_or_return(buf)) { 333 + v = bch_cached_dev_run(dc); 334 + if (v) 335 + return v; 336 + } 336 337 337 338 if (attr == &sysfs_cache_mode) { 338 - v = __sysfs_match_string(bch_cache_modes, -1, buf); 339 + v = sysfs_match_string(bch_cache_modes, buf); 339 340 if (v < 0) 340 341 return v; 341 342 ··· 349 344 } 350 345 351 346 if (attr == &sysfs_stop_when_cache_set_failed) { 352 - v = __sysfs_match_string(bch_stop_on_failure_modes, -1, buf); 347 + v = sysfs_match_string(bch_stop_on_failure_modes, buf); 353 348 if (v < 0) 354 349 return v; 355 350 ··· 413 408 struct cached_dev *dc = container_of(kobj, struct cached_dev, 414 409 disk.kobj); 415 410 411 + /* no user space access if system is rebooting */ 412 + if (bcache_is_reboot) 413 + return -EBUSY; 414 + 416 415 mutex_lock(&bch_register_lock); 417 416 size = __cached_dev_store(kobj, attr, buf, size); 418 417 ··· 473 464 &sysfs_writeback_rate_p_term_inverse, 474 465 &sysfs_writeback_rate_minimum, 475 466 &sysfs_writeback_rate_debug, 476 - &sysfs_errors, 467 + &sysfs_io_errors, 477 468 &sysfs_io_error_limit, 478 469 &sysfs_io_disable, 479 470 &sysfs_dirty_data, ··· 519 510 struct bcache_device *d = container_of(kobj, struct bcache_device, 520 511 kobj); 521 512 struct uuid_entry *u = &d->c->uuids[d->id]; 513 + 514 + /* no user space access if system is rebooting */ 515 + if (bcache_is_reboot) 516 + return -EBUSY; 522 517 523 518 sysfs_strtoul(data_csum, d->data_csum); 524 519 ··· 706 693 sysfs_print(reclaim, 707 694 atomic_long_read(&c->reclaim)); 708 695 696 + sysfs_print(reclaimed_journal_buckets, 697 + atomic_long_read(&c->reclaimed_journal_buckets)); 698 + 709 699 sysfs_print(flush_write, 710 700 atomic_long_read(&c->flush_write)); 711 - 712 - sysfs_print(retry_flush_write, 713 - atomic_long_read(&c->retry_flush_write)); 714 701 715 702 sysfs_print(writeback_keys_done, 716 703 atomic_long_read(&c->writeback_keys_done)); ··· 758 745 { 759 746 struct cache_set *c = container_of(kobj, struct cache_set, kobj); 760 747 ssize_t v; 748 + 749 + /* no user space access if system is rebooting */ 750 + if (bcache_is_reboot) 751 + return -EBUSY; 761 752 762 753 if (attr == &sysfs_unregister) 763 754 bch_cache_set_unregister(c); ··· 816 799 0, UINT_MAX); 817 800 818 801 if (attr == &sysfs_errors) { 819 - v = __sysfs_match_string(error_actions, -1, buf); 802 + v = sysfs_match_string(error_actions, buf); 820 803 if (v < 0) 821 804 return v; 822 805 ··· 882 865 { 883 866 struct cache_set *c = container_of(kobj, struct cache_set, internal); 884 867 868 + /* no user space access if system is rebooting */ 869 + if (bcache_is_reboot) 870 + return -EBUSY; 871 + 885 872 return bch_cache_set_store(&c->kobj, attr, buf, size); 886 873 } 887 874 ··· 935 914 &sysfs_bset_tree_stats, 936 915 &sysfs_cache_read_races, 937 916 &sysfs_reclaim, 917 + &sysfs_reclaimed_journal_buckets, 938 918 &sysfs_flush_write, 939 - &sysfs_retry_flush_write, 940 919 &sysfs_writeback_keys_done, 941 920 &sysfs_writeback_keys_failed, 942 921 ··· 1071 1050 struct cache *ca = container_of(kobj, struct cache, kobj); 1072 1051 ssize_t v; 1073 1052 1053 + /* no user space access if system is rebooting */ 1054 + if (bcache_is_reboot) 1055 + return -EBUSY; 1056 + 1074 1057 if (attr == &sysfs_discard) { 1075 1058 bool v = strtoul_or_return(buf); 1076 1059 ··· 1088 1063 } 1089 1064 1090 1065 if (attr == &sysfs_cache_replacement_policy) { 1091 - v = __sysfs_match_string(cache_replacement_policies, -1, buf); 1066 + v = sysfs_match_string(cache_replacement_policies, buf); 1092 1067 if (v < 0) 1093 1068 return v; 1094 1069

-2

drivers/md/bcache/util.h

··· 113 113 114 114 #define heap_full(h) ((h)->used == (h)->size) 115 115 116 - #define heap_empty(h) ((h)->used == 0) 117 - 118 116 #define DECLARE_FIFO(type, name) \ 119 117 struct { \ 120 118 size_t front, back, size, mask; \

+8

drivers/md/bcache/writeback.c

··· 122 122 static bool set_at_max_writeback_rate(struct cache_set *c, 123 123 struct cached_dev *dc) 124 124 { 125 + /* Don't set max writeback rate if gc is running */ 126 + if (!c->gc_mark_valid) 127 + return false; 125 128 /* 126 129 * Idle_counter is increased everytime when update_writeback_rate() is 127 130 * called. If all backing devices attached to the same cache set have ··· 738 735 } 739 736 } 740 737 738 + if (dc->writeback_write_wq) { 739 + flush_workqueue(dc->writeback_write_wq); 740 + destroy_workqueue(dc->writeback_write_wq); 741 + } 741 742 cached_dev_put(dc); 742 743 wait_for_kthread_stop(); 743 744 ··· 837 830 "bcache_writeback"); 838 831 if (IS_ERR(dc->writeback_thread)) { 839 832 cached_dev_put(dc); 833 + destroy_workqueue(dc->writeback_write_wq); 840 834 return PTR_ERR(dc->writeback_thread); 841 835 } 842 836 dc->writeback_running = true;

+20

drivers/md/md-bitmap.c

··· 1790 1790 return; 1791 1791 1792 1792 md_bitmap_wait_behind_writes(mddev); 1793 + mempool_destroy(mddev->wb_info_pool); 1794 + mddev->wb_info_pool = NULL; 1793 1795 1794 1796 mutex_lock(&mddev->bitmap_info.mutex); 1795 1797 spin_lock(&mddev->lock); ··· 1902 1900 sector_t start = 0; 1903 1901 sector_t sector = 0; 1904 1902 struct bitmap *bitmap = mddev->bitmap; 1903 + struct md_rdev *rdev; 1905 1904 1906 1905 if (!bitmap) 1907 1906 goto out; 1907 + 1908 + rdev_for_each(rdev, mddev) 1909 + mddev_create_wb_pool(mddev, rdev, true); 1908 1910 1909 1911 if (mddev_is_clustered(mddev)) 1910 1912 md_cluster_ops->load_bitmaps(mddev, mddev->bitmap_info.nodes); ··· 2468 2462 backlog_store(struct mddev *mddev, const char *buf, size_t len) 2469 2463 { 2470 2464 unsigned long backlog; 2465 + unsigned long old_mwb = mddev->bitmap_info.max_write_behind; 2471 2466 int rv = kstrtoul(buf, 10, &backlog); 2472 2467 if (rv) 2473 2468 return rv; 2474 2469 if (backlog > COUNTER_MAX) 2475 2470 return -EINVAL; 2476 2471 mddev->bitmap_info.max_write_behind = backlog; 2472 + if (!backlog && mddev->wb_info_pool) { 2473 + /* wb_info_pool is not needed if backlog is zero */ 2474 + mempool_destroy(mddev->wb_info_pool); 2475 + mddev->wb_info_pool = NULL; 2476 + } else if (backlog && !mddev->wb_info_pool) { 2477 + /* wb_info_pool is needed since backlog is not zero */ 2478 + struct md_rdev *rdev; 2479 + 2480 + rdev_for_each(rdev, mddev) 2481 + mddev_create_wb_pool(mddev, rdev, false); 2482 + } 2483 + if (old_mwb != backlog) 2484 + md_bitmap_update_sb(mddev->bitmap); 2477 2485 return len; 2478 2486 } 2479 2487

+113 -16

drivers/md/md.c

··· 37 37 38 38 */ 39 39 40 + #include <linux/sched/mm.h> 40 41 #include <linux/sched/signal.h> 41 42 #include <linux/kthread.h> 42 43 #include <linux/blkdev.h> ··· 123 122 { 124 123 return mddev->sync_speed_max ? 125 124 mddev->sync_speed_max : sysctl_speed_limit_max; 125 + } 126 + 127 + static int rdev_init_wb(struct md_rdev *rdev) 128 + { 129 + if (rdev->bdev->bd_queue->nr_hw_queues == 1) 130 + return 0; 131 + 132 + spin_lock_init(&rdev->wb_list_lock); 133 + INIT_LIST_HEAD(&rdev->wb_list); 134 + init_waitqueue_head(&rdev->wb_io_wait); 135 + set_bit(WBCollisionCheck, &rdev->flags); 136 + 137 + return 1; 138 + } 139 + 140 + /* 141 + * Create wb_info_pool if rdev is the first multi-queue device flaged 142 + * with writemostly, also write-behind mode is enabled. 143 + */ 144 + void mddev_create_wb_pool(struct mddev *mddev, struct md_rdev *rdev, 145 + bool is_suspend) 146 + { 147 + if (mddev->bitmap_info.max_write_behind == 0) 148 + return; 149 + 150 + if (!test_bit(WriteMostly, &rdev->flags) || !rdev_init_wb(rdev)) 151 + return; 152 + 153 + if (mddev->wb_info_pool == NULL) { 154 + unsigned int noio_flag; 155 + 156 + if (!is_suspend) 157 + mddev_suspend(mddev); 158 + noio_flag = memalloc_noio_save(); 159 + mddev->wb_info_pool = mempool_create_kmalloc_pool(NR_WB_INFOS, 160 + sizeof(struct wb_info)); 161 + memalloc_noio_restore(noio_flag); 162 + if (!mddev->wb_info_pool) 163 + pr_err("can't alloc memory pool for writemostly\n"); 164 + if (!is_suspend) 165 + mddev_resume(mddev); 166 + } 167 + } 168 + EXPORT_SYMBOL_GPL(mddev_create_wb_pool); 169 + 170 + /* 171 + * destroy wb_info_pool if rdev is the last device flaged with WBCollisionCheck. 172 + */ 173 + static void mddev_destroy_wb_pool(struct mddev *mddev, struct md_rdev *rdev) 174 + { 175 + if (!test_and_clear_bit(WBCollisionCheck, &rdev->flags)) 176 + return; 177 + 178 + if (mddev->wb_info_pool) { 179 + struct md_rdev *temp; 180 + int num = 0; 181 + 182 + /* 183 + * Check if other rdevs need wb_info_pool. 184 + */ 185 + rdev_for_each(temp, mddev) 186 + if (temp != rdev && 187 + test_bit(WBCollisionCheck, &temp->flags)) 188 + num++; 189 + if (!num) { 190 + mddev_suspend(rdev->mddev); 191 + mempool_destroy(mddev->wb_info_pool); 192 + mddev->wb_info_pool = NULL; 193 + mddev_resume(rdev->mddev); 194 + } 195 + } 126 196 } 127 197 128 198 static struct ctl_table_header *raid_table_header; ··· 2282 2210 rdev->mddev = mddev; 2283 2211 pr_debug("md: bind<%s>\n", b); 2284 2212 2213 + if (mddev->raid_disks) 2214 + mddev_create_wb_pool(mddev, rdev, false); 2215 + 2285 2216 if ((err = kobject_add(&rdev->kobj, &mddev->kobj, "dev-%s", b))) 2286 2217 goto fail; 2287 2218 ··· 2321 2246 bd_unlink_disk_holder(rdev->bdev, rdev->mddev->gendisk); 2322 2247 list_del_rcu(&rdev->same_set); 2323 2248 pr_debug("md: unbind<%s>\n", bdevname(rdev->bdev,b)); 2249 + mddev_destroy_wb_pool(rdev->mddev, rdev); 2324 2250 rdev->mddev = NULL; 2325 2251 sysfs_remove_link(&rdev->kobj, "block"); 2326 2252 sysfs_put(rdev->sysfs_state); ··· 2834 2758 } 2835 2759 } else if (cmd_match(buf, "writemostly")) { 2836 2760 set_bit(WriteMostly, &rdev->flags); 2761 + mddev_create_wb_pool(rdev->mddev, rdev, false); 2837 2762 err = 0; 2838 2763 } else if (cmd_match(buf, "-writemostly")) { 2764 + mddev_destroy_wb_pool(rdev->mddev, rdev); 2839 2765 clear_bit(WriteMostly, &rdev->flags); 2840 2766 err = 0; 2841 2767 } else if (cmd_match(buf, "blocked")) { ··· 3434 3356 if (!entry->show) 3435 3357 return -EIO; 3436 3358 if (!rdev->mddev) 3437 - return -EBUSY; 3359 + return -ENODEV; 3438 3360 return entry->show(rdev, page); 3439 3361 } 3440 3362 ··· 5666 5588 mddev->bitmap = bitmap; 5667 5589 5668 5590 } 5669 - if (err) { 5670 - mddev_detach(mddev); 5671 - if (mddev->private) 5672 - pers->free(mddev, mddev->private); 5673 - mddev->private = NULL; 5674 - module_put(pers->owner); 5675 - md_bitmap_destroy(mddev); 5676 - goto abort; 5591 + if (err) 5592 + goto bitmap_abort; 5593 + 5594 + if (mddev->bitmap_info.max_write_behind > 0) { 5595 + bool creat_pool = false; 5596 + 5597 + rdev_for_each(rdev, mddev) { 5598 + if (test_bit(WriteMostly, &rdev->flags) && 5599 + rdev_init_wb(rdev)) 5600 + creat_pool = true; 5601 + } 5602 + if (creat_pool && mddev->wb_info_pool == NULL) { 5603 + mddev->wb_info_pool = 5604 + mempool_create_kmalloc_pool(NR_WB_INFOS, 5605 + sizeof(struct wb_info)); 5606 + if (!mddev->wb_info_pool) { 5607 + err = -ENOMEM; 5608 + goto bitmap_abort; 5609 + } 5610 + } 5677 5611 } 5612 + 5678 5613 if (mddev->queue) { 5679 5614 bool nonrot = true; 5680 5615 ··· 5730 5639 spin_unlock(&mddev->lock); 5731 5640 rdev_for_each(rdev, mddev) 5732 5641 if (rdev->raid_disk >= 0) 5733 - if (sysfs_link_rdev(mddev, rdev)) 5734 - /* failure here is OK */; 5642 + sysfs_link_rdev(mddev, rdev); /* failure here is OK */ 5735 5643 5736 5644 if (mddev->degraded && !mddev->ro) 5737 5645 /* This ensures that recovering status is reported immediately ··· 5748 5658 sysfs_notify(&mddev->kobj, NULL, "degraded"); 5749 5659 return 0; 5750 5660 5661 + bitmap_abort: 5662 + mddev_detach(mddev); 5663 + if (mddev->private) 5664 + pers->free(mddev, mddev->private); 5665 + mddev->private = NULL; 5666 + module_put(pers->owner); 5667 + md_bitmap_destroy(mddev); 5751 5668 abort: 5752 5669 bioset_exit(&mddev->bio_set); 5753 5670 bioset_exit(&mddev->sync_set); ··· 5923 5826 mddev->in_sync = 1; 5924 5827 md_update_sb(mddev, 1); 5925 5828 } 5829 + mempool_destroy(mddev->wb_info_pool); 5830 + mddev->wb_info_pool = NULL; 5926 5831 } 5927 5832 5928 5833 void md_stop_writes(struct mddev *mddev) ··· 8297 8198 { 8298 8199 struct mddev *mddev = thread->mddev; 8299 8200 struct mddev *mddev2; 8300 - unsigned int currspeed = 0, 8301 - window; 8201 + unsigned int currspeed = 0, window; 8302 8202 sector_t max_sectors,j, io_sectors, recovery_done; 8303 8203 unsigned long mark[SYNC_MARKS]; 8304 8204 unsigned long update_time; ··· 8354 8256 * 0 == not engaged in resync at all 8355 8257 * 2 == checking that there is no conflict with another sync 8356 8258 * 1 == like 2, but have yielded to allow conflicting resync to 8357 - * commense 8259 + * commence 8358 8260 * other == active in resync - this many blocks 8359 8261 * 8360 8262 * Before starting a resync we must have set curr_resync to ··· 8485 8387 /* 8486 8388 * Tune reconstruction: 8487 8389 */ 8488 - window = 32*(PAGE_SIZE/512); 8390 + window = 32 * (PAGE_SIZE / 512); 8489 8391 pr_debug("md: using %dk window, over a total of %lluk.\n", 8490 8392 window/2, (unsigned long long)max_sectors/2); 8491 8393 ··· 9298 9200 * perform resync with the new activated disk */ 9299 9201 set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); 9300 9202 md_wakeup_thread(mddev->thread); 9301 - 9302 9203 } 9303 9204 /* device faulty 9304 9205 * We just want to do the minimum to mark the disk

+23

drivers/md/md.h

··· 109 109 * for reporting to userspace and storing 110 110 * in superblock. 111 111 */ 112 + 113 + /* 114 + * The members for check collision of write behind IOs. 115 + */ 116 + struct list_head wb_list; 117 + spinlock_t wb_list_lock; 118 + wait_queue_head_t wb_io_wait; 119 + 112 120 struct work_struct del_work; /* used for delayed sysfs removal */ 113 121 114 122 struct kernfs_node *sysfs_state; /* handle for 'state' ··· 201 193 * it didn't fail, so don't use FailFast 202 194 * any more for metadata 203 195 */ 196 + WBCollisionCheck, /* 197 + * multiqueue device should check if there 198 + * is collision between write behind bios. 199 + */ 204 200 }; 205 201 206 202 static inline int is_badblock(struct md_rdev *rdev, sector_t s, int sectors, ··· 255 243 MD_SB_CHANGE_CLEAN, /* transition to or from 'clean' */ 256 244 MD_SB_CHANGE_PENDING, /* switch from 'clean' to 'active' in progress */ 257 245 MD_SB_NEED_REWRITE, /* metadata write needs to be repeated */ 246 + }; 247 + 248 + #define NR_WB_INFOS 8 249 + /* record current range of write behind IOs */ 250 + struct wb_info { 251 + sector_t lo; 252 + sector_t hi; 253 + struct list_head list; 258 254 }; 259 255 260 256 struct mddev { ··· 481 461 */ 482 462 struct work_struct flush_work; 483 463 struct work_struct event_work; /* used by dm to report failure event */ 464 + mempool_t *wb_info_pool; 484 465 void (*sync_super)(struct mddev *mddev, struct md_rdev *rdev); 485 466 struct md_cluster_info *cluster_info; 486 467 unsigned int good_device_nr; /* good device num within cluster raid */ ··· 730 709 extern void md_reload_sb(struct mddev *mddev, int raid_disk); 731 710 extern void md_update_sb(struct mddev *mddev, int force); 732 711 extern void md_kick_rdev_from_array(struct md_rdev * rdev); 712 + extern void mddev_create_wb_pool(struct mddev *mddev, struct md_rdev *rdev, 713 + bool is_suspend); 733 714 struct md_rdev *md_find_rdev_nr_rcu(struct mddev *mddev, int nr); 734 715 struct md_rdev *md_find_rdev_rcu(struct mddev *mddev, dev_t dev); 735 716

+30

drivers/md/raid1-10.c

··· 3 3 #define RESYNC_BLOCK_SIZE (64*1024) 4 4 #define RESYNC_PAGES ((RESYNC_BLOCK_SIZE + PAGE_SIZE-1) / PAGE_SIZE) 5 5 6 + /* 7 + * Number of guaranteed raid bios in case of extreme VM load: 8 + */ 9 + #define NR_RAID_BIOS 256 10 + 11 + /* when we get a read error on a read-only array, we redirect to another 12 + * device without failing the first device, or trying to over-write to 13 + * correct the read error. To keep track of bad blocks on a per-bio 14 + * level, we store IO_BLOCKED in the appropriate 'bios' pointer 15 + */ 16 + #define IO_BLOCKED ((struct bio *)1) 17 + /* When we successfully write to a known bad-block, we need to remove the 18 + * bad-block marking which must be done from process context. So we record 19 + * the success by setting devs[n].bio to IO_MADE_GOOD 20 + */ 21 + #define IO_MADE_GOOD ((struct bio *)2) 22 + 23 + #define BIO_SPECIAL(bio) ((unsigned long)bio <= 2) 24 + 25 + /* When there are this many requests queue to be written by 26 + * the raid thread, we become 'congested' to provide back-pressure 27 + * for writeback. 28 + */ 29 + static int max_queued_requests = 1024; 30 + 6 31 /* for managing resync I/O pages */ 7 32 struct resync_pages { 8 33 void *raid_bio; 9 34 struct page *pages[RESYNC_PAGES]; 10 35 }; 36 + 37 + static void rbio_pool_free(void *rbio, void *data) 38 + { 39 + kfree(rbio); 40 + } 11 41 12 42 static inline int resync_alloc_pages(struct resync_pages *rp, 13 43 gfp_t gfp_flags)

+76 -43

drivers/md/raid1.c

··· 42 42 (1L << MD_HAS_PPL) | \ 43 43 (1L << MD_HAS_MULTIPLE_PPLS)) 44 44 45 - /* 46 - * Number of guaranteed r1bios in case of extreme VM load: 47 - */ 48 - #define NR_RAID1_BIOS 256 49 - 50 - /* when we get a read error on a read-only array, we redirect to another 51 - * device without failing the first device, or trying to over-write to 52 - * correct the read error. To keep track of bad blocks on a per-bio 53 - * level, we store IO_BLOCKED in the appropriate 'bios' pointer 54 - */ 55 - #define IO_BLOCKED ((struct bio *)1) 56 - /* When we successfully write to a known bad-block, we need to remove the 57 - * bad-block marking which must be done from process context. So we record 58 - * the success by setting devs[n].bio to IO_MADE_GOOD 59 - */ 60 - #define IO_MADE_GOOD ((struct bio *)2) 61 - 62 - #define BIO_SPECIAL(bio) ((unsigned long)bio <= 2) 63 - 64 - /* When there are this many requests queue to be written by 65 - * the raid1 thread, we become 'congested' to provide back-pressure 66 - * for writeback. 67 - */ 68 - static int max_queued_requests = 1024; 69 - 70 45 static void allow_barrier(struct r1conf *conf, sector_t sector_nr); 71 46 static void lower_barrier(struct r1conf *conf, sector_t sector_nr); 72 47 ··· 49 74 do { if ((md)->queue) blk_add_trace_msg((md)->queue, "raid1 " fmt, ##args); } while (0) 50 75 51 76 #include "raid1-10.c" 77 + 78 + static int check_and_add_wb(struct md_rdev *rdev, sector_t lo, sector_t hi) 79 + { 80 + struct wb_info *wi, *temp_wi; 81 + unsigned long flags; 82 + int ret = 0; 83 + struct mddev *mddev = rdev->mddev; 84 + 85 + wi = mempool_alloc(mddev->wb_info_pool, GFP_NOIO); 86 + 87 + spin_lock_irqsave(&rdev->wb_list_lock, flags); 88 + list_for_each_entry(temp_wi, &rdev->wb_list, list) { 89 + /* collision happened */ 90 + if (hi > temp_wi->lo && lo < temp_wi->hi) { 91 + ret = -EBUSY; 92 + break; 93 + } 94 + } 95 + 96 + if (!ret) { 97 + wi->lo = lo; 98 + wi->hi = hi; 99 + list_add(&wi->list, &rdev->wb_list); 100 + } else 101 + mempool_free(wi, mddev->wb_info_pool); 102 + spin_unlock_irqrestore(&rdev->wb_list_lock, flags); 103 + 104 + return ret; 105 + } 106 + 107 + static void remove_wb(struct md_rdev *rdev, sector_t lo, sector_t hi) 108 + { 109 + struct wb_info *wi; 110 + unsigned long flags; 111 + int found = 0; 112 + struct mddev *mddev = rdev->mddev; 113 + 114 + spin_lock_irqsave(&rdev->wb_list_lock, flags); 115 + list_for_each_entry(wi, &rdev->wb_list, list) 116 + if (hi == wi->hi && lo == wi->lo) { 117 + list_del(&wi->list); 118 + mempool_free(wi, mddev->wb_info_pool); 119 + found = 1; 120 + break; 121 + } 122 + 123 + if (!found) 124 + WARN(1, "The write behind IO is not recorded\n"); 125 + spin_unlock_irqrestore(&rdev->wb_list_lock, flags); 126 + wake_up(&rdev->wb_io_wait); 127 + } 52 128 53 129 /* 54 130 * for resync bio, r1bio pointer can be retrieved from the per-bio ··· 117 91 118 92 /* allocate a r1bio with room for raid_disks entries in the bios array */ 119 93 return kzalloc(size, gfp_flags); 120 - } 121 - 122 - static void r1bio_pool_free(void *r1_bio, void *data) 123 - { 124 - kfree(r1_bio); 125 94 } 126 95 127 96 #define RESYNC_DEPTH 32 ··· 194 173 kfree(rps); 195 174 196 175 out_free_r1bio: 197 - r1bio_pool_free(r1_bio, data); 176 + rbio_pool_free(r1_bio, data); 198 177 return NULL; 199 178 } 200 179 ··· 214 193 /* resync pages array stored in the 1st bio's .bi_private */ 215 194 kfree(rp); 216 195 217 - r1bio_pool_free(r1bio, data); 196 + rbio_pool_free(r1bio, data); 218 197 } 219 198 220 199 static void put_all_bios(struct r1conf *conf, struct r1bio *r1_bio) ··· 497 476 } 498 477 499 478 if (behind) { 479 + if (test_bit(WBCollisionCheck, &rdev->flags)) { 480 + sector_t lo = r1_bio->sector; 481 + sector_t hi = r1_bio->sector + r1_bio->sectors; 482 + 483 + remove_wb(rdev, lo, hi); 484 + } 500 485 if (test_bit(WriteMostly, &rdev->flags)) 501 486 atomic_dec(&r1_bio->behind_remaining); 502 487 ··· 1476 1449 if (!r1_bio->bios[i]) 1477 1450 continue; 1478 1451 1479 - 1480 1452 if (first_clone) { 1481 1453 /* do behind I/O ? 1482 1454 * Not if there are too many, or cannot ··· 1500 1474 mbio = bio_clone_fast(bio, GFP_NOIO, &mddev->bio_set); 1501 1475 1502 1476 if (r1_bio->behind_master_bio) { 1503 - if (test_bit(WriteMostly, &conf->mirrors[i].rdev->flags)) 1477 + struct md_rdev *rdev = conf->mirrors[i].rdev; 1478 + 1479 + if (test_bit(WBCollisionCheck, &rdev->flags)) { 1480 + sector_t lo = r1_bio->sector; 1481 + sector_t hi = r1_bio->sector + r1_bio->sectors; 1482 + 1483 + wait_event(rdev->wb_io_wait, 1484 + check_and_add_wb(rdev, lo, hi) == 0); 1485 + } 1486 + if (test_bit(WriteMostly, &rdev->flags)) 1504 1487 atomic_inc(&r1_bio->behind_remaining); 1505 1488 } 1506 1489 ··· 1764 1729 first = last = rdev->saved_raid_disk; 1765 1730 1766 1731 for (mirror = first; mirror <= last; mirror++) { 1767 - p = conf->mirrors+mirror; 1732 + p = conf->mirrors + mirror; 1768 1733 if (!p->rdev) { 1769 - 1770 1734 if (mddev->gendisk) 1771 1735 disk_stack_limits(mddev->gendisk, rdev->bdev, 1772 1736 rdev->data_offset << 9); ··· 2922 2888 if (read_targets == 1) 2923 2889 bio->bi_opf &= ~MD_FAILFAST; 2924 2890 generic_make_request(bio); 2925 - 2926 2891 } 2927 2892 return nr_sectors; 2928 2893 } ··· 2980 2947 if (!conf->poolinfo) 2981 2948 goto abort; 2982 2949 conf->poolinfo->raid_disks = mddev->raid_disks * 2; 2983 - err = mempool_init(&conf->r1bio_pool, NR_RAID1_BIOS, r1bio_pool_alloc, 2984 - r1bio_pool_free, conf->poolinfo); 2950 + err = mempool_init(&conf->r1bio_pool, NR_RAID_BIOS, r1bio_pool_alloc, 2951 + rbio_pool_free, conf->poolinfo); 2985 2952 if (err) 2986 2953 goto abort; 2987 2954 ··· 3122 3089 } 3123 3090 3124 3091 mddev->degraded = 0; 3125 - for (i=0; i < conf->raid_disks; i++) 3092 + for (i = 0; i < conf->raid_disks; i++) 3126 3093 if (conf->mirrors[i].rdev == NULL || 3127 3094 !test_bit(In_sync, &conf->mirrors[i].rdev->flags) || 3128 3095 test_bit(Faulty, &conf->mirrors[i].rdev->flags)) ··· 3157 3124 mddev->queue); 3158 3125 } 3159 3126 3160 - ret = md_integrity_register(mddev); 3127 + ret = md_integrity_register(mddev); 3161 3128 if (ret) { 3162 3129 md_unregister_thread(&mddev->thread); 3163 3130 raid1_free(mddev, conf); ··· 3265 3232 newpoolinfo->mddev = mddev; 3266 3233 newpoolinfo->raid_disks = raid_disks * 2; 3267 3234 3268 - ret = mempool_init(&newpool, NR_RAID1_BIOS, r1bio_pool_alloc, 3269 - r1bio_pool_free, newpoolinfo); 3235 + ret = mempool_init(&newpool, NR_RAID_BIOS, r1bio_pool_alloc, 3236 + rbio_pool_free, newpoolinfo); 3270 3237 if (ret) { 3271 3238 kfree(newpoolinfo); 3272 3239 return ret;

+38 -48

drivers/md/raid10.c

··· 64 64 * [B A] [D C] [B A] [E C D] 65 65 */ 66 66 67 - /* 68 - * Number of guaranteed r10bios in case of extreme VM load: 69 - */ 70 - #define NR_RAID10_BIOS 256 71 - 72 - /* when we get a read error on a read-only array, we redirect to another 73 - * device without failing the first device, or trying to over-write to 74 - * correct the read error. To keep track of bad blocks on a per-bio 75 - * level, we store IO_BLOCKED in the appropriate 'bios' pointer 76 - */ 77 - #define IO_BLOCKED ((struct bio *)1) 78 - /* When we successfully write to a known bad-block, we need to remove the 79 - * bad-block marking which must be done from process context. So we record 80 - * the success by setting devs[n].bio to IO_MADE_GOOD 81 - */ 82 - #define IO_MADE_GOOD ((struct bio *)2) 83 - 84 - #define BIO_SPECIAL(bio) ((unsigned long)bio <= 2) 85 - 86 - /* When there are this many requests queued to be written by 87 - * the raid10 thread, we become 'congested' to provide back-pressure 88 - * for writeback. 89 - */ 90 - static int max_queued_requests = 1024; 91 - 92 67 static void allow_barrier(struct r10conf *conf); 93 68 static void lower_barrier(struct r10conf *conf); 94 69 static int _enough(struct r10conf *conf, int previous, int ignore); ··· 96 121 /* allocate a r10bio with room for raid_disks entries in the 97 122 * bios array */ 98 123 return kzalloc(size, gfp_flags); 99 - } 100 - 101 - static void r10bio_pool_free(void *r10_bio, void *data) 102 - { 103 - kfree(r10_bio); 104 124 } 105 125 106 126 #define RESYNC_SECTORS (RESYNC_BLOCK_SIZE >> 9) ··· 203 233 } 204 234 kfree(rps); 205 235 out_free_r10bio: 206 - r10bio_pool_free(r10_bio, conf); 236 + rbio_pool_free(r10_bio, conf); 207 237 return NULL; 208 238 } 209 239 ··· 231 261 /* resync pages array stored in the 1st bio's .bi_private */ 232 262 kfree(rp); 233 263 234 - r10bio_pool_free(r10bio, conf); 264 + rbio_pool_free(r10bio, conf); 235 265 } 236 266 237 267 static void put_all_bios(struct r10conf *conf, struct r10bio *r10_bio) ··· 707 737 int sectors = r10_bio->sectors; 708 738 int best_good_sectors; 709 739 sector_t new_distance, best_dist; 710 - struct md_rdev *best_rdev, *rdev = NULL; 740 + struct md_rdev *best_dist_rdev, *best_pending_rdev, *rdev = NULL; 711 741 int do_balance; 712 - int best_slot; 742 + int best_dist_slot, best_pending_slot; 743 + bool has_nonrot_disk = false; 744 + unsigned int min_pending; 713 745 struct geom *geo = &conf->geo; 714 746 715 747 raid10_find_phys(conf, r10_bio); 716 748 rcu_read_lock(); 717 - best_slot = -1; 718 - best_rdev = NULL; 749 + best_dist_slot = -1; 750 + min_pending = UINT_MAX; 751 + best_dist_rdev = NULL; 752 + best_pending_rdev = NULL; 719 753 best_dist = MaxSector; 720 754 best_good_sectors = 0; 721 755 do_balance = 1; ··· 741 767 sector_t first_bad; 742 768 int bad_sectors; 743 769 sector_t dev_sector; 770 + unsigned int pending; 771 + bool nonrot; 744 772 745 773 if (r10_bio->devs[slot].bio == IO_BLOCKED) 746 774 continue; ··· 779 803 first_bad - dev_sector; 780 804 if (good_sectors > best_good_sectors) { 781 805 best_good_sectors = good_sectors; 782 - best_slot = slot; 783 - best_rdev = rdev; 806 + best_dist_slot = slot; 807 + best_dist_rdev = rdev; 784 808 } 785 809 if (!do_balance) 786 810 /* Must read from here */ ··· 793 817 if (!do_balance) 794 818 break; 795 819 796 - if (best_slot >= 0) 820 + nonrot = blk_queue_nonrot(bdev_get_queue(rdev->bdev)); 821 + has_nonrot_disk |= nonrot; 822 + pending = atomic_read(&rdev->nr_pending); 823 + if (min_pending > pending && nonrot) { 824 + min_pending = pending; 825 + best_pending_slot = slot; 826 + best_pending_rdev = rdev; 827 + } 828 + 829 + if (best_dist_slot >= 0) 797 830 /* At least 2 disks to choose from so failfast is OK */ 798 831 set_bit(R10BIO_FailFast, &r10_bio->state); 799 832 /* This optimisation is debatable, and completely destroys 800 833 * sequential read speed for 'far copies' arrays. So only 801 834 * keep it for 'near' arrays, and review those later. 802 835 */ 803 - if (geo->near_copies > 1 && !atomic_read(&rdev->nr_pending)) 836 + if (geo->near_copies > 1 && !pending) 804 837 new_distance = 0; 805 838 806 839 /* for far > 1 always use the lowest address */ ··· 818 833 else 819 834 new_distance = abs(r10_bio->devs[slot].addr - 820 835 conf->mirrors[disk].head_position); 836 + 821 837 if (new_distance < best_dist) { 822 838 best_dist = new_distance; 823 - best_slot = slot; 824 - best_rdev = rdev; 839 + best_dist_slot = slot; 840 + best_dist_rdev = rdev; 825 841 } 826 842 } 827 843 if (slot >= conf->copies) { 828 - slot = best_slot; 829 - rdev = best_rdev; 844 + if (has_nonrot_disk) { 845 + slot = best_pending_slot; 846 + rdev = best_pending_rdev; 847 + } else { 848 + slot = best_dist_slot; 849 + rdev = best_dist_rdev; 850 + } 830 851 } 831 852 832 853 if (slot >= 0) { ··· 3666 3675 3667 3676 conf->geo = geo; 3668 3677 conf->copies = copies; 3669 - err = mempool_init(&conf->r10bio_pool, NR_RAID10_BIOS, r10bio_pool_alloc, 3670 - r10bio_pool_free, conf); 3678 + err = mempool_init(&conf->r10bio_pool, NR_RAID_BIOS, r10bio_pool_alloc, 3679 + rbio_pool_free, conf); 3671 3680 if (err) 3672 3681 goto out; 3673 3682 ··· 4771 4780 int idx = 0; 4772 4781 struct page **pages; 4773 4782 4774 - r10b = kmalloc(sizeof(*r10b) + 4775 - sizeof(struct r10dev) * conf->copies, GFP_NOIO); 4783 + r10b = kmalloc(struct_size(r10b, devs, conf->copies), GFP_NOIO); 4776 4784 if (!r10b) { 4777 4785 set_bit(MD_RECOVERY_INTR, &mddev->recovery); 4778 4786 return -ENOMEM;

+9 -3

drivers/md/raid5.c

··· 5251 5251 rcu_read_unlock(); 5252 5252 raid_bio->bi_next = (void*)rdev; 5253 5253 bio_set_dev(align_bi, rdev->bdev); 5254 - bio_clear_flag(align_bi, BIO_SEG_VALID); 5255 5254 5256 5255 if (is_badblock(rdev, align_bi->bi_iter.bi_sector, 5257 5256 bio_sectors(align_bi), ··· 7671 7672 static int raid5_add_disk(struct mddev *mddev, struct md_rdev *rdev) 7672 7673 { 7673 7674 struct r5conf *conf = mddev->private; 7674 - int err = -EEXIST; 7675 + int ret, err = -EEXIST; 7675 7676 int disk; 7676 7677 struct disk_info *p; 7677 7678 int first = 0; ··· 7686 7687 * The array is in readonly mode if journal is missing, so no 7687 7688 * write requests running. We should be safe 7688 7689 */ 7689 - log_init(conf, rdev, false); 7690 + ret = log_init(conf, rdev, false); 7691 + if (ret) 7692 + return ret; 7693 + 7694 + ret = r5l_start(conf->log); 7695 + if (ret) 7696 + return ret; 7697 + 7690 7698 return 0; 7691 7699 } 7692 7700 if (mddev->recovery_disabled == conf->recovery_disabled)

+34 -11

drivers/nvme/host/core.c

··· 1113 1113 return id; 1114 1114 } 1115 1115 1116 - static int nvme_set_features(struct nvme_ctrl *dev, unsigned fid, unsigned dword11, 1117 - void *buffer, size_t buflen, u32 *result) 1116 + static int nvme_features(struct nvme_ctrl *dev, u8 op, unsigned int fid, 1117 + unsigned int dword11, void *buffer, size_t buflen, u32 *result) 1118 1118 { 1119 1119 struct nvme_command c; 1120 1120 union nvme_result res; 1121 1121 int ret; 1122 1122 1123 1123 memset(&c, 0, sizeof(c)); 1124 - c.features.opcode = nvme_admin_set_features; 1124 + c.features.opcode = op; 1125 1125 c.features.fid = cpu_to_le32(fid); 1126 1126 c.features.dword11 = cpu_to_le32(dword11); 1127 1127 ··· 1131 1131 *result = le32_to_cpu(res.u32); 1132 1132 return ret; 1133 1133 } 1134 + 1135 + int nvme_set_features(struct nvme_ctrl *dev, unsigned int fid, 1136 + unsigned int dword11, void *buffer, size_t buflen, 1137 + u32 *result) 1138 + { 1139 + return nvme_features(dev, nvme_admin_set_features, fid, dword11, buffer, 1140 + buflen, result); 1141 + } 1142 + EXPORT_SYMBOL_GPL(nvme_set_features); 1143 + 1144 + int nvme_get_features(struct nvme_ctrl *dev, unsigned int fid, 1145 + unsigned int dword11, void *buffer, size_t buflen, 1146 + u32 *result) 1147 + { 1148 + return nvme_features(dev, nvme_admin_get_features, fid, dword11, buffer, 1149 + buflen, result); 1150 + } 1151 + EXPORT_SYMBOL_GPL(nvme_get_features); 1134 1152 1135 1153 int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count) 1136 1154 { ··· 3336 3318 device_add_disk(ctrl->device, ns->disk, nvme_ns_id_attr_groups); 3337 3319 3338 3320 nvme_mpath_add_disk(ns, id); 3339 - nvme_fault_inject_init(ns); 3321 + nvme_fault_inject_init(&ns->fault_inject, ns->disk->disk_name); 3340 3322 kfree(id); 3341 3323 3342 3324 return 0; ··· 3361 3343 if (test_and_set_bit(NVME_NS_REMOVING, &ns->flags)) 3362 3344 return; 3363 3345 3364 - nvme_fault_inject_fini(ns); 3346 + nvme_fault_inject_fini(&ns->fault_inject); 3347 + 3348 + mutex_lock(&ns->ctrl->subsys->lock); 3349 + list_del_rcu(&ns->siblings); 3350 + mutex_unlock(&ns->ctrl->subsys->lock); 3351 + synchronize_rcu(); /* guarantee not available in head->list */ 3352 + nvme_mpath_clear_current_path(ns); 3353 + synchronize_srcu(&ns->head->srcu); /* wait for concurrent submissions */ 3354 + 3365 3355 if (ns->disk && ns->disk->flags & GENHD_FL_UP) { 3366 3356 del_gendisk(ns->disk); 3367 3357 blk_cleanup_queue(ns->queue); ··· 3377 3351 blk_integrity_unregister(ns->disk); 3378 3352 } 3379 3353 3380 - mutex_lock(&ns->ctrl->subsys->lock); 3381 - list_del_rcu(&ns->siblings); 3382 - nvme_mpath_clear_current_path(ns); 3383 - mutex_unlock(&ns->ctrl->subsys->lock); 3384 - 3385 3354 down_write(&ns->ctrl->namespaces_rwsem); 3386 3355 list_del_init(&ns->list); 3387 3356 up_write(&ns->ctrl->namespaces_rwsem); 3388 3357 3389 - synchronize_srcu(&ns->head->srcu); 3390 3358 nvme_mpath_check_last_path(ns); 3391 3359 nvme_put_ns(ns); 3392 3360 } ··· 3722 3702 3723 3703 void nvme_uninit_ctrl(struct nvme_ctrl *ctrl) 3724 3704 { 3705 + nvme_fault_inject_fini(&ctrl->fault_inject); 3725 3706 dev_pm_qos_hide_latency_tolerance(ctrl->device); 3726 3707 cdev_device_del(&ctrl->cdev, ctrl->device); 3727 3708 } ··· 3817 3796 ctrl->device->power.set_latency_tolerance = nvme_set_latency_tolerance; 3818 3797 dev_pm_qos_update_user_latency_tolerance(ctrl->device, 3819 3798 min(default_ps_max_latency_us, (unsigned long)S32_MAX)); 3799 + 3800 + nvme_fault_inject_init(&ctrl->fault_inject, dev_name(ctrl->device)); 3820 3801 3821 3802 return 0; 3822 3803 out_free_name:

+1 -1

drivers/nvme/host/fabrics.c

··· 578 578 switch (ctrl->state) { 579 579 case NVME_CTRL_NEW: 580 580 case NVME_CTRL_CONNECTING: 581 - if (req->cmd->common.opcode == nvme_fabrics_command && 581 + if (nvme_is_fabrics(req->cmd) && 582 582 req->cmd->fabrics.fctype == nvme_fabrics_type_connect) 583 583 return true; 584 584 break;

+22 -19

drivers/nvme/host/fault_inject.c

··· 15 15 static char *fail_request; 16 16 module_param(fail_request, charp, 0000); 17 17 18 - void nvme_fault_inject_init(struct nvme_ns *ns) 18 + void nvme_fault_inject_init(struct nvme_fault_inject *fault_inj, 19 + const char *dev_name) 19 20 { 20 21 struct dentry *dir, *parent; 21 - char *name = ns->disk->disk_name; 22 - struct nvme_fault_inject *fault_inj = &ns->fault_inject; 23 22 struct fault_attr *attr = &fault_inj->attr; 24 23 25 24 /* set default fault injection attribute */ ··· 26 27 setup_fault_attr(&fail_default_attr, fail_request); 27 28 28 29 /* create debugfs directory and attribute */ 29 - parent = debugfs_create_dir(name, NULL); 30 + parent = debugfs_create_dir(dev_name, NULL); 30 31 if (!parent) { 31 - pr_warn("%s: failed to create debugfs directory\n", name); 32 + pr_warn("%s: failed to create debugfs directory\n", dev_name); 32 33 return; 33 34 } 34 35 35 36 *attr = fail_default_attr; 36 37 dir = fault_create_debugfs_attr("fault_inject", parent, attr); 37 38 if (IS_ERR(dir)) { 38 - pr_warn("%s: failed to create debugfs attr\n", name); 39 + pr_warn("%s: failed to create debugfs attr\n", dev_name); 39 40 debugfs_remove_recursive(parent); 40 41 return; 41 42 } 42 - ns->fault_inject.parent = parent; 43 + fault_inj->parent = parent; 43 44 44 45 /* create debugfs for status code and dont_retry */ 45 46 fault_inj->status = NVME_SC_INVALID_OPCODE; ··· 48 49 debugfs_create_bool("dont_retry", 0600, dir, &fault_inj->dont_retry); 49 50 } 50 51 51 - void nvme_fault_inject_fini(struct nvme_ns *ns) 52 + void nvme_fault_inject_fini(struct nvme_fault_inject *fault_inject) 52 53 { 53 54 /* remove debugfs directories */ 54 - debugfs_remove_recursive(ns->fault_inject.parent); 55 + debugfs_remove_recursive(fault_inject->parent); 55 56 } 56 57 57 58 void nvme_should_fail(struct request *req) 58 59 { 59 60 struct gendisk *disk = req->rq_disk; 60 - struct nvme_ns *ns = NULL; 61 + struct nvme_fault_inject *fault_inject = NULL; 61 62 u16 status; 62 63 63 - /* 64 - * make sure this request is coming from a valid namespace 65 - */ 66 - if (!disk) 67 - return; 64 + if (disk) { 65 + struct nvme_ns *ns = disk->private_data; 68 66 69 - ns = disk->private_data; 70 - if (ns && should_fail(&ns->fault_inject.attr, 1)) { 67 + if (ns) 68 + fault_inject = &ns->fault_inject; 69 + else 70 + WARN_ONCE(1, "No namespace found for request\n"); 71 + } else { 72 + fault_inject = &nvme_req(req)->ctrl->fault_inject; 73 + } 74 + 75 + if (fault_inject && should_fail(&fault_inject->attr, 1)) { 71 76 /* inject status code and DNR bit */ 72 - status = ns->fault_inject.status; 73 - if (ns->fault_inject.dont_retry) 77 + status = fault_inject->status; 78 + if (fault_inject->dont_retry) 74 79 status |= NVME_SC_DNR; 75 80 nvme_req(req)->status = status; 76 81 }

+6

drivers/nvme/host/fc.c

··· 2607 2607 if (nvme_fc_ctlr_active_on_rport(ctrl)) 2608 2608 return -ENOTUNIQ; 2609 2609 2610 + dev_info(ctrl->ctrl.device, 2611 + "NVME-FC{%d}: create association : host wwpn 0x%016llx " 2612 + " rport wwpn 0x%016llx: NQN \"%s\"\n", 2613 + ctrl->cnum, ctrl->lport->localport.port_name, 2614 + ctrl->rport->remoteport.port_name, ctrl->ctrl.opts->subsysnqn); 2615 + 2610 2616 /* 2611 2617 * Create the admin queue 2612 2618 */

+1 -1

drivers/nvme/host/lightnvm.c

··· 660 660 rq->cmd_flags &= ~REQ_FAILFAST_DRIVER; 661 661 662 662 if (rqd->bio) 663 - blk_init_request_from_bio(rq, rqd->bio); 663 + blk_rq_append_bio(rq, &rqd->bio); 664 664 else 665 665 rq->ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_BE, IOPRIO_NORM); 666 666

+27 -15

drivers/nvme/host/nvme.h

··· 146 146 NVME_CTRL_DEAD, 147 147 }; 148 148 149 + struct nvme_fault_inject { 150 + #ifdef CONFIG_FAULT_INJECTION_DEBUG_FS 151 + struct fault_attr attr; 152 + struct dentry *parent; 153 + bool dont_retry; /* DNR, do not retry */ 154 + u16 status; /* status code */ 155 + #endif 156 + }; 157 + 149 158 struct nvme_ctrl { 150 159 bool comp_seen; 151 160 enum nvme_ctrl_state state; ··· 256 247 257 248 struct page *discard_page; 258 249 unsigned long discard_page_busy; 250 + 251 + struct nvme_fault_inject fault_inject; 259 252 }; 260 253 261 254 enum nvme_iopolicy { ··· 324 313 #endif 325 314 }; 326 315 327 - #ifdef CONFIG_FAULT_INJECTION_DEBUG_FS 328 - struct nvme_fault_inject { 329 - struct fault_attr attr; 330 - struct dentry *parent; 331 - bool dont_retry; /* DNR, do not retry */ 332 - u16 status; /* status code */ 333 - }; 334 - #endif 335 - 336 316 struct nvme_ns { 337 317 struct list_head list; 338 318 ··· 351 349 #define NVME_NS_ANA_PENDING 2 352 350 u16 noiob; 353 351 354 - #ifdef CONFIG_FAULT_INJECTION_DEBUG_FS 355 352 struct nvme_fault_inject fault_inject; 356 - #endif 357 353 358 354 }; 359 355 ··· 372 372 }; 373 373 374 374 #ifdef CONFIG_FAULT_INJECTION_DEBUG_FS 375 - void nvme_fault_inject_init(struct nvme_ns *ns); 376 - void nvme_fault_inject_fini(struct nvme_ns *ns); 375 + void nvme_fault_inject_init(struct nvme_fault_inject *fault_inj, 376 + const char *dev_name); 377 + void nvme_fault_inject_fini(struct nvme_fault_inject *fault_inject); 377 378 void nvme_should_fail(struct request *req); 378 379 #else 379 - static inline void nvme_fault_inject_init(struct nvme_ns *ns) {} 380 - static inline void nvme_fault_inject_fini(struct nvme_ns *ns) {} 380 + static inline void nvme_fault_inject_init(struct nvme_fault_inject *fault_inj, 381 + const char *dev_name) 382 + { 383 + } 384 + static inline void nvme_fault_inject_fini(struct nvme_fault_inject *fault_inj) 385 + { 386 + } 381 387 static inline void nvme_should_fail(struct request *req) {} 382 388 #endif 383 389 ··· 465 459 union nvme_result *result, void *buffer, unsigned bufflen, 466 460 unsigned timeout, int qid, int at_head, 467 461 blk_mq_req_flags_t flags, bool poll); 462 + int nvme_set_features(struct nvme_ctrl *dev, unsigned int fid, 463 + unsigned int dword11, void *buffer, size_t buflen, 464 + u32 *result); 465 + int nvme_get_features(struct nvme_ctrl *dev, unsigned int fid, 466 + unsigned int dword11, void *buffer, size_t buflen, 467 + u32 *result); 468 468 int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count); 469 469 void nvme_stop_keep_alive(struct nvme_ctrl *ctrl); 470 470 int nvme_reset_ctrl(struct nvme_ctrl *ctrl);

+111 -32

drivers/nvme/host/pci.c

··· 18 18 #include <linux/mutex.h> 19 19 #include <linux/once.h> 20 20 #include <linux/pci.h> 21 + #include <linux/suspend.h> 21 22 #include <linux/t10-pi.h> 22 23 #include <linux/types.h> 23 24 #include <linux/io-64-nonatomic-lo-hi.h> ··· 68 67 module_param_cb(io_queue_depth, &io_queue_depth_ops, &io_queue_depth, 0644); 69 68 MODULE_PARM_DESC(io_queue_depth, "set io queue depth, should >= 2"); 70 69 71 - static int queue_count_set(const char *val, const struct kernel_param *kp); 72 - static const struct kernel_param_ops queue_count_ops = { 73 - .set = queue_count_set, 74 - .get = param_get_int, 75 - }; 76 - 77 70 static int write_queues; 78 - module_param_cb(write_queues, &queue_count_ops, &write_queues, 0644); 71 + module_param(write_queues, int, 0644); 79 72 MODULE_PARM_DESC(write_queues, 80 73 "Number of queues to use for writes. If not set, reads and writes " 81 74 "will share a queue set."); 82 75 83 - static int poll_queues = 0; 84 - module_param_cb(poll_queues, &queue_count_ops, &poll_queues, 0644); 76 + static int poll_queues; 77 + module_param(poll_queues, int, 0644); 85 78 MODULE_PARM_DESC(poll_queues, "Number of queues to use for polled IO."); 86 79 87 80 struct nvme_dev; ··· 111 116 u32 cmbsz; 112 117 u32 cmbloc; 113 118 struct nvme_ctrl ctrl; 119 + u32 last_ps; 114 120 115 121 mempool_t *iod_mempool; 116 122 ··· 136 140 ret = kstrtoint(val, 10, &n); 137 141 if (ret != 0 || n < 2) 138 142 return -EINVAL; 139 - 140 - return param_set_int(val, kp); 141 - } 142 - 143 - static int queue_count_set(const char *val, const struct kernel_param *kp) 144 - { 145 - int n, ret; 146 - 147 - ret = kstrtoint(val, 10, &n); 148 - if (ret) 149 - return ret; 150 - if (n > num_possible_cpus()) 151 - n = num_possible_cpus(); 152 143 153 144 return param_set_int(val, kp); 154 145 } ··· 2051 2068 .priv = dev, 2052 2069 }; 2053 2070 unsigned int irq_queues, this_p_queues; 2071 + unsigned int nr_cpus = num_possible_cpus(); 2054 2072 2055 2073 /* 2056 2074 * Poll queues don't need interrupts, but we need at least one IO ··· 2062 2078 this_p_queues = nr_io_queues - 1; 2063 2079 irq_queues = 1; 2064 2080 } else { 2065 - irq_queues = nr_io_queues - this_p_queues + 1; 2081 + if (nr_cpus < nr_io_queues - this_p_queues) 2082 + irq_queues = nr_cpus + 1; 2083 + else 2084 + irq_queues = nr_io_queues - this_p_queues + 1; 2066 2085 } 2067 2086 dev->io_queues[HCTX_TYPE_POLL] = this_p_queues; 2068 2087 ··· 2451 2464 kfree(dev); 2452 2465 } 2453 2466 2454 - static void nvme_remove_dead_ctrl(struct nvme_dev *dev, int status) 2467 + static void nvme_remove_dead_ctrl(struct nvme_dev *dev) 2455 2468 { 2456 - dev_warn(dev->ctrl.device, "Removing after probe failure status: %d\n", status); 2457 - 2458 2469 nvme_get_ctrl(&dev->ctrl); 2459 2470 nvme_dev_disable(dev, false); 2460 2471 nvme_kill_queues(&dev->ctrl); ··· 2465 2480 struct nvme_dev *dev = 2466 2481 container_of(work, struct nvme_dev, ctrl.reset_work); 2467 2482 bool was_suspend = !!(dev->ctrl.ctrl_config & NVME_CC_SHN_NORMAL); 2468 - int result = -ENODEV; 2483 + int result; 2469 2484 enum nvme_ctrl_state new_state = NVME_CTRL_LIVE; 2470 2485 2471 - if (WARN_ON(dev->ctrl.state != NVME_CTRL_RESETTING)) 2486 + if (WARN_ON(dev->ctrl.state != NVME_CTRL_RESETTING)) { 2487 + result = -ENODEV; 2472 2488 goto out; 2489 + } 2473 2490 2474 2491 /* 2475 2492 * If we're called to reset a live controller first shut it down before ··· 2515 2528 if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_CONNECTING)) { 2516 2529 dev_warn(dev->ctrl.device, 2517 2530 "failed to mark controller CONNECTING\n"); 2531 + result = -EBUSY; 2518 2532 goto out; 2519 2533 } 2520 2534 ··· 2576 2588 if (!nvme_change_ctrl_state(&dev->ctrl, new_state)) { 2577 2589 dev_warn(dev->ctrl.device, 2578 2590 "failed to mark controller state %d\n", new_state); 2591 + result = -ENODEV; 2579 2592 goto out; 2580 2593 } 2581 2594 ··· 2586 2597 out_unlock: 2587 2598 mutex_unlock(&dev->shutdown_lock); 2588 2599 out: 2589 - nvme_remove_dead_ctrl(dev, result); 2600 + if (result) 2601 + dev_warn(dev->ctrl.device, 2602 + "Removing after probe failure status: %d\n", result); 2603 + nvme_remove_dead_ctrl(dev); 2590 2604 } 2591 2605 2592 2606 static void nvme_remove_dead_ctrl_work(struct work_struct *work) ··· 2827 2835 } 2828 2836 2829 2837 #ifdef CONFIG_PM_SLEEP 2838 + static int nvme_get_power_state(struct nvme_ctrl *ctrl, u32 *ps) 2839 + { 2840 + return nvme_get_features(ctrl, NVME_FEAT_POWER_MGMT, 0, NULL, 0, ps); 2841 + } 2842 + 2843 + static int nvme_set_power_state(struct nvme_ctrl *ctrl, u32 ps) 2844 + { 2845 + return nvme_set_features(ctrl, NVME_FEAT_POWER_MGMT, ps, NULL, 0, NULL); 2846 + } 2847 + 2848 + static int nvme_resume(struct device *dev) 2849 + { 2850 + struct nvme_dev *ndev = pci_get_drvdata(to_pci_dev(dev)); 2851 + struct nvme_ctrl *ctrl = &ndev->ctrl; 2852 + 2853 + if (pm_resume_via_firmware() || !ctrl->npss || 2854 + nvme_set_power_state(ctrl, ndev->last_ps) != 0) 2855 + nvme_reset_ctrl(ctrl); 2856 + return 0; 2857 + } 2858 + 2830 2859 static int nvme_suspend(struct device *dev) 2831 2860 { 2832 2861 struct pci_dev *pdev = to_pci_dev(dev); 2833 2862 struct nvme_dev *ndev = pci_get_drvdata(pdev); 2863 + struct nvme_ctrl *ctrl = &ndev->ctrl; 2864 + int ret = -EBUSY; 2865 + 2866 + /* 2867 + * The platform does not remove power for a kernel managed suspend so 2868 + * use host managed nvme power settings for lowest idle power if 2869 + * possible. This should have quicker resume latency than a full device 2870 + * shutdown. But if the firmware is involved after the suspend or the 2871 + * device does not support any non-default power states, shut down the 2872 + * device fully. 2873 + */ 2874 + if (pm_suspend_via_firmware() || !ctrl->npss) { 2875 + nvme_dev_disable(ndev, true); 2876 + return 0; 2877 + } 2878 + 2879 + nvme_start_freeze(ctrl); 2880 + nvme_wait_freeze(ctrl); 2881 + nvme_sync_queues(ctrl); 2882 + 2883 + if (ctrl->state != NVME_CTRL_LIVE && 2884 + ctrl->state != NVME_CTRL_ADMIN_ONLY) 2885 + goto unfreeze; 2886 + 2887 + ndev->last_ps = 0; 2888 + ret = nvme_get_power_state(ctrl, &ndev->last_ps); 2889 + if (ret < 0) 2890 + goto unfreeze; 2891 + 2892 + ret = nvme_set_power_state(ctrl, ctrl->npss); 2893 + if (ret < 0) 2894 + goto unfreeze; 2895 + 2896 + if (ret) { 2897 + /* 2898 + * Clearing npss forces a controller reset on resume. The 2899 + * correct value will be resdicovered then. 2900 + */ 2901 + nvme_dev_disable(ndev, true); 2902 + ctrl->npss = 0; 2903 + ret = 0; 2904 + goto unfreeze; 2905 + } 2906 + /* 2907 + * A saved state prevents pci pm from generically controlling the 2908 + * device's power. If we're using protocol specific settings, we don't 2909 + * want pci interfering. 2910 + */ 2911 + pci_save_state(pdev); 2912 + unfreeze: 2913 + nvme_unfreeze(ctrl); 2914 + return ret; 2915 + } 2916 + 2917 + static int nvme_simple_suspend(struct device *dev) 2918 + { 2919 + struct nvme_dev *ndev = pci_get_drvdata(to_pci_dev(dev)); 2834 2920 2835 2921 nvme_dev_disable(ndev, true); 2836 2922 return 0; 2837 2923 } 2838 2924 2839 - static int nvme_resume(struct device *dev) 2925 + static int nvme_simple_resume(struct device *dev) 2840 2926 { 2841 2927 struct pci_dev *pdev = to_pci_dev(dev); 2842 2928 struct nvme_dev *ndev = pci_get_drvdata(pdev); ··· 2922 2852 nvme_reset_ctrl(&ndev->ctrl); 2923 2853 return 0; 2924 2854 } 2925 - #endif 2926 2855 2927 - static SIMPLE_DEV_PM_OPS(nvme_dev_pm_ops, nvme_suspend, nvme_resume); 2856 + const struct dev_pm_ops nvme_dev_pm_ops = { 2857 + .suspend = nvme_suspend, 2858 + .resume = nvme_resume, 2859 + .freeze = nvme_simple_suspend, 2860 + .thaw = nvme_simple_resume, 2861 + .poweroff = nvme_simple_suspend, 2862 + .restore = nvme_simple_resume, 2863 + }; 2864 + #endif /* CONFIG_PM_SLEEP */ 2928 2865 2929 2866 static pci_ers_result_t nvme_error_detected(struct pci_dev *pdev, 2930 2867 pci_channel_state_t state) ··· 3036 2959 .probe = nvme_probe, 3037 2960 .remove = nvme_remove, 3038 2961 .shutdown = nvme_shutdown, 2962 + #ifdef CONFIG_PM_SLEEP 3039 2963 .driver = { 3040 2964 .pm = &nvme_dev_pm_ops, 3041 2965 }, 2966 + #endif 3042 2967 .sriov_configure = pci_sriov_configure_simple, 3043 2968 .err_handler = &nvme_err_handler, 3044 2969 };

+63 -1

drivers/nvme/host/trace.c

··· 135 135 } 136 136 } 137 137 138 + static const char *nvme_trace_fabrics_property_set(struct trace_seq *p, u8 *spc) 139 + { 140 + const char *ret = trace_seq_buffer_ptr(p); 141 + u8 attrib = spc[0]; 142 + u32 ofst = get_unaligned_le32(spc + 4); 143 + u64 value = get_unaligned_le64(spc + 8); 144 + 145 + trace_seq_printf(p, "attrib=%u, ofst=0x%x, value=0x%llx", 146 + attrib, ofst, value); 147 + trace_seq_putc(p, 0); 148 + return ret; 149 + } 150 + 151 + static const char *nvme_trace_fabrics_connect(struct trace_seq *p, u8 *spc) 152 + { 153 + const char *ret = trace_seq_buffer_ptr(p); 154 + u16 recfmt = get_unaligned_le16(spc); 155 + u16 qid = get_unaligned_le16(spc + 2); 156 + u16 sqsize = get_unaligned_le16(spc + 4); 157 + u8 cattr = spc[6]; 158 + u32 kato = get_unaligned_le32(spc + 8); 159 + 160 + trace_seq_printf(p, "recfmt=%u, qid=%u, sqsize=%u, cattr=%u, kato=%u", 161 + recfmt, qid, sqsize, cattr, kato); 162 + trace_seq_putc(p, 0); 163 + return ret; 164 + } 165 + 166 + static const char *nvme_trace_fabrics_property_get(struct trace_seq *p, u8 *spc) 167 + { 168 + const char *ret = trace_seq_buffer_ptr(p); 169 + u8 attrib = spc[0]; 170 + u32 ofst = get_unaligned_le32(spc + 4); 171 + 172 + trace_seq_printf(p, "attrib=%u, ofst=0x%x", attrib, ofst); 173 + trace_seq_putc(p, 0); 174 + return ret; 175 + } 176 + 177 + static const char *nvme_trace_fabrics_common(struct trace_seq *p, u8 *spc) 178 + { 179 + const char *ret = trace_seq_buffer_ptr(p); 180 + 181 + trace_seq_printf(p, "spcecific=%*ph", 24, spc); 182 + trace_seq_putc(p, 0); 183 + return ret; 184 + } 185 + 186 + const char *nvme_trace_parse_fabrics_cmd(struct trace_seq *p, 187 + u8 fctype, u8 *spc) 188 + { 189 + switch (fctype) { 190 + case nvme_fabrics_type_property_set: 191 + return nvme_trace_fabrics_property_set(p, spc); 192 + case nvme_fabrics_type_connect: 193 + return nvme_trace_fabrics_connect(p, spc); 194 + case nvme_fabrics_type_property_get: 195 + return nvme_trace_fabrics_property_get(p, spc); 196 + default: 197 + return nvme_trace_fabrics_common(p, spc); 198 + } 199 + } 200 + 138 201 const char *nvme_trace_disk_name(struct trace_seq *p, char *name) 139 202 { 140 203 const char *ret = trace_seq_buffer_ptr(p); ··· 208 145 209 146 return ret; 210 147 } 211 - EXPORT_SYMBOL_GPL(nvme_trace_disk_name); 212 148 213 149 EXPORT_TRACEPOINT_SYMBOL_GPL(nvme_sq);

+15 -51

drivers/nvme/host/trace.h

··· 16 16 17 17 #include "nvme.h" 18 18 19 - #define nvme_admin_opcode_name(opcode) { opcode, #opcode } 20 - #define show_admin_opcode_name(val) \ 21 - __print_symbolic(val, \ 22 - nvme_admin_opcode_name(nvme_admin_delete_sq), \ 23 - nvme_admin_opcode_name(nvme_admin_create_sq), \ 24 - nvme_admin_opcode_name(nvme_admin_get_log_page), \ 25 - nvme_admin_opcode_name(nvme_admin_delete_cq), \ 26 - nvme_admin_opcode_name(nvme_admin_create_cq), \ 27 - nvme_admin_opcode_name(nvme_admin_identify), \ 28 - nvme_admin_opcode_name(nvme_admin_abort_cmd), \ 29 - nvme_admin_opcode_name(nvme_admin_set_features), \ 30 - nvme_admin_opcode_name(nvme_admin_get_features), \ 31 - nvme_admin_opcode_name(nvme_admin_async_event), \ 32 - nvme_admin_opcode_name(nvme_admin_ns_mgmt), \ 33 - nvme_admin_opcode_name(nvme_admin_activate_fw), \ 34 - nvme_admin_opcode_name(nvme_admin_download_fw), \ 35 - nvme_admin_opcode_name(nvme_admin_ns_attach), \ 36 - nvme_admin_opcode_name(nvme_admin_keep_alive), \ 37 - nvme_admin_opcode_name(nvme_admin_directive_send), \ 38 - nvme_admin_opcode_name(nvme_admin_directive_recv), \ 39 - nvme_admin_opcode_name(nvme_admin_dbbuf), \ 40 - nvme_admin_opcode_name(nvme_admin_format_nvm), \ 41 - nvme_admin_opcode_name(nvme_admin_security_send), \ 42 - nvme_admin_opcode_name(nvme_admin_security_recv), \ 43 - nvme_admin_opcode_name(nvme_admin_sanitize_nvm)) 44 - 45 - #define nvme_opcode_name(opcode) { opcode, #opcode } 46 - #define show_nvm_opcode_name(val) \ 47 - __print_symbolic(val, \ 48 - nvme_opcode_name(nvme_cmd_flush), \ 49 - nvme_opcode_name(nvme_cmd_write), \ 50 - nvme_opcode_name(nvme_cmd_read), \ 51 - nvme_opcode_name(nvme_cmd_write_uncor), \ 52 - nvme_opcode_name(nvme_cmd_compare), \ 53 - nvme_opcode_name(nvme_cmd_write_zeroes), \ 54 - nvme_opcode_name(nvme_cmd_dsm), \ 55 - nvme_opcode_name(nvme_cmd_resv_register), \ 56 - nvme_opcode_name(nvme_cmd_resv_report), \ 57 - nvme_opcode_name(nvme_cmd_resv_acquire), \ 58 - nvme_opcode_name(nvme_cmd_resv_release)) 59 - 60 - #define show_opcode_name(qid, opcode) \ 61 - (qid ? show_nvm_opcode_name(opcode) : show_admin_opcode_name(opcode)) 62 - 63 19 const char *nvme_trace_parse_admin_cmd(struct trace_seq *p, u8 opcode, 64 20 u8 *cdw10); 65 21 const char *nvme_trace_parse_nvm_cmd(struct trace_seq *p, u8 opcode, 66 22 u8 *cdw10); 23 + const char *nvme_trace_parse_fabrics_cmd(struct trace_seq *p, u8 fctype, 24 + u8 *spc); 67 25 68 - #define parse_nvme_cmd(qid, opcode, cdw10) \ 69 - (qid ? \ 70 - nvme_trace_parse_nvm_cmd(p, opcode, cdw10) : \ 71 - nvme_trace_parse_admin_cmd(p, opcode, cdw10)) 26 + #define parse_nvme_cmd(qid, opcode, fctype, cdw10) \ 27 + ((opcode) == nvme_fabrics_command ? \ 28 + nvme_trace_parse_fabrics_cmd(p, fctype, cdw10) : \ 29 + ((qid) ? \ 30 + nvme_trace_parse_nvm_cmd(p, opcode, cdw10) : \ 31 + nvme_trace_parse_admin_cmd(p, opcode, cdw10))) 72 32 73 33 const char *nvme_trace_disk_name(struct trace_seq *p, char *name); 74 34 #define __print_disk_name(name) \ ··· 53 93 __field(int, qid) 54 94 __field(u8, opcode) 55 95 __field(u8, flags) 96 + __field(u8, fctype) 56 97 __field(u16, cid) 57 98 __field(u32, nsid) 58 99 __field(u64, metadata) ··· 67 106 __entry->cid = cmd->common.command_id; 68 107 __entry->nsid = le32_to_cpu(cmd->common.nsid); 69 108 __entry->metadata = le64_to_cpu(cmd->common.metadata); 109 + __entry->fctype = cmd->fabrics.fctype; 70 110 __assign_disk_name(__entry->disk, req->rq_disk); 71 111 memcpy(__entry->cdw10, &cmd->common.cdw10, 72 112 sizeof(__entry->cdw10)); ··· 76 114 __entry->ctrl_id, __print_disk_name(__entry->disk), 77 115 __entry->qid, __entry->cid, __entry->nsid, 78 116 __entry->flags, __entry->metadata, 79 - show_opcode_name(__entry->qid, __entry->opcode), 80 - parse_nvme_cmd(__entry->qid, __entry->opcode, __entry->cdw10)) 117 + show_opcode_name(__entry->qid, __entry->opcode, 118 + __entry->fctype), 119 + parse_nvme_cmd(__entry->qid, __entry->opcode, 120 + __entry->fctype, __entry->cdw10)) 81 121 ); 82 122 83 123 TRACE_EVENT(nvme_complete_rq, ··· 105 141 __entry->status = nvme_req(req)->status; 106 142 __assign_disk_name(__entry->disk, req->rq_disk); 107 143 ), 108 - TP_printk("nvme%d: %sqid=%d, cmdid=%u, res=%llu, retries=%u, flags=0x%x, status=%u", 144 + TP_printk("nvme%d: %sqid=%d, cmdid=%u, res=%#llx, retries=%u, flags=0x%x, status=%#x", 109 145 __entry->ctrl_id, __print_disk_name(__entry->disk), 110 146 __entry->qid, __entry->cid, __entry->result, 111 147 __entry->retries, __entry->flags, __entry->status)

+3

drivers/nvme/target/Makefile

··· 1 1 # SPDX-License-Identifier: GPL-2.0 2 2 3 + ccflags-y += -I$(src) 4 + 3 5 obj-$(CONFIG_NVME_TARGET) += nvmet.o 4 6 obj-$(CONFIG_NVME_TARGET_LOOP) += nvme-loop.o 5 7 obj-$(CONFIG_NVME_TARGET_RDMA) += nvmet-rdma.o ··· 16 14 nvmet-fc-y += fc.o 17 15 nvme-fcloop-y += fcloop.o 18 16 nvmet-tcp-y += tcp.o 17 + nvmet-$(CONFIG_TRACING) += trace.o

+11 -1

drivers/nvme/target/core.c

··· 10 10 #include <linux/pci-p2pdma.h> 11 11 #include <linux/scatterlist.h> 12 12 13 + #define CREATE_TRACE_POINTS 14 + #include "trace.h" 15 + 13 16 #include "nvmet.h" 14 17 15 18 struct workqueue_struct *buffered_io_wq; ··· 314 311 port->inline_data_size = 0; 315 312 316 313 port->enabled = true; 314 + port->tr_ops = ops; 317 315 return 0; 318 316 } 319 317 ··· 325 321 lockdep_assert_held(&nvmet_config_sem); 326 322 327 323 port->enabled = false; 324 + port->tr_ops = NULL; 328 325 329 326 ops = nvmet_transports[port->disc_addr.trtype]; 330 327 ops->remove_port(port); ··· 694 689 695 690 if (unlikely(status)) 696 691 nvmet_set_error(req, status); 692 + 693 + trace_nvmet_req_complete(req); 694 + 697 695 if (req->ns) 698 696 nvmet_put_namespace(req->ns); 699 697 req->ops->queue_response(req); ··· 856 848 req->error_loc = NVMET_NO_ERROR_LOC; 857 849 req->error_slba = 0; 858 850 851 + trace_nvmet_req_init(req, req->cmd); 852 + 859 853 /* no support for fused commands yet */ 860 854 if (unlikely(flags & (NVME_CMD_FUSE_FIRST | NVME_CMD_FUSE_SECOND))) { 861 855 req->error_loc = offsetof(struct nvme_common_command, flags); ··· 881 871 status = nvmet_parse_connect_cmd(req); 882 872 else if (likely(req->sq->qid != 0)) 883 873 status = nvmet_parse_io_cmd(req); 884 - else if (req->cmd->common.opcode == nvme_fabrics_command) 874 + else if (nvme_is_fabrics(req->cmd)) 885 875 status = nvmet_parse_fabrics_cmd(req); 886 876 else if (req->sq->ctrl->subsys->type == NVME_NQN_DISC) 887 877 status = nvmet_parse_discovery_cmd(req);

+4

drivers/nvme/target/discovery.c

··· 41 41 __nvmet_disc_changed(port, ctrl); 42 42 } 43 43 mutex_unlock(&nvmet_disc_subsys->lock); 44 + 45 + /* If transport can signal change, notify transport */ 46 + if (port->tr_ops && port->tr_ops->discovery_chg) 47 + port->tr_ops->discovery_chg(port); 44 48 } 45 49 46 50 static void __nvmet_subsys_disc_changed(struct nvmet_port *port,

+1 -1

drivers/nvme/target/fabrics-cmd.c

··· 268 268 { 269 269 struct nvme_command *cmd = req->cmd; 270 270 271 - if (cmd->common.opcode != nvme_fabrics_command) { 271 + if (!nvme_is_fabrics(cmd)) { 272 272 pr_err("invalid command 0x%x on unconnected queue.\n", 273 273 cmd->fabrics.opcode); 274 274 req->error_loc = offsetof(struct nvme_common_command, opcode);

+12 -1

drivers/nvme/target/fc.c

··· 1806 1806 */ 1807 1807 rspcnt = atomic_inc_return(&fod->queue->zrspcnt); 1808 1808 if (!(rspcnt % fod->queue->ersp_ratio) || 1809 - sqe->opcode == nvme_fabrics_command || 1809 + nvme_is_fabrics((struct nvme_command *) sqe) || 1810 1810 xfr_length != fod->req.transfer_len || 1811 1811 (le16_to_cpu(cqe->status) & 0xFFFE) || cqewd[0] || cqewd[1] || 1812 1812 (sqe->flags & (NVME_CMD_FUSE_FIRST | NVME_CMD_FUSE_SECOND)) || ··· 2549 2549 kfree(pe); 2550 2550 } 2551 2551 2552 + static void 2553 + nvmet_fc_discovery_chg(struct nvmet_port *port) 2554 + { 2555 + struct nvmet_fc_port_entry *pe = port->priv; 2556 + struct nvmet_fc_tgtport *tgtport = pe->tgtport; 2557 + 2558 + if (tgtport && tgtport->ops->discovery_event) 2559 + tgtport->ops->discovery_event(&tgtport->fc_target_port); 2560 + } 2561 + 2552 2562 static const struct nvmet_fabrics_ops nvmet_fc_tgt_fcp_ops = { 2553 2563 .owner = THIS_MODULE, 2554 2564 .type = NVMF_TRTYPE_FC, ··· 2567 2557 .remove_port = nvmet_fc_remove_port, 2568 2558 .queue_response = nvmet_fc_fcp_nvme_cmd_done, 2569 2559 .delete_ctrl = nvmet_fc_delete_ctrl, 2560 + .discovery_chg = nvmet_fc_discovery_chg, 2570 2561 }; 2571 2562 2572 2563 static int __init nvmet_fc_init_module(void)

+37

drivers/nvme/target/fcloop.c

··· 231 231 int status; 232 232 }; 233 233 234 + struct fcloop_rscn { 235 + struct fcloop_tport *tport; 236 + struct work_struct work; 237 + }; 238 + 234 239 enum { 235 240 INI_IO_START = 0, 236 241 INI_IO_ACTIVE = 1, ··· 351 346 schedule_work(&tls_req->work); 352 347 353 348 return 0; 349 + } 350 + 351 + /* 352 + * Simulate reception of RSCN and converting it to a initiator transport 353 + * call to rescan a remote port. 354 + */ 355 + static void 356 + fcloop_tgt_rscn_work(struct work_struct *work) 357 + { 358 + struct fcloop_rscn *tgt_rscn = 359 + container_of(work, struct fcloop_rscn, work); 360 + struct fcloop_tport *tport = tgt_rscn->tport; 361 + 362 + if (tport->remoteport) 363 + nvme_fc_rescan_remoteport(tport->remoteport); 364 + kfree(tgt_rscn); 365 + } 366 + 367 + static void 368 + fcloop_tgt_discovery_evt(struct nvmet_fc_target_port *tgtport) 369 + { 370 + struct fcloop_rscn *tgt_rscn; 371 + 372 + tgt_rscn = kzalloc(sizeof(*tgt_rscn), GFP_KERNEL); 373 + if (!tgt_rscn) 374 + return; 375 + 376 + tgt_rscn->tport = tgtport->private; 377 + INIT_WORK(&tgt_rscn->work, fcloop_tgt_rscn_work); 378 + 379 + schedule_work(&tgt_rscn->work); 354 380 } 355 381 356 382 static void ··· 875 839 .fcp_op = fcloop_fcp_op, 876 840 .fcp_abort = fcloop_tgt_fcp_abort, 877 841 .fcp_req_release = fcloop_fcp_req_release, 842 + .discovery_event = fcloop_tgt_discovery_evt, 878 843 .max_hw_queues = FCLOOP_HW_QUEUES, 879 844 .max_sgl_segments = FCLOOP_SGL_SEGS, 880 845 .max_dif_sgl_segments = FCLOOP_SGL_SEGS,

+2

drivers/nvme/target/nvmet.h

··· 140 140 void *priv; 141 141 bool enabled; 142 142 int inline_data_size; 143 + const struct nvmet_fabrics_ops *tr_ops; 143 144 }; 144 145 145 146 static inline struct nvmet_port *to_nvmet_port(struct config_item *item) ··· 278 277 void (*disc_traddr)(struct nvmet_req *req, 279 278 struct nvmet_port *port, char *traddr); 280 279 u16 (*install_queue)(struct nvmet_sq *nvme_sq); 280 + void (*discovery_chg)(struct nvmet_port *port); 281 281 }; 282 282 283 283 #define NVMET_MAX_INLINE_BIOVEC 8

+201

drivers/nvme/target/trace.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* 3 + * NVM Express target device driver tracepoints 4 + * Copyright (c) 2018 Johannes Thumshirn, SUSE Linux GmbH 5 + */ 6 + 7 + #include <asm/unaligned.h> 8 + #include "trace.h" 9 + 10 + static const char *nvmet_trace_admin_identify(struct trace_seq *p, u8 *cdw10) 11 + { 12 + const char *ret = trace_seq_buffer_ptr(p); 13 + u8 cns = cdw10[0]; 14 + u16 ctrlid = get_unaligned_le16(cdw10 + 2); 15 + 16 + trace_seq_printf(p, "cns=%u, ctrlid=%u", cns, ctrlid); 17 + trace_seq_putc(p, 0); 18 + 19 + return ret; 20 + } 21 + 22 + static const char *nvmet_trace_admin_get_features(struct trace_seq *p, 23 + u8 *cdw10) 24 + { 25 + const char *ret = trace_seq_buffer_ptr(p); 26 + u8 fid = cdw10[0]; 27 + u8 sel = cdw10[1] & 0x7; 28 + u32 cdw11 = get_unaligned_le32(cdw10 + 4); 29 + 30 + trace_seq_printf(p, "fid=0x%x sel=0x%x cdw11=0x%x", fid, sel, cdw11); 31 + trace_seq_putc(p, 0); 32 + 33 + return ret; 34 + } 35 + 36 + static const char *nvmet_trace_read_write(struct trace_seq *p, u8 *cdw10) 37 + { 38 + const char *ret = trace_seq_buffer_ptr(p); 39 + u64 slba = get_unaligned_le64(cdw10); 40 + u16 length = get_unaligned_le16(cdw10 + 8); 41 + u16 control = get_unaligned_le16(cdw10 + 10); 42 + u32 dsmgmt = get_unaligned_le32(cdw10 + 12); 43 + u32 reftag = get_unaligned_le32(cdw10 + 16); 44 + 45 + trace_seq_printf(p, 46 + "slba=%llu, len=%u, ctrl=0x%x, dsmgmt=%u, reftag=%u", 47 + slba, length, control, dsmgmt, reftag); 48 + trace_seq_putc(p, 0); 49 + 50 + return ret; 51 + } 52 + 53 + static const char *nvmet_trace_dsm(struct trace_seq *p, u8 *cdw10) 54 + { 55 + const char *ret = trace_seq_buffer_ptr(p); 56 + 57 + trace_seq_printf(p, "nr=%u, attributes=%u", 58 + get_unaligned_le32(cdw10), 59 + get_unaligned_le32(cdw10 + 4)); 60 + trace_seq_putc(p, 0); 61 + 62 + return ret; 63 + } 64 + 65 + static const char *nvmet_trace_common(struct trace_seq *p, u8 *cdw10) 66 + { 67 + const char *ret = trace_seq_buffer_ptr(p); 68 + 69 + trace_seq_printf(p, "cdw10=%*ph", 24, cdw10); 70 + trace_seq_putc(p, 0); 71 + 72 + return ret; 73 + } 74 + 75 + const char *nvmet_trace_parse_admin_cmd(struct trace_seq *p, 76 + u8 opcode, u8 *cdw10) 77 + { 78 + switch (opcode) { 79 + case nvme_admin_identify: 80 + return nvmet_trace_admin_identify(p, cdw10); 81 + case nvme_admin_get_features: 82 + return nvmet_trace_admin_get_features(p, cdw10); 83 + default: 84 + return nvmet_trace_common(p, cdw10); 85 + } 86 + } 87 + 88 + const char *nvmet_trace_parse_nvm_cmd(struct trace_seq *p, 89 + u8 opcode, u8 *cdw10) 90 + { 91 + switch (opcode) { 92 + case nvme_cmd_read: 93 + case nvme_cmd_write: 94 + case nvme_cmd_write_zeroes: 95 + return nvmet_trace_read_write(p, cdw10); 96 + case nvme_cmd_dsm: 97 + return nvmet_trace_dsm(p, cdw10); 98 + default: 99 + return nvmet_trace_common(p, cdw10); 100 + } 101 + } 102 + 103 + static const char *nvmet_trace_fabrics_property_set(struct trace_seq *p, 104 + u8 *spc) 105 + { 106 + const char *ret = trace_seq_buffer_ptr(p); 107 + u8 attrib = spc[0]; 108 + u32 ofst = get_unaligned_le32(spc + 4); 109 + u64 value = get_unaligned_le64(spc + 8); 110 + 111 + trace_seq_printf(p, "attrib=%u, ofst=0x%x, value=0x%llx", 112 + attrib, ofst, value); 113 + trace_seq_putc(p, 0); 114 + return ret; 115 + } 116 + 117 + static const char *nvmet_trace_fabrics_connect(struct trace_seq *p, 118 + u8 *spc) 119 + { 120 + const char *ret = trace_seq_buffer_ptr(p); 121 + u16 recfmt = get_unaligned_le16(spc); 122 + u16 qid = get_unaligned_le16(spc + 2); 123 + u16 sqsize = get_unaligned_le16(spc + 4); 124 + u8 cattr = spc[6]; 125 + u32 kato = get_unaligned_le32(spc + 8); 126 + 127 + trace_seq_printf(p, "recfmt=%u, qid=%u, sqsize=%u, cattr=%u, kato=%u", 128 + recfmt, qid, sqsize, cattr, kato); 129 + trace_seq_putc(p, 0); 130 + return ret; 131 + } 132 + 133 + static const char *nvmet_trace_fabrics_property_get(struct trace_seq *p, 134 + u8 *spc) 135 + { 136 + const char *ret = trace_seq_buffer_ptr(p); 137 + u8 attrib = spc[0]; 138 + u32 ofst = get_unaligned_le32(spc + 4); 139 + 140 + trace_seq_printf(p, "attrib=%u, ofst=0x%x", attrib, ofst); 141 + trace_seq_putc(p, 0); 142 + return ret; 143 + } 144 + 145 + static const char *nvmet_trace_fabrics_common(struct trace_seq *p, u8 *spc) 146 + { 147 + const char *ret = trace_seq_buffer_ptr(p); 148 + 149 + trace_seq_printf(p, "spcecific=%*ph", 24, spc); 150 + trace_seq_putc(p, 0); 151 + return ret; 152 + } 153 + 154 + const char *nvmet_trace_parse_fabrics_cmd(struct trace_seq *p, 155 + u8 fctype, u8 *spc) 156 + { 157 + switch (fctype) { 158 + case nvme_fabrics_type_property_set: 159 + return nvmet_trace_fabrics_property_set(p, spc); 160 + case nvme_fabrics_type_connect: 161 + return nvmet_trace_fabrics_connect(p, spc); 162 + case nvme_fabrics_type_property_get: 163 + return nvmet_trace_fabrics_property_get(p, spc); 164 + default: 165 + return nvmet_trace_fabrics_common(p, spc); 166 + } 167 + } 168 + 169 + const char *nvmet_trace_disk_name(struct trace_seq *p, char *name) 170 + { 171 + const char *ret = trace_seq_buffer_ptr(p); 172 + 173 + if (*name) 174 + trace_seq_printf(p, "disk=%s, ", name); 175 + trace_seq_putc(p, 0); 176 + 177 + return ret; 178 + } 179 + 180 + const char *nvmet_trace_ctrl_name(struct trace_seq *p, struct nvmet_ctrl *ctrl) 181 + { 182 + const char *ret = trace_seq_buffer_ptr(p); 183 + 184 + /* 185 + * XXX: We don't know the controller instance before executing the 186 + * connect command itself because the connect command for the admin 187 + * queue will not provide the cntlid which will be allocated in this 188 + * command. In case of io queues, the controller instance will be 189 + * mapped by the extra data of the connect command. 190 + * If we can know the extra data of the connect command in this stage, 191 + * we can update this print statement later. 192 + */ 193 + if (ctrl) 194 + trace_seq_printf(p, "%d", ctrl->cntlid); 195 + else 196 + trace_seq_printf(p, "_"); 197 + trace_seq_putc(p, 0); 198 + 199 + return ret; 200 + } 201 +

+141

drivers/nvme/target/trace.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + /* 3 + * NVM Express target device driver tracepoints 4 + * Copyright (c) 2018 Johannes Thumshirn, SUSE Linux GmbH 5 + * 6 + * This is entirely based on drivers/nvme/host/trace.h 7 + */ 8 + 9 + #undef TRACE_SYSTEM 10 + #define TRACE_SYSTEM nvmet 11 + 12 + #if !defined(_TRACE_NVMET_H) || defined(TRACE_HEADER_MULTI_READ) 13 + #define _TRACE_NVMET_H 14 + 15 + #include <linux/nvme.h> 16 + #include <linux/tracepoint.h> 17 + #include <linux/trace_seq.h> 18 + 19 + #include "nvmet.h" 20 + 21 + const char *nvmet_trace_parse_admin_cmd(struct trace_seq *p, u8 opcode, 22 + u8 *cdw10); 23 + const char *nvmet_trace_parse_nvm_cmd(struct trace_seq *p, u8 opcode, 24 + u8 *cdw10); 25 + const char *nvmet_trace_parse_fabrics_cmd(struct trace_seq *p, u8 fctype, 26 + u8 *spc); 27 + 28 + #define parse_nvme_cmd(qid, opcode, fctype, cdw10) \ 29 + ((opcode) == nvme_fabrics_command ? \ 30 + nvmet_trace_parse_fabrics_cmd(p, fctype, cdw10) : \ 31 + (qid ? \ 32 + nvmet_trace_parse_nvm_cmd(p, opcode, cdw10) : \ 33 + nvmet_trace_parse_admin_cmd(p, opcode, cdw10))) 34 + 35 + const char *nvmet_trace_ctrl_name(struct trace_seq *p, struct nvmet_ctrl *ctrl); 36 + #define __print_ctrl_name(ctrl) \ 37 + nvmet_trace_ctrl_name(p, ctrl) 38 + 39 + const char *nvmet_trace_disk_name(struct trace_seq *p, char *name); 40 + #define __print_disk_name(name) \ 41 + nvmet_trace_disk_name(p, name) 42 + 43 + #ifndef TRACE_HEADER_MULTI_READ 44 + static inline struct nvmet_ctrl *nvmet_req_to_ctrl(struct nvmet_req *req) 45 + { 46 + return req->sq->ctrl; 47 + } 48 + 49 + static inline void __assign_disk_name(char *name, struct nvmet_req *req, 50 + bool init) 51 + { 52 + struct nvmet_ctrl *ctrl = nvmet_req_to_ctrl(req); 53 + struct nvmet_ns *ns; 54 + 55 + if ((init && req->sq->qid) || (!init && req->cq->qid)) { 56 + ns = nvmet_find_namespace(ctrl, req->cmd->rw.nsid); 57 + strncpy(name, ns->device_path, DISK_NAME_LEN); 58 + return; 59 + } 60 + 61 + memset(name, 0, DISK_NAME_LEN); 62 + } 63 + #endif 64 + 65 + TRACE_EVENT(nvmet_req_init, 66 + TP_PROTO(struct nvmet_req *req, struct nvme_command *cmd), 67 + TP_ARGS(req, cmd), 68 + TP_STRUCT__entry( 69 + __field(struct nvme_command *, cmd) 70 + __field(struct nvmet_ctrl *, ctrl) 71 + __array(char, disk, DISK_NAME_LEN) 72 + __field(int, qid) 73 + __field(u16, cid) 74 + __field(u8, opcode) 75 + __field(u8, fctype) 76 + __field(u8, flags) 77 + __field(u32, nsid) 78 + __field(u64, metadata) 79 + __array(u8, cdw10, 24) 80 + ), 81 + TP_fast_assign( 82 + __entry->cmd = cmd; 83 + __entry->ctrl = nvmet_req_to_ctrl(req); 84 + __assign_disk_name(__entry->disk, req, true); 85 + __entry->qid = req->sq->qid; 86 + __entry->cid = cmd->common.command_id; 87 + __entry->opcode = cmd->common.opcode; 88 + __entry->fctype = cmd->fabrics.fctype; 89 + __entry->flags = cmd->common.flags; 90 + __entry->nsid = le32_to_cpu(cmd->common.nsid); 91 + __entry->metadata = le64_to_cpu(cmd->common.metadata); 92 + memcpy(__entry->cdw10, &cmd->common.cdw10, 93 + sizeof(__entry->cdw10)); 94 + ), 95 + TP_printk("nvmet%s: %sqid=%d, cmdid=%u, nsid=%u, flags=%#x, " 96 + "meta=%#llx, cmd=(%s, %s)", 97 + __print_ctrl_name(__entry->ctrl), 98 + __print_disk_name(__entry->disk), 99 + __entry->qid, __entry->cid, __entry->nsid, 100 + __entry->flags, __entry->metadata, 101 + show_opcode_name(__entry->qid, __entry->opcode, 102 + __entry->fctype), 103 + parse_nvme_cmd(__entry->qid, __entry->opcode, 104 + __entry->fctype, __entry->cdw10)) 105 + ); 106 + 107 + TRACE_EVENT(nvmet_req_complete, 108 + TP_PROTO(struct nvmet_req *req), 109 + TP_ARGS(req), 110 + TP_STRUCT__entry( 111 + __field(struct nvmet_ctrl *, ctrl) 112 + __array(char, disk, DISK_NAME_LEN) 113 + __field(int, qid) 114 + __field(int, cid) 115 + __field(u64, result) 116 + __field(u16, status) 117 + ), 118 + TP_fast_assign( 119 + __entry->ctrl = nvmet_req_to_ctrl(req); 120 + __entry->qid = req->cq->qid; 121 + __entry->cid = req->cqe->command_id; 122 + __entry->result = le64_to_cpu(req->cqe->result.u64); 123 + __entry->status = le16_to_cpu(req->cqe->status) >> 1; 124 + __assign_disk_name(__entry->disk, req, false); 125 + ), 126 + TP_printk("nvmet%s: %sqid=%d, cmdid=%u, res=%#llx, status=%#x", 127 + __print_ctrl_name(__entry->ctrl), 128 + __print_disk_name(__entry->disk), 129 + __entry->qid, __entry->cid, __entry->result, __entry->status) 130 + 131 + ); 132 + 133 + #endif /* _TRACE_NVMET_H */ 134 + 135 + #undef TRACE_INCLUDE_PATH 136 + #define TRACE_INCLUDE_PATH . 137 + #undef TRACE_INCLUDE_FILE 138 + #define TRACE_INCLUDE_FILE trace 139 + 140 + /* This part must be outside protection */ 141 + #include <trace/define_trace.h>

+2

drivers/scsi/lpfc/lpfc.h

··· 274 274 uint32_t elsXmitADISC; 275 275 uint32_t elsXmitLOGO; 276 276 uint32_t elsXmitSCR; 277 + uint32_t elsXmitRSCN; 277 278 uint32_t elsXmitRNID; 278 279 uint32_t elsXmitFARP; 279 280 uint32_t elsXmitFARPR; ··· 820 819 uint32_t cfg_use_msi; 821 820 uint32_t cfg_auto_imax; 822 821 uint32_t cfg_fcp_imax; 822 + uint32_t cfg_force_rscn; 823 823 uint32_t cfg_cq_poll_threshold; 824 824 uint32_t cfg_cq_max_proc_limit; 825 825 uint32_t cfg_fcp_cpu_map;

+60

drivers/scsi/lpfc/lpfc_attr.c

··· 4959 4959 lpfc_request_firmware_upgrade_store); 4960 4960 4961 4961 /** 4962 + * lpfc_force_rscn_store 4963 + * 4964 + * @dev: class device that is converted into a Scsi_host. 4965 + * @attr: device attribute, not used. 4966 + * @buf: unused string 4967 + * @count: unused variable. 4968 + * 4969 + * Description: 4970 + * Force the switch to send a RSCN to all other NPorts in our zone 4971 + * If we are direct connect pt2pt, build the RSCN command ourself 4972 + * and send to the other NPort. Not supported for private loop. 4973 + * 4974 + * Returns: 4975 + * 0 - on success 4976 + * -EIO - if command is not sent 4977 + **/ 4978 + static ssize_t 4979 + lpfc_force_rscn_store(struct device *dev, struct device_attribute *attr, 4980 + const char *buf, size_t count) 4981 + { 4982 + struct Scsi_Host *shost = class_to_shost(dev); 4983 + struct lpfc_vport *vport = (struct lpfc_vport *)shost->hostdata; 4984 + int i; 4985 + 4986 + i = lpfc_issue_els_rscn(vport, 0); 4987 + if (i) 4988 + return -EIO; 4989 + return strlen(buf); 4990 + } 4991 + 4992 + /* 4993 + * lpfc_force_rscn: Force an RSCN to be sent to all remote NPorts 4994 + * connected to the HBA. 4995 + * 4996 + * Value range is any ascii value 4997 + */ 4998 + static int lpfc_force_rscn; 4999 + module_param(lpfc_force_rscn, int, 0644); 5000 + MODULE_PARM_DESC(lpfc_force_rscn, 5001 + "Force an RSCN to be sent to all remote NPorts"); 5002 + lpfc_param_show(force_rscn) 5003 + 5004 + /** 5005 + * lpfc_force_rscn_init - Force an RSCN to be sent to all remote NPorts 5006 + * @phba: lpfc_hba pointer. 5007 + * @val: unused value. 5008 + * 5009 + * Returns: 5010 + * zero if val saved. 5011 + **/ 5012 + static int 5013 + lpfc_force_rscn_init(struct lpfc_hba *phba, int val) 5014 + { 5015 + return 0; 5016 + } 5017 + static DEVICE_ATTR_RW(lpfc_force_rscn); 5018 + 5019 + /** 4962 5020 * lpfc_fcp_imax_store 4963 5021 * 4964 5022 * @dev: class device that is converted into a Scsi_host. ··· 6016 5958 &dev_attr_lpfc_nvme_oas, 6017 5959 &dev_attr_lpfc_nvme_embed_cmd, 6018 5960 &dev_attr_lpfc_fcp_imax, 5961 + &dev_attr_lpfc_force_rscn, 6019 5962 &dev_attr_lpfc_cq_poll_threshold, 6020 5963 &dev_attr_lpfc_cq_max_proc_limit, 6021 5964 &dev_attr_lpfc_fcp_cpu_map, ··· 7064 7005 lpfc_nvme_oas_init(phba, lpfc_nvme_oas); 7065 7006 lpfc_nvme_embed_cmd_init(phba, lpfc_nvme_embed_cmd); 7066 7007 lpfc_fcp_imax_init(phba, lpfc_fcp_imax); 7008 + lpfc_force_rscn_init(phba, lpfc_force_rscn); 7067 7009 lpfc_cq_poll_threshold_init(phba, lpfc_cq_poll_threshold); 7068 7010 lpfc_cq_max_proc_limit_init(phba, lpfc_cq_max_proc_limit); 7069 7011 lpfc_fcp_cpu_map_init(phba, lpfc_fcp_cpu_map);

+4

drivers/scsi/lpfc/lpfc_crtn.h

··· 141 141 int lpfc_issue_els_logo(struct lpfc_vport *, struct lpfc_nodelist *, uint8_t); 142 142 int lpfc_issue_els_npiv_logo(struct lpfc_vport *, struct lpfc_nodelist *); 143 143 int lpfc_issue_els_scr(struct lpfc_vport *, uint32_t, uint8_t); 144 + int lpfc_issue_els_rscn(struct lpfc_vport *vport, uint8_t retry); 144 145 int lpfc_issue_fabric_reglogin(struct lpfc_vport *); 145 146 int lpfc_els_free_iocb(struct lpfc_hba *, struct lpfc_iocbq *); 146 147 int lpfc_ct_free_iocb(struct lpfc_hba *, struct lpfc_iocbq *); ··· 356 355 struct lpfc_nodelist *lpfc_findnode_did(struct lpfc_vport *, uint32_t); 357 356 struct lpfc_nodelist *lpfc_findnode_wwpn(struct lpfc_vport *, 358 357 struct lpfc_name *); 358 + struct lpfc_nodelist *lpfc_findnode_mapped(struct lpfc_vport *vport); 359 359 360 360 int lpfc_sli_issue_mbox_wait(struct lpfc_hba *, LPFC_MBOXQ_t *, uint32_t); 361 361 ··· 557 555 int lpfc_check_fwlog_support(struct lpfc_hba *phba); 558 556 559 557 /* NVME interfaces. */ 558 + void lpfc_nvme_rescan_port(struct lpfc_vport *vport, 559 + struct lpfc_nodelist *ndlp); 560 560 void lpfc_nvme_unregister_port(struct lpfc_vport *vport, 561 561 struct lpfc_nodelist *ndlp); 562 562 int lpfc_nvme_register_port(struct lpfc_vport *vport,

+127

drivers/scsi/lpfc/lpfc_els.c

··· 30 30 #include <scsi/scsi_device.h> 31 31 #include <scsi/scsi_host.h> 32 32 #include <scsi/scsi_transport_fc.h> 33 + #include <uapi/scsi/fc/fc_fs.h> 34 + #include <uapi/scsi/fc/fc_els.h> 33 35 34 36 #include "lpfc_hw4.h" 35 37 #include "lpfc_hw.h" ··· 3077 3075 */ 3078 3076 if (!(vport->fc_flag & FC_PT2PT)) 3079 3077 lpfc_nlp_put(ndlp); 3078 + return 0; 3079 + } 3080 + 3081 + /** 3082 + * lpfc_issue_els_rscn - Issue an RSCN to the Fabric Controller (Fabric) 3083 + * or the other nport (pt2pt). 3084 + * @vport: pointer to a host virtual N_Port data structure. 3085 + * @retry: number of retries to the command IOCB. 3086 + * 3087 + * This routine issues a RSCN to the Fabric Controller (DID 0xFFFFFD) 3088 + * when connected to a fabric, or to the remote port when connected 3089 + * in point-to-point mode. When sent to the Fabric Controller, it will 3090 + * replay the RSCN to registered recipients. 3091 + * 3092 + * Note that, in lpfc_prep_els_iocb() routine, the reference count of ndlp 3093 + * will be incremented by 1 for holding the ndlp and the reference to ndlp 3094 + * will be stored into the context1 field of the IOCB for the completion 3095 + * callback function to the RSCN ELS command. 3096 + * 3097 + * Return code 3098 + * 0 - Successfully issued RSCN command 3099 + * 1 - Failed to issue RSCN command 3100 + **/ 3101 + int 3102 + lpfc_issue_els_rscn(struct lpfc_vport *vport, uint8_t retry) 3103 + { 3104 + struct lpfc_hba *phba = vport->phba; 3105 + struct lpfc_iocbq *elsiocb; 3106 + struct lpfc_nodelist *ndlp; 3107 + struct { 3108 + struct fc_els_rscn rscn; 3109 + struct fc_els_rscn_page portid; 3110 + } *event; 3111 + uint32_t nportid; 3112 + uint16_t cmdsize = sizeof(*event); 3113 + 3114 + /* Not supported for private loop */ 3115 + if (phba->fc_topology == LPFC_TOPOLOGY_LOOP && 3116 + !(vport->fc_flag & FC_PUBLIC_LOOP)) 3117 + return 1; 3118 + 3119 + if (vport->fc_flag & FC_PT2PT) { 3120 + /* find any mapped nport - that would be the other nport */ 3121 + ndlp = lpfc_findnode_mapped(vport); 3122 + if (!ndlp) 3123 + return 1; 3124 + } else { 3125 + nportid = FC_FID_FCTRL; 3126 + /* find the fabric controller node */ 3127 + ndlp = lpfc_findnode_did(vport, nportid); 3128 + if (!ndlp) { 3129 + /* if one didn't exist, make one */ 3130 + ndlp = lpfc_nlp_init(vport, nportid); 3131 + if (!ndlp) 3132 + return 1; 3133 + lpfc_enqueue_node(vport, ndlp); 3134 + } else if (!NLP_CHK_NODE_ACT(ndlp)) { 3135 + ndlp = lpfc_enable_node(vport, ndlp, 3136 + NLP_STE_UNUSED_NODE); 3137 + if (!ndlp) 3138 + return 1; 3139 + } 3140 + } 3141 + 3142 + elsiocb = lpfc_prep_els_iocb(vport, 1, cmdsize, retry, ndlp, 3143 + ndlp->nlp_DID, ELS_CMD_RSCN_XMT); 3144 + 3145 + if (!elsiocb) { 3146 + /* This will trigger the release of the node just 3147 + * allocated 3148 + */ 3149 + lpfc_nlp_put(ndlp); 3150 + return 1; 3151 + } 3152 + 3153 + event = ((struct lpfc_dmabuf *)elsiocb->context2)->virt; 3154 + 3155 + event->rscn.rscn_cmd = ELS_RSCN; 3156 + event->rscn.rscn_page_len = sizeof(struct fc_els_rscn_page); 3157 + event->rscn.rscn_plen = cpu_to_be16(cmdsize); 3158 + 3159 + nportid = vport->fc_myDID; 3160 + /* appears that page flags must be 0 for fabric to broadcast RSCN */ 3161 + event->portid.rscn_page_flags = 0; 3162 + event->portid.rscn_fid[0] = (nportid & 0x00FF0000) >> 16; 3163 + event->portid.rscn_fid[1] = (nportid & 0x0000FF00) >> 8; 3164 + event->portid.rscn_fid[2] = nportid & 0x000000FF; 3165 + 3166 + lpfc_debugfs_disc_trc(vport, LPFC_DISC_TRC_ELS_CMD, 3167 + "Issue RSCN: did:x%x", 3168 + ndlp->nlp_DID, 0, 0); 3169 + 3170 + phba->fc_stat.elsXmitRSCN++; 3171 + elsiocb->iocb_cmpl = lpfc_cmpl_els_cmd; 3172 + if (lpfc_sli_issue_iocb(phba, LPFC_ELS_RING, elsiocb, 0) == 3173 + IOCB_ERROR) { 3174 + /* The additional lpfc_nlp_put will cause the following 3175 + * lpfc_els_free_iocb routine to trigger the rlease of 3176 + * the node. 3177 + */ 3178 + lpfc_nlp_put(ndlp); 3179 + lpfc_els_free_iocb(phba, elsiocb); 3180 + return 1; 3181 + } 3182 + /* This will cause the callback-function lpfc_cmpl_els_cmd to 3183 + * trigger the release of node. 3184 + */ 3185 + if (!(vport->fc_flag & FC_PT2PT)) 3186 + lpfc_nlp_put(ndlp); 3187 + 3080 3188 return 0; 3081 3189 } 3082 3190 ··· 6326 6214 continue; 6327 6215 } 6328 6216 6217 + if (ndlp->nlp_fc4_type & NLP_FC4_NVME) 6218 + lpfc_nvme_rescan_port(vport, ndlp); 6329 6219 6330 6220 lpfc_disc_state_machine(vport, ndlp, NULL, 6331 6221 NLP_EVT_DEVICE_RECOVERY); ··· 6431 6317 for (i = 0; i < payload_len/sizeof(uint32_t); i++) 6432 6318 fc_host_post_event(shost, fc_get_event_number(), 6433 6319 FCH_EVT_RSCN, lp[i]); 6320 + 6321 + /* Check if RSCN is coming from a direct-connected remote NPort */ 6322 + if (vport->fc_flag & FC_PT2PT) { 6323 + /* If so, just ACC it, no other action needed for now */ 6324 + lpfc_printf_vlog(vport, KERN_INFO, LOG_ELS, 6325 + "2024 pt2pt RSCN %08x Data: x%x x%x\n", 6326 + *lp, vport->fc_flag, payload_len); 6327 + lpfc_els_rsp_acc(vport, ELS_CMD_ACC, cmdiocb, ndlp, NULL); 6328 + 6329 + if (ndlp->nlp_fc4_type & NLP_FC4_NVME) 6330 + lpfc_nvme_rescan_port(vport, ndlp); 6331 + return 0; 6332 + } 6434 6333 6435 6334 /* If we are about to begin discovery, just ACC the RSCN. 6436 6335 * Discovery processing will satisfy it.

+35

drivers/scsi/lpfc/lpfc_hbadisc.c

··· 5277 5277 } 5278 5278 5279 5279 struct lpfc_nodelist * 5280 + lpfc_findnode_mapped(struct lpfc_vport *vport) 5281 + { 5282 + struct Scsi_Host *shost = lpfc_shost_from_vport(vport); 5283 + struct lpfc_nodelist *ndlp; 5284 + uint32_t data1; 5285 + unsigned long iflags; 5286 + 5287 + spin_lock_irqsave(shost->host_lock, iflags); 5288 + 5289 + list_for_each_entry(ndlp, &vport->fc_nodes, nlp_listp) { 5290 + if (ndlp->nlp_state == NLP_STE_UNMAPPED_NODE || 5291 + ndlp->nlp_state == NLP_STE_MAPPED_NODE) { 5292 + data1 = (((uint32_t)ndlp->nlp_state << 24) | 5293 + ((uint32_t)ndlp->nlp_xri << 16) | 5294 + ((uint32_t)ndlp->nlp_type << 8) | 5295 + ((uint32_t)ndlp->nlp_rpi & 0xff)); 5296 + spin_unlock_irqrestore(shost->host_lock, iflags); 5297 + lpfc_printf_vlog(vport, KERN_INFO, LOG_NODE, 5298 + "2025 FIND node DID " 5299 + "Data: x%p x%x x%x x%x %p\n", 5300 + ndlp, ndlp->nlp_DID, 5301 + ndlp->nlp_flag, data1, 5302 + ndlp->active_rrqs_xri_bitmap); 5303 + return ndlp; 5304 + } 5305 + } 5306 + spin_unlock_irqrestore(shost->host_lock, iflags); 5307 + 5308 + /* FIND node did <did> NOT FOUND */ 5309 + lpfc_printf_vlog(vport, KERN_INFO, LOG_NODE, 5310 + "2026 FIND mapped did NOT FOUND.\n"); 5311 + return NULL; 5312 + } 5313 + 5314 + struct lpfc_nodelist * 5280 5315 lpfc_setup_disc_node(struct lpfc_vport *vport, uint32_t did) 5281 5316 { 5282 5317 struct Scsi_Host *shost = lpfc_shost_from_vport(vport);

+2

drivers/scsi/lpfc/lpfc_hw.h

··· 601 601 #define ELS_CMD_RPL 0x57000000 602 602 #define ELS_CMD_FAN 0x60000000 603 603 #define ELS_CMD_RSCN 0x61040000 604 + #define ELS_CMD_RSCN_XMT 0x61040008 604 605 #define ELS_CMD_SCR 0x62000000 605 606 #define ELS_CMD_RNID 0x78000000 606 607 #define ELS_CMD_LIRR 0x7A000000 ··· 643 642 #define ELS_CMD_RPL 0x57 644 643 #define ELS_CMD_FAN 0x60 645 644 #define ELS_CMD_RSCN 0x0461 645 + #define ELS_CMD_RSCN_XMT 0x08000461 646 646 #define ELS_CMD_SCR 0x62 647 647 #define ELS_CMD_RNID 0x78 648 648 #define ELS_CMD_LIRR 0x7A

+44

drivers/scsi/lpfc/lpfc_nvme.c

··· 2402 2402 #endif 2403 2403 } 2404 2404 2405 + /** 2406 + * lpfc_nvme_rescan_port - Check to see if we should rescan this remoteport 2407 + * 2408 + * If the ndlp represents an NVME Target, that we are logged into, 2409 + * ping the NVME FC Transport layer to initiate a device rescan 2410 + * on this remote NPort. 2411 + */ 2412 + void 2413 + lpfc_nvme_rescan_port(struct lpfc_vport *vport, struct lpfc_nodelist *ndlp) 2414 + { 2415 + #if (IS_ENABLED(CONFIG_NVME_FC)) 2416 + struct lpfc_nvme_rport *rport; 2417 + struct nvme_fc_remote_port *remoteport; 2418 + 2419 + rport = ndlp->nrport; 2420 + 2421 + lpfc_printf_vlog(vport, KERN_INFO, LOG_NVME_DISC, 2422 + "6170 Rescan NPort DID x%06x type x%x " 2423 + "state x%x rport %p\n", 2424 + ndlp->nlp_DID, ndlp->nlp_type, ndlp->nlp_state, rport); 2425 + if (!rport) 2426 + goto input_err; 2427 + remoteport = rport->remoteport; 2428 + if (!remoteport) 2429 + goto input_err; 2430 + 2431 + /* Only rescan if we are an NVME target in the MAPPED state */ 2432 + if (remoteport->port_role & FC_PORT_ROLE_NVME_DISCOVERY && 2433 + ndlp->nlp_state == NLP_STE_MAPPED_NODE) { 2434 + nvme_fc_rescan_remoteport(remoteport); 2435 + 2436 + lpfc_printf_vlog(vport, KERN_ERR, LOG_NVME_DISC, 2437 + "6172 NVME rescanned DID x%06x " 2438 + "port_state x%x\n", 2439 + ndlp->nlp_DID, remoteport->port_state); 2440 + } 2441 + return; 2442 + input_err: 2443 + lpfc_printf_vlog(vport, KERN_ERR, LOG_NVME_DISC, 2444 + "6169 State error: lport %p, rport%p FCID x%06x\n", 2445 + vport->localport, ndlp->rport, ndlp->nlp_DID); 2446 + #endif 2447 + } 2448 + 2405 2449 /* lpfc_nvme_unregister_port - unbind the DID and port_role from this rport. 2406 2450 * 2407 2451 * There is no notion of Devloss or rport recovery from the current

+17

drivers/scsi/lpfc/lpfc_nvmet.c

··· 1139 1139 spin_unlock_irqrestore(&ctxp->ctxlock, iflag); 1140 1140 } 1141 1141 1142 + static void 1143 + lpfc_nvmet_discovery_event(struct nvmet_fc_target_port *tgtport) 1144 + { 1145 + struct lpfc_nvmet_tgtport *tgtp; 1146 + struct lpfc_hba *phba; 1147 + uint32_t rc; 1148 + 1149 + tgtp = tgtport->private; 1150 + phba = tgtp->phba; 1151 + 1152 + rc = lpfc_issue_els_rscn(phba->pport, 0); 1153 + lpfc_printf_log(phba, KERN_ERR, LOG_NVME, 1154 + "6420 NVMET subsystem change: Notification %s\n", 1155 + (rc) ? "Failed" : "Sent"); 1156 + } 1157 + 1142 1158 static struct nvmet_fc_target_template lpfc_tgttemplate = { 1143 1159 .targetport_delete = lpfc_nvmet_targetport_delete, 1144 1160 .xmt_ls_rsp = lpfc_nvmet_xmt_ls_rsp, ··· 1162 1146 .fcp_abort = lpfc_nvmet_xmt_fcp_abort, 1163 1147 .fcp_req_release = lpfc_nvmet_xmt_fcp_release, 1164 1148 .defer_rcv = lpfc_nvmet_defer_rcv, 1149 + .discovery_event = lpfc_nvmet_discovery_event, 1165 1150 1166 1151 .max_hw_queues = 1, 1167 1152 .max_sgl_segments = LPFC_NVMET_DEFAULT_SEGS,

+1

drivers/scsi/lpfc/lpfc_sli.c

··· 9398 9398 if (if_type >= LPFC_SLI_INTF_IF_TYPE_2) { 9399 9399 if (pcmd && (*pcmd == ELS_CMD_FLOGI || 9400 9400 *pcmd == ELS_CMD_SCR || 9401 + *pcmd == ELS_CMD_RSCN_XMT || 9401 9402 *pcmd == ELS_CMD_FDISC || 9402 9403 *pcmd == ELS_CMD_LOGO || 9403 9404 *pcmd == ELS_CMD_PLOGI)) {

+3 -16

fs/block_dev.c

··· 203 203 { 204 204 struct file *file = iocb->ki_filp; 205 205 struct block_device *bdev = I_BDEV(bdev_file_inode(file)); 206 - struct bio_vec inline_vecs[DIO_INLINE_BIO_VECS], *vecs, *bvec; 206 + struct bio_vec inline_vecs[DIO_INLINE_BIO_VECS], *vecs; 207 207 loff_t pos = iocb->ki_pos; 208 208 bool should_dirty = false; 209 209 struct bio bio; 210 210 ssize_t ret; 211 211 blk_qc_t qc; 212 - struct bvec_iter_all iter_all; 213 212 214 213 if ((pos | iov_iter_alignment(iter)) & 215 214 (bdev_logical_block_size(bdev) - 1)) ··· 258 259 } 259 260 __set_current_state(TASK_RUNNING); 260 261 261 - bio_for_each_segment_all(bvec, &bio, iter_all) { 262 - if (should_dirty && !PageCompound(bvec->bv_page)) 263 - set_page_dirty_lock(bvec->bv_page); 264 - if (!bio_flagged(&bio, BIO_NO_PAGE_REF)) 265 - put_page(bvec->bv_page); 266 - } 267 - 262 + bio_release_pages(&bio, should_dirty); 268 263 if (unlikely(bio.bi_status)) 269 264 ret = blk_status_to_errno(bio.bi_status); 270 265 ··· 328 335 if (should_dirty) { 329 336 bio_check_pages_dirty(bio); 330 337 } else { 331 - if (!bio_flagged(bio, BIO_NO_PAGE_REF)) { 332 - struct bvec_iter_all iter_all; 333 - struct bio_vec *bvec; 334 - 335 - bio_for_each_segment_all(bvec, bio, iter_all) 336 - put_page(bvec->bv_page); 337 - } 338 + bio_release_pages(bio, false); 338 339 bio_put(bio); 339 340 } 340 341 }

+3 -12

fs/direct-io.c

··· 538 538 */ 539 539 static blk_status_t dio_bio_complete(struct dio *dio, struct bio *bio) 540 540 { 541 - struct bio_vec *bvec; 542 541 blk_status_t err = bio->bi_status; 542 + bool should_dirty = dio->op == REQ_OP_READ && dio->should_dirty; 543 543 544 544 if (err) { 545 545 if (err == BLK_STS_AGAIN && (bio->bi_opf & REQ_NOWAIT)) ··· 548 548 dio->io_error = -EIO; 549 549 } 550 550 551 - if (dio->is_async && dio->op == REQ_OP_READ && dio->should_dirty) { 551 + if (dio->is_async && should_dirty) { 552 552 bio_check_pages_dirty(bio); /* transfers ownership */ 553 553 } else { 554 - struct bvec_iter_all iter_all; 555 - 556 - bio_for_each_segment_all(bvec, bio, iter_all) { 557 - struct page *page = bvec->bv_page; 558 - 559 - if (dio->op == REQ_OP_READ && !PageCompound(page) && 560 - dio->should_dirty) 561 - set_page_dirty_lock(page); 562 - put_page(page); 563 - } 554 + bio_release_pages(bio, should_dirty); 564 555 bio_put(bio); 565 556 } 566 557 return err;

+7 -1

fs/fs-writeback.c

··· 715 715 void wbc_account_io(struct writeback_control *wbc, struct page *page, 716 716 size_t bytes) 717 717 { 718 + struct cgroup_subsys_state *css; 718 719 int id; 719 720 720 721 /* ··· 727 726 if (!wbc->wb) 728 727 return; 729 728 730 - id = mem_cgroup_css_from_page(page)->id; 729 + css = mem_cgroup_css_from_page(page); 730 + /* dead cgroups shouldn't contribute to inode ownership arbitration */ 731 + if (!(css->flags & CSS_ONLINE)) 732 + return; 733 + 734 + id = css->id; 731 735 732 736 if (id == wbc->wb_id) { 733 737 wbc->wb_bytes += bytes;

-3

fs/io_uring.c

··· 998 998 iov_iter_bvec(iter, rw, imu->bvec, imu->nr_bvecs, offset + len); 999 999 if (offset) 1000 1000 iov_iter_advance(iter, offset); 1001 - 1002 - /* don't drop a reference to these pages */ 1003 - iter->type |= ITER_BVEC_FLAG_NO_REF; 1004 1001 return 0; 1005 1002 } 1006 1003

+2 -8

fs/iomap.c

··· 333 333 if (iop) 334 334 atomic_inc(&iop->read_count); 335 335 336 - if (!ctx->bio || !is_contig || bio_full(ctx->bio)) { 336 + if (!ctx->bio || !is_contig || bio_full(ctx->bio, plen)) { 337 337 gfp_t gfp = mapping_gfp_constraint(page->mapping, GFP_KERNEL); 338 338 int nr_vecs = (length + PAGE_SIZE - 1) >> PAGE_SHIFT; 339 339 ··· 1599 1599 if (should_dirty) { 1600 1600 bio_check_pages_dirty(bio); 1601 1601 } else { 1602 - if (!bio_flagged(bio, BIO_NO_PAGE_REF)) { 1603 - struct bvec_iter_all iter_all; 1604 - struct bio_vec *bvec; 1605 - 1606 - bio_for_each_segment_all(bvec, bio, iter_all) 1607 - put_page(bvec->bv_page); 1608 - } 1602 + bio_release_pages(bio, false); 1609 1603 bio_put(bio); 1610 1604 } 1611 1605 }

+1 -1

fs/xfs/xfs_aops.c

··· 782 782 atomic_inc(&iop->write_count); 783 783 784 784 if (!merged) { 785 - if (bio_full(wpc->ioend->io_bio)) 785 + if (bio_full(wpc->ioend->io_bio, len)) 786 786 xfs_chain_bio(wpc->ioend, wbc, bdev, sector); 787 787 bio_add_page(wpc->ioend->io_bio, page, len, poff); 788 788 }

+17 -14

include/linux/bio.h

··· 102 102 return NULL; 103 103 } 104 104 105 - static inline bool bio_full(struct bio *bio) 105 + /** 106 + * bio_full - check if the bio is full 107 + * @bio: bio to check 108 + * @len: length of one segment to be added 109 + * 110 + * Return true if @bio is full and one segment with @len bytes can't be 111 + * added to the bio, otherwise return false 112 + */ 113 + static inline bool bio_full(struct bio *bio, unsigned len) 106 114 { 107 - return bio->bi_vcnt >= bio->bi_max_vecs; 115 + if (bio->bi_vcnt >= bio->bi_max_vecs) 116 + return true; 117 + 118 + if (bio->bi_iter.bi_size > UINT_MAX - len) 119 + return true; 120 + 121 + return false; 108 122 } 109 123 110 124 static inline bool bio_next_segment(const struct bio *bio, ··· 422 408 } 423 409 424 410 struct request_queue; 425 - extern int bio_phys_segments(struct request_queue *, struct bio *); 426 411 427 412 extern int submit_bio_wait(struct bio *bio); 428 413 extern void bio_advance(struct bio *, unsigned); ··· 440 427 void __bio_add_page(struct bio *bio, struct page *page, 441 428 unsigned int len, unsigned int off); 442 429 int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter); 430 + void bio_release_pages(struct bio *bio, bool mark_dirty); 443 431 struct rq_map_data; 444 432 extern struct bio *bio_map_user_iov(struct request_queue *, 445 433 struct iov_iter *, gfp_t); ··· 457 443 void generic_end_io_acct(struct request_queue *q, int op, 458 444 struct hd_struct *part, 459 445 unsigned long start_time); 460 - 461 - #ifndef ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 462 - # error "You should define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE for your platform" 463 - #endif 464 - #if ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 465 - extern void bio_flush_dcache_pages(struct bio *bi); 466 - #else 467 - static inline void bio_flush_dcache_pages(struct bio *bi) 468 - { 469 - } 470 - #endif 471 446 472 447 extern void bio_copy_data_iter(struct bio *dst, struct bvec_iter *dst_iter, 473 448 struct bio *src, struct bvec_iter *src_iter);

+22 -84

include/linux/blk-cgroup.h

··· 63 63 64 64 /* 65 65 * blkg_[rw]stat->aux_cnt is excluded for local stats but included for 66 - * recursive. Used to carry stats of dead children, and, for blkg_rwstat, 67 - * to carry result values from read and sum operations. 66 + * recursive. Used to carry stats of dead children. 68 67 */ 69 - struct blkg_stat { 70 - struct percpu_counter cpu_cnt; 71 - atomic64_t aux_cnt; 72 - }; 73 - 74 68 struct blkg_rwstat { 75 69 struct percpu_counter cpu_cnt[BLKG_RWSTAT_NR]; 76 70 atomic64_t aux_cnt[BLKG_RWSTAT_NR]; 71 + }; 72 + 73 + struct blkg_rwstat_sample { 74 + u64 cnt[BLKG_RWSTAT_NR]; 77 75 }; 78 76 79 77 /* ··· 196 198 void blkcg_deactivate_policy(struct request_queue *q, 197 199 const struct blkcg_policy *pol); 198 200 201 + static inline u64 blkg_rwstat_read_counter(struct blkg_rwstat *rwstat, 202 + unsigned int idx) 203 + { 204 + return atomic64_read(&rwstat->aux_cnt[idx]) + 205 + percpu_counter_sum_positive(&rwstat->cpu_cnt[idx]); 206 + } 207 + 199 208 const char *blkg_dev_name(struct blkcg_gq *blkg); 200 209 void blkcg_print_blkgs(struct seq_file *sf, struct blkcg *blkcg, 201 210 u64 (*prfill)(struct seq_file *, ··· 211 206 bool show_total); 212 207 u64 __blkg_prfill_u64(struct seq_file *sf, struct blkg_policy_data *pd, u64 v); 213 208 u64 __blkg_prfill_rwstat(struct seq_file *sf, struct blkg_policy_data *pd, 214 - const struct blkg_rwstat *rwstat); 215 - u64 blkg_prfill_stat(struct seq_file *sf, struct blkg_policy_data *pd, int off); 209 + const struct blkg_rwstat_sample *rwstat); 216 210 u64 blkg_prfill_rwstat(struct seq_file *sf, struct blkg_policy_data *pd, 217 211 int off); 218 212 int blkg_print_stat_bytes(struct seq_file *sf, void *v); ··· 219 215 int blkg_print_stat_bytes_recursive(struct seq_file *sf, void *v); 220 216 int blkg_print_stat_ios_recursive(struct seq_file *sf, void *v); 221 217 222 - u64 blkg_stat_recursive_sum(struct blkcg_gq *blkg, 223 - struct blkcg_policy *pol, int off); 224 - struct blkg_rwstat blkg_rwstat_recursive_sum(struct blkcg_gq *blkg, 225 - struct blkcg_policy *pol, int off); 218 + void blkg_rwstat_recursive_sum(struct blkcg_gq *blkg, struct blkcg_policy *pol, 219 + int off, struct blkg_rwstat_sample *sum); 226 220 227 221 struct blkg_conf_ctx { 228 222 struct gendisk *disk; ··· 571 569 if (((d_blkg) = __blkg_lookup(css_to_blkcg(pos_css), \ 572 570 (p_blkg)->q, false))) 573 571 574 - static inline int blkg_stat_init(struct blkg_stat *stat, gfp_t gfp) 575 - { 576 - int ret; 577 - 578 - ret = percpu_counter_init(&stat->cpu_cnt, 0, gfp); 579 - if (ret) 580 - return ret; 581 - 582 - atomic64_set(&stat->aux_cnt, 0); 583 - return 0; 584 - } 585 - 586 - static inline void blkg_stat_exit(struct blkg_stat *stat) 587 - { 588 - percpu_counter_destroy(&stat->cpu_cnt); 589 - } 590 - 591 - /** 592 - * blkg_stat_add - add a value to a blkg_stat 593 - * @stat: target blkg_stat 594 - * @val: value to add 595 - * 596 - * Add @val to @stat. The caller must ensure that IRQ on the same CPU 597 - * don't re-enter this function for the same counter. 598 - */ 599 - static inline void blkg_stat_add(struct blkg_stat *stat, uint64_t val) 600 - { 601 - percpu_counter_add_batch(&stat->cpu_cnt, val, BLKG_STAT_CPU_BATCH); 602 - } 603 - 604 - /** 605 - * blkg_stat_read - read the current value of a blkg_stat 606 - * @stat: blkg_stat to read 607 - */ 608 - static inline uint64_t blkg_stat_read(struct blkg_stat *stat) 609 - { 610 - return percpu_counter_sum_positive(&stat->cpu_cnt); 611 - } 612 - 613 - /** 614 - * blkg_stat_reset - reset a blkg_stat 615 - * @stat: blkg_stat to reset 616 - */ 617 - static inline void blkg_stat_reset(struct blkg_stat *stat) 618 - { 619 - percpu_counter_set(&stat->cpu_cnt, 0); 620 - atomic64_set(&stat->aux_cnt, 0); 621 - } 622 - 623 - /** 624 - * blkg_stat_add_aux - add a blkg_stat into another's aux count 625 - * @to: the destination blkg_stat 626 - * @from: the source 627 - * 628 - * Add @from's count including the aux one to @to's aux count. 629 - */ 630 - static inline void blkg_stat_add_aux(struct blkg_stat *to, 631 - struct blkg_stat *from) 632 - { 633 - atomic64_add(blkg_stat_read(from) + atomic64_read(&from->aux_cnt), 634 - &to->aux_cnt); 635 - } 636 - 637 572 static inline int blkg_rwstat_init(struct blkg_rwstat *rwstat, gfp_t gfp) 638 573 { 639 574 int i, ret; ··· 632 693 * 633 694 * Read the current snapshot of @rwstat and return it in the aux counts. 634 695 */ 635 - static inline struct blkg_rwstat blkg_rwstat_read(struct blkg_rwstat *rwstat) 696 + static inline void blkg_rwstat_read(struct blkg_rwstat *rwstat, 697 + struct blkg_rwstat_sample *result) 636 698 { 637 - struct blkg_rwstat result; 638 699 int i; 639 700 640 701 for (i = 0; i < BLKG_RWSTAT_NR; i++) 641 - atomic64_set(&result.aux_cnt[i], 642 - percpu_counter_sum_positive(&rwstat->cpu_cnt[i])); 643 - return result; 702 + result->cnt[i] = 703 + percpu_counter_sum_positive(&rwstat->cpu_cnt[i]); 644 704 } 645 705 646 706 /** ··· 652 714 */ 653 715 static inline uint64_t blkg_rwstat_total(struct blkg_rwstat *rwstat) 654 716 { 655 - struct blkg_rwstat tmp = blkg_rwstat_read(rwstat); 717 + struct blkg_rwstat_sample tmp = { }; 656 718 657 - return atomic64_read(&tmp.aux_cnt[BLKG_RWSTAT_READ]) + 658 - atomic64_read(&tmp.aux_cnt[BLKG_RWSTAT_WRITE]); 719 + blkg_rwstat_read(rwstat, &tmp); 720 + return tmp.cnt[BLKG_RWSTAT_READ] + tmp.cnt[BLKG_RWSTAT_WRITE]; 659 721 } 660 722 661 723 /**

+1 -1

include/linux/blk-mq.h

··· 306 306 bool blk_mq_complete_request(struct request *rq); 307 307 void blk_mq_complete_request_sync(struct request *rq); 308 308 bool blk_mq_bio_list_merge(struct request_queue *q, struct list_head *list, 309 - struct bio *bio); 309 + struct bio *bio, unsigned int nr_segs); 310 310 bool blk_mq_queue_stopped(struct request_queue *q); 311 311 void blk_mq_stop_hw_queue(struct blk_mq_hw_ctx *hctx); 312 312 void blk_mq_start_hw_queue(struct blk_mq_hw_ctx *hctx);

-6

include/linux/blk_types.h

··· 154 154 blk_status_t bi_status; 155 155 u8 bi_partno; 156 156 157 - /* Number of segments in this BIO after 158 - * physical address coalescing is performed. 159 - */ 160 - unsigned int bi_phys_segments; 161 - 162 157 struct bvec_iter bi_iter; 163 158 164 159 atomic_t __bi_remaining; ··· 205 210 */ 206 211 enum { 207 212 BIO_NO_PAGE_REF, /* don't put release vec pages */ 208 - BIO_SEG_VALID, /* bi_phys_segments valid */ 209 213 BIO_CLONED, /* doesn't own data */ 210 214 BIO_BOUNCED, /* bio is a bounce bio */ 211 215 BIO_USER_MAPPED, /* contains user pages */

+4 -15

include/linux/blkdev.h

··· 137 137 unsigned int cmd_flags; /* op and common flags */ 138 138 req_flags_t rq_flags; 139 139 140 + int tag; 140 141 int internal_tag; 141 142 142 143 /* the following two fields are internal, NEVER access directly */ 143 144 unsigned int __data_len; /* total data len */ 144 - int tag; 145 145 sector_t __sector; /* sector cursor */ 146 146 147 147 struct bio *bio; ··· 828 828 extern blk_qc_t generic_make_request(struct bio *bio); 829 829 extern blk_qc_t direct_make_request(struct bio *bio); 830 830 extern void blk_rq_init(struct request_queue *q, struct request *rq); 831 - extern void blk_init_request_from_bio(struct request *req, struct bio *bio); 832 831 extern void blk_put_request(struct request *); 833 832 extern struct request *blk_get_request(struct request_queue *, unsigned int op, 834 833 blk_mq_req_flags_t flags); ··· 841 842 struct request *rq); 842 843 extern int blk_rq_append_bio(struct request *rq, struct bio **bio); 843 844 extern void blk_queue_split(struct request_queue *, struct bio **); 844 - extern void blk_recount_segments(struct request_queue *, struct bio *); 845 845 extern int scsi_verify_blk_ioctl(struct block_device *, unsigned int); 846 846 extern int scsi_cmd_blk_ioctl(struct block_device *, fmode_t, 847 847 unsigned int, void __user *); ··· 864 866 struct request *, int); 865 867 extern void blk_execute_rq_nowait(struct request_queue *, struct gendisk *, 866 868 struct request *, int, rq_end_io_fn *); 869 + 870 + /* Helper to convert REQ_OP_XXX to its string format XXX */ 871 + extern const char *blk_op_str(unsigned int op); 867 872 868 873 int blk_status_to_errno(blk_status_t status); 869 874 blk_status_t errno_to_blk_status(int errno); ··· 1027 1026 * 1028 1027 * blk_update_request() completes given number of bytes and updates 1029 1028 * the request without completing it. 1030 - * 1031 - * blk_end_request() and friends. __blk_end_request() must be called 1032 - * with the request queue spinlock acquired. 1033 - * 1034 - * Several drivers define their own end_request and call 1035 - * blk_end_request() for parts of the original function. 1036 - * This prevents code duplication in drivers. 1037 1029 */ 1038 1030 extern bool blk_update_request(struct request *rq, blk_status_t error, 1039 1031 unsigned int nr_bytes); 1040 - extern void blk_end_request_all(struct request *rq, blk_status_t error); 1041 - extern bool __blk_end_request(struct request *rq, blk_status_t error, 1042 - unsigned int nr_bytes); 1043 - extern void __blk_end_request_all(struct request *rq, blk_status_t error); 1044 - extern bool __blk_end_request_cur(struct request *rq, blk_status_t error); 1045 1032 1046 1033 extern void __blk_complete_request(struct request *); 1047 1034 extern void blk_abort_request(struct request *);

+1 -1

include/linux/elevator.h

··· 34 34 void (*depth_updated)(struct blk_mq_hw_ctx *); 35 35 36 36 bool (*allow_merge)(struct request_queue *, struct request *, struct bio *); 37 - bool (*bio_merge)(struct blk_mq_hw_ctx *, struct bio *); 37 + bool (*bio_merge)(struct blk_mq_hw_ctx *, struct bio *, unsigned int); 38 38 int (*request_merge)(struct request_queue *q, struct request **, struct bio *); 39 39 void (*request_merged)(struct request_queue *, struct request *, enum elv_merge); 40 40 void (*requests_merged)(struct request_queue *, struct request *, struct request *);

+6

include/linux/nvme-fc-driver.h

··· 791 791 * nvmefc_tgt_fcp_req. 792 792 * Entrypoint is Optional. 793 793 * 794 + * @discovery_event: Called by the transport to generate an RSCN 795 + * change notifications to NVME initiators. The RSCN notifications 796 + * should cause the initiator to rescan the discovery controller 797 + * on the targetport. 798 + * 794 799 * @max_hw_queues: indicates the maximum number of hw queues the LLDD 795 800 * supports for cpu affinitization. 796 801 * Value is Mandatory. Must be at least 1. ··· 837 832 struct nvmefc_tgt_fcp_req *fcpreq); 838 833 void (*defer_rcv)(struct nvmet_fc_target_port *tgtport, 839 834 struct nvmefc_tgt_fcp_req *fcpreq); 835 + void (*discovery_event)(struct nvmet_fc_target_port *tgtport); 840 836 841 837 u32 max_hw_queues; 842 838 u16 max_sgl_segments;

+65 -1

include/linux/nvme.h

··· 562 562 nvme_cmd_resv_release = 0x15, 563 563 }; 564 564 565 + #define nvme_opcode_name(opcode) { opcode, #opcode } 566 + #define show_nvm_opcode_name(val) \ 567 + __print_symbolic(val, \ 568 + nvme_opcode_name(nvme_cmd_flush), \ 569 + nvme_opcode_name(nvme_cmd_write), \ 570 + nvme_opcode_name(nvme_cmd_read), \ 571 + nvme_opcode_name(nvme_cmd_write_uncor), \ 572 + nvme_opcode_name(nvme_cmd_compare), \ 573 + nvme_opcode_name(nvme_cmd_write_zeroes), \ 574 + nvme_opcode_name(nvme_cmd_dsm), \ 575 + nvme_opcode_name(nvme_cmd_resv_register), \ 576 + nvme_opcode_name(nvme_cmd_resv_report), \ 577 + nvme_opcode_name(nvme_cmd_resv_acquire), \ 578 + nvme_opcode_name(nvme_cmd_resv_release)) 579 + 580 + 565 581 /* 566 582 * Descriptor subtype - lower 4 bits of nvme_(keyed_)sgl_desc identifier 567 583 * ··· 810 794 nvme_admin_sanitize_nvm = 0x84, 811 795 }; 812 796 797 + #define nvme_admin_opcode_name(opcode) { opcode, #opcode } 798 + #define show_admin_opcode_name(val) \ 799 + __print_symbolic(val, \ 800 + nvme_admin_opcode_name(nvme_admin_delete_sq), \ 801 + nvme_admin_opcode_name(nvme_admin_create_sq), \ 802 + nvme_admin_opcode_name(nvme_admin_get_log_page), \ 803 + nvme_admin_opcode_name(nvme_admin_delete_cq), \ 804 + nvme_admin_opcode_name(nvme_admin_create_cq), \ 805 + nvme_admin_opcode_name(nvme_admin_identify), \ 806 + nvme_admin_opcode_name(nvme_admin_abort_cmd), \ 807 + nvme_admin_opcode_name(nvme_admin_set_features), \ 808 + nvme_admin_opcode_name(nvme_admin_get_features), \ 809 + nvme_admin_opcode_name(nvme_admin_async_event), \ 810 + nvme_admin_opcode_name(nvme_admin_ns_mgmt), \ 811 + nvme_admin_opcode_name(nvme_admin_activate_fw), \ 812 + nvme_admin_opcode_name(nvme_admin_download_fw), \ 813 + nvme_admin_opcode_name(nvme_admin_ns_attach), \ 814 + nvme_admin_opcode_name(nvme_admin_keep_alive), \ 815 + nvme_admin_opcode_name(nvme_admin_directive_send), \ 816 + nvme_admin_opcode_name(nvme_admin_directive_recv), \ 817 + nvme_admin_opcode_name(nvme_admin_dbbuf), \ 818 + nvme_admin_opcode_name(nvme_admin_format_nvm), \ 819 + nvme_admin_opcode_name(nvme_admin_security_send), \ 820 + nvme_admin_opcode_name(nvme_admin_security_recv), \ 821 + nvme_admin_opcode_name(nvme_admin_sanitize_nvm)) 822 + 813 823 enum { 814 824 NVME_QUEUE_PHYS_CONTIG = (1 << 0), 815 825 NVME_CQ_IRQ_ENABLED = (1 << 1), ··· 1050 1008 nvme_fabrics_type_property_get = 0x04, 1051 1009 }; 1052 1010 1011 + #define nvme_fabrics_type_name(type) { type, #type } 1012 + #define show_fabrics_type_name(type) \ 1013 + __print_symbolic(type, \ 1014 + nvme_fabrics_type_name(nvme_fabrics_type_property_set), \ 1015 + nvme_fabrics_type_name(nvme_fabrics_type_connect), \ 1016 + nvme_fabrics_type_name(nvme_fabrics_type_property_get)) 1017 + 1018 + /* 1019 + * If not fabrics command, fctype will be ignored. 1020 + */ 1021 + #define show_opcode_name(qid, opcode, fctype) \ 1022 + ((opcode) == nvme_fabrics_command ? \ 1023 + show_fabrics_type_name(fctype) : \ 1024 + ((qid) ? \ 1025 + show_nvm_opcode_name(opcode) : \ 1026 + show_admin_opcode_name(opcode))) 1027 + 1053 1028 struct nvmf_common_command { 1054 1029 __u8 opcode; 1055 1030 __u8 resv1; ··· 1224 1165 }; 1225 1166 }; 1226 1167 1168 + static inline bool nvme_is_fabrics(struct nvme_command *cmd) 1169 + { 1170 + return cmd->common.opcode == nvme_fabrics_command; 1171 + } 1172 + 1227 1173 struct nvme_error_slot { 1228 1174 __le64 error_count; 1229 1175 __le16 sqid; ··· 1250 1186 * 1251 1187 * Why can't we simply have a Fabrics In and Fabrics out command? 1252 1188 */ 1253 - if (unlikely(cmd->common.opcode == nvme_fabrics_command)) 1189 + if (unlikely(nvme_is_fabrics(cmd))) 1254 1190 return cmd->fabrics.fctype & 1; 1255 1191 return cmd->common.opcode & 1; 1256 1192 }

+3

include/linux/sed-opal.h

··· 39 39 case IOC_OPAL_ENABLE_DISABLE_MBR: 40 40 case IOC_OPAL_ERASE_LR: 41 41 case IOC_OPAL_SECURE_ERASE_LR: 42 + case IOC_OPAL_PSID_REVERT_TPR: 43 + case IOC_OPAL_MBR_DONE: 44 + case IOC_OPAL_WRITE_SHADOW_MBR: 42 45 return true; 43 46 } 44 47 return false;

+1 -9

include/linux/uio.h

··· 19 19 }; 20 20 21 21 enum iter_type { 22 - /* set if ITER_BVEC doesn't hold a bv_page ref */ 23 - ITER_BVEC_FLAG_NO_REF = 2, 24 - 25 22 /* iter types */ 26 23 ITER_IOVEC = 4, 27 24 ITER_KVEC = 8, ··· 53 56 54 57 static inline enum iter_type iov_iter_type(const struct iov_iter *i) 55 58 { 56 - return i->type & ~(READ | WRITE | ITER_BVEC_FLAG_NO_REF); 59 + return i->type & ~(READ | WRITE); 57 60 } 58 61 59 62 static inline bool iter_is_iovec(const struct iov_iter *i) ··· 84 87 static inline unsigned char iov_iter_rw(const struct iov_iter *i) 85 88 { 86 89 return i->type & (READ | WRITE); 87 - } 88 - 89 - static inline bool iov_iter_bvec_no_ref(const struct iov_iter *i) 90 - { 91 - return (i->type & ITER_BVEC_FLAG_NO_REF) != 0; 92 90 } 93 91 94 92 /*

+1 -10

include/trace/events/f2fs.h

··· 76 76 #define show_bio_type(op,op_flags) show_bio_op(op), \ 77 77 show_bio_op_flags(op_flags) 78 78 79 - #define show_bio_op(op) \ 80 - __print_symbolic(op, \ 81 - { REQ_OP_READ, "READ" }, \ 82 - { REQ_OP_WRITE, "WRITE" }, \ 83 - { REQ_OP_FLUSH, "FLUSH" }, \ 84 - { REQ_OP_DISCARD, "DISCARD" }, \ 85 - { REQ_OP_SECURE_ERASE, "SECURE_ERASE" }, \ 86 - { REQ_OP_ZONE_RESET, "ZONE_RESET" }, \ 87 - { REQ_OP_WRITE_SAME, "WRITE_SAME" }, \ 88 - { REQ_OP_WRITE_ZEROES, "WRITE_ZEROES" }) 79 + #define show_bio_op(op) blk_op_str(op) 89 80 90 81 #define show_bio_op_flags(flags) \ 91 82 __print_flags(F2FS_BIO_FLAG_MASK(flags), "|", \

+21

include/uapi/linux/sed-opal.h

··· 20 20 OPAL_MBR_DISABLE = 0x01, 21 21 }; 22 22 23 + enum opal_mbr_done_flag { 24 + OPAL_MBR_NOT_DONE = 0x0, 25 + OPAL_MBR_DONE = 0x01 26 + }; 27 + 23 28 enum opal_user { 24 29 OPAL_ADMIN1 = 0x0, 25 30 OPAL_USER1 = 0x01, ··· 100 95 __u8 __align[7]; 101 96 }; 102 97 98 + struct opal_mbr_done { 99 + struct opal_key key; 100 + __u8 done_flag; 101 + __u8 __align[7]; 102 + }; 103 + 104 + struct opal_shadow_mbr { 105 + struct opal_key key; 106 + const __u64 data; 107 + __u64 offset; 108 + __u64 size; 109 + }; 110 + 103 111 #define IOC_OPAL_SAVE _IOW('p', 220, struct opal_lock_unlock) 104 112 #define IOC_OPAL_LOCK_UNLOCK _IOW('p', 221, struct opal_lock_unlock) 105 113 #define IOC_OPAL_TAKE_OWNERSHIP _IOW('p', 222, struct opal_key) ··· 125 107 #define IOC_OPAL_ENABLE_DISABLE_MBR _IOW('p', 229, struct opal_mbr_data) 126 108 #define IOC_OPAL_ERASE_LR _IOW('p', 230, struct opal_session_info) 127 109 #define IOC_OPAL_SECURE_ERASE_LR _IOW('p', 231, struct opal_session_info) 110 + #define IOC_OPAL_PSID_REVERT_TPR _IOW('p', 232, struct opal_key) 111 + #define IOC_OPAL_MBR_DONE _IOW('p', 233, struct opal_mbr_done) 112 + #define IOC_OPAL_WRITE_SHADOW_MBR _IOW('p', 234, struct opal_shadow_mbr) 128 113 129 114 #endif /* _UAPI_SED_OPAL_H */

-8

init/Kconfig

··· 852 852 853 853 See Documentation/cgroup-v1/blkio-controller.rst for more information. 854 854 855 - config DEBUG_BLK_CGROUP 856 - bool "IO controller debugging" 857 - depends on BLK_CGROUP 858 - default n 859 - ---help--- 860 - Enable some debugging help. Currently it exports additional stat 861 - files in a cgroup which can be useful for debugging. 862 - 863 855 config CGROUP_WRITEBACK 864 856 bool 865 857 depends on MEMCG && BLK_CGROUP

+1

kernel/cgroup/cgroup.c

··· 4226 4226 4227 4227 return NULL; 4228 4228 } 4229 + EXPORT_SYMBOL_GPL(css_next_descendant_pre); 4229 4230 4230 4231 /** 4231 4232 * css_rightmost_descendant - return the rightmost descendant of a css

+3 -7

lib/sbitmap.c

··· 26 26 /* 27 27 * First get a stable cleared mask, setting the old mask to 0. 28 28 */ 29 - do { 30 - mask = sb->map[index].cleared; 31 - } while (cmpxchg(&sb->map[index].cleared, mask, 0) != mask); 29 + mask = xchg(&sb->map[index].cleared, 0); 32 30 33 31 /* 34 32 * Now clear the masked bits in our free word ··· 514 516 struct sbq_wait_state *ws = &sbq->ws[wake_index]; 515 517 516 518 if (waitqueue_active(&ws->wait)) { 517 - int o = atomic_read(&sbq->wake_index); 518 - 519 - if (wake_index != o) 520 - atomic_cmpxchg(&sbq->wake_index, o, wake_index); 519 + if (wake_index != atomic_read(&sbq->wake_index)) 520 + atomic_set(&sbq->wake_index, wake_index); 521 521 return ws; 522 522 } 523 523

Configure Feed

Configure Feed