Merge tag 'for-5.4/block-2019-09-16' of git://git.kernel.dk/linux-block

+97

Documentation/admin-guide/cgroup-v2.rst

··· 1469 1469 8:16 rbytes=1459200 wbytes=314773504 rios=192 wios=353 dbytes=0 dios=0 1470 1470 8:0 rbytes=90430464 wbytes=299008000 rios=8950 wios=1252 dbytes=50331648 dios=3021 1471 1471 1472 + io.cost.qos 1473 + A read-write nested-keyed file with exists only on the root 1474 + cgroup. 1475 + 1476 + This file configures the Quality of Service of the IO cost 1477 + model based controller (CONFIG_BLK_CGROUP_IOCOST) which 1478 + currently implements "io.weight" proportional control. Lines 1479 + are keyed by $MAJ:$MIN device numbers and not ordered. The 1480 + line for a given device is populated on the first write for 1481 + the device on "io.cost.qos" or "io.cost.model". The following 1482 + nested keys are defined. 1483 + 1484 + ====== ===================================== 1485 + enable Weight-based control enable 1486 + ctrl "auto" or "user" 1487 + rpct Read latency percentile [0, 100] 1488 + rlat Read latency threshold 1489 + wpct Write latency percentile [0, 100] 1490 + wlat Write latency threshold 1491 + min Minimum scaling percentage [1, 10000] 1492 + max Maximum scaling percentage [1, 10000] 1493 + ====== ===================================== 1494 + 1495 + The controller is disabled by default and can be enabled by 1496 + setting "enable" to 1. "rpct" and "wpct" parameters default 1497 + to zero and the controller uses internal device saturation 1498 + state to adjust the overall IO rate between "min" and "max". 1499 + 1500 + When a better control quality is needed, latency QoS 1501 + parameters can be configured. For example:: 1502 + 1503 + 8:16 enable=1 ctrl=auto rpct=95.00 rlat=75000 wpct=95.00 wlat=150000 min=50.00 max=150.0 1504 + 1505 + shows that on sdb, the controller is enabled, will consider 1506 + the device saturated if the 95th percentile of read completion 1507 + latencies is above 75ms or write 150ms, and adjust the overall 1508 + IO issue rate between 50% and 150% accordingly. 1509 + 1510 + The lower the saturation point, the better the latency QoS at 1511 + the cost of aggregate bandwidth. The narrower the allowed 1512 + adjustment range between "min" and "max", the more conformant 1513 + to the cost model the IO behavior. Note that the IO issue 1514 + base rate may be far off from 100% and setting "min" and "max" 1515 + blindly can lead to a significant loss of device capacity or 1516 + control quality. "min" and "max" are useful for regulating 1517 + devices which show wide temporary behavior changes - e.g. a 1518 + ssd which accepts writes at the line speed for a while and 1519 + then completely stalls for multiple seconds. 1520 + 1521 + When "ctrl" is "auto", the parameters are controlled by the 1522 + kernel and may change automatically. Setting "ctrl" to "user" 1523 + or setting any of the percentile and latency parameters puts 1524 + it into "user" mode and disables the automatic changes. The 1525 + automatic mode can be restored by setting "ctrl" to "auto". 1526 + 1527 + io.cost.model 1528 + A read-write nested-keyed file with exists only on the root 1529 + cgroup. 1530 + 1531 + This file configures the cost model of the IO cost model based 1532 + controller (CONFIG_BLK_CGROUP_IOCOST) which currently 1533 + implements "io.weight" proportional control. Lines are keyed 1534 + by $MAJ:$MIN device numbers and not ordered. The line for a 1535 + given device is populated on the first write for the device on 1536 + "io.cost.qos" or "io.cost.model". The following nested keys 1537 + are defined. 1538 + 1539 + ===== ================================ 1540 + ctrl "auto" or "user" 1541 + model The cost model in use - "linear" 1542 + ===== ================================ 1543 + 1544 + When "ctrl" is "auto", the kernel may change all parameters 1545 + dynamically. When "ctrl" is set to "user" or any other 1546 + parameters are written to, "ctrl" become "user" and the 1547 + automatic changes are disabled. 1548 + 1549 + When "model" is "linear", the following model parameters are 1550 + defined. 1551 + 1552 + ============= ======================================== 1553 + [r|w]bps The maximum sequential IO throughput 1554 + [r|w]seqiops The maximum 4k sequential IOs per second 1555 + [r|w]randiops The maximum 4k random IOs per second 1556 + ============= ======================================== 1557 + 1558 + From the above, the builtin linear model determines the base 1559 + costs of a sequential and random IO and the cost coefficient 1560 + for the IO size. While simple, this model can cover most 1561 + common device classes acceptably. 1562 + 1563 + The IO cost model isn't expected to be accurate in absolute 1564 + sense and is scaled to the device behavior dynamically. 1565 + 1566 + If needed, tools/cgroup/iocost_coef_gen.py can be used to 1567 + generate device-specific coefficients. 1568 + 1472 1569 io.weight 1473 1570 A read-write flat-keyed file which exists on non-root cgroups. 1474 1571 The default is "default 100".

-6

Documentation/admin-guide/kernel-parameters.txt

··· 1201 1201 See comment before function elanfreq_setup() in 1202 1202 arch/x86/kernel/cpu/cpufreq/elanfreq.c. 1203 1203 1204 - elevator= [IOSCHED] 1205 - Format: { "mq-deadline" | "kyber" | "bfq" } 1206 - See Documentation/block/deadline-iosched.rst, 1207 - Documentation/block/kyber-iosched.rst and 1208 - Documentation/block/bfq-iosched.rst for details. 1209 - 1210 1204 elfcorehdr=[size[KMG]@]offset[KMG] [IA64,PPC,SH,X86,S390] 1211 1205 Specifies physical address of start of kernel core 1212 1206 image elf header and optionally the size. Generally

+3 -5

Documentation/admin-guide/kernel-per-CPU-kthreads.rst

··· 274 274 (based on an earlier one from Gilad Ben-Yossef) that 275 275 reduces or even eliminates vmstat overhead for some 276 276 workloads at https://lkml.org/lkml/2013/9/4/379. 277 - e. Boot with "elevator=noop" to avoid workqueue use by 278 - the block layer. 279 - f. If running on high-end powerpc servers, build with 277 + e. If running on high-end powerpc servers, build with 280 278 CONFIG_PPC_RTAS_DAEMON=n. This prevents the RTAS 281 279 daemon from running on each CPU every second or so. 282 280 (This will require editing Kconfig files and will defeat ··· 282 284 due to the rtas_event_scan() function. 283 285 WARNING: Please check your CPU specifications to 284 286 make sure that this is safe on your particular system. 285 - g. If running on Cell Processor, build your kernel with 287 + f. If running on Cell Processor, build your kernel with 286 288 CBE_CPUFREQ_SPU_GOVERNOR=n to avoid OS jitter from 287 289 spu_gov_work(). 288 290 WARNING: Please check your CPU specifications to 289 291 make sure that this is safe on your particular system. 290 - h. If running on PowerMAC, build your kernel with 292 + g. If running on PowerMAC, build your kernel with 291 293 CONFIG_PMAC_RACKMETER=n to disable the CPU-meter, 292 294 avoiding OS jitter from rackmeter_do_timer(). 293 295

+18 -15

Documentation/block/null_blk.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 1 3 ======================== 2 4 Null block device driver 3 5 ======================== 4 6 5 - 1. Overview 6 - =========== 7 + Overview 8 + ======== 7 9 8 - The null block device (/dev/nullb*) is used for benchmarking the various 10 + The null block device (``/dev/nullb*``) is used for benchmarking the various 9 11 block-layer implementations. It emulates a block device of X gigabytes in size. 10 - The following instances are possible: 11 - 12 - Single-queue block-layer 13 - 14 - - Request-based. 15 - - Single submission queue per device. 16 - - Implements IO scheduling algorithms (CFQ, Deadline, noop). 12 + It does not execute any read/write operation, just mark them as complete in 13 + the request queue. The following instances are possible: 17 14 18 15 Multi-queue block-layer 19 16 ··· 24 27 25 28 All of them have a completion queue for each core in the system. 26 29 27 - 2. Module parameters applicable for all instances 28 - ================================================= 30 + Module parameters 31 + ================= 29 32 30 33 queue_mode=[0-2]: Default: 2-Multi-queue 31 34 Selects which block-layer the module should instantiate with. 32 35 33 36 = ============ 34 37 0 Bio-based 35 - 1 Single-queue 38 + 1 Single-queue (deprecated) 36 39 2 Multi-queue 37 40 = ============ 38 41 ··· 64 67 completion_nsec=[ns]: Default: 10,000ns 65 68 Combined with irqmode=2 (timer). The time each completion event must wait. 66 69 67 - submit_queues=[1..nr_cpus]: 70 + submit_queues=[1..nr_cpus]: Default: 1 68 71 The number of submission queues attached to the device driver. If unset, it 69 72 defaults to 1. For multi-queue, it is ignored when use_per_node_hctx module 70 73 parameter is 1. ··· 72 75 hw_queue_depth=[0..qdepth]: Default: 64 73 76 The hardware queue depth of the device. 74 77 75 - III: Multi-queue specific parameters 78 + Multi-queue specific parameters 79 + ------------------------------- 76 80 77 81 use_per_node_hctx=[0/1]: Default: 0 82 + Number of hardware context queues. 78 83 79 84 = ===================================================================== 80 85 0 The number of submit queues are set to the value of the submit_queues ··· 86 87 = ===================================================================== 87 88 88 89 no_sched=[0/1]: Default: 0 90 + Enable/disable the io scheduler. 89 91 90 92 = ====================================== 91 93 0 nullb* use default blk-mq io scheduler ··· 94 94 = ====================================== 95 95 96 96 blocking=[0/1]: Default: 0 97 + Blocking behavior of the request queue. 97 98 98 99 = =============================================================== 99 100 0 Register as a non-blocking blk-mq driver device. ··· 104 103 = =============================================================== 105 104 106 105 shared_tags=[0/1]: Default: 0 106 + Sharing tags between devices. 107 107 108 108 = ================================================================ 109 109 0 Tag set is not shared. ··· 113 111 = ================================================================ 114 112 115 113 zoned=[0/1]: Default: 0 114 + Device is a random-access or a zoned block device. 116 115 117 116 = ====================================================================== 118 117 0 Block device is exposed as a random-access block device.

-4

Documentation/block/switching-sched.rst

··· 2 2 Switching Scheduler 3 3 =================== 4 4 5 - To choose IO schedulers at boot time, use the argument 'elevator=deadline'. 6 - 'noop' and 'cfq' (the default) are also available. IO schedulers are assigned 7 - globally at boot time only presently. 8 - 9 5 Each io queue has a set of io scheduler tunables associated with it. These 10 6 tunables control how the io scheduler works. You can find these entries 11 7 in::

+13

block/Kconfig

··· 26 26 27 27 if BLOCK 28 28 29 + config BLK_RQ_ALLOC_TIME 30 + bool 31 + 29 32 config BLK_SCSI_REQUEST 30 33 bool 31 34 ··· 134 131 target than the victimized group. 135 132 136 133 Note, this is an experimental interface and could be changed someday. 134 + 135 + config BLK_CGROUP_IOCOST 136 + bool "Enable support for cost model based cgroup IO controller" 137 + depends on BLK_CGROUP=y 138 + select BLK_RQ_ALLOC_TIME 139 + ---help--- 140 + Enabling this option enables the .weight interface for cost 141 + model based proportional IO control. The IO controller 142 + distributes IO capacity between different groups based on 143 + their share of the overall weight distribution. 137 144 138 145 config BLK_WBT_MQ 139 146 bool "Multiqueue writeback throttling"

+1

block/Makefile

··· 18 18 obj-$(CONFIG_BLK_CGROUP) += blk-cgroup.o 19 19 obj-$(CONFIG_BLK_DEV_THROTTLING) += blk-throttle.o 20 20 obj-$(CONFIG_BLK_CGROUP_IOLATENCY) += blk-iolatency.o 21 + obj-$(CONFIG_BLK_CGROUP_IOCOST) += blk-iocost.o 21 22 obj-$(CONFIG_MQ_IOSCHED_DEADLINE) += mq-deadline.o 22 23 obj-$(CONFIG_MQ_IOSCHED_KYBER) += kyber-iosched.o 23 24 bfq-y := bfq-iosched.o bfq-wf2q.o bfq-cgroup.o

+117 -39

block/bfq-cgroup.c

··· 501 501 kfree(cpd_to_bfqgd(cpd)); 502 502 } 503 503 504 - static struct blkg_policy_data *bfq_pd_alloc(gfp_t gfp, int node) 504 + static struct blkg_policy_data *bfq_pd_alloc(gfp_t gfp, struct request_queue *q, 505 + struct blkcg *blkcg) 505 506 { 506 507 struct bfq_group *bfqg; 507 508 508 - bfqg = kzalloc_node(sizeof(*bfqg), gfp, node); 509 + bfqg = kzalloc_node(sizeof(*bfqg), gfp, q->node); 509 510 if (!bfqg) 510 511 return NULL; 511 512 ··· 905 904 bfq_end_wr_async_queues(bfqd, bfqd->root_group); 906 905 } 907 906 908 - static int bfq_io_show_weight(struct seq_file *sf, void *v) 907 + static int bfq_io_show_weight_legacy(struct seq_file *sf, void *v) 909 908 { 910 909 struct blkcg *blkcg = css_to_blkcg(seq_css(sf)); 911 910 struct bfq_group_data *bfqgd = blkcg_to_bfqgd(blkcg); ··· 917 916 seq_printf(sf, "%u\n", val); 918 917 919 918 return 0; 919 + } 920 + 921 + static u64 bfqg_prfill_weight_device(struct seq_file *sf, 922 + struct blkg_policy_data *pd, int off) 923 + { 924 + struct bfq_group *bfqg = pd_to_bfqg(pd); 925 + 926 + if (!bfqg->entity.dev_weight) 927 + return 0; 928 + return __blkg_prfill_u64(sf, pd, bfqg->entity.dev_weight); 929 + } 930 + 931 + static int bfq_io_show_weight(struct seq_file *sf, void *v) 932 + { 933 + struct blkcg *blkcg = css_to_blkcg(seq_css(sf)); 934 + struct bfq_group_data *bfqgd = blkcg_to_bfqgd(blkcg); 935 + 936 + seq_printf(sf, "default %u\n", bfqgd->weight); 937 + blkcg_print_blkgs(sf, blkcg, bfqg_prfill_weight_device, 938 + &blkcg_policy_bfq, 0, false); 939 + return 0; 940 + } 941 + 942 + static void bfq_group_set_weight(struct bfq_group *bfqg, u64 weight, u64 dev_weight) 943 + { 944 + weight = dev_weight ?: weight; 945 + 946 + bfqg->entity.dev_weight = dev_weight; 947 + /* 948 + * Setting the prio_changed flag of the entity 949 + * to 1 with new_weight == weight would re-set 950 + * the value of the weight to its ioprio mapping. 951 + * Set the flag only if necessary. 952 + */ 953 + if ((unsigned short)weight != bfqg->entity.new_weight) { 954 + bfqg->entity.new_weight = (unsigned short)weight; 955 + /* 956 + * Make sure that the above new value has been 957 + * stored in bfqg->entity.new_weight before 958 + * setting the prio_changed flag. In fact, 959 + * this flag may be read asynchronously (in 960 + * critical sections protected by a different 961 + * lock than that held here), and finding this 962 + * flag set may cause the execution of the code 963 + * for updating parameters whose value may 964 + * depend also on bfqg->entity.new_weight (in 965 + * __bfq_entity_update_weight_prio). 966 + * This barrier makes sure that the new value 967 + * of bfqg->entity.new_weight is correctly 968 + * seen in that code. 969 + */ 970 + smp_wmb(); 971 + bfqg->entity.prio_changed = 1; 972 + } 920 973 } 921 974 922 975 static int bfq_io_set_weight_legacy(struct cgroup_subsys_state *css, ··· 991 936 hlist_for_each_entry(blkg, &blkcg->blkg_list, blkcg_node) { 992 937 struct bfq_group *bfqg = blkg_to_bfqg(blkg); 993 938 994 - if (!bfqg) 995 - continue; 996 - /* 997 - * Setting the prio_changed flag of the entity 998 - * to 1 with new_weight == weight would re-set 999 - * the value of the weight to its ioprio mapping. 1000 - * Set the flag only if necessary. 1001 - */ 1002 - if ((unsigned short)val != bfqg->entity.new_weight) { 1003 - bfqg->entity.new_weight = (unsigned short)val; 1004 - /* 1005 - * Make sure that the above new value has been 1006 - * stored in bfqg->entity.new_weight before 1007 - * setting the prio_changed flag. In fact, 1008 - * this flag may be read asynchronously (in 1009 - * critical sections protected by a different 1010 - * lock than that held here), and finding this 1011 - * flag set may cause the execution of the code 1012 - * for updating parameters whose value may 1013 - * depend also on bfqg->entity.new_weight (in 1014 - * __bfq_entity_update_weight_prio). 1015 - * This barrier makes sure that the new value 1016 - * of bfqg->entity.new_weight is correctly 1017 - * seen in that code. 1018 - */ 1019 - smp_wmb(); 1020 - bfqg->entity.prio_changed = 1; 1021 - } 939 + if (bfqg) 940 + bfq_group_set_weight(bfqg, val, 0); 1022 941 } 1023 942 spin_unlock_irq(&blkcg->lock); 1024 943 1025 944 return ret; 1026 945 } 1027 946 947 + static ssize_t bfq_io_set_device_weight(struct kernfs_open_file *of, 948 + char *buf, size_t nbytes, 949 + loff_t off) 950 + { 951 + int ret; 952 + struct blkg_conf_ctx ctx; 953 + struct blkcg *blkcg = css_to_blkcg(of_css(of)); 954 + struct bfq_group *bfqg; 955 + u64 v; 956 + 957 + ret = blkg_conf_prep(blkcg, &blkcg_policy_bfq, buf, &ctx); 958 + if (ret) 959 + return ret; 960 + 961 + if (sscanf(ctx.body, "%llu", &v) == 1) { 962 + /* require "default" on dfl */ 963 + ret = -ERANGE; 964 + if (!v) 965 + goto out; 966 + } else if (!strcmp(strim(ctx.body), "default")) { 967 + v = 0; 968 + } else { 969 + ret = -EINVAL; 970 + goto out; 971 + } 972 + 973 + bfqg = blkg_to_bfqg(ctx.blkg); 974 + 975 + ret = -ERANGE; 976 + if (!v || (v >= BFQ_MIN_WEIGHT && v <= BFQ_MAX_WEIGHT)) { 977 + bfq_group_set_weight(bfqg, bfqg->entity.weight, v); 978 + ret = 0; 979 + } 980 + out: 981 + blkg_conf_finish(&ctx); 982 + return ret ?: nbytes; 983 + } 984 + 1028 985 static ssize_t bfq_io_set_weight(struct kernfs_open_file *of, 1029 986 char *buf, size_t nbytes, 1030 987 loff_t off) 1031 988 { 1032 - u64 weight; 1033 - /* First unsigned long found in the file is used */ 1034 - int ret = kstrtoull(strim(buf), 0, &weight); 989 + char *endp; 990 + int ret; 991 + u64 v; 1035 992 1036 - if (ret) 1037 - return ret; 993 + buf = strim(buf); 1038 994 1039 - ret = bfq_io_set_weight_legacy(of_css(of), NULL, weight); 1040 - return ret ?: nbytes; 995 + /* "WEIGHT" or "default WEIGHT" sets the default weight */ 996 + v = simple_strtoull(buf, &endp, 0); 997 + if (*endp == '\0' || sscanf(buf, "default %llu", &v) == 1) { 998 + ret = bfq_io_set_weight_legacy(of_css(of), NULL, v); 999 + return ret ?: nbytes; 1000 + } 1001 + 1002 + return bfq_io_set_device_weight(of, buf, nbytes, off); 1041 1003 } 1042 1004 1043 1005 #ifdef CONFIG_BFQ_CGROUP_DEBUG ··· 1213 1141 { 1214 1142 .name = "bfq.weight", 1215 1143 .flags = CFTYPE_NOT_ON_ROOT, 1216 - .seq_show = bfq_io_show_weight, 1144 + .seq_show = bfq_io_show_weight_legacy, 1217 1145 .write_u64 = bfq_io_set_weight_legacy, 1146 + }, 1147 + { 1148 + .name = "bfq.weight_device", 1149 + .flags = CFTYPE_NOT_ON_ROOT, 1150 + .seq_show = bfq_io_show_weight, 1151 + .write = bfq_io_set_weight, 1218 1152 }, 1219 1153 1220 1154 /* statistics, covers only the tasks in the bfqg */

+3

block/bfq-iosched.h

··· 168 168 /* budget, used also to calculate F_i: F_i = S_i + @budget / @weight */ 169 169 int budget; 170 170 171 + /* device weight, if non-zero, it overrides the default weight of 172 + * bfq_group_data */ 173 + int dev_weight; 171 174 /* weight of the queue */ 172 175 int weight; 173 176 /* next weight if a change is in progress */

+2

block/bfq-wf2q.c

··· 744 744 } 745 745 #endif 746 746 747 + /* Matches the smp_wmb() in bfq_group_set_weight. */ 748 + smp_rmb(); 747 749 old_st->wsum -= entity->weight; 748 750 749 751 if (entity->new_weight != entity->orig_weight) {

+24 -36

block/bio.c

··· 646 646 return true; 647 647 } 648 648 649 - /* 650 - * Check if the @page can be added to the current segment(@bv), and make 651 - * sure to call it only if page_is_mergeable(@bv, @page) is true 652 - */ 653 - static bool can_add_page_to_seg(struct request_queue *q, 654 - struct bio_vec *bv, struct page *page, unsigned len, 655 - unsigned offset) 649 + static bool bio_try_merge_pc_page(struct request_queue *q, struct bio *bio, 650 + struct page *page, unsigned len, unsigned offset, 651 + bool *same_page) 656 652 { 653 + struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1]; 657 654 unsigned long mask = queue_segment_boundary(q); 658 655 phys_addr_t addr1 = page_to_phys(bv->bv_page) + bv->bv_offset; 659 656 phys_addr_t addr2 = page_to_phys(page) + offset + len - 1; 660 657 661 658 if ((addr1 | mask) != (addr2 | mask)) 662 659 return false; 663 - 664 660 if (bv->bv_len + len > queue_max_segment_size(q)) 665 661 return false; 666 - 667 - return true; 662 + return __bio_try_merge_page(bio, page, len, offset, same_page); 668 663 } 669 664 670 665 /** ··· 669 674 * @page: page to add 670 675 * @len: vec entry length 671 676 * @offset: vec entry offset 672 - * @put_same_page: put the page if it is same with last added page 677 + * @same_page: return if the merge happen inside the same page 673 678 * 674 679 * Attempt to add a page to the bio_vec maplist. This can fail for a 675 680 * number of reasons, such as the bio being full or target block device ··· 680 685 */ 681 686 static int __bio_add_pc_page(struct request_queue *q, struct bio *bio, 682 687 struct page *page, unsigned int len, unsigned int offset, 683 - bool put_same_page) 688 + bool *same_page) 684 689 { 685 690 struct bio_vec *bvec; 686 - bool same_page = false; 687 691 688 692 /* 689 693 * cloned bio must not modify vec list ··· 694 700 return 0; 695 701 696 702 if (bio->bi_vcnt > 0) { 697 - bvec = &bio->bi_io_vec[bio->bi_vcnt - 1]; 698 - 699 - if (page == bvec->bv_page && 700 - offset == bvec->bv_offset + bvec->bv_len) { 701 - if (put_same_page) 702 - put_page(page); 703 - bvec->bv_len += len; 704 - goto done; 705 - } 703 + if (bio_try_merge_pc_page(q, bio, page, len, offset, same_page)) 704 + return len; 706 705 707 706 /* 708 - * If the queue doesn't support SG gaps and adding this 709 - * offset would create a gap, disallow it. 707 + * If the queue doesn't support SG gaps and adding this segment 708 + * would create a gap, disallow it. 710 709 */ 710 + bvec = &bio->bi_io_vec[bio->bi_vcnt - 1]; 711 711 if (bvec_gap_to_prev(q, bvec, offset)) 712 712 return 0; 713 - 714 - if (page_is_mergeable(bvec, page, len, offset, &same_page) && 715 - can_add_page_to_seg(q, bvec, page, len, offset)) { 716 - bvec->bv_len += len; 717 - goto done; 718 - } 719 713 } 720 714 721 715 if (bio_full(bio, len)) ··· 717 735 bvec->bv_len = len; 718 736 bvec->bv_offset = offset; 719 737 bio->bi_vcnt++; 720 - done: 721 738 bio->bi_iter.bi_size += len; 722 739 return len; 723 740 } ··· 724 743 int bio_add_pc_page(struct request_queue *q, struct bio *bio, 725 744 struct page *page, unsigned int len, unsigned int offset) 726 745 { 727 - return __bio_add_pc_page(q, bio, page, len, offset, false); 746 + bool same_page = false; 747 + return __bio_add_pc_page(q, bio, page, len, offset, &same_page); 728 748 } 729 749 EXPORT_SYMBOL(bio_add_pc_page); 730 750 ··· 788 806 789 807 bio->bi_iter.bi_size += len; 790 808 bio->bi_vcnt++; 809 + 810 + if (!bio_flagged(bio, BIO_WORKINGSET) && unlikely(PageWorkingset(page))) 811 + bio_set_flag(bio, BIO_WORKINGSET); 791 812 } 792 813 EXPORT_SYMBOL_GPL(__bio_add_page); 793 814 ··· 1369 1384 for (j = 0; j < npages; j++) { 1370 1385 struct page *page = pages[j]; 1371 1386 unsigned int n = PAGE_SIZE - offs; 1387 + bool same_page = false; 1372 1388 1373 1389 if (n > bytes) 1374 1390 n = bytes; 1375 1391 1376 1392 if (!__bio_add_pc_page(q, bio, page, n, offs, 1377 - true)) 1393 + &same_page)) { 1394 + if (same_page) 1395 + put_page(page); 1378 1396 break; 1397 + } 1379 1398 1380 1399 added += n; 1381 1400 bytes -= n; ··· 1510 1521 bio->bi_end_io = bio_map_kern_endio; 1511 1522 return bio; 1512 1523 } 1513 - EXPORT_SYMBOL(bio_map_kern); 1514 1524 1515 1525 static void bio_copy_kern_endio(struct bio *bio) 1516 1526 { ··· 1830 1842 * @bio, and updates @bio to represent the remaining sectors. 1831 1843 * 1832 1844 * Unless this is a discard request the newly allocated bio will point 1833 - * to @bio's bi_io_vec; it is the caller's responsibility to ensure that 1834 - * @bio is not freed before the split. 1845 + * to @bio's bi_io_vec. It is the caller's responsibility to ensure that 1846 + * neither @bio nor @bs are freed before the split bio. 1835 1847 */ 1836 1848 struct bio *bio_split(struct bio *bio, int sectors, 1837 1849 gfp_t gfp, struct bio_set *bs)

+50 -23

block/blk-cgroup.c

··· 175 175 continue; 176 176 177 177 /* alloc per-policy data and attach it to blkg */ 178 - pd = pol->pd_alloc_fn(gfp_mask, q->node); 178 + pd = pol->pd_alloc_fn(gfp_mask, q, blkcg); 179 179 if (!pd) 180 180 goto err_free; 181 181 ··· 755 755 756 756 /** 757 757 * blkg_conf_prep - parse and prepare for per-blkg config update 758 + * @inputp: input string pointer 759 + * 760 + * Parse the device node prefix part, MAJ:MIN, of per-blkg config update 761 + * from @input and get and return the matching gendisk. *@inputp is 762 + * updated to point past the device node prefix. Returns an ERR_PTR() 763 + * value on error. 764 + * 765 + * Use this function iff blkg_conf_prep() can't be used for some reason. 766 + */ 767 + struct gendisk *blkcg_conf_get_disk(char **inputp) 768 + { 769 + char *input = *inputp; 770 + unsigned int major, minor; 771 + struct gendisk *disk; 772 + int key_len, part; 773 + 774 + if (sscanf(input, "%u:%u%n", &major, &minor, &key_len) != 2) 775 + return ERR_PTR(-EINVAL); 776 + 777 + input += key_len; 778 + if (!isspace(*input)) 779 + return ERR_PTR(-EINVAL); 780 + input = skip_spaces(input); 781 + 782 + disk = get_gendisk(MKDEV(major, minor), &part); 783 + if (!disk) 784 + return ERR_PTR(-ENODEV); 785 + if (part) { 786 + put_disk_and_module(disk); 787 + return ERR_PTR(-ENODEV); 788 + } 789 + 790 + *inputp = input; 791 + return disk; 792 + } 793 + 794 + /** 795 + * blkg_conf_prep - parse and prepare for per-blkg config update 758 796 * @blkcg: target block cgroup 759 797 * @pol: target policy 760 798 * @input: input string ··· 810 772 struct gendisk *disk; 811 773 struct request_queue *q; 812 774 struct blkcg_gq *blkg; 813 - unsigned int major, minor; 814 - int key_len, part, ret; 815 - char *body; 775 + int ret; 816 776 817 - if (sscanf(input, "%u:%u%n", &major, &minor, &key_len) != 2) 818 - return -EINVAL; 819 - 820 - body = input + key_len; 821 - if (!isspace(*body)) 822 - return -EINVAL; 823 - body = skip_spaces(body); 824 - 825 - disk = get_gendisk(MKDEV(major, minor), &part); 826 - if (!disk) 827 - return -ENODEV; 828 - if (part) { 829 - ret = -ENODEV; 830 - goto fail; 831 - } 777 + disk = blkcg_conf_get_disk(&input); 778 + if (IS_ERR(disk)) 779 + return PTR_ERR(disk); 832 780 833 781 q = disk->queue; 834 782 ··· 880 856 success: 881 857 ctx->disk = disk; 882 858 ctx->blkg = blkg; 883 - ctx->body = body; 859 + ctx->body = input; 884 860 return 0; 885 861 886 862 fail_unlock: ··· 900 876 } 901 877 return ret; 902 878 } 879 + EXPORT_SYMBOL_GPL(blkg_conf_prep); 903 880 904 881 /** 905 882 * blkg_conf_finish - finish up per-blkg config update ··· 916 891 rcu_read_unlock(); 917 892 put_disk_and_module(ctx->disk); 918 893 } 894 + EXPORT_SYMBOL_GPL(blkg_conf_finish); 919 895 920 896 static int blkcg_print_stat(struct seq_file *sf, void *v) 921 897 { ··· 1372 1346 blk_mq_freeze_queue(q); 1373 1347 pd_prealloc: 1374 1348 if (!pd_prealloc) { 1375 - pd_prealloc = pol->pd_alloc_fn(GFP_KERNEL, q->node); 1349 + pd_prealloc = pol->pd_alloc_fn(GFP_KERNEL, q, &blkcg_root); 1376 1350 if (!pd_prealloc) { 1377 1351 ret = -ENOMEM; 1378 1352 goto out_bypass_end; ··· 1388 1362 if (blkg->pd[pol->plid]) 1389 1363 continue; 1390 1364 1391 - pd = pol->pd_alloc_fn(GFP_NOWAIT | __GFP_NOWARN, q->node); 1365 + pd = pol->pd_alloc_fn(GFP_NOWAIT | __GFP_NOWARN, q, &blkcg_root); 1392 1366 if (!pd) 1393 1367 swap(pd, pd_prealloc); 1394 1368 if (!pd) { ··· 1501 1475 blkcg->cpd[pol->plid] = cpd; 1502 1476 cpd->blkcg = blkcg; 1503 1477 cpd->plid = pol->plid; 1504 - pol->cpd_init_fn(cpd); 1478 + if (pol->cpd_init_fn) 1479 + pol->cpd_init_fn(cpd); 1505 1480 } 1506 1481 } 1507 1482

+34 -3

block/blk-core.c

··· 36 36 #include <linux/blk-cgroup.h> 37 37 #include <linux/debugfs.h> 38 38 #include <linux/bpf.h> 39 + #include <linux/psi.h> 39 40 40 41 #define CREATE_TRACE_POINTS 41 42 #include <trace/events/block.h> ··· 130 129 REQ_OP_NAME(DISCARD), 131 130 REQ_OP_NAME(SECURE_ERASE), 132 131 REQ_OP_NAME(ZONE_RESET), 132 + REQ_OP_NAME(ZONE_RESET_ALL), 133 133 REQ_OP_NAME(WRITE_SAME), 134 134 REQ_OP_NAME(WRITE_ZEROES), 135 135 REQ_OP_NAME(SCSI_IN), ··· 346 344 347 345 /* 348 346 * Drain all requests queued before DYING marking. Set DEAD flag to 349 - * prevent that q->request_fn() gets invoked after draining finished. 347 + * prevent that blk_mq_run_hw_queues() accesses the hardware queues 348 + * after draining finished. 350 349 */ 351 350 blk_freeze_queue(q); 352 351 ··· 482 479 if (!q) 483 480 return NULL; 484 481 485 - INIT_LIST_HEAD(&q->queue_head); 486 482 q->last_merge = NULL; 487 483 488 484 q->id = ida_simple_get(&blk_queue_ida, 0, 0, gfp_mask); ··· 520 518 mutex_init(&q->blk_trace_mutex); 521 519 #endif 522 520 mutex_init(&q->sysfs_lock); 521 + mutex_init(&q->sysfs_dir_lock); 523 522 spin_lock_init(&q->queue_lock); 524 523 525 524 init_waitqueue_head(&q->mq_freeze_wq); ··· 604 601 return false; 605 602 606 603 trace_block_bio_backmerge(req->q, req, bio); 604 + rq_qos_merge(req->q, req, bio); 607 605 608 606 if ((req->cmd_flags & REQ_FAILFAST_MASK) != ff) 609 607 blk_rq_set_mixed_merge(req); ··· 626 622 return false; 627 623 628 624 trace_block_bio_frontmerge(req->q, req, bio); 625 + rq_qos_merge(req->q, req, bio); 629 626 630 627 if ((req->cmd_flags & REQ_FAILFAST_MASK) != ff) 631 628 blk_rq_set_mixed_merge(req); ··· 651 646 if (blk_rq_sectors(req) + bio_sectors(bio) > 652 647 blk_rq_get_max_sectors(req, blk_rq_pos(req))) 653 648 goto no_merge; 649 + 650 + rq_qos_merge(q, req, bio); 654 651 655 652 req->biotail->bi_next = bio; 656 653 req->biotail = bio; ··· 938 931 if (!blk_queue_is_zoned(q)) 939 932 goto not_supported; 940 933 break; 934 + case REQ_OP_ZONE_RESET_ALL: 935 + if (!blk_queue_is_zoned(q) || !blk_queue_zone_resetall(q)) 936 + goto not_supported; 937 + break; 941 938 case REQ_OP_WRITE_ZEROES: 942 939 if (!q->limits.max_write_zeroes_sectors) 943 940 goto not_supported; ··· 1139 1128 */ 1140 1129 blk_qc_t submit_bio(struct bio *bio) 1141 1130 { 1131 + bool workingset_read = false; 1132 + unsigned long pflags; 1133 + blk_qc_t ret; 1134 + 1142 1135 if (blkcg_punt_bio_submit(bio)) 1143 1136 return BLK_QC_T_NONE; 1144 1137 ··· 1161 1146 if (op_is_write(bio_op(bio))) { 1162 1147 count_vm_events(PGPGOUT, count); 1163 1148 } else { 1149 + if (bio_flagged(bio, BIO_WORKINGSET)) 1150 + workingset_read = true; 1164 1151 task_io_account_read(bio->bi_iter.bi_size); 1165 1152 count_vm_events(PGPGIN, count); 1166 1153 } ··· 1177 1160 } 1178 1161 } 1179 1162 1180 - return generic_make_request(bio); 1163 + /* 1164 + * If we're reading data that is part of the userspace 1165 + * workingset, count submission time as memory stall. When the 1166 + * device is congested, or the submitting cgroup IO-throttled, 1167 + * submission can be a significant part of overall IO time. 1168 + */ 1169 + if (workingset_read) 1170 + psi_memstall_enter(&pflags); 1171 + 1172 + ret = generic_make_request(bio); 1173 + 1174 + if (workingset_read) 1175 + psi_memstall_leave(&pflags); 1176 + 1177 + return ret; 1181 1178 } 1182 1179 EXPORT_SYMBOL(submit_bio); 1183 1180

+2457

block/blk-iocost.c

··· 1 + /* SPDX-License-Identifier: GPL-2.0 2 + * 3 + * IO cost model based controller. 4 + * 5 + * Copyright (C) 2019 Tejun Heo <tj@kernel.org> 6 + * Copyright (C) 2019 Andy Newell <newella@fb.com> 7 + * Copyright (C) 2019 Facebook 8 + * 9 + * One challenge of controlling IO resources is the lack of trivially 10 + * observable cost metric. This is distinguished from CPU and memory where 11 + * wallclock time and the number of bytes can serve as accurate enough 12 + * approximations. 13 + * 14 + * Bandwidth and iops are the most commonly used metrics for IO devices but 15 + * depending on the type and specifics of the device, different IO patterns 16 + * easily lead to multiple orders of magnitude variations rendering them 17 + * useless for the purpose of IO capacity distribution. While on-device 18 + * time, with a lot of clutches, could serve as a useful approximation for 19 + * non-queued rotational devices, this is no longer viable with modern 20 + * devices, even the rotational ones. 21 + * 22 + * While there is no cost metric we can trivially observe, it isn't a 23 + * complete mystery. For example, on a rotational device, seek cost 24 + * dominates while a contiguous transfer contributes a smaller amount 25 + * proportional to the size. If we can characterize at least the relative 26 + * costs of these different types of IOs, it should be possible to 27 + * implement a reasonable work-conserving proportional IO resource 28 + * distribution. 29 + * 30 + * 1. IO Cost Model 31 + * 32 + * IO cost model estimates the cost of an IO given its basic parameters and 33 + * history (e.g. the end sector of the last IO). The cost is measured in 34 + * device time. If a given IO is estimated to cost 10ms, the device should 35 + * be able to process ~100 of those IOs in a second. 36 + * 37 + * Currently, there's only one builtin cost model - linear. Each IO is 38 + * classified as sequential or random and given a base cost accordingly. 39 + * On top of that, a size cost proportional to the length of the IO is 40 + * added. While simple, this model captures the operational 41 + * characteristics of a wide varienty of devices well enough. Default 42 + * paramters for several different classes of devices are provided and the 43 + * parameters can be configured from userspace via 44 + * /sys/fs/cgroup/io.cost.model. 45 + * 46 + * If needed, tools/cgroup/iocost_coef_gen.py can be used to generate 47 + * device-specific coefficients. 48 + * 49 + * If needed, tools/cgroup/iocost_coef_gen.py can be used to generate 50 + * device-specific coefficients. 51 + * 52 + * 2. Control Strategy 53 + * 54 + * The device virtual time (vtime) is used as the primary control metric. 55 + * The control strategy is composed of the following three parts. 56 + * 57 + * 2-1. Vtime Distribution 58 + * 59 + * When a cgroup becomes active in terms of IOs, its hierarchical share is 60 + * calculated. Please consider the following hierarchy where the numbers 61 + * inside parentheses denote the configured weights. 62 + * 63 + * root 64 + * / \ 65 + * A (w:100) B (w:300) 66 + * / \ 67 + * A0 (w:100) A1 (w:100) 68 + * 69 + * If B is idle and only A0 and A1 are actively issuing IOs, as the two are 70 + * of equal weight, each gets 50% share. If then B starts issuing IOs, B 71 + * gets 300/(100+300) or 75% share, and A0 and A1 equally splits the rest, 72 + * 12.5% each. The distribution mechanism only cares about these flattened 73 + * shares. They're called hweights (hierarchical weights) and always add 74 + * upto 1 (HWEIGHT_WHOLE). 75 + * 76 + * A given cgroup's vtime runs slower in inverse proportion to its hweight. 77 + * For example, with 12.5% weight, A0's time runs 8 times slower (100/12.5) 78 + * against the device vtime - an IO which takes 10ms on the underlying 79 + * device is considered to take 80ms on A0. 80 + * 81 + * This constitutes the basis of IO capacity distribution. Each cgroup's 82 + * vtime is running at a rate determined by its hweight. A cgroup tracks 83 + * the vtime consumed by past IOs and can issue a new IO iff doing so 84 + * wouldn't outrun the current device vtime. Otherwise, the IO is 85 + * suspended until the vtime has progressed enough to cover it. 86 + * 87 + * 2-2. Vrate Adjustment 88 + * 89 + * It's unrealistic to expect the cost model to be perfect. There are too 90 + * many devices and even on the same device the overall performance 91 + * fluctuates depending on numerous factors such as IO mixture and device 92 + * internal garbage collection. The controller needs to adapt dynamically. 93 + * 94 + * This is achieved by adjusting the overall IO rate according to how busy 95 + * the device is. If the device becomes overloaded, we're sending down too 96 + * many IOs and should generally slow down. If there are waiting issuers 97 + * but the device isn't saturated, we're issuing too few and should 98 + * generally speed up. 99 + * 100 + * To slow down, we lower the vrate - the rate at which the device vtime 101 + * passes compared to the wall clock. For example, if the vtime is running 102 + * at the vrate of 75%, all cgroups added up would only be able to issue 103 + * 750ms worth of IOs per second, and vice-versa for speeding up. 104 + * 105 + * Device business is determined using two criteria - rq wait and 106 + * completion latencies. 107 + * 108 + * When a device gets saturated, the on-device and then the request queues 109 + * fill up and a bio which is ready to be issued has to wait for a request 110 + * to become available. When this delay becomes noticeable, it's a clear 111 + * indication that the device is saturated and we lower the vrate. This 112 + * saturation signal is fairly conservative as it only triggers when both 113 + * hardware and software queues are filled up, and is used as the default 114 + * busy signal. 115 + * 116 + * As devices can have deep queues and be unfair in how the queued commands 117 + * are executed, soley depending on rq wait may not result in satisfactory 118 + * control quality. For a better control quality, completion latency QoS 119 + * parameters can be configured so that the device is considered saturated 120 + * if N'th percentile completion latency rises above the set point. 121 + * 122 + * The completion latency requirements are a function of both the 123 + * underlying device characteristics and the desired IO latency quality of 124 + * service. There is an inherent trade-off - the tighter the latency QoS, 125 + * the higher the bandwidth lossage. Latency QoS is disabled by default 126 + * and can be set through /sys/fs/cgroup/io.cost.qos. 127 + * 128 + * 2-3. Work Conservation 129 + * 130 + * Imagine two cgroups A and B with equal weights. A is issuing a small IO 131 + * periodically while B is sending out enough parallel IOs to saturate the 132 + * device on its own. Let's say A's usage amounts to 100ms worth of IO 133 + * cost per second, i.e., 10% of the device capacity. The naive 134 + * distribution of half and half would lead to 60% utilization of the 135 + * device, a significant reduction in the total amount of work done 136 + * compared to free-for-all competition. This is too high a cost to pay 137 + * for IO control. 138 + * 139 + * To conserve the total amount of work done, we keep track of how much 140 + * each active cgroup is actually using and yield part of its weight if 141 + * there are other cgroups which can make use of it. In the above case, 142 + * A's weight will be lowered so that it hovers above the actual usage and 143 + * B would be able to use the rest. 144 + * 145 + * As we don't want to penalize a cgroup for donating its weight, the 146 + * surplus weight adjustment factors in a margin and has an immediate 147 + * snapback mechanism in case the cgroup needs more IO vtime for itself. 148 + * 149 + * Note that adjusting down surplus weights has the same effects as 150 + * accelerating vtime for other cgroups and work conservation can also be 151 + * implemented by adjusting vrate dynamically. However, squaring who can 152 + * donate and should take back how much requires hweight propagations 153 + * anyway making it easier to implement and understand as a separate 154 + * mechanism. 155 + * 156 + * 3. Monitoring 157 + * 158 + * Instead of debugfs or other clumsy monitoring mechanisms, this 159 + * controller uses a drgn based monitoring script - 160 + * tools/cgroup/iocost_monitor.py. For details on drgn, please see 161 + * https://github.com/osandov/drgn. The ouput looks like the following. 162 + * 163 + * sdb RUN per=300ms cur_per=234.218:v203.695 busy= +1 vrate= 62.12% 164 + * active weight hweight% inflt% dbt delay usages% 165 + * test/a * 50/ 50 33.33/ 33.33 27.65 2 0*041 033:033:033 166 + * test/b * 100/ 100 66.67/ 66.67 17.56 0 0*000 066:079:077 167 + * 168 + * - per : Timer period 169 + * - cur_per : Internal wall and device vtime clock 170 + * - vrate : Device virtual time rate against wall clock 171 + * - weight : Surplus-adjusted and configured weights 172 + * - hweight : Surplus-adjusted and configured hierarchical weights 173 + * - inflt : The percentage of in-flight IO cost at the end of last period 174 + * - del_ms : Deferred issuer delay induction level and duration 175 + * - usages : Usage history 176 + */ 177 + 178 + #include <linux/kernel.h> 179 + #include <linux/module.h> 180 + #include <linux/timer.h> 181 + #include <linux/time64.h> 182 + #include <linux/parser.h> 183 + #include <linux/sched/signal.h> 184 + #include <linux/blk-cgroup.h> 185 + #include "blk-rq-qos.h" 186 + #include "blk-stat.h" 187 + #include "blk-wbt.h" 188 + 189 + #ifdef CONFIG_TRACEPOINTS 190 + 191 + /* copied from TRACE_CGROUP_PATH, see cgroup-internal.h */ 192 + #define TRACE_IOCG_PATH_LEN 1024 193 + static DEFINE_SPINLOCK(trace_iocg_path_lock); 194 + static char trace_iocg_path[TRACE_IOCG_PATH_LEN]; 195 + 196 + #define TRACE_IOCG_PATH(type, iocg, ...) \ 197 + do { \ 198 + unsigned long flags; \ 199 + if (trace_iocost_##type##_enabled()) { \ 200 + spin_lock_irqsave(&trace_iocg_path_lock, flags); \ 201 + cgroup_path(iocg_to_blkg(iocg)->blkcg->css.cgroup, \ 202 + trace_iocg_path, TRACE_IOCG_PATH_LEN); \ 203 + trace_iocost_##type(iocg, trace_iocg_path, \ 204 + ##__VA_ARGS__); \ 205 + spin_unlock_irqrestore(&trace_iocg_path_lock, flags); \ 206 + } \ 207 + } while (0) 208 + 209 + #else /* CONFIG_TRACE_POINTS */ 210 + #define TRACE_IOCG_PATH(type, iocg, ...) do { } while (0) 211 + #endif /* CONFIG_TRACE_POINTS */ 212 + 213 + enum { 214 + MILLION = 1000000, 215 + 216 + /* timer period is calculated from latency requirements, bound it */ 217 + MIN_PERIOD = USEC_PER_MSEC, 218 + MAX_PERIOD = USEC_PER_SEC, 219 + 220 + /* 221 + * A cgroup's vtime can run 50% behind the device vtime, which 222 + * serves as its IO credit buffer. Surplus weight adjustment is 223 + * immediately canceled if the vtime margin runs below 10%. 224 + */ 225 + MARGIN_PCT = 50, 226 + INUSE_MARGIN_PCT = 10, 227 + 228 + /* Have some play in waitq timer operations */ 229 + WAITQ_TIMER_MARGIN_PCT = 5, 230 + 231 + /* 232 + * vtime can wrap well within a reasonable uptime when vrate is 233 + * consistently raised. Don't trust recorded cgroup vtime if the 234 + * period counter indicates that it's older than 5mins. 235 + */ 236 + VTIME_VALID_DUR = 300 * USEC_PER_SEC, 237 + 238 + /* 239 + * Remember the past three non-zero usages and use the max for 240 + * surplus calculation. Three slots guarantee that we remember one 241 + * full period usage from the last active stretch even after 242 + * partial deactivation and re-activation periods. Don't start 243 + * giving away weight before collecting two data points to prevent 244 + * hweight adjustments based on one partial activation period. 245 + */ 246 + NR_USAGE_SLOTS = 3, 247 + MIN_VALID_USAGES = 2, 248 + 249 + /* 1/64k is granular enough and can easily be handled w/ u32 */ 250 + HWEIGHT_WHOLE = 1 << 16, 251 + 252 + /* 253 + * As vtime is used to calculate the cost of each IO, it needs to 254 + * be fairly high precision. For example, it should be able to 255 + * represent the cost of a single page worth of discard with 256 + * suffificient accuracy. At the same time, it should be able to 257 + * represent reasonably long enough durations to be useful and 258 + * convenient during operation. 259 + * 260 + * 1s worth of vtime is 2^37. This gives us both sub-nanosecond 261 + * granularity and days of wrap-around time even at extreme vrates. 262 + */ 263 + VTIME_PER_SEC_SHIFT = 37, 264 + VTIME_PER_SEC = 1LLU << VTIME_PER_SEC_SHIFT, 265 + VTIME_PER_USEC = VTIME_PER_SEC / USEC_PER_SEC, 266 + 267 + /* bound vrate adjustments within two orders of magnitude */ 268 + VRATE_MIN_PPM = 10000, /* 1% */ 269 + VRATE_MAX_PPM = 100000000, /* 10000% */ 270 + 271 + VRATE_MIN = VTIME_PER_USEC * VRATE_MIN_PPM / MILLION, 272 + VRATE_CLAMP_ADJ_PCT = 4, 273 + 274 + /* if IOs end up waiting for requests, issue less */ 275 + RQ_WAIT_BUSY_PCT = 5, 276 + 277 + /* unbusy hysterisis */ 278 + UNBUSY_THR_PCT = 75, 279 + 280 + /* don't let cmds which take a very long time pin lagging for too long */ 281 + MAX_LAGGING_PERIODS = 10, 282 + 283 + /* 284 + * If usage% * 1.25 + 2% is lower than hweight% by more than 3%, 285 + * donate the surplus. 286 + */ 287 + SURPLUS_SCALE_PCT = 125, /* * 125% */ 288 + SURPLUS_SCALE_ABS = HWEIGHT_WHOLE / 50, /* + 2% */ 289 + SURPLUS_MIN_ADJ_DELTA = HWEIGHT_WHOLE / 33, /* 3% */ 290 + 291 + /* switch iff the conditions are met for longer than this */ 292 + AUTOP_CYCLE_NSEC = 10LLU * NSEC_PER_SEC, 293 + 294 + /* 295 + * Count IO size in 4k pages. The 12bit shift helps keeping 296 + * size-proportional components of cost calculation in closer 297 + * numbers of digits to per-IO cost components. 298 + */ 299 + IOC_PAGE_SHIFT = 12, 300 + IOC_PAGE_SIZE = 1 << IOC_PAGE_SHIFT, 301 + IOC_SECT_TO_PAGE_SHIFT = IOC_PAGE_SHIFT - SECTOR_SHIFT, 302 + 303 + /* if apart further than 16M, consider randio for linear model */ 304 + LCOEF_RANDIO_PAGES = 4096, 305 + }; 306 + 307 + enum ioc_running { 308 + IOC_IDLE, 309 + IOC_RUNNING, 310 + IOC_STOP, 311 + }; 312 + 313 + /* io.cost.qos controls including per-dev enable of the whole controller */ 314 + enum { 315 + QOS_ENABLE, 316 + QOS_CTRL, 317 + NR_QOS_CTRL_PARAMS, 318 + }; 319 + 320 + /* io.cost.qos params */ 321 + enum { 322 + QOS_RPPM, 323 + QOS_RLAT, 324 + QOS_WPPM, 325 + QOS_WLAT, 326 + QOS_MIN, 327 + QOS_MAX, 328 + NR_QOS_PARAMS, 329 + }; 330 + 331 + /* io.cost.model controls */ 332 + enum { 333 + COST_CTRL, 334 + COST_MODEL, 335 + NR_COST_CTRL_PARAMS, 336 + }; 337 + 338 + /* builtin linear cost model coefficients */ 339 + enum { 340 + I_LCOEF_RBPS, 341 + I_LCOEF_RSEQIOPS, 342 + I_LCOEF_RRANDIOPS, 343 + I_LCOEF_WBPS, 344 + I_LCOEF_WSEQIOPS, 345 + I_LCOEF_WRANDIOPS, 346 + NR_I_LCOEFS, 347 + }; 348 + 349 + enum { 350 + LCOEF_RPAGE, 351 + LCOEF_RSEQIO, 352 + LCOEF_RRANDIO, 353 + LCOEF_WPAGE, 354 + LCOEF_WSEQIO, 355 + LCOEF_WRANDIO, 356 + NR_LCOEFS, 357 + }; 358 + 359 + enum { 360 + AUTOP_INVALID, 361 + AUTOP_HDD, 362 + AUTOP_SSD_QD1, 363 + AUTOP_SSD_DFL, 364 + AUTOP_SSD_FAST, 365 + }; 366 + 367 + struct ioc_gq; 368 + 369 + struct ioc_params { 370 + u32 qos[NR_QOS_PARAMS]; 371 + u64 i_lcoefs[NR_I_LCOEFS]; 372 + u64 lcoefs[NR_LCOEFS]; 373 + u32 too_fast_vrate_pct; 374 + u32 too_slow_vrate_pct; 375 + }; 376 + 377 + struct ioc_missed { 378 + u32 nr_met; 379 + u32 nr_missed; 380 + u32 last_met; 381 + u32 last_missed; 382 + }; 383 + 384 + struct ioc_pcpu_stat { 385 + struct ioc_missed missed[2]; 386 + 387 + u64 rq_wait_ns; 388 + u64 last_rq_wait_ns; 389 + }; 390 + 391 + /* per device */ 392 + struct ioc { 393 + struct rq_qos rqos; 394 + 395 + bool enabled; 396 + 397 + struct ioc_params params; 398 + u32 period_us; 399 + u32 margin_us; 400 + u64 vrate_min; 401 + u64 vrate_max; 402 + 403 + spinlock_t lock; 404 + struct timer_list timer; 405 + struct list_head active_iocgs; /* active cgroups */ 406 + struct ioc_pcpu_stat __percpu *pcpu_stat; 407 + 408 + enum ioc_running running; 409 + atomic64_t vtime_rate; 410 + 411 + seqcount_t period_seqcount; 412 + u32 period_at; /* wallclock starttime */ 413 + u64 period_at_vtime; /* vtime starttime */ 414 + 415 + atomic64_t cur_period; /* inc'd each period */ 416 + int busy_level; /* saturation history */ 417 + 418 + u64 inuse_margin_vtime; 419 + bool weights_updated; 420 + atomic_t hweight_gen; /* for lazy hweights */ 421 + 422 + u64 autop_too_fast_at; 423 + u64 autop_too_slow_at; 424 + int autop_idx; 425 + bool user_qos_params:1; 426 + bool user_cost_model:1; 427 + }; 428 + 429 + /* per device-cgroup pair */ 430 + struct ioc_gq { 431 + struct blkg_policy_data pd; 432 + struct ioc *ioc; 433 + 434 + /* 435 + * A iocg can get its weight from two sources - an explicit 436 + * per-device-cgroup configuration or the default weight of the 437 + * cgroup. `cfg_weight` is the explicit per-device-cgroup 438 + * configuration. `weight` is the effective considering both 439 + * sources. 440 + * 441 + * When an idle cgroup becomes active its `active` goes from 0 to 442 + * `weight`. `inuse` is the surplus adjusted active weight. 443 + * `active` and `inuse` are used to calculate `hweight_active` and 444 + * `hweight_inuse`. 445 + * 446 + * `last_inuse` remembers `inuse` while an iocg is idle to persist 447 + * surplus adjustments. 448 + */ 449 + u32 cfg_weight; 450 + u32 weight; 451 + u32 active; 452 + u32 inuse; 453 + u32 last_inuse; 454 + 455 + sector_t cursor; /* to detect randio */ 456 + 457 + /* 458 + * `vtime` is this iocg's vtime cursor which progresses as IOs are 459 + * issued. If lagging behind device vtime, the delta represents 460 + * the currently available IO budget. If runnning ahead, the 461 + * overage. 462 + * 463 + * `vtime_done` is the same but progressed on completion rather 464 + * than issue. The delta behind `vtime` represents the cost of 465 + * currently in-flight IOs. 466 + * 467 + * `last_vtime` is used to remember `vtime` at the end of the last 468 + * period to calculate utilization. 469 + */ 470 + atomic64_t vtime; 471 + atomic64_t done_vtime; 472 + atomic64_t abs_vdebt; 473 + u64 last_vtime; 474 + 475 + /* 476 + * The period this iocg was last active in. Used for deactivation 477 + * and invalidating `vtime`. 478 + */ 479 + atomic64_t active_period; 480 + struct list_head active_list; 481 + 482 + /* see __propagate_active_weight() and current_hweight() for details */ 483 + u64 child_active_sum; 484 + u64 child_inuse_sum; 485 + int hweight_gen; 486 + u32 hweight_active; 487 + u32 hweight_inuse; 488 + bool has_surplus; 489 + 490 + struct wait_queue_head waitq; 491 + struct hrtimer waitq_timer; 492 + struct hrtimer delay_timer; 493 + 494 + /* usage is recorded as fractions of HWEIGHT_WHOLE */ 495 + int usage_idx; 496 + u32 usages[NR_USAGE_SLOTS]; 497 + 498 + /* this iocg's depth in the hierarchy and ancestors including self */ 499 + int level; 500 + struct ioc_gq *ancestors[]; 501 + }; 502 + 503 + /* per cgroup */ 504 + struct ioc_cgrp { 505 + struct blkcg_policy_data cpd; 506 + unsigned int dfl_weight; 507 + }; 508 + 509 + struct ioc_now { 510 + u64 now_ns; 511 + u32 now; 512 + u64 vnow; 513 + u64 vrate; 514 + }; 515 + 516 + struct iocg_wait { 517 + struct wait_queue_entry wait; 518 + struct bio *bio; 519 + u64 abs_cost; 520 + bool committed; 521 + }; 522 + 523 + struct iocg_wake_ctx { 524 + struct ioc_gq *iocg; 525 + u32 hw_inuse; 526 + s64 vbudget; 527 + }; 528 + 529 + static const struct ioc_params autop[] = { 530 + [AUTOP_HDD] = { 531 + .qos = { 532 + [QOS_RLAT] = 50000, /* 50ms */ 533 + [QOS_WLAT] = 50000, 534 + [QOS_MIN] = VRATE_MIN_PPM, 535 + [QOS_MAX] = VRATE_MAX_PPM, 536 + }, 537 + .i_lcoefs = { 538 + [I_LCOEF_RBPS] = 174019176, 539 + [I_LCOEF_RSEQIOPS] = 41708, 540 + [I_LCOEF_RRANDIOPS] = 370, 541 + [I_LCOEF_WBPS] = 178075866, 542 + [I_LCOEF_WSEQIOPS] = 42705, 543 + [I_LCOEF_WRANDIOPS] = 378, 544 + }, 545 + }, 546 + [AUTOP_SSD_QD1] = { 547 + .qos = { 548 + [QOS_RLAT] = 25000, /* 25ms */ 549 + [QOS_WLAT] = 25000, 550 + [QOS_MIN] = VRATE_MIN_PPM, 551 + [QOS_MAX] = VRATE_MAX_PPM, 552 + }, 553 + .i_lcoefs = { 554 + [I_LCOEF_RBPS] = 245855193, 555 + [I_LCOEF_RSEQIOPS] = 61575, 556 + [I_LCOEF_RRANDIOPS] = 6946, 557 + [I_LCOEF_WBPS] = 141365009, 558 + [I_LCOEF_WSEQIOPS] = 33716, 559 + [I_LCOEF_WRANDIOPS] = 26796, 560 + }, 561 + }, 562 + [AUTOP_SSD_DFL] = { 563 + .qos = { 564 + [QOS_RLAT] = 25000, /* 25ms */ 565 + [QOS_WLAT] = 25000, 566 + [QOS_MIN] = VRATE_MIN_PPM, 567 + [QOS_MAX] = VRATE_MAX_PPM, 568 + }, 569 + .i_lcoefs = { 570 + [I_LCOEF_RBPS] = 488636629, 571 + [I_LCOEF_RSEQIOPS] = 8932, 572 + [I_LCOEF_RRANDIOPS] = 8518, 573 + [I_LCOEF_WBPS] = 427891549, 574 + [I_LCOEF_WSEQIOPS] = 28755, 575 + [I_LCOEF_WRANDIOPS] = 21940, 576 + }, 577 + .too_fast_vrate_pct = 500, 578 + }, 579 + [AUTOP_SSD_FAST] = { 580 + .qos = { 581 + [QOS_RLAT] = 5000, /* 5ms */ 582 + [QOS_WLAT] = 5000, 583 + [QOS_MIN] = VRATE_MIN_PPM, 584 + [QOS_MAX] = VRATE_MAX_PPM, 585 + }, 586 + .i_lcoefs = { 587 + [I_LCOEF_RBPS] = 3102524156LLU, 588 + [I_LCOEF_RSEQIOPS] = 724816, 589 + [I_LCOEF_RRANDIOPS] = 778122, 590 + [I_LCOEF_WBPS] = 1742780862LLU, 591 + [I_LCOEF_WSEQIOPS] = 425702, 592 + [I_LCOEF_WRANDIOPS] = 443193, 593 + }, 594 + .too_slow_vrate_pct = 10, 595 + }, 596 + }; 597 + 598 + /* 599 + * vrate adjust percentages indexed by ioc->busy_level. We adjust up on 600 + * vtime credit shortage and down on device saturation. 601 + */ 602 + static u32 vrate_adj_pct[] = 603 + { 0, 0, 0, 0, 604 + 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 605 + 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 606 + 4, 4, 4, 4, 4, 4, 4, 4, 8, 8, 8, 8, 8, 8, 8, 8, 16 }; 607 + 608 + static struct blkcg_policy blkcg_policy_iocost; 609 + 610 + /* accessors and helpers */ 611 + static struct ioc *rqos_to_ioc(struct rq_qos *rqos) 612 + { 613 + return container_of(rqos, struct ioc, rqos); 614 + } 615 + 616 + static struct ioc *q_to_ioc(struct request_queue *q) 617 + { 618 + return rqos_to_ioc(rq_qos_id(q, RQ_QOS_COST)); 619 + } 620 + 621 + static const char *q_name(struct request_queue *q) 622 + { 623 + if (test_bit(QUEUE_FLAG_REGISTERED, &q->queue_flags)) 624 + return kobject_name(q->kobj.parent); 625 + else 626 + return "<unknown>"; 627 + } 628 + 629 + static const char __maybe_unused *ioc_name(struct ioc *ioc) 630 + { 631 + return q_name(ioc->rqos.q); 632 + } 633 + 634 + static struct ioc_gq *pd_to_iocg(struct blkg_policy_data *pd) 635 + { 636 + return pd ? container_of(pd, struct ioc_gq, pd) : NULL; 637 + } 638 + 639 + static struct ioc_gq *blkg_to_iocg(struct blkcg_gq *blkg) 640 + { 641 + return pd_to_iocg(blkg_to_pd(blkg, &blkcg_policy_iocost)); 642 + } 643 + 644 + static struct blkcg_gq *iocg_to_blkg(struct ioc_gq *iocg) 645 + { 646 + return pd_to_blkg(&iocg->pd); 647 + } 648 + 649 + static struct ioc_cgrp *blkcg_to_iocc(struct blkcg *blkcg) 650 + { 651 + return container_of(blkcg_to_cpd(blkcg, &blkcg_policy_iocost), 652 + struct ioc_cgrp, cpd); 653 + } 654 + 655 + /* 656 + * Scale @abs_cost to the inverse of @hw_inuse. The lower the hierarchical 657 + * weight, the more expensive each IO. Must round up. 658 + */ 659 + static u64 abs_cost_to_cost(u64 abs_cost, u32 hw_inuse) 660 + { 661 + return DIV64_U64_ROUND_UP(abs_cost * HWEIGHT_WHOLE, hw_inuse); 662 + } 663 + 664 + /* 665 + * The inverse of abs_cost_to_cost(). Must round up. 666 + */ 667 + static u64 cost_to_abs_cost(u64 cost, u32 hw_inuse) 668 + { 669 + return DIV64_U64_ROUND_UP(cost * hw_inuse, HWEIGHT_WHOLE); 670 + } 671 + 672 + static void iocg_commit_bio(struct ioc_gq *iocg, struct bio *bio, u64 cost) 673 + { 674 + bio->bi_iocost_cost = cost; 675 + atomic64_add(cost, &iocg->vtime); 676 + } 677 + 678 + #define CREATE_TRACE_POINTS 679 + #include <trace/events/iocost.h> 680 + 681 + /* latency Qos params changed, update period_us and all the dependent params */ 682 + static void ioc_refresh_period_us(struct ioc *ioc) 683 + { 684 + u32 ppm, lat, multi, period_us; 685 + 686 + lockdep_assert_held(&ioc->lock); 687 + 688 + /* pick the higher latency target */ 689 + if (ioc->params.qos[QOS_RLAT] >= ioc->params.qos[QOS_WLAT]) { 690 + ppm = ioc->params.qos[QOS_RPPM]; 691 + lat = ioc->params.qos[QOS_RLAT]; 692 + } else { 693 + ppm = ioc->params.qos[QOS_WPPM]; 694 + lat = ioc->params.qos[QOS_WLAT]; 695 + } 696 + 697 + /* 698 + * We want the period to be long enough to contain a healthy number 699 + * of IOs while short enough for granular control. Define it as a 700 + * multiple of the latency target. Ideally, the multiplier should 701 + * be scaled according to the percentile so that it would nominally 702 + * contain a certain number of requests. Let's be simpler and 703 + * scale it linearly so that it's 2x >= pct(90) and 10x at pct(50). 704 + */ 705 + if (ppm) 706 + multi = max_t(u32, (MILLION - ppm) / 50000, 2); 707 + else 708 + multi = 2; 709 + period_us = multi * lat; 710 + period_us = clamp_t(u32, period_us, MIN_PERIOD, MAX_PERIOD); 711 + 712 + /* calculate dependent params */ 713 + ioc->period_us = period_us; 714 + ioc->margin_us = period_us * MARGIN_PCT / 100; 715 + ioc->inuse_margin_vtime = DIV64_U64_ROUND_UP( 716 + period_us * VTIME_PER_USEC * INUSE_MARGIN_PCT, 100); 717 + } 718 + 719 + static int ioc_autop_idx(struct ioc *ioc) 720 + { 721 + int idx = ioc->autop_idx; 722 + const struct ioc_params *p = &autop[idx]; 723 + u32 vrate_pct; 724 + u64 now_ns; 725 + 726 + /* rotational? */ 727 + if (!blk_queue_nonrot(ioc->rqos.q)) 728 + return AUTOP_HDD; 729 + 730 + /* handle SATA SSDs w/ broken NCQ */ 731 + if (blk_queue_depth(ioc->rqos.q) == 1) 732 + return AUTOP_SSD_QD1; 733 + 734 + /* use one of the normal ssd sets */ 735 + if (idx < AUTOP_SSD_DFL) 736 + return AUTOP_SSD_DFL; 737 + 738 + /* if user is overriding anything, maintain what was there */ 739 + if (ioc->user_qos_params || ioc->user_cost_model) 740 + return idx; 741 + 742 + /* step up/down based on the vrate */ 743 + vrate_pct = div64_u64(atomic64_read(&ioc->vtime_rate) * 100, 744 + VTIME_PER_USEC); 745 + now_ns = ktime_get_ns(); 746 + 747 + if (p->too_fast_vrate_pct && p->too_fast_vrate_pct <= vrate_pct) { 748 + if (!ioc->autop_too_fast_at) 749 + ioc->autop_too_fast_at = now_ns; 750 + if (now_ns - ioc->autop_too_fast_at >= AUTOP_CYCLE_NSEC) 751 + return idx + 1; 752 + } else { 753 + ioc->autop_too_fast_at = 0; 754 + } 755 + 756 + if (p->too_slow_vrate_pct && p->too_slow_vrate_pct >= vrate_pct) { 757 + if (!ioc->autop_too_slow_at) 758 + ioc->autop_too_slow_at = now_ns; 759 + if (now_ns - ioc->autop_too_slow_at >= AUTOP_CYCLE_NSEC) 760 + return idx - 1; 761 + } else { 762 + ioc->autop_too_slow_at = 0; 763 + } 764 + 765 + return idx; 766 + } 767 + 768 + /* 769 + * Take the followings as input 770 + * 771 + * @bps maximum sequential throughput 772 + * @seqiops maximum sequential 4k iops 773 + * @randiops maximum random 4k iops 774 + * 775 + * and calculate the linear model cost coefficients. 776 + * 777 + * *@page per-page cost 1s / (@bps / 4096) 778 + * *@seqio base cost of a seq IO max((1s / @seqiops) - *@page, 0) 779 + * @randiops base cost of a rand IO max((1s / @randiops) - *@page, 0) 780 + */ 781 + static void calc_lcoefs(u64 bps, u64 seqiops, u64 randiops, 782 + u64 *page, u64 *seqio, u64 *randio) 783 + { 784 + u64 v; 785 + 786 + *page = *seqio = *randio = 0; 787 + 788 + if (bps) 789 + *page = DIV64_U64_ROUND_UP(VTIME_PER_SEC, 790 + DIV_ROUND_UP_ULL(bps, IOC_PAGE_SIZE)); 791 + 792 + if (seqiops) { 793 + v = DIV64_U64_ROUND_UP(VTIME_PER_SEC, seqiops); 794 + if (v > *page) 795 + *seqio = v - *page; 796 + } 797 + 798 + if (randiops) { 799 + v = DIV64_U64_ROUND_UP(VTIME_PER_SEC, randiops); 800 + if (v > *page) 801 + *randio = v - *page; 802 + } 803 + } 804 + 805 + static void ioc_refresh_lcoefs(struct ioc *ioc) 806 + { 807 + u64 *u = ioc->params.i_lcoefs; 808 + u64 *c = ioc->params.lcoefs; 809 + 810 + calc_lcoefs(u[I_LCOEF_RBPS], u[I_LCOEF_RSEQIOPS], u[I_LCOEF_RRANDIOPS], 811 + &c[LCOEF_RPAGE], &c[LCOEF_RSEQIO], &c[LCOEF_RRANDIO]); 812 + calc_lcoefs(u[I_LCOEF_WBPS], u[I_LCOEF_WSEQIOPS], u[I_LCOEF_WRANDIOPS], 813 + &c[LCOEF_WPAGE], &c[LCOEF_WSEQIO], &c[LCOEF_WRANDIO]); 814 + } 815 + 816 + static bool ioc_refresh_params(struct ioc *ioc, bool force) 817 + { 818 + const struct ioc_params *p; 819 + int idx; 820 + 821 + lockdep_assert_held(&ioc->lock); 822 + 823 + idx = ioc_autop_idx(ioc); 824 + p = &autop[idx]; 825 + 826 + if (idx == ioc->autop_idx && !force) 827 + return false; 828 + 829 + if (idx != ioc->autop_idx) 830 + atomic64_set(&ioc->vtime_rate, VTIME_PER_USEC); 831 + 832 + ioc->autop_idx = idx; 833 + ioc->autop_too_fast_at = 0; 834 + ioc->autop_too_slow_at = 0; 835 + 836 + if (!ioc->user_qos_params) 837 + memcpy(ioc->params.qos, p->qos, sizeof(p->qos)); 838 + if (!ioc->user_cost_model) 839 + memcpy(ioc->params.i_lcoefs, p->i_lcoefs, sizeof(p->i_lcoefs)); 840 + 841 + ioc_refresh_period_us(ioc); 842 + ioc_refresh_lcoefs(ioc); 843 + 844 + ioc->vrate_min = DIV64_U64_ROUND_UP((u64)ioc->params.qos[QOS_MIN] * 845 + VTIME_PER_USEC, MILLION); 846 + ioc->vrate_max = div64_u64((u64)ioc->params.qos[QOS_MAX] * 847 + VTIME_PER_USEC, MILLION); 848 + 849 + return true; 850 + } 851 + 852 + /* take a snapshot of the current [v]time and vrate */ 853 + static void ioc_now(struct ioc *ioc, struct ioc_now *now) 854 + { 855 + unsigned seq; 856 + 857 + now->now_ns = ktime_get(); 858 + now->now = ktime_to_us(now->now_ns); 859 + now->vrate = atomic64_read(&ioc->vtime_rate); 860 + 861 + /* 862 + * The current vtime is 863 + * 864 + * vtime at period start + (wallclock time since the start) * vrate 865 + * 866 + * As a consistent snapshot of `period_at_vtime` and `period_at` is 867 + * needed, they're seqcount protected. 868 + */ 869 + do { 870 + seq = read_seqcount_begin(&ioc->period_seqcount); 871 + now->vnow = ioc->period_at_vtime + 872 + (now->now - ioc->period_at) * now->vrate; 873 + } while (read_seqcount_retry(&ioc->period_seqcount, seq)); 874 + } 875 + 876 + static void ioc_start_period(struct ioc *ioc, struct ioc_now *now) 877 + { 878 + lockdep_assert_held(&ioc->lock); 879 + WARN_ON_ONCE(ioc->running != IOC_RUNNING); 880 + 881 + write_seqcount_begin(&ioc->period_seqcount); 882 + ioc->period_at = now->now; 883 + ioc->period_at_vtime = now->vnow; 884 + write_seqcount_end(&ioc->period_seqcount); 885 + 886 + ioc->timer.expires = jiffies + usecs_to_jiffies(ioc->period_us); 887 + add_timer(&ioc->timer); 888 + } 889 + 890 + /* 891 + * Update @iocg's `active` and `inuse` to @active and @inuse, update level 892 + * weight sums and propagate upwards accordingly. 893 + */ 894 + static void __propagate_active_weight(struct ioc_gq *iocg, u32 active, u32 inuse) 895 + { 896 + struct ioc *ioc = iocg->ioc; 897 + int lvl; 898 + 899 + lockdep_assert_held(&ioc->lock); 900 + 901 + inuse = min(active, inuse); 902 + 903 + for (lvl = iocg->level - 1; lvl >= 0; lvl--) { 904 + struct ioc_gq *parent = iocg->ancestors[lvl]; 905 + struct ioc_gq *child = iocg->ancestors[lvl + 1]; 906 + u32 parent_active = 0, parent_inuse = 0; 907 + 908 + /* update the level sums */ 909 + parent->child_active_sum += (s32)(active - child->active); 910 + parent->child_inuse_sum += (s32)(inuse - child->inuse); 911 + /* apply the udpates */ 912 + child->active = active; 913 + child->inuse = inuse; 914 + 915 + /* 916 + * The delta between inuse and active sums indicates that 917 + * that much of weight is being given away. Parent's inuse 918 + * and active should reflect the ratio. 919 + */ 920 + if (parent->child_active_sum) { 921 + parent_active = parent->weight; 922 + parent_inuse = DIV64_U64_ROUND_UP( 923 + parent_active * parent->child_inuse_sum, 924 + parent->child_active_sum); 925 + } 926 + 927 + /* do we need to keep walking up? */ 928 + if (parent_active == parent->active && 929 + parent_inuse == parent->inuse) 930 + break; 931 + 932 + active = parent_active; 933 + inuse = parent_inuse; 934 + } 935 + 936 + ioc->weights_updated = true; 937 + } 938 + 939 + static void commit_active_weights(struct ioc *ioc) 940 + { 941 + lockdep_assert_held(&ioc->lock); 942 + 943 + if (ioc->weights_updated) { 944 + /* paired with rmb in current_hweight(), see there */ 945 + smp_wmb(); 946 + atomic_inc(&ioc->hweight_gen); 947 + ioc->weights_updated = false; 948 + } 949 + } 950 + 951 + static void propagate_active_weight(struct ioc_gq *iocg, u32 active, u32 inuse) 952 + { 953 + __propagate_active_weight(iocg, active, inuse); 954 + commit_active_weights(iocg->ioc); 955 + } 956 + 957 + static void current_hweight(struct ioc_gq *iocg, u32 *hw_activep, u32 *hw_inusep) 958 + { 959 + struct ioc *ioc = iocg->ioc; 960 + int lvl; 961 + u32 hwa, hwi; 962 + int ioc_gen; 963 + 964 + /* hot path - if uptodate, use cached */ 965 + ioc_gen = atomic_read(&ioc->hweight_gen); 966 + if (ioc_gen == iocg->hweight_gen) 967 + goto out; 968 + 969 + /* 970 + * Paired with wmb in commit_active_weights(). If we saw the 971 + * updated hweight_gen, all the weight updates from 972 + * __propagate_active_weight() are visible too. 973 + * 974 + * We can race with weight updates during calculation and get it 975 + * wrong. However, hweight_gen would have changed and a future 976 + * reader will recalculate and we're guaranteed to discard the 977 + * wrong result soon. 978 + */ 979 + smp_rmb(); 980 + 981 + hwa = hwi = HWEIGHT_WHOLE; 982 + for (lvl = 0; lvl <= iocg->level - 1; lvl++) { 983 + struct ioc_gq *parent = iocg->ancestors[lvl]; 984 + struct ioc_gq *child = iocg->ancestors[lvl + 1]; 985 + u32 active_sum = READ_ONCE(parent->child_active_sum); 986 + u32 inuse_sum = READ_ONCE(parent->child_inuse_sum); 987 + u32 active = READ_ONCE(child->active); 988 + u32 inuse = READ_ONCE(child->inuse); 989 + 990 + /* we can race with deactivations and either may read as zero */ 991 + if (!active_sum || !inuse_sum) 992 + continue; 993 + 994 + active_sum = max(active, active_sum); 995 + hwa = hwa * active / active_sum; /* max 16bits * 10000 */ 996 + 997 + inuse_sum = max(inuse, inuse_sum); 998 + hwi = hwi * inuse / inuse_sum; /* max 16bits * 10000 */ 999 + } 1000 + 1001 + iocg->hweight_active = max_t(u32, hwa, 1); 1002 + iocg->hweight_inuse = max_t(u32, hwi, 1); 1003 + iocg->hweight_gen = ioc_gen; 1004 + out: 1005 + if (hw_activep) 1006 + *hw_activep = iocg->hweight_active; 1007 + if (hw_inusep) 1008 + *hw_inusep = iocg->hweight_inuse; 1009 + } 1010 + 1011 + static void weight_updated(struct ioc_gq *iocg) 1012 + { 1013 + struct ioc *ioc = iocg->ioc; 1014 + struct blkcg_gq *blkg = iocg_to_blkg(iocg); 1015 + struct ioc_cgrp *iocc = blkcg_to_iocc(blkg->blkcg); 1016 + u32 weight; 1017 + 1018 + lockdep_assert_held(&ioc->lock); 1019 + 1020 + weight = iocg->cfg_weight ?: iocc->dfl_weight; 1021 + if (weight != iocg->weight && iocg->active) 1022 + propagate_active_weight(iocg, weight, 1023 + DIV64_U64_ROUND_UP(iocg->inuse * weight, iocg->weight)); 1024 + iocg->weight = weight; 1025 + } 1026 + 1027 + static bool iocg_activate(struct ioc_gq *iocg, struct ioc_now *now) 1028 + { 1029 + struct ioc *ioc = iocg->ioc; 1030 + u64 last_period, cur_period, max_period_delta; 1031 + u64 vtime, vmargin, vmin; 1032 + int i; 1033 + 1034 + /* 1035 + * If seem to be already active, just update the stamp to tell the 1036 + * timer that we're still active. We don't mind occassional races. 1037 + */ 1038 + if (!list_empty(&iocg->active_list)) { 1039 + ioc_now(ioc, now); 1040 + cur_period = atomic64_read(&ioc->cur_period); 1041 + if (atomic64_read(&iocg->active_period) != cur_period) 1042 + atomic64_set(&iocg->active_period, cur_period); 1043 + return true; 1044 + } 1045 + 1046 + /* racy check on internal node IOs, treat as root level IOs */ 1047 + if (iocg->child_active_sum) 1048 + return false; 1049 + 1050 + spin_lock_irq(&ioc->lock); 1051 + 1052 + ioc_now(ioc, now); 1053 + 1054 + /* update period */ 1055 + cur_period = atomic64_read(&ioc->cur_period); 1056 + last_period = atomic64_read(&iocg->active_period); 1057 + atomic64_set(&iocg->active_period, cur_period); 1058 + 1059 + /* already activated or breaking leaf-only constraint? */ 1060 + for (i = iocg->level; i > 0; i--) 1061 + if (!list_empty(&iocg->active_list)) 1062 + goto fail_unlock; 1063 + if (iocg->child_active_sum) 1064 + goto fail_unlock; 1065 + 1066 + /* 1067 + * vtime may wrap when vrate is raised substantially due to 1068 + * underestimated IO costs. Look at the period and ignore its 1069 + * vtime if the iocg has been idle for too long. Also, cap the 1070 + * budget it can start with to the margin. 1071 + */ 1072 + max_period_delta = DIV64_U64_ROUND_UP(VTIME_VALID_DUR, ioc->period_us); 1073 + vtime = atomic64_read(&iocg->vtime); 1074 + vmargin = ioc->margin_us * now->vrate; 1075 + vmin = now->vnow - vmargin; 1076 + 1077 + if (last_period + max_period_delta < cur_period || 1078 + time_before64(vtime, vmin)) { 1079 + atomic64_add(vmin - vtime, &iocg->vtime); 1080 + atomic64_add(vmin - vtime, &iocg->done_vtime); 1081 + vtime = vmin; 1082 + } 1083 + 1084 + /* 1085 + * Activate, propagate weight and start period timer if not 1086 + * running. Reset hweight_gen to avoid accidental match from 1087 + * wrapping. 1088 + */ 1089 + iocg->hweight_gen = atomic_read(&ioc->hweight_gen) - 1; 1090 + list_add(&iocg->active_list, &ioc->active_iocgs); 1091 + propagate_active_weight(iocg, iocg->weight, 1092 + iocg->last_inuse ?: iocg->weight); 1093 + 1094 + TRACE_IOCG_PATH(iocg_activate, iocg, now, 1095 + last_period, cur_period, vtime); 1096 + 1097 + iocg->last_vtime = vtime; 1098 + 1099 + if (ioc->running == IOC_IDLE) { 1100 + ioc->running = IOC_RUNNING; 1101 + ioc_start_period(ioc, now); 1102 + } 1103 + 1104 + spin_unlock_irq(&ioc->lock); 1105 + return true; 1106 + 1107 + fail_unlock: 1108 + spin_unlock_irq(&ioc->lock); 1109 + return false; 1110 + } 1111 + 1112 + static int iocg_wake_fn(struct wait_queue_entry *wq_entry, unsigned mode, 1113 + int flags, void *key) 1114 + { 1115 + struct iocg_wait *wait = container_of(wq_entry, struct iocg_wait, wait); 1116 + struct iocg_wake_ctx *ctx = (struct iocg_wake_ctx *)key; 1117 + u64 cost = abs_cost_to_cost(wait->abs_cost, ctx->hw_inuse); 1118 + 1119 + ctx->vbudget -= cost; 1120 + 1121 + if (ctx->vbudget < 0) 1122 + return -1; 1123 + 1124 + iocg_commit_bio(ctx->iocg, wait->bio, cost); 1125 + 1126 + /* 1127 + * autoremove_wake_function() removes the wait entry only when it 1128 + * actually changed the task state. We want the wait always 1129 + * removed. Remove explicitly and use default_wake_function(). 1130 + */ 1131 + list_del_init(&wq_entry->entry); 1132 + wait->committed = true; 1133 + 1134 + default_wake_function(wq_entry, mode, flags, key); 1135 + return 0; 1136 + } 1137 + 1138 + static void iocg_kick_waitq(struct ioc_gq *iocg, struct ioc_now *now) 1139 + { 1140 + struct ioc *ioc = iocg->ioc; 1141 + struct iocg_wake_ctx ctx = { .iocg = iocg }; 1142 + u64 margin_ns = (u64)(ioc->period_us * 1143 + WAITQ_TIMER_MARGIN_PCT / 100) * NSEC_PER_USEC; 1144 + u64 abs_vdebt, vdebt, vshortage, expires, oexpires; 1145 + s64 vbudget; 1146 + u32 hw_inuse; 1147 + 1148 + lockdep_assert_held(&iocg->waitq.lock); 1149 + 1150 + current_hweight(iocg, NULL, &hw_inuse); 1151 + vbudget = now->vnow - atomic64_read(&iocg->vtime); 1152 + 1153 + /* pay off debt */ 1154 + abs_vdebt = atomic64_read(&iocg->abs_vdebt); 1155 + vdebt = abs_cost_to_cost(abs_vdebt, hw_inuse); 1156 + if (vdebt && vbudget > 0) { 1157 + u64 delta = min_t(u64, vbudget, vdebt); 1158 + u64 abs_delta = min(cost_to_abs_cost(delta, hw_inuse), 1159 + abs_vdebt); 1160 + 1161 + atomic64_add(delta, &iocg->vtime); 1162 + atomic64_add(delta, &iocg->done_vtime); 1163 + atomic64_sub(abs_delta, &iocg->abs_vdebt); 1164 + if (WARN_ON_ONCE(atomic64_read(&iocg->abs_vdebt) < 0)) 1165 + atomic64_set(&iocg->abs_vdebt, 0); 1166 + } 1167 + 1168 + /* 1169 + * Wake up the ones which are due and see how much vtime we'll need 1170 + * for the next one. 1171 + */ 1172 + ctx.hw_inuse = hw_inuse; 1173 + ctx.vbudget = vbudget - vdebt; 1174 + __wake_up_locked_key(&iocg->waitq, TASK_NORMAL, &ctx); 1175 + if (!waitqueue_active(&iocg->waitq)) 1176 + return; 1177 + if (WARN_ON_ONCE(ctx.vbudget >= 0)) 1178 + return; 1179 + 1180 + /* determine next wakeup, add a quarter margin to guarantee chunking */ 1181 + vshortage = -ctx.vbudget; 1182 + expires = now->now_ns + 1183 + DIV64_U64_ROUND_UP(vshortage, now->vrate) * NSEC_PER_USEC; 1184 + expires += margin_ns / 4; 1185 + 1186 + /* if already active and close enough, don't bother */ 1187 + oexpires = ktime_to_ns(hrtimer_get_softexpires(&iocg->waitq_timer)); 1188 + if (hrtimer_is_queued(&iocg->waitq_timer) && 1189 + abs(oexpires - expires) <= margin_ns / 4) 1190 + return; 1191 + 1192 + hrtimer_start_range_ns(&iocg->waitq_timer, ns_to_ktime(expires), 1193 + margin_ns / 4, HRTIMER_MODE_ABS); 1194 + } 1195 + 1196 + static enum hrtimer_restart iocg_waitq_timer_fn(struct hrtimer *timer) 1197 + { 1198 + struct ioc_gq *iocg = container_of(timer, struct ioc_gq, waitq_timer); 1199 + struct ioc_now now; 1200 + unsigned long flags; 1201 + 1202 + ioc_now(iocg->ioc, &now); 1203 + 1204 + spin_lock_irqsave(&iocg->waitq.lock, flags); 1205 + iocg_kick_waitq(iocg, &now); 1206 + spin_unlock_irqrestore(&iocg->waitq.lock, flags); 1207 + 1208 + return HRTIMER_NORESTART; 1209 + } 1210 + 1211 + static void iocg_kick_delay(struct ioc_gq *iocg, struct ioc_now *now, u64 cost) 1212 + { 1213 + struct ioc *ioc = iocg->ioc; 1214 + struct blkcg_gq *blkg = iocg_to_blkg(iocg); 1215 + u64 vtime = atomic64_read(&iocg->vtime); 1216 + u64 vmargin = ioc->margin_us * now->vrate; 1217 + u64 margin_ns = ioc->margin_us * NSEC_PER_USEC; 1218 + u64 expires, oexpires; 1219 + u32 hw_inuse; 1220 + 1221 + /* debt-adjust vtime */ 1222 + current_hweight(iocg, NULL, &hw_inuse); 1223 + vtime += abs_cost_to_cost(atomic64_read(&iocg->abs_vdebt), hw_inuse); 1224 + 1225 + /* clear or maintain depending on the overage */ 1226 + if (time_before_eq64(vtime, now->vnow)) { 1227 + blkcg_clear_delay(blkg); 1228 + return; 1229 + } 1230 + if (!atomic_read(&blkg->use_delay) && 1231 + time_before_eq64(vtime, now->vnow + vmargin)) 1232 + return; 1233 + 1234 + /* use delay */ 1235 + if (cost) { 1236 + u64 cost_ns = DIV64_U64_ROUND_UP(cost * NSEC_PER_USEC, 1237 + now->vrate); 1238 + blkcg_add_delay(blkg, now->now_ns, cost_ns); 1239 + } 1240 + blkcg_use_delay(blkg); 1241 + 1242 + expires = now->now_ns + DIV64_U64_ROUND_UP(vtime - now->vnow, 1243 + now->vrate) * NSEC_PER_USEC; 1244 + 1245 + /* if already active and close enough, don't bother */ 1246 + oexpires = ktime_to_ns(hrtimer_get_softexpires(&iocg->delay_timer)); 1247 + if (hrtimer_is_queued(&iocg->delay_timer) && 1248 + abs(oexpires - expires) <= margin_ns / 4) 1249 + return; 1250 + 1251 + hrtimer_start_range_ns(&iocg->delay_timer, ns_to_ktime(expires), 1252 + margin_ns / 4, HRTIMER_MODE_ABS); 1253 + } 1254 + 1255 + static enum hrtimer_restart iocg_delay_timer_fn(struct hrtimer *timer) 1256 + { 1257 + struct ioc_gq *iocg = container_of(timer, struct ioc_gq, delay_timer); 1258 + struct ioc_now now; 1259 + 1260 + ioc_now(iocg->ioc, &now); 1261 + iocg_kick_delay(iocg, &now, 0); 1262 + 1263 + return HRTIMER_NORESTART; 1264 + } 1265 + 1266 + static void ioc_lat_stat(struct ioc *ioc, u32 *missed_ppm_ar, u32 *rq_wait_pct_p) 1267 + { 1268 + u32 nr_met[2] = { }; 1269 + u32 nr_missed[2] = { }; 1270 + u64 rq_wait_ns = 0; 1271 + int cpu, rw; 1272 + 1273 + for_each_online_cpu(cpu) { 1274 + struct ioc_pcpu_stat *stat = per_cpu_ptr(ioc->pcpu_stat, cpu); 1275 + u64 this_rq_wait_ns; 1276 + 1277 + for (rw = READ; rw <= WRITE; rw++) { 1278 + u32 this_met = READ_ONCE(stat->missed[rw].nr_met); 1279 + u32 this_missed = READ_ONCE(stat->missed[rw].nr_missed); 1280 + 1281 + nr_met[rw] += this_met - stat->missed[rw].last_met; 1282 + nr_missed[rw] += this_missed - stat->missed[rw].last_missed; 1283 + stat->missed[rw].last_met = this_met; 1284 + stat->missed[rw].last_missed = this_missed; 1285 + } 1286 + 1287 + this_rq_wait_ns = READ_ONCE(stat->rq_wait_ns); 1288 + rq_wait_ns += this_rq_wait_ns - stat->last_rq_wait_ns; 1289 + stat->last_rq_wait_ns = this_rq_wait_ns; 1290 + } 1291 + 1292 + for (rw = READ; rw <= WRITE; rw++) { 1293 + if (nr_met[rw] + nr_missed[rw]) 1294 + missed_ppm_ar[rw] = 1295 + DIV64_U64_ROUND_UP((u64)nr_missed[rw] * MILLION, 1296 + nr_met[rw] + nr_missed[rw]); 1297 + else 1298 + missed_ppm_ar[rw] = 0; 1299 + } 1300 + 1301 + *rq_wait_pct_p = div64_u64(rq_wait_ns * 100, 1302 + ioc->period_us * NSEC_PER_USEC); 1303 + } 1304 + 1305 + /* was iocg idle this period? */ 1306 + static bool iocg_is_idle(struct ioc_gq *iocg) 1307 + { 1308 + struct ioc *ioc = iocg->ioc; 1309 + 1310 + /* did something get issued this period? */ 1311 + if (atomic64_read(&iocg->active_period) == 1312 + atomic64_read(&ioc->cur_period)) 1313 + return false; 1314 + 1315 + /* is something in flight? */ 1316 + if (atomic64_read(&iocg->done_vtime) < atomic64_read(&iocg->vtime)) 1317 + return false; 1318 + 1319 + return true; 1320 + } 1321 + 1322 + /* returns usage with margin added if surplus is large enough */ 1323 + static u32 surplus_adjusted_hweight_inuse(u32 usage, u32 hw_inuse) 1324 + { 1325 + /* add margin */ 1326 + usage = DIV_ROUND_UP(usage * SURPLUS_SCALE_PCT, 100); 1327 + usage += SURPLUS_SCALE_ABS; 1328 + 1329 + /* don't bother if the surplus is too small */ 1330 + if (usage + SURPLUS_MIN_ADJ_DELTA > hw_inuse) 1331 + return 0; 1332 + 1333 + return usage; 1334 + } 1335 + 1336 + static void ioc_timer_fn(struct timer_list *timer) 1337 + { 1338 + struct ioc *ioc = container_of(timer, struct ioc, timer); 1339 + struct ioc_gq *iocg, *tiocg; 1340 + struct ioc_now now; 1341 + int nr_surpluses = 0, nr_shortages = 0, nr_lagging = 0; 1342 + u32 ppm_rthr = MILLION - ioc->params.qos[QOS_RPPM]; 1343 + u32 ppm_wthr = MILLION - ioc->params.qos[QOS_WPPM]; 1344 + u32 missed_ppm[2], rq_wait_pct; 1345 + u64 period_vtime; 1346 + int i; 1347 + 1348 + /* how were the latencies during the period? */ 1349 + ioc_lat_stat(ioc, missed_ppm, &rq_wait_pct); 1350 + 1351 + /* take care of active iocgs */ 1352 + spin_lock_irq(&ioc->lock); 1353 + 1354 + ioc_now(ioc, &now); 1355 + 1356 + period_vtime = now.vnow - ioc->period_at_vtime; 1357 + if (WARN_ON_ONCE(!period_vtime)) { 1358 + spin_unlock_irq(&ioc->lock); 1359 + return; 1360 + } 1361 + 1362 + /* 1363 + * Waiters determine the sleep durations based on the vrate they 1364 + * saw at the time of sleep. If vrate has increased, some waiters 1365 + * could be sleeping for too long. Wake up tardy waiters which 1366 + * should have woken up in the last period and expire idle iocgs. 1367 + */ 1368 + list_for_each_entry_safe(iocg, tiocg, &ioc->active_iocgs, active_list) { 1369 + if (!waitqueue_active(&iocg->waitq) && 1370 + !atomic64_read(&iocg->abs_vdebt) && !iocg_is_idle(iocg)) 1371 + continue; 1372 + 1373 + spin_lock(&iocg->waitq.lock); 1374 + 1375 + if (waitqueue_active(&iocg->waitq) || 1376 + atomic64_read(&iocg->abs_vdebt)) { 1377 + /* might be oversleeping vtime / hweight changes, kick */ 1378 + iocg_kick_waitq(iocg, &now); 1379 + iocg_kick_delay(iocg, &now, 0); 1380 + } else if (iocg_is_idle(iocg)) { 1381 + /* no waiter and idle, deactivate */ 1382 + iocg->last_inuse = iocg->inuse; 1383 + __propagate_active_weight(iocg, 0, 0); 1384 + list_del_init(&iocg->active_list); 1385 + } 1386 + 1387 + spin_unlock(&iocg->waitq.lock); 1388 + } 1389 + commit_active_weights(ioc); 1390 + 1391 + /* calc usages and see whether some weights need to be moved around */ 1392 + list_for_each_entry(iocg, &ioc->active_iocgs, active_list) { 1393 + u64 vdone, vtime, vusage, vmargin, vmin; 1394 + u32 hw_active, hw_inuse, usage; 1395 + 1396 + /* 1397 + * Collect unused and wind vtime closer to vnow to prevent 1398 + * iocgs from accumulating a large amount of budget. 1399 + */ 1400 + vdone = atomic64_read(&iocg->done_vtime); 1401 + vtime = atomic64_read(&iocg->vtime); 1402 + current_hweight(iocg, &hw_active, &hw_inuse); 1403 + 1404 + /* 1405 + * Latency QoS detection doesn't account for IOs which are 1406 + * in-flight for longer than a period. Detect them by 1407 + * comparing vdone against period start. If lagging behind 1408 + * IOs from past periods, don't increase vrate. 1409 + */ 1410 + if (!atomic_read(&iocg_to_blkg(iocg)->use_delay) && 1411 + time_after64(vtime, vdone) && 1412 + time_after64(vtime, now.vnow - 1413 + MAX_LAGGING_PERIODS * period_vtime) && 1414 + time_before64(vdone, now.vnow - period_vtime)) 1415 + nr_lagging++; 1416 + 1417 + if (waitqueue_active(&iocg->waitq)) 1418 + vusage = now.vnow - iocg->last_vtime; 1419 + else if (time_before64(iocg->last_vtime, vtime)) 1420 + vusage = vtime - iocg->last_vtime; 1421 + else 1422 + vusage = 0; 1423 + 1424 + iocg->last_vtime += vusage; 1425 + /* 1426 + * Factor in in-flight vtime into vusage to avoid 1427 + * high-latency completions appearing as idle. This should 1428 + * be done after the above ->last_time adjustment. 1429 + */ 1430 + vusage = max(vusage, vtime - vdone); 1431 + 1432 + /* calculate hweight based usage ratio and record */ 1433 + if (vusage) { 1434 + usage = DIV64_U64_ROUND_UP(vusage * hw_inuse, 1435 + period_vtime); 1436 + iocg->usage_idx = (iocg->usage_idx + 1) % NR_USAGE_SLOTS; 1437 + iocg->usages[iocg->usage_idx] = usage; 1438 + } else { 1439 + usage = 0; 1440 + } 1441 + 1442 + /* see whether there's surplus vtime */ 1443 + vmargin = ioc->margin_us * now.vrate; 1444 + vmin = now.vnow - vmargin; 1445 + 1446 + iocg->has_surplus = false; 1447 + 1448 + if (!waitqueue_active(&iocg->waitq) && 1449 + time_before64(vtime, vmin)) { 1450 + u64 delta = vmin - vtime; 1451 + 1452 + /* throw away surplus vtime */ 1453 + atomic64_add(delta, &iocg->vtime); 1454 + atomic64_add(delta, &iocg->done_vtime); 1455 + iocg->last_vtime += delta; 1456 + /* if usage is sufficiently low, maybe it can donate */ 1457 + if (surplus_adjusted_hweight_inuse(usage, hw_inuse)) { 1458 + iocg->has_surplus = true; 1459 + nr_surpluses++; 1460 + } 1461 + } else if (hw_inuse < hw_active) { 1462 + u32 new_hwi, new_inuse; 1463 + 1464 + /* was donating but might need to take back some */ 1465 + if (waitqueue_active(&iocg->waitq)) { 1466 + new_hwi = hw_active; 1467 + } else { 1468 + new_hwi = max(hw_inuse, 1469 + usage * SURPLUS_SCALE_PCT / 100 + 1470 + SURPLUS_SCALE_ABS); 1471 + } 1472 + 1473 + new_inuse = div64_u64((u64)iocg->inuse * new_hwi, 1474 + hw_inuse); 1475 + new_inuse = clamp_t(u32, new_inuse, 1, iocg->active); 1476 + 1477 + if (new_inuse > iocg->inuse) { 1478 + TRACE_IOCG_PATH(inuse_takeback, iocg, &now, 1479 + iocg->inuse, new_inuse, 1480 + hw_inuse, new_hwi); 1481 + __propagate_active_weight(iocg, iocg->weight, 1482 + new_inuse); 1483 + } 1484 + } else { 1485 + /* genuninely out of vtime */ 1486 + nr_shortages++; 1487 + } 1488 + } 1489 + 1490 + if (!nr_shortages || !nr_surpluses) 1491 + goto skip_surplus_transfers; 1492 + 1493 + /* there are both shortages and surpluses, transfer surpluses */ 1494 + list_for_each_entry(iocg, &ioc->active_iocgs, active_list) { 1495 + u32 usage, hw_active, hw_inuse, new_hwi, new_inuse; 1496 + int nr_valid = 0; 1497 + 1498 + if (!iocg->has_surplus) 1499 + continue; 1500 + 1501 + /* base the decision on max historical usage */ 1502 + for (i = 0, usage = 0; i < NR_USAGE_SLOTS; i++) { 1503 + if (iocg->usages[i]) { 1504 + usage = max(usage, iocg->usages[i]); 1505 + nr_valid++; 1506 + } 1507 + } 1508 + if (nr_valid < MIN_VALID_USAGES) 1509 + continue; 1510 + 1511 + current_hweight(iocg, &hw_active, &hw_inuse); 1512 + new_hwi = surplus_adjusted_hweight_inuse(usage, hw_inuse); 1513 + if (!new_hwi) 1514 + continue; 1515 + 1516 + new_inuse = DIV64_U64_ROUND_UP((u64)iocg->inuse * new_hwi, 1517 + hw_inuse); 1518 + if (new_inuse < iocg->inuse) { 1519 + TRACE_IOCG_PATH(inuse_giveaway, iocg, &now, 1520 + iocg->inuse, new_inuse, 1521 + hw_inuse, new_hwi); 1522 + __propagate_active_weight(iocg, iocg->weight, new_inuse); 1523 + } 1524 + } 1525 + skip_surplus_transfers: 1526 + commit_active_weights(ioc); 1527 + 1528 + /* 1529 + * If q is getting clogged or we're missing too much, we're issuing 1530 + * too much IO and should lower vtime rate. If we're not missing 1531 + * and experiencing shortages but not surpluses, we're too stingy 1532 + * and should increase vtime rate. 1533 + */ 1534 + if (rq_wait_pct > RQ_WAIT_BUSY_PCT || 1535 + missed_ppm[READ] > ppm_rthr || 1536 + missed_ppm[WRITE] > ppm_wthr) { 1537 + ioc->busy_level = max(ioc->busy_level, 0); 1538 + ioc->busy_level++; 1539 + } else if (nr_lagging) { 1540 + ioc->busy_level = max(ioc->busy_level, 0); 1541 + } else if (nr_shortages && !nr_surpluses && 1542 + rq_wait_pct <= RQ_WAIT_BUSY_PCT * UNBUSY_THR_PCT / 100 && 1543 + missed_ppm[READ] <= ppm_rthr * UNBUSY_THR_PCT / 100 && 1544 + missed_ppm[WRITE] <= ppm_wthr * UNBUSY_THR_PCT / 100) { 1545 + ioc->busy_level = min(ioc->busy_level, 0); 1546 + ioc->busy_level--; 1547 + } else { 1548 + ioc->busy_level = 0; 1549 + } 1550 + 1551 + ioc->busy_level = clamp(ioc->busy_level, -1000, 1000); 1552 + 1553 + if (ioc->busy_level) { 1554 + u64 vrate = atomic64_read(&ioc->vtime_rate); 1555 + u64 vrate_min = ioc->vrate_min, vrate_max = ioc->vrate_max; 1556 + 1557 + /* rq_wait signal is always reliable, ignore user vrate_min */ 1558 + if (rq_wait_pct > RQ_WAIT_BUSY_PCT) 1559 + vrate_min = VRATE_MIN; 1560 + 1561 + /* 1562 + * If vrate is out of bounds, apply clamp gradually as the 1563 + * bounds can change abruptly. Otherwise, apply busy_level 1564 + * based adjustment. 1565 + */ 1566 + if (vrate < vrate_min) { 1567 + vrate = div64_u64(vrate * (100 + VRATE_CLAMP_ADJ_PCT), 1568 + 100); 1569 + vrate = min(vrate, vrate_min); 1570 + } else if (vrate > vrate_max) { 1571 + vrate = div64_u64(vrate * (100 - VRATE_CLAMP_ADJ_PCT), 1572 + 100); 1573 + vrate = max(vrate, vrate_max); 1574 + } else { 1575 + int idx = min_t(int, abs(ioc->busy_level), 1576 + ARRAY_SIZE(vrate_adj_pct) - 1); 1577 + u32 adj_pct = vrate_adj_pct[idx]; 1578 + 1579 + if (ioc->busy_level > 0) 1580 + adj_pct = 100 - adj_pct; 1581 + else 1582 + adj_pct = 100 + adj_pct; 1583 + 1584 + vrate = clamp(DIV64_U64_ROUND_UP(vrate * adj_pct, 100), 1585 + vrate_min, vrate_max); 1586 + } 1587 + 1588 + trace_iocost_ioc_vrate_adj(ioc, vrate, &missed_ppm, rq_wait_pct, 1589 + nr_lagging, nr_shortages, 1590 + nr_surpluses); 1591 + 1592 + atomic64_set(&ioc->vtime_rate, vrate); 1593 + ioc->inuse_margin_vtime = DIV64_U64_ROUND_UP( 1594 + ioc->period_us * vrate * INUSE_MARGIN_PCT, 100); 1595 + } 1596 + 1597 + ioc_refresh_params(ioc, false); 1598 + 1599 + /* 1600 + * This period is done. Move onto the next one. If nothing's 1601 + * going on with the device, stop the timer. 1602 + */ 1603 + atomic64_inc(&ioc->cur_period); 1604 + 1605 + if (ioc->running != IOC_STOP) { 1606 + if (!list_empty(&ioc->active_iocgs)) { 1607 + ioc_start_period(ioc, &now); 1608 + } else { 1609 + ioc->busy_level = 0; 1610 + ioc->running = IOC_IDLE; 1611 + } 1612 + } 1613 + 1614 + spin_unlock_irq(&ioc->lock); 1615 + } 1616 + 1617 + static void calc_vtime_cost_builtin(struct bio *bio, struct ioc_gq *iocg, 1618 + bool is_merge, u64 *costp) 1619 + { 1620 + struct ioc *ioc = iocg->ioc; 1621 + u64 coef_seqio, coef_randio, coef_page; 1622 + u64 pages = max_t(u64, bio_sectors(bio) >> IOC_SECT_TO_PAGE_SHIFT, 1); 1623 + u64 seek_pages = 0; 1624 + u64 cost = 0; 1625 + 1626 + switch (bio_op(bio)) { 1627 + case REQ_OP_READ: 1628 + coef_seqio = ioc->params.lcoefs[LCOEF_RSEQIO]; 1629 + coef_randio = ioc->params.lcoefs[LCOEF_RRANDIO]; 1630 + coef_page = ioc->params.lcoefs[LCOEF_RPAGE]; 1631 + break; 1632 + case REQ_OP_WRITE: 1633 + coef_seqio = ioc->params.lcoefs[LCOEF_WSEQIO]; 1634 + coef_randio = ioc->params.lcoefs[LCOEF_WRANDIO]; 1635 + coef_page = ioc->params.lcoefs[LCOEF_WPAGE]; 1636 + break; 1637 + default: 1638 + goto out; 1639 + } 1640 + 1641 + if (iocg->cursor) { 1642 + seek_pages = abs(bio->bi_iter.bi_sector - iocg->cursor); 1643 + seek_pages >>= IOC_SECT_TO_PAGE_SHIFT; 1644 + } 1645 + 1646 + if (!is_merge) { 1647 + if (seek_pages > LCOEF_RANDIO_PAGES) { 1648 + cost += coef_randio; 1649 + } else { 1650 + cost += coef_seqio; 1651 + } 1652 + } 1653 + cost += pages * coef_page; 1654 + out: 1655 + *costp = cost; 1656 + } 1657 + 1658 + static u64 calc_vtime_cost(struct bio *bio, struct ioc_gq *iocg, bool is_merge) 1659 + { 1660 + u64 cost; 1661 + 1662 + calc_vtime_cost_builtin(bio, iocg, is_merge, &cost); 1663 + return cost; 1664 + } 1665 + 1666 + static void ioc_rqos_throttle(struct rq_qos *rqos, struct bio *bio) 1667 + { 1668 + struct blkcg_gq *blkg = bio->bi_blkg; 1669 + struct ioc *ioc = rqos_to_ioc(rqos); 1670 + struct ioc_gq *iocg = blkg_to_iocg(blkg); 1671 + struct ioc_now now; 1672 + struct iocg_wait wait; 1673 + u32 hw_active, hw_inuse; 1674 + u64 abs_cost, cost, vtime; 1675 + 1676 + /* bypass IOs if disabled or for root cgroup */ 1677 + if (!ioc->enabled || !iocg->level) 1678 + return; 1679 + 1680 + /* always activate so that even 0 cost IOs get protected to some level */ 1681 + if (!iocg_activate(iocg, &now)) 1682 + return; 1683 + 1684 + /* calculate the absolute vtime cost */ 1685 + abs_cost = calc_vtime_cost(bio, iocg, false); 1686 + if (!abs_cost) 1687 + return; 1688 + 1689 + iocg->cursor = bio_end_sector(bio); 1690 + 1691 + vtime = atomic64_read(&iocg->vtime); 1692 + current_hweight(iocg, &hw_active, &hw_inuse); 1693 + 1694 + if (hw_inuse < hw_active && 1695 + time_after_eq64(vtime + ioc->inuse_margin_vtime, now.vnow)) { 1696 + TRACE_IOCG_PATH(inuse_reset, iocg, &now, 1697 + iocg->inuse, iocg->weight, hw_inuse, hw_active); 1698 + spin_lock_irq(&ioc->lock); 1699 + propagate_active_weight(iocg, iocg->weight, iocg->weight); 1700 + spin_unlock_irq(&ioc->lock); 1701 + current_hweight(iocg, &hw_active, &hw_inuse); 1702 + } 1703 + 1704 + cost = abs_cost_to_cost(abs_cost, hw_inuse); 1705 + 1706 + /* 1707 + * If no one's waiting and within budget, issue right away. The 1708 + * tests are racy but the races aren't systemic - we only miss once 1709 + * in a while which is fine. 1710 + */ 1711 + if (!waitqueue_active(&iocg->waitq) && 1712 + !atomic64_read(&iocg->abs_vdebt) && 1713 + time_before_eq64(vtime + cost, now.vnow)) { 1714 + iocg_commit_bio(iocg, bio, cost); 1715 + return; 1716 + } 1717 + 1718 + /* 1719 + * We're over budget. If @bio has to be issued regardless, 1720 + * remember the abs_cost instead of advancing vtime. 1721 + * iocg_kick_waitq() will pay off the debt before waking more IOs. 1722 + * This way, the debt is continuously paid off each period with the 1723 + * actual budget available to the cgroup. If we just wound vtime, 1724 + * we would incorrectly use the current hw_inuse for the entire 1725 + * amount which, for example, can lead to the cgroup staying 1726 + * blocked for a long time even with substantially raised hw_inuse. 1727 + */ 1728 + if (bio_issue_as_root_blkg(bio) || fatal_signal_pending(current)) { 1729 + atomic64_add(abs_cost, &iocg->abs_vdebt); 1730 + iocg_kick_delay(iocg, &now, cost); 1731 + return; 1732 + } 1733 + 1734 + /* 1735 + * Append self to the waitq and schedule the wakeup timer if we're 1736 + * the first waiter. The timer duration is calculated based on the 1737 + * current vrate. vtime and hweight changes can make it too short 1738 + * or too long. Each wait entry records the absolute cost it's 1739 + * waiting for to allow re-evaluation using a custom wait entry. 1740 + * 1741 + * If too short, the timer simply reschedules itself. If too long, 1742 + * the period timer will notice and trigger wakeups. 1743 + * 1744 + * All waiters are on iocg->waitq and the wait states are 1745 + * synchronized using waitq.lock. 1746 + */ 1747 + spin_lock_irq(&iocg->waitq.lock); 1748 + 1749 + /* 1750 + * We activated above but w/o any synchronization. Deactivation is 1751 + * synchronized with waitq.lock and we won't get deactivated as 1752 + * long as we're waiting, so we're good if we're activated here. 1753 + * In the unlikely case that we are deactivated, just issue the IO. 1754 + */ 1755 + if (unlikely(list_empty(&iocg->active_list))) { 1756 + spin_unlock_irq(&iocg->waitq.lock); 1757 + iocg_commit_bio(iocg, bio, cost); 1758 + return; 1759 + } 1760 + 1761 + init_waitqueue_func_entry(&wait.wait, iocg_wake_fn); 1762 + wait.wait.private = current; 1763 + wait.bio = bio; 1764 + wait.abs_cost = abs_cost; 1765 + wait.committed = false; /* will be set true by waker */ 1766 + 1767 + __add_wait_queue_entry_tail(&iocg->waitq, &wait.wait); 1768 + iocg_kick_waitq(iocg, &now); 1769 + 1770 + spin_unlock_irq(&iocg->waitq.lock); 1771 + 1772 + while (true) { 1773 + set_current_state(TASK_UNINTERRUPTIBLE); 1774 + if (wait.committed) 1775 + break; 1776 + io_schedule(); 1777 + } 1778 + 1779 + /* waker already committed us, proceed */ 1780 + finish_wait(&iocg->waitq, &wait.wait); 1781 + } 1782 + 1783 + static void ioc_rqos_merge(struct rq_qos *rqos, struct request *rq, 1784 + struct bio *bio) 1785 + { 1786 + struct ioc_gq *iocg = blkg_to_iocg(bio->bi_blkg); 1787 + struct ioc *ioc = iocg->ioc; 1788 + sector_t bio_end = bio_end_sector(bio); 1789 + struct ioc_now now; 1790 + u32 hw_inuse; 1791 + u64 abs_cost, cost; 1792 + 1793 + /* bypass if disabled or for root cgroup */ 1794 + if (!ioc->enabled || !iocg->level) 1795 + return; 1796 + 1797 + abs_cost = calc_vtime_cost(bio, iocg, true); 1798 + if (!abs_cost) 1799 + return; 1800 + 1801 + ioc_now(ioc, &now); 1802 + current_hweight(iocg, NULL, &hw_inuse); 1803 + cost = abs_cost_to_cost(abs_cost, hw_inuse); 1804 + 1805 + /* update cursor if backmerging into the request at the cursor */ 1806 + if (blk_rq_pos(rq) < bio_end && 1807 + blk_rq_pos(rq) + blk_rq_sectors(rq) == iocg->cursor) 1808 + iocg->cursor = bio_end; 1809 + 1810 + /* 1811 + * Charge if there's enough vtime budget and the existing request 1812 + * has cost assigned. Otherwise, account it as debt. See debt 1813 + * handling in ioc_rqos_throttle() for details. 1814 + */ 1815 + if (rq->bio && rq->bio->bi_iocost_cost && 1816 + time_before_eq64(atomic64_read(&iocg->vtime) + cost, now.vnow)) 1817 + iocg_commit_bio(iocg, bio, cost); 1818 + else 1819 + atomic64_add(abs_cost, &iocg->abs_vdebt); 1820 + } 1821 + 1822 + static void ioc_rqos_done_bio(struct rq_qos *rqos, struct bio *bio) 1823 + { 1824 + struct ioc_gq *iocg = blkg_to_iocg(bio->bi_blkg); 1825 + 1826 + if (iocg && bio->bi_iocost_cost) 1827 + atomic64_add(bio->bi_iocost_cost, &iocg->done_vtime); 1828 + } 1829 + 1830 + static void ioc_rqos_done(struct rq_qos *rqos, struct request *rq) 1831 + { 1832 + struct ioc *ioc = rqos_to_ioc(rqos); 1833 + u64 on_q_ns, rq_wait_ns; 1834 + int pidx, rw; 1835 + 1836 + if (!ioc->enabled || !rq->alloc_time_ns || !rq->start_time_ns) 1837 + return; 1838 + 1839 + switch (req_op(rq) & REQ_OP_MASK) { 1840 + case REQ_OP_READ: 1841 + pidx = QOS_RLAT; 1842 + rw = READ; 1843 + break; 1844 + case REQ_OP_WRITE: 1845 + pidx = QOS_WLAT; 1846 + rw = WRITE; 1847 + break; 1848 + default: 1849 + return; 1850 + } 1851 + 1852 + on_q_ns = ktime_get_ns() - rq->alloc_time_ns; 1853 + rq_wait_ns = rq->start_time_ns - rq->alloc_time_ns; 1854 + 1855 + if (on_q_ns <= ioc->params.qos[pidx] * NSEC_PER_USEC) 1856 + this_cpu_inc(ioc->pcpu_stat->missed[rw].nr_met); 1857 + else 1858 + this_cpu_inc(ioc->pcpu_stat->missed[rw].nr_missed); 1859 + 1860 + this_cpu_add(ioc->pcpu_stat->rq_wait_ns, rq_wait_ns); 1861 + } 1862 + 1863 + static void ioc_rqos_queue_depth_changed(struct rq_qos *rqos) 1864 + { 1865 + struct ioc *ioc = rqos_to_ioc(rqos); 1866 + 1867 + spin_lock_irq(&ioc->lock); 1868 + ioc_refresh_params(ioc, false); 1869 + spin_unlock_irq(&ioc->lock); 1870 + } 1871 + 1872 + static void ioc_rqos_exit(struct rq_qos *rqos) 1873 + { 1874 + struct ioc *ioc = rqos_to_ioc(rqos); 1875 + 1876 + blkcg_deactivate_policy(rqos->q, &blkcg_policy_iocost); 1877 + 1878 + spin_lock_irq(&ioc->lock); 1879 + ioc->running = IOC_STOP; 1880 + spin_unlock_irq(&ioc->lock); 1881 + 1882 + del_timer_sync(&ioc->timer); 1883 + free_percpu(ioc->pcpu_stat); 1884 + kfree(ioc); 1885 + } 1886 + 1887 + static struct rq_qos_ops ioc_rqos_ops = { 1888 + .throttle = ioc_rqos_throttle, 1889 + .merge = ioc_rqos_merge, 1890 + .done_bio = ioc_rqos_done_bio, 1891 + .done = ioc_rqos_done, 1892 + .queue_depth_changed = ioc_rqos_queue_depth_changed, 1893 + .exit = ioc_rqos_exit, 1894 + }; 1895 + 1896 + static int blk_iocost_init(struct request_queue *q) 1897 + { 1898 + struct ioc *ioc; 1899 + struct rq_qos *rqos; 1900 + int ret; 1901 + 1902 + ioc = kzalloc(sizeof(*ioc), GFP_KERNEL); 1903 + if (!ioc) 1904 + return -ENOMEM; 1905 + 1906 + ioc->pcpu_stat = alloc_percpu(struct ioc_pcpu_stat); 1907 + if (!ioc->pcpu_stat) { 1908 + kfree(ioc); 1909 + return -ENOMEM; 1910 + } 1911 + 1912 + rqos = &ioc->rqos; 1913 + rqos->id = RQ_QOS_COST; 1914 + rqos->ops = &ioc_rqos_ops; 1915 + rqos->q = q; 1916 + 1917 + spin_lock_init(&ioc->lock); 1918 + timer_setup(&ioc->timer, ioc_timer_fn, 0); 1919 + INIT_LIST_HEAD(&ioc->active_iocgs); 1920 + 1921 + ioc->running = IOC_IDLE; 1922 + atomic64_set(&ioc->vtime_rate, VTIME_PER_USEC); 1923 + seqcount_init(&ioc->period_seqcount); 1924 + ioc->period_at = ktime_to_us(ktime_get()); 1925 + atomic64_set(&ioc->cur_period, 0); 1926 + atomic_set(&ioc->hweight_gen, 0); 1927 + 1928 + spin_lock_irq(&ioc->lock); 1929 + ioc->autop_idx = AUTOP_INVALID; 1930 + ioc_refresh_params(ioc, true); 1931 + spin_unlock_irq(&ioc->lock); 1932 + 1933 + rq_qos_add(q, rqos); 1934 + ret = blkcg_activate_policy(q, &blkcg_policy_iocost); 1935 + if (ret) { 1936 + rq_qos_del(q, rqos); 1937 + free_percpu(ioc->pcpu_stat); 1938 + kfree(ioc); 1939 + return ret; 1940 + } 1941 + return 0; 1942 + } 1943 + 1944 + static struct blkcg_policy_data *ioc_cpd_alloc(gfp_t gfp) 1945 + { 1946 + struct ioc_cgrp *iocc; 1947 + 1948 + iocc = kzalloc(sizeof(struct ioc_cgrp), gfp); 1949 + if (!iocc) 1950 + return NULL; 1951 + 1952 + iocc->dfl_weight = CGROUP_WEIGHT_DFL; 1953 + return &iocc->cpd; 1954 + } 1955 + 1956 + static void ioc_cpd_free(struct blkcg_policy_data *cpd) 1957 + { 1958 + kfree(container_of(cpd, struct ioc_cgrp, cpd)); 1959 + } 1960 + 1961 + static struct blkg_policy_data *ioc_pd_alloc(gfp_t gfp, struct request_queue *q, 1962 + struct blkcg *blkcg) 1963 + { 1964 + int levels = blkcg->css.cgroup->level + 1; 1965 + struct ioc_gq *iocg; 1966 + 1967 + iocg = kzalloc_node(sizeof(*iocg) + levels * sizeof(iocg->ancestors[0]), 1968 + gfp, q->node); 1969 + if (!iocg) 1970 + return NULL; 1971 + 1972 + return &iocg->pd; 1973 + } 1974 + 1975 + static void ioc_pd_init(struct blkg_policy_data *pd) 1976 + { 1977 + struct ioc_gq *iocg = pd_to_iocg(pd); 1978 + struct blkcg_gq *blkg = pd_to_blkg(&iocg->pd); 1979 + struct ioc *ioc = q_to_ioc(blkg->q); 1980 + struct ioc_now now; 1981 + struct blkcg_gq *tblkg; 1982 + unsigned long flags; 1983 + 1984 + ioc_now(ioc, &now); 1985 + 1986 + iocg->ioc = ioc; 1987 + atomic64_set(&iocg->vtime, now.vnow); 1988 + atomic64_set(&iocg->done_vtime, now.vnow); 1989 + atomic64_set(&iocg->abs_vdebt, 0); 1990 + atomic64_set(&iocg->active_period, atomic64_read(&ioc->cur_period)); 1991 + INIT_LIST_HEAD(&iocg->active_list); 1992 + iocg->hweight_active = HWEIGHT_WHOLE; 1993 + iocg->hweight_inuse = HWEIGHT_WHOLE; 1994 + 1995 + init_waitqueue_head(&iocg->waitq); 1996 + hrtimer_init(&iocg->waitq_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS); 1997 + iocg->waitq_timer.function = iocg_waitq_timer_fn; 1998 + hrtimer_init(&iocg->delay_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS); 1999 + iocg->delay_timer.function = iocg_delay_timer_fn; 2000 + 2001 + iocg->level = blkg->blkcg->css.cgroup->level; 2002 + 2003 + for (tblkg = blkg; tblkg; tblkg = tblkg->parent) { 2004 + struct ioc_gq *tiocg = blkg_to_iocg(tblkg); 2005 + iocg->ancestors[tiocg->level] = tiocg; 2006 + } 2007 + 2008 + spin_lock_irqsave(&ioc->lock, flags); 2009 + weight_updated(iocg); 2010 + spin_unlock_irqrestore(&ioc->lock, flags); 2011 + } 2012 + 2013 + static void ioc_pd_free(struct blkg_policy_data *pd) 2014 + { 2015 + struct ioc_gq *iocg = pd_to_iocg(pd); 2016 + struct ioc *ioc = iocg->ioc; 2017 + 2018 + if (ioc) { 2019 + spin_lock(&ioc->lock); 2020 + if (!list_empty(&iocg->active_list)) { 2021 + propagate_active_weight(iocg, 0, 0); 2022 + list_del_init(&iocg->active_list); 2023 + } 2024 + spin_unlock(&ioc->lock); 2025 + 2026 + hrtimer_cancel(&iocg->waitq_timer); 2027 + hrtimer_cancel(&iocg->delay_timer); 2028 + } 2029 + kfree(iocg); 2030 + } 2031 + 2032 + static u64 ioc_weight_prfill(struct seq_file *sf, struct blkg_policy_data *pd, 2033 + int off) 2034 + { 2035 + const char *dname = blkg_dev_name(pd->blkg); 2036 + struct ioc_gq *iocg = pd_to_iocg(pd); 2037 + 2038 + if (dname && iocg->cfg_weight) 2039 + seq_printf(sf, "%s %u\n", dname, iocg->cfg_weight); 2040 + return 0; 2041 + } 2042 + 2043 + 2044 + static int ioc_weight_show(struct seq_file *sf, void *v) 2045 + { 2046 + struct blkcg *blkcg = css_to_blkcg(seq_css(sf)); 2047 + struct ioc_cgrp *iocc = blkcg_to_iocc(blkcg); 2048 + 2049 + seq_printf(sf, "default %u\n", iocc->dfl_weight); 2050 + blkcg_print_blkgs(sf, blkcg, ioc_weight_prfill, 2051 + &blkcg_policy_iocost, seq_cft(sf)->private, false); 2052 + return 0; 2053 + } 2054 + 2055 + static ssize_t ioc_weight_write(struct kernfs_open_file *of, char *buf, 2056 + size_t nbytes, loff_t off) 2057 + { 2058 + struct blkcg *blkcg = css_to_blkcg(of_css(of)); 2059 + struct ioc_cgrp *iocc = blkcg_to_iocc(blkcg); 2060 + struct blkg_conf_ctx ctx; 2061 + struct ioc_gq *iocg; 2062 + u32 v; 2063 + int ret; 2064 + 2065 + if (!strchr(buf, ':')) { 2066 + struct blkcg_gq *blkg; 2067 + 2068 + if (!sscanf(buf, "default %u", &v) && !sscanf(buf, "%u", &v)) 2069 + return -EINVAL; 2070 + 2071 + if (v < CGROUP_WEIGHT_MIN || v > CGROUP_WEIGHT_MAX) 2072 + return -EINVAL; 2073 + 2074 + spin_lock(&blkcg->lock); 2075 + iocc->dfl_weight = v; 2076 + hlist_for_each_entry(blkg, &blkcg->blkg_list, blkcg_node) { 2077 + struct ioc_gq *iocg = blkg_to_iocg(blkg); 2078 + 2079 + if (iocg) { 2080 + spin_lock_irq(&iocg->ioc->lock); 2081 + weight_updated(iocg); 2082 + spin_unlock_irq(&iocg->ioc->lock); 2083 + } 2084 + } 2085 + spin_unlock(&blkcg->lock); 2086 + 2087 + return nbytes; 2088 + } 2089 + 2090 + ret = blkg_conf_prep(blkcg, &blkcg_policy_iocost, buf, &ctx); 2091 + if (ret) 2092 + return ret; 2093 + 2094 + iocg = blkg_to_iocg(ctx.blkg); 2095 + 2096 + if (!strncmp(ctx.body, "default", 7)) { 2097 + v = 0; 2098 + } else { 2099 + if (!sscanf(ctx.body, "%u", &v)) 2100 + goto einval; 2101 + if (v < CGROUP_WEIGHT_MIN || v > CGROUP_WEIGHT_MAX) 2102 + goto einval; 2103 + } 2104 + 2105 + spin_lock_irq(&iocg->ioc->lock); 2106 + iocg->cfg_weight = v; 2107 + weight_updated(iocg); 2108 + spin_unlock_irq(&iocg->ioc->lock); 2109 + 2110 + blkg_conf_finish(&ctx); 2111 + return nbytes; 2112 + 2113 + einval: 2114 + blkg_conf_finish(&ctx); 2115 + return -EINVAL; 2116 + } 2117 + 2118 + static u64 ioc_qos_prfill(struct seq_file *sf, struct blkg_policy_data *pd, 2119 + int off) 2120 + { 2121 + const char *dname = blkg_dev_name(pd->blkg); 2122 + struct ioc *ioc = pd_to_iocg(pd)->ioc; 2123 + 2124 + if (!dname) 2125 + return 0; 2126 + 2127 + seq_printf(sf, "%s enable=%d ctrl=%s rpct=%u.%02u rlat=%u wpct=%u.%02u wlat=%u min=%u.%02u max=%u.%02u\n", 2128 + dname, ioc->enabled, ioc->user_qos_params ? "user" : "auto", 2129 + ioc->params.qos[QOS_RPPM] / 10000, 2130 + ioc->params.qos[QOS_RPPM] % 10000 / 100, 2131 + ioc->params.qos[QOS_RLAT], 2132 + ioc->params.qos[QOS_WPPM] / 10000, 2133 + ioc->params.qos[QOS_WPPM] % 10000 / 100, 2134 + ioc->params.qos[QOS_WLAT], 2135 + ioc->params.qos[QOS_MIN] / 10000, 2136 + ioc->params.qos[QOS_MIN] % 10000 / 100, 2137 + ioc->params.qos[QOS_MAX] / 10000, 2138 + ioc->params.qos[QOS_MAX] % 10000 / 100); 2139 + return 0; 2140 + } 2141 + 2142 + static int ioc_qos_show(struct seq_file *sf, void *v) 2143 + { 2144 + struct blkcg *blkcg = css_to_blkcg(seq_css(sf)); 2145 + 2146 + blkcg_print_blkgs(sf, blkcg, ioc_qos_prfill, 2147 + &blkcg_policy_iocost, seq_cft(sf)->private, false); 2148 + return 0; 2149 + } 2150 + 2151 + static const match_table_t qos_ctrl_tokens = { 2152 + { QOS_ENABLE, "enable=%u" }, 2153 + { QOS_CTRL, "ctrl=%s" }, 2154 + { NR_QOS_CTRL_PARAMS, NULL }, 2155 + }; 2156 + 2157 + static const match_table_t qos_tokens = { 2158 + { QOS_RPPM, "rpct=%s" }, 2159 + { QOS_RLAT, "rlat=%u" }, 2160 + { QOS_WPPM, "wpct=%s" }, 2161 + { QOS_WLAT, "wlat=%u" }, 2162 + { QOS_MIN, "min=%s" }, 2163 + { QOS_MAX, "max=%s" }, 2164 + { NR_QOS_PARAMS, NULL }, 2165 + }; 2166 + 2167 + static ssize_t ioc_qos_write(struct kernfs_open_file *of, char *input, 2168 + size_t nbytes, loff_t off) 2169 + { 2170 + struct gendisk *disk; 2171 + struct ioc *ioc; 2172 + u32 qos[NR_QOS_PARAMS]; 2173 + bool enable, user; 2174 + char *p; 2175 + int ret; 2176 + 2177 + disk = blkcg_conf_get_disk(&input); 2178 + if (IS_ERR(disk)) 2179 + return PTR_ERR(disk); 2180 + 2181 + ioc = q_to_ioc(disk->queue); 2182 + if (!ioc) { 2183 + ret = blk_iocost_init(disk->queue); 2184 + if (ret) 2185 + goto err; 2186 + ioc = q_to_ioc(disk->queue); 2187 + } 2188 + 2189 + spin_lock_irq(&ioc->lock); 2190 + memcpy(qos, ioc->params.qos, sizeof(qos)); 2191 + enable = ioc->enabled; 2192 + user = ioc->user_qos_params; 2193 + spin_unlock_irq(&ioc->lock); 2194 + 2195 + while ((p = strsep(&input, " \t\n"))) { 2196 + substring_t args[MAX_OPT_ARGS]; 2197 + char buf[32]; 2198 + int tok; 2199 + s64 v; 2200 + 2201 + if (!*p) 2202 + continue; 2203 + 2204 + switch (match_token(p, qos_ctrl_tokens, args)) { 2205 + case QOS_ENABLE: 2206 + match_u64(&args[0], &v); 2207 + enable = v; 2208 + continue; 2209 + case QOS_CTRL: 2210 + match_strlcpy(buf, &args[0], sizeof(buf)); 2211 + if (!strcmp(buf, "auto")) 2212 + user = false; 2213 + else if (!strcmp(buf, "user")) 2214 + user = true; 2215 + else 2216 + goto einval; 2217 + continue; 2218 + } 2219 + 2220 + tok = match_token(p, qos_tokens, args); 2221 + switch (tok) { 2222 + case QOS_RPPM: 2223 + case QOS_WPPM: 2224 + if (match_strlcpy(buf, &args[0], sizeof(buf)) >= 2225 + sizeof(buf)) 2226 + goto einval; 2227 + if (cgroup_parse_float(buf, 2, &v)) 2228 + goto einval; 2229 + if (v < 0 || v > 10000) 2230 + goto einval; 2231 + qos[tok] = v * 100; 2232 + break; 2233 + case QOS_RLAT: 2234 + case QOS_WLAT: 2235 + if (match_u64(&args[0], &v)) 2236 + goto einval; 2237 + qos[tok] = v; 2238 + break; 2239 + case QOS_MIN: 2240 + case QOS_MAX: 2241 + if (match_strlcpy(buf, &args[0], sizeof(buf)) >= 2242 + sizeof(buf)) 2243 + goto einval; 2244 + if (cgroup_parse_float(buf, 2, &v)) 2245 + goto einval; 2246 + if (v < 0) 2247 + goto einval; 2248 + qos[tok] = clamp_t(s64, v * 100, 2249 + VRATE_MIN_PPM, VRATE_MAX_PPM); 2250 + break; 2251 + default: 2252 + goto einval; 2253 + } 2254 + user = true; 2255 + } 2256 + 2257 + if (qos[QOS_MIN] > qos[QOS_MAX]) 2258 + goto einval; 2259 + 2260 + spin_lock_irq(&ioc->lock); 2261 + 2262 + if (enable) { 2263 + blk_queue_flag_set(QUEUE_FLAG_RQ_ALLOC_TIME, ioc->rqos.q); 2264 + ioc->enabled = true; 2265 + } else { 2266 + blk_queue_flag_clear(QUEUE_FLAG_RQ_ALLOC_TIME, ioc->rqos.q); 2267 + ioc->enabled = false; 2268 + } 2269 + 2270 + if (user) { 2271 + memcpy(ioc->params.qos, qos, sizeof(qos)); 2272 + ioc->user_qos_params = true; 2273 + } else { 2274 + ioc->user_qos_params = false; 2275 + } 2276 + 2277 + ioc_refresh_params(ioc, true); 2278 + spin_unlock_irq(&ioc->lock); 2279 + 2280 + put_disk_and_module(disk); 2281 + return nbytes; 2282 + einval: 2283 + ret = -EINVAL; 2284 + err: 2285 + put_disk_and_module(disk); 2286 + return ret; 2287 + } 2288 + 2289 + static u64 ioc_cost_model_prfill(struct seq_file *sf, 2290 + struct blkg_policy_data *pd, int off) 2291 + { 2292 + const char *dname = blkg_dev_name(pd->blkg); 2293 + struct ioc *ioc = pd_to_iocg(pd)->ioc; 2294 + u64 *u = ioc->params.i_lcoefs; 2295 + 2296 + if (!dname) 2297 + return 0; 2298 + 2299 + seq_printf(sf, "%s ctrl=%s model=linear " 2300 + "rbps=%llu rseqiops=%llu rrandiops=%llu " 2301 + "wbps=%llu wseqiops=%llu wrandiops=%llu\n", 2302 + dname, ioc->user_cost_model ? "user" : "auto", 2303 + u[I_LCOEF_RBPS], u[I_LCOEF_RSEQIOPS], u[I_LCOEF_RRANDIOPS], 2304 + u[I_LCOEF_WBPS], u[I_LCOEF_WSEQIOPS], u[I_LCOEF_WRANDIOPS]); 2305 + return 0; 2306 + } 2307 + 2308 + static int ioc_cost_model_show(struct seq_file *sf, void *v) 2309 + { 2310 + struct blkcg *blkcg = css_to_blkcg(seq_css(sf)); 2311 + 2312 + blkcg_print_blkgs(sf, blkcg, ioc_cost_model_prfill, 2313 + &blkcg_policy_iocost, seq_cft(sf)->private, false); 2314 + return 0; 2315 + } 2316 + 2317 + static const match_table_t cost_ctrl_tokens = { 2318 + { COST_CTRL, "ctrl=%s" }, 2319 + { COST_MODEL, "model=%s" }, 2320 + { NR_COST_CTRL_PARAMS, NULL }, 2321 + }; 2322 + 2323 + static const match_table_t i_lcoef_tokens = { 2324 + { I_LCOEF_RBPS, "rbps=%u" }, 2325 + { I_LCOEF_RSEQIOPS, "rseqiops=%u" }, 2326 + { I_LCOEF_RRANDIOPS, "rrandiops=%u" }, 2327 + { I_LCOEF_WBPS, "wbps=%u" }, 2328 + { I_LCOEF_WSEQIOPS, "wseqiops=%u" }, 2329 + { I_LCOEF_WRANDIOPS, "wrandiops=%u" }, 2330 + { NR_I_LCOEFS, NULL }, 2331 + }; 2332 + 2333 + static ssize_t ioc_cost_model_write(struct kernfs_open_file *of, char *input, 2334 + size_t nbytes, loff_t off) 2335 + { 2336 + struct gendisk *disk; 2337 + struct ioc *ioc; 2338 + u64 u[NR_I_LCOEFS]; 2339 + bool user; 2340 + char *p; 2341 + int ret; 2342 + 2343 + disk = blkcg_conf_get_disk(&input); 2344 + if (IS_ERR(disk)) 2345 + return PTR_ERR(disk); 2346 + 2347 + ioc = q_to_ioc(disk->queue); 2348 + if (!ioc) { 2349 + ret = blk_iocost_init(disk->queue); 2350 + if (ret) 2351 + goto err; 2352 + ioc = q_to_ioc(disk->queue); 2353 + } 2354 + 2355 + spin_lock_irq(&ioc->lock); 2356 + memcpy(u, ioc->params.i_lcoefs, sizeof(u)); 2357 + user = ioc->user_cost_model; 2358 + spin_unlock_irq(&ioc->lock); 2359 + 2360 + while ((p = strsep(&input, " \t\n"))) { 2361 + substring_t args[MAX_OPT_ARGS]; 2362 + char buf[32]; 2363 + int tok; 2364 + u64 v; 2365 + 2366 + if (!*p) 2367 + continue; 2368 + 2369 + switch (match_token(p, cost_ctrl_tokens, args)) { 2370 + case COST_CTRL: 2371 + match_strlcpy(buf, &args[0], sizeof(buf)); 2372 + if (!strcmp(buf, "auto")) 2373 + user = false; 2374 + else if (!strcmp(buf, "user")) 2375 + user = true; 2376 + else 2377 + goto einval; 2378 + continue; 2379 + case COST_MODEL: 2380 + match_strlcpy(buf, &args[0], sizeof(buf)); 2381 + if (strcmp(buf, "linear")) 2382 + goto einval; 2383 + continue; 2384 + } 2385 + 2386 + tok = match_token(p, i_lcoef_tokens, args); 2387 + if (tok == NR_I_LCOEFS) 2388 + goto einval; 2389 + if (match_u64(&args[0], &v)) 2390 + goto einval; 2391 + u[tok] = v; 2392 + user = true; 2393 + } 2394 + 2395 + spin_lock_irq(&ioc->lock); 2396 + if (user) { 2397 + memcpy(ioc->params.i_lcoefs, u, sizeof(u)); 2398 + ioc->user_cost_model = true; 2399 + } else { 2400 + ioc->user_cost_model = false; 2401 + } 2402 + ioc_refresh_params(ioc, true); 2403 + spin_unlock_irq(&ioc->lock); 2404 + 2405 + put_disk_and_module(disk); 2406 + return nbytes; 2407 + 2408 + einval: 2409 + ret = -EINVAL; 2410 + err: 2411 + put_disk_and_module(disk); 2412 + return ret; 2413 + } 2414 + 2415 + static struct cftype ioc_files[] = { 2416 + { 2417 + .name = "weight", 2418 + .flags = CFTYPE_NOT_ON_ROOT, 2419 + .seq_show = ioc_weight_show, 2420 + .write = ioc_weight_write, 2421 + }, 2422 + { 2423 + .name = "cost.qos", 2424 + .flags = CFTYPE_ONLY_ON_ROOT, 2425 + .seq_show = ioc_qos_show, 2426 + .write = ioc_qos_write, 2427 + }, 2428 + { 2429 + .name = "cost.model", 2430 + .flags = CFTYPE_ONLY_ON_ROOT, 2431 + .seq_show = ioc_cost_model_show, 2432 + .write = ioc_cost_model_write, 2433 + }, 2434 + {} 2435 + }; 2436 + 2437 + static struct blkcg_policy blkcg_policy_iocost = { 2438 + .dfl_cftypes = ioc_files, 2439 + .cpd_alloc_fn = ioc_cpd_alloc, 2440 + .cpd_free_fn = ioc_cpd_free, 2441 + .pd_alloc_fn = ioc_pd_alloc, 2442 + .pd_init_fn = ioc_pd_init, 2443 + .pd_free_fn = ioc_pd_free, 2444 + }; 2445 + 2446 + static int __init ioc_init(void) 2447 + { 2448 + return blkcg_policy_register(&blkcg_policy_iocost); 2449 + } 2450 + 2451 + static void __exit ioc_exit(void) 2452 + { 2453 + return blkcg_policy_unregister(&blkcg_policy_iocost); 2454 + } 2455 + 2456 + module_init(ioc_init); 2457 + module_exit(ioc_exit);

+5 -3

block/blk-iolatency.c

··· 725 725 return -ENOMEM; 726 726 727 727 rqos = &blkiolat->rqos; 728 - rqos->id = RQ_QOS_CGROUP; 728 + rqos->id = RQ_QOS_LATENCY; 729 729 rqos->ops = &blkcg_iolatency_ops; 730 730 rqos->q = q; 731 731 ··· 934 934 } 935 935 936 936 937 - static struct blkg_policy_data *iolatency_pd_alloc(gfp_t gfp, int node) 937 + static struct blkg_policy_data *iolatency_pd_alloc(gfp_t gfp, 938 + struct request_queue *q, 939 + struct blkcg *blkcg) 938 940 { 939 941 struct iolatency_grp *iolat; 940 942 941 - iolat = kzalloc_node(sizeof(*iolat), gfp, node); 943 + iolat = kzalloc_node(sizeof(*iolat), gfp, q->node); 942 944 if (!iolat) 943 945 return NULL; 944 946 iolat->stats = __alloc_percpu_gfp(sizeof(struct latency_stat),

+102 -49

block/blk-merge.c

··· 132 132 return bio_split(bio, q->limits.max_write_same_sectors, GFP_NOIO, bs); 133 133 } 134 134 135 + /* 136 + * Return the maximum number of sectors from the start of a bio that may be 137 + * submitted as a single request to a block device. If enough sectors remain, 138 + * align the end to the physical block size. Otherwise align the end to the 139 + * logical block size. This approach minimizes the number of non-aligned 140 + * requests that are submitted to a block device if the start of a bio is not 141 + * aligned to a physical block boundary. 142 + */ 135 143 static inline unsigned get_max_io_size(struct request_queue *q, 136 144 struct bio *bio) 137 145 { 138 146 unsigned sectors = blk_max_size_offset(q, bio->bi_iter.bi_sector); 139 - unsigned mask = queue_logical_block_size(q) - 1; 147 + unsigned max_sectors = sectors; 148 + unsigned pbs = queue_physical_block_size(q) >> SECTOR_SHIFT; 149 + unsigned lbs = queue_logical_block_size(q) >> SECTOR_SHIFT; 150 + unsigned start_offset = bio->bi_iter.bi_sector & (pbs - 1); 140 151 141 - /* aligned to logical block size */ 142 - sectors &= ~(mask >> 9); 152 + max_sectors += start_offset; 153 + max_sectors &= ~(pbs - 1); 154 + if (max_sectors > start_offset) 155 + return max_sectors - start_offset; 143 156 144 - return sectors; 157 + return sectors & (lbs - 1); 145 158 } 146 159 147 - static unsigned get_max_segment_size(struct request_queue *q, 160 + static unsigned get_max_segment_size(const struct request_queue *q, 148 161 unsigned offset) 149 162 { 150 163 unsigned long mask = queue_segment_boundary(q); ··· 170 157 queue_max_segment_size(q)); 171 158 } 172 159 173 - /* 174 - * Split the bvec @bv into segments, and update all kinds of 175 - * variables. 160 + /** 161 + * bvec_split_segs - verify whether or not a bvec should be split in the middle 162 + * @q: [in] request queue associated with the bio associated with @bv 163 + * @bv: [in] bvec to examine 164 + * @nsegs: [in,out] Number of segments in the bio being built. Incremented 165 + * by the number of segments from @bv that may be appended to that 166 + * bio without exceeding @max_segs 167 + * @sectors: [in,out] Number of sectors in the bio being built. Incremented 168 + * by the number of sectors from @bv that may be appended to that 169 + * bio without exceeding @max_sectors 170 + * @max_segs: [in] upper bound for *@nsegs 171 + * @max_sectors: [in] upper bound for *@sectors 172 + * 173 + * When splitting a bio, it can happen that a bvec is encountered that is too 174 + * big to fit in a single segment and hence that it has to be split in the 175 + * middle. This function verifies whether or not that should happen. The value 176 + * %true is returned if and only if appending the entire @bv to a bio with 177 + * *@nsegs segments and *@sectors sectors would make that bio unacceptable for 178 + * the block driver. 176 179 */ 177 - static bool bvec_split_segs(struct request_queue *q, struct bio_vec *bv, 178 - unsigned *nsegs, unsigned *sectors, unsigned max_segs) 180 + static bool bvec_split_segs(const struct request_queue *q, 181 + const struct bio_vec *bv, unsigned *nsegs, 182 + unsigned *sectors, unsigned max_segs, 183 + unsigned max_sectors) 179 184 { 180 - unsigned len = bv->bv_len; 185 + unsigned max_len = (min(max_sectors, UINT_MAX >> 9) - *sectors) << 9; 186 + unsigned len = min(bv->bv_len, max_len); 181 187 unsigned total_len = 0; 182 - unsigned new_nsegs = 0, seg_size = 0; 188 + unsigned seg_size = 0; 183 189 184 - /* 185 - * Multi-page bvec may be too big to hold in one segment, so the 186 - * current bvec has to be splitted as multiple segments. 187 - */ 188 - while (len && new_nsegs + *nsegs < max_segs) { 190 + while (len && *nsegs < max_segs) { 189 191 seg_size = get_max_segment_size(q, bv->bv_offset + total_len); 190 192 seg_size = min(seg_size, len); 191 193 192 - new_nsegs++; 194 + (*nsegs)++; 193 195 total_len += seg_size; 194 196 len -= seg_size; 195 197 ··· 212 184 break; 213 185 } 214 186 215 - if (new_nsegs) { 216 - *nsegs += new_nsegs; 217 - if (sectors) 218 - *sectors += total_len >> 9; 219 - } 187 + *sectors += total_len >> 9; 220 188 221 - /* split in the middle of the bvec if len != 0 */ 222 - return !!len; 189 + /* tell the caller to split the bvec if it is too big to fit */ 190 + return len > 0 || bv->bv_len > max_len; 223 191 } 224 192 193 + /** 194 + * blk_bio_segment_split - split a bio in two bios 195 + * @q: [in] request queue pointer 196 + * @bio: [in] bio to be split 197 + * @bs: [in] bio set to allocate the clone from 198 + * @segs: [out] number of segments in the bio with the first half of the sectors 199 + * 200 + * Clone @bio, update the bi_iter of the clone to represent the first sectors 201 + * of @bio and update @bio->bi_iter to represent the remaining sectors. The 202 + * following is guaranteed for the cloned bio: 203 + * - That it has at most get_max_io_size(@q, @bio) sectors. 204 + * - That it has at most queue_max_segments(@q) segments. 205 + * 206 + * Except for discard requests the cloned bio will point at the bi_io_vec of 207 + * the original bio. It is the responsibility of the caller to ensure that the 208 + * original bio is not freed before the cloned bio. The caller is also 209 + * responsible for ensuring that @bs is only destroyed after processing of the 210 + * split bio has finished. 211 + */ 225 212 static struct bio *blk_bio_segment_split(struct request_queue *q, 226 213 struct bio *bio, 227 214 struct bio_set *bs, ··· 256 213 if (bvprvp && bvec_gap_to_prev(q, bvprvp, bv.bv_offset)) 257 214 goto split; 258 215 259 - if (sectors + (bv.bv_len >> 9) > max_sectors) { 260 - /* 261 - * Consider this a new segment if we're splitting in 262 - * the middle of this vector. 263 - */ 264 - if (nsegs < max_segs && 265 - sectors < max_sectors) { 266 - /* split in the middle of bvec */ 267 - bv.bv_len = (max_sectors - sectors) << 9; 268 - bvec_split_segs(q, &bv, &nsegs, 269 - &sectors, max_segs); 270 - } 216 + if (nsegs < max_segs && 217 + sectors + (bv.bv_len >> 9) <= max_sectors && 218 + bv.bv_offset + bv.bv_len <= PAGE_SIZE) { 219 + nsegs++; 220 + sectors += bv.bv_len >> 9; 221 + } else if (bvec_split_segs(q, &bv, &nsegs, &sectors, max_segs, 222 + max_sectors)) { 271 223 goto split; 272 224 } 273 - 274 - if (nsegs == max_segs) 275 - goto split; 276 225 277 226 bvprv = bv; 278 227 bvprvp = &bvprv; 279 - 280 - if (bv.bv_offset + bv.bv_len <= PAGE_SIZE) { 281 - nsegs++; 282 - sectors += bv.bv_len >> 9; 283 - } else if (bvec_split_segs(q, &bv, &nsegs, &sectors, 284 - max_segs)) { 285 - goto split; 286 - } 287 228 } 288 229 289 230 *segs = nsegs; ··· 277 250 return bio_split(bio, sectors, GFP_NOIO, bs); 278 251 } 279 252 253 + /** 254 + * __blk_queue_split - split a bio and submit the second half 255 + * @q: [in] request queue pointer 256 + * @bio: [in, out] bio to be split 257 + * @nr_segs: [out] number of segments in the first bio 258 + * 259 + * Split a bio into two bios, chain the two bios, submit the second half and 260 + * store a pointer to the first half in *@bio. If the second bio is still too 261 + * big it will be split by a recursive call to this function. Since this 262 + * function may allocate a new bio from @q->bio_split, it is the responsibility 263 + * of the caller to ensure that @q is only released after processing of the 264 + * split bio has finished. 265 + */ 280 266 void __blk_queue_split(struct request_queue *q, struct bio **bio, 281 267 unsigned int *nr_segs) 282 268 { ··· 334 294 } 335 295 } 336 296 297 + /** 298 + * blk_queue_split - split a bio and submit the second half 299 + * @q: [in] request queue pointer 300 + * @bio: [in, out] bio to be split 301 + * 302 + * Split a bio into two bios, chains the two bios, submit the second half and 303 + * store a pointer to the first half in *@bio. Since this function may allocate 304 + * a new bio from @q->bio_split, it is the responsibility of the caller to 305 + * ensure that @q is only released after processing of the split bio has 306 + * finished. 307 + */ 337 308 void blk_queue_split(struct request_queue *q, struct bio **bio) 338 309 { 339 310 unsigned int nr_segs; ··· 356 305 unsigned int blk_recalc_rq_segments(struct request *rq) 357 306 { 358 307 unsigned int nr_phys_segs = 0; 308 + unsigned int nr_sectors = 0; 359 309 struct req_iterator iter; 360 310 struct bio_vec bv; 361 311 ··· 373 321 } 374 322 375 323 rq_for_each_bvec(bv, rq, iter) 376 - bvec_split_segs(rq->q, &bv, &nr_phys_segs, NULL, UINT_MAX); 324 + bvec_split_segs(rq->q, &bv, &nr_phys_segs, &nr_sectors, 325 + UINT_MAX, UINT_MAX); 377 326 return nr_phys_segs; 378 327 } 379 328

+22 -7

block/blk-mq-cpumap.c

··· 15 15 #include "blk.h" 16 16 #include "blk-mq.h" 17 17 18 - static int cpu_to_queue_index(struct blk_mq_queue_map *qmap, 19 - unsigned int nr_queues, const int cpu) 18 + static int queue_index(struct blk_mq_queue_map *qmap, 19 + unsigned int nr_queues, const int q) 20 20 { 21 - return qmap->queue_offset + (cpu % nr_queues); 21 + return qmap->queue_offset + (q % nr_queues); 22 22 } 23 23 24 24 static int get_first_sibling(unsigned int cpu) ··· 36 36 { 37 37 unsigned int *map = qmap->mq_map; 38 38 unsigned int nr_queues = qmap->nr_queues; 39 - unsigned int cpu, first_sibling; 39 + unsigned int cpu, first_sibling, q = 0; 40 + 41 + for_each_possible_cpu(cpu) 42 + map[cpu] = -1; 43 + 44 + /* 45 + * Spread queues among present CPUs first for minimizing 46 + * count of dead queues which are mapped by all un-present CPUs 47 + */ 48 + for_each_present_cpu(cpu) { 49 + if (q >= nr_queues) 50 + break; 51 + map[cpu] = queue_index(qmap, nr_queues, q++); 52 + } 40 53 41 54 for_each_possible_cpu(cpu) { 55 + if (map[cpu] != -1) 56 + continue; 42 57 /* 43 58 * First do sequential mapping between CPUs and queues. 44 59 * In case we still have CPUs to map, and we have some number of 45 60 * threads per cores then map sibling threads to the same queue 46 61 * for performance optimizations. 47 62 */ 48 - if (cpu < nr_queues) { 49 - map[cpu] = cpu_to_queue_index(qmap, nr_queues, cpu); 63 + if (q < nr_queues) { 64 + map[cpu] = queue_index(qmap, nr_queues, q++); 50 65 } else { 51 66 first_sibling = get_first_sibling(cpu); 52 67 if (first_sibling == cpu) 53 - map[cpu] = cpu_to_queue_index(qmap, nr_queues, cpu); 68 + map[cpu] = queue_index(qmap, nr_queues, q++); 54 69 else 55 70 map[cpu] = map[first_sibling]; 56 71 }

+6 -17

block/blk-mq-sysfs.c

··· 270 270 struct blk_mq_hw_ctx *hctx; 271 271 int i; 272 272 273 - lockdep_assert_held(&q->sysfs_lock); 273 + lockdep_assert_held(&q->sysfs_dir_lock); 274 274 275 275 queue_for_each_hw_ctx(q, hctx, i) 276 276 blk_mq_unregister_hctx(hctx); ··· 320 320 int ret, i; 321 321 322 322 WARN_ON_ONCE(!q->kobj.parent); 323 - lockdep_assert_held(&q->sysfs_lock); 323 + lockdep_assert_held(&q->sysfs_dir_lock); 324 324 325 325 ret = kobject_add(q->mq_kobj, kobject_get(&dev->kobj), "%s", "mq"); 326 326 if (ret < 0) ··· 349 349 return ret; 350 350 } 351 351 352 - int blk_mq_register_dev(struct device *dev, struct request_queue *q) 353 - { 354 - int ret; 355 - 356 - mutex_lock(&q->sysfs_lock); 357 - ret = __blk_mq_register_dev(dev, q); 358 - mutex_unlock(&q->sysfs_lock); 359 - 360 - return ret; 361 - } 362 - 363 352 void blk_mq_sysfs_unregister(struct request_queue *q) 364 353 { 365 354 struct blk_mq_hw_ctx *hctx; 366 355 int i; 367 356 368 - mutex_lock(&q->sysfs_lock); 357 + mutex_lock(&q->sysfs_dir_lock); 369 358 if (!q->mq_sysfs_init_done) 370 359 goto unlock; 371 360 ··· 362 373 blk_mq_unregister_hctx(hctx); 363 374 364 375 unlock: 365 - mutex_unlock(&q->sysfs_lock); 376 + mutex_unlock(&q->sysfs_dir_lock); 366 377 } 367 378 368 379 int blk_mq_sysfs_register(struct request_queue *q) ··· 370 381 struct blk_mq_hw_ctx *hctx; 371 382 int i, ret = 0; 372 383 373 - mutex_lock(&q->sysfs_lock); 384 + mutex_lock(&q->sysfs_dir_lock); 374 385 if (!q->mq_sysfs_init_done) 375 386 goto unlock; 376 387 ··· 381 392 } 382 393 383 394 unlock: 384 - mutex_unlock(&q->sysfs_lock); 395 + mutex_unlock(&q->sysfs_dir_lock); 385 396 386 397 return ret; 387 398 }

+32

block/blk-mq-tag.c

··· 10 10 #include <linux/module.h> 11 11 12 12 #include <linux/blk-mq.h> 13 + #include <linux/delay.h> 13 14 #include "blk.h" 14 15 #include "blk-mq.h" 15 16 #include "blk-mq-tag.h" ··· 354 353 } 355 354 } 356 355 EXPORT_SYMBOL(blk_mq_tagset_busy_iter); 356 + 357 + static bool blk_mq_tagset_count_completed_rqs(struct request *rq, 358 + void *data, bool reserved) 359 + { 360 + unsigned *count = data; 361 + 362 + if (blk_mq_request_completed(rq)) 363 + (*count)++; 364 + return true; 365 + } 366 + 367 + /** 368 + * blk_mq_tagset_wait_completed_request - wait until all completed req's 369 + * complete funtion is run 370 + * @tagset: Tag set to drain completed request 371 + * 372 + * Note: This function has to be run after all IO queues are shutdown 373 + */ 374 + void blk_mq_tagset_wait_completed_request(struct blk_mq_tag_set *tagset) 375 + { 376 + while (true) { 377 + unsigned count = 0; 378 + 379 + blk_mq_tagset_busy_iter(tagset, 380 + blk_mq_tagset_count_completed_rqs, &count); 381 + if (!count) 382 + break; 383 + msleep(5); 384 + } 385 + } 386 + EXPORT_SYMBOL(blk_mq_tagset_wait_completed_request); 357 387 358 388 /** 359 389 * blk_mq_queue_tag_busy_iter - iterate over all requests with a driver tag

+35 -34

block/blk-mq.c

··· 44 44 45 45 static int blk_mq_poll_stats_bkt(const struct request *rq) 46 46 { 47 - int ddir, bytes, bucket; 47 + int ddir, sectors, bucket; 48 48 49 49 ddir = rq_data_dir(rq); 50 - bytes = blk_rq_bytes(rq); 50 + sectors = blk_rq_stats_sectors(rq); 51 51 52 - bucket = ddir + 2*(ilog2(bytes) - 9); 52 + bucket = ddir + 2 * ilog2(sectors); 53 53 54 54 if (bucket < 0) 55 55 return -1; ··· 282 282 EXPORT_SYMBOL(blk_mq_can_queue); 283 283 284 284 /* 285 - * Only need start/end time stamping if we have stats enabled, or using 286 - * an IO scheduler. 285 + * Only need start/end time stamping if we have iostat or 286 + * blk stats enabled, or using an IO scheduler. 287 287 */ 288 288 static inline bool blk_mq_need_time_stamp(struct request *rq) 289 289 { 290 - return (rq->rq_flags & RQF_IO_STAT) || rq->q->elevator; 290 + return (rq->rq_flags & (RQF_IO_STAT | RQF_STATS)) || rq->q->elevator; 291 291 } 292 292 293 293 static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data, 294 - unsigned int tag, unsigned int op) 294 + unsigned int tag, unsigned int op, u64 alloc_time_ns) 295 295 { 296 296 struct blk_mq_tags *tags = blk_mq_tags_from_data(data); 297 297 struct request *rq = tags->static_rqs[tag]; ··· 325 325 RB_CLEAR_NODE(&rq->rb_node); 326 326 rq->rq_disk = NULL; 327 327 rq->part = NULL; 328 + #ifdef CONFIG_BLK_RQ_ALLOC_TIME 329 + rq->alloc_time_ns = alloc_time_ns; 330 + #endif 328 331 if (blk_mq_need_time_stamp(rq)) 329 332 rq->start_time_ns = ktime_get_ns(); 330 333 else 331 334 rq->start_time_ns = 0; 332 335 rq->io_start_time_ns = 0; 336 + rq->stats_sectors = 0; 333 337 rq->nr_phys_segments = 0; 334 338 #if defined(CONFIG_BLK_DEV_INTEGRITY) 335 339 rq->nr_integrity_segments = 0; ··· 360 356 struct request *rq; 361 357 unsigned int tag; 362 358 bool clear_ctx_on_error = false; 359 + u64 alloc_time_ns = 0; 363 360 364 361 blk_queue_enter_live(q); 362 + 363 + /* alloc_time includes depth and tag waits */ 364 + if (blk_queue_rq_alloc_time(q)) 365 + alloc_time_ns = ktime_get_ns(); 366 + 365 367 data->q = q; 366 368 if (likely(!data->ctx)) { 367 369 data->ctx = blk_mq_get_ctx(q); ··· 403 393 return NULL; 404 394 } 405 395 406 - rq = blk_mq_rq_ctx_init(data, tag, data->cmd_flags); 396 + rq = blk_mq_rq_ctx_init(data, tag, data->cmd_flags, alloc_time_ns); 407 397 if (!op_is_flush(data->cmd_flags)) { 408 398 rq->elv.icq = NULL; 409 399 if (e && e->type->ops.prepare_request) { ··· 662 652 } 663 653 EXPORT_SYMBOL(blk_mq_complete_request); 664 654 665 - void blk_mq_complete_request_sync(struct request *rq) 666 - { 667 - WRITE_ONCE(rq->state, MQ_RQ_COMPLETE); 668 - rq->q->mq_ops->complete(rq); 669 - } 670 - EXPORT_SYMBOL_GPL(blk_mq_complete_request_sync); 671 - 672 655 int blk_mq_request_started(struct request *rq) 673 656 { 674 657 return blk_mq_rq_state(rq) != MQ_RQ_IDLE; 675 658 } 676 659 EXPORT_SYMBOL_GPL(blk_mq_request_started); 660 + 661 + int blk_mq_request_completed(struct request *rq) 662 + { 663 + return blk_mq_rq_state(rq) == MQ_RQ_COMPLETE; 664 + } 665 + EXPORT_SYMBOL_GPL(blk_mq_request_completed); 677 666 678 667 void blk_mq_start_request(struct request *rq) 679 668 { ··· 682 673 683 674 if (test_bit(QUEUE_FLAG_STATS, &q->queue_flags)) { 684 675 rq->io_start_time_ns = ktime_get_ns(); 685 - #ifdef CONFIG_BLK_DEV_THROTTLING_LOW 686 - rq->throtl_size = blk_rq_sectors(rq); 687 - #endif 676 + rq->stats_sectors = blk_rq_sectors(rq); 688 677 rq->rq_flags |= RQF_STATS; 689 678 rq_qos_issue(q, rq); 690 679 } ··· 2460 2453 struct blk_mq_ctx *ctx; 2461 2454 struct blk_mq_tag_set *set = q->tag_set; 2462 2455 2463 - /* 2464 - * Avoid others reading imcomplete hctx->cpumask through sysfs 2465 - */ 2466 - mutex_lock(&q->sysfs_lock); 2467 - 2468 2456 queue_for_each_hw_ctx(q, hctx, i) { 2469 2457 cpumask_clear(hctx->cpumask); 2470 2458 hctx->nr_ctx = 0; ··· 2519 2517 ctx->hctxs[j] = blk_mq_map_queue_type(q, 2520 2518 HCTX_TYPE_DEFAULT, i); 2521 2519 } 2522 - 2523 - mutex_unlock(&q->sysfs_lock); 2524 2520 2525 2521 queue_for_each_hw_ctx(q, hctx, i) { 2526 2522 /* ··· 2688 2688 if (!uninit_q) 2689 2689 return ERR_PTR(-ENOMEM); 2690 2690 2691 - q = blk_mq_init_allocated_queue(set, uninit_q); 2691 + /* 2692 + * Initialize the queue without an elevator. device_add_disk() will do 2693 + * the initialization. 2694 + */ 2695 + q = blk_mq_init_allocated_queue(set, uninit_q, false); 2692 2696 if (IS_ERR(q)) 2693 2697 blk_cleanup_queue(uninit_q); 2694 2698 ··· 2843 2839 } 2844 2840 2845 2841 struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set, 2846 - struct request_queue *q) 2842 + struct request_queue *q, 2843 + bool elevator_init) 2847 2844 { 2848 2845 /* mark the queue as mq asap */ 2849 2846 q->mq_ops = set->ops; ··· 2906 2901 blk_mq_add_queue_tag_set(set, q); 2907 2902 blk_mq_map_swqueue(q); 2908 2903 2909 - if (!(set->flags & BLK_MQ_F_NO_SCHED)) { 2910 - int ret; 2911 - 2912 - ret = elevator_init_mq(q); 2913 - if (ret) 2914 - return ERR_PTR(ret); 2915 - } 2904 + if (elevator_init) 2905 + elevator_init_mq(q); 2916 2906 2917 2907 return q; 2918 2908 2919 2909 err_hctxs: 2920 2910 kfree(q->queue_hw_ctx); 2911 + q->nr_hw_queues = 0; 2921 2912 err_sys_init: 2922 2913 blk_mq_sysfs_deinit(q); 2923 2914 err_poll:

+7 -5

block/blk-pm.c

··· 207 207 */ 208 208 void blk_set_runtime_active(struct request_queue *q) 209 209 { 210 - spin_lock_irq(&q->queue_lock); 211 - q->rpm_status = RPM_ACTIVE; 212 - pm_runtime_mark_last_busy(q->dev); 213 - pm_request_autosuspend(q->dev); 214 - spin_unlock_irq(&q->queue_lock); 210 + if (q->dev) { 211 + spin_lock_irq(&q->queue_lock); 212 + q->rpm_status = RPM_ACTIVE; 213 + pm_runtime_mark_last_busy(q->dev); 214 + pm_request_autosuspend(q->dev); 215 + spin_unlock_irq(&q->queue_lock); 216 + } 215 217 } 216 218 EXPORT_SYMBOL(blk_set_runtime_active);

+18

block/blk-rq-qos.c

··· 83 83 } while (rqos); 84 84 } 85 85 86 + void __rq_qos_merge(struct rq_qos *rqos, struct request *rq, struct bio *bio) 87 + { 88 + do { 89 + if (rqos->ops->merge) 90 + rqos->ops->merge(rqos, rq, bio); 91 + rqos = rqos->next; 92 + } while (rqos); 93 + } 94 + 86 95 void __rq_qos_done_bio(struct rq_qos *rqos, struct bio *bio) 87 96 { 88 97 do { 89 98 if (rqos->ops->done_bio) 90 99 rqos->ops->done_bio(rqos, bio); 100 + rqos = rqos->next; 101 + } while (rqos); 102 + } 103 + 104 + void __rq_qos_queue_depth_changed(struct rq_qos *rqos) 105 + { 106 + do { 107 + if (rqos->ops->queue_depth_changed) 108 + rqos->ops->queue_depth_changed(rqos); 91 109 rqos = rqos->next; 92 110 } while (rqos); 93 111 }

+24 -4

block/blk-rq-qos.h

··· 14 14 15 15 enum rq_qos_id { 16 16 RQ_QOS_WBT, 17 - RQ_QOS_CGROUP, 17 + RQ_QOS_LATENCY, 18 + RQ_QOS_COST, 18 19 }; 19 20 20 21 struct rq_wait { ··· 36 35 struct rq_qos_ops { 37 36 void (*throttle)(struct rq_qos *, struct bio *); 38 37 void (*track)(struct rq_qos *, struct request *, struct bio *); 38 + void (*merge)(struct rq_qos *, struct request *, struct bio *); 39 39 void (*issue)(struct rq_qos *, struct request *); 40 40 void (*requeue)(struct rq_qos *, struct request *); 41 41 void (*done)(struct rq_qos *, struct request *); 42 42 void (*done_bio)(struct rq_qos *, struct bio *); 43 43 void (*cleanup)(struct rq_qos *, struct bio *); 44 + void (*queue_depth_changed)(struct rq_qos *); 44 45 void (*exit)(struct rq_qos *); 45 46 const struct blk_mq_debugfs_attr *debugfs_attrs; 46 47 }; ··· 75 72 76 73 static inline struct rq_qos *blkcg_rq_qos(struct request_queue *q) 77 74 { 78 - return rq_qos_id(q, RQ_QOS_CGROUP); 75 + return rq_qos_id(q, RQ_QOS_LATENCY); 79 76 } 80 77 81 78 static inline const char *rq_qos_id_to_name(enum rq_qos_id id) ··· 83 80 switch (id) { 84 81 case RQ_QOS_WBT: 85 82 return "wbt"; 86 - case RQ_QOS_CGROUP: 87 - return "cgroup"; 83 + case RQ_QOS_LATENCY: 84 + return "latency"; 85 + case RQ_QOS_COST: 86 + return "cost"; 88 87 } 89 88 return "unknown"; 90 89 } ··· 140 135 void __rq_qos_requeue(struct rq_qos *rqos, struct request *rq); 141 136 void __rq_qos_throttle(struct rq_qos *rqos, struct bio *bio); 142 137 void __rq_qos_track(struct rq_qos *rqos, struct request *rq, struct bio *bio); 138 + void __rq_qos_merge(struct rq_qos *rqos, struct request *rq, struct bio *bio); 143 139 void __rq_qos_done_bio(struct rq_qos *rqos, struct bio *bio); 140 + void __rq_qos_queue_depth_changed(struct rq_qos *rqos); 144 141 145 142 static inline void rq_qos_cleanup(struct request_queue *q, struct bio *bio) 146 143 { ··· 190 183 { 191 184 if (q->rq_qos) 192 185 __rq_qos_track(q->rq_qos, rq, bio); 186 + } 187 + 188 + static inline void rq_qos_merge(struct request_queue *q, struct request *rq, 189 + struct bio *bio) 190 + { 191 + if (q->rq_qos) 192 + __rq_qos_merge(q->rq_qos, rq, bio); 193 + } 194 + 195 + static inline void rq_qos_queue_depth_changed(struct request_queue *q) 196 + { 197 + if (q->rq_qos) 198 + __rq_qos_queue_depth_changed(q->rq_qos); 193 199 } 194 200 195 201 void rq_qos_exit(struct request_queue *);

+17 -1

block/blk-settings.c

··· 805 805 void blk_set_queue_depth(struct request_queue *q, unsigned int depth) 806 806 { 807 807 q->queue_depth = depth; 808 - wbt_set_queue_depth(q, depth); 808 + rq_qos_queue_depth_changed(q); 809 809 } 810 810 EXPORT_SYMBOL(blk_set_queue_depth); 811 811 ··· 831 831 wbt_set_write_cache(q, test_bit(QUEUE_FLAG_WC, &q->queue_flags)); 832 832 } 833 833 EXPORT_SYMBOL_GPL(blk_queue_write_cache); 834 + 835 + /** 836 + * blk_queue_required_elevator_features - Set a queue required elevator features 837 + * @q: the request queue for the target device 838 + * @features: Required elevator features OR'ed together 839 + * 840 + * Tell the block layer that for the device controlled through @q, only the 841 + * only elevators that can be used are those that implement at least the set of 842 + * features specified by @features. 843 + */ 844 + void blk_queue_required_elevator_features(struct request_queue *q, 845 + unsigned int features) 846 + { 847 + q->required_elevator_features = features; 848 + } 849 + EXPORT_SYMBOL_GPL(blk_queue_required_elevator_features); 834 850 835 851 static int __init blk_settings_init(void) 836 852 {

+31 -19

block/blk-sysfs.c

··· 941 941 int ret; 942 942 struct device *dev = disk_to_dev(disk); 943 943 struct request_queue *q = disk->queue; 944 + bool has_elevator = false; 944 945 945 946 if (WARN_ON(!q)) 946 947 return -ENXIO; 947 948 948 - WARN_ONCE(test_bit(QUEUE_FLAG_REGISTERED, &q->queue_flags), 949 + WARN_ONCE(blk_queue_registered(q), 949 950 "%s is registering an already registered queue\n", 950 951 kobject_name(&dev->kobj)); 951 - blk_queue_flag_set(QUEUE_FLAG_REGISTERED, q); 952 952 953 953 /* 954 954 * SCSI probing may synchronously create and destroy a lot of ··· 968 968 if (ret) 969 969 return ret; 970 970 971 - /* Prevent changes through sysfs until registration is completed. */ 972 - mutex_lock(&q->sysfs_lock); 971 + mutex_lock(&q->sysfs_dir_lock); 973 972 974 973 ret = kobject_add(&q->kobj, kobject_get(&dev->kobj), "%s", "queue"); 975 974 if (ret < 0) { ··· 989 990 blk_mq_debugfs_register(q); 990 991 } 991 992 992 - kobject_uevent(&q->kobj, KOBJ_ADD); 993 - 994 - wbt_enable_default(q); 995 - 996 - blk_throtl_register_queue(q); 997 - 993 + /* 994 + * The flag of QUEUE_FLAG_REGISTERED isn't set yet, so elevator 995 + * switch won't happen at all. 996 + */ 998 997 if (q->elevator) { 999 - ret = elv_register_queue(q); 998 + ret = elv_register_queue(q, false); 1000 999 if (ret) { 1001 - mutex_unlock(&q->sysfs_lock); 1002 - kobject_uevent(&q->kobj, KOBJ_REMOVE); 1000 + mutex_unlock(&q->sysfs_dir_lock); 1003 1001 kobject_del(&q->kobj); 1004 1002 blk_trace_remove_sysfs(dev); 1005 1003 kobject_put(&dev->kobj); 1006 1004 return ret; 1007 1005 } 1006 + has_elevator = true; 1008 1007 } 1008 + 1009 + mutex_lock(&q->sysfs_lock); 1010 + blk_queue_flag_set(QUEUE_FLAG_REGISTERED, q); 1011 + wbt_enable_default(q); 1012 + blk_throtl_register_queue(q); 1013 + 1014 + /* Now everything is ready and send out KOBJ_ADD uevent */ 1015 + kobject_uevent(&q->kobj, KOBJ_ADD); 1016 + if (has_elevator) 1017 + kobject_uevent(&q->elevator->kobj, KOBJ_ADD); 1018 + mutex_unlock(&q->sysfs_lock); 1019 + 1009 1020 ret = 0; 1010 1021 unlock: 1011 - mutex_unlock(&q->sysfs_lock); 1022 + mutex_unlock(&q->sysfs_dir_lock); 1012 1023 return ret; 1013 1024 } 1014 1025 EXPORT_SYMBOL_GPL(blk_register_queue); ··· 1038 1029 return; 1039 1030 1040 1031 /* Return early if disk->queue was never registered. */ 1041 - if (!test_bit(QUEUE_FLAG_REGISTERED, &q->queue_flags)) 1032 + if (!blk_queue_registered(q)) 1042 1033 return; 1043 1034 1044 1035 /* ··· 1047 1038 * concurrent elv_iosched_store() calls. 1048 1039 */ 1049 1040 mutex_lock(&q->sysfs_lock); 1050 - 1051 1041 blk_queue_flag_clear(QUEUE_FLAG_REGISTERED, q); 1042 + mutex_unlock(&q->sysfs_lock); 1052 1043 1044 + mutex_lock(&q->sysfs_dir_lock); 1053 1045 /* 1054 1046 * Remove the sysfs attributes before unregistering the queue data 1055 1047 * structures that can be modified through sysfs. 1056 1048 */ 1057 1049 if (queue_is_mq(q)) 1058 1050 blk_mq_unregister_dev(disk_to_dev(disk), q); 1059 - mutex_unlock(&q->sysfs_lock); 1060 1051 1061 1052 kobject_uevent(&q->kobj, KOBJ_REMOVE); 1062 1053 kobject_del(&q->kobj); 1063 1054 blk_trace_remove_sysfs(disk_to_dev(disk)); 1064 1055 1065 - mutex_lock(&q->sysfs_lock); 1056 + /* 1057 + * q->kobj has been removed, so it is safe to check if elevator 1058 + * exists without holding q->sysfs_lock. 1059 + */ 1066 1060 if (q->elevator) 1067 1061 elv_unregister_queue(q); 1068 - mutex_unlock(&q->sysfs_lock); 1062 + mutex_unlock(&q->sysfs_dir_lock); 1069 1063 1070 1064 kobject_put(&disk_to_dev(disk)->kobj); 1071 1065 }

+6 -3

block/blk-throttle.c

··· 478 478 timer_setup(&sq->pending_timer, throtl_pending_timer_fn, 0); 479 479 } 480 480 481 - static struct blkg_policy_data *throtl_pd_alloc(gfp_t gfp, int node) 481 + static struct blkg_policy_data *throtl_pd_alloc(gfp_t gfp, 482 + struct request_queue *q, 483 + struct blkcg *blkcg) 482 484 { 483 485 struct throtl_grp *tg; 484 486 int rw; 485 487 486 - tg = kzalloc_node(sizeof(*tg), gfp, node); 488 + tg = kzalloc_node(sizeof(*tg), gfp, q->node); 487 489 if (!tg) 488 490 return NULL; 489 491 ··· 2248 2246 struct request_queue *q = rq->q; 2249 2247 struct throtl_data *td = q->td; 2250 2248 2251 - throtl_track_latency(td, rq->throtl_size, req_op(rq), time_ns >> 10); 2249 + throtl_track_latency(td, blk_rq_stats_sectors(rq), req_op(rq), 2250 + time_ns >> 10); 2252 2251 } 2253 2252 2254 2253 void blk_throtl_bio_endio(struct bio *bio)

+9 -11

block/blk-wbt.c

··· 629 629 } 630 630 } 631 631 632 - void wbt_set_queue_depth(struct request_queue *q, unsigned int depth) 633 - { 634 - struct rq_qos *rqos = wbt_rq_qos(q); 635 - if (rqos) { 636 - RQWB(rqos)->rq_depth.queue_depth = depth; 637 - __wbt_update_limits(RQWB(rqos)); 638 - } 639 - } 640 - 641 632 void wbt_set_write_cache(struct request_queue *q, bool write_cache_on) 642 633 { 643 634 struct rq_qos *rqos = wbt_rq_qos(q); ··· 647 656 return; 648 657 649 658 /* Queue not registered? Maybe shutting down... */ 650 - if (!test_bit(QUEUE_FLAG_REGISTERED, &q->queue_flags)) 659 + if (!blk_queue_registered(q)) 651 660 return; 652 661 653 662 if (queue_is_mq(q) && IS_ENABLED(CONFIG_BLK_WBT_MQ)) ··· 678 687 679 688 /* don't account */ 680 689 return -1; 690 + } 691 + 692 + static void wbt_queue_depth_changed(struct rq_qos *rqos) 693 + { 694 + RQWB(rqos)->rq_depth.queue_depth = blk_queue_depth(rqos->q); 695 + __wbt_update_limits(RQWB(rqos)); 681 696 } 682 697 683 698 static void wbt_exit(struct rq_qos *rqos) ··· 808 811 .requeue = wbt_requeue, 809 812 .done = wbt_done, 810 813 .cleanup = wbt_cleanup, 814 + .queue_depth_changed = wbt_queue_depth_changed, 811 815 .exit = wbt_exit, 812 816 #ifdef CONFIG_BLK_DEBUG_FS 813 817 .debugfs_attrs = wbt_debugfs_attrs, ··· 851 853 852 854 rwb->min_lat_nsec = wbt_default_latency_nsec(q); 853 855 854 - wbt_set_queue_depth(q, blk_queue_depth(q)); 856 + wbt_queue_depth_changed(&rwb->rqos); 855 857 wbt_set_write_cache(q, test_bit(QUEUE_FLAG_WC, &q->queue_flags)); 856 858 857 859 return 0;

-4

block/blk-wbt.h

··· 95 95 u64 wbt_get_min_lat(struct request_queue *q); 96 96 void wbt_set_min_lat(struct request_queue *q, u64 val); 97 97 98 - void wbt_set_queue_depth(struct request_queue *, unsigned int); 99 98 void wbt_set_write_cache(struct request_queue *, bool); 100 99 101 100 u64 wbt_default_latency_nsec(struct request_queue *); ··· 115 116 { 116 117 } 117 118 static inline void wbt_enable_default(struct request_queue *q) 118 - { 119 - } 120 - static inline void wbt_set_queue_depth(struct request_queue *q, unsigned int depth) 121 119 { 122 120 } 123 121 static inline void wbt_set_write_cache(struct request_queue *q, bool wc)

+39

block/blk-zoned.c

··· 202 202 } 203 203 EXPORT_SYMBOL_GPL(blkdev_report_zones); 204 204 205 + /* 206 + * Special case of zone reset operation to reset all zones in one command, 207 + * useful for applications like mkfs. 208 + */ 209 + static int __blkdev_reset_all_zones(struct block_device *bdev, gfp_t gfp_mask) 210 + { 211 + struct bio *bio = bio_alloc(gfp_mask, 0); 212 + int ret; 213 + 214 + /* across the zones operations, don't need any sectors */ 215 + bio_set_dev(bio, bdev); 216 + bio_set_op_attrs(bio, REQ_OP_ZONE_RESET_ALL, 0); 217 + 218 + ret = submit_bio_wait(bio); 219 + bio_put(bio); 220 + 221 + return ret; 222 + } 223 + 224 + static inline bool blkdev_allow_reset_all_zones(struct block_device *bdev, 225 + sector_t nr_sectors) 226 + { 227 + if (!blk_queue_zone_resetall(bdev_get_queue(bdev))) 228 + return false; 229 + 230 + if (nr_sectors != part_nr_sects_read(bdev->bd_part)) 231 + return false; 232 + /* 233 + * REQ_OP_ZONE_RESET_ALL can be executed only if the block device is 234 + * the entire disk, that is, if the blocks device start offset is 0 and 235 + * its capacity is the same as the entire disk. 236 + */ 237 + return get_start_sect(bdev) == 0 && 238 + part_nr_sects_read(bdev->bd_part) == get_capacity(bdev->bd_disk); 239 + } 240 + 205 241 /** 206 242 * blkdev_reset_zones - Reset zones write pointer 207 243 * @bdev: Target block device ··· 270 234 if (!nr_sectors || end_sector > bdev->bd_part->nr_sects) 271 235 /* Out of range */ 272 236 return -EINVAL; 237 + 238 + if (blkdev_allow_reset_all_zones(bdev, nr_sectors)) 239 + return __blkdev_reset_all_zones(bdev, gfp_mask); 273 240 274 241 /* Check alignment (handle eventual smaller last zone) */ 275 242 zone_sectors = blk_queue_zone_sectors(q);

+2 -2

block/blk.h

··· 184 184 185 185 void blk_insert_flush(struct request *rq); 186 186 187 - int elevator_init_mq(struct request_queue *q); 187 + void elevator_init_mq(struct request_queue *q); 188 188 int elevator_switch_mq(struct request_queue *q, 189 189 struct elevator_type *new_e); 190 190 void __elevator_exit(struct request_queue *, struct elevator_queue *); 191 - int elv_register_queue(struct request_queue *q); 191 + int elv_register_queue(struct request_queue *q, bool uevent); 192 192 void elv_unregister_queue(struct request_queue *q); 193 193 194 194 static inline void elevator_exit(struct request_queue *q,

+157 -60

block/elevator.c

··· 83 83 } 84 84 EXPORT_SYMBOL(elv_bio_merge_ok); 85 85 86 - static bool elevator_match(const struct elevator_type *e, const char *name) 86 + static inline bool elv_support_features(unsigned int elv_features, 87 + unsigned int required_features) 87 88 { 89 + return (required_features & elv_features) == required_features; 90 + } 91 + 92 + /** 93 + * elevator_match - Test an elevator name and features 94 + * @e: Scheduler to test 95 + * @name: Elevator name to test 96 + * @required_features: Features that the elevator must provide 97 + * 98 + * Return true is the elevator @e name matches @name and if @e provides all the 99 + * the feratures spcified by @required_features. 100 + */ 101 + static bool elevator_match(const struct elevator_type *e, const char *name, 102 + unsigned int required_features) 103 + { 104 + if (!elv_support_features(e->elevator_features, required_features)) 105 + return false; 88 106 if (!strcmp(e->elevator_name, name)) 89 107 return true; 90 108 if (e->elevator_alias && !strcmp(e->elevator_alias, name)) ··· 111 93 return false; 112 94 } 113 95 114 - /* 115 - * Return scheduler with name 'name' 96 + /** 97 + * elevator_find - Find an elevator 98 + * @name: Name of the elevator to find 99 + * @required_features: Features that the elevator must provide 100 + * 101 + * Return the first registered scheduler with name @name and supporting the 102 + * features @required_features and NULL otherwise. 116 103 */ 117 - static struct elevator_type *elevator_find(const char *name) 104 + static struct elevator_type *elevator_find(const char *name, 105 + unsigned int required_features) 118 106 { 119 107 struct elevator_type *e; 120 108 121 109 list_for_each_entry(e, &elv_list, list) { 122 - if (elevator_match(e, name)) 110 + if (elevator_match(e, name, required_features)) 123 111 return e; 124 112 } 125 113 ··· 144 120 145 121 spin_lock(&elv_list_lock); 146 122 147 - e = elevator_find(name); 123 + e = elevator_find(name, q->required_elevator_features); 148 124 if (!e && try_loading) { 149 125 spin_unlock(&elv_list_lock); 150 126 request_module("%s-iosched", name); 151 127 spin_lock(&elv_list_lock); 152 - e = elevator_find(name); 128 + e = elevator_find(name, q->required_elevator_features); 153 129 } 154 130 155 131 if (e && !try_module_get(e->elevator_owner)) ··· 158 134 spin_unlock(&elv_list_lock); 159 135 return e; 160 136 } 161 - 162 - static char chosen_elevator[ELV_NAME_MAX]; 163 - 164 - static int __init elevator_setup(char *str) 165 - { 166 - /* 167 - * Be backwards-compatible with previous kernels, so users 168 - * won't get the wrong elevator. 169 - */ 170 - strncpy(chosen_elevator, str, sizeof(chosen_elevator) - 1); 171 - return 1; 172 - } 173 - 174 - __setup("elevator=", elevator_setup); 175 137 176 138 static struct kobj_type elv_ktype; 177 139 ··· 480 470 .release = elevator_release, 481 471 }; 482 472 483 - int elv_register_queue(struct request_queue *q) 473 + /* 474 + * elv_register_queue is called from either blk_register_queue or 475 + * elevator_switch, elevator switch is prevented from being happen 476 + * in the two paths, so it is safe to not hold q->sysfs_lock. 477 + */ 478 + int elv_register_queue(struct request_queue *q, bool uevent) 484 479 { 485 480 struct elevator_queue *e = q->elevator; 486 481 int error; 487 - 488 - lockdep_assert_held(&q->sysfs_lock); 489 482 490 483 error = kobject_add(&e->kobj, &q->kobj, "%s", "iosched"); 491 484 if (!error) { ··· 500 487 attr++; 501 488 } 502 489 } 503 - kobject_uevent(&e->kobj, KOBJ_ADD); 490 + if (uevent) 491 + kobject_uevent(&e->kobj, KOBJ_ADD); 492 + 493 + mutex_lock(&q->sysfs_lock); 504 494 e->registered = 1; 495 + mutex_unlock(&q->sysfs_lock); 505 496 } 506 497 return error; 507 498 } 508 499 500 + /* 501 + * elv_unregister_queue is called from either blk_unregister_queue or 502 + * elevator_switch, elevator switch is prevented from being happen 503 + * in the two paths, so it is safe to not hold q->sysfs_lock. 504 + */ 509 505 void elv_unregister_queue(struct request_queue *q) 510 506 { 511 - lockdep_assert_held(&q->sysfs_lock); 512 - 513 507 if (q) { 514 508 struct elevator_queue *e = q->elevator; 515 509 516 510 kobject_uevent(&e->kobj, KOBJ_REMOVE); 517 511 kobject_del(&e->kobj); 512 + 513 + mutex_lock(&q->sysfs_lock); 518 514 e->registered = 0; 519 515 /* Re-enable throttling in case elevator disabled it */ 520 516 wbt_enable_default(q); 517 + mutex_unlock(&q->sysfs_lock); 521 518 } 522 519 } 523 520 ··· 549 526 550 527 /* register, don't allow duplicate names */ 551 528 spin_lock(&elv_list_lock); 552 - if (elevator_find(e->elevator_name)) { 529 + if (elevator_find(e->elevator_name, 0)) { 553 530 spin_unlock(&elv_list_lock); 554 531 kmem_cache_destroy(e->icq_cache); 555 532 return -EBUSY; ··· 590 567 lockdep_assert_held(&q->sysfs_lock); 591 568 592 569 if (q->elevator) { 593 - if (q->elevator->registered) 570 + if (q->elevator->registered) { 571 + mutex_unlock(&q->sysfs_lock); 572 + 573 + /* 574 + * Concurrent elevator switch can't happen becasue 575 + * sysfs write is always exclusively on same file. 576 + * 577 + * Also the elevator queue won't be freed after 578 + * sysfs_lock is released becasue kobject_del() in 579 + * blk_unregister_queue() waits for completion of 580 + * .store & .show on its attributes. 581 + */ 594 582 elv_unregister_queue(q); 583 + 584 + mutex_lock(&q->sysfs_lock); 585 + } 595 586 ioc_clear_queue(q); 596 587 elevator_exit(q, q->elevator); 588 + 589 + /* 590 + * sysfs_lock may be dropped, so re-check if queue is 591 + * unregistered. If yes, don't switch to new elevator 592 + * any more 593 + */ 594 + if (!blk_queue_registered(q)) 595 + return 0; 597 596 } 598 597 599 598 ret = blk_mq_init_sched(q, new_e); ··· 623 578 goto out; 624 579 625 580 if (new_e) { 626 - ret = elv_register_queue(q); 581 + mutex_unlock(&q->sysfs_lock); 582 + 583 + ret = elv_register_queue(q, true); 584 + 585 + mutex_lock(&q->sysfs_lock); 627 586 if (ret) { 628 587 elevator_exit(q, q->elevator); 629 588 goto out; ··· 643 594 return ret; 644 595 } 645 596 597 + static inline bool elv_support_iosched(struct request_queue *q) 598 + { 599 + if (q->tag_set && (q->tag_set->flags & BLK_MQ_F_NO_SCHED)) 600 + return false; 601 + return true; 602 + } 603 + 646 604 /* 647 - * For blk-mq devices, we default to using mq-deadline, if available, for single 648 - * queue devices. If deadline isn't available OR we have multiple queues, 649 - * default to "none". 605 + * For single queue devices, default to using mq-deadline. If we have multiple 606 + * queues or mq-deadline is not available, default to "none". 650 607 */ 651 - int elevator_init_mq(struct request_queue *q) 608 + static struct elevator_type *elevator_get_default(struct request_queue *q) 609 + { 610 + if (q->nr_hw_queues != 1) 611 + return NULL; 612 + 613 + return elevator_get(q, "mq-deadline", false); 614 + } 615 + 616 + /* 617 + * Get the first elevator providing the features required by the request queue. 618 + * Default to "none" if no matching elevator is found. 619 + */ 620 + static struct elevator_type *elevator_get_by_features(struct request_queue *q) 621 + { 622 + struct elevator_type *e, *found = NULL; 623 + 624 + spin_lock(&elv_list_lock); 625 + 626 + list_for_each_entry(e, &elv_list, list) { 627 + if (elv_support_features(e->elevator_features, 628 + q->required_elevator_features)) { 629 + found = e; 630 + break; 631 + } 632 + } 633 + 634 + if (found && !try_module_get(found->elevator_owner)) 635 + found = NULL; 636 + 637 + spin_unlock(&elv_list_lock); 638 + return found; 639 + } 640 + 641 + /* 642 + * For a device queue that has no required features, use the default elevator 643 + * settings. Otherwise, use the first elevator available matching the required 644 + * features. If no suitable elevator is find or if the chosen elevator 645 + * initialization fails, fall back to the "none" elevator (no elevator). 646 + */ 647 + void elevator_init_mq(struct request_queue *q) 652 648 { 653 649 struct elevator_type *e; 654 - int err = 0; 650 + int err; 655 651 656 - if (q->nr_hw_queues != 1) 657 - return 0; 652 + if (!elv_support_iosched(q)) 653 + return; 658 654 659 - /* 660 - * q->sysfs_lock must be held to provide mutual exclusion between 661 - * elevator_switch() and here. 662 - */ 663 - mutex_lock(&q->sysfs_lock); 655 + WARN_ON_ONCE(test_bit(QUEUE_FLAG_REGISTERED, &q->queue_flags)); 656 + 664 657 if (unlikely(q->elevator)) 665 - goto out_unlock; 658 + return; 666 659 667 - e = elevator_get(q, "mq-deadline", false); 660 + if (!q->required_elevator_features) 661 + e = elevator_get_default(q); 662 + else 663 + e = elevator_get_by_features(q); 668 664 if (!e) 669 - goto out_unlock; 665 + return; 666 + 667 + blk_mq_freeze_queue(q); 668 + blk_mq_quiesce_queue(q); 670 669 671 670 err = blk_mq_init_sched(q, e); 672 - if (err) 671 + 672 + blk_mq_unquiesce_queue(q); 673 + blk_mq_unfreeze_queue(q); 674 + 675 + if (err) { 676 + pr_warn("\"%s\" elevator initialization failed, " 677 + "falling back to \"none\"\n", e->elevator_name); 673 678 elevator_put(e); 674 - out_unlock: 675 - mutex_unlock(&q->sysfs_lock); 676 - return err; 679 + } 677 680 } 678 681 679 682 ··· 761 660 struct elevator_type *e; 762 661 763 662 /* Make sure queue is not in the middle of being removed */ 764 - if (!test_bit(QUEUE_FLAG_REGISTERED, &q->queue_flags)) 663 + if (!blk_queue_registered(q)) 765 664 return -ENOENT; 766 665 767 666 /* ··· 778 677 if (!e) 779 678 return -EINVAL; 780 679 781 - if (q->elevator && elevator_match(q->elevator->type, elevator_name)) { 680 + if (q->elevator && 681 + elevator_match(q->elevator->type, elevator_name, 0)) { 782 682 elevator_put(e); 783 683 return 0; 784 684 } 785 685 786 686 return elevator_switch(q, e); 787 - } 788 - 789 - static inline bool elv_support_iosched(struct request_queue *q) 790 - { 791 - if (q->tag_set && (q->tag_set->flags & BLK_MQ_F_NO_SCHED)) 792 - return false; 793 - return true; 794 687 } 795 688 796 689 ssize_t elv_iosched_store(struct request_queue *q, const char *name, ··· 819 724 820 725 spin_lock(&elv_list_lock); 821 726 list_for_each_entry(__e, &elv_list, list) { 822 - if (elv && elevator_match(elv, __e->elevator_name)) { 727 + if (elv && elevator_match(elv, __e->elevator_name, 0)) { 823 728 len += sprintf(name+len, "[%s] ", elv->elevator_name); 824 729 continue; 825 730 } 826 - if (elv_support_iosched(q)) 731 + if (elv_support_iosched(q) && 732 + elevator_match(__e, __e->elevator_name, 733 + q->required_elevator_features)) 827 734 len += sprintf(name+len, "%s ", __e->elevator_name); 828 735 } 829 736 spin_unlock(&elv_list_lock);

+9

block/genhd.c

··· 695 695 dev_t devt; 696 696 int retval; 697 697 698 + /* 699 + * The disk queue should now be all set with enough information about 700 + * the device for the elevator code to pick an adequate default 701 + * elevator if one is needed, that is, for devices requesting queue 702 + * registration. 703 + */ 704 + if (register_queue) 705 + elevator_init_mq(disk->queue); 706 + 698 707 /* minors == 0 indicates to use ext devt from part0 and should 699 708 * be accompanied with EXT_DEVT flag. Make sure all 700 709 * parameters make sense.

+10 -10

block/mq-deadline.c

··· 377 377 * hardware queue, but we may return a request that is for a 378 378 * different hardware queue. This is because mq-deadline has shared 379 379 * state for all hardware queues, in terms of sorting, FIFOs, etc. 380 - * 381 - * For a zoned block device, __dd_dispatch_request() may return NULL 382 - * if all the queued write requests are directed at zones that are already 383 - * locked due to on-going write requests. In this case, make sure to mark 384 - * the queue as needing a restart to ensure that the queue is run again 385 - * and the pending writes dispatched once the target zones for the ongoing 386 - * write requests are unlocked in dd_finish_request(). 387 380 */ 388 381 static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx) 389 382 { ··· 385 392 386 393 spin_lock(&dd->lock); 387 394 rq = __dd_dispatch_request(dd); 388 - if (!rq && blk_queue_is_zoned(hctx->queue) && 389 - !list_empty(&dd->fifo_list[WRITE])) 390 - blk_mq_sched_mark_restart_hctx(hctx); 391 395 spin_unlock(&dd->lock); 392 396 393 397 return rq; ··· 551 561 * spinlock so that the zone is never unlocked while deadline_fifo_request() 552 562 * or deadline_next_request() are executing. This function is called for 553 563 * all requests, whether or not these requests complete successfully. 564 + * 565 + * For a zoned block device, __dd_dispatch_request() may have stopped 566 + * dispatching requests if all the queued requests are write requests directed 567 + * at zones that are already locked due to on-going write requests. To ensure 568 + * write request dispatch progress in this case, mark the queue as needing a 569 + * restart to ensure that the queue is run again after completion of the 570 + * request and zones being unlocked. 554 571 */ 555 572 static void dd_finish_request(struct request *rq) 556 573 { ··· 569 572 570 573 spin_lock_irqsave(&dd->zone_lock, flags); 571 574 blk_req_zone_write_unlock(rq); 575 + if (!list_empty(&dd->fifo_list[WRITE])) 576 + blk_mq_sched_mark_restart_hctx(rq->mq_hctx); 572 577 spin_unlock_irqrestore(&dd->zone_lock, flags); 573 578 } 574 579 } ··· 794 795 .elevator_attrs = deadline_attrs, 795 796 .elevator_name = "mq-deadline", 796 797 .elevator_alias = "deadline", 798 + .elevator_features = ELEVATOR_F_ZBD_SEQ_WRITE, 797 799 .elevator_owner = THIS_MODULE, 798 800 }; 799 801 MODULE_ALIAS("mq-deadline-iosched");

+1 -4

block/opal_proto.h

··· 119 119 OPAL_UID_HEXFF, 120 120 }; 121 121 122 - #define OPAL_METHOD_LENGTH 8 123 - 124 122 /* Enum for indexing the OPALMETHOD array */ 125 123 enum opal_method { 126 124 OPAL_PROPERTIES, ··· 165 167 OPAL_TABLE_LASTID = 0x0A, 166 168 OPAL_TABLE_MIN = 0x0B, 167 169 OPAL_TABLE_MAX = 0x0C, 168 - 169 170 /* authority table */ 170 171 OPAL_PIN = 0x03, 171 172 /* locking tokens */ ··· 179 182 OPAL_LIFECYCLE = 0x06, 180 183 /* locking info table */ 181 184 OPAL_MAXRANGES = 0x04, 182 - /* mbr control */ 185 + /* mbr control */ 183 186 OPAL_MBRENABLE = 0x01, 184 187 OPAL_MBRDONE = 0x02, 185 188 /* properties */

+41 -8

block/sed-opal.c

··· 129 129 { 0x00, 0x00, 0x00, 0x09, 0x00, 0x00, 0x84, 0x01 }, 130 130 131 131 /* tables */ 132 - 133 132 [OPAL_TABLE_TABLE] 134 133 { 0x00, 0x00, 0x00, 0x01, 0x00, 0x00, 0x00, 0x01 }, 135 134 [OPAL_LOCKINGRANGE_GLOBAL] = ··· 151 152 { 0x00, 0x00, 0x08, 0x01, 0x00, 0x00, 0x00, 0x00 }, 152 153 153 154 /* C_PIN_TABLE object ID's */ 154 - 155 155 [OPAL_C_PIN_MSID] = 156 156 { 0x00, 0x00, 0x00, 0x0B, 0x00, 0x00, 0x84, 0x02}, 157 157 [OPAL_C_PIN_SID] = ··· 159 161 { 0x00, 0x00, 0x00, 0x0B, 0x00, 0x01, 0x00, 0x01}, 160 162 161 163 /* half UID's (only first 4 bytes used) */ 162 - 163 164 [OPAL_HALF_UID_AUTHORITY_OBJ_REF] = 164 165 { 0x00, 0x00, 0x0C, 0x05, 0xff, 0xff, 0xff, 0xff }, 165 166 [OPAL_HALF_UID_BOOLEAN_ACE] = ··· 514 517 ret = opal_recv_cmd(dev); 515 518 if (ret) 516 519 return ret; 520 + 517 521 return opal_discovery0_end(dev); 518 522 } 519 523 ··· 523 525 const struct opal_step discovery0_step = { 524 526 opal_discovery0, 525 527 }; 528 + 526 529 return execute_step(dev, &discovery0_step, 0); 527 530 } 528 531 ··· 550 551 { 551 552 if (!can_add(err, cmd, 1)) 552 553 return; 554 + 553 555 cmd->cmd[cmd->pos++] = tok; 554 556 } 555 557 ··· 577 577 header0 |= bytestring ? MEDIUM_ATOM_BYTESTRING : 0; 578 578 header0 |= has_sign ? MEDIUM_ATOM_SIGNED : 0; 579 579 header0 |= (len >> 8) & MEDIUM_ATOM_LEN_MASK; 580 + 580 581 cmd->cmd[cmd->pos++] = header0; 581 582 cmd->cmd[cmd->pos++] = len; 582 583 } ··· 650 649 651 650 if (lr == 0) 652 651 return 0; 652 + 653 653 buffer[5] = LOCKING_RANGE_NON_GLOBAL; 654 654 buffer[7] = lr; 655 655 ··· 905 903 num_entries++; 906 904 } 907 905 908 - if (num_entries == 0) { 909 - pr_debug("Couldn't parse response.\n"); 910 - return -EINVAL; 911 - } 912 906 resp->num = num_entries; 913 907 914 908 return 0; ··· 943 945 } 944 946 945 947 *store = tok->pos + skip; 948 + 946 949 return tok->len - skip; 947 950 } 948 951 ··· 1061 1062 1062 1063 dev->hsn = hsn; 1063 1064 dev->tsn = tsn; 1065 + 1064 1066 return 0; 1065 1067 } 1066 1068 ··· 1084 1084 { 1085 1085 dev->hsn = 0; 1086 1086 dev->tsn = 0; 1087 + 1087 1088 return parse_and_check_status(dev); 1088 1089 } 1089 1090 ··· 1173 1172 return err; 1174 1173 1175 1174 } 1175 + 1176 1176 return finalize_and_send(dev, parse_and_check_status); 1177 1177 } 1178 1178 ··· 1186 1184 error = parse_and_check_status(dev); 1187 1185 if (error) 1188 1186 return error; 1187 + 1189 1188 keylen = response_get_string(&dev->parsed, 4, &activekey); 1190 1189 if (!activekey) { 1191 1190 pr_debug("%s: Couldn't extract the Activekey from the response\n", 1192 1191 __func__); 1193 1192 return OPAL_INVAL_PARAM; 1194 1193 } 1194 + 1195 1195 dev->prev_data = kmemdup(activekey, keylen, GFP_KERNEL); 1196 1196 1197 1197 if (!dev->prev_data) ··· 1255 1251 1256 1252 add_token_u8(&err, dev, OPAL_ENDLIST); 1257 1253 add_token_u8(&err, dev, OPAL_ENDNAME); 1254 + 1258 1255 return err; 1259 1256 } 1260 1257 ··· 1268 1263 0, 0); 1269 1264 if (err) 1270 1265 pr_debug("Failed to create enable global lr command\n"); 1266 + 1271 1267 return err; 1272 1268 } 1273 1269 ··· 1319 1313 if (err) { 1320 1314 pr_debug("Error building Setup Locking range command.\n"); 1321 1315 return err; 1322 - 1323 1316 } 1324 1317 1325 1318 return finalize_and_send(dev, parse_and_check_status); ··· 1398 1393 kfree(key); 1399 1394 dev->prev_data = NULL; 1400 1395 } 1396 + 1401 1397 return ret; 1402 1398 } 1403 1399 ··· 1524 1518 pr_debug("Error building Erase Locking Range Command.\n"); 1525 1519 return err; 1526 1520 } 1521 + 1527 1522 return finalize_and_send(dev, parse_and_check_status); 1528 1523 } 1529 1524 ··· 1643 1636 1644 1637 off += len; 1645 1638 } 1639 + 1646 1640 return err; 1647 1641 } 1648 1642 ··· 1824 1816 pr_debug("Error building SET command.\n"); 1825 1817 return err; 1826 1818 } 1819 + 1827 1820 return finalize_and_send(dev, parse_and_check_status); 1828 1821 } 1829 1822 ··· 1866 1857 pr_debug("Error building SET command.\n"); 1867 1858 return ret; 1868 1859 } 1860 + 1869 1861 return finalize_and_send(dev, parse_and_check_status); 1870 1862 } 1871 1863 ··· 1967 1957 1968 1958 if (err < 0) 1969 1959 return err; 1960 + 1970 1961 return finalize_and_send(dev, end_session_cont); 1971 1962 } 1972 1963 ··· 1976 1965 const struct opal_step error_end_session = { 1977 1966 end_opal_session, 1978 1967 }; 1968 + 1979 1969 return execute_step(dev, &error_end_session, 0); 1980 1970 } 1981 1971 ··· 1996 1984 ret = opal_discovery0_step(dev); 1997 1985 dev->supported = !ret; 1998 1986 mutex_unlock(&dev->dev_lock); 1987 + 1999 1988 return ret; 2000 1989 } 2001 1990 ··· 2017 2004 { 2018 2005 if (!dev) 2019 2006 return; 2007 + 2020 2008 clean_opal_dev(dev); 2021 2009 kfree(dev); 2022 2010 } ··· 2040 2026 kfree(dev); 2041 2027 return NULL; 2042 2028 } 2029 + 2043 2030 return dev; 2044 2031 } 2045 2032 EXPORT_SYMBOL(init_opal_dev); ··· 2060 2045 setup_opal_dev(dev); 2061 2046 ret = execute_steps(dev, erase_steps, ARRAY_SIZE(erase_steps)); 2062 2047 mutex_unlock(&dev->dev_lock); 2048 + 2063 2049 return ret; 2064 2050 } 2065 2051 ··· 2078 2062 setup_opal_dev(dev); 2079 2063 ret = execute_steps(dev, erase_steps, ARRAY_SIZE(erase_steps)); 2080 2064 mutex_unlock(&dev->dev_lock); 2065 + 2081 2066 return ret; 2082 2067 } 2083 2068 ··· 2106 2089 setup_opal_dev(dev); 2107 2090 ret = execute_steps(dev, mbr_steps, ARRAY_SIZE(mbr_steps)); 2108 2091 mutex_unlock(&dev->dev_lock); 2092 + 2109 2093 return ret; 2110 2094 } 2111 2095 ··· 2131 2113 setup_opal_dev(dev); 2132 2114 ret = execute_steps(dev, mbr_steps, ARRAY_SIZE(mbr_steps)); 2133 2115 mutex_unlock(&dev->dev_lock); 2116 + 2134 2117 return ret; 2135 2118 } 2136 2119 ··· 2152 2133 setup_opal_dev(dev); 2153 2134 ret = execute_steps(dev, mbr_steps, ARRAY_SIZE(mbr_steps)); 2154 2135 mutex_unlock(&dev->dev_lock); 2136 + 2155 2137 return ret; 2156 2138 } 2157 2139 ··· 2171 2151 setup_opal_dev(dev); 2172 2152 add_suspend_info(dev, suspend); 2173 2153 mutex_unlock(&dev->dev_lock); 2154 + 2174 2155 return 0; 2175 2156 } 2176 2157 ··· 2190 2169 pr_debug("Locking state was not RO or RW\n"); 2191 2170 return -EINVAL; 2192 2171 } 2172 + 2193 2173 if (lk_unlk->session.who < OPAL_USER1 || 2194 2174 lk_unlk->session.who > OPAL_USER9) { 2195 2175 pr_debug("Authority was not within the range of users: %d\n", 2196 2176 lk_unlk->session.who); 2197 2177 return -EINVAL; 2198 2178 } 2179 + 2199 2180 if (lk_unlk->session.sum) { 2200 2181 pr_debug("%s not supported in sum. Use setup locking range\n", 2201 2182 __func__); ··· 2208 2185 setup_opal_dev(dev); 2209 2186 ret = execute_steps(dev, steps, ARRAY_SIZE(steps)); 2210 2187 mutex_unlock(&dev->dev_lock); 2188 + 2211 2189 return ret; 2212 2190 } 2213 2191 ··· 2291 2267 mutex_lock(&dev->dev_lock); 2292 2268 ret = __opal_lock_unlock(dev, lk_unlk); 2293 2269 mutex_unlock(&dev->dev_lock); 2270 + 2294 2271 return ret; 2295 2272 } 2296 2273 ··· 2314 2289 setup_opal_dev(dev); 2315 2290 ret = execute_steps(dev, owner_steps, ARRAY_SIZE(owner_steps)); 2316 2291 mutex_unlock(&dev->dev_lock); 2292 + 2317 2293 return ret; 2318 2294 } 2319 2295 ··· 2336 2310 setup_opal_dev(dev); 2337 2311 ret = execute_steps(dev, active_steps, ARRAY_SIZE(active_steps)); 2338 2312 mutex_unlock(&dev->dev_lock); 2313 + 2339 2314 return ret; 2340 2315 } 2341 2316 ··· 2354 2327 setup_opal_dev(dev); 2355 2328 ret = execute_steps(dev, lr_steps, ARRAY_SIZE(lr_steps)); 2356 2329 mutex_unlock(&dev->dev_lock); 2330 + 2357 2331 return ret; 2358 2332 } 2359 2333 ··· 2375 2347 setup_opal_dev(dev); 2376 2348 ret = execute_steps(dev, pw_steps, ARRAY_SIZE(pw_steps)); 2377 2349 mutex_unlock(&dev->dev_lock); 2350 + 2378 2351 return ret; 2379 2352 } 2380 2353 ··· 2400 2371 setup_opal_dev(dev); 2401 2372 ret = execute_steps(dev, act_steps, ARRAY_SIZE(act_steps)); 2402 2373 mutex_unlock(&dev->dev_lock); 2374 + 2403 2375 return ret; 2404 2376 } 2405 2377 ··· 2412 2382 2413 2383 if (!dev) 2414 2384 return false; 2385 + 2415 2386 if (!dev->supported) 2416 2387 return false; 2417 2388 ··· 2430 2399 suspend->unlk.session.sum); 2431 2400 was_failure = true; 2432 2401 } 2402 + 2433 2403 if (dev->mbr_enabled) { 2434 2404 ret = __opal_set_mbr_done(dev, &suspend->unlk.session.opal_key); 2435 2405 if (ret) ··· 2438 2406 } 2439 2407 } 2440 2408 mutex_unlock(&dev->dev_lock); 2409 + 2441 2410 return was_failure; 2442 2411 } 2443 2412 EXPORT_SYMBOL(opal_unlock_from_suspend);

+2 -2

drivers/block/floppy.c

··· 3780 3780 v.native_format = UDP->native_format; 3781 3781 mutex_unlock(&floppy_mutex); 3782 3782 3783 - if (copy_from_user(arg, &v, sizeof(struct compat_floppy_drive_params))) 3783 + if (copy_to_user(arg, &v, sizeof(struct compat_floppy_drive_params))) 3784 3784 return -EFAULT; 3785 3785 return 0; 3786 3786 } ··· 3816 3816 v.bufblocks = UDRS->bufblocks; 3817 3817 mutex_unlock(&floppy_mutex); 3818 3818 3819 - if (copy_from_user(arg, &v, sizeof(struct compat_floppy_drive_struct))) 3819 + if (copy_to_user(arg, &v, sizeof(struct compat_floppy_drive_struct))) 3820 3820 return -EFAULT; 3821 3821 return 0; 3822 3822 Eintr:

+1

drivers/block/loop.c

··· 1755 1755 case LOOP_SET_FD: 1756 1756 case LOOP_CHANGE_FD: 1757 1757 case LOOP_SET_BLOCK_SIZE: 1758 + case LOOP_SET_DIRECT_IO: 1758 1759 err = lo_ioctl(bdev, mode, cmd, arg); 1759 1760 break; 1760 1761 default:

+80 -47

drivers/block/nbd.c

··· 108 108 struct nbd_config *config; 109 109 struct mutex config_lock; 110 110 struct gendisk *disk; 111 + struct workqueue_struct *recv_workq; 111 112 112 113 struct list_head list; 113 114 struct task_struct *task_recv; ··· 122 121 struct mutex lock; 123 122 int index; 124 123 int cookie; 124 + int retries; 125 125 blk_status_t status; 126 126 unsigned long flags; 127 127 u32 cmd_cookie; ··· 140 138 141 139 static unsigned int nbds_max = 16; 142 140 static int max_part = 16; 143 - static struct workqueue_struct *recv_workqueue; 144 141 static int part_shift; 145 142 146 143 static int nbd_dev_dbg_init(struct nbd_device *nbd); ··· 345 344 dev_warn(disk_to_dev(nbd->disk), "shutting down sockets\n"); 346 345 } 347 346 347 + static u32 req_to_nbd_cmd_type(struct request *req) 348 + { 349 + switch (req_op(req)) { 350 + case REQ_OP_DISCARD: 351 + return NBD_CMD_TRIM; 352 + case REQ_OP_FLUSH: 353 + return NBD_CMD_FLUSH; 354 + case REQ_OP_WRITE: 355 + return NBD_CMD_WRITE; 356 + case REQ_OP_READ: 357 + return NBD_CMD_READ; 358 + default: 359 + return U32_MAX; 360 + } 361 + } 362 + 348 363 static enum blk_eh_timer_return nbd_xmit_timeout(struct request *req, 349 364 bool reserved) 350 365 { ··· 374 357 } 375 358 config = nbd->config; 376 359 377 - if (!mutex_trylock(&cmd->lock)) 360 + if (!mutex_trylock(&cmd->lock)) { 361 + nbd_config_put(nbd); 378 362 return BLK_EH_RESET_TIMER; 363 + } 379 364 380 365 if (config->num_connections > 1) { 381 366 dev_err_ratelimited(nbd_to_dev(nbd), ··· 408 389 nbd_config_put(nbd); 409 390 return BLK_EH_DONE; 410 391 } 411 - } else { 412 - dev_err_ratelimited(nbd_to_dev(nbd), 413 - "Connection timed out\n"); 414 392 } 393 + 394 + if (!nbd->tag_set.timeout) { 395 + /* 396 + * Userspace sets timeout=0 to disable socket disconnection, 397 + * so just warn and reset the timer. 398 + */ 399 + cmd->retries++; 400 + dev_info(nbd_to_dev(nbd), "Possible stuck request %p: control (%s@%llu,%uB). Runtime %u seconds\n", 401 + req, nbdcmd_to_ascii(req_to_nbd_cmd_type(req)), 402 + (unsigned long long)blk_rq_pos(req) << 9, 403 + blk_rq_bytes(req), (req->timeout / HZ) * cmd->retries); 404 + 405 + mutex_unlock(&cmd->lock); 406 + nbd_config_put(nbd); 407 + return BLK_EH_RESET_TIMER; 408 + } 409 + 410 + dev_err_ratelimited(nbd_to_dev(nbd), "Connection timed out\n"); 415 411 set_bit(NBD_TIMEDOUT, &config->runtime_flags); 416 412 cmd->status = BLK_STS_IOERR; 417 413 mutex_unlock(&cmd->lock); ··· 514 480 515 481 iov_iter_kvec(&from, WRITE, &iov, 1, sizeof(request)); 516 482 517 - switch (req_op(req)) { 518 - case REQ_OP_DISCARD: 519 - type = NBD_CMD_TRIM; 520 - break; 521 - case REQ_OP_FLUSH: 522 - type = NBD_CMD_FLUSH; 523 - break; 524 - case REQ_OP_WRITE: 525 - type = NBD_CMD_WRITE; 526 - break; 527 - case REQ_OP_READ: 528 - type = NBD_CMD_READ; 529 - break; 530 - default: 483 + type = req_to_nbd_cmd_type(req); 484 + if (type == U32_MAX) 531 485 return -EIO; 532 - } 533 486 534 487 if (rq_data_dir(req) == WRITE && 535 488 (config->flags & NBD_FLAG_READ_ONLY)) { ··· 547 526 } 548 527 cmd->index = index; 549 528 cmd->cookie = nsock->cookie; 529 + cmd->retries = 0; 550 530 request.type = htonl(type | nbd_cmd_flags); 551 531 if (type != NBD_CMD_FLUSH) { 552 532 request.from = cpu_to_be64((u64)blk_rq_pos(req) << 9); ··· 1058 1036 /* We take the tx_mutex in an error path in the recv_work, so we 1059 1037 * need to queue_work outside of the tx_mutex. 1060 1038 */ 1061 - queue_work(recv_workqueue, &args->work); 1039 + queue_work(nbd->recv_workq, &args->work); 1062 1040 1063 1041 atomic_inc(&config->live_connections); 1064 1042 wake_up(&config->conn_wait); ··· 1159 1137 kfree(nbd->config); 1160 1138 nbd->config = NULL; 1161 1139 1140 + if (nbd->recv_workq) 1141 + destroy_workqueue(nbd->recv_workq); 1142 + nbd->recv_workq = NULL; 1143 + 1162 1144 nbd->tag_set.timeout = 0; 1163 1145 nbd->disk->queue->limits.discard_granularity = 0; 1164 1146 nbd->disk->queue->limits.discard_alignment = 0; ··· 1189 1163 !(config->flags & NBD_FLAG_CAN_MULTI_CONN)) { 1190 1164 dev_err(disk_to_dev(nbd->disk), "server does not support multiple connections per device.\n"); 1191 1165 return -EINVAL; 1166 + } 1167 + 1168 + nbd->recv_workq = alloc_workqueue("knbd%d-recv", 1169 + WQ_MEM_RECLAIM | WQ_HIGHPRI | 1170 + WQ_UNBOUND, 0, nbd->index); 1171 + if (!nbd->recv_workq) { 1172 + dev_err(disk_to_dev(nbd->disk), "Could not allocate knbd recv work queue.\n"); 1173 + return -ENOMEM; 1192 1174 } 1193 1175 1194 1176 blk_mq_update_nr_hw_queues(&nbd->tag_set, config->num_connections); ··· 1229 1195 INIT_WORK(&args->work, recv_work); 1230 1196 args->nbd = nbd; 1231 1197 args->index = i; 1232 - queue_work(recv_workqueue, &args->work); 1198 + queue_work(nbd->recv_workq, &args->work); 1233 1199 } 1234 1200 nbd_size_update(nbd); 1235 1201 return error; ··· 1249 1215 mutex_unlock(&nbd->config_lock); 1250 1216 ret = wait_event_interruptible(config->recv_wq, 1251 1217 atomic_read(&config->recv_threads) == 0); 1252 - if (ret) 1218 + if (ret) { 1253 1219 sock_shutdown(nbd); 1220 + flush_workqueue(nbd->recv_workq); 1221 + } 1254 1222 mutex_lock(&nbd->config_lock); 1255 1223 nbd_bdev_reset(bdev); 1256 1224 /* user requested, ignore socket errors */ ··· 1280 1244 blksize > PAGE_SIZE) 1281 1245 return false; 1282 1246 return true; 1247 + } 1248 + 1249 + static void nbd_set_cmd_timeout(struct nbd_device *nbd, u64 timeout) 1250 + { 1251 + nbd->tag_set.timeout = timeout * HZ; 1252 + if (timeout) 1253 + blk_queue_rq_timeout(nbd->disk->queue, timeout * HZ); 1283 1254 } 1284 1255 1285 1256 /* Must be called with config_lock held */ ··· 1319 1276 nbd_size_set(nbd, config->blksize, arg); 1320 1277 return 0; 1321 1278 case NBD_SET_TIMEOUT: 1322 - if (arg) { 1323 - nbd->tag_set.timeout = arg * HZ; 1324 - blk_queue_rq_timeout(nbd->disk->queue, arg * HZ); 1325 - } 1279 + nbd_set_cmd_timeout(nbd, arg); 1326 1280 return 0; 1327 1281 1328 1282 case NBD_SET_FLAGS: ··· 1839 1799 if (ret) 1840 1800 goto out; 1841 1801 1842 - if (info->attrs[NBD_ATTR_TIMEOUT]) { 1843 - u64 timeout = nla_get_u64(info->attrs[NBD_ATTR_TIMEOUT]); 1844 - nbd->tag_set.timeout = timeout * HZ; 1845 - blk_queue_rq_timeout(nbd->disk->queue, timeout * HZ); 1846 - } 1802 + if (info->attrs[NBD_ATTR_TIMEOUT]) 1803 + nbd_set_cmd_timeout(nbd, 1804 + nla_get_u64(info->attrs[NBD_ATTR_TIMEOUT])); 1847 1805 if (info->attrs[NBD_ATTR_DEAD_CONN_TIMEOUT]) { 1848 1806 config->dead_conn_timeout = 1849 1807 nla_get_u64(info->attrs[NBD_ATTR_DEAD_CONN_TIMEOUT]); ··· 1913 1875 nbd_disconnect(nbd); 1914 1876 nbd_clear_sock(nbd); 1915 1877 mutex_unlock(&nbd->config_lock); 1878 + /* 1879 + * Make sure recv thread has finished, so it does not drop the last 1880 + * config ref and try to destroy the workqueue from inside the work 1881 + * queue. 1882 + */ 1883 + flush_workqueue(nbd->recv_workq); 1916 1884 if (test_and_clear_bit(NBD_HAS_CONFIG_REF, 1917 1885 &nbd->config->runtime_flags)) 1918 1886 nbd_config_put(nbd); ··· 2015 1971 if (ret) 2016 1972 goto out; 2017 1973 2018 - if (info->attrs[NBD_ATTR_TIMEOUT]) { 2019 - u64 timeout = nla_get_u64(info->attrs[NBD_ATTR_TIMEOUT]); 2020 - nbd->tag_set.timeout = timeout * HZ; 2021 - blk_queue_rq_timeout(nbd->disk->queue, timeout * HZ); 2022 - } 1974 + if (info->attrs[NBD_ATTR_TIMEOUT]) 1975 + nbd_set_cmd_timeout(nbd, 1976 + nla_get_u64(info->attrs[NBD_ATTR_TIMEOUT])); 2023 1977 if (info->attrs[NBD_ATTR_DEAD_CONN_TIMEOUT]) { 2024 1978 config->dead_conn_timeout = 2025 1979 nla_get_u64(info->attrs[NBD_ATTR_DEAD_CONN_TIMEOUT]); ··· 2303 2261 2304 2262 if (nbds_max > 1UL << (MINORBITS - part_shift)) 2305 2263 return -EINVAL; 2306 - recv_workqueue = alloc_workqueue("knbd-recv", 2307 - WQ_MEM_RECLAIM | WQ_HIGHPRI | 2308 - WQ_UNBOUND, 0); 2309 - if (!recv_workqueue) 2310 - return -ENOMEM; 2311 2264 2312 - if (register_blkdev(NBD_MAJOR, "nbd")) { 2313 - destroy_workqueue(recv_workqueue); 2265 + if (register_blkdev(NBD_MAJOR, "nbd")) 2314 2266 return -EIO; 2315 - } 2316 2267 2317 2268 if (genl_register_family(&nbd_genl_family)) { 2318 2269 unregister_blkdev(NBD_MAJOR, "nbd"); 2319 - destroy_workqueue(recv_workqueue); 2320 2270 return -EINVAL; 2321 2271 } 2322 2272 nbd_dbg_init(); ··· 2350 2316 2351 2317 idr_destroy(&nbd_index_idr); 2352 2318 genl_unregister_family(&nbd_genl_family); 2353 - destroy_workqueue(recv_workqueue); 2354 2319 unregister_blkdev(NBD_MAJOR, "nbd"); 2355 2320 } 2356 2321

+11 -7

drivers/block/null_blk.h

··· 2 2 #ifndef __BLK_NULL_BLK_H 3 3 #define __BLK_NULL_BLK_H 4 4 5 + #undef pr_fmt 6 + #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt 7 + 5 8 #include <linux/blkdev.h> 6 9 #include <linux/slab.h> 7 10 #include <linux/blk-mq.h> ··· 93 90 void null_zone_exit(struct nullb_device *dev); 94 91 int null_zone_report(struct gendisk *disk, sector_t sector, 95 92 struct blk_zone *zones, unsigned int *nr_zones); 96 - void null_zone_write(struct nullb_cmd *cmd, sector_t sector, 97 - unsigned int nr_sectors); 98 - void null_zone_reset(struct nullb_cmd *cmd, sector_t sector); 93 + blk_status_t null_handle_zoned(struct nullb_cmd *cmd, 94 + enum req_opf op, sector_t sector, 95 + sector_t nr_sectors); 99 96 #else 100 97 static inline int null_zone_init(struct nullb_device *dev) 101 98 { 102 - pr_err("null_blk: CONFIG_BLK_DEV_ZONED not enabled\n"); 99 + pr_err("CONFIG_BLK_DEV_ZONED not enabled\n"); 103 100 return -EINVAL; 104 101 } 105 102 static inline void null_zone_exit(struct nullb_device *dev) {} ··· 109 106 { 110 107 return -EOPNOTSUPP; 111 108 } 112 - static inline void null_zone_write(struct nullb_cmd *cmd, sector_t sector, 113 - unsigned int nr_sectors) 109 + static inline blk_status_t null_handle_zoned(struct nullb_cmd *cmd, 110 + enum req_opf op, sector_t sector, 111 + sector_t nr_sectors) 114 112 { 113 + return BLK_STS_NOTSUPP; 115 114 } 116 - static inline void null_zone_reset(struct nullb_cmd *cmd, sector_t sector) {} 117 115 #endif /* CONFIG_BLK_DEV_ZONED */ 118 116 #endif /* __NULL_BLK_H */

+94 -85

drivers/block/null_blk_main.c

··· 141 141 module_param_named(bs, g_bs, int, 0444); 142 142 MODULE_PARM_DESC(bs, "Block size (in bytes)"); 143 143 144 - static int nr_devices = 1; 145 - module_param(nr_devices, int, 0444); 144 + static unsigned int nr_devices = 1; 145 + module_param(nr_devices, uint, 0444); 146 146 MODULE_PARM_DESC(nr_devices, "Number of devices to register"); 147 147 148 148 static bool g_blocking; ··· 1133 1133 blk_mq_start_stopped_hw_queues(q, true); 1134 1134 } 1135 1135 1136 - static blk_status_t null_handle_cmd(struct nullb_cmd *cmd) 1136 + static inline blk_status_t null_handle_throttled(struct nullb_cmd *cmd) 1137 1137 { 1138 1138 struct nullb_device *dev = cmd->nq->dev; 1139 1139 struct nullb *nullb = dev->nullb; 1140 - int err = 0; 1140 + blk_status_t sts = BLK_STS_OK; 1141 + struct request *rq = cmd->rq; 1141 1142 1142 - if (test_bit(NULLB_DEV_FL_THROTTLED, &dev->flags)) { 1143 - struct request *rq = cmd->rq; 1143 + if (!hrtimer_active(&nullb->bw_timer)) 1144 + hrtimer_restart(&nullb->bw_timer); 1144 1145 1145 - if (!hrtimer_active(&nullb->bw_timer)) 1146 - hrtimer_restart(&nullb->bw_timer); 1147 - 1148 - if (atomic_long_sub_return(blk_rq_bytes(rq), 1149 - &nullb->cur_bytes) < 0) { 1150 - null_stop_queue(nullb); 1151 - /* race with timer */ 1152 - if (atomic_long_read(&nullb->cur_bytes) > 0) 1153 - null_restart_queue_async(nullb); 1154 - /* requeue request */ 1155 - return BLK_STS_DEV_RESOURCE; 1156 - } 1146 + if (atomic_long_sub_return(blk_rq_bytes(rq), &nullb->cur_bytes) < 0) { 1147 + null_stop_queue(nullb); 1148 + /* race with timer */ 1149 + if (atomic_long_read(&nullb->cur_bytes) > 0) 1150 + null_restart_queue_async(nullb); 1151 + /* requeue request */ 1152 + sts = BLK_STS_DEV_RESOURCE; 1157 1153 } 1154 + return sts; 1155 + } 1158 1156 1159 - if (nullb->dev->badblocks.shift != -1) { 1160 - int bad_sectors; 1161 - sector_t sector, size, first_bad; 1162 - bool is_flush = true; 1157 + static inline blk_status_t null_handle_badblocks(struct nullb_cmd *cmd, 1158 + sector_t sector, 1159 + sector_t nr_sectors) 1160 + { 1161 + struct badblocks *bb = &cmd->nq->dev->badblocks; 1162 + sector_t first_bad; 1163 + int bad_sectors; 1163 1164 1164 - if (dev->queue_mode == NULL_Q_BIO && 1165 - bio_op(cmd->bio) != REQ_OP_FLUSH) { 1166 - is_flush = false; 1167 - sector = cmd->bio->bi_iter.bi_sector; 1168 - size = bio_sectors(cmd->bio); 1169 - } 1170 - if (dev->queue_mode != NULL_Q_BIO && 1171 - req_op(cmd->rq) != REQ_OP_FLUSH) { 1172 - is_flush = false; 1173 - sector = blk_rq_pos(cmd->rq); 1174 - size = blk_rq_sectors(cmd->rq); 1175 - } 1176 - if (!is_flush && badblocks_check(&nullb->dev->badblocks, sector, 1177 - size, &first_bad, &bad_sectors)) { 1178 - cmd->error = BLK_STS_IOERR; 1179 - goto out; 1180 - } 1181 - } 1165 + if (badblocks_check(bb, sector, nr_sectors, &first_bad, &bad_sectors)) 1166 + return BLK_STS_IOERR; 1182 1167 1183 - if (dev->memory_backed) { 1184 - if (dev->queue_mode == NULL_Q_BIO) { 1185 - if (bio_op(cmd->bio) == REQ_OP_FLUSH) 1186 - err = null_handle_flush(nullb); 1187 - else 1188 - err = null_handle_bio(cmd); 1189 - } else { 1190 - if (req_op(cmd->rq) == REQ_OP_FLUSH) 1191 - err = null_handle_flush(nullb); 1192 - else 1193 - err = null_handle_rq(cmd); 1194 - } 1195 - } 1196 - cmd->error = errno_to_blk_status(err); 1168 + return BLK_STS_OK; 1169 + } 1197 1170 1198 - if (!cmd->error && dev->zoned) { 1199 - sector_t sector; 1200 - unsigned int nr_sectors; 1201 - enum req_opf op; 1171 + static inline blk_status_t null_handle_memory_backed(struct nullb_cmd *cmd, 1172 + enum req_opf op) 1173 + { 1174 + struct nullb_device *dev = cmd->nq->dev; 1175 + int err; 1202 1176 1203 - if (dev->queue_mode == NULL_Q_BIO) { 1204 - op = bio_op(cmd->bio); 1205 - sector = cmd->bio->bi_iter.bi_sector; 1206 - nr_sectors = cmd->bio->bi_iter.bi_size >> 9; 1207 - } else { 1208 - op = req_op(cmd->rq); 1209 - sector = blk_rq_pos(cmd->rq); 1210 - nr_sectors = blk_rq_sectors(cmd->rq); 1211 - } 1177 + if (dev->queue_mode == NULL_Q_BIO) 1178 + err = null_handle_bio(cmd); 1179 + else 1180 + err = null_handle_rq(cmd); 1212 1181 1213 - if (op == REQ_OP_WRITE) 1214 - null_zone_write(cmd, sector, nr_sectors); 1215 - else if (op == REQ_OP_ZONE_RESET) 1216 - null_zone_reset(cmd, sector); 1217 - } 1218 - out: 1182 + return errno_to_blk_status(err); 1183 + } 1184 + 1185 + static inline void nullb_complete_cmd(struct nullb_cmd *cmd) 1186 + { 1219 1187 /* Complete IO by inline, softirq or timer */ 1220 - switch (dev->irqmode) { 1188 + switch (cmd->nq->dev->irqmode) { 1221 1189 case NULL_IRQ_SOFTIRQ: 1222 - switch (dev->queue_mode) { 1190 + switch (cmd->nq->dev->queue_mode) { 1223 1191 case NULL_Q_MQ: 1224 1192 blk_mq_complete_request(cmd->rq); 1225 1193 break; ··· 1206 1238 null_cmd_end_timer(cmd); 1207 1239 break; 1208 1240 } 1241 + } 1242 + 1243 + static blk_status_t null_handle_cmd(struct nullb_cmd *cmd, sector_t sector, 1244 + sector_t nr_sectors, enum req_opf op) 1245 + { 1246 + struct nullb_device *dev = cmd->nq->dev; 1247 + struct nullb *nullb = dev->nullb; 1248 + blk_status_t sts; 1249 + 1250 + if (test_bit(NULLB_DEV_FL_THROTTLED, &dev->flags)) { 1251 + sts = null_handle_throttled(cmd); 1252 + if (sts != BLK_STS_OK) 1253 + return sts; 1254 + } 1255 + 1256 + if (op == REQ_OP_FLUSH) { 1257 + cmd->error = errno_to_blk_status(null_handle_flush(nullb)); 1258 + goto out; 1259 + } 1260 + 1261 + if (nullb->dev->badblocks.shift != -1) { 1262 + cmd->error = null_handle_badblocks(cmd, sector, nr_sectors); 1263 + if (cmd->error != BLK_STS_OK) 1264 + goto out; 1265 + } 1266 + 1267 + if (dev->memory_backed) 1268 + cmd->error = null_handle_memory_backed(cmd, op); 1269 + 1270 + if (!cmd->error && dev->zoned) 1271 + cmd->error = null_handle_zoned(cmd, op, sector, nr_sectors); 1272 + 1273 + out: 1274 + nullb_complete_cmd(cmd); 1209 1275 return BLK_STS_OK; 1210 1276 } 1211 1277 ··· 1282 1280 1283 1281 static blk_qc_t null_queue_bio(struct request_queue *q, struct bio *bio) 1284 1282 { 1283 + sector_t sector = bio->bi_iter.bi_sector; 1284 + sector_t nr_sectors = bio_sectors(bio); 1285 1285 struct nullb *nullb = q->queuedata; 1286 1286 struct nullb_queue *nq = nullb_to_queue(nullb); 1287 1287 struct nullb_cmd *cmd; ··· 1291 1287 cmd = alloc_cmd(nq, 1); 1292 1288 cmd->bio = bio; 1293 1289 1294 - null_handle_cmd(cmd); 1290 + null_handle_cmd(cmd, sector, nr_sectors, bio_op(bio)); 1295 1291 return BLK_QC_T_NONE; 1296 1292 } 1297 1293 ··· 1315 1311 1316 1312 static enum blk_eh_timer_return null_timeout_rq(struct request *rq, bool res) 1317 1313 { 1318 - pr_info("null: rq %p timed out\n", rq); 1314 + pr_info("rq %p timed out\n", rq); 1319 1315 blk_mq_complete_request(rq); 1320 1316 return BLK_EH_DONE; 1321 1317 } ··· 1325 1321 { 1326 1322 struct nullb_cmd *cmd = blk_mq_rq_to_pdu(bd->rq); 1327 1323 struct nullb_queue *nq = hctx->driver_data; 1324 + sector_t nr_sectors = blk_rq_sectors(bd->rq); 1325 + sector_t sector = blk_rq_pos(bd->rq); 1328 1326 1329 1327 might_sleep_if(hctx->flags & BLK_MQ_F_BLOCKING); 1330 1328 ··· 1355 1349 if (should_timeout_request(bd->rq)) 1356 1350 return BLK_STS_OK; 1357 1351 1358 - return null_handle_cmd(cmd); 1352 + return null_handle_cmd(cmd, sector, nr_sectors, req_op(bd->rq)); 1359 1353 } 1360 1354 1361 1355 static const struct blk_mq_ops null_mq_ops = { ··· 1694 1688 1695 1689 blk_queue_chunk_sectors(nullb->q, dev->zone_size_sects); 1696 1690 nullb->q->limits.zoned = BLK_ZONED_HM; 1691 + blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, nullb->q); 1692 + blk_queue_required_elevator_features(nullb->q, 1693 + ELEVATOR_F_ZBD_SEQ_WRITE); 1697 1694 } 1698 1695 1699 1696 nullb->q->queuedata = nullb; ··· 1748 1739 struct nullb_device *dev; 1749 1740 1750 1741 if (g_bs > PAGE_SIZE) { 1751 - pr_warn("null_blk: invalid block size\n"); 1752 - pr_warn("null_blk: defaults block size to %lu\n", PAGE_SIZE); 1742 + pr_warn("invalid block size\n"); 1743 + pr_warn("defaults block size to %lu\n", PAGE_SIZE); 1753 1744 g_bs = PAGE_SIZE; 1754 1745 } 1755 1746 1756 1747 if (!is_power_of_2(g_zone_size)) { 1757 - pr_err("null_blk: zone_size must be power-of-two\n"); 1748 + pr_err("zone_size must be power-of-two\n"); 1758 1749 return -EINVAL; 1759 1750 } 1760 1751 1761 1752 if (g_home_node != NUMA_NO_NODE && g_home_node >= nr_online_nodes) { 1762 - pr_err("null_blk: invalid home_node value\n"); 1753 + pr_err("invalid home_node value\n"); 1763 1754 g_home_node = NUMA_NO_NODE; 1764 1755 } 1765 1756 1766 1757 if (g_queue_mode == NULL_Q_RQ) { 1767 - pr_err("null_blk: legacy IO path no longer available\n"); 1758 + pr_err("legacy IO path no longer available\n"); 1768 1759 return -EINVAL; 1769 1760 } 1770 1761 if (g_queue_mode == NULL_Q_MQ && g_use_per_node_hctx) { 1771 1762 if (g_submit_queues != nr_online_nodes) { 1772 - pr_warn("null_blk: submit_queues param is set to %u.\n", 1763 + pr_warn("submit_queues param is set to %u.\n", 1773 1764 nr_online_nodes); 1774 1765 g_submit_queues = nr_online_nodes; 1775 1766 } ··· 1812 1803 } 1813 1804 } 1814 1805 1815 - pr_info("null: module loaded\n"); 1806 + pr_info("module loaded\n"); 1816 1807 return 0; 1817 1808 1818 1809 err_dev:

+43 -16

drivers/block/null_blk_zoned.c

··· 17 17 unsigned int i; 18 18 19 19 if (!is_power_of_2(dev->zone_size)) { 20 - pr_err("null_blk: zone_size must be power-of-two\n"); 20 + pr_err("zone_size must be power-of-two\n"); 21 21 return -EINVAL; 22 22 } 23 23 ··· 31 31 32 32 if (dev->zone_nr_conv >= dev->nr_zones) { 33 33 dev->zone_nr_conv = dev->nr_zones - 1; 34 - pr_info("null_blk: changed the number of conventional zones to %u", 34 + pr_info("changed the number of conventional zones to %u", 35 35 dev->zone_nr_conv); 36 36 } 37 37 ··· 84 84 return 0; 85 85 } 86 86 87 - void null_zone_write(struct nullb_cmd *cmd, sector_t sector, 87 + static blk_status_t null_zone_write(struct nullb_cmd *cmd, sector_t sector, 88 88 unsigned int nr_sectors) 89 89 { 90 90 struct nullb_device *dev = cmd->nq->dev; ··· 95 95 case BLK_ZONE_COND_FULL: 96 96 /* Cannot write to a full zone */ 97 97 cmd->error = BLK_STS_IOERR; 98 - break; 98 + return BLK_STS_IOERR; 99 99 case BLK_ZONE_COND_EMPTY: 100 100 case BLK_ZONE_COND_IMP_OPEN: 101 101 /* Writes must be at the write pointer position */ 102 - if (sector != zone->wp) { 103 - cmd->error = BLK_STS_IOERR; 104 - break; 105 - } 102 + if (sector != zone->wp) 103 + return BLK_STS_IOERR; 106 104 107 105 if (zone->cond == BLK_ZONE_COND_EMPTY) 108 106 zone->cond = BLK_ZONE_COND_IMP_OPEN; ··· 113 115 break; 114 116 default: 115 117 /* Invalid zone condition */ 116 - cmd->error = BLK_STS_IOERR; 117 - break; 118 + return BLK_STS_IOERR; 118 119 } 120 + return BLK_STS_OK; 119 121 } 120 122 121 - void null_zone_reset(struct nullb_cmd *cmd, sector_t sector) 123 + static blk_status_t null_zone_reset(struct nullb_cmd *cmd, sector_t sector) 122 124 { 123 125 struct nullb_device *dev = cmd->nq->dev; 124 126 unsigned int zno = null_zone_no(dev, sector); 125 127 struct blk_zone *zone = &dev->zones[zno]; 128 + size_t i; 126 129 127 - if (zone->type == BLK_ZONE_TYPE_CONVENTIONAL) { 128 - cmd->error = BLK_STS_IOERR; 129 - return; 130 + switch (req_op(cmd->rq)) { 131 + case REQ_OP_ZONE_RESET_ALL: 132 + for (i = 0; i < dev->nr_zones; i++) { 133 + if (zone[i].type == BLK_ZONE_TYPE_CONVENTIONAL) 134 + continue; 135 + zone[i].cond = BLK_ZONE_COND_EMPTY; 136 + zone[i].wp = zone[i].start; 137 + } 138 + break; 139 + case REQ_OP_ZONE_RESET: 140 + if (zone->type == BLK_ZONE_TYPE_CONVENTIONAL) 141 + return BLK_STS_IOERR; 142 + 143 + zone->cond = BLK_ZONE_COND_EMPTY; 144 + zone->wp = zone->start; 145 + break; 146 + default: 147 + cmd->error = BLK_STS_NOTSUPP; 148 + break; 130 149 } 150 + return BLK_STS_OK; 151 + } 131 152 132 - zone->cond = BLK_ZONE_COND_EMPTY; 133 - zone->wp = zone->start; 153 + blk_status_t null_handle_zoned(struct nullb_cmd *cmd, enum req_opf op, 154 + sector_t sector, sector_t nr_sectors) 155 + { 156 + switch (op) { 157 + case REQ_OP_WRITE: 158 + return null_zone_write(cmd, sector, nr_sectors); 159 + case REQ_OP_ZONE_RESET: 160 + case REQ_OP_ZONE_RESET_ALL: 161 + return null_zone_reset(cmd, sector); 162 + default: 163 + return BLK_STS_OK; 164 + } 134 165 }

+7 -5

drivers/block/paride/pcd.c

··· 314 314 disk->queue = blk_mq_init_sq_queue(&cd->tag_set, &pcd_mq_ops, 315 315 1, BLK_MQ_F_SHOULD_MERGE); 316 316 if (IS_ERR(disk->queue)) { 317 - put_disk(disk); 318 317 disk->queue = NULL; 318 + put_disk(disk); 319 319 continue; 320 320 } 321 321 ··· 723 723 k = 0; 724 724 if (pcd_drive_count == 0) { /* nothing spec'd - so autoprobe for 1 */ 725 725 cd = pcd; 726 - if (pi_init(cd->pi, 1, -1, -1, -1, -1, -1, pcd_buffer, 727 - PI_PCD, verbose, cd->name)) { 728 - if (!pcd_probe(cd, -1, id) && cd->disk) { 726 + if (cd->disk && pi_init(cd->pi, 1, -1, -1, -1, -1, -1, 727 + pcd_buffer, PI_PCD, verbose, cd->name)) { 728 + if (!pcd_probe(cd, -1, id)) { 729 729 cd->present = 1; 730 730 k++; 731 731 } else ··· 736 736 int *conf = *drives[unit]; 737 737 if (!conf[D_PRT]) 738 738 continue; 739 + if (!cd->disk) 740 + continue; 739 741 if (!pi_init(cd->pi, 0, conf[D_PRT], conf[D_MOD], 740 742 conf[D_UNI], conf[D_PRO], conf[D_DLY], 741 743 pcd_buffer, PI_PCD, verbose, cd->name)) 742 744 continue; 743 - if (!pcd_probe(cd, conf[D_SLV], id) && cd->disk) { 745 + if (!pcd_probe(cd, conf[D_SLV], id)) { 744 746 cd->present = 1; 745 747 k++; 746 748 } else

+1 -1

drivers/block/paride/pf.c

··· 300 300 disk->queue = blk_mq_init_sq_queue(&pf->tag_set, &pf_mq_ops, 301 301 1, BLK_MQ_F_SHOULD_MERGE); 302 302 if (IS_ERR(disk->queue)) { 303 - put_disk(disk); 304 303 disk->queue = NULL; 304 + put_disk(disk); 305 305 continue; 306 306 } 307 307

+63 -34

drivers/lightnvm/core.c

··· 4 4 * Initial release: Matias Bjorling <m@bjorling.me> 5 5 */ 6 6 7 + #define pr_fmt(fmt) "nvm: " fmt 8 + 7 9 #include <linux/list.h> 8 10 #include <linux/types.h> 9 11 #include <linux/sem.h> ··· 76 74 77 75 for (i = lun_begin; i <= lun_end; i++) { 78 76 if (test_and_set_bit(i, dev->lun_map)) { 79 - pr_err("nvm: lun %d already allocated\n", i); 77 + pr_err("lun %d already allocated\n", i); 80 78 goto err; 81 79 } 82 80 } ··· 266 264 int lun_end) 267 265 { 268 266 if (lun_begin > lun_end || lun_end >= geo->all_luns) { 269 - pr_err("nvm: lun out of bound (%u:%u > %u)\n", 267 + pr_err("lun out of bound (%u:%u > %u)\n", 270 268 lun_begin, lun_end, geo->all_luns - 1); 271 269 return -EINVAL; 272 270 } ··· 299 297 if (e->op == 0xFFFF) { 300 298 e->op = NVM_TARGET_DEFAULT_OP; 301 299 } else if (e->op < NVM_TARGET_MIN_OP || e->op > NVM_TARGET_MAX_OP) { 302 - pr_err("nvm: invalid over provisioning value\n"); 300 + pr_err("invalid over provisioning value\n"); 303 301 return -EINVAL; 304 302 } 305 303 ··· 336 334 e = create->conf.e; 337 335 break; 338 336 default: 339 - pr_err("nvm: config type not valid\n"); 337 + pr_err("config type not valid\n"); 340 338 return -EINVAL; 341 339 } 342 340 343 341 tt = nvm_find_target_type(create->tgttype); 344 342 if (!tt) { 345 - pr_err("nvm: target type %s not found\n", create->tgttype); 343 + pr_err("target type %s not found\n", create->tgttype); 346 344 return -EINVAL; 347 345 } 348 346 349 347 if ((tt->flags & NVM_TGT_F_HOST_L2P) != (dev->geo.dom & NVM_RSP_L2P)) { 350 - pr_err("nvm: device is incompatible with target L2P type.\n"); 348 + pr_err("device is incompatible with target L2P type.\n"); 351 349 return -EINVAL; 352 350 } 353 351 354 352 if (nvm_target_exists(create->tgtname)) { 355 - pr_err("nvm: target name already exists (%s)\n", 353 + pr_err("target name already exists (%s)\n", 356 354 create->tgtname); 357 355 return -EINVAL; 358 356 } ··· 369 367 370 368 tgt_dev = nvm_create_tgt_dev(dev, e.lun_begin, e.lun_end, e.op); 371 369 if (!tgt_dev) { 372 - pr_err("nvm: could not create target device\n"); 370 + pr_err("could not create target device\n"); 373 371 ret = -ENOMEM; 374 372 goto err_t; 375 373 } ··· 495 493 } 496 494 up_read(&nvm_lock); 497 495 498 - if (!t) 496 + if (!t) { 497 + pr_err("failed to remove target %s\n", 498 + remove->tgtname); 499 499 return 1; 500 + } 500 501 501 502 __nvm_remove_target(t, true); 502 503 kref_put(&dev->ref, nvm_free); ··· 691 686 rqd->nr_ppas = nr_ppas; 692 687 rqd->ppa_list = nvm_dev_dma_alloc(dev, GFP_KERNEL, &rqd->dma_ppa_list); 693 688 if (!rqd->ppa_list) { 694 - pr_err("nvm: failed to allocate dma memory\n"); 689 + pr_err("failed to allocate dma memory\n"); 695 690 return -ENOMEM; 696 691 } 697 692 ··· 736 731 return flags; 737 732 } 738 733 739 - int nvm_submit_io(struct nvm_tgt_dev *tgt_dev, struct nvm_rq *rqd) 734 + int nvm_submit_io(struct nvm_tgt_dev *tgt_dev, struct nvm_rq *rqd, void *buf) 740 735 { 741 736 struct nvm_dev *dev = tgt_dev->parent; 742 737 int ret; ··· 750 745 rqd->flags = nvm_set_flags(&tgt_dev->geo, rqd); 751 746 752 747 /* In case of error, fail with right address format */ 753 - ret = dev->ops->submit_io(dev, rqd); 748 + ret = dev->ops->submit_io(dev, rqd, buf); 754 749 if (ret) 755 750 nvm_rq_dev_to_tgt(tgt_dev, rqd); 756 751 return ret; 757 752 } 758 753 EXPORT_SYMBOL(nvm_submit_io); 759 754 760 - int nvm_submit_io_sync(struct nvm_tgt_dev *tgt_dev, struct nvm_rq *rqd) 755 + static void nvm_sync_end_io(struct nvm_rq *rqd) 756 + { 757 + struct completion *waiting = rqd->private; 758 + 759 + complete(waiting); 760 + } 761 + 762 + static int nvm_submit_io_wait(struct nvm_dev *dev, struct nvm_rq *rqd, 763 + void *buf) 764 + { 765 + DECLARE_COMPLETION_ONSTACK(wait); 766 + int ret = 0; 767 + 768 + rqd->end_io = nvm_sync_end_io; 769 + rqd->private = &wait; 770 + 771 + ret = dev->ops->submit_io(dev, rqd, buf); 772 + if (ret) 773 + return ret; 774 + 775 + wait_for_completion_io(&wait); 776 + 777 + return 0; 778 + } 779 + 780 + int nvm_submit_io_sync(struct nvm_tgt_dev *tgt_dev, struct nvm_rq *rqd, 781 + void *buf) 761 782 { 762 783 struct nvm_dev *dev = tgt_dev->parent; 763 784 int ret; 764 785 765 - if (!dev->ops->submit_io_sync) 786 + if (!dev->ops->submit_io) 766 787 return -ENODEV; 767 788 768 789 nvm_rq_tgt_to_dev(tgt_dev, rqd); ··· 796 765 rqd->dev = tgt_dev; 797 766 rqd->flags = nvm_set_flags(&tgt_dev->geo, rqd); 798 767 799 - /* In case of error, fail with right address format */ 800 - ret = dev->ops->submit_io_sync(dev, rqd); 801 - nvm_rq_dev_to_tgt(tgt_dev, rqd); 768 + ret = nvm_submit_io_wait(dev, rqd, buf); 802 769 803 770 return ret; 804 771 } ··· 817 788 818 789 static int nvm_submit_io_sync_raw(struct nvm_dev *dev, struct nvm_rq *rqd) 819 790 { 820 - if (!dev->ops->submit_io_sync) 791 + if (!dev->ops->submit_io) 821 792 return -ENODEV; 822 793 794 + rqd->dev = NULL; 823 795 rqd->flags = nvm_set_flags(&dev->geo, rqd); 824 796 825 - return dev->ops->submit_io_sync(dev, rqd); 797 + return nvm_submit_io_wait(dev, rqd, NULL); 826 798 } 827 799 828 800 static int nvm_bb_chunk_sense(struct nvm_dev *dev, struct ppa_addr ppa) ··· 1078 1048 return 0; 1079 1049 1080 1050 if (nr_ppas > NVM_MAX_VLBA) { 1081 - pr_err("nvm: unable to update all blocks atomically\n"); 1051 + pr_err("unable to update all blocks atomically\n"); 1082 1052 return -EINVAL; 1083 1053 } 1084 1054 ··· 1141 1111 int ret = -EINVAL; 1142 1112 1143 1113 if (dev->ops->identity(dev)) { 1144 - pr_err("nvm: device could not be identified\n"); 1114 + pr_err("device could not be identified\n"); 1145 1115 goto err; 1146 1116 } 1147 1117 1148 - pr_debug("nvm: ver:%u.%u nvm_vendor:%x\n", 1149 - geo->major_ver_id, geo->minor_ver_id, 1150 - geo->vmnt); 1118 + pr_debug("ver:%u.%u nvm_vendor:%x\n", geo->major_ver_id, 1119 + geo->minor_ver_id, geo->vmnt); 1151 1120 1152 1121 ret = nvm_core_init(dev); 1153 1122 if (ret) { 1154 - pr_err("nvm: could not initialize core structures.\n"); 1123 + pr_err("could not initialize core structures.\n"); 1155 1124 goto err; 1156 1125 } 1157 1126 1158 - pr_info("nvm: registered %s [%u/%u/%u/%u/%u]\n", 1127 + pr_info("registered %s [%u/%u/%u/%u/%u]\n", 1159 1128 dev->name, dev->geo.ws_min, dev->geo.ws_opt, 1160 1129 dev->geo.num_chk, dev->geo.all_luns, 1161 1130 dev->geo.num_ch); 1162 1131 return 0; 1163 1132 err: 1164 - pr_err("nvm: failed to initialize nvm\n"); 1133 + pr_err("failed to initialize nvm\n"); 1165 1134 return ret; 1166 1135 } 1167 1136 ··· 1198 1169 dev->dma_pool = dev->ops->create_dma_pool(dev, "ppalist", 1199 1170 exp_pool_size); 1200 1171 if (!dev->dma_pool) { 1201 - pr_err("nvm: could not create dma pool\n"); 1172 + pr_err("could not create dma pool\n"); 1202 1173 kref_put(&dev->ref, nvm_free); 1203 1174 return -ENOMEM; 1204 1175 } ··· 1243 1214 up_write(&nvm_lock); 1244 1215 1245 1216 if (!dev) { 1246 - pr_err("nvm: device not found\n"); 1217 + pr_err("device not found\n"); 1247 1218 return -EINVAL; 1248 1219 } 1249 1220 ··· 1317 1288 i++; 1318 1289 1319 1290 if (i > 31) { 1320 - pr_err("nvm: max 31 devices can be reported.\n"); 1291 + pr_err("max 31 devices can be reported.\n"); 1321 1292 break; 1322 1293 } 1323 1294 } ··· 1344 1315 1345 1316 if (create.conf.type == NVM_CONFIG_TYPE_EXTENDED && 1346 1317 create.conf.e.rsv != 0) { 1347 - pr_err("nvm: reserved config field in use\n"); 1318 + pr_err("reserved config field in use\n"); 1348 1319 return -EINVAL; 1349 1320 } 1350 1321 ··· 1360 1331 flags &= ~NVM_TARGET_FACTORY; 1361 1332 1362 1333 if (flags) { 1363 - pr_err("nvm: flag not supported\n"); 1334 + pr_err("flag not supported\n"); 1364 1335 return -EINVAL; 1365 1336 } 1366 1337 } ··· 1378 1349 remove.tgtname[DISK_NAME_LEN - 1] = '\0'; 1379 1350 1380 1351 if (remove.flags != 0) { 1381 - pr_err("nvm: no flags supported\n"); 1352 + pr_err("no flags supported\n"); 1382 1353 return -EINVAL; 1383 1354 } 1384 1355 ··· 1394 1365 return -EFAULT; 1395 1366 1396 1367 if (init.flags != 0) { 1397 - pr_err("nvm: no flags supported\n"); 1368 + pr_err("no flags supported\n"); 1398 1369 return -EINVAL; 1399 1370 } 1400 1371

+13 -103

drivers/lightnvm/pblk-core.c

··· 507 507 pblk->sec_per_write = sec_per_write; 508 508 } 509 509 510 - int pblk_submit_io(struct pblk *pblk, struct nvm_rq *rqd) 510 + int pblk_submit_io(struct pblk *pblk, struct nvm_rq *rqd, void *buf) 511 511 { 512 512 struct nvm_tgt_dev *dev = pblk->dev; 513 513 ··· 518 518 return NVM_IO_ERR; 519 519 #endif 520 520 521 - return nvm_submit_io(dev, rqd); 521 + return nvm_submit_io(dev, rqd, buf); 522 522 } 523 523 524 524 void pblk_check_chunk_state_update(struct pblk *pblk, struct nvm_rq *rqd) ··· 541 541 } 542 542 } 543 543 544 - int pblk_submit_io_sync(struct pblk *pblk, struct nvm_rq *rqd) 544 + int pblk_submit_io_sync(struct pblk *pblk, struct nvm_rq *rqd, void *buf) 545 545 { 546 546 struct nvm_tgt_dev *dev = pblk->dev; 547 547 int ret; ··· 553 553 return NVM_IO_ERR; 554 554 #endif 555 555 556 - ret = nvm_submit_io_sync(dev, rqd); 556 + ret = nvm_submit_io_sync(dev, rqd, buf); 557 557 558 558 if (trace_pblk_chunk_state_enabled() && !ret && 559 559 rqd->opcode == NVM_OP_PWRITE) ··· 562 562 return ret; 563 563 } 564 564 565 - int pblk_submit_io_sync_sem(struct pblk *pblk, struct nvm_rq *rqd) 565 + static int pblk_submit_io_sync_sem(struct pblk *pblk, struct nvm_rq *rqd, 566 + void *buf) 566 567 { 567 568 struct ppa_addr *ppa_list = nvm_rq_to_ppa_list(rqd); 568 569 int ret; 569 570 570 571 pblk_down_chunk(pblk, ppa_list[0]); 571 - ret = pblk_submit_io_sync(pblk, rqd); 572 + ret = pblk_submit_io_sync(pblk, rqd, buf); 572 573 pblk_up_chunk(pblk, ppa_list[0]); 573 574 574 575 return ret; 575 - } 576 - 577 - static void pblk_bio_map_addr_endio(struct bio *bio) 578 - { 579 - bio_put(bio); 580 - } 581 - 582 - struct bio *pblk_bio_map_addr(struct pblk *pblk, void *data, 583 - unsigned int nr_secs, unsigned int len, 584 - int alloc_type, gfp_t gfp_mask) 585 - { 586 - struct nvm_tgt_dev *dev = pblk->dev; 587 - void *kaddr = data; 588 - struct page *page; 589 - struct bio *bio; 590 - int i, ret; 591 - 592 - if (alloc_type == PBLK_KMALLOC_META) 593 - return bio_map_kern(dev->q, kaddr, len, gfp_mask); 594 - 595 - bio = bio_kmalloc(gfp_mask, nr_secs); 596 - if (!bio) 597 - return ERR_PTR(-ENOMEM); 598 - 599 - for (i = 0; i < nr_secs; i++) { 600 - page = vmalloc_to_page(kaddr); 601 - if (!page) { 602 - pblk_err(pblk, "could not map vmalloc bio\n"); 603 - bio_put(bio); 604 - bio = ERR_PTR(-ENOMEM); 605 - goto out; 606 - } 607 - 608 - ret = bio_add_pc_page(dev->q, bio, page, PAGE_SIZE, 0); 609 - if (ret != PAGE_SIZE) { 610 - pblk_err(pblk, "could not add page to bio\n"); 611 - bio_put(bio); 612 - bio = ERR_PTR(-ENOMEM); 613 - goto out; 614 - } 615 - 616 - kaddr += PAGE_SIZE; 617 - } 618 - 619 - bio->bi_end_io = pblk_bio_map_addr_endio; 620 - out: 621 - return bio; 622 576 } 623 577 624 578 int pblk_calc_secs(struct pblk *pblk, unsigned long secs_avail, ··· 676 722 677 723 int pblk_line_smeta_read(struct pblk *pblk, struct pblk_line *line) 678 724 { 679 - struct nvm_tgt_dev *dev = pblk->dev; 680 725 struct pblk_line_meta *lm = &pblk->lm; 681 - struct bio *bio; 682 726 struct ppa_addr *ppa_list; 683 727 struct nvm_rq rqd; 684 728 u64 paddr = pblk_line_smeta_start(pblk, line); ··· 688 736 if (ret) 689 737 return ret; 690 738 691 - bio = bio_map_kern(dev->q, line->smeta, lm->smeta_len, GFP_KERNEL); 692 - if (IS_ERR(bio)) { 693 - ret = PTR_ERR(bio); 694 - goto clear_rqd; 695 - } 696 - 697 - bio->bi_iter.bi_sector = 0; /* internal bio */ 698 - bio_set_op_attrs(bio, REQ_OP_READ, 0); 699 - 700 - rqd.bio = bio; 701 739 rqd.opcode = NVM_OP_PREAD; 702 740 rqd.nr_ppas = lm->smeta_sec; 703 741 rqd.is_seq = 1; ··· 696 754 for (i = 0; i < lm->smeta_sec; i++, paddr++) 697 755 ppa_list[i] = addr_to_gen_ppa(pblk, paddr, line->id); 698 756 699 - ret = pblk_submit_io_sync(pblk, &rqd); 757 + ret = pblk_submit_io_sync(pblk, &rqd, line->smeta); 700 758 if (ret) { 701 759 pblk_err(pblk, "smeta I/O submission failed: %d\n", ret); 702 - bio_put(bio); 703 760 goto clear_rqd; 704 761 } 705 762 ··· 717 776 static int pblk_line_smeta_write(struct pblk *pblk, struct pblk_line *line, 718 777 u64 paddr) 719 778 { 720 - struct nvm_tgt_dev *dev = pblk->dev; 721 779 struct pblk_line_meta *lm = &pblk->lm; 722 - struct bio *bio; 723 780 struct ppa_addr *ppa_list; 724 781 struct nvm_rq rqd; 725 782 __le64 *lba_list = emeta_to_lbas(pblk, line->emeta->buf); ··· 730 791 if (ret) 731 792 return ret; 732 793 733 - bio = bio_map_kern(dev->q, line->smeta, lm->smeta_len, GFP_KERNEL); 734 - if (IS_ERR(bio)) { 735 - ret = PTR_ERR(bio); 736 - goto clear_rqd; 737 - } 738 - 739 - bio->bi_iter.bi_sector = 0; /* internal bio */ 740 - bio_set_op_attrs(bio, REQ_OP_WRITE, 0); 741 - 742 - rqd.bio = bio; 743 794 rqd.opcode = NVM_OP_PWRITE; 744 795 rqd.nr_ppas = lm->smeta_sec; 745 796 rqd.is_seq = 1; ··· 743 814 meta->lba = lba_list[paddr] = addr_empty; 744 815 } 745 816 746 - ret = pblk_submit_io_sync_sem(pblk, &rqd); 817 + ret = pblk_submit_io_sync_sem(pblk, &rqd, line->smeta); 747 818 if (ret) { 748 819 pblk_err(pblk, "smeta I/O submission failed: %d\n", ret); 749 - bio_put(bio); 750 820 goto clear_rqd; 751 821 } 752 822 ··· 766 838 { 767 839 struct nvm_tgt_dev *dev = pblk->dev; 768 840 struct nvm_geo *geo = &dev->geo; 769 - struct pblk_line_mgmt *l_mg = &pblk->l_mg; 770 841 struct pblk_line_meta *lm = &pblk->lm; 771 842 void *ppa_list_buf, *meta_list; 772 - struct bio *bio; 773 843 struct ppa_addr *ppa_list; 774 844 struct nvm_rq rqd; 775 845 u64 paddr = line->emeta_ssec; ··· 793 867 rq_ppas = pblk_calc_secs(pblk, left_ppas, 0, false); 794 868 rq_len = rq_ppas * geo->csecs; 795 869 796 - bio = pblk_bio_map_addr(pblk, emeta_buf, rq_ppas, rq_len, 797 - l_mg->emeta_alloc_type, GFP_KERNEL); 798 - if (IS_ERR(bio)) { 799 - ret = PTR_ERR(bio); 800 - goto free_rqd_dma; 801 - } 802 - 803 - bio->bi_iter.bi_sector = 0; /* internal bio */ 804 - bio_set_op_attrs(bio, REQ_OP_READ, 0); 805 - 806 - rqd.bio = bio; 807 870 rqd.meta_list = meta_list; 808 871 rqd.ppa_list = ppa_list_buf; 809 872 rqd.dma_meta_list = dma_meta_list; ··· 811 896 while (test_bit(pos, line->blk_bitmap)) { 812 897 paddr += min; 813 898 if (pblk_boundary_paddr_checks(pblk, paddr)) { 814 - bio_put(bio); 815 899 ret = -EINTR; 816 900 goto free_rqd_dma; 817 901 } ··· 820 906 } 821 907 822 908 if (pblk_boundary_paddr_checks(pblk, paddr + min)) { 823 - bio_put(bio); 824 909 ret = -EINTR; 825 910 goto free_rqd_dma; 826 911 } ··· 828 915 ppa_list[i] = addr_to_gen_ppa(pblk, paddr, line_id); 829 916 } 830 917 831 - ret = pblk_submit_io_sync(pblk, &rqd); 918 + ret = pblk_submit_io_sync(pblk, &rqd, emeta_buf); 832 919 if (ret) { 833 920 pblk_err(pblk, "emeta I/O submission failed: %d\n", ret); 834 - bio_put(bio); 835 921 goto free_rqd_dma; 836 922 } 837 923 ··· 875 963 /* The write thread schedules erases so that it minimizes disturbances 876 964 * with writes. Thus, there is no need to take the LUN semaphore. 877 965 */ 878 - ret = pblk_submit_io_sync(pblk, &rqd); 966 + ret = pblk_submit_io_sync(pblk, &rqd, NULL); 879 967 rqd.private = pblk; 880 968 __pblk_end_io_erase(pblk, &rqd); 881 969 ··· 1704 1792 /* The write thread schedules erases so that it minimizes disturbances 1705 1793 * with writes. Thus, there is no need to take the LUN semaphore. 1706 1794 */ 1707 - err = pblk_submit_io(pblk, rqd); 1795 + err = pblk_submit_io(pblk, rqd, NULL); 1708 1796 if (err) { 1709 1797 struct nvm_tgt_dev *dev = pblk->dev; 1710 1798 struct nvm_geo *geo = &dev->geo; ··· 1835 1923 static void pblk_save_lba_list(struct pblk *pblk, struct pblk_line *line) 1836 1924 { 1837 1925 struct pblk_line_meta *lm = &pblk->lm; 1838 - struct pblk_line_mgmt *l_mg = &pblk->l_mg; 1839 1926 unsigned int lba_list_size = lm->emeta_len[2]; 1840 1927 struct pblk_w_err_gc *w_err_gc = line->w_err_gc; 1841 1928 struct pblk_emeta *emeta = line->emeta; 1842 1929 1843 - w_err_gc->lba_list = pblk_malloc(lba_list_size, 1844 - l_mg->emeta_alloc_type, GFP_KERNEL); 1930 + w_err_gc->lba_list = kvmalloc(lba_list_size, GFP_KERNEL); 1845 1931 memcpy(w_err_gc->lba_list, emeta_to_lbas(pblk, emeta->buf), 1846 1932 lba_list_size); 1847 1933 }

+8 -11

drivers/lightnvm/pblk-gc.c

··· 132 132 struct pblk_line *line) 133 133 { 134 134 struct line_emeta *emeta_buf; 135 - struct pblk_line_mgmt *l_mg = &pblk->l_mg; 136 135 struct pblk_line_meta *lm = &pblk->lm; 137 136 unsigned int lba_list_size = lm->emeta_len[2]; 138 137 __le64 *lba_list; 139 138 int ret; 140 139 141 - emeta_buf = pblk_malloc(lm->emeta_len[0], 142 - l_mg->emeta_alloc_type, GFP_KERNEL); 140 + emeta_buf = kvmalloc(lm->emeta_len[0], GFP_KERNEL); 143 141 if (!emeta_buf) 144 142 return NULL; 145 143 ··· 145 147 if (ret) { 146 148 pblk_err(pblk, "line %d read emeta failed (%d)\n", 147 149 line->id, ret); 148 - pblk_mfree(emeta_buf, l_mg->emeta_alloc_type); 150 + kvfree(emeta_buf); 149 151 return NULL; 150 152 } 151 153 ··· 159 161 if (ret) { 160 162 pblk_err(pblk, "inconsistent emeta (line %d)\n", 161 163 line->id); 162 - pblk_mfree(emeta_buf, l_mg->emeta_alloc_type); 164 + kvfree(emeta_buf); 163 165 return NULL; 164 166 } 165 167 166 - lba_list = pblk_malloc(lba_list_size, 167 - l_mg->emeta_alloc_type, GFP_KERNEL); 168 + lba_list = kvmalloc(lba_list_size, GFP_KERNEL); 169 + 168 170 if (lba_list) 169 171 memcpy(lba_list, emeta_to_lbas(pblk, emeta_buf), lba_list_size); 170 172 171 - pblk_mfree(emeta_buf, l_mg->emeta_alloc_type); 173 + kvfree(emeta_buf); 172 174 173 175 return lba_list; 174 176 } ··· 179 181 ws); 180 182 struct pblk *pblk = line_ws->pblk; 181 183 struct pblk_line *line = line_ws->line; 182 - struct pblk_line_mgmt *l_mg = &pblk->l_mg; 183 184 struct pblk_line_meta *lm = &pblk->lm; 184 185 struct nvm_tgt_dev *dev = pblk->dev; 185 186 struct nvm_geo *geo = &dev->geo; ··· 269 272 goto next_rq; 270 273 271 274 out: 272 - pblk_mfree(lba_list, l_mg->emeta_alloc_type); 275 + kvfree(lba_list); 273 276 kfree(line_ws); 274 277 kfree(invalid_bitmap); 275 278 ··· 283 286 fail_free_gc_rq: 284 287 kfree(gc_rq); 285 288 fail_free_lba_list: 286 - pblk_mfree(lba_list, l_mg->emeta_alloc_type); 289 + kvfree(lba_list); 287 290 fail_free_invalid_bitmap: 288 291 kfree(invalid_bitmap); 289 292 fail_free_ws:

+10 -28

drivers/lightnvm/pblk-init.c

··· 543 543 544 544 for (i = 0; i < PBLK_DATA_LINES; i++) { 545 545 kfree(l_mg->sline_meta[i]); 546 - pblk_mfree(l_mg->eline_meta[i]->buf, l_mg->emeta_alloc_type); 546 + kvfree(l_mg->eline_meta[i]->buf); 547 547 kfree(l_mg->eline_meta[i]); 548 548 } 549 549 ··· 560 560 kfree(line->erase_bitmap); 561 561 kfree(line->chks); 562 562 563 - pblk_mfree(w_err_gc->lba_list, l_mg->emeta_alloc_type); 563 + kvfree(w_err_gc->lba_list); 564 564 kfree(w_err_gc); 565 565 } 566 566 ··· 890 890 if (!emeta) 891 891 goto fail_free_emeta; 892 892 893 - if (lm->emeta_len[0] > KMALLOC_MAX_CACHE_SIZE) { 894 - l_mg->emeta_alloc_type = PBLK_VMALLOC_META; 895 - 896 - emeta->buf = vmalloc(lm->emeta_len[0]); 897 - if (!emeta->buf) { 898 - kfree(emeta); 899 - goto fail_free_emeta; 900 - } 901 - 902 - emeta->nr_entries = lm->emeta_sec[0]; 903 - l_mg->eline_meta[i] = emeta; 904 - } else { 905 - l_mg->emeta_alloc_type = PBLK_KMALLOC_META; 906 - 907 - emeta->buf = kmalloc(lm->emeta_len[0], GFP_KERNEL); 908 - if (!emeta->buf) { 909 - kfree(emeta); 910 - goto fail_free_emeta; 911 - } 912 - 913 - emeta->nr_entries = lm->emeta_sec[0]; 914 - l_mg->eline_meta[i] = emeta; 893 + emeta->buf = kvmalloc(lm->emeta_len[0], GFP_KERNEL); 894 + if (!emeta->buf) { 895 + kfree(emeta); 896 + goto fail_free_emeta; 915 897 } 898 + 899 + emeta->nr_entries = lm->emeta_sec[0]; 900 + l_mg->eline_meta[i] = emeta; 916 901 } 917 902 918 903 for (i = 0; i < l_mg->nr_lines; i++) ··· 911 926 912 927 fail_free_emeta: 913 928 while (--i >= 0) { 914 - if (l_mg->emeta_alloc_type == PBLK_VMALLOC_META) 915 - vfree(l_mg->eline_meta[i]->buf); 916 - else 917 - kfree(l_mg->eline_meta[i]->buf); 929 + kvfree(l_mg->eline_meta[i]->buf); 918 930 kfree(l_mg->eline_meta[i]); 919 931 } 920 932

+3 -23

drivers/lightnvm/pblk-read.c

··· 342 342 bio_put(int_bio); 343 343 int_bio = bio_clone_fast(bio, GFP_KERNEL, &pblk_bio_set); 344 344 goto split_retry; 345 - } else if (pblk_submit_io(pblk, rqd)) { 345 + } else if (pblk_submit_io(pblk, rqd, NULL)) { 346 346 /* Submitting IO to drive failed, let's report an error */ 347 347 rqd->error = -ENODEV; 348 348 pblk_end_io_read(rqd); ··· 417 417 418 418 int pblk_submit_read_gc(struct pblk *pblk, struct pblk_gc_rq *gc_rq) 419 419 { 420 - struct nvm_tgt_dev *dev = pblk->dev; 421 - struct nvm_geo *geo = &dev->geo; 422 - struct bio *bio; 423 420 struct nvm_rq rqd; 424 - int data_len; 425 421 int ret = NVM_IO_OK; 426 422 427 423 memset(&rqd, 0, sizeof(struct nvm_rq)); ··· 442 446 if (!(gc_rq->secs_to_gc)) 443 447 goto out; 444 448 445 - data_len = (gc_rq->secs_to_gc) * geo->csecs; 446 - bio = pblk_bio_map_addr(pblk, gc_rq->data, gc_rq->secs_to_gc, data_len, 447 - PBLK_VMALLOC_META, GFP_KERNEL); 448 - if (IS_ERR(bio)) { 449 - pblk_err(pblk, "could not allocate GC bio (%lu)\n", 450 - PTR_ERR(bio)); 451 - ret = PTR_ERR(bio); 452 - goto err_free_dma; 453 - } 454 - 455 - bio->bi_iter.bi_sector = 0; /* internal bio */ 456 - bio_set_op_attrs(bio, REQ_OP_READ, 0); 457 - 458 449 rqd.opcode = NVM_OP_PREAD; 459 450 rqd.nr_ppas = gc_rq->secs_to_gc; 460 - rqd.bio = bio; 461 451 462 - if (pblk_submit_io_sync(pblk, &rqd)) { 452 + if (pblk_submit_io_sync(pblk, &rqd, gc_rq->data)) { 463 453 ret = -EIO; 464 - goto err_free_bio; 454 + goto err_free_dma; 465 455 } 466 456 467 457 pblk_read_check_rand(pblk, &rqd, gc_rq->lba_list, gc_rq->nr_secs); ··· 471 489 pblk_free_rqd_meta(pblk, &rqd); 472 490 return ret; 473 491 474 - err_free_bio: 475 - bio_put(bio); 476 492 err_free_dma: 477 493 pblk_free_rqd_meta(pblk, &rqd); 478 494 return ret;

+6 -36

drivers/lightnvm/pblk-recovery.c

··· 178 178 void *meta_list; 179 179 struct pblk_pad_rq *pad_rq; 180 180 struct nvm_rq *rqd; 181 - struct bio *bio; 182 181 struct ppa_addr *ppa_list; 183 182 void *data; 184 183 __le64 *lba_list = emeta_to_lbas(pblk, line->emeta->buf); 185 184 u64 w_ptr = line->cur_sec; 186 - int left_line_ppas, rq_ppas, rq_len; 185 + int left_line_ppas, rq_ppas; 187 186 int i, j; 188 187 int ret = 0; 189 188 ··· 211 212 goto fail_complete; 212 213 } 213 214 214 - rq_len = rq_ppas * geo->csecs; 215 - 216 - bio = pblk_bio_map_addr(pblk, data, rq_ppas, rq_len, 217 - PBLK_VMALLOC_META, GFP_KERNEL); 218 - if (IS_ERR(bio)) { 219 - ret = PTR_ERR(bio); 220 - goto fail_complete; 221 - } 222 - 223 - bio->bi_iter.bi_sector = 0; /* internal bio */ 224 - bio_set_op_attrs(bio, REQ_OP_WRITE, 0); 225 - 226 215 rqd = pblk_alloc_rqd(pblk, PBLK_WRITE_INT); 227 216 228 217 ret = pblk_alloc_rqd_meta(pblk, rqd); 229 218 if (ret) { 230 219 pblk_free_rqd(pblk, rqd, PBLK_WRITE_INT); 231 - bio_put(bio); 232 220 goto fail_complete; 233 221 } 234 222 235 - rqd->bio = bio; 223 + rqd->bio = NULL; 236 224 rqd->opcode = NVM_OP_PWRITE; 237 225 rqd->is_seq = 1; 238 226 rqd->nr_ppas = rq_ppas; ··· 261 275 kref_get(&pad_rq->ref); 262 276 pblk_down_chunk(pblk, ppa_list[0]); 263 277 264 - ret = pblk_submit_io(pblk, rqd); 278 + ret = pblk_submit_io(pblk, rqd, data); 265 279 if (ret) { 266 280 pblk_err(pblk, "I/O submission failed: %d\n", ret); 267 281 pblk_up_chunk(pblk, ppa_list[0]); 268 282 kref_put(&pad_rq->ref, pblk_recov_complete); 269 283 pblk_free_rqd(pblk, rqd, PBLK_WRITE_INT); 270 - bio_put(bio); 271 284 goto fail_complete; 272 285 } 273 286 ··· 360 375 struct ppa_addr *ppa_list; 361 376 void *meta_list; 362 377 struct nvm_rq *rqd; 363 - struct bio *bio; 364 378 void *data; 365 379 dma_addr_t dma_ppa_list, dma_meta_list; 366 380 __le64 *lba_list; 367 381 u64 paddr = pblk_line_smeta_start(pblk, line) + lm->smeta_sec; 368 382 bool padded = false; 369 - int rq_ppas, rq_len; 383 + int rq_ppas; 370 384 int i, j; 371 385 int ret; 372 386 u64 left_ppas = pblk_sec_in_open_line(pblk, line) - lm->smeta_sec; ··· 388 404 rq_ppas = pblk_calc_secs(pblk, left_ppas, 0, false); 389 405 if (!rq_ppas) 390 406 rq_ppas = pblk->min_write_pgs; 391 - rq_len = rq_ppas * geo->csecs; 392 407 393 408 retry_rq: 394 - bio = bio_map_kern(dev->q, data, rq_len, GFP_KERNEL); 395 - if (IS_ERR(bio)) 396 - return PTR_ERR(bio); 397 - 398 - bio->bi_iter.bi_sector = 0; /* internal bio */ 399 - bio_set_op_attrs(bio, REQ_OP_READ, 0); 400 - bio_get(bio); 401 - 402 - rqd->bio = bio; 409 + rqd->bio = NULL; 403 410 rqd->opcode = NVM_OP_PREAD; 404 411 rqd->meta_list = meta_list; 405 412 rqd->nr_ppas = rq_ppas; ··· 420 445 addr_to_gen_ppa(pblk, paddr + j, line->id); 421 446 } 422 447 423 - ret = pblk_submit_io_sync(pblk, rqd); 448 + ret = pblk_submit_io_sync(pblk, rqd, data); 424 449 if (ret) { 425 450 pblk_err(pblk, "I/O submission failed: %d\n", ret); 426 - bio_put(bio); 427 451 return ret; 428 452 } 429 453 ··· 434 460 435 461 if (padded) { 436 462 pblk_log_read_err(pblk, rqd); 437 - bio_put(bio); 438 463 return -EINTR; 439 464 } 440 465 441 466 pad_distance = pblk_pad_distance(pblk, line); 442 467 ret = pblk_recov_pad_line(pblk, line, pad_distance); 443 468 if (ret) { 444 - bio_put(bio); 445 469 return ret; 446 470 } 447 471 448 472 padded = true; 449 - bio_put(bio); 450 473 goto retry_rq; 451 474 } 452 475 453 476 pblk_get_packed_meta(pblk, rqd); 454 - bio_put(bio); 455 477 456 478 for (i = 0; i < rqd->nr_ppas; i++) { 457 479 struct pblk_sec_meta *meta = pblk_get_meta(pblk, meta_list, i);

+3 -17

drivers/lightnvm/pblk-write.c

··· 373 373 struct pblk_emeta *emeta = meta_line->emeta; 374 374 struct ppa_addr *ppa_list; 375 375 struct pblk_g_ctx *m_ctx; 376 - struct bio *bio; 377 376 struct nvm_rq *rqd; 378 377 void *data; 379 378 u64 paddr; ··· 390 391 rq_len = rq_ppas * geo->csecs; 391 392 data = ((void *)emeta->buf) + emeta->mem; 392 393 393 - bio = pblk_bio_map_addr(pblk, data, rq_ppas, rq_len, 394 - l_mg->emeta_alloc_type, GFP_KERNEL); 395 - if (IS_ERR(bio)) { 396 - pblk_err(pblk, "failed to map emeta io"); 397 - ret = PTR_ERR(bio); 398 - goto fail_free_rqd; 399 - } 400 - bio->bi_iter.bi_sector = 0; /* internal bio */ 401 - bio_set_op_attrs(bio, REQ_OP_WRITE, 0); 402 - rqd->bio = bio; 403 - 404 394 ret = pblk_alloc_w_rq(pblk, rqd, rq_ppas, pblk_end_io_write_meta); 405 395 if (ret) 406 - goto fail_free_bio; 396 + goto fail_free_rqd; 407 397 408 398 ppa_list = nvm_rq_to_ppa_list(rqd); 409 399 for (i = 0; i < rqd->nr_ppas; ) { ··· 411 423 412 424 pblk_down_chunk(pblk, ppa_list[0]); 413 425 414 - ret = pblk_submit_io(pblk, rqd); 426 + ret = pblk_submit_io(pblk, rqd, data); 415 427 if (ret) { 416 428 pblk_err(pblk, "emeta I/O submission failed: %d\n", ret); 417 429 goto fail_rollback; ··· 425 437 pblk_dealloc_page(pblk, meta_line, rq_ppas); 426 438 list_add(&meta_line->list, &meta_line->list); 427 439 spin_unlock(&l_mg->close_lock); 428 - fail_free_bio: 429 - bio_put(bio); 430 440 fail_free_rqd: 431 441 pblk_free_rqd(pblk, rqd, PBLK_WRITE_INT); 432 442 return ret; ··· 509 523 meta_line = pblk_should_submit_meta_io(pblk, rqd); 510 524 511 525 /* Submit data write for current data line */ 512 - err = pblk_submit_io(pblk, rqd); 526 + err = pblk_submit_io(pblk, rqd, NULL); 513 527 if (err) { 514 528 pblk_err(pblk, "data I/O submission failed: %d\n", err); 515 529 return NVM_IO_ERR;

+2 -29

drivers/lightnvm/pblk.h

··· 482 482 #define PBLK_DATA_LINES 4 483 483 484 484 enum { 485 - PBLK_KMALLOC_META = 1, 486 - PBLK_VMALLOC_META = 2, 487 - }; 488 - 489 - enum { 490 485 PBLK_EMETA_TYPE_HEADER = 1, /* struct line_emeta first sector */ 491 486 PBLK_EMETA_TYPE_LLBA = 2, /* lba list - type: __le64 */ 492 487 PBLK_EMETA_TYPE_VSC = 3, /* vsc list - type: __le32 */ ··· 515 520 struct list_head emeta_list; /* Lines queued to schedule emeta */ 516 521 517 522 __le32 *vsc_list; /* Valid sector counts for all lines */ 518 - 519 - /* Metadata allocation type: VMALLOC | KMALLOC */ 520 - int emeta_alloc_type; 521 523 522 524 /* Pre-allocated metadata for data lines */ 523 525 struct pblk_smeta *sline_meta[PBLK_DATA_LINES]; ··· 775 783 struct ppa_addr ppa); 776 784 void pblk_log_write_err(struct pblk *pblk, struct nvm_rq *rqd); 777 785 void pblk_log_read_err(struct pblk *pblk, struct nvm_rq *rqd); 778 - int pblk_submit_io(struct pblk *pblk, struct nvm_rq *rqd); 779 - int pblk_submit_io_sync(struct pblk *pblk, struct nvm_rq *rqd); 780 - int pblk_submit_io_sync_sem(struct pblk *pblk, struct nvm_rq *rqd); 786 + int pblk_submit_io(struct pblk *pblk, struct nvm_rq *rqd, void *buf); 787 + int pblk_submit_io_sync(struct pblk *pblk, struct nvm_rq *rqd, void *buf); 781 788 int pblk_submit_meta_io(struct pblk *pblk, struct pblk_line *meta_line); 782 789 void pblk_check_chunk_state_update(struct pblk *pblk, struct nvm_rq *rqd); 783 - struct bio *pblk_bio_map_addr(struct pblk *pblk, void *data, 784 - unsigned int nr_secs, unsigned int len, 785 - int alloc_type, gfp_t gfp_mask); 786 790 struct pblk_line *pblk_line_get(struct pblk *pblk); 787 791 struct pblk_line *pblk_line_get_first_data(struct pblk *pblk); 788 792 struct pblk_line *pblk_line_replace_data(struct pblk *pblk); ··· 925 937 */ 926 938 int pblk_sysfs_init(struct gendisk *tdisk); 927 939 void pblk_sysfs_exit(struct gendisk *tdisk); 928 - 929 - static inline void *pblk_malloc(size_t size, int type, gfp_t flags) 930 - { 931 - if (type == PBLK_KMALLOC_META) 932 - return kmalloc(size, flags); 933 - return vmalloc(size); 934 - } 935 - 936 - static inline void pblk_mfree(void *ptr, int type) 937 - { 938 - if (type == PBLK_KMALLOC_META) 939 - kfree(ptr); 940 - else 941 - vfree(ptr); 942 - } 943 940 944 941 static inline struct nvm_rq *nvm_rq_from_c_ctx(void *c_ctx) 945 942 {

+8 -2

drivers/md/bcache/closure.c

··· 105 105 106 106 static void closure_sync_fn(struct closure *cl) 107 107 { 108 - cl->s->done = 1; 109 - wake_up_process(cl->s->task); 108 + struct closure_syncer *s = cl->s; 109 + struct task_struct *p; 110 + 111 + rcu_read_lock(); 112 + p = READ_ONCE(s->task); 113 + s->done = 1; 114 + wake_up_process(p); 115 + rcu_read_unlock(); 110 116 } 111 117 112 118 void __sched __closure_sync(struct closure *cl)

+2 -3

drivers/md/bcache/debug.c

··· 178 178 while (size) { 179 179 struct keybuf_key *w; 180 180 unsigned int bytes = min(i->bytes, size); 181 - int err = copy_to_user(buf, i->buf, bytes); 182 181 183 - if (err) 184 - return err; 182 + if (copy_to_user(buf, i->buf, bytes)) 183 + return -EFAULT; 185 184 186 185 ret += bytes; 187 186 buf += bytes;

+1

drivers/md/bcache/sysfs.c

··· 964 964 965 965 static int __bch_cache_cmp(const void *l, const void *r) 966 966 { 967 + cond_resched(); 967 968 return *((uint16_t *)r) - *((uint16_t *)l); 968 969 } 969 970

+2 -1

drivers/md/dm-rq.c

··· 408 408 ret = dm_dispatch_clone_request(clone, rq); 409 409 if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE) { 410 410 blk_rq_unprep_clone(clone); 411 + blk_mq_cleanup_rq(clone); 411 412 tio->ti->type->release_clone_rq(clone, &tio->info); 412 413 tio->clone = NULL; 413 414 return DM_MAPIO_REQUEUE; ··· 563 562 if (err) 564 563 goto out_kfree_tag_set; 565 564 566 - q = blk_mq_init_allocated_queue(md->tag_set, md->queue); 565 + q = blk_mq_init_allocated_queue(md->tag_set, md->queue, true); 567 566 if (IS_ERR(q)) { 568 567 err = PTR_ERR(q); 569 568 goto out_tag_set;

+5

drivers/md/md-linear.c

··· 258 258 bio_sector < start_sector)) 259 259 goto out_of_bounds; 260 260 261 + if (unlikely(is_mddev_broken(tmp_dev->rdev, "linear"))) { 262 + bio_io_error(bio); 263 + return true; 264 + } 265 + 261 266 if (unlikely(bio_end_sector(bio) > end_sector)) { 262 267 /* This bio crosses a device boundary, so we have to split it */ 263 268 struct bio *split = bio_split(bio, end_sector - bio_sector,

+80 -16

drivers/md/md.c

··· 376 376 struct mddev *mddev = q->queuedata; 377 377 unsigned int sectors; 378 378 379 + if (unlikely(test_bit(MD_BROKEN, &mddev->flags)) && (rw == WRITE)) { 380 + bio_io_error(bio); 381 + return BLK_QC_T_NONE; 382 + } 383 + 379 384 blk_queue_split(q, &bio); 380 385 381 386 if (mddev == NULL || mddev->pers == NULL) { ··· 1237 1232 mddev->new_layout = mddev->layout; 1238 1233 mddev->new_chunk_sectors = mddev->chunk_sectors; 1239 1234 } 1235 + if (mddev->level == 0) 1236 + mddev->layout = -1; 1240 1237 1241 1238 if (sb->state & (1<<MD_SB_CLEAN)) 1242 1239 mddev->recovery_cp = MaxSector; ··· 1654 1647 rdev->ppl.sector = rdev->sb_start + rdev->ppl.offset; 1655 1648 } 1656 1649 1650 + if ((le32_to_cpu(sb->feature_map) & MD_FEATURE_RAID0_LAYOUT) && 1651 + sb->level != 0) 1652 + return -EINVAL; 1653 + 1657 1654 if (!refdev) { 1658 1655 ret = 1; 1659 1656 } else { ··· 1768 1757 mddev->new_chunk_sectors = mddev->chunk_sectors; 1769 1758 } 1770 1759 1760 + if (mddev->level == 0 && 1761 + !(le32_to_cpu(sb->feature_map) & MD_FEATURE_RAID0_LAYOUT)) 1762 + mddev->layout = -1; 1763 + 1771 1764 if (le32_to_cpu(sb->feature_map) & MD_FEATURE_JOURNAL) 1772 1765 set_bit(MD_HAS_JOURNAL, &mddev->flags); 1773 1766 ··· 1841 1826 if (!(le32_to_cpu(sb->feature_map) & 1842 1827 MD_FEATURE_RECOVERY_BITMAP)) 1843 1828 rdev->saved_raid_disk = -1; 1844 - } else 1845 - set_bit(In_sync, &rdev->flags); 1829 + } else { 1830 + /* 1831 + * If the array is FROZEN, then the device can't 1832 + * be in_sync with rest of array. 1833 + */ 1834 + if (!test_bit(MD_RECOVERY_FROZEN, 1835 + &mddev->recovery)) 1836 + set_bit(In_sync, &rdev->flags); 1837 + } 1846 1838 rdev->raid_disk = role; 1847 1839 break; 1848 1840 } ··· 3686 3664 return -EINVAL; 3687 3665 if (decimals < 0) 3688 3666 decimals = 0; 3689 - while (decimals < scale) { 3690 - result *= 10; 3691 - decimals ++; 3692 - } 3693 - *res = result; 3667 + *res = result * int_pow(10, scale - decimals); 3694 3668 return 0; 3695 3669 } 3696 3670 ··· 4173 4155 * active-idle 4174 4156 * like active, but no writes have been seen for a while (100msec). 4175 4157 * 4158 + * broken 4159 + * RAID0/LINEAR-only: same as clean, but array is missing a member. 4160 + * It's useful because RAID0/LINEAR mounted-arrays aren't stopped 4161 + * when a member is gone, so this state will at least alert the 4162 + * user that something is wrong. 4176 4163 */ 4177 4164 enum array_state { clear, inactive, suspended, readonly, read_auto, clean, active, 4178 - write_pending, active_idle, bad_word}; 4165 + write_pending, active_idle, broken, bad_word}; 4179 4166 static char *array_states[] = { 4180 4167 "clear", "inactive", "suspended", "readonly", "read-auto", "clean", "active", 4181 - "write-pending", "active-idle", NULL }; 4168 + "write-pending", "active-idle", "broken", NULL }; 4182 4169 4183 4170 static int match_word(const char *word, char **list) 4184 4171 { ··· 4199 4176 { 4200 4177 enum array_state st = inactive; 4201 4178 4202 - if (mddev->pers) 4179 + if (mddev->pers && !test_bit(MD_NOT_READY, &mddev->flags)) { 4203 4180 switch(mddev->ro) { 4204 4181 case 1: 4205 4182 st = readonly; ··· 4219 4196 st = active; 4220 4197 spin_unlock(&mddev->lock); 4221 4198 } 4222 - else { 4199 + 4200 + if (test_bit(MD_BROKEN, &mddev->flags) && st == clean) 4201 + st = broken; 4202 + } else { 4223 4203 if (list_empty(&mddev->disks) && 4224 4204 mddev->raid_disks == 0 && 4225 4205 mddev->dev_sectors == 0) ··· 4336 4310 break; 4337 4311 case write_pending: 4338 4312 case active_idle: 4313 + case broken: 4339 4314 /* these cannot be set */ 4340 4315 break; 4341 4316 } ··· 5209 5182 __ATTR(consistency_policy, S_IRUGO | S_IWUSR, consistency_policy_show, 5210 5183 consistency_policy_store); 5211 5184 5185 + static ssize_t fail_last_dev_show(struct mddev *mddev, char *page) 5186 + { 5187 + return sprintf(page, "%d\n", mddev->fail_last_dev); 5188 + } 5189 + 5190 + /* 5191 + * Setting fail_last_dev to true to allow last device to be forcibly removed 5192 + * from RAID1/RAID10. 5193 + */ 5194 + static ssize_t 5195 + fail_last_dev_store(struct mddev *mddev, const char *buf, size_t len) 5196 + { 5197 + int ret; 5198 + bool value; 5199 + 5200 + ret = kstrtobool(buf, &value); 5201 + if (ret) 5202 + return ret; 5203 + 5204 + if (value != mddev->fail_last_dev) 5205 + mddev->fail_last_dev = value; 5206 + 5207 + return len; 5208 + } 5209 + static struct md_sysfs_entry md_fail_last_dev = 5210 + __ATTR(fail_last_dev, S_IRUGO | S_IWUSR, fail_last_dev_show, 5211 + fail_last_dev_store); 5212 + 5212 5213 static struct attribute *md_default_attrs[] = { 5213 5214 &md_level.attr, 5214 5215 &md_layout.attr, ··· 5253 5198 &md_array_size.attr, 5254 5199 &max_corr_read_errors.attr, 5255 5200 &md_consistency_policy.attr, 5201 + &md_fail_last_dev.attr, 5256 5202 NULL, 5257 5203 }; 5258 5204 ··· 5800 5744 md_update_sb(mddev, 0); 5801 5745 5802 5746 md_new_event(mddev); 5803 - sysfs_notify_dirent_safe(mddev->sysfs_state); 5804 - sysfs_notify_dirent_safe(mddev->sysfs_action); 5805 - sysfs_notify(&mddev->kobj, NULL, "degraded"); 5806 5747 return 0; 5807 5748 5808 5749 bitmap_abort: ··· 5820 5767 { 5821 5768 int err; 5822 5769 5770 + set_bit(MD_NOT_READY, &mddev->flags); 5823 5771 err = md_run(mddev); 5824 5772 if (err) 5825 5773 goto out; ··· 5841 5787 5842 5788 set_capacity(mddev->gendisk, mddev->array_sectors); 5843 5789 revalidate_disk(mddev->gendisk); 5790 + clear_bit(MD_NOT_READY, &mddev->flags); 5844 5791 mddev->changed = 1; 5845 5792 kobject_uevent(&disk_to_dev(mddev->gendisk)->kobj, KOBJ_CHANGE); 5793 + sysfs_notify_dirent_safe(mddev->sysfs_state); 5794 + sysfs_notify_dirent_safe(mddev->sysfs_action); 5795 + sysfs_notify(&mddev->kobj, NULL, "degraded"); 5846 5796 out: 5797 + clear_bit(MD_NOT_READY, &mddev->flags); 5847 5798 return err; 5848 5799 } 5849 5800 ··· 6908 6849 mddev->external = 0; 6909 6850 6910 6851 mddev->layout = info->layout; 6852 + if (mddev->level == 0) 6853 + /* Cannot trust RAID0 layout info here */ 6854 + mddev->layout = -1; 6911 6855 mddev->chunk_sectors = info->chunk_size >> 9; 6912 6856 6913 6857 if (mddev->persistent) { ··· 8962 8900 8963 8901 if (mddev_trylock(mddev)) { 8964 8902 int spares = 0; 8903 + bool try_set_sync = mddev->safemode != 0; 8965 8904 8966 8905 if (!mddev->external && mddev->safemode == 1) 8967 8906 mddev->safemode = 0; ··· 9008 8945 } 9009 8946 } 9010 8947 9011 - if (!mddev->external && !mddev->in_sync) { 8948 + if (try_set_sync && !mddev->external && !mddev->in_sync) { 9012 8949 spin_lock(&mddev->lock); 9013 8950 set_in_sync(mddev); 9014 8951 spin_unlock(&mddev->lock); ··· 9106 9043 /* resync has finished, collect result */ 9107 9044 md_unregister_thread(&mddev->sync_thread); 9108 9045 if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery) && 9109 - !test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery)) { 9046 + !test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery) && 9047 + mddev->degraded != mddev->raid_disks) { 9110 9048 /* success...*/ 9111 9049 /* activate any spares */ 9112 9050 if (mddev->pers->spare_active(mddev)) {

+20

drivers/md/md.h

··· 248 248 MD_UPDATING_SB, /* md_check_recovery is updating the metadata 249 249 * without explicitly holding reconfig_mutex. 250 250 */ 251 + MD_NOT_READY, /* do_md_run() is active, so 'array_state' 252 + * must not report that array is ready yet 253 + */ 254 + MD_BROKEN, /* This is used in RAID-0/LINEAR only, to stop 255 + * I/O in case an array member is gone/failed. 256 + */ 251 257 }; 252 258 253 259 enum mddev_sb_flags { ··· 493 487 unsigned int good_device_nr; /* good device num within cluster raid */ 494 488 495 489 bool has_superblocks:1; 490 + bool fail_last_dev:1; 496 491 }; 497 492 498 493 enum recovery_flags { ··· 741 734 bool is_suspend); 742 735 struct md_rdev *md_find_rdev_nr_rcu(struct mddev *mddev, int nr); 743 736 struct md_rdev *md_find_rdev_rcu(struct mddev *mddev, dev_t dev); 737 + 738 + static inline bool is_mddev_broken(struct md_rdev *rdev, const char *md_type) 739 + { 740 + int flags = rdev->bdev->bd_disk->flags; 741 + 742 + if (!(flags & GENHD_FL_UP)) { 743 + if (!test_and_set_bit(MD_BROKEN, &rdev->mddev->flags)) 744 + pr_warn("md: %s: %s array has a missing/failed member\n", 745 + mdname(rdev->mddev), md_type); 746 + return true; 747 + } 748 + return false; 749 + } 744 750 745 751 static inline void rdev_dec_pending(struct md_rdev *rdev, struct mddev *mddev) 746 752 {

+40 -1

drivers/md/raid0.c

··· 19 19 #include "raid0.h" 20 20 #include "raid5.h" 21 21 22 + static int default_layout = 0; 23 + module_param(default_layout, int, 0644); 24 + 22 25 #define UNSUPPORTED_MDDEV_FLAGS \ 23 26 ((1L << MD_HAS_JOURNAL) | \ 24 27 (1L << MD_JOURNAL_CLEAN) | \ ··· 142 139 } 143 140 pr_debug("md/raid0:%s: FINAL %d zones\n", 144 141 mdname(mddev), conf->nr_strip_zones); 142 + 143 + if (conf->nr_strip_zones == 1) { 144 + conf->layout = RAID0_ORIG_LAYOUT; 145 + } else if (mddev->layout == RAID0_ORIG_LAYOUT || 146 + mddev->layout == RAID0_ALT_MULTIZONE_LAYOUT) { 147 + conf->layout = mddev->layout; 148 + } else if (default_layout == RAID0_ORIG_LAYOUT || 149 + default_layout == RAID0_ALT_MULTIZONE_LAYOUT) { 150 + conf->layout = default_layout; 151 + } else { 152 + pr_err("md/raid0:%s: cannot assemble multi-zone RAID0 with default_layout setting\n", 153 + mdname(mddev)); 154 + pr_err("md/raid0: please set raid.default_layout to 1 or 2\n"); 155 + err = -ENOTSUPP; 156 + goto abort; 157 + } 145 158 /* 146 159 * now since we have the hard sector sizes, we can make sure 147 160 * chunk size is a multiple of that sector size ··· 566 547 567 548 static bool raid0_make_request(struct mddev *mddev, struct bio *bio) 568 549 { 550 + struct r0conf *conf = mddev->private; 569 551 struct strip_zone *zone; 570 552 struct md_rdev *tmp_dev; 571 553 sector_t bio_sector; 572 554 sector_t sector; 555 + sector_t orig_sector; 573 556 unsigned chunk_sects; 574 557 unsigned sectors; 575 558 ··· 605 584 bio = split; 606 585 } 607 586 587 + orig_sector = sector; 608 588 zone = find_zone(mddev->private, &sector); 609 - tmp_dev = map_sector(mddev, zone, sector, &sector); 589 + switch (conf->layout) { 590 + case RAID0_ORIG_LAYOUT: 591 + tmp_dev = map_sector(mddev, zone, orig_sector, &sector); 592 + break; 593 + case RAID0_ALT_MULTIZONE_LAYOUT: 594 + tmp_dev = map_sector(mddev, zone, sector, &sector); 595 + break; 596 + default: 597 + WARN("md/raid0:%s: Invalid layout\n", mdname(mddev)); 598 + bio_io_error(bio); 599 + return true; 600 + } 601 + 602 + if (unlikely(is_mddev_broken(tmp_dev, "raid0"))) { 603 + bio_io_error(bio); 604 + return true; 605 + } 606 + 610 607 bio_set_dev(bio, tmp_dev->bdev); 611 608 bio->bi_iter.bi_sector = sector + zone->dev_start + 612 609 tmp_dev->data_offset;

+14

drivers/md/raid0.h

··· 8 8 int nb_dev; /* # of devices attached to the zone */ 9 9 }; 10 10 11 + /* Linux 3.14 (20d0189b101) made an unintended change to 12 + * the RAID0 layout for multi-zone arrays (where devices aren't all 13 + * the same size. 14 + * RAID0_ORIG_LAYOUT restores the original layout 15 + * RAID0_ALT_MULTIZONE_LAYOUT uses the altered layout 16 + * The layouts are identical when there is only one zone (all 17 + * devices the same size). 18 + */ 19 + 20 + enum r0layout { 21 + RAID0_ORIG_LAYOUT = 1, 22 + RAID0_ALT_MULTIZONE_LAYOUT = 2, 23 + }; 11 24 struct r0conf { 12 25 struct strip_zone *strip_zone; 13 26 struct md_rdev **devlist; /* lists of rdevs, pointed to 14 27 * by strip_zone->dev */ 15 28 int nr_strip_zones; 29 + enum r0layout layout; 16 30 }; 17 31 18 32 #endif

+51 -38

drivers/md/raid1.c

··· 447 447 /* We never try FailFast to WriteMostly devices */ 448 448 !test_bit(WriteMostly, &rdev->flags)) { 449 449 md_error(r1_bio->mddev, rdev); 450 - if (!test_bit(Faulty, &rdev->flags)) 451 - /* This is the only remaining device, 452 - * We need to retry the write without 453 - * FailFast 454 - */ 455 - set_bit(R1BIO_WriteError, &r1_bio->state); 456 - else { 457 - /* Finished with this branch */ 458 - r1_bio->bios[mirror] = NULL; 459 - to_put = bio; 460 - } 461 - } else 450 + } 451 + 452 + /* 453 + * When the device is faulty, it is not necessary to 454 + * handle write error. 455 + * For failfast, this is the only remaining device, 456 + * We need to retry the write without FailFast. 457 + */ 458 + if (!test_bit(Faulty, &rdev->flags)) 462 459 set_bit(R1BIO_WriteError, &r1_bio->state); 460 + else { 461 + /* Finished with this branch */ 462 + r1_bio->bios[mirror] = NULL; 463 + to_put = bio; 464 + } 463 465 } else { 464 466 /* 465 467 * Set R1BIO_Uptodate in our master bio, so that we ··· 874 872 * backgroup IO calls must call raise_barrier. Once that returns 875 873 * there is no normal IO happeing. It must arrange to call 876 874 * lower_barrier when the particular background IO completes. 875 + * 876 + * If resync/recovery is interrupted, returns -EINTR; 877 + * Otherwise, returns 0. 877 878 */ 878 - static sector_t raise_barrier(struct r1conf *conf, sector_t sector_nr) 879 + static int raise_barrier(struct r1conf *conf, sector_t sector_nr) 879 880 { 880 881 int idx = sector_to_idx(sector_nr); 881 882 ··· 1617 1612 1618 1613 /* 1619 1614 * If it is not operational, then we have already marked it as dead 1620 - * else if it is the last working disks, ignore the error, let the 1621 - * next level up know. 1615 + * else if it is the last working disks with "fail_last_dev == false", 1616 + * ignore the error, let the next level up know. 1622 1617 * else mark the drive as failed 1623 1618 */ 1624 1619 spin_lock_irqsave(&conf->device_lock, flags); 1625 - if (test_bit(In_sync, &rdev->flags) 1620 + if (test_bit(In_sync, &rdev->flags) && !mddev->fail_last_dev 1626 1621 && (conf->raid_disks - mddev->degraded) == 1) { 1627 1622 /* 1628 1623 * Don't fail the drive, act as though we were just a ··· 1906 1901 } while (sectors_to_go > 0); 1907 1902 } 1908 1903 1904 + static void put_sync_write_buf(struct r1bio *r1_bio, int uptodate) 1905 + { 1906 + if (atomic_dec_and_test(&r1_bio->remaining)) { 1907 + struct mddev *mddev = r1_bio->mddev; 1908 + int s = r1_bio->sectors; 1909 + 1910 + if (test_bit(R1BIO_MadeGood, &r1_bio->state) || 1911 + test_bit(R1BIO_WriteError, &r1_bio->state)) 1912 + reschedule_retry(r1_bio); 1913 + else { 1914 + put_buf(r1_bio); 1915 + md_done_sync(mddev, s, uptodate); 1916 + } 1917 + } 1918 + } 1919 + 1909 1920 static void end_sync_write(struct bio *bio) 1910 1921 { 1911 1922 int uptodate = !bio->bi_status; ··· 1948 1927 ) 1949 1928 set_bit(R1BIO_MadeGood, &r1_bio->state); 1950 1929 1951 - if (atomic_dec_and_test(&r1_bio->remaining)) { 1952 - int s = r1_bio->sectors; 1953 - if (test_bit(R1BIO_MadeGood, &r1_bio->state) || 1954 - test_bit(R1BIO_WriteError, &r1_bio->state)) 1955 - reschedule_retry(r1_bio); 1956 - else { 1957 - put_buf(r1_bio); 1958 - md_done_sync(mddev, s, uptodate); 1959 - } 1960 - } 1930 + put_sync_write_buf(r1_bio, uptodate); 1961 1931 } 1962 1932 1963 1933 static int r1_sync_page_io(struct md_rdev *rdev, sector_t sector, ··· 2231 2219 generic_make_request(wbio); 2232 2220 } 2233 2221 2234 - if (atomic_dec_and_test(&r1_bio->remaining)) { 2235 - /* if we're here, all write(s) have completed, so clean up */ 2236 - int s = r1_bio->sectors; 2237 - if (test_bit(R1BIO_MadeGood, &r1_bio->state) || 2238 - test_bit(R1BIO_WriteError, &r1_bio->state)) 2239 - reschedule_retry(r1_bio); 2240 - else { 2241 - put_buf(r1_bio); 2242 - md_done_sync(mddev, s, 1); 2243 - } 2244 - } 2222 + put_sync_write_buf(r1_bio, 1); 2245 2223 } 2246 2224 2247 2225 /* ··· 3129 3127 !test_bit(In_sync, &conf->mirrors[i].rdev->flags) || 3130 3128 test_bit(Faulty, &conf->mirrors[i].rdev->flags)) 3131 3129 mddev->degraded++; 3130 + /* 3131 + * RAID1 needs at least one disk in active 3132 + */ 3133 + if (conf->raid_disks - mddev->degraded < 1) { 3134 + ret = -EINVAL; 3135 + goto abort; 3136 + } 3132 3137 3133 3138 if (conf->raid_disks - mddev->degraded == 1) 3134 3139 mddev->recovery_cp = MaxSector; ··· 3169 3160 ret = md_integrity_register(mddev); 3170 3161 if (ret) { 3171 3162 md_unregister_thread(&mddev->thread); 3172 - raid1_free(mddev, conf); 3163 + goto abort; 3173 3164 } 3165 + return 0; 3166 + 3167 + abort: 3168 + raid1_free(mddev, conf); 3174 3169 return ret; 3175 3170 } 3176 3171

+17 -15

drivers/md/raid10.c

··· 465 465 if (test_bit(FailFast, &rdev->flags) && 466 466 (bio->bi_opf & MD_FAILFAST)) { 467 467 md_error(rdev->mddev, rdev); 468 - if (!test_bit(Faulty, &rdev->flags)) 469 - /* This is the only remaining device, 470 - * We need to retry the write without 471 - * FailFast 472 - */ 473 - set_bit(R10BIO_WriteError, &r10_bio->state); 474 - else { 475 - r10_bio->devs[slot].bio = NULL; 476 - to_put = bio; 477 - dec_rdev = 1; 478 - } 479 - } else 468 + } 469 + 470 + /* 471 + * When the device is faulty, it is not necessary to 472 + * handle write error. 473 + * For failfast, this is the only remaining device, 474 + * We need to retry the write without FailFast. 475 + */ 476 + if (!test_bit(Faulty, &rdev->flags)) 480 477 set_bit(R10BIO_WriteError, &r10_bio->state); 478 + else { 479 + r10_bio->devs[slot].bio = NULL; 480 + to_put = bio; 481 + dec_rdev = 1; 482 + } 481 483 } 482 484 } else { 483 485 /* ··· 1640 1638 1641 1639 /* 1642 1640 * If it is not operational, then we have already marked it as dead 1643 - * else if it is the last working disks, ignore the error, let the 1644 - * next level up know. 1641 + * else if it is the last working disks with "fail_last_dev == false", 1642 + * ignore the error, let the next level up know. 1645 1643 * else mark the drive as failed 1646 1644 */ 1647 1645 spin_lock_irqsave(&conf->device_lock, flags); 1648 - if (test_bit(In_sync, &rdev->flags) 1646 + if (test_bit(In_sync, &rdev->flags) && !mddev->fail_last_dev 1649 1647 && !enough(conf, rdev->raid_disk)) { 1650 1648 /* 1651 1649 * Don't fail the drive, just return an IO error.

+18 -9

drivers/md/raid5.c

··· 2526 2526 int set_bad = 0; 2527 2527 2528 2528 clear_bit(R5_UPTODATE, &sh->dev[i].flags); 2529 - atomic_inc(&rdev->read_errors); 2529 + if (!(bi->bi_status == BLK_STS_PROTECTION)) 2530 + atomic_inc(&rdev->read_errors); 2530 2531 if (test_bit(R5_ReadRepl, &sh->dev[i].flags)) 2531 2532 pr_warn_ratelimited( 2532 2533 "md/raid:%s: read error on replacement device (sector %llu on %s).\n", ··· 2550 2549 (unsigned long long)s, 2551 2550 bdn); 2552 2551 } else if (atomic_read(&rdev->read_errors) 2553 - > conf->max_nr_stripes) 2554 - pr_warn("md/raid:%s: Too many read errors, failing device %s.\n", 2555 - mdname(conf->mddev), bdn); 2556 - else 2552 + > conf->max_nr_stripes) { 2553 + if (!test_bit(Faulty, &rdev->flags)) { 2554 + pr_warn("md/raid:%s: %d read_errors > %d stripes\n", 2555 + mdname(conf->mddev), 2556 + atomic_read(&rdev->read_errors), 2557 + conf->max_nr_stripes); 2558 + pr_warn("md/raid:%s: Too many read errors, failing device %s.\n", 2559 + mdname(conf->mddev), bdn); 2560 + } 2561 + } else 2557 2562 retry = 1; 2558 2563 if (set_bad && test_bit(In_sync, &rdev->flags) 2559 2564 && !test_bit(R5_ReadNoMerge, &sh->dev[i].flags)) 2560 2565 retry = 1; 2561 2566 if (retry) 2562 - if (test_bit(R5_ReadNoMerge, &sh->dev[i].flags)) { 2567 + if (sh->qd_idx >= 0 && sh->pd_idx == i) 2568 + set_bit(R5_ReadError, &sh->dev[i].flags); 2569 + else if (test_bit(R5_ReadNoMerge, &sh->dev[i].flags)) { 2563 2570 set_bit(R5_ReadError, &sh->dev[i].flags); 2564 2571 clear_bit(R5_ReadNoMerge, &sh->dev[i].flags); 2565 2572 } else ··· 4621 4612 (1 << STRIPE_FULL_WRITE) | 4622 4613 (1 << STRIPE_BIOFILL_RUN) | 4623 4614 (1 << STRIPE_COMPUTE_RUN) | 4624 - (1 << STRIPE_OPS_REQ_PENDING) | 4625 4615 (1 << STRIPE_DISCARD) | 4626 4616 (1 << STRIPE_BATCH_READY) | 4627 4617 (1 << STRIPE_BATCH_ERR) | ··· 5499 5491 return; 5500 5492 5501 5493 logical_sector = bi->bi_iter.bi_sector & ~((sector_t)STRIPE_SECTORS-1); 5502 - last_sector = bi->bi_iter.bi_sector + (bi->bi_iter.bi_size>>9); 5494 + last_sector = bio_end_sector(bi); 5503 5495 5504 5496 bi->bi_next = NULL; 5505 5497 ··· 5726 5718 do_flush = false; 5727 5719 } 5728 5720 5729 - set_bit(STRIPE_HANDLE, &sh->state); 5721 + if (!sh->batch_head) 5722 + set_bit(STRIPE_HANDLE, &sh->state); 5730 5723 clear_bit(STRIPE_DELAYED, &sh->state); 5731 5724 if ((!sh->batch_head || sh == sh->batch_head) && 5732 5725 (bi->bi_opf & REQ_SYNC) &&

+1 -4

drivers/md/raid5.h

··· 357 357 STRIPE_FULL_WRITE, /* all blocks are set to be overwritten */ 358 358 STRIPE_BIOFILL_RUN, 359 359 STRIPE_COMPUTE_RUN, 360 - STRIPE_OPS_REQ_PENDING, 361 360 STRIPE_ON_UNPLUG_LIST, 362 361 STRIPE_DISCARD, 363 362 STRIPE_ON_RELEASE_LIST, ··· 492 493 */ 493 494 static inline struct bio *r5_next_bio(struct bio *bio, sector_t sector) 494 495 { 495 - int sectors = bio_sectors(bio); 496 - 497 - if (bio->bi_iter.bi_sector + sectors < sector + STRIPE_SECTORS) 496 + if (bio_end_sector(bio) < sector + STRIPE_SECTORS) 498 497 return bio->bi_next; 499 498 else 500 499 return NULL;

+1

drivers/nvme/host/Kconfig

··· 64 64 depends on INET 65 65 depends on BLK_DEV_NVME 66 66 select NVME_FABRICS 67 + select CRYPTO_CRC32C 67 68 help 68 69 This provides support for the NVMe over Fabrics protocol using 69 70 the TCP transport. This allows you to use remote block devices

+140 -61

drivers/nvme/host/core.c

··· 22 22 #include <linux/pm_qos.h> 23 23 #include <asm/unaligned.h> 24 24 25 - #define CREATE_TRACE_POINTS 26 - #include "trace.h" 27 - 28 25 #include "nvme.h" 29 26 #include "fabrics.h" 27 + 28 + #define CREATE_TRACE_POINTS 29 + #include "trace.h" 30 30 31 31 #define NVME_MINORS (1U << MINORBITS) 32 32 ··· 81 81 struct workqueue_struct *nvme_delete_wq; 82 82 EXPORT_SYMBOL_GPL(nvme_delete_wq); 83 83 84 - static DEFINE_IDA(nvme_subsystems_ida); 85 84 static LIST_HEAD(nvme_subsystems); 86 85 static DEFINE_MUTEX(nvme_subsystems_lock); 87 86 ··· 196 197 return ns->pi_type && ns->ms == sizeof(struct t10_pi_tuple); 197 198 } 198 199 199 - static blk_status_t nvme_error_status(struct request *req) 200 + static blk_status_t nvme_error_status(u16 status) 200 201 { 201 - switch (nvme_req(req)->status & 0x7ff) { 202 + switch (status & 0x7ff) { 202 203 case NVME_SC_SUCCESS: 203 204 return BLK_STS_OK; 204 205 case NVME_SC_CAP_EXCEEDED: ··· 225 226 return BLK_STS_PROTECTION; 226 227 case NVME_SC_RESERVATION_CONFLICT: 227 228 return BLK_STS_NEXUS; 229 + case NVME_SC_HOST_PATH_ERROR: 230 + return BLK_STS_TRANSPORT; 228 231 default: 229 232 return BLK_STS_IOERR; 230 233 } ··· 261 260 262 261 void nvme_complete_rq(struct request *req) 263 262 { 264 - blk_status_t status = nvme_error_status(req); 263 + blk_status_t status = nvme_error_status(nvme_req(req)->status); 265 264 266 265 trace_nvme_complete_rq(req); 267 266 ··· 280 279 return; 281 280 } 282 281 } 282 + 283 + nvme_trace_bio_complete(req, status); 283 284 blk_mq_end_request(req, status); 284 285 } 285 286 EXPORT_SYMBOL_GPL(nvme_complete_rq); ··· 291 288 dev_dbg_ratelimited(((struct nvme_ctrl *) data)->device, 292 289 "Cancelling I/O %d", req->tag); 293 290 294 - nvme_req(req)->status = NVME_SC_ABORT_REQ; 295 - blk_mq_complete_request_sync(req); 291 + /* don't abort one completed request */ 292 + if (blk_mq_request_completed(req)) 293 + return true; 294 + 295 + nvme_req(req)->status = NVME_SC_HOST_PATH_ERROR; 296 + blk_mq_complete_request(req); 296 297 return true; 297 298 } 298 299 EXPORT_SYMBOL_GPL(nvme_cancel_request); ··· 1095 1088 NVME_IDENTIFY_DATA_SIZE); 1096 1089 } 1097 1090 1098 - static struct nvme_id_ns *nvme_identify_ns(struct nvme_ctrl *ctrl, 1099 - unsigned nsid) 1091 + static int nvme_identify_ns(struct nvme_ctrl *ctrl, 1092 + unsigned nsid, struct nvme_id_ns **id) 1100 1093 { 1101 - struct nvme_id_ns *id; 1102 1094 struct nvme_command c = { }; 1103 1095 int error; 1104 1096 ··· 1106 1100 c.identify.nsid = cpu_to_le32(nsid); 1107 1101 c.identify.cns = NVME_ID_CNS_NS; 1108 1102 1109 - id = kmalloc(sizeof(*id), GFP_KERNEL); 1110 - if (!id) 1111 - return NULL; 1103 + *id = kmalloc(sizeof(**id), GFP_KERNEL); 1104 + if (!*id) 1105 + return -ENOMEM; 1112 1106 1113 - error = nvme_submit_sync_cmd(ctrl->admin_q, &c, id, sizeof(*id)); 1107 + error = nvme_submit_sync_cmd(ctrl->admin_q, &c, *id, sizeof(**id)); 1114 1108 if (error) { 1115 1109 dev_warn(ctrl->device, "Identify namespace failed (%d)\n", error); 1116 - kfree(id); 1117 - return NULL; 1110 + kfree(*id); 1118 1111 } 1119 1112 1120 - return id; 1113 + return error; 1121 1114 } 1122 1115 1123 1116 static int nvme_features(struct nvme_ctrl *dev, u8 op, unsigned int fid, ··· 1185 1180 EXPORT_SYMBOL_GPL(nvme_set_queue_count); 1186 1181 1187 1182 #define NVME_AEN_SUPPORTED \ 1188 - (NVME_AEN_CFG_NS_ATTR | NVME_AEN_CFG_FW_ACT | NVME_AEN_CFG_ANA_CHANGE) 1183 + (NVME_AEN_CFG_NS_ATTR | NVME_AEN_CFG_FW_ACT | \ 1184 + NVME_AEN_CFG_ANA_CHANGE | NVME_AEN_CFG_DISC_CHANGE) 1189 1185 1190 1186 static void nvme_enable_aen(struct nvme_ctrl *ctrl) 1191 1187 { ··· 1201 1195 if (status) 1202 1196 dev_warn(ctrl->device, "Failed to configure AEN (cfg %x)\n", 1203 1197 supported_aens); 1198 + 1199 + queue_work(nvme_wq, &ctrl->async_event_work); 1204 1200 } 1205 1201 1206 1202 static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio) ··· 1602 1594 blk_queue_max_write_zeroes_sectors(disk->queue, max_sectors); 1603 1595 } 1604 1596 1605 - static void nvme_report_ns_ids(struct nvme_ctrl *ctrl, unsigned int nsid, 1597 + static int nvme_report_ns_ids(struct nvme_ctrl *ctrl, unsigned int nsid, 1606 1598 struct nvme_id_ns *id, struct nvme_ns_ids *ids) 1607 1599 { 1600 + int ret = 0; 1601 + 1608 1602 memset(ids, 0, sizeof(*ids)); 1609 1603 1610 1604 if (ctrl->vs >= NVME_VS(1, 1, 0)) ··· 1617 1607 /* Don't treat error as fatal we potentially 1618 1608 * already have a NGUID or EUI-64 1619 1609 */ 1620 - if (nvme_identify_ns_descs(ctrl, nsid, ids)) 1610 + ret = nvme_identify_ns_descs(ctrl, nsid, ids); 1611 + if (ret) 1621 1612 dev_warn(ctrl->device, 1622 - "%s: Identify Descriptors failed\n", __func__); 1613 + "Identify Descriptors failed (%d)\n", ret); 1623 1614 } 1615 + return ret; 1624 1616 } 1625 1617 1626 1618 static bool nvme_ns_ids_valid(struct nvme_ns_ids *ids) ··· 1750 1738 return -ENODEV; 1751 1739 } 1752 1740 1753 - id = nvme_identify_ns(ctrl, ns->head->ns_id); 1754 - if (!id) 1755 - return -ENODEV; 1741 + ret = nvme_identify_ns(ctrl, ns->head->ns_id, &id); 1742 + if (ret) 1743 + goto out; 1756 1744 1757 1745 if (id->ncap == 0) { 1758 1746 ret = -ENODEV; 1759 - goto out; 1747 + goto free_id; 1760 1748 } 1761 1749 1762 1750 __nvme_revalidate_disk(disk, id); 1763 - nvme_report_ns_ids(ctrl, ns->head->ns_id, id, &ids); 1751 + ret = nvme_report_ns_ids(ctrl, ns->head->ns_id, id, &ids); 1752 + if (ret) 1753 + goto free_id; 1754 + 1764 1755 if (!nvme_ns_ids_equal(&ns->head->ids, &ids)) { 1765 1756 dev_err(ctrl->device, 1766 1757 "identifiers changed for nsid %d\n", ns->head->ns_id); 1767 1758 ret = -ENODEV; 1768 1759 } 1769 1760 1770 - out: 1761 + free_id: 1771 1762 kfree(id); 1763 + out: 1764 + /* 1765 + * Only fail the function if we got a fatal error back from the 1766 + * device, otherwise ignore the error and just move on. 1767 + */ 1768 + if (ret == -ENOMEM || (ret > 0 && !(ret & NVME_SC_DNR))) 1769 + ret = 0; 1770 + else if (ret > 0) 1771 + ret = blk_status_to_errno(nvme_error_status(ret)); 1772 1772 return ret; 1773 1773 } 1774 1774 ··· 1976 1952 * bits', but doing so may cause the device to complete commands to the 1977 1953 * admin queue ... and we don't know what memory that might be pointing at! 1978 1954 */ 1979 - int nvme_disable_ctrl(struct nvme_ctrl *ctrl, u64 cap) 1955 + int nvme_disable_ctrl(struct nvme_ctrl *ctrl) 1980 1956 { 1981 1957 int ret; 1982 1958 ··· 1990 1966 if (ctrl->quirks & NVME_QUIRK_DELAY_BEFORE_CHK_RDY) 1991 1967 msleep(NVME_QUIRK_DELAY_AMOUNT); 1992 1968 1993 - return nvme_wait_ready(ctrl, cap, false); 1969 + return nvme_wait_ready(ctrl, ctrl->cap, false); 1994 1970 } 1995 1971 EXPORT_SYMBOL_GPL(nvme_disable_ctrl); 1996 1972 1997 - int nvme_enable_ctrl(struct nvme_ctrl *ctrl, u64 cap) 1973 + int nvme_enable_ctrl(struct nvme_ctrl *ctrl) 1998 1974 { 1999 1975 /* 2000 1976 * Default to a 4K page size, with the intention to update this 2001 1977 * path in the future to accomodate architectures with differing 2002 1978 * kernel and IO page sizes. 2003 1979 */ 2004 - unsigned dev_page_min = NVME_CAP_MPSMIN(cap) + 12, page_shift = 12; 1980 + unsigned dev_page_min, page_shift = 12; 2005 1981 int ret; 1982 + 1983 + ret = ctrl->ops->reg_read64(ctrl, NVME_REG_CAP, &ctrl->cap); 1984 + if (ret) { 1985 + dev_err(ctrl->device, "Reading CAP failed (%d)\n", ret); 1986 + return ret; 1987 + } 1988 + dev_page_min = NVME_CAP_MPSMIN(ctrl->cap) + 12; 2006 1989 2007 1990 if (page_shift < dev_page_min) { 2008 1991 dev_err(ctrl->device, ··· 2029 1998 ret = ctrl->ops->reg_write32(ctrl, NVME_REG_CC, ctrl->ctrl_config); 2030 1999 if (ret) 2031 2000 return ret; 2032 - return nvme_wait_ready(ctrl, cap, true); 2001 + return nvme_wait_ready(ctrl, ctrl->cap, true); 2033 2002 } 2034 2003 EXPORT_SYMBOL_GPL(nvme_enable_ctrl); 2035 2004 ··· 2363 2332 struct nvme_subsystem *subsys = 2364 2333 container_of(dev, struct nvme_subsystem, dev); 2365 2334 2366 - ida_simple_remove(&nvme_subsystems_ida, subsys->instance); 2335 + if (subsys->instance >= 0) 2336 + ida_simple_remove(&nvme_instance_ida, subsys->instance); 2367 2337 kfree(subsys); 2368 2338 } 2369 2339 ··· 2392 2360 struct nvme_subsystem *subsys; 2393 2361 2394 2362 lockdep_assert_held(&nvme_subsystems_lock); 2363 + 2364 + /* 2365 + * Fail matches for discovery subsystems. This results 2366 + * in each discovery controller bound to a unique subsystem. 2367 + * This avoids issues with validating controller values 2368 + * that can only be true when there is a single unique subsystem. 2369 + * There may be multiple and completely independent entities 2370 + * that provide discovery controllers. 2371 + */ 2372 + if (!strcmp(subsysnqn, NVME_DISC_SUBSYS_NAME)) 2373 + return NULL; 2395 2374 2396 2375 list_for_each_entry(subsys, &nvme_subsystems, entry) { 2397 2376 if (strcmp(subsys->subnqn, subsysnqn)) ··· 2504 2461 subsys = kzalloc(sizeof(*subsys), GFP_KERNEL); 2505 2462 if (!subsys) 2506 2463 return -ENOMEM; 2507 - ret = ida_simple_get(&nvme_subsystems_ida, 0, 0, GFP_KERNEL); 2508 - if (ret < 0) { 2509 - kfree(subsys); 2510 - return ret; 2511 - } 2512 - subsys->instance = ret; 2464 + 2465 + subsys->instance = -1; 2513 2466 mutex_init(&subsys->lock); 2514 2467 kref_init(&subsys->ref); 2515 2468 INIT_LIST_HEAD(&subsys->ctrls); ··· 2524 2485 subsys->dev.class = nvme_subsys_class; 2525 2486 subsys->dev.release = nvme_release_subsystem; 2526 2487 subsys->dev.groups = nvme_subsys_attrs_groups; 2527 - dev_set_name(&subsys->dev, "nvme-subsys%d", subsys->instance); 2488 + dev_set_name(&subsys->dev, "nvme-subsys%d", ctrl->instance); 2528 2489 device_initialize(&subsys->dev); 2529 2490 2530 2491 mutex_lock(&nvme_subsystems_lock); ··· 2556 2517 goto out_put_subsystem; 2557 2518 } 2558 2519 2520 + if (!found) 2521 + subsys->instance = ctrl->instance; 2559 2522 ctrl->subsys = subsys; 2560 2523 list_add_tail(&ctrl->subsys_entry, &subsys->ctrls); 2561 2524 mutex_unlock(&nvme_subsystems_lock); ··· 2615 2574 int nvme_init_identify(struct nvme_ctrl *ctrl) 2616 2575 { 2617 2576 struct nvme_id_ctrl *id; 2618 - u64 cap; 2619 2577 int ret, page_shift; 2620 2578 u32 max_hw_sectors; 2621 2579 bool prev_apst_enabled; ··· 2624 2584 dev_err(ctrl->device, "Reading VS failed (%d)\n", ret); 2625 2585 return ret; 2626 2586 } 2627 - 2628 - ret = ctrl->ops->reg_read64(ctrl, NVME_REG_CAP, &cap); 2629 - if (ret) { 2630 - dev_err(ctrl->device, "Reading CAP failed (%d)\n", ret); 2631 - return ret; 2632 - } 2633 - page_shift = NVME_CAP_MPSMIN(cap) + 12; 2587 + page_shift = NVME_CAP_MPSMIN(ctrl->cap) + 12; 2588 + ctrl->sqsize = min_t(int, NVME_CAP_MQES(ctrl->cap), ctrl->sqsize); 2634 2589 2635 2590 if (ctrl->vs >= NVME_VS(1, 1, 0)) 2636 - ctrl->subsystem = NVME_CAP_NSSRC(cap); 2591 + ctrl->subsystem = NVME_CAP_NSSRC(ctrl->cap); 2637 2592 2638 2593 ret = nvme_identify_ctrl(ctrl, &id); 2639 2594 if (ret) { ··· 3219 3184 head->ns_id = nsid; 3220 3185 kref_init(&head->ref); 3221 3186 3222 - nvme_report_ns_ids(ctrl, nsid, id, &head->ids); 3187 + ret = nvme_report_ns_ids(ctrl, nsid, id, &head->ids); 3188 + if (ret) 3189 + goto out_cleanup_srcu; 3223 3190 3224 3191 ret = __nvme_check_ids(ctrl->subsys, head); 3225 3192 if (ret) { ··· 3246 3209 out_free_head: 3247 3210 kfree(head); 3248 3211 out: 3212 + if (ret > 0) 3213 + ret = blk_status_to_errno(nvme_error_status(ret)); 3249 3214 return ERR_PTR(ret); 3250 3215 } 3251 3216 ··· 3271 3232 } else { 3272 3233 struct nvme_ns_ids ids; 3273 3234 3274 - nvme_report_ns_ids(ctrl, nsid, id, &ids); 3235 + ret = nvme_report_ns_ids(ctrl, nsid, id, &ids); 3236 + if (ret) 3237 + goto out_unlock; 3238 + 3275 3239 if (!nvme_ns_ids_equal(&head->ids, &ids)) { 3276 3240 dev_err(ctrl->device, 3277 3241 "IDs don't match for shared namespace %d\n", ··· 3289 3247 3290 3248 out_unlock: 3291 3249 mutex_unlock(&ctrl->subsys->lock); 3250 + if (ret > 0) 3251 + ret = blk_status_to_errno(nvme_error_status(ret)); 3292 3252 return ret; 3293 3253 } 3294 3254 ··· 3382 3338 blk_queue_logical_block_size(ns->queue, 1 << ns->lba_shift); 3383 3339 nvme_set_queue_limits(ctrl, ns->queue); 3384 3340 3385 - id = nvme_identify_ns(ctrl, nsid); 3386 - if (!id) { 3387 - ret = -EIO; 3341 + ret = nvme_identify_ns(ctrl, nsid, &id); 3342 + if (ret) 3388 3343 goto out_free_queue; 3389 - } 3390 3344 3391 3345 if (id->ncap == 0) { 3392 3346 ret = -EINVAL; ··· 3446 3404 blk_cleanup_queue(ns->queue); 3447 3405 out_free_ns: 3448 3406 kfree(ns); 3407 + if (ret > 0) 3408 + ret = blk_status_to_errno(nvme_error_status(ret)); 3449 3409 return ret; 3450 3410 } 3451 3411 ··· 3661 3617 } 3662 3618 EXPORT_SYMBOL_GPL(nvme_remove_namespaces); 3663 3619 3620 + static int nvme_class_uevent(struct device *dev, struct kobj_uevent_env *env) 3621 + { 3622 + struct nvme_ctrl *ctrl = 3623 + container_of(dev, struct nvme_ctrl, ctrl_device); 3624 + struct nvmf_ctrl_options *opts = ctrl->opts; 3625 + int ret; 3626 + 3627 + ret = add_uevent_var(env, "NVME_TRTYPE=%s", ctrl->ops->name); 3628 + if (ret) 3629 + return ret; 3630 + 3631 + if (opts) { 3632 + ret = add_uevent_var(env, "NVME_TRADDR=%s", opts->traddr); 3633 + if (ret) 3634 + return ret; 3635 + 3636 + ret = add_uevent_var(env, "NVME_TRSVCID=%s", 3637 + opts->trsvcid ?: "none"); 3638 + if (ret) 3639 + return ret; 3640 + 3641 + ret = add_uevent_var(env, "NVME_HOST_TRADDR=%s", 3642 + opts->host_traddr ?: "none"); 3643 + } 3644 + return ret; 3645 + } 3646 + 3664 3647 static void nvme_aen_uevent(struct nvme_ctrl *ctrl) 3665 3648 { 3666 3649 char *envp[2] = { NULL, NULL }; ··· 3794 3723 queue_work(nvme_wq, &ctrl->ana_work); 3795 3724 break; 3796 3725 #endif 3726 + case NVME_AER_NOTICE_DISC_CHANGED: 3727 + ctrl->aen_result = result; 3728 + break; 3797 3729 default: 3798 3730 dev_warn(ctrl->device, "async event result %08x\n", result); 3799 3731 } ··· 3843 3769 if (ctrl->kato) 3844 3770 nvme_start_keep_alive(ctrl); 3845 3771 3772 + nvme_enable_aen(ctrl); 3773 + 3846 3774 if (ctrl->queue_count > 1) { 3847 3775 nvme_queue_scan(ctrl); 3848 - nvme_enable_aen(ctrl); 3849 - queue_work(nvme_wq, &ctrl->async_event_work); 3850 3776 nvme_start_queues(ctrl); 3851 3777 } 3852 3778 } ··· 3866 3792 container_of(dev, struct nvme_ctrl, ctrl_device); 3867 3793 struct nvme_subsystem *subsys = ctrl->subsys; 3868 3794 3869 - ida_simple_remove(&nvme_instance_ida, ctrl->instance); 3795 + if (subsys && ctrl->instance != subsys->instance) 3796 + ida_simple_remove(&nvme_instance_ida, ctrl->instance); 3797 + 3870 3798 kfree(ctrl->effects); 3871 3799 nvme_mpath_uninit(ctrl); 3872 3800 __free_page(ctrl->discard_page); ··· 4068 3992 list_for_each_entry(ns, &ctrl->namespaces, list) 4069 3993 blk_sync_queue(ns->queue); 4070 3994 up_read(&ctrl->namespaces_rwsem); 3995 + 3996 + if (ctrl->admin_q) 3997 + blk_sync_queue(ctrl->admin_q); 4071 3998 } 4072 3999 EXPORT_SYMBOL_GPL(nvme_sync_queues); 4073 4000 ··· 4129 4050 result = PTR_ERR(nvme_class); 4130 4051 goto unregister_chrdev; 4131 4052 } 4053 + nvme_class->dev_uevent = nvme_class_uevent; 4132 4054 4133 4055 nvme_subsys_class = class_create(THIS_MODULE, "nvme-subsystem"); 4134 4056 if (IS_ERR(nvme_subsys_class)) { ··· 4154 4074 4155 4075 static void __exit nvme_core_exit(void) 4156 4076 { 4157 - ida_destroy(&nvme_subsystems_ida); 4158 4077 class_destroy(nvme_subsys_class); 4159 4078 class_destroy(nvme_class); 4160 4079 unregister_chrdev_region(nvme_chr_devt, NVME_MINORS);

+24 -14

drivers/nvme/host/fabrics.c

··· 150 150 cmd.prop_get.fctype = nvme_fabrics_type_property_get; 151 151 cmd.prop_get.offset = cpu_to_le32(off); 152 152 153 - ret = __nvme_submit_sync_cmd(ctrl->admin_q, &cmd, &res, NULL, 0, 0, 153 + ret = __nvme_submit_sync_cmd(ctrl->fabrics_q, &cmd, &res, NULL, 0, 0, 154 154 NVME_QID_ANY, 0, 0, false); 155 155 156 156 if (ret >= 0) ··· 197 197 cmd.prop_get.attrib = 1; 198 198 cmd.prop_get.offset = cpu_to_le32(off); 199 199 200 - ret = __nvme_submit_sync_cmd(ctrl->admin_q, &cmd, &res, NULL, 0, 0, 200 + ret = __nvme_submit_sync_cmd(ctrl->fabrics_q, &cmd, &res, NULL, 0, 0, 201 201 NVME_QID_ANY, 0, 0, false); 202 202 203 203 if (ret >= 0) ··· 243 243 cmd.prop_set.offset = cpu_to_le32(off); 244 244 cmd.prop_set.value = cpu_to_le64(val); 245 245 246 - ret = __nvme_submit_sync_cmd(ctrl->admin_q, &cmd, NULL, NULL, 0, 0, 246 + ret = __nvme_submit_sync_cmd(ctrl->fabrics_q, &cmd, NULL, NULL, 0, 0, 247 247 NVME_QID_ANY, 0, 0, false); 248 248 if (unlikely(ret)) 249 249 dev_err(ctrl->device, ··· 381 381 * Set keep-alive timeout in seconds granularity (ms * 1000) 382 382 * and add a grace period for controller kato enforcement 383 383 */ 384 - cmd.connect.kato = ctrl->opts->discovery_nqn ? 0 : 385 - cpu_to_le32((ctrl->kato + NVME_KATO_GRACE) * 1000); 384 + cmd.connect.kato = ctrl->kato ? 385 + cpu_to_le32((ctrl->kato + NVME_KATO_GRACE) * 1000) : 0; 386 386 387 387 if (ctrl->opts->disable_sqflow) 388 388 cmd.connect.cattr |= NVME_CONNECT_DISABLE_SQFLOW; ··· 396 396 strncpy(data->subsysnqn, ctrl->opts->subsysnqn, NVMF_NQN_SIZE); 397 397 strncpy(data->hostnqn, ctrl->opts->host->nqn, NVMF_NQN_SIZE); 398 398 399 - ret = __nvme_submit_sync_cmd(ctrl->admin_q, &cmd, &res, 399 + ret = __nvme_submit_sync_cmd(ctrl->fabrics_q, &cmd, &res, 400 400 data, sizeof(*data), 0, NVME_QID_ANY, 1, 401 401 BLK_MQ_REQ_RESERVED | BLK_MQ_REQ_NOWAIT, false); 402 402 if (ret) { ··· 611 611 { NVMF_OPT_DATA_DIGEST, "data_digest" }, 612 612 { NVMF_OPT_NR_WRITE_QUEUES, "nr_write_queues=%d" }, 613 613 { NVMF_OPT_NR_POLL_QUEUES, "nr_poll_queues=%d" }, 614 + { NVMF_OPT_TOS, "tos=%d" }, 614 615 { NVMF_OPT_ERR, NULL } 615 616 }; 616 617 ··· 633 632 opts->duplicate_connect = false; 634 633 opts->hdr_digest = false; 635 634 opts->data_digest = false; 635 + opts->tos = -1; /* < 0 == use transport default */ 636 636 637 637 options = o = kstrdup(buf, GFP_KERNEL); 638 638 if (!options) ··· 740 738 pr_warn("keep_alive_tmo 0 won't execute keep alives!!!\n"); 741 739 } 742 740 opts->kato = token; 743 - 744 - if (opts->discovery_nqn && opts->kato) { 745 - pr_err("Discovery controllers cannot accept KATO != 0\n"); 746 - ret = -EINVAL; 747 - goto out; 748 - } 749 - 750 741 break; 751 742 case NVMF_OPT_CTRL_LOSS_TMO: 752 743 if (match_int(args, &token)) { ··· 851 856 } 852 857 opts->nr_poll_queues = token; 853 858 break; 859 + case NVMF_OPT_TOS: 860 + if (match_int(args, &token)) { 861 + ret = -EINVAL; 862 + goto out; 863 + } 864 + if (token < 0) { 865 + pr_err("Invalid type of service %d\n", token); 866 + ret = -EINVAL; 867 + goto out; 868 + } 869 + if (token > 255) { 870 + pr_warn("Clamping type of service to 255\n"); 871 + token = 255; 872 + } 873 + opts->tos = token; 874 + break; 854 875 default: 855 876 pr_warn("unknown parameter or missing value '%s' in ctrl creation request\n", 856 877 p); ··· 876 865 } 877 866 878 867 if (opts->discovery_nqn) { 879 - opts->kato = 0; 880 868 opts->nr_io_queues = 0; 881 869 opts->nr_write_queues = 0; 882 870 opts->nr_poll_queues = 0;

+3

drivers/nvme/host/fabrics.h

··· 55 55 NVMF_OPT_DATA_DIGEST = 1 << 16, 56 56 NVMF_OPT_NR_WRITE_QUEUES = 1 << 17, 57 57 NVMF_OPT_NR_POLL_QUEUES = 1 << 18, 58 + NVMF_OPT_TOS = 1 << 19, 58 59 }; 59 60 60 61 /** ··· 88 87 * @data_digest: generate/verify data digest (TCP) 89 88 * @nr_write_queues: number of queues for write I/O 90 89 * @nr_poll_queues: number of queues for polling I/O 90 + * @tos: type of service 91 91 */ 92 92 struct nvmf_ctrl_options { 93 93 unsigned mask; ··· 110 108 bool data_digest; 111 109 unsigned int nr_write_queues; 112 110 unsigned int nr_poll_queues; 111 + int tos; 113 112 }; 114 113 115 114 /*

+47 -26

drivers/nvme/host/fc.c

··· 1608 1608 sizeof(op->rsp_iu), DMA_FROM_DEVICE); 1609 1609 1610 1610 if (opstate == FCPOP_STATE_ABORTED) 1611 - status = cpu_to_le16(NVME_SC_ABORT_REQ << 1); 1612 - else if (freq->status) 1613 - status = cpu_to_le16(NVME_SC_INTERNAL << 1); 1611 + status = cpu_to_le16(NVME_SC_HOST_PATH_ERROR << 1); 1612 + else if (freq->status) { 1613 + status = cpu_to_le16(NVME_SC_HOST_PATH_ERROR << 1); 1614 + dev_info(ctrl->ctrl.device, 1615 + "NVME-FC{%d}: io failed due to lldd error %d\n", 1616 + ctrl->cnum, freq->status); 1617 + } 1614 1618 1615 1619 /* 1616 1620 * For the linux implementation, if we have an unsuccesful ··· 1641 1637 * no payload in the CQE by the transport. 1642 1638 */ 1643 1639 if (freq->transferred_length != 1644 - be32_to_cpu(op->cmd_iu.data_len)) { 1645 - status = cpu_to_le16(NVME_SC_INTERNAL << 1); 1640 + be32_to_cpu(op->cmd_iu.data_len)) { 1641 + status = cpu_to_le16(NVME_SC_HOST_PATH_ERROR << 1); 1642 + dev_info(ctrl->ctrl.device, 1643 + "NVME-FC{%d}: io failed due to bad transfer " 1644 + "length: %d vs expected %d\n", 1645 + ctrl->cnum, freq->transferred_length, 1646 + be32_to_cpu(op->cmd_iu.data_len)); 1646 1647 goto done; 1647 1648 } 1648 1649 result.u64 = 0; ··· 1664 1655 freq->transferred_length || 1665 1656 op->rsp_iu.status_code || 1666 1657 sqe->common.command_id != cqe->command_id)) { 1667 - status = cpu_to_le16(NVME_SC_INTERNAL << 1); 1658 + status = cpu_to_le16(NVME_SC_HOST_PATH_ERROR << 1); 1659 + dev_info(ctrl->ctrl.device, 1660 + "NVME-FC{%d}: io failed due to bad NVMe_ERSP: " 1661 + "iu len %d, xfr len %d vs %d, status code " 1662 + "%d, cmdid %d vs %d\n", 1663 + ctrl->cnum, be16_to_cpu(op->rsp_iu.iu_len), 1664 + be32_to_cpu(op->rsp_iu.xfrd_len), 1665 + freq->transferred_length, 1666 + op->rsp_iu.status_code, 1667 + sqe->common.command_id, 1668 + cqe->command_id); 1668 1669 goto done; 1669 1670 } 1670 1671 result = cqe->result; ··· 1682 1663 break; 1683 1664 1684 1665 default: 1685 - status = cpu_to_le16(NVME_SC_INTERNAL << 1); 1666 + status = cpu_to_le16(NVME_SC_HOST_PATH_ERROR << 1); 1667 + dev_info(ctrl->ctrl.device, 1668 + "NVME-FC{%d}: io failed due to odd NVMe_xRSP iu " 1669 + "len %d\n", 1670 + ctrl->cnum, freq->rcv_rsplen); 1686 1671 goto done; 1687 1672 } 1688 1673 ··· 2029 2006 2030 2007 blk_mq_unquiesce_queue(ctrl->ctrl.admin_q); 2031 2008 blk_cleanup_queue(ctrl->ctrl.admin_q); 2009 + blk_cleanup_queue(ctrl->ctrl.fabrics_q); 2032 2010 blk_mq_free_tag_set(&ctrl->admin_tag_set); 2033 2011 2034 2012 kfree(ctrl->queues); ··· 2131 2107 struct nvme_fc_fcp_op *op) 2132 2108 { 2133 2109 struct nvmefc_fcp_req *freq = &op->fcp_req; 2134 - enum dma_data_direction dir; 2135 2110 int ret; 2136 2111 2137 2112 freq->sg_cnt = 0; ··· 2147 2124 2148 2125 op->nents = blk_rq_map_sg(rq->q, rq, freq->sg_table.sgl); 2149 2126 WARN_ON(op->nents > blk_rq_nr_phys_segments(rq)); 2150 - dir = (rq_data_dir(rq) == WRITE) ? DMA_TO_DEVICE : DMA_FROM_DEVICE; 2151 2127 freq->sg_cnt = fc_dma_map_sg(ctrl->lport->dev, freq->sg_table.sgl, 2152 - op->nents, dir); 2128 + op->nents, rq_dma_dir(rq)); 2153 2129 if (unlikely(freq->sg_cnt <= 0)) { 2154 2130 sg_free_table_chained(&freq->sg_table, SG_CHUNK_SIZE); 2155 2131 freq->sg_cnt = 0; ··· 2171 2149 return; 2172 2150 2173 2151 fc_dma_unmap_sg(ctrl->lport->dev, freq->sg_table.sgl, op->nents, 2174 - ((rq_data_dir(rq) == WRITE) ? 2175 - DMA_TO_DEVICE : DMA_FROM_DEVICE)); 2152 + rq_dma_dir(rq)); 2176 2153 2177 2154 nvme_cleanup_cmd(rq); 2178 2155 ··· 2654 2633 if (ret) 2655 2634 goto out_delete_hw_queue; 2656 2635 2657 - blk_mq_unquiesce_queue(ctrl->ctrl.admin_q); 2658 - 2659 2636 ret = nvmf_connect_admin_queue(&ctrl->ctrl); 2660 2637 if (ret) 2661 2638 goto out_disconnect_admin_queue; ··· 2667 2648 * prior connection values 2668 2649 */ 2669 2650 2670 - ret = nvmf_reg_read64(&ctrl->ctrl, NVME_REG_CAP, &ctrl->ctrl.cap); 2671 - if (ret) { 2672 - dev_err(ctrl->ctrl.device, 2673 - "prop_get NVME_REG_CAP failed\n"); 2674 - goto out_disconnect_admin_queue; 2675 - } 2676 - 2677 - ctrl->ctrl.sqsize = 2678 - min_t(int, NVME_CAP_MQES(ctrl->ctrl.cap), ctrl->ctrl.sqsize); 2679 - 2680 - ret = nvme_enable_ctrl(&ctrl->ctrl, ctrl->ctrl.cap); 2651 + ret = nvme_enable_ctrl(&ctrl->ctrl); 2681 2652 if (ret) 2682 2653 goto out_disconnect_admin_queue; 2683 2654 2684 2655 ctrl->ctrl.max_hw_sectors = 2685 2656 (ctrl->lport->ops->max_sgl_segments - 1) << (PAGE_SHIFT - 9); 2657 + 2658 + blk_mq_unquiesce_queue(ctrl->ctrl.admin_q); 2686 2659 2687 2660 ret = nvme_init_identify(&ctrl->ctrl); 2688 2661 if (ret) ··· 2785 2774 nvme_stop_queues(&ctrl->ctrl); 2786 2775 blk_mq_tagset_busy_iter(&ctrl->tag_set, 2787 2776 nvme_fc_terminate_exchange, &ctrl->ctrl); 2777 + blk_mq_tagset_wait_completed_request(&ctrl->tag_set); 2788 2778 } 2789 2779 2790 2780 /* ··· 2808 2796 blk_mq_quiesce_queue(ctrl->ctrl.admin_q); 2809 2797 blk_mq_tagset_busy_iter(&ctrl->admin_tag_set, 2810 2798 nvme_fc_terminate_exchange, &ctrl->ctrl); 2799 + blk_mq_tagset_wait_completed_request(&ctrl->admin_tag_set); 2811 2800 2812 2801 /* kill the aens as they are a separate path */ 2813 2802 nvme_fc_abort_aen_ops(ctrl); ··· 3122 3109 goto out_free_queues; 3123 3110 ctrl->ctrl.admin_tagset = &ctrl->admin_tag_set; 3124 3111 3112 + ctrl->ctrl.fabrics_q = blk_mq_init_queue(&ctrl->admin_tag_set); 3113 + if (IS_ERR(ctrl->ctrl.fabrics_q)) { 3114 + ret = PTR_ERR(ctrl->ctrl.fabrics_q); 3115 + goto out_free_admin_tag_set; 3116 + } 3117 + 3125 3118 ctrl->ctrl.admin_q = blk_mq_init_queue(&ctrl->admin_tag_set); 3126 3119 if (IS_ERR(ctrl->ctrl.admin_q)) { 3127 3120 ret = PTR_ERR(ctrl->ctrl.admin_q); 3128 - goto out_free_admin_tag_set; 3121 + goto out_cleanup_fabrics_q; 3129 3122 } 3130 3123 3131 3124 /* ··· 3203 3184 3204 3185 out_cleanup_admin_q: 3205 3186 blk_cleanup_queue(ctrl->ctrl.admin_q); 3187 + out_cleanup_fabrics_q: 3188 + blk_cleanup_queue(ctrl->ctrl.fabrics_q); 3206 3189 out_free_admin_tag_set: 3207 3190 blk_mq_free_tag_set(&ctrl->admin_tag_set); 3208 3191 out_free_queues:

+15 -30

drivers/nvme/host/lightnvm.c

··· 667 667 return rq; 668 668 } 669 669 670 - static int nvme_nvm_submit_io(struct nvm_dev *dev, struct nvm_rq *rqd) 670 + static int nvme_nvm_submit_io(struct nvm_dev *dev, struct nvm_rq *rqd, 671 + void *buf) 671 672 { 673 + struct nvm_geo *geo = &dev->geo; 672 674 struct request_queue *q = dev->q; 673 675 struct nvme_nvm_command *cmd; 674 676 struct request *rq; 677 + int ret; 675 678 676 679 cmd = kzalloc(sizeof(struct nvme_nvm_command), GFP_KERNEL); 677 680 if (!cmd) ··· 682 679 683 680 rq = nvme_nvm_alloc_request(q, rqd, cmd); 684 681 if (IS_ERR(rq)) { 685 - kfree(cmd); 686 - return PTR_ERR(rq); 682 + ret = PTR_ERR(rq); 683 + goto err_free_cmd; 684 + } 685 + 686 + if (buf) { 687 + ret = blk_rq_map_kern(q, rq, buf, geo->csecs * rqd->nr_ppas, 688 + GFP_KERNEL); 689 + if (ret) 690 + goto err_free_cmd; 687 691 } 688 692 689 693 rq->end_io_data = rqd; ··· 698 688 blk_execute_rq_nowait(q, NULL, rq, 0, nvme_nvm_end_io); 699 689 700 690 return 0; 701 - } 702 691 703 - static int nvme_nvm_submit_io_sync(struct nvm_dev *dev, struct nvm_rq *rqd) 704 - { 705 - struct request_queue *q = dev->q; 706 - struct request *rq; 707 - struct nvme_nvm_command cmd; 708 - int ret = 0; 709 - 710 - memset(&cmd, 0, sizeof(struct nvme_nvm_command)); 711 - 712 - rq = nvme_nvm_alloc_request(q, rqd, &cmd); 713 - if (IS_ERR(rq)) 714 - return PTR_ERR(rq); 715 - 716 - /* I/Os can fail and the error is signaled through rqd. Callers must 717 - * handle the error accordingly. 718 - */ 719 - blk_execute_rq(q, NULL, rq, 0); 720 - if (nvme_req(rq)->flags & NVME_REQ_CANCELLED) 721 - ret = -EINTR; 722 - 723 - rqd->ppa_status = le64_to_cpu(nvme_req(rq)->result.u64); 724 - rqd->error = nvme_req(rq)->status; 725 - 726 - blk_mq_free_request(rq); 727 - 692 + err_free_cmd: 693 + kfree(cmd); 728 694 return ret; 729 695 } 730 696 ··· 740 754 .get_chk_meta = nvme_nvm_get_chk_meta, 741 755 742 756 .submit_io = nvme_nvm_submit_io, 743 - .submit_io_sync = nvme_nvm_submit_io_sync, 744 757 745 758 .create_dma_pool = nvme_nvm_create_dma_pool, 746 759 .destroy_dma_pool = nvme_nvm_destroy_dma_pool,

+5 -3

drivers/nvme/host/multipath.c

··· 509 509 510 510 down_write(&ctrl->namespaces_rwsem); 511 511 list_for_each_entry(ns, &ctrl->namespaces, list) { 512 - if (ns->head->ns_id != le32_to_cpu(desc->nsids[n])) 512 + unsigned nsid = le32_to_cpu(desc->nsids[n]); 513 + 514 + if (ns->head->ns_id < nsid) 513 515 continue; 514 - nvme_update_ns_ana_state(desc, ns); 516 + if (ns->head->ns_id == nsid) 517 + nvme_update_ns_ana_state(desc, ns); 515 518 if (++n == nr_nsids) 516 519 break; 517 520 } 518 521 up_write(&ctrl->namespaces_rwsem); 519 - WARN_ON_ONCE(n < nr_nsids); 520 522 return 0; 521 523 } 522 524

+34 -2

drivers/nvme/host/nvme.h

··· 16 16 #include <linux/fault-inject.h> 17 17 #include <linux/rcupdate.h> 18 18 19 + #include <trace/events/block.h> 20 + 19 21 extern unsigned int nvme_io_timeout; 20 22 #define NVME_IO_TIMEOUT (nvme_io_timeout * HZ) 21 23 ··· 99 97 * Force simple suspend/resume path. 100 98 */ 101 99 NVME_QUIRK_SIMPLE_SUSPEND = (1 << 10), 100 + 101 + /* 102 + * Use only one interrupt vector for all queues 103 + */ 104 + NVME_QUIRK_SINGLE_VECTOR = (1 << 11), 105 + 106 + /* 107 + * Use non-standard 128 bytes SQEs. 108 + */ 109 + NVME_QUIRK_128_BYTES_SQES = (1 << 12), 110 + 111 + /* 112 + * Prevent tag overlap between queues 113 + */ 114 + NVME_QUIRK_SHARED_TAGS = (1 << 13), 102 115 }; 103 116 104 117 /* ··· 186 169 const struct nvme_ctrl_ops *ops; 187 170 struct request_queue *admin_q; 188 171 struct request_queue *connect_q; 172 + struct request_queue *fabrics_q; 189 173 struct device *dev; 190 174 int instance; 191 175 int numa_node; ··· 449 431 bool nvme_cancel_request(struct request *req, void *data, bool reserved); 450 432 bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl, 451 433 enum nvme_ctrl_state new_state); 452 - int nvme_disable_ctrl(struct nvme_ctrl *ctrl, u64 cap); 453 - int nvme_enable_ctrl(struct nvme_ctrl *ctrl, u64 cap); 434 + int nvme_disable_ctrl(struct nvme_ctrl *ctrl); 435 + int nvme_enable_ctrl(struct nvme_ctrl *ctrl); 454 436 int nvme_shutdown_ctrl(struct nvme_ctrl *ctrl); 455 437 int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev, 456 438 const struct nvme_ctrl_ops *ops, unsigned long quirks); ··· 538 520 kblockd_schedule_work(&head->requeue_work); 539 521 } 540 522 523 + static inline void nvme_trace_bio_complete(struct request *req, 524 + blk_status_t status) 525 + { 526 + struct nvme_ns *ns = req->q->queuedata; 527 + 528 + if (req->cmd_flags & REQ_NVME_MPATH) 529 + trace_block_bio_complete(ns->head->disk->queue, 530 + req->bio, status); 531 + } 532 + 541 533 extern struct device_attribute dev_attr_ana_grpid; 542 534 extern struct device_attribute dev_attr_ana_state; 543 535 extern struct device_attribute subsys_attr_iopolicy; ··· 593 565 { 594 566 } 595 567 static inline void nvme_mpath_check_last_path(struct nvme_ns *ns) 568 + { 569 + } 570 + static inline void nvme_trace_bio_complete(struct request *req, 571 + blk_status_t status) 596 572 { 597 573 } 598 574 static inline int nvme_mpath_init(struct nvme_ctrl *ctrl,

+78 -24

drivers/nvme/host/pci.c

··· 28 28 #include "trace.h" 29 29 #include "nvme.h" 30 30 31 - #define SQ_SIZE(depth) (depth * sizeof(struct nvme_command)) 32 - #define CQ_SIZE(depth) (depth * sizeof(struct nvme_completion)) 31 + #define SQ_SIZE(q) ((q)->q_depth << (q)->sqes) 32 + #define CQ_SIZE(q) ((q)->q_depth * sizeof(struct nvme_completion)) 33 33 34 34 #define SGES_PER_PAGE (PAGE_SIZE / sizeof(struct nvme_sgl_desc)) 35 35 ··· 100 100 unsigned io_queues[HCTX_MAX_TYPES]; 101 101 unsigned int num_vecs; 102 102 int q_depth; 103 + int io_sqes; 103 104 u32 db_stride; 104 105 void __iomem *bar; 105 106 unsigned long bar_mapped_size; ··· 163 162 struct nvme_queue { 164 163 struct nvme_dev *dev; 165 164 spinlock_t sq_lock; 166 - struct nvme_command *sq_cmds; 165 + void *sq_cmds; 167 166 /* only used for poll queues: */ 168 167 spinlock_t cq_poll_lock ____cacheline_aligned_in_smp; 169 168 volatile struct nvme_completion *cqes; ··· 179 178 u16 last_cq_head; 180 179 u16 qid; 181 180 u8 cq_phase; 181 + u8 sqes; 182 182 unsigned long flags; 183 183 #define NVMEQ_ENABLED 0 184 184 #define NVMEQ_SQ_CMB 1 ··· 490 488 bool write_sq) 491 489 { 492 490 spin_lock(&nvmeq->sq_lock); 493 - memcpy(&nvmeq->sq_cmds[nvmeq->sq_tail], cmd, sizeof(*cmd)); 491 + memcpy(nvmeq->sq_cmds + (nvmeq->sq_tail << nvmeq->sqes), 492 + cmd, sizeof(*cmd)); 494 493 if (++nvmeq->sq_tail == nvmeq->q_depth) 495 494 nvmeq->sq_tail = 0; 496 495 nvme_write_sq_db(nvmeq, write_sq); ··· 537 534 static void nvme_unmap_data(struct nvme_dev *dev, struct request *req) 538 535 { 539 536 struct nvme_iod *iod = blk_mq_rq_to_pdu(req); 540 - enum dma_data_direction dma_dir = rq_data_dir(req) ? 541 - DMA_TO_DEVICE : DMA_FROM_DEVICE; 542 537 const int last_prp = dev->ctrl.page_size / sizeof(__le64) - 1; 543 538 dma_addr_t dma_addr = iod->first_dma, next_dma_addr; 544 539 int i; 545 540 546 541 if (iod->dma_len) { 547 - dma_unmap_page(dev->dev, dma_addr, iod->dma_len, dma_dir); 542 + dma_unmap_page(dev->dev, dma_addr, iod->dma_len, 543 + rq_dma_dir(req)); 548 544 return; 549 545 } 550 546 ··· 1346 1344 1347 1345 static void nvme_free_queue(struct nvme_queue *nvmeq) 1348 1346 { 1349 - dma_free_coherent(nvmeq->dev->dev, CQ_SIZE(nvmeq->q_depth), 1347 + dma_free_coherent(nvmeq->dev->dev, CQ_SIZE(nvmeq), 1350 1348 (void *)nvmeq->cqes, nvmeq->cq_dma_addr); 1351 1349 if (!nvmeq->sq_cmds) 1352 1350 return; 1353 1351 1354 1352 if (test_and_clear_bit(NVMEQ_SQ_CMB, &nvmeq->flags)) { 1355 1353 pci_free_p2pmem(to_pci_dev(nvmeq->dev->dev), 1356 - nvmeq->sq_cmds, SQ_SIZE(nvmeq->q_depth)); 1354 + nvmeq->sq_cmds, SQ_SIZE(nvmeq)); 1357 1355 } else { 1358 - dma_free_coherent(nvmeq->dev->dev, SQ_SIZE(nvmeq->q_depth), 1356 + dma_free_coherent(nvmeq->dev->dev, SQ_SIZE(nvmeq), 1359 1357 nvmeq->sq_cmds, nvmeq->sq_dma_addr); 1360 1358 } 1361 1359 } ··· 1405 1403 if (shutdown) 1406 1404 nvme_shutdown_ctrl(&dev->ctrl); 1407 1405 else 1408 - nvme_disable_ctrl(&dev->ctrl, dev->ctrl.cap); 1406 + nvme_disable_ctrl(&dev->ctrl); 1409 1407 1410 1408 nvme_poll_irqdisable(nvmeq, -1); 1411 1409 } ··· 1435 1433 } 1436 1434 1437 1435 static int nvme_alloc_sq_cmds(struct nvme_dev *dev, struct nvme_queue *nvmeq, 1438 - int qid, int depth) 1436 + int qid) 1439 1437 { 1440 1438 struct pci_dev *pdev = to_pci_dev(dev->dev); 1441 1439 1442 1440 if (qid && dev->cmb_use_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) { 1443 - nvmeq->sq_cmds = pci_alloc_p2pmem(pdev, SQ_SIZE(depth)); 1441 + nvmeq->sq_cmds = pci_alloc_p2pmem(pdev, SQ_SIZE(nvmeq)); 1444 1442 if (nvmeq->sq_cmds) { 1445 1443 nvmeq->sq_dma_addr = pci_p2pmem_virt_to_bus(pdev, 1446 1444 nvmeq->sq_cmds); ··· 1449 1447 return 0; 1450 1448 } 1451 1449 1452 - pci_free_p2pmem(pdev, nvmeq->sq_cmds, SQ_SIZE(depth)); 1450 + pci_free_p2pmem(pdev, nvmeq->sq_cmds, SQ_SIZE(nvmeq)); 1453 1451 } 1454 1452 } 1455 1453 1456 - nvmeq->sq_cmds = dma_alloc_coherent(dev->dev, SQ_SIZE(depth), 1454 + nvmeq->sq_cmds = dma_alloc_coherent(dev->dev, SQ_SIZE(nvmeq), 1457 1455 &nvmeq->sq_dma_addr, GFP_KERNEL); 1458 1456 if (!nvmeq->sq_cmds) 1459 1457 return -ENOMEM; ··· 1467 1465 if (dev->ctrl.queue_count > qid) 1468 1466 return 0; 1469 1467 1470 - nvmeq->cqes = dma_alloc_coherent(dev->dev, CQ_SIZE(depth), 1468 + nvmeq->sqes = qid ? dev->io_sqes : NVME_ADM_SQES; 1469 + nvmeq->q_depth = depth; 1470 + nvmeq->cqes = dma_alloc_coherent(dev->dev, CQ_SIZE(nvmeq), 1471 1471 &nvmeq->cq_dma_addr, GFP_KERNEL); 1472 1472 if (!nvmeq->cqes) 1473 1473 goto free_nvmeq; 1474 1474 1475 - if (nvme_alloc_sq_cmds(dev, nvmeq, qid, depth)) 1475 + if (nvme_alloc_sq_cmds(dev, nvmeq, qid)) 1476 1476 goto free_cqdma; 1477 1477 1478 1478 nvmeq->dev = dev; ··· 1483 1479 nvmeq->cq_head = 0; 1484 1480 nvmeq->cq_phase = 1; 1485 1481 nvmeq->q_db = &dev->dbs[qid * 2 * dev->db_stride]; 1486 - nvmeq->q_depth = depth; 1487 1482 nvmeq->qid = qid; 1488 1483 dev->ctrl.queue_count++; 1489 1484 1490 1485 return 0; 1491 1486 1492 1487 free_cqdma: 1493 - dma_free_coherent(dev->dev, CQ_SIZE(depth), (void *)nvmeq->cqes, 1494 - nvmeq->cq_dma_addr); 1488 + dma_free_coherent(dev->dev, CQ_SIZE(nvmeq), (void *)nvmeq->cqes, 1489 + nvmeq->cq_dma_addr); 1495 1490 free_nvmeq: 1496 1491 return -ENOMEM; 1497 1492 } ··· 1518 1515 nvmeq->cq_head = 0; 1519 1516 nvmeq->cq_phase = 1; 1520 1517 nvmeq->q_db = &dev->dbs[qid * 2 * dev->db_stride]; 1521 - memset((void *)nvmeq->cqes, 0, CQ_SIZE(nvmeq->q_depth)); 1518 + memset((void *)nvmeq->cqes, 0, CQ_SIZE(nvmeq)); 1522 1519 nvme_dbbuf_init(dev, nvmeq, qid); 1523 1520 dev->online_queues++; 1524 1521 wmb(); /* ensure the first interrupt sees the initialization */ ··· 1555 1552 nvme_init_queue(nvmeq, qid); 1556 1553 1557 1554 if (!polled) { 1558 - nvmeq->cq_vector = vector; 1559 1555 result = queue_request_irq(nvmeq); 1560 1556 if (result < 0) 1561 1557 goto release_sq; ··· 1681 1679 (readl(dev->bar + NVME_REG_CSTS) & NVME_CSTS_NSSRO)) 1682 1680 writel(NVME_CSTS_NSSRO, dev->bar + NVME_REG_CSTS); 1683 1681 1684 - result = nvme_disable_ctrl(&dev->ctrl, dev->ctrl.cap); 1682 + result = nvme_disable_ctrl(&dev->ctrl); 1685 1683 if (result < 0) 1686 1684 return result; 1687 1685 ··· 1697 1695 lo_hi_writeq(nvmeq->sq_dma_addr, dev->bar + NVME_REG_ASQ); 1698 1696 lo_hi_writeq(nvmeq->cq_dma_addr, dev->bar + NVME_REG_ACQ); 1699 1697 1700 - result = nvme_enable_ctrl(&dev->ctrl, dev->ctrl.cap); 1698 + result = nvme_enable_ctrl(&dev->ctrl); 1701 1699 if (result) 1702 1700 return result; 1703 1701 ··· 2079 2077 dev->io_queues[HCTX_TYPE_DEFAULT] = 1; 2080 2078 dev->io_queues[HCTX_TYPE_READ] = 0; 2081 2079 2080 + /* 2081 + * Some Apple controllers require all queues to use the 2082 + * first vector. 2083 + */ 2084 + if (dev->ctrl.quirks & NVME_QUIRK_SINGLE_VECTOR) 2085 + irq_queues = 1; 2086 + 2082 2087 return pci_alloc_irq_vectors_affinity(pdev, 1, irq_queues, 2083 2088 PCI_IRQ_ALL_TYPES | PCI_IRQ_AFFINITY, &affd); 2084 2089 } ··· 2104 2095 unsigned long size; 2105 2096 2106 2097 nr_io_queues = max_io_queues(); 2098 + 2099 + /* 2100 + * If tags are shared with admin queue (Apple bug), then 2101 + * make sure we only use one IO queue. 2102 + */ 2103 + if (dev->ctrl.quirks & NVME_QUIRK_SHARED_TAGS) 2104 + nr_io_queues = 1; 2105 + 2107 2106 result = nvme_set_queue_count(&dev->ctrl, &nr_io_queues); 2108 2107 if (result < 0) 2109 2108 return result; ··· 2282 2265 dev->tagset.flags = BLK_MQ_F_SHOULD_MERGE; 2283 2266 dev->tagset.driver_data = dev; 2284 2267 2268 + /* 2269 + * Some Apple controllers requires tags to be unique 2270 + * across admin and IO queue, so reserve the first 32 2271 + * tags of the IO queue. 2272 + */ 2273 + if (dev->ctrl.quirks & NVME_QUIRK_SHARED_TAGS) 2274 + dev->tagset.reserved_tags = NVME_AQ_DEPTH; 2275 + 2285 2276 ret = blk_mq_alloc_tag_set(&dev->tagset); 2286 2277 if (ret) { 2287 2278 dev_warn(dev->ctrl.device, ··· 2339 2314 2340 2315 dev->q_depth = min_t(int, NVME_CAP_MQES(dev->ctrl.cap) + 1, 2341 2316 io_queue_depth); 2317 + dev->ctrl.sqsize = dev->q_depth - 1; /* 0's based queue depth */ 2342 2318 dev->db_stride = 1 << NVME_CAP_STRIDE(dev->ctrl.cap); 2343 2319 dev->dbs = dev->bar + 4096; 2320 + 2321 + /* 2322 + * Some Apple controllers require a non-standard SQE size. 2323 + * Interestingly they also seem to ignore the CC:IOSQES register 2324 + * so we don't bother updating it here. 2325 + */ 2326 + if (dev->ctrl.quirks & NVME_QUIRK_128_BYTES_SQES) 2327 + dev->io_sqes = 7; 2328 + else 2329 + dev->io_sqes = NVME_NVM_IOSQES; 2344 2330 2345 2331 /* 2346 2332 * Temporary fix for the Apple controller found in the MacBook8,1 and ··· 2369 2333 dev_err(dev->ctrl.device, "detected PM1725 NVMe controller, " 2370 2334 "set queue depth=%u\n", dev->q_depth); 2371 2335 } 2336 + 2337 + /* 2338 + * Controllers with the shared tags quirk need the IO queue to be 2339 + * big enough so that we get 32 tags for the admin queue 2340 + */ 2341 + if ((dev->ctrl.quirks & NVME_QUIRK_SHARED_TAGS) && 2342 + (dev->q_depth < (NVME_AQ_DEPTH + 2))) { 2343 + dev->q_depth = NVME_AQ_DEPTH + 2; 2344 + dev_warn(dev->ctrl.device, "IO queue depth clamped to %d\n", 2345 + dev->q_depth); 2346 + } 2347 + 2372 2348 2373 2349 nvme_map_cmb(dev); 2374 2350 ··· 2449 2401 2450 2402 blk_mq_tagset_busy_iter(&dev->tagset, nvme_cancel_request, &dev->ctrl); 2451 2403 blk_mq_tagset_busy_iter(&dev->admin_tagset, nvme_cancel_request, &dev->ctrl); 2404 + blk_mq_tagset_wait_completed_request(&dev->tagset); 2405 + blk_mq_tagset_wait_completed_request(&dev->admin_tagset); 2452 2406 2453 2407 /* 2454 2408 * The driver will not be starting up queues again if shutting down so ··· 3091 3041 { PCI_DEVICE_CLASS(PCI_CLASS_STORAGE_EXPRESS, 0xffffff) }, 3092 3042 { PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2001) }, 3093 3043 { PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2003) }, 3044 + { PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2005), 3045 + .driver_data = NVME_QUIRK_SINGLE_VECTOR | 3046 + NVME_QUIRK_128_BYTES_SQES | 3047 + NVME_QUIRK_SHARED_TAGS }, 3094 3048 { 0, } 3095 3049 }; 3096 3050 MODULE_DEVICE_TABLE(pci, nvme_id_table);

+34 -27

drivers/nvme/host/rdma.c

··· 757 757 { 758 758 if (remove) { 759 759 blk_cleanup_queue(ctrl->ctrl.admin_q); 760 + blk_cleanup_queue(ctrl->ctrl.fabrics_q); 760 761 blk_mq_free_tag_set(ctrl->ctrl.admin_tagset); 761 762 } 762 763 if (ctrl->async_event_sqe.data) { ··· 799 798 goto out_free_async_qe; 800 799 } 801 800 801 + ctrl->ctrl.fabrics_q = blk_mq_init_queue(&ctrl->admin_tag_set); 802 + if (IS_ERR(ctrl->ctrl.fabrics_q)) { 803 + error = PTR_ERR(ctrl->ctrl.fabrics_q); 804 + goto out_free_tagset; 805 + } 806 + 802 807 ctrl->ctrl.admin_q = blk_mq_init_queue(&ctrl->admin_tag_set); 803 808 if (IS_ERR(ctrl->ctrl.admin_q)) { 804 809 error = PTR_ERR(ctrl->ctrl.admin_q); 805 - goto out_free_tagset; 810 + goto out_cleanup_fabrics_q; 806 811 } 807 812 } 808 813 ··· 816 809 if (error) 817 810 goto out_cleanup_queue; 818 811 819 - error = ctrl->ctrl.ops->reg_read64(&ctrl->ctrl, NVME_REG_CAP, 820 - &ctrl->ctrl.cap); 821 - if (error) { 822 - dev_err(ctrl->ctrl.device, 823 - "prop_get NVME_REG_CAP failed\n"); 824 - goto out_stop_queue; 825 - } 826 - 827 - ctrl->ctrl.sqsize = 828 - min_t(int, NVME_CAP_MQES(ctrl->ctrl.cap), ctrl->ctrl.sqsize); 829 - 830 - error = nvme_enable_ctrl(&ctrl->ctrl, ctrl->ctrl.cap); 812 + error = nvme_enable_ctrl(&ctrl->ctrl); 831 813 if (error) 832 814 goto out_stop_queue; 833 815 834 816 ctrl->ctrl.max_hw_sectors = 835 817 (ctrl->max_fr_pages - 1) << (ilog2(SZ_4K) - 9); 818 + 819 + blk_mq_unquiesce_queue(ctrl->ctrl.admin_q); 836 820 837 821 error = nvme_init_identify(&ctrl->ctrl); 838 822 if (error) ··· 836 838 out_cleanup_queue: 837 839 if (new) 838 840 blk_cleanup_queue(ctrl->ctrl.admin_q); 841 + out_cleanup_fabrics_q: 842 + if (new) 843 + blk_cleanup_queue(ctrl->ctrl.fabrics_q); 839 844 out_free_tagset: 840 845 if (new) 841 846 blk_mq_free_tag_set(ctrl->ctrl.admin_tagset); ··· 908 907 { 909 908 blk_mq_quiesce_queue(ctrl->ctrl.admin_q); 910 909 nvme_rdma_stop_queue(&ctrl->queues[0]); 911 - if (ctrl->ctrl.admin_tagset) 910 + if (ctrl->ctrl.admin_tagset) { 912 911 blk_mq_tagset_busy_iter(ctrl->ctrl.admin_tagset, 913 912 nvme_cancel_request, &ctrl->ctrl); 914 - blk_mq_unquiesce_queue(ctrl->ctrl.admin_q); 913 + blk_mq_tagset_wait_completed_request(ctrl->ctrl.admin_tagset); 914 + } 915 + if (remove) 916 + blk_mq_unquiesce_queue(ctrl->ctrl.admin_q); 915 917 nvme_rdma_destroy_admin_queue(ctrl, remove); 916 918 } 917 919 ··· 924 920 if (ctrl->ctrl.queue_count > 1) { 925 921 nvme_stop_queues(&ctrl->ctrl); 926 922 nvme_rdma_stop_io_queues(ctrl); 927 - if (ctrl->ctrl.tagset) 923 + if (ctrl->ctrl.tagset) { 928 924 blk_mq_tagset_busy_iter(ctrl->ctrl.tagset, 929 925 nvme_cancel_request, &ctrl->ctrl); 926 + blk_mq_tagset_wait_completed_request(ctrl->ctrl.tagset); 927 + } 930 928 if (remove) 931 929 nvme_start_queues(&ctrl->ctrl); 932 930 nvme_rdma_destroy_io_queues(ctrl, remove); ··· 1065 1059 nvme_rdma_teardown_io_queues(ctrl, false); 1066 1060 nvme_start_queues(&ctrl->ctrl); 1067 1061 nvme_rdma_teardown_admin_queue(ctrl, false); 1062 + blk_mq_unquiesce_queue(ctrl->ctrl.admin_q); 1068 1063 1069 1064 if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_CONNECTING)) { 1070 1065 /* state change failure is ok if we're in DELETING state */ ··· 1152 1145 req->mr = NULL; 1153 1146 } 1154 1147 1155 - ib_dma_unmap_sg(ibdev, req->sg_table.sgl, 1156 - req->nents, rq_data_dir(rq) == 1157 - WRITE ? DMA_TO_DEVICE : DMA_FROM_DEVICE); 1148 + ib_dma_unmap_sg(ibdev, req->sg_table.sgl, req->nents, rq_dma_dir(rq)); 1158 1149 1159 1150 nvme_cleanup_cmd(rq); 1160 1151 sg_free_table_chained(&req->sg_table, SG_CHUNK_SIZE); ··· 1278 1273 req->nents = blk_rq_map_sg(rq->q, rq, req->sg_table.sgl); 1279 1274 1280 1275 count = ib_dma_map_sg(ibdev, req->sg_table.sgl, req->nents, 1281 - rq_data_dir(rq) == WRITE ? DMA_TO_DEVICE : DMA_FROM_DEVICE); 1276 + rq_dma_dir(rq)); 1282 1277 if (unlikely(count <= 0)) { 1283 1278 ret = -EIO; 1284 1279 goto out_free_table; ··· 1307 1302 return 0; 1308 1303 1309 1304 out_unmap_sg: 1310 - ib_dma_unmap_sg(ibdev, req->sg_table.sgl, 1311 - req->nents, rq_data_dir(rq) == 1312 - WRITE ? DMA_TO_DEVICE : DMA_FROM_DEVICE); 1305 + ib_dma_unmap_sg(ibdev, req->sg_table.sgl, req->nents, rq_dma_dir(rq)); 1313 1306 out_free_table: 1314 1307 sg_free_table_chained(&req->sg_table, SG_CHUNK_SIZE); 1315 1308 return ret; ··· 1550 1547 1551 1548 static int nvme_rdma_addr_resolved(struct nvme_rdma_queue *queue) 1552 1549 { 1550 + struct nvme_ctrl *ctrl = &queue->ctrl->ctrl; 1553 1551 int ret; 1554 1552 1555 1553 ret = nvme_rdma_create_queue_ib(queue); 1556 1554 if (ret) 1557 1555 return ret; 1558 1556 1557 + if (ctrl->opts->tos >= 0) 1558 + rdma_set_service_type(queue->cm_id, ctrl->opts->tos); 1559 1559 ret = rdma_resolve_route(queue->cm_id, NVME_RDMA_CONNECT_TIMEOUT_MS); 1560 1560 if (ret) { 1561 - dev_err(queue->ctrl->ctrl.device, 1562 - "rdma_resolve_route failed (%d).\n", 1561 + dev_err(ctrl->device, "rdma_resolve_route failed (%d).\n", 1563 1562 queue->cm_error); 1564 1563 goto out_destroy_queue; 1565 1564 } ··· 1874 1869 cancel_delayed_work_sync(&ctrl->reconnect_work); 1875 1870 1876 1871 nvme_rdma_teardown_io_queues(ctrl, shutdown); 1872 + blk_mq_quiesce_queue(ctrl->ctrl.admin_q); 1877 1873 if (shutdown) 1878 1874 nvme_shutdown_ctrl(&ctrl->ctrl); 1879 1875 else 1880 - nvme_disable_ctrl(&ctrl->ctrl, ctrl->ctrl.cap); 1876 + nvme_disable_ctrl(&ctrl->ctrl); 1881 1877 nvme_rdma_teardown_admin_queue(ctrl, shutdown); 1882 1878 } 1883 1879 ··· 2057 2051 .required_opts = NVMF_OPT_TRADDR, 2058 2052 .allowed_opts = NVMF_OPT_TRSVCID | NVMF_OPT_RECONNECT_DELAY | 2059 2053 NVMF_OPT_HOST_TRADDR | NVMF_OPT_CTRL_LOSS_TMO | 2060 - NVMF_OPT_NR_WRITE_QUEUES | NVMF_OPT_NR_POLL_QUEUES, 2054 + NVMF_OPT_NR_WRITE_QUEUES | NVMF_OPT_NR_POLL_QUEUES | 2055 + NVMF_OPT_TOS, 2061 2056 .create_ctrl = nvme_rdma_create_ctrl, 2062 2057 }; 2063 2058

+101 -43

drivers/nvme/host/tcp.c

··· 13 13 #include <net/tcp.h> 14 14 #include <linux/blk-mq.h> 15 15 #include <crypto/hash.h> 16 + #include <net/busy_poll.h> 16 17 17 18 #include "nvme.h" 18 19 #include "fabrics.h" ··· 73 72 int pdu_offset; 74 73 size_t data_remaining; 75 74 size_t ddgst_remaining; 75 + unsigned int nr_cqe; 76 76 77 77 /* send state */ 78 78 struct nvme_tcp_request *request; ··· 440 438 } 441 439 442 440 nvme_end_request(rq, cqe->status, cqe->result); 441 + queue->nr_cqe++; 443 442 444 443 return 0; 445 444 } ··· 611 608 612 609 switch (hdr->type) { 613 610 case nvme_tcp_c2h_data: 614 - ret = nvme_tcp_handle_c2h_data(queue, (void *)queue->pdu); 615 - break; 611 + return nvme_tcp_handle_c2h_data(queue, (void *)queue->pdu); 616 612 case nvme_tcp_rsp: 617 613 nvme_tcp_init_recv_ctx(queue); 618 - ret = nvme_tcp_handle_comp(queue, (void *)queue->pdu); 619 - break; 614 + return nvme_tcp_handle_comp(queue, (void *)queue->pdu); 620 615 case nvme_tcp_r2t: 621 616 nvme_tcp_init_recv_ctx(queue); 622 - ret = nvme_tcp_handle_r2t(queue, (void *)queue->pdu); 623 - break; 617 + return nvme_tcp_handle_r2t(queue, (void *)queue->pdu); 624 618 default: 625 619 dev_err(queue->ctrl->ctrl.device, 626 620 "unsupported pdu type (%d)\n", hdr->type); 627 621 return -EINVAL; 628 622 } 629 - 630 - return ret; 631 623 } 632 624 633 625 static inline void nvme_tcp_end_request(struct request *rq, u16 status) ··· 699 701 nvme_tcp_ddgst_final(queue->rcv_hash, &queue->exp_ddgst); 700 702 queue->ddgst_remaining = NVME_TCP_DIGEST_LENGTH; 701 703 } else { 702 - if (pdu->hdr.flags & NVME_TCP_F_DATA_SUCCESS) 704 + if (pdu->hdr.flags & NVME_TCP_F_DATA_SUCCESS) { 703 705 nvme_tcp_end_request(rq, NVME_SC_SUCCESS); 706 + queue->nr_cqe++; 707 + } 704 708 nvme_tcp_init_recv_ctx(queue); 705 709 } 706 710 } ··· 742 742 pdu->command_id); 743 743 744 744 nvme_tcp_end_request(rq, NVME_SC_SUCCESS); 745 + queue->nr_cqe++; 745 746 } 746 747 747 748 nvme_tcp_init_recv_ctx(queue); ··· 842 841 843 842 static void nvme_tcp_fail_request(struct nvme_tcp_request *req) 844 843 { 845 - nvme_tcp_end_request(blk_mq_rq_from_pdu(req), NVME_SC_DATA_XFER_ERROR); 844 + nvme_tcp_end_request(blk_mq_rq_from_pdu(req), NVME_SC_HOST_PATH_ERROR); 846 845 } 847 846 848 847 static int nvme_tcp_try_send_data(struct nvme_tcp_request *req) ··· 1024 1023 1025 1024 static int nvme_tcp_try_recv(struct nvme_tcp_queue *queue) 1026 1025 { 1027 - struct sock *sk = queue->sock->sk; 1026 + struct socket *sock = queue->sock; 1027 + struct sock *sk = sock->sk; 1028 1028 read_descriptor_t rd_desc; 1029 1029 int consumed; 1030 1030 1031 1031 rd_desc.arg.data = queue; 1032 1032 rd_desc.count = 1; 1033 1033 lock_sock(sk); 1034 - consumed = tcp_read_sock(sk, &rd_desc, nvme_tcp_recv_skb); 1034 + queue->nr_cqe = 0; 1035 + consumed = sock->ops->read_sock(sk, &rd_desc, nvme_tcp_recv_skb); 1035 1036 release_sock(sk); 1036 1037 return consumed; 1037 1038 } ··· 1258 1255 queue->queue_size = queue_size; 1259 1256 1260 1257 if (qid > 0) 1261 - queue->cmnd_capsule_len = ctrl->ctrl.ioccsz * 16; 1258 + queue->cmnd_capsule_len = nctrl->ioccsz * 16; 1262 1259 else 1263 1260 queue->cmnd_capsule_len = sizeof(struct nvme_command) + 1264 1261 NVME_TCP_ADMIN_CCSZ; ··· 1266 1263 ret = sock_create(ctrl->addr.ss_family, SOCK_STREAM, 1267 1264 IPPROTO_TCP, &queue->sock); 1268 1265 if (ret) { 1269 - dev_err(ctrl->ctrl.device, 1266 + dev_err(nctrl->device, 1270 1267 "failed to create socket: %d\n", ret); 1271 1268 return ret; 1272 1269 } ··· 1276 1273 ret = kernel_setsockopt(queue->sock, IPPROTO_TCP, TCP_SYNCNT, 1277 1274 (char *)&opt, sizeof(opt)); 1278 1275 if (ret) { 1279 - dev_err(ctrl->ctrl.device, 1276 + dev_err(nctrl->device, 1280 1277 "failed to set TCP_SYNCNT sock opt %d\n", ret); 1281 1278 goto err_sock; 1282 1279 } ··· 1286 1283 ret = kernel_setsockopt(queue->sock, IPPROTO_TCP, 1287 1284 TCP_NODELAY, (char *)&opt, sizeof(opt)); 1288 1285 if (ret) { 1289 - dev_err(ctrl->ctrl.device, 1286 + dev_err(nctrl->device, 1290 1287 "failed to set TCP_NODELAY sock opt %d\n", ret); 1291 1288 goto err_sock; 1292 1289 } ··· 1299 1296 ret = kernel_setsockopt(queue->sock, SOL_SOCKET, SO_LINGER, 1300 1297 (char *)&sol, sizeof(sol)); 1301 1298 if (ret) { 1302 - dev_err(ctrl->ctrl.device, 1299 + dev_err(nctrl->device, 1303 1300 "failed to set SO_LINGER sock opt %d\n", ret); 1304 1301 goto err_sock; 1302 + } 1303 + 1304 + /* Set socket type of service */ 1305 + if (nctrl->opts->tos >= 0) { 1306 + opt = nctrl->opts->tos; 1307 + ret = kernel_setsockopt(queue->sock, SOL_IP, IP_TOS, 1308 + (char *)&opt, sizeof(opt)); 1309 + if (ret) { 1310 + dev_err(nctrl->device, 1311 + "failed to set IP_TOS sock opt %d\n", ret); 1312 + goto err_sock; 1313 + } 1305 1314 } 1306 1315 1307 1316 queue->sock->sk->sk_allocation = GFP_ATOMIC; ··· 1329 1314 queue->pdu_offset = 0; 1330 1315 sk_set_memalloc(queue->sock->sk); 1331 1316 1332 - if (ctrl->ctrl.opts->mask & NVMF_OPT_HOST_TRADDR) { 1317 + if (nctrl->opts->mask & NVMF_OPT_HOST_TRADDR) { 1333 1318 ret = kernel_bind(queue->sock, (struct sockaddr *)&ctrl->src_addr, 1334 1319 sizeof(ctrl->src_addr)); 1335 1320 if (ret) { 1336 - dev_err(ctrl->ctrl.device, 1321 + dev_err(nctrl->device, 1337 1322 "failed to bind queue %d socket %d\n", 1338 1323 qid, ret); 1339 1324 goto err_sock; ··· 1345 1330 if (queue->hdr_digest || queue->data_digest) { 1346 1331 ret = nvme_tcp_alloc_crypto(queue); 1347 1332 if (ret) { 1348 - dev_err(ctrl->ctrl.device, 1333 + dev_err(nctrl->device, 1349 1334 "failed to allocate queue %d crypto\n", qid); 1350 1335 goto err_sock; 1351 1336 } ··· 1359 1344 goto err_crypto; 1360 1345 } 1361 1346 1362 - dev_dbg(ctrl->ctrl.device, "connecting queue %d\n", 1347 + dev_dbg(nctrl->device, "connecting queue %d\n", 1363 1348 nvme_tcp_queue_id(queue)); 1364 1349 1365 1350 ret = kernel_connect(queue->sock, (struct sockaddr *)&ctrl->addr, 1366 1351 sizeof(ctrl->addr), 0); 1367 1352 if (ret) { 1368 - dev_err(ctrl->ctrl.device, 1353 + dev_err(nctrl->device, 1369 1354 "failed to connect socket: %d\n", ret); 1370 1355 goto err_rcv_pdu; 1371 1356 } ··· 1386 1371 queue->sock->sk->sk_data_ready = nvme_tcp_data_ready; 1387 1372 queue->sock->sk->sk_state_change = nvme_tcp_state_change; 1388 1373 queue->sock->sk->sk_write_space = nvme_tcp_write_space; 1374 + queue->sock->sk->sk_ll_usec = 1; 1389 1375 write_unlock_bh(&queue->sock->sk->sk_callback_lock); 1390 1376 1391 1377 return 0; ··· 1485 1469 set->driver_data = ctrl; 1486 1470 set->nr_hw_queues = nctrl->queue_count - 1; 1487 1471 set->timeout = NVME_IO_TIMEOUT; 1488 - set->nr_maps = 2 /* default + read */; 1472 + set->nr_maps = nctrl->opts->nr_poll_queues ? HCTX_MAX_TYPES : 2; 1489 1473 } 1490 1474 1491 1475 ret = blk_mq_alloc_tag_set(set); ··· 1584 1568 1585 1569 nr_io_queues = min(ctrl->opts->nr_io_queues, num_online_cpus()); 1586 1570 nr_io_queues += min(ctrl->opts->nr_write_queues, num_online_cpus()); 1571 + nr_io_queues += min(ctrl->opts->nr_poll_queues, num_online_cpus()); 1587 1572 1588 1573 return nr_io_queues; 1589 1574 } ··· 1615 1598 ctrl->io_queues[HCTX_TYPE_DEFAULT] = 1616 1599 min(opts->nr_io_queues, nr_io_queues); 1617 1600 nr_io_queues -= ctrl->io_queues[HCTX_TYPE_DEFAULT]; 1601 + } 1602 + 1603 + if (opts->nr_poll_queues && nr_io_queues) { 1604 + /* map dedicated poll queues only if we have queues left */ 1605 + ctrl->io_queues[HCTX_TYPE_POLL] = 1606 + min(opts->nr_poll_queues, nr_io_queues); 1618 1607 } 1619 1608 } 1620 1609 ··· 1703 1680 nvme_tcp_stop_queue(ctrl, 0); 1704 1681 if (remove) { 1705 1682 blk_cleanup_queue(ctrl->admin_q); 1683 + blk_cleanup_queue(ctrl->fabrics_q); 1706 1684 blk_mq_free_tag_set(ctrl->admin_tagset); 1707 1685 } 1708 1686 nvme_tcp_free_admin_queue(ctrl); ··· 1724 1700 goto out_free_queue; 1725 1701 } 1726 1702 1703 + ctrl->fabrics_q = blk_mq_init_queue(ctrl->admin_tagset); 1704 + if (IS_ERR(ctrl->fabrics_q)) { 1705 + error = PTR_ERR(ctrl->fabrics_q); 1706 + goto out_free_tagset; 1707 + } 1708 + 1727 1709 ctrl->admin_q = blk_mq_init_queue(ctrl->admin_tagset); 1728 1710 if (IS_ERR(ctrl->admin_q)) { 1729 1711 error = PTR_ERR(ctrl->admin_q); 1730 - goto out_free_tagset; 1712 + goto out_cleanup_fabrics_q; 1731 1713 } 1732 1714 } 1733 1715 ··· 1741 1711 if (error) 1742 1712 goto out_cleanup_queue; 1743 1713 1744 - error = ctrl->ops->reg_read64(ctrl, NVME_REG_CAP, &ctrl->cap); 1745 - if (error) { 1746 - dev_err(ctrl->device, 1747 - "prop_get NVME_REG_CAP failed\n"); 1748 - goto out_stop_queue; 1749 - } 1750 - 1751 - ctrl->sqsize = min_t(int, NVME_CAP_MQES(ctrl->cap), ctrl->sqsize); 1752 - 1753 - error = nvme_enable_ctrl(ctrl, ctrl->cap); 1714 + error = nvme_enable_ctrl(ctrl); 1754 1715 if (error) 1755 1716 goto out_stop_queue; 1717 + 1718 + blk_mq_unquiesce_queue(ctrl->admin_q); 1756 1719 1757 1720 error = nvme_init_identify(ctrl); 1758 1721 if (error) ··· 1758 1735 out_cleanup_queue: 1759 1736 if (new) 1760 1737 blk_cleanup_queue(ctrl->admin_q); 1738 + out_cleanup_fabrics_q: 1739 + if (new) 1740 + blk_cleanup_queue(ctrl->fabrics_q); 1761 1741 out_free_tagset: 1762 1742 if (new) 1763 1743 blk_mq_free_tag_set(ctrl->admin_tagset); ··· 1774 1748 { 1775 1749 blk_mq_quiesce_queue(ctrl->admin_q); 1776 1750 nvme_tcp_stop_queue(ctrl, 0); 1777 - if (ctrl->admin_tagset) 1751 + if (ctrl->admin_tagset) { 1778 1752 blk_mq_tagset_busy_iter(ctrl->admin_tagset, 1779 1753 nvme_cancel_request, ctrl); 1780 - blk_mq_unquiesce_queue(ctrl->admin_q); 1754 + blk_mq_tagset_wait_completed_request(ctrl->admin_tagset); 1755 + } 1756 + if (remove) 1757 + blk_mq_unquiesce_queue(ctrl->admin_q); 1781 1758 nvme_tcp_destroy_admin_queue(ctrl, remove); 1782 1759 } 1783 1760 ··· 1791 1762 return; 1792 1763 nvme_stop_queues(ctrl); 1793 1764 nvme_tcp_stop_io_queues(ctrl); 1794 - if (ctrl->tagset) 1765 + if (ctrl->tagset) { 1795 1766 blk_mq_tagset_busy_iter(ctrl->tagset, 1796 1767 nvme_cancel_request, ctrl); 1768 + blk_mq_tagset_wait_completed_request(ctrl->tagset); 1769 + } 1797 1770 if (remove) 1798 1771 nvme_start_queues(ctrl); 1799 1772 nvme_tcp_destroy_io_queues(ctrl, remove); ··· 1824 1793 static int nvme_tcp_setup_ctrl(struct nvme_ctrl *ctrl, bool new) 1825 1794 { 1826 1795 struct nvmf_ctrl_options *opts = ctrl->opts; 1827 - int ret = -EINVAL; 1796 + int ret; 1828 1797 1829 1798 ret = nvme_tcp_configure_admin_queue(ctrl, new); 1830 1799 if (ret) ··· 1907 1876 /* unquiesce to fail fast pending requests */ 1908 1877 nvme_start_queues(ctrl); 1909 1878 nvme_tcp_teardown_admin_queue(ctrl, false); 1879 + blk_mq_unquiesce_queue(ctrl->admin_q); 1910 1880 1911 1881 if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_CONNECTING)) { 1912 1882 /* state change failure is ok if we're in DELETING state */ ··· 1924 1892 cancel_delayed_work_sync(&to_tcp_ctrl(ctrl)->connect_work); 1925 1893 1926 1894 nvme_tcp_teardown_io_queues(ctrl, shutdown); 1895 + blk_mq_quiesce_queue(ctrl->admin_q); 1927 1896 if (shutdown) 1928 1897 nvme_shutdown_ctrl(ctrl); 1929 1898 else 1930 - nvme_disable_ctrl(ctrl, ctrl->cap); 1899 + nvme_disable_ctrl(ctrl); 1931 1900 nvme_tcp_teardown_admin_queue(ctrl, shutdown); 1932 1901 } 1933 1902 ··· 2184 2151 blk_mq_map_queues(&set->map[HCTX_TYPE_DEFAULT]); 2185 2152 blk_mq_map_queues(&set->map[HCTX_TYPE_READ]); 2186 2153 2154 + if (opts->nr_poll_queues && ctrl->io_queues[HCTX_TYPE_POLL]) { 2155 + /* map dedicated poll queues only if we have queues left */ 2156 + set->map[HCTX_TYPE_POLL].nr_queues = 2157 + ctrl->io_queues[HCTX_TYPE_POLL]; 2158 + set->map[HCTX_TYPE_POLL].queue_offset = 2159 + ctrl->io_queues[HCTX_TYPE_DEFAULT] + 2160 + ctrl->io_queues[HCTX_TYPE_READ]; 2161 + blk_mq_map_queues(&set->map[HCTX_TYPE_POLL]); 2162 + } 2163 + 2187 2164 dev_info(ctrl->ctrl.device, 2188 - "mapped %d/%d default/read queues.\n", 2165 + "mapped %d/%d/%d default/read/poll queues.\n", 2189 2166 ctrl->io_queues[HCTX_TYPE_DEFAULT], 2190 - ctrl->io_queues[HCTX_TYPE_READ]); 2167 + ctrl->io_queues[HCTX_TYPE_READ], 2168 + ctrl->io_queues[HCTX_TYPE_POLL]); 2191 2169 2192 2170 return 0; 2171 + } 2172 + 2173 + static int nvme_tcp_poll(struct blk_mq_hw_ctx *hctx) 2174 + { 2175 + struct nvme_tcp_queue *queue = hctx->driver_data; 2176 + struct sock *sk = queue->sock->sk; 2177 + 2178 + if (sk_can_busy_loop(sk) && skb_queue_empty(&sk->sk_receive_queue)) 2179 + sk_busy_loop(sk, true); 2180 + nvme_tcp_try_recv(queue); 2181 + return queue->nr_cqe; 2193 2182 } 2194 2183 2195 2184 static struct blk_mq_ops nvme_tcp_mq_ops = { ··· 2222 2167 .init_hctx = nvme_tcp_init_hctx, 2223 2168 .timeout = nvme_tcp_timeout, 2224 2169 .map_queues = nvme_tcp_map_queues, 2170 + .poll = nvme_tcp_poll, 2225 2171 }; 2226 2172 2227 2173 static struct blk_mq_ops nvme_tcp_admin_mq_ops = { ··· 2276 2220 2277 2221 INIT_LIST_HEAD(&ctrl->list); 2278 2222 ctrl->ctrl.opts = opts; 2279 - ctrl->ctrl.queue_count = opts->nr_io_queues + opts->nr_write_queues + 1; 2223 + ctrl->ctrl.queue_count = opts->nr_io_queues + opts->nr_write_queues + 2224 + opts->nr_poll_queues + 1; 2280 2225 ctrl->ctrl.sqsize = opts->queue_size - 1; 2281 2226 ctrl->ctrl.kato = opts->kato; 2282 2227 ··· 2371 2314 .allowed_opts = NVMF_OPT_TRSVCID | NVMF_OPT_RECONNECT_DELAY | 2372 2315 NVMF_OPT_HOST_TRADDR | NVMF_OPT_CTRL_LOSS_TMO | 2373 2316 NVMF_OPT_HDR_DIGEST | NVMF_OPT_DATA_DIGEST | 2374 - NVMF_OPT_NR_WRITE_QUEUES, 2317 + NVMF_OPT_NR_WRITE_QUEUES | NVMF_OPT_NR_POLL_QUEUES | 2318 + NVMF_OPT_TOS, 2375 2319 .create_ctrl = nvme_tcp_create_ctrl, 2376 2320 }; 2377 2321

+18

drivers/nvme/host/trace.c

··· 86 86 return ret; 87 87 } 88 88 89 + static const char *nvme_trace_get_lba_status(struct trace_seq *p, 90 + u8 *cdw10) 91 + { 92 + const char *ret = trace_seq_buffer_ptr(p); 93 + u64 slba = get_unaligned_le64(cdw10); 94 + u32 mndw = get_unaligned_le32(cdw10 + 8); 95 + u16 rl = get_unaligned_le16(cdw10 + 12); 96 + u8 atype = cdw10[15]; 97 + 98 + trace_seq_printf(p, "slba=0x%llx, mndw=0x%x, rl=0x%x, atype=%u", 99 + slba, mndw, rl, atype); 100 + trace_seq_putc(p, 0); 101 + 102 + return ret; 103 + } 104 + 89 105 static const char *nvme_trace_read_write(struct trace_seq *p, u8 *cdw10) 90 106 { 91 107 const char *ret = trace_seq_buffer_ptr(p); ··· 157 141 return nvme_trace_admin_identify(p, cdw10); 158 142 case nvme_admin_get_features: 159 143 return nvme_trace_admin_get_features(p, cdw10); 144 + case nvme_admin_get_lba_status: 145 + return nvme_trace_get_lba_status(p, cdw10); 160 146 default: 161 147 return nvme_trace_common(p, cdw10); 162 148 }

+11 -11

drivers/nvme/target/admin-cmd.c

··· 37 37 static void nvmet_execute_get_log_page_error(struct nvmet_req *req) 38 38 { 39 39 struct nvmet_ctrl *ctrl = req->sq->ctrl; 40 - u16 status = NVME_SC_SUCCESS; 41 40 unsigned long flags; 42 41 off_t offset = 0; 43 42 u64 slot; ··· 46 47 slot = ctrl->err_counter % NVMET_ERROR_LOG_SLOTS; 47 48 48 49 for (i = 0; i < NVMET_ERROR_LOG_SLOTS; i++) { 49 - status = nvmet_copy_to_sgl(req, offset, &ctrl->slots[slot], 50 - sizeof(struct nvme_error_slot)); 51 - if (status) 50 + if (nvmet_copy_to_sgl(req, offset, &ctrl->slots[slot], 51 + sizeof(struct nvme_error_slot))) 52 52 break; 53 53 54 54 if (slot == 0) ··· 57 59 offset += sizeof(struct nvme_error_slot); 58 60 } 59 61 spin_unlock_irqrestore(&ctrl->error_lock, flags); 60 - nvmet_req_complete(req, status); 62 + nvmet_req_complete(req, 0); 61 63 } 62 64 63 65 static u16 nvmet_get_smart_log_nsid(struct nvmet_req *req, ··· 79 81 goto out; 80 82 81 83 host_reads = part_stat_read(ns->bdev->bd_part, ios[READ]); 82 - data_units_read = part_stat_read(ns->bdev->bd_part, sectors[READ]); 84 + data_units_read = DIV_ROUND_UP(part_stat_read(ns->bdev->bd_part, 85 + sectors[READ]), 1000); 83 86 host_writes = part_stat_read(ns->bdev->bd_part, ios[WRITE]); 84 - data_units_written = part_stat_read(ns->bdev->bd_part, sectors[WRITE]); 87 + data_units_written = DIV_ROUND_UP(part_stat_read(ns->bdev->bd_part, 88 + sectors[WRITE]), 1000); 85 89 86 90 put_unaligned_le64(host_reads, &slog->host_reads[0]); 87 91 put_unaligned_le64(data_units_read, &slog->data_units_read[0]); ··· 111 111 if (!ns->bdev) 112 112 continue; 113 113 host_reads += part_stat_read(ns->bdev->bd_part, ios[READ]); 114 - data_units_read += 115 - part_stat_read(ns->bdev->bd_part, sectors[READ]); 114 + data_units_read += DIV_ROUND_UP( 115 + part_stat_read(ns->bdev->bd_part, sectors[READ]), 1000); 116 116 host_writes += part_stat_read(ns->bdev->bd_part, ios[WRITE]); 117 - data_units_written += 118 - part_stat_read(ns->bdev->bd_part, sectors[WRITE]); 117 + data_units_written += DIV_ROUND_UP( 118 + part_stat_read(ns->bdev->bd_part, sectors[WRITE]), 1000); 119 119 120 120 } 121 121 rcu_read_unlock();

+1 -3

drivers/nvme/target/discovery.c

··· 381 381 { 382 382 nvmet_disc_subsys = 383 383 nvmet_subsys_alloc(NVME_DISC_SUBSYS_NAME, NVME_NQN_DISC); 384 - if (IS_ERR(nvmet_disc_subsys)) 385 - return PTR_ERR(nvmet_disc_subsys); 386 - return 0; 384 + return PTR_ERR_OR_ZERO(nvmet_disc_subsys); 387 385 } 388 386 389 387 void nvmet_exit_discovery(void)

+16 -14

drivers/nvme/target/loop.c

··· 253 253 clear_bit(NVME_LOOP_Q_LIVE, &ctrl->queues[0].flags); 254 254 nvmet_sq_destroy(&ctrl->queues[0].nvme_sq); 255 255 blk_cleanup_queue(ctrl->ctrl.admin_q); 256 + blk_cleanup_queue(ctrl->ctrl.fabrics_q); 256 257 blk_mq_free_tag_set(&ctrl->admin_tag_set); 257 258 } 258 259 ··· 358 357 goto out_free_sq; 359 358 ctrl->ctrl.admin_tagset = &ctrl->admin_tag_set; 360 359 360 + ctrl->ctrl.fabrics_q = blk_mq_init_queue(&ctrl->admin_tag_set); 361 + if (IS_ERR(ctrl->ctrl.fabrics_q)) { 362 + error = PTR_ERR(ctrl->ctrl.fabrics_q); 363 + goto out_free_tagset; 364 + } 365 + 361 366 ctrl->ctrl.admin_q = blk_mq_init_queue(&ctrl->admin_tag_set); 362 367 if (IS_ERR(ctrl->ctrl.admin_q)) { 363 368 error = PTR_ERR(ctrl->ctrl.admin_q); 364 - goto out_free_tagset; 369 + goto out_cleanup_fabrics_q; 365 370 } 366 371 367 372 error = nvmf_connect_admin_queue(&ctrl->ctrl); ··· 376 369 377 370 set_bit(NVME_LOOP_Q_LIVE, &ctrl->queues[0].flags); 378 371 379 - error = nvmf_reg_read64(&ctrl->ctrl, NVME_REG_CAP, &ctrl->ctrl.cap); 380 - if (error) { 381 - dev_err(ctrl->ctrl.device, 382 - "prop_get NVME_REG_CAP failed\n"); 383 - goto out_cleanup_queue; 384 - } 385 - 386 - ctrl->ctrl.sqsize = 387 - min_t(int, NVME_CAP_MQES(ctrl->ctrl.cap), ctrl->ctrl.sqsize); 388 - 389 - error = nvme_enable_ctrl(&ctrl->ctrl, ctrl->ctrl.cap); 372 + error = nvme_enable_ctrl(&ctrl->ctrl); 390 373 if (error) 391 374 goto out_cleanup_queue; 392 375 393 376 ctrl->ctrl.max_hw_sectors = 394 377 (NVME_LOOP_MAX_SEGMENTS - 1) << (PAGE_SHIFT - 9); 378 + 379 + blk_mq_unquiesce_queue(ctrl->ctrl.admin_q); 395 380 396 381 error = nvme_init_identify(&ctrl->ctrl); 397 382 if (error) ··· 393 394 394 395 out_cleanup_queue: 395 396 blk_cleanup_queue(ctrl->ctrl.admin_q); 397 + out_cleanup_fabrics_q: 398 + blk_cleanup_queue(ctrl->ctrl.fabrics_q); 396 399 out_free_tagset: 397 400 blk_mq_free_tag_set(&ctrl->admin_tag_set); 398 401 out_free_sq: ··· 408 407 nvme_stop_queues(&ctrl->ctrl); 409 408 blk_mq_tagset_busy_iter(&ctrl->tag_set, 410 409 nvme_cancel_request, &ctrl->ctrl); 410 + blk_mq_tagset_wait_completed_request(&ctrl->tag_set); 411 411 nvme_loop_destroy_io_queues(ctrl); 412 412 } 413 413 414 + blk_mq_quiesce_queue(ctrl->ctrl.admin_q); 414 415 if (ctrl->ctrl.state == NVME_CTRL_LIVE) 415 416 nvme_shutdown_ctrl(&ctrl->ctrl); 416 417 417 - blk_mq_quiesce_queue(ctrl->ctrl.admin_q); 418 418 blk_mq_tagset_busy_iter(&ctrl->admin_tag_set, 419 419 nvme_cancel_request, &ctrl->ctrl); 420 - blk_mq_unquiesce_queue(ctrl->ctrl.admin_q); 420 + blk_mq_tagset_wait_completed_request(&ctrl->admin_tag_set); 421 421 nvme_loop_destroy_admin_queue(ctrl); 422 422 } 423 423

+20 -4

drivers/nvme/target/tcp.c

··· 348 348 349 349 return 0; 350 350 err: 351 - sgl_free(cmd->req.sg); 351 + if (cmd->req.sg_cnt) 352 + sgl_free(cmd->req.sg); 352 353 return NVME_SC_INTERNAL; 353 354 } 354 355 ··· 554 553 555 554 if (queue->nvme_sq.sqhd_disabled) { 556 555 kfree(cmd->iov); 557 - sgl_free(cmd->req.sg); 556 + if (cmd->req.sg_cnt) 557 + sgl_free(cmd->req.sg); 558 558 } 559 559 560 560 return 1; ··· 586 584 return -EAGAIN; 587 585 588 586 kfree(cmd->iov); 589 - sgl_free(cmd->req.sg); 587 + if (cmd->req.sg_cnt) 588 + sgl_free(cmd->req.sg); 590 589 cmd->queue->snd_cmd = NULL; 591 590 nvmet_tcp_put_cmd(cmd); 592 591 return 1; ··· 1309 1306 { 1310 1307 nvmet_req_uninit(&cmd->req); 1311 1308 nvmet_tcp_unmap_pdu_iovec(cmd); 1312 - sgl_free(cmd->req.sg); 1309 + kfree(cmd->iov); 1310 + if (cmd->req.sg_cnt) 1311 + sgl_free(cmd->req.sg); 1313 1312 } 1314 1313 1315 1314 static void nvmet_tcp_uninit_data_in_cmds(struct nvmet_tcp_queue *queue) ··· 1415 1410 static int nvmet_tcp_set_queue_sock(struct nvmet_tcp_queue *queue) 1416 1411 { 1417 1412 struct socket *sock = queue->sock; 1413 + struct inet_sock *inet = inet_sk(sock->sk); 1418 1414 struct linger sol = { .l_onoff = 1, .l_linger = 0 }; 1419 1415 int ret; 1420 1416 ··· 1438 1432 (char *)&sol, sizeof(sol)); 1439 1433 if (ret) 1440 1434 return ret; 1435 + 1436 + /* Set socket type of service */ 1437 + if (inet->rcv_tos > 0) { 1438 + int tos = inet->rcv_tos; 1439 + 1440 + ret = kernel_setsockopt(sock, SOL_IP, IP_TOS, 1441 + (char *)&tos, sizeof(tos)); 1442 + if (ret) 1443 + return ret; 1444 + } 1441 1445 1442 1446 write_lock_bh(&sock->sk->sk_callback_lock); 1443 1447 sock->sk->sk_user_data = queue;

+18

drivers/nvme/target/trace.c

··· 33 33 return ret; 34 34 } 35 35 36 + static const char *nvmet_trace_get_lba_status(struct trace_seq *p, 37 + u8 *cdw10) 38 + { 39 + const char *ret = trace_seq_buffer_ptr(p); 40 + u64 slba = get_unaligned_le64(cdw10); 41 + u32 mndw = get_unaligned_le32(cdw10 + 8); 42 + u16 rl = get_unaligned_le16(cdw10 + 12); 43 + u8 atype = cdw10[15]; 44 + 45 + trace_seq_printf(p, "slba=0x%llx, mndw=0x%x, rl=0x%x, atype=%u", 46 + slba, mndw, rl, atype); 47 + trace_seq_putc(p, 0); 48 + 49 + return ret; 50 + } 51 + 36 52 static const char *nvmet_trace_read_write(struct trace_seq *p, u8 *cdw10) 37 53 { 38 54 const char *ret = trace_seq_buffer_ptr(p); ··· 96 80 return nvmet_trace_admin_identify(p, cdw10); 97 81 case nvme_admin_get_features: 98 82 return nvmet_trace_admin_get_features(p, cdw10); 83 + case nvme_admin_get_lba_status: 84 + return nvmet_trace_get_lba_status(p, cdw10); 99 85 default: 100 86 return nvmet_trace_common(p, cdw10); 101 87 }

+13

drivers/scsi/scsi_lib.c

··· 1089 1089 cmd->retries = 0; 1090 1090 } 1091 1091 1092 + /* 1093 + * Only called when the request isn't completed by SCSI, and not freed by 1094 + * SCSI 1095 + */ 1096 + static void scsi_cleanup_rq(struct request *rq) 1097 + { 1098 + if (rq->rq_flags & RQF_DONTPREP) { 1099 + scsi_mq_uninit_cmd(blk_mq_rq_to_pdu(rq)); 1100 + rq->rq_flags &= ~RQF_DONTPREP; 1101 + } 1102 + } 1103 + 1092 1104 /* Add a command to the list used by the aacraid and dpt_i2o drivers */ 1093 1105 void scsi_add_cmd_to_list(struct scsi_cmnd *cmd) 1094 1106 { ··· 1833 1821 .init_request = scsi_mq_init_request, 1834 1822 .exit_request = scsi_mq_exit_request, 1835 1823 .initialize_rq_fn = scsi_initialize_rq, 1824 + .cleanup_rq = scsi_cleanup_rq, 1836 1825 .busy = scsi_mq_lld_busy, 1837 1826 .map_queues = scsi_map_queues, 1838 1827 };

+1 -2

drivers/scsi/scsi_pm.c

··· 94 94 if (!err && scsi_is_sdev_device(dev)) { 95 95 struct scsi_device *sdev = to_scsi_device(dev); 96 96 97 - if (sdev->request_queue->dev) 98 - blk_set_runtime_active(sdev->request_queue); 97 + blk_set_runtime_active(sdev->request_queue); 99 98 } 100 99 } 101 100

+4 -1

drivers/scsi/sd.c

··· 1293 1293 case REQ_OP_WRITE: 1294 1294 return sd_setup_read_write_cmnd(cmd); 1295 1295 case REQ_OP_ZONE_RESET: 1296 - return sd_zbc_setup_reset_cmnd(cmd); 1296 + return sd_zbc_setup_reset_cmnd(cmd, false); 1297 + case REQ_OP_ZONE_RESET_ALL: 1298 + return sd_zbc_setup_reset_cmnd(cmd, true); 1297 1299 default: 1298 1300 WARN_ON_ONCE(1); 1299 1301 return BLK_STS_NOTSUPP; ··· 1961 1959 case REQ_OP_WRITE_ZEROES: 1962 1960 case REQ_OP_WRITE_SAME: 1963 1961 case REQ_OP_ZONE_RESET: 1962 + case REQ_OP_ZONE_RESET_ALL: 1964 1963 if (!result) { 1965 1964 good_bytes = blk_rq_bytes(req); 1966 1965 scsi_set_resid(SCpnt, 0);

+3 -2

drivers/scsi/sd.h

··· 209 209 210 210 extern int sd_zbc_read_zones(struct scsi_disk *sdkp, unsigned char *buffer); 211 211 extern void sd_zbc_print_zones(struct scsi_disk *sdkp); 212 - extern blk_status_t sd_zbc_setup_reset_cmnd(struct scsi_cmnd *cmd); 212 + extern blk_status_t sd_zbc_setup_reset_cmnd(struct scsi_cmnd *cmd, bool all); 213 213 extern void sd_zbc_complete(struct scsi_cmnd *cmd, unsigned int good_bytes, 214 214 struct scsi_sense_hdr *sshdr); 215 215 extern int sd_zbc_report_zones(struct gendisk *disk, sector_t sector, ··· 225 225 226 226 static inline void sd_zbc_print_zones(struct scsi_disk *sdkp) {} 227 227 228 - static inline blk_status_t sd_zbc_setup_reset_cmnd(struct scsi_cmnd *cmd) 228 + static inline blk_status_t sd_zbc_setup_reset_cmnd(struct scsi_cmnd *cmd, 229 + bool all) 229 230 { 230 231 return BLK_STS_TARGET; 231 232 }

+10 -2

drivers/scsi/sd_zbc.c

··· 209 209 /** 210 210 * sd_zbc_setup_reset_cmnd - Prepare a RESET WRITE POINTER scsi command. 211 211 * @cmd: the command to setup 212 + * @all: Reset all zones control. 212 213 * 213 214 * Called from sd_init_command() for a REQ_OP_ZONE_RESET request. 214 215 */ 215 - blk_status_t sd_zbc_setup_reset_cmnd(struct scsi_cmnd *cmd) 216 + blk_status_t sd_zbc_setup_reset_cmnd(struct scsi_cmnd *cmd, bool all) 216 217 { 217 218 struct request *rq = cmd->request; 218 219 struct scsi_disk *sdkp = scsi_disk(rq->rq_disk); ··· 235 234 memset(cmd->cmnd, 0, cmd->cmd_len); 236 235 cmd->cmnd[0] = ZBC_OUT; 237 236 cmd->cmnd[1] = ZO_RESET_WRITE_POINTER; 238 - put_unaligned_be64(block, &cmd->cmnd[2]); 237 + if (all) 238 + cmd->cmnd[14] = 0x1; 239 + else 240 + put_unaligned_be64(block, &cmd->cmnd[2]); 239 241 240 242 rq->timeout = SD_TIMEOUT; 241 243 cmd->sc_data_direction = DMA_NONE; ··· 265 261 266 262 switch (req_op(rq)) { 267 263 case REQ_OP_ZONE_RESET: 264 + case REQ_OP_ZONE_RESET_ALL: 268 265 269 266 if (result && 270 267 sshdr->sense_key == ILLEGAL_REQUEST && ··· 492 487 /* The drive satisfies the kernel restrictions: set it up */ 493 488 blk_queue_chunk_sectors(sdkp->disk->queue, 494 489 logical_to_sectors(sdkp->device, zone_blocks)); 490 + blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, sdkp->disk->queue); 491 + blk_queue_required_elevator_features(sdkp->disk->queue, 492 + ELEVATOR_F_ZBD_SEQ_WRITE); 495 493 nr_zones = round_up(sdkp->capacity, zone_blocks) >> ilog2(zone_blocks); 496 494 497 495 /* READ16/WRITE16 is mandatory for ZBC disks */

+127 -47

fs/fs-writeback.c

··· 36 36 */ 37 37 #define MIN_WRITEBACK_PAGES (4096UL >> (PAGE_SHIFT - 10)) 38 38 39 - struct wb_completion { 40 - atomic_t cnt; 41 - }; 42 - 43 39 /* 44 40 * Passed into wb_writeback(), essentially a subset of writeback_control 45 41 */ ··· 55 59 struct list_head list; /* pending work list */ 56 60 struct wb_completion *done; /* set if the caller waits */ 57 61 }; 58 - 59 - /* 60 - * If one wants to wait for one or more wb_writeback_works, each work's 61 - * ->done should be set to a wb_completion defined using the following 62 - * macro. Once all work items are issued with wb_queue_work(), the caller 63 - * can wait for the completion of all using wb_wait_for_completion(). Work 64 - * items which are waited upon aren't freed automatically on completion. 65 - */ 66 - #define DEFINE_WB_COMPLETION_ONSTACK(cmpl) \ 67 - struct wb_completion cmpl = { \ 68 - .cnt = ATOMIC_INIT(1), \ 69 - } 70 - 71 62 72 63 /* 73 64 * If an inode is constantly having its pages dirtied, but then the ··· 165 182 if (work->auto_free) 166 183 kfree(work); 167 184 if (done && atomic_dec_and_test(&done->cnt)) 168 - wake_up_all(&wb->bdi->wb_waitq); 185 + wake_up_all(done->waitq); 169 186 } 170 187 171 188 static void wb_queue_work(struct bdi_writeback *wb, ··· 189 206 190 207 /** 191 208 * wb_wait_for_completion - wait for completion of bdi_writeback_works 192 - * @bdi: bdi work items were issued to 193 209 * @done: target wb_completion 194 210 * 195 211 * Wait for one or more work items issued to @bdi with their ->done field 196 - * set to @done, which should have been defined with 197 - * DEFINE_WB_COMPLETION_ONSTACK(). This function returns after all such 198 - * work items are completed. Work items which are waited upon aren't freed 212 + * set to @done, which should have been initialized with 213 + * DEFINE_WB_COMPLETION(). This function returns after all such work items 214 + * are completed. Work items which are waited upon aren't freed 199 215 * automatically on completion. 200 216 */ 201 - static void wb_wait_for_completion(struct backing_dev_info *bdi, 202 - struct wb_completion *done) 217 + void wb_wait_for_completion(struct wb_completion *done) 203 218 { 204 219 atomic_dec(&done->cnt); /* put down the initial count */ 205 - wait_event(bdi->wb_waitq, !atomic_read(&done->cnt)); 220 + wait_event(*done->waitq, !atomic_read(&done->cnt)); 206 221 } 207 222 208 223 #ifdef CONFIG_CGROUP_WRITEBACK 209 224 210 - /* parameters for foreign inode detection, see wb_detach_inode() */ 225 + /* 226 + * Parameters for foreign inode detection, see wbc_detach_inode() to see 227 + * how they're used. 228 + * 229 + * These paramters are inherently heuristical as the detection target 230 + * itself is fuzzy. All we want to do is detaching an inode from the 231 + * current owner if it's being written to by some other cgroups too much. 232 + * 233 + * The current cgroup writeback is built on the assumption that multiple 234 + * cgroups writing to the same inode concurrently is very rare and a mode 235 + * of operation which isn't well supported. As such, the goal is not 236 + * taking too long when a different cgroup takes over an inode while 237 + * avoiding too aggressive flip-flops from occasional foreign writes. 238 + * 239 + * We record, very roughly, 2s worth of IO time history and if more than 240 + * half of that is foreign, trigger the switch. The recording is quantized 241 + * to 16 slots. To avoid tiny writes from swinging the decision too much, 242 + * writes smaller than 1/8 of avg size are ignored. 243 + */ 211 244 #define WB_FRN_TIME_SHIFT 13 /* 1s = 2^13, upto 8 secs w/ 16bit */ 212 245 #define WB_FRN_TIME_AVG_SHIFT 3 /* avg = avg * 7/8 + new * 1/8 */ 213 - #define WB_FRN_TIME_CUT_DIV 2 /* ignore rounds < avg / 2 */ 246 + #define WB_FRN_TIME_CUT_DIV 8 /* ignore rounds < avg / 8 */ 214 247 #define WB_FRN_TIME_PERIOD (2 * (1 << WB_FRN_TIME_SHIFT)) /* 2s */ 215 248 216 249 #define WB_FRN_HIST_SLOTS 16 /* inode->i_wb_frn_history is 16bit */ ··· 236 237 /* if foreign slots >= 8, switch */ 237 238 #define WB_FRN_HIST_MAX_SLOTS (WB_FRN_HIST_THR_SLOTS / 2 + 1) 238 239 /* one round can affect upto 5 slots */ 240 + #define WB_FRN_MAX_IN_FLIGHT 1024 /* don't queue too many concurrently */ 239 241 240 242 static atomic_t isw_nr_in_flight = ATOMIC_INIT(0); 241 243 static struct workqueue_struct *isw_wq; ··· 389 389 if (unlikely(inode->i_state & I_FREEING)) 390 390 goto skip_switch; 391 391 392 + trace_inode_switch_wbs(inode, old_wb, new_wb); 393 + 392 394 /* 393 395 * Count and transfer stats. Note that PAGECACHE_TAG_DIRTY points 394 396 * to possibly dirty pages while PAGECACHE_TAG_WRITEBACK points to ··· 491 489 if (inode->i_state & I_WB_SWITCH) 492 490 return; 493 491 494 - /* 495 - * Avoid starting new switches while sync_inodes_sb() is in 496 - * progress. Otherwise, if the down_write protected issue path 497 - * blocks heavily, we might end up starting a large number of 498 - * switches which will block on the rwsem. 499 - */ 500 - if (!down_read_trylock(&bdi->wb_switch_rwsem)) 492 + /* avoid queueing a new switch if too many are already in flight */ 493 + if (atomic_read(&isw_nr_in_flight) > WB_FRN_MAX_IN_FLIGHT) 501 494 return; 502 495 503 496 isw = kzalloc(sizeof(*isw), GFP_ATOMIC); 504 497 if (!isw) 505 - goto out_unlock; 498 + return; 506 499 507 500 /* find and pin the new wb */ 508 501 rcu_read_lock(); ··· 531 534 call_rcu(&isw->rcu_head, inode_switch_wbs_rcu_fn); 532 535 533 536 atomic_inc(&isw_nr_in_flight); 534 - 535 - goto out_unlock; 537 + return; 536 538 537 539 out_free: 538 540 if (isw->new_wb) 539 541 wb_put(isw->new_wb); 540 542 kfree(isw); 541 - out_unlock: 542 - up_read(&bdi->wb_switch_rwsem); 543 543 } 544 544 545 545 /** ··· 674 680 history <<= slots; 675 681 if (wbc->wb_id != max_id) 676 682 history |= (1U << slots) - 1; 683 + 684 + if (history) 685 + trace_inode_foreign_history(inode, wbc, history); 677 686 678 687 /* 679 688 * Switch if the current wb isn't the consistent winner. ··· 840 843 restart: 841 844 rcu_read_lock(); 842 845 list_for_each_entry_continue_rcu(wb, &bdi->wb_list, bdi_node) { 843 - DEFINE_WB_COMPLETION_ONSTACK(fallback_work_done); 846 + DEFINE_WB_COMPLETION(fallback_work_done, bdi); 844 847 struct wb_writeback_work fallback_work; 845 848 struct wb_writeback_work *work; 846 849 long nr_pages; ··· 887 890 last_wb = wb; 888 891 889 892 rcu_read_unlock(); 890 - wb_wait_for_completion(bdi, &fallback_work_done); 893 + wb_wait_for_completion(&fallback_work_done); 891 894 goto restart; 892 895 } 893 896 rcu_read_unlock(); 894 897 895 898 if (last_wb) 896 899 wb_put(last_wb); 900 + } 901 + 902 + /** 903 + * cgroup_writeback_by_id - initiate cgroup writeback from bdi and memcg IDs 904 + * @bdi_id: target bdi id 905 + * @memcg_id: target memcg css id 906 + * @nr_pages: number of pages to write, 0 for best-effort dirty flushing 907 + * @reason: reason why some writeback work initiated 908 + * @done: target wb_completion 909 + * 910 + * Initiate flush of the bdi_writeback identified by @bdi_id and @memcg_id 911 + * with the specified parameters. 912 + */ 913 + int cgroup_writeback_by_id(u64 bdi_id, int memcg_id, unsigned long nr, 914 + enum wb_reason reason, struct wb_completion *done) 915 + { 916 + struct backing_dev_info *bdi; 917 + struct cgroup_subsys_state *memcg_css; 918 + struct bdi_writeback *wb; 919 + struct wb_writeback_work *work; 920 + int ret; 921 + 922 + /* lookup bdi and memcg */ 923 + bdi = bdi_get_by_id(bdi_id); 924 + if (!bdi) 925 + return -ENOENT; 926 + 927 + rcu_read_lock(); 928 + memcg_css = css_from_id(memcg_id, &memory_cgrp_subsys); 929 + if (memcg_css && !css_tryget(memcg_css)) 930 + memcg_css = NULL; 931 + rcu_read_unlock(); 932 + if (!memcg_css) { 933 + ret = -ENOENT; 934 + goto out_bdi_put; 935 + } 936 + 937 + /* 938 + * And find the associated wb. If the wb isn't there already 939 + * there's nothing to flush, don't create one. 940 + */ 941 + wb = wb_get_lookup(bdi, memcg_css); 942 + if (!wb) { 943 + ret = -ENOENT; 944 + goto out_css_put; 945 + } 946 + 947 + /* 948 + * If @nr is zero, the caller is attempting to write out most of 949 + * the currently dirty pages. Let's take the current dirty page 950 + * count and inflate it by 25% which should be large enough to 951 + * flush out most dirty pages while avoiding getting livelocked by 952 + * concurrent dirtiers. 953 + */ 954 + if (!nr) { 955 + unsigned long filepages, headroom, dirty, writeback; 956 + 957 + mem_cgroup_wb_stats(wb, &filepages, &headroom, &dirty, 958 + &writeback); 959 + nr = dirty * 10 / 8; 960 + } 961 + 962 + /* issue the writeback work */ 963 + work = kzalloc(sizeof(*work), GFP_NOWAIT | __GFP_NOWARN); 964 + if (work) { 965 + work->nr_pages = nr; 966 + work->sync_mode = WB_SYNC_NONE; 967 + work->range_cyclic = 1; 968 + work->reason = reason; 969 + work->done = done; 970 + work->auto_free = 1; 971 + wb_queue_work(wb, work); 972 + ret = 0; 973 + } else { 974 + ret = -ENOMEM; 975 + } 976 + 977 + wb_put(wb); 978 + out_css_put: 979 + css_put(memcg_css); 980 + out_bdi_put: 981 + bdi_put(bdi); 982 + return ret; 897 983 } 898 984 899 985 /** ··· 2442 2362 static void __writeback_inodes_sb_nr(struct super_block *sb, unsigned long nr, 2443 2363 enum wb_reason reason, bool skip_if_busy) 2444 2364 { 2445 - DEFINE_WB_COMPLETION_ONSTACK(done); 2365 + struct backing_dev_info *bdi = sb->s_bdi; 2366 + DEFINE_WB_COMPLETION(done, bdi); 2446 2367 struct wb_writeback_work work = { 2447 2368 .sb = sb, 2448 2369 .sync_mode = WB_SYNC_NONE, ··· 2452 2371 .nr_pages = nr, 2453 2372 .reason = reason, 2454 2373 }; 2455 - struct backing_dev_info *bdi = sb->s_bdi; 2456 2374 2457 2375 if (!bdi_has_dirty_io(bdi) || bdi == &noop_backing_dev_info) 2458 2376 return; 2459 2377 WARN_ON(!rwsem_is_locked(&sb->s_umount)); 2460 2378 2461 2379 bdi_split_work_to_wbs(sb->s_bdi, &work, skip_if_busy); 2462 - wb_wait_for_completion(bdi, &done); 2380 + wb_wait_for_completion(&done); 2463 2381 } 2464 2382 2465 2383 /** ··· 2520 2440 */ 2521 2441 void sync_inodes_sb(struct super_block *sb) 2522 2442 { 2523 - DEFINE_WB_COMPLETION_ONSTACK(done); 2443 + struct backing_dev_info *bdi = sb->s_bdi; 2444 + DEFINE_WB_COMPLETION(done, bdi); 2524 2445 struct wb_writeback_work work = { 2525 2446 .sb = sb, 2526 2447 .sync_mode = WB_SYNC_ALL, ··· 2531 2450 .reason = WB_REASON_SYNC, 2532 2451 .for_sync = 1, 2533 2452 }; 2534 - struct backing_dev_info *bdi = sb->s_bdi; 2535 2453 2536 2454 /* 2537 2455 * Can't skip on !bdi_has_dirty() because we should wait for !dirty ··· 2544 2464 /* protect against inode wb switch, see inode_switch_wbs_work_fn() */ 2545 2465 bdi_down_write_wb_switch_rwsem(bdi); 2546 2466 bdi_split_work_to_wbs(bdi, &work, false); 2547 - wb_wait_for_completion(bdi, &done); 2467 + wb_wait_for_completion(&done); 2548 2468 bdi_up_write_wb_switch_rwsem(bdi); 2549 2469 2550 2470 wait_sb_inodes(sb);

+23

include/linux/backing-dev-defs.h

··· 63 63 * so it has a mismatch name. 64 64 */ 65 65 WB_REASON_FORKER_THREAD, 66 + WB_REASON_FOREIGN_FLUSH, 66 67 67 68 WB_REASON_MAX, 68 69 }; 70 + 71 + struct wb_completion { 72 + atomic_t cnt; 73 + wait_queue_head_t *waitq; 74 + }; 75 + 76 + #define __WB_COMPLETION_INIT(_waitq) \ 77 + (struct wb_completion){ .cnt = ATOMIC_INIT(1), .waitq = (_waitq) } 78 + 79 + /* 80 + * If one wants to wait for one or more wb_writeback_works, each work's 81 + * ->done should be set to a wb_completion defined using the following 82 + * macro. Once all work items are issued with wb_queue_work(), the caller 83 + * can wait for the completion of all using wb_wait_for_completion(). Work 84 + * items which are waited upon aren't freed automatically on completion. 85 + */ 86 + #define WB_COMPLETION_INIT(bdi) __WB_COMPLETION_INIT(&(bdi)->wb_waitq) 87 + 88 + #define DEFINE_WB_COMPLETION(cmpl, bdi) \ 89 + struct wb_completion cmpl = WB_COMPLETION_INIT(bdi) 69 90 70 91 /* 71 92 * For cgroup writeback, multiple wb's may map to the same blkcg. Those ··· 186 165 }; 187 166 188 167 struct backing_dev_info { 168 + u64 id; 169 + struct rb_node rb_node; /* keyed by ->id */ 189 170 struct list_head bdi_list; 190 171 unsigned long ra_pages; /* max readahead in PAGE_SIZE units */ 191 172 unsigned long io_pages; /* max allowed IO size */

+5

include/linux/backing-dev.h

··· 24 24 return bdi; 25 25 } 26 26 27 + struct backing_dev_info *bdi_get_by_id(u64 id); 27 28 void bdi_put(struct backing_dev_info *bdi); 28 29 29 30 __printf(2, 3) ··· 44 43 void wb_start_background_writeback(struct bdi_writeback *wb); 45 44 void wb_workfn(struct work_struct *work); 46 45 void wb_wakeup_delayed(struct bdi_writeback *wb); 46 + 47 + void wb_wait_for_completion(struct wb_completion *done); 47 48 48 49 extern spinlock_t bdi_lock; 49 50 extern struct list_head bdi_list; ··· 230 227 struct bdi_writeback_congested * 231 228 wb_congested_get_create(struct backing_dev_info *bdi, int blkcg_id, gfp_t gfp); 232 229 void wb_congested_put(struct bdi_writeback_congested *congested); 230 + struct bdi_writeback *wb_get_lookup(struct backing_dev_info *bdi, 231 + struct cgroup_subsys_state *memcg_css); 233 232 struct bdi_writeback *wb_get_create(struct backing_dev_info *bdi, 234 233 struct cgroup_subsys_state *memcg_css, 235 234 gfp_t gfp);

+4 -2

include/linux/blk-cgroup.h

··· 149 149 typedef void (blkcg_pol_init_cpd_fn)(struct blkcg_policy_data *cpd); 150 150 typedef void (blkcg_pol_free_cpd_fn)(struct blkcg_policy_data *cpd); 151 151 typedef void (blkcg_pol_bind_cpd_fn)(struct blkcg_policy_data *cpd); 152 - typedef struct blkg_policy_data *(blkcg_pol_alloc_pd_fn)(gfp_t gfp, int node); 152 + typedef struct blkg_policy_data *(blkcg_pol_alloc_pd_fn)(gfp_t gfp, 153 + struct request_queue *q, struct blkcg *blkcg); 153 154 typedef void (blkcg_pol_init_pd_fn)(struct blkg_policy_data *pd); 154 155 typedef void (blkcg_pol_online_pd_fn)(struct blkg_policy_data *pd); 155 156 typedef void (blkcg_pol_offline_pd_fn)(struct blkg_policy_data *pd); ··· 234 233 char *body; 235 234 }; 236 235 236 + struct gendisk *blkcg_conf_get_disk(char **inputp); 237 237 int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol, 238 238 char *input, struct blkg_conf_ctx *ctx); 239 239 void blkg_conf_finish(struct blkg_conf_ctx *ctx); ··· 377 375 * @q: request_queue of interest 378 376 * 379 377 * Lookup blkg for the @blkcg - @q pair. This function should be called 380 - * under RCU read loc. 378 + * under RCU read lock. 381 379 */ 382 380 static inline struct blkcg_gq *blkg_lookup(struct blkcg *blkcg, 383 381 struct request_queue *q)

+17 -3

include/linux/blk-mq.h

··· 140 140 typedef int (map_queues_fn)(struct blk_mq_tag_set *set); 141 141 typedef bool (busy_fn)(struct request_queue *); 142 142 typedef void (complete_fn)(struct request *); 143 + typedef void (cleanup_rq_fn)(struct request *); 143 144 144 145 145 146 struct blk_mq_ops { ··· 202 201 void (*initialize_rq_fn)(struct request *rq); 203 202 204 203 /* 204 + * Called before freeing one request which isn't completed yet, 205 + * and usually for freeing the driver private data 206 + */ 207 + cleanup_rq_fn *cleanup_rq; 208 + 209 + /* 205 210 * If set, returns whether or not this queue currently is busy 206 211 */ 207 212 busy_fn *busy; ··· 248 241 249 242 struct request_queue *blk_mq_init_queue(struct blk_mq_tag_set *); 250 243 struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set, 251 - struct request_queue *q); 244 + struct request_queue *q, 245 + bool elevator_init); 252 246 struct request_queue *blk_mq_init_sq_queue(struct blk_mq_tag_set *set, 253 247 const struct blk_mq_ops *ops, 254 248 unsigned int queue_depth, 255 249 unsigned int set_flags); 256 - int blk_mq_register_dev(struct device *, struct request_queue *); 257 250 void blk_mq_unregister_dev(struct device *, struct request_queue *); 258 251 259 252 int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set); ··· 303 296 304 297 305 298 int blk_mq_request_started(struct request *rq); 299 + int blk_mq_request_completed(struct request *rq); 306 300 void blk_mq_start_request(struct request *rq); 307 301 void blk_mq_end_request(struct request *rq, blk_status_t error); 308 302 void __blk_mq_end_request(struct request *rq, blk_status_t error); ··· 312 304 void blk_mq_kick_requeue_list(struct request_queue *q); 313 305 void blk_mq_delay_kick_requeue_list(struct request_queue *q, unsigned long msecs); 314 306 bool blk_mq_complete_request(struct request *rq); 315 - void blk_mq_complete_request_sync(struct request *rq); 316 307 bool blk_mq_bio_list_merge(struct request_queue *q, struct list_head *list, 317 308 struct bio *bio, unsigned int nr_segs); 318 309 bool blk_mq_queue_stopped(struct request_queue *q); ··· 328 321 void blk_mq_run_hw_queues(struct request_queue *q, bool async); 329 322 void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset, 330 323 busy_tag_iter_fn *fn, void *priv); 324 + void blk_mq_tagset_wait_completed_request(struct blk_mq_tag_set *tagset); 331 325 void blk_mq_freeze_queue(struct request_queue *q); 332 326 void blk_mq_unfreeze_queue(struct request_queue *q); 333 327 void blk_freeze_queue_start(struct request_queue *q); ··· 372 364 373 365 return rq->internal_tag | (hctx->queue_num << BLK_QC_T_SHIFT) | 374 366 BLK_QC_T_INTERNAL; 367 + } 368 + 369 + static inline void blk_mq_cleanup_rq(struct request *rq) 370 + { 371 + if (rq->q->mq_ops->cleanup_rq) 372 + rq->q->mq_ops->cleanup_rq(rq); 375 373 } 376 374 377 375 #endif

+6

include/linux/blk_types.h

··· 169 169 */ 170 170 struct blkcg_gq *bi_blkg; 171 171 struct bio_issue bi_issue; 172 + #ifdef CONFIG_BLK_CGROUP_IOCOST 173 + u64 bi_iocost_cost; 174 + #endif 172 175 #endif 173 176 union { 174 177 #if defined(CONFIG_BLK_DEV_INTEGRITY) ··· 212 209 BIO_BOUNCED, /* bio is a bounce bio */ 213 210 BIO_USER_MAPPED, /* contains user pages */ 214 211 BIO_NULL_MAPPED, /* contains invalid user pages */ 212 + BIO_WORKINGSET, /* contains userspace workingset pages */ 215 213 BIO_QUIET, /* Make BIO Quiet */ 216 214 BIO_CHAIN, /* chained bio, ->bi_remaining in effect */ 217 215 BIO_REFFED, /* bio has elevated ->bi_cnt */ ··· 286 282 REQ_OP_ZONE_RESET = 6, 287 283 /* write the same sector many times */ 288 284 REQ_OP_WRITE_SAME = 7, 285 + /* reset all the zone present on the device */ 286 + REQ_OP_ZONE_RESET_ALL = 8, 289 287 /* write the zero filled sector many times */ 290 288 REQ_OP_WRITE_ZEROES = 9, 291 289

+49 -24

include/linux/blkdev.h

··· 194 194 195 195 struct gendisk *rq_disk; 196 196 struct hd_struct *part; 197 - /* Time that I/O was submitted to the kernel. */ 197 + #ifdef CONFIG_BLK_RQ_ALLOC_TIME 198 + /* Time that the first bio started allocating this request. */ 199 + u64 alloc_time_ns; 200 + #endif 201 + /* Time that this request was allocated for this IO. */ 198 202 u64 start_time_ns; 199 203 /* Time that I/O was submitted to the device. */ 200 204 u64 io_start_time_ns; ··· 206 202 #ifdef CONFIG_BLK_WBT 207 203 unsigned short wbt_flags; 208 204 #endif 209 - #ifdef CONFIG_BLK_DEV_THROTTLING_LOW 210 - unsigned short throtl_size; 211 - #endif 205 + /* 206 + * rq sectors used for blk stats. It has the same value 207 + * with blk_rq_sectors(rq), except that it never be zeroed 208 + * by completion. 209 + */ 210 + unsigned short stats_sectors; 212 211 213 212 /* 214 213 * Number of scatter-gather DMA addr+len pairs after ··· 398 391 #endif /* CONFIG_BLK_DEV_ZONED */ 399 392 400 393 struct request_queue { 401 - /* 402 - * Together with queue_head for cacheline sharing 403 - */ 404 - struct list_head queue_head; 405 394 struct request *last_merge; 406 395 struct elevator_queue *elevator; 407 396 ··· 499 496 500 497 struct queue_limits limits; 501 498 499 + unsigned int required_elevator_features; 500 + 502 501 #ifdef CONFIG_BLK_DEV_ZONED 503 502 /* 504 503 * Zoned block device information for request dispatch control. ··· 544 539 struct delayed_work requeue_work; 545 540 546 541 struct mutex sysfs_lock; 542 + struct mutex sysfs_dir_lock; 547 543 548 544 /* 549 545 * for reusing dead hctx instance in case of updating ··· 617 611 #define QUEUE_FLAG_SCSI_PASSTHROUGH 23 /* queue supports SCSI commands */ 618 612 #define QUEUE_FLAG_QUIESCED 24 /* queue has been quiesced */ 619 613 #define QUEUE_FLAG_PCI_P2PDMA 25 /* device supports PCI p2p requests */ 614 + #define QUEUE_FLAG_ZONE_RESETALL 26 /* supports Zone Reset All */ 615 + #define QUEUE_FLAG_RQ_ALLOC_TIME 27 /* record rq->alloc_time_ns */ 620 616 621 617 #define QUEUE_FLAG_MQ_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) | \ 622 618 (1 << QUEUE_FLAG_SAME_COMP)) ··· 638 630 #define blk_queue_io_stat(q) test_bit(QUEUE_FLAG_IO_STAT, &(q)->queue_flags) 639 631 #define blk_queue_add_random(q) test_bit(QUEUE_FLAG_ADD_RANDOM, &(q)->queue_flags) 640 632 #define blk_queue_discard(q) test_bit(QUEUE_FLAG_DISCARD, &(q)->queue_flags) 633 + #define blk_queue_zone_resetall(q) \ 634 + test_bit(QUEUE_FLAG_ZONE_RESETALL, &(q)->queue_flags) 641 635 #define blk_queue_secure_erase(q) \ 642 636 (test_bit(QUEUE_FLAG_SECERASE, &(q)->queue_flags)) 643 637 #define blk_queue_dax(q) test_bit(QUEUE_FLAG_DAX, &(q)->queue_flags) ··· 647 637 test_bit(QUEUE_FLAG_SCSI_PASSTHROUGH, &(q)->queue_flags) 648 638 #define blk_queue_pci_p2pdma(q) \ 649 639 test_bit(QUEUE_FLAG_PCI_P2PDMA, &(q)->queue_flags) 640 + #ifdef CONFIG_BLK_RQ_ALLOC_TIME 641 + #define blk_queue_rq_alloc_time(q) \ 642 + test_bit(QUEUE_FLAG_RQ_ALLOC_TIME, &(q)->queue_flags) 643 + #else 644 + #define blk_queue_rq_alloc_time(q) false 645 + #endif 650 646 651 647 #define blk_noretry_request(rq) \ 652 648 ((rq)->cmd_flags & (REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT| \ ··· 660 644 #define blk_queue_quiesced(q) test_bit(QUEUE_FLAG_QUIESCED, &(q)->queue_flags) 661 645 #define blk_queue_pm_only(q) atomic_read(&(q)->pm_only) 662 646 #define blk_queue_fua(q) test_bit(QUEUE_FLAG_FUA, &(q)->queue_flags) 647 + #define blk_queue_registered(q) test_bit(QUEUE_FLAG_REGISTERED, &(q)->queue_flags) 663 648 664 649 extern void blk_set_pm_only(struct request_queue *q); 665 650 extern void blk_clear_pm_only(struct request_queue *q); ··· 920 903 * blk_rq_err_bytes() : bytes left till the next error boundary 921 904 * blk_rq_sectors() : sectors left in the entire request 922 905 * blk_rq_cur_sectors() : sectors left in the current segment 906 + * blk_rq_stats_sectors() : sectors of the entire request used for stats 923 907 */ 924 908 static inline sector_t blk_rq_pos(const struct request *rq) 925 909 { ··· 947 929 static inline unsigned int blk_rq_cur_sectors(const struct request *rq) 948 930 { 949 931 return blk_rq_cur_bytes(rq) >> SECTOR_SHIFT; 932 + } 933 + 934 + static inline unsigned int blk_rq_stats_sectors(const struct request *rq) 935 + { 936 + return rq->stats_sectors; 950 937 } 951 938 952 939 #ifdef CONFIG_BLK_DEV_ZONED ··· 1108 1085 extern void blk_queue_update_dma_alignment(struct request_queue *, int); 1109 1086 extern void blk_queue_rq_timeout(struct request_queue *, unsigned int); 1110 1087 extern void blk_queue_write_cache(struct request_queue *q, bool enabled, bool fua); 1088 + extern void blk_queue_required_elevator_features(struct request_queue *q, 1089 + unsigned int features); 1111 1090 1112 1091 /* 1113 1092 * Number of physical segments as sent to the device. ··· 1257 1232 BLK_SEG_BOUNDARY_MASK = 0xFFFFFFFFUL, 1258 1233 }; 1259 1234 1260 - static inline unsigned long queue_segment_boundary(struct request_queue *q) 1235 + static inline unsigned long queue_segment_boundary(const struct request_queue *q) 1261 1236 { 1262 1237 return q->limits.seg_boundary_mask; 1263 1238 } 1264 1239 1265 - static inline unsigned long queue_virt_boundary(struct request_queue *q) 1240 + static inline unsigned long queue_virt_boundary(const struct request_queue *q) 1266 1241 { 1267 1242 return q->limits.virt_boundary_mask; 1268 1243 } 1269 1244 1270 - static inline unsigned int queue_max_sectors(struct request_queue *q) 1245 + static inline unsigned int queue_max_sectors(const struct request_queue *q) 1271 1246 { 1272 1247 return q->limits.max_sectors; 1273 1248 } 1274 1249 1275 - static inline unsigned int queue_max_hw_sectors(struct request_queue *q) 1250 + static inline unsigned int queue_max_hw_sectors(const struct request_queue *q) 1276 1251 { 1277 1252 return q->limits.max_hw_sectors; 1278 1253 } 1279 1254 1280 - static inline unsigned short queue_max_segments(struct request_queue *q) 1255 + static inline unsigned short queue_max_segments(const struct request_queue *q) 1281 1256 { 1282 1257 return q->limits.max_segments; 1283 1258 } 1284 1259 1285 - static inline unsigned short queue_max_discard_segments(struct request_queue *q) 1260 + static inline unsigned short queue_max_discard_segments(const struct request_queue *q) 1286 1261 { 1287 1262 return q->limits.max_discard_segments; 1288 1263 } 1289 1264 1290 - static inline unsigned int queue_max_segment_size(struct request_queue *q) 1265 + static inline unsigned int queue_max_segment_size(const struct request_queue *q) 1291 1266 { 1292 1267 return q->limits.max_segment_size; 1293 1268 } 1294 1269 1295 - static inline unsigned short queue_logical_block_size(struct request_queue *q) 1270 + static inline unsigned short queue_logical_block_size(const struct request_queue *q) 1296 1271 { 1297 1272 int retval = 512; 1298 1273 ··· 1307 1282 return queue_logical_block_size(bdev_get_queue(bdev)); 1308 1283 } 1309 1284 1310 - static inline unsigned int queue_physical_block_size(struct request_queue *q) 1285 + static inline unsigned int queue_physical_block_size(const struct request_queue *q) 1311 1286 { 1312 1287 return q->limits.physical_block_size; 1313 1288 } ··· 1317 1292 return queue_physical_block_size(bdev_get_queue(bdev)); 1318 1293 } 1319 1294 1320 - static inline unsigned int queue_io_min(struct request_queue *q) 1295 + static inline unsigned int queue_io_min(const struct request_queue *q) 1321 1296 { 1322 1297 return q->limits.io_min; 1323 1298 } ··· 1327 1302 return queue_io_min(bdev_get_queue(bdev)); 1328 1303 } 1329 1304 1330 - static inline unsigned int queue_io_opt(struct request_queue *q) 1305 + static inline unsigned int queue_io_opt(const struct request_queue *q) 1331 1306 { 1332 1307 return q->limits.io_opt; 1333 1308 } ··· 1337 1312 return queue_io_opt(bdev_get_queue(bdev)); 1338 1313 } 1339 1314 1340 - static inline int queue_alignment_offset(struct request_queue *q) 1315 + static inline int queue_alignment_offset(const struct request_queue *q) 1341 1316 { 1342 1317 if (q->limits.misaligned) 1343 1318 return -1; ··· 1367 1342 return q->limits.alignment_offset; 1368 1343 } 1369 1344 1370 - static inline int queue_discard_alignment(struct request_queue *q) 1345 + static inline int queue_discard_alignment(const struct request_queue *q) 1371 1346 { 1372 1347 if (q->limits.discard_misaligned) 1373 1348 return -1; ··· 1457 1432 return 0; 1458 1433 } 1459 1434 1460 - static inline int queue_dma_alignment(struct request_queue *q) 1435 + static inline int queue_dma_alignment(const struct request_queue *q) 1461 1436 { 1462 1437 return q ? q->dma_alignment : 511; 1463 1438 } ··· 1568 1543 } 1569 1544 1570 1545 static inline unsigned short 1571 - queue_max_integrity_segments(struct request_queue *q) 1546 + queue_max_integrity_segments(const struct request_queue *q) 1572 1547 { 1573 1548 return q->limits.max_integrity_segments; 1574 1549 } ··· 1651 1626 unsigned int segs) 1652 1627 { 1653 1628 } 1654 - static inline unsigned short queue_max_integrity_segments(struct request_queue *q) 1629 + static inline unsigned short queue_max_integrity_segments(const struct request_queue *q) 1655 1630 { 1656 1631 return 0; 1657 1632 }

+8

include/linux/elevator.h

··· 76 76 struct elv_fs_entry *elevator_attrs; 77 77 const char *elevator_name; 78 78 const char *elevator_alias; 79 + const unsigned int elevator_features; 79 80 struct module *elevator_owner; 80 81 #ifdef CONFIG_BLK_DEBUG_FS 81 82 const struct blk_mq_debugfs_attr *queue_debugfs_attrs; ··· 165 164 166 165 #define rq_entry_fifo(ptr) list_entry((ptr), struct request, queuelist) 167 166 #define rq_fifo_clear(rq) list_del_init(&(rq)->queuelist) 167 + 168 + /* 169 + * Elevator features. 170 + */ 171 + 172 + /* Supports zoned block devices sequential write constraint */ 173 + #define ELEVATOR_F_ZBD_SEQ_WRITE (1U << 0) 168 174 169 175 #endif /* CONFIG_BLOCK */ 170 176 #endif

+3 -5

include/linux/lightnvm.h

··· 88 88 typedef int (nvm_op_set_bb_fn)(struct nvm_dev *, struct ppa_addr *, int, int); 89 89 typedef int (nvm_get_chk_meta_fn)(struct nvm_dev *, sector_t, int, 90 90 struct nvm_chk_meta *); 91 - typedef int (nvm_submit_io_fn)(struct nvm_dev *, struct nvm_rq *); 92 - typedef int (nvm_submit_io_sync_fn)(struct nvm_dev *, struct nvm_rq *); 91 + typedef int (nvm_submit_io_fn)(struct nvm_dev *, struct nvm_rq *, void *); 93 92 typedef void *(nvm_create_dma_pool_fn)(struct nvm_dev *, char *, int); 94 93 typedef void (nvm_destroy_dma_pool_fn)(void *); 95 94 typedef void *(nvm_dev_dma_alloc_fn)(struct nvm_dev *, void *, gfp_t, ··· 103 104 nvm_get_chk_meta_fn *get_chk_meta; 104 105 105 106 nvm_submit_io_fn *submit_io; 106 - nvm_submit_io_sync_fn *submit_io_sync; 107 107 108 108 nvm_create_dma_pool_fn *create_dma_pool; 109 109 nvm_destroy_dma_pool_fn *destroy_dma_pool; ··· 680 682 int, struct nvm_chk_meta *); 681 683 extern int nvm_set_chunk_meta(struct nvm_tgt_dev *, struct ppa_addr *, 682 684 int, int); 683 - extern int nvm_submit_io(struct nvm_tgt_dev *, struct nvm_rq *); 684 - extern int nvm_submit_io_sync(struct nvm_tgt_dev *, struct nvm_rq *); 685 + extern int nvm_submit_io(struct nvm_tgt_dev *, struct nvm_rq *, void *); 686 + extern int nvm_submit_io_sync(struct nvm_tgt_dev *, struct nvm_rq *, void *); 685 687 extern void nvm_end_io(struct nvm_rq *); 686 688 687 689 #else /* CONFIG_NVM */

+39

include/linux/memcontrol.h

··· 184 184 #endif 185 185 186 186 /* 187 + * Remember four most recent foreign writebacks with dirty pages in this 188 + * cgroup. Inode sharing is expected to be uncommon and, even if we miss 189 + * one in a given round, we're likely to catch it later if it keeps 190 + * foreign-dirtying, so a fairly low count should be enough. 191 + * 192 + * See mem_cgroup_track_foreign_dirty_slowpath() for details. 193 + */ 194 + #define MEMCG_CGWB_FRN_CNT 4 195 + 196 + struct memcg_cgwb_frn { 197 + u64 bdi_id; /* bdi->id of the foreign inode */ 198 + int memcg_id; /* memcg->css.id of foreign inode */ 199 + u64 at; /* jiffies_64 at the time of dirtying */ 200 + struct wb_completion done; /* tracks in-flight foreign writebacks */ 201 + }; 202 + 203 + /* 187 204 * The memory controller data structure. The memory controller controls both 188 205 * page cache and RSS per cgroup. We would eventually like to provide 189 206 * statistics based on the statistics developed by Rik Van Riel for clock-pro, ··· 324 307 #ifdef CONFIG_CGROUP_WRITEBACK 325 308 struct list_head cgwb_list; 326 309 struct wb_domain cgwb_domain; 310 + struct memcg_cgwb_frn cgwb_frn[MEMCG_CGWB_FRN_CNT]; 327 311 #endif 328 312 329 313 /* List of events which userspace want to receive */ ··· 1255 1237 unsigned long *pheadroom, unsigned long *pdirty, 1256 1238 unsigned long *pwriteback); 1257 1239 1240 + void mem_cgroup_track_foreign_dirty_slowpath(struct page *page, 1241 + struct bdi_writeback *wb); 1242 + 1243 + static inline void mem_cgroup_track_foreign_dirty(struct page *page, 1244 + struct bdi_writeback *wb) 1245 + { 1246 + if (unlikely(&page->mem_cgroup->css != wb->memcg_css)) 1247 + mem_cgroup_track_foreign_dirty_slowpath(page, wb); 1248 + } 1249 + 1250 + void mem_cgroup_flush_foreign(struct bdi_writeback *wb); 1251 + 1258 1252 #else /* CONFIG_CGROUP_WRITEBACK */ 1259 1253 1260 1254 static inline struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb) ··· 1279 1249 unsigned long *pheadroom, 1280 1250 unsigned long *pdirty, 1281 1251 unsigned long *pwriteback) 1252 + { 1253 + } 1254 + 1255 + static inline void mem_cgroup_track_foreign_dirty(struct page *page, 1256 + struct bdi_writeback *wb) 1257 + { 1258 + } 1259 + 1260 + static inline void mem_cgroup_flush_foreign(struct bdi_writeback *wb) 1282 1261 { 1283 1262 } 1284 1263

+4 -1

include/linux/nvme.h

··· 140 140 * Submission and Completion Queue Entry Sizes for the NVM command set. 141 141 * (In bytes and specified as a power of two (2^n)). 142 142 */ 143 + #define NVME_ADM_SQES 6 143 144 #define NVME_NVM_IOSQES 6 144 145 #define NVME_NVM_IOCQES 4 145 146 ··· 815 814 nvme_admin_security_send = 0x81, 816 815 nvme_admin_security_recv = 0x82, 817 816 nvme_admin_sanitize_nvm = 0x84, 817 + nvme_admin_get_lba_status = 0x86, 818 818 }; 819 819 820 820 #define nvme_admin_opcode_name(opcode) { opcode, #opcode } ··· 842 840 nvme_admin_opcode_name(nvme_admin_format_nvm), \ 843 841 nvme_admin_opcode_name(nvme_admin_security_send), \ 844 842 nvme_admin_opcode_name(nvme_admin_security_recv), \ 845 - nvme_admin_opcode_name(nvme_admin_sanitize_nvm)) 843 + nvme_admin_opcode_name(nvme_admin_sanitize_nvm), \ 844 + nvme_admin_opcode_name(nvme_admin_get_lba_status)) 846 845 847 846 enum { 848 847 NVME_QUEUE_PHYS_CONTIG = (1 << 0),

+2

include/linux/writeback.h

··· 217 217 void wbc_detach_inode(struct writeback_control *wbc); 218 218 void wbc_account_cgroup_owner(struct writeback_control *wbc, struct page *page, 219 219 size_t bytes); 220 + int cgroup_writeback_by_id(u64 bdi_id, int memcg_id, unsigned long nr_pages, 221 + enum wb_reason reason, struct wb_completion *done); 220 222 void cgroup_writeback_umount(void); 221 223 222 224 /**

+178

include/trace/events/iocost.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + #undef TRACE_SYSTEM 3 + #define TRACE_SYSTEM iocost 4 + 5 + struct ioc; 6 + struct ioc_now; 7 + struct ioc_gq; 8 + 9 + #if !defined(_TRACE_BLK_IOCOST_H) || defined(TRACE_HEADER_MULTI_READ) 10 + #define _TRACE_BLK_IOCOST_H 11 + 12 + #include <linux/tracepoint.h> 13 + 14 + TRACE_EVENT(iocost_iocg_activate, 15 + 16 + TP_PROTO(struct ioc_gq *iocg, const char *path, struct ioc_now *now, 17 + u64 last_period, u64 cur_period, u64 vtime), 18 + 19 + TP_ARGS(iocg, path, now, last_period, cur_period, vtime), 20 + 21 + TP_STRUCT__entry ( 22 + __string(devname, ioc_name(iocg->ioc)) 23 + __string(cgroup, path) 24 + __field(u64, now) 25 + __field(u64, vnow) 26 + __field(u64, vrate) 27 + __field(u64, last_period) 28 + __field(u64, cur_period) 29 + __field(u64, last_vtime) 30 + __field(u64, vtime) 31 + __field(u32, weight) 32 + __field(u32, inuse) 33 + __field(u64, hweight_active) 34 + __field(u64, hweight_inuse) 35 + ), 36 + 37 + TP_fast_assign( 38 + __assign_str(devname, ioc_name(iocg->ioc)); 39 + __assign_str(cgroup, path); 40 + __entry->now = now->now; 41 + __entry->vnow = now->vnow; 42 + __entry->vrate = now->vrate; 43 + __entry->last_period = last_period; 44 + __entry->cur_period = cur_period; 45 + __entry->last_vtime = iocg->last_vtime; 46 + __entry->vtime = vtime; 47 + __entry->weight = iocg->weight; 48 + __entry->inuse = iocg->inuse; 49 + __entry->hweight_active = iocg->hweight_active; 50 + __entry->hweight_inuse = iocg->hweight_inuse; 51 + ), 52 + 53 + TP_printk("[%s:%s] now=%llu:%llu vrate=%llu " 54 + "period=%llu->%llu vtime=%llu->%llu " 55 + "weight=%u/%u hweight=%llu/%llu", 56 + __get_str(devname), __get_str(cgroup), 57 + __entry->now, __entry->vnow, __entry->vrate, 58 + __entry->last_period, __entry->cur_period, 59 + __entry->last_vtime, __entry->vtime, 60 + __entry->inuse, __entry->weight, 61 + __entry->hweight_inuse, __entry->hweight_active 62 + ) 63 + ); 64 + 65 + DECLARE_EVENT_CLASS(iocg_inuse_update, 66 + 67 + TP_PROTO(struct ioc_gq *iocg, const char *path, struct ioc_now *now, 68 + u32 old_inuse, u32 new_inuse, 69 + u64 old_hw_inuse, u64 new_hw_inuse), 70 + 71 + TP_ARGS(iocg, path, now, old_inuse, new_inuse, 72 + old_hw_inuse, new_hw_inuse), 73 + 74 + TP_STRUCT__entry ( 75 + __string(devname, ioc_name(iocg->ioc)) 76 + __string(cgroup, path) 77 + __field(u64, now) 78 + __field(u32, old_inuse) 79 + __field(u32, new_inuse) 80 + __field(u64, old_hweight_inuse) 81 + __field(u64, new_hweight_inuse) 82 + ), 83 + 84 + TP_fast_assign( 85 + __assign_str(devname, ioc_name(iocg->ioc)); 86 + __assign_str(cgroup, path); 87 + __entry->now = now->now; 88 + __entry->old_inuse = old_inuse; 89 + __entry->new_inuse = new_inuse; 90 + __entry->old_hweight_inuse = old_hw_inuse; 91 + __entry->new_hweight_inuse = new_hw_inuse; 92 + ), 93 + 94 + TP_printk("[%s:%s] now=%llu inuse=%u->%u hw_inuse=%llu->%llu", 95 + __get_str(devname), __get_str(cgroup), __entry->now, 96 + __entry->old_inuse, __entry->new_inuse, 97 + __entry->old_hweight_inuse, __entry->new_hweight_inuse 98 + ) 99 + ); 100 + 101 + DEFINE_EVENT(iocg_inuse_update, iocost_inuse_takeback, 102 + 103 + TP_PROTO(struct ioc_gq *iocg, const char *path, struct ioc_now *now, 104 + u32 old_inuse, u32 new_inuse, 105 + u64 old_hw_inuse, u64 new_hw_inuse), 106 + 107 + TP_ARGS(iocg, path, now, old_inuse, new_inuse, 108 + old_hw_inuse, new_hw_inuse) 109 + ); 110 + 111 + DEFINE_EVENT(iocg_inuse_update, iocost_inuse_giveaway, 112 + 113 + TP_PROTO(struct ioc_gq *iocg, const char *path, struct ioc_now *now, 114 + u32 old_inuse, u32 new_inuse, 115 + u64 old_hw_inuse, u64 new_hw_inuse), 116 + 117 + TP_ARGS(iocg, path, now, old_inuse, new_inuse, 118 + old_hw_inuse, new_hw_inuse) 119 + ); 120 + 121 + DEFINE_EVENT(iocg_inuse_update, iocost_inuse_reset, 122 + 123 + TP_PROTO(struct ioc_gq *iocg, const char *path, struct ioc_now *now, 124 + u32 old_inuse, u32 new_inuse, 125 + u64 old_hw_inuse, u64 new_hw_inuse), 126 + 127 + TP_ARGS(iocg, path, now, old_inuse, new_inuse, 128 + old_hw_inuse, new_hw_inuse) 129 + ); 130 + 131 + TRACE_EVENT(iocost_ioc_vrate_adj, 132 + 133 + TP_PROTO(struct ioc *ioc, u64 new_vrate, u32 (*missed_ppm)[2], 134 + u32 rq_wait_pct, int nr_lagging, int nr_shortages, 135 + int nr_surpluses), 136 + 137 + TP_ARGS(ioc, new_vrate, missed_ppm, rq_wait_pct, nr_lagging, nr_shortages, 138 + nr_surpluses), 139 + 140 + TP_STRUCT__entry ( 141 + __string(devname, ioc_name(ioc)) 142 + __field(u64, old_vrate) 143 + __field(u64, new_vrate) 144 + __field(int, busy_level) 145 + __field(u32, read_missed_ppm) 146 + __field(u32, write_missed_ppm) 147 + __field(u32, rq_wait_pct) 148 + __field(int, nr_lagging) 149 + __field(int, nr_shortages) 150 + __field(int, nr_surpluses) 151 + ), 152 + 153 + TP_fast_assign( 154 + __assign_str(devname, ioc_name(ioc)); 155 + __entry->old_vrate = atomic64_read(&ioc->vtime_rate);; 156 + __entry->new_vrate = new_vrate; 157 + __entry->busy_level = ioc->busy_level; 158 + __entry->read_missed_ppm = (*missed_ppm)[READ]; 159 + __entry->write_missed_ppm = (*missed_ppm)[WRITE]; 160 + __entry->rq_wait_pct = rq_wait_pct; 161 + __entry->nr_lagging = nr_lagging; 162 + __entry->nr_shortages = nr_shortages; 163 + __entry->nr_surpluses = nr_surpluses; 164 + ), 165 + 166 + TP_printk("[%s] vrate=%llu->%llu busy=%d missed_ppm=%u:%u rq_wait_pct=%u lagging=%d shortages=%d surpluses=%d", 167 + __get_str(devname), __entry->old_vrate, __entry->new_vrate, 168 + __entry->busy_level, 169 + __entry->read_missed_ppm, __entry->write_missed_ppm, 170 + __entry->rq_wait_pct, __entry->nr_lagging, __entry->nr_shortages, 171 + __entry->nr_surpluses 172 + ) 173 + ); 174 + 175 + #endif /* _TRACE_BLK_IOCOST_H */ 176 + 177 + /* This part must be outside protection */ 178 + #include <trace/define_trace.h>

+126

include/trace/events/writeback.h

··· 176 176 #endif /* CONFIG_CGROUP_WRITEBACK */ 177 177 #endif /* CREATE_TRACE_POINTS */ 178 178 179 + #ifdef CONFIG_CGROUP_WRITEBACK 180 + TRACE_EVENT(inode_foreign_history, 181 + 182 + TP_PROTO(struct inode *inode, struct writeback_control *wbc, 183 + unsigned int history), 184 + 185 + TP_ARGS(inode, wbc, history), 186 + 187 + TP_STRUCT__entry( 188 + __array(char, name, 32) 189 + __field(unsigned long, ino) 190 + __field(unsigned int, cgroup_ino) 191 + __field(unsigned int, history) 192 + ), 193 + 194 + TP_fast_assign( 195 + strncpy(__entry->name, dev_name(inode_to_bdi(inode)->dev), 32); 196 + __entry->ino = inode->i_ino; 197 + __entry->cgroup_ino = __trace_wbc_assign_cgroup(wbc); 198 + __entry->history = history; 199 + ), 200 + 201 + TP_printk("bdi %s: ino=%lu cgroup_ino=%u history=0x%x", 202 + __entry->name, 203 + __entry->ino, 204 + __entry->cgroup_ino, 205 + __entry->history 206 + ) 207 + ); 208 + 209 + TRACE_EVENT(inode_switch_wbs, 210 + 211 + TP_PROTO(struct inode *inode, struct bdi_writeback *old_wb, 212 + struct bdi_writeback *new_wb), 213 + 214 + TP_ARGS(inode, old_wb, new_wb), 215 + 216 + TP_STRUCT__entry( 217 + __array(char, name, 32) 218 + __field(unsigned long, ino) 219 + __field(unsigned int, old_cgroup_ino) 220 + __field(unsigned int, new_cgroup_ino) 221 + ), 222 + 223 + TP_fast_assign( 224 + strncpy(__entry->name, dev_name(old_wb->bdi->dev), 32); 225 + __entry->ino = inode->i_ino; 226 + __entry->old_cgroup_ino = __trace_wb_assign_cgroup(old_wb); 227 + __entry->new_cgroup_ino = __trace_wb_assign_cgroup(new_wb); 228 + ), 229 + 230 + TP_printk("bdi %s: ino=%lu old_cgroup_ino=%u new_cgroup_ino=%u", 231 + __entry->name, 232 + __entry->ino, 233 + __entry->old_cgroup_ino, 234 + __entry->new_cgroup_ino 235 + ) 236 + ); 237 + 238 + TRACE_EVENT(track_foreign_dirty, 239 + 240 + TP_PROTO(struct page *page, struct bdi_writeback *wb), 241 + 242 + TP_ARGS(page, wb), 243 + 244 + TP_STRUCT__entry( 245 + __array(char, name, 32) 246 + __field(u64, bdi_id) 247 + __field(unsigned long, ino) 248 + __field(unsigned int, memcg_id) 249 + __field(unsigned int, cgroup_ino) 250 + __field(unsigned int, page_cgroup_ino) 251 + ), 252 + 253 + TP_fast_assign( 254 + struct address_space *mapping = page_mapping(page); 255 + struct inode *inode = mapping ? mapping->host : NULL; 256 + 257 + strncpy(__entry->name, dev_name(wb->bdi->dev), 32); 258 + __entry->bdi_id = wb->bdi->id; 259 + __entry->ino = inode ? inode->i_ino : 0; 260 + __entry->memcg_id = wb->memcg_css->id; 261 + __entry->cgroup_ino = __trace_wb_assign_cgroup(wb); 262 + __entry->page_cgroup_ino = page->mem_cgroup->css.cgroup->kn->id.ino; 263 + ), 264 + 265 + TP_printk("bdi %s[%llu]: ino=%lu memcg_id=%u cgroup_ino=%u page_cgroup_ino=%u", 266 + __entry->name, 267 + __entry->bdi_id, 268 + __entry->ino, 269 + __entry->memcg_id, 270 + __entry->cgroup_ino, 271 + __entry->page_cgroup_ino 272 + ) 273 + ); 274 + 275 + TRACE_EVENT(flush_foreign, 276 + 277 + TP_PROTO(struct bdi_writeback *wb, unsigned int frn_bdi_id, 278 + unsigned int frn_memcg_id), 279 + 280 + TP_ARGS(wb, frn_bdi_id, frn_memcg_id), 281 + 282 + TP_STRUCT__entry( 283 + __array(char, name, 32) 284 + __field(unsigned int, cgroup_ino) 285 + __field(unsigned int, frn_bdi_id) 286 + __field(unsigned int, frn_memcg_id) 287 + ), 288 + 289 + TP_fast_assign( 290 + strncpy(__entry->name, dev_name(wb->bdi->dev), 32); 291 + __entry->cgroup_ino = __trace_wb_assign_cgroup(wb); 292 + __entry->frn_bdi_id = frn_bdi_id; 293 + __entry->frn_memcg_id = frn_memcg_id; 294 + ), 295 + 296 + TP_printk("bdi %s: cgroup_ino=%u frn_bdi_id=%u frn_memcg_id=%u", 297 + __entry->name, 298 + __entry->cgroup_ino, 299 + __entry->frn_bdi_id, 300 + __entry->frn_memcg_id 301 + ) 302 + ); 303 + #endif 304 + 179 305 DECLARE_EVENT_CLASS(writeback_write_inode_template, 180 306 181 307 TP_PROTO(struct inode *inode, struct writeback_control *wbc),

+2

include/uapi/linux/raid/md_p.h

··· 329 329 #define MD_FEATURE_JOURNAL 512 /* support write cache */ 330 330 #define MD_FEATURE_PPL 1024 /* support PPL */ 331 331 #define MD_FEATURE_MULTIPLE_PPLS 2048 /* support for multiple PPLs */ 332 + #define MD_FEATURE_RAID0_LAYOUT 4096 /* layout is meaningful for RAID0 */ 332 333 #define MD_FEATURE_ALL (MD_FEATURE_BITMAP_OFFSET \ 333 334 |MD_FEATURE_RECOVERY_OFFSET \ 334 335 |MD_FEATURE_RESHAPE_ACTIVE \ ··· 342 341 |MD_FEATURE_JOURNAL \ 343 342 |MD_FEATURE_PPL \ 344 343 |MD_FEATURE_MULTIPLE_PPLS \ 344 + |MD_FEATURE_RAID0_LAYOUT \ 345 345 ) 346 346 347 347 struct r5l_payload_header {

+7 -5

lib/sg_split.c

··· 176 176 * The order of these 3 calls is important and should be kept. 177 177 */ 178 178 sg_split_phys(splitters, nb_splits); 179 - ret = sg_calculate_split(in, in_mapped_nents, nb_splits, skip, 180 - split_sizes, splitters, true); 181 - if (ret < 0) 182 - goto err; 183 - sg_split_mapped(splitters, nb_splits); 179 + if (in_mapped_nents) { 180 + ret = sg_calculate_split(in, in_mapped_nents, nb_splits, skip, 181 + split_sizes, splitters, true); 182 + if (ret < 0) 183 + goto err; 184 + sg_split_mapped(splitters, nb_splits); 185 + } 184 186 185 187 for (i = 0; i < nb_splits; i++) { 186 188 out[i] = splitters[i].out_sg;

+100 -20

mm/backing-dev.c

··· 1 1 // SPDX-License-Identifier: GPL-2.0-only 2 2 3 3 #include <linux/wait.h> 4 + #include <linux/rbtree.h> 4 5 #include <linux/backing-dev.h> 5 6 #include <linux/kthread.h> 6 7 #include <linux/freezer.h> ··· 23 22 static struct class *bdi_class; 24 23 25 24 /* 26 - * bdi_lock protects updates to bdi_list. bdi_list has RCU reader side 27 - * locking. 25 + * bdi_lock protects bdi_tree and updates to bdi_list. bdi_list has RCU 26 + * reader side locking. 28 27 */ 29 28 DEFINE_SPINLOCK(bdi_lock); 29 + static u64 bdi_id_cursor; 30 + static struct rb_root bdi_tree = RB_ROOT; 30 31 LIST_HEAD(bdi_list); 31 32 32 33 /* bdi_wq serves all asynchronous writeback tasks */ ··· 618 615 } 619 616 620 617 /** 621 - * wb_get_create - get wb for a given memcg, create if necessary 618 + * wb_get_lookup - get wb for a given memcg 622 619 * @bdi: target bdi 623 620 * @memcg_css: cgroup_subsys_state of the target memcg (must have positive ref) 624 - * @gfp: allocation mask to use 625 621 * 626 - * Try to get the wb for @memcg_css on @bdi. If it doesn't exist, try to 627 - * create one. The returned wb has its refcount incremented. 622 + * Try to get the wb for @memcg_css on @bdi. The returned wb has its 623 + * refcount incremented. 628 624 * 629 625 * This function uses css_get() on @memcg_css and thus expects its refcnt 630 626 * to be positive on invocation. IOW, rcu_read_lock() protection on ··· 640 638 * each lookup. On mismatch, the existing wb is discarded and a new one is 641 639 * created. 642 640 */ 641 + struct bdi_writeback *wb_get_lookup(struct backing_dev_info *bdi, 642 + struct cgroup_subsys_state *memcg_css) 643 + { 644 + struct bdi_writeback *wb; 645 + 646 + if (!memcg_css->parent) 647 + return &bdi->wb; 648 + 649 + rcu_read_lock(); 650 + wb = radix_tree_lookup(&bdi->cgwb_tree, memcg_css->id); 651 + if (wb) { 652 + struct cgroup_subsys_state *blkcg_css; 653 + 654 + /* see whether the blkcg association has changed */ 655 + blkcg_css = cgroup_get_e_css(memcg_css->cgroup, &io_cgrp_subsys); 656 + if (unlikely(wb->blkcg_css != blkcg_css || !wb_tryget(wb))) 657 + wb = NULL; 658 + css_put(blkcg_css); 659 + } 660 + rcu_read_unlock(); 661 + 662 + return wb; 663 + } 664 + 665 + /** 666 + * wb_get_create - get wb for a given memcg, create if necessary 667 + * @bdi: target bdi 668 + * @memcg_css: cgroup_subsys_state of the target memcg (must have positive ref) 669 + * @gfp: allocation mask to use 670 + * 671 + * Try to get the wb for @memcg_css on @bdi. If it doesn't exist, try to 672 + * create one. See wb_get_lookup() for more details. 673 + */ 643 674 struct bdi_writeback *wb_get_create(struct backing_dev_info *bdi, 644 675 struct cgroup_subsys_state *memcg_css, 645 676 gfp_t gfp) ··· 685 650 return &bdi->wb; 686 651 687 652 do { 688 - rcu_read_lock(); 689 - wb = radix_tree_lookup(&bdi->cgwb_tree, memcg_css->id); 690 - if (wb) { 691 - struct cgroup_subsys_state *blkcg_css; 692 - 693 - /* see whether the blkcg association has changed */ 694 - blkcg_css = cgroup_get_e_css(memcg_css->cgroup, 695 - &io_cgrp_subsys); 696 - if (unlikely(wb->blkcg_css != blkcg_css || 697 - !wb_tryget(wb))) 698 - wb = NULL; 699 - css_put(blkcg_css); 700 - } 701 - rcu_read_unlock(); 653 + wb = wb_get_lookup(bdi, memcg_css); 702 654 } while (!wb && !cgwb_create(bdi, memcg_css, gfp)); 703 655 704 656 return wb; ··· 881 859 } 882 860 EXPORT_SYMBOL(bdi_alloc_node); 883 861 862 + static struct rb_node **bdi_lookup_rb_node(u64 id, struct rb_node **parentp) 863 + { 864 + struct rb_node **p = &bdi_tree.rb_node; 865 + struct rb_node *parent = NULL; 866 + struct backing_dev_info *bdi; 867 + 868 + lockdep_assert_held(&bdi_lock); 869 + 870 + while (*p) { 871 + parent = *p; 872 + bdi = rb_entry(parent, struct backing_dev_info, rb_node); 873 + 874 + if (bdi->id > id) 875 + p = &(*p)->rb_left; 876 + else if (bdi->id < id) 877 + p = &(*p)->rb_right; 878 + else 879 + break; 880 + } 881 + 882 + if (parentp) 883 + *parentp = parent; 884 + return p; 885 + } 886 + 887 + /** 888 + * bdi_get_by_id - lookup and get bdi from its id 889 + * @id: bdi id to lookup 890 + * 891 + * Find bdi matching @id and get it. Returns NULL if the matching bdi 892 + * doesn't exist or is already unregistered. 893 + */ 894 + struct backing_dev_info *bdi_get_by_id(u64 id) 895 + { 896 + struct backing_dev_info *bdi = NULL; 897 + struct rb_node **p; 898 + 899 + spin_lock_bh(&bdi_lock); 900 + p = bdi_lookup_rb_node(id, NULL); 901 + if (*p) { 902 + bdi = rb_entry(*p, struct backing_dev_info, rb_node); 903 + bdi_get(bdi); 904 + } 905 + spin_unlock_bh(&bdi_lock); 906 + 907 + return bdi; 908 + } 909 + 884 910 int bdi_register_va(struct backing_dev_info *bdi, const char *fmt, va_list args) 885 911 { 886 912 struct device *dev; 913 + struct rb_node *parent, **p; 887 914 888 915 if (bdi->dev) /* The driver needs to use separate queues per device */ 889 916 return 0; ··· 948 877 set_bit(WB_registered, &bdi->wb.state); 949 878 950 879 spin_lock_bh(&bdi_lock); 880 + 881 + bdi->id = ++bdi_id_cursor; 882 + 883 + p = bdi_lookup_rb_node(bdi->id, &parent); 884 + rb_link_node(&bdi->rb_node, parent, p); 885 + rb_insert_color(&bdi->rb_node, &bdi_tree); 886 + 951 887 list_add_tail_rcu(&bdi->bdi_list, &bdi_list); 888 + 952 889 spin_unlock_bh(&bdi_lock); 953 890 954 891 trace_writeback_bdi_register(bdi); ··· 997 918 static void bdi_remove_from_list(struct backing_dev_info *bdi) 998 919 { 999 920 spin_lock_bh(&bdi_lock); 921 + rb_erase(&bdi->rb_node, &bdi_tree); 1000 922 list_del_rcu(&bdi->bdi_list); 1001 923 spin_unlock_bh(&bdi_lock); 1002 924

+139

mm/memcontrol.c

··· 87 87 #define do_swap_account 0 88 88 #endif 89 89 90 + #ifdef CONFIG_CGROUP_WRITEBACK 91 + static DECLARE_WAIT_QUEUE_HEAD(memcg_cgwb_frn_waitq); 92 + #endif 93 + 90 94 /* Whether legacy memory+swap accounting is active */ 91 95 static bool do_memsw_account(void) 92 96 { ··· 4176 4172 4177 4173 #ifdef CONFIG_CGROUP_WRITEBACK 4178 4174 4175 + #include <trace/events/writeback.h> 4176 + 4179 4177 static int memcg_wb_domain_init(struct mem_cgroup *memcg, gfp_t gfp) 4180 4178 { 4181 4179 return wb_domain_init(&memcg->cgwb_domain, gfp); ··· 4258 4252 4259 4253 *pheadroom = min(*pheadroom, ceiling - min(ceiling, used)); 4260 4254 memcg = parent; 4255 + } 4256 + } 4257 + 4258 + /* 4259 + * Foreign dirty flushing 4260 + * 4261 + * There's an inherent mismatch between memcg and writeback. The former 4262 + * trackes ownership per-page while the latter per-inode. This was a 4263 + * deliberate design decision because honoring per-page ownership in the 4264 + * writeback path is complicated, may lead to higher CPU and IO overheads 4265 + * and deemed unnecessary given that write-sharing an inode across 4266 + * different cgroups isn't a common use-case. 4267 + * 4268 + * Combined with inode majority-writer ownership switching, this works well 4269 + * enough in most cases but there are some pathological cases. For 4270 + * example, let's say there are two cgroups A and B which keep writing to 4271 + * different but confined parts of the same inode. B owns the inode and 4272 + * A's memory is limited far below B's. A's dirty ratio can rise enough to 4273 + * trigger balance_dirty_pages() sleeps but B's can be low enough to avoid 4274 + * triggering background writeback. A will be slowed down without a way to 4275 + * make writeback of the dirty pages happen. 4276 + * 4277 + * Conditions like the above can lead to a cgroup getting repatedly and 4278 + * severely throttled after making some progress after each 4279 + * dirty_expire_interval while the underyling IO device is almost 4280 + * completely idle. 4281 + * 4282 + * Solving this problem completely requires matching the ownership tracking 4283 + * granularities between memcg and writeback in either direction. However, 4284 + * the more egregious behaviors can be avoided by simply remembering the 4285 + * most recent foreign dirtying events and initiating remote flushes on 4286 + * them when local writeback isn't enough to keep the memory clean enough. 4287 + * 4288 + * The following two functions implement such mechanism. When a foreign 4289 + * page - a page whose memcg and writeback ownerships don't match - is 4290 + * dirtied, mem_cgroup_track_foreign_dirty() records the inode owning 4291 + * bdi_writeback on the page owning memcg. When balance_dirty_pages() 4292 + * decides that the memcg needs to sleep due to high dirty ratio, it calls 4293 + * mem_cgroup_flush_foreign() which queues writeback on the recorded 4294 + * foreign bdi_writebacks which haven't expired. Both the numbers of 4295 + * recorded bdi_writebacks and concurrent in-flight foreign writebacks are 4296 + * limited to MEMCG_CGWB_FRN_CNT. 4297 + * 4298 + * The mechanism only remembers IDs and doesn't hold any object references. 4299 + * As being wrong occasionally doesn't matter, updates and accesses to the 4300 + * records are lockless and racy. 4301 + */ 4302 + void mem_cgroup_track_foreign_dirty_slowpath(struct page *page, 4303 + struct bdi_writeback *wb) 4304 + { 4305 + struct mem_cgroup *memcg = page->mem_cgroup; 4306 + struct memcg_cgwb_frn *frn; 4307 + u64 now = get_jiffies_64(); 4308 + u64 oldest_at = now; 4309 + int oldest = -1; 4310 + int i; 4311 + 4312 + trace_track_foreign_dirty(page, wb); 4313 + 4314 + /* 4315 + * Pick the slot to use. If there is already a slot for @wb, keep 4316 + * using it. If not replace the oldest one which isn't being 4317 + * written out. 4318 + */ 4319 + for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) { 4320 + frn = &memcg->cgwb_frn[i]; 4321 + if (frn->bdi_id == wb->bdi->id && 4322 + frn->memcg_id == wb->memcg_css->id) 4323 + break; 4324 + if (time_before64(frn->at, oldest_at) && 4325 + atomic_read(&frn->done.cnt) == 1) { 4326 + oldest = i; 4327 + oldest_at = frn->at; 4328 + } 4329 + } 4330 + 4331 + if (i < MEMCG_CGWB_FRN_CNT) { 4332 + /* 4333 + * Re-using an existing one. Update timestamp lazily to 4334 + * avoid making the cacheline hot. We want them to be 4335 + * reasonably up-to-date and significantly shorter than 4336 + * dirty_expire_interval as that's what expires the record. 4337 + * Use the shorter of 1s and dirty_expire_interval / 8. 4338 + */ 4339 + unsigned long update_intv = 4340 + min_t(unsigned long, HZ, 4341 + msecs_to_jiffies(dirty_expire_interval * 10) / 8); 4342 + 4343 + if (time_before64(frn->at, now - update_intv)) 4344 + frn->at = now; 4345 + } else if (oldest >= 0) { 4346 + /* replace the oldest free one */ 4347 + frn = &memcg->cgwb_frn[oldest]; 4348 + frn->bdi_id = wb->bdi->id; 4349 + frn->memcg_id = wb->memcg_css->id; 4350 + frn->at = now; 4351 + } 4352 + } 4353 + 4354 + /* issue foreign writeback flushes for recorded foreign dirtying events */ 4355 + void mem_cgroup_flush_foreign(struct bdi_writeback *wb) 4356 + { 4357 + struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css); 4358 + unsigned long intv = msecs_to_jiffies(dirty_expire_interval * 10); 4359 + u64 now = jiffies_64; 4360 + int i; 4361 + 4362 + for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) { 4363 + struct memcg_cgwb_frn *frn = &memcg->cgwb_frn[i]; 4364 + 4365 + /* 4366 + * If the record is older than dirty_expire_interval, 4367 + * writeback on it has already started. No need to kick it 4368 + * off again. Also, don't start a new one if there's 4369 + * already one in flight. 4370 + */ 4371 + if (time_after64(frn->at, now - intv) && 4372 + atomic_read(&frn->done.cnt) == 1) { 4373 + frn->at = 0; 4374 + trace_flush_foreign(wb, frn->bdi_id, frn->memcg_id); 4375 + cgroup_writeback_by_id(frn->bdi_id, frn->memcg_id, 0, 4376 + WB_REASON_FOREIGN_FLUSH, 4377 + &frn->done); 4378 + } 4261 4379 } 4262 4380 } 4263 4381 ··· 4907 4777 struct mem_cgroup *memcg; 4908 4778 unsigned int size; 4909 4779 int node; 4780 + int __maybe_unused i; 4910 4781 4911 4782 size = sizeof(struct mem_cgroup); 4912 4783 size += nr_node_ids * sizeof(struct mem_cgroup_per_node *); ··· 4951 4820 #endif 4952 4821 #ifdef CONFIG_CGROUP_WRITEBACK 4953 4822 INIT_LIST_HEAD(&memcg->cgwb_list); 4823 + for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) 4824 + memcg->cgwb_frn[i].done = 4825 + __WB_COMPLETION_INIT(&memcg_cgwb_frn_waitq); 4954 4826 #endif 4955 4827 idr_replace(&mem_cgroup_idr, memcg, memcg->id.id); 4956 4828 return memcg; ··· 5083 4949 static void mem_cgroup_css_free(struct cgroup_subsys_state *css) 5084 4950 { 5085 4951 struct mem_cgroup *memcg = mem_cgroup_from_css(css); 4952 + int __maybe_unused i; 5086 4953 4954 + #ifdef CONFIG_CGROUP_WRITEBACK 4955 + for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) 4956 + wb_wait_for_completion(&memcg->cgwb_frn[i].done); 4957 + #endif 5087 4958 if (cgroup_subsys_on_dfl(memory_cgrp_subsys) && !cgroup_memory_nosocket) 5088 4959 static_branch_dec(&memcg_sockets_enabled_key); 5089 4960

+4

mm/page-writeback.c

··· 1667 1667 if (unlikely(!writeback_in_progress(wb))) 1668 1668 wb_start_background_writeback(wb); 1669 1669 1670 + mem_cgroup_flush_foreign(wb); 1671 + 1670 1672 /* 1671 1673 * Calculate global domain's pos_ratio and select the 1672 1674 * global dtc by default. ··· 2429 2427 task_io_account_write(PAGE_SIZE); 2430 2428 current->nr_dirtied++; 2431 2429 this_cpu_inc(bdp_ratelimits); 2430 + 2431 + mem_cgroup_track_foreign_dirty(page, wb); 2432 2432 } 2433 2433 } 2434 2434

+178

tools/cgroup/iocost_coef_gen.py

··· 1 + #!/usr/bin/env python3 2 + # 3 + # Copyright (C) 2019 Tejun Heo <tj@kernel.org> 4 + # Copyright (C) 2019 Andy Newell <newella@fb.com> 5 + # Copyright (C) 2019 Facebook 6 + 7 + desc = """ 8 + Generate linear IO cost model coefficients used by the blk-iocost 9 + controller. If the target raw testdev is specified, destructive tests 10 + are performed against the whole device; otherwise, on 11 + ./iocost-coef-fio.testfile. The result can be written directly to 12 + /sys/fs/cgroup/io.cost.model. 13 + 14 + On high performance devices, --numjobs > 1 is needed to achieve 15 + saturation. 16 + 17 + See Documentation/admin-guide/cgroup-v2.rst and block/blk-iocost.c 18 + for more details. 19 + """ 20 + 21 + import argparse 22 + import re 23 + import json 24 + import glob 25 + import os 26 + import sys 27 + import atexit 28 + import shutil 29 + import tempfile 30 + import subprocess 31 + 32 + parser = argparse.ArgumentParser(description=desc, 33 + formatter_class=argparse.RawTextHelpFormatter) 34 + parser.add_argument('--testdev', metavar='DEV', 35 + help='Raw block device to use for testing, ignores --testfile-size') 36 + parser.add_argument('--testfile-size-gb', type=float, metavar='GIGABYTES', default=16, 37 + help='Testfile size in gigabytes (default: %(default)s)') 38 + parser.add_argument('--duration', type=int, metavar='SECONDS', default=120, 39 + help='Individual test run duration in seconds (default: %(default)s)') 40 + parser.add_argument('--seqio-block-mb', metavar='MEGABYTES', type=int, default=128, 41 + help='Sequential test block size in megabytes (default: %(default)s)') 42 + parser.add_argument('--seq-depth', type=int, metavar='DEPTH', default=64, 43 + help='Sequential test queue depth (default: %(default)s)') 44 + parser.add_argument('--rand-depth', type=int, metavar='DEPTH', default=64, 45 + help='Random test queue depth (default: %(default)s)') 46 + parser.add_argument('--numjobs', type=int, metavar='JOBS', default=1, 47 + help='Number of parallel fio jobs to run (default: %(default)s)') 48 + parser.add_argument('--quiet', action='store_true') 49 + parser.add_argument('--verbose', action='store_true') 50 + 51 + def info(msg): 52 + if not args.quiet: 53 + print(msg) 54 + 55 + def dbg(msg): 56 + if args.verbose and not args.quiet: 57 + print(msg) 58 + 59 + # determine ('DEVNAME', 'MAJ:MIN') for @path 60 + def dir_to_dev(path): 61 + # find the block device the current directory is on 62 + devname = subprocess.run(f'findmnt -nvo SOURCE -T{path}', 63 + stdout=subprocess.PIPE, shell=True).stdout 64 + devname = os.path.basename(devname).decode('utf-8').strip() 65 + 66 + # partition -> whole device 67 + parents = glob.glob('/sys/block/*/' + devname) 68 + if len(parents): 69 + devname = os.path.basename(os.path.dirname(parents[0])) 70 + rdev = os.stat(f'/dev/{devname}').st_rdev 71 + return (devname, f'{os.major(rdev)}:{os.minor(rdev)}') 72 + 73 + def create_testfile(path, size): 74 + global args 75 + 76 + if os.path.isfile(path) and os.stat(path).st_size == size: 77 + return 78 + 79 + info(f'Creating testfile {path}') 80 + subprocess.check_call(f'rm -f {path}', shell=True) 81 + subprocess.check_call(f'touch {path}', shell=True) 82 + subprocess.call(f'chattr +C {path}', shell=True) 83 + subprocess.check_call( 84 + f'pv -s {size} -pr /dev/urandom {"-q" if args.quiet else ""} | ' 85 + f'dd of={path} count={size} ' 86 + f'iflag=count_bytes,fullblock oflag=direct bs=16M status=none', 87 + shell=True) 88 + 89 + def run_fio(testfile, duration, iotype, iodepth, blocksize, jobs): 90 + global args 91 + 92 + eta = 'never' if args.quiet else 'always' 93 + outfile = tempfile.NamedTemporaryFile() 94 + cmd = (f'fio --direct=1 --ioengine=libaio --name=coef ' 95 + f'--filename={testfile} --runtime={round(duration)} ' 96 + f'--readwrite={iotype} --iodepth={iodepth} --blocksize={blocksize} ' 97 + f'--eta={eta} --output-format json --output={outfile.name} ' 98 + f'--time_based --numjobs={jobs}') 99 + if args.verbose: 100 + dbg(f'Running {cmd}') 101 + subprocess.check_call(cmd, shell=True) 102 + with open(outfile.name, 'r') as f: 103 + d = json.loads(f.read()) 104 + return sum(j['read']['bw_bytes'] + j['write']['bw_bytes'] for j in d['jobs']) 105 + 106 + def restore_elevator_nomerges(): 107 + global elevator_path, nomerges_path, elevator, nomerges 108 + 109 + info(f'Restoring elevator to {elevator} and nomerges to {nomerges}') 110 + with open(elevator_path, 'w') as f: 111 + f.write(elevator) 112 + with open(nomerges_path, 'w') as f: 113 + f.write(nomerges) 114 + 115 + 116 + args = parser.parse_args() 117 + 118 + missing = False 119 + for cmd in [ 'findmnt', 'pv', 'dd', 'fio' ]: 120 + if not shutil.which(cmd): 121 + print(f'Required command "{cmd}" is missing', file=sys.stderr) 122 + missing = True 123 + if missing: 124 + sys.exit(1) 125 + 126 + if args.testdev: 127 + devname = os.path.basename(args.testdev) 128 + rdev = os.stat(f'/dev/{devname}').st_rdev 129 + devno = f'{os.major(rdev)}:{os.minor(rdev)}' 130 + testfile = f'/dev/{devname}' 131 + info(f'Test target: {devname}({devno})') 132 + else: 133 + devname, devno = dir_to_dev('.') 134 + testfile = 'iocost-coef-fio.testfile' 135 + testfile_size = int(args.testfile_size_gb * 2 ** 30) 136 + create_testfile(testfile, testfile_size) 137 + info(f'Test target: {testfile} on {devname}({devno})') 138 + 139 + elevator_path = f'/sys/block/{devname}/queue/scheduler' 140 + nomerges_path = f'/sys/block/{devname}/queue/nomerges' 141 + 142 + with open(elevator_path, 'r') as f: 143 + elevator = re.sub(r'.*\[(.*)\].*', r'\1', f.read().strip()) 144 + with open(nomerges_path, 'r') as f: 145 + nomerges = f.read().strip() 146 + 147 + info(f'Temporarily disabling elevator and merges') 148 + atexit.register(restore_elevator_nomerges) 149 + with open(elevator_path, 'w') as f: 150 + f.write('none') 151 + with open(nomerges_path, 'w') as f: 152 + f.write('1') 153 + 154 + info('Determining rbps...') 155 + rbps = run_fio(testfile, args.duration, 'read', 156 + 1, args.seqio_block_mb * (2 ** 20), args.numjobs) 157 + info(f'\nrbps={rbps}, determining rseqiops...') 158 + rseqiops = round(run_fio(testfile, args.duration, 'read', 159 + args.seq_depth, 4096, args.numjobs) / 4096) 160 + info(f'\nrseqiops={rseqiops}, determining rrandiops...') 161 + rrandiops = round(run_fio(testfile, args.duration, 'randread', 162 + args.rand_depth, 4096, args.numjobs) / 4096) 163 + info(f'\nrrandiops={rrandiops}, determining wbps...') 164 + wbps = run_fio(testfile, args.duration, 'write', 165 + 1, args.seqio_block_mb * (2 ** 20), args.numjobs) 166 + info(f'\nwbps={wbps}, determining wseqiops...') 167 + wseqiops = round(run_fio(testfile, args.duration, 'write', 168 + args.seq_depth, 4096, args.numjobs) / 4096) 169 + info(f'\nwseqiops={wseqiops}, determining wrandiops...') 170 + wrandiops = round(run_fio(testfile, args.duration, 'randwrite', 171 + args.rand_depth, 4096, args.numjobs) / 4096) 172 + info(f'\nwrandiops={wrandiops}') 173 + restore_elevator_nomerges() 174 + atexit.unregister(restore_elevator_nomerges) 175 + info('') 176 + 177 + print(f'{devno} rbps={rbps} rseqiops={rseqiops} rrandiops={rrandiops} ' 178 + f'wbps={wbps} wseqiops={wseqiops} wrandiops={wrandiops}')

+277

tools/cgroup/iocost_monitor.py

··· 1 + #!/usr/bin/env drgn 2 + # 3 + # Copyright (C) 2019 Tejun Heo <tj@kernel.org> 4 + # Copyright (C) 2019 Facebook 5 + 6 + desc = """ 7 + This is a drgn script to monitor the blk-iocost cgroup controller. 8 + See the comment at the top of block/blk-iocost.c for more details. 9 + For drgn, visit https://github.com/osandov/drgn. 10 + """ 11 + 12 + import sys 13 + import re 14 + import time 15 + import json 16 + import math 17 + 18 + import drgn 19 + from drgn import container_of 20 + from drgn.helpers.linux.list import list_for_each_entry,list_empty 21 + from drgn.helpers.linux.radixtree import radix_tree_for_each,radix_tree_lookup 22 + 23 + import argparse 24 + parser = argparse.ArgumentParser(description=desc, 25 + formatter_class=argparse.RawTextHelpFormatter) 26 + parser.add_argument('devname', metavar='DEV', 27 + help='Target block device name (e.g. sda)') 28 + parser.add_argument('--cgroup', action='append', metavar='REGEX', 29 + help='Regex for target cgroups, ') 30 + parser.add_argument('--interval', '-i', metavar='SECONDS', type=float, default=1, 31 + help='Monitoring interval in seconds') 32 + parser.add_argument('--json', action='store_true', 33 + help='Output in json') 34 + args = parser.parse_args() 35 + 36 + def err(s): 37 + print(s, file=sys.stderr, flush=True) 38 + sys.exit(1) 39 + 40 + try: 41 + blkcg_root = prog['blkcg_root'] 42 + plid = prog['blkcg_policy_iocost'].plid.value_() 43 + except: 44 + err('The kernel does not have iocost enabled') 45 + 46 + IOC_RUNNING = prog['IOC_RUNNING'].value_() 47 + NR_USAGE_SLOTS = prog['NR_USAGE_SLOTS'].value_() 48 + HWEIGHT_WHOLE = prog['HWEIGHT_WHOLE'].value_() 49 + VTIME_PER_SEC = prog['VTIME_PER_SEC'].value_() 50 + VTIME_PER_USEC = prog['VTIME_PER_USEC'].value_() 51 + AUTOP_SSD_FAST = prog['AUTOP_SSD_FAST'].value_() 52 + AUTOP_SSD_DFL = prog['AUTOP_SSD_DFL'].value_() 53 + AUTOP_SSD_QD1 = prog['AUTOP_SSD_QD1'].value_() 54 + AUTOP_HDD = prog['AUTOP_HDD'].value_() 55 + 56 + autop_names = { 57 + AUTOP_SSD_FAST: 'ssd_fast', 58 + AUTOP_SSD_DFL: 'ssd_dfl', 59 + AUTOP_SSD_QD1: 'ssd_qd1', 60 + AUTOP_HDD: 'hdd', 61 + } 62 + 63 + class BlkgIterator: 64 + def blkcg_name(blkcg): 65 + return blkcg.css.cgroup.kn.name.string_().decode('utf-8') 66 + 67 + def walk(self, blkcg, q_id, parent_path): 68 + if not self.include_dying and \ 69 + not (blkcg.css.flags.value_() & prog['CSS_ONLINE'].value_()): 70 + return 71 + 72 + name = BlkgIterator.blkcg_name(blkcg) 73 + path = parent_path + '/' + name if parent_path else name 74 + blkg = drgn.Object(prog, 'struct blkcg_gq', 75 + address=radix_tree_lookup(blkcg.blkg_tree, q_id)) 76 + if not blkg.address_: 77 + return 78 + 79 + self.blkgs.append((path if path else '/', blkg)) 80 + 81 + for c in list_for_each_entry('struct blkcg', 82 + blkcg.css.children.address_of_(), 'css.sibling'): 83 + self.walk(c, q_id, path) 84 + 85 + def __init__(self, root_blkcg, q_id, include_dying=False): 86 + self.include_dying = include_dying 87 + self.blkgs = [] 88 + self.walk(root_blkcg, q_id, '') 89 + 90 + def __iter__(self): 91 + return iter(self.blkgs) 92 + 93 + class IocStat: 94 + def __init__(self, ioc): 95 + global autop_names 96 + 97 + self.enabled = ioc.enabled.value_() 98 + self.running = ioc.running.value_() == IOC_RUNNING 99 + self.period_ms = ioc.period_us.value_() / 1_000 100 + self.period_at = ioc.period_at.value_() / 1_000_000 101 + self.vperiod_at = ioc.period_at_vtime.value_() / VTIME_PER_SEC 102 + self.vrate_pct = ioc.vtime_rate.counter.value_() * 100 / VTIME_PER_USEC 103 + self.busy_level = ioc.busy_level.value_() 104 + self.autop_idx = ioc.autop_idx.value_() 105 + self.user_cost_model = ioc.user_cost_model.value_() 106 + self.user_qos_params = ioc.user_qos_params.value_() 107 + 108 + if self.autop_idx in autop_names: 109 + self.autop_name = autop_names[self.autop_idx] 110 + else: 111 + self.autop_name = '?' 112 + 113 + def dict(self, now): 114 + return { 'device' : devname, 115 + 'timestamp' : str(now), 116 + 'enabled' : str(int(self.enabled)), 117 + 'running' : str(int(self.running)), 118 + 'period_ms' : str(self.period_ms), 119 + 'period_at' : str(self.period_at), 120 + 'period_vtime_at' : str(self.vperiod_at), 121 + 'busy_level' : str(self.busy_level), 122 + 'vrate_pct' : str(self.vrate_pct), } 123 + 124 + def table_preamble_str(self): 125 + state = ('RUN' if self.running else 'IDLE') if self.enabled else 'OFF' 126 + output = f'{devname} {state:4} ' \ 127 + f'per={self.period_ms}ms ' \ 128 + f'cur_per={self.period_at:.3f}:v{self.vperiod_at:.3f} ' \ 129 + f'busy={self.busy_level:+3} ' \ 130 + f'vrate={self.vrate_pct:6.2f}% ' \ 131 + f'params={self.autop_name}' 132 + if self.user_cost_model or self.user_qos_params: 133 + output += f'({"C" if self.user_cost_model else ""}{"Q" if self.user_qos_params else ""})' 134 + return output 135 + 136 + def table_header_str(self): 137 + return f'{"":25} active {"weight":>9} {"hweight%":>13} {"inflt%":>6} ' \ 138 + f'{"dbt":>3} {"delay":>6} {"usages%"}' 139 + 140 + class IocgStat: 141 + def __init__(self, iocg): 142 + ioc = iocg.ioc 143 + blkg = iocg.pd.blkg 144 + 145 + self.is_active = not list_empty(iocg.active_list.address_of_()) 146 + self.weight = iocg.weight.value_() 147 + self.active = iocg.active.value_() 148 + self.inuse = iocg.inuse.value_() 149 + self.hwa_pct = iocg.hweight_active.value_() * 100 / HWEIGHT_WHOLE 150 + self.hwi_pct = iocg.hweight_inuse.value_() * 100 / HWEIGHT_WHOLE 151 + self.address = iocg.value_() 152 + 153 + vdone = iocg.done_vtime.counter.value_() 154 + vtime = iocg.vtime.counter.value_() 155 + vrate = ioc.vtime_rate.counter.value_() 156 + period_vtime = ioc.period_us.value_() * vrate 157 + if period_vtime: 158 + self.inflight_pct = (vtime - vdone) * 100 / period_vtime 159 + else: 160 + self.inflight_pct = 0 161 + 162 + self.debt_ms = iocg.abs_vdebt.counter.value_() / VTIME_PER_USEC / 1000 163 + self.use_delay = blkg.use_delay.counter.value_() 164 + self.delay_ms = blkg.delay_nsec.counter.value_() / 1_000_000 165 + 166 + usage_idx = iocg.usage_idx.value_() 167 + self.usages = [] 168 + self.usage = 0 169 + for i in range(NR_USAGE_SLOTS): 170 + usage = iocg.usages[(usage_idx + i) % NR_USAGE_SLOTS].value_() 171 + upct = usage * 100 / HWEIGHT_WHOLE 172 + self.usages.append(upct) 173 + self.usage = max(self.usage, upct) 174 + 175 + def dict(self, now, path): 176 + out = { 'cgroup' : path, 177 + 'timestamp' : str(now), 178 + 'is_active' : str(int(self.is_active)), 179 + 'weight' : str(self.weight), 180 + 'weight_active' : str(self.active), 181 + 'weight_inuse' : str(self.inuse), 182 + 'hweight_active_pct' : str(self.hwa_pct), 183 + 'hweight_inuse_pct' : str(self.hwi_pct), 184 + 'inflight_pct' : str(self.inflight_pct), 185 + 'debt_ms' : str(self.debt_ms), 186 + 'use_delay' : str(self.use_delay), 187 + 'delay_ms' : str(self.delay_ms), 188 + 'usage_pct' : str(self.usage), 189 + 'address' : str(hex(self.address)) } 190 + for i in range(len(self.usages)): 191 + out[f'usage_pct_{i}'] = str(self.usages[i]) 192 + return out 193 + 194 + def table_row_str(self, path): 195 + out = f'{path[-28:]:28} ' \ 196 + f'{"*" if self.is_active else " "} ' \ 197 + f'{self.inuse:5}/{self.active:5} ' \ 198 + f'{self.hwi_pct:6.2f}/{self.hwa_pct:6.2f} ' \ 199 + f'{self.inflight_pct:6.2f} ' \ 200 + f'{min(math.ceil(self.debt_ms), 999):3} ' \ 201 + f'{min(self.use_delay, 99):2}*'\ 202 + f'{min(math.ceil(self.delay_ms), 999):03} ' 203 + for u in self.usages: 204 + out += f'{min(round(u), 999):03d}:' 205 + out = out.rstrip(':') 206 + return out 207 + 208 + # handle args 209 + table_fmt = not args.json 210 + interval = args.interval 211 + devname = args.devname 212 + 213 + if args.json: 214 + table_fmt = False 215 + 216 + re_str = None 217 + if args.cgroup: 218 + for r in args.cgroup: 219 + if re_str is None: 220 + re_str = r 221 + else: 222 + re_str += '|' + r 223 + 224 + filter_re = re.compile(re_str) if re_str else None 225 + 226 + # Locate the roots 227 + q_id = None 228 + root_iocg = None 229 + ioc = None 230 + 231 + for i, ptr in radix_tree_for_each(blkcg_root.blkg_tree): 232 + blkg = drgn.Object(prog, 'struct blkcg_gq', address=ptr) 233 + try: 234 + if devname == blkg.q.kobj.parent.name.string_().decode('utf-8'): 235 + q_id = blkg.q.id.value_() 236 + if blkg.pd[plid]: 237 + root_iocg = container_of(blkg.pd[plid], 'struct ioc_gq', 'pd') 238 + ioc = root_iocg.ioc 239 + break 240 + except: 241 + pass 242 + 243 + if ioc is None: 244 + err(f'Could not find ioc for {devname}'); 245 + 246 + # Keep printing 247 + while True: 248 + now = time.time() 249 + iocstat = IocStat(ioc) 250 + output = '' 251 + 252 + if table_fmt: 253 + output += '\n' + iocstat.table_preamble_str() 254 + output += '\n' + iocstat.table_header_str() 255 + else: 256 + output += json.dumps(iocstat.dict(now)) 257 + 258 + for path, blkg in BlkgIterator(blkcg_root, q_id): 259 + if filter_re and not filter_re.match(path): 260 + continue 261 + if not blkg.pd[plid]: 262 + continue 263 + 264 + iocg = container_of(blkg.pd[plid], 'struct ioc_gq', 'pd') 265 + iocg_stat = IocgStat(iocg) 266 + 267 + if not filter_re and not iocg_stat.is_active: 268 + continue 269 + 270 + if table_fmt: 271 + output += '\n' + iocg_stat.table_row_str(path) 272 + else: 273 + output += '\n' + json.dumps(iocg_stat.dict(now, path)) 274 + 275 + print(output) 276 + sys.stdout.flush() 277 + time.sleep(interval)

Configure Feed

Configure Feed