block, bfq: inject I/O to underutilized actuators

The main service scheme of BFQ for sync I/O is serving one sync
bfq_queue at a time, for a while. In particular, BFQ enforces this
scheme when it deems the latter necessary to boost throughput or
to preserve service guarantees. Unfortunately, when BFQ enforces
this policy, only one actuator at a time gets served for a while,
because each bfq_queue contains I/O only for one actuator. The
other actuators may remain underutilized.

Actually, BFQ may serve (inject) extra I/O, taken from other
bfq_queues, in parallel with that of the in-service queue. This
injection mechanism may provide the ground for dealing also with
the above actuator-underutilization problem. Yet BFQ does not take
the actuator load into account when choosing which queue to pick
extra I/O from. In addition, BFQ may happen to inject extra I/O
only when the in-service queue is temporarily empty.

In view of these facts, this commit extends the
injection mechanism in such a way that the latter:
(1) takes into account also the actuator load;
(2) checks such a load on each dispatch, and injects I/O for an
underutilized actuator, if there is one and there is I/O for it.

To perform the check in (2), this commit introduces a load
threshold, currently set to 4. A linear scan of each actuator is
performed, until an actuator is found for which the following two
conditions hold: the load of the actuator is below the threshold,
and there is at least one non-in-service queue that contains I/O
for that actuator. If such a pair (actuator, queue) is found, then
the head request of that queue is returned for dispatch, instead
of the head request of the in-service queue.

We have set the threshold, empirically, to the minimum possible
value for which an actuator is fully utilized, or close to be
fully utilized. By doing so, injected I/O 'steals' as few
drive-queue slots as possibile to the in-service queue. This
reduces as much as possible the probability that the service of
I/O from the in-service bfq_queue gets delayed because of slot
exhaustion, i.e., because all the slots of the drive queue are
filled with I/O injected from other queues (NCQ provides for 32
slots).

This new mechanism also counters actuator underutilization in the
case of asymmetric configurations of bfq_queues. Namely if there
are few bfq_queues containing I/O for some actuators and many
bfq_queues containing I/O for other actuators. Or if the
bfq_queues containing I/O for some actuators have lower weights
than the other bfq_queues.

Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Davide Zini <davidezini2@gmail.com>
Link: https://lore.kernel.org/r/20230103145503.71712-8-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

authored by

Davide Zini and committed by

Jens Axboe 3 years ago 2d31c684 4fdb3b9f

+139 -40

4 changed files

expand all

block

bfq-cgroup.c

bfq-iosched.c

bfq-iosched.h

bfq-wf2q.c

+1 -1

block/bfq-cgroup.c

··· 706 706 bfq_activate_bfqq(bfqd, bfqq); 707 707 } 708 708 709 - if (!bfqd->in_service_queue && !bfqd->rq_in_driver) 709 + if (!bfqd->in_service_queue && !bfqd->tot_rq_in_driver) 710 710 bfq_schedule_dispatch(bfqd); 711 711 /* release extra ref taken above, bfqq may happen to be freed now */ 712 712 bfq_put_queue(bfqq);

+101 -35

block/bfq-iosched.c

··· 2259 2259 * elapsed. 2260 2260 */ 2261 2261 if (bfqq == bfqd->in_service_queue && 2262 - (bfqd->rq_in_driver == 0 || 2262 + (bfqd->tot_rq_in_driver == 0 || 2263 2263 (bfqq->last_serv_time_ns > 0 && 2264 - bfqd->rqs_injected && bfqd->rq_in_driver > 0)) && 2264 + bfqd->rqs_injected && bfqd->tot_rq_in_driver > 0)) && 2265 2265 time_is_before_eq_jiffies(bfqq->decrease_time_jif + 2266 2266 msecs_to_jiffies(10))) { 2267 2267 bfqd->last_empty_occupied_ns = ktime_get_ns(); ··· 2285 2285 * will be set in case injection is performed 2286 2286 * on bfqq before rq is completed). 2287 2287 */ 2288 - if (bfqd->rq_in_driver == 0) 2288 + if (bfqd->tot_rq_in_driver == 0) 2289 2289 bfqd->rqs_injected = false; 2290 2290 } 2291 2291 } ··· 2650 2650 static void bfq_end_wr(struct bfq_data *bfqd) 2651 2651 { 2652 2652 struct bfq_queue *bfqq; 2653 + int i; 2653 2654 2654 2655 spin_lock_irq(&bfqd->lock); 2655 2656 2656 - list_for_each_entry(bfqq, &bfqd->active_list, bfqq_list) 2657 - bfq_bfqq_end_wr(bfqq); 2657 + for (i = 0; i < bfqd->num_actuators; i++) { 2658 + list_for_each_entry(bfqq, &bfqd->active_list[i], bfqq_list) 2659 + bfq_bfqq_end_wr(bfqq); 2660 + } 2658 2661 list_for_each_entry(bfqq, &bfqd->idle_list, bfqq_list) 2659 2662 bfq_bfqq_end_wr(bfqq); 2660 2663 bfq_end_wr_async(bfqd); ··· 3614 3611 * - start a new observation interval with this dispatch 3615 3612 */ 3616 3613 if (now_ns - bfqd->last_dispatch > 100*NSEC_PER_MSEC && 3617 - bfqd->rq_in_driver == 0) 3614 + bfqd->tot_rq_in_driver == 0) 3618 3615 goto update_rate_and_reset; 3619 3616 3620 3617 /* Update sampling information */ 3621 3618 bfqd->peak_rate_samples++; 3622 3619 3623 - if ((bfqd->rq_in_driver > 0 || 3620 + if ((bfqd->tot_rq_in_driver > 0 || 3624 3621 now_ns - bfqd->last_completion < BFQ_MIN_TT) 3625 3622 && !BFQ_RQ_SEEKY(bfqd, bfqd->last_position, rq)) 3626 3623 bfqd->sequential_samples++; ··· 3885 3882 return false; 3886 3883 3887 3884 return (bfqq->wr_coeff > 1 && 3888 - (bfqd->wr_busy_queues < 3889 - tot_busy_queues || 3890 - bfqd->rq_in_driver >= 3891 - bfqq->dispatched + 4)) || 3885 + (bfqd->wr_busy_queues < tot_busy_queues || 3886 + bfqd->tot_rq_in_driver >= bfqq->dispatched + 4)) || 3892 3887 bfq_asymmetric_scenario(bfqd, bfqq) || 3893 3888 tot_busy_queues == 1; 3894 3889 } ··· 4657 4656 { 4658 4657 struct bfq_queue *bfqq, *in_serv_bfqq = bfqd->in_service_queue; 4659 4658 unsigned int limit = in_serv_bfqq->inject_limit; 4659 + int i; 4660 + 4660 4661 /* 4661 4662 * If 4662 4663 * - bfqq is not weight-raised and therefore does not carry ··· 4690 4687 ) 4691 4688 limit = 1; 4692 4689 4693 - if (bfqd->rq_in_driver >= limit) 4690 + if (bfqd->tot_rq_in_driver >= limit) 4694 4691 return NULL; 4695 4692 4696 4693 /* ··· 4705 4702 * (and re-added only if it gets new requests, but then it 4706 4703 * is assigned again enough budget for its new backlog). 4707 4704 */ 4708 - list_for_each_entry(bfqq, &bfqd->active_list, bfqq_list) 4709 - if (!RB_EMPTY_ROOT(&bfqq->sort_list) && 4710 - (in_serv_always_inject || bfqq->wr_coeff > 1) && 4711 - bfq_serv_to_charge(bfqq->next_rq, bfqq) <= 4712 - bfq_bfqq_budget_left(bfqq)) { 4705 + for (i = 0; i < bfqd->num_actuators; i++) { 4706 + list_for_each_entry(bfqq, &bfqd->active_list[i], bfqq_list) 4707 + if (!RB_EMPTY_ROOT(&bfqq->sort_list) && 4708 + (in_serv_always_inject || bfqq->wr_coeff > 1) && 4709 + bfq_serv_to_charge(bfqq->next_rq, bfqq) <= 4710 + bfq_bfqq_budget_left(bfqq)) { 4713 4711 /* 4714 4712 * Allow for only one large in-flight request 4715 4713 * on non-rotational devices, for the ··· 4735 4731 else 4736 4732 limit = in_serv_bfqq->inject_limit; 4737 4733 4738 - if (bfqd->rq_in_driver < limit) { 4734 + if (bfqd->tot_rq_in_driver < limit) { 4739 4735 bfqd->rqs_injected = true; 4740 4736 return bfqq; 4741 4737 } 4742 4738 } 4739 + } 4743 4740 4744 4741 return NULL; 4745 4742 } 4743 + 4744 + static struct bfq_queue * 4745 + bfq_find_active_bfqq_for_actuator(struct bfq_data *bfqd, int idx) 4746 + { 4747 + struct bfq_queue *bfqq; 4748 + 4749 + if (bfqd->in_service_queue && 4750 + bfqd->in_service_queue->actuator_idx == idx) 4751 + return bfqd->in_service_queue; 4752 + 4753 + list_for_each_entry(bfqq, &bfqd->active_list[idx], bfqq_list) { 4754 + if (!RB_EMPTY_ROOT(&bfqq->sort_list) && 4755 + bfq_serv_to_charge(bfqq->next_rq, bfqq) <= 4756 + bfq_bfqq_budget_left(bfqq)) { 4757 + return bfqq; 4758 + } 4759 + } 4760 + 4761 + return NULL; 4762 + } 4763 + 4764 + /* 4765 + * Perform a linear scan of each actuator, until an actuator is found 4766 + * for which the following two conditions hold: the load of the 4767 + * actuator is below the threshold (see comments on actuator_load_threshold 4768 + * for details), and there is a queue that contains I/O for that 4769 + * actuator. On success, return that queue. 4770 + */ 4771 + static struct bfq_queue * 4772 + bfq_find_bfqq_for_underused_actuator(struct bfq_data *bfqd) 4773 + { 4774 + int i; 4775 + 4776 + for (i = 0 ; i < bfqd->num_actuators; i++) { 4777 + if (bfqd->rq_in_driver[i] < bfqd->actuator_load_threshold) { 4778 + struct bfq_queue *bfqq = 4779 + bfq_find_active_bfqq_for_actuator(bfqd, i); 4780 + 4781 + if (bfqq) 4782 + return bfqq; 4783 + } 4784 + } 4785 + 4786 + return NULL; 4787 + } 4788 + 4746 4789 4747 4790 /* 4748 4791 * Select a queue for service. If we have a current queue in service, ··· 4797 4746 */ 4798 4747 static struct bfq_queue *bfq_select_queue(struct bfq_data *bfqd) 4799 4748 { 4800 - struct bfq_queue *bfqq; 4749 + struct bfq_queue *bfqq, *inject_bfqq; 4801 4750 struct request *next_rq; 4802 4751 enum bfqq_expiration reason = BFQQE_BUDGET_TIMEOUT; 4803 4752 ··· 4819 4768 goto expire; 4820 4769 4821 4770 check_queue: 4771 + /* 4772 + * If some actuator is underutilized, but the in-service 4773 + * queue does not contain I/O for that actuator, then try to 4774 + * inject I/O for that actuator. 4775 + */ 4776 + inject_bfqq = bfq_find_bfqq_for_underused_actuator(bfqd); 4777 + if (inject_bfqq && inject_bfqq != bfqq) 4778 + return inject_bfqq; 4779 + 4822 4780 /* 4823 4781 * This loop is rarely executed more than once. Even when it 4824 4782 * happens, it is much more convenient to re-execute this loop ··· 5183 5123 5184 5124 /* 5185 5125 * We exploit the bfq_finish_requeue_request hook to 5186 - * decrement rq_in_driver, but 5126 + * decrement tot_rq_in_driver, but 5187 5127 * bfq_finish_requeue_request will not be invoked on 5188 5128 * this request. So, to avoid unbalance, just start 5189 - * this request, without incrementing rq_in_driver. As 5190 - * a negative consequence, rq_in_driver is deceptively 5129 + * this request, without incrementing tot_rq_in_driver. As 5130 + * a negative consequence, tot_rq_in_driver is deceptively 5191 5131 * lower than it should be while this request is in 5192 5132 * service. This may cause bfq_schedule_dispatch to be 5193 5133 * invoked uselessly. ··· 5196 5136 * bfq_finish_requeue_request hook, if defined, is 5197 5137 * probably invoked also on this request. So, by 5198 5138 * exploiting this hook, we could 1) increment 5199 - * rq_in_driver here, and 2) decrement it in 5139 + * tot_rq_in_driver here, and 2) decrement it in 5200 5140 * bfq_finish_requeue_request. Such a solution would 5201 5141 * let the value of the counter be always accurate, 5202 5142 * but it would entail using an extra interface ··· 5225 5165 * Of course, serving one request at a time may cause loss of 5226 5166 * throughput. 5227 5167 */ 5228 - if (bfqd->strict_guarantees && bfqd->rq_in_driver > 0) 5168 + if (bfqd->strict_guarantees && bfqd->tot_rq_in_driver > 0) 5229 5169 goto exit; 5230 5170 5231 5171 bfqq = bfq_select_queue(bfqd); ··· 5236 5176 5237 5177 if (rq) { 5238 5178 inc_in_driver_start_rq: 5239 - bfqd->rq_in_driver++; 5179 + bfqd->rq_in_driver[bfqq->actuator_idx]++; 5180 + bfqd->tot_rq_in_driver++; 5240 5181 start_rq: 5241 5182 rq->rq_flags |= RQF_STARTED; 5242 5183 } ··· 6304 6243 struct bfq_queue *bfqq = bfqd->in_service_queue; 6305 6244 6306 6245 bfqd->max_rq_in_driver = max_t(int, bfqd->max_rq_in_driver, 6307 - bfqd->rq_in_driver); 6246 + bfqd->tot_rq_in_driver); 6308 6247 6309 6248 if (bfqd->hw_tag == 1) 6310 6249 return; ··· 6315 6254 * sum is not exact, as it's not taking into account deactivated 6316 6255 * requests. 6317 6256 */ 6318 - if (bfqd->rq_in_driver + bfqd->queued <= BFQ_HW_QUEUE_THRESHOLD) 6257 + if (bfqd->tot_rq_in_driver + bfqd->queued <= BFQ_HW_QUEUE_THRESHOLD) 6319 6258 return; 6320 6259 6321 6260 /* ··· 6326 6265 if (bfqq && bfq_bfqq_has_short_ttime(bfqq) && 6327 6266 bfqq->dispatched + bfqq->queued[0] + bfqq->queued[1] < 6328 6267 BFQ_HW_QUEUE_THRESHOLD && 6329 - bfqd->rq_in_driver < BFQ_HW_QUEUE_THRESHOLD) 6268 + bfqd->tot_rq_in_driver < BFQ_HW_QUEUE_THRESHOLD) 6330 6269 return; 6331 6270 6332 6271 if (bfqd->hw_tag_samples++ < BFQ_HW_QUEUE_SAMPLES) ··· 6347 6286 6348 6287 bfq_update_hw_tag(bfqd); 6349 6288 6350 - bfqd->rq_in_driver--; 6289 + bfqd->rq_in_driver[bfqq->actuator_idx]--; 6290 + bfqd->tot_rq_in_driver--; 6351 6291 bfqq->dispatched--; 6352 6292 6353 6293 if (!bfqq->dispatched && !bfq_bfqq_busy(bfqq)) { ··· 6468 6406 BFQQE_NO_MORE_REQUESTS); 6469 6407 } 6470 6408 6471 - if (!bfqd->rq_in_driver) 6409 + if (!bfqd->tot_rq_in_driver) 6472 6410 bfq_schedule_dispatch(bfqd); 6473 6411 } 6474 6412 ··· 6599 6537 * conditions to do it, or we can lower the last base value 6600 6538 * computed. 6601 6539 * 6602 - * NOTE: (bfqd->rq_in_driver == 1) means that there is no I/O 6540 + * NOTE: (bfqd->tot_rq_in_driver == 1) means that there is no I/O 6603 6541 * request in flight, because this function is in the code 6604 6542 * path that handles the completion of a request of bfqq, and, 6605 6543 * in particular, this function is executed before 6606 - * bfqd->rq_in_driver is decremented in such a code path. 6544 + * bfqd->tot_rq_in_driver is decremented in such a code path. 6607 6545 */ 6608 - if ((bfqq->last_serv_time_ns == 0 && bfqd->rq_in_driver == 1) || 6546 + if ((bfqq->last_serv_time_ns == 0 && bfqd->tot_rq_in_driver == 1) || 6609 6547 tot_time_ns < bfqq->last_serv_time_ns) { 6610 6548 if (bfqq->last_serv_time_ns == 0) { 6611 6549 /* ··· 6615 6553 bfqq->inject_limit = max_t(unsigned int, 1, old_limit); 6616 6554 } 6617 6555 bfqq->last_serv_time_ns = tot_time_ns; 6618 - } else if (!bfqd->rqs_injected && bfqd->rq_in_driver == 1) 6556 + } else if (!bfqd->rqs_injected && bfqd->tot_rq_in_driver == 1) 6619 6557 /* 6620 6558 * No I/O injected and no request still in service in 6621 6559 * the drive: these are the exact conditions for ··· 7270 7208 bfqd->num_groups_with_pending_reqs = 0; 7271 7209 #endif 7272 7210 7273 - INIT_LIST_HEAD(&bfqd->active_list); 7211 + INIT_LIST_HEAD(&bfqd->active_list[0]); 7212 + INIT_LIST_HEAD(&bfqd->active_list[1]); 7274 7213 INIT_LIST_HEAD(&bfqd->idle_list); 7275 7214 INIT_HLIST_HEAD(&bfqd->burst_list); 7276 7215 ··· 7315 7252 bfqd->rate_dur_prod = ref_rate[blk_queue_nonrot(bfqd->queue)] * 7316 7253 ref_wr_duration[blk_queue_nonrot(bfqd->queue)]; 7317 7254 bfqd->peak_rate = ref_rate[blk_queue_nonrot(bfqd->queue)] * 2 / 3; 7255 + 7256 + /* see comments on the definition of next field inside bfq_data */ 7257 + bfqd->actuator_load_threshold = 4; 7318 7258 7319 7259 spin_lock_init(&bfqd->lock); 7320 7260

+36 -3

block/bfq-iosched.h

··· 590 590 /* number of queued requests */ 591 591 int queued; 592 592 /* number of requests dispatched and waiting for completion */ 593 - int rq_in_driver; 593 + int tot_rq_in_driver; 594 + /* 595 + * number of requests dispatched and waiting for completion 596 + * for each actuator 597 + */ 598 + int rq_in_driver[BFQ_MAX_ACTUATORS]; 594 599 595 600 /* true if the device is non rotational and performs queueing */ 596 601 bool nonrot_with_queueing; ··· 689 684 /* maximum budget allotted to a bfq_queue before rescheduling */ 690 685 int bfq_max_budget; 691 686 692 - /* list of all the bfq_queues active on the device */ 693 - struct list_head active_list; 687 + /* 688 + * List of all the bfq_queues active for a specific actuator 689 + * on the device. Keeping active queues separate on a 690 + * per-actuator basis helps implementing per-actuator 691 + * injection more efficiently. 692 + */ 693 + struct list_head active_list[BFQ_MAX_ACTUATORS]; 694 694 /* list of all the bfq_queues idle on the device */ 695 695 struct list_head idle_list; 696 696 ··· 831 821 sector_t sector[BFQ_MAX_ACTUATORS]; 832 822 sector_t nr_sectors[BFQ_MAX_ACTUATORS]; 833 823 struct blk_independent_access_range ia_ranges[BFQ_MAX_ACTUATORS]; 824 + 825 + /* 826 + * If the number of I/O requests queued in the device for a 827 + * given actuator is below next threshold, then the actuator 828 + * is deemed as underutilized. If this condition is found to 829 + * hold for some actuator upon a dispatch, but (i) the 830 + * in-service queue does not contain I/O for that actuator, 831 + * while (ii) some other queue does contain I/O for that 832 + * actuator, then the head I/O request of the latter queue is 833 + * returned (injected), instead of the head request of the 834 + * currently in-service queue. 835 + * 836 + * We set the threshold, empirically, to the minimum possible 837 + * value for which an actuator is fully utilized, or close to 838 + * be fully utilized. By doing so, injected I/O 'steals' as 839 + * few drive-queue slots as possibile to the in-service 840 + * queue. This reduces as much as possible the probability 841 + * that the service of I/O from the in-service bfq_queue gets 842 + * delayed because of slot exhaustion, i.e., because all the 843 + * slots of the drive queue are filled with I/O injected from 844 + * other queues (NCQ provides for 32 slots). 845 + */ 846 + unsigned int actuator_load_threshold; 834 847 }; 835 848 836 849 enum bfqq_state_flags {

+1 -1

block/bfq-wf2q.c

··· 493 493 bfq_update_active_tree(node); 494 494 495 495 if (bfqq) 496 - list_add(&bfqq->bfqq_list, &bfqq->bfqd->active_list); 496 + list_add(&bfqq->bfqq_list, &bfqq->bfqd->active_list[bfqq->actuator_idx]); 497 497 498 498 bfq_inc_active_entities(entity); 499 499 }

Configure Feed

Configure Feed