Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

Merge branch 'for-3.11/core' of git://git.kernel.dk/linux-block

Pull core block IO updates from Jens Axboe:
"Here are the core IO block bits for 3.11. It contains:

- A tweak to the reserved tag logic from Jan, for weirdo devices with
just 3 free tags. But for those it improves things substantially
for random writes.

- Periodic writeback fix from Jan. Marked for stable as well.

- Fix for a race condition in IO scheduler switching from Jianpeng.

- The hierarchical blk-cgroup support from Tejun. This is the grunt
of the series.

- blk-throttle fix from Vivek.

Just a note that I'm in the middle of a relocation, whole family is
flying out tomorrow. Hence I will be awal the remainder of this week,
but back at work again on Monday the 15th. CC'ing Tejun, since any
potential "surprises" will most likely be from the blk-cgroup work.
But it's been brewing for a while and sitting in my tree and
linux-next for a long time, so should be solid."

* 'for-3.11/core' of git://git.kernel.dk/linux-block: (36 commits)
elevator: Fix a race in elevator switching
block: Reserve only one queue tag for sync IO if only 3 tags are available
writeback: Fix periodic writeback after fs mount
blk-throttle: implement proper hierarchy support
blk-throttle: implement throtl_grp->has_rules[]
blk-throttle: Account for child group's start time in parent while bio climbs up
blk-throttle: add throtl_qnode for dispatch fairness
blk-throttle: make throtl_pending_timer_fn() ready for hierarchy
blk-throttle: make tg_dispatch_one_bio() ready for hierarchy
blk-throttle: make blk_throtl_bio() ready for hierarchy
blk-throttle: make blk_throtl_drain() ready for hierarchy
blk-throttle: dispatch from throtl_pending_timer_fn()
blk-throttle: implement dispatch looping
blk-throttle: separate out throtl_service_queue->pending_timer from throtl_data->dispatch_work
blk-throttle: set REQ_THROTTLED from throtl_charge_bio() and gate stats update with it
blk-throttle: implement sq_to_tg(), sq_to_td() and throtl_log()
blk-throttle: add throtl_service_queue->parent_sq
blk-throttle: generalize update_disptime optimization in blk_throtl_bio()
blk-throttle: dispatch to throtl_data->service_queue.bio_lists[]
blk-throttle: move bio_lists[] and friends to throtl_service_queue
...

+910 -439
+15 -14
Documentation/cgroups/blkio-controller.txt
··· 94 94 95 95 Hierarchical Cgroups 96 96 ==================== 97 - - Currently only CFQ supports hierarchical groups. For throttling, 98 - cgroup interface does allow creation of hierarchical cgroups and 99 - internally it treats them as flat hierarchy. 100 97 101 - If somebody created a hierarchy like as follows. 98 + Both CFQ and throttling implement hierarchy support; however, 99 + throttling's hierarchy support is enabled iff "sane_behavior" is 100 + enabled from cgroup side, which currently is a development option and 101 + not publicly available. 102 + 103 + If somebody created a hierarchy like as follows. 102 104 103 105 root 104 106 / \ ··· 108 106 | 109 107 test3 110 108 111 - CFQ will handle the hierarchy correctly but and throttling will 112 - practically treat all groups at same level. For details on CFQ 113 - hierarchy support, refer to Documentation/block/cfq-iosched.txt. 114 - Throttling will treat the hierarchy as if it looks like the 115 - following. 109 + CFQ by default and throttling with "sane_behavior" will handle the 110 + hierarchy correctly. For details on CFQ hierarchy support, refer to 111 + Documentation/block/cfq-iosched.txt. For throttling, all limits apply 112 + to the whole subtree while all statistics are local to the IOs 113 + directly generated by tasks in that cgroup. 114 + 115 + Throttling without "sane_behavior" enabled from cgroup side will 116 + practically treat all groups at same level as if it looks like the 117 + following. 116 118 117 119 pivot 118 120 / / \ \ 119 121 root test1 test2 test3 120 - 121 - Nesting cgroups, while allowed, isn't officially supported and blkio 122 - genereates warning when cgroups nest. Once throttling implements 123 - hierarchy support, hierarchy will be supported and the warning will 124 - be removed. 125 122 126 123 Various user visible config options 127 124 ===================================
+40 -65
block/blk-cgroup.c
··· 32 32 33 33 static struct blkcg_policy *blkcg_policy[BLKCG_MAX_POLS]; 34 34 35 - static struct blkcg_gq *__blkg_lookup(struct blkcg *blkcg, 36 - struct request_queue *q, bool update_hint); 37 - 38 - /** 39 - * blkg_for_each_descendant_pre - pre-order walk of a blkg's descendants 40 - * @d_blkg: loop cursor pointing to the current descendant 41 - * @pos_cgrp: used for iteration 42 - * @p_blkg: target blkg to walk descendants of 43 - * 44 - * Walk @c_blkg through the descendants of @p_blkg. Must be used with RCU 45 - * read locked. If called under either blkcg or queue lock, the iteration 46 - * is guaranteed to include all and only online blkgs. The caller may 47 - * update @pos_cgrp by calling cgroup_rightmost_descendant() to skip 48 - * subtree. 49 - */ 50 - #define blkg_for_each_descendant_pre(d_blkg, pos_cgrp, p_blkg) \ 51 - cgroup_for_each_descendant_pre((pos_cgrp), (p_blkg)->blkcg->css.cgroup) \ 52 - if (((d_blkg) = __blkg_lookup(cgroup_to_blkcg(pos_cgrp), \ 53 - (p_blkg)->q, false))) 54 - 55 35 static bool blkcg_policy_enabled(struct request_queue *q, 56 36 const struct blkcg_policy *pol) 57 37 { ··· 51 71 if (!blkg) 52 72 return; 53 73 54 - for (i = 0; i < BLKCG_MAX_POLS; i++) { 55 - struct blkcg_policy *pol = blkcg_policy[i]; 56 - struct blkg_policy_data *pd = blkg->pd[i]; 57 - 58 - if (!pd) 59 - continue; 60 - 61 - if (pol && pol->pd_exit_fn) 62 - pol->pd_exit_fn(blkg); 63 - 64 - kfree(pd); 65 - } 74 + for (i = 0; i < BLKCG_MAX_POLS; i++) 75 + kfree(blkg->pd[i]); 66 76 67 77 blk_exit_rl(&blkg->rl); 68 78 kfree(blkg); ··· 104 134 blkg->pd[i] = pd; 105 135 pd->blkg = blkg; 106 136 pd->plid = i; 107 - 108 - /* invoke per-policy init */ 109 - if (pol->pd_init_fn) 110 - pol->pd_init_fn(blkg); 111 137 } 112 138 113 139 return blkg; ··· 124 158 * @q's bypass state. If @update_hint is %true, the caller should be 125 159 * holding @q->queue_lock and lookup hint is updated on success. 126 160 */ 127 - static struct blkcg_gq *__blkg_lookup(struct blkcg *blkcg, 128 - struct request_queue *q, bool update_hint) 161 + struct blkcg_gq *__blkg_lookup(struct blkcg *blkcg, struct request_queue *q, 162 + bool update_hint) 129 163 { 130 164 struct blkcg_gq *blkg; 131 165 ··· 200 234 } 201 235 blkg = new_blkg; 202 236 203 - /* link parent and insert */ 237 + /* link parent */ 204 238 if (blkcg_parent(blkcg)) { 205 239 blkg->parent = __blkg_lookup(blkcg_parent(blkcg), q, false); 206 240 if (WARN_ON_ONCE(!blkg->parent)) { 207 - blkg = ERR_PTR(-EINVAL); 241 + ret = -EINVAL; 208 242 goto err_put_css; 209 243 } 210 244 blkg_get(blkg->parent); 211 245 } 212 246 247 + /* invoke per-policy init */ 248 + for (i = 0; i < BLKCG_MAX_POLS; i++) { 249 + struct blkcg_policy *pol = blkcg_policy[i]; 250 + 251 + if (blkg->pd[i] && pol->pd_init_fn) 252 + pol->pd_init_fn(blkg); 253 + } 254 + 255 + /* insert */ 213 256 spin_lock(&blkcg->lock); 214 257 ret = radix_tree_insert(&blkcg->blkg_tree, q->id, blkg); 215 258 if (likely(!ret)) { ··· 369 394 q->root_rl.blkg = NULL; 370 395 } 371 396 372 - static void blkg_rcu_free(struct rcu_head *rcu_head) 397 + /* 398 + * A group is RCU protected, but having an rcu lock does not mean that one 399 + * can access all the fields of blkg and assume these are valid. For 400 + * example, don't try to follow throtl_data and request queue links. 401 + * 402 + * Having a reference to blkg under an rcu allows accesses to only values 403 + * local to groups like group stats and group rate limits. 404 + */ 405 + void __blkg_release_rcu(struct rcu_head *rcu_head) 373 406 { 374 - blkg_free(container_of(rcu_head, struct blkcg_gq, rcu_head)); 375 - } 407 + struct blkcg_gq *blkg = container_of(rcu_head, struct blkcg_gq, rcu_head); 408 + int i; 376 409 377 - void __blkg_release(struct blkcg_gq *blkg) 378 - { 410 + /* tell policies that this one is being freed */ 411 + for (i = 0; i < BLKCG_MAX_POLS; i++) { 412 + struct blkcg_policy *pol = blkcg_policy[i]; 413 + 414 + if (blkg->pd[i] && pol->pd_exit_fn) 415 + pol->pd_exit_fn(blkg); 416 + } 417 + 379 418 /* release the blkcg and parent blkg refs this blkg has been holding */ 380 419 css_put(&blkg->blkcg->css); 381 - if (blkg->parent) 420 + if (blkg->parent) { 421 + spin_lock_irq(blkg->q->queue_lock); 382 422 blkg_put(blkg->parent); 423 + spin_unlock_irq(blkg->q->queue_lock); 424 + } 383 425 384 - /* 385 - * A group is freed in rcu manner. But having an rcu lock does not 386 - * mean that one can access all the fields of blkg and assume these 387 - * are valid. For example, don't try to follow throtl_data and 388 - * request queue links. 389 - * 390 - * Having a reference to blkg under an rcu allows acess to only 391 - * values local to groups like group stats and group rate limits 392 - */ 393 - call_rcu(&blkg->rcu_head, blkg_rcu_free); 426 + blkg_free(blkg); 394 427 } 395 - EXPORT_SYMBOL_GPL(__blkg_release); 428 + EXPORT_SYMBOL_GPL(__blkg_release_rcu); 396 429 397 430 /* 398 431 * The next function used by blk_queue_for_each_rl(). It's a bit tricky ··· 911 928 .subsys_id = blkio_subsys_id, 912 929 .base_cftypes = blkcg_files, 913 930 .module = THIS_MODULE, 914 - 915 - /* 916 - * blkio subsystem is utterly broken in terms of hierarchy support. 917 - * It treats all cgroups equally regardless of where they're 918 - * located in the hierarchy - all cgroups are treated as if they're 919 - * right below the root. Fix it and remove the following. 920 - */ 921 - .broken_hierarchy = true, 922 931 }; 923 932 EXPORT_SYMBOL_GPL(blkio_subsys); 924 933
+36 -2
block/blk-cgroup.h
··· 266 266 blkg->refcnt++; 267 267 } 268 268 269 - void __blkg_release(struct blkcg_gq *blkg); 269 + void __blkg_release_rcu(struct rcu_head *rcu); 270 270 271 271 /** 272 272 * blkg_put - put a blkg reference ··· 279 279 lockdep_assert_held(blkg->q->queue_lock); 280 280 WARN_ON_ONCE(blkg->refcnt <= 0); 281 281 if (!--blkg->refcnt) 282 - __blkg_release(blkg); 282 + call_rcu(&blkg->rcu_head, __blkg_release_rcu); 283 283 } 284 + 285 + struct blkcg_gq *__blkg_lookup(struct blkcg *blkcg, struct request_queue *q, 286 + bool update_hint); 287 + 288 + /** 289 + * blkg_for_each_descendant_pre - pre-order walk of a blkg's descendants 290 + * @d_blkg: loop cursor pointing to the current descendant 291 + * @pos_cgrp: used for iteration 292 + * @p_blkg: target blkg to walk descendants of 293 + * 294 + * Walk @c_blkg through the descendants of @p_blkg. Must be used with RCU 295 + * read locked. If called under either blkcg or queue lock, the iteration 296 + * is guaranteed to include all and only online blkgs. The caller may 297 + * update @pos_cgrp by calling cgroup_rightmost_descendant() to skip 298 + * subtree. 299 + */ 300 + #define blkg_for_each_descendant_pre(d_blkg, pos_cgrp, p_blkg) \ 301 + cgroup_for_each_descendant_pre((pos_cgrp), (p_blkg)->blkcg->css.cgroup) \ 302 + if (((d_blkg) = __blkg_lookup(cgroup_to_blkcg(pos_cgrp), \ 303 + (p_blkg)->q, false))) 304 + 305 + /** 306 + * blkg_for_each_descendant_post - post-order walk of a blkg's descendants 307 + * @d_blkg: loop cursor pointing to the current descendant 308 + * @pos_cgrp: used for iteration 309 + * @p_blkg: target blkg to walk descendants of 310 + * 311 + * Similar to blkg_for_each_descendant_pre() but performs post-order 312 + * traversal instead. Synchronization rules are the same. 313 + */ 314 + #define blkg_for_each_descendant_post(d_blkg, pos_cgrp, p_blkg) \ 315 + cgroup_for_each_descendant_post((pos_cgrp), (p_blkg)->blkcg->css.cgroup) \ 316 + if (((d_blkg) = __blkg_lookup(cgroup_to_blkcg(pos_cgrp), \ 317 + (p_blkg)->q, false))) 284 318 285 319 /** 286 320 * blk_get_rl - get request_list to use
+9 -2
block/blk-tag.c
··· 348 348 */ 349 349 max_depth = bqt->max_depth; 350 350 if (!rq_is_sync(rq) && max_depth > 1) { 351 - max_depth -= 2; 352 - if (!max_depth) 351 + switch (max_depth) { 352 + case 2: 353 353 max_depth = 1; 354 + break; 355 + case 3: 356 + max_depth = 2; 357 + break; 358 + default: 359 + max_depth -= 2; 360 + } 354 361 if (q->in_flight[BLK_RW_ASYNC] > max_depth) 355 362 return 1; 356 363 }
+747 -323
block/blk-throttle.c
··· 25 25 26 26 /* A workqueue to queue throttle related work */ 27 27 static struct workqueue_struct *kthrotld_workqueue; 28 - static void throtl_schedule_delayed_work(struct throtl_data *td, 29 - unsigned long delay); 30 28 31 - struct throtl_rb_root { 32 - struct rb_root rb; 33 - struct rb_node *left; 34 - unsigned int count; 35 - unsigned long min_disptime; 29 + /* 30 + * To implement hierarchical throttling, throtl_grps form a tree and bios 31 + * are dispatched upwards level by level until they reach the top and get 32 + * issued. When dispatching bios from the children and local group at each 33 + * level, if the bios are dispatched into a single bio_list, there's a risk 34 + * of a local or child group which can queue many bios at once filling up 35 + * the list starving others. 36 + * 37 + * To avoid such starvation, dispatched bios are queued separately 38 + * according to where they came from. When they are again dispatched to 39 + * the parent, they're popped in round-robin order so that no single source 40 + * hogs the dispatch window. 41 + * 42 + * throtl_qnode is used to keep the queued bios separated by their sources. 43 + * Bios are queued to throtl_qnode which in turn is queued to 44 + * throtl_service_queue and then dispatched in round-robin order. 45 + * 46 + * It's also used to track the reference counts on blkg's. A qnode always 47 + * belongs to a throtl_grp and gets queued on itself or the parent, so 48 + * incrementing the reference of the associated throtl_grp when a qnode is 49 + * queued and decrementing when dequeued is enough to keep the whole blkg 50 + * tree pinned while bios are in flight. 51 + */ 52 + struct throtl_qnode { 53 + struct list_head node; /* service_queue->queued[] */ 54 + struct bio_list bios; /* queued bios */ 55 + struct throtl_grp *tg; /* tg this qnode belongs to */ 36 56 }; 37 57 38 - #define THROTL_RB_ROOT (struct throtl_rb_root) { .rb = RB_ROOT, .left = NULL, \ 39 - .count = 0, .min_disptime = 0} 58 + struct throtl_service_queue { 59 + struct throtl_service_queue *parent_sq; /* the parent service_queue */ 60 + 61 + /* 62 + * Bios queued directly to this service_queue or dispatched from 63 + * children throtl_grp's. 64 + */ 65 + struct list_head queued[2]; /* throtl_qnode [READ/WRITE] */ 66 + unsigned int nr_queued[2]; /* number of queued bios */ 67 + 68 + /* 69 + * RB tree of active children throtl_grp's, which are sorted by 70 + * their ->disptime. 71 + */ 72 + struct rb_root pending_tree; /* RB tree of active tgs */ 73 + struct rb_node *first_pending; /* first node in the tree */ 74 + unsigned int nr_pending; /* # queued in the tree */ 75 + unsigned long first_pending_disptime; /* disptime of the first tg */ 76 + struct timer_list pending_timer; /* fires on first_pending_disptime */ 77 + }; 78 + 79 + enum tg_state_flags { 80 + THROTL_TG_PENDING = 1 << 0, /* on parent's pending tree */ 81 + THROTL_TG_WAS_EMPTY = 1 << 1, /* bio_lists[] became non-empty */ 82 + }; 40 83 41 84 #define rb_entry_tg(node) rb_entry((node), struct throtl_grp, rb_node) 42 85 ··· 95 52 /* must be the first member */ 96 53 struct blkg_policy_data pd; 97 54 98 - /* active throtl group service_tree member */ 55 + /* active throtl group service_queue member */ 99 56 struct rb_node rb_node; 57 + 58 + /* throtl_data this group belongs to */ 59 + struct throtl_data *td; 60 + 61 + /* this group's service queue */ 62 + struct throtl_service_queue service_queue; 63 + 64 + /* 65 + * qnode_on_self is used when bios are directly queued to this 66 + * throtl_grp so that local bios compete fairly with bios 67 + * dispatched from children. qnode_on_parent is used when bios are 68 + * dispatched from this throtl_grp into its parent and will compete 69 + * with the sibling qnode_on_parents and the parent's 70 + * qnode_on_self. 71 + */ 72 + struct throtl_qnode qnode_on_self[2]; 73 + struct throtl_qnode qnode_on_parent[2]; 100 74 101 75 /* 102 76 * Dispatch time in jiffies. This is the estimated time when group ··· 124 64 125 65 unsigned int flags; 126 66 127 - /* Two lists for READ and WRITE */ 128 - struct bio_list bio_lists[2]; 129 - 130 - /* Number of queued bios on READ and WRITE lists */ 131 - unsigned int nr_queued[2]; 67 + /* are there any throtl rules between this group and td? */ 68 + bool has_rules[2]; 132 69 133 70 /* bytes per second rate limits */ 134 71 uint64_t bps[2]; ··· 142 85 unsigned long slice_start[2]; 143 86 unsigned long slice_end[2]; 144 87 145 - /* Some throttle limits got updated for the group */ 146 - int limits_changed; 147 - 148 88 /* Per cpu stats pointer */ 149 89 struct tg_stats_cpu __percpu *stats_cpu; 150 90 ··· 152 98 struct throtl_data 153 99 { 154 100 /* service tree for active throtl groups */ 155 - struct throtl_rb_root tg_service_tree; 101 + struct throtl_service_queue service_queue; 156 102 157 103 struct request_queue *queue; 158 104 ··· 165 111 unsigned int nr_undestroyed_grps; 166 112 167 113 /* Work for dispatching throttled bios */ 168 - struct delayed_work throtl_work; 169 - 170 - int limits_changed; 114 + struct work_struct dispatch_work; 171 115 }; 172 116 173 117 /* list and work item to allocate percpu group stats */ ··· 174 122 175 123 static void tg_stats_alloc_fn(struct work_struct *); 176 124 static DECLARE_DELAYED_WORK(tg_stats_alloc_work, tg_stats_alloc_fn); 125 + 126 + static void throtl_pending_timer_fn(unsigned long arg); 177 127 178 128 static inline struct throtl_grp *pd_to_tg(struct blkg_policy_data *pd) 179 129 { ··· 197 143 return blkg_to_tg(td->queue->root_blkg); 198 144 } 199 145 200 - enum tg_state_flags { 201 - THROTL_TG_FLAG_on_rr = 0, /* on round-robin busy list */ 202 - }; 203 - 204 - #define THROTL_TG_FNS(name) \ 205 - static inline void throtl_mark_tg_##name(struct throtl_grp *tg) \ 206 - { \ 207 - (tg)->flags |= (1 << THROTL_TG_FLAG_##name); \ 208 - } \ 209 - static inline void throtl_clear_tg_##name(struct throtl_grp *tg) \ 210 - { \ 211 - (tg)->flags &= ~(1 << THROTL_TG_FLAG_##name); \ 212 - } \ 213 - static inline int throtl_tg_##name(const struct throtl_grp *tg) \ 214 - { \ 215 - return ((tg)->flags & (1 << THROTL_TG_FLAG_##name)) != 0; \ 216 - } 217 - 218 - THROTL_TG_FNS(on_rr); 219 - 220 - #define throtl_log_tg(td, tg, fmt, args...) do { \ 221 - char __pbuf[128]; \ 222 - \ 223 - blkg_path(tg_to_blkg(tg), __pbuf, sizeof(__pbuf)); \ 224 - blk_add_trace_msg((td)->queue, "throtl %s " fmt, __pbuf, ##args); \ 225 - } while (0) 226 - 227 - #define throtl_log(td, fmt, args...) \ 228 - blk_add_trace_msg((td)->queue, "throtl " fmt, ##args) 229 - 230 - static inline unsigned int total_nr_queued(struct throtl_data *td) 146 + /** 147 + * sq_to_tg - return the throl_grp the specified service queue belongs to 148 + * @sq: the throtl_service_queue of interest 149 + * 150 + * Return the throtl_grp @sq belongs to. If @sq is the top-level one 151 + * embedded in throtl_data, %NULL is returned. 152 + */ 153 + static struct throtl_grp *sq_to_tg(struct throtl_service_queue *sq) 231 154 { 232 - return td->nr_queued[0] + td->nr_queued[1]; 155 + if (sq && sq->parent_sq) 156 + return container_of(sq, struct throtl_grp, service_queue); 157 + else 158 + return NULL; 233 159 } 160 + 161 + /** 162 + * sq_to_td - return throtl_data the specified service queue belongs to 163 + * @sq: the throtl_service_queue of interest 164 + * 165 + * A service_queue can be embeded in either a throtl_grp or throtl_data. 166 + * Determine the associated throtl_data accordingly and return it. 167 + */ 168 + static struct throtl_data *sq_to_td(struct throtl_service_queue *sq) 169 + { 170 + struct throtl_grp *tg = sq_to_tg(sq); 171 + 172 + if (tg) 173 + return tg->td; 174 + else 175 + return container_of(sq, struct throtl_data, service_queue); 176 + } 177 + 178 + /** 179 + * throtl_log - log debug message via blktrace 180 + * @sq: the service_queue being reported 181 + * @fmt: printf format string 182 + * @args: printf args 183 + * 184 + * The messages are prefixed with "throtl BLKG_NAME" if @sq belongs to a 185 + * throtl_grp; otherwise, just "throtl". 186 + * 187 + * TODO: this should be made a function and name formatting should happen 188 + * after testing whether blktrace is enabled. 189 + */ 190 + #define throtl_log(sq, fmt, args...) do { \ 191 + struct throtl_grp *__tg = sq_to_tg((sq)); \ 192 + struct throtl_data *__td = sq_to_td((sq)); \ 193 + \ 194 + (void)__td; \ 195 + if ((__tg)) { \ 196 + char __pbuf[128]; \ 197 + \ 198 + blkg_path(tg_to_blkg(__tg), __pbuf, sizeof(__pbuf)); \ 199 + blk_add_trace_msg(__td->queue, "throtl %s " fmt, __pbuf, ##args); \ 200 + } else { \ 201 + blk_add_trace_msg(__td->queue, "throtl " fmt, ##args); \ 202 + } \ 203 + } while (0) 234 204 235 205 /* 236 206 * Worker for allocating per cpu stat for tgs. This is scheduled on the ··· 293 215 goto alloc_stats; 294 216 } 295 217 218 + static void throtl_qnode_init(struct throtl_qnode *qn, struct throtl_grp *tg) 219 + { 220 + INIT_LIST_HEAD(&qn->node); 221 + bio_list_init(&qn->bios); 222 + qn->tg = tg; 223 + } 224 + 225 + /** 226 + * throtl_qnode_add_bio - add a bio to a throtl_qnode and activate it 227 + * @bio: bio being added 228 + * @qn: qnode to add bio to 229 + * @queued: the service_queue->queued[] list @qn belongs to 230 + * 231 + * Add @bio to @qn and put @qn on @queued if it's not already on. 232 + * @qn->tg's reference count is bumped when @qn is activated. See the 233 + * comment on top of throtl_qnode definition for details. 234 + */ 235 + static void throtl_qnode_add_bio(struct bio *bio, struct throtl_qnode *qn, 236 + struct list_head *queued) 237 + { 238 + bio_list_add(&qn->bios, bio); 239 + if (list_empty(&qn->node)) { 240 + list_add_tail(&qn->node, queued); 241 + blkg_get(tg_to_blkg(qn->tg)); 242 + } 243 + } 244 + 245 + /** 246 + * throtl_peek_queued - peek the first bio on a qnode list 247 + * @queued: the qnode list to peek 248 + */ 249 + static struct bio *throtl_peek_queued(struct list_head *queued) 250 + { 251 + struct throtl_qnode *qn = list_first_entry(queued, struct throtl_qnode, node); 252 + struct bio *bio; 253 + 254 + if (list_empty(queued)) 255 + return NULL; 256 + 257 + bio = bio_list_peek(&qn->bios); 258 + WARN_ON_ONCE(!bio); 259 + return bio; 260 + } 261 + 262 + /** 263 + * throtl_pop_queued - pop the first bio form a qnode list 264 + * @queued: the qnode list to pop a bio from 265 + * @tg_to_put: optional out argument for throtl_grp to put 266 + * 267 + * Pop the first bio from the qnode list @queued. After popping, the first 268 + * qnode is removed from @queued if empty or moved to the end of @queued so 269 + * that the popping order is round-robin. 270 + * 271 + * When the first qnode is removed, its associated throtl_grp should be put 272 + * too. If @tg_to_put is NULL, this function automatically puts it; 273 + * otherwise, *@tg_to_put is set to the throtl_grp to put and the caller is 274 + * responsible for putting it. 275 + */ 276 + static struct bio *throtl_pop_queued(struct list_head *queued, 277 + struct throtl_grp **tg_to_put) 278 + { 279 + struct throtl_qnode *qn = list_first_entry(queued, struct throtl_qnode, node); 280 + struct bio *bio; 281 + 282 + if (list_empty(queued)) 283 + return NULL; 284 + 285 + bio = bio_list_pop(&qn->bios); 286 + WARN_ON_ONCE(!bio); 287 + 288 + if (bio_list_empty(&qn->bios)) { 289 + list_del_init(&qn->node); 290 + if (tg_to_put) 291 + *tg_to_put = qn->tg; 292 + else 293 + blkg_put(tg_to_blkg(qn->tg)); 294 + } else { 295 + list_move_tail(&qn->node, queued); 296 + } 297 + 298 + return bio; 299 + } 300 + 301 + /* init a service_queue, assumes the caller zeroed it */ 302 + static void throtl_service_queue_init(struct throtl_service_queue *sq, 303 + struct throtl_service_queue *parent_sq) 304 + { 305 + INIT_LIST_HEAD(&sq->queued[0]); 306 + INIT_LIST_HEAD(&sq->queued[1]); 307 + sq->pending_tree = RB_ROOT; 308 + sq->parent_sq = parent_sq; 309 + setup_timer(&sq->pending_timer, throtl_pending_timer_fn, 310 + (unsigned long)sq); 311 + } 312 + 313 + static void throtl_service_queue_exit(struct throtl_service_queue *sq) 314 + { 315 + del_timer_sync(&sq->pending_timer); 316 + } 317 + 296 318 static void throtl_pd_init(struct blkcg_gq *blkg) 297 319 { 298 320 struct throtl_grp *tg = blkg_to_tg(blkg); 321 + struct throtl_data *td = blkg->q->td; 322 + struct throtl_service_queue *parent_sq; 299 323 unsigned long flags; 324 + int rw; 325 + 326 + /* 327 + * If sane_hierarchy is enabled, we switch to properly hierarchical 328 + * behavior where limits on a given throtl_grp are applied to the 329 + * whole subtree rather than just the group itself. e.g. If 16M 330 + * read_bps limit is set on the root group, the whole system can't 331 + * exceed 16M for the device. 332 + * 333 + * If sane_hierarchy is not enabled, the broken flat hierarchy 334 + * behavior is retained where all throtl_grps are treated as if 335 + * they're all separate root groups right below throtl_data. 336 + * Limits of a group don't interact with limits of other groups 337 + * regardless of the position of the group in the hierarchy. 338 + */ 339 + parent_sq = &td->service_queue; 340 + 341 + if (cgroup_sane_behavior(blkg->blkcg->css.cgroup) && blkg->parent) 342 + parent_sq = &blkg_to_tg(blkg->parent)->service_queue; 343 + 344 + throtl_service_queue_init(&tg->service_queue, parent_sq); 345 + 346 + for (rw = READ; rw <= WRITE; rw++) { 347 + throtl_qnode_init(&tg->qnode_on_self[rw], tg); 348 + throtl_qnode_init(&tg->qnode_on_parent[rw], tg); 349 + } 300 350 301 351 RB_CLEAR_NODE(&tg->rb_node); 302 - bio_list_init(&tg->bio_lists[0]); 303 - bio_list_init(&tg->bio_lists[1]); 304 - tg->limits_changed = false; 352 + tg->td = td; 305 353 306 354 tg->bps[READ] = -1; 307 355 tg->bps[WRITE] = -1; ··· 445 241 spin_unlock_irqrestore(&tg_stats_alloc_lock, flags); 446 242 } 447 243 244 + /* 245 + * Set has_rules[] if @tg or any of its parents have limits configured. 246 + * This doesn't require walking up to the top of the hierarchy as the 247 + * parent's has_rules[] is guaranteed to be correct. 248 + */ 249 + static void tg_update_has_rules(struct throtl_grp *tg) 250 + { 251 + struct throtl_grp *parent_tg = sq_to_tg(tg->service_queue.parent_sq); 252 + int rw; 253 + 254 + for (rw = READ; rw <= WRITE; rw++) 255 + tg->has_rules[rw] = (parent_tg && parent_tg->has_rules[rw]) || 256 + (tg->bps[rw] != -1 || tg->iops[rw] != -1); 257 + } 258 + 259 + static void throtl_pd_online(struct blkcg_gq *blkg) 260 + { 261 + /* 262 + * We don't want new groups to escape the limits of its ancestors. 263 + * Update has_rules[] after a new group is brought online. 264 + */ 265 + tg_update_has_rules(blkg_to_tg(blkg)); 266 + } 267 + 448 268 static void throtl_pd_exit(struct blkcg_gq *blkg) 449 269 { 450 270 struct throtl_grp *tg = blkg_to_tg(blkg); ··· 479 251 spin_unlock_irqrestore(&tg_stats_alloc_lock, flags); 480 252 481 253 free_percpu(tg->stats_cpu); 254 + 255 + throtl_service_queue_exit(&tg->service_queue); 482 256 } 483 257 484 258 static void throtl_pd_reset_stats(struct blkcg_gq *blkg) ··· 539 309 return tg; 540 310 } 541 311 542 - static struct throtl_grp *throtl_rb_first(struct throtl_rb_root *root) 312 + static struct throtl_grp * 313 + throtl_rb_first(struct throtl_service_queue *parent_sq) 543 314 { 544 315 /* Service tree is empty */ 545 - if (!root->count) 316 + if (!parent_sq->nr_pending) 546 317 return NULL; 547 318 548 - if (!root->left) 549 - root->left = rb_first(&root->rb); 319 + if (!parent_sq->first_pending) 320 + parent_sq->first_pending = rb_first(&parent_sq->pending_tree); 550 321 551 - if (root->left) 552 - return rb_entry_tg(root->left); 322 + if (parent_sq->first_pending) 323 + return rb_entry_tg(parent_sq->first_pending); 553 324 554 325 return NULL; 555 326 } ··· 561 330 RB_CLEAR_NODE(n); 562 331 } 563 332 564 - static void throtl_rb_erase(struct rb_node *n, struct throtl_rb_root *root) 333 + static void throtl_rb_erase(struct rb_node *n, 334 + struct throtl_service_queue *parent_sq) 565 335 { 566 - if (root->left == n) 567 - root->left = NULL; 568 - rb_erase_init(n, &root->rb); 569 - --root->count; 336 + if (parent_sq->first_pending == n) 337 + parent_sq->first_pending = NULL; 338 + rb_erase_init(n, &parent_sq->pending_tree); 339 + --parent_sq->nr_pending; 570 340 } 571 341 572 - static void update_min_dispatch_time(struct throtl_rb_root *st) 342 + static void update_min_dispatch_time(struct throtl_service_queue *parent_sq) 573 343 { 574 344 struct throtl_grp *tg; 575 345 576 - tg = throtl_rb_first(st); 346 + tg = throtl_rb_first(parent_sq); 577 347 if (!tg) 578 348 return; 579 349 580 - st->min_disptime = tg->disptime; 350 + parent_sq->first_pending_disptime = tg->disptime; 581 351 } 582 352 583 - static void 584 - tg_service_tree_add(struct throtl_rb_root *st, struct throtl_grp *tg) 353 + static void tg_service_queue_add(struct throtl_grp *tg) 585 354 { 586 - struct rb_node **node = &st->rb.rb_node; 355 + struct throtl_service_queue *parent_sq = tg->service_queue.parent_sq; 356 + struct rb_node **node = &parent_sq->pending_tree.rb_node; 587 357 struct rb_node *parent = NULL; 588 358 struct throtl_grp *__tg; 589 359 unsigned long key = tg->disptime; ··· 603 371 } 604 372 605 373 if (left) 606 - st->left = &tg->rb_node; 374 + parent_sq->first_pending = &tg->rb_node; 607 375 608 376 rb_link_node(&tg->rb_node, parent, node); 609 - rb_insert_color(&tg->rb_node, &st->rb); 377 + rb_insert_color(&tg->rb_node, &parent_sq->pending_tree); 610 378 } 611 379 612 - static void __throtl_enqueue_tg(struct throtl_data *td, struct throtl_grp *tg) 380 + static void __throtl_enqueue_tg(struct throtl_grp *tg) 613 381 { 614 - struct throtl_rb_root *st = &td->tg_service_tree; 615 - 616 - tg_service_tree_add(st, tg); 617 - throtl_mark_tg_on_rr(tg); 618 - st->count++; 382 + tg_service_queue_add(tg); 383 + tg->flags |= THROTL_TG_PENDING; 384 + tg->service_queue.parent_sq->nr_pending++; 619 385 } 620 386 621 - static void throtl_enqueue_tg(struct throtl_data *td, struct throtl_grp *tg) 387 + static void throtl_enqueue_tg(struct throtl_grp *tg) 622 388 { 623 - if (!throtl_tg_on_rr(tg)) 624 - __throtl_enqueue_tg(td, tg); 389 + if (!(tg->flags & THROTL_TG_PENDING)) 390 + __throtl_enqueue_tg(tg); 625 391 } 626 392 627 - static void __throtl_dequeue_tg(struct throtl_data *td, struct throtl_grp *tg) 393 + static void __throtl_dequeue_tg(struct throtl_grp *tg) 628 394 { 629 - throtl_rb_erase(&tg->rb_node, &td->tg_service_tree); 630 - throtl_clear_tg_on_rr(tg); 395 + throtl_rb_erase(&tg->rb_node, tg->service_queue.parent_sq); 396 + tg->flags &= ~THROTL_TG_PENDING; 631 397 } 632 398 633 - static void throtl_dequeue_tg(struct throtl_data *td, struct throtl_grp *tg) 399 + static void throtl_dequeue_tg(struct throtl_grp *tg) 634 400 { 635 - if (throtl_tg_on_rr(tg)) 636 - __throtl_dequeue_tg(td, tg); 401 + if (tg->flags & THROTL_TG_PENDING) 402 + __throtl_dequeue_tg(tg); 637 403 } 638 404 639 - static void throtl_schedule_next_dispatch(struct throtl_data *td) 405 + /* Call with queue lock held */ 406 + static void throtl_schedule_pending_timer(struct throtl_service_queue *sq, 407 + unsigned long expires) 640 408 { 641 - struct throtl_rb_root *st = &td->tg_service_tree; 409 + mod_timer(&sq->pending_timer, expires); 410 + throtl_log(sq, "schedule timer. delay=%lu jiffies=%lu", 411 + expires - jiffies, jiffies); 412 + } 413 + 414 + /** 415 + * throtl_schedule_next_dispatch - schedule the next dispatch cycle 416 + * @sq: the service_queue to schedule dispatch for 417 + * @force: force scheduling 418 + * 419 + * Arm @sq->pending_timer so that the next dispatch cycle starts on the 420 + * dispatch time of the first pending child. Returns %true if either timer 421 + * is armed or there's no pending child left. %false if the current 422 + * dispatch window is still open and the caller should continue 423 + * dispatching. 424 + * 425 + * If @force is %true, the dispatch timer is always scheduled and this 426 + * function is guaranteed to return %true. This is to be used when the 427 + * caller can't dispatch itself and needs to invoke pending_timer 428 + * unconditionally. Note that forced scheduling is likely to induce short 429 + * delay before dispatch starts even if @sq->first_pending_disptime is not 430 + * in the future and thus shouldn't be used in hot paths. 431 + */ 432 + static bool throtl_schedule_next_dispatch(struct throtl_service_queue *sq, 433 + bool force) 434 + { 435 + /* any pending children left? */ 436 + if (!sq->nr_pending) 437 + return true; 438 + 439 + update_min_dispatch_time(sq); 440 + 441 + /* is the next dispatch time in the future? */ 442 + if (force || time_after(sq->first_pending_disptime, jiffies)) { 443 + throtl_schedule_pending_timer(sq, sq->first_pending_disptime); 444 + return true; 445 + } 446 + 447 + /* tell the caller to continue dispatching */ 448 + return false; 449 + } 450 + 451 + static inline void throtl_start_new_slice_with_credit(struct throtl_grp *tg, 452 + bool rw, unsigned long start) 453 + { 454 + tg->bytes_disp[rw] = 0; 455 + tg->io_disp[rw] = 0; 642 456 643 457 /* 644 - * If there are more bios pending, schedule more work. 458 + * Previous slice has expired. We must have trimmed it after last 459 + * bio dispatch. That means since start of last slice, we never used 460 + * that bandwidth. Do try to make use of that bandwidth while giving 461 + * credit. 645 462 */ 646 - if (!total_nr_queued(td)) 647 - return; 463 + if (time_after_eq(start, tg->slice_start[rw])) 464 + tg->slice_start[rw] = start; 648 465 649 - BUG_ON(!st->count); 650 - 651 - update_min_dispatch_time(st); 652 - 653 - if (time_before_eq(st->min_disptime, jiffies)) 654 - throtl_schedule_delayed_work(td, 0); 655 - else 656 - throtl_schedule_delayed_work(td, (st->min_disptime - jiffies)); 466 + tg->slice_end[rw] = jiffies + throtl_slice; 467 + throtl_log(&tg->service_queue, 468 + "[%c] new slice with credit start=%lu end=%lu jiffies=%lu", 469 + rw == READ ? 'R' : 'W', tg->slice_start[rw], 470 + tg->slice_end[rw], jiffies); 657 471 } 658 472 659 - static inline void 660 - throtl_start_new_slice(struct throtl_data *td, struct throtl_grp *tg, bool rw) 473 + static inline void throtl_start_new_slice(struct throtl_grp *tg, bool rw) 661 474 { 662 475 tg->bytes_disp[rw] = 0; 663 476 tg->io_disp[rw] = 0; 664 477 tg->slice_start[rw] = jiffies; 665 478 tg->slice_end[rw] = jiffies + throtl_slice; 666 - throtl_log_tg(td, tg, "[%c] new slice start=%lu end=%lu jiffies=%lu", 667 - rw == READ ? 'R' : 'W', tg->slice_start[rw], 668 - tg->slice_end[rw], jiffies); 479 + throtl_log(&tg->service_queue, 480 + "[%c] new slice start=%lu end=%lu jiffies=%lu", 481 + rw == READ ? 'R' : 'W', tg->slice_start[rw], 482 + tg->slice_end[rw], jiffies); 669 483 } 670 484 671 - static inline void throtl_set_slice_end(struct throtl_data *td, 672 - struct throtl_grp *tg, bool rw, unsigned long jiffy_end) 485 + static inline void throtl_set_slice_end(struct throtl_grp *tg, bool rw, 486 + unsigned long jiffy_end) 673 487 { 674 488 tg->slice_end[rw] = roundup(jiffy_end, throtl_slice); 675 489 } 676 490 677 - static inline void throtl_extend_slice(struct throtl_data *td, 678 - struct throtl_grp *tg, bool rw, unsigned long jiffy_end) 491 + static inline void throtl_extend_slice(struct throtl_grp *tg, bool rw, 492 + unsigned long jiffy_end) 679 493 { 680 494 tg->slice_end[rw] = roundup(jiffy_end, throtl_slice); 681 - throtl_log_tg(td, tg, "[%c] extend slice start=%lu end=%lu jiffies=%lu", 682 - rw == READ ? 'R' : 'W', tg->slice_start[rw], 683 - tg->slice_end[rw], jiffies); 495 + throtl_log(&tg->service_queue, 496 + "[%c] extend slice start=%lu end=%lu jiffies=%lu", 497 + rw == READ ? 'R' : 'W', tg->slice_start[rw], 498 + tg->slice_end[rw], jiffies); 684 499 } 685 500 686 501 /* Determine if previously allocated or extended slice is complete or not */ 687 - static bool 688 - throtl_slice_used(struct throtl_data *td, struct throtl_grp *tg, bool rw) 502 + static bool throtl_slice_used(struct throtl_grp *tg, bool rw) 689 503 { 690 504 if (time_in_range(jiffies, tg->slice_start[rw], tg->slice_end[rw])) 691 505 return 0; ··· 740 462 } 741 463 742 464 /* Trim the used slices and adjust slice start accordingly */ 743 - static inline void 744 - throtl_trim_slice(struct throtl_data *td, struct throtl_grp *tg, bool rw) 465 + static inline void throtl_trim_slice(struct throtl_grp *tg, bool rw) 745 466 { 746 467 unsigned long nr_slices, time_elapsed, io_trim; 747 468 u64 bytes_trim, tmp; ··· 752 475 * renewed. Don't try to trim the slice if slice is used. A new 753 476 * slice will start when appropriate. 754 477 */ 755 - if (throtl_slice_used(td, tg, rw)) 478 + if (throtl_slice_used(tg, rw)) 756 479 return; 757 480 758 481 /* ··· 763 486 * is bad because it does not allow new slice to start. 764 487 */ 765 488 766 - throtl_set_slice_end(td, tg, rw, jiffies + throtl_slice); 489 + throtl_set_slice_end(tg, rw, jiffies + throtl_slice); 767 490 768 491 time_elapsed = jiffies - tg->slice_start[rw]; 769 492 ··· 792 515 793 516 tg->slice_start[rw] += nr_slices * throtl_slice; 794 517 795 - throtl_log_tg(td, tg, "[%c] trim slice nr=%lu bytes=%llu io=%lu" 796 - " start=%lu end=%lu jiffies=%lu", 797 - rw == READ ? 'R' : 'W', nr_slices, bytes_trim, io_trim, 798 - tg->slice_start[rw], tg->slice_end[rw], jiffies); 518 + throtl_log(&tg->service_queue, 519 + "[%c] trim slice nr=%lu bytes=%llu io=%lu start=%lu end=%lu jiffies=%lu", 520 + rw == READ ? 'R' : 'W', nr_slices, bytes_trim, io_trim, 521 + tg->slice_start[rw], tg->slice_end[rw], jiffies); 799 522 } 800 523 801 - static bool tg_with_in_iops_limit(struct throtl_data *td, struct throtl_grp *tg, 802 - struct bio *bio, unsigned long *wait) 524 + static bool tg_with_in_iops_limit(struct throtl_grp *tg, struct bio *bio, 525 + unsigned long *wait) 803 526 { 804 527 bool rw = bio_data_dir(bio); 805 528 unsigned int io_allowed; ··· 848 571 return 0; 849 572 } 850 573 851 - static bool tg_with_in_bps_limit(struct throtl_data *td, struct throtl_grp *tg, 852 - struct bio *bio, unsigned long *wait) 574 + static bool tg_with_in_bps_limit(struct throtl_grp *tg, struct bio *bio, 575 + unsigned long *wait) 853 576 { 854 577 bool rw = bio_data_dir(bio); 855 578 u64 bytes_allowed, extra_bytes, tmp; ··· 890 613 return 0; 891 614 } 892 615 893 - static bool tg_no_rule_group(struct throtl_grp *tg, bool rw) { 894 - if (tg->bps[rw] == -1 && tg->iops[rw] == -1) 895 - return 1; 896 - return 0; 897 - } 898 - 899 616 /* 900 617 * Returns whether one can dispatch a bio or not. Also returns approx number 901 618 * of jiffies to wait before this bio is with-in IO rate and can be dispatched 902 619 */ 903 - static bool tg_may_dispatch(struct throtl_data *td, struct throtl_grp *tg, 904 - struct bio *bio, unsigned long *wait) 620 + static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio, 621 + unsigned long *wait) 905 622 { 906 623 bool rw = bio_data_dir(bio); 907 624 unsigned long bps_wait = 0, iops_wait = 0, max_wait = 0; ··· 906 635 * this function with a different bio if there are other bios 907 636 * queued. 908 637 */ 909 - BUG_ON(tg->nr_queued[rw] && bio != bio_list_peek(&tg->bio_lists[rw])); 638 + BUG_ON(tg->service_queue.nr_queued[rw] && 639 + bio != throtl_peek_queued(&tg->service_queue.queued[rw])); 910 640 911 641 /* If tg->bps = -1, then BW is unlimited */ 912 642 if (tg->bps[rw] == -1 && tg->iops[rw] == -1) { ··· 921 649 * existing slice to make sure it is at least throtl_slice interval 922 650 * long since now. 923 651 */ 924 - if (throtl_slice_used(td, tg, rw)) 925 - throtl_start_new_slice(td, tg, rw); 652 + if (throtl_slice_used(tg, rw)) 653 + throtl_start_new_slice(tg, rw); 926 654 else { 927 655 if (time_before(tg->slice_end[rw], jiffies + throtl_slice)) 928 - throtl_extend_slice(td, tg, rw, jiffies + throtl_slice); 656 + throtl_extend_slice(tg, rw, jiffies + throtl_slice); 929 657 } 930 658 931 - if (tg_with_in_bps_limit(td, tg, bio, &bps_wait) 932 - && tg_with_in_iops_limit(td, tg, bio, &iops_wait)) { 659 + if (tg_with_in_bps_limit(tg, bio, &bps_wait) && 660 + tg_with_in_iops_limit(tg, bio, &iops_wait)) { 933 661 if (wait) 934 662 *wait = 0; 935 663 return 1; ··· 941 669 *wait = max_wait; 942 670 943 671 if (time_before(tg->slice_end[rw], jiffies + max_wait)) 944 - throtl_extend_slice(td, tg, rw, jiffies + max_wait); 672 + throtl_extend_slice(tg, rw, jiffies + max_wait); 945 673 946 674 return 0; 947 675 } ··· 980 708 tg->bytes_disp[rw] += bio->bi_size; 981 709 tg->io_disp[rw]++; 982 710 983 - throtl_update_dispatch_stats(tg_to_blkg(tg), bio->bi_size, bio->bi_rw); 711 + /* 712 + * REQ_THROTTLED is used to prevent the same bio to be throttled 713 + * more than once as a throttled bio will go through blk-throtl the 714 + * second time when it eventually gets issued. Set it when a bio 715 + * is being charged to a tg. 716 + * 717 + * Dispatch stats aren't recursive and each @bio should only be 718 + * accounted by the @tg it was originally associated with. Let's 719 + * update the stats when setting REQ_THROTTLED for the first time 720 + * which is guaranteed to be for the @bio's original tg. 721 + */ 722 + if (!(bio->bi_rw & REQ_THROTTLED)) { 723 + bio->bi_rw |= REQ_THROTTLED; 724 + throtl_update_dispatch_stats(tg_to_blkg(tg), bio->bi_size, 725 + bio->bi_rw); 726 + } 984 727 } 985 728 986 - static void throtl_add_bio_tg(struct throtl_data *td, struct throtl_grp *tg, 987 - struct bio *bio) 729 + /** 730 + * throtl_add_bio_tg - add a bio to the specified throtl_grp 731 + * @bio: bio to add 732 + * @qn: qnode to use 733 + * @tg: the target throtl_grp 734 + * 735 + * Add @bio to @tg's service_queue using @qn. If @qn is not specified, 736 + * tg->qnode_on_self[] is used. 737 + */ 738 + static void throtl_add_bio_tg(struct bio *bio, struct throtl_qnode *qn, 739 + struct throtl_grp *tg) 988 740 { 741 + struct throtl_service_queue *sq = &tg->service_queue; 989 742 bool rw = bio_data_dir(bio); 990 743 991 - bio_list_add(&tg->bio_lists[rw], bio); 992 - /* Take a bio reference on tg */ 993 - blkg_get(tg_to_blkg(tg)); 994 - tg->nr_queued[rw]++; 995 - td->nr_queued[rw]++; 996 - throtl_enqueue_tg(td, tg); 744 + if (!qn) 745 + qn = &tg->qnode_on_self[rw]; 746 + 747 + /* 748 + * If @tg doesn't currently have any bios queued in the same 749 + * direction, queueing @bio can change when @tg should be 750 + * dispatched. Mark that @tg was empty. This is automatically 751 + * cleaered on the next tg_update_disptime(). 752 + */ 753 + if (!sq->nr_queued[rw]) 754 + tg->flags |= THROTL_TG_WAS_EMPTY; 755 + 756 + throtl_qnode_add_bio(bio, qn, &sq->queued[rw]); 757 + 758 + sq->nr_queued[rw]++; 759 + throtl_enqueue_tg(tg); 997 760 } 998 761 999 - static void tg_update_disptime(struct throtl_data *td, struct throtl_grp *tg) 762 + static void tg_update_disptime(struct throtl_grp *tg) 1000 763 { 764 + struct throtl_service_queue *sq = &tg->service_queue; 1001 765 unsigned long read_wait = -1, write_wait = -1, min_wait = -1, disptime; 1002 766 struct bio *bio; 1003 767 1004 - if ((bio = bio_list_peek(&tg->bio_lists[READ]))) 1005 - tg_may_dispatch(td, tg, bio, &read_wait); 768 + if ((bio = throtl_peek_queued(&sq->queued[READ]))) 769 + tg_may_dispatch(tg, bio, &read_wait); 1006 770 1007 - if ((bio = bio_list_peek(&tg->bio_lists[WRITE]))) 1008 - tg_may_dispatch(td, tg, bio, &write_wait); 771 + if ((bio = throtl_peek_queued(&sq->queued[WRITE]))) 772 + tg_may_dispatch(tg, bio, &write_wait); 1009 773 1010 774 min_wait = min(read_wait, write_wait); 1011 775 disptime = jiffies + min_wait; 1012 776 1013 777 /* Update dispatch time */ 1014 - throtl_dequeue_tg(td, tg); 778 + throtl_dequeue_tg(tg); 1015 779 tg->disptime = disptime; 1016 - throtl_enqueue_tg(td, tg); 780 + throtl_enqueue_tg(tg); 781 + 782 + /* see throtl_add_bio_tg() */ 783 + tg->flags &= ~THROTL_TG_WAS_EMPTY; 1017 784 } 1018 785 1019 - static void tg_dispatch_one_bio(struct throtl_data *td, struct throtl_grp *tg, 1020 - bool rw, struct bio_list *bl) 786 + static void start_parent_slice_with_credit(struct throtl_grp *child_tg, 787 + struct throtl_grp *parent_tg, bool rw) 1021 788 { 789 + if (throtl_slice_used(parent_tg, rw)) { 790 + throtl_start_new_slice_with_credit(parent_tg, rw, 791 + child_tg->slice_start[rw]); 792 + } 793 + 794 + } 795 + 796 + static void tg_dispatch_one_bio(struct throtl_grp *tg, bool rw) 797 + { 798 + struct throtl_service_queue *sq = &tg->service_queue; 799 + struct throtl_service_queue *parent_sq = sq->parent_sq; 800 + struct throtl_grp *parent_tg = sq_to_tg(parent_sq); 801 + struct throtl_grp *tg_to_put = NULL; 1022 802 struct bio *bio; 1023 803 1024 - bio = bio_list_pop(&tg->bio_lists[rw]); 1025 - tg->nr_queued[rw]--; 1026 - /* Drop bio reference on blkg */ 1027 - blkg_put(tg_to_blkg(tg)); 1028 - 1029 - BUG_ON(td->nr_queued[rw] <= 0); 1030 - td->nr_queued[rw]--; 804 + /* 805 + * @bio is being transferred from @tg to @parent_sq. Popping a bio 806 + * from @tg may put its reference and @parent_sq might end up 807 + * getting released prematurely. Remember the tg to put and put it 808 + * after @bio is transferred to @parent_sq. 809 + */ 810 + bio = throtl_pop_queued(&sq->queued[rw], &tg_to_put); 811 + sq->nr_queued[rw]--; 1031 812 1032 813 throtl_charge_bio(tg, bio); 1033 - bio_list_add(bl, bio); 1034 - bio->bi_rw |= REQ_THROTTLED; 1035 814 1036 - throtl_trim_slice(td, tg, rw); 815 + /* 816 + * If our parent is another tg, we just need to transfer @bio to 817 + * the parent using throtl_add_bio_tg(). If our parent is 818 + * @td->service_queue, @bio is ready to be issued. Put it on its 819 + * bio_lists[] and decrease total number queued. The caller is 820 + * responsible for issuing these bios. 821 + */ 822 + if (parent_tg) { 823 + throtl_add_bio_tg(bio, &tg->qnode_on_parent[rw], parent_tg); 824 + start_parent_slice_with_credit(tg, parent_tg, rw); 825 + } else { 826 + throtl_qnode_add_bio(bio, &tg->qnode_on_parent[rw], 827 + &parent_sq->queued[rw]); 828 + BUG_ON(tg->td->nr_queued[rw] <= 0); 829 + tg->td->nr_queued[rw]--; 830 + } 831 + 832 + throtl_trim_slice(tg, rw); 833 + 834 + if (tg_to_put) 835 + blkg_put(tg_to_blkg(tg_to_put)); 1037 836 } 1038 837 1039 - static int throtl_dispatch_tg(struct throtl_data *td, struct throtl_grp *tg, 1040 - struct bio_list *bl) 838 + static int throtl_dispatch_tg(struct throtl_grp *tg) 1041 839 { 840 + struct throtl_service_queue *sq = &tg->service_queue; 1042 841 unsigned int nr_reads = 0, nr_writes = 0; 1043 842 unsigned int max_nr_reads = throtl_grp_quantum*3/4; 1044 843 unsigned int max_nr_writes = throtl_grp_quantum - max_nr_reads; ··· 1117 774 1118 775 /* Try to dispatch 75% READS and 25% WRITES */ 1119 776 1120 - while ((bio = bio_list_peek(&tg->bio_lists[READ])) 1121 - && tg_may_dispatch(td, tg, bio, NULL)) { 777 + while ((bio = throtl_peek_queued(&sq->queued[READ])) && 778 + tg_may_dispatch(tg, bio, NULL)) { 1122 779 1123 - tg_dispatch_one_bio(td, tg, bio_data_dir(bio), bl); 780 + tg_dispatch_one_bio(tg, bio_data_dir(bio)); 1124 781 nr_reads++; 1125 782 1126 783 if (nr_reads >= max_nr_reads) 1127 784 break; 1128 785 } 1129 786 1130 - while ((bio = bio_list_peek(&tg->bio_lists[WRITE])) 1131 - && tg_may_dispatch(td, tg, bio, NULL)) { 787 + while ((bio = throtl_peek_queued(&sq->queued[WRITE])) && 788 + tg_may_dispatch(tg, bio, NULL)) { 1132 789 1133 - tg_dispatch_one_bio(td, tg, bio_data_dir(bio), bl); 790 + tg_dispatch_one_bio(tg, bio_data_dir(bio)); 1134 791 nr_writes++; 1135 792 1136 793 if (nr_writes >= max_nr_writes) ··· 1140 797 return nr_reads + nr_writes; 1141 798 } 1142 799 1143 - static int throtl_select_dispatch(struct throtl_data *td, struct bio_list *bl) 800 + static int throtl_select_dispatch(struct throtl_service_queue *parent_sq) 1144 801 { 1145 802 unsigned int nr_disp = 0; 1146 - struct throtl_grp *tg; 1147 - struct throtl_rb_root *st = &td->tg_service_tree; 1148 803 1149 804 while (1) { 1150 - tg = throtl_rb_first(st); 805 + struct throtl_grp *tg = throtl_rb_first(parent_sq); 806 + struct throtl_service_queue *sq = &tg->service_queue; 1151 807 1152 808 if (!tg) 1153 809 break; ··· 1154 812 if (time_before(jiffies, tg->disptime)) 1155 813 break; 1156 814 1157 - throtl_dequeue_tg(td, tg); 815 + throtl_dequeue_tg(tg); 1158 816 1159 - nr_disp += throtl_dispatch_tg(td, tg, bl); 817 + nr_disp += throtl_dispatch_tg(tg); 1160 818 1161 - if (tg->nr_queued[0] || tg->nr_queued[1]) { 1162 - tg_update_disptime(td, tg); 1163 - throtl_enqueue_tg(td, tg); 1164 - } 819 + if (sq->nr_queued[0] || sq->nr_queued[1]) 820 + tg_update_disptime(tg); 1165 821 1166 822 if (nr_disp >= throtl_quantum) 1167 823 break; ··· 1168 828 return nr_disp; 1169 829 } 1170 830 1171 - static void throtl_process_limit_change(struct throtl_data *td) 831 + /** 832 + * throtl_pending_timer_fn - timer function for service_queue->pending_timer 833 + * @arg: the throtl_service_queue being serviced 834 + * 835 + * This timer is armed when a child throtl_grp with active bio's become 836 + * pending and queued on the service_queue's pending_tree and expires when 837 + * the first child throtl_grp should be dispatched. This function 838 + * dispatches bio's from the children throtl_grps to the parent 839 + * service_queue. 840 + * 841 + * If the parent's parent is another throtl_grp, dispatching is propagated 842 + * by either arming its pending_timer or repeating dispatch directly. If 843 + * the top-level service_tree is reached, throtl_data->dispatch_work is 844 + * kicked so that the ready bio's are issued. 845 + */ 846 + static void throtl_pending_timer_fn(unsigned long arg) 1172 847 { 848 + struct throtl_service_queue *sq = (void *)arg; 849 + struct throtl_grp *tg = sq_to_tg(sq); 850 + struct throtl_data *td = sq_to_td(sq); 1173 851 struct request_queue *q = td->queue; 1174 - struct blkcg_gq *blkg, *n; 852 + struct throtl_service_queue *parent_sq; 853 + bool dispatched; 854 + int ret; 1175 855 1176 - if (!td->limits_changed) 1177 - return; 856 + spin_lock_irq(q->queue_lock); 857 + again: 858 + parent_sq = sq->parent_sq; 859 + dispatched = false; 1178 860 1179 - xchg(&td->limits_changed, false); 861 + while (true) { 862 + throtl_log(sq, "dispatch nr_queued=%u read=%u write=%u", 863 + sq->nr_queued[READ] + sq->nr_queued[WRITE], 864 + sq->nr_queued[READ], sq->nr_queued[WRITE]); 1180 865 1181 - throtl_log(td, "limits changed"); 866 + ret = throtl_select_dispatch(sq); 867 + if (ret) { 868 + throtl_log(sq, "bios disp=%u", ret); 869 + dispatched = true; 870 + } 1182 871 1183 - list_for_each_entry_safe(blkg, n, &q->blkg_list, q_node) { 1184 - struct throtl_grp *tg = blkg_to_tg(blkg); 872 + if (throtl_schedule_next_dispatch(sq, false)) 873 + break; 1185 874 1186 - if (!tg->limits_changed) 1187 - continue; 1188 - 1189 - if (!xchg(&tg->limits_changed, false)) 1190 - continue; 1191 - 1192 - throtl_log_tg(td, tg, "limit change rbps=%llu wbps=%llu" 1193 - " riops=%u wiops=%u", tg->bps[READ], tg->bps[WRITE], 1194 - tg->iops[READ], tg->iops[WRITE]); 1195 - 1196 - /* 1197 - * Restart the slices for both READ and WRITES. It 1198 - * might happen that a group's limit are dropped 1199 - * suddenly and we don't want to account recently 1200 - * dispatched IO with new low rate 1201 - */ 1202 - throtl_start_new_slice(td, tg, 0); 1203 - throtl_start_new_slice(td, tg, 1); 1204 - 1205 - if (throtl_tg_on_rr(tg)) 1206 - tg_update_disptime(td, tg); 875 + /* this dispatch windows is still open, relax and repeat */ 876 + spin_unlock_irq(q->queue_lock); 877 + cpu_relax(); 878 + spin_lock_irq(q->queue_lock); 1207 879 } 880 + 881 + if (!dispatched) 882 + goto out_unlock; 883 + 884 + if (parent_sq) { 885 + /* @parent_sq is another throl_grp, propagate dispatch */ 886 + if (tg->flags & THROTL_TG_WAS_EMPTY) { 887 + tg_update_disptime(tg); 888 + if (!throtl_schedule_next_dispatch(parent_sq, false)) { 889 + /* window is already open, repeat dispatching */ 890 + sq = parent_sq; 891 + tg = sq_to_tg(sq); 892 + goto again; 893 + } 894 + } 895 + } else { 896 + /* reached the top-level, queue issueing */ 897 + queue_work(kthrotld_workqueue, &td->dispatch_work); 898 + } 899 + out_unlock: 900 + spin_unlock_irq(q->queue_lock); 1208 901 } 1209 902 1210 - /* Dispatch throttled bios. Should be called without queue lock held. */ 1211 - static int throtl_dispatch(struct request_queue *q) 903 + /** 904 + * blk_throtl_dispatch_work_fn - work function for throtl_data->dispatch_work 905 + * @work: work item being executed 906 + * 907 + * This function is queued for execution when bio's reach the bio_lists[] 908 + * of throtl_data->service_queue. Those bio's are ready and issued by this 909 + * function. 910 + */ 911 + void blk_throtl_dispatch_work_fn(struct work_struct *work) 1212 912 { 1213 - struct throtl_data *td = q->td; 1214 - unsigned int nr_disp = 0; 913 + struct throtl_data *td = container_of(work, struct throtl_data, 914 + dispatch_work); 915 + struct throtl_service_queue *td_sq = &td->service_queue; 916 + struct request_queue *q = td->queue; 1215 917 struct bio_list bio_list_on_stack; 1216 918 struct bio *bio; 1217 919 struct blk_plug plug; 1218 - 1219 - spin_lock_irq(q->queue_lock); 1220 - 1221 - throtl_process_limit_change(td); 1222 - 1223 - if (!total_nr_queued(td)) 1224 - goto out; 920 + int rw; 1225 921 1226 922 bio_list_init(&bio_list_on_stack); 1227 923 1228 - throtl_log(td, "dispatch nr_queued=%u read=%u write=%u", 1229 - total_nr_queued(td), td->nr_queued[READ], 1230 - td->nr_queued[WRITE]); 1231 - 1232 - nr_disp = throtl_select_dispatch(td, &bio_list_on_stack); 1233 - 1234 - if (nr_disp) 1235 - throtl_log(td, "bios disp=%u", nr_disp); 1236 - 1237 - throtl_schedule_next_dispatch(td); 1238 - out: 924 + spin_lock_irq(q->queue_lock); 925 + for (rw = READ; rw <= WRITE; rw++) 926 + while ((bio = throtl_pop_queued(&td_sq->queued[rw], NULL))) 927 + bio_list_add(&bio_list_on_stack, bio); 1239 928 spin_unlock_irq(q->queue_lock); 1240 929 1241 - /* 1242 - * If we dispatched some requests, unplug the queue to make sure 1243 - * immediate dispatch 1244 - */ 1245 - if (nr_disp) { 930 + if (!bio_list_empty(&bio_list_on_stack)) { 1246 931 blk_start_plug(&plug); 1247 932 while((bio = bio_list_pop(&bio_list_on_stack))) 1248 933 generic_make_request(bio); 1249 934 blk_finish_plug(&plug); 1250 - } 1251 - return nr_disp; 1252 - } 1253 - 1254 - void blk_throtl_work(struct work_struct *work) 1255 - { 1256 - struct throtl_data *td = container_of(work, struct throtl_data, 1257 - throtl_work.work); 1258 - struct request_queue *q = td->queue; 1259 - 1260 - throtl_dispatch(q); 1261 - } 1262 - 1263 - /* Call with queue lock held */ 1264 - static void 1265 - throtl_schedule_delayed_work(struct throtl_data *td, unsigned long delay) 1266 - { 1267 - 1268 - struct delayed_work *dwork = &td->throtl_work; 1269 - 1270 - /* schedule work if limits changed even if no bio is queued */ 1271 - if (total_nr_queued(td) || td->limits_changed) { 1272 - mod_delayed_work(kthrotld_workqueue, dwork, delay); 1273 - throtl_log(td, "schedule work. delay=%lu jiffies=%lu", 1274 - delay, jiffies); 1275 935 } 1276 936 } 1277 937 ··· 1347 1007 struct blkcg *blkcg = cgroup_to_blkcg(cgrp); 1348 1008 struct blkg_conf_ctx ctx; 1349 1009 struct throtl_grp *tg; 1350 - struct throtl_data *td; 1010 + struct throtl_service_queue *sq; 1011 + struct blkcg_gq *blkg; 1012 + struct cgroup *pos_cgrp; 1351 1013 int ret; 1352 1014 1353 1015 ret = blkg_conf_prep(blkcg, &blkcg_policy_throtl, buf, &ctx); ··· 1357 1015 return ret; 1358 1016 1359 1017 tg = blkg_to_tg(ctx.blkg); 1360 - td = ctx.blkg->q->td; 1018 + sq = &tg->service_queue; 1361 1019 1362 1020 if (!ctx.v) 1363 1021 ctx.v = -1; ··· 1367 1025 else 1368 1026 *(unsigned int *)((void *)tg + cft->private) = ctx.v; 1369 1027 1370 - /* XXX: we don't need the following deferred processing */ 1371 - xchg(&tg->limits_changed, true); 1372 - xchg(&td->limits_changed, true); 1373 - throtl_schedule_delayed_work(td, 0); 1028 + throtl_log(&tg->service_queue, 1029 + "limit change rbps=%llu wbps=%llu riops=%u wiops=%u", 1030 + tg->bps[READ], tg->bps[WRITE], 1031 + tg->iops[READ], tg->iops[WRITE]); 1032 + 1033 + /* 1034 + * Update has_rules[] flags for the updated tg's subtree. A tg is 1035 + * considered to have rules if either the tg itself or any of its 1036 + * ancestors has rules. This identifies groups without any 1037 + * restrictions in the whole hierarchy and allows them to bypass 1038 + * blk-throttle. 1039 + */ 1040 + tg_update_has_rules(tg); 1041 + blkg_for_each_descendant_pre(blkg, pos_cgrp, ctx.blkg) 1042 + tg_update_has_rules(blkg_to_tg(blkg)); 1043 + 1044 + /* 1045 + * We're already holding queue_lock and know @tg is valid. Let's 1046 + * apply the new config directly. 1047 + * 1048 + * Restart the slices for both READ and WRITES. It might happen 1049 + * that a group's limit are dropped suddenly and we don't want to 1050 + * account recently dispatched IO with new low rate. 1051 + */ 1052 + throtl_start_new_slice(tg, 0); 1053 + throtl_start_new_slice(tg, 1); 1054 + 1055 + if (tg->flags & THROTL_TG_PENDING) { 1056 + tg_update_disptime(tg); 1057 + throtl_schedule_next_dispatch(sq->parent_sq, true); 1058 + } 1374 1059 1375 1060 blkg_conf_finish(&ctx); 1376 1061 return 0; ··· 1461 1092 { 1462 1093 struct throtl_data *td = q->td; 1463 1094 1464 - cancel_delayed_work_sync(&td->throtl_work); 1095 + cancel_work_sync(&td->dispatch_work); 1465 1096 } 1466 1097 1467 1098 static struct blkcg_policy blkcg_policy_throtl = { ··· 1469 1100 .cftypes = throtl_files, 1470 1101 1471 1102 .pd_init_fn = throtl_pd_init, 1103 + .pd_online_fn = throtl_pd_online, 1472 1104 .pd_exit_fn = throtl_pd_exit, 1473 1105 .pd_reset_stats_fn = throtl_pd_reset_stats, 1474 1106 }; ··· 1477 1107 bool blk_throtl_bio(struct request_queue *q, struct bio *bio) 1478 1108 { 1479 1109 struct throtl_data *td = q->td; 1110 + struct throtl_qnode *qn = NULL; 1480 1111 struct throtl_grp *tg; 1481 - bool rw = bio_data_dir(bio), update_disptime = true; 1112 + struct throtl_service_queue *sq; 1113 + bool rw = bio_data_dir(bio); 1482 1114 struct blkcg *blkcg; 1483 1115 bool throttled = false; 1484 1116 1485 - if (bio->bi_rw & REQ_THROTTLED) { 1486 - bio->bi_rw &= ~REQ_THROTTLED; 1117 + /* see throtl_charge_bio() */ 1118 + if (bio->bi_rw & REQ_THROTTLED) 1487 1119 goto out; 1488 - } 1489 1120 1490 1121 /* 1491 1122 * A throtl_grp pointer retrieved under rcu can be used to access ··· 1497 1126 blkcg = bio_blkcg(bio); 1498 1127 tg = throtl_lookup_tg(td, blkcg); 1499 1128 if (tg) { 1500 - if (tg_no_rule_group(tg, rw)) { 1129 + if (!tg->has_rules[rw]) { 1501 1130 throtl_update_dispatch_stats(tg_to_blkg(tg), 1502 1131 bio->bi_size, bio->bi_rw); 1503 1132 goto out_unlock_rcu; ··· 1513 1142 if (unlikely(!tg)) 1514 1143 goto out_unlock; 1515 1144 1516 - if (tg->nr_queued[rw]) { 1517 - /* 1518 - * There is already another bio queued in same dir. No 1519 - * need to update dispatch time. 1520 - */ 1521 - update_disptime = false; 1522 - goto queue_bio; 1145 + sq = &tg->service_queue; 1523 1146 1524 - } 1147 + while (true) { 1148 + /* throtl is FIFO - if bios are already queued, should queue */ 1149 + if (sq->nr_queued[rw]) 1150 + break; 1525 1151 1526 - /* Bio is with-in rate limit of group */ 1527 - if (tg_may_dispatch(td, tg, bio, NULL)) { 1152 + /* if above limits, break to queue */ 1153 + if (!tg_may_dispatch(tg, bio, NULL)) 1154 + break; 1155 + 1156 + /* within limits, let's charge and dispatch directly */ 1528 1157 throtl_charge_bio(tg, bio); 1529 1158 1530 1159 /* ··· 1538 1167 * 1539 1168 * So keep on trimming slice even if bio is not queued. 1540 1169 */ 1541 - throtl_trim_slice(td, tg, rw); 1542 - goto out_unlock; 1170 + throtl_trim_slice(tg, rw); 1171 + 1172 + /* 1173 + * @bio passed through this layer without being throttled. 1174 + * Climb up the ladder. If we''re already at the top, it 1175 + * can be executed directly. 1176 + */ 1177 + qn = &tg->qnode_on_parent[rw]; 1178 + sq = sq->parent_sq; 1179 + tg = sq_to_tg(sq); 1180 + if (!tg) 1181 + goto out_unlock; 1543 1182 } 1544 1183 1545 - queue_bio: 1546 - throtl_log_tg(td, tg, "[%c] bio. bdisp=%llu sz=%u bps=%llu" 1547 - " iodisp=%u iops=%u queued=%d/%d", 1548 - rw == READ ? 'R' : 'W', 1549 - tg->bytes_disp[rw], bio->bi_size, tg->bps[rw], 1550 - tg->io_disp[rw], tg->iops[rw], 1551 - tg->nr_queued[READ], tg->nr_queued[WRITE]); 1184 + /* out-of-limit, queue to @tg */ 1185 + throtl_log(sq, "[%c] bio. bdisp=%llu sz=%u bps=%llu iodisp=%u iops=%u queued=%d/%d", 1186 + rw == READ ? 'R' : 'W', 1187 + tg->bytes_disp[rw], bio->bi_size, tg->bps[rw], 1188 + tg->io_disp[rw], tg->iops[rw], 1189 + sq->nr_queued[READ], sq->nr_queued[WRITE]); 1552 1190 1553 1191 bio_associate_current(bio); 1554 - throtl_add_bio_tg(q->td, tg, bio); 1192 + tg->td->nr_queued[rw]++; 1193 + throtl_add_bio_tg(bio, qn, tg); 1555 1194 throttled = true; 1556 1195 1557 - if (update_disptime) { 1558 - tg_update_disptime(td, tg); 1559 - throtl_schedule_next_dispatch(td); 1196 + /* 1197 + * Update @tg's dispatch time and force schedule dispatch if @tg 1198 + * was empty before @bio. The forced scheduling isn't likely to 1199 + * cause undue delay as @bio is likely to be dispatched directly if 1200 + * its @tg's disptime is not in the future. 1201 + */ 1202 + if (tg->flags & THROTL_TG_WAS_EMPTY) { 1203 + tg_update_disptime(tg); 1204 + throtl_schedule_next_dispatch(tg->service_queue.parent_sq, true); 1560 1205 } 1561 1206 1562 1207 out_unlock: ··· 1580 1193 out_unlock_rcu: 1581 1194 rcu_read_unlock(); 1582 1195 out: 1196 + /* 1197 + * As multiple blk-throtls may stack in the same issue path, we 1198 + * don't want bios to leave with the flag set. Clear the flag if 1199 + * being issued. 1200 + */ 1201 + if (!throttled) 1202 + bio->bi_rw &= ~REQ_THROTTLED; 1583 1203 return throttled; 1204 + } 1205 + 1206 + /* 1207 + * Dispatch all bios from all children tg's queued on @parent_sq. On 1208 + * return, @parent_sq is guaranteed to not have any active children tg's 1209 + * and all bios from previously active tg's are on @parent_sq->bio_lists[]. 1210 + */ 1211 + static void tg_drain_bios(struct throtl_service_queue *parent_sq) 1212 + { 1213 + struct throtl_grp *tg; 1214 + 1215 + while ((tg = throtl_rb_first(parent_sq))) { 1216 + struct throtl_service_queue *sq = &tg->service_queue; 1217 + struct bio *bio; 1218 + 1219 + throtl_dequeue_tg(tg); 1220 + 1221 + while ((bio = throtl_peek_queued(&sq->queued[READ]))) 1222 + tg_dispatch_one_bio(tg, bio_data_dir(bio)); 1223 + while ((bio = throtl_peek_queued(&sq->queued[WRITE]))) 1224 + tg_dispatch_one_bio(tg, bio_data_dir(bio)); 1225 + } 1584 1226 } 1585 1227 1586 1228 /** ··· 1622 1206 __releases(q->queue_lock) __acquires(q->queue_lock) 1623 1207 { 1624 1208 struct throtl_data *td = q->td; 1625 - struct throtl_rb_root *st = &td->tg_service_tree; 1626 - struct throtl_grp *tg; 1627 - struct bio_list bl; 1209 + struct blkcg_gq *blkg; 1210 + struct cgroup *pos_cgrp; 1628 1211 struct bio *bio; 1212 + int rw; 1629 1213 1630 1214 queue_lockdep_assert_held(q); 1215 + rcu_read_lock(); 1631 1216 1632 - bio_list_init(&bl); 1217 + /* 1218 + * Drain each tg while doing post-order walk on the blkg tree, so 1219 + * that all bios are propagated to td->service_queue. It'd be 1220 + * better to walk service_queue tree directly but blkg walk is 1221 + * easier. 1222 + */ 1223 + blkg_for_each_descendant_post(blkg, pos_cgrp, td->queue->root_blkg) 1224 + tg_drain_bios(&blkg_to_tg(blkg)->service_queue); 1633 1225 1634 - while ((tg = throtl_rb_first(st))) { 1635 - throtl_dequeue_tg(td, tg); 1226 + tg_drain_bios(&td_root_tg(td)->service_queue); 1636 1227 1637 - while ((bio = bio_list_peek(&tg->bio_lists[READ]))) 1638 - tg_dispatch_one_bio(td, tg, bio_data_dir(bio), &bl); 1639 - while ((bio = bio_list_peek(&tg->bio_lists[WRITE]))) 1640 - tg_dispatch_one_bio(td, tg, bio_data_dir(bio), &bl); 1641 - } 1228 + /* finally, transfer bios from top-level tg's into the td */ 1229 + tg_drain_bios(&td->service_queue); 1230 + 1231 + rcu_read_unlock(); 1642 1232 spin_unlock_irq(q->queue_lock); 1643 1233 1644 - while ((bio = bio_list_pop(&bl))) 1645 - generic_make_request(bio); 1234 + /* all bios now should be in td->service_queue, issue them */ 1235 + for (rw = READ; rw <= WRITE; rw++) 1236 + while ((bio = throtl_pop_queued(&td->service_queue.queued[rw], 1237 + NULL))) 1238 + generic_make_request(bio); 1646 1239 1647 1240 spin_lock_irq(q->queue_lock); 1648 1241 } ··· 1665 1240 if (!td) 1666 1241 return -ENOMEM; 1667 1242 1668 - td->tg_service_tree = THROTL_RB_ROOT; 1669 - td->limits_changed = false; 1670 - INIT_DELAYED_WORK(&td->throtl_work, blk_throtl_work); 1243 + INIT_WORK(&td->dispatch_work, blk_throtl_dispatch_work_fn); 1244 + throtl_service_queue_init(&td->service_queue, NULL); 1671 1245 1672 1246 q->td = td; 1673 1247 td->queue = q;
+15 -4
block/cfq-iosched.c
··· 4347 4347 kfree(cfqd); 4348 4348 } 4349 4349 4350 - static int cfq_init_queue(struct request_queue *q) 4350 + static int cfq_init_queue(struct request_queue *q, struct elevator_type *e) 4351 4351 { 4352 4352 struct cfq_data *cfqd; 4353 4353 struct blkcg_gq *blkg __maybe_unused; 4354 4354 int i, ret; 4355 + struct elevator_queue *eq; 4355 4356 4356 - cfqd = kmalloc_node(sizeof(*cfqd), GFP_KERNEL | __GFP_ZERO, q->node); 4357 - if (!cfqd) 4357 + eq = elevator_alloc(q, e); 4358 + if (!eq) 4358 4359 return -ENOMEM; 4359 4360 4361 + cfqd = kmalloc_node(sizeof(*cfqd), GFP_KERNEL | __GFP_ZERO, q->node); 4362 + if (!cfqd) { 4363 + kobject_put(&eq->kobj); 4364 + return -ENOMEM; 4365 + } 4366 + eq->elevator_data = cfqd; 4367 + 4360 4368 cfqd->queue = q; 4361 - q->elevator->elevator_data = cfqd; 4369 + spin_lock_irq(q->queue_lock); 4370 + q->elevator = eq; 4371 + spin_unlock_irq(q->queue_lock); 4362 4372 4363 4373 /* Init root service tree */ 4364 4374 cfqd->grp_service_tree = CFQ_RB_ROOT; ··· 4443 4433 4444 4434 out_free: 4445 4435 kfree(cfqd); 4436 + kobject_put(&eq->kobj); 4446 4437 return ret; 4447 4438 } 4448 4439
+13 -3
block/deadline-iosched.c
··· 337 337 /* 338 338 * initialize elevator private data (deadline_data). 339 339 */ 340 - static int deadline_init_queue(struct request_queue *q) 340 + static int deadline_init_queue(struct request_queue *q, struct elevator_type *e) 341 341 { 342 342 struct deadline_data *dd; 343 + struct elevator_queue *eq; 344 + 345 + eq = elevator_alloc(q, e); 346 + if (!eq) 347 + return -ENOMEM; 343 348 344 349 dd = kmalloc_node(sizeof(*dd), GFP_KERNEL | __GFP_ZERO, q->node); 345 - if (!dd) 350 + if (!dd) { 351 + kobject_put(&eq->kobj); 346 352 return -ENOMEM; 353 + } 354 + eq->elevator_data = dd; 347 355 348 356 INIT_LIST_HEAD(&dd->fifo_list[READ]); 349 357 INIT_LIST_HEAD(&dd->fifo_list[WRITE]); ··· 363 355 dd->front_merges = 1; 364 356 dd->fifo_batch = fifo_batch; 365 357 366 - q->elevator->elevator_data = dd; 358 + spin_lock_irq(q->queue_lock); 359 + q->elevator = eq; 360 + spin_unlock_irq(q->queue_lock); 367 361 return 0; 368 362 } 369 363
+5 -20
block/elevator.c
··· 150 150 151 151 static struct kobj_type elv_ktype; 152 152 153 - static struct elevator_queue *elevator_alloc(struct request_queue *q, 153 + struct elevator_queue *elevator_alloc(struct request_queue *q, 154 154 struct elevator_type *e) 155 155 { 156 156 struct elevator_queue *eq; ··· 170 170 elevator_put(e); 171 171 return NULL; 172 172 } 173 + EXPORT_SYMBOL(elevator_alloc); 173 174 174 175 static void elevator_release(struct kobject *kobj) 175 176 { ··· 222 221 } 223 222 } 224 223 225 - q->elevator = elevator_alloc(q, e); 226 - if (!q->elevator) 227 - return -ENOMEM; 228 - 229 - err = e->ops.elevator_init_fn(q); 230 - if (err) { 231 - kobject_put(&q->elevator->kobj); 232 - return err; 233 - } 234 - 224 + err = e->ops.elevator_init_fn(q, e); 235 225 return 0; 236 226 } 237 227 EXPORT_SYMBOL(elevator_init); ··· 927 935 spin_unlock_irq(q->queue_lock); 928 936 929 937 /* allocate, init and register new elevator */ 930 - err = -ENOMEM; 931 - q->elevator = elevator_alloc(q, new_e); 932 - if (!q->elevator) 938 + err = new_e->ops.elevator_init_fn(q, new_e); 939 + if (err) 933 940 goto fail_init; 934 - 935 - err = new_e->ops.elevator_init_fn(q); 936 - if (err) { 937 - kobject_put(&q->elevator->kobj); 938 - goto fail_init; 939 - } 940 941 941 942 if (registered) { 942 943 err = elv_register_queue(q);
+15 -4
block/noop-iosched.c
··· 59 59 return list_entry(rq->queuelist.next, struct request, queuelist); 60 60 } 61 61 62 - static int noop_init_queue(struct request_queue *q) 62 + static int noop_init_queue(struct request_queue *q, struct elevator_type *e) 63 63 { 64 64 struct noop_data *nd; 65 + struct elevator_queue *eq; 65 66 66 - nd = kmalloc_node(sizeof(*nd), GFP_KERNEL, q->node); 67 - if (!nd) 67 + eq = elevator_alloc(q, e); 68 + if (!eq) 68 69 return -ENOMEM; 69 70 71 + nd = kmalloc_node(sizeof(*nd), GFP_KERNEL, q->node); 72 + if (!nd) { 73 + kobject_put(&eq->kobj); 74 + return -ENOMEM; 75 + } 76 + eq->elevator_data = nd; 77 + 70 78 INIT_LIST_HEAD(&nd->queue); 71 - q->elevator->elevator_data = nd; 79 + 80 + spin_lock_irq(q->queue_lock); 81 + q->elevator = eq; 82 + spin_unlock_irq(q->queue_lock); 72 83 return 0; 73 84 } 74 85
+8 -1
fs/block_dev.c
··· 58 58 struct backing_dev_info *dst) 59 59 { 60 60 struct backing_dev_info *old = inode->i_data.backing_dev_info; 61 + bool wakeup_bdi = false; 61 62 62 63 if (unlikely(dst == old)) /* deadlock avoidance */ 63 64 return; 64 65 bdi_lock_two(&old->wb, &dst->wb); 65 66 spin_lock(&inode->i_lock); 66 67 inode->i_data.backing_dev_info = dst; 67 - if (inode->i_state & I_DIRTY) 68 + if (inode->i_state & I_DIRTY) { 69 + if (bdi_cap_writeback_dirty(dst) && !wb_has_dirty_io(&dst->wb)) 70 + wakeup_bdi = true; 68 71 list_move(&inode->i_wb_list, &dst->wb.b_dirty); 72 + } 69 73 spin_unlock(&inode->i_lock); 70 74 spin_unlock(&old->wb.list_lock); 71 75 spin_unlock(&dst->wb.list_lock); 76 + 77 + if (wakeup_bdi) 78 + bdi_wakeup_thread_delayed(dst); 72 79 } 73 80 74 81 /* Kill _all_ buffers and pagecache , dirty or not.. */
+2
include/linux/cgroup.h
··· 278 278 * 279 279 * - memcg: use_hierarchy is on by default and the cgroup file for 280 280 * the flag is not created. 281 + * 282 + * - blkcg: blk-throttle becomes properly hierarchical. 281 283 */ 282 284 CGRP_ROOT_SANE_BEHAVIOR = (1 << 0), 283 285
+5 -1
include/linux/elevator.h
··· 7 7 #ifdef CONFIG_BLOCK 8 8 9 9 struct io_cq; 10 + struct elevator_type; 10 11 11 12 typedef int (elevator_merge_fn) (struct request_queue *, struct request **, 12 13 struct bio *); ··· 36 35 typedef void (elevator_activate_req_fn) (struct request_queue *, struct request *); 37 36 typedef void (elevator_deactivate_req_fn) (struct request_queue *, struct request *); 38 37 39 - typedef int (elevator_init_fn) (struct request_queue *); 38 + typedef int (elevator_init_fn) (struct request_queue *, 39 + struct elevator_type *e); 40 40 typedef void (elevator_exit_fn) (struct elevator_queue *); 41 41 42 42 struct elevator_ops ··· 157 155 extern void elevator_exit(struct elevator_queue *); 158 156 extern int elevator_change(struct request_queue *, const char *); 159 157 extern bool elv_rq_merge_ok(struct request *, struct bio *); 158 + extern struct elevator_queue *elevator_alloc(struct request_queue *, 159 + struct elevator_type *); 160 160 161 161 /* 162 162 * Helper functions.