Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

ubifs: Queue up space reservation tasks if retrying many times

Recently we catched ENOSPC returned by make_reservation() while doing
fsstress on UBIFS, we got following information when it occurred (See
details in Link):

UBIFS error (ubi0:0 pid 3640152): make_reservation [ubifs]: cannot
reserve 112 bytes in jhead 2, error -28
CPU: 2 PID: 3640152 Comm: kworker/u16:2 Tainted: G B W
Hardware name: Hisilicon PhosphorHi1230 EMU (DT)
Workqueue: writeback wb_workfn (flush-ubifs_0_0)
Call trace:
dump_stack+0x114/0x198
make_reservation+0x564/0x610 [ubifs]
ubifs_jnl_write_data+0x328/0x48c [ubifs]
do_writepage+0x2a8/0x3e4 [ubifs]
ubifs_writepage+0x16c/0x374 [ubifs]
generic_writepages+0xb4/0x114
do_writepages+0xcc/0x11c
writeback_sb_inodes+0x2d0/0x564
wb_writeback+0x20c/0x2b4
wb_workfn+0x404/0x510
process_one_work+0x304/0x4ac
worker_thread+0x31c/0x4e4
kthread+0x23c/0x290
Budgeting info: data budget sum 17576, total budget sum 17768
budg_data_growth 4144, budg_dd_growth 13432, budg_idx_growth 192
min_idx_lebs 13, old_idx_sz 988640, uncommitted_idx 0
page_budget 4144, inode_budget 160, dent_budget 312
nospace 0, nospace_rp 0
dark_wm 8192, dead_wm 4096, max_idx_node_sz 192
freeable_cnt 0, calc_idx_sz 988640, idx_gc_cnt 0
dirty_pg_cnt 4, dirty_zn_cnt 0, clean_zn_cnt 4811
gc_lnum 21, ihead_lnum 14
jhead 0 (GC) LEB 16
jhead 1 (base) LEB 34
jhead 2 (data) LEB 23
bud LEB 16
bud LEB 23
bud LEB 34
old bud LEB 33
old bud LEB 31
old bud LEB 15
commit state 4
Budgeting predictions:
available: 33832, outstanding 17576, free 15356
(pid 3640152) start dumping LEB properties
(pid 3640152) Lprops statistics: empty_lebs 3, idx_lebs 11
taken_empty_lebs 1, total_free 1253376, total_dirty 2445736
total_used 3438712, total_dark 65536, total_dead 17248
LEB 15 free 0 dirty 248000 used 5952 (taken)
LEB 16 free 110592 dirty 896 used 142464 (taken, jhead 0 (GC))
LEB 21 free 253952 dirty 0 used 0 (taken, GC LEB)
LEB 23 free 0 dirty 248104 used 5848 (taken, jhead 2 (data))
LEB 29 free 253952 dirty 0 used 0 (empty)
LEB 33 free 0 dirty 253952 used 0 (taken)
LEB 34 free 217088 dirty 36544 used 320 (taken, jhead 1 (base))
LEB 37 free 253952 dirty 0 used 0 (empty)
OTHERS: index lebs, zero-available non-index lebs

According to the budget algorithm, there are 5 LEBs reserved for budget:
three journal heads(16,23,34), 1 GC LEB(21) and 1 deletion LEB(can be
used in make_reservation()). There are 2 empty LEBs used for index nodes,
which is calculated as min_idx_lebs - idx_lebs = 2. In theory, LEB 15
and 33 should be reclaimed as free state after committing, but it is now
in taken state. After looking the realization of reserve_space(), there's
a possible situation:

LEB 15: free 2000 dirty 248000 used 3952 (jhead 2)
LEB 23: free 2000 dirty 248104 used 3848 (bud, taken)
LEB 33: free 2000 dirty 251952 used 0 (bud, taken)

wb_workfn wb_workfn_2
do_writepage // write 3000 bytes
ubifs_jnl_write_data
make_reservation
reserve_space
ubifs_garbage_collect
ubifs_find_dirty_leb // ret ENOSPC, dirty LEBs are taken
nospc_retries++ // 1
ubifs_run_commit
do_commit

LEB 15: free 2000 dirty 248000 used 3952 (jhead 2)
LEB 23: free 2000 dirty 248104 used 3848 (dirty)
LEB 33: free 2000 dirty 251952 used 0 (dirty)

do_writepage // write 2000 bytes for 3 times
ubifs_jnl_write_data
// grabs 15\23\33

LEB 15: free 0 dirty 248000 used 5952 (bud, taken)
LEB 23: free 0 dirty 248104 used 5848 (jhead 2)
LEB 33: free 0 dirty 253952 used 0 (bud, taken)

reserve_space
ubifs_garbage_collect
ubifs_find_dirty_leb // ret ENOSPC, dirty LEBs are taken
if (nospc_retries++ < 2) // false
ubifs_ro_mode !

Fetch a reproducer in Link.

The dirty LEBs could be grabbed by other threads, which fails finding dirty
LEBs of GC in current thread, so make_reservation() could try many times to
invoke GC&&committing, but current realization limits the times of retrying
as 'nospc_retries'(twice).
Fix it by adding a wait queue, start queuing up space reservation tasks
when someone task has retried gc + commit for many times. Then there is
only one task making space reservation at any time, and it can always make
success under the premise of correct budgeting.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=218164
Fixes: 1e51764a3c2a ("UBIFS: add new flash file system")
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>

authored by

Zhihao Cheng and committed by
Richard Weinberger
556c19f5 6379b44c

+155 -22
+149 -22
fs/ubifs/journal.c
··· 293 293 } 294 294 295 295 /** 296 + * __queue_and_wait - queue a task and wait until the task is waked up. 297 + * @c: UBIFS file-system description object 298 + * 299 + * This function adds current task in queue and waits until the task is waked 300 + * up. This function should be called with @c->reserve_space_wq locked. 301 + */ 302 + static void __queue_and_wait(struct ubifs_info *c) 303 + { 304 + DEFINE_WAIT(wait); 305 + 306 + __add_wait_queue_entry_tail_exclusive(&c->reserve_space_wq, &wait); 307 + set_current_state(TASK_UNINTERRUPTIBLE); 308 + spin_unlock(&c->reserve_space_wq.lock); 309 + 310 + schedule(); 311 + finish_wait(&c->reserve_space_wq, &wait); 312 + } 313 + 314 + /** 315 + * wait_for_reservation - try queuing current task to wait until waked up. 316 + * @c: UBIFS file-system description object 317 + * 318 + * This function queues current task to wait until waked up, if queuing is 319 + * started(@c->need_wait_space is not %0). Returns %true if current task is 320 + * added in queue, otherwise %false is returned. 321 + */ 322 + static bool wait_for_reservation(struct ubifs_info *c) 323 + { 324 + if (likely(atomic_read(&c->need_wait_space) == 0)) 325 + /* Quick path to check whether queuing is started. */ 326 + return false; 327 + 328 + spin_lock(&c->reserve_space_wq.lock); 329 + if (atomic_read(&c->need_wait_space) == 0) { 330 + /* Queuing is not started, don't queue current task. */ 331 + spin_unlock(&c->reserve_space_wq.lock); 332 + return false; 333 + } 334 + 335 + __queue_and_wait(c); 336 + return true; 337 + } 338 + 339 + /** 340 + * wake_up_reservation - wake up first task in queue or stop queuing. 341 + * @c: UBIFS file-system description object 342 + * 343 + * This function wakes up the first task in queue if it exists, or stops 344 + * queuing if no tasks in queue. 345 + */ 346 + static void wake_up_reservation(struct ubifs_info *c) 347 + { 348 + spin_lock(&c->reserve_space_wq.lock); 349 + if (waitqueue_active(&c->reserve_space_wq)) 350 + wake_up_locked(&c->reserve_space_wq); 351 + else 352 + /* 353 + * Compared with wait_for_reservation(), set @c->need_wait_space 354 + * under the protection of wait queue lock, which can avoid that 355 + * @c->need_wait_space is set to 0 after new task queued. 356 + */ 357 + atomic_set(&c->need_wait_space, 0); 358 + spin_unlock(&c->reserve_space_wq.lock); 359 + } 360 + 361 + /** 362 + * wake_up_reservation - add current task in queue or start queuing. 363 + * @c: UBIFS file-system description object 364 + * 365 + * This function starts queuing if queuing is not started, otherwise adds 366 + * current task in queue. 367 + */ 368 + static void add_or_start_queue(struct ubifs_info *c) 369 + { 370 + spin_lock(&c->reserve_space_wq.lock); 371 + if (atomic_cmpxchg(&c->need_wait_space, 0, 1) == 0) { 372 + /* Starts queuing, task can go on directly. */ 373 + spin_unlock(&c->reserve_space_wq.lock); 374 + return; 375 + } 376 + 377 + /* 378 + * There are at least two tasks have retried more than 32 times 379 + * at certain point, first task has started queuing, just queue 380 + * the left tasks. 381 + */ 382 + __queue_and_wait(c); 383 + } 384 + 385 + /** 296 386 * make_reservation - reserve journal space. 297 387 * @c: UBIFS file-system description object 298 388 * @jhead: journal head ··· 401 311 static int make_reservation(struct ubifs_info *c, int jhead, int len) 402 312 { 403 313 int err, cmt_retries = 0, nospc_retries = 0; 314 + bool blocked = wait_for_reservation(c); 404 315 405 316 again: 406 317 down_read(&c->commit_sem); 407 318 err = reserve_space(c, jhead, len); 408 - if (!err) 319 + if (!err) { 409 320 /* c->commit_sem will get released via finish_reservation(). */ 410 - return 0; 321 + goto out_wake_up; 322 + } 411 323 up_read(&c->commit_sem); 412 324 413 325 if (err == -ENOSPC) { 414 326 /* 415 327 * GC could not make any progress. We should try to commit 416 - * once because it could make some dirty space and GC would 417 - * make progress, so make the error -EAGAIN so that the below 328 + * because it could make some dirty space and GC would make 329 + * progress, so make the error -EAGAIN so that the below 418 330 * will commit and re-try. 419 331 */ 420 - if (nospc_retries++ < 2) { 421 - dbg_jnl("no space, retry"); 422 - err = -EAGAIN; 423 - } 424 - 425 - /* 426 - * This means that the budgeting is incorrect. We always have 427 - * to be able to write to the media, because all operations are 428 - * budgeted. Deletions are not budgeted, though, but we reserve 429 - * an extra LEB for them. 430 - */ 332 + nospc_retries++; 333 + dbg_jnl("no space, retry"); 334 + err = -EAGAIN; 431 335 } 432 336 433 337 if (err != -EAGAIN) ··· 433 349 */ 434 350 if (cmt_retries > 128) { 435 351 /* 436 - * This should not happen unless the journal size limitations 437 - * are too tough. 352 + * This should not happen unless: 353 + * 1. The journal size limitations are too tough. 354 + * 2. The budgeting is incorrect. We always have to be able to 355 + * write to the media, because all operations are budgeted. 356 + * Deletions are not budgeted, though, but we reserve an 357 + * extra LEB for them. 438 358 */ 439 - ubifs_err(c, "stuck in space allocation"); 359 + ubifs_err(c, "stuck in space allocation, nospc_retries %d", 360 + nospc_retries); 440 361 err = -ENOSPC; 441 362 goto out; 442 - } else if (cmt_retries > 32) 443 - ubifs_warn(c, "too many space allocation re-tries (%d)", 444 - cmt_retries); 363 + } else if (cmt_retries > 32) { 364 + /* 365 + * It's almost impossible to happen, unless there are many tasks 366 + * making reservation concurrently and someone task has retried 367 + * gc + commit for many times, generated available space during 368 + * this period are grabbed by other tasks. 369 + * But if it happens, start queuing up all tasks that will make 370 + * space reservation, then there is only one task making space 371 + * reservation at any time, and it can always make success under 372 + * the premise of correct budgeting. 373 + */ 374 + ubifs_warn(c, "too many space allocation cmt_retries (%d) " 375 + "nospc_retries (%d), start queuing tasks", 376 + cmt_retries, nospc_retries); 377 + 378 + if (!blocked) { 379 + blocked = true; 380 + add_or_start_queue(c); 381 + } 382 + } 445 383 446 384 dbg_jnl("-EAGAIN, commit and retry (retried %d times)", 447 385 cmt_retries); ··· 471 365 472 366 err = ubifs_run_commit(c); 473 367 if (err) 474 - return err; 368 + goto out_wake_up; 475 369 goto again; 476 370 477 371 out: ··· 485 379 ubifs_dump_lprops(c); 486 380 cmt_retries = dbg_check_lprops(c); 487 381 up_write(&c->commit_sem); 382 + } 383 + out_wake_up: 384 + if (blocked) { 385 + /* 386 + * Only tasks that have ever started queuing or ever been queued 387 + * can wake up other queued tasks, which can make sure that 388 + * there is only one task waked up to make space reservation. 389 + * For example: 390 + * task A task B task C 391 + * make_reservation make_reservation 392 + * reserve_space // 0 393 + * wake_up_reservation 394 + * atomic_cmpxchg // 0, start queuing 395 + * reserve_space 396 + * wait_for_reservation 397 + * __queue_and_wait 398 + * add_wait_queue 399 + * if (blocked) // false 400 + * // So that task C won't be waked up to race with task B 401 + */ 402 + wake_up_reservation(c); 488 403 } 489 404 return err; 490 405 }
+2
fs/ubifs/super.c
··· 2151 2151 mutex_init(&c->bu_mutex); 2152 2152 mutex_init(&c->write_reserve_mutex); 2153 2153 init_waitqueue_head(&c->cmt_wq); 2154 + init_waitqueue_head(&c->reserve_space_wq); 2155 + atomic_set(&c->need_wait_space, 0); 2154 2156 c->buds = RB_ROOT; 2155 2157 c->old_idx = RB_ROOT; 2156 2158 c->size_tree = RB_ROOT;
+4
fs/ubifs/ubifs.h
··· 1047 1047 * @bg_bud_bytes: number of bud bytes when background commit is initiated 1048 1048 * @old_buds: buds to be released after commit ends 1049 1049 * @max_bud_cnt: maximum number of buds 1050 + * @need_wait_space: Non %0 means space reservation tasks need to wait in queue 1051 + * @reserve_space_wq: wait queue to sleep on if @need_wait_space is not %0 1050 1052 * 1051 1053 * @commit_sem: synchronizes committer with other processes 1052 1054 * @cmt_state: commit state ··· 1307 1305 long long bg_bud_bytes; 1308 1306 struct list_head old_buds; 1309 1307 int max_bud_cnt; 1308 + atomic_t need_wait_space; 1309 + wait_queue_head_t reserve_space_wq; 1310 1310 1311 1311 struct rw_semaphore commit_sem; 1312 1312 int cmt_state;