Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

Merge tag 'xfs-for-linus-3.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs

Pull xfs fixes from Dave Chinner:
"This update fixes a warning in the new pagecache_isize_extended() and
updates some related comments, another fix for zero-range
misbehaviour, and an unforntuately large set of fixes for regressions
in the bulkstat code.

The bulkstat fixes are large but necessary. I wouldn't normally push
such a rework for a -rcX update, but right now xfsdump can silently
create incomplete dumps on 3.17 and it's possible that even xfsrestore
won't notice that the dumps were incomplete. Hence we need to get
this update into 3.17-stable kernels ASAP.

In more detail, the refactoring work I committed in 3.17 has exposed a
major hole in our QA coverage. With both xfsdump (the major user of
bulkstat) and xfsrestore silently ignoring missing files in the
dump/restore process, incomplete dumps were going unnoticed if they
were being triggered. Many of the dump/restore filesets were so small
that they didn't evenhave a chance of triggering the loop iteration
bugs we introduced in 3.17, so we didn't exercise the code
sufficiently, either.

We have already taken steps to improve QA coverage in xfstests to
avoid this happening again, and I've done a lot of manual verification
of dump/restore on very large data sets (tens of millions of inodes)
of the past week to verify this patch set results in bulkstat behaving
the same way as it does on 3.16.

Unfortunately, the fixes are not exactly simple - in tracking down the
problem historic API warts were discovered (e.g xfsdump has been
working around a 20 year old bug in the bulkstat API for the past 10
years) and so that complicated the process of diagnosing and fixing
the problems. i.e. we had to fix bugs in the code as well as
discover and re-introduce the userspace visible API bugs that we
unwittingly "fixed" in 3.17 that xfsdump relied on to work correctly.

Summary:

- incorrect warnings about i_mutex locking in pagecache_isize_extended()
and updates comments to match expected locking
- another zero-range bug fix for stray file size updates
- a bunch of fixes for regression in the bulkstat code introduced in
3.17"

* tag 'xfs-for-linus-3.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs:
xfs: track bulkstat progress by agino
xfs: bulkstat error handling is broken
xfs: bulkstat main loop logic is a mess
xfs: bulkstat chunk-formatter has issues
xfs: bulkstat chunk formatting cursor is broken
xfs: bulkstat btree walk doesn't terminate
mm: Fix comment before truncate_setsize()
xfs: rework zero range to prevent invalid i_size updates
mm: Remove false WARN_ON from pagecache_isize_extended()
xfs: Check error during inode btree iteration in xfs_bulkstat()
xfs: bulkstat doesn't release AGI buffer on error

+148 -202
+20 -52
fs/xfs/xfs_bmap_util.c
··· 1338 1338 goto out; 1339 1339 } 1340 1340 1341 - 1341 + /* 1342 + * Preallocate and zero a range of a file. This mechanism has the allocation 1343 + * semantics of fallocate and in addition converts data in the range to zeroes. 1344 + */ 1342 1345 int 1343 1346 xfs_zero_file_space( 1344 1347 struct xfs_inode *ip, ··· 1349 1346 xfs_off_t len) 1350 1347 { 1351 1348 struct xfs_mount *mp = ip->i_mount; 1352 - uint granularity; 1353 - xfs_off_t start_boundary; 1354 - xfs_off_t end_boundary; 1349 + uint blksize; 1355 1350 int error; 1356 1351 1357 1352 trace_xfs_zero_file_space(ip); 1358 1353 1359 - granularity = max_t(uint, 1 << mp->m_sb.sb_blocklog, PAGE_CACHE_SIZE); 1354 + blksize = 1 << mp->m_sb.sb_blocklog; 1360 1355 1361 1356 /* 1362 - * Round the range of extents we are going to convert inwards. If the 1363 - * offset is aligned, then it doesn't get changed so we zero from the 1364 - * start of the block offset points to. 1357 + * Punch a hole and prealloc the range. We use hole punch rather than 1358 + * unwritten extent conversion for two reasons: 1359 + * 1360 + * 1.) Hole punch handles partial block zeroing for us. 1361 + * 1362 + * 2.) If prealloc returns ENOSPC, the file range is still zero-valued 1363 + * by virtue of the hole punch. 1365 1364 */ 1366 - start_boundary = round_up(offset, granularity); 1367 - end_boundary = round_down(offset + len, granularity); 1365 + error = xfs_free_file_space(ip, offset, len); 1366 + if (error) 1367 + goto out; 1368 1368 1369 - ASSERT(start_boundary >= offset); 1370 - ASSERT(end_boundary <= offset + len); 1371 - 1372 - if (start_boundary < end_boundary - 1) { 1373 - /* 1374 - * Writeback the range to ensure any inode size updates due to 1375 - * appending writes make it to disk (otherwise we could just 1376 - * punch out the delalloc blocks). 1377 - */ 1378 - error = filemap_write_and_wait_range(VFS_I(ip)->i_mapping, 1379 - start_boundary, end_boundary - 1); 1380 - if (error) 1381 - goto out; 1382 - truncate_pagecache_range(VFS_I(ip), start_boundary, 1383 - end_boundary - 1); 1384 - 1385 - /* convert the blocks */ 1386 - error = xfs_alloc_file_space(ip, start_boundary, 1387 - end_boundary - start_boundary - 1, 1388 - XFS_BMAPI_PREALLOC | XFS_BMAPI_CONVERT); 1389 - if (error) 1390 - goto out; 1391 - 1392 - /* We've handled the interior of the range, now for the edges */ 1393 - if (start_boundary != offset) { 1394 - error = xfs_iozero(ip, offset, start_boundary - offset); 1395 - if (error) 1396 - goto out; 1397 - } 1398 - 1399 - if (end_boundary != offset + len) 1400 - error = xfs_iozero(ip, end_boundary, 1401 - offset + len - end_boundary); 1402 - 1403 - } else { 1404 - /* 1405 - * It's either a sub-granularity range or the range spanned lies 1406 - * partially across two adjacent blocks. 1407 - */ 1408 - error = xfs_iozero(ip, offset, len); 1409 - } 1410 - 1369 + error = xfs_alloc_file_space(ip, round_down(offset, blksize), 1370 + round_up(offset + len, blksize) - 1371 + round_down(offset, blksize), 1372 + XFS_BMAPI_PREALLOC); 1411 1373 out: 1412 1374 return error; 1413 1375
+125 -131
fs/xfs/xfs_itable.c
··· 236 236 XFS_WANT_CORRUPTED_RETURN(stat == 1); 237 237 238 238 /* Check if the record contains the inode in request */ 239 - if (irec->ir_startino + XFS_INODES_PER_CHUNK <= agino) 240 - return -EINVAL; 239 + if (irec->ir_startino + XFS_INODES_PER_CHUNK <= agino) { 240 + *icount = 0; 241 + return 0; 242 + } 241 243 242 244 idx = agino - irec->ir_startino + 1; 243 245 if (idx < XFS_INODES_PER_CHUNK && ··· 264 262 265 263 #define XFS_BULKSTAT_UBLEFT(ubleft) ((ubleft) >= statstruct_size) 266 264 265 + struct xfs_bulkstat_agichunk { 266 + char __user **ac_ubuffer;/* pointer into user's buffer */ 267 + int ac_ubleft; /* bytes left in user's buffer */ 268 + int ac_ubelem; /* spaces used in user's buffer */ 269 + }; 270 + 267 271 /* 268 272 * Process inodes in chunk with a pointer to a formatter function 269 273 * that will iget the inode and fill in the appropriate structure. 270 274 */ 271 - int 275 + static int 272 276 xfs_bulkstat_ag_ichunk( 273 277 struct xfs_mount *mp, 274 278 xfs_agnumber_t agno, 275 279 struct xfs_inobt_rec_incore *irbp, 276 280 bulkstat_one_pf formatter, 277 281 size_t statstruct_size, 278 - struct xfs_bulkstat_agichunk *acp) 282 + struct xfs_bulkstat_agichunk *acp, 283 + xfs_agino_t *last_agino) 279 284 { 280 - xfs_ino_t lastino = acp->ac_lastino; 281 285 char __user **ubufp = acp->ac_ubuffer; 282 - int ubleft = acp->ac_ubleft; 283 - int ubelem = acp->ac_ubelem; 284 - int chunkidx, clustidx; 286 + int chunkidx; 285 287 int error = 0; 286 - xfs_agino_t agino; 288 + xfs_agino_t agino = irbp->ir_startino; 287 289 288 - for (agino = irbp->ir_startino, chunkidx = clustidx = 0; 289 - XFS_BULKSTAT_UBLEFT(ubleft) && 290 - irbp->ir_freecount < XFS_INODES_PER_CHUNK; 291 - chunkidx++, clustidx++, agino++) { 292 - int fmterror; /* bulkstat formatter result */ 290 + for (chunkidx = 0; chunkidx < XFS_INODES_PER_CHUNK; 291 + chunkidx++, agino++) { 292 + int fmterror; 293 293 int ubused; 294 - xfs_ino_t ino = XFS_AGINO_TO_INO(mp, agno, agino); 295 294 296 - ASSERT(chunkidx < XFS_INODES_PER_CHUNK); 295 + /* inode won't fit in buffer, we are done */ 296 + if (acp->ac_ubleft < statstruct_size) 297 + break; 297 298 298 299 /* Skip if this inode is free */ 299 - if (XFS_INOBT_MASK(chunkidx) & irbp->ir_free) { 300 - lastino = ino; 300 + if (XFS_INOBT_MASK(chunkidx) & irbp->ir_free) 301 301 continue; 302 - } 303 - 304 - /* 305 - * Count used inodes as free so we can tell when the 306 - * chunk is used up. 307 - */ 308 - irbp->ir_freecount++; 309 302 310 303 /* Get the inode and fill in a single buffer */ 311 304 ubused = statstruct_size; 312 - error = formatter(mp, ino, *ubufp, ubleft, &ubused, &fmterror); 313 - if (fmterror == BULKSTAT_RV_NOTHING) { 314 - if (error && error != -ENOENT && error != -EINVAL) { 315 - ubleft = 0; 316 - break; 317 - } 318 - lastino = ino; 319 - continue; 320 - } 321 - if (fmterror == BULKSTAT_RV_GIVEUP) { 322 - ubleft = 0; 305 + error = formatter(mp, XFS_AGINO_TO_INO(mp, agno, agino), 306 + *ubufp, acp->ac_ubleft, &ubused, &fmterror); 307 + 308 + if (fmterror == BULKSTAT_RV_GIVEUP || 309 + (error && error != -ENOENT && error != -EINVAL)) { 310 + acp->ac_ubleft = 0; 323 311 ASSERT(error); 324 312 break; 325 313 } 326 - if (*ubufp) 327 - *ubufp += ubused; 328 - ubleft -= ubused; 329 - ubelem++; 330 - lastino = ino; 314 + 315 + /* be careful not to leak error if at end of chunk */ 316 + if (fmterror == BULKSTAT_RV_NOTHING || error) { 317 + error = 0; 318 + continue; 319 + } 320 + 321 + *ubufp += ubused; 322 + acp->ac_ubleft -= ubused; 323 + acp->ac_ubelem++; 331 324 } 332 325 333 - acp->ac_lastino = lastino; 334 - acp->ac_ubleft = ubleft; 335 - acp->ac_ubelem = ubelem; 326 + /* 327 + * Post-update *last_agino. At this point, agino will always point one 328 + * inode past the last inode we processed successfully. Hence we 329 + * substract that inode when setting the *last_agino cursor so that we 330 + * return the correct cookie to userspace. On the next bulkstat call, 331 + * the inode under the lastino cookie will be skipped as we have already 332 + * processed it here. 333 + */ 334 + *last_agino = agino - 1; 336 335 337 336 return error; 338 337 } ··· 356 353 xfs_agino_t agino; /* inode # in allocation group */ 357 354 xfs_agnumber_t agno; /* allocation group number */ 358 355 xfs_btree_cur_t *cur; /* btree cursor for ialloc btree */ 359 - int end_of_ag; /* set if we've seen the ag end */ 360 - int error; /* error code */ 361 - int fmterror;/* bulkstat formatter result */ 362 - int i; /* loop index */ 363 - int icount; /* count of inodes good in irbuf */ 364 356 size_t irbsize; /* size of irec buffer in bytes */ 365 - xfs_ino_t ino; /* inode number (filesystem) */ 366 - xfs_inobt_rec_incore_t *irbp; /* current irec buffer pointer */ 367 357 xfs_inobt_rec_incore_t *irbuf; /* start of irec buffer */ 368 - xfs_inobt_rec_incore_t *irbufend; /* end of good irec buffer entries */ 369 - xfs_ino_t lastino; /* last inode number returned */ 370 358 int nirbuf; /* size of irbuf */ 371 - int rval; /* return value error code */ 372 - int tmp; /* result value from btree calls */ 373 359 int ubcount; /* size of user's buffer */ 374 - int ubleft; /* bytes left in user's buffer */ 375 - char __user *ubufp; /* pointer into user's buffer */ 376 - int ubelem; /* spaces used in user's buffer */ 360 + struct xfs_bulkstat_agichunk ac; 361 + int error = 0; 377 362 378 363 /* 379 364 * Get the last inode value, see if there's nothing to do. 380 365 */ 381 - ino = (xfs_ino_t)*lastinop; 382 - lastino = ino; 383 - agno = XFS_INO_TO_AGNO(mp, ino); 384 - agino = XFS_INO_TO_AGINO(mp, ino); 366 + agno = XFS_INO_TO_AGNO(mp, *lastinop); 367 + agino = XFS_INO_TO_AGINO(mp, *lastinop); 385 368 if (agno >= mp->m_sb.sb_agcount || 386 - ino != XFS_AGINO_TO_INO(mp, agno, agino)) { 369 + *lastinop != XFS_AGINO_TO_INO(mp, agno, agino)) { 387 370 *done = 1; 388 371 *ubcountp = 0; 389 372 return 0; 390 373 } 391 374 392 375 ubcount = *ubcountp; /* statstruct's */ 393 - ubleft = ubcount * statstruct_size; /* bytes */ 394 - *ubcountp = ubelem = 0; 376 + ac.ac_ubuffer = &ubuffer; 377 + ac.ac_ubleft = ubcount * statstruct_size; /* bytes */; 378 + ac.ac_ubelem = 0; 379 + 380 + *ubcountp = 0; 395 381 *done = 0; 396 - fmterror = 0; 397 - ubufp = ubuffer; 382 + 398 383 irbuf = kmem_zalloc_greedy(&irbsize, PAGE_SIZE, PAGE_SIZE * 4); 399 384 if (!irbuf) 400 385 return -ENOMEM; ··· 393 402 * Loop over the allocation groups, starting from the last 394 403 * inode returned; 0 means start of the allocation group. 395 404 */ 396 - rval = 0; 397 - while (XFS_BULKSTAT_UBLEFT(ubleft) && agno < mp->m_sb.sb_agcount) { 398 - cond_resched(); 405 + while (agno < mp->m_sb.sb_agcount) { 406 + struct xfs_inobt_rec_incore *irbp = irbuf; 407 + struct xfs_inobt_rec_incore *irbufend = irbuf + nirbuf; 408 + bool end_of_ag = false; 409 + int icount = 0; 410 + int stat; 411 + 399 412 error = xfs_ialloc_read_agi(mp, NULL, agno, &agbp); 400 413 if (error) 401 414 break; ··· 409 414 */ 410 415 cur = xfs_inobt_init_cursor(mp, NULL, agbp, agno, 411 416 XFS_BTNUM_INO); 412 - irbp = irbuf; 413 - irbufend = irbuf + nirbuf; 414 - end_of_ag = 0; 415 - icount = 0; 416 417 if (agino > 0) { 417 418 /* 418 419 * In the middle of an allocation group, we need to get ··· 418 427 419 428 error = xfs_bulkstat_grab_ichunk(cur, agino, &icount, &r); 420 429 if (error) 421 - break; 430 + goto del_cursor; 422 431 if (icount) { 423 432 irbp->ir_startino = r.ir_startino; 424 433 irbp->ir_freecount = r.ir_freecount; 425 434 irbp->ir_free = r.ir_free; 426 435 irbp++; 427 - agino = r.ir_startino + XFS_INODES_PER_CHUNK; 428 436 } 429 437 /* Increment to the next record */ 430 - error = xfs_btree_increment(cur, 0, &tmp); 438 + error = xfs_btree_increment(cur, 0, &stat); 431 439 } else { 432 440 /* Start of ag. Lookup the first inode chunk */ 433 - error = xfs_inobt_lookup(cur, 0, XFS_LOOKUP_GE, &tmp); 441 + error = xfs_inobt_lookup(cur, 0, XFS_LOOKUP_GE, &stat); 434 442 } 435 - if (error) 436 - break; 443 + if (error || stat == 0) { 444 + end_of_ag = true; 445 + goto del_cursor; 446 + } 437 447 438 448 /* 439 449 * Loop through inode btree records in this ag, ··· 443 451 while (irbp < irbufend && icount < ubcount) { 444 452 struct xfs_inobt_rec_incore r; 445 453 446 - error = xfs_inobt_get_rec(cur, &r, &i); 447 - if (error || i == 0) { 448 - end_of_ag = 1; 449 - break; 454 + error = xfs_inobt_get_rec(cur, &r, &stat); 455 + if (error || stat == 0) { 456 + end_of_ag = true; 457 + goto del_cursor; 450 458 } 451 459 452 460 /* ··· 461 469 irbp++; 462 470 icount += XFS_INODES_PER_CHUNK - r.ir_freecount; 463 471 } 464 - /* 465 - * Set agino to after this chunk and bump the cursor. 466 - */ 467 - agino = r.ir_startino + XFS_INODES_PER_CHUNK; 468 - error = xfs_btree_increment(cur, 0, &tmp); 472 + error = xfs_btree_increment(cur, 0, &stat); 473 + if (error || stat == 0) { 474 + end_of_ag = true; 475 + goto del_cursor; 476 + } 469 477 cond_resched(); 470 478 } 479 + 471 480 /* 472 - * Drop the btree buffers and the agi buffer. 473 - * We can't hold any of the locks these represent 474 - * when calling iget. 481 + * Drop the btree buffers and the agi buffer as we can't hold any 482 + * of the locks these represent when calling iget. If there is a 483 + * pending error, then we are done. 475 484 */ 485 + del_cursor: 476 486 xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR); 477 487 xfs_buf_relse(agbp); 488 + if (error) 489 + break; 478 490 /* 479 - * Now format all the good inodes into the user's buffer. 491 + * Now format all the good inodes into the user's buffer. The 492 + * call to xfs_bulkstat_ag_ichunk() sets up the agino pointer 493 + * for the next loop iteration. 480 494 */ 481 495 irbufend = irbp; 482 496 for (irbp = irbuf; 483 - irbp < irbufend && XFS_BULKSTAT_UBLEFT(ubleft); irbp++) { 484 - struct xfs_bulkstat_agichunk ac; 485 - 486 - ac.ac_lastino = lastino; 487 - ac.ac_ubuffer = &ubuffer; 488 - ac.ac_ubleft = ubleft; 489 - ac.ac_ubelem = ubelem; 497 + irbp < irbufend && ac.ac_ubleft >= statstruct_size; 498 + irbp++) { 490 499 error = xfs_bulkstat_ag_ichunk(mp, agno, irbp, 491 - formatter, statstruct_size, &ac); 500 + formatter, statstruct_size, &ac, 501 + &agino); 492 502 if (error) 493 - rval = error; 494 - 495 - lastino = ac.ac_lastino; 496 - ubleft = ac.ac_ubleft; 497 - ubelem = ac.ac_ubelem; 503 + break; 498 504 499 505 cond_resched(); 500 506 } 507 + 501 508 /* 502 - * Set up for the next loop iteration. 509 + * If we've run out of space or had a formatting error, we 510 + * are now done 503 511 */ 504 - if (XFS_BULKSTAT_UBLEFT(ubleft)) { 505 - if (end_of_ag) { 506 - agno++; 507 - agino = 0; 508 - } else 509 - agino = XFS_INO_TO_AGINO(mp, lastino); 510 - } else 512 + if (ac.ac_ubleft < statstruct_size || error) 511 513 break; 514 + 515 + if (end_of_ag) { 516 + agno++; 517 + agino = 0; 518 + } 512 519 } 513 520 /* 514 521 * Done, we're either out of filesystem or space to put the data. 515 522 */ 516 523 kmem_free(irbuf); 517 - *ubcountp = ubelem; 518 - /* 519 - * Found some inodes, return them now and return the error next time. 520 - */ 521 - if (ubelem) 522 - rval = 0; 523 - if (agno >= mp->m_sb.sb_agcount) { 524 - /* 525 - * If we ran out of filesystem, mark lastino as off 526 - * the end of the filesystem, so the next call 527 - * will return immediately. 528 - */ 529 - *lastinop = (xfs_ino_t)XFS_AGINO_TO_INO(mp, agno, 0); 530 - *done = 1; 531 - } else 532 - *lastinop = (xfs_ino_t)lastino; 524 + *ubcountp = ac.ac_ubelem; 533 525 534 - return rval; 526 + /* 527 + * We found some inodes, so clear the error status and return them. 528 + * The lastino pointer will point directly at the inode that triggered 529 + * any error that occurred, so on the next call the error will be 530 + * triggered again and propagated to userspace as there will be no 531 + * formatted inodes in the buffer. 532 + */ 533 + if (ac.ac_ubelem) 534 + error = 0; 535 + 536 + /* 537 + * If we ran out of filesystem, lastino will point off the end of 538 + * the filesystem so the next call will return immediately. 539 + */ 540 + *lastinop = XFS_AGINO_TO_INO(mp, agno, agino); 541 + if (agno >= mp->m_sb.sb_agcount) 542 + *done = 1; 543 + 544 + return error; 535 545 } 536 546 537 547 int
-16
fs/xfs/xfs_itable.h
··· 30 30 int *ubused, 31 31 int *stat); 32 32 33 - struct xfs_bulkstat_agichunk { 34 - xfs_ino_t ac_lastino; /* last inode returned */ 35 - char __user **ac_ubuffer;/* pointer into user's buffer */ 36 - int ac_ubleft; /* bytes left in user's buffer */ 37 - int ac_ubelem; /* spaces used in user's buffer */ 38 - }; 39 - 40 - int 41 - xfs_bulkstat_ag_ichunk( 42 - struct xfs_mount *mp, 43 - xfs_agnumber_t agno, 44 - struct xfs_inobt_rec_incore *irbp, 45 - bulkstat_one_pf formatter, 46 - size_t statstruct_size, 47 - struct xfs_bulkstat_agichunk *acp); 48 - 49 33 /* 50 34 * Values for stat return value. 51 35 */
+3 -3
mm/truncate.c
··· 715 715 * necessary) to @newsize. It will be typically be called from the filesystem's 716 716 * setattr function when ATTR_SIZE is passed in. 717 717 * 718 - * Must be called with inode_mutex held and before all filesystem specific 719 - * block truncation has been performed. 718 + * Must be called with a lock serializing truncates and writes (generally 719 + * i_mutex but e.g. xfs uses a different lock) and before all filesystem 720 + * specific block truncation has been performed. 720 721 */ 721 722 void truncate_setsize(struct inode *inode, loff_t newsize) 722 723 { ··· 756 755 struct page *page; 757 756 pgoff_t index; 758 757 759 - WARN_ON(!mutex_is_locked(&inode->i_mutex)); 760 758 WARN_ON(to > inode->i_size); 761 759 762 760 if (from >= to || bsize == PAGE_CACHE_SIZE)