Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

fs: new inode i_state corruption fix

There was a report of a data corruption
http://lkml.org/lkml/2008/11/14/121. There is a script included to
reproduce the problem.

During testing, I encountered a number of strange things with ext3, so I
tried ext2 to attempt to reduce complexity of the problem. I found that
fsstress would quickly hang in wait_on_inode, waiting for I_LOCK to be
cleared, even though instrumentation showed that unlock_new_inode had
already been called for that inode. This points to memory scribble, or
synchronisation problme.

i_state of I_NEW inodes is not protected by inode_lock because other
processes are not supposed to touch them until I_LOCK (and I_NEW) is
cleared. Adding WARN_ON(inode->i_state & I_NEW) to sites where we modify
i_state revealed that generic_sync_sb_inodes is picking up new inodes from
the inode lists and passing them to __writeback_single_inode without
waiting for I_NEW. Subsequently modifying i_state causes corruption. In
my case it would look like this:

CPU0 CPU1
unlock_new_inode() __sync_single_inode()
reg <- inode->i_state
reg -> reg & ~(I_LOCK|I_NEW) reg <- inode->i_state
reg -> inode->i_state reg -> reg | I_SYNC
reg -> inode->i_state

Non-atomic RMW on CPU1 overwrites CPU0 store and sets I_LOCK|I_NEW again.

Fix for this is rather than wait for I_NEW inodes, just skip over them:
inodes concurrently being created are not subject to data integrity
operations, and should not significantly contribute to dirty memory
either.

After this change, I'm unable to reproduce any of the added warnings or
hangs after ~1hour of running. Previously, the new warnings would start
immediately and hang would happen in under 5 minutes.

I'm also testing on ext3 now, and so far no problems there either. I
don't know whether this fixes the problem reported above, but it fixes a
real problem for me.

Cc: "Jorge Boncompte [DTI2]" <jorge@dti2.net>
Reported-by: Adrian Hunter <ext-adrian.hunter@nokia.com>
Cc: Jan Kara <jack@suse.cz>
Cc: <stable@kernel.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

authored by

Nick Piggin and committed by
Linus Torvalds
7ef0d737 f272b7bc

+15 -1
+8 -1
fs/fs-writeback.c
··· 274 274 int ret; 275 275 276 276 BUG_ON(inode->i_state & I_SYNC); 277 + WARN_ON(inode->i_state & I_NEW); 277 278 278 279 /* Set I_SYNC, reset I_DIRTY */ 279 280 dirty = inode->i_state & I_DIRTY; ··· 299 298 } 300 299 301 300 spin_lock(&inode_lock); 301 + WARN_ON(inode->i_state & I_NEW); 302 302 inode->i_state &= ~I_SYNC; 303 303 if (!(inode->i_state & I_FREEING)) { 304 304 if (!(inode->i_state & I_DIRTY) && ··· 472 470 break; 473 471 } 474 472 473 + if (inode->i_state & I_NEW) { 474 + requeue_io(inode); 475 + continue; 476 + } 477 + 475 478 if (wbc->nonblocking && bdi_write_congested(bdi)) { 476 479 wbc->encountered_congestion = 1; 477 480 if (!sb_is_blkdev_sb(sb)) ··· 538 531 list_for_each_entry(inode, &sb->s_inodes, i_sb_list) { 539 532 struct address_space *mapping; 540 533 541 - if (inode->i_state & (I_FREEING|I_WILL_FREE)) 534 + if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) 542 535 continue; 543 536 mapping = inode->i_mapping; 544 537 if (mapping->nrpages == 0)
+7
fs/inode.c
··· 359 359 invalidate_inode_buffers(inode); 360 360 if (!atomic_read(&inode->i_count)) { 361 361 list_move(&inode->i_list, dispose); 362 + WARN_ON(inode->i_state & I_NEW); 362 363 inode->i_state |= I_FREEING; 363 364 count++; 364 365 continue; ··· 461 460 continue; 462 461 } 463 462 list_move(&inode->i_list, &freeable); 463 + WARN_ON(inode->i_state & I_NEW); 464 464 inode->i_state |= I_FREEING; 465 465 nr_pruned++; 466 466 } ··· 658 656 * just created it (so there can be no old holders 659 657 * that haven't tested I_LOCK). 660 658 */ 659 + WARN_ON((inode->i_state & (I_LOCK|I_NEW)) != (I_LOCK|I_NEW)); 661 660 inode->i_state &= ~(I_LOCK|I_NEW); 662 661 wake_up_inode(inode); 663 662 } ··· 1148 1145 1149 1146 list_del_init(&inode->i_list); 1150 1147 list_del_init(&inode->i_sb_list); 1148 + WARN_ON(inode->i_state & I_NEW); 1151 1149 inode->i_state |= I_FREEING; 1152 1150 inodes_stat.nr_inodes--; 1153 1151 spin_unlock(&inode_lock); ··· 1190 1186 spin_unlock(&inode_lock); 1191 1187 return; 1192 1188 } 1189 + WARN_ON(inode->i_state & I_NEW); 1193 1190 inode->i_state |= I_WILL_FREE; 1194 1191 spin_unlock(&inode_lock); 1195 1192 write_inode_now(inode, 1); 1196 1193 spin_lock(&inode_lock); 1194 + WARN_ON(inode->i_state & I_NEW); 1197 1195 inode->i_state &= ~I_WILL_FREE; 1198 1196 inodes_stat.nr_unused--; 1199 1197 hlist_del_init(&inode->i_hash); 1200 1198 } 1201 1199 list_del_init(&inode->i_list); 1202 1200 list_del_init(&inode->i_sb_list); 1201 + WARN_ON(inode->i_state & I_NEW); 1203 1202 inode->i_state |= I_FREEING; 1204 1203 inodes_stat.nr_inodes--; 1205 1204 spin_unlock(&inode_lock);