btrfs: fix racy bitfield write in btrfs_clear_space_info_full()

From the memory-barriers.txt document regarding memory barrier ordering
guarantees:

(*) These guarantees do not apply to bitfields, because compilers often
generate code to modify these using non-atomic read-modify-write
sequences. Do not attempt to use bitfields to synchronize parallel
algorithms.

(*) Even in cases where bitfields are protected by locks, all fields
in a given bitfield must be protected by one lock. If two fields
in a given bitfield are protected by different locks, the compiler's
non-atomic read-modify-write sequences can cause an update to one
field to corrupt the value of an adjacent field.

btrfs_space_info has a bitfield sharing an underlying word consisting of
the fields full, chunk_alloc, and flush:

struct btrfs_space_info {
struct btrfs_fs_info * fs_info; /* 0 8 */
struct btrfs_space_info * parent; /* 8 8 */
...
int clamp; /* 172 4 */
unsigned int full:1; /* 176: 0 4 */
unsigned int chunk_alloc:1; /* 176: 1 4 */
unsigned int flush:1; /* 176: 2 4 */
...

Therefore, to be safe from parallel read-modify-writes losing a write to
one of the bitfield members protected by a lock, all writes to all the
bitfields must use the lock. They almost universally do, except for
btrfs_clear_space_info_full() which iterates over the space_infos and
writes out found->full = 0 without a lock.

Imagine that we have one thread completing a transaction in which we
finished deleting a block_group and are thus calling
btrfs_clear_space_info_full() while simultaneously the data reclaim
ticket infrastructure is running do_async_reclaim_data_space():

T1 T2
btrfs_commit_transaction
btrfs_clear_space_info_full
data_sinfo->full = 0
READ: full:0, chunk_alloc:0, flush:1
do_async_reclaim_data_space(data_sinfo)
spin_lock(&space_info->lock);
if(list_empty(tickets))
space_info->flush = 0;
READ: full: 0, chunk_alloc:0, flush:1
MOD/WRITE: full: 0, chunk_alloc:0, flush:0
spin_unlock(&space_info->lock);
return;
MOD/WRITE: full:0, chunk_alloc:0, flush:1

and now data_sinfo->flush is 1 but the reclaim worker has exited. This
breaks the invariant that flush is 0 iff there is no work queued or
running. Once this invariant is violated, future allocations that go
into __reserve_bytes() will add tickets to space_info->tickets but will
see space_info->flush is set to 1 and not queue the work. After this,
they will block forever on the resulting ticket, as it is now impossible
to kick the worker again.

I also confirmed by looking at the assembly of the affected kernel that
it is doing RMW operations. For example, to set the flush (3rd) bit to 0,
the assembly is:
andb $0xfb,0x60(%rbx)
and similarly for setting the full (1st) bit to 0:
andb $0xfe,-0x20(%rax)

So I think this is really a bug on practical systems. I have observed
a number of systems in this exact state, but am currently unable to
reproduce it.

Rather than leaving this footgun lying around for the future, take
advantage of the fact that there is room in the struct anyway, and that
it is already quite large and simply change the three bitfield members to
bools. This avoids writes to space_info->full having any effect on
writes to space_info->flush, regardless of locking.

Fixes: 957780eb2788 ("Btrfs: introduce ticketed enospc infrastructure")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

authored by

Boris Burkov and committed by

David Sterba 6 months ago 38e81871 745483ea

+17 -17

3 changed files

expand all

btrfs

block-group.c

space-info.c

space-info.h

+3 -3

fs/btrfs/block-group.c

··· 4215 4215 mutex_unlock(&fs_info->chunk_mutex); 4216 4216 } else { 4217 4217 /* Proceed with allocation */ 4218 - space_info->chunk_alloc = 1; 4218 + space_info->chunk_alloc = true; 4219 4219 wait_for_alloc = false; 4220 4220 spin_unlock(&space_info->lock); 4221 4221 } ··· 4264 4264 spin_lock(&space_info->lock); 4265 4265 if (ret < 0) { 4266 4266 if (ret == -ENOSPC) 4267 - space_info->full = 1; 4267 + space_info->full = true; 4268 4268 else 4269 4269 goto out; 4270 4270 } else { ··· 4274 4274 4275 4275 space_info->force_alloc = CHUNK_ALLOC_NO_FORCE; 4276 4276 out: 4277 - space_info->chunk_alloc = 0; 4277 + space_info->chunk_alloc = false; 4278 4278 spin_unlock(&space_info->lock); 4279 4279 mutex_unlock(&fs_info->chunk_mutex); 4280 4280

+11 -11

fs/btrfs/space-info.c

··· 192 192 struct btrfs_space_info *found; 193 193 194 194 list_for_each_entry(found, head, list) 195 - found->full = 0; 195 + found->full = false; 196 196 } 197 197 198 198 /* ··· 372 372 space_info->bytes_readonly += block_group->bytes_super; 373 373 btrfs_space_info_update_bytes_zone_unusable(space_info, block_group->zone_unusable); 374 374 if (block_group->length > 0) 375 - space_info->full = 0; 375 + space_info->full = false; 376 376 btrfs_try_granting_tickets(info, space_info); 377 377 spin_unlock(&space_info->lock); 378 378 ··· 1146 1146 spin_lock(&space_info->lock); 1147 1147 to_reclaim = btrfs_calc_reclaim_metadata_size(fs_info, space_info); 1148 1148 if (!to_reclaim) { 1149 - space_info->flush = 0; 1149 + space_info->flush = false; 1150 1150 spin_unlock(&space_info->lock); 1151 1151 return; 1152 1152 } ··· 1158 1158 flush_space(fs_info, space_info, to_reclaim, flush_state, false); 1159 1159 spin_lock(&space_info->lock); 1160 1160 if (list_empty(&space_info->tickets)) { 1161 - space_info->flush = 0; 1161 + space_info->flush = false; 1162 1162 spin_unlock(&space_info->lock); 1163 1163 return; 1164 1164 } ··· 1201 1201 flush_state = FLUSH_DELAYED_ITEMS_NR; 1202 1202 commit_cycles--; 1203 1203 } else { 1204 - space_info->flush = 0; 1204 + space_info->flush = false; 1205 1205 } 1206 1206 } else { 1207 1207 flush_state = FLUSH_DELAYED_ITEMS_NR; ··· 1383 1383 1384 1384 spin_lock(&space_info->lock); 1385 1385 if (list_empty(&space_info->tickets)) { 1386 - space_info->flush = 0; 1386 + space_info->flush = false; 1387 1387 spin_unlock(&space_info->lock); 1388 1388 return; 1389 1389 } ··· 1394 1394 flush_space(fs_info, space_info, U64_MAX, ALLOC_CHUNK_FORCE, false); 1395 1395 spin_lock(&space_info->lock); 1396 1396 if (list_empty(&space_info->tickets)) { 1397 - space_info->flush = 0; 1397 + space_info->flush = false; 1398 1398 spin_unlock(&space_info->lock); 1399 1399 return; 1400 1400 } ··· 1411 1411 data_flush_states[flush_state], false); 1412 1412 spin_lock(&space_info->lock); 1413 1413 if (list_empty(&space_info->tickets)) { 1414 - space_info->flush = 0; 1414 + space_info->flush = false; 1415 1415 spin_unlock(&space_info->lock); 1416 1416 return; 1417 1417 } ··· 1428 1428 if (maybe_fail_all_tickets(fs_info, space_info)) 1429 1429 flush_state = 0; 1430 1430 else 1431 - space_info->flush = 0; 1431 + space_info->flush = false; 1432 1432 } else { 1433 1433 flush_state = 0; 1434 1434 } ··· 1444 1444 1445 1445 aborted_fs: 1446 1446 maybe_fail_all_tickets(fs_info, space_info); 1447 - space_info->flush = 0; 1447 + space_info->flush = false; 1448 1448 spin_unlock(&space_info->lock); 1449 1449 } 1450 1450 ··· 1825 1825 */ 1826 1826 maybe_clamp_preempt(fs_info, space_info); 1827 1827 1828 - space_info->flush = 1; 1828 + space_info->flush = true; 1829 1829 trace_btrfs_trigger_flush(fs_info, 1830 1830 space_info->flags, 1831 1831 orig_bytes, flush,

+3 -3

fs/btrfs/space-info.h

··· 142 142 flushing. The value is >> clamp, so turns 143 143 out to be a 2^clamp divisor. */ 144 144 145 - unsigned int full:1; /* indicates that we cannot allocate any more 145 + bool full; /* indicates that we cannot allocate any more 146 146 chunks for this space */ 147 - unsigned int chunk_alloc:1; /* set if we are allocating a chunk */ 147 + bool chunk_alloc; /* set if we are allocating a chunk */ 148 148 149 - unsigned int flush:1; /* set if we are trying to make space */ 149 + bool flush; /* set if we are trying to make space */ 150 150 151 151 unsigned int force_alloc; /* set if we need to force a chunk 152 152 alloc for this space */

Configure Feed

Configure Feed