Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

Merge tag 'bcachefs-2025-01-20.2' of git://evilpiepirate.org/bcachefs

Pull bcachefs updates from Kent Overstreet:
"Lots of scalability work, another big on-disk format change. On-disk
format version goes from 1.13 to 1.20.

Like 6.11, this is another big and expensive automatic/required on
disk format upgrade. This is planned to be the last big on disk format
upgrade before the experimental label comes off. There will be one
more minor on disk format update for a few things that couldn't make
this release.

Headline improvements:

- Self healing work:

Allocator and reflink now run the exact same check/repair code that
fsck does at runtime, where applicable.

The long term goal here is to remove inconsistent() errors (that
cause us to go emergency read only) by lifting fsck code up to
normal runtime paths; we should only go emergency read-only if we
detect an inconsistency that was due to a runtime bug - or truly
catastrophic damage (corrupted btree roots/interior nodes).

- Reflink repair no longer deletes reflink pointers:

Instead we flip an error bit and log the error, and they can still
be deleted by file deletion. This means a temporary failure to find
an indirect extent (perhaps repaired later by btree node scan)
won't result in unnecessary data loss

- Improvements to rebalance data path option handling:

We can now correctly apply changed filesystem-level io path options
to pending rebalance work, and soon we'll be able to apply
file-level io path option changes to indirect extents

- Fix mount time regression that some users encountered post the 6.11
disk accounting rewrite.

Accounting keys were encoded little endian (typetag in the low
bits) - which didn't anticipate adding accounting keys for every
inode, which aren't stored in memory and we don't want to scan at
mount time.

- fsck time on large filesystems is improved by multiple orders of
magnitude. Previously, 100TB was about the practical max filesystem
size, where users were reporting fsck times of a day+. With the new
changes (which nearly eliminate backpointers fsck overhead), we
fsck'd a filesystem with 10PB of data in 1.5 hours.

The problematic fsck passes were walking every extent and checking
for missing backpointers, and walking every backpointer to check
for dangling backpointers. As we've been adding more and more
runtime self healing there was no reason to keep around the
backpointers -> extents pass; dangling backpointers are just
deleted, and we can do that when using them - thus, backpointers ->
extents is now only run in debug mode.

extents -> backpointers does need to exist, since missing
backpointers would mean we can't find data to move it (for e.g.
copygc, device evacuate, scrub). But the new on disk format version
makes possible a new strategy where we sum up backpointers within a
bucket and check it against the bucket sector counts, and then only
scan for missing backpointers if the counts are off (and then, only
for specific buckets).

Full list of on disk format changes:

- 1.14: backpointer_bucket_gen

Backpointers now have a field for the bucket generation number,
replacing the obsolete bucket_offset field. This is needed for the
new "sum up backpointers within a bucket" code, since backpointers
use the btree write buffer - meaning we will see stale reads, and
this runs online, with the filesystem in full rw mode.

- 1.15: disk_accounting_big_endian

As previously described, fix the endianness of accounting keys so
that accounting keys with the same typetag sort together, and
accounting read can skip types it's not interested in.

- 1.16: reflink_p_may_update_opts:

This version indicates that a new reflink pointer field is
understood and may be used; the field indicates whether the reflink
pointer has permissions to update IO path options (e.g.
compression, replicas) may be updated on the indirect extent it
points to.

This completes the rebalance/reflink data path option handling from
the 6.13 pull request.

- 1.17: inode_depth

Add a new inode field, bi_depth, to accelerate the
check_directory_structure fsck path, which checks for loops in the
filesystem heirarchy.

check_inodes and check_dirents check connectivity, so
check_directory_structure only has to check for loops - by walking
back up to the root from every directory.

But a path can't be a loop if it has a counter that increases
monotonically from root to leaf - adding a depth counter means that
we can check for loops with only local (parent -> child) checks. We
might need to occasionally renumber the depth field in fsck if
directories have been moved around, but then future fsck runs will
be much faster.

- 1.18: persistent_inode_cursors

Previously, the cursor used for inode allocation was only kept in
memory, which meant that users with large filesystems and lots of
files were reporting that the first create after mounting would
take awhile - since it had to scan from the start.

Inode allocation cursors are now persistent, and also include a
generation field (incremented on wraparound, which will only happen
if inode allocation is restricted to 32 bit inodes), so that we
don't have to leave inode_generation keys around after a delete.

The option for 32 bit inode numbers may now also be set on
individual directories, and non-32 bit inode allocations are
disallowed from allocating from the 32 bit part of the inode number
space.

- 1.19: autofix_errors

Runtime self healing is now the default.o

- 1.20: directory size (from Hongbo)

directory i_size is now meaningful, and not 0"

* tag 'bcachefs-2025-01-20.2' of git://evilpiepirate.org/bcachefs: (268 commits)
bcachefs: Fix check_inode_hash_info_matches_root()
bcachefs: Document issue with bch_stripe layout
bcachefs: Fix self healing on read error
bcachefs: Pop all the transactions from the abort one
bcachefs: Only abort the transactions in the cycle
bcachefs: Introduce lock_graph_pop_from
bcachefs: Convert open-coded lock_graph_pop_all to helper
bcachefs: Do not allow no fail lock request to fail
bcachefs: Merge the condition to avoid additional invocation
Revert "bcachefs: Fix bch2_btree_node_upgrade()"
bcachefs: bcachefs_metadata_version_directory_size
bcachefs: make directory i_size meaningful
bcachefs: check_unreachable_inodes is not actually PASS_ONLINE yet
bcachefs: Don't use BTREE_ITER_cached when walking alloc btree during fsck
bcachefs: Check for dirents to overwritten inodes
bcachefs: bch2_btree_iter_peek_slot() handles navigating to nonexistent depth
bcachefs: Don't set btree_path to updtodate if we don't fill
bcachefs: __bch2_btree_pos_to_text()
bcachefs: printbuf_reset() handles tabstops
bcachefs: Silence read-only errors when deleting snapshots
...

+7215 -4679
+1 -1
Documentation/filesystems/bcachefs/CodingStyle.rst
··· 183 183 A good code comment is wonderful, but even better is the comment that didn't 184 184 need to exist because the code was so straightforward as to be obvious; 185 185 organized into small clean and tidy modules, with clear and descriptive names 186 - for functions and variable, where every line of code has a clear purpose. 186 + for functions and variables, where every line of code has a clear purpose.
+1 -1
fs/bcachefs/Kconfig
··· 90 90 91 91 config BCACHEFS_PATH_TRACEPOINTS 92 92 bool "Extra btree_path tracepoints" 93 - depends on BCACHEFS_FS 93 + depends on BCACHEFS_FS && TRACING 94 94 help 95 95 Enable extra tracepoints for debugging btree_path operations; we don't 96 96 normally want these enabled because they happen at very high rates.
+1
fs/bcachefs/Makefile
··· 82 82 siphash.o \ 83 83 six.o \ 84 84 snapshot.o \ 85 + str_hash.o \ 85 86 subvolume.o \ 86 87 super.o \ 87 88 super-io.o \
+3 -8
fs/bcachefs/acl.c
··· 184 184 return ERR_PTR(-EINVAL); 185 185 } 186 186 187 - #define acl_for_each_entry(acl, acl_e) \ 188 - for (acl_e = acl->a_entries; \ 189 - acl_e < acl->a_entries + acl->a_count; \ 190 - acl_e++) 191 - 192 187 /* 193 188 * Convert from in-memory to filesystem representation. 194 189 */ ··· 194 199 { 195 200 struct bkey_i_xattr *xattr; 196 201 bch_acl_header *acl_header; 197 - const struct posix_acl_entry *acl_e; 202 + const struct posix_acl_entry *acl_e, *pe; 198 203 void *outptr; 199 204 unsigned nr_short = 0, nr_long = 0, acl_len, u64s; 200 205 201 - acl_for_each_entry(acl, acl_e) { 206 + FOREACH_ACL_ENTRY(acl_e, acl, pe) { 202 207 switch (acl_e->e_tag) { 203 208 case ACL_USER: 204 209 case ACL_GROUP: ··· 236 241 237 242 outptr = (void *) acl_header + sizeof(*acl_header); 238 243 239 - acl_for_each_entry(acl, acl_e) { 244 + FOREACH_ACL_ENTRY(acl_e, acl, pe) { 240 245 bch_acl_entry *entry = outptr; 241 246 242 247 entry->e_tag = cpu_to_le16(acl_e->e_tag);
+290 -268
fs/bcachefs/alloc_background.c
··· 198 198 } 199 199 200 200 int bch2_alloc_v1_validate(struct bch_fs *c, struct bkey_s_c k, 201 - enum bch_validate_flags flags) 201 + struct bkey_validate_context from) 202 202 { 203 203 struct bkey_s_c_alloc a = bkey_s_c_to_alloc(k); 204 204 int ret = 0; ··· 213 213 } 214 214 215 215 int bch2_alloc_v2_validate(struct bch_fs *c, struct bkey_s_c k, 216 - enum bch_validate_flags flags) 216 + struct bkey_validate_context from) 217 217 { 218 218 struct bkey_alloc_unpacked u; 219 219 int ret = 0; ··· 226 226 } 227 227 228 228 int bch2_alloc_v3_validate(struct bch_fs *c, struct bkey_s_c k, 229 - enum bch_validate_flags flags) 229 + struct bkey_validate_context from) 230 230 { 231 231 struct bkey_alloc_unpacked u; 232 232 int ret = 0; ··· 239 239 } 240 240 241 241 int bch2_alloc_v4_validate(struct bch_fs *c, struct bkey_s_c k, 242 - enum bch_validate_flags flags) 242 + struct bkey_validate_context from) 243 243 { 244 244 struct bch_alloc_v4 a; 245 245 int ret = 0; ··· 322 322 void bch2_alloc_v4_swab(struct bkey_s k) 323 323 { 324 324 struct bch_alloc_v4 *a = bkey_s_to_alloc_v4(k).v; 325 - struct bch_backpointer *bp, *bps; 326 325 327 - a->journal_seq = swab64(a->journal_seq); 326 + a->journal_seq_nonempty = swab64(a->journal_seq_nonempty); 327 + a->journal_seq_empty = swab64(a->journal_seq_empty); 328 328 a->flags = swab32(a->flags); 329 329 a->dirty_sectors = swab32(a->dirty_sectors); 330 330 a->cached_sectors = swab32(a->cached_sectors); ··· 333 333 a->stripe = swab32(a->stripe); 334 334 a->nr_external_backpointers = swab32(a->nr_external_backpointers); 335 335 a->stripe_sectors = swab32(a->stripe_sectors); 336 - 337 - bps = alloc_v4_backpointers(a); 338 - for (bp = bps; bp < bps + BCH_ALLOC_V4_NR_BACKPOINTERS(a); bp++) { 339 - bp->bucket_offset = swab40(bp->bucket_offset); 340 - bp->bucket_len = swab32(bp->bucket_len); 341 - bch2_bpos_swab(&bp->pos); 342 - } 343 336 } 344 337 345 338 void bch2_alloc_to_text(struct printbuf *out, struct bch_fs *c, struct bkey_s_c k) ··· 347 354 prt_printf(out, "gen %u oldest_gen %u data_type ", a->gen, a->oldest_gen); 348 355 bch2_prt_data_type(out, a->data_type); 349 356 prt_newline(out); 350 - prt_printf(out, "journal_seq %llu\n", a->journal_seq); 351 - prt_printf(out, "need_discard %llu\n", BCH_ALLOC_V4_NEED_DISCARD(a)); 352 - prt_printf(out, "need_inc_gen %llu\n", BCH_ALLOC_V4_NEED_INC_GEN(a)); 353 - prt_printf(out, "dirty_sectors %u\n", a->dirty_sectors); 354 - prt_printf(out, "stripe_sectors %u\n", a->stripe_sectors); 355 - prt_printf(out, "cached_sectors %u\n", a->cached_sectors); 356 - prt_printf(out, "stripe %u\n", a->stripe); 357 - prt_printf(out, "stripe_redundancy %u\n", a->stripe_redundancy); 358 - prt_printf(out, "io_time[READ] %llu\n", a->io_time[READ]); 359 - prt_printf(out, "io_time[WRITE] %llu\n", a->io_time[WRITE]); 357 + prt_printf(out, "journal_seq_nonempty %llu\n", a->journal_seq_nonempty); 358 + prt_printf(out, "journal_seq_empty %llu\n", a->journal_seq_empty); 359 + prt_printf(out, "need_discard %llu\n", BCH_ALLOC_V4_NEED_DISCARD(a)); 360 + prt_printf(out, "need_inc_gen %llu\n", BCH_ALLOC_V4_NEED_INC_GEN(a)); 361 + prt_printf(out, "dirty_sectors %u\n", a->dirty_sectors); 362 + prt_printf(out, "stripe_sectors %u\n", a->stripe_sectors); 363 + prt_printf(out, "cached_sectors %u\n", a->cached_sectors); 364 + prt_printf(out, "stripe %u\n", a->stripe); 365 + prt_printf(out, "stripe_redundancy %u\n", a->stripe_redundancy); 366 + prt_printf(out, "io_time[READ] %llu\n", a->io_time[READ]); 367 + prt_printf(out, "io_time[WRITE] %llu\n", a->io_time[WRITE]); 360 368 361 369 if (ca) 362 370 prt_printf(out, "fragmentation %llu\n", alloc_lru_idx_fragmentation(*a, ca)); ··· 386 392 struct bkey_alloc_unpacked u = bch2_alloc_unpack(k); 387 393 388 394 *out = (struct bch_alloc_v4) { 389 - .journal_seq = u.journal_seq, 395 + .journal_seq_nonempty = u.journal_seq, 390 396 .flags = u.need_discard, 391 397 .gen = u.gen, 392 398 .oldest_gen = u.oldest_gen, ··· 511 517 } 512 518 513 519 int bch2_bucket_gens_validate(struct bch_fs *c, struct bkey_s_c k, 514 - enum bch_validate_flags flags) 520 + struct bkey_validate_context from) 515 521 { 516 522 int ret = 0; 517 523 ··· 658 664 659 665 /* Free space/discard btree: */ 660 666 667 + static int __need_discard_or_freespace_err(struct btree_trans *trans, 668 + struct bkey_s_c alloc_k, 669 + bool set, bool discard, bool repair) 670 + { 671 + struct bch_fs *c = trans->c; 672 + enum bch_fsck_flags flags = FSCK_CAN_IGNORE|(repair ? FSCK_CAN_FIX : 0); 673 + enum bch_sb_error_id err_id = discard 674 + ? BCH_FSCK_ERR_need_discard_key_wrong 675 + : BCH_FSCK_ERR_freespace_key_wrong; 676 + enum btree_id btree = discard ? BTREE_ID_need_discard : BTREE_ID_freespace; 677 + struct printbuf buf = PRINTBUF; 678 + 679 + bch2_bkey_val_to_text(&buf, c, alloc_k); 680 + 681 + int ret = __bch2_fsck_err(NULL, trans, flags, err_id, 682 + "bucket incorrectly %sset in %s btree\n" 683 + " %s", 684 + set ? "" : "un", 685 + bch2_btree_id_str(btree), 686 + buf.buf); 687 + if (ret == -BCH_ERR_fsck_ignore || 688 + ret == -BCH_ERR_fsck_errors_not_fixed) 689 + ret = 0; 690 + 691 + printbuf_exit(&buf); 692 + return ret; 693 + } 694 + 695 + #define need_discard_or_freespace_err(...) \ 696 + fsck_err_wrap(__need_discard_or_freespace_err(__VA_ARGS__)) 697 + 698 + #define need_discard_or_freespace_err_on(cond, ...) \ 699 + (unlikely(cond) ? need_discard_or_freespace_err(__VA_ARGS__) : false) 700 + 661 701 static int bch2_bucket_do_index(struct btree_trans *trans, 662 702 struct bch_dev *ca, 663 703 struct bkey_s_c alloc_k, 664 704 const struct bch_alloc_v4 *a, 665 705 bool set) 666 706 { 667 - struct bch_fs *c = trans->c; 668 - struct btree_iter iter; 669 - struct bkey_s_c old; 670 - struct bkey_i *k; 671 707 enum btree_id btree; 672 - enum bch_bkey_type old_type = !set ? KEY_TYPE_set : KEY_TYPE_deleted; 673 - enum bch_bkey_type new_type = set ? KEY_TYPE_set : KEY_TYPE_deleted; 674 - struct printbuf buf = PRINTBUF; 675 - int ret; 708 + struct bpos pos; 676 709 677 710 if (a->data_type != BCH_DATA_free && 678 711 a->data_type != BCH_DATA_need_discard) 679 712 return 0; 680 713 681 - k = bch2_trans_kmalloc_nomemzero(trans, sizeof(*k)); 682 - if (IS_ERR(k)) 683 - return PTR_ERR(k); 684 - 685 - bkey_init(&k->k); 686 - k->k.type = new_type; 687 - 688 714 switch (a->data_type) { 689 715 case BCH_DATA_free: 690 716 btree = BTREE_ID_freespace; 691 - k->k.p = alloc_freespace_pos(alloc_k.k->p, *a); 692 - bch2_key_resize(&k->k, 1); 717 + pos = alloc_freespace_pos(alloc_k.k->p, *a); 693 718 break; 694 719 case BCH_DATA_need_discard: 695 720 btree = BTREE_ID_need_discard; 696 - k->k.p = alloc_k.k->p; 721 + pos = alloc_k.k->p; 697 722 break; 698 723 default: 699 724 return 0; 700 725 } 701 726 702 - old = bch2_bkey_get_iter(trans, &iter, btree, 703 - bkey_start_pos(&k->k), 704 - BTREE_ITER_intent); 705 - ret = bkey_err(old); 727 + struct btree_iter iter; 728 + struct bkey_s_c old = bch2_bkey_get_iter(trans, &iter, btree, pos, BTREE_ITER_intent); 729 + int ret = bkey_err(old); 706 730 if (ret) 707 731 return ret; 708 732 709 - if (ca->mi.freespace_initialized && 710 - c->curr_recovery_pass > BCH_RECOVERY_PASS_check_alloc_info && 711 - bch2_trans_inconsistent_on(old.k->type != old_type, trans, 712 - "incorrect key when %s %s:%llu:%llu:0 (got %s should be %s)\n" 713 - " for %s", 714 - set ? "setting" : "clearing", 715 - bch2_btree_id_str(btree), 716 - iter.pos.inode, 717 - iter.pos.offset, 718 - bch2_bkey_types[old.k->type], 719 - bch2_bkey_types[old_type], 720 - (bch2_bkey_val_to_text(&buf, c, alloc_k), buf.buf))) { 721 - ret = -EIO; 722 - goto err; 723 - } 733 + need_discard_or_freespace_err_on(ca->mi.freespace_initialized && 734 + !old.k->type != set, 735 + trans, alloc_k, set, 736 + btree == BTREE_ID_need_discard, false); 724 737 725 - ret = bch2_trans_update(trans, &iter, k, 0); 726 - err: 738 + ret = bch2_btree_bit_mod_iter(trans, &iter, set); 739 + fsck_err: 727 740 bch2_trans_iter_exit(trans, &iter); 728 - printbuf_exit(&buf); 729 741 return ret; 730 742 } 731 743 ··· 858 858 if (flags & BTREE_TRIGGER_transactional) { 859 859 alloc_data_type_set(new_a, new_a->data_type); 860 860 861 - if (bch2_bucket_sectors_total(*new_a) > bch2_bucket_sectors_total(*old_a)) { 861 + int is_empty_delta = (int) data_type_is_empty(new_a->data_type) - 862 + (int) data_type_is_empty(old_a->data_type); 863 + 864 + if (is_empty_delta < 0) { 862 865 new_a->io_time[READ] = bch2_current_io_time(c, READ); 863 866 new_a->io_time[WRITE]= bch2_current_io_time(c, WRITE); 864 867 SET_BCH_ALLOC_V4_NEED_INC_GEN(new_a, true); ··· 931 928 } 932 929 933 930 if ((flags & BTREE_TRIGGER_atomic) && (flags & BTREE_TRIGGER_insert)) { 934 - u64 journal_seq = trans->journal_res.seq; 935 - u64 bucket_journal_seq = new_a->journal_seq; 931 + u64 transaction_seq = trans->journal_res.seq; 932 + BUG_ON(!transaction_seq); 936 933 937 - if ((flags & BTREE_TRIGGER_insert) && 938 - data_type_is_empty(old_a->data_type) != 939 - data_type_is_empty(new_a->data_type) && 940 - new.k->type == KEY_TYPE_alloc_v4) { 941 - struct bch_alloc_v4 *v = bkey_s_to_alloc_v4(new).v; 934 + if (log_fsck_err_on(transaction_seq && new_a->journal_seq_nonempty > transaction_seq, 935 + trans, alloc_key_journal_seq_in_future, 936 + "bucket journal seq in future (currently at %llu)\n%s", 937 + journal_cur_seq(&c->journal), 938 + (bch2_bkey_val_to_text(&buf, c, new.s_c), buf.buf))) 939 + new_a->journal_seq_nonempty = transaction_seq; 942 940 943 - /* 944 - * If the btree updates referring to a bucket weren't flushed 945 - * before the bucket became empty again, then the we don't have 946 - * to wait on a journal flush before we can reuse the bucket: 947 - */ 948 - v->journal_seq = bucket_journal_seq = 949 - data_type_is_empty(new_a->data_type) && 950 - (journal_seq == v->journal_seq || 951 - bch2_journal_noflush_seq(&c->journal, v->journal_seq)) 952 - ? 0 : journal_seq; 941 + int is_empty_delta = (int) data_type_is_empty(new_a->data_type) - 942 + (int) data_type_is_empty(old_a->data_type); 943 + 944 + /* 945 + * Record journal sequence number of empty -> nonempty transition: 946 + * Note that there may be multiple empty -> nonempty 947 + * transitions, data in a bucket may be overwritten while we're 948 + * still writing to it - so be careful to only record the first: 949 + * */ 950 + if (is_empty_delta < 0 && 951 + new_a->journal_seq_empty <= c->journal.flushed_seq_ondisk) { 952 + new_a->journal_seq_nonempty = transaction_seq; 953 + new_a->journal_seq_empty = 0; 953 954 } 954 955 955 - if (!data_type_is_empty(old_a->data_type) && 956 - data_type_is_empty(new_a->data_type) && 957 - bucket_journal_seq) { 958 - ret = bch2_set_bucket_needs_journal_commit(&c->buckets_waiting_for_journal, 959 - c->journal.flushed_seq_ondisk, 960 - new.k->p.inode, new.k->p.offset, 961 - bucket_journal_seq); 962 - if (bch2_fs_fatal_err_on(ret, c, 963 - "setting bucket_needs_journal_commit: %s", bch2_err_str(ret))) 964 - goto err; 956 + /* 957 + * Bucket becomes empty: mark it as waiting for a journal flush, 958 + * unless updates since empty -> nonempty transition were never 959 + * flushed - we may need to ask the journal not to flush 960 + * intermediate sequence numbers: 961 + */ 962 + if (is_empty_delta > 0) { 963 + if (new_a->journal_seq_nonempty == transaction_seq || 964 + bch2_journal_noflush_seq(&c->journal, 965 + new_a->journal_seq_nonempty, 966 + transaction_seq)) { 967 + new_a->journal_seq_nonempty = new_a->journal_seq_empty = 0; 968 + } else { 969 + new_a->journal_seq_empty = transaction_seq; 970 + 971 + ret = bch2_set_bucket_needs_journal_commit(&c->buckets_waiting_for_journal, 972 + c->journal.flushed_seq_ondisk, 973 + new.k->p.inode, new.k->p.offset, 974 + transaction_seq); 975 + if (bch2_fs_fatal_err_on(ret, c, 976 + "setting bucket_needs_journal_commit: %s", 977 + bch2_err_str(ret))) 978 + goto err; 979 + } 965 980 } 966 981 967 982 if (new_a->gen != old_a->gen) { ··· 995 974 996 975 #define eval_state(_a, expr) ({ const struct bch_alloc_v4 *a = _a; expr; }) 997 976 #define statechange(expr) !eval_state(old_a, expr) && eval_state(new_a, expr) 998 - #define bucket_flushed(a) (!a->journal_seq || a->journal_seq <= c->journal.flushed_seq_ondisk) 977 + #define bucket_flushed(a) (a->journal_seq_empty <= c->journal.flushed_seq_ondisk) 999 978 1000 979 if (statechange(a->data_type == BCH_DATA_free) && 1001 980 bucket_flushed(new_a)) ··· 1027 1006 rcu_read_unlock(); 1028 1007 } 1029 1008 err: 1009 + fsck_err: 1030 1010 printbuf_exit(&buf); 1031 1011 bch2_dev_put(ca); 1032 1012 return ret; ··· 1067 1045 * btree node min/max is a closed interval, upto takes a half 1068 1046 * open interval: 1069 1047 */ 1070 - k = bch2_btree_iter_peek_upto(&iter2, end); 1048 + k = bch2_btree_iter_peek_max(&iter2, end); 1071 1049 next = iter2.pos; 1072 1050 bch2_trans_iter_exit(iter->trans, &iter2); 1073 1051 ··· 1151 1129 struct bch_fs *c = trans->c; 1152 1130 struct bch_alloc_v4 a_convert; 1153 1131 const struct bch_alloc_v4 *a; 1154 - unsigned discard_key_type, freespace_key_type; 1155 1132 unsigned gens_offset; 1156 1133 struct bkey_s_c k; 1157 1134 struct printbuf buf = PRINTBUF; ··· 1170 1149 1171 1150 a = bch2_alloc_to_v4(alloc_k, &a_convert); 1172 1151 1173 - discard_key_type = a->data_type == BCH_DATA_need_discard ? KEY_TYPE_set : 0; 1174 1152 bch2_btree_iter_set_pos(discard_iter, alloc_k.k->p); 1175 1153 k = bch2_btree_iter_peek_slot(discard_iter); 1176 1154 ret = bkey_err(k); 1177 1155 if (ret) 1178 1156 goto err; 1179 1157 1180 - if (fsck_err_on(k.k->type != discard_key_type, 1181 - trans, need_discard_key_wrong, 1182 - "incorrect key in need_discard btree (got %s should be %s)\n" 1183 - " %s", 1184 - bch2_bkey_types[k.k->type], 1185 - bch2_bkey_types[discard_key_type], 1186 - (bch2_bkey_val_to_text(&buf, c, alloc_k), buf.buf))) { 1187 - struct bkey_i *update = 1188 - bch2_trans_kmalloc(trans, sizeof(*update)); 1189 - 1190 - ret = PTR_ERR_OR_ZERO(update); 1191 - if (ret) 1192 - goto err; 1193 - 1194 - bkey_init(&update->k); 1195 - update->k.type = discard_key_type; 1196 - update->k.p = discard_iter->pos; 1197 - 1198 - ret = bch2_trans_update(trans, discard_iter, update, 0); 1158 + bool is_discarded = a->data_type == BCH_DATA_need_discard; 1159 + if (need_discard_or_freespace_err_on(!!k.k->type != is_discarded, 1160 + trans, alloc_k, !is_discarded, true, true)) { 1161 + ret = bch2_btree_bit_mod_iter(trans, discard_iter, is_discarded); 1199 1162 if (ret) 1200 1163 goto err; 1201 1164 } 1202 1165 1203 - freespace_key_type = a->data_type == BCH_DATA_free ? KEY_TYPE_set : 0; 1204 1166 bch2_btree_iter_set_pos(freespace_iter, alloc_freespace_pos(alloc_k.k->p, *a)); 1205 1167 k = bch2_btree_iter_peek_slot(freespace_iter); 1206 1168 ret = bkey_err(k); 1207 1169 if (ret) 1208 1170 goto err; 1209 1171 1210 - if (fsck_err_on(k.k->type != freespace_key_type, 1211 - trans, freespace_key_wrong, 1212 - "incorrect key in freespace btree (got %s should be %s)\n" 1213 - " %s", 1214 - bch2_bkey_types[k.k->type], 1215 - bch2_bkey_types[freespace_key_type], 1216 - (printbuf_reset(&buf), 1217 - bch2_bkey_val_to_text(&buf, c, alloc_k), buf.buf))) { 1218 - struct bkey_i *update = 1219 - bch2_trans_kmalloc(trans, sizeof(*update)); 1220 - 1221 - ret = PTR_ERR_OR_ZERO(update); 1222 - if (ret) 1223 - goto err; 1224 - 1225 - bkey_init(&update->k); 1226 - update->k.type = freespace_key_type; 1227 - update->k.p = freespace_iter->pos; 1228 - bch2_key_resize(&update->k, 1); 1229 - 1230 - ret = bch2_trans_update(trans, freespace_iter, update, 0); 1172 + bool is_free = a->data_type == BCH_DATA_free; 1173 + if (need_discard_or_freespace_err_on(!!k.k->type != is_free, 1174 + trans, alloc_k, !is_free, false, true)) { 1175 + ret = bch2_btree_bit_mod_iter(trans, freespace_iter, is_free); 1231 1176 if (ret) 1232 1177 goto err; 1233 1178 } ··· 1355 1368 return ret; 1356 1369 } 1357 1370 1358 - static noinline_for_stack int bch2_check_discard_freespace_key(struct btree_trans *trans, 1359 - struct btree_iter *iter) 1371 + struct check_discard_freespace_key_async { 1372 + struct work_struct work; 1373 + struct bch_fs *c; 1374 + struct bbpos pos; 1375 + }; 1376 + 1377 + static int bch2_recheck_discard_freespace_key(struct btree_trans *trans, struct bbpos pos) 1378 + { 1379 + struct btree_iter iter; 1380 + struct bkey_s_c k = bch2_bkey_get_iter(trans, &iter, pos.btree, pos.pos, 0); 1381 + int ret = bkey_err(k); 1382 + if (ret) 1383 + return ret; 1384 + 1385 + u8 gen; 1386 + ret = k.k->type != KEY_TYPE_set 1387 + ? bch2_check_discard_freespace_key(trans, &iter, &gen, false) 1388 + : 0; 1389 + bch2_trans_iter_exit(trans, &iter); 1390 + return ret; 1391 + } 1392 + 1393 + static void check_discard_freespace_key_work(struct work_struct *work) 1394 + { 1395 + struct check_discard_freespace_key_async *w = 1396 + container_of(work, struct check_discard_freespace_key_async, work); 1397 + 1398 + bch2_trans_do(w->c, bch2_recheck_discard_freespace_key(trans, w->pos)); 1399 + bch2_write_ref_put(w->c, BCH_WRITE_REF_check_discard_freespace_key); 1400 + kfree(w); 1401 + } 1402 + 1403 + int bch2_check_discard_freespace_key(struct btree_trans *trans, struct btree_iter *iter, u8 *gen, 1404 + bool async_repair) 1360 1405 { 1361 1406 struct bch_fs *c = trans->c; 1362 - struct btree_iter alloc_iter; 1363 - struct bkey_s_c alloc_k; 1364 - struct bch_alloc_v4 a_convert; 1365 - const struct bch_alloc_v4 *a; 1366 - u64 genbits; 1367 - struct bpos pos; 1368 1407 enum bch_data_type state = iter->btree_id == BTREE_ID_need_discard 1369 1408 ? BCH_DATA_need_discard 1370 1409 : BCH_DATA_free; 1371 1410 struct printbuf buf = PRINTBUF; 1372 - int ret; 1373 1411 1374 - pos = iter->pos; 1375 - pos.offset &= ~(~0ULL << 56); 1376 - genbits = iter->pos.offset & (~0ULL << 56); 1412 + struct bpos bucket = iter->pos; 1413 + bucket.offset &= ~(~0ULL << 56); 1414 + u64 genbits = iter->pos.offset & (~0ULL << 56); 1377 1415 1378 - alloc_k = bch2_bkey_get_iter(trans, &alloc_iter, BTREE_ID_alloc, pos, 0); 1379 - ret = bkey_err(alloc_k); 1416 + struct btree_iter alloc_iter; 1417 + struct bkey_s_c alloc_k = bch2_bkey_get_iter(trans, &alloc_iter, 1418 + BTREE_ID_alloc, bucket, 1419 + async_repair ? BTREE_ITER_cached : 0); 1420 + int ret = bkey_err(alloc_k); 1380 1421 if (ret) 1381 1422 return ret; 1382 1423 1383 - if (fsck_err_on(!bch2_dev_bucket_exists(c, pos), 1384 - trans, need_discard_freespace_key_to_invalid_dev_bucket, 1385 - "entry in %s btree for nonexistant dev:bucket %llu:%llu", 1386 - bch2_btree_id_str(iter->btree_id), pos.inode, pos.offset)) 1387 - goto delete; 1424 + if (!bch2_dev_bucket_exists(c, bucket)) { 1425 + if (fsck_err(trans, need_discard_freespace_key_to_invalid_dev_bucket, 1426 + "entry in %s btree for nonexistant dev:bucket %llu:%llu", 1427 + bch2_btree_id_str(iter->btree_id), bucket.inode, bucket.offset)) 1428 + goto delete; 1429 + ret = 1; 1430 + goto out; 1431 + } 1388 1432 1389 - a = bch2_alloc_to_v4(alloc_k, &a_convert); 1433 + struct bch_alloc_v4 a_convert; 1434 + const struct bch_alloc_v4 *a = bch2_alloc_to_v4(alloc_k, &a_convert); 1390 1435 1391 - if (fsck_err_on(a->data_type != state || 1392 - (state == BCH_DATA_free && 1393 - genbits != alloc_freespace_genbits(*a)), 1394 - trans, need_discard_freespace_key_bad, 1395 - "%s\n incorrectly set at %s:%llu:%llu:0 (free %u, genbits %llu should be %llu)", 1396 - (bch2_bkey_val_to_text(&buf, c, alloc_k), buf.buf), 1397 - bch2_btree_id_str(iter->btree_id), 1398 - iter->pos.inode, 1399 - iter->pos.offset, 1400 - a->data_type == state, 1401 - genbits >> 56, alloc_freespace_genbits(*a) >> 56)) 1402 - goto delete; 1436 + if (a->data_type != state || 1437 + (state == BCH_DATA_free && 1438 + genbits != alloc_freespace_genbits(*a))) { 1439 + if (fsck_err(trans, need_discard_freespace_key_bad, 1440 + "%s\n incorrectly set at %s:%llu:%llu:0 (free %u, genbits %llu should be %llu)", 1441 + (bch2_bkey_val_to_text(&buf, c, alloc_k), buf.buf), 1442 + bch2_btree_id_str(iter->btree_id), 1443 + iter->pos.inode, 1444 + iter->pos.offset, 1445 + a->data_type == state, 1446 + genbits >> 56, alloc_freespace_genbits(*a) >> 56)) 1447 + goto delete; 1448 + ret = 1; 1449 + goto out; 1450 + } 1451 + 1452 + *gen = a->gen; 1403 1453 out: 1404 1454 fsck_err: 1405 1455 bch2_set_btree_iter_dontneed(&alloc_iter); ··· 1444 1420 printbuf_exit(&buf); 1445 1421 return ret; 1446 1422 delete: 1447 - ret = bch2_btree_delete_extent_at(trans, iter, 1448 - iter->btree_id == BTREE_ID_freespace ? 1 : 0, 0) ?: 1449 - bch2_trans_commit(trans, NULL, NULL, 1450 - BCH_TRANS_COMMIT_no_enospc); 1451 - goto out; 1423 + if (!async_repair) { 1424 + ret = bch2_btree_bit_mod_iter(trans, iter, false) ?: 1425 + bch2_trans_commit(trans, NULL, NULL, 1426 + BCH_TRANS_COMMIT_no_enospc) ?: 1427 + -BCH_ERR_transaction_restart_commit; 1428 + goto out; 1429 + } else { 1430 + /* 1431 + * We can't repair here when called from the allocator path: the 1432 + * commit will recurse back into the allocator 1433 + */ 1434 + struct check_discard_freespace_key_async *w = 1435 + kzalloc(sizeof(*w), GFP_KERNEL); 1436 + if (!w) 1437 + goto out; 1438 + 1439 + if (!bch2_write_ref_tryget(c, BCH_WRITE_REF_check_discard_freespace_key)) { 1440 + kfree(w); 1441 + goto out; 1442 + } 1443 + 1444 + INIT_WORK(&w->work, check_discard_freespace_key_work); 1445 + w->c = c; 1446 + w->pos = BBPOS(iter->btree_id, iter->pos); 1447 + queue_work(c->write_ref_wq, &w->work); 1448 + goto out; 1449 + } 1450 + } 1451 + 1452 + static int bch2_check_discard_freespace_key_fsck(struct btree_trans *trans, struct btree_iter *iter) 1453 + { 1454 + u8 gen; 1455 + int ret = bch2_check_discard_freespace_key(trans, iter, &gen, false); 1456 + return ret < 0 ? ret : 0; 1452 1457 } 1453 1458 1454 1459 /* ··· 1634 1581 ret = for_each_btree_key(trans, iter, 1635 1582 BTREE_ID_need_discard, POS_MIN, 1636 1583 BTREE_ITER_prefetch, k, 1637 - bch2_check_discard_freespace_key(trans, &iter)); 1584 + bch2_check_discard_freespace_key_fsck(trans, &iter)); 1638 1585 if (ret) 1639 1586 goto err; 1640 1587 ··· 1647 1594 break; 1648 1595 1649 1596 ret = bkey_err(k) ?: 1650 - bch2_check_discard_freespace_key(trans, &iter); 1597 + bch2_check_discard_freespace_key_fsck(trans, &iter); 1651 1598 if (bch2_err_matches(ret, BCH_ERR_transaction_restart)) { 1652 1599 ret = 0; 1653 1600 continue; ··· 1810 1757 struct bch_dev *ca, 1811 1758 struct btree_iter *need_discard_iter, 1812 1759 struct bpos *discard_pos_done, 1813 - struct discard_buckets_state *s) 1760 + struct discard_buckets_state *s, 1761 + bool fastpath) 1814 1762 { 1815 1763 struct bch_fs *c = trans->c; 1816 1764 struct bpos pos = need_discard_iter->pos; ··· 1847 1793 if (ret) 1848 1794 goto out; 1849 1795 1850 - if (bch2_bucket_sectors_total(a->v)) { 1851 - if (bch2_trans_inconsistent_on(c->curr_recovery_pass > BCH_RECOVERY_PASS_check_alloc_info, 1852 - trans, "attempting to discard bucket with dirty data\n%s", 1853 - (bch2_bkey_val_to_text(&buf, c, k), buf.buf))) 1854 - ret = -EIO; 1855 - goto out; 1856 - } 1857 - 1858 1796 if (a->v.data_type != BCH_DATA_need_discard) { 1859 - if (data_type_is_empty(a->v.data_type) && 1860 - BCH_ALLOC_V4_NEED_INC_GEN(&a->v)) { 1861 - a->v.gen++; 1862 - SET_BCH_ALLOC_V4_NEED_INC_GEN(&a->v, false); 1863 - goto write; 1797 + if (need_discard_or_freespace_err(trans, k, true, true, true)) { 1798 + ret = bch2_btree_bit_mod_iter(trans, need_discard_iter, false); 1799 + if (ret) 1800 + goto out; 1801 + goto commit; 1864 1802 } 1865 1803 1866 - if (bch2_trans_inconsistent_on(c->curr_recovery_pass > BCH_RECOVERY_PASS_check_alloc_info, 1867 - trans, "bucket incorrectly set in need_discard btree\n" 1868 - "%s", 1869 - (bch2_bkey_val_to_text(&buf, c, k), buf.buf))) 1870 - ret = -EIO; 1871 1804 goto out; 1872 1805 } 1873 1806 1874 - if (a->v.journal_seq > c->journal.flushed_seq_ondisk) { 1875 - if (bch2_trans_inconsistent_on(c->curr_recovery_pass > BCH_RECOVERY_PASS_check_alloc_info, 1876 - trans, "clearing need_discard but journal_seq %llu > flushed_seq %llu\n%s", 1877 - a->v.journal_seq, 1878 - c->journal.flushed_seq_ondisk, 1879 - (bch2_bkey_val_to_text(&buf, c, k), buf.buf))) 1880 - ret = -EIO; 1881 - goto out; 1807 + if (!fastpath) { 1808 + if (discard_in_flight_add(ca, iter.pos.offset, true)) 1809 + goto out; 1810 + 1811 + discard_locked = true; 1882 1812 } 1883 - 1884 - if (discard_in_flight_add(ca, iter.pos.offset, true)) 1885 - goto out; 1886 - 1887 - discard_locked = true; 1888 1813 1889 1814 if (!bkey_eq(*discard_pos_done, iter.pos) && 1890 1815 ca->mi.discard && !c->opts.nochanges) { ··· 1877 1844 ca->mi.bucket_size, 1878 1845 GFP_KERNEL); 1879 1846 *discard_pos_done = iter.pos; 1847 + s->discarded++; 1880 1848 1881 1849 ret = bch2_trans_relock_notrace(trans); 1882 1850 if (ret) ··· 1885 1851 } 1886 1852 1887 1853 SET_BCH_ALLOC_V4_NEED_DISCARD(&a->v, false); 1888 - write: 1889 1854 alloc_data_type_set(&a->v, a->v.data_type); 1890 1855 1891 - ret = bch2_trans_update(trans, &iter, &a->k_i, 0) ?: 1892 - bch2_trans_commit(trans, NULL, NULL, 1893 - BCH_WATERMARK_btree| 1894 - BCH_TRANS_COMMIT_no_enospc); 1856 + ret = bch2_trans_update(trans, &iter, &a->k_i, 0); 1857 + if (ret) 1858 + goto out; 1859 + commit: 1860 + ret = bch2_trans_commit(trans, NULL, NULL, 1861 + BCH_WATERMARK_btree| 1862 + BCH_TRANS_COMMIT_no_enospc); 1895 1863 if (ret) 1896 1864 goto out; 1897 1865 1898 1866 count_event(c, bucket_discard); 1899 - s->discarded++; 1900 1867 out: 1868 + fsck_err: 1901 1869 if (discard_locked) 1902 1870 discard_in_flight_remove(ca, iter.pos.offset); 1903 - s->seen++; 1871 + if (!ret) 1872 + s->seen++; 1904 1873 bch2_trans_iter_exit(trans, &iter); 1905 1874 printbuf_exit(&buf); 1906 1875 return ret; ··· 1923 1886 * successful commit: 1924 1887 */ 1925 1888 ret = bch2_trans_run(c, 1926 - for_each_btree_key_upto(trans, iter, 1889 + for_each_btree_key_max(trans, iter, 1927 1890 BTREE_ID_need_discard, 1928 1891 POS(ca->dev_idx, 0), 1929 1892 POS(ca->dev_idx, U64_MAX), 0, k, 1930 - bch2_discard_one_bucket(trans, ca, &iter, &discard_pos_done, &s))); 1893 + bch2_discard_one_bucket(trans, ca, &iter, &discard_pos_done, &s, false))); 1931 1894 1932 1895 trace_discard_buckets(c, s.seen, s.open, s.need_journal_commit, s.discarded, 1933 1896 bch2_err_str(ret)); ··· 1960 1923 bch2_dev_do_discards(ca); 1961 1924 } 1962 1925 1963 - static int bch2_clear_bucket_needs_discard(struct btree_trans *trans, struct bpos bucket) 1926 + static int bch2_do_discards_fast_one(struct btree_trans *trans, 1927 + struct bch_dev *ca, 1928 + u64 bucket, 1929 + struct bpos *discard_pos_done, 1930 + struct discard_buckets_state *s) 1964 1931 { 1965 - struct btree_iter iter; 1966 - bch2_trans_iter_init(trans, &iter, BTREE_ID_alloc, bucket, BTREE_ITER_intent); 1967 - struct bkey_s_c k = bch2_btree_iter_peek_slot(&iter); 1968 - int ret = bkey_err(k); 1932 + struct btree_iter need_discard_iter; 1933 + struct bkey_s_c discard_k = bch2_bkey_get_iter(trans, &need_discard_iter, 1934 + BTREE_ID_need_discard, POS(ca->dev_idx, bucket), 0); 1935 + int ret = bkey_err(discard_k); 1969 1936 if (ret) 1970 - goto err; 1937 + return ret; 1971 1938 1972 - struct bkey_i_alloc_v4 *a = bch2_alloc_to_v4_mut(trans, k); 1973 - ret = PTR_ERR_OR_ZERO(a); 1974 - if (ret) 1975 - goto err; 1939 + if (log_fsck_err_on(discard_k.k->type != KEY_TYPE_set, 1940 + trans, discarding_bucket_not_in_need_discard_btree, 1941 + "attempting to discard bucket %u:%llu not in need_discard btree", 1942 + ca->dev_idx, bucket)) 1943 + goto out; 1976 1944 1977 - BUG_ON(a->v.dirty_sectors); 1978 - SET_BCH_ALLOC_V4_NEED_DISCARD(&a->v, false); 1979 - alloc_data_type_set(&a->v, a->v.data_type); 1980 - 1981 - ret = bch2_trans_update(trans, &iter, &a->k_i, 0); 1982 - err: 1983 - bch2_trans_iter_exit(trans, &iter); 1945 + ret = bch2_discard_one_bucket(trans, ca, &need_discard_iter, discard_pos_done, s, true); 1946 + out: 1947 + fsck_err: 1948 + bch2_trans_iter_exit(trans, &need_discard_iter); 1984 1949 return ret; 1985 1950 } 1986 1951 ··· 1990 1951 { 1991 1952 struct bch_dev *ca = container_of(work, struct bch_dev, discard_fast_work); 1992 1953 struct bch_fs *c = ca->fs; 1954 + struct discard_buckets_state s = {}; 1955 + struct bpos discard_pos_done = POS_MAX; 1956 + struct btree_trans *trans = bch2_trans_get(c); 1957 + int ret = 0; 1993 1958 1994 1959 while (1) { 1995 1960 bool got_bucket = false; ··· 2014 1971 if (!got_bucket) 2015 1972 break; 2016 1973 2017 - if (ca->mi.discard && !c->opts.nochanges) 2018 - blkdev_issue_discard(ca->disk_sb.bdev, 2019 - bucket_to_sector(ca, bucket), 2020 - ca->mi.bucket_size, 2021 - GFP_KERNEL); 2022 - 2023 - int ret = bch2_trans_commit_do(c, NULL, NULL, 2024 - BCH_WATERMARK_btree| 2025 - BCH_TRANS_COMMIT_no_enospc, 2026 - bch2_clear_bucket_needs_discard(trans, POS(ca->dev_idx, bucket))); 1974 + ret = lockrestart_do(trans, 1975 + bch2_do_discards_fast_one(trans, ca, bucket, &discard_pos_done, &s)); 2027 1976 bch_err_fn(c, ret); 2028 1977 2029 1978 discard_in_flight_remove(ca, bucket); ··· 2024 1989 break; 2025 1990 } 2026 1991 1992 + trace_discard_buckets(c, s.seen, s.open, s.need_journal_commit, s.discarded, bch2_err_str(ret)); 1993 + 1994 + bch2_trans_put(trans); 2027 1995 percpu_ref_put(&ca->io_ref); 2028 1996 bch2_write_ref_put(c, BCH_WRITE_REF_discard_fast); 2029 1997 } ··· 2068 2030 return 1; 2069 2031 2070 2032 if (!bch2_dev_bucket_exists(c, bucket)) { 2071 - prt_str(&buf, "lru entry points to invalid bucket"); 2072 - goto err; 2033 + if (fsck_err(trans, lru_entry_to_invalid_bucket, 2034 + "lru key points to nonexistent device:bucket %llu:%llu", 2035 + bucket.inode, bucket.offset)) 2036 + return bch2_btree_bit_mod_buffered(trans, BTREE_ID_lru, lru_iter->pos, false); 2037 + goto out; 2073 2038 } 2074 2039 2075 2040 if (bch2_bucket_is_open_safe(c, bucket.inode, bucket.offset)) ··· 2113 2072 trace_and_count(c, bucket_invalidate, c, bucket.inode, bucket.offset, cached_sectors); 2114 2073 --*nr_to_invalidate; 2115 2074 out: 2075 + fsck_err: 2116 2076 printbuf_exit(&buf); 2117 2077 return ret; 2118 - err: 2119 - prt_str(&buf, "\n lru key: "); 2120 - bch2_bkey_val_to_text(&buf, c, lru_k); 2121 - 2122 - prt_str(&buf, "\n lru entry: "); 2123 - bch2_lru_pos_to_text(&buf, lru_iter->pos); 2124 - 2125 - prt_str(&buf, "\n alloc key: "); 2126 - if (!a) 2127 - bch2_bpos_to_text(&buf, bucket); 2128 - else 2129 - bch2_bkey_val_to_text(&buf, c, bkey_i_to_s_c(&a->k_i)); 2130 - 2131 - bch_err(c, "%s", buf.buf); 2132 - if (c->curr_recovery_pass > BCH_RECOVERY_PASS_check_lrus) { 2133 - bch2_inconsistent_error(c); 2134 - ret = -EINVAL; 2135 - } 2136 - 2137 - goto out; 2138 2078 } 2139 2079 2140 2080 static struct bkey_s_c next_lru_key(struct btree_trans *trans, struct btree_iter *iter, ··· 2123 2101 { 2124 2102 struct bkey_s_c k; 2125 2103 again: 2126 - k = bch2_btree_iter_peek_upto(iter, lru_pos(ca->dev_idx, U64_MAX, LRU_TIME_MAX)); 2104 + k = bch2_btree_iter_peek_max(iter, lru_pos(ca->dev_idx, U64_MAX, LRU_TIME_MAX)); 2127 2105 if (!k.k && !*wrapped) { 2128 2106 bch2_btree_iter_set_pos(iter, lru_pos(ca->dev_idx, 0, 0)); 2129 2107 *wrapped = true;
+11 -7
fs/bcachefs/alloc_background.h
··· 8 8 #include "debug.h" 9 9 #include "super.h" 10 10 11 - enum bch_validate_flags; 12 - 13 11 /* How out of date a pointer gen is allowed to be: */ 14 12 #define BUCKET_GC_GEN_MAX 96U 15 13 ··· 243 245 244 246 int bch2_bucket_io_time_reset(struct btree_trans *, unsigned, size_t, int); 245 247 246 - int bch2_alloc_v1_validate(struct bch_fs *, struct bkey_s_c, enum bch_validate_flags); 247 - int bch2_alloc_v2_validate(struct bch_fs *, struct bkey_s_c, enum bch_validate_flags); 248 - int bch2_alloc_v3_validate(struct bch_fs *, struct bkey_s_c, enum bch_validate_flags); 249 - int bch2_alloc_v4_validate(struct bch_fs *, struct bkey_s_c, enum bch_validate_flags); 248 + int bch2_alloc_v1_validate(struct bch_fs *, struct bkey_s_c, 249 + struct bkey_validate_context); 250 + int bch2_alloc_v2_validate(struct bch_fs *, struct bkey_s_c, 251 + struct bkey_validate_context); 252 + int bch2_alloc_v3_validate(struct bch_fs *, struct bkey_s_c, 253 + struct bkey_validate_context); 254 + int bch2_alloc_v4_validate(struct bch_fs *, struct bkey_s_c, 255 + struct bkey_validate_context); 250 256 void bch2_alloc_v4_swab(struct bkey_s); 251 257 void bch2_alloc_to_text(struct printbuf *, struct bch_fs *, struct bkey_s_c); 252 258 ··· 284 282 }) 285 283 286 284 int bch2_bucket_gens_validate(struct bch_fs *, struct bkey_s_c, 287 - enum bch_validate_flags); 285 + struct bkey_validate_context); 288 286 void bch2_bucket_gens_to_text(struct printbuf *, struct bch_fs *, struct bkey_s_c); 289 287 290 288 #define bch2_bkey_ops_bucket_gens ((struct bkey_ops) { \ ··· 309 307 int bch2_trigger_alloc(struct btree_trans *, enum btree_id, unsigned, 310 308 struct bkey_s_c, struct bkey_s, 311 309 enum btree_iter_update_trigger_flags); 310 + 311 + int bch2_check_discard_freespace_key(struct btree_trans *, struct btree_iter *, u8 *, bool); 312 312 int bch2_check_alloc_info(struct bch_fs *); 313 313 int bch2_check_alloc_to_lru_refs(struct bch_fs *); 314 314 void bch2_dev_do_discards(struct bch_dev *);
+2 -2
fs/bcachefs/alloc_background_format.h
··· 58 58 59 59 struct bch_alloc_v4 { 60 60 struct bch_val v; 61 - __u64 journal_seq; 61 + __u64 journal_seq_nonempty; 62 62 __u32 flags; 63 63 __u8 gen; 64 64 __u8 oldest_gen; ··· 70 70 __u32 stripe; 71 71 __u32 nr_external_backpointers; 72 72 /* end of fields in original version of alloc_v4 */ 73 - __u64 _fragmentation_lru; /* obsolete */ 73 + __u64 journal_seq_empty; 74 74 __u32 stripe_sectors; 75 75 __u32 pad; 76 76 } __packed __aligned(8);
+108 -196
fs/bcachefs/alloc_foreground.c
··· 107 107 return; 108 108 } 109 109 110 - percpu_down_read(&c->mark_lock); 111 110 spin_lock(&ob->lock); 112 - 113 111 ob->valid = false; 114 112 ob->data_type = 0; 115 - 116 113 spin_unlock(&ob->lock); 117 - percpu_up_read(&c->mark_lock); 118 114 119 115 spin_lock(&c->freelist_lock); 120 116 bch2_open_bucket_hash_remove(c, ob); ··· 152 156 return ob; 153 157 } 154 158 159 + static inline bool is_superblock_bucket(struct bch_fs *c, struct bch_dev *ca, u64 b) 160 + { 161 + if (c->curr_recovery_pass > BCH_RECOVERY_PASS_trans_mark_dev_sbs) 162 + return false; 163 + 164 + return bch2_is_superblock_bucket(ca, b); 165 + } 166 + 155 167 static void open_bucket_free_unused(struct bch_fs *c, struct open_bucket *ob) 156 168 { 157 169 BUG_ON(c->open_buckets_partial_nr >= ··· 179 175 closure_wake_up(&c->freelist_wait); 180 176 } 181 177 182 - /* _only_ for allocating the journal on a new device: */ 183 - long bch2_bucket_alloc_new_fs(struct bch_dev *ca) 184 - { 185 - while (ca->new_fs_bucket_idx < ca->mi.nbuckets) { 186 - u64 b = ca->new_fs_bucket_idx++; 187 - 188 - if (!is_superblock_bucket(ca, b) && 189 - (!ca->buckets_nouse || !test_bit(b, ca->buckets_nouse))) 190 - return b; 191 - } 192 - 193 - return -1; 194 - } 195 - 196 178 static inline unsigned open_buckets_reserved(enum bch_watermark watermark) 197 179 { 198 180 switch (watermark) { ··· 196 206 } 197 207 } 198 208 199 - static struct open_bucket *__try_alloc_bucket(struct bch_fs *c, struct bch_dev *ca, 200 - u64 bucket, 201 - enum bch_watermark watermark, 202 - const struct bch_alloc_v4 *a, 203 - struct bucket_alloc_state *s, 204 - struct closure *cl) 209 + static inline bool may_alloc_bucket(struct bch_fs *c, 210 + struct bpos bucket, 211 + struct bucket_alloc_state *s) 205 212 { 206 - struct open_bucket *ob; 207 - 208 - if (unlikely(ca->buckets_nouse && test_bit(bucket, ca->buckets_nouse))) { 209 - s->skipped_nouse++; 210 - return NULL; 211 - } 212 - 213 - if (bch2_bucket_is_open(c, ca->dev_idx, bucket)) { 213 + if (bch2_bucket_is_open(c, bucket.inode, bucket.offset)) { 214 214 s->skipped_open++; 215 - return NULL; 215 + return false; 216 216 } 217 217 218 218 if (bch2_bucket_needs_journal_commit(&c->buckets_waiting_for_journal, 219 - c->journal.flushed_seq_ondisk, ca->dev_idx, bucket)) { 219 + c->journal.flushed_seq_ondisk, bucket.inode, bucket.offset)) { 220 220 s->skipped_need_journal_commit++; 221 - return NULL; 221 + return false; 222 222 } 223 223 224 - if (bch2_bucket_nocow_is_locked(&c->nocow_locks, POS(ca->dev_idx, bucket))) { 224 + if (bch2_bucket_nocow_is_locked(&c->nocow_locks, bucket)) { 225 225 s->skipped_nocow++; 226 + return false; 227 + } 228 + 229 + return true; 230 + } 231 + 232 + static struct open_bucket *__try_alloc_bucket(struct bch_fs *c, struct bch_dev *ca, 233 + u64 bucket, u8 gen, 234 + enum bch_watermark watermark, 235 + struct bucket_alloc_state *s, 236 + struct closure *cl) 237 + { 238 + if (unlikely(is_superblock_bucket(c, ca, bucket))) 239 + return NULL; 240 + 241 + if (unlikely(ca->buckets_nouse && test_bit(bucket, ca->buckets_nouse))) { 242 + s->skipped_nouse++; 226 243 return NULL; 227 244 } 228 245 ··· 251 254 return NULL; 252 255 } 253 256 254 - ob = bch2_open_bucket_alloc(c); 257 + struct open_bucket *ob = bch2_open_bucket_alloc(c); 255 258 256 259 spin_lock(&ob->lock); 257 - 258 260 ob->valid = true; 259 261 ob->sectors_free = ca->mi.bucket_size; 260 262 ob->dev = ca->dev_idx; 261 - ob->gen = a->gen; 263 + ob->gen = gen; 262 264 ob->bucket = bucket; 263 265 spin_unlock(&ob->lock); 264 266 ··· 272 276 } 273 277 274 278 static struct open_bucket *try_alloc_bucket(struct btree_trans *trans, struct bch_dev *ca, 275 - enum bch_watermark watermark, u64 free_entry, 279 + enum bch_watermark watermark, 276 280 struct bucket_alloc_state *s, 277 - struct bkey_s_c freespace_k, 281 + struct btree_iter *freespace_iter, 278 282 struct closure *cl) 279 283 { 280 284 struct bch_fs *c = trans->c; 281 - struct btree_iter iter = { NULL }; 282 - struct bkey_s_c k; 283 - struct open_bucket *ob; 284 - struct bch_alloc_v4 a_convert; 285 - const struct bch_alloc_v4 *a; 286 - u64 b = free_entry & ~(~0ULL << 56); 287 - unsigned genbits = free_entry >> 56; 288 - struct printbuf buf = PRINTBUF; 289 - int ret; 285 + u64 b = freespace_iter->pos.offset & ~(~0ULL << 56); 290 286 291 - if (b < ca->mi.first_bucket || b >= ca->mi.nbuckets) { 292 - prt_printf(&buf, "freespace btree has bucket outside allowed range %u-%llu\n" 293 - " freespace key ", 294 - ca->mi.first_bucket, ca->mi.nbuckets); 295 - bch2_bkey_val_to_text(&buf, c, freespace_k); 296 - bch2_trans_inconsistent(trans, "%s", buf.buf); 297 - ob = ERR_PTR(-EIO); 298 - goto err; 299 - } 287 + if (!may_alloc_bucket(c, POS(ca->dev_idx, b), s)) 288 + return NULL; 300 289 301 - k = bch2_bkey_get_iter(trans, &iter, 302 - BTREE_ID_alloc, POS(ca->dev_idx, b), 303 - BTREE_ITER_cached); 304 - ret = bkey_err(k); 305 - if (ret) { 306 - ob = ERR_PTR(ret); 307 - goto err; 308 - } 290 + u8 gen; 291 + int ret = bch2_check_discard_freespace_key(trans, freespace_iter, &gen, true); 292 + if (ret < 0) 293 + return ERR_PTR(ret); 294 + if (ret) 295 + return NULL; 309 296 310 - a = bch2_alloc_to_v4(k, &a_convert); 311 - 312 - if (a->data_type != BCH_DATA_free) { 313 - if (c->curr_recovery_pass <= BCH_RECOVERY_PASS_check_alloc_info) { 314 - ob = NULL; 315 - goto err; 316 - } 317 - 318 - prt_printf(&buf, "non free bucket in freespace btree\n" 319 - " freespace key "); 320 - bch2_bkey_val_to_text(&buf, c, freespace_k); 321 - prt_printf(&buf, "\n "); 322 - bch2_bkey_val_to_text(&buf, c, k); 323 - bch2_trans_inconsistent(trans, "%s", buf.buf); 324 - ob = ERR_PTR(-EIO); 325 - goto err; 326 - } 327 - 328 - if (genbits != (alloc_freespace_genbits(*a) >> 56) && 329 - c->curr_recovery_pass > BCH_RECOVERY_PASS_check_alloc_info) { 330 - prt_printf(&buf, "bucket in freespace btree with wrong genbits (got %u should be %llu)\n" 331 - " freespace key ", 332 - genbits, alloc_freespace_genbits(*a) >> 56); 333 - bch2_bkey_val_to_text(&buf, c, freespace_k); 334 - prt_printf(&buf, "\n "); 335 - bch2_bkey_val_to_text(&buf, c, k); 336 - bch2_trans_inconsistent(trans, "%s", buf.buf); 337 - ob = ERR_PTR(-EIO); 338 - goto err; 339 - } 340 - 341 - if (c->curr_recovery_pass <= BCH_RECOVERY_PASS_check_extents_to_backpointers) { 342 - struct bch_backpointer bp; 343 - struct bpos bp_pos = POS_MIN; 344 - 345 - ret = bch2_get_next_backpointer(trans, ca, POS(ca->dev_idx, b), -1, 346 - &bp_pos, &bp, 347 - BTREE_ITER_nopreserve); 348 - if (ret) { 349 - ob = ERR_PTR(ret); 350 - goto err; 351 - } 352 - 353 - if (!bkey_eq(bp_pos, POS_MAX)) { 354 - /* 355 - * Bucket may have data in it - we don't call 356 - * bc2h_trans_inconnsistent() because fsck hasn't 357 - * finished yet 358 - */ 359 - ob = NULL; 360 - goto err; 361 - } 362 - } 363 - 364 - ob = __try_alloc_bucket(c, ca, b, watermark, a, s, cl); 365 - if (!ob) 366 - bch2_set_btree_iter_dontneed(&iter); 367 - err: 368 - if (iter.path) 369 - bch2_set_btree_iter_dontneed(&iter); 370 - bch2_trans_iter_exit(trans, &iter); 371 - printbuf_exit(&buf); 372 - return ob; 297 + return __try_alloc_bucket(c, ca, b, gen, watermark, s, cl); 373 298 } 374 299 375 300 /* 376 301 * This path is for before the freespace btree is initialized: 377 - * 378 - * If ca->new_fs_bucket_idx is nonzero, we haven't yet marked superblock & 379 - * journal buckets - journal buckets will be < ca->new_fs_bucket_idx 380 302 */ 381 303 static noinline struct open_bucket * 382 304 bch2_bucket_alloc_early(struct btree_trans *trans, ··· 303 389 struct bucket_alloc_state *s, 304 390 struct closure *cl) 305 391 { 392 + struct bch_fs *c = trans->c; 306 393 struct btree_iter iter, citer; 307 394 struct bkey_s_c k, ck; 308 395 struct open_bucket *ob = NULL; 309 - u64 first_bucket = max_t(u64, ca->mi.first_bucket, ca->new_fs_bucket_idx); 396 + u64 first_bucket = ca->mi.first_bucket; 310 397 u64 *dev_alloc_cursor = &ca->alloc_cursor[s->btree_bitmap]; 311 398 u64 alloc_start = max(first_bucket, *dev_alloc_cursor); 312 399 u64 alloc_cursor = alloc_start; ··· 329 414 330 415 if (bkey_ge(k.k->p, POS(ca->dev_idx, ca->mi.nbuckets))) 331 416 break; 332 - 333 - if (ca->new_fs_bucket_idx && 334 - is_superblock_bucket(ca, k.k->p.offset)) 335 - continue; 336 417 337 418 if (s->btree_bitmap != BTREE_BITMAP_ANY && 338 419 s->btree_bitmap != bch2_dev_btree_bitmap_marked_sectors(ca, ··· 363 452 364 453 s->buckets_seen++; 365 454 366 - ob = __try_alloc_bucket(trans->c, ca, k.k->p.offset, watermark, a, s, cl); 455 + ob = may_alloc_bucket(c, k.k->p, s) 456 + ? __try_alloc_bucket(c, ca, k.k->p.offset, a->gen, 457 + watermark, s, cl) 458 + : NULL; 367 459 next: 368 460 bch2_set_btree_iter_dontneed(&citer); 369 461 bch2_trans_iter_exit(trans, &citer); ··· 403 489 u64 alloc_start = max_t(u64, ca->mi.first_bucket, READ_ONCE(*dev_alloc_cursor)); 404 490 u64 alloc_cursor = alloc_start; 405 491 int ret; 406 - 407 - BUG_ON(ca->new_fs_bucket_idx); 408 492 again: 409 - for_each_btree_key_norestart(trans, iter, BTREE_ID_freespace, 410 - POS(ca->dev_idx, alloc_cursor), 0, k, ret) { 411 - if (k.k->p.inode != ca->dev_idx) 412 - break; 493 + for_each_btree_key_max_norestart(trans, iter, BTREE_ID_freespace, 494 + POS(ca->dev_idx, alloc_cursor), 495 + POS(ca->dev_idx, U64_MAX), 496 + 0, k, ret) { 497 + /* 498 + * peek normally dosen't trim extents - they can span iter.pos, 499 + * which is not what we want here: 500 + */ 501 + iter.k.size = iter.k.p.offset - iter.pos.offset; 413 502 414 - for (alloc_cursor = max(alloc_cursor, bkey_start_offset(k.k)); 415 - alloc_cursor < k.k->p.offset; 416 - alloc_cursor++) { 503 + while (iter.k.size) { 417 504 s->buckets_seen++; 418 505 419 - u64 bucket = alloc_cursor & ~(~0ULL << 56); 506 + u64 bucket = iter.pos.offset & ~(~0ULL << 56); 420 507 if (s->btree_bitmap != BTREE_BITMAP_ANY && 421 508 s->btree_bitmap != bch2_dev_btree_bitmap_marked_sectors(ca, 422 509 bucket_to_sector(ca, bucket), ca->mi.bucket_size)) { ··· 426 511 goto fail; 427 512 428 513 bucket = sector_to_bucket(ca, 429 - round_up(bucket_to_sector(ca, bucket) + 1, 514 + round_up(bucket_to_sector(ca, bucket + 1), 430 515 1ULL << ca->mi.btree_bitmap_shift)); 431 - u64 genbits = alloc_cursor >> 56; 432 - alloc_cursor = bucket | (genbits << 56); 516 + alloc_cursor = bucket|(iter.pos.offset & (~0ULL << 56)); 433 517 434 - if (alloc_cursor > k.k->p.offset) 435 - bch2_btree_iter_set_pos(&iter, POS(ca->dev_idx, alloc_cursor)); 518 + bch2_btree_iter_set_pos(&iter, POS(ca->dev_idx, alloc_cursor)); 436 519 s->skipped_mi_btree_bitmap++; 437 - continue; 520 + goto next; 438 521 } 439 522 440 - ob = try_alloc_bucket(trans, ca, watermark, 441 - alloc_cursor, s, k, cl); 523 + ob = try_alloc_bucket(trans, ca, watermark, s, &iter, cl); 442 524 if (ob) { 525 + if (!IS_ERR(ob)) 526 + *dev_alloc_cursor = iter.pos.offset; 443 527 bch2_set_btree_iter_dontneed(&iter); 444 528 break; 445 529 } 446 - } 447 530 531 + iter.k.size--; 532 + iter.pos.offset++; 533 + } 534 + next: 448 535 if (ob || ret) 449 536 break; 450 537 } 451 538 fail: 452 539 bch2_trans_iter_exit(trans, &iter); 453 540 454 - if (!ob && ret) 541 + BUG_ON(ob && ret); 542 + 543 + if (ret) 455 544 ob = ERR_PTR(ret); 456 545 457 546 if (!ob && alloc_start > ca->mi.first_bucket) { 458 547 alloc_cursor = alloc_start = ca->mi.first_bucket; 459 548 goto again; 460 549 } 461 - 462 - *dev_alloc_cursor = alloc_cursor; 463 550 464 551 return ob; 465 552 } ··· 512 595 * @watermark: how important is this allocation? 513 596 * @data_type: BCH_DATA_journal, btree, user... 514 597 * @cl: if not NULL, closure to be used to wait if buckets not available 598 + * @nowait: if true, do not wait for buckets to become available 515 599 * @usage: for secondarily also returning the current device usage 516 600 * 517 601 * Returns: an open_bucket on success, or an ERR_PTR() on failure. ··· 547 629 bch2_dev_do_invalidates(ca); 548 630 549 631 if (!avail) { 632 + if (watermark > BCH_WATERMARK_normal && 633 + c->curr_recovery_pass <= BCH_RECOVERY_PASS_check_allocations) 634 + goto alloc; 635 + 550 636 if (cl && !waiting) { 551 637 closure_wait(&c->freelist_wait, cl); 552 638 waiting = true; ··· 633 711 unsigned i; 634 712 635 713 for_each_set_bit(i, devs->d, BCH_SB_MEMBERS_MAX) 636 - ret.devs[ret.nr++] = i; 714 + ret.data[ret.nr++] = i; 637 715 638 - bubble_sort(ret.devs, ret.nr, dev_stripe_cmp); 716 + bubble_sort(ret.data, ret.nr, dev_stripe_cmp); 639 717 return ret; 640 718 } 641 719 ··· 707 785 struct closure *cl) 708 786 { 709 787 struct bch_fs *c = trans->c; 710 - struct dev_alloc_list devs_sorted = 711 - bch2_dev_alloc_list(c, stripe, devs_may_alloc); 712 788 int ret = -BCH_ERR_insufficient_devices; 713 789 714 790 BUG_ON(*nr_effective >= nr_replicas); 715 791 716 - for (unsigned i = 0; i < devs_sorted.nr; i++) { 717 - struct bch_dev_usage usage; 718 - struct open_bucket *ob; 719 - 720 - unsigned dev = devs_sorted.devs[i]; 721 - struct bch_dev *ca = bch2_dev_tryget_noerror(c, dev); 792 + struct dev_alloc_list devs_sorted = bch2_dev_alloc_list(c, stripe, devs_may_alloc); 793 + darray_for_each(devs_sorted, i) { 794 + struct bch_dev *ca = bch2_dev_tryget_noerror(c, *i); 722 795 if (!ca) 723 796 continue; 724 797 ··· 722 805 continue; 723 806 } 724 807 725 - ob = bch2_bucket_alloc_trans(trans, ca, watermark, data_type, 726 - cl, flags & BCH_WRITE_ALLOC_NOWAIT, &usage); 808 + struct bch_dev_usage usage; 809 + struct open_bucket *ob = bch2_bucket_alloc_trans(trans, ca, watermark, data_type, 810 + cl, flags & BCH_WRITE_ALLOC_NOWAIT, &usage); 727 811 if (!IS_ERR(ob)) 728 812 bch2_dev_stripe_increment_inlined(ca, stripe, &usage); 729 813 bch2_dev_put(ca); ··· 768 850 struct closure *cl) 769 851 { 770 852 struct bch_fs *c = trans->c; 771 - struct dev_alloc_list devs_sorted; 772 - struct ec_stripe_head *h; 773 - struct open_bucket *ob; 774 - unsigned i, ec_idx; 775 853 int ret = 0; 776 854 777 855 if (nr_replicas < 2) ··· 776 862 if (ec_open_bucket(c, ptrs)) 777 863 return 0; 778 864 779 - h = bch2_ec_stripe_head_get(trans, target, 0, nr_replicas - 1, watermark, cl); 865 + struct ec_stripe_head *h = 866 + bch2_ec_stripe_head_get(trans, target, 0, nr_replicas - 1, watermark, cl); 780 867 if (IS_ERR(h)) 781 868 return PTR_ERR(h); 782 869 if (!h) 783 870 return 0; 784 871 785 - devs_sorted = bch2_dev_alloc_list(c, &wp->stripe, devs_may_alloc); 786 - 787 - for (i = 0; i < devs_sorted.nr; i++) 788 - for (ec_idx = 0; ec_idx < h->s->nr_data; ec_idx++) { 872 + struct dev_alloc_list devs_sorted = bch2_dev_alloc_list(c, &wp->stripe, devs_may_alloc); 873 + darray_for_each(devs_sorted, i) 874 + for (unsigned ec_idx = 0; ec_idx < h->s->nr_data; ec_idx++) { 789 875 if (!h->s->blocks[ec_idx]) 790 876 continue; 791 877 792 - ob = c->open_buckets + h->s->blocks[ec_idx]; 793 - if (ob->dev == devs_sorted.devs[i] && 794 - !test_and_set_bit(ec_idx, h->s->blocks_allocated)) 795 - goto got_bucket; 796 - } 797 - goto out_put_head; 798 - got_bucket: 799 - ob->ec_idx = ec_idx; 800 - ob->ec = h->s; 801 - ec_stripe_new_get(h->s, STRIPE_REF_io); 878 + struct open_bucket *ob = c->open_buckets + h->s->blocks[ec_idx]; 879 + if (ob->dev == *i && !test_and_set_bit(ec_idx, h->s->blocks_allocated)) { 880 + ob->ec_idx = ec_idx; 881 + ob->ec = h->s; 882 + ec_stripe_new_get(h->s, STRIPE_REF_io); 802 883 803 - ret = add_new_bucket(c, ptrs, devs_may_alloc, 804 - nr_replicas, nr_effective, 805 - have_cache, ob); 806 - out_put_head: 884 + ret = add_new_bucket(c, ptrs, devs_may_alloc, 885 + nr_replicas, nr_effective, 886 + have_cache, ob); 887 + goto out; 888 + } 889 + } 890 + out: 807 891 bch2_ec_stripe_head_put(c, h); 808 892 return ret; 809 893 }
+1 -3
fs/bcachefs/alloc_foreground.h
··· 20 20 21 21 struct dev_alloc_list { 22 22 unsigned nr; 23 - u8 devs[BCH_SB_MEMBERS_MAX]; 23 + u8 data[BCH_SB_MEMBERS_MAX]; 24 24 }; 25 25 26 26 struct dev_alloc_list bch2_dev_alloc_list(struct bch_fs *, 27 27 struct dev_stripe_state *, 28 28 struct bch_devs_mask *); 29 29 void bch2_dev_stripe_increment(struct bch_dev *, struct dev_stripe_state *); 30 - 31 - long bch2_bucket_alloc_new_fs(struct bch_dev *); 32 30 33 31 static inline struct bch_dev *ob_dev(struct bch_fs *c, struct open_bucket *ob) 34 32 {
+519 -321
fs/bcachefs/backpointers.c
··· 14 14 15 15 #include <linux/mm.h> 16 16 17 - static bool extent_matches_bp(struct bch_fs *c, 18 - enum btree_id btree_id, unsigned level, 19 - struct bkey_s_c k, 20 - struct bpos bucket, 21 - struct bch_backpointer bp) 22 - { 23 - struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(k); 24 - const union bch_extent_entry *entry; 25 - struct extent_ptr_decoded p; 26 - 27 - rcu_read_lock(); 28 - bkey_for_each_ptr_decode(k.k, ptrs, p, entry) { 29 - struct bpos bucket2; 30 - struct bch_backpointer bp2; 31 - 32 - if (p.ptr.cached) 33 - continue; 34 - 35 - struct bch_dev *ca = bch2_dev_rcu(c, p.ptr.dev); 36 - if (!ca) 37 - continue; 38 - 39 - bch2_extent_ptr_to_bp(c, ca, btree_id, level, k, p, entry, &bucket2, &bp2); 40 - if (bpos_eq(bucket, bucket2) && 41 - !memcmp(&bp, &bp2, sizeof(bp))) { 42 - rcu_read_unlock(); 43 - return true; 44 - } 45 - } 46 - rcu_read_unlock(); 47 - 48 - return false; 49 - } 50 - 51 17 int bch2_backpointer_validate(struct bch_fs *c, struct bkey_s_c k, 52 - enum bch_validate_flags flags) 18 + struct bkey_validate_context from) 53 19 { 54 20 struct bkey_s_c_backpointer bp = bkey_s_c_to_backpointer(k); 55 21 int ret = 0; ··· 25 59 "backpointer level bad: %u >= %u", 26 60 bp.v->level, BTREE_MAX_DEPTH); 27 61 28 - rcu_read_lock(); 29 - struct bch_dev *ca = bch2_dev_rcu_noerror(c, bp.k->p.inode); 30 - if (!ca) { 31 - /* these will be caught by fsck */ 32 - rcu_read_unlock(); 33 - return 0; 34 - } 35 - 36 - struct bpos bucket = bp_pos_to_bucket(ca, bp.k->p); 37 - struct bpos bp_pos = bucket_pos_to_bp_noerror(ca, bucket, bp.v->bucket_offset); 38 - rcu_read_unlock(); 39 - 40 - bkey_fsck_err_on((bp.v->bucket_offset >> MAX_EXTENT_COMPRESS_RATIO_SHIFT) >= ca->mi.bucket_size || 41 - !bpos_eq(bp.k->p, bp_pos), 42 - c, backpointer_bucket_offset_wrong, 43 - "backpointer bucket_offset wrong"); 62 + bkey_fsck_err_on(bp.k->p.inode == BCH_SB_MEMBER_INVALID, 63 + c, backpointer_dev_bad, 64 + "backpointer for BCH_SB_MEMBER_INVALID"); 44 65 fsck_err: 45 66 return ret; 46 67 } 47 68 48 - void bch2_backpointer_to_text(struct printbuf *out, const struct bch_backpointer *bp) 69 + void bch2_backpointer_to_text(struct printbuf *out, struct bch_fs *c, struct bkey_s_c k) 49 70 { 50 - prt_printf(out, "btree=%s l=%u offset=%llu:%u len=%u pos=", 51 - bch2_btree_id_str(bp->btree_id), 52 - bp->level, 53 - (u64) (bp->bucket_offset >> MAX_EXTENT_COMPRESS_RATIO_SHIFT), 54 - (u32) bp->bucket_offset & ~(~0U << MAX_EXTENT_COMPRESS_RATIO_SHIFT), 55 - bp->bucket_len); 56 - bch2_bpos_to_text(out, bp->pos); 57 - } 71 + struct bkey_s_c_backpointer bp = bkey_s_c_to_backpointer(k); 58 72 59 - void bch2_backpointer_k_to_text(struct printbuf *out, struct bch_fs *c, struct bkey_s_c k) 60 - { 61 73 rcu_read_lock(); 62 - struct bch_dev *ca = bch2_dev_rcu_noerror(c, k.k->p.inode); 74 + struct bch_dev *ca = bch2_dev_rcu_noerror(c, bp.k->p.inode); 63 75 if (ca) { 64 - struct bpos bucket = bp_pos_to_bucket(ca, k.k->p); 76 + u32 bucket_offset; 77 + struct bpos bucket = bp_pos_to_bucket_and_offset(ca, bp.k->p, &bucket_offset); 65 78 rcu_read_unlock(); 66 - prt_str(out, "bucket="); 67 - bch2_bpos_to_text(out, bucket); 68 - prt_str(out, " "); 79 + prt_printf(out, "bucket=%llu:%llu:%u ", bucket.inode, bucket.offset, bucket_offset); 69 80 } else { 70 81 rcu_read_unlock(); 82 + prt_printf(out, "sector=%llu:%llu ", bp.k->p.inode, bp.k->p.offset >> MAX_EXTENT_COMPRESS_RATIO_SHIFT); 71 83 } 72 84 73 - bch2_backpointer_to_text(out, bkey_s_c_to_backpointer(k).v); 85 + bch2_btree_id_level_to_text(out, bp.v->btree_id, bp.v->level); 86 + prt_printf(out, " suboffset=%u len=%u gen=%u pos=", 87 + (u32) bp.k->p.offset & ~(~0U << MAX_EXTENT_COMPRESS_RATIO_SHIFT), 88 + bp.v->bucket_len, 89 + bp.v->bucket_gen); 90 + bch2_bpos_to_text(out, bp.v->pos); 74 91 } 75 92 76 93 void bch2_backpointer_swab(struct bkey_s k) 77 94 { 78 95 struct bkey_s_backpointer bp = bkey_s_to_backpointer(k); 79 96 80 - bp.v->bucket_offset = swab40(bp.v->bucket_offset); 81 97 bp.v->bucket_len = swab32(bp.v->bucket_len); 82 98 bch2_bpos_swab(&bp.v->pos); 83 99 } 84 100 101 + static bool extent_matches_bp(struct bch_fs *c, 102 + enum btree_id btree_id, unsigned level, 103 + struct bkey_s_c k, 104 + struct bkey_s_c_backpointer bp) 105 + { 106 + struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(k); 107 + const union bch_extent_entry *entry; 108 + struct extent_ptr_decoded p; 109 + 110 + bkey_for_each_ptr_decode(k.k, ptrs, p, entry) { 111 + struct bkey_i_backpointer bp2; 112 + bch2_extent_ptr_to_bp(c, btree_id, level, k, p, entry, &bp2); 113 + 114 + if (bpos_eq(bp.k->p, bp2.k.p) && 115 + !memcmp(bp.v, &bp2.v, sizeof(bp2.v))) 116 + return true; 117 + } 118 + 119 + return false; 120 + } 121 + 85 122 static noinline int backpointer_mod_err(struct btree_trans *trans, 86 - struct bch_backpointer bp, 87 - struct bkey_s_c bp_k, 88 123 struct bkey_s_c orig_k, 124 + struct bkey_i_backpointer *new_bp, 125 + struct bkey_s_c found_bp, 89 126 bool insert) 90 127 { 91 128 struct bch_fs *c = trans->c; ··· 96 127 97 128 if (insert) { 98 129 prt_printf(&buf, "existing backpointer found when inserting "); 99 - bch2_backpointer_to_text(&buf, &bp); 130 + bch2_bkey_val_to_text(&buf, c, bkey_i_to_s_c(&new_bp->k_i)); 100 131 prt_newline(&buf); 101 132 printbuf_indent_add(&buf, 2); 102 133 103 134 prt_printf(&buf, "found "); 104 - bch2_bkey_val_to_text(&buf, c, bp_k); 135 + bch2_bkey_val_to_text(&buf, c, found_bp); 105 136 prt_newline(&buf); 106 137 107 138 prt_printf(&buf, "for "); ··· 113 144 printbuf_indent_add(&buf, 2); 114 145 115 146 prt_printf(&buf, "searching for "); 116 - bch2_backpointer_to_text(&buf, &bp); 147 + bch2_bkey_val_to_text(&buf, c, bkey_i_to_s_c(&new_bp->k_i)); 117 148 prt_newline(&buf); 118 149 119 150 prt_printf(&buf, "got "); 120 - bch2_bkey_val_to_text(&buf, c, bp_k); 151 + bch2_bkey_val_to_text(&buf, c, found_bp); 121 152 prt_newline(&buf); 122 153 123 154 prt_printf(&buf, "for "); ··· 136 167 } 137 168 138 169 int bch2_bucket_backpointer_mod_nowritebuffer(struct btree_trans *trans, 139 - struct bch_dev *ca, 140 - struct bpos bucket, 141 - struct bch_backpointer bp, 142 170 struct bkey_s_c orig_k, 171 + struct bkey_i_backpointer *bp, 143 172 bool insert) 144 173 { 145 174 struct btree_iter bp_iter; 146 - struct bkey_s_c k; 147 - struct bkey_i_backpointer *bp_k; 148 - int ret; 149 - 150 - bp_k = bch2_trans_kmalloc_nomemzero(trans, sizeof(struct bkey_i_backpointer)); 151 - ret = PTR_ERR_OR_ZERO(bp_k); 152 - if (ret) 153 - return ret; 154 - 155 - bkey_backpointer_init(&bp_k->k_i); 156 - bp_k->k.p = bucket_pos_to_bp(ca, bucket, bp.bucket_offset); 157 - bp_k->v = bp; 158 - 159 - if (!insert) { 160 - bp_k->k.type = KEY_TYPE_deleted; 161 - set_bkey_val_u64s(&bp_k->k, 0); 162 - } 163 - 164 - k = bch2_bkey_get_iter(trans, &bp_iter, BTREE_ID_backpointers, 165 - bp_k->k.p, 175 + struct bkey_s_c k = bch2_bkey_get_iter(trans, &bp_iter, BTREE_ID_backpointers, 176 + bp->k.p, 166 177 BTREE_ITER_intent| 167 178 BTREE_ITER_slots| 168 179 BTREE_ITER_with_updates); 169 - ret = bkey_err(k); 180 + int ret = bkey_err(k); 170 181 if (ret) 171 - goto err; 182 + return ret; 172 183 173 184 if (insert 174 185 ? k.k->type 175 186 : (k.k->type != KEY_TYPE_backpointer || 176 - memcmp(bkey_s_c_to_backpointer(k).v, &bp, sizeof(bp)))) { 177 - ret = backpointer_mod_err(trans, bp, k, orig_k, insert); 187 + memcmp(bkey_s_c_to_backpointer(k).v, &bp->v, sizeof(bp->v)))) { 188 + ret = backpointer_mod_err(trans, orig_k, bp, k, insert); 178 189 if (ret) 179 190 goto err; 180 191 } 181 192 182 - ret = bch2_trans_update(trans, &bp_iter, &bp_k->k_i, 0); 193 + if (!insert) { 194 + bp->k.type = KEY_TYPE_deleted; 195 + set_bkey_val_u64s(&bp->k, 0); 196 + } 197 + 198 + ret = bch2_trans_update(trans, &bp_iter, &bp->k_i, 0); 183 199 err: 184 200 bch2_trans_iter_exit(trans, &bp_iter); 185 201 return ret; 186 202 } 187 203 188 - /* 189 - * Find the next backpointer >= *bp_offset: 190 - */ 191 - int bch2_get_next_backpointer(struct btree_trans *trans, 192 - struct bch_dev *ca, 193 - struct bpos bucket, int gen, 194 - struct bpos *bp_pos, 195 - struct bch_backpointer *bp, 196 - unsigned iter_flags) 204 + static int bch2_backpointer_del(struct btree_trans *trans, struct bpos pos) 197 205 { 198 - struct bpos bp_end_pos = bucket_pos_to_bp(ca, bpos_nosnap_successor(bucket), 0); 199 - struct btree_iter alloc_iter = { NULL }, bp_iter = { NULL }; 200 - struct bkey_s_c k; 201 - int ret = 0; 202 - 203 - if (bpos_ge(*bp_pos, bp_end_pos)) 204 - goto done; 205 - 206 - if (gen >= 0) { 207 - k = bch2_bkey_get_iter(trans, &alloc_iter, BTREE_ID_alloc, 208 - bucket, BTREE_ITER_cached|iter_flags); 209 - ret = bkey_err(k); 210 - if (ret) 211 - goto out; 212 - 213 - if (k.k->type != KEY_TYPE_alloc_v4 || 214 - bkey_s_c_to_alloc_v4(k).v->gen != gen) 215 - goto done; 216 - } 217 - 218 - *bp_pos = bpos_max(*bp_pos, bucket_pos_to_bp(ca, bucket, 0)); 219 - 220 - for_each_btree_key_norestart(trans, bp_iter, BTREE_ID_backpointers, 221 - *bp_pos, iter_flags, k, ret) { 222 - if (bpos_ge(k.k->p, bp_end_pos)) 223 - break; 224 - 225 - *bp_pos = k.k->p; 226 - *bp = *bkey_s_c_to_backpointer(k).v; 227 - goto out; 228 - } 229 - done: 230 - *bp_pos = SPOS_MAX; 231 - out: 232 - bch2_trans_iter_exit(trans, &bp_iter); 233 - bch2_trans_iter_exit(trans, &alloc_iter); 234 - return ret; 206 + return (likely(!bch2_backpointers_no_use_write_buffer) 207 + ? bch2_btree_delete_at_buffered(trans, BTREE_ID_backpointers, pos) 208 + : bch2_btree_delete(trans, BTREE_ID_backpointers, pos, 0)) ?: 209 + bch2_trans_commit(trans, NULL, NULL, BCH_TRANS_COMMIT_no_enospc); 235 210 } 236 211 237 - static void backpointer_not_found(struct btree_trans *trans, 238 - struct bpos bp_pos, 239 - struct bch_backpointer bp, 240 - struct bkey_s_c k) 212 + static inline int bch2_backpointers_maybe_flush(struct btree_trans *trans, 213 + struct bkey_s_c visiting_k, 214 + struct bkey_buf *last_flushed) 215 + { 216 + return likely(!bch2_backpointers_no_use_write_buffer) 217 + ? bch2_btree_write_buffer_maybe_flush(trans, visiting_k, last_flushed) 218 + : 0; 219 + } 220 + 221 + static int backpointer_target_not_found(struct btree_trans *trans, 222 + struct bkey_s_c_backpointer bp, 223 + struct bkey_s_c target_k, 224 + struct bkey_buf *last_flushed) 241 225 { 242 226 struct bch_fs *c = trans->c; 243 227 struct printbuf buf = PRINTBUF; 228 + int ret = 0; 244 229 245 230 /* 246 231 * If we're using the btree write buffer, the backpointer we were 247 232 * looking at may have already been deleted - failure to find what it 248 233 * pointed to is not an error: 249 234 */ 250 - if (likely(!bch2_backpointers_no_use_write_buffer)) 251 - return; 252 - 253 - struct bpos bucket; 254 - if (!bp_pos_to_bucket_nodev(c, bp_pos, &bucket)) 255 - return; 235 + ret = last_flushed 236 + ? bch2_backpointers_maybe_flush(trans, bp.s_c, last_flushed) 237 + : 0; 238 + if (ret) 239 + return ret; 256 240 257 241 prt_printf(&buf, "backpointer doesn't match %s it points to:\n ", 258 - bp.level ? "btree node" : "extent"); 259 - prt_printf(&buf, "bucket: "); 260 - bch2_bpos_to_text(&buf, bucket); 261 - prt_printf(&buf, "\n "); 242 + bp.v->level ? "btree node" : "extent"); 243 + bch2_bkey_val_to_text(&buf, c, bp.s_c); 262 244 263 - prt_printf(&buf, "backpointer pos: "); 264 - bch2_bpos_to_text(&buf, bp_pos); 265 245 prt_printf(&buf, "\n "); 246 + bch2_bkey_val_to_text(&buf, c, target_k); 266 247 267 - bch2_backpointer_to_text(&buf, &bp); 268 - prt_printf(&buf, "\n "); 269 - bch2_bkey_val_to_text(&buf, c, k); 270 - if (c->curr_recovery_pass >= BCH_RECOVERY_PASS_check_extents_to_backpointers) 271 - bch_err_ratelimited(c, "%s", buf.buf); 272 - else 273 - bch2_trans_inconsistent(trans, "%s", buf.buf); 248 + struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(target_k); 249 + const union bch_extent_entry *entry; 250 + struct extent_ptr_decoded p; 251 + bkey_for_each_ptr_decode(target_k.k, ptrs, p, entry) 252 + if (p.ptr.dev == bp.k->p.inode) { 253 + prt_printf(&buf, "\n "); 254 + struct bkey_i_backpointer bp2; 255 + bch2_extent_ptr_to_bp(c, bp.v->btree_id, bp.v->level, target_k, p, entry, &bp2); 256 + bch2_bkey_val_to_text(&buf, c, bkey_i_to_s_c(&bp2.k_i)); 257 + } 274 258 259 + if (fsck_err(trans, backpointer_to_missing_ptr, 260 + "%s", buf.buf)) 261 + ret = bch2_backpointer_del(trans, bp.k->p); 262 + fsck_err: 275 263 printbuf_exit(&buf); 264 + return ret; 276 265 } 277 266 278 267 struct bkey_s_c bch2_backpointer_get_key(struct btree_trans *trans, 268 + struct bkey_s_c_backpointer bp, 279 269 struct btree_iter *iter, 280 - struct bpos bp_pos, 281 - struct bch_backpointer bp, 282 - unsigned iter_flags) 270 + unsigned iter_flags, 271 + struct bkey_buf *last_flushed) 283 272 { 284 - if (likely(!bp.level)) { 285 - struct bch_fs *c = trans->c; 273 + struct bch_fs *c = trans->c; 286 274 287 - struct bpos bucket; 288 - if (!bp_pos_to_bucket_nodev(c, bp_pos, &bucket)) 289 - return bkey_s_c_err(-EIO); 275 + if (unlikely(bp.v->btree_id >= btree_id_nr_alive(c))) 276 + return bkey_s_c_null; 290 277 278 + if (likely(!bp.v->level)) { 291 279 bch2_trans_node_iter_init(trans, iter, 292 - bp.btree_id, 293 - bp.pos, 280 + bp.v->btree_id, 281 + bp.v->pos, 294 282 0, 0, 295 283 iter_flags); 296 284 struct bkey_s_c k = bch2_btree_iter_peek_slot(iter); ··· 256 330 return k; 257 331 } 258 332 259 - if (k.k && extent_matches_bp(c, bp.btree_id, bp.level, k, bucket, bp)) 333 + if (k.k && 334 + extent_matches_bp(c, bp.v->btree_id, bp.v->level, k, bp)) 260 335 return k; 261 336 262 337 bch2_trans_iter_exit(trans, iter); 263 - backpointer_not_found(trans, bp_pos, bp, k); 264 - return bkey_s_c_null; 338 + int ret = backpointer_target_not_found(trans, bp, k, last_flushed); 339 + return ret ? bkey_s_c_err(ret) : bkey_s_c_null; 265 340 } else { 266 - struct btree *b = bch2_backpointer_get_node(trans, iter, bp_pos, bp); 341 + struct btree *b = bch2_backpointer_get_node(trans, bp, iter, last_flushed); 342 + if (IS_ERR_OR_NULL(b)) 343 + return ((struct bkey_s_c) { .k = ERR_CAST(b) }); 267 344 268 - if (IS_ERR_OR_NULL(b)) { 269 - bch2_trans_iter_exit(trans, iter); 270 - return IS_ERR(b) ? bkey_s_c_err(PTR_ERR(b)) : bkey_s_c_null; 271 - } 272 345 return bkey_i_to_s_c(&b->key); 273 346 } 274 347 } 275 348 276 349 struct btree *bch2_backpointer_get_node(struct btree_trans *trans, 350 + struct bkey_s_c_backpointer bp, 277 351 struct btree_iter *iter, 278 - struct bpos bp_pos, 279 - struct bch_backpointer bp) 352 + struct bkey_buf *last_flushed) 280 353 { 281 354 struct bch_fs *c = trans->c; 282 355 283 - BUG_ON(!bp.level); 284 - 285 - struct bpos bucket; 286 - if (!bp_pos_to_bucket_nodev(c, bp_pos, &bucket)) 287 - return ERR_PTR(-EIO); 356 + BUG_ON(!bp.v->level); 288 357 289 358 bch2_trans_node_iter_init(trans, iter, 290 - bp.btree_id, 291 - bp.pos, 359 + bp.v->btree_id, 360 + bp.v->pos, 292 361 0, 293 - bp.level - 1, 362 + bp.v->level - 1, 294 363 0); 295 364 struct btree *b = bch2_btree_iter_peek_node(iter); 296 365 if (IS_ERR_OR_NULL(b)) 297 366 goto err; 298 367 299 - BUG_ON(b->c.level != bp.level - 1); 368 + BUG_ON(b->c.level != bp.v->level - 1); 300 369 301 - if (extent_matches_bp(c, bp.btree_id, bp.level, 302 - bkey_i_to_s_c(&b->key), 303 - bucket, bp)) 370 + if (extent_matches_bp(c, bp.v->btree_id, bp.v->level, 371 + bkey_i_to_s_c(&b->key), bp)) 304 372 return b; 305 373 306 374 if (btree_node_will_make_reachable(b)) { 307 375 b = ERR_PTR(-BCH_ERR_backpointer_to_overwritten_btree_node); 308 376 } else { 309 - backpointer_not_found(trans, bp_pos, bp, bkey_i_to_s_c(&b->key)); 310 - b = NULL; 377 + int ret = backpointer_target_not_found(trans, bp, bkey_i_to_s_c(&b->key), last_flushed); 378 + b = ret ? ERR_PTR(ret) : NULL; 311 379 } 312 380 err: 313 381 bch2_trans_iter_exit(trans, iter); 314 382 return b; 315 383 } 316 384 317 - static int bch2_check_btree_backpointer(struct btree_trans *trans, struct btree_iter *bp_iter, 318 - struct bkey_s_c k) 385 + static int bch2_check_backpointer_has_valid_bucket(struct btree_trans *trans, struct bkey_s_c k, 386 + struct bkey_buf *last_flushed) 319 387 { 388 + if (k.k->type != KEY_TYPE_backpointer) 389 + return 0; 390 + 320 391 struct bch_fs *c = trans->c; 321 392 struct btree_iter alloc_iter = { NULL }; 322 393 struct bkey_s_c alloc_k; ··· 322 399 323 400 struct bpos bucket; 324 401 if (!bp_pos_to_bucket_nodev_noerror(c, k.k->p, &bucket)) { 402 + ret = bch2_backpointers_maybe_flush(trans, k, last_flushed); 403 + if (ret) 404 + goto out; 405 + 325 406 if (fsck_err(trans, backpointer_to_missing_device, 326 407 "backpointer for missing device:\n%s", 327 408 (bch2_bkey_val_to_text(&buf, c, k), buf.buf))) 328 - ret = bch2_btree_delete_at(trans, bp_iter, 0); 409 + ret = bch2_backpointer_del(trans, k.k->p); 329 410 goto out; 330 411 } 331 412 ··· 338 411 if (ret) 339 412 goto out; 340 413 341 - if (fsck_err_on(alloc_k.k->type != KEY_TYPE_alloc_v4, 342 - trans, backpointer_to_missing_alloc, 343 - "backpointer for nonexistent alloc key: %llu:%llu:0\n%s", 344 - alloc_iter.pos.inode, alloc_iter.pos.offset, 345 - (bch2_bkey_val_to_text(&buf, c, k), buf.buf))) { 346 - ret = bch2_btree_delete_at(trans, bp_iter, 0); 347 - goto out; 414 + if (alloc_k.k->type != KEY_TYPE_alloc_v4) { 415 + ret = bch2_backpointers_maybe_flush(trans, k, last_flushed); 416 + if (ret) 417 + goto out; 418 + 419 + if (fsck_err(trans, backpointer_to_missing_alloc, 420 + "backpointer for nonexistent alloc key: %llu:%llu:0\n%s", 421 + alloc_iter.pos.inode, alloc_iter.pos.offset, 422 + (bch2_bkey_val_to_text(&buf, c, k), buf.buf))) 423 + ret = bch2_backpointer_del(trans, k.k->p); 348 424 } 349 425 out: 350 426 fsck_err: ··· 359 429 /* verify that every backpointer has a corresponding alloc key */ 360 430 int bch2_check_btree_backpointers(struct bch_fs *c) 361 431 { 432 + struct bkey_buf last_flushed; 433 + bch2_bkey_buf_init(&last_flushed); 434 + bkey_init(&last_flushed.k->k); 435 + 362 436 int ret = bch2_trans_run(c, 363 437 for_each_btree_key_commit(trans, iter, 364 438 BTREE_ID_backpointers, POS_MIN, 0, k, 365 439 NULL, NULL, BCH_TRANS_COMMIT_no_enospc, 366 - bch2_check_btree_backpointer(trans, &iter, k))); 440 + bch2_check_backpointer_has_valid_bucket(trans, k, &last_flushed))); 441 + 442 + bch2_bkey_buf_exit(&last_flushed, c); 367 443 bch_err_fn(c, ret); 368 444 return ret; 369 445 } 370 446 371 447 struct extents_to_bp_state { 372 - struct bpos bucket_start; 373 - struct bpos bucket_end; 448 + struct bpos bp_start; 449 + struct bpos bp_end; 374 450 struct bkey_buf last_flushed; 375 451 }; 376 452 ··· 437 501 goto err; 438 502 439 503 prt_str(&buf, "extents pointing to same space, but first extent checksum bad:"); 440 - prt_printf(&buf, "\n %s ", bch2_btree_id_str(btree)); 504 + prt_printf(&buf, "\n "); 505 + bch2_btree_id_to_text(&buf, btree); 506 + prt_str(&buf, " "); 441 507 bch2_bkey_val_to_text(&buf, c, extent); 442 - prt_printf(&buf, "\n %s ", bch2_btree_id_str(o_btree)); 508 + prt_printf(&buf, "\n "); 509 + bch2_btree_id_to_text(&buf, o_btree); 510 + prt_str(&buf, " "); 443 511 bch2_bkey_val_to_text(&buf, c, extent2); 444 512 445 513 struct nonce nonce = extent_nonce(extent.k->bversion, p.crc); ··· 464 524 465 525 static int check_bp_exists(struct btree_trans *trans, 466 526 struct extents_to_bp_state *s, 467 - struct bpos bucket, 468 - struct bch_backpointer bp, 527 + struct bkey_i_backpointer *bp, 469 528 struct bkey_s_c orig_k) 470 529 { 471 530 struct bch_fs *c = trans->c; 472 - struct btree_iter bp_iter = {}; 473 531 struct btree_iter other_extent_iter = {}; 474 532 struct printbuf buf = PRINTBUF; 475 - struct bkey_s_c bp_k; 476 - int ret = 0; 477 533 478 - struct bch_dev *ca = bch2_dev_bucket_tryget(c, bucket); 479 - if (!ca) { 480 - prt_str(&buf, "extent for nonexistent device:bucket "); 481 - bch2_bpos_to_text(&buf, bucket); 482 - prt_str(&buf, "\n "); 483 - bch2_bkey_val_to_text(&buf, c, orig_k); 484 - bch_err(c, "%s", buf.buf); 485 - ret = -BCH_ERR_fsck_repair_unimplemented; 486 - goto err; 487 - } 534 + if (bpos_lt(bp->k.p, s->bp_start) || 535 + bpos_gt(bp->k.p, s->bp_end)) 536 + return 0; 488 537 489 - if (bpos_lt(bucket, s->bucket_start) || 490 - bpos_gt(bucket, s->bucket_end)) 491 - goto out; 492 - 493 - bp_k = bch2_bkey_get_iter(trans, &bp_iter, BTREE_ID_backpointers, 494 - bucket_pos_to_bp(ca, bucket, bp.bucket_offset), 495 - 0); 496 - ret = bkey_err(bp_k); 538 + struct btree_iter bp_iter; 539 + struct bkey_s_c bp_k = bch2_bkey_get_iter(trans, &bp_iter, BTREE_ID_backpointers, bp->k.p, 0); 540 + int ret = bkey_err(bp_k); 497 541 if (ret) 498 542 goto err; 499 543 500 544 if (bp_k.k->type != KEY_TYPE_backpointer || 501 - memcmp(bkey_s_c_to_backpointer(bp_k).v, &bp, sizeof(bp))) { 545 + memcmp(bkey_s_c_to_backpointer(bp_k).v, &bp->v, sizeof(bp->v))) { 502 546 ret = bch2_btree_write_buffer_maybe_flush(trans, orig_k, &s->last_flushed); 503 547 if (ret) 504 548 goto err; ··· 494 570 fsck_err: 495 571 bch2_trans_iter_exit(trans, &other_extent_iter); 496 572 bch2_trans_iter_exit(trans, &bp_iter); 497 - bch2_dev_put(ca); 498 573 printbuf_exit(&buf); 499 574 return ret; 500 575 check_existing_bp: ··· 501 578 if (bp_k.k->type != KEY_TYPE_backpointer) 502 579 goto missing; 503 580 504 - struct bch_backpointer other_bp = *bkey_s_c_to_backpointer(bp_k).v; 581 + struct bkey_s_c_backpointer other_bp = bkey_s_c_to_backpointer(bp_k); 505 582 506 583 struct bkey_s_c other_extent = 507 - bch2_backpointer_get_key(trans, &other_extent_iter, bp_k.k->p, other_bp, 0); 584 + bch2_backpointer_get_key(trans, other_bp, &other_extent_iter, 0, NULL); 508 585 ret = bkey_err(other_extent); 509 586 if (ret == -BCH_ERR_backpointer_to_overwritten_btree_node) 510 587 ret = 0; ··· 523 600 bch_err(c, "%s", buf.buf); 524 601 525 602 if (other_extent.k->size <= orig_k.k->size) { 526 - ret = drop_dev_and_update(trans, other_bp.btree_id, other_extent, bucket.inode); 603 + ret = drop_dev_and_update(trans, other_bp.v->btree_id, 604 + other_extent, bp->k.p.inode); 527 605 if (ret) 528 606 goto err; 529 607 goto out; 530 608 } else { 531 - ret = drop_dev_and_update(trans, bp.btree_id, orig_k, bucket.inode); 609 + ret = drop_dev_and_update(trans, bp->v.btree_id, orig_k, bp->k.p.inode); 532 610 if (ret) 533 611 goto err; 534 612 goto missing; 535 613 } 536 614 } 537 615 538 - ret = check_extent_checksum(trans, other_bp.btree_id, other_extent, bp.btree_id, orig_k, bucket.inode); 616 + ret = check_extent_checksum(trans, 617 + other_bp.v->btree_id, other_extent, 618 + bp->v.btree_id, orig_k, 619 + bp->k.p.inode); 539 620 if (ret < 0) 540 621 goto err; 541 622 if (ret) { ··· 547 620 goto missing; 548 621 } 549 622 550 - ret = check_extent_checksum(trans, bp.btree_id, orig_k, other_bp.btree_id, other_extent, bucket.inode); 623 + ret = check_extent_checksum(trans, bp->v.btree_id, orig_k, 624 + other_bp.v->btree_id, other_extent, bp->k.p.inode); 551 625 if (ret < 0) 552 626 goto err; 553 627 if (ret) { ··· 557 629 } 558 630 559 631 printbuf_reset(&buf); 560 - prt_printf(&buf, "duplicate extents pointing to same space on dev %llu\n ", bucket.inode); 632 + prt_printf(&buf, "duplicate extents pointing to same space on dev %llu\n ", bp->k.p.inode); 561 633 bch2_bkey_val_to_text(&buf, c, orig_k); 562 634 prt_str(&buf, "\n "); 563 635 bch2_bkey_val_to_text(&buf, c, other_extent); ··· 566 638 goto err; 567 639 missing: 568 640 printbuf_reset(&buf); 569 - prt_printf(&buf, "missing backpointer for btree=%s l=%u ", 570 - bch2_btree_id_str(bp.btree_id), bp.level); 641 + prt_str(&buf, "missing backpointer\n for: "); 571 642 bch2_bkey_val_to_text(&buf, c, orig_k); 572 - prt_printf(&buf, "\n got: "); 643 + prt_printf(&buf, "\n want: "); 644 + bch2_bkey_val_to_text(&buf, c, bkey_i_to_s_c(&bp->k_i)); 645 + prt_printf(&buf, "\n got: "); 573 646 bch2_bkey_val_to_text(&buf, c, bp_k); 574 647 575 - struct bkey_i_backpointer n_bp_k; 576 - bkey_backpointer_init(&n_bp_k.k_i); 577 - n_bp_k.k.p = bucket_pos_to_bp(ca, bucket, bp.bucket_offset); 578 - n_bp_k.v = bp; 579 - prt_printf(&buf, "\n want: "); 580 - bch2_bkey_val_to_text(&buf, c, bkey_i_to_s_c(&n_bp_k.k_i)); 581 - 582 648 if (fsck_err(trans, ptr_to_missing_backpointer, "%s", buf.buf)) 583 - ret = bch2_bucket_backpointer_mod(trans, ca, bucket, bp, orig_k, true); 649 + ret = bch2_bucket_backpointer_mod(trans, orig_k, bp, true); 584 650 585 651 goto out; 586 652 } ··· 585 663 struct bkey_s_c k) 586 664 { 587 665 struct bch_fs *c = trans->c; 588 - struct bkey_ptrs_c ptrs; 666 + struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(k); 589 667 const union bch_extent_entry *entry; 590 668 struct extent_ptr_decoded p; 591 - int ret; 592 669 593 - ptrs = bch2_bkey_ptrs_c(k); 594 670 bkey_for_each_ptr_decode(k.k, ptrs, p, entry) { 595 - struct bpos bucket_pos = POS_MIN; 596 - struct bch_backpointer bp; 597 - 598 671 if (p.ptr.cached) 672 + continue; 673 + 674 + if (p.ptr.dev == BCH_SB_MEMBER_INVALID) 599 675 continue; 600 676 601 677 rcu_read_lock(); 602 678 struct bch_dev *ca = bch2_dev_rcu_noerror(c, p.ptr.dev); 603 - if (ca) 604 - bch2_extent_ptr_to_bp(c, ca, btree, level, k, p, entry, &bucket_pos, &bp); 679 + bool check = ca && test_bit(PTR_BUCKET_NR(ca, &p.ptr), ca->bucket_backpointer_mismatches); 680 + bool empty = ca && test_bit(PTR_BUCKET_NR(ca, &p.ptr), ca->bucket_backpointer_empty); 605 681 rcu_read_unlock(); 606 682 607 - if (!ca) 608 - continue; 683 + if (check || empty) { 684 + struct bkey_i_backpointer bp; 685 + bch2_extent_ptr_to_bp(c, btree, level, k, p, entry, &bp); 609 686 610 - ret = check_bp_exists(trans, s, bucket_pos, bp, k); 611 - if (ret) 612 - return ret; 687 + int ret = check 688 + ? check_bp_exists(trans, s, &bp, k) 689 + : bch2_bucket_backpointer_mod(trans, k, &bp, true); 690 + if (ret) 691 + return ret; 692 + } 613 693 } 614 694 615 695 return 0; ··· 820 896 return 0; 821 897 } 822 898 899 + enum alloc_sector_counter { 900 + ALLOC_dirty, 901 + ALLOC_cached, 902 + ALLOC_stripe, 903 + ALLOC_SECTORS_NR 904 + }; 905 + 906 + static enum alloc_sector_counter data_type_to_alloc_counter(enum bch_data_type t) 907 + { 908 + switch (t) { 909 + case BCH_DATA_btree: 910 + case BCH_DATA_user: 911 + return ALLOC_dirty; 912 + case BCH_DATA_cached: 913 + return ALLOC_cached; 914 + case BCH_DATA_stripe: 915 + return ALLOC_stripe; 916 + default: 917 + BUG(); 918 + } 919 + } 920 + 921 + static int check_bucket_backpointers_to_extents(struct btree_trans *, struct bch_dev *, struct bpos); 922 + 923 + static int check_bucket_backpointer_mismatch(struct btree_trans *trans, struct bkey_s_c alloc_k, 924 + struct bkey_buf *last_flushed) 925 + { 926 + struct bch_fs *c = trans->c; 927 + struct bch_alloc_v4 a_convert; 928 + const struct bch_alloc_v4 *a = bch2_alloc_to_v4(alloc_k, &a_convert); 929 + bool need_commit = false; 930 + 931 + if (a->data_type == BCH_DATA_sb || 932 + a->data_type == BCH_DATA_journal || 933 + a->data_type == BCH_DATA_parity) 934 + return 0; 935 + 936 + u32 sectors[ALLOC_SECTORS_NR]; 937 + memset(sectors, 0, sizeof(sectors)); 938 + 939 + struct bch_dev *ca = bch2_dev_bucket_tryget_noerror(trans->c, alloc_k.k->p); 940 + if (!ca) 941 + return 0; 942 + 943 + struct btree_iter iter; 944 + struct bkey_s_c bp_k; 945 + int ret = 0; 946 + for_each_btree_key_max_norestart(trans, iter, BTREE_ID_backpointers, 947 + bucket_pos_to_bp_start(ca, alloc_k.k->p), 948 + bucket_pos_to_bp_end(ca, alloc_k.k->p), 0, bp_k, ret) { 949 + if (bp_k.k->type != KEY_TYPE_backpointer) 950 + continue; 951 + 952 + struct bkey_s_c_backpointer bp = bkey_s_c_to_backpointer(bp_k); 953 + 954 + if (c->sb.version_upgrade_complete >= bcachefs_metadata_version_backpointer_bucket_gen && 955 + (bp.v->bucket_gen != a->gen || 956 + bp.v->pad)) { 957 + ret = bch2_backpointer_del(trans, bp_k.k->p); 958 + if (ret) 959 + break; 960 + 961 + need_commit = true; 962 + continue; 963 + } 964 + 965 + if (bp.v->bucket_gen != a->gen) 966 + continue; 967 + 968 + sectors[data_type_to_alloc_counter(bp.v->data_type)] += bp.v->bucket_len; 969 + }; 970 + bch2_trans_iter_exit(trans, &iter); 971 + if (ret) 972 + goto err; 973 + 974 + if (need_commit) { 975 + ret = bch2_trans_commit(trans, NULL, NULL, BCH_TRANS_COMMIT_no_enospc); 976 + if (ret) 977 + goto err; 978 + } 979 + 980 + /* Cached pointers don't have backpointers: */ 981 + 982 + if (sectors[ALLOC_dirty] != a->dirty_sectors || 983 + sectors[ALLOC_stripe] != a->stripe_sectors) { 984 + if (c->sb.version_upgrade_complete >= bcachefs_metadata_version_backpointer_bucket_gen) { 985 + ret = bch2_backpointers_maybe_flush(trans, alloc_k, last_flushed); 986 + if (ret) 987 + goto err; 988 + } 989 + 990 + if (sectors[ALLOC_dirty] > a->dirty_sectors || 991 + sectors[ALLOC_stripe] > a->stripe_sectors) { 992 + ret = check_bucket_backpointers_to_extents(trans, ca, alloc_k.k->p) ?: 993 + -BCH_ERR_transaction_restart_nested; 994 + goto err; 995 + } 996 + 997 + if (!sectors[ALLOC_dirty] && 998 + !sectors[ALLOC_stripe]) 999 + __set_bit(alloc_k.k->p.offset, ca->bucket_backpointer_empty); 1000 + else 1001 + __set_bit(alloc_k.k->p.offset, ca->bucket_backpointer_mismatches); 1002 + } 1003 + err: 1004 + bch2_dev_put(ca); 1005 + return ret; 1006 + } 1007 + 1008 + static bool backpointer_node_has_missing(struct bch_fs *c, struct bkey_s_c k) 1009 + { 1010 + switch (k.k->type) { 1011 + case KEY_TYPE_btree_ptr_v2: { 1012 + bool ret = false; 1013 + 1014 + rcu_read_lock(); 1015 + struct bpos pos = bkey_s_c_to_btree_ptr_v2(k).v->min_key; 1016 + while (pos.inode <= k.k->p.inode) { 1017 + if (pos.inode >= c->sb.nr_devices) 1018 + break; 1019 + 1020 + struct bch_dev *ca = bch2_dev_rcu_noerror(c, pos.inode); 1021 + if (!ca) 1022 + goto next; 1023 + 1024 + struct bpos bucket = bp_pos_to_bucket(ca, pos); 1025 + bucket.offset = find_next_bit(ca->bucket_backpointer_mismatches, 1026 + ca->mi.nbuckets, bucket.offset); 1027 + if (bucket.offset == ca->mi.nbuckets) 1028 + goto next; 1029 + 1030 + ret = bpos_le(bucket_pos_to_bp_end(ca, bucket), k.k->p); 1031 + if (ret) 1032 + break; 1033 + next: 1034 + pos = SPOS(pos.inode + 1, 0, 0); 1035 + } 1036 + rcu_read_unlock(); 1037 + 1038 + return ret; 1039 + } 1040 + case KEY_TYPE_btree_ptr: 1041 + return true; 1042 + default: 1043 + return false; 1044 + } 1045 + } 1046 + 1047 + static int btree_node_get_and_pin(struct btree_trans *trans, struct bkey_i *k, 1048 + enum btree_id btree, unsigned level) 1049 + { 1050 + struct btree_iter iter; 1051 + bch2_trans_node_iter_init(trans, &iter, btree, k->k.p, 0, level, 0); 1052 + struct btree *b = bch2_btree_iter_peek_node(&iter); 1053 + int ret = PTR_ERR_OR_ZERO(b); 1054 + if (ret) 1055 + goto err; 1056 + 1057 + if (b) 1058 + bch2_node_pin(trans->c, b); 1059 + err: 1060 + bch2_trans_iter_exit(trans, &iter); 1061 + return ret; 1062 + } 1063 + 1064 + static int bch2_pin_backpointer_nodes_with_missing(struct btree_trans *trans, 1065 + struct bpos start, struct bpos *end) 1066 + { 1067 + struct bch_fs *c = trans->c; 1068 + int ret = 0; 1069 + 1070 + struct bkey_buf tmp; 1071 + bch2_bkey_buf_init(&tmp); 1072 + 1073 + bch2_btree_cache_unpin(c); 1074 + 1075 + *end = SPOS_MAX; 1076 + 1077 + s64 mem_may_pin = mem_may_pin_bytes(c); 1078 + struct btree_iter iter; 1079 + bch2_trans_node_iter_init(trans, &iter, BTREE_ID_backpointers, start, 1080 + 0, 1, BTREE_ITER_prefetch); 1081 + ret = for_each_btree_key_continue(trans, iter, 0, k, ({ 1082 + if (!backpointer_node_has_missing(c, k)) 1083 + continue; 1084 + 1085 + mem_may_pin -= c->opts.btree_node_size; 1086 + if (mem_may_pin <= 0) 1087 + break; 1088 + 1089 + bch2_bkey_buf_reassemble(&tmp, c, k); 1090 + struct btree_path *path = btree_iter_path(trans, &iter); 1091 + 1092 + BUG_ON(path->level != 1); 1093 + 1094 + bch2_btree_node_prefetch(trans, path, tmp.k, path->btree_id, path->level - 1); 1095 + })); 1096 + if (ret) 1097 + return ret; 1098 + 1099 + struct bpos pinned = SPOS_MAX; 1100 + mem_may_pin = mem_may_pin_bytes(c); 1101 + bch2_trans_node_iter_init(trans, &iter, BTREE_ID_backpointers, start, 1102 + 0, 1, BTREE_ITER_prefetch); 1103 + ret = for_each_btree_key_continue(trans, iter, 0, k, ({ 1104 + if (!backpointer_node_has_missing(c, k)) 1105 + continue; 1106 + 1107 + mem_may_pin -= c->opts.btree_node_size; 1108 + if (mem_may_pin <= 0) { 1109 + *end = pinned; 1110 + break; 1111 + } 1112 + 1113 + bch2_bkey_buf_reassemble(&tmp, c, k); 1114 + struct btree_path *path = btree_iter_path(trans, &iter); 1115 + 1116 + BUG_ON(path->level != 1); 1117 + 1118 + int ret2 = btree_node_get_and_pin(trans, tmp.k, path->btree_id, path->level - 1); 1119 + 1120 + if (!ret2) 1121 + pinned = tmp.k->k.p; 1122 + 1123 + ret; 1124 + })); 1125 + if (ret) 1126 + return ret; 1127 + 1128 + return ret; 1129 + } 1130 + 823 1131 int bch2_check_extents_to_backpointers(struct bch_fs *c) 824 1132 { 1133 + int ret = 0; 1134 + 1135 + /* 1136 + * Can't allow devices to come/go/resize while we have bucket bitmaps 1137 + * allocated 1138 + */ 1139 + lockdep_assert_held(&c->state_lock); 1140 + 1141 + for_each_member_device(c, ca) { 1142 + BUG_ON(ca->bucket_backpointer_mismatches); 1143 + ca->bucket_backpointer_mismatches = kvcalloc(BITS_TO_LONGS(ca->mi.nbuckets), 1144 + sizeof(unsigned long), 1145 + GFP_KERNEL); 1146 + ca->bucket_backpointer_empty = kvcalloc(BITS_TO_LONGS(ca->mi.nbuckets), 1147 + sizeof(unsigned long), 1148 + GFP_KERNEL); 1149 + if (!ca->bucket_backpointer_mismatches || 1150 + !ca->bucket_backpointer_empty) { 1151 + bch2_dev_put(ca); 1152 + ret = -BCH_ERR_ENOMEM_backpointer_mismatches_bitmap; 1153 + goto err_free_bitmaps; 1154 + } 1155 + } 1156 + 825 1157 struct btree_trans *trans = bch2_trans_get(c); 826 - struct extents_to_bp_state s = { .bucket_start = POS_MIN }; 827 - int ret; 1158 + struct extents_to_bp_state s = { .bp_start = POS_MIN }; 828 1159 829 1160 bch2_bkey_buf_init(&s.last_flushed); 830 1161 bkey_init(&s.last_flushed.k->k); 831 1162 1163 + ret = for_each_btree_key(trans, iter, BTREE_ID_alloc, 1164 + POS_MIN, BTREE_ITER_prefetch, k, ({ 1165 + check_bucket_backpointer_mismatch(trans, k, &s.last_flushed); 1166 + })); 1167 + if (ret) 1168 + goto err; 1169 + 1170 + u64 nr_buckets = 0, nr_mismatches = 0, nr_empty = 0; 1171 + for_each_member_device(c, ca) { 1172 + nr_buckets += ca->mi.nbuckets; 1173 + nr_mismatches += bitmap_weight(ca->bucket_backpointer_mismatches, ca->mi.nbuckets); 1174 + nr_empty += bitmap_weight(ca->bucket_backpointer_empty, ca->mi.nbuckets); 1175 + } 1176 + 1177 + if (!nr_mismatches && !nr_empty) 1178 + goto err; 1179 + 1180 + bch_info(c, "scanning for missing backpointers in %llu/%llu buckets", 1181 + nr_mismatches + nr_empty, nr_buckets); 1182 + 832 1183 while (1) { 833 - struct bbpos end; 834 - ret = bch2_get_btree_in_memory_pos(trans, 835 - BIT_ULL(BTREE_ID_backpointers), 836 - BIT_ULL(BTREE_ID_backpointers), 837 - BBPOS(BTREE_ID_backpointers, s.bucket_start), &end); 1184 + ret = bch2_pin_backpointer_nodes_with_missing(trans, s.bp_start, &s.bp_end); 838 1185 if (ret) 839 1186 break; 840 1187 841 - s.bucket_end = end.pos; 842 - 843 - if ( bpos_eq(s.bucket_start, POS_MIN) && 844 - !bpos_eq(s.bucket_end, SPOS_MAX)) 1188 + if ( bpos_eq(s.bp_start, POS_MIN) && 1189 + !bpos_eq(s.bp_end, SPOS_MAX)) 845 1190 bch_verbose(c, "%s(): alloc info does not fit in ram, running in multiple passes with %zu nodes per pass", 846 1191 __func__, btree_nodes_fit_in_ram(c)); 847 1192 848 - if (!bpos_eq(s.bucket_start, POS_MIN) || 849 - !bpos_eq(s.bucket_end, SPOS_MAX)) { 1193 + if (!bpos_eq(s.bp_start, POS_MIN) || 1194 + !bpos_eq(s.bp_end, SPOS_MAX)) { 850 1195 struct printbuf buf = PRINTBUF; 851 1196 852 1197 prt_str(&buf, "check_extents_to_backpointers(): "); 853 - bch2_bpos_to_text(&buf, s.bucket_start); 1198 + bch2_bpos_to_text(&buf, s.bp_start); 854 1199 prt_str(&buf, "-"); 855 - bch2_bpos_to_text(&buf, s.bucket_end); 1200 + bch2_bpos_to_text(&buf, s.bp_end); 856 1201 857 1202 bch_verbose(c, "%s", buf.buf); 858 1203 printbuf_exit(&buf); 859 1204 } 860 1205 861 1206 ret = bch2_check_extents_to_backpointers_pass(trans, &s); 862 - if (ret || bpos_eq(s.bucket_end, SPOS_MAX)) 1207 + if (ret || bpos_eq(s.bp_end, SPOS_MAX)) 863 1208 break; 864 1209 865 - s.bucket_start = bpos_successor(s.bucket_end); 1210 + s.bp_start = bpos_successor(s.bp_end); 866 1211 } 1212 + err: 867 1213 bch2_trans_put(trans); 868 1214 bch2_bkey_buf_exit(&s.last_flushed, c); 869 - 870 1215 bch2_btree_cache_unpin(c); 1216 + err_free_bitmaps: 1217 + for_each_member_device(c, ca) { 1218 + kvfree(ca->bucket_backpointer_empty); 1219 + ca->bucket_backpointer_empty = NULL; 1220 + kvfree(ca->bucket_backpointer_mismatches); 1221 + ca->bucket_backpointer_mismatches = NULL; 1222 + } 871 1223 872 1224 bch_err_fn(c, ret); 873 1225 return ret; ··· 1159 959 return 0; 1160 960 1161 961 struct bkey_s_c_backpointer bp = bkey_s_c_to_backpointer(bp_k); 1162 - struct bch_fs *c = trans->c; 1163 - struct btree_iter iter; 1164 962 struct bbpos pos = bp_to_bbpos(*bp.v); 1165 - struct bkey_s_c k; 1166 - struct printbuf buf = PRINTBUF; 1167 - int ret; 1168 963 1169 964 if (bbpos_cmp(pos, start) < 0 || 1170 965 bbpos_cmp(pos, end) > 0) 1171 966 return 0; 1172 967 1173 - k = bch2_backpointer_get_key(trans, &iter, bp.k->p, *bp.v, 0); 1174 - ret = bkey_err(k); 968 + struct btree_iter iter; 969 + struct bkey_s_c k = bch2_backpointer_get_key(trans, bp, &iter, 0, last_flushed); 970 + int ret = bkey_err(k); 1175 971 if (ret == -BCH_ERR_backpointer_to_overwritten_btree_node) 1176 972 return 0; 1177 973 if (ret) 1178 974 return ret; 1179 975 1180 - if (!k.k) { 1181 - ret = bch2_btree_write_buffer_maybe_flush(trans, bp.s_c, last_flushed); 1182 - if (ret) 1183 - goto out; 1184 - 1185 - if (fsck_err(trans, backpointer_to_missing_ptr, 1186 - "backpointer for missing %s\n %s", 1187 - bp.v->level ? "btree node" : "extent", 1188 - (bch2_bkey_val_to_text(&buf, c, bp.s_c), buf.buf))) { 1189 - ret = bch2_btree_delete_at_buffered(trans, BTREE_ID_backpointers, bp.k->p); 1190 - goto out; 1191 - } 1192 - } 1193 - out: 1194 - fsck_err: 1195 976 bch2_trans_iter_exit(trans, &iter); 1196 - printbuf_exit(&buf); 1197 977 return ret; 978 + } 979 + 980 + static int check_bucket_backpointers_to_extents(struct btree_trans *trans, 981 + struct bch_dev *ca, struct bpos bucket) 982 + { 983 + u32 restart_count = trans->restart_count; 984 + struct bkey_buf last_flushed; 985 + bch2_bkey_buf_init(&last_flushed); 986 + bkey_init(&last_flushed.k->k); 987 + 988 + int ret = for_each_btree_key_max(trans, iter, BTREE_ID_backpointers, 989 + bucket_pos_to_bp_start(ca, bucket), 990 + bucket_pos_to_bp_end(ca, bucket), 991 + 0, k, 992 + check_one_backpointer(trans, BBPOS_MIN, BBPOS_MAX, k, &last_flushed) 993 + ); 994 + 995 + bch2_bkey_buf_exit(&last_flushed, trans->c); 996 + return ret ?: trans_was_restarted(trans, restart_count); 1198 997 } 1199 998 1200 999 static int bch2_check_backpointers_to_extents_pass(struct btree_trans *trans, ··· 1208 1009 bkey_init(&last_flushed.k->k); 1209 1010 progress_init(&progress, trans->c, BIT_ULL(BTREE_ID_backpointers)); 1210 1011 1211 - int ret = for_each_btree_key_commit(trans, iter, BTREE_ID_backpointers, 1212 - POS_MIN, BTREE_ITER_prefetch, k, 1213 - NULL, NULL, BCH_TRANS_COMMIT_no_enospc, ({ 1012 + int ret = for_each_btree_key(trans, iter, BTREE_ID_backpointers, 1013 + POS_MIN, BTREE_ITER_prefetch, k, ({ 1214 1014 progress_update_iter(trans, &progress, &iter, "backpointers_to_extents"); 1215 1015 check_one_backpointer(trans, start, end, k, &last_flushed); 1216 1016 }));
+44 -53
fs/bcachefs/backpointers.h
··· 18 18 ((x & 0xff00000000ULL) >> 32)); 19 19 } 20 20 21 - int bch2_backpointer_validate(struct bch_fs *, struct bkey_s_c k, enum bch_validate_flags); 22 - void bch2_backpointer_to_text(struct printbuf *, const struct bch_backpointer *); 23 - void bch2_backpointer_k_to_text(struct printbuf *, struct bch_fs *, struct bkey_s_c); 21 + int bch2_backpointer_validate(struct bch_fs *, struct bkey_s_c k, 22 + struct bkey_validate_context); 23 + void bch2_backpointer_to_text(struct printbuf *, struct bch_fs *, struct bkey_s_c); 24 24 void bch2_backpointer_swab(struct bkey_s); 25 25 26 26 #define bch2_bkey_ops_backpointer ((struct bkey_ops) { \ 27 27 .key_validate = bch2_backpointer_validate, \ 28 - .val_to_text = bch2_backpointer_k_to_text, \ 28 + .val_to_text = bch2_backpointer_to_text, \ 29 29 .swab = bch2_backpointer_swab, \ 30 30 .min_val_size = 32, \ 31 31 }) ··· 43 43 return POS(bp_pos.inode, sector_to_bucket(ca, bucket_sector)); 44 44 } 45 45 46 + static inline struct bpos bp_pos_to_bucket_and_offset(const struct bch_dev *ca, struct bpos bp_pos, 47 + u32 *bucket_offset) 48 + { 49 + u64 bucket_sector = bp_pos.offset >> MAX_EXTENT_COMPRESS_RATIO_SHIFT; 50 + 51 + return POS(bp_pos.inode, sector_to_bucket_and_offset(ca, bucket_sector, bucket_offset)); 52 + } 53 + 46 54 static inline bool bp_pos_to_bucket_nodev_noerror(struct bch_fs *c, struct bpos bp_pos, struct bpos *bucket) 47 55 { 48 56 rcu_read_lock(); 49 - struct bch_dev *ca = bch2_dev_rcu(c, bp_pos.inode); 57 + struct bch_dev *ca = bch2_dev_rcu_noerror(c, bp_pos.inode); 50 58 if (ca) 51 59 *bucket = bp_pos_to_bucket(ca, bp_pos); 52 60 rcu_read_unlock(); 53 61 return ca != NULL; 54 - } 55 - 56 - static inline bool bp_pos_to_bucket_nodev(struct bch_fs *c, struct bpos bp_pos, struct bpos *bucket) 57 - { 58 - return !bch2_fs_inconsistent_on(!bp_pos_to_bucket_nodev_noerror(c, bp_pos, bucket), 59 - c, "backpointer for missing device %llu", bp_pos.inode); 60 62 } 61 63 62 64 static inline struct bpos bucket_pos_to_bp_noerror(const struct bch_dev *ca, ··· 82 80 return ret; 83 81 } 84 82 85 - int bch2_bucket_backpointer_mod_nowritebuffer(struct btree_trans *, struct bch_dev *, 86 - struct bpos bucket, struct bch_backpointer, struct bkey_s_c, bool); 83 + static inline struct bpos bucket_pos_to_bp_start(const struct bch_dev *ca, struct bpos bucket) 84 + { 85 + return bucket_pos_to_bp(ca, bucket, 0); 86 + } 87 + 88 + static inline struct bpos bucket_pos_to_bp_end(const struct bch_dev *ca, struct bpos bucket) 89 + { 90 + return bpos_nosnap_predecessor(bucket_pos_to_bp(ca, bpos_nosnap_successor(bucket), 0)); 91 + } 92 + 93 + int bch2_bucket_backpointer_mod_nowritebuffer(struct btree_trans *, 94 + struct bkey_s_c, 95 + struct bkey_i_backpointer *, 96 + bool); 87 97 88 98 static inline int bch2_bucket_backpointer_mod(struct btree_trans *trans, 89 - struct bch_dev *ca, 90 - struct bpos bucket, 91 - struct bch_backpointer bp, 92 99 struct bkey_s_c orig_k, 100 + struct bkey_i_backpointer *bp, 93 101 bool insert) 94 102 { 95 103 if (unlikely(bch2_backpointers_no_use_write_buffer)) 96 - return bch2_bucket_backpointer_mod_nowritebuffer(trans, ca, bucket, bp, orig_k, insert); 97 - 98 - struct bkey_i_backpointer bp_k; 99 - 100 - bkey_backpointer_init(&bp_k.k_i); 101 - bp_k.k.p = bucket_pos_to_bp(ca, bucket, bp.bucket_offset); 102 - bp_k.v = bp; 104 + return bch2_bucket_backpointer_mod_nowritebuffer(trans, orig_k, bp, insert); 103 105 104 106 if (!insert) { 105 - bp_k.k.type = KEY_TYPE_deleted; 106 - set_bkey_val_u64s(&bp_k.k, 0); 107 + bp->k.type = KEY_TYPE_deleted; 108 + set_bkey_val_u64s(&bp->k, 0); 107 109 } 108 110 109 - return bch2_trans_update_buffered(trans, BTREE_ID_backpointers, &bp_k.k_i); 111 + return bch2_trans_update_buffered(trans, BTREE_ID_backpointers, &bp->k_i); 110 112 } 111 113 112 114 static inline enum bch_data_type bch2_bkey_ptr_data_type(struct bkey_s_c k, ··· 140 134 } 141 135 } 142 136 143 - static inline void __bch2_extent_ptr_to_bp(struct bch_fs *c, struct bch_dev *ca, 137 + static inline void bch2_extent_ptr_to_bp(struct bch_fs *c, 144 138 enum btree_id btree_id, unsigned level, 145 139 struct bkey_s_c k, struct extent_ptr_decoded p, 146 140 const union bch_extent_entry *entry, 147 - struct bpos *bucket_pos, struct bch_backpointer *bp, 148 - u64 sectors) 141 + struct bkey_i_backpointer *bp) 149 142 { 150 - u32 bucket_offset; 151 - *bucket_pos = PTR_BUCKET_POS_OFFSET(ca, &p.ptr, &bucket_offset); 152 - *bp = (struct bch_backpointer) { 143 + bkey_backpointer_init(&bp->k_i); 144 + bp->k.p = POS(p.ptr.dev, ((u64) p.ptr.offset << MAX_EXTENT_COMPRESS_RATIO_SHIFT) + p.crc.offset); 145 + bp->v = (struct bch_backpointer) { 153 146 .btree_id = btree_id, 154 147 .level = level, 155 148 .data_type = bch2_bkey_ptr_data_type(k, p, entry), 156 - .bucket_offset = ((u64) bucket_offset << MAX_EXTENT_COMPRESS_RATIO_SHIFT) + 157 - p.crc.offset, 158 - .bucket_len = sectors, 149 + .bucket_gen = p.ptr.gen, 150 + .bucket_len = ptr_disk_sectors(level ? btree_sectors(c) : k.k->size, p), 159 151 .pos = k.k->p, 160 152 }; 161 153 } 162 154 163 - static inline void bch2_extent_ptr_to_bp(struct bch_fs *c, struct bch_dev *ca, 164 - enum btree_id btree_id, unsigned level, 165 - struct bkey_s_c k, struct extent_ptr_decoded p, 166 - const union bch_extent_entry *entry, 167 - struct bpos *bucket_pos, struct bch_backpointer *bp) 168 - { 169 - u64 sectors = ptr_disk_sectors(level ? btree_sectors(c) : k.k->size, p); 170 - 171 - __bch2_extent_ptr_to_bp(c, ca, btree_id, level, k, p, entry, bucket_pos, bp, sectors); 172 - } 173 - 174 - int bch2_get_next_backpointer(struct btree_trans *, struct bch_dev *ca, struct bpos, int, 175 - struct bpos *, struct bch_backpointer *, unsigned); 176 - struct bkey_s_c bch2_backpointer_get_key(struct btree_trans *, struct btree_iter *, 177 - struct bpos, struct bch_backpointer, 178 - unsigned); 179 - struct btree *bch2_backpointer_get_node(struct btree_trans *, struct btree_iter *, 180 - struct bpos, struct bch_backpointer); 155 + struct bkey_buf; 156 + struct bkey_s_c bch2_backpointer_get_key(struct btree_trans *, struct bkey_s_c_backpointer, 157 + struct btree_iter *, unsigned, struct bkey_buf *); 158 + struct btree *bch2_backpointer_get_node(struct btree_trans *, struct bkey_s_c_backpointer, 159 + struct btree_iter *, struct bkey_buf *); 181 160 182 161 int bch2_check_btree_backpointers(struct bch_fs *); 183 162 int bch2_check_extents_to_backpointers(struct bch_fs *);
+1 -1
fs/bcachefs/bbpos.h
··· 29 29 30 30 static inline void bch2_bbpos_to_text(struct printbuf *out, struct bbpos pos) 31 31 { 32 - prt_str(out, bch2_btree_id_str(pos.btree)); 32 + bch2_btree_id_to_text(out, pos.btree); 33 33 prt_char(out, ':'); 34 34 bch2_bpos_to_text(out, pos.pos); 35 35 }
+33 -37
fs/bcachefs/bcachefs.h
··· 205 205 #include <linux/zstd.h> 206 206 207 207 #include "bcachefs_format.h" 208 + #include "btree_journal_iter_types.h" 208 209 #include "disk_accounting_types.h" 209 210 #include "errcode.h" 210 211 #include "fifo.h" ··· 294 293 295 294 #define bch_info(c, fmt, ...) \ 296 295 bch2_print(c, KERN_INFO bch2_fmt(c, fmt), ##__VA_ARGS__) 296 + #define bch_info_ratelimited(c, fmt, ...) \ 297 + bch2_print_ratelimited(c, KERN_INFO bch2_fmt(c, fmt), ##__VA_ARGS__) 297 298 #define bch_notice(c, fmt, ...) \ 298 299 bch2_print(c, KERN_NOTICE bch2_fmt(c, fmt), ##__VA_ARGS__) 299 300 #define bch_warn(c, fmt, ...) \ ··· 353 350 do { \ 354 351 if ((c)->opts.verbose) \ 355 352 bch_info(c, fmt, ##__VA_ARGS__); \ 353 + } while (0) 354 + 355 + #define bch_verbose_ratelimited(c, fmt, ...) \ 356 + do { \ 357 + if ((c)->opts.verbose) \ 358 + bch_info_ratelimited(c, fmt, ##__VA_ARGS__); \ 356 359 } while (0) 357 360 358 361 #define pr_verbose_init(opts, fmt, ...) \ ··· 547 538 548 539 /* 549 540 * Buckets: 550 - * Per-bucket arrays are protected by c->mark_lock, bucket_lock and 551 - * gc_gens_lock, for device resize - holding any is sufficient for 552 - * access: Or rcu_read_lock(), but only for dev_ptr_stale(): 541 + * Per-bucket arrays are protected by either rcu_read_lock or 542 + * state_lock, for device resize. 553 543 */ 554 544 GENRADIX(struct bucket) buckets_gc; 555 545 struct bucket_gens __rcu *bucket_gens; 556 546 u8 *oldest_gen; 557 547 unsigned long *buckets_nouse; 558 - struct rw_semaphore bucket_lock; 548 + 549 + unsigned long *bucket_backpointer_mismatches; 550 + unsigned long *bucket_backpointer_empty; 559 551 560 552 struct bch_dev_usage __percpu *usage; 561 553 562 554 /* Allocator: */ 563 - u64 new_fs_bucket_idx; 564 555 u64 alloc_cursor[3]; 565 556 566 557 unsigned nr_open_buckets; ··· 615 606 x(going_ro) \ 616 607 x(write_disable_complete) \ 617 608 x(clean_shutdown) \ 609 + x(recovery_running) \ 618 610 x(fsck_running) \ 619 611 x(initial_gc_unfixed) \ 620 612 x(need_delete_dead_snapshots) \ ··· 660 650 } entries[]; 661 651 }; 662 652 663 - struct journal_keys { 664 - /* must match layout in darray_types.h */ 665 - size_t nr, size; 666 - struct journal_key { 667 - u64 journal_seq; 668 - u32 journal_offset; 669 - enum btree_id btree_id:8; 670 - unsigned level:8; 671 - bool allocated; 672 - bool overwritten; 673 - struct bkey_i *k; 674 - } *data; 675 - /* 676 - * Gap buffer: instead of all the empty space in the array being at the 677 - * end of the buffer - from @nr to @size - the empty space is at @gap. 678 - * This means that sequential insertions are O(n) instead of O(n^2). 679 - */ 680 - size_t gap; 681 - atomic_t ref; 682 - bool initial_ref_held; 683 - }; 684 - 685 653 struct btree_trans_buf { 686 654 struct btree_trans *trans; 687 655 }; ··· 668 680 ((subvol_inum) { BCACHEFS_ROOT_SUBVOL, BCACHEFS_ROOT_INO }) 669 681 670 682 #define BCH_WRITE_REFS() \ 683 + x(journal) \ 671 684 x(trans) \ 672 685 x(write) \ 673 686 x(promote) \ ··· 681 692 x(dio_write) \ 682 693 x(discard) \ 683 694 x(discard_fast) \ 695 + x(check_discard_freespace_key) \ 684 696 x(invalidate) \ 685 697 x(delete_dead_snapshots) \ 686 698 x(gc_gens) \ ··· 725 735 struct percpu_ref writes; 726 736 #endif 727 737 /* 738 + * Certain operations are only allowed in single threaded mode, during 739 + * recovery, and we want to assert that this is the case: 740 + */ 741 + struct task_struct *recovery_task; 742 + 743 + /* 728 744 * Analagous to c->writes, for asynchronous ops that don't necessarily 729 745 * need fs to be read-write 730 746 */ ··· 760 764 __uuid_t user_uuid; 761 765 762 766 u16 version; 767 + u16 version_incompat; 768 + u16 version_incompat_allowed; 763 769 u16 version_min; 764 770 u16 version_upgrade_complete; 765 771 ··· 832 834 struct work_struct btree_interior_update_work; 833 835 834 836 struct workqueue_struct *btree_node_rewrite_worker; 835 - 836 - struct list_head pending_node_rewrites; 837 - struct mutex pending_node_rewrites_lock; 837 + struct list_head btree_node_rewrites; 838 + struct list_head btree_node_rewrites_pending; 839 + spinlock_t btree_node_rewrites_lock; 840 + struct closure_waitlist btree_node_rewrites_wait; 838 841 839 842 /* btree_io.c: */ 840 843 spinlock_t btree_write_error_lock; ··· 966 967 struct rhashtable promote_table; 967 968 968 969 mempool_t compression_bounce[2]; 969 - mempool_t compress_workspace[BCH_COMPRESSION_TYPE_NR]; 970 - mempool_t decompress_workspace; 970 + mempool_t compress_workspace[BCH_COMPRESSION_OPT_NR]; 971 971 size_t zstd_workspace_size; 972 972 973 973 struct crypto_shash *sha256; ··· 1025 1027 struct list_head vfs_inodes_list; 1026 1028 struct mutex vfs_inodes_lock; 1027 1029 struct rhashtable vfs_inodes_table; 1030 + struct rhltable vfs_inodes_by_inum_table; 1028 1031 1029 1032 /* VFS IO PATH - fs-io.c */ 1030 1033 struct bio_set writepage_bioset; ··· 1047 1048 * for signaling to the toplevel code which pass we want to run now. 1048 1049 */ 1049 1050 enum bch_recovery_pass curr_recovery_pass; 1051 + enum bch_recovery_pass next_recovery_pass; 1050 1052 /* bitmask of recovery passes that we actually ran */ 1051 1053 u64 recovery_passes_complete; 1052 1054 /* never rewinds version of curr_recovery_pass */ 1053 1055 enum bch_recovery_pass recovery_pass_done; 1056 + spinlock_t recovery_pass_lock; 1054 1057 struct semaphore online_fsck_mutex; 1055 1058 1056 1059 /* DEBUG JUNK */ ··· 1062 1061 struct btree *verify_data; 1063 1062 struct btree_node *verify_ondisk; 1064 1063 struct mutex verify_lock; 1065 - 1066 - u64 *unused_inode_hints; 1067 - unsigned inode_shard_bits; 1068 1064 1069 1065 /* 1070 1066 * A btree node on disk could have too many bsets for an iterator to fit ··· 1083 1085 1084 1086 u64 counters_on_mount[BCH_COUNTER_NR]; 1085 1087 u64 __percpu *counters; 1086 - 1087 - unsigned copy_gc_enabled:1; 1088 1088 1089 1089 struct bch2_time_stats times[BCH_TIME_STAT_NR]; 1090 1090
+74 -32
fs/bcachefs/bcachefs_format.h
··· 418 418 x(snapshot_tree, 31) \ 419 419 x(logged_op_truncate, 32) \ 420 420 x(logged_op_finsert, 33) \ 421 - x(accounting, 34) 421 + x(accounting, 34) \ 422 + x(inode_alloc_cursor, 35) 422 423 423 424 enum bch_bkey_type { 424 425 #define x(name, nr) KEY_TYPE_##name = nr, ··· 464 463 __u8 btree_id; 465 464 __u8 level; 466 465 __u8 data_type; 467 - __u64 bucket_offset:40; 466 + __u8 bucket_gen; 467 + __u32 pad; 468 468 __u32 bucket_len; 469 469 struct bpos pos; 470 470 } __packed __aligned(8); ··· 501 499 #include "disk_groups_format.h" 502 500 #include "extents_format.h" 503 501 #include "ec_format.h" 504 - #include "dirent_format.h" 505 - #include "disk_groups_format.h" 506 502 #include "inode_format.h" 507 503 #include "journal_seq_blacklist_format.h" 508 504 #include "logged_ops_format.h" ··· 679 679 x(disk_accounting_v3, BCH_VERSION(1, 10)) \ 680 680 x(disk_accounting_inum, BCH_VERSION(1, 11)) \ 681 681 x(rebalance_work_acct_fix, BCH_VERSION(1, 12)) \ 682 - x(inode_has_child_snapshots, BCH_VERSION(1, 13)) 682 + x(inode_has_child_snapshots, BCH_VERSION(1, 13)) \ 683 + x(backpointer_bucket_gen, BCH_VERSION(1, 14)) \ 684 + x(disk_accounting_big_endian, BCH_VERSION(1, 15)) \ 685 + x(reflink_p_may_update_opts, BCH_VERSION(1, 16)) \ 686 + x(inode_depth, BCH_VERSION(1, 17)) \ 687 + x(persistent_inode_cursors, BCH_VERSION(1, 18)) \ 688 + x(autofix_errors, BCH_VERSION(1, 19)) \ 689 + x(directory_size, BCH_VERSION(1, 20)) 683 690 684 691 enum bcachefs_metadata_version { 685 692 bcachefs_metadata_version_min = 9, ··· 851 844 struct bch_sb, flags[5], 0, 16); 852 845 LE64_BITMASK(BCH_SB_ALLOCATOR_STUCK_TIMEOUT, 853 846 struct bch_sb, flags[5], 16, 32); 847 + LE64_BITMASK(BCH_SB_VERSION_INCOMPAT, struct bch_sb, flags[5], 32, 48); 848 + LE64_BITMASK(BCH_SB_VERSION_INCOMPAT_ALLOWED, 849 + struct bch_sb, flags[5], 48, 64); 850 + LE64_BITMASK(BCH_SB_SHARD_INUMS_NBITS, struct bch_sb, flags[6], 0, 4); 854 851 855 852 static inline __u64 BCH_SB_COMPRESSION_TYPE(const struct bch_sb *sb) 856 853 { ··· 907 896 x(new_varint, 15) \ 908 897 x(journal_no_flush, 16) \ 909 898 x(alloc_v2, 17) \ 910 - x(extents_across_btree_nodes, 18) 899 + x(extents_across_btree_nodes, 18) \ 900 + x(incompat_version_field, 19) 911 901 912 902 #define BCH_SB_FEATURES_ALWAYS \ 913 - ((1ULL << BCH_FEATURE_new_extent_overwrite)| \ 914 - (1ULL << BCH_FEATURE_extents_above_btree_updates)|\ 915 - (1ULL << BCH_FEATURE_btree_updates_journalled)|\ 916 - (1ULL << BCH_FEATURE_alloc_v2)|\ 917 - (1ULL << BCH_FEATURE_extents_across_btree_nodes)) 903 + (BIT_ULL(BCH_FEATURE_new_extent_overwrite)| \ 904 + BIT_ULL(BCH_FEATURE_extents_above_btree_updates)|\ 905 + BIT_ULL(BCH_FEATURE_btree_updates_journalled)|\ 906 + BIT_ULL(BCH_FEATURE_alloc_v2)|\ 907 + BIT_ULL(BCH_FEATURE_extents_across_btree_nodes)) 918 908 919 909 #define BCH_SB_FEATURES_ALL \ 920 910 (BCH_SB_FEATURES_ALWAYS| \ 921 - (1ULL << BCH_FEATURE_new_siphash)| \ 922 - (1ULL << BCH_FEATURE_btree_ptr_v2)| \ 923 - (1ULL << BCH_FEATURE_new_varint)| \ 924 - (1ULL << BCH_FEATURE_journal_no_flush)) 911 + BIT_ULL(BCH_FEATURE_new_siphash)| \ 912 + BIT_ULL(BCH_FEATURE_btree_ptr_v2)| \ 913 + BIT_ULL(BCH_FEATURE_new_varint)| \ 914 + BIT_ULL(BCH_FEATURE_journal_no_flush)) 925 915 926 916 enum bch_sb_feature { 927 917 #define x(f, n) BCH_FEATURE_##f, ··· 1044 1032 x(crc64, 2) \ 1045 1033 x(xxhash, 3) 1046 1034 1047 - enum bch_csum_opts { 1035 + enum bch_csum_opt { 1048 1036 #define x(t, n) BCH_CSUM_OPT_##t = n, 1049 1037 BCH_CSUM_OPTS() 1050 1038 #undef x ··· 1233 1221 u8 d[]; 1234 1222 } __packed __aligned(8); 1235 1223 1224 + static inline unsigned jset_entry_log_msg_bytes(struct jset_entry_log *l) 1225 + { 1226 + unsigned b = vstruct_bytes(&l->entry) - offsetof(struct jset_entry_log, d); 1227 + 1228 + while (b && !l->d[b - 1]) 1229 + --b; 1230 + return b; 1231 + } 1232 + 1236 1233 struct jset_entry_datetime { 1237 1234 struct jset_entry entry; 1238 1235 __le64 seconds; ··· 1289 1268 /* Btree: */ 1290 1269 1291 1270 enum btree_id_flags { 1292 - BTREE_ID_EXTENTS = BIT(0), 1293 - BTREE_ID_SNAPSHOTS = BIT(1), 1294 - BTREE_ID_SNAPSHOT_FIELD = BIT(2), 1295 - BTREE_ID_DATA = BIT(3), 1271 + BTREE_IS_extents = BIT(0), 1272 + BTREE_IS_snapshots = BIT(1), 1273 + BTREE_IS_snapshot_field = BIT(2), 1274 + BTREE_IS_data = BIT(3), 1275 + BTREE_IS_write_buffer = BIT(4), 1296 1276 }; 1297 1277 1298 1278 #define BCH_BTREE_IDS() \ 1299 - x(extents, 0, BTREE_ID_EXTENTS|BTREE_ID_SNAPSHOTS|BTREE_ID_DATA,\ 1279 + x(extents, 0, \ 1280 + BTREE_IS_extents| \ 1281 + BTREE_IS_snapshots| \ 1282 + BTREE_IS_data, \ 1300 1283 BIT_ULL(KEY_TYPE_whiteout)| \ 1301 1284 BIT_ULL(KEY_TYPE_error)| \ 1302 1285 BIT_ULL(KEY_TYPE_cookie)| \ ··· 1308 1283 BIT_ULL(KEY_TYPE_reservation)| \ 1309 1284 BIT_ULL(KEY_TYPE_reflink_p)| \ 1310 1285 BIT_ULL(KEY_TYPE_inline_data)) \ 1311 - x(inodes, 1, BTREE_ID_SNAPSHOTS, \ 1286 + x(inodes, 1, \ 1287 + BTREE_IS_snapshots, \ 1312 1288 BIT_ULL(KEY_TYPE_whiteout)| \ 1313 1289 BIT_ULL(KEY_TYPE_inode)| \ 1314 1290 BIT_ULL(KEY_TYPE_inode_v2)| \ 1315 1291 BIT_ULL(KEY_TYPE_inode_v3)| \ 1316 1292 BIT_ULL(KEY_TYPE_inode_generation)) \ 1317 - x(dirents, 2, BTREE_ID_SNAPSHOTS, \ 1293 + x(dirents, 2, \ 1294 + BTREE_IS_snapshots, \ 1318 1295 BIT_ULL(KEY_TYPE_whiteout)| \ 1319 1296 BIT_ULL(KEY_TYPE_hash_whiteout)| \ 1320 1297 BIT_ULL(KEY_TYPE_dirent)) \ 1321 - x(xattrs, 3, BTREE_ID_SNAPSHOTS, \ 1298 + x(xattrs, 3, \ 1299 + BTREE_IS_snapshots, \ 1322 1300 BIT_ULL(KEY_TYPE_whiteout)| \ 1323 1301 BIT_ULL(KEY_TYPE_cookie)| \ 1324 1302 BIT_ULL(KEY_TYPE_hash_whiteout)| \ ··· 1335 1307 BIT_ULL(KEY_TYPE_quota)) \ 1336 1308 x(stripes, 6, 0, \ 1337 1309 BIT_ULL(KEY_TYPE_stripe)) \ 1338 - x(reflink, 7, BTREE_ID_EXTENTS|BTREE_ID_DATA, \ 1310 + x(reflink, 7, \ 1311 + BTREE_IS_extents| \ 1312 + BTREE_IS_data, \ 1339 1313 BIT_ULL(KEY_TYPE_reflink_v)| \ 1340 1314 BIT_ULL(KEY_TYPE_indirect_inline_data)| \ 1341 1315 BIT_ULL(KEY_TYPE_error)) \ ··· 1345 1315 BIT_ULL(KEY_TYPE_subvolume)) \ 1346 1316 x(snapshots, 9, 0, \ 1347 1317 BIT_ULL(KEY_TYPE_snapshot)) \ 1348 - x(lru, 10, 0, \ 1318 + x(lru, 10, \ 1319 + BTREE_IS_write_buffer, \ 1349 1320 BIT_ULL(KEY_TYPE_set)) \ 1350 - x(freespace, 11, BTREE_ID_EXTENTS, \ 1321 + x(freespace, 11, \ 1322 + BTREE_IS_extents, \ 1351 1323 BIT_ULL(KEY_TYPE_set)) \ 1352 1324 x(need_discard, 12, 0, \ 1353 1325 BIT_ULL(KEY_TYPE_set)) \ 1354 - x(backpointers, 13, 0, \ 1326 + x(backpointers, 13, \ 1327 + BTREE_IS_write_buffer, \ 1355 1328 BIT_ULL(KEY_TYPE_backpointer)) \ 1356 1329 x(bucket_gens, 14, 0, \ 1357 1330 BIT_ULL(KEY_TYPE_bucket_gens)) \ 1358 1331 x(snapshot_trees, 15, 0, \ 1359 1332 BIT_ULL(KEY_TYPE_snapshot_tree)) \ 1360 - x(deleted_inodes, 16, BTREE_ID_SNAPSHOT_FIELD, \ 1333 + x(deleted_inodes, 16, \ 1334 + BTREE_IS_snapshot_field| \ 1335 + BTREE_IS_write_buffer, \ 1361 1336 BIT_ULL(KEY_TYPE_set)) \ 1362 1337 x(logged_ops, 17, 0, \ 1363 1338 BIT_ULL(KEY_TYPE_logged_op_truncate)| \ 1364 - BIT_ULL(KEY_TYPE_logged_op_finsert)) \ 1365 - x(rebalance_work, 18, BTREE_ID_SNAPSHOT_FIELD, \ 1339 + BIT_ULL(KEY_TYPE_logged_op_finsert)| \ 1340 + BIT_ULL(KEY_TYPE_inode_alloc_cursor)) \ 1341 + x(rebalance_work, 18, \ 1342 + BTREE_IS_snapshot_field| \ 1343 + BTREE_IS_write_buffer, \ 1366 1344 BIT_ULL(KEY_TYPE_set)|BIT_ULL(KEY_TYPE_cookie)) \ 1367 1345 x(subvolume_children, 19, 0, \ 1368 1346 BIT_ULL(KEY_TYPE_set)) \ 1369 - x(accounting, 20, BTREE_ID_SNAPSHOT_FIELD, \ 1347 + x(accounting, 20, \ 1348 + BTREE_IS_snapshot_field| \ 1349 + BTREE_IS_write_buffer, \ 1370 1350 BIT_ULL(KEY_TYPE_accounting)) \ 1371 1351 1372 1352 enum btree_id { ··· 1401 1361 case BTREE_ID_need_discard: 1402 1362 case BTREE_ID_freespace: 1403 1363 case BTREE_ID_bucket_gens: 1364 + case BTREE_ID_lru: 1365 + case BTREE_ID_accounting: 1404 1366 return true; 1405 1367 default: 1406 1368 return false;
-7
fs/bcachefs/bkey.h
··· 9 9 #include "util.h" 10 10 #include "vstructs.h" 11 11 12 - enum bch_validate_flags { 13 - BCH_VALIDATE_write = BIT(0), 14 - BCH_VALIDATE_commit = BIT(1), 15 - BCH_VALIDATE_journal = BIT(2), 16 - BCH_VALIDATE_silent = BIT(3), 17 - }; 18 - 19 12 #if 0 20 13 21 14 /*
+15 -14
fs/bcachefs/bkey_methods.c
··· 28 28 }; 29 29 30 30 static int deleted_key_validate(struct bch_fs *c, struct bkey_s_c k, 31 - enum bch_validate_flags flags) 31 + struct bkey_validate_context from) 32 32 { 33 33 return 0; 34 34 } ··· 42 42 }) 43 43 44 44 static int empty_val_key_validate(struct bch_fs *c, struct bkey_s_c k, 45 - enum bch_validate_flags flags) 45 + struct bkey_validate_context from) 46 46 { 47 47 int ret = 0; 48 48 ··· 59 59 }) 60 60 61 61 static int key_type_cookie_validate(struct bch_fs *c, struct bkey_s_c k, 62 - enum bch_validate_flags flags) 62 + struct bkey_validate_context from) 63 63 { 64 64 return 0; 65 65 } ··· 83 83 }) 84 84 85 85 static int key_type_inline_data_validate(struct bch_fs *c, struct bkey_s_c k, 86 - enum bch_validate_flags flags) 86 + struct bkey_validate_context from) 87 87 { 88 88 return 0; 89 89 } ··· 124 124 }; 125 125 126 126 int bch2_bkey_val_validate(struct bch_fs *c, struct bkey_s_c k, 127 - enum bch_validate_flags flags) 127 + struct bkey_validate_context from) 128 128 { 129 129 if (test_bit(BCH_FS_no_invalid_checks, &c->flags)) 130 130 return 0; ··· 140 140 if (!ops->key_validate) 141 141 return 0; 142 142 143 - ret = ops->key_validate(c, k, flags); 143 + ret = ops->key_validate(c, k, from); 144 144 fsck_err: 145 145 return ret; 146 146 } ··· 161 161 } 162 162 163 163 int __bch2_bkey_validate(struct bch_fs *c, struct bkey_s_c k, 164 - enum btree_node_type type, 165 - enum bch_validate_flags flags) 164 + struct bkey_validate_context from) 166 165 { 166 + enum btree_node_type type = __btree_node_type(from.level, from.btree); 167 + 167 168 if (test_bit(BCH_FS_no_invalid_checks, &c->flags)) 168 169 return 0; 169 170 ··· 178 177 return 0; 179 178 180 179 bkey_fsck_err_on(k.k->type < KEY_TYPE_MAX && 181 - (type == BKEY_TYPE_btree || (flags & BCH_VALIDATE_commit)) && 180 + (type == BKEY_TYPE_btree || (from.flags & BCH_VALIDATE_commit)) && 182 181 !(bch2_key_types_allowed[type] & BIT_ULL(k.k->type)), 183 182 c, bkey_invalid_type_for_btree, 184 183 "invalid key type for btree %s (%s)", ··· 229 228 } 230 229 231 230 int bch2_bkey_validate(struct bch_fs *c, struct bkey_s_c k, 232 - enum btree_node_type type, 233 - enum bch_validate_flags flags) 231 + struct bkey_validate_context from) 234 232 { 235 - return __bch2_bkey_validate(c, k, type, flags) ?: 236 - bch2_bkey_val_validate(c, k, flags); 233 + return __bch2_bkey_validate(c, k, from) ?: 234 + bch2_bkey_val_validate(c, k, from); 237 235 } 238 236 239 237 int bch2_bkey_in_btree_node(struct bch_fs *c, struct btree *b, 240 - struct bkey_s_c k, enum bch_validate_flags flags) 238 + struct bkey_s_c k, 239 + struct bkey_validate_context from) 241 240 { 242 241 int ret = 0; 243 242
+8 -7
fs/bcachefs/bkey_methods.h
··· 22 22 */ 23 23 struct bkey_ops { 24 24 int (*key_validate)(struct bch_fs *c, struct bkey_s_c k, 25 - enum bch_validate_flags flags); 25 + struct bkey_validate_context from); 26 26 void (*val_to_text)(struct printbuf *, struct bch_fs *, 27 27 struct bkey_s_c); 28 28 void (*swab)(struct bkey_s); ··· 48 48 : &bch2_bkey_null_ops; 49 49 } 50 50 51 - int bch2_bkey_val_validate(struct bch_fs *, struct bkey_s_c, enum bch_validate_flags); 52 - int __bch2_bkey_validate(struct bch_fs *, struct bkey_s_c, enum btree_node_type, 53 - enum bch_validate_flags); 54 - int bch2_bkey_validate(struct bch_fs *, struct bkey_s_c, enum btree_node_type, 55 - enum bch_validate_flags); 51 + int bch2_bkey_val_validate(struct bch_fs *, struct bkey_s_c, 52 + struct bkey_validate_context); 53 + int __bch2_bkey_validate(struct bch_fs *, struct bkey_s_c, 54 + struct bkey_validate_context); 55 + int bch2_bkey_validate(struct bch_fs *, struct bkey_s_c, 56 + struct bkey_validate_context); 56 57 int bch2_bkey_in_btree_node(struct bch_fs *, struct btree *, struct bkey_s_c, 57 - enum bch_validate_flags); 58 + struct bkey_validate_context from); 58 59 59 60 void bch2_bpos_to_text(struct printbuf *, struct bpos); 60 61 void bch2_bkey_to_text(struct printbuf *, const struct bkey *);
+28
fs/bcachefs/bkey_types.h
··· 210 210 BCH_BKEY_TYPES(); 211 211 #undef x 212 212 213 + enum bch_validate_flags { 214 + BCH_VALIDATE_write = BIT(0), 215 + BCH_VALIDATE_commit = BIT(1), 216 + BCH_VALIDATE_silent = BIT(2), 217 + }; 218 + 219 + #define BKEY_VALIDATE_CONTEXTS() \ 220 + x(unknown) \ 221 + x(superblock) \ 222 + x(journal) \ 223 + x(btree_root) \ 224 + x(btree_node) \ 225 + x(commit) 226 + 227 + struct bkey_validate_context { 228 + enum { 229 + #define x(n) BKEY_VALIDATE_##n, 230 + BKEY_VALIDATE_CONTEXTS() 231 + #undef x 232 + } from:8; 233 + enum bch_validate_flags flags:8; 234 + u8 level; 235 + enum btree_id btree; 236 + bool root:1; 237 + unsigned journal_offset; 238 + u64 journal_seq; 239 + }; 240 + 213 241 #endif /* _BCACHEFS_BKEY_TYPES_H */
+39 -20
fs/bcachefs/btree_cache.c
··· 222 222 struct btree_cache *bc = &c->btree_cache; 223 223 224 224 mutex_lock(&bc->lock); 225 - BUG_ON(!__btree_node_pinned(bc, b)); 226 225 if (b != btree_node_root(c, b) && !btree_node_pinned(b)) { 227 226 set_btree_node_pinned(b); 228 227 list_move(&b->list, &bc->live[1].list); ··· 325 326 if (!IS_ERR_OR_NULL(b)) { 326 327 mutex_lock(&c->btree_cache.lock); 327 328 328 - bch2_btree_node_hash_remove(&c->btree_cache, b); 329 + __bch2_btree_node_hash_remove(&c->btree_cache, b); 329 330 330 331 bkey_copy(&b->key, new); 331 332 ret = __bch2_btree_node_hash_insert(&c->btree_cache, b); ··· 1003 1004 return; 1004 1005 1005 1006 prt_printf(&buf, 1006 - "btree node header doesn't match ptr\n" 1007 - "btree %s level %u\n" 1008 - "ptr: ", 1009 - bch2_btree_id_str(b->c.btree_id), b->c.level); 1007 + "btree node header doesn't match ptr: "); 1008 + bch2_btree_id_level_to_text(&buf, b->c.btree_id, b->c.level); 1009 + prt_str(&buf, "\nptr: "); 1010 1010 bch2_bkey_val_to_text(&buf, c, bkey_i_to_s_c(&b->key)); 1011 1011 1012 - prt_printf(&buf, "\nheader: btree %s level %llu\n" 1013 - "min ", 1014 - bch2_btree_id_str(BTREE_NODE_ID(b->data)), 1015 - BTREE_NODE_LEVEL(b->data)); 1012 + prt_str(&buf, "\nheader: "); 1013 + bch2_btree_id_level_to_text(&buf, BTREE_NODE_ID(b->data), BTREE_NODE_LEVEL(b->data)); 1014 + prt_str(&buf, "\nmin "); 1016 1015 bch2_bpos_to_text(&buf, b->data->min_key); 1017 1016 1018 1017 prt_printf(&buf, "\nmax "); ··· 1130 1133 1131 1134 if (unlikely(btree_node_read_error(b))) { 1132 1135 six_unlock_type(&b->c.lock, lock_type); 1133 - return ERR_PTR(-BCH_ERR_btree_node_read_error); 1136 + return ERR_PTR(-BCH_ERR_btree_node_read_err_cached); 1134 1137 } 1135 1138 1136 1139 EBUG_ON(b->c.btree_id != path->btree_id); ··· 1220 1223 1221 1224 if (unlikely(btree_node_read_error(b))) { 1222 1225 six_unlock_type(&b->c.lock, lock_type); 1223 - return ERR_PTR(-BCH_ERR_btree_node_read_error); 1226 + return ERR_PTR(-BCH_ERR_btree_node_read_err_cached); 1224 1227 } 1225 1228 1226 1229 EBUG_ON(b->c.btree_id != path->btree_id); ··· 1302 1305 1303 1306 if (unlikely(btree_node_read_error(b))) { 1304 1307 six_unlock_read(&b->c.lock); 1305 - b = ERR_PTR(-BCH_ERR_btree_node_read_error); 1308 + b = ERR_PTR(-BCH_ERR_btree_node_read_err_cached); 1306 1309 goto out; 1307 1310 } 1308 1311 ··· 1395 1398 prt_printf(out, "(unknown btree %u)", btree); 1396 1399 } 1397 1400 1401 + void bch2_btree_id_level_to_text(struct printbuf *out, enum btree_id btree, unsigned level) 1402 + { 1403 + prt_str(out, "btree="); 1404 + bch2_btree_id_to_text(out, btree); 1405 + prt_printf(out, " level=%u", level); 1406 + } 1407 + 1408 + void __bch2_btree_pos_to_text(struct printbuf *out, struct bch_fs *c, 1409 + enum btree_id btree, unsigned level, struct bkey_s_c k) 1410 + { 1411 + bch2_btree_id_to_text(out, btree); 1412 + prt_printf(out, " level %u/", level); 1413 + struct btree_root *r = bch2_btree_id_root(c, btree); 1414 + if (r) 1415 + prt_printf(out, "%u", r->level); 1416 + else 1417 + prt_printf(out, "(unknown)"); 1418 + prt_printf(out, "\n "); 1419 + 1420 + bch2_bkey_val_to_text(out, c, k); 1421 + } 1422 + 1398 1423 void bch2_btree_pos_to_text(struct printbuf *out, struct bch_fs *c, const struct btree *b) 1399 1424 { 1400 - prt_printf(out, "%s level %u/%u\n ", 1401 - bch2_btree_id_str(b->c.btree_id), 1402 - b->c.level, 1403 - bch2_btree_id_root(c, b->c.btree_id)->level); 1404 - bch2_bkey_val_to_text(out, c, bkey_i_to_s_c(&b->key)); 1425 + __bch2_btree_pos_to_text(out, c, b->c.btree_id, b->c.level, bkey_i_to_s_c(&b->key)); 1405 1426 } 1406 1427 1407 1428 void bch2_btree_node_to_text(struct printbuf *out, struct bch_fs *c, const struct btree *b) ··· 1493 1478 prt_printf(out, "cannibalize lock:\t%p\n", bc->alloc_lock); 1494 1479 prt_newline(out); 1495 1480 1496 - for (unsigned i = 0; i < ARRAY_SIZE(bc->nr_by_btree); i++) 1497 - prt_btree_cache_line(out, c, bch2_btree_id_str(i), bc->nr_by_btree[i]); 1481 + for (unsigned i = 0; i < ARRAY_SIZE(bc->nr_by_btree); i++) { 1482 + bch2_btree_id_to_text(out, i); 1483 + prt_printf(out, "\t"); 1484 + prt_human_readable_u64(out, bc->nr_by_btree[i] * c->opts.btree_node_size); 1485 + prt_printf(out, " (%zu)\n", bc->nr_by_btree[i]); 1486 + } 1498 1487 1499 1488 prt_newline(out); 1500 1489 prt_printf(out, "freed:\t%zu\n", bc->nr_freed);
+11 -3
fs/bcachefs/btree_cache.h
··· 128 128 } else { 129 129 unsigned idx = id - BTREE_ID_NR; 130 130 131 - EBUG_ON(idx >= c->btree_roots_extra.nr); 131 + /* This can happen when we're called from btree_node_scan */ 132 + if (idx >= c->btree_roots_extra.nr) 133 + return NULL; 134 + 132 135 return &c->btree_roots_extra.data[idx]; 133 136 } 134 137 } 135 138 136 139 static inline struct btree *btree_node_root(struct bch_fs *c, struct btree *b) 137 140 { 138 - return bch2_btree_id_root(c, b->c.btree_id)->b; 141 + struct btree_root *r = bch2_btree_id_root(c, b->c.btree_id); 142 + 143 + return r ? r->b : NULL; 139 144 } 140 145 141 - const char *bch2_btree_id_str(enum btree_id); 146 + const char *bch2_btree_id_str(enum btree_id); /* avoid */ 142 147 void bch2_btree_id_to_text(struct printbuf *, enum btree_id); 148 + void bch2_btree_id_level_to_text(struct printbuf *, enum btree_id, unsigned); 143 149 150 + void __bch2_btree_pos_to_text(struct printbuf *, struct bch_fs *, 151 + enum btree_id, unsigned, struct bkey_s_c); 144 152 void bch2_btree_pos_to_text(struct printbuf *, struct bch_fs *, const struct btree *); 145 153 void bch2_btree_node_to_text(struct printbuf *, struct bch_fs *, const struct btree *); 146 154 void bch2_btree_cache_to_text(struct printbuf *, const struct btree_cache *);
+43 -135
fs/bcachefs/btree_gc.c
··· 29 29 #include "move.h" 30 30 #include "recovery_passes.h" 31 31 #include "reflink.h" 32 + #include "recovery.h" 32 33 #include "replicas.h" 33 34 #include "super-io.h" 34 35 #include "trace.h" ··· 57 56 { 58 57 prt_str(out, bch2_gc_phase_strs[p->phase]); 59 58 prt_char(out, ' '); 60 - bch2_btree_id_to_text(out, p->btree); 61 - prt_printf(out, " l=%u ", p->level); 59 + bch2_btree_id_level_to_text(out, p->btree, p->level); 60 + prt_char(out, ' '); 62 61 bch2_bpos_to_text(out, p->pos); 63 62 } 64 63 ··· 210 209 if (bpos_eq(expected_start, cur->data->min_key)) 211 210 return 0; 212 211 213 - prt_printf(&buf, " at btree %s level %u:\n parent: ", 214 - bch2_btree_id_str(b->c.btree_id), b->c.level); 212 + prt_printf(&buf, " at "); 213 + bch2_btree_id_level_to_text(&buf, b->c.btree_id, b->c.level); 214 + prt_printf(&buf, ":\n parent: "); 215 215 bch2_bkey_val_to_text(&buf, c, bkey_i_to_s_c(&b->key)); 216 216 217 217 if (prev) { ··· 279 277 if (bpos_eq(child->key.k.p, b->key.k.p)) 280 278 return 0; 281 279 282 - prt_printf(&buf, "at btree %s level %u:\n parent: ", 283 - bch2_btree_id_str(b->c.btree_id), b->c.level); 280 + prt_printf(&buf, " at "); 281 + bch2_btree_id_level_to_text(&buf, b->c.btree_id, b->c.level); 282 + prt_printf(&buf, ":\n parent: "); 284 283 bch2_bkey_val_to_text(&buf, c, bkey_i_to_s_c(&b->key)); 285 284 286 285 prt_str(&buf, "\n child: "); ··· 344 341 ret = PTR_ERR_OR_ZERO(cur); 345 342 346 343 printbuf_reset(&buf); 344 + bch2_btree_id_level_to_text(&buf, b->c.btree_id, b->c.level - 1); 345 + prt_char(&buf, ' '); 347 346 bch2_bkey_val_to_text(&buf, c, bkey_i_to_s_c(cur_k.k)); 348 347 349 348 if (mustfix_fsck_err_on(bch2_err_matches(ret, EIO), 350 - trans, btree_node_unreadable, 351 - "Topology repair: unreadable btree node at btree %s level %u:\n" 349 + trans, btree_node_read_error, 350 + "Topology repair: unreadable btree node at\n" 352 351 " %s", 353 - bch2_btree_id_str(b->c.btree_id), 354 - b->c.level - 1, 355 352 buf.buf)) { 356 353 bch2_btree_node_evict(trans, cur_k.k); 357 354 cur = NULL; ··· 360 357 if (ret) 361 358 break; 362 359 363 - if (!btree_id_is_alloc(b->c.btree_id)) { 364 - ret = bch2_run_explicit_recovery_pass(c, BCH_RECOVERY_PASS_scan_for_btree_nodes); 365 - if (ret) 366 - break; 367 - } 360 + ret = bch2_btree_lost_data(c, b->c.btree_id); 361 + if (ret) 362 + break; 368 363 continue; 369 364 } 370 365 ··· 371 370 break; 372 371 373 372 if (bch2_btree_node_is_stale(c, cur)) { 374 - bch_info(c, "btree node %s older than nodes found by scanning", buf.buf); 373 + bch_info(c, "btree node older than nodes found by scanning\n %s", buf.buf); 375 374 six_unlock_read(&cur->c.lock); 376 375 bch2_btree_node_evict(trans, cur_k.k); 377 376 ret = bch2_journal_key_delete(c, b->c.btree_id, ··· 479 478 } 480 479 481 480 printbuf_reset(&buf); 481 + bch2_btree_id_level_to_text(&buf, b->c.btree_id, b->c.level); 482 + prt_newline(&buf); 482 483 bch2_bkey_val_to_text(&buf, c, bkey_i_to_s_c(&b->key)); 483 484 484 485 if (mustfix_fsck_err_on(!have_child, 485 486 trans, btree_node_topology_interior_node_empty, 486 - "empty interior btree node at btree %s level %u\n" 487 - " %s", 488 - bch2_btree_id_str(b->c.btree_id), 489 - b->c.level, buf.buf)) 487 + "empty interior btree node at %s", buf.buf)) 490 488 ret = DROP_THIS_NODE; 491 489 err: 492 490 fsck_err: ··· 511 511 { 512 512 struct btree_trans *trans = bch2_trans_get(c); 513 513 struct bpos pulled_from_scan = POS_MIN; 514 + struct printbuf buf = PRINTBUF; 514 515 int ret = 0; 515 516 516 517 bch2_trans_srcu_unlock(trans); ··· 520 519 struct btree_root *r = bch2_btree_id_root(c, i); 521 520 bool reconstructed_root = false; 522 521 522 + printbuf_reset(&buf); 523 + bch2_btree_id_to_text(&buf, i); 524 + 523 525 if (r->error) { 524 - ret = bch2_run_explicit_recovery_pass(c, BCH_RECOVERY_PASS_scan_for_btree_nodes); 526 + ret = bch2_btree_lost_data(c, i); 525 527 if (ret) 526 528 break; 527 529 reconstruct_root: 528 - bch_info(c, "btree root %s unreadable, must recover from scan", bch2_btree_id_str(i)); 530 + bch_info(c, "btree root %s unreadable, must recover from scan", buf.buf); 529 531 530 532 r->alive = false; 531 533 r->error = 0; 532 534 533 535 if (!bch2_btree_has_scanned_nodes(c, i)) { 534 536 mustfix_fsck_err(trans, btree_root_unreadable_and_scan_found_nothing, 535 - "no nodes found for btree %s, continue?", bch2_btree_id_str(i)); 537 + "no nodes found for btree %s, continue?", buf.buf); 536 538 bch2_btree_root_alloc_fake_trans(trans, i, 0); 537 539 } else { 538 540 bch2_btree_root_alloc_fake_trans(trans, i, 1); ··· 564 560 if (!reconstructed_root) 565 561 goto reconstruct_root; 566 562 567 - bch_err(c, "empty btree root %s", bch2_btree_id_str(i)); 563 + bch_err(c, "empty btree root %s", buf.buf); 568 564 bch2_btree_root_alloc_fake_trans(trans, i, 0); 569 565 r->alive = false; 570 566 ret = 0; 571 567 } 572 568 } 573 569 fsck_err: 570 + printbuf_exit(&buf); 574 571 bch2_trans_put(trans); 575 572 return ret; 576 573 } ··· 718 713 { 719 714 struct btree_trans *trans = bch2_trans_get(c); 720 715 enum btree_id ids[BTREE_ID_NR]; 716 + struct printbuf buf = PRINTBUF; 721 717 unsigned i; 722 718 int ret = 0; 723 719 ··· 733 727 continue; 734 728 735 729 ret = bch2_gc_btree(trans, btree, true); 736 - 737 - if (mustfix_fsck_err_on(bch2_err_matches(ret, EIO), 738 - trans, btree_node_read_error, 739 - "btree node read error for %s", 740 - bch2_btree_id_str(btree))) 741 - ret = bch2_run_explicit_recovery_pass(c, BCH_RECOVERY_PASS_check_topology); 742 730 } 743 - fsck_err: 731 + 732 + printbuf_exit(&buf); 744 733 bch2_trans_put(trans); 745 734 bch_err_fn(c, ret); 746 735 return ret; ··· 803 802 old = bch2_alloc_to_v4(k, &old_convert); 804 803 gc = new = *old; 805 804 806 - percpu_down_read(&c->mark_lock); 807 805 __bucket_m_to_alloc(&gc, *gc_bucket(ca, iter->pos.offset)); 808 806 809 807 old_gc = gc; ··· 813 813 gc.data_type = old->data_type; 814 814 gc.dirty_sectors = old->dirty_sectors; 815 815 } 816 - percpu_up_read(&c->mark_lock); 817 816 818 817 /* 819 818 * gc.data_type doesn't yet include need_discard & need_gc_gen states - ··· 830 831 * safe w.r.t. transaction restarts, so fixup the gc_bucket so 831 832 * we don't run it twice: 832 833 */ 833 - percpu_down_read(&c->mark_lock); 834 834 struct bucket *gc_m = gc_bucket(ca, iter->pos.offset); 835 835 gc_m->data_type = gc.data_type; 836 836 gc_m->dirty_sectors = gc.dirty_sectors; 837 - percpu_up_read(&c->mark_lock); 838 837 } 839 838 840 839 if (fsck_err_on(new.data_type != gc.data_type, ··· 892 895 893 896 for_each_member_device(c, ca) { 894 897 ret = bch2_trans_run(c, 895 - for_each_btree_key_upto_commit(trans, iter, BTREE_ID_alloc, 898 + for_each_btree_key_max_commit(trans, iter, BTREE_ID_alloc, 896 899 POS(ca->dev_idx, ca->mi.first_bucket), 897 900 POS(ca->dev_idx, ca->mi.nbuckets - 1), 898 901 BTREE_ITER_slots|BTREE_ITER_prefetch, k, 899 - NULL, NULL, BCH_TRANS_COMMIT_lazy_rw, 902 + NULL, NULL, BCH_TRANS_COMMIT_no_enospc, 900 903 bch2_alloc_write_key(trans, &iter, ca, k))); 901 904 if (ret) { 902 905 bch2_dev_put(ca); ··· 920 923 break; 921 924 } 922 925 } 923 - 924 - bch_err_fn(c, ret); 925 - return ret; 926 - } 927 - 928 - static int bch2_gc_write_reflink_key(struct btree_trans *trans, 929 - struct btree_iter *iter, 930 - struct bkey_s_c k, 931 - size_t *idx) 932 - { 933 - struct bch_fs *c = trans->c; 934 - const __le64 *refcount = bkey_refcount_c(k); 935 - struct printbuf buf = PRINTBUF; 936 - struct reflink_gc *r; 937 - int ret = 0; 938 - 939 - if (!refcount) 940 - return 0; 941 - 942 - while ((r = genradix_ptr(&c->reflink_gc_table, *idx)) && 943 - r->offset < k.k->p.offset) 944 - ++*idx; 945 - 946 - if (!r || 947 - r->offset != k.k->p.offset || 948 - r->size != k.k->size) { 949 - bch_err(c, "unexpected inconsistency walking reflink table at gc finish"); 950 - return -EINVAL; 951 - } 952 - 953 - if (fsck_err_on(r->refcount != le64_to_cpu(*refcount), 954 - trans, reflink_v_refcount_wrong, 955 - "reflink key has wrong refcount:\n" 956 - " %s\n" 957 - " should be %u", 958 - (bch2_bkey_val_to_text(&buf, c, k), buf.buf), 959 - r->refcount)) { 960 - struct bkey_i *new = bch2_bkey_make_mut_noupdate(trans, k); 961 - ret = PTR_ERR_OR_ZERO(new); 962 - if (ret) 963 - goto out; 964 - 965 - if (!r->refcount) 966 - new->k.type = KEY_TYPE_deleted; 967 - else 968 - *bkey_refcount(bkey_i_to_s(new)) = cpu_to_le64(r->refcount); 969 - ret = bch2_trans_update(trans, iter, new, 0); 970 - } 971 - out: 972 - fsck_err: 973 - printbuf_exit(&buf); 974 - return ret; 975 - } 976 - 977 - static int bch2_gc_reflink_done(struct bch_fs *c) 978 - { 979 - size_t idx = 0; 980 - 981 - int ret = bch2_trans_run(c, 982 - for_each_btree_key_commit(trans, iter, 983 - BTREE_ID_reflink, POS_MIN, 984 - BTREE_ITER_prefetch, k, 985 - NULL, NULL, BCH_TRANS_COMMIT_no_enospc, 986 - bch2_gc_write_reflink_key(trans, &iter, k, &idx))); 987 - c->reflink_gc_nr = 0; 988 - return ret; 989 - } 990 - 991 - static int bch2_gc_reflink_start(struct bch_fs *c) 992 - { 993 - c->reflink_gc_nr = 0; 994 - 995 - int ret = bch2_trans_run(c, 996 - for_each_btree_key(trans, iter, BTREE_ID_reflink, POS_MIN, 997 - BTREE_ITER_prefetch, k, ({ 998 - const __le64 *refcount = bkey_refcount_c(k); 999 - 1000 - if (!refcount) 1001 - continue; 1002 - 1003 - struct reflink_gc *r = genradix_ptr_alloc(&c->reflink_gc_table, 1004 - c->reflink_gc_nr++, GFP_KERNEL); 1005 - if (!r) { 1006 - ret = -BCH_ERR_ENOMEM_gc_reflink_start; 1007 - break; 1008 - } 1009 - 1010 - r->offset = k.k->p.offset; 1011 - r->size = k.k->size; 1012 - r->refcount = 0; 1013 - 0; 1014 - }))); 1015 926 1016 927 bch_err_fn(c, ret); 1017 928 return ret; ··· 1076 1171 if (unlikely(test_bit(BCH_FS_going_ro, &c->flags))) 1077 1172 return -EROFS; 1078 1173 1079 - percpu_down_read(&c->mark_lock); 1080 1174 rcu_read_lock(); 1081 1175 bkey_for_each_ptr(ptrs, ptr) { 1082 1176 struct bch_dev *ca = bch2_dev_rcu(c, ptr->dev); ··· 1084 1180 1085 1181 if (dev_ptr_stale(ca, ptr) > 16) { 1086 1182 rcu_read_unlock(); 1087 - percpu_up_read(&c->mark_lock); 1088 1183 goto update; 1089 1184 } 1090 1185 } ··· 1098 1195 *gen = ptr->gen; 1099 1196 } 1100 1197 rcu_read_unlock(); 1101 - percpu_up_read(&c->mark_lock); 1102 1198 return 0; 1103 1199 update: 1104 1200 u = bch2_bkey_make_mut(trans, iter, &k, 0); ··· 1126 1224 return ret; 1127 1225 1128 1226 a_mut->v.oldest_gen = ca->oldest_gen[iter->pos.offset]; 1129 - alloc_data_type_set(&a_mut->v, a_mut->v.data_type); 1130 1227 1131 1228 return bch2_trans_update(trans, iter, &a_mut->k_i, 0); 1132 1229 } ··· 1238 1337 bch2_write_ref_put(c, BCH_WRITE_REF_gc_gens); 1239 1338 } 1240 1339 1241 - void bch2_fs_gc_init(struct bch_fs *c) 1340 + void bch2_fs_btree_gc_exit(struct bch_fs *c) 1341 + { 1342 + } 1343 + 1344 + int bch2_fs_btree_gc_init(struct bch_fs *c) 1242 1345 { 1243 1346 seqcount_init(&c->gc_pos_lock); 1244 - 1245 1347 INIT_WORK(&c->gc_gens_work, bch2_gc_gens_work); 1348 + 1349 + init_rwsem(&c->gc_lock); 1350 + mutex_init(&c->gc_gens_lock); 1351 + return 0; 1246 1352 }
+3 -1
fs/bcachefs/btree_gc.h
··· 82 82 83 83 int bch2_gc_gens(struct bch_fs *); 84 84 void bch2_gc_gens_async(struct bch_fs *); 85 - void bch2_fs_gc_init(struct bch_fs *); 85 + 86 + void bch2_fs_btree_gc_exit(struct bch_fs *); 87 + int bch2_fs_btree_gc_init(struct bch_fs *); 86 88 87 89 #endif /* _BCACHEFS_BTREE_GC_H */
+145 -72
fs/bcachefs/btree_io.c
··· 25 25 26 26 static void bch2_btree_node_header_to_text(struct printbuf *out, struct btree_node *bn) 27 27 { 28 - prt_printf(out, "btree=%s l=%u seq %llux\n", 29 - bch2_btree_id_str(BTREE_NODE_ID(bn)), 30 - (unsigned) BTREE_NODE_LEVEL(bn), bn->keys.seq); 28 + bch2_btree_id_level_to_text(out, BTREE_NODE_ID(bn), BTREE_NODE_LEVEL(bn)); 29 + prt_printf(out, " seq %llx %llu\n", bn->keys.seq, BTREE_NODE_SEQ(bn)); 31 30 prt_str(out, "min: "); 32 31 bch2_bpos_to_text(out, bn->min_key); 33 32 prt_newline(out); ··· 489 490 if (b->nsets == MAX_BSETS && 490 491 !btree_node_write_in_flight(b) && 491 492 should_compact_all(c, b)) { 492 - bch2_btree_node_write(c, b, SIX_LOCK_write, 493 - BTREE_WRITE_init_next_bset); 493 + bch2_btree_node_write_trans(trans, b, SIX_LOCK_write, 494 + BTREE_WRITE_init_next_bset); 494 495 reinit_iter = true; 495 496 } 496 497 ··· 831 832 return ret; 832 833 } 833 834 835 + static int btree_node_bkey_val_validate(struct bch_fs *c, struct btree *b, 836 + struct bkey_s_c k, 837 + enum bch_validate_flags flags) 838 + { 839 + return bch2_bkey_val_validate(c, k, (struct bkey_validate_context) { 840 + .from = BKEY_VALIDATE_btree_node, 841 + .level = b->c.level, 842 + .btree = b->c.btree_id, 843 + .flags = flags 844 + }); 845 + } 846 + 834 847 static int bset_key_validate(struct bch_fs *c, struct btree *b, 835 848 struct bkey_s_c k, 836 - bool updated_range, int rw) 849 + bool updated_range, 850 + enum bch_validate_flags flags) 837 851 { 838 - return __bch2_bkey_validate(c, k, btree_node_type(b), 0) ?: 839 - (!updated_range ? bch2_bkey_in_btree_node(c, b, k, 0) : 0) ?: 840 - (rw == WRITE ? bch2_bkey_val_validate(c, k, 0) : 0); 852 + struct bkey_validate_context from = (struct bkey_validate_context) { 853 + .from = BKEY_VALIDATE_btree_node, 854 + .level = b->c.level, 855 + .btree = b->c.btree_id, 856 + .flags = flags, 857 + }; 858 + return __bch2_bkey_validate(c, k, from) ?: 859 + (!updated_range ? bch2_bkey_in_btree_node(c, b, k, from) : 0) ?: 860 + (flags & BCH_VALIDATE_write ? btree_node_bkey_val_validate(c, b, k, flags) : 0); 841 861 } 842 862 843 863 static bool bkey_packed_valid(struct bch_fs *c, struct btree *b, ··· 873 855 874 856 struct bkey tmp; 875 857 struct bkey_s u = __bkey_disassemble(b, k, &tmp); 876 - return !__bch2_bkey_validate(c, u.s_c, btree_node_type(b), BCH_VALIDATE_silent); 858 + return !__bch2_bkey_validate(c, u.s_c, 859 + (struct bkey_validate_context) { 860 + .from = BKEY_VALIDATE_btree_node, 861 + .level = b->c.level, 862 + .btree = b->c.btree_id, 863 + .flags = BCH_VALIDATE_silent 864 + }); 865 + } 866 + 867 + static inline int btree_node_read_bkey_cmp(const struct btree *b, 868 + const struct bkey_packed *l, 869 + const struct bkey_packed *r) 870 + { 871 + return bch2_bkey_cmp_packed(b, l, r) 872 + ?: (int) bkey_deleted(r) - (int) bkey_deleted(l); 877 873 } 878 874 879 875 static int validate_bset_keys(struct bch_fs *c, struct btree *b, ··· 950 918 BSET_BIG_ENDIAN(i), write, 951 919 &b->format, k); 952 920 953 - if (prev && bkey_iter_cmp(b, prev, k) > 0) { 921 + if (prev && btree_node_read_bkey_cmp(b, prev, k) >= 0) { 954 922 struct bkey up = bkey_unpack_key(b, prev); 955 923 956 924 printbuf_reset(&buf); ··· 997 965 got_good_key: 998 966 le16_add_cpu(&i->u64s, -next_good_key); 999 967 memmove_u64s_down(k, bkey_p_next(k), (u64 *) vstruct_end(i) - (u64 *) k); 968 + set_btree_node_need_rewrite(b); 1000 969 } 1001 970 fsck_err: 1002 971 printbuf_exit(&buf); ··· 1071 1038 1072 1039 while (b->written < (ptr_written ?: btree_sectors(c))) { 1073 1040 unsigned sectors; 1074 - struct nonce nonce; 1075 1041 bool first = !b->written; 1076 - bool csum_bad; 1077 1042 1078 - if (!b->written) { 1043 + if (first) { 1044 + bne = NULL; 1079 1045 i = &b->data->keys; 1046 + } else { 1047 + bne = write_block(b); 1048 + i = &bne->keys; 1080 1049 1081 - btree_err_on(!bch2_checksum_type_valid(c, BSET_CSUM_TYPE(i)), 1082 - -BCH_ERR_btree_node_read_err_want_retry, 1083 - c, ca, b, i, NULL, 1084 - bset_unknown_csum, 1085 - "unknown checksum type %llu", BSET_CSUM_TYPE(i)); 1050 + if (i->seq != b->data->keys.seq) 1051 + break; 1052 + } 1086 1053 1087 - nonce = btree_nonce(i, b->written << 9); 1054 + struct nonce nonce = btree_nonce(i, b->written << 9); 1055 + bool good_csum_type = bch2_checksum_type_valid(c, BSET_CSUM_TYPE(i)); 1088 1056 1089 - struct bch_csum csum = csum_vstruct(c, BSET_CSUM_TYPE(i), nonce, b->data); 1090 - csum_bad = bch2_crc_cmp(b->data->csum, csum); 1091 - if (csum_bad) 1092 - bch2_io_error(ca, BCH_MEMBER_ERROR_checksum); 1057 + btree_err_on(!good_csum_type, 1058 + bch2_csum_type_is_encryption(BSET_CSUM_TYPE(i)) 1059 + ? -BCH_ERR_btree_node_read_err_must_retry 1060 + : -BCH_ERR_btree_node_read_err_want_retry, 1061 + c, ca, b, i, NULL, 1062 + bset_unknown_csum, 1063 + "unknown checksum type %llu", BSET_CSUM_TYPE(i)); 1093 1064 1094 - btree_err_on(csum_bad, 1095 - -BCH_ERR_btree_node_read_err_want_retry, 1096 - c, ca, b, i, NULL, 1097 - bset_bad_csum, 1098 - "%s", 1099 - (printbuf_reset(&buf), 1100 - bch2_csum_err_msg(&buf, BSET_CSUM_TYPE(i), b->data->csum, csum), 1101 - buf.buf)); 1065 + if (first) { 1066 + if (good_csum_type) { 1067 + struct bch_csum csum = csum_vstruct(c, BSET_CSUM_TYPE(i), nonce, b->data); 1068 + bool csum_bad = bch2_crc_cmp(b->data->csum, csum); 1069 + if (csum_bad) 1070 + bch2_io_error(ca, BCH_MEMBER_ERROR_checksum); 1102 1071 1103 - ret = bset_encrypt(c, i, b->written << 9); 1104 - if (bch2_fs_fatal_err_on(ret, c, 1105 - "decrypting btree node: %s", bch2_err_str(ret))) 1106 - goto fsck_err; 1072 + btree_err_on(csum_bad, 1073 + -BCH_ERR_btree_node_read_err_want_retry, 1074 + c, ca, b, i, NULL, 1075 + bset_bad_csum, 1076 + "%s", 1077 + (printbuf_reset(&buf), 1078 + bch2_csum_err_msg(&buf, BSET_CSUM_TYPE(i), b->data->csum, csum), 1079 + buf.buf)); 1080 + 1081 + ret = bset_encrypt(c, i, b->written << 9); 1082 + if (bch2_fs_fatal_err_on(ret, c, 1083 + "decrypting btree node: %s", bch2_err_str(ret))) 1084 + goto fsck_err; 1085 + } 1107 1086 1108 1087 btree_err_on(btree_node_type_is_extents(btree_node_type(b)) && 1109 1088 !BTREE_NODE_NEW_EXTENT_OVERWRITE(b->data), ··· 1126 1081 1127 1082 sectors = vstruct_sectors(b->data, c->block_bits); 1128 1083 } else { 1129 - bne = write_block(b); 1130 - i = &bne->keys; 1084 + if (good_csum_type) { 1085 + struct bch_csum csum = csum_vstruct(c, BSET_CSUM_TYPE(i), nonce, bne); 1086 + bool csum_bad = bch2_crc_cmp(bne->csum, csum); 1087 + if (ca && csum_bad) 1088 + bch2_io_error(ca, BCH_MEMBER_ERROR_checksum); 1131 1089 1132 - if (i->seq != b->data->keys.seq) 1133 - break; 1090 + btree_err_on(csum_bad, 1091 + -BCH_ERR_btree_node_read_err_want_retry, 1092 + c, ca, b, i, NULL, 1093 + bset_bad_csum, 1094 + "%s", 1095 + (printbuf_reset(&buf), 1096 + bch2_csum_err_msg(&buf, BSET_CSUM_TYPE(i), bne->csum, csum), 1097 + buf.buf)); 1134 1098 1135 - btree_err_on(!bch2_checksum_type_valid(c, BSET_CSUM_TYPE(i)), 1136 - -BCH_ERR_btree_node_read_err_want_retry, 1137 - c, ca, b, i, NULL, 1138 - bset_unknown_csum, 1139 - "unknown checksum type %llu", BSET_CSUM_TYPE(i)); 1140 - 1141 - nonce = btree_nonce(i, b->written << 9); 1142 - struct bch_csum csum = csum_vstruct(c, BSET_CSUM_TYPE(i), nonce, bne); 1143 - csum_bad = bch2_crc_cmp(bne->csum, csum); 1144 - if (ca && csum_bad) 1145 - bch2_io_error(ca, BCH_MEMBER_ERROR_checksum); 1146 - 1147 - btree_err_on(csum_bad, 1148 - -BCH_ERR_btree_node_read_err_want_retry, 1149 - c, ca, b, i, NULL, 1150 - bset_bad_csum, 1151 - "%s", 1152 - (printbuf_reset(&buf), 1153 - bch2_csum_err_msg(&buf, BSET_CSUM_TYPE(i), bne->csum, csum), 1154 - buf.buf)); 1155 - 1156 - ret = bset_encrypt(c, i, b->written << 9); 1157 - if (bch2_fs_fatal_err_on(ret, c, 1158 - "decrypting btree node: %s", bch2_err_str(ret))) 1159 - goto fsck_err; 1099 + ret = bset_encrypt(c, i, b->written << 9); 1100 + if (bch2_fs_fatal_err_on(ret, c, 1101 + "decrypting btree node: %s", bch2_err_str(ret))) 1102 + goto fsck_err; 1103 + } 1160 1104 1161 1105 sectors = vstruct_sectors(bne, c->block_bits); 1162 1106 } ··· 1250 1216 struct bkey tmp; 1251 1217 struct bkey_s u = __bkey_disassemble(b, k, &tmp); 1252 1218 1253 - ret = bch2_bkey_val_validate(c, u.s_c, READ); 1219 + ret = btree_node_bkey_val_validate(c, b, u.s_c, READ); 1254 1220 if (ret == -BCH_ERR_fsck_delete_bkey || 1255 1221 (bch2_inject_invalid_keys && 1256 1222 !bversion_cmp(u.k->bversion, MAX_VERSION))) { ··· 1260 1226 memmove_u64s_down(k, bkey_p_next(k), 1261 1227 (u64 *) vstruct_end(i) - (u64 *) k); 1262 1228 set_btree_bset_end(b, b->set); 1229 + set_btree_node_need_rewrite(b); 1263 1230 continue; 1264 1231 } 1265 1232 if (ret) ··· 1374 1339 rb->start_time); 1375 1340 bio_put(&rb->bio); 1376 1341 1377 - if (saw_error && 1342 + if ((saw_error || 1343 + btree_node_need_rewrite(b)) && 1378 1344 !btree_node_read_error(b) && 1379 1345 c->curr_recovery_pass != BCH_RECOVERY_PASS_scan_for_btree_nodes) { 1380 - printbuf_reset(&buf); 1381 - bch2_bpos_to_text(&buf, b->key.k.p); 1382 - bch_err_ratelimited(c, "%s: rewriting btree node at btree=%s level=%u %s due to error", 1383 - __func__, bch2_btree_id_str(b->c.btree_id), b->c.level, buf.buf); 1346 + if (saw_error) { 1347 + printbuf_reset(&buf); 1348 + bch2_btree_id_level_to_text(&buf, b->c.btree_id, b->c.level); 1349 + prt_str(&buf, " "); 1350 + bch2_bkey_val_to_text(&buf, c, bkey_i_to_s_c(&b->key)); 1351 + bch_err_ratelimited(c, "%s: rewriting btree node at due to error\n %s", 1352 + __func__, buf.buf); 1353 + } 1384 1354 1385 1355 bch2_btree_node_rewrite_async(c, b); 1386 1356 } ··· 1973 1933 bool saw_error; 1974 1934 1975 1935 int ret = bch2_bkey_validate(c, bkey_i_to_s_c(&b->key), 1976 - BKEY_TYPE_btree, WRITE); 1936 + (struct bkey_validate_context) { 1937 + .from = BKEY_VALIDATE_btree_node, 1938 + .level = b->c.level + 1, 1939 + .btree = b->c.btree_id, 1940 + .flags = BCH_VALIDATE_write, 1941 + }); 1977 1942 if (ret) { 1978 1943 bch2_fs_inconsistent(c, "invalid btree node key before write"); 1979 1944 return ret; ··· 2333 2288 six_trylock_write(&b->c.lock)) { 2334 2289 bch2_btree_post_write_cleanup(c, b); 2335 2290 six_unlock_write(&b->c.lock); 2291 + } 2292 + 2293 + if (lock_type_held == SIX_LOCK_read) 2294 + six_lock_downgrade(&b->c.lock); 2295 + } else { 2296 + __bch2_btree_node_write(c, b, flags); 2297 + if (lock_type_held == SIX_LOCK_write && 2298 + btree_node_just_written(b)) 2299 + bch2_btree_post_write_cleanup(c, b); 2300 + } 2301 + } 2302 + 2303 + void bch2_btree_node_write_trans(struct btree_trans *trans, struct btree *b, 2304 + enum six_lock_type lock_type_held, 2305 + unsigned flags) 2306 + { 2307 + struct bch_fs *c = trans->c; 2308 + 2309 + if (lock_type_held == SIX_LOCK_intent || 2310 + (lock_type_held == SIX_LOCK_read && 2311 + six_lock_tryupgrade(&b->c.lock))) { 2312 + __bch2_btree_node_write(c, b, flags); 2313 + 2314 + /* don't cycle lock unnecessarily: */ 2315 + if (btree_node_just_written(b) && 2316 + six_trylock_write(&b->c.lock)) { 2317 + bch2_btree_post_write_cleanup(c, b); 2318 + __bch2_btree_node_unlock_write(trans, b); 2336 2319 } 2337 2320 2338 2321 if (lock_type_held == SIX_LOCK_read)
+4 -2
fs/bcachefs/btree_io.h
··· 144 144 void __bch2_btree_node_write(struct bch_fs *, struct btree *, unsigned); 145 145 void bch2_btree_node_write(struct bch_fs *, struct btree *, 146 146 enum six_lock_type, unsigned); 147 + void bch2_btree_node_write_trans(struct btree_trans *, struct btree *, 148 + enum six_lock_type, unsigned); 147 149 148 - static inline void btree_node_write_if_need(struct bch_fs *c, struct btree *b, 150 + static inline void btree_node_write_if_need(struct btree_trans *trans, struct btree *b, 149 151 enum six_lock_type lock_held) 150 152 { 151 - bch2_btree_node_write(c, b, lock_held, BTREE_WRITE_ONLY_IF_NEED); 153 + bch2_btree_node_write_trans(trans, b, lock_held, BTREE_WRITE_ONLY_IF_NEED); 152 154 } 153 155 154 156 bool bch2_btree_flush_all_reads(struct bch_fs *);
+383 -225
fs/bcachefs/btree_iter.c
··· 270 270 BUG_ON(!(iter->flags & BTREE_ITER_all_snapshots) && 271 271 iter->pos.snapshot != iter->snapshot); 272 272 273 - BUG_ON(bkey_lt(iter->pos, bkey_start_pos(&iter->k)) || 274 - bkey_gt(iter->pos, iter->k.p)); 273 + BUG_ON(iter->flags & BTREE_ITER_all_snapshots ? !bpos_eq(iter->pos, iter->k.p) : 274 + !(iter->flags & BTREE_ITER_is_extents) ? !bkey_eq(iter->pos, iter->k.p) : 275 + (bkey_lt(iter->pos, bkey_start_pos(&iter->k)) || 276 + bkey_gt(iter->pos, iter->k.p))); 275 277 } 276 278 277 279 static int bch2_btree_iter_verify_ret(struct btree_iter *iter, struct bkey_s_c k) ··· 329 327 void bch2_assert_pos_locked(struct btree_trans *trans, enum btree_id id, 330 328 struct bpos pos) 331 329 { 332 - bch2_trans_verify_not_unlocked(trans); 330 + bch2_trans_verify_not_unlocked_or_in_restart(trans); 333 331 334 332 struct btree_path *path; 335 333 struct trans_for_each_path_inorder_iter iter; ··· 699 697 bch2_trans_revalidate_updates_in_node(trans, b); 700 698 } 701 699 700 + void bch2_trans_node_drop(struct btree_trans *trans, 701 + struct btree *b) 702 + { 703 + struct btree_path *path; 704 + unsigned i, level = b->c.level; 705 + 706 + trans_for_each_path(trans, path, i) 707 + if (path->l[level].b == b) { 708 + btree_node_unlock(trans, path, level); 709 + path->l[level].b = ERR_PTR(-BCH_ERR_no_btree_node_init); 710 + } 711 + } 712 + 702 713 /* 703 714 * A btree node has been modified in such a way as to invalidate iterators - fix 704 715 * them: ··· 735 720 unsigned long trace_ip) 736 721 { 737 722 struct bch_fs *c = trans->c; 738 - struct btree *b, **rootp = &bch2_btree_id_root(c, path->btree_id)->b; 723 + struct btree_root *r = bch2_btree_id_root(c, path->btree_id); 739 724 enum six_lock_type lock_type; 740 725 unsigned i; 741 726 int ret; ··· 743 728 EBUG_ON(path->nodes_locked); 744 729 745 730 while (1) { 746 - b = READ_ONCE(*rootp); 731 + struct btree *b = READ_ONCE(r->b); 732 + if (unlikely(!b)) { 733 + BUG_ON(!r->error); 734 + return r->error; 735 + } 736 + 747 737 path->level = READ_ONCE(b->c.level); 748 738 749 739 if (unlikely(path->level < depth_want)) { ··· 768 748 ret = btree_node_lock(trans, path, &b->c, 769 749 path->level, lock_type, trace_ip); 770 750 if (unlikely(ret)) { 771 - if (bch2_err_matches(ret, BCH_ERR_lock_fail_root_changed)) 772 - continue; 773 751 if (bch2_err_matches(ret, BCH_ERR_transaction_restart)) 774 752 return ret; 775 753 BUG(); 776 754 } 777 755 778 - if (likely(b == READ_ONCE(*rootp) && 756 + if (likely(b == READ_ONCE(r->b) && 779 757 b->c.level == path->level && 780 758 !race_fault())) { 781 759 for (i = 0; i < path->level; i++) ··· 842 824 int ret = 0; 843 825 844 826 bch2_bkey_buf_init(&tmp); 827 + 828 + jiter->fail_if_too_many_whiteouts = true; 845 829 846 830 while (nr-- && !ret) { 847 831 if (!bch2_btree_node_relock(trans, path, path->level)) ··· 1020 1000 1021 1001 bch2_trans_unlock(trans); 1022 1002 cond_resched(); 1023 - trans_set_locked(trans); 1003 + trans_set_locked(trans, false); 1024 1004 1025 1005 if (unlikely(trans->memory_allocation_failure)) { 1026 1006 struct closure cl; ··· 1287 1267 { 1288 1268 int cmp = bpos_cmp(new_pos, trans->paths[path_idx].pos); 1289 1269 1290 - bch2_trans_verify_not_in_restart(trans); 1270 + bch2_trans_verify_not_unlocked_or_in_restart(trans); 1291 1271 EBUG_ON(!trans->paths[path_idx].ref); 1292 1272 1293 1273 trace_btree_path_set_pos(trans, trans->paths + path_idx, &new_pos); ··· 1447 1427 (void *) trans->last_begin_ip); 1448 1428 } 1449 1429 1450 - void __noreturn bch2_trans_in_restart_error(struct btree_trans *trans) 1430 + static void __noreturn bch2_trans_in_restart_error(struct btree_trans *trans) 1451 1431 { 1432 + #ifdef CONFIG_BCACHEFS_DEBUG 1433 + struct printbuf buf = PRINTBUF; 1434 + bch2_prt_backtrace(&buf, &trans->last_restarted_trace); 1435 + panic("in transaction restart: %s, last restarted by\n%s", 1436 + bch2_err_str(trans->restarted), 1437 + buf.buf); 1438 + #else 1452 1439 panic("in transaction restart: %s, last restarted by %pS\n", 1453 1440 bch2_err_str(trans->restarted), 1454 1441 (void *) trans->last_restarted_ip); 1442 + #endif 1455 1443 } 1456 1444 1457 - void __noreturn bch2_trans_unlocked_error(struct btree_trans *trans) 1445 + void __noreturn bch2_trans_unlocked_or_in_restart_error(struct btree_trans *trans) 1458 1446 { 1459 - panic("trans should be locked, unlocked by %pS\n", 1460 - (void *) trans->last_unlock_ip); 1447 + if (trans->restarted) 1448 + bch2_trans_in_restart_error(trans); 1449 + 1450 + if (!trans->locked) 1451 + panic("trans should be locked, unlocked by %pS\n", 1452 + (void *) trans->last_unlock_ip); 1453 + 1454 + BUG(); 1461 1455 } 1462 1456 1463 1457 noinline __cold ··· 1484 1450 trans_for_each_update(trans, i) { 1485 1451 struct bkey_s_c old = { &i->old_k, i->old_v }; 1486 1452 1487 - prt_printf(buf, "update: btree=%s cached=%u %pS\n", 1488 - bch2_btree_id_str(i->btree_id), 1489 - i->cached, 1490 - (void *) i->ip_allocated); 1453 + prt_str(buf, "update: btree="); 1454 + bch2_btree_id_to_text(buf, i->btree_id); 1455 + prt_printf(buf, " cached=%u %pS\n", 1456 + i->cached, 1457 + (void *) i->ip_allocated); 1491 1458 1492 1459 prt_printf(buf, " old "); 1493 1460 bch2_bkey_val_to_text(buf, trans->c, old); ··· 1521 1486 { 1522 1487 struct btree_path *path = trans->paths + path_idx; 1523 1488 1524 - prt_printf(out, "path: idx %3u ref %u:%u %c %c %c btree=%s l=%u pos ", 1489 + prt_printf(out, "path: idx %3u ref %u:%u %c %c %c ", 1525 1490 path_idx, path->ref, path->intent_ref, 1526 1491 path->preserve ? 'P' : ' ', 1527 1492 path->should_be_locked ? 'S' : ' ', 1528 - path->cached ? 'C' : 'B', 1529 - bch2_btree_id_str(path->btree_id), 1530 - path->level); 1493 + path->cached ? 'C' : 'B'); 1494 + bch2_btree_id_level_to_text(out, path->btree_id, path->level); 1495 + prt_str(out, " pos "); 1531 1496 bch2_bpos_to_text(out, path->pos); 1532 1497 1533 1498 if (!path->cached && btree_node_locked(path, path->level)) { ··· 1752 1717 struct trans_for_each_path_inorder_iter iter; 1753 1718 btree_path_idx_t path_pos = 0, path_idx; 1754 1719 1755 - bch2_trans_verify_not_unlocked(trans); 1756 - bch2_trans_verify_not_in_restart(trans); 1720 + bch2_trans_verify_not_unlocked_or_in_restart(trans); 1757 1721 bch2_trans_verify_locks(trans); 1758 1722 1759 1723 btree_trans_sort_paths(trans); ··· 1867 1833 !bkey_eq(path->pos, ck->key.pos)); 1868 1834 1869 1835 *u = ck->k->k; 1870 - k = bkey_i_to_s_c(ck->k); 1836 + k = (struct bkey_s_c) { u, &ck->k->v }; 1871 1837 } 1872 1838 1873 1839 return k; ··· 1876 1842 u->p = path->pos; 1877 1843 return (struct bkey_s_c) { u, NULL }; 1878 1844 } 1879 - 1880 1845 1881 1846 void bch2_set_btree_iter_dontneed(struct btree_iter *iter) 1882 1847 { ··· 1903 1870 struct btree_trans *trans = iter->trans; 1904 1871 int ret; 1905 1872 1906 - bch2_trans_verify_not_unlocked(trans); 1873 + bch2_trans_verify_not_unlocked_or_in_restart(trans); 1907 1874 1908 1875 iter->path = bch2_btree_path_set_pos(trans, iter->path, 1909 1876 btree_iter_search_key(iter), ··· 1978 1945 int ret; 1979 1946 1980 1947 EBUG_ON(trans->paths[iter->path].cached); 1981 - bch2_trans_verify_not_in_restart(trans); 1948 + bch2_trans_verify_not_unlocked_or_in_restart(trans); 1982 1949 bch2_btree_iter_verify(iter); 1983 1950 1984 1951 ret = bch2_btree_path_traverse(trans, iter->path, iter->flags); ··· 2134 2101 { 2135 2102 struct btree_path *path = btree_iter_path(trans, iter); 2136 2103 2137 - return bch2_journal_keys_peek_upto(trans->c, iter->btree_id, 2104 + return bch2_journal_keys_peek_max(trans->c, iter->btree_id, 2138 2105 path->level, 2139 2106 path->pos, 2140 2107 end_pos, ··· 2157 2124 } 2158 2125 2159 2126 static noinline 2160 - struct bkey_s_c btree_trans_peek_journal(struct btree_trans *trans, 2161 - struct btree_iter *iter, 2162 - struct bkey_s_c k) 2127 + void btree_trans_peek_journal(struct btree_trans *trans, 2128 + struct btree_iter *iter, 2129 + struct bkey_s_c *k) 2163 2130 { 2164 2131 struct btree_path *path = btree_iter_path(trans, iter); 2165 2132 struct bkey_i *next_journal = 2166 2133 bch2_btree_journal_peek(trans, iter, 2167 - k.k ? k.k->p : path_l(path)->b->key.k.p); 2134 + k->k ? k->k->p : path_l(path)->b->key.k.p); 2135 + if (next_journal) { 2136 + iter->k = next_journal->k; 2137 + *k = bkey_i_to_s_c(next_journal); 2138 + } 2139 + } 2140 + 2141 + static struct bkey_i *bch2_btree_journal_peek_prev(struct btree_trans *trans, 2142 + struct btree_iter *iter, 2143 + struct bpos end_pos) 2144 + { 2145 + struct btree_path *path = btree_iter_path(trans, iter); 2146 + 2147 + return bch2_journal_keys_peek_prev_min(trans->c, iter->btree_id, 2148 + path->level, 2149 + path->pos, 2150 + end_pos, 2151 + &iter->journal_idx); 2152 + } 2153 + 2154 + static noinline 2155 + void btree_trans_peek_prev_journal(struct btree_trans *trans, 2156 + struct btree_iter *iter, 2157 + struct bkey_s_c *k) 2158 + { 2159 + struct btree_path *path = btree_iter_path(trans, iter); 2160 + struct bkey_i *next_journal = 2161 + bch2_btree_journal_peek_prev(trans, iter, 2162 + k->k ? k->k->p : path_l(path)->b->key.k.p); 2168 2163 2169 2164 if (next_journal) { 2170 2165 iter->k = next_journal->k; 2171 - k = bkey_i_to_s_c(next_journal); 2166 + *k = bkey_i_to_s_c(next_journal); 2172 2167 } 2173 - 2174 - return k; 2175 2168 } 2176 2169 2177 2170 /* ··· 2213 2154 struct bkey_s_c k; 2214 2155 int ret; 2215 2156 2216 - bch2_trans_verify_not_in_restart(trans); 2217 - bch2_trans_verify_not_unlocked(trans); 2157 + bch2_trans_verify_not_unlocked_or_in_restart(trans); 2218 2158 2219 2159 if ((iter->flags & BTREE_ITER_key_cache_fill) && 2220 2160 bpos_eq(iter->pos, pos)) ··· 2242 2184 btree_path_set_should_be_locked(trans, trans->paths + iter->key_cache_path); 2243 2185 2244 2186 k = bch2_btree_path_peek_slot(trans->paths + iter->key_cache_path, &u); 2245 - if (k.k && !bkey_err(k)) { 2246 - iter->k = u; 2247 - k.k = &iter->k; 2248 - } 2187 + if (!k.k) 2188 + return k; 2189 + 2190 + if ((iter->flags & BTREE_ITER_all_snapshots) && 2191 + !bpos_eq(pos, k.k->p)) 2192 + return bkey_s_c_null; 2193 + 2194 + iter->k = u; 2195 + k.k = &iter->k; 2249 2196 return k; 2250 2197 } 2251 2198 ··· 2264 2201 bch2_btree_iter_verify(iter); 2265 2202 2266 2203 while (1) { 2267 - struct btree_path_level *l; 2268 - 2269 2204 iter->path = bch2_btree_path_set_pos(trans, iter->path, search_key, 2270 2205 iter->flags & BTREE_ITER_intent, 2271 2206 btree_iter_ip_allocated(iter)); ··· 2273 2212 /* ensure that iter->k is consistent with iter->pos: */ 2274 2213 bch2_btree_iter_set_pos(iter, iter->pos); 2275 2214 k = bkey_s_c_err(ret); 2276 - goto out; 2215 + break; 2277 2216 } 2278 2217 2279 2218 struct btree_path *path = btree_iter_path(trans, iter); 2280 - l = path_l(path); 2219 + struct btree_path_level *l = path_l(path); 2281 2220 2282 2221 if (unlikely(!l->b)) { 2283 2222 /* No btree nodes at requested level: */ 2284 2223 bch2_btree_iter_set_pos(iter, SPOS_MAX); 2285 2224 k = bkey_s_c_null; 2286 - goto out; 2225 + break; 2287 2226 } 2288 2227 2289 2228 btree_path_set_should_be_locked(trans, path); ··· 2294 2233 k.k && 2295 2234 (k2 = btree_trans_peek_key_cache(iter, k.k->p)).k) { 2296 2235 k = k2; 2297 - ret = bkey_err(k); 2298 - if (ret) { 2236 + if (bkey_err(k)) { 2299 2237 bch2_btree_iter_set_pos(iter, iter->pos); 2300 - goto out; 2238 + break; 2301 2239 } 2302 2240 } 2303 2241 2304 2242 if (unlikely(iter->flags & BTREE_ITER_with_journal)) 2305 - k = btree_trans_peek_journal(trans, iter, k); 2243 + btree_trans_peek_journal(trans, iter, &k); 2306 2244 2307 2245 if (unlikely((iter->flags & BTREE_ITER_with_updates) && 2308 2246 trans->nr_updates)) ··· 2330 2270 /* End of btree: */ 2331 2271 bch2_btree_iter_set_pos(iter, SPOS_MAX); 2332 2272 k = bkey_s_c_null; 2333 - goto out; 2273 + break; 2334 2274 } 2335 2275 } 2336 - out: 2337 - bch2_btree_iter_verify(iter); 2338 2276 2277 + bch2_btree_iter_verify(iter); 2339 2278 return k; 2340 2279 } 2341 2280 2342 2281 /** 2343 - * bch2_btree_iter_peek_upto() - returns first key greater than or equal to 2282 + * bch2_btree_iter_peek_max() - returns first key greater than or equal to 2344 2283 * iterator's current position 2345 2284 * @iter: iterator to peek from 2346 2285 * @end: search limit: returns keys less than or equal to @end 2347 2286 * 2348 2287 * Returns: key if found, or an error extractable with bkey_err(). 2349 2288 */ 2350 - struct bkey_s_c bch2_btree_iter_peek_upto(struct btree_iter *iter, struct bpos end) 2289 + struct bkey_s_c bch2_btree_iter_peek_max(struct btree_iter *iter, struct bpos end) 2351 2290 { 2352 2291 struct btree_trans *trans = iter->trans; 2353 2292 struct bpos search_key = btree_iter_search_key(iter); 2354 2293 struct bkey_s_c k; 2355 - struct bpos iter_pos; 2294 + struct bpos iter_pos = iter->pos; 2356 2295 int ret; 2357 2296 2358 - bch2_trans_verify_not_unlocked(trans); 2297 + bch2_trans_verify_not_unlocked_or_in_restart(trans); 2298 + bch2_btree_iter_verify_entry_exit(iter); 2359 2299 EBUG_ON((iter->flags & BTREE_ITER_filter_snapshots) && bkey_eq(end, POS_MAX)); 2360 2300 2361 2301 if (iter->update_path) { ··· 2364 2304 iter->update_path = 0; 2365 2305 } 2366 2306 2367 - bch2_btree_iter_verify_entry_exit(iter); 2368 - 2369 2307 while (1) { 2370 2308 k = __bch2_btree_iter_peek(iter, search_key); 2371 2309 if (unlikely(!k.k)) ··· 2371 2313 if (unlikely(bkey_err(k))) 2372 2314 goto out_no_locked; 2373 2315 2374 - /* 2375 - * We need to check against @end before FILTER_SNAPSHOTS because 2376 - * if we get to a different inode that requested we might be 2377 - * seeing keys for a different snapshot tree that will all be 2378 - * filtered out. 2379 - * 2380 - * But we can't do the full check here, because bkey_start_pos() 2381 - * isn't monotonically increasing before FILTER_SNAPSHOTS, and 2382 - * that's what we check against in extents mode: 2383 - */ 2384 - if (unlikely(!(iter->flags & BTREE_ITER_is_extents) 2385 - ? bkey_gt(k.k->p, end) 2386 - : k.k->p.inode > end.inode)) 2387 - goto end; 2316 + if (iter->flags & BTREE_ITER_filter_snapshots) { 2317 + /* 2318 + * We need to check against @end before FILTER_SNAPSHOTS because 2319 + * if we get to a different inode that requested we might be 2320 + * seeing keys for a different snapshot tree that will all be 2321 + * filtered out. 2322 + * 2323 + * But we can't do the full check here, because bkey_start_pos() 2324 + * isn't monotonically increasing before FILTER_SNAPSHOTS, and 2325 + * that's what we check against in extents mode: 2326 + */ 2327 + if (unlikely(!(iter->flags & BTREE_ITER_is_extents) 2328 + ? bkey_gt(k.k->p, end) 2329 + : k.k->p.inode > end.inode)) 2330 + goto end; 2388 2331 2389 - if (iter->update_path && 2390 - !bkey_eq(trans->paths[iter->update_path].pos, k.k->p)) { 2391 - bch2_path_put_nokeep(trans, iter->update_path, 2392 - iter->flags & BTREE_ITER_intent); 2393 - iter->update_path = 0; 2394 - } 2332 + if (iter->update_path && 2333 + !bkey_eq(trans->paths[iter->update_path].pos, k.k->p)) { 2334 + bch2_path_put_nokeep(trans, iter->update_path, 2335 + iter->flags & BTREE_ITER_intent); 2336 + iter->update_path = 0; 2337 + } 2395 2338 2396 - if ((iter->flags & BTREE_ITER_filter_snapshots) && 2397 - (iter->flags & BTREE_ITER_intent) && 2398 - !(iter->flags & BTREE_ITER_is_extents) && 2399 - !iter->update_path) { 2400 - struct bpos pos = k.k->p; 2339 + if ((iter->flags & BTREE_ITER_intent) && 2340 + !(iter->flags & BTREE_ITER_is_extents) && 2341 + !iter->update_path) { 2342 + struct bpos pos = k.k->p; 2401 2343 2402 - if (pos.snapshot < iter->snapshot) { 2344 + if (pos.snapshot < iter->snapshot) { 2345 + search_key = bpos_successor(k.k->p); 2346 + continue; 2347 + } 2348 + 2349 + pos.snapshot = iter->snapshot; 2350 + 2351 + /* 2352 + * advance, same as on exit for iter->path, but only up 2353 + * to snapshot 2354 + */ 2355 + __btree_path_get(trans, trans->paths + iter->path, iter->flags & BTREE_ITER_intent); 2356 + iter->update_path = iter->path; 2357 + 2358 + iter->update_path = bch2_btree_path_set_pos(trans, 2359 + iter->update_path, pos, 2360 + iter->flags & BTREE_ITER_intent, 2361 + _THIS_IP_); 2362 + ret = bch2_btree_path_traverse(trans, iter->update_path, iter->flags); 2363 + if (unlikely(ret)) { 2364 + k = bkey_s_c_err(ret); 2365 + goto out_no_locked; 2366 + } 2367 + } 2368 + 2369 + /* 2370 + * We can never have a key in a leaf node at POS_MAX, so 2371 + * we don't have to check these successor() calls: 2372 + */ 2373 + if (!bch2_snapshot_is_ancestor(trans->c, 2374 + iter->snapshot, 2375 + k.k->p.snapshot)) { 2403 2376 search_key = bpos_successor(k.k->p); 2404 2377 continue; 2405 2378 } 2406 2379 2407 - pos.snapshot = iter->snapshot; 2408 - 2409 - /* 2410 - * advance, same as on exit for iter->path, but only up 2411 - * to snapshot 2412 - */ 2413 - __btree_path_get(trans, trans->paths + iter->path, iter->flags & BTREE_ITER_intent); 2414 - iter->update_path = iter->path; 2415 - 2416 - iter->update_path = bch2_btree_path_set_pos(trans, 2417 - iter->update_path, pos, 2418 - iter->flags & BTREE_ITER_intent, 2419 - _THIS_IP_); 2420 - ret = bch2_btree_path_traverse(trans, iter->update_path, iter->flags); 2421 - if (unlikely(ret)) { 2422 - k = bkey_s_c_err(ret); 2423 - goto out_no_locked; 2380 + if (bkey_whiteout(k.k) && 2381 + !(iter->flags & BTREE_ITER_key_cache_fill)) { 2382 + search_key = bkey_successor(iter, k.k->p); 2383 + continue; 2424 2384 } 2425 - } 2426 - 2427 - /* 2428 - * We can never have a key in a leaf node at POS_MAX, so 2429 - * we don't have to check these successor() calls: 2430 - */ 2431 - if ((iter->flags & BTREE_ITER_filter_snapshots) && 2432 - !bch2_snapshot_is_ancestor(trans->c, 2433 - iter->snapshot, 2434 - k.k->p.snapshot)) { 2435 - search_key = bpos_successor(k.k->p); 2436 - continue; 2437 - } 2438 - 2439 - if (bkey_whiteout(k.k) && 2440 - !(iter->flags & BTREE_ITER_all_snapshots)) { 2441 - search_key = bkey_successor(iter, k.k->p); 2442 - continue; 2443 2385 } 2444 2386 2445 2387 /* ··· 2509 2451 return bch2_btree_iter_peek(iter); 2510 2452 } 2511 2453 2512 - /** 2513 - * bch2_btree_iter_peek_prev() - returns first key less than or equal to 2514 - * iterator's current position 2515 - * @iter: iterator to peek from 2516 - * 2517 - * Returns: key if found, or an error extractable with bkey_err(). 2518 - */ 2519 - struct bkey_s_c bch2_btree_iter_peek_prev(struct btree_iter *iter) 2454 + static struct bkey_s_c __bch2_btree_iter_peek_prev(struct btree_iter *iter, struct bpos search_key) 2520 2455 { 2521 2456 struct btree_trans *trans = iter->trans; 2522 - struct bpos search_key = iter->pos; 2523 - struct bkey_s_c k; 2524 - struct bkey saved_k; 2525 - const struct bch_val *saved_v; 2526 - btree_path_idx_t saved_path = 0; 2527 - int ret; 2528 - 2529 - bch2_trans_verify_not_unlocked(trans); 2530 - EBUG_ON(btree_iter_path(trans, iter)->cached || 2531 - btree_iter_path(trans, iter)->level); 2532 - 2533 - if (iter->flags & BTREE_ITER_with_journal) 2534 - return bkey_s_c_err(-BCH_ERR_btree_iter_with_journal_not_supported); 2457 + struct bkey_s_c k, k2; 2535 2458 2536 2459 bch2_btree_iter_verify(iter); 2537 - bch2_btree_iter_verify_entry_exit(iter); 2538 - 2539 - if (iter->flags & BTREE_ITER_filter_snapshots) 2540 - search_key.snapshot = U32_MAX; 2541 2460 2542 2461 while (1) { 2543 2462 iter->path = bch2_btree_path_set_pos(trans, iter->path, search_key, 2544 - iter->flags & BTREE_ITER_intent, 2545 - btree_iter_ip_allocated(iter)); 2463 + iter->flags & BTREE_ITER_intent, 2464 + btree_iter_ip_allocated(iter)); 2546 2465 2547 - ret = bch2_btree_path_traverse(trans, iter->path, iter->flags); 2466 + int ret = bch2_btree_path_traverse(trans, iter->path, iter->flags); 2548 2467 if (unlikely(ret)) { 2549 2468 /* ensure that iter->k is consistent with iter->pos: */ 2550 2469 bch2_btree_iter_set_pos(iter, iter->pos); 2551 2470 k = bkey_s_c_err(ret); 2552 - goto out_no_locked; 2471 + break; 2553 2472 } 2554 2473 2555 2474 struct btree_path *path = btree_iter_path(trans, iter); 2475 + struct btree_path_level *l = path_l(path); 2556 2476 2557 - k = btree_path_level_peek(trans, path, &path->l[0], &iter->k); 2558 - if (!k.k || 2559 - ((iter->flags & BTREE_ITER_is_extents) 2560 - ? bpos_ge(bkey_start_pos(k.k), search_key) 2561 - : bpos_gt(k.k->p, search_key))) 2562 - k = btree_path_level_prev(trans, path, &path->l[0], &iter->k); 2477 + if (unlikely(!l->b)) { 2478 + /* No btree nodes at requested level: */ 2479 + bch2_btree_iter_set_pos(iter, SPOS_MAX); 2480 + k = bkey_s_c_null; 2481 + break; 2482 + } 2483 + 2484 + btree_path_set_should_be_locked(trans, path); 2485 + 2486 + k = btree_path_level_peek_all(trans->c, l, &iter->k); 2487 + if (!k.k || bpos_gt(k.k->p, search_key)) { 2488 + k = btree_path_level_prev(trans, path, l, &iter->k); 2489 + 2490 + BUG_ON(k.k && bpos_gt(k.k->p, search_key)); 2491 + } 2492 + 2493 + if (unlikely(iter->flags & BTREE_ITER_with_key_cache) && 2494 + k.k && 2495 + (k2 = btree_trans_peek_key_cache(iter, k.k->p)).k) { 2496 + k = k2; 2497 + if (bkey_err(k2)) { 2498 + bch2_btree_iter_set_pos(iter, iter->pos); 2499 + break; 2500 + } 2501 + } 2502 + 2503 + if (unlikely(iter->flags & BTREE_ITER_with_journal)) 2504 + btree_trans_peek_prev_journal(trans, iter, &k); 2563 2505 2564 2506 if (unlikely((iter->flags & BTREE_ITER_with_updates) && 2565 2507 trans->nr_updates)) 2566 2508 bch2_btree_trans_peek_prev_updates(trans, iter, &k); 2567 2509 2568 - if (likely(k.k)) { 2569 - if (iter->flags & BTREE_ITER_filter_snapshots) { 2570 - if (k.k->p.snapshot == iter->snapshot) 2571 - goto got_key; 2572 - 2573 - /* 2574 - * If we have a saved candidate, and we're no 2575 - * longer at the same _key_ (not pos), return 2576 - * that candidate 2577 - */ 2578 - if (saved_path && !bkey_eq(k.k->p, saved_k.p)) { 2579 - bch2_path_put_nokeep(trans, iter->path, 2580 - iter->flags & BTREE_ITER_intent); 2581 - iter->path = saved_path; 2582 - saved_path = 0; 2583 - iter->k = saved_k; 2584 - k.v = saved_v; 2585 - goto got_key; 2586 - } 2587 - 2588 - if (bch2_snapshot_is_ancestor(trans->c, 2589 - iter->snapshot, 2590 - k.k->p.snapshot)) { 2591 - if (saved_path) 2592 - bch2_path_put_nokeep(trans, saved_path, 2593 - iter->flags & BTREE_ITER_intent); 2594 - saved_path = btree_path_clone(trans, iter->path, 2595 - iter->flags & BTREE_ITER_intent, 2596 - _THIS_IP_); 2597 - path = btree_iter_path(trans, iter); 2598 - trace_btree_path_save_pos(trans, path, trans->paths + saved_path); 2599 - saved_k = *k.k; 2600 - saved_v = k.v; 2601 - } 2602 - 2603 - search_key = bpos_predecessor(k.k->p); 2604 - continue; 2605 - } 2606 - got_key: 2607 - if (bkey_whiteout(k.k) && 2608 - !(iter->flags & BTREE_ITER_all_snapshots)) { 2609 - search_key = bkey_predecessor(iter, k.k->p); 2610 - if (iter->flags & BTREE_ITER_filter_snapshots) 2611 - search_key.snapshot = U32_MAX; 2612 - continue; 2613 - } 2614 - 2615 - btree_path_set_should_be_locked(trans, path); 2510 + if (likely(k.k && !bkey_deleted(k.k))) { 2616 2511 break; 2512 + } else if (k.k) { 2513 + search_key = bpos_predecessor(k.k->p); 2617 2514 } else if (likely(!bpos_eq(path->l[0].b->data->min_key, POS_MIN))) { 2618 2515 /* Advance to previous leaf node: */ 2619 2516 search_key = bpos_predecessor(path->l[0].b->data->min_key); ··· 2576 2563 /* Start of btree: */ 2577 2564 bch2_btree_iter_set_pos(iter, POS_MIN); 2578 2565 k = bkey_s_c_null; 2579 - goto out_no_locked; 2566 + break; 2580 2567 } 2581 2568 } 2582 2569 2583 - EBUG_ON(bkey_gt(bkey_start_pos(k.k), iter->pos)); 2570 + bch2_btree_iter_verify(iter); 2571 + return k; 2572 + } 2573 + 2574 + /** 2575 + * bch2_btree_iter_peek_prev_min() - returns first key less than or equal to 2576 + * iterator's current position 2577 + * @iter: iterator to peek from 2578 + * @end: search limit: returns keys greater than or equal to @end 2579 + * 2580 + * Returns: key if found, or an error extractable with bkey_err(). 2581 + */ 2582 + struct bkey_s_c bch2_btree_iter_peek_prev_min(struct btree_iter *iter, struct bpos end) 2583 + { 2584 + if ((iter->flags & (BTREE_ITER_is_extents|BTREE_ITER_filter_snapshots)) && 2585 + !bkey_eq(iter->pos, POS_MAX)) { 2586 + /* 2587 + * bkey_start_pos(), for extents, is not monotonically 2588 + * increasing until after filtering for snapshots: 2589 + * 2590 + * Thus, for extents we need to search forward until we find a 2591 + * real visible extents - easiest to just use peek_slot() (which 2592 + * internally uses peek() for extents) 2593 + */ 2594 + struct bkey_s_c k = bch2_btree_iter_peek_slot(iter); 2595 + if (bkey_err(k)) 2596 + return k; 2597 + 2598 + if (!bkey_deleted(k.k) && 2599 + (!(iter->flags & BTREE_ITER_is_extents) || 2600 + bkey_lt(bkey_start_pos(k.k), iter->pos))) 2601 + return k; 2602 + } 2603 + 2604 + struct btree_trans *trans = iter->trans; 2605 + struct bpos search_key = iter->pos; 2606 + struct bkey_s_c k; 2607 + btree_path_idx_t saved_path = 0; 2608 + 2609 + bch2_trans_verify_not_unlocked_or_in_restart(trans); 2610 + bch2_btree_iter_verify_entry_exit(iter); 2611 + EBUG_ON((iter->flags & BTREE_ITER_filter_snapshots) && bpos_eq(end, POS_MIN)); 2612 + 2613 + while (1) { 2614 + k = __bch2_btree_iter_peek_prev(iter, search_key); 2615 + if (unlikely(!k.k)) 2616 + goto end; 2617 + if (unlikely(bkey_err(k))) 2618 + goto out_no_locked; 2619 + 2620 + if (iter->flags & BTREE_ITER_filter_snapshots) { 2621 + struct btree_path *s = saved_path ? trans->paths + saved_path : NULL; 2622 + if (s && bpos_lt(k.k->p, SPOS(s->pos.inode, s->pos.offset, iter->snapshot))) { 2623 + /* 2624 + * If we have a saved candidate, and we're past 2625 + * the last possible snapshot overwrite, return 2626 + * it: 2627 + */ 2628 + bch2_path_put_nokeep(trans, iter->path, 2629 + iter->flags & BTREE_ITER_intent); 2630 + iter->path = saved_path; 2631 + saved_path = 0; 2632 + k = bch2_btree_path_peek_slot(btree_iter_path(trans, iter), &iter->k); 2633 + break; 2634 + } 2635 + 2636 + /* 2637 + * We need to check against @end before FILTER_SNAPSHOTS because 2638 + * if we get to a different inode that requested we might be 2639 + * seeing keys for a different snapshot tree that will all be 2640 + * filtered out. 2641 + */ 2642 + if (unlikely(bkey_lt(k.k->p, end))) 2643 + goto end; 2644 + 2645 + if (!bch2_snapshot_is_ancestor(trans->c, iter->snapshot, k.k->p.snapshot)) { 2646 + search_key = bpos_predecessor(k.k->p); 2647 + continue; 2648 + } 2649 + 2650 + if (k.k->p.snapshot != iter->snapshot) { 2651 + /* 2652 + * Have a key visible in iter->snapshot, but 2653 + * might have overwrites: - save it and keep 2654 + * searching. Unless it's a whiteout - then drop 2655 + * our previous saved candidate: 2656 + */ 2657 + if (saved_path) { 2658 + bch2_path_put_nokeep(trans, saved_path, 2659 + iter->flags & BTREE_ITER_intent); 2660 + saved_path = 0; 2661 + } 2662 + 2663 + if (!bkey_whiteout(k.k)) { 2664 + saved_path = btree_path_clone(trans, iter->path, 2665 + iter->flags & BTREE_ITER_intent, 2666 + _THIS_IP_); 2667 + trace_btree_path_save_pos(trans, 2668 + trans->paths + iter->path, 2669 + trans->paths + saved_path); 2670 + } 2671 + 2672 + search_key = bpos_predecessor(k.k->p); 2673 + continue; 2674 + } 2675 + 2676 + if (bkey_whiteout(k.k)) { 2677 + search_key = bkey_predecessor(iter, k.k->p); 2678 + search_key.snapshot = U32_MAX; 2679 + continue; 2680 + } 2681 + } 2682 + 2683 + EBUG_ON(iter->flags & BTREE_ITER_all_snapshots ? bpos_gt(k.k->p, iter->pos) : 2684 + iter->flags & BTREE_ITER_is_extents ? bkey_ge(bkey_start_pos(k.k), iter->pos) : 2685 + bkey_gt(k.k->p, iter->pos)); 2686 + 2687 + if (unlikely(iter->flags & BTREE_ITER_all_snapshots ? bpos_lt(k.k->p, end) : 2688 + iter->flags & BTREE_ITER_is_extents ? bkey_le(k.k->p, end) : 2689 + bkey_lt(k.k->p, end))) 2690 + goto end; 2691 + 2692 + break; 2693 + } 2584 2694 2585 2695 /* Extents can straddle iter->pos: */ 2586 - if (bkey_lt(k.k->p, iter->pos)) 2587 - iter->pos = k.k->p; 2696 + iter->pos = bpos_min(iter->pos, k.k->p);; 2588 2697 2589 2698 if (iter->flags & BTREE_ITER_filter_snapshots) 2590 2699 iter->pos.snapshot = iter->snapshot; ··· 2716 2581 2717 2582 bch2_btree_iter_verify_entry_exit(iter); 2718 2583 bch2_btree_iter_verify(iter); 2719 - 2720 2584 return k; 2585 + end: 2586 + bch2_btree_iter_set_pos(iter, end); 2587 + k = bkey_s_c_null; 2588 + goto out_no_locked; 2721 2589 } 2722 2590 2723 2591 /** ··· 2745 2607 struct bkey_s_c k; 2746 2608 int ret; 2747 2609 2748 - bch2_trans_verify_not_unlocked(trans); 2610 + bch2_trans_verify_not_unlocked_or_in_restart(trans); 2749 2611 bch2_btree_iter_verify(iter); 2750 2612 bch2_btree_iter_verify_entry_exit(iter); 2751 2613 EBUG_ON(btree_iter_path(trans, iter)->level && (iter->flags & BTREE_ITER_with_key_cache)); ··· 2769 2631 k = bkey_s_c_err(ret); 2770 2632 goto out_no_locked; 2771 2633 } 2634 + 2635 + struct btree_path *path = btree_iter_path(trans, iter); 2636 + if (unlikely(!btree_path_node(path, path->level))) 2637 + return bkey_s_c_null; 2772 2638 2773 2639 if ((iter->flags & BTREE_ITER_cached) || 2774 2640 !(iter->flags & (BTREE_ITER_is_extents|BTREE_ITER_filter_snapshots))) { ··· 2800 2658 k = bch2_btree_path_peek_slot(trans->paths + iter->path, &iter->k); 2801 2659 if (unlikely(!k.k)) 2802 2660 goto out_no_locked; 2661 + 2662 + if (unlikely(k.k->type == KEY_TYPE_whiteout && 2663 + (iter->flags & BTREE_ITER_filter_snapshots) && 2664 + !(iter->flags & BTREE_ITER_key_cache_fill))) 2665 + iter->k.type = KEY_TYPE_deleted; 2803 2666 } else { 2804 2667 struct bpos next; 2805 2668 struct bpos end = iter->pos; ··· 2818 2671 struct btree_iter iter2; 2819 2672 2820 2673 bch2_trans_copy_iter(&iter2, iter); 2821 - k = bch2_btree_iter_peek_upto(&iter2, end); 2674 + k = bch2_btree_iter_peek_max(&iter2, end); 2822 2675 2823 2676 if (k.k && !bkey_err(k)) { 2824 2677 swap(iter->key_cache_path, iter2.key_cache_path); ··· 2829 2682 } else { 2830 2683 struct bpos pos = iter->pos; 2831 2684 2832 - k = bch2_btree_iter_peek_upto(iter, end); 2685 + k = bch2_btree_iter_peek_max(iter, end); 2833 2686 if (unlikely(bkey_err(k))) 2834 2687 bch2_btree_iter_set_pos(iter, pos); 2835 2688 else ··· 3049 2902 unsigned flags) 3050 2903 { 3051 2904 bch2_trans_iter_init_common(trans, iter, btree_id, pos, 0, 0, 3052 - bch2_btree_iter_flags(trans, btree_id, flags), 2905 + bch2_btree_iter_flags(trans, btree_id, 0, flags), 3053 2906 _RET_IP_); 3054 2907 } 3055 2908 ··· 3065 2918 flags |= BTREE_ITER_snapshot_field; 3066 2919 flags |= BTREE_ITER_all_snapshots; 3067 2920 2921 + if (!depth && btree_id_cached(trans->c, btree_id)) 2922 + flags |= BTREE_ITER_with_key_cache; 2923 + 3068 2924 bch2_trans_iter_init_common(trans, iter, btree_id, pos, locks_want, depth, 3069 - __bch2_btree_iter_flags(trans, btree_id, flags), 2925 + bch2_btree_iter_flags(trans, btree_id, depth, flags), 3070 2926 _RET_IP_); 3071 2927 3072 2928 iter->min_depth = depth; ··· 3272 3122 3273 3123 trans->last_begin_ip = _RET_IP_; 3274 3124 3275 - trans_set_locked(trans); 3125 + trans_set_locked(trans, false); 3276 3126 3277 3127 if (trans->restarted) { 3278 3128 bch2_btree_path_traverse_all(trans); 3279 3129 trans->notrace_relock_fail = false; 3280 3130 } 3281 3131 3282 - bch2_trans_verify_not_unlocked(trans); 3132 + bch2_trans_verify_not_unlocked_or_in_restart(trans); 3283 3133 return trans->restart_count; 3284 3134 } 3285 3135 ··· 3378 3228 trans->srcu_idx = srcu_read_lock(&c->btree_trans_barrier); 3379 3229 trans->srcu_lock_time = jiffies; 3380 3230 trans->srcu_held = true; 3381 - trans_set_locked(trans); 3231 + trans_set_locked(trans, false); 3382 3232 3383 3233 closure_init_stack_release(&trans->ref); 3384 3234 return trans; ··· 3412 3262 { 3413 3263 struct bch_fs *c = trans->c; 3414 3264 3265 + if (trans->restarted) 3266 + bch2_trans_in_restart_error(trans); 3267 + 3415 3268 bch2_trans_unlock(trans); 3416 3269 3417 3270 trans_for_each_update(trans, i) ··· 3437 3284 */ 3438 3285 closure_return_sync(&trans->ref); 3439 3286 trans->locking_wait.task = NULL; 3287 + 3288 + #ifdef CONFIG_BCACHEFS_DEBUG 3289 + darray_exit(&trans->last_restarted_trace); 3290 + #endif 3440 3291 3441 3292 unsigned long *paths_allocated = trans->paths_allocated; 3442 3293 trans->paths_allocated = NULL; ··· 3495 3338 pid = owner ? owner->pid : 0; 3496 3339 rcu_read_unlock(); 3497 3340 3498 - prt_printf(out, "\t%px %c l=%u %s:", b, b->cached ? 'c' : 'b', 3499 - b->level, bch2_btree_id_str(b->btree_id)); 3341 + prt_printf(out, "\t%px %c ", b, b->cached ? 'c' : 'b'); 3342 + bch2_btree_id_to_text(out, b->btree_id); 3343 + prt_printf(out, " l=%u:", b->level); 3500 3344 bch2_bpos_to_text(out, btree_node_pos(b)); 3501 3345 3502 3346 prt_printf(out, "\t locks %u:%u:%u held by pid %u", ··· 3536 3378 if (!path->nodes_locked) 3537 3379 continue; 3538 3380 3539 - prt_printf(out, " path %u %c l=%u %s:", 3540 - idx, 3541 - path->cached ? 'c' : 'b', 3542 - path->level, 3543 - bch2_btree_id_str(path->btree_id)); 3381 + prt_printf(out, " path %u %c ", 3382 + idx, 3383 + path->cached ? 'c' : 'b'); 3384 + bch2_btree_id_to_text(out, path->btree_id); 3385 + prt_printf(out, " l=%u:", path->level); 3544 3386 bch2_bpos_to_text(out, path->pos); 3545 3387 prt_newline(out); 3546 3388 ··· 3646 3488 #ifdef CONFIG_LOCKDEP 3647 3489 fs_reclaim_acquire(GFP_KERNEL); 3648 3490 struct btree_trans *trans = bch2_trans_get(c); 3649 - trans_set_locked(trans); 3491 + trans_set_locked(trans, false); 3650 3492 bch2_trans_put(trans); 3651 3493 fs_reclaim_release(GFP_KERNEL); 3652 3494 #endif
+65 -69
fs/bcachefs/btree_iter.h
··· 23 23 { 24 24 unsigned idx = path - trans->paths; 25 25 26 + EBUG_ON(idx >= trans->nr_paths); 26 27 EBUG_ON(!test_bit(idx, trans->paths_allocated)); 27 28 if (unlikely(path->ref == U8_MAX)) { 28 29 bch2_dump_trans_paths_updates(trans); ··· 37 36 38 37 static inline bool __btree_path_put(struct btree_trans *trans, struct btree_path *path, bool intent) 39 38 { 39 + EBUG_ON(path - trans->paths >= trans->nr_paths); 40 40 EBUG_ON(!test_bit(path - trans->paths, trans->paths_allocated)); 41 41 EBUG_ON(!path->ref); 42 42 EBUG_ON(!path->intent_ref && intent); ··· 236 234 btree_path_idx_t, 237 235 unsigned, unsigned long); 238 236 239 - static inline void bch2_trans_verify_not_unlocked(struct btree_trans *); 237 + static inline void bch2_trans_verify_not_unlocked_or_in_restart(struct btree_trans *); 240 238 241 239 static inline int __must_check bch2_btree_path_traverse(struct btree_trans *trans, 242 240 btree_path_idx_t path, unsigned flags) 243 241 { 244 - bch2_trans_verify_not_unlocked(trans); 242 + bch2_trans_verify_not_unlocked_or_in_restart(trans); 245 243 246 244 if (trans->paths[path].uptodate < BTREE_ITER_NEED_RELOCK) 247 245 return 0; ··· 326 324 bch2_trans_restart_error(trans, restart_count); 327 325 } 328 326 329 - void __noreturn bch2_trans_in_restart_error(struct btree_trans *); 327 + void __noreturn bch2_trans_unlocked_or_in_restart_error(struct btree_trans *); 330 328 331 - static inline void bch2_trans_verify_not_in_restart(struct btree_trans *trans) 329 + static inline void bch2_trans_verify_not_unlocked_or_in_restart(struct btree_trans *trans) 332 330 { 333 - if (trans->restarted) 334 - bch2_trans_in_restart_error(trans); 335 - } 336 - 337 - void __noreturn bch2_trans_unlocked_error(struct btree_trans *); 338 - 339 - static inline void bch2_trans_verify_not_unlocked(struct btree_trans *trans) 340 - { 341 - if (!trans->locked) 342 - bch2_trans_unlocked_error(trans); 331 + if (trans->restarted || !trans->locked) 332 + bch2_trans_unlocked_or_in_restart_error(trans); 343 333 } 344 334 345 335 __always_inline 346 - static int btree_trans_restart_nounlock(struct btree_trans *trans, int err) 336 + static int btree_trans_restart_ip(struct btree_trans *trans, int err, unsigned long ip) 347 337 { 348 338 BUG_ON(err <= 0); 349 339 BUG_ON(!bch2_err_matches(-err, BCH_ERR_transaction_restart)); 350 340 351 341 trans->restarted = err; 352 - trans->last_restarted_ip = _THIS_IP_; 342 + trans->last_restarted_ip = ip; 343 + #ifdef CONFIG_BCACHEFS_DEBUG 344 + darray_exit(&trans->last_restarted_trace); 345 + bch2_save_backtrace(&trans->last_restarted_trace, current, 0, GFP_NOWAIT); 346 + #endif 353 347 return -err; 354 348 } 355 349 356 350 __always_inline 357 351 static int btree_trans_restart(struct btree_trans *trans, int err) 358 352 { 359 - btree_trans_restart_nounlock(trans, err); 360 - return -err; 353 + return btree_trans_restart_ip(trans, err, _THIS_IP_); 361 354 } 362 355 363 356 bool bch2_btree_node_upgrade(struct btree_trans *, ··· 372 375 void bch2_trans_downgrade(struct btree_trans *); 373 376 374 377 void bch2_trans_node_add(struct btree_trans *trans, struct btree_path *, struct btree *); 378 + void bch2_trans_node_drop(struct btree_trans *trans, struct btree *); 375 379 void bch2_trans_node_reinit_iter(struct btree_trans *, struct btree *); 376 380 377 381 int __must_check __bch2_btree_iter_traverse(struct btree_iter *iter); ··· 382 384 struct btree *bch2_btree_iter_peek_node_and_restart(struct btree_iter *); 383 385 struct btree *bch2_btree_iter_next_node(struct btree_iter *); 384 386 385 - struct bkey_s_c bch2_btree_iter_peek_upto(struct btree_iter *, struct bpos); 387 + struct bkey_s_c bch2_btree_iter_peek_max(struct btree_iter *, struct bpos); 386 388 struct bkey_s_c bch2_btree_iter_next(struct btree_iter *); 387 389 388 390 static inline struct bkey_s_c bch2_btree_iter_peek(struct btree_iter *iter) 389 391 { 390 - return bch2_btree_iter_peek_upto(iter, SPOS_MAX); 392 + return bch2_btree_iter_peek_max(iter, SPOS_MAX); 391 393 } 392 394 393 - struct bkey_s_c bch2_btree_iter_peek_prev(struct btree_iter *); 395 + struct bkey_s_c bch2_btree_iter_peek_prev_min(struct btree_iter *, struct bpos); 396 + 397 + static inline struct bkey_s_c bch2_btree_iter_peek_prev(struct btree_iter *iter) 398 + { 399 + return bch2_btree_iter_peek_prev_min(iter, POS_MIN); 400 + } 401 + 394 402 struct bkey_s_c bch2_btree_iter_prev(struct btree_iter *); 395 403 396 404 struct bkey_s_c bch2_btree_iter_peek_slot(struct btree_iter *); ··· 447 443 448 444 void bch2_trans_iter_exit(struct btree_trans *, struct btree_iter *); 449 445 450 - static inline unsigned __bch2_btree_iter_flags(struct btree_trans *trans, 451 - unsigned btree_id, 452 - unsigned flags) 446 + static inline unsigned bch2_btree_iter_flags(struct btree_trans *trans, 447 + unsigned btree_id, 448 + unsigned level, 449 + unsigned flags) 453 450 { 451 + if (level || !btree_id_cached(trans->c, btree_id)) { 452 + flags &= ~BTREE_ITER_cached; 453 + flags &= ~BTREE_ITER_with_key_cache; 454 + } else if (!(flags & BTREE_ITER_cached)) 455 + flags |= BTREE_ITER_with_key_cache; 456 + 454 457 if (!(flags & (BTREE_ITER_all_snapshots|BTREE_ITER_not_extents)) && 455 458 btree_id_is_extents(btree_id)) 456 459 flags |= BTREE_ITER_is_extents; ··· 474 463 flags |= BTREE_ITER_with_journal; 475 464 476 465 return flags; 477 - } 478 - 479 - static inline unsigned bch2_btree_iter_flags(struct btree_trans *trans, 480 - unsigned btree_id, 481 - unsigned flags) 482 - { 483 - if (!btree_id_cached(trans->c, btree_id)) { 484 - flags &= ~BTREE_ITER_cached; 485 - flags &= ~BTREE_ITER_with_key_cache; 486 - } else if (!(flags & BTREE_ITER_cached)) 487 - flags |= BTREE_ITER_with_key_cache; 488 - 489 - return __bch2_btree_iter_flags(trans, btree_id, flags); 490 466 } 491 467 492 468 static inline void bch2_trans_iter_init_common(struct btree_trans *trans, ··· 512 514 if (__builtin_constant_p(btree_id) && 513 515 __builtin_constant_p(flags)) 514 516 bch2_trans_iter_init_common(trans, iter, btree_id, pos, 0, 0, 515 - bch2_btree_iter_flags(trans, btree_id, flags), 517 + bch2_btree_iter_flags(trans, btree_id, 0, flags), 516 518 _THIS_IP_); 517 519 else 518 520 bch2_trans_iter_init_outlined(trans, iter, btree_id, pos, flags); ··· 591 593 bkey_s_c_to_##_type(__bch2_bkey_get_iter(_trans, _iter, \ 592 594 _btree_id, _pos, _flags, KEY_TYPE_##_type)) 593 595 596 + static inline void __bkey_val_copy(void *dst_v, unsigned dst_size, struct bkey_s_c src_k) 597 + { 598 + unsigned b = min_t(unsigned, dst_size, bkey_val_bytes(src_k.k)); 599 + memcpy(dst_v, src_k.v, b); 600 + if (unlikely(b < dst_size)) 601 + memset(dst_v + b, 0, dst_size - b); 602 + } 603 + 594 604 #define bkey_val_copy(_dst_v, _src_k) \ 595 605 do { \ 596 - unsigned b = min_t(unsigned, sizeof(*_dst_v), \ 597 - bkey_val_bytes(_src_k.k)); \ 598 - memcpy(_dst_v, _src_k.v, b); \ 599 - if (b < sizeof(*_dst_v)) \ 600 - memset((void *) (_dst_v) + b, 0, sizeof(*_dst_v) - b); \ 606 + BUILD_BUG_ON(!__typecheck(*_dst_v, *_src_k.v)); \ 607 + __bkey_val_copy(_dst_v, sizeof(*_dst_v), _src_k.s_c); \ 601 608 } while (0) 602 609 603 610 static inline int __bch2_bkey_get_val_typed(struct btree_trans *trans, ··· 611 608 unsigned val_size, void *val) 612 609 { 613 610 struct btree_iter iter; 614 - struct bkey_s_c k; 615 - int ret; 616 - 617 - k = __bch2_bkey_get_iter(trans, &iter, btree_id, pos, flags, type); 618 - ret = bkey_err(k); 611 + struct bkey_s_c k = __bch2_bkey_get_iter(trans, &iter, btree_id, pos, flags, type); 612 + int ret = bkey_err(k); 619 613 if (!ret) { 620 - unsigned b = min_t(unsigned, bkey_val_bytes(k.k), val_size); 621 - 622 - memcpy(val, k.v, b); 623 - if (unlikely(b < sizeof(*val))) 624 - memset((void *) val + b, 0, sizeof(*val) - b); 614 + __bkey_val_copy(val, val_size, k); 625 615 bch2_trans_iter_exit(trans, &iter); 626 616 } 627 617 ··· 673 677 bch2_btree_iter_peek(iter); 674 678 } 675 679 676 - static inline struct bkey_s_c bch2_btree_iter_peek_upto_type(struct btree_iter *iter, 680 + static inline struct bkey_s_c bch2_btree_iter_peek_max_type(struct btree_iter *iter, 677 681 struct bpos end, 678 682 unsigned flags) 679 683 { 680 684 if (!(flags & BTREE_ITER_slots)) 681 - return bch2_btree_iter_peek_upto(iter, end); 685 + return bch2_btree_iter_peek_max(iter, end); 682 686 683 687 if (bkey_gt(iter->pos, end)) 684 688 return bkey_s_c_null; ··· 742 746 _ret2 ?: trans_was_restarted(_trans, _restart_count); \ 743 747 }) 744 748 745 - #define for_each_btree_key_upto_continue(_trans, _iter, \ 749 + #define for_each_btree_key_max_continue(_trans, _iter, \ 746 750 _end, _flags, _k, _do) \ 747 751 ({ \ 748 752 struct bkey_s_c _k; \ ··· 750 754 \ 751 755 do { \ 752 756 _ret3 = lockrestart_do(_trans, ({ \ 753 - (_k) = bch2_btree_iter_peek_upto_type(&(_iter), \ 757 + (_k) = bch2_btree_iter_peek_max_type(&(_iter), \ 754 758 _end, (_flags)); \ 755 759 if (!(_k).k) \ 756 760 break; \ ··· 764 768 }) 765 769 766 770 #define for_each_btree_key_continue(_trans, _iter, _flags, _k, _do) \ 767 - for_each_btree_key_upto_continue(_trans, _iter, SPOS_MAX, _flags, _k, _do) 771 + for_each_btree_key_max_continue(_trans, _iter, SPOS_MAX, _flags, _k, _do) 768 772 769 - #define for_each_btree_key_upto(_trans, _iter, _btree_id, \ 773 + #define for_each_btree_key_max(_trans, _iter, _btree_id, \ 770 774 _start, _end, _flags, _k, _do) \ 771 775 ({ \ 772 776 bch2_trans_begin(trans); \ ··· 775 779 bch2_trans_iter_init((_trans), &(_iter), (_btree_id), \ 776 780 (_start), (_flags)); \ 777 781 \ 778 - for_each_btree_key_upto_continue(_trans, _iter, _end, _flags, _k, _do);\ 782 + for_each_btree_key_max_continue(_trans, _iter, _end, _flags, _k, _do);\ 779 783 }) 780 784 781 785 #define for_each_btree_key(_trans, _iter, _btree_id, \ 782 786 _start, _flags, _k, _do) \ 783 - for_each_btree_key_upto(_trans, _iter, _btree_id, _start, \ 787 + for_each_btree_key_max(_trans, _iter, _btree_id, _start, \ 784 788 SPOS_MAX, _flags, _k, _do) 785 789 786 790 #define for_each_btree_key_reverse(_trans, _iter, _btree_id, \ ··· 824 828 (_do) ?: bch2_trans_commit(_trans, (_disk_res),\ 825 829 (_journal_seq), (_commit_flags))) 826 830 827 - #define for_each_btree_key_upto_commit(_trans, _iter, _btree_id, \ 831 + #define for_each_btree_key_max_commit(_trans, _iter, _btree_id, \ 828 832 _start, _end, _iter_flags, _k, \ 829 833 _disk_res, _journal_seq, _commit_flags,\ 830 834 _do) \ 831 - for_each_btree_key_upto(_trans, _iter, _btree_id, _start, _end, _iter_flags, _k,\ 835 + for_each_btree_key_max(_trans, _iter, _btree_id, _start, _end, _iter_flags, _k,\ 832 836 (_do) ?: bch2_trans_commit(_trans, (_disk_res),\ 833 837 (_journal_seq), (_commit_flags))) 834 838 835 839 struct bkey_s_c bch2_btree_iter_peek_and_restart_outlined(struct btree_iter *); 836 840 837 - #define for_each_btree_key_upto_norestart(_trans, _iter, _btree_id, \ 841 + #define for_each_btree_key_max_norestart(_trans, _iter, _btree_id, \ 838 842 _start, _end, _flags, _k, _ret) \ 839 843 for (bch2_trans_iter_init((_trans), &(_iter), (_btree_id), \ 840 844 (_start), (_flags)); \ 841 - (_k) = bch2_btree_iter_peek_upto_type(&(_iter), _end, _flags),\ 845 + (_k) = bch2_btree_iter_peek_max_type(&(_iter), _end, _flags),\ 842 846 !((_ret) = bkey_err(_k)) && (_k).k; \ 843 847 bch2_btree_iter_advance(&(_iter))) 844 848 845 - #define for_each_btree_key_upto_continue_norestart(_iter, _end, _flags, _k, _ret)\ 849 + #define for_each_btree_key_max_continue_norestart(_iter, _end, _flags, _k, _ret)\ 846 850 for (; \ 847 - (_k) = bch2_btree_iter_peek_upto_type(&(_iter), _end, _flags), \ 851 + (_k) = bch2_btree_iter_peek_max_type(&(_iter), _end, _flags), \ 848 852 !((_ret) = bkey_err(_k)) && (_k).k; \ 849 853 bch2_btree_iter_advance(&(_iter))) 850 854 851 855 #define for_each_btree_key_norestart(_trans, _iter, _btree_id, \ 852 856 _start, _flags, _k, _ret) \ 853 - for_each_btree_key_upto_norestart(_trans, _iter, _btree_id, _start,\ 857 + for_each_btree_key_max_norestart(_trans, _iter, _btree_id, _start,\ 854 858 SPOS_MAX, _flags, _k, _ret) 855 859 856 860 #define for_each_btree_key_reverse_norestart(_trans, _iter, _btree_id, \ ··· 862 866 bch2_btree_iter_rewind(&(_iter))) 863 867 864 868 #define for_each_btree_key_continue_norestart(_iter, _flags, _k, _ret) \ 865 - for_each_btree_key_upto_continue_norestart(_iter, SPOS_MAX, _flags, _k, _ret) 869 + for_each_btree_key_max_continue_norestart(_iter, SPOS_MAX, _flags, _k, _ret) 866 870 867 871 /* 868 872 * This should not be used in a fastpath, without first trying _do in
+204 -33
fs/bcachefs/btree_journal_iter.c
··· 16 16 * operations for the regular btree iter code to use: 17 17 */ 18 18 19 + static inline size_t pos_to_idx(struct journal_keys *keys, size_t pos) 20 + { 21 + size_t gap_size = keys->size - keys->nr; 22 + 23 + BUG_ON(pos >= keys->gap && pos < keys->gap + gap_size); 24 + 25 + if (pos >= keys->gap) 26 + pos -= gap_size; 27 + return pos; 28 + } 29 + 19 30 static inline size_t idx_to_pos(struct journal_keys *keys, size_t idx) 20 31 { 21 32 size_t gap_size = keys->size - keys->nr; ··· 72 61 } 73 62 74 63 /* Returns first non-overwritten key >= search key: */ 75 - struct bkey_i *bch2_journal_keys_peek_upto(struct bch_fs *c, enum btree_id btree_id, 64 + struct bkey_i *bch2_journal_keys_peek_max(struct bch_fs *c, enum btree_id btree_id, 76 65 unsigned level, struct bpos pos, 77 66 struct bpos end_pos, size_t *idx) 78 67 { ··· 95 84 } 96 85 } 97 86 87 + struct bkey_i *ret = NULL; 88 + rcu_read_lock(); /* for overwritten_ranges */ 89 + 98 90 while ((k = *idx < keys->nr ? idx_to_key(keys, *idx) : NULL)) { 99 91 if (__journal_key_cmp(btree_id, level, end_pos, k) < 0) 100 - return NULL; 92 + break; 101 93 102 94 if (k->overwritten) { 103 - (*idx)++; 95 + if (k->overwritten_range) 96 + *idx = rcu_dereference(k->overwritten_range)->end; 97 + else 98 + *idx += 1; 104 99 continue; 105 100 } 106 101 107 - if (__journal_key_cmp(btree_id, level, pos, k) <= 0) 108 - return k->k; 102 + if (__journal_key_cmp(btree_id, level, pos, k) <= 0) { 103 + ret = k->k; 104 + break; 105 + } 109 106 107 + (*idx)++; 108 + iters++; 109 + if (iters == 10) { 110 + *idx = 0; 111 + rcu_read_unlock(); 112 + goto search; 113 + } 114 + } 115 + 116 + rcu_read_unlock(); 117 + return ret; 118 + } 119 + 120 + struct bkey_i *bch2_journal_keys_peek_prev_min(struct bch_fs *c, enum btree_id btree_id, 121 + unsigned level, struct bpos pos, 122 + struct bpos end_pos, size_t *idx) 123 + { 124 + struct journal_keys *keys = &c->journal_keys; 125 + unsigned iters = 0; 126 + struct journal_key *k; 127 + 128 + BUG_ON(*idx > keys->nr); 129 + search: 130 + if (!*idx) 131 + *idx = __bch2_journal_key_search(keys, btree_id, level, pos); 132 + 133 + while (*idx && 134 + __journal_key_cmp(btree_id, level, end_pos, idx_to_key(keys, *idx - 1)) <= 0) { 110 135 (*idx)++; 111 136 iters++; 112 137 if (iters == 10) { ··· 151 104 } 152 105 } 153 106 154 - return NULL; 107 + struct bkey_i *ret = NULL; 108 + rcu_read_lock(); /* for overwritten_ranges */ 109 + 110 + while ((k = *idx < keys->nr ? idx_to_key(keys, *idx) : NULL)) { 111 + if (__journal_key_cmp(btree_id, level, end_pos, k) > 0) 112 + break; 113 + 114 + if (k->overwritten) { 115 + if (k->overwritten_range) 116 + *idx = rcu_dereference(k->overwritten_range)->start - 1; 117 + else 118 + *idx -= 1; 119 + continue; 120 + } 121 + 122 + if (__journal_key_cmp(btree_id, level, pos, k) >= 0) { 123 + ret = k->k; 124 + break; 125 + } 126 + 127 + --(*idx); 128 + iters++; 129 + if (iters == 10) { 130 + *idx = 0; 131 + goto search; 132 + } 133 + } 134 + 135 + rcu_read_unlock(); 136 + return ret; 155 137 } 156 138 157 139 struct bkey_i *bch2_journal_keys_peek_slot(struct bch_fs *c, enum btree_id btree_id, ··· 188 112 { 189 113 size_t idx = 0; 190 114 191 - return bch2_journal_keys_peek_upto(c, btree_id, level, pos, pos, &idx); 115 + return bch2_journal_keys_peek_max(c, btree_id, level, pos, pos, &idx); 192 116 } 193 117 194 118 static void journal_iter_verify(struct journal_iter *iter) 195 119 { 120 + #ifdef CONFIG_BCACHEFS_DEBUG 196 121 struct journal_keys *keys = iter->keys; 197 122 size_t gap_size = keys->size - keys->nr; 198 123 ··· 203 126 if (iter->idx < keys->size) { 204 127 struct journal_key *k = keys->data + iter->idx; 205 128 206 - int cmp = cmp_int(k->btree_id, iter->btree_id) ?: 207 - cmp_int(k->level, iter->level); 208 - BUG_ON(cmp < 0); 129 + int cmp = __journal_key_btree_cmp(iter->btree_id, iter->level, k); 130 + BUG_ON(cmp > 0); 209 131 } 132 + #endif 210 133 } 211 134 212 135 static void journal_iters_fix(struct bch_fs *c) ··· 259 182 * Ensure these keys are done last by journal replay, to unblock 260 183 * journal reclaim: 261 184 */ 262 - .journal_seq = U32_MAX, 185 + .journal_seq = U64_MAX, 263 186 }; 264 187 struct journal_keys *keys = &c->journal_keys; 265 188 size_t idx = bch2_journal_key_search(keys, id, level, k->k.p); ··· 367 290 bkey_deleted(&keys->data[idx].k->k)); 368 291 } 369 292 293 + static void __bch2_journal_key_overwritten(struct journal_keys *keys, size_t pos) 294 + { 295 + struct journal_key *k = keys->data + pos; 296 + size_t idx = pos_to_idx(keys, pos); 297 + 298 + k->overwritten = true; 299 + 300 + struct journal_key *prev = idx > 0 ? keys->data + idx_to_pos(keys, idx - 1) : NULL; 301 + struct journal_key *next = idx + 1 < keys->nr ? keys->data + idx_to_pos(keys, idx + 1) : NULL; 302 + 303 + bool prev_overwritten = prev && prev->overwritten; 304 + bool next_overwritten = next && next->overwritten; 305 + 306 + struct journal_key_range_overwritten *prev_range = 307 + prev_overwritten ? prev->overwritten_range : NULL; 308 + struct journal_key_range_overwritten *next_range = 309 + next_overwritten ? next->overwritten_range : NULL; 310 + 311 + BUG_ON(prev_range && prev_range->end != idx); 312 + BUG_ON(next_range && next_range->start != idx + 1); 313 + 314 + if (prev_range && next_range) { 315 + prev_range->end = next_range->end; 316 + 317 + keys->data[pos].overwritten_range = prev_range; 318 + for (size_t i = next_range->start; i < next_range->end; i++) { 319 + struct journal_key *ip = keys->data + idx_to_pos(keys, i); 320 + BUG_ON(ip->overwritten_range != next_range); 321 + ip->overwritten_range = prev_range; 322 + } 323 + 324 + kfree_rcu_mightsleep(next_range); 325 + } else if (prev_range) { 326 + prev_range->end++; 327 + k->overwritten_range = prev_range; 328 + if (next_overwritten) { 329 + prev_range->end++; 330 + next->overwritten_range = prev_range; 331 + } 332 + } else if (next_range) { 333 + next_range->start--; 334 + k->overwritten_range = next_range; 335 + if (prev_overwritten) { 336 + next_range->start--; 337 + prev->overwritten_range = next_range; 338 + } 339 + } else if (prev_overwritten || next_overwritten) { 340 + struct journal_key_range_overwritten *r = kmalloc(sizeof(*r), GFP_KERNEL); 341 + if (!r) 342 + return; 343 + 344 + r->start = idx - (size_t) prev_overwritten; 345 + r->end = idx + 1 + (size_t) next_overwritten; 346 + 347 + rcu_assign_pointer(k->overwritten_range, r); 348 + if (prev_overwritten) 349 + prev->overwritten_range = r; 350 + if (next_overwritten) 351 + next->overwritten_range = r; 352 + } 353 + } 354 + 370 355 void bch2_journal_key_overwritten(struct bch_fs *c, enum btree_id btree, 371 356 unsigned level, struct bpos pos) 372 357 { ··· 438 299 if (idx < keys->size && 439 300 keys->data[idx].btree_id == btree && 440 301 keys->data[idx].level == level && 441 - bpos_eq(keys->data[idx].k->k.p, pos)) 442 - keys->data[idx].overwritten = true; 302 + bpos_eq(keys->data[idx].k->k.p, pos) && 303 + !keys->data[idx].overwritten) { 304 + mutex_lock(&keys->overwrite_lock); 305 + __bch2_journal_key_overwritten(keys, idx); 306 + mutex_unlock(&keys->overwrite_lock); 307 + } 443 308 } 444 309 445 310 static void bch2_journal_iter_advance(struct journal_iter *iter) ··· 457 314 458 315 static struct bkey_s_c bch2_journal_iter_peek(struct journal_iter *iter) 459 316 { 317 + struct bkey_s_c ret = bkey_s_c_null; 318 + 460 319 journal_iter_verify(iter); 461 320 321 + rcu_read_lock(); 462 322 while (iter->idx < iter->keys->size) { 463 323 struct journal_key *k = iter->keys->data + iter->idx; 464 324 465 - int cmp = cmp_int(k->btree_id, iter->btree_id) ?: 466 - cmp_int(k->level, iter->level); 467 - if (cmp > 0) 325 + int cmp = __journal_key_btree_cmp(iter->btree_id, iter->level, k); 326 + if (cmp < 0) 468 327 break; 469 328 BUG_ON(cmp); 470 329 471 - if (!k->overwritten) 472 - return bkey_i_to_s_c(k->k); 330 + if (!k->overwritten) { 331 + ret = bkey_i_to_s_c(k->k); 332 + break; 333 + } 473 334 474 - bch2_journal_iter_advance(iter); 335 + if (k->overwritten_range) 336 + iter->idx = idx_to_pos(iter->keys, rcu_dereference(k->overwritten_range)->end); 337 + else 338 + bch2_journal_iter_advance(iter); 475 339 } 340 + rcu_read_unlock(); 476 341 477 - return bkey_s_c_null; 342 + return ret; 478 343 } 479 344 480 345 static void bch2_journal_iter_exit(struct journal_iter *iter) ··· 533 382 : (level > 1 ? 1 : 16); 534 383 535 384 iter.prefetch = false; 385 + iter.fail_if_too_many_whiteouts = true; 536 386 bch2_bkey_buf_init(&tmp); 537 387 538 388 while (nr--) { ··· 552 400 struct bkey_s_c bch2_btree_and_journal_iter_peek(struct btree_and_journal_iter *iter) 553 401 { 554 402 struct bkey_s_c btree_k, journal_k = bkey_s_c_null, ret; 403 + size_t iters = 0; 555 404 556 405 if (iter->prefetch && iter->journal.level) 557 406 btree_and_journal_iter_prefetch(iter); 558 407 again: 559 408 if (iter->at_end) 409 + return bkey_s_c_null; 410 + 411 + iters++; 412 + 413 + if (iters > 20 && iter->fail_if_too_many_whiteouts) 560 414 return bkey_s_c_null; 561 415 562 416 while ((btree_k = bch2_journal_iter_peek_btree(iter)).k && ··· 639 481 640 482 /* sort and dedup all keys in the journal: */ 641 483 642 - void bch2_journal_entries_free(struct bch_fs *c) 643 - { 644 - struct journal_replay **i; 645 - struct genradix_iter iter; 646 - 647 - genradix_for_each(&c->journal_entries, iter, i) 648 - kvfree(*i); 649 - genradix_free(&c->journal_entries); 650 - } 651 - 652 484 /* 653 485 * When keys compare equal, oldest compares first: 654 486 */ ··· 663 515 664 516 move_gap(keys, keys->nr); 665 517 666 - darray_for_each(*keys, i) 518 + darray_for_each(*keys, i) { 519 + if (i->overwritten_range && 520 + (i == &darray_last(*keys) || 521 + i->overwritten_range != i[1].overwritten_range)) 522 + kfree(i->overwritten_range); 523 + 667 524 if (i->allocated) 668 525 kfree(i->k); 526 + } 669 527 670 528 kvfree(keys->data); 671 529 keys->data = NULL; 672 530 keys->nr = keys->gap = keys->size = 0; 673 531 674 - bch2_journal_entries_free(c); 532 + struct journal_replay **i; 533 + struct genradix_iter iter; 534 + 535 + genradix_for_each(&c->journal_entries, iter, i) 536 + kvfree(*i); 537 + genradix_free(&c->journal_entries); 675 538 } 676 539 677 540 static void __journal_keys_sort(struct journal_keys *keys) ··· 787 628 788 629 darray_for_each(*keys, i) { 789 630 printbuf_reset(&buf); 631 + prt_printf(&buf, "btree="); 632 + bch2_btree_id_to_text(&buf, i->btree_id); 633 + prt_printf(&buf, " l=%u ", i->level); 790 634 bch2_bkey_val_to_text(&buf, c, bkey_i_to_s_c(i->k)); 791 - pr_err("%s l=%u %s", bch2_btree_id_str(i->btree_id), i->level, buf.buf); 635 + pr_err("%s", buf.buf); 792 636 } 793 637 printbuf_exit(&buf); 638 + } 639 + 640 + void bch2_fs_journal_keys_init(struct bch_fs *c) 641 + { 642 + struct journal_keys *keys = &c->journal_keys; 643 + 644 + atomic_set(&keys->ref, 1); 645 + keys->initial_ref_held = true; 646 + mutex_init(&keys->overwrite_lock); 794 647 }
+16 -6
fs/bcachefs/btree_journal_iter.h
··· 26 26 struct bpos pos; 27 27 bool at_end; 28 28 bool prefetch; 29 + bool fail_if_too_many_whiteouts; 29 30 }; 31 + 32 + static inline int __journal_key_btree_cmp(enum btree_id l_btree_id, 33 + unsigned l_level, 34 + const struct journal_key *r) 35 + { 36 + return -cmp_int(l_level, r->level) ?: 37 + cmp_int(l_btree_id, r->btree_id); 38 + } 30 39 31 40 static inline int __journal_key_cmp(enum btree_id l_btree_id, 32 41 unsigned l_level, 33 42 struct bpos l_pos, 34 43 const struct journal_key *r) 35 44 { 36 - return (cmp_int(l_btree_id, r->btree_id) ?: 37 - cmp_int(l_level, r->level) ?: 38 - bpos_cmp(l_pos, r->k->k.p)); 45 + return __journal_key_btree_cmp(l_btree_id, l_level, r) ?: 46 + bpos_cmp(l_pos, r->k->k.p); 39 47 } 40 48 41 49 static inline int journal_key_cmp(const struct journal_key *l, const struct journal_key *r) ··· 51 43 return __journal_key_cmp(l->btree_id, l->level, l->k->k.p, r); 52 44 } 53 45 54 - struct bkey_i *bch2_journal_keys_peek_upto(struct bch_fs *, enum btree_id, 46 + struct bkey_i *bch2_journal_keys_peek_max(struct bch_fs *, enum btree_id, 47 + unsigned, struct bpos, struct bpos, size_t *); 48 + struct bkey_i *bch2_journal_keys_peek_prev_min(struct bch_fs *, enum btree_id, 55 49 unsigned, struct bpos, struct bpos, size_t *); 56 50 struct bkey_i *bch2_journal_keys_peek_slot(struct bch_fs *, enum btree_id, 57 51 unsigned, struct bpos); ··· 89 79 c->journal_keys.initial_ref_held = false; 90 80 } 91 81 92 - void bch2_journal_entries_free(struct bch_fs *); 93 - 94 82 int bch2_journal_keys_sort(struct bch_fs *); 95 83 96 84 void bch2_shoot_down_journal_keys(struct bch_fs *, enum btree_id, ··· 96 88 struct bpos, struct bpos); 97 89 98 90 void bch2_journal_keys_dump(struct bch_fs *); 91 + 92 + void bch2_fs_journal_keys_init(struct bch_fs *); 99 93 100 94 #endif /* _BCACHEFS_BTREE_JOURNAL_ITER_H */
+36
fs/bcachefs/btree_journal_iter_types.h
··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + #ifndef _BCACHEFS_BTREE_JOURNAL_ITER_TYPES_H 3 + #define _BCACHEFS_BTREE_JOURNAL_ITER_TYPES_H 4 + 5 + struct journal_key_range_overwritten { 6 + size_t start, end; 7 + }; 8 + 9 + struct journal_key { 10 + u64 journal_seq; 11 + u32 journal_offset; 12 + enum btree_id btree_id:8; 13 + unsigned level:8; 14 + bool allocated; 15 + bool overwritten; 16 + struct journal_key_range_overwritten __rcu * 17 + overwritten_range; 18 + struct bkey_i *k; 19 + }; 20 + 21 + struct journal_keys { 22 + /* must match layout in darray_types.h */ 23 + size_t nr, size; 24 + struct journal_key *data; 25 + /* 26 + * Gap buffer: instead of all the empty space in the array being at the 27 + * end of the buffer - from @nr to @size - the empty space is at @gap. 28 + * This means that sequential insertions are O(n) instead of O(n^2). 29 + */ 30 + size_t gap; 31 + atomic_t ref; 32 + bool initial_ref_held; 33 + struct mutex overwrite_lock; 34 + }; 35 + 36 + #endif /* _BCACHEFS_BTREE_JOURNAL_ITER_TYPES_H */
+55 -20
fs/bcachefs/btree_key_cache.c
··· 197 197 return ck; 198 198 } 199 199 200 - static int btree_key_cache_create(struct btree_trans *trans, struct btree_path *path, 200 + static int btree_key_cache_create(struct btree_trans *trans, 201 + struct btree_path *path, 202 + struct btree_path *ck_path, 201 203 struct bkey_s_c k) 202 204 { 203 205 struct bch_fs *c = trans->c; ··· 219 217 key_u64s = min(256U, (key_u64s * 3) / 2); 220 218 key_u64s = roundup_pow_of_two(key_u64s); 221 219 222 - struct bkey_cached *ck = bkey_cached_alloc(trans, path, key_u64s); 220 + struct bkey_cached *ck = bkey_cached_alloc(trans, ck_path, key_u64s); 223 221 int ret = PTR_ERR_OR_ZERO(ck); 224 222 if (ret) 225 223 return ret; ··· 228 226 ck = bkey_cached_reuse(bc); 229 227 if (unlikely(!ck)) { 230 228 bch_err(c, "error allocating memory for key cache item, btree %s", 231 - bch2_btree_id_str(path->btree_id)); 229 + bch2_btree_id_str(ck_path->btree_id)); 232 230 return -BCH_ERR_ENOMEM_btree_key_cache_create; 233 231 } 234 232 } 235 233 236 234 ck->c.level = 0; 237 - ck->c.btree_id = path->btree_id; 238 - ck->key.btree_id = path->btree_id; 239 - ck->key.pos = path->pos; 235 + ck->c.btree_id = ck_path->btree_id; 236 + ck->key.btree_id = ck_path->btree_id; 237 + ck->key.pos = ck_path->pos; 240 238 ck->flags = 1U << BKEY_CACHED_ACCESSED; 241 239 242 240 if (unlikely(key_u64s > ck->u64s)) { 243 - mark_btree_node_locked_noreset(path, 0, BTREE_NODE_UNLOCKED); 241 + mark_btree_node_locked_noreset(ck_path, 0, BTREE_NODE_UNLOCKED); 244 242 245 243 struct bkey_i *new_k = allocate_dropping_locks(trans, ret, 246 244 kmalloc(key_u64s * sizeof(u64), _gfp)); ··· 260 258 261 259 bkey_reassemble(ck->k, k); 262 260 261 + ret = bch2_btree_node_lock_write(trans, path, &path_l(path)->b->c); 262 + if (unlikely(ret)) 263 + goto err; 264 + 263 265 ret = rhashtable_lookup_insert_fast(&bc->table, &ck->hash, bch2_btree_key_cache_params); 266 + 267 + bch2_btree_node_unlock_write(trans, path, path_l(path)->b); 268 + 264 269 if (unlikely(ret)) /* raced with another fill? */ 265 270 goto err; 266 271 267 272 atomic_long_inc(&bc->nr_keys); 268 273 six_unlock_write(&ck->c.lock); 269 274 270 - enum six_lock_type lock_want = __btree_lock_want(path, 0); 275 + enum six_lock_type lock_want = __btree_lock_want(ck_path, 0); 271 276 if (lock_want == SIX_LOCK_read) 272 277 six_lock_downgrade(&ck->c.lock); 273 - btree_path_cached_set(trans, path, ck, (enum btree_node_locked_type) lock_want); 274 - path->uptodate = BTREE_ITER_UPTODATE; 278 + btree_path_cached_set(trans, ck_path, ck, (enum btree_node_locked_type) lock_want); 279 + ck_path->uptodate = BTREE_ITER_UPTODATE; 275 280 return 0; 276 281 err: 277 282 bkey_cached_free(bc, ck); 278 - mark_btree_node_locked_noreset(path, 0, BTREE_NODE_UNLOCKED); 283 + mark_btree_node_locked_noreset(ck_path, 0, BTREE_NODE_UNLOCKED); 279 284 280 285 return ret; 281 286 } ··· 291 282 struct btree_path *ck_path, 292 283 unsigned flags) 293 284 { 294 - if (flags & BTREE_ITER_cached_nofill) { 295 - ck_path->uptodate = BTREE_ITER_UPTODATE; 285 + if (flags & BTREE_ITER_cached_nofill) 296 286 return 0; 297 - } 298 287 299 288 struct bch_fs *c = trans->c; 300 289 struct btree_iter iter; ··· 300 293 int ret; 301 294 302 295 bch2_trans_iter_init(trans, &iter, ck_path->btree_id, ck_path->pos, 296 + BTREE_ITER_intent| 303 297 BTREE_ITER_key_cache_fill| 304 298 BTREE_ITER_cached_nofill); 305 299 iter.flags &= ~BTREE_ITER_with_journal; ··· 314 306 if (unlikely(ret)) 315 307 goto out; 316 308 317 - ret = btree_key_cache_create(trans, ck_path, k); 309 + ret = btree_key_cache_create(trans, btree_iter_path(trans, &iter), ck_path, k); 318 310 if (ret) 319 311 goto err; 312 + 313 + if (trace_key_cache_fill_enabled()) { 314 + struct printbuf buf = PRINTBUF; 315 + 316 + bch2_bpos_to_text(&buf, ck_path->pos); 317 + prt_char(&buf, ' '); 318 + bch2_bkey_val_to_text(&buf, trans->c, k); 319 + trace_key_cache_fill(trans, buf.buf); 320 + printbuf_exit(&buf); 321 + } 320 322 out: 321 323 /* We're not likely to need this iterator again: */ 322 324 bch2_set_btree_iter_dontneed(&iter); ··· 442 424 !test_bit(JOURNAL_space_low, &c->journal.flags)) 443 425 commit_flags |= BCH_TRANS_COMMIT_no_journal_res; 444 426 445 - ret = bch2_btree_iter_traverse(&b_iter) ?: 446 - bch2_trans_update(trans, &b_iter, ck->k, 427 + struct bkey_s_c btree_k = bch2_btree_iter_peek_slot(&b_iter); 428 + ret = bkey_err(btree_k); 429 + if (ret) 430 + goto err; 431 + 432 + /* * Check that we're not violating cache coherency rules: */ 433 + BUG_ON(bkey_deleted(btree_k.k)); 434 + 435 + ret = bch2_trans_update(trans, &b_iter, ck->k, 447 436 BTREE_UPDATE_key_cache_reclaim| 448 437 BTREE_UPDATE_internal_snapshot_node| 449 438 BTREE_TRIGGER_norun) ?: ··· 458 433 BCH_TRANS_COMMIT_no_check_rw| 459 434 BCH_TRANS_COMMIT_no_enospc| 460 435 commit_flags); 461 - 436 + err: 462 437 bch2_fs_fatal_err_on(ret && 463 438 !bch2_err_matches(ret, BCH_ERR_transaction_restart) && 464 439 !bch2_err_matches(ret, BCH_ERR_journal_reclaim_would_deadlock) && ··· 611 586 bkey_cached_free(bc, ck); 612 587 613 588 mark_btree_node_locked(trans, path, 0, BTREE_NODE_UNLOCKED); 614 - btree_path_set_dirty(path, BTREE_ITER_NEED_TRAVERSE); 615 - path->should_be_locked = false; 589 + 590 + struct btree_path *path2; 591 + unsigned i; 592 + trans_for_each_path(trans, path2, i) 593 + if (path2->l[0].b == (void *) ck) { 594 + __bch2_btree_path_unlock(trans, path2); 595 + path2->l[0].b = ERR_PTR(-BCH_ERR_no_btree_node_drop); 596 + path2->should_be_locked = false; 597 + btree_path_set_dirty(path2, BTREE_ITER_NEED_TRAVERSE); 598 + } 599 + 600 + bch2_trans_verify_locks(trans); 616 601 } 617 602 618 603 static unsigned long bch2_btree_key_cache_scan(struct shrinker *shrink,
+45 -33
fs/bcachefs/btree_locking.c
··· 109 109 lock_graph_up(g); 110 110 } 111 111 112 + static noinline void lock_graph_pop_from(struct lock_graph *g, struct trans_waiting_for_lock *i) 113 + { 114 + while (g->g + g->nr > i) 115 + lock_graph_up(g); 116 + } 117 + 112 118 static void __lock_graph_down(struct lock_graph *g, struct btree_trans *trans) 113 119 { 114 120 g->g[g->nr++] = (struct trans_waiting_for_lock) { ··· 130 124 __lock_graph_down(g, trans); 131 125 } 132 126 133 - static bool lock_graph_remove_non_waiters(struct lock_graph *g) 127 + static bool lock_graph_remove_non_waiters(struct lock_graph *g, 128 + struct trans_waiting_for_lock *from) 134 129 { 135 130 struct trans_waiting_for_lock *i; 136 131 137 - for (i = g->g + 1; i < g->g + g->nr; i++) 132 + if (from->trans->locking != from->node_want) { 133 + lock_graph_pop_from(g, from); 134 + return true; 135 + } 136 + 137 + for (i = from + 1; i < g->g + g->nr; i++) 138 138 if (i->trans->locking != i->node_want || 139 139 i->trans->locking_wait.start_time != i[-1].lock_start_time) { 140 - while (g->g + g->nr > i) 141 - lock_graph_up(g); 140 + lock_graph_pop_from(g, i); 142 141 return true; 143 142 } 144 143 ··· 190 179 return 3; 191 180 } 192 181 193 - static noinline int break_cycle(struct lock_graph *g, struct printbuf *cycle) 182 + static noinline int break_cycle(struct lock_graph *g, struct printbuf *cycle, 183 + struct trans_waiting_for_lock *from) 194 184 { 195 185 struct trans_waiting_for_lock *i, *abort = NULL; 196 186 unsigned best = 0, pref; 197 187 int ret; 198 188 199 - if (lock_graph_remove_non_waiters(g)) 189 + if (lock_graph_remove_non_waiters(g, from)) 200 190 return 0; 201 191 202 192 /* Only checking, for debugfs: */ ··· 207 195 goto out; 208 196 } 209 197 210 - for (i = g->g; i < g->g + g->nr; i++) { 198 + for (i = from; i < g->g + g->nr; i++) { 211 199 pref = btree_trans_abort_preference(i->trans); 212 200 if (pref > best) { 213 201 abort = i; ··· 241 229 ret = abort_lock(g, abort); 242 230 out: 243 231 if (ret) 244 - while (g->nr) 245 - lock_graph_up(g); 232 + lock_graph_pop_all(g); 233 + else 234 + lock_graph_pop_from(g, abort); 246 235 return ret; 247 236 } 248 237 ··· 256 243 for (i = g->g; i < g->g + g->nr; i++) 257 244 if (i->trans == trans) { 258 245 closure_put(&trans->ref); 259 - return break_cycle(g, cycle); 246 + return break_cycle(g, cycle, i); 260 247 } 261 248 262 249 if (g->nr == ARRAY_SIZE(g->g)) { ··· 265 252 if (orig_trans->lock_may_not_fail) 266 253 return 0; 267 254 268 - while (g->nr) 269 - lock_graph_up(g); 255 + lock_graph_pop_all(g); 270 256 271 257 if (cycle) 272 258 return 0; ··· 293 281 294 282 g.nr = 0; 295 283 296 - if (trans->lock_must_abort) { 284 + if (trans->lock_must_abort && !trans->lock_may_not_fail) { 297 285 if (cycle) 298 286 return -1; 299 287 ··· 348 336 * structures - which means it can't be blocked 349 337 * waiting on a lock: 350 338 */ 351 - if (!lock_graph_remove_non_waiters(&g)) { 339 + if (!lock_graph_remove_non_waiters(&g, g.g)) { 352 340 /* 353 341 * If lock_graph_remove_non_waiters() 354 342 * didn't do anything, it must be ··· 524 512 struct btree_path *path, unsigned level) 525 513 { 526 514 struct btree *b = path->l[level].b; 527 - struct six_lock_count count = bch2_btree_node_lock_counts(trans, path, &b->c, level); 528 515 529 516 if (!is_btree_node(path, level)) 530 517 return false; ··· 547 536 if (race_fault()) 548 537 return false; 549 538 550 - if (btree_node_locked(path, level)) { 551 - bool ret; 539 + if (btree_node_locked(path, level) 540 + ? six_lock_tryupgrade(&b->c.lock) 541 + : six_relock_type(&b->c.lock, SIX_LOCK_intent, path->l[level].lock_seq)) 542 + goto success; 552 543 553 - six_lock_readers_add(&b->c.lock, -count.n[SIX_LOCK_read]); 554 - ret = six_lock_tryupgrade(&b->c.lock); 555 - six_lock_readers_add(&b->c.lock, count.n[SIX_LOCK_read]); 556 - 557 - if (ret) 558 - goto success; 559 - } else { 560 - if (six_relock_type(&b->c.lock, SIX_LOCK_intent, path->l[level].lock_seq)) 561 - goto success; 562 - } 563 - 564 - /* 565 - * Do we already have an intent lock via another path? If so, just bump 566 - * lock count: 567 - */ 568 544 if (btree_node_lock_seq_matches(path, b, level) && 569 545 btree_node_lock_increment(trans, &b->c, level, BTREE_NODE_INTENT_LOCKED)) { 570 546 btree_node_unlock(trans, path, level); ··· 780 782 return bch2_trans_relock_fail(trans, path, &f, trace); 781 783 } 782 784 783 - trans_set_locked(trans); 785 + trans_set_locked(trans, true); 784 786 out: 785 787 bch2_trans_verify_locks(trans); 786 788 return 0; ··· 814 816 { 815 817 bch2_trans_unlock(trans); 816 818 bch2_trans_srcu_unlock(trans); 819 + } 820 + 821 + void bch2_trans_unlock_write(struct btree_trans *trans) 822 + { 823 + struct btree_path *path; 824 + unsigned i; 825 + 826 + trans_for_each_path(trans, path, i) 827 + for (unsigned l = 0; l < BTREE_MAX_DEPTH; l++) 828 + if (btree_node_write_locked(path, l)) 829 + bch2_btree_node_unlock_write(trans, path, path->l[l].b); 817 830 } 818 831 819 832 int __bch2_trans_mutex_lock(struct btree_trans *trans, ··· 865 856 (want == BTREE_NODE_UNLOCKED || 866 857 have != BTREE_NODE_WRITE_LOCKED) && 867 858 want != have); 859 + 860 + BUG_ON(btree_node_locked(path, l) && 861 + path->l[l].lock_seq != six_lock_seq(&path->l[l].b->c.lock)); 868 862 } 869 863 } 870 864
+27 -23
fs/bcachefs/btree_locking.h
··· 16 16 void bch2_btree_lock_init(struct btree_bkey_cached_common *, enum six_lock_init_flags); 17 17 18 18 void bch2_trans_unlock_noassert(struct btree_trans *); 19 + void bch2_trans_unlock_write(struct btree_trans *); 19 20 20 21 static inline bool is_btree_node(struct btree_path *path, unsigned l) 21 22 { ··· 76 75 path->nodes_locked |= (type + 1) << (level << 1); 77 76 } 78 77 79 - static inline void mark_btree_node_unlocked(struct btree_path *path, 80 - unsigned level) 81 - { 82 - EBUG_ON(btree_node_write_locked(path, level)); 83 - mark_btree_node_locked_noreset(path, level, BTREE_NODE_UNLOCKED); 84 - } 85 - 86 78 static inline void mark_btree_node_locked(struct btree_trans *trans, 87 79 struct btree_path *path, 88 80 unsigned level, ··· 118 124 119 125 /* unlock: */ 120 126 127 + void bch2_btree_node_unlock_write(struct btree_trans *, 128 + struct btree_path *, struct btree *); 129 + 121 130 static inline void btree_node_unlock(struct btree_trans *trans, 122 131 struct btree_path *path, unsigned level) 123 132 { 124 133 int lock_type = btree_node_locked_type(path, level); 125 134 126 135 EBUG_ON(level >= BTREE_MAX_DEPTH); 127 - EBUG_ON(lock_type == BTREE_NODE_WRITE_LOCKED); 128 136 129 137 if (lock_type != BTREE_NODE_UNLOCKED) { 138 + if (unlikely(lock_type == BTREE_NODE_WRITE_LOCKED)) { 139 + bch2_btree_node_unlock_write(trans, path, path->l[level].b); 140 + lock_type = BTREE_NODE_INTENT_LOCKED; 141 + } 130 142 six_unlock_type(&path->l[level].b->c.lock, lock_type); 131 143 btree_trans_lock_hold_time_update(trans, path, level); 144 + mark_btree_node_locked_noreset(path, level, BTREE_NODE_UNLOCKED); 132 145 } 133 - mark_btree_node_unlocked(path, level); 134 146 } 135 147 136 148 static inline int btree_path_lowest_level_locked(struct btree_path *path) ··· 163 163 * succeed: 164 164 */ 165 165 static inline void 166 + __bch2_btree_node_unlock_write(struct btree_trans *trans, struct btree *b) 167 + { 168 + if (!b->c.lock.write_lock_recurse) { 169 + struct btree_path *linked; 170 + unsigned i; 171 + 172 + trans_for_each_path_with_node(trans, b, linked, i) 173 + linked->l[b->c.level].lock_seq++; 174 + } 175 + 176 + six_unlock_write(&b->c.lock); 177 + } 178 + 179 + static inline void 166 180 bch2_btree_node_unlock_write_inlined(struct btree_trans *trans, struct btree_path *path, 167 181 struct btree *b) 168 182 { 169 - struct btree_path *linked; 170 - unsigned i; 171 - 172 183 EBUG_ON(path->l[b->c.level].b != b); 173 184 EBUG_ON(path->l[b->c.level].lock_seq != six_lock_seq(&b->c.lock)); 174 185 EBUG_ON(btree_node_locked_type(path, b->c.level) != SIX_LOCK_write); 175 186 176 187 mark_btree_node_locked_noreset(path, b->c.level, BTREE_NODE_INTENT_LOCKED); 177 - 178 - trans_for_each_path_with_node(trans, b, linked, i) 179 - linked->l[b->c.level].lock_seq++; 180 - 181 - six_unlock_write(&b->c.lock); 188 + __bch2_btree_node_unlock_write(trans, b); 182 189 } 183 - 184 - void bch2_btree_node_unlock_write(struct btree_trans *, 185 - struct btree_path *, struct btree *); 186 190 187 191 int bch2_six_check_for_deadlock(struct six_lock *lock, void *p); 188 192 189 193 /* lock: */ 190 194 191 - static inline void trans_set_locked(struct btree_trans *trans) 195 + static inline void trans_set_locked(struct btree_trans *trans, bool try) 192 196 { 193 197 if (!trans->locked) { 194 - lock_acquire_exclusive(&trans->dep_map, 0, 0, NULL, _THIS_IP_); 198 + lock_acquire_exclusive(&trans->dep_map, 0, try, NULL, _THIS_IP_); 195 199 trans->locked = true; 196 200 trans->last_unlock_ip = 0; 197 201 ··· 286 282 int ret = 0; 287 283 288 284 EBUG_ON(level >= BTREE_MAX_DEPTH); 289 - bch2_trans_verify_not_unlocked(trans); 285 + bch2_trans_verify_not_unlocked_or_in_restart(trans); 290 286 291 287 if (likely(six_trylock_type(&b->lock, type)) || 292 288 btree_node_lock_increment(trans, b, level, (enum btree_node_locked_type) type) ||
+101 -52
fs/bcachefs/btree_node_scan.c
··· 12 12 #include "recovery_passes.h" 13 13 14 14 #include <linux/kthread.h> 15 + #include <linux/min_heap.h> 15 16 #include <linux/sort.h> 16 17 17 18 struct find_btree_nodes_worker { ··· 23 22 24 23 static void found_btree_node_to_text(struct printbuf *out, struct bch_fs *c, const struct found_btree_node *n) 25 24 { 26 - prt_printf(out, "%s l=%u seq=%u journal_seq=%llu cookie=%llx ", 27 - bch2_btree_id_str(n->btree_id), n->level, n->seq, 28 - n->journal_seq, n->cookie); 25 + bch2_btree_id_level_to_text(out, n->btree_id, n->level); 26 + prt_printf(out, " seq=%u journal_seq=%llu cookie=%llx ", 27 + n->seq, n->journal_seq, n->cookie); 29 28 bch2_bpos_to_text(out, n->min_key); 30 29 prt_str(out, "-"); 31 30 bch2_bpos_to_text(out, n->max_key); 32 31 33 32 if (n->range_updated) 34 33 prt_str(out, " range updated"); 35 - if (n->overwritten) 36 - prt_str(out, " overwritten"); 37 34 38 35 for (unsigned i = 0; i < n->nr_ptrs; i++) { 39 36 prt_char(out, ' '); ··· 139 140 -found_btree_node_cmp_time(l, r); 140 141 } 141 142 143 + static inline bool found_btree_node_cmp_pos_less(const void *l, const void *r, void *arg) 144 + { 145 + return found_btree_node_cmp_pos(l, r) < 0; 146 + } 147 + 148 + static inline void found_btree_node_swap(void *_l, void *_r, void *arg) 149 + { 150 + struct found_btree_node *l = _l; 151 + struct found_btree_node *r = _r; 152 + 153 + swap(*l, *r); 154 + } 155 + 156 + static const struct min_heap_callbacks found_btree_node_heap_cbs = { 157 + .less = found_btree_node_cmp_pos_less, 158 + .swp = found_btree_node_swap, 159 + }; 160 + 142 161 static void try_read_btree_node(struct find_btree_nodes *f, struct bch_dev *ca, 143 162 struct bio *bio, struct btree_node *bn, u64 offset) 144 163 { ··· 176 159 return; 177 160 178 161 if (bch2_csum_type_is_encryption(BSET_CSUM_TYPE(&bn->keys))) { 162 + if (!c->chacha20) 163 + return; 164 + 179 165 struct nonce nonce = btree_nonce(&bn->keys, 0); 180 166 unsigned bytes = (void *) &bn->keys - (void *) &bn->flags; 181 167 ··· 312 292 return f->ret ?: ret; 313 293 } 314 294 315 - static void bubble_up(struct found_btree_node *n, struct found_btree_node *end) 295 + static bool nodes_overlap(const struct found_btree_node *l, 296 + const struct found_btree_node *r) 316 297 { 317 - while (n + 1 < end && 318 - found_btree_node_cmp_pos(n, n + 1) > 0) { 319 - swap(n[0], n[1]); 320 - n++; 321 - } 298 + return (l->btree_id == r->btree_id && 299 + l->level == r->level && 300 + bpos_gt(l->max_key, r->min_key)); 322 301 } 323 302 324 303 static int handle_overwrites(struct bch_fs *c, 325 - struct found_btree_node *start, 326 - struct found_btree_node *end) 304 + struct found_btree_node *l, 305 + found_btree_nodes *nodes_heap) 327 306 { 328 - struct found_btree_node *n; 329 - again: 330 - for (n = start + 1; 331 - n < end && 332 - n->btree_id == start->btree_id && 333 - n->level == start->level && 334 - bpos_lt(n->min_key, start->max_key); 335 - n++) { 336 - int cmp = found_btree_node_cmp_time(start, n); 307 + struct found_btree_node *r; 308 + 309 + while ((r = min_heap_peek(nodes_heap)) && 310 + nodes_overlap(l, r)) { 311 + int cmp = found_btree_node_cmp_time(l, r); 337 312 338 313 if (cmp > 0) { 339 - if (bpos_cmp(start->max_key, n->max_key) >= 0) 340 - n->overwritten = true; 314 + if (bpos_cmp(l->max_key, r->max_key) >= 0) 315 + min_heap_pop(nodes_heap, &found_btree_node_heap_cbs, NULL); 341 316 else { 342 - n->range_updated = true; 343 - n->min_key = bpos_successor(start->max_key); 344 - n->range_updated = true; 345 - bubble_up(n, end); 346 - goto again; 317 + r->range_updated = true; 318 + r->min_key = bpos_successor(l->max_key); 319 + r->range_updated = true; 320 + min_heap_sift_down(nodes_heap, 0, &found_btree_node_heap_cbs, NULL); 347 321 } 348 322 } else if (cmp < 0) { 349 - BUG_ON(bpos_cmp(n->min_key, start->min_key) <= 0); 323 + BUG_ON(bpos_eq(l->min_key, r->min_key)); 350 324 351 - start->max_key = bpos_predecessor(n->min_key); 352 - start->range_updated = true; 353 - } else if (n->level) { 354 - n->overwritten = true; 325 + l->max_key = bpos_predecessor(r->min_key); 326 + l->range_updated = true; 327 + } else if (r->level) { 328 + min_heap_pop(nodes_heap, &found_btree_node_heap_cbs, NULL); 355 329 } else { 356 - if (bpos_cmp(start->max_key, n->max_key) >= 0) 357 - n->overwritten = true; 330 + if (bpos_cmp(l->max_key, r->max_key) >= 0) 331 + min_heap_pop(nodes_heap, &found_btree_node_heap_cbs, NULL); 358 332 else { 359 - n->range_updated = true; 360 - n->min_key = bpos_successor(start->max_key); 361 - n->range_updated = true; 362 - bubble_up(n, end); 363 - goto again; 333 + r->range_updated = true; 334 + r->min_key = bpos_successor(l->max_key); 335 + r->range_updated = true; 336 + min_heap_sift_down(nodes_heap, 0, &found_btree_node_heap_cbs, NULL); 364 337 } 365 338 } 366 339 } ··· 365 352 { 366 353 struct find_btree_nodes *f = &c->found_btree_nodes; 367 354 struct printbuf buf = PRINTBUF; 355 + found_btree_nodes nodes_heap = {}; 368 356 size_t dst; 369 357 int ret = 0; 370 358 ··· 420 406 bch2_print_string_as_lines(KERN_INFO, buf.buf); 421 407 } 422 408 423 - dst = 0; 424 - darray_for_each(f->nodes, i) { 425 - if (i->overwritten) 426 - continue; 409 + swap(nodes_heap, f->nodes); 427 410 428 - ret = handle_overwrites(c, i, &darray_top(f->nodes)); 411 + { 412 + /* darray must have same layout as a heap */ 413 + min_heap_char real_heap; 414 + BUILD_BUG_ON(sizeof(nodes_heap.nr) != sizeof(real_heap.nr)); 415 + BUILD_BUG_ON(sizeof(nodes_heap.size) != sizeof(real_heap.size)); 416 + BUILD_BUG_ON(offsetof(found_btree_nodes, nr) != offsetof(min_heap_char, nr)); 417 + BUILD_BUG_ON(offsetof(found_btree_nodes, size) != offsetof(min_heap_char, size)); 418 + } 419 + 420 + min_heapify_all(&nodes_heap, &found_btree_node_heap_cbs, NULL); 421 + 422 + if (nodes_heap.nr) { 423 + ret = darray_push(&f->nodes, *min_heap_peek(&nodes_heap)); 429 424 if (ret) 430 425 goto err; 431 426 432 - BUG_ON(i->overwritten); 433 - f->nodes.data[dst++] = *i; 427 + min_heap_pop(&nodes_heap, &found_btree_node_heap_cbs, NULL); 434 428 } 435 - f->nodes.nr = dst; 436 429 437 - if (c->opts.verbose) { 430 + while (true) { 431 + ret = handle_overwrites(c, &darray_last(f->nodes), &nodes_heap); 432 + if (ret) 433 + goto err; 434 + 435 + if (!nodes_heap.nr) 436 + break; 437 + 438 + ret = darray_push(&f->nodes, *min_heap_peek(&nodes_heap)); 439 + if (ret) 440 + goto err; 441 + 442 + min_heap_pop(&nodes_heap, &found_btree_node_heap_cbs, NULL); 443 + } 444 + 445 + for (struct found_btree_node *n = f->nodes.data; n < &darray_last(f->nodes); n++) 446 + BUG_ON(nodes_overlap(n, n + 1)); 447 + 448 + if (0 && c->opts.verbose) { 438 449 printbuf_reset(&buf); 439 450 prt_printf(&buf, "%s: nodes found after overwrites:\n", __func__); 440 451 found_btree_nodes_to_text(&buf, c, f->nodes); 441 452 bch2_print_string_as_lines(KERN_INFO, buf.buf); 453 + } else { 454 + bch_info(c, "btree node scan found %zu nodes after overwrites", f->nodes.nr); 442 455 } 443 456 444 457 eytzinger0_sort(f->nodes.data, f->nodes.nr, sizeof(f->nodes.data[0]), found_btree_node_cmp_pos, NULL); 445 458 err: 459 + darray_exit(&nodes_heap); 446 460 printbuf_exit(&buf); 447 461 return ret; 448 462 } ··· 541 499 if (c->opts.verbose) { 542 500 struct printbuf buf = PRINTBUF; 543 501 544 - prt_printf(&buf, "recovering %s l=%u ", bch2_btree_id_str(btree), level); 502 + prt_str(&buf, "recovery "); 503 + bch2_btree_id_level_to_text(&buf, btree, level); 504 + prt_str(&buf, " "); 545 505 bch2_bpos_to_text(&buf, node_min); 546 506 prt_str(&buf, " - "); 547 507 bch2_bpos_to_text(&buf, node_max); ··· 577 533 bch_verbose(c, "%s(): recovering %s", __func__, buf.buf); 578 534 printbuf_exit(&buf); 579 535 580 - BUG_ON(bch2_bkey_validate(c, bkey_i_to_s_c(&tmp.k), BKEY_TYPE_btree, 0)); 536 + BUG_ON(bch2_bkey_validate(c, bkey_i_to_s_c(&tmp.k), 537 + (struct bkey_validate_context) { 538 + .from = BKEY_VALIDATE_btree_node, 539 + .level = level + 1, 540 + .btree = btree, 541 + })); 581 542 582 543 ret = bch2_journal_key_insert(c, btree, level + 1, &tmp.k); 583 544 if (ret)
-1
fs/bcachefs/btree_node_scan_types.h
··· 6 6 7 7 struct found_btree_node { 8 8 bool range_updated:1; 9 - bool overwritten:1; 10 9 u8 btree_id; 11 10 u8 level; 12 11 unsigned sectors_written;
+79 -134
fs/bcachefs/btree_trans_commit.c
··· 133 133 return 0; 134 134 } 135 135 136 - static inline void bch2_trans_unlock_write(struct btree_trans *trans) 136 + static inline void bch2_trans_unlock_updates_write(struct btree_trans *trans) 137 137 { 138 138 if (likely(trans->write_locked)) { 139 139 trans_for_each_update(trans, i) ··· 249 249 new |= 1 << BTREE_NODE_need_write; 250 250 } while (!try_cmpxchg(&b->flags, &old, new)); 251 251 252 - btree_node_write_if_need(c, b, SIX_LOCK_read); 252 + btree_node_write_if_need(trans, b, SIX_LOCK_read); 253 253 six_unlock_read(&b->c.lock); 254 254 255 255 bch2_trans_put(trans); ··· 384 384 struct bkey_i *new_k; 385 385 int ret; 386 386 387 - bch2_trans_unlock_write(trans); 387 + bch2_trans_unlock_updates_write(trans); 388 388 bch2_trans_unlock(trans); 389 389 390 390 new_k = kmalloc(new_u64s * sizeof(u64), GFP_KERNEL); ··· 479 479 old, flags); 480 480 } 481 481 482 - static int run_one_trans_trigger(struct btree_trans *trans, struct btree_insert_entry *i, 483 - bool overwrite) 482 + static int run_one_trans_trigger(struct btree_trans *trans, struct btree_insert_entry *i) 484 483 { 485 484 verify_update_old_key(trans, i); 486 485 ··· 506 507 return bch2_key_trigger(trans, i->btree_id, i->level, old, bkey_i_to_s(i->k), 507 508 BTREE_TRIGGER_insert| 508 509 BTREE_TRIGGER_overwrite|flags) ?: 1; 509 - } else if (overwrite && !i->overwrite_trigger_run) { 510 + } else if (!i->overwrite_trigger_run) { 510 511 i->overwrite_trigger_run = true; 511 512 return bch2_key_trigger_old(trans, i->btree_id, i->level, old, flags) ?: 1; 512 - } else if (!overwrite && !i->insert_trigger_run) { 513 + } else if (!i->insert_trigger_run) { 513 514 i->insert_trigger_run = true; 514 515 return bch2_key_trigger_new(trans, i->btree_id, i->level, bkey_i_to_s(i->k), flags) ?: 1; 515 516 } else { ··· 518 519 } 519 520 520 521 static int run_btree_triggers(struct btree_trans *trans, enum btree_id btree_id, 521 - unsigned btree_id_start) 522 + unsigned *btree_id_updates_start) 522 523 { 523 - for (int overwrite = 1; overwrite >= 0; --overwrite) { 524 - bool trans_trigger_run; 524 + bool trans_trigger_run; 525 525 526 - /* 527 - * Running triggers will append more updates to the list of updates as 528 - * we're walking it: 529 - */ 530 - do { 531 - trans_trigger_run = false; 526 + /* 527 + * Running triggers will append more updates to the list of updates as 528 + * we're walking it: 529 + */ 530 + do { 531 + trans_trigger_run = false; 532 532 533 - for (unsigned i = btree_id_start; 534 - i < trans->nr_updates && trans->updates[i].btree_id <= btree_id; 535 - i++) { 536 - if (trans->updates[i].btree_id != btree_id) 537 - continue; 538 - 539 - int ret = run_one_trans_trigger(trans, trans->updates + i, overwrite); 540 - if (ret < 0) 541 - return ret; 542 - if (ret) 543 - trans_trigger_run = true; 533 + for (unsigned i = *btree_id_updates_start; 534 + i < trans->nr_updates && trans->updates[i].btree_id <= btree_id; 535 + i++) { 536 + if (trans->updates[i].btree_id < btree_id) { 537 + *btree_id_updates_start = i; 538 + continue; 544 539 } 545 - } while (trans_trigger_run); 546 - } 540 + 541 + int ret = run_one_trans_trigger(trans, trans->updates + i); 542 + if (ret < 0) 543 + return ret; 544 + if (ret) 545 + trans_trigger_run = true; 546 + } 547 + } while (trans_trigger_run); 548 + 549 + trans_for_each_update(trans, i) 550 + BUG_ON(!(i->flags & BTREE_TRIGGER_norun) && 551 + i->btree_id == btree_id && 552 + btree_node_type_has_trans_triggers(i->bkey_type) && 553 + (!i->insert_trigger_run || !i->overwrite_trigger_run)); 547 554 548 555 return 0; 549 556 } 550 557 551 558 static int bch2_trans_commit_run_triggers(struct btree_trans *trans) 552 559 { 553 - unsigned btree_id = 0, btree_id_start = 0; 560 + unsigned btree_id = 0, btree_id_updates_start = 0; 554 561 int ret = 0; 555 562 556 563 /* ··· 570 565 if (btree_id == BTREE_ID_alloc) 571 566 continue; 572 567 573 - while (btree_id_start < trans->nr_updates && 574 - trans->updates[btree_id_start].btree_id < btree_id) 575 - btree_id_start++; 576 - 577 - ret = run_btree_triggers(trans, btree_id, btree_id_start); 568 + ret = run_btree_triggers(trans, btree_id, &btree_id_updates_start); 578 569 if (ret) 579 570 return ret; 580 571 } 581 572 582 - for (unsigned idx = 0; idx < trans->nr_updates; idx++) { 583 - struct btree_insert_entry *i = trans->updates + idx; 584 - 585 - if (i->btree_id > BTREE_ID_alloc) 586 - break; 587 - if (i->btree_id == BTREE_ID_alloc) { 588 - ret = run_btree_triggers(trans, BTREE_ID_alloc, idx); 589 - if (ret) 590 - return ret; 591 - break; 592 - } 593 - } 573 + btree_id_updates_start = 0; 574 + ret = run_btree_triggers(trans, BTREE_ID_alloc, &btree_id_updates_start); 575 + if (ret) 576 + return ret; 594 577 595 578 #ifdef CONFIG_BCACHEFS_DEBUG 596 579 trans_for_each_update(trans, i) ··· 602 609 return 0; 603 610 } 604 611 605 - static struct bversion journal_pos_to_bversion(struct journal_res *res, unsigned offset) 606 - { 607 - return (struct bversion) { 608 - .hi = res->seq >> 32, 609 - .lo = (res->seq << 32) | (res->offset + offset), 610 - }; 611 - } 612 - 613 612 static inline int 614 613 bch2_trans_commit_write_locked(struct btree_trans *trans, unsigned flags, 615 614 struct btree_insert_entry **stopped_at, ··· 612 627 unsigned u64s = 0; 613 628 int ret = 0; 614 629 615 - bch2_trans_verify_not_unlocked(trans); 616 - bch2_trans_verify_not_in_restart(trans); 630 + bch2_trans_verify_not_unlocked_or_in_restart(trans); 617 631 618 632 if (race_fault()) { 619 633 trace_and_count(c, trans_restart_fault_inject, trans, trace_ip); 620 - return btree_trans_restart_nounlock(trans, BCH_ERR_transaction_restart_fault_inject); 634 + return btree_trans_restart(trans, BCH_ERR_transaction_restart_fault_inject); 621 635 } 622 636 623 637 /* ··· 685 701 struct jset_entry *entry = trans->journal_entries; 686 702 687 703 percpu_down_read(&c->mark_lock); 688 - 689 704 for (entry = trans->journal_entries; 690 705 entry != (void *) ((u64 *) trans->journal_entries + trans->journal_entries_u64s); 691 706 entry = vstruct_next(entry)) 692 707 if (entry->type == BCH_JSET_ENTRY_write_buffer_keys && 693 708 entry->start->k.type == KEY_TYPE_accounting) { 694 - BUG_ON(!trans->journal_res.ref); 695 - 696 - struct bkey_i_accounting *a = bkey_i_to_accounting(entry->start); 697 - 698 - a->k.bversion = journal_pos_to_bversion(&trans->journal_res, 699 - (u64 *) entry - (u64 *) trans->journal_entries); 700 - BUG_ON(bversion_zero(a->k.bversion)); 701 - 702 - if (likely(!(flags & BCH_TRANS_COMMIT_skip_accounting_apply))) { 703 - ret = bch2_accounting_mem_mod_locked(trans, accounting_i_to_s_c(a), BCH_ACCOUNTING_normal); 704 - if (ret) 705 - goto revert_fs_usage; 706 - } 709 + ret = bch2_accounting_trans_commit_hook(trans, bkey_i_to_accounting(entry->start), flags); 710 + if (ret) 711 + goto revert_fs_usage; 707 712 } 708 713 percpu_up_read(&c->mark_lock); 709 714 ··· 712 739 goto fatal_err; 713 740 } 714 741 742 + struct bkey_validate_context validate_context = { .from = BKEY_VALIDATE_commit }; 743 + 744 + if (!(flags & BCH_TRANS_COMMIT_no_journal_res)) 745 + validate_context.flags = BCH_VALIDATE_write|BCH_VALIDATE_commit; 746 + 747 + for (struct jset_entry *i = trans->journal_entries; 748 + i != (void *) ((u64 *) trans->journal_entries + trans->journal_entries_u64s); 749 + i = vstruct_next(i)) { 750 + ret = bch2_journal_entry_validate(c, NULL, i, 751 + bcachefs_metadata_version_current, 752 + CPU_BIG_ENDIAN, validate_context); 753 + if (unlikely(ret)) { 754 + bch2_trans_inconsistent(trans, "invalid journal entry on insert from %s\n", 755 + trans->fn); 756 + goto fatal_err; 757 + } 758 + } 759 + 715 760 trans_for_each_update(trans, i) { 716 - enum bch_validate_flags invalid_flags = 0; 761 + validate_context.level = i->level; 762 + validate_context.btree = i->btree_id; 717 763 718 - if (!(flags & BCH_TRANS_COMMIT_no_journal_res)) 719 - invalid_flags |= BCH_VALIDATE_write|BCH_VALIDATE_commit; 720 - 721 - ret = bch2_bkey_validate(c, bkey_i_to_s_c(i->k), 722 - i->bkey_type, invalid_flags); 764 + ret = bch2_bkey_validate(c, bkey_i_to_s_c(i->k), validate_context); 723 765 if (unlikely(ret)){ 724 766 bch2_trans_inconsistent(trans, "invalid bkey on insert from %s -> %ps\n", 725 767 trans->fn, (void *) i->ip_allocated); 726 768 goto fatal_err; 727 769 } 728 770 btree_insert_entry_checks(trans, i); 729 - } 730 - 731 - for (struct jset_entry *i = trans->journal_entries; 732 - i != (void *) ((u64 *) trans->journal_entries + trans->journal_entries_u64s); 733 - i = vstruct_next(i)) { 734 - enum bch_validate_flags invalid_flags = 0; 735 - 736 - if (!(flags & BCH_TRANS_COMMIT_no_journal_res)) 737 - invalid_flags |= BCH_VALIDATE_write|BCH_VALIDATE_commit; 738 - 739 - ret = bch2_journal_entry_validate(c, NULL, i, 740 - bcachefs_metadata_version_current, 741 - CPU_BIG_ENDIAN, invalid_flags); 742 - if (unlikely(ret)) { 743 - bch2_trans_inconsistent(trans, "invalid journal entry on insert from %s\n", 744 - trans->fn); 745 - goto fatal_err; 746 - } 747 771 } 748 772 749 773 if (likely(!(flags & BCH_TRANS_COMMIT_no_journal_res))) { ··· 803 833 entry2 != entry; 804 834 entry2 = vstruct_next(entry2)) 805 835 if (entry2->type == BCH_JSET_ENTRY_write_buffer_keys && 806 - entry2->start->k.type == KEY_TYPE_accounting) { 807 - struct bkey_s_accounting a = bkey_i_to_s_accounting(entry2->start); 808 - 809 - bch2_accounting_neg(a); 810 - bch2_accounting_mem_mod_locked(trans, a.c, BCH_ACCOUNTING_normal); 811 - bch2_accounting_neg(a); 812 - } 836 + entry2->start->k.type == KEY_TYPE_accounting) 837 + bch2_accounting_trans_commit_revert(trans, 838 + bkey_i_to_accounting(entry2->start), flags); 813 839 percpu_up_read(&c->mark_lock); 814 840 return ret; 815 841 } ··· 868 902 if (!ret && unlikely(trans->journal_replay_not_finished)) 869 903 bch2_drop_overwrites_from_journal(trans); 870 904 871 - bch2_trans_unlock_write(trans); 905 + bch2_trans_unlock_updates_write(trans); 872 906 873 907 if (!ret && trans->journal_pin) 874 908 bch2_journal_pin_add(&c->journal, trans->journal_res.seq, ··· 960 994 return ret; 961 995 } 962 996 963 - static noinline int 964 - bch2_trans_commit_get_rw_cold(struct btree_trans *trans, unsigned flags) 965 - { 966 - struct bch_fs *c = trans->c; 967 - int ret; 968 - 969 - if (likely(!(flags & BCH_TRANS_COMMIT_lazy_rw)) || 970 - test_bit(BCH_FS_started, &c->flags)) 971 - return -BCH_ERR_erofs_trans_commit; 972 - 973 - ret = drop_locks_do(trans, bch2_fs_read_write_early(c)); 974 - if (ret) 975 - return ret; 976 - 977 - bch2_write_ref_get(c, BCH_WRITE_REF_trans); 978 - return 0; 979 - } 980 - 981 997 /* 982 998 * This is for updates done in the early part of fsck - btree_gc - before we've 983 999 * gone RW. we only add the new key to the list of keys for journal replay to ··· 969 1021 do_bch2_trans_commit_to_journal_replay(struct btree_trans *trans) 970 1022 { 971 1023 struct bch_fs *c = trans->c; 1024 + 1025 + BUG_ON(current != c->recovery_task); 972 1026 973 1027 trans_for_each_update(trans, i) { 974 1028 int ret = bch2_journal_key_insert(c, i->btree_id, i->level, i->k); ··· 997 1047 struct bch_fs *c = trans->c; 998 1048 int ret = 0; 999 1049 1000 - bch2_trans_verify_not_unlocked(trans); 1001 - bch2_trans_verify_not_in_restart(trans); 1050 + bch2_trans_verify_not_unlocked_or_in_restart(trans); 1002 1051 1003 1052 if (!trans->nr_updates && 1004 1053 !trans->journal_entries_u64s) ··· 1007 1058 if (ret) 1008 1059 goto out_reset; 1009 1060 1010 - if (unlikely(!test_bit(BCH_FS_may_go_rw, &c->flags))) { 1011 - ret = do_bch2_trans_commit_to_journal_replay(trans); 1012 - goto out_reset; 1013 - } 1014 - 1015 1061 if (!(flags & BCH_TRANS_COMMIT_no_check_rw) && 1016 1062 unlikely(!bch2_write_ref_tryget(c, BCH_WRITE_REF_trans))) { 1017 - ret = bch2_trans_commit_get_rw_cold(trans, flags); 1018 - if (ret) 1019 - goto out_reset; 1063 + if (unlikely(!test_bit(BCH_FS_may_go_rw, &c->flags))) 1064 + ret = do_bch2_trans_commit_to_journal_replay(trans); 1065 + else 1066 + ret = -BCH_ERR_erofs_trans_commit; 1067 + goto out_reset; 1020 1068 } 1021 1069 1022 1070 EBUG_ON(test_bit(BCH_FS_clean_shutdown, &c->flags)); ··· 1058 1112 } 1059 1113 retry: 1060 1114 errored_at = NULL; 1061 - bch2_trans_verify_not_unlocked(trans); 1062 - bch2_trans_verify_not_in_restart(trans); 1115 + bch2_trans_verify_not_unlocked_or_in_restart(trans); 1063 1116 if (likely(!(flags & BCH_TRANS_COMMIT_no_journal_res))) 1064 1117 memset(&trans->journal_res, 0, sizeof(trans->journal_res)); 1065 1118 memset(&trans->fs_usage_delta, 0, sizeof(trans->fs_usage_delta));
+38 -24
fs/bcachefs/btree_types.h
··· 513 513 u64 last_begin_time; 514 514 unsigned long last_begin_ip; 515 515 unsigned long last_restarted_ip; 516 + #ifdef CONFIG_BCACHEFS_DEBUG 517 + bch_stacktrace last_restarted_trace; 518 + #endif 516 519 unsigned long last_unlock_ip; 517 520 unsigned long srcu_lock_time; 518 521 ··· 790 787 return BIT_ULL(type) & BTREE_NODE_TYPE_HAS_TRIGGERS; 791 788 } 792 789 793 - static inline bool btree_node_type_is_extents(enum btree_node_type type) 794 - { 795 - const u64 mask = 0 796 - #define x(name, nr, flags, ...) |((!!((flags) & BTREE_ID_EXTENTS)) << (nr + 1)) 797 - BCH_BTREE_IDS() 798 - #undef x 799 - ; 800 - 801 - return BIT_ULL(type) & mask; 802 - } 803 - 804 790 static inline bool btree_id_is_extents(enum btree_id btree) 805 791 { 806 - return btree_node_type_is_extents(__btree_node_type(0, btree)); 807 - } 808 - 809 - static inline bool btree_type_has_snapshots(enum btree_id id) 810 - { 811 792 const u64 mask = 0 812 - #define x(name, nr, flags, ...) |((!!((flags) & BTREE_ID_SNAPSHOTS)) << nr) 793 + #define x(name, nr, flags, ...) |((!!((flags) & BTREE_IS_extents)) << nr) 813 794 BCH_BTREE_IDS() 814 795 #undef x 815 796 ; 816 797 817 - return BIT_ULL(id) & mask; 798 + return BIT_ULL(btree) & mask; 818 799 } 819 800 820 - static inline bool btree_type_has_snapshot_field(enum btree_id id) 801 + static inline bool btree_node_type_is_extents(enum btree_node_type type) 802 + { 803 + return type != BKEY_TYPE_btree && btree_id_is_extents(type - 1); 804 + } 805 + 806 + static inline bool btree_type_has_snapshots(enum btree_id btree) 821 807 { 822 808 const u64 mask = 0 823 - #define x(name, nr, flags, ...) |((!!((flags) & (BTREE_ID_SNAPSHOT_FIELD|BTREE_ID_SNAPSHOTS))) << nr) 809 + #define x(name, nr, flags, ...) |((!!((flags) & BTREE_IS_snapshots)) << nr) 824 810 BCH_BTREE_IDS() 825 811 #undef x 826 812 ; 827 813 828 - return BIT_ULL(id) & mask; 814 + return BIT_ULL(btree) & mask; 829 815 } 830 816 831 - static inline bool btree_type_has_ptrs(enum btree_id id) 817 + static inline bool btree_type_has_snapshot_field(enum btree_id btree) 832 818 { 833 819 const u64 mask = 0 834 - #define x(name, nr, flags, ...) |((!!((flags) & BTREE_ID_DATA)) << nr) 820 + #define x(name, nr, flags, ...) |((!!((flags) & (BTREE_IS_snapshot_field|BTREE_IS_snapshots))) << nr) 835 821 BCH_BTREE_IDS() 836 822 #undef x 837 823 ; 838 824 839 - return BIT_ULL(id) & mask; 825 + return BIT_ULL(btree) & mask; 826 + } 827 + 828 + static inline bool btree_type_has_ptrs(enum btree_id btree) 829 + { 830 + const u64 mask = 0 831 + #define x(name, nr, flags, ...) |((!!((flags) & BTREE_IS_data)) << nr) 832 + BCH_BTREE_IDS() 833 + #undef x 834 + ; 835 + 836 + return BIT_ULL(btree) & mask; 837 + } 838 + 839 + static inline bool btree_type_uses_write_buffer(enum btree_id btree) 840 + { 841 + const u64 mask = 0 842 + #define x(name, nr, flags, ...) |((!!((flags) & BTREE_IS_write_buffer)) << nr) 843 + BCH_BTREE_IDS() 844 + #undef x 845 + ; 846 + 847 + return BIT_ULL(btree) & mask; 840 848 } 841 849 842 850 struct btree_root {
+38 -36
fs/bcachefs/btree_update.c
··· 144 144 !(ret = bkey_err(old_k)) && 145 145 bkey_eq(old_pos, old_k.k->p)) { 146 146 struct bpos whiteout_pos = 147 - SPOS(new_pos.inode, new_pos.offset, old_k.k->p.snapshot);; 147 + SPOS(new_pos.inode, new_pos.offset, old_k.k->p.snapshot); 148 148 149 149 if (!bch2_snapshot_is_ancestor(c, old_k.k->p.snapshot, old_pos.snapshot) || 150 150 snapshot_list_has_ancestor(c, &s, old_k.k->p.snapshot)) ··· 296 296 BTREE_ITER_intent| 297 297 BTREE_ITER_with_updates| 298 298 BTREE_ITER_not_extents); 299 - k = bch2_btree_iter_peek_upto(&iter, POS(insert->k.p.inode, U64_MAX)); 299 + k = bch2_btree_iter_peek_max(&iter, POS(insert->k.p.inode, U64_MAX)); 300 300 if ((ret = bkey_err(k))) 301 301 goto err; 302 302 if (!k.k) ··· 323 323 goto out; 324 324 next: 325 325 bch2_btree_iter_advance(&iter); 326 - k = bch2_btree_iter_peek_upto(&iter, POS(insert->k.p.inode, U64_MAX)); 326 + k = bch2_btree_iter_peek_max(&iter, POS(insert->k.p.inode, U64_MAX)); 327 327 if ((ret = bkey_err(k))) 328 328 goto err; 329 329 if (!k.k) ··· 588 588 int bch2_bkey_get_empty_slot(struct btree_trans *trans, struct btree_iter *iter, 589 589 enum btree_id btree, struct bpos end) 590 590 { 591 - struct bkey_s_c k; 592 - int ret = 0; 593 - 594 - bch2_trans_iter_init(trans, iter, btree, POS_MAX, BTREE_ITER_intent); 595 - k = bch2_btree_iter_prev(iter); 596 - ret = bkey_err(k); 591 + bch2_trans_iter_init(trans, iter, btree, end, BTREE_ITER_intent); 592 + struct bkey_s_c k = bch2_btree_iter_peek_prev(iter); 593 + int ret = bkey_err(k); 597 594 if (ret) 598 595 goto err; 599 596 ··· 669 672 bch2_btree_insert_trans(trans, id, k, iter_flags)); 670 673 } 671 674 672 - int bch2_btree_delete_extent_at(struct btree_trans *trans, struct btree_iter *iter, 673 - unsigned len, unsigned update_flags) 674 - { 675 - struct bkey_i *k; 676 - 677 - k = bch2_trans_kmalloc(trans, sizeof(*k)); 678 - if (IS_ERR(k)) 679 - return PTR_ERR(k); 680 - 681 - bkey_init(&k->k); 682 - k->k.p = iter->pos; 683 - bch2_key_resize(&k->k, len); 684 - return bch2_trans_update(trans, iter, k, update_flags); 685 - } 686 - 687 675 int bch2_btree_delete_at(struct btree_trans *trans, 688 676 struct btree_iter *iter, unsigned update_flags) 689 677 { 690 - return bch2_btree_delete_extent_at(trans, iter, 0, update_flags); 678 + struct bkey_i *k = bch2_trans_kmalloc(trans, sizeof(*k)); 679 + int ret = PTR_ERR_OR_ZERO(k); 680 + if (ret) 681 + return ret; 682 + 683 + bkey_init(&k->k); 684 + k->k.p = iter->pos; 685 + return bch2_trans_update(trans, iter, k, update_flags); 691 686 } 692 687 693 688 int bch2_btree_delete(struct btree_trans *trans, ··· 710 721 int ret = 0; 711 722 712 723 bch2_trans_iter_init(trans, &iter, id, start, BTREE_ITER_intent); 713 - while ((k = bch2_btree_iter_peek_upto(&iter, end)).k) { 724 + while ((k = bch2_btree_iter_peek_max(&iter, end)).k) { 714 725 struct disk_reservation disk_res = 715 726 bch2_disk_reservation_init(trans->c, 0); 716 727 struct bkey_i delete; ··· 783 794 return ret; 784 795 } 785 796 786 - int bch2_btree_bit_mod(struct btree_trans *trans, enum btree_id btree, 787 - struct bpos pos, bool set) 797 + int bch2_btree_bit_mod_iter(struct btree_trans *trans, struct btree_iter *iter, bool set) 788 798 { 789 799 struct bkey_i *k = bch2_trans_kmalloc(trans, sizeof(*k)); 790 800 int ret = PTR_ERR_OR_ZERO(k); ··· 792 804 793 805 bkey_init(&k->k); 794 806 k->k.type = set ? KEY_TYPE_set : KEY_TYPE_deleted; 795 - k->k.p = pos; 807 + k->k.p = iter->pos; 808 + if (iter->flags & BTREE_ITER_is_extents) 809 + bch2_key_resize(&k->k, 1); 796 810 811 + return bch2_trans_update(trans, iter, k, 0); 812 + } 813 + 814 + int bch2_btree_bit_mod(struct btree_trans *trans, enum btree_id btree, 815 + struct bpos pos, bool set) 816 + { 797 817 struct btree_iter iter; 798 818 bch2_trans_iter_init(trans, &iter, btree, pos, BTREE_ITER_intent); 799 819 800 - ret = bch2_btree_iter_traverse(&iter) ?: 801 - bch2_trans_update(trans, &iter, k, 0); 820 + int ret = bch2_btree_iter_traverse(&iter) ?: 821 + bch2_btree_bit_mod_iter(trans, &iter, set); 802 822 bch2_trans_iter_exit(trans, &iter); 803 823 return ret; 804 824 } ··· 823 827 return bch2_trans_update_buffered(trans, btree, &k); 824 828 } 825 829 826 - static int __bch2_trans_log_msg(struct btree_trans *trans, struct printbuf *buf, unsigned u64s) 830 + int bch2_trans_log_msg(struct btree_trans *trans, struct printbuf *buf) 827 831 { 832 + unsigned u64s = DIV_ROUND_UP(buf->pos, sizeof(u64)); 833 + prt_chars(buf, '\0', u64s * sizeof(u64) - buf->pos); 834 + 835 + int ret = buf->allocation_failure ? -BCH_ERR_ENOMEM_trans_log_msg : 0; 836 + if (ret) 837 + return ret; 838 + 828 839 struct jset_entry *e = bch2_trans_jset_entry_alloc(trans, jset_u64s(u64s)); 829 - int ret = PTR_ERR_OR_ZERO(e); 840 + ret = PTR_ERR_OR_ZERO(e); 830 841 if (ret) 831 842 return ret; 832 843 ··· 868 865 memcpy(l->d, buf.buf, buf.pos); 869 866 c->journal.early_journal_entries.nr += jset_u64s(u64s); 870 867 } else { 871 - ret = bch2_trans_commit_do(c, NULL, NULL, 872 - BCH_TRANS_COMMIT_lazy_rw|commit_flags, 873 - __bch2_trans_log_msg(trans, &buf, u64s)); 868 + ret = bch2_trans_commit_do(c, NULL, NULL, commit_flags, 869 + bch2_trans_log_msg(trans, &buf)); 874 870 } 875 871 err: 876 872 printbuf_exit(&buf);
+17 -12
fs/bcachefs/btree_update.h
··· 24 24 #define BCH_TRANS_COMMIT_FLAGS() \ 25 25 x(no_enospc, "don't check for enospc") \ 26 26 x(no_check_rw, "don't attempt to take a ref on c->writes") \ 27 - x(lazy_rw, "go read-write if we haven't yet - only for use in recovery") \ 28 27 x(no_journal_res, "don't take a journal reservation, instead " \ 29 28 "pin journal entry referred to by trans->journal_res.seq") \ 30 29 x(journal_reclaim, "operation required for journal reclaim; may return error" \ ··· 46 47 47 48 void bch2_trans_commit_flags_to_text(struct printbuf *, enum bch_trans_commit_flags); 48 49 49 - int bch2_btree_delete_extent_at(struct btree_trans *, struct btree_iter *, 50 - unsigned, unsigned); 51 50 int bch2_btree_delete_at(struct btree_trans *, struct btree_iter *, unsigned); 52 51 int bch2_btree_delete(struct btree_trans *, enum btree_id, struct bpos, unsigned); 53 52 ··· 63 66 int bch2_btree_delete_range(struct bch_fs *, enum btree_id, 64 67 struct bpos, struct bpos, unsigned, u64 *); 65 68 69 + int bch2_btree_bit_mod_iter(struct btree_trans *, struct btree_iter *, bool); 66 70 int bch2_btree_bit_mod(struct btree_trans *, enum btree_id, struct bpos, bool); 67 71 int bch2_btree_bit_mod_buffered(struct btree_trans *, enum btree_id, struct bpos, bool); 68 72 ··· 159 161 struct btree_trans_commit_hook *); 160 162 int __bch2_trans_commit(struct btree_trans *, unsigned); 161 163 164 + int bch2_trans_log_msg(struct btree_trans *, struct printbuf *); 162 165 __printf(2, 3) int bch2_fs_log_msg(struct bch_fs *, const char *, ...); 163 166 __printf(2, 3) int bch2_journal_log_msg(struct bch_fs *, const char *, ...); 164 167 ··· 243 244 KEY_TYPE_##_type, sizeof(struct bkey_i_##_type))) 244 245 245 246 static inline struct bkey_i *__bch2_bkey_make_mut(struct btree_trans *trans, struct btree_iter *iter, 246 - struct bkey_s_c *k, unsigned flags, 247 + struct bkey_s_c *k, 248 + enum btree_iter_update_trigger_flags flags, 247 249 unsigned type, unsigned min_bytes) 248 250 { 249 251 struct bkey_i *mut = __bch2_bkey_make_mut_noupdate(trans, *k, type, min_bytes); ··· 261 261 return mut; 262 262 } 263 263 264 - static inline struct bkey_i *bch2_bkey_make_mut(struct btree_trans *trans, struct btree_iter *iter, 265 - struct bkey_s_c *k, unsigned flags) 264 + static inline struct bkey_i *bch2_bkey_make_mut(struct btree_trans *trans, 265 + struct btree_iter *iter, struct bkey_s_c *k, 266 + enum btree_iter_update_trigger_flags flags) 266 267 { 267 268 return __bch2_bkey_make_mut(trans, iter, k, flags, 0, 0); 268 269 } ··· 275 274 static inline struct bkey_i *__bch2_bkey_get_mut_noupdate(struct btree_trans *trans, 276 275 struct btree_iter *iter, 277 276 unsigned btree_id, struct bpos pos, 278 - unsigned flags, unsigned type, unsigned min_bytes) 277 + enum btree_iter_update_trigger_flags flags, 278 + unsigned type, unsigned min_bytes) 279 279 { 280 280 struct bkey_s_c k = __bch2_bkey_get_iter(trans, iter, 281 281 btree_id, pos, flags|BTREE_ITER_intent, type); ··· 291 289 static inline struct bkey_i *bch2_bkey_get_mut_noupdate(struct btree_trans *trans, 292 290 struct btree_iter *iter, 293 291 unsigned btree_id, struct bpos pos, 294 - unsigned flags) 292 + enum btree_iter_update_trigger_flags flags) 295 293 { 296 294 return __bch2_bkey_get_mut_noupdate(trans, iter, btree_id, pos, flags, 0, 0); 297 295 } ··· 299 297 static inline struct bkey_i *__bch2_bkey_get_mut(struct btree_trans *trans, 300 298 struct btree_iter *iter, 301 299 unsigned btree_id, struct bpos pos, 302 - unsigned flags, unsigned type, unsigned min_bytes) 300 + enum btree_iter_update_trigger_flags flags, 301 + unsigned type, unsigned min_bytes) 303 302 { 304 303 struct bkey_i *mut = __bch2_bkey_get_mut_noupdate(trans, iter, 305 304 btree_id, pos, flags|BTREE_ITER_intent, type, min_bytes); ··· 321 318 static inline struct bkey_i *bch2_bkey_get_mut_minsize(struct btree_trans *trans, 322 319 struct btree_iter *iter, 323 320 unsigned btree_id, struct bpos pos, 324 - unsigned flags, unsigned min_bytes) 321 + enum btree_iter_update_trigger_flags flags, 322 + unsigned min_bytes) 325 323 { 326 324 return __bch2_bkey_get_mut(trans, iter, btree_id, pos, flags, 0, min_bytes); 327 325 } ··· 330 326 static inline struct bkey_i *bch2_bkey_get_mut(struct btree_trans *trans, 331 327 struct btree_iter *iter, 332 328 unsigned btree_id, struct bpos pos, 333 - unsigned flags) 329 + enum btree_iter_update_trigger_flags flags) 334 330 { 335 331 return __bch2_bkey_get_mut(trans, iter, btree_id, pos, flags, 0, 0); 336 332 } ··· 341 337 KEY_TYPE_##_type, sizeof(struct bkey_i_##_type))) 342 338 343 339 static inline struct bkey_i *__bch2_bkey_alloc(struct btree_trans *trans, struct btree_iter *iter, 344 - unsigned flags, unsigned type, unsigned val_size) 340 + enum btree_iter_update_trigger_flags flags, 341 + unsigned type, unsigned val_size) 345 342 { 346 343 struct bkey_i *k = bch2_trans_kmalloc(trans, sizeof(*k) + val_size); 347 344 int ret;
+159 -136
fs/bcachefs/btree_update_interior.c
··· 58 58 !bpos_eq(bkey_i_to_btree_ptr_v2(&b->key)->v.min_key, 59 59 b->data->min_key)); 60 60 61 + bch2_bkey_buf_init(&prev); 62 + bkey_init(&prev.k->k); 63 + bch2_btree_and_journal_iter_init_node_iter(trans, &iter, b); 64 + 61 65 if (b == btree_node_root(c, b)) { 62 66 if (!bpos_eq(b->data->min_key, POS_MIN)) { 63 67 printbuf_reset(&buf); 64 68 bch2_bpos_to_text(&buf, b->data->min_key); 65 - need_fsck_err(trans, btree_root_bad_min_key, 69 + log_fsck_err(trans, btree_root_bad_min_key, 66 70 "btree root with incorrect min_key: %s", buf.buf); 67 71 goto topology_repair; 68 72 } ··· 74 70 if (!bpos_eq(b->data->max_key, SPOS_MAX)) { 75 71 printbuf_reset(&buf); 76 72 bch2_bpos_to_text(&buf, b->data->max_key); 77 - need_fsck_err(trans, btree_root_bad_max_key, 73 + log_fsck_err(trans, btree_root_bad_max_key, 78 74 "btree root with incorrect max_key: %s", buf.buf); 79 75 goto topology_repair; 80 76 } 81 77 } 82 78 83 79 if (!b->c.level) 84 - return 0; 85 - 86 - bch2_bkey_buf_init(&prev); 87 - bkey_init(&prev.k->k); 88 - bch2_btree_and_journal_iter_init_node_iter(trans, &iter, b); 80 + goto out; 89 81 90 82 while ((k = bch2_btree_and_journal_iter_peek(&iter)).k) { 91 83 if (k.k->type != KEY_TYPE_btree_ptr_v2) ··· 97 97 bch2_topology_error(c); 98 98 99 99 printbuf_reset(&buf); 100 - prt_str(&buf, "end of prev node doesn't match start of next node\n"), 101 - prt_printf(&buf, " in btree %s level %u node ", 102 - bch2_btree_id_str(b->c.btree_id), b->c.level); 100 + prt_str(&buf, "end of prev node doesn't match start of next node\n in "); 101 + bch2_btree_id_level_to_text(&buf, b->c.btree_id, b->c.level); 102 + prt_str(&buf, " node "); 103 103 bch2_bkey_val_to_text(&buf, c, bkey_i_to_s_c(&b->key)); 104 104 prt_str(&buf, "\n prev "); 105 105 bch2_bkey_val_to_text(&buf, c, bkey_i_to_s_c(prev.k)); 106 106 prt_str(&buf, "\n next "); 107 107 bch2_bkey_val_to_text(&buf, c, k); 108 108 109 - need_fsck_err(trans, btree_node_topology_bad_min_key, "%s", buf.buf); 109 + log_fsck_err(trans, btree_node_topology_bad_min_key, "%s", buf.buf); 110 110 goto topology_repair; 111 111 } 112 112 ··· 118 118 bch2_topology_error(c); 119 119 120 120 printbuf_reset(&buf); 121 - prt_str(&buf, "empty interior node\n"); 122 - prt_printf(&buf, " in btree %s level %u node ", 123 - bch2_btree_id_str(b->c.btree_id), b->c.level); 121 + prt_str(&buf, "empty interior node\n in "); 122 + bch2_btree_id_level_to_text(&buf, b->c.btree_id, b->c.level); 123 + prt_str(&buf, " node "); 124 124 bch2_bkey_val_to_text(&buf, c, bkey_i_to_s_c(&b->key)); 125 125 126 - need_fsck_err(trans, btree_node_topology_empty_interior_node, "%s", buf.buf); 126 + log_fsck_err(trans, btree_node_topology_empty_interior_node, "%s", buf.buf); 127 127 goto topology_repair; 128 128 } else if (!bpos_eq(prev.k->k.p, b->key.k.p)) { 129 129 bch2_topology_error(c); 130 130 131 131 printbuf_reset(&buf); 132 - prt_str(&buf, "last child node doesn't end at end of parent node\n"); 133 - prt_printf(&buf, " in btree %s level %u node ", 134 - bch2_btree_id_str(b->c.btree_id), b->c.level); 132 + prt_str(&buf, "last child node doesn't end at end of parent node\n in "); 133 + bch2_btree_id_level_to_text(&buf, b->c.btree_id, b->c.level); 134 + prt_str(&buf, " node "); 135 135 bch2_bkey_val_to_text(&buf, c, bkey_i_to_s_c(&b->key)); 136 136 prt_str(&buf, "\n last key "); 137 137 bch2_bkey_val_to_text(&buf, c, bkey_i_to_s_c(prev.k)); 138 138 139 - need_fsck_err(trans, btree_node_topology_bad_max_key, "%s", buf.buf); 139 + log_fsck_err(trans, btree_node_topology_bad_max_key, "%s", buf.buf); 140 140 goto topology_repair; 141 141 } 142 142 out: ··· 146 146 printbuf_exit(&buf); 147 147 return ret; 148 148 topology_repair: 149 - if ((c->opts.recovery_passes & BIT_ULL(BCH_RECOVERY_PASS_check_topology)) && 150 - c->curr_recovery_pass > BCH_RECOVERY_PASS_check_topology) { 151 - bch2_inconsistent_error(c); 152 - ret = -BCH_ERR_btree_need_topology_repair; 153 - } else { 154 - ret = bch2_run_explicit_recovery_pass(c, BCH_RECOVERY_PASS_check_topology); 155 - } 149 + ret = bch2_topology_error(c); 156 150 goto out; 157 151 } 158 152 ··· 238 244 struct btree *b) 239 245 { 240 246 struct bch_fs *c = trans->c; 241 - unsigned i, level = b->c.level; 242 247 243 248 bch2_btree_node_lock_write_nofail(trans, path, &b->c); 244 249 ··· 248 255 mutex_unlock(&c->btree_cache.lock); 249 256 250 257 six_unlock_write(&b->c.lock); 251 - mark_btree_node_locked_noreset(path, level, BTREE_NODE_INTENT_LOCKED); 258 + mark_btree_node_locked_noreset(path, b->c.level, BTREE_NODE_INTENT_LOCKED); 252 259 253 - trans_for_each_path(trans, path, i) 254 - if (path->l[level].b == b) { 255 - btree_node_unlock(trans, path, level); 256 - path->l[level].b = ERR_PTR(-BCH_ERR_no_btree_node_init); 257 - } 260 + bch2_trans_node_drop(trans, b); 258 261 } 259 262 260 263 static void bch2_btree_node_free_never_used(struct btree_update *as, ··· 259 270 { 260 271 struct bch_fs *c = as->c; 261 272 struct prealloc_nodes *p = &as->prealloc_nodes[b->c.lock.readers != NULL]; 262 - struct btree_path *path; 263 - unsigned i, level = b->c.level; 264 273 265 274 BUG_ON(!list_empty(&b->write_blocked)); 266 275 BUG_ON(b->will_make_reachable != (1UL|(unsigned long) as)); ··· 280 293 281 294 six_unlock_intent(&b->c.lock); 282 295 283 - trans_for_each_path(trans, path, i) 284 - if (path->l[level].b == b) { 285 - btree_node_unlock(trans, path, level); 286 - path->l[level].b = ERR_PTR(-BCH_ERR_no_btree_node_init); 287 - } 296 + bch2_trans_node_drop(trans, b); 288 297 } 289 298 290 299 static struct btree *__bch2_btree_node_alloc(struct btree_trans *trans, ··· 792 809 mark_btree_node_locked_noreset(path, b->c.level, BTREE_NODE_INTENT_LOCKED); 793 810 six_unlock_write(&b->c.lock); 794 811 795 - btree_node_write_if_need(c, b, SIX_LOCK_intent); 812 + btree_node_write_if_need(trans, b, SIX_LOCK_intent); 796 813 btree_node_unlock(trans, path, b->c.level); 797 814 bch2_path_put(trans, path_idx, true); 798 815 } ··· 813 830 b = as->new_nodes[i]; 814 831 815 832 btree_node_lock_nopath_nofail(trans, &b->c, SIX_LOCK_read); 816 - btree_node_write_if_need(c, b, SIX_LOCK_read); 833 + btree_node_write_if_need(trans, b, SIX_LOCK_read); 817 834 six_unlock_read(&b->c.lock); 818 835 } 819 836 ··· 1349 1366 if (unlikely(!test_bit(JOURNAL_replay_done, &c->journal.flags))) 1350 1367 bch2_journal_key_overwritten(c, b->c.btree_id, b->c.level, insert->k.p); 1351 1368 1352 - if (bch2_bkey_validate(c, bkey_i_to_s_c(insert), 1353 - btree_node_type(b), BCH_VALIDATE_write) ?: 1354 - bch2_bkey_in_btree_node(c, b, bkey_i_to_s_c(insert), BCH_VALIDATE_write)) { 1369 + struct bkey_validate_context from = (struct bkey_validate_context) { 1370 + .from = BKEY_VALIDATE_btree_node, 1371 + .level = b->c.level, 1372 + .btree = b->c.btree_id, 1373 + .flags = BCH_VALIDATE_commit, 1374 + }; 1375 + if (bch2_bkey_validate(c, bkey_i_to_s_c(insert), from) ?: 1376 + bch2_bkey_in_btree_node(c, b, bkey_i_to_s_c(insert), from)) { 1355 1377 bch2_fs_inconsistent(c, "%s: inserting invalid bkey", __func__); 1356 1378 dump_stack(); 1357 1379 } ··· 1406 1418 (bkey_cmp_left_packed(b, k, &insert->k.p) >= 0)) 1407 1419 ; 1408 1420 1409 - while (!bch2_keylist_empty(keys)) { 1410 - insert = bch2_keylist_front(keys); 1411 - 1412 - if (bpos_gt(insert->k.p, b->key.k.p)) 1413 - break; 1414 - 1421 + for (; 1422 + insert != keys->top && bpos_le(insert->k.p, b->key.k.p); 1423 + insert = bkey_next(insert)) 1415 1424 bch2_insert_fixup_btree_ptr(as, trans, path, b, &node_iter, insert); 1416 - bch2_keylist_pop_front(keys); 1425 + 1426 + if (bch2_btree_node_check_topology(trans, b)) { 1427 + struct printbuf buf = PRINTBUF; 1428 + 1429 + for (struct bkey_i *k = keys->keys; 1430 + k != insert; 1431 + k = bkey_next(k)) { 1432 + bch2_bkey_val_to_text(&buf, trans->c, bkey_i_to_s_c(k)); 1433 + prt_newline(&buf); 1434 + } 1435 + 1436 + panic("%s(): check_topology error: inserted keys\n%s", __func__, buf.buf); 1417 1437 } 1438 + 1439 + memmove_u64s_down(keys->keys, insert, keys->top_p - insert->_data); 1440 + keys->top_p -= insert->_data - keys->keys_p; 1418 1441 } 1419 1442 1420 1443 static bool key_deleted_in_insert(struct keylist *insert_keys, struct bpos pos) ··· 1574 1575 bch2_btree_node_iter_init(&node_iter, b, &bch2_keylist_front(keys)->k.p); 1575 1576 1576 1577 bch2_btree_insert_keys_interior(as, trans, path, b, node_iter, keys); 1577 - 1578 - BUG_ON(bch2_btree_node_check_topology(trans, b)); 1579 1578 } 1580 1579 } 1581 1580 ··· 1595 1598 ret = bch2_btree_node_check_topology(trans, b); 1596 1599 if (ret) 1597 1600 return ret; 1598 - 1599 - bch2_btree_interior_update_will_free_node(as, b); 1600 1601 1601 1602 if (b->nr.live_u64s > BTREE_SPLIT_THRESHOLD(c)) { 1602 1603 struct btree *n[2]; ··· 1694 1699 if (ret) 1695 1700 goto err; 1696 1701 1702 + bch2_btree_interior_update_will_free_node(as, b); 1703 + 1697 1704 if (n3) { 1698 1705 bch2_btree_update_get_open_buckets(as, n3); 1699 - bch2_btree_node_write(c, n3, SIX_LOCK_intent, 0); 1706 + bch2_btree_node_write_trans(trans, n3, SIX_LOCK_intent, 0); 1700 1707 } 1701 1708 if (n2) { 1702 1709 bch2_btree_update_get_open_buckets(as, n2); 1703 - bch2_btree_node_write(c, n2, SIX_LOCK_intent, 0); 1710 + bch2_btree_node_write_trans(trans, n2, SIX_LOCK_intent, 0); 1704 1711 } 1705 1712 bch2_btree_update_get_open_buckets(as, n1); 1706 - bch2_btree_node_write(c, n1, SIX_LOCK_intent, 0); 1713 + bch2_btree_node_write_trans(trans, n1, SIX_LOCK_intent, 0); 1707 1714 1708 1715 /* 1709 1716 * The old node must be freed (in memory) _before_ unlocking the new ··· 1824 1827 1825 1828 btree_update_updated_node(as, b); 1826 1829 bch2_btree_node_unlock_write(trans, path, b); 1827 - 1828 - BUG_ON(bch2_btree_node_check_topology(trans, b)); 1829 1830 return 0; 1830 1831 split: 1831 1832 /* ··· 1900 1905 BUG_ON(ret); 1901 1906 1902 1907 bch2_btree_update_get_open_buckets(as, n); 1903 - bch2_btree_node_write(c, n, SIX_LOCK_intent, 0); 1908 + bch2_btree_node_write_trans(trans, n, SIX_LOCK_intent, 0); 1904 1909 bch2_trans_node_add(trans, path, n); 1905 1910 six_unlock_intent(&n->c.lock); 1906 1911 ··· 1948 1953 u64 start_time = local_clock(); 1949 1954 int ret = 0; 1950 1955 1951 - bch2_trans_verify_not_in_restart(trans); 1952 - bch2_trans_verify_not_unlocked(trans); 1956 + bch2_trans_verify_not_unlocked_or_in_restart(trans); 1953 1957 BUG_ON(!trans->paths[path].should_be_locked); 1954 1958 BUG_ON(!btree_node_locked(&trans->paths[path], level)); 1955 1959 ··· 2052 2058 2053 2059 trace_and_count(c, btree_node_merge, trans, b); 2054 2060 2055 - bch2_btree_interior_update_will_free_node(as, b); 2056 - bch2_btree_interior_update_will_free_node(as, m); 2057 - 2058 2061 n = bch2_btree_node_alloc(as, trans, b->c.level); 2059 2062 2060 2063 SET_BTREE_NODE_SEQ(n->data, ··· 2087 2096 if (ret) 2088 2097 goto err_free_update; 2089 2098 2099 + bch2_btree_interior_update_will_free_node(as, b); 2100 + bch2_btree_interior_update_will_free_node(as, m); 2101 + 2090 2102 bch2_trans_verify_paths(trans); 2091 2103 2092 2104 bch2_btree_update_get_open_buckets(as, n); 2093 - bch2_btree_node_write(c, n, SIX_LOCK_intent, 0); 2105 + bch2_btree_node_write_trans(trans, n, SIX_LOCK_intent, 0); 2094 2106 2095 2107 bch2_btree_node_free_inmem(trans, trans->paths + path, b); 2096 2108 bch2_btree_node_free_inmem(trans, trans->paths + sib_path, m); ··· 2144 2150 if (ret) 2145 2151 goto out; 2146 2152 2147 - bch2_btree_interior_update_will_free_node(as, b); 2148 - 2149 2153 n = bch2_btree_node_alloc_replacement(as, trans, b); 2150 2154 2151 2155 bch2_btree_build_aux_trees(n); ··· 2167 2175 if (ret) 2168 2176 goto err; 2169 2177 2178 + bch2_btree_interior_update_will_free_node(as, b); 2179 + 2170 2180 bch2_btree_update_get_open_buckets(as, n); 2171 - bch2_btree_node_write(c, n, SIX_LOCK_intent, 0); 2181 + bch2_btree_node_write_trans(trans, n, SIX_LOCK_intent, 0); 2172 2182 2173 2183 bch2_btree_node_free_inmem(trans, btree_iter_path(trans, iter), b); 2174 2184 ··· 2195 2201 struct list_head list; 2196 2202 enum btree_id btree_id; 2197 2203 unsigned level; 2198 - struct bpos pos; 2199 - __le64 seq; 2204 + struct bkey_buf key; 2200 2205 }; 2201 2206 2202 2207 static int async_btree_node_rewrite_trans(struct btree_trans *trans, 2203 2208 struct async_btree_rewrite *a) 2204 2209 { 2205 - struct bch_fs *c = trans->c; 2206 2210 struct btree_iter iter; 2207 - struct btree *b; 2208 - int ret; 2209 - 2210 - bch2_trans_node_iter_init(trans, &iter, a->btree_id, a->pos, 2211 + bch2_trans_node_iter_init(trans, &iter, 2212 + a->btree_id, a->key.k->k.p, 2211 2213 BTREE_MAX_DEPTH, a->level, 0); 2212 - b = bch2_btree_iter_peek_node(&iter); 2213 - ret = PTR_ERR_OR_ZERO(b); 2214 + struct btree *b = bch2_btree_iter_peek_node(&iter); 2215 + int ret = PTR_ERR_OR_ZERO(b); 2214 2216 if (ret) 2215 2217 goto out; 2216 2218 2217 - if (!b || b->data->keys.seq != a->seq) { 2219 + bool found = b && btree_ptr_hash_val(&b->key) == btree_ptr_hash_val(a->key.k); 2220 + ret = found 2221 + ? bch2_btree_node_rewrite(trans, &iter, b, 0) 2222 + : -ENOENT; 2223 + 2224 + #if 0 2225 + /* Tracepoint... */ 2226 + if (!ret || ret == -ENOENT) { 2227 + struct bch_fs *c = trans->c; 2218 2228 struct printbuf buf = PRINTBUF; 2219 2229 2220 - if (b) 2221 - bch2_bkey_val_to_text(&buf, c, bkey_i_to_s_c(&b->key)); 2222 - else 2223 - prt_str(&buf, "(null"); 2224 - bch_info(c, "%s: node to rewrite not found:, searching for seq %llu, got\n%s", 2225 - __func__, a->seq, buf.buf); 2230 + if (!ret) { 2231 + prt_printf(&buf, "rewrite node:\n "); 2232 + bch2_bkey_val_to_text(&buf, c, bkey_i_to_s_c(a->key.k)); 2233 + } else { 2234 + prt_printf(&buf, "node to rewrite not found:\n want: "); 2235 + bch2_bkey_val_to_text(&buf, c, bkey_i_to_s_c(a->key.k)); 2236 + prt_printf(&buf, "\n got: "); 2237 + if (b) 2238 + bch2_bkey_val_to_text(&buf, c, bkey_i_to_s_c(&b->key)); 2239 + else 2240 + prt_str(&buf, "(null)"); 2241 + } 2242 + bch_info(c, "%s", buf.buf); 2226 2243 printbuf_exit(&buf); 2227 - goto out; 2228 2244 } 2229 - 2230 - ret = bch2_btree_node_rewrite(trans, &iter, b, 0); 2245 + #endif 2231 2246 out: 2232 2247 bch2_trans_iter_exit(trans, &iter); 2233 - 2234 2248 return ret; 2235 2249 } 2236 2250 ··· 2249 2247 struct bch_fs *c = a->c; 2250 2248 2251 2249 int ret = bch2_trans_do(c, async_btree_node_rewrite_trans(trans, a)); 2252 - bch_err_fn_ratelimited(c, ret); 2250 + if (ret != -ENOENT) 2251 + bch_err_fn_ratelimited(c, ret); 2252 + 2253 + spin_lock(&c->btree_node_rewrites_lock); 2254 + list_del(&a->list); 2255 + spin_unlock(&c->btree_node_rewrites_lock); 2256 + 2257 + closure_wake_up(&c->btree_node_rewrites_wait); 2258 + 2259 + bch2_bkey_buf_exit(&a->key, c); 2253 2260 bch2_write_ref_put(c, BCH_WRITE_REF_node_rewrite); 2254 2261 kfree(a); 2255 2262 } 2256 2263 2257 2264 void bch2_btree_node_rewrite_async(struct bch_fs *c, struct btree *b) 2258 2265 { 2259 - struct async_btree_rewrite *a; 2260 - int ret; 2261 - 2262 - a = kmalloc(sizeof(*a), GFP_NOFS); 2263 - if (!a) { 2264 - bch_err(c, "%s: error allocating memory", __func__); 2266 + struct async_btree_rewrite *a = kmalloc(sizeof(*a), GFP_NOFS); 2267 + if (!a) 2265 2268 return; 2266 - } 2267 2269 2268 2270 a->c = c; 2269 2271 a->btree_id = b->c.btree_id; 2270 2272 a->level = b->c.level; 2271 - a->pos = b->key.k.p; 2272 - a->seq = b->data->keys.seq; 2273 2273 INIT_WORK(&a->work, async_btree_node_rewrite_work); 2274 2274 2275 - if (unlikely(!test_bit(BCH_FS_may_go_rw, &c->flags))) { 2276 - mutex_lock(&c->pending_node_rewrites_lock); 2277 - list_add(&a->list, &c->pending_node_rewrites); 2278 - mutex_unlock(&c->pending_node_rewrites_lock); 2279 - return; 2275 + bch2_bkey_buf_init(&a->key); 2276 + bch2_bkey_buf_copy(&a->key, c, &b->key); 2277 + 2278 + bool now = false, pending = false; 2279 + 2280 + spin_lock(&c->btree_node_rewrites_lock); 2281 + if (c->curr_recovery_pass > BCH_RECOVERY_PASS_journal_replay && 2282 + bch2_write_ref_tryget(c, BCH_WRITE_REF_node_rewrite)) { 2283 + list_add(&a->list, &c->btree_node_rewrites); 2284 + now = true; 2285 + } else if (!test_bit(BCH_FS_may_go_rw, &c->flags)) { 2286 + list_add(&a->list, &c->btree_node_rewrites_pending); 2287 + pending = true; 2280 2288 } 2289 + spin_unlock(&c->btree_node_rewrites_lock); 2281 2290 2282 - if (!bch2_write_ref_tryget(c, BCH_WRITE_REF_node_rewrite)) { 2283 - if (test_bit(BCH_FS_started, &c->flags)) { 2284 - bch_err(c, "%s: error getting c->writes ref", __func__); 2285 - kfree(a); 2286 - return; 2287 - } 2288 - 2289 - ret = bch2_fs_read_write_early(c); 2290 - bch_err_msg(c, ret, "going read-write"); 2291 - if (ret) { 2292 - kfree(a); 2293 - return; 2294 - } 2295 - 2296 - bch2_write_ref_get(c, BCH_WRITE_REF_node_rewrite); 2291 + if (now) { 2292 + queue_work(c->btree_node_rewrite_worker, &a->work); 2293 + } else if (pending) { 2294 + /* bch2_do_pending_node_rewrites will execute */ 2295 + } else { 2296 + bch2_bkey_buf_exit(&a->key, c); 2297 + kfree(a); 2297 2298 } 2299 + } 2298 2300 2299 - queue_work(c->btree_node_rewrite_worker, &a->work); 2301 + void bch2_async_btree_node_rewrites_flush(struct bch_fs *c) 2302 + { 2303 + closure_wait_event(&c->btree_node_rewrites_wait, 2304 + list_empty(&c->btree_node_rewrites)); 2300 2305 } 2301 2306 2302 2307 void bch2_do_pending_node_rewrites(struct bch_fs *c) 2303 2308 { 2304 - struct async_btree_rewrite *a, *n; 2309 + while (1) { 2310 + spin_lock(&c->btree_node_rewrites_lock); 2311 + struct async_btree_rewrite *a = 2312 + list_pop_entry(&c->btree_node_rewrites_pending, 2313 + struct async_btree_rewrite, list); 2314 + if (a) 2315 + list_add(&a->list, &c->btree_node_rewrites); 2316 + spin_unlock(&c->btree_node_rewrites_lock); 2305 2317 2306 - mutex_lock(&c->pending_node_rewrites_lock); 2307 - list_for_each_entry_safe(a, n, &c->pending_node_rewrites, list) { 2308 - list_del(&a->list); 2318 + if (!a) 2319 + break; 2309 2320 2310 2321 bch2_write_ref_get(c, BCH_WRITE_REF_node_rewrite); 2311 2322 queue_work(c->btree_node_rewrite_worker, &a->work); 2312 2323 } 2313 - mutex_unlock(&c->pending_node_rewrites_lock); 2314 2324 } 2315 2325 2316 2326 void bch2_free_pending_node_rewrites(struct bch_fs *c) 2317 2327 { 2318 - struct async_btree_rewrite *a, *n; 2328 + while (1) { 2329 + spin_lock(&c->btree_node_rewrites_lock); 2330 + struct async_btree_rewrite *a = 2331 + list_pop_entry(&c->btree_node_rewrites_pending, 2332 + struct async_btree_rewrite, list); 2333 + spin_unlock(&c->btree_node_rewrites_lock); 2319 2334 2320 - mutex_lock(&c->pending_node_rewrites_lock); 2321 - list_for_each_entry_safe(a, n, &c->pending_node_rewrites, list) { 2322 - list_del(&a->list); 2335 + if (!a) 2336 + break; 2323 2337 2338 + bch2_bkey_buf_exit(&a->key, c); 2324 2339 kfree(a); 2325 2340 } 2326 - mutex_unlock(&c->pending_node_rewrites_lock); 2327 2341 } 2328 2342 2329 2343 static int __bch2_btree_node_update_key(struct btree_trans *trans, ··· 2593 2575 prt_printf(out, "%ps: ", (void *) as->ip_started); 2594 2576 bch2_trans_commit_flags_to_text(out, as->flags); 2595 2577 2596 - prt_printf(out, " btree=%s l=%u-%u mode=%s nodes_written=%u cl.remaining=%u journal_seq=%llu\n", 2597 - bch2_btree_id_str(as->btree_id), 2578 + prt_str(out, " "); 2579 + bch2_btree_id_to_text(out, as->btree_id); 2580 + prt_printf(out, " l=%u-%u mode=%s nodes_written=%u cl.remaining=%u journal_seq=%llu\n", 2598 2581 as->update_level_start, 2599 2582 as->update_level_end, 2600 2583 bch2_btree_update_modes[as->mode], ··· 2696 2677 2697 2678 void bch2_fs_btree_interior_update_exit(struct bch_fs *c) 2698 2679 { 2680 + WARN_ON(!list_empty(&c->btree_node_rewrites)); 2681 + WARN_ON(!list_empty(&c->btree_node_rewrites_pending)); 2682 + 2699 2683 if (c->btree_node_rewrite_worker) 2700 2684 destroy_workqueue(c->btree_node_rewrite_worker); 2701 2685 if (c->btree_interior_update_worker) ··· 2714 2692 mutex_init(&c->btree_interior_update_lock); 2715 2693 INIT_WORK(&c->btree_interior_update_work, btree_interior_update_work); 2716 2694 2717 - INIT_LIST_HEAD(&c->pending_node_rewrites); 2718 - mutex_init(&c->pending_node_rewrites_lock); 2695 + INIT_LIST_HEAD(&c->btree_node_rewrites); 2696 + INIT_LIST_HEAD(&c->btree_node_rewrites_pending); 2697 + spin_lock_init(&c->btree_node_rewrites_lock); 2719 2698 } 2720 2699 2721 2700 int bch2_fs_btree_interior_update_init(struct bch_fs *c)
+2 -1
fs/bcachefs/btree_update_interior.h
··· 159 159 unsigned level, 160 160 unsigned flags) 161 161 { 162 - bch2_trans_verify_not_unlocked(trans); 162 + bch2_trans_verify_not_unlocked_or_in_restart(trans); 163 163 164 164 return bch2_foreground_maybe_merge_sibling(trans, path, level, flags, 165 165 btree_prev_sib) ?: ··· 334 334 struct jset_entry *bch2_btree_roots_to_journal_entries(struct bch_fs *, 335 335 struct jset_entry *, unsigned long); 336 336 337 + void bch2_async_btree_node_rewrites_flush(struct bch_fs *); 337 338 void bch2_do_pending_node_rewrites(struct bch_fs *); 338 339 void bch2_free_pending_node_rewrites(struct bch_fs *); 339 340
+50 -33
fs/bcachefs/btree_write_buffer.c
··· 19 19 static int bch2_btree_write_buffer_journal_flush(struct journal *, 20 20 struct journal_entry_pin *, u64); 21 21 22 - static int bch2_journal_keys_to_write_buffer(struct bch_fs *, struct journal_buf *); 23 - 24 22 static inline bool __wb_key_ref_cmp(const struct wb_key_ref *l, const struct wb_key_ref *r) 25 23 { 26 24 return (cmp_int(l->hi, r->hi) ?: ··· 312 314 darray_for_each(wb->sorted, i) { 313 315 struct btree_write_buffered_key *k = &wb->flushing.keys.data[i->idx]; 314 316 317 + BUG_ON(!btree_type_uses_write_buffer(k->btree)); 318 + 315 319 for (struct wb_key_ref *n = i + 1; n < min(i + 4, &darray_top(wb->sorted)); n++) 316 320 prefetch(&wb->flushing.keys.data[n->idx]); 317 321 ··· 481 481 return ret; 482 482 } 483 483 484 - static int fetch_wb_keys_from_journal(struct bch_fs *c, u64 seq) 484 + static int bch2_journal_keys_to_write_buffer(struct bch_fs *c, struct journal_buf *buf) 485 + { 486 + struct journal_keys_to_wb dst; 487 + int ret = 0; 488 + 489 + bch2_journal_keys_to_write_buffer_start(c, &dst, le64_to_cpu(buf->data->seq)); 490 + 491 + for_each_jset_entry_type(entry, buf->data, BCH_JSET_ENTRY_write_buffer_keys) { 492 + jset_entry_for_each_key(entry, k) { 493 + ret = bch2_journal_key_to_wb(c, &dst, entry->btree_id, k); 494 + if (ret) 495 + goto out; 496 + } 497 + 498 + entry->type = BCH_JSET_ENTRY_btree_keys; 499 + } 500 + out: 501 + ret = bch2_journal_keys_to_write_buffer_end(c, &dst) ?: ret; 502 + return ret; 503 + } 504 + 505 + static int fetch_wb_keys_from_journal(struct bch_fs *c, u64 max_seq) 485 506 { 486 507 struct journal *j = &c->journal; 487 508 struct journal_buf *buf; 509 + bool blocked; 488 510 int ret = 0; 489 511 490 - while (!ret && (buf = bch2_next_write_buffer_flush_journal_buf(j, seq))) { 512 + while (!ret && (buf = bch2_next_write_buffer_flush_journal_buf(j, max_seq, &blocked))) { 491 513 ret = bch2_journal_keys_to_write_buffer(c, buf); 514 + 515 + if (!blocked && !ret) { 516 + spin_lock(&j->lock); 517 + buf->need_flush_to_write_buffer = false; 518 + spin_unlock(&j->lock); 519 + } 520 + 492 521 mutex_unlock(&j->buf_lock); 522 + 523 + if (blocked) { 524 + bch2_journal_unblock(j); 525 + break; 526 + } 493 527 } 494 528 495 529 return ret; 496 530 } 497 531 498 - static int btree_write_buffer_flush_seq(struct btree_trans *trans, u64 seq, 532 + static int btree_write_buffer_flush_seq(struct btree_trans *trans, u64 max_seq, 499 533 bool *did_work) 500 534 { 501 535 struct bch_fs *c = trans->c; ··· 539 505 do { 540 506 bch2_trans_unlock(trans); 541 507 542 - fetch_from_journal_err = fetch_wb_keys_from_journal(c, seq); 508 + fetch_from_journal_err = fetch_wb_keys_from_journal(c, max_seq); 543 509 544 510 *did_work |= wb->inc.keys.nr || wb->flushing.keys.nr; 545 511 ··· 552 518 mutex_unlock(&wb->flushing.lock); 553 519 } while (!ret && 554 520 (fetch_from_journal_err || 555 - (wb->inc.pin.seq && wb->inc.pin.seq <= seq) || 556 - (wb->flushing.pin.seq && wb->flushing.pin.seq <= seq))); 521 + (wb->inc.pin.seq && wb->inc.pin.seq <= max_seq) || 522 + (wb->flushing.pin.seq && wb->flushing.pin.seq <= max_seq))); 557 523 558 524 return ret; 559 525 } ··· 634 600 bch2_bkey_buf_init(&tmp); 635 601 636 602 if (!bkey_and_val_eq(referring_k, bkey_i_to_s_c(last_flushed->k))) { 603 + if (trace_write_buffer_maybe_flush_enabled()) { 604 + struct printbuf buf = PRINTBUF; 605 + 606 + bch2_bkey_val_to_text(&buf, c, referring_k); 607 + trace_write_buffer_maybe_flush(trans, _RET_IP_, buf.buf); 608 + printbuf_exit(&buf); 609 + } 610 + 637 611 bch2_bkey_buf_reassemble(&tmp, c, referring_k); 638 612 639 613 if (bkey_is_btree_ptr(referring_k.k)) { ··· 810 768 mutex_unlock(&wb->flushing.lock); 811 769 mutex_unlock(&wb->inc.lock); 812 770 813 - return ret; 814 - } 815 - 816 - static int bch2_journal_keys_to_write_buffer(struct bch_fs *c, struct journal_buf *buf) 817 - { 818 - struct journal_keys_to_wb dst; 819 - int ret = 0; 820 - 821 - bch2_journal_keys_to_write_buffer_start(c, &dst, le64_to_cpu(buf->data->seq)); 822 - 823 - for_each_jset_entry_type(entry, buf->data, BCH_JSET_ENTRY_write_buffer_keys) { 824 - jset_entry_for_each_key(entry, k) { 825 - ret = bch2_journal_key_to_wb(c, &dst, entry->btree_id, k); 826 - if (ret) 827 - goto out; 828 - } 829 - 830 - entry->type = BCH_JSET_ENTRY_btree_keys; 831 - } 832 - 833 - spin_lock(&c->journal.lock); 834 - buf->need_flush_to_write_buffer = false; 835 - spin_unlock(&c->journal.lock); 836 - out: 837 - ret = bch2_journal_keys_to_write_buffer_end(c, &dst) ?: ret; 838 771 return ret; 839 772 } 840 773
+75 -58
fs/bcachefs/buckets.c
··· 18 18 #include "error.h" 19 19 #include "inode.h" 20 20 #include "movinggc.h" 21 + #include "rebalance.h" 21 22 #include "recovery.h" 23 + #include "recovery_passes.h" 22 24 #include "reflink.h" 23 25 #include "replicas.h" 24 26 #include "subvolume.h" ··· 262 260 struct printbuf buf = PRINTBUF; 263 261 int ret = 0; 264 262 265 - percpu_down_read(&c->mark_lock); 266 - 267 263 bkey_for_each_ptr_decode(k.k, ptrs_c, p, entry_c) { 268 264 ret = bch2_check_fix_ptr(trans, k, p, entry_c, &do_update); 269 265 if (ret) ··· 362 362 bch_info(c, "new key %s", buf.buf); 363 363 } 364 364 365 - percpu_up_read(&c->mark_lock); 366 365 struct btree_iter iter; 367 366 bch2_trans_node_iter_init(trans, &iter, btree, new->k.p, 0, level, 368 367 BTREE_ITER_intent|BTREE_ITER_all_snapshots); ··· 370 371 BTREE_UPDATE_internal_snapshot_node| 371 372 BTREE_TRIGGER_norun); 372 373 bch2_trans_iter_exit(trans, &iter); 373 - percpu_down_read(&c->mark_lock); 374 - 375 374 if (ret) 376 375 goto err; 377 376 ··· 377 380 bch2_btree_node_update_key_early(trans, btree, level - 1, k, new); 378 381 } 379 382 err: 380 - percpu_up_read(&c->mark_lock); 381 383 printbuf_exit(&buf); 382 384 return ret; 383 385 } ··· 397 401 BUG_ON(!sectors); 398 402 399 403 if (gen_after(ptr->gen, b_gen)) { 400 - bch2_fsck_err(trans, FSCK_CAN_IGNORE|FSCK_NEED_FSCK, 401 - ptr_gen_newer_than_bucket_gen, 404 + bch2_run_explicit_recovery_pass(c, BCH_RECOVERY_PASS_check_allocations); 405 + log_fsck_err(trans, ptr_gen_newer_than_bucket_gen, 402 406 "bucket %u:%zu gen %u data type %s: ptr gen %u newer than bucket gen\n" 403 407 "while marking %s", 404 408 ptr->dev, bucket_nr, b_gen, ··· 411 415 } 412 416 413 417 if (gen_cmp(b_gen, ptr->gen) > BUCKET_GC_GEN_MAX) { 414 - bch2_fsck_err(trans, FSCK_CAN_IGNORE|FSCK_NEED_FSCK, 415 - ptr_too_stale, 418 + bch2_run_explicit_recovery_pass(c, BCH_RECOVERY_PASS_check_allocations); 419 + log_fsck_err(trans, ptr_too_stale, 416 420 "bucket %u:%zu gen %u data type %s: ptr gen %u too stale\n" 417 421 "while marking %s", 418 422 ptr->dev, bucket_nr, b_gen, ··· 431 435 } 432 436 433 437 if (b_gen != ptr->gen) { 434 - bch2_fsck_err(trans, FSCK_CAN_IGNORE|FSCK_NEED_FSCK, 435 - stale_dirty_ptr, 438 + bch2_run_explicit_recovery_pass(c, BCH_RECOVERY_PASS_check_allocations); 439 + log_fsck_err(trans, stale_dirty_ptr, 436 440 "bucket %u:%zu gen %u (mem gen %u) data type %s: stale dirty ptr (gen %u)\n" 437 441 "while marking %s", 438 442 ptr->dev, bucket_nr, b_gen, ··· 447 451 } 448 452 449 453 if (bucket_data_type_mismatch(bucket_data_type, ptr_data_type)) { 450 - bch2_fsck_err(trans, FSCK_CAN_IGNORE|FSCK_NEED_FSCK, 451 - ptr_bucket_data_type_mismatch, 454 + bch2_run_explicit_recovery_pass(c, BCH_RECOVERY_PASS_check_allocations); 455 + log_fsck_err(trans, ptr_bucket_data_type_mismatch, 452 456 "bucket %u:%zu gen %u different types of data in same bucket: %s, %s\n" 453 457 "while marking %s", 454 458 ptr->dev, bucket_nr, b_gen, ··· 462 466 } 463 467 464 468 if ((u64) *bucket_sectors + sectors > U32_MAX) { 465 - bch2_fsck_err(trans, FSCK_CAN_IGNORE|FSCK_NEED_FSCK, 466 - bucket_sector_count_overflow, 469 + bch2_run_explicit_recovery_pass(c, BCH_RECOVERY_PASS_check_allocations); 470 + log_fsck_err(trans, bucket_sector_count_overflow, 467 471 "bucket %u:%zu gen %u data type %s sector count overflow: %u + %lli > U32_MAX\n" 468 472 "while marking %s", 469 473 ptr->dev, bucket_nr, b_gen, ··· 481 485 printbuf_exit(&buf); 482 486 return ret; 483 487 err: 488 + fsck_err: 484 489 bch2_dump_trans_updates(trans); 490 + bch2_inconsistent_error(c); 485 491 ret = -BCH_ERR_bucket_ref_update; 486 492 goto out; 487 493 } ··· 541 543 struct bkey_s_c k, 542 544 const struct extent_ptr_decoded *p, 543 545 s64 sectors, enum bch_data_type ptr_data_type, 544 - struct bch_alloc_v4 *a) 546 + struct bch_alloc_v4 *a, 547 + bool insert) 545 548 { 546 549 u32 *dst_sectors = p->has_ec ? &a->stripe_sectors : 547 550 !p->ptr.cached ? &a->dirty_sectors : ··· 552 553 553 554 if (ret) 554 555 return ret; 555 - 556 - alloc_data_type_set(a, ptr_data_type); 556 + if (insert) 557 + alloc_data_type_set(a, ptr_data_type); 557 558 return 0; 558 559 } 559 560 ··· 569 570 struct printbuf buf = PRINTBUF; 570 571 int ret = 0; 571 572 572 - u64 abs_sectors = ptr_disk_sectors(level ? btree_sectors(c) : k.k->size, p); 573 - *sectors = insert ? abs_sectors : -abs_sectors; 573 + struct bkey_i_backpointer bp; 574 + bch2_extent_ptr_to_bp(c, btree_id, level, k, p, entry, &bp); 575 + 576 + *sectors = insert ? bp.v.bucket_len : -(s64) bp.v.bucket_len; 574 577 575 578 struct bch_dev *ca = bch2_dev_tryget(c, p.ptr.dev); 576 579 if (unlikely(!ca)) { ··· 581 580 goto err; 582 581 } 583 582 584 - struct bpos bucket; 585 - struct bch_backpointer bp; 586 - __bch2_extent_ptr_to_bp(trans->c, ca, btree_id, level, k, p, entry, &bucket, &bp, abs_sectors); 583 + struct bpos bucket = PTR_BUCKET_POS(ca, &p.ptr); 587 584 588 585 if (flags & BTREE_TRIGGER_transactional) { 589 586 struct bkey_i_alloc_v4 *a = bch2_trans_start_alloc_update(trans, bucket, 0); 590 587 ret = PTR_ERR_OR_ZERO(a) ?: 591 - __mark_pointer(trans, ca, k, &p, *sectors, bp.data_type, &a->v); 588 + __mark_pointer(trans, ca, k, &p, *sectors, bp.v.data_type, &a->v, insert); 592 589 if (ret) 593 590 goto err; 594 591 595 592 if (!p.ptr.cached) { 596 - ret = bch2_bucket_backpointer_mod(trans, ca, bucket, bp, k, insert); 593 + ret = bch2_bucket_backpointer_mod(trans, k, &bp, insert); 597 594 if (ret) 598 595 goto err; 599 596 } 600 597 } 601 598 602 599 if (flags & BTREE_TRIGGER_gc) { 603 - percpu_down_read(&c->mark_lock); 604 600 struct bucket *g = gc_bucket(ca, bucket.offset); 605 601 if (bch2_fs_inconsistent_on(!g, c, "reference to invalid bucket on device %u\n %s", 606 602 p.ptr.dev, 607 603 (bch2_bkey_val_to_text(&buf, c, k), buf.buf))) { 608 604 ret = -BCH_ERR_trigger_pointer; 609 - goto err_unlock; 605 + goto err; 610 606 } 611 607 612 608 bucket_lock(g); 613 609 struct bch_alloc_v4 old = bucket_m_to_alloc(*g), new = old; 614 - ret = __mark_pointer(trans, ca, k, &p, *sectors, bp.data_type, &new); 610 + ret = __mark_pointer(trans, ca, k, &p, *sectors, bp.v.data_type, &new, insert); 615 611 alloc_to_bucket(g, new); 616 612 bucket_unlock(g); 617 - err_unlock: 618 - percpu_up_read(&c->mark_lock); 619 613 620 614 if (!ret) 621 615 ret = bch2_alloc_key_to_dev_counters(trans, ca, &old, &new, flags); ··· 947 951 enum bch_data_type type, 948 952 unsigned sectors) 949 953 { 954 + struct bch_fs *c = trans->c; 950 955 struct btree_iter iter; 951 956 int ret = 0; 952 957 ··· 957 960 return PTR_ERR(a); 958 961 959 962 if (a->v.data_type && type && a->v.data_type != type) { 960 - bch2_fsck_err(trans, FSCK_CAN_IGNORE|FSCK_NEED_FSCK, 961 - bucket_metadata_type_mismatch, 963 + bch2_run_explicit_recovery_pass(c, BCH_RECOVERY_PASS_check_allocations); 964 + log_fsck_err(trans, bucket_metadata_type_mismatch, 962 965 "bucket %llu:%llu gen %u different types of data in same bucket: %s, %s\n" 963 966 "while marking %s", 964 967 iter.pos.inode, iter.pos.offset, a->v.gen, ··· 976 979 ret = bch2_trans_update(trans, &iter, &a->k_i, 0); 977 980 } 978 981 err: 982 + fsck_err: 979 983 bch2_trans_iter_exit(trans, &iter); 980 984 return ret; 981 985 } ··· 988 990 struct bch_fs *c = trans->c; 989 991 int ret = 0; 990 992 991 - percpu_down_read(&c->mark_lock); 992 993 struct bucket *g = gc_bucket(ca, b); 993 994 if (bch2_fs_inconsistent_on(!g, c, "reference to invalid bucket on device %u when marking metadata type %s", 994 995 ca->dev_idx, bch2_data_type_str(data_type))) 995 - goto err_unlock; 996 + goto err; 996 997 997 998 bucket_lock(g); 998 999 struct bch_alloc_v4 old = bucket_m_to_alloc(*g); ··· 1001 1004 "different types of data in same bucket: %s, %s", 1002 1005 bch2_data_type_str(g->data_type), 1003 1006 bch2_data_type_str(data_type))) 1004 - goto err; 1007 + goto err_unlock; 1005 1008 1006 1009 if (bch2_fs_inconsistent_on((u64) g->dirty_sectors + sectors > ca->mi.bucket_size, c, 1007 1010 "bucket %u:%llu gen %u data type %s sector count overflow: %u + %u > bucket size", 1008 1011 ca->dev_idx, b, g->gen, 1009 1012 bch2_data_type_str(g->data_type ?: data_type), 1010 1013 g->dirty_sectors, sectors)) 1011 - goto err; 1014 + goto err_unlock; 1012 1015 1013 1016 g->data_type = data_type; 1014 1017 g->dirty_sectors += sectors; 1015 1018 struct bch_alloc_v4 new = bucket_m_to_alloc(*g); 1016 1019 bucket_unlock(g); 1017 - percpu_up_read(&c->mark_lock); 1018 1020 ret = bch2_alloc_key_to_dev_counters(trans, ca, &old, &new, flags); 1019 1021 return ret; 1020 - err: 1021 - bucket_unlock(g); 1022 1022 err_unlock: 1023 - percpu_up_read(&c->mark_lock); 1023 + bucket_unlock(g); 1024 + err: 1024 1025 return -BCH_ERR_metadata_bucket_inconsistency; 1025 1026 } 1026 1027 ··· 1150 1155 return bch2_trans_mark_dev_sbs_flags(c, BTREE_TRIGGER_transactional); 1151 1156 } 1152 1157 1158 + bool bch2_is_superblock_bucket(struct bch_dev *ca, u64 b) 1159 + { 1160 + struct bch_sb_layout *layout = &ca->disk_sb.sb->layout; 1161 + u64 b_offset = bucket_to_sector(ca, b); 1162 + u64 b_end = bucket_to_sector(ca, b + 1); 1163 + unsigned i; 1164 + 1165 + if (!b) 1166 + return true; 1167 + 1168 + for (i = 0; i < layout->nr_superblocks; i++) { 1169 + u64 offset = le64_to_cpu(layout->sb_offset[i]); 1170 + u64 end = offset + (1 << layout->sb_max_size_bits); 1171 + 1172 + if (!(offset >= b_end || end <= b_offset)) 1173 + return true; 1174 + } 1175 + 1176 + for (i = 0; i < ca->journal.nr; i++) 1177 + if (b == ca->journal.buckets[i]) 1178 + return true; 1179 + 1180 + return false; 1181 + } 1182 + 1153 1183 /* Disk reservations: */ 1154 1184 1155 1185 #define SECTORS_CACHE 1024 ··· 1258 1238 for_each_member_device(c, ca) { 1259 1239 BUG_ON(ca->buckets_nouse); 1260 1240 1261 - ca->buckets_nouse = kvmalloc(BITS_TO_LONGS(ca->mi.nbuckets) * 1241 + ca->buckets_nouse = bch2_kvmalloc(BITS_TO_LONGS(ca->mi.nbuckets) * 1262 1242 sizeof(unsigned long), 1263 1243 GFP_KERNEL|__GFP_ZERO); 1264 1244 if (!ca->buckets_nouse) { ··· 1284 1264 bool resize = ca->bucket_gens != NULL; 1285 1265 int ret; 1286 1266 1287 - BUG_ON(resize && ca->buckets_nouse); 1267 + if (resize) 1268 + lockdep_assert_held(&c->state_lock); 1288 1269 1289 - if (!(bucket_gens = kvmalloc(sizeof(struct bucket_gens) + nbuckets, 1290 - GFP_KERNEL|__GFP_ZERO))) { 1270 + if (resize && ca->buckets_nouse) 1271 + return -BCH_ERR_no_resize_with_buckets_nouse; 1272 + 1273 + bucket_gens = bch2_kvmalloc(struct_size(bucket_gens, b, nbuckets), 1274 + GFP_KERNEL|__GFP_ZERO); 1275 + if (!bucket_gens) { 1291 1276 ret = -BCH_ERR_ENOMEM_bucket_gens; 1292 1277 goto err; 1293 1278 } ··· 1302 1277 bucket_gens->nbuckets_minus_first = 1303 1278 bucket_gens->nbuckets - bucket_gens->first_bucket; 1304 1279 1305 - if (resize) { 1306 - down_write(&ca->bucket_lock); 1307 - percpu_down_write(&c->mark_lock); 1308 - } 1309 - 1310 1280 old_bucket_gens = rcu_dereference_protected(ca->bucket_gens, 1); 1311 1281 1312 1282 if (resize) { 1313 - size_t n = min(bucket_gens->nbuckets, old_bucket_gens->nbuckets); 1314 - 1283 + bucket_gens->nbuckets = min(bucket_gens->nbuckets, 1284 + old_bucket_gens->nbuckets); 1285 + bucket_gens->nbuckets_minus_first = 1286 + bucket_gens->nbuckets - bucket_gens->first_bucket; 1315 1287 memcpy(bucket_gens->b, 1316 1288 old_bucket_gens->b, 1317 - n); 1289 + bucket_gens->nbuckets); 1318 1290 } 1319 1291 1320 1292 rcu_assign_pointer(ca->bucket_gens, bucket_gens); 1321 1293 bucket_gens = old_bucket_gens; 1322 1294 1323 1295 nbuckets = ca->mi.nbuckets; 1324 - 1325 - if (resize) { 1326 - percpu_up_write(&c->mark_lock); 1327 - up_write(&ca->bucket_lock); 1328 - } 1329 1296 1330 1297 ret = 0; 1331 1298 err:
+5 -25
fs/bcachefs/buckets.h
··· 82 82 83 83 static inline struct bucket *gc_bucket(struct bch_dev *ca, size_t b) 84 84 { 85 - return genradix_ptr(&ca->buckets_gc, b); 85 + return bucket_valid(ca, b) 86 + ? genradix_ptr(&ca->buckets_gc, b) 87 + : NULL; 86 88 } 87 89 88 90 static inline struct bucket_gens *bucket_gens(struct bch_dev *ca) 89 91 { 90 92 return rcu_dereference_check(ca->bucket_gens, 91 - !ca->fs || 92 - percpu_rwsem_is_held(&ca->fs->mark_lock) || 93 - lockdep_is_held(&ca->fs->state_lock) || 94 - lockdep_is_held(&ca->bucket_lock)); 93 + lockdep_is_held(&ca->fs->state_lock)); 95 94 } 96 95 97 96 static inline u8 *bucket_gen(struct bch_dev *ca, size_t b) ··· 307 308 enum btree_iter_update_trigger_flags); 308 309 int bch2_trans_mark_dev_sbs(struct bch_fs *); 309 310 310 - static inline bool is_superblock_bucket(struct bch_dev *ca, u64 b) 311 - { 312 - struct bch_sb_layout *layout = &ca->disk_sb.sb->layout; 313 - u64 b_offset = bucket_to_sector(ca, b); 314 - u64 b_end = bucket_to_sector(ca, b + 1); 315 - unsigned i; 316 - 317 - if (!b) 318 - return true; 319 - 320 - for (i = 0; i < layout->nr_superblocks; i++) { 321 - u64 offset = le64_to_cpu(layout->sb_offset[i]); 322 - u64 end = offset + (1 << layout->sb_max_size_bits); 323 - 324 - if (!(offset >= b_end || end <= b_offset)) 325 - return true; 326 - } 327 - 328 - return false; 329 - } 311 + bool bch2_is_superblock_bucket(struct bch_dev *, u64); 330 312 331 313 static inline const char *bch2_data_type_str(enum bch_data_type type) 332 314 {
+1 -1
fs/bcachefs/buckets_types.h
··· 24 24 u16 first_bucket; 25 25 size_t nbuckets; 26 26 size_t nbuckets_minus_first; 27 - u8 b[]; 27 + u8 b[] __counted_by(nbuckets); 28 28 }; 29 29 30 30 struct bch_dev_usage {
+1 -218
fs/bcachefs/chardev.c
··· 6 6 #include "buckets.h" 7 7 #include "chardev.h" 8 8 #include "disk_accounting.h" 9 + #include "fsck.h" 9 10 #include "journal.h" 10 11 #include "move.h" 11 12 #include "recovery_passes.h" 12 13 #include "replicas.h" 13 - #include "super.h" 14 14 #include "super-io.h" 15 15 #include "thread_with_file.h" 16 16 ··· 126 126 return 0; 127 127 } 128 128 #endif 129 - 130 - struct fsck_thread { 131 - struct thread_with_stdio thr; 132 - struct bch_fs *c; 133 - struct bch_opts opts; 134 - }; 135 - 136 - static void bch2_fsck_thread_exit(struct thread_with_stdio *_thr) 137 - { 138 - struct fsck_thread *thr = container_of(_thr, struct fsck_thread, thr); 139 - kfree(thr); 140 - } 141 - 142 - static int bch2_fsck_offline_thread_fn(struct thread_with_stdio *stdio) 143 - { 144 - struct fsck_thread *thr = container_of(stdio, struct fsck_thread, thr); 145 - struct bch_fs *c = thr->c; 146 - 147 - int ret = PTR_ERR_OR_ZERO(c); 148 - if (ret) 149 - return ret; 150 - 151 - ret = bch2_fs_start(thr->c); 152 - if (ret) 153 - goto err; 154 - 155 - if (test_bit(BCH_FS_errors_fixed, &c->flags)) { 156 - bch2_stdio_redirect_printf(&stdio->stdio, false, "%s: errors fixed\n", c->name); 157 - ret |= 1; 158 - } 159 - if (test_bit(BCH_FS_error, &c->flags)) { 160 - bch2_stdio_redirect_printf(&stdio->stdio, false, "%s: still has errors\n", c->name); 161 - ret |= 4; 162 - } 163 - err: 164 - bch2_fs_stop(c); 165 - return ret; 166 - } 167 - 168 - static const struct thread_with_stdio_ops bch2_offline_fsck_ops = { 169 - .exit = bch2_fsck_thread_exit, 170 - .fn = bch2_fsck_offline_thread_fn, 171 - }; 172 - 173 - static long bch2_ioctl_fsck_offline(struct bch_ioctl_fsck_offline __user *user_arg) 174 - { 175 - struct bch_ioctl_fsck_offline arg; 176 - struct fsck_thread *thr = NULL; 177 - darray_str(devs) = {}; 178 - long ret = 0; 179 - 180 - if (copy_from_user(&arg, user_arg, sizeof(arg))) 181 - return -EFAULT; 182 - 183 - if (arg.flags) 184 - return -EINVAL; 185 - 186 - if (!capable(CAP_SYS_ADMIN)) 187 - return -EPERM; 188 - 189 - for (size_t i = 0; i < arg.nr_devs; i++) { 190 - u64 dev_u64; 191 - ret = copy_from_user_errcode(&dev_u64, &user_arg->devs[i], sizeof(u64)); 192 - if (ret) 193 - goto err; 194 - 195 - char *dev_str = strndup_user((char __user *)(unsigned long) dev_u64, PATH_MAX); 196 - ret = PTR_ERR_OR_ZERO(dev_str); 197 - if (ret) 198 - goto err; 199 - 200 - ret = darray_push(&devs, dev_str); 201 - if (ret) { 202 - kfree(dev_str); 203 - goto err; 204 - } 205 - } 206 - 207 - thr = kzalloc(sizeof(*thr), GFP_KERNEL); 208 - if (!thr) { 209 - ret = -ENOMEM; 210 - goto err; 211 - } 212 - 213 - thr->opts = bch2_opts_empty(); 214 - 215 - if (arg.opts) { 216 - char *optstr = strndup_user((char __user *)(unsigned long) arg.opts, 1 << 16); 217 - ret = PTR_ERR_OR_ZERO(optstr) ?: 218 - bch2_parse_mount_opts(NULL, &thr->opts, NULL, optstr); 219 - if (!IS_ERR(optstr)) 220 - kfree(optstr); 221 - 222 - if (ret) 223 - goto err; 224 - } 225 - 226 - opt_set(thr->opts, stdio, (u64)(unsigned long)&thr->thr.stdio); 227 - opt_set(thr->opts, read_only, 1); 228 - opt_set(thr->opts, ratelimit_errors, 0); 229 - 230 - /* We need request_key() to be called before we punt to kthread: */ 231 - opt_set(thr->opts, nostart, true); 232 - 233 - bch2_thread_with_stdio_init(&thr->thr, &bch2_offline_fsck_ops); 234 - 235 - thr->c = bch2_fs_open(devs.data, arg.nr_devs, thr->opts); 236 - 237 - if (!IS_ERR(thr->c) && 238 - thr->c->opts.errors == BCH_ON_ERROR_panic) 239 - thr->c->opts.errors = BCH_ON_ERROR_ro; 240 - 241 - ret = __bch2_run_thread_with_stdio(&thr->thr); 242 - out: 243 - darray_for_each(devs, i) 244 - kfree(*i); 245 - darray_exit(&devs); 246 - return ret; 247 - err: 248 - if (thr) 249 - bch2_fsck_thread_exit(&thr->thr); 250 - pr_err("ret %s", bch2_err_str(ret)); 251 - goto out; 252 - } 253 129 254 130 static long bch2_global_ioctl(unsigned cmd, void __user *arg) 255 131 { ··· 648 772 ret = bch2_set_nr_journal_buckets(c, ca, arg.nbuckets); 649 773 650 774 bch2_dev_put(ca); 651 - return ret; 652 - } 653 - 654 - static int bch2_fsck_online_thread_fn(struct thread_with_stdio *stdio) 655 - { 656 - struct fsck_thread *thr = container_of(stdio, struct fsck_thread, thr); 657 - struct bch_fs *c = thr->c; 658 - 659 - c->stdio_filter = current; 660 - c->stdio = &thr->thr.stdio; 661 - 662 - /* 663 - * XXX: can we figure out a way to do this without mucking with c->opts? 664 - */ 665 - unsigned old_fix_errors = c->opts.fix_errors; 666 - if (opt_defined(thr->opts, fix_errors)) 667 - c->opts.fix_errors = thr->opts.fix_errors; 668 - else 669 - c->opts.fix_errors = FSCK_FIX_ask; 670 - 671 - c->opts.fsck = true; 672 - set_bit(BCH_FS_fsck_running, &c->flags); 673 - 674 - c->curr_recovery_pass = BCH_RECOVERY_PASS_check_alloc_info; 675 - int ret = bch2_run_online_recovery_passes(c); 676 - 677 - clear_bit(BCH_FS_fsck_running, &c->flags); 678 - bch_err_fn(c, ret); 679 - 680 - c->stdio = NULL; 681 - c->stdio_filter = NULL; 682 - c->opts.fix_errors = old_fix_errors; 683 - 684 - up(&c->online_fsck_mutex); 685 - bch2_ro_ref_put(c); 686 - return ret; 687 - } 688 - 689 - static const struct thread_with_stdio_ops bch2_online_fsck_ops = { 690 - .exit = bch2_fsck_thread_exit, 691 - .fn = bch2_fsck_online_thread_fn, 692 - }; 693 - 694 - static long bch2_ioctl_fsck_online(struct bch_fs *c, 695 - struct bch_ioctl_fsck_online arg) 696 - { 697 - struct fsck_thread *thr = NULL; 698 - long ret = 0; 699 - 700 - if (arg.flags) 701 - return -EINVAL; 702 - 703 - if (!capable(CAP_SYS_ADMIN)) 704 - return -EPERM; 705 - 706 - if (!bch2_ro_ref_tryget(c)) 707 - return -EROFS; 708 - 709 - if (down_trylock(&c->online_fsck_mutex)) { 710 - bch2_ro_ref_put(c); 711 - return -EAGAIN; 712 - } 713 - 714 - thr = kzalloc(sizeof(*thr), GFP_KERNEL); 715 - if (!thr) { 716 - ret = -ENOMEM; 717 - goto err; 718 - } 719 - 720 - thr->c = c; 721 - thr->opts = bch2_opts_empty(); 722 - 723 - if (arg.opts) { 724 - char *optstr = strndup_user((char __user *)(unsigned long) arg.opts, 1 << 16); 725 - 726 - ret = PTR_ERR_OR_ZERO(optstr) ?: 727 - bch2_parse_mount_opts(c, &thr->opts, NULL, optstr); 728 - if (!IS_ERR(optstr)) 729 - kfree(optstr); 730 - 731 - if (ret) 732 - goto err; 733 - } 734 - 735 - ret = bch2_run_thread_with_stdio(&thr->thr, &bch2_online_fsck_ops); 736 - err: 737 - if (ret < 0) { 738 - bch_err_fn(c, ret); 739 - if (thr) 740 - bch2_fsck_thread_exit(&thr->thr); 741 - up(&c->online_fsck_mutex); 742 - bch2_ro_ref_put(c); 743 - } 744 775 return ret; 745 776 } 746 777
+8 -2
fs/bcachefs/checksum.c
··· 2 2 #include "bcachefs.h" 3 3 #include "checksum.h" 4 4 #include "errcode.h" 5 + #include "error.h" 5 6 #include "super.h" 6 7 #include "super-io.h" 7 8 ··· 253 252 if (!bch2_csum_type_is_encryption(type)) 254 253 return 0; 255 254 255 + if (bch2_fs_inconsistent_on(!c->chacha20, 256 + c, "attempting to encrypt without encryption key")) 257 + return -BCH_ERR_no_encryption_key; 258 + 256 259 return do_encrypt(c->chacha20, nonce, data, len); 257 260 } 258 261 ··· 342 337 size_t sgl_len = 0; 343 338 int ret = 0; 344 339 345 - if (!bch2_csum_type_is_encryption(type)) 346 - return 0; 340 + if (bch2_fs_inconsistent_on(!c->chacha20, 341 + c, "attempting to encrypt without encryption key")) 342 + return -BCH_ERR_no_encryption_key; 347 343 348 344 darray_init(&sgl); 349 345
+1 -1
fs/bcachefs/checksum.h
··· 109 109 void bch2_fs_encryption_exit(struct bch_fs *); 110 110 int bch2_fs_encryption_init(struct bch_fs *); 111 111 112 - static inline enum bch_csum_type bch2_csum_opt_to_type(enum bch_csum_opts type, 112 + static inline enum bch_csum_type bch2_csum_opt_to_type(enum bch_csum_opt type, 113 113 bool data) 114 114 { 115 115 switch (type) {
+67 -29
fs/bcachefs/compress.c
··· 2 2 #include "bcachefs.h" 3 3 #include "checksum.h" 4 4 #include "compress.h" 5 + #include "error.h" 5 6 #include "extents.h" 7 + #include "opts.h" 6 8 #include "super-io.h" 7 9 8 10 #include <linux/lz4.h> 9 11 #include <linux/zlib.h> 10 12 #include <linux/zstd.h> 13 + 14 + static inline enum bch_compression_opts bch2_compression_type_to_opt(enum bch_compression_type type) 15 + { 16 + switch (type) { 17 + case BCH_COMPRESSION_TYPE_none: 18 + case BCH_COMPRESSION_TYPE_incompressible: 19 + return BCH_COMPRESSION_OPT_none; 20 + case BCH_COMPRESSION_TYPE_lz4_old: 21 + case BCH_COMPRESSION_TYPE_lz4: 22 + return BCH_COMPRESSION_OPT_lz4; 23 + case BCH_COMPRESSION_TYPE_gzip: 24 + return BCH_COMPRESSION_OPT_gzip; 25 + case BCH_COMPRESSION_TYPE_zstd: 26 + return BCH_COMPRESSION_OPT_zstd; 27 + default: 28 + BUG(); 29 + } 30 + } 11 31 12 32 /* Bounce buffer: */ 13 33 struct bbuf { ··· 178 158 void *workspace; 179 159 int ret; 180 160 161 + enum bch_compression_opts opt = bch2_compression_type_to_opt(crc.compression_type); 162 + mempool_t *workspace_pool = &c->compress_workspace[opt]; 163 + if (unlikely(!mempool_initialized(workspace_pool))) { 164 + if (fsck_err(c, compression_type_not_marked_in_sb, 165 + "compression type %s set but not marked in superblock", 166 + __bch2_compression_types[crc.compression_type])) 167 + ret = bch2_check_set_has_compressed_data(c, opt); 168 + else 169 + ret = -BCH_ERR_compression_workspace_not_initialized; 170 + if (ret) 171 + goto out; 172 + } 173 + 181 174 src_data = bio_map_or_bounce(c, src, READ); 182 175 183 176 switch (crc.compression_type) { ··· 209 176 .avail_out = dst_len, 210 177 }; 211 178 212 - workspace = mempool_alloc(&c->decompress_workspace, GFP_NOFS); 179 + workspace = mempool_alloc(workspace_pool, GFP_NOFS); 213 180 214 181 zlib_set_workspace(&strm, workspace); 215 182 zlib_inflateInit2(&strm, -MAX_WBITS); 216 183 ret = zlib_inflate(&strm, Z_FINISH); 217 184 218 - mempool_free(workspace, &c->decompress_workspace); 185 + mempool_free(workspace, workspace_pool); 219 186 220 187 if (ret != Z_STREAM_END) 221 188 goto err; ··· 228 195 if (real_src_len > src_len - 4) 229 196 goto err; 230 197 231 - workspace = mempool_alloc(&c->decompress_workspace, GFP_NOFS); 198 + workspace = mempool_alloc(workspace_pool, GFP_NOFS); 232 199 ctx = zstd_init_dctx(workspace, zstd_dctx_workspace_bound()); 233 200 234 201 ret = zstd_decompress_dctx(ctx, 235 202 dst_data, dst_len, 236 203 src_data.b + 4, real_src_len); 237 204 238 - mempool_free(workspace, &c->decompress_workspace); 205 + mempool_free(workspace, workspace_pool); 239 206 240 207 if (ret != dst_len) 241 208 goto err; ··· 245 212 BUG(); 246 213 } 247 214 ret = 0; 215 + fsck_err: 248 216 out: 249 217 bio_unmap_or_unbounce(c, src_data); 250 218 return ret; ··· 428 394 unsigned pad; 429 395 int ret = 0; 430 396 431 - BUG_ON(compression_type >= BCH_COMPRESSION_TYPE_NR); 432 - BUG_ON(!mempool_initialized(&c->compress_workspace[compression_type])); 397 + /* bch2_compression_decode catches unknown compression types: */ 398 + BUG_ON(compression.type >= BCH_COMPRESSION_OPT_NR); 399 + 400 + mempool_t *workspace_pool = &c->compress_workspace[compression.type]; 401 + if (unlikely(!mempool_initialized(workspace_pool))) { 402 + if (fsck_err(c, compression_opt_not_marked_in_sb, 403 + "compression opt %s set but not marked in superblock", 404 + bch2_compression_opts[compression.type])) { 405 + ret = bch2_check_set_has_compressed_data(c, compression.type); 406 + if (ret) /* memory allocation failure, don't compress */ 407 + return 0; 408 + } else { 409 + return 0; 410 + } 411 + } 433 412 434 413 /* If it's only one block, don't bother trying to compress: */ 435 414 if (src->bi_iter.bi_size <= c->opts.block_size) ··· 451 404 dst_data = bio_map_or_bounce(c, dst, WRITE); 452 405 src_data = bio_map_or_bounce(c, src, READ); 453 406 454 - workspace = mempool_alloc(&c->compress_workspace[compression_type], GFP_NOFS); 407 + workspace = mempool_alloc(workspace_pool, GFP_NOFS); 455 408 456 409 *src_len = src->bi_iter.bi_size; 457 410 *dst_len = dst->bi_iter.bi_size; ··· 494 447 *src_len = round_down(*src_len, block_bytes(c)); 495 448 } 496 449 497 - mempool_free(workspace, &c->compress_workspace[compression_type]); 450 + mempool_free(workspace, workspace_pool); 498 451 499 452 if (ret) 500 453 goto err; ··· 523 476 return ret; 524 477 err: 525 478 ret = BCH_COMPRESSION_TYPE_incompressible; 479 + goto out; 480 + fsck_err: 481 + ret = 0; 526 482 goto out; 527 483 } 528 484 ··· 609 559 { 610 560 unsigned i; 611 561 612 - mempool_exit(&c->decompress_workspace); 613 562 for (i = 0; i < ARRAY_SIZE(c->compress_workspace); i++) 614 563 mempool_exit(&c->compress_workspace[i]); 615 564 mempool_exit(&c->compression_bounce[WRITE]); ··· 617 568 618 569 static int __bch2_fs_compress_init(struct bch_fs *c, u64 features) 619 570 { 620 - size_t decompress_workspace_size = 0; 621 571 ZSTD_parameters params = zstd_get_params(zstd_max_clevel(), 622 572 c->opts.encoded_extent_max); 623 573 ··· 624 576 625 577 struct { 626 578 unsigned feature; 627 - enum bch_compression_type type; 579 + enum bch_compression_opts type; 628 580 size_t compress_workspace; 629 - size_t decompress_workspace; 630 581 } compression_types[] = { 631 - { BCH_FEATURE_lz4, BCH_COMPRESSION_TYPE_lz4, 632 - max_t(size_t, LZ4_MEM_COMPRESS, LZ4HC_MEM_COMPRESS), 633 - 0 }, 634 - { BCH_FEATURE_gzip, BCH_COMPRESSION_TYPE_gzip, 635 - zlib_deflate_workspacesize(MAX_WBITS, DEF_MEM_LEVEL), 636 - zlib_inflate_workspacesize(), }, 637 - { BCH_FEATURE_zstd, BCH_COMPRESSION_TYPE_zstd, 638 - c->zstd_workspace_size, 639 - zstd_dctx_workspace_bound() }, 582 + { BCH_FEATURE_lz4, BCH_COMPRESSION_OPT_lz4, 583 + max_t(size_t, LZ4_MEM_COMPRESS, LZ4HC_MEM_COMPRESS) }, 584 + { BCH_FEATURE_gzip, BCH_COMPRESSION_OPT_gzip, 585 + max(zlib_deflate_workspacesize(MAX_WBITS, DEF_MEM_LEVEL), 586 + zlib_inflate_workspacesize()) }, 587 + { BCH_FEATURE_zstd, BCH_COMPRESSION_OPT_zstd, 588 + max(c->zstd_workspace_size, 589 + zstd_dctx_workspace_bound()) }, 640 590 }, *i; 641 591 bool have_compressed = false; 642 592 ··· 659 613 for (i = compression_types; 660 614 i < compression_types + ARRAY_SIZE(compression_types); 661 615 i++) { 662 - decompress_workspace_size = 663 - max(decompress_workspace_size, i->decompress_workspace); 664 - 665 616 if (!(features & (1 << i->feature))) 666 617 continue; 667 618 ··· 670 627 1, i->compress_workspace)) 671 628 return -BCH_ERR_ENOMEM_compression_workspace_init; 672 629 } 673 - 674 - if (!mempool_initialized(&c->decompress_workspace) && 675 - mempool_init_kvmalloc_pool(&c->decompress_workspace, 676 - 1, decompress_workspace_size)) 677 - return -BCH_ERR_ENOMEM_decompression_workspace_init; 678 630 679 631 return 0; 680 632 }
+1 -1
fs/bcachefs/darray.h
··· 83 83 for (typeof(&(_d).data[0]) _i = (_d).data; _i < (_d).data + (_d).nr; _i++) 84 84 85 85 #define darray_for_each_reverse(_d, _i) \ 86 - for (typeof(&(_d).data[0]) _i = (_d).data + (_d).nr - 1; _i >= (_d).data; --_i) 86 + for (typeof(&(_d).data[0]) _i = (_d).data + (_d).nr - 1; _i >= (_d).data && (_d).nr; --_i) 87 87 88 88 #define darray_init(_d) \ 89 89 do { \
+37 -39
fs/bcachefs/data_update.c
··· 110 110 { 111 111 struct bch_fs *c = m->op.c; 112 112 struct bkey_s_c old = bkey_i_to_s_c(m->k.k); 113 - const union bch_extent_entry *entry; 114 - struct bch_extent_ptr *ptr; 115 - struct extent_ptr_decoded p; 116 113 struct printbuf buf = PRINTBUF; 117 - unsigned i, rewrites_found = 0; 114 + unsigned rewrites_found = 0; 118 115 119 116 if (!trace_move_extent_fail_enabled()) 120 117 return; ··· 119 122 prt_str(&buf, msg); 120 123 121 124 if (insert) { 122 - i = 0; 125 + const union bch_extent_entry *entry; 126 + struct bch_extent_ptr *ptr; 127 + struct extent_ptr_decoded p; 128 + 129 + unsigned ptr_bit = 1; 123 130 bkey_for_each_ptr_decode(old.k, bch2_bkey_ptrs_c(old), p, entry) { 124 - if (((1U << i) & m->data_opts.rewrite_ptrs) && 131 + if ((ptr_bit & m->data_opts.rewrite_ptrs) && 125 132 (ptr = bch2_extent_has_ptr(old, p, bkey_i_to_s(insert))) && 126 133 !ptr->cached) 127 - rewrites_found |= 1U << i; 128 - i++; 134 + rewrites_found |= ptr_bit; 135 + ptr_bit <<= 1; 129 136 } 130 137 } 131 138 132 - prt_printf(&buf, "\nrewrite ptrs: %u%u%u%u", 133 - (m->data_opts.rewrite_ptrs & (1 << 0)) != 0, 134 - (m->data_opts.rewrite_ptrs & (1 << 1)) != 0, 135 - (m->data_opts.rewrite_ptrs & (1 << 2)) != 0, 136 - (m->data_opts.rewrite_ptrs & (1 << 3)) != 0); 139 + prt_str(&buf, "rewrites found:\t"); 140 + bch2_prt_u64_base2(&buf, rewrites_found); 141 + prt_newline(&buf); 137 142 138 - prt_printf(&buf, "\nrewrites found: %u%u%u%u", 139 - (rewrites_found & (1 << 0)) != 0, 140 - (rewrites_found & (1 << 1)) != 0, 141 - (rewrites_found & (1 << 2)) != 0, 142 - (rewrites_found & (1 << 3)) != 0); 143 + bch2_data_update_opts_to_text(&buf, c, &m->op.opts, &m->data_opts); 143 144 144 145 prt_str(&buf, "\nold: "); 145 146 bch2_bkey_val_to_text(&buf, c, old); ··· 189 194 struct bpos next_pos; 190 195 bool should_check_enospc; 191 196 s64 i_sectors_delta = 0, disk_sectors_delta = 0; 192 - unsigned rewrites_found = 0, durability, i; 197 + unsigned rewrites_found = 0, durability, ptr_bit; 193 198 194 199 bch2_trans_begin(trans); 195 200 ··· 226 231 * 227 232 * Fist, drop rewrite_ptrs from @new: 228 233 */ 229 - i = 0; 234 + ptr_bit = 1; 230 235 bkey_for_each_ptr_decode(old.k, bch2_bkey_ptrs_c(old), p, entry_c) { 231 - if (((1U << i) & m->data_opts.rewrite_ptrs) && 236 + if ((ptr_bit & m->data_opts.rewrite_ptrs) && 232 237 (ptr = bch2_extent_has_ptr(old, p, bkey_i_to_s(insert))) && 233 238 !ptr->cached) { 234 239 bch2_extent_ptr_set_cached(c, &m->op.opts, 235 240 bkey_i_to_s(insert), ptr); 236 - rewrites_found |= 1U << i; 241 + rewrites_found |= ptr_bit; 237 242 } 238 - i++; 243 + ptr_bit <<= 1; 239 244 } 240 245 241 246 if (m->data_opts.rewrite_ptrs && ··· 318 323 * it's been hard to reproduce, so this should give us some more 319 324 * information when it does occur: 320 325 */ 321 - int invalid = bch2_bkey_validate(c, bkey_i_to_s_c(insert), __btree_node_type(0, m->btree_id), 322 - BCH_VALIDATE_commit); 326 + int invalid = bch2_bkey_validate(c, bkey_i_to_s_c(insert), 327 + (struct bkey_validate_context) { 328 + .btree = m->btree_id, 329 + .flags = BCH_VALIDATE_commit, 330 + }); 323 331 if (invalid) { 324 332 struct printbuf buf = PRINTBUF; 325 333 ··· 360 362 k.k->p, bkey_start_pos(&insert->k)) ?: 361 363 bch2_insert_snapshot_whiteouts(trans, m->btree_id, 362 364 k.k->p, insert->k.p) ?: 363 - bch2_bkey_set_needs_rebalance(c, insert, &op->opts) ?: 365 + bch2_bkey_set_needs_rebalance(c, &op->opts, insert) ?: 364 366 bch2_trans_update(trans, &iter, insert, 365 367 BTREE_UPDATE_internal_snapshot_node) ?: 366 368 bch2_trans_commit(trans, &op->res, ··· 538 540 prt_newline(out); 539 541 540 542 prt_str(out, "compression:\t"); 541 - bch2_compression_opt_to_text(out, background_compression(*io_opts)); 543 + bch2_compression_opt_to_text(out, io_opts->background_compression); 542 544 prt_newline(out); 543 545 544 546 prt_str(out, "opts.replicas:\t"); ··· 612 614 struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(k); 613 615 const union bch_extent_entry *entry; 614 616 struct extent_ptr_decoded p; 615 - unsigned i, reserve_sectors = k.k->size * data_opts.extra_replicas; 617 + unsigned reserve_sectors = k.k->size * data_opts.extra_replicas; 616 618 int ret = 0; 617 619 618 620 /* ··· 620 622 * and we have to check for this because we go rw before repairing the 621 623 * snapshots table - just skip it, we can move it later. 622 624 */ 623 - if (unlikely(k.k->p.snapshot && !bch2_snapshot_equiv(c, k.k->p.snapshot))) 625 + if (unlikely(k.k->p.snapshot && !bch2_snapshot_exists(c, k.k->p.snapshot))) 624 626 return -BCH_ERR_data_update_done; 625 627 626 628 if (!bkey_get_dev_refs(c, k)) ··· 650 652 BCH_WRITE_DATA_ENCODED| 651 653 BCH_WRITE_MOVE| 652 654 m->data_opts.write_flags; 653 - m->op.compression_opt = background_compression(io_opts); 655 + m->op.compression_opt = io_opts.background_compression; 654 656 m->op.watermark = m->data_opts.btree_insert_flags & BCH_WATERMARK_MASK; 655 657 656 658 unsigned durability_have = 0, durability_removing = 0; 657 659 658 - i = 0; 660 + unsigned ptr_bit = 1; 659 661 bkey_for_each_ptr_decode(k.k, ptrs, p, entry) { 660 662 if (!p.ptr.cached) { 661 663 rcu_read_lock(); 662 - if (BIT(i) & m->data_opts.rewrite_ptrs) { 664 + if (ptr_bit & m->data_opts.rewrite_ptrs) { 663 665 if (crc_is_compressed(p.crc)) 664 666 reserve_sectors += k.k->size; 665 667 666 668 m->op.nr_replicas += bch2_extent_ptr_desired_durability(c, &p); 667 669 durability_removing += bch2_extent_ptr_desired_durability(c, &p); 668 - } else if (!(BIT(i) & m->data_opts.kill_ptrs)) { 670 + } else if (!(ptr_bit & m->data_opts.kill_ptrs)) { 669 671 bch2_dev_list_add_dev(&m->op.devs_have, p.ptr.dev); 670 672 durability_have += bch2_extent_ptr_durability(c, &p); 671 673 } ··· 685 687 if (p.crc.compression_type == BCH_COMPRESSION_TYPE_incompressible) 686 688 m->op.incompressible = true; 687 689 688 - i++; 690 + ptr_bit <<= 1; 689 691 } 690 692 691 693 unsigned durability_required = max(0, (int) (io_opts.data_replicas - durability_have)); ··· 748 750 void bch2_data_update_opts_normalize(struct bkey_s_c k, struct data_update_opts *opts) 749 751 { 750 752 struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(k); 751 - unsigned i = 0; 753 + unsigned ptr_bit = 1; 752 754 753 755 bkey_for_each_ptr(ptrs, ptr) { 754 - if ((opts->rewrite_ptrs & (1U << i)) && ptr->cached) { 755 - opts->kill_ptrs |= 1U << i; 756 - opts->rewrite_ptrs ^= 1U << i; 756 + if ((opts->rewrite_ptrs & ptr_bit) && ptr->cached) { 757 + opts->kill_ptrs |= ptr_bit; 758 + opts->rewrite_ptrs ^= ptr_bit; 757 759 } 758 760 759 - i++; 761 + ptr_bit <<= 1; 760 762 } 761 763 }
+3 -1
fs/bcachefs/debug.c
··· 472 472 if (!out->nr_tabstops) 473 473 printbuf_tabstop_push(out, 32); 474 474 475 - prt_printf(out, "%px btree=%s l=%u\n", b, bch2_btree_id_str(b->c.btree_id), b->c.level); 475 + prt_printf(out, "%px ", b); 476 + bch2_btree_id_level_to_text(out, b->c.btree_id, b->c.level); 477 + prt_printf(out, "\n"); 476 478 477 479 printbuf_indent_add(out, 2); 478 480
+5 -5
fs/bcachefs/dirent.c
··· 101 101 }; 102 102 103 103 int bch2_dirent_validate(struct bch_fs *c, struct bkey_s_c k, 104 - enum bch_validate_flags flags) 104 + struct bkey_validate_context from) 105 105 { 106 106 struct bkey_s_c_dirent d = bkey_s_c_to_dirent(k); 107 107 struct qstr d_name = bch2_dirent_get_name(d); ··· 120 120 * Check new keys don't exceed the max length 121 121 * (older keys may be larger.) 122 122 */ 123 - bkey_fsck_err_on((flags & BCH_VALIDATE_commit) && d_name.len > BCH_NAME_MAX, 123 + bkey_fsck_err_on((from.flags & BCH_VALIDATE_commit) && d_name.len > BCH_NAME_MAX, 124 124 c, dirent_name_too_long, 125 125 "dirent name too big (%u > %u)", 126 126 d_name.len, BCH_NAME_MAX); ··· 266 266 } else { 267 267 target->subvol = le32_to_cpu(d.v->d_child_subvol); 268 268 269 - ret = bch2_subvolume_get(trans, target->subvol, true, BTREE_ITER_cached, &s); 269 + ret = bch2_subvolume_get(trans, target->subvol, true, &s); 270 270 271 271 target->inum = le64_to_cpu(s.inode); 272 272 } ··· 500 500 struct bkey_s_c k; 501 501 int ret; 502 502 503 - for_each_btree_key_upto_norestart(trans, iter, BTREE_ID_dirents, 503 + for_each_btree_key_max_norestart(trans, iter, BTREE_ID_dirents, 504 504 SPOS(dir, 0, snapshot), 505 505 POS(dir, U64_MAX), 0, k, ret) 506 506 if (k.k->type == KEY_TYPE_dirent) { ··· 549 549 bch2_bkey_buf_init(&sk); 550 550 551 551 int ret = bch2_trans_run(c, 552 - for_each_btree_key_in_subvolume_upto(trans, iter, BTREE_ID_dirents, 552 + for_each_btree_key_in_subvolume_max(trans, iter, BTREE_ID_dirents, 553 553 POS(inum.inum, ctx->pos), 554 554 POS(inum.inum, U64_MAX), 555 555 inum.subvol, 0, k, ({
+7 -2
fs/bcachefs/dirent.h
··· 4 4 5 5 #include "str_hash.h" 6 6 7 - enum bch_validate_flags; 8 7 extern const struct bch_hash_desc bch2_dirent_hash_desc; 9 8 10 - int bch2_dirent_validate(struct bch_fs *, struct bkey_s_c, enum bch_validate_flags); 9 + int bch2_dirent_validate(struct bch_fs *, struct bkey_s_c, 10 + struct bkey_validate_context); 11 11 void bch2_dirent_to_text(struct printbuf *, struct bch_fs *, struct bkey_s_c); 12 12 13 13 #define bch2_bkey_ops_dirent ((struct bkey_ops) { \ ··· 29 29 { 30 30 return DIV_ROUND_UP(offsetof(struct bch_dirent, d_name) + len, 31 31 sizeof(u64)); 32 + } 33 + 34 + static inline unsigned int dirent_occupied_size(const struct qstr *name) 35 + { 36 + return (BKEY_U64s + dirent_val_u64s(name->len)) * sizeof(u64); 32 37 } 33 38 34 39 int bch2_dirent_read_target(struct btree_trans *, subvol_inum,
+93 -57
fs/bcachefs/disk_accounting.c
··· 79 79 memcpy_u64s_small(acc->v.d, d, nr); 80 80 } 81 81 82 + static int bch2_accounting_update_sb_one(struct bch_fs *, struct bpos); 83 + 82 84 int bch2_disk_accounting_mod(struct btree_trans *trans, 83 85 struct disk_accounting_pos *k, 84 86 s64 *d, unsigned nr, bool gc) ··· 98 96 99 97 accounting_key_init(&k_i.k, k, d, nr); 100 98 101 - return likely(!gc) 102 - ? bch2_trans_update_buffered(trans, BTREE_ID_accounting, &k_i.k) 103 - : bch2_accounting_mem_add(trans, bkey_i_to_s_c_accounting(&k_i.k), true); 99 + if (unlikely(gc)) { 100 + int ret = bch2_accounting_mem_add(trans, bkey_i_to_s_c_accounting(&k_i.k), true); 101 + if (ret == -BCH_ERR_btree_insert_need_mark_replicas) 102 + ret = drop_locks_do(trans, 103 + bch2_accounting_update_sb_one(trans->c, disk_accounting_pos_to_bpos(k))) ?: 104 + bch2_accounting_mem_add(trans, bkey_i_to_s_c_accounting(&k_i.k), true); 105 + return ret; 106 + } else { 107 + return bch2_trans_update_buffered(trans, BTREE_ID_accounting, &k_i.k); 108 + } 104 109 } 105 110 106 111 int bch2_mod_dev_cached_sectors(struct btree_trans *trans, ··· 136 127 #define field_end(p, member) (((void *) (&p.member)) + sizeof(p.member)) 137 128 138 129 int bch2_accounting_validate(struct bch_fs *c, struct bkey_s_c k, 139 - enum bch_validate_flags flags) 130 + struct bkey_validate_context from) 140 131 { 141 132 struct disk_accounting_pos acc_k; 142 133 bpos_to_disk_accounting_pos(&acc_k, k.k->p); 143 134 void *end = &acc_k + 1; 144 135 int ret = 0; 145 136 146 - bkey_fsck_err_on(bversion_zero(k.k->bversion), 137 + bkey_fsck_err_on((from.flags & BCH_VALIDATE_commit) && 138 + bversion_zero(k.k->bversion), 147 139 c, accounting_key_version_0, 148 140 "accounting key with version=0"); 149 141 ··· 227 217 prt_printf(out, "id=%u", k->snapshot.id); 228 218 break; 229 219 case BCH_DISK_ACCOUNTING_btree: 230 - prt_printf(out, "btree=%s", bch2_btree_id_str(k->btree.id)); 220 + prt_str(out, "btree="); 221 + bch2_btree_id_to_text(out, k->btree.id); 231 222 break; 232 223 } 233 224 } ··· 254 243 } 255 244 256 245 static inline void __accounting_to_replicas(struct bch_replicas_entry_v1 *r, 257 - struct disk_accounting_pos acc) 246 + struct disk_accounting_pos *acc) 258 247 { 259 - unsafe_memcpy(r, &acc.replicas, 260 - replicas_entry_bytes(&acc.replicas), 248 + unsafe_memcpy(r, &acc->replicas, 249 + replicas_entry_bytes(&acc->replicas), 261 250 "variable length struct"); 262 251 } 263 252 ··· 268 257 269 258 switch (acc_k.type) { 270 259 case BCH_DISK_ACCOUNTING_replicas: 271 - __accounting_to_replicas(r, acc_k); 260 + __accounting_to_replicas(r, &acc_k); 272 261 return true; 273 262 default: 274 263 return false; ··· 333 322 334 323 eytzinger0_sort(acc->k.data, acc->k.nr, sizeof(acc->k.data[0]), 335 324 accounting_pos_cmp, NULL); 325 + 326 + if (trace_accounting_mem_insert_enabled()) { 327 + struct printbuf buf = PRINTBUF; 328 + 329 + bch2_accounting_to_text(&buf, c, a.s_c); 330 + trace_accounting_mem_insert(c, buf.buf); 331 + printbuf_exit(&buf); 332 + } 336 333 return 0; 337 334 err: 338 335 free_percpu(n.v[1]); ··· 478 459 if (ret) 479 460 darray_exit(out_buf); 480 461 return ret; 481 - } 482 - 483 - void bch2_fs_accounting_to_text(struct printbuf *out, struct bch_fs *c) 484 - { 485 - struct bch_accounting_mem *acc = &c->accounting; 486 - 487 - percpu_down_read(&c->mark_lock); 488 - out->atomic++; 489 - 490 - eytzinger0_for_each(i, acc->k.nr) { 491 - struct disk_accounting_pos acc_k; 492 - bpos_to_disk_accounting_pos(&acc_k, acc->k.data[i].pos); 493 - 494 - bch2_accounting_key_to_text(out, &acc_k); 495 - 496 - u64 v[BCH_ACCOUNTING_MAX_COUNTERS]; 497 - bch2_accounting_mem_read_counters(acc, i, v, ARRAY_SIZE(v), false); 498 - 499 - prt_str(out, ":"); 500 - for (unsigned j = 0; j < acc->k.data[i].nr_counters; j++) 501 - prt_printf(out, " %llu", v[j]); 502 - prt_newline(out); 503 - } 504 - 505 - --out->atomic; 506 - percpu_up_read(&c->mark_lock); 507 462 } 508 463 509 464 static void bch2_accounting_free_counters(struct bch_accounting_mem *acc, bool gc) ··· 618 625 switch (acc.type) { 619 626 case BCH_DISK_ACCOUNTING_replicas: { 620 627 struct bch_replicas_padded r; 621 - __accounting_to_replicas(&r.e, acc); 628 + __accounting_to_replicas(&r.e, &acc); 622 629 623 630 for (unsigned i = 0; i < r.e.nr_devs; i++) 624 631 if (r.e.devs[i] != BCH_SB_MEMBER_INVALID && ··· 692 699 struct btree_trans *trans = bch2_trans_get(c); 693 700 struct printbuf buf = PRINTBUF; 694 701 695 - int ret = for_each_btree_key(trans, iter, 696 - BTREE_ID_accounting, POS_MIN, 702 + /* 703 + * We might run more than once if we rewind to start topology repair or 704 + * btree node scan - and those might cause us to get different results, 705 + * so we can't just skip if we've already run. 706 + * 707 + * Instead, zero out any accounting we have: 708 + */ 709 + percpu_down_write(&c->mark_lock); 710 + darray_for_each(acc->k, e) 711 + percpu_memset(e->v[0], 0, sizeof(u64) * e->nr_counters); 712 + for_each_member_device(c, ca) 713 + percpu_memset(ca->usage, 0, sizeof(*ca->usage)); 714 + percpu_memset(c->usage, 0, sizeof(*c->usage)); 715 + percpu_up_write(&c->mark_lock); 716 + 717 + struct btree_iter iter; 718 + bch2_trans_iter_init(trans, &iter, BTREE_ID_accounting, POS_MIN, 719 + BTREE_ITER_prefetch|BTREE_ITER_all_snapshots); 720 + iter.flags &= ~BTREE_ITER_with_journal; 721 + int ret = for_each_btree_key_continue(trans, iter, 697 722 BTREE_ITER_prefetch|BTREE_ITER_all_snapshots, k, ({ 698 723 struct bkey u; 699 724 struct bkey_s_c k = bch2_btree_path_peek_slot_exact(btree_iter_path(trans, &iter), &u); 725 + 726 + if (k.k->type != KEY_TYPE_accounting) 727 + continue; 728 + 729 + struct disk_accounting_pos acc_k; 730 + bpos_to_disk_accounting_pos(&acc_k, k.k->p); 731 + 732 + if (acc_k.type >= BCH_DISK_ACCOUNTING_TYPE_NR) 733 + break; 734 + 735 + if (!bch2_accounting_is_mem(acc_k)) { 736 + struct disk_accounting_pos next = { .type = acc_k.type + 1 }; 737 + bch2_btree_iter_set_pos(&iter, disk_accounting_pos_to_bpos(&next)); 738 + continue; 739 + } 740 + 700 741 accounting_read_key(trans, k); 701 742 })); 702 743 if (ret) ··· 742 715 743 716 darray_for_each(*keys, i) { 744 717 if (i->k->k.type == KEY_TYPE_accounting) { 718 + struct disk_accounting_pos acc_k; 719 + bpos_to_disk_accounting_pos(&acc_k, i->k->k.p); 720 + 721 + if (!bch2_accounting_is_mem(acc_k)) 722 + continue; 723 + 745 724 struct bkey_s_c k = bkey_i_to_s_c(i->k); 746 725 unsigned idx = eytzinger0_find(acc->k.data, acc->k.nr, 747 726 sizeof(acc->k.data[0]), ··· 781 748 keys->gap = keys->nr = dst - keys->data; 782 749 783 750 percpu_down_write(&c->mark_lock); 784 - unsigned i = 0; 785 - while (i < acc->k.nr) { 786 - unsigned idx = inorder_to_eytzinger0(i, acc->k.nr); 787 751 752 + darray_for_each_reverse(acc->k, i) { 788 753 struct disk_accounting_pos acc_k; 789 - bpos_to_disk_accounting_pos(&acc_k, acc->k.data[idx].pos); 754 + bpos_to_disk_accounting_pos(&acc_k, i->pos); 790 755 791 756 u64 v[BCH_ACCOUNTING_MAX_COUNTERS]; 792 - bch2_accounting_mem_read_counters(acc, idx, v, ARRAY_SIZE(v), false); 757 + memset(v, 0, sizeof(v)); 758 + 759 + for (unsigned j = 0; j < i->nr_counters; j++) 760 + v[j] = percpu_u64_get(i->v[0] + j); 793 761 794 762 /* 795 763 * If the entry counters are zeroed, it should be treated as ··· 799 765 * Remove it, so that if it's re-added it gets re-marked in the 800 766 * superblock: 801 767 */ 802 - ret = bch2_is_zero(v, sizeof(v[0]) * acc->k.data[idx].nr_counters) 768 + ret = bch2_is_zero(v, sizeof(v[0]) * i->nr_counters) 803 769 ? -BCH_ERR_remove_disk_accounting_entry 804 - : bch2_disk_accounting_validate_late(trans, acc_k, 805 - v, acc->k.data[idx].nr_counters); 770 + : bch2_disk_accounting_validate_late(trans, acc_k, v, i->nr_counters); 806 771 807 772 if (ret == -BCH_ERR_remove_disk_accounting_entry) { 808 - free_percpu(acc->k.data[idx].v[0]); 809 - free_percpu(acc->k.data[idx].v[1]); 810 - darray_remove_item(&acc->k, &acc->k.data[idx]); 811 - eytzinger0_sort(acc->k.data, acc->k.nr, sizeof(acc->k.data[0]), 812 - accounting_pos_cmp, NULL); 773 + free_percpu(i->v[0]); 774 + free_percpu(i->v[1]); 775 + darray_remove_item(&acc->k, i); 813 776 ret = 0; 814 777 continue; 815 778 } 816 779 817 780 if (ret) 818 781 goto fsck_err; 819 - i++; 820 782 } 783 + 784 + eytzinger0_sort(acc->k.data, acc->k.nr, sizeof(acc->k.data[0]), 785 + accounting_pos_cmp, NULL); 821 786 822 787 preempt_disable(); 823 788 struct bch_fs_usage_base *usage = this_cpu_ptr(c->usage); ··· 837 804 break; 838 805 case BCH_DISK_ACCOUNTING_dev_data_type: 839 806 rcu_read_lock(); 840 - struct bch_dev *ca = bch2_dev_rcu(c, k.dev_data_type.dev); 807 + struct bch_dev *ca = bch2_dev_rcu_noerror(c, k.dev_data_type.dev); 841 808 if (ca) { 842 809 struct bch_dev_usage_type __percpu *d = &ca->usage->d[k.dev_data_type.data_type]; 843 810 percpu_u64_set(&d->buckets, v[0]); ··· 914 881 bpos_to_disk_accounting_pos(&acc_k, k.k->p); 915 882 916 883 if (acc_k.type >= BCH_DISK_ACCOUNTING_TYPE_NR) 917 - continue; 884 + break; 918 885 919 - if (acc_k.type == BCH_DISK_ACCOUNTING_inum) 886 + if (!bch2_accounting_is_mem(acc_k)) { 887 + struct disk_accounting_pos next = { .type = acc_k.type + 1 }; 888 + bch2_btree_iter_set_pos(&iter, disk_accounting_pos_to_bpos(&next)); 920 889 continue; 890 + } 921 891 922 892 bch2_accounting_mem_read(c, k.k->p, v, nr); 923 893 ··· 946 910 break; 947 911 case BCH_DISK_ACCOUNTING_dev_data_type: { 948 912 rcu_read_lock(); 949 - struct bch_dev *ca = bch2_dev_rcu(c, acc_k.dev_data_type.dev); 913 + struct bch_dev *ca = bch2_dev_rcu_noerror(c, acc_k.dev_data_type.dev); 950 914 if (!ca) { 951 915 rcu_read_unlock(); 952 916 continue;
+61 -12
fs/bcachefs/disk_accounting.h
··· 2 2 #ifndef _BCACHEFS_DISK_ACCOUNTING_H 3 3 #define _BCACHEFS_DISK_ACCOUNTING_H 4 4 5 + #include "btree_update.h" 5 6 #include "eytzinger.h" 6 7 #include "sb-members.h" 7 8 ··· 63 62 64 63 static inline void bpos_to_disk_accounting_pos(struct disk_accounting_pos *acc, struct bpos p) 65 64 { 66 - acc->_pad = p; 65 + BUILD_BUG_ON(sizeof(*acc) != sizeof(p)); 66 + 67 67 #if __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__ 68 - bch2_bpos_swab(&acc->_pad); 68 + acc->_pad = p; 69 + #else 70 + memcpy_swab(acc, &p, sizeof(p)); 69 71 #endif 70 72 } 71 73 72 - static inline struct bpos disk_accounting_pos_to_bpos(struct disk_accounting_pos *k) 74 + static inline struct bpos disk_accounting_pos_to_bpos(struct disk_accounting_pos *acc) 73 75 { 74 - struct bpos ret = k->_pad; 75 - 76 + struct bpos p; 76 77 #if __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__ 77 - bch2_bpos_swab(&ret); 78 + p = acc->_pad; 79 + #else 80 + memcpy_swab(&p, acc, sizeof(p)); 78 81 #endif 79 - return ret; 82 + return p; 80 83 } 81 84 82 85 int bch2_disk_accounting_mod(struct btree_trans *, struct disk_accounting_pos *, 83 86 s64 *, unsigned, bool); 84 87 int bch2_mod_dev_cached_sectors(struct btree_trans *, unsigned, s64, bool); 85 88 86 - int bch2_accounting_validate(struct bch_fs *, struct bkey_s_c, enum bch_validate_flags); 89 + int bch2_accounting_validate(struct bch_fs *, struct bkey_s_c, 90 + struct bkey_validate_context); 87 91 void bch2_accounting_key_to_text(struct printbuf *, struct disk_accounting_pos *); 88 92 void bch2_accounting_to_text(struct printbuf *, struct bch_fs *, struct bkey_s_c); 89 93 void bch2_accounting_swab(struct bkey_s); ··· 118 112 int bch2_accounting_mem_insert(struct bch_fs *, struct bkey_s_c_accounting, enum bch_accounting_mode); 119 113 void bch2_accounting_mem_gc(struct bch_fs *); 120 114 115 + static inline bool bch2_accounting_is_mem(struct disk_accounting_pos acc) 116 + { 117 + return acc.type < BCH_DISK_ACCOUNTING_TYPE_NR && 118 + acc.type != BCH_DISK_ACCOUNTING_inum; 119 + } 120 + 121 121 /* 122 122 * Update in memory counters so they match the btree update we're doing; called 123 123 * from transaction commit path ··· 138 126 bpos_to_disk_accounting_pos(&acc_k, a.k->p); 139 127 bool gc = mode == BCH_ACCOUNTING_gc; 140 128 141 - EBUG_ON(gc && !acc->gc_running); 129 + if (gc && !acc->gc_running) 130 + return 0; 142 131 143 - if (acc_k.type == BCH_DISK_ACCOUNTING_inum) 132 + if (!bch2_accounting_is_mem(acc_k)) 144 133 return 0; 145 134 146 135 if (mode == BCH_ACCOUNTING_normal) { ··· 154 141 break; 155 142 case BCH_DISK_ACCOUNTING_dev_data_type: 156 143 rcu_read_lock(); 157 - struct bch_dev *ca = bch2_dev_rcu(c, acc_k.dev_data_type.dev); 144 + struct bch_dev *ca = bch2_dev_rcu_noerror(c, acc_k.dev_data_type.dev); 158 145 if (ca) { 159 146 this_cpu_add(ca->usage->d[acc_k.dev_data_type.data_type].buckets, a.v->d[0]); 160 147 this_cpu_add(ca->usage->d[acc_k.dev_data_type.data_type].sectors, a.v->d[1]); ··· 217 204 bch2_accounting_mem_read_counters(acc, idx, v, nr, false); 218 205 } 219 206 207 + static inline struct bversion journal_pos_to_bversion(struct journal_res *res, unsigned offset) 208 + { 209 + EBUG_ON(!res->ref); 210 + 211 + return (struct bversion) { 212 + .hi = res->seq >> 32, 213 + .lo = (res->seq << 32) | (res->offset + offset), 214 + }; 215 + } 216 + 217 + static inline int bch2_accounting_trans_commit_hook(struct btree_trans *trans, 218 + struct bkey_i_accounting *a, 219 + unsigned commit_flags) 220 + { 221 + a->k.bversion = journal_pos_to_bversion(&trans->journal_res, 222 + (u64 *) a - (u64 *) trans->journal_entries); 223 + 224 + EBUG_ON(bversion_zero(a->k.bversion)); 225 + 226 + return likely(!(commit_flags & BCH_TRANS_COMMIT_skip_accounting_apply)) 227 + ? bch2_accounting_mem_mod_locked(trans, accounting_i_to_s_c(a), BCH_ACCOUNTING_normal) 228 + : 0; 229 + } 230 + 231 + static inline void bch2_accounting_trans_commit_revert(struct btree_trans *trans, 232 + struct bkey_i_accounting *a_i, 233 + unsigned commit_flags) 234 + { 235 + if (likely(!(commit_flags & BCH_TRANS_COMMIT_skip_accounting_apply))) { 236 + struct bkey_s_accounting a = accounting_i_to_s(a_i); 237 + 238 + bch2_accounting_neg(a); 239 + bch2_accounting_mem_mod_locked(trans, a.c, BCH_ACCOUNTING_normal); 240 + bch2_accounting_neg(a); 241 + } 242 + } 243 + 220 244 int bch2_fs_replicas_usage_read(struct bch_fs *, darray_char *); 221 245 int bch2_fs_accounting_read(struct bch_fs *, darray_char *, unsigned); 222 - void bch2_fs_accounting_to_text(struct printbuf *, struct bch_fs *); 223 246 224 247 int bch2_gc_accounting_start(struct bch_fs *); 225 248 int bch2_gc_accounting_done(struct bch_fs *);
+142 -133
fs/bcachefs/ec.c
··· 26 26 #include "util.h" 27 27 28 28 #include <linux/sort.h> 29 + #include <linux/string_choices.h> 29 30 30 31 #ifdef __KERNEL__ 31 32 ··· 110 109 /* Stripes btree keys: */ 111 110 112 111 int bch2_stripe_validate(struct bch_fs *c, struct bkey_s_c k, 113 - enum bch_validate_flags flags) 112 + struct bkey_validate_context from) 114 113 { 115 114 const struct bch_stripe *s = bkey_s_c_to_stripe(k).v; 116 115 int ret = 0; ··· 130 129 "invalid csum granularity (%u >= 64)", 131 130 s->csum_granularity_bits); 132 131 133 - ret = bch2_bkey_ptrs_validate(c, k, flags); 132 + ret = bch2_bkey_ptrs_validate(c, k, from); 134 133 fsck_err: 135 134 return ret; 136 135 } ··· 305 304 } 306 305 307 306 if (flags & BTREE_TRIGGER_gc) { 308 - percpu_down_read(&c->mark_lock); 309 307 struct bucket *g = gc_bucket(ca, bucket.offset); 310 308 if (bch2_fs_inconsistent_on(!g, c, "reference to invalid bucket on device %u\n %s", 311 309 ptr->dev, 312 310 (bch2_bkey_val_to_text(&buf, c, s.s_c), buf.buf))) { 313 311 ret = -BCH_ERR_mark_stripe; 314 - goto err_unlock; 312 + goto err; 315 313 } 316 314 317 315 bucket_lock(g); ··· 318 318 ret = __mark_stripe_bucket(trans, ca, s, ptr_idx, deleting, bucket, &new, flags); 319 319 alloc_to_bucket(g, new); 320 320 bucket_unlock(g); 321 - err_unlock: 322 - percpu_up_read(&c->mark_lock); 321 + 323 322 if (!ret) 324 323 ret = bch2_alloc_key_to_dev_counters(trans, ca, &old, &new, flags); 325 324 } ··· 731 732 ? BCH_MEMBER_ERROR_write 732 733 : BCH_MEMBER_ERROR_read, 733 734 "erasure coding %s error: %s", 734 - bio_data_dir(bio) ? "write" : "read", 735 + str_write_read(bio_data_dir(bio)), 735 736 bch2_blk_status_to_str(bio->bi_status))) 736 737 clear_bit(ec_bio->idx, ec_bio->buf->valid); 737 738 ··· 908 909 bch2_bkey_val_to_text(&msgbuf, c, orig_k); 909 910 bch_err_ratelimited(c, 910 911 "error doing reconstruct read: %s\n %s", msg, msgbuf.buf); 911 - printbuf_exit(&msgbuf);; 912 + printbuf_exit(&msgbuf); 912 913 ret = -BCH_ERR_stripe_reconstruct; 913 914 goto out; 914 915 } ··· 1265 1266 struct bch_dev *ca, 1266 1267 struct bpos bucket, u8 gen, 1267 1268 struct ec_stripe_buf *s, 1268 - struct bpos *bp_pos) 1269 + struct bkey_s_c_backpointer bp, 1270 + struct bkey_buf *last_flushed) 1269 1271 { 1270 1272 struct bch_stripe *v = &bkey_i_to_stripe(&s->key)->v; 1271 1273 struct bch_fs *c = trans->c; 1272 - struct bch_backpointer bp; 1273 1274 struct btree_iter iter; 1274 1275 struct bkey_s_c k; 1275 1276 const struct bch_extent_ptr *ptr_c; ··· 1278 1279 struct bkey_i *n; 1279 1280 int ret, dev, block; 1280 1281 1281 - ret = bch2_get_next_backpointer(trans, ca, bucket, gen, 1282 - bp_pos, &bp, BTREE_ITER_cached); 1283 - if (ret) 1284 - return ret; 1285 - if (bpos_eq(*bp_pos, SPOS_MAX)) 1286 - return 0; 1287 - 1288 - if (bp.level) { 1282 + if (bp.v->level) { 1289 1283 struct printbuf buf = PRINTBUF; 1290 1284 struct btree_iter node_iter; 1291 1285 struct btree *b; 1292 1286 1293 - b = bch2_backpointer_get_node(trans, &node_iter, *bp_pos, bp); 1287 + b = bch2_backpointer_get_node(trans, bp, &node_iter, last_flushed); 1294 1288 bch2_trans_iter_exit(trans, &node_iter); 1295 1289 1296 1290 if (!b) 1297 1291 return 0; 1298 1292 1299 1293 prt_printf(&buf, "found btree node in erasure coded bucket: b=%px\n", b); 1300 - bch2_backpointer_to_text(&buf, &bp); 1294 + bch2_bkey_val_to_text(&buf, c, bp.s_c); 1301 1295 1302 1296 bch2_fs_inconsistent(c, "%s", buf.buf); 1303 1297 printbuf_exit(&buf); 1304 1298 return -EIO; 1305 1299 } 1306 1300 1307 - k = bch2_backpointer_get_key(trans, &iter, *bp_pos, bp, BTREE_ITER_intent); 1301 + k = bch2_backpointer_get_key(trans, bp, &iter, BTREE_ITER_intent, last_flushed); 1308 1302 ret = bkey_err(k); 1309 1303 if (ret) 1310 1304 return ret; ··· 1356 1364 struct bch_fs *c = trans->c; 1357 1365 struct bch_stripe *v = &bkey_i_to_stripe(&s->key)->v; 1358 1366 struct bch_extent_ptr ptr = v->ptrs[block]; 1359 - struct bpos bp_pos = POS_MIN; 1360 1367 int ret = 0; 1361 1368 1362 1369 struct bch_dev *ca = bch2_dev_tryget(c, ptr.dev); ··· 1364 1373 1365 1374 struct bpos bucket_pos = PTR_BUCKET_POS(ca, &ptr); 1366 1375 1367 - while (1) { 1368 - ret = commit_do(trans, NULL, NULL, 1369 - BCH_TRANS_COMMIT_no_check_rw| 1370 - BCH_TRANS_COMMIT_no_enospc, 1371 - ec_stripe_update_extent(trans, ca, bucket_pos, ptr.gen, s, &bp_pos)); 1372 - if (ret) 1373 - break; 1374 - if (bkey_eq(bp_pos, POS_MAX)) 1376 + struct bkey_buf last_flushed; 1377 + bch2_bkey_buf_init(&last_flushed); 1378 + bkey_init(&last_flushed.k->k); 1379 + 1380 + ret = for_each_btree_key_max_commit(trans, bp_iter, BTREE_ID_backpointers, 1381 + bucket_pos_to_bp_start(ca, bucket_pos), 1382 + bucket_pos_to_bp_end(ca, bucket_pos), 0, bp_k, 1383 + NULL, NULL, 1384 + BCH_TRANS_COMMIT_no_check_rw| 1385 + BCH_TRANS_COMMIT_no_enospc, ({ 1386 + if (bkey_ge(bp_k.k->p, bucket_pos_to_bp(ca, bpos_nosnap_successor(bucket_pos), 0))) 1375 1387 break; 1376 1388 1377 - bp_pos = bpos_nosnap_successor(bp_pos); 1378 - } 1389 + if (bp_k.k->type != KEY_TYPE_backpointer) 1390 + continue; 1379 1391 1392 + ec_stripe_update_extent(trans, ca, bucket_pos, ptr.gen, s, 1393 + bkey_s_c_to_backpointer(bp_k), &last_flushed); 1394 + })); 1395 + 1396 + bch2_bkey_buf_exit(&last_flushed, c); 1380 1397 bch2_dev_put(ca); 1381 1398 return ret; 1382 1399 } ··· 1706 1707 set_bkey_val_u64s(&s->k, u64s); 1707 1708 } 1708 1709 1709 - static int ec_new_stripe_alloc(struct bch_fs *c, struct ec_stripe_head *h) 1710 + static struct ec_stripe_new *ec_new_stripe_alloc(struct bch_fs *c, struct ec_stripe_head *h) 1710 1711 { 1711 1712 struct ec_stripe_new *s; 1712 1713 ··· 1714 1715 1715 1716 s = kzalloc(sizeof(*s), GFP_KERNEL); 1716 1717 if (!s) 1717 - return -BCH_ERR_ENOMEM_ec_new_stripe_alloc; 1718 + return NULL; 1718 1719 1719 1720 mutex_init(&s->lock); 1720 1721 closure_init(&s->iodone, NULL); ··· 1729 1730 ec_stripe_key_init(c, &s->new_stripe.key, 1730 1731 s->nr_data, s->nr_parity, 1731 1732 h->blocksize, h->disk_label); 1732 - 1733 - h->s = s; 1734 - h->nr_created++; 1735 - return 0; 1733 + return s; 1736 1734 } 1737 1735 1738 1736 static void ec_stripe_head_devs_update(struct bch_fs *c, struct ec_stripe_head *h) ··· 1874 1878 return h; 1875 1879 } 1876 1880 1877 - static int new_stripe_alloc_buckets(struct btree_trans *trans, struct ec_stripe_head *h, 1881 + static int new_stripe_alloc_buckets(struct btree_trans *trans, 1882 + struct ec_stripe_head *h, struct ec_stripe_new *s, 1878 1883 enum bch_watermark watermark, struct closure *cl) 1879 1884 { 1880 1885 struct bch_fs *c = trans->c; 1881 1886 struct bch_devs_mask devs = h->devs; 1882 1887 struct open_bucket *ob; 1883 1888 struct open_buckets buckets; 1884 - struct bch_stripe *v = &bkey_i_to_stripe(&h->s->new_stripe.key)->v; 1889 + struct bch_stripe *v = &bkey_i_to_stripe(&s->new_stripe.key)->v; 1885 1890 unsigned i, j, nr_have_parity = 0, nr_have_data = 0; 1886 1891 bool have_cache = true; 1887 1892 int ret = 0; 1888 1893 1889 - BUG_ON(v->nr_blocks != h->s->nr_data + h->s->nr_parity); 1890 - BUG_ON(v->nr_redundant != h->s->nr_parity); 1894 + BUG_ON(v->nr_blocks != s->nr_data + s->nr_parity); 1895 + BUG_ON(v->nr_redundant != s->nr_parity); 1891 1896 1892 1897 /* * We bypass the sector allocator which normally does this: */ 1893 1898 bitmap_and(devs.d, devs.d, c->rw_devs[BCH_DATA_user].d, BCH_SB_MEMBERS_MAX); 1894 1899 1895 - for_each_set_bit(i, h->s->blocks_gotten, v->nr_blocks) { 1900 + for_each_set_bit(i, s->blocks_gotten, v->nr_blocks) { 1896 1901 /* 1897 1902 * Note: we don't yet repair invalid blocks (failed/removed 1898 1903 * devices) when reusing stripes - we still need a codepath to ··· 1903 1906 if (v->ptrs[i].dev != BCH_SB_MEMBER_INVALID) 1904 1907 __clear_bit(v->ptrs[i].dev, devs.d); 1905 1908 1906 - if (i < h->s->nr_data) 1909 + if (i < s->nr_data) 1907 1910 nr_have_data++; 1908 1911 else 1909 1912 nr_have_parity++; 1910 1913 } 1911 1914 1912 - BUG_ON(nr_have_data > h->s->nr_data); 1913 - BUG_ON(nr_have_parity > h->s->nr_parity); 1915 + BUG_ON(nr_have_data > s->nr_data); 1916 + BUG_ON(nr_have_parity > s->nr_parity); 1914 1917 1915 1918 buckets.nr = 0; 1916 - if (nr_have_parity < h->s->nr_parity) { 1919 + if (nr_have_parity < s->nr_parity) { 1917 1920 ret = bch2_bucket_alloc_set_trans(trans, &buckets, 1918 1921 &h->parity_stripe, 1919 1922 &devs, 1920 - h->s->nr_parity, 1923 + s->nr_parity, 1921 1924 &nr_have_parity, 1922 1925 &have_cache, 0, 1923 1926 BCH_DATA_parity, ··· 1925 1928 cl); 1926 1929 1927 1930 open_bucket_for_each(c, &buckets, ob, i) { 1928 - j = find_next_zero_bit(h->s->blocks_gotten, 1929 - h->s->nr_data + h->s->nr_parity, 1930 - h->s->nr_data); 1931 - BUG_ON(j >= h->s->nr_data + h->s->nr_parity); 1931 + j = find_next_zero_bit(s->blocks_gotten, 1932 + s->nr_data + s->nr_parity, 1933 + s->nr_data); 1934 + BUG_ON(j >= s->nr_data + s->nr_parity); 1932 1935 1933 - h->s->blocks[j] = buckets.v[i]; 1936 + s->blocks[j] = buckets.v[i]; 1934 1937 v->ptrs[j] = bch2_ob_ptr(c, ob); 1935 - __set_bit(j, h->s->blocks_gotten); 1938 + __set_bit(j, s->blocks_gotten); 1936 1939 } 1937 1940 1938 1941 if (ret) ··· 1940 1943 } 1941 1944 1942 1945 buckets.nr = 0; 1943 - if (nr_have_data < h->s->nr_data) { 1946 + if (nr_have_data < s->nr_data) { 1944 1947 ret = bch2_bucket_alloc_set_trans(trans, &buckets, 1945 1948 &h->block_stripe, 1946 1949 &devs, 1947 - h->s->nr_data, 1950 + s->nr_data, 1948 1951 &nr_have_data, 1949 1952 &have_cache, 0, 1950 1953 BCH_DATA_user, ··· 1952 1955 cl); 1953 1956 1954 1957 open_bucket_for_each(c, &buckets, ob, i) { 1955 - j = find_next_zero_bit(h->s->blocks_gotten, 1956 - h->s->nr_data, 0); 1957 - BUG_ON(j >= h->s->nr_data); 1958 + j = find_next_zero_bit(s->blocks_gotten, 1959 + s->nr_data, 0); 1960 + BUG_ON(j >= s->nr_data); 1958 1961 1959 - h->s->blocks[j] = buckets.v[i]; 1962 + s->blocks[j] = buckets.v[i]; 1960 1963 v->ptrs[j] = bch2_ob_ptr(c, ob); 1961 - __set_bit(j, h->s->blocks_gotten); 1964 + __set_bit(j, s->blocks_gotten); 1962 1965 } 1963 1966 1964 1967 if (ret) ··· 2004 2007 return ret; 2005 2008 } 2006 2009 2007 - static int __bch2_ec_stripe_head_reuse(struct btree_trans *trans, struct ec_stripe_head *h) 2010 + static int init_new_stripe_from_existing(struct bch_fs *c, struct ec_stripe_new *s) 2011 + { 2012 + struct bch_stripe *new_v = &bkey_i_to_stripe(&s->new_stripe.key)->v; 2013 + struct bch_stripe *existing_v = &bkey_i_to_stripe(&s->existing_stripe.key)->v; 2014 + unsigned i; 2015 + 2016 + BUG_ON(existing_v->nr_redundant != s->nr_parity); 2017 + s->nr_data = existing_v->nr_blocks - 2018 + existing_v->nr_redundant; 2019 + 2020 + int ret = ec_stripe_buf_init(&s->existing_stripe, 0, le16_to_cpu(existing_v->sectors)); 2021 + if (ret) { 2022 + bch2_stripe_close(c, s); 2023 + return ret; 2024 + } 2025 + 2026 + BUG_ON(s->existing_stripe.size != le16_to_cpu(existing_v->sectors)); 2027 + 2028 + /* 2029 + * Free buckets we initially allocated - they might conflict with 2030 + * blocks from the stripe we're reusing: 2031 + */ 2032 + for_each_set_bit(i, s->blocks_gotten, new_v->nr_blocks) { 2033 + bch2_open_bucket_put(c, c->open_buckets + s->blocks[i]); 2034 + s->blocks[i] = 0; 2035 + } 2036 + memset(s->blocks_gotten, 0, sizeof(s->blocks_gotten)); 2037 + memset(s->blocks_allocated, 0, sizeof(s->blocks_allocated)); 2038 + 2039 + for (unsigned i = 0; i < existing_v->nr_blocks; i++) { 2040 + if (stripe_blockcount_get(existing_v, i)) { 2041 + __set_bit(i, s->blocks_gotten); 2042 + __set_bit(i, s->blocks_allocated); 2043 + } 2044 + 2045 + ec_block_io(c, &s->existing_stripe, READ, i, &s->iodone); 2046 + } 2047 + 2048 + bkey_copy(&s->new_stripe.key, &s->existing_stripe.key); 2049 + s->have_existing_stripe = true; 2050 + 2051 + return 0; 2052 + } 2053 + 2054 + static int __bch2_ec_stripe_head_reuse(struct btree_trans *trans, struct ec_stripe_head *h, 2055 + struct ec_stripe_new *s) 2008 2056 { 2009 2057 struct bch_fs *c = trans->c; 2010 - struct bch_stripe *new_v = &bkey_i_to_stripe(&h->s->new_stripe.key)->v; 2011 - struct bch_stripe *existing_v; 2012 - unsigned i; 2013 2058 s64 idx; 2014 2059 int ret; 2015 2060 ··· 2063 2024 if (idx < 0) 2064 2025 return -BCH_ERR_stripe_alloc_blocked; 2065 2026 2066 - ret = get_stripe_key_trans(trans, idx, &h->s->existing_stripe); 2027 + ret = get_stripe_key_trans(trans, idx, &s->existing_stripe); 2067 2028 bch2_fs_fatal_err_on(ret && !bch2_err_matches(ret, BCH_ERR_transaction_restart), c, 2068 2029 "reading stripe key: %s", bch2_err_str(ret)); 2069 2030 if (ret) { 2070 - bch2_stripe_close(c, h->s); 2031 + bch2_stripe_close(c, s); 2071 2032 return ret; 2072 2033 } 2073 2034 2074 - existing_v = &bkey_i_to_stripe(&h->s->existing_stripe.key)->v; 2075 - 2076 - BUG_ON(existing_v->nr_redundant != h->s->nr_parity); 2077 - h->s->nr_data = existing_v->nr_blocks - 2078 - existing_v->nr_redundant; 2079 - 2080 - ret = ec_stripe_buf_init(&h->s->existing_stripe, 0, h->blocksize); 2081 - if (ret) { 2082 - bch2_stripe_close(c, h->s); 2083 - return ret; 2084 - } 2085 - 2086 - BUG_ON(h->s->existing_stripe.size != h->blocksize); 2087 - BUG_ON(h->s->existing_stripe.size != le16_to_cpu(existing_v->sectors)); 2088 - 2089 - /* 2090 - * Free buckets we initially allocated - they might conflict with 2091 - * blocks from the stripe we're reusing: 2092 - */ 2093 - for_each_set_bit(i, h->s->blocks_gotten, new_v->nr_blocks) { 2094 - bch2_open_bucket_put(c, c->open_buckets + h->s->blocks[i]); 2095 - h->s->blocks[i] = 0; 2096 - } 2097 - memset(h->s->blocks_gotten, 0, sizeof(h->s->blocks_gotten)); 2098 - memset(h->s->blocks_allocated, 0, sizeof(h->s->blocks_allocated)); 2099 - 2100 - for (i = 0; i < existing_v->nr_blocks; i++) { 2101 - if (stripe_blockcount_get(existing_v, i)) { 2102 - __set_bit(i, h->s->blocks_gotten); 2103 - __set_bit(i, h->s->blocks_allocated); 2104 - } 2105 - 2106 - ec_block_io(c, &h->s->existing_stripe, READ, i, &h->s->iodone); 2107 - } 2108 - 2109 - bkey_copy(&h->s->new_stripe.key, &h->s->existing_stripe.key); 2110 - h->s->have_existing_stripe = true; 2111 - 2112 - return 0; 2035 + return init_new_stripe_from_existing(c, s); 2113 2036 } 2114 2037 2115 - static int __bch2_ec_stripe_head_reserve(struct btree_trans *trans, struct ec_stripe_head *h) 2038 + static int __bch2_ec_stripe_head_reserve(struct btree_trans *trans, struct ec_stripe_head *h, 2039 + struct ec_stripe_new *s) 2116 2040 { 2117 2041 struct bch_fs *c = trans->c; 2118 2042 struct btree_iter iter; ··· 2084 2082 struct bpos start_pos = bpos_max(min_pos, POS(0, c->ec_stripe_hint)); 2085 2083 int ret; 2086 2084 2087 - if (!h->s->res.sectors) { 2088 - ret = bch2_disk_reservation_get(c, &h->s->res, 2085 + if (!s->res.sectors) { 2086 + ret = bch2_disk_reservation_get(c, &s->res, 2089 2087 h->blocksize, 2090 - h->s->nr_parity, 2088 + s->nr_parity, 2091 2089 BCH_DISK_RESERVATION_NOFAIL); 2092 2090 if (ret) 2093 2091 return ret; 2094 2092 } 2095 2093 2094 + /* 2095 + * Allocate stripe slot 2096 + * XXX: we're going to need a bitrange btree of free stripes 2097 + */ 2096 2098 for_each_btree_key_norestart(trans, iter, BTREE_ID_stripes, start_pos, 2097 2099 BTREE_ITER_slots|BTREE_ITER_intent, k, ret) { 2098 2100 if (bkey_gt(k.k->p, POS(0, U32_MAX))) { ··· 2111 2105 } 2112 2106 2113 2107 if (bkey_deleted(k.k) && 2114 - bch2_try_open_stripe(c, h->s, k.k->p.offset)) 2108 + bch2_try_open_stripe(c, s, k.k->p.offset)) 2115 2109 break; 2116 2110 } 2117 2111 ··· 2122 2116 2123 2117 ret = ec_stripe_mem_alloc(trans, &iter); 2124 2118 if (ret) { 2125 - bch2_stripe_close(c, h->s); 2119 + bch2_stripe_close(c, s); 2126 2120 goto err; 2127 2121 } 2128 2122 2129 - h->s->new_stripe.key.k.p = iter.pos; 2123 + s->new_stripe.key.k.p = iter.pos; 2130 2124 out: 2131 2125 bch2_trans_iter_exit(trans, &iter); 2132 2126 return ret; 2133 2127 err: 2134 - bch2_disk_reservation_put(c, &h->s->res); 2128 + bch2_disk_reservation_put(c, &s->res); 2135 2129 goto out; 2136 2130 } 2137 2131 ··· 2162 2156 return h; 2163 2157 2164 2158 if (!h->s) { 2165 - ret = ec_new_stripe_alloc(c, h); 2166 - if (ret) { 2159 + h->s = ec_new_stripe_alloc(c, h); 2160 + if (!h->s) { 2161 + ret = -BCH_ERR_ENOMEM_ec_new_stripe_alloc; 2167 2162 bch_err(c, "failed to allocate new stripe"); 2168 2163 goto err; 2169 2164 } 2165 + 2166 + h->nr_created++; 2170 2167 } 2171 2168 2172 - if (h->s->allocated) 2169 + struct ec_stripe_new *s = h->s; 2170 + 2171 + if (s->allocated) 2173 2172 goto allocated; 2174 2173 2175 - if (h->s->have_existing_stripe) 2174 + if (s->have_existing_stripe) 2176 2175 goto alloc_existing; 2177 2176 2178 2177 /* First, try to allocate a full stripe: */ 2179 - ret = new_stripe_alloc_buckets(trans, h, BCH_WATERMARK_stripe, NULL) ?: 2180 - __bch2_ec_stripe_head_reserve(trans, h); 2178 + ret = new_stripe_alloc_buckets(trans, h, s, BCH_WATERMARK_stripe, NULL) ?: 2179 + __bch2_ec_stripe_head_reserve(trans, h, s); 2181 2180 if (!ret) 2182 2181 goto allocate_buf; 2183 2182 if (bch2_err_matches(ret, BCH_ERR_transaction_restart) || ··· 2194 2183 * existing stripe: 2195 2184 */ 2196 2185 while (1) { 2197 - ret = __bch2_ec_stripe_head_reuse(trans, h); 2186 + ret = __bch2_ec_stripe_head_reuse(trans, h, s); 2198 2187 if (!ret) 2199 2188 break; 2200 2189 if (waiting || !cl || ret != -BCH_ERR_stripe_alloc_blocked) 2201 2190 goto err; 2202 2191 2203 2192 if (watermark == BCH_WATERMARK_copygc) { 2204 - ret = new_stripe_alloc_buckets(trans, h, watermark, NULL) ?: 2205 - __bch2_ec_stripe_head_reserve(trans, h); 2193 + ret = new_stripe_alloc_buckets(trans, h, s, watermark, NULL) ?: 2194 + __bch2_ec_stripe_head_reserve(trans, h, s); 2206 2195 if (ret) 2207 2196 goto err; 2208 2197 goto allocate_buf; ··· 2220 2209 * Retry allocating buckets, with the watermark for this 2221 2210 * particular write: 2222 2211 */ 2223 - ret = new_stripe_alloc_buckets(trans, h, watermark, cl); 2212 + ret = new_stripe_alloc_buckets(trans, h, s, watermark, cl); 2224 2213 if (ret) 2225 2214 goto err; 2226 2215 2227 2216 allocate_buf: 2228 - ret = ec_stripe_buf_init(&h->s->new_stripe, 0, h->blocksize); 2217 + ret = ec_stripe_buf_init(&s->new_stripe, 0, h->blocksize); 2229 2218 if (ret) 2230 2219 goto err; 2231 2220 2232 - h->s->allocated = true; 2221 + s->allocated = true; 2233 2222 allocated: 2234 - BUG_ON(!h->s->idx); 2235 - BUG_ON(!h->s->new_stripe.data[0]); 2223 + BUG_ON(!s->idx); 2224 + BUG_ON(!s->new_stripe.data[0]); 2236 2225 BUG_ON(trans->restarted); 2237 2226 return h; 2238 2227 err: ··· 2297 2286 int bch2_dev_remove_stripes(struct bch_fs *c, unsigned dev_idx) 2298 2287 { 2299 2288 return bch2_trans_run(c, 2300 - for_each_btree_key_upto_commit(trans, iter, 2289 + for_each_btree_key_max_commit(trans, iter, 2301 2290 BTREE_ID_alloc, POS(dev_idx, 0), POS(dev_idx, U64_MAX), 2302 2291 BTREE_ITER_intent, k, 2303 2292 NULL, NULL, 0, ({ ··· 2460 2449 2461 2450 while (1) { 2462 2451 mutex_lock(&c->ec_stripe_head_lock); 2463 - h = list_first_entry_or_null(&c->ec_stripe_head_list, 2464 - struct ec_stripe_head, list); 2465 - if (h) 2466 - list_del(&h->list); 2452 + h = list_pop_entry(&c->ec_stripe_head_list, struct ec_stripe_head, list); 2467 2453 mutex_unlock(&c->ec_stripe_head_lock); 2454 + 2468 2455 if (!h) 2469 2456 break; 2470 2457
+2 -3
fs/bcachefs/ec.h
··· 6 6 #include "buckets_types.h" 7 7 #include "extents_types.h" 8 8 9 - enum bch_validate_flags; 10 - 11 - int bch2_stripe_validate(struct bch_fs *, struct bkey_s_c, enum bch_validate_flags); 9 + int bch2_stripe_validate(struct bch_fs *, struct bkey_s_c, 10 + struct bkey_validate_context); 12 11 void bch2_stripe_to_text(struct printbuf *, struct bch_fs *, 13 12 struct bkey_s_c); 14 13 int bch2_trigger_stripe(struct btree_trans *, enum btree_id, unsigned,
+17
fs/bcachefs/ec_format.h
··· 20 20 */ 21 21 __u8 disk_label; 22 22 23 + /* 24 + * Variable length sections: 25 + * - Pointers 26 + * - Checksums 27 + * 2D array of [stripe block/device][csum block], with checksum block 28 + * size given by csum_granularity_bits 29 + * - Block sector counts: per-block array of u16s 30 + * 31 + * XXX: 32 + * Either checksums should have come last, or we should have included a 33 + * checksum_size field (the size in bytes of the checksum itself, not 34 + * the blocksize the checksum covers). 35 + * 36 + * Currently we aren't able to access the block sector counts if the 37 + * checksum type is unknown. 38 + */ 39 + 23 40 struct bch_extent_ptr ptrs[]; 24 41 } __packed __aligned(8); 25 42
+17 -4
fs/bcachefs/errcode.h
··· 54 54 x(ENOMEM, ENOMEM_compression_bounce_read_init) \ 55 55 x(ENOMEM, ENOMEM_compression_bounce_write_init) \ 56 56 x(ENOMEM, ENOMEM_compression_workspace_init) \ 57 - x(ENOMEM, ENOMEM_decompression_workspace_init) \ 57 + x(ENOMEM, ENOMEM_backpointer_mismatches_bitmap) \ 58 + x(EIO, compression_workspace_not_initialized) \ 58 59 x(ENOMEM, ENOMEM_bucket_gens) \ 59 60 x(ENOMEM, ENOMEM_buckets_nouse) \ 60 61 x(ENOMEM, ENOMEM_usage_init) \ ··· 117 116 x(ENOENT, ENOENT_dirent_doesnt_match_inode) \ 118 117 x(ENOENT, ENOENT_dev_not_found) \ 119 118 x(ENOENT, ENOENT_dev_idx_not_found) \ 119 + x(ENOENT, ENOENT_inode_no_backpointer) \ 120 + x(ENOENT, ENOENT_no_snapshot_tree_subvol) \ 120 121 x(ENOTEMPTY, ENOTEMPTY_dir_not_empty) \ 121 122 x(ENOTEMPTY, ENOTEMPTY_subvol_not_empty) \ 122 123 x(EEXIST, EEXIST_str_hash_set) \ ··· 151 148 x(BCH_ERR_transaction_restart, transaction_restart_split_race) \ 152 149 x(BCH_ERR_transaction_restart, transaction_restart_write_buffer_flush) \ 153 150 x(BCH_ERR_transaction_restart, transaction_restart_nested) \ 151 + x(BCH_ERR_transaction_restart, transaction_restart_commit) \ 154 152 x(0, no_btree_node) \ 155 153 x(BCH_ERR_no_btree_node, no_btree_node_relock) \ 156 154 x(BCH_ERR_no_btree_node, no_btree_node_upgrade) \ ··· 168 164 x(BCH_ERR_btree_insert_fail, btree_insert_need_journal_res) \ 169 165 x(BCH_ERR_btree_insert_fail, btree_insert_need_journal_reclaim) \ 170 166 x(0, backpointer_to_overwritten_btree_node) \ 171 - x(0, lock_fail_root_changed) \ 172 167 x(0, journal_reclaim_would_deadlock) \ 173 168 x(EINVAL, fsck) \ 174 169 x(BCH_ERR_fsck, fsck_fix) \ ··· 176 173 x(BCH_ERR_fsck, fsck_errors_not_fixed) \ 177 174 x(BCH_ERR_fsck, fsck_repair_unimplemented) \ 178 175 x(BCH_ERR_fsck, fsck_repair_impossible) \ 179 - x(0, restart_recovery) \ 176 + x(EINVAL, restart_recovery) \ 177 + x(EINVAL, not_in_recovery) \ 178 + x(EINVAL, cannot_rewind_recovery) \ 180 179 x(0, data_update_done) \ 181 180 x(EINVAL, device_state_not_allowed) \ 182 181 x(EINVAL, member_info_missing) \ ··· 197 192 x(EINVAL, opt_parse_error) \ 198 193 x(EINVAL, remove_with_metadata_missing_unimplemented)\ 199 194 x(EINVAL, remove_would_lose_data) \ 200 - x(EINVAL, btree_iter_with_journal_not_supported) \ 195 + x(EINVAL, no_resize_with_buckets_nouse) \ 196 + x(EINVAL, inode_unpack_error) \ 197 + x(EINVAL, varint_decode_error) \ 201 198 x(EROFS, erofs_trans_commit) \ 202 199 x(EROFS, erofs_no_writes) \ 203 200 x(EROFS, erofs_journal_err) \ ··· 248 241 x(BCH_ERR_invalid_sb, invalid_sb_downgrade) \ 249 242 x(BCH_ERR_invalid, invalid_bkey) \ 250 243 x(BCH_ERR_operation_blocked, nocow_lock_blocked) \ 244 + x(EIO, journal_shutdown) \ 245 + x(EIO, journal_flush_err) \ 251 246 x(EIO, btree_node_read_err) \ 247 + x(BCH_ERR_btree_node_read_err, btree_node_read_err_cached) \ 252 248 x(EIO, sb_not_downgraded) \ 253 249 x(EIO, btree_node_write_all_failed) \ 254 250 x(EIO, btree_node_read_error) \ ··· 267 257 x(EIO, no_device_to_read_from) \ 268 258 x(EIO, missing_indirect_extent) \ 269 259 x(EIO, invalidate_stripe_to_dev) \ 260 + x(EIO, no_encryption_key) \ 261 + x(EIO, insufficient_journal_devices) \ 270 262 x(BCH_ERR_btree_node_read_err, btree_node_read_err_fixable) \ 271 263 x(BCH_ERR_btree_node_read_err, btree_node_read_err_want_retry) \ 272 264 x(BCH_ERR_btree_node_read_err, btree_node_read_err_must_retry) \ ··· 317 305 318 306 #define BLK_STS_REMOVED ((__force blk_status_t)128) 319 307 308 + #include <linux/blk_types.h> 320 309 const char *bch2_blk_status_to_str(blk_status_t); 321 310 322 311 #endif /* _BCACHFES_ERRCODE_H */
+133 -54
fs/bcachefs/error.c
··· 1 1 // SPDX-License-Identifier: GPL-2.0 2 2 #include "bcachefs.h" 3 + #include "btree_cache.h" 3 4 #include "btree_iter.h" 4 5 #include "error.h" 6 + #include "fs-common.h" 5 7 #include "journal.h" 6 8 #include "recovery_passes.h" 7 9 #include "super.h" ··· 35 33 int bch2_topology_error(struct bch_fs *c) 36 34 { 37 35 set_bit(BCH_FS_topology_error, &c->flags); 38 - if (!test_bit(BCH_FS_fsck_running, &c->flags)) { 36 + if (!test_bit(BCH_FS_recovery_running, &c->flags)) { 39 37 bch2_inconsistent_error(c); 40 38 return -BCH_ERR_btree_need_topology_repair; 41 39 } else { ··· 220 218 #undef x 221 219 }; 222 220 221 + static int do_fsck_ask_yn(struct bch_fs *c, 222 + struct btree_trans *trans, 223 + struct printbuf *question, 224 + const char *action) 225 + { 226 + prt_str(question, ", "); 227 + prt_str(question, action); 228 + 229 + if (bch2_fs_stdio_redirect(c)) 230 + bch2_print(c, "%s", question->buf); 231 + else 232 + bch2_print_string_as_lines(KERN_ERR, question->buf); 233 + 234 + int ask = bch2_fsck_ask_yn(c, trans); 235 + 236 + if (trans) { 237 + int ret = bch2_trans_relock(trans); 238 + if (ret) 239 + return ret; 240 + } 241 + 242 + return ask; 243 + } 244 + 223 245 int __bch2_fsck_err(struct bch_fs *c, 224 246 struct btree_trans *trans, 225 247 enum bch_fsck_flags flags, ··· 252 226 { 253 227 struct fsck_err_state *s = NULL; 254 228 va_list args; 255 - bool print = true, suppressing = false, inconsistent = false; 229 + bool print = true, suppressing = false, inconsistent = false, exiting = false; 256 230 struct printbuf buf = PRINTBUF, *out = &buf; 257 231 int ret = -BCH_ERR_fsck_ignore; 258 232 const char *action_orig = "fix?", *action = action_orig; ··· 282 256 !trans && 283 257 bch2_current_has_btree_trans(c)); 284 258 285 - if ((flags & FSCK_CAN_FIX) && 286 - test_bit(err, c->sb.errors_silent)) 287 - return -BCH_ERR_fsck_fix; 259 + if (test_bit(err, c->sb.errors_silent)) 260 + return flags & FSCK_CAN_FIX 261 + ? -BCH_ERR_fsck_fix 262 + : -BCH_ERR_fsck_ignore; 288 263 289 264 bch2_sb_error_count(c, err); 290 265 ··· 316 289 */ 317 290 if (s->last_msg && !strcmp(buf.buf, s->last_msg)) { 318 291 ret = s->ret; 319 - mutex_unlock(&c->fsck_error_msgs_lock); 320 - goto err; 292 + goto err_unlock; 321 293 } 322 294 323 295 kfree(s->last_msg); 324 296 s->last_msg = kstrdup(buf.buf, GFP_KERNEL); 325 297 if (!s->last_msg) { 326 - mutex_unlock(&c->fsck_error_msgs_lock); 327 298 ret = -ENOMEM; 328 - goto err; 299 + goto err_unlock; 329 300 } 330 301 331 302 if (c->opts.ratelimit_errors && ··· 343 318 prt_printf(out, bch2_log_msg(c, "")); 344 319 #endif 345 320 346 - if ((flags & FSCK_CAN_FIX) && 347 - (flags & FSCK_AUTOFIX) && 321 + if ((flags & FSCK_AUTOFIX) && 348 322 (c->opts.errors == BCH_ON_ERROR_continue || 349 323 c->opts.errors == BCH_ON_ERROR_fix_safe)) { 350 324 prt_str(out, ", "); 351 - prt_actioning(out, action); 352 - ret = -BCH_ERR_fsck_fix; 325 + if (flags & FSCK_CAN_FIX) { 326 + prt_actioning(out, action); 327 + ret = -BCH_ERR_fsck_fix; 328 + } else { 329 + prt_str(out, ", continuing"); 330 + ret = -BCH_ERR_fsck_ignore; 331 + } 332 + 333 + goto print; 353 334 } else if (!test_bit(BCH_FS_fsck_running, &c->flags)) { 354 335 if (c->opts.errors != BCH_ON_ERROR_continue || 355 336 !(flags & (FSCK_CAN_FIX|FSCK_CAN_IGNORE))) { ··· 379 348 : c->opts.fix_errors; 380 349 381 350 if (fix == FSCK_FIX_ask) { 382 - prt_str(out, ", "); 383 - prt_str(out, action); 384 - 385 - if (bch2_fs_stdio_redirect(c)) 386 - bch2_print(c, "%s", out->buf); 387 - else 388 - bch2_print_string_as_lines(KERN_ERR, out->buf); 389 351 print = false; 390 352 391 - int ask = bch2_fsck_ask_yn(c, trans); 353 + ret = do_fsck_ask_yn(c, trans, out, action); 354 + if (ret < 0) 355 + goto err_unlock; 392 356 393 - if (trans) { 394 - ret = bch2_trans_relock(trans); 395 - if (ret) { 396 - mutex_unlock(&c->fsck_error_msgs_lock); 397 - goto err; 398 - } 399 - } 400 - 401 - if (ask >= YN_ALLNO && s) 402 - s->fix = ask == YN_ALLNO 357 + if (ret >= YN_ALLNO && s) 358 + s->fix = ret == YN_ALLNO 403 359 ? FSCK_FIX_no 404 360 : FSCK_FIX_yes; 405 361 406 - ret = ask & 1 362 + ret = ret & 1 407 363 ? -BCH_ERR_fsck_fix 408 364 : -BCH_ERR_fsck_ignore; 409 365 } else if (fix == FSCK_FIX_yes || ··· 403 385 prt_str(out, ", not "); 404 386 prt_actioning(out, action); 405 387 } 406 - } else if (flags & FSCK_NEED_FSCK) { 407 - prt_str(out, " (run fsck to correct)"); 408 - } else { 388 + } else if (!(flags & FSCK_CAN_IGNORE)) { 409 389 prt_str(out, " (repair unimplemented)"); 410 390 } 411 391 ··· 412 396 !(flags & FSCK_CAN_IGNORE))) 413 397 ret = -BCH_ERR_fsck_errors_not_fixed; 414 398 415 - bool exiting = 416 - test_bit(BCH_FS_fsck_running, &c->flags) && 417 - (ret != -BCH_ERR_fsck_fix && 418 - ret != -BCH_ERR_fsck_ignore); 419 - 420 - if (exiting) 399 + if (test_bit(BCH_FS_fsck_running, &c->flags) && 400 + (ret != -BCH_ERR_fsck_fix && 401 + ret != -BCH_ERR_fsck_ignore)) { 402 + exiting = true; 421 403 print = true; 422 - 404 + } 405 + print: 423 406 if (print) { 424 407 if (bch2_fs_stdio_redirect(c)) 425 408 bch2_print(c, "%s\n", out->buf); ··· 434 419 if (s) 435 420 s->ret = ret; 436 421 437 - mutex_unlock(&c->fsck_error_msgs_lock); 438 - 439 422 if (inconsistent) 440 423 bch2_inconsistent_error(c); 441 424 442 - if (ret == -BCH_ERR_fsck_fix) { 443 - set_bit(BCH_FS_errors_fixed, &c->flags); 444 - } else { 445 - set_bit(BCH_FS_errors_not_fixed, &c->flags); 446 - set_bit(BCH_FS_error, &c->flags); 425 + /* 426 + * We don't yet track whether the filesystem currently has errors, for 427 + * log_fsck_err()s: that would require us to track for every error type 428 + * which recovery pass corrects it, to get the fsck exit status correct: 429 + */ 430 + if (flags & FSCK_CAN_FIX) { 431 + if (ret == -BCH_ERR_fsck_fix) { 432 + set_bit(BCH_FS_errors_fixed, &c->flags); 433 + } else { 434 + set_bit(BCH_FS_errors_not_fixed, &c->flags); 435 + set_bit(BCH_FS_error, &c->flags); 436 + } 447 437 } 438 + err_unlock: 439 + mutex_unlock(&c->fsck_error_msgs_lock); 448 440 err: 449 441 if (action != action_orig) 450 442 kfree(action); ··· 459 437 return ret; 460 438 } 461 439 440 + static const char * const bch2_bkey_validate_contexts[] = { 441 + #define x(n) #n, 442 + BKEY_VALIDATE_CONTEXTS() 443 + #undef x 444 + NULL 445 + }; 446 + 462 447 int __bch2_bkey_fsck_err(struct bch_fs *c, 463 448 struct bkey_s_c k, 464 - enum bch_validate_flags validate_flags, 449 + struct bkey_validate_context from, 465 450 enum bch_sb_error_id err, 466 451 const char *fmt, ...) 467 452 { 468 - if (validate_flags & BCH_VALIDATE_silent) 453 + if (from.flags & BCH_VALIDATE_silent) 469 454 return -BCH_ERR_fsck_delete_bkey; 470 455 471 456 unsigned fsck_flags = 0; 472 - if (!(validate_flags & (BCH_VALIDATE_write|BCH_VALIDATE_commit))) 457 + if (!(from.flags & (BCH_VALIDATE_write|BCH_VALIDATE_commit))) { 458 + if (test_bit(err, c->sb.errors_silent)) 459 + return -BCH_ERR_fsck_delete_bkey; 460 + 473 461 fsck_flags |= FSCK_AUTOFIX|FSCK_CAN_FIX; 462 + } 463 + if (!WARN_ON(err >= ARRAY_SIZE(fsck_flags_extra))) 464 + fsck_flags |= fsck_flags_extra[err]; 474 465 475 466 struct printbuf buf = PRINTBUF; 476 - va_list args; 467 + prt_printf(&buf, "invalid bkey in %s", 468 + bch2_bkey_validate_contexts[from.from]); 477 469 478 - prt_str(&buf, "invalid bkey "); 470 + if (from.from == BKEY_VALIDATE_journal) 471 + prt_printf(&buf, " journal seq=%llu offset=%u", 472 + from.journal_seq, from.journal_offset); 473 + 474 + prt_str(&buf, " btree="); 475 + bch2_btree_id_to_text(&buf, from.btree); 476 + prt_printf(&buf, " level=%u: ", from.level); 477 + 479 478 bch2_bkey_val_to_text(&buf, c, k); 480 479 prt_str(&buf, "\n "); 480 + 481 + va_list args; 481 482 va_start(args, fmt); 482 483 prt_vprintf(&buf, fmt, args); 483 484 va_end(args); 485 + 484 486 prt_str(&buf, ": delete?"); 485 487 486 488 int ret = __bch2_fsck_err(c, NULL, fsck_flags, err, "%s", buf.buf); ··· 528 482 } 529 483 530 484 mutex_unlock(&c->fsck_error_msgs_lock); 485 + } 486 + 487 + int bch2_inum_err_msg_trans(struct btree_trans *trans, struct printbuf *out, subvol_inum inum) 488 + { 489 + u32 restart_count = trans->restart_count; 490 + int ret = 0; 491 + 492 + /* XXX: we don't yet attempt to print paths when we don't know the subvol */ 493 + if (inum.subvol) 494 + ret = lockrestart_do(trans, bch2_inum_to_path(trans, inum, out)); 495 + if (!inum.subvol || ret) 496 + prt_printf(out, "inum %llu:%llu", inum.subvol, inum.inum); 497 + 498 + return trans_was_restarted(trans, restart_count); 499 + } 500 + 501 + int bch2_inum_offset_err_msg_trans(struct btree_trans *trans, struct printbuf *out, 502 + subvol_inum inum, u64 offset) 503 + { 504 + int ret = bch2_inum_err_msg_trans(trans, out, inum); 505 + prt_printf(out, " offset %llu: ", offset); 506 + return ret; 507 + } 508 + 509 + void bch2_inum_err_msg(struct bch_fs *c, struct printbuf *out, subvol_inum inum) 510 + { 511 + bch2_trans_run(c, bch2_inum_err_msg_trans(trans, out, inum)); 512 + } 513 + 514 + void bch2_inum_offset_err_msg(struct bch_fs *c, struct printbuf *out, 515 + subvol_inum inum, u64 offset) 516 + { 517 + bch2_trans_run(c, bch2_inum_offset_err_msg_trans(trans, out, inum, offset)); 531 518 }
+25 -33
fs/bcachefs/error.h
··· 45 45 bch2_inconsistent_error(c); \ 46 46 }) 47 47 48 - #define bch2_fs_inconsistent_on(cond, c, ...) \ 48 + #define bch2_fs_inconsistent_on(cond, ...) \ 49 49 ({ \ 50 50 bool _ret = unlikely(!!(cond)); \ 51 - \ 52 51 if (_ret) \ 53 - bch2_fs_inconsistent(c, __VA_ARGS__); \ 54 - _ret; \ 55 - }) 56 - 57 - /* 58 - * Later we might want to mark only the particular device inconsistent, not the 59 - * entire filesystem: 60 - */ 61 - 62 - #define bch2_dev_inconsistent(ca, ...) \ 63 - do { \ 64 - bch_err(ca, __VA_ARGS__); \ 65 - bch2_inconsistent_error((ca)->fs); \ 66 - } while (0) 67 - 68 - #define bch2_dev_inconsistent_on(cond, ca, ...) \ 69 - ({ \ 70 - bool _ret = unlikely(!!(cond)); \ 71 - \ 72 - if (_ret) \ 73 - bch2_dev_inconsistent(ca, __VA_ARGS__); \ 52 + bch2_fs_inconsistent(__VA_ARGS__); \ 74 53 _ret; \ 75 54 }) 76 55 ··· 102 123 103 124 void bch2_flush_fsck_errs(struct bch_fs *); 104 125 105 - #define __fsck_err(c, _flags, _err_type, ...) \ 126 + #define fsck_err_wrap(_do) \ 106 127 ({ \ 107 - int _ret = bch2_fsck_err(c, _flags, _err_type, __VA_ARGS__); \ 128 + int _ret = _do; \ 108 129 if (_ret != -BCH_ERR_fsck_fix && \ 109 130 _ret != -BCH_ERR_fsck_ignore) { \ 110 131 ret = _ret; \ ··· 113 134 \ 114 135 _ret == -BCH_ERR_fsck_fix; \ 115 136 }) 137 + 138 + #define __fsck_err(...) fsck_err_wrap(bch2_fsck_err(__VA_ARGS__)) 116 139 117 140 /* These macros return true if error should be fixed: */ 118 141 ··· 130 149 (unlikely(cond) ? __fsck_err(c, _flags, _err_type, __VA_ARGS__) : false);\ 131 150 }) 132 151 133 - #define need_fsck_err_on(cond, c, _err_type, ...) \ 134 - __fsck_err_on(cond, c, FSCK_CAN_IGNORE|FSCK_NEED_FSCK, _err_type, __VA_ARGS__) 135 - 136 - #define need_fsck_err(c, _err_type, ...) \ 137 - __fsck_err(c, FSCK_CAN_IGNORE|FSCK_NEED_FSCK, _err_type, __VA_ARGS__) 138 - 139 152 #define mustfix_fsck_err(c, _err_type, ...) \ 140 153 __fsck_err(c, FSCK_CAN_FIX, _err_type, __VA_ARGS__) 141 154 ··· 142 167 #define fsck_err_on(cond, c, _err_type, ...) \ 143 168 __fsck_err_on(cond, c, FSCK_CAN_FIX|FSCK_CAN_IGNORE, _err_type, __VA_ARGS__) 144 169 170 + #define log_fsck_err(c, _err_type, ...) \ 171 + __fsck_err(c, FSCK_CAN_IGNORE, _err_type, __VA_ARGS__) 172 + 173 + #define log_fsck_err_on(cond, ...) \ 174 + ({ \ 175 + bool _ret = unlikely(!!(cond)); \ 176 + if (_ret) \ 177 + log_fsck_err(__VA_ARGS__); \ 178 + _ret; \ 179 + }) 180 + 145 181 enum bch_validate_flags; 146 182 __printf(5, 6) 147 183 int __bch2_bkey_fsck_err(struct bch_fs *, 148 184 struct bkey_s_c, 149 - enum bch_validate_flags, 185 + struct bkey_validate_context from, 150 186 enum bch_sb_error_id, 151 187 const char *, ...); 152 188 ··· 167 181 */ 168 182 #define bkey_fsck_err(c, _err_type, _err_msg, ...) \ 169 183 do { \ 170 - int _ret = __bch2_bkey_fsck_err(c, k, flags, \ 184 + int _ret = __bch2_bkey_fsck_err(c, k, from, \ 171 185 BCH_FSCK_ERR_##_err_type, \ 172 186 _err_msg, ##__VA_ARGS__); \ 173 187 if (_ret != -BCH_ERR_fsck_fix && \ ··· 237 251 } \ 238 252 _ret; \ 239 253 }) 254 + 255 + int bch2_inum_err_msg_trans(struct btree_trans *, struct printbuf *, subvol_inum); 256 + int bch2_inum_offset_err_msg_trans(struct btree_trans *, struct printbuf *, subvol_inum, u64); 257 + 258 + void bch2_inum_err_msg(struct bch_fs *, struct printbuf *, subvol_inum); 259 + void bch2_inum_offset_err_msg(struct bch_fs *, struct printbuf *, subvol_inum, u64); 240 260 241 261 #endif /* _BCACHEFS_ERROR_H */
+2 -2
fs/bcachefs/extent_update.c
··· 64 64 break; 65 65 case KEY_TYPE_reflink_p: { 66 66 struct bkey_s_c_reflink_p p = bkey_s_c_to_reflink_p(k); 67 - u64 idx = le64_to_cpu(p.v->idx); 67 + u64 idx = REFLINK_P_IDX(p.v); 68 68 unsigned sectors = bpos_min(*end, p.k->p).offset - 69 69 bkey_start_offset(p.k); 70 70 struct btree_iter iter; ··· 128 128 129 129 bch2_trans_copy_iter(&copy, iter); 130 130 131 - for_each_btree_key_upto_continue_norestart(copy, insert->k.p, 0, k, ret) { 131 + for_each_btree_key_max_continue_norestart(copy, insert->k.p, 0, k, ret) { 132 132 unsigned offset = 0; 133 133 134 134 if (bkey_gt(bkey_start_pos(&insert->k), bkey_start_pos(k.k)))
+93 -197
fs/bcachefs/extents.c
··· 21 21 #include "extents.h" 22 22 #include "inode.h" 23 23 #include "journal.h" 24 + #include "rebalance.h" 24 25 #include "replicas.h" 25 26 #include "super.h" 26 27 #include "super-io.h" ··· 88 87 if (likely(!p1.idx && !p2.idx)) { 89 88 u64 l1 = dev_latency(c, p1.ptr.dev); 90 89 u64 l2 = dev_latency(c, p2.ptr.dev); 90 + 91 + /* 92 + * Square the latencies, to bias more in favor of the faster 93 + * device - we never want to stop issuing reads to the slower 94 + * device altogether, so that we can update our latency numbers: 95 + */ 96 + l1 *= l1; 97 + l2 *= l2; 91 98 92 99 /* Pick at random, biased in favor of the faster device: */ 93 100 ··· 178 169 /* KEY_TYPE_btree_ptr: */ 179 170 180 171 int bch2_btree_ptr_validate(struct bch_fs *c, struct bkey_s_c k, 181 - enum bch_validate_flags flags) 172 + struct bkey_validate_context from) 182 173 { 183 174 int ret = 0; 184 175 ··· 186 177 c, btree_ptr_val_too_big, 187 178 "value too big (%zu > %u)", bkey_val_u64s(k.k), BCH_REPLICAS_MAX); 188 179 189 - ret = bch2_bkey_ptrs_validate(c, k, flags); 180 + ret = bch2_bkey_ptrs_validate(c, k, from); 190 181 fsck_err: 191 182 return ret; 192 183 } ··· 198 189 } 199 190 200 191 int bch2_btree_ptr_v2_validate(struct bch_fs *c, struct bkey_s_c k, 201 - enum bch_validate_flags flags) 192 + struct bkey_validate_context from) 202 193 { 203 194 struct bkey_s_c_btree_ptr_v2 bp = bkey_s_c_to_btree_ptr_v2(k); 204 195 int ret = 0; ··· 212 203 c, btree_ptr_v2_min_key_bad, 213 204 "min_key > key"); 214 205 215 - if (flags & BCH_VALIDATE_write) 206 + if ((from.flags & BCH_VALIDATE_write) && 207 + c->sb.version_min >= bcachefs_metadata_version_btree_ptr_sectors_written) 216 208 bkey_fsck_err_on(!bp.v->sectors_written, 217 209 c, btree_ptr_v2_written_0, 218 210 "sectors_written == 0"); 219 211 220 - ret = bch2_bkey_ptrs_validate(c, k, flags); 212 + ret = bch2_bkey_ptrs_validate(c, k, from); 221 213 fsck_err: 222 214 return ret; 223 215 } ··· 405 395 /* KEY_TYPE_reservation: */ 406 396 407 397 int bch2_reservation_validate(struct bch_fs *c, struct bkey_s_c k, 408 - enum bch_validate_flags flags) 398 + struct bkey_validate_context from) 409 399 { 410 400 struct bkey_s_c_reservation r = bkey_s_c_to_reservation(k); 411 401 int ret = 0; ··· 1130 1120 bch2_prt_compression_type(out, crc->compression_type); 1131 1121 } 1132 1122 1123 + static void bch2_extent_rebalance_to_text(struct printbuf *out, struct bch_fs *c, 1124 + const struct bch_extent_rebalance *r) 1125 + { 1126 + prt_str(out, "rebalance:"); 1127 + 1128 + prt_printf(out, " replicas=%u", r->data_replicas); 1129 + if (r->data_replicas_from_inode) 1130 + prt_str(out, " (inode)"); 1131 + 1132 + prt_str(out, " checksum="); 1133 + bch2_prt_csum_opt(out, r->data_checksum); 1134 + if (r->data_checksum_from_inode) 1135 + prt_str(out, " (inode)"); 1136 + 1137 + if (r->background_compression || r->background_compression_from_inode) { 1138 + prt_str(out, " background_compression="); 1139 + bch2_compression_opt_to_text(out, r->background_compression); 1140 + 1141 + if (r->background_compression_from_inode) 1142 + prt_str(out, " (inode)"); 1143 + } 1144 + 1145 + if (r->background_target || r->background_target_from_inode) { 1146 + prt_str(out, " background_target="); 1147 + if (c) 1148 + bch2_target_to_text(out, c, r->background_target); 1149 + else 1150 + prt_printf(out, "%u", r->background_target); 1151 + 1152 + if (r->background_target_from_inode) 1153 + prt_str(out, " (inode)"); 1154 + } 1155 + 1156 + if (r->promote_target || r->promote_target_from_inode) { 1157 + prt_str(out, " promote_target="); 1158 + if (c) 1159 + bch2_target_to_text(out, c, r->promote_target); 1160 + else 1161 + prt_printf(out, "%u", r->promote_target); 1162 + 1163 + if (r->promote_target_from_inode) 1164 + prt_str(out, " (inode)"); 1165 + } 1166 + 1167 + if (r->erasure_code || r->erasure_code_from_inode) { 1168 + prt_printf(out, " ec=%u", r->erasure_code); 1169 + if (r->erasure_code_from_inode) 1170 + prt_str(out, " (inode)"); 1171 + } 1172 + } 1173 + 1133 1174 void bch2_bkey_ptrs_to_text(struct printbuf *out, struct bch_fs *c, 1134 1175 struct bkey_s_c k) 1135 1176 { ··· 1216 1155 (u64) ec->idx, ec->block); 1217 1156 break; 1218 1157 } 1219 - case BCH_EXTENT_ENTRY_rebalance: { 1220 - const struct bch_extent_rebalance *r = &entry->rebalance; 1221 - 1222 - prt_str(out, "rebalance: target "); 1223 - if (c) 1224 - bch2_target_to_text(out, c, r->target); 1225 - else 1226 - prt_printf(out, "%u", r->target); 1227 - prt_str(out, " compression "); 1228 - bch2_compression_opt_to_text(out, r->compression); 1158 + case BCH_EXTENT_ENTRY_rebalance: 1159 + bch2_extent_rebalance_to_text(out, c, &entry->rebalance); 1229 1160 break; 1230 - } 1161 + 1231 1162 default: 1232 1163 prt_printf(out, "(invalid extent entry %.16llx)", *((u64 *) entry)); 1233 1164 return; ··· 1231 1178 1232 1179 static int extent_ptr_validate(struct bch_fs *c, 1233 1180 struct bkey_s_c k, 1234 - enum bch_validate_flags flags, 1181 + struct bkey_validate_context from, 1235 1182 const struct bch_extent_ptr *ptr, 1236 1183 unsigned size_ondisk, 1237 1184 bool metadata) 1238 1185 { 1239 1186 int ret = 0; 1187 + 1188 + struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(k); 1189 + bkey_for_each_ptr(ptrs, ptr2) 1190 + bkey_fsck_err_on(ptr != ptr2 && ptr->dev == ptr2->dev, 1191 + c, ptr_to_duplicate_device, 1192 + "multiple pointers to same device (%u)", ptr->dev); 1240 1193 1241 1194 /* bad pointers are repaired by check_fix_ptrs(): */ 1242 1195 rcu_read_lock(); ··· 1257 1198 u64 nbuckets = ca->mi.nbuckets; 1258 1199 unsigned bucket_size = ca->mi.bucket_size; 1259 1200 rcu_read_unlock(); 1260 - 1261 - struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(k); 1262 - bkey_for_each_ptr(ptrs, ptr2) 1263 - bkey_fsck_err_on(ptr != ptr2 && ptr->dev == ptr2->dev, 1264 - c, ptr_to_duplicate_device, 1265 - "multiple pointers to same device (%u)", ptr->dev); 1266 - 1267 1201 1268 1202 bkey_fsck_err_on(bucket >= nbuckets, 1269 1203 c, ptr_after_last_bucket, ··· 1273 1221 } 1274 1222 1275 1223 int bch2_bkey_ptrs_validate(struct bch_fs *c, struct bkey_s_c k, 1276 - enum bch_validate_flags flags) 1224 + struct bkey_validate_context from) 1277 1225 { 1278 1226 struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(k); 1279 1227 const union bch_extent_entry *entry; ··· 1300 1248 1301 1249 switch (extent_entry_type(entry)) { 1302 1250 case BCH_EXTENT_ENTRY_ptr: 1303 - ret = extent_ptr_validate(c, k, flags, &entry->ptr, size_ondisk, false); 1251 + ret = extent_ptr_validate(c, k, from, &entry->ptr, size_ondisk, false); 1304 1252 if (ret) 1305 1253 return ret; 1306 1254 ··· 1322 1270 case BCH_EXTENT_ENTRY_crc128: 1323 1271 crc = bch2_extent_crc_unpack(k.k, entry_to_crc(entry)); 1324 1272 1325 - bkey_fsck_err_on(crc.offset + crc.live_size > crc.uncompressed_size, 1326 - c, ptr_crc_uncompressed_size_too_small, 1327 - "checksum offset + key size > uncompressed size"); 1328 1273 bkey_fsck_err_on(!bch2_checksum_type_valid(c, crc.csum_type), 1329 1274 c, ptr_crc_csum_type_unknown, 1330 1275 "invalid checksum type"); 1331 1276 bkey_fsck_err_on(crc.compression_type >= BCH_COMPRESSION_TYPE_NR, 1332 1277 c, ptr_crc_compression_type_unknown, 1333 1278 "invalid compression type"); 1279 + 1280 + bkey_fsck_err_on(crc.offset + crc.live_size > crc.uncompressed_size, 1281 + c, ptr_crc_uncompressed_size_too_small, 1282 + "checksum offset + key size > uncompressed size"); 1283 + bkey_fsck_err_on(crc_is_encoded(crc) && 1284 + (crc.uncompressed_size > c->opts.encoded_extent_max >> 9) && 1285 + (from.flags & (BCH_VALIDATE_write|BCH_VALIDATE_commit)), 1286 + c, ptr_crc_uncompressed_size_too_big, 1287 + "too large encoded extent"); 1288 + bkey_fsck_err_on(!crc_is_compressed(crc) && 1289 + crc.compressed_size != crc.uncompressed_size, 1290 + c, ptr_crc_uncompressed_size_mismatch, 1291 + "not compressed but compressed != uncompressed size"); 1334 1292 1335 1293 if (bch2_csum_type_is_encryption(crc.csum_type)) { 1336 1294 if (nonce == UINT_MAX) ··· 1354 1292 c, ptr_crc_redundant, 1355 1293 "redundant crc entry"); 1356 1294 crc_since_last_ptr = true; 1357 - 1358 - bkey_fsck_err_on(crc_is_encoded(crc) && 1359 - (crc.uncompressed_size > c->opts.encoded_extent_max >> 9) && 1360 - (flags & (BCH_VALIDATE_write|BCH_VALIDATE_commit)), 1361 - c, ptr_crc_uncompressed_size_too_big, 1362 - "too large encoded extent"); 1363 1295 1364 1296 size_ondisk = crc.compressed_size; 1365 1297 break; ··· 1447 1391 } 1448 1392 } 1449 1393 1450 - const struct bch_extent_rebalance *bch2_bkey_rebalance_opts(struct bkey_s_c k) 1451 - { 1452 - struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(k); 1453 - const union bch_extent_entry *entry; 1454 - 1455 - bkey_extent_entry_for_each(ptrs, entry) 1456 - if (__extent_entry_type(entry) == BCH_EXTENT_ENTRY_rebalance) 1457 - return &entry->rebalance; 1458 - 1459 - return NULL; 1460 - } 1461 - 1462 - unsigned bch2_bkey_ptrs_need_rebalance(struct bch_fs *c, struct bkey_s_c k, 1463 - unsigned target, unsigned compression) 1464 - { 1465 - struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(k); 1466 - unsigned rewrite_ptrs = 0; 1467 - 1468 - if (compression) { 1469 - unsigned compression_type = bch2_compression_opt_to_type(compression); 1470 - const union bch_extent_entry *entry; 1471 - struct extent_ptr_decoded p; 1472 - unsigned i = 0; 1473 - 1474 - bkey_for_each_ptr_decode(k.k, ptrs, p, entry) { 1475 - if (p.crc.compression_type == BCH_COMPRESSION_TYPE_incompressible || 1476 - p.ptr.unwritten) { 1477 - rewrite_ptrs = 0; 1478 - goto incompressible; 1479 - } 1480 - 1481 - if (!p.ptr.cached && p.crc.compression_type != compression_type) 1482 - rewrite_ptrs |= 1U << i; 1483 - i++; 1484 - } 1485 - } 1486 - incompressible: 1487 - if (target && bch2_target_accepts_data(c, BCH_DATA_user, target)) { 1488 - unsigned i = 0; 1489 - 1490 - bkey_for_each_ptr(ptrs, ptr) { 1491 - if (!ptr->cached && !bch2_dev_in_target(c, ptr->dev, target)) 1492 - rewrite_ptrs |= 1U << i; 1493 - i++; 1494 - } 1495 - } 1496 - 1497 - return rewrite_ptrs; 1498 - } 1499 - 1500 - bool bch2_bkey_needs_rebalance(struct bch_fs *c, struct bkey_s_c k) 1501 - { 1502 - const struct bch_extent_rebalance *r = bch2_bkey_rebalance_opts(k); 1503 - 1504 - /* 1505 - * If it's an indirect extent, we don't delete the rebalance entry when 1506 - * done so that we know what options were applied - check if it still 1507 - * needs work done: 1508 - */ 1509 - if (r && 1510 - k.k->type == KEY_TYPE_reflink_v && 1511 - !bch2_bkey_ptrs_need_rebalance(c, k, r->target, r->compression)) 1512 - r = NULL; 1513 - 1514 - return r != NULL; 1515 - } 1516 - 1517 - static u64 __bch2_bkey_sectors_need_rebalance(struct bch_fs *c, struct bkey_s_c k, 1518 - unsigned target, unsigned compression) 1519 - { 1520 - struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(k); 1521 - const union bch_extent_entry *entry; 1522 - struct extent_ptr_decoded p; 1523 - u64 sectors = 0; 1524 - 1525 - if (compression) { 1526 - unsigned compression_type = bch2_compression_opt_to_type(compression); 1527 - 1528 - bkey_for_each_ptr_decode(k.k, ptrs, p, entry) { 1529 - if (p.crc.compression_type == BCH_COMPRESSION_TYPE_incompressible || 1530 - p.ptr.unwritten) { 1531 - sectors = 0; 1532 - goto incompressible; 1533 - } 1534 - 1535 - if (!p.ptr.cached && p.crc.compression_type != compression_type) 1536 - sectors += p.crc.compressed_size; 1537 - } 1538 - } 1539 - incompressible: 1540 - if (target && bch2_target_accepts_data(c, BCH_DATA_user, target)) { 1541 - bkey_for_each_ptr_decode(k.k, ptrs, p, entry) 1542 - if (!p.ptr.cached && !bch2_dev_in_target(c, p.ptr.dev, target)) 1543 - sectors += p.crc.compressed_size; 1544 - } 1545 - 1546 - return sectors; 1547 - } 1548 - 1549 - u64 bch2_bkey_sectors_need_rebalance(struct bch_fs *c, struct bkey_s_c k) 1550 - { 1551 - const struct bch_extent_rebalance *r = bch2_bkey_rebalance_opts(k); 1552 - 1553 - return r ? __bch2_bkey_sectors_need_rebalance(c, k, r->target, r->compression) : 0; 1554 - } 1555 - 1556 - int bch2_bkey_set_needs_rebalance(struct bch_fs *c, struct bkey_i *_k, 1557 - struct bch_io_opts *opts) 1558 - { 1559 - struct bkey_s k = bkey_i_to_s(_k); 1560 - struct bch_extent_rebalance *r; 1561 - unsigned target = opts->background_target; 1562 - unsigned compression = background_compression(*opts); 1563 - bool needs_rebalance; 1564 - 1565 - if (!bkey_extent_is_direct_data(k.k)) 1566 - return 0; 1567 - 1568 - /* get existing rebalance entry: */ 1569 - r = (struct bch_extent_rebalance *) bch2_bkey_rebalance_opts(k.s_c); 1570 - if (r) { 1571 - if (k.k->type == KEY_TYPE_reflink_v) { 1572 - /* 1573 - * indirect extents: existing options take precedence, 1574 - * so that we don't move extents back and forth if 1575 - * they're referenced by different inodes with different 1576 - * options: 1577 - */ 1578 - if (r->target) 1579 - target = r->target; 1580 - if (r->compression) 1581 - compression = r->compression; 1582 - } 1583 - 1584 - r->target = target; 1585 - r->compression = compression; 1586 - } 1587 - 1588 - needs_rebalance = bch2_bkey_ptrs_need_rebalance(c, k.s_c, target, compression); 1589 - 1590 - if (needs_rebalance && !r) { 1591 - union bch_extent_entry *new = bkey_val_end(k); 1592 - 1593 - new->rebalance.type = 1U << BCH_EXTENT_ENTRY_rebalance; 1594 - new->rebalance.compression = compression; 1595 - new->rebalance.target = target; 1596 - new->rebalance.unused = 0; 1597 - k.k->u64s += extent_entry_u64s(new); 1598 - } else if (!needs_rebalance && r && k.k->type != KEY_TYPE_reflink_v) { 1599 - /* 1600 - * For indirect extents, don't delete the rebalance entry when 1601 - * we're finished so that we know we specifically moved it or 1602 - * compressed it to its current location/compression type 1603 - */ 1604 - extent_entry_drop(k, (union bch_extent_entry *) r); 1605 - } 1606 - 1607 - return 0; 1608 - } 1609 - 1610 1394 /* Generic extent code: */ 1611 1395 1612 1396 int bch2_cut_front_s(struct bpos where, struct bkey_s k) ··· 1506 1610 case KEY_TYPE_reflink_p: { 1507 1611 struct bkey_s_reflink_p p = bkey_s_to_reflink_p(k); 1508 1612 1509 - le64_add_cpu(&p.v->idx, sub); 1613 + SET_REFLINK_P_IDX(p.v, REFLINK_P_IDX(p.v) + sub); 1510 1614 break; 1511 1615 } 1512 1616 case KEY_TYPE_inline_data:
+4 -14
fs/bcachefs/extents.h
··· 8 8 9 9 struct bch_fs; 10 10 struct btree_trans; 11 - enum bch_validate_flags; 12 11 13 12 /* extent entries: */ 14 13 ··· 409 410 /* KEY_TYPE_btree_ptr: */ 410 411 411 412 int bch2_btree_ptr_validate(struct bch_fs *, struct bkey_s_c, 412 - enum bch_validate_flags); 413 + struct bkey_validate_context); 413 414 void bch2_btree_ptr_to_text(struct printbuf *, struct bch_fs *, 414 415 struct bkey_s_c); 415 416 416 417 int bch2_btree_ptr_v2_validate(struct bch_fs *, struct bkey_s_c, 417 - enum bch_validate_flags); 418 + struct bkey_validate_context); 418 419 void bch2_btree_ptr_v2_to_text(struct printbuf *, struct bch_fs *, struct bkey_s_c); 419 420 void bch2_btree_ptr_v2_compat(enum btree_id, unsigned, unsigned, 420 421 int, struct bkey_s); ··· 451 452 /* KEY_TYPE_reservation: */ 452 453 453 454 int bch2_reservation_validate(struct bch_fs *, struct bkey_s_c, 454 - enum bch_validate_flags); 455 + struct bkey_validate_context); 455 456 void bch2_reservation_to_text(struct printbuf *, struct bch_fs *, struct bkey_s_c); 456 457 bool bch2_reservation_merge(struct bch_fs *, struct bkey_s, struct bkey_s_c); 457 458 ··· 695 696 void bch2_bkey_ptrs_to_text(struct printbuf *, struct bch_fs *, 696 697 struct bkey_s_c); 697 698 int bch2_bkey_ptrs_validate(struct bch_fs *, struct bkey_s_c, 698 - enum bch_validate_flags); 699 + struct bkey_validate_context); 699 700 700 701 static inline bool bch2_extent_ptr_eq(struct bch_extent_ptr ptr1, 701 702 struct bch_extent_ptr ptr2) ··· 708 709 } 709 710 710 711 void bch2_ptr_swab(struct bkey_s); 711 - 712 - const struct bch_extent_rebalance *bch2_bkey_rebalance_opts(struct bkey_s_c); 713 - unsigned bch2_bkey_ptrs_need_rebalance(struct bch_fs *, struct bkey_s_c, 714 - unsigned, unsigned); 715 - bool bch2_bkey_needs_rebalance(struct bch_fs *, struct bkey_s_c); 716 - u64 bch2_bkey_sectors_need_rebalance(struct bch_fs *, struct bkey_s_c); 717 - 718 - int bch2_bkey_set_needs_rebalance(struct bch_fs *, struct bkey_i *, 719 - struct bch_io_opts *); 720 712 721 713 /* Generic extent code: */ 722 714
+2 -13
fs/bcachefs/extents_format.h
··· 201 201 #endif 202 202 }; 203 203 204 - struct bch_extent_rebalance { 205 - #if defined(__LITTLE_ENDIAN_BITFIELD) 206 - __u64 type:6, 207 - unused:34, 208 - compression:8, /* enum bch_compression_opt */ 209 - target:16; 210 - #elif defined (__BIG_ENDIAN_BITFIELD) 211 - __u64 target:16, 212 - compression:8, 213 - unused:34, 214 - type:6; 215 - #endif 216 - }; 204 + /* bch_extent_rebalance: */ 205 + #include "rebalance_format.h" 217 206 218 207 union bch_extent_entry { 219 208 #if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__ || __BITS_PER_LONG == 64
+116 -3
fs/bcachefs/fs-common.c
··· 69 69 if (!snapshot_src.inum) { 70 70 /* Inode wasn't specified, just snapshot: */ 71 71 struct bch_subvolume s; 72 - 73 - ret = bch2_subvolume_get(trans, snapshot_src.subvol, true, 74 - BTREE_ITER_cached, &s); 72 + ret = bch2_subvolume_get(trans, snapshot_src.subvol, true, &s); 75 73 if (ret) 76 74 goto err; 77 75 ··· 152 154 if (is_subdir_for_nlink(new_inode)) 153 155 dir_u->bi_nlink++; 154 156 dir_u->bi_mtime = dir_u->bi_ctime = now; 157 + dir_u->bi_size += dirent_occupied_size(name); 155 158 156 159 ret = bch2_inode_write(trans, &dir_iter, dir_u); 157 160 if (ret) ··· 170 171 new_inode->bi_dir = dir_u->bi_inum; 171 172 new_inode->bi_dir_offset = dir_offset; 172 173 } 174 + 175 + if (S_ISDIR(mode) && 176 + !new_inode->bi_subvol) 177 + new_inode->bi_depth = dir_u->bi_depth + 1; 173 178 174 179 inode_iter.flags &= ~BTREE_ITER_all_snapshots; 175 180 bch2_btree_iter_set_snapshot(&inode_iter, snapshot); ··· 221 218 } 222 219 223 220 dir_u->bi_mtime = dir_u->bi_ctime = now; 221 + dir_u->bi_size += dirent_occupied_size(name); 224 222 225 223 dir_hash = bch2_hash_info_init(c, dir_u); 226 224 ··· 324 320 325 321 dir_u->bi_mtime = dir_u->bi_ctime = inode_u->bi_ctime = now; 326 322 dir_u->bi_nlink -= is_subdir_for_nlink(inode_u); 323 + dir_u->bi_size -= dirent_occupied_size(name); 327 324 328 325 ret = bch2_hash_delete_at(trans, bch2_dirent_hash_desc, 329 326 &dir_hash, &dirent_iter, ··· 463 458 goto err; 464 459 } 465 460 461 + if (mode == BCH_RENAME) { 462 + src_dir_u->bi_size -= dirent_occupied_size(src_name); 463 + dst_dir_u->bi_size += dirent_occupied_size(dst_name); 464 + } 465 + 466 + if (mode == BCH_RENAME_OVERWRITE) 467 + src_dir_u->bi_size -= dirent_occupied_size(src_name); 468 + 466 469 if (src_inode_u->bi_parent_subvol) 467 470 src_inode_u->bi_parent_subvol = dst_dir.subvol; 468 471 ··· 525 512 dst_dir_u->bi_nlink++; 526 513 } 527 514 515 + if (S_ISDIR(src_inode_u->bi_mode) && 516 + !src_inode_u->bi_subvol) 517 + src_inode_u->bi_depth = dst_dir_u->bi_depth + 1; 518 + 519 + if (mode == BCH_RENAME_EXCHANGE && 520 + S_ISDIR(dst_inode_u->bi_mode) && 521 + !dst_inode_u->bi_subvol) 522 + dst_inode_u->bi_depth = src_dir_u->bi_depth + 1; 523 + 528 524 if (dst_inum.inum && is_subdir_for_nlink(dst_inode_u)) { 529 525 dst_dir_u->bi_nlink--; 530 526 src_dir_u->bi_nlink += mode == BCH_RENAME_EXCHANGE; ··· 569 547 bch2_trans_iter_exit(trans, &dst_dir_iter); 570 548 bch2_trans_iter_exit(trans, &src_dir_iter); 571 549 return ret; 550 + } 551 + 552 + static inline void prt_bytes_reversed(struct printbuf *out, const void *b, unsigned n) 553 + { 554 + bch2_printbuf_make_room(out, n); 555 + 556 + unsigned can_print = min(n, printbuf_remaining(out)); 557 + 558 + b += n; 559 + 560 + for (unsigned i = 0; i < can_print; i++) 561 + out->buf[out->pos++] = *((char *) --b); 562 + 563 + printbuf_nul_terminate(out); 564 + } 565 + 566 + static inline void prt_str_reversed(struct printbuf *out, const char *s) 567 + { 568 + prt_bytes_reversed(out, s, strlen(s)); 569 + } 570 + 571 + static inline void reverse_bytes(void *b, size_t n) 572 + { 573 + char *e = b + n, *s = b; 574 + 575 + while (s < e) { 576 + --e; 577 + swap(*s, *e); 578 + s++; 579 + } 580 + } 581 + 582 + /* XXX: we don't yet attempt to print paths when we don't know the subvol */ 583 + int bch2_inum_to_path(struct btree_trans *trans, subvol_inum inum, struct printbuf *path) 584 + { 585 + unsigned orig_pos = path->pos; 586 + int ret = 0; 587 + 588 + while (!(inum.subvol == BCACHEFS_ROOT_SUBVOL && 589 + inum.inum == BCACHEFS_ROOT_INO)) { 590 + struct bch_inode_unpacked inode; 591 + ret = bch2_inode_find_by_inum_trans(trans, inum, &inode); 592 + if (ret) 593 + goto disconnected; 594 + 595 + if (!inode.bi_dir && !inode.bi_dir_offset) { 596 + ret = -BCH_ERR_ENOENT_inode_no_backpointer; 597 + goto disconnected; 598 + } 599 + 600 + inum.subvol = inode.bi_parent_subvol ?: inum.subvol; 601 + inum.inum = inode.bi_dir; 602 + 603 + u32 snapshot; 604 + ret = bch2_subvolume_get_snapshot(trans, inum.subvol, &snapshot); 605 + if (ret) 606 + goto disconnected; 607 + 608 + struct btree_iter d_iter; 609 + struct bkey_s_c_dirent d = bch2_bkey_get_iter_typed(trans, &d_iter, 610 + BTREE_ID_dirents, SPOS(inode.bi_dir, inode.bi_dir_offset, snapshot), 611 + 0, dirent); 612 + ret = bkey_err(d.s_c); 613 + if (ret) 614 + goto disconnected; 615 + 616 + struct qstr dirent_name = bch2_dirent_get_name(d); 617 + prt_bytes_reversed(path, dirent_name.name, dirent_name.len); 618 + 619 + prt_char(path, '/'); 620 + 621 + bch2_trans_iter_exit(trans, &d_iter); 622 + } 623 + 624 + if (orig_pos == path->pos) 625 + prt_char(path, '/'); 626 + out: 627 + ret = path->allocation_failure ? -ENOMEM : 0; 628 + if (ret) 629 + goto err; 630 + 631 + reverse_bytes(path->buf + orig_pos, path->pos - orig_pos); 632 + return 0; 633 + err: 634 + return ret; 635 + disconnected: 636 + if (bch2_err_matches(ret, BCH_ERR_transaction_restart)) 637 + goto err; 638 + 639 + prt_str_reversed(path, "(disconnected)"); 640 + goto out; 572 641 }
+2
fs/bcachefs/fs-common.h
··· 42 42 bool bch2_reinherit_attrs(struct bch_inode_unpacked *, 43 43 struct bch_inode_unpacked *); 44 44 45 + int bch2_inum_to_path(struct btree_trans *, subvol_inum, struct printbuf *); 46 + 45 47 #endif /* _BCACHEFS_FS_COMMON_H */
+28 -17
fs/bcachefs/fs-io-buffered.c
··· 164 164 BTREE_ITER_slots); 165 165 while (1) { 166 166 struct bkey_s_c k; 167 - unsigned bytes, sectors, offset_into_extent; 167 + unsigned bytes, sectors; 168 + s64 offset_into_extent; 168 169 enum btree_id data_btree = BTREE_ID_extents; 169 170 170 171 bch2_trans_begin(trans); ··· 198 197 199 198 k = bkey_i_to_s_c(sk.k); 200 199 201 - sectors = min(sectors, k.k->size - offset_into_extent); 200 + sectors = min_t(unsigned, sectors, k.k->size - offset_into_extent); 202 201 203 202 if (readpages_iter) { 204 203 ret = readpage_bio_extend(trans, readpages_iter, &rbio->bio, sectors, ··· 231 230 bch2_trans_iter_exit(trans, &iter); 232 231 233 232 if (ret) { 234 - bch_err_inum_offset_ratelimited(c, 235 - iter.pos.inode, 236 - iter.pos.offset << 9, 237 - "read error %i from btree lookup", ret); 233 + struct printbuf buf = PRINTBUF; 234 + bch2_inum_offset_err_msg_trans(trans, &buf, inum, iter.pos.offset << 9); 235 + prt_printf(&buf, "read error %i from btree lookup", ret); 236 + bch_err_ratelimited(c, "%s", buf.buf); 237 + printbuf_exit(&buf); 238 + 238 239 rbio->bio.bi_status = BLK_STS_IOERR; 239 240 bio_endio(&rbio->bio); 240 241 } ··· 251 248 struct bch_io_opts opts; 252 249 struct folio *folio; 253 250 struct readpages_iter readpages_iter; 251 + struct blk_plug plug; 254 252 255 253 bch2_inode_opts_get(&opts, c, &inode->ei_inode); 256 254 ··· 259 255 if (ret) 260 256 return; 261 257 258 + /* 259 + * Besides being a general performance optimization, plugging helps with 260 + * avoiding btree transaction srcu warnings - submitting a bio can 261 + * block, and we don't want todo that with the transaction locked. 262 + * 263 + * However, plugged bios are submitted when we schedule; we ideally 264 + * would have our own scheduler hook to call unlock_long() before 265 + * scheduling. 266 + */ 267 + blk_start_plug(&plug); 262 268 bch2_pagecache_add_get(inode); 263 269 264 270 struct btree_trans *trans = bch2_trans_get(c); ··· 295 281 bch2_trans_put(trans); 296 282 297 283 bch2_pagecache_add_put(inode); 298 - 284 + blk_finish_plug(&plug); 299 285 darray_exit(&readpages_iter.folios); 300 286 } 301 287 ··· 310 296 struct bch_fs *c = inode->v.i_sb->s_fs_info; 311 297 struct bch_read_bio *rbio; 312 298 struct bch_io_opts opts; 299 + struct blk_plug plug; 313 300 int ret; 314 301 DECLARE_COMPLETION_ONSTACK(done); 302 + 303 + BUG_ON(folio_test_uptodate(folio)); 304 + BUG_ON(folio_test_dirty(folio)); 315 305 316 306 if (!bch2_folio_create(folio, GFP_KERNEL)) 317 307 return -ENOMEM; ··· 331 313 rbio->bio.bi_iter.bi_sector = folio_sector(folio); 332 314 BUG_ON(!bio_add_folio(&rbio->bio, folio, folio_size(folio), 0)); 333 315 316 + blk_start_plug(&plug); 334 317 bch2_trans_run(c, (bchfs_read(trans, rbio, inode_inum(inode), NULL), 0)); 318 + blk_finish_plug(&plug); 335 319 wait_for_completion(&done); 336 320 337 321 ret = blk_status_to_errno(rbio->bio.bi_status); ··· 625 605 BUG_ON(!bio_add_folio(&w->io->op.wbio.bio, folio, 626 606 sectors << 9, offset << 9)); 627 607 628 - /* Check for writing past i_size: */ 629 - WARN_ONCE((bio_end_sector(&w->io->op.wbio.bio) << 9) > 630 - round_up(i_size, block_bytes(c)) && 631 - !test_bit(BCH_FS_emergency_ro, &c->flags), 632 - "writing past i_size: %llu > %llu (unrounded %llu)\n", 633 - bio_end_sector(&w->io->op.wbio.bio) << 9, 634 - round_up(i_size, block_bytes(c)), 635 - i_size); 636 - 637 608 w->io->op.res.sectors += reserved_sectors; 638 609 w->io->op.i_sectors_delta -= dirty_sectors; 639 610 w->io->op.new_i_size = i_size; ··· 680 669 folio = __filemap_get_folio(mapping, pos >> PAGE_SHIFT, 681 670 FGP_WRITEBEGIN | fgf_set_order(len), 682 671 mapping_gfp_mask(mapping)); 683 - if (IS_ERR_OR_NULL(folio)) 672 + if (IS_ERR(folio)) 684 673 goto err_unlock; 685 674 686 675 offset = pos - folio_pos(folio);
+5
fs/bcachefs/fs-io-direct.c
··· 70 70 struct bch_io_opts opts; 71 71 struct dio_read *dio; 72 72 struct bio *bio; 73 + struct blk_plug plug; 73 74 loff_t offset = req->ki_pos; 74 75 bool sync = is_sync_kiocb(req); 75 76 size_t shorten; ··· 129 128 */ 130 129 dio->should_dirty = iter_is_iovec(iter); 131 130 131 + blk_start_plug(&plug); 132 + 132 133 goto start; 133 134 while (iter->count) { 134 135 bio = bio_alloc_bioset(NULL, ··· 162 159 163 160 bch2_read(c, rbio_init(bio, opts), inode_inum(inode)); 164 161 } 162 + 163 + blk_finish_plug(&plug); 165 164 166 165 iter->count += shorten; 167 166
+2 -2
fs/bcachefs/fs-io-pagecache.c
··· 29 29 break; 30 30 31 31 f = __filemap_get_folio(mapping, pos >> PAGE_SHIFT, fgp_flags, gfp); 32 - if (IS_ERR_OR_NULL(f)) 32 + if (IS_ERR(f)) 33 33 break; 34 34 35 35 BUG_ON(fs->nr && folio_pos(f) != pos); ··· 199 199 unsigned folio_idx = 0; 200 200 201 201 return bch2_trans_run(c, 202 - for_each_btree_key_in_subvolume_upto(trans, iter, BTREE_ID_extents, 202 + for_each_btree_key_in_subvolume_max(trans, iter, BTREE_ID_extents, 203 203 POS(inum.inum, offset), 204 204 POS(inum.inum, U64_MAX), 205 205 inum.subvol, BTREE_ITER_slots, k, ({
+45 -9
fs/bcachefs/fs-io.c
··· 167 167 168 168 /* fsync: */ 169 169 170 + static int bch2_get_inode_journal_seq_trans(struct btree_trans *trans, subvol_inum inum, 171 + u64 *seq) 172 + { 173 + struct printbuf buf = PRINTBUF; 174 + struct bch_inode_unpacked u; 175 + struct btree_iter iter; 176 + int ret = bch2_inode_peek(trans, &iter, &u, inum, 0); 177 + if (ret) 178 + return ret; 179 + 180 + u64 cur_seq = journal_cur_seq(&trans->c->journal); 181 + *seq = min(cur_seq, u.bi_journal_seq); 182 + 183 + if (fsck_err_on(u.bi_journal_seq > cur_seq, 184 + trans, inode_journal_seq_in_future, 185 + "inode journal seq in future (currently at %llu)\n%s", 186 + cur_seq, 187 + (bch2_inode_unpacked_to_text(&buf, &u), 188 + buf.buf))) { 189 + u.bi_journal_seq = cur_seq; 190 + ret = bch2_inode_write(trans, &iter, &u); 191 + } 192 + fsck_err: 193 + bch2_trans_iter_exit(trans, &iter); 194 + printbuf_exit(&buf); 195 + return ret; 196 + } 197 + 170 198 /* 171 199 * inode->ei_inode.bi_journal_seq won't be up to date since it's set in an 172 200 * insert trigger: look up the btree inode instead ··· 208 180 if (!bch2_write_ref_tryget(c, BCH_WRITE_REF_fsync)) 209 181 return -EROFS; 210 182 211 - struct bch_inode_unpacked u; 212 - int ret = bch2_inode_find_by_inum(c, inode_inum(inode), &u) ?: 213 - bch2_journal_flush_seq(&c->journal, u.bi_journal_seq, TASK_INTERRUPTIBLE) ?: 183 + u64 seq; 184 + int ret = bch2_trans_commit_do(c, NULL, NULL, 0, 185 + bch2_get_inode_journal_seq_trans(trans, inode_inum(inode), &seq)) ?: 186 + bch2_journal_flush_seq(&c->journal, seq, TASK_INTERRUPTIBLE) ?: 214 187 bch2_inode_flush_nocow_writes(c, inode); 215 188 bch2_write_ref_put(c, BCH_WRITE_REF_fsync); 216 189 return ret; ··· 251 222 struct bpos end) 252 223 { 253 224 return bch2_trans_run(c, 254 - for_each_btree_key_in_subvolume_upto(trans, iter, BTREE_ID_extents, start, end, 225 + for_each_btree_key_in_subvolume_max(trans, iter, BTREE_ID_extents, start, end, 255 226 subvol, 0, k, ({ 256 227 bkey_extent_is_data(k.k) && !bkey_extent_is_unwritten(k); 257 228 }))); ··· 285 256 286 257 folio = __filemap_get_folio(mapping, index, 287 258 FGP_LOCK|FGP_CREAT, GFP_KERNEL); 288 - if (IS_ERR_OR_NULL(folio)) { 259 + if (IS_ERR(folio)) { 289 260 ret = -ENOMEM; 290 261 goto out; 291 262 } ··· 835 806 u64 sectors = end - start; 836 807 837 808 int ret = bch2_trans_run(c, 838 - for_each_btree_key_in_subvolume_upto(trans, iter, 809 + for_each_btree_key_in_subvolume_max(trans, iter, 839 810 BTREE_ID_extents, 840 811 POS(inode->v.i_ino, start), 841 812 POS(inode->v.i_ino, end - 1), ··· 906 877 bch2_mark_pagecache_unallocated(src, pos_src >> 9, 907 878 (pos_src + aligned_len) >> 9); 908 879 880 + /* 881 + * XXX: we'd like to be telling bch2_remap_range() if we have 882 + * permission to write to the source file, and thus if io path option 883 + * changes should be propagated through the copy, but we need mnt_idmap 884 + * from the pathwalk, awkward 885 + */ 909 886 ret = bch2_remap_range(c, 910 887 inode_inum(dst), pos_dst >> 9, 911 888 inode_inum(src), pos_src >> 9, 912 889 aligned_len >> 9, 913 - pos_dst + len, &i_sectors_delta); 890 + pos_dst + len, &i_sectors_delta, 891 + false); 914 892 if (ret < 0) 915 893 goto err; 916 894 ··· 958 922 return -ENXIO; 959 923 960 924 int ret = bch2_trans_run(c, 961 - for_each_btree_key_in_subvolume_upto(trans, iter, BTREE_ID_extents, 925 + for_each_btree_key_in_subvolume_max(trans, iter, BTREE_ID_extents, 962 926 POS(inode->v.i_ino, offset >> 9), 963 927 POS(inode->v.i_ino, U64_MAX), 964 928 inum.subvol, 0, k, ({ ··· 994 958 return -ENXIO; 995 959 996 960 int ret = bch2_trans_run(c, 997 - for_each_btree_key_in_subvolume_upto(trans, iter, BTREE_ID_extents, 961 + for_each_btree_key_in_subvolume_max(trans, iter, BTREE_ID_extents, 998 962 POS(inode->v.i_ino, offset >> 9), 999 963 POS(inode->v.i_ino, U64_MAX), 1000 964 inum.subvol, BTREE_ITER_slots, k, ({
+1 -6
fs/bcachefs/fs-ioctl.c
··· 406 406 sync_inodes_sb(c->vfs_sb); 407 407 up_read(&c->vfs_sb->s_umount); 408 408 } 409 - retry: 409 + 410 410 if (arg.src_ptr) { 411 411 error = user_path_at(arg.dirfd, 412 412 (const char __user *)(unsigned long)arg.src_ptr, ··· 486 486 err2: 487 487 if (arg.src_ptr) 488 488 path_put(&src_path); 489 - 490 - if (retry_estale(error, lookup_flags)) { 491 - lookup_flags |= LOOKUP_REVAL; 492 - goto retry; 493 - } 494 489 err1: 495 490 return error; 496 491 }
+69 -32
fs/bcachefs/fs.c
··· 23 23 #include "journal.h" 24 24 #include "keylist.h" 25 25 #include "quota.h" 26 + #include "rebalance.h" 26 27 #include "snapshot.h" 27 28 #include "super.h" 28 29 #include "xattr.h" ··· 39 38 #include <linux/posix_acl.h> 40 39 #include <linux/random.h> 41 40 #include <linux/seq_file.h> 41 + #include <linux/siphash.h> 42 42 #include <linux/statfs.h> 43 43 #include <linux/string.h> 44 44 #include <linux/xattr.h> ··· 67 65 i_gid_write(&inode->v, bi->bi_gid); 68 66 inode->v.i_mode = bi->bi_mode; 69 67 68 + if (fields & ATTR_SIZE) 69 + i_size_write(&inode->v, bi->bi_size); 70 + 70 71 if (fields & ATTR_ATIME) 71 72 inode_set_atime_to_ts(&inode->v, bch2_time_to_timespec(c, bi->bi_atime)); 72 73 if (fields & ATTR_MTIME) ··· 94 89 retry: 95 90 bch2_trans_begin(trans); 96 91 97 - ret = bch2_inode_peek(trans, &iter, &inode_u, inode_inum(inode), 98 - BTREE_ITER_intent) ?: 99 - (set ? set(trans, inode, &inode_u, p) : 0) ?: 100 - bch2_inode_write(trans, &iter, &inode_u) ?: 92 + ret = bch2_inode_peek(trans, &iter, &inode_u, inode_inum(inode), BTREE_ITER_intent); 93 + if (ret) 94 + goto err; 95 + 96 + struct bch_extent_rebalance old_r = bch2_inode_rebalance_opts_get(c, &inode_u); 97 + 98 + ret = (set ? set(trans, inode, &inode_u, p) : 0); 99 + if (ret) 100 + goto err; 101 + 102 + struct bch_extent_rebalance new_r = bch2_inode_rebalance_opts_get(c, &inode_u); 103 + 104 + if (memcmp(&old_r, &new_r, sizeof(new_r))) { 105 + ret = bch2_set_rebalance_needs_scan_trans(trans, inode_u.bi_inum); 106 + if (ret) 107 + goto err; 108 + } 109 + 110 + ret = bch2_inode_write(trans, &iter, &inode_u) ?: 101 111 bch2_trans_commit(trans, NULL, NULL, BCH_TRANS_COMMIT_no_enospc); 102 112 103 113 /* ··· 121 101 */ 122 102 if (!ret) 123 103 bch2_inode_update_after_write(trans, inode, &inode_u, fields); 124 - 104 + err: 125 105 bch2_trans_iter_exit(trans, &iter); 126 106 127 107 if (bch2_err_matches(ret, BCH_ERR_transaction_restart)) ··· 180 160 static u32 bch2_vfs_inode_hash_fn(const void *data, u32 len, u32 seed) 181 161 { 182 162 const subvol_inum *inum = data; 163 + siphash_key_t k = { .key[0] = seed }; 183 164 184 - return jhash(&inum->inum, sizeof(inum->inum), seed); 165 + return siphash_2u64(inum->subvol, inum->inum, &k); 185 166 } 186 167 187 168 static u32 bch2_vfs_inode_obj_hash_fn(const void *data, u32 len, u32 seed) ··· 211 190 .automatic_shrinking = true, 212 191 }; 213 192 193 + static const struct rhashtable_params bch2_vfs_inodes_by_inum_params = { 194 + .head_offset = offsetof(struct bch_inode_info, by_inum_hash), 195 + .key_offset = offsetof(struct bch_inode_info, ei_inum.inum), 196 + .key_len = sizeof(u64), 197 + .automatic_shrinking = true, 198 + }; 199 + 214 200 int bch2_inode_or_descendents_is_open(struct btree_trans *trans, struct bpos p) 215 201 { 216 202 struct bch_fs *c = trans->c; 217 - struct rhashtable *ht = &c->vfs_inodes_table; 218 - subvol_inum inum = (subvol_inum) { .inum = p.offset }; 203 + struct rhltable *ht = &c->vfs_inodes_by_inum_table; 204 + u64 inum = p.offset; 219 205 DARRAY(u32) subvols; 220 206 int ret = 0; 221 207 ··· 247 219 struct rhash_lock_head __rcu *const *bkt; 248 220 struct rhash_head *he; 249 221 unsigned int hash; 250 - struct bucket_table *tbl = rht_dereference_rcu(ht->tbl, ht); 222 + struct bucket_table *tbl = rht_dereference_rcu(ht->ht.tbl, &ht->ht); 251 223 restart: 252 - hash = rht_key_hashfn(ht, tbl, &inum, bch2_vfs_inodes_params); 224 + hash = rht_key_hashfn(&ht->ht, tbl, &inum, bch2_vfs_inodes_by_inum_params); 253 225 bkt = rht_bucket(tbl, hash); 254 226 do { 255 227 struct bch_inode_info *inode; 256 228 257 229 rht_for_each_entry_rcu_from(inode, he, rht_ptr_rcu(bkt), tbl, hash, hash) { 258 - if (inode->ei_inum.inum == inum.inum) { 230 + if (inode->ei_inum.inum == inum) { 259 231 ret = darray_push_gfp(&subvols, inode->ei_inum.subvol, 260 232 GFP_NOWAIT|__GFP_NOWARN); 261 233 if (ret) { ··· 276 248 /* Ensure we see any new tables. */ 277 249 smp_rmb(); 278 250 279 - tbl = rht_dereference_rcu(tbl->future_tbl, ht); 251 + tbl = rht_dereference_rcu(tbl->future_tbl, &ht->ht); 280 252 if (unlikely(tbl)) 281 253 goto restart; 282 254 rcu_read_unlock(); ··· 355 327 spin_unlock(&inode->v.i_lock); 356 328 357 329 if (remove) { 358 - int ret = rhashtable_remove_fast(&c->vfs_inodes_table, 330 + int ret = rhltable_remove(&c->vfs_inodes_by_inum_table, 331 + &inode->by_inum_hash, bch2_vfs_inodes_by_inum_params); 332 + BUG_ON(ret); 333 + 334 + ret = rhashtable_remove_fast(&c->vfs_inodes_table, 359 335 &inode->hash, bch2_vfs_inodes_params); 360 336 BUG_ON(ret); 361 337 inode->v.i_hash.pprev = NULL; ··· 404 372 discard_new_inode(&inode->v); 405 373 return old; 406 374 } else { 375 + int ret = rhltable_insert(&c->vfs_inodes_by_inum_table, 376 + &inode->by_inum_hash, 377 + bch2_vfs_inodes_by_inum_params); 378 + BUG_ON(ret); 379 + 407 380 inode_fake_hash(&inode->v); 408 381 409 382 inode_sb_list_add(&inode->v); ··· 502 465 struct bch_inode_unpacked inode_u; 503 466 struct bch_subvolume subvol; 504 467 int ret = lockrestart_do(trans, 505 - bch2_subvolume_get(trans, inum.subvol, true, 0, &subvol) ?: 468 + bch2_subvolume_get(trans, inum.subvol, true, &subvol) ?: 506 469 bch2_inode_find_by_inum_trans(trans, inum, &inode_u)) ?: 507 470 PTR_ERR_OR_ZERO(inode = bch2_inode_hash_init_insert(trans, inum, &inode_u, &subvol)); 508 471 bch2_trans_put(trans); ··· 572 535 inum.subvol = inode_u.bi_subvol ?: dir->ei_inum.subvol; 573 536 inum.inum = inode_u.bi_inum; 574 537 575 - ret = bch2_subvolume_get(trans, inum.subvol, true, 576 - BTREE_ITER_with_updates, &subvol) ?: 538 + ret = bch2_subvolume_get(trans, inum.subvol, true, &subvol) ?: 577 539 bch2_trans_commit(trans, NULL, &journal_seq, 0); 578 540 if (unlikely(ret)) { 579 541 bch2_quota_acct(c, bch_qid(&inode_u), Q_INO, -1, ··· 585 549 586 550 if (!(flags & BCH_CREATE_TMPFILE)) { 587 551 bch2_inode_update_after_write(trans, dir, &dir_u, 588 - ATTR_MTIME|ATTR_CTIME); 552 + ATTR_MTIME|ATTR_CTIME|ATTR_SIZE); 589 553 mutex_unlock(&dir->ei_update_lock); 590 554 } 591 555 ··· 653 617 654 618 struct bch_subvolume subvol; 655 619 struct bch_inode_unpacked inode_u; 656 - ret = bch2_subvolume_get(trans, inum.subvol, true, 0, &subvol) ?: 620 + ret = bch2_subvolume_get(trans, inum.subvol, true, &subvol) ?: 657 621 bch2_inode_find_by_inum_nowarn_trans(trans, inum, &inode_u) ?: 658 622 PTR_ERR_OR_ZERO(inode = bch2_inode_hash_init_insert(trans, inum, &inode_u, &subvol)); 659 623 ··· 664 628 goto err; 665 629 666 630 /* regular files may have hardlinks: */ 667 - if (bch2_fs_inconsistent_on(bch2_inode_should_have_bp(&inode_u) && 631 + if (bch2_fs_inconsistent_on(bch2_inode_should_have_single_bp(&inode_u) && 668 632 !bkey_eq(k.k->p, POS(inode_u.bi_dir, inode_u.bi_dir_offset)), 669 633 c, 670 634 "dirent points to inode that does not point back:\n %s", ··· 742 706 743 707 if (likely(!ret)) { 744 708 bch2_inode_update_after_write(trans, dir, &dir_u, 745 - ATTR_MTIME|ATTR_CTIME); 709 + ATTR_MTIME|ATTR_CTIME|ATTR_SIZE); 746 710 bch2_inode_update_after_write(trans, inode, &inode_u, ATTR_CTIME); 747 711 } 748 712 ··· 795 759 goto err; 796 760 797 761 bch2_inode_update_after_write(trans, dir, &dir_u, 798 - ATTR_MTIME|ATTR_CTIME); 762 + ATTR_MTIME|ATTR_CTIME|ATTR_SIZE); 799 763 bch2_inode_update_after_write(trans, inode, &inode_u, 800 764 ATTR_MTIME); 801 765 ··· 973 937 dst_inode->v.i_ino != dst_inode_u.bi_inum); 974 938 975 939 bch2_inode_update_after_write(trans, src_dir, &src_dir_u, 976 - ATTR_MTIME|ATTR_CTIME); 940 + ATTR_MTIME|ATTR_CTIME|ATTR_SIZE); 977 941 978 942 if (src_dir != dst_dir) 979 943 bch2_inode_update_after_write(trans, dst_dir, &dst_dir_u, 980 - ATTR_MTIME|ATTR_CTIME); 944 + ATTR_MTIME|ATTR_CTIME|ATTR_SIZE); 981 945 982 946 bch2_inode_update_after_write(trans, src_inode, &src_inode_u, 983 947 ATTR_CTIME); ··· 1281 1245 struct btree_iter iter; 1282 1246 struct bkey_s_c k; 1283 1247 struct bkey_buf cur, prev; 1284 - unsigned offset_into_extent, sectors; 1285 1248 bool have_extent = false; 1286 1249 int ret = 0; 1287 1250 ··· 1313 1278 1314 1279 bch2_btree_iter_set_snapshot(&iter, snapshot); 1315 1280 1316 - k = bch2_btree_iter_peek_upto(&iter, end); 1281 + k = bch2_btree_iter_peek_max(&iter, end); 1317 1282 ret = bkey_err(k); 1318 1283 if (ret) 1319 1284 continue; ··· 1327 1292 continue; 1328 1293 } 1329 1294 1330 - offset_into_extent = iter.pos.offset - 1331 - bkey_start_offset(k.k); 1332 - sectors = k.k->size - offset_into_extent; 1295 + s64 offset_into_extent = iter.pos.offset - bkey_start_offset(k.k); 1296 + unsigned sectors = k.k->size - offset_into_extent; 1333 1297 1334 1298 bch2_bkey_buf_reassemble(&cur, c, k); 1335 1299 ··· 1340 1306 k = bkey_i_to_s_c(cur.k); 1341 1307 bch2_bkey_buf_realloc(&prev, c, k.k->u64s); 1342 1308 1343 - sectors = min(sectors, k.k->size - offset_into_extent); 1309 + sectors = min_t(unsigned, sectors, k.k->size - offset_into_extent); 1344 1310 1345 1311 bch2_cut_front(POS(k.k->p.inode, 1346 1312 bkey_start_offset(k.k) + ··· 1770 1736 bch2_inode_update_after_write(trans, inode, bi, ~0); 1771 1737 1772 1738 inode->v.i_blocks = bi->bi_sectors; 1773 - inode->v.i_ino = bi->bi_inum; 1774 1739 inode->v.i_rdev = bi->bi_dev; 1775 1740 inode->v.i_generation = bi->bi_generation; 1776 1741 inode->v.i_size = bi->bi_size; ··· 2233 2200 sb->s_time_gran = c->sb.nsec_per_time_unit; 2234 2201 sb->s_time_min = div_s64(S64_MIN, c->sb.time_units_per_sec) + 1; 2235 2202 sb->s_time_max = div_s64(S64_MAX, c->sb.time_units_per_sec); 2236 - sb->s_uuid = c->sb.user_uuid; 2203 + super_set_uuid(sb, c->sb.user_uuid.b, sizeof(c->sb.user_uuid)); 2204 + super_set_sysfs_name_uuid(sb); 2237 2205 sb->s_shrink->seeks = 0; 2238 2206 c->vfs_sb = sb; 2239 2207 strscpy(sb->s_id, c->name, sizeof(sb->s_id)); ··· 2379 2345 2380 2346 void bch2_fs_vfs_exit(struct bch_fs *c) 2381 2347 { 2348 + if (c->vfs_inodes_by_inum_table.ht.tbl) 2349 + rhltable_destroy(&c->vfs_inodes_by_inum_table); 2382 2350 if (c->vfs_inodes_table.tbl) 2383 2351 rhashtable_destroy(&c->vfs_inodes_table); 2384 2352 } 2385 2353 2386 2354 int bch2_fs_vfs_init(struct bch_fs *c) 2387 2355 { 2388 - return rhashtable_init(&c->vfs_inodes_table, &bch2_vfs_inodes_params); 2356 + return rhashtable_init(&c->vfs_inodes_table, &bch2_vfs_inodes_params) ?: 2357 + rhltable_init(&c->vfs_inodes_by_inum_table, &bch2_vfs_inodes_by_inum_params); 2389 2358 } 2390 2359 2391 2360 static struct file_system_type bcache_fs_type = {
+1
fs/bcachefs/fs.h
··· 14 14 struct bch_inode_info { 15 15 struct inode v; 16 16 struct rhash_head hash; 17 + struct rhlist_head by_inum_hash; 17 18 subvol_inum ei_inum; 18 19 19 20 struct list_head ei_vfs_inode_list;
+493 -281
fs/bcachefs/fsck.c
··· 1 1 // SPDX-License-Identifier: GPL-2.0 2 2 3 3 #include "bcachefs.h" 4 + #include "bcachefs_ioctl.h" 4 5 #include "bkey_buf.h" 5 6 #include "btree_cache.h" 6 7 #include "btree_update.h" ··· 17 16 #include "recovery_passes.h" 18 17 #include "snapshot.h" 19 18 #include "super.h" 19 + #include "thread_with_file.h" 20 20 #include "xattr.h" 21 21 22 22 #include <linux/bsearch.h> ··· 75 73 { 76 74 u64 sectors = 0; 77 75 78 - int ret = for_each_btree_key_upto(trans, iter, BTREE_ID_extents, 76 + int ret = for_each_btree_key_max(trans, iter, BTREE_ID_extents, 79 77 SPOS(inum, 0, snapshot), 80 78 POS(inum, U64_MAX), 81 79 0, k, ({ ··· 92 90 { 93 91 u64 subdirs = 0; 94 92 95 - int ret = for_each_btree_key_upto(trans, iter, BTREE_ID_dirents, 93 + int ret = for_each_btree_key_max(trans, iter, BTREE_ID_dirents, 96 94 SPOS(inum, 0, snapshot), 97 95 POS(inum, U64_MAX), 98 96 0, k, ({ ··· 109 107 u32 *snapshot, u64 *inum) 110 108 { 111 109 struct bch_subvolume s; 112 - int ret = bch2_subvolume_get(trans, subvol, false, 0, &s); 110 + int ret = bch2_subvolume_get(trans, subvol, false, &s); 113 111 114 112 *snapshot = le32_to_cpu(s.snapshot); 115 113 *inum = le64_to_cpu(s.inode); ··· 172 170 if (ret) 173 171 return ret; 174 172 175 - struct bkey_s_c_dirent d = bkey_s_c_to_dirent(bch2_btree_iter_peek_slot(&iter)); 173 + struct bkey_s_c_dirent d = bkey_s_c_to_dirent(k); 176 174 *target = le64_to_cpu(d.v->d_inum); 177 175 *type = d.v->d_type; 178 176 bch2_trans_iter_exit(trans, &iter); ··· 205 203 return ret; 206 204 } 207 205 206 + /* 207 + * Find any subvolume associated with a tree of snapshots 208 + * We can't rely on master_subvol - it might have been deleted. 209 + */ 210 + static int find_snapshot_tree_subvol(struct btree_trans *trans, 211 + u32 tree_id, u32 *subvol) 212 + { 213 + struct btree_iter iter; 214 + struct bkey_s_c k; 215 + int ret; 216 + 217 + for_each_btree_key_norestart(trans, iter, BTREE_ID_snapshots, POS_MIN, 0, k, ret) { 218 + if (k.k->type != KEY_TYPE_snapshot) 219 + continue; 220 + 221 + struct bkey_s_c_snapshot s = bkey_s_c_to_snapshot(k); 222 + if (le32_to_cpu(s.v->tree) != tree_id) 223 + continue; 224 + 225 + if (s.v->subvol) { 226 + *subvol = le32_to_cpu(s.v->subvol); 227 + goto found; 228 + } 229 + } 230 + ret = -BCH_ERR_ENOENT_no_snapshot_tree_subvol; 231 + found: 232 + bch2_trans_iter_exit(trans, &iter); 233 + return ret; 234 + } 235 + 208 236 /* Get lost+found, create if it doesn't exist: */ 209 237 static int lookup_lostfound(struct btree_trans *trans, u32 snapshot, 210 238 struct bch_inode_unpacked *lostfound, ··· 242 210 { 243 211 struct bch_fs *c = trans->c; 244 212 struct qstr lostfound_str = QSTR("lost+found"); 213 + struct btree_iter lostfound_iter = { NULL }; 245 214 u64 inum = 0; 246 215 unsigned d_type = 0; 247 216 int ret; ··· 253 220 if (ret) 254 221 return ret; 255 222 256 - subvol_inum root_inum = { .subvol = le32_to_cpu(st.master_subvol) }; 223 + u32 subvolid; 224 + ret = find_snapshot_tree_subvol(trans, 225 + bch2_snapshot_tree(c, snapshot), &subvolid); 226 + bch_err_msg(c, ret, "finding subvol associated with snapshot tree %u", 227 + bch2_snapshot_tree(c, snapshot)); 228 + if (ret) 229 + return ret; 257 230 258 231 struct bch_subvolume subvol; 259 - ret = bch2_subvolume_get(trans, le32_to_cpu(st.master_subvol), 260 - false, 0, &subvol); 261 - bch_err_msg(c, ret, "looking up root subvol %u for snapshot %u", 262 - le32_to_cpu(st.master_subvol), snapshot); 232 + ret = bch2_subvolume_get(trans, subvolid, false, &subvol); 233 + bch_err_msg(c, ret, "looking up subvol %u for snapshot %u", subvolid, snapshot); 263 234 if (ret) 264 235 return ret; 265 236 266 237 if (!subvol.inode) { 267 238 struct btree_iter iter; 268 239 struct bkey_i_subvolume *subvol = bch2_bkey_get_mut_typed(trans, &iter, 269 - BTREE_ID_subvolumes, POS(0, le32_to_cpu(st.master_subvol)), 240 + BTREE_ID_subvolumes, POS(0, subvolid), 270 241 0, subvolume); 271 242 ret = PTR_ERR_OR_ZERO(subvol); 272 243 if (ret) ··· 280 243 bch2_trans_iter_exit(trans, &iter); 281 244 } 282 245 283 - root_inum.inum = le64_to_cpu(subvol.inode); 246 + subvol_inum root_inum = { 247 + .subvol = subvolid, 248 + .inum = le64_to_cpu(subvol.inode) 249 + }; 284 250 285 251 struct bch_inode_unpacked root_inode; 286 252 struct bch_hash_info root_hash_info; 287 253 ret = lookup_inode(trans, root_inum.inum, snapshot, &root_inode); 288 254 bch_err_msg(c, ret, "looking up root inode %llu for subvol %u", 289 - root_inum.inum, le32_to_cpu(st.master_subvol)); 255 + root_inum.inum, subvolid); 290 256 if (ret) 291 257 return ret; 292 258 ··· 328 288 * XXX: we could have a nicer log message here if we had a nice way to 329 289 * walk backpointers to print a path 330 290 */ 331 - bch_notice(c, "creating lost+found in subvol %llu snapshot %u", 332 - root_inum.subvol, le32_to_cpu(st.root_snapshot)); 291 + struct printbuf path = PRINTBUF; 292 + ret = bch2_inum_to_path(trans, root_inum, &path); 293 + if (ret) 294 + goto err; 295 + 296 + bch_notice(c, "creating %s/lost+found in subvol %llu snapshot %u", 297 + path.buf, root_inum.subvol, snapshot); 298 + printbuf_exit(&path); 333 299 334 300 u64 now = bch2_current_time(c); 335 - struct btree_iter lostfound_iter = { NULL }; 336 301 u64 cpu = raw_smp_processor_id(); 337 302 338 303 bch2_inode_init_early(c, lostfound); ··· 496 451 continue; 497 452 498 453 struct bch_inode_unpacked child_inode; 499 - bch2_inode_unpack(k, &child_inode); 454 + ret = bch2_inode_unpack(k, &child_inode); 455 + if (ret) 456 + break; 500 457 501 458 if (!inode_should_reattach(&child_inode)) { 502 459 ret = maybe_delete_dirent(trans, ··· 529 482 return ret; 530 483 } 531 484 485 + static struct bkey_s_c_dirent dirent_get_by_pos(struct btree_trans *trans, 486 + struct btree_iter *iter, 487 + struct bpos pos) 488 + { 489 + return bch2_bkey_get_iter_typed(trans, iter, BTREE_ID_dirents, pos, 0, dirent); 490 + } 491 + 532 492 static int remove_backpointer(struct btree_trans *trans, 533 493 struct bch_inode_unpacked *inode) 534 494 { ··· 544 490 545 491 struct bch_fs *c = trans->c; 546 492 struct btree_iter iter; 547 - struct bkey_s_c_dirent d = 548 - bch2_bkey_get_iter_typed(trans, &iter, BTREE_ID_dirents, 549 - SPOS(inode->bi_dir, inode->bi_dir_offset, inode->bi_snapshot), 0, 550 - dirent); 551 - int ret = bkey_err(d) ?: 552 - dirent_points_to_inode(c, d, inode) ?: 553 - __remove_dirent(trans, d.k->p); 493 + struct bkey_s_c_dirent d = dirent_get_by_pos(trans, &iter, 494 + SPOS(inode->bi_dir, inode->bi_dir_offset, inode->bi_snapshot)); 495 + int ret = bkey_err(d) ?: 496 + dirent_points_to_inode(c, d, inode) ?: 497 + __remove_dirent(trans, d.k->p); 554 498 bch2_trans_iter_exit(trans, &iter); 555 499 return ret; 556 500 } ··· 665 613 struct btree_iter iter = {}; 666 614 667 615 bch2_trans_iter_init(trans, &iter, BTREE_ID_extents, SPOS(inum, U64_MAX, snapshot), 0); 668 - struct bkey_s_c k = bch2_btree_iter_peek_prev(&iter); 616 + struct bkey_s_c k = bch2_btree_iter_peek_prev_min(&iter, POS(inum, 0)); 669 617 bch2_trans_iter_exit(trans, &iter); 670 618 int ret = bkey_err(k); 671 619 if (ret) ··· 832 780 struct bpos last_pos; 833 781 834 782 DARRAY(struct inode_walker_entry) inodes; 783 + snapshot_id_list deletes; 835 784 }; 836 785 837 786 static void inode_walker_exit(struct inode_walker *w) 838 787 { 839 788 darray_exit(&w->inodes); 789 + darray_exit(&w->deletes); 840 790 } 841 791 842 792 static struct inode_walker inode_walker_init(void) ··· 851 797 { 852 798 struct bch_inode_unpacked u; 853 799 854 - BUG_ON(bch2_inode_unpack(inode, &u)); 855 - 856 - return darray_push(&w->inodes, ((struct inode_walker_entry) { 800 + return bch2_inode_unpack(inode, &u) ?: 801 + darray_push(&w->inodes, ((struct inode_walker_entry) { 857 802 .inode = u, 858 803 .snapshot = inode.k->p.snapshot, 859 804 })); ··· 962 909 int ret; 963 910 964 911 w->inodes.nr = 0; 912 + w->deletes.nr = 0; 965 913 966 - for_each_btree_key_norestart(trans, iter, BTREE_ID_inodes, POS(0, inum), 914 + for_each_btree_key_reverse_norestart(trans, iter, BTREE_ID_inodes, SPOS(0, inum, s->pos.snapshot), 967 915 BTREE_ITER_all_snapshots, k, ret) { 968 916 if (k.k->p.offset != inum) 969 917 break; ··· 972 918 if (!ref_visible(c, s, s->pos.snapshot, k.k->p.snapshot)) 973 919 continue; 974 920 975 - if (bkey_is_inode(k.k)) 976 - add_inode(c, w, k); 921 + if (snapshot_list_has_ancestor(c, &w->deletes, k.k->p.snapshot)) 922 + continue; 977 923 978 - if (k.k->p.snapshot >= s->pos.snapshot) 924 + ret = bkey_is_inode(k.k) 925 + ? add_inode(c, w, k) 926 + : snapshot_list_add(c, &w->deletes, k.k->p.snapshot); 927 + if (ret) 979 928 break; 980 929 } 981 930 bch2_trans_iter_exit(trans, &iter); ··· 986 929 return ret; 987 930 } 988 931 989 - static int dirent_has_target(struct btree_trans *trans, struct bkey_s_c_dirent d) 990 - { 991 - if (d.v->d_type == DT_SUBVOL) { 992 - u32 snap; 993 - u64 inum; 994 - int ret = subvol_lookup(trans, le32_to_cpu(d.v->d_child_subvol), &snap, &inum); 995 - if (ret && !bch2_err_matches(ret, ENOENT)) 996 - return ret; 997 - return !ret; 998 - } else { 999 - struct btree_iter iter; 1000 - struct bkey_s_c k = bch2_bkey_get_iter(trans, &iter, BTREE_ID_inodes, 1001 - SPOS(0, le64_to_cpu(d.v->d_inum), d.k->p.snapshot), 0); 1002 - int ret = bkey_err(k); 1003 - if (ret) 1004 - return ret; 1005 - 1006 - ret = bkey_is_inode(k.k); 1007 - bch2_trans_iter_exit(trans, &iter); 1008 - return ret; 1009 - } 1010 - } 1011 - 1012 932 /* 1013 933 * Prefer to delete the first one, since that will be the one at the wrong 1014 934 * offset: 1015 935 * return value: 0 -> delete k1, 1 -> delete k2 1016 936 */ 1017 - static int hash_pick_winner(struct btree_trans *trans, 1018 - const struct bch_hash_desc desc, 1019 - struct bch_hash_info *hash_info, 1020 - struct bkey_s_c k1, 1021 - struct bkey_s_c k2) 1022 - { 1023 - if (bkey_val_bytes(k1.k) == bkey_val_bytes(k2.k) && 1024 - !memcmp(k1.v, k2.v, bkey_val_bytes(k1.k))) 1025 - return 0; 1026 - 1027 - switch (desc.btree_id) { 1028 - case BTREE_ID_dirents: { 1029 - int ret = dirent_has_target(trans, bkey_s_c_to_dirent(k1)); 1030 - if (ret < 0) 1031 - return ret; 1032 - if (!ret) 1033 - return 0; 1034 - 1035 - ret = dirent_has_target(trans, bkey_s_c_to_dirent(k2)); 1036 - if (ret < 0) 1037 - return ret; 1038 - if (!ret) 1039 - return 1; 1040 - return 2; 1041 - } 1042 - default: 1043 - return 0; 1044 - } 1045 - } 1046 - 1047 - static int fsck_update_backpointers(struct btree_trans *trans, 1048 - struct snapshots_seen *s, 1049 - const struct bch_hash_desc desc, 1050 - struct bch_hash_info *hash_info, 1051 - struct bkey_i *new) 937 + int bch2_fsck_update_backpointers(struct btree_trans *trans, 938 + struct snapshots_seen *s, 939 + const struct bch_hash_desc desc, 940 + struct bch_hash_info *hash_info, 941 + struct bkey_i *new) 1052 942 { 1053 943 if (new->k.type != KEY_TYPE_dirent) 1054 944 return 0; ··· 1021 1017 err: 1022 1018 inode_walker_exit(&target); 1023 1019 return ret; 1024 - } 1025 - 1026 - static int fsck_rename_dirent(struct btree_trans *trans, 1027 - struct snapshots_seen *s, 1028 - const struct bch_hash_desc desc, 1029 - struct bch_hash_info *hash_info, 1030 - struct bkey_s_c_dirent old) 1031 - { 1032 - struct qstr old_name = bch2_dirent_get_name(old); 1033 - struct bkey_i_dirent *new = bch2_trans_kmalloc(trans, bkey_bytes(old.k) + 32); 1034 - int ret = PTR_ERR_OR_ZERO(new); 1035 - if (ret) 1036 - return ret; 1037 - 1038 - bkey_dirent_init(&new->k_i); 1039 - dirent_copy_target(new, old); 1040 - new->k.p = old.k->p; 1041 - 1042 - for (unsigned i = 0; i < 1000; i++) { 1043 - unsigned len = sprintf(new->v.d_name, "%.*s.fsck_renamed-%u", 1044 - old_name.len, old_name.name, i); 1045 - unsigned u64s = BKEY_U64s + dirent_val_u64s(len); 1046 - 1047 - if (u64s > U8_MAX) 1048 - return -EINVAL; 1049 - 1050 - new->k.u64s = u64s; 1051 - 1052 - ret = bch2_hash_set_in_snapshot(trans, bch2_dirent_hash_desc, hash_info, 1053 - (subvol_inum) { 0, old.k->p.inode }, 1054 - old.k->p.snapshot, &new->k_i, 1055 - BTREE_UPDATE_internal_snapshot_node); 1056 - if (!bch2_err_matches(ret, EEXIST)) 1057 - break; 1058 - } 1059 - 1060 - if (ret) 1061 - return ret; 1062 - 1063 - return fsck_update_backpointers(trans, s, desc, hash_info, &new->k_i); 1064 - } 1065 - 1066 - static int hash_check_key(struct btree_trans *trans, 1067 - struct snapshots_seen *s, 1068 - const struct bch_hash_desc desc, 1069 - struct bch_hash_info *hash_info, 1070 - struct btree_iter *k_iter, struct bkey_s_c hash_k) 1071 - { 1072 - struct bch_fs *c = trans->c; 1073 - struct btree_iter iter = { NULL }; 1074 - struct printbuf buf = PRINTBUF; 1075 - struct bkey_s_c k; 1076 - u64 hash; 1077 - int ret = 0; 1078 - 1079 - if (hash_k.k->type != desc.key_type) 1080 - return 0; 1081 - 1082 - hash = desc.hash_bkey(hash_info, hash_k); 1083 - 1084 - if (likely(hash == hash_k.k->p.offset)) 1085 - return 0; 1086 - 1087 - if (hash_k.k->p.offset < hash) 1088 - goto bad_hash; 1089 - 1090 - for_each_btree_key_norestart(trans, iter, desc.btree_id, 1091 - SPOS(hash_k.k->p.inode, hash, hash_k.k->p.snapshot), 1092 - BTREE_ITER_slots, k, ret) { 1093 - if (bkey_eq(k.k->p, hash_k.k->p)) 1094 - break; 1095 - 1096 - if (k.k->type == desc.key_type && 1097 - !desc.cmp_bkey(k, hash_k)) 1098 - goto duplicate_entries; 1099 - 1100 - if (bkey_deleted(k.k)) { 1101 - bch2_trans_iter_exit(trans, &iter); 1102 - goto bad_hash; 1103 - } 1104 - } 1105 - out: 1106 - bch2_trans_iter_exit(trans, &iter); 1107 - printbuf_exit(&buf); 1108 - return ret; 1109 - bad_hash: 1110 - if (fsck_err(trans, hash_table_key_wrong_offset, 1111 - "hash table key at wrong offset: btree %s inode %llu offset %llu, hashed to %llu\n %s", 1112 - bch2_btree_id_str(desc.btree_id), hash_k.k->p.inode, hash_k.k->p.offset, hash, 1113 - (printbuf_reset(&buf), 1114 - bch2_bkey_val_to_text(&buf, c, hash_k), buf.buf))) { 1115 - struct bkey_i *new = bch2_bkey_make_mut_noupdate(trans, hash_k); 1116 - if (IS_ERR(new)) 1117 - return PTR_ERR(new); 1118 - 1119 - k = bch2_hash_set_or_get_in_snapshot(trans, &iter, desc, hash_info, 1120 - (subvol_inum) { 0, hash_k.k->p.inode }, 1121 - hash_k.k->p.snapshot, new, 1122 - STR_HASH_must_create| 1123 - BTREE_ITER_with_updates| 1124 - BTREE_UPDATE_internal_snapshot_node); 1125 - ret = bkey_err(k); 1126 - if (ret) 1127 - goto out; 1128 - if (k.k) 1129 - goto duplicate_entries; 1130 - 1131 - ret = bch2_hash_delete_at(trans, desc, hash_info, k_iter, 1132 - BTREE_UPDATE_internal_snapshot_node) ?: 1133 - fsck_update_backpointers(trans, s, desc, hash_info, new) ?: 1134 - bch2_trans_commit(trans, NULL, NULL, BCH_TRANS_COMMIT_no_enospc) ?: 1135 - -BCH_ERR_transaction_restart_nested; 1136 - goto out; 1137 - } 1138 - fsck_err: 1139 - goto out; 1140 - duplicate_entries: 1141 - ret = hash_pick_winner(trans, desc, hash_info, hash_k, k); 1142 - if (ret < 0) 1143 - goto out; 1144 - 1145 - if (!fsck_err(trans, hash_table_key_duplicate, 1146 - "duplicate hash table keys%s:\n%s", 1147 - ret != 2 ? "" : ", both point to valid inodes", 1148 - (printbuf_reset(&buf), 1149 - bch2_bkey_val_to_text(&buf, c, hash_k), 1150 - prt_newline(&buf), 1151 - bch2_bkey_val_to_text(&buf, c, k), 1152 - buf.buf))) 1153 - goto out; 1154 - 1155 - switch (ret) { 1156 - case 0: 1157 - ret = bch2_hash_delete_at(trans, desc, hash_info, k_iter, 0); 1158 - break; 1159 - case 1: 1160 - ret = bch2_hash_delete_at(trans, desc, hash_info, &iter, 0); 1161 - break; 1162 - case 2: 1163 - ret = fsck_rename_dirent(trans, s, desc, hash_info, bkey_s_c_to_dirent(hash_k)) ?: 1164 - bch2_hash_delete_at(trans, desc, hash_info, k_iter, 0); 1165 - goto out; 1166 - } 1167 - 1168 - ret = bch2_trans_commit(trans, NULL, NULL, 0) ?: 1169 - -BCH_ERR_transaction_restart_nested; 1170 - goto out; 1171 - } 1172 - 1173 - static struct bkey_s_c_dirent dirent_get_by_pos(struct btree_trans *trans, 1174 - struct btree_iter *iter, 1175 - struct bpos pos) 1176 - { 1177 - return bch2_bkey_get_iter_typed(trans, iter, BTREE_ID_dirents, pos, 0, dirent); 1178 1020 } 1179 1021 1180 1022 static struct bkey_s_c_dirent inode_get_dirent(struct btree_trans *trans, ··· 1110 1260 goto err; 1111 1261 BUG(); 1112 1262 found_root: 1113 - BUG_ON(bch2_inode_unpack(k, root)); 1263 + ret = bch2_inode_unpack(k, root); 1114 1264 err: 1115 1265 bch2_trans_iter_exit(trans, &iter); 1266 + return ret; 1267 + } 1268 + 1269 + static int check_directory_size(struct btree_trans *trans, 1270 + struct bch_inode_unpacked *inode_u, 1271 + struct bkey_s_c inode_k, bool *write_inode) 1272 + { 1273 + struct btree_iter iter; 1274 + struct bkey_s_c k; 1275 + u64 new_size = 0; 1276 + int ret; 1277 + 1278 + for_each_btree_key_max_norestart(trans, iter, BTREE_ID_dirents, 1279 + SPOS(inode_k.k->p.offset, 0, inode_k.k->p.snapshot), 1280 + POS(inode_k.k->p.offset, U64_MAX), 1281 + 0, k, ret) { 1282 + if (k.k->type != KEY_TYPE_dirent) 1283 + continue; 1284 + 1285 + struct bkey_s_c_dirent dirent = bkey_s_c_to_dirent(k); 1286 + struct qstr name = bch2_dirent_get_name(dirent); 1287 + 1288 + new_size += dirent_occupied_size(&name); 1289 + } 1290 + bch2_trans_iter_exit(trans, &iter); 1291 + 1292 + if (!ret && inode_u->bi_size != new_size) { 1293 + inode_u->bi_size = new_size; 1294 + *write_inode = true; 1295 + } 1296 + 1116 1297 return ret; 1117 1298 } 1118 1299 ··· 1172 1291 if (!bkey_is_inode(k.k)) 1173 1292 return 0; 1174 1293 1175 - BUG_ON(bch2_inode_unpack(k, &u)); 1294 + ret = bch2_inode_unpack(k, &u); 1295 + if (ret) 1296 + goto err; 1176 1297 1177 1298 if (snapshot_root->bi_inum != u.bi_inum) { 1178 1299 ret = get_snapshot_root_inode(trans, snapshot_root, u.bi_inum); ··· 1185 1302 if (fsck_err_on(u.bi_hash_seed != snapshot_root->bi_hash_seed || 1186 1303 INODE_STR_HASH(&u) != INODE_STR_HASH(snapshot_root), 1187 1304 trans, inode_snapshot_mismatch, 1188 - "inodes in different snapshots don't match")) { 1305 + "inode hash info in different snapshots don't match")) { 1189 1306 u.bi_hash_seed = snapshot_root->bi_hash_seed; 1190 1307 SET_INODE_STR_HASH(&u, INODE_STR_HASH(snapshot_root)); 1191 1308 do_update = true; ··· 1275 1392 1276 1393 if (fsck_err_on(!ret, 1277 1394 trans, inode_unlinked_and_not_open, 1278 - "inode %llu%u unlinked and not open", 1395 + "inode %llu:%u unlinked and not open", 1279 1396 u.bi_inum, u.bi_snapshot)) { 1280 1397 ret = bch2_inode_rm_snapshot(trans, u.bi_inum, iter->pos.snapshot); 1281 1398 bch_err_msg(c, ret, "in fsck deleting inode"); ··· 1298 1415 if (u.bi_subvol) { 1299 1416 struct bch_subvolume s; 1300 1417 1301 - ret = bch2_subvolume_get(trans, u.bi_subvol, false, 0, &s); 1418 + ret = bch2_subvolume_get(trans, u.bi_subvol, false, &s); 1302 1419 if (ret && !bch2_err_matches(ret, ENOENT)) 1303 1420 goto err; 1304 1421 ··· 1323 1440 u.bi_parent_subvol = 0; 1324 1441 do_update = true; 1325 1442 } 1443 + } 1444 + 1445 + if (fsck_err_on(u.bi_journal_seq > journal_cur_seq(&c->journal), 1446 + trans, inode_journal_seq_in_future, 1447 + "inode journal seq in future (currently at %llu)\n%s", 1448 + journal_cur_seq(&c->journal), 1449 + (printbuf_reset(&buf), 1450 + bch2_inode_unpacked_to_text(&buf, &u), 1451 + buf.buf))) { 1452 + u.bi_journal_seq = journal_cur_seq(&c->journal); 1453 + do_update = true; 1454 + } 1455 + 1456 + if (S_ISDIR(u.bi_mode)) { 1457 + ret = check_directory_size(trans, &u, k, &do_update); 1458 + 1459 + fsck_err_on(ret, 1460 + trans, directory_size_mismatch, 1461 + "directory inode %llu:%u with the mismatch directory size", 1462 + u.bi_inum, k.k->p.snapshot); 1463 + ret = 0; 1326 1464 } 1327 1465 do_update: 1328 1466 if (do_update) { ··· 1406 1502 break; 1407 1503 1408 1504 struct bch_inode_unpacked parent_inode; 1409 - bch2_inode_unpack(k, &parent_inode); 1505 + ret = bch2_inode_unpack(k, &parent_inode); 1506 + if (ret) 1507 + break; 1410 1508 1411 1509 if (!inode_should_reattach(&parent_inode)) 1412 1510 break; ··· 1431 1525 return 0; 1432 1526 1433 1527 struct bch_inode_unpacked inode; 1434 - BUG_ON(bch2_inode_unpack(k, &inode)); 1528 + ret = bch2_inode_unpack(k, &inode); 1529 + if (ret) 1530 + return ret; 1435 1531 1436 1532 if (!inode_should_reattach(&inode)) 1437 1533 return 0; ··· 1557 1649 if (i->count != count2) { 1558 1650 bch_err_ratelimited(c, "fsck counted i_sectors wrong for inode %llu:%u: got %llu should be %llu", 1559 1651 w->last_pos.inode, i->snapshot, i->count, count2); 1560 - return -BCH_ERR_internal_fsck_err; 1652 + i->count = count2; 1561 1653 } 1562 1654 1563 1655 if (fsck_err_on(!(i->inode.bi_flags & BCH_INODE_i_sectors_dirty), ··· 1661 1753 bch2_trans_iter_init(trans, &iter1, btree, pos1, 1662 1754 BTREE_ITER_all_snapshots| 1663 1755 BTREE_ITER_not_extents); 1664 - k1 = bch2_btree_iter_peek_upto(&iter1, POS(pos1.inode, U64_MAX)); 1756 + k1 = bch2_btree_iter_peek_max(&iter1, POS(pos1.inode, U64_MAX)); 1665 1757 ret = bkey_err(k1); 1666 1758 if (ret) 1667 1759 goto err; ··· 1686 1778 while (1) { 1687 1779 bch2_btree_iter_advance(&iter2); 1688 1780 1689 - k2 = bch2_btree_iter_peek_upto(&iter2, POS(pos1.inode, U64_MAX)); 1781 + k2 = bch2_btree_iter_peek_max(&iter2, POS(pos1.inode, U64_MAX)); 1690 1782 ret = bkey_err(k2); 1691 1783 if (ret) 1692 1784 goto err; ··· 2064 2156 return __bch2_fsck_write_inode(trans, target); 2065 2157 } 2066 2158 2067 - if (bch2_inode_should_have_bp(target) && 2159 + if (bch2_inode_should_have_single_bp(target) && 2068 2160 !fsck_err(trans, inode_wrong_backpointer, 2069 2161 "dirent points to inode that does not point back:\n %s", 2070 2162 (bch2_bkey_val_to_text(&buf, c, d.s_c), ··· 2388 2480 *hash_info = bch2_hash_info_init(c, &i->inode); 2389 2481 dir->first_this_inode = false; 2390 2482 2391 - ret = hash_check_key(trans, s, bch2_dirent_hash_desc, hash_info, iter, k); 2483 + ret = bch2_str_hash_check_key(trans, s, &bch2_dirent_hash_desc, hash_info, iter, k); 2392 2484 if (ret < 0) 2393 2485 goto err; 2394 2486 if (ret) { ··· 2427 2519 if (ret) 2428 2520 goto err; 2429 2521 } 2522 + 2523 + darray_for_each(target->deletes, i) 2524 + if (fsck_err_on(!snapshot_list_has_id(&s->ids, *i), 2525 + trans, dirent_to_overwritten_inode, 2526 + "dirent points to inode overwritten in snapshot %u:\n%s", 2527 + *i, 2528 + (printbuf_reset(&buf), 2529 + bch2_bkey_val_to_text(&buf, c, k), 2530 + buf.buf))) { 2531 + struct btree_iter delete_iter; 2532 + bch2_trans_iter_init(trans, &delete_iter, 2533 + BTREE_ID_dirents, 2534 + SPOS(k.k->p.inode, k.k->p.offset, *i), 2535 + BTREE_ITER_intent); 2536 + ret = bch2_btree_iter_traverse(&delete_iter) ?: 2537 + bch2_hash_delete_at(trans, bch2_dirent_hash_desc, 2538 + hash_info, 2539 + &delete_iter, 2540 + BTREE_UPDATE_internal_snapshot_node); 2541 + bch2_trans_iter_exit(trans, &delete_iter); 2542 + if (ret) 2543 + goto err; 2544 + 2545 + } 2430 2546 } 2431 2547 2432 2548 ret = bch2_trans_commit(trans, NULL, NULL, BCH_TRANS_COMMIT_no_enospc); ··· 2526 2594 *hash_info = bch2_hash_info_init(c, &i->inode); 2527 2595 inode->first_this_inode = false; 2528 2596 2529 - ret = hash_check_key(trans, NULL, bch2_xattr_hash_desc, hash_info, iter, k); 2597 + ret = bch2_str_hash_check_key(trans, NULL, &bch2_xattr_hash_desc, hash_info, iter, k); 2530 2598 bch_err_fn(c, ret); 2531 2599 return ret; 2532 2600 } ··· 2706 2774 2707 2775 typedef DARRAY(struct pathbuf_entry) pathbuf; 2708 2776 2777 + static int bch2_bi_depth_renumber_one(struct btree_trans *trans, struct pathbuf_entry *p, 2778 + u32 new_depth) 2779 + { 2780 + struct btree_iter iter; 2781 + struct bkey_s_c k = bch2_bkey_get_iter(trans, &iter, BTREE_ID_inodes, 2782 + SPOS(0, p->inum, p->snapshot), 0); 2783 + 2784 + struct bch_inode_unpacked inode; 2785 + int ret = bkey_err(k) ?: 2786 + !bkey_is_inode(k.k) ? -BCH_ERR_ENOENT_inode 2787 + : bch2_inode_unpack(k, &inode); 2788 + if (ret) 2789 + goto err; 2790 + 2791 + if (inode.bi_depth != new_depth) { 2792 + inode.bi_depth = new_depth; 2793 + ret = __bch2_fsck_write_inode(trans, &inode) ?: 2794 + bch2_trans_commit(trans, NULL, NULL, 0); 2795 + } 2796 + err: 2797 + bch2_trans_iter_exit(trans, &iter); 2798 + return ret; 2799 + } 2800 + 2801 + static int bch2_bi_depth_renumber(struct btree_trans *trans, pathbuf *path, u32 new_bi_depth) 2802 + { 2803 + u32 restart_count = trans->restart_count; 2804 + int ret = 0; 2805 + 2806 + darray_for_each_reverse(*path, i) { 2807 + ret = nested_lockrestart_do(trans, 2808 + bch2_bi_depth_renumber_one(trans, i, new_bi_depth)); 2809 + bch_err_fn(trans->c, ret); 2810 + if (ret) 2811 + break; 2812 + 2813 + new_bi_depth++; 2814 + } 2815 + 2816 + return ret ?: trans_was_restarted(trans, restart_count); 2817 + } 2818 + 2709 2819 static bool path_is_dup(pathbuf *p, u64 inum, u32 snapshot) 2710 2820 { 2711 2821 darray_for_each(*p, i) ··· 2757 2783 return false; 2758 2784 } 2759 2785 2760 - static int check_path(struct btree_trans *trans, pathbuf *p, struct bkey_s_c inode_k) 2786 + static int check_path_loop(struct btree_trans *trans, struct bkey_s_c inode_k) 2761 2787 { 2762 2788 struct bch_fs *c = trans->c; 2763 2789 struct btree_iter inode_iter = {}; 2764 - struct bch_inode_unpacked inode; 2790 + pathbuf path = {}; 2765 2791 struct printbuf buf = PRINTBUF; 2766 2792 u32 snapshot = inode_k.k->p.snapshot; 2793 + bool redo_bi_depth = false; 2794 + u32 min_bi_depth = U32_MAX; 2767 2795 int ret = 0; 2768 2796 2769 - p->nr = 0; 2770 - 2771 - BUG_ON(bch2_inode_unpack(inode_k, &inode)); 2772 - 2773 - if (!S_ISDIR(inode.bi_mode)) 2774 - return 0; 2797 + struct bch_inode_unpacked inode; 2798 + ret = bch2_inode_unpack(inode_k, &inode); 2799 + if (ret) 2800 + return ret; 2775 2801 2776 2802 while (!inode.bi_subvol) { 2777 2803 struct btree_iter dirent_iter; ··· 2781 2807 d = inode_get_dirent(trans, &dirent_iter, &inode, &parent_snapshot); 2782 2808 ret = bkey_err(d.s_c); 2783 2809 if (ret && !bch2_err_matches(ret, ENOENT)) 2784 - break; 2810 + goto out; 2785 2811 2786 2812 if (!ret && (ret = dirent_points_to_inode(c, d, &inode))) 2787 2813 bch2_trans_iter_exit(trans, &dirent_iter); ··· 2796 2822 2797 2823 bch2_trans_iter_exit(trans, &dirent_iter); 2798 2824 2799 - ret = darray_push(p, ((struct pathbuf_entry) { 2825 + ret = darray_push(&path, ((struct pathbuf_entry) { 2800 2826 .inum = inode.bi_inum, 2801 2827 .snapshot = snapshot, 2802 2828 })); ··· 2808 2834 bch2_trans_iter_exit(trans, &inode_iter); 2809 2835 inode_k = bch2_bkey_get_iter(trans, &inode_iter, BTREE_ID_inodes, 2810 2836 SPOS(0, inode.bi_dir, snapshot), 0); 2837 + 2838 + struct bch_inode_unpacked parent_inode; 2811 2839 ret = bkey_err(inode_k) ?: 2812 2840 !bkey_is_inode(inode_k.k) ? -BCH_ERR_ENOENT_inode 2813 - : bch2_inode_unpack(inode_k, &inode); 2841 + : bch2_inode_unpack(inode_k, &parent_inode); 2814 2842 if (ret) { 2815 2843 /* Should have been caught in dirents pass */ 2816 2844 bch_err_msg(c, ret, "error looking up parent directory"); 2817 - break; 2845 + goto out; 2818 2846 } 2819 2847 2820 - snapshot = inode_k.k->p.snapshot; 2848 + min_bi_depth = parent_inode.bi_depth; 2821 2849 2822 - if (path_is_dup(p, inode.bi_inum, snapshot)) { 2850 + if (parent_inode.bi_depth < inode.bi_depth && 2851 + min_bi_depth < U16_MAX) 2852 + break; 2853 + 2854 + inode = parent_inode; 2855 + snapshot = inode_k.k->p.snapshot; 2856 + redo_bi_depth = true; 2857 + 2858 + if (path_is_dup(&path, inode.bi_inum, snapshot)) { 2823 2859 /* XXX print path */ 2824 2860 bch_err(c, "directory structure loop"); 2825 2861 2826 - darray_for_each(*p, i) 2862 + darray_for_each(path, i) 2827 2863 pr_err("%llu:%u", i->inum, i->snapshot); 2828 2864 pr_err("%llu:%u", inode.bi_inum, snapshot); 2829 2865 ··· 2846 2862 ret = reattach_inode(trans, &inode); 2847 2863 bch_err_msg(c, ret, "reattaching inode %llu", inode.bi_inum); 2848 2864 } 2849 - break; 2865 + 2866 + goto out; 2850 2867 } 2851 2868 } 2869 + 2870 + if (inode.bi_subvol) 2871 + min_bi_depth = 0; 2872 + 2873 + if (redo_bi_depth) 2874 + ret = bch2_bi_depth_renumber(trans, &path, min_bi_depth); 2852 2875 out: 2853 2876 fsck_err: 2854 2877 bch2_trans_iter_exit(trans, &inode_iter); 2878 + darray_exit(&path); 2855 2879 printbuf_exit(&buf); 2856 2880 bch_err_fn(c, ret); 2857 2881 return ret; ··· 2871 2879 */ 2872 2880 int bch2_check_directory_structure(struct bch_fs *c) 2873 2881 { 2874 - pathbuf path = { 0, }; 2875 - int ret; 2876 - 2877 - ret = bch2_trans_run(c, 2882 + int ret = bch2_trans_run(c, 2878 2883 for_each_btree_key_commit(trans, iter, BTREE_ID_inodes, POS_MIN, 2879 2884 BTREE_ITER_intent| 2880 2885 BTREE_ITER_prefetch| 2881 2886 BTREE_ITER_all_snapshots, k, 2882 2887 NULL, NULL, BCH_TRANS_COMMIT_no_enospc, ({ 2883 - if (!bkey_is_inode(k.k)) 2888 + if (!S_ISDIR(bkey_inode_mode(k))) 2884 2889 continue; 2885 2890 2886 2891 if (bch2_inode_flags(k) & BCH_INODE_unlinked) 2887 2892 continue; 2888 2893 2889 - check_path(trans, &path, k); 2894 + check_path_loop(trans, k); 2890 2895 }))); 2891 - darray_exit(&path); 2892 2896 2893 2897 bch_err_fn(c, ret); 2894 2898 return ret; ··· 2982 2994 2983 2995 /* Should never fail, checked by bch2_inode_invalid: */ 2984 2996 struct bch_inode_unpacked u; 2985 - BUG_ON(bch2_inode_unpack(k, &u)); 2997 + _ret3 = bch2_inode_unpack(k, &u); 2998 + if (_ret3) 2999 + break; 2986 3000 2987 3001 /* 2988 3002 * Backpointer and directory structure checks are sufficient for ··· 3062 3072 if (!bkey_is_inode(k.k)) 3063 3073 return 0; 3064 3074 3065 - BUG_ON(bch2_inode_unpack(k, &u)); 3075 + ret = bch2_inode_unpack(k, &u); 3076 + if (ret) 3077 + return ret; 3066 3078 3067 3079 if (S_ISDIR(u.bi_mode)) 3068 3080 return 0; ··· 3186 3194 bch_err_fn(c, ret); 3187 3195 return ret; 3188 3196 } 3197 + 3198 + #ifndef NO_BCACHEFS_CHARDEV 3199 + 3200 + struct fsck_thread { 3201 + struct thread_with_stdio thr; 3202 + struct bch_fs *c; 3203 + struct bch_opts opts; 3204 + }; 3205 + 3206 + static void bch2_fsck_thread_exit(struct thread_with_stdio *_thr) 3207 + { 3208 + struct fsck_thread *thr = container_of(_thr, struct fsck_thread, thr); 3209 + kfree(thr); 3210 + } 3211 + 3212 + static int bch2_fsck_offline_thread_fn(struct thread_with_stdio *stdio) 3213 + { 3214 + struct fsck_thread *thr = container_of(stdio, struct fsck_thread, thr); 3215 + struct bch_fs *c = thr->c; 3216 + 3217 + int ret = PTR_ERR_OR_ZERO(c); 3218 + if (ret) 3219 + return ret; 3220 + 3221 + ret = bch2_fs_start(thr->c); 3222 + if (ret) 3223 + goto err; 3224 + 3225 + if (test_bit(BCH_FS_errors_fixed, &c->flags)) { 3226 + bch2_stdio_redirect_printf(&stdio->stdio, false, "%s: errors fixed\n", c->name); 3227 + ret |= 1; 3228 + } 3229 + if (test_bit(BCH_FS_error, &c->flags)) { 3230 + bch2_stdio_redirect_printf(&stdio->stdio, false, "%s: still has errors\n", c->name); 3231 + ret |= 4; 3232 + } 3233 + err: 3234 + bch2_fs_stop(c); 3235 + return ret; 3236 + } 3237 + 3238 + static const struct thread_with_stdio_ops bch2_offline_fsck_ops = { 3239 + .exit = bch2_fsck_thread_exit, 3240 + .fn = bch2_fsck_offline_thread_fn, 3241 + }; 3242 + 3243 + long bch2_ioctl_fsck_offline(struct bch_ioctl_fsck_offline __user *user_arg) 3244 + { 3245 + struct bch_ioctl_fsck_offline arg; 3246 + struct fsck_thread *thr = NULL; 3247 + darray_str(devs) = {}; 3248 + long ret = 0; 3249 + 3250 + if (copy_from_user(&arg, user_arg, sizeof(arg))) 3251 + return -EFAULT; 3252 + 3253 + if (arg.flags) 3254 + return -EINVAL; 3255 + 3256 + if (!capable(CAP_SYS_ADMIN)) 3257 + return -EPERM; 3258 + 3259 + for (size_t i = 0; i < arg.nr_devs; i++) { 3260 + u64 dev_u64; 3261 + ret = copy_from_user_errcode(&dev_u64, &user_arg->devs[i], sizeof(u64)); 3262 + if (ret) 3263 + goto err; 3264 + 3265 + char *dev_str = strndup_user((char __user *)(unsigned long) dev_u64, PATH_MAX); 3266 + ret = PTR_ERR_OR_ZERO(dev_str); 3267 + if (ret) 3268 + goto err; 3269 + 3270 + ret = darray_push(&devs, dev_str); 3271 + if (ret) { 3272 + kfree(dev_str); 3273 + goto err; 3274 + } 3275 + } 3276 + 3277 + thr = kzalloc(sizeof(*thr), GFP_KERNEL); 3278 + if (!thr) { 3279 + ret = -ENOMEM; 3280 + goto err; 3281 + } 3282 + 3283 + thr->opts = bch2_opts_empty(); 3284 + 3285 + if (arg.opts) { 3286 + char *optstr = strndup_user((char __user *)(unsigned long) arg.opts, 1 << 16); 3287 + ret = PTR_ERR_OR_ZERO(optstr) ?: 3288 + bch2_parse_mount_opts(NULL, &thr->opts, NULL, optstr); 3289 + if (!IS_ERR(optstr)) 3290 + kfree(optstr); 3291 + 3292 + if (ret) 3293 + goto err; 3294 + } 3295 + 3296 + opt_set(thr->opts, stdio, (u64)(unsigned long)&thr->thr.stdio); 3297 + opt_set(thr->opts, read_only, 1); 3298 + opt_set(thr->opts, ratelimit_errors, 0); 3299 + 3300 + /* We need request_key() to be called before we punt to kthread: */ 3301 + opt_set(thr->opts, nostart, true); 3302 + 3303 + bch2_thread_with_stdio_init(&thr->thr, &bch2_offline_fsck_ops); 3304 + 3305 + thr->c = bch2_fs_open(devs.data, arg.nr_devs, thr->opts); 3306 + 3307 + if (!IS_ERR(thr->c) && 3308 + thr->c->opts.errors == BCH_ON_ERROR_panic) 3309 + thr->c->opts.errors = BCH_ON_ERROR_ro; 3310 + 3311 + ret = __bch2_run_thread_with_stdio(&thr->thr); 3312 + out: 3313 + darray_for_each(devs, i) 3314 + kfree(*i); 3315 + darray_exit(&devs); 3316 + return ret; 3317 + err: 3318 + if (thr) 3319 + bch2_fsck_thread_exit(&thr->thr); 3320 + pr_err("ret %s", bch2_err_str(ret)); 3321 + goto out; 3322 + } 3323 + 3324 + static int bch2_fsck_online_thread_fn(struct thread_with_stdio *stdio) 3325 + { 3326 + struct fsck_thread *thr = container_of(stdio, struct fsck_thread, thr); 3327 + struct bch_fs *c = thr->c; 3328 + 3329 + c->stdio_filter = current; 3330 + c->stdio = &thr->thr.stdio; 3331 + 3332 + /* 3333 + * XXX: can we figure out a way to do this without mucking with c->opts? 3334 + */ 3335 + unsigned old_fix_errors = c->opts.fix_errors; 3336 + if (opt_defined(thr->opts, fix_errors)) 3337 + c->opts.fix_errors = thr->opts.fix_errors; 3338 + else 3339 + c->opts.fix_errors = FSCK_FIX_ask; 3340 + 3341 + c->opts.fsck = true; 3342 + set_bit(BCH_FS_fsck_running, &c->flags); 3343 + 3344 + c->curr_recovery_pass = BCH_RECOVERY_PASS_check_alloc_info; 3345 + int ret = bch2_run_online_recovery_passes(c); 3346 + 3347 + clear_bit(BCH_FS_fsck_running, &c->flags); 3348 + bch_err_fn(c, ret); 3349 + 3350 + c->stdio = NULL; 3351 + c->stdio_filter = NULL; 3352 + c->opts.fix_errors = old_fix_errors; 3353 + 3354 + up(&c->online_fsck_mutex); 3355 + bch2_ro_ref_put(c); 3356 + return ret; 3357 + } 3358 + 3359 + static const struct thread_with_stdio_ops bch2_online_fsck_ops = { 3360 + .exit = bch2_fsck_thread_exit, 3361 + .fn = bch2_fsck_online_thread_fn, 3362 + }; 3363 + 3364 + long bch2_ioctl_fsck_online(struct bch_fs *c, struct bch_ioctl_fsck_online arg) 3365 + { 3366 + struct fsck_thread *thr = NULL; 3367 + long ret = 0; 3368 + 3369 + if (arg.flags) 3370 + return -EINVAL; 3371 + 3372 + if (!capable(CAP_SYS_ADMIN)) 3373 + return -EPERM; 3374 + 3375 + if (!bch2_ro_ref_tryget(c)) 3376 + return -EROFS; 3377 + 3378 + if (down_trylock(&c->online_fsck_mutex)) { 3379 + bch2_ro_ref_put(c); 3380 + return -EAGAIN; 3381 + } 3382 + 3383 + thr = kzalloc(sizeof(*thr), GFP_KERNEL); 3384 + if (!thr) { 3385 + ret = -ENOMEM; 3386 + goto err; 3387 + } 3388 + 3389 + thr->c = c; 3390 + thr->opts = bch2_opts_empty(); 3391 + 3392 + if (arg.opts) { 3393 + char *optstr = strndup_user((char __user *)(unsigned long) arg.opts, 1 << 16); 3394 + 3395 + ret = PTR_ERR_OR_ZERO(optstr) ?: 3396 + bch2_parse_mount_opts(c, &thr->opts, NULL, optstr); 3397 + if (!IS_ERR(optstr)) 3398 + kfree(optstr); 3399 + 3400 + if (ret) 3401 + goto err; 3402 + } 3403 + 3404 + ret = bch2_run_thread_with_stdio(&thr->thr, &bch2_online_fsck_ops); 3405 + err: 3406 + if (ret < 0) { 3407 + bch_err_fn(c, ret); 3408 + if (thr) 3409 + bch2_fsck_thread_exit(&thr->thr); 3410 + up(&c->online_fsck_mutex); 3411 + bch2_ro_ref_put(c); 3412 + } 3413 + return ret; 3414 + } 3415 + 3416 + #endif /* NO_BCACHEFS_CHARDEV */
+11
fs/bcachefs/fsck.h
··· 2 2 #ifndef _BCACHEFS_FSCK_H 3 3 #define _BCACHEFS_FSCK_H 4 4 5 + #include "str_hash.h" 6 + 7 + int bch2_fsck_update_backpointers(struct btree_trans *, 8 + struct snapshots_seen *, 9 + const struct bch_hash_desc, 10 + struct bch_hash_info *, 11 + struct bkey_i *); 12 + 5 13 int bch2_check_inodes(struct bch_fs *); 6 14 int bch2_check_extents(struct bch_fs *); 7 15 int bch2_check_indirect_extents(struct bch_fs *); ··· 21 13 int bch2_check_directory_structure(struct bch_fs *); 22 14 int bch2_check_nlinks(struct bch_fs *); 23 15 int bch2_fix_reflink_p(struct bch_fs *); 16 + 17 + long bch2_ioctl_fsck_offline(struct bch_ioctl_fsck_offline __user *); 18 + long bch2_ioctl_fsck_online(struct bch_fs *, struct bch_ioctl_fsck_online); 24 19 25 20 #endif /* _BCACHEFS_FSCK_H */
+113 -56
fs/bcachefs/inode.c
··· 14 14 #include "extent_update.h" 15 15 #include "fs.h" 16 16 #include "inode.h" 17 + #include "opts.h" 17 18 #include "str_hash.h" 18 19 #include "snapshot.h" 19 20 #include "subvolume.h" ··· 48 47 u8 *p; 49 48 50 49 if (in >= end) 51 - return -1; 50 + return -BCH_ERR_inode_unpack_error; 52 51 53 52 if (!*in) 54 - return -1; 53 + return -BCH_ERR_inode_unpack_error; 55 54 56 55 /* 57 56 * position of highest set bit indicates number of bytes: ··· 61 60 bytes = byte_table[shift - 1]; 62 61 63 62 if (in + bytes > end) 64 - return -1; 63 + return -BCH_ERR_inode_unpack_error; 65 64 66 65 p = (u8 *) be + 16 - bytes; 67 66 memcpy(p, in, bytes); ··· 177 176 return ret; \ 178 177 \ 179 178 if (field_bits > sizeof(unpacked->_name) * 8) \ 180 - return -1; \ 179 + return -BCH_ERR_inode_unpack_error; \ 181 180 \ 182 181 unpacked->_name = field[1]; \ 183 182 in += ret; ··· 218 217 \ 219 218 unpacked->_name = v[0]; \ 220 219 if (v[1] || v[0] != unpacked->_name) \ 221 - return -1; \ 220 + return -BCH_ERR_inode_unpack_error; \ 222 221 fieldnr++; 223 222 224 223 BCH_INODE_FIELDS_v2() ··· 269 268 \ 270 269 unpacked->_name = v[0]; \ 271 270 if (v[1] || v[0] != unpacked->_name) \ 272 - return -1; \ 271 + return -BCH_ERR_inode_unpack_error; \ 273 272 fieldnr++; 274 273 275 274 BCH_INODE_FIELDS_v3() ··· 429 428 } 430 429 431 430 static int __bch2_inode_validate(struct bch_fs *c, struct bkey_s_c k, 432 - enum bch_validate_flags flags) 431 + struct bkey_validate_context from) 433 432 { 434 433 struct bch_inode_unpacked unpacked; 435 434 int ret = 0; ··· 469 468 } 470 469 471 470 int bch2_inode_validate(struct bch_fs *c, struct bkey_s_c k, 472 - enum bch_validate_flags flags) 471 + struct bkey_validate_context from) 473 472 { 474 473 struct bkey_s_c_inode inode = bkey_s_c_to_inode(k); 475 474 int ret = 0; ··· 479 478 "invalid str hash type (%llu >= %u)", 480 479 INODEv1_STR_HASH(inode.v), BCH_STR_HASH_NR); 481 480 482 - ret = __bch2_inode_validate(c, k, flags); 481 + ret = __bch2_inode_validate(c, k, from); 483 482 fsck_err: 484 483 return ret; 485 484 } 486 485 487 486 int bch2_inode_v2_validate(struct bch_fs *c, struct bkey_s_c k, 488 - enum bch_validate_flags flags) 487 + struct bkey_validate_context from) 489 488 { 490 489 struct bkey_s_c_inode_v2 inode = bkey_s_c_to_inode_v2(k); 491 490 int ret = 0; ··· 495 494 "invalid str hash type (%llu >= %u)", 496 495 INODEv2_STR_HASH(inode.v), BCH_STR_HASH_NR); 497 496 498 - ret = __bch2_inode_validate(c, k, flags); 497 + ret = __bch2_inode_validate(c, k, from); 499 498 fsck_err: 500 499 return ret; 501 500 } 502 501 503 502 int bch2_inode_v3_validate(struct bch_fs *c, struct bkey_s_c k, 504 - enum bch_validate_flags flags) 503 + struct bkey_validate_context from) 505 504 { 506 505 struct bkey_s_c_inode_v3 inode = bkey_s_c_to_inode_v3(k); 507 506 int ret = 0; ··· 519 518 "invalid str hash type (%llu >= %u)", 520 519 INODEv3_STR_HASH(inode.v), BCH_STR_HASH_NR); 521 520 522 - ret = __bch2_inode_validate(c, k, flags); 521 + ret = __bch2_inode_validate(c, k, from); 523 522 fsck_err: 524 523 return ret; 525 524 } ··· 618 617 struct bkey_s_c k; 619 618 int ret = 0; 620 619 621 - for_each_btree_key_upto_norestart(trans, *iter, btree, 620 + for_each_btree_key_max_norestart(trans, *iter, btree, 622 621 bpos_successor(pos), 623 622 SPOS(pos.inode, pos.offset, U32_MAX), 624 623 flags|BTREE_ITER_all_snapshots, k, ret) ··· 653 652 struct bkey_s_c k; 654 653 int ret = 0; 655 654 656 - for_each_btree_key_upto_norestart(trans, iter, 655 + for_each_btree_key_max_norestart(trans, iter, 657 656 BTREE_ID_inodes, POS(0, pos.offset), bpos_predecessor(pos), 658 657 BTREE_ITER_all_snapshots| 659 658 BTREE_ITER_with_updates, k, ret) ··· 780 779 } 781 780 782 781 int bch2_inode_generation_validate(struct bch_fs *c, struct bkey_s_c k, 783 - enum bch_validate_flags flags) 782 + struct bkey_validate_context from) 784 783 { 785 784 int ret = 0; 786 785 ··· 797 796 struct bkey_s_c_inode_generation gen = bkey_s_c_to_inode_generation(k); 798 797 799 798 prt_printf(out, "generation: %u", le32_to_cpu(gen.v->bi_generation)); 799 + } 800 + 801 + int bch2_inode_alloc_cursor_validate(struct bch_fs *c, struct bkey_s_c k, 802 + struct bkey_validate_context from) 803 + { 804 + int ret = 0; 805 + 806 + bkey_fsck_err_on(k.k->p.inode != LOGGED_OPS_INUM_inode_cursors, 807 + c, inode_alloc_cursor_inode_bad, 808 + "k.p.inode bad"); 809 + fsck_err: 810 + return ret; 811 + } 812 + 813 + void bch2_inode_alloc_cursor_to_text(struct printbuf *out, struct bch_fs *c, 814 + struct bkey_s_c k) 815 + { 816 + struct bkey_s_c_inode_alloc_cursor i = bkey_s_c_to_inode_alloc_cursor(k); 817 + 818 + prt_printf(out, "idx %llu generation %llu", 819 + le64_to_cpu(i.v->idx), 820 + le64_to_cpu(i.v->gen)); 800 821 } 801 822 802 823 void bch2_inode_init_early(struct bch_fs *c, ··· 881 858 } 882 859 } 883 860 861 + static struct bkey_i_inode_alloc_cursor * 862 + bch2_inode_alloc_cursor_get(struct btree_trans *trans, u64 cpu, u64 *min, u64 *max) 863 + { 864 + struct bch_fs *c = trans->c; 865 + 866 + u64 cursor_idx = c->opts.inodes_32bit ? 0 : cpu + 1; 867 + 868 + cursor_idx &= ~(~0ULL << c->opts.shard_inode_numbers_bits); 869 + 870 + struct btree_iter iter; 871 + struct bkey_s_c k = bch2_bkey_get_iter(trans, &iter, 872 + BTREE_ID_logged_ops, 873 + POS(LOGGED_OPS_INUM_inode_cursors, cursor_idx), 874 + BTREE_ITER_cached); 875 + int ret = bkey_err(k); 876 + if (ret) 877 + return ERR_PTR(ret); 878 + 879 + struct bkey_i_inode_alloc_cursor *cursor = 880 + k.k->type == KEY_TYPE_inode_alloc_cursor 881 + ? bch2_bkey_make_mut_typed(trans, &iter, &k, 0, inode_alloc_cursor) 882 + : bch2_bkey_alloc(trans, &iter, 0, inode_alloc_cursor); 883 + ret = PTR_ERR_OR_ZERO(cursor); 884 + if (ret) 885 + goto err; 886 + 887 + if (c->opts.inodes_32bit) { 888 + *min = BLOCKDEV_INODE_MAX; 889 + *max = INT_MAX; 890 + } else { 891 + cursor->v.bits = c->opts.shard_inode_numbers_bits; 892 + 893 + unsigned bits = 63 - c->opts.shard_inode_numbers_bits; 894 + 895 + *min = max(cpu << bits, (u64) INT_MAX + 1); 896 + *max = (cpu << bits) | ~(ULLONG_MAX << bits); 897 + } 898 + 899 + if (le64_to_cpu(cursor->v.idx) < *min) 900 + cursor->v.idx = cpu_to_le64(*min); 901 + 902 + if (le64_to_cpu(cursor->v.idx) >= *max) { 903 + cursor->v.idx = cpu_to_le64(*min); 904 + le32_add_cpu(&cursor->v.gen, 1); 905 + } 906 + err: 907 + bch2_trans_iter_exit(trans, &iter); 908 + return ret ? ERR_PTR(ret) : cursor; 909 + } 910 + 884 911 /* 885 912 * This just finds an empty slot: 886 913 */ ··· 939 866 struct bch_inode_unpacked *inode_u, 940 867 u32 snapshot, u64 cpu) 941 868 { 942 - struct bch_fs *c = trans->c; 943 - struct bkey_s_c k; 944 - u64 min, max, start, pos, *hint; 945 - int ret = 0; 946 - unsigned bits = (c->opts.inodes_32bit ? 31 : 63); 869 + u64 min, max; 870 + struct bkey_i_inode_alloc_cursor *cursor = 871 + bch2_inode_alloc_cursor_get(trans, cpu, &min, &max); 872 + int ret = PTR_ERR_OR_ZERO(cursor); 873 + if (ret) 874 + return ret; 947 875 948 - if (c->opts.shard_inode_numbers) { 949 - bits -= c->inode_shard_bits; 876 + u64 start = le64_to_cpu(cursor->v.idx); 877 + u64 pos = start; 950 878 951 - min = (cpu << bits); 952 - max = (cpu << bits) | ~(ULLONG_MAX << bits); 953 - 954 - min = max_t(u64, min, BLOCKDEV_INODE_MAX); 955 - hint = c->unused_inode_hints + cpu; 956 - } else { 957 - min = BLOCKDEV_INODE_MAX; 958 - max = ~(ULLONG_MAX << bits); 959 - hint = c->unused_inode_hints; 960 - } 961 - 962 - start = READ_ONCE(*hint); 963 - 964 - if (start >= max || start < min) 965 - start = min; 966 - 967 - pos = start; 968 879 bch2_trans_iter_init(trans, iter, BTREE_ID_inodes, POS(0, pos), 969 880 BTREE_ITER_all_snapshots| 970 881 BTREE_ITER_intent); 882 + struct bkey_s_c k; 971 883 again: 972 884 while ((k = bch2_btree_iter_peek(iter)).k && 973 885 !(ret = bkey_err(k)) && ··· 982 924 /* Retry from start */ 983 925 pos = start = min; 984 926 bch2_btree_iter_set_pos(iter, POS(0, pos)); 927 + le32_add_cpu(&cursor->v.gen, 1); 985 928 goto again; 986 929 found_slot: 987 930 bch2_btree_iter_set_pos(iter, SPOS(0, pos, snapshot)); ··· 993 934 return ret; 994 935 } 995 936 996 - *hint = k.k->p.offset; 997 937 inode_u->bi_inum = k.k->p.offset; 998 - inode_u->bi_generation = bkey_generation(k); 938 + inode_u->bi_generation = le64_to_cpu(cursor->v.gen); 939 + cursor->v.idx = cpu_to_le64(k.k->p.offset + 1); 999 940 return 0; 1000 941 } 1001 942 ··· 1025 966 1026 967 bch2_btree_iter_set_snapshot(&iter, snapshot); 1027 968 1028 - k = bch2_btree_iter_peek_upto(&iter, end); 969 + k = bch2_btree_iter_peek_max(&iter, end); 1029 970 ret = bkey_err(k); 1030 971 if (ret) 1031 972 goto err; ··· 1057 998 { 1058 999 struct btree_trans *trans = bch2_trans_get(c); 1059 1000 struct btree_iter iter = { NULL }; 1060 - struct bkey_i_inode_generation delete; 1061 - struct bch_inode_unpacked inode_u; 1062 1001 struct bkey_s_c k; 1063 1002 u32 snapshot; 1064 1003 int ret; ··· 1096 1039 goto err; 1097 1040 } 1098 1041 1099 - bch2_inode_unpack(k, &inode_u); 1100 - 1101 - bkey_inode_generation_init(&delete.k_i); 1102 - delete.k.p = iter.pos; 1103 - delete.v.bi_generation = cpu_to_le32(inode_u.bi_generation + 1); 1104 - 1105 - ret = bch2_trans_update(trans, &iter, &delete.k_i, 0) ?: 1042 + ret = bch2_btree_delete_at(trans, &iter, 0) ?: 1106 1043 bch2_trans_commit(trans, NULL, NULL, 1107 1044 BCH_TRANS_COMMIT_no_enospc); 1108 1045 err: ··· 1192 1141 void bch2_inode_opts_get(struct bch_io_opts *opts, struct bch_fs *c, 1193 1142 struct bch_inode_unpacked *inode) 1194 1143 { 1195 - #define x(_name, _bits) opts->_name = inode_opt_get(c, inode, _name); 1144 + #define x(_name, _bits) \ 1145 + if ((inode)->bi_##_name) { \ 1146 + opts->_name = inode->bi_##_name - 1; \ 1147 + opts->_name##_from_inode = true; \ 1148 + } else { \ 1149 + opts->_name = c->opts._name; \ 1150 + } 1196 1151 BCH_INODE_OPTS() 1197 1152 #undef x 1198 1153 1199 - if (opts->nocow) 1200 - opts->compression = opts->background_compression = opts->data_checksum = opts->erasure_code = 0; 1154 + bch2_io_opts_fixups(opts); 1201 1155 } 1202 1156 1203 1157 int bch2_inum_opts_get(struct btree_trans *trans, subvol_inum inum, struct bch_io_opts *opts) ··· 1436 1380 NULL, NULL, BCH_TRANS_COMMIT_no_enospc, ({ 1437 1381 ret = may_delete_deleted_inode(trans, &iter, k.k->p, &need_another_pass); 1438 1382 if (ret > 0) { 1439 - bch_verbose(c, "deleting unlinked inode %llu:%u", k.k->p.offset, k.k->p.snapshot); 1383 + bch_verbose_ratelimited(c, "deleting unlinked inode %llu:%u", 1384 + k.k->p.offset, k.k->p.snapshot); 1440 1385 1441 1386 ret = bch2_inode_rm_snapshot(trans, k.k->p.offset, k.k->p.snapshot); 1442 1387 /*
+37 -6
fs/bcachefs/inode.h
··· 7 7 #include "opts.h" 8 8 #include "snapshot.h" 9 9 10 - enum bch_validate_flags; 11 10 extern const char * const bch2_inode_opts[]; 12 11 13 12 int bch2_inode_validate(struct bch_fs *, struct bkey_s_c, 14 - enum bch_validate_flags); 13 + struct bkey_validate_context); 15 14 int bch2_inode_v2_validate(struct bch_fs *, struct bkey_s_c, 16 - enum bch_validate_flags); 15 + struct bkey_validate_context); 17 16 int bch2_inode_v3_validate(struct bch_fs *, struct bkey_s_c, 18 - enum bch_validate_flags); 17 + struct bkey_validate_context); 19 18 void bch2_inode_to_text(struct printbuf *, struct bch_fs *, struct bkey_s_c); 20 19 21 20 int __bch2_inode_has_child_snapshots(struct btree_trans *, struct bpos); ··· 59 60 } 60 61 61 62 int bch2_inode_generation_validate(struct bch_fs *, struct bkey_s_c, 62 - enum bch_validate_flags); 63 + struct bkey_validate_context); 63 64 void bch2_inode_generation_to_text(struct printbuf *, struct bch_fs *, struct bkey_s_c); 64 65 65 66 #define bch2_bkey_ops_inode_generation ((struct bkey_ops) { \ 66 67 .key_validate = bch2_inode_generation_validate, \ 67 68 .val_to_text = bch2_inode_generation_to_text, \ 68 69 .min_val_size = 8, \ 70 + }) 71 + 72 + int bch2_inode_alloc_cursor_validate(struct bch_fs *, struct bkey_s_c, 73 + struct bkey_validate_context); 74 + void bch2_inode_alloc_cursor_to_text(struct printbuf *, struct bch_fs *, struct bkey_s_c); 75 + 76 + #define bch2_bkey_ops_inode_alloc_cursor ((struct bkey_ops) { \ 77 + .key_validate = bch2_inode_alloc_cursor_validate, \ 78 + .val_to_text = bch2_inode_alloc_cursor_to_text, \ 79 + .min_val_size = 16, \ 69 80 }) 70 81 71 82 #if 0 ··· 229 220 } 230 221 } 231 222 223 + static inline unsigned bkey_inode_mode(struct bkey_s_c k) 224 + { 225 + switch (k.k->type) { 226 + case KEY_TYPE_inode: 227 + return le16_to_cpu(bkey_s_c_to_inode(k).v->bi_mode); 228 + case KEY_TYPE_inode_v2: 229 + return le16_to_cpu(bkey_s_c_to_inode_v2(k).v->bi_mode); 230 + case KEY_TYPE_inode_v3: 231 + return INODEv3_MODE(bkey_s_c_to_inode_v3(k).v); 232 + default: 233 + return 0; 234 + } 235 + } 236 + 232 237 /* i_nlink: */ 233 238 234 239 static inline unsigned nlink_bias(umode_t mode) ··· 272 249 int bch2_inode_nlink_inc(struct bch_inode_unpacked *); 273 250 void bch2_inode_nlink_dec(struct btree_trans *, struct bch_inode_unpacked *); 274 251 275 - static inline bool bch2_inode_should_have_bp(struct bch_inode_unpacked *inode) 252 + static inline bool bch2_inode_should_have_single_bp(struct bch_inode_unpacked *inode) 276 253 { 277 254 bool inode_has_bp = inode->bi_dir || inode->bi_dir_offset; 278 255 ··· 284 261 void bch2_inode_opts_get(struct bch_io_opts *, struct bch_fs *, 285 262 struct bch_inode_unpacked *); 286 263 int bch2_inum_opts_get(struct btree_trans*, subvol_inum, struct bch_io_opts *); 264 + 265 + static inline struct bch_extent_rebalance 266 + bch2_inode_rebalance_opts_get(struct bch_fs *c, struct bch_inode_unpacked *inode) 267 + { 268 + struct bch_io_opts io_opts; 269 + bch2_inode_opts_get(&io_opts, c, inode); 270 + return io_opts_to_rebalance_opts(&io_opts); 271 + } 287 272 288 273 int bch2_inode_rm_snapshot(struct btree_trans *, u64, u32); 289 274 int bch2_delete_dead_inodes(struct bch_fs *);
+13 -2
fs/bcachefs/inode_format.h
··· 101 101 x(bi_dir_offset, 64) \ 102 102 x(bi_subvol, 32) \ 103 103 x(bi_parent_subvol, 32) \ 104 - x(bi_nocow, 8) 104 + x(bi_nocow, 8) \ 105 + x(bi_depth, 32) \ 106 + x(bi_inodes_32bit, 8) 105 107 106 108 /* subset of BCH_INODE_FIELDS */ 107 109 #define BCH_INODE_OPTS() \ ··· 116 114 x(foreground_target, 16) \ 117 115 x(background_target, 16) \ 118 116 x(erasure_code, 16) \ 119 - x(nocow, 8) 117 + x(nocow, 8) \ 118 + x(inodes_32bit, 8) 120 119 121 120 enum inode_opt_id { 122 121 #define x(name, ...) \ ··· 166 163 LE64_BITMASK(INODEv3_FIELDS_START, 167 164 struct bch_inode_v3, bi_flags, 31, 36); 168 165 LE64_BITMASK(INODEv3_MODE, struct bch_inode_v3, bi_flags, 36, 52); 166 + 167 + struct bch_inode_alloc_cursor { 168 + struct bch_val v; 169 + __u8 bits; 170 + __u8 pad; 171 + __le32 gen; 172 + __le64 idx; 173 + }; 169 174 170 175 #endif /* _BCACHEFS_INODE_FORMAT_H */
+12 -10
fs/bcachefs/io_misc.c
··· 113 113 err: 114 114 if (!ret && sectors_allocated) 115 115 bch2_increment_clock(c, sectors_allocated, WRITE); 116 - if (should_print_err(ret)) 117 - bch_err_inum_offset_ratelimited(c, 118 - inum.inum, 119 - iter->pos.offset << 9, 120 - "%s(): error: %s", __func__, bch2_err_str(ret)); 116 + if (should_print_err(ret)) { 117 + struct printbuf buf = PRINTBUF; 118 + bch2_inum_offset_err_msg_trans(trans, &buf, inum, iter->pos.offset << 9); 119 + prt_printf(&buf, "fallocate error: %s", bch2_err_str(ret)); 120 + bch_err_ratelimited(c, "%s", buf.buf); 121 + printbuf_exit(&buf); 122 + } 121 123 err_noprint: 122 124 bch2_open_buckets_put(c, &open_buckets); 123 125 bch2_disk_reservation_put(c, &disk_res); ··· 166 164 bch2_btree_iter_set_snapshot(iter, snapshot); 167 165 168 166 /* 169 - * peek_upto() doesn't have ideal semantics for extents: 167 + * peek_max() doesn't have ideal semantics for extents: 170 168 */ 171 - k = bch2_btree_iter_peek_upto(iter, end_pos); 169 + k = bch2_btree_iter_peek_max(iter, end_pos); 172 170 if (!k.k) 173 171 break; 174 172 ··· 428 426 bch2_btree_iter_set_pos(&iter, SPOS(inum.inum, pos, snapshot)); 429 427 430 428 k = insert 431 - ? bch2_btree_iter_peek_prev(&iter) 432 - : bch2_btree_iter_peek_upto(&iter, POS(inum.inum, U64_MAX)); 429 + ? bch2_btree_iter_peek_prev_min(&iter, POS(inum.inum, 0)) 430 + : bch2_btree_iter_peek_max(&iter, POS(inum.inum, U64_MAX)); 433 431 if ((ret = bkey_err(k))) 434 432 goto btree_err; 435 433 ··· 463 461 464 462 op->v.pos = cpu_to_le64(insert ? bkey_start_offset(&delete.k) : delete.k.p.offset); 465 463 466 - ret = bch2_bkey_set_needs_rebalance(c, copy, &opts) ?: 464 + ret = bch2_bkey_set_needs_rebalance(c, &opts, copy) ?: 467 465 bch2_btree_insert_trans(trans, BTREE_ID_extents, &delete, 0) ?: 468 466 bch2_btree_insert_trans(trans, BTREE_ID_extents, copy, 0) ?: 469 467 bch2_logged_op_update(trans, &op->k_i) ?:
+159 -100
fs/bcachefs/io_read.c
··· 21 21 #include "io_read.h" 22 22 #include "io_misc.h" 23 23 #include "io_write.h" 24 + #include "reflink.h" 24 25 #include "subvolume.h" 25 26 #include "trace.h" 26 27 ··· 91 90 .automatic_shrinking = true, 92 91 }; 93 92 93 + static inline bool have_io_error(struct bch_io_failures *failed) 94 + { 95 + return failed && failed->nr; 96 + } 97 + 94 98 static inline int should_promote(struct bch_fs *c, struct bkey_s_c k, 95 99 struct bpos pos, 96 100 struct bch_io_opts opts, 97 101 unsigned flags, 98 102 struct bch_io_failures *failed) 99 103 { 100 - if (!failed) { 104 + if (!have_io_error(failed)) { 101 105 BUG_ON(!opts.promote_target); 102 106 103 107 if (!(flags & BCH_READ_MAY_PROMOTE)) ··· 229 223 230 224 struct data_update_opts update_opts = {}; 231 225 232 - if (!failed) { 226 + if (!have_io_error(failed)) { 233 227 update_opts.target = opts.promote_target; 234 228 update_opts.extra_replicas = 1; 235 229 update_opts.write_flags = BCH_WRITE_ALLOC_NOWAIT|BCH_WRITE_CACHED; ··· 237 231 update_opts.target = opts.foreground_target; 238 232 239 233 struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(k); 240 - unsigned i = 0; 234 + unsigned ptr_bit = 1; 241 235 bkey_for_each_ptr(ptrs, ptr) { 242 236 if (bch2_dev_io_failures(failed, ptr->dev)) 243 - update_opts.rewrite_ptrs |= BIT(i); 244 - i++; 237 + update_opts.rewrite_ptrs |= ptr_bit; 238 + ptr_bit <<= 1; 245 239 } 246 240 } 247 241 ··· 291 285 * if failed != NULL we're not actually doing a promote, we're 292 286 * recovering from an io/checksum error 293 287 */ 294 - bool promote_full = (failed || 288 + bool promote_full = (have_io_error(failed) || 295 289 *read_full || 296 290 READ_ONCE(c->opts.promote_whole_extents)); 297 291 /* data might have to be decompressed in the write path: */ ··· 326 320 } 327 321 328 322 /* Read */ 323 + 324 + static int bch2_read_err_msg_trans(struct btree_trans *trans, struct printbuf *out, 325 + struct bch_read_bio *rbio, struct bpos read_pos) 326 + { 327 + return bch2_inum_offset_err_msg_trans(trans, out, 328 + (subvol_inum) { rbio->subvol, read_pos.inode }, 329 + read_pos.offset << 9); 330 + } 331 + 332 + static void bch2_read_err_msg(struct bch_fs *c, struct printbuf *out, 333 + struct bch_read_bio *rbio, struct bpos read_pos) 334 + { 335 + bch2_trans_run(c, bch2_read_err_msg_trans(trans, out, rbio, read_pos)); 336 + } 329 337 330 338 #define READ_RETRY_AVOID 1 331 339 #define READ_RETRY 2 ··· 519 499 } 520 500 } 521 501 502 + static void bch2_read_io_err(struct work_struct *work) 503 + { 504 + struct bch_read_bio *rbio = 505 + container_of(work, struct bch_read_bio, work); 506 + struct bio *bio = &rbio->bio; 507 + struct bch_fs *c = rbio->c; 508 + struct bch_dev *ca = rbio->have_ioref ? bch2_dev_have_ref(c, rbio->pick.ptr.dev) : NULL; 509 + struct printbuf buf = PRINTBUF; 510 + 511 + bch2_read_err_msg(c, &buf, rbio, rbio->read_pos); 512 + prt_printf(&buf, "data read error: %s", bch2_blk_status_to_str(bio->bi_status)); 513 + 514 + if (ca) { 515 + bch2_io_error(ca, BCH_MEMBER_ERROR_read); 516 + bch_err_ratelimited(ca, "%s", buf.buf); 517 + } else { 518 + bch_err_ratelimited(c, "%s", buf.buf); 519 + } 520 + 521 + printbuf_exit(&buf); 522 + bch2_rbio_error(rbio, READ_RETRY_AVOID, bio->bi_status); 523 + } 524 + 522 525 static int __bch2_rbio_narrow_crcs(struct btree_trans *trans, 523 526 struct bch_read_bio *rbio) 524 527 { ··· 603 560 { 604 561 bch2_trans_commit_do(rbio->c, NULL, NULL, BCH_TRANS_COMMIT_no_enospc, 605 562 __bch2_rbio_narrow_crcs(trans, rbio)); 563 + } 564 + 565 + static void bch2_read_csum_err(struct work_struct *work) 566 + { 567 + struct bch_read_bio *rbio = 568 + container_of(work, struct bch_read_bio, work); 569 + struct bch_fs *c = rbio->c; 570 + struct bio *src = &rbio->bio; 571 + struct bch_extent_crc_unpacked crc = rbio->pick.crc; 572 + struct nonce nonce = extent_nonce(rbio->version, crc); 573 + struct bch_csum csum = bch2_checksum_bio(c, crc.csum_type, nonce, src); 574 + struct printbuf buf = PRINTBUF; 575 + 576 + bch2_read_err_msg(c, &buf, rbio, rbio->read_pos); 577 + prt_str(&buf, "data "); 578 + bch2_csum_err_msg(&buf, crc.csum_type, rbio->pick.crc.csum, csum); 579 + 580 + struct bch_dev *ca = rbio->have_ioref ? bch2_dev_have_ref(c, rbio->pick.ptr.dev) : NULL; 581 + if (ca) { 582 + bch2_io_error(ca, BCH_MEMBER_ERROR_checksum); 583 + bch_err_ratelimited(ca, "%s", buf.buf); 584 + } else { 585 + bch_err_ratelimited(c, "%s", buf.buf); 586 + } 587 + 588 + bch2_rbio_error(rbio, READ_RETRY_AVOID, BLK_STS_IOERR); 589 + printbuf_exit(&buf); 590 + } 591 + 592 + static void bch2_read_decompress_err(struct work_struct *work) 593 + { 594 + struct bch_read_bio *rbio = 595 + container_of(work, struct bch_read_bio, work); 596 + struct bch_fs *c = rbio->c; 597 + struct printbuf buf = PRINTBUF; 598 + 599 + bch2_read_err_msg(c, &buf, rbio, rbio->read_pos); 600 + prt_str(&buf, "decompression error"); 601 + 602 + struct bch_dev *ca = rbio->have_ioref ? bch2_dev_have_ref(c, rbio->pick.ptr.dev) : NULL; 603 + if (ca) 604 + bch_err_ratelimited(ca, "%s", buf.buf); 605 + else 606 + bch_err_ratelimited(c, "%s", buf.buf); 607 + 608 + bch2_rbio_error(rbio, READ_ERR, BLK_STS_IOERR); 609 + printbuf_exit(&buf); 610 + } 611 + 612 + static void bch2_read_decrypt_err(struct work_struct *work) 613 + { 614 + struct bch_read_bio *rbio = 615 + container_of(work, struct bch_read_bio, work); 616 + struct bch_fs *c = rbio->c; 617 + struct printbuf buf = PRINTBUF; 618 + 619 + bch2_read_err_msg(c, &buf, rbio, rbio->read_pos); 620 + prt_str(&buf, "decrypt error"); 621 + 622 + struct bch_dev *ca = rbio->have_ioref ? bch2_dev_have_ref(c, rbio->pick.ptr.dev) : NULL; 623 + if (ca) 624 + bch_err_ratelimited(ca, "%s", buf.buf); 625 + else 626 + bch_err_ratelimited(c, "%s", buf.buf); 627 + 628 + bch2_rbio_error(rbio, READ_ERR, BLK_STS_IOERR); 629 + printbuf_exit(&buf); 606 630 } 607 631 608 632 /* Inner part that may run in process context */ ··· 778 668 goto out; 779 669 } 780 670 781 - struct printbuf buf = PRINTBUF; 782 - buf.atomic++; 783 - prt_str(&buf, "data "); 784 - bch2_csum_err_msg(&buf, crc.csum_type, rbio->pick.crc.csum, csum); 785 - 786 - struct bch_dev *ca = rbio->have_ioref ? bch2_dev_have_ref(c, rbio->pick.ptr.dev) : NULL; 787 - if (ca) { 788 - bch_err_inum_offset_ratelimited(ca, 789 - rbio->read_pos.inode, 790 - rbio->read_pos.offset << 9, 791 - "data %s", buf.buf); 792 - bch2_io_error(ca, BCH_MEMBER_ERROR_checksum); 793 - } 794 - printbuf_exit(&buf); 795 - bch2_rbio_error(rbio, READ_RETRY_AVOID, BLK_STS_IOERR); 671 + bch2_rbio_punt(rbio, bch2_read_csum_err, RBIO_CONTEXT_UNBOUND, system_unbound_wq); 796 672 goto out; 797 673 decompression_err: 798 - bch_err_inum_offset_ratelimited(c, rbio->read_pos.inode, 799 - rbio->read_pos.offset << 9, 800 - "decompression error"); 801 - bch2_rbio_error(rbio, READ_ERR, BLK_STS_IOERR); 674 + bch2_rbio_punt(rbio, bch2_read_decompress_err, RBIO_CONTEXT_UNBOUND, system_unbound_wq); 802 675 goto out; 803 676 decrypt_err: 804 - bch_err_inum_offset_ratelimited(c, rbio->read_pos.inode, 805 - rbio->read_pos.offset << 9, 806 - "decrypt error"); 807 - bch2_rbio_error(rbio, READ_ERR, BLK_STS_IOERR); 677 + bch2_rbio_punt(rbio, bch2_read_decrypt_err, RBIO_CONTEXT_UNBOUND, system_unbound_wq); 808 678 goto out; 809 679 } 810 680 ··· 805 715 if (!rbio->split) 806 716 rbio->bio.bi_end_io = rbio->end_io; 807 717 808 - if (bio->bi_status) { 809 - if (ca) { 810 - bch_err_inum_offset_ratelimited(ca, 811 - rbio->read_pos.inode, 812 - rbio->read_pos.offset, 813 - "data read error: %s", 814 - bch2_blk_status_to_str(bio->bi_status)); 815 - bch2_io_error(ca, BCH_MEMBER_ERROR_read); 816 - } 817 - bch2_rbio_error(rbio, READ_RETRY_AVOID, bio->bi_status); 718 + if (unlikely(bio->bi_status)) { 719 + bch2_rbio_punt(rbio, bch2_read_io_err, RBIO_CONTEXT_UNBOUND, system_unbound_wq); 818 720 return; 819 721 } 820 722 ··· 830 748 context = RBIO_CONTEXT_HIGHPRI, wq = system_highpri_wq; 831 749 832 750 bch2_rbio_punt(rbio, __bch2_read_endio, context, wq); 833 - } 834 - 835 - int __bch2_read_indirect_extent(struct btree_trans *trans, 836 - unsigned *offset_into_extent, 837 - struct bkey_buf *orig_k) 838 - { 839 - struct btree_iter iter; 840 - struct bkey_s_c k; 841 - u64 reflink_offset; 842 - int ret; 843 - 844 - reflink_offset = le64_to_cpu(bkey_i_to_reflink_p(orig_k->k)->v.idx) + 845 - *offset_into_extent; 846 - 847 - k = bch2_bkey_get_iter(trans, &iter, BTREE_ID_reflink, 848 - POS(0, reflink_offset), 0); 849 - ret = bkey_err(k); 850 - if (ret) 851 - goto err; 852 - 853 - if (k.k->type != KEY_TYPE_reflink_v && 854 - k.k->type != KEY_TYPE_indirect_inline_data) { 855 - bch_err_inum_offset_ratelimited(trans->c, 856 - orig_k->k->k.p.inode, 857 - orig_k->k->k.p.offset << 9, 858 - "%llu len %u points to nonexistent indirect extent %llu", 859 - orig_k->k->k.p.offset, 860 - orig_k->k->k.size, 861 - reflink_offset); 862 - bch2_inconsistent_error(trans->c); 863 - ret = -BCH_ERR_missing_indirect_extent; 864 - goto err; 865 - } 866 - 867 - *offset_into_extent = iter.pos.offset - bkey_start_offset(k.k); 868 - bch2_bkey_buf_reassemble(orig_k, trans->c, k); 869 - err: 870 - bch2_trans_iter_exit(trans, &iter); 871 - return ret; 872 751 } 873 752 874 753 static noinline void read_from_stale_dirty_pointer(struct btree_trans *trans, ··· 911 868 if (!pick_ret) 912 869 goto hole; 913 870 914 - if (pick_ret < 0) { 871 + if (unlikely(pick_ret < 0)) { 915 872 struct printbuf buf = PRINTBUF; 873 + bch2_read_err_msg_trans(trans, &buf, orig, read_pos); 874 + prt_printf(&buf, "no device to read from: %s\n ", bch2_err_str(pick_ret)); 916 875 bch2_bkey_val_to_text(&buf, c, k); 917 876 918 - bch_err_inum_offset_ratelimited(c, 919 - read_pos.inode, read_pos.offset << 9, 920 - "no device to read from: %s\n %s", 921 - bch2_err_str(pick_ret), 922 - buf.buf); 877 + bch_err_ratelimited(c, "%s", buf.buf); 878 + printbuf_exit(&buf); 879 + goto err; 880 + } 881 + 882 + if (unlikely(bch2_csum_type_is_encryption(pick.crc.csum_type)) && !c->chacha20) { 883 + struct printbuf buf = PRINTBUF; 884 + bch2_read_err_msg_trans(trans, &buf, orig, read_pos); 885 + prt_printf(&buf, "attempting to read encrypted data without encryption key\n "); 886 + bch2_bkey_val_to_text(&buf, c, k); 887 + 888 + bch_err_ratelimited(c, "%s", buf.buf); 923 889 printbuf_exit(&buf); 924 890 goto err; 925 891 } ··· 994 942 bounce = true; 995 943 } 996 944 997 - if (orig->opts.promote_target)// || failed) 945 + if (orig->opts.promote_target || have_io_error(failed)) 998 946 promote = promote_alloc(trans, iter, k, &pick, orig->opts, flags, 999 947 &rbio, &bounce, &read_full, failed); 1000 948 ··· 1114 1062 } 1115 1063 1116 1064 if (!rbio->pick.idx) { 1117 - if (!rbio->have_ioref) { 1118 - bch_err_inum_offset_ratelimited(c, 1119 - read_pos.inode, 1120 - read_pos.offset << 9, 1121 - "no device to read from"); 1065 + if (unlikely(!rbio->have_ioref)) { 1066 + struct printbuf buf = PRINTBUF; 1067 + bch2_read_err_msg_trans(trans, &buf, rbio, read_pos); 1068 + prt_printf(&buf, "no device to read from:\n "); 1069 + bch2_bkey_val_to_text(&buf, c, k); 1070 + 1071 + bch_err_ratelimited(c, "%s", buf.buf); 1072 + printbuf_exit(&buf); 1073 + 1122 1074 bch2_rbio_error(rbio, READ_RETRY_AVOID, BLK_STS_IOERR); 1123 1075 goto out; 1124 1076 } ··· 1220 1164 BTREE_ITER_slots); 1221 1165 1222 1166 while (1) { 1223 - unsigned bytes, sectors, offset_into_extent; 1224 1167 enum btree_id data_btree = BTREE_ID_extents; 1225 1168 1226 1169 bch2_trans_begin(trans); ··· 1239 1184 if (ret) 1240 1185 goto err; 1241 1186 1242 - offset_into_extent = iter.pos.offset - 1187 + s64 offset_into_extent = iter.pos.offset - 1243 1188 bkey_start_offset(k.k); 1244 - sectors = k.k->size - offset_into_extent; 1189 + unsigned sectors = k.k->size - offset_into_extent; 1245 1190 1246 1191 bch2_bkey_buf_reassemble(&sk, c, k); 1247 1192 ··· 1256 1201 * With indirect extents, the amount of data to read is the min 1257 1202 * of the original extent and the indirect extent: 1258 1203 */ 1259 - sectors = min(sectors, k.k->size - offset_into_extent); 1204 + sectors = min_t(unsigned, sectors, k.k->size - offset_into_extent); 1260 1205 1261 - bytes = min(sectors, bvec_iter_sectors(bvec_iter)) << 9; 1206 + unsigned bytes = min(sectors, bvec_iter_sectors(bvec_iter)) << 9; 1262 1207 swap(bvec_iter.bi_size, bytes); 1263 1208 1264 1209 if (bvec_iter.bi_size == bytes) ··· 1284 1229 } 1285 1230 1286 1231 bch2_trans_iter_exit(trans, &iter); 1287 - bch2_trans_put(trans); 1288 - bch2_bkey_buf_exit(&sk, c); 1289 1232 1290 1233 if (ret) { 1291 - bch_err_inum_offset_ratelimited(c, inum.inum, 1292 - bvec_iter.bi_sector << 9, 1293 - "read error %i from btree lookup", ret); 1234 + struct printbuf buf = PRINTBUF; 1235 + bch2_inum_offset_err_msg_trans(trans, &buf, inum, bvec_iter.bi_sector << 9); 1236 + prt_printf(&buf, "read error %i from btree lookup", ret); 1237 + bch_err_ratelimited(c, "%s", buf.buf); 1238 + printbuf_exit(&buf); 1239 + 1294 1240 rbio->bio.bi_status = BLK_STS_IOERR; 1295 1241 bch2_rbio_done(rbio); 1296 1242 } 1243 + 1244 + bch2_trans_put(trans); 1245 + bch2_bkey_buf_exit(&sk, c); 1297 1246 } 1298 1247 1299 1248 void bch2_fs_io_read_exit(struct bch_fs *c)
+21 -7
fs/bcachefs/io_read.h
··· 3 3 #define _BCACHEFS_IO_READ_H 4 4 5 5 #include "bkey_buf.h" 6 + #include "reflink.h" 6 7 7 8 struct bch_read_bio { 8 9 struct bch_fs *c; ··· 80 79 struct cache_promote_op; 81 80 struct extent_ptr_decoded; 82 81 83 - int __bch2_read_indirect_extent(struct btree_trans *, unsigned *, 84 - struct bkey_buf *); 85 - 86 82 static inline int bch2_read_indirect_extent(struct btree_trans *trans, 87 83 enum btree_id *data_btree, 88 - unsigned *offset_into_extent, 89 - struct bkey_buf *k) 84 + s64 *offset_into_extent, 85 + struct bkey_buf *extent) 90 86 { 91 - if (k->k->k.type != KEY_TYPE_reflink_p) 87 + if (extent->k->k.type != KEY_TYPE_reflink_p) 92 88 return 0; 93 89 94 90 *data_btree = BTREE_ID_reflink; 95 - return __bch2_read_indirect_extent(trans, offset_into_extent, k); 91 + struct btree_iter iter; 92 + struct bkey_s_c k = bch2_lookup_indirect_extent(trans, &iter, 93 + offset_into_extent, 94 + bkey_i_to_s_c_reflink_p(extent->k), 95 + true, 0); 96 + int ret = bkey_err(k); 97 + if (ret) 98 + return ret; 99 + 100 + if (bkey_deleted(k.k)) { 101 + bch2_trans_iter_exit(trans, &iter); 102 + return -BCH_ERR_missing_indirect_extent; 103 + } 104 + 105 + bch2_bkey_buf_reassemble(extent, trans->c, k); 106 + bch2_trans_iter_exit(trans, &iter); 107 + return 0; 96 108 } 97 109 98 110 enum bch_read_flags {
+60 -42
fs/bcachefs/io_write.c
··· 164 164 165 165 bch2_trans_copy_iter(&iter, extent_iter); 166 166 167 - for_each_btree_key_upto_continue_norestart(iter, 167 + for_each_btree_key_max_continue_norestart(iter, 168 168 new->k.p, BTREE_ITER_slots, old, ret) { 169 169 s64 sectors = min(new->k.p.offset, old.k->p.offset) - 170 170 max(bkey_start_offset(&new->k), ··· 216 216 SPOS(0, 217 217 extent_iter->pos.inode, 218 218 extent_iter->snapshot), 219 + BTREE_ITER_intent| 219 220 BTREE_ITER_cached); 220 221 int ret = bkey_err(k); 221 222 if (unlikely(ret)) ··· 370 369 bkey_start_pos(&sk.k->k), 371 370 BTREE_ITER_slots|BTREE_ITER_intent); 372 371 373 - ret = bch2_bkey_set_needs_rebalance(c, sk.k, &op->opts) ?: 372 + ret = bch2_bkey_set_needs_rebalance(c, &op->opts, sk.k) ?: 374 373 bch2_extent_update(trans, inum, &iter, sk.k, 375 374 &op->res, 376 375 op->new_i_size, &op->i_sectors_delta, ··· 395 394 } 396 395 397 396 /* Writes */ 397 + 398 + static void __bch2_write_op_error(struct printbuf *out, struct bch_write_op *op, 399 + u64 offset) 400 + { 401 + bch2_inum_offset_err_msg(op->c, out, 402 + (subvol_inum) { op->subvol, op->pos.inode, }, 403 + offset << 9); 404 + prt_printf(out, "write error%s: ", 405 + op->flags & BCH_WRITE_MOVE ? "(internal move)" : ""); 406 + } 407 + 408 + static void bch2_write_op_error(struct printbuf *out, struct bch_write_op *op) 409 + { 410 + __bch2_write_op_error(out, op, op->pos.offset); 411 + } 398 412 399 413 void bch2_submit_wbio_replicas(struct bch_write_bio *wbio, struct bch_fs *c, 400 414 enum bch_data_type type, ··· 547 531 548 532 op->written += sectors_start - keylist_sectors(keys); 549 533 550 - if (ret && !bch2_err_matches(ret, EROFS)) { 534 + if (unlikely(ret && !bch2_err_matches(ret, EROFS))) { 551 535 struct bkey_i *insert = bch2_keylist_front(&op->insert_keys); 552 536 553 - bch_err_inum_offset_ratelimited(c, 554 - insert->k.p.inode, insert->k.p.offset << 9, 555 - "%s write error while doing btree update: %s", 556 - op->flags & BCH_WRITE_MOVE ? "move" : "user", 557 - bch2_err_str(ret)); 537 + struct printbuf buf = PRINTBUF; 538 + __bch2_write_op_error(&buf, op, bkey_start_offset(&insert->k)); 539 + prt_printf(&buf, "btree update error: %s", bch2_err_str(ret)); 540 + bch_err_ratelimited(c, "%s", buf.buf); 541 + printbuf_exit(&buf); 558 542 } 559 543 560 544 if (ret) ··· 637 621 638 622 while (1) { 639 623 spin_lock_irq(&wp->writes_lock); 640 - op = list_first_entry_or_null(&wp->writes, struct bch_write_op, wp_list); 641 - if (op) 642 - list_del(&op->wp_list); 624 + op = list_pop_entry(&wp->writes, struct bch_write_op, wp_list); 643 625 wp_update_state(wp, op != NULL); 644 626 spin_unlock_irq(&wp->writes_lock); 645 627 ··· 1094 1080 *_dst = dst; 1095 1081 return more; 1096 1082 csum_err: 1097 - bch_err_inum_offset_ratelimited(c, 1098 - op->pos.inode, 1099 - op->pos.offset << 9, 1100 - "%s write error: error verifying existing checksum while rewriting existing data (memory corruption?)", 1101 - op->flags & BCH_WRITE_MOVE ? "move" : "user"); 1083 + { 1084 + struct printbuf buf = PRINTBUF; 1085 + bch2_write_op_error(&buf, op); 1086 + prt_printf(&buf, "error verifying existing checksum while rewriting existing data (memory corruption?)"); 1087 + bch_err_ratelimited(c, "%s", buf.buf); 1088 + printbuf_exit(&buf); 1089 + } 1090 + 1102 1091 ret = -EIO; 1103 1092 err: 1104 1093 if (to_wbio(dst)->bounce) ··· 1182 1165 struct btree_trans *trans = bch2_trans_get(c); 1183 1166 1184 1167 for_each_keylist_key(&op->insert_keys, orig) { 1185 - int ret = for_each_btree_key_upto_commit(trans, iter, BTREE_ID_extents, 1168 + int ret = for_each_btree_key_max_commit(trans, iter, BTREE_ID_extents, 1186 1169 bkey_start_pos(&orig->k), orig->k.p, 1187 1170 BTREE_ITER_intent, k, 1188 1171 NULL, NULL, BCH_TRANS_COMMIT_no_enospc, ({ ··· 1192 1175 if (ret && !bch2_err_matches(ret, EROFS)) { 1193 1176 struct bkey_i *insert = bch2_keylist_front(&op->insert_keys); 1194 1177 1195 - bch_err_inum_offset_ratelimited(c, 1196 - insert->k.p.inode, insert->k.p.offset << 9, 1197 - "%s write error while doing btree update: %s", 1198 - op->flags & BCH_WRITE_MOVE ? "move" : "user", 1199 - bch2_err_str(ret)); 1178 + struct printbuf buf = PRINTBUF; 1179 + __bch2_write_op_error(&buf, op, bkey_start_offset(&insert->k)); 1180 + prt_printf(&buf, "btree update error: %s", bch2_err_str(ret)); 1181 + bch_err_ratelimited(c, "%s", buf.buf); 1182 + printbuf_exit(&buf); 1200 1183 } 1201 1184 1202 1185 if (ret) { ··· 1356 1339 if (bch2_err_matches(ret, BCH_ERR_transaction_restart)) 1357 1340 goto retry; 1358 1341 1342 + bch2_trans_put(trans); 1343 + darray_exit(&buckets); 1344 + 1359 1345 if (ret) { 1360 - bch_err_inum_offset_ratelimited(c, 1361 - op->pos.inode, op->pos.offset << 9, 1362 - "%s: btree lookup error %s", __func__, bch2_err_str(ret)); 1346 + struct printbuf buf = PRINTBUF; 1347 + bch2_write_op_error(&buf, op); 1348 + prt_printf(&buf, "%s(): btree lookup error: %s", __func__, bch2_err_str(ret)); 1349 + bch_err_ratelimited(c, "%s", buf.buf); 1350 + printbuf_exit(&buf); 1363 1351 op->error = ret; 1364 1352 op->flags |= BCH_WRITE_SUBMITTED; 1365 1353 } 1366 - 1367 - bch2_trans_put(trans); 1368 - darray_exit(&buckets); 1369 1354 1370 1355 /* fallback to cow write path? */ 1371 1356 if (!(op->flags & BCH_WRITE_SUBMITTED)) { ··· 1481 1462 if (ret <= 0) { 1482 1463 op->flags |= BCH_WRITE_SUBMITTED; 1483 1464 1484 - if (ret < 0) { 1485 - if (!(op->flags & BCH_WRITE_ALLOC_NOWAIT)) 1486 - bch_err_inum_offset_ratelimited(c, 1487 - op->pos.inode, 1488 - op->pos.offset << 9, 1489 - "%s(): %s error: %s", __func__, 1490 - op->flags & BCH_WRITE_MOVE ? "move" : "user", 1491 - bch2_err_str(ret)); 1465 + if (unlikely(ret < 0)) { 1466 + if (!(op->flags & BCH_WRITE_ALLOC_NOWAIT)) { 1467 + struct printbuf buf = PRINTBUF; 1468 + bch2_write_op_error(&buf, op); 1469 + prt_printf(&buf, "%s(): %s", __func__, bch2_err_str(ret)); 1470 + bch_err_ratelimited(c, "%s", buf.buf); 1471 + printbuf_exit(&buf); 1472 + } 1492 1473 op->error = ret; 1493 1474 break; 1494 1475 } ··· 1614 1595 bch2_keylist_init(&op->insert_keys, op->inline_keys); 1615 1596 wbio_init(bio)->put_bio = false; 1616 1597 1617 - if (bio->bi_iter.bi_size & (c->opts.block_size - 1)) { 1618 - bch_err_inum_offset_ratelimited(c, 1619 - op->pos.inode, 1620 - op->pos.offset << 9, 1621 - "%s write error: misaligned write", 1622 - op->flags & BCH_WRITE_MOVE ? "move" : "user"); 1598 + if (unlikely(bio->bi_iter.bi_size & (c->opts.block_size - 1))) { 1599 + struct printbuf buf = PRINTBUF; 1600 + bch2_write_op_error(&buf, op); 1601 + prt_printf(&buf, "misaligned write"); 1602 + printbuf_exit(&buf); 1623 1603 op->error = -EIO; 1624 1604 goto err; 1625 1605 }
+117 -45
fs/bcachefs/journal.c
··· 217 217 if (__bch2_journal_pin_put(j, seq)) 218 218 bch2_journal_reclaim_fast(j); 219 219 bch2_journal_do_writes(j); 220 + 221 + /* 222 + * for __bch2_next_write_buffer_flush_journal_buf(), when quiescing an 223 + * open journal entry 224 + */ 225 + wake_up(&j->wait); 220 226 } 221 227 222 228 /* ··· 256 250 257 251 if (!__journal_entry_is_open(old)) 258 252 return; 253 + 254 + if (old.cur_entry_offset == JOURNAL_ENTRY_BLOCKED_VAL) 255 + old.cur_entry_offset = j->cur_entry_offset_if_blocked; 259 256 260 257 /* Close out old buffer: */ 261 258 buf->data->u64s = cpu_to_le32(old.cur_entry_offset); ··· 381 372 382 373 if (nr_unwritten_journal_entries(j) == ARRAY_SIZE(j->buf)) 383 374 return JOURNAL_ERR_max_in_flight; 375 + 376 + if (bch2_fs_fatal_err_on(journal_cur_seq(j) >= JOURNAL_SEQ_MAX, 377 + c, "cannot start: journal seq overflow")) 378 + return JOURNAL_ERR_insufficient_devices; /* -EROFS */ 384 379 385 380 BUG_ON(!j->cur_entry_sectors); 386 381 ··· 677 664 * @seq: seq to flush 678 665 * @parent: closure object to wait with 679 666 * Returns: 1 if @seq has already been flushed, 0 if @seq is being flushed, 680 - * -EIO if @seq will never be flushed 667 + * -BCH_ERR_journal_flush_err if @seq will never be flushed 681 668 * 682 669 * Like bch2_journal_wait_on_seq, except that it triggers a write immediately if 683 670 * necessary ··· 700 687 701 688 /* Recheck under lock: */ 702 689 if (j->err_seq && seq >= j->err_seq) { 703 - ret = -EIO; 690 + ret = -BCH_ERR_journal_flush_err; 704 691 goto out; 705 692 } 706 693 ··· 807 794 } 808 795 809 796 /* 810 - * bch2_journal_noflush_seq - tell the journal not to issue any flushes before 797 + * bch2_journal_noflush_seq - ask the journal not to issue any flushes in the 798 + * range [start, end) 811 799 * @seq 812 800 */ 813 - bool bch2_journal_noflush_seq(struct journal *j, u64 seq) 801 + bool bch2_journal_noflush_seq(struct journal *j, u64 start, u64 end) 814 802 { 815 803 struct bch_fs *c = container_of(j, struct bch_fs, journal); 816 804 u64 unwritten_seq; ··· 820 806 if (!(c->sb.features & (1ULL << BCH_FEATURE_journal_no_flush))) 821 807 return false; 822 808 823 - if (seq <= c->journal.flushed_seq_ondisk) 809 + if (c->journal.flushed_seq_ondisk >= start) 824 810 return false; 825 811 826 812 spin_lock(&j->lock); 827 - if (seq <= c->journal.flushed_seq_ondisk) 813 + if (c->journal.flushed_seq_ondisk >= start) 828 814 goto out; 829 815 830 816 for (unwritten_seq = journal_last_unwritten_seq(j); 831 - unwritten_seq < seq; 817 + unwritten_seq < end; 832 818 unwritten_seq++) { 833 819 struct journal_buf *buf = journal_seq_to_buf(j, unwritten_seq); 834 820 ··· 845 831 return ret; 846 832 } 847 833 848 - int bch2_journal_meta(struct journal *j) 834 + static int __bch2_journal_meta(struct journal *j) 849 835 { 850 - struct journal_buf *buf; 851 - struct journal_res res; 852 - int ret; 853 - 854 - memset(&res, 0, sizeof(res)); 855 - 856 - ret = bch2_journal_res_get(j, &res, jset_u64s(0), 0); 836 + struct journal_res res = {}; 837 + int ret = bch2_journal_res_get(j, &res, jset_u64s(0), 0); 857 838 if (ret) 858 839 return ret; 859 840 860 - buf = j->buf + (res.seq & JOURNAL_BUF_MASK); 841 + struct journal_buf *buf = j->buf + (res.seq & JOURNAL_BUF_MASK); 861 842 buf->must_flush = true; 862 843 863 844 if (!buf->flush_time) { ··· 865 856 return bch2_journal_flush_seq(j, res.seq, TASK_UNINTERRUPTIBLE); 866 857 } 867 858 859 + int bch2_journal_meta(struct journal *j) 860 + { 861 + struct bch_fs *c = container_of(j, struct bch_fs, journal); 862 + 863 + if (!bch2_write_ref_tryget(c, BCH_WRITE_REF_journal)) 864 + return -EROFS; 865 + 866 + int ret = __bch2_journal_meta(j); 867 + bch2_write_ref_put(c, BCH_WRITE_REF_journal); 868 + return ret; 869 + } 870 + 868 871 /* block/unlock the journal: */ 869 872 870 873 void bch2_journal_unblock(struct journal *j) 871 874 { 872 875 spin_lock(&j->lock); 873 - j->blocked--; 876 + if (!--j->blocked && 877 + j->cur_entry_offset_if_blocked < JOURNAL_ENTRY_CLOSED_VAL && 878 + j->reservations.cur_entry_offset == JOURNAL_ENTRY_BLOCKED_VAL) { 879 + union journal_res_state old, new; 880 + 881 + old.v = atomic64_read(&j->reservations.counter); 882 + do { 883 + new.v = old.v; 884 + new.cur_entry_offset = j->cur_entry_offset_if_blocked; 885 + } while (!atomic64_try_cmpxchg(&j->reservations.counter, &old.v, new.v)); 886 + } 874 887 spin_unlock(&j->lock); 875 888 876 889 journal_wake(j); 877 890 } 878 891 892 + static void __bch2_journal_block(struct journal *j) 893 + { 894 + if (!j->blocked++) { 895 + union journal_res_state old, new; 896 + 897 + old.v = atomic64_read(&j->reservations.counter); 898 + do { 899 + j->cur_entry_offset_if_blocked = old.cur_entry_offset; 900 + 901 + if (j->cur_entry_offset_if_blocked >= JOURNAL_ENTRY_CLOSED_VAL) 902 + break; 903 + 904 + new.v = old.v; 905 + new.cur_entry_offset = JOURNAL_ENTRY_BLOCKED_VAL; 906 + } while (!atomic64_try_cmpxchg(&j->reservations.counter, &old.v, new.v)); 907 + 908 + journal_cur_buf(j)->data->u64s = cpu_to_le32(old.cur_entry_offset); 909 + } 910 + } 911 + 879 912 void bch2_journal_block(struct journal *j) 880 913 { 881 914 spin_lock(&j->lock); 882 - j->blocked++; 915 + __bch2_journal_block(j); 883 916 spin_unlock(&j->lock); 884 917 885 918 journal_quiesce(j); 886 919 } 887 920 888 - static struct journal_buf *__bch2_next_write_buffer_flush_journal_buf(struct journal *j, u64 max_seq) 921 + static struct journal_buf *__bch2_next_write_buffer_flush_journal_buf(struct journal *j, 922 + u64 max_seq, bool *blocked) 889 923 { 890 924 struct journal_buf *ret = NULL; 891 925 ··· 945 893 struct journal_buf *buf = j->buf + idx; 946 894 947 895 if (buf->need_flush_to_write_buffer) { 948 - if (seq == journal_cur_seq(j)) 949 - __journal_entry_close(j, JOURNAL_ENTRY_CLOSED_VAL, true); 950 - 951 896 union journal_res_state s; 952 897 s.v = atomic64_read_acquire(&j->reservations.counter); 953 898 954 - ret = journal_state_count(s, idx) 899 + unsigned open = seq == journal_cur_seq(j) && __journal_entry_is_open(s); 900 + 901 + if (open && !*blocked) { 902 + __bch2_journal_block(j); 903 + *blocked = true; 904 + } 905 + 906 + ret = journal_state_count(s, idx) > open 955 907 ? ERR_PTR(-EAGAIN) 956 908 : buf; 957 909 break; ··· 968 912 return ret; 969 913 } 970 914 971 - struct journal_buf *bch2_next_write_buffer_flush_journal_buf(struct journal *j, u64 max_seq) 915 + struct journal_buf *bch2_next_write_buffer_flush_journal_buf(struct journal *j, 916 + u64 max_seq, bool *blocked) 972 917 { 973 918 struct journal_buf *ret; 919 + *blocked = false; 974 920 975 - wait_event(j->wait, (ret = __bch2_next_write_buffer_flush_journal_buf(j, max_seq)) != ERR_PTR(-EAGAIN)); 921 + wait_event(j->wait, (ret = __bch2_next_write_buffer_flush_journal_buf(j, 922 + max_seq, blocked)) != ERR_PTR(-EAGAIN)); 923 + if (IS_ERR_OR_NULL(ret) && *blocked) 924 + bch2_journal_unblock(j); 925 + 976 926 return ret; 977 927 } 978 928 ··· 1007 945 } 1008 946 1009 947 for (nr_got = 0; nr_got < nr_want; nr_got++) { 1010 - if (new_fs) { 1011 - bu[nr_got] = bch2_bucket_alloc_new_fs(ca); 1012 - if (bu[nr_got] < 0) { 1013 - ret = -BCH_ERR_ENOSPC_bucket_alloc; 1014 - break; 1015 - } 1016 - } else { 1017 - ob[nr_got] = bch2_bucket_alloc(c, ca, BCH_WATERMARK_normal, 1018 - BCH_DATA_journal, cl); 1019 - ret = PTR_ERR_OR_ZERO(ob[nr_got]); 1020 - if (ret) 1021 - break; 948 + enum bch_watermark watermark = new_fs 949 + ? BCH_WATERMARK_btree 950 + : BCH_WATERMARK_normal; 1022 951 952 + ob[nr_got] = bch2_bucket_alloc(c, ca, watermark, 953 + BCH_DATA_journal, cl); 954 + ret = PTR_ERR_OR_ZERO(ob[nr_got]); 955 + if (ret) 956 + break; 957 + 958 + if (!new_fs) { 1023 959 ret = bch2_trans_run(c, 1024 960 bch2_trans_mark_metadata_bucket(trans, ca, 1025 961 ob[nr_got]->bucket, BCH_DATA_journal, ··· 1027 967 bch_err_msg(c, ret, "marking new journal buckets"); 1028 968 break; 1029 969 } 1030 - 1031 - bu[nr_got] = ob[nr_got]->bucket; 1032 970 } 971 + 972 + bu[nr_got] = ob[nr_got]->bucket; 1033 973 } 1034 974 1035 975 if (!nr_got) ··· 1069 1009 if (ret) 1070 1010 goto err_unblock; 1071 1011 1072 - if (!new_fs) 1073 - bch2_write_super(c); 1012 + bch2_write_super(c); 1074 1013 1075 1014 /* Commit: */ 1076 1015 if (c) ··· 1103 1044 bu[i], BCH_DATA_free, 0, 1104 1045 BTREE_TRIGGER_transactional)); 1105 1046 err_free: 1106 - if (!new_fs) 1107 - for (i = 0; i < nr_got; i++) 1108 - bch2_open_bucket_put(c, ob[i]); 1047 + for (i = 0; i < nr_got; i++) 1048 + bch2_open_bucket_put(c, ob[i]); 1109 1049 1110 1050 kfree(new_bucket_seq); 1111 1051 kfree(new_buckets); ··· 1251 1193 * Always write a new journal entry, to make sure the clock hands are up 1252 1194 * to date (and match the superblock) 1253 1195 */ 1254 - bch2_journal_meta(j); 1196 + __bch2_journal_meta(j); 1255 1197 1256 1198 journal_quiesce(j); 1257 1199 cancel_delayed_work_sync(&j->write_work); ··· 1274 1216 struct genradix_iter iter; 1275 1217 bool had_entries = false; 1276 1218 u64 last_seq = cur_seq, nr, seq; 1219 + 1220 + if (cur_seq >= JOURNAL_SEQ_MAX) { 1221 + bch_err(c, "cannot start: journal seq overflow"); 1222 + return -EINVAL; 1223 + } 1277 1224 1278 1225 genradix_for_each_reverse(&c->journal_entries, iter, _i) { 1279 1226 i = *_i; ··· 1537 1474 case JOURNAL_ENTRY_CLOSED_VAL: 1538 1475 prt_printf(out, "closed\n"); 1539 1476 break; 1477 + case JOURNAL_ENTRY_BLOCKED_VAL: 1478 + prt_printf(out, "blocked\n"); 1479 + break; 1540 1480 default: 1541 1481 prt_printf(out, "%u/%u\n", s.cur_entry_offset, j->cur_entry_u64s); 1542 1482 break; ··· 1565 1499 printbuf_indent_sub(out, 2); 1566 1500 1567 1501 for_each_member_device_rcu(c, ca, &c->rw_devs[BCH_DATA_journal]) { 1502 + if (!ca->mi.durability) 1503 + continue; 1504 + 1568 1505 struct journal_device *ja = &ca->journal; 1569 1506 1570 1507 if (!test_bit(ca->dev_idx, c->rw_devs[BCH_DATA_journal].d)) ··· 1577 1508 continue; 1578 1509 1579 1510 prt_printf(out, "dev %u:\n", ca->dev_idx); 1511 + prt_printf(out, "durability %u:\n", ca->mi.durability); 1580 1512 printbuf_indent_add(out, 2); 1581 1513 prt_printf(out, "nr\t%u\n", ja->nr); 1582 1514 prt_printf(out, "bucket size\t%u\n", ca->mi.bucket_size); ··· 1588 1518 prt_printf(out, "cur_idx\t%u (seq %llu)\n", ja->cur_idx, ja->bucket_seq[ja->cur_idx]); 1589 1519 printbuf_indent_sub(out, 2); 1590 1520 } 1521 + 1522 + prt_printf(out, "replicas want %u need %u\n", c->opts.metadata_replicas, c->opts.metadata_replicas_required); 1591 1523 1592 1524 rcu_read_unlock(); 1593 1525
+5 -4
fs/bcachefs/journal.h
··· 285 285 spin_lock(&j->lock); 286 286 bch2_journal_buf_put_final(j, seq); 287 287 spin_unlock(&j->lock); 288 - } 288 + } else if (unlikely(s.cur_entry_offset == JOURNAL_ENTRY_BLOCKED_VAL)) 289 + wake_up(&j->wait); 289 290 } 290 291 291 292 /* ··· 404 403 405 404 int bch2_journal_flush_seq(struct journal *, u64, unsigned); 406 405 int bch2_journal_flush(struct journal *); 407 - bool bch2_journal_noflush_seq(struct journal *, u64); 406 + bool bch2_journal_noflush_seq(struct journal *, u64, u64); 408 407 int bch2_journal_meta(struct journal *); 409 408 410 409 void bch2_journal_halt(struct journal *); ··· 412 411 static inline int bch2_journal_error(struct journal *j) 413 412 { 414 413 return j->reservations.cur_entry_offset == JOURNAL_ENTRY_ERROR_VAL 415 - ? -EIO : 0; 414 + ? -BCH_ERR_journal_shutdown : 0; 416 415 } 417 416 418 417 struct bch_dev; ··· 425 424 426 425 void bch2_journal_unblock(struct journal *); 427 426 void bch2_journal_block(struct journal *); 428 - struct journal_buf *bch2_next_write_buffer_flush_journal_buf(struct journal *j, u64 max_seq); 427 + struct journal_buf *bch2_next_write_buffer_flush_journal_buf(struct journal *, u64, bool *); 429 428 430 429 void __bch2_journal_debug_to_text(struct printbuf *, struct journal *); 431 430 void bch2_journal_debug_to_text(struct printbuf *, struct journal *);
+130 -97
fs/bcachefs/journal_io.c
··· 17 17 #include "sb-clean.h" 18 18 #include "trace.h" 19 19 20 + #include <linux/string_choices.h> 21 + 20 22 void bch2_journal_pos_from_member_info_set(struct bch_fs *c) 21 23 { 22 24 lockdep_assert_held(&c->sb_lock); ··· 301 299 journal_entry_err_msg(&_buf, version, jset, entry); \ 302 300 prt_printf(&_buf, msg, ##__VA_ARGS__); \ 303 301 \ 304 - switch (flags & BCH_VALIDATE_write) { \ 302 + switch (from.flags & BCH_VALIDATE_write) { \ 305 303 case READ: \ 306 304 mustfix_fsck_err(c, _err, "%s", _buf.buf); \ 307 305 break; \ ··· 327 325 static int journal_validate_key(struct bch_fs *c, 328 326 struct jset *jset, 329 327 struct jset_entry *entry, 330 - unsigned level, enum btree_id btree_id, 331 328 struct bkey_i *k, 332 - unsigned version, int big_endian, 333 - enum bch_validate_flags flags) 329 + struct bkey_validate_context from, 330 + unsigned version, int big_endian) 334 331 { 332 + enum bch_validate_flags flags = from.flags; 335 333 int write = flags & BCH_VALIDATE_write; 336 334 void *next = vstruct_next(entry); 337 335 int ret = 0; ··· 366 364 } 367 365 368 366 if (!write) 369 - bch2_bkey_compat(level, btree_id, version, big_endian, 367 + bch2_bkey_compat(from.level, from.btree, version, big_endian, 370 368 write, NULL, bkey_to_packed(k)); 371 369 372 - ret = bch2_bkey_validate(c, bkey_i_to_s_c(k), 373 - __btree_node_type(level, btree_id), write); 370 + ret = bch2_bkey_validate(c, bkey_i_to_s_c(k), from); 374 371 if (ret == -BCH_ERR_fsck_delete_bkey) { 375 372 le16_add_cpu(&entry->u64s, -((u16) k->k.u64s)); 376 373 memmove(k, bkey_next(k), next - (void *) bkey_next(k)); ··· 380 379 goto fsck_err; 381 380 382 381 if (write) 383 - bch2_bkey_compat(level, btree_id, version, big_endian, 382 + bch2_bkey_compat(from.level, from.btree, version, big_endian, 384 383 write, NULL, bkey_to_packed(k)); 385 384 fsck_err: 386 385 return ret; ··· 390 389 struct jset *jset, 391 390 struct jset_entry *entry, 392 391 unsigned version, int big_endian, 393 - enum bch_validate_flags flags) 392 + struct bkey_validate_context from) 394 393 { 395 394 struct bkey_i *k = entry->start; 396 395 396 + from.level = entry->level; 397 + from.btree = entry->btree_id; 398 + 397 399 while (k != vstruct_last(entry)) { 398 - int ret = journal_validate_key(c, jset, entry, 399 - entry->level, 400 - entry->btree_id, 401 - k, version, big_endian, 402 - flags|BCH_VALIDATE_journal); 400 + int ret = journal_validate_key(c, jset, entry, k, from, version, big_endian); 403 401 if (ret == FSCK_DELETED_KEY) 404 402 continue; 405 403 else if (ret) ··· 421 421 bch2_prt_jset_entry_type(out, entry->type); 422 422 prt_str(out, ": "); 423 423 } 424 - prt_printf(out, "btree=%s l=%u ", bch2_btree_id_str(entry->btree_id), entry->level); 424 + bch2_btree_id_level_to_text(out, entry->btree_id, entry->level); 425 + prt_char(out, ' '); 425 426 bch2_bkey_val_to_text(out, c, bkey_i_to_s_c(k)); 426 427 first = false; 427 428 } ··· 432 431 struct jset *jset, 433 432 struct jset_entry *entry, 434 433 unsigned version, int big_endian, 435 - enum bch_validate_flags flags) 434 + struct bkey_validate_context from) 436 435 { 437 436 struct bkey_i *k = entry->start; 438 437 int ret = 0; 438 + 439 + from.root = true; 440 + from.level = entry->level + 1; 441 + from.btree = entry->btree_id; 439 442 440 443 if (journal_entry_err_on(!entry->u64s || 441 444 le16_to_cpu(entry->u64s) != k->k.u64s, ··· 457 452 return 0; 458 453 } 459 454 460 - ret = journal_validate_key(c, jset, entry, 1, entry->btree_id, k, 461 - version, big_endian, flags); 455 + ret = journal_validate_key(c, jset, entry, k, from, version, big_endian); 462 456 if (ret == FSCK_DELETED_KEY) 463 457 ret = 0; 464 458 fsck_err: ··· 474 470 struct jset *jset, 475 471 struct jset_entry *entry, 476 472 unsigned version, int big_endian, 477 - enum bch_validate_flags flags) 473 + struct bkey_validate_context from) 478 474 { 479 475 /* obsolete, don't care: */ 480 476 return 0; ··· 489 485 struct jset *jset, 490 486 struct jset_entry *entry, 491 487 unsigned version, int big_endian, 492 - enum bch_validate_flags flags) 488 + struct bkey_validate_context from) 493 489 { 494 490 int ret = 0; 495 491 ··· 516 512 struct jset *jset, 517 513 struct jset_entry *entry, 518 514 unsigned version, int big_endian, 519 - enum bch_validate_flags flags) 515 + struct bkey_validate_context from) 520 516 { 521 517 struct jset_entry_blacklist_v2 *bl_entry; 522 518 int ret = 0; ··· 558 554 struct jset *jset, 559 555 struct jset_entry *entry, 560 556 unsigned version, int big_endian, 561 - enum bch_validate_flags flags) 557 + struct bkey_validate_context from) 562 558 { 563 559 struct jset_entry_usage *u = 564 560 container_of(entry, struct jset_entry_usage, entry); ··· 592 588 struct jset *jset, 593 589 struct jset_entry *entry, 594 590 unsigned version, int big_endian, 595 - enum bch_validate_flags flags) 591 + struct bkey_validate_context from) 596 592 { 597 593 struct jset_entry_data_usage *u = 598 594 container_of(entry, struct jset_entry_data_usage, entry); ··· 636 632 struct jset *jset, 637 633 struct jset_entry *entry, 638 634 unsigned version, int big_endian, 639 - enum bch_validate_flags flags) 635 + struct bkey_validate_context from) 640 636 { 641 637 struct jset_entry_clock *clock = 642 638 container_of(entry, struct jset_entry_clock, entry); ··· 669 665 struct jset_entry_clock *clock = 670 666 container_of(entry, struct jset_entry_clock, entry); 671 667 672 - prt_printf(out, "%s=%llu", clock->rw ? "write" : "read", le64_to_cpu(clock->time)); 668 + prt_printf(out, "%s=%llu", str_write_read(clock->rw), le64_to_cpu(clock->time)); 673 669 } 674 670 675 671 static int journal_entry_dev_usage_validate(struct bch_fs *c, 676 672 struct jset *jset, 677 673 struct jset_entry *entry, 678 674 unsigned version, int big_endian, 679 - enum bch_validate_flags flags) 675 + struct bkey_validate_context from) 680 676 { 681 677 struct jset_entry_dev_usage *u = 682 678 container_of(entry, struct jset_entry_dev_usage, entry); ··· 733 729 struct jset *jset, 734 730 struct jset_entry *entry, 735 731 unsigned version, int big_endian, 736 - enum bch_validate_flags flags) 732 + struct bkey_validate_context from) 737 733 { 738 734 return 0; 739 735 } ··· 742 738 struct jset_entry *entry) 743 739 { 744 740 struct jset_entry_log *l = container_of(entry, struct jset_entry_log, entry); 745 - unsigned bytes = vstruct_bytes(entry) - offsetof(struct jset_entry_log, d); 746 741 747 - prt_printf(out, "%.*s", bytes, l->d); 742 + prt_printf(out, "%.*s", jset_entry_log_msg_bytes(l), l->d); 748 743 } 749 744 750 745 static int journal_entry_overwrite_validate(struct bch_fs *c, 751 746 struct jset *jset, 752 747 struct jset_entry *entry, 753 748 unsigned version, int big_endian, 754 - enum bch_validate_flags flags) 749 + struct bkey_validate_context from) 755 750 { 751 + from.flags = 0; 756 752 return journal_entry_btree_keys_validate(c, jset, entry, 757 - version, big_endian, READ); 753 + version, big_endian, from); 758 754 } 759 755 760 756 static void journal_entry_overwrite_to_text(struct printbuf *out, struct bch_fs *c, ··· 767 763 struct jset *jset, 768 764 struct jset_entry *entry, 769 765 unsigned version, int big_endian, 770 - enum bch_validate_flags flags) 766 + struct bkey_validate_context from) 771 767 { 772 768 return journal_entry_btree_keys_validate(c, jset, entry, 773 - version, big_endian, READ); 769 + version, big_endian, from); 774 770 } 775 771 776 772 static void journal_entry_write_buffer_keys_to_text(struct printbuf *out, struct bch_fs *c, ··· 783 779 struct jset *jset, 784 780 struct jset_entry *entry, 785 781 unsigned version, int big_endian, 786 - enum bch_validate_flags flags) 782 + struct bkey_validate_context from) 787 783 { 788 784 unsigned bytes = vstruct_bytes(entry); 789 785 unsigned expected = 16; ··· 813 809 struct jset_entry_ops { 814 810 int (*validate)(struct bch_fs *, struct jset *, 815 811 struct jset_entry *, unsigned, int, 816 - enum bch_validate_flags); 812 + struct bkey_validate_context); 817 813 void (*to_text)(struct printbuf *, struct bch_fs *, struct jset_entry *); 818 814 }; 819 815 ··· 831 827 struct jset *jset, 832 828 struct jset_entry *entry, 833 829 unsigned version, int big_endian, 834 - enum bch_validate_flags flags) 830 + struct bkey_validate_context from) 835 831 { 836 832 return entry->type < BCH_JSET_ENTRY_NR 837 833 ? bch2_jset_entry_ops[entry->type].validate(c, jset, entry, 838 - version, big_endian, flags) 834 + version, big_endian, from) 839 835 : 0; 840 836 } 841 837 ··· 853 849 static int jset_validate_entries(struct bch_fs *c, struct jset *jset, 854 850 enum bch_validate_flags flags) 855 851 { 852 + struct bkey_validate_context from = { 853 + .flags = flags, 854 + .from = BKEY_VALIDATE_journal, 855 + .journal_seq = le64_to_cpu(jset->seq), 856 + }; 857 + 856 858 unsigned version = le32_to_cpu(jset->version); 857 859 int ret = 0; 858 860 859 861 vstruct_for_each(jset, entry) { 862 + from.journal_offset = (u64 *) entry - jset->_data; 863 + 860 864 if (journal_entry_err_on(vstruct_next(entry) > vstruct_last(jset), 861 865 c, version, jset, entry, 862 866 journal_entry_past_jset_end, ··· 873 861 break; 874 862 } 875 863 876 - ret = bch2_journal_entry_validate(c, jset, entry, 877 - version, JSET_BIG_ENDIAN(jset), flags); 864 + ret = bch2_journal_entry_validate(c, jset, entry, version, 865 + JSET_BIG_ENDIAN(jset), from); 878 866 if (ret) 879 867 break; 880 868 } ··· 887 875 struct jset *jset, u64 sector, 888 876 enum bch_validate_flags flags) 889 877 { 890 - unsigned version; 878 + struct bkey_validate_context from = { 879 + .flags = flags, 880 + .from = BKEY_VALIDATE_journal, 881 + .journal_seq = le64_to_cpu(jset->seq), 882 + }; 891 883 int ret = 0; 892 884 893 885 if (le64_to_cpu(jset->magic) != jset_magic(c)) 894 886 return JOURNAL_ENTRY_NONE; 895 887 896 - version = le32_to_cpu(jset->version); 888 + unsigned version = le32_to_cpu(jset->version); 897 889 if (journal_entry_err_on(!bch2_version_compatible(version), 898 890 c, version, jset, NULL, 899 891 jset_unsupported_version, ··· 942 926 unsigned bucket_sectors_left, 943 927 unsigned sectors_read) 944 928 { 945 - size_t bytes = vstruct_bytes(jset); 946 - unsigned version; 947 - enum bch_validate_flags flags = BCH_VALIDATE_journal; 929 + struct bkey_validate_context from = { 930 + .from = BKEY_VALIDATE_journal, 931 + .journal_seq = le64_to_cpu(jset->seq), 932 + }; 948 933 int ret = 0; 949 934 950 935 if (le64_to_cpu(jset->magic) != jset_magic(c)) 951 936 return JOURNAL_ENTRY_NONE; 952 937 953 - version = le32_to_cpu(jset->version); 938 + unsigned version = le32_to_cpu(jset->version); 954 939 if (journal_entry_err_on(!bch2_version_compatible(version), 955 940 c, version, jset, NULL, 956 941 jset_unsupported_version, ··· 964 947 return -EINVAL; 965 948 } 966 949 950 + size_t bytes = vstruct_bytes(jset); 967 951 if (bytes > (sectors_read << 9) && 968 952 sectors_read < bucket_sectors_left) 969 953 return JOURNAL_ENTRY_REREAD; ··· 1249 1231 * those entries will be blacklisted: 1250 1232 */ 1251 1233 genradix_for_each_reverse(&c->journal_entries, radix_iter, _i) { 1252 - enum bch_validate_flags flags = BCH_VALIDATE_journal; 1253 - 1254 1234 i = *_i; 1255 1235 1256 1236 if (journal_replay_ignore(i)) ··· 1268 1252 continue; 1269 1253 } 1270 1254 1255 + struct bkey_validate_context from = { 1256 + .from = BKEY_VALIDATE_journal, 1257 + .journal_seq = le64_to_cpu(i->j.seq), 1258 + }; 1271 1259 if (journal_entry_err_on(le64_to_cpu(i->j.last_seq) > le64_to_cpu(i->j.seq), 1272 1260 c, le32_to_cpu(i->j.version), &i->j, NULL, 1273 1261 jset_last_seq_newer_than_seq, ··· 1431 1411 1432 1412 /* journal write: */ 1433 1413 1414 + static void journal_advance_devs_to_next_bucket(struct journal *j, 1415 + struct dev_alloc_list *devs, 1416 + unsigned sectors, u64 seq) 1417 + { 1418 + struct bch_fs *c = container_of(j, struct bch_fs, journal); 1419 + 1420 + darray_for_each(*devs, i) { 1421 + struct bch_dev *ca = rcu_dereference(c->devs[*i]); 1422 + if (!ca) 1423 + continue; 1424 + 1425 + struct journal_device *ja = &ca->journal; 1426 + 1427 + if (sectors > ja->sectors_free && 1428 + sectors <= ca->mi.bucket_size && 1429 + bch2_journal_dev_buckets_available(j, ja, 1430 + journal_space_discarded)) { 1431 + ja->cur_idx = (ja->cur_idx + 1) % ja->nr; 1432 + ja->sectors_free = ca->mi.bucket_size; 1433 + 1434 + /* 1435 + * ja->bucket_seq[ja->cur_idx] must always have 1436 + * something sensible: 1437 + */ 1438 + ja->bucket_seq[ja->cur_idx] = le64_to_cpu(seq); 1439 + } 1440 + } 1441 + } 1442 + 1434 1443 static void __journal_write_alloc(struct journal *j, 1435 1444 struct journal_buf *w, 1436 - struct dev_alloc_list *devs_sorted, 1445 + struct dev_alloc_list *devs, 1437 1446 unsigned sectors, 1438 1447 unsigned *replicas, 1439 1448 unsigned replicas_want) 1440 1449 { 1441 1450 struct bch_fs *c = container_of(j, struct bch_fs, journal); 1442 - struct journal_device *ja; 1443 - struct bch_dev *ca; 1444 - unsigned i; 1445 1451 1446 - if (*replicas >= replicas_want) 1447 - return; 1448 - 1449 - for (i = 0; i < devs_sorted->nr; i++) { 1450 - ca = rcu_dereference(c->devs[devs_sorted->devs[i]]); 1452 + darray_for_each(*devs, i) { 1453 + struct bch_dev *ca = rcu_dereference(c->devs[*i]); 1451 1454 if (!ca) 1452 1455 continue; 1453 1456 1454 - ja = &ca->journal; 1457 + struct journal_device *ja = &ca->journal; 1455 1458 1456 1459 /* 1457 1460 * Check that we can use this device, and aren't already using ··· 1520 1477 { 1521 1478 struct bch_fs *c = container_of(j, struct bch_fs, journal); 1522 1479 struct bch_devs_mask devs; 1523 - struct journal_device *ja; 1524 - struct bch_dev *ca; 1525 1480 struct dev_alloc_list devs_sorted; 1526 1481 unsigned sectors = vstruct_sectors(w->data, c->block_bits); 1527 1482 unsigned target = c->opts.metadata_target ?: 1528 1483 c->opts.foreground_target; 1529 - unsigned i, replicas = 0, replicas_want = 1484 + unsigned replicas = 0, replicas_want = 1530 1485 READ_ONCE(c->opts.metadata_replicas); 1531 1486 unsigned replicas_need = min_t(unsigned, replicas_want, 1532 1487 READ_ONCE(c->opts.metadata_replicas_required)); 1488 + bool advance_done = false; 1533 1489 1534 1490 rcu_read_lock(); 1535 - retry: 1536 - devs = target_rw_devs(c, BCH_DATA_journal, target); 1537 1491 1538 - devs_sorted = bch2_dev_alloc_list(c, &j->wp.stripe, &devs); 1539 - 1540 - __journal_write_alloc(j, w, &devs_sorted, 1541 - sectors, &replicas, replicas_want); 1542 - 1543 - if (replicas >= replicas_want) 1544 - goto done; 1545 - 1546 - for (i = 0; i < devs_sorted.nr; i++) { 1547 - ca = rcu_dereference(c->devs[devs_sorted.devs[i]]); 1548 - if (!ca) 1549 - continue; 1550 - 1551 - ja = &ca->journal; 1552 - 1553 - if (sectors > ja->sectors_free && 1554 - sectors <= ca->mi.bucket_size && 1555 - bch2_journal_dev_buckets_available(j, ja, 1556 - journal_space_discarded)) { 1557 - ja->cur_idx = (ja->cur_idx + 1) % ja->nr; 1558 - ja->sectors_free = ca->mi.bucket_size; 1559 - 1560 - /* 1561 - * ja->bucket_seq[ja->cur_idx] must always have 1562 - * something sensible: 1563 - */ 1564 - ja->bucket_seq[ja->cur_idx] = le64_to_cpu(w->data->seq); 1565 - } 1492 + /* We might run more than once if we have to stop and do discards: */ 1493 + struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(bkey_i_to_s_c(&w->key)); 1494 + bkey_for_each_ptr(ptrs, p) { 1495 + struct bch_dev *ca = bch2_dev_rcu_noerror(c, p->dev); 1496 + if (ca) 1497 + replicas += ca->mi.durability; 1566 1498 } 1567 1499 1568 - __journal_write_alloc(j, w, &devs_sorted, 1569 - sectors, &replicas, replicas_want); 1500 + retry_target: 1501 + devs = target_rw_devs(c, BCH_DATA_journal, target); 1502 + devs_sorted = bch2_dev_alloc_list(c, &j->wp.stripe, &devs); 1503 + retry_alloc: 1504 + __journal_write_alloc(j, w, &devs_sorted, sectors, &replicas, replicas_want); 1505 + 1506 + if (likely(replicas >= replicas_want)) 1507 + goto done; 1508 + 1509 + if (!advance_done) { 1510 + journal_advance_devs_to_next_bucket(j, &devs_sorted, sectors, w->data->seq); 1511 + advance_done = true; 1512 + goto retry_alloc; 1513 + } 1570 1514 1571 1515 if (replicas < replicas_want && target) { 1572 1516 /* Retry from all devices: */ 1573 1517 target = 0; 1574 - goto retry; 1518 + advance_done = false; 1519 + goto retry_target; 1575 1520 } 1576 1521 done: 1577 1522 rcu_read_unlock(); 1578 1523 1579 1524 BUG_ON(bkey_val_u64s(&w->key.k) > BCH_REPLICAS_MAX); 1580 1525 1581 - return replicas >= replicas_need ? 0 : -EROFS; 1526 + return replicas >= replicas_need ? 0 : -BCH_ERR_insufficient_journal_devices; 1582 1527 } 1583 1528 1584 1529 static void journal_buf_realloc(struct journal *j, struct journal_buf *buf) ··· 2054 2023 bch2_journal_do_discards(j); 2055 2024 } 2056 2025 2057 - if (ret) { 2026 + if (ret && !bch2_journal_error(j)) { 2058 2027 struct printbuf buf = PRINTBUF; 2059 2028 buf.atomic++; 2060 2029 2061 - prt_printf(&buf, bch2_fmt(c, "Unable to allocate journal write at seq %llu: %s"), 2030 + prt_printf(&buf, bch2_fmt(c, "Unable to allocate journal write at seq %llu for %zu sectors: %s"), 2062 2031 le64_to_cpu(w->data->seq), 2032 + vstruct_sectors(w->data, c->block_bits), 2063 2033 bch2_err_str(ret)); 2064 2034 __bch2_journal_debug_to_text(&buf, j); 2065 2035 spin_unlock(&j->lock); 2066 2036 bch2_print_string_as_lines(KERN_ERR, buf.buf); 2067 2037 printbuf_exit(&buf); 2068 - goto err; 2069 2038 } 2039 + if (ret) 2040 + goto err; 2070 2041 2071 2042 /* 2072 2043 * write is allocated, no longer need to account for it in
+1 -1
fs/bcachefs/journal_io.h
··· 63 63 64 64 int bch2_journal_entry_validate(struct bch_fs *, struct jset *, 65 65 struct jset_entry *, unsigned, int, 66 - enum bch_validate_flags); 66 + struct bkey_validate_context); 67 67 void bch2_journal_entry_to_text(struct printbuf *, struct bch_fs *, 68 68 struct jset_entry *); 69 69
+15 -4
fs/bcachefs/journal_reclaim.c
··· 38 38 struct journal_device *ja, 39 39 enum journal_space_from from) 40 40 { 41 + if (!ja->nr) 42 + return 0; 43 + 41 44 unsigned available = (journal_space_from(ja, from) - 42 45 ja->cur_idx - 1 + ja->nr) % ja->nr; 43 46 ··· 140 137 struct bch_fs *c = container_of(j, struct bch_fs, journal); 141 138 unsigned pos, nr_devs = 0; 142 139 struct journal_space space, dev_space[BCH_SB_MEMBERS_MAX]; 140 + unsigned min_bucket_size = U32_MAX; 143 141 144 142 BUG_ON(nr_devs_want > ARRAY_SIZE(dev_space)); 145 143 146 144 rcu_read_lock(); 147 145 for_each_member_device_rcu(c, ca, &c->rw_devs[BCH_DATA_journal]) { 148 - if (!ca->journal.nr) 146 + if (!ca->journal.nr || 147 + !ca->mi.durability) 149 148 continue; 149 + 150 + min_bucket_size = min(min_bucket_size, ca->mi.bucket_size); 150 151 151 152 space = journal_dev_space_available(j, ca, from); 152 153 if (!space.next_entry) ··· 171 164 * We sorted largest to smallest, and we want the smallest out of the 172 165 * @nr_devs_want largest devices: 173 166 */ 174 - return dev_space[nr_devs_want - 1]; 167 + space = dev_space[nr_devs_want - 1]; 168 + space.next_entry = min(space.next_entry, min_bucket_size); 169 + return space; 175 170 } 176 171 177 172 void bch2_journal_space_available(struct journal *j) ··· 767 758 journal_empty = fifo_empty(&j->pin); 768 759 spin_unlock(&j->lock); 769 760 761 + long timeout = j->next_reclaim - jiffies; 762 + 770 763 if (journal_empty) 771 764 schedule(); 772 - else if (time_after(j->next_reclaim, jiffies)) 773 - schedule_timeout(j->next_reclaim - jiffies); 765 + else if (timeout > 0) 766 + schedule_timeout(timeout); 774 767 else 775 768 break; 776 769 }
+5
fs/bcachefs/journal_types.h
··· 9 9 #include "super_types.h" 10 10 #include "fifo.h" 11 11 12 + /* btree write buffer steals 8 bits for its own purposes: */ 13 + #define JOURNAL_SEQ_MAX ((1ULL << 56) - 1) 14 + 12 15 #define JOURNAL_BUF_BITS 2 13 16 #define JOURNAL_BUF_NR (1U << JOURNAL_BUF_BITS) 14 17 #define JOURNAL_BUF_MASK (JOURNAL_BUF_NR - 1) ··· 115 112 */ 116 113 #define JOURNAL_ENTRY_OFFSET_MAX ((1U << 20) - 1) 117 114 115 + #define JOURNAL_ENTRY_BLOCKED_VAL (JOURNAL_ENTRY_OFFSET_MAX - 2) 118 116 #define JOURNAL_ENTRY_CLOSED_VAL (JOURNAL_ENTRY_OFFSET_MAX - 1) 119 117 #define JOURNAL_ENTRY_ERROR_VAL (JOURNAL_ENTRY_OFFSET_MAX) 120 118 ··· 197 193 * insufficient devices: 198 194 */ 199 195 enum journal_errors cur_entry_error; 196 + unsigned cur_entry_offset_if_blocked; 200 197 201 198 unsigned buf_size_want; 202 199 /*
+6 -5
fs/bcachefs/logged_ops.c
··· 63 63 int bch2_resume_logged_ops(struct bch_fs *c) 64 64 { 65 65 int ret = bch2_trans_run(c, 66 - for_each_btree_key(trans, iter, 67 - BTREE_ID_logged_ops, POS_MIN, 66 + for_each_btree_key_max(trans, iter, 67 + BTREE_ID_logged_ops, 68 + POS(LOGGED_OPS_INUM_logged_ops, 0), 69 + POS(LOGGED_OPS_INUM_logged_ops, U64_MAX), 68 70 BTREE_ITER_prefetch, k, 69 71 resume_logged_op(trans, &iter, k))); 70 72 bch_err_fn(c, ret); ··· 76 74 static int __bch2_logged_op_start(struct btree_trans *trans, struct bkey_i *k) 77 75 { 78 76 struct btree_iter iter; 79 - int ret; 80 - 81 - ret = bch2_bkey_get_empty_slot(trans, &iter, BTREE_ID_logged_ops, POS_MAX); 77 + int ret = bch2_bkey_get_empty_slot(trans, &iter, 78 + BTREE_ID_logged_ops, POS(LOGGED_OPS_INUM_logged_ops, U64_MAX)); 82 79 if (ret) 83 80 return ret; 84 81
+5
fs/bcachefs/logged_ops_format.h
··· 2 2 #ifndef _BCACHEFS_LOGGED_OPS_FORMAT_H 3 3 #define _BCACHEFS_LOGGED_OPS_FORMAT_H 4 4 5 + enum logged_ops_inums { 6 + LOGGED_OPS_INUM_logged_ops, 7 + LOGGED_OPS_INUM_inode_cursors, 8 + }; 9 + 5 10 struct bch_logged_op_truncate { 6 11 struct bch_val v; 7 12 __le32 subvol;
+2 -2
fs/bcachefs/lru.c
··· 12 12 13 13 /* KEY_TYPE_lru is obsolete: */ 14 14 int bch2_lru_validate(struct bch_fs *c, struct bkey_s_c k, 15 - enum bch_validate_flags flags) 15 + struct bkey_validate_context from) 16 16 { 17 17 int ret = 0; 18 18 ··· 192 192 int ret = bch2_trans_run(c, 193 193 for_each_btree_key_commit(trans, iter, 194 194 BTREE_ID_lru, POS_MIN, BTREE_ITER_prefetch, k, 195 - NULL, NULL, BCH_TRANS_COMMIT_no_enospc|BCH_TRANS_COMMIT_lazy_rw, 195 + NULL, NULL, BCH_TRANS_COMMIT_no_enospc, 196 196 bch2_check_lru_key(trans, &iter, k, &last_flushed))); 197 197 198 198 bch2_bkey_buf_exit(&last_flushed, c);
+1 -1
fs/bcachefs/lru.h
··· 33 33 return BCH_LRU_read; 34 34 } 35 35 36 - int bch2_lru_validate(struct bch_fs *, struct bkey_s_c, enum bch_validate_flags); 36 + int bch2_lru_validate(struct bch_fs *, struct bkey_s_c, struct bkey_validate_context); 37 37 void bch2_lru_to_text(struct printbuf *, struct bch_fs *, struct bkey_s_c); 38 38 39 39 void bch2_lru_pos_to_text(struct printbuf *, struct bpos);
+121 -63
fs/bcachefs/move.c
··· 21 21 #include "journal_reclaim.h" 22 22 #include "keylist.h" 23 23 #include "move.h" 24 + #include "rebalance.h" 25 + #include "reflink.h" 24 26 #include "replicas.h" 25 27 #include "snapshot.h" 26 28 #include "super-io.h" ··· 198 196 list_del(&ctxt->list); 199 197 mutex_unlock(&c->moving_context_lock); 200 198 199 + /* 200 + * Generally, releasing a transaction within a transaction restart means 201 + * an unhandled transaction restart: but this can happen legitimately 202 + * within the move code, e.g. when bch2_move_ratelimit() tells us to 203 + * exit before we've retried 204 + */ 205 + bch2_trans_begin(ctxt->trans); 201 206 bch2_trans_put(ctxt->trans); 202 207 memset(ctxt, 0, sizeof(*ctxt)); 203 208 } ··· 388 379 return ret; 389 380 } 390 381 391 - struct bch_io_opts *bch2_move_get_io_opts(struct btree_trans *trans, 382 + static struct bch_io_opts *bch2_move_get_io_opts(struct btree_trans *trans, 392 383 struct per_snapshot_io_opts *io_opts, 384 + struct bpos extent_pos, /* extent_iter, extent_k may be in reflink btree */ 385 + struct btree_iter *extent_iter, 393 386 struct bkey_s_c extent_k) 394 387 { 395 388 struct bch_fs *c = trans->c; 396 389 u32 restart_count = trans->restart_count; 390 + struct bch_io_opts *opts_ret = &io_opts->fs_io_opts; 397 391 int ret = 0; 398 392 399 - if (io_opts->cur_inum != extent_k.k->p.inode) { 393 + if (extent_k.k->type == KEY_TYPE_reflink_v) 394 + goto out; 395 + 396 + if (io_opts->cur_inum != extent_pos.inode) { 400 397 io_opts->d.nr = 0; 401 398 402 - ret = for_each_btree_key(trans, iter, BTREE_ID_inodes, POS(0, extent_k.k->p.inode), 399 + ret = for_each_btree_key(trans, iter, BTREE_ID_inodes, POS(0, extent_pos.inode), 403 400 BTREE_ITER_all_snapshots, k, ({ 404 - if (k.k->p.offset != extent_k.k->p.inode) 401 + if (k.k->p.offset != extent_pos.inode) 405 402 break; 406 403 407 404 if (!bkey_is_inode(k.k)) 408 405 continue; 409 406 410 407 struct bch_inode_unpacked inode; 411 - BUG_ON(bch2_inode_unpack(k, &inode)); 408 + _ret3 = bch2_inode_unpack(k, &inode); 409 + if (_ret3) 410 + break; 412 411 413 412 struct snapshot_io_opts_entry e = { .snapshot = k.k->p.snapshot }; 414 413 bch2_inode_opts_get(&e.io_opts, trans->c, &inode); 415 414 416 415 darray_push(&io_opts->d, e); 417 416 })); 418 - io_opts->cur_inum = extent_k.k->p.inode; 417 + io_opts->cur_inum = extent_pos.inode; 419 418 } 420 419 421 420 ret = ret ?: trans_was_restarted(trans, restart_count); ··· 432 415 433 416 if (extent_k.k->p.snapshot) 434 417 darray_for_each(io_opts->d, i) 435 - if (bch2_snapshot_is_ancestor(c, extent_k.k->p.snapshot, i->snapshot)) 436 - return &i->io_opts; 437 - 438 - return &io_opts->fs_io_opts; 418 + if (bch2_snapshot_is_ancestor(c, extent_k.k->p.snapshot, i->snapshot)) { 419 + opts_ret = &i->io_opts; 420 + break; 421 + } 422 + out: 423 + ret = bch2_get_update_rebalance_opts(trans, opts_ret, extent_iter, extent_k); 424 + if (ret) 425 + return ERR_PTR(ret); 426 + return opts_ret; 439 427 } 440 428 441 429 int bch2_move_get_io_opts_one(struct btree_trans *trans, 442 430 struct bch_io_opts *io_opts, 431 + struct btree_iter *extent_iter, 443 432 struct bkey_s_c extent_k) 444 433 { 445 - struct btree_iter iter; 446 - struct bkey_s_c k; 447 - int ret; 434 + struct bch_fs *c = trans->c; 435 + 436 + *io_opts = bch2_opts_to_inode_opts(c->opts); 448 437 449 438 /* reflink btree? */ 450 - if (!extent_k.k->p.inode) { 451 - *io_opts = bch2_opts_to_inode_opts(trans->c->opts); 452 - return 0; 453 - } 439 + if (!extent_k.k->p.inode) 440 + goto out; 454 441 455 - k = bch2_bkey_get_iter(trans, &iter, BTREE_ID_inodes, 442 + struct btree_iter inode_iter; 443 + struct bkey_s_c inode_k = bch2_bkey_get_iter(trans, &inode_iter, BTREE_ID_inodes, 456 444 SPOS(0, extent_k.k->p.inode, extent_k.k->p.snapshot), 457 445 BTREE_ITER_cached); 458 - ret = bkey_err(k); 446 + int ret = bkey_err(inode_k); 459 447 if (bch2_err_matches(ret, BCH_ERR_transaction_restart)) 460 448 return ret; 461 449 462 - if (!ret && bkey_is_inode(k.k)) { 450 + if (!ret && bkey_is_inode(inode_k.k)) { 463 451 struct bch_inode_unpacked inode; 464 - bch2_inode_unpack(k, &inode); 465 - bch2_inode_opts_get(io_opts, trans->c, &inode); 466 - } else { 467 - *io_opts = bch2_opts_to_inode_opts(trans->c->opts); 452 + bch2_inode_unpack(inode_k, &inode); 453 + bch2_inode_opts_get(io_opts, c, &inode); 468 454 } 469 - 470 - bch2_trans_iter_exit(trans, &iter); 471 - return 0; 455 + bch2_trans_iter_exit(trans, &inode_iter); 456 + out: 457 + return bch2_get_update_rebalance_opts(trans, io_opts, extent_iter, extent_k); 472 458 } 473 459 474 460 int bch2_move_ratelimit(struct moving_context *ctxt) ··· 529 509 struct per_snapshot_io_opts snapshot_io_opts; 530 510 struct bch_io_opts *io_opts; 531 511 struct bkey_buf sk; 532 - struct btree_iter iter; 512 + struct btree_iter iter, reflink_iter = {}; 533 513 struct bkey_s_c k; 534 514 struct data_update_opts data_opts; 515 + /* 516 + * If we're moving a single file, also process reflinked data it points 517 + * to (this includes propagating changed io_opts from the inode to the 518 + * extent): 519 + */ 520 + bool walk_indirect = start.inode == end.inode; 535 521 int ret = 0, ret2; 536 522 537 523 per_snapshot_io_opts_init(&snapshot_io_opts, c); ··· 557 531 bch2_ratelimit_reset(ctxt->rate); 558 532 559 533 while (!bch2_move_ratelimit(ctxt)) { 534 + struct btree_iter *extent_iter = &iter; 535 + 560 536 bch2_trans_begin(trans); 561 537 562 538 k = bch2_btree_iter_peek(&iter); ··· 577 549 if (ctxt->stats) 578 550 ctxt->stats->pos = BBPOS(iter.btree_id, iter.pos); 579 551 552 + if (walk_indirect && 553 + k.k->type == KEY_TYPE_reflink_p && 554 + REFLINK_P_MAY_UPDATE_OPTIONS(bkey_s_c_to_reflink_p(k).v)) { 555 + struct bkey_s_c_reflink_p p = bkey_s_c_to_reflink_p(k); 556 + s64 offset_into_extent = iter.pos.offset - bkey_start_offset(k.k); 557 + 558 + bch2_trans_iter_exit(trans, &reflink_iter); 559 + k = bch2_lookup_indirect_extent(trans, &reflink_iter, &offset_into_extent, p, true, 0); 560 + ret = bkey_err(k); 561 + if (bch2_err_matches(ret, BCH_ERR_transaction_restart)) 562 + continue; 563 + if (ret) 564 + break; 565 + 566 + if (bkey_deleted(k.k)) 567 + goto next_nondata; 568 + 569 + /* 570 + * XXX: reflink pointers may point to multiple indirect 571 + * extents, so don't advance past the entire reflink 572 + * pointer - need to fixup iter->k 573 + */ 574 + extent_iter = &reflink_iter; 575 + } 576 + 580 577 if (!bkey_extent_is_direct_data(k.k)) 581 578 goto next_nondata; 582 579 583 - io_opts = bch2_move_get_io_opts(trans, &snapshot_io_opts, k); 580 + io_opts = bch2_move_get_io_opts(trans, &snapshot_io_opts, 581 + iter.pos, extent_iter, k); 584 582 ret = PTR_ERR_OR_ZERO(io_opts); 585 583 if (ret) 586 584 continue; ··· 622 568 bch2_bkey_buf_reassemble(&sk, c, k); 623 569 k = bkey_i_to_s_c(sk.k); 624 570 625 - ret2 = bch2_move_extent(ctxt, NULL, &iter, k, *io_opts, data_opts); 571 + ret2 = bch2_move_extent(ctxt, NULL, extent_iter, k, *io_opts, data_opts); 626 572 if (ret2) { 627 573 if (bch2_err_matches(ret2, BCH_ERR_transaction_restart)) 628 574 continue; ··· 643 589 bch2_btree_iter_advance(&iter); 644 590 } 645 591 592 + bch2_trans_iter_exit(trans, &reflink_iter); 646 593 bch2_trans_iter_exit(trans, &iter); 647 594 bch2_bkey_buf_exit(&sk, c); 648 595 per_snapshot_io_opts_exit(&snapshot_io_opts); ··· 709 654 struct bch_fs *c = trans->c; 710 655 bool is_kthread = current->flags & PF_KTHREAD; 711 656 struct bch_io_opts io_opts = bch2_opts_to_inode_opts(c->opts); 712 - struct btree_iter iter; 657 + struct btree_iter iter = {}, bp_iter = {}; 713 658 struct bkey_buf sk; 714 - struct bch_backpointer bp; 715 - struct bch_alloc_v4 a_convert; 716 - const struct bch_alloc_v4 *a; 717 659 struct bkey_s_c k; 718 660 struct data_update_opts data_opts; 719 - unsigned dirty_sectors, bucket_size; 720 - u64 fragmentation; 721 - struct bpos bp_pos = POS_MIN; 661 + unsigned sectors_moved = 0; 662 + struct bkey_buf last_flushed; 722 663 int ret = 0; 723 664 724 665 struct bch_dev *ca = bch2_dev_tryget(c, bucket.inode); ··· 723 672 724 673 trace_bucket_evacuate(c, &bucket); 725 674 675 + bch2_bkey_buf_init(&last_flushed); 676 + bkey_init(&last_flushed.k->k); 726 677 bch2_bkey_buf_init(&sk); 727 678 728 679 /* ··· 732 679 */ 733 680 bch2_trans_begin(trans); 734 681 735 - bch2_trans_iter_init(trans, &iter, BTREE_ID_alloc, 736 - bucket, BTREE_ITER_cached); 737 - ret = lockrestart_do(trans, 738 - bkey_err(k = bch2_btree_iter_peek_slot(&iter))); 739 - bch2_trans_iter_exit(trans, &iter); 682 + bch2_trans_iter_init(trans, &bp_iter, BTREE_ID_backpointers, 683 + bucket_pos_to_bp_start(ca, bucket), 0); 740 684 741 685 bch_err_msg(c, ret, "looking up alloc key"); 742 686 if (ret) 743 687 goto err; 744 - 745 - a = bch2_alloc_to_v4(k, &a_convert); 746 - dirty_sectors = bch2_bucket_sectors_dirty(*a); 747 - bucket_size = ca->mi.bucket_size; 748 - fragmentation = alloc_lru_idx_fragmentation(*a, ca); 749 688 750 689 ret = bch2_btree_write_buffer_tryflush(trans); 751 690 bch_err_msg(c, ret, "flushing btree write buffer"); ··· 750 705 751 706 bch2_trans_begin(trans); 752 707 753 - ret = bch2_get_next_backpointer(trans, ca, bucket, gen, 754 - &bp_pos, &bp, 755 - BTREE_ITER_cached); 708 + k = bch2_btree_iter_peek(&bp_iter); 709 + ret = bkey_err(k); 756 710 if (bch2_err_matches(ret, BCH_ERR_transaction_restart)) 757 711 continue; 758 712 if (ret) 759 713 goto err; 760 - if (bkey_eq(bp_pos, POS_MAX)) 714 + 715 + if (!k.k || bkey_gt(k.k->p, bucket_pos_to_bp_end(ca, bucket))) 761 716 break; 762 717 763 - if (!bp.level) { 764 - k = bch2_backpointer_get_key(trans, &iter, bp_pos, bp, 0); 718 + if (k.k->type != KEY_TYPE_backpointer) 719 + goto next; 720 + 721 + struct bkey_s_c_backpointer bp = bkey_s_c_to_backpointer(k); 722 + 723 + if (!bp.v->level) { 724 + k = bch2_backpointer_get_key(trans, bp, &iter, 0, &last_flushed); 765 725 ret = bkey_err(k); 766 726 if (bch2_err_matches(ret, BCH_ERR_transaction_restart)) 767 727 continue; ··· 778 728 bch2_bkey_buf_reassemble(&sk, c, k); 779 729 k = bkey_i_to_s_c(sk.k); 780 730 781 - ret = bch2_move_get_io_opts_one(trans, &io_opts, k); 731 + ret = bch2_move_get_io_opts_one(trans, &io_opts, &iter, k); 782 732 if (ret) { 783 733 bch2_trans_iter_exit(trans, &iter); 784 734 continue; ··· 788 738 data_opts.target = io_opts.background_target; 789 739 data_opts.rewrite_ptrs = 0; 790 740 741 + unsigned sectors = bp.v->bucket_len; /* move_extent will drop locks */ 791 742 unsigned i = 0; 792 - bkey_for_each_ptr(bch2_bkey_ptrs_c(k), ptr) { 793 - if (ptr->dev == bucket.inode) { 794 - data_opts.rewrite_ptrs |= 1U << i; 795 - if (ptr->cached) { 743 + const union bch_extent_entry *entry; 744 + struct extent_ptr_decoded p; 745 + bkey_for_each_ptr_decode(k.k, bch2_bkey_ptrs_c(k), p, entry) { 746 + if (p.ptr.dev == bucket.inode) { 747 + if (p.ptr.cached) { 796 748 bch2_trans_iter_exit(trans, &iter); 797 749 goto next; 798 750 } 751 + data_opts.rewrite_ptrs |= 1U << i; 752 + break; 799 753 } 800 754 i++; 801 755 } ··· 819 765 goto err; 820 766 821 767 if (ctxt->stats) 822 - atomic64_add(k.k->size, &ctxt->stats->sectors_seen); 768 + atomic64_add(sectors, &ctxt->stats->sectors_seen); 769 + sectors_moved += sectors; 823 770 } else { 824 771 struct btree *b; 825 772 826 - b = bch2_backpointer_get_node(trans, &iter, bp_pos, bp); 773 + b = bch2_backpointer_get_node(trans, bp, &iter, &last_flushed); 827 774 ret = PTR_ERR_OR_ZERO(b); 828 775 if (ret == -BCH_ERR_backpointer_to_overwritten_btree_node) 829 - continue; 776 + goto next; 830 777 if (bch2_err_matches(ret, BCH_ERR_transaction_restart)) 831 778 continue; 832 779 if (ret) ··· 851 796 atomic64_add(sectors, &ctxt->stats->sectors_seen); 852 797 atomic64_add(sectors, &ctxt->stats->sectors_moved); 853 798 } 799 + sectors_moved += btree_sectors(c); 854 800 } 855 801 next: 856 - bp_pos = bpos_nosnap_successor(bp_pos); 802 + bch2_btree_iter_advance(&bp_iter); 857 803 } 858 804 859 - trace_evacuate_bucket(c, &bucket, dirty_sectors, bucket_size, fragmentation, ret); 805 + trace_evacuate_bucket(c, &bucket, sectors_moved, ca->mi.bucket_size, ret); 860 806 err: 807 + bch2_trans_iter_exit(trans, &bp_iter); 861 808 bch2_dev_put(ca); 862 809 bch2_bkey_buf_exit(&sk, c); 810 + bch2_bkey_buf_exit(&last_flushed, c); 863 811 return ret; 864 812 } 865 813
+2 -3
fs/bcachefs/move.h
··· 110 110 darray_exit(&io_opts->d); 111 111 } 112 112 113 - struct bch_io_opts *bch2_move_get_io_opts(struct btree_trans *, 114 - struct per_snapshot_io_opts *, struct bkey_s_c); 115 - int bch2_move_get_io_opts_one(struct btree_trans *, struct bch_io_opts *, struct bkey_s_c); 113 + int bch2_move_get_io_opts_one(struct btree_trans *, struct bch_io_opts *, 114 + struct btree_iter *, struct bkey_s_c); 116 115 117 116 int bch2_scan_old_btree_nodes(struct bch_fs *, struct bch_move_stats *); 118 117
+3 -3
fs/bcachefs/movinggc.c
··· 167 167 168 168 bch2_trans_begin(trans); 169 169 170 - ret = for_each_btree_key_upto(trans, iter, BTREE_ID_lru, 170 + ret = for_each_btree_key_max(trans, iter, BTREE_ID_lru, 171 171 lru_pos(BCH_LRU_FRAGMENTATION_START, 0, 0), 172 172 lru_pos(BCH_LRU_FRAGMENTATION_START, U64_MAX, LRU_TIME_MAX), 173 173 0, k, ({ ··· 350 350 bch2_trans_unlock_long(ctxt.trans); 351 351 cond_resched(); 352 352 353 - if (!c->copy_gc_enabled) { 353 + if (!c->opts.copygc_enabled) { 354 354 move_buckets_wait(&ctxt, buckets, true); 355 - kthread_wait_freezable(c->copy_gc_enabled || 355 + kthread_wait_freezable(c->opts.copygc_enabled || 356 356 kthread_should_stop()); 357 357 } 358 358
+16 -10
fs/bcachefs/opts.c
··· 1 1 // SPDX-License-Identifier: GPL-2.0 2 2 3 3 #include <linux/kernel.h> 4 + #include <linux/fs_parser.h> 4 5 5 6 #include "bcachefs.h" 6 7 #include "compress.h" ··· 49 48 NULL 50 49 }; 51 50 52 - const char * const bch2_csum_opts[] = { 51 + const char * const __bch2_csum_opts[] = { 53 52 BCH_CSUM_OPTS() 54 53 NULL 55 54 }; 56 55 57 - static const char * const __bch2_compression_types[] = { 56 + const char * const __bch2_compression_types[] = { 58 57 BCH_COMPRESSION_TYPES() 59 58 NULL 60 59 }; ··· 114 113 PRT_STR_OPT_BOUNDSCHECKED(jset_entry_type, enum bch_jset_entry_type); 115 114 PRT_STR_OPT_BOUNDSCHECKED(fs_usage_type, enum bch_fs_usage_type); 116 115 PRT_STR_OPT_BOUNDSCHECKED(data_type, enum bch_data_type); 116 + PRT_STR_OPT_BOUNDSCHECKED(csum_opt, enum bch_csum_opt); 117 117 PRT_STR_OPT_BOUNDSCHECKED(csum_type, enum bch_csum_type); 118 118 PRT_STR_OPT_BOUNDSCHECKED(compression_type, enum bch_compression_type); 119 119 PRT_STR_OPT_BOUNDSCHECKED(str_hash_type, enum bch_str_hash_type); ··· 335 333 switch (opt->type) { 336 334 case BCH_OPT_BOOL: 337 335 if (val) { 338 - ret = kstrtou64(val, 10, res); 336 + ret = lookup_constant(bool_names, val, -BCH_ERR_option_not_bool); 337 + if (ret != -BCH_ERR_option_not_bool) { 338 + *res = ret; 339 + } else { 340 + if (err) 341 + prt_printf(err, "%s: must be bool", opt->attr.name); 342 + return ret; 343 + } 339 344 } else { 340 - ret = 0; 341 345 *res = 1; 342 346 } 343 347 344 - if (ret < 0 || (*res != 0 && *res != 1)) { 345 - if (err) 346 - prt_printf(err, "%s: must be bool", opt->attr.name); 347 - return ret < 0 ? ret : -BCH_ERR_option_not_bool; 348 - } 349 348 break; 350 349 case BCH_OPT_UINT: 351 350 if (!val) { ··· 713 710 714 711 struct bch_io_opts bch2_opts_to_inode_opts(struct bch_opts src) 715 712 { 716 - return (struct bch_io_opts) { 713 + struct bch_io_opts opts = { 717 714 #define x(_name, _bits) ._name = src._name, 718 715 BCH_INODE_OPTS() 719 716 #undef x 720 717 }; 718 + 719 + bch2_io_opts_fixups(&opts); 720 + return opts; 721 721 } 722 722 723 723 bool bch2_opt_is_inode_opt(enum bch_opt_id id)
+50 -11
fs/bcachefs/opts.h
··· 16 16 extern const char * const bch2_sb_features[]; 17 17 extern const char * const bch2_sb_compat[]; 18 18 extern const char * const __bch2_btree_ids[]; 19 - extern const char * const bch2_csum_opts[]; 19 + extern const char * const __bch2_csum_opts[]; 20 + extern const char * const __bch2_compression_types[]; 20 21 extern const char * const bch2_compression_opts[]; 21 22 extern const char * const __bch2_str_hash_types[]; 22 23 extern const char * const bch2_str_hash_opts[]; ··· 28 27 void bch2_prt_jset_entry_type(struct printbuf *, enum bch_jset_entry_type); 29 28 void bch2_prt_fs_usage_type(struct printbuf *, enum bch_fs_usage_type); 30 29 void bch2_prt_data_type(struct printbuf *, enum bch_data_type); 30 + void bch2_prt_csum_opt(struct printbuf *, enum bch_csum_opt); 31 31 void bch2_prt_csum_type(struct printbuf *, enum bch_csum_type); 32 32 void bch2_prt_compression_type(struct printbuf *, enum bch_compression_type); 33 33 void bch2_prt_str_hash_type(struct printbuf *, enum bch_str_hash_type); ··· 173 171 "size", "Maximum size of checksummed/compressed extents")\ 174 172 x(metadata_checksum, u8, \ 175 173 OPT_FS|OPT_FORMAT|OPT_MOUNT|OPT_RUNTIME, \ 176 - OPT_STR(bch2_csum_opts), \ 174 + OPT_STR(__bch2_csum_opts), \ 177 175 BCH_SB_META_CSUM_TYPE, BCH_CSUM_OPT_crc32c, \ 178 176 NULL, NULL) \ 179 177 x(data_checksum, u8, \ 180 178 OPT_FS|OPT_INODE|OPT_FORMAT|OPT_MOUNT|OPT_RUNTIME, \ 181 - OPT_STR(bch2_csum_opts), \ 179 + OPT_STR(__bch2_csum_opts), \ 182 180 BCH_SB_DATA_CSUM_TYPE, BCH_CSUM_OPT_crc32c, \ 183 181 NULL, NULL) \ 184 182 x(compression, u8, \ ··· 222 220 BCH_SB_ERASURE_CODE, false, \ 223 221 NULL, "Enable erasure coding (DO NOT USE YET)") \ 224 222 x(inodes_32bit, u8, \ 225 - OPT_FS|OPT_FORMAT|OPT_MOUNT|OPT_RUNTIME, \ 223 + OPT_FS|OPT_INODE|OPT_FORMAT|OPT_MOUNT|OPT_RUNTIME, \ 226 224 OPT_BOOL(), \ 227 225 BCH_SB_INODE_32BIT, true, \ 228 226 NULL, "Constrain inode numbers to 32 bits") \ 229 - x(shard_inode_numbers, u8, \ 230 - OPT_FS|OPT_FORMAT|OPT_MOUNT|OPT_RUNTIME, \ 231 - OPT_BOOL(), \ 232 - BCH_SB_SHARD_INUMS, true, \ 227 + x(shard_inode_numbers_bits, u8, \ 228 + OPT_FS|OPT_FORMAT, \ 229 + OPT_UINT(0, 8), \ 230 + BCH_SB_SHARD_INUMS_NBITS, 0, \ 233 231 NULL, "Shard new inode numbers by CPU id") \ 234 232 x(inodes_use_key_cache, u8, \ 235 233 OPT_FS|OPT_FORMAT|OPT_MOUNT, \ ··· 475 473 BCH2_NO_SB_OPT, true, \ 476 474 NULL, "Enable nocow mode: enables runtime locking in\n"\ 477 475 "data move path needed if nocow will ever be in use\n")\ 476 + x(copygc_enabled, u8, \ 477 + OPT_FS|OPT_MOUNT, \ 478 + OPT_BOOL(), \ 479 + BCH2_NO_SB_OPT, true, \ 480 + NULL, "Enable copygc: disable for debugging, or to\n"\ 481 + "quiet the system when doing performance testing\n")\ 482 + x(rebalance_enabled, u8, \ 483 + OPT_FS|OPT_MOUNT, \ 484 + OPT_BOOL(), \ 485 + BCH2_NO_SB_OPT, true, \ 486 + NULL, "Enable rebalance: disable for debugging, or to\n"\ 487 + "quiet the system when doing performance testing\n")\ 478 488 x(no_data_io, u8, \ 479 489 OPT_MOUNT, \ 480 490 OPT_BOOL(), \ ··· 502 488 OPT_DEVICE, \ 503 489 OPT_UINT(0, S64_MAX), \ 504 490 BCH2_NO_SB_OPT, 0, \ 505 - "size", "Size of filesystem on device") \ 491 + "size", "Specifies the bucket size; must be greater than the btree node size")\ 506 492 x(durability, u8, \ 507 493 OPT_DEVICE|OPT_SB_FIELD_ONE_BIAS, \ 508 494 OPT_UINT(0, BCH_REPLICAS_MAX), \ ··· 638 624 #define x(_name, _bits) u##_bits _name; 639 625 BCH_INODE_OPTS() 640 626 #undef x 627 + #define x(_name, _bits) u64 _name##_from_inode:1; 628 + BCH_INODE_OPTS() 629 + #undef x 641 630 }; 642 631 643 - static inline unsigned background_compression(struct bch_io_opts opts) 632 + static inline void bch2_io_opts_fixups(struct bch_io_opts *opts) 644 633 { 645 - return opts.background_compression ?: opts.compression; 634 + if (!opts->background_target) 635 + opts->background_target = opts->foreground_target; 636 + if (!opts->background_compression) 637 + opts->background_compression = opts->compression; 638 + if (opts->nocow) { 639 + opts->compression = opts->background_compression = 0; 640 + opts->data_checksum = 0; 641 + opts->erasure_code = 0; 642 + } 646 643 } 647 644 648 645 struct bch_io_opts bch2_opts_to_inode_opts(struct bch_opts); 649 646 bool bch2_opt_is_inode_opt(enum bch_opt_id); 647 + 648 + /* rebalance opts: */ 649 + 650 + static inline struct bch_extent_rebalance io_opts_to_rebalance_opts(struct bch_io_opts *opts) 651 + { 652 + return (struct bch_extent_rebalance) { 653 + .type = BIT(BCH_EXTENT_ENTRY_rebalance), 654 + #define x(_name) \ 655 + ._name = opts->_name, \ 656 + ._name##_from_inode = opts->_name##_from_inode, 657 + BCH_REBALANCE_OPTS() 658 + #undef x 659 + }; 660 + }; 650 661 651 662 #endif /* _BCACHEFS_OPTS_H */
+11 -4
fs/bcachefs/printbuf.h
··· 251 251 printbuf_nul_terminate_reserved(out); 252 252 } 253 253 254 + static inline void printbuf_reset_keep_tabstops(struct printbuf *buf) 255 + { 256 + buf->pos = 0; 257 + buf->allocation_failure = 0; 258 + buf->last_newline = 0; 259 + buf->last_field = 0; 260 + buf->indent = 0; 261 + buf->cur_tabstop = 0; 262 + } 263 + 254 264 /** 255 265 * printbuf_reset - re-use a printbuf without freeing and re-initializing it: 256 266 */ 257 267 static inline void printbuf_reset(struct printbuf *buf) 258 268 { 259 - buf->pos = 0; 260 - buf->allocation_failure = 0; 261 - buf->indent = 0; 269 + printbuf_reset_keep_tabstops(buf); 262 270 buf->nr_tabstops = 0; 263 - buf->cur_tabstop = 0; 264 271 } 265 272 266 273 /**
+1 -1
fs/bcachefs/quota.c
··· 60 60 }; 61 61 62 62 int bch2_quota_validate(struct bch_fs *c, struct bkey_s_c k, 63 - enum bch_validate_flags flags) 63 + struct bkey_validate_context from) 64 64 { 65 65 int ret = 0; 66 66
+2 -2
fs/bcachefs/quota.h
··· 5 5 #include "inode.h" 6 6 #include "quota_types.h" 7 7 8 - enum bch_validate_flags; 9 8 extern const struct bch_sb_field_ops bch_sb_field_ops_quota; 10 9 11 - int bch2_quota_validate(struct bch_fs *, struct bkey_s_c, enum bch_validate_flags); 10 + int bch2_quota_validate(struct bch_fs *, struct bkey_s_c, 11 + struct bkey_validate_context); 12 12 void bch2_quota_to_text(struct printbuf *, struct bch_fs *, struct bkey_s_c); 13 13 14 14 #define bch2_bkey_ops_quota ((struct bkey_ops) { \
+27 -11
fs/bcachefs/rcu_pending.c
··· 25 25 #define RCU_PENDING_KVFREE_FN ((rcu_pending_process_fn) (ulong) RCU_PENDING_KVFREE) 26 26 #define RCU_PENDING_CALL_RCU_FN ((rcu_pending_process_fn) (ulong) RCU_PENDING_CALL_RCU) 27 27 28 - static inline unsigned long __get_state_synchronize_rcu(struct srcu_struct *ssp) 28 + #ifdef __KERNEL__ 29 + typedef unsigned long rcu_gp_poll_state_t; 30 + 31 + static inline bool rcu_gp_poll_cookie_eq(rcu_gp_poll_state_t l, rcu_gp_poll_state_t r) 32 + { 33 + return l == r; 34 + } 35 + #else 36 + typedef struct urcu_gp_poll_state rcu_gp_poll_state_t; 37 + 38 + static inline bool rcu_gp_poll_cookie_eq(rcu_gp_poll_state_t l, rcu_gp_poll_state_t r) 39 + { 40 + return l.grace_period_id == r.grace_period_id; 41 + } 42 + #endif 43 + 44 + static inline rcu_gp_poll_state_t __get_state_synchronize_rcu(struct srcu_struct *ssp) 29 45 { 30 46 return ssp 31 47 ? get_state_synchronize_srcu(ssp) 32 48 : get_state_synchronize_rcu(); 33 49 } 34 50 35 - static inline unsigned long __start_poll_synchronize_rcu(struct srcu_struct *ssp) 51 + static inline rcu_gp_poll_state_t __start_poll_synchronize_rcu(struct srcu_struct *ssp) 36 52 { 37 53 return ssp 38 54 ? start_poll_synchronize_srcu(ssp) 39 55 : start_poll_synchronize_rcu(); 40 56 } 41 57 42 - static inline bool __poll_state_synchronize_rcu(struct srcu_struct *ssp, unsigned long cookie) 58 + static inline bool __poll_state_synchronize_rcu(struct srcu_struct *ssp, rcu_gp_poll_state_t cookie) 43 59 { 44 60 return ssp 45 61 ? poll_state_synchronize_srcu(ssp, cookie) ··· 87 71 GENRADIX(struct rcu_head *) objs; 88 72 size_t nr; 89 73 struct rcu_head **cursor; 90 - unsigned long seq; 74 + rcu_gp_poll_state_t seq; 91 75 }; 92 76 93 77 struct rcu_pending_list { 94 78 struct rcu_head *head; 95 79 struct rcu_head *tail; 96 - unsigned long seq; 80 + rcu_gp_poll_state_t seq; 97 81 }; 98 82 99 83 struct rcu_pending_pcpu { ··· 332 316 } 333 317 334 318 static __always_inline struct rcu_pending_seq * 335 - get_object_radix(struct rcu_pending_pcpu *p, unsigned long seq) 319 + get_object_radix(struct rcu_pending_pcpu *p, rcu_gp_poll_state_t seq) 336 320 { 337 321 darray_for_each_reverse(p->objs, objs) 338 - if (objs->seq == seq) 322 + if (rcu_gp_poll_cookie_eq(objs->seq, seq)) 339 323 return objs; 340 324 341 325 if (darray_push_gfp(&p->objs, ((struct rcu_pending_seq) { .seq = seq }), GFP_ATOMIC)) ··· 345 329 } 346 330 347 331 static noinline bool 348 - rcu_pending_enqueue_list(struct rcu_pending_pcpu *p, unsigned long seq, 332 + rcu_pending_enqueue_list(struct rcu_pending_pcpu *p, rcu_gp_poll_state_t seq, 349 333 struct rcu_head *head, void *ptr, 350 334 unsigned long *flags) 351 335 { ··· 380 364 again: 381 365 for (struct rcu_pending_list *i = p->lists; 382 366 i < p->lists + NUM_ACTIVE_RCU_POLL_OLDSTATE; i++) { 383 - if (i->seq == seq) { 367 + if (rcu_gp_poll_cookie_eq(i->seq, seq)) { 384 368 rcu_pending_list_add(i, head); 385 369 return false; 386 370 } ··· 424 408 struct rcu_pending_pcpu *p; 425 409 struct rcu_pending_seq *objs; 426 410 struct genradix_node *new_node = NULL; 427 - unsigned long seq, flags; 411 + unsigned long flags; 428 412 bool start_gp = false; 429 413 430 414 BUG_ON((ptr != NULL) != (pending->process == RCU_PENDING_KVFREE_FN)); ··· 432 416 local_irq_save(flags); 433 417 p = this_cpu_ptr(pending->p); 434 418 spin_lock(&p->lock); 435 - seq = __get_state_synchronize_rcu(pending->srcu); 419 + rcu_gp_poll_state_t seq = __get_state_synchronize_rcu(pending->srcu); 436 420 restart: 437 421 if (may_sleep && 438 422 unlikely(process_finished_items(pending, p, flags)))
+225 -41
fs/bcachefs/rebalance.c
··· 24 24 #include <linux/kthread.h> 25 25 #include <linux/sched/cputime.h> 26 26 27 + /* bch_extent_rebalance: */ 28 + 29 + static const struct bch_extent_rebalance *bch2_bkey_rebalance_opts(struct bkey_s_c k) 30 + { 31 + struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(k); 32 + const union bch_extent_entry *entry; 33 + 34 + bkey_extent_entry_for_each(ptrs, entry) 35 + if (__extent_entry_type(entry) == BCH_EXTENT_ENTRY_rebalance) 36 + return &entry->rebalance; 37 + 38 + return NULL; 39 + } 40 + 41 + static inline unsigned bch2_bkey_ptrs_need_compress(struct bch_fs *c, 42 + struct bch_io_opts *opts, 43 + struct bkey_s_c k, 44 + struct bkey_ptrs_c ptrs) 45 + { 46 + if (!opts->background_compression) 47 + return 0; 48 + 49 + unsigned compression_type = bch2_compression_opt_to_type(opts->background_compression); 50 + const union bch_extent_entry *entry; 51 + struct extent_ptr_decoded p; 52 + unsigned ptr_bit = 1; 53 + unsigned rewrite_ptrs = 0; 54 + 55 + bkey_for_each_ptr_decode(k.k, ptrs, p, entry) { 56 + if (p.crc.compression_type == BCH_COMPRESSION_TYPE_incompressible || 57 + p.ptr.unwritten) 58 + return 0; 59 + 60 + if (!p.ptr.cached && p.crc.compression_type != compression_type) 61 + rewrite_ptrs |= ptr_bit; 62 + ptr_bit <<= 1; 63 + } 64 + 65 + return rewrite_ptrs; 66 + } 67 + 68 + static inline unsigned bch2_bkey_ptrs_need_move(struct bch_fs *c, 69 + struct bch_io_opts *opts, 70 + struct bkey_ptrs_c ptrs) 71 + { 72 + if (!opts->background_target || 73 + !bch2_target_accepts_data(c, BCH_DATA_user, opts->background_target)) 74 + return 0; 75 + 76 + unsigned ptr_bit = 1; 77 + unsigned rewrite_ptrs = 0; 78 + 79 + bkey_for_each_ptr(ptrs, ptr) { 80 + if (!ptr->cached && !bch2_dev_in_target(c, ptr->dev, opts->background_target)) 81 + rewrite_ptrs |= ptr_bit; 82 + ptr_bit <<= 1; 83 + } 84 + 85 + return rewrite_ptrs; 86 + } 87 + 88 + static unsigned bch2_bkey_ptrs_need_rebalance(struct bch_fs *c, 89 + struct bch_io_opts *opts, 90 + struct bkey_s_c k) 91 + { 92 + struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(k); 93 + 94 + return bch2_bkey_ptrs_need_compress(c, opts, k, ptrs) | 95 + bch2_bkey_ptrs_need_move(c, opts, ptrs); 96 + } 97 + 98 + u64 bch2_bkey_sectors_need_rebalance(struct bch_fs *c, struct bkey_s_c k) 99 + { 100 + const struct bch_extent_rebalance *opts = bch2_bkey_rebalance_opts(k); 101 + if (!opts) 102 + return 0; 103 + 104 + struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(k); 105 + const union bch_extent_entry *entry; 106 + struct extent_ptr_decoded p; 107 + u64 sectors = 0; 108 + 109 + if (opts->background_compression) { 110 + unsigned compression_type = bch2_compression_opt_to_type(opts->background_compression); 111 + 112 + bkey_for_each_ptr_decode(k.k, ptrs, p, entry) { 113 + if (p.crc.compression_type == BCH_COMPRESSION_TYPE_incompressible || 114 + p.ptr.unwritten) { 115 + sectors = 0; 116 + goto incompressible; 117 + } 118 + 119 + if (!p.ptr.cached && p.crc.compression_type != compression_type) 120 + sectors += p.crc.compressed_size; 121 + } 122 + } 123 + incompressible: 124 + if (opts->background_target && 125 + bch2_target_accepts_data(c, BCH_DATA_user, opts->background_target)) { 126 + bkey_for_each_ptr_decode(k.k, ptrs, p, entry) 127 + if (!p.ptr.cached && !bch2_dev_in_target(c, p.ptr.dev, opts->background_target)) 128 + sectors += p.crc.compressed_size; 129 + } 130 + 131 + return sectors; 132 + } 133 + 134 + static bool bch2_bkey_rebalance_needs_update(struct bch_fs *c, struct bch_io_opts *opts, 135 + struct bkey_s_c k) 136 + { 137 + if (!bkey_extent_is_direct_data(k.k)) 138 + return 0; 139 + 140 + const struct bch_extent_rebalance *old = bch2_bkey_rebalance_opts(k); 141 + 142 + if (k.k->type == KEY_TYPE_reflink_v || bch2_bkey_ptrs_need_rebalance(c, opts, k)) { 143 + struct bch_extent_rebalance new = io_opts_to_rebalance_opts(opts); 144 + return old == NULL || memcmp(old, &new, sizeof(new)); 145 + } else { 146 + return old != NULL; 147 + } 148 + } 149 + 150 + int bch2_bkey_set_needs_rebalance(struct bch_fs *c, struct bch_io_opts *opts, 151 + struct bkey_i *_k) 152 + { 153 + if (!bkey_extent_is_direct_data(&_k->k)) 154 + return 0; 155 + 156 + struct bkey_s k = bkey_i_to_s(_k); 157 + struct bch_extent_rebalance *old = 158 + (struct bch_extent_rebalance *) bch2_bkey_rebalance_opts(k.s_c); 159 + 160 + if (k.k->type == KEY_TYPE_reflink_v || bch2_bkey_ptrs_need_rebalance(c, opts, k.s_c)) { 161 + if (!old) { 162 + old = bkey_val_end(k); 163 + k.k->u64s += sizeof(*old) / sizeof(u64); 164 + } 165 + 166 + *old = io_opts_to_rebalance_opts(opts); 167 + } else { 168 + if (old) 169 + extent_entry_drop(k, (union bch_extent_entry *) old); 170 + } 171 + 172 + return 0; 173 + } 174 + 175 + int bch2_get_update_rebalance_opts(struct btree_trans *trans, 176 + struct bch_io_opts *io_opts, 177 + struct btree_iter *iter, 178 + struct bkey_s_c k) 179 + { 180 + BUG_ON(iter->flags & BTREE_ITER_is_extents); 181 + BUG_ON(iter->flags & BTREE_ITER_filter_snapshots); 182 + 183 + const struct bch_extent_rebalance *r = k.k->type == KEY_TYPE_reflink_v 184 + ? bch2_bkey_rebalance_opts(k) : NULL; 185 + if (r) { 186 + #define x(_name) \ 187 + if (r->_name##_from_inode) { \ 188 + io_opts->_name = r->_name; \ 189 + io_opts->_name##_from_inode = true; \ 190 + } 191 + BCH_REBALANCE_OPTS() 192 + #undef x 193 + } 194 + 195 + if (!bch2_bkey_rebalance_needs_update(trans->c, io_opts, k)) 196 + return 0; 197 + 198 + struct bkey_i *n = bch2_trans_kmalloc(trans, bkey_bytes(k.k) + 8); 199 + int ret = PTR_ERR_OR_ZERO(n); 200 + if (ret) 201 + return ret; 202 + 203 + bkey_reassemble(n, k); 204 + 205 + /* On successfull transaction commit, @k was invalidated: */ 206 + 207 + return bch2_bkey_set_needs_rebalance(trans->c, io_opts, n) ?: 208 + bch2_trans_update(trans, iter, n, BTREE_UPDATE_internal_snapshot_node) ?: 209 + bch2_trans_commit(trans, NULL, NULL, 0) ?: 210 + -BCH_ERR_transaction_restart_nested; 211 + } 212 + 27 213 #define REBALANCE_WORK_SCAN_OFFSET (U64_MAX - 1) 28 214 29 215 static const char * const bch2_rebalance_state_strs[] = { ··· 219 33 #undef x 220 34 }; 221 35 222 - static int __bch2_set_rebalance_needs_scan(struct btree_trans *trans, u64 inum) 36 + int bch2_set_rebalance_needs_scan_trans(struct btree_trans *trans, u64 inum) 223 37 { 224 38 struct btree_iter iter; 225 39 struct bkey_s_c k; ··· 257 71 int bch2_set_rebalance_needs_scan(struct bch_fs *c, u64 inum) 258 72 { 259 73 int ret = bch2_trans_commit_do(c, NULL, NULL, 260 - BCH_TRANS_COMMIT_no_enospc| 261 - BCH_TRANS_COMMIT_lazy_rw, 262 - __bch2_set_rebalance_needs_scan(trans, inum)); 74 + BCH_TRANS_COMMIT_no_enospc, 75 + bch2_set_rebalance_needs_scan_trans(trans, inum)); 263 76 rebalance_wakeup(c); 264 77 return ret; 265 78 } ··· 306 121 struct btree_iter *iter, 307 122 struct bkey_s_c k) 308 123 { 124 + if (!bch2_bkey_rebalance_opts(k)) 125 + return 0; 126 + 309 127 struct bkey_i *n = bch2_bkey_make_mut(trans, iter, &k, 0); 310 128 int ret = PTR_ERR_OR_ZERO(n); 311 129 if (ret) ··· 322 134 static struct bkey_s_c next_rebalance_extent(struct btree_trans *trans, 323 135 struct bpos work_pos, 324 136 struct btree_iter *extent_iter, 137 + struct bch_io_opts *io_opts, 325 138 struct data_update_opts *data_opts) 326 139 { 327 140 struct bch_fs *c = trans->c; 328 - struct bkey_s_c k; 329 141 330 142 bch2_trans_iter_exit(trans, extent_iter); 331 143 bch2_trans_iter_init(trans, extent_iter, 332 144 work_pos.inode ? BTREE_ID_extents : BTREE_ID_reflink, 333 145 work_pos, 334 146 BTREE_ITER_all_snapshots); 335 - k = bch2_btree_iter_peek_slot(extent_iter); 147 + struct bkey_s_c k = bch2_btree_iter_peek_slot(extent_iter); 336 148 if (bkey_err(k)) 337 149 return k; 338 150 339 - const struct bch_extent_rebalance *r = k.k ? bch2_bkey_rebalance_opts(k) : NULL; 340 - if (!r) { 341 - /* raced due to btree write buffer, nothing to do */ 342 - return bkey_s_c_null; 343 - } 151 + int ret = bch2_move_get_io_opts_one(trans, io_opts, extent_iter, k); 152 + if (ret) 153 + return bkey_s_c_err(ret); 344 154 345 155 memset(data_opts, 0, sizeof(*data_opts)); 346 - 347 - data_opts->rewrite_ptrs = 348 - bch2_bkey_ptrs_need_rebalance(c, k, r->target, r->compression); 349 - data_opts->target = r->target; 156 + data_opts->rewrite_ptrs = bch2_bkey_ptrs_need_rebalance(c, io_opts, k); 157 + data_opts->target = io_opts->background_target; 350 158 data_opts->write_flags |= BCH_WRITE_ONLY_SPECIFIED_DEVS; 351 159 352 160 if (!data_opts->rewrite_ptrs) { ··· 362 178 if (trace_rebalance_extent_enabled()) { 363 179 struct printbuf buf = PRINTBUF; 364 180 365 - prt_str(&buf, "target="); 366 - bch2_target_to_text(&buf, c, r->target); 367 - prt_str(&buf, " compression="); 368 - bch2_compression_opt_to_text(&buf, r->compression); 369 - prt_str(&buf, " "); 370 181 bch2_bkey_val_to_text(&buf, c, k); 182 + prt_newline(&buf); 183 + 184 + struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(k); 185 + 186 + unsigned p = bch2_bkey_ptrs_need_compress(c, io_opts, k, ptrs); 187 + if (p) { 188 + prt_str(&buf, "compression="); 189 + bch2_compression_opt_to_text(&buf, io_opts->background_compression); 190 + prt_str(&buf, " "); 191 + bch2_prt_u64_base2(&buf, p); 192 + prt_newline(&buf); 193 + } 194 + 195 + p = bch2_bkey_ptrs_need_move(c, io_opts, ptrs); 196 + if (p) { 197 + prt_str(&buf, "move="); 198 + bch2_target_to_text(&buf, c, io_opts->background_target); 199 + prt_str(&buf, " "); 200 + bch2_prt_u64_base2(&buf, p); 201 + prt_newline(&buf); 202 + } 371 203 372 204 trace_rebalance_extent(c, buf.buf); 373 205 printbuf_exit(&buf); ··· 412 212 bch2_bkey_buf_init(&sk); 413 213 414 214 ret = bkey_err(k = next_rebalance_extent(trans, work_pos, 415 - extent_iter, &data_opts)); 215 + extent_iter, &io_opts, &data_opts)); 416 216 if (ret || !k.k) 417 - goto out; 418 - 419 - ret = bch2_move_get_io_opts_one(trans, &io_opts, k); 420 - if (ret) 421 217 goto out; 422 218 423 219 atomic64_add(k.k->size, &ctxt->stats->sectors_seen); ··· 449 253 struct bch_io_opts *io_opts, 450 254 struct data_update_opts *data_opts) 451 255 { 452 - unsigned target, compression; 453 - 454 - if (k.k->p.inode) { 455 - target = io_opts->background_target; 456 - compression = background_compression(*io_opts); 457 - } else { 458 - const struct bch_extent_rebalance *r = bch2_bkey_rebalance_opts(k); 459 - 460 - target = r ? r->target : io_opts->background_target; 461 - compression = r ? r->compression : background_compression(*io_opts); 462 - } 463 - 464 - data_opts->rewrite_ptrs = bch2_bkey_ptrs_need_rebalance(c, k, target, compression); 465 - data_opts->target = target; 256 + data_opts->rewrite_ptrs = bch2_bkey_ptrs_need_rebalance(c, io_opts, k); 257 + data_opts->target = io_opts->background_target; 466 258 data_opts->write_flags |= BCH_WRITE_ONLY_SPECIFIED_DEVS; 467 259 return data_opts->rewrite_ptrs != 0; 468 260 } ··· 522 338 BTREE_ITER_all_snapshots); 523 339 524 340 while (!bch2_move_ratelimit(ctxt)) { 525 - if (!r->enabled) { 341 + if (!c->opts.rebalance_enabled) { 526 342 bch2_moving_ctxt_flush_all(ctxt); 527 - kthread_wait_freezable(r->enabled || 343 + kthread_wait_freezable(c->opts.rebalance_enabled || 528 344 kthread_should_stop()); 529 345 } 530 346
+10
fs/bcachefs/rebalance.h
··· 2 2 #ifndef _BCACHEFS_REBALANCE_H 3 3 #define _BCACHEFS_REBALANCE_H 4 4 5 + #include "compress.h" 6 + #include "disk_groups.h" 5 7 #include "rebalance_types.h" 6 8 9 + u64 bch2_bkey_sectors_need_rebalance(struct bch_fs *, struct bkey_s_c); 10 + int bch2_bkey_set_needs_rebalance(struct bch_fs *, struct bch_io_opts *, struct bkey_i *); 11 + int bch2_get_update_rebalance_opts(struct btree_trans *, 12 + struct bch_io_opts *, 13 + struct btree_iter *, 14 + struct bkey_s_c); 15 + 16 + int bch2_set_rebalance_needs_scan_trans(struct btree_trans *, u64); 7 17 int bch2_set_rebalance_needs_scan(struct bch_fs *, u64 inum); 8 18 int bch2_set_fs_needs_rebalance(struct bch_fs *); 9 19
+53
fs/bcachefs/rebalance_format.h
··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + #ifndef _BCACHEFS_REBALANCE_FORMAT_H 3 + #define _BCACHEFS_REBALANCE_FORMAT_H 4 + 5 + struct bch_extent_rebalance { 6 + #if defined(__LITTLE_ENDIAN_BITFIELD) 7 + __u64 type:6, 8 + unused:3, 9 + 10 + promote_target_from_inode:1, 11 + erasure_code_from_inode:1, 12 + data_checksum_from_inode:1, 13 + background_compression_from_inode:1, 14 + data_replicas_from_inode:1, 15 + background_target_from_inode:1, 16 + 17 + promote_target:16, 18 + erasure_code:1, 19 + data_checksum:4, 20 + data_replicas:4, 21 + background_compression:8, /* enum bch_compression_opt */ 22 + background_target:16; 23 + #elif defined (__BIG_ENDIAN_BITFIELD) 24 + __u64 background_target:16, 25 + background_compression:8, 26 + data_replicas:4, 27 + data_checksum:4, 28 + erasure_code:1, 29 + promote_target:16, 30 + 31 + background_target_from_inode:1, 32 + data_replicas_from_inode:1, 33 + background_compression_from_inode:1, 34 + data_checksum_from_inode:1, 35 + erasure_code_from_inode:1, 36 + promote_target_from_inode:1, 37 + 38 + unused:3, 39 + type:6; 40 + #endif 41 + }; 42 + 43 + /* subset of BCH_INODE_OPTS */ 44 + #define BCH_REBALANCE_OPTS() \ 45 + x(data_checksum) \ 46 + x(background_compression) \ 47 + x(data_replicas) \ 48 + x(promote_target) \ 49 + x(background_target) \ 50 + x(erasure_code) 51 + 52 + #endif /* _BCACHEFS_REBALANCE_FORMAT_H */ 53 +
-2
fs/bcachefs/rebalance_types.h
··· 30 30 struct bbpos scan_start; 31 31 struct bbpos scan_end; 32 32 struct bch_move_stats scan_stats; 33 - 34 - unsigned enabled:1; 35 33 }; 36 34 37 35 #endif /* _BCACHEFS_REBALANCE_TYPES_H */
+146 -72
fs/bcachefs/recovery.c
··· 34 34 35 35 #define QSTR(n) { { { .len = strlen(n) } }, .name = n } 36 36 37 - void bch2_btree_lost_data(struct bch_fs *c, enum btree_id btree) 37 + int bch2_btree_lost_data(struct bch_fs *c, enum btree_id btree) 38 38 { 39 - if (btree >= BTREE_ID_NR_MAX) 40 - return; 41 - 42 39 u64 b = BIT_ULL(btree); 40 + int ret = 0; 41 + 42 + mutex_lock(&c->sb_lock); 43 + struct bch_sb_field_ext *ext = bch2_sb_field_get(c->disk_sb.sb, ext); 43 44 44 45 if (!(c->sb.btrees_lost_data & b)) { 45 - bch_err(c, "flagging btree %s lost data", bch2_btree_id_str(btree)); 46 - 47 - mutex_lock(&c->sb_lock); 48 - bch2_sb_field_get(c->disk_sb.sb, ext)->btrees_lost_data |= cpu_to_le64(b); 49 - bch2_write_super(c); 50 - mutex_unlock(&c->sb_lock); 46 + struct printbuf buf = PRINTBUF; 47 + bch2_btree_id_to_text(&buf, btree); 48 + bch_err(c, "flagging btree %s lost data", buf.buf); 49 + printbuf_exit(&buf); 50 + ext->btrees_lost_data |= cpu_to_le64(b); 51 51 } 52 + 53 + /* Once we have runtime self healing for topology errors we won't need this: */ 54 + ret = bch2_run_explicit_recovery_pass_persistent_locked(c, BCH_RECOVERY_PASS_check_topology) ?: ret; 55 + 56 + /* Btree node accounting will be off: */ 57 + __set_bit_le64(BCH_FSCK_ERR_accounting_mismatch, ext->errors_silent); 58 + ret = bch2_run_explicit_recovery_pass_persistent_locked(c, BCH_RECOVERY_PASS_check_allocations) ?: ret; 59 + 60 + #ifdef CONFIG_BCACHEFS_DEBUG 61 + /* 62 + * These are much more minor, and don't need to be corrected right away, 63 + * but in debug mode we want the next fsck run to be clean: 64 + */ 65 + ret = bch2_run_explicit_recovery_pass_persistent_locked(c, BCH_RECOVERY_PASS_check_lrus) ?: ret; 66 + ret = bch2_run_explicit_recovery_pass_persistent_locked(c, BCH_RECOVERY_PASS_check_backpointers_to_extents) ?: ret; 67 + #endif 68 + 69 + switch (btree) { 70 + case BTREE_ID_alloc: 71 + ret = bch2_run_explicit_recovery_pass_persistent_locked(c, BCH_RECOVERY_PASS_check_alloc_info) ?: ret; 72 + 73 + __set_bit_le64(BCH_FSCK_ERR_alloc_key_data_type_wrong, ext->errors_silent); 74 + __set_bit_le64(BCH_FSCK_ERR_alloc_key_gen_wrong, ext->errors_silent); 75 + __set_bit_le64(BCH_FSCK_ERR_alloc_key_dirty_sectors_wrong, ext->errors_silent); 76 + __set_bit_le64(BCH_FSCK_ERR_alloc_key_cached_sectors_wrong, ext->errors_silent); 77 + __set_bit_le64(BCH_FSCK_ERR_alloc_key_stripe_wrong, ext->errors_silent); 78 + __set_bit_le64(BCH_FSCK_ERR_alloc_key_stripe_redundancy_wrong, ext->errors_silent); 79 + goto out; 80 + case BTREE_ID_backpointers: 81 + ret = bch2_run_explicit_recovery_pass_persistent_locked(c, BCH_RECOVERY_PASS_check_btree_backpointers) ?: ret; 82 + ret = bch2_run_explicit_recovery_pass_persistent_locked(c, BCH_RECOVERY_PASS_check_extents_to_backpointers) ?: ret; 83 + goto out; 84 + case BTREE_ID_need_discard: 85 + ret = bch2_run_explicit_recovery_pass_persistent_locked(c, BCH_RECOVERY_PASS_check_alloc_info) ?: ret; 86 + goto out; 87 + case BTREE_ID_freespace: 88 + ret = bch2_run_explicit_recovery_pass_persistent_locked(c, BCH_RECOVERY_PASS_check_alloc_info) ?: ret; 89 + goto out; 90 + case BTREE_ID_bucket_gens: 91 + ret = bch2_run_explicit_recovery_pass_persistent_locked(c, BCH_RECOVERY_PASS_check_alloc_info) ?: ret; 92 + goto out; 93 + case BTREE_ID_lru: 94 + ret = bch2_run_explicit_recovery_pass_persistent_locked(c, BCH_RECOVERY_PASS_check_alloc_info) ?: ret; 95 + goto out; 96 + case BTREE_ID_accounting: 97 + ret = bch2_run_explicit_recovery_pass_persistent_locked(c, BCH_RECOVERY_PASS_check_allocations) ?: ret; 98 + goto out; 99 + default: 100 + ret = bch2_run_explicit_recovery_pass_persistent_locked(c, BCH_RECOVERY_PASS_scan_for_btree_nodes) ?: ret; 101 + goto out; 102 + } 103 + out: 104 + bch2_write_super(c); 105 + mutex_unlock(&c->sb_lock); 106 + 107 + return ret; 108 + } 109 + 110 + static void kill_btree(struct bch_fs *c, enum btree_id btree) 111 + { 112 + bch2_btree_id_root(c, btree)->alive = false; 113 + bch2_shoot_down_journal_keys(c, btree, 0, BTREE_MAX_DEPTH, POS_MIN, SPOS_MAX); 52 114 } 53 115 54 116 /* for -o reconstruct_alloc: */ ··· 141 79 __set_bit_le64(BCH_FSCK_ERR_fs_usage_persistent_reserved_wrong, ext->errors_silent); 142 80 __set_bit_le64(BCH_FSCK_ERR_fs_usage_replicas_wrong, ext->errors_silent); 143 81 82 + __set_bit_le64(BCH_FSCK_ERR_alloc_key_to_missing_lru_entry, ext->errors_silent); 83 + 144 84 __set_bit_le64(BCH_FSCK_ERR_alloc_key_data_type_wrong, ext->errors_silent); 145 85 __set_bit_le64(BCH_FSCK_ERR_alloc_key_gen_wrong, ext->errors_silent); 146 86 __set_bit_le64(BCH_FSCK_ERR_alloc_key_dirty_sectors_wrong, ext->errors_silent); ··· 163 99 bch2_write_super(c); 164 100 mutex_unlock(&c->sb_lock); 165 101 166 - bch2_shoot_down_journal_keys(c, BTREE_ID_alloc, 167 - 0, BTREE_MAX_DEPTH, POS_MIN, SPOS_MAX); 168 - bch2_shoot_down_journal_keys(c, BTREE_ID_backpointers, 169 - 0, BTREE_MAX_DEPTH, POS_MIN, SPOS_MAX); 170 - bch2_shoot_down_journal_keys(c, BTREE_ID_need_discard, 171 - 0, BTREE_MAX_DEPTH, POS_MIN, SPOS_MAX); 172 - bch2_shoot_down_journal_keys(c, BTREE_ID_freespace, 173 - 0, BTREE_MAX_DEPTH, POS_MIN, SPOS_MAX); 174 - bch2_shoot_down_journal_keys(c, BTREE_ID_bucket_gens, 175 - 0, BTREE_MAX_DEPTH, POS_MIN, SPOS_MAX); 102 + for (unsigned i = 0; i < btree_id_nr_alive(c); i++) 103 + if (btree_id_is_alloc(i)) 104 + kill_btree(c, i); 176 105 } 177 106 178 107 /* ··· 411 354 ? BCH_TRANS_COMMIT_no_journal_res|BCH_WATERMARK_reclaim 412 355 : 0), 413 356 bch2_journal_replay_key(trans, k)); 414 - bch_err_msg(c, ret, "while replaying key at btree %s level %u:", 415 - bch2_btree_id_str(k->btree_id), k->level); 416 - if (ret) 357 + if (ret) { 358 + struct printbuf buf = PRINTBUF; 359 + bch2_btree_id_level_to_text(&buf, k->btree_id, k->level); 360 + bch_err_msg(c, ret, "while replaying key at %s:", buf.buf); 361 + printbuf_exit(&buf); 417 362 goto err; 363 + } 418 364 419 365 BUG_ON(k->btree_id != BTREE_ID_accounting && !k->overwritten); 420 366 } ··· 463 403 464 404 switch (entry->type) { 465 405 case BCH_JSET_ENTRY_btree_root: { 466 - struct btree_root *r; 406 + 407 + if (unlikely(!entry->u64s)) 408 + return 0; 467 409 468 410 if (fsck_err_on(entry->btree_id >= BTREE_ID_NR_MAX, 469 411 c, invalid_btree_id, ··· 479 417 return ret; 480 418 } 481 419 482 - r = bch2_btree_id_root(c, entry->btree_id); 420 + struct btree_root *r = bch2_btree_id_root(c, entry->btree_id); 483 421 484 - if (entry->u64s) { 485 - r->level = entry->level; 486 - bkey_copy(&r->key, (struct bkey_i *) entry->start); 487 - r->error = 0; 488 - } else { 489 - r->error = -BCH_ERR_btree_node_read_error; 490 - } 422 + r->level = entry->level; 423 + bkey_copy(&r->key, (struct bkey_i *) entry->start); 424 + r->error = 0; 491 425 r->alive = true; 492 426 break; 493 427 } ··· 563 505 564 506 static int read_btree_roots(struct bch_fs *c) 565 507 { 508 + struct printbuf buf = PRINTBUF; 566 509 int ret = 0; 567 510 568 511 for (unsigned i = 0; i < btree_id_nr_alive(c); i++) { ··· 572 513 if (!r->alive) 573 514 continue; 574 515 575 - if (btree_id_is_alloc(i) && c->opts.reconstruct_alloc) 576 - continue; 516 + printbuf_reset(&buf); 517 + bch2_btree_id_level_to_text(&buf, i, r->level); 577 518 578 519 if (mustfix_fsck_err_on((ret = r->error), 579 520 c, btree_root_bkey_invalid, 580 521 "invalid btree root %s", 581 - bch2_btree_id_str(i)) || 522 + buf.buf) || 582 523 mustfix_fsck_err_on((ret = r->error = bch2_btree_root_read(c, i, &r->key, r->level)), 583 524 c, btree_root_read_error, 584 - "error reading btree root %s l=%u: %s", 585 - bch2_btree_id_str(i), r->level, bch2_err_str(ret))) { 586 - if (btree_id_is_alloc(i)) { 587 - c->opts.recovery_passes |= BIT_ULL(BCH_RECOVERY_PASS_check_allocations); 588 - c->opts.recovery_passes |= BIT_ULL(BCH_RECOVERY_PASS_check_alloc_info); 589 - c->opts.recovery_passes |= BIT_ULL(BCH_RECOVERY_PASS_check_lrus); 590 - c->opts.recovery_passes |= BIT_ULL(BCH_RECOVERY_PASS_check_extents_to_backpointers); 591 - c->opts.recovery_passes |= BIT_ULL(BCH_RECOVERY_PASS_check_alloc_to_lru_refs); 592 - c->sb.compat &= ~(1ULL << BCH_COMPAT_alloc_info); 525 + "error reading btree root %s: %s", 526 + buf.buf, bch2_err_str(ret))) { 527 + if (btree_id_is_alloc(i)) 593 528 r->error = 0; 594 - } else if (!(c->opts.recovery_passes & BIT_ULL(BCH_RECOVERY_PASS_scan_for_btree_nodes))) { 595 - bch_info(c, "will run btree node scan"); 596 - c->opts.recovery_passes |= BIT_ULL(BCH_RECOVERY_PASS_scan_for_btree_nodes); 597 - c->opts.recovery_passes |= BIT_ULL(BCH_RECOVERY_PASS_check_topology); 598 - } 599 529 600 - ret = 0; 601 - bch2_btree_lost_data(c, i); 530 + ret = bch2_btree_lost_data(c, i); 531 + BUG_ON(ret); 602 532 } 603 533 } 604 534 ··· 601 553 } 602 554 } 603 555 fsck_err: 556 + printbuf_exit(&buf); 604 557 return ret; 605 558 } 606 559 ··· 612 563 bch2_latest_compatible_version(c->sb.version)); 613 564 unsigned old_version = c->sb.version_upgrade_complete ?: c->sb.version; 614 565 unsigned new_version = 0; 566 + bool ret = false; 615 567 616 568 if (old_version < bcachefs_metadata_required_upgrade_below) { 617 569 if (c->opts.version_upgrade == BCH_VERSION_UPGRADE_incompatible || ··· 668 618 } 669 619 670 620 bch_info(c, "%s", buf.buf); 671 - 672 - bch2_sb_upgrade(c, new_version); 673 - 674 621 printbuf_exit(&buf); 675 - return true; 622 + 623 + ret = true; 676 624 } 677 625 678 - return false; 626 + if (new_version > c->sb.version_incompat && 627 + c->opts.version_upgrade == BCH_VERSION_UPGRADE_incompatible) { 628 + struct printbuf buf = PRINTBUF; 629 + 630 + prt_str(&buf, "Now allowing incompatible features up to "); 631 + bch2_version_to_text(&buf, new_version); 632 + prt_str(&buf, ", previously allowed up to "); 633 + bch2_version_to_text(&buf, c->sb.version_incompat_allowed); 634 + prt_newline(&buf); 635 + 636 + bch_info(c, "%s", buf.buf); 637 + printbuf_exit(&buf); 638 + 639 + ret = true; 640 + } 641 + 642 + if (ret) 643 + bch2_sb_upgrade(c, new_version, 644 + c->opts.version_upgrade == BCH_VERSION_UPGRADE_incompatible); 645 + 646 + return ret; 679 647 } 680 648 681 649 int bch2_fs_recovery(struct bch_fs *c) ··· 728 660 goto err; 729 661 } 730 662 731 - if (c->opts.norecovery) 732 - c->opts.recovery_pass_last = BCH_RECOVERY_PASS_journal_replay - 1; 663 + if (c->opts.norecovery) { 664 + c->opts.recovery_pass_last = c->opts.recovery_pass_last 665 + ? min(c->opts.recovery_pass_last, BCH_RECOVERY_PASS_snapshots_read) 666 + : BCH_RECOVERY_PASS_snapshots_read; 667 + c->opts.nochanges = true; 668 + c->opts.read_only = true; 669 + } 733 670 734 671 mutex_lock(&c->sb_lock); 735 672 struct bch_sb_field_ext *ext = bch2_sb_field_get(c->disk_sb.sb, ext); ··· 781 708 782 709 c->opts.recovery_passes |= bch2_recovery_passes_from_stable(le64_to_cpu(ext->recovery_passes_required[0])); 783 710 711 + if (c->sb.version_upgrade_complete < bcachefs_metadata_version_autofix_errors) { 712 + SET_BCH_SB_ERROR_ACTION(c->disk_sb.sb, BCH_ON_ERROR_fix_safe); 713 + write_sb = true; 714 + } 715 + 784 716 if (write_sb) 785 717 bch2_write_super(c); 786 718 mutex_unlock(&c->sb_lock); 787 - 788 - if (c->opts.fsck && IS_ENABLED(CONFIG_BCACHEFS_DEBUG)) 789 - c->opts.recovery_passes |= BIT_ULL(BCH_RECOVERY_PASS_check_topology); 790 719 791 720 if (c->opts.fsck) 792 721 set_bit(BCH_FS_fsck_running, &c->flags); 793 722 if (c->sb.clean) 794 723 set_bit(BCH_FS_clean_recovery, &c->flags); 724 + set_bit(BCH_FS_recovery_running, &c->flags); 795 725 796 726 ret = bch2_blacklist_table_initialize(c); 797 727 if (ret) { ··· 883 807 c->journal_replay_seq_start = last_seq; 884 808 c->journal_replay_seq_end = blacklist_seq - 1; 885 809 886 - if (c->opts.reconstruct_alloc) 887 - bch2_reconstruct_alloc(c); 888 - 889 810 zero_out_btree_mem_ptr(&c->journal_keys); 890 811 891 812 ret = journal_replay_early(c, clean); 892 813 if (ret) 893 814 goto err; 815 + 816 + if (c->opts.reconstruct_alloc) 817 + bch2_reconstruct_alloc(c); 894 818 895 819 /* 896 820 * After an unclean shutdown, skip then next few journal sequence ··· 946 870 */ 947 871 set_bit(BCH_FS_may_go_rw, &c->flags); 948 872 clear_bit(BCH_FS_fsck_running, &c->flags); 873 + clear_bit(BCH_FS_recovery_running, &c->flags); 949 874 950 875 /* in case we don't run journal replay, i.e. norecovery mode */ 951 876 set_bit(BCH_FS_accounting_replay_done, &c->flags); 952 877 878 + bch2_async_btree_node_rewrites_flush(c); 879 + 953 880 /* fsync if we fixed errors */ 954 - if (test_bit(BCH_FS_errors_fixed, &c->flags) && 955 - bch2_write_ref_tryget(c, BCH_WRITE_REF_fsync)) { 881 + if (test_bit(BCH_FS_errors_fixed, &c->flags)) { 956 882 bch2_journal_flush_all_pins(&c->journal); 957 883 bch2_journal_meta(&c->journal); 958 - bch2_write_ref_put(c, BCH_WRITE_REF_fsync); 959 884 } 960 885 961 886 /* If we fixed errors, verify that fs is actually clean now: */ ··· 1098 1021 bch2_check_version_downgrade(c); 1099 1022 1100 1023 if (c->opts.version_upgrade != BCH_VERSION_UPGRADE_none) { 1101 - bch2_sb_upgrade(c, bcachefs_metadata_version_current); 1024 + bch2_sb_upgrade(c, bcachefs_metadata_version_current, false); 1102 1025 SET_BCH_SB_VERSION_UPGRADE_COMPLETE(c->disk_sb.sb, bcachefs_metadata_version_current); 1103 1026 bch2_write_super(c); 1104 1027 } ··· 1112 1035 bch2_write_super(c); 1113 1036 mutex_unlock(&c->sb_lock); 1114 1037 1115 - c->curr_recovery_pass = BCH_RECOVERY_PASS_NR; 1116 1038 set_bit(BCH_FS_btree_running, &c->flags); 1117 1039 set_bit(BCH_FS_may_go_rw, &c->flags); 1118 1040 ··· 1151 1075 bch_err_msg(c, ret, "marking superblocks"); 1152 1076 if (ret) 1153 1077 goto err; 1154 - 1155 - for_each_online_member(c, ca) 1156 - ca->new_fs_bucket_idx = 0; 1157 1078 1158 1079 ret = bch2_fs_freespace_init(c); 1159 1080 if (ret) ··· 1210 1137 bch2_write_super(c); 1211 1138 mutex_unlock(&c->sb_lock); 1212 1139 1140 + c->curr_recovery_pass = BCH_RECOVERY_PASS_NR; 1213 1141 return 0; 1214 1142 err: 1215 1143 bch_err_fn(c, ret);
+1 -1
fs/bcachefs/recovery.h
··· 2 2 #ifndef _BCACHEFS_RECOVERY_H 3 3 #define _BCACHEFS_RECOVERY_H 4 4 5 - void bch2_btree_lost_data(struct bch_fs *, enum btree_id); 5 + int bch2_btree_lost_data(struct bch_fs *, enum btree_id); 6 6 7 7 int bch2_journal_replay(struct bch_fs *); 8 8
+80 -28
fs/bcachefs/recovery_passes.c
··· 46 46 47 47 set_bit(BCH_FS_may_go_rw, &c->flags); 48 48 49 - if (keys->nr || c->opts.fsck || !c->sb.clean || c->opts.recovery_passes) 49 + if (keys->nr || !c->opts.read_only || c->opts.fsck || !c->sb.clean || c->opts.recovery_passes) 50 50 return bch2_fs_read_write_early(c); 51 51 return 0; 52 52 } ··· 100 100 /* 101 101 * For when we need to rewind recovery passes and run a pass we skipped: 102 102 */ 103 - int bch2_run_explicit_recovery_pass(struct bch_fs *c, 104 - enum bch_recovery_pass pass) 103 + static int __bch2_run_explicit_recovery_pass(struct bch_fs *c, 104 + enum bch_recovery_pass pass) 105 105 { 106 - if (c->opts.recovery_passes & BIT_ULL(pass)) 106 + if (c->curr_recovery_pass == ARRAY_SIZE(recovery_pass_fns)) 107 + return -BCH_ERR_not_in_recovery; 108 + 109 + if (c->recovery_passes_complete & BIT_ULL(pass)) 107 110 return 0; 108 111 109 - bch_info(c, "running explicit recovery pass %s (%u), currently at %s (%u)", 110 - bch2_recovery_passes[pass], pass, 111 - bch2_recovery_passes[c->curr_recovery_pass], c->curr_recovery_pass); 112 + bool print = !(c->opts.recovery_passes & BIT_ULL(pass)); 113 + 114 + if (pass < BCH_RECOVERY_PASS_set_may_go_rw && 115 + c->curr_recovery_pass >= BCH_RECOVERY_PASS_set_may_go_rw) { 116 + if (print) 117 + bch_info(c, "need recovery pass %s (%u), but already rw", 118 + bch2_recovery_passes[pass], pass); 119 + return -BCH_ERR_cannot_rewind_recovery; 120 + } 121 + 122 + if (print) 123 + bch_info(c, "running explicit recovery pass %s (%u), currently at %s (%u)", 124 + bch2_recovery_passes[pass], pass, 125 + bch2_recovery_passes[c->curr_recovery_pass], c->curr_recovery_pass); 112 126 113 127 c->opts.recovery_passes |= BIT_ULL(pass); 114 128 115 - if (c->curr_recovery_pass >= pass) { 116 - c->curr_recovery_pass = pass; 129 + if (c->curr_recovery_pass > pass) { 130 + c->next_recovery_pass = pass; 117 131 c->recovery_passes_complete &= (1ULL << pass) >> 1; 118 132 return -BCH_ERR_restart_recovery; 119 133 } else { 120 134 return 0; 121 135 } 136 + } 137 + 138 + int bch2_run_explicit_recovery_pass(struct bch_fs *c, 139 + enum bch_recovery_pass pass) 140 + { 141 + unsigned long flags; 142 + spin_lock_irqsave(&c->recovery_pass_lock, flags); 143 + int ret = __bch2_run_explicit_recovery_pass(c, pass); 144 + spin_unlock_irqrestore(&c->recovery_pass_lock, flags); 145 + return ret; 146 + } 147 + 148 + int bch2_run_explicit_recovery_pass_persistent_locked(struct bch_fs *c, 149 + enum bch_recovery_pass pass) 150 + { 151 + lockdep_assert_held(&c->sb_lock); 152 + 153 + struct bch_sb_field_ext *ext = bch2_sb_field_get(c->disk_sb.sb, ext); 154 + __set_bit_le64(bch2_recovery_pass_to_stable(pass), ext->recovery_passes_required); 155 + 156 + return bch2_run_explicit_recovery_pass(c, pass); 122 157 } 123 158 124 159 int bch2_run_explicit_recovery_pass_persistent(struct bch_fs *c, ··· 268 233 */ 269 234 c->opts.recovery_passes_exclude &= ~BCH_RECOVERY_PASS_set_may_go_rw; 270 235 271 - while (c->curr_recovery_pass < ARRAY_SIZE(recovery_pass_fns)) { 236 + while (c->curr_recovery_pass < ARRAY_SIZE(recovery_pass_fns) && !ret) { 237 + c->next_recovery_pass = c->curr_recovery_pass + 1; 238 + 239 + spin_lock_irq(&c->recovery_pass_lock); 240 + unsigned pass = c->curr_recovery_pass; 241 + 272 242 if (c->opts.recovery_pass_last && 273 - c->curr_recovery_pass > c->opts.recovery_pass_last) 243 + c->curr_recovery_pass > c->opts.recovery_pass_last) { 244 + spin_unlock_irq(&c->recovery_pass_lock); 274 245 break; 275 - 276 - if (should_run_recovery_pass(c, c->curr_recovery_pass)) { 277 - unsigned pass = c->curr_recovery_pass; 278 - 279 - ret = bch2_run_recovery_pass(c, c->curr_recovery_pass) ?: 280 - bch2_journal_flush(&c->journal); 281 - if (bch2_err_matches(ret, BCH_ERR_restart_recovery) || 282 - (ret && c->curr_recovery_pass < pass)) 283 - continue; 284 - if (ret) 285 - break; 286 - 287 - c->recovery_passes_complete |= BIT_ULL(c->curr_recovery_pass); 288 246 } 289 247 290 - c->recovery_pass_done = max(c->recovery_pass_done, c->curr_recovery_pass); 248 + if (!should_run_recovery_pass(c, pass)) { 249 + c->curr_recovery_pass++; 250 + c->recovery_pass_done = max(c->recovery_pass_done, pass); 251 + spin_unlock_irq(&c->recovery_pass_lock); 252 + continue; 253 + } 254 + spin_unlock_irq(&c->recovery_pass_lock); 291 255 292 - if (!test_bit(BCH_FS_error, &c->flags)) 293 - bch2_clear_recovery_pass_required(c, c->curr_recovery_pass); 256 + ret = bch2_run_recovery_pass(c, pass) ?: 257 + bch2_journal_flush(&c->journal); 294 258 295 - c->curr_recovery_pass++; 259 + if (!ret && !test_bit(BCH_FS_error, &c->flags)) 260 + bch2_clear_recovery_pass_required(c, pass); 261 + 262 + spin_lock_irq(&c->recovery_pass_lock); 263 + if (c->next_recovery_pass < c->curr_recovery_pass) { 264 + /* 265 + * bch2_run_explicit_recovery_pass() was called: we 266 + * can't always catch -BCH_ERR_restart_recovery because 267 + * it may have been called from another thread (btree 268 + * node read completion) 269 + */ 270 + ret = 0; 271 + c->recovery_passes_complete &= ~(~0ULL << c->curr_recovery_pass); 272 + } else { 273 + c->recovery_passes_complete |= BIT_ULL(pass); 274 + c->recovery_pass_done = max(c->recovery_pass_done, pass); 275 + } 276 + c->curr_recovery_pass = c->next_recovery_pass; 277 + spin_unlock_irq(&c->recovery_pass_lock); 296 278 } 297 279 298 280 return ret;
+1
fs/bcachefs/recovery_passes.h
··· 9 9 u64 bch2_fsck_recovery_passes(void); 10 10 11 11 int bch2_run_explicit_recovery_pass(struct bch_fs *, enum bch_recovery_pass); 12 + int bch2_run_explicit_recovery_pass_persistent_locked(struct bch_fs *, enum bch_recovery_pass); 12 13 int bch2_run_explicit_recovery_pass_persistent(struct bch_fs *, enum bch_recovery_pass); 13 14 14 15 int bch2_run_online_recovery_passes(struct bch_fs *);
+49 -43
fs/bcachefs/recovery_passes_types.h
··· 8 8 #define PASS_ALWAYS BIT(3) 9 9 #define PASS_ONLINE BIT(4) 10 10 11 + #ifdef CONFIG_BCACHEFS_DEBUG 12 + #define PASS_FSCK_DEBUG BIT(1) 13 + #else 14 + #define PASS_FSCK_DEBUG 0 15 + #endif 16 + 11 17 /* 12 18 * Passes may be reordered, but the second field is a persistent identifier and 13 19 * must never change: 14 20 */ 15 - #define BCH_RECOVERY_PASSES() \ 16 - x(recovery_pass_empty, 41, PASS_SILENT) \ 17 - x(scan_for_btree_nodes, 37, 0) \ 18 - x(check_topology, 4, 0) \ 19 - x(accounting_read, 39, PASS_ALWAYS) \ 20 - x(alloc_read, 0, PASS_ALWAYS) \ 21 - x(stripes_read, 1, PASS_ALWAYS) \ 22 - x(initialize_subvolumes, 2, 0) \ 23 - x(snapshots_read, 3, PASS_ALWAYS) \ 24 - x(check_allocations, 5, PASS_FSCK) \ 25 - x(trans_mark_dev_sbs, 6, PASS_ALWAYS|PASS_SILENT) \ 26 - x(fs_journal_alloc, 7, PASS_ALWAYS|PASS_SILENT) \ 27 - x(set_may_go_rw, 8, PASS_ALWAYS|PASS_SILENT) \ 28 - x(journal_replay, 9, PASS_ALWAYS) \ 29 - x(check_alloc_info, 10, PASS_ONLINE|PASS_FSCK) \ 30 - x(check_lrus, 11, PASS_ONLINE|PASS_FSCK) \ 31 - x(check_btree_backpointers, 12, PASS_ONLINE|PASS_FSCK) \ 32 - x(check_backpointers_to_extents, 13, PASS_ONLINE|PASS_FSCK) \ 33 - x(check_extents_to_backpointers, 14, PASS_ONLINE|PASS_FSCK) \ 34 - x(check_alloc_to_lru_refs, 15, PASS_ONLINE|PASS_FSCK) \ 35 - x(fs_freespace_init, 16, PASS_ALWAYS|PASS_SILENT) \ 36 - x(bucket_gens_init, 17, 0) \ 37 - x(reconstruct_snapshots, 38, 0) \ 38 - x(check_snapshot_trees, 18, PASS_ONLINE|PASS_FSCK) \ 39 - x(check_snapshots, 19, PASS_ONLINE|PASS_FSCK) \ 40 - x(check_subvols, 20, PASS_ONLINE|PASS_FSCK) \ 41 - x(check_subvol_children, 35, PASS_ONLINE|PASS_FSCK) \ 42 - x(delete_dead_snapshots, 21, PASS_ONLINE|PASS_FSCK) \ 43 - x(fs_upgrade_for_subvolumes, 22, 0) \ 44 - x(check_inodes, 24, PASS_FSCK) \ 45 - x(check_extents, 25, PASS_FSCK) \ 46 - x(check_indirect_extents, 26, PASS_FSCK) \ 47 - x(check_dirents, 27, PASS_FSCK) \ 48 - x(check_xattrs, 28, PASS_FSCK) \ 49 - x(check_root, 29, PASS_ONLINE|PASS_FSCK) \ 50 - x(check_unreachable_inodes, 40, PASS_ONLINE|PASS_FSCK) \ 51 - x(check_subvolume_structure, 36, PASS_ONLINE|PASS_FSCK) \ 52 - x(check_directory_structure, 30, PASS_ONLINE|PASS_FSCK) \ 53 - x(check_nlinks, 31, PASS_FSCK) \ 54 - x(resume_logged_ops, 23, PASS_ALWAYS) \ 55 - x(delete_dead_inodes, 32, PASS_ALWAYS) \ 56 - x(fix_reflink_p, 33, 0) \ 57 - x(set_fs_needs_rebalance, 34, 0) \ 21 + #define BCH_RECOVERY_PASSES() \ 22 + x(recovery_pass_empty, 41, PASS_SILENT) \ 23 + x(scan_for_btree_nodes, 37, 0) \ 24 + x(check_topology, 4, 0) \ 25 + x(accounting_read, 39, PASS_ALWAYS) \ 26 + x(alloc_read, 0, PASS_ALWAYS) \ 27 + x(stripes_read, 1, PASS_ALWAYS) \ 28 + x(initialize_subvolumes, 2, 0) \ 29 + x(snapshots_read, 3, PASS_ALWAYS) \ 30 + x(check_allocations, 5, PASS_FSCK) \ 31 + x(trans_mark_dev_sbs, 6, PASS_ALWAYS|PASS_SILENT) \ 32 + x(fs_journal_alloc, 7, PASS_ALWAYS|PASS_SILENT) \ 33 + x(set_may_go_rw, 8, PASS_ALWAYS|PASS_SILENT) \ 34 + x(journal_replay, 9, PASS_ALWAYS) \ 35 + x(check_alloc_info, 10, PASS_ONLINE|PASS_FSCK) \ 36 + x(check_lrus, 11, PASS_ONLINE|PASS_FSCK) \ 37 + x(check_btree_backpointers, 12, PASS_ONLINE|PASS_FSCK) \ 38 + x(check_backpointers_to_extents, 13, PASS_ONLINE|PASS_FSCK_DEBUG) \ 39 + x(check_extents_to_backpointers, 14, PASS_ONLINE|PASS_FSCK) \ 40 + x(check_alloc_to_lru_refs, 15, PASS_ONLINE|PASS_FSCK) \ 41 + x(fs_freespace_init, 16, PASS_ALWAYS|PASS_SILENT) \ 42 + x(bucket_gens_init, 17, 0) \ 43 + x(reconstruct_snapshots, 38, 0) \ 44 + x(check_snapshot_trees, 18, PASS_ONLINE|PASS_FSCK) \ 45 + x(check_snapshots, 19, PASS_ONLINE|PASS_FSCK) \ 46 + x(check_subvols, 20, PASS_ONLINE|PASS_FSCK) \ 47 + x(check_subvol_children, 35, PASS_ONLINE|PASS_FSCK) \ 48 + x(delete_dead_snapshots, 21, PASS_ONLINE|PASS_FSCK) \ 49 + x(fs_upgrade_for_subvolumes, 22, 0) \ 50 + x(check_inodes, 24, PASS_FSCK) \ 51 + x(check_extents, 25, PASS_FSCK) \ 52 + x(check_indirect_extents, 26, PASS_ONLINE|PASS_FSCK) \ 53 + x(check_dirents, 27, PASS_FSCK) \ 54 + x(check_xattrs, 28, PASS_FSCK) \ 55 + x(check_root, 29, PASS_ONLINE|PASS_FSCK) \ 56 + x(check_unreachable_inodes, 40, PASS_FSCK) \ 57 + x(check_subvolume_structure, 36, PASS_ONLINE|PASS_FSCK) \ 58 + x(check_directory_structure, 30, PASS_ONLINE|PASS_FSCK) \ 59 + x(check_nlinks, 31, PASS_FSCK) \ 60 + x(resume_logged_ops, 23, PASS_ALWAYS) \ 61 + x(delete_dead_inodes, 32, PASS_ALWAYS) \ 62 + x(fix_reflink_p, 33, 0) \ 63 + x(set_fs_needs_rebalance, 34, 0) 58 64 59 65 /* We normally enumerate recovery passes in the order we run them: */ 60 66 enum bch_recovery_pass {
+378 -120
fs/bcachefs/reflink.c
··· 15 15 16 16 #include <linux/sched/signal.h> 17 17 18 + static inline bool bkey_extent_is_reflink_data(const struct bkey *k) 19 + { 20 + switch (k->type) { 21 + case KEY_TYPE_reflink_v: 22 + case KEY_TYPE_indirect_inline_data: 23 + return true; 24 + default: 25 + return false; 26 + } 27 + } 28 + 18 29 static inline unsigned bkey_type_to_indirect(const struct bkey *k) 19 30 { 20 31 switch (k->type) { ··· 41 30 /* reflink pointers */ 42 31 43 32 int bch2_reflink_p_validate(struct bch_fs *c, struct bkey_s_c k, 44 - enum bch_validate_flags flags) 33 + struct bkey_validate_context from) 45 34 { 46 35 struct bkey_s_c_reflink_p p = bkey_s_c_to_reflink_p(k); 47 36 int ret = 0; 48 37 49 - bkey_fsck_err_on(le64_to_cpu(p.v->idx) < le32_to_cpu(p.v->front_pad), 38 + bkey_fsck_err_on(REFLINK_P_IDX(p.v) < le32_to_cpu(p.v->front_pad), 50 39 c, reflink_p_front_pad_bad, 51 40 "idx < front_pad (%llu < %u)", 52 - le64_to_cpu(p.v->idx), le32_to_cpu(p.v->front_pad)); 41 + REFLINK_P_IDX(p.v), le32_to_cpu(p.v->front_pad)); 53 42 fsck_err: 54 43 return ret; 55 44 } ··· 60 49 struct bkey_s_c_reflink_p p = bkey_s_c_to_reflink_p(k); 61 50 62 51 prt_printf(out, "idx %llu front_pad %u back_pad %u", 63 - le64_to_cpu(p.v->idx), 52 + REFLINK_P_IDX(p.v), 64 53 le32_to_cpu(p.v->front_pad), 65 54 le32_to_cpu(p.v->back_pad)); 66 55 } ··· 76 65 */ 77 66 return false; 78 67 79 - if (le64_to_cpu(l.v->idx) + l.k->size != le64_to_cpu(r.v->idx)) 68 + if (REFLINK_P_IDX(l.v) + l.k->size != REFLINK_P_IDX(r.v)) 69 + return false; 70 + 71 + if (REFLINK_P_ERROR(l.v) != REFLINK_P_ERROR(r.v)) 80 72 return false; 81 73 82 74 bch2_key_resize(l.k, l.k->size + r.k->size); 83 75 return true; 84 76 } 85 77 78 + /* indirect extents */ 79 + 80 + int bch2_reflink_v_validate(struct bch_fs *c, struct bkey_s_c k, 81 + struct bkey_validate_context from) 82 + { 83 + int ret = 0; 84 + 85 + bkey_fsck_err_on(bkey_gt(k.k->p, POS(0, REFLINK_P_IDX_MAX)), 86 + c, reflink_v_pos_bad, 87 + "indirect extent above maximum position 0:%llu", 88 + REFLINK_P_IDX_MAX); 89 + 90 + ret = bch2_bkey_ptrs_validate(c, k, from); 91 + fsck_err: 92 + return ret; 93 + } 94 + 95 + void bch2_reflink_v_to_text(struct printbuf *out, struct bch_fs *c, 96 + struct bkey_s_c k) 97 + { 98 + struct bkey_s_c_reflink_v r = bkey_s_c_to_reflink_v(k); 99 + 100 + prt_printf(out, "refcount: %llu ", le64_to_cpu(r.v->refcount)); 101 + 102 + bch2_bkey_ptrs_to_text(out, c, k); 103 + } 104 + 105 + #if 0 106 + Currently disabled, needs to be debugged: 107 + 108 + bool bch2_reflink_v_merge(struct bch_fs *c, struct bkey_s _l, struct bkey_s_c _r) 109 + { 110 + struct bkey_s_reflink_v l = bkey_s_to_reflink_v(_l); 111 + struct bkey_s_c_reflink_v r = bkey_s_c_to_reflink_v(_r); 112 + 113 + return l.v->refcount == r.v->refcount && bch2_extent_merge(c, _l, _r); 114 + } 115 + #endif 116 + 117 + /* indirect inline data */ 118 + 119 + int bch2_indirect_inline_data_validate(struct bch_fs *c, struct bkey_s_c k, 120 + struct bkey_validate_context from) 121 + { 122 + return 0; 123 + } 124 + 125 + void bch2_indirect_inline_data_to_text(struct printbuf *out, 126 + struct bch_fs *c, struct bkey_s_c k) 127 + { 128 + struct bkey_s_c_indirect_inline_data d = bkey_s_c_to_indirect_inline_data(k); 129 + unsigned datalen = bkey_inline_data_bytes(k.k); 130 + 131 + prt_printf(out, "refcount %llu datalen %u: %*phN", 132 + le64_to_cpu(d.v->refcount), datalen, 133 + min(datalen, 32U), d.v->data); 134 + } 135 + 136 + /* lookup */ 137 + 138 + static int bch2_indirect_extent_not_missing(struct btree_trans *trans, struct bkey_s_c_reflink_p p, 139 + bool should_commit) 140 + { 141 + struct bkey_i_reflink_p *new = bch2_bkey_make_mut_noupdate_typed(trans, p.s_c, reflink_p); 142 + int ret = PTR_ERR_OR_ZERO(new); 143 + if (ret) 144 + return ret; 145 + 146 + SET_REFLINK_P_ERROR(&new->v, false); 147 + ret = bch2_btree_insert_trans(trans, BTREE_ID_extents, &new->k_i, BTREE_TRIGGER_norun); 148 + if (ret) 149 + return ret; 150 + 151 + if (!should_commit) 152 + return 0; 153 + 154 + return bch2_trans_commit(trans, NULL, NULL, BCH_TRANS_COMMIT_no_enospc) ?: 155 + -BCH_ERR_transaction_restart_nested; 156 + } 157 + 158 + static int bch2_indirect_extent_missing_error(struct btree_trans *trans, 159 + struct bkey_s_c_reflink_p p, 160 + u64 missing_start, u64 missing_end, 161 + bool should_commit) 162 + { 163 + if (REFLINK_P_ERROR(p.v)) 164 + return -BCH_ERR_missing_indirect_extent; 165 + 166 + struct bch_fs *c = trans->c; 167 + u64 live_start = REFLINK_P_IDX(p.v); 168 + u64 live_end = REFLINK_P_IDX(p.v) + p.k->size; 169 + u64 refd_start = live_start - le32_to_cpu(p.v->front_pad); 170 + u64 refd_end = live_end + le32_to_cpu(p.v->back_pad); 171 + struct printbuf buf = PRINTBUF; 172 + int ret = 0; 173 + 174 + BUG_ON(missing_start < refd_start); 175 + BUG_ON(missing_end > refd_end); 176 + 177 + if (fsck_err(trans, reflink_p_to_missing_reflink_v, 178 + "pointer to missing indirect extent\n" 179 + " %s\n" 180 + " missing range %llu-%llu", 181 + (bch2_bkey_val_to_text(&buf, c, p.s_c), buf.buf), 182 + missing_start, missing_end)) { 183 + struct bkey_i_reflink_p *new = bch2_bkey_make_mut_noupdate_typed(trans, p.s_c, reflink_p); 184 + ret = PTR_ERR_OR_ZERO(new); 185 + if (ret) 186 + goto err; 187 + 188 + /* 189 + * Is the missing range not actually needed? 190 + * 191 + * p.v->idx refers to the data that we actually want, but if the 192 + * indirect extent we point to was bigger, front_pad and back_pad 193 + * indicate the range we took a reference on. 194 + */ 195 + 196 + if (missing_end <= live_start) { 197 + new->v.front_pad = cpu_to_le32(live_start - missing_end); 198 + } else if (missing_start >= live_end) { 199 + new->v.back_pad = cpu_to_le32(missing_start - live_end); 200 + } else { 201 + struct bpos new_start = bkey_start_pos(&new->k); 202 + struct bpos new_end = new->k.p; 203 + 204 + if (missing_start > live_start) 205 + new_start.offset += missing_start - live_start; 206 + if (missing_end < live_end) 207 + new_end.offset -= live_end - missing_end; 208 + 209 + bch2_cut_front(new_start, &new->k_i); 210 + bch2_cut_back(new_end, &new->k_i); 211 + 212 + SET_REFLINK_P_ERROR(&new->v, true); 213 + } 214 + 215 + ret = bch2_btree_insert_trans(trans, BTREE_ID_extents, &new->k_i, BTREE_TRIGGER_norun); 216 + if (ret) 217 + goto err; 218 + 219 + if (should_commit) 220 + ret = bch2_trans_commit(trans, NULL, NULL, BCH_TRANS_COMMIT_no_enospc) ?: 221 + -BCH_ERR_transaction_restart_nested; 222 + } 223 + err: 224 + fsck_err: 225 + printbuf_exit(&buf); 226 + return ret; 227 + } 228 + 229 + /* 230 + * This is used from the read path, which doesn't expect to have to do a 231 + * transaction commit, and from triggers, which should not be doing a commit: 232 + */ 233 + struct bkey_s_c bch2_lookup_indirect_extent(struct btree_trans *trans, 234 + struct btree_iter *iter, 235 + s64 *offset_into_extent, 236 + struct bkey_s_c_reflink_p p, 237 + bool should_commit, 238 + unsigned iter_flags) 239 + { 240 + BUG_ON(*offset_into_extent < -((s64) le32_to_cpu(p.v->front_pad))); 241 + BUG_ON(*offset_into_extent >= p.k->size + le32_to_cpu(p.v->back_pad)); 242 + 243 + u64 reflink_offset = REFLINK_P_IDX(p.v) + *offset_into_extent; 244 + 245 + struct bkey_s_c k = bch2_bkey_get_iter(trans, iter, BTREE_ID_reflink, 246 + POS(0, reflink_offset), iter_flags); 247 + if (bkey_err(k)) 248 + return k; 249 + 250 + if (unlikely(!bkey_extent_is_reflink_data(k.k))) { 251 + bch2_trans_iter_exit(trans, iter); 252 + 253 + unsigned size = min((u64) k.k->size, 254 + REFLINK_P_IDX(p.v) + p.k->size + le32_to_cpu(p.v->back_pad) - 255 + reflink_offset); 256 + bch2_key_resize(&iter->k, size); 257 + 258 + int ret = bch2_indirect_extent_missing_error(trans, p, reflink_offset, 259 + k.k->p.offset, should_commit); 260 + if (ret) 261 + return bkey_s_c_err(ret); 262 + } else if (unlikely(REFLINK_P_ERROR(p.v))) { 263 + bch2_trans_iter_exit(trans, iter); 264 + 265 + int ret = bch2_indirect_extent_not_missing(trans, p, should_commit); 266 + if (ret) 267 + return bkey_s_c_err(ret); 268 + } 269 + 270 + *offset_into_extent = reflink_offset - bkey_start_offset(k.k); 271 + return k; 272 + } 273 + 274 + /* reflink pointer trigger */ 275 + 86 276 static int trans_trigger_reflink_p_segment(struct btree_trans *trans, 87 277 struct bkey_s_c_reflink_p p, u64 *idx, 88 278 enum btree_iter_update_trigger_flags flags) 89 279 { 90 280 struct bch_fs *c = trans->c; 91 - struct btree_iter iter; 92 - struct bkey_i *k; 93 - __le64 *refcount; 94 - int add = !(flags & BTREE_TRIGGER_overwrite) ? 1 : -1; 95 281 struct printbuf buf = PRINTBUF; 96 - int ret; 97 282 98 - k = bch2_bkey_get_mut_noupdate(trans, &iter, 99 - BTREE_ID_reflink, POS(0, *idx), 100 - BTREE_ITER_with_updates); 101 - ret = PTR_ERR_OR_ZERO(k); 283 + s64 offset_into_extent = *idx - REFLINK_P_IDX(p.v); 284 + struct btree_iter iter; 285 + struct bkey_s_c k = bch2_lookup_indirect_extent(trans, &iter, &offset_into_extent, p, false, 286 + BTREE_ITER_intent| 287 + BTREE_ITER_with_updates); 288 + int ret = bkey_err(k); 289 + if (ret) 290 + return ret; 291 + 292 + if (bkey_deleted(k.k)) { 293 + if (!(flags & BTREE_TRIGGER_overwrite)) 294 + ret = -BCH_ERR_missing_indirect_extent; 295 + goto next; 296 + } 297 + 298 + struct bkey_i *new = bch2_bkey_make_mut_noupdate(trans, k); 299 + ret = PTR_ERR_OR_ZERO(new); 102 300 if (ret) 103 301 goto err; 104 302 105 - refcount = bkey_refcount(bkey_i_to_s(k)); 106 - if (!refcount) { 107 - bch2_bkey_val_to_text(&buf, c, p.s_c); 108 - bch2_trans_inconsistent(trans, 109 - "nonexistent indirect extent at %llu while marking\n %s", 110 - *idx, buf.buf); 111 - ret = -EIO; 112 - goto err; 113 - } 114 - 303 + __le64 *refcount = bkey_refcount(bkey_i_to_s(new)); 115 304 if (!*refcount && (flags & BTREE_TRIGGER_overwrite)) { 116 305 bch2_bkey_val_to_text(&buf, c, p.s_c); 117 - bch2_trans_inconsistent(trans, 118 - "indirect extent refcount underflow at %llu while marking\n %s", 119 - *idx, buf.buf); 120 - ret = -EIO; 121 - goto err; 306 + prt_printf(&buf, "\n "); 307 + bch2_bkey_val_to_text(&buf, c, k); 308 + log_fsck_err(trans, reflink_refcount_underflow, 309 + "indirect extent refcount underflow while marking\n %s", 310 + buf.buf); 311 + goto next; 122 312 } 123 313 124 314 if (flags & BTREE_TRIGGER_insert) { ··· 327 115 u64 pad; 328 116 329 117 pad = max_t(s64, le32_to_cpu(v->front_pad), 330 - le64_to_cpu(v->idx) - bkey_start_offset(&k->k)); 118 + REFLINK_P_IDX(v) - bkey_start_offset(&new->k)); 331 119 BUG_ON(pad > U32_MAX); 332 120 v->front_pad = cpu_to_le32(pad); 333 121 334 122 pad = max_t(s64, le32_to_cpu(v->back_pad), 335 - k->k.p.offset - p.k->size - le64_to_cpu(v->idx)); 123 + new->k.p.offset - p.k->size - REFLINK_P_IDX(v)); 336 124 BUG_ON(pad > U32_MAX); 337 125 v->back_pad = cpu_to_le32(pad); 338 126 } 339 127 340 - le64_add_cpu(refcount, add); 128 + le64_add_cpu(refcount, !(flags & BTREE_TRIGGER_overwrite) ? 1 : -1); 341 129 342 130 bch2_btree_iter_set_pos_to_extent_start(&iter); 343 - ret = bch2_trans_update(trans, &iter, k, 0); 131 + ret = bch2_trans_update(trans, &iter, new, 0); 344 132 if (ret) 345 133 goto err; 346 - 347 - *idx = k->k.p.offset; 134 + next: 135 + *idx = k.k->p.offset; 348 136 err: 137 + fsck_err: 349 138 bch2_trans_iter_exit(trans, &iter); 350 139 printbuf_exit(&buf); 351 140 return ret; ··· 360 147 struct bch_fs *c = trans->c; 361 148 struct reflink_gc *r; 362 149 int add = !(flags & BTREE_TRIGGER_overwrite) ? 1 : -1; 363 - u64 start = le64_to_cpu(p.v->idx); 364 - u64 end = le64_to_cpu(p.v->idx) + p.k->size; 365 - u64 next_idx = end + le32_to_cpu(p.v->back_pad); 150 + u64 next_idx = REFLINK_P_IDX(p.v) + p.k->size + le32_to_cpu(p.v->back_pad); 366 151 s64 ret = 0; 367 152 struct printbuf buf = PRINTBUF; 368 153 ··· 379 168 *idx = r->offset; 380 169 return 0; 381 170 not_found: 382 - BUG_ON(!(flags & BTREE_TRIGGER_check_repair)); 383 - 384 - if (fsck_err(trans, reflink_p_to_missing_reflink_v, 385 - "pointer to missing indirect extent\n" 386 - " %s\n" 387 - " missing range %llu-%llu", 388 - (bch2_bkey_val_to_text(&buf, c, p.s_c), buf.buf), 389 - *idx, next_idx)) { 390 - struct bkey_i *update = bch2_bkey_make_mut_noupdate(trans, p.s_c); 391 - ret = PTR_ERR_OR_ZERO(update); 171 + if (flags & BTREE_TRIGGER_check_repair) { 172 + ret = bch2_indirect_extent_missing_error(trans, p, *idx, next_idx, false); 392 173 if (ret) 393 174 goto err; 394 - 395 - if (next_idx <= start) { 396 - bkey_i_to_reflink_p(update)->v.front_pad = cpu_to_le32(start - next_idx); 397 - } else if (*idx >= end) { 398 - bkey_i_to_reflink_p(update)->v.back_pad = cpu_to_le32(*idx - end); 399 - } else { 400 - bkey_error_init(update); 401 - update->k.p = p.k->p; 402 - update->k.size = p.k->size; 403 - set_bkey_val_u64s(&update->k, 0); 404 - } 405 - 406 - ret = bch2_btree_insert_trans(trans, BTREE_ID_extents, update, BTREE_TRIGGER_norun); 407 175 } 408 176 409 177 *idx = next_idx; 410 178 err: 411 - fsck_err: 412 179 printbuf_exit(&buf); 413 180 return ret; 414 181 } ··· 399 210 struct bkey_s_c_reflink_p p = bkey_s_c_to_reflink_p(k); 400 211 int ret = 0; 401 212 402 - u64 idx = le64_to_cpu(p.v->idx) - le32_to_cpu(p.v->front_pad); 403 - u64 end = le64_to_cpu(p.v->idx) + p.k->size + le32_to_cpu(p.v->back_pad); 213 + u64 idx = REFLINK_P_IDX(p.v) - le32_to_cpu(p.v->front_pad); 214 + u64 end = REFLINK_P_IDX(p.v) + p.k->size + le32_to_cpu(p.v->back_pad); 404 215 405 216 if (flags & BTREE_TRIGGER_transactional) { 406 217 while (idx < end && !ret) ··· 442 253 return trigger_run_overwrite_then_insert(__trigger_reflink_p, trans, btree_id, level, old, new, flags); 443 254 } 444 255 445 - /* indirect extents */ 446 - 447 - int bch2_reflink_v_validate(struct bch_fs *c, struct bkey_s_c k, 448 - enum bch_validate_flags flags) 449 - { 450 - return bch2_bkey_ptrs_validate(c, k, flags); 451 - } 452 - 453 - void bch2_reflink_v_to_text(struct printbuf *out, struct bch_fs *c, 454 - struct bkey_s_c k) 455 - { 456 - struct bkey_s_c_reflink_v r = bkey_s_c_to_reflink_v(k); 457 - 458 - prt_printf(out, "refcount: %llu ", le64_to_cpu(r.v->refcount)); 459 - 460 - bch2_bkey_ptrs_to_text(out, c, k); 461 - } 462 - 463 - #if 0 464 - Currently disabled, needs to be debugged: 465 - 466 - bool bch2_reflink_v_merge(struct bch_fs *c, struct bkey_s _l, struct bkey_s_c _r) 467 - { 468 - struct bkey_s_reflink_v l = bkey_s_to_reflink_v(_l); 469 - struct bkey_s_c_reflink_v r = bkey_s_c_to_reflink_v(_r); 470 - 471 - return l.v->refcount == r.v->refcount && bch2_extent_merge(c, _l, _r); 472 - } 473 - #endif 256 + /* indirect extent trigger */ 474 257 475 258 static inline void 476 259 check_indirect_extent_deleting(struct bkey_s new, ··· 468 307 return bch2_trigger_extent(trans, btree_id, level, old, new, flags); 469 308 } 470 309 471 - /* indirect inline data */ 472 - 473 - int bch2_indirect_inline_data_validate(struct bch_fs *c, struct bkey_s_c k, 474 - enum bch_validate_flags flags) 475 - { 476 - return 0; 477 - } 478 - 479 - void bch2_indirect_inline_data_to_text(struct printbuf *out, 480 - struct bch_fs *c, struct bkey_s_c k) 481 - { 482 - struct bkey_s_c_indirect_inline_data d = bkey_s_c_to_indirect_inline_data(k); 483 - unsigned datalen = bkey_inline_data_bytes(k.k); 484 - 485 - prt_printf(out, "refcount %llu datalen %u: %*phN", 486 - le64_to_cpu(d.v->refcount), datalen, 487 - min(datalen, 32U), d.v->data); 488 - } 489 - 490 310 int bch2_trigger_indirect_inline_data(struct btree_trans *trans, 491 311 enum btree_id btree_id, unsigned level, 492 312 struct bkey_s_c old, struct bkey_s new, ··· 478 336 return 0; 479 337 } 480 338 339 + /* create */ 340 + 481 341 static int bch2_make_extent_indirect(struct btree_trans *trans, 482 342 struct btree_iter *extent_iter, 483 - struct bkey_i *orig) 343 + struct bkey_i *orig, 344 + bool reflink_p_may_update_opts_field) 484 345 { 485 346 struct bch_fs *c = trans->c; 486 347 struct btree_iter reflink_iter = { NULL }; ··· 502 357 ret = bkey_err(k); 503 358 if (ret) 504 359 goto err; 360 + 361 + /* 362 + * XXX: we're assuming that 56 bits will be enough for the life of the 363 + * filesystem: we need to implement wraparound, with a cursor in the 364 + * logged ops btree: 365 + */ 366 + if (bkey_ge(reflink_iter.pos, POS(0, REFLINK_P_IDX_MAX - orig->k.size))) 367 + return -ENOSPC; 505 368 506 369 r_v = bch2_trans_kmalloc(trans, sizeof(__le64) + bkey_bytes(&orig->k)); 507 370 ret = PTR_ERR_OR_ZERO(r_v); ··· 547 394 memset(&r_p->v, 0, sizeof(r_p->v)); 548 395 #endif 549 396 550 - r_p->v.idx = cpu_to_le64(bkey_start_offset(&r_v->k)); 397 + SET_REFLINK_P_IDX(&r_p->v, bkey_start_offset(&r_v->k)); 398 + 399 + if (reflink_p_may_update_opts_field) 400 + SET_REFLINK_P_MAY_UPDATE_OPTIONS(&r_p->v, true); 551 401 552 402 ret = bch2_trans_update(trans, extent_iter, &r_p->k_i, 553 403 BTREE_UPDATE_internal_snapshot_node); ··· 565 409 struct bkey_s_c k; 566 410 int ret; 567 411 568 - for_each_btree_key_upto_continue_norestart(*iter, end, 0, k, ret) { 412 + for_each_btree_key_max_continue_norestart(*iter, end, 0, k, ret) { 569 413 if (bkey_extent_is_unwritten(k)) 570 414 continue; 571 415 ··· 582 426 subvol_inum dst_inum, u64 dst_offset, 583 427 subvol_inum src_inum, u64 src_offset, 584 428 u64 remap_sectors, 585 - u64 new_i_size, s64 *i_sectors_delta) 429 + u64 new_i_size, s64 *i_sectors_delta, 430 + bool may_change_src_io_path_opts) 586 431 { 587 432 struct btree_trans *trans; 588 433 struct btree_iter dst_iter, src_iter; ··· 596 439 struct bpos src_want; 597 440 u64 dst_done = 0; 598 441 u32 dst_snapshot, src_snapshot; 442 + bool reflink_p_may_update_opts_field = 443 + bch2_request_incompat_feature(c, bcachefs_metadata_version_reflink_p_may_update_opts); 599 444 int ret = 0, ret2 = 0; 600 445 601 446 if (!bch2_write_ref_tryget(c, BCH_WRITE_REF_reflink)) ··· 679 520 src_k = bkey_i_to_s_c(new_src.k); 680 521 681 522 ret = bch2_make_extent_indirect(trans, &src_iter, 682 - new_src.k); 523 + new_src.k, 524 + reflink_p_may_update_opts_field); 683 525 if (ret) 684 526 continue; 685 527 ··· 693 533 struct bkey_i_reflink_p *dst_p = 694 534 bkey_reflink_p_init(new_dst.k); 695 535 696 - u64 offset = le64_to_cpu(src_p.v->idx) + 536 + u64 offset = REFLINK_P_IDX(src_p.v) + 697 537 (src_want.offset - 698 538 bkey_start_offset(src_k.k)); 699 539 700 - dst_p->v.idx = cpu_to_le64(offset); 540 + SET_REFLINK_P_IDX(&dst_p->v, offset); 541 + 542 + if (reflink_p_may_update_opts_field && 543 + may_change_src_io_path_opts) 544 + SET_REFLINK_P_MAY_UPDATE_OPTIONS(&dst_p->v, true); 701 545 } else { 702 546 BUG(); 703 547 } ··· 711 547 min(src_k.k->p.offset - src_want.offset, 712 548 dst_end.offset - dst_iter.pos.offset)); 713 549 714 - ret = bch2_bkey_set_needs_rebalance(c, new_dst.k, &opts) ?: 550 + ret = bch2_bkey_set_needs_rebalance(c, &opts, new_dst.k) ?: 715 551 bch2_extent_update(trans, dst_inum, &dst_iter, 716 552 new_dst.k, &disk_res, 717 553 new_i_size, i_sectors_delta, ··· 754 590 bch2_write_ref_put(c, BCH_WRITE_REF_reflink); 755 591 756 592 return dst_done ?: ret ?: ret2; 593 + } 594 + 595 + /* fsck */ 596 + 597 + static int bch2_gc_write_reflink_key(struct btree_trans *trans, 598 + struct btree_iter *iter, 599 + struct bkey_s_c k, 600 + size_t *idx) 601 + { 602 + struct bch_fs *c = trans->c; 603 + const __le64 *refcount = bkey_refcount_c(k); 604 + struct printbuf buf = PRINTBUF; 605 + struct reflink_gc *r; 606 + int ret = 0; 607 + 608 + if (!refcount) 609 + return 0; 610 + 611 + while ((r = genradix_ptr(&c->reflink_gc_table, *idx)) && 612 + r->offset < k.k->p.offset) 613 + ++*idx; 614 + 615 + if (!r || 616 + r->offset != k.k->p.offset || 617 + r->size != k.k->size) { 618 + bch_err(c, "unexpected inconsistency walking reflink table at gc finish"); 619 + return -EINVAL; 620 + } 621 + 622 + if (fsck_err_on(r->refcount != le64_to_cpu(*refcount), 623 + trans, reflink_v_refcount_wrong, 624 + "reflink key has wrong refcount:\n" 625 + " %s\n" 626 + " should be %u", 627 + (bch2_bkey_val_to_text(&buf, c, k), buf.buf), 628 + r->refcount)) { 629 + struct bkey_i *new = bch2_bkey_make_mut_noupdate(trans, k); 630 + ret = PTR_ERR_OR_ZERO(new); 631 + if (ret) 632 + goto out; 633 + 634 + if (!r->refcount) 635 + new->k.type = KEY_TYPE_deleted; 636 + else 637 + *bkey_refcount(bkey_i_to_s(new)) = cpu_to_le64(r->refcount); 638 + ret = bch2_trans_update(trans, iter, new, 0); 639 + } 640 + out: 641 + fsck_err: 642 + printbuf_exit(&buf); 643 + return ret; 644 + } 645 + 646 + int bch2_gc_reflink_done(struct bch_fs *c) 647 + { 648 + size_t idx = 0; 649 + 650 + int ret = bch2_trans_run(c, 651 + for_each_btree_key_commit(trans, iter, 652 + BTREE_ID_reflink, POS_MIN, 653 + BTREE_ITER_prefetch, k, 654 + NULL, NULL, BCH_TRANS_COMMIT_no_enospc, 655 + bch2_gc_write_reflink_key(trans, &iter, k, &idx))); 656 + c->reflink_gc_nr = 0; 657 + return ret; 658 + } 659 + 660 + int bch2_gc_reflink_start(struct bch_fs *c) 661 + { 662 + c->reflink_gc_nr = 0; 663 + 664 + int ret = bch2_trans_run(c, 665 + for_each_btree_key(trans, iter, BTREE_ID_reflink, POS_MIN, 666 + BTREE_ITER_prefetch, k, ({ 667 + const __le64 *refcount = bkey_refcount_c(k); 668 + 669 + if (!refcount) 670 + continue; 671 + 672 + struct reflink_gc *r = genradix_ptr_alloc(&c->reflink_gc_table, 673 + c->reflink_gc_nr++, GFP_KERNEL); 674 + if (!r) { 675 + ret = -BCH_ERR_ENOMEM_gc_reflink_start; 676 + break; 677 + } 678 + 679 + r->offset = k.k->p.offset; 680 + r->size = k.k->size; 681 + r->refcount = 0; 682 + 0; 683 + }))); 684 + 685 + bch_err_fn(c, ret); 686 + return ret; 757 687 }
+14 -6
fs/bcachefs/reflink.h
··· 2 2 #ifndef _BCACHEFS_REFLINK_H 3 3 #define _BCACHEFS_REFLINK_H 4 4 5 - enum bch_validate_flags; 6 - 7 - int bch2_reflink_p_validate(struct bch_fs *, struct bkey_s_c, enum bch_validate_flags); 5 + int bch2_reflink_p_validate(struct bch_fs *, struct bkey_s_c, 6 + struct bkey_validate_context); 8 7 void bch2_reflink_p_to_text(struct printbuf *, struct bch_fs *, struct bkey_s_c); 9 8 bool bch2_reflink_p_merge(struct bch_fs *, struct bkey_s, struct bkey_s_c); 10 9 int bch2_trigger_reflink_p(struct btree_trans *, enum btree_id, unsigned, ··· 18 19 .min_val_size = 16, \ 19 20 }) 20 21 21 - int bch2_reflink_v_validate(struct bch_fs *, struct bkey_s_c, enum bch_validate_flags); 22 + int bch2_reflink_v_validate(struct bch_fs *, struct bkey_s_c, 23 + struct bkey_validate_context); 22 24 void bch2_reflink_v_to_text(struct printbuf *, struct bch_fs *, struct bkey_s_c); 23 25 int bch2_trigger_reflink_v(struct btree_trans *, enum btree_id, unsigned, 24 26 struct bkey_s_c, struct bkey_s, ··· 34 34 }) 35 35 36 36 int bch2_indirect_inline_data_validate(struct bch_fs *, struct bkey_s_c, 37 - enum bch_validate_flags); 37 + struct bkey_validate_context); 38 38 void bch2_indirect_inline_data_to_text(struct printbuf *, 39 39 struct bch_fs *, struct bkey_s_c); 40 40 int bch2_trigger_indirect_inline_data(struct btree_trans *, ··· 73 73 } 74 74 } 75 75 76 + struct bkey_s_c bch2_lookup_indirect_extent(struct btree_trans *, struct btree_iter *, 77 + s64 *, struct bkey_s_c_reflink_p, 78 + bool, unsigned); 79 + 76 80 s64 bch2_remap_range(struct bch_fs *, subvol_inum, u64, 77 - subvol_inum, u64, u64, u64, s64 *); 81 + subvol_inum, u64, u64, u64, s64 *, 82 + bool); 83 + 84 + int bch2_gc_reflink_done(struct bch_fs *); 85 + int bch2_gc_reflink_start(struct bch_fs *); 78 86 79 87 #endif /* _BCACHEFS_REFLINK_H */
+5 -1
fs/bcachefs/sb-clean.c
··· 23 23 int bch2_sb_clean_validate_late(struct bch_fs *c, struct bch_sb_field_clean *clean, 24 24 int write) 25 25 { 26 + struct bkey_validate_context from = { 27 + .flags = write, 28 + .from = BKEY_VALIDATE_superblock, 29 + }; 26 30 struct jset_entry *entry; 27 31 int ret; 28 32 ··· 44 40 ret = bch2_journal_entry_validate(c, NULL, entry, 45 41 le16_to_cpu(c->disk_sb.sb->version), 46 42 BCH_SB_BIG_ENDIAN(c->disk_sb.sb), 47 - write); 43 + from); 48 44 if (ret) 49 45 return ret; 50 46 }
+85 -80
fs/bcachefs/sb-counters_format.h
··· 2 2 #ifndef _BCACHEFS_SB_COUNTERS_FORMAT_H 3 3 #define _BCACHEFS_SB_COUNTERS_FORMAT_H 4 4 5 - #define BCH_PERSISTENT_COUNTERS() \ 6 - x(io_read, 0) \ 7 - x(io_write, 1) \ 8 - x(io_move, 2) \ 9 - x(bucket_invalidate, 3) \ 10 - x(bucket_discard, 4) \ 11 - x(bucket_alloc, 5) \ 12 - x(bucket_alloc_fail, 6) \ 13 - x(btree_cache_scan, 7) \ 14 - x(btree_cache_reap, 8) \ 15 - x(btree_cache_cannibalize, 9) \ 16 - x(btree_cache_cannibalize_lock, 10) \ 17 - x(btree_cache_cannibalize_lock_fail, 11) \ 18 - x(btree_cache_cannibalize_unlock, 12) \ 19 - x(btree_node_write, 13) \ 20 - x(btree_node_read, 14) \ 21 - x(btree_node_compact, 15) \ 22 - x(btree_node_merge, 16) \ 23 - x(btree_node_split, 17) \ 24 - x(btree_node_rewrite, 18) \ 25 - x(btree_node_alloc, 19) \ 26 - x(btree_node_free, 20) \ 27 - x(btree_node_set_root, 21) \ 28 - x(btree_path_relock_fail, 22) \ 29 - x(btree_path_upgrade_fail, 23) \ 30 - x(btree_reserve_get_fail, 24) \ 31 - x(journal_entry_full, 25) \ 32 - x(journal_full, 26) \ 33 - x(journal_reclaim_finish, 27) \ 34 - x(journal_reclaim_start, 28) \ 35 - x(journal_write, 29) \ 36 - x(read_promote, 30) \ 37 - x(read_bounce, 31) \ 38 - x(read_split, 33) \ 39 - x(read_retry, 32) \ 40 - x(read_reuse_race, 34) \ 41 - x(move_extent_read, 35) \ 42 - x(move_extent_write, 36) \ 43 - x(move_extent_finish, 37) \ 44 - x(move_extent_fail, 38) \ 45 - x(move_extent_start_fail, 39) \ 46 - x(copygc, 40) \ 47 - x(copygc_wait, 41) \ 48 - x(gc_gens_end, 42) \ 49 - x(gc_gens_start, 43) \ 50 - x(trans_blocked_journal_reclaim, 44) \ 51 - x(trans_restart_btree_node_reused, 45) \ 52 - x(trans_restart_btree_node_split, 46) \ 53 - x(trans_restart_fault_inject, 47) \ 54 - x(trans_restart_iter_upgrade, 48) \ 55 - x(trans_restart_journal_preres_get, 49) \ 56 - x(trans_restart_journal_reclaim, 50) \ 57 - x(trans_restart_journal_res_get, 51) \ 58 - x(trans_restart_key_cache_key_realloced, 52) \ 59 - x(trans_restart_key_cache_raced, 53) \ 60 - x(trans_restart_mark_replicas, 54) \ 61 - x(trans_restart_mem_realloced, 55) \ 62 - x(trans_restart_memory_allocation_failure, 56) \ 63 - x(trans_restart_relock, 57) \ 64 - x(trans_restart_relock_after_fill, 58) \ 65 - x(trans_restart_relock_key_cache_fill, 59) \ 66 - x(trans_restart_relock_next_node, 60) \ 67 - x(trans_restart_relock_parent_for_fill, 61) \ 68 - x(trans_restart_relock_path, 62) \ 69 - x(trans_restart_relock_path_intent, 63) \ 70 - x(trans_restart_too_many_iters, 64) \ 71 - x(trans_restart_traverse, 65) \ 72 - x(trans_restart_upgrade, 66) \ 73 - x(trans_restart_would_deadlock, 67) \ 74 - x(trans_restart_would_deadlock_write, 68) \ 75 - x(trans_restart_injected, 69) \ 76 - x(trans_restart_key_cache_upgrade, 70) \ 77 - x(trans_traverse_all, 71) \ 78 - x(transaction_commit, 72) \ 79 - x(write_super, 73) \ 80 - x(trans_restart_would_deadlock_recursion_limit, 74) \ 81 - x(trans_restart_write_buffer_flush, 75) \ 82 - x(trans_restart_split_race, 76) \ 83 - x(write_buffer_flush_slowpath, 77) \ 84 - x(write_buffer_flush_sync, 78) 5 + enum counters_flags { 6 + TYPE_COUNTER = BIT(0), /* event counters */ 7 + TYPE_SECTORS = BIT(1), /* amount counters, the unit is sectors */ 8 + }; 9 + 10 + #define BCH_PERSISTENT_COUNTERS() \ 11 + x(io_read, 0, TYPE_SECTORS) \ 12 + x(io_write, 1, TYPE_SECTORS) \ 13 + x(io_move, 2, TYPE_SECTORS) \ 14 + x(bucket_invalidate, 3, TYPE_COUNTER) \ 15 + x(bucket_discard, 4, TYPE_COUNTER) \ 16 + x(bucket_alloc, 5, TYPE_COUNTER) \ 17 + x(bucket_alloc_fail, 6, TYPE_COUNTER) \ 18 + x(btree_cache_scan, 7, TYPE_COUNTER) \ 19 + x(btree_cache_reap, 8, TYPE_COUNTER) \ 20 + x(btree_cache_cannibalize, 9, TYPE_COUNTER) \ 21 + x(btree_cache_cannibalize_lock, 10, TYPE_COUNTER) \ 22 + x(btree_cache_cannibalize_lock_fail, 11, TYPE_COUNTER) \ 23 + x(btree_cache_cannibalize_unlock, 12, TYPE_COUNTER) \ 24 + x(btree_node_write, 13, TYPE_COUNTER) \ 25 + x(btree_node_read, 14, TYPE_COUNTER) \ 26 + x(btree_node_compact, 15, TYPE_COUNTER) \ 27 + x(btree_node_merge, 16, TYPE_COUNTER) \ 28 + x(btree_node_split, 17, TYPE_COUNTER) \ 29 + x(btree_node_rewrite, 18, TYPE_COUNTER) \ 30 + x(btree_node_alloc, 19, TYPE_COUNTER) \ 31 + x(btree_node_free, 20, TYPE_COUNTER) \ 32 + x(btree_node_set_root, 21, TYPE_COUNTER) \ 33 + x(btree_path_relock_fail, 22, TYPE_COUNTER) \ 34 + x(btree_path_upgrade_fail, 23, TYPE_COUNTER) \ 35 + x(btree_reserve_get_fail, 24, TYPE_COUNTER) \ 36 + x(journal_entry_full, 25, TYPE_COUNTER) \ 37 + x(journal_full, 26, TYPE_COUNTER) \ 38 + x(journal_reclaim_finish, 27, TYPE_COUNTER) \ 39 + x(journal_reclaim_start, 28, TYPE_COUNTER) \ 40 + x(journal_write, 29, TYPE_COUNTER) \ 41 + x(read_promote, 30, TYPE_COUNTER) \ 42 + x(read_bounce, 31, TYPE_COUNTER) \ 43 + x(read_split, 33, TYPE_COUNTER) \ 44 + x(read_retry, 32, TYPE_COUNTER) \ 45 + x(read_reuse_race, 34, TYPE_COUNTER) \ 46 + x(move_extent_read, 35, TYPE_SECTORS) \ 47 + x(move_extent_write, 36, TYPE_SECTORS) \ 48 + x(move_extent_finish, 37, TYPE_SECTORS) \ 49 + x(move_extent_fail, 38, TYPE_COUNTER) \ 50 + x(move_extent_start_fail, 39, TYPE_COUNTER) \ 51 + x(copygc, 40, TYPE_COUNTER) \ 52 + x(copygc_wait, 41, TYPE_COUNTER) \ 53 + x(gc_gens_end, 42, TYPE_COUNTER) \ 54 + x(gc_gens_start, 43, TYPE_COUNTER) \ 55 + x(trans_blocked_journal_reclaim, 44, TYPE_COUNTER) \ 56 + x(trans_restart_btree_node_reused, 45, TYPE_COUNTER) \ 57 + x(trans_restart_btree_node_split, 46, TYPE_COUNTER) \ 58 + x(trans_restart_fault_inject, 47, TYPE_COUNTER) \ 59 + x(trans_restart_iter_upgrade, 48, TYPE_COUNTER) \ 60 + x(trans_restart_journal_preres_get, 49, TYPE_COUNTER) \ 61 + x(trans_restart_journal_reclaim, 50, TYPE_COUNTER) \ 62 + x(trans_restart_journal_res_get, 51, TYPE_COUNTER) \ 63 + x(trans_restart_key_cache_key_realloced, 52, TYPE_COUNTER) \ 64 + x(trans_restart_key_cache_raced, 53, TYPE_COUNTER) \ 65 + x(trans_restart_mark_replicas, 54, TYPE_COUNTER) \ 66 + x(trans_restart_mem_realloced, 55, TYPE_COUNTER) \ 67 + x(trans_restart_memory_allocation_failure, 56, TYPE_COUNTER) \ 68 + x(trans_restart_relock, 57, TYPE_COUNTER) \ 69 + x(trans_restart_relock_after_fill, 58, TYPE_COUNTER) \ 70 + x(trans_restart_relock_key_cache_fill, 59, TYPE_COUNTER) \ 71 + x(trans_restart_relock_next_node, 60, TYPE_COUNTER) \ 72 + x(trans_restart_relock_parent_for_fill, 61, TYPE_COUNTER) \ 73 + x(trans_restart_relock_path, 62, TYPE_COUNTER) \ 74 + x(trans_restart_relock_path_intent, 63, TYPE_COUNTER) \ 75 + x(trans_restart_too_many_iters, 64, TYPE_COUNTER) \ 76 + x(trans_restart_traverse, 65, TYPE_COUNTER) \ 77 + x(trans_restart_upgrade, 66, TYPE_COUNTER) \ 78 + x(trans_restart_would_deadlock, 67, TYPE_COUNTER) \ 79 + x(trans_restart_would_deadlock_write, 68, TYPE_COUNTER) \ 80 + x(trans_restart_injected, 69, TYPE_COUNTER) \ 81 + x(trans_restart_key_cache_upgrade, 70, TYPE_COUNTER) \ 82 + x(trans_traverse_all, 71, TYPE_COUNTER) \ 83 + x(transaction_commit, 72, TYPE_COUNTER) \ 84 + x(write_super, 73, TYPE_COUNTER) \ 85 + x(trans_restart_would_deadlock_recursion_limit, 74, TYPE_COUNTER) \ 86 + x(trans_restart_write_buffer_flush, 75, TYPE_COUNTER) \ 87 + x(trans_restart_split_race, 76, TYPE_COUNTER) \ 88 + x(write_buffer_flush_slowpath, 77, TYPE_COUNTER) \ 89 + x(write_buffer_flush_sync, 78, TYPE_COUNTER) 85 90 86 91 enum bch_persistent_counters { 87 92 #define x(t, n, ...) BCH_COUNTER_##t,
+26 -2
fs/bcachefs/sb-downgrade.c
··· 81 81 BCH_FSCK_ERR_accounting_mismatch) \ 82 82 x(inode_has_child_snapshots, \ 83 83 BIT_ULL(BCH_RECOVERY_PASS_check_inodes), \ 84 - BCH_FSCK_ERR_inode_has_child_snapshots_wrong) 84 + BCH_FSCK_ERR_inode_has_child_snapshots_wrong) \ 85 + x(backpointer_bucket_gen, \ 86 + BIT_ULL(BCH_RECOVERY_PASS_check_extents_to_backpointers),\ 87 + BCH_FSCK_ERR_backpointer_to_missing_ptr, \ 88 + BCH_FSCK_ERR_ptr_to_missing_backpointer) \ 89 + x(disk_accounting_big_endian, \ 90 + BIT_ULL(BCH_RECOVERY_PASS_check_allocations), \ 91 + BCH_FSCK_ERR_accounting_mismatch, \ 92 + BCH_FSCK_ERR_accounting_key_replicas_nr_devs_0, \ 93 + BCH_FSCK_ERR_accounting_key_junk_at_end) \ 94 + x(directory_size, \ 95 + BIT_ULL(BCH_RECOVERY_PASS_check_inodes), \ 96 + BCH_FSCK_ERR_directory_size_mismatch) \ 85 97 86 98 #define DOWNGRADE_TABLE() \ 87 99 x(bucket_stripe_sectors, \ ··· 129 117 BCH_FSCK_ERR_bkey_version_in_future) \ 130 118 x(rebalance_work_acct_fix, \ 131 119 BIT_ULL(BCH_RECOVERY_PASS_check_allocations), \ 132 - BCH_FSCK_ERR_accounting_mismatch) 120 + BCH_FSCK_ERR_accounting_mismatch, \ 121 + BCH_FSCK_ERR_accounting_key_replicas_nr_devs_0, \ 122 + BCH_FSCK_ERR_accounting_key_junk_at_end) \ 123 + x(backpointer_bucket_gen, \ 124 + BIT_ULL(BCH_RECOVERY_PASS_check_extents_to_backpointers),\ 125 + BCH_FSCK_ERR_backpointer_bucket_offset_wrong, \ 126 + BCH_FSCK_ERR_backpointer_to_missing_ptr, \ 127 + BCH_FSCK_ERR_ptr_to_missing_backpointer) \ 128 + x(disk_accounting_big_endian, \ 129 + BIT_ULL(BCH_RECOVERY_PASS_check_allocations), \ 130 + BCH_FSCK_ERR_accounting_mismatch, \ 131 + BCH_FSCK_ERR_accounting_key_replicas_nr_devs_0, \ 132 + BCH_FSCK_ERR_accounting_key_junk_at_end) 133 133 134 134 struct upgrade_downgrade_entry { 135 135 u64 recovery_passes;
+31 -23
fs/bcachefs/sb-errors_format.h
··· 5 5 enum bch_fsck_flags { 6 6 FSCK_CAN_FIX = 1 << 0, 7 7 FSCK_CAN_IGNORE = 1 << 1, 8 - FSCK_NEED_FSCK = 1 << 2, 9 - FSCK_NO_RATELIMIT = 1 << 3, 10 - FSCK_AUTOFIX = 1 << 4, 8 + FSCK_NO_RATELIMIT = 1 << 2, 9 + FSCK_AUTOFIX = 1 << 3, 11 10 }; 12 11 13 12 #define BCH_SB_ERRS() \ ··· 58 59 x(bset_empty, 45, 0) \ 59 60 x(bset_bad_seq, 46, 0) \ 60 61 x(bset_blacklisted_journal_seq, 47, 0) \ 61 - x(first_bset_blacklisted_journal_seq, 48, 0) \ 62 + x(first_bset_blacklisted_journal_seq, 48, FSCK_AUTOFIX) \ 62 63 x(btree_node_bad_btree, 49, 0) \ 63 64 x(btree_node_bad_level, 50, 0) \ 64 65 x(btree_node_bad_min_key, 51, 0) \ ··· 67 68 x(btree_node_bkey_past_bset_end, 54, 0) \ 68 69 x(btree_node_bkey_bad_format, 55, 0) \ 69 70 x(btree_node_bad_bkey, 56, 0) \ 70 - x(btree_node_bkey_out_of_order, 57, 0) \ 71 - x(btree_root_bkey_invalid, 58, 0) \ 72 - x(btree_root_read_error, 59, 0) \ 71 + x(btree_node_bkey_out_of_order, 57, FSCK_AUTOFIX) \ 72 + x(btree_root_bkey_invalid, 58, FSCK_AUTOFIX) \ 73 + x(btree_root_read_error, 59, FSCK_AUTOFIX) \ 73 74 x(btree_root_bad_min_key, 60, 0) \ 74 75 x(btree_root_bad_max_key, 61, 0) \ 75 - x(btree_node_read_error, 62, 0) \ 76 - x(btree_node_topology_bad_min_key, 63, 0) \ 77 - x(btree_node_topology_bad_max_key, 64, 0) \ 78 - x(btree_node_topology_overwritten_by_prev_node, 65, 0) \ 79 - x(btree_node_topology_overwritten_by_next_node, 66, 0) \ 80 - x(btree_node_topology_interior_node_empty, 67, 0) \ 76 + x(btree_node_read_error, 62, FSCK_AUTOFIX) \ 77 + x(btree_node_topology_bad_min_key, 63, FSCK_AUTOFIX) \ 78 + x(btree_node_topology_bad_max_key, 64, FSCK_AUTOFIX) \ 79 + x(btree_node_topology_overwritten_by_prev_node, 65, FSCK_AUTOFIX) \ 80 + x(btree_node_topology_overwritten_by_next_node, 66, FSCK_AUTOFIX) \ 81 + x(btree_node_topology_interior_node_empty, 67, FSCK_AUTOFIX) \ 81 82 x(fs_usage_hidden_wrong, 68, FSCK_AUTOFIX) \ 82 83 x(fs_usage_btree_wrong, 69, FSCK_AUTOFIX) \ 83 84 x(fs_usage_data_wrong, 70, FSCK_AUTOFIX) \ ··· 122 123 x(alloc_key_cached_sectors_wrong, 109, FSCK_AUTOFIX) \ 123 124 x(alloc_key_stripe_wrong, 110, FSCK_AUTOFIX) \ 124 125 x(alloc_key_stripe_redundancy_wrong, 111, FSCK_AUTOFIX) \ 126 + x(alloc_key_journal_seq_in_future, 298, FSCK_AUTOFIX) \ 125 127 x(bucket_sector_count_overflow, 112, 0) \ 126 128 x(bucket_metadata_type_mismatch, 113, 0) \ 127 - x(need_discard_key_wrong, 114, 0) \ 128 - x(freespace_key_wrong, 115, 0) \ 129 - x(freespace_hole_missing, 116, 0) \ 129 + x(need_discard_key_wrong, 114, FSCK_AUTOFIX) \ 130 + x(freespace_key_wrong, 115, FSCK_AUTOFIX) \ 131 + x(freespace_hole_missing, 116, FSCK_AUTOFIX) \ 130 132 x(bucket_gens_val_size_bad, 117, 0) \ 131 133 x(bucket_gens_key_wrong, 118, FSCK_AUTOFIX) \ 132 134 x(bucket_gens_hole_wrong, 119, FSCK_AUTOFIX) \ ··· 139 139 x(discarding_bucket_not_in_need_discard_btree, 291, 0) \ 140 140 x(backpointer_bucket_offset_wrong, 125, 0) \ 141 141 x(backpointer_level_bad, 294, 0) \ 142 - x(backpointer_to_missing_device, 126, 0) \ 143 - x(backpointer_to_missing_alloc, 127, 0) \ 144 - x(backpointer_to_missing_ptr, 128, 0) \ 142 + x(backpointer_dev_bad, 297, 0) \ 143 + x(backpointer_to_missing_device, 126, FSCK_AUTOFIX) \ 144 + x(backpointer_to_missing_alloc, 127, FSCK_AUTOFIX) \ 145 + x(backpointer_to_missing_ptr, 128, FSCK_AUTOFIX) \ 145 146 x(lru_entry_at_time_0, 129, FSCK_AUTOFIX) \ 146 147 x(lru_entry_to_invalid_bucket, 130, FSCK_AUTOFIX) \ 147 148 x(lru_entry_bad, 131, FSCK_AUTOFIX) \ ··· 168 167 x(ptr_to_incorrect_stripe, 151, 0) \ 169 168 x(ptr_gen_newer_than_bucket_gen, 152, 0) \ 170 169 x(ptr_too_stale, 153, 0) \ 171 - x(stale_dirty_ptr, 154, 0) \ 170 + x(stale_dirty_ptr, 154, FSCK_AUTOFIX) \ 172 171 x(ptr_bucket_data_type_mismatch, 155, 0) \ 173 172 x(ptr_cached_and_erasure_coded, 156, 0) \ 174 173 x(ptr_crc_uncompressed_size_too_small, 157, 0) \ 174 + x(ptr_crc_uncompressed_size_too_big, 161, 0) \ 175 + x(ptr_crc_uncompressed_size_mismatch, 300, 0) \ 175 176 x(ptr_crc_csum_type_unknown, 158, 0) \ 176 177 x(ptr_crc_compression_type_unknown, 159, 0) \ 177 178 x(ptr_crc_redundant, 160, 0) \ 178 - x(ptr_crc_uncompressed_size_too_big, 161, 0) \ 179 179 x(ptr_crc_nonce_mismatch, 162, 0) \ 180 180 x(ptr_stripe_redundant, 163, 0) \ 181 181 x(reservation_key_nr_replicas_invalid, 164, 0) \ ··· 211 209 x(bkey_in_missing_snapshot, 190, 0) \ 212 210 x(inode_pos_inode_nonzero, 191, 0) \ 213 211 x(inode_pos_blockdev_range, 192, 0) \ 212 + x(inode_alloc_cursor_inode_bad, 301, 0) \ 214 213 x(inode_unpack_error, 193, 0) \ 215 214 x(inode_str_hash_invalid, 194, 0) \ 216 215 x(inode_v3_fields_start_bad, 195, 0) \ ··· 235 232 x(inode_wrong_nlink, 209, FSCK_AUTOFIX) \ 236 233 x(inode_has_child_snapshots_wrong, 287, 0) \ 237 234 x(inode_unreachable, 210, FSCK_AUTOFIX) \ 235 + x(inode_journal_seq_in_future, 299, FSCK_AUTOFIX) \ 238 236 x(deleted_inode_but_clean, 211, FSCK_AUTOFIX) \ 239 237 x(deleted_inode_missing, 212, FSCK_AUTOFIX) \ 240 238 x(deleted_inode_is_dir, 213, FSCK_AUTOFIX) \ ··· 256 252 x(dirent_in_missing_dir_inode, 227, 0) \ 257 253 x(dirent_in_non_dir_inode, 228, 0) \ 258 254 x(dirent_to_missing_inode, 229, 0) \ 255 + x(dirent_to_overwritten_inode, 302, 0) \ 259 256 x(dirent_to_missing_subvol, 230, 0) \ 260 257 x(dirent_to_itself, 231, 0) \ 261 258 x(quota_type_invalid, 232, 0) \ ··· 293 288 x(btree_root_unreadable_and_scan_found_nothing, 263, 0) \ 294 289 x(snapshot_node_missing, 264, 0) \ 295 290 x(dup_backpointer_to_bad_csum_extent, 265, 0) \ 296 - x(btree_bitmap_not_marked, 266, 0) \ 291 + x(btree_bitmap_not_marked, 266, FSCK_AUTOFIX) \ 297 292 x(sb_clean_entry_overrun, 267, 0) \ 298 293 x(btree_ptr_v2_written_0, 268, 0) \ 299 294 x(subvol_snapshot_bad, 269, 0) \ ··· 311 306 x(accounting_key_replicas_devs_unsorted, 280, FSCK_AUTOFIX) \ 312 307 x(accounting_key_version_0, 282, FSCK_AUTOFIX) \ 313 308 x(logged_op_but_clean, 283, FSCK_AUTOFIX) \ 314 - x(MAX, 295, 0) 309 + x(compression_opt_not_marked_in_sb, 295, FSCK_AUTOFIX) \ 310 + x(compression_type_not_marked_in_sb, 296, FSCK_AUTOFIX) \ 311 + x(directory_size_mismatch, 303, FSCK_AUTOFIX) \ 312 + x(MAX, 304, 0) 315 313 316 314 enum bch_sb_error_id { 317 315 #define x(t, n, ...) BCH_FSCK_ERR_##t = n,
+17 -10
fs/bcachefs/six.c
··· 491 491 list_del(&wait->list); 492 492 raw_spin_unlock(&lock->wait_lock); 493 493 494 - if (unlikely(acquired)) 494 + if (unlikely(acquired)) { 495 495 do_six_unlock_type(lock, type); 496 + } else if (type == SIX_LOCK_write) { 497 + six_clear_bitmask(lock, SIX_LOCK_HELD_write); 498 + six_lock_wakeup(lock, atomic_read(&lock->state), SIX_LOCK_read); 499 + } 496 500 break; 497 501 } 498 502 ··· 505 501 506 502 __set_current_state(TASK_RUNNING); 507 503 out: 508 - if (ret && type == SIX_LOCK_write) { 509 - six_clear_bitmask(lock, SIX_LOCK_HELD_write); 510 - six_lock_wakeup(lock, atomic_read(&lock->state), SIX_LOCK_read); 511 - } 512 504 trace_contention_end(lock, 0); 513 505 514 506 return ret; ··· 616 616 617 617 if (type != SIX_LOCK_write) 618 618 six_release(&lock->dep_map, ip); 619 - else 620 - lock->seq++; 621 619 622 620 if (type == SIX_LOCK_intent && 623 621 lock->intent_lock_recurse) { 624 622 --lock->intent_lock_recurse; 625 623 return; 626 624 } 625 + 626 + if (type == SIX_LOCK_write && 627 + lock->write_lock_recurse) { 628 + --lock->write_lock_recurse; 629 + return; 630 + } 631 + 632 + if (type == SIX_LOCK_write) 633 + lock->seq++; 627 634 628 635 do_six_unlock_type(lock, type); 629 636 } ··· 742 735 atomic_add(l[type].lock_val, &lock->state); 743 736 } 744 737 break; 738 + case SIX_LOCK_write: 739 + lock->write_lock_recurse++; 740 + fallthrough; 745 741 case SIX_LOCK_intent: 746 742 EBUG_ON(!(atomic_read(&lock->state) & SIX_LOCK_HELD_intent)); 747 743 lock->intent_lock_recurse++; 748 - break; 749 - case SIX_LOCK_write: 750 - BUG(); 751 744 break; 752 745 } 753 746 }
+1
fs/bcachefs/six.h
··· 137 137 atomic_t state; 138 138 u32 seq; 139 139 unsigned intent_lock_recurse; 140 + unsigned write_lock_recurse; 140 141 struct task_struct *owner; 141 142 unsigned __percpu *readers; 142 143 raw_spinlock_t wait_lock;
+231 -298
fs/bcachefs/snapshot.c
··· 2 2 3 3 #include "bcachefs.h" 4 4 #include "bkey_buf.h" 5 + #include "btree_cache.h" 5 6 #include "btree_key_cache.h" 6 7 #include "btree_update.h" 7 8 #include "buckets.h" ··· 33 32 } 34 33 35 34 int bch2_snapshot_tree_validate(struct bch_fs *c, struct bkey_s_c k, 36 - enum bch_validate_flags flags) 35 + struct bkey_validate_context from) 37 36 { 38 37 int ret = 0; 39 38 ··· 226 225 } 227 226 228 227 int bch2_snapshot_validate(struct bch_fs *c, struct bkey_s_c k, 229 - enum bch_validate_flags flags) 228 + struct bkey_validate_context from) 230 229 { 231 230 struct bkey_s_c_snapshot s; 232 231 u32 i, id; ··· 280 279 return ret; 281 280 } 282 281 283 - static void __set_is_ancestor_bitmap(struct bch_fs *c, u32 id) 284 - { 285 - struct snapshot_t *t = snapshot_t_mut(c, id); 286 - u32 parent = id; 287 - 288 - while ((parent = bch2_snapshot_parent_early(c, parent)) && 289 - parent - id - 1 < IS_ANCESTOR_BITMAP) 290 - __set_bit(parent - id - 1, t->is_ancestor); 291 - } 292 - 293 - static void set_is_ancestor_bitmap(struct bch_fs *c, u32 id) 294 - { 295 - mutex_lock(&c->snapshot_table_lock); 296 - __set_is_ancestor_bitmap(c, id); 297 - mutex_unlock(&c->snapshot_table_lock); 298 - } 299 - 300 282 static int __bch2_mark_snapshot(struct btree_trans *trans, 301 283 enum btree_id btree, unsigned level, 302 284 struct bkey_s_c old, struct bkey_s_c new, ··· 301 317 if (new.k->type == KEY_TYPE_snapshot) { 302 318 struct bkey_s_c_snapshot s = bkey_s_c_to_snapshot(new); 303 319 320 + t->live = true; 304 321 t->parent = le32_to_cpu(s.v->parent); 305 322 t->children[0] = le32_to_cpu(s.v->children[0]); 306 323 t->children[1] = le32_to_cpu(s.v->children[1]); ··· 320 335 t->skip[2] = 0; 321 336 } 322 337 323 - __set_is_ancestor_bitmap(c, id); 338 + u32 parent = id; 339 + 340 + while ((parent = bch2_snapshot_parent_early(c, parent)) && 341 + parent - id - 1 < IS_ANCESTOR_BITMAP) 342 + __set_bit(parent - id - 1, t->is_ancestor); 324 343 325 344 if (BCH_SNAPSHOT_DELETED(s.v)) { 326 345 set_bit(BCH_FS_need_delete_dead_snapshots, &c->flags); ··· 352 363 { 353 364 return bch2_bkey_get_val_typed(trans, BTREE_ID_snapshots, POS(0, id), 354 365 BTREE_ITER_with_updates, snapshot, s); 355 - } 356 - 357 - static int bch2_snapshot_live(struct btree_trans *trans, u32 id) 358 - { 359 - struct bch_snapshot v; 360 - int ret; 361 - 362 - if (!id) 363 - return 0; 364 - 365 - ret = bch2_snapshot_lookup(trans, id, &v); 366 - if (bch2_err_matches(ret, ENOENT)) 367 - bch_err(trans->c, "snapshot node %u not found", id); 368 - if (ret) 369 - return ret; 370 - 371 - return !BCH_SNAPSHOT_DELETED(&v); 372 - } 373 - 374 - /* 375 - * If @k is a snapshot with just one live child, it's part of a linear chain, 376 - * which we consider to be an equivalence class: and then after snapshot 377 - * deletion cleanup, there should only be a single key at a given position in 378 - * this equivalence class. 379 - * 380 - * This sets the equivalence class of @k to be the child's equivalence class, if 381 - * it's part of such a linear chain: this correctly sets equivalence classes on 382 - * startup if we run leaf to root (i.e. in natural key order). 383 - */ 384 - static int bch2_snapshot_set_equiv(struct btree_trans *trans, struct bkey_s_c k) 385 - { 386 - struct bch_fs *c = trans->c; 387 - unsigned i, nr_live = 0, live_idx = 0; 388 - struct bkey_s_c_snapshot snap; 389 - u32 id = k.k->p.offset, child[2]; 390 - 391 - if (k.k->type != KEY_TYPE_snapshot) 392 - return 0; 393 - 394 - snap = bkey_s_c_to_snapshot(k); 395 - 396 - child[0] = le32_to_cpu(snap.v->children[0]); 397 - child[1] = le32_to_cpu(snap.v->children[1]); 398 - 399 - for (i = 0; i < 2; i++) { 400 - int ret = bch2_snapshot_live(trans, child[i]); 401 - 402 - if (ret < 0) 403 - return ret; 404 - 405 - if (ret) 406 - live_idx = i; 407 - nr_live += ret; 408 - } 409 - 410 - mutex_lock(&c->snapshot_table_lock); 411 - 412 - snapshot_t_mut(c, id)->equiv = nr_live == 1 413 - ? snapshot_t_mut(c, child[live_idx])->equiv 414 - : id; 415 - 416 - mutex_unlock(&c->snapshot_table_lock); 417 - 418 - return 0; 419 366 } 420 367 421 368 /* fsck: */ ··· 431 506 break; 432 507 } 433 508 } 434 - 435 509 bch2_trans_iter_exit(trans, &iter); 436 510 437 511 if (!ret && !found) { ··· 460 536 struct bch_snapshot s; 461 537 struct bch_subvolume subvol; 462 538 struct printbuf buf = PRINTBUF; 539 + struct btree_iter snapshot_iter = {}; 463 540 u32 root_id; 464 541 int ret; 465 542 ··· 470 545 st = bkey_s_c_to_snapshot_tree(k); 471 546 root_id = le32_to_cpu(st.v->root_snapshot); 472 547 473 - ret = bch2_snapshot_lookup(trans, root_id, &s); 548 + struct bkey_s_c_snapshot snapshot_k = 549 + bch2_bkey_get_iter_typed(trans, &snapshot_iter, BTREE_ID_snapshots, 550 + POS(0, root_id), 0, snapshot); 551 + ret = bkey_err(snapshot_k); 474 552 if (ret && !bch2_err_matches(ret, ENOENT)) 475 553 goto err; 554 + 555 + if (!ret) 556 + bkey_val_copy(&s, snapshot_k); 476 557 477 558 if (fsck_err_on(ret || 478 559 root_id != bch2_snapshot_root(c, root_id) || 479 560 st.k->p.offset != le32_to_cpu(s.tree), 480 561 trans, snapshot_tree_to_missing_snapshot, 481 562 "snapshot tree points to missing/incorrect snapshot:\n %s", 482 - (bch2_bkey_val_to_text(&buf, c, st.s_c), buf.buf))) { 563 + (bch2_bkey_val_to_text(&buf, c, st.s_c), 564 + prt_newline(&buf), 565 + ret 566 + ? prt_printf(&buf, "(%s)", bch2_err_str(ret)) 567 + : bch2_bkey_val_to_text(&buf, c, snapshot_k.s_c), 568 + buf.buf))) { 483 569 ret = bch2_btree_delete_at(trans, iter, 0); 484 570 goto err; 485 571 } 486 572 487 - ret = bch2_subvolume_get(trans, le32_to_cpu(st.v->master_subvol), 488 - false, 0, &subvol); 573 + if (!st.v->master_subvol) 574 + goto out; 575 + 576 + ret = bch2_subvolume_get(trans, le32_to_cpu(st.v->master_subvol), false, &subvol); 489 577 if (ret && !bch2_err_matches(ret, ENOENT)) 490 578 goto err; 491 579 ··· 541 603 u->v.master_subvol = cpu_to_le32(subvol_id); 542 604 st = snapshot_tree_i_to_s_c(u); 543 605 } 606 + out: 544 607 err: 545 608 fsck_err: 609 + bch2_trans_iter_exit(trans, &snapshot_iter); 546 610 printbuf_exit(&buf); 547 611 return ret; 548 612 } ··· 739 799 740 800 if (should_have_subvol) { 741 801 id = le32_to_cpu(s.subvol); 742 - ret = bch2_subvolume_get(trans, id, 0, false, &subvol); 802 + ret = bch2_subvolume_get(trans, id, false, &subvol); 743 803 if (bch2_err_matches(ret, ENOENT)) 744 804 bch_err(c, "snapshot points to nonexistent subvolume:\n %s", 745 805 (bch2_bkey_val_to_text(&buf, c, k), buf.buf)); ··· 842 902 { 843 903 struct bch_fs *c = trans->c; 844 904 845 - if (bch2_snapshot_equiv(c, id)) 905 + if (bch2_snapshot_exists(c, id)) 846 906 return 0; 847 907 848 908 /* Do we need to reconstruct the snapshot_tree entry as well? */ ··· 891 951 892 952 return bch2_btree_insert_trans(trans, BTREE_ID_snapshots, &snapshot->k_i, 0) ?: 893 953 bch2_mark_snapshot(trans, BTREE_ID_snapshots, 0, 894 - bkey_s_c_null, bkey_i_to_s(&snapshot->k_i), 0) ?: 895 - bch2_snapshot_set_equiv(trans, bkey_i_to_s_c(&snapshot->k_i)); 954 + bkey_s_c_null, bkey_i_to_s(&snapshot->k_i), 0); 896 955 } 897 956 898 957 /* Figure out which snapshot nodes belong in the same tree: */ ··· 989 1050 snapshot_id_list_to_text(&buf, t); 990 1051 991 1052 darray_for_each(*t, id) { 992 - if (fsck_err_on(!bch2_snapshot_equiv(c, *id), 1053 + if (fsck_err_on(!bch2_snapshot_exists(c, *id), 993 1054 trans, snapshot_node_missing, 994 1055 "snapshot node %u from tree %s missing, recreate?", *id, buf.buf)) { 995 1056 if (t->nr > 1) { ··· 1022 1083 struct printbuf buf = PRINTBUF; 1023 1084 int ret = 0; 1024 1085 1025 - if (fsck_err_on(!bch2_snapshot_equiv(c, k.k->p.snapshot), 1086 + if (fsck_err_on(!bch2_snapshot_exists(c, k.k->p.snapshot), 1026 1087 trans, bkey_in_missing_snapshot, 1027 1088 "key in missing snapshot %s, delete?", 1028 - (bch2_bkey_val_to_text(&buf, c, k), buf.buf))) 1089 + (bch2_btree_id_to_text(&buf, iter->btree_id), 1090 + prt_char(&buf, ' '), 1091 + bch2_bkey_val_to_text(&buf, c, k), buf.buf))) 1029 1092 ret = bch2_btree_delete_at(trans, iter, 1030 1093 BTREE_UPDATE_internal_snapshot_node) ?: 1; 1031 1094 fsck_err: ··· 1041 1100 int bch2_snapshot_node_set_deleted(struct btree_trans *trans, u32 id) 1042 1101 { 1043 1102 struct btree_iter iter; 1044 - struct bkey_i_snapshot *s; 1045 - int ret = 0; 1046 - 1047 - s = bch2_bkey_get_mut_typed(trans, &iter, 1103 + struct bkey_i_snapshot *s = 1104 + bch2_bkey_get_mut_typed(trans, &iter, 1048 1105 BTREE_ID_snapshots, POS(0, id), 1049 1106 0, snapshot); 1050 - ret = PTR_ERR_OR_ZERO(s); 1107 + int ret = PTR_ERR_OR_ZERO(s); 1051 1108 if (unlikely(ret)) { 1052 1109 bch2_fs_inconsistent_on(bch2_err_matches(ret, ENOENT), 1053 1110 trans->c, "missing snapshot %u", id); ··· 1233 1294 goto err; 1234 1295 1235 1296 new_snapids[i] = iter.pos.offset; 1236 - 1237 - mutex_lock(&c->snapshot_table_lock); 1238 - snapshot_t_mut(c, new_snapids[i])->equiv = new_snapids[i]; 1239 - mutex_unlock(&c->snapshot_table_lock); 1240 1297 } 1241 1298 err: 1242 1299 bch2_trans_iter_exit(trans, &iter); ··· 1338 1403 * that key to snapshot leaf nodes, where we can mutate it 1339 1404 */ 1340 1405 1341 - static int delete_dead_snapshots_process_key(struct btree_trans *trans, 1342 - struct btree_iter *iter, 1343 - struct bkey_s_c k, 1344 - snapshot_id_list *deleted, 1345 - snapshot_id_list *equiv_seen, 1346 - struct bpos *last_pos) 1347 - { 1348 - int ret = bch2_check_key_has_snapshot(trans, iter, k); 1349 - if (ret) 1350 - return ret < 0 ? ret : 0; 1406 + struct snapshot_interior_delete { 1407 + u32 id; 1408 + u32 live_child; 1409 + }; 1410 + typedef DARRAY(struct snapshot_interior_delete) interior_delete_list; 1351 1411 1352 - struct bch_fs *c = trans->c; 1353 - u32 equiv = bch2_snapshot_equiv(c, k.k->p.snapshot); 1354 - if (!equiv) /* key for invalid snapshot node, but we chose not to delete */ 1412 + static inline u32 interior_delete_has_id(interior_delete_list *l, u32 id) 1413 + { 1414 + darray_for_each(*l, i) 1415 + if (i->id == id) 1416 + return i->live_child; 1417 + return 0; 1418 + } 1419 + 1420 + static unsigned __live_child(struct snapshot_table *t, u32 id, 1421 + snapshot_id_list *delete_leaves, 1422 + interior_delete_list *delete_interior) 1423 + { 1424 + struct snapshot_t *s = __snapshot_t(t, id); 1425 + if (!s) 1355 1426 return 0; 1356 1427 1357 - if (!bkey_eq(k.k->p, *last_pos)) 1358 - equiv_seen->nr = 0; 1428 + for (unsigned i = 0; i < ARRAY_SIZE(s->children); i++) 1429 + if (s->children[i] && 1430 + !snapshot_list_has_id(delete_leaves, s->children[i]) && 1431 + !interior_delete_has_id(delete_interior, s->children[i])) 1432 + return s->children[i]; 1359 1433 1360 - if (snapshot_list_has_id(deleted, k.k->p.snapshot)) 1361 - return bch2_btree_delete_at(trans, iter, 1362 - BTREE_UPDATE_internal_snapshot_node); 1363 - 1364 - if (!bpos_eq(*last_pos, k.k->p) && 1365 - snapshot_list_has_id(equiv_seen, equiv)) 1366 - return bch2_btree_delete_at(trans, iter, 1367 - BTREE_UPDATE_internal_snapshot_node); 1368 - 1369 - *last_pos = k.k->p; 1370 - 1371 - ret = snapshot_list_add_nodup(c, equiv_seen, equiv); 1372 - if (ret) 1373 - return ret; 1374 - 1375 - /* 1376 - * When we have a linear chain of snapshot nodes, we consider 1377 - * those to form an equivalence class: we're going to collapse 1378 - * them all down to a single node, and keep the leaf-most node - 1379 - * which has the same id as the equivalence class id. 1380 - * 1381 - * If there are multiple keys in different snapshots at the same 1382 - * position, we're only going to keep the one in the newest 1383 - * snapshot (we delete the others above) - the rest have been 1384 - * overwritten and are redundant, and for the key we're going to keep we 1385 - * need to move it to the equivalance class ID if it's not there 1386 - * already. 1387 - */ 1388 - if (equiv != k.k->p.snapshot) { 1389 - struct bkey_i *new = bch2_bkey_make_mut_noupdate(trans, k); 1390 - int ret = PTR_ERR_OR_ZERO(new); 1391 - if (ret) 1392 - return ret; 1393 - 1394 - new->k.p.snapshot = equiv; 1395 - 1396 - struct btree_iter new_iter; 1397 - bch2_trans_iter_init(trans, &new_iter, iter->btree_id, new->k.p, 1398 - BTREE_ITER_all_snapshots| 1399 - BTREE_ITER_cached| 1400 - BTREE_ITER_intent); 1401 - 1402 - ret = bch2_btree_iter_traverse(&new_iter) ?: 1403 - bch2_trans_update(trans, &new_iter, new, 1404 - BTREE_UPDATE_internal_snapshot_node) ?: 1405 - bch2_btree_delete_at(trans, iter, 1406 - BTREE_UPDATE_internal_snapshot_node); 1407 - bch2_trans_iter_exit(trans, &new_iter); 1408 - if (ret) 1409 - return ret; 1434 + for (unsigned i = 0; i < ARRAY_SIZE(s->children); i++) { 1435 + u32 live_child = s->children[i] 1436 + ? __live_child(t, s->children[i], delete_leaves, delete_interior) 1437 + : 0; 1438 + if (live_child) 1439 + return live_child; 1410 1440 } 1411 1441 1412 1442 return 0; 1413 1443 } 1414 1444 1415 - static int bch2_snapshot_needs_delete(struct btree_trans *trans, struct bkey_s_c k) 1445 + static unsigned live_child(struct bch_fs *c, u32 id, 1446 + snapshot_id_list *delete_leaves, 1447 + interior_delete_list *delete_interior) 1416 1448 { 1417 - struct bkey_s_c_snapshot snap; 1418 - u32 children[2]; 1419 - int ret; 1449 + rcu_read_lock(); 1450 + u32 ret = __live_child(rcu_dereference(c->snapshots), id, 1451 + delete_leaves, delete_interior); 1452 + rcu_read_unlock(); 1453 + return ret; 1454 + } 1420 1455 1421 - if (k.k->type != KEY_TYPE_snapshot) 1422 - return 0; 1456 + static int delete_dead_snapshots_process_key(struct btree_trans *trans, 1457 + struct btree_iter *iter, 1458 + struct bkey_s_c k, 1459 + snapshot_id_list *delete_leaves, 1460 + interior_delete_list *delete_interior) 1461 + { 1462 + if (snapshot_list_has_id(delete_leaves, k.k->p.snapshot)) 1463 + return bch2_btree_delete_at(trans, iter, 1464 + BTREE_UPDATE_internal_snapshot_node); 1423 1465 1424 - snap = bkey_s_c_to_snapshot(k); 1425 - if (BCH_SNAPSHOT_DELETED(snap.v) || 1426 - BCH_SNAPSHOT_SUBVOL(snap.v)) 1427 - return 0; 1466 + u32 live_child = interior_delete_has_id(delete_interior, k.k->p.snapshot); 1467 + if (live_child) { 1468 + struct bkey_i *new = bch2_bkey_make_mut_noupdate(trans, k); 1469 + int ret = PTR_ERR_OR_ZERO(new); 1470 + if (ret) 1471 + return ret; 1428 1472 1429 - children[0] = le32_to_cpu(snap.v->children[0]); 1430 - children[1] = le32_to_cpu(snap.v->children[1]); 1473 + new->k.p.snapshot = live_child; 1431 1474 1432 - ret = bch2_snapshot_live(trans, children[0]) ?: 1433 - bch2_snapshot_live(trans, children[1]); 1434 - if (ret < 0) 1475 + struct btree_iter dst_iter; 1476 + struct bkey_s_c dst_k = bch2_bkey_get_iter(trans, &dst_iter, 1477 + iter->btree_id, new->k.p, 1478 + BTREE_ITER_all_snapshots| 1479 + BTREE_ITER_intent); 1480 + ret = bkey_err(dst_k); 1481 + if (ret) 1482 + return ret; 1483 + 1484 + ret = (bkey_deleted(dst_k.k) 1485 + ? bch2_trans_update(trans, &dst_iter, new, 1486 + BTREE_UPDATE_internal_snapshot_node) 1487 + : 0) ?: 1488 + bch2_btree_delete_at(trans, iter, 1489 + BTREE_UPDATE_internal_snapshot_node); 1490 + bch2_trans_iter_exit(trans, &dst_iter); 1435 1491 return ret; 1436 - return !ret; 1492 + } 1493 + 1494 + return 0; 1437 1495 } 1438 1496 1439 1497 /* ··· 1434 1506 * it doesn't have child snapshot nodes - it's now redundant and we can mark it 1435 1507 * as deleted. 1436 1508 */ 1437 - static int bch2_delete_redundant_snapshot(struct btree_trans *trans, struct bkey_s_c k) 1509 + static int check_should_delete_snapshot(struct btree_trans *trans, struct bkey_s_c k, 1510 + snapshot_id_list *delete_leaves, 1511 + interior_delete_list *delete_interior) 1438 1512 { 1439 - int ret = bch2_snapshot_needs_delete(trans, k); 1513 + if (k.k->type != KEY_TYPE_snapshot) 1514 + return 0; 1440 1515 1441 - return ret <= 0 1442 - ? ret 1443 - : bch2_snapshot_node_set_deleted(trans, k.k->p.offset); 1516 + struct bch_fs *c = trans->c; 1517 + struct bkey_s_c_snapshot s = bkey_s_c_to_snapshot(k); 1518 + unsigned live_children = 0; 1519 + 1520 + if (BCH_SNAPSHOT_SUBVOL(s.v)) 1521 + return 0; 1522 + 1523 + for (unsigned i = 0; i < 2; i++) { 1524 + u32 child = le32_to_cpu(s.v->children[i]); 1525 + 1526 + live_children += child && 1527 + !snapshot_list_has_id(delete_leaves, child); 1528 + } 1529 + 1530 + if (live_children == 0) { 1531 + return snapshot_list_add(c, delete_leaves, s.k->p.offset); 1532 + } else if (live_children == 1) { 1533 + struct snapshot_interior_delete d = { 1534 + .id = s.k->p.offset, 1535 + .live_child = live_child(c, s.k->p.offset, delete_leaves, delete_interior), 1536 + }; 1537 + 1538 + if (!d.live_child) { 1539 + bch_err(c, "error finding live child of snapshot %u", d.id); 1540 + return -EINVAL; 1541 + } 1542 + 1543 + return darray_push(delete_interior, d); 1544 + } else { 1545 + return 0; 1546 + } 1444 1547 } 1445 1548 1446 1549 static inline u32 bch2_snapshot_nth_parent_skip(struct bch_fs *c, u32 id, u32 n, 1447 - snapshot_id_list *skip) 1550 + interior_delete_list *skip) 1448 1551 { 1449 1552 rcu_read_lock(); 1450 - while (snapshot_list_has_id(skip, id)) 1553 + while (interior_delete_has_id(skip, id)) 1451 1554 id = __bch2_snapshot_parent(c, id); 1452 1555 1453 1556 while (n--) { 1454 1557 do { 1455 1558 id = __bch2_snapshot_parent(c, id); 1456 - } while (snapshot_list_has_id(skip, id)); 1559 + } while (interior_delete_has_id(skip, id)); 1457 1560 } 1458 1561 rcu_read_unlock(); 1459 1562 ··· 1493 1534 1494 1535 static int bch2_fix_child_of_deleted_snapshot(struct btree_trans *trans, 1495 1536 struct btree_iter *iter, struct bkey_s_c k, 1496 - snapshot_id_list *deleted) 1537 + interior_delete_list *deleted) 1497 1538 { 1498 1539 struct bch_fs *c = trans->c; 1499 1540 u32 nr_deleted_ancestors = 0; ··· 1503 1544 if (k.k->type != KEY_TYPE_snapshot) 1504 1545 return 0; 1505 1546 1506 - if (snapshot_list_has_id(deleted, k.k->p.offset)) 1547 + if (interior_delete_has_id(deleted, k.k->p.offset)) 1507 1548 return 0; 1508 1549 1509 1550 s = bch2_bkey_make_mut_noupdate_typed(trans, k, snapshot); ··· 1512 1553 return ret; 1513 1554 1514 1555 darray_for_each(*deleted, i) 1515 - nr_deleted_ancestors += bch2_snapshot_is_ancestor(c, s->k.p.offset, *i); 1556 + nr_deleted_ancestors += bch2_snapshot_is_ancestor(c, s->k.p.offset, i->id); 1516 1557 1517 1558 if (!nr_deleted_ancestors) 1518 1559 return 0; ··· 1530 1571 for (unsigned j = 0; j < ARRAY_SIZE(s->v.skip); j++) { 1531 1572 u32 id = le32_to_cpu(s->v.skip[j]); 1532 1573 1533 - if (snapshot_list_has_id(deleted, id)) { 1574 + if (interior_delete_has_id(deleted, id)) { 1534 1575 id = bch2_snapshot_nth_parent_skip(c, 1535 1576 parent, 1536 1577 depth > 1 ··· 1549 1590 1550 1591 int bch2_delete_dead_snapshots(struct bch_fs *c) 1551 1592 { 1552 - struct btree_trans *trans; 1553 - snapshot_id_list deleted = { 0 }; 1554 - snapshot_id_list deleted_interior = { 0 }; 1555 - int ret = 0; 1556 - 1557 1593 if (!test_and_clear_bit(BCH_FS_need_delete_dead_snapshots, &c->flags)) 1558 1594 return 0; 1559 1595 1560 - trans = bch2_trans_get(c); 1596 + struct btree_trans *trans = bch2_trans_get(c); 1597 + snapshot_id_list delete_leaves = {}; 1598 + interior_delete_list delete_interior = {}; 1599 + int ret = 0; 1561 1600 1562 1601 /* 1563 1602 * For every snapshot node: If we have no live children and it's not 1564 1603 * pointed to by a subvolume, delete it: 1565 1604 */ 1566 - ret = for_each_btree_key_commit(trans, iter, BTREE_ID_snapshots, 1567 - POS_MIN, 0, k, 1568 - NULL, NULL, 0, 1569 - bch2_delete_redundant_snapshot(trans, k)); 1570 - bch_err_msg(c, ret, "deleting redundant snapshots"); 1605 + ret = for_each_btree_key(trans, iter, BTREE_ID_snapshots, POS_MIN, 0, k, 1606 + check_should_delete_snapshot(trans, k, &delete_leaves, &delete_interior)); 1607 + if (!bch2_err_matches(ret, EROFS)) 1608 + bch_err_msg(c, ret, "walking snapshots"); 1571 1609 if (ret) 1572 1610 goto err; 1573 1611 1574 - ret = for_each_btree_key(trans, iter, BTREE_ID_snapshots, 1575 - POS_MIN, 0, k, 1576 - bch2_snapshot_set_equiv(trans, k)); 1577 - bch_err_msg(c, ret, "in bch2_snapshots_set_equiv"); 1578 - if (ret) 1612 + if (!delete_leaves.nr && !delete_interior.nr) 1579 1613 goto err; 1580 1614 1581 - ret = for_each_btree_key(trans, iter, BTREE_ID_snapshots, 1582 - POS_MIN, 0, k, ({ 1583 - if (k.k->type != KEY_TYPE_snapshot) 1584 - continue; 1615 + { 1616 + struct printbuf buf = PRINTBUF; 1617 + prt_printf(&buf, "deleting leaves"); 1618 + darray_for_each(delete_leaves, i) 1619 + prt_printf(&buf, " %u", *i); 1585 1620 1586 - BCH_SNAPSHOT_DELETED(bkey_s_c_to_snapshot(k).v) 1587 - ? snapshot_list_add(c, &deleted, k.k->p.offset) 1588 - : 0; 1589 - })); 1590 - bch_err_msg(c, ret, "walking snapshots"); 1591 - if (ret) 1592 - goto err; 1621 + prt_printf(&buf, " interior"); 1622 + darray_for_each(delete_interior, i) 1623 + prt_printf(&buf, " %u->%u", i->id, i->live_child); 1624 + 1625 + ret = commit_do(trans, NULL, NULL, 0, bch2_trans_log_msg(trans, &buf)); 1626 + printbuf_exit(&buf); 1627 + if (ret) 1628 + goto err; 1629 + } 1593 1630 1594 1631 for (unsigned btree = 0; btree < BTREE_ID_NR; btree++) { 1595 - struct bpos last_pos = POS_MIN; 1596 - snapshot_id_list equiv_seen = { 0 }; 1597 1632 struct disk_reservation res = { 0 }; 1598 1633 1599 1634 if (!btree_type_has_snapshots(btree)) ··· 1597 1644 btree, POS_MIN, 1598 1645 BTREE_ITER_prefetch|BTREE_ITER_all_snapshots, k, 1599 1646 &res, NULL, BCH_TRANS_COMMIT_no_enospc, 1600 - delete_dead_snapshots_process_key(trans, &iter, k, &deleted, 1601 - &equiv_seen, &last_pos)); 1647 + delete_dead_snapshots_process_key(trans, &iter, k, 1648 + &delete_leaves, 1649 + &delete_interior)); 1602 1650 1603 1651 bch2_disk_reservation_put(c, &res); 1604 - darray_exit(&equiv_seen); 1605 1652 1606 - bch_err_msg(c, ret, "deleting keys from dying snapshots"); 1653 + if (!bch2_err_matches(ret, EROFS)) 1654 + bch_err_msg(c, ret, "deleting keys from dying snapshots"); 1607 1655 if (ret) 1608 1656 goto err; 1609 1657 } 1610 1658 1611 - bch2_trans_unlock(trans); 1612 - down_write(&c->snapshot_create_lock); 1613 - 1614 - ret = for_each_btree_key(trans, iter, BTREE_ID_snapshots, 1615 - POS_MIN, 0, k, ({ 1616 - u32 snapshot = k.k->p.offset; 1617 - u32 equiv = bch2_snapshot_equiv(c, snapshot); 1618 - 1619 - equiv != snapshot 1620 - ? snapshot_list_add(c, &deleted_interior, snapshot) 1621 - : 0; 1622 - })); 1623 - 1624 - bch_err_msg(c, ret, "walking snapshots"); 1625 - if (ret) 1626 - goto err_create_lock; 1659 + darray_for_each(delete_leaves, i) { 1660 + ret = commit_do(trans, NULL, NULL, 0, 1661 + bch2_snapshot_node_delete(trans, *i)); 1662 + if (!bch2_err_matches(ret, EROFS)) 1663 + bch_err_msg(c, ret, "deleting snapshot %u", *i); 1664 + if (ret) 1665 + goto err; 1666 + } 1627 1667 1628 1668 /* 1629 1669 * Fixing children of deleted snapshots can't be done completely ··· 1626 1680 ret = for_each_btree_key_commit(trans, iter, BTREE_ID_snapshots, POS_MIN, 1627 1681 BTREE_ITER_intent, k, 1628 1682 NULL, NULL, BCH_TRANS_COMMIT_no_enospc, 1629 - bch2_fix_child_of_deleted_snapshot(trans, &iter, k, &deleted_interior)); 1683 + bch2_fix_child_of_deleted_snapshot(trans, &iter, k, &delete_interior)); 1630 1684 if (ret) 1631 - goto err_create_lock; 1685 + goto err; 1632 1686 1633 - darray_for_each(deleted, i) { 1687 + darray_for_each(delete_interior, i) { 1634 1688 ret = commit_do(trans, NULL, NULL, 0, 1635 - bch2_snapshot_node_delete(trans, *i)); 1636 - bch_err_msg(c, ret, "deleting snapshot %u", *i); 1689 + bch2_snapshot_node_delete(trans, i->id)); 1690 + if (!bch2_err_matches(ret, EROFS)) 1691 + bch_err_msg(c, ret, "deleting snapshot %u", i->id); 1637 1692 if (ret) 1638 - goto err_create_lock; 1693 + goto err; 1639 1694 } 1640 - 1641 - darray_for_each(deleted_interior, i) { 1642 - ret = commit_do(trans, NULL, NULL, 0, 1643 - bch2_snapshot_node_delete(trans, *i)); 1644 - bch_err_msg(c, ret, "deleting snapshot %u", *i); 1645 - if (ret) 1646 - goto err_create_lock; 1647 - } 1648 - err_create_lock: 1649 - up_write(&c->snapshot_create_lock); 1650 1695 err: 1651 - darray_exit(&deleted_interior); 1652 - darray_exit(&deleted); 1696 + darray_exit(&delete_interior); 1697 + darray_exit(&delete_leaves); 1653 1698 bch2_trans_put(trans); 1654 - bch_err_fn(c, ret); 1699 + if (!bch2_err_matches(ret, EROFS)) 1700 + bch_err_fn(c, ret); 1655 1701 return ret; 1656 1702 } 1657 1703 ··· 1659 1721 1660 1722 void bch2_delete_dead_snapshots_async(struct bch_fs *c) 1661 1723 { 1662 - if (bch2_write_ref_tryget(c, BCH_WRITE_REF_delete_dead_snapshots) && 1663 - !queue_work(c->write_ref_wq, &c->snapshot_delete_work)) 1724 + if (!bch2_write_ref_tryget(c, BCH_WRITE_REF_delete_dead_snapshots)) 1725 + return; 1726 + 1727 + BUG_ON(!test_bit(BCH_FS_may_go_rw, &c->flags)); 1728 + 1729 + if (!queue_work(c->write_ref_wq, &c->snapshot_delete_work)) 1664 1730 bch2_write_ref_put(c, BCH_WRITE_REF_delete_dead_snapshots); 1665 1731 } 1666 1732 ··· 1677 1735 struct bkey_s_c k; 1678 1736 int ret; 1679 1737 1680 - bch2_trans_iter_init(trans, &iter, id, pos, 1681 - BTREE_ITER_not_extents| 1682 - BTREE_ITER_all_snapshots); 1683 - while (1) { 1684 - k = bch2_btree_iter_prev(&iter); 1685 - ret = bkey_err(k); 1686 - if (ret) 1687 - break; 1688 - 1689 - if (!k.k) 1690 - break; 1691 - 1738 + for_each_btree_key_reverse_norestart(trans, iter, id, bpos_predecessor(pos), 1739 + BTREE_ITER_not_extents| 1740 + BTREE_ITER_all_snapshots, 1741 + k, ret) { 1692 1742 if (!bkey_eq(pos, k.k->p)) 1693 1743 break; 1694 1744 ··· 1694 1760 return ret; 1695 1761 } 1696 1762 1763 + static bool interior_snapshot_needs_delete(struct bkey_s_c_snapshot snap) 1764 + { 1765 + /* If there's one child, it's redundant and keys will be moved to the child */ 1766 + return !!snap.v->children[0] + !!snap.v->children[1] == 1; 1767 + } 1768 + 1697 1769 static int bch2_check_snapshot_needs_deletion(struct btree_trans *trans, struct bkey_s_c k) 1698 1770 { 1699 - struct bch_fs *c = trans->c; 1700 - struct bkey_s_c_snapshot snap; 1701 - int ret = 0; 1702 - 1703 1771 if (k.k->type != KEY_TYPE_snapshot) 1704 1772 return 0; 1705 1773 1706 - snap = bkey_s_c_to_snapshot(k); 1774 + struct bkey_s_c_snapshot snap = bkey_s_c_to_snapshot(k); 1707 1775 if (BCH_SNAPSHOT_DELETED(snap.v) || 1708 - bch2_snapshot_equiv(c, k.k->p.offset) != k.k->p.offset || 1709 - (ret = bch2_snapshot_needs_delete(trans, k)) > 0) { 1710 - set_bit(BCH_FS_need_delete_dead_snapshots, &c->flags); 1711 - return 0; 1712 - } 1776 + interior_snapshot_needs_delete(snap)) 1777 + set_bit(BCH_FS_need_delete_dead_snapshots, &trans->c->flags); 1713 1778 1714 - return ret; 1779 + return 0; 1715 1780 } 1716 1781 1717 1782 int bch2_snapshots_read(struct bch_fs *c) 1718 1783 { 1784 + /* 1785 + * Initializing the is_ancestor bitmaps requires ancestors to already be 1786 + * initialized - so mark in reverse: 1787 + */ 1719 1788 int ret = bch2_trans_run(c, 1720 - for_each_btree_key(trans, iter, BTREE_ID_snapshots, 1721 - POS_MIN, 0, k, 1789 + for_each_btree_key_reverse(trans, iter, BTREE_ID_snapshots, 1790 + POS_MAX, 0, k, 1722 1791 __bch2_mark_snapshot(trans, BTREE_ID_snapshots, 0, bkey_s_c_null, k, 0) ?: 1723 - bch2_snapshot_set_equiv(trans, k) ?: 1724 - bch2_check_snapshot_needs_deletion(trans, k)) ?: 1725 - for_each_btree_key(trans, iter, BTREE_ID_snapshots, 1726 - POS_MIN, 0, k, 1727 - (set_is_ancestor_bitmap(c, k.k->p.offset), 0))); 1792 + bch2_check_snapshot_needs_deletion(trans, k))); 1728 1793 bch_err_fn(c, ret); 1729 1794 1730 1795 /*
+8 -9
fs/bcachefs/snapshot.h
··· 2 2 #ifndef _BCACHEFS_SNAPSHOT_H 3 3 #define _BCACHEFS_SNAPSHOT_H 4 4 5 - enum bch_validate_flags; 6 - 7 5 void bch2_snapshot_tree_to_text(struct printbuf *, struct bch_fs *, struct bkey_s_c); 8 6 int bch2_snapshot_tree_validate(struct bch_fs *, struct bkey_s_c, 9 - enum bch_validate_flags); 7 + struct bkey_validate_context); 10 8 11 9 #define bch2_bkey_ops_snapshot_tree ((struct bkey_ops) { \ 12 10 .key_validate = bch2_snapshot_tree_validate, \ ··· 17 19 int bch2_snapshot_tree_lookup(struct btree_trans *, u32, struct bch_snapshot_tree *); 18 20 19 21 void bch2_snapshot_to_text(struct printbuf *, struct bch_fs *, struct bkey_s_c); 20 - int bch2_snapshot_validate(struct bch_fs *, struct bkey_s_c, enum bch_validate_flags); 22 + int bch2_snapshot_validate(struct bch_fs *, struct bkey_s_c, 23 + struct bkey_validate_context); 21 24 int bch2_mark_snapshot(struct btree_trans *, enum btree_id, unsigned, 22 25 struct bkey_s_c, struct bkey_s, 23 26 enum btree_iter_update_trigger_flags); ··· 119 120 return id; 120 121 } 121 122 122 - static inline u32 __bch2_snapshot_equiv(struct bch_fs *c, u32 id) 123 + static inline bool __bch2_snapshot_exists(struct bch_fs *c, u32 id) 123 124 { 124 125 const struct snapshot_t *s = snapshot_t(c, id); 125 - return s ? s->equiv : 0; 126 + return s ? s->live : 0; 126 127 } 127 128 128 - static inline u32 bch2_snapshot_equiv(struct bch_fs *c, u32 id) 129 + static inline bool bch2_snapshot_exists(struct bch_fs *c, u32 id) 129 130 { 130 131 rcu_read_lock(); 131 - id = __bch2_snapshot_equiv(c, id); 132 + bool ret = __bch2_snapshot_exists(c, id); 132 133 rcu_read_unlock(); 133 134 134 - return id; 135 + return ret; 135 136 } 136 137 137 138 static inline int bch2_snapshot_is_internal_node(struct bch_fs *c, u32 id)
+295
fs/bcachefs/str_hash.c
··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + 3 + #include "bcachefs.h" 4 + #include "btree_cache.h" 5 + #include "btree_update.h" 6 + #include "dirent.h" 7 + #include "fsck.h" 8 + #include "str_hash.h" 9 + #include "subvolume.h" 10 + 11 + static int bch2_dirent_has_target(struct btree_trans *trans, struct bkey_s_c_dirent d) 12 + { 13 + if (d.v->d_type == DT_SUBVOL) { 14 + struct bch_subvolume subvol; 15 + int ret = bch2_subvolume_get(trans, le32_to_cpu(d.v->d_child_subvol), 16 + false, &subvol); 17 + if (ret && !bch2_err_matches(ret, ENOENT)) 18 + return ret; 19 + return !ret; 20 + } else { 21 + struct btree_iter iter; 22 + struct bkey_s_c k = bch2_bkey_get_iter(trans, &iter, BTREE_ID_inodes, 23 + SPOS(0, le64_to_cpu(d.v->d_inum), d.k->p.snapshot), 0); 24 + int ret = bkey_err(k); 25 + if (ret) 26 + return ret; 27 + 28 + ret = bkey_is_inode(k.k); 29 + bch2_trans_iter_exit(trans, &iter); 30 + return ret; 31 + } 32 + } 33 + 34 + static int fsck_rename_dirent(struct btree_trans *trans, 35 + struct snapshots_seen *s, 36 + const struct bch_hash_desc desc, 37 + struct bch_hash_info *hash_info, 38 + struct bkey_s_c_dirent old) 39 + { 40 + struct qstr old_name = bch2_dirent_get_name(old); 41 + struct bkey_i_dirent *new = bch2_trans_kmalloc(trans, bkey_bytes(old.k) + 32); 42 + int ret = PTR_ERR_OR_ZERO(new); 43 + if (ret) 44 + return ret; 45 + 46 + bkey_dirent_init(&new->k_i); 47 + dirent_copy_target(new, old); 48 + new->k.p = old.k->p; 49 + 50 + for (unsigned i = 0; i < 1000; i++) { 51 + unsigned len = sprintf(new->v.d_name, "%.*s.fsck_renamed-%u", 52 + old_name.len, old_name.name, i); 53 + unsigned u64s = BKEY_U64s + dirent_val_u64s(len); 54 + 55 + if (u64s > U8_MAX) 56 + return -EINVAL; 57 + 58 + new->k.u64s = u64s; 59 + 60 + ret = bch2_hash_set_in_snapshot(trans, bch2_dirent_hash_desc, hash_info, 61 + (subvol_inum) { 0, old.k->p.inode }, 62 + old.k->p.snapshot, &new->k_i, 63 + BTREE_UPDATE_internal_snapshot_node); 64 + if (!bch2_err_matches(ret, EEXIST)) 65 + break; 66 + } 67 + 68 + if (ret) 69 + return ret; 70 + 71 + return bch2_fsck_update_backpointers(trans, s, desc, hash_info, &new->k_i); 72 + } 73 + 74 + static int hash_pick_winner(struct btree_trans *trans, 75 + const struct bch_hash_desc desc, 76 + struct bch_hash_info *hash_info, 77 + struct bkey_s_c k1, 78 + struct bkey_s_c k2) 79 + { 80 + if (bkey_val_bytes(k1.k) == bkey_val_bytes(k2.k) && 81 + !memcmp(k1.v, k2.v, bkey_val_bytes(k1.k))) 82 + return 0; 83 + 84 + switch (desc.btree_id) { 85 + case BTREE_ID_dirents: { 86 + int ret = bch2_dirent_has_target(trans, bkey_s_c_to_dirent(k1)); 87 + if (ret < 0) 88 + return ret; 89 + if (!ret) 90 + return 0; 91 + 92 + ret = bch2_dirent_has_target(trans, bkey_s_c_to_dirent(k2)); 93 + if (ret < 0) 94 + return ret; 95 + if (!ret) 96 + return 1; 97 + return 2; 98 + } 99 + default: 100 + return 0; 101 + } 102 + } 103 + 104 + static int repair_inode_hash_info(struct btree_trans *trans, 105 + struct bch_inode_unpacked *snapshot_root) 106 + { 107 + struct btree_iter iter; 108 + struct bkey_s_c k; 109 + int ret = 0; 110 + 111 + for_each_btree_key_reverse_norestart(trans, iter, BTREE_ID_inodes, 112 + SPOS(0, snapshot_root->bi_inum, snapshot_root->bi_snapshot - 1), 113 + BTREE_ITER_all_snapshots, k, ret) { 114 + if (k.k->p.offset != snapshot_root->bi_inum) 115 + break; 116 + if (!bkey_is_inode(k.k)) 117 + continue; 118 + 119 + struct bch_inode_unpacked inode; 120 + ret = bch2_inode_unpack(k, &inode); 121 + if (ret) 122 + break; 123 + 124 + if (fsck_err_on(inode.bi_hash_seed != snapshot_root->bi_hash_seed || 125 + INODE_STR_HASH(&inode) != INODE_STR_HASH(snapshot_root), 126 + trans, inode_snapshot_mismatch, 127 + "inode hash info in different snapshots don't match")) { 128 + inode.bi_hash_seed = snapshot_root->bi_hash_seed; 129 + SET_INODE_STR_HASH(&inode, INODE_STR_HASH(snapshot_root)); 130 + ret = __bch2_fsck_write_inode(trans, &inode) ?: 131 + bch2_trans_commit(trans, NULL, NULL, BCH_TRANS_COMMIT_no_enospc) ?: 132 + -BCH_ERR_transaction_restart_nested; 133 + break; 134 + } 135 + } 136 + fsck_err: 137 + bch2_trans_iter_exit(trans, &iter); 138 + return ret; 139 + } 140 + 141 + /* 142 + * All versions of the same inode in different snapshots must have the same hash 143 + * seed/type: verify that the hash info we're using matches the root 144 + */ 145 + static int check_inode_hash_info_matches_root(struct btree_trans *trans, u64 inum, 146 + struct bch_hash_info *hash_info) 147 + { 148 + struct bch_fs *c = trans->c; 149 + struct btree_iter iter; 150 + struct bkey_s_c k; 151 + int ret = 0; 152 + 153 + for_each_btree_key_reverse_norestart(trans, iter, BTREE_ID_inodes, SPOS(0, inum, U32_MAX), 154 + BTREE_ITER_all_snapshots, k, ret) { 155 + if (k.k->p.offset != inum) 156 + break; 157 + if (bkey_is_inode(k.k)) 158 + goto found; 159 + } 160 + bch_err(c, "%s(): inum %llu not found", __func__, inum); 161 + ret = -BCH_ERR_fsck_repair_unimplemented; 162 + goto err; 163 + found:; 164 + struct bch_inode_unpacked inode; 165 + ret = bch2_inode_unpack(k, &inode); 166 + if (ret) 167 + goto err; 168 + 169 + struct bch_hash_info hash2 = bch2_hash_info_init(c, &inode); 170 + if (hash_info->type != hash2.type || 171 + memcmp(&hash_info->siphash_key, &hash2.siphash_key, sizeof(hash2.siphash_key))) { 172 + ret = repair_inode_hash_info(trans, &inode); 173 + if (!ret) { 174 + bch_err(c, "inode hash info mismatch with root, but mismatch not found\n" 175 + "%u %llx %llx\n" 176 + "%u %llx %llx", 177 + hash_info->type, 178 + hash_info->siphash_key.k0, 179 + hash_info->siphash_key.k1, 180 + hash2.type, 181 + hash2.siphash_key.k0, 182 + hash2.siphash_key.k1); 183 + ret = -BCH_ERR_fsck_repair_unimplemented; 184 + } 185 + } 186 + err: 187 + bch2_trans_iter_exit(trans, &iter); 188 + return ret; 189 + } 190 + 191 + int __bch2_str_hash_check_key(struct btree_trans *trans, 192 + struct snapshots_seen *s, 193 + const struct bch_hash_desc *desc, 194 + struct bch_hash_info *hash_info, 195 + struct btree_iter *k_iter, struct bkey_s_c hash_k) 196 + { 197 + struct bch_fs *c = trans->c; 198 + struct btree_iter iter = { NULL }; 199 + struct printbuf buf = PRINTBUF; 200 + struct bkey_s_c k; 201 + int ret = 0; 202 + 203 + u64 hash = desc->hash_bkey(hash_info, hash_k); 204 + if (hash_k.k->p.offset < hash) 205 + goto bad_hash; 206 + 207 + for_each_btree_key_norestart(trans, iter, desc->btree_id, 208 + SPOS(hash_k.k->p.inode, hash, hash_k.k->p.snapshot), 209 + BTREE_ITER_slots, k, ret) { 210 + if (bkey_eq(k.k->p, hash_k.k->p)) 211 + break; 212 + 213 + if (k.k->type == desc->key_type && 214 + !desc->cmp_bkey(k, hash_k)) 215 + goto duplicate_entries; 216 + 217 + if (bkey_deleted(k.k)) { 218 + bch2_trans_iter_exit(trans, &iter); 219 + goto bad_hash; 220 + } 221 + } 222 + out: 223 + bch2_trans_iter_exit(trans, &iter); 224 + printbuf_exit(&buf); 225 + return ret; 226 + bad_hash: 227 + /* 228 + * Before doing any repair, check hash_info itself: 229 + */ 230 + ret = check_inode_hash_info_matches_root(trans, hash_k.k->p.inode, hash_info); 231 + if (ret) 232 + goto out; 233 + 234 + if (fsck_err(trans, hash_table_key_wrong_offset, 235 + "hash table key at wrong offset: btree %s inode %llu offset %llu, hashed to %llu\n %s", 236 + bch2_btree_id_str(desc->btree_id), hash_k.k->p.inode, hash_k.k->p.offset, hash, 237 + (printbuf_reset(&buf), 238 + bch2_bkey_val_to_text(&buf, c, hash_k), buf.buf))) { 239 + struct bkey_i *new = bch2_bkey_make_mut_noupdate(trans, hash_k); 240 + if (IS_ERR(new)) 241 + return PTR_ERR(new); 242 + 243 + k = bch2_hash_set_or_get_in_snapshot(trans, &iter, *desc, hash_info, 244 + (subvol_inum) { 0, hash_k.k->p.inode }, 245 + hash_k.k->p.snapshot, new, 246 + STR_HASH_must_create| 247 + BTREE_ITER_with_updates| 248 + BTREE_UPDATE_internal_snapshot_node); 249 + ret = bkey_err(k); 250 + if (ret) 251 + goto out; 252 + if (k.k) 253 + goto duplicate_entries; 254 + 255 + ret = bch2_hash_delete_at(trans, *desc, hash_info, k_iter, 256 + BTREE_UPDATE_internal_snapshot_node) ?: 257 + bch2_fsck_update_backpointers(trans, s, *desc, hash_info, new) ?: 258 + bch2_trans_commit(trans, NULL, NULL, BCH_TRANS_COMMIT_no_enospc) ?: 259 + -BCH_ERR_transaction_restart_nested; 260 + goto out; 261 + } 262 + fsck_err: 263 + goto out; 264 + duplicate_entries: 265 + ret = hash_pick_winner(trans, *desc, hash_info, hash_k, k); 266 + if (ret < 0) 267 + goto out; 268 + 269 + if (!fsck_err(trans, hash_table_key_duplicate, 270 + "duplicate hash table keys%s:\n%s", 271 + ret != 2 ? "" : ", both point to valid inodes", 272 + (printbuf_reset(&buf), 273 + bch2_bkey_val_to_text(&buf, c, hash_k), 274 + prt_newline(&buf), 275 + bch2_bkey_val_to_text(&buf, c, k), 276 + buf.buf))) 277 + goto out; 278 + 279 + switch (ret) { 280 + case 0: 281 + ret = bch2_hash_delete_at(trans, *desc, hash_info, k_iter, 0); 282 + break; 283 + case 1: 284 + ret = bch2_hash_delete_at(trans, *desc, hash_info, &iter, 0); 285 + break; 286 + case 2: 287 + ret = fsck_rename_dirent(trans, s, *desc, hash_info, bkey_s_c_to_dirent(hash_k)) ?: 288 + bch2_hash_delete_at(trans, *desc, hash_info, k_iter, 0); 289 + goto out; 290 + } 291 + 292 + ret = bch2_trans_commit(trans, NULL, NULL, 0) ?: 293 + -BCH_ERR_transaction_restart_nested; 294 + goto out; 295 + }
+25 -3
fs/bcachefs/str_hash.h
··· 160 160 struct bkey_s_c k; 161 161 int ret; 162 162 163 - for_each_btree_key_upto_norestart(trans, *iter, desc.btree_id, 163 + for_each_btree_key_max_norestart(trans, *iter, desc.btree_id, 164 164 SPOS(inum.inum, desc.hash_key(info, key), snapshot), 165 165 POS(inum.inum, U64_MAX), 166 166 BTREE_ITER_slots|flags, k, ret) { ··· 210 210 if (ret) 211 211 return ret; 212 212 213 - for_each_btree_key_upto_norestart(trans, *iter, desc.btree_id, 213 + for_each_btree_key_max_norestart(trans, *iter, desc.btree_id, 214 214 SPOS(inum.inum, desc.hash_key(info, key), snapshot), 215 215 POS(inum.inum, U64_MAX), 216 216 BTREE_ITER_slots|BTREE_ITER_intent, k, ret) ··· 265 265 bool found = false; 266 266 int ret; 267 267 268 - for_each_btree_key_upto_norestart(trans, *iter, desc.btree_id, 268 + for_each_btree_key_max_norestart(trans, *iter, desc.btree_id, 269 269 SPOS(insert->k.p.inode, 270 270 desc.hash_bkey(info, bkey_i_to_s_c(insert)), 271 271 snapshot), ··· 391 391 ret = bch2_hash_delete_at(trans, desc, info, &iter, 0); 392 392 bch2_trans_iter_exit(trans, &iter); 393 393 return ret; 394 + } 395 + 396 + struct snapshots_seen; 397 + int __bch2_str_hash_check_key(struct btree_trans *, 398 + struct snapshots_seen *, 399 + const struct bch_hash_desc *, 400 + struct bch_hash_info *, 401 + struct btree_iter *, struct bkey_s_c); 402 + 403 + static inline int bch2_str_hash_check_key(struct btree_trans *trans, 404 + struct snapshots_seen *s, 405 + const struct bch_hash_desc *desc, 406 + struct bch_hash_info *hash_info, 407 + struct btree_iter *k_iter, struct bkey_s_c hash_k) 408 + { 409 + if (hash_k.k->type != desc->key_type) 410 + return 0; 411 + 412 + if (likely(desc->hash_bkey(hash_info, hash_k) == hash_k.k->p.offset)) 413 + return 0; 414 + 415 + return __bch2_str_hash_check_key(trans, s, desc, hash_info, k_iter, hash_k); 394 416 } 395 417 396 418 #endif /* _BCACHEFS_STR_HASH_H */
+48 -20
fs/bcachefs/subvolume.c
··· 207 207 /* Subvolumes: */ 208 208 209 209 int bch2_subvolume_validate(struct bch_fs *c, struct bkey_s_c k, 210 - enum bch_validate_flags flags) 210 + struct bkey_validate_context from) 211 211 { 212 212 struct bkey_s_c_subvolume subvol = bkey_s_c_to_subvolume(k); 213 213 int ret = 0; ··· 286 286 static __always_inline int 287 287 bch2_subvolume_get_inlined(struct btree_trans *trans, unsigned subvol, 288 288 bool inconsistent_if_not_found, 289 - int iter_flags, 290 289 struct bch_subvolume *s) 291 290 { 292 291 int ret = bch2_bkey_get_val_typed(trans, BTREE_ID_subvolumes, POS(0, subvol), 293 - iter_flags, subvolume, s); 292 + BTREE_ITER_cached| 293 + BTREE_ITER_with_updates, subvolume, s); 294 294 bch2_fs_inconsistent_on(bch2_err_matches(ret, ENOENT) && 295 295 inconsistent_if_not_found, 296 296 trans->c, "missing subvolume %u", subvol); ··· 299 299 300 300 int bch2_subvolume_get(struct btree_trans *trans, unsigned subvol, 301 301 bool inconsistent_if_not_found, 302 - int iter_flags, 303 302 struct bch_subvolume *s) 304 303 { 305 - return bch2_subvolume_get_inlined(trans, subvol, inconsistent_if_not_found, iter_flags, s); 304 + return bch2_subvolume_get_inlined(trans, subvol, inconsistent_if_not_found, s); 306 305 } 307 306 308 307 int bch2_subvol_is_ro_trans(struct btree_trans *trans, u32 subvol) 309 308 { 310 309 struct bch_subvolume s; 311 - int ret = bch2_subvolume_get_inlined(trans, subvol, true, 0, &s); 310 + int ret = bch2_subvolume_get_inlined(trans, subvol, true, &s); 312 311 if (ret) 313 312 return ret; 314 313 ··· 327 328 struct bch_snapshot snap; 328 329 329 330 return bch2_snapshot_lookup(trans, snapshot, &snap) ?: 330 - bch2_subvolume_get(trans, le32_to_cpu(snap.subvol), true, 0, subvol); 331 + bch2_subvolume_get(trans, le32_to_cpu(snap.subvol), true, subvol); 331 332 } 332 333 333 334 int __bch2_subvolume_get_snapshot(struct btree_trans *trans, u32 subvolid, ··· 395 396 struct bch_subvolume s; 396 397 397 398 return lockrestart_do(trans, 398 - bch2_subvolume_get(trans, subvolid_to_delete, true, 399 - BTREE_ITER_cached, &s)) ?: 399 + bch2_subvolume_get(trans, subvolid_to_delete, true, &s)) ?: 400 400 for_each_btree_key_commit(trans, iter, 401 401 BTREE_ID_subvolumes, POS_MIN, BTREE_ITER_prefetch, k, 402 402 NULL, NULL, BCH_TRANS_COMMIT_no_enospc, ··· 409 411 */ 410 412 static int __bch2_subvolume_delete(struct btree_trans *trans, u32 subvolid) 411 413 { 412 - struct btree_iter iter; 413 - struct bkey_s_c_subvolume subvol; 414 - u32 snapid; 415 - int ret = 0; 414 + struct btree_iter subvol_iter = {}, snapshot_iter = {}, snapshot_tree_iter = {}; 416 415 417 - subvol = bch2_bkey_get_iter_typed(trans, &iter, 416 + struct bkey_s_c_subvolume subvol = 417 + bch2_bkey_get_iter_typed(trans, &subvol_iter, 418 418 BTREE_ID_subvolumes, POS(0, subvolid), 419 419 BTREE_ITER_cached|BTREE_ITER_intent, 420 420 subvolume); 421 - ret = bkey_err(subvol); 421 + int ret = bkey_err(subvol); 422 422 bch2_fs_inconsistent_on(bch2_err_matches(ret, ENOENT), trans->c, 423 423 "missing subvolume %u", subvolid); 424 424 if (ret) 425 - return ret; 425 + goto err; 426 426 427 - snapid = le32_to_cpu(subvol.v->snapshot); 427 + u32 snapid = le32_to_cpu(subvol.v->snapshot); 428 428 429 - ret = bch2_btree_delete_at(trans, &iter, 0) ?: 429 + struct bkey_s_c_snapshot snapshot = 430 + bch2_bkey_get_iter_typed(trans, &snapshot_iter, 431 + BTREE_ID_snapshots, POS(0, snapid), 432 + 0, snapshot); 433 + ret = bkey_err(subvol); 434 + bch2_fs_inconsistent_on(bch2_err_matches(ret, ENOENT), trans->c, 435 + "missing snapshot %u", snapid); 436 + if (ret) 437 + goto err; 438 + 439 + u32 treeid = le32_to_cpu(snapshot.v->tree); 440 + 441 + struct bkey_s_c_snapshot_tree snapshot_tree = 442 + bch2_bkey_get_iter_typed(trans, &snapshot_tree_iter, 443 + BTREE_ID_snapshot_trees, POS(0, treeid), 444 + 0, snapshot_tree); 445 + 446 + if (le32_to_cpu(snapshot_tree.v->master_subvol) == subvolid) { 447 + struct bkey_i_snapshot_tree *snapshot_tree_mut = 448 + bch2_bkey_make_mut_typed(trans, &snapshot_tree_iter, 449 + &snapshot_tree.s_c, 450 + 0, snapshot_tree); 451 + ret = PTR_ERR_OR_ZERO(snapshot_tree_mut); 452 + if (ret) 453 + goto err; 454 + 455 + snapshot_tree_mut->v.master_subvol = 0; 456 + } 457 + 458 + ret = bch2_btree_delete_at(trans, &subvol_iter, 0) ?: 430 459 bch2_snapshot_node_set_deleted(trans, snapid); 431 - bch2_trans_iter_exit(trans, &iter); 460 + err: 461 + bch2_trans_iter_exit(trans, &snapshot_tree_iter); 462 + bch2_trans_iter_exit(trans, &snapshot_iter); 463 + bch2_trans_iter_exit(trans, &subvol_iter); 432 464 return ret; 433 465 } 434 466 ··· 703 675 /* set bi_subvol on root inode */ 704 676 int bch2_fs_upgrade_for_subvolumes(struct bch_fs *c) 705 677 { 706 - int ret = bch2_trans_commit_do(c, NULL, NULL, BCH_TRANS_COMMIT_lazy_rw, 678 + int ret = bch2_trans_commit_do(c, NULL, NULL, BCH_TRANS_COMMIT_no_enospc, 707 679 __bch2_fs_upgrade_for_subvolumes(trans)); 708 680 bch_err_fn(c, ret); 709 681 return ret;
+9 -10
fs/bcachefs/subvolume.h
··· 5 5 #include "darray.h" 6 6 #include "subvolume_types.h" 7 7 8 - enum bch_validate_flags; 9 - 10 8 int bch2_check_subvols(struct bch_fs *); 11 9 int bch2_check_subvol_children(struct bch_fs *); 12 10 13 - int bch2_subvolume_validate(struct bch_fs *, struct bkey_s_c, enum bch_validate_flags); 11 + int bch2_subvolume_validate(struct bch_fs *, struct bkey_s_c, 12 + struct bkey_validate_context); 14 13 void bch2_subvolume_to_text(struct printbuf *, struct bch_fs *, struct bkey_s_c); 15 14 int bch2_subvolume_trigger(struct btree_trans *, enum btree_id, unsigned, 16 15 struct bkey_s_c, struct bkey_s, ··· 24 25 25 26 int bch2_subvol_has_children(struct btree_trans *, u32); 26 27 int bch2_subvolume_get(struct btree_trans *, unsigned, 27 - bool, int, struct bch_subvolume *); 28 + bool, struct bch_subvolume *); 28 29 int __bch2_subvolume_get_snapshot(struct btree_trans *, u32, 29 30 u32 *, bool); 30 31 int bch2_subvolume_get_snapshot(struct btree_trans *, u32, u32 *); ··· 33 34 int bch2_subvol_is_ro(struct bch_fs *, u32); 34 35 35 36 static inline struct bkey_s_c 36 - bch2_btree_iter_peek_in_subvolume_upto_type(struct btree_iter *iter, struct bpos end, 37 + bch2_btree_iter_peek_in_subvolume_max_type(struct btree_iter *iter, struct bpos end, 37 38 u32 subvolid, unsigned flags) 38 39 { 39 40 u32 snapshot; ··· 42 43 return bkey_s_c_err(ret); 43 44 44 45 bch2_btree_iter_set_snapshot(iter, snapshot); 45 - return bch2_btree_iter_peek_upto_type(iter, end, flags); 46 + return bch2_btree_iter_peek_max_type(iter, end, flags); 46 47 } 47 48 48 - #define for_each_btree_key_in_subvolume_upto_continue(_trans, _iter, \ 49 + #define for_each_btree_key_in_subvolume_max_continue(_trans, _iter, \ 49 50 _end, _subvolid, _flags, _k, _do) \ 50 51 ({ \ 51 52 struct bkey_s_c _k; \ ··· 53 54 \ 54 55 do { \ 55 56 _ret3 = lockrestart_do(_trans, ({ \ 56 - (_k) = bch2_btree_iter_peek_in_subvolume_upto_type(&(_iter), \ 57 + (_k) = bch2_btree_iter_peek_in_subvolume_max_type(&(_iter), \ 57 58 _end, _subvolid, (_flags)); \ 58 59 if (!(_k).k) \ 59 60 break; \ ··· 66 67 _ret3; \ 67 68 }) 68 69 69 - #define for_each_btree_key_in_subvolume_upto(_trans, _iter, _btree_id, \ 70 + #define for_each_btree_key_in_subvolume_max(_trans, _iter, _btree_id, \ 70 71 _start, _end, _subvolid, _flags, _k, _do) \ 71 72 ({ \ 72 73 struct btree_iter _iter; \ 73 74 bch2_trans_iter_init((_trans), &(_iter), (_btree_id), \ 74 75 (_start), (_flags)); \ 75 76 \ 76 - for_each_btree_key_in_subvolume_upto_continue(_trans, _iter, \ 77 + for_each_btree_key_in_subvolume_max_continue(_trans, _iter, \ 77 78 _end, _subvolid, _flags, _k, _do); \ 78 79 }) 79 80
+1 -1
fs/bcachefs/subvolume_types.h
··· 9 9 #define IS_ANCESTOR_BITMAP 128 10 10 11 11 struct snapshot_t { 12 + bool live; 12 13 u32 parent; 13 14 u32 skip[3]; 14 15 u32 depth; 15 16 u32 children[2]; 16 17 u32 subvol; /* Nonzero only if a subvolume points to this node: */ 17 18 u32 tree; 18 - u32 equiv; 19 19 unsigned long is_ancestor[BITS_TO_LONGS(IS_ANCESTOR_BITMAP)]; 20 20 }; 21 21
+73 -10
fs/bcachefs/super-io.c
··· 23 23 24 24 #include <linux/backing-dev.h> 25 25 #include <linux/sort.h> 26 + #include <linux/string_choices.h> 26 27 27 28 static const struct blk_holder_ops bch2_sb_handle_bdev_ops = { 28 29 }; ··· 42 41 #undef x 43 42 }; 44 43 45 - void bch2_version_to_text(struct printbuf *out, unsigned v) 44 + void bch2_version_to_text(struct printbuf *out, enum bcachefs_metadata_version v) 46 45 { 47 46 const char *str = "(unknown version)"; 48 47 ··· 55 54 prt_printf(out, "%u.%u: %s", BCH_VERSION_MAJOR(v), BCH_VERSION_MINOR(v), str); 56 55 } 57 56 58 - unsigned bch2_latest_compatible_version(unsigned v) 57 + enum bcachefs_metadata_version bch2_latest_compatible_version(enum bcachefs_metadata_version v) 59 58 { 60 59 if (!BCH_VERSION_MAJOR(v)) 61 60 return v; ··· 67 66 v = bch2_metadata_versions[i].version; 68 67 69 68 return v; 69 + } 70 + 71 + void bch2_set_version_incompat(struct bch_fs *c, enum bcachefs_metadata_version version) 72 + { 73 + mutex_lock(&c->sb_lock); 74 + SET_BCH_SB_VERSION_INCOMPAT(c->disk_sb.sb, 75 + max(BCH_SB_VERSION_INCOMPAT(c->disk_sb.sb), version)); 76 + c->disk_sb.sb->features[0] |= cpu_to_le64(BCH_FEATURE_incompat_version_field); 77 + bch2_write_super(c); 78 + mutex_unlock(&c->sb_lock); 70 79 } 71 80 72 81 const char * const bch2_sb_fields[] = { ··· 379 368 return -BCH_ERR_invalid_sb_features; 380 369 } 381 370 371 + if (BCH_VERSION_MAJOR(le16_to_cpu(sb->version)) > BCH_VERSION_MAJOR(bcachefs_metadata_version_current) || 372 + BCH_SB_VERSION_INCOMPAT(sb) > bcachefs_metadata_version_current) { 373 + prt_printf(out, "Filesystem has incompatible version"); 374 + return -BCH_ERR_invalid_sb_features; 375 + } 376 + 382 377 block_size = le16_to_cpu(sb->block_size); 383 378 384 379 if (block_size > PAGE_SECTORS) { ··· 423 406 return -BCH_ERR_invalid_sb_time_precision; 424 407 } 425 408 409 + /* old versions didn't know to downgrade this field */ 410 + if (BCH_SB_VERSION_INCOMPAT_ALLOWED(sb) > le16_to_cpu(sb->version)) 411 + SET_BCH_SB_VERSION_INCOMPAT_ALLOWED(sb, le16_to_cpu(sb->version)); 412 + 413 + if (BCH_SB_VERSION_INCOMPAT(sb) > BCH_SB_VERSION_INCOMPAT_ALLOWED(sb)) { 414 + prt_printf(out, "Invalid version_incompat "); 415 + bch2_version_to_text(out, BCH_SB_VERSION_INCOMPAT(sb)); 416 + prt_str(out, " > incompat_allowed "); 417 + bch2_version_to_text(out, BCH_SB_VERSION_INCOMPAT_ALLOWED(sb)); 418 + if (flags & BCH_VALIDATE_write) 419 + return -BCH_ERR_invalid_sb_version; 420 + else 421 + SET_BCH_SB_VERSION_INCOMPAT_ALLOWED(sb, BCH_SB_VERSION_INCOMPAT(sb)); 422 + } 423 + 426 424 if (!flags) { 427 425 /* 428 426 * Been seeing a bug where these are getting inexplicably ··· 459 427 if (le16_to_cpu(sb->version) <= bcachefs_metadata_version_disk_accounting_v2) 460 428 SET_BCH_SB_PROMOTE_WHOLE_EXTENTS(sb, true); 461 429 } 430 + 431 + #ifdef __KERNEL__ 432 + if (!BCH_SB_SHARD_INUMS_NBITS(sb)) 433 + SET_BCH_SB_SHARD_INUMS_NBITS(sb, ilog2(roundup_pow_of_two(num_online_cpus()))); 434 + #endif 462 435 463 436 for (opt_id = 0; opt_id < bch2_opts_nr; opt_id++) { 464 437 const struct bch_option *opt = bch2_opt_table + opt_id; ··· 556 519 c->sb.uuid = src->uuid; 557 520 c->sb.user_uuid = src->user_uuid; 558 521 c->sb.version = le16_to_cpu(src->version); 522 + c->sb.version_incompat = BCH_SB_VERSION_INCOMPAT(src); 523 + c->sb.version_incompat_allowed 524 + = BCH_SB_VERSION_INCOMPAT_ALLOWED(src); 559 525 c->sb.version_min = le16_to_cpu(src->version_min); 560 526 c->sb.version_upgrade_complete = BCH_SB_VERSION_UPGRADE_COMPLETE(src); 561 527 c->sb.nr_devices = src->nr_devices; ··· 716 676 } 717 677 718 678 enum bch_csum_type csum_type = BCH_SB_CSUM_TYPE(sb->sb); 719 - if (csum_type >= BCH_CSUM_NR) { 679 + if (csum_type >= BCH_CSUM_NR || 680 + bch2_csum_type_is_encryption(csum_type)) { 720 681 prt_printf(err, "unknown checksum type %llu", BCH_SB_CSUM_TYPE(sb->sb)); 721 682 return -BCH_ERR_invalid_sb_csum_type; 722 683 } ··· 919 878 ? BCH_MEMBER_ERROR_write 920 879 : BCH_MEMBER_ERROR_read, 921 880 "superblock %s error: %s", 922 - bio_data_dir(bio) ? "write" : "read", 881 + str_write_read(bio_data_dir(bio)), 923 882 bch2_blk_status_to_str(bio->bi_status))) 924 883 ca->sb_write_error = 1; 925 884 ··· 932 891 struct bch_sb *sb = ca->disk_sb.sb; 933 892 struct bio *bio = ca->disk_sb.bio; 934 893 894 + memset(ca->sb_read_scratch, 0, BCH_SB_READ_SCRATCH_BUF_SIZE); 895 + 935 896 bio_reset(bio, ca->disk_sb.bdev, REQ_OP_READ|REQ_SYNC|REQ_META); 936 897 bio->bi_iter.bi_sector = le64_to_cpu(sb->layout.sb_offset[0]); 937 898 bio->bi_end_io = write_super_endio; 938 899 bio->bi_private = ca; 939 - bch2_bio_map(bio, ca->sb_read_scratch, PAGE_SIZE); 900 + bch2_bio_map(bio, ca->sb_read_scratch, BCH_SB_READ_SCRATCH_BUF_SIZE); 940 901 941 - this_cpu_add(ca->io_done->sectors[READ][BCH_DATA_sb], 942 - bio_sectors(bio)); 902 + this_cpu_add(ca->io_done->sectors[READ][BCH_DATA_sb], bio_sectors(bio)); 943 903 944 904 percpu_ref_get(&ca->io_ref); 945 905 closure_bio_submit(bio, &c->sb_write); ··· 1084 1042 ": Superblock write was silently dropped! (seq %llu expected %llu)", 1085 1043 le64_to_cpu(ca->sb_read_scratch->seq), 1086 1044 ca->disk_sb.seq); 1087 - bch2_fs_fatal_error(c, "%s", buf.buf); 1045 + 1046 + if (c->opts.errors != BCH_ON_ERROR_continue && 1047 + c->opts.errors != BCH_ON_ERROR_fix_safe) { 1048 + ret = -BCH_ERR_erofs_sb_err; 1049 + bch2_fs_fatal_error(c, "%s", buf.buf); 1050 + } else { 1051 + bch_err(c, "%s", buf.buf); 1052 + } 1053 + 1088 1054 printbuf_exit(&buf); 1089 - ret = -BCH_ERR_erofs_sb_err; 1090 1055 } 1091 1056 1092 1057 if (le64_to_cpu(ca->sb_read_scratch->seq) > ca->disk_sb.seq) { ··· 1198 1149 */ 1199 1150 if (BCH_SB_VERSION_UPGRADE_COMPLETE(c->disk_sb.sb) > bcachefs_metadata_version_current) 1200 1151 SET_BCH_SB_VERSION_UPGRADE_COMPLETE(c->disk_sb.sb, bcachefs_metadata_version_current); 1152 + if (BCH_SB_VERSION_INCOMPAT_ALLOWED(c->disk_sb.sb) > bcachefs_metadata_version_current) 1153 + SET_BCH_SB_VERSION_INCOMPAT_ALLOWED(c->disk_sb.sb, bcachefs_metadata_version_current); 1201 1154 if (c->sb.version > bcachefs_metadata_version_current) 1202 1155 c->disk_sb.sb->version = cpu_to_le16(bcachefs_metadata_version_current); 1203 1156 if (c->sb.version_min > bcachefs_metadata_version_current) ··· 1208 1157 return ret; 1209 1158 } 1210 1159 1211 - void bch2_sb_upgrade(struct bch_fs *c, unsigned new_version) 1160 + void bch2_sb_upgrade(struct bch_fs *c, unsigned new_version, bool incompat) 1212 1161 { 1213 1162 lockdep_assert_held(&c->sb_lock); 1214 1163 ··· 1218 1167 1219 1168 c->disk_sb.sb->version = cpu_to_le16(new_version); 1220 1169 c->disk_sb.sb->features[0] |= cpu_to_le64(BCH_SB_FEATURES_ALL); 1170 + 1171 + if (incompat) 1172 + SET_BCH_SB_VERSION_INCOMPAT_ALLOWED(c->disk_sb.sb, 1173 + max(BCH_SB_VERSION_INCOMPAT_ALLOWED(c->disk_sb.sb), new_version)); 1221 1174 } 1222 1175 1223 1176 static int bch2_sb_ext_validate(struct bch_sb *sb, struct bch_sb_field *f, ··· 1384 1329 1385 1330 prt_printf(out, "Version:\t"); 1386 1331 bch2_version_to_text(out, le16_to_cpu(sb->version)); 1332 + prt_newline(out); 1333 + 1334 + prt_printf(out, "Incompatible features allowed:\t"); 1335 + bch2_version_to_text(out, BCH_SB_VERSION_INCOMPAT_ALLOWED(sb)); 1336 + prt_newline(out); 1337 + 1338 + prt_printf(out, "Incompatible features in use:\t"); 1339 + bch2_version_to_text(out, BCH_SB_VERSION_INCOMPAT(sb)); 1387 1340 prt_newline(out); 1388 1341 1389 1342 prt_printf(out, "Version upgrade complete:\t");
+18 -3
fs/bcachefs/super-io.h
··· 10 10 11 11 #include <asm/byteorder.h> 12 12 13 + #define BCH_SB_READ_SCRATCH_BUF_SIZE 4096 14 + 13 15 static inline bool bch2_version_compatible(u16 version) 14 16 { 15 17 return BCH_VERSION_MAJOR(version) <= BCH_VERSION_MAJOR(bcachefs_metadata_version_current) && 16 18 version >= bcachefs_metadata_version_min; 17 19 } 18 20 19 - void bch2_version_to_text(struct printbuf *, unsigned); 20 - unsigned bch2_latest_compatible_version(unsigned); 21 + void bch2_version_to_text(struct printbuf *, enum bcachefs_metadata_version); 22 + enum bcachefs_metadata_version bch2_latest_compatible_version(enum bcachefs_metadata_version); 23 + 24 + void bch2_set_version_incompat(struct bch_fs *, enum bcachefs_metadata_version); 25 + 26 + static inline bool bch2_request_incompat_feature(struct bch_fs *c, 27 + enum bcachefs_metadata_version version) 28 + { 29 + if (unlikely(version > c->sb.version_incompat)) { 30 + if (version > c->sb.version_incompat_allowed) 31 + return false; 32 + bch2_set_version_incompat(c, version); 33 + } 34 + return true; 35 + } 21 36 22 37 static inline size_t bch2_sb_field_bytes(struct bch_sb_field *f) 23 38 { ··· 107 92 } 108 93 109 94 bool bch2_check_version_downgrade(struct bch_fs *); 110 - void bch2_sb_upgrade(struct bch_fs *, unsigned); 95 + void bch2_sb_upgrade(struct bch_fs *, unsigned, bool); 111 96 112 97 void __bch2_sb_field_to_text(struct printbuf *, struct bch_sb *, 113 98 struct bch_sb_field *);
+21 -33
fs/bcachefs/super.c
··· 290 290 291 291 bch2_fs_journal_stop(&c->journal); 292 292 293 - bch_info(c, "%sshutdown complete, journal seq %llu", 293 + bch_info(c, "%sclean shutdown complete, journal seq %llu", 294 294 test_bit(BCH_FS_clean_shutdown, &c->flags) ? "" : "un", 295 295 c->journal.seq_ondisk); 296 296 ··· 441 441 { 442 442 int ret; 443 443 444 + BUG_ON(!test_bit(BCH_FS_may_go_rw, &c->flags)); 445 + 444 446 if (test_bit(BCH_FS_initial_gc_unfixed, &c->flags)) { 445 447 bch_err(c, "cannot go rw, unfixed btree errors"); 446 448 return -BCH_ERR_erofs_unfixed_errors; ··· 563 561 bch2_io_clock_exit(&c->io_clock[WRITE]); 564 562 bch2_io_clock_exit(&c->io_clock[READ]); 565 563 bch2_fs_compress_exit(c); 564 + bch2_fs_btree_gc_exit(c); 566 565 bch2_journal_keys_put_initial(c); 567 566 bch2_find_btree_nodes_exit(&c->found_btree_nodes); 568 567 BUG_ON(atomic_read(&c->journal_keys.ref)); ··· 587 584 #endif 588 585 kfree(rcu_dereference_protected(c->disk_groups, 1)); 589 586 kfree(c->journal_seq_blacklist_table); 590 - kfree(c->unused_inode_hints); 591 587 592 588 if (c->write_ref_wq) 593 589 destroy_workqueue(c->write_ref_wq); ··· 768 766 769 767 refcount_set(&c->ro_ref, 1); 770 768 init_waitqueue_head(&c->ro_ref_wait); 769 + spin_lock_init(&c->recovery_pass_lock); 771 770 sema_init(&c->online_fsck_mutex, 1); 772 - 773 - init_rwsem(&c->gc_lock); 774 - mutex_init(&c->gc_gens_lock); 775 - atomic_set(&c->journal_keys.ref, 1); 776 - c->journal_keys.initial_ref_held = true; 777 771 778 772 for (i = 0; i < BCH_TIME_STAT_NR; i++) 779 773 bch2_time_stats_init(&c->times[i]); 780 774 781 - bch2_fs_gc_init(c); 782 775 bch2_fs_copygc_init(c); 783 776 bch2_fs_btree_key_cache_init_early(&c->btree_key_cache); 784 777 bch2_fs_btree_iter_init_early(c); 785 778 bch2_fs_btree_interior_update_init_early(c); 779 + bch2_fs_journal_keys_init(c); 786 780 bch2_fs_allocator_background_init(c); 787 781 bch2_fs_allocator_foreground_init(c); 788 782 bch2_fs_rebalance_init(c); ··· 806 808 807 809 INIT_LIST_HEAD(&c->vfs_inodes_list); 808 810 mutex_init(&c->vfs_inodes_lock); 809 - 810 - c->copy_gc_enabled = 1; 811 - c->rebalance.enabled = 1; 812 811 813 812 c->journal.flush_write_time = &c->times[BCH_TIME_journal_flush_write]; 814 813 c->journal.noflush_write_time = &c->times[BCH_TIME_journal_noflush_write]; ··· 868 873 (btree_blocks(c) + 1) * 2 * 869 874 sizeof(struct sort_iter_set); 870 875 871 - c->inode_shard_bits = ilog2(roundup_pow_of_two(num_possible_cpus())); 872 - 873 876 if (!(c->btree_update_wq = alloc_workqueue("bcachefs", 874 877 WQ_HIGHPRI|WQ_FREEZABLE|WQ_MEM_RECLAIM|WQ_UNBOUND, 512)) || 875 878 !(c->btree_io_complete_wq = alloc_workqueue("bcachefs_btree_io", ··· 894 901 !(c->online_reserved = alloc_percpu(u64)) || 895 902 mempool_init_kvmalloc_pool(&c->btree_bounce_pool, 1, 896 903 c->opts.btree_node_size) || 897 - mempool_init_kmalloc_pool(&c->large_bkey_pool, 1, 2048) || 898 - !(c->unused_inode_hints = kcalloc(1U << c->inode_shard_bits, 899 - sizeof(u64), GFP_KERNEL))) { 904 + mempool_init_kmalloc_pool(&c->large_bkey_pool, 1, 2048)) { 900 905 ret = -BCH_ERR_ENOMEM_fs_other_alloc; 901 906 goto err; 902 907 } ··· 908 917 bch2_fs_btree_cache_init(c) ?: 909 918 bch2_fs_btree_key_cache_init(&c->btree_key_cache) ?: 910 919 bch2_fs_btree_interior_update_init(c) ?: 920 + bch2_fs_btree_gc_init(c) ?: 911 921 bch2_fs_buckets_waiting_for_journal_init(c) ?: 912 922 bch2_fs_btree_write_buffer_init(c) ?: 913 923 bch2_fs_subvolumes_init(c) ?: ··· 1025 1033 bch2_dev_allocator_add(c, ca); 1026 1034 bch2_recalc_capacity(c); 1027 1035 1036 + c->recovery_task = current; 1028 1037 ret = BCH_SB_INITIALIZED(c->disk_sb.sb) 1029 1038 ? bch2_fs_recovery(c) 1030 1039 : bch2_fs_initialize(c); 1040 + c->recovery_task = NULL; 1041 + 1031 1042 if (ret) 1032 1043 goto err; 1033 1044 ··· 1115 1120 1116 1121 prt_bdevname(&buf, fs->bdev); 1117 1122 prt_char(&buf, ' '); 1118 - bch2_prt_datetime(&buf, le64_to_cpu(fs->sb->write_time));; 1123 + bch2_prt_datetime(&buf, le64_to_cpu(fs->sb->write_time)); 1119 1124 prt_newline(&buf); 1120 1125 1121 1126 prt_bdevname(&buf, sb->bdev); 1122 1127 prt_char(&buf, ' '); 1123 - bch2_prt_datetime(&buf, le64_to_cpu(sb->sb->write_time));; 1128 + bch2_prt_datetime(&buf, le64_to_cpu(sb->sb->write_time)); 1124 1129 prt_newline(&buf); 1125 1130 1126 1131 if (!opts->no_splitbrain_check) ··· 1193 1198 1194 1199 free_percpu(ca->io_done); 1195 1200 bch2_dev_buckets_free(ca); 1196 - free_page((unsigned long) ca->sb_read_scratch); 1201 + kfree(ca->sb_read_scratch); 1197 1202 1198 1203 bch2_time_stats_quantiles_exit(&ca->io_latency[WRITE]); 1199 1204 bch2_time_stats_quantiles_exit(&ca->io_latency[READ]); ··· 1304 1309 init_completion(&ca->ref_completion); 1305 1310 init_completion(&ca->io_ref_completion); 1306 1311 1307 - init_rwsem(&ca->bucket_lock); 1308 - 1309 1312 INIT_WORK(&ca->io_error_work, bch2_io_error_work); 1310 1313 1311 1314 bch2_time_stats_quantiles_init(&ca->io_latency[READ]); ··· 1330 1337 1331 1338 if (percpu_ref_init(&ca->io_ref, bch2_dev_io_ref_complete, 1332 1339 PERCPU_REF_INIT_DEAD, GFP_KERNEL) || 1333 - !(ca->sb_read_scratch = (void *) __get_free_page(GFP_KERNEL)) || 1340 + !(ca->sb_read_scratch = kmalloc(BCH_SB_READ_SCRATCH_BUF_SIZE, GFP_KERNEL)) || 1334 1341 bch2_dev_buckets_alloc(c, ca) || 1335 1342 !(ca->io_done = alloc_percpu(*ca->io_done))) 1336 1343 goto err; ··· 1359 1366 { 1360 1367 struct bch_member member = bch2_sb_member_get(c->disk_sb.sb, dev_idx); 1361 1368 struct bch_dev *ca = NULL; 1362 - int ret = 0; 1363 1369 1364 1370 if (bch2_fs_init_fault("dev_alloc")) 1365 1371 goto err; ··· 1370 1378 ca->fs = c; 1371 1379 1372 1380 bch2_dev_attach(c, ca, dev_idx); 1373 - return ret; 1381 + return 0; 1374 1382 err: 1375 - if (ca) 1376 - bch2_dev_free(ca); 1377 1383 return -BCH_ERR_ENOMEM_dev_alloc; 1378 1384 } 1379 1385 ··· 1741 1751 if (ret) 1742 1752 goto err; 1743 1753 1744 - ret = bch2_dev_journal_alloc(ca, true); 1745 - bch_err_msg(c, ret, "allocating journal"); 1746 - if (ret) 1747 - goto err; 1748 - 1749 1754 down_write(&c->state_lock); 1750 1755 mutex_lock(&c->sb_lock); 1751 1756 ··· 1791 1806 if (ret) 1792 1807 goto err_late; 1793 1808 1794 - ca->new_fs_bucket_idx = 0; 1795 - 1796 1809 if (ca->mi.state == BCH_MEMBER_STATE_rw) 1797 1810 __bch2_dev_read_write(c, ca); 1811 + 1812 + ret = bch2_dev_journal_alloc(ca, false); 1813 + bch_err_msg(c, ret, "allocating journal"); 1814 + if (ret) 1815 + goto err_late; 1798 1816 1799 1817 up_write(&c->state_lock); 1800 1818 return 0;
-10
fs/bcachefs/super.h
··· 34 34 int bch2_fs_read_write(struct bch_fs *); 35 35 int bch2_fs_read_write_early(struct bch_fs *); 36 36 37 - /* 38 - * Only for use in the recovery/fsck path: 39 - */ 40 - static inline void bch2_fs_lazy_rw(struct bch_fs *c) 41 - { 42 - if (!test_bit(BCH_FS_rw, &c->flags) && 43 - !test_bit(BCH_FS_was_rw, &c->flags)) 44 - bch2_fs_read_write_early(c); 45 - } 46 - 47 37 void __bch2_fs_stop(struct bch_fs *); 48 38 void bch2_fs_free(struct bch_fs *); 49 39 void bch2_fs_stop(struct bch_fs *);
+21 -39
fs/bcachefs/sysfs.c
··· 146 146 write_attribute(trigger_btree_cache_shrink); 147 147 write_attribute(trigger_btree_key_cache_shrink); 148 148 write_attribute(trigger_freelist_wakeup); 149 - rw_attribute(gc_gens_pos); 149 + read_attribute(gc_gens_pos); 150 150 151 151 read_attribute(uuid); 152 152 read_attribute(minor); ··· 203 203 204 204 read_attribute(has_data); 205 205 read_attribute(alloc_debug); 206 - read_attribute(accounting); 207 206 read_attribute(usage_base); 208 207 209 208 #define x(t, n, ...) read_attribute(t); ··· 210 211 #undef x 211 212 212 213 rw_attribute(discard); 214 + read_attribute(state); 213 215 rw_attribute(label); 214 216 215 - rw_attribute(copy_gc_enabled); 216 217 read_attribute(copy_gc_wait); 217 218 218 - rw_attribute(rebalance_enabled); 219 219 sysfs_pd_controller_attribute(rebalance); 220 220 read_attribute(rebalance_status); 221 221 ··· 234 236 { .name = #_name, .mode = 0644 }; 235 237 BCH_TIME_STATS() 236 238 #undef x 237 - 238 - static struct attribute sysfs_state_rw = { 239 - .name = "state", 240 - .mode = 0444, 241 - }; 242 239 243 240 static size_t bch2_btree_cache_size(struct bch_fs *c) 244 241 { ··· 295 302 296 303 static void bch2_gc_gens_pos_to_text(struct printbuf *out, struct bch_fs *c) 297 304 { 298 - prt_printf(out, "%s: ", bch2_btree_id_str(c->gc_gens_btree)); 305 + bch2_btree_id_to_text(out, c->gc_gens_btree); 306 + prt_printf(out, ": "); 299 307 bch2_bpos_to_text(out, c->gc_gens_pos); 300 308 prt_printf(out, "\n"); 301 309 } ··· 333 339 if (attr == &sysfs_gc_gens_pos) 334 340 bch2_gc_gens_pos_to_text(out, c); 335 341 336 - sysfs_printf(copy_gc_enabled, "%i", c->copy_gc_enabled); 337 - 338 - sysfs_printf(rebalance_enabled, "%i", c->rebalance.enabled); 339 342 sysfs_pd_controller_show(rebalance, &c->rebalance.pd); /* XXX */ 340 343 341 344 if (attr == &sysfs_copy_gc_wait) ··· 396 405 if (attr == &sysfs_alloc_debug) 397 406 bch2_fs_alloc_debug_to_text(out, c); 398 407 399 - if (attr == &sysfs_accounting) 400 - bch2_fs_accounting_to_text(out, c); 401 - 402 408 if (attr == &sysfs_usage_base) 403 409 bch2_fs_usage_base_to_text(out, c); 404 410 ··· 405 417 STORE(bch2_fs) 406 418 { 407 419 struct bch_fs *c = container_of(kobj, struct bch_fs, kobj); 408 - 409 - if (attr == &sysfs_copy_gc_enabled) { 410 - ssize_t ret = strtoul_safe(buf, c->copy_gc_enabled) 411 - ?: (ssize_t) size; 412 - 413 - if (c->copygc_thread) 414 - wake_up_process(c->copygc_thread); 415 - return ret; 416 - } 417 - 418 - if (attr == &sysfs_rebalance_enabled) { 419 - ssize_t ret = strtoul_safe(buf, c->rebalance.enabled) 420 - ?: (ssize_t) size; 421 - 422 - rebalance_wakeup(c); 423 - return ret; 424 - } 425 420 426 421 sysfs_pd_controller_store(rebalance, &c->rebalance.pd); 427 422 ··· 505 534 506 535 printbuf_tabstop_push(out, 32); 507 536 508 - #define x(t, ...) \ 537 + #define x(t, n, f, ...) \ 509 538 if (attr == &sysfs_##t) { \ 510 539 counter = percpu_u64_get(&c->counters[BCH_COUNTER_##t]);\ 511 540 counter_since_mount = counter - c->counters_on_mount[BCH_COUNTER_##t];\ 541 + if (f & TYPE_SECTORS) { \ 542 + counter <<= 9; \ 543 + counter_since_mount <<= 9; \ 544 + } \ 545 + \ 512 546 prt_printf(out, "since mount:\t"); \ 547 + (f & TYPE_COUNTER) ? prt_u64(out, counter_since_mount) :\ 513 548 prt_human_readable_u64(out, counter_since_mount); \ 514 549 prt_newline(out); \ 515 550 \ 516 551 prt_printf(out, "since filesystem creation:\t"); \ 552 + (f & TYPE_COUNTER) ? prt_u64(out, counter) : \ 517 553 prt_human_readable_u64(out, counter); \ 518 554 prt_newline(out); \ 519 555 } ··· 588 610 589 611 &sysfs_gc_gens_pos, 590 612 591 - &sysfs_copy_gc_enabled, 592 613 &sysfs_copy_gc_wait, 593 614 594 - &sysfs_rebalance_enabled, 595 615 sysfs_pd_controller_files(rebalance), 596 616 597 617 &sysfs_moving_ctxts, ··· 598 622 599 623 &sysfs_disk_groups, 600 624 &sysfs_alloc_debug, 601 - &sysfs_accounting, 602 625 &sysfs_usage_base, 603 626 NULL 604 627 }; ··· 656 681 id == Opt_background_compression || 657 682 (id == Opt_compression && !c->opts.background_compression))) 658 683 bch2_set_rebalance_needs_scan(c, 0); 684 + 685 + if (v && id == Opt_rebalance_enabled) 686 + rebalance_wakeup(c); 687 + 688 + if (v && id == Opt_copygc_enabled && 689 + c->copygc_thread) 690 + wake_up_process(c->copygc_thread); 659 691 660 692 ret = size; 661 693 err: ··· 772 790 prt_char(out, '\n'); 773 791 } 774 792 775 - if (attr == &sysfs_state_rw) { 793 + if (attr == &sysfs_state) { 776 794 prt_string_option(out, bch2_member_states, ca->mi.state); 777 795 prt_char(out, '\n'); 778 796 } ··· 852 870 853 871 /* settings: */ 854 872 &sysfs_discard, 855 - &sysfs_state_rw, 873 + &sysfs_state, 856 874 &sysfs_label, 857 875 858 876 &sysfs_has_data,
+13 -13
fs/bcachefs/tests.c
··· 131 131 i = 0; 132 132 133 133 ret = bch2_trans_run(c, 134 - for_each_btree_key_upto(trans, iter, BTREE_ID_xattrs, 134 + for_each_btree_key_max(trans, iter, BTREE_ID_xattrs, 135 135 SPOS(0, 0, U32_MAX), POS(0, U64_MAX), 136 136 0, k, ({ 137 137 BUG_ON(k.k->p.offset != i++); ··· 186 186 i = 0; 187 187 188 188 ret = bch2_trans_run(c, 189 - for_each_btree_key_upto(trans, iter, BTREE_ID_extents, 189 + for_each_btree_key_max(trans, iter, BTREE_ID_extents, 190 190 SPOS(0, 0, U32_MAX), POS(0, U64_MAX), 191 191 0, k, ({ 192 192 BUG_ON(bkey_start_offset(k.k) != i); ··· 242 242 i = 0; 243 243 244 244 ret = bch2_trans_run(c, 245 - for_each_btree_key_upto(trans, iter, BTREE_ID_xattrs, 245 + for_each_btree_key_max(trans, iter, BTREE_ID_xattrs, 246 246 SPOS(0, 0, U32_MAX), POS(0, U64_MAX), 247 247 0, k, ({ 248 248 BUG_ON(k.k->p.offset != i); ··· 259 259 i = 0; 260 260 261 261 ret = bch2_trans_run(c, 262 - for_each_btree_key_upto(trans, iter, BTREE_ID_xattrs, 262 + for_each_btree_key_max(trans, iter, BTREE_ID_xattrs, 263 263 SPOS(0, 0, U32_MAX), POS(0, U64_MAX), 264 264 BTREE_ITER_slots, k, ({ 265 265 if (i >= nr * 2) ··· 302 302 i = 0; 303 303 304 304 ret = bch2_trans_run(c, 305 - for_each_btree_key_upto(trans, iter, BTREE_ID_extents, 305 + for_each_btree_key_max(trans, iter, BTREE_ID_extents, 306 306 SPOS(0, 0, U32_MAX), POS(0, U64_MAX), 307 307 0, k, ({ 308 308 BUG_ON(bkey_start_offset(k.k) != i + 8); ··· 320 320 i = 0; 321 321 322 322 ret = bch2_trans_run(c, 323 - for_each_btree_key_upto(trans, iter, BTREE_ID_extents, 323 + for_each_btree_key_max(trans, iter, BTREE_ID_extents, 324 324 SPOS(0, 0, U32_MAX), POS(0, U64_MAX), 325 325 BTREE_ITER_slots, k, ({ 326 326 if (i == nr) ··· 349 349 bch2_trans_iter_init(trans, &iter, BTREE_ID_xattrs, 350 350 SPOS(0, 0, U32_MAX), 0); 351 351 352 - lockrestart_do(trans, bkey_err(k = bch2_btree_iter_peek_upto(&iter, POS(0, U64_MAX)))); 352 + lockrestart_do(trans, bkey_err(k = bch2_btree_iter_peek_max(&iter, POS(0, U64_MAX)))); 353 353 BUG_ON(k.k); 354 354 355 - lockrestart_do(trans, bkey_err(k = bch2_btree_iter_peek_upto(&iter, POS(0, U64_MAX)))); 355 + lockrestart_do(trans, bkey_err(k = bch2_btree_iter_peek_max(&iter, POS(0, U64_MAX)))); 356 356 BUG_ON(k.k); 357 357 358 358 bch2_trans_iter_exit(trans, &iter); ··· 369 369 bch2_trans_iter_init(trans, &iter, BTREE_ID_extents, 370 370 SPOS(0, 0, U32_MAX), 0); 371 371 372 - lockrestart_do(trans, bkey_err(k = bch2_btree_iter_peek_upto(&iter, POS(0, U64_MAX)))); 372 + lockrestart_do(trans, bkey_err(k = bch2_btree_iter_peek_max(&iter, POS(0, U64_MAX)))); 373 373 BUG_ON(k.k); 374 374 375 - lockrestart_do(trans, bkey_err(k = bch2_btree_iter_peek_upto(&iter, POS(0, U64_MAX)))); 375 + lockrestart_do(trans, bkey_err(k = bch2_btree_iter_peek_max(&iter, POS(0, U64_MAX)))); 376 376 BUG_ON(k.k); 377 377 378 378 bch2_trans_iter_exit(trans, &iter); ··· 488 488 trans = bch2_trans_get(c); 489 489 bch2_trans_iter_init(trans, &iter, BTREE_ID_xattrs, 490 490 SPOS(0, 0, snapid_lo), 0); 491 - lockrestart_do(trans, bkey_err(k = bch2_btree_iter_peek_upto(&iter, POS(0, U64_MAX)))); 491 + lockrestart_do(trans, bkey_err(k = bch2_btree_iter_peek_max(&iter, POS(0, U64_MAX)))); 492 492 493 493 BUG_ON(k.k->p.snapshot != U32_MAX); 494 494 ··· 672 672 673 673 bch2_trans_iter_init(trans, &iter, BTREE_ID_xattrs, pos, 674 674 BTREE_ITER_intent); 675 - k = bch2_btree_iter_peek_upto(&iter, POS(0, U64_MAX)); 675 + k = bch2_btree_iter_peek_max(&iter, POS(0, U64_MAX)); 676 676 ret = bkey_err(k); 677 677 if (ret) 678 678 goto err; ··· 726 726 static int seq_lookup(struct bch_fs *c, u64 nr) 727 727 { 728 728 return bch2_trans_run(c, 729 - for_each_btree_key_upto(trans, iter, BTREE_ID_xattrs, 729 + for_each_btree_key_max(trans, iter, BTREE_ID_xattrs, 730 730 SPOS(0, 0, U32_MAX), POS(0, U64_MAX), 731 731 0, k, 732 732 0));
+67 -10
fs/bcachefs/trace.h
··· 199 199 (unsigned long long)__entry->sector, __entry->nr_sector) 200 200 ); 201 201 202 + /* disk_accounting.c */ 203 + 204 + TRACE_EVENT(accounting_mem_insert, 205 + TP_PROTO(struct bch_fs *c, const char *acc), 206 + TP_ARGS(c, acc), 207 + 208 + TP_STRUCT__entry( 209 + __field(dev_t, dev ) 210 + __field(unsigned, new_nr ) 211 + __string(acc, acc ) 212 + ), 213 + 214 + TP_fast_assign( 215 + __entry->dev = c->dev; 216 + __entry->new_nr = c->accounting.k.nr; 217 + __assign_str(acc); 218 + ), 219 + 220 + TP_printk("%d,%d entries %u added %s", 221 + MAJOR(__entry->dev), MINOR(__entry->dev), 222 + __entry->new_nr, 223 + __get_str(acc)) 224 + ); 225 + 202 226 /* fs.c: */ 203 227 TRACE_EVENT(bch2_sync_fs, 204 228 TP_PROTO(struct super_block *sb, int wait), ··· 872 848 TRACE_EVENT(evacuate_bucket, 873 849 TP_PROTO(struct bch_fs *c, struct bpos *bucket, 874 850 unsigned sectors, unsigned bucket_size, 875 - u64 fragmentation, int ret), 876 - TP_ARGS(c, bucket, sectors, bucket_size, fragmentation, ret), 851 + int ret), 852 + TP_ARGS(c, bucket, sectors, bucket_size, ret), 877 853 878 854 TP_STRUCT__entry( 879 855 __field(dev_t, dev ) ··· 881 857 __field(u64, bucket ) 882 858 __field(u32, sectors ) 883 859 __field(u32, bucket_size ) 884 - __field(u64, fragmentation ) 885 860 __field(int, ret ) 886 861 ), 887 862 ··· 890 867 __entry->bucket = bucket->offset; 891 868 __entry->sectors = sectors; 892 869 __entry->bucket_size = bucket_size; 893 - __entry->fragmentation = fragmentation; 894 870 __entry->ret = ret; 895 871 ), 896 872 897 - TP_printk("%d,%d %llu:%llu sectors %u/%u fragmentation %llu ret %i", 873 + TP_printk("%d,%d %llu:%llu sectors %u/%u ret %i", 898 874 MAJOR(__entry->dev), MINOR(__entry->dev), 899 875 __entry->member, __entry->bucket, 900 876 __entry->sectors, __entry->bucket_size, 901 - __entry->fragmentation, __entry->ret) 877 + __entry->ret) 902 878 ); 903 879 904 880 TRACE_EVENT(copygc, ··· 1338 1316 __entry->new_u64s) 1339 1317 ); 1340 1318 1319 + DEFINE_EVENT(transaction_event, trans_restart_write_buffer_flush, 1320 + TP_PROTO(struct btree_trans *trans, 1321 + unsigned long caller_ip), 1322 + TP_ARGS(trans, caller_ip) 1323 + ); 1324 + 1341 1325 TRACE_EVENT(path_downgrade, 1342 1326 TP_PROTO(struct btree_trans *trans, 1343 1327 unsigned long caller_ip, ··· 1380 1352 __entry->pos_snapshot) 1381 1353 ); 1382 1354 1383 - DEFINE_EVENT(transaction_event, trans_restart_write_buffer_flush, 1384 - TP_PROTO(struct btree_trans *trans, 1385 - unsigned long caller_ip), 1386 - TP_ARGS(trans, caller_ip) 1355 + TRACE_EVENT(key_cache_fill, 1356 + TP_PROTO(struct btree_trans *trans, const char *key), 1357 + TP_ARGS(trans, key), 1358 + 1359 + TP_STRUCT__entry( 1360 + __array(char, trans_fn, 32 ) 1361 + __string(key, key ) 1362 + ), 1363 + 1364 + TP_fast_assign( 1365 + strscpy(__entry->trans_fn, trans->fn, sizeof(__entry->trans_fn)); 1366 + __assign_str(key); 1367 + ), 1368 + 1369 + TP_printk("%s %s", __entry->trans_fn, __get_str(key)) 1387 1370 ); 1388 1371 1389 1372 TRACE_EVENT(write_buffer_flush, ··· 1451 1412 ), 1452 1413 1453 1414 TP_printk("%zu/%zu", __entry->slowpath, __entry->total) 1415 + ); 1416 + 1417 + TRACE_EVENT(write_buffer_maybe_flush, 1418 + TP_PROTO(struct btree_trans *trans, unsigned long caller_ip, const char *key), 1419 + TP_ARGS(trans, caller_ip, key), 1420 + 1421 + TP_STRUCT__entry( 1422 + __array(char, trans_fn, 32 ) 1423 + __field(unsigned long, caller_ip ) 1424 + __string(key, key ) 1425 + ), 1426 + 1427 + TP_fast_assign( 1428 + strscpy(__entry->trans_fn, trans->fn, sizeof(__entry->trans_fn)); 1429 + __assign_str(key); 1430 + ), 1431 + 1432 + TP_printk("%s %pS %s", __entry->trans_fn, (void *) __entry->caller_ip, __get_str(key)) 1454 1433 ); 1455 1434 1456 1435 DEFINE_EVENT(fs_str, rebalance_extent,
+32
fs/bcachefs/util.h
··· 55 55 PAGE_SIZE); 56 56 } 57 57 58 + static inline void *bch2_kvmalloc(size_t n, gfp_t flags) 59 + { 60 + void *p = unlikely(n >= INT_MAX) 61 + ? vmalloc(n) 62 + : kvmalloc(n, flags & ~__GFP_ZERO); 63 + if (p && (flags & __GFP_ZERO)) 64 + memset(p, 0, n); 65 + return p; 66 + } 67 + 58 68 #define init_heap(heap, _size, gfp) \ 59 69 ({ \ 60 70 (heap)->nr = 0; \ ··· 326 316 typeof(ptr) _ptr = ptr; \ 327 317 _ptr ? container_of(_ptr, type, member) : NULL; \ 328 318 }) 319 + 320 + static inline struct list_head *list_pop(struct list_head *head) 321 + { 322 + if (list_empty(head)) 323 + return NULL; 324 + 325 + struct list_head *ret = head->next; 326 + list_del_init(ret); 327 + return ret; 328 + } 329 + 330 + #define list_pop_entry(head, type, member) \ 331 + container_of_or_null(list_pop(head), type, member) 329 332 330 333 /* Does linear interpolation between powers of two */ 331 334 static inline unsigned fract_exp_two(unsigned x, unsigned fract_bits) ··· 717 694 static inline bool test_bit_le64(size_t bit, __le64 *addr) 718 695 { 719 696 return (addr[bit / 64] & cpu_to_le64(BIT_ULL(bit % 64))) != 0; 697 + } 698 + 699 + static inline void memcpy_swab(void *_dst, void *_src, size_t len) 700 + { 701 + u8 *dst = _dst + len; 702 + u8 *src = _src; 703 + 704 + while (len--) 705 + *--dst = *src++; 720 706 } 721 707 722 708 #endif /* _BCACHEFS_UTIL_H */
+3 -2
fs/bcachefs/varint.c
··· 9 9 #include <valgrind/memcheck.h> 10 10 #endif 11 11 12 + #include "errcode.h" 12 13 #include "varint.h" 13 14 14 15 /** ··· 54 53 u64 v; 55 54 56 55 if (unlikely(in + bytes > end)) 57 - return -1; 56 + return -BCH_ERR_varint_decode_error; 58 57 59 58 if (likely(bytes < 9)) { 60 59 __le64 v_le = 0; ··· 116 115 unsigned bytes = ffz(*in) + 1; 117 116 118 117 if (unlikely(in + bytes > end)) 119 - return -1; 118 + return -BCH_ERR_varint_decode_error; 120 119 121 120 if (likely(bytes < 9)) { 122 121 v >>= bytes;
+3 -10
fs/bcachefs/xattr.c
··· 71 71 }; 72 72 73 73 int bch2_xattr_validate(struct bch_fs *c, struct bkey_s_c k, 74 - enum bch_validate_flags flags) 74 + struct bkey_validate_context from) 75 75 { 76 76 struct bkey_s_c_xattr xattr = bkey_s_c_to_xattr(k); 77 77 unsigned val_u64s = xattr_val_u64s(xattr.v->x_name_len, ··· 309 309 u64 offset = 0, inum = inode->ei_inode.bi_inum; 310 310 311 311 int ret = bch2_trans_run(c, 312 - for_each_btree_key_in_subvolume_upto(trans, iter, BTREE_ID_xattrs, 312 + for_each_btree_key_in_subvolume_max(trans, iter, BTREE_ID_xattrs, 313 313 POS(inum, offset), 314 314 POS(inum, U64_MAX), 315 315 inode->ei_inum.subvol, 0, k, ({ ··· 565 565 ret = bch2_write_inode(c, inode, inode_opt_set_fn, &s, 0); 566 566 err: 567 567 mutex_unlock(&inode->ei_update_lock); 568 - 569 - if (value && 570 - (opt_id == Opt_background_target || 571 - opt_id == Opt_background_compression || 572 - (opt_id == Opt_compression && !inode_opt_get(c, &inode->ei_inode, background_compression)))) 573 - bch2_set_rebalance_needs_scan(c, inode->ei_inode.bi_inum); 574 - 575 568 err_class_exit: 576 569 return bch2_err_class(ret); 577 570 } ··· 602 609 603 610 #endif /* NO_BCACHEFS_FS */ 604 611 605 - const struct xattr_handler *bch2_xattr_handlers[] = { 612 + const struct xattr_handler * const bch2_xattr_handlers[] = { 606 613 &bch_xattr_user_handler, 607 614 &bch_xattr_trusted_handler, 608 615 &bch_xattr_security_handler,
+3 -2
fs/bcachefs/xattr.h
··· 6 6 7 7 extern const struct bch_hash_desc bch2_xattr_hash_desc; 8 8 9 - int bch2_xattr_validate(struct bch_fs *, struct bkey_s_c, enum bch_validate_flags); 9 + int bch2_xattr_validate(struct bch_fs *, struct bkey_s_c, 10 + struct bkey_validate_context); 10 11 void bch2_xattr_to_text(struct printbuf *, struct bch_fs *, struct bkey_s_c); 11 12 12 13 #define bch2_bkey_ops_xattr ((struct bkey_ops) { \ ··· 45 44 46 45 ssize_t bch2_xattr_list(struct dentry *, char *, size_t); 47 46 48 - extern const struct xattr_handler *bch2_xattr_handlers[]; 47 + extern const struct xattr_handler * const bch2_xattr_handlers[]; 49 48 50 49 #endif /* _BCACHEFS_XATTR_H */
+2 -1
fs/fs_parser.c
··· 13 13 #include <linux/namei.h> 14 14 #include "internal.h" 15 15 16 - static const struct constant_table bool_names[] = { 16 + const struct constant_table bool_names[] = { 17 17 { "0", false }, 18 18 { "1", true }, 19 19 { "false", false }, ··· 22 22 { "yes", true }, 23 23 { }, 24 24 }; 25 + EXPORT_SYMBOL(bool_names); 25 26 26 27 static const struct constant_table * 27 28 __lookup_constant(const struct constant_table *tbl, const char *name)
+2
include/linux/fs_parser.h
··· 84 84 85 85 extern int lookup_constant(const struct constant_table tbl[], const char *name, int not_found); 86 86 87 + extern const struct constant_table bool_names[]; 88 + 87 89 #ifdef CONFIG_VALIDATE_FS_PARSER 88 90 extern bool validate_constant_table(const struct constant_table *tbl, size_t tbl_size, 89 91 int low, int high, int special);
+2 -2
include/linux/min_heap.h
··· 15 15 */ 16 16 #define MIN_HEAP_PREALLOCATED(_type, _name, _nr) \ 17 17 struct _name { \ 18 - int nr; \ 19 - int size; \ 18 + size_t nr; \ 19 + size_t size; \ 20 20 _type *data; \ 21 21 _type preallocated[_nr]; \ 22 22 }