Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

Merge tag 'bcachefs-2025-03-24' of git://evilpiepirate.org/bcachefs

Pull bcachefs updates from Kent Overstreet:
"On disk format is now soft frozen: no more required/automatic are
anticipated before taking off the experimental label.

Major changes/features since 6.14:

- Scrub

- Blocksize greater than page size support

- A number of "rebalance spinning and doing no work" issues have been
fixed; we now check if the write allocation will succeed in
bch2_data_update_init(), before kicking off the read.

There's still more work to do in this area. Later we may want to
add another bitset btree, like rebalance_work, to track "extents
that rebalance was requested to move but couldn't", e.g. due to
destination target having insufficient online devices.

- We can now support scaling well into the petabyte range: latest
bcachefs-tools will pick an appropriate bucket size at format time
to ensure fsck can run in available memory (e.g. a server with
256GB of ram and 100PB of storage would want 16MB buckets).

On disk format changes:

- 1.21: cached backpointers (scalability improvement)

Cached replicas now get backpointers, which means we no longer rely
on incrementing bucket generation numbers to invalidate cached
data: this lets us get rid of the bucket generation number garbage
collection, which had to periodically rescan all extents to
recompute bucket oldest_gen.

Bucket generation numbers are now only used as a consistency check,
but they're quite useful for that.

- 1.22: stripe backpointers

Stripes now have backpointers: erasure coded stripes have their own
checksums, separate from the checksums for the extents they contain
(and stripe checksums also cover the parity blocks). This is
required for implementing scrub for stripes.

- 1.23: stripe lru (scalability improvement)

Persistent lru for stripes, ordered by "number of empty blocks".
This is used by the stripe creation path, which depending on free
space may create a new stripe out of a partially empty existing
stripe instead of starting a brand new stripe.

This replaces an in-memory heap, and means we no longer have to
read in the stripes btree at startup.

- 1.24: casefolding

Case insensitive directory support, courtesy of Valve.

This is an incompatible feature, to enable mount with
-o version_upgrade=incompatible

- 1.25: extent_flags

Another incompatible feature requiring explicit opt-in to enable.

This adds a flags entry to extents, and a flag bit that marks
extents as poisoned.

A poisoned extent is an extent that was unreadable due to checksum
errors. We can't move such extents without giving them a new
checksum, and we may have to move them (for e.g. copygc or device
evacuate). We also don't want to delete them: in the future we'll
have an API that lets userspace ignore checksum errors and attempt
to deal with simple bitrot itself. Marking them as poisoned lets us
continue to return the correct error to userspace on normal read
calls.

Other changes/features:

- BCH_IOCTL_QUERY_COUNTERS: this is used by the new 'bcachefs fs top'
command, which shows a live view of all internal filesystem
counters.

- Improved journal pipelining: we can now have 16 journal writes in
flight concurrently, up from 4. We're logging significantly more to
the journal than we used to with all the recent disk accounting
changes and additions, so some users should see a performance
increase on some workloads.

- BCH_MEMBER_STATE_failed: previously, we would do no IO at all to
devices marked as failed. Now we will attempt to read from them,
but only if we have no better options.

- New option, write_error_timeout: devices will be kicked out of the
filesystem if all writes have been failing for x number of seconds.

We now also kick devices out when notified by blk_holder_ops that
they've gone offline.

- Device option handling improvements: the discard option should now
be working as expected (additionally, in -tools, all device options
that can be set at format time can now be set at device add time,
i.e. data_allowed, state).

- We now try harder to read data after a checksum error: we'll do
additional retries if necessary to a device after after it gave us
data with a checksum error.

- More self healing work: the full inode <-> dirent consistency
checks that are currently run by fsck are now also run every time
we do a lookup, meaning we'll be able to correct errors at runtime.
Runtime self healing will be flipped on after the new changes have
seen more testing, currently they're just checking for consistency.

- KMSAN fixes: our KMSAN builds should be nearly clean now, which
will put a massive dent in the syzbot dashboard"

* tag 'bcachefs-2025-03-24' of git://evilpiepirate.org/bcachefs: (180 commits)
bcachefs: Kill unnecessary bch2_dev_usage_read()
bcachefs: btree node write errors now print btree node
bcachefs: Fix race in print_chain()
bcachefs: btree_trans_restart_foreign_task()
bcachefs: bch2_disk_accounting_mod2()
bcachefs: zero init journal bios
bcachefs: Eliminate padding in move_bucket_key
bcachefs: Fix a KMSAN splat in btree_update_nodes_written()
bcachefs: kmsan asserts
bcachefs: Fix kmsan warnings in bch2_extent_crc_pack()
bcachefs: Disable asm memcpys when kmsan enabled
bcachefs: Handle backpointers with unknown data types
bcachefs: Count BCH_DATA_parity backpointers correctly
bcachefs: Run bch2_check_dirent_target() at lookup time
bcachefs: Refactor bch2_check_dirent_target()
bcachefs: Move bch2_check_dirent_target() to namei.c
bcachefs: fs-common.c -> namei.c
bcachefs: EIO cleanup
bcachefs: bch2_write_prep_encoded_data() now returns errcode
bcachefs: Simplify bch2_write_op_error()
...

+4880 -3008
+25 -18
Documentation/filesystems/bcachefs/SubmittingPatches.rst
··· 1 - Submitting patches to bcachefs: 2 - =============================== 1 + Submitting patches to bcachefs 2 + ============================== 3 + 4 + Here are suggestions for submitting patches to bcachefs subsystem. 5 + 6 + Submission checklist 7 + -------------------- 3 8 4 9 Patches must be tested before being submitted, either with the xfstests suite 5 - [0], or the full bcachefs test suite in ktest [1], depending on what's being 10 + [0]_, or the full bcachefs test suite in ktest [1]_, depending on what's being 6 11 touched. Note that ktest wraps xfstests and will be an easier method to running 7 12 it for most users; it includes single-command wrappers for all the mainstream 8 13 in-kernel local filesystems. ··· 31 26 Focus on writing code that reads well and is organized well; code should be 32 27 aesthetically pleasing. 33 28 34 - CI: 35 - === 29 + CI 30 + -- 36 31 37 32 Instead of running your tests locally, when running the full test suite it's 38 33 preferable to let a server farm do it in parallel, and then have the results 39 34 in a nice test dashboard (which can tell you which failures are new, and 40 35 presents results in a git log view, avoiding the need for most bisecting). 41 36 42 - That exists [2], and community members may request an account. If you work for 37 + That exists [2]_, and community members may request an account. If you work for 43 38 a big tech company, you'll need to help out with server costs to get access - 44 39 but the CI is not restricted to running bcachefs tests: it runs any ktest test 45 40 (which generally makes it easy to wrap other tests that can run in qemu). 46 41 47 - Other things to think about: 48 - ============================ 42 + Other things to think about 43 + --------------------------- 49 44 50 45 - How will we debug this code? Is there sufficient introspection to diagnose 51 46 when something starts acting wonky on a user machine? ··· 84 79 tested? (Automated tests exists but aren't in the CI, due to the hassle of 85 80 disk image management; coordinate to have them run.) 86 81 87 - Mailing list, IRC: 88 - ================== 82 + Mailing list, IRC 83 + ----------------- 89 84 90 - Patches should hit the list [3], but much discussion and code review happens on 91 - IRC as well [4]; many people appreciate the more conversational approach and 92 - quicker feedback. 85 + Patches should hit the list [3]_, but much discussion and code review happens 86 + on IRC as well [4]_; many people appreciate the more conversational approach 87 + and quicker feedback. 93 88 94 89 Additionally, we have a lively user community doing excellent QA work, which 95 90 exists primarily on IRC. Please make use of that resource; user feedback is 96 91 important for any nontrivial feature, and documenting it in commit messages 97 92 would be a good idea. 98 93 99 - [0]: git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git 100 - [1]: https://evilpiepirate.org/git/ktest.git/ 101 - [2]: https://evilpiepirate.org/~testdashboard/ci/ 102 - [3]: linux-bcachefs@vger.kernel.org 103 - [4]: irc.oftc.net#bcache, #bcachefs-dev 94 + .. rubric:: References 95 + 96 + .. [0] git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git 97 + .. [1] https://evilpiepirate.org/git/ktest.git/ 98 + .. [2] https://evilpiepirate.org/~testdashboard/ci/ 99 + .. [3] linux-bcachefs@vger.kernel.org 100 + .. [4] irc.oftc.net#bcache, #bcachefs-dev
+90
Documentation/filesystems/bcachefs/casefolding.rst
··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + Casefolding 4 + =========== 5 + 6 + bcachefs has support for case-insensitive file and directory 7 + lookups using the regular `chattr +F` (`S_CASEFOLD`, `FS_CASEFOLD_FL`) 8 + casefolding attributes. 9 + 10 + The main usecase for casefolding is compatibility with software written 11 + against other filesystems that rely on casefolded lookups 12 + (eg. NTFS and Wine/Proton). 13 + Taking advantage of file-system level casefolding can lead to great 14 + loading time gains in many applications and games. 15 + 16 + Casefolding support requires a kernel with the `CONFIG_UNICODE` enabled. 17 + Once a directory has been flagged for casefolding, a feature bit 18 + is enabled on the superblock which marks the filesystem as using 19 + casefolding. 20 + When the feature bit for casefolding is enabled, it is no longer possible 21 + to mount that filesystem on kernels without `CONFIG_UNICODE` enabled. 22 + 23 + On the lookup/query side: casefolding is implemented by allocating a new 24 + string of `BCH_NAME_MAX` length using the `utf8_casefold` function to 25 + casefold the query string. 26 + 27 + On the dirent side: casefolding is implemented by ensuring the `bkey`'s 28 + hash is made from the casefolded string and storing the cached casefolded 29 + name with the regular name in the dirent. 30 + 31 + The structure looks like this: 32 + 33 + * Regular: [dirent data][regular name][nul][nul]... 34 + * Casefolded: [dirent data][reg len][cf len][regular name][casefolded name][nul][nul]... 35 + 36 + (Do note, the number of NULs here is merely for illustration; their count can 37 + vary per-key, and they may not even be present if the key is aligned to 38 + `sizeof(u64)`.) 39 + 40 + This is efficient as it means that for all file lookups that require casefolding, 41 + it has identical performance to a regular lookup: 42 + a hash comparison and a `memcmp` of the name. 43 + 44 + Rationale 45 + --------- 46 + 47 + Several designs were considered for this system: 48 + One was to introduce a dirent_v2, however that would be painful especially as 49 + the hash system only has support for a single key type. This would also need 50 + `BCH_NAME_MAX` to change between versions, and a new feature bit. 51 + 52 + Another option was to store without the two lengths, and just take the length of 53 + the regular name and casefolded name contiguously / 2 as the length. This would 54 + assume that the regular length == casefolded length, but that could potentially 55 + not be true, if the uppercase unicode glyph had a different UTF-8 encoding than 56 + the lowercase unicode glyph. 57 + It would be possible to disregard the casefold cache for those cases, but it was 58 + decided to simply encode the two string lengths in the key to avoid random 59 + performance issues if this edgecase was ever hit. 60 + 61 + The option settled on was to use a free-bit in d_type to mark a dirent as having 62 + a casefold cache, and then treat the first 4 bytes the name block as lengths. 63 + You can see this in the `d_cf_name_block` member of union in `bch_dirent`. 64 + 65 + The feature bit was used to allow casefolding support to be enabled for the majority 66 + of users, but some allow users who have no need for the feature to still use bcachefs as 67 + `CONFIG_UNICODE` can increase the kernel side a significant amount due to the tables used, 68 + which may be decider between using bcachefs for eg. embedded platforms. 69 + 70 + Other filesystems like ext4 and f2fs have a super-block level option for casefolding 71 + encoding, but bcachefs currently does not provide this. ext4 and f2fs do not expose 72 + any encodings than a single UTF-8 version. When future encodings are desirable, 73 + they will be added trivially using the opts mechanism. 74 + 75 + dentry/dcache considerations 76 + ---------------------------- 77 + 78 + Currently, in casefolded directories, bcachefs (like other filesystems) will not cache 79 + negative dentry's. 80 + 81 + This is because currently doing so presents a problem in the following scenario: 82 + 83 + - Lookup file "blAH" in a casefolded directory 84 + - Creation of file "BLAH" in a casefolded directory 85 + - Lookup file "blAH" in a casefolded directory 86 + 87 + This would fail if negative dentry's were cached. 88 + 89 + This is slightly suboptimal, but could be fixed in future with some vfs work. 90 +
+19 -1
Documentation/filesystems/bcachefs/index.rst
··· 4 4 bcachefs Documentation 5 5 ====================== 6 6 7 + Subsystem-specific development process notes 8 + -------------------------------------------- 9 + 10 + Development notes specific to bcachefs. These are intended to supplement 11 + :doc:`general kernel development handbook </process/index>`. 12 + 7 13 .. toctree:: 8 - :maxdepth: 2 14 + :maxdepth: 1 9 15 :numbered: 10 16 11 17 CodingStyle 12 18 SubmittingPatches 19 + 20 + Filesystem implementation 21 + ------------------------- 22 + 23 + Documentation for filesystem features and their implementation details. 24 + At this moment, only a few of these are described here. 25 + 26 + .. toctree:: 27 + :maxdepth: 1 28 + :numbered: 29 + 30 + casefolding 13 31 errorcodes
+1 -1
fs/bcachefs/Kconfig
··· 16 16 select ZSTD_COMPRESS 17 17 select ZSTD_DECOMPRESS 18 18 select CRYPTO 19 - select CRYPTO_SHA256 19 + select CRYPTO_LIB_SHA256 20 20 select CRYPTO_CHACHA20 21 21 select CRYPTO_POLY1305 22 22 select KEYS
+2 -1
fs/bcachefs/Makefile
··· 41 41 extent_update.o \ 42 42 eytzinger.o \ 43 43 fs.o \ 44 - fs-common.o \ 45 44 fs-ioctl.o \ 46 45 fs-io.o \ 47 46 fs-io-buffered.o \ ··· 63 64 migrate.o \ 64 65 move.o \ 65 66 movinggc.o \ 67 + namei.o \ 66 68 nocow_locking.o \ 67 69 opts.o \ 68 70 printbuf.o \ 71 + progress.o \ 69 72 quota.o \ 70 73 rebalance.o \ 71 74 rcu_pending.o \
+134 -56
fs/bcachefs/alloc_background.c
··· 232 232 int ret = 0; 233 233 234 234 bkey_fsck_err_on(bch2_alloc_unpack_v3(&u, k), 235 - c, alloc_v2_unpack_error, 235 + c, alloc_v3_unpack_error, 236 236 "unpack error"); 237 237 fsck_err: 238 238 return ret; ··· 777 777 s64 delta_sectors, 778 778 s64 delta_fragmented, unsigned flags) 779 779 { 780 - struct disk_accounting_pos acc = { 781 - .type = BCH_DISK_ACCOUNTING_dev_data_type, 782 - .dev_data_type.dev = ca->dev_idx, 783 - .dev_data_type.data_type = data_type, 784 - }; 785 780 s64 d[3] = { delta_buckets, delta_sectors, delta_fragmented }; 786 781 787 - return bch2_disk_accounting_mod(trans, &acc, d, 3, flags & BTREE_TRIGGER_gc); 782 + return bch2_disk_accounting_mod2(trans, flags & BTREE_TRIGGER_gc, 783 + d, dev_data_type, 784 + .dev = ca->dev_idx, 785 + .data_type = data_type); 788 786 } 789 787 790 788 int bch2_alloc_key_to_dev_counters(struct btree_trans *trans, struct bch_dev *ca, ··· 835 837 836 838 struct bch_dev *ca = bch2_dev_bucket_tryget(c, new.k->p); 837 839 if (!ca) 838 - return -EIO; 840 + return -BCH_ERR_trigger_alloc; 839 841 840 842 struct bch_alloc_v4 old_a_convert; 841 843 const struct bch_alloc_v4 *old_a = bch2_alloc_to_v4(old, &old_a_convert); ··· 869 871 if (data_type_is_empty(new_a->data_type) && 870 872 BCH_ALLOC_V4_NEED_INC_GEN(new_a) && 871 873 !bch2_bucket_is_open_safe(c, new.k->p.inode, new.k->p.offset)) { 874 + if (new_a->oldest_gen == new_a->gen && 875 + !bch2_bucket_sectors_total(*new_a)) 876 + new_a->oldest_gen++; 872 877 new_a->gen++; 873 878 SET_BCH_ALLOC_V4_NEED_INC_GEN(new_a, false); 874 879 alloc_data_type_set(new_a, new_a->data_type); ··· 890 889 !new_a->io_time[READ]) 891 890 new_a->io_time[READ] = bch2_current_io_time(c, READ); 892 891 893 - u64 old_lru = alloc_lru_idx_read(*old_a); 894 - u64 new_lru = alloc_lru_idx_read(*new_a); 895 - if (old_lru != new_lru) { 896 - ret = bch2_lru_change(trans, new.k->p.inode, 897 - bucket_to_u64(new.k->p), 898 - old_lru, new_lru); 899 - if (ret) 900 - goto err; 901 - } 892 + ret = bch2_lru_change(trans, new.k->p.inode, 893 + bucket_to_u64(new.k->p), 894 + alloc_lru_idx_read(*old_a), 895 + alloc_lru_idx_read(*new_a)); 896 + if (ret) 897 + goto err; 902 898 903 - old_lru = alloc_lru_idx_fragmentation(*old_a, ca); 904 - new_lru = alloc_lru_idx_fragmentation(*new_a, ca); 905 - if (old_lru != new_lru) { 906 - ret = bch2_lru_change(trans, 907 - BCH_LRU_FRAGMENTATION_START, 908 - bucket_to_u64(new.k->p), 909 - old_lru, new_lru); 910 - if (ret) 911 - goto err; 912 - } 899 + ret = bch2_lru_change(trans, 900 + BCH_LRU_BUCKET_FRAGMENTATION, 901 + bucket_to_u64(new.k->p), 902 + alloc_lru_idx_fragmentation(*old_a, ca), 903 + alloc_lru_idx_fragmentation(*new_a, ca)); 904 + if (ret) 905 + goto err; 913 906 914 907 if (old_a->gen != new_a->gen) { 915 908 ret = bch2_bucket_gen_update(trans, new.k->p, new_a->gen); ··· 1029 1034 invalid_bucket: 1030 1035 bch2_fs_inconsistent(c, "reference to invalid bucket\n %s", 1031 1036 (bch2_bkey_val_to_text(&buf, c, new.s_c), buf.buf)); 1032 - ret = -EIO; 1037 + ret = -BCH_ERR_trigger_alloc; 1033 1038 goto err; 1034 1039 } 1035 1040 ··· 1700 1705 1701 1706 u64 lru_idx = alloc_lru_idx_fragmentation(*a, ca); 1702 1707 if (lru_idx) { 1703 - ret = bch2_lru_check_set(trans, BCH_LRU_FRAGMENTATION_START, 1708 + ret = bch2_lru_check_set(trans, BCH_LRU_BUCKET_FRAGMENTATION, 1709 + bucket_to_u64(alloc_k.k->p), 1704 1710 lru_idx, alloc_k, last_flushed); 1705 1711 if (ret) 1706 1712 goto err; ··· 1731 1735 a = &a_mut->v; 1732 1736 } 1733 1737 1734 - ret = bch2_lru_check_set(trans, alloc_k.k->p.inode, a->io_time[READ], 1738 + ret = bch2_lru_check_set(trans, alloc_k.k->p.inode, 1739 + bucket_to_u64(alloc_k.k->p), 1740 + a->io_time[READ], 1735 1741 alloc_k, last_flushed); 1736 1742 if (ret) 1737 1743 goto err; ··· 1755 1757 for_each_btree_key_commit(trans, iter, BTREE_ID_alloc, 1756 1758 POS_MIN, BTREE_ITER_prefetch, k, 1757 1759 NULL, NULL, BCH_TRANS_COMMIT_no_enospc, 1758 - bch2_check_alloc_to_lru_ref(trans, &iter, &last_flushed))); 1760 + bch2_check_alloc_to_lru_ref(trans, &iter, &last_flushed))) ?: 1761 + bch2_check_stripe_to_lru_refs(c); 1759 1762 1760 1763 bch2_bkey_buf_exit(&last_flushed, c); 1761 1764 bch_err_fn(c, ret); ··· 1803 1804 u64 need_journal_commit; 1804 1805 u64 discarded; 1805 1806 }; 1807 + 1808 + /* 1809 + * This is needed because discard is both a filesystem option and a device 1810 + * option, and mount options are supposed to apply to that mount and not be 1811 + * persisted, i.e. if it's set as a mount option we can't propagate it to the 1812 + * device. 1813 + */ 1814 + static inline bool discard_opt_enabled(struct bch_fs *c, struct bch_dev *ca) 1815 + { 1816 + return test_bit(BCH_FS_discard_mount_opt_set, &c->flags) 1817 + ? c->opts.discard 1818 + : ca->mi.discard; 1819 + } 1806 1820 1807 1821 static int bch2_discard_one_bucket(struct btree_trans *trans, 1808 1822 struct bch_dev *ca, ··· 1880 1868 s->discarded++; 1881 1869 *discard_pos_done = iter.pos; 1882 1870 1883 - if (ca->mi.discard && !c->opts.nochanges) { 1871 + if (discard_opt_enabled(c, ca) && !c->opts.nochanges) { 1884 1872 /* 1885 1873 * This works without any other locks because this is the only 1886 1874 * thread that removes items from the need_discard tree ··· 1909 1897 if (ret) 1910 1898 goto out; 1911 1899 1912 - count_event(c, bucket_discard); 1900 + if (!fastpath) 1901 + count_event(c, bucket_discard); 1902 + else 1903 + count_event(c, bucket_discard_fast); 1913 1904 out: 1914 1905 fsck_err: 1915 1906 if (discard_locked) ··· 2070 2055 bch2_write_ref_put(c, BCH_WRITE_REF_discard_fast); 2071 2056 } 2072 2057 2058 + static int invalidate_one_bp(struct btree_trans *trans, 2059 + struct bch_dev *ca, 2060 + struct bkey_s_c_backpointer bp, 2061 + struct bkey_buf *last_flushed) 2062 + { 2063 + struct btree_iter extent_iter; 2064 + struct bkey_s_c extent_k = 2065 + bch2_backpointer_get_key(trans, bp, &extent_iter, 0, last_flushed); 2066 + int ret = bkey_err(extent_k); 2067 + if (ret) 2068 + return ret; 2069 + 2070 + struct bkey_i *n = 2071 + bch2_bkey_make_mut(trans, &extent_iter, &extent_k, 2072 + BTREE_UPDATE_internal_snapshot_node); 2073 + ret = PTR_ERR_OR_ZERO(n); 2074 + if (ret) 2075 + goto err; 2076 + 2077 + bch2_bkey_drop_device(bkey_i_to_s(n), ca->dev_idx); 2078 + err: 2079 + bch2_trans_iter_exit(trans, &extent_iter); 2080 + return ret; 2081 + } 2082 + 2083 + static int invalidate_one_bucket_by_bps(struct btree_trans *trans, 2084 + struct bch_dev *ca, 2085 + struct bpos bucket, 2086 + u8 gen, 2087 + struct bkey_buf *last_flushed) 2088 + { 2089 + struct bpos bp_start = bucket_pos_to_bp_start(ca, bucket); 2090 + struct bpos bp_end = bucket_pos_to_bp_end(ca, bucket); 2091 + 2092 + return for_each_btree_key_max_commit(trans, iter, BTREE_ID_backpointers, 2093 + bp_start, bp_end, 0, k, 2094 + NULL, NULL, 2095 + BCH_WATERMARK_btree| 2096 + BCH_TRANS_COMMIT_no_enospc, ({ 2097 + if (k.k->type != KEY_TYPE_backpointer) 2098 + continue; 2099 + 2100 + struct bkey_s_c_backpointer bp = bkey_s_c_to_backpointer(k); 2101 + 2102 + if (bp.v->bucket_gen != gen) 2103 + continue; 2104 + 2105 + /* filter out bps with gens that don't match */ 2106 + 2107 + invalidate_one_bp(trans, ca, bp, last_flushed); 2108 + })); 2109 + } 2110 + 2111 + noinline_for_stack 2073 2112 static int invalidate_one_bucket(struct btree_trans *trans, 2113 + struct bch_dev *ca, 2074 2114 struct btree_iter *lru_iter, 2075 2115 struct bkey_s_c lru_k, 2116 + struct bkey_buf *last_flushed, 2076 2117 s64 *nr_to_invalidate) 2077 2118 { 2078 2119 struct bch_fs *c = trans->c; 2079 - struct bkey_i_alloc_v4 *a = NULL; 2080 2120 struct printbuf buf = PRINTBUF; 2081 2121 struct bpos bucket = u64_to_bucket(lru_k.k->p.offset); 2082 - unsigned cached_sectors; 2122 + struct btree_iter alloc_iter = {}; 2083 2123 int ret = 0; 2084 2124 2085 2125 if (*nr_to_invalidate <= 0) ··· 2151 2081 if (bch2_bucket_is_open_safe(c, bucket.inode, bucket.offset)) 2152 2082 return 0; 2153 2083 2154 - a = bch2_trans_start_alloc_update(trans, bucket, BTREE_TRIGGER_bucket_invalidate); 2155 - ret = PTR_ERR_OR_ZERO(a); 2084 + struct bkey_s_c alloc_k = bch2_bkey_get_iter(trans, &alloc_iter, 2085 + BTREE_ID_alloc, bucket, 2086 + BTREE_ITER_cached); 2087 + ret = bkey_err(alloc_k); 2156 2088 if (ret) 2157 - goto out; 2089 + return ret; 2090 + 2091 + struct bch_alloc_v4 a_convert; 2092 + const struct bch_alloc_v4 *a = bch2_alloc_to_v4(alloc_k, &a_convert); 2158 2093 2159 2094 /* We expect harmless races here due to the btree write buffer: */ 2160 - if (lru_pos_time(lru_iter->pos) != alloc_lru_idx_read(a->v)) 2095 + if (lru_pos_time(lru_iter->pos) != alloc_lru_idx_read(*a)) 2161 2096 goto out; 2162 2097 2163 - BUG_ON(a->v.data_type != BCH_DATA_cached); 2164 - BUG_ON(a->v.dirty_sectors); 2098 + /* 2099 + * Impossible since alloc_lru_idx_read() only returns nonzero if the 2100 + * bucket is supposed to be on the cached bucket LRU (i.e. 2101 + * BCH_DATA_cached) 2102 + * 2103 + * bch2_lru_validate() also disallows lru keys with lru_pos_time() == 0 2104 + */ 2105 + BUG_ON(a->data_type != BCH_DATA_cached); 2106 + BUG_ON(a->dirty_sectors); 2165 2107 2166 - if (!a->v.cached_sectors) 2108 + if (!a->cached_sectors) 2167 2109 bch_err(c, "invalidating empty bucket, confused"); 2168 2110 2169 - cached_sectors = a->v.cached_sectors; 2111 + unsigned cached_sectors = a->cached_sectors; 2112 + u8 gen = a->gen; 2170 2113 2171 - SET_BCH_ALLOC_V4_NEED_INC_GEN(&a->v, false); 2172 - a->v.gen++; 2173 - a->v.data_type = 0; 2174 - a->v.dirty_sectors = 0; 2175 - a->v.stripe_sectors = 0; 2176 - a->v.cached_sectors = 0; 2177 - a->v.io_time[READ] = bch2_current_io_time(c, READ); 2178 - a->v.io_time[WRITE] = bch2_current_io_time(c, WRITE); 2179 - 2180 - ret = bch2_trans_commit(trans, NULL, NULL, 2181 - BCH_WATERMARK_btree| 2182 - BCH_TRANS_COMMIT_no_enospc); 2114 + ret = invalidate_one_bucket_by_bps(trans, ca, bucket, gen, last_flushed); 2183 2115 if (ret) 2184 2116 goto out; 2185 2117 ··· 2189 2117 --*nr_to_invalidate; 2190 2118 out: 2191 2119 fsck_err: 2120 + bch2_trans_iter_exit(trans, &alloc_iter); 2192 2121 printbuf_exit(&buf); 2193 2122 return ret; 2194 2123 } ··· 2216 2143 struct btree_trans *trans = bch2_trans_get(c); 2217 2144 int ret = 0; 2218 2145 2146 + struct bkey_buf last_flushed; 2147 + bch2_bkey_buf_init(&last_flushed); 2148 + bkey_init(&last_flushed.k->k); 2149 + 2219 2150 ret = bch2_btree_write_buffer_tryflush(trans); 2220 2151 if (ret) 2221 2152 goto err; ··· 2244 2167 if (!k.k) 2245 2168 break; 2246 2169 2247 - ret = invalidate_one_bucket(trans, &iter, k, &nr_to_invalidate); 2170 + ret = invalidate_one_bucket(trans, ca, &iter, k, &last_flushed, &nr_to_invalidate); 2248 2171 restart_err: 2249 2172 if (bch2_err_matches(ret, BCH_ERR_transaction_restart)) 2250 2173 continue; ··· 2257 2180 err: 2258 2181 bch2_trans_put(trans); 2259 2182 percpu_ref_put(&ca->io_ref); 2183 + bch2_bkey_buf_exit(&last_flushed, c); 2260 2184 bch2_write_ref_put(c, BCH_WRITE_REF_invalidate); 2261 2185 } 2262 2186
+1 -1
fs/bcachefs/alloc_background.h
··· 131 131 if (a.stripe) 132 132 return data_type == BCH_DATA_parity ? data_type : BCH_DATA_stripe; 133 133 if (bch2_bucket_sectors_dirty(a)) 134 - return data_type; 134 + return bucket_data_type(data_type); 135 135 if (a.cached_sectors) 136 136 return BCH_DATA_cached; 137 137 if (BCH_ALLOC_V4_NEED_DISCARD(&a))
+7 -24
fs/bcachefs/alloc_foreground.c
··· 127 127 128 128 void bch2_open_bucket_write_error(struct bch_fs *c, 129 129 struct open_buckets *obs, 130 - unsigned dev) 130 + unsigned dev, int err) 131 131 { 132 132 struct open_bucket *ob; 133 133 unsigned i; 134 134 135 135 open_bucket_for_each(c, obs, ob, i) 136 136 if (ob->dev == dev && ob->ec) 137 - bch2_ec_bucket_cancel(c, ob); 137 + bch2_ec_bucket_cancel(c, ob, err); 138 138 } 139 139 140 140 static struct open_bucket *bch2_open_bucket_alloc(struct bch_fs *c) ··· 177 177 178 178 closure_wake_up(&c->open_buckets_wait); 179 179 closure_wake_up(&c->freelist_wait); 180 - } 181 - 182 - static inline unsigned open_buckets_reserved(enum bch_watermark watermark) 183 - { 184 - switch (watermark) { 185 - case BCH_WATERMARK_interior_updates: 186 - return 0; 187 - case BCH_WATERMARK_reclaim: 188 - return OPEN_BUCKETS_COUNT / 6; 189 - case BCH_WATERMARK_btree: 190 - case BCH_WATERMARK_btree_copygc: 191 - return OPEN_BUCKETS_COUNT / 4; 192 - case BCH_WATERMARK_copygc: 193 - return OPEN_BUCKETS_COUNT / 3; 194 - default: 195 - return OPEN_BUCKETS_COUNT / 2; 196 - } 197 180 } 198 181 199 182 static inline bool may_alloc_bucket(struct bch_fs *c, ··· 222 239 223 240 spin_lock(&c->freelist_lock); 224 241 225 - if (unlikely(c->open_buckets_nr_free <= open_buckets_reserved(watermark))) { 242 + if (unlikely(c->open_buckets_nr_free <= bch2_open_buckets_reserved(watermark))) { 226 243 if (cl) 227 244 closure_wait(&c->open_buckets_wait, cl); 228 245 ··· 631 648 struct bch_dev_usage *usage) 632 649 { 633 650 u64 *v = stripe->next_alloc + ca->dev_idx; 634 - u64 free_space = dev_buckets_available(ca, BCH_WATERMARK_normal); 651 + u64 free_space = __dev_buckets_available(ca, *usage, BCH_WATERMARK_normal); 635 652 u64 free_space_inv = free_space 636 653 ? div64_u64(1ULL << 48, free_space) 637 654 : 1ULL << 48; ··· 711 728 712 729 struct bch_dev_usage usage; 713 730 struct open_bucket *ob = bch2_bucket_alloc_trans(trans, ca, watermark, data_type, 714 - cl, flags & BCH_WRITE_ALLOC_NOWAIT, &usage); 731 + cl, flags & BCH_WRITE_alloc_nowait, &usage); 715 732 if (!IS_ERR(ob)) 716 733 bch2_dev_stripe_increment_inlined(ca, stripe, &usage); 717 734 bch2_dev_put(ca); ··· 1319 1336 if (wp->data_type != BCH_DATA_user) 1320 1337 have_cache = true; 1321 1338 1322 - if (target && !(flags & BCH_WRITE_ONLY_SPECIFIED_DEVS)) { 1339 + if (target && !(flags & BCH_WRITE_only_specified_devs)) { 1323 1340 ret = open_bucket_add_buckets(trans, &ptrs, wp, devs_have, 1324 1341 target, erasure_code, 1325 1342 nr_replicas, &nr_effective, ··· 1409 1426 if (cl && bch2_err_matches(ret, BCH_ERR_open_buckets_empty)) 1410 1427 ret = -BCH_ERR_bucket_alloc_blocked; 1411 1428 1412 - if (cl && !(flags & BCH_WRITE_ALLOC_NOWAIT) && 1429 + if (cl && !(flags & BCH_WRITE_alloc_nowait) && 1413 1430 bch2_err_matches(ret, BCH_ERR_freelist_empty)) 1414 1431 ret = -BCH_ERR_bucket_alloc_blocked; 1415 1432
+18 -1
fs/bcachefs/alloc_foreground.h
··· 33 33 return bch2_dev_have_ref(c, ob->dev); 34 34 } 35 35 36 + static inline unsigned bch2_open_buckets_reserved(enum bch_watermark watermark) 37 + { 38 + switch (watermark) { 39 + case BCH_WATERMARK_interior_updates: 40 + return 0; 41 + case BCH_WATERMARK_reclaim: 42 + return OPEN_BUCKETS_COUNT / 6; 43 + case BCH_WATERMARK_btree: 44 + case BCH_WATERMARK_btree_copygc: 45 + return OPEN_BUCKETS_COUNT / 4; 46 + case BCH_WATERMARK_copygc: 47 + return OPEN_BUCKETS_COUNT / 3; 48 + default: 49 + return OPEN_BUCKETS_COUNT / 2; 50 + } 51 + } 52 + 36 53 struct open_bucket *bch2_bucket_alloc(struct bch_fs *, struct bch_dev *, 37 54 enum bch_watermark, enum bch_data_type, 38 55 struct closure *); ··· 82 65 } 83 66 84 67 void bch2_open_bucket_write_error(struct bch_fs *, 85 - struct open_buckets *, unsigned); 68 + struct open_buckets *, unsigned, int); 86 69 87 70 void __bch2_open_bucket_put(struct bch_fs *, struct open_bucket *); 88 71
+2
fs/bcachefs/alloc_types.h
··· 90 90 x(stopped) \ 91 91 x(waiting_io) \ 92 92 x(waiting_work) \ 93 + x(runnable) \ 93 94 x(running) 94 95 95 96 enum write_point_state { ··· 126 125 enum write_point_state state; 127 126 u64 last_state_change; 128 127 u64 time[WRITE_POINT_STATE_NR]; 128 + u64 last_runtime; 129 129 } __aligned(SMP_CACHE_BYTES); 130 130 }; 131 131
+58 -95
fs/bcachefs/backpointers.c
··· 11 11 #include "checksum.h" 12 12 #include "disk_accounting.h" 13 13 #include "error.h" 14 + #include "progress.h" 14 15 15 16 #include <linux/mm.h> 16 17 ··· 50 49 } 51 50 52 51 bch2_btree_id_level_to_text(out, bp.v->btree_id, bp.v->level); 52 + prt_str(out, " data_type="); 53 + bch2_prt_data_type(out, bp.v->data_type); 53 54 prt_printf(out, " suboffset=%u len=%u gen=%u pos=", 54 55 (u32) bp.k->p.offset & ~(~0U << MAX_EXTENT_COMPRESS_RATIO_SHIFT), 55 56 bp.v->bucket_len, ··· 247 244 if (unlikely(bp.v->btree_id >= btree_id_nr_alive(c))) 248 245 return bkey_s_c_null; 249 246 250 - if (likely(!bp.v->level)) { 251 - bch2_trans_node_iter_init(trans, iter, 252 - bp.v->btree_id, 253 - bp.v->pos, 254 - 0, 0, 255 - iter_flags); 256 - struct bkey_s_c k = bch2_btree_iter_peek_slot(iter); 257 - if (bkey_err(k)) { 258 - bch2_trans_iter_exit(trans, iter); 259 - return k; 260 - } 261 - 262 - if (k.k && 263 - extent_matches_bp(c, bp.v->btree_id, bp.v->level, k, bp)) 264 - return k; 265 - 247 + bch2_trans_node_iter_init(trans, iter, 248 + bp.v->btree_id, 249 + bp.v->pos, 250 + 0, 251 + bp.v->level, 252 + iter_flags); 253 + struct bkey_s_c k = bch2_btree_iter_peek_slot(iter); 254 + if (bkey_err(k)) { 266 255 bch2_trans_iter_exit(trans, iter); 256 + return k; 257 + } 258 + 259 + if (k.k && 260 + extent_matches_bp(c, bp.v->btree_id, bp.v->level, k, bp)) 261 + return k; 262 + 263 + bch2_trans_iter_exit(trans, iter); 264 + 265 + if (!bp.v->level) { 267 266 int ret = backpointer_target_not_found(trans, bp, k, last_flushed); 268 267 return ret ? bkey_s_c_err(ret) : bkey_s_c_null; 269 268 } else { 270 269 struct btree *b = bch2_backpointer_get_node(trans, bp, iter, last_flushed); 270 + if (b == ERR_PTR(-BCH_ERR_backpointer_to_overwritten_btree_node)) 271 + return bkey_s_c_null; 271 272 if (IS_ERR_OR_NULL(b)) 272 273 return ((struct bkey_s_c) { .k = ERR_CAST(b) }); 273 274 ··· 521 514 if (!other_extent.k) 522 515 goto missing; 523 516 517 + rcu_read_lock(); 518 + struct bch_dev *ca = bch2_dev_rcu_noerror(c, bp->k.p.inode); 519 + if (ca) { 520 + struct bkey_ptrs_c other_extent_ptrs = bch2_bkey_ptrs_c(other_extent); 521 + bkey_for_each_ptr(other_extent_ptrs, ptr) 522 + if (ptr->dev == bp->k.p.inode && 523 + dev_ptr_stale_rcu(ca, ptr)) { 524 + ret = drop_dev_and_update(trans, other_bp.v->btree_id, 525 + other_extent, bp->k.p.inode); 526 + if (ret) 527 + goto err; 528 + goto out; 529 + } 530 + } 531 + rcu_read_unlock(); 532 + 524 533 if (bch2_extents_match(orig_k, other_extent)) { 525 534 printbuf_reset(&buf); 526 535 prt_printf(&buf, "duplicate versions of same extent, deleting smaller\n "); ··· 613 590 struct extent_ptr_decoded p; 614 591 615 592 bkey_for_each_ptr_decode(k.k, ptrs, p, entry) { 616 - if (p.ptr.cached) 617 - continue; 618 - 619 593 if (p.ptr.dev == BCH_SB_MEMBER_INVALID) 620 594 continue; 621 595 ··· 620 600 struct bch_dev *ca = bch2_dev_rcu_noerror(c, p.ptr.dev); 621 601 bool check = ca && test_bit(PTR_BUCKET_NR(ca, &p.ptr), ca->bucket_backpointer_mismatches); 622 602 bool empty = ca && test_bit(PTR_BUCKET_NR(ca, &p.ptr), ca->bucket_backpointer_empty); 603 + 604 + bool stale = p.ptr.cached && (!ca || dev_ptr_stale_rcu(ca, &p.ptr)); 623 605 rcu_read_unlock(); 624 606 625 - if (check || empty) { 607 + if ((check || empty) && !stale) { 626 608 struct bkey_i_backpointer bp; 627 609 bch2_extent_ptr_to_bp(c, btree, level, k, p, entry, &bp); 628 610 ··· 737 715 return ret; 738 716 } 739 717 740 - struct progress_indicator_state { 741 - unsigned long next_print; 742 - u64 nodes_seen; 743 - u64 nodes_total; 744 - struct btree *last_node; 745 - }; 746 - 747 - static inline void progress_init(struct progress_indicator_state *s, 748 - struct bch_fs *c, 749 - u64 btree_id_mask) 750 - { 751 - memset(s, 0, sizeof(*s)); 752 - 753 - s->next_print = jiffies + HZ * 10; 754 - 755 - for (unsigned i = 0; i < BTREE_ID_NR; i++) { 756 - if (!(btree_id_mask & BIT_ULL(i))) 757 - continue; 758 - 759 - struct disk_accounting_pos acc = { 760 - .type = BCH_DISK_ACCOUNTING_btree, 761 - .btree.id = i, 762 - }; 763 - 764 - u64 v; 765 - bch2_accounting_mem_read(c, disk_accounting_pos_to_bpos(&acc), &v, 1); 766 - s->nodes_total += div64_ul(v, btree_sectors(c)); 767 - } 768 - } 769 - 770 - static inline bool progress_update_p(struct progress_indicator_state *s) 771 - { 772 - bool ret = time_after_eq(jiffies, s->next_print); 773 - 774 - if (ret) 775 - s->next_print = jiffies + HZ * 10; 776 - return ret; 777 - } 778 - 779 - static void progress_update_iter(struct btree_trans *trans, 780 - struct progress_indicator_state *s, 781 - struct btree_iter *iter, 782 - const char *msg) 783 - { 784 - struct bch_fs *c = trans->c; 785 - struct btree *b = path_l(btree_iter_path(trans, iter))->b; 786 - 787 - s->nodes_seen += b != s->last_node; 788 - s->last_node = b; 789 - 790 - if (progress_update_p(s)) { 791 - struct printbuf buf = PRINTBUF; 792 - unsigned percent = s->nodes_total 793 - ? div64_u64(s->nodes_seen * 100, s->nodes_total) 794 - : 0; 795 - 796 - prt_printf(&buf, "%s: %d%%, done %llu/%llu nodes, at ", 797 - msg, percent, s->nodes_seen, s->nodes_total); 798 - bch2_bbpos_to_text(&buf, BBPOS(iter->btree_id, iter->pos)); 799 - 800 - bch_info(c, "%s", buf.buf); 801 - printbuf_exit(&buf); 802 - } 803 - } 804 - 805 718 static int bch2_check_extents_to_backpointers_pass(struct btree_trans *trans, 806 719 struct extents_to_bp_state *s) 807 720 { ··· 744 787 struct progress_indicator_state progress; 745 788 int ret = 0; 746 789 747 - progress_init(&progress, trans->c, BIT_ULL(BTREE_ID_extents)|BIT_ULL(BTREE_ID_reflink)); 790 + bch2_progress_init(&progress, trans->c, BIT_ULL(BTREE_ID_extents)|BIT_ULL(BTREE_ID_reflink)); 748 791 749 792 for (enum btree_id btree_id = 0; 750 793 btree_id < btree_id_nr_alive(c); ··· 763 806 BTREE_ITER_prefetch); 764 807 765 808 ret = for_each_btree_key_continue(trans, iter, 0, k, ({ 766 - progress_update_iter(trans, &progress, &iter, "extents_to_backpointers"); 809 + bch2_progress_update_iter(trans, &progress, &iter, "extents_to_backpointers"); 767 810 check_extent_to_backpointers(trans, s, btree_id, level, k) ?: 768 811 bch2_trans_commit(trans, NULL, NULL, BCH_TRANS_COMMIT_no_enospc); 769 812 })); ··· 784 827 ALLOC_SECTORS_NR 785 828 }; 786 829 787 - static enum alloc_sector_counter data_type_to_alloc_counter(enum bch_data_type t) 830 + static int data_type_to_alloc_counter(enum bch_data_type t) 788 831 { 789 832 switch (t) { 790 833 case BCH_DATA_btree: ··· 793 836 case BCH_DATA_cached: 794 837 return ALLOC_cached; 795 838 case BCH_DATA_stripe: 839 + case BCH_DATA_parity: 796 840 return ALLOC_stripe; 797 841 default: 798 - BUG(); 842 + return -1; 799 843 } 800 844 } 801 845 ··· 847 889 if (bp.v->bucket_gen != a->gen) 848 890 continue; 849 891 850 - sectors[data_type_to_alloc_counter(bp.v->data_type)] += bp.v->bucket_len; 892 + int alloc_counter = data_type_to_alloc_counter(bp.v->data_type); 893 + if (alloc_counter < 0) 894 + continue; 895 + 896 + sectors[alloc_counter] += bp.v->bucket_len; 851 897 }; 852 898 bch2_trans_iter_exit(trans, &iter); 853 899 if (ret) ··· 863 901 goto err; 864 902 } 865 903 866 - /* Cached pointers don't have backpointers: */ 867 - 868 904 if (sectors[ALLOC_dirty] != a->dirty_sectors || 905 + sectors[ALLOC_cached] != a->cached_sectors || 869 906 sectors[ALLOC_stripe] != a->stripe_sectors) { 870 907 if (c->sb.version_upgrade_complete >= bcachefs_metadata_version_backpointer_bucket_gen) { 871 908 ret = bch2_backpointers_maybe_flush(trans, alloc_k, last_flushed); ··· 873 912 } 874 913 875 914 if (sectors[ALLOC_dirty] > a->dirty_sectors || 915 + sectors[ALLOC_cached] > a->cached_sectors || 876 916 sectors[ALLOC_stripe] > a->stripe_sectors) { 877 917 ret = check_bucket_backpointers_to_extents(trans, ca, alloc_k.k->p) ?: 878 918 -BCH_ERR_transaction_restart_nested; ··· 881 919 } 882 920 883 921 if (!sectors[ALLOC_dirty] && 884 - !sectors[ALLOC_stripe]) 922 + !sectors[ALLOC_stripe] && 923 + !sectors[ALLOC_cached]) 885 924 __set_bit(alloc_k.k->p.offset, ca->bucket_backpointer_empty); 886 925 else 887 926 __set_bit(alloc_k.k->p.offset, ca->bucket_backpointer_mismatches); ··· 1169 1206 1170 1207 bch2_bkey_buf_init(&last_flushed); 1171 1208 bkey_init(&last_flushed.k->k); 1172 - progress_init(&progress, trans->c, BIT_ULL(BTREE_ID_backpointers)); 1209 + bch2_progress_init(&progress, trans->c, BIT_ULL(BTREE_ID_backpointers)); 1173 1210 1174 1211 int ret = for_each_btree_key(trans, iter, BTREE_ID_backpointers, 1175 1212 POS_MIN, BTREE_ITER_prefetch, k, ({ 1176 - progress_update_iter(trans, &progress, &iter, "backpointers_to_extents"); 1213 + bch2_progress_update_iter(trans, &progress, &iter, "backpointers_to_extents"); 1177 1214 check_one_backpointer(trans, start, end, k, &last_flushed); 1178 1215 })); 1179 1216
+22 -4
fs/bcachefs/backpointers.h
··· 1 1 /* SPDX-License-Identifier: GPL-2.0 */ 2 - #ifndef _BCACHEFS_BACKPOINTERS_BACKGROUND_H 3 - #define _BCACHEFS_BACKPOINTERS_BACKGROUND_H 2 + #ifndef _BCACHEFS_BACKPOINTERS_H 3 + #define _BCACHEFS_BACKPOINTERS_H 4 4 5 5 #include "btree_cache.h" 6 6 #include "btree_iter.h" ··· 123 123 return BCH_DATA_btree; 124 124 case KEY_TYPE_extent: 125 125 case KEY_TYPE_reflink_v: 126 - return p.has_ec ? BCH_DATA_stripe : BCH_DATA_user; 126 + if (p.has_ec) 127 + return BCH_DATA_stripe; 128 + if (p.ptr.cached) 129 + return BCH_DATA_cached; 130 + else 131 + return BCH_DATA_user; 127 132 case KEY_TYPE_stripe: { 128 133 const struct bch_extent_ptr *ptr = &entry->ptr; 129 134 struct bkey_s_c_stripe s = bkey_s_c_to_stripe(k); ··· 152 147 struct bkey_i_backpointer *bp) 153 148 { 154 149 bkey_backpointer_init(&bp->k_i); 155 - bp->k.p = POS(p.ptr.dev, ((u64) p.ptr.offset << MAX_EXTENT_COMPRESS_RATIO_SHIFT) + p.crc.offset); 150 + bp->k.p.inode = p.ptr.dev; 151 + 152 + if (k.k->type != KEY_TYPE_stripe) 153 + bp->k.p.offset = ((u64) p.ptr.offset << MAX_EXTENT_COMPRESS_RATIO_SHIFT) + p.crc.offset; 154 + else { 155 + /* 156 + * Put stripe backpointers where they won't collide with the 157 + * extent backpointers within the stripe: 158 + */ 159 + struct bkey_s_c_stripe s = bkey_s_c_to_stripe(k); 160 + bp->k.p.offset = ((u64) (p.ptr.offset + le16_to_cpu(s.v->sectors)) << 161 + MAX_EXTENT_COMPRESS_RATIO_SHIFT) - 1; 162 + } 163 + 156 164 bp->v = (struct bch_backpointer) { 157 165 .btree_id = btree_id, 158 166 .level = level,
+13 -7
fs/bcachefs/bcachefs.h
··· 203 203 #include <linux/types.h> 204 204 #include <linux/workqueue.h> 205 205 #include <linux/zstd.h> 206 + #include <linux/unicode.h> 206 207 207 208 #include "bcachefs_format.h" 208 209 #include "btree_journal_iter_types.h" ··· 445 444 x(btree_node_sort) \ 446 445 x(btree_node_read) \ 447 446 x(btree_node_read_done) \ 447 + x(btree_node_write) \ 448 448 x(btree_interior_update_foreground) \ 449 449 x(btree_interior_update_total) \ 450 450 x(btree_gc) \ ··· 458 456 x(blocked_journal_low_on_space) \ 459 457 x(blocked_journal_low_on_pin) \ 460 458 x(blocked_journal_max_in_flight) \ 459 + x(blocked_journal_max_open) \ 461 460 x(blocked_key_cache_flush) \ 462 461 x(blocked_allocate) \ 463 462 x(blocked_allocate_open_bucket) \ ··· 536 533 */ 537 534 struct bch_member_cpu mi; 538 535 atomic64_t errors[BCH_MEMBER_ERROR_NR]; 536 + unsigned long write_errors_start; 539 537 540 538 __uuid_t uuid; 541 539 char name[BDEVNAME_SIZE]; ··· 627 623 x(topology_error) \ 628 624 x(errors_fixed) \ 629 625 x(errors_not_fixed) \ 630 - x(no_invalid_checks) 626 + x(no_invalid_checks) \ 627 + x(discard_mount_opt_set) \ 631 628 632 629 enum bch_fs_flags { 633 630 #define x(n) BCH_FS_##n, ··· 692 687 x(gc_gens) \ 693 688 x(snapshot_delete_pagecache) \ 694 689 x(sysfs) \ 695 - x(btree_write_buffer) 690 + x(btree_write_buffer) \ 691 + x(btree_node_scrub) 696 692 697 693 enum bch_write_ref { 698 694 #define x(n) BCH_WRITE_REF_##n, ··· 701 695 #undef x 702 696 BCH_WRITE_REF_NR, 703 697 }; 698 + 699 + #define BCH_FS_DEFAULT_UTF8_ENCODING UNICODE_AGE(12, 1, 0) 704 700 705 701 struct bch_fs { 706 702 struct closure cl; ··· 788 780 u64 btrees_lost_data; 789 781 } sb; 790 782 783 + #ifdef CONFIG_UNICODE 784 + struct unicode_map *cf_encoding; 785 + #endif 791 786 792 787 struct bch_sb_handle disk_sb; 793 788 ··· 980 969 mempool_t compress_workspace[BCH_COMPRESSION_OPT_NR]; 981 970 size_t zstd_workspace_size; 982 971 983 - struct crypto_shash *sha256; 984 972 struct crypto_sync_skcipher *chacha20; 985 973 struct crypto_shash *poly1305; 986 974 ··· 1003 993 wait_queue_head_t copygc_running_wq; 1004 994 1005 995 /* STRIPES: */ 1006 - GENRADIX(struct stripe) stripes; 1007 996 GENRADIX(struct gc_stripe) gc_stripes; 1008 997 1009 998 struct hlist_head ec_stripes_new[32]; 1010 999 spinlock_t ec_stripes_new_lock; 1011 - 1012 - ec_stripes_heap ec_stripes_heap; 1013 - struct mutex ec_stripes_heap_lock; 1014 1000 1015 1001 /* ERASURE CODING */ 1016 1002 struct list_head ec_stripe_head_list;
+13 -3
fs/bcachefs/bcachefs_format.h
··· 686 686 x(inode_depth, BCH_VERSION(1, 17)) \ 687 687 x(persistent_inode_cursors, BCH_VERSION(1, 18)) \ 688 688 x(autofix_errors, BCH_VERSION(1, 19)) \ 689 - x(directory_size, BCH_VERSION(1, 20)) 689 + x(directory_size, BCH_VERSION(1, 20)) \ 690 + x(cached_backpointers, BCH_VERSION(1, 21)) \ 691 + x(stripe_backpointers, BCH_VERSION(1, 22)) \ 692 + x(stripe_lru, BCH_VERSION(1, 23)) \ 693 + x(casefolding, BCH_VERSION(1, 24)) \ 694 + x(extent_flags, BCH_VERSION(1, 25)) 690 695 691 696 enum bcachefs_metadata_version { 692 697 bcachefs_metadata_version_min = 9, ··· 842 837 LE64_BITMASK(BCH_SB_INODES_USE_KEY_CACHE,struct bch_sb, flags[3], 29, 30); 843 838 LE64_BITMASK(BCH_SB_JOURNAL_FLUSH_DELAY,struct bch_sb, flags[3], 30, 62); 844 839 LE64_BITMASK(BCH_SB_JOURNAL_FLUSH_DISABLED,struct bch_sb, flags[3], 62, 63); 840 + /* one free bit */ 845 841 LE64_BITMASK(BCH_SB_JOURNAL_RECLAIM_DELAY,struct bch_sb, flags[4], 0, 32); 846 842 LE64_BITMASK(BCH_SB_JOURNAL_TRANSACTION_NAMES,struct bch_sb, flags[4], 32, 33); 847 843 LE64_BITMASK(BCH_SB_NOCOW, struct bch_sb, flags[4], 33, 34); ··· 861 855 LE64_BITMASK(BCH_SB_VERSION_INCOMPAT_ALLOWED, 862 856 struct bch_sb, flags[5], 48, 64); 863 857 LE64_BITMASK(BCH_SB_SHARD_INUMS_NBITS, struct bch_sb, flags[6], 0, 4); 858 + LE64_BITMASK(BCH_SB_WRITE_ERROR_TIMEOUT,struct bch_sb, flags[6], 4, 14); 859 + LE64_BITMASK(BCH_SB_CSUM_ERR_RETRY_NR, struct bch_sb, flags[6], 14, 20); 864 860 865 861 static inline __u64 BCH_SB_COMPRESSION_TYPE(const struct bch_sb *sb) 866 862 { ··· 916 908 x(journal_no_flush, 16) \ 917 909 x(alloc_v2, 17) \ 918 910 x(extents_across_btree_nodes, 18) \ 919 - x(incompat_version_field, 19) 911 + x(incompat_version_field, 19) \ 912 + x(casefolding, 20) 920 913 921 914 #define BCH_SB_FEATURES_ALWAYS \ 922 915 (BIT_ULL(BCH_FEATURE_new_extent_overwrite)| \ ··· 931 922 BIT_ULL(BCH_FEATURE_new_siphash)| \ 932 923 BIT_ULL(BCH_FEATURE_btree_ptr_v2)| \ 933 924 BIT_ULL(BCH_FEATURE_new_varint)| \ 934 - BIT_ULL(BCH_FEATURE_journal_no_flush)) 925 + BIT_ULL(BCH_FEATURE_journal_no_flush)| \ 926 + BIT_ULL(BCH_FEATURE_incompat_version_field)) 935 927 936 928 enum bch_sb_feature { 937 929 #define x(f, n) BCH_FEATURE_##f,
+28 -1
fs/bcachefs/bcachefs_ioctl.h
··· 87 87 #define BCH_IOCTL_FSCK_OFFLINE _IOW(0xbc, 19, struct bch_ioctl_fsck_offline) 88 88 #define BCH_IOCTL_FSCK_ONLINE _IOW(0xbc, 20, struct bch_ioctl_fsck_online) 89 89 #define BCH_IOCTL_QUERY_ACCOUNTING _IOW(0xbc, 21, struct bch_ioctl_query_accounting) 90 + #define BCH_IOCTL_QUERY_COUNTERS _IOW(0xbc, 21, struct bch_ioctl_query_counters) 90 91 91 92 /* ioctl below act on a particular file, not the filesystem as a whole: */ 92 93 ··· 216 215 union { 217 216 struct { 218 217 __u32 dev; 218 + __u32 data_types; 219 + } scrub; 220 + struct { 221 + __u32 dev; 219 222 __u32 pad; 220 223 } migrate; 221 224 struct { ··· 234 229 BCH_DATA_EVENT_NR = 1, 235 230 }; 236 231 232 + enum data_progress_data_type_special { 233 + DATA_PROGRESS_DATA_TYPE_phys = 254, 234 + DATA_PROGRESS_DATA_TYPE_done = 255, 235 + }; 236 + 237 237 struct bch_ioctl_data_progress { 238 238 __u8 data_type; 239 239 __u8 btree_id; ··· 247 237 248 238 __u64 sectors_done; 249 239 __u64 sectors_total; 240 + __u64 sectors_error_corrected; 241 + __u64 sectors_error_uncorrected; 250 242 } __packed __aligned(8); 243 + 244 + enum bch_ioctl_data_event_ret { 245 + BCH_IOCTL_DATA_EVENT_RET_done = 1, 246 + BCH_IOCTL_DATA_EVENT_RET_device_offline = 2, 247 + }; 251 248 252 249 struct bch_ioctl_data_event { 253 250 __u8 type; 254 - __u8 pad[7]; 251 + __u8 ret; 252 + __u8 pad[6]; 255 253 union { 256 254 struct bch_ioctl_data_progress p; 257 255 __u64 pad2[15]; ··· 459 441 __u32 accounting_types_mask; /* input parameter */ 460 442 461 443 struct bkey_i_accounting accounting[]; 444 + }; 445 + 446 + #define BCH_IOCTL_QUERY_COUNTERS_MOUNT (1 << 0) 447 + 448 + struct bch_ioctl_query_counters { 449 + __u16 nr; 450 + __u16 flags; 451 + __u32 pad; 452 + __u64 d[]; 462 453 }; 463 454 464 455 #endif /* _BCACHEFS_IOCTL_H */
+1
fs/bcachefs/btree_cache.c
··· 610 610 btree_node_write_in_flight(b)); 611 611 612 612 btree_node_data_free(bc, b); 613 + cond_resched(); 613 614 } 614 615 615 616 BUG_ON(!bch2_journal_error(&c->journal) &&
+12 -6
fs/bcachefs/btree_gc.c
··· 27 27 #include "journal.h" 28 28 #include "keylist.h" 29 29 #include "move.h" 30 + #include "progress.h" 30 31 #include "recovery_passes.h" 31 32 #include "reflink.h" 32 33 #include "recovery.h" ··· 657 656 return ret; 658 657 } 659 658 660 - static int bch2_gc_btree(struct btree_trans *trans, enum btree_id btree, bool initial) 659 + static int bch2_gc_btree(struct btree_trans *trans, 660 + struct progress_indicator_state *progress, 661 + enum btree_id btree, bool initial) 661 662 { 662 663 struct bch_fs *c = trans->c; 663 664 unsigned target_depth = btree_node_type_has_triggers(__btree_node_type(0, btree)) ? 0 : 1; ··· 676 673 BTREE_ITER_prefetch); 677 674 678 675 ret = for_each_btree_key_continue(trans, iter, 0, k, ({ 676 + bch2_progress_update_iter(trans, progress, &iter, "check_allocations"); 679 677 gc_pos_set(c, gc_pos_btree(btree, level, k.k->p)); 680 678 bch2_gc_mark_key(trans, btree, level, &prev, &iter, k, initial); 681 679 })); ··· 721 717 static int bch2_gc_btrees(struct bch_fs *c) 722 718 { 723 719 struct btree_trans *trans = bch2_trans_get(c); 724 - enum btree_id ids[BTREE_ID_NR]; 725 720 struct printbuf buf = PRINTBUF; 726 - unsigned i; 727 721 int ret = 0; 728 722 729 - for (i = 0; i < BTREE_ID_NR; i++) 723 + struct progress_indicator_state progress; 724 + bch2_progress_init(&progress, c, ~0ULL); 725 + 726 + enum btree_id ids[BTREE_ID_NR]; 727 + for (unsigned i = 0; i < BTREE_ID_NR; i++) 730 728 ids[i] = i; 731 729 bubble_sort(ids, BTREE_ID_NR, btree_id_gc_phase_cmp); 732 730 733 - for (i = 0; i < btree_id_nr_alive(c) && !ret; i++) { 731 + for (unsigned i = 0; i < btree_id_nr_alive(c) && !ret; i++) { 734 732 unsigned btree = i < BTREE_ID_NR ? ids[i] : i; 735 733 736 734 if (IS_ERR_OR_NULL(bch2_btree_id_root(c, btree)->b)) 737 735 continue; 738 736 739 - ret = bch2_gc_btree(trans, btree, true); 737 + ret = bch2_gc_btree(trans, &progress, btree, true); 740 738 } 741 739 742 740 printbuf_exit(&buf);
+233 -26
fs/bcachefs/btree_io.c
··· 1 1 // SPDX-License-Identifier: GPL-2.0 2 2 3 3 #include "bcachefs.h" 4 + #include "bkey_buf.h" 4 5 #include "bkey_methods.h" 5 6 #include "bkey_sort.h" 6 7 #include "btree_cache.h" ··· 1329 1328 bch_info(c, "retrying read"); 1330 1329 ca = bch2_dev_get_ioref(c, rb->pick.ptr.dev, READ); 1331 1330 rb->have_ioref = ca != NULL; 1331 + rb->start_time = local_clock(); 1332 1332 bio_reset(bio, NULL, REQ_OP_READ|REQ_SYNC|REQ_META); 1333 1333 bio->bi_iter.bi_sector = rb->pick.ptr.offset; 1334 1334 bio->bi_iter.bi_size = btree_buf_bytes(b); ··· 1340 1338 } else { 1341 1339 bio->bi_status = BLK_STS_REMOVED; 1342 1340 } 1341 + 1342 + bch2_account_io_completion(ca, BCH_MEMBER_ERROR_read, 1343 + rb->start_time, !bio->bi_status); 1343 1344 start: 1344 1345 printbuf_reset(&buf); 1345 1346 bch2_btree_pos_to_text(&buf, c, b); 1346 - bch2_dev_io_err_on(ca && bio->bi_status, ca, BCH_MEMBER_ERROR_read, 1347 - "btree read error %s for %s", 1348 - bch2_blk_status_to_str(bio->bi_status), buf.buf); 1347 + 1348 + if (ca && bio->bi_status) 1349 + bch_err_dev_ratelimited(ca, 1350 + "btree read error %s for %s", 1351 + bch2_blk_status_to_str(bio->bi_status), buf.buf); 1349 1352 if (rb->have_ioref) 1350 1353 percpu_ref_put(&ca->io_ref); 1351 1354 rb->have_ioref = false; 1352 1355 1353 - bch2_mark_io_failure(&failed, &rb->pick); 1356 + bch2_mark_io_failure(&failed, &rb->pick, false); 1354 1357 1355 1358 can_retry = bch2_bkey_pick_read_device(c, 1356 1359 bkey_i_to_s_c(&b->key), 1357 - &failed, &rb->pick) > 0; 1360 + &failed, &rb->pick, -1) > 0; 1358 1361 1359 1362 if (!bio->bi_status && 1360 1363 !bch2_btree_node_read_done(c, ca, b, can_retry, &saw_error)) { ··· 1407 1400 struct btree_read_bio *rb = 1408 1401 container_of(bio, struct btree_read_bio, bio); 1409 1402 struct bch_fs *c = rb->c; 1403 + struct bch_dev *ca = rb->have_ioref 1404 + ? bch2_dev_have_ref(c, rb->pick.ptr.dev) : NULL; 1410 1405 1411 - if (rb->have_ioref) { 1412 - struct bch_dev *ca = bch2_dev_have_ref(c, rb->pick.ptr.dev); 1413 - 1414 - bch2_latency_acct(ca, rb->start_time, READ); 1415 - } 1406 + bch2_account_io_completion(ca, BCH_MEMBER_ERROR_read, 1407 + rb->start_time, !bio->bi_status); 1416 1408 1417 1409 queue_work(c->btree_read_complete_wq, &rb->work); 1418 1410 } ··· 1703 1697 return; 1704 1698 1705 1699 ret = bch2_bkey_pick_read_device(c, bkey_i_to_s_c(&b->key), 1706 - NULL, &pick); 1700 + NULL, &pick, -1); 1707 1701 1708 1702 if (ret <= 0) { 1709 1703 struct printbuf buf = PRINTBUF; ··· 1817 1811 return bch2_trans_run(c, __bch2_btree_root_read(trans, id, k, level)); 1818 1812 } 1819 1813 1814 + struct btree_node_scrub { 1815 + struct bch_fs *c; 1816 + struct bch_dev *ca; 1817 + void *buf; 1818 + bool used_mempool; 1819 + unsigned written; 1820 + 1821 + enum btree_id btree; 1822 + unsigned level; 1823 + struct bkey_buf key; 1824 + __le64 seq; 1825 + 1826 + struct work_struct work; 1827 + struct bio bio; 1828 + }; 1829 + 1830 + static bool btree_node_scrub_check(struct bch_fs *c, struct btree_node *data, unsigned ptr_written, 1831 + struct printbuf *err) 1832 + { 1833 + unsigned written = 0; 1834 + 1835 + if (le64_to_cpu(data->magic) != bset_magic(c)) { 1836 + prt_printf(err, "bad magic: want %llx, got %llx", 1837 + bset_magic(c), le64_to_cpu(data->magic)); 1838 + return false; 1839 + } 1840 + 1841 + while (written < (ptr_written ?: btree_sectors(c))) { 1842 + struct btree_node_entry *bne; 1843 + struct bset *i; 1844 + bool first = !written; 1845 + 1846 + if (first) { 1847 + bne = NULL; 1848 + i = &data->keys; 1849 + } else { 1850 + bne = (void *) data + (written << 9); 1851 + i = &bne->keys; 1852 + 1853 + if (!ptr_written && i->seq != data->keys.seq) 1854 + break; 1855 + } 1856 + 1857 + struct nonce nonce = btree_nonce(i, written << 9); 1858 + bool good_csum_type = bch2_checksum_type_valid(c, BSET_CSUM_TYPE(i)); 1859 + 1860 + if (first) { 1861 + if (good_csum_type) { 1862 + struct bch_csum csum = csum_vstruct(c, BSET_CSUM_TYPE(i), nonce, data); 1863 + if (bch2_crc_cmp(data->csum, csum)) { 1864 + bch2_csum_err_msg(err, BSET_CSUM_TYPE(i), data->csum, csum); 1865 + return false; 1866 + } 1867 + } 1868 + 1869 + written += vstruct_sectors(data, c->block_bits); 1870 + } else { 1871 + if (good_csum_type) { 1872 + struct bch_csum csum = csum_vstruct(c, BSET_CSUM_TYPE(i), nonce, bne); 1873 + if (bch2_crc_cmp(bne->csum, csum)) { 1874 + bch2_csum_err_msg(err, BSET_CSUM_TYPE(i), bne->csum, csum); 1875 + return false; 1876 + } 1877 + } 1878 + 1879 + written += vstruct_sectors(bne, c->block_bits); 1880 + } 1881 + } 1882 + 1883 + return true; 1884 + } 1885 + 1886 + static void btree_node_scrub_work(struct work_struct *work) 1887 + { 1888 + struct btree_node_scrub *scrub = container_of(work, struct btree_node_scrub, work); 1889 + struct bch_fs *c = scrub->c; 1890 + struct printbuf err = PRINTBUF; 1891 + 1892 + __bch2_btree_pos_to_text(&err, c, scrub->btree, scrub->level, 1893 + bkey_i_to_s_c(scrub->key.k)); 1894 + prt_newline(&err); 1895 + 1896 + if (!btree_node_scrub_check(c, scrub->buf, scrub->written, &err)) { 1897 + struct btree_trans *trans = bch2_trans_get(c); 1898 + 1899 + struct btree_iter iter; 1900 + bch2_trans_node_iter_init(trans, &iter, scrub->btree, 1901 + scrub->key.k->k.p, 0, scrub->level - 1, 0); 1902 + 1903 + struct btree *b; 1904 + int ret = lockrestart_do(trans, PTR_ERR_OR_ZERO(b = bch2_btree_iter_peek_node(&iter))); 1905 + if (ret) 1906 + goto err; 1907 + 1908 + if (bkey_i_to_btree_ptr_v2(&b->key)->v.seq == scrub->seq) { 1909 + bch_err(c, "error validating btree node during scrub on %s at btree %s", 1910 + scrub->ca->name, err.buf); 1911 + 1912 + ret = bch2_btree_node_rewrite(trans, &iter, b, 0); 1913 + } 1914 + err: 1915 + bch2_trans_iter_exit(trans, &iter); 1916 + bch2_trans_begin(trans); 1917 + bch2_trans_put(trans); 1918 + } 1919 + 1920 + printbuf_exit(&err); 1921 + bch2_bkey_buf_exit(&scrub->key, c);; 1922 + btree_bounce_free(c, c->opts.btree_node_size, scrub->used_mempool, scrub->buf); 1923 + percpu_ref_put(&scrub->ca->io_ref); 1924 + kfree(scrub); 1925 + bch2_write_ref_put(c, BCH_WRITE_REF_btree_node_scrub); 1926 + } 1927 + 1928 + static void btree_node_scrub_endio(struct bio *bio) 1929 + { 1930 + struct btree_node_scrub *scrub = container_of(bio, struct btree_node_scrub, bio); 1931 + 1932 + queue_work(scrub->c->btree_read_complete_wq, &scrub->work); 1933 + } 1934 + 1935 + int bch2_btree_node_scrub(struct btree_trans *trans, 1936 + enum btree_id btree, unsigned level, 1937 + struct bkey_s_c k, unsigned dev) 1938 + { 1939 + if (k.k->type != KEY_TYPE_btree_ptr_v2) 1940 + return 0; 1941 + 1942 + struct bch_fs *c = trans->c; 1943 + 1944 + if (!bch2_write_ref_tryget(c, BCH_WRITE_REF_btree_node_scrub)) 1945 + return -BCH_ERR_erofs_no_writes; 1946 + 1947 + struct extent_ptr_decoded pick; 1948 + int ret = bch2_bkey_pick_read_device(c, k, NULL, &pick, dev); 1949 + if (ret <= 0) 1950 + goto err; 1951 + 1952 + struct bch_dev *ca = bch2_dev_get_ioref(c, pick.ptr.dev, READ); 1953 + if (!ca) { 1954 + ret = -BCH_ERR_device_offline; 1955 + goto err; 1956 + } 1957 + 1958 + bool used_mempool = false; 1959 + void *buf = btree_bounce_alloc(c, c->opts.btree_node_size, &used_mempool); 1960 + 1961 + unsigned vecs = buf_pages(buf, c->opts.btree_node_size); 1962 + 1963 + struct btree_node_scrub *scrub = 1964 + kzalloc(sizeof(*scrub) + sizeof(struct bio_vec) * vecs, GFP_KERNEL); 1965 + if (!scrub) { 1966 + ret = -ENOMEM; 1967 + goto err_free; 1968 + } 1969 + 1970 + scrub->c = c; 1971 + scrub->ca = ca; 1972 + scrub->buf = buf; 1973 + scrub->used_mempool = used_mempool; 1974 + scrub->written = btree_ptr_sectors_written(k); 1975 + 1976 + scrub->btree = btree; 1977 + scrub->level = level; 1978 + bch2_bkey_buf_init(&scrub->key); 1979 + bch2_bkey_buf_reassemble(&scrub->key, c, k); 1980 + scrub->seq = bkey_s_c_to_btree_ptr_v2(k).v->seq; 1981 + 1982 + INIT_WORK(&scrub->work, btree_node_scrub_work); 1983 + 1984 + bio_init(&scrub->bio, ca->disk_sb.bdev, scrub->bio.bi_inline_vecs, vecs, REQ_OP_READ); 1985 + bch2_bio_map(&scrub->bio, scrub->buf, c->opts.btree_node_size); 1986 + scrub->bio.bi_iter.bi_sector = pick.ptr.offset; 1987 + scrub->bio.bi_end_io = btree_node_scrub_endio; 1988 + submit_bio(&scrub->bio); 1989 + return 0; 1990 + err_free: 1991 + btree_bounce_free(c, c->opts.btree_node_size, used_mempool, buf); 1992 + percpu_ref_put(&ca->io_ref); 1993 + err: 1994 + bch2_write_ref_put(c, BCH_WRITE_REF_btree_node_scrub); 1995 + return ret; 1996 + } 1997 + 1820 1998 static void bch2_btree_complete_write(struct bch_fs *c, struct btree *b, 1821 1999 struct btree_write *w) 1822 2000 { ··· 2021 1831 bch2_journal_pin_drop(&c->journal, &w->journal); 2022 1832 } 2023 1833 2024 - static void __btree_node_write_done(struct bch_fs *c, struct btree *b) 1834 + static void __btree_node_write_done(struct bch_fs *c, struct btree *b, u64 start_time) 2025 1835 { 2026 1836 struct btree_write *w = btree_prev_write(b); 2027 1837 unsigned long old, new; 2028 1838 unsigned type = 0; 2029 1839 2030 1840 bch2_btree_complete_write(c, b, w); 1841 + 1842 + if (start_time) 1843 + bch2_time_stats_update(&c->times[BCH_TIME_btree_node_write], start_time); 2031 1844 2032 1845 old = READ_ONCE(b->flags); 2033 1846 do { ··· 2062 1869 wake_up_bit(&b->flags, BTREE_NODE_write_in_flight); 2063 1870 } 2064 1871 2065 - static void btree_node_write_done(struct bch_fs *c, struct btree *b) 1872 + static void btree_node_write_done(struct bch_fs *c, struct btree *b, u64 start_time) 2066 1873 { 2067 1874 struct btree_trans *trans = bch2_trans_get(c); 2068 1875 ··· 2070 1877 2071 1878 /* we don't need transaction context anymore after we got the lock. */ 2072 1879 bch2_trans_put(trans); 2073 - __btree_node_write_done(c, b); 1880 + __btree_node_write_done(c, b, start_time); 2074 1881 six_unlock_read(&b->c.lock); 2075 1882 } 2076 1883 ··· 2080 1887 container_of(work, struct btree_write_bio, work); 2081 1888 struct bch_fs *c = wbio->wbio.c; 2082 1889 struct btree *b = wbio->wbio.bio.bi_private; 1890 + u64 start_time = wbio->start_time; 2083 1891 int ret = 0; 2084 1892 2085 1893 btree_bounce_free(c, ··· 2113 1919 } 2114 1920 out: 2115 1921 bio_put(&wbio->wbio.bio); 2116 - btree_node_write_done(c, b); 1922 + btree_node_write_done(c, b, start_time); 2117 1923 return; 2118 1924 err: 2119 1925 set_btree_node_noevict(b); 2120 - bch2_fs_fatal_err_on(!bch2_err_matches(ret, EROFS), c, 2121 - "writing btree node: %s", bch2_err_str(ret)); 1926 + 1927 + if (!bch2_err_matches(ret, EROFS)) { 1928 + struct printbuf buf = PRINTBUF; 1929 + prt_printf(&buf, "writing btree node: %s\n ", bch2_err_str(ret)); 1930 + bch2_btree_pos_to_text(&buf, c, b); 1931 + bch2_fs_fatal_error(c, "%s", buf.buf); 1932 + printbuf_exit(&buf); 1933 + } 2122 1934 goto out; 2123 1935 } 2124 1936 ··· 2137 1937 struct bch_fs *c = wbio->c; 2138 1938 struct btree *b = wbio->bio.bi_private; 2139 1939 struct bch_dev *ca = wbio->have_ioref ? bch2_dev_have_ref(c, wbio->dev) : NULL; 2140 - unsigned long flags; 2141 1940 2142 - if (wbio->have_ioref) 2143 - bch2_latency_acct(ca, wbio->submit_time, WRITE); 1941 + bch2_account_io_completion(ca, BCH_MEMBER_ERROR_write, 1942 + wbio->submit_time, !bio->bi_status); 2144 1943 2145 - if (!ca || 2146 - bch2_dev_io_err_on(bio->bi_status, ca, BCH_MEMBER_ERROR_write, 2147 - "btree write error: %s", 2148 - bch2_blk_status_to_str(bio->bi_status)) || 2149 - bch2_meta_write_fault("btree")) { 1944 + if (ca && bio->bi_status) { 1945 + struct printbuf buf = PRINTBUF; 1946 + prt_printf(&buf, "btree write error: %s\n ", 1947 + bch2_blk_status_to_str(bio->bi_status)); 1948 + bch2_btree_pos_to_text(&buf, c, b); 1949 + bch_err_dev_ratelimited(ca, "%s", buf.buf); 1950 + printbuf_exit(&buf); 1951 + } 1952 + 1953 + if (bio->bi_status) { 1954 + unsigned long flags; 2150 1955 spin_lock_irqsave(&c->btree_write_error_lock, flags); 2151 1956 bch2_dev_list_add_dev(&orig->failed, wbio->dev); 2152 1957 spin_unlock_irqrestore(&c->btree_write_error_lock, flags); ··· 2228 2023 bool validate_before_checksum = false; 2229 2024 enum btree_write_type type = flags & BTREE_WRITE_TYPE_MASK; 2230 2025 void *data; 2026 + u64 start_time = local_clock(); 2231 2027 int ret; 2232 2028 2233 2029 if (flags & BTREE_WRITE_ALREADY_STARTED) ··· 2437 2231 wbio->data = data; 2438 2232 wbio->data_bytes = bytes; 2439 2233 wbio->sector_offset = b->written; 2234 + wbio->start_time = start_time; 2440 2235 wbio->wbio.c = c; 2441 2236 wbio->wbio.used_mempool = used_mempool; 2442 2237 wbio->wbio.first_btree_write = !b->written; ··· 2465 2258 b->written += sectors_to_write; 2466 2259 nowrite: 2467 2260 btree_bounce_free(c, bytes, used_mempool, data); 2468 - __btree_node_write_done(c, b); 2261 + __btree_node_write_done(c, b, 0); 2469 2262 } 2470 2263 2471 2264 /*
+4
fs/bcachefs/btree_io.h
··· 52 52 void *data; 53 53 unsigned data_bytes; 54 54 unsigned sector_offset; 55 + u64 start_time; 55 56 struct bch_write_bio wbio; 56 57 }; 57 58 ··· 132 131 void bch2_btree_node_read(struct btree_trans *, struct btree *, bool); 133 132 int bch2_btree_root_read(struct bch_fs *, enum btree_id, 134 133 const struct bkey_i *, unsigned); 134 + 135 + int bch2_btree_node_scrub(struct btree_trans *, enum btree_id, unsigned, 136 + struct bkey_s_c, unsigned); 135 137 136 138 bool bch2_btree_post_write_cleanup(struct bch_fs *, struct btree *); 137 139
-14
fs/bcachefs/btree_iter.c
··· 562 562 bch2_btree_node_iter_peek_all(&l->iter, l->b)); 563 563 } 564 564 565 - static inline struct bkey_s_c btree_path_level_peek(struct btree_trans *trans, 566 - struct btree_path *path, 567 - struct btree_path_level *l, 568 - struct bkey *u) 569 - { 570 - struct bkey_s_c k = __btree_iter_unpack(trans->c, l, u, 571 - bch2_btree_node_iter_peek(&l->iter, l->b)); 572 - 573 - path->pos = k.k ? k.k->p : l->b->key.k.p; 574 - trans->paths_sorted = false; 575 - bch2_btree_path_verify_level(trans, path, l - path->l); 576 - return k; 577 - } 578 - 579 565 static inline struct bkey_s_c btree_path_level_prev(struct btree_trans *trans, 580 566 struct btree_path *path, 581 567 struct btree_path_level *l,
+8 -1
fs/bcachefs/btree_iter.h
··· 335 335 } 336 336 337 337 __always_inline 338 - static int btree_trans_restart_ip(struct btree_trans *trans, int err, unsigned long ip) 338 + static int btree_trans_restart_foreign_task(struct btree_trans *trans, int err, unsigned long ip) 339 339 { 340 340 BUG_ON(err <= 0); 341 341 BUG_ON(!bch2_err_matches(-err, BCH_ERR_transaction_restart)); 342 342 343 343 trans->restarted = err; 344 344 trans->last_restarted_ip = ip; 345 + return -err; 346 + } 347 + 348 + __always_inline 349 + static int btree_trans_restart_ip(struct btree_trans *trans, int err, unsigned long ip) 350 + { 351 + btree_trans_restart_foreign_task(trans, err, ip); 345 352 #ifdef CONFIG_BCACHEFS_DEBUG 346 353 darray_exit(&trans->last_restarted_trace); 347 354 bch2_save_backtrace(&trans->last_restarted_trace, current, 0, GFP_NOWAIT);
+5 -3
fs/bcachefs/btree_locking.c
··· 91 91 struct trans_waiting_for_lock *i; 92 92 93 93 for (i = g->g; i != g->g + g->nr; i++) { 94 - struct task_struct *task = i->trans->locking_wait.task; 94 + struct task_struct *task = READ_ONCE(i->trans->locking_wait.task); 95 95 if (i != g->g) 96 96 prt_str(out, "<- "); 97 - prt_printf(out, "%u ", task ?task->pid : 0); 97 + prt_printf(out, "%u ", task ? task->pid : 0); 98 98 } 99 99 prt_newline(out); 100 100 } ··· 172 172 { 173 173 if (i == g->g) { 174 174 trace_would_deadlock(g, i->trans); 175 - return btree_trans_restart(i->trans, BCH_ERR_transaction_restart_would_deadlock); 175 + return btree_trans_restart_foreign_task(i->trans, 176 + BCH_ERR_transaction_restart_would_deadlock, 177 + _THIS_IP_); 176 178 } else { 177 179 i->trans->lock_must_abort = true; 178 180 wake_up_process(i->trans->locking_wait.task);
+17 -12
fs/bcachefs/btree_node_scan.c
··· 166 166 bio->bi_iter.bi_sector = offset; 167 167 bch2_bio_map(bio, bn, PAGE_SIZE); 168 168 169 + u64 submit_time = local_clock(); 169 170 submit_bio_wait(bio); 170 - if (bch2_dev_io_err_on(bio->bi_status, ca, BCH_MEMBER_ERROR_read, 171 - "IO error in try_read_btree_node() at %llu: %s", 172 - offset, bch2_blk_status_to_str(bio->bi_status))) 171 + 172 + bch2_account_io_completion(ca, BCH_MEMBER_ERROR_read, submit_time, !bio->bi_status); 173 + 174 + if (bio->bi_status) { 175 + bch_err_dev_ratelimited(ca, 176 + "IO error in try_read_btree_node() at %llu: %s", 177 + offset, bch2_blk_status_to_str(bio->bi_status)); 173 178 return; 179 + } 174 180 175 181 if (le64_to_cpu(bn->magic) != bset_magic(c)) 176 182 return; ··· 270 264 err: 271 265 bio_put(bio); 272 266 free_page((unsigned long) buf); 273 - percpu_ref_get(&ca->io_ref); 267 + percpu_ref_put(&ca->io_ref); 274 268 closure_put(w->cl); 275 269 kfree(w); 276 270 return 0; ··· 289 283 continue; 290 284 291 285 struct find_btree_nodes_worker *w = kmalloc(sizeof(*w), GFP_KERNEL); 292 - struct task_struct *t; 293 - 294 286 if (!w) { 295 287 percpu_ref_put(&ca->io_ref); 296 288 ret = -ENOMEM; 297 289 goto err; 298 290 } 299 291 300 - percpu_ref_get(&ca->io_ref); 301 - closure_get(&cl); 302 292 w->cl = &cl; 303 293 w->f = f; 304 294 w->ca = ca; 305 295 306 - t = kthread_run(read_btree_nodes_worker, w, "read_btree_nodes/%s", ca->name); 296 + struct task_struct *t = kthread_create(read_btree_nodes_worker, w, "read_btree_nodes/%s", ca->name); 307 297 ret = PTR_ERR_OR_ZERO(t); 308 298 if (ret) { 309 299 percpu_ref_put(&ca->io_ref); 310 - closure_put(&cl); 311 - f->ret = ret; 312 - bch_err(c, "error starting kthread: %i", ret); 300 + kfree(w); 301 + bch_err_msg(c, ret, "starting kthread"); 313 302 break; 314 303 } 304 + 305 + closure_get(&cl); 306 + percpu_ref_get(&ca->io_ref); 307 + wake_up_process(t); 315 308 } 316 309 err: 317 310 closure_sync(&cl);
+54 -74
fs/bcachefs/btree_trans_commit.c
··· 164 164 EBUG_ON(bpos_gt(insert->k.p, b->data->max_key)); 165 165 EBUG_ON(insert->k.u64s > bch2_btree_keys_u64s_remaining(b)); 166 166 EBUG_ON(!b->c.level && !bpos_eq(insert->k.p, path->pos)); 167 + kmsan_check_memory(insert, bkey_bytes(&insert->k)); 167 168 168 169 k = bch2_btree_node_iter_peek_all(node_iter, b); 169 170 if (k && bkey_cmp_left_packed(b, k, &insert->k.p)) ··· 337 336 BUG_ON(i->cached != path->cached); 338 337 BUG_ON(i->level != path->level); 339 338 BUG_ON(i->btree_id != path->btree_id); 339 + BUG_ON(i->bkey_type != __btree_node_type(path->level, path->btree_id)); 340 340 EBUG_ON(!i->level && 341 341 btree_type_has_snapshots(i->btree_id) && 342 342 !(i->flags & BTREE_UPDATE_internal_snapshot_node) && ··· 519 517 } 520 518 } 521 519 522 - static int run_btree_triggers(struct btree_trans *trans, enum btree_id btree_id, 523 - unsigned *btree_id_updates_start) 524 - { 525 - bool trans_trigger_run; 526 - 527 - /* 528 - * Running triggers will append more updates to the list of updates as 529 - * we're walking it: 530 - */ 531 - do { 532 - trans_trigger_run = false; 533 - 534 - for (unsigned i = *btree_id_updates_start; 535 - i < trans->nr_updates && trans->updates[i].btree_id <= btree_id; 536 - i++) { 537 - if (trans->updates[i].btree_id < btree_id) { 538 - *btree_id_updates_start = i; 539 - continue; 540 - } 541 - 542 - int ret = run_one_trans_trigger(trans, trans->updates + i); 543 - if (ret < 0) 544 - return ret; 545 - if (ret) 546 - trans_trigger_run = true; 547 - } 548 - } while (trans_trigger_run); 549 - 550 - trans_for_each_update(trans, i) 551 - BUG_ON(!(i->flags & BTREE_TRIGGER_norun) && 552 - i->btree_id == btree_id && 553 - btree_node_type_has_trans_triggers(i->bkey_type) && 554 - (!i->insert_trigger_run || !i->overwrite_trigger_run)); 555 - 556 - return 0; 557 - } 558 - 559 520 static int bch2_trans_commit_run_triggers(struct btree_trans *trans) 560 521 { 561 - unsigned btree_id = 0, btree_id_updates_start = 0; 562 - int ret = 0; 522 + unsigned sort_id_start = 0; 563 523 564 - /* 565 - * 566 - * For a given btree, this algorithm runs insert triggers before 567 - * overwrite triggers: this is so that when extents are being moved 568 - * (e.g. by FALLOCATE_FL_INSERT_RANGE), we don't drop references before 569 - * they are re-added. 570 - */ 571 - for (btree_id = 0; btree_id < BTREE_ID_NR; btree_id++) { 572 - if (btree_id == BTREE_ID_alloc) 573 - continue; 524 + while (sort_id_start < trans->nr_updates) { 525 + unsigned i, sort_id = trans->updates[sort_id_start].sort_order; 526 + bool trans_trigger_run; 574 527 575 - ret = run_btree_triggers(trans, btree_id, &btree_id_updates_start); 576 - if (ret) 577 - return ret; 528 + /* 529 + * For a given btree, this algorithm runs insert triggers before 530 + * overwrite triggers: this is so that when extents are being 531 + * moved (e.g. by FALLOCATE_FL_INSERT_RANGE), we don't drop 532 + * references before they are re-added. 533 + * 534 + * Running triggers will append more updates to the list of 535 + * updates as we're walking it: 536 + */ 537 + do { 538 + trans_trigger_run = false; 539 + 540 + for (i = sort_id_start; 541 + i < trans->nr_updates && trans->updates[i].sort_order <= sort_id; 542 + i++) { 543 + if (trans->updates[i].sort_order < sort_id) { 544 + sort_id_start = i; 545 + continue; 546 + } 547 + 548 + int ret = run_one_trans_trigger(trans, trans->updates + i); 549 + if (ret < 0) 550 + return ret; 551 + if (ret) 552 + trans_trigger_run = true; 553 + } 554 + } while (trans_trigger_run); 555 + 556 + sort_id_start = i; 578 557 } 579 - 580 - btree_id_updates_start = 0; 581 - ret = run_btree_triggers(trans, BTREE_ID_alloc, &btree_id_updates_start); 582 - if (ret) 583 - return ret; 584 558 585 559 #ifdef CONFIG_BCACHEFS_DEBUG 586 560 trans_for_each_update(trans, i) ··· 881 903 struct bch_fs *c = trans->c; 882 904 enum bch_watermark watermark = flags & BCH_WATERMARK_MASK; 883 905 906 + if (bch2_err_matches(ret, BCH_ERR_journal_res_blocked)) { 907 + /* 908 + * XXX: this should probably be a separate BTREE_INSERT_NONBLOCK 909 + * flag 910 + */ 911 + if ((flags & BCH_TRANS_COMMIT_journal_reclaim) && 912 + watermark < BCH_WATERMARK_reclaim) { 913 + ret = -BCH_ERR_journal_reclaim_would_deadlock; 914 + goto out; 915 + } 916 + 917 + ret = drop_locks_do(trans, 918 + bch2_trans_journal_res_get(trans, 919 + (flags & BCH_WATERMARK_MASK)| 920 + JOURNAL_RES_GET_CHECK)); 921 + goto out; 922 + } 923 + 884 924 switch (ret) { 885 925 case -BCH_ERR_btree_insert_btree_node_full: 886 926 ret = bch2_btree_split_leaf(trans, i->path, flags); ··· 909 913 case -BCH_ERR_btree_insert_need_mark_replicas: 910 914 ret = drop_locks_do(trans, 911 915 bch2_accounting_update_sb(trans)); 912 - break; 913 - case -BCH_ERR_journal_res_get_blocked: 914 - /* 915 - * XXX: this should probably be a separate BTREE_INSERT_NONBLOCK 916 - * flag 917 - */ 918 - if ((flags & BCH_TRANS_COMMIT_journal_reclaim) && 919 - watermark < BCH_WATERMARK_reclaim) { 920 - ret = -BCH_ERR_journal_reclaim_would_deadlock; 921 - break; 922 - } 923 - 924 - ret = drop_locks_do(trans, 925 - bch2_trans_journal_res_get(trans, 926 - (flags & BCH_WATERMARK_MASK)| 927 - JOURNAL_RES_GET_CHECK)); 928 916 break; 929 917 case -BCH_ERR_btree_insert_need_journal_reclaim: 930 918 bch2_trans_unlock(trans); ··· 930 950 BUG_ON(ret >= 0); 931 951 break; 932 952 } 933 - 953 + out: 934 954 BUG_ON(bch2_err_matches(ret, BCH_ERR_transaction_restart) != !!trans->restarted); 935 955 936 956 bch2_fs_inconsistent_on(bch2_err_matches(ret, ENOSPC) &&
+13
fs/bcachefs/btree_types.h
··· 423 423 424 424 struct btree_insert_entry { 425 425 unsigned flags; 426 + u8 sort_order; 426 427 u8 bkey_type; 427 428 enum btree_id btree_id:8; 428 429 u8 level:4; ··· 852 851 ; 853 852 854 853 return BIT_ULL(btree) & mask; 854 + } 855 + 856 + static inline u8 btree_trigger_order(enum btree_id btree) 857 + { 858 + switch (btree) { 859 + case BTREE_ID_alloc: 860 + return U8_MAX; 861 + case BTREE_ID_stripes: 862 + return U8_MAX - 1; 863 + default: 864 + return btree; 865 + } 855 866 } 856 867 857 868 struct btree_root {
+4 -1
fs/bcachefs/btree_update.c
··· 17 17 static inline int btree_insert_entry_cmp(const struct btree_insert_entry *l, 18 18 const struct btree_insert_entry *r) 19 19 { 20 - return cmp_int(l->btree_id, r->btree_id) ?: 20 + return cmp_int(l->sort_order, r->sort_order) ?: 21 21 cmp_int(l->cached, r->cached) ?: 22 22 -cmp_int(l->level, r->level) ?: 23 23 bpos_cmp(l->k->k.p, r->k->k.p); ··· 397 397 398 398 n = (struct btree_insert_entry) { 399 399 .flags = flags, 400 + .sort_order = btree_trigger_order(path->btree_id), 400 401 .bkey_type = __btree_node_type(path->level, path->btree_id), 401 402 .btree_id = path->btree_id, 402 403 .level = path->level, ··· 512 511 int __must_check bch2_trans_update(struct btree_trans *trans, struct btree_iter *iter, 513 512 struct bkey_i *k, enum btree_iter_update_trigger_flags flags) 514 513 { 514 + kmsan_check_memory(k, bkey_bytes(&k->k)); 515 + 515 516 btree_path_idx_t path_idx = iter->update_path ?: iter->path; 516 517 int ret; 517 518
+2
fs/bcachefs/btree_update.h
··· 133 133 enum btree_id btree, 134 134 struct bkey_i *k) 135 135 { 136 + kmsan_check_memory(k, bkey_bytes(&k->k)); 137 + 136 138 if (unlikely(!btree_type_uses_write_buffer(btree))) { 137 139 int ret = bch2_btree_write_buffer_insert_err(trans, btree, k); 138 140 dump_stack();
+93 -71
fs/bcachefs/btree_update_interior.c
··· 649 649 return 0; 650 650 } 651 651 652 + /* If the node has been reused, we might be reading uninitialized memory - that's fine: */ 653 + static noinline __no_kmsan_checks bool btree_node_seq_matches(struct btree *b, __le64 seq) 654 + { 655 + struct btree_node *b_data = READ_ONCE(b->data); 656 + 657 + return (b_data ? b_data->keys.seq : 0) == seq; 658 + } 659 + 652 660 static void btree_update_nodes_written(struct btree_update *as) 653 661 { 654 662 struct bch_fs *c = as->c; ··· 685 677 * on disk: 686 678 */ 687 679 for (i = 0; i < as->nr_old_nodes; i++) { 688 - __le64 seq; 689 - 690 680 b = as->old_nodes[i]; 691 681 692 - bch2_trans_begin(trans); 693 - btree_node_lock_nopath_nofail(trans, &b->c, SIX_LOCK_read); 694 - seq = b->data ? b->data->keys.seq : 0; 695 - six_unlock_read(&b->c.lock); 696 - bch2_trans_unlock_long(trans); 697 - 698 - if (seq == as->old_nodes_seq[i]) 682 + if (btree_node_seq_matches(b, as->old_nodes_seq[i])) 699 683 wait_on_bit_io(&b->flags, BTREE_NODE_write_in_flight_inner, 700 684 TASK_UNINTERRUPTIBLE); 701 685 } ··· 2126 2126 goto out; 2127 2127 } 2128 2128 2129 + static int get_iter_to_node(struct btree_trans *trans, struct btree_iter *iter, 2130 + struct btree *b) 2131 + { 2132 + bch2_trans_node_iter_init(trans, iter, b->c.btree_id, b->key.k.p, 2133 + BTREE_MAX_DEPTH, b->c.level, 2134 + BTREE_ITER_intent); 2135 + int ret = bch2_btree_iter_traverse(iter); 2136 + if (ret) 2137 + goto err; 2138 + 2139 + /* has node been freed? */ 2140 + if (btree_iter_path(trans, iter)->l[b->c.level].b != b) { 2141 + /* node has been freed: */ 2142 + BUG_ON(!btree_node_dying(b)); 2143 + ret = -BCH_ERR_btree_node_dying; 2144 + goto err; 2145 + } 2146 + 2147 + BUG_ON(!btree_node_hashed(b)); 2148 + return 0; 2149 + err: 2150 + bch2_trans_iter_exit(trans, iter); 2151 + return ret; 2152 + } 2153 + 2129 2154 int bch2_btree_node_rewrite(struct btree_trans *trans, 2130 2155 struct btree_iter *iter, 2131 2156 struct btree *b, ··· 2216 2191 goto out; 2217 2192 } 2218 2193 2194 + static int bch2_btree_node_rewrite_key(struct btree_trans *trans, 2195 + enum btree_id btree, unsigned level, 2196 + struct bkey_i *k, unsigned flags) 2197 + { 2198 + struct btree_iter iter; 2199 + bch2_trans_node_iter_init(trans, &iter, 2200 + btree, k->k.p, 2201 + BTREE_MAX_DEPTH, level, 0); 2202 + struct btree *b = bch2_btree_iter_peek_node(&iter); 2203 + int ret = PTR_ERR_OR_ZERO(b); 2204 + if (ret) 2205 + goto out; 2206 + 2207 + bool found = b && btree_ptr_hash_val(&b->key) == btree_ptr_hash_val(k); 2208 + ret = found 2209 + ? bch2_btree_node_rewrite(trans, &iter, b, flags) 2210 + : -ENOENT; 2211 + out: 2212 + bch2_trans_iter_exit(trans, &iter); 2213 + return ret; 2214 + } 2215 + 2216 + int bch2_btree_node_rewrite_pos(struct btree_trans *trans, 2217 + enum btree_id btree, unsigned level, 2218 + struct bpos pos, unsigned flags) 2219 + { 2220 + BUG_ON(!level); 2221 + 2222 + /* Traverse one depth lower to get a pointer to the node itself: */ 2223 + struct btree_iter iter; 2224 + bch2_trans_node_iter_init(trans, &iter, btree, pos, 0, level - 1, 0); 2225 + struct btree *b = bch2_btree_iter_peek_node(&iter); 2226 + int ret = PTR_ERR_OR_ZERO(b); 2227 + if (ret) 2228 + goto err; 2229 + 2230 + ret = bch2_btree_node_rewrite(trans, &iter, b, flags); 2231 + err: 2232 + bch2_trans_iter_exit(trans, &iter); 2233 + return ret; 2234 + } 2235 + 2236 + int bch2_btree_node_rewrite_key_get_iter(struct btree_trans *trans, 2237 + struct btree *b, unsigned flags) 2238 + { 2239 + struct btree_iter iter; 2240 + int ret = get_iter_to_node(trans, &iter, b); 2241 + if (ret) 2242 + return ret == -BCH_ERR_btree_node_dying ? 0 : ret; 2243 + 2244 + ret = bch2_btree_node_rewrite(trans, &iter, b, flags); 2245 + bch2_trans_iter_exit(trans, &iter); 2246 + return ret; 2247 + } 2248 + 2219 2249 struct async_btree_rewrite { 2220 2250 struct bch_fs *c; 2221 2251 struct work_struct work; ··· 2280 2200 struct bkey_buf key; 2281 2201 }; 2282 2202 2283 - static int async_btree_node_rewrite_trans(struct btree_trans *trans, 2284 - struct async_btree_rewrite *a) 2285 - { 2286 - struct btree_iter iter; 2287 - bch2_trans_node_iter_init(trans, &iter, 2288 - a->btree_id, a->key.k->k.p, 2289 - BTREE_MAX_DEPTH, a->level, 0); 2290 - struct btree *b = bch2_btree_iter_peek_node(&iter); 2291 - int ret = PTR_ERR_OR_ZERO(b); 2292 - if (ret) 2293 - goto out; 2294 - 2295 - bool found = b && btree_ptr_hash_val(&b->key) == btree_ptr_hash_val(a->key.k); 2296 - ret = found 2297 - ? bch2_btree_node_rewrite(trans, &iter, b, 0) 2298 - : -ENOENT; 2299 - 2300 - #if 0 2301 - /* Tracepoint... */ 2302 - if (!ret || ret == -ENOENT) { 2303 - struct bch_fs *c = trans->c; 2304 - struct printbuf buf = PRINTBUF; 2305 - 2306 - if (!ret) { 2307 - prt_printf(&buf, "rewrite node:\n "); 2308 - bch2_bkey_val_to_text(&buf, c, bkey_i_to_s_c(a->key.k)); 2309 - } else { 2310 - prt_printf(&buf, "node to rewrite not found:\n want: "); 2311 - bch2_bkey_val_to_text(&buf, c, bkey_i_to_s_c(a->key.k)); 2312 - prt_printf(&buf, "\n got: "); 2313 - if (b) 2314 - bch2_bkey_val_to_text(&buf, c, bkey_i_to_s_c(&b->key)); 2315 - else 2316 - prt_str(&buf, "(null)"); 2317 - } 2318 - bch_info(c, "%s", buf.buf); 2319 - printbuf_exit(&buf); 2320 - } 2321 - #endif 2322 - out: 2323 - bch2_trans_iter_exit(trans, &iter); 2324 - return ret; 2325 - } 2326 - 2327 2203 static void async_btree_node_rewrite_work(struct work_struct *work) 2328 2204 { 2329 2205 struct async_btree_rewrite *a = 2330 2206 container_of(work, struct async_btree_rewrite, work); 2331 2207 struct bch_fs *c = a->c; 2332 2208 2333 - int ret = bch2_trans_do(c, async_btree_node_rewrite_trans(trans, a)); 2209 + int ret = bch2_trans_do(c, bch2_btree_node_rewrite_key(trans, 2210 + a->btree_id, a->level, a->key.k, 0)); 2334 2211 if (ret != -ENOENT) 2335 2212 bch_err_fn_ratelimited(c, ret); 2336 2213 ··· 2531 2494 unsigned commit_flags, bool skip_triggers) 2532 2495 { 2533 2496 struct btree_iter iter; 2534 - int ret; 2535 - 2536 - bch2_trans_node_iter_init(trans, &iter, b->c.btree_id, b->key.k.p, 2537 - BTREE_MAX_DEPTH, b->c.level, 2538 - BTREE_ITER_intent); 2539 - ret = bch2_btree_iter_traverse(&iter); 2497 + int ret = get_iter_to_node(trans, &iter, b); 2540 2498 if (ret) 2541 - goto out; 2542 - 2543 - /* has node been freed? */ 2544 - if (btree_iter_path(trans, &iter)->l[b->c.level].b != b) { 2545 - /* node has been freed: */ 2546 - BUG_ON(!btree_node_dying(b)); 2547 - goto out; 2548 - } 2549 - 2550 - BUG_ON(!btree_node_hashed(b)); 2499 + return ret == -BCH_ERR_btree_node_dying ? 0 : ret; 2551 2500 2552 2501 bch2_bkey_drop_ptrs(bkey_i_to_s(new_key), ptr, 2553 2502 !bch2_bkey_has_device(bkey_i_to_s(&b->key), ptr->dev)); 2554 2503 2555 2504 ret = bch2_btree_node_update_key(trans, &iter, b, new_key, 2556 2505 commit_flags, skip_triggers); 2557 - out: 2558 2506 bch2_trans_iter_exit(trans, &iter); 2559 2507 return ret; 2560 2508 }
+7
fs/bcachefs/btree_update_interior.h
··· 169 169 170 170 int bch2_btree_node_rewrite(struct btree_trans *, struct btree_iter *, 171 171 struct btree *, unsigned); 172 + int bch2_btree_node_rewrite_pos(struct btree_trans *, 173 + enum btree_id, unsigned, 174 + struct bpos, unsigned); 175 + int bch2_btree_node_rewrite_key_get_iter(struct btree_trans *, 176 + struct btree *, unsigned); 177 + 172 178 void bch2_btree_node_rewrite_async(struct bch_fs *, struct btree *); 179 + 173 180 int bch2_btree_node_update_key(struct btree_trans *, struct btree_iter *, 174 181 struct btree *, struct bkey_i *, 175 182 unsigned, bool);
+29 -51
fs/bcachefs/buckets.c
··· 590 590 if (ret) 591 591 goto err; 592 592 593 - if (!p.ptr.cached) { 594 - ret = bch2_bucket_backpointer_mod(trans, k, &bp, insert); 595 - if (ret) 596 - goto err; 597 - } 593 + ret = bch2_bucket_backpointer_mod(trans, k, &bp, insert); 594 + if (ret) 595 + goto err; 598 596 } 599 597 600 598 if (flags & BTREE_TRIGGER_gc) { ··· 672 674 return -BCH_ERR_ENOMEM_mark_stripe_ptr; 673 675 } 674 676 675 - mutex_lock(&c->ec_stripes_heap_lock); 677 + gc_stripe_lock(m); 676 678 677 679 if (!m || !m->alive) { 678 - mutex_unlock(&c->ec_stripes_heap_lock); 680 + gc_stripe_unlock(m); 679 681 struct printbuf buf = PRINTBUF; 680 682 bch2_bkey_val_to_text(&buf, c, k); 681 683 bch_err_ratelimited(c, "pointer to nonexistent stripe %llu\n while marking %s", ··· 691 693 .type = BCH_DISK_ACCOUNTING_replicas, 692 694 }; 693 695 memcpy(&acc.replicas, &m->r.e, replicas_entry_bytes(&m->r.e)); 694 - mutex_unlock(&c->ec_stripes_heap_lock); 696 + gc_stripe_unlock(m); 695 697 696 698 acc.replicas.data_type = data_type; 697 699 int ret = bch2_disk_accounting_mod(trans, &acc, &sectors, 1, true); ··· 724 726 .replicas.nr_required = 1, 725 727 }; 726 728 727 - struct disk_accounting_pos acct_compression_key = { 728 - .type = BCH_DISK_ACCOUNTING_compression, 729 - }; 729 + unsigned cur_compression_type = 0; 730 730 u64 compression_acct[3] = { 1, 0, 0 }; 731 731 732 732 bkey_for_each_ptr_decode(k.k, ptrs, p, entry) { ··· 758 762 acc_replicas_key.replicas.nr_required = 0; 759 763 } 760 764 761 - if (acct_compression_key.compression.type && 762 - acct_compression_key.compression.type != p.crc.compression_type) { 765 + if (cur_compression_type && 766 + cur_compression_type != p.crc.compression_type) { 763 767 if (flags & BTREE_TRIGGER_overwrite) 764 768 bch2_u64s_neg(compression_acct, ARRAY_SIZE(compression_acct)); 765 769 766 - ret = bch2_disk_accounting_mod(trans, &acct_compression_key, compression_acct, 767 - ARRAY_SIZE(compression_acct), gc); 770 + ret = bch2_disk_accounting_mod2(trans, gc, compression_acct, 771 + compression, cur_compression_type); 768 772 if (ret) 769 773 return ret; 770 774 ··· 773 777 compression_acct[2] = 0; 774 778 } 775 779 776 - acct_compression_key.compression.type = p.crc.compression_type; 780 + cur_compression_type = p.crc.compression_type; 777 781 if (p.crc.compression_type) { 778 782 compression_acct[1] += p.crc.uncompressed_size; 779 783 compression_acct[2] += p.crc.compressed_size; ··· 787 791 } 788 792 789 793 if (acc_replicas_key.replicas.nr_devs && !level && k.k->p.snapshot) { 790 - struct disk_accounting_pos acc_snapshot_key = { 791 - .type = BCH_DISK_ACCOUNTING_snapshot, 792 - .snapshot.id = k.k->p.snapshot, 793 - }; 794 - ret = bch2_disk_accounting_mod(trans, &acc_snapshot_key, replicas_sectors, 1, gc); 794 + ret = bch2_disk_accounting_mod2_nr(trans, gc, replicas_sectors, 1, snapshot, k.k->p.snapshot); 795 795 if (ret) 796 796 return ret; 797 797 } 798 798 799 - if (acct_compression_key.compression.type) { 799 + if (cur_compression_type) { 800 800 if (flags & BTREE_TRIGGER_overwrite) 801 801 bch2_u64s_neg(compression_acct, ARRAY_SIZE(compression_acct)); 802 802 803 - ret = bch2_disk_accounting_mod(trans, &acct_compression_key, compression_acct, 804 - ARRAY_SIZE(compression_acct), gc); 803 + ret = bch2_disk_accounting_mod2(trans, gc, compression_acct, 804 + compression, cur_compression_type); 805 805 if (ret) 806 806 return ret; 807 807 } 808 808 809 809 if (level) { 810 - struct disk_accounting_pos acc_btree_key = { 811 - .type = BCH_DISK_ACCOUNTING_btree, 812 - .btree.id = btree_id, 813 - }; 814 - ret = bch2_disk_accounting_mod(trans, &acc_btree_key, replicas_sectors, 1, gc); 810 + ret = bch2_disk_accounting_mod2_nr(trans, gc, replicas_sectors, 1, btree, btree_id); 815 811 if (ret) 816 812 return ret; 817 813 } else { 818 814 bool insert = !(flags & BTREE_TRIGGER_overwrite); 819 - struct disk_accounting_pos acc_inum_key = { 820 - .type = BCH_DISK_ACCOUNTING_inum, 821 - .inum.inum = k.k->p.inode, 822 - }; 815 + 823 816 s64 v[3] = { 824 817 insert ? 1 : -1, 825 818 insert ? k.k->size : -((s64) k.k->size), 826 819 *replicas_sectors, 827 820 }; 828 - ret = bch2_disk_accounting_mod(trans, &acc_inum_key, v, ARRAY_SIZE(v), gc); 821 + ret = bch2_disk_accounting_mod2(trans, gc, v, inum, k.k->p.inode); 829 822 if (ret) 830 823 return ret; 831 824 } ··· 863 878 } 864 879 865 880 int need_rebalance_delta = 0; 866 - s64 need_rebalance_sectors_delta = 0; 881 + s64 need_rebalance_sectors_delta[1] = { 0 }; 867 882 868 883 s64 s = bch2_bkey_sectors_need_rebalance(c, old); 869 884 need_rebalance_delta -= s != 0; 870 - need_rebalance_sectors_delta -= s; 885 + need_rebalance_sectors_delta[0] -= s; 871 886 872 887 s = bch2_bkey_sectors_need_rebalance(c, new.s_c); 873 888 need_rebalance_delta += s != 0; 874 - need_rebalance_sectors_delta += s; 889 + need_rebalance_sectors_delta[0] += s; 875 890 876 891 if ((flags & BTREE_TRIGGER_transactional) && need_rebalance_delta) { 877 892 int ret = bch2_btree_bit_mod_buffered(trans, BTREE_ID_rebalance_work, ··· 880 895 return ret; 881 896 } 882 897 883 - if (need_rebalance_sectors_delta) { 884 - struct disk_accounting_pos acc = { 885 - .type = BCH_DISK_ACCOUNTING_rebalance_work, 886 - }; 887 - int ret = bch2_disk_accounting_mod(trans, &acc, &need_rebalance_sectors_delta, 1, 888 - flags & BTREE_TRIGGER_gc); 898 + if (need_rebalance_sectors_delta[0]) { 899 + int ret = bch2_disk_accounting_mod2(trans, flags & BTREE_TRIGGER_gc, 900 + need_rebalance_sectors_delta, rebalance_work); 889 901 if (ret) 890 902 return ret; 891 903 } ··· 898 916 enum btree_iter_update_trigger_flags flags) 899 917 { 900 918 if (flags & (BTREE_TRIGGER_transactional|BTREE_TRIGGER_gc)) { 901 - s64 sectors = k.k->size; 919 + s64 sectors[1] = { k.k->size }; 902 920 903 921 if (flags & BTREE_TRIGGER_overwrite) 904 - sectors = -sectors; 922 + sectors[0] = -sectors[0]; 905 923 906 - struct disk_accounting_pos acc = { 907 - .type = BCH_DISK_ACCOUNTING_persistent_reserved, 908 - .persistent_reserved.nr_replicas = bkey_s_c_to_reservation(k).v->nr_replicas, 909 - }; 910 - 911 - return bch2_disk_accounting_mod(trans, &acc, &sectors, 1, flags & BTREE_TRIGGER_gc); 924 + return bch2_disk_accounting_mod2(trans, flags & BTREE_TRIGGER_gc, sectors, 925 + persistent_reserved, bkey_s_c_to_reservation(k).v->nr_replicas); 912 926 } 913 927 914 928 return 0;
+1 -30
fs/bcachefs/buckets.h
··· 39 39 for (_b = (_buckets)->b + (_buckets)->first_bucket; \ 40 40 _b < (_buckets)->b + (_buckets)->nbuckets; _b++) 41 41 42 - /* 43 - * Ugly hack alert: 44 - * 45 - * We need to cram a spinlock in a single byte, because that's what we have left 46 - * in struct bucket, and we care about the size of these - during fsck, we need 47 - * in memory state for every single bucket on every device. 48 - * 49 - * We used to do 50 - * while (xchg(&b->lock, 1) cpu_relax(); 51 - * but, it turns out not all architectures support xchg on a single byte. 52 - * 53 - * So now we use bit_spin_lock(), with fun games since we can't burn a whole 54 - * ulong for this - we just need to make sure the lock bit always ends up in the 55 - * first byte. 56 - */ 57 - 58 - #if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__ 59 - #define BUCKET_LOCK_BITNR 0 60 - #else 61 - #define BUCKET_LOCK_BITNR (BITS_PER_LONG - 1) 62 - #endif 63 - 64 - union ulong_byte_assert { 65 - ulong ulong; 66 - u8 byte; 67 - }; 68 - 69 42 static inline void bucket_unlock(struct bucket *b) 70 43 { 71 44 BUILD_BUG_ON(!((union ulong_byte_assert) { .ulong = 1UL << BUCKET_LOCK_BITNR }).byte); ··· 140 167 141 168 static inline int gen_after(u8 a, u8 b) 142 169 { 143 - int r = gen_cmp(a, b); 144 - 145 - return r > 0 ? r : 0; 170 + return max(0, gen_cmp(a, b)); 146 171 } 147 172 148 173 static inline int dev_ptr_stale_rcu(struct bch_dev *ca, const struct bch_extent_ptr *ptr)
+27
fs/bcachefs/buckets_types.h
··· 7 7 8 8 #define BUCKET_JOURNAL_SEQ_BITS 16 9 9 10 + /* 11 + * Ugly hack alert: 12 + * 13 + * We need to cram a spinlock in a single byte, because that's what we have left 14 + * in struct bucket, and we care about the size of these - during fsck, we need 15 + * in memory state for every single bucket on every device. 16 + * 17 + * We used to do 18 + * while (xchg(&b->lock, 1) cpu_relax(); 19 + * but, it turns out not all architectures support xchg on a single byte. 20 + * 21 + * So now we use bit_spin_lock(), with fun games since we can't burn a whole 22 + * ulong for this - we just need to make sure the lock bit always ends up in the 23 + * first byte. 24 + */ 25 + 26 + #if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__ 27 + #define BUCKET_LOCK_BITNR 0 28 + #else 29 + #define BUCKET_LOCK_BITNR (BITS_PER_LONG - 1) 30 + #endif 31 + 32 + union ulong_byte_assert { 33 + ulong ulong; 34 + u8 byte; 35 + }; 36 + 10 37 struct bucket { 11 38 u8 lock; 12 39 u8 gen_valid:1;
+31 -7
fs/bcachefs/chardev.c
··· 11 11 #include "move.h" 12 12 #include "recovery_passes.h" 13 13 #include "replicas.h" 14 + #include "sb-counters.h" 14 15 #include "super-io.h" 15 16 #include "thread_with_file.h" 16 17 ··· 313 312 struct bch_data_ctx *ctx = container_of(arg, struct bch_data_ctx, thr); 314 313 315 314 ctx->thr.ret = bch2_data_job(ctx->c, &ctx->stats, ctx->arg); 316 - ctx->stats.data_type = U8_MAX; 315 + if (ctx->thr.ret == -BCH_ERR_device_offline) 316 + ctx->stats.ret = BCH_IOCTL_DATA_EVENT_RET_device_offline; 317 + else { 318 + ctx->stats.ret = BCH_IOCTL_DATA_EVENT_RET_done; 319 + ctx->stats.data_type = (int) DATA_PROGRESS_DATA_TYPE_done; 320 + } 317 321 return 0; 318 322 } 319 323 ··· 337 331 struct bch_data_ctx *ctx = container_of(file->private_data, struct bch_data_ctx, thr); 338 332 struct bch_fs *c = ctx->c; 339 333 struct bch_ioctl_data_event e = { 340 - .type = BCH_DATA_EVENT_PROGRESS, 341 - .p.data_type = ctx->stats.data_type, 342 - .p.btree_id = ctx->stats.pos.btree, 343 - .p.pos = ctx->stats.pos.pos, 344 - .p.sectors_done = atomic64_read(&ctx->stats.sectors_seen), 345 - .p.sectors_total = bch2_fs_usage_read_short(c).used, 334 + .type = BCH_DATA_EVENT_PROGRESS, 335 + .ret = ctx->stats.ret, 336 + .p.data_type = ctx->stats.data_type, 337 + .p.btree_id = ctx->stats.pos.btree, 338 + .p.pos = ctx->stats.pos.pos, 339 + .p.sectors_done = atomic64_read(&ctx->stats.sectors_seen), 340 + .p.sectors_error_corrected = atomic64_read(&ctx->stats.sectors_error_corrected), 341 + .p.sectors_error_uncorrected = atomic64_read(&ctx->stats.sectors_error_uncorrected), 346 342 }; 343 + 344 + if (ctx->arg.op == BCH_DATA_OP_scrub) { 345 + struct bch_dev *ca = bch2_dev_tryget(c, ctx->arg.scrub.dev); 346 + if (ca) { 347 + struct bch_dev_usage u; 348 + bch2_dev_usage_read_fast(ca, &u); 349 + for (unsigned i = BCH_DATA_btree; i < ARRAY_SIZE(u.d); i++) 350 + if (ctx->arg.scrub.data_types & BIT(i)) 351 + e.p.sectors_total += u.d[i].sectors; 352 + bch2_dev_put(ca); 353 + } 354 + } else { 355 + e.p.sectors_total = bch2_fs_usage_read_short(c).used; 356 + } 347 357 348 358 if (len < sizeof(e)) 349 359 return -EINVAL; ··· 732 710 BCH_IOCTL(fsck_online, struct bch_ioctl_fsck_online); 733 711 case BCH_IOCTL_QUERY_ACCOUNTING: 734 712 return bch2_ioctl_query_accounting(c, arg); 713 + case BCH_IOCTL_QUERY_COUNTERS: 714 + return bch2_ioctl_query_counters(c, arg); 735 715 default: 736 716 return -ENOTTY; 737 717 }
+14 -11
fs/bcachefs/checksum.c
··· 466 466 prt_str(&buf, ")"); 467 467 WARN_RATELIMIT(1, "%s", buf.buf); 468 468 printbuf_exit(&buf); 469 - return -EIO; 469 + return -BCH_ERR_recompute_checksum; 470 470 } 471 471 472 472 for (i = splits; i < splits + ARRAY_SIZE(splits); i++) { ··· 693 693 return 0; 694 694 } 695 695 696 + #if 0 697 + 698 + /* 699 + * This seems to be duplicating code in cmd_remove_passphrase() in 700 + * bcachefs-tools, but we might want to switch userspace to use this - and 701 + * perhaps add an ioctl for calling this at runtime, so we can take the 702 + * passphrase off of a mounted filesystem (which has come up). 703 + */ 696 704 int bch2_disable_encryption(struct bch_fs *c) 697 705 { 698 706 struct bch_sb_field_crypt *crypt; ··· 733 725 return ret; 734 726 } 735 727 728 + /* 729 + * For enabling encryption on an existing filesystem: not hooked up yet, but it 730 + * should be 731 + */ 736 732 int bch2_enable_encryption(struct bch_fs *c, bool keyed) 737 733 { 738 734 struct bch_encrypted_key key; ··· 793 781 memzero_explicit(&key, sizeof(key)); 794 782 return ret; 795 783 } 784 + #endif 796 785 797 786 void bch2_fs_encryption_exit(struct bch_fs *c) 798 787 { ··· 801 788 crypto_free_shash(c->poly1305); 802 789 if (c->chacha20) 803 790 crypto_free_sync_skcipher(c->chacha20); 804 - if (c->sha256) 805 - crypto_free_shash(c->sha256); 806 791 } 807 792 808 793 int bch2_fs_encryption_init(struct bch_fs *c) ··· 808 797 struct bch_sb_field_crypt *crypt; 809 798 struct bch_key key; 810 799 int ret = 0; 811 - 812 - c->sha256 = crypto_alloc_shash("sha256", 0, 0); 813 - ret = PTR_ERR_OR_ZERO(c->sha256); 814 - if (ret) { 815 - c->sha256 = NULL; 816 - bch_err(c, "error requesting sha256 module: %s", bch2_err_str(ret)); 817 - goto out; 818 - } 819 800 820 801 crypt = bch2_sb_field_get(c->disk_sb.sb, crypt); 821 802 if (!crypt)
+2
fs/bcachefs/checksum.h
··· 103 103 int bch2_decrypt_sb_key(struct bch_fs *, struct bch_sb_field_crypt *, 104 104 struct bch_key *); 105 105 106 + #if 0 106 107 int bch2_disable_encryption(struct bch_fs *); 107 108 int bch2_enable_encryption(struct bch_fs *, bool); 109 + #endif 108 110 109 111 void bch2_fs_encryption_exit(struct bch_fs *); 110 112 int bch2_fs_encryption_init(struct bch_fs *);
+29 -36
fs/bcachefs/compress.c
··· 177 177 size_t src_len = src->bi_iter.bi_size; 178 178 size_t dst_len = crc.uncompressed_size << 9; 179 179 void *workspace; 180 - int ret; 180 + int ret = 0, ret2; 181 181 182 182 enum bch_compression_opts opt = bch2_compression_type_to_opt(crc.compression_type); 183 183 mempool_t *workspace_pool = &c->compress_workspace[opt]; ··· 189 189 else 190 190 ret = -BCH_ERR_compression_workspace_not_initialized; 191 191 if (ret) 192 - goto out; 192 + goto err; 193 193 } 194 194 195 195 src_data = bio_map_or_bounce(c, src, READ); ··· 197 197 switch (crc.compression_type) { 198 198 case BCH_COMPRESSION_TYPE_lz4_old: 199 199 case BCH_COMPRESSION_TYPE_lz4: 200 - ret = LZ4_decompress_safe_partial(src_data.b, dst_data, 201 - src_len, dst_len, dst_len); 202 - if (ret != dst_len) 203 - goto err; 200 + ret2 = LZ4_decompress_safe_partial(src_data.b, dst_data, 201 + src_len, dst_len, dst_len); 202 + if (ret2 != dst_len) 203 + ret = -BCH_ERR_decompress_lz4; 204 204 break; 205 205 case BCH_COMPRESSION_TYPE_gzip: { 206 206 z_stream strm = { ··· 214 214 215 215 zlib_set_workspace(&strm, workspace); 216 216 zlib_inflateInit2(&strm, -MAX_WBITS); 217 - ret = zlib_inflate(&strm, Z_FINISH); 217 + ret2 = zlib_inflate(&strm, Z_FINISH); 218 218 219 219 mempool_free(workspace, workspace_pool); 220 220 221 - if (ret != Z_STREAM_END) 222 - goto err; 221 + if (ret2 != Z_STREAM_END) 222 + ret = -BCH_ERR_decompress_gzip; 223 223 break; 224 224 } 225 225 case BCH_COMPRESSION_TYPE_zstd: { 226 226 ZSTD_DCtx *ctx; 227 227 size_t real_src_len = le32_to_cpup(src_data.b); 228 228 229 - if (real_src_len > src_len - 4) 229 + if (real_src_len > src_len - 4) { 230 + ret = -BCH_ERR_decompress_zstd_src_len_bad; 230 231 goto err; 232 + } 231 233 232 234 workspace = mempool_alloc(workspace_pool, GFP_NOFS); 233 235 ctx = zstd_init_dctx(workspace, zstd_dctx_workspace_bound()); 234 236 235 - ret = zstd_decompress_dctx(ctx, 237 + ret2 = zstd_decompress_dctx(ctx, 236 238 dst_data, dst_len, 237 239 src_data.b + 4, real_src_len); 238 240 239 241 mempool_free(workspace, workspace_pool); 240 242 241 - if (ret != dst_len) 242 - goto err; 243 + if (ret2 != dst_len) 244 + ret = -BCH_ERR_decompress_zstd; 243 245 break; 244 246 } 245 247 default: 246 248 BUG(); 247 249 } 248 - ret = 0; 250 + err: 249 251 fsck_err: 250 - out: 251 252 bio_unmap_or_unbounce(c, src_data); 252 253 return ret; 253 - err: 254 - ret = -EIO; 255 - goto out; 256 254 } 257 255 258 256 int bch2_bio_uncompress_inplace(struct bch_write_op *op, ··· 266 268 BUG_ON(!bio->bi_vcnt); 267 269 BUG_ON(DIV_ROUND_UP(crc->live_size, PAGE_SECTORS) > bio->bi_max_vecs); 268 270 269 - if (crc->uncompressed_size << 9 > c->opts.encoded_extent_max || 270 - crc->compressed_size << 9 > c->opts.encoded_extent_max) { 271 - struct printbuf buf = PRINTBUF; 272 - bch2_write_op_error(&buf, op); 273 - prt_printf(&buf, "error rewriting existing data: extent too big"); 274 - bch_err_ratelimited(c, "%s", buf.buf); 275 - printbuf_exit(&buf); 276 - return -EIO; 271 + if (crc->uncompressed_size << 9 > c->opts.encoded_extent_max) { 272 + bch2_write_op_error(op, op->pos.offset, 273 + "extent too big to decompress (%u > %u)", 274 + crc->uncompressed_size << 9, c->opts.encoded_extent_max); 275 + return -BCH_ERR_decompress_exceeded_max_encoded_extent; 277 276 } 278 277 279 278 data = __bounce_alloc(c, dst_len, WRITE); 280 279 281 - if (__bio_uncompress(c, bio, data.b, *crc)) { 282 - if (!c->opts.no_data_io) { 283 - struct printbuf buf = PRINTBUF; 284 - bch2_write_op_error(&buf, op); 285 - prt_printf(&buf, "error rewriting existing data: decompression error"); 286 - bch_err_ratelimited(c, "%s", buf.buf); 287 - printbuf_exit(&buf); 288 - } 289 - ret = -EIO; 280 + ret = __bio_uncompress(c, bio, data.b, *crc); 281 + 282 + if (c->opts.no_data_io) 283 + ret = 0; 284 + 285 + if (ret) { 286 + bch2_write_op_error(op, op->pos.offset, "%s", bch2_err_str(ret)); 290 287 goto err; 291 288 } 292 289 ··· 314 321 315 322 if (crc.uncompressed_size << 9 > c->opts.encoded_extent_max || 316 323 crc.compressed_size << 9 > c->opts.encoded_extent_max) 317 - return -EIO; 324 + return -BCH_ERR_decompress_exceeded_max_encoded_extent; 318 325 319 326 dst_data = dst_len == dst_iter.bi_size 320 327 ? __bio_map_or_bounce(c, dst, dst_iter, WRITE)
+182 -55
fs/bcachefs/data_update.c
··· 20 20 #include "subvolume.h" 21 21 #include "trace.h" 22 22 23 + #include <linux/ioprio.h> 24 + 23 25 static void bkey_put_dev_refs(struct bch_fs *c, struct bkey_s_c k) 24 26 { 25 27 struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(k); ··· 35 33 struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(k); 36 34 37 35 bkey_for_each_ptr(ptrs, ptr) { 38 - if (!bch2_dev_tryget(c, ptr->dev)) { 36 + if (unlikely(!bch2_dev_tryget(c, ptr->dev))) { 39 37 bkey_for_each_ptr(ptrs, ptr2) { 40 38 if (ptr2 == ptr) 41 39 break; ··· 93 91 return true; 94 92 } 95 93 96 - static noinline void trace_move_extent_finish2(struct data_update *u, 94 + static noinline void trace_io_move_finish2(struct data_update *u, 97 95 struct bkey_i *new, 98 96 struct bkey_i *insert) 99 97 { ··· 113 111 bch2_bkey_val_to_text(&buf, c, bkey_i_to_s_c(insert)); 114 112 prt_newline(&buf); 115 113 116 - trace_move_extent_finish(c, buf.buf); 114 + trace_io_move_finish(c, buf.buf); 117 115 printbuf_exit(&buf); 118 116 } 119 117 120 - static void trace_move_extent_fail2(struct data_update *m, 118 + static void trace_io_move_fail2(struct data_update *m, 121 119 struct bkey_s_c new, 122 120 struct bkey_s_c wrote, 123 121 struct bkey_i *insert, ··· 128 126 struct printbuf buf = PRINTBUF; 129 127 unsigned rewrites_found = 0; 130 128 131 - if (!trace_move_extent_fail_enabled()) 129 + if (!trace_io_move_fail_enabled()) 132 130 return; 133 131 134 132 prt_str(&buf, msg); ··· 168 166 bch2_bkey_val_to_text(&buf, c, bkey_i_to_s_c(insert)); 169 167 } 170 168 171 - trace_move_extent_fail(c, buf.buf); 169 + trace_io_move_fail(c, buf.buf); 172 170 printbuf_exit(&buf); 173 171 } 174 172 ··· 216 214 new = bkey_i_to_extent(bch2_keylist_front(keys)); 217 215 218 216 if (!bch2_extents_match(k, old)) { 219 - trace_move_extent_fail2(m, k, bkey_i_to_s_c(&new->k_i), 217 + trace_io_move_fail2(m, k, bkey_i_to_s_c(&new->k_i), 220 218 NULL, "no match:"); 221 219 goto nowork; 222 220 } ··· 256 254 if (m->data_opts.rewrite_ptrs && 257 255 !rewrites_found && 258 256 bch2_bkey_durability(c, k) >= m->op.opts.data_replicas) { 259 - trace_move_extent_fail2(m, k, bkey_i_to_s_c(&new->k_i), insert, "no rewrites found:"); 257 + trace_io_move_fail2(m, k, bkey_i_to_s_c(&new->k_i), insert, "no rewrites found:"); 260 258 goto nowork; 261 259 } 262 260 ··· 273 271 } 274 272 275 273 if (!bkey_val_u64s(&new->k)) { 276 - trace_move_extent_fail2(m, k, bkey_i_to_s_c(&new->k_i), insert, "new replicas conflicted:"); 274 + trace_io_move_fail2(m, k, bkey_i_to_s_c(&new->k_i), insert, "new replicas conflicted:"); 277 275 goto nowork; 278 276 } 279 277 ··· 354 352 printbuf_exit(&buf); 355 353 356 354 bch2_fatal_error(c); 357 - ret = -EIO; 355 + ret = -BCH_ERR_invalid_bkey; 358 356 goto out; 359 357 } 360 358 ··· 387 385 if (!ret) { 388 386 bch2_btree_iter_set_pos(&iter, next_pos); 389 387 390 - this_cpu_add(c->counters[BCH_COUNTER_move_extent_finish], new->k.size); 391 - if (trace_move_extent_finish_enabled()) 392 - trace_move_extent_finish2(m, &new->k_i, insert); 388 + this_cpu_add(c->counters[BCH_COUNTER_io_move_finish], new->k.size); 389 + if (trace_io_move_finish_enabled()) 390 + trace_io_move_finish2(m, &new->k_i, insert); 393 391 } 394 392 err: 395 393 if (bch2_err_matches(ret, BCH_ERR_transaction_restart)) ··· 411 409 &m->stats->sectors_raced); 412 410 } 413 411 414 - count_event(c, move_extent_fail); 412 + count_event(c, io_move_fail); 415 413 416 414 bch2_btree_iter_advance(&iter); 417 415 goto next; ··· 429 427 return bch2_trans_run(op->c, __bch2_data_update_index_update(trans, op)); 430 428 } 431 429 432 - void bch2_data_update_read_done(struct data_update *m, 433 - struct bch_extent_crc_unpacked crc) 430 + void bch2_data_update_read_done(struct data_update *m) 434 431 { 432 + m->read_done = true; 433 + 435 434 /* write bio must own pages: */ 436 435 BUG_ON(!m->op.wbio.bio.bi_vcnt); 437 436 438 - m->op.crc = crc; 439 - m->op.wbio.bio.bi_iter.bi_size = crc.compressed_size << 9; 437 + m->op.crc = m->rbio.pick.crc; 438 + m->op.wbio.bio.bi_iter.bi_size = m->op.crc.compressed_size << 9; 439 + 440 + this_cpu_add(m->op.c->counters[BCH_COUNTER_io_move_write], m->k.k->k.size); 440 441 441 442 closure_call(&m->op.cl, bch2_write, NULL, NULL); 442 443 } ··· 449 444 struct bch_fs *c = update->op.c; 450 445 struct bkey_s_c k = bkey_i_to_s_c(update->k.k); 451 446 447 + bch2_bio_free_pages_pool(c, &update->op.wbio.bio); 448 + kfree(update->bvecs); 449 + update->bvecs = NULL; 450 + 452 451 if (c->opts.nocow_enabled) 453 452 bkey_nocow_unlock(c, k); 454 453 bkey_put_dev_refs(c, k); 455 - bch2_bkey_buf_exit(&update->k, c); 456 454 bch2_disk_reservation_put(c, &update->op.res); 457 - bch2_bio_free_pages_pool(c, &update->op.wbio.bio); 455 + bch2_bkey_buf_exit(&update->k, c); 458 456 } 459 457 460 - static void bch2_update_unwritten_extent(struct btree_trans *trans, 461 - struct data_update *update) 458 + static int bch2_update_unwritten_extent(struct btree_trans *trans, 459 + struct data_update *update) 462 460 { 463 461 struct bch_fs *c = update->op.c; 464 - struct bio *bio = &update->op.wbio.bio; 465 462 struct bkey_i_extent *e; 466 463 struct write_point *wp; 467 464 struct closure cl; 468 465 struct btree_iter iter; 469 466 struct bkey_s_c k; 470 - int ret; 467 + int ret = 0; 471 468 472 469 closure_init_stack(&cl); 473 470 bch2_keylist_init(&update->op.insert_keys, update->op.inline_keys); 474 471 475 - while (bio_sectors(bio)) { 476 - unsigned sectors = bio_sectors(bio); 472 + while (bpos_lt(update->op.pos, update->k.k->k.p)) { 473 + unsigned sectors = update->k.k->k.p.offset - 474 + update->op.pos.offset; 477 475 478 476 bch2_trans_begin(trans); 479 477 ··· 512 504 bch_err_fn_ratelimited(c, ret); 513 505 514 506 if (ret) 515 - return; 507 + break; 516 508 517 509 sectors = min(sectors, wp->sectors_free); 518 510 ··· 522 514 bch2_alloc_sectors_append_ptrs(c, wp, &e->k_i, sectors, false); 523 515 bch2_alloc_sectors_done(c, wp); 524 516 525 - bio_advance(bio, sectors << 9); 526 517 update->op.pos.offset += sectors; 527 518 528 519 extent_for_each_ptr(extent_i_to_s(e), ptr) ··· 540 533 bch2_trans_unlock(trans); 541 534 closure_sync(&cl); 542 535 } 536 + 537 + return ret; 543 538 } 544 539 545 540 void bch2_data_update_opts_to_text(struct printbuf *out, struct bch_fs *c, 546 541 struct bch_io_opts *io_opts, 547 542 struct data_update_opts *data_opts) 548 543 { 549 - printbuf_tabstop_push(out, 20); 544 + if (!out->nr_tabstops) 545 + printbuf_tabstop_push(out, 20); 550 546 551 547 prt_str_indented(out, "rewrite ptrs:\t"); 552 548 bch2_prt_u64_base2(out, data_opts->rewrite_ptrs); ··· 582 572 583 573 prt_str_indented(out, "old key:\t"); 584 574 bch2_bkey_val_to_text(out, m->op.c, bkey_i_to_s_c(m->k.k)); 575 + } 576 + 577 + void bch2_data_update_inflight_to_text(struct printbuf *out, struct data_update *m) 578 + { 579 + bch2_bkey_val_to_text(out, m->op.c, bkey_i_to_s_c(m->k.k)); 580 + prt_newline(out); 581 + printbuf_indent_add(out, 2); 582 + bch2_data_update_opts_to_text(out, m->op.c, &m->op.opts, &m->data_opts); 583 + prt_printf(out, "read_done:\t\%u\n", m->read_done); 584 + bch2_write_op_to_text(out, &m->op); 585 + printbuf_indent_sub(out, 2); 585 586 } 586 587 587 588 int bch2_extent_drop_ptrs(struct btree_trans *trans, ··· 638 617 bch2_trans_commit(trans, NULL, NULL, BCH_TRANS_COMMIT_no_enospc); 639 618 } 640 619 620 + int bch2_data_update_bios_init(struct data_update *m, struct bch_fs *c, 621 + struct bch_io_opts *io_opts) 622 + { 623 + struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(bkey_i_to_s_c(m->k.k)); 624 + const union bch_extent_entry *entry; 625 + struct extent_ptr_decoded p; 626 + 627 + /* write path might have to decompress data: */ 628 + unsigned buf_bytes = 0; 629 + bkey_for_each_ptr_decode(&m->k.k->k, ptrs, p, entry) 630 + buf_bytes = max_t(unsigned, buf_bytes, p.crc.uncompressed_size << 9); 631 + 632 + unsigned nr_vecs = DIV_ROUND_UP(buf_bytes, PAGE_SIZE); 633 + 634 + m->bvecs = kmalloc_array(nr_vecs, sizeof*(m->bvecs), GFP_KERNEL); 635 + if (!m->bvecs) 636 + return -ENOMEM; 637 + 638 + bio_init(&m->rbio.bio, NULL, m->bvecs, nr_vecs, REQ_OP_READ); 639 + bio_init(&m->op.wbio.bio, NULL, m->bvecs, nr_vecs, 0); 640 + 641 + if (bch2_bio_alloc_pages(&m->op.wbio.bio, buf_bytes, GFP_KERNEL)) { 642 + kfree(m->bvecs); 643 + m->bvecs = NULL; 644 + return -ENOMEM; 645 + } 646 + 647 + rbio_init(&m->rbio.bio, c, *io_opts, NULL); 648 + m->rbio.data_update = true; 649 + m->rbio.bio.bi_iter.bi_size = buf_bytes; 650 + m->rbio.bio.bi_iter.bi_sector = bkey_start_offset(&m->k.k->k); 651 + m->op.wbio.bio.bi_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_IDLE, 0); 652 + return 0; 653 + } 654 + 655 + static int can_write_extent(struct bch_fs *c, struct data_update *m) 656 + { 657 + if ((m->op.flags & BCH_WRITE_alloc_nowait) && 658 + unlikely(c->open_buckets_nr_free <= bch2_open_buckets_reserved(m->op.watermark))) 659 + return -BCH_ERR_data_update_done_would_block; 660 + 661 + unsigned target = m->op.flags & BCH_WRITE_only_specified_devs 662 + ? m->op.target 663 + : 0; 664 + struct bch_devs_mask devs = target_rw_devs(c, BCH_DATA_user, target); 665 + 666 + darray_for_each(m->op.devs_have, i) 667 + __clear_bit(*i, devs.d); 668 + 669 + rcu_read_lock(); 670 + unsigned nr_replicas = 0, i; 671 + for_each_set_bit(i, devs.d, BCH_SB_MEMBERS_MAX) { 672 + struct bch_dev *ca = bch2_dev_rcu(c, i); 673 + 674 + struct bch_dev_usage usage; 675 + bch2_dev_usage_read_fast(ca, &usage); 676 + 677 + if (!dev_buckets_free(ca, usage, m->op.watermark)) 678 + continue; 679 + 680 + nr_replicas += ca->mi.durability; 681 + if (nr_replicas >= m->op.nr_replicas) 682 + break; 683 + } 684 + rcu_read_unlock(); 685 + 686 + if (!nr_replicas) 687 + return -BCH_ERR_data_update_done_no_rw_devs; 688 + if (nr_replicas < m->op.nr_replicas) 689 + return -BCH_ERR_insufficient_devices; 690 + return 0; 691 + } 692 + 641 693 int bch2_data_update_init(struct btree_trans *trans, 642 694 struct btree_iter *iter, 643 695 struct moving_context *ctxt, 644 696 struct data_update *m, 645 697 struct write_point_specifier wp, 646 - struct bch_io_opts io_opts, 698 + struct bch_io_opts *io_opts, 647 699 struct data_update_opts data_opts, 648 700 enum btree_id btree_id, 649 701 struct bkey_s_c k) ··· 734 640 * snapshots table - just skip it, we can move it later. 735 641 */ 736 642 if (unlikely(k.k->p.snapshot && !bch2_snapshot_exists(c, k.k->p.snapshot))) 737 - return -BCH_ERR_data_update_done; 738 - 739 - if (!bkey_get_dev_refs(c, k)) 740 - return -BCH_ERR_data_update_done; 741 - 742 - if (c->opts.nocow_enabled && 743 - !bkey_nocow_lock(c, ctxt, k)) { 744 - bkey_put_dev_refs(c, k); 745 - return -BCH_ERR_nocow_lock_blocked; 746 - } 643 + return -BCH_ERR_data_update_done_no_snapshot; 747 644 748 645 bch2_bkey_buf_init(&m->k); 749 646 bch2_bkey_buf_reassemble(&m->k, c, k); ··· 743 658 m->ctxt = ctxt; 744 659 m->stats = ctxt ? ctxt->stats : NULL; 745 660 746 - bch2_write_op_init(&m->op, c, io_opts); 661 + bch2_write_op_init(&m->op, c, *io_opts); 747 662 m->op.pos = bkey_start_pos(k.k); 748 663 m->op.version = k.k->bversion; 749 664 m->op.target = data_opts.target; 750 665 m->op.write_point = wp; 751 666 m->op.nr_replicas = 0; 752 - m->op.flags |= BCH_WRITE_PAGES_STABLE| 753 - BCH_WRITE_PAGES_OWNED| 754 - BCH_WRITE_DATA_ENCODED| 755 - BCH_WRITE_MOVE| 667 + m->op.flags |= BCH_WRITE_pages_stable| 668 + BCH_WRITE_pages_owned| 669 + BCH_WRITE_data_encoded| 670 + BCH_WRITE_move| 756 671 m->data_opts.write_flags; 757 - m->op.compression_opt = io_opts.background_compression; 672 + m->op.compression_opt = io_opts->background_compression; 758 673 m->op.watermark = m->data_opts.btree_insert_flags & BCH_WATERMARK_MASK; 759 674 760 675 unsigned durability_have = 0, durability_removing = 0; ··· 792 707 ptr_bit <<= 1; 793 708 } 794 709 795 - unsigned durability_required = max(0, (int) (io_opts.data_replicas - durability_have)); 710 + unsigned durability_required = max(0, (int) (io_opts->data_replicas - durability_have)); 796 711 797 712 /* 798 713 * If current extent durability is less than io_opts.data_replicas, ··· 825 740 m->data_opts.rewrite_ptrs = 0; 826 741 /* if iter == NULL, it's just a promote */ 827 742 if (iter) 828 - ret = bch2_extent_drop_ptrs(trans, iter, k, &io_opts, &m->data_opts); 829 - goto out; 743 + ret = bch2_extent_drop_ptrs(trans, iter, k, io_opts, &m->data_opts); 744 + if (!ret) 745 + ret = -BCH_ERR_data_update_done_no_writes_needed; 746 + goto out_bkey_buf_exit; 830 747 } 748 + 749 + /* 750 + * Check if the allocation will succeed, to avoid getting an error later 751 + * in bch2_write() -> bch2_alloc_sectors_start() and doing a useless 752 + * read: 753 + * 754 + * This guards against 755 + * - BCH_WRITE_alloc_nowait allocations failing (promotes) 756 + * - Destination target full 757 + * - Device(s) in destination target offline 758 + * - Insufficient durability available in destination target 759 + * (i.e. trying to move a durability=2 replica to a target with a 760 + * single durability=2 device) 761 + */ 762 + ret = can_write_extent(c, m); 763 + if (ret) 764 + goto out_bkey_buf_exit; 831 765 832 766 if (reserve_sectors) { 833 767 ret = bch2_disk_reservation_add(c, &m->op.res, reserve_sectors, ··· 854 750 ? 0 855 751 : BCH_DISK_RESERVATION_NOFAIL); 856 752 if (ret) 857 - goto out; 753 + goto out_bkey_buf_exit; 754 + } 755 + 756 + if (!bkey_get_dev_refs(c, k)) { 757 + ret = -BCH_ERR_data_update_done_no_dev_refs; 758 + goto out_put_disk_res; 759 + } 760 + 761 + if (c->opts.nocow_enabled && 762 + !bkey_nocow_lock(c, ctxt, k)) { 763 + ret = -BCH_ERR_nocow_lock_blocked; 764 + goto out_put_dev_refs; 858 765 } 859 766 860 767 if (bkey_extent_is_unwritten(k)) { 861 - bch2_update_unwritten_extent(trans, m); 862 - goto out; 768 + ret = bch2_update_unwritten_extent(trans, m) ?: 769 + -BCH_ERR_data_update_done_unwritten; 770 + goto out_nocow_unlock; 863 771 } 864 772 773 + ret = bch2_data_update_bios_init(m, c, io_opts); 774 + if (ret) 775 + goto out_nocow_unlock; 776 + 865 777 return 0; 866 - out: 867 - bch2_data_update_exit(m); 868 - return ret ?: -BCH_ERR_data_update_done; 778 + out_nocow_unlock: 779 + if (c->opts.nocow_enabled) 780 + bkey_nocow_unlock(c, k); 781 + out_put_dev_refs: 782 + bkey_put_dev_refs(c, k); 783 + out_put_disk_res: 784 + bch2_disk_reservation_put(c, &m->op.res); 785 + out_bkey_buf_exit: 786 + bch2_bkey_buf_exit(&m->k, c); 787 + return ret; 869 788 } 870 789 871 790 void bch2_data_update_opts_normalize(struct bkey_s_c k, struct data_update_opts *opts)
+14 -3
fs/bcachefs/data_update.h
··· 4 4 #define _BCACHEFS_DATA_UPDATE_H 5 5 6 6 #include "bkey_buf.h" 7 + #include "io_read.h" 7 8 #include "io_write_types.h" 8 9 9 10 struct moving_context; ··· 16 15 u8 extra_replicas; 17 16 unsigned btree_insert_flags; 18 17 unsigned write_flags; 18 + 19 + int read_dev; 20 + bool scrub; 19 21 }; 20 22 21 23 void bch2_data_update_opts_to_text(struct printbuf *, struct bch_fs *, ··· 26 22 27 23 struct data_update { 28 24 /* extent being updated: */ 25 + bool read_done; 29 26 enum btree_id btree_id; 30 27 struct bkey_buf k; 31 28 struct data_update_opts data_opts; 32 29 struct moving_context *ctxt; 33 30 struct bch_move_stats *stats; 31 + 32 + struct bch_read_bio rbio; 34 33 struct bch_write_op op; 34 + struct bio_vec *bvecs; 35 35 }; 36 36 37 37 void bch2_data_update_to_text(struct printbuf *, struct data_update *); 38 + void bch2_data_update_inflight_to_text(struct printbuf *, struct data_update *); 38 39 39 40 int bch2_data_update_index_update(struct bch_write_op *); 40 41 41 - void bch2_data_update_read_done(struct data_update *, 42 - struct bch_extent_crc_unpacked); 42 + void bch2_data_update_read_done(struct data_update *); 43 43 44 44 int bch2_extent_drop_ptrs(struct btree_trans *, 45 45 struct btree_iter *, ··· 51 43 struct bch_io_opts *, 52 44 struct data_update_opts *); 53 45 46 + int bch2_data_update_bios_init(struct data_update *, struct bch_fs *, 47 + struct bch_io_opts *); 48 + 54 49 void bch2_data_update_exit(struct data_update *); 55 50 int bch2_data_update_init(struct btree_trans *, struct btree_iter *, 56 51 struct moving_context *, 57 52 struct data_update *, 58 53 struct write_point_specifier, 59 - struct bch_io_opts, struct data_update_opts, 54 + struct bch_io_opts *, struct data_update_opts, 60 55 enum btree_id, struct bkey_s_c); 61 56 void bch2_data_update_opts_normalize(struct bkey_s_c, struct data_update_opts *); 62 57
+30 -4
fs/bcachefs/debug.c
··· 7 7 */ 8 8 9 9 #include "bcachefs.h" 10 + #include "alloc_foreground.h" 10 11 #include "bkey_methods.h" 11 12 #include "btree_cache.h" 12 13 #include "btree_io.h" ··· 191 190 unsigned offset = 0; 192 191 int ret; 193 192 194 - if (bch2_bkey_pick_read_device(c, bkey_i_to_s_c(&b->key), NULL, &pick) <= 0) { 193 + if (bch2_bkey_pick_read_device(c, bkey_i_to_s_c(&b->key), NULL, &pick, -1) <= 0) { 195 194 prt_printf(out, "error getting device to read from: invalid device\n"); 196 195 return; 197 196 } ··· 845 844 seqmutex_unlock(&c->btree_trans_lock); 846 845 } 847 846 848 - static ssize_t bch2_btree_deadlock_read(struct file *file, char __user *buf, 849 - size_t size, loff_t *ppos) 847 + typedef void (*fs_to_text_fn)(struct printbuf *, struct bch_fs *); 848 + 849 + static ssize_t bch2_simple_print(struct file *file, char __user *buf, 850 + size_t size, loff_t *ppos, 851 + fs_to_text_fn fn) 850 852 { 851 853 struct dump_iter *i = file->private_data; 852 854 struct bch_fs *c = i->c; ··· 860 856 i->ret = 0; 861 857 862 858 if (!i->iter) { 863 - btree_deadlock_to_text(&i->buf, c); 859 + fn(&i->buf, c); 864 860 i->iter++; 865 861 } 866 862 ··· 873 869 return ret ?: i->ret; 874 870 } 875 871 872 + static ssize_t bch2_btree_deadlock_read(struct file *file, char __user *buf, 873 + size_t size, loff_t *ppos) 874 + { 875 + return bch2_simple_print(file, buf, size, ppos, btree_deadlock_to_text); 876 + } 877 + 876 878 static const struct file_operations btree_deadlock_ops = { 877 879 .owner = THIS_MODULE, 878 880 .open = bch2_dump_open, 879 881 .release = bch2_dump_release, 880 882 .read = bch2_btree_deadlock_read, 883 + }; 884 + 885 + static ssize_t bch2_write_points_read(struct file *file, char __user *buf, 886 + size_t size, loff_t *ppos) 887 + { 888 + return bch2_simple_print(file, buf, size, ppos, bch2_write_points_to_text); 889 + } 890 + 891 + static const struct file_operations write_points_ops = { 892 + .owner = THIS_MODULE, 893 + .open = bch2_dump_open, 894 + .release = bch2_dump_release, 895 + .read = bch2_write_points_read, 881 896 }; 882 897 883 898 void bch2_fs_debug_exit(struct bch_fs *c) ··· 949 926 950 927 debugfs_create_file("btree_deadlock", 0400, c->fs_debug_dir, 951 928 c->btree_debug, &btree_deadlock_ops); 929 + 930 + debugfs_create_file("write_points", 0400, c->fs_debug_dir, 931 + c->btree_debug, &write_points_ops); 952 932 953 933 c->btree_debug_dir = debugfs_create_dir("btrees", c->fs_debug_dir); 954 934 if (IS_ERR_OR_NULL(c->btree_debug_dir))
+241 -33
fs/bcachefs/dirent.c
··· 13 13 14 14 #include <linux/dcache.h> 15 15 16 + static int bch2_casefold(struct btree_trans *trans, const struct bch_hash_info *info, 17 + const struct qstr *str, struct qstr *out_cf) 18 + { 19 + *out_cf = (struct qstr) QSTR_INIT(NULL, 0); 20 + 21 + #ifdef CONFIG_UNICODE 22 + unsigned char *buf = bch2_trans_kmalloc(trans, BCH_NAME_MAX + 1); 23 + int ret = PTR_ERR_OR_ZERO(buf); 24 + if (ret) 25 + return ret; 26 + 27 + ret = utf8_casefold(info->cf_encoding, str, buf, BCH_NAME_MAX + 1); 28 + if (ret <= 0) 29 + return ret; 30 + 31 + *out_cf = (struct qstr) QSTR_INIT(buf, ret); 32 + return 0; 33 + #else 34 + return -EOPNOTSUPP; 35 + #endif 36 + } 37 + 38 + static inline int bch2_maybe_casefold(struct btree_trans *trans, 39 + const struct bch_hash_info *info, 40 + const struct qstr *str, struct qstr *out_cf) 41 + { 42 + if (likely(!info->cf_encoding)) { 43 + *out_cf = *str; 44 + return 0; 45 + } else { 46 + return bch2_casefold(trans, info, str, out_cf); 47 + } 48 + } 49 + 16 50 static unsigned bch2_dirent_name_bytes(struct bkey_s_c_dirent d) 17 51 { 18 52 if (bkey_val_bytes(d.k) < offsetof(struct bch_dirent, d_name)) ··· 62 28 #endif 63 29 64 30 return bkey_bytes - 65 - offsetof(struct bch_dirent, d_name) - 31 + (d.v->d_casefold 32 + ? offsetof(struct bch_dirent, d_cf_name_block.d_names) 33 + : offsetof(struct bch_dirent, d_name)) - 66 34 trailing_nuls; 67 35 } 68 36 69 37 struct qstr bch2_dirent_get_name(struct bkey_s_c_dirent d) 70 38 { 71 - return (struct qstr) QSTR_INIT(d.v->d_name, bch2_dirent_name_bytes(d)); 39 + if (d.v->d_casefold) { 40 + unsigned name_len = le16_to_cpu(d.v->d_cf_name_block.d_name_len); 41 + return (struct qstr) QSTR_INIT(&d.v->d_cf_name_block.d_names[0], name_len); 42 + } else { 43 + return (struct qstr) QSTR_INIT(d.v->d_name, bch2_dirent_name_bytes(d)); 44 + } 45 + } 46 + 47 + static struct qstr bch2_dirent_get_casefold_name(struct bkey_s_c_dirent d) 48 + { 49 + if (d.v->d_casefold) { 50 + unsigned name_len = le16_to_cpu(d.v->d_cf_name_block.d_name_len); 51 + unsigned cf_name_len = le16_to_cpu(d.v->d_cf_name_block.d_cf_name_len); 52 + return (struct qstr) QSTR_INIT(&d.v->d_cf_name_block.d_names[name_len], cf_name_len); 53 + } else { 54 + return (struct qstr) QSTR_INIT(NULL, 0); 55 + } 56 + } 57 + 58 + static inline struct qstr bch2_dirent_get_lookup_name(struct bkey_s_c_dirent d) 59 + { 60 + return d.v->d_casefold 61 + ? bch2_dirent_get_casefold_name(d) 62 + : bch2_dirent_get_name(d); 72 63 } 73 64 74 65 static u64 bch2_dirent_hash(const struct bch_hash_info *info, ··· 116 57 static u64 dirent_hash_bkey(const struct bch_hash_info *info, struct bkey_s_c k) 117 58 { 118 59 struct bkey_s_c_dirent d = bkey_s_c_to_dirent(k); 119 - struct qstr name = bch2_dirent_get_name(d); 60 + struct qstr name = bch2_dirent_get_lookup_name(d); 120 61 121 62 return bch2_dirent_hash(info, &name); 122 63 } ··· 124 65 static bool dirent_cmp_key(struct bkey_s_c _l, const void *_r) 125 66 { 126 67 struct bkey_s_c_dirent l = bkey_s_c_to_dirent(_l); 127 - const struct qstr l_name = bch2_dirent_get_name(l); 68 + const struct qstr l_name = bch2_dirent_get_lookup_name(l); 128 69 const struct qstr *r_name = _r; 129 70 130 71 return !qstr_eq(l_name, *r_name); ··· 134 75 { 135 76 struct bkey_s_c_dirent l = bkey_s_c_to_dirent(_l); 136 77 struct bkey_s_c_dirent r = bkey_s_c_to_dirent(_r); 137 - const struct qstr l_name = bch2_dirent_get_name(l); 138 - const struct qstr r_name = bch2_dirent_get_name(r); 78 + const struct qstr l_name = bch2_dirent_get_lookup_name(l); 79 + const struct qstr r_name = bch2_dirent_get_lookup_name(r); 139 80 140 81 return !qstr_eq(l_name, r_name); 141 82 } ··· 163 104 struct bkey_validate_context from) 164 105 { 165 106 struct bkey_s_c_dirent d = bkey_s_c_to_dirent(k); 107 + unsigned name_block_len = bch2_dirent_name_bytes(d); 166 108 struct qstr d_name = bch2_dirent_get_name(d); 109 + struct qstr d_cf_name = bch2_dirent_get_casefold_name(d); 167 110 int ret = 0; 168 111 169 112 bkey_fsck_err_on(!d_name.len, 170 113 c, dirent_empty_name, 171 114 "empty name"); 172 115 173 - bkey_fsck_err_on(bkey_val_u64s(k.k) > dirent_val_u64s(d_name.len), 116 + bkey_fsck_err_on(d_name.len + d_cf_name.len > name_block_len, 174 117 c, dirent_val_too_big, 175 - "value too big (%zu > %u)", 176 - bkey_val_u64s(k.k), dirent_val_u64s(d_name.len)); 118 + "dirent names exceed bkey size (%d + %d > %d)", 119 + d_name.len, d_cf_name.len, name_block_len); 177 120 178 121 /* 179 122 * Check new keys don't exceed the max length ··· 203 142 le64_to_cpu(d.v->d_inum) == d.k->p.inode, 204 143 c, dirent_to_itself, 205 144 "dirent points to own directory"); 145 + 146 + if (d.v->d_casefold) { 147 + bkey_fsck_err_on(from.from == BKEY_VALIDATE_commit && 148 + d_cf_name.len > BCH_NAME_MAX, 149 + c, dirent_cf_name_too_big, 150 + "dirent w/ cf name too big (%u > %u)", 151 + d_cf_name.len, BCH_NAME_MAX); 152 + 153 + bkey_fsck_err_on(d_cf_name.len != strnlen(d_cf_name.name, d_cf_name.len), 154 + c, dirent_stray_data_after_cf_name, 155 + "dirent has stray data after cf name's NUL"); 156 + } 206 157 fsck_err: 207 158 return ret; 208 159 } ··· 236 163 prt_printf(out, " type %s", bch2_d_type_str(d.v->d_type)); 237 164 } 238 165 239 - static struct bkey_i_dirent *dirent_create_key(struct btree_trans *trans, 240 - subvol_inum dir, u8 type, 241 - const struct qstr *name, u64 dst) 166 + static struct bkey_i_dirent *dirent_alloc_key(struct btree_trans *trans, 167 + subvol_inum dir, 168 + u8 type, 169 + int name_len, int cf_name_len, 170 + u64 dst) 242 171 { 243 172 struct bkey_i_dirent *dirent; 244 - unsigned u64s = BKEY_U64s + dirent_val_u64s(name->len); 245 - 246 - if (name->len > BCH_NAME_MAX) 247 - return ERR_PTR(-ENAMETOOLONG); 173 + unsigned u64s = BKEY_U64s + dirent_val_u64s(name_len, cf_name_len); 248 174 249 175 BUG_ON(u64s > U8_MAX); 250 176 ··· 262 190 } 263 191 264 192 dirent->v.d_type = type; 193 + dirent->v.d_unused = 0; 194 + dirent->v.d_casefold = cf_name_len ? 1 : 0; 265 195 266 - memcpy(dirent->v.d_name, name->name, name->len); 267 - memset(dirent->v.d_name + name->len, 0, 268 - bkey_val_bytes(&dirent->k) - 269 - offsetof(struct bch_dirent, d_name) - 270 - name->len); 196 + return dirent; 197 + } 271 198 272 - EBUG_ON(bch2_dirent_name_bytes(dirent_i_to_s_c(dirent)) != name->len); 199 + static void dirent_init_regular_name(struct bkey_i_dirent *dirent, 200 + const struct qstr *name) 201 + { 202 + EBUG_ON(dirent->v.d_casefold); 203 + 204 + memcpy(&dirent->v.d_name[0], name->name, name->len); 205 + memset(&dirent->v.d_name[name->len], 0, 206 + bkey_val_bytes(&dirent->k) - 207 + offsetof(struct bch_dirent, d_name) - 208 + name->len); 209 + } 210 + 211 + static void dirent_init_casefolded_name(struct bkey_i_dirent *dirent, 212 + const struct qstr *name, 213 + const struct qstr *cf_name) 214 + { 215 + EBUG_ON(!dirent->v.d_casefold); 216 + EBUG_ON(!cf_name->len); 217 + 218 + dirent->v.d_cf_name_block.d_name_len = name->len; 219 + dirent->v.d_cf_name_block.d_cf_name_len = cf_name->len; 220 + memcpy(&dirent->v.d_cf_name_block.d_names[0], name->name, name->len); 221 + memcpy(&dirent->v.d_cf_name_block.d_names[name->len], cf_name->name, cf_name->len); 222 + memset(&dirent->v.d_cf_name_block.d_names[name->len + cf_name->len], 0, 223 + bkey_val_bytes(&dirent->k) - 224 + offsetof(struct bch_dirent, d_cf_name_block.d_names) - 225 + name->len + cf_name->len); 226 + 227 + EBUG_ON(bch2_dirent_get_casefold_name(dirent_i_to_s_c(dirent)).len != cf_name->len); 228 + } 229 + 230 + static struct bkey_i_dirent *dirent_create_key(struct btree_trans *trans, 231 + subvol_inum dir, 232 + u8 type, 233 + const struct qstr *name, 234 + const struct qstr *cf_name, 235 + u64 dst) 236 + { 237 + struct bkey_i_dirent *dirent; 238 + 239 + if (name->len > BCH_NAME_MAX) 240 + return ERR_PTR(-ENAMETOOLONG); 241 + 242 + dirent = dirent_alloc_key(trans, dir, type, name->len, cf_name ? cf_name->len : 0, dst); 243 + if (IS_ERR(dirent)) 244 + return dirent; 245 + 246 + if (cf_name) 247 + dirent_init_casefolded_name(dirent, name, cf_name); 248 + else 249 + dirent_init_regular_name(dirent, name); 250 + 251 + EBUG_ON(bch2_dirent_get_name(dirent_i_to_s_c(dirent)).len != name->len); 273 252 274 253 return dirent; 275 254 } ··· 336 213 struct bkey_i_dirent *dirent; 337 214 int ret; 338 215 339 - dirent = dirent_create_key(trans, dir_inum, type, name, dst_inum); 216 + dirent = dirent_create_key(trans, dir_inum, type, name, NULL, dst_inum); 340 217 ret = PTR_ERR_OR_ZERO(dirent); 341 218 if (ret) 342 219 return ret; ··· 356 233 const struct bch_hash_info *hash_info, 357 234 u8 type, const struct qstr *name, u64 dst_inum, 358 235 u64 *dir_offset, 236 + u64 *i_size, 359 237 enum btree_iter_update_trigger_flags flags) 360 238 { 361 239 struct bkey_i_dirent *dirent; 362 240 int ret; 363 241 364 - dirent = dirent_create_key(trans, dir, type, name, dst_inum); 242 + if (hash_info->cf_encoding) { 243 + struct qstr cf_name; 244 + ret = bch2_casefold(trans, hash_info, name, &cf_name); 245 + if (ret) 246 + return ret; 247 + dirent = dirent_create_key(trans, dir, type, name, &cf_name, dst_inum); 248 + } else { 249 + dirent = dirent_create_key(trans, dir, type, name, NULL, dst_inum); 250 + } 251 + 365 252 ret = PTR_ERR_OR_ZERO(dirent); 366 253 if (ret) 367 254 return ret; 255 + 256 + *i_size += bkey_bytes(&dirent->k); 368 257 369 258 ret = bch2_hash_set(trans, bch2_dirent_hash_desc, hash_info, 370 259 dir, &dirent->k_i, flags); ··· 410 275 } 411 276 412 277 int bch2_dirent_rename(struct btree_trans *trans, 413 - subvol_inum src_dir, struct bch_hash_info *src_hash, 414 - subvol_inum dst_dir, struct bch_hash_info *dst_hash, 278 + subvol_inum src_dir, struct bch_hash_info *src_hash, u64 *src_dir_i_size, 279 + subvol_inum dst_dir, struct bch_hash_info *dst_hash, u64 *dst_dir_i_size, 415 280 const struct qstr *src_name, subvol_inum *src_inum, u64 *src_offset, 416 281 const struct qstr *dst_name, subvol_inum *dst_inum, u64 *dst_offset, 417 282 enum bch_rename_mode mode) 418 283 { 284 + struct qstr src_name_lookup, dst_name_lookup; 419 285 struct btree_iter src_iter = { NULL }; 420 286 struct btree_iter dst_iter = { NULL }; 421 287 struct bkey_s_c old_src, old_dst = bkey_s_c_null; ··· 431 295 memset(dst_inum, 0, sizeof(*dst_inum)); 432 296 433 297 /* Lookup src: */ 298 + ret = bch2_maybe_casefold(trans, src_hash, src_name, &src_name_lookup); 299 + if (ret) 300 + goto out; 434 301 old_src = bch2_hash_lookup(trans, &src_iter, bch2_dirent_hash_desc, 435 - src_hash, src_dir, src_name, 302 + src_hash, src_dir, &src_name_lookup, 436 303 BTREE_ITER_intent); 437 304 ret = bkey_err(old_src); 438 305 if (ret) ··· 447 308 goto out; 448 309 449 310 /* Lookup dst: */ 311 + ret = bch2_maybe_casefold(trans, dst_hash, dst_name, &dst_name_lookup); 312 + if (ret) 313 + goto out; 450 314 if (mode == BCH_RENAME) { 451 315 /* 452 316 * Note that we're _not_ checking if the target already exists - ··· 457 315 * correctness: 458 316 */ 459 317 ret = bch2_hash_hole(trans, &dst_iter, bch2_dirent_hash_desc, 460 - dst_hash, dst_dir, dst_name); 318 + dst_hash, dst_dir, &dst_name_lookup); 461 319 if (ret) 462 320 goto out; 463 321 } else { 464 322 old_dst = bch2_hash_lookup(trans, &dst_iter, bch2_dirent_hash_desc, 465 - dst_hash, dst_dir, dst_name, 323 + dst_hash, dst_dir, &dst_name_lookup, 466 324 BTREE_ITER_intent); 467 325 ret = bkey_err(old_dst); 468 326 if (ret) ··· 478 336 *src_offset = dst_iter.pos.offset; 479 337 480 338 /* Create new dst key: */ 481 - new_dst = dirent_create_key(trans, dst_dir, 0, dst_name, 0); 339 + new_dst = dirent_create_key(trans, dst_dir, 0, dst_name, 340 + dst_hash->cf_encoding ? &dst_name_lookup : NULL, 0); 482 341 ret = PTR_ERR_OR_ZERO(new_dst); 483 342 if (ret) 484 343 goto out; ··· 489 346 490 347 /* Create new src key: */ 491 348 if (mode == BCH_RENAME_EXCHANGE) { 492 - new_src = dirent_create_key(trans, src_dir, 0, src_name, 0); 349 + new_src = dirent_create_key(trans, src_dir, 0, src_name, 350 + src_hash->cf_encoding ? &src_name_lookup : NULL, 0); 493 351 ret = PTR_ERR_OR_ZERO(new_src); 494 352 if (ret) 495 353 goto out; ··· 549 405 if ((mode == BCH_RENAME_EXCHANGE) && 550 406 new_src->v.d_type == DT_SUBVOL) 551 407 new_src->v.d_parent_subvol = cpu_to_le32(src_dir.subvol); 408 + 409 + if (old_dst.k) 410 + *dst_dir_i_size -= bkey_bytes(old_dst.k); 411 + *src_dir_i_size -= bkey_bytes(old_src.k); 412 + 413 + if (mode == BCH_RENAME_EXCHANGE) 414 + *src_dir_i_size += bkey_bytes(&new_src->k); 415 + *dst_dir_i_size += bkey_bytes(&new_dst->k); 552 416 553 417 ret = bch2_trans_update(trans, &dst_iter, &new_dst->k_i, 0); 554 418 if (ret) ··· 617 465 const struct qstr *name, subvol_inum *inum, 618 466 unsigned flags) 619 467 { 468 + struct qstr lookup_name; 469 + int ret = bch2_maybe_casefold(trans, hash_info, name, &lookup_name); 470 + if (ret) 471 + return ret; 472 + 620 473 struct bkey_s_c k = bch2_hash_lookup(trans, iter, bch2_dirent_hash_desc, 621 - hash_info, dir, name, flags); 622 - int ret = bkey_err(k); 474 + hash_info, dir, &lookup_name, flags); 475 + ret = bkey_err(k); 623 476 if (ret) 624 477 goto err; 625 478 ··· 728 571 bch2_bkey_buf_exit(&sk, c); 729 572 730 573 return ret < 0 ? ret : 0; 574 + } 575 + 576 + /* fsck */ 577 + 578 + static int lookup_first_inode(struct btree_trans *trans, u64 inode_nr, 579 + struct bch_inode_unpacked *inode) 580 + { 581 + struct btree_iter iter; 582 + struct bkey_s_c k; 583 + int ret; 584 + 585 + for_each_btree_key_norestart(trans, iter, BTREE_ID_inodes, POS(0, inode_nr), 586 + BTREE_ITER_all_snapshots, k, ret) { 587 + if (k.k->p.offset != inode_nr) 588 + break; 589 + if (!bkey_is_inode(k.k)) 590 + continue; 591 + ret = bch2_inode_unpack(k, inode); 592 + goto found; 593 + } 594 + ret = -BCH_ERR_ENOENT_inode; 595 + found: 596 + bch_err_msg(trans->c, ret, "fetching inode %llu", inode_nr); 597 + bch2_trans_iter_exit(trans, &iter); 598 + return ret; 599 + } 600 + 601 + int bch2_fsck_remove_dirent(struct btree_trans *trans, struct bpos pos) 602 + { 603 + struct bch_fs *c = trans->c; 604 + struct btree_iter iter; 605 + struct bch_inode_unpacked dir_inode; 606 + struct bch_hash_info dir_hash_info; 607 + int ret; 608 + 609 + ret = lookup_first_inode(trans, pos.inode, &dir_inode); 610 + if (ret) 611 + goto err; 612 + 613 + dir_hash_info = bch2_hash_info_init(c, &dir_inode); 614 + 615 + bch2_trans_iter_init(trans, &iter, BTREE_ID_dirents, pos, BTREE_ITER_intent); 616 + 617 + ret = bch2_btree_iter_traverse(&iter) ?: 618 + bch2_hash_delete_at(trans, bch2_dirent_hash_desc, 619 + &dir_hash_info, &iter, 620 + BTREE_UPDATE_internal_snapshot_node); 621 + bch2_trans_iter_exit(trans, &iter); 622 + err: 623 + bch_err_fn(c, ret); 624 + return ret; 731 625 }
+11 -6
fs/bcachefs/dirent.h
··· 25 25 26 26 struct qstr bch2_dirent_get_name(struct bkey_s_c_dirent d); 27 27 28 - static inline unsigned dirent_val_u64s(unsigned len) 28 + static inline unsigned dirent_val_u64s(unsigned len, unsigned cf_len) 29 29 { 30 - return DIV_ROUND_UP(offsetof(struct bch_dirent, d_name) + len, 31 - sizeof(u64)); 30 + unsigned bytes = cf_len 31 + ? offsetof(struct bch_dirent, d_cf_name_block.d_names) + len + cf_len 32 + : offsetof(struct bch_dirent, d_name) + len; 33 + 34 + return DIV_ROUND_UP(bytes, sizeof(u64)); 32 35 } 33 36 34 37 int bch2_dirent_read_target(struct btree_trans *, subvol_inum, ··· 50 47 enum btree_iter_update_trigger_flags); 51 48 int bch2_dirent_create(struct btree_trans *, subvol_inum, 52 49 const struct bch_hash_info *, u8, 53 - const struct qstr *, u64, u64 *, 50 + const struct qstr *, u64, u64 *, u64 *, 54 51 enum btree_iter_update_trigger_flags); 55 52 56 53 static inline unsigned vfs_d_type(unsigned type) ··· 65 62 }; 66 63 67 64 int bch2_dirent_rename(struct btree_trans *, 68 - subvol_inum, struct bch_hash_info *, 69 - subvol_inum, struct bch_hash_info *, 65 + subvol_inum, struct bch_hash_info *, u64 *, 66 + subvol_inum, struct bch_hash_info *, u64 *, 70 67 const struct qstr *, subvol_inum *, u64 *, 71 68 const struct qstr *, subvol_inum *, u64 *, 72 69 enum bch_rename_mode); ··· 81 78 int bch2_empty_dir_snapshot(struct btree_trans *, u64, u32, u32); 82 79 int bch2_empty_dir_trans(struct btree_trans *, subvol_inum); 83 80 int bch2_readdir(struct bch_fs *, subvol_inum, struct dir_context *); 81 + 82 + int bch2_fsck_remove_dirent(struct btree_trans *, struct bpos); 84 83 85 84 #endif /* _BCACHEFS_DIRENT_H */
+18 -2
fs/bcachefs/dirent_format.h
··· 29 29 * Copy of mode bits 12-15 from the target inode - so userspace can get 30 30 * the filetype without having to do a stat() 31 31 */ 32 - __u8 d_type; 32 + #if defined(__LITTLE_ENDIAN_BITFIELD) 33 + __u8 d_type:5, 34 + d_unused:2, 35 + d_casefold:1; 36 + #elif defined(__BIG_ENDIAN_BITFIELD) 37 + __u8 d_casefold:1, 38 + d_unused:2, 39 + d_type:5; 40 + #endif 33 41 34 - __u8 d_name[]; 42 + union { 43 + struct { 44 + __u8 d_pad; 45 + __le16 d_name_len; 46 + __le16 d_cf_name_len; 47 + __u8 d_names[]; 48 + } d_cf_name_block __packed; 49 + __DECLARE_FLEX_ARRAY(__u8, d_name); 50 + } __packed; 35 51 } __packed __aligned(8); 36 52 37 53 #define DT_SUBVOL 16
+18
fs/bcachefs/disk_accounting.h
··· 85 85 86 86 int bch2_disk_accounting_mod(struct btree_trans *, struct disk_accounting_pos *, 87 87 s64 *, unsigned, bool); 88 + 89 + #define disk_accounting_key_init(_k, _type, ...) \ 90 + do { \ 91 + memset(&(_k), 0, sizeof(_k)); \ 92 + (_k).type = BCH_DISK_ACCOUNTING_##_type; \ 93 + (_k)._type = (struct bch_acct_##_type) { __VA_ARGS__ }; \ 94 + } while (0) 95 + 96 + #define bch2_disk_accounting_mod2_nr(_trans, _gc, _v, _nr, ...) \ 97 + ({ \ 98 + struct disk_accounting_pos pos; \ 99 + disk_accounting_key_init(pos, __VA_ARGS__); \ 100 + bch2_disk_accounting_mod(trans, &pos, _v, _nr, _gc); \ 101 + }) 102 + 103 + #define bch2_disk_accounting_mod2(_trans, _gc, _v, ...) \ 104 + bch2_disk_accounting_mod2_nr(_trans, _gc, _v, ARRAY_SIZE(_v), __VA_ARGS__) 105 + 88 106 int bch2_mod_dev_cached_sectors(struct btree_trans *, unsigned, s64, bool); 89 107 90 108 int bch2_accounting_validate(struct bch_fs *, struct bkey_s_c,
+6 -6
fs/bcachefs/disk_accounting_format.h
··· 113 113 BCH_DISK_ACCOUNTING_TYPE_NR, 114 114 }; 115 115 116 - struct bch_nr_inodes { 116 + struct bch_acct_nr_inodes { 117 117 }; 118 118 119 - struct bch_persistent_reserved { 119 + struct bch_acct_persistent_reserved { 120 120 __u8 nr_replicas; 121 121 }; 122 122 123 - struct bch_dev_data_type { 123 + struct bch_acct_dev_data_type { 124 124 __u8 dev; 125 125 __u8 data_type; 126 126 }; ··· 149 149 struct { 150 150 __u8 type; 151 151 union { 152 - struct bch_nr_inodes nr_inodes; 153 - struct bch_persistent_reserved persistent_reserved; 152 + struct bch_acct_nr_inodes nr_inodes; 153 + struct bch_acct_persistent_reserved persistent_reserved; 154 154 struct bch_replicas_entry_v1 replicas; 155 - struct bch_dev_data_type dev_data_type; 155 + struct bch_acct_dev_data_type dev_data_type; 156 156 struct bch_acct_compression compression; 157 157 struct bch_acct_snapshot snapshot; 158 158 struct bch_acct_btree btree;
+162 -320
fs/bcachefs/ec.c
··· 20 20 #include "io_read.h" 21 21 #include "io_write.h" 22 22 #include "keylist.h" 23 + #include "lru.h" 23 24 #include "recovery.h" 24 25 #include "replicas.h" 25 26 #include "super-io.h" ··· 105 104 struct bch_dev *ca; 106 105 struct ec_stripe_buf *buf; 107 106 size_t idx; 107 + u64 submit_time; 108 108 struct bio bio; 109 109 }; 110 110 ··· 300 298 struct bpos bucket = PTR_BUCKET_POS(ca, ptr); 301 299 302 300 if (flags & BTREE_TRIGGER_transactional) { 301 + struct extent_ptr_decoded p = { 302 + .ptr = *ptr, 303 + .crc = bch2_extent_crc_unpack(s.k, NULL), 304 + }; 305 + struct bkey_i_backpointer bp; 306 + bch2_extent_ptr_to_bp(c, BTREE_ID_stripes, 0, s.s_c, p, 307 + (const union bch_extent_entry *) ptr, &bp); 308 + 303 309 struct bkey_i_alloc_v4 *a = 304 310 bch2_trans_start_alloc_update(trans, bucket, 0); 305 - ret = PTR_ERR_OR_ZERO(a) ?: 306 - __mark_stripe_bucket(trans, ca, s, ptr_idx, deleting, bucket, &a->v, flags); 311 + ret = PTR_ERR_OR_ZERO(a) ?: 312 + __mark_stripe_bucket(trans, ca, s, ptr_idx, deleting, bucket, &a->v, flags) ?: 313 + bch2_bucket_backpointer_mod(trans, s.s_c, &bp, 314 + !(flags & BTREE_TRIGGER_overwrite)); 315 + if (ret) 316 + goto err; 307 317 } 308 318 309 319 if (flags & BTREE_TRIGGER_gc) { ··· 380 366 return 0; 381 367 } 382 368 383 - static inline void stripe_to_mem(struct stripe *m, const struct bch_stripe *s) 384 - { 385 - m->sectors = le16_to_cpu(s->sectors); 386 - m->algorithm = s->algorithm; 387 - m->nr_blocks = s->nr_blocks; 388 - m->nr_redundant = s->nr_redundant; 389 - m->disk_label = s->disk_label; 390 - m->blocks_nonempty = 0; 391 - 392 - for (unsigned i = 0; i < s->nr_blocks; i++) 393 - m->blocks_nonempty += !!stripe_blockcount_get(s, i); 394 - } 395 - 396 369 int bch2_trigger_stripe(struct btree_trans *trans, 397 370 enum btree_id btree, unsigned level, 398 371 struct bkey_s_c old, struct bkey_s _new, ··· 400 399 (new_s->nr_blocks != old_s->nr_blocks || 401 400 new_s->nr_redundant != old_s->nr_redundant)); 402 401 402 + if (flags & BTREE_TRIGGER_transactional) { 403 + int ret = bch2_lru_change(trans, 404 + BCH_LRU_STRIPE_FRAGMENTATION, 405 + idx, 406 + stripe_lru_pos(old_s), 407 + stripe_lru_pos(new_s)); 408 + if (ret) 409 + return ret; 410 + } 403 411 404 412 if (flags & (BTREE_TRIGGER_transactional|BTREE_TRIGGER_gc)) { 405 413 /* ··· 480 470 int ret = mark_stripe_buckets(trans, old, new, flags); 481 471 if (ret) 482 472 return ret; 483 - } 484 - 485 - if (flags & BTREE_TRIGGER_atomic) { 486 - struct stripe *m = genradix_ptr(&c->stripes, idx); 487 - 488 - if (!m) { 489 - struct printbuf buf1 = PRINTBUF; 490 - struct printbuf buf2 = PRINTBUF; 491 - 492 - bch2_bkey_val_to_text(&buf1, c, old); 493 - bch2_bkey_val_to_text(&buf2, c, new); 494 - bch_err_ratelimited(c, "error marking nonexistent stripe %llu while marking\n" 495 - "old %s\n" 496 - "new %s", idx, buf1.buf, buf2.buf); 497 - printbuf_exit(&buf2); 498 - printbuf_exit(&buf1); 499 - bch2_inconsistent_error(c); 500 - return -1; 501 - } 502 - 503 - if (!new_s) { 504 - bch2_stripes_heap_del(c, m, idx); 505 - 506 - memset(m, 0, sizeof(*m)); 507 - } else { 508 - stripe_to_mem(m, new_s); 509 - 510 - if (!old_s) 511 - bch2_stripes_heap_insert(c, m, idx); 512 - else 513 - bch2_stripes_heap_update(c, m, idx); 514 - } 515 473 } 516 474 517 475 return 0; ··· 704 726 struct bch_dev *ca = ec_bio->ca; 705 727 struct closure *cl = bio->bi_private; 706 728 707 - if (bch2_dev_io_err_on(bio->bi_status, ca, 708 - bio_data_dir(bio) 709 - ? BCH_MEMBER_ERROR_write 710 - : BCH_MEMBER_ERROR_read, 711 - "erasure coding %s error: %s", 729 + bch2_account_io_completion(ca, bio_data_dir(bio), 730 + ec_bio->submit_time, !bio->bi_status); 731 + 732 + if (bio->bi_status) { 733 + bch_err_dev_ratelimited(ca, "erasure coding %s error: %s", 712 734 str_write_read(bio_data_dir(bio)), 713 - bch2_blk_status_to_str(bio->bi_status))) 735 + bch2_blk_status_to_str(bio->bi_status)); 714 736 clear_bit(ec_bio->idx, ec_bio->buf->valid); 737 + } 715 738 716 739 int stale = dev_ptr_stale(ca, ptr); 717 740 if (stale) { ··· 775 796 ec_bio->ca = ca; 776 797 ec_bio->buf = buf; 777 798 ec_bio->idx = idx; 799 + ec_bio->submit_time = local_clock(); 778 800 779 801 ec_bio->bio.bi_iter.bi_sector = ptr->offset + buf->offset + (offset >> 9); 780 802 ec_bio->bio.bi_end_io = ec_block_endio; ··· 897 917 898 918 static int __ec_stripe_mem_alloc(struct bch_fs *c, size_t idx, gfp_t gfp) 899 919 { 900 - ec_stripes_heap n, *h = &c->ec_stripes_heap; 901 - 902 - if (idx >= h->size) { 903 - if (!init_heap(&n, max(1024UL, roundup_pow_of_two(idx + 1)), gfp)) 904 - return -BCH_ERR_ENOMEM_ec_stripe_mem_alloc; 905 - 906 - mutex_lock(&c->ec_stripes_heap_lock); 907 - if (n.size > h->size) { 908 - memcpy(n.data, h->data, h->nr * sizeof(h->data[0])); 909 - n.nr = h->nr; 910 - swap(*h, n); 911 - } 912 - mutex_unlock(&c->ec_stripes_heap_lock); 913 - 914 - free_heap(&n); 915 - } 916 - 917 - if (!genradix_ptr_alloc(&c->stripes, idx, gfp)) 918 - return -BCH_ERR_ENOMEM_ec_stripe_mem_alloc; 919 - 920 920 if (c->gc_pos.phase != GC_PHASE_not_running && 921 921 !genradix_ptr_alloc(&c->gc_stripes, idx, gfp)) 922 922 return -BCH_ERR_ENOMEM_ec_stripe_mem_alloc; ··· 969 1009 s->idx = 0; 970 1010 } 971 1011 972 - /* Heap of all existing stripes, ordered by blocks_nonempty */ 973 - 974 - static u64 stripe_idx_to_delete(struct bch_fs *c) 975 - { 976 - ec_stripes_heap *h = &c->ec_stripes_heap; 977 - 978 - lockdep_assert_held(&c->ec_stripes_heap_lock); 979 - 980 - if (h->nr && 981 - h->data[0].blocks_nonempty == 0 && 982 - !bch2_stripe_is_open(c, h->data[0].idx)) 983 - return h->data[0].idx; 984 - 985 - return 0; 986 - } 987 - 988 - static inline void ec_stripes_heap_set_backpointer(ec_stripes_heap *h, 989 - size_t i) 990 - { 991 - struct bch_fs *c = container_of(h, struct bch_fs, ec_stripes_heap); 992 - 993 - genradix_ptr(&c->stripes, h->data[i].idx)->heap_idx = i; 994 - } 995 - 996 - static inline bool ec_stripes_heap_cmp(const void *l, const void *r, void __always_unused *args) 997 - { 998 - struct ec_stripe_heap_entry *_l = (struct ec_stripe_heap_entry *)l; 999 - struct ec_stripe_heap_entry *_r = (struct ec_stripe_heap_entry *)r; 1000 - 1001 - return ((_l->blocks_nonempty > _r->blocks_nonempty) < 1002 - (_l->blocks_nonempty < _r->blocks_nonempty)); 1003 - } 1004 - 1005 - static inline void ec_stripes_heap_swap(void *l, void *r, void *h) 1006 - { 1007 - struct ec_stripe_heap_entry *_l = (struct ec_stripe_heap_entry *)l; 1008 - struct ec_stripe_heap_entry *_r = (struct ec_stripe_heap_entry *)r; 1009 - ec_stripes_heap *_h = (ec_stripes_heap *)h; 1010 - size_t i = _l - _h->data; 1011 - size_t j = _r - _h->data; 1012 - 1013 - swap(*_l, *_r); 1014 - 1015 - ec_stripes_heap_set_backpointer(_h, i); 1016 - ec_stripes_heap_set_backpointer(_h, j); 1017 - } 1018 - 1019 - static const struct min_heap_callbacks callbacks = { 1020 - .less = ec_stripes_heap_cmp, 1021 - .swp = ec_stripes_heap_swap, 1022 - }; 1023 - 1024 - static void heap_verify_backpointer(struct bch_fs *c, size_t idx) 1025 - { 1026 - ec_stripes_heap *h = &c->ec_stripes_heap; 1027 - struct stripe *m = genradix_ptr(&c->stripes, idx); 1028 - 1029 - BUG_ON(m->heap_idx >= h->nr); 1030 - BUG_ON(h->data[m->heap_idx].idx != idx); 1031 - } 1032 - 1033 - void bch2_stripes_heap_del(struct bch_fs *c, 1034 - struct stripe *m, size_t idx) 1035 - { 1036 - mutex_lock(&c->ec_stripes_heap_lock); 1037 - heap_verify_backpointer(c, idx); 1038 - 1039 - min_heap_del(&c->ec_stripes_heap, m->heap_idx, &callbacks, &c->ec_stripes_heap); 1040 - mutex_unlock(&c->ec_stripes_heap_lock); 1041 - } 1042 - 1043 - void bch2_stripes_heap_insert(struct bch_fs *c, 1044 - struct stripe *m, size_t idx) 1045 - { 1046 - mutex_lock(&c->ec_stripes_heap_lock); 1047 - BUG_ON(min_heap_full(&c->ec_stripes_heap)); 1048 - 1049 - genradix_ptr(&c->stripes, idx)->heap_idx = c->ec_stripes_heap.nr; 1050 - min_heap_push(&c->ec_stripes_heap, &((struct ec_stripe_heap_entry) { 1051 - .idx = idx, 1052 - .blocks_nonempty = m->blocks_nonempty, 1053 - }), 1054 - &callbacks, 1055 - &c->ec_stripes_heap); 1056 - 1057 - heap_verify_backpointer(c, idx); 1058 - mutex_unlock(&c->ec_stripes_heap_lock); 1059 - } 1060 - 1061 - void bch2_stripes_heap_update(struct bch_fs *c, 1062 - struct stripe *m, size_t idx) 1063 - { 1064 - ec_stripes_heap *h = &c->ec_stripes_heap; 1065 - bool do_deletes; 1066 - size_t i; 1067 - 1068 - mutex_lock(&c->ec_stripes_heap_lock); 1069 - heap_verify_backpointer(c, idx); 1070 - 1071 - h->data[m->heap_idx].blocks_nonempty = m->blocks_nonempty; 1072 - 1073 - i = m->heap_idx; 1074 - min_heap_sift_up(h, i, &callbacks, &c->ec_stripes_heap); 1075 - min_heap_sift_down(h, i, &callbacks, &c->ec_stripes_heap); 1076 - 1077 - heap_verify_backpointer(c, idx); 1078 - 1079 - do_deletes = stripe_idx_to_delete(c) != 0; 1080 - mutex_unlock(&c->ec_stripes_heap_lock); 1081 - 1082 - if (do_deletes) 1083 - bch2_do_stripe_deletes(c); 1084 - } 1085 - 1086 1012 /* stripe deletion */ 1087 1013 1088 1014 static int ec_stripe_delete(struct btree_trans *trans, u64 idx) 1089 1015 { 1090 - struct bch_fs *c = trans->c; 1091 1016 struct btree_iter iter; 1092 - struct bkey_s_c k; 1093 - struct bkey_s_c_stripe s; 1094 - int ret; 1095 - 1096 - k = bch2_bkey_get_iter(trans, &iter, BTREE_ID_stripes, POS(0, idx), 1097 - BTREE_ITER_intent); 1098 - ret = bkey_err(k); 1017 + struct bkey_s_c k = bch2_bkey_get_iter(trans, &iter, 1018 + BTREE_ID_stripes, POS(0, idx), 1019 + BTREE_ITER_intent); 1020 + int ret = bkey_err(k); 1099 1021 if (ret) 1100 1022 goto err; 1101 1023 1102 - if (k.k->type != KEY_TYPE_stripe) { 1103 - bch2_fs_inconsistent(c, "attempting to delete nonexistent stripe %llu", idx); 1104 - ret = -EINVAL; 1105 - goto err; 1106 - } 1107 - 1108 - s = bkey_s_c_to_stripe(k); 1109 - for (unsigned i = 0; i < s.v->nr_blocks; i++) 1110 - if (stripe_blockcount_get(s.v, i)) { 1111 - struct printbuf buf = PRINTBUF; 1112 - 1113 - bch2_bkey_val_to_text(&buf, c, k); 1114 - bch2_fs_inconsistent(c, "attempting to delete nonempty stripe %s", buf.buf); 1115 - printbuf_exit(&buf); 1116 - ret = -EINVAL; 1117 - goto err; 1118 - } 1119 - 1120 - ret = bch2_btree_delete_at(trans, &iter, 0); 1024 + /* 1025 + * We expect write buffer races here 1026 + * Important: check stripe_is_open with stripe key locked: 1027 + */ 1028 + if (k.k->type == KEY_TYPE_stripe && 1029 + !bch2_stripe_is_open(trans->c, idx) && 1030 + stripe_lru_pos(bkey_s_c_to_stripe(k).v) == 1) 1031 + ret = bch2_btree_delete_at(trans, &iter, 0); 1121 1032 err: 1122 1033 bch2_trans_iter_exit(trans, &iter); 1123 1034 return ret; 1124 1035 } 1125 1036 1037 + /* 1038 + * XXX 1039 + * can we kill this and delete stripes from the trigger? 1040 + */ 1126 1041 static void ec_stripe_delete_work(struct work_struct *work) 1127 1042 { 1128 1043 struct bch_fs *c = 1129 1044 container_of(work, struct bch_fs, ec_stripe_delete_work); 1130 1045 1131 - while (1) { 1132 - mutex_lock(&c->ec_stripes_heap_lock); 1133 - u64 idx = stripe_idx_to_delete(c); 1134 - mutex_unlock(&c->ec_stripes_heap_lock); 1135 - 1136 - if (!idx) 1137 - break; 1138 - 1139 - int ret = bch2_trans_commit_do(c, NULL, NULL, BCH_TRANS_COMMIT_no_enospc, 1140 - ec_stripe_delete(trans, idx)); 1141 - bch_err_fn(c, ret); 1142 - if (ret) 1143 - break; 1144 - } 1145 - 1046 + bch2_trans_run(c, 1047 + bch2_btree_write_buffer_tryflush(trans) ?: 1048 + for_each_btree_key_max_commit(trans, lru_iter, BTREE_ID_lru, 1049 + lru_pos(BCH_LRU_STRIPE_FRAGMENTATION, 1, 0), 1050 + lru_pos(BCH_LRU_STRIPE_FRAGMENTATION, 1, LRU_TIME_MAX), 1051 + 0, lru_k, 1052 + NULL, NULL, 1053 + BCH_TRANS_COMMIT_no_enospc, ({ 1054 + ec_stripe_delete(trans, lru_k.k->p.offset); 1055 + }))); 1146 1056 bch2_write_ref_put(c, BCH_WRITE_REF_stripe_delete); 1147 1057 } 1148 1058 ··· 1124 1294 1125 1295 bch2_fs_inconsistent(c, "%s", buf.buf); 1126 1296 printbuf_exit(&buf); 1127 - return -EIO; 1297 + return -BCH_ERR_erasure_coding_found_btree_node; 1128 1298 } 1129 1299 1130 1300 k = bch2_backpointer_get_key(trans, bp, &iter, BTREE_ITER_intent, last_flushed); ··· 1190 1360 1191 1361 struct bch_dev *ca = bch2_dev_tryget(c, ptr.dev); 1192 1362 if (!ca) 1193 - return -EIO; 1363 + return -BCH_ERR_ENOENT_dev_not_found; 1194 1364 1195 1365 struct bpos bucket_pos = PTR_BUCKET_POS(ca, &ptr); 1196 1366 ··· 1210 1380 if (bp_k.k->type != KEY_TYPE_backpointer) 1211 1381 continue; 1212 1382 1383 + struct bkey_s_c_backpointer bp = bkey_s_c_to_backpointer(bp_k); 1384 + if (bp.v->btree_id == BTREE_ID_stripes) 1385 + continue; 1386 + 1213 1387 ec_stripe_update_extent(trans, ca, bucket_pos, ptr.gen, s, 1214 - bkey_s_c_to_backpointer(bp_k), &last_flushed); 1388 + bp, &last_flushed); 1215 1389 })); 1216 1390 1217 1391 bch2_bkey_buf_exit(&last_flushed, c); ··· 1227 1393 { 1228 1394 struct btree_trans *trans = bch2_trans_get(c); 1229 1395 struct bch_stripe *v = &bkey_i_to_stripe(&s->key)->v; 1230 - unsigned i, nr_data = v->nr_blocks - v->nr_redundant; 1231 - int ret = 0; 1396 + unsigned nr_data = v->nr_blocks - v->nr_redundant; 1232 1397 1233 - ret = bch2_btree_write_buffer_flush_sync(trans); 1398 + int ret = bch2_btree_write_buffer_flush_sync(trans); 1234 1399 if (ret) 1235 1400 goto err; 1236 1401 1237 - for (i = 0; i < nr_data; i++) { 1402 + for (unsigned i = 0; i < nr_data; i++) { 1238 1403 ret = ec_stripe_update_bucket(trans, s, i); 1239 1404 if (ret) 1240 1405 break; 1241 1406 } 1242 1407 err: 1243 1408 bch2_trans_put(trans); 1244 - 1245 1409 return ret; 1246 1410 } 1247 1411 ··· 1305 1473 if (s->err) { 1306 1474 if (!bch2_err_matches(s->err, EROFS)) 1307 1475 bch_err(c, "error creating stripe: error writing data buckets"); 1476 + ret = s->err; 1308 1477 goto err; 1309 1478 } 1310 1479 ··· 1314 1481 1315 1482 if (ec_do_recov(c, &s->existing_stripe)) { 1316 1483 bch_err(c, "error creating stripe: error reading existing stripe"); 1484 + ret = -BCH_ERR_ec_block_read; 1317 1485 goto err; 1318 1486 } 1319 1487 ··· 1340 1506 1341 1507 if (ec_nr_failed(&s->new_stripe)) { 1342 1508 bch_err(c, "error creating stripe: error writing redundancy buckets"); 1509 + ret = -BCH_ERR_ec_block_write; 1343 1510 goto err; 1344 1511 } 1345 1512 ··· 1362 1527 if (ret) 1363 1528 goto err; 1364 1529 err: 1530 + trace_stripe_create(c, s->idx, ret); 1531 + 1365 1532 bch2_disk_reservation_put(c, &s->res); 1366 1533 1367 1534 for (i = 0; i < v->nr_blocks; i++) ··· 1449 1612 ec_stripe_new_set_pending(c, h); 1450 1613 } 1451 1614 1452 - void bch2_ec_bucket_cancel(struct bch_fs *c, struct open_bucket *ob) 1615 + void bch2_ec_bucket_cancel(struct bch_fs *c, struct open_bucket *ob, int err) 1453 1616 { 1454 1617 struct ec_stripe_new *s = ob->ec; 1455 1618 1456 - s->err = -EIO; 1619 + s->err = err; 1457 1620 } 1458 1621 1459 1622 void *bch2_writepoint_ec_buf(struct bch_fs *c, struct write_point *wp) ··· 1805 1968 return 0; 1806 1969 } 1807 1970 1808 - static s64 get_existing_stripe(struct bch_fs *c, 1809 - struct ec_stripe_head *head) 1971 + static int __get_existing_stripe(struct btree_trans *trans, 1972 + struct ec_stripe_head *head, 1973 + struct ec_stripe_buf *stripe, 1974 + u64 idx) 1810 1975 { 1811 - ec_stripes_heap *h = &c->ec_stripes_heap; 1812 - struct stripe *m; 1813 - size_t heap_idx; 1814 - u64 stripe_idx; 1815 - s64 ret = -1; 1976 + struct bch_fs *c = trans->c; 1816 1977 1817 - if (may_create_new_stripe(c)) 1818 - return -1; 1978 + struct btree_iter iter; 1979 + struct bkey_s_c k = bch2_bkey_get_iter(trans, &iter, 1980 + BTREE_ID_stripes, POS(0, idx), 0); 1981 + int ret = bkey_err(k); 1982 + if (ret) 1983 + goto err; 1819 1984 1820 - mutex_lock(&c->ec_stripes_heap_lock); 1821 - for (heap_idx = 0; heap_idx < h->nr; heap_idx++) { 1822 - /* No blocks worth reusing, stripe will just be deleted: */ 1823 - if (!h->data[heap_idx].blocks_nonempty) 1824 - continue; 1985 + /* We expect write buffer races here */ 1986 + if (k.k->type != KEY_TYPE_stripe) 1987 + goto out; 1825 1988 1826 - stripe_idx = h->data[heap_idx].idx; 1989 + struct bkey_s_c_stripe s = bkey_s_c_to_stripe(k); 1990 + if (stripe_lru_pos(s.v) <= 1) 1991 + goto out; 1827 1992 1828 - m = genradix_ptr(&c->stripes, stripe_idx); 1829 - 1830 - if (m->disk_label == head->disk_label && 1831 - m->algorithm == head->algo && 1832 - m->nr_redundant == head->redundancy && 1833 - m->sectors == head->blocksize && 1834 - m->blocks_nonempty < m->nr_blocks - m->nr_redundant && 1835 - bch2_try_open_stripe(c, head->s, stripe_idx)) { 1836 - ret = stripe_idx; 1837 - break; 1838 - } 1993 + if (s.v->disk_label == head->disk_label && 1994 + s.v->algorithm == head->algo && 1995 + s.v->nr_redundant == head->redundancy && 1996 + le16_to_cpu(s.v->sectors) == head->blocksize && 1997 + bch2_try_open_stripe(c, head->s, idx)) { 1998 + bkey_reassemble(&stripe->key, k); 1999 + ret = 1; 1839 2000 } 1840 - mutex_unlock(&c->ec_stripes_heap_lock); 2001 + out: 2002 + bch2_set_btree_iter_dontneed(&iter); 2003 + err: 2004 + bch2_trans_iter_exit(trans, &iter); 1841 2005 return ret; 1842 2006 } 1843 2007 ··· 1890 2052 struct ec_stripe_new *s) 1891 2053 { 1892 2054 struct bch_fs *c = trans->c; 1893 - s64 idx; 1894 - int ret; 1895 2055 1896 2056 /* 1897 2057 * If we can't allocate a new stripe, and there's no stripes with empty 1898 2058 * blocks for us to reuse, that means we have to wait on copygc: 1899 2059 */ 1900 - idx = get_existing_stripe(c, h); 1901 - if (idx < 0) 1902 - return -BCH_ERR_stripe_alloc_blocked; 2060 + if (may_create_new_stripe(c)) 2061 + return -1; 1903 2062 1904 - ret = get_stripe_key_trans(trans, idx, &s->existing_stripe); 1905 - bch2_fs_fatal_err_on(ret && !bch2_err_matches(ret, BCH_ERR_transaction_restart), c, 1906 - "reading stripe key: %s", bch2_err_str(ret)); 1907 - if (ret) { 1908 - bch2_stripe_close(c, s); 1909 - return ret; 2063 + struct btree_iter lru_iter; 2064 + struct bkey_s_c lru_k; 2065 + int ret = 0; 2066 + 2067 + for_each_btree_key_max_norestart(trans, lru_iter, BTREE_ID_lru, 2068 + lru_pos(BCH_LRU_STRIPE_FRAGMENTATION, 2, 0), 2069 + lru_pos(BCH_LRU_STRIPE_FRAGMENTATION, 2, LRU_TIME_MAX), 2070 + 0, lru_k, ret) { 2071 + ret = __get_existing_stripe(trans, h, &s->existing_stripe, lru_k.k->p.offset); 2072 + if (ret) 2073 + break; 1910 2074 } 2075 + bch2_trans_iter_exit(trans, &lru_iter); 2076 + if (!ret) 2077 + ret = -BCH_ERR_stripe_alloc_blocked; 2078 + if (ret == 1) 2079 + ret = 0; 2080 + if (ret) 2081 + return ret; 1911 2082 1912 2083 return init_new_stripe_from_existing(c, s); 1913 2084 } ··· 2214 2367 2215 2368 int bch2_stripes_read(struct bch_fs *c) 2216 2369 { 2217 - int ret = bch2_trans_run(c, 2218 - for_each_btree_key(trans, iter, BTREE_ID_stripes, POS_MIN, 2219 - BTREE_ITER_prefetch, k, ({ 2220 - if (k.k->type != KEY_TYPE_stripe) 2221 - continue; 2222 - 2223 - ret = __ec_stripe_mem_alloc(c, k.k->p.offset, GFP_KERNEL); 2224 - if (ret) 2225 - break; 2226 - 2227 - struct stripe *m = genradix_ptr(&c->stripes, k.k->p.offset); 2228 - 2229 - stripe_to_mem(m, bkey_s_c_to_stripe(k).v); 2230 - 2231 - bch2_stripes_heap_insert(c, m, k.k->p.offset); 2232 - 0; 2233 - }))); 2234 - bch_err_fn(c, ret); 2235 - return ret; 2236 - } 2237 - 2238 - void bch2_stripes_heap_to_text(struct printbuf *out, struct bch_fs *c) 2239 - { 2240 - ec_stripes_heap *h = &c->ec_stripes_heap; 2241 - struct stripe *m; 2242 - size_t i; 2243 - 2244 - mutex_lock(&c->ec_stripes_heap_lock); 2245 - for (i = 0; i < min_t(size_t, h->nr, 50); i++) { 2246 - m = genradix_ptr(&c->stripes, h->data[i].idx); 2247 - 2248 - prt_printf(out, "%zu %u/%u+%u", h->data[i].idx, 2249 - h->data[i].blocks_nonempty, 2250 - m->nr_blocks - m->nr_redundant, 2251 - m->nr_redundant); 2252 - if (bch2_stripe_is_open(c, h->data[i].idx)) 2253 - prt_str(out, " open"); 2254 - prt_newline(out); 2255 - } 2256 - mutex_unlock(&c->ec_stripes_heap_lock); 2370 + return 0; 2257 2371 } 2258 2372 2259 2373 static void bch2_new_stripe_to_text(struct printbuf *out, struct bch_fs *c, ··· 2285 2477 2286 2478 BUG_ON(!list_empty(&c->ec_stripe_new_list)); 2287 2479 2288 - free_heap(&c->ec_stripes_heap); 2289 - genradix_free(&c->stripes); 2290 2480 bioset_exit(&c->ec_bioset); 2291 2481 } 2292 2482 2293 2483 void bch2_fs_ec_init_early(struct bch_fs *c) 2294 2484 { 2295 2485 spin_lock_init(&c->ec_stripes_new_lock); 2296 - mutex_init(&c->ec_stripes_heap_lock); 2297 2486 2298 2487 INIT_LIST_HEAD(&c->ec_stripe_head_list); 2299 2488 mutex_init(&c->ec_stripe_head_lock); ··· 2307 2502 { 2308 2503 return bioset_init(&c->ec_bioset, 1, offsetof(struct ec_bio, bio), 2309 2504 BIOSET_NEED_BVECS); 2505 + } 2506 + 2507 + static int bch2_check_stripe_to_lru_ref(struct btree_trans *trans, 2508 + struct bkey_s_c k, 2509 + struct bkey_buf *last_flushed) 2510 + { 2511 + if (k.k->type != KEY_TYPE_stripe) 2512 + return 0; 2513 + 2514 + struct bkey_s_c_stripe s = bkey_s_c_to_stripe(k); 2515 + 2516 + u64 lru_idx = stripe_lru_pos(s.v); 2517 + if (lru_idx) { 2518 + int ret = bch2_lru_check_set(trans, BCH_LRU_STRIPE_FRAGMENTATION, 2519 + k.k->p.offset, lru_idx, k, last_flushed); 2520 + if (ret) 2521 + return ret; 2522 + } 2523 + return 0; 2524 + } 2525 + 2526 + int bch2_check_stripe_to_lru_refs(struct bch_fs *c) 2527 + { 2528 + struct bkey_buf last_flushed; 2529 + 2530 + bch2_bkey_buf_init(&last_flushed); 2531 + bkey_init(&last_flushed.k->k); 2532 + 2533 + int ret = bch2_trans_run(c, 2534 + for_each_btree_key_commit(trans, iter, BTREE_ID_stripes, 2535 + POS_MIN, BTREE_ITER_prefetch, k, 2536 + NULL, NULL, BCH_TRANS_COMMIT_no_enospc, 2537 + bch2_check_stripe_to_lru_ref(trans, k, &last_flushed))); 2538 + 2539 + bch2_bkey_buf_exit(&last_flushed, c); 2540 + bch_err_fn(c, ret); 2541 + return ret; 2310 2542 }
+40 -6
fs/bcachefs/ec.h
··· 92 92 memcpy(stripe_csum(s, block, csum_idx), &csum, bch_crc_bytes[s->csum_type]); 93 93 } 94 94 95 + #define STRIPE_LRU_POS_EMPTY 1 96 + 97 + static inline u64 stripe_lru_pos(const struct bch_stripe *s) 98 + { 99 + if (!s) 100 + return 0; 101 + 102 + unsigned nr_data = s->nr_blocks - s->nr_redundant, blocks_empty = 0; 103 + 104 + for (unsigned i = 0; i < nr_data; i++) 105 + blocks_empty += !stripe_blockcount_get(s, i); 106 + 107 + /* Will be picked up by the stripe_delete worker */ 108 + if (blocks_empty == nr_data) 109 + return STRIPE_LRU_POS_EMPTY; 110 + 111 + if (!blocks_empty) 112 + return 0; 113 + 114 + /* invert: more blocks empty = reuse first */ 115 + return LRU_TIME_MAX - blocks_empty; 116 + } 117 + 95 118 static inline bool __bch2_ptr_matches_stripe(const struct bch_extent_ptr *stripe_ptr, 96 119 const struct bch_extent_ptr *data_ptr, 97 120 unsigned sectors) ··· 153 130 154 131 return __bch2_ptr_matches_stripe(&m->ptrs[p.ec.block], &p.ptr, 155 132 m->sectors); 133 + } 134 + 135 + static inline void gc_stripe_unlock(struct gc_stripe *s) 136 + { 137 + BUILD_BUG_ON(!((union ulong_byte_assert) { .ulong = 1UL << BUCKET_LOCK_BITNR }).byte); 138 + 139 + clear_bit_unlock(BUCKET_LOCK_BITNR, (void *) &s->lock); 140 + wake_up_bit((void *) &s->lock, BUCKET_LOCK_BITNR); 141 + } 142 + 143 + static inline void gc_stripe_lock(struct gc_stripe *s) 144 + { 145 + wait_on_bit_lock((void *) &s->lock, BUCKET_LOCK_BITNR, 146 + TASK_UNINTERRUPTIBLE); 156 147 } 157 148 158 149 struct bch_read_bio; ··· 249 212 250 213 void *bch2_writepoint_ec_buf(struct bch_fs *, struct write_point *); 251 214 252 - void bch2_ec_bucket_cancel(struct bch_fs *, struct open_bucket *); 215 + void bch2_ec_bucket_cancel(struct bch_fs *, struct open_bucket *, int); 253 216 254 217 int bch2_ec_stripe_new_alloc(struct bch_fs *, struct ec_stripe_head *); 255 218 ··· 257 220 struct ec_stripe_head *bch2_ec_stripe_head_get(struct btree_trans *, 258 221 unsigned, unsigned, unsigned, 259 222 enum bch_watermark, struct closure *); 260 - 261 - void bch2_stripes_heap_update(struct bch_fs *, struct stripe *, size_t); 262 - void bch2_stripes_heap_del(struct bch_fs *, struct stripe *, size_t); 263 - void bch2_stripes_heap_insert(struct bch_fs *, struct stripe *, size_t); 264 223 265 224 void bch2_do_stripe_deletes(struct bch_fs *); 266 225 void bch2_ec_do_stripe_creates(struct bch_fs *); ··· 294 261 295 262 int bch2_stripes_read(struct bch_fs *); 296 263 297 - void bch2_stripes_heap_to_text(struct printbuf *, struct bch_fs *); 298 264 void bch2_new_stripes_to_text(struct printbuf *, struct bch_fs *); 299 265 300 266 void bch2_fs_ec_exit(struct bch_fs *); 301 267 void bch2_fs_ec_init_early(struct bch_fs *); 302 268 int bch2_fs_ec_init(struct bch_fs *); 269 + 270 + int bch2_check_stripe_to_lru_refs(struct bch_fs *); 303 271 304 272 #endif /* _BCACHEFS_EC_H */
+2 -10
fs/bcachefs/ec_types.h
··· 20 20 }; 21 21 22 22 struct gc_stripe { 23 + u8 lock; 24 + unsigned alive:1; /* does a corresponding key exist in stripes btree? */ 23 25 u16 sectors; 24 - 25 26 u8 nr_blocks; 26 27 u8 nr_redundant; 27 - 28 - unsigned alive:1; /* does a corresponding key exist in stripes btree? */ 29 28 u16 block_sectors[BCH_BKEY_PTRS_MAX]; 30 29 struct bch_extent_ptr ptrs[BCH_BKEY_PTRS_MAX]; 31 30 32 31 struct bch_replicas_padded r; 33 32 }; 34 - 35 - struct ec_stripe_heap_entry { 36 - size_t idx; 37 - unsigned blocks_nonempty; 38 - }; 39 - 40 - typedef DEFINE_MIN_HEAP(struct ec_stripe_heap_entry, ec_stripes_heap) ec_stripes_heap; 41 33 42 34 #endif /* _BCACHEFS_EC_TYPES_H */
+60 -5
fs/bcachefs/errcode.h
··· 116 116 x(ENOENT, ENOENT_snapshot_tree) \ 117 117 x(ENOENT, ENOENT_dirent_doesnt_match_inode) \ 118 118 x(ENOENT, ENOENT_dev_not_found) \ 119 + x(ENOENT, ENOENT_dev_bucket_not_found) \ 119 120 x(ENOENT, ENOENT_dev_idx_not_found) \ 120 121 x(ENOENT, ENOENT_inode_no_backpointer) \ 121 122 x(ENOENT, ENOENT_no_snapshot_tree_subvol) \ 123 + x(ENOENT, btree_node_dying) \ 122 124 x(ENOTEMPTY, ENOTEMPTY_dir_not_empty) \ 123 125 x(ENOTEMPTY, ENOTEMPTY_subvol_not_empty) \ 124 126 x(EEXIST, EEXIST_str_hash_set) \ ··· 182 180 x(EINVAL, not_in_recovery) \ 183 181 x(EINVAL, cannot_rewind_recovery) \ 184 182 x(0, data_update_done) \ 183 + x(BCH_ERR_data_update_done, data_update_done_would_block) \ 184 + x(BCH_ERR_data_update_done, data_update_done_unwritten) \ 185 + x(BCH_ERR_data_update_done, data_update_done_no_writes_needed) \ 186 + x(BCH_ERR_data_update_done, data_update_done_no_snapshot) \ 187 + x(BCH_ERR_data_update_done, data_update_done_no_dev_refs) \ 188 + x(BCH_ERR_data_update_done, data_update_done_no_rw_devs) \ 185 189 x(EINVAL, device_state_not_allowed) \ 186 190 x(EINVAL, member_info_missing) \ 187 191 x(EINVAL, mismatched_block_size) \ ··· 208 200 x(EINVAL, no_resize_with_buckets_nouse) \ 209 201 x(EINVAL, inode_unpack_error) \ 210 202 x(EINVAL, varint_decode_error) \ 203 + x(EINVAL, erasure_coding_found_btree_node) \ 204 + x(EOPNOTSUPP, may_not_use_incompat_feature) \ 211 205 x(EROFS, erofs_trans_commit) \ 212 206 x(EROFS, erofs_no_writes) \ 213 207 x(EROFS, erofs_journal_err) \ ··· 220 210 x(EROFS, insufficient_devices) \ 221 211 x(0, operation_blocked) \ 222 212 x(BCH_ERR_operation_blocked, btree_cache_cannibalize_lock_blocked) \ 223 - x(BCH_ERR_operation_blocked, journal_res_get_blocked) \ 224 - x(BCH_ERR_operation_blocked, journal_preres_get_blocked) \ 225 - x(BCH_ERR_operation_blocked, bucket_alloc_blocked) \ 226 - x(BCH_ERR_operation_blocked, stripe_alloc_blocked) \ 213 + x(BCH_ERR_operation_blocked, journal_res_blocked) \ 214 + x(BCH_ERR_journal_res_blocked, journal_blocked) \ 215 + x(BCH_ERR_journal_res_blocked, journal_max_in_flight) \ 216 + x(BCH_ERR_journal_res_blocked, journal_max_open) \ 217 + x(BCH_ERR_journal_res_blocked, journal_full) \ 218 + x(BCH_ERR_journal_res_blocked, journal_pin_full) \ 219 + x(BCH_ERR_journal_res_blocked, journal_buf_enomem) \ 220 + x(BCH_ERR_journal_res_blocked, journal_stuck) \ 221 + x(BCH_ERR_journal_res_blocked, journal_retry_open) \ 222 + x(BCH_ERR_journal_res_blocked, journal_preres_get_blocked) \ 223 + x(BCH_ERR_journal_res_blocked, bucket_alloc_blocked) \ 224 + x(BCH_ERR_journal_res_blocked, stripe_alloc_blocked) \ 227 225 x(BCH_ERR_invalid, invalid_sb) \ 228 226 x(BCH_ERR_invalid_sb, invalid_sb_magic) \ 229 227 x(BCH_ERR_invalid_sb, invalid_sb_version) \ ··· 241 223 x(BCH_ERR_invalid_sb, invalid_sb_csum) \ 242 224 x(BCH_ERR_invalid_sb, invalid_sb_block_size) \ 243 225 x(BCH_ERR_invalid_sb, invalid_sb_uuid) \ 226 + x(BCH_ERR_invalid_sb, invalid_sb_offset) \ 244 227 x(BCH_ERR_invalid_sb, invalid_sb_too_many_members) \ 245 228 x(BCH_ERR_invalid_sb, invalid_sb_dev_idx) \ 246 229 x(BCH_ERR_invalid_sb, invalid_sb_time_precision) \ ··· 269 250 x(BCH_ERR_operation_blocked, nocow_lock_blocked) \ 270 251 x(EIO, journal_shutdown) \ 271 252 x(EIO, journal_flush_err) \ 253 + x(EIO, journal_write_err) \ 272 254 x(EIO, btree_node_read_err) \ 273 255 x(BCH_ERR_btree_node_read_err, btree_node_read_err_cached) \ 274 256 x(EIO, sb_not_downgraded) \ ··· 278 258 x(EIO, btree_node_read_validate_error) \ 279 259 x(EIO, btree_need_topology_repair) \ 280 260 x(EIO, bucket_ref_update) \ 261 + x(EIO, trigger_alloc) \ 281 262 x(EIO, trigger_pointer) \ 282 263 x(EIO, trigger_stripe_pointer) \ 283 264 x(EIO, metadata_bucket_inconsistency) \ 284 265 x(EIO, mark_stripe) \ 285 266 x(EIO, stripe_reconstruct) \ 286 267 x(EIO, key_type_error) \ 287 - x(EIO, no_device_to_read_from) \ 268 + x(EIO, extent_poisened) \ 288 269 x(EIO, missing_indirect_extent) \ 289 270 x(EIO, invalidate_stripe_to_dev) \ 290 271 x(EIO, no_encryption_key) \ 291 272 x(EIO, insufficient_journal_devices) \ 273 + x(EIO, device_offline) \ 274 + x(EIO, EIO_fault_injected) \ 275 + x(EIO, ec_block_read) \ 276 + x(EIO, ec_block_write) \ 277 + x(EIO, recompute_checksum) \ 278 + x(EIO, decompress) \ 279 + x(BCH_ERR_decompress, decompress_exceeded_max_encoded_extent) \ 280 + x(BCH_ERR_decompress, decompress_lz4) \ 281 + x(BCH_ERR_decompress, decompress_gzip) \ 282 + x(BCH_ERR_decompress, decompress_zstd_src_len_bad) \ 283 + x(BCH_ERR_decompress, decompress_zstd) \ 284 + x(EIO, data_write) \ 285 + x(BCH_ERR_data_write, data_write_io) \ 286 + x(BCH_ERR_data_write, data_write_csum) \ 287 + x(BCH_ERR_data_write, data_write_invalid_ptr) \ 288 + x(BCH_ERR_data_write, data_write_misaligned) \ 289 + x(BCH_ERR_decompress, data_read) \ 290 + x(BCH_ERR_data_read, no_device_to_read_from) \ 291 + x(BCH_ERR_data_read, data_read_io_err) \ 292 + x(BCH_ERR_data_read, data_read_csum_err) \ 293 + x(BCH_ERR_data_read, data_read_retry) \ 294 + x(BCH_ERR_data_read_retry, data_read_retry_avoid) \ 295 + x(BCH_ERR_data_read_retry_avoid,data_read_retry_device_offline) \ 296 + x(BCH_ERR_data_read_retry_avoid,data_read_retry_io_err) \ 297 + x(BCH_ERR_data_read_retry_avoid,data_read_retry_ec_reconstruct_err) \ 298 + x(BCH_ERR_data_read_retry_avoid,data_read_retry_csum_err) \ 299 + x(BCH_ERR_data_read_retry, data_read_retry_csum_err_maybe_userspace)\ 300 + x(BCH_ERR_data_read, data_read_decompress_err) \ 301 + x(BCH_ERR_data_read, data_read_decrypt_err) \ 302 + x(BCH_ERR_data_read, data_read_ptr_stale_race) \ 303 + x(BCH_ERR_data_read_retry, data_read_ptr_stale_retry) \ 304 + x(BCH_ERR_data_read, data_read_no_encryption_key) \ 305 + x(BCH_ERR_data_read, data_read_buffer_too_small) \ 306 + x(BCH_ERR_data_read, data_read_key_overwritten) \ 292 307 x(BCH_ERR_btree_node_read_err, btree_node_read_err_fixable) \ 293 308 x(BCH_ERR_btree_node_read_err, btree_node_read_err_want_retry) \ 294 309 x(BCH_ERR_btree_node_read_err, btree_node_read_err_must_retry) \
+68 -28
fs/bcachefs/error.c
··· 3 3 #include "btree_cache.h" 4 4 #include "btree_iter.h" 5 5 #include "error.h" 6 - #include "fs-common.h" 7 6 #include "journal.h" 7 + #include "namei.h" 8 8 #include "recovery_passes.h" 9 9 #include "super.h" 10 10 #include "thread_with_file.h" ··· 54 54 { 55 55 struct bch_dev *ca = container_of(work, struct bch_dev, io_error_work); 56 56 struct bch_fs *c = ca->fs; 57 - bool dev; 57 + 58 + /* XXX: if it's reads or checksums that are failing, set it to failed */ 58 59 59 60 down_write(&c->state_lock); 60 - dev = bch2_dev_state_allowed(c, ca, BCH_MEMBER_STATE_ro, 61 - BCH_FORCE_IF_DEGRADED); 62 - if (dev 63 - ? __bch2_dev_set_state(c, ca, BCH_MEMBER_STATE_ro, 64 - BCH_FORCE_IF_DEGRADED) 65 - : bch2_fs_emergency_read_only(c)) 61 + unsigned long write_errors_start = READ_ONCE(ca->write_errors_start); 62 + 63 + if (write_errors_start && 64 + time_after(jiffies, 65 + write_errors_start + c->opts.write_error_timeout * HZ)) { 66 + if (ca->mi.state >= BCH_MEMBER_STATE_ro) 67 + goto out; 68 + 69 + bool dev = !__bch2_dev_set_state(c, ca, BCH_MEMBER_STATE_ro, 70 + BCH_FORCE_IF_DEGRADED); 71 + 66 72 bch_err(ca, 67 - "too many IO errors, setting %s RO", 73 + "writes erroring for %u seconds, setting %s ro", 74 + c->opts.write_error_timeout, 68 75 dev ? "device" : "filesystem"); 76 + if (!dev) 77 + bch2_fs_emergency_read_only(c); 78 + 79 + } 80 + out: 69 81 up_write(&c->state_lock); 70 82 } 71 83 72 84 void bch2_io_error(struct bch_dev *ca, enum bch_member_error_type type) 73 85 { 74 86 atomic64_inc(&ca->errors[type]); 75 - //queue_work(system_long_wq, &ca->io_error_work); 87 + 88 + if (type == BCH_MEMBER_ERROR_write && !ca->write_errors_start) 89 + ca->write_errors_start = jiffies; 90 + 91 + queue_work(system_long_wq, &ca->io_error_work); 76 92 } 77 93 78 94 enum ask_yn { ··· 546 530 mutex_unlock(&c->fsck_error_msgs_lock); 547 531 } 548 532 549 - int bch2_inum_err_msg_trans(struct btree_trans *trans, struct printbuf *out, subvol_inum inum) 533 + int bch2_inum_offset_err_msg_trans(struct btree_trans *trans, struct printbuf *out, 534 + subvol_inum inum, u64 offset) 550 535 { 551 536 u32 restart_count = trans->restart_count; 552 537 int ret = 0; 553 538 554 - /* XXX: we don't yet attempt to print paths when we don't know the subvol */ 555 - if (inum.subvol) 556 - ret = lockrestart_do(trans, bch2_inum_to_path(trans, inum, out)); 539 + if (inum.subvol) { 540 + ret = bch2_inum_to_path(trans, inum, out); 541 + if (bch2_err_matches(ret, BCH_ERR_transaction_restart)) 542 + return ret; 543 + } 557 544 if (!inum.subvol || ret) 558 545 prt_printf(out, "inum %llu:%llu", inum.subvol, inum.inum); 546 + prt_printf(out, " offset %llu: ", offset); 559 547 560 548 return trans_was_restarted(trans, restart_count); 561 - } 562 - 563 - int bch2_inum_offset_err_msg_trans(struct btree_trans *trans, struct printbuf *out, 564 - subvol_inum inum, u64 offset) 565 - { 566 - int ret = bch2_inum_err_msg_trans(trans, out, inum); 567 - prt_printf(out, " offset %llu: ", offset); 568 - return ret; 569 - } 570 - 571 - void bch2_inum_err_msg(struct bch_fs *c, struct printbuf *out, subvol_inum inum) 572 - { 573 - bch2_trans_run(c, bch2_inum_err_msg_trans(trans, out, inum)); 574 549 } 575 550 576 551 void bch2_inum_offset_err_msg(struct bch_fs *c, struct printbuf *out, 577 552 subvol_inum inum, u64 offset) 578 553 { 579 - bch2_trans_run(c, bch2_inum_offset_err_msg_trans(trans, out, inum, offset)); 554 + bch2_trans_do(c, bch2_inum_offset_err_msg_trans(trans, out, inum, offset)); 555 + } 556 + 557 + int bch2_inum_snap_offset_err_msg_trans(struct btree_trans *trans, struct printbuf *out, 558 + struct bpos pos) 559 + { 560 + struct bch_fs *c = trans->c; 561 + int ret = 0; 562 + 563 + if (!bch2_snapshot_is_leaf(c, pos.snapshot)) 564 + prt_str(out, "(multiple snapshots) "); 565 + 566 + subvol_inum inum = { 567 + .subvol = bch2_snapshot_tree_oldest_subvol(c, pos.snapshot), 568 + .inum = pos.inode, 569 + }; 570 + 571 + if (inum.subvol) { 572 + ret = bch2_inum_to_path(trans, inum, out); 573 + if (bch2_err_matches(ret, BCH_ERR_transaction_restart)) 574 + return ret; 575 + } 576 + 577 + if (!inum.subvol || ret) 578 + prt_printf(out, "inum %llu:%u", pos.inode, pos.snapshot); 579 + 580 + prt_printf(out, " offset %llu: ", pos.offset << 8); 581 + return 0; 582 + } 583 + 584 + void bch2_inum_snap_offset_err_msg(struct bch_fs *c, struct printbuf *out, 585 + struct bpos pos) 586 + { 587 + bch2_trans_do(c, bch2_inum_snap_offset_err_msg_trans(trans, out, pos)); 580 588 }
+33 -22
fs/bcachefs/error.h
··· 216 216 /* Does the error handling without logging a message */ 217 217 void bch2_io_error(struct bch_dev *, enum bch_member_error_type); 218 218 219 - #define bch2_dev_io_err_on(cond, ca, _type, ...) \ 220 - ({ \ 221 - bool _ret = (cond); \ 222 - \ 223 - if (_ret) { \ 224 - bch_err_dev_ratelimited(ca, __VA_ARGS__); \ 225 - bch2_io_error(ca, _type); \ 226 - } \ 227 - _ret; \ 228 - }) 219 + #ifndef CONFIG_BCACHEFS_NO_LATENCY_ACCT 220 + void bch2_latency_acct(struct bch_dev *, u64, int); 221 + #else 222 + static inline void bch2_latency_acct(struct bch_dev *ca, u64 submit_time, int rw) {} 223 + #endif 229 224 230 - #define bch2_dev_inum_io_err_on(cond, ca, _type, ...) \ 231 - ({ \ 232 - bool _ret = (cond); \ 233 - \ 234 - if (_ret) { \ 235 - bch_err_inum_offset_ratelimited(ca, __VA_ARGS__); \ 236 - bch2_io_error(ca, _type); \ 237 - } \ 238 - _ret; \ 239 - }) 225 + static inline void bch2_account_io_success_fail(struct bch_dev *ca, 226 + enum bch_member_error_type type, 227 + bool success) 228 + { 229 + if (likely(success)) { 230 + if (type == BCH_MEMBER_ERROR_write && 231 + ca->write_errors_start) 232 + ca->write_errors_start = 0; 233 + } else { 234 + bch2_io_error(ca, type); 235 + } 236 + } 240 237 241 - int bch2_inum_err_msg_trans(struct btree_trans *, struct printbuf *, subvol_inum); 238 + static inline void bch2_account_io_completion(struct bch_dev *ca, 239 + enum bch_member_error_type type, 240 + u64 submit_time, bool success) 241 + { 242 + if (unlikely(!ca)) 243 + return; 244 + 245 + if (type != BCH_MEMBER_ERROR_checksum) 246 + bch2_latency_acct(ca, submit_time, type); 247 + 248 + bch2_account_io_success_fail(ca, type, success); 249 + } 250 + 242 251 int bch2_inum_offset_err_msg_trans(struct btree_trans *, struct printbuf *, subvol_inum, u64); 243 252 244 - void bch2_inum_err_msg(struct bch_fs *, struct printbuf *, subvol_inum); 245 253 void bch2_inum_offset_err_msg(struct bch_fs *, struct printbuf *, subvol_inum, u64); 254 + 255 + int bch2_inum_snap_offset_err_msg_trans(struct btree_trans *, struct printbuf *, struct bpos); 256 + void bch2_inum_snap_offset_err_msg(struct bch_fs *, struct printbuf *, struct bpos); 246 257 247 258 #endif /* _BCACHEFS_ERROR_H */
+171 -78
fs/bcachefs/extents.c
··· 28 28 #include "trace.h" 29 29 #include "util.h" 30 30 31 + static const char * const bch2_extent_flags_strs[] = { 32 + #define x(n, v) [BCH_EXTENT_FLAG_##n] = #n, 33 + BCH_EXTENT_FLAGS() 34 + #undef x 35 + NULL, 36 + }; 37 + 31 38 static unsigned bch2_crc_field_size_max[] = { 32 39 [BCH_EXTENT_ENTRY_crc32] = CRC32_SIZE_MAX, 33 40 [BCH_EXTENT_ENTRY_crc64] = CRC64_SIZE_MAX, ··· 58 51 } 59 52 60 53 void bch2_mark_io_failure(struct bch_io_failures *failed, 61 - struct extent_ptr_decoded *p) 54 + struct extent_ptr_decoded *p, 55 + bool csum_error) 62 56 { 63 57 struct bch_dev_io_failures *f = bch2_dev_io_failures(failed, p->ptr.dev); 64 58 ··· 67 59 BUG_ON(failed->nr >= ARRAY_SIZE(failed->devs)); 68 60 69 61 f = &failed->devs[failed->nr++]; 70 - f->dev = p->ptr.dev; 71 - f->idx = p->idx; 72 - f->nr_failed = 1; 73 - f->nr_retries = 0; 74 - } else if (p->idx != f->idx) { 75 - f->idx = p->idx; 76 - f->nr_failed = 1; 77 - f->nr_retries = 0; 78 - } else { 79 - f->nr_failed++; 62 + memset(f, 0, sizeof(*f)); 63 + f->dev = p->ptr.dev; 80 64 } 65 + 66 + if (p->do_ec_reconstruct) 67 + f->failed_ec = true; 68 + else if (!csum_error) 69 + f->failed_io = true; 70 + else 71 + f->failed_csum_nr++; 81 72 } 82 73 83 - static inline u64 dev_latency(struct bch_fs *c, unsigned dev) 74 + static inline u64 dev_latency(struct bch_dev *ca) 84 75 { 85 - struct bch_dev *ca = bch2_dev_rcu(c, dev); 86 76 return ca ? atomic64_read(&ca->cur_latency[READ]) : S64_MAX; 77 + } 78 + 79 + static inline int dev_failed(struct bch_dev *ca) 80 + { 81 + return !ca || ca->mi.state == BCH_MEMBER_STATE_failed; 87 82 } 88 83 89 84 /* ··· 94 83 */ 95 84 static inline bool ptr_better(struct bch_fs *c, 96 85 const struct extent_ptr_decoded p1, 97 - const struct extent_ptr_decoded p2) 86 + u64 p1_latency, 87 + struct bch_dev *ca1, 88 + const struct extent_ptr_decoded p2, 89 + u64 p2_latency) 98 90 { 99 - if (likely(!p1.idx && !p2.idx)) { 100 - u64 l1 = dev_latency(c, p1.ptr.dev); 101 - u64 l2 = dev_latency(c, p2.ptr.dev); 91 + struct bch_dev *ca2 = bch2_dev_rcu(c, p2.ptr.dev); 102 92 103 - /* 104 - * Square the latencies, to bias more in favor of the faster 105 - * device - we never want to stop issuing reads to the slower 106 - * device altogether, so that we can update our latency numbers: 107 - */ 108 - l1 *= l1; 109 - l2 *= l2; 93 + int failed_delta = dev_failed(ca1) - dev_failed(ca2); 94 + if (unlikely(failed_delta)) 95 + return failed_delta < 0; 110 96 111 - /* Pick at random, biased in favor of the faster device: */ 97 + if (unlikely(bch2_force_reconstruct_read)) 98 + return p1.do_ec_reconstruct > p2.do_ec_reconstruct; 112 99 113 - return bch2_get_random_u64_below(l1 + l2) > l1; 114 - } 100 + if (unlikely(p1.do_ec_reconstruct || p2.do_ec_reconstruct)) 101 + return p1.do_ec_reconstruct < p2.do_ec_reconstruct; 115 102 116 - if (bch2_force_reconstruct_read) 117 - return p1.idx > p2.idx; 103 + int crc_retry_delta = (int) p1.crc_retry_nr - (int) p2.crc_retry_nr; 104 + if (unlikely(crc_retry_delta)) 105 + return crc_retry_delta < 0; 118 106 119 - return p1.idx < p2.idx; 107 + /* Pick at random, biased in favor of the faster device: */ 108 + 109 + return bch2_get_random_u64_below(p1_latency + p2_latency) > p1_latency; 120 110 } 121 111 122 112 /* ··· 127 115 */ 128 116 int bch2_bkey_pick_read_device(struct bch_fs *c, struct bkey_s_c k, 129 117 struct bch_io_failures *failed, 130 - struct extent_ptr_decoded *pick) 118 + struct extent_ptr_decoded *pick, 119 + int dev) 131 120 { 132 - struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(k); 133 - const union bch_extent_entry *entry; 134 - struct extent_ptr_decoded p; 135 - struct bch_dev_io_failures *f; 136 - int ret = 0; 121 + bool have_csum_errors = false, have_io_errors = false, have_missing_devs = false; 122 + bool have_dirty_ptrs = false, have_pick = false; 137 123 138 124 if (k.k->type == KEY_TYPE_error) 139 125 return -BCH_ERR_key_type_error; 140 126 127 + struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(k); 128 + 129 + if (bch2_bkey_extent_ptrs_flags(ptrs) & BIT_ULL(BCH_EXTENT_FLAG_poisoned)) 130 + return -BCH_ERR_extent_poisened; 131 + 141 132 rcu_read_lock(); 133 + const union bch_extent_entry *entry; 134 + struct extent_ptr_decoded p; 135 + u64 pick_latency; 136 + 142 137 bkey_for_each_ptr_decode(k.k, ptrs, p, entry) { 138 + have_dirty_ptrs |= !p.ptr.cached; 139 + 143 140 /* 144 141 * Unwritten extent: no need to actually read, treat it as a 145 142 * hole and return 0s: 146 143 */ 147 144 if (p.ptr.unwritten) { 148 - ret = 0; 149 - break; 145 + rcu_read_unlock(); 146 + return 0; 150 147 } 151 148 152 - /* 153 - * If there are any dirty pointers it's an error if we can't 154 - * read: 155 - */ 156 - if (!ret && !p.ptr.cached) 157 - ret = -BCH_ERR_no_device_to_read_from; 149 + /* Are we being asked to read from a specific device? */ 150 + if (dev >= 0 && p.ptr.dev != dev) 151 + continue; 158 152 159 153 struct bch_dev *ca = bch2_dev_rcu(c, p.ptr.dev); 160 154 161 155 if (p.ptr.cached && (!ca || dev_ptr_stale_rcu(ca, &p.ptr))) 162 156 continue; 163 157 164 - f = failed ? bch2_dev_io_failures(failed, p.ptr.dev) : NULL; 165 - if (f) 166 - p.idx = f->nr_failed < f->nr_retries 167 - ? f->idx 168 - : f->idx + 1; 158 + struct bch_dev_io_failures *f = 159 + unlikely(failed) ? bch2_dev_io_failures(failed, p.ptr.dev) : NULL; 160 + if (unlikely(f)) { 161 + p.crc_retry_nr = f->failed_csum_nr; 162 + p.has_ec &= ~f->failed_ec; 169 163 170 - if (!p.idx && (!ca || !bch2_dev_is_readable(ca))) 171 - p.idx++; 164 + if (ca && ca->mi.state != BCH_MEMBER_STATE_failed) { 165 + have_io_errors |= f->failed_io; 166 + have_io_errors |= f->failed_ec; 167 + } 168 + have_csum_errors |= !!f->failed_csum_nr; 172 169 173 - if (!p.idx && p.has_ec && bch2_force_reconstruct_read) 174 - p.idx++; 170 + if (p.has_ec && (f->failed_io || f->failed_csum_nr)) 171 + p.do_ec_reconstruct = true; 172 + else if (f->failed_io || 173 + f->failed_csum_nr > c->opts.checksum_err_retry_nr) 174 + continue; 175 + } 175 176 176 - if (p.idx > (unsigned) p.has_ec) 177 - continue; 177 + have_missing_devs |= ca && !bch2_dev_is_online(ca); 178 178 179 - if (ret > 0 && !ptr_better(c, p, *pick)) 180 - continue; 179 + if (!ca || !bch2_dev_is_online(ca)) { 180 + if (!p.has_ec) 181 + continue; 182 + p.do_ec_reconstruct = true; 183 + } 181 184 182 - *pick = p; 183 - ret = 1; 185 + if (bch2_force_reconstruct_read && p.has_ec) 186 + p.do_ec_reconstruct = true; 187 + 188 + u64 p_latency = dev_latency(ca); 189 + /* 190 + * Square the latencies, to bias more in favor of the faster 191 + * device - we never want to stop issuing reads to the slower 192 + * device altogether, so that we can update our latency numbers: 193 + */ 194 + p_latency *= p_latency; 195 + 196 + if (!have_pick || 197 + ptr_better(c, 198 + p, p_latency, ca, 199 + *pick, pick_latency)) { 200 + *pick = p; 201 + pick_latency = p_latency; 202 + have_pick = true; 203 + } 184 204 } 185 205 rcu_read_unlock(); 186 206 187 - return ret; 207 + if (have_pick) 208 + return 1; 209 + if (!have_dirty_ptrs) 210 + return 0; 211 + if (have_missing_devs) 212 + return -BCH_ERR_no_device_to_read_from; 213 + if (have_csum_errors) 214 + return -BCH_ERR_data_read_csum_err; 215 + if (have_io_errors) 216 + return -BCH_ERR_data_read_io_err; 217 + 218 + WARN_ONCE(1, "unhandled error case in %s\n", __func__); 219 + return -EINVAL; 188 220 } 189 221 190 222 /* KEY_TYPE_btree_ptr: */ ··· 592 536 struct bch_extent_crc_unpacked src, 593 537 enum bch_extent_entry_type type) 594 538 { 595 - #define set_common_fields(_dst, _src) \ 596 - _dst.type = 1 << type; \ 597 - _dst.csum_type = _src.csum_type, \ 598 - _dst.compression_type = _src.compression_type, \ 599 - _dst._compressed_size = _src.compressed_size - 1, \ 600 - _dst._uncompressed_size = _src.uncompressed_size - 1, \ 601 - _dst.offset = _src.offset 539 + #define common_fields(_src) \ 540 + .type = BIT(type), \ 541 + .csum_type = _src.csum_type, \ 542 + .compression_type = _src.compression_type, \ 543 + ._compressed_size = _src.compressed_size - 1, \ 544 + ._uncompressed_size = _src.uncompressed_size - 1, \ 545 + .offset = _src.offset 602 546 603 547 switch (type) { 604 548 case BCH_EXTENT_ENTRY_crc32: 605 - set_common_fields(dst->crc32, src); 606 - dst->crc32.csum = (u32 __force) *((__le32 *) &src.csum.lo); 549 + dst->crc32 = (struct bch_extent_crc32) { 550 + common_fields(src), 551 + .csum = (u32 __force) *((__le32 *) &src.csum.lo), 552 + }; 607 553 break; 608 554 case BCH_EXTENT_ENTRY_crc64: 609 - set_common_fields(dst->crc64, src); 610 - dst->crc64.nonce = src.nonce; 611 - dst->crc64.csum_lo = (u64 __force) src.csum.lo; 612 - dst->crc64.csum_hi = (u64 __force) *((__le16 *) &src.csum.hi); 555 + dst->crc64 = (struct bch_extent_crc64) { 556 + common_fields(src), 557 + .nonce = src.nonce, 558 + .csum_lo = (u64 __force) src.csum.lo, 559 + .csum_hi = (u64 __force) *((__le16 *) &src.csum.hi), 560 + }; 613 561 break; 614 562 case BCH_EXTENT_ENTRY_crc128: 615 - set_common_fields(dst->crc128, src); 616 - dst->crc128.nonce = src.nonce; 617 - dst->crc128.csum = src.csum; 563 + dst->crc128 = (struct bch_extent_crc128) { 564 + common_fields(src), 565 + .nonce = src.nonce, 566 + .csum = src.csum, 567 + }; 618 568 break; 619 569 default: 620 570 BUG(); ··· 1059 997 1060 998 struct bch_dev *ca = bch2_dev_rcu_noerror(c, ptr->dev); 1061 999 1062 - return ca && bch2_dev_is_readable(ca) && !dev_ptr_stale_rcu(ca, ptr); 1000 + return ca && bch2_dev_is_healthy(ca) && !dev_ptr_stale_rcu(ca, ptr); 1063 1001 } 1064 1002 1065 1003 void bch2_extent_ptr_set_cached(struct bch_fs *c, ··· 1282 1220 bch2_extent_rebalance_to_text(out, c, &entry->rebalance); 1283 1221 break; 1284 1222 1223 + case BCH_EXTENT_ENTRY_flags: 1224 + prt_bitflags(out, bch2_extent_flags_strs, entry->flags.flags); 1225 + break; 1226 + 1285 1227 default: 1286 1228 prt_printf(out, "(invalid extent entry %.16llx)", *((u64 *) entry)); 1287 1229 return; ··· 1447 1381 #endif 1448 1382 break; 1449 1383 } 1384 + case BCH_EXTENT_ENTRY_flags: 1385 + bkey_fsck_err_on(entry != ptrs.start, 1386 + c, extent_flags_not_at_start, 1387 + "extent flags entry not at start"); 1388 + break; 1450 1389 } 1451 1390 } 1452 1391 ··· 1518 1447 } 1519 1448 } 1520 1449 1450 + int bch2_bkey_extent_flags_set(struct bch_fs *c, struct bkey_i *k, u64 flags) 1451 + { 1452 + int ret = bch2_request_incompat_feature(c, bcachefs_metadata_version_extent_flags); 1453 + if (ret) 1454 + return ret; 1455 + 1456 + struct bkey_ptrs ptrs = bch2_bkey_ptrs(bkey_i_to_s(k)); 1457 + 1458 + if (ptrs.start != ptrs.end && 1459 + extent_entry_type(ptrs.start) == BCH_EXTENT_ENTRY_flags) { 1460 + ptrs.start->flags.flags = flags; 1461 + } else { 1462 + struct bch_extent_flags f = { 1463 + .type = BIT(BCH_EXTENT_ENTRY_flags), 1464 + .flags = flags, 1465 + }; 1466 + __extent_entry_insert(k, ptrs.start, (union bch_extent_entry *) &f); 1467 + } 1468 + 1469 + return 0; 1470 + } 1471 + 1521 1472 /* Generic extent code: */ 1522 1473 1523 1474 int bch2_cut_front_s(struct bpos where, struct bkey_s k) ··· 1585 1492 entry->crc128.offset += sub; 1586 1493 break; 1587 1494 case BCH_EXTENT_ENTRY_stripe_ptr: 1588 - break; 1589 1495 case BCH_EXTENT_ENTRY_rebalance: 1496 + case BCH_EXTENT_ENTRY_flags: 1590 1497 break; 1591 1498 } 1592 1499
+20 -4
fs/bcachefs/extents.h
··· 320 320 ({ \ 321 321 __label__ out; \ 322 322 \ 323 - (_ptr).idx = 0; \ 324 - (_ptr).has_ec = false; \ 323 + (_ptr).has_ec = false; \ 324 + (_ptr).do_ec_reconstruct = false; \ 325 + (_ptr).crc_retry_nr = 0; \ 325 326 \ 326 327 __bkey_extent_entry_for_each_from(_entry, _end, _entry) \ 327 328 switch (__extent_entry_type(_entry)) { \ ··· 402 401 struct bch_dev_io_failures *bch2_dev_io_failures(struct bch_io_failures *, 403 402 unsigned); 404 403 void bch2_mark_io_failure(struct bch_io_failures *, 405 - struct extent_ptr_decoded *); 404 + struct extent_ptr_decoded *, bool); 406 405 int bch2_bkey_pick_read_device(struct bch_fs *, struct bkey_s_c, 407 406 struct bch_io_failures *, 408 - struct extent_ptr_decoded *); 407 + struct extent_ptr_decoded *, int); 409 408 410 409 /* KEY_TYPE_btree_ptr: */ 411 410 ··· 753 752 k->p.offset += new_size; 754 753 k->size = new_size; 755 754 } 755 + 756 + static inline u64 bch2_bkey_extent_ptrs_flags(struct bkey_ptrs_c ptrs) 757 + { 758 + if (ptrs.start != ptrs.end && 759 + extent_entry_type(ptrs.start) == BCH_EXTENT_ENTRY_flags) 760 + return ptrs.start->flags.flags; 761 + return 0; 762 + } 763 + 764 + static inline u64 bch2_bkey_extent_flags(struct bkey_s_c k) 765 + { 766 + return bch2_bkey_extent_ptrs_flags(bch2_bkey_ptrs_c(k)); 767 + } 768 + 769 + int bch2_bkey_extent_flags_set(struct bch_fs *, struct bkey_i *, u64); 756 770 757 771 #endif /* _BCACHEFS_EXTENTS_H */
+22 -2
fs/bcachefs/extents_format.h
··· 79 79 x(crc64, 2) \ 80 80 x(crc128, 3) \ 81 81 x(stripe_ptr, 4) \ 82 - x(rebalance, 5) 83 - #define BCH_EXTENT_ENTRY_MAX 6 82 + x(rebalance, 5) \ 83 + x(flags, 6) 84 + #define BCH_EXTENT_ENTRY_MAX 7 84 85 85 86 enum bch_extent_entry_type { 86 87 #define x(f, n) BCH_EXTENT_ENTRY_##f = n, ··· 199 198 redundancy:4, 200 199 block:8, 201 200 type:5; 201 + #endif 202 + }; 203 + 204 + #define BCH_EXTENT_FLAGS() \ 205 + x(poisoned, 0) 206 + 207 + enum bch_extent_flags_e { 208 + #define x(n, v) BCH_EXTENT_FLAG_##n = v, 209 + BCH_EXTENT_FLAGS() 210 + #undef x 211 + }; 212 + 213 + struct bch_extent_flags { 214 + #if defined(__LITTLE_ENDIAN_BITFIELD) 215 + __u64 type:7, 216 + flags:57; 217 + #elif defined (__BIG_ENDIAN_BITFIELD) 218 + __u64 flags:57, 219 + type:7; 202 220 #endif 203 221 }; 204 222
+6 -5
fs/bcachefs/extents_types.h
··· 20 20 }; 21 21 22 22 struct extent_ptr_decoded { 23 - unsigned idx; 24 23 bool has_ec; 24 + bool do_ec_reconstruct; 25 + u8 crc_retry_nr; 25 26 struct bch_extent_crc_unpacked crc; 26 27 struct bch_extent_ptr ptr; 27 28 struct bch_extent_stripe_ptr ec; ··· 32 31 u8 nr; 33 32 struct bch_dev_io_failures { 34 33 u8 dev; 35 - u8 idx; 36 - u8 nr_failed; 37 - u8 nr_retries; 38 - } devs[BCH_REPLICAS_MAX]; 34 + unsigned failed_csum_nr:6, 35 + failed_io:1, 36 + failed_ec:1; 37 + } devs[BCH_REPLICAS_MAX + 1]; 39 38 }; 40 39 41 40 #endif /* _BCACHEFS_EXTENTS_TYPES_H */
+73 -63
fs/bcachefs/eytzinger.c
··· 148 148 return cmp(a, b, priv); 149 149 } 150 150 151 - static inline int eytzinger0_do_cmp(void *base, size_t n, size_t size, 151 + static inline int eytzinger1_do_cmp(void *base1, size_t n, size_t size, 152 152 cmp_r_func_t cmp_func, const void *priv, 153 153 size_t l, size_t r) 154 154 { 155 - return do_cmp(base + inorder_to_eytzinger0(l, n) * size, 156 - base + inorder_to_eytzinger0(r, n) * size, 155 + return do_cmp(base1 + inorder_to_eytzinger1(l, n) * size, 156 + base1 + inorder_to_eytzinger1(r, n) * size, 157 157 cmp_func, priv); 158 158 } 159 159 160 - static inline void eytzinger0_do_swap(void *base, size_t n, size_t size, 160 + static inline void eytzinger1_do_swap(void *base1, size_t n, size_t size, 161 161 swap_r_func_t swap_func, const void *priv, 162 162 size_t l, size_t r) 163 163 { 164 - do_swap(base + inorder_to_eytzinger0(l, n) * size, 165 - base + inorder_to_eytzinger0(r, n) * size, 164 + do_swap(base1 + inorder_to_eytzinger1(l, n) * size, 165 + base1 + inorder_to_eytzinger1(r, n) * size, 166 166 size, swap_func, priv); 167 + } 168 + 169 + static void eytzinger1_sort_r(void *base1, size_t n, size_t size, 170 + cmp_r_func_t cmp_func, 171 + swap_r_func_t swap_func, 172 + const void *priv) 173 + { 174 + unsigned i, j, k; 175 + 176 + /* called from 'sort' without swap function, let's pick the default */ 177 + if (swap_func == SWAP_WRAPPER && !((struct wrapper *)priv)->swap_func) 178 + swap_func = NULL; 179 + 180 + if (!swap_func) { 181 + if (is_aligned(base1, size, 8)) 182 + swap_func = SWAP_WORDS_64; 183 + else if (is_aligned(base1, size, 4)) 184 + swap_func = SWAP_WORDS_32; 185 + else 186 + swap_func = SWAP_BYTES; 187 + } 188 + 189 + /* heapify */ 190 + for (i = n / 2; i >= 1; --i) { 191 + /* Find the sift-down path all the way to the leaves. */ 192 + for (j = i; k = j * 2, k < n;) 193 + j = eytzinger1_do_cmp(base1, n, size, cmp_func, priv, k, k + 1) > 0 ? k : k + 1; 194 + 195 + /* Special case for the last leaf with no sibling. */ 196 + if (j * 2 == n) 197 + j *= 2; 198 + 199 + /* Backtrack to the correct location. */ 200 + while (j != i && eytzinger1_do_cmp(base1, n, size, cmp_func, priv, i, j) >= 0) 201 + j /= 2; 202 + 203 + /* Shift the element into its correct place. */ 204 + for (k = j; j != i;) { 205 + j /= 2; 206 + eytzinger1_do_swap(base1, n, size, swap_func, priv, j, k); 207 + } 208 + } 209 + 210 + /* sort */ 211 + for (i = n; i > 1; --i) { 212 + eytzinger1_do_swap(base1, n, size, swap_func, priv, 1, i); 213 + 214 + /* Find the sift-down path all the way to the leaves. */ 215 + for (j = 1; k = j * 2, k + 1 < i;) 216 + j = eytzinger1_do_cmp(base1, n, size, cmp_func, priv, k, k + 1) > 0 ? k : k + 1; 217 + 218 + /* Special case for the last leaf with no sibling. */ 219 + if (j * 2 + 1 == i) 220 + j *= 2; 221 + 222 + /* Backtrack to the correct location. */ 223 + while (j >= 1 && eytzinger1_do_cmp(base1, n, size, cmp_func, priv, 1, j) >= 0) 224 + j /= 2; 225 + 226 + /* Shift the element into its correct place. */ 227 + for (k = j; j > 1;) { 228 + j /= 2; 229 + eytzinger1_do_swap(base1, n, size, swap_func, priv, j, k); 230 + } 231 + } 167 232 } 168 233 169 234 void eytzinger0_sort_r(void *base, size_t n, size_t size, ··· 236 171 swap_r_func_t swap_func, 237 172 const void *priv) 238 173 { 239 - int i, j, k; 174 + void *base1 = base - size; 240 175 241 - /* called from 'sort' without swap function, let's pick the default */ 242 - if (swap_func == SWAP_WRAPPER && !((struct wrapper *)priv)->swap_func) 243 - swap_func = NULL; 244 - 245 - if (!swap_func) { 246 - if (is_aligned(base, size, 8)) 247 - swap_func = SWAP_WORDS_64; 248 - else if (is_aligned(base, size, 4)) 249 - swap_func = SWAP_WORDS_32; 250 - else 251 - swap_func = SWAP_BYTES; 252 - } 253 - 254 - /* heapify */ 255 - for (i = n / 2 - 1; i >= 0; --i) { 256 - /* Find the sift-down path all the way to the leaves. */ 257 - for (j = i; k = j * 2 + 1, k + 1 < n;) 258 - j = eytzinger0_do_cmp(base, n, size, cmp_func, priv, k, k + 1) > 0 ? k : k + 1; 259 - 260 - /* Special case for the last leaf with no sibling. */ 261 - if (j * 2 + 2 == n) 262 - j = j * 2 + 1; 263 - 264 - /* Backtrack to the correct location. */ 265 - while (j != i && eytzinger0_do_cmp(base, n, size, cmp_func, priv, i, j) >= 0) 266 - j = (j - 1) / 2; 267 - 268 - /* Shift the element into its correct place. */ 269 - for (k = j; j != i;) { 270 - j = (j - 1) / 2; 271 - eytzinger0_do_swap(base, n, size, swap_func, priv, j, k); 272 - } 273 - } 274 - 275 - /* sort */ 276 - for (i = n - 1; i > 0; --i) { 277 - eytzinger0_do_swap(base, n, size, swap_func, priv, 0, i); 278 - 279 - /* Find the sift-down path all the way to the leaves. */ 280 - for (j = 0; k = j * 2 + 1, k + 1 < i;) 281 - j = eytzinger0_do_cmp(base, n, size, cmp_func, priv, k, k + 1) > 0 ? k : k + 1; 282 - 283 - /* Special case for the last leaf with no sibling. */ 284 - if (j * 2 + 2 == i) 285 - j = j * 2 + 1; 286 - 287 - /* Backtrack to the correct location. */ 288 - while (j && eytzinger0_do_cmp(base, n, size, cmp_func, priv, 0, j) >= 0) 289 - j = (j - 1) / 2; 290 - 291 - /* Shift the element into its correct place. */ 292 - for (k = j; j;) { 293 - j = (j - 1) / 2; 294 - eytzinger0_do_swap(base, n, size, swap_func, priv, j, k); 295 - } 296 - } 176 + return eytzinger1_sort_r(base1, n, size, cmp_func, swap_func, priv); 297 177 } 298 178 299 179 void eytzinger0_sort(void *base, size_t n, size_t size,
+37 -56
fs/bcachefs/eytzinger.h
··· 6 6 #include <linux/log2.h> 7 7 8 8 #ifdef EYTZINGER_DEBUG 9 + #include <linux/bug.h> 9 10 #define EYTZINGER_BUG_ON(cond) BUG_ON(cond) 10 11 #else 11 12 #define EYTZINGER_BUG_ON(cond) ··· 57 56 return rounddown_pow_of_two(size + 1) - 1; 58 57 } 59 58 60 - /* 61 - * eytzinger1_next() and eytzinger1_prev() have the nice properties that 62 - * 63 - * eytzinger1_next(0) == eytzinger1_first()) 64 - * eytzinger1_prev(0) == eytzinger1_last()) 65 - * 66 - * eytzinger1_prev(eytzinger1_first()) == 0 67 - * eytzinger1_next(eytzinger1_last()) == 0 68 - */ 69 - 70 59 static inline unsigned eytzinger1_next(unsigned i, unsigned size) 71 60 { 72 - EYTZINGER_BUG_ON(i > size); 61 + EYTZINGER_BUG_ON(i == 0 || i > size); 73 62 74 63 if (eytzinger1_right_child(i) <= size) { 75 64 i = eytzinger1_right_child(i); 76 65 77 - i <<= __fls(size + 1) - __fls(i); 66 + i <<= __fls(size) - __fls(i); 78 67 i >>= i > size; 79 68 } else { 80 69 i >>= ffz(i) + 1; ··· 75 84 76 85 static inline unsigned eytzinger1_prev(unsigned i, unsigned size) 77 86 { 78 - EYTZINGER_BUG_ON(i > size); 87 + EYTZINGER_BUG_ON(i == 0 || i > size); 79 88 80 89 if (eytzinger1_left_child(i) <= size) { 81 90 i = eytzinger1_left_child(i) + 1; 82 91 83 - i <<= __fls(size + 1) - __fls(i); 92 + i <<= __fls(size) - __fls(i); 84 93 i -= 1; 85 94 i >>= i > size; 86 95 } else { ··· 234 243 (_i) != -1; \ 235 244 (_i) = eytzinger0_next((_i), (_size))) 236 245 246 + #define eytzinger0_for_each_prev(_i, _size) \ 247 + for (unsigned (_i) = eytzinger0_last((_size)); \ 248 + (_i) != -1; \ 249 + (_i) = eytzinger0_prev((_i), (_size))) 250 + 237 251 /* return greatest node <= @search, or -1 if not found */ 238 252 static inline int eytzinger0_find_le(void *base, size_t nr, size_t size, 239 253 cmp_func_t cmp, const void *search) 240 254 { 241 - unsigned i, n = 0; 255 + void *base1 = base - size; 256 + unsigned n = 1; 242 257 243 - if (!nr) 244 - return -1; 245 - 246 - do { 247 - i = n; 248 - n = eytzinger0_child(i, cmp(base + i * size, search) <= 0); 249 - } while (n < nr); 250 - 251 - if (n & 1) { 252 - /* 253 - * @i was greater than @search, return previous node: 254 - * 255 - * if @i was leftmost/smallest element, 256 - * eytzinger0_prev(eytzinger0_first())) returns -1, as expected 257 - */ 258 - return eytzinger0_prev(i, nr); 259 - } else { 260 - return i; 261 - } 258 + while (n <= nr) 259 + n = eytzinger1_child(n, cmp(base1 + n * size, search) <= 0); 260 + n >>= __ffs(n) + 1; 261 + return n - 1; 262 262 } 263 263 264 + /* return smallest node > @search, or -1 if not found */ 264 265 static inline int eytzinger0_find_gt(void *base, size_t nr, size_t size, 265 266 cmp_func_t cmp, const void *search) 266 267 { 267 - ssize_t idx = eytzinger0_find_le(base, nr, size, cmp, search); 268 + void *base1 = base - size; 269 + unsigned n = 1; 268 270 269 - /* 270 - * if eytitzinger0_find_le() returned -1 - no element was <= search - we 271 - * want to return the first element; next/prev identities mean this work 272 - * as expected 273 - * 274 - * similarly if find_le() returns last element, we should return -1; 275 - * identities mean this all works out: 276 - */ 277 - return eytzinger0_next(idx, nr); 271 + while (n <= nr) 272 + n = eytzinger1_child(n, cmp(base1 + n * size, search) <= 0); 273 + n >>= __ffs(n + 1) + 1; 274 + return n - 1; 278 275 } 279 276 277 + /* return smallest node >= @search, or -1 if not found */ 280 278 static inline int eytzinger0_find_ge(void *base, size_t nr, size_t size, 281 279 cmp_func_t cmp, const void *search) 282 280 { 283 - ssize_t idx = eytzinger0_find_le(base, nr, size, cmp, search); 281 + void *base1 = base - size; 282 + unsigned n = 1; 284 283 285 - if (idx < nr && !cmp(base + idx * size, search)) 286 - return idx; 287 - 288 - return eytzinger0_next(idx, nr); 284 + while (n <= nr) 285 + n = eytzinger1_child(n, cmp(base1 + n * size, search) < 0); 286 + n >>= __ffs(n + 1) + 1; 287 + return n - 1; 289 288 } 290 289 291 290 #define eytzinger0_find(base, nr, size, _cmp, search) \ 292 291 ({ \ 293 - void *_base = (base); \ 292 + size_t _size = (size); \ 293 + void *_base1 = (void *)(base) - _size; \ 294 294 const void *_search = (search); \ 295 295 size_t _nr = (nr); \ 296 - size_t _size = (size); \ 297 - size_t _i = 0; \ 296 + size_t _i = 1; \ 298 297 int _res; \ 299 298 \ 300 - while (_i < _nr && \ 301 - (_res = _cmp(_search, _base + _i * _size))) \ 302 - _i = eytzinger0_child(_i, _res > 0); \ 303 - _i; \ 299 + while (_i <= _nr && \ 300 + (_res = _cmp(_search, _base1 + _i * _size))) \ 301 + _i = eytzinger1_child(_i, _res > 0); \ 302 + _i - 1; \ 304 303 }) 305 304 306 305 void eytzinger0_sort_r(void *, size_t, size_t,
+196 -14
fs/bcachefs/fs-common.c fs/bcachefs/namei.c
··· 4 4 #include "acl.h" 5 5 #include "btree_update.h" 6 6 #include "dirent.h" 7 - #include "fs-common.h" 8 7 #include "inode.h" 8 + #include "namei.h" 9 9 #include "subvolume.h" 10 10 #include "xattr.h" 11 11 ··· 46 46 BTREE_ITER_intent|BTREE_ITER_with_updates); 47 47 if (ret) 48 48 goto err; 49 + 50 + /* Inherit casefold state from parent. */ 51 + if (S_ISDIR(mode)) 52 + new_inode->bi_flags |= dir_u->bi_flags & BCH_INODE_casefolded; 49 53 50 54 if (!(flags & BCH_CREATE_SNAPSHOT)) { 51 55 /* Normal create path - allocate a new inode: */ ··· 157 153 dir_u->bi_nlink++; 158 154 dir_u->bi_mtime = dir_u->bi_ctime = now; 159 155 160 - ret = bch2_inode_write(trans, &dir_iter, dir_u); 161 - if (ret) 162 - goto err; 163 - 164 - ret = bch2_dirent_create(trans, dir, &dir_hash, 165 - dir_type, 166 - name, 167 - dir_target, 168 - &dir_offset, 169 - STR_HASH_must_create|BTREE_ITER_with_updates); 156 + ret = bch2_dirent_create(trans, dir, &dir_hash, 157 + dir_type, 158 + name, 159 + dir_target, 160 + &dir_offset, 161 + &dir_u->bi_size, 162 + STR_HASH_must_create|BTREE_ITER_with_updates) ?: 163 + bch2_inode_write(trans, &dir_iter, dir_u); 170 164 if (ret) 171 165 goto err; 172 166 ··· 227 225 228 226 ret = bch2_dirent_create(trans, dir, &dir_hash, 229 227 mode_to_type(inode_u->bi_mode), 230 - name, inum.inum, &dir_offset, 228 + name, inum.inum, 229 + &dir_offset, 230 + &dir_u->bi_size, 231 231 STR_HASH_must_create); 232 232 if (ret) 233 233 goto err; ··· 421 417 } 422 418 423 419 ret = bch2_dirent_rename(trans, 424 - src_dir, &src_hash, 425 - dst_dir, &dst_hash, 420 + src_dir, &src_hash, &src_dir_u->bi_size, 421 + dst_dir, &dst_hash, &dst_dir_u->bi_size, 426 422 src_name, &src_inum, &src_offset, 427 423 dst_name, &dst_inum, &dst_offset, 428 424 mode); ··· 564 560 return ret; 565 561 } 566 562 563 + /* inum_to_path */ 564 + 567 565 static inline void prt_bytes_reversed(struct printbuf *out, const void *b, unsigned n) 568 566 { 569 567 bch2_printbuf_make_room(out, n); ··· 655 649 656 650 prt_str_reversed(path, "(disconnected)"); 657 651 goto out; 652 + } 653 + 654 + /* fsck */ 655 + 656 + static int bch2_check_dirent_inode_dirent(struct btree_trans *trans, 657 + struct bkey_s_c_dirent d, 658 + struct bch_inode_unpacked *target, 659 + bool in_fsck) 660 + { 661 + struct bch_fs *c = trans->c; 662 + struct printbuf buf = PRINTBUF; 663 + struct btree_iter bp_iter = { NULL }; 664 + int ret = 0; 665 + 666 + if (inode_points_to_dirent(target, d)) 667 + return 0; 668 + 669 + if (!target->bi_dir && 670 + !target->bi_dir_offset) { 671 + fsck_err_on(S_ISDIR(target->bi_mode), 672 + trans, inode_dir_missing_backpointer, 673 + "directory with missing backpointer\n%s", 674 + (printbuf_reset(&buf), 675 + bch2_bkey_val_to_text(&buf, c, d.s_c), 676 + prt_printf(&buf, "\n"), 677 + bch2_inode_unpacked_to_text(&buf, target), 678 + buf.buf)); 679 + 680 + fsck_err_on(target->bi_flags & BCH_INODE_unlinked, 681 + trans, inode_unlinked_but_has_dirent, 682 + "inode unlinked but has dirent\n%s", 683 + (printbuf_reset(&buf), 684 + bch2_bkey_val_to_text(&buf, c, d.s_c), 685 + prt_printf(&buf, "\n"), 686 + bch2_inode_unpacked_to_text(&buf, target), 687 + buf.buf)); 688 + 689 + target->bi_flags &= ~BCH_INODE_unlinked; 690 + target->bi_dir = d.k->p.inode; 691 + target->bi_dir_offset = d.k->p.offset; 692 + return __bch2_fsck_write_inode(trans, target); 693 + } 694 + 695 + if (bch2_inode_should_have_single_bp(target) && 696 + !fsck_err(trans, inode_wrong_backpointer, 697 + "dirent points to inode that does not point back:\n %s", 698 + (bch2_bkey_val_to_text(&buf, c, d.s_c), 699 + prt_printf(&buf, "\n "), 700 + bch2_inode_unpacked_to_text(&buf, target), 701 + buf.buf))) 702 + goto err; 703 + 704 + struct bkey_s_c_dirent bp_dirent = 705 + bch2_bkey_get_iter_typed(trans, &bp_iter, BTREE_ID_dirents, 706 + SPOS(target->bi_dir, target->bi_dir_offset, target->bi_snapshot), 707 + 0, dirent); 708 + ret = bkey_err(bp_dirent); 709 + if (ret && !bch2_err_matches(ret, ENOENT)) 710 + goto err; 711 + 712 + bool backpointer_exists = !ret; 713 + ret = 0; 714 + 715 + if (!backpointer_exists) { 716 + if (fsck_err(trans, inode_wrong_backpointer, 717 + "inode %llu:%u has wrong backpointer:\n" 718 + "got %llu:%llu\n" 719 + "should be %llu:%llu", 720 + target->bi_inum, target->bi_snapshot, 721 + target->bi_dir, 722 + target->bi_dir_offset, 723 + d.k->p.inode, 724 + d.k->p.offset)) { 725 + target->bi_dir = d.k->p.inode; 726 + target->bi_dir_offset = d.k->p.offset; 727 + ret = __bch2_fsck_write_inode(trans, target); 728 + } 729 + } else { 730 + bch2_bkey_val_to_text(&buf, c, d.s_c); 731 + prt_newline(&buf); 732 + bch2_bkey_val_to_text(&buf, c, bp_dirent.s_c); 733 + 734 + if (S_ISDIR(target->bi_mode) || target->bi_subvol) { 735 + /* 736 + * XXX: verify connectivity of the other dirent 737 + * up to the root before removing this one 738 + * 739 + * Additionally, bch2_lookup would need to cope with the 740 + * dirent it found being removed - or should we remove 741 + * the other one, even though the inode points to it? 742 + */ 743 + if (in_fsck) { 744 + if (fsck_err(trans, inode_dir_multiple_links, 745 + "%s %llu:%u with multiple links\n%s", 746 + S_ISDIR(target->bi_mode) ? "directory" : "subvolume", 747 + target->bi_inum, target->bi_snapshot, buf.buf)) 748 + ret = bch2_fsck_remove_dirent(trans, d.k->p); 749 + } else { 750 + bch2_fs_inconsistent(c, 751 + "%s %llu:%u with multiple links\n%s", 752 + S_ISDIR(target->bi_mode) ? "directory" : "subvolume", 753 + target->bi_inum, target->bi_snapshot, buf.buf); 754 + } 755 + 756 + goto out; 757 + } else { 758 + /* 759 + * hardlinked file with nlink 0: 760 + * We're just adjusting nlink here so check_nlinks() will pick 761 + * it up, it ignores inodes with nlink 0 762 + */ 763 + if (fsck_err_on(!target->bi_nlink, 764 + trans, inode_multiple_links_but_nlink_0, 765 + "inode %llu:%u type %s has multiple links but i_nlink 0\n%s", 766 + target->bi_inum, target->bi_snapshot, bch2_d_types[d.v->d_type], buf.buf)) { 767 + target->bi_nlink++; 768 + target->bi_flags &= ~BCH_INODE_unlinked; 769 + ret = __bch2_fsck_write_inode(trans, target); 770 + if (ret) 771 + goto err; 772 + } 773 + } 774 + } 775 + out: 776 + err: 777 + fsck_err: 778 + bch2_trans_iter_exit(trans, &bp_iter); 779 + printbuf_exit(&buf); 780 + bch_err_fn(c, ret); 781 + return ret; 782 + } 783 + 784 + int __bch2_check_dirent_target(struct btree_trans *trans, 785 + struct btree_iter *dirent_iter, 786 + struct bkey_s_c_dirent d, 787 + struct bch_inode_unpacked *target, 788 + bool in_fsck) 789 + { 790 + struct bch_fs *c = trans->c; 791 + struct printbuf buf = PRINTBUF; 792 + int ret = 0; 793 + 794 + ret = bch2_check_dirent_inode_dirent(trans, d, target, in_fsck); 795 + if (ret) 796 + goto err; 797 + 798 + if (fsck_err_on(d.v->d_type != inode_d_type(target), 799 + trans, dirent_d_type_wrong, 800 + "incorrect d_type: got %s, should be %s:\n%s", 801 + bch2_d_type_str(d.v->d_type), 802 + bch2_d_type_str(inode_d_type(target)), 803 + (printbuf_reset(&buf), 804 + bch2_bkey_val_to_text(&buf, c, d.s_c), buf.buf))) { 805 + struct bkey_i_dirent *n = bch2_trans_kmalloc(trans, bkey_bytes(d.k)); 806 + ret = PTR_ERR_OR_ZERO(n); 807 + if (ret) 808 + goto err; 809 + 810 + bkey_reassemble(&n->k_i, d.s_c); 811 + n->v.d_type = inode_d_type(target); 812 + if (n->v.d_type == DT_SUBVOL) { 813 + n->v.d_parent_subvol = cpu_to_le32(target->bi_parent_subvol); 814 + n->v.d_child_subvol = cpu_to_le32(target->bi_subvol); 815 + } else { 816 + n->v.d_inum = cpu_to_le64(target->bi_inum); 817 + } 818 + 819 + ret = bch2_trans_update(trans, dirent_iter, &n->k_i, 0); 820 + if (ret) 821 + goto err; 822 + } 823 + err: 824 + fsck_err: 825 + printbuf_exit(&buf); 826 + bch_err_fn(c, ret); 827 + return ret; 658 828 }
+28 -3
fs/bcachefs/fs-common.h fs/bcachefs/namei.h
··· 1 1 /* SPDX-License-Identifier: GPL-2.0 */ 2 - #ifndef _BCACHEFS_FS_COMMON_H 3 - #define _BCACHEFS_FS_COMMON_H 2 + #ifndef _BCACHEFS_NAMEI_H 3 + #define _BCACHEFS_NAMEI_H 4 4 5 5 #include "dirent.h" 6 6 ··· 44 44 45 45 int bch2_inum_to_path(struct btree_trans *, subvol_inum, struct printbuf *); 46 46 47 - #endif /* _BCACHEFS_FS_COMMON_H */ 47 + int __bch2_check_dirent_target(struct btree_trans *, 48 + struct btree_iter *, 49 + struct bkey_s_c_dirent, 50 + struct bch_inode_unpacked *, bool); 51 + 52 + static inline bool inode_points_to_dirent(struct bch_inode_unpacked *inode, 53 + struct bkey_s_c_dirent d) 54 + { 55 + return inode->bi_dir == d.k->p.inode && 56 + inode->bi_dir_offset == d.k->p.offset; 57 + } 58 + 59 + static inline int bch2_check_dirent_target(struct btree_trans *trans, 60 + struct btree_iter *dirent_iter, 61 + struct bkey_s_c_dirent d, 62 + struct bch_inode_unpacked *target, 63 + bool in_fsck) 64 + { 65 + if (likely(inode_points_to_dirent(target, d) && 66 + d.v->d_type == inode_d_type(target))) 67 + return 0; 68 + 69 + return __bch2_check_dirent_target(trans, dirent_iter, d, target, in_fsck); 70 + } 71 + 72 + #endif /* _BCACHEFS_NAMEI_H */
+24 -14
fs/bcachefs/fs-io-buffered.c
··· 110 110 if (!get_more) 111 111 break; 112 112 113 + unsigned sectors_remaining = sectors_this_extent - bio_sectors(bio); 114 + 115 + if (sectors_remaining < PAGE_SECTORS << mapping_min_folio_order(iter->mapping)) 116 + break; 117 + 118 + unsigned order = ilog2(rounddown_pow_of_two(sectors_remaining) / PAGE_SECTORS); 119 + 120 + /* ensure proper alignment */ 121 + order = min(order, __ffs(folio_offset|BIT(31))); 122 + 113 123 folio = xa_load(&iter->mapping->i_pages, folio_offset); 114 124 if (folio && !xa_is_value(folio)) 115 125 break; 116 126 117 - folio = filemap_alloc_folio(readahead_gfp_mask(iter->mapping), 0); 127 + folio = filemap_alloc_folio(readahead_gfp_mask(iter->mapping), order); 118 128 if (!folio) 119 129 break; 120 130 ··· 159 149 struct bch_fs *c = trans->c; 160 150 struct btree_iter iter; 161 151 struct bkey_buf sk; 162 - int flags = BCH_READ_RETRY_IF_STALE| 163 - BCH_READ_MAY_PROMOTE; 152 + int flags = BCH_READ_retry_if_stale| 153 + BCH_READ_may_promote; 164 154 int ret = 0; 165 155 166 - rbio->c = c; 167 - rbio->start_time = local_clock(); 168 156 rbio->subvol = inum.subvol; 169 157 170 158 bch2_bkey_buf_init(&sk); ··· 219 211 swap(rbio->bio.bi_iter.bi_size, bytes); 220 212 221 213 if (rbio->bio.bi_iter.bi_size == bytes) 222 - flags |= BCH_READ_LAST_FRAGMENT; 214 + flags |= BCH_READ_last_fragment; 223 215 224 216 bch2_bio_page_state_set(&rbio->bio, k); 225 217 226 218 bch2_read_extent(trans, rbio, iter.pos, 227 219 data_btree, k, offset_into_extent, flags); 228 220 229 - if (flags & BCH_READ_LAST_FRAGMENT) 221 + if (flags & BCH_READ_last_fragment) 230 222 break; 231 223 232 224 swap(rbio->bio.bi_iter.bi_size, bytes); ··· 240 232 241 233 if (ret) { 242 234 struct printbuf buf = PRINTBUF; 243 - bch2_inum_offset_err_msg_trans(trans, &buf, inum, iter.pos.offset << 9); 235 + lockrestart_do(trans, 236 + bch2_inum_offset_err_msg_trans(trans, &buf, inum, iter.pos.offset << 9)); 244 237 prt_printf(&buf, "read error %i from btree lookup", ret); 245 238 bch_err_ratelimited(c, "%s", buf.buf); 246 239 printbuf_exit(&buf); ··· 289 280 struct bch_read_bio *rbio = 290 281 rbio_init(bio_alloc_bioset(NULL, n, REQ_OP_READ, 291 282 GFP_KERNEL, &c->bio_read), 292 - opts); 283 + c, 284 + opts, 285 + bch2_readpages_end_io); 293 286 294 287 readpage_iter_advance(&readpages_iter); 295 288 296 289 rbio->bio.bi_iter.bi_sector = folio_sector(folio); 297 - rbio->bio.bi_end_io = bch2_readpages_end_io; 298 290 BUG_ON(!bio_add_folio(&rbio->bio, folio, folio_size(folio), 0)); 299 291 300 292 bchfs_read(trans, rbio, inode_inum(inode), ··· 333 323 bch2_inode_opts_get(&opts, c, &inode->ei_inode); 334 324 335 325 rbio = rbio_init(bio_alloc_bioset(NULL, 1, REQ_OP_READ, GFP_KERNEL, &c->bio_read), 336 - opts); 326 + c, 327 + opts, 328 + bch2_read_single_folio_end_io); 337 329 rbio->bio.bi_private = &done; 338 - rbio->bio.bi_end_io = bch2_read_single_folio_end_io; 339 - 340 330 rbio->bio.bi_opf = REQ_OP_READ|REQ_SYNC; 341 331 rbio->bio.bi_iter.bi_sector = folio_sector(folio); 342 332 BUG_ON(!bio_add_folio(&rbio->bio, folio, folio_size(folio), 0)); ··· 430 420 } 431 421 } 432 422 433 - if (io->op.flags & BCH_WRITE_WROTE_DATA_INLINE) { 423 + if (io->op.flags & BCH_WRITE_wrote_data_inline) { 434 424 bio_for_each_folio_all(fi, bio) { 435 425 struct bch_folio *s; 436 426
+14 -6
fs/bcachefs/fs-io-direct.c
··· 73 73 struct blk_plug plug; 74 74 loff_t offset = req->ki_pos; 75 75 bool sync = is_sync_kiocb(req); 76 + bool split = false; 76 77 size_t shorten; 77 78 ssize_t ret; 78 79 ··· 99 98 REQ_OP_READ, 100 99 GFP_KERNEL, 101 100 &c->dio_read_bioset); 102 - 103 - bio->bi_end_io = bch2_direct_IO_read_endio; 104 101 105 102 dio = container_of(bio, struct dio_read, rbio.bio); 106 103 closure_init(&dio->cl, NULL); ··· 132 133 133 134 goto start; 134 135 while (iter->count) { 136 + split = true; 137 + 135 138 bio = bio_alloc_bioset(NULL, 136 139 bio_iov_vecs_to_alloc(iter, BIO_MAX_VECS), 137 140 REQ_OP_READ, 138 141 GFP_KERNEL, 139 142 &c->bio_read); 140 - bio->bi_end_io = bch2_direct_IO_read_split_endio; 141 143 start: 142 144 bio->bi_opf = REQ_OP_READ|REQ_SYNC; 143 145 bio->bi_iter.bi_sector = offset >> 9; ··· 160 160 if (iter->count) 161 161 closure_get(&dio->cl); 162 162 163 - bch2_read(c, rbio_init(bio, opts), inode_inum(inode)); 163 + struct bch_read_bio *rbio = 164 + rbio_init(bio, 165 + c, 166 + opts, 167 + split 168 + ? bch2_direct_IO_read_split_endio 169 + : bch2_direct_IO_read_endio); 170 + 171 + bch2_read(c, rbio, inode_inum(inode)); 164 172 } 165 173 166 174 blk_finish_plug(&plug); ··· 519 511 dio->op.devs_need_flush = &inode->ei_devs_need_flush; 520 512 521 513 if (sync) 522 - dio->op.flags |= BCH_WRITE_SYNC; 523 - dio->op.flags |= BCH_WRITE_CHECK_ENOSPC; 514 + dio->op.flags |= BCH_WRITE_sync; 515 + dio->op.flags |= BCH_WRITE_check_enospc; 524 516 525 517 ret = bch2_quota_reservation_add(c, inode, &dio->quota_res, 526 518 bio_sectors(bio), true);
+28 -2
fs/bcachefs/fs-ioctl.c
··· 5 5 #include "chardev.h" 6 6 #include "dirent.h" 7 7 #include "fs.h" 8 - #include "fs-common.h" 9 8 #include "fs-ioctl.h" 9 + #include "namei.h" 10 10 #include "quota.h" 11 11 12 12 #include <linux/compat.h> ··· 53 53 !S_ISDIR(bi->bi_mode) && 54 54 (newflags & (BCH_INODE_nodump|BCH_INODE_noatime)) != newflags) 55 55 return -EINVAL; 56 + 57 + if ((newflags ^ oldflags) & BCH_INODE_casefolded) { 58 + #ifdef CONFIG_UNICODE 59 + int ret = 0; 60 + /* Not supported on individual files. */ 61 + if (!S_ISDIR(bi->bi_mode)) 62 + return -EOPNOTSUPP; 63 + 64 + /* 65 + * Make sure the dir is empty, as otherwise we'd need to 66 + * rehash everything and update the dirent keys. 67 + */ 68 + ret = bch2_empty_dir_trans(trans, inode_inum(inode)); 69 + if (ret < 0) 70 + return ret; 71 + 72 + ret = bch2_request_incompat_feature(c,bcachefs_metadata_version_casefolding); 73 + if (ret) 74 + return ret; 75 + 76 + bch2_check_set_feature(c, BCH_FEATURE_casefolding); 77 + #else 78 + printk(KERN_ERR "Cannot use casefolding on a kernel without CONFIG_UNICODE\n"); 79 + return -EOPNOTSUPP; 80 + #endif 81 + } 56 82 57 83 if (s->set_projinherit) { 58 84 bi->bi_fields_set &= ~(1 << Inode_opt_project); ··· 244 218 int ret = 0; 245 219 subvol_inum inum; 246 220 247 - kname = kmalloc(BCH_NAME_MAX + 1, GFP_KERNEL); 221 + kname = kmalloc(BCH_NAME_MAX, GFP_KERNEL); 248 222 if (!kname) 249 223 return -ENOMEM; 250 224
+11 -9
fs/bcachefs/fs-ioctl.h
··· 6 6 7 7 /* bcachefs inode flags -> vfs inode flags: */ 8 8 static const __maybe_unused unsigned bch_flags_to_vfs[] = { 9 - [__BCH_INODE_sync] = S_SYNC, 10 - [__BCH_INODE_immutable] = S_IMMUTABLE, 11 - [__BCH_INODE_append] = S_APPEND, 12 - [__BCH_INODE_noatime] = S_NOATIME, 9 + [__BCH_INODE_sync] = S_SYNC, 10 + [__BCH_INODE_immutable] = S_IMMUTABLE, 11 + [__BCH_INODE_append] = S_APPEND, 12 + [__BCH_INODE_noatime] = S_NOATIME, 13 + [__BCH_INODE_casefolded] = S_CASEFOLD, 13 14 }; 14 15 15 16 /* bcachefs inode flags -> FS_IOC_GETFLAGS: */ 16 17 static const __maybe_unused unsigned bch_flags_to_uflags[] = { 17 - [__BCH_INODE_sync] = FS_SYNC_FL, 18 - [__BCH_INODE_immutable] = FS_IMMUTABLE_FL, 19 - [__BCH_INODE_append] = FS_APPEND_FL, 20 - [__BCH_INODE_nodump] = FS_NODUMP_FL, 21 - [__BCH_INODE_noatime] = FS_NOATIME_FL, 18 + [__BCH_INODE_sync] = FS_SYNC_FL, 19 + [__BCH_INODE_immutable] = FS_IMMUTABLE_FL, 20 + [__BCH_INODE_append] = FS_APPEND_FL, 21 + [__BCH_INODE_nodump] = FS_NODUMP_FL, 22 + [__BCH_INODE_noatime] = FS_NOATIME_FL, 23 + [__BCH_INODE_casefolded] = FS_CASEFOLD_FL, 22 24 }; 23 25 24 26 /* bcachefs inode flags -> FS_IOC_FSGETXATTR: */
+80 -59
fs/bcachefs/fs.c
··· 11 11 #include "errcode.h" 12 12 #include "extents.h" 13 13 #include "fs.h" 14 - #include "fs-common.h" 15 14 #include "fs-io.h" 16 15 #include "fs-ioctl.h" 17 16 #include "fs-io-buffered.h" ··· 21 22 #include "io_read.h" 22 23 #include "journal.h" 23 24 #include "keylist.h" 25 + #include "namei.h" 24 26 #include "quota.h" 25 27 #include "rebalance.h" 26 28 #include "snapshot.h" ··· 641 641 if (ret) 642 642 return ERR_PTR(ret); 643 643 644 - ret = bch2_dirent_read_target(trans, dir, bkey_s_c_to_dirent(k), &inum); 644 + struct bkey_s_c_dirent d = bkey_s_c_to_dirent(k); 645 + 646 + ret = bch2_dirent_read_target(trans, dir, d, &inum); 645 647 if (ret > 0) 646 648 ret = -ENOENT; 647 649 if (ret) ··· 653 651 if (inode) 654 652 goto out; 655 653 654 + /* 655 + * Note: if check/repair needs it, we commit before 656 + * bch2_inode_hash_init_insert(), as after that point we can't take a 657 + * restart - not in the top level loop with a commit_do(), like we 658 + * usually do: 659 + */ 660 + 656 661 struct bch_subvolume subvol; 657 662 struct bch_inode_unpacked inode_u; 658 663 ret = bch2_subvolume_get(trans, inum.subvol, true, &subvol) ?: 659 664 bch2_inode_find_by_inum_nowarn_trans(trans, inum, &inode_u) ?: 665 + bch2_check_dirent_target(trans, &dirent_iter, d, &inode_u, false) ?: 666 + bch2_trans_commit(trans, NULL, NULL, BCH_TRANS_COMMIT_no_enospc) ?: 660 667 PTR_ERR_OR_ZERO(inode = bch2_inode_hash_init_insert(trans, inum, &inode_u, &subvol)); 661 668 669 + /* 670 + * don't remove it: check_inodes might find another inode that points 671 + * back to this dirent 672 + */ 662 673 bch2_fs_inconsistent_on(bch2_err_matches(ret, ENOENT), 663 674 c, "dirent to missing inode:\n %s", 664 - (bch2_bkey_val_to_text(&buf, c, k), buf.buf)); 675 + (bch2_bkey_val_to_text(&buf, c, d.s_c), buf.buf)); 665 676 if (ret) 666 677 goto err; 667 - 668 - /* regular files may have hardlinks: */ 669 - if (bch2_fs_inconsistent_on(bch2_inode_should_have_single_bp(&inode_u) && 670 - !bkey_eq(k.k->p, POS(inode_u.bi_dir, inode_u.bi_dir_offset)), 671 - c, 672 - "dirent points to inode that does not point back:\n %s", 673 - (bch2_bkey_val_to_text(&buf, c, k), 674 - prt_printf(&buf, "\n "), 675 - bch2_inode_unpacked_to_text(&buf, &inode_u), 676 - buf.buf))) { 677 - ret = -ENOENT; 678 - goto err; 679 - } 680 678 out: 681 679 bch2_trans_iter_exit(trans, &dirent_iter); 682 680 printbuf_exit(&buf); ··· 699 697 &hash, &dentry->d_name))); 700 698 if (IS_ERR(inode)) 701 699 inode = NULL; 700 + 701 + #ifdef CONFIG_UNICODE 702 + if (!inode && IS_CASEFOLDED(vdir)) { 703 + /* 704 + * Do not cache a negative dentry in casefolded directories 705 + * as it would need to be invalidated in the following situation: 706 + * - Lookup file "blAH" in a casefolded directory 707 + * - Creation of file "BLAH" in a casefolded directory 708 + * - Lookup file "blAH" in a casefolded directory 709 + * which would fail if we had a negative dentry. 710 + * 711 + * We should come back to this when VFS has a method to handle 712 + * this edgecase. 713 + */ 714 + return NULL; 715 + } 716 + #endif 702 717 703 718 return d_splice_alias(&inode->v, dentry); 704 719 } ··· 1821 1802 break; 1822 1803 } 1823 1804 1824 - mapping_set_large_folios(inode->v.i_mapping); 1805 + mapping_set_folio_min_order(inode->v.i_mapping, 1806 + get_order(trans->c->opts.block_size)); 1825 1807 } 1826 1808 1827 1809 static void bch2_free_inode(struct inode *vinode) ··· 2028 2008 return c ?: ERR_PTR(-ENOENT); 2029 2009 } 2030 2010 2031 - static int bch2_remount(struct super_block *sb, int *flags, 2032 - struct bch_opts opts) 2033 - { 2034 - struct bch_fs *c = sb->s_fs_info; 2035 - int ret = 0; 2036 - 2037 - opt_set(opts, read_only, (*flags & SB_RDONLY) != 0); 2038 - 2039 - if (opts.read_only != c->opts.read_only) { 2040 - down_write(&c->state_lock); 2041 - 2042 - if (opts.read_only) { 2043 - bch2_fs_read_only(c); 2044 - 2045 - sb->s_flags |= SB_RDONLY; 2046 - } else { 2047 - ret = bch2_fs_read_write(c); 2048 - if (ret) { 2049 - bch_err(c, "error going rw: %i", ret); 2050 - up_write(&c->state_lock); 2051 - ret = -EINVAL; 2052 - goto err; 2053 - } 2054 - 2055 - sb->s_flags &= ~SB_RDONLY; 2056 - } 2057 - 2058 - c->opts.read_only = opts.read_only; 2059 - 2060 - up_write(&c->state_lock); 2061 - } 2062 - 2063 - if (opt_defined(opts, errors)) 2064 - c->opts.errors = opts.errors; 2065 - err: 2066 - return bch2_err_class(ret); 2067 - } 2068 - 2069 2011 static int bch2_show_devname(struct seq_file *seq, struct dentry *root) 2070 2012 { 2071 2013 struct bch_fs *c = root->d_sb->s_fs_info; ··· 2174 2192 if (ret) 2175 2193 goto err; 2176 2194 2195 + if (opt_defined(opts, discard)) 2196 + set_bit(BCH_FS_discard_mount_opt_set, &c->flags); 2197 + 2177 2198 /* Some options can't be parsed until after the fs is started: */ 2178 2199 opts = bch2_opts_empty(); 2179 2200 ret = bch2_parse_mount_opts(c, &opts, NULL, opts_parse->parse_later.buf); ··· 2185 2200 2186 2201 bch2_opts_apply(&c->opts, opts); 2187 2202 2188 - ret = bch2_fs_start(c); 2189 - if (ret) 2190 - goto err_stop_fs; 2203 + /* 2204 + * need to initialise sb and set c->vfs_sb _before_ starting fs, 2205 + * for blk_holder_ops 2206 + */ 2191 2207 2192 2208 sb = sget(fc->fs_type, NULL, bch2_set_super, fc->sb_flags|SB_NOSEC, c); 2193 2209 ret = PTR_ERR_OR_ZERO(sb); ··· 2249 2263 #endif 2250 2264 2251 2265 sb->s_shrink->seeks = 0; 2266 + 2267 + ret = bch2_fs_start(c); 2268 + if (ret) 2269 + goto err_put_super; 2252 2270 2253 2271 vinode = bch2_vfs_inode_get(c, BCACHEFS_ROOT_SUBVOL_INUM); 2254 2272 ret = PTR_ERR_OR_ZERO(vinode); ··· 2341 2351 { 2342 2352 struct super_block *sb = fc->root->d_sb; 2343 2353 struct bch2_opts_parse *opts = fc->fs_private; 2354 + struct bch_fs *c = sb->s_fs_info; 2355 + int ret = 0; 2344 2356 2345 - return bch2_remount(sb, &fc->sb_flags, opts->opts); 2357 + opt_set(opts->opts, read_only, (fc->sb_flags & SB_RDONLY) != 0); 2358 + 2359 + if (opts->opts.read_only != c->opts.read_only) { 2360 + down_write(&c->state_lock); 2361 + 2362 + if (opts->opts.read_only) { 2363 + bch2_fs_read_only(c); 2364 + 2365 + sb->s_flags |= SB_RDONLY; 2366 + } else { 2367 + ret = bch2_fs_read_write(c); 2368 + if (ret) { 2369 + bch_err(c, "error going rw: %i", ret); 2370 + up_write(&c->state_lock); 2371 + ret = -EINVAL; 2372 + goto err; 2373 + } 2374 + 2375 + sb->s_flags &= ~SB_RDONLY; 2376 + } 2377 + 2378 + c->opts.read_only = opts->opts.read_only; 2379 + 2380 + up_write(&c->state_lock); 2381 + } 2382 + 2383 + if (opt_defined(opts->opts, errors)) 2384 + c->opts.errors = opts->opts.errors; 2385 + err: 2386 + return bch2_err_class(ret); 2346 2387 } 2347 2388 2348 2389 static const struct fs_context_operations bch2_context_ops = {
+6 -225
fs/bcachefs/fsck.c
··· 10 10 #include "dirent.h" 11 11 #include "error.h" 12 12 #include "fs.h" 13 - #include "fs-common.h" 14 13 #include "fsck.h" 15 14 #include "inode.h" 16 15 #include "keylist.h" 16 + #include "namei.h" 17 17 #include "recovery_passes.h" 18 18 #include "snapshot.h" 19 19 #include "super.h" ··· 22 22 23 23 #include <linux/bsearch.h> 24 24 #include <linux/dcache.h> /* struct qstr */ 25 - 26 - static bool inode_points_to_dirent(struct bch_inode_unpacked *inode, 27 - struct bkey_s_c_dirent d) 28 - { 29 - return inode->bi_dir == d.k->p.inode && 30 - inode->bi_dir_offset == d.k->p.offset; 31 - } 32 25 33 26 static int dirent_points_to_inode_nowarn(struct bkey_s_c_dirent d, 34 27 struct bch_inode_unpacked *inode) ··· 109 116 return ret; 110 117 } 111 118 112 - static int lookup_first_inode(struct btree_trans *trans, u64 inode_nr, 113 - struct bch_inode_unpacked *inode) 114 - { 115 - struct btree_iter iter; 116 - struct bkey_s_c k; 117 - int ret; 118 - 119 - for_each_btree_key_norestart(trans, iter, BTREE_ID_inodes, POS(0, inode_nr), 120 - BTREE_ITER_all_snapshots, k, ret) { 121 - if (k.k->p.offset != inode_nr) 122 - break; 123 - if (!bkey_is_inode(k.k)) 124 - continue; 125 - ret = bch2_inode_unpack(k, inode); 126 - goto found; 127 - } 128 - ret = -BCH_ERR_ENOENT_inode; 129 - found: 130 - bch_err_msg(trans->c, ret, "fetching inode %llu", inode_nr); 131 - bch2_trans_iter_exit(trans, &iter); 132 - return ret; 133 - } 134 - 135 119 static int lookup_inode(struct btree_trans *trans, u64 inode_nr, u32 snapshot, 136 120 struct bch_inode_unpacked *inode) 137 121 { ··· 147 177 *type = d.v->d_type; 148 178 bch2_trans_iter_exit(trans, &iter); 149 179 return 0; 150 - } 151 - 152 - static int __remove_dirent(struct btree_trans *trans, struct bpos pos) 153 - { 154 - struct bch_fs *c = trans->c; 155 - struct btree_iter iter; 156 - struct bch_inode_unpacked dir_inode; 157 - struct bch_hash_info dir_hash_info; 158 - int ret; 159 - 160 - ret = lookup_first_inode(trans, pos.inode, &dir_inode); 161 - if (ret) 162 - goto err; 163 - 164 - dir_hash_info = bch2_hash_info_init(c, &dir_inode); 165 - 166 - bch2_trans_iter_init(trans, &iter, BTREE_ID_dirents, pos, BTREE_ITER_intent); 167 - 168 - ret = bch2_btree_iter_traverse(&iter) ?: 169 - bch2_hash_delete_at(trans, bch2_dirent_hash_desc, 170 - &dir_hash_info, &iter, 171 - BTREE_UPDATE_internal_snapshot_node); 172 - bch2_trans_iter_exit(trans, &iter); 173 - err: 174 - bch_err_fn(c, ret); 175 - return ret; 176 180 } 177 181 178 182 /* ··· 492 548 SPOS(inode->bi_dir, inode->bi_dir_offset, inode->bi_snapshot)); 493 549 int ret = bkey_err(d) ?: 494 550 dirent_points_to_inode(c, d, inode) ?: 495 - __remove_dirent(trans, d.k->p); 551 + bch2_fsck_remove_dirent(trans, d.k->p); 496 552 bch2_trans_iter_exit(trans, &iter); 497 553 return ret; 498 554 } ··· 1929 1985 trans_was_restarted(trans, restart_count); 1930 1986 } 1931 1987 1932 - noinline_for_stack 1933 - static int check_dirent_inode_dirent(struct btree_trans *trans, 1934 - struct btree_iter *iter, 1935 - struct bkey_s_c_dirent d, 1936 - struct bch_inode_unpacked *target) 1937 - { 1938 - struct bch_fs *c = trans->c; 1939 - struct printbuf buf = PRINTBUF; 1940 - struct btree_iter bp_iter = { NULL }; 1941 - int ret = 0; 1942 - 1943 - if (inode_points_to_dirent(target, d)) 1944 - return 0; 1945 - 1946 - if (!target->bi_dir && 1947 - !target->bi_dir_offset) { 1948 - fsck_err_on(S_ISDIR(target->bi_mode), 1949 - trans, inode_dir_missing_backpointer, 1950 - "directory with missing backpointer\n%s", 1951 - (printbuf_reset(&buf), 1952 - bch2_bkey_val_to_text(&buf, c, d.s_c), 1953 - prt_printf(&buf, "\n"), 1954 - bch2_inode_unpacked_to_text(&buf, target), 1955 - buf.buf)); 1956 - 1957 - fsck_err_on(target->bi_flags & BCH_INODE_unlinked, 1958 - trans, inode_unlinked_but_has_dirent, 1959 - "inode unlinked but has dirent\n%s", 1960 - (printbuf_reset(&buf), 1961 - bch2_bkey_val_to_text(&buf, c, d.s_c), 1962 - prt_printf(&buf, "\n"), 1963 - bch2_inode_unpacked_to_text(&buf, target), 1964 - buf.buf)); 1965 - 1966 - target->bi_flags &= ~BCH_INODE_unlinked; 1967 - target->bi_dir = d.k->p.inode; 1968 - target->bi_dir_offset = d.k->p.offset; 1969 - return __bch2_fsck_write_inode(trans, target); 1970 - } 1971 - 1972 - if (bch2_inode_should_have_single_bp(target) && 1973 - !fsck_err(trans, inode_wrong_backpointer, 1974 - "dirent points to inode that does not point back:\n %s", 1975 - (bch2_bkey_val_to_text(&buf, c, d.s_c), 1976 - prt_printf(&buf, "\n "), 1977 - bch2_inode_unpacked_to_text(&buf, target), 1978 - buf.buf))) 1979 - goto err; 1980 - 1981 - struct bkey_s_c_dirent bp_dirent = dirent_get_by_pos(trans, &bp_iter, 1982 - SPOS(target->bi_dir, target->bi_dir_offset, target->bi_snapshot)); 1983 - ret = bkey_err(bp_dirent); 1984 - if (ret && !bch2_err_matches(ret, ENOENT)) 1985 - goto err; 1986 - 1987 - bool backpointer_exists = !ret; 1988 - ret = 0; 1989 - 1990 - if (fsck_err_on(!backpointer_exists, 1991 - trans, inode_wrong_backpointer, 1992 - "inode %llu:%u has wrong backpointer:\n" 1993 - "got %llu:%llu\n" 1994 - "should be %llu:%llu", 1995 - target->bi_inum, target->bi_snapshot, 1996 - target->bi_dir, 1997 - target->bi_dir_offset, 1998 - d.k->p.inode, 1999 - d.k->p.offset)) { 2000 - target->bi_dir = d.k->p.inode; 2001 - target->bi_dir_offset = d.k->p.offset; 2002 - ret = __bch2_fsck_write_inode(trans, target); 2003 - goto out; 2004 - } 2005 - 2006 - bch2_bkey_val_to_text(&buf, c, d.s_c); 2007 - prt_newline(&buf); 2008 - if (backpointer_exists) 2009 - bch2_bkey_val_to_text(&buf, c, bp_dirent.s_c); 2010 - 2011 - if (fsck_err_on(backpointer_exists && 2012 - (S_ISDIR(target->bi_mode) || 2013 - target->bi_subvol), 2014 - trans, inode_dir_multiple_links, 2015 - "%s %llu:%u with multiple links\n%s", 2016 - S_ISDIR(target->bi_mode) ? "directory" : "subvolume", 2017 - target->bi_inum, target->bi_snapshot, buf.buf)) { 2018 - ret = __remove_dirent(trans, d.k->p); 2019 - goto out; 2020 - } 2021 - 2022 - /* 2023 - * hardlinked file with nlink 0: 2024 - * We're just adjusting nlink here so check_nlinks() will pick 2025 - * it up, it ignores inodes with nlink 0 2026 - */ 2027 - if (fsck_err_on(backpointer_exists && !target->bi_nlink, 2028 - trans, inode_multiple_links_but_nlink_0, 2029 - "inode %llu:%u type %s has multiple links but i_nlink 0\n%s", 2030 - target->bi_inum, target->bi_snapshot, bch2_d_types[d.v->d_type], buf.buf)) { 2031 - target->bi_nlink++; 2032 - target->bi_flags &= ~BCH_INODE_unlinked; 2033 - ret = __bch2_fsck_write_inode(trans, target); 2034 - if (ret) 2035 - goto err; 2036 - } 2037 - out: 2038 - err: 2039 - fsck_err: 2040 - bch2_trans_iter_exit(trans, &bp_iter); 2041 - printbuf_exit(&buf); 2042 - bch_err_fn(c, ret); 2043 - return ret; 2044 - } 2045 - 2046 - noinline_for_stack 2047 - static int check_dirent_target(struct btree_trans *trans, 2048 - struct btree_iter *iter, 2049 - struct bkey_s_c_dirent d, 2050 - struct bch_inode_unpacked *target) 2051 - { 2052 - struct bch_fs *c = trans->c; 2053 - struct bkey_i_dirent *n; 2054 - struct printbuf buf = PRINTBUF; 2055 - int ret = 0; 2056 - 2057 - ret = check_dirent_inode_dirent(trans, iter, d, target); 2058 - if (ret) 2059 - goto err; 2060 - 2061 - if (fsck_err_on(d.v->d_type != inode_d_type(target), 2062 - trans, dirent_d_type_wrong, 2063 - "incorrect d_type: got %s, should be %s:\n%s", 2064 - bch2_d_type_str(d.v->d_type), 2065 - bch2_d_type_str(inode_d_type(target)), 2066 - (printbuf_reset(&buf), 2067 - bch2_bkey_val_to_text(&buf, c, d.s_c), buf.buf))) { 2068 - n = bch2_trans_kmalloc(trans, bkey_bytes(d.k)); 2069 - ret = PTR_ERR_OR_ZERO(n); 2070 - if (ret) 2071 - goto err; 2072 - 2073 - bkey_reassemble(&n->k_i, d.s_c); 2074 - n->v.d_type = inode_d_type(target); 2075 - if (n->v.d_type == DT_SUBVOL) { 2076 - n->v.d_parent_subvol = cpu_to_le32(target->bi_parent_subvol); 2077 - n->v.d_child_subvol = cpu_to_le32(target->bi_subvol); 2078 - } else { 2079 - n->v.d_inum = cpu_to_le64(target->bi_inum); 2080 - } 2081 - 2082 - ret = bch2_trans_update(trans, iter, &n->k_i, 0); 2083 - if (ret) 2084 - goto err; 2085 - 2086 - d = dirent_i_to_s_c(n); 2087 - } 2088 - err: 2089 - fsck_err: 2090 - printbuf_exit(&buf); 2091 - bch_err_fn(c, ret); 2092 - return ret; 2093 - } 2094 - 2095 1988 /* find a subvolume that's a descendent of @snapshot: */ 2096 1989 static int find_snapshot_subvol(struct btree_trans *trans, u32 snapshot, u32 *subvolid) 2097 1990 { ··· 2028 2247 if (fsck_err(trans, dirent_to_missing_subvol, 2029 2248 "dirent points to missing subvolume\n%s", 2030 2249 (bch2_bkey_val_to_text(&buf, c, d.s_c), buf.buf))) 2031 - return __remove_dirent(trans, d.k->p); 2250 + return bch2_fsck_remove_dirent(trans, d.k->p); 2032 2251 ret = 0; 2033 2252 goto out; 2034 2253 } ··· 2072 2291 goto err; 2073 2292 } 2074 2293 2075 - ret = check_dirent_target(trans, iter, d, &subvol_root); 2294 + ret = bch2_check_dirent_target(trans, iter, d, &subvol_root, true); 2076 2295 if (ret) 2077 2296 goto err; 2078 2297 out: ··· 2159 2378 (printbuf_reset(&buf), 2160 2379 bch2_bkey_val_to_text(&buf, c, k), 2161 2380 buf.buf))) { 2162 - ret = __remove_dirent(trans, d.k->p); 2381 + ret = bch2_fsck_remove_dirent(trans, d.k->p); 2163 2382 if (ret) 2164 2383 goto err; 2165 2384 } 2166 2385 2167 2386 darray_for_each(target->inodes, i) { 2168 - ret = check_dirent_target(trans, iter, d, &i->inode); 2387 + ret = bch2_check_dirent_target(trans, iter, d, &i->inode, true); 2169 2388 if (ret) 2170 2389 goto err; 2171 2390 }
+5 -19
fs/bcachefs/inode.c
··· 731 731 bkey_s_to_inode_v3(new).v->bi_journal_seq = cpu_to_le64(trans->journal_res.seq); 732 732 } 733 733 734 - s64 nr = bkey_is_inode(new.k) - bkey_is_inode(old.k); 735 - if ((flags & (BTREE_TRIGGER_transactional|BTREE_TRIGGER_gc)) && nr) { 736 - struct disk_accounting_pos acc = { .type = BCH_DISK_ACCOUNTING_nr_inodes }; 737 - int ret = bch2_disk_accounting_mod(trans, &acc, &nr, 1, flags & BTREE_TRIGGER_gc); 734 + s64 nr[1] = { bkey_is_inode(new.k) - bkey_is_inode(old.k) }; 735 + if ((flags & (BTREE_TRIGGER_transactional|BTREE_TRIGGER_gc)) && nr[0]) { 736 + int ret = bch2_disk_accounting_mod2(trans, flags & BTREE_TRIGGER_gc, nr, nr_inodes); 738 737 if (ret) 739 738 return ret; 740 739 } ··· 865 866 bch2_inode_init_early(c, inode_u); 866 867 bch2_inode_init_late(inode_u, bch2_current_time(c), 867 868 uid, gid, mode, rdev, parent); 868 - } 869 - 870 - static inline u32 bkey_generation(struct bkey_s_c k) 871 - { 872 - switch (k.k->type) { 873 - case KEY_TYPE_inode: 874 - case KEY_TYPE_inode_v2: 875 - BUG(); 876 - case KEY_TYPE_inode_generation: 877 - return le32_to_cpu(bkey_s_c_to_inode_generation(k).v->bi_generation); 878 - default: 879 - return 0; 880 - } 881 869 } 882 870 883 871 static struct bkey_i_inode_alloc_cursor * ··· 1078 1092 bch2_fs_inconsistent(c, 1079 1093 "inode %llu:%u not found when deleting", 1080 1094 inum.inum, snapshot); 1081 - ret = -EIO; 1095 + ret = -BCH_ERR_ENOENT_inode; 1082 1096 goto err; 1083 1097 } 1084 1098 ··· 1242 1256 bch2_fs_inconsistent(c, 1243 1257 "inode %llu:%u not found when deleting", 1244 1258 inum, snapshot); 1245 - ret = -EIO; 1259 + ret = -BCH_ERR_ENOENT_inode; 1246 1260 goto err; 1247 1261 } 1248 1262
+1
fs/bcachefs/inode.h
··· 277 277 bool inode_has_bp = inode->bi_dir || inode->bi_dir_offset; 278 278 279 279 return S_ISDIR(inode->bi_mode) || 280 + inode->bi_subvol || 280 281 (!inode->bi_nlink && inode_has_bp); 281 282 } 282 283
+2 -1
fs/bcachefs/inode_format.h
··· 137 137 x(i_sectors_dirty, 6) \ 138 138 x(unlinked, 7) \ 139 139 x(backptr_untrusted, 8) \ 140 - x(has_child_snapshot, 9) 140 + x(has_child_snapshot, 9) \ 141 + x(casefolded, 10) 141 142 142 143 /* bits 20+ reserved for packed fields below: */ 143 144
+2 -1
fs/bcachefs/io_misc.c
··· 115 115 bch2_increment_clock(c, sectors_allocated, WRITE); 116 116 if (should_print_err(ret)) { 117 117 struct printbuf buf = PRINTBUF; 118 - bch2_inum_offset_err_msg_trans(trans, &buf, inum, iter->pos.offset << 9); 118 + lockrestart_do(trans, 119 + bch2_inum_offset_err_msg_trans(trans, &buf, inum, iter->pos.offset << 9)); 119 120 prt_printf(&buf, "fallocate error: %s", bch2_err_str(ret)); 120 121 bch_err_ratelimited(c, "%s", buf.buf); 121 122 printbuf_exit(&buf);
+407 -350
fs/bcachefs/io_read.c
··· 25 25 #include "subvolume.h" 26 26 #include "trace.h" 27 27 28 + #include <linux/random.h> 28 29 #include <linux/sched/mm.h> 30 + 31 + #ifdef CONFIG_BCACHEFS_DEBUG 32 + static unsigned bch2_read_corrupt_ratio; 33 + module_param_named(read_corrupt_ratio, bch2_read_corrupt_ratio, uint, 0644); 34 + MODULE_PARM_DESC(read_corrupt_ratio, ""); 35 + #endif 29 36 30 37 #ifndef CONFIG_BCACHEFS_NO_LATENCY_ACCT 31 38 ··· 87 80 struct rhash_head hash; 88 81 struct bpos pos; 89 82 83 + struct work_struct work; 90 84 struct data_update write; 91 85 struct bio_vec bi_inline_vecs[]; /* must be last */ 92 86 }; ··· 104 96 return failed && failed->nr; 105 97 } 106 98 99 + static inline struct data_update *rbio_data_update(struct bch_read_bio *rbio) 100 + { 101 + EBUG_ON(rbio->split); 102 + 103 + return rbio->data_update 104 + ? container_of(rbio, struct data_update, rbio) 105 + : NULL; 106 + } 107 + 108 + static bool ptr_being_rewritten(struct bch_read_bio *orig, unsigned dev) 109 + { 110 + struct data_update *u = rbio_data_update(orig); 111 + if (!u) 112 + return false; 113 + 114 + struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(bkey_i_to_s_c(u->k.k)); 115 + unsigned i = 0; 116 + bkey_for_each_ptr(ptrs, ptr) { 117 + if (ptr->dev == dev && 118 + u->data_opts.rewrite_ptrs & BIT(i)) 119 + return true; 120 + i++; 121 + } 122 + 123 + return false; 124 + } 125 + 107 126 static inline int should_promote(struct bch_fs *c, struct bkey_s_c k, 108 127 struct bpos pos, 109 128 struct bch_io_opts opts, ··· 140 105 if (!have_io_error(failed)) { 141 106 BUG_ON(!opts.promote_target); 142 107 143 - if (!(flags & BCH_READ_MAY_PROMOTE)) 108 + if (!(flags & BCH_READ_may_promote)) 144 109 return -BCH_ERR_nopromote_may_not; 145 110 146 111 if (bch2_bkey_has_target(c, k, opts.promote_target)) ··· 160 125 return 0; 161 126 } 162 127 163 - static void promote_free(struct bch_fs *c, struct promote_op *op) 128 + static noinline void promote_free(struct bch_read_bio *rbio) 164 129 { 165 - int ret; 130 + struct promote_op *op = container_of(rbio, struct promote_op, write.rbio); 131 + struct bch_fs *c = rbio->c; 132 + 133 + int ret = rhashtable_remove_fast(&c->promote_table, &op->hash, 134 + bch_promote_params); 135 + BUG_ON(ret); 166 136 167 137 bch2_data_update_exit(&op->write); 168 138 169 - ret = rhashtable_remove_fast(&c->promote_table, &op->hash, 170 - bch_promote_params); 171 - BUG_ON(ret); 172 139 bch2_write_ref_put(c, BCH_WRITE_REF_promote); 173 140 kfree_rcu(op, rcu); 174 141 } 175 142 176 143 static void promote_done(struct bch_write_op *wop) 177 144 { 178 - struct promote_op *op = 179 - container_of(wop, struct promote_op, write.op); 180 - struct bch_fs *c = op->write.op.c; 145 + struct promote_op *op = container_of(wop, struct promote_op, write.op); 146 + struct bch_fs *c = op->write.rbio.c; 181 147 182 - bch2_time_stats_update(&c->times[BCH_TIME_data_promote], 183 - op->start_time); 184 - promote_free(c, op); 148 + bch2_time_stats_update(&c->times[BCH_TIME_data_promote], op->start_time); 149 + promote_free(&op->write.rbio); 185 150 } 186 151 187 - static void promote_start(struct promote_op *op, struct bch_read_bio *rbio) 152 + static void promote_start_work(struct work_struct *work) 188 153 { 189 - struct bio *bio = &op->write.op.wbio.bio; 154 + struct promote_op *op = container_of(work, struct promote_op, work); 190 155 191 - trace_and_count(op->write.op.c, read_promote, &rbio->bio); 192 - 193 - /* we now own pages: */ 194 - BUG_ON(!rbio->bounce); 195 - BUG_ON(rbio->bio.bi_vcnt > bio->bi_max_vecs); 196 - 197 - memcpy(bio->bi_io_vec, rbio->bio.bi_io_vec, 198 - sizeof(struct bio_vec) * rbio->bio.bi_vcnt); 199 - swap(bio->bi_vcnt, rbio->bio.bi_vcnt); 200 - 201 - bch2_data_update_read_done(&op->write, rbio->pick.crc); 156 + bch2_data_update_read_done(&op->write); 202 157 } 203 158 204 - static struct promote_op *__promote_alloc(struct btree_trans *trans, 205 - enum btree_id btree_id, 206 - struct bkey_s_c k, 207 - struct bpos pos, 208 - struct extent_ptr_decoded *pick, 209 - struct bch_io_opts opts, 210 - unsigned sectors, 211 - struct bch_read_bio **rbio, 212 - struct bch_io_failures *failed) 159 + static noinline void promote_start(struct bch_read_bio *rbio) 160 + { 161 + struct promote_op *op = container_of(rbio, struct promote_op, write.rbio); 162 + 163 + trace_and_count(op->write.op.c, io_read_promote, &rbio->bio); 164 + 165 + INIT_WORK(&op->work, promote_start_work); 166 + queue_work(rbio->c->write_ref_wq, &op->work); 167 + } 168 + 169 + static struct bch_read_bio *__promote_alloc(struct btree_trans *trans, 170 + enum btree_id btree_id, 171 + struct bkey_s_c k, 172 + struct bpos pos, 173 + struct extent_ptr_decoded *pick, 174 + unsigned sectors, 175 + struct bch_read_bio *orig, 176 + struct bch_io_failures *failed) 213 177 { 214 178 struct bch_fs *c = trans->c; 215 - struct promote_op *op = NULL; 216 - struct bio *bio; 217 - unsigned pages = DIV_ROUND_UP(sectors, PAGE_SECTORS); 218 179 int ret; 180 + 181 + struct data_update_opts update_opts = { .write_flags = BCH_WRITE_alloc_nowait }; 182 + 183 + if (!have_io_error(failed)) { 184 + update_opts.target = orig->opts.promote_target; 185 + update_opts.extra_replicas = 1; 186 + update_opts.write_flags |= BCH_WRITE_cached; 187 + update_opts.write_flags |= BCH_WRITE_only_specified_devs; 188 + } else { 189 + update_opts.target = orig->opts.foreground_target; 190 + 191 + struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(k); 192 + unsigned ptr_bit = 1; 193 + bkey_for_each_ptr(ptrs, ptr) { 194 + if (bch2_dev_io_failures(failed, ptr->dev) && 195 + !ptr_being_rewritten(orig, ptr->dev)) 196 + update_opts.rewrite_ptrs |= ptr_bit; 197 + ptr_bit <<= 1; 198 + } 199 + 200 + if (!update_opts.rewrite_ptrs) 201 + return NULL; 202 + } 219 203 220 204 if (!bch2_write_ref_tryget(c, BCH_WRITE_REF_promote)) 221 205 return ERR_PTR(-BCH_ERR_nopromote_no_writes); 222 206 223 - op = kzalloc(struct_size(op, bi_inline_vecs, pages), GFP_KERNEL); 207 + struct promote_op *op = kzalloc(sizeof(*op), GFP_KERNEL); 224 208 if (!op) { 225 209 ret = -BCH_ERR_nopromote_enomem; 226 - goto err; 210 + goto err_put; 227 211 } 228 212 229 213 op->start_time = local_clock(); 230 214 op->pos = pos; 231 - 232 - /* 233 - * We don't use the mempool here because extents that aren't 234 - * checksummed or compressed can be too big for the mempool: 235 - */ 236 - *rbio = kzalloc(sizeof(struct bch_read_bio) + 237 - sizeof(struct bio_vec) * pages, 238 - GFP_KERNEL); 239 - if (!*rbio) { 240 - ret = -BCH_ERR_nopromote_enomem; 241 - goto err; 242 - } 243 - 244 - rbio_init(&(*rbio)->bio, opts); 245 - bio_init(&(*rbio)->bio, NULL, (*rbio)->bio.bi_inline_vecs, pages, 0); 246 - 247 - if (bch2_bio_alloc_pages(&(*rbio)->bio, sectors << 9, GFP_KERNEL)) { 248 - ret = -BCH_ERR_nopromote_enomem; 249 - goto err; 250 - } 251 - 252 - (*rbio)->bounce = true; 253 - (*rbio)->split = true; 254 - (*rbio)->kmalloc = true; 255 215 256 216 if (rhashtable_lookup_insert_fast(&c->promote_table, &op->hash, 257 217 bch_promote_params)) { ··· 254 224 goto err; 255 225 } 256 226 257 - bio = &op->write.op.wbio.bio; 258 - bio_init(bio, NULL, bio->bi_inline_vecs, pages, 0); 259 - 260 - struct data_update_opts update_opts = {}; 261 - 262 - if (!have_io_error(failed)) { 263 - update_opts.target = opts.promote_target; 264 - update_opts.extra_replicas = 1; 265 - update_opts.write_flags = BCH_WRITE_ALLOC_NOWAIT|BCH_WRITE_CACHED; 266 - } else { 267 - update_opts.target = opts.foreground_target; 268 - 269 - struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(k); 270 - unsigned ptr_bit = 1; 271 - bkey_for_each_ptr(ptrs, ptr) { 272 - if (bch2_dev_io_failures(failed, ptr->dev)) 273 - update_opts.rewrite_ptrs |= ptr_bit; 274 - ptr_bit <<= 1; 275 - } 276 - } 277 - 278 227 ret = bch2_data_update_init(trans, NULL, NULL, &op->write, 279 228 writepoint_hashed((unsigned long) current), 280 - opts, 229 + &orig->opts, 281 230 update_opts, 282 231 btree_id, k); 283 232 /* 284 233 * possible errors: -BCH_ERR_nocow_lock_blocked, 285 234 * -BCH_ERR_ENOSPC_disk_reservation: 286 235 */ 287 - if (ret) { 288 - BUG_ON(rhashtable_remove_fast(&c->promote_table, &op->hash, 289 - bch_promote_params)); 290 - goto err; 291 - } 236 + if (ret) 237 + goto err_remove_hash; 292 238 239 + rbio_init_fragment(&op->write.rbio.bio, orig); 240 + op->write.rbio.bounce = true; 241 + op->write.rbio.promote = true; 293 242 op->write.op.end_io = promote_done; 294 243 295 - return op; 244 + return &op->write.rbio; 245 + err_remove_hash: 246 + BUG_ON(rhashtable_remove_fast(&c->promote_table, &op->hash, 247 + bch_promote_params)); 296 248 err: 297 - if (*rbio) 298 - bio_free_pages(&(*rbio)->bio); 299 - kfree(*rbio); 300 - *rbio = NULL; 249 + bio_free_pages(&op->write.op.wbio.bio); 301 250 /* We may have added to the rhashtable and thus need rcu freeing: */ 302 251 kfree_rcu(op, rcu); 252 + err_put: 303 253 bch2_write_ref_put(c, BCH_WRITE_REF_promote); 304 254 return ERR_PTR(ret); 305 255 } 306 256 307 257 noinline 308 - static struct promote_op *promote_alloc(struct btree_trans *trans, 258 + static struct bch_read_bio *promote_alloc(struct btree_trans *trans, 309 259 struct bvec_iter iter, 310 260 struct bkey_s_c k, 311 261 struct extent_ptr_decoded *pick, 312 - struct bch_io_opts opts, 313 262 unsigned flags, 314 - struct bch_read_bio **rbio, 263 + struct bch_read_bio *orig, 315 264 bool *bounce, 316 265 bool *read_full, 317 266 struct bch_io_failures *failed) ··· 310 301 struct bpos pos = promote_full 311 302 ? bkey_start_pos(k.k) 312 303 : POS(k.k->p.inode, iter.bi_sector); 313 - struct promote_op *promote; 314 304 int ret; 315 305 316 - ret = should_promote(c, k, pos, opts, flags, failed); 306 + ret = should_promote(c, k, pos, orig->opts, flags, failed); 317 307 if (ret) 318 308 goto nopromote; 319 309 320 - promote = __promote_alloc(trans, 321 - k.k->type == KEY_TYPE_reflink_v 322 - ? BTREE_ID_reflink 323 - : BTREE_ID_extents, 324 - k, pos, pick, opts, sectors, rbio, failed); 310 + struct bch_read_bio *promote = 311 + __promote_alloc(trans, 312 + k.k->type == KEY_TYPE_reflink_v 313 + ? BTREE_ID_reflink 314 + : BTREE_ID_extents, 315 + k, pos, pick, sectors, orig, failed); 316 + if (!promote) 317 + return NULL; 318 + 325 319 ret = PTR_ERR_OR_ZERO(promote); 326 320 if (ret) 327 321 goto nopromote; ··· 333 321 *read_full = promote_full; 334 322 return promote; 335 323 nopromote: 336 - trace_read_nopromote(c, ret); 324 + trace_io_read_nopromote(c, ret); 337 325 return NULL; 338 326 } 339 327 ··· 342 330 static int bch2_read_err_msg_trans(struct btree_trans *trans, struct printbuf *out, 343 331 struct bch_read_bio *rbio, struct bpos read_pos) 344 332 { 345 - return bch2_inum_offset_err_msg_trans(trans, out, 346 - (subvol_inum) { rbio->subvol, read_pos.inode }, 347 - read_pos.offset << 9); 333 + int ret = lockrestart_do(trans, 334 + bch2_inum_offset_err_msg_trans(trans, out, 335 + (subvol_inum) { rbio->subvol, read_pos.inode }, 336 + read_pos.offset << 9)); 337 + if (ret) 338 + return ret; 339 + 340 + if (rbio->data_update) 341 + prt_str(out, "(internal move) "); 342 + 343 + return 0; 348 344 } 349 345 350 346 static void bch2_read_err_msg(struct bch_fs *c, struct printbuf *out, ··· 360 340 { 361 341 bch2_trans_run(c, bch2_read_err_msg_trans(trans, out, rbio, read_pos)); 362 342 } 363 - 364 - #define READ_RETRY_AVOID 1 365 - #define READ_RETRY 2 366 - #define READ_ERR 3 367 343 368 344 enum rbio_context { 369 345 RBIO_CONTEXT_NULL, ··· 391 375 { 392 376 BUG_ON(rbio->bounce && !rbio->split); 393 377 394 - if (rbio->promote) 395 - promote_free(rbio->c, rbio->promote); 396 - rbio->promote = NULL; 397 - 398 - if (rbio->bounce) 399 - bch2_bio_free_pages_pool(rbio->c, &rbio->bio); 378 + if (rbio->have_ioref) { 379 + struct bch_dev *ca = bch2_dev_have_ref(rbio->c, rbio->pick.ptr.dev); 380 + percpu_ref_put(&ca->io_ref); 381 + } 400 382 401 383 if (rbio->split) { 402 384 struct bch_read_bio *parent = rbio->parent; 403 385 404 - if (rbio->kmalloc) 405 - kfree(rbio); 406 - else 386 + if (unlikely(rbio->promote)) { 387 + if (!rbio->bio.bi_status) 388 + promote_start(rbio); 389 + else 390 + promote_free(rbio); 391 + } else { 392 + if (rbio->bounce) 393 + bch2_bio_free_pages_pool(rbio->c, &rbio->bio); 394 + 407 395 bio_put(&rbio->bio); 396 + } 408 397 409 398 rbio = parent; 410 399 } ··· 429 408 bio_endio(&rbio->bio); 430 409 } 431 410 432 - static void bch2_read_retry_nodecode(struct bch_fs *c, struct bch_read_bio *rbio, 433 - struct bvec_iter bvec_iter, 434 - struct bch_io_failures *failed, 435 - unsigned flags) 411 + static noinline int bch2_read_retry_nodecode(struct btree_trans *trans, 412 + struct bch_read_bio *rbio, 413 + struct bvec_iter bvec_iter, 414 + struct bch_io_failures *failed, 415 + unsigned flags) 436 416 { 437 - struct btree_trans *trans = bch2_trans_get(c); 438 - struct btree_iter iter; 439 - struct bkey_buf sk; 440 - struct bkey_s_c k; 441 - int ret; 442 - 443 - flags &= ~BCH_READ_LAST_FRAGMENT; 444 - flags |= BCH_READ_MUST_CLONE; 445 - 446 - bch2_bkey_buf_init(&sk); 447 - 448 - bch2_trans_iter_init(trans, &iter, rbio->data_btree, 449 - rbio->read_pos, BTREE_ITER_slots); 417 + struct data_update *u = container_of(rbio, struct data_update, rbio); 450 418 retry: 451 419 bch2_trans_begin(trans); 452 - rbio->bio.bi_status = 0; 453 420 454 - ret = lockrestart_do(trans, bkey_err(k = bch2_btree_iter_peek_slot(&iter))); 421 + struct btree_iter iter; 422 + struct bkey_s_c k; 423 + int ret = lockrestart_do(trans, 424 + bkey_err(k = bch2_bkey_get_iter(trans, &iter, 425 + u->btree_id, bkey_start_pos(&u->k.k->k), 426 + 0))); 455 427 if (ret) 456 428 goto err; 457 429 458 - bch2_bkey_buf_reassemble(&sk, c, k); 459 - k = bkey_i_to_s_c(sk.k); 460 - 461 - if (!bch2_bkey_matches_ptr(c, k, 462 - rbio->pick.ptr, 463 - rbio->data_pos.offset - 464 - rbio->pick.crc.offset)) { 430 + if (!bkey_and_val_eq(k, bkey_i_to_s_c(u->k.k))) { 465 431 /* extent we wanted to read no longer exists: */ 466 - rbio->hole = true; 467 - goto out; 432 + rbio->ret = -BCH_ERR_data_read_key_overwritten; 433 + goto err; 468 434 } 469 435 470 436 ret = __bch2_read_extent(trans, rbio, bvec_iter, 471 - rbio->read_pos, 472 - rbio->data_btree, 473 - k, 0, failed, flags); 474 - if (ret == READ_RETRY) 475 - goto retry; 476 - if (ret) 477 - goto err; 478 - out: 479 - bch2_rbio_done(rbio); 480 - bch2_trans_iter_exit(trans, &iter); 481 - bch2_trans_put(trans); 482 - bch2_bkey_buf_exit(&sk, c); 483 - return; 437 + bkey_start_pos(&u->k.k->k), 438 + u->btree_id, 439 + bkey_i_to_s_c(u->k.k), 440 + 0, failed, flags, -1); 484 441 err: 485 - rbio->bio.bi_status = BLK_STS_IOERR; 486 - goto out; 442 + bch2_trans_iter_exit(trans, &iter); 443 + 444 + if (bch2_err_matches(ret, BCH_ERR_data_read_retry)) 445 + goto retry; 446 + 447 + if (ret) { 448 + rbio->bio.bi_status = BLK_STS_IOERR; 449 + rbio->ret = ret; 450 + } 451 + 452 + BUG_ON(atomic_read(&rbio->bio.__bi_remaining) != 1); 453 + return ret; 487 454 } 488 455 489 456 static void bch2_rbio_retry(struct work_struct *work) ··· 486 477 .inum = rbio->read_pos.inode, 487 478 }; 488 479 struct bch_io_failures failed = { .nr = 0 }; 480 + struct btree_trans *trans = bch2_trans_get(c); 489 481 490 - trace_and_count(c, read_retry, &rbio->bio); 482 + trace_io_read_retry(&rbio->bio); 483 + this_cpu_add(c->counters[BCH_COUNTER_io_read_retry], 484 + bvec_iter_sectors(rbio->bvec_iter)); 491 485 492 - if (rbio->retry == READ_RETRY_AVOID) 493 - bch2_mark_io_failure(&failed, &rbio->pick); 486 + if (bch2_err_matches(rbio->ret, BCH_ERR_data_read_retry_avoid)) 487 + bch2_mark_io_failure(&failed, &rbio->pick, 488 + rbio->ret == -BCH_ERR_data_read_retry_csum_err); 494 489 495 - rbio->bio.bi_status = 0; 490 + if (!rbio->split) { 491 + rbio->bio.bi_status = 0; 492 + rbio->ret = 0; 493 + } 494 + 495 + unsigned subvol = rbio->subvol; 496 + struct bpos read_pos = rbio->read_pos; 496 497 497 498 rbio = bch2_rbio_free(rbio); 498 499 499 - flags |= BCH_READ_IN_RETRY; 500 - flags &= ~BCH_READ_MAY_PROMOTE; 500 + flags |= BCH_READ_in_retry; 501 + flags &= ~BCH_READ_may_promote; 502 + flags &= ~BCH_READ_last_fragment; 503 + flags |= BCH_READ_must_clone; 501 504 502 - if (flags & BCH_READ_NODECODE) { 503 - bch2_read_retry_nodecode(c, rbio, iter, &failed, flags); 505 + int ret = rbio->data_update 506 + ? bch2_read_retry_nodecode(trans, rbio, iter, &failed, flags) 507 + : __bch2_read(trans, rbio, iter, inum, &failed, flags); 508 + 509 + if (ret) { 510 + rbio->ret = ret; 511 + rbio->bio.bi_status = BLK_STS_IOERR; 504 512 } else { 505 - flags &= ~BCH_READ_LAST_FRAGMENT; 506 - flags |= BCH_READ_MUST_CLONE; 513 + struct printbuf buf = PRINTBUF; 507 514 508 - __bch2_read(c, rbio, iter, inum, &failed, flags); 515 + lockrestart_do(trans, 516 + bch2_inum_offset_err_msg_trans(trans, &buf, 517 + (subvol_inum) { subvol, read_pos.inode }, 518 + read_pos.offset << 9)); 519 + if (rbio->data_update) 520 + prt_str(&buf, "(internal move) "); 521 + prt_str(&buf, "successful retry"); 522 + 523 + bch_err_ratelimited(c, "%s", buf.buf); 524 + printbuf_exit(&buf); 509 525 } 526 + 527 + bch2_rbio_done(rbio); 528 + bch2_trans_put(trans); 510 529 } 511 530 512 - static void bch2_rbio_error(struct bch_read_bio *rbio, int retry, 513 - blk_status_t error) 531 + static void bch2_rbio_error(struct bch_read_bio *rbio, 532 + int ret, blk_status_t blk_error) 514 533 { 515 - rbio->retry = retry; 534 + BUG_ON(ret >= 0); 516 535 517 - if (rbio->flags & BCH_READ_IN_RETRY) 536 + rbio->ret = ret; 537 + rbio->bio.bi_status = blk_error; 538 + 539 + bch2_rbio_parent(rbio)->saw_error = true; 540 + 541 + if (rbio->flags & BCH_READ_in_retry) 518 542 return; 519 543 520 - if (retry == READ_ERR) { 521 - rbio = bch2_rbio_free(rbio); 522 - 523 - rbio->bio.bi_status = error; 524 - bch2_rbio_done(rbio); 525 - } else { 544 + if (bch2_err_matches(ret, BCH_ERR_data_read_retry)) { 526 545 bch2_rbio_punt(rbio, bch2_rbio_retry, 527 546 RBIO_CONTEXT_UNBOUND, system_unbound_wq); 547 + } else { 548 + rbio = bch2_rbio_free(rbio); 549 + 550 + rbio->ret = ret; 551 + rbio->bio.bi_status = blk_error; 552 + 553 + bch2_rbio_done(rbio); 528 554 } 529 555 } 530 556 ··· 575 531 bch2_read_err_msg(c, &buf, rbio, rbio->read_pos); 576 532 prt_printf(&buf, "data read error: %s", bch2_blk_status_to_str(bio->bi_status)); 577 533 578 - if (ca) { 579 - bch2_io_error(ca, BCH_MEMBER_ERROR_read); 534 + if (ca) 580 535 bch_err_ratelimited(ca, "%s", buf.buf); 581 - } else { 536 + else 582 537 bch_err_ratelimited(c, "%s", buf.buf); 583 - } 584 538 585 539 printbuf_exit(&buf); 586 - bch2_rbio_error(rbio, READ_RETRY_AVOID, bio->bi_status); 540 + bch2_rbio_error(rbio, -BCH_ERR_data_read_retry_io_err, bio->bi_status); 587 541 } 588 542 589 543 static int __bch2_rbio_narrow_crcs(struct btree_trans *trans, ··· 663 621 bch2_csum_err_msg(&buf, crc.csum_type, rbio->pick.crc.csum, csum); 664 622 665 623 struct bch_dev *ca = rbio->have_ioref ? bch2_dev_have_ref(c, rbio->pick.ptr.dev) : NULL; 666 - if (ca) { 667 - bch2_io_error(ca, BCH_MEMBER_ERROR_checksum); 624 + if (ca) 668 625 bch_err_ratelimited(ca, "%s", buf.buf); 669 - } else { 626 + else 670 627 bch_err_ratelimited(c, "%s", buf.buf); 671 - } 672 628 673 - bch2_rbio_error(rbio, READ_RETRY_AVOID, BLK_STS_IOERR); 629 + bch2_rbio_error(rbio, -BCH_ERR_data_read_retry_csum_err, BLK_STS_IOERR); 674 630 printbuf_exit(&buf); 675 631 } 676 632 ··· 688 648 else 689 649 bch_err_ratelimited(c, "%s", buf.buf); 690 650 691 - bch2_rbio_error(rbio, READ_ERR, BLK_STS_IOERR); 651 + bch2_rbio_error(rbio, -BCH_ERR_data_read_decompress_err, BLK_STS_IOERR); 692 652 printbuf_exit(&buf); 693 653 } 694 654 ··· 708 668 else 709 669 bch_err_ratelimited(c, "%s", buf.buf); 710 670 711 - bch2_rbio_error(rbio, READ_ERR, BLK_STS_IOERR); 671 + bch2_rbio_error(rbio, -BCH_ERR_data_read_decrypt_err, BLK_STS_IOERR); 712 672 printbuf_exit(&buf); 713 673 } 714 674 ··· 718 678 struct bch_read_bio *rbio = 719 679 container_of(work, struct bch_read_bio, work); 720 680 struct bch_fs *c = rbio->c; 721 - struct bio *src = &rbio->bio; 722 - struct bio *dst = &bch2_rbio_parent(rbio)->bio; 723 - struct bvec_iter dst_iter = rbio->bvec_iter; 681 + struct bch_dev *ca = rbio->have_ioref ? bch2_dev_have_ref(c, rbio->pick.ptr.dev) : NULL; 682 + struct bch_read_bio *parent = bch2_rbio_parent(rbio); 683 + struct bio *src = &rbio->bio; 684 + struct bio *dst = &parent->bio; 685 + struct bvec_iter dst_iter = rbio->bvec_iter; 724 686 struct bch_extent_crc_unpacked crc = rbio->pick.crc; 725 687 struct nonce nonce = extent_nonce(rbio->version, crc); 726 688 unsigned nofs_flags; ··· 740 698 src->bi_iter = rbio->bvec_iter; 741 699 } 742 700 701 + bch2_maybe_corrupt_bio(src, bch2_read_corrupt_ratio); 702 + 743 703 csum = bch2_checksum_bio(c, crc.csum_type, nonce, src); 744 - if (bch2_crc_cmp(csum, rbio->pick.crc.csum) && !c->opts.no_data_io) 704 + bool csum_good = !bch2_crc_cmp(csum, rbio->pick.crc.csum) || c->opts.no_data_io; 705 + 706 + /* 707 + * Checksum error: if the bio wasn't bounced, we may have been 708 + * reading into buffers owned by userspace (that userspace can 709 + * scribble over) - retry the read, bouncing it this time: 710 + */ 711 + if (!csum_good && !rbio->bounce && (rbio->flags & BCH_READ_user_mapped)) { 712 + rbio->flags |= BCH_READ_must_bounce; 713 + bch2_rbio_error(rbio, -BCH_ERR_data_read_retry_csum_err_maybe_userspace, 714 + BLK_STS_IOERR); 715 + goto out; 716 + } 717 + 718 + bch2_account_io_completion(ca, BCH_MEMBER_ERROR_checksum, 0, csum_good); 719 + 720 + if (!csum_good) 745 721 goto csum_err; 746 722 747 723 /* ··· 772 712 if (unlikely(rbio->narrow_crcs)) 773 713 bch2_rbio_narrow_crcs(rbio); 774 714 775 - if (rbio->flags & BCH_READ_NODECODE) 776 - goto nodecode; 715 + if (likely(!parent->data_update)) { 716 + /* Adjust crc to point to subset of data we want: */ 717 + crc.offset += rbio->offset_into_extent; 718 + crc.live_size = bvec_iter_sectors(rbio->bvec_iter); 777 719 778 - /* Adjust crc to point to subset of data we want: */ 779 - crc.offset += rbio->offset_into_extent; 780 - crc.live_size = bvec_iter_sectors(rbio->bvec_iter); 720 + if (crc_is_compressed(crc)) { 721 + ret = bch2_encrypt_bio(c, crc.csum_type, nonce, src); 722 + if (ret) 723 + goto decrypt_err; 781 724 782 - if (crc_is_compressed(crc)) { 783 - ret = bch2_encrypt_bio(c, crc.csum_type, nonce, src); 784 - if (ret) 785 - goto decrypt_err; 725 + if (bch2_bio_uncompress(c, src, dst, dst_iter, crc) && 726 + !c->opts.no_data_io) 727 + goto decompression_err; 728 + } else { 729 + /* don't need to decrypt the entire bio: */ 730 + nonce = nonce_add(nonce, crc.offset << 9); 731 + bio_advance(src, crc.offset << 9); 786 732 787 - if (bch2_bio_uncompress(c, src, dst, dst_iter, crc) && 788 - !c->opts.no_data_io) 789 - goto decompression_err; 733 + BUG_ON(src->bi_iter.bi_size < dst_iter.bi_size); 734 + src->bi_iter.bi_size = dst_iter.bi_size; 735 + 736 + ret = bch2_encrypt_bio(c, crc.csum_type, nonce, src); 737 + if (ret) 738 + goto decrypt_err; 739 + 740 + if (rbio->bounce) { 741 + struct bvec_iter src_iter = src->bi_iter; 742 + 743 + bio_copy_data_iter(dst, &dst_iter, src, &src_iter); 744 + } 745 + } 790 746 } else { 791 - /* don't need to decrypt the entire bio: */ 792 - nonce = nonce_add(nonce, crc.offset << 9); 793 - bio_advance(src, crc.offset << 9); 794 - 795 - BUG_ON(src->bi_iter.bi_size < dst_iter.bi_size); 796 - src->bi_iter.bi_size = dst_iter.bi_size; 797 - 798 - ret = bch2_encrypt_bio(c, crc.csum_type, nonce, src); 799 - if (ret) 800 - goto decrypt_err; 747 + if (rbio->split) 748 + rbio->parent->pick = rbio->pick; 801 749 802 750 if (rbio->bounce) { 803 751 struct bvec_iter src_iter = src->bi_iter; ··· 822 754 ret = bch2_encrypt_bio(c, crc.csum_type, nonce, src); 823 755 if (ret) 824 756 goto decrypt_err; 825 - 826 - promote_start(rbio->promote, rbio); 827 - rbio->promote = NULL; 828 757 } 829 - nodecode: 830 - if (likely(!(rbio->flags & BCH_READ_IN_RETRY))) { 758 + 759 + if (likely(!(rbio->flags & BCH_READ_in_retry))) { 831 760 rbio = bch2_rbio_free(rbio); 832 761 bch2_rbio_done(rbio); 833 762 } ··· 832 767 memalloc_nofs_restore(nofs_flags); 833 768 return; 834 769 csum_err: 835 - /* 836 - * Checksum error: if the bio wasn't bounced, we may have been 837 - * reading into buffers owned by userspace (that userspace can 838 - * scribble over) - retry the read, bouncing it this time: 839 - */ 840 - if (!rbio->bounce && (rbio->flags & BCH_READ_USER_MAPPED)) { 841 - rbio->flags |= BCH_READ_MUST_BOUNCE; 842 - bch2_rbio_error(rbio, READ_RETRY, BLK_STS_IOERR); 843 - goto out; 844 - } 845 - 846 770 bch2_rbio_punt(rbio, bch2_read_csum_err, RBIO_CONTEXT_UNBOUND, system_unbound_wq); 847 771 goto out; 848 772 decompression_err: ··· 851 797 struct workqueue_struct *wq = NULL; 852 798 enum rbio_context context = RBIO_CONTEXT_NULL; 853 799 854 - if (rbio->have_ioref) { 855 - bch2_latency_acct(ca, rbio->submit_time, READ); 856 - percpu_ref_put(&ca->io_ref); 857 - } 800 + bch2_account_io_completion(ca, BCH_MEMBER_ERROR_read, 801 + rbio->submit_time, !bio->bi_status); 858 802 859 803 if (!rbio->split) 860 804 rbio->bio.bi_end_io = rbio->end_io; ··· 862 810 return; 863 811 } 864 812 865 - if (((rbio->flags & BCH_READ_RETRY_IF_STALE) && race_fault()) || 813 + if (((rbio->flags & BCH_READ_retry_if_stale) && race_fault()) || 866 814 (ca && dev_ptr_stale(ca, &rbio->pick.ptr))) { 867 - trace_and_count(c, read_reuse_race, &rbio->bio); 815 + trace_and_count(c, io_read_reuse_race, &rbio->bio); 868 816 869 - if (rbio->flags & BCH_READ_RETRY_IF_STALE) 870 - bch2_rbio_error(rbio, READ_RETRY, BLK_STS_AGAIN); 817 + if (rbio->flags & BCH_READ_retry_if_stale) 818 + bch2_rbio_error(rbio, -BCH_ERR_data_read_ptr_stale_retry, BLK_STS_AGAIN); 871 819 else 872 - bch2_rbio_error(rbio, READ_ERR, BLK_STS_AGAIN); 820 + bch2_rbio_error(rbio, -BCH_ERR_data_read_ptr_stale_race, BLK_STS_AGAIN); 873 821 return; 874 822 } 875 823 ··· 935 883 struct bvec_iter iter, struct bpos read_pos, 936 884 enum btree_id data_btree, struct bkey_s_c k, 937 885 unsigned offset_into_extent, 938 - struct bch_io_failures *failed, unsigned flags) 886 + struct bch_io_failures *failed, unsigned flags, int dev) 939 887 { 940 888 struct bch_fs *c = trans->c; 941 889 struct extent_ptr_decoded pick; 942 890 struct bch_read_bio *rbio = NULL; 943 - struct promote_op *promote = NULL; 944 891 bool bounce = false, read_full = false, narrow_crcs = false; 945 892 struct bpos data_pos = bkey_start_pos(k.k); 946 - int pick_ret; 893 + struct data_update *u = rbio_data_update(orig); 894 + int ret = 0; 947 895 948 896 if (bkey_extent_is_inline_data(k.k)) { 949 897 unsigned bytes = min_t(unsigned, iter.bi_size, ··· 954 902 swap(iter.bi_size, bytes); 955 903 bio_advance_iter(&orig->bio, &iter, bytes); 956 904 zero_fill_bio_iter(&orig->bio, iter); 905 + this_cpu_add(c->counters[BCH_COUNTER_io_read_inline], 906 + bvec_iter_sectors(iter)); 957 907 goto out_read_done; 958 908 } 959 909 retry_pick: 960 - pick_ret = bch2_bkey_pick_read_device(c, k, failed, &pick); 910 + ret = bch2_bkey_pick_read_device(c, k, failed, &pick, dev); 961 911 962 912 /* hole or reservation - just zero fill: */ 963 - if (!pick_ret) 913 + if (!ret) 964 914 goto hole; 965 915 966 - if (unlikely(pick_ret < 0)) { 916 + if (unlikely(ret < 0)) { 967 917 struct printbuf buf = PRINTBUF; 968 918 bch2_read_err_msg_trans(trans, &buf, orig, read_pos); 969 - prt_printf(&buf, "no device to read from: %s\n ", bch2_err_str(pick_ret)); 919 + prt_printf(&buf, "%s\n ", bch2_err_str(ret)); 970 920 bch2_bkey_val_to_text(&buf, c, k); 971 921 972 922 bch_err_ratelimited(c, "%s", buf.buf); ··· 984 930 985 931 bch_err_ratelimited(c, "%s", buf.buf); 986 932 printbuf_exit(&buf); 933 + ret = -BCH_ERR_data_read_no_encryption_key; 987 934 goto err; 988 935 } 989 936 ··· 996 941 * retry path, don't check here, it'll be caught in bch2_read_endio() 997 942 * and we'll end up in the retry path: 998 943 */ 999 - if ((flags & BCH_READ_IN_RETRY) && 944 + if ((flags & BCH_READ_in_retry) && 1000 945 !pick.ptr.cached && 1001 946 ca && 1002 947 unlikely(dev_ptr_stale(ca, &pick.ptr))) { 1003 948 read_from_stale_dirty_pointer(trans, ca, k, pick.ptr); 1004 - bch2_mark_io_failure(failed, &pick); 949 + bch2_mark_io_failure(failed, &pick, false); 1005 950 percpu_ref_put(&ca->io_ref); 1006 951 goto retry_pick; 1007 952 } 1008 953 1009 - if (flags & BCH_READ_NODECODE) { 954 + if (likely(!u)) { 955 + if (!(flags & BCH_READ_last_fragment) || 956 + bio_flagged(&orig->bio, BIO_CHAIN)) 957 + flags |= BCH_READ_must_clone; 958 + 959 + narrow_crcs = !(flags & BCH_READ_in_retry) && 960 + bch2_can_narrow_extent_crcs(k, pick.crc); 961 + 962 + if (narrow_crcs && (flags & BCH_READ_user_mapped)) 963 + flags |= BCH_READ_must_bounce; 964 + 965 + EBUG_ON(offset_into_extent + bvec_iter_sectors(iter) > k.k->size); 966 + 967 + if (crc_is_compressed(pick.crc) || 968 + (pick.crc.csum_type != BCH_CSUM_none && 969 + (bvec_iter_sectors(iter) != pick.crc.uncompressed_size || 970 + (bch2_csum_type_is_encryption(pick.crc.csum_type) && 971 + (flags & BCH_READ_user_mapped)) || 972 + (flags & BCH_READ_must_bounce)))) { 973 + read_full = true; 974 + bounce = true; 975 + } 976 + } else { 1010 977 /* 1011 978 * can happen if we retry, and the extent we were going to read 1012 979 * has been merged in the meantime: 1013 980 */ 1014 - if (pick.crc.compressed_size > orig->bio.bi_vcnt * PAGE_SECTORS) { 981 + if (pick.crc.compressed_size > u->op.wbio.bio.bi_iter.bi_size) { 1015 982 if (ca) 1016 983 percpu_ref_put(&ca->io_ref); 1017 - goto hole; 984 + rbio->ret = -BCH_ERR_data_read_buffer_too_small; 985 + goto out_read_done; 1018 986 } 1019 987 1020 988 iter.bi_size = pick.crc.compressed_size << 9; 1021 - goto get_bio; 1022 - } 1023 - 1024 - if (!(flags & BCH_READ_LAST_FRAGMENT) || 1025 - bio_flagged(&orig->bio, BIO_CHAIN)) 1026 - flags |= BCH_READ_MUST_CLONE; 1027 - 1028 - narrow_crcs = !(flags & BCH_READ_IN_RETRY) && 1029 - bch2_can_narrow_extent_crcs(k, pick.crc); 1030 - 1031 - if (narrow_crcs && (flags & BCH_READ_USER_MAPPED)) 1032 - flags |= BCH_READ_MUST_BOUNCE; 1033 - 1034 - EBUG_ON(offset_into_extent + bvec_iter_sectors(iter) > k.k->size); 1035 - 1036 - if (crc_is_compressed(pick.crc) || 1037 - (pick.crc.csum_type != BCH_CSUM_none && 1038 - (bvec_iter_sectors(iter) != pick.crc.uncompressed_size || 1039 - (bch2_csum_type_is_encryption(pick.crc.csum_type) && 1040 - (flags & BCH_READ_USER_MAPPED)) || 1041 - (flags & BCH_READ_MUST_BOUNCE)))) { 1042 989 read_full = true; 1043 - bounce = true; 1044 990 } 1045 991 1046 992 if (orig->opts.promote_target || have_io_error(failed)) 1047 - promote = promote_alloc(trans, iter, k, &pick, orig->opts, flags, 1048 - &rbio, &bounce, &read_full, failed); 993 + rbio = promote_alloc(trans, iter, k, &pick, flags, orig, 994 + &bounce, &read_full, failed); 1049 995 1050 996 if (!read_full) { 1051 997 EBUG_ON(crc_is_compressed(pick.crc)); ··· 1065 1009 pick.crc.offset = 0; 1066 1010 pick.crc.live_size = bvec_iter_sectors(iter); 1067 1011 } 1068 - get_bio: 1012 + 1069 1013 if (rbio) { 1070 1014 /* 1071 1015 * promote already allocated bounce rbio: ··· 1080 1024 } else if (bounce) { 1081 1025 unsigned sectors = pick.crc.compressed_size; 1082 1026 1083 - rbio = rbio_init(bio_alloc_bioset(NULL, 1027 + rbio = rbio_init_fragment(bio_alloc_bioset(NULL, 1084 1028 DIV_ROUND_UP(sectors, PAGE_SECTORS), 1085 1029 0, 1086 1030 GFP_NOFS, 1087 1031 &c->bio_read_split), 1088 - orig->opts); 1032 + orig); 1089 1033 1090 1034 bch2_bio_alloc_pages_pool(c, &rbio->bio, sectors << 9); 1091 1035 rbio->bounce = true; 1092 - rbio->split = true; 1093 - } else if (flags & BCH_READ_MUST_CLONE) { 1036 + } else if (flags & BCH_READ_must_clone) { 1094 1037 /* 1095 1038 * Have to clone if there were any splits, due to error 1096 1039 * reporting issues (if a split errored, and retrying didn't ··· 1098 1043 * from the whole bio, in which case we don't want to retry and 1099 1044 * lose the error) 1100 1045 */ 1101 - rbio = rbio_init(bio_alloc_clone(NULL, &orig->bio, GFP_NOFS, 1046 + rbio = rbio_init_fragment(bio_alloc_clone(NULL, &orig->bio, GFP_NOFS, 1102 1047 &c->bio_read_split), 1103 - orig->opts); 1048 + orig); 1104 1049 rbio->bio.bi_iter = iter; 1105 - rbio->split = true; 1106 1050 } else { 1107 1051 rbio = orig; 1108 1052 rbio->bio.bi_iter = iter; ··· 1110 1056 1111 1057 EBUG_ON(bio_sectors(&rbio->bio) != pick.crc.compressed_size); 1112 1058 1113 - rbio->c = c; 1114 1059 rbio->submit_time = local_clock(); 1115 - if (rbio->split) 1116 - rbio->parent = orig; 1117 - else 1060 + if (!rbio->split) 1118 1061 rbio->end_io = orig->bio.bi_end_io; 1119 1062 rbio->bvec_iter = iter; 1120 1063 rbio->offset_into_extent= offset_into_extent; 1121 1064 rbio->flags = flags; 1122 1065 rbio->have_ioref = ca != NULL; 1123 1066 rbio->narrow_crcs = narrow_crcs; 1124 - rbio->hole = 0; 1125 - rbio->retry = 0; 1067 + rbio->ret = 0; 1126 1068 rbio->context = 0; 1127 - /* XXX: only initialize this if needed */ 1128 - rbio->devs_have = bch2_bkey_devs(k); 1129 1069 rbio->pick = pick; 1130 1070 rbio->subvol = orig->subvol; 1131 1071 rbio->read_pos = read_pos; 1132 1072 rbio->data_btree = data_btree; 1133 1073 rbio->data_pos = data_pos; 1134 1074 rbio->version = k.k->bversion; 1135 - rbio->promote = promote; 1136 1075 INIT_WORK(&rbio->work, NULL); 1137 - 1138 - if (flags & BCH_READ_NODECODE) 1139 - orig->pick = pick; 1140 1076 1141 1077 rbio->bio.bi_opf = orig->bio.bi_opf; 1142 1078 rbio->bio.bi_iter.bi_sector = pick.ptr.offset; 1143 1079 rbio->bio.bi_end_io = bch2_read_endio; 1144 1080 1145 1081 if (rbio->bounce) 1146 - trace_and_count(c, read_bounce, &rbio->bio); 1082 + trace_and_count(c, io_read_bounce, &rbio->bio); 1147 1083 1148 - this_cpu_add(c->counters[BCH_COUNTER_io_read], bio_sectors(&rbio->bio)); 1084 + if (!u) 1085 + this_cpu_add(c->counters[BCH_COUNTER_io_read], bio_sectors(&rbio->bio)); 1086 + else 1087 + this_cpu_add(c->counters[BCH_COUNTER_io_move_read], bio_sectors(&rbio->bio)); 1149 1088 bch2_increment_clock(c, bio_sectors(&rbio->bio), READ); 1150 1089 1151 1090 /* 1152 1091 * If it's being moved internally, we don't want to flag it as a cache 1153 1092 * hit: 1154 1093 */ 1155 - if (ca && pick.ptr.cached && !(flags & BCH_READ_NODECODE)) 1094 + if (ca && pick.ptr.cached && !u) 1156 1095 bch2_bucket_io_time_reset(trans, pick.ptr.dev, 1157 1096 PTR_BUCKET_NR(ca, &pick.ptr), READ); 1158 1097 1159 - if (!(flags & (BCH_READ_IN_RETRY|BCH_READ_LAST_FRAGMENT))) { 1098 + if (!(flags & (BCH_READ_in_retry|BCH_READ_last_fragment))) { 1160 1099 bio_inc_remaining(&orig->bio); 1161 - trace_and_count(c, read_split, &orig->bio); 1100 + trace_and_count(c, io_read_split, &orig->bio); 1162 1101 } 1163 1102 1164 1103 /* 1165 1104 * Unlock the iterator while the btree node's lock is still in 1166 1105 * cache, before doing the IO: 1167 1106 */ 1168 - if (!(flags & BCH_READ_IN_RETRY)) 1107 + if (!(flags & BCH_READ_in_retry)) 1169 1108 bch2_trans_unlock(trans); 1170 1109 else 1171 1110 bch2_trans_unlock_long(trans); 1172 1111 1173 - if (!rbio->pick.idx) { 1112 + if (likely(!rbio->pick.do_ec_reconstruct)) { 1174 1113 if (unlikely(!rbio->have_ioref)) { 1175 1114 struct printbuf buf = PRINTBUF; 1176 1115 bch2_read_err_msg_trans(trans, &buf, rbio, read_pos); ··· 1173 1126 bch_err_ratelimited(c, "%s", buf.buf); 1174 1127 printbuf_exit(&buf); 1175 1128 1176 - bch2_rbio_error(rbio, READ_RETRY_AVOID, BLK_STS_IOERR); 1129 + bch2_rbio_error(rbio, 1130 + -BCH_ERR_data_read_retry_device_offline, 1131 + BLK_STS_IOERR); 1177 1132 goto out; 1178 1133 } 1179 1134 ··· 1184 1135 bio_set_dev(&rbio->bio, ca->disk_sb.bdev); 1185 1136 1186 1137 if (unlikely(c->opts.no_data_io)) { 1187 - if (likely(!(flags & BCH_READ_IN_RETRY))) 1138 + if (likely(!(flags & BCH_READ_in_retry))) 1188 1139 bio_endio(&rbio->bio); 1189 1140 } else { 1190 - if (likely(!(flags & BCH_READ_IN_RETRY))) 1141 + if (likely(!(flags & BCH_READ_in_retry))) 1191 1142 submit_bio(&rbio->bio); 1192 1143 else 1193 1144 submit_bio_wait(&rbio->bio); ··· 1201 1152 } else { 1202 1153 /* Attempting reconstruct read: */ 1203 1154 if (bch2_ec_read_extent(trans, rbio, k)) { 1204 - bch2_rbio_error(rbio, READ_RETRY_AVOID, BLK_STS_IOERR); 1155 + bch2_rbio_error(rbio, -BCH_ERR_data_read_retry_ec_reconstruct_err, 1156 + BLK_STS_IOERR); 1205 1157 goto out; 1206 1158 } 1207 1159 1208 - if (likely(!(flags & BCH_READ_IN_RETRY))) 1160 + if (likely(!(flags & BCH_READ_in_retry))) 1209 1161 bio_endio(&rbio->bio); 1210 1162 } 1211 1163 out: 1212 - if (likely(!(flags & BCH_READ_IN_RETRY))) { 1164 + if (likely(!(flags & BCH_READ_in_retry))) { 1213 1165 return 0; 1214 1166 } else { 1215 1167 bch2_trans_unlock(trans); ··· 1220 1170 rbio->context = RBIO_CONTEXT_UNBOUND; 1221 1171 bch2_read_endio(&rbio->bio); 1222 1172 1223 - ret = rbio->retry; 1173 + ret = rbio->ret; 1224 1174 rbio = bch2_rbio_free(rbio); 1225 1175 1226 - if (ret == READ_RETRY_AVOID) { 1227 - bch2_mark_io_failure(failed, &pick); 1228 - ret = READ_RETRY; 1229 - } 1230 - 1231 - if (!ret) 1232 - goto out_read_done; 1176 + if (bch2_err_matches(ret, BCH_ERR_data_read_retry_avoid)) 1177 + bch2_mark_io_failure(failed, &pick, 1178 + ret == -BCH_ERR_data_read_retry_csum_err); 1233 1179 1234 1180 return ret; 1235 1181 } 1236 1182 1237 1183 err: 1238 - if (flags & BCH_READ_IN_RETRY) 1239 - return READ_ERR; 1184 + if (flags & BCH_READ_in_retry) 1185 + return ret; 1240 1186 1241 - orig->bio.bi_status = BLK_STS_IOERR; 1187 + orig->bio.bi_status = BLK_STS_IOERR; 1188 + orig->ret = ret; 1242 1189 goto out_read_done; 1243 1190 1244 1191 hole: 1192 + this_cpu_add(c->counters[BCH_COUNTER_io_read_hole], 1193 + bvec_iter_sectors(iter)); 1245 1194 /* 1246 - * won't normally happen in the BCH_READ_NODECODE 1247 - * (bch2_move_extent()) path, but if we retry and the extent we wanted 1248 - * to read no longer exists we have to signal that: 1195 + * won't normally happen in the data update (bch2_move_extent()) path, 1196 + * but if we retry and the extent we wanted to read no longer exists we 1197 + * have to signal that: 1249 1198 */ 1250 - if (flags & BCH_READ_NODECODE) 1251 - orig->hole = true; 1199 + if (u) 1200 + orig->ret = -BCH_ERR_data_read_key_overwritten; 1252 1201 1253 1202 zero_fill_bio_iter(&orig->bio, iter); 1254 1203 out_read_done: 1255 - if (flags & BCH_READ_LAST_FRAGMENT) 1204 + if ((flags & BCH_READ_last_fragment) && 1205 + !(flags & BCH_READ_in_retry)) 1256 1206 bch2_rbio_done(orig); 1257 1207 return 0; 1258 1208 } 1259 1209 1260 - void __bch2_read(struct bch_fs *c, struct bch_read_bio *rbio, 1261 - struct bvec_iter bvec_iter, subvol_inum inum, 1262 - struct bch_io_failures *failed, unsigned flags) 1210 + int __bch2_read(struct btree_trans *trans, struct bch_read_bio *rbio, 1211 + struct bvec_iter bvec_iter, subvol_inum inum, 1212 + struct bch_io_failures *failed, unsigned flags) 1263 1213 { 1264 - struct btree_trans *trans = bch2_trans_get(c); 1214 + struct bch_fs *c = trans->c; 1265 1215 struct btree_iter iter; 1266 1216 struct bkey_buf sk; 1267 1217 struct bkey_s_c k; 1268 1218 int ret; 1269 1219 1270 - BUG_ON(flags & BCH_READ_NODECODE); 1220 + EBUG_ON(rbio->data_update); 1271 1221 1272 1222 bch2_bkey_buf_init(&sk); 1273 1223 bch2_trans_iter_init(trans, &iter, BTREE_ID_extents, ··· 1317 1267 swap(bvec_iter.bi_size, bytes); 1318 1268 1319 1269 if (bvec_iter.bi_size == bytes) 1320 - flags |= BCH_READ_LAST_FRAGMENT; 1270 + flags |= BCH_READ_last_fragment; 1321 1271 1322 1272 ret = __bch2_read_extent(trans, rbio, bvec_iter, iter.pos, 1323 1273 data_btree, k, 1324 - offset_into_extent, failed, flags); 1274 + offset_into_extent, failed, flags, -1); 1325 1275 if (ret) 1326 1276 goto err; 1327 1277 1328 - if (flags & BCH_READ_LAST_FRAGMENT) 1278 + if (flags & BCH_READ_last_fragment) 1329 1279 break; 1330 1280 1331 1281 swap(bvec_iter.bi_size, bytes); 1332 1282 bio_advance_iter(&rbio->bio, &bvec_iter, bytes); 1333 1283 err: 1284 + if (ret == -BCH_ERR_data_read_retry_csum_err_maybe_userspace) 1285 + flags |= BCH_READ_must_bounce; 1286 + 1334 1287 if (ret && 1335 1288 !bch2_err_matches(ret, BCH_ERR_transaction_restart) && 1336 - ret != READ_RETRY && 1337 - ret != READ_RETRY_AVOID) 1289 + !bch2_err_matches(ret, BCH_ERR_data_read_retry)) 1338 1290 break; 1339 1291 } 1340 1292 ··· 1344 1292 1345 1293 if (ret) { 1346 1294 struct printbuf buf = PRINTBUF; 1347 - bch2_inum_offset_err_msg_trans(trans, &buf, inum, bvec_iter.bi_sector << 9); 1348 - prt_printf(&buf, "read error %i from btree lookup", ret); 1295 + lockrestart_do(trans, 1296 + bch2_inum_offset_err_msg_trans(trans, &buf, inum, 1297 + bvec_iter.bi_sector << 9)); 1298 + prt_printf(&buf, "read error: %s", bch2_err_str(ret)); 1349 1299 bch_err_ratelimited(c, "%s", buf.buf); 1350 1300 printbuf_exit(&buf); 1351 1301 1352 - rbio->bio.bi_status = BLK_STS_IOERR; 1353 - bch2_rbio_done(rbio); 1302 + rbio->bio.bi_status = BLK_STS_IOERR; 1303 + rbio->ret = ret; 1304 + 1305 + if (!(flags & BCH_READ_in_retry)) 1306 + bch2_rbio_done(rbio); 1354 1307 } 1355 1308 1356 - bch2_trans_put(trans); 1357 1309 bch2_bkey_buf_exit(&sk, c); 1310 + return ret; 1358 1311 } 1359 1312 1360 1313 void bch2_fs_io_read_exit(struct bch_fs *c)
+59 -35
fs/bcachefs/io_read.h
··· 3 3 #define _BCACHEFS_IO_READ_H 4 4 5 5 #include "bkey_buf.h" 6 + #include "btree_iter.h" 6 7 #include "reflink.h" 7 8 8 9 struct bch_read_bio { ··· 36 35 u16 flags; 37 36 union { 38 37 struct { 39 - u16 bounce:1, 38 + u16 data_update:1, 39 + promote:1, 40 + bounce:1, 40 41 split:1, 41 - kmalloc:1, 42 42 have_ioref:1, 43 43 narrow_crcs:1, 44 - hole:1, 45 - retry:2, 44 + saw_error:1, 46 45 context:2; 47 46 }; 48 47 u16 _state; 49 48 }; 50 - 51 - struct bch_devs_list devs_have; 49 + s16 ret; 52 50 53 51 struct extent_ptr_decoded pick; 54 52 ··· 64 64 enum btree_id data_btree; 65 65 struct bpos data_pos; 66 66 struct bversion version; 67 - 68 - struct promote_op *promote; 69 67 70 68 struct bch_io_opts opts; 71 69 ··· 106 108 return 0; 107 109 } 108 110 109 - enum bch_read_flags { 110 - BCH_READ_RETRY_IF_STALE = 1 << 0, 111 - BCH_READ_MAY_PROMOTE = 1 << 1, 112 - BCH_READ_USER_MAPPED = 1 << 2, 113 - BCH_READ_NODECODE = 1 << 3, 114 - BCH_READ_LAST_FRAGMENT = 1 << 4, 111 + #define BCH_READ_FLAGS() \ 112 + x(retry_if_stale) \ 113 + x(may_promote) \ 114 + x(user_mapped) \ 115 + x(last_fragment) \ 116 + x(must_bounce) \ 117 + x(must_clone) \ 118 + x(in_retry) 115 119 116 - /* internal: */ 117 - BCH_READ_MUST_BOUNCE = 1 << 5, 118 - BCH_READ_MUST_CLONE = 1 << 6, 119 - BCH_READ_IN_RETRY = 1 << 7, 120 + enum __bch_read_flags { 121 + #define x(n) __BCH_READ_##n, 122 + BCH_READ_FLAGS() 123 + #undef x 124 + }; 125 + 126 + enum bch_read_flags { 127 + #define x(n) BCH_READ_##n = BIT(__BCH_READ_##n), 128 + BCH_READ_FLAGS() 129 + #undef x 120 130 }; 121 131 122 132 int __bch2_read_extent(struct btree_trans *, struct bch_read_bio *, 123 133 struct bvec_iter, struct bpos, enum btree_id, 124 134 struct bkey_s_c, unsigned, 125 - struct bch_io_failures *, unsigned); 135 + struct bch_io_failures *, unsigned, int); 126 136 127 137 static inline void bch2_read_extent(struct btree_trans *trans, 128 138 struct bch_read_bio *rbio, struct bpos read_pos, ··· 138 132 unsigned offset_into_extent, unsigned flags) 139 133 { 140 134 __bch2_read_extent(trans, rbio, rbio->bio.bi_iter, read_pos, 141 - data_btree, k, offset_into_extent, NULL, flags); 135 + data_btree, k, offset_into_extent, NULL, flags, -1); 142 136 } 143 137 144 - void __bch2_read(struct bch_fs *, struct bch_read_bio *, struct bvec_iter, 145 - subvol_inum, struct bch_io_failures *, unsigned flags); 138 + int __bch2_read(struct btree_trans *, struct bch_read_bio *, struct bvec_iter, 139 + subvol_inum, struct bch_io_failures *, unsigned flags); 146 140 147 141 static inline void bch2_read(struct bch_fs *c, struct bch_read_bio *rbio, 148 142 subvol_inum inum) 149 143 { 150 - struct bch_io_failures failed = { .nr = 0 }; 151 - 152 144 BUG_ON(rbio->_state); 153 145 154 - rbio->c = c; 155 - rbio->start_time = local_clock(); 156 146 rbio->subvol = inum.subvol; 157 147 158 - __bch2_read(c, rbio, rbio->bio.bi_iter, inum, &failed, 159 - BCH_READ_RETRY_IF_STALE| 160 - BCH_READ_MAY_PROMOTE| 161 - BCH_READ_USER_MAPPED); 148 + bch2_trans_run(c, 149 + __bch2_read(trans, rbio, rbio->bio.bi_iter, inum, NULL, 150 + BCH_READ_retry_if_stale| 151 + BCH_READ_may_promote| 152 + BCH_READ_user_mapped)); 162 153 } 163 154 164 - static inline struct bch_read_bio *rbio_init(struct bio *bio, 165 - struct bch_io_opts opts) 155 + static inline struct bch_read_bio *rbio_init_fragment(struct bio *bio, 156 + struct bch_read_bio *orig) 166 157 { 167 158 struct bch_read_bio *rbio = to_rbio(bio); 168 159 169 - rbio->_state = 0; 170 - rbio->promote = NULL; 171 - rbio->opts = opts; 160 + rbio->c = orig->c; 161 + rbio->_state = 0; 162 + rbio->flags = 0; 163 + rbio->ret = 0; 164 + rbio->split = true; 165 + rbio->parent = orig; 166 + rbio->opts = orig->opts; 167 + return rbio; 168 + } 169 + 170 + static inline struct bch_read_bio *rbio_init(struct bio *bio, 171 + struct bch_fs *c, 172 + struct bch_io_opts opts, 173 + bio_end_io_t end_io) 174 + { 175 + struct bch_read_bio *rbio = to_rbio(bio); 176 + 177 + rbio->start_time = local_clock(); 178 + rbio->c = c; 179 + rbio->_state = 0; 180 + rbio->flags = 0; 181 + rbio->ret = 0; 182 + rbio->opts = opts; 183 + rbio->bio.bi_end_io = end_io; 172 184 return rbio; 173 185 } 174 186
+213 -203
fs/bcachefs/io_write.c
··· 34 34 #include <linux/random.h> 35 35 #include <linux/sched/mm.h> 36 36 37 + #ifdef CONFIG_BCACHEFS_DEBUG 38 + static unsigned bch2_write_corrupt_ratio; 39 + module_param_named(write_corrupt_ratio, bch2_write_corrupt_ratio, uint, 0644); 40 + MODULE_PARM_DESC(write_corrupt_ratio, ""); 41 + #endif 42 + 37 43 #ifndef CONFIG_BCACHEFS_NO_LATENCY_ACCT 38 44 39 45 static inline void bch2_congested_acct(struct bch_dev *ca, u64 io_latency, ··· 380 374 bch2_extent_update(trans, inum, &iter, sk.k, 381 375 &op->res, 382 376 op->new_i_size, &op->i_sectors_delta, 383 - op->flags & BCH_WRITE_CHECK_ENOSPC); 377 + op->flags & BCH_WRITE_check_enospc); 384 378 bch2_trans_iter_exit(trans, &iter); 385 379 386 380 if (bch2_err_matches(ret, BCH_ERR_transaction_restart)) ··· 402 396 403 397 /* Writes */ 404 398 405 - static void __bch2_write_op_error(struct printbuf *out, struct bch_write_op *op, 406 - u64 offset) 399 + void bch2_write_op_error(struct bch_write_op *op, u64 offset, const char *fmt, ...) 407 400 { 408 - bch2_inum_offset_err_msg(op->c, out, 409 - (subvol_inum) { op->subvol, op->pos.inode, }, 410 - offset << 9); 411 - prt_printf(out, "write error%s: ", 412 - op->flags & BCH_WRITE_MOVE ? "(internal move)" : ""); 401 + struct printbuf buf = PRINTBUF; 402 + 403 + if (op->subvol) { 404 + bch2_inum_offset_err_msg(op->c, &buf, 405 + (subvol_inum) { op->subvol, op->pos.inode, }, 406 + offset << 9); 407 + } else { 408 + struct bpos pos = op->pos; 409 + pos.offset = offset; 410 + bch2_inum_snap_offset_err_msg(op->c, &buf, pos); 411 + } 412 + 413 + prt_str(&buf, "write error: "); 414 + 415 + va_list args; 416 + va_start(args, fmt); 417 + prt_vprintf(&buf, fmt, args); 418 + va_end(args); 419 + 420 + if (op->flags & BCH_WRITE_move) { 421 + struct data_update *u = container_of(op, struct data_update, op); 422 + 423 + prt_printf(&buf, "\n from internal move "); 424 + bch2_bkey_val_to_text(&buf, op->c, bkey_i_to_s_c(u->k.k)); 425 + } 426 + 427 + bch_err_ratelimited(op->c, "%s", buf.buf); 428 + printbuf_exit(&buf); 413 429 } 414 430 415 - void bch2_write_op_error(struct printbuf *out, struct bch_write_op *op) 431 + static void bch2_write_csum_err_msg(struct bch_write_op *op) 416 432 { 417 - __bch2_write_op_error(out, op, op->pos.offset); 418 - } 419 - 420 - static void bch2_write_op_error_trans(struct btree_trans *trans, struct printbuf *out, 421 - struct bch_write_op *op, u64 offset) 422 - { 423 - bch2_inum_offset_err_msg_trans(trans, out, 424 - (subvol_inum) { op->subvol, op->pos.inode, }, 425 - offset << 9); 426 - prt_printf(out, "write error%s: ", 427 - op->flags & BCH_WRITE_MOVE ? "(internal move)" : ""); 433 + bch2_write_op_error(op, op->pos.offset, 434 + "error verifying existing checksum while rewriting existing data (memory corruption?)"); 428 435 } 429 436 430 437 void bch2_submit_wbio_replicas(struct bch_write_bio *wbio, struct bch_fs *c, ··· 512 493 bch2_time_stats_update(&c->times[BCH_TIME_data_write], op->start_time); 513 494 bch2_disk_reservation_put(c, &op->res); 514 495 515 - if (!(op->flags & BCH_WRITE_MOVE)) 496 + if (!(op->flags & BCH_WRITE_move)) 516 497 bch2_write_ref_put(c, BCH_WRITE_REF_write); 517 498 bch2_keylist_free(&op->insert_keys, op->inline_keys); 518 499 ··· 535 516 test_bit(ptr->dev, op->failed.d)); 536 517 537 518 if (!bch2_bkey_nr_ptrs(bkey_i_to_s_c(src))) 538 - return -EIO; 519 + return -BCH_ERR_data_write_io; 539 520 } 540 521 541 522 if (dst != src) ··· 558 539 unsigned dev; 559 540 int ret = 0; 560 541 561 - if (unlikely(op->flags & BCH_WRITE_IO_ERROR)) { 542 + if (unlikely(op->flags & BCH_WRITE_io_error)) { 562 543 ret = bch2_write_drop_io_error_ptrs(op); 563 544 if (ret) 564 545 goto err; ··· 567 548 if (!bch2_keylist_empty(keys)) { 568 549 u64 sectors_start = keylist_sectors(keys); 569 550 570 - ret = !(op->flags & BCH_WRITE_MOVE) 551 + ret = !(op->flags & BCH_WRITE_move) 571 552 ? bch2_write_index_default(op) 572 553 : bch2_data_update_index_update(op); 573 554 ··· 579 560 if (unlikely(ret && !bch2_err_matches(ret, EROFS))) { 580 561 struct bkey_i *insert = bch2_keylist_front(&op->insert_keys); 581 562 582 - struct printbuf buf = PRINTBUF; 583 - __bch2_write_op_error(&buf, op, bkey_start_offset(&insert->k)); 584 - prt_printf(&buf, "btree update error: %s", bch2_err_str(ret)); 585 - bch_err_ratelimited(c, "%s", buf.buf); 586 - printbuf_exit(&buf); 563 + bch2_write_op_error(op, bkey_start_offset(&insert->k), 564 + "btree update error: %s", bch2_err_str(ret)); 587 565 } 588 566 589 567 if (ret) ··· 589 573 out: 590 574 /* If some a bucket wasn't written, we can't erasure code it: */ 591 575 for_each_set_bit(dev, op->failed.d, BCH_SB_MEMBERS_MAX) 592 - bch2_open_bucket_write_error(c, &op->open_buckets, dev); 576 + bch2_open_bucket_write_error(c, &op->open_buckets, dev, -BCH_ERR_data_write_io); 593 577 594 578 bch2_open_buckets_put(c, &op->open_buckets); 595 579 return; 596 580 err: 597 581 keys->top = keys->keys; 598 582 op->error = ret; 599 - op->flags |= BCH_WRITE_SUBMITTED; 583 + op->flags |= BCH_WRITE_submitted; 600 584 goto out; 601 585 } 602 586 603 587 static inline void __wp_update_state(struct write_point *wp, enum write_point_state state) 604 588 { 605 589 if (state != wp->state) { 590 + struct task_struct *p = current; 606 591 u64 now = ktime_get_ns(); 592 + u64 runtime = p->se.sum_exec_runtime + 593 + (now - p->se.exec_start); 594 + 595 + if (state == WRITE_POINT_runnable) 596 + wp->last_runtime = runtime; 597 + else if (wp->state == WRITE_POINT_runnable) 598 + wp->time[WRITE_POINT_running] += runtime - wp->last_runtime; 607 599 608 600 if (wp->last_state_change && 609 601 time_after64(now, wp->last_state_change)) ··· 625 601 { 626 602 enum write_point_state state; 627 603 628 - state = running ? WRITE_POINT_running : 604 + state = running ? WRITE_POINT_runnable: 629 605 !list_empty(&wp->writes) ? WRITE_POINT_waiting_io 630 606 : WRITE_POINT_stopped; 631 607 ··· 639 615 struct workqueue_struct *wq = index_update_wq(op); 640 616 unsigned long flags; 641 617 642 - if ((op->flags & BCH_WRITE_SUBMITTED) && 643 - (op->flags & BCH_WRITE_MOVE)) 618 + if ((op->flags & BCH_WRITE_submitted) && 619 + (op->flags & BCH_WRITE_move)) 644 620 bch2_bio_free_pages_pool(op->c, &op->wbio.bio); 645 621 646 622 spin_lock_irqsave(&wp->writes_lock, flags); ··· 678 654 if (!op) 679 655 break; 680 656 681 - op->flags |= BCH_WRITE_IN_WORKER; 657 + op->flags |= BCH_WRITE_in_worker; 682 658 683 659 __bch2_write_index(op); 684 660 685 - if (!(op->flags & BCH_WRITE_SUBMITTED)) 661 + if (!(op->flags & BCH_WRITE_submitted)) 686 662 __bch2_write(op); 687 663 else 688 664 bch2_write_done(&op->cl); ··· 700 676 ? bch2_dev_have_ref(c, wbio->dev) 701 677 : NULL; 702 678 703 - if (bch2_dev_inum_io_err_on(bio->bi_status, ca, BCH_MEMBER_ERROR_write, 679 + bch2_account_io_completion(ca, BCH_MEMBER_ERROR_write, 680 + wbio->submit_time, !bio->bi_status); 681 + 682 + if (bio->bi_status) { 683 + bch_err_inum_offset_ratelimited(ca, 704 684 op->pos.inode, 705 685 wbio->inode_offset << 9, 706 686 "data write error: %s", 707 - bch2_blk_status_to_str(bio->bi_status))) { 687 + bch2_blk_status_to_str(bio->bi_status)); 708 688 set_bit(wbio->dev, op->failed.d); 709 - op->flags |= BCH_WRITE_IO_ERROR; 689 + op->flags |= BCH_WRITE_io_error; 710 690 } 711 691 712 692 if (wbio->nocow) { ··· 720 692 set_bit(wbio->dev, op->devs_need_flush->d); 721 693 } 722 694 723 - if (wbio->have_ioref) { 724 - bch2_latency_acct(ca, wbio->submit_time, WRITE); 695 + if (wbio->have_ioref) 725 696 percpu_ref_put(&ca->io_ref); 726 - } 727 697 728 698 if (wbio->bounce) 729 699 bch2_bio_free_pages_pool(c, bio); ··· 755 729 bch2_extent_crc_append(&e->k_i, crc); 756 730 757 731 bch2_alloc_sectors_append_ptrs_inlined(op->c, wp, &e->k_i, crc.compressed_size, 758 - op->flags & BCH_WRITE_CACHED); 732 + op->flags & BCH_WRITE_cached); 759 733 760 734 bch2_keylist_push(&op->insert_keys); 761 735 } ··· 815 789 { 816 790 struct bio *bio = &op->wbio.bio; 817 791 struct bch_extent_crc_unpacked new_crc; 818 - int ret; 819 792 820 793 /* bch2_rechecksum_bio() can't encrypt or decrypt data: */ 821 794 ··· 822 797 bch2_csum_type_is_encryption(new_csum_type)) 823 798 new_csum_type = op->crc.csum_type; 824 799 825 - ret = bch2_rechecksum_bio(c, bio, op->version, op->crc, 826 - NULL, &new_crc, 827 - op->crc.offset, op->crc.live_size, 828 - new_csum_type); 800 + int ret = bch2_rechecksum_bio(c, bio, op->version, op->crc, 801 + NULL, &new_crc, 802 + op->crc.offset, op->crc.live_size, 803 + new_csum_type); 829 804 if (ret) 830 805 return ret; 831 806 ··· 835 810 return 0; 836 811 } 837 812 838 - static int bch2_write_decrypt(struct bch_write_op *op) 839 - { 840 - struct bch_fs *c = op->c; 841 - struct nonce nonce = extent_nonce(op->version, op->crc); 842 - struct bch_csum csum; 843 - int ret; 844 - 845 - if (!bch2_csum_type_is_encryption(op->crc.csum_type)) 846 - return 0; 847 - 848 - /* 849 - * If we need to decrypt data in the write path, we'll no longer be able 850 - * to verify the existing checksum (poly1305 mac, in this case) after 851 - * it's decrypted - this is the last point we'll be able to reverify the 852 - * checksum: 853 - */ 854 - csum = bch2_checksum_bio(c, op->crc.csum_type, nonce, &op->wbio.bio); 855 - if (bch2_crc_cmp(op->crc.csum, csum) && !c->opts.no_data_io) 856 - return -EIO; 857 - 858 - ret = bch2_encrypt_bio(c, op->crc.csum_type, nonce, &op->wbio.bio); 859 - op->crc.csum_type = 0; 860 - op->crc.csum = (struct bch_csum) { 0, 0 }; 861 - return ret; 862 - } 863 - 864 - static enum prep_encoded_ret { 865 - PREP_ENCODED_OK, 866 - PREP_ENCODED_ERR, 867 - PREP_ENCODED_CHECKSUM_ERR, 868 - PREP_ENCODED_DO_WRITE, 869 - } bch2_write_prep_encoded_data(struct bch_write_op *op, struct write_point *wp) 813 + static noinline int bch2_write_prep_encoded_data(struct bch_write_op *op, struct write_point *wp) 870 814 { 871 815 struct bch_fs *c = op->c; 872 816 struct bio *bio = &op->wbio.bio; 873 - 874 - if (!(op->flags & BCH_WRITE_DATA_ENCODED)) 875 - return PREP_ENCODED_OK; 817 + struct nonce nonce = extent_nonce(op->version, op->crc); 818 + int ret = 0; 876 819 877 820 BUG_ON(bio_sectors(bio) != op->crc.compressed_size); 878 821 ··· 851 858 (op->crc.compression_type == bch2_compression_opt_to_type(op->compression_opt) || 852 859 op->incompressible)) { 853 860 if (!crc_is_compressed(op->crc) && 854 - op->csum_type != op->crc.csum_type && 855 - bch2_write_rechecksum(c, op, op->csum_type) && 856 - !c->opts.no_data_io) 857 - return PREP_ENCODED_CHECKSUM_ERR; 861 + op->csum_type != op->crc.csum_type) { 862 + ret = bch2_write_rechecksum(c, op, op->csum_type); 863 + if (ret) 864 + return ret; 865 + } 858 866 859 - return PREP_ENCODED_DO_WRITE; 867 + return 1; 860 868 } 861 869 862 870 /* ··· 865 871 * is, we have to decompress it: 866 872 */ 867 873 if (crc_is_compressed(op->crc)) { 868 - struct bch_csum csum; 869 - 870 - if (bch2_write_decrypt(op)) 871 - return PREP_ENCODED_CHECKSUM_ERR; 872 - 873 874 /* Last point we can still verify checksum: */ 874 - csum = bch2_checksum_bio(c, op->crc.csum_type, 875 - extent_nonce(op->version, op->crc), 876 - bio); 875 + struct bch_csum csum = bch2_checksum_bio(c, op->crc.csum_type, nonce, bio); 877 876 if (bch2_crc_cmp(op->crc.csum, csum) && !c->opts.no_data_io) 878 - return PREP_ENCODED_CHECKSUM_ERR; 877 + goto csum_err; 879 878 880 - if (bch2_bio_uncompress_inplace(op, bio)) 881 - return PREP_ENCODED_ERR; 879 + if (bch2_csum_type_is_encryption(op->crc.csum_type)) { 880 + ret = bch2_encrypt_bio(c, op->crc.csum_type, nonce, bio); 881 + if (ret) 882 + return ret; 883 + 884 + op->crc.csum_type = 0; 885 + op->crc.csum = (struct bch_csum) { 0, 0 }; 886 + } 887 + 888 + ret = bch2_bio_uncompress_inplace(op, bio); 889 + if (ret) 890 + return ret; 882 891 } 883 892 884 893 /* ··· 893 896 * If the data is checksummed and we're only writing a subset, 894 897 * rechecksum and adjust bio to point to currently live data: 895 898 */ 896 - if ((op->crc.live_size != op->crc.uncompressed_size || 897 - op->crc.csum_type != op->csum_type) && 898 - bch2_write_rechecksum(c, op, op->csum_type) && 899 - !c->opts.no_data_io) 900 - return PREP_ENCODED_CHECKSUM_ERR; 899 + if (op->crc.live_size != op->crc.uncompressed_size || 900 + op->crc.csum_type != op->csum_type) { 901 + ret = bch2_write_rechecksum(c, op, op->csum_type); 902 + if (ret) 903 + return ret; 904 + } 901 905 902 906 /* 903 907 * If we want to compress the data, it has to be decrypted: 904 908 */ 905 - if ((op->compression_opt || 906 - bch2_csum_type_is_encryption(op->crc.csum_type) != 907 - bch2_csum_type_is_encryption(op->csum_type)) && 908 - bch2_write_decrypt(op)) 909 - return PREP_ENCODED_CHECKSUM_ERR; 909 + if (bch2_csum_type_is_encryption(op->crc.csum_type) && 910 + (op->compression_opt || op->crc.csum_type != op->csum_type)) { 911 + struct bch_csum csum = bch2_checksum_bio(c, op->crc.csum_type, nonce, bio); 912 + if (bch2_crc_cmp(op->crc.csum, csum) && !c->opts.no_data_io) 913 + goto csum_err; 910 914 911 - return PREP_ENCODED_OK; 915 + ret = bch2_encrypt_bio(c, op->crc.csum_type, nonce, bio); 916 + if (ret) 917 + return ret; 918 + 919 + op->crc.csum_type = 0; 920 + op->crc.csum = (struct bch_csum) { 0, 0 }; 921 + } 922 + 923 + return 0; 924 + csum_err: 925 + bch2_write_csum_err_msg(op); 926 + return -BCH_ERR_data_write_csum; 912 927 } 913 928 914 929 static int bch2_write_extent(struct bch_write_op *op, struct write_point *wp, ··· 939 930 940 931 ec_buf = bch2_writepoint_ec_buf(c, wp); 941 932 942 - switch (bch2_write_prep_encoded_data(op, wp)) { 943 - case PREP_ENCODED_OK: 944 - break; 945 - case PREP_ENCODED_ERR: 946 - ret = -EIO; 947 - goto err; 948 - case PREP_ENCODED_CHECKSUM_ERR: 949 - goto csum_err; 950 - case PREP_ENCODED_DO_WRITE: 951 - /* XXX look for bug here */ 952 - if (ec_buf) { 953 - dst = bch2_write_bio_alloc(c, wp, src, 954 - &page_alloc_failed, 955 - ec_buf); 956 - bio_copy_data(dst, src); 957 - bounce = true; 933 + if (unlikely(op->flags & BCH_WRITE_data_encoded)) { 934 + ret = bch2_write_prep_encoded_data(op, wp); 935 + if (ret < 0) 936 + goto err; 937 + if (ret) { 938 + if (ec_buf) { 939 + dst = bch2_write_bio_alloc(c, wp, src, 940 + &page_alloc_failed, 941 + ec_buf); 942 + bio_copy_data(dst, src); 943 + bounce = true; 944 + } 945 + init_append_extent(op, wp, op->version, op->crc); 946 + goto do_write; 958 947 } 959 - init_append_extent(op, wp, op->version, op->crc); 960 - goto do_write; 961 948 } 962 949 963 950 if (ec_buf || 964 951 op->compression_opt || 965 952 (op->csum_type && 966 - !(op->flags & BCH_WRITE_PAGES_STABLE)) || 953 + !(op->flags & BCH_WRITE_pages_stable)) || 967 954 (bch2_csum_type_is_encryption(op->csum_type) && 968 - !(op->flags & BCH_WRITE_PAGES_OWNED))) { 955 + !(op->flags & BCH_WRITE_pages_owned))) { 969 956 dst = bch2_write_bio_alloc(c, wp, src, 970 957 &page_alloc_failed, 971 958 ec_buf); 972 959 bounce = true; 973 960 } 974 961 962 + #ifdef CONFIG_BCACHEFS_DEBUG 963 + unsigned write_corrupt_ratio = READ_ONCE(bch2_write_corrupt_ratio); 964 + if (!bounce && write_corrupt_ratio) { 965 + dst = bch2_write_bio_alloc(c, wp, src, 966 + &page_alloc_failed, 967 + ec_buf); 968 + bounce = true; 969 + } 970 + #endif 975 971 saved_iter = dst->bi_iter; 976 972 977 973 do { ··· 990 976 break; 991 977 992 978 BUG_ON(op->compression_opt && 993 - (op->flags & BCH_WRITE_DATA_ENCODED) && 979 + (op->flags & BCH_WRITE_data_encoded) && 994 980 bch2_csum_type_is_encryption(op->crc.csum_type)); 995 981 BUG_ON(op->compression_opt && !bounce); 996 982 ··· 1028 1014 } 1029 1015 } 1030 1016 1031 - if ((op->flags & BCH_WRITE_DATA_ENCODED) && 1017 + if ((op->flags & BCH_WRITE_data_encoded) && 1032 1018 !crc_is_compressed(crc) && 1033 1019 bch2_csum_type_is_encryption(op->crc.csum_type) == 1034 1020 bch2_csum_type_is_encryption(op->csum_type)) { ··· 1060 1046 crc.compression_type = compression_type; 1061 1047 crc.nonce = nonce; 1062 1048 } else { 1063 - if ((op->flags & BCH_WRITE_DATA_ENCODED) && 1049 + if ((op->flags & BCH_WRITE_data_encoded) && 1064 1050 bch2_rechecksum_bio(c, src, version, op->crc, 1065 1051 NULL, &op->crc, 1066 1052 src_len >> 9, ··· 1085 1071 } 1086 1072 1087 1073 init_append_extent(op, wp, version, crc); 1074 + 1075 + #ifdef CONFIG_BCACHEFS_DEBUG 1076 + if (write_corrupt_ratio) { 1077 + swap(dst->bi_iter.bi_size, dst_len); 1078 + bch2_maybe_corrupt_bio(dst, write_corrupt_ratio); 1079 + swap(dst->bi_iter.bi_size, dst_len); 1080 + } 1081 + #endif 1088 1082 1089 1083 if (dst != src) 1090 1084 bio_advance(dst, dst_len); ··· 1126 1104 *_dst = dst; 1127 1105 return more; 1128 1106 csum_err: 1129 - { 1130 - struct printbuf buf = PRINTBUF; 1131 - bch2_write_op_error(&buf, op); 1132 - prt_printf(&buf, "error verifying existing checksum while rewriting existing data (memory corruption?)"); 1133 - bch_err_ratelimited(c, "%s", buf.buf); 1134 - printbuf_exit(&buf); 1135 - } 1136 - 1137 - ret = -EIO; 1107 + bch2_write_csum_err_msg(op); 1108 + ret = -BCH_ERR_data_write_csum; 1138 1109 err: 1139 1110 if (to_wbio(dst)->bounce) 1140 1111 bch2_bio_free_pages_pool(c, dst); ··· 1205 1190 { 1206 1191 struct bch_fs *c = op->c; 1207 1192 struct btree_trans *trans = bch2_trans_get(c); 1193 + int ret = 0; 1208 1194 1209 1195 for_each_keylist_key(&op->insert_keys, orig) { 1210 - int ret = for_each_btree_key_max_commit(trans, iter, BTREE_ID_extents, 1196 + ret = for_each_btree_key_max_commit(trans, iter, BTREE_ID_extents, 1211 1197 bkey_start_pos(&orig->k), orig->k.p, 1212 1198 BTREE_ITER_intent, k, 1213 1199 NULL, NULL, BCH_TRANS_COMMIT_no_enospc, ({ 1214 1200 bch2_nocow_write_convert_one_unwritten(trans, &iter, orig, k, op->new_i_size); 1215 1201 })); 1216 - 1217 - if (ret && !bch2_err_matches(ret, EROFS)) { 1218 - struct bkey_i *insert = bch2_keylist_front(&op->insert_keys); 1219 - 1220 - struct printbuf buf = PRINTBUF; 1221 - bch2_write_op_error_trans(trans, &buf, op, bkey_start_offset(&insert->k)); 1222 - prt_printf(&buf, "btree update error: %s", bch2_err_str(ret)); 1223 - bch_err_ratelimited(c, "%s", buf.buf); 1224 - printbuf_exit(&buf); 1225 - } 1226 - 1227 - if (ret) { 1228 - op->error = ret; 1202 + if (ret) 1229 1203 break; 1230 - } 1231 1204 } 1232 1205 1233 1206 bch2_trans_put(trans); 1207 + 1208 + if (ret && !bch2_err_matches(ret, EROFS)) { 1209 + struct bkey_i *insert = bch2_keylist_front(&op->insert_keys); 1210 + bch2_write_op_error(op, bkey_start_offset(&insert->k), 1211 + "btree update error: %s", bch2_err_str(ret)); 1212 + } 1213 + 1214 + if (ret) 1215 + op->error = ret; 1234 1216 } 1235 1217 1236 1218 static void __bch2_nocow_write_done(struct bch_write_op *op) 1237 1219 { 1238 - if (unlikely(op->flags & BCH_WRITE_IO_ERROR)) { 1239 - op->error = -EIO; 1240 - } else if (unlikely(op->flags & BCH_WRITE_CONVERT_UNWRITTEN)) 1220 + if (unlikely(op->flags & BCH_WRITE_io_error)) { 1221 + op->error = -BCH_ERR_data_write_io; 1222 + } else if (unlikely(op->flags & BCH_WRITE_convert_unwritten)) 1241 1223 bch2_nocow_write_convert_unwritten(op); 1242 1224 } 1243 1225 ··· 1263 1251 struct bucket_to_lock *stale_at; 1264 1252 int stale, ret; 1265 1253 1266 - if (op->flags & BCH_WRITE_MOVE) 1254 + if (op->flags & BCH_WRITE_move) 1267 1255 return; 1268 1256 1269 1257 darray_init(&buckets); ··· 1321 1309 }), GFP_KERNEL|__GFP_NOFAIL); 1322 1310 1323 1311 if (ptr->unwritten) 1324 - op->flags |= BCH_WRITE_CONVERT_UNWRITTEN; 1312 + op->flags |= BCH_WRITE_convert_unwritten; 1325 1313 } 1326 1314 1327 1315 /* Unlock before taking nocow locks, doing IO: */ ··· 1329 1317 bch2_trans_unlock(trans); 1330 1318 1331 1319 bch2_cut_front(op->pos, op->insert_keys.top); 1332 - if (op->flags & BCH_WRITE_CONVERT_UNWRITTEN) 1320 + if (op->flags & BCH_WRITE_convert_unwritten) 1333 1321 bch2_cut_back(POS(op->pos.inode, op->pos.offset + bio_sectors(bio)), op->insert_keys.top); 1334 1322 1335 1323 darray_for_each(buckets, i) { ··· 1354 1342 wbio_init(bio)->put_bio = true; 1355 1343 bio->bi_opf = op->wbio.bio.bi_opf; 1356 1344 } else { 1357 - op->flags |= BCH_WRITE_SUBMITTED; 1345 + op->flags |= BCH_WRITE_submitted; 1358 1346 } 1359 1347 1360 1348 op->pos.offset += bio_sectors(bio); ··· 1364 1352 bio->bi_private = &op->cl; 1365 1353 bio->bi_opf |= REQ_OP_WRITE; 1366 1354 closure_get(&op->cl); 1355 + 1367 1356 bch2_submit_wbio_replicas(to_wbio(bio), c, BCH_DATA_user, 1368 1357 op->insert_keys.top, true); 1369 1358 1370 1359 bch2_keylist_push(&op->insert_keys); 1371 - if (op->flags & BCH_WRITE_SUBMITTED) 1360 + if (op->flags & BCH_WRITE_submitted) 1372 1361 break; 1373 1362 bch2_btree_iter_advance(&iter); 1374 1363 } ··· 1383 1370 darray_exit(&buckets); 1384 1371 1385 1372 if (ret) { 1386 - struct printbuf buf = PRINTBUF; 1387 - bch2_write_op_error(&buf, op); 1388 - prt_printf(&buf, "%s(): btree lookup error: %s", __func__, bch2_err_str(ret)); 1389 - bch_err_ratelimited(c, "%s", buf.buf); 1390 - printbuf_exit(&buf); 1373 + bch2_write_op_error(op, op->pos.offset, 1374 + "%s(): btree lookup error: %s", __func__, bch2_err_str(ret)); 1391 1375 op->error = ret; 1392 - op->flags |= BCH_WRITE_SUBMITTED; 1376 + op->flags |= BCH_WRITE_submitted; 1393 1377 } 1394 1378 1395 1379 /* fallback to cow write path? */ 1396 - if (!(op->flags & BCH_WRITE_SUBMITTED)) { 1380 + if (!(op->flags & BCH_WRITE_submitted)) { 1397 1381 closure_sync(&op->cl); 1398 1382 __bch2_nocow_write_done(op); 1399 1383 op->insert_keys.top = op->insert_keys.keys; 1400 - } else if (op->flags & BCH_WRITE_SYNC) { 1384 + } else if (op->flags & BCH_WRITE_sync) { 1401 1385 closure_sync(&op->cl); 1402 1386 bch2_nocow_write_done(&op->cl.work); 1403 1387 } else { ··· 1424 1414 "pointer to invalid bucket in nocow path on device %llu\n %s", 1425 1415 stale_at->b.inode, 1426 1416 (bch2_bkey_val_to_text(&buf, c, k), buf.buf))) { 1427 - ret = -EIO; 1417 + ret = -BCH_ERR_data_write_invalid_ptr; 1428 1418 } else { 1429 1419 /* We can retry this: */ 1430 1420 ret = -BCH_ERR_transaction_restart; ··· 1446 1436 1447 1437 if (unlikely(op->opts.nocow && c->opts.nocow_enabled)) { 1448 1438 bch2_nocow_write(op); 1449 - if (op->flags & BCH_WRITE_SUBMITTED) 1439 + if (op->flags & BCH_WRITE_submitted) 1450 1440 goto out_nofs_restore; 1451 1441 } 1452 1442 again: ··· 1476 1466 ret = bch2_trans_run(c, lockrestart_do(trans, 1477 1467 bch2_alloc_sectors_start_trans(trans, 1478 1468 op->target, 1479 - op->opts.erasure_code && !(op->flags & BCH_WRITE_CACHED), 1469 + op->opts.erasure_code && !(op->flags & BCH_WRITE_cached), 1480 1470 op->write_point, 1481 1471 &op->devs_have, 1482 1472 op->nr_replicas, ··· 1499 1489 bch2_alloc_sectors_done_inlined(c, wp); 1500 1490 err: 1501 1491 if (ret <= 0) { 1502 - op->flags |= BCH_WRITE_SUBMITTED; 1492 + op->flags |= BCH_WRITE_submitted; 1503 1493 1504 1494 if (unlikely(ret < 0)) { 1505 - if (!(op->flags & BCH_WRITE_ALLOC_NOWAIT)) { 1506 - struct printbuf buf = PRINTBUF; 1507 - bch2_write_op_error(&buf, op); 1508 - prt_printf(&buf, "%s(): %s", __func__, bch2_err_str(ret)); 1509 - bch_err_ratelimited(c, "%s", buf.buf); 1510 - printbuf_exit(&buf); 1511 - } 1495 + if (!(op->flags & BCH_WRITE_alloc_nowait)) 1496 + bch2_write_op_error(op, op->pos.offset, 1497 + "%s(): %s", __func__, bch2_err_str(ret)); 1512 1498 op->error = ret; 1513 1499 break; 1514 1500 } ··· 1530 1524 * synchronously here if we weren't able to submit all of the IO at 1531 1525 * once, as that signals backpressure to the caller. 1532 1526 */ 1533 - if ((op->flags & BCH_WRITE_SYNC) || 1534 - (!(op->flags & BCH_WRITE_SUBMITTED) && 1535 - !(op->flags & BCH_WRITE_IN_WORKER))) { 1527 + if ((op->flags & BCH_WRITE_sync) || 1528 + (!(op->flags & BCH_WRITE_submitted) && 1529 + !(op->flags & BCH_WRITE_in_worker))) { 1536 1530 bch2_wait_on_allocator(c, &op->cl); 1537 1531 1538 1532 __bch2_write_index(op); 1539 1533 1540 - if (!(op->flags & BCH_WRITE_SUBMITTED)) 1534 + if (!(op->flags & BCH_WRITE_submitted)) 1541 1535 goto again; 1542 1536 bch2_write_done(&op->cl); 1543 1537 } else { ··· 1558 1552 1559 1553 memset(&op->failed, 0, sizeof(op->failed)); 1560 1554 1561 - op->flags |= BCH_WRITE_WROTE_DATA_INLINE; 1562 - op->flags |= BCH_WRITE_SUBMITTED; 1555 + op->flags |= BCH_WRITE_wrote_data_inline; 1556 + op->flags |= BCH_WRITE_submitted; 1563 1557 1564 1558 bch2_check_set_feature(op->c, BCH_FEATURE_inline_data); 1565 1559 ··· 1622 1616 BUG_ON(!op->write_point.v); 1623 1617 BUG_ON(bkey_eq(op->pos, POS_MAX)); 1624 1618 1625 - if (op->flags & BCH_WRITE_ONLY_SPECIFIED_DEVS) 1626 - op->flags |= BCH_WRITE_ALLOC_NOWAIT; 1619 + if (op->flags & BCH_WRITE_only_specified_devs) 1620 + op->flags |= BCH_WRITE_alloc_nowait; 1627 1621 1628 1622 op->nr_replicas_required = min_t(unsigned, op->nr_replicas_required, op->nr_replicas); 1629 1623 op->start_time = local_clock(); ··· 1631 1625 wbio_init(bio)->put_bio = false; 1632 1626 1633 1627 if (unlikely(bio->bi_iter.bi_size & (c->opts.block_size - 1))) { 1634 - struct printbuf buf = PRINTBUF; 1635 - bch2_write_op_error(&buf, op); 1636 - prt_printf(&buf, "misaligned write"); 1637 - printbuf_exit(&buf); 1638 - op->error = -EIO; 1628 + bch2_write_op_error(op, op->pos.offset, "misaligned write"); 1629 + op->error = -BCH_ERR_data_write_misaligned; 1639 1630 goto err; 1640 1631 } 1641 1632 ··· 1641 1638 goto err; 1642 1639 } 1643 1640 1644 - if (!(op->flags & BCH_WRITE_MOVE) && 1641 + if (!(op->flags & BCH_WRITE_move) && 1645 1642 !bch2_write_ref_tryget(c, BCH_WRITE_REF_write)) { 1646 1643 op->error = -BCH_ERR_erofs_no_writes; 1647 1644 goto err; 1648 1645 } 1649 1646 1650 - this_cpu_add(c->counters[BCH_COUNTER_io_write], bio_sectors(bio)); 1647 + if (!(op->flags & BCH_WRITE_move)) 1648 + this_cpu_add(c->counters[BCH_COUNTER_io_write], bio_sectors(bio)); 1651 1649 bch2_increment_clock(c, bio_sectors(bio), WRITE); 1652 1650 1653 1651 data_len = min_t(u64, bio->bi_iter.bi_size, ··· 1679 1675 1680 1676 void bch2_write_op_to_text(struct printbuf *out, struct bch_write_op *op) 1681 1677 { 1682 - prt_str(out, "pos: "); 1678 + if (!out->nr_tabstops) 1679 + printbuf_tabstop_push(out, 32); 1680 + 1681 + prt_printf(out, "pos:\t"); 1683 1682 bch2_bpos_to_text(out, op->pos); 1684 1683 prt_newline(out); 1685 1684 printbuf_indent_add(out, 2); 1686 1685 1687 - prt_str(out, "started: "); 1686 + prt_printf(out, "started:\t"); 1688 1687 bch2_pr_time_units(out, local_clock() - op->start_time); 1689 1688 prt_newline(out); 1690 1689 1691 - prt_str(out, "flags: "); 1690 + prt_printf(out, "flags:\t"); 1692 1691 prt_bitflags(out, bch2_write_flags, op->flags); 1693 1692 prt_newline(out); 1694 1693 1695 - prt_printf(out, "ref: %u\n", closure_nr_remaining(&op->cl)); 1694 + prt_printf(out, "nr_replicas:\t%u\n", op->nr_replicas); 1695 + prt_printf(out, "nr_replicas_required:\t%u\n", op->nr_replicas_required); 1696 + 1697 + prt_printf(out, "ref:\t%u\n", closure_nr_remaining(&op->cl)); 1696 1698 1697 1699 printbuf_indent_sub(out, 2); 1698 1700 }
+16 -22
fs/bcachefs/io_write.h
··· 11 11 void bch2_bio_free_pages_pool(struct bch_fs *, struct bio *); 12 12 void bch2_bio_alloc_pages_pool(struct bch_fs *, struct bio *, size_t); 13 13 14 - #ifndef CONFIG_BCACHEFS_NO_LATENCY_ACCT 15 - void bch2_latency_acct(struct bch_dev *, u64, int); 16 - #else 17 - static inline void bch2_latency_acct(struct bch_dev *ca, u64 submit_time, int rw) {} 18 - #endif 19 - 20 14 void bch2_submit_wbio_replicas(struct bch_write_bio *, struct bch_fs *, 21 15 enum bch_data_type, const struct bkey_i *, bool); 22 16 23 - void bch2_write_op_error(struct printbuf *out, struct bch_write_op *op); 17 + __printf(3, 4) 18 + void bch2_write_op_error(struct bch_write_op *op, u64, const char *, ...); 24 19 25 20 #define BCH_WRITE_FLAGS() \ 26 - x(ALLOC_NOWAIT) \ 27 - x(CACHED) \ 28 - x(DATA_ENCODED) \ 29 - x(PAGES_STABLE) \ 30 - x(PAGES_OWNED) \ 31 - x(ONLY_SPECIFIED_DEVS) \ 32 - x(WROTE_DATA_INLINE) \ 33 - x(FROM_INTERNAL) \ 34 - x(CHECK_ENOSPC) \ 35 - x(SYNC) \ 36 - x(MOVE) \ 37 - x(IN_WORKER) \ 38 - x(SUBMITTED) \ 39 - x(IO_ERROR) \ 40 - x(CONVERT_UNWRITTEN) 21 + x(alloc_nowait) \ 22 + x(cached) \ 23 + x(data_encoded) \ 24 + x(pages_stable) \ 25 + x(pages_owned) \ 26 + x(only_specified_devs) \ 27 + x(wrote_data_inline) \ 28 + x(check_enospc) \ 29 + x(sync) \ 30 + x(move) \ 31 + x(in_worker) \ 32 + x(submitted) \ 33 + x(io_error) \ 34 + x(convert_unwritten) 41 35 42 36 enum __bch_write_flags { 43 37 #define x(f) __BCH_WRITE_##f,
+1 -1
fs/bcachefs/io_write_types.h
··· 64 64 struct bpos pos; 65 65 struct bversion version; 66 66 67 - /* For BCH_WRITE_DATA_ENCODED: */ 67 + /* For BCH_WRITE_data_encoded: */ 68 68 struct bch_extent_crc_unpacked crc; 69 69 70 70 struct write_point_specifier write_point;
+131 -62
fs/bcachefs/journal.c
··· 20 20 #include "journal_seq_blacklist.h" 21 21 #include "trace.h" 22 22 23 - static const char * const bch2_journal_errors[] = { 24 - #define x(n) #n, 25 - JOURNAL_ERRORS() 26 - #undef x 27 - NULL 28 - }; 29 - 30 23 static inline bool journal_seq_unwritten(struct journal *j, u64 seq) 31 24 { 32 25 return seq > j->seq_ondisk; ··· 49 56 prt_printf(out, "seq:\t%llu\n", seq); 50 57 printbuf_indent_add(out, 2); 51 58 52 - prt_printf(out, "refcount:\t%u\n", journal_state_count(s, i)); 59 + if (!buf->write_started) 60 + prt_printf(out, "refcount:\t%u\n", journal_state_count(s, i & JOURNAL_STATE_BUF_MASK)); 53 61 54 - prt_printf(out, "size:\t"); 55 - prt_human_readable_u64(out, vstruct_bytes(buf->data)); 56 - prt_newline(out); 62 + struct closure *cl = &buf->io; 63 + int r = atomic_read(&cl->remaining); 64 + prt_printf(out, "io:\t%pS r %i\n", cl->fn, r & CLOSURE_REMAINING_MASK); 65 + 66 + if (buf->data) { 67 + prt_printf(out, "size:\t"); 68 + prt_human_readable_u64(out, vstruct_bytes(buf->data)); 69 + prt_newline(out); 70 + } 57 71 58 72 prt_printf(out, "expires:\t"); 59 73 prt_printf(out, "%li jiffies\n", buf->expires - jiffies); ··· 87 87 88 88 static void bch2_journal_bufs_to_text(struct printbuf *out, struct journal *j) 89 89 { 90 + lockdep_assert_held(&j->lock); 91 + out->atomic++; 92 + 90 93 if (!out->nr_tabstops) 91 94 printbuf_tabstop_push(out, 24); 92 95 ··· 98 95 seq++) 99 96 bch2_journal_buf_to_text(out, j, seq); 100 97 prt_printf(out, "last buf %s\n", journal_entry_is_open(j) ? "open" : "closed"); 98 + 99 + --out->atomic; 101 100 } 102 101 103 102 static inline struct journal_buf * ··· 109 104 110 105 EBUG_ON(seq > journal_cur_seq(j)); 111 106 112 - if (journal_seq_unwritten(j, seq)) { 107 + if (journal_seq_unwritten(j, seq)) 113 108 buf = j->buf + (seq & JOURNAL_BUF_MASK); 114 - EBUG_ON(le64_to_cpu(buf->data->seq) != seq); 115 - } 116 109 return buf; 117 110 } 118 111 ··· 142 139 bool stuck = false; 143 140 struct printbuf buf = PRINTBUF; 144 141 145 - if (!(error == JOURNAL_ERR_journal_full || 146 - error == JOURNAL_ERR_journal_pin_full) || 142 + if (!(error == -BCH_ERR_journal_full || 143 + error == -BCH_ERR_journal_pin_full) || 147 144 nr_unwritten_journal_entries(j) || 148 145 (flags & BCH_WATERMARK_MASK) != BCH_WATERMARK_reclaim) 149 146 return stuck; ··· 170 167 spin_unlock(&j->lock); 171 168 172 169 bch_err(c, "Journal stuck! Hava a pre-reservation but journal full (error %s)", 173 - bch2_journal_errors[error]); 170 + bch2_err_str(error)); 174 171 bch2_journal_debug_to_text(&buf, j); 175 172 bch_err(c, "%s", buf.buf); 176 173 ··· 198 195 if (w->write_started) 199 196 continue; 200 197 201 - if (!journal_state_count(j->reservations, idx)) { 198 + if (!journal_state_seq_count(j, j->reservations, seq)) { 199 + j->seq_write_started = seq; 202 200 w->write_started = true; 203 201 closure_call(&w->io, bch2_journal_write, j->wq, NULL); 204 202 } ··· 310 306 311 307 bch2_journal_space_available(j); 312 308 313 - __bch2_journal_buf_put(j, old.idx, le64_to_cpu(buf->data->seq)); 309 + __bch2_journal_buf_put(j, le64_to_cpu(buf->data->seq)); 314 310 } 315 311 316 312 void bch2_journal_halt(struct journal *j) ··· 381 377 BUG_ON(BCH_SB_CLEAN(c->disk_sb.sb)); 382 378 383 379 if (j->blocked) 384 - return JOURNAL_ERR_blocked; 380 + return -BCH_ERR_journal_blocked; 385 381 386 382 if (j->cur_entry_error) 387 383 return j->cur_entry_error; 388 384 389 - if (bch2_journal_error(j)) 390 - return JOURNAL_ERR_insufficient_devices; /* -EROFS */ 385 + int ret = bch2_journal_error(j); 386 + if (unlikely(ret)) 387 + return ret; 391 388 392 389 if (!fifo_free(&j->pin)) 393 - return JOURNAL_ERR_journal_pin_full; 390 + return -BCH_ERR_journal_pin_full; 394 391 395 392 if (nr_unwritten_journal_entries(j) == ARRAY_SIZE(j->buf)) 396 - return JOURNAL_ERR_max_in_flight; 393 + return -BCH_ERR_journal_max_in_flight; 394 + 395 + if (atomic64_read(&j->seq) - j->seq_write_started == JOURNAL_STATE_BUF_NR) 396 + return -BCH_ERR_journal_max_open; 397 397 398 398 if (journal_cur_seq(j) >= JOURNAL_SEQ_MAX) { 399 399 bch_err(c, "cannot start: journal seq overflow"); 400 400 if (bch2_fs_emergency_read_only_locked(c)) 401 401 bch_err(c, "fatal error - emergency read only"); 402 - return JOURNAL_ERR_insufficient_devices; /* -EROFS */ 402 + return -BCH_ERR_journal_shutdown; 403 403 } 404 404 405 + if (!j->free_buf && !buf->data) 406 + return -BCH_ERR_journal_buf_enomem; /* will retry after write completion frees up a buf */ 407 + 405 408 BUG_ON(!j->cur_entry_sectors); 409 + 410 + if (!buf->data) { 411 + swap(buf->data, j->free_buf); 412 + swap(buf->buf_size, j->free_buf_size); 413 + } 406 414 407 415 buf->expires = 408 416 (journal_cur_seq(j) == j->flushed_seq_ondisk ··· 431 415 u64s = clamp_t(int, u64s, 0, JOURNAL_ENTRY_CLOSED_VAL - 1); 432 416 433 417 if (u64s <= (ssize_t) j->early_journal_entries.nr) 434 - return JOURNAL_ERR_journal_full; 418 + return -BCH_ERR_journal_full; 435 419 436 420 if (fifo_empty(&j->pin) && j->reclaim_thread) 437 421 wake_up_process(j->reclaim_thread); ··· 480 464 481 465 new.idx++; 482 466 BUG_ON(journal_state_count(new, new.idx)); 483 - BUG_ON(new.idx != (journal_cur_seq(j) & JOURNAL_BUF_MASK)); 467 + BUG_ON(new.idx != (journal_cur_seq(j) & JOURNAL_STATE_BUF_MASK)); 484 468 485 469 journal_state_inc(&new); 486 470 ··· 530 514 spin_unlock(&j->lock); 531 515 } 532 516 517 + static void journal_buf_prealloc(struct journal *j) 518 + { 519 + if (j->free_buf && 520 + j->free_buf_size >= j->buf_size_want) 521 + return; 522 + 523 + unsigned buf_size = j->buf_size_want; 524 + 525 + spin_unlock(&j->lock); 526 + void *buf = kvmalloc(buf_size, GFP_NOFS); 527 + spin_lock(&j->lock); 528 + 529 + if (buf && 530 + (!j->free_buf || 531 + buf_size > j->free_buf_size)) { 532 + swap(buf, j->free_buf); 533 + swap(buf_size, j->free_buf_size); 534 + } 535 + 536 + if (unlikely(buf)) { 537 + spin_unlock(&j->lock); 538 + /* kvfree can sleep */ 539 + kvfree(buf); 540 + spin_lock(&j->lock); 541 + } 542 + } 543 + 533 544 static int __journal_res_get(struct journal *j, struct journal_res *res, 534 545 unsigned flags) 535 546 { ··· 568 525 if (journal_res_get_fast(j, res, flags)) 569 526 return 0; 570 527 571 - if (bch2_journal_error(j)) 572 - return -BCH_ERR_erofs_journal_err; 528 + ret = bch2_journal_error(j); 529 + if (unlikely(ret)) 530 + return ret; 573 531 574 532 if (j->blocked) 575 - return -BCH_ERR_journal_res_get_blocked; 533 + return -BCH_ERR_journal_blocked; 576 534 577 535 if ((flags & BCH_WATERMARK_MASK) < j->watermark) { 578 - ret = JOURNAL_ERR_journal_full; 536 + ret = -BCH_ERR_journal_full; 579 537 can_discard = j->can_discard; 580 538 goto out; 581 539 } 582 540 583 541 if (nr_unwritten_journal_entries(j) == ARRAY_SIZE(j->buf) && !journal_entry_is_open(j)) { 584 - ret = JOURNAL_ERR_max_in_flight; 542 + ret = -BCH_ERR_journal_max_in_flight; 585 543 goto out; 586 544 } 587 545 588 546 spin_lock(&j->lock); 547 + 548 + journal_buf_prealloc(j); 589 549 590 550 /* 591 551 * Recheck after taking the lock, so we don't race with another thread ··· 612 566 j->buf_size_want = max(j->buf_size_want, buf->buf_size << 1); 613 567 614 568 __journal_entry_close(j, JOURNAL_ENTRY_CLOSED_VAL, false); 615 - ret = journal_entry_open(j) ?: JOURNAL_ERR_retry; 569 + ret = journal_entry_open(j) ?: -BCH_ERR_journal_retry_open; 616 570 unlock: 617 571 can_discard = j->can_discard; 618 572 spin_unlock(&j->lock); 619 573 out: 620 - if (ret == JOURNAL_ERR_retry) 621 - goto retry; 622 - if (!ret) 574 + if (likely(!ret)) 623 575 return 0; 576 + if (ret == -BCH_ERR_journal_retry_open) 577 + goto retry; 624 578 625 579 if (journal_error_check_stuck(j, ret, flags)) 626 - ret = -BCH_ERR_journal_res_get_blocked; 580 + ret = -BCH_ERR_journal_stuck; 627 581 628 - if (ret == JOURNAL_ERR_max_in_flight && 629 - track_event_change(&c->times[BCH_TIME_blocked_journal_max_in_flight], true)) { 630 - 582 + if (ret == -BCH_ERR_journal_max_in_flight && 583 + track_event_change(&c->times[BCH_TIME_blocked_journal_max_in_flight], true) && 584 + trace_journal_entry_full_enabled()) { 631 585 struct printbuf buf = PRINTBUF; 586 + 587 + bch2_printbuf_make_room(&buf, 4096); 588 + 589 + spin_lock(&j->lock); 632 590 prt_printf(&buf, "seq %llu\n", journal_cur_seq(j)); 633 591 bch2_journal_bufs_to_text(&buf, j); 592 + spin_unlock(&j->lock); 593 + 594 + trace_journal_entry_full(c, buf.buf); 595 + printbuf_exit(&buf); 596 + count_event(c, journal_entry_full); 597 + } 598 + 599 + if (ret == -BCH_ERR_journal_max_open && 600 + track_event_change(&c->times[BCH_TIME_blocked_journal_max_open], true) && 601 + trace_journal_entry_full_enabled()) { 602 + struct printbuf buf = PRINTBUF; 603 + 604 + bch2_printbuf_make_room(&buf, 4096); 605 + 606 + spin_lock(&j->lock); 607 + prt_printf(&buf, "seq %llu\n", journal_cur_seq(j)); 608 + bch2_journal_bufs_to_text(&buf, j); 609 + spin_unlock(&j->lock); 610 + 634 611 trace_journal_entry_full(c, buf.buf); 635 612 printbuf_exit(&buf); 636 613 count_event(c, journal_entry_full); ··· 663 594 * Journal is full - can't rely on reclaim from work item due to 664 595 * freezing: 665 596 */ 666 - if ((ret == JOURNAL_ERR_journal_full || 667 - ret == JOURNAL_ERR_journal_pin_full) && 597 + if ((ret == -BCH_ERR_journal_full || 598 + ret == -BCH_ERR_journal_pin_full) && 668 599 !(flags & JOURNAL_RES_GET_NONBLOCK)) { 669 600 if (can_discard) { 670 601 bch2_journal_do_discards(j); ··· 677 608 } 678 609 } 679 610 680 - return ret == JOURNAL_ERR_insufficient_devices 681 - ? -BCH_ERR_erofs_journal_err 682 - : -BCH_ERR_journal_res_get_blocked; 611 + return ret; 683 612 } 684 613 685 614 static unsigned max_dev_latency(struct bch_fs *c) ··· 707 640 int ret; 708 641 709 642 if (closure_wait_event_timeout(&j->async_wait, 710 - (ret = __journal_res_get(j, res, flags)) != -BCH_ERR_journal_res_get_blocked || 643 + !bch2_err_matches(ret = __journal_res_get(j, res, flags), BCH_ERR_operation_blocked) || 711 644 (flags & JOURNAL_RES_GET_NONBLOCK), 712 645 HZ)) 713 646 return ret; ··· 721 654 remaining_wait = max(0, remaining_wait - HZ); 722 655 723 656 if (closure_wait_event_timeout(&j->async_wait, 724 - (ret = __journal_res_get(j, res, flags)) != -BCH_ERR_journal_res_get_blocked || 657 + !bch2_err_matches(ret = __journal_res_get(j, res, flags), BCH_ERR_operation_blocked) || 725 658 (flags & JOURNAL_RES_GET_NONBLOCK), 726 659 remaining_wait)) 727 660 return ret; ··· 733 666 printbuf_exit(&buf); 734 667 735 668 closure_wait_event(&j->async_wait, 736 - (ret = __journal_res_get(j, res, flags)) != -BCH_ERR_journal_res_get_blocked || 669 + !bch2_err_matches(ret = __journal_res_get(j, res, flags), BCH_ERR_operation_blocked) || 737 670 (flags & JOURNAL_RES_GET_NONBLOCK)); 738 671 return ret; 739 672 } ··· 754 687 goto out; 755 688 756 689 j->cur_entry_u64s = max_t(int, 0, j->cur_entry_u64s - d); 757 - smp_mb(); 758 690 state = READ_ONCE(j->reservations); 759 691 760 692 if (state.cur_entry_offset < JOURNAL_ENTRY_CLOSED_VAL && ··· 973 907 struct bch_fs *c = container_of(j, struct bch_fs, journal); 974 908 975 909 if (!bch2_write_ref_tryget(c, BCH_WRITE_REF_journal)) 976 - return -EROFS; 910 + return -BCH_ERR_erofs_no_writes; 977 911 978 912 int ret = __bch2_journal_meta(j); 979 913 bch2_write_ref_put(c, BCH_WRITE_REF_journal); ··· 1017 951 new.cur_entry_offset = JOURNAL_ENTRY_BLOCKED_VAL; 1018 952 } while (!atomic64_try_cmpxchg(&j->reservations.counter, &old.v, new.v)); 1019 953 1020 - journal_cur_buf(j)->data->u64s = cpu_to_le32(old.cur_entry_offset); 954 + if (old.cur_entry_offset < JOURNAL_ENTRY_BLOCKED_VAL) 955 + journal_cur_buf(j)->data->u64s = cpu_to_le32(old.cur_entry_offset); 1021 956 } 1022 957 } 1023 958 ··· 1059 992 *blocked = true; 1060 993 } 1061 994 1062 - ret = journal_state_count(s, idx) > open 995 + ret = journal_state_count(s, idx & JOURNAL_STATE_BUF_MASK) > open 1063 996 ? ERR_PTR(-EAGAIN) 1064 997 : buf; 1065 998 break; ··· 1416 1349 j->replay_journal_seq_end = cur_seq; 1417 1350 j->last_seq_ondisk = last_seq; 1418 1351 j->flushed_seq_ondisk = cur_seq - 1; 1352 + j->seq_write_started = cur_seq - 1; 1419 1353 j->seq_ondisk = cur_seq - 1; 1420 1354 j->pin.front = last_seq; 1421 1355 j->pin.back = cur_seq; ··· 1457 1389 set_bit(JOURNAL_running, &j->flags); 1458 1390 j->last_flush_write = jiffies; 1459 1391 1460 - j->reservations.idx = j->reservations.unwritten_idx = journal_cur_seq(j); 1461 - j->reservations.unwritten_idx++; 1392 + j->reservations.idx = journal_cur_seq(j); 1462 1393 1463 1394 c->last_bucket_seq_cleanup = journal_cur_seq(j); 1464 1395 ··· 1510 1443 unsigned nr_bvecs = DIV_ROUND_UP(JOURNAL_ENTRY_SIZE_MAX, PAGE_SIZE); 1511 1444 1512 1445 for (unsigned i = 0; i < ARRAY_SIZE(ja->bio); i++) { 1513 - ja->bio[i] = kmalloc(struct_size(ja->bio[i], bio.bi_inline_vecs, 1446 + ja->bio[i] = kzalloc(struct_size(ja->bio[i], bio.bi_inline_vecs, 1514 1447 nr_bvecs), GFP_KERNEL); 1515 1448 if (!ja->bio[i]) 1516 1449 return -BCH_ERR_ENOMEM_dev_journal_init; ··· 1549 1482 1550 1483 for (unsigned i = 0; i < ARRAY_SIZE(j->buf); i++) 1551 1484 kvfree(j->buf[i].data); 1485 + kvfree(j->free_buf); 1552 1486 free_fifo(&j->pin); 1553 1487 } 1554 1488 ··· 1576 1508 if (!(init_fifo(&j->pin, JOURNAL_PIN, GFP_KERNEL))) 1577 1509 return -BCH_ERR_ENOMEM_journal_pin_fifo; 1578 1510 1579 - for (unsigned i = 0; i < ARRAY_SIZE(j->buf); i++) { 1580 - j->buf[i].buf_size = JOURNAL_ENTRY_SIZE_MIN; 1581 - j->buf[i].data = kvmalloc(j->buf[i].buf_size, GFP_KERNEL); 1582 - if (!j->buf[i].data) 1583 - return -BCH_ERR_ENOMEM_journal_buf; 1511 + j->free_buf_size = j->buf_size_want = JOURNAL_ENTRY_SIZE_MIN; 1512 + j->free_buf = kvmalloc(j->free_buf_size, GFP_KERNEL); 1513 + if (!j->free_buf) 1514 + return -BCH_ERR_ENOMEM_journal_buf; 1515 + 1516 + for (unsigned i = 0; i < ARRAY_SIZE(j->buf); i++) 1584 1517 j->buf[i].idx = i; 1585 - } 1586 1518 1587 1519 j->pin.front = j->pin.back = 1; 1588 1520 ··· 1632 1564 prt_printf(out, "average write size:\t"); 1633 1565 prt_human_readable_u64(out, nr_writes ? div64_u64(j->entry_bytes_written, nr_writes) : 0); 1634 1566 prt_newline(out); 1567 + prt_printf(out, "free buf:\t%u\n", j->free_buf ? j->free_buf_size : 0); 1635 1568 prt_printf(out, "nr direct reclaim:\t%llu\n", j->nr_direct_reclaim); 1636 1569 prt_printf(out, "nr background reclaim:\t%llu\n", j->nr_background_reclaim); 1637 1570 prt_printf(out, "reclaim kicked:\t%u\n", j->reclaim_kicked); ··· 1640 1571 ? jiffies_to_msecs(j->next_reclaim - jiffies) : 0); 1641 1572 prt_printf(out, "blocked:\t%u\n", j->blocked); 1642 1573 prt_printf(out, "current entry sectors:\t%u\n", j->cur_entry_sectors); 1643 - prt_printf(out, "current entry error:\t%s\n", bch2_journal_errors[j->cur_entry_error]); 1574 + prt_printf(out, "current entry error:\t%s\n", bch2_err_str(j->cur_entry_error)); 1644 1575 prt_printf(out, "current entry:\t"); 1645 1576 1646 1577 switch (s.cur_entry_offset) {
+30 -12
fs/bcachefs/journal.h
··· 121 121 closure_wake_up(&j->async_wait); 122 122 } 123 123 124 - static inline struct journal_buf *journal_cur_buf(struct journal *j) 125 - { 126 - return j->buf + j->reservations.idx; 127 - } 128 - 129 124 /* Sequence number of oldest dirty journal entry */ 130 125 131 126 static inline u64 journal_last_seq(struct journal *j) ··· 138 143 return j->seq_ondisk + 1; 139 144 } 140 145 146 + static inline struct journal_buf *journal_cur_buf(struct journal *j) 147 + { 148 + unsigned idx = (journal_cur_seq(j) & 149 + JOURNAL_BUF_MASK & 150 + ~JOURNAL_STATE_BUF_MASK) + j->reservations.idx; 151 + 152 + return j->buf + idx; 153 + } 154 + 141 155 static inline int journal_state_count(union journal_res_state s, int idx) 142 156 { 143 157 switch (idx) { ··· 156 152 case 3: return s.buf3_count; 157 153 } 158 154 BUG(); 155 + } 156 + 157 + static inline int journal_state_seq_count(struct journal *j, 158 + union journal_res_state s, u64 seq) 159 + { 160 + if (journal_cur_seq(j) - seq < JOURNAL_STATE_BUF_NR) 161 + return journal_state_count(s, seq & JOURNAL_STATE_BUF_MASK); 162 + else 163 + return 0; 159 164 } 160 165 161 166 static inline void journal_state_inc(union journal_res_state *s) ··· 206 193 static inline struct jset_entry * 207 194 journal_res_entry(struct journal *j, struct journal_res *res) 208 195 { 209 - return vstruct_idx(j->buf[res->idx].data, res->offset); 196 + return vstruct_idx(j->buf[res->seq & JOURNAL_BUF_MASK].data, res->offset); 210 197 } 211 198 212 199 static inline unsigned journal_entry_init(struct jset_entry *entry, unsigned type, ··· 280 267 void bch2_journal_do_writes(struct journal *); 281 268 void bch2_journal_buf_put_final(struct journal *, u64); 282 269 283 - static inline void __bch2_journal_buf_put(struct journal *j, unsigned idx, u64 seq) 270 + static inline void __bch2_journal_buf_put(struct journal *j, u64 seq) 284 271 { 272 + unsigned idx = seq & JOURNAL_STATE_BUF_MASK; 285 273 union journal_res_state s; 286 274 287 275 s = journal_state_buf_put(j, idx); ··· 290 276 bch2_journal_buf_put_final(j, seq); 291 277 } 292 278 293 - static inline void bch2_journal_buf_put(struct journal *j, unsigned idx, u64 seq) 279 + static inline void bch2_journal_buf_put(struct journal *j, u64 seq) 294 280 { 281 + unsigned idx = seq & JOURNAL_STATE_BUF_MASK; 295 282 union journal_res_state s; 296 283 297 284 s = journal_state_buf_put(j, idx); ··· 321 306 BCH_JSET_ENTRY_btree_keys, 322 307 0, 0, 0); 323 308 324 - bch2_journal_buf_put(j, res->idx, res->seq); 309 + bch2_journal_buf_put(j, res->seq); 325 310 326 311 res->ref = 0; 327 312 } ··· 350 335 351 336 /* 352 337 * Check if there is still room in the current journal 353 - * entry: 338 + * entry, smp_rmb() guarantees that reads from reservations.counter 339 + * occur before accessing cur_entry_u64s: 354 340 */ 341 + smp_rmb(); 355 342 if (new.cur_entry_offset + res->u64s > j->cur_entry_u64s) 356 343 return 0; 357 344 ··· 378 361 &old.v, new.v)); 379 362 380 363 res->ref = true; 381 - res->idx = old.idx; 382 364 res->offset = old.cur_entry_offset; 383 - res->seq = le64_to_cpu(j->buf[old.idx].data->seq); 365 + res->seq = journal_cur_seq(j); 366 + res->seq -= (res->seq - old.idx) & JOURNAL_STATE_BUF_MASK; 384 367 return 1; 385 368 } 386 369 ··· 407 390 (flags & JOURNAL_RES_GET_NONBLOCK) != 0, 408 391 NULL, _THIS_IP_); 409 392 EBUG_ON(!res->ref); 393 + BUG_ON(!res->seq); 410 394 } 411 395 return 0; 412 396 }
+60 -39
fs/bcachefs/journal_io.c
··· 1041 1041 bio->bi_iter.bi_sector = offset; 1042 1042 bch2_bio_map(bio, buf->data, sectors_read << 9); 1043 1043 1044 + u64 submit_time = local_clock(); 1044 1045 ret = submit_bio_wait(bio); 1045 1046 kfree(bio); 1046 1047 1047 - if (bch2_dev_io_err_on(ret, ca, BCH_MEMBER_ERROR_read, 1048 - "journal read error: sector %llu", 1049 - offset) || 1050 - bch2_meta_read_fault("journal")) { 1048 + if (!ret && bch2_meta_read_fault("journal")) 1049 + ret = -BCH_ERR_EIO_fault_injected; 1050 + 1051 + bch2_account_io_completion(ca, BCH_MEMBER_ERROR_read, 1052 + submit_time, !ret); 1053 + 1054 + if (ret) { 1055 + bch_err_dev_ratelimited(ca, 1056 + "journal read error: sector %llu", offset); 1051 1057 /* 1052 1058 * We don't error out of the recovery process 1053 1059 * here, since the relevant journal entry may be ··· 1116 1110 struct bch_csum csum; 1117 1111 csum_good = jset_csum_good(c, j, &csum); 1118 1112 1119 - if (bch2_dev_io_err_on(!csum_good, ca, BCH_MEMBER_ERROR_checksum, 1120 - "%s", 1121 - (printbuf_reset(&err), 1122 - prt_str(&err, "journal "), 1123 - bch2_csum_err_msg(&err, csum_type, j->csum, csum), 1124 - err.buf))) 1113 + bch2_account_io_completion(ca, BCH_MEMBER_ERROR_checksum, 0, csum_good); 1114 + 1115 + if (!csum_good) { 1116 + bch_err_dev_ratelimited(ca, "%s", 1117 + (printbuf_reset(&err), 1118 + prt_str(&err, "journal "), 1119 + bch2_csum_err_msg(&err, csum_type, j->csum, csum), 1120 + err.buf)); 1125 1121 saw_bad = true; 1122 + } 1126 1123 1127 1124 ret = bch2_encrypt(c, JSET_CSUM_TYPE(j), journal_nonce(j), 1128 1125 j->encrypted_start, ··· 1524 1515 * @j: journal object 1525 1516 * @w: journal buf (entry to be written) 1526 1517 * 1527 - * Returns: 0 on success, or -EROFS on failure 1518 + * Returns: 0 on success, or -BCH_ERR_insufficient_devices on failure 1528 1519 */ 1529 1520 static int journal_write_alloc(struct journal *j, struct journal_buf *w) 1530 1521 { ··· 1609 1600 kvfree(new_buf); 1610 1601 } 1611 1602 1612 - static inline struct journal_buf *journal_last_unwritten_buf(struct journal *j) 1613 - { 1614 - return j->buf + (journal_last_unwritten_seq(j) & JOURNAL_BUF_MASK); 1615 - } 1616 - 1617 1603 static CLOSURE_CALLBACK(journal_write_done) 1618 1604 { 1619 1605 closure_type(w, struct journal_buf, io); 1620 1606 struct journal *j = container_of(w, struct journal, buf[w->idx]); 1621 1607 struct bch_fs *c = container_of(j, struct bch_fs, journal); 1622 1608 struct bch_replicas_padded replicas; 1623 - union journal_res_state old, new; 1624 1609 u64 seq = le64_to_cpu(w->data->seq); 1625 1610 int err = 0; 1626 1611 ··· 1624 1621 1625 1622 if (!w->devs_written.nr) { 1626 1623 bch_err(c, "unable to write journal to sufficient devices"); 1627 - err = -EIO; 1624 + err = -BCH_ERR_journal_write_err; 1628 1625 } else { 1629 1626 bch2_devlist_to_replicas(&replicas.e, BCH_DATA_journal, 1630 1627 w->devs_written); 1631 - if (bch2_mark_replicas(c, &replicas.e)) 1632 - err = -EIO; 1628 + err = bch2_mark_replicas(c, &replicas.e); 1633 1629 } 1634 1630 1635 1631 if (err) ··· 1643 1641 j->err_seq = seq; 1644 1642 w->write_done = true; 1645 1643 1644 + if (!j->free_buf || j->free_buf_size < w->buf_size) { 1645 + swap(j->free_buf, w->data); 1646 + swap(j->free_buf_size, w->buf_size); 1647 + } 1648 + 1649 + if (w->data) { 1650 + void *buf = w->data; 1651 + w->data = NULL; 1652 + w->buf_size = 0; 1653 + 1654 + spin_unlock(&j->lock); 1655 + kvfree(buf); 1656 + spin_lock(&j->lock); 1657 + } 1658 + 1646 1659 bool completed = false; 1660 + bool do_discards = false; 1647 1661 1648 1662 for (seq = journal_last_unwritten_seq(j); 1649 1663 seq <= journal_cur_seq(j); ··· 1668 1650 if (!w->write_done) 1669 1651 break; 1670 1652 1671 - if (!j->err_seq && !JSET_NO_FLUSH(w->data)) { 1653 + if (!j->err_seq && !w->noflush) { 1672 1654 j->flushed_seq_ondisk = seq; 1673 1655 j->last_seq_ondisk = w->last_seq; 1674 1656 1675 - bch2_do_discards(c); 1676 1657 closure_wake_up(&c->freelist_wait); 1677 1658 bch2_reset_alloc_cursors(c); 1678 1659 } ··· 1688 1671 if (j->watermark != BCH_WATERMARK_stripe) 1689 1672 journal_reclaim_kick(&c->journal); 1690 1673 1691 - old.v = atomic64_read(&j->reservations.counter); 1692 - do { 1693 - new.v = old.v; 1694 - BUG_ON(journal_state_count(new, new.unwritten_idx)); 1695 - BUG_ON(new.unwritten_idx != (seq & JOURNAL_BUF_MASK)); 1696 - 1697 - new.unwritten_idx++; 1698 - } while (!atomic64_try_cmpxchg(&j->reservations.counter, 1699 - &old.v, new.v)); 1700 - 1701 1674 closure_wake_up(&w->wait); 1702 1675 completed = true; 1703 1676 } ··· 1702 1695 } 1703 1696 1704 1697 if (journal_last_unwritten_seq(j) == journal_cur_seq(j) && 1705 - new.cur_entry_offset < JOURNAL_ENTRY_CLOSED_VAL) { 1698 + j->reservations.cur_entry_offset < JOURNAL_ENTRY_CLOSED_VAL) { 1706 1699 struct journal_buf *buf = journal_cur_buf(j); 1707 1700 long delta = buf->expires - jiffies; 1708 1701 ··· 1722 1715 */ 1723 1716 bch2_journal_do_writes(j); 1724 1717 spin_unlock(&j->lock); 1718 + 1719 + if (do_discards) 1720 + bch2_do_discards(c); 1725 1721 } 1726 1722 1727 1723 static void journal_write_endio(struct bio *bio) ··· 1734 1724 struct journal *j = &ca->fs->journal; 1735 1725 struct journal_buf *w = j->buf + jbio->buf_idx; 1736 1726 1737 - if (bch2_dev_io_err_on(bio->bi_status, ca, BCH_MEMBER_ERROR_write, 1727 + bch2_account_io_completion(ca, BCH_MEMBER_ERROR_write, 1728 + jbio->submit_time, !bio->bi_status); 1729 + 1730 + if (bio->bi_status) { 1731 + bch_err_dev_ratelimited(ca, 1738 1732 "error writing journal entry %llu: %s", 1739 1733 le64_to_cpu(w->data->seq), 1740 - bch2_blk_status_to_str(bio->bi_status)) || 1741 - bch2_meta_write_fault("journal")) { 1742 - unsigned long flags; 1734 + bch2_blk_status_to_str(bio->bi_status)); 1743 1735 1736 + unsigned long flags; 1744 1737 spin_lock_irqsave(&j->err_lock, flags); 1745 1738 bch2_dev_list_drop_dev(&w->devs_written, ca->dev_idx); 1746 1739 spin_unlock_irqrestore(&j->err_lock, flags); ··· 1772 1759 sectors); 1773 1760 1774 1761 struct journal_device *ja = &ca->journal; 1775 - struct bio *bio = &ja->bio[w->idx]->bio; 1762 + struct journal_bio *jbio = ja->bio[w->idx]; 1763 + struct bio *bio = &jbio->bio; 1764 + 1765 + jbio->submit_time = local_clock(); 1766 + 1776 1767 bio_reset(bio, ca->disk_sb.bdev, REQ_OP_WRITE|REQ_SYNC|REQ_META); 1777 1768 bio->bi_iter.bi_sector = ptr->offset; 1778 1769 bio->bi_end_io = journal_write_endio; ··· 1808 1791 struct journal *j = container_of(w, struct journal, buf[w->idx]); 1809 1792 struct bch_fs *c = container_of(j, struct bch_fs, journal); 1810 1793 1794 + /* 1795 + * Wait for previous journal writes to comelete; they won't necessarily 1796 + * be flushed if they're still in flight 1797 + */ 1811 1798 if (j->seq_ondisk + 1 != le64_to_cpu(w->data->seq)) { 1812 1799 spin_lock(&j->lock); 1813 1800 if (j->seq_ondisk + 1 != le64_to_cpu(w->data->seq)) { ··· 2005 1984 * write anything at all. 2006 1985 */ 2007 1986 if (error && test_bit(JOURNAL_need_flush_write, &j->flags)) 2008 - return -EIO; 1987 + return error; 2009 1988 2010 1989 if (error || 2011 1990 w->noflush ||
+4 -6
fs/bcachefs/journal_reclaim.c
··· 226 226 227 227 bch_err(c, "%s", buf.buf); 228 228 printbuf_exit(&buf); 229 - ret = JOURNAL_ERR_insufficient_devices; 229 + ret = -BCH_ERR_insufficient_journal_devices; 230 230 goto out; 231 231 } 232 232 ··· 240 240 total = j->space[journal_space_total].total; 241 241 242 242 if (!j->space[journal_space_discarded].next_entry) 243 - ret = JOURNAL_ERR_journal_full; 243 + ret = -BCH_ERR_journal_full; 244 244 245 245 if ((j->space[journal_space_clean_ondisk].next_entry < 246 246 j->space[journal_space_clean_ondisk].total) && ··· 645 645 * @j: journal object 646 646 * @direct: direct or background reclaim? 647 647 * @kicked: requested to run since we last ran? 648 - * Returns: 0 on success, or -EIO if the journal has been shutdown 649 648 * 650 649 * Background journal reclaim writes out btree nodes. It should be run 651 650 * early enough so that we never completely run out of journal buckets. ··· 684 685 if (kthread && kthread_should_stop()) 685 686 break; 686 687 687 - if (bch2_journal_error(j)) { 688 - ret = -EIO; 688 + ret = bch2_journal_error(j); 689 + if (ret) 689 690 break; 690 - } 691 691 692 692 bch2_journal_do_discards(j); 693 693
+3 -4
fs/bcachefs/journal_seq_blacklist.c
··· 231 231 struct journal_seq_blacklist_table *t = c->journal_seq_blacklist_table; 232 232 BUG_ON(nr != t->nr); 233 233 234 - unsigned i; 235 - for (src = bl->start, i = t->nr == 0 ? 0 : eytzinger0_first(t->nr); 236 - src < bl->start + nr; 237 - src++, i = eytzinger0_next(i, nr)) { 234 + src = bl->start; 235 + eytzinger0_for_each(i, nr) { 238 236 BUG_ON(t->entries[i].start != le64_to_cpu(src->start)); 239 237 BUG_ON(t->entries[i].end != le64_to_cpu(src->end)); 240 238 241 239 if (t->entries[i].dirty || t->entries[i].end >= c->journal.oldest_seq_found_ondisk) 242 240 *dst++ = *src; 241 + src++; 243 242 } 244 243 245 244 unsigned new_nr = dst - bl->start;
+13 -24
fs/bcachefs/journal_types.h
··· 12 12 /* btree write buffer steals 8 bits for its own purposes: */ 13 13 #define JOURNAL_SEQ_MAX ((1ULL << 56) - 1) 14 14 15 - #define JOURNAL_BUF_BITS 2 15 + #define JOURNAL_STATE_BUF_BITS 2 16 + #define JOURNAL_STATE_BUF_NR (1U << JOURNAL_STATE_BUF_BITS) 17 + #define JOURNAL_STATE_BUF_MASK (JOURNAL_STATE_BUF_NR - 1) 18 + 19 + #define JOURNAL_BUF_BITS 4 16 20 #define JOURNAL_BUF_NR (1U << JOURNAL_BUF_BITS) 17 21 #define JOURNAL_BUF_MASK (JOURNAL_BUF_NR - 1) 18 22 ··· 86 82 87 83 struct journal_res { 88 84 bool ref; 89 - u8 idx; 90 85 u16 u64s; 91 86 u32 offset; 92 87 u64 seq; ··· 101 98 }; 102 99 103 100 struct { 104 - u64 cur_entry_offset:20, 101 + u64 cur_entry_offset:22, 105 102 idx:2, 106 - unwritten_idx:2, 107 103 buf0_count:10, 108 104 buf1_count:10, 109 105 buf2_count:10, ··· 112 110 113 111 /* bytes: */ 114 112 #define JOURNAL_ENTRY_SIZE_MIN (64U << 10) /* 64k */ 115 - #define JOURNAL_ENTRY_SIZE_MAX (4U << 20) /* 4M */ 113 + #define JOURNAL_ENTRY_SIZE_MAX (4U << 22) /* 16M */ 116 114 117 115 /* 118 116 * We stash some journal state as sentinal values in cur_entry_offset: 119 117 * note - cur_entry_offset is in units of u64s 120 118 */ 121 - #define JOURNAL_ENTRY_OFFSET_MAX ((1U << 20) - 1) 119 + #define JOURNAL_ENTRY_OFFSET_MAX ((1U << 22) - 1) 122 120 123 121 #define JOURNAL_ENTRY_BLOCKED_VAL (JOURNAL_ENTRY_OFFSET_MAX - 2) 124 122 #define JOURNAL_ENTRY_CLOSED_VAL (JOURNAL_ENTRY_OFFSET_MAX - 1) ··· 151 149 #undef x 152 150 }; 153 151 154 - /* Reasons we may fail to get a journal reservation: */ 155 - #define JOURNAL_ERRORS() \ 156 - x(ok) \ 157 - x(retry) \ 158 - x(blocked) \ 159 - x(max_in_flight) \ 160 - x(journal_full) \ 161 - x(journal_pin_full) \ 162 - x(journal_stuck) \ 163 - x(insufficient_devices) 164 - 165 - enum journal_errors { 166 - #define x(n) JOURNAL_ERR_##n, 167 - JOURNAL_ERRORS() 168 - #undef x 169 - }; 170 - 171 152 typedef DARRAY(u64) darray_u64; 172 153 173 154 struct journal_bio { 174 155 struct bch_dev *ca; 175 156 unsigned buf_idx; 157 + u64 submit_time; 176 158 177 159 struct bio bio; 178 160 }; ··· 185 199 * 0, or -ENOSPC if waiting on journal reclaim, or -EROFS if 186 200 * insufficient devices: 187 201 */ 188 - enum journal_errors cur_entry_error; 202 + int cur_entry_error; 189 203 unsigned cur_entry_offset_if_blocked; 190 204 191 205 unsigned buf_size_want; ··· 206 220 * other is possibly being written out. 207 221 */ 208 222 struct journal_buf buf[JOURNAL_BUF_NR]; 223 + void *free_buf; 224 + unsigned free_buf_size; 209 225 210 226 spinlock_t lock; 211 227 ··· 225 237 /* Sequence number of most recent journal entry (last entry in @pin) */ 226 238 atomic64_t seq; 227 239 240 + u64 seq_write_started; 228 241 /* seq, last_seq from the most recent journal entry successfully written */ 229 242 u64 seq_ondisk; 230 243 u64 flushed_seq_ondisk;
+62 -38
fs/bcachefs/lru.c
··· 6 6 #include "btree_iter.h" 7 7 #include "btree_update.h" 8 8 #include "btree_write_buffer.h" 9 + #include "ec.h" 9 10 #include "error.h" 10 11 #include "lru.h" 11 12 #include "recovery.h" ··· 60 59 return __bch2_lru_set(trans, lru_id, dev_bucket, time, KEY_TYPE_set); 61 60 } 62 61 63 - int bch2_lru_change(struct btree_trans *trans, 64 - u16 lru_id, u64 dev_bucket, 65 - u64 old_time, u64 new_time) 62 + int __bch2_lru_change(struct btree_trans *trans, 63 + u16 lru_id, u64 dev_bucket, 64 + u64 old_time, u64 new_time) 66 65 { 67 66 if (old_time == new_time) 68 67 return 0; ··· 79 78 }; 80 79 81 80 int bch2_lru_check_set(struct btree_trans *trans, 82 - u16 lru_id, u64 time, 81 + u16 lru_id, 82 + u64 dev_bucket, 83 + u64 time, 83 84 struct bkey_s_c referring_k, 84 85 struct bkey_buf *last_flushed) 85 86 { ··· 90 87 struct btree_iter lru_iter; 91 88 struct bkey_s_c lru_k = 92 89 bch2_bkey_get_iter(trans, &lru_iter, BTREE_ID_lru, 93 - lru_pos(lru_id, 94 - bucket_to_u64(referring_k.k->p), 95 - time), 0); 90 + lru_pos(lru_id, dev_bucket, time), 0); 96 91 int ret = bkey_err(lru_k); 97 92 if (ret) 98 93 return ret; ··· 105 104 " %s", 106 105 bch2_lru_types[lru_type(lru_k)], 107 106 (bch2_bkey_val_to_text(&buf, c, referring_k), buf.buf))) { 108 - ret = bch2_lru_set(trans, lru_id, bucket_to_u64(referring_k.k->p), time); 107 + ret = bch2_lru_set(trans, lru_id, dev_bucket, time); 109 108 if (ret) 110 109 goto err; 111 110 } ··· 117 116 return ret; 118 117 } 119 118 119 + static struct bbpos lru_pos_to_bp(struct bkey_s_c lru_k) 120 + { 121 + enum bch_lru_type type = lru_type(lru_k); 122 + 123 + switch (type) { 124 + case BCH_LRU_read: 125 + case BCH_LRU_fragmentation: 126 + return BBPOS(BTREE_ID_alloc, u64_to_bucket(lru_k.k->p.offset)); 127 + case BCH_LRU_stripes: 128 + return BBPOS(BTREE_ID_stripes, POS(0, lru_k.k->p.offset)); 129 + default: 130 + BUG(); 131 + } 132 + } 133 + 134 + static u64 bkey_lru_type_idx(struct bch_fs *c, 135 + enum bch_lru_type type, 136 + struct bkey_s_c k) 137 + { 138 + struct bch_alloc_v4 a_convert; 139 + const struct bch_alloc_v4 *a; 140 + 141 + switch (type) { 142 + case BCH_LRU_read: 143 + a = bch2_alloc_to_v4(k, &a_convert); 144 + return alloc_lru_idx_read(*a); 145 + case BCH_LRU_fragmentation: { 146 + a = bch2_alloc_to_v4(k, &a_convert); 147 + 148 + rcu_read_lock(); 149 + struct bch_dev *ca = bch2_dev_rcu_noerror(c, k.k->p.inode); 150 + u64 idx = ca 151 + ? alloc_lru_idx_fragmentation(*a, ca) 152 + : 0; 153 + rcu_read_unlock(); 154 + return idx; 155 + } 156 + case BCH_LRU_stripes: 157 + return k.k->type == KEY_TYPE_stripe 158 + ? stripe_lru_pos(bkey_s_c_to_stripe(k).v) 159 + : 0; 160 + default: 161 + BUG(); 162 + } 163 + } 164 + 120 165 static int bch2_check_lru_key(struct btree_trans *trans, 121 166 struct btree_iter *lru_iter, 122 167 struct bkey_s_c lru_k, 123 168 struct bkey_buf *last_flushed) 124 169 { 125 170 struct bch_fs *c = trans->c; 126 - struct btree_iter iter; 127 - struct bkey_s_c k; 128 - struct bch_alloc_v4 a_convert; 129 - const struct bch_alloc_v4 *a; 130 171 struct printbuf buf1 = PRINTBUF; 131 172 struct printbuf buf2 = PRINTBUF; 132 - enum bch_lru_type type = lru_type(lru_k); 133 - struct bpos alloc_pos = u64_to_bucket(lru_k.k->p.offset); 134 - u64 idx; 135 - int ret; 136 173 137 - struct bch_dev *ca = bch2_dev_bucket_tryget_noerror(c, alloc_pos); 174 + struct bbpos bp = lru_pos_to_bp(lru_k); 138 175 139 - if (fsck_err_on(!ca, 140 - trans, lru_entry_to_invalid_bucket, 141 - "lru key points to nonexistent device:bucket %llu:%llu", 142 - alloc_pos.inode, alloc_pos.offset)) 143 - return bch2_btree_bit_mod_buffered(trans, BTREE_ID_lru, lru_iter->pos, false); 144 - 145 - k = bch2_bkey_get_iter(trans, &iter, BTREE_ID_alloc, alloc_pos, 0); 146 - ret = bkey_err(k); 176 + struct btree_iter iter; 177 + struct bkey_s_c k = bch2_bkey_get_iter(trans, &iter, bp.btree, bp.pos, 0); 178 + int ret = bkey_err(k); 147 179 if (ret) 148 180 goto err; 149 181 150 - a = bch2_alloc_to_v4(k, &a_convert); 182 + enum bch_lru_type type = lru_type(lru_k); 183 + u64 idx = bkey_lru_type_idx(c, type, k); 151 184 152 - switch (type) { 153 - case BCH_LRU_read: 154 - idx = alloc_lru_idx_read(*a); 155 - break; 156 - case BCH_LRU_fragmentation: 157 - idx = alloc_lru_idx_fragmentation(*a, ca); 158 - break; 159 - } 160 - 161 - if (lru_k.k->type != KEY_TYPE_set || 162 - lru_pos_time(lru_k.k->p) != idx) { 185 + if (lru_pos_time(lru_k.k->p) != idx) { 163 186 ret = bch2_btree_write_buffer_maybe_flush(trans, lru_k, last_flushed); 164 187 if (ret) 165 188 goto err; ··· 201 176 err: 202 177 fsck_err: 203 178 bch2_trans_iter_exit(trans, &iter); 204 - bch2_dev_put(ca); 205 179 printbuf_exit(&buf2); 206 180 printbuf_exit(&buf1); 207 181 return ret;
+18 -4
fs/bcachefs/lru.h
··· 28 28 { 29 29 u16 lru_id = l.k->p.inode >> 48; 30 30 31 - if (lru_id == BCH_LRU_FRAGMENTATION_START) 31 + switch (lru_id) { 32 + case BCH_LRU_BUCKET_FRAGMENTATION: 32 33 return BCH_LRU_fragmentation; 33 - return BCH_LRU_read; 34 + case BCH_LRU_STRIPE_FRAGMENTATION: 35 + return BCH_LRU_stripes; 36 + default: 37 + return BCH_LRU_read; 38 + } 34 39 } 35 40 36 41 int bch2_lru_validate(struct bch_fs *, struct bkey_s_c, struct bkey_validate_context); ··· 51 46 52 47 int bch2_lru_del(struct btree_trans *, u16, u64, u64); 53 48 int bch2_lru_set(struct btree_trans *, u16, u64, u64); 54 - int bch2_lru_change(struct btree_trans *, u16, u64, u64, u64); 49 + int __bch2_lru_change(struct btree_trans *, u16, u64, u64, u64); 50 + 51 + static inline int bch2_lru_change(struct btree_trans *trans, 52 + u16 lru_id, u64 dev_bucket, 53 + u64 old_time, u64 new_time) 54 + { 55 + return old_time != new_time 56 + ? __bch2_lru_change(trans, lru_id, dev_bucket, old_time, new_time) 57 + : 0; 58 + } 55 59 56 60 struct bkey_buf; 57 - int bch2_lru_check_set(struct btree_trans *, u16, u64, struct bkey_s_c, struct bkey_buf *); 61 + int bch2_lru_check_set(struct btree_trans *, u16, u64, u64, struct bkey_s_c, struct bkey_buf *); 58 62 59 63 int bch2_check_lrus(struct bch_fs *); 60 64
+4 -2
fs/bcachefs/lru_format.h
··· 9 9 10 10 #define BCH_LRU_TYPES() \ 11 11 x(read) \ 12 - x(fragmentation) 12 + x(fragmentation) \ 13 + x(stripes) 13 14 14 15 enum bch_lru_type { 15 16 #define x(n) BCH_LRU_##n, ··· 18 17 #undef x 19 18 }; 20 19 21 - #define BCH_LRU_FRAGMENTATION_START ((1U << 16) - 1) 20 + #define BCH_LRU_BUCKET_FRAGMENTATION ((1U << 16) - 1) 21 + #define BCH_LRU_STRIPE_FRAGMENTATION ((1U << 16) - 2) 22 22 23 23 #define LRU_TIME_BITS 48 24 24 #define LRU_TIME_MAX ((1ULL << LRU_TIME_BITS) - 1)
+20 -6
fs/bcachefs/migrate.c
··· 15 15 #include "keylist.h" 16 16 #include "migrate.h" 17 17 #include "move.h" 18 + #include "progress.h" 18 19 #include "replicas.h" 19 20 #include "super-io.h" 20 21 ··· 77 76 return 0; 78 77 } 79 78 80 - static int bch2_dev_usrdata_drop(struct bch_fs *c, unsigned dev_idx, int flags) 79 + static int bch2_dev_usrdata_drop(struct bch_fs *c, 80 + struct progress_indicator_state *progress, 81 + unsigned dev_idx, int flags) 81 82 { 82 83 struct btree_trans *trans = bch2_trans_get(c); 83 84 enum btree_id id; ··· 91 88 92 89 ret = for_each_btree_key_commit(trans, iter, id, POS_MIN, 93 90 BTREE_ITER_prefetch|BTREE_ITER_all_snapshots, k, 94 - NULL, NULL, BCH_TRANS_COMMIT_no_enospc, 95 - bch2_dev_usrdata_drop_key(trans, &iter, k, dev_idx, flags)); 91 + NULL, NULL, BCH_TRANS_COMMIT_no_enospc, ({ 92 + bch2_progress_update_iter(trans, progress, &iter, "dropping user data"); 93 + bch2_dev_usrdata_drop_key(trans, &iter, k, dev_idx, flags); 94 + })); 96 95 if (ret) 97 96 break; 98 97 } ··· 104 99 return ret; 105 100 } 106 101 107 - static int bch2_dev_metadata_drop(struct bch_fs *c, unsigned dev_idx, int flags) 102 + static int bch2_dev_metadata_drop(struct bch_fs *c, 103 + struct progress_indicator_state *progress, 104 + unsigned dev_idx, int flags) 108 105 { 109 106 struct btree_trans *trans; 110 107 struct btree_iter iter; ··· 132 125 while (bch2_trans_begin(trans), 133 126 (b = bch2_btree_iter_peek_node(&iter)) && 134 127 !(ret = PTR_ERR_OR_ZERO(b))) { 128 + bch2_progress_update_iter(trans, progress, &iter, "dropping metadata"); 129 + 135 130 if (!bch2_bkey_has_device_c(bkey_i_to_s_c(&b->key), dev_idx)) 136 131 goto next; 137 132 ··· 178 169 179 170 int bch2_dev_data_drop(struct bch_fs *c, unsigned dev_idx, int flags) 180 171 { 181 - return bch2_dev_usrdata_drop(c, dev_idx, flags) ?: 182 - bch2_dev_metadata_drop(c, dev_idx, flags); 172 + struct progress_indicator_state progress; 173 + bch2_progress_init(&progress, c, 174 + BIT_ULL(BTREE_ID_extents)| 175 + BIT_ULL(BTREE_ID_reflink)); 176 + 177 + return bch2_dev_usrdata_drop(c, &progress, dev_idx, flags) ?: 178 + bch2_dev_metadata_drop(c, &progress, dev_idx, flags); 183 179 }
+283 -193
fs/bcachefs/move.c
··· 38 38 NULL 39 39 }; 40 40 41 - static void trace_move_extent2(struct bch_fs *c, struct bkey_s_c k, 41 + static void trace_io_move2(struct bch_fs *c, struct bkey_s_c k, 42 42 struct bch_io_opts *io_opts, 43 43 struct data_update_opts *data_opts) 44 44 { 45 - if (trace_move_extent_enabled()) { 45 + if (trace_io_move_enabled()) { 46 46 struct printbuf buf = PRINTBUF; 47 47 48 48 bch2_bkey_val_to_text(&buf, c, k); 49 49 prt_newline(&buf); 50 50 bch2_data_update_opts_to_text(&buf, c, io_opts, data_opts); 51 - trace_move_extent(c, buf.buf); 51 + trace_io_move(c, buf.buf); 52 52 printbuf_exit(&buf); 53 53 } 54 54 } 55 55 56 - static void trace_move_extent_read2(struct bch_fs *c, struct bkey_s_c k) 56 + static void trace_io_move_read2(struct bch_fs *c, struct bkey_s_c k) 57 57 { 58 - if (trace_move_extent_read_enabled()) { 58 + if (trace_io_move_read_enabled()) { 59 59 struct printbuf buf = PRINTBUF; 60 60 61 61 bch2_bkey_val_to_text(&buf, c, k); 62 - trace_move_extent_read(c, buf.buf); 62 + trace_io_move_read(c, buf.buf); 63 63 printbuf_exit(&buf); 64 64 } 65 65 } ··· 74 74 unsigned read_sectors; 75 75 unsigned write_sectors; 76 76 77 - struct bch_read_bio rbio; 78 - 79 77 struct data_update write; 80 - /* Must be last since it is variable size */ 81 - struct bio_vec bi_inline_vecs[]; 82 78 }; 83 79 84 80 static void move_free(struct moving_io *io) ··· 84 88 if (io->b) 85 89 atomic_dec(&io->b->count); 86 90 87 - bch2_data_update_exit(&io->write); 88 - 89 91 mutex_lock(&ctxt->lock); 90 92 list_del(&io->io_list); 91 93 wake_up(&ctxt->wait); 92 94 mutex_unlock(&ctxt->lock); 93 95 96 + if (!io->write.data_opts.scrub) { 97 + bch2_data_update_exit(&io->write); 98 + } else { 99 + bch2_bio_free_pages_pool(io->write.op.c, &io->write.op.wbio.bio); 100 + kfree(io->write.bvecs); 101 + } 94 102 kfree(io); 95 103 } 96 104 97 105 static void move_write_done(struct bch_write_op *op) 98 106 { 99 107 struct moving_io *io = container_of(op, struct moving_io, write.op); 108 + struct bch_fs *c = op->c; 100 109 struct moving_context *ctxt = io->write.ctxt; 101 110 102 - if (io->write.op.error) 103 - ctxt->write_error = true; 111 + if (op->error) { 112 + if (trace_io_move_write_fail_enabled()) { 113 + struct printbuf buf = PRINTBUF; 104 114 105 - atomic_sub(io->write_sectors, &io->write.ctxt->write_sectors); 106 - atomic_dec(&io->write.ctxt->write_ios); 115 + bch2_write_op_to_text(&buf, op); 116 + prt_printf(&buf, "ret\t%s\n", bch2_err_str(op->error)); 117 + trace_io_move_write_fail(c, buf.buf); 118 + printbuf_exit(&buf); 119 + } 120 + this_cpu_inc(c->counters[BCH_COUNTER_io_move_write_fail]); 121 + 122 + ctxt->write_error = true; 123 + } 124 + 125 + atomic_sub(io->write_sectors, &ctxt->write_sectors); 126 + atomic_dec(&ctxt->write_ios); 107 127 move_free(io); 108 128 closure_put(&ctxt->cl); 109 129 } 110 130 111 131 static void move_write(struct moving_io *io) 112 132 { 113 - if (unlikely(io->rbio.bio.bi_status || io->rbio.hole)) { 133 + struct moving_context *ctxt = io->write.ctxt; 134 + 135 + if (ctxt->stats) { 136 + if (io->write.rbio.bio.bi_status) 137 + atomic64_add(io->write.rbio.bvec_iter.bi_size >> 9, 138 + &ctxt->stats->sectors_error_uncorrected); 139 + else if (io->write.rbio.saw_error) 140 + atomic64_add(io->write.rbio.bvec_iter.bi_size >> 9, 141 + &ctxt->stats->sectors_error_corrected); 142 + } 143 + 144 + if (unlikely(io->write.rbio.ret || 145 + io->write.rbio.bio.bi_status || 146 + io->write.data_opts.scrub)) { 114 147 move_free(io); 115 148 return; 116 149 } 117 150 118 - if (trace_move_extent_write_enabled()) { 151 + if (trace_io_move_write_enabled()) { 119 152 struct bch_fs *c = io->write.op.c; 120 153 struct printbuf buf = PRINTBUF; 121 154 122 155 bch2_bkey_val_to_text(&buf, c, bkey_i_to_s_c(io->write.k.k)); 123 - trace_move_extent_write(c, buf.buf); 156 + trace_io_move_write(c, buf.buf); 124 157 printbuf_exit(&buf); 125 158 } 126 159 ··· 157 132 atomic_add(io->write_sectors, &io->write.ctxt->write_sectors); 158 133 atomic_inc(&io->write.ctxt->write_ios); 159 134 160 - bch2_data_update_read_done(&io->write, io->rbio.pick.crc); 135 + bch2_data_update_read_done(&io->write); 161 136 } 162 137 163 138 struct moving_io *bch2_moving_ctxt_next_pending_write(struct moving_context *ctxt) ··· 170 145 171 146 static void move_read_endio(struct bio *bio) 172 147 { 173 - struct moving_io *io = container_of(bio, struct moving_io, rbio.bio); 148 + struct moving_io *io = container_of(bio, struct moving_io, write.rbio.bio); 174 149 struct moving_context *ctxt = io->write.ctxt; 175 150 176 151 atomic_sub(io->read_sectors, &ctxt->read_sectors); ··· 283 258 { 284 259 struct btree_trans *trans = ctxt->trans; 285 260 struct bch_fs *c = trans->c; 286 - struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(k); 287 - struct moving_io *io; 288 - const union bch_extent_entry *entry; 289 - struct extent_ptr_decoded p; 290 - unsigned sectors = k.k->size, pages; 291 261 int ret = -ENOMEM; 292 262 293 - trace_move_extent2(c, k, &io_opts, &data_opts); 263 + trace_io_move2(c, k, &io_opts, &data_opts); 264 + this_cpu_add(c->counters[BCH_COUNTER_io_move], k.k->size); 294 265 295 266 if (ctxt->stats) 296 267 ctxt->stats->pos = BBPOS(iter->btree_id, iter->pos); ··· 294 273 bch2_data_update_opts_normalize(k, &data_opts); 295 274 296 275 if (!data_opts.rewrite_ptrs && 297 - !data_opts.extra_replicas) { 276 + !data_opts.extra_replicas && 277 + !data_opts.scrub) { 298 278 if (data_opts.kill_ptrs) 299 279 return bch2_extent_drop_ptrs(trans, iter, k, &io_opts, &data_opts); 300 280 return 0; ··· 307 285 */ 308 286 bch2_trans_unlock(trans); 309 287 310 - /* write path might have to decompress data: */ 311 - bkey_for_each_ptr_decode(k.k, ptrs, p, entry) 312 - sectors = max_t(unsigned, sectors, p.crc.uncompressed_size); 313 - 314 - pages = DIV_ROUND_UP(sectors, PAGE_SECTORS); 315 - io = kzalloc(sizeof(struct moving_io) + 316 - sizeof(struct bio_vec) * pages, GFP_KERNEL); 288 + struct moving_io *io = kzalloc(sizeof(struct moving_io), GFP_KERNEL); 317 289 if (!io) 318 290 goto err; 319 291 ··· 316 300 io->read_sectors = k.k->size; 317 301 io->write_sectors = k.k->size; 318 302 319 - bio_init(&io->write.op.wbio.bio, NULL, io->bi_inline_vecs, pages, 0); 320 - io->write.op.wbio.bio.bi_ioprio = 321 - IOPRIO_PRIO_VALUE(IOPRIO_CLASS_IDLE, 0); 303 + if (!data_opts.scrub) { 304 + ret = bch2_data_update_init(trans, iter, ctxt, &io->write, ctxt->wp, 305 + &io_opts, data_opts, iter->btree_id, k); 306 + if (ret) 307 + goto err_free; 322 308 323 - if (bch2_bio_alloc_pages(&io->write.op.wbio.bio, sectors << 9, 324 - GFP_KERNEL)) 325 - goto err_free; 309 + io->write.op.end_io = move_write_done; 310 + } else { 311 + bch2_bkey_buf_init(&io->write.k); 312 + bch2_bkey_buf_reassemble(&io->write.k, c, k); 326 313 327 - io->rbio.c = c; 328 - io->rbio.opts = io_opts; 329 - bio_init(&io->rbio.bio, NULL, io->bi_inline_vecs, pages, 0); 330 - io->rbio.bio.bi_vcnt = pages; 331 - io->rbio.bio.bi_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_IDLE, 0); 332 - io->rbio.bio.bi_iter.bi_size = sectors << 9; 314 + io->write.op.c = c; 315 + io->write.data_opts = data_opts; 333 316 334 - io->rbio.bio.bi_opf = REQ_OP_READ; 335 - io->rbio.bio.bi_iter.bi_sector = bkey_start_offset(k.k); 336 - io->rbio.bio.bi_end_io = move_read_endio; 317 + ret = bch2_data_update_bios_init(&io->write, c, &io_opts); 318 + if (ret) 319 + goto err_free; 320 + } 337 321 338 - ret = bch2_data_update_init(trans, iter, ctxt, &io->write, ctxt->wp, 339 - io_opts, data_opts, iter->btree_id, k); 340 - if (ret) 341 - goto err_free_pages; 342 - 343 - io->write.op.end_io = move_write_done; 322 + io->write.rbio.bio.bi_end_io = move_read_endio; 323 + io->write.rbio.bio.bi_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_IDLE, 0); 344 324 345 325 if (ctxt->rate) 346 326 bch2_ratelimit_increment(ctxt->rate, k.k->size); ··· 351 339 atomic_inc(&io->b->count); 352 340 } 353 341 354 - this_cpu_add(c->counters[BCH_COUNTER_io_move], k.k->size); 355 - this_cpu_add(c->counters[BCH_COUNTER_move_extent_read], k.k->size); 356 - trace_move_extent_read2(c, k); 342 + trace_io_move_read2(c, k); 357 343 358 344 mutex_lock(&ctxt->lock); 359 345 atomic_add(io->read_sectors, &ctxt->read_sectors); ··· 366 356 * ctxt when doing wakeup 367 357 */ 368 358 closure_get(&ctxt->cl); 369 - bch2_read_extent(trans, &io->rbio, 370 - bkey_start_pos(k.k), 371 - iter->btree_id, k, 0, 372 - BCH_READ_NODECODE| 373 - BCH_READ_LAST_FRAGMENT); 359 + __bch2_read_extent(trans, &io->write.rbio, 360 + io->write.rbio.bio.bi_iter, 361 + bkey_start_pos(k.k), 362 + iter->btree_id, k, 0, 363 + NULL, 364 + BCH_READ_last_fragment, 365 + data_opts.scrub ? data_opts.read_dev : -1); 374 366 return 0; 375 - err_free_pages: 376 - bio_free_pages(&io->write.op.wbio.bio); 377 367 err_free: 378 368 kfree(io); 379 369 err: 380 - if (ret == -BCH_ERR_data_update_done) 370 + if (bch2_err_matches(ret, BCH_ERR_data_update_done)) 381 371 return 0; 382 372 383 373 if (bch2_err_matches(ret, EROFS) || 384 374 bch2_err_matches(ret, BCH_ERR_transaction_restart)) 385 375 return ret; 386 376 387 - count_event(c, move_extent_start_fail); 377 + count_event(c, io_move_start_fail); 388 378 389 - if (trace_move_extent_start_fail_enabled()) { 379 + if (trace_io_move_start_fail_enabled()) { 390 380 struct printbuf buf = PRINTBUF; 391 381 392 382 bch2_bkey_val_to_text(&buf, c, k); 393 383 prt_str(&buf, ": "); 394 384 prt_str(&buf, bch2_err_str(ret)); 395 - trace_move_extent_start_fail(c, buf.buf); 385 + trace_io_move_start_fail(c, buf.buf); 396 386 printbuf_exit(&buf); 397 387 } 398 388 return ret; ··· 561 551 bch2_trans_begin(trans); 562 552 bch2_trans_iter_init(trans, &iter, btree_id, start, 563 553 BTREE_ITER_prefetch| 554 + BTREE_ITER_not_extents| 564 555 BTREE_ITER_all_snapshots); 565 556 566 557 if (ctxt->rate) ··· 592 581 k.k->type == KEY_TYPE_reflink_p && 593 582 REFLINK_P_MAY_UPDATE_OPTIONS(bkey_s_c_to_reflink_p(k).v)) { 594 583 struct bkey_s_c_reflink_p p = bkey_s_c_to_reflink_p(k); 595 - s64 offset_into_extent = iter.pos.offset - bkey_start_offset(k.k); 584 + s64 offset_into_extent = 0; 596 585 597 586 bch2_trans_iter_exit(trans, &reflink_iter); 598 587 k = bch2_lookup_indirect_extent(trans, &reflink_iter, &offset_into_extent, p, true, 0); ··· 611 600 * pointer - need to fixup iter->k 612 601 */ 613 602 extent_iter = &reflink_iter; 603 + offset_into_extent = 0; 614 604 } 615 605 616 606 if (!bkey_extent_is_direct_data(k.k)) ··· 639 627 if (bch2_err_matches(ret2, BCH_ERR_transaction_restart)) 640 628 continue; 641 629 642 - if (ret2 == -ENOMEM) { 630 + if (bch2_err_matches(ret2, ENOMEM)) { 643 631 /* memory allocation failure, wait for some IO to finish */ 644 632 bch2_move_ctxt_wait_for_io(ctxt); 645 633 continue; ··· 701 689 bool wait_on_copygc, 702 690 move_pred_fn pred, void *arg) 703 691 { 704 - 705 692 struct moving_context ctxt; 706 - int ret; 707 693 708 694 bch2_moving_ctxt_init(&ctxt, c, rate, stats, wp, wait_on_copygc); 709 - ret = __bch2_move_data(&ctxt, start, end, pred, arg); 695 + int ret = __bch2_move_data(&ctxt, start, end, pred, arg); 710 696 bch2_moving_ctxt_exit(&ctxt); 711 697 712 698 return ret; 713 699 } 714 700 715 - int bch2_evacuate_bucket(struct moving_context *ctxt, 716 - struct move_bucket_in_flight *bucket_in_flight, 717 - struct bpos bucket, int gen, 718 - struct data_update_opts _data_opts) 701 + static int __bch2_move_data_phys(struct moving_context *ctxt, 702 + struct move_bucket_in_flight *bucket_in_flight, 703 + unsigned dev, 704 + u64 bucket_start, 705 + u64 bucket_end, 706 + unsigned data_types, 707 + move_pred_fn pred, void *arg) 719 708 { 720 709 struct btree_trans *trans = ctxt->trans; 721 710 struct bch_fs *c = trans->c; ··· 725 712 struct btree_iter iter = {}, bp_iter = {}; 726 713 struct bkey_buf sk; 727 714 struct bkey_s_c k; 728 - struct data_update_opts data_opts; 729 - unsigned sectors_moved = 0; 730 715 struct bkey_buf last_flushed; 731 716 int ret = 0; 732 717 733 - struct bch_dev *ca = bch2_dev_tryget(c, bucket.inode); 718 + struct bch_dev *ca = bch2_dev_tryget(c, dev); 734 719 if (!ca) 735 720 return 0; 736 721 737 - trace_bucket_evacuate(c, &bucket); 722 + bucket_end = min(bucket_end, ca->mi.nbuckets); 723 + 724 + struct bpos bp_start = bucket_pos_to_bp_start(ca, POS(dev, bucket_start)); 725 + struct bpos bp_end = bucket_pos_to_bp_end(ca, POS(dev, bucket_end)); 726 + bch2_dev_put(ca); 727 + ca = NULL; 738 728 739 729 bch2_bkey_buf_init(&last_flushed); 740 730 bkey_init(&last_flushed.k->k); ··· 748 732 */ 749 733 bch2_trans_begin(trans); 750 734 751 - bch2_trans_iter_init(trans, &bp_iter, BTREE_ID_backpointers, 752 - bucket_pos_to_bp_start(ca, bucket), 0); 735 + bch2_trans_iter_init(trans, &bp_iter, BTREE_ID_backpointers, bp_start, 0); 753 736 754 737 bch_err_msg(c, ret, "looking up alloc key"); 755 738 if (ret) ··· 772 757 if (ret) 773 758 goto err; 774 759 775 - if (!k.k || bkey_gt(k.k->p, bucket_pos_to_bp_end(ca, bucket))) 760 + if (!k.k || bkey_gt(k.k->p, bp_end)) 776 761 break; 777 762 778 763 if (k.k->type != KEY_TYPE_backpointer) ··· 780 765 781 766 struct bkey_s_c_backpointer bp = bkey_s_c_to_backpointer(k); 782 767 768 + if (ctxt->stats) 769 + ctxt->stats->offset = bp.k->p.offset >> MAX_EXTENT_COMPRESS_RATIO_SHIFT; 770 + 771 + if (!(data_types & BIT(bp.v->data_type))) 772 + goto next; 773 + 774 + if (!bp.v->level && bp.v->btree_id == BTREE_ID_stripes) 775 + goto next; 776 + 777 + k = bch2_backpointer_get_key(trans, bp, &iter, 0, &last_flushed); 778 + ret = bkey_err(k); 779 + if (bch2_err_matches(ret, BCH_ERR_transaction_restart)) 780 + continue; 781 + if (ret) 782 + goto err; 783 + if (!k.k) 784 + goto next; 785 + 783 786 if (!bp.v->level) { 784 - k = bch2_backpointer_get_key(trans, bp, &iter, 0, &last_flushed); 785 - ret = bkey_err(k); 786 - if (bch2_err_matches(ret, BCH_ERR_transaction_restart)) 787 - continue; 788 - if (ret) 789 - goto err; 790 - if (!k.k) 791 - goto next; 792 - 793 - bch2_bkey_buf_reassemble(&sk, c, k); 794 - k = bkey_i_to_s_c(sk.k); 795 - 796 787 ret = bch2_move_get_io_opts_one(trans, &io_opts, &iter, k); 797 788 if (ret) { 798 789 bch2_trans_iter_exit(trans, &iter); 799 790 continue; 800 791 } 801 - 802 - data_opts = _data_opts; 803 - data_opts.target = io_opts.background_target; 804 - data_opts.rewrite_ptrs = 0; 805 - 806 - unsigned sectors = bp.v->bucket_len; /* move_extent will drop locks */ 807 - unsigned i = 0; 808 - const union bch_extent_entry *entry; 809 - struct extent_ptr_decoded p; 810 - bkey_for_each_ptr_decode(k.k, bch2_bkey_ptrs_c(k), p, entry) { 811 - if (p.ptr.dev == bucket.inode) { 812 - if (p.ptr.cached) { 813 - bch2_trans_iter_exit(trans, &iter); 814 - goto next; 815 - } 816 - data_opts.rewrite_ptrs |= 1U << i; 817 - break; 818 - } 819 - i++; 820 - } 821 - 822 - ret = bch2_move_extent(ctxt, bucket_in_flight, 823 - &iter, k, io_opts, data_opts); 824 - bch2_trans_iter_exit(trans, &iter); 825 - 826 - if (bch2_err_matches(ret, BCH_ERR_transaction_restart)) 827 - continue; 828 - if (ret == -ENOMEM) { 829 - /* memory allocation failure, wait for some IO to finish */ 830 - bch2_move_ctxt_wait_for_io(ctxt); 831 - continue; 832 - } 833 - if (ret) 834 - goto err; 835 - 836 - if (ctxt->stats) 837 - atomic64_add(sectors, &ctxt->stats->sectors_seen); 838 - sectors_moved += sectors; 839 - } else { 840 - struct btree *b; 841 - 842 - b = bch2_backpointer_get_node(trans, bp, &iter, &last_flushed); 843 - ret = PTR_ERR_OR_ZERO(b); 844 - if (ret == -BCH_ERR_backpointer_to_overwritten_btree_node) 845 - goto next; 846 - if (bch2_err_matches(ret, BCH_ERR_transaction_restart)) 847 - continue; 848 - if (ret) 849 - goto err; 850 - if (!b) 851 - goto next; 852 - 853 - unsigned sectors = btree_ptr_sectors_written(bkey_i_to_s_c(&b->key)); 854 - 855 - ret = bch2_btree_node_rewrite(trans, &iter, b, 0); 856 - bch2_trans_iter_exit(trans, &iter); 857 - 858 - if (bch2_err_matches(ret, BCH_ERR_transaction_restart)) 859 - continue; 860 - if (ret) 861 - goto err; 862 - 863 - if (ctxt->rate) 864 - bch2_ratelimit_increment(ctxt->rate, sectors); 865 - if (ctxt->stats) { 866 - atomic64_add(sectors, &ctxt->stats->sectors_seen); 867 - atomic64_add(sectors, &ctxt->stats->sectors_moved); 868 - } 869 - sectors_moved += btree_sectors(c); 870 792 } 793 + 794 + struct data_update_opts data_opts = {}; 795 + if (!pred(c, arg, k, &io_opts, &data_opts)) { 796 + bch2_trans_iter_exit(trans, &iter); 797 + goto next; 798 + } 799 + 800 + if (data_opts.scrub && 801 + !bch2_dev_idx_is_online(c, data_opts.read_dev)) { 802 + bch2_trans_iter_exit(trans, &iter); 803 + ret = -BCH_ERR_device_offline; 804 + break; 805 + } 806 + 807 + bch2_bkey_buf_reassemble(&sk, c, k); 808 + k = bkey_i_to_s_c(sk.k); 809 + 810 + /* move_extent will drop locks */ 811 + unsigned sectors = bp.v->bucket_len; 812 + 813 + if (!bp.v->level) 814 + ret = bch2_move_extent(ctxt, bucket_in_flight, &iter, k, io_opts, data_opts); 815 + else if (!data_opts.scrub) 816 + ret = bch2_btree_node_rewrite_pos(trans, bp.v->btree_id, bp.v->level, k.k->p, 0); 817 + else 818 + ret = bch2_btree_node_scrub(trans, bp.v->btree_id, bp.v->level, k, data_opts.read_dev); 819 + 820 + bch2_trans_iter_exit(trans, &iter); 821 + 822 + if (bch2_err_matches(ret, BCH_ERR_transaction_restart)) 823 + continue; 824 + if (ret == -ENOMEM) { 825 + /* memory allocation failure, wait for some IO to finish */ 826 + bch2_move_ctxt_wait_for_io(ctxt); 827 + continue; 828 + } 829 + if (ret) 830 + goto err; 831 + 832 + if (ctxt->stats) 833 + atomic64_add(sectors, &ctxt->stats->sectors_seen); 871 834 next: 872 835 bch2_btree_iter_advance(&bp_iter); 873 836 } 874 - 875 - trace_evacuate_bucket(c, &bucket, sectors_moved, ca->mi.bucket_size, ret); 876 837 err: 877 838 bch2_trans_iter_exit(trans, &bp_iter); 878 - bch2_dev_put(ca); 879 839 bch2_bkey_buf_exit(&sk, c); 880 840 bch2_bkey_buf_exit(&last_flushed, c); 881 841 return ret; 842 + } 843 + 844 + static int bch2_move_data_phys(struct bch_fs *c, 845 + unsigned dev, 846 + u64 start, 847 + u64 end, 848 + unsigned data_types, 849 + struct bch_ratelimit *rate, 850 + struct bch_move_stats *stats, 851 + struct write_point_specifier wp, 852 + bool wait_on_copygc, 853 + move_pred_fn pred, void *arg) 854 + { 855 + struct moving_context ctxt; 856 + 857 + bch2_trans_run(c, bch2_btree_write_buffer_flush_sync(trans)); 858 + 859 + bch2_moving_ctxt_init(&ctxt, c, rate, stats, wp, wait_on_copygc); 860 + ctxt.stats->phys = true; 861 + ctxt.stats->data_type = (int) DATA_PROGRESS_DATA_TYPE_phys; 862 + 863 + int ret = __bch2_move_data_phys(&ctxt, NULL, dev, start, end, data_types, pred, arg); 864 + bch2_moving_ctxt_exit(&ctxt); 865 + 866 + return ret; 867 + } 868 + 869 + struct evacuate_bucket_arg { 870 + struct bpos bucket; 871 + int gen; 872 + struct data_update_opts data_opts; 873 + }; 874 + 875 + static bool evacuate_bucket_pred(struct bch_fs *c, void *_arg, struct bkey_s_c k, 876 + struct bch_io_opts *io_opts, 877 + struct data_update_opts *data_opts) 878 + { 879 + struct evacuate_bucket_arg *arg = _arg; 880 + 881 + *data_opts = arg->data_opts; 882 + 883 + unsigned i = 0; 884 + bkey_for_each_ptr(bch2_bkey_ptrs_c(k), ptr) { 885 + if (ptr->dev == arg->bucket.inode && 886 + (arg->gen < 0 || arg->gen == ptr->gen) && 887 + !ptr->cached) 888 + data_opts->rewrite_ptrs |= BIT(i); 889 + i++; 890 + } 891 + 892 + return data_opts->rewrite_ptrs != 0; 893 + } 894 + 895 + int bch2_evacuate_bucket(struct moving_context *ctxt, 896 + struct move_bucket_in_flight *bucket_in_flight, 897 + struct bpos bucket, int gen, 898 + struct data_update_opts data_opts) 899 + { 900 + struct evacuate_bucket_arg arg = { bucket, gen, data_opts, }; 901 + 902 + return __bch2_move_data_phys(ctxt, bucket_in_flight, 903 + bucket.inode, 904 + bucket.offset, 905 + bucket.offset + 1, 906 + ~0, 907 + evacuate_bucket_pred, &arg); 882 908 } 883 909 884 910 typedef bool (*move_btree_pred)(struct bch_fs *, void *, ··· 1063 1007 return rereplicate_pred(c, arg, bkey_i_to_s_c(&b->key), io_opts, data_opts); 1064 1008 } 1065 1009 1066 - static bool migrate_btree_pred(struct bch_fs *c, void *arg, 1067 - struct btree *b, 1068 - struct bch_io_opts *io_opts, 1069 - struct data_update_opts *data_opts) 1070 - { 1071 - return migrate_pred(c, arg, bkey_i_to_s_c(&b->key), io_opts, data_opts); 1072 - } 1073 - 1074 1010 /* 1075 1011 * Ancient versions of bcachefs produced packed formats which could represent 1076 1012 * keys that the in memory format cannot represent; this checks for those ··· 1152 1104 return drop_extra_replicas_pred(c, arg, bkey_i_to_s_c(&b->key), io_opts, data_opts); 1153 1105 } 1154 1106 1107 + static bool scrub_pred(struct bch_fs *c, void *_arg, 1108 + struct bkey_s_c k, 1109 + struct bch_io_opts *io_opts, 1110 + struct data_update_opts *data_opts) 1111 + { 1112 + struct bch_ioctl_data *arg = _arg; 1113 + 1114 + if (k.k->type != KEY_TYPE_btree_ptr_v2) { 1115 + struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(k); 1116 + const union bch_extent_entry *entry; 1117 + struct extent_ptr_decoded p; 1118 + bkey_for_each_ptr_decode(k.k, ptrs, p, entry) 1119 + if (p.ptr.dev == arg->migrate.dev) { 1120 + if (!p.crc.csum_type) 1121 + return false; 1122 + break; 1123 + } 1124 + } 1125 + 1126 + data_opts->scrub = true; 1127 + data_opts->read_dev = arg->migrate.dev; 1128 + return true; 1129 + } 1130 + 1155 1131 int bch2_data_job(struct bch_fs *c, 1156 1132 struct bch_move_stats *stats, 1157 1133 struct bch_ioctl_data op) ··· 1190 1118 bch2_move_stats_init(stats, bch2_data_ops_strs[op.op]); 1191 1119 1192 1120 switch (op.op) { 1121 + case BCH_DATA_OP_scrub: 1122 + /* 1123 + * prevent tests from spuriously failing, make sure we see all 1124 + * btree nodes that need to be repaired 1125 + */ 1126 + bch2_btree_interior_updates_flush(c); 1127 + 1128 + ret = bch2_move_data_phys(c, op.scrub.dev, 0, U64_MAX, 1129 + op.scrub.data_types, 1130 + NULL, 1131 + stats, 1132 + writepoint_hashed((unsigned long) current), 1133 + false, 1134 + scrub_pred, &op) ?: ret; 1135 + break; 1136 + 1193 1137 case BCH_DATA_OP_rereplicate: 1194 1138 stats->data_type = BCH_DATA_journal; 1195 1139 ret = bch2_journal_flush_device_pins(&c->journal, -1); ··· 1225 1137 1226 1138 stats->data_type = BCH_DATA_journal; 1227 1139 ret = bch2_journal_flush_device_pins(&c->journal, op.migrate.dev); 1228 - ret = bch2_move_btree(c, start, end, 1229 - migrate_btree_pred, &op, stats) ?: ret; 1230 - ret = bch2_move_data(c, start, end, 1231 - NULL, 1232 - stats, 1233 - writepoint_hashed((unsigned long) current), 1234 - true, 1235 - migrate_pred, &op) ?: ret; 1140 + ret = bch2_move_data_phys(c, op.migrate.dev, 0, U64_MAX, 1141 + ~0, 1142 + NULL, 1143 + stats, 1144 + writepoint_hashed((unsigned long) current), 1145 + true, 1146 + migrate_pred, &op) ?: ret; 1147 + bch2_btree_interior_updates_flush(c); 1236 1148 ret = bch2_replicas_gc2(c) ?: ret; 1237 1149 break; 1238 1150 case BCH_DATA_OP_rewrite_old_nodes: ··· 1264 1176 prt_newline(out); 1265 1177 printbuf_indent_add(out, 2); 1266 1178 1267 - prt_printf(out, "keys moved: %llu\n", atomic64_read(&stats->keys_moved)); 1268 - prt_printf(out, "keys raced: %llu\n", atomic64_read(&stats->keys_raced)); 1269 - prt_printf(out, "bytes seen: "); 1179 + prt_printf(out, "keys moved:\t%llu\n", atomic64_read(&stats->keys_moved)); 1180 + prt_printf(out, "keys raced:\t%llu\n", atomic64_read(&stats->keys_raced)); 1181 + prt_printf(out, "bytes seen:\t"); 1270 1182 prt_human_readable_u64(out, atomic64_read(&stats->sectors_seen) << 9); 1271 1183 prt_newline(out); 1272 1184 1273 - prt_printf(out, "bytes moved: "); 1185 + prt_printf(out, "bytes moved:\t"); 1274 1186 prt_human_readable_u64(out, atomic64_read(&stats->sectors_moved) << 9); 1275 1187 prt_newline(out); 1276 1188 1277 - prt_printf(out, "bytes raced: "); 1189 + prt_printf(out, "bytes raced:\t"); 1278 1190 prt_human_readable_u64(out, atomic64_read(&stats->sectors_raced) << 9); 1279 1191 prt_newline(out); 1280 1192 ··· 1283 1195 1284 1196 static void bch2_moving_ctxt_to_text(struct printbuf *out, struct bch_fs *c, struct moving_context *ctxt) 1285 1197 { 1286 - struct moving_io *io; 1198 + if (!out->nr_tabstops) 1199 + printbuf_tabstop_push(out, 32); 1287 1200 1288 1201 bch2_move_stats_to_text(out, ctxt->stats); 1289 1202 printbuf_indent_add(out, 2); ··· 1304 1215 printbuf_indent_add(out, 2); 1305 1216 1306 1217 mutex_lock(&ctxt->lock); 1218 + struct moving_io *io; 1307 1219 list_for_each_entry(io, &ctxt->ios, io_list) 1308 - bch2_write_op_to_text(out, &io->write.op); 1220 + bch2_data_update_inflight_to_text(out, &io->write); 1309 1221 mutex_unlock(&ctxt->lock); 1310 1222 1311 1223 printbuf_indent_sub(out, 4);
+17 -3
fs/bcachefs/move_types.h
··· 3 3 #define _BCACHEFS_MOVE_TYPES_H 4 4 5 5 #include "bbpos_types.h" 6 + #include "bcachefs_ioctl.h" 6 7 7 8 struct bch_move_stats { 8 - enum bch_data_type data_type; 9 - struct bbpos pos; 10 9 char name[32]; 10 + bool phys; 11 + enum bch_ioctl_data_event_ret ret; 12 + 13 + union { 14 + struct { 15 + enum bch_data_type data_type; 16 + struct bbpos pos; 17 + }; 18 + struct { 19 + unsigned dev; 20 + u64 offset; 21 + }; 22 + }; 11 23 12 24 atomic64_t keys_moved; 13 25 atomic64_t keys_raced; 14 26 atomic64_t sectors_seen; 15 27 atomic64_t sectors_moved; 16 28 atomic64_t sectors_raced; 29 + atomic64_t sectors_error_corrected; 30 + atomic64_t sectors_error_uncorrected; 17 31 }; 18 32 19 33 struct move_bucket_key { 20 34 struct bpos bucket; 21 - u8 gen; 35 + unsigned gen; 22 36 }; 23 37 24 38 struct move_bucket {
+13 -2
fs/bcachefs/movinggc.c
··· 167 167 bch2_trans_begin(trans); 168 168 169 169 ret = for_each_btree_key_max(trans, iter, BTREE_ID_lru, 170 - lru_pos(BCH_LRU_FRAGMENTATION_START, 0, 0), 171 - lru_pos(BCH_LRU_FRAGMENTATION_START, U64_MAX, LRU_TIME_MAX), 170 + lru_pos(BCH_LRU_BUCKET_FRAGMENTATION, 0, 0), 171 + lru_pos(BCH_LRU_BUCKET_FRAGMENTATION, U64_MAX, LRU_TIME_MAX), 172 172 0, k, ({ 173 173 struct move_bucket b = { .k.bucket = u64_to_bucket(k.k->p.offset) }; 174 174 int ret2 = 0; ··· 317 317 prt_printf(out, "Currently calculated wait:\t"); 318 318 prt_human_readable_u64(out, bch2_copygc_wait_amount(c)); 319 319 prt_newline(out); 320 + 321 + rcu_read_lock(); 322 + struct task_struct *t = rcu_dereference(c->copygc_thread); 323 + if (t) 324 + get_task_struct(t); 325 + rcu_read_unlock(); 326 + 327 + if (t) { 328 + bch2_prt_task_backtrace(out, t, 0, GFP_KERNEL); 329 + put_task_struct(t); 330 + } 320 331 } 321 332 322 333 static int bch2_copygc_thread(void *arg)
+56 -59
fs/bcachefs/opts.c
··· 163 163 [DT_SUBVOL] = "subvol", 164 164 }; 165 165 166 - u64 BCH2_NO_SB_OPT(const struct bch_sb *sb) 167 - { 168 - BUG(); 169 - } 170 - 171 - void SET_BCH2_NO_SB_OPT(struct bch_sb *sb, u64 v) 172 - { 173 - BUG(); 174 - } 175 - 176 166 void bch2_opts_apply(struct bch_opts *dst, struct bch_opts src) 177 167 { 178 168 #define x(_name, ...) \ ··· 213 223 } 214 224 } 215 225 226 + /* dummy option, for options that aren't stored in the superblock */ 227 + typedef u64 (*sb_opt_get_fn)(const struct bch_sb *); 228 + typedef void (*sb_opt_set_fn)(struct bch_sb *, u64); 229 + typedef u64 (*member_opt_get_fn)(const struct bch_member *); 230 + typedef void (*member_opt_set_fn)(struct bch_member *, u64); 231 + 232 + __maybe_unused static const sb_opt_get_fn BCH2_NO_SB_OPT = NULL; 233 + __maybe_unused static const sb_opt_set_fn SET_BCH2_NO_SB_OPT = NULL; 234 + __maybe_unused static const member_opt_get_fn BCH2_NO_MEMBER_OPT = NULL; 235 + __maybe_unused static const member_opt_set_fn SET_BCH2_NO_MEMBER_OPT = NULL; 236 + 237 + #define type_compatible_or_null(_p, _type) \ 238 + __builtin_choose_expr( \ 239 + __builtin_types_compatible_p(typeof(_p), typeof(_type)), _p, NULL) 240 + 216 241 const struct bch_option bch2_opt_table[] = { 217 242 #define OPT_BOOL() .type = BCH_OPT_BOOL, .min = 0, .max = 2 218 243 #define OPT_UINT(_min, _max) .type = BCH_OPT_UINT, \ ··· 244 239 245 240 #define x(_name, _bits, _flags, _type, _sb_opt, _default, _hint, _help) \ 246 241 [Opt_##_name] = { \ 247 - .attr = { \ 248 - .name = #_name, \ 249 - .mode = (_flags) & OPT_RUNTIME ? 0644 : 0444, \ 250 - }, \ 251 - .flags = _flags, \ 252 - .hint = _hint, \ 253 - .help = _help, \ 254 - .get_sb = _sb_opt, \ 255 - .set_sb = SET_##_sb_opt, \ 242 + .attr.name = #_name, \ 243 + .attr.mode = (_flags) & OPT_RUNTIME ? 0644 : 0444, \ 244 + .flags = _flags, \ 245 + .hint = _hint, \ 246 + .help = _help, \ 247 + .get_sb = type_compatible_or_null(_sb_opt, *BCH2_NO_SB_OPT), \ 248 + .set_sb = type_compatible_or_null(SET_##_sb_opt,*SET_BCH2_NO_SB_OPT), \ 249 + .get_member = type_compatible_or_null(_sb_opt, *BCH2_NO_MEMBER_OPT), \ 250 + .set_member = type_compatible_or_null(SET_##_sb_opt,*SET_BCH2_NO_MEMBER_OPT),\ 256 251 _type \ 257 252 }, 258 253 ··· 480 475 } 481 476 } 482 477 483 - int bch2_opt_check_may_set(struct bch_fs *c, int id, u64 v) 478 + int bch2_opt_check_may_set(struct bch_fs *c, struct bch_dev *ca, int id, u64 v) 484 479 { 480 + lockdep_assert_held(&c->state_lock); 481 + 485 482 int ret = 0; 486 483 487 484 switch (id) { 485 + case Opt_state: 486 + if (ca) 487 + return __bch2_dev_set_state(c, ca, v, BCH_FORCE_IF_DEGRADED); 488 + break; 489 + 488 490 case Opt_compression: 489 491 case Opt_background_compression: 490 492 ret = bch2_check_set_has_compressed_data(c, v); ··· 507 495 508 496 int bch2_opts_check_may_set(struct bch_fs *c) 509 497 { 510 - unsigned i; 511 - int ret; 512 - 513 - for (i = 0; i < bch2_opts_nr; i++) { 514 - ret = bch2_opt_check_may_set(c, i, 515 - bch2_opt_get_by_id(&c->opts, i)); 498 + for (unsigned i = 0; i < bch2_opts_nr; i++) { 499 + int ret = bch2_opt_check_may_set(c, NULL, i, bch2_opt_get_by_id(&c->opts, i)); 516 500 if (ret) 517 501 return ret; 518 502 } ··· 627 619 return ret; 628 620 } 629 621 630 - u64 bch2_opt_from_sb(struct bch_sb *sb, enum bch_opt_id id) 622 + u64 bch2_opt_from_sb(struct bch_sb *sb, enum bch_opt_id id, int dev_idx) 631 623 { 632 624 const struct bch_option *opt = bch2_opt_table + id; 633 625 u64 v; 634 626 635 - v = opt->get_sb(sb); 627 + if (dev_idx < 0) { 628 + v = opt->get_sb(sb); 629 + } else { 630 + if (WARN(!bch2_member_exists(sb, dev_idx), 631 + "tried to set device option %s on nonexistent device %i", 632 + opt->attr.name, dev_idx)) 633 + return 0; 634 + 635 + struct bch_member m = bch2_sb_member_get(sb, dev_idx); 636 + v = opt->get_member(&m); 637 + } 638 + 639 + if (opt->flags & OPT_SB_FIELD_ONE_BIAS) 640 + --v; 636 641 637 642 if (opt->flags & OPT_SB_FIELD_ILOG2) 638 643 v = 1ULL << v; ··· 662 641 */ 663 642 int bch2_opts_from_sb(struct bch_opts *opts, struct bch_sb *sb) 664 643 { 665 - unsigned id; 666 - 667 - for (id = 0; id < bch2_opts_nr; id++) { 644 + for (unsigned id = 0; id < bch2_opts_nr; id++) { 668 645 const struct bch_option *opt = bch2_opt_table + id; 669 646 670 - if (opt->get_sb == BCH2_NO_SB_OPT) 671 - continue; 672 - 673 - bch2_opt_set_by_id(opts, id, bch2_opt_from_sb(sb, id)); 647 + if (opt->get_sb) 648 + bch2_opt_set_by_id(opts, id, bch2_opt_from_sb(sb, id, -1)); 674 649 } 675 650 676 651 return 0; 677 652 } 678 653 679 - struct bch_dev_sb_opt_set { 680 - void (*set_sb)(struct bch_member *, u64); 681 - }; 682 - 683 - static const struct bch_dev_sb_opt_set bch2_dev_sb_opt_setters [] = { 684 - #define x(n, set) [Opt_##n] = { .set_sb = SET_##set }, 685 - BCH_DEV_OPT_SETTERS() 686 - #undef x 687 - }; 688 - 689 654 void __bch2_opt_set_sb(struct bch_sb *sb, int dev_idx, 690 655 const struct bch_option *opt, u64 v) 691 656 { 692 - enum bch_opt_id id = opt - bch2_opt_table; 693 - 694 657 if (opt->flags & OPT_SB_FIELD_SECTORS) 695 658 v >>= 9; 696 659 ··· 684 679 if (opt->flags & OPT_SB_FIELD_ONE_BIAS) 685 680 v++; 686 681 687 - if (opt->flags & OPT_FS) { 688 - if (opt->set_sb != SET_BCH2_NO_SB_OPT) 689 - opt->set_sb(sb, v); 690 - } 682 + if ((opt->flags & OPT_FS) && opt->set_sb && dev_idx < 0) 683 + opt->set_sb(sb, v); 691 684 692 - if ((opt->flags & OPT_DEVICE) && dev_idx >= 0) { 685 + if ((opt->flags & OPT_DEVICE) && opt->set_member && dev_idx >= 0) { 693 686 if (WARN(!bch2_member_exists(sb, dev_idx), 694 687 "tried to set device option %s on nonexistent device %i", 695 688 opt->attr.name, dev_idx)) 696 689 return; 697 690 698 - struct bch_member *m = bch2_members_v2_get_mut(sb, dev_idx); 699 - 700 - const struct bch_dev_sb_opt_set *set = bch2_dev_sb_opt_setters + id; 701 - if (set->set_sb) 702 - set->set_sb(m, v); 703 - else 704 - pr_err("option %s cannot be set via opt_set_sb()", opt->attr.name); 691 + opt->set_member(bch2_members_v2_get_mut(sb, dev_idx), v); 705 692 } 706 693 } 707 694
+37 -32
fs/bcachefs/opts.h
··· 50 50 * apply the options from that struct that are defined. 51 51 */ 52 52 53 - /* dummy option, for options that aren't stored in the superblock */ 54 - u64 BCH2_NO_SB_OPT(const struct bch_sb *); 55 - void SET_BCH2_NO_SB_OPT(struct bch_sb *, u64); 56 - 57 53 /* When can be set: */ 58 54 enum opt_flags { 59 55 OPT_FS = BIT(0), /* Filesystem option */ ··· 128 132 OPT_FS|OPT_FORMAT| \ 129 133 OPT_HUMAN_READABLE|OPT_MUST_BE_POW_2|OPT_SB_FIELD_SECTORS, \ 130 134 OPT_UINT(512, 1U << 16), \ 131 - BCH_SB_BLOCK_SIZE, 8, \ 135 + BCH_SB_BLOCK_SIZE, 4 << 10, \ 132 136 "size", NULL) \ 133 137 x(btree_node_size, u32, \ 134 138 OPT_FS|OPT_FORMAT| \ 135 139 OPT_HUMAN_READABLE|OPT_MUST_BE_POW_2|OPT_SB_FIELD_SECTORS, \ 136 140 OPT_UINT(512, 1U << 20), \ 137 - BCH_SB_BTREE_NODE_SIZE, 512, \ 141 + BCH_SB_BTREE_NODE_SIZE, 256 << 10, \ 138 142 "size", "Btree node size, default 256k") \ 139 143 x(errors, u8, \ 140 144 OPT_FS|OPT_FORMAT|OPT_MOUNT|OPT_RUNTIME, \ 141 145 OPT_STR(bch2_error_actions), \ 142 146 BCH_SB_ERROR_ACTION, BCH_ON_ERROR_fix_safe, \ 143 147 NULL, "Action to take on filesystem error") \ 148 + x(write_error_timeout, u16, \ 149 + OPT_FS|OPT_FORMAT|OPT_MOUNT|OPT_RUNTIME, \ 150 + OPT_UINT(1, 300), \ 151 + BCH_SB_WRITE_ERROR_TIMEOUT, 30, \ 152 + NULL, "Number of consecutive write errors allowed before kicking out a device")\ 144 153 x(metadata_replicas, u8, \ 145 154 OPT_FS|OPT_FORMAT|OPT_MOUNT|OPT_RUNTIME, \ 146 155 OPT_UINT(1, BCH_REPLICAS_MAX), \ ··· 182 181 OPT_STR(__bch2_csum_opts), \ 183 182 BCH_SB_DATA_CSUM_TYPE, BCH_CSUM_OPT_crc32c, \ 184 183 NULL, NULL) \ 184 + x(checksum_err_retry_nr, u8, \ 185 + OPT_FS|OPT_FORMAT|OPT_MOUNT|OPT_RUNTIME, \ 186 + OPT_UINT(0, 32), \ 187 + BCH_SB_CSUM_ERR_RETRY_NR, 3, \ 188 + NULL, NULL) \ 185 189 x(compression, u8, \ 186 190 OPT_FS|OPT_INODE|OPT_FORMAT|OPT_MOUNT|OPT_RUNTIME, \ 187 191 OPT_FN(bch2_opt_compression), \ ··· 203 197 BCH_SB_STR_HASH_TYPE, BCH_STR_HASH_OPT_siphash, \ 204 198 NULL, "Hash function for directory entries and xattrs")\ 205 199 x(metadata_target, u16, \ 206 - OPT_FS|OPT_INODE|OPT_FORMAT|OPT_MOUNT|OPT_RUNTIME, \ 200 + OPT_FS|OPT_FORMAT|OPT_MOUNT|OPT_RUNTIME, \ 207 201 OPT_FN(bch2_opt_target), \ 208 202 BCH_SB_METADATA_TARGET, 0, \ 209 203 "(target)", "Device or label for metadata writes") \ ··· 314 308 OPT_BOOL(), \ 315 309 BCH2_NO_SB_OPT, false, \ 316 310 NULL, "Don't kick drives out when splitbrain detected")\ 317 - x(discard, u8, \ 318 - OPT_FS|OPT_MOUNT|OPT_DEVICE, \ 319 - OPT_BOOL(), \ 320 - BCH2_NO_SB_OPT, true, \ 321 - NULL, "Enable discard/TRIM support") \ 322 311 x(verbose, u8, \ 323 312 OPT_FS|OPT_MOUNT|OPT_RUNTIME, \ 324 313 OPT_BOOL(), \ ··· 494 493 BCH2_NO_SB_OPT, false, \ 495 494 NULL, "Skip submit_bio() for data reads and writes, " \ 496 495 "for performance testing purposes") \ 497 - x(fs_size, u64, \ 498 - OPT_DEVICE, \ 496 + x(state, u64, \ 497 + OPT_DEVICE|OPT_RUNTIME, \ 498 + OPT_STR(bch2_member_states), \ 499 + BCH_MEMBER_STATE, BCH_MEMBER_STATE_rw, \ 500 + "state", "rw,ro,failed,spare") \ 501 + x(bucket_size, u32, \ 502 + OPT_DEVICE|OPT_HUMAN_READABLE|OPT_SB_FIELD_SECTORS, \ 499 503 OPT_UINT(0, S64_MAX), \ 500 - BCH2_NO_SB_OPT, 0, \ 501 - "size", "Size of filesystem on device") \ 502 - x(bucket, u32, \ 503 - OPT_DEVICE, \ 504 - OPT_UINT(0, S64_MAX), \ 505 - BCH2_NO_SB_OPT, 0, \ 504 + BCH_MEMBER_BUCKET_SIZE, 0, \ 506 505 "size", "Specifies the bucket size; must be greater than the btree node size")\ 507 506 x(durability, u8, \ 508 - OPT_DEVICE|OPT_SB_FIELD_ONE_BIAS, \ 507 + OPT_DEVICE|OPT_RUNTIME|OPT_SB_FIELD_ONE_BIAS, \ 509 508 OPT_UINT(0, BCH_REPLICAS_MAX), \ 510 - BCH2_NO_SB_OPT, 1, \ 509 + BCH_MEMBER_DURABILITY, 1, \ 511 510 "n", "Data written to this device will be considered\n"\ 512 511 "to have already been replicated n times") \ 513 512 x(data_allowed, u8, \ 514 513 OPT_DEVICE, \ 515 514 OPT_BITFIELD(__bch2_data_types), \ 516 - BCH2_NO_SB_OPT, BIT(BCH_DATA_journal)|BIT(BCH_DATA_btree)|BIT(BCH_DATA_user),\ 515 + BCH_MEMBER_DATA_ALLOWED, BIT(BCH_DATA_journal)|BIT(BCH_DATA_btree)|BIT(BCH_DATA_user),\ 517 516 "types", "Allowed data types for this device: journal, btree, and/or user")\ 517 + x(discard, u8, \ 518 + OPT_MOUNT|OPT_DEVICE|OPT_RUNTIME, \ 519 + OPT_BOOL(), \ 520 + BCH_MEMBER_DISCARD, true, \ 521 + NULL, "Enable discard/TRIM support") \ 518 522 x(btree_node_prefetch, u8, \ 519 523 OPT_FS|OPT_MOUNT|OPT_RUNTIME, \ 520 524 OPT_BOOL(), \ 521 525 BCH2_NO_SB_OPT, true, \ 522 526 NULL, "BTREE_ITER_prefetch casuse btree nodes to be\n"\ 523 527 " prefetched sequentially") 524 - 525 - #define BCH_DEV_OPT_SETTERS() \ 526 - x(discard, BCH_MEMBER_DISCARD) \ 527 - x(durability, BCH_MEMBER_DURABILITY) \ 528 - x(data_allowed, BCH_MEMBER_DATA_ALLOWED) 529 528 530 529 struct bch_opts { 531 530 #define x(_name, _bits, ...) unsigned _name##_defined:1; ··· 583 582 584 583 struct bch_option { 585 584 struct attribute attr; 586 - u64 (*get_sb)(const struct bch_sb *); 587 - void (*set_sb)(struct bch_sb *, u64); 588 585 enum opt_type type; 589 586 enum opt_flags flags; 590 587 u64 min, max; ··· 594 595 const char *hint; 595 596 const char *help; 596 597 598 + u64 (*get_sb)(const struct bch_sb *); 599 + void (*set_sb)(struct bch_sb *, u64); 600 + 601 + u64 (*get_member)(const struct bch_member *); 602 + void (*set_member)(struct bch_member *, u64); 603 + 597 604 }; 598 605 599 606 extern const struct bch_option bch2_opt_table[]; ··· 608 603 u64 bch2_opt_get_by_id(const struct bch_opts *, enum bch_opt_id); 609 604 void bch2_opt_set_by_id(struct bch_opts *, enum bch_opt_id, u64); 610 605 611 - u64 bch2_opt_from_sb(struct bch_sb *, enum bch_opt_id); 606 + u64 bch2_opt_from_sb(struct bch_sb *, enum bch_opt_id, int); 612 607 int bch2_opts_from_sb(struct bch_opts *, struct bch_sb *); 613 608 void __bch2_opt_set_sb(struct bch_sb *, int, const struct bch_option *, u64); 614 609 ··· 630 625 struct bch_fs *, struct bch_sb *, 631 626 unsigned, unsigned, unsigned); 632 627 633 - int bch2_opt_check_may_set(struct bch_fs *, int, u64); 628 + int bch2_opt_check_may_set(struct bch_fs *, struct bch_dev *, int, u64); 634 629 int bch2_opts_check_may_set(struct bch_fs *); 635 630 int bch2_parse_one_mount_opt(struct bch_fs *, struct bch_opts *, 636 631 struct printbuf *, const char *, const char *);
+63
fs/bcachefs/progress.c
··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + #include "bcachefs.h" 3 + #include "bbpos.h" 4 + #include "disk_accounting.h" 5 + #include "progress.h" 6 + 7 + void bch2_progress_init(struct progress_indicator_state *s, 8 + struct bch_fs *c, 9 + u64 btree_id_mask) 10 + { 11 + memset(s, 0, sizeof(*s)); 12 + 13 + s->next_print = jiffies + HZ * 10; 14 + 15 + for (unsigned i = 0; i < BTREE_ID_NR; i++) { 16 + if (!(btree_id_mask & BIT_ULL(i))) 17 + continue; 18 + 19 + struct disk_accounting_pos acc = { 20 + .type = BCH_DISK_ACCOUNTING_btree, 21 + .btree.id = i, 22 + }; 23 + 24 + u64 v; 25 + bch2_accounting_mem_read(c, disk_accounting_pos_to_bpos(&acc), &v, 1); 26 + s->nodes_total += div64_ul(v, btree_sectors(c)); 27 + } 28 + } 29 + 30 + static inline bool progress_update_p(struct progress_indicator_state *s) 31 + { 32 + bool ret = time_after_eq(jiffies, s->next_print); 33 + 34 + if (ret) 35 + s->next_print = jiffies + HZ * 10; 36 + return ret; 37 + } 38 + 39 + void bch2_progress_update_iter(struct btree_trans *trans, 40 + struct progress_indicator_state *s, 41 + struct btree_iter *iter, 42 + const char *msg) 43 + { 44 + struct bch_fs *c = trans->c; 45 + struct btree *b = path_l(btree_iter_path(trans, iter))->b; 46 + 47 + s->nodes_seen += b != s->last_node; 48 + s->last_node = b; 49 + 50 + if (progress_update_p(s)) { 51 + struct printbuf buf = PRINTBUF; 52 + unsigned percent = s->nodes_total 53 + ? div64_u64(s->nodes_seen * 100, s->nodes_total) 54 + : 0; 55 + 56 + prt_printf(&buf, "%s: %d%%, done %llu/%llu nodes, at ", 57 + msg, percent, s->nodes_seen, s->nodes_total); 58 + bch2_bbpos_to_text(&buf, BBPOS(iter->btree_id, iter->pos)); 59 + 60 + bch_info(c, "%s", buf.buf); 61 + printbuf_exit(&buf); 62 + } 63 + }
+29
fs/bcachefs/progress.h
··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + #ifndef _BCACHEFS_PROGRESS_H 3 + #define _BCACHEFS_PROGRESS_H 4 + 5 + /* 6 + * Lame progress indicators 7 + * 8 + * We don't like to use these because they print to the dmesg console, which is 9 + * spammy - we much prefer to be wired up to a userspace programm (e.g. via 10 + * thread_with_file) and have it print the progress indicator. 11 + * 12 + * But some code is old and doesn't support that, or runs in a context where 13 + * that's not yet practical (mount). 14 + */ 15 + 16 + struct progress_indicator_state { 17 + unsigned long next_print; 18 + u64 nodes_seen; 19 + u64 nodes_total; 20 + struct btree *last_node; 21 + }; 22 + 23 + void bch2_progress_init(struct progress_indicator_state *, struct bch_fs *, u64); 24 + void bch2_progress_update_iter(struct btree_trans *, 25 + struct progress_indicator_state *, 26 + struct btree_iter *, 27 + const char *); 28 + 29 + #endif /* _BCACHEFS_PROGRESS_H */
+37 -9
fs/bcachefs/rebalance.c
··· 26 26 27 27 /* bch_extent_rebalance: */ 28 28 29 - static const struct bch_extent_rebalance *bch2_bkey_rebalance_opts(struct bkey_s_c k) 29 + static const struct bch_extent_rebalance *bch2_bkey_ptrs_rebalance_opts(struct bkey_ptrs_c ptrs) 30 30 { 31 - struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(k); 32 31 const union bch_extent_entry *entry; 33 32 34 33 bkey_extent_entry_for_each(ptrs, entry) ··· 35 36 return &entry->rebalance; 36 37 37 38 return NULL; 39 + } 40 + 41 + static const struct bch_extent_rebalance *bch2_bkey_rebalance_opts(struct bkey_s_c k) 42 + { 43 + return bch2_bkey_ptrs_rebalance_opts(bch2_bkey_ptrs_c(k)); 38 44 } 39 45 40 46 static inline unsigned bch2_bkey_ptrs_need_compress(struct bch_fs *c, ··· 101 97 102 98 u64 bch2_bkey_sectors_need_rebalance(struct bch_fs *c, struct bkey_s_c k) 103 99 { 104 - const struct bch_extent_rebalance *opts = bch2_bkey_rebalance_opts(k); 100 + struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(k); 101 + 102 + const struct bch_extent_rebalance *opts = bch2_bkey_ptrs_rebalance_opts(ptrs); 105 103 if (!opts) 106 104 return 0; 107 105 108 - struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(k); 109 106 const union bch_extent_entry *entry; 110 107 struct extent_ptr_decoded p; 111 108 u64 sectors = 0; ··· 346 341 memset(data_opts, 0, sizeof(*data_opts)); 347 342 data_opts->rewrite_ptrs = bch2_bkey_ptrs_need_rebalance(c, io_opts, k); 348 343 data_opts->target = io_opts->background_target; 349 - data_opts->write_flags |= BCH_WRITE_ONLY_SPECIFIED_DEVS; 344 + data_opts->write_flags |= BCH_WRITE_only_specified_devs; 350 345 351 346 if (!data_opts->rewrite_ptrs) { 352 347 /* ··· 454 449 { 455 450 data_opts->rewrite_ptrs = bch2_bkey_ptrs_need_rebalance(c, io_opts, k); 456 451 data_opts->target = io_opts->background_target; 457 - data_opts->write_flags |= BCH_WRITE_ONLY_SPECIFIED_DEVS; 452 + data_opts->write_flags |= BCH_WRITE_only_specified_devs; 458 453 return data_opts->rewrite_ptrs != 0; 459 454 } 460 455 ··· 595 590 596 591 void bch2_rebalance_status_to_text(struct printbuf *out, struct bch_fs *c) 597 592 { 593 + printbuf_tabstop_push(out, 32); 594 + 598 595 struct bch_fs_rebalance *r = &c->rebalance; 596 + 597 + /* print pending work */ 598 + struct disk_accounting_pos acc = { .type = BCH_DISK_ACCOUNTING_rebalance_work, }; 599 + u64 v; 600 + bch2_accounting_mem_read(c, disk_accounting_pos_to_bpos(&acc), &v, 1); 601 + 602 + prt_printf(out, "pending work:\t"); 603 + prt_human_readable_u64(out, v); 604 + prt_printf(out, "\n\n"); 599 605 600 606 prt_str(out, bch2_rebalance_state_strs[r->state]); 601 607 prt_newline(out); ··· 616 600 case BCH_REBALANCE_waiting: { 617 601 u64 now = atomic64_read(&c->io_clock[WRITE].now); 618 602 619 - prt_str(out, "io wait duration: "); 603 + prt_printf(out, "io wait duration:\t"); 620 604 bch2_prt_human_readable_s64(out, (r->wait_iotime_end - r->wait_iotime_start) << 9); 621 605 prt_newline(out); 622 606 623 - prt_str(out, "io wait remaining: "); 607 + prt_printf(out, "io wait remaining:\t"); 624 608 bch2_prt_human_readable_s64(out, (r->wait_iotime_end - now) << 9); 625 609 prt_newline(out); 626 610 627 - prt_str(out, "duration waited: "); 611 + prt_printf(out, "duration waited:\t"); 628 612 bch2_pr_time_units(out, ktime_get_real_ns() - r->wait_wallclock_start); 629 613 prt_newline(out); 630 614 break; ··· 637 621 break; 638 622 } 639 623 prt_newline(out); 624 + 625 + rcu_read_lock(); 626 + struct task_struct *t = rcu_dereference(c->rebalance.thread); 627 + if (t) 628 + get_task_struct(t); 629 + rcu_read_unlock(); 630 + 631 + if (t) { 632 + bch2_prt_task_backtrace(out, t, 0, GFP_KERNEL); 633 + put_task_struct(t); 634 + } 635 + 640 636 printbuf_indent_sub(out, 2); 641 637 } 642 638
+2 -2
fs/bcachefs/recovery.c
··· 13 13 #include "disk_accounting.h" 14 14 #include "errcode.h" 15 15 #include "error.h" 16 - #include "fs-common.h" 17 16 #include "journal_io.h" 18 17 #include "journal_reclaim.h" 19 18 #include "journal_seq_blacklist.h" 20 19 #include "logged_ops.h" 21 20 #include "move.h" 21 + #include "namei.h" 22 22 #include "quota.h" 23 23 #include "rebalance.h" 24 24 #include "recovery.h" ··· 899 899 * journal sequence numbers: 900 900 */ 901 901 if (!c->sb.clean) 902 - journal_seq += 8; 902 + journal_seq += JOURNAL_BUF_NR * 4; 903 903 904 904 if (blacklist_seq != journal_seq) { 905 905 ret = bch2_journal_log_msg(c, "blacklisting entries %llu-%llu",
+1 -1
fs/bcachefs/recovery_passes_types.h
··· 24 24 x(check_topology, 4, 0) \ 25 25 x(accounting_read, 39, PASS_ALWAYS) \ 26 26 x(alloc_read, 0, PASS_ALWAYS) \ 27 - x(stripes_read, 1, PASS_ALWAYS) \ 27 + x(stripes_read, 1, 0) \ 28 28 x(initialize_subvolumes, 2, 0) \ 29 29 x(snapshots_read, 3, PASS_ALWAYS) \ 30 30 x(check_allocations, 5, PASS_FSCK) \
+16 -7
fs/bcachefs/reflink.c
··· 185 185 BUG_ON(missing_start < refd_start); 186 186 BUG_ON(missing_end > refd_end); 187 187 188 - if (fsck_err(trans, reflink_p_to_missing_reflink_v, 189 - "pointer to missing indirect extent\n" 190 - " %s\n" 191 - " missing range %llu-%llu", 192 - (bch2_bkey_val_to_text(&buf, c, p.s_c), buf.buf), 193 - missing_start, missing_end)) { 188 + struct bpos missing_pos = bkey_start_pos(p.k); 189 + missing_pos.offset += missing_start - live_start; 190 + 191 + prt_printf(&buf, "pointer to missing indirect extent in "); 192 + ret = bch2_inum_snap_offset_err_msg_trans(trans, &buf, missing_pos); 193 + if (ret) 194 + goto err; 195 + 196 + prt_printf(&buf, "-%llu\n ", (missing_pos.offset + (missing_end - missing_start)) << 9); 197 + bch2_bkey_val_to_text(&buf, c, p.s_c); 198 + 199 + prt_printf(&buf, "\n missing reflink btree range %llu-%llu", 200 + missing_start, missing_end); 201 + 202 + if (fsck_err(trans, reflink_p_to_missing_reflink_v, "%s", buf.buf)) { 194 203 struct bkey_i_reflink_p *new = bch2_bkey_make_mut_noupdate_typed(trans, p.s_c, reflink_p); 195 204 ret = PTR_ERR_OR_ZERO(new); 196 205 if (ret) ··· 606 597 u64 dst_done = 0; 607 598 u32 dst_snapshot, src_snapshot; 608 599 bool reflink_p_may_update_opts_field = 609 - bch2_request_incompat_feature(c, bcachefs_metadata_version_reflink_p_may_update_opts); 600 + !bch2_request_incompat_feature(c, bcachefs_metadata_version_reflink_p_may_update_opts); 610 601 int ret = 0, ret2 = 0; 611 602 612 603 if (!bch2_write_ref_tryget(c, BCH_WRITE_REF_reflink))
+69 -21
fs/bcachefs/sb-counters.c
··· 5 5 6 6 /* BCH_SB_FIELD_counters */ 7 7 8 - static const char * const bch2_counter_names[] = { 8 + static const u8 counters_to_stable_map[] = { 9 + #define x(n, id, ...) [BCH_COUNTER_##n] = BCH_COUNTER_STABLE_##n, 10 + BCH_PERSISTENT_COUNTERS() 11 + #undef x 12 + }; 13 + 14 + const char * const bch2_counter_names[] = { 9 15 #define x(t, n, ...) (#t), 10 16 BCH_PERSISTENT_COUNTERS() 11 17 #undef x ··· 24 18 return 0; 25 19 26 20 return (__le64 *) vstruct_end(&ctrs->field) - &ctrs->d[0]; 27 - }; 21 + } 28 22 29 23 static int bch2_sb_counters_validate(struct bch_sb *sb, struct bch_sb_field *f, 30 24 enum bch_validate_flags flags, struct printbuf *err) 31 25 { 32 26 return 0; 33 - }; 27 + } 34 28 35 29 static void bch2_sb_counters_to_text(struct printbuf *out, struct bch_sb *sb, 36 30 struct bch_sb_field *f) ··· 38 32 struct bch_sb_field_counters *ctrs = field_to_type(f, counters); 39 33 unsigned int nr = bch2_sb_counter_nr_entries(ctrs); 40 34 41 - for (unsigned i = 0; i < nr; i++) 42 - prt_printf(out, "%s \t%llu\n", 43 - i < BCH_COUNTER_NR ? bch2_counter_names[i] : "(unknown)", 44 - le64_to_cpu(ctrs->d[i])); 45 - }; 35 + for (unsigned i = 0; i < BCH_COUNTER_NR; i++) { 36 + unsigned stable = counters_to_stable_map[i]; 37 + if (stable < nr) 38 + prt_printf(out, "%s \t%llu\n", 39 + bch2_counter_names[i], 40 + le64_to_cpu(ctrs->d[stable])); 41 + } 42 + } 46 43 47 44 int bch2_sb_counters_to_cpu(struct bch_fs *c) 48 45 { 49 46 struct bch_sb_field_counters *ctrs = bch2_sb_field_get(c->disk_sb.sb, counters); 50 - unsigned int i; 51 47 unsigned int nr = bch2_sb_counter_nr_entries(ctrs); 52 - u64 val = 0; 53 48 54 - for (i = 0; i < BCH_COUNTER_NR; i++) 49 + for (unsigned i = 0; i < BCH_COUNTER_NR; i++) 55 50 c->counters_on_mount[i] = 0; 56 51 57 - for (i = 0; i < min_t(unsigned int, nr, BCH_COUNTER_NR); i++) { 58 - val = le64_to_cpu(ctrs->d[i]); 59 - percpu_u64_set(&c->counters[i], val); 60 - c->counters_on_mount[i] = val; 52 + for (unsigned i = 0; i < BCH_COUNTER_NR; i++) { 53 + unsigned stable = counters_to_stable_map[i]; 54 + if (stable < nr) { 55 + u64 v = le64_to_cpu(ctrs->d[stable]); 56 + percpu_u64_set(&c->counters[i], v); 57 + c->counters_on_mount[i] = v; 58 + } 61 59 } 60 + 62 61 return 0; 63 - }; 62 + } 64 63 65 64 int bch2_sb_counters_from_cpu(struct bch_fs *c) 66 65 { 67 66 struct bch_sb_field_counters *ctrs = bch2_sb_field_get(c->disk_sb.sb, counters); 68 67 struct bch_sb_field_counters *ret; 69 - unsigned int i; 70 68 unsigned int nr = bch2_sb_counter_nr_entries(ctrs); 71 69 72 70 if (nr < BCH_COUNTER_NR) { 73 71 ret = bch2_sb_field_resize(&c->disk_sb, counters, 74 - sizeof(*ctrs) / sizeof(u64) + BCH_COUNTER_NR); 75 - 72 + sizeof(*ctrs) / sizeof(u64) + BCH_COUNTER_NR); 76 73 if (ret) { 77 74 ctrs = ret; 78 75 nr = bch2_sb_counter_nr_entries(ctrs); 79 76 } 80 77 } 81 78 79 + for (unsigned i = 0; i < BCH_COUNTER_NR; i++) { 80 + unsigned stable = counters_to_stable_map[i]; 81 + if (stable < nr) 82 + ctrs->d[stable] = cpu_to_le64(percpu_u64_get(&c->counters[i])); 83 + } 82 84 83 - for (i = 0; i < min_t(unsigned int, nr, BCH_COUNTER_NR); i++) 84 - ctrs->d[i] = cpu_to_le64(percpu_u64_get(&c->counters[i])); 85 85 return 0; 86 86 } 87 87 ··· 109 97 .validate = bch2_sb_counters_validate, 110 98 .to_text = bch2_sb_counters_to_text, 111 99 }; 100 + 101 + #ifndef NO_BCACHEFS_CHARDEV 102 + long bch2_ioctl_query_counters(struct bch_fs *c, 103 + struct bch_ioctl_query_counters __user *user_arg) 104 + { 105 + struct bch_ioctl_query_counters arg; 106 + int ret = copy_from_user_errcode(&arg, user_arg, sizeof(arg)); 107 + if (ret) 108 + return ret; 109 + 110 + if ((arg.flags & ~BCH_IOCTL_QUERY_COUNTERS_MOUNT) || 111 + arg.pad) 112 + return -EINVAL; 113 + 114 + arg.nr = min(arg.nr, BCH_COUNTER_NR); 115 + ret = put_user(arg.nr, &user_arg->nr); 116 + if (ret) 117 + return ret; 118 + 119 + for (unsigned i = 0; i < BCH_COUNTER_NR; i++) { 120 + unsigned stable = counters_to_stable_map[i]; 121 + 122 + if (stable < arg.nr) { 123 + u64 v = !(arg.flags & BCH_IOCTL_QUERY_COUNTERS_MOUNT) 124 + ? percpu_u64_get(&c->counters[i]) 125 + : c->counters_on_mount[i]; 126 + 127 + ret = put_user(v, &user_arg->d[stable]); 128 + if (ret) 129 + return ret; 130 + } 131 + } 132 + 133 + return 0; 134 + } 135 + #endif
+4
fs/bcachefs/sb-counters.h
··· 11 11 void bch2_fs_counters_exit(struct bch_fs *); 12 12 int bch2_fs_counters_init(struct bch_fs *); 13 13 14 + extern const char * const bch2_counter_names[]; 14 15 extern const struct bch_sb_field_ops bch_sb_field_ops_counters; 16 + 17 + long bch2_ioctl_query_counters(struct bch_fs *, 18 + struct bch_ioctl_query_counters __user *); 15 19 16 20 #endif // _BCACHEFS_SB_COUNTERS_H
+21 -10
fs/bcachefs/sb-counters_format.h
··· 9 9 10 10 #define BCH_PERSISTENT_COUNTERS() \ 11 11 x(io_read, 0, TYPE_SECTORS) \ 12 + x(io_read_inline, 80, TYPE_SECTORS) \ 13 + x(io_read_hole, 81, TYPE_SECTORS) \ 14 + x(io_read_promote, 30, TYPE_COUNTER) \ 15 + x(io_read_bounce, 31, TYPE_COUNTER) \ 16 + x(io_read_split, 33, TYPE_COUNTER) \ 17 + x(io_read_reuse_race, 34, TYPE_COUNTER) \ 18 + x(io_read_retry, 32, TYPE_COUNTER) \ 12 19 x(io_write, 1, TYPE_SECTORS) \ 13 20 x(io_move, 2, TYPE_SECTORS) \ 21 + x(io_move_read, 35, TYPE_SECTORS) \ 22 + x(io_move_write, 36, TYPE_SECTORS) \ 23 + x(io_move_finish, 37, TYPE_SECTORS) \ 24 + x(io_move_fail, 38, TYPE_COUNTER) \ 25 + x(io_move_write_fail, 82, TYPE_COUNTER) \ 26 + x(io_move_start_fail, 39, TYPE_COUNTER) \ 14 27 x(bucket_invalidate, 3, TYPE_COUNTER) \ 15 28 x(bucket_discard, 4, TYPE_COUNTER) \ 29 + x(bucket_discard_fast, 79, TYPE_COUNTER) \ 16 30 x(bucket_alloc, 5, TYPE_COUNTER) \ 17 31 x(bucket_alloc_fail, 6, TYPE_COUNTER) \ 18 32 x(btree_cache_scan, 7, TYPE_COUNTER) \ ··· 52 38 x(journal_reclaim_finish, 27, TYPE_COUNTER) \ 53 39 x(journal_reclaim_start, 28, TYPE_COUNTER) \ 54 40 x(journal_write, 29, TYPE_COUNTER) \ 55 - x(read_promote, 30, TYPE_COUNTER) \ 56 - x(read_bounce, 31, TYPE_COUNTER) \ 57 - x(read_split, 33, TYPE_COUNTER) \ 58 - x(read_retry, 32, TYPE_COUNTER) \ 59 - x(read_reuse_race, 34, TYPE_COUNTER) \ 60 - x(move_extent_read, 35, TYPE_SECTORS) \ 61 - x(move_extent_write, 36, TYPE_SECTORS) \ 62 - x(move_extent_finish, 37, TYPE_SECTORS) \ 63 - x(move_extent_fail, 38, TYPE_COUNTER) \ 64 - x(move_extent_start_fail, 39, TYPE_COUNTER) \ 65 41 x(copygc, 40, TYPE_COUNTER) \ 66 42 x(copygc_wait, 41, TYPE_COUNTER) \ 67 43 x(gc_gens_end, 42, TYPE_COUNTER) \ ··· 97 93 BCH_PERSISTENT_COUNTERS() 98 94 #undef x 99 95 BCH_COUNTER_NR 96 + }; 97 + 98 + enum bch_persistent_counters_stable { 99 + #define x(t, n, ...) BCH_COUNTER_STABLE_##t = n, 100 + BCH_PERSISTENT_COUNTERS() 101 + #undef x 102 + BCH_COUNTER_STABLE_NR 100 103 }; 101 104 102 105 struct bch_sb_field_counters {
+7 -1
fs/bcachefs/sb-downgrade.c
··· 90 90 BIT_ULL(BCH_RECOVERY_PASS_check_allocations), \ 91 91 BCH_FSCK_ERR_accounting_mismatch, \ 92 92 BCH_FSCK_ERR_accounting_key_replicas_nr_devs_0, \ 93 - BCH_FSCK_ERR_accounting_key_junk_at_end) 93 + BCH_FSCK_ERR_accounting_key_junk_at_end) \ 94 + x(cached_backpointers, \ 95 + BIT_ULL(BCH_RECOVERY_PASS_check_extents_to_backpointers),\ 96 + BCH_FSCK_ERR_ptr_to_missing_backpointer) \ 97 + x(stripe_backpointers, \ 98 + BIT_ULL(BCH_RECOVERY_PASS_check_extents_to_backpointers),\ 99 + BCH_FSCK_ERR_ptr_to_missing_backpointer) 94 100 95 101 #define DOWNGRADE_TABLE() \ 96 102 x(bucket_stripe_sectors, \
+4 -1
fs/bcachefs/sb-errors_format.h
··· 179 179 x(ptr_crc_redundant, 160, 0) \ 180 180 x(ptr_crc_nonce_mismatch, 162, 0) \ 181 181 x(ptr_stripe_redundant, 163, 0) \ 182 + x(extent_flags_not_at_start, 306, 0) \ 182 183 x(reservation_key_nr_replicas_invalid, 164, 0) \ 183 184 x(reflink_v_refcount_wrong, 165, FSCK_AUTOFIX) \ 184 185 x(reflink_v_pos_bad, 292, 0) \ ··· 315 314 x(compression_opt_not_marked_in_sb, 295, FSCK_AUTOFIX) \ 316 315 x(compression_type_not_marked_in_sb, 296, FSCK_AUTOFIX) \ 317 316 x(directory_size_mismatch, 303, FSCK_AUTOFIX) \ 318 - x(MAX, 304, 0) 317 + x(dirent_cf_name_too_big, 304, 0) \ 318 + x(dirent_stray_data_after_cf_name, 305, 0) \ 319 + x(MAX, 307, 0) 319 320 320 321 enum bch_sb_error_id { 321 322 #define x(t, n, ...) BCH_FSCK_ERR_##t = n,
+15 -1
fs/bcachefs/sb-members.h
··· 23 23 return !percpu_ref_is_zero(&ca->io_ref); 24 24 } 25 25 26 - static inline bool bch2_dev_is_readable(struct bch_dev *ca) 26 + static inline struct bch_dev *bch2_dev_rcu(struct bch_fs *, unsigned); 27 + 28 + static inline bool bch2_dev_idx_is_online(struct bch_fs *c, unsigned dev) 29 + { 30 + rcu_read_lock(); 31 + struct bch_dev *ca = bch2_dev_rcu(c, dev); 32 + bool ret = ca && bch2_dev_is_online(ca); 33 + rcu_read_unlock(); 34 + 35 + return ret; 36 + } 37 + 38 + static inline bool bch2_dev_is_healthy(struct bch_dev *ca) 27 39 { 28 40 return bch2_dev_is_online(ca) && 29 41 ca->mi.state != BCH_MEMBER_STATE_failed; ··· 283 271 284 272 static inline struct bch_dev *bch2_dev_get_ioref(struct bch_fs *c, unsigned dev, int rw) 285 273 { 274 + might_sleep(); 275 + 286 276 rcu_read_lock(); 287 277 struct bch_dev *ca = bch2_dev_rcu(c, dev); 288 278 if (ca && !percpu_ref_tryget(&ca->io_ref))
+1
fs/bcachefs/sb-members_format.h
··· 79 79 80 80 #define BCH_MEMBER_V1_BYTES 56 81 81 82 + LE16_BITMASK(BCH_MEMBER_BUCKET_SIZE, struct bch_member, bucket_size, 0, 16) 82 83 LE64_BITMASK(BCH_MEMBER_STATE, struct bch_member, flags, 0, 4) 83 84 /* 4-14 unused, was TIER, HAS_(META)DATA, REPLACEMENT */ 84 85 LE64_BITMASK(BCH_MEMBER_DISCARD, struct bch_member, flags, 14, 15)
+4 -3
fs/bcachefs/snapshot.c
··· 146 146 goto out; 147 147 } 148 148 149 - while (id && id < ancestor - IS_ANCESTOR_BITMAP) 150 - id = get_ancestor_below(t, id, ancestor); 149 + if (likely(ancestor >= IS_ANCESTOR_BITMAP)) 150 + while (id && id < ancestor - IS_ANCESTOR_BITMAP) 151 + id = get_ancestor_below(t, id, ancestor); 151 152 152 153 ret = id && id < ancestor 153 154 ? test_ancestor_bitmap(t, id, ancestor) ··· 390 389 return 0; 391 390 } 392 391 393 - static u32 bch2_snapshot_tree_oldest_subvol(struct bch_fs *c, u32 snapshot_root) 392 + u32 bch2_snapshot_tree_oldest_subvol(struct bch_fs *c, u32 snapshot_root) 394 393 { 395 394 u32 id = snapshot_root; 396 395 u32 subvol = 0, s;
+1
fs/bcachefs/snapshot.h
··· 105 105 return id; 106 106 } 107 107 108 + u32 bch2_snapshot_tree_oldest_subvol(struct bch_fs *, u32); 108 109 u32 bch2_snapshot_skiplist_get(struct bch_fs *, u32); 109 110 110 111 static inline u32 bch2_snapshot_root(struct bch_fs *c, u32 id)
+1 -1
fs/bcachefs/str_hash.c
··· 50 50 for (unsigned i = 0; i < 1000; i++) { 51 51 unsigned len = sprintf(new->v.d_name, "%.*s.fsck_renamed-%u", 52 52 old_name.len, old_name.name, i); 53 - unsigned u64s = BKEY_U64s + dirent_val_u64s(len); 53 + unsigned u64s = BKEY_U64s + dirent_val_u64s(len, 0); 54 54 55 55 if (u64s > U8_MAX) 56 56 return -EINVAL;
+6 -6
fs/bcachefs/str_hash.h
··· 12 12 #include "super.h" 13 13 14 14 #include <linux/crc32c.h> 15 - #include <crypto/hash.h> 16 15 #include <crypto/sha2.h> 17 16 18 17 static inline enum bch_str_hash_type ··· 33 34 34 35 struct bch_hash_info { 35 36 u8 type; 37 + struct unicode_map *cf_encoding; 36 38 /* 37 39 * For crc32 or crc64 string hashes the first key value of 38 40 * the siphash_key (k0) is used as the key. ··· 47 47 /* XXX ick */ 48 48 struct bch_hash_info info = { 49 49 .type = INODE_STR_HASH(bi), 50 + #ifdef CONFIG_UNICODE 51 + .cf_encoding = !!(bi->bi_flags & BCH_INODE_casefolded) ? c->cf_encoding : NULL, 52 + #endif 50 53 .siphash_key = { .k0 = bi->bi_hash_seed } 51 54 }; 52 55 53 56 if (unlikely(info.type == BCH_STR_HASH_siphash_old)) { 54 - SHASH_DESC_ON_STACK(desc, c->sha256); 55 57 u8 digest[SHA256_DIGEST_SIZE]; 56 58 57 - desc->tfm = c->sha256; 58 - 59 - crypto_shash_digest(desc, (void *) &bi->bi_hash_seed, 60 - sizeof(bi->bi_hash_seed), digest); 59 + sha256((const u8 *)&bi->bi_hash_seed, 60 + sizeof(bi->bi_hash_seed), digest); 61 61 memcpy(&info.siphash_key, digest, sizeof(info.siphash_key)); 62 62 } 63 63
+53 -39
fs/bcachefs/super-io.c
··· 25 25 #include <linux/sort.h> 26 26 #include <linux/string_choices.h> 27 27 28 - static const struct blk_holder_ops bch2_sb_handle_bdev_ops = { 29 - }; 30 - 31 28 struct bch2_metadata_version { 32 29 u16 version; 33 30 const char *name; ··· 66 69 return v; 67 70 } 68 71 69 - bool bch2_set_version_incompat(struct bch_fs *c, enum bcachefs_metadata_version version) 72 + int bch2_set_version_incompat(struct bch_fs *c, enum bcachefs_metadata_version version) 70 73 { 71 - bool ret = (c->sb.features & BIT_ULL(BCH_FEATURE_incompat_version_field)) && 72 - version <= c->sb.version_incompat_allowed; 74 + int ret = ((c->sb.features & BIT_ULL(BCH_FEATURE_incompat_version_field)) && 75 + version <= c->sb.version_incompat_allowed) 76 + ? 0 77 + : -BCH_ERR_may_not_use_incompat_feature; 73 78 74 - if (ret) { 79 + if (!ret) { 75 80 mutex_lock(&c->sb_lock); 76 81 SET_BCH_SB_VERSION_INCOMPAT(c->disk_sb.sb, 77 82 max(BCH_SB_VERSION_INCOMPAT(c->disk_sb.sb), version)); ··· 365 366 return 0; 366 367 } 367 368 368 - static int bch2_sb_validate(struct bch_sb_handle *disk_sb, 369 - enum bch_validate_flags flags, struct printbuf *out) 369 + int bch2_sb_validate(struct bch_sb *sb, u64 read_offset, 370 + enum bch_validate_flags flags, struct printbuf *out) 370 371 { 371 - struct bch_sb *sb = disk_sb->sb; 372 372 struct bch_sb_field_members_v1 *mi; 373 373 enum bch_opt_id opt_id; 374 - u16 block_size; 375 374 int ret; 376 375 377 376 ret = bch2_sb_compatible(sb, out); 378 377 if (ret) 379 378 return ret; 380 379 381 - if (sb->features[1] || 382 - (le64_to_cpu(sb->features[0]) & (~0ULL << BCH_FEATURE_NR))) { 383 - prt_printf(out, "Filesystem has incompatible features"); 380 + u64 incompat = le64_to_cpu(sb->features[0]) & (~0ULL << BCH_FEATURE_NR); 381 + unsigned incompat_bit = 0; 382 + if (incompat) 383 + incompat_bit = __ffs64(incompat); 384 + else if (sb->features[1]) 385 + incompat_bit = 64 + __ffs64(le64_to_cpu(sb->features[1])); 386 + 387 + if (incompat_bit) { 388 + prt_printf(out, "Filesystem has incompatible feature bit %u, highest supported %s (%u)", 389 + incompat_bit, 390 + bch2_sb_features[BCH_FEATURE_NR - 1], 391 + BCH_FEATURE_NR - 1); 384 392 return -BCH_ERR_invalid_sb_features; 385 393 } 386 394 387 395 if (BCH_VERSION_MAJOR(le16_to_cpu(sb->version)) > BCH_VERSION_MAJOR(bcachefs_metadata_version_current) || 388 396 BCH_SB_VERSION_INCOMPAT(sb) > bcachefs_metadata_version_current) { 389 - prt_printf(out, "Filesystem has incompatible version"); 397 + prt_str(out, "Filesystem has incompatible version "); 398 + bch2_version_to_text(out, le16_to_cpu(sb->version)); 399 + prt_str(out, ", current version "); 400 + bch2_version_to_text(out, bcachefs_metadata_version_current); 390 401 return -BCH_ERR_invalid_sb_features; 391 - } 392 - 393 - block_size = le16_to_cpu(sb->block_size); 394 - 395 - if (block_size > PAGE_SECTORS) { 396 - prt_printf(out, "Block size too big (got %u, max %u)", 397 - block_size, PAGE_SECTORS); 398 - return -BCH_ERR_invalid_sb_block_size; 399 402 } 400 403 401 404 if (bch2_is_zero(sb->user_uuid.b, sizeof(sb->user_uuid))) { ··· 408 407 if (bch2_is_zero(sb->uuid.b, sizeof(sb->uuid))) { 409 408 prt_printf(out, "Bad internal UUID (got zeroes)"); 410 409 return -BCH_ERR_invalid_sb_uuid; 410 + } 411 + 412 + if (!(flags & BCH_VALIDATE_write) && 413 + le64_to_cpu(sb->offset) != read_offset) { 414 + prt_printf(out, "Bad sb offset (got %llu, read from %llu)", 415 + le64_to_cpu(sb->offset), read_offset); 416 + return -BCH_ERR_invalid_sb_offset; 411 417 } 412 418 413 419 if (!sb->nr_devices || ··· 472 464 473 465 if (le16_to_cpu(sb->version) <= bcachefs_metadata_version_disk_accounting_v2) 474 466 SET_BCH_SB_PROMOTE_WHOLE_EXTENTS(sb, true); 467 + 468 + if (!BCH_SB_WRITE_ERROR_TIMEOUT(sb)) 469 + SET_BCH_SB_WRITE_ERROR_TIMEOUT(sb, 30); 470 + 471 + if (le16_to_cpu(sb->version) <= bcachefs_metadata_version_extent_flags && 472 + !BCH_SB_CSUM_ERR_RETRY_NR(sb)) 473 + SET_BCH_SB_CSUM_ERR_RETRY_NR(sb, 3); 475 474 } 476 475 477 476 #ifdef __KERNEL__ ··· 489 474 for (opt_id = 0; opt_id < bch2_opts_nr; opt_id++) { 490 475 const struct bch_option *opt = bch2_opt_table + opt_id; 491 476 492 - if (opt->get_sb != BCH2_NO_SB_OPT) { 493 - u64 v = bch2_opt_from_sb(sb, opt_id); 477 + if (opt->get_sb) { 478 + u64 v = bch2_opt_from_sb(sb, opt_id, -1); 494 479 495 480 prt_printf(out, "Invalid option "); 496 481 ret = bch2_opt_validate(opt, v, out); ··· 770 755 memset(sb, 0, sizeof(*sb)); 771 756 sb->mode = BLK_OPEN_READ; 772 757 sb->have_bio = true; 773 - sb->holder = kmalloc(1, GFP_KERNEL); 758 + sb->holder = kzalloc(sizeof(*sb->holder), GFP_KERNEL); 774 759 if (!sb->holder) 775 760 return -ENOMEM; 776 761 ··· 896 881 897 882 sb->have_layout = true; 898 883 899 - ret = bch2_sb_validate(sb, 0, &err); 884 + ret = bch2_sb_validate(sb->sb, offset, 0, &err); 900 885 if (ret) { 901 886 bch2_print_opts(opts, KERN_ERR "bcachefs (%s): error validating superblock: %s\n", 902 887 path, err.buf); ··· 933 918 { 934 919 struct bch_dev *ca = bio->bi_private; 935 920 921 + bch2_account_io_success_fail(ca, bio_data_dir(bio), !bio->bi_status); 922 + 936 923 /* XXX: return errors directly */ 937 924 938 - if (bch2_dev_io_err_on(bio->bi_status, ca, 939 - bio_data_dir(bio) 940 - ? BCH_MEMBER_ERROR_write 941 - : BCH_MEMBER_ERROR_read, 942 - "superblock %s error: %s", 925 + if (bio->bi_status) { 926 + bch_err_dev_ratelimited(ca, "superblock %s error: %s", 943 927 str_write_read(bio_data_dir(bio)), 944 - bch2_blk_status_to_str(bio->bi_status))) 928 + bch2_blk_status_to_str(bio->bi_status)); 945 929 ca->sb_write_error = 1; 930 + } 946 931 947 932 closure_put(&ca->fs->sb_write); 948 933 percpu_ref_put(&ca->io_ref); ··· 1053 1038 darray_for_each(online_devices, ca) { 1054 1039 printbuf_reset(&err); 1055 1040 1056 - ret = bch2_sb_validate(&(*ca)->disk_sb, BCH_VALIDATE_write, &err); 1041 + ret = bch2_sb_validate((*ca)->disk_sb.sb, 0, BCH_VALIDATE_write, &err); 1057 1042 if (ret) { 1058 1043 bch2_fs_inconsistent(c, "sb invalid before write: %s", err.buf); 1059 1044 goto out; ··· 1181 1166 !can_mount_with_written), c, 1182 1167 ": Unable to write superblock to sufficient devices (from %ps)", 1183 1168 (void *) _RET_IP_)) 1184 - ret = -1; 1169 + ret = -BCH_ERR_erofs_sb_err; 1185 1170 out: 1186 1171 /* Make new options visible after they're persistent: */ 1187 1172 bch2_sb_update(c); ··· 1238 1223 bch2_sb_field_resize(&c->disk_sb, downgrade, 0); 1239 1224 1240 1225 c->disk_sb.sb->version = cpu_to_le16(new_version); 1241 - c->disk_sb.sb->features[0] |= cpu_to_le64(BCH_SB_FEATURES_ALL); 1242 1226 1243 1227 if (incompat) { 1228 + c->disk_sb.sb->features[0] |= cpu_to_le64(BCH_SB_FEATURES_ALL); 1244 1229 SET_BCH_SB_VERSION_INCOMPAT_ALLOWED(c->disk_sb.sb, 1245 1230 max(BCH_SB_VERSION_INCOMPAT_ALLOWED(c->disk_sb.sb), new_version)); 1246 - c->disk_sb.sb->features[0] |= cpu_to_le64(BCH_FEATURE_incompat_version_field); 1247 1231 } 1248 1232 } 1249 1233 ··· 1473 1459 for (id = 0; id < bch2_opts_nr; id++) { 1474 1460 const struct bch_option *opt = bch2_opt_table + id; 1475 1461 1476 - if (opt->get_sb != BCH2_NO_SB_OPT) { 1477 - u64 v = bch2_opt_from_sb(sb, id); 1462 + if (opt->get_sb) { 1463 + u64 v = bch2_opt_from_sb(sb, id, -1); 1478 1464 1479 1465 prt_printf(out, "%s:\t", opt->attr.name); 1480 1466 bch2_opt_to_text(out, NULL, sb, opt, v,
+6 -4
fs/bcachefs/super-io.h
··· 21 21 void bch2_version_to_text(struct printbuf *, enum bcachefs_metadata_version); 22 22 enum bcachefs_metadata_version bch2_latest_compatible_version(enum bcachefs_metadata_version); 23 23 24 - bool bch2_set_version_incompat(struct bch_fs *, enum bcachefs_metadata_version); 24 + int bch2_set_version_incompat(struct bch_fs *, enum bcachefs_metadata_version); 25 25 26 - static inline bool bch2_request_incompat_feature(struct bch_fs *c, 27 - enum bcachefs_metadata_version version) 26 + static inline int bch2_request_incompat_feature(struct bch_fs *c, 27 + enum bcachefs_metadata_version version) 28 28 { 29 29 return likely(version <= c->sb.version_incompat) 30 - ? true 30 + ? 0 31 31 : bch2_set_version_incompat(c, version); 32 32 } 33 33 ··· 91 91 92 92 void bch2_free_super(struct bch_sb_handle *); 93 93 int bch2_sb_realloc(struct bch_sb_handle *, unsigned); 94 + 95 + int bch2_sb_validate(struct bch_sb *, u64, enum bch_validate_flags, struct printbuf *); 94 96 95 97 int bch2_read_super(const char *, struct bch_opts *, struct bch_sb_handle *); 96 98 int bch2_read_super_silent(const char *, struct bch_opts *, struct bch_sb_handle *);
+129 -12
fs/bcachefs/super.c
··· 75 75 MODULE_LICENSE("GPL"); 76 76 MODULE_AUTHOR("Kent Overstreet <kent.overstreet@gmail.com>"); 77 77 MODULE_DESCRIPTION("bcachefs filesystem"); 78 - MODULE_SOFTDEP("pre: crc32c"); 79 - MODULE_SOFTDEP("pre: crc64"); 80 - MODULE_SOFTDEP("pre: sha256"); 81 78 MODULE_SOFTDEP("pre: chacha20"); 82 79 MODULE_SOFTDEP("pre: poly1305"); 83 80 MODULE_SOFTDEP("pre: xxhash"); ··· 715 718 kobject_add(&c->time_stats, &c->kobj, "time_stats") ?: 716 719 #endif 717 720 kobject_add(&c->counters_kobj, &c->kobj, "counters") ?: 718 - bch2_opts_create_sysfs_files(&c->opts_dir); 721 + bch2_opts_create_sysfs_files(&c->opts_dir, OPT_FS); 719 722 if (ret) { 720 723 bch_err(c, "error creating sysfs objects"); 721 724 return ret; ··· 833 836 834 837 if (ret) 835 838 goto err; 839 + 840 + #ifdef CONFIG_UNICODE 841 + /* Default encoding until we can potentially have more as an option. */ 842 + c->cf_encoding = utf8_load(BCH_FS_DEFAULT_UTF8_ENCODING); 843 + if (IS_ERR(c->cf_encoding)) { 844 + printk(KERN_ERR "Cannot load UTF-8 encoding for filesystem. Version: %u.%u.%u", 845 + unicode_major(BCH_FS_DEFAULT_UTF8_ENCODING), 846 + unicode_minor(BCH_FS_DEFAULT_UTF8_ENCODING), 847 + unicode_rev(BCH_FS_DEFAULT_UTF8_ENCODING)); 848 + ret = -EINVAL; 849 + goto err; 850 + } 851 + #else 852 + if (c->sb.features & BIT_ULL(BCH_FEATURE_casefolding)) { 853 + printk(KERN_ERR "Cannot mount a filesystem with casefolding on a kernel without CONFIG_UNICODE\n"); 854 + ret = -EINVAL; 855 + goto err; 856 + } 857 + #endif 836 858 837 859 pr_uuid(&name, c->sb.user_uuid.b); 838 860 ret = name.allocation_failure ? -BCH_ERR_ENOMEM_fs_name_alloc : 0; ··· 1072 1056 } 1073 1057 1074 1058 set_bit(BCH_FS_started, &c->flags); 1059 + wake_up(&c->ro_ref_wait); 1075 1060 1076 1061 if (c->opts.read_only) { 1077 1062 bch2_fs_read_only(c); ··· 1297 1280 return 0; 1298 1281 1299 1282 if (!ca->kobj.state_in_sysfs) { 1300 - ret = kobject_add(&ca->kobj, &c->kobj, 1301 - "dev-%u", ca->dev_idx); 1283 + ret = kobject_add(&ca->kobj, &c->kobj, "dev-%u", ca->dev_idx) ?: 1284 + bch2_opts_create_sysfs_files(&ca->kobj, OPT_DEVICE); 1302 1285 if (ret) 1303 1286 return ret; 1304 1287 } ··· 1428 1411 /* Commit: */ 1429 1412 ca->disk_sb = *sb; 1430 1413 memset(sb, 0, sizeof(*sb)); 1414 + 1415 + /* 1416 + * Stash pointer to the filesystem for blk_holder_ops - note that once 1417 + * attached to a filesystem, we will always close the block device 1418 + * before tearing down the filesystem object. 1419 + */ 1420 + ca->disk_sb.holder->c = ca->fs; 1431 1421 1432 1422 ca->dev = ca->disk_sb.bdev->bd_dev; 1433 1423 ··· 1990 1966 mutex_unlock(&c->sb_lock); 1991 1967 1992 1968 if (ca->mi.freespace_initialized) { 1993 - struct disk_accounting_pos acc = { 1994 - .type = BCH_DISK_ACCOUNTING_dev_data_type, 1995 - .dev_data_type.dev = ca->dev_idx, 1996 - .dev_data_type.data_type = BCH_DATA_free, 1997 - }; 1998 1969 u64 v[3] = { nbuckets - old_nbuckets, 0, 0 }; 1999 1970 2000 1971 ret = bch2_trans_commit_do(ca->fs, NULL, NULL, 0, 2001 - bch2_disk_accounting_mod(trans, &acc, v, ARRAY_SIZE(v), false)) ?: 1972 + bch2_disk_accounting_mod2(trans, false, v, dev_data_type, 1973 + .dev = ca->dev_idx, 1974 + .data_type = BCH_DATA_free)) ?: 2002 1975 bch2_dev_freespace_init(c, ca, old_nbuckets, nbuckets); 2003 1976 if (ret) 2004 1977 goto err; ··· 2018 1997 return ca; 2019 1998 return ERR_PTR(-BCH_ERR_ENOENT_dev_not_found); 2020 1999 } 2000 + 2001 + /* blk_holder_ops: */ 2002 + 2003 + static struct bch_fs *bdev_get_fs(struct block_device *bdev) 2004 + __releases(&bdev->bd_holder_lock) 2005 + { 2006 + struct bch_sb_handle_holder *holder = bdev->bd_holder; 2007 + struct bch_fs *c = holder->c; 2008 + 2009 + if (c && !bch2_ro_ref_tryget(c)) 2010 + c = NULL; 2011 + 2012 + mutex_unlock(&bdev->bd_holder_lock); 2013 + 2014 + if (c) 2015 + wait_event(c->ro_ref_wait, test_bit(BCH_FS_started, &c->flags)); 2016 + return c; 2017 + } 2018 + 2019 + /* returns with ref on ca->ref */ 2020 + static struct bch_dev *bdev_to_bch_dev(struct bch_fs *c, struct block_device *bdev) 2021 + { 2022 + for_each_member_device(c, ca) 2023 + if (ca->disk_sb.bdev == bdev) 2024 + return ca; 2025 + return NULL; 2026 + } 2027 + 2028 + static void bch2_fs_bdev_mark_dead(struct block_device *bdev, bool surprise) 2029 + { 2030 + struct bch_fs *c = bdev_get_fs(bdev); 2031 + if (!c) 2032 + return; 2033 + 2034 + struct super_block *sb = c->vfs_sb; 2035 + if (sb) { 2036 + /* 2037 + * Not necessary, c->ro_ref guards against the filesystem being 2038 + * unmounted - we only take this to avoid a warning in 2039 + * sync_filesystem: 2040 + */ 2041 + down_read(&sb->s_umount); 2042 + } 2043 + 2044 + down_write(&c->state_lock); 2045 + struct bch_dev *ca = bdev_to_bch_dev(c, bdev); 2046 + if (!ca) 2047 + goto unlock; 2048 + 2049 + if (bch2_dev_state_allowed(c, ca, BCH_MEMBER_STATE_failed, BCH_FORCE_IF_DEGRADED)) { 2050 + __bch2_dev_offline(c, ca); 2051 + } else { 2052 + if (sb) { 2053 + if (!surprise) 2054 + sync_filesystem(sb); 2055 + shrink_dcache_sb(sb); 2056 + evict_inodes(sb); 2057 + } 2058 + 2059 + bch2_journal_flush(&c->journal); 2060 + bch2_fs_emergency_read_only(c); 2061 + } 2062 + 2063 + bch2_dev_put(ca); 2064 + unlock: 2065 + if (sb) 2066 + up_read(&sb->s_umount); 2067 + up_write(&c->state_lock); 2068 + bch2_ro_ref_put(c); 2069 + } 2070 + 2071 + static void bch2_fs_bdev_sync(struct block_device *bdev) 2072 + { 2073 + struct bch_fs *c = bdev_get_fs(bdev); 2074 + if (!c) 2075 + return; 2076 + 2077 + struct super_block *sb = c->vfs_sb; 2078 + if (sb) { 2079 + /* 2080 + * Not necessary, c->ro_ref guards against the filesystem being 2081 + * unmounted - we only take this to avoid a warning in 2082 + * sync_filesystem: 2083 + */ 2084 + down_read(&sb->s_umount); 2085 + sync_filesystem(sb); 2086 + up_read(&sb->s_umount); 2087 + } 2088 + 2089 + bch2_ro_ref_put(c); 2090 + } 2091 + 2092 + const struct blk_holder_ops bch2_sb_handle_bdev_ops = { 2093 + .mark_dead = bch2_fs_bdev_mark_dead, 2094 + .sync = bch2_fs_bdev_sync, 2095 + }; 2021 2096 2022 2097 /* Filesystem open: */ 2023 2098
+2
fs/bcachefs/super.h
··· 42 42 int bch2_fs_start(struct bch_fs *); 43 43 struct bch_fs *bch2_fs_open(char * const *, unsigned, struct bch_opts); 44 44 45 + extern const struct blk_holder_ops bch2_sb_handle_bdev_ops; 46 + 45 47 #endif /* _BCACHEFS_SUPER_H */
+7 -1
fs/bcachefs/super_types.h
··· 2 2 #ifndef _BCACHEFS_SUPER_TYPES_H 3 3 #define _BCACHEFS_SUPER_TYPES_H 4 4 5 + struct bch_fs; 6 + 7 + struct bch_sb_handle_holder { 8 + struct bch_fs *c; 9 + }; 10 + 5 11 struct bch_sb_handle { 6 12 struct bch_sb *sb; 7 13 struct file *s_bdev_file; 8 14 struct block_device *bdev; 9 15 char *sb_name; 10 16 struct bio *bio; 11 - void *holder; 17 + struct bch_sb_handle_holder *holder; 12 18 size_t buffer_size; 13 19 blk_mode_t mode; 14 20 unsigned have_layout:1;
+77 -64
fs/bcachefs/sysfs.c
··· 146 146 write_attribute(trigger_btree_cache_shrink); 147 147 write_attribute(trigger_btree_key_cache_shrink); 148 148 write_attribute(trigger_freelist_wakeup); 149 + write_attribute(trigger_btree_updates); 149 150 read_attribute(gc_gens_pos); 150 151 151 152 read_attribute(uuid); 152 153 read_attribute(minor); 153 154 read_attribute(flags); 154 - read_attribute(bucket_size); 155 155 read_attribute(first_bucket); 156 156 read_attribute(nbuckets); 157 - rw_attribute(durability); 158 157 read_attribute(io_done); 159 158 read_attribute(io_errors); 160 159 write_attribute(io_errors_reset); ··· 172 173 read_attribute(btree_cache); 173 174 read_attribute(btree_key_cache); 174 175 read_attribute(btree_reserve_cache); 175 - read_attribute(stripes_heap); 176 176 read_attribute(open_buckets); 177 177 read_attribute(open_buckets_partial); 178 - read_attribute(write_points); 179 178 read_attribute(nocow_lock_table); 180 179 181 180 #ifdef BCH_WRITE_REF_DEBUG ··· 206 209 BCH_PERSISTENT_COUNTERS() 207 210 #undef x 208 211 209 - rw_attribute(discard); 210 - read_attribute(state); 211 212 rw_attribute(label); 212 213 213 214 read_attribute(copy_gc_wait); ··· 350 355 if (attr == &sysfs_btree_reserve_cache) 351 356 bch2_btree_reserve_cache_to_text(out, c); 352 357 353 - if (attr == &sysfs_stripes_heap) 354 - bch2_stripes_heap_to_text(out, c); 355 - 356 358 if (attr == &sysfs_open_buckets) 357 359 bch2_open_buckets_to_text(out, c, NULL); 358 360 359 361 if (attr == &sysfs_open_buckets_partial) 360 362 bch2_open_buckets_partial_to_text(out, c); 361 - 362 - if (attr == &sysfs_write_points) 363 - bch2_write_points_to_text(out, c); 364 363 365 364 if (attr == &sysfs_compression_stats) 366 365 bch2_compression_stats_to_text(out, c); ··· 403 414 return -EPERM; 404 415 405 416 /* Debugging: */ 417 + 418 + if (attr == &sysfs_trigger_btree_updates) 419 + queue_work(c->btree_interior_update_worker, &c->btree_interior_update_work); 406 420 407 421 if (!bch2_write_ref_tryget(c, BCH_WRITE_REF_sysfs)) 408 422 return -EROFS; ··· 558 566 &sysfs_btree_key_cache, 559 567 &sysfs_btree_reserve_cache, 560 568 &sysfs_new_stripes, 561 - &sysfs_stripes_heap, 562 569 &sysfs_open_buckets, 563 570 &sysfs_open_buckets_partial, 564 - &sysfs_write_points, 565 571 #ifdef BCH_WRITE_REF_DEBUG 566 572 &sysfs_write_refs, 567 573 #endif ··· 575 585 &sysfs_trigger_btree_cache_shrink, 576 586 &sysfs_trigger_btree_key_cache_shrink, 577 587 &sysfs_trigger_freelist_wakeup, 588 + &sysfs_trigger_btree_updates, 578 589 579 590 &sysfs_gc_gens_pos, 580 591 ··· 595 604 596 605 /* options */ 597 606 598 - SHOW(bch2_fs_opts_dir) 607 + static ssize_t sysfs_opt_show(struct bch_fs *c, 608 + struct bch_dev *ca, 609 + enum bch_opt_id id, 610 + struct printbuf *out) 599 611 { 600 - struct bch_fs *c = container_of(kobj, struct bch_fs, opts_dir); 601 - const struct bch_option *opt = container_of(attr, struct bch_option, attr); 602 - int id = opt - bch2_opt_table; 603 - u64 v = bch2_opt_get_by_id(&c->opts, id); 612 + const struct bch_option *opt = bch2_opt_table + id; 613 + u64 v; 614 + 615 + if (opt->flags & OPT_FS) { 616 + v = bch2_opt_get_by_id(&c->opts, id); 617 + } else if ((opt->flags & OPT_DEVICE) && opt->get_member) { 618 + v = bch2_opt_from_sb(c->disk_sb.sb, id, ca->dev_idx); 619 + } else { 620 + return -EINVAL; 621 + } 604 622 605 623 bch2_opt_to_text(out, c, c->disk_sb.sb, opt, v, OPT_SHOW_FULL_LIST); 606 624 prt_char(out, '\n'); 607 - 608 625 return 0; 609 626 } 610 627 611 - STORE(bch2_fs_opts_dir) 628 + static ssize_t sysfs_opt_store(struct bch_fs *c, 629 + struct bch_dev *ca, 630 + enum bch_opt_id id, 631 + const char *buf, size_t size) 612 632 { 613 - struct bch_fs *c = container_of(kobj, struct bch_fs, opts_dir); 614 - const struct bch_option *opt = container_of(attr, struct bch_option, attr); 615 - int ret, id = opt - bch2_opt_table; 616 - char *tmp; 617 - u64 v; 633 + const struct bch_option *opt = bch2_opt_table + id; 634 + int ret = 0; 618 635 619 636 /* 620 637 * We don't need to take c->writes for correctness, but it eliminates an ··· 631 632 if (unlikely(!bch2_write_ref_tryget(c, BCH_WRITE_REF_sysfs))) 632 633 return -EROFS; 633 634 634 - tmp = kstrdup(buf, GFP_KERNEL); 635 + down_write(&c->state_lock); 636 + 637 + char *tmp = kstrdup(buf, GFP_KERNEL); 635 638 if (!tmp) { 636 639 ret = -ENOMEM; 637 640 goto err; 638 641 } 639 642 640 - ret = bch2_opt_parse(c, opt, strim(tmp), &v, NULL); 643 + u64 v; 644 + ret = bch2_opt_parse(c, opt, strim(tmp), &v, NULL) ?: 645 + bch2_opt_check_may_set(c, ca, id, v); 641 646 kfree(tmp); 642 647 643 648 if (ret < 0) 644 649 goto err; 645 650 646 - ret = bch2_opt_check_may_set(c, id, v); 647 - if (ret < 0) 648 - goto err; 649 - 650 - bch2_opt_set_sb(c, NULL, opt, v); 651 + bch2_opt_set_sb(c, ca, opt, v); 651 652 bch2_opt_set_by_id(&c->opts, id, v); 652 653 653 654 if (v && 654 655 (id == Opt_background_target || 656 + (id == Opt_foreground_target && !c->opts.background_target) || 655 657 id == Opt_background_compression || 656 658 (id == Opt_compression && !c->opts.background_compression))) 657 659 bch2_set_rebalance_needs_scan(c, 0); ··· 664 664 c->copygc_thread) 665 665 wake_up_process(c->copygc_thread); 666 666 667 + if (id == Opt_discard && !ca) { 668 + mutex_lock(&c->sb_lock); 669 + for_each_member_device(c, ca) 670 + opt->set_member(bch2_members_v2_get_mut(ca->disk_sb.sb, ca->dev_idx), v); 671 + 672 + bch2_write_super(c); 673 + mutex_unlock(&c->sb_lock); 674 + } 675 + 667 676 ret = size; 668 677 err: 678 + up_write(&c->state_lock); 669 679 bch2_write_ref_put(c, BCH_WRITE_REF_sysfs); 670 680 return ret; 681 + } 682 + 683 + SHOW(bch2_fs_opts_dir) 684 + { 685 + struct bch_fs *c = container_of(kobj, struct bch_fs, opts_dir); 686 + int id = bch2_opt_lookup(attr->name); 687 + if (id < 0) 688 + return 0; 689 + 690 + return sysfs_opt_show(c, NULL, id, out); 691 + } 692 + 693 + STORE(bch2_fs_opts_dir) 694 + { 695 + struct bch_fs *c = container_of(kobj, struct bch_fs, opts_dir); 696 + int id = bch2_opt_lookup(attr->name); 697 + if (id < 0) 698 + return 0; 699 + 700 + return sysfs_opt_store(c, NULL, id, buf, size); 671 701 } 672 702 SYSFS_OPS(bch2_fs_opts_dir); 673 703 674 704 struct attribute *bch2_fs_opts_dir_files[] = { NULL }; 675 705 676 - int bch2_opts_create_sysfs_files(struct kobject *kobj) 706 + int bch2_opts_create_sysfs_files(struct kobject *kobj, unsigned type) 677 707 { 678 - const struct bch_option *i; 679 - int ret; 680 - 681 - for (i = bch2_opt_table; 708 + for (const struct bch_option *i = bch2_opt_table; 682 709 i < bch2_opt_table + bch2_opts_nr; 683 710 i++) { 684 - if (!(i->flags & OPT_FS)) 711 + if (i->flags & OPT_HIDDEN) 712 + continue; 713 + if (!(i->flags & type)) 685 714 continue; 686 715 687 - ret = sysfs_create_file(kobj, &i->attr); 716 + int ret = sysfs_create_file(kobj, &i->attr); 688 717 if (ret) 689 718 return ret; 690 719 } ··· 784 755 785 756 sysfs_printf(uuid, "%pU\n", ca->uuid.b); 786 757 787 - sysfs_print(bucket_size, bucket_bytes(ca)); 788 758 sysfs_print(first_bucket, ca->mi.first_bucket); 789 759 sysfs_print(nbuckets, ca->mi.nbuckets); 790 - sysfs_print(durability, ca->mi.durability); 791 - sysfs_print(discard, ca->mi.discard); 792 760 793 761 if (attr == &sysfs_label) { 794 762 if (ca->mi.group) ··· 795 769 796 770 if (attr == &sysfs_has_data) { 797 771 prt_bitflags(out, __bch2_data_types, bch2_dev_has_data(c, ca)); 798 - prt_char(out, '\n'); 799 - } 800 - 801 - if (attr == &sysfs_state) { 802 - prt_string_option(out, bch2_member_states, ca->mi.state); 803 772 prt_char(out, '\n'); 804 773 } 805 774 ··· 823 802 if (attr == &sysfs_open_buckets) 824 803 bch2_open_buckets_to_text(out, c, ca); 825 804 805 + int opt_id = bch2_opt_lookup(attr->name); 806 + if (opt_id >= 0) 807 + return sysfs_opt_show(c, ca, opt_id, out); 808 + 826 809 return 0; 827 810 } 828 811 ··· 834 809 { 835 810 struct bch_dev *ca = container_of(kobj, struct bch_dev, kobj); 836 811 struct bch_fs *c = ca->fs; 837 - 838 - if (attr == &sysfs_discard) { 839 - bool v = strtoul_or_return(buf); 840 - 841 - bch2_opt_set_sb(c, ca, bch2_opt_table + Opt_discard, v); 842 - } 843 - 844 - if (attr == &sysfs_durability) { 845 - u64 v = strtoul_or_return(buf); 846 - 847 - bch2_opt_set_sb(c, ca, bch2_opt_table + Opt_durability, v); 848 - } 849 812 850 813 if (attr == &sysfs_label) { 851 814 char *tmp; ··· 852 839 if (attr == &sysfs_io_errors_reset) 853 840 bch2_dev_errors_reset(ca); 854 841 842 + int opt_id = bch2_opt_lookup(attr->name); 843 + if (opt_id >= 0) 844 + return sysfs_opt_store(c, ca, opt_id, buf, size); 845 + 855 846 return size; 856 847 } 857 848 SYSFS_OPS(bch2_dev); 858 849 859 850 struct attribute *bch2_dev_files[] = { 860 851 &sysfs_uuid, 861 - &sysfs_bucket_size, 862 852 &sysfs_first_bucket, 863 853 &sysfs_nbuckets, 864 - &sysfs_durability, 865 854 866 855 /* settings: */ 867 - &sysfs_discard, 868 - &sysfs_state, 869 856 &sysfs_label, 870 857 871 858 &sysfs_has_data,
+3 -2
fs/bcachefs/sysfs.h
··· 23 23 extern const struct sysfs_ops bch2_fs_time_stats_sysfs_ops; 24 24 extern const struct sysfs_ops bch2_dev_sysfs_ops; 25 25 26 - int bch2_opts_create_sysfs_files(struct kobject *); 26 + int bch2_opts_create_sysfs_files(struct kobject *, unsigned); 27 27 28 28 #else 29 29 ··· 41 41 static const struct sysfs_ops bch2_fs_time_stats_sysfs_ops; 42 42 static const struct sysfs_ops bch2_dev_sysfs_ops; 43 43 44 - static inline int bch2_opts_create_sysfs_files(struct kobject *kobj) { return 0; } 44 + static inline int bch2_opts_create_sysfs_files(struct kobject *kobj, unsigned type) 45 + { return 0; } 45 46 46 47 #endif /* NO_BCACHEFS_SYSFS */ 47 48
+41 -64
fs/bcachefs/trace.h
··· 295 295 296 296 /* io.c: */ 297 297 298 - DEFINE_EVENT(bio, read_promote, 298 + DEFINE_EVENT(bio, io_read_promote, 299 299 TP_PROTO(struct bio *bio), 300 300 TP_ARGS(bio) 301 301 ); 302 302 303 - TRACE_EVENT(read_nopromote, 303 + TRACE_EVENT(io_read_nopromote, 304 304 TP_PROTO(struct bch_fs *c, int ret), 305 305 TP_ARGS(c, ret), 306 306 ··· 319 319 __entry->ret) 320 320 ); 321 321 322 - DEFINE_EVENT(bio, read_bounce, 322 + DEFINE_EVENT(bio, io_read_bounce, 323 323 TP_PROTO(struct bio *bio), 324 324 TP_ARGS(bio) 325 325 ); 326 326 327 - DEFINE_EVENT(bio, read_split, 327 + DEFINE_EVENT(bio, io_read_split, 328 328 TP_PROTO(struct bio *bio), 329 329 TP_ARGS(bio) 330 330 ); 331 331 332 - DEFINE_EVENT(bio, read_retry, 332 + DEFINE_EVENT(bio, io_read_retry, 333 333 TP_PROTO(struct bio *bio), 334 334 TP_ARGS(bio) 335 335 ); 336 336 337 - DEFINE_EVENT(bio, read_reuse_race, 337 + DEFINE_EVENT(bio, io_read_reuse_race, 338 338 TP_PROTO(struct bio *bio), 339 339 TP_ARGS(bio) 340 + ); 341 + 342 + /* ec.c */ 343 + 344 + TRACE_EVENT(stripe_create, 345 + TP_PROTO(struct bch_fs *c, u64 idx, int ret), 346 + TP_ARGS(c, idx, ret), 347 + 348 + TP_STRUCT__entry( 349 + __field(dev_t, dev ) 350 + __field(u64, idx ) 351 + __field(int, ret ) 352 + ), 353 + 354 + TP_fast_assign( 355 + __entry->dev = c->dev; 356 + __entry->idx = idx; 357 + __entry->ret = ret; 358 + ), 359 + 360 + TP_printk("%d,%d idx %llu ret %i", 361 + MAJOR(__entry->dev), MINOR(__entry->dev), 362 + __entry->idx, 363 + __entry->ret) 340 364 ); 341 365 342 366 /* Journal */ ··· 821 797 822 798 /* Moving IO */ 823 799 824 - TRACE_EVENT(bucket_evacuate, 825 - TP_PROTO(struct bch_fs *c, struct bpos *bucket), 826 - TP_ARGS(c, bucket), 827 - 828 - TP_STRUCT__entry( 829 - __field(dev_t, dev ) 830 - __field(u32, dev_idx ) 831 - __field(u64, bucket ) 832 - ), 833 - 834 - TP_fast_assign( 835 - __entry->dev = c->dev; 836 - __entry->dev_idx = bucket->inode; 837 - __entry->bucket = bucket->offset; 838 - ), 839 - 840 - TP_printk("%d:%d %u:%llu", 841 - MAJOR(__entry->dev), MINOR(__entry->dev), 842 - __entry->dev_idx, __entry->bucket) 843 - ); 844 - 845 - DEFINE_EVENT(fs_str, move_extent, 800 + DEFINE_EVENT(fs_str, io_move, 846 801 TP_PROTO(struct bch_fs *c, const char *str), 847 802 TP_ARGS(c, str) 848 803 ); 849 804 850 - DEFINE_EVENT(fs_str, move_extent_read, 805 + DEFINE_EVENT(fs_str, io_move_read, 851 806 TP_PROTO(struct bch_fs *c, const char *str), 852 807 TP_ARGS(c, str) 853 808 ); 854 809 855 - DEFINE_EVENT(fs_str, move_extent_write, 810 + DEFINE_EVENT(fs_str, io_move_write, 856 811 TP_PROTO(struct bch_fs *c, const char *str), 857 812 TP_ARGS(c, str) 858 813 ); 859 814 860 - DEFINE_EVENT(fs_str, move_extent_finish, 815 + DEFINE_EVENT(fs_str, io_move_finish, 861 816 TP_PROTO(struct bch_fs *c, const char *str), 862 817 TP_ARGS(c, str) 863 818 ); 864 819 865 - DEFINE_EVENT(fs_str, move_extent_fail, 820 + DEFINE_EVENT(fs_str, io_move_fail, 866 821 TP_PROTO(struct bch_fs *c, const char *str), 867 822 TP_ARGS(c, str) 868 823 ); 869 824 870 - DEFINE_EVENT(fs_str, move_extent_start_fail, 825 + DEFINE_EVENT(fs_str, io_move_write_fail, 826 + TP_PROTO(struct bch_fs *c, const char *str), 827 + TP_ARGS(c, str) 828 + ); 829 + 830 + DEFINE_EVENT(fs_str, io_move_start_fail, 871 831 TP_PROTO(struct bch_fs *c, const char *str), 872 832 TP_ARGS(c, str) 873 833 ); ··· 887 879 __entry->sectors_seen, 888 880 __entry->sectors_moved, 889 881 __entry->sectors_raced) 890 - ); 891 - 892 - TRACE_EVENT(evacuate_bucket, 893 - TP_PROTO(struct bch_fs *c, struct bpos *bucket, 894 - unsigned sectors, unsigned bucket_size, 895 - int ret), 896 - TP_ARGS(c, bucket, sectors, bucket_size, ret), 897 - 898 - TP_STRUCT__entry( 899 - __field(dev_t, dev ) 900 - __field(u64, member ) 901 - __field(u64, bucket ) 902 - __field(u32, sectors ) 903 - __field(u32, bucket_size ) 904 - __field(int, ret ) 905 - ), 906 - 907 - TP_fast_assign( 908 - __entry->dev = c->dev; 909 - __entry->member = bucket->inode; 910 - __entry->bucket = bucket->offset; 911 - __entry->sectors = sectors; 912 - __entry->bucket_size = bucket_size; 913 - __entry->ret = ret; 914 - ), 915 - 916 - TP_printk("%d,%d %llu:%llu sectors %u/%u ret %i", 917 - MAJOR(__entry->dev), MINOR(__entry->dev), 918 - __entry->member, __entry->bucket, 919 - __entry->sectors, __entry->bucket_size, 920 - __entry->ret) 921 882 ); 922 883 923 884 TRACE_EVENT(copygc,
+189 -42
fs/bcachefs/util.c
··· 473 473 u64 last_q = 0; 474 474 475 475 prt_printf(out, "quantiles (%s):\t", u->name); 476 - eytzinger0_for_each(i, NR_QUANTILES) { 477 - bool is_last = eytzinger0_next(i, NR_QUANTILES) == -1; 476 + eytzinger0_for_each(j, NR_QUANTILES) { 477 + bool is_last = eytzinger0_next(j, NR_QUANTILES) == -1; 478 478 479 - u64 q = max(quantiles->entries[i].m, last_q); 479 + u64 q = max(quantiles->entries[j].m, last_q); 480 480 prt_printf(out, "%llu ", div64_u64(q, u->nsecs)); 481 481 if (is_last) 482 482 prt_newline(out); ··· 704 704 } 705 705 } 706 706 707 + #ifdef CONFIG_BCACHEFS_DEBUG 708 + void bch2_corrupt_bio(struct bio *bio) 709 + { 710 + struct bvec_iter iter; 711 + struct bio_vec bv; 712 + unsigned offset = get_random_u32_below(bio->bi_iter.bi_size / sizeof(u64)); 713 + 714 + bio_for_each_segment(bv, bio, iter) { 715 + unsigned u64s = bv.bv_len / sizeof(u64); 716 + 717 + if (offset < u64s) { 718 + u64 *segment = bvec_kmap_local(&bv); 719 + segment[offset] = get_random_u64(); 720 + kunmap_local(segment); 721 + return; 722 + } 723 + offset -= u64s; 724 + } 725 + } 726 + #endif 727 + 707 728 #if 0 708 729 void eytzinger1_test(void) 709 730 { 710 - unsigned inorder, eytz, size; 731 + unsigned inorder, size; 711 732 712 - pr_info("1 based eytzinger test:"); 733 + pr_info("1 based eytzinger test:\n"); 713 734 714 735 for (size = 2; 715 736 size < 65536; ··· 738 717 unsigned extra = eytzinger1_extra(size); 739 718 740 719 if (!(size % 4096)) 741 - pr_info("tree size %u", size); 742 - 743 - BUG_ON(eytzinger1_prev(0, size) != eytzinger1_last(size)); 744 - BUG_ON(eytzinger1_next(0, size) != eytzinger1_first(size)); 745 - 746 - BUG_ON(eytzinger1_prev(eytzinger1_first(size), size) != 0); 747 - BUG_ON(eytzinger1_next(eytzinger1_last(size), size) != 0); 720 + pr_info("tree size %u\n", size); 748 721 749 722 inorder = 1; 750 723 eytzinger1_for_each(eytz, size) { ··· 749 734 750 735 inorder++; 751 736 } 737 + BUG_ON(inorder - 1 != size); 752 738 } 753 739 } 754 740 755 741 void eytzinger0_test(void) 756 742 { 757 743 758 - unsigned inorder, eytz, size; 744 + unsigned inorder, size; 759 745 760 - pr_info("0 based eytzinger test:"); 746 + pr_info("0 based eytzinger test:\n"); 761 747 762 748 for (size = 1; 763 749 size < 65536; ··· 766 750 unsigned extra = eytzinger0_extra(size); 767 751 768 752 if (!(size % 4096)) 769 - pr_info("tree size %u", size); 770 - 771 - BUG_ON(eytzinger0_prev(-1, size) != eytzinger0_last(size)); 772 - BUG_ON(eytzinger0_next(-1, size) != eytzinger0_first(size)); 773 - 774 - BUG_ON(eytzinger0_prev(eytzinger0_first(size), size) != -1); 775 - BUG_ON(eytzinger0_next(eytzinger0_last(size), size) != -1); 753 + pr_info("tree size %u\n", size); 776 754 777 755 inorder = 0; 778 756 eytzinger0_for_each(eytz, size) { ··· 777 767 778 768 inorder++; 779 769 } 770 + BUG_ON(inorder != size); 771 + 772 + inorder = size - 1; 773 + eytzinger0_for_each_prev(eytz, size) { 774 + BUG_ON(eytz != eytzinger0_first(size) && 775 + eytzinger0_next(eytzinger0_prev(eytz, size), size) != eytz); 776 + 777 + inorder--; 778 + } 779 + BUG_ON(inorder != -1); 780 780 } 781 781 } 782 782 783 - static inline int cmp_u16(const void *_l, const void *_r, size_t size) 783 + static inline int cmp_u16(const void *_l, const void *_r) 784 784 { 785 785 const u16 *l = _l, *r = _r; 786 786 787 - return (*l > *r) - (*r - *l); 787 + return (*l > *r) - (*r > *l); 788 788 } 789 789 790 - static void eytzinger0_find_test_val(u16 *test_array, unsigned nr, u16 search) 790 + static void eytzinger0_find_test_le(u16 *test_array, unsigned nr, u16 search) 791 791 { 792 - int i, c1 = -1, c2 = -1; 793 - ssize_t r; 792 + int r, s; 793 + bool bad; 794 794 795 795 r = eytzinger0_find_le(test_array, nr, 796 796 sizeof(test_array[0]), 797 797 cmp_u16, &search); 798 - if (r >= 0) 799 - c1 = test_array[r]; 800 - 801 - for (i = 0; i < nr; i++) 802 - if (test_array[i] <= search && test_array[i] > c2) 803 - c2 = test_array[i]; 804 - 805 - if (c1 != c2) { 806 - eytzinger0_for_each(i, nr) 807 - pr_info("[%3u] = %12u", i, test_array[i]); 808 - pr_info("find_le(%2u) -> [%2zi] = %2i should be %2i", 809 - i, r, c1, c2); 798 + if (r >= 0) { 799 + if (test_array[r] > search) { 800 + bad = true; 801 + } else { 802 + s = eytzinger0_next(r, nr); 803 + bad = s >= 0 && test_array[s] <= search; 804 + } 805 + } else { 806 + s = eytzinger0_last(nr); 807 + bad = s >= 0 && test_array[s] <= search; 810 808 } 809 + 810 + if (bad) { 811 + s = -1; 812 + eytzinger0_for_each_prev(j, nr) { 813 + if (test_array[j] <= search) { 814 + s = j; 815 + break; 816 + } 817 + } 818 + 819 + eytzinger0_for_each(j, nr) 820 + pr_info("[%3u] = %12u\n", j, test_array[j]); 821 + pr_info("find_le(%12u) = %3i should be %3i\n", 822 + search, r, s); 823 + BUG(); 824 + } 825 + } 826 + 827 + static void eytzinger0_find_test_gt(u16 *test_array, unsigned nr, u16 search) 828 + { 829 + int r, s; 830 + bool bad; 831 + 832 + r = eytzinger0_find_gt(test_array, nr, 833 + sizeof(test_array[0]), 834 + cmp_u16, &search); 835 + if (r >= 0) { 836 + if (test_array[r] <= search) { 837 + bad = true; 838 + } else { 839 + s = eytzinger0_prev(r, nr); 840 + bad = s >= 0 && test_array[s] > search; 841 + } 842 + } else { 843 + s = eytzinger0_first(nr); 844 + bad = s >= 0 && test_array[s] > search; 845 + } 846 + 847 + if (bad) { 848 + s = -1; 849 + eytzinger0_for_each(j, nr) { 850 + if (test_array[j] > search) { 851 + s = j; 852 + break; 853 + } 854 + } 855 + 856 + eytzinger0_for_each(j, nr) 857 + pr_info("[%3u] = %12u\n", j, test_array[j]); 858 + pr_info("find_gt(%12u) = %3i should be %3i\n", 859 + search, r, s); 860 + BUG(); 861 + } 862 + } 863 + 864 + static void eytzinger0_find_test_ge(u16 *test_array, unsigned nr, u16 search) 865 + { 866 + int r, s; 867 + bool bad; 868 + 869 + r = eytzinger0_find_ge(test_array, nr, 870 + sizeof(test_array[0]), 871 + cmp_u16, &search); 872 + if (r >= 0) { 873 + if (test_array[r] < search) { 874 + bad = true; 875 + } else { 876 + s = eytzinger0_prev(r, nr); 877 + bad = s >= 0 && test_array[s] >= search; 878 + } 879 + } else { 880 + s = eytzinger0_first(nr); 881 + bad = s >= 0 && test_array[s] >= search; 882 + } 883 + 884 + if (bad) { 885 + s = -1; 886 + eytzinger0_for_each(j, nr) { 887 + if (test_array[j] >= search) { 888 + s = j; 889 + break; 890 + } 891 + } 892 + 893 + eytzinger0_for_each(j, nr) 894 + pr_info("[%3u] = %12u\n", j, test_array[j]); 895 + pr_info("find_ge(%12u) = %3i should be %3i\n", 896 + search, r, s); 897 + BUG(); 898 + } 899 + } 900 + 901 + static void eytzinger0_find_test_eq(u16 *test_array, unsigned nr, u16 search) 902 + { 903 + unsigned r; 904 + int s; 905 + bool bad; 906 + 907 + r = eytzinger0_find(test_array, nr, 908 + sizeof(test_array[0]), 909 + cmp_u16, &search); 910 + 911 + if (r < nr) { 912 + bad = test_array[r] != search; 913 + } else { 914 + s = eytzinger0_find_le(test_array, nr, 915 + sizeof(test_array[0]), 916 + cmp_u16, &search); 917 + bad = s >= 0 && test_array[s] == search; 918 + } 919 + 920 + if (bad) { 921 + eytzinger0_for_each(j, nr) 922 + pr_info("[%3u] = %12u\n", j, test_array[j]); 923 + pr_info("find(%12u) = %3i is incorrect\n", 924 + search, r); 925 + BUG(); 926 + } 927 + } 928 + 929 + static void eytzinger0_find_test_val(u16 *test_array, unsigned nr, u16 search) 930 + { 931 + eytzinger0_find_test_le(test_array, nr, search); 932 + eytzinger0_find_test_gt(test_array, nr, search); 933 + eytzinger0_find_test_ge(test_array, nr, search); 934 + eytzinger0_find_test_eq(test_array, nr, search); 811 935 } 812 936 813 937 void eytzinger0_find_test(void) ··· 950 806 u16 *test_array = kmalloc_array(allocated, sizeof(test_array[0]), GFP_KERNEL); 951 807 952 808 for (nr = 1; nr < allocated; nr++) { 953 - pr_info("testing %u elems", nr); 809 + u16 prev = 0; 810 + 811 + pr_info("testing %u elems\n", nr); 954 812 955 813 get_random_bytes(test_array, nr * sizeof(test_array[0])); 956 814 eytzinger0_sort(test_array, nr, sizeof(test_array[0]), cmp_u16, NULL); 957 815 958 816 /* verify array is sorted correctly: */ 959 - eytzinger0_for_each(i, nr) 960 - BUG_ON(i != eytzinger0_last(nr) && 961 - test_array[i] > test_array[eytzinger0_next(i, nr)]); 817 + eytzinger0_for_each(j, nr) { 818 + BUG_ON(test_array[j] < prev); 819 + prev = test_array[j]; 820 + } 962 821 963 822 for (i = 0; i < U16_MAX; i += 1 << 12) 964 823 eytzinger0_find_test_val(test_array, nr, i);
+14 -2
fs/bcachefs/util.h
··· 406 406 void memcpy_to_bio(struct bio *, struct bvec_iter, const void *); 407 407 void memcpy_from_bio(void *, struct bio *, struct bvec_iter); 408 408 409 + #ifdef CONFIG_BCACHEFS_DEBUG 410 + void bch2_corrupt_bio(struct bio *); 411 + 412 + static inline void bch2_maybe_corrupt_bio(struct bio *bio, unsigned ratio) 413 + { 414 + if (ratio && !get_random_u32_below(ratio)) 415 + bch2_corrupt_bio(bio); 416 + } 417 + #else 418 + #define bch2_maybe_corrupt_bio(...) do {} while (0) 419 + #endif 420 + 409 421 static inline void memcpy_u64s_small(void *dst, const void *src, 410 422 unsigned u64s) 411 423 { ··· 431 419 static inline void __memcpy_u64s(void *dst, const void *src, 432 420 unsigned u64s) 433 421 { 434 - #ifdef CONFIG_X86_64 422 + #if defined(CONFIG_X86_64) && !defined(CONFIG_KMSAN) 435 423 long d0, d1, d2; 436 424 437 425 asm volatile("rep ; movsq" ··· 508 496 u64 *dst = (u64 *) _dst + u64s - 1; 509 497 u64 *src = (u64 *) _src + u64s - 1; 510 498 511 - #ifdef CONFIG_X86_64 499 + #if defined(CONFIG_X86_64) && !defined(CONFIG_KMSAN) 512 500 long d0, d1, d2; 513 501 514 502 asm volatile("std ;\n"
+1 -1
fs/bcachefs/xattr.c
··· 523 523 if (ret < 0) 524 524 goto err_class_exit; 525 525 526 - ret = bch2_opt_check_may_set(c, opt_id, v); 526 + ret = bch2_opt_check_may_set(c, NULL, opt_id, v); 527 527 if (ret < 0) 528 528 goto err_class_exit; 529 529