Merge tag 'bcachefs-2025-03-24' of git://evilpiepirate.org/bcachefs

+25 -18

Documentation/filesystems/bcachefs/SubmittingPatches.rst

··· 1 - Submitting patches to bcachefs: 2 - =============================== 1 + Submitting patches to bcachefs 2 + ============================== 3 + 4 + Here are suggestions for submitting patches to bcachefs subsystem. 5 + 6 + Submission checklist 7 + -------------------- 3 8 4 9 Patches must be tested before being submitted, either with the xfstests suite 5 - [0], or the full bcachefs test suite in ktest [1], depending on what's being 10 + [0]_, or the full bcachefs test suite in ktest [1]_, depending on what's being 6 11 touched. Note that ktest wraps xfstests and will be an easier method to running 7 12 it for most users; it includes single-command wrappers for all the mainstream 8 13 in-kernel local filesystems. ··· 31 26 Focus on writing code that reads well and is organized well; code should be 32 27 aesthetically pleasing. 33 28 34 - CI: 35 - === 29 + CI 30 + -- 36 31 37 32 Instead of running your tests locally, when running the full test suite it's 38 33 preferable to let a server farm do it in parallel, and then have the results 39 34 in a nice test dashboard (which can tell you which failures are new, and 40 35 presents results in a git log view, avoiding the need for most bisecting). 41 36 42 - That exists [2], and community members may request an account. If you work for 37 + That exists [2]_, and community members may request an account. If you work for 43 38 a big tech company, you'll need to help out with server costs to get access - 44 39 but the CI is not restricted to running bcachefs tests: it runs any ktest test 45 40 (which generally makes it easy to wrap other tests that can run in qemu). 46 41 47 - Other things to think about: 48 - ============================ 42 + Other things to think about 43 + --------------------------- 49 44 50 45 - How will we debug this code? Is there sufficient introspection to diagnose 51 46 when something starts acting wonky on a user machine? ··· 84 79 tested? (Automated tests exists but aren't in the CI, due to the hassle of 85 80 disk image management; coordinate to have them run.) 86 81 87 - Mailing list, IRC: 88 - ================== 82 + Mailing list, IRC 83 + ----------------- 89 84 90 - Patches should hit the list [3], but much discussion and code review happens on 91 - IRC as well [4]; many people appreciate the more conversational approach and 92 - quicker feedback. 85 + Patches should hit the list [3]_, but much discussion and code review happens 86 + on IRC as well [4]_; many people appreciate the more conversational approach 87 + and quicker feedback. 93 88 94 89 Additionally, we have a lively user community doing excellent QA work, which 95 90 exists primarily on IRC. Please make use of that resource; user feedback is 96 91 important for any nontrivial feature, and documenting it in commit messages 97 92 would be a good idea. 98 93 99 - [0]: git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git 100 - [1]: https://evilpiepirate.org/git/ktest.git/ 101 - [2]: https://evilpiepirate.org/~testdashboard/ci/ 102 - [3]: linux-bcachefs@vger.kernel.org 103 - [4]: irc.oftc.net#bcache, #bcachefs-dev 94 + .. rubric:: References 95 + 96 + .. [0] git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git 97 + .. [1] https://evilpiepirate.org/git/ktest.git/ 98 + .. [2] https://evilpiepirate.org/~testdashboard/ci/ 99 + .. [3] linux-bcachefs@vger.kernel.org 100 + .. [4] irc.oftc.net#bcache, #bcachefs-dev

+90

Documentation/filesystems/bcachefs/casefolding.rst

··· 1 + .. SPDX-License-Identifier: GPL-2.0 2 + 3 + Casefolding 4 + =========== 5 + 6 + bcachefs has support for case-insensitive file and directory 7 + lookups using the regular `chattr +F` (`S_CASEFOLD`, `FS_CASEFOLD_FL`) 8 + casefolding attributes. 9 + 10 + The main usecase for casefolding is compatibility with software written 11 + against other filesystems that rely on casefolded lookups 12 + (eg. NTFS and Wine/Proton). 13 + Taking advantage of file-system level casefolding can lead to great 14 + loading time gains in many applications and games. 15 + 16 + Casefolding support requires a kernel with the `CONFIG_UNICODE` enabled. 17 + Once a directory has been flagged for casefolding, a feature bit 18 + is enabled on the superblock which marks the filesystem as using 19 + casefolding. 20 + When the feature bit for casefolding is enabled, it is no longer possible 21 + to mount that filesystem on kernels without `CONFIG_UNICODE` enabled. 22 + 23 + On the lookup/query side: casefolding is implemented by allocating a new 24 + string of `BCH_NAME_MAX` length using the `utf8_casefold` function to 25 + casefold the query string. 26 + 27 + On the dirent side: casefolding is implemented by ensuring the `bkey`'s 28 + hash is made from the casefolded string and storing the cached casefolded 29 + name with the regular name in the dirent. 30 + 31 + The structure looks like this: 32 + 33 + * Regular: [dirent data][regular name][nul][nul]... 34 + * Casefolded: [dirent data][reg len][cf len][regular name][casefolded name][nul][nul]... 35 + 36 + (Do note, the number of NULs here is merely for illustration; their count can 37 + vary per-key, and they may not even be present if the key is aligned to 38 + `sizeof(u64)`.) 39 + 40 + This is efficient as it means that for all file lookups that require casefolding, 41 + it has identical performance to a regular lookup: 42 + a hash comparison and a `memcmp` of the name. 43 + 44 + Rationale 45 + --------- 46 + 47 + Several designs were considered for this system: 48 + One was to introduce a dirent_v2, however that would be painful especially as 49 + the hash system only has support for a single key type. This would also need 50 + `BCH_NAME_MAX` to change between versions, and a new feature bit. 51 + 52 + Another option was to store without the two lengths, and just take the length of 53 + the regular name and casefolded name contiguously / 2 as the length. This would 54 + assume that the regular length == casefolded length, but that could potentially 55 + not be true, if the uppercase unicode glyph had a different UTF-8 encoding than 56 + the lowercase unicode glyph. 57 + It would be possible to disregard the casefold cache for those cases, but it was 58 + decided to simply encode the two string lengths in the key to avoid random 59 + performance issues if this edgecase was ever hit. 60 + 61 + The option settled on was to use a free-bit in d_type to mark a dirent as having 62 + a casefold cache, and then treat the first 4 bytes the name block as lengths. 63 + You can see this in the `d_cf_name_block` member of union in `bch_dirent`. 64 + 65 + The feature bit was used to allow casefolding support to be enabled for the majority 66 + of users, but some allow users who have no need for the feature to still use bcachefs as 67 + `CONFIG_UNICODE` can increase the kernel side a significant amount due to the tables used, 68 + which may be decider between using bcachefs for eg. embedded platforms. 69 + 70 + Other filesystems like ext4 and f2fs have a super-block level option for casefolding 71 + encoding, but bcachefs currently does not provide this. ext4 and f2fs do not expose 72 + any encodings than a single UTF-8 version. When future encodings are desirable, 73 + they will be added trivially using the opts mechanism. 74 + 75 + dentry/dcache considerations 76 + ---------------------------- 77 + 78 + Currently, in casefolded directories, bcachefs (like other filesystems) will not cache 79 + negative dentry's. 80 + 81 + This is because currently doing so presents a problem in the following scenario: 82 + 83 + - Lookup file "blAH" in a casefolded directory 84 + - Creation of file "BLAH" in a casefolded directory 85 + - Lookup file "blAH" in a casefolded directory 86 + 87 + This would fail if negative dentry's were cached. 88 + 89 + This is slightly suboptimal, but could be fixed in future with some vfs work. 90 +

+19 -1

Documentation/filesystems/bcachefs/index.rst

··· 4 4 bcachefs Documentation 5 5 ====================== 6 6 7 + Subsystem-specific development process notes 8 + -------------------------------------------- 9 + 10 + Development notes specific to bcachefs. These are intended to supplement 11 + :doc:`general kernel development handbook </process/index>`. 12 + 7 13 .. toctree:: 8 - :maxdepth: 2 14 + :maxdepth: 1 9 15 :numbered: 10 16 11 17 CodingStyle 12 18 SubmittingPatches 19 + 20 + Filesystem implementation 21 + ------------------------- 22 + 23 + Documentation for filesystem features and their implementation details. 24 + At this moment, only a few of these are described here. 25 + 26 + .. toctree:: 27 + :maxdepth: 1 28 + :numbered: 29 + 30 + casefolding 13 31 errorcodes

+1 -1

fs/bcachefs/Kconfig

··· 16 16 select ZSTD_COMPRESS 17 17 select ZSTD_DECOMPRESS 18 18 select CRYPTO 19 - select CRYPTO_SHA256 19 + select CRYPTO_LIB_SHA256 20 20 select CRYPTO_CHACHA20 21 21 select CRYPTO_POLY1305 22 22 select KEYS

+2 -1

fs/bcachefs/Makefile

··· 41 41 extent_update.o \ 42 42 eytzinger.o \ 43 43 fs.o \ 44 - fs-common.o \ 45 44 fs-ioctl.o \ 46 45 fs-io.o \ 47 46 fs-io-buffered.o \ ··· 63 64 migrate.o \ 64 65 move.o \ 65 66 movinggc.o \ 67 + namei.o \ 66 68 nocow_locking.o \ 67 69 opts.o \ 68 70 printbuf.o \ 71 + progress.o \ 69 72 quota.o \ 70 73 rebalance.o \ 71 74 rcu_pending.o \

+134 -56

fs/bcachefs/alloc_background.c

··· 232 232 int ret = 0; 233 233 234 234 bkey_fsck_err_on(bch2_alloc_unpack_v3(&u, k), 235 - c, alloc_v2_unpack_error, 235 + c, alloc_v3_unpack_error, 236 236 "unpack error"); 237 237 fsck_err: 238 238 return ret; ··· 777 777 s64 delta_sectors, 778 778 s64 delta_fragmented, unsigned flags) 779 779 { 780 - struct disk_accounting_pos acc = { 781 - .type = BCH_DISK_ACCOUNTING_dev_data_type, 782 - .dev_data_type.dev = ca->dev_idx, 783 - .dev_data_type.data_type = data_type, 784 - }; 785 780 s64 d[3] = { delta_buckets, delta_sectors, delta_fragmented }; 786 781 787 - return bch2_disk_accounting_mod(trans, &acc, d, 3, flags & BTREE_TRIGGER_gc); 782 + return bch2_disk_accounting_mod2(trans, flags & BTREE_TRIGGER_gc, 783 + d, dev_data_type, 784 + .dev = ca->dev_idx, 785 + .data_type = data_type); 788 786 } 789 787 790 788 int bch2_alloc_key_to_dev_counters(struct btree_trans *trans, struct bch_dev *ca, ··· 835 837 836 838 struct bch_dev *ca = bch2_dev_bucket_tryget(c, new.k->p); 837 839 if (!ca) 838 - return -EIO; 840 + return -BCH_ERR_trigger_alloc; 839 841 840 842 struct bch_alloc_v4 old_a_convert; 841 843 const struct bch_alloc_v4 *old_a = bch2_alloc_to_v4(old, &old_a_convert); ··· 869 871 if (data_type_is_empty(new_a->data_type) && 870 872 BCH_ALLOC_V4_NEED_INC_GEN(new_a) && 871 873 !bch2_bucket_is_open_safe(c, new.k->p.inode, new.k->p.offset)) { 874 + if (new_a->oldest_gen == new_a->gen && 875 + !bch2_bucket_sectors_total(*new_a)) 876 + new_a->oldest_gen++; 872 877 new_a->gen++; 873 878 SET_BCH_ALLOC_V4_NEED_INC_GEN(new_a, false); 874 879 alloc_data_type_set(new_a, new_a->data_type); ··· 890 889 !new_a->io_time[READ]) 891 890 new_a->io_time[READ] = bch2_current_io_time(c, READ); 892 891 893 - u64 old_lru = alloc_lru_idx_read(*old_a); 894 - u64 new_lru = alloc_lru_idx_read(*new_a); 895 - if (old_lru != new_lru) { 896 - ret = bch2_lru_change(trans, new.k->p.inode, 897 - bucket_to_u64(new.k->p), 898 - old_lru, new_lru); 899 - if (ret) 900 - goto err; 901 - } 892 + ret = bch2_lru_change(trans, new.k->p.inode, 893 + bucket_to_u64(new.k->p), 894 + alloc_lru_idx_read(*old_a), 895 + alloc_lru_idx_read(*new_a)); 896 + if (ret) 897 + goto err; 902 898 903 - old_lru = alloc_lru_idx_fragmentation(*old_a, ca); 904 - new_lru = alloc_lru_idx_fragmentation(*new_a, ca); 905 - if (old_lru != new_lru) { 906 - ret = bch2_lru_change(trans, 907 - BCH_LRU_FRAGMENTATION_START, 908 - bucket_to_u64(new.k->p), 909 - old_lru, new_lru); 910 - if (ret) 911 - goto err; 912 - } 899 + ret = bch2_lru_change(trans, 900 + BCH_LRU_BUCKET_FRAGMENTATION, 901 + bucket_to_u64(new.k->p), 902 + alloc_lru_idx_fragmentation(*old_a, ca), 903 + alloc_lru_idx_fragmentation(*new_a, ca)); 904 + if (ret) 905 + goto err; 913 906 914 907 if (old_a->gen != new_a->gen) { 915 908 ret = bch2_bucket_gen_update(trans, new.k->p, new_a->gen); ··· 1029 1034 invalid_bucket: 1030 1035 bch2_fs_inconsistent(c, "reference to invalid bucket\n %s", 1031 1036 (bch2_bkey_val_to_text(&buf, c, new.s_c), buf.buf)); 1032 - ret = -EIO; 1037 + ret = -BCH_ERR_trigger_alloc; 1033 1038 goto err; 1034 1039 } 1035 1040 ··· 1700 1705 1701 1706 u64 lru_idx = alloc_lru_idx_fragmentation(*a, ca); 1702 1707 if (lru_idx) { 1703 - ret = bch2_lru_check_set(trans, BCH_LRU_FRAGMENTATION_START, 1708 + ret = bch2_lru_check_set(trans, BCH_LRU_BUCKET_FRAGMENTATION, 1709 + bucket_to_u64(alloc_k.k->p), 1704 1710 lru_idx, alloc_k, last_flushed); 1705 1711 if (ret) 1706 1712 goto err; ··· 1731 1735 a = &a_mut->v; 1732 1736 } 1733 1737 1734 - ret = bch2_lru_check_set(trans, alloc_k.k->p.inode, a->io_time[READ], 1738 + ret = bch2_lru_check_set(trans, alloc_k.k->p.inode, 1739 + bucket_to_u64(alloc_k.k->p), 1740 + a->io_time[READ], 1735 1741 alloc_k, last_flushed); 1736 1742 if (ret) 1737 1743 goto err; ··· 1755 1757 for_each_btree_key_commit(trans, iter, BTREE_ID_alloc, 1756 1758 POS_MIN, BTREE_ITER_prefetch, k, 1757 1759 NULL, NULL, BCH_TRANS_COMMIT_no_enospc, 1758 - bch2_check_alloc_to_lru_ref(trans, &iter, &last_flushed))); 1760 + bch2_check_alloc_to_lru_ref(trans, &iter, &last_flushed))) ?: 1761 + bch2_check_stripe_to_lru_refs(c); 1759 1762 1760 1763 bch2_bkey_buf_exit(&last_flushed, c); 1761 1764 bch_err_fn(c, ret); ··· 1803 1804 u64 need_journal_commit; 1804 1805 u64 discarded; 1805 1806 }; 1807 + 1808 + /* 1809 + * This is needed because discard is both a filesystem option and a device 1810 + * option, and mount options are supposed to apply to that mount and not be 1811 + * persisted, i.e. if it's set as a mount option we can't propagate it to the 1812 + * device. 1813 + */ 1814 + static inline bool discard_opt_enabled(struct bch_fs *c, struct bch_dev *ca) 1815 + { 1816 + return test_bit(BCH_FS_discard_mount_opt_set, &c->flags) 1817 + ? c->opts.discard 1818 + : ca->mi.discard; 1819 + } 1806 1820 1807 1821 static int bch2_discard_one_bucket(struct btree_trans *trans, 1808 1822 struct bch_dev *ca, ··· 1880 1868 s->discarded++; 1881 1869 *discard_pos_done = iter.pos; 1882 1870 1883 - if (ca->mi.discard && !c->opts.nochanges) { 1871 + if (discard_opt_enabled(c, ca) && !c->opts.nochanges) { 1884 1872 /* 1885 1873 * This works without any other locks because this is the only 1886 1874 * thread that removes items from the need_discard tree ··· 1909 1897 if (ret) 1910 1898 goto out; 1911 1899 1912 - count_event(c, bucket_discard); 1900 + if (!fastpath) 1901 + count_event(c, bucket_discard); 1902 + else 1903 + count_event(c, bucket_discard_fast); 1913 1904 out: 1914 1905 fsck_err: 1915 1906 if (discard_locked) ··· 2070 2055 bch2_write_ref_put(c, BCH_WRITE_REF_discard_fast); 2071 2056 } 2072 2057 2058 + static int invalidate_one_bp(struct btree_trans *trans, 2059 + struct bch_dev *ca, 2060 + struct bkey_s_c_backpointer bp, 2061 + struct bkey_buf *last_flushed) 2062 + { 2063 + struct btree_iter extent_iter; 2064 + struct bkey_s_c extent_k = 2065 + bch2_backpointer_get_key(trans, bp, &extent_iter, 0, last_flushed); 2066 + int ret = bkey_err(extent_k); 2067 + if (ret) 2068 + return ret; 2069 + 2070 + struct bkey_i *n = 2071 + bch2_bkey_make_mut(trans, &extent_iter, &extent_k, 2072 + BTREE_UPDATE_internal_snapshot_node); 2073 + ret = PTR_ERR_OR_ZERO(n); 2074 + if (ret) 2075 + goto err; 2076 + 2077 + bch2_bkey_drop_device(bkey_i_to_s(n), ca->dev_idx); 2078 + err: 2079 + bch2_trans_iter_exit(trans, &extent_iter); 2080 + return ret; 2081 + } 2082 + 2083 + static int invalidate_one_bucket_by_bps(struct btree_trans *trans, 2084 + struct bch_dev *ca, 2085 + struct bpos bucket, 2086 + u8 gen, 2087 + struct bkey_buf *last_flushed) 2088 + { 2089 + struct bpos bp_start = bucket_pos_to_bp_start(ca, bucket); 2090 + struct bpos bp_end = bucket_pos_to_bp_end(ca, bucket); 2091 + 2092 + return for_each_btree_key_max_commit(trans, iter, BTREE_ID_backpointers, 2093 + bp_start, bp_end, 0, k, 2094 + NULL, NULL, 2095 + BCH_WATERMARK_btree| 2096 + BCH_TRANS_COMMIT_no_enospc, ({ 2097 + if (k.k->type != KEY_TYPE_backpointer) 2098 + continue; 2099 + 2100 + struct bkey_s_c_backpointer bp = bkey_s_c_to_backpointer(k); 2101 + 2102 + if (bp.v->bucket_gen != gen) 2103 + continue; 2104 + 2105 + /* filter out bps with gens that don't match */ 2106 + 2107 + invalidate_one_bp(trans, ca, bp, last_flushed); 2108 + })); 2109 + } 2110 + 2111 + noinline_for_stack 2073 2112 static int invalidate_one_bucket(struct btree_trans *trans, 2113 + struct bch_dev *ca, 2074 2114 struct btree_iter *lru_iter, 2075 2115 struct bkey_s_c lru_k, 2116 + struct bkey_buf *last_flushed, 2076 2117 s64 *nr_to_invalidate) 2077 2118 { 2078 2119 struct bch_fs *c = trans->c; 2079 - struct bkey_i_alloc_v4 *a = NULL; 2080 2120 struct printbuf buf = PRINTBUF; 2081 2121 struct bpos bucket = u64_to_bucket(lru_k.k->p.offset); 2082 - unsigned cached_sectors; 2122 + struct btree_iter alloc_iter = {}; 2083 2123 int ret = 0; 2084 2124 2085 2125 if (*nr_to_invalidate <= 0) ··· 2151 2081 if (bch2_bucket_is_open_safe(c, bucket.inode, bucket.offset)) 2152 2082 return 0; 2153 2083 2154 - a = bch2_trans_start_alloc_update(trans, bucket, BTREE_TRIGGER_bucket_invalidate); 2155 - ret = PTR_ERR_OR_ZERO(a); 2084 + struct bkey_s_c alloc_k = bch2_bkey_get_iter(trans, &alloc_iter, 2085 + BTREE_ID_alloc, bucket, 2086 + BTREE_ITER_cached); 2087 + ret = bkey_err(alloc_k); 2156 2088 if (ret) 2157 - goto out; 2089 + return ret; 2090 + 2091 + struct bch_alloc_v4 a_convert; 2092 + const struct bch_alloc_v4 *a = bch2_alloc_to_v4(alloc_k, &a_convert); 2158 2093 2159 2094 /* We expect harmless races here due to the btree write buffer: */ 2160 - if (lru_pos_time(lru_iter->pos) != alloc_lru_idx_read(a->v)) 2095 + if (lru_pos_time(lru_iter->pos) != alloc_lru_idx_read(*a)) 2161 2096 goto out; 2162 2097 2163 - BUG_ON(a->v.data_type != BCH_DATA_cached); 2164 - BUG_ON(a->v.dirty_sectors); 2098 + /* 2099 + * Impossible since alloc_lru_idx_read() only returns nonzero if the 2100 + * bucket is supposed to be on the cached bucket LRU (i.e. 2101 + * BCH_DATA_cached) 2102 + * 2103 + * bch2_lru_validate() also disallows lru keys with lru_pos_time() == 0 2104 + */ 2105 + BUG_ON(a->data_type != BCH_DATA_cached); 2106 + BUG_ON(a->dirty_sectors); 2165 2107 2166 - if (!a->v.cached_sectors) 2108 + if (!a->cached_sectors) 2167 2109 bch_err(c, "invalidating empty bucket, confused"); 2168 2110 2169 - cached_sectors = a->v.cached_sectors; 2111 + unsigned cached_sectors = a->cached_sectors; 2112 + u8 gen = a->gen; 2170 2113 2171 - SET_BCH_ALLOC_V4_NEED_INC_GEN(&a->v, false); 2172 - a->v.gen++; 2173 - a->v.data_type = 0; 2174 - a->v.dirty_sectors = 0; 2175 - a->v.stripe_sectors = 0; 2176 - a->v.cached_sectors = 0; 2177 - a->v.io_time[READ] = bch2_current_io_time(c, READ); 2178 - a->v.io_time[WRITE] = bch2_current_io_time(c, WRITE); 2179 - 2180 - ret = bch2_trans_commit(trans, NULL, NULL, 2181 - BCH_WATERMARK_btree| 2182 - BCH_TRANS_COMMIT_no_enospc); 2114 + ret = invalidate_one_bucket_by_bps(trans, ca, bucket, gen, last_flushed); 2183 2115 if (ret) 2184 2116 goto out; 2185 2117 ··· 2189 2117 --*nr_to_invalidate; 2190 2118 out: 2191 2119 fsck_err: 2120 + bch2_trans_iter_exit(trans, &alloc_iter); 2192 2121 printbuf_exit(&buf); 2193 2122 return ret; 2194 2123 } ··· 2216 2143 struct btree_trans *trans = bch2_trans_get(c); 2217 2144 int ret = 0; 2218 2145 2146 + struct bkey_buf last_flushed; 2147 + bch2_bkey_buf_init(&last_flushed); 2148 + bkey_init(&last_flushed.k->k); 2149 + 2219 2150 ret = bch2_btree_write_buffer_tryflush(trans); 2220 2151 if (ret) 2221 2152 goto err; ··· 2244 2167 if (!k.k) 2245 2168 break; 2246 2169 2247 - ret = invalidate_one_bucket(trans, &iter, k, &nr_to_invalidate); 2170 + ret = invalidate_one_bucket(trans, ca, &iter, k, &last_flushed, &nr_to_invalidate); 2248 2171 restart_err: 2249 2172 if (bch2_err_matches(ret, BCH_ERR_transaction_restart)) 2250 2173 continue; ··· 2257 2180 err: 2258 2181 bch2_trans_put(trans); 2259 2182 percpu_ref_put(&ca->io_ref); 2183 + bch2_bkey_buf_exit(&last_flushed, c); 2260 2184 bch2_write_ref_put(c, BCH_WRITE_REF_invalidate); 2261 2185 } 2262 2186

+1 -1

fs/bcachefs/alloc_background.h

··· 131 131 if (a.stripe) 132 132 return data_type == BCH_DATA_parity ? data_type : BCH_DATA_stripe; 133 133 if (bch2_bucket_sectors_dirty(a)) 134 - return data_type; 134 + return bucket_data_type(data_type); 135 135 if (a.cached_sectors) 136 136 return BCH_DATA_cached; 137 137 if (BCH_ALLOC_V4_NEED_DISCARD(&a))

+7 -24

fs/bcachefs/alloc_foreground.c

··· 127 127 128 128 void bch2_open_bucket_write_error(struct bch_fs *c, 129 129 struct open_buckets *obs, 130 - unsigned dev) 130 + unsigned dev, int err) 131 131 { 132 132 struct open_bucket *ob; 133 133 unsigned i; 134 134 135 135 open_bucket_for_each(c, obs, ob, i) 136 136 if (ob->dev == dev && ob->ec) 137 - bch2_ec_bucket_cancel(c, ob); 137 + bch2_ec_bucket_cancel(c, ob, err); 138 138 } 139 139 140 140 static struct open_bucket *bch2_open_bucket_alloc(struct bch_fs *c) ··· 177 177 178 178 closure_wake_up(&c->open_buckets_wait); 179 179 closure_wake_up(&c->freelist_wait); 180 - } 181 - 182 - static inline unsigned open_buckets_reserved(enum bch_watermark watermark) 183 - { 184 - switch (watermark) { 185 - case BCH_WATERMARK_interior_updates: 186 - return 0; 187 - case BCH_WATERMARK_reclaim: 188 - return OPEN_BUCKETS_COUNT / 6; 189 - case BCH_WATERMARK_btree: 190 - case BCH_WATERMARK_btree_copygc: 191 - return OPEN_BUCKETS_COUNT / 4; 192 - case BCH_WATERMARK_copygc: 193 - return OPEN_BUCKETS_COUNT / 3; 194 - default: 195 - return OPEN_BUCKETS_COUNT / 2; 196 - } 197 180 } 198 181 199 182 static inline bool may_alloc_bucket(struct bch_fs *c, ··· 222 239 223 240 spin_lock(&c->freelist_lock); 224 241 225 - if (unlikely(c->open_buckets_nr_free <= open_buckets_reserved(watermark))) { 242 + if (unlikely(c->open_buckets_nr_free <= bch2_open_buckets_reserved(watermark))) { 226 243 if (cl) 227 244 closure_wait(&c->open_buckets_wait, cl); 228 245 ··· 631 648 struct bch_dev_usage *usage) 632 649 { 633 650 u64 *v = stripe->next_alloc + ca->dev_idx; 634 - u64 free_space = dev_buckets_available(ca, BCH_WATERMARK_normal); 651 + u64 free_space = __dev_buckets_available(ca, *usage, BCH_WATERMARK_normal); 635 652 u64 free_space_inv = free_space 636 653 ? div64_u64(1ULL << 48, free_space) 637 654 : 1ULL << 48; ··· 711 728 712 729 struct bch_dev_usage usage; 713 730 struct open_bucket *ob = bch2_bucket_alloc_trans(trans, ca, watermark, data_type, 714 - cl, flags & BCH_WRITE_ALLOC_NOWAIT, &usage); 731 + cl, flags & BCH_WRITE_alloc_nowait, &usage); 715 732 if (!IS_ERR(ob)) 716 733 bch2_dev_stripe_increment_inlined(ca, stripe, &usage); 717 734 bch2_dev_put(ca); ··· 1319 1336 if (wp->data_type != BCH_DATA_user) 1320 1337 have_cache = true; 1321 1338 1322 - if (target && !(flags & BCH_WRITE_ONLY_SPECIFIED_DEVS)) { 1339 + if (target && !(flags & BCH_WRITE_only_specified_devs)) { 1323 1340 ret = open_bucket_add_buckets(trans, &ptrs, wp, devs_have, 1324 1341 target, erasure_code, 1325 1342 nr_replicas, &nr_effective, ··· 1409 1426 if (cl && bch2_err_matches(ret, BCH_ERR_open_buckets_empty)) 1410 1427 ret = -BCH_ERR_bucket_alloc_blocked; 1411 1428 1412 - if (cl && !(flags & BCH_WRITE_ALLOC_NOWAIT) && 1429 + if (cl && !(flags & BCH_WRITE_alloc_nowait) && 1413 1430 bch2_err_matches(ret, BCH_ERR_freelist_empty)) 1414 1431 ret = -BCH_ERR_bucket_alloc_blocked; 1415 1432

+18 -1

fs/bcachefs/alloc_foreground.h

··· 33 33 return bch2_dev_have_ref(c, ob->dev); 34 34 } 35 35 36 + static inline unsigned bch2_open_buckets_reserved(enum bch_watermark watermark) 37 + { 38 + switch (watermark) { 39 + case BCH_WATERMARK_interior_updates: 40 + return 0; 41 + case BCH_WATERMARK_reclaim: 42 + return OPEN_BUCKETS_COUNT / 6; 43 + case BCH_WATERMARK_btree: 44 + case BCH_WATERMARK_btree_copygc: 45 + return OPEN_BUCKETS_COUNT / 4; 46 + case BCH_WATERMARK_copygc: 47 + return OPEN_BUCKETS_COUNT / 3; 48 + default: 49 + return OPEN_BUCKETS_COUNT / 2; 50 + } 51 + } 52 + 36 53 struct open_bucket *bch2_bucket_alloc(struct bch_fs *, struct bch_dev *, 37 54 enum bch_watermark, enum bch_data_type, 38 55 struct closure *); ··· 82 65 } 83 66 84 67 void bch2_open_bucket_write_error(struct bch_fs *, 85 - struct open_buckets *, unsigned); 68 + struct open_buckets *, unsigned, int); 86 69 87 70 void __bch2_open_bucket_put(struct bch_fs *, struct open_bucket *); 88 71

+2

fs/bcachefs/alloc_types.h

··· 90 90 x(stopped) \ 91 91 x(waiting_io) \ 92 92 x(waiting_work) \ 93 + x(runnable) \ 93 94 x(running) 94 95 95 96 enum write_point_state { ··· 126 125 enum write_point_state state; 127 126 u64 last_state_change; 128 127 u64 time[WRITE_POINT_STATE_NR]; 128 + u64 last_runtime; 129 129 } __aligned(SMP_CACHE_BYTES); 130 130 }; 131 131

+58 -95

fs/bcachefs/backpointers.c

··· 11 11 #include "checksum.h" 12 12 #include "disk_accounting.h" 13 13 #include "error.h" 14 + #include "progress.h" 14 15 15 16 #include <linux/mm.h> 16 17 ··· 50 49 } 51 50 52 51 bch2_btree_id_level_to_text(out, bp.v->btree_id, bp.v->level); 52 + prt_str(out, " data_type="); 53 + bch2_prt_data_type(out, bp.v->data_type); 53 54 prt_printf(out, " suboffset=%u len=%u gen=%u pos=", 54 55 (u32) bp.k->p.offset & ~(~0U << MAX_EXTENT_COMPRESS_RATIO_SHIFT), 55 56 bp.v->bucket_len, ··· 247 244 if (unlikely(bp.v->btree_id >= btree_id_nr_alive(c))) 248 245 return bkey_s_c_null; 249 246 250 - if (likely(!bp.v->level)) { 251 - bch2_trans_node_iter_init(trans, iter, 252 - bp.v->btree_id, 253 - bp.v->pos, 254 - 0, 0, 255 - iter_flags); 256 - struct bkey_s_c k = bch2_btree_iter_peek_slot(iter); 257 - if (bkey_err(k)) { 258 - bch2_trans_iter_exit(trans, iter); 259 - return k; 260 - } 261 - 262 - if (k.k && 263 - extent_matches_bp(c, bp.v->btree_id, bp.v->level, k, bp)) 264 - return k; 265 - 247 + bch2_trans_node_iter_init(trans, iter, 248 + bp.v->btree_id, 249 + bp.v->pos, 250 + 0, 251 + bp.v->level, 252 + iter_flags); 253 + struct bkey_s_c k = bch2_btree_iter_peek_slot(iter); 254 + if (bkey_err(k)) { 266 255 bch2_trans_iter_exit(trans, iter); 256 + return k; 257 + } 258 + 259 + if (k.k && 260 + extent_matches_bp(c, bp.v->btree_id, bp.v->level, k, bp)) 261 + return k; 262 + 263 + bch2_trans_iter_exit(trans, iter); 264 + 265 + if (!bp.v->level) { 267 266 int ret = backpointer_target_not_found(trans, bp, k, last_flushed); 268 267 return ret ? bkey_s_c_err(ret) : bkey_s_c_null; 269 268 } else { 270 269 struct btree *b = bch2_backpointer_get_node(trans, bp, iter, last_flushed); 270 + if (b == ERR_PTR(-BCH_ERR_backpointer_to_overwritten_btree_node)) 271 + return bkey_s_c_null; 271 272 if (IS_ERR_OR_NULL(b)) 272 273 return ((struct bkey_s_c) { .k = ERR_CAST(b) }); 273 274 ··· 521 514 if (!other_extent.k) 522 515 goto missing; 523 516 517 + rcu_read_lock(); 518 + struct bch_dev *ca = bch2_dev_rcu_noerror(c, bp->k.p.inode); 519 + if (ca) { 520 + struct bkey_ptrs_c other_extent_ptrs = bch2_bkey_ptrs_c(other_extent); 521 + bkey_for_each_ptr(other_extent_ptrs, ptr) 522 + if (ptr->dev == bp->k.p.inode && 523 + dev_ptr_stale_rcu(ca, ptr)) { 524 + ret = drop_dev_and_update(trans, other_bp.v->btree_id, 525 + other_extent, bp->k.p.inode); 526 + if (ret) 527 + goto err; 528 + goto out; 529 + } 530 + } 531 + rcu_read_unlock(); 532 + 524 533 if (bch2_extents_match(orig_k, other_extent)) { 525 534 printbuf_reset(&buf); 526 535 prt_printf(&buf, "duplicate versions of same extent, deleting smaller\n "); ··· 613 590 struct extent_ptr_decoded p; 614 591 615 592 bkey_for_each_ptr_decode(k.k, ptrs, p, entry) { 616 - if (p.ptr.cached) 617 - continue; 618 - 619 593 if (p.ptr.dev == BCH_SB_MEMBER_INVALID) 620 594 continue; 621 595 ··· 620 600 struct bch_dev *ca = bch2_dev_rcu_noerror(c, p.ptr.dev); 621 601 bool check = ca && test_bit(PTR_BUCKET_NR(ca, &p.ptr), ca->bucket_backpointer_mismatches); 622 602 bool empty = ca && test_bit(PTR_BUCKET_NR(ca, &p.ptr), ca->bucket_backpointer_empty); 603 + 604 + bool stale = p.ptr.cached && (!ca || dev_ptr_stale_rcu(ca, &p.ptr)); 623 605 rcu_read_unlock(); 624 606 625 - if (check || empty) { 607 + if ((check || empty) && !stale) { 626 608 struct bkey_i_backpointer bp; 627 609 bch2_extent_ptr_to_bp(c, btree, level, k, p, entry, &bp); 628 610 ··· 737 715 return ret; 738 716 } 739 717 740 - struct progress_indicator_state { 741 - unsigned long next_print; 742 - u64 nodes_seen; 743 - u64 nodes_total; 744 - struct btree *last_node; 745 - }; 746 - 747 - static inline void progress_init(struct progress_indicator_state *s, 748 - struct bch_fs *c, 749 - u64 btree_id_mask) 750 - { 751 - memset(s, 0, sizeof(*s)); 752 - 753 - s->next_print = jiffies + HZ * 10; 754 - 755 - for (unsigned i = 0; i < BTREE_ID_NR; i++) { 756 - if (!(btree_id_mask & BIT_ULL(i))) 757 - continue; 758 - 759 - struct disk_accounting_pos acc = { 760 - .type = BCH_DISK_ACCOUNTING_btree, 761 - .btree.id = i, 762 - }; 763 - 764 - u64 v; 765 - bch2_accounting_mem_read(c, disk_accounting_pos_to_bpos(&acc), &v, 1); 766 - s->nodes_total += div64_ul(v, btree_sectors(c)); 767 - } 768 - } 769 - 770 - static inline bool progress_update_p(struct progress_indicator_state *s) 771 - { 772 - bool ret = time_after_eq(jiffies, s->next_print); 773 - 774 - if (ret) 775 - s->next_print = jiffies + HZ * 10; 776 - return ret; 777 - } 778 - 779 - static void progress_update_iter(struct btree_trans *trans, 780 - struct progress_indicator_state *s, 781 - struct btree_iter *iter, 782 - const char *msg) 783 - { 784 - struct bch_fs *c = trans->c; 785 - struct btree *b = path_l(btree_iter_path(trans, iter))->b; 786 - 787 - s->nodes_seen += b != s->last_node; 788 - s->last_node = b; 789 - 790 - if (progress_update_p(s)) { 791 - struct printbuf buf = PRINTBUF; 792 - unsigned percent = s->nodes_total 793 - ? div64_u64(s->nodes_seen * 100, s->nodes_total) 794 - : 0; 795 - 796 - prt_printf(&buf, "%s: %d%%, done %llu/%llu nodes, at ", 797 - msg, percent, s->nodes_seen, s->nodes_total); 798 - bch2_bbpos_to_text(&buf, BBPOS(iter->btree_id, iter->pos)); 799 - 800 - bch_info(c, "%s", buf.buf); 801 - printbuf_exit(&buf); 802 - } 803 - } 804 - 805 718 static int bch2_check_extents_to_backpointers_pass(struct btree_trans *trans, 806 719 struct extents_to_bp_state *s) 807 720 { ··· 744 787 struct progress_indicator_state progress; 745 788 int ret = 0; 746 789 747 - progress_init(&progress, trans->c, BIT_ULL(BTREE_ID_extents)|BIT_ULL(BTREE_ID_reflink)); 790 + bch2_progress_init(&progress, trans->c, BIT_ULL(BTREE_ID_extents)|BIT_ULL(BTREE_ID_reflink)); 748 791 749 792 for (enum btree_id btree_id = 0; 750 793 btree_id < btree_id_nr_alive(c); ··· 763 806 BTREE_ITER_prefetch); 764 807 765 808 ret = for_each_btree_key_continue(trans, iter, 0, k, ({ 766 - progress_update_iter(trans, &progress, &iter, "extents_to_backpointers"); 809 + bch2_progress_update_iter(trans, &progress, &iter, "extents_to_backpointers"); 767 810 check_extent_to_backpointers(trans, s, btree_id, level, k) ?: 768 811 bch2_trans_commit(trans, NULL, NULL, BCH_TRANS_COMMIT_no_enospc); 769 812 })); ··· 784 827 ALLOC_SECTORS_NR 785 828 }; 786 829 787 - static enum alloc_sector_counter data_type_to_alloc_counter(enum bch_data_type t) 830 + static int data_type_to_alloc_counter(enum bch_data_type t) 788 831 { 789 832 switch (t) { 790 833 case BCH_DATA_btree: ··· 793 836 case BCH_DATA_cached: 794 837 return ALLOC_cached; 795 838 case BCH_DATA_stripe: 839 + case BCH_DATA_parity: 796 840 return ALLOC_stripe; 797 841 default: 798 - BUG(); 842 + return -1; 799 843 } 800 844 } 801 845 ··· 847 889 if (bp.v->bucket_gen != a->gen) 848 890 continue; 849 891 850 - sectors[data_type_to_alloc_counter(bp.v->data_type)] += bp.v->bucket_len; 892 + int alloc_counter = data_type_to_alloc_counter(bp.v->data_type); 893 + if (alloc_counter < 0) 894 + continue; 895 + 896 + sectors[alloc_counter] += bp.v->bucket_len; 851 897 }; 852 898 bch2_trans_iter_exit(trans, &iter); 853 899 if (ret) ··· 863 901 goto err; 864 902 } 865 903 866 - /* Cached pointers don't have backpointers: */ 867 - 868 904 if (sectors[ALLOC_dirty] != a->dirty_sectors || 905 + sectors[ALLOC_cached] != a->cached_sectors || 869 906 sectors[ALLOC_stripe] != a->stripe_sectors) { 870 907 if (c->sb.version_upgrade_complete >= bcachefs_metadata_version_backpointer_bucket_gen) { 871 908 ret = bch2_backpointers_maybe_flush(trans, alloc_k, last_flushed); ··· 873 912 } 874 913 875 914 if (sectors[ALLOC_dirty] > a->dirty_sectors || 915 + sectors[ALLOC_cached] > a->cached_sectors || 876 916 sectors[ALLOC_stripe] > a->stripe_sectors) { 877 917 ret = check_bucket_backpointers_to_extents(trans, ca, alloc_k.k->p) ?: 878 918 -BCH_ERR_transaction_restart_nested; ··· 881 919 } 882 920 883 921 if (!sectors[ALLOC_dirty] && 884 - !sectors[ALLOC_stripe]) 922 + !sectors[ALLOC_stripe] && 923 + !sectors[ALLOC_cached]) 885 924 __set_bit(alloc_k.k->p.offset, ca->bucket_backpointer_empty); 886 925 else 887 926 __set_bit(alloc_k.k->p.offset, ca->bucket_backpointer_mismatches); ··· 1169 1206 1170 1207 bch2_bkey_buf_init(&last_flushed); 1171 1208 bkey_init(&last_flushed.k->k); 1172 - progress_init(&progress, trans->c, BIT_ULL(BTREE_ID_backpointers)); 1209 + bch2_progress_init(&progress, trans->c, BIT_ULL(BTREE_ID_backpointers)); 1173 1210 1174 1211 int ret = for_each_btree_key(trans, iter, BTREE_ID_backpointers, 1175 1212 POS_MIN, BTREE_ITER_prefetch, k, ({ 1176 - progress_update_iter(trans, &progress, &iter, "backpointers_to_extents"); 1213 + bch2_progress_update_iter(trans, &progress, &iter, "backpointers_to_extents"); 1177 1214 check_one_backpointer(trans, start, end, k, &last_flushed); 1178 1215 })); 1179 1216

+22 -4

fs/bcachefs/backpointers.h

··· 1 1 /* SPDX-License-Identifier: GPL-2.0 */ 2 - #ifndef _BCACHEFS_BACKPOINTERS_BACKGROUND_H 3 - #define _BCACHEFS_BACKPOINTERS_BACKGROUND_H 2 + #ifndef _BCACHEFS_BACKPOINTERS_H 3 + #define _BCACHEFS_BACKPOINTERS_H 4 4 5 5 #include "btree_cache.h" 6 6 #include "btree_iter.h" ··· 123 123 return BCH_DATA_btree; 124 124 case KEY_TYPE_extent: 125 125 case KEY_TYPE_reflink_v: 126 - return p.has_ec ? BCH_DATA_stripe : BCH_DATA_user; 126 + if (p.has_ec) 127 + return BCH_DATA_stripe; 128 + if (p.ptr.cached) 129 + return BCH_DATA_cached; 130 + else 131 + return BCH_DATA_user; 127 132 case KEY_TYPE_stripe: { 128 133 const struct bch_extent_ptr *ptr = &entry->ptr; 129 134 struct bkey_s_c_stripe s = bkey_s_c_to_stripe(k); ··· 152 147 struct bkey_i_backpointer *bp) 153 148 { 154 149 bkey_backpointer_init(&bp->k_i); 155 - bp->k.p = POS(p.ptr.dev, ((u64) p.ptr.offset << MAX_EXTENT_COMPRESS_RATIO_SHIFT) + p.crc.offset); 150 + bp->k.p.inode = p.ptr.dev; 151 + 152 + if (k.k->type != KEY_TYPE_stripe) 153 + bp->k.p.offset = ((u64) p.ptr.offset << MAX_EXTENT_COMPRESS_RATIO_SHIFT) + p.crc.offset; 154 + else { 155 + /* 156 + * Put stripe backpointers where they won't collide with the 157 + * extent backpointers within the stripe: 158 + */ 159 + struct bkey_s_c_stripe s = bkey_s_c_to_stripe(k); 160 + bp->k.p.offset = ((u64) (p.ptr.offset + le16_to_cpu(s.v->sectors)) << 161 + MAX_EXTENT_COMPRESS_RATIO_SHIFT) - 1; 162 + } 163 + 156 164 bp->v = (struct bch_backpointer) { 157 165 .btree_id = btree_id, 158 166 .level = level,

+13 -7

fs/bcachefs/bcachefs.h

··· 203 203 #include <linux/types.h> 204 204 #include <linux/workqueue.h> 205 205 #include <linux/zstd.h> 206 + #include <linux/unicode.h> 206 207 207 208 #include "bcachefs_format.h" 208 209 #include "btree_journal_iter_types.h" ··· 445 444 x(btree_node_sort) \ 446 445 x(btree_node_read) \ 447 446 x(btree_node_read_done) \ 447 + x(btree_node_write) \ 448 448 x(btree_interior_update_foreground) \ 449 449 x(btree_interior_update_total) \ 450 450 x(btree_gc) \ ··· 458 456 x(blocked_journal_low_on_space) \ 459 457 x(blocked_journal_low_on_pin) \ 460 458 x(blocked_journal_max_in_flight) \ 459 + x(blocked_journal_max_open) \ 461 460 x(blocked_key_cache_flush) \ 462 461 x(blocked_allocate) \ 463 462 x(blocked_allocate_open_bucket) \ ··· 536 533 */ 537 534 struct bch_member_cpu mi; 538 535 atomic64_t errors[BCH_MEMBER_ERROR_NR]; 536 + unsigned long write_errors_start; 539 537 540 538 __uuid_t uuid; 541 539 char name[BDEVNAME_SIZE]; ··· 627 623 x(topology_error) \ 628 624 x(errors_fixed) \ 629 625 x(errors_not_fixed) \ 630 - x(no_invalid_checks) 626 + x(no_invalid_checks) \ 627 + x(discard_mount_opt_set) \ 631 628 632 629 enum bch_fs_flags { 633 630 #define x(n) BCH_FS_##n, ··· 692 687 x(gc_gens) \ 693 688 x(snapshot_delete_pagecache) \ 694 689 x(sysfs) \ 695 - x(btree_write_buffer) 690 + x(btree_write_buffer) \ 691 + x(btree_node_scrub) 696 692 697 693 enum bch_write_ref { 698 694 #define x(n) BCH_WRITE_REF_##n, ··· 701 695 #undef x 702 696 BCH_WRITE_REF_NR, 703 697 }; 698 + 699 + #define BCH_FS_DEFAULT_UTF8_ENCODING UNICODE_AGE(12, 1, 0) 704 700 705 701 struct bch_fs { 706 702 struct closure cl; ··· 788 780 u64 btrees_lost_data; 789 781 } sb; 790 782 783 + #ifdef CONFIG_UNICODE 784 + struct unicode_map *cf_encoding; 785 + #endif 791 786 792 787 struct bch_sb_handle disk_sb; 793 788 ··· 980 969 mempool_t compress_workspace[BCH_COMPRESSION_OPT_NR]; 981 970 size_t zstd_workspace_size; 982 971 983 - struct crypto_shash *sha256; 984 972 struct crypto_sync_skcipher *chacha20; 985 973 struct crypto_shash *poly1305; 986 974 ··· 1003 993 wait_queue_head_t copygc_running_wq; 1004 994 1005 995 /* STRIPES: */ 1006 - GENRADIX(struct stripe) stripes; 1007 996 GENRADIX(struct gc_stripe) gc_stripes; 1008 997 1009 998 struct hlist_head ec_stripes_new[32]; 1010 999 spinlock_t ec_stripes_new_lock; 1011 - 1012 - ec_stripes_heap ec_stripes_heap; 1013 - struct mutex ec_stripes_heap_lock; 1014 1000 1015 1001 /* ERASURE CODING */ 1016 1002 struct list_head ec_stripe_head_list;

+13 -3

fs/bcachefs/bcachefs_format.h

··· 686 686 x(inode_depth, BCH_VERSION(1, 17)) \ 687 687 x(persistent_inode_cursors, BCH_VERSION(1, 18)) \ 688 688 x(autofix_errors, BCH_VERSION(1, 19)) \ 689 - x(directory_size, BCH_VERSION(1, 20)) 689 + x(directory_size, BCH_VERSION(1, 20)) \ 690 + x(cached_backpointers, BCH_VERSION(1, 21)) \ 691 + x(stripe_backpointers, BCH_VERSION(1, 22)) \ 692 + x(stripe_lru, BCH_VERSION(1, 23)) \ 693 + x(casefolding, BCH_VERSION(1, 24)) \ 694 + x(extent_flags, BCH_VERSION(1, 25)) 690 695 691 696 enum bcachefs_metadata_version { 692 697 bcachefs_metadata_version_min = 9, ··· 842 837 LE64_BITMASK(BCH_SB_INODES_USE_KEY_CACHE,struct bch_sb, flags[3], 29, 30); 843 838 LE64_BITMASK(BCH_SB_JOURNAL_FLUSH_DELAY,struct bch_sb, flags[3], 30, 62); 844 839 LE64_BITMASK(BCH_SB_JOURNAL_FLUSH_DISABLED,struct bch_sb, flags[3], 62, 63); 840 + /* one free bit */ 845 841 LE64_BITMASK(BCH_SB_JOURNAL_RECLAIM_DELAY,struct bch_sb, flags[4], 0, 32); 846 842 LE64_BITMASK(BCH_SB_JOURNAL_TRANSACTION_NAMES,struct bch_sb, flags[4], 32, 33); 847 843 LE64_BITMASK(BCH_SB_NOCOW, struct bch_sb, flags[4], 33, 34); ··· 861 855 LE64_BITMASK(BCH_SB_VERSION_INCOMPAT_ALLOWED, 862 856 struct bch_sb, flags[5], 48, 64); 863 857 LE64_BITMASK(BCH_SB_SHARD_INUMS_NBITS, struct bch_sb, flags[6], 0, 4); 858 + LE64_BITMASK(BCH_SB_WRITE_ERROR_TIMEOUT,struct bch_sb, flags[6], 4, 14); 859 + LE64_BITMASK(BCH_SB_CSUM_ERR_RETRY_NR, struct bch_sb, flags[6], 14, 20); 864 860 865 861 static inline __u64 BCH_SB_COMPRESSION_TYPE(const struct bch_sb *sb) 866 862 { ··· 916 908 x(journal_no_flush, 16) \ 917 909 x(alloc_v2, 17) \ 918 910 x(extents_across_btree_nodes, 18) \ 919 - x(incompat_version_field, 19) 911 + x(incompat_version_field, 19) \ 912 + x(casefolding, 20) 920 913 921 914 #define BCH_SB_FEATURES_ALWAYS \ 922 915 (BIT_ULL(BCH_FEATURE_new_extent_overwrite)| \ ··· 931 922 BIT_ULL(BCH_FEATURE_new_siphash)| \ 932 923 BIT_ULL(BCH_FEATURE_btree_ptr_v2)| \ 933 924 BIT_ULL(BCH_FEATURE_new_varint)| \ 934 - BIT_ULL(BCH_FEATURE_journal_no_flush)) 925 + BIT_ULL(BCH_FEATURE_journal_no_flush)| \ 926 + BIT_ULL(BCH_FEATURE_incompat_version_field)) 935 927 936 928 enum bch_sb_feature { 937 929 #define x(f, n) BCH_FEATURE_##f,

+28 -1

fs/bcachefs/bcachefs_ioctl.h

··· 87 87 #define BCH_IOCTL_FSCK_OFFLINE _IOW(0xbc, 19, struct bch_ioctl_fsck_offline) 88 88 #define BCH_IOCTL_FSCK_ONLINE _IOW(0xbc, 20, struct bch_ioctl_fsck_online) 89 89 #define BCH_IOCTL_QUERY_ACCOUNTING _IOW(0xbc, 21, struct bch_ioctl_query_accounting) 90 + #define BCH_IOCTL_QUERY_COUNTERS _IOW(0xbc, 21, struct bch_ioctl_query_counters) 90 91 91 92 /* ioctl below act on a particular file, not the filesystem as a whole: */ 92 93 ··· 216 215 union { 217 216 struct { 218 217 __u32 dev; 218 + __u32 data_types; 219 + } scrub; 220 + struct { 221 + __u32 dev; 219 222 __u32 pad; 220 223 } migrate; 221 224 struct { ··· 234 229 BCH_DATA_EVENT_NR = 1, 235 230 }; 236 231 232 + enum data_progress_data_type_special { 233 + DATA_PROGRESS_DATA_TYPE_phys = 254, 234 + DATA_PROGRESS_DATA_TYPE_done = 255, 235 + }; 236 + 237 237 struct bch_ioctl_data_progress { 238 238 __u8 data_type; 239 239 __u8 btree_id; ··· 247 237 248 238 __u64 sectors_done; 249 239 __u64 sectors_total; 240 + __u64 sectors_error_corrected; 241 + __u64 sectors_error_uncorrected; 250 242 } __packed __aligned(8); 243 + 244 + enum bch_ioctl_data_event_ret { 245 + BCH_IOCTL_DATA_EVENT_RET_done = 1, 246 + BCH_IOCTL_DATA_EVENT_RET_device_offline = 2, 247 + }; 251 248 252 249 struct bch_ioctl_data_event { 253 250 __u8 type; 254 - __u8 pad[7]; 251 + __u8 ret; 252 + __u8 pad[6]; 255 253 union { 256 254 struct bch_ioctl_data_progress p; 257 255 __u64 pad2[15]; ··· 459 441 __u32 accounting_types_mask; /* input parameter */ 460 442 461 443 struct bkey_i_accounting accounting[]; 444 + }; 445 + 446 + #define BCH_IOCTL_QUERY_COUNTERS_MOUNT (1 << 0) 447 + 448 + struct bch_ioctl_query_counters { 449 + __u16 nr; 450 + __u16 flags; 451 + __u32 pad; 452 + __u64 d[]; 462 453 }; 463 454 464 455 #endif /* _BCACHEFS_IOCTL_H */

+1

fs/bcachefs/btree_cache.c

··· 610 610 btree_node_write_in_flight(b)); 611 611 612 612 btree_node_data_free(bc, b); 613 + cond_resched(); 613 614 } 614 615 615 616 BUG_ON(!bch2_journal_error(&c->journal) &&

+12 -6

fs/bcachefs/btree_gc.c

··· 27 27 #include "journal.h" 28 28 #include "keylist.h" 29 29 #include "move.h" 30 + #include "progress.h" 30 31 #include "recovery_passes.h" 31 32 #include "reflink.h" 32 33 #include "recovery.h" ··· 657 656 return ret; 658 657 } 659 658 660 - static int bch2_gc_btree(struct btree_trans *trans, enum btree_id btree, bool initial) 659 + static int bch2_gc_btree(struct btree_trans *trans, 660 + struct progress_indicator_state *progress, 661 + enum btree_id btree, bool initial) 661 662 { 662 663 struct bch_fs *c = trans->c; 663 664 unsigned target_depth = btree_node_type_has_triggers(__btree_node_type(0, btree)) ? 0 : 1; ··· 676 673 BTREE_ITER_prefetch); 677 674 678 675 ret = for_each_btree_key_continue(trans, iter, 0, k, ({ 676 + bch2_progress_update_iter(trans, progress, &iter, "check_allocations"); 679 677 gc_pos_set(c, gc_pos_btree(btree, level, k.k->p)); 680 678 bch2_gc_mark_key(trans, btree, level, &prev, &iter, k, initial); 681 679 })); ··· 721 717 static int bch2_gc_btrees(struct bch_fs *c) 722 718 { 723 719 struct btree_trans *trans = bch2_trans_get(c); 724 - enum btree_id ids[BTREE_ID_NR]; 725 720 struct printbuf buf = PRINTBUF; 726 - unsigned i; 727 721 int ret = 0; 728 722 729 - for (i = 0; i < BTREE_ID_NR; i++) 723 + struct progress_indicator_state progress; 724 + bch2_progress_init(&progress, c, ~0ULL); 725 + 726 + enum btree_id ids[BTREE_ID_NR]; 727 + for (unsigned i = 0; i < BTREE_ID_NR; i++) 730 728 ids[i] = i; 731 729 bubble_sort(ids, BTREE_ID_NR, btree_id_gc_phase_cmp); 732 730 733 - for (i = 0; i < btree_id_nr_alive(c) && !ret; i++) { 731 + for (unsigned i = 0; i < btree_id_nr_alive(c) && !ret; i++) { 734 732 unsigned btree = i < BTREE_ID_NR ? ids[i] : i; 735 733 736 734 if (IS_ERR_OR_NULL(bch2_btree_id_root(c, btree)->b)) 737 735 continue; 738 736 739 - ret = bch2_gc_btree(trans, btree, true); 737 + ret = bch2_gc_btree(trans, &progress, btree, true); 740 738 } 741 739 742 740 printbuf_exit(&buf);

+233 -26

fs/bcachefs/btree_io.c

··· 1 1 // SPDX-License-Identifier: GPL-2.0 2 2 3 3 #include "bcachefs.h" 4 + #include "bkey_buf.h" 4 5 #include "bkey_methods.h" 5 6 #include "bkey_sort.h" 6 7 #include "btree_cache.h" ··· 1329 1328 bch_info(c, "retrying read"); 1330 1329 ca = bch2_dev_get_ioref(c, rb->pick.ptr.dev, READ); 1331 1330 rb->have_ioref = ca != NULL; 1331 + rb->start_time = local_clock(); 1332 1332 bio_reset(bio, NULL, REQ_OP_READ|REQ_SYNC|REQ_META); 1333 1333 bio->bi_iter.bi_sector = rb->pick.ptr.offset; 1334 1334 bio->bi_iter.bi_size = btree_buf_bytes(b); ··· 1340 1338 } else { 1341 1339 bio->bi_status = BLK_STS_REMOVED; 1342 1340 } 1341 + 1342 + bch2_account_io_completion(ca, BCH_MEMBER_ERROR_read, 1343 + rb->start_time, !bio->bi_status); 1343 1344 start: 1344 1345 printbuf_reset(&buf); 1345 1346 bch2_btree_pos_to_text(&buf, c, b); 1346 - bch2_dev_io_err_on(ca && bio->bi_status, ca, BCH_MEMBER_ERROR_read, 1347 - "btree read error %s for %s", 1348 - bch2_blk_status_to_str(bio->bi_status), buf.buf); 1347 + 1348 + if (ca && bio->bi_status) 1349 + bch_err_dev_ratelimited(ca, 1350 + "btree read error %s for %s", 1351 + bch2_blk_status_to_str(bio->bi_status), buf.buf); 1349 1352 if (rb->have_ioref) 1350 1353 percpu_ref_put(&ca->io_ref); 1351 1354 rb->have_ioref = false; 1352 1355 1353 - bch2_mark_io_failure(&failed, &rb->pick); 1356 + bch2_mark_io_failure(&failed, &rb->pick, false); 1354 1357 1355 1358 can_retry = bch2_bkey_pick_read_device(c, 1356 1359 bkey_i_to_s_c(&b->key), 1357 - &failed, &rb->pick) > 0; 1360 + &failed, &rb->pick, -1) > 0; 1358 1361 1359 1362 if (!bio->bi_status && 1360 1363 !bch2_btree_node_read_done(c, ca, b, can_retry, &saw_error)) { ··· 1407 1400 struct btree_read_bio *rb = 1408 1401 container_of(bio, struct btree_read_bio, bio); 1409 1402 struct bch_fs *c = rb->c; 1403 + struct bch_dev *ca = rb->have_ioref 1404 + ? bch2_dev_have_ref(c, rb->pick.ptr.dev) : NULL; 1410 1405 1411 - if (rb->have_ioref) { 1412 - struct bch_dev *ca = bch2_dev_have_ref(c, rb->pick.ptr.dev); 1413 - 1414 - bch2_latency_acct(ca, rb->start_time, READ); 1415 - } 1406 + bch2_account_io_completion(ca, BCH_MEMBER_ERROR_read, 1407 + rb->start_time, !bio->bi_status); 1416 1408 1417 1409 queue_work(c->btree_read_complete_wq, &rb->work); 1418 1410 } ··· 1703 1697 return; 1704 1698 1705 1699 ret = bch2_bkey_pick_read_device(c, bkey_i_to_s_c(&b->key), 1706 - NULL, &pick); 1700 + NULL, &pick, -1); 1707 1701 1708 1702 if (ret <= 0) { 1709 1703 struct printbuf buf = PRINTBUF; ··· 1817 1811 return bch2_trans_run(c, __bch2_btree_root_read(trans, id, k, level)); 1818 1812 } 1819 1813 1814 + struct btree_node_scrub { 1815 + struct bch_fs *c; 1816 + struct bch_dev *ca; 1817 + void *buf; 1818 + bool used_mempool; 1819 + unsigned written; 1820 + 1821 + enum btree_id btree; 1822 + unsigned level; 1823 + struct bkey_buf key; 1824 + __le64 seq; 1825 + 1826 + struct work_struct work; 1827 + struct bio bio; 1828 + }; 1829 + 1830 + static bool btree_node_scrub_check(struct bch_fs *c, struct btree_node *data, unsigned ptr_written, 1831 + struct printbuf *err) 1832 + { 1833 + unsigned written = 0; 1834 + 1835 + if (le64_to_cpu(data->magic) != bset_magic(c)) { 1836 + prt_printf(err, "bad magic: want %llx, got %llx", 1837 + bset_magic(c), le64_to_cpu(data->magic)); 1838 + return false; 1839 + } 1840 + 1841 + while (written < (ptr_written ?: btree_sectors(c))) { 1842 + struct btree_node_entry *bne; 1843 + struct bset *i; 1844 + bool first = !written; 1845 + 1846 + if (first) { 1847 + bne = NULL; 1848 + i = &data->keys; 1849 + } else { 1850 + bne = (void *) data + (written << 9); 1851 + i = &bne->keys; 1852 + 1853 + if (!ptr_written && i->seq != data->keys.seq) 1854 + break; 1855 + } 1856 + 1857 + struct nonce nonce = btree_nonce(i, written << 9); 1858 + bool good_csum_type = bch2_checksum_type_valid(c, BSET_CSUM_TYPE(i)); 1859 + 1860 + if (first) { 1861 + if (good_csum_type) { 1862 + struct bch_csum csum = csum_vstruct(c, BSET_CSUM_TYPE(i), nonce, data); 1863 + if (bch2_crc_cmp(data->csum, csum)) { 1864 + bch2_csum_err_msg(err, BSET_CSUM_TYPE(i), data->csum, csum); 1865 + return false; 1866 + } 1867 + } 1868 + 1869 + written += vstruct_sectors(data, c->block_bits); 1870 + } else { 1871 + if (good_csum_type) { 1872 + struct bch_csum csum = csum_vstruct(c, BSET_CSUM_TYPE(i), nonce, bne); 1873 + if (bch2_crc_cmp(bne->csum, csum)) { 1874 + bch2_csum_err_msg(err, BSET_CSUM_TYPE(i), bne->csum, csum); 1875 + return false; 1876 + } 1877 + } 1878 + 1879 + written += vstruct_sectors(bne, c->block_bits); 1880 + } 1881 + } 1882 + 1883 + return true; 1884 + } 1885 + 1886 + static void btree_node_scrub_work(struct work_struct *work) 1887 + { 1888 + struct btree_node_scrub *scrub = container_of(work, struct btree_node_scrub, work); 1889 + struct bch_fs *c = scrub->c; 1890 + struct printbuf err = PRINTBUF; 1891 + 1892 + __bch2_btree_pos_to_text(&err, c, scrub->btree, scrub->level, 1893 + bkey_i_to_s_c(scrub->key.k)); 1894 + prt_newline(&err); 1895 + 1896 + if (!btree_node_scrub_check(c, scrub->buf, scrub->written, &err)) { 1897 + struct btree_trans *trans = bch2_trans_get(c); 1898 + 1899 + struct btree_iter iter; 1900 + bch2_trans_node_iter_init(trans, &iter, scrub->btree, 1901 + scrub->key.k->k.p, 0, scrub->level - 1, 0); 1902 + 1903 + struct btree *b; 1904 + int ret = lockrestart_do(trans, PTR_ERR_OR_ZERO(b = bch2_btree_iter_peek_node(&iter))); 1905 + if (ret) 1906 + goto err; 1907 + 1908 + if (bkey_i_to_btree_ptr_v2(&b->key)->v.seq == scrub->seq) { 1909 + bch_err(c, "error validating btree node during scrub on %s at btree %s", 1910 + scrub->ca->name, err.buf); 1911 + 1912 + ret = bch2_btree_node_rewrite(trans, &iter, b, 0); 1913 + } 1914 + err: 1915 + bch2_trans_iter_exit(trans, &iter); 1916 + bch2_trans_begin(trans); 1917 + bch2_trans_put(trans); 1918 + } 1919 + 1920 + printbuf_exit(&err); 1921 + bch2_bkey_buf_exit(&scrub->key, c);; 1922 + btree_bounce_free(c, c->opts.btree_node_size, scrub->used_mempool, scrub->buf); 1923 + percpu_ref_put(&scrub->ca->io_ref); 1924 + kfree(scrub); 1925 + bch2_write_ref_put(c, BCH_WRITE_REF_btree_node_scrub); 1926 + } 1927 + 1928 + static void btree_node_scrub_endio(struct bio *bio) 1929 + { 1930 + struct btree_node_scrub *scrub = container_of(bio, struct btree_node_scrub, bio); 1931 + 1932 + queue_work(scrub->c->btree_read_complete_wq, &scrub->work); 1933 + } 1934 + 1935 + int bch2_btree_node_scrub(struct btree_trans *trans, 1936 + enum btree_id btree, unsigned level, 1937 + struct bkey_s_c k, unsigned dev) 1938 + { 1939 + if (k.k->type != KEY_TYPE_btree_ptr_v2) 1940 + return 0; 1941 + 1942 + struct bch_fs *c = trans->c; 1943 + 1944 + if (!bch2_write_ref_tryget(c, BCH_WRITE_REF_btree_node_scrub)) 1945 + return -BCH_ERR_erofs_no_writes; 1946 + 1947 + struct extent_ptr_decoded pick; 1948 + int ret = bch2_bkey_pick_read_device(c, k, NULL, &pick, dev); 1949 + if (ret <= 0) 1950 + goto err; 1951 + 1952 + struct bch_dev *ca = bch2_dev_get_ioref(c, pick.ptr.dev, READ); 1953 + if (!ca) { 1954 + ret = -BCH_ERR_device_offline; 1955 + goto err; 1956 + } 1957 + 1958 + bool used_mempool = false; 1959 + void *buf = btree_bounce_alloc(c, c->opts.btree_node_size, &used_mempool); 1960 + 1961 + unsigned vecs = buf_pages(buf, c->opts.btree_node_size); 1962 + 1963 + struct btree_node_scrub *scrub = 1964 + kzalloc(sizeof(*scrub) + sizeof(struct bio_vec) * vecs, GFP_KERNEL); 1965 + if (!scrub) { 1966 + ret = -ENOMEM; 1967 + goto err_free; 1968 + } 1969 + 1970 + scrub->c = c; 1971 + scrub->ca = ca; 1972 + scrub->buf = buf; 1973 + scrub->used_mempool = used_mempool; 1974 + scrub->written = btree_ptr_sectors_written(k); 1975 + 1976 + scrub->btree = btree; 1977 + scrub->level = level; 1978 + bch2_bkey_buf_init(&scrub->key); 1979 + bch2_bkey_buf_reassemble(&scrub->key, c, k); 1980 + scrub->seq = bkey_s_c_to_btree_ptr_v2(k).v->seq; 1981 + 1982 + INIT_WORK(&scrub->work, btree_node_scrub_work); 1983 + 1984 + bio_init(&scrub->bio, ca->disk_sb.bdev, scrub->bio.bi_inline_vecs, vecs, REQ_OP_READ); 1985 + bch2_bio_map(&scrub->bio, scrub->buf, c->opts.btree_node_size); 1986 + scrub->bio.bi_iter.bi_sector = pick.ptr.offset; 1987 + scrub->bio.bi_end_io = btree_node_scrub_endio; 1988 + submit_bio(&scrub->bio); 1989 + return 0; 1990 + err_free: 1991 + btree_bounce_free(c, c->opts.btree_node_size, used_mempool, buf); 1992 + percpu_ref_put(&ca->io_ref); 1993 + err: 1994 + bch2_write_ref_put(c, BCH_WRITE_REF_btree_node_scrub); 1995 + return ret; 1996 + } 1997 + 1820 1998 static void bch2_btree_complete_write(struct bch_fs *c, struct btree *b, 1821 1999 struct btree_write *w) 1822 2000 { ··· 2021 1831 bch2_journal_pin_drop(&c->journal, &w->journal); 2022 1832 } 2023 1833 2024 - static void __btree_node_write_done(struct bch_fs *c, struct btree *b) 1834 + static void __btree_node_write_done(struct bch_fs *c, struct btree *b, u64 start_time) 2025 1835 { 2026 1836 struct btree_write *w = btree_prev_write(b); 2027 1837 unsigned long old, new; 2028 1838 unsigned type = 0; 2029 1839 2030 1840 bch2_btree_complete_write(c, b, w); 1841 + 1842 + if (start_time) 1843 + bch2_time_stats_update(&c->times[BCH_TIME_btree_node_write], start_time); 2031 1844 2032 1845 old = READ_ONCE(b->flags); 2033 1846 do { ··· 2062 1869 wake_up_bit(&b->flags, BTREE_NODE_write_in_flight); 2063 1870 } 2064 1871 2065 - static void btree_node_write_done(struct bch_fs *c, struct btree *b) 1872 + static void btree_node_write_done(struct bch_fs *c, struct btree *b, u64 start_time) 2066 1873 { 2067 1874 struct btree_trans *trans = bch2_trans_get(c); 2068 1875 ··· 2070 1877 2071 1878 /* we don't need transaction context anymore after we got the lock. */ 2072 1879 bch2_trans_put(trans); 2073 - __btree_node_write_done(c, b); 1880 + __btree_node_write_done(c, b, start_time); 2074 1881 six_unlock_read(&b->c.lock); 2075 1882 } 2076 1883 ··· 2080 1887 container_of(work, struct btree_write_bio, work); 2081 1888 struct bch_fs *c = wbio->wbio.c; 2082 1889 struct btree *b = wbio->wbio.bio.bi_private; 1890 + u64 start_time = wbio->start_time; 2083 1891 int ret = 0; 2084 1892 2085 1893 btree_bounce_free(c, ··· 2113 1919 } 2114 1920 out: 2115 1921 bio_put(&wbio->wbio.bio); 2116 - btree_node_write_done(c, b); 1922 + btree_node_write_done(c, b, start_time); 2117 1923 return; 2118 1924 err: 2119 1925 set_btree_node_noevict(b); 2120 - bch2_fs_fatal_err_on(!bch2_err_matches(ret, EROFS), c, 2121 - "writing btree node: %s", bch2_err_str(ret)); 1926 + 1927 + if (!bch2_err_matches(ret, EROFS)) { 1928 + struct printbuf buf = PRINTBUF; 1929 + prt_printf(&buf, "writing btree node: %s\n ", bch2_err_str(ret)); 1930 + bch2_btree_pos_to_text(&buf, c, b); 1931 + bch2_fs_fatal_error(c, "%s", buf.buf); 1932 + printbuf_exit(&buf); 1933 + } 2122 1934 goto out; 2123 1935 } 2124 1936 ··· 2137 1937 struct bch_fs *c = wbio->c; 2138 1938 struct btree *b = wbio->bio.bi_private; 2139 1939 struct bch_dev *ca = wbio->have_ioref ? bch2_dev_have_ref(c, wbio->dev) : NULL; 2140 - unsigned long flags; 2141 1940 2142 - if (wbio->have_ioref) 2143 - bch2_latency_acct(ca, wbio->submit_time, WRITE); 1941 + bch2_account_io_completion(ca, BCH_MEMBER_ERROR_write, 1942 + wbio->submit_time, !bio->bi_status); 2144 1943 2145 - if (!ca || 2146 - bch2_dev_io_err_on(bio->bi_status, ca, BCH_MEMBER_ERROR_write, 2147 - "btree write error: %s", 2148 - bch2_blk_status_to_str(bio->bi_status)) || 2149 - bch2_meta_write_fault("btree")) { 1944 + if (ca && bio->bi_status) { 1945 + struct printbuf buf = PRINTBUF; 1946 + prt_printf(&buf, "btree write error: %s\n ", 1947 + bch2_blk_status_to_str(bio->bi_status)); 1948 + bch2_btree_pos_to_text(&buf, c, b); 1949 + bch_err_dev_ratelimited(ca, "%s", buf.buf); 1950 + printbuf_exit(&buf); 1951 + } 1952 + 1953 + if (bio->bi_status) { 1954 + unsigned long flags; 2150 1955 spin_lock_irqsave(&c->btree_write_error_lock, flags); 2151 1956 bch2_dev_list_add_dev(&orig->failed, wbio->dev); 2152 1957 spin_unlock_irqrestore(&c->btree_write_error_lock, flags); ··· 2228 2023 bool validate_before_checksum = false; 2229 2024 enum btree_write_type type = flags & BTREE_WRITE_TYPE_MASK; 2230 2025 void *data; 2026 + u64 start_time = local_clock(); 2231 2027 int ret; 2232 2028 2233 2029 if (flags & BTREE_WRITE_ALREADY_STARTED) ··· 2437 2231 wbio->data = data; 2438 2232 wbio->data_bytes = bytes; 2439 2233 wbio->sector_offset = b->written; 2234 + wbio->start_time = start_time; 2440 2235 wbio->wbio.c = c; 2441 2236 wbio->wbio.used_mempool = used_mempool; 2442 2237 wbio->wbio.first_btree_write = !b->written; ··· 2465 2258 b->written += sectors_to_write; 2466 2259 nowrite: 2467 2260 btree_bounce_free(c, bytes, used_mempool, data); 2468 - __btree_node_write_done(c, b); 2261 + __btree_node_write_done(c, b, 0); 2469 2262 } 2470 2263 2471 2264 /*

+4

fs/bcachefs/btree_io.h

··· 52 52 void *data; 53 53 unsigned data_bytes; 54 54 unsigned sector_offset; 55 + u64 start_time; 55 56 struct bch_write_bio wbio; 56 57 }; 57 58 ··· 132 131 void bch2_btree_node_read(struct btree_trans *, struct btree *, bool); 133 132 int bch2_btree_root_read(struct bch_fs *, enum btree_id, 134 133 const struct bkey_i *, unsigned); 134 + 135 + int bch2_btree_node_scrub(struct btree_trans *, enum btree_id, unsigned, 136 + struct bkey_s_c, unsigned); 135 137 136 138 bool bch2_btree_post_write_cleanup(struct bch_fs *, struct btree *); 137 139

-14

fs/bcachefs/btree_iter.c

··· 562 562 bch2_btree_node_iter_peek_all(&l->iter, l->b)); 563 563 } 564 564 565 - static inline struct bkey_s_c btree_path_level_peek(struct btree_trans *trans, 566 - struct btree_path *path, 567 - struct btree_path_level *l, 568 - struct bkey *u) 569 - { 570 - struct bkey_s_c k = __btree_iter_unpack(trans->c, l, u, 571 - bch2_btree_node_iter_peek(&l->iter, l->b)); 572 - 573 - path->pos = k.k ? k.k->p : l->b->key.k.p; 574 - trans->paths_sorted = false; 575 - bch2_btree_path_verify_level(trans, path, l - path->l); 576 - return k; 577 - } 578 - 579 565 static inline struct bkey_s_c btree_path_level_prev(struct btree_trans *trans, 580 566 struct btree_path *path, 581 567 struct btree_path_level *l,

+8 -1

fs/bcachefs/btree_iter.h

··· 335 335 } 336 336 337 337 __always_inline 338 - static int btree_trans_restart_ip(struct btree_trans *trans, int err, unsigned long ip) 338 + static int btree_trans_restart_foreign_task(struct btree_trans *trans, int err, unsigned long ip) 339 339 { 340 340 BUG_ON(err <= 0); 341 341 BUG_ON(!bch2_err_matches(-err, BCH_ERR_transaction_restart)); 342 342 343 343 trans->restarted = err; 344 344 trans->last_restarted_ip = ip; 345 + return -err; 346 + } 347 + 348 + __always_inline 349 + static int btree_trans_restart_ip(struct btree_trans *trans, int err, unsigned long ip) 350 + { 351 + btree_trans_restart_foreign_task(trans, err, ip); 345 352 #ifdef CONFIG_BCACHEFS_DEBUG 346 353 darray_exit(&trans->last_restarted_trace); 347 354 bch2_save_backtrace(&trans->last_restarted_trace, current, 0, GFP_NOWAIT);

+5 -3

fs/bcachefs/btree_locking.c

··· 91 91 struct trans_waiting_for_lock *i; 92 92 93 93 for (i = g->g; i != g->g + g->nr; i++) { 94 - struct task_struct *task = i->trans->locking_wait.task; 94 + struct task_struct *task = READ_ONCE(i->trans->locking_wait.task); 95 95 if (i != g->g) 96 96 prt_str(out, "<- "); 97 - prt_printf(out, "%u ", task ?task->pid : 0); 97 + prt_printf(out, "%u ", task ? task->pid : 0); 98 98 } 99 99 prt_newline(out); 100 100 } ··· 172 172 { 173 173 if (i == g->g) { 174 174 trace_would_deadlock(g, i->trans); 175 - return btree_trans_restart(i->trans, BCH_ERR_transaction_restart_would_deadlock); 175 + return btree_trans_restart_foreign_task(i->trans, 176 + BCH_ERR_transaction_restart_would_deadlock, 177 + _THIS_IP_); 176 178 } else { 177 179 i->trans->lock_must_abort = true; 178 180 wake_up_process(i->trans->locking_wait.task);

+17 -12

fs/bcachefs/btree_node_scan.c

··· 166 166 bio->bi_iter.bi_sector = offset; 167 167 bch2_bio_map(bio, bn, PAGE_SIZE); 168 168 169 + u64 submit_time = local_clock(); 169 170 submit_bio_wait(bio); 170 - if (bch2_dev_io_err_on(bio->bi_status, ca, BCH_MEMBER_ERROR_read, 171 - "IO error in try_read_btree_node() at %llu: %s", 172 - offset, bch2_blk_status_to_str(bio->bi_status))) 171 + 172 + bch2_account_io_completion(ca, BCH_MEMBER_ERROR_read, submit_time, !bio->bi_status); 173 + 174 + if (bio->bi_status) { 175 + bch_err_dev_ratelimited(ca, 176 + "IO error in try_read_btree_node() at %llu: %s", 177 + offset, bch2_blk_status_to_str(bio->bi_status)); 173 178 return; 179 + } 174 180 175 181 if (le64_to_cpu(bn->magic) != bset_magic(c)) 176 182 return; ··· 270 264 err: 271 265 bio_put(bio); 272 266 free_page((unsigned long) buf); 273 - percpu_ref_get(&ca->io_ref); 267 + percpu_ref_put(&ca->io_ref); 274 268 closure_put(w->cl); 275 269 kfree(w); 276 270 return 0; ··· 289 283 continue; 290 284 291 285 struct find_btree_nodes_worker *w = kmalloc(sizeof(*w), GFP_KERNEL); 292 - struct task_struct *t; 293 - 294 286 if (!w) { 295 287 percpu_ref_put(&ca->io_ref); 296 288 ret = -ENOMEM; 297 289 goto err; 298 290 } 299 291 300 - percpu_ref_get(&ca->io_ref); 301 - closure_get(&cl); 302 292 w->cl = &cl; 303 293 w->f = f; 304 294 w->ca = ca; 305 295 306 - t = kthread_run(read_btree_nodes_worker, w, "read_btree_nodes/%s", ca->name); 296 + struct task_struct *t = kthread_create(read_btree_nodes_worker, w, "read_btree_nodes/%s", ca->name); 307 297 ret = PTR_ERR_OR_ZERO(t); 308 298 if (ret) { 309 299 percpu_ref_put(&ca->io_ref); 310 - closure_put(&cl); 311 - f->ret = ret; 312 - bch_err(c, "error starting kthread: %i", ret); 300 + kfree(w); 301 + bch_err_msg(c, ret, "starting kthread"); 313 302 break; 314 303 } 304 + 305 + closure_get(&cl); 306 + percpu_ref_get(&ca->io_ref); 307 + wake_up_process(t); 315 308 } 316 309 err: 317 310 closure_sync(&cl);

+54 -74

fs/bcachefs/btree_trans_commit.c

··· 164 164 EBUG_ON(bpos_gt(insert->k.p, b->data->max_key)); 165 165 EBUG_ON(insert->k.u64s > bch2_btree_keys_u64s_remaining(b)); 166 166 EBUG_ON(!b->c.level && !bpos_eq(insert->k.p, path->pos)); 167 + kmsan_check_memory(insert, bkey_bytes(&insert->k)); 167 168 168 169 k = bch2_btree_node_iter_peek_all(node_iter, b); 169 170 if (k && bkey_cmp_left_packed(b, k, &insert->k.p)) ··· 337 336 BUG_ON(i->cached != path->cached); 338 337 BUG_ON(i->level != path->level); 339 338 BUG_ON(i->btree_id != path->btree_id); 339 + BUG_ON(i->bkey_type != __btree_node_type(path->level, path->btree_id)); 340 340 EBUG_ON(!i->level && 341 341 btree_type_has_snapshots(i->btree_id) && 342 342 !(i->flags & BTREE_UPDATE_internal_snapshot_node) && ··· 519 517 } 520 518 } 521 519 522 - static int run_btree_triggers(struct btree_trans *trans, enum btree_id btree_id, 523 - unsigned *btree_id_updates_start) 524 - { 525 - bool trans_trigger_run; 526 - 527 - /* 528 - * Running triggers will append more updates to the list of updates as 529 - * we're walking it: 530 - */ 531 - do { 532 - trans_trigger_run = false; 533 - 534 - for (unsigned i = *btree_id_updates_start; 535 - i < trans->nr_updates && trans->updates[i].btree_id <= btree_id; 536 - i++) { 537 - if (trans->updates[i].btree_id < btree_id) { 538 - *btree_id_updates_start = i; 539 - continue; 540 - } 541 - 542 - int ret = run_one_trans_trigger(trans, trans->updates + i); 543 - if (ret < 0) 544 - return ret; 545 - if (ret) 546 - trans_trigger_run = true; 547 - } 548 - } while (trans_trigger_run); 549 - 550 - trans_for_each_update(trans, i) 551 - BUG_ON(!(i->flags & BTREE_TRIGGER_norun) && 552 - i->btree_id == btree_id && 553 - btree_node_type_has_trans_triggers(i->bkey_type) && 554 - (!i->insert_trigger_run || !i->overwrite_trigger_run)); 555 - 556 - return 0; 557 - } 558 - 559 520 static int bch2_trans_commit_run_triggers(struct btree_trans *trans) 560 521 { 561 - unsigned btree_id = 0, btree_id_updates_start = 0; 562 - int ret = 0; 522 + unsigned sort_id_start = 0; 563 523 564 - /* 565 - * 566 - * For a given btree, this algorithm runs insert triggers before 567 - * overwrite triggers: this is so that when extents are being moved 568 - * (e.g. by FALLOCATE_FL_INSERT_RANGE), we don't drop references before 569 - * they are re-added. 570 - */ 571 - for (btree_id = 0; btree_id < BTREE_ID_NR; btree_id++) { 572 - if (btree_id == BTREE_ID_alloc) 573 - continue; 524 + while (sort_id_start < trans->nr_updates) { 525 + unsigned i, sort_id = trans->updates[sort_id_start].sort_order; 526 + bool trans_trigger_run; 574 527 575 - ret = run_btree_triggers(trans, btree_id, &btree_id_updates_start); 576 - if (ret) 577 - return ret; 528 + /* 529 + * For a given btree, this algorithm runs insert triggers before 530 + * overwrite triggers: this is so that when extents are being 531 + * moved (e.g. by FALLOCATE_FL_INSERT_RANGE), we don't drop 532 + * references before they are re-added. 533 + * 534 + * Running triggers will append more updates to the list of 535 + * updates as we're walking it: 536 + */ 537 + do { 538 + trans_trigger_run = false; 539 + 540 + for (i = sort_id_start; 541 + i < trans->nr_updates && trans->updates[i].sort_order <= sort_id; 542 + i++) { 543 + if (trans->updates[i].sort_order < sort_id) { 544 + sort_id_start = i; 545 + continue; 546 + } 547 + 548 + int ret = run_one_trans_trigger(trans, trans->updates + i); 549 + if (ret < 0) 550 + return ret; 551 + if (ret) 552 + trans_trigger_run = true; 553 + } 554 + } while (trans_trigger_run); 555 + 556 + sort_id_start = i; 578 557 } 579 - 580 - btree_id_updates_start = 0; 581 - ret = run_btree_triggers(trans, BTREE_ID_alloc, &btree_id_updates_start); 582 - if (ret) 583 - return ret; 584 558 585 559 #ifdef CONFIG_BCACHEFS_DEBUG 586 560 trans_for_each_update(trans, i) ··· 881 903 struct bch_fs *c = trans->c; 882 904 enum bch_watermark watermark = flags & BCH_WATERMARK_MASK; 883 905 906 + if (bch2_err_matches(ret, BCH_ERR_journal_res_blocked)) { 907 + /* 908 + * XXX: this should probably be a separate BTREE_INSERT_NONBLOCK 909 + * flag 910 + */ 911 + if ((flags & BCH_TRANS_COMMIT_journal_reclaim) && 912 + watermark < BCH_WATERMARK_reclaim) { 913 + ret = -BCH_ERR_journal_reclaim_would_deadlock; 914 + goto out; 915 + } 916 + 917 + ret = drop_locks_do(trans, 918 + bch2_trans_journal_res_get(trans, 919 + (flags & BCH_WATERMARK_MASK)| 920 + JOURNAL_RES_GET_CHECK)); 921 + goto out; 922 + } 923 + 884 924 switch (ret) { 885 925 case -BCH_ERR_btree_insert_btree_node_full: 886 926 ret = bch2_btree_split_leaf(trans, i->path, flags); ··· 909 913 case -BCH_ERR_btree_insert_need_mark_replicas: 910 914 ret = drop_locks_do(trans, 911 915 bch2_accounting_update_sb(trans)); 912 - break; 913 - case -BCH_ERR_journal_res_get_blocked: 914 - /* 915 - * XXX: this should probably be a separate BTREE_INSERT_NONBLOCK 916 - * flag 917 - */ 918 - if ((flags & BCH_TRANS_COMMIT_journal_reclaim) && 919 - watermark < BCH_WATERMARK_reclaim) { 920 - ret = -BCH_ERR_journal_reclaim_would_deadlock; 921 - break; 922 - } 923 - 924 - ret = drop_locks_do(trans, 925 - bch2_trans_journal_res_get(trans, 926 - (flags & BCH_WATERMARK_MASK)| 927 - JOURNAL_RES_GET_CHECK)); 928 916 break; 929 917 case -BCH_ERR_btree_insert_need_journal_reclaim: 930 918 bch2_trans_unlock(trans); ··· 930 950 BUG_ON(ret >= 0); 931 951 break; 932 952 } 933 - 953 + out: 934 954 BUG_ON(bch2_err_matches(ret, BCH_ERR_transaction_restart) != !!trans->restarted); 935 955 936 956 bch2_fs_inconsistent_on(bch2_err_matches(ret, ENOSPC) &&

+13

fs/bcachefs/btree_types.h

··· 423 423 424 424 struct btree_insert_entry { 425 425 unsigned flags; 426 + u8 sort_order; 426 427 u8 bkey_type; 427 428 enum btree_id btree_id:8; 428 429 u8 level:4; ··· 852 851 ; 853 852 854 853 return BIT_ULL(btree) & mask; 854 + } 855 + 856 + static inline u8 btree_trigger_order(enum btree_id btree) 857 + { 858 + switch (btree) { 859 + case BTREE_ID_alloc: 860 + return U8_MAX; 861 + case BTREE_ID_stripes: 862 + return U8_MAX - 1; 863 + default: 864 + return btree; 865 + } 855 866 } 856 867 857 868 struct btree_root {

+4 -1

fs/bcachefs/btree_update.c

··· 17 17 static inline int btree_insert_entry_cmp(const struct btree_insert_entry *l, 18 18 const struct btree_insert_entry *r) 19 19 { 20 - return cmp_int(l->btree_id, r->btree_id) ?: 20 + return cmp_int(l->sort_order, r->sort_order) ?: 21 21 cmp_int(l->cached, r->cached) ?: 22 22 -cmp_int(l->level, r->level) ?: 23 23 bpos_cmp(l->k->k.p, r->k->k.p); ··· 397 397 398 398 n = (struct btree_insert_entry) { 399 399 .flags = flags, 400 + .sort_order = btree_trigger_order(path->btree_id), 400 401 .bkey_type = __btree_node_type(path->level, path->btree_id), 401 402 .btree_id = path->btree_id, 402 403 .level = path->level, ··· 512 511 int __must_check bch2_trans_update(struct btree_trans *trans, struct btree_iter *iter, 513 512 struct bkey_i *k, enum btree_iter_update_trigger_flags flags) 514 513 { 514 + kmsan_check_memory(k, bkey_bytes(&k->k)); 515 + 515 516 btree_path_idx_t path_idx = iter->update_path ?: iter->path; 516 517 int ret; 517 518

+2

fs/bcachefs/btree_update.h

··· 133 133 enum btree_id btree, 134 134 struct bkey_i *k) 135 135 { 136 + kmsan_check_memory(k, bkey_bytes(&k->k)); 137 + 136 138 if (unlikely(!btree_type_uses_write_buffer(btree))) { 137 139 int ret = bch2_btree_write_buffer_insert_err(trans, btree, k); 138 140 dump_stack();

+93 -71

fs/bcachefs/btree_update_interior.c

··· 649 649 return 0; 650 650 } 651 651 652 + /* If the node has been reused, we might be reading uninitialized memory - that's fine: */ 653 + static noinline __no_kmsan_checks bool btree_node_seq_matches(struct btree *b, __le64 seq) 654 + { 655 + struct btree_node *b_data = READ_ONCE(b->data); 656 + 657 + return (b_data ? b_data->keys.seq : 0) == seq; 658 + } 659 + 652 660 static void btree_update_nodes_written(struct btree_update *as) 653 661 { 654 662 struct bch_fs *c = as->c; ··· 685 677 * on disk: 686 678 */ 687 679 for (i = 0; i < as->nr_old_nodes; i++) { 688 - __le64 seq; 689 - 690 680 b = as->old_nodes[i]; 691 681 692 - bch2_trans_begin(trans); 693 - btree_node_lock_nopath_nofail(trans, &b->c, SIX_LOCK_read); 694 - seq = b->data ? b->data->keys.seq : 0; 695 - six_unlock_read(&b->c.lock); 696 - bch2_trans_unlock_long(trans); 697 - 698 - if (seq == as->old_nodes_seq[i]) 682 + if (btree_node_seq_matches(b, as->old_nodes_seq[i])) 699 683 wait_on_bit_io(&b->flags, BTREE_NODE_write_in_flight_inner, 700 684 TASK_UNINTERRUPTIBLE); 701 685 } ··· 2126 2126 goto out; 2127 2127 } 2128 2128 2129 + static int get_iter_to_node(struct btree_trans *trans, struct btree_iter *iter, 2130 + struct btree *b) 2131 + { 2132 + bch2_trans_node_iter_init(trans, iter, b->c.btree_id, b->key.k.p, 2133 + BTREE_MAX_DEPTH, b->c.level, 2134 + BTREE_ITER_intent); 2135 + int ret = bch2_btree_iter_traverse(iter); 2136 + if (ret) 2137 + goto err; 2138 + 2139 + /* has node been freed? */ 2140 + if (btree_iter_path(trans, iter)->l[b->c.level].b != b) { 2141 + /* node has been freed: */ 2142 + BUG_ON(!btree_node_dying(b)); 2143 + ret = -BCH_ERR_btree_node_dying; 2144 + goto err; 2145 + } 2146 + 2147 + BUG_ON(!btree_node_hashed(b)); 2148 + return 0; 2149 + err: 2150 + bch2_trans_iter_exit(trans, iter); 2151 + return ret; 2152 + } 2153 + 2129 2154 int bch2_btree_node_rewrite(struct btree_trans *trans, 2130 2155 struct btree_iter *iter, 2131 2156 struct btree *b, ··· 2216 2191 goto out; 2217 2192 } 2218 2193 2194 + static int bch2_btree_node_rewrite_key(struct btree_trans *trans, 2195 + enum btree_id btree, unsigned level, 2196 + struct bkey_i *k, unsigned flags) 2197 + { 2198 + struct btree_iter iter; 2199 + bch2_trans_node_iter_init(trans, &iter, 2200 + btree, k->k.p, 2201 + BTREE_MAX_DEPTH, level, 0); 2202 + struct btree *b = bch2_btree_iter_peek_node(&iter); 2203 + int ret = PTR_ERR_OR_ZERO(b); 2204 + if (ret) 2205 + goto out; 2206 + 2207 + bool found = b && btree_ptr_hash_val(&b->key) == btree_ptr_hash_val(k); 2208 + ret = found 2209 + ? bch2_btree_node_rewrite(trans, &iter, b, flags) 2210 + : -ENOENT; 2211 + out: 2212 + bch2_trans_iter_exit(trans, &iter); 2213 + return ret; 2214 + } 2215 + 2216 + int bch2_btree_node_rewrite_pos(struct btree_trans *trans, 2217 + enum btree_id btree, unsigned level, 2218 + struct bpos pos, unsigned flags) 2219 + { 2220 + BUG_ON(!level); 2221 + 2222 + /* Traverse one depth lower to get a pointer to the node itself: */ 2223 + struct btree_iter iter; 2224 + bch2_trans_node_iter_init(trans, &iter, btree, pos, 0, level - 1, 0); 2225 + struct btree *b = bch2_btree_iter_peek_node(&iter); 2226 + int ret = PTR_ERR_OR_ZERO(b); 2227 + if (ret) 2228 + goto err; 2229 + 2230 + ret = bch2_btree_node_rewrite(trans, &iter, b, flags); 2231 + err: 2232 + bch2_trans_iter_exit(trans, &iter); 2233 + return ret; 2234 + } 2235 + 2236 + int bch2_btree_node_rewrite_key_get_iter(struct btree_trans *trans, 2237 + struct btree *b, unsigned flags) 2238 + { 2239 + struct btree_iter iter; 2240 + int ret = get_iter_to_node(trans, &iter, b); 2241 + if (ret) 2242 + return ret == -BCH_ERR_btree_node_dying ? 0 : ret; 2243 + 2244 + ret = bch2_btree_node_rewrite(trans, &iter, b, flags); 2245 + bch2_trans_iter_exit(trans, &iter); 2246 + return ret; 2247 + } 2248 + 2219 2249 struct async_btree_rewrite { 2220 2250 struct bch_fs *c; 2221 2251 struct work_struct work; ··· 2280 2200 struct bkey_buf key; 2281 2201 }; 2282 2202 2283 - static int async_btree_node_rewrite_trans(struct btree_trans *trans, 2284 - struct async_btree_rewrite *a) 2285 - { 2286 - struct btree_iter iter; 2287 - bch2_trans_node_iter_init(trans, &iter, 2288 - a->btree_id, a->key.k->k.p, 2289 - BTREE_MAX_DEPTH, a->level, 0); 2290 - struct btree *b = bch2_btree_iter_peek_node(&iter); 2291 - int ret = PTR_ERR_OR_ZERO(b); 2292 - if (ret) 2293 - goto out; 2294 - 2295 - bool found = b && btree_ptr_hash_val(&b->key) == btree_ptr_hash_val(a->key.k); 2296 - ret = found 2297 - ? bch2_btree_node_rewrite(trans, &iter, b, 0) 2298 - : -ENOENT; 2299 - 2300 - #if 0 2301 - /* Tracepoint... */ 2302 - if (!ret || ret == -ENOENT) { 2303 - struct bch_fs *c = trans->c; 2304 - struct printbuf buf = PRINTBUF; 2305 - 2306 - if (!ret) { 2307 - prt_printf(&buf, "rewrite node:\n "); 2308 - bch2_bkey_val_to_text(&buf, c, bkey_i_to_s_c(a->key.k)); 2309 - } else { 2310 - prt_printf(&buf, "node to rewrite not found:\n want: "); 2311 - bch2_bkey_val_to_text(&buf, c, bkey_i_to_s_c(a->key.k)); 2312 - prt_printf(&buf, "\n got: "); 2313 - if (b) 2314 - bch2_bkey_val_to_text(&buf, c, bkey_i_to_s_c(&b->key)); 2315 - else 2316 - prt_str(&buf, "(null)"); 2317 - } 2318 - bch_info(c, "%s", buf.buf); 2319 - printbuf_exit(&buf); 2320 - } 2321 - #endif 2322 - out: 2323 - bch2_trans_iter_exit(trans, &iter); 2324 - return ret; 2325 - } 2326 - 2327 2203 static void async_btree_node_rewrite_work(struct work_struct *work) 2328 2204 { 2329 2205 struct async_btree_rewrite *a = 2330 2206 container_of(work, struct async_btree_rewrite, work); 2331 2207 struct bch_fs *c = a->c; 2332 2208 2333 - int ret = bch2_trans_do(c, async_btree_node_rewrite_trans(trans, a)); 2209 + int ret = bch2_trans_do(c, bch2_btree_node_rewrite_key(trans, 2210 + a->btree_id, a->level, a->key.k, 0)); 2334 2211 if (ret != -ENOENT) 2335 2212 bch_err_fn_ratelimited(c, ret); 2336 2213 ··· 2531 2494 unsigned commit_flags, bool skip_triggers) 2532 2495 { 2533 2496 struct btree_iter iter; 2534 - int ret; 2535 - 2536 - bch2_trans_node_iter_init(trans, &iter, b->c.btree_id, b->key.k.p, 2537 - BTREE_MAX_DEPTH, b->c.level, 2538 - BTREE_ITER_intent); 2539 - ret = bch2_btree_iter_traverse(&iter); 2497 + int ret = get_iter_to_node(trans, &iter, b); 2540 2498 if (ret) 2541 - goto out; 2542 - 2543 - /* has node been freed? */ 2544 - if (btree_iter_path(trans, &iter)->l[b->c.level].b != b) { 2545 - /* node has been freed: */ 2546 - BUG_ON(!btree_node_dying(b)); 2547 - goto out; 2548 - } 2549 - 2550 - BUG_ON(!btree_node_hashed(b)); 2499 + return ret == -BCH_ERR_btree_node_dying ? 0 : ret; 2551 2500 2552 2501 bch2_bkey_drop_ptrs(bkey_i_to_s(new_key), ptr, 2553 2502 !bch2_bkey_has_device(bkey_i_to_s(&b->key), ptr->dev)); 2554 2503 2555 2504 ret = bch2_btree_node_update_key(trans, &iter, b, new_key, 2556 2505 commit_flags, skip_triggers); 2557 - out: 2558 2506 bch2_trans_iter_exit(trans, &iter); 2559 2507 return ret; 2560 2508 }

+7

fs/bcachefs/btree_update_interior.h

··· 169 169 170 170 int bch2_btree_node_rewrite(struct btree_trans *, struct btree_iter *, 171 171 struct btree *, unsigned); 172 + int bch2_btree_node_rewrite_pos(struct btree_trans *, 173 + enum btree_id, unsigned, 174 + struct bpos, unsigned); 175 + int bch2_btree_node_rewrite_key_get_iter(struct btree_trans *, 176 + struct btree *, unsigned); 177 + 172 178 void bch2_btree_node_rewrite_async(struct bch_fs *, struct btree *); 179 + 173 180 int bch2_btree_node_update_key(struct btree_trans *, struct btree_iter *, 174 181 struct btree *, struct bkey_i *, 175 182 unsigned, bool);

+29 -51

fs/bcachefs/buckets.c

··· 590 590 if (ret) 591 591 goto err; 592 592 593 - if (!p.ptr.cached) { 594 - ret = bch2_bucket_backpointer_mod(trans, k, &bp, insert); 595 - if (ret) 596 - goto err; 597 - } 593 + ret = bch2_bucket_backpointer_mod(trans, k, &bp, insert); 594 + if (ret) 595 + goto err; 598 596 } 599 597 600 598 if (flags & BTREE_TRIGGER_gc) { ··· 672 674 return -BCH_ERR_ENOMEM_mark_stripe_ptr; 673 675 } 674 676 675 - mutex_lock(&c->ec_stripes_heap_lock); 677 + gc_stripe_lock(m); 676 678 677 679 if (!m || !m->alive) { 678 - mutex_unlock(&c->ec_stripes_heap_lock); 680 + gc_stripe_unlock(m); 679 681 struct printbuf buf = PRINTBUF; 680 682 bch2_bkey_val_to_text(&buf, c, k); 681 683 bch_err_ratelimited(c, "pointer to nonexistent stripe %llu\n while marking %s", ··· 691 693 .type = BCH_DISK_ACCOUNTING_replicas, 692 694 }; 693 695 memcpy(&acc.replicas, &m->r.e, replicas_entry_bytes(&m->r.e)); 694 - mutex_unlock(&c->ec_stripes_heap_lock); 696 + gc_stripe_unlock(m); 695 697 696 698 acc.replicas.data_type = data_type; 697 699 int ret = bch2_disk_accounting_mod(trans, &acc, &sectors, 1, true); ··· 724 726 .replicas.nr_required = 1, 725 727 }; 726 728 727 - struct disk_accounting_pos acct_compression_key = { 728 - .type = BCH_DISK_ACCOUNTING_compression, 729 - }; 729 + unsigned cur_compression_type = 0; 730 730 u64 compression_acct[3] = { 1, 0, 0 }; 731 731 732 732 bkey_for_each_ptr_decode(k.k, ptrs, p, entry) { ··· 758 762 acc_replicas_key.replicas.nr_required = 0; 759 763 } 760 764 761 - if (acct_compression_key.compression.type && 762 - acct_compression_key.compression.type != p.crc.compression_type) { 765 + if (cur_compression_type && 766 + cur_compression_type != p.crc.compression_type) { 763 767 if (flags & BTREE_TRIGGER_overwrite) 764 768 bch2_u64s_neg(compression_acct, ARRAY_SIZE(compression_acct)); 765 769 766 - ret = bch2_disk_accounting_mod(trans, &acct_compression_key, compression_acct, 767 - ARRAY_SIZE(compression_acct), gc); 770 + ret = bch2_disk_accounting_mod2(trans, gc, compression_acct, 771 + compression, cur_compression_type); 768 772 if (ret) 769 773 return ret; 770 774 ··· 773 777 compression_acct[2] = 0; 774 778 } 775 779 776 - acct_compression_key.compression.type = p.crc.compression_type; 780 + cur_compression_type = p.crc.compression_type; 777 781 if (p.crc.compression_type) { 778 782 compression_acct[1] += p.crc.uncompressed_size; 779 783 compression_acct[2] += p.crc.compressed_size; ··· 787 791 } 788 792 789 793 if (acc_replicas_key.replicas.nr_devs && !level && k.k->p.snapshot) { 790 - struct disk_accounting_pos acc_snapshot_key = { 791 - .type = BCH_DISK_ACCOUNTING_snapshot, 792 - .snapshot.id = k.k->p.snapshot, 793 - }; 794 - ret = bch2_disk_accounting_mod(trans, &acc_snapshot_key, replicas_sectors, 1, gc); 794 + ret = bch2_disk_accounting_mod2_nr(trans, gc, replicas_sectors, 1, snapshot, k.k->p.snapshot); 795 795 if (ret) 796 796 return ret; 797 797 } 798 798 799 - if (acct_compression_key.compression.type) { 799 + if (cur_compression_type) { 800 800 if (flags & BTREE_TRIGGER_overwrite) 801 801 bch2_u64s_neg(compression_acct, ARRAY_SIZE(compression_acct)); 802 802 803 - ret = bch2_disk_accounting_mod(trans, &acct_compression_key, compression_acct, 804 - ARRAY_SIZE(compression_acct), gc); 803 + ret = bch2_disk_accounting_mod2(trans, gc, compression_acct, 804 + compression, cur_compression_type); 805 805 if (ret) 806 806 return ret; 807 807 } 808 808 809 809 if (level) { 810 - struct disk_accounting_pos acc_btree_key = { 811 - .type = BCH_DISK_ACCOUNTING_btree, 812 - .btree.id = btree_id, 813 - }; 814 - ret = bch2_disk_accounting_mod(trans, &acc_btree_key, replicas_sectors, 1, gc); 810 + ret = bch2_disk_accounting_mod2_nr(trans, gc, replicas_sectors, 1, btree, btree_id); 815 811 if (ret) 816 812 return ret; 817 813 } else { 818 814 bool insert = !(flags & BTREE_TRIGGER_overwrite); 819 - struct disk_accounting_pos acc_inum_key = { 820 - .type = BCH_DISK_ACCOUNTING_inum, 821 - .inum.inum = k.k->p.inode, 822 - }; 815 + 823 816 s64 v[3] = { 824 817 insert ? 1 : -1, 825 818 insert ? k.k->size : -((s64) k.k->size), 826 819 *replicas_sectors, 827 820 }; 828 - ret = bch2_disk_accounting_mod(trans, &acc_inum_key, v, ARRAY_SIZE(v), gc); 821 + ret = bch2_disk_accounting_mod2(trans, gc, v, inum, k.k->p.inode); 829 822 if (ret) 830 823 return ret; 831 824 } ··· 863 878 } 864 879 865 880 int need_rebalance_delta = 0; 866 - s64 need_rebalance_sectors_delta = 0; 881 + s64 need_rebalance_sectors_delta[1] = { 0 }; 867 882 868 883 s64 s = bch2_bkey_sectors_need_rebalance(c, old); 869 884 need_rebalance_delta -= s != 0; 870 - need_rebalance_sectors_delta -= s; 885 + need_rebalance_sectors_delta[0] -= s; 871 886 872 887 s = bch2_bkey_sectors_need_rebalance(c, new.s_c); 873 888 need_rebalance_delta += s != 0; 874 - need_rebalance_sectors_delta += s; 889 + need_rebalance_sectors_delta[0] += s; 875 890 876 891 if ((flags & BTREE_TRIGGER_transactional) && need_rebalance_delta) { 877 892 int ret = bch2_btree_bit_mod_buffered(trans, BTREE_ID_rebalance_work, ··· 880 895 return ret; 881 896 } 882 897 883 - if (need_rebalance_sectors_delta) { 884 - struct disk_accounting_pos acc = { 885 - .type = BCH_DISK_ACCOUNTING_rebalance_work, 886 - }; 887 - int ret = bch2_disk_accounting_mod(trans, &acc, &need_rebalance_sectors_delta, 1, 888 - flags & BTREE_TRIGGER_gc); 898 + if (need_rebalance_sectors_delta[0]) { 899 + int ret = bch2_disk_accounting_mod2(trans, flags & BTREE_TRIGGER_gc, 900 + need_rebalance_sectors_delta, rebalance_work); 889 901 if (ret) 890 902 return ret; 891 903 } ··· 898 916 enum btree_iter_update_trigger_flags flags) 899 917 { 900 918 if (flags & (BTREE_TRIGGER_transactional|BTREE_TRIGGER_gc)) { 901 - s64 sectors = k.k->size; 919 + s64 sectors[1] = { k.k->size }; 902 920 903 921 if (flags & BTREE_TRIGGER_overwrite) 904 - sectors = -sectors; 922 + sectors[0] = -sectors[0]; 905 923 906 - struct disk_accounting_pos acc = { 907 - .type = BCH_DISK_ACCOUNTING_persistent_reserved, 908 - .persistent_reserved.nr_replicas = bkey_s_c_to_reservation(k).v->nr_replicas, 909 - }; 910 - 911 - return bch2_disk_accounting_mod(trans, &acc, &sectors, 1, flags & BTREE_TRIGGER_gc); 924 + return bch2_disk_accounting_mod2(trans, flags & BTREE_TRIGGER_gc, sectors, 925 + persistent_reserved, bkey_s_c_to_reservation(k).v->nr_replicas); 912 926 } 913 927 914 928 return 0;

+1 -30

fs/bcachefs/buckets.h

··· 39 39 for (_b = (_buckets)->b + (_buckets)->first_bucket; \ 40 40 _b < (_buckets)->b + (_buckets)->nbuckets; _b++) 41 41 42 - /* 43 - * Ugly hack alert: 44 - * 45 - * We need to cram a spinlock in a single byte, because that's what we have left 46 - * in struct bucket, and we care about the size of these - during fsck, we need 47 - * in memory state for every single bucket on every device. 48 - * 49 - * We used to do 50 - * while (xchg(&b->lock, 1) cpu_relax(); 51 - * but, it turns out not all architectures support xchg on a single byte. 52 - * 53 - * So now we use bit_spin_lock(), with fun games since we can't burn a whole 54 - * ulong for this - we just need to make sure the lock bit always ends up in the 55 - * first byte. 56 - */ 57 - 58 - #if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__ 59 - #define BUCKET_LOCK_BITNR 0 60 - #else 61 - #define BUCKET_LOCK_BITNR (BITS_PER_LONG - 1) 62 - #endif 63 - 64 - union ulong_byte_assert { 65 - ulong ulong; 66 - u8 byte; 67 - }; 68 - 69 42 static inline void bucket_unlock(struct bucket *b) 70 43 { 71 44 BUILD_BUG_ON(!((union ulong_byte_assert) { .ulong = 1UL << BUCKET_LOCK_BITNR }).byte); ··· 140 167 141 168 static inline int gen_after(u8 a, u8 b) 142 169 { 143 - int r = gen_cmp(a, b); 144 - 145 - return r > 0 ? r : 0; 170 + return max(0, gen_cmp(a, b)); 146 171 } 147 172 148 173 static inline int dev_ptr_stale_rcu(struct bch_dev *ca, const struct bch_extent_ptr *ptr)

+27

fs/bcachefs/buckets_types.h

··· 7 7 8 8 #define BUCKET_JOURNAL_SEQ_BITS 16 9 9 10 + /* 11 + * Ugly hack alert: 12 + * 13 + * We need to cram a spinlock in a single byte, because that's what we have left 14 + * in struct bucket, and we care about the size of these - during fsck, we need 15 + * in memory state for every single bucket on every device. 16 + * 17 + * We used to do 18 + * while (xchg(&b->lock, 1) cpu_relax(); 19 + * but, it turns out not all architectures support xchg on a single byte. 20 + * 21 + * So now we use bit_spin_lock(), with fun games since we can't burn a whole 22 + * ulong for this - we just need to make sure the lock bit always ends up in the 23 + * first byte. 24 + */ 25 + 26 + #if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__ 27 + #define BUCKET_LOCK_BITNR 0 28 + #else 29 + #define BUCKET_LOCK_BITNR (BITS_PER_LONG - 1) 30 + #endif 31 + 32 + union ulong_byte_assert { 33 + ulong ulong; 34 + u8 byte; 35 + }; 36 + 10 37 struct bucket { 11 38 u8 lock; 12 39 u8 gen_valid:1;

+31 -7

fs/bcachefs/chardev.c

··· 11 11 #include "move.h" 12 12 #include "recovery_passes.h" 13 13 #include "replicas.h" 14 + #include "sb-counters.h" 14 15 #include "super-io.h" 15 16 #include "thread_with_file.h" 16 17 ··· 313 312 struct bch_data_ctx *ctx = container_of(arg, struct bch_data_ctx, thr); 314 313 315 314 ctx->thr.ret = bch2_data_job(ctx->c, &ctx->stats, ctx->arg); 316 - ctx->stats.data_type = U8_MAX; 315 + if (ctx->thr.ret == -BCH_ERR_device_offline) 316 + ctx->stats.ret = BCH_IOCTL_DATA_EVENT_RET_device_offline; 317 + else { 318 + ctx->stats.ret = BCH_IOCTL_DATA_EVENT_RET_done; 319 + ctx->stats.data_type = (int) DATA_PROGRESS_DATA_TYPE_done; 320 + } 317 321 return 0; 318 322 } 319 323 ··· 337 331 struct bch_data_ctx *ctx = container_of(file->private_data, struct bch_data_ctx, thr); 338 332 struct bch_fs *c = ctx->c; 339 333 struct bch_ioctl_data_event e = { 340 - .type = BCH_DATA_EVENT_PROGRESS, 341 - .p.data_type = ctx->stats.data_type, 342 - .p.btree_id = ctx->stats.pos.btree, 343 - .p.pos = ctx->stats.pos.pos, 344 - .p.sectors_done = atomic64_read(&ctx->stats.sectors_seen), 345 - .p.sectors_total = bch2_fs_usage_read_short(c).used, 334 + .type = BCH_DATA_EVENT_PROGRESS, 335 + .ret = ctx->stats.ret, 336 + .p.data_type = ctx->stats.data_type, 337 + .p.btree_id = ctx->stats.pos.btree, 338 + .p.pos = ctx->stats.pos.pos, 339 + .p.sectors_done = atomic64_read(&ctx->stats.sectors_seen), 340 + .p.sectors_error_corrected = atomic64_read(&ctx->stats.sectors_error_corrected), 341 + .p.sectors_error_uncorrected = atomic64_read(&ctx->stats.sectors_error_uncorrected), 346 342 }; 343 + 344 + if (ctx->arg.op == BCH_DATA_OP_scrub) { 345 + struct bch_dev *ca = bch2_dev_tryget(c, ctx->arg.scrub.dev); 346 + if (ca) { 347 + struct bch_dev_usage u; 348 + bch2_dev_usage_read_fast(ca, &u); 349 + for (unsigned i = BCH_DATA_btree; i < ARRAY_SIZE(u.d); i++) 350 + if (ctx->arg.scrub.data_types & BIT(i)) 351 + e.p.sectors_total += u.d[i].sectors; 352 + bch2_dev_put(ca); 353 + } 354 + } else { 355 + e.p.sectors_total = bch2_fs_usage_read_short(c).used; 356 + } 347 357 348 358 if (len < sizeof(e)) 349 359 return -EINVAL; ··· 732 710 BCH_IOCTL(fsck_online, struct bch_ioctl_fsck_online); 733 711 case BCH_IOCTL_QUERY_ACCOUNTING: 734 712 return bch2_ioctl_query_accounting(c, arg); 713 + case BCH_IOCTL_QUERY_COUNTERS: 714 + return bch2_ioctl_query_counters(c, arg); 735 715 default: 736 716 return -ENOTTY; 737 717 }

+14 -11

fs/bcachefs/checksum.c

··· 466 466 prt_str(&buf, ")"); 467 467 WARN_RATELIMIT(1, "%s", buf.buf); 468 468 printbuf_exit(&buf); 469 - return -EIO; 469 + return -BCH_ERR_recompute_checksum; 470 470 } 471 471 472 472 for (i = splits; i < splits + ARRAY_SIZE(splits); i++) { ··· 693 693 return 0; 694 694 } 695 695 696 + #if 0 697 + 698 + /* 699 + * This seems to be duplicating code in cmd_remove_passphrase() in 700 + * bcachefs-tools, but we might want to switch userspace to use this - and 701 + * perhaps add an ioctl for calling this at runtime, so we can take the 702 + * passphrase off of a mounted filesystem (which has come up). 703 + */ 696 704 int bch2_disable_encryption(struct bch_fs *c) 697 705 { 698 706 struct bch_sb_field_crypt *crypt; ··· 733 725 return ret; 734 726 } 735 727 728 + /* 729 + * For enabling encryption on an existing filesystem: not hooked up yet, but it 730 + * should be 731 + */ 736 732 int bch2_enable_encryption(struct bch_fs *c, bool keyed) 737 733 { 738 734 struct bch_encrypted_key key; ··· 793 781 memzero_explicit(&key, sizeof(key)); 794 782 return ret; 795 783 } 784 + #endif 796 785 797 786 void bch2_fs_encryption_exit(struct bch_fs *c) 798 787 { ··· 801 788 crypto_free_shash(c->poly1305); 802 789 if (c->chacha20) 803 790 crypto_free_sync_skcipher(c->chacha20); 804 - if (c->sha256) 805 - crypto_free_shash(c->sha256); 806 791 } 807 792 808 793 int bch2_fs_encryption_init(struct bch_fs *c) ··· 808 797 struct bch_sb_field_crypt *crypt; 809 798 struct bch_key key; 810 799 int ret = 0; 811 - 812 - c->sha256 = crypto_alloc_shash("sha256", 0, 0); 813 - ret = PTR_ERR_OR_ZERO(c->sha256); 814 - if (ret) { 815 - c->sha256 = NULL; 816 - bch_err(c, "error requesting sha256 module: %s", bch2_err_str(ret)); 817 - goto out; 818 - } 819 800 820 801 crypt = bch2_sb_field_get(c->disk_sb.sb, crypt); 821 802 if (!crypt)

+2

fs/bcachefs/checksum.h

··· 103 103 int bch2_decrypt_sb_key(struct bch_fs *, struct bch_sb_field_crypt *, 104 104 struct bch_key *); 105 105 106 + #if 0 106 107 int bch2_disable_encryption(struct bch_fs *); 107 108 int bch2_enable_encryption(struct bch_fs *, bool); 109 + #endif 108 110 109 111 void bch2_fs_encryption_exit(struct bch_fs *); 110 112 int bch2_fs_encryption_init(struct bch_fs *);

+29 -36

fs/bcachefs/compress.c

··· 177 177 size_t src_len = src->bi_iter.bi_size; 178 178 size_t dst_len = crc.uncompressed_size << 9; 179 179 void *workspace; 180 - int ret; 180 + int ret = 0, ret2; 181 181 182 182 enum bch_compression_opts opt = bch2_compression_type_to_opt(crc.compression_type); 183 183 mempool_t *workspace_pool = &c->compress_workspace[opt]; ··· 189 189 else 190 190 ret = -BCH_ERR_compression_workspace_not_initialized; 191 191 if (ret) 192 - goto out; 192 + goto err; 193 193 } 194 194 195 195 src_data = bio_map_or_bounce(c, src, READ); ··· 197 197 switch (crc.compression_type) { 198 198 case BCH_COMPRESSION_TYPE_lz4_old: 199 199 case BCH_COMPRESSION_TYPE_lz4: 200 - ret = LZ4_decompress_safe_partial(src_data.b, dst_data, 201 - src_len, dst_len, dst_len); 202 - if (ret != dst_len) 203 - goto err; 200 + ret2 = LZ4_decompress_safe_partial(src_data.b, dst_data, 201 + src_len, dst_len, dst_len); 202 + if (ret2 != dst_len) 203 + ret = -BCH_ERR_decompress_lz4; 204 204 break; 205 205 case BCH_COMPRESSION_TYPE_gzip: { 206 206 z_stream strm = { ··· 214 214 215 215 zlib_set_workspace(&strm, workspace); 216 216 zlib_inflateInit2(&strm, -MAX_WBITS); 217 - ret = zlib_inflate(&strm, Z_FINISH); 217 + ret2 = zlib_inflate(&strm, Z_FINISH); 218 218 219 219 mempool_free(workspace, workspace_pool); 220 220 221 - if (ret != Z_STREAM_END) 222 - goto err; 221 + if (ret2 != Z_STREAM_END) 222 + ret = -BCH_ERR_decompress_gzip; 223 223 break; 224 224 } 225 225 case BCH_COMPRESSION_TYPE_zstd: { 226 226 ZSTD_DCtx *ctx; 227 227 size_t real_src_len = le32_to_cpup(src_data.b); 228 228 229 - if (real_src_len > src_len - 4) 229 + if (real_src_len > src_len - 4) { 230 + ret = -BCH_ERR_decompress_zstd_src_len_bad; 230 231 goto err; 232 + } 231 233 232 234 workspace = mempool_alloc(workspace_pool, GFP_NOFS); 233 235 ctx = zstd_init_dctx(workspace, zstd_dctx_workspace_bound()); 234 236 235 - ret = zstd_decompress_dctx(ctx, 237 + ret2 = zstd_decompress_dctx(ctx, 236 238 dst_data, dst_len, 237 239 src_data.b + 4, real_src_len); 238 240 239 241 mempool_free(workspace, workspace_pool); 240 242 241 - if (ret != dst_len) 242 - goto err; 243 + if (ret2 != dst_len) 244 + ret = -BCH_ERR_decompress_zstd; 243 245 break; 244 246 } 245 247 default: 246 248 BUG(); 247 249 } 248 - ret = 0; 250 + err: 249 251 fsck_err: 250 - out: 251 252 bio_unmap_or_unbounce(c, src_data); 252 253 return ret; 253 - err: 254 - ret = -EIO; 255 - goto out; 256 254 } 257 255 258 256 int bch2_bio_uncompress_inplace(struct bch_write_op *op, ··· 266 268 BUG_ON(!bio->bi_vcnt); 267 269 BUG_ON(DIV_ROUND_UP(crc->live_size, PAGE_SECTORS) > bio->bi_max_vecs); 268 270 269 - if (crc->uncompressed_size << 9 > c->opts.encoded_extent_max || 270 - crc->compressed_size << 9 > c->opts.encoded_extent_max) { 271 - struct printbuf buf = PRINTBUF; 272 - bch2_write_op_error(&buf, op); 273 - prt_printf(&buf, "error rewriting existing data: extent too big"); 274 - bch_err_ratelimited(c, "%s", buf.buf); 275 - printbuf_exit(&buf); 276 - return -EIO; 271 + if (crc->uncompressed_size << 9 > c->opts.encoded_extent_max) { 272 + bch2_write_op_error(op, op->pos.offset, 273 + "extent too big to decompress (%u > %u)", 274 + crc->uncompressed_size << 9, c->opts.encoded_extent_max); 275 + return -BCH_ERR_decompress_exceeded_max_encoded_extent; 277 276 } 278 277 279 278 data = __bounce_alloc(c, dst_len, WRITE); 280 279 281 - if (__bio_uncompress(c, bio, data.b, *crc)) { 282 - if (!c->opts.no_data_io) { 283 - struct printbuf buf = PRINTBUF; 284 - bch2_write_op_error(&buf, op); 285 - prt_printf(&buf, "error rewriting existing data: decompression error"); 286 - bch_err_ratelimited(c, "%s", buf.buf); 287 - printbuf_exit(&buf); 288 - } 289 - ret = -EIO; 280 + ret = __bio_uncompress(c, bio, data.b, *crc); 281 + 282 + if (c->opts.no_data_io) 283 + ret = 0; 284 + 285 + if (ret) { 286 + bch2_write_op_error(op, op->pos.offset, "%s", bch2_err_str(ret)); 290 287 goto err; 291 288 } 292 289 ··· 314 321 315 322 if (crc.uncompressed_size << 9 > c->opts.encoded_extent_max || 316 323 crc.compressed_size << 9 > c->opts.encoded_extent_max) 317 - return -EIO; 324 + return -BCH_ERR_decompress_exceeded_max_encoded_extent; 318 325 319 326 dst_data = dst_len == dst_iter.bi_size 320 327 ? __bio_map_or_bounce(c, dst, dst_iter, WRITE)

+182 -55

fs/bcachefs/data_update.c

··· 20 20 #include "subvolume.h" 21 21 #include "trace.h" 22 22 23 + #include <linux/ioprio.h> 24 + 23 25 static void bkey_put_dev_refs(struct bch_fs *c, struct bkey_s_c k) 24 26 { 25 27 struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(k); ··· 35 33 struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(k); 36 34 37 35 bkey_for_each_ptr(ptrs, ptr) { 38 - if (!bch2_dev_tryget(c, ptr->dev)) { 36 + if (unlikely(!bch2_dev_tryget(c, ptr->dev))) { 39 37 bkey_for_each_ptr(ptrs, ptr2) { 40 38 if (ptr2 == ptr) 41 39 break; ··· 93 91 return true; 94 92 } 95 93 96 - static noinline void trace_move_extent_finish2(struct data_update *u, 94 + static noinline void trace_io_move_finish2(struct data_update *u, 97 95 struct bkey_i *new, 98 96 struct bkey_i *insert) 99 97 { ··· 113 111 bch2_bkey_val_to_text(&buf, c, bkey_i_to_s_c(insert)); 114 112 prt_newline(&buf); 115 113 116 - trace_move_extent_finish(c, buf.buf); 114 + trace_io_move_finish(c, buf.buf); 117 115 printbuf_exit(&buf); 118 116 } 119 117 120 - static void trace_move_extent_fail2(struct data_update *m, 118 + static void trace_io_move_fail2(struct data_update *m, 121 119 struct bkey_s_c new, 122 120 struct bkey_s_c wrote, 123 121 struct bkey_i *insert, ··· 128 126 struct printbuf buf = PRINTBUF; 129 127 unsigned rewrites_found = 0; 130 128 131 - if (!trace_move_extent_fail_enabled()) 129 + if (!trace_io_move_fail_enabled()) 132 130 return; 133 131 134 132 prt_str(&buf, msg); ··· 168 166 bch2_bkey_val_to_text(&buf, c, bkey_i_to_s_c(insert)); 169 167 } 170 168 171 - trace_move_extent_fail(c, buf.buf); 169 + trace_io_move_fail(c, buf.buf); 172 170 printbuf_exit(&buf); 173 171 } 174 172 ··· 216 214 new = bkey_i_to_extent(bch2_keylist_front(keys)); 217 215 218 216 if (!bch2_extents_match(k, old)) { 219 - trace_move_extent_fail2(m, k, bkey_i_to_s_c(&new->k_i), 217 + trace_io_move_fail2(m, k, bkey_i_to_s_c(&new->k_i), 220 218 NULL, "no match:"); 221 219 goto nowork; 222 220 } ··· 256 254 if (m->data_opts.rewrite_ptrs && 257 255 !rewrites_found && 258 256 bch2_bkey_durability(c, k) >= m->op.opts.data_replicas) { 259 - trace_move_extent_fail2(m, k, bkey_i_to_s_c(&new->k_i), insert, "no rewrites found:"); 257 + trace_io_move_fail2(m, k, bkey_i_to_s_c(&new->k_i), insert, "no rewrites found:"); 260 258 goto nowork; 261 259 } 262 260 ··· 273 271 } 274 272 275 273 if (!bkey_val_u64s(&new->k)) { 276 - trace_move_extent_fail2(m, k, bkey_i_to_s_c(&new->k_i), insert, "new replicas conflicted:"); 274 + trace_io_move_fail2(m, k, bkey_i_to_s_c(&new->k_i), insert, "new replicas conflicted:"); 277 275 goto nowork; 278 276 } 279 277 ··· 354 352 printbuf_exit(&buf); 355 353 356 354 bch2_fatal_error(c); 357 - ret = -EIO; 355 + ret = -BCH_ERR_invalid_bkey; 358 356 goto out; 359 357 } 360 358 ··· 387 385 if (!ret) { 388 386 bch2_btree_iter_set_pos(&iter, next_pos); 389 387 390 - this_cpu_add(c->counters[BCH_COUNTER_move_extent_finish], new->k.size); 391 - if (trace_move_extent_finish_enabled()) 392 - trace_move_extent_finish2(m, &new->k_i, insert); 388 + this_cpu_add(c->counters[BCH_COUNTER_io_move_finish], new->k.size); 389 + if (trace_io_move_finish_enabled()) 390 + trace_io_move_finish2(m, &new->k_i, insert); 393 391 } 394 392 err: 395 393 if (bch2_err_matches(ret, BCH_ERR_transaction_restart)) ··· 411 409 &m->stats->sectors_raced); 412 410 } 413 411 414 - count_event(c, move_extent_fail); 412 + count_event(c, io_move_fail); 415 413 416 414 bch2_btree_iter_advance(&iter); 417 415 goto next; ··· 429 427 return bch2_trans_run(op->c, __bch2_data_update_index_update(trans, op)); 430 428 } 431 429 432 - void bch2_data_update_read_done(struct data_update *m, 433 - struct bch_extent_crc_unpacked crc) 430 + void bch2_data_update_read_done(struct data_update *m) 434 431 { 432 + m->read_done = true; 433 + 435 434 /* write bio must own pages: */ 436 435 BUG_ON(!m->op.wbio.bio.bi_vcnt); 437 436 438 - m->op.crc = crc; 439 - m->op.wbio.bio.bi_iter.bi_size = crc.compressed_size << 9; 437 + m->op.crc = m->rbio.pick.crc; 438 + m->op.wbio.bio.bi_iter.bi_size = m->op.crc.compressed_size << 9; 439 + 440 + this_cpu_add(m->op.c->counters[BCH_COUNTER_io_move_write], m->k.k->k.size); 440 441 441 442 closure_call(&m->op.cl, bch2_write, NULL, NULL); 442 443 } ··· 449 444 struct bch_fs *c = update->op.c; 450 445 struct bkey_s_c k = bkey_i_to_s_c(update->k.k); 451 446 447 + bch2_bio_free_pages_pool(c, &update->op.wbio.bio); 448 + kfree(update->bvecs); 449 + update->bvecs = NULL; 450 + 452 451 if (c->opts.nocow_enabled) 453 452 bkey_nocow_unlock(c, k); 454 453 bkey_put_dev_refs(c, k); 455 - bch2_bkey_buf_exit(&update->k, c); 456 454 bch2_disk_reservation_put(c, &update->op.res); 457 - bch2_bio_free_pages_pool(c, &update->op.wbio.bio); 455 + bch2_bkey_buf_exit(&update->k, c); 458 456 } 459 457 460 - static void bch2_update_unwritten_extent(struct btree_trans *trans, 461 - struct data_update *update) 458 + static int bch2_update_unwritten_extent(struct btree_trans *trans, 459 + struct data_update *update) 462 460 { 463 461 struct bch_fs *c = update->op.c; 464 - struct bio *bio = &update->op.wbio.bio; 465 462 struct bkey_i_extent *e; 466 463 struct write_point *wp; 467 464 struct closure cl; 468 465 struct btree_iter iter; 469 466 struct bkey_s_c k; 470 - int ret; 467 + int ret = 0; 471 468 472 469 closure_init_stack(&cl); 473 470 bch2_keylist_init(&update->op.insert_keys, update->op.inline_keys); 474 471 475 - while (bio_sectors(bio)) { 476 - unsigned sectors = bio_sectors(bio); 472 + while (bpos_lt(update->op.pos, update->k.k->k.p)) { 473 + unsigned sectors = update->k.k->k.p.offset - 474 + update->op.pos.offset; 477 475 478 476 bch2_trans_begin(trans); 479 477 ··· 512 504 bch_err_fn_ratelimited(c, ret); 513 505 514 506 if (ret) 515 - return; 507 + break; 516 508 517 509 sectors = min(sectors, wp->sectors_free); 518 510 ··· 522 514 bch2_alloc_sectors_append_ptrs(c, wp, &e->k_i, sectors, false); 523 515 bch2_alloc_sectors_done(c, wp); 524 516 525 - bio_advance(bio, sectors << 9); 526 517 update->op.pos.offset += sectors; 527 518 528 519 extent_for_each_ptr(extent_i_to_s(e), ptr) ··· 540 533 bch2_trans_unlock(trans); 541 534 closure_sync(&cl); 542 535 } 536 + 537 + return ret; 543 538 } 544 539 545 540 void bch2_data_update_opts_to_text(struct printbuf *out, struct bch_fs *c, 546 541 struct bch_io_opts *io_opts, 547 542 struct data_update_opts *data_opts) 548 543 { 549 - printbuf_tabstop_push(out, 20); 544 + if (!out->nr_tabstops) 545 + printbuf_tabstop_push(out, 20); 550 546 551 547 prt_str_indented(out, "rewrite ptrs:\t"); 552 548 bch2_prt_u64_base2(out, data_opts->rewrite_ptrs); ··· 582 572 583 573 prt_str_indented(out, "old key:\t"); 584 574 bch2_bkey_val_to_text(out, m->op.c, bkey_i_to_s_c(m->k.k)); 575 + } 576 + 577 + void bch2_data_update_inflight_to_text(struct printbuf *out, struct data_update *m) 578 + { 579 + bch2_bkey_val_to_text(out, m->op.c, bkey_i_to_s_c(m->k.k)); 580 + prt_newline(out); 581 + printbuf_indent_add(out, 2); 582 + bch2_data_update_opts_to_text(out, m->op.c, &m->op.opts, &m->data_opts); 583 + prt_printf(out, "read_done:\t\%u\n", m->read_done); 584 + bch2_write_op_to_text(out, &m->op); 585 + printbuf_indent_sub(out, 2); 585 586 } 586 587 587 588 int bch2_extent_drop_ptrs(struct btree_trans *trans, ··· 638 617 bch2_trans_commit(trans, NULL, NULL, BCH_TRANS_COMMIT_no_enospc); 639 618 } 640 619 620 + int bch2_data_update_bios_init(struct data_update *m, struct bch_fs *c, 621 + struct bch_io_opts *io_opts) 622 + { 623 + struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(bkey_i_to_s_c(m->k.k)); 624 + const union bch_extent_entry *entry; 625 + struct extent_ptr_decoded p; 626 + 627 + /* write path might have to decompress data: */ 628 + unsigned buf_bytes = 0; 629 + bkey_for_each_ptr_decode(&m->k.k->k, ptrs, p, entry) 630 + buf_bytes = max_t(unsigned, buf_bytes, p.crc.uncompressed_size << 9); 631 + 632 + unsigned nr_vecs = DIV_ROUND_UP(buf_bytes, PAGE_SIZE); 633 + 634 + m->bvecs = kmalloc_array(nr_vecs, sizeof*(m->bvecs), GFP_KERNEL); 635 + if (!m->bvecs) 636 + return -ENOMEM; 637 + 638 + bio_init(&m->rbio.bio, NULL, m->bvecs, nr_vecs, REQ_OP_READ); 639 + bio_init(&m->op.wbio.bio, NULL, m->bvecs, nr_vecs, 0); 640 + 641 + if (bch2_bio_alloc_pages(&m->op.wbio.bio, buf_bytes, GFP_KERNEL)) { 642 + kfree(m->bvecs); 643 + m->bvecs = NULL; 644 + return -ENOMEM; 645 + } 646 + 647 + rbio_init(&m->rbio.bio, c, *io_opts, NULL); 648 + m->rbio.data_update = true; 649 + m->rbio.bio.bi_iter.bi_size = buf_bytes; 650 + m->rbio.bio.bi_iter.bi_sector = bkey_start_offset(&m->k.k->k); 651 + m->op.wbio.bio.bi_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_IDLE, 0); 652 + return 0; 653 + } 654 + 655 + static int can_write_extent(struct bch_fs *c, struct data_update *m) 656 + { 657 + if ((m->op.flags & BCH_WRITE_alloc_nowait) && 658 + unlikely(c->open_buckets_nr_free <= bch2_open_buckets_reserved(m->op.watermark))) 659 + return -BCH_ERR_data_update_done_would_block; 660 + 661 + unsigned target = m->op.flags & BCH_WRITE_only_specified_devs 662 + ? m->op.target 663 + : 0; 664 + struct bch_devs_mask devs = target_rw_devs(c, BCH_DATA_user, target); 665 + 666 + darray_for_each(m->op.devs_have, i) 667 + __clear_bit(*i, devs.d); 668 + 669 + rcu_read_lock(); 670 + unsigned nr_replicas = 0, i; 671 + for_each_set_bit(i, devs.d, BCH_SB_MEMBERS_MAX) { 672 + struct bch_dev *ca = bch2_dev_rcu(c, i); 673 + 674 + struct bch_dev_usage usage; 675 + bch2_dev_usage_read_fast(ca, &usage); 676 + 677 + if (!dev_buckets_free(ca, usage, m->op.watermark)) 678 + continue; 679 + 680 + nr_replicas += ca->mi.durability; 681 + if (nr_replicas >= m->op.nr_replicas) 682 + break; 683 + } 684 + rcu_read_unlock(); 685 + 686 + if (!nr_replicas) 687 + return -BCH_ERR_data_update_done_no_rw_devs; 688 + if (nr_replicas < m->op.nr_replicas) 689 + return -BCH_ERR_insufficient_devices; 690 + return 0; 691 + } 692 + 641 693 int bch2_data_update_init(struct btree_trans *trans, 642 694 struct btree_iter *iter, 643 695 struct moving_context *ctxt, 644 696 struct data_update *m, 645 697 struct write_point_specifier wp, 646 - struct bch_io_opts io_opts, 698 + struct bch_io_opts *io_opts, 647 699 struct data_update_opts data_opts, 648 700 enum btree_id btree_id, 649 701 struct bkey_s_c k) ··· 734 640 * snapshots table - just skip it, we can move it later. 735 641 */ 736 642 if (unlikely(k.k->p.snapshot && !bch2_snapshot_exists(c, k.k->p.snapshot))) 737 - return -BCH_ERR_data_update_done; 738 - 739 - if (!bkey_get_dev_refs(c, k)) 740 - return -BCH_ERR_data_update_done; 741 - 742 - if (c->opts.nocow_enabled && 743 - !bkey_nocow_lock(c, ctxt, k)) { 744 - bkey_put_dev_refs(c, k); 745 - return -BCH_ERR_nocow_lock_blocked; 746 - } 643 + return -BCH_ERR_data_update_done_no_snapshot; 747 644 748 645 bch2_bkey_buf_init(&m->k); 749 646 bch2_bkey_buf_reassemble(&m->k, c, k); ··· 743 658 m->ctxt = ctxt; 744 659 m->stats = ctxt ? ctxt->stats : NULL; 745 660 746 - bch2_write_op_init(&m->op, c, io_opts); 661 + bch2_write_op_init(&m->op, c, *io_opts); 747 662 m->op.pos = bkey_start_pos(k.k); 748 663 m->op.version = k.k->bversion; 749 664 m->op.target = data_opts.target; 750 665 m->op.write_point = wp; 751 666 m->op.nr_replicas = 0; 752 - m->op.flags |= BCH_WRITE_PAGES_STABLE| 753 - BCH_WRITE_PAGES_OWNED| 754 - BCH_WRITE_DATA_ENCODED| 755 - BCH_WRITE_MOVE| 667 + m->op.flags |= BCH_WRITE_pages_stable| 668 + BCH_WRITE_pages_owned| 669 + BCH_WRITE_data_encoded| 670 + BCH_WRITE_move| 756 671 m->data_opts.write_flags; 757 - m->op.compression_opt = io_opts.background_compression; 672 + m->op.compression_opt = io_opts->background_compression; 758 673 m->op.watermark = m->data_opts.btree_insert_flags & BCH_WATERMARK_MASK; 759 674 760 675 unsigned durability_have = 0, durability_removing = 0; ··· 792 707 ptr_bit <<= 1; 793 708 } 794 709 795 - unsigned durability_required = max(0, (int) (io_opts.data_replicas - durability_have)); 710 + unsigned durability_required = max(0, (int) (io_opts->data_replicas - durability_have)); 796 711 797 712 /* 798 713 * If current extent durability is less than io_opts.data_replicas, ··· 825 740 m->data_opts.rewrite_ptrs = 0; 826 741 /* if iter == NULL, it's just a promote */ 827 742 if (iter) 828 - ret = bch2_extent_drop_ptrs(trans, iter, k, &io_opts, &m->data_opts); 829 - goto out; 743 + ret = bch2_extent_drop_ptrs(trans, iter, k, io_opts, &m->data_opts); 744 + if (!ret) 745 + ret = -BCH_ERR_data_update_done_no_writes_needed; 746 + goto out_bkey_buf_exit; 830 747 } 748 + 749 + /* 750 + * Check if the allocation will succeed, to avoid getting an error later 751 + * in bch2_write() -> bch2_alloc_sectors_start() and doing a useless 752 + * read: 753 + * 754 + * This guards against 755 + * - BCH_WRITE_alloc_nowait allocations failing (promotes) 756 + * - Destination target full 757 + * - Device(s) in destination target offline 758 + * - Insufficient durability available in destination target 759 + * (i.e. trying to move a durability=2 replica to a target with a 760 + * single durability=2 device) 761 + */ 762 + ret = can_write_extent(c, m); 763 + if (ret) 764 + goto out_bkey_buf_exit; 831 765 832 766 if (reserve_sectors) { 833 767 ret = bch2_disk_reservation_add(c, &m->op.res, reserve_sectors, ··· 854 750 ? 0 855 751 : BCH_DISK_RESERVATION_NOFAIL); 856 752 if (ret) 857 - goto out; 753 + goto out_bkey_buf_exit; 754 + } 755 + 756 + if (!bkey_get_dev_refs(c, k)) { 757 + ret = -BCH_ERR_data_update_done_no_dev_refs; 758 + goto out_put_disk_res; 759 + } 760 + 761 + if (c->opts.nocow_enabled && 762 + !bkey_nocow_lock(c, ctxt, k)) { 763 + ret = -BCH_ERR_nocow_lock_blocked; 764 + goto out_put_dev_refs; 858 765 } 859 766 860 767 if (bkey_extent_is_unwritten(k)) { 861 - bch2_update_unwritten_extent(trans, m); 862 - goto out; 768 + ret = bch2_update_unwritten_extent(trans, m) ?: 769 + -BCH_ERR_data_update_done_unwritten; 770 + goto out_nocow_unlock; 863 771 } 864 772 773 + ret = bch2_data_update_bios_init(m, c, io_opts); 774 + if (ret) 775 + goto out_nocow_unlock; 776 + 865 777 return 0; 866 - out: 867 - bch2_data_update_exit(m); 868 - return ret ?: -BCH_ERR_data_update_done; 778 + out_nocow_unlock: 779 + if (c->opts.nocow_enabled) 780 + bkey_nocow_unlock(c, k); 781 + out_put_dev_refs: 782 + bkey_put_dev_refs(c, k); 783 + out_put_disk_res: 784 + bch2_disk_reservation_put(c, &m->op.res); 785 + out_bkey_buf_exit: 786 + bch2_bkey_buf_exit(&m->k, c); 787 + return ret; 869 788 } 870 789 871 790 void bch2_data_update_opts_normalize(struct bkey_s_c k, struct data_update_opts *opts)

+14 -3

fs/bcachefs/data_update.h

··· 4 4 #define _BCACHEFS_DATA_UPDATE_H 5 5 6 6 #include "bkey_buf.h" 7 + #include "io_read.h" 7 8 #include "io_write_types.h" 8 9 9 10 struct moving_context; ··· 16 15 u8 extra_replicas; 17 16 unsigned btree_insert_flags; 18 17 unsigned write_flags; 18 + 19 + int read_dev; 20 + bool scrub; 19 21 }; 20 22 21 23 void bch2_data_update_opts_to_text(struct printbuf *, struct bch_fs *, ··· 26 22 27 23 struct data_update { 28 24 /* extent being updated: */ 25 + bool read_done; 29 26 enum btree_id btree_id; 30 27 struct bkey_buf k; 31 28 struct data_update_opts data_opts; 32 29 struct moving_context *ctxt; 33 30 struct bch_move_stats *stats; 31 + 32 + struct bch_read_bio rbio; 34 33 struct bch_write_op op; 34 + struct bio_vec *bvecs; 35 35 }; 36 36 37 37 void bch2_data_update_to_text(struct printbuf *, struct data_update *); 38 + void bch2_data_update_inflight_to_text(struct printbuf *, struct data_update *); 38 39 39 40 int bch2_data_update_index_update(struct bch_write_op *); 40 41 41 - void bch2_data_update_read_done(struct data_update *, 42 - struct bch_extent_crc_unpacked); 42 + void bch2_data_update_read_done(struct data_update *); 43 43 44 44 int bch2_extent_drop_ptrs(struct btree_trans *, 45 45 struct btree_iter *, ··· 51 43 struct bch_io_opts *, 52 44 struct data_update_opts *); 53 45 46 + int bch2_data_update_bios_init(struct data_update *, struct bch_fs *, 47 + struct bch_io_opts *); 48 + 54 49 void bch2_data_update_exit(struct data_update *); 55 50 int bch2_data_update_init(struct btree_trans *, struct btree_iter *, 56 51 struct moving_context *, 57 52 struct data_update *, 58 53 struct write_point_specifier, 59 - struct bch_io_opts, struct data_update_opts, 54 + struct bch_io_opts *, struct data_update_opts, 60 55 enum btree_id, struct bkey_s_c); 61 56 void bch2_data_update_opts_normalize(struct bkey_s_c, struct data_update_opts *); 62 57

+30 -4

fs/bcachefs/debug.c

··· 7 7 */ 8 8 9 9 #include "bcachefs.h" 10 + #include "alloc_foreground.h" 10 11 #include "bkey_methods.h" 11 12 #include "btree_cache.h" 12 13 #include "btree_io.h" ··· 191 190 unsigned offset = 0; 192 191 int ret; 193 192 194 - if (bch2_bkey_pick_read_device(c, bkey_i_to_s_c(&b->key), NULL, &pick) <= 0) { 193 + if (bch2_bkey_pick_read_device(c, bkey_i_to_s_c(&b->key), NULL, &pick, -1) <= 0) { 195 194 prt_printf(out, "error getting device to read from: invalid device\n"); 196 195 return; 197 196 } ··· 845 844 seqmutex_unlock(&c->btree_trans_lock); 846 845 } 847 846 848 - static ssize_t bch2_btree_deadlock_read(struct file *file, char __user *buf, 849 - size_t size, loff_t *ppos) 847 + typedef void (*fs_to_text_fn)(struct printbuf *, struct bch_fs *); 848 + 849 + static ssize_t bch2_simple_print(struct file *file, char __user *buf, 850 + size_t size, loff_t *ppos, 851 + fs_to_text_fn fn) 850 852 { 851 853 struct dump_iter *i = file->private_data; 852 854 struct bch_fs *c = i->c; ··· 860 856 i->ret = 0; 861 857 862 858 if (!i->iter) { 863 - btree_deadlock_to_text(&i->buf, c); 859 + fn(&i->buf, c); 864 860 i->iter++; 865 861 } 866 862 ··· 873 869 return ret ?: i->ret; 874 870 } 875 871 872 + static ssize_t bch2_btree_deadlock_read(struct file *file, char __user *buf, 873 + size_t size, loff_t *ppos) 874 + { 875 + return bch2_simple_print(file, buf, size, ppos, btree_deadlock_to_text); 876 + } 877 + 876 878 static const struct file_operations btree_deadlock_ops = { 877 879 .owner = THIS_MODULE, 878 880 .open = bch2_dump_open, 879 881 .release = bch2_dump_release, 880 882 .read = bch2_btree_deadlock_read, 883 + }; 884 + 885 + static ssize_t bch2_write_points_read(struct file *file, char __user *buf, 886 + size_t size, loff_t *ppos) 887 + { 888 + return bch2_simple_print(file, buf, size, ppos, bch2_write_points_to_text); 889 + } 890 + 891 + static const struct file_operations write_points_ops = { 892 + .owner = THIS_MODULE, 893 + .open = bch2_dump_open, 894 + .release = bch2_dump_release, 895 + .read = bch2_write_points_read, 881 896 }; 882 897 883 898 void bch2_fs_debug_exit(struct bch_fs *c) ··· 949 926 950 927 debugfs_create_file("btree_deadlock", 0400, c->fs_debug_dir, 951 928 c->btree_debug, &btree_deadlock_ops); 929 + 930 + debugfs_create_file("write_points", 0400, c->fs_debug_dir, 931 + c->btree_debug, &write_points_ops); 952 932 953 933 c->btree_debug_dir = debugfs_create_dir("btrees", c->fs_debug_dir); 954 934 if (IS_ERR_OR_NULL(c->btree_debug_dir))

+241 -33

fs/bcachefs/dirent.c

··· 13 13 14 14 #include <linux/dcache.h> 15 15 16 + static int bch2_casefold(struct btree_trans *trans, const struct bch_hash_info *info, 17 + const struct qstr *str, struct qstr *out_cf) 18 + { 19 + *out_cf = (struct qstr) QSTR_INIT(NULL, 0); 20 + 21 + #ifdef CONFIG_UNICODE 22 + unsigned char *buf = bch2_trans_kmalloc(trans, BCH_NAME_MAX + 1); 23 + int ret = PTR_ERR_OR_ZERO(buf); 24 + if (ret) 25 + return ret; 26 + 27 + ret = utf8_casefold(info->cf_encoding, str, buf, BCH_NAME_MAX + 1); 28 + if (ret <= 0) 29 + return ret; 30 + 31 + *out_cf = (struct qstr) QSTR_INIT(buf, ret); 32 + return 0; 33 + #else 34 + return -EOPNOTSUPP; 35 + #endif 36 + } 37 + 38 + static inline int bch2_maybe_casefold(struct btree_trans *trans, 39 + const struct bch_hash_info *info, 40 + const struct qstr *str, struct qstr *out_cf) 41 + { 42 + if (likely(!info->cf_encoding)) { 43 + *out_cf = *str; 44 + return 0; 45 + } else { 46 + return bch2_casefold(trans, info, str, out_cf); 47 + } 48 + } 49 + 16 50 static unsigned bch2_dirent_name_bytes(struct bkey_s_c_dirent d) 17 51 { 18 52 if (bkey_val_bytes(d.k) < offsetof(struct bch_dirent, d_name)) ··· 62 28 #endif 63 29 64 30 return bkey_bytes - 65 - offsetof(struct bch_dirent, d_name) - 31 + (d.v->d_casefold 32 + ? offsetof(struct bch_dirent, d_cf_name_block.d_names) 33 + : offsetof(struct bch_dirent, d_name)) - 66 34 trailing_nuls; 67 35 } 68 36 69 37 struct qstr bch2_dirent_get_name(struct bkey_s_c_dirent d) 70 38 { 71 - return (struct qstr) QSTR_INIT(d.v->d_name, bch2_dirent_name_bytes(d)); 39 + if (d.v->d_casefold) { 40 + unsigned name_len = le16_to_cpu(d.v->d_cf_name_block.d_name_len); 41 + return (struct qstr) QSTR_INIT(&d.v->d_cf_name_block.d_names[0], name_len); 42 + } else { 43 + return (struct qstr) QSTR_INIT(d.v->d_name, bch2_dirent_name_bytes(d)); 44 + } 45 + } 46 + 47 + static struct qstr bch2_dirent_get_casefold_name(struct bkey_s_c_dirent d) 48 + { 49 + if (d.v->d_casefold) { 50 + unsigned name_len = le16_to_cpu(d.v->d_cf_name_block.d_name_len); 51 + unsigned cf_name_len = le16_to_cpu(d.v->d_cf_name_block.d_cf_name_len); 52 + return (struct qstr) QSTR_INIT(&d.v->d_cf_name_block.d_names[name_len], cf_name_len); 53 + } else { 54 + return (struct qstr) QSTR_INIT(NULL, 0); 55 + } 56 + } 57 + 58 + static inline struct qstr bch2_dirent_get_lookup_name(struct bkey_s_c_dirent d) 59 + { 60 + return d.v->d_casefold 61 + ? bch2_dirent_get_casefold_name(d) 62 + : bch2_dirent_get_name(d); 72 63 } 73 64 74 65 static u64 bch2_dirent_hash(const struct bch_hash_info *info, ··· 116 57 static u64 dirent_hash_bkey(const struct bch_hash_info *info, struct bkey_s_c k) 117 58 { 118 59 struct bkey_s_c_dirent d = bkey_s_c_to_dirent(k); 119 - struct qstr name = bch2_dirent_get_name(d); 60 + struct qstr name = bch2_dirent_get_lookup_name(d); 120 61 121 62 return bch2_dirent_hash(info, &name); 122 63 } ··· 124 65 static bool dirent_cmp_key(struct bkey_s_c _l, const void *_r) 125 66 { 126 67 struct bkey_s_c_dirent l = bkey_s_c_to_dirent(_l); 127 - const struct qstr l_name = bch2_dirent_get_name(l); 68 + const struct qstr l_name = bch2_dirent_get_lookup_name(l); 128 69 const struct qstr *r_name = _r; 129 70 130 71 return !qstr_eq(l_name, *r_name); ··· 134 75 { 135 76 struct bkey_s_c_dirent l = bkey_s_c_to_dirent(_l); 136 77 struct bkey_s_c_dirent r = bkey_s_c_to_dirent(_r); 137 - const struct qstr l_name = bch2_dirent_get_name(l); 138 - const struct qstr r_name = bch2_dirent_get_name(r); 78 + const struct qstr l_name = bch2_dirent_get_lookup_name(l); 79 + const struct qstr r_name = bch2_dirent_get_lookup_name(r); 139 80 140 81 return !qstr_eq(l_name, r_name); 141 82 } ··· 163 104 struct bkey_validate_context from) 164 105 { 165 106 struct bkey_s_c_dirent d = bkey_s_c_to_dirent(k); 107 + unsigned name_block_len = bch2_dirent_name_bytes(d); 166 108 struct qstr d_name = bch2_dirent_get_name(d); 109 + struct qstr d_cf_name = bch2_dirent_get_casefold_name(d); 167 110 int ret = 0; 168 111 169 112 bkey_fsck_err_on(!d_name.len, 170 113 c, dirent_empty_name, 171 114 "empty name"); 172 115 173 - bkey_fsck_err_on(bkey_val_u64s(k.k) > dirent_val_u64s(d_name.len), 116 + bkey_fsck_err_on(d_name.len + d_cf_name.len > name_block_len, 174 117 c, dirent_val_too_big, 175 - "value too big (%zu > %u)", 176 - bkey_val_u64s(k.k), dirent_val_u64s(d_name.len)); 118 + "dirent names exceed bkey size (%d + %d > %d)", 119 + d_name.len, d_cf_name.len, name_block_len); 177 120 178 121 /* 179 122 * Check new keys don't exceed the max length ··· 203 142 le64_to_cpu(d.v->d_inum) == d.k->p.inode, 204 143 c, dirent_to_itself, 205 144 "dirent points to own directory"); 145 + 146 + if (d.v->d_casefold) { 147 + bkey_fsck_err_on(from.from == BKEY_VALIDATE_commit && 148 + d_cf_name.len > BCH_NAME_MAX, 149 + c, dirent_cf_name_too_big, 150 + "dirent w/ cf name too big (%u > %u)", 151 + d_cf_name.len, BCH_NAME_MAX); 152 + 153 + bkey_fsck_err_on(d_cf_name.len != strnlen(d_cf_name.name, d_cf_name.len), 154 + c, dirent_stray_data_after_cf_name, 155 + "dirent has stray data after cf name's NUL"); 156 + } 206 157 fsck_err: 207 158 return ret; 208 159 } ··· 236 163 prt_printf(out, " type %s", bch2_d_type_str(d.v->d_type)); 237 164 } 238 165 239 - static struct bkey_i_dirent *dirent_create_key(struct btree_trans *trans, 240 - subvol_inum dir, u8 type, 241 - const struct qstr *name, u64 dst) 166 + static struct bkey_i_dirent *dirent_alloc_key(struct btree_trans *trans, 167 + subvol_inum dir, 168 + u8 type, 169 + int name_len, int cf_name_len, 170 + u64 dst) 242 171 { 243 172 struct bkey_i_dirent *dirent; 244 - unsigned u64s = BKEY_U64s + dirent_val_u64s(name->len); 245 - 246 - if (name->len > BCH_NAME_MAX) 247 - return ERR_PTR(-ENAMETOOLONG); 173 + unsigned u64s = BKEY_U64s + dirent_val_u64s(name_len, cf_name_len); 248 174 249 175 BUG_ON(u64s > U8_MAX); 250 176 ··· 262 190 } 263 191 264 192 dirent->v.d_type = type; 193 + dirent->v.d_unused = 0; 194 + dirent->v.d_casefold = cf_name_len ? 1 : 0; 265 195 266 - memcpy(dirent->v.d_name, name->name, name->len); 267 - memset(dirent->v.d_name + name->len, 0, 268 - bkey_val_bytes(&dirent->k) - 269 - offsetof(struct bch_dirent, d_name) - 270 - name->len); 196 + return dirent; 197 + } 271 198 272 - EBUG_ON(bch2_dirent_name_bytes(dirent_i_to_s_c(dirent)) != name->len); 199 + static void dirent_init_regular_name(struct bkey_i_dirent *dirent, 200 + const struct qstr *name) 201 + { 202 + EBUG_ON(dirent->v.d_casefold); 203 + 204 + memcpy(&dirent->v.d_name[0], name->name, name->len); 205 + memset(&dirent->v.d_name[name->len], 0, 206 + bkey_val_bytes(&dirent->k) - 207 + offsetof(struct bch_dirent, d_name) - 208 + name->len); 209 + } 210 + 211 + static void dirent_init_casefolded_name(struct bkey_i_dirent *dirent, 212 + const struct qstr *name, 213 + const struct qstr *cf_name) 214 + { 215 + EBUG_ON(!dirent->v.d_casefold); 216 + EBUG_ON(!cf_name->len); 217 + 218 + dirent->v.d_cf_name_block.d_name_len = name->len; 219 + dirent->v.d_cf_name_block.d_cf_name_len = cf_name->len; 220 + memcpy(&dirent->v.d_cf_name_block.d_names[0], name->name, name->len); 221 + memcpy(&dirent->v.d_cf_name_block.d_names[name->len], cf_name->name, cf_name->len); 222 + memset(&dirent->v.d_cf_name_block.d_names[name->len + cf_name->len], 0, 223 + bkey_val_bytes(&dirent->k) - 224 + offsetof(struct bch_dirent, d_cf_name_block.d_names) - 225 + name->len + cf_name->len); 226 + 227 + EBUG_ON(bch2_dirent_get_casefold_name(dirent_i_to_s_c(dirent)).len != cf_name->len); 228 + } 229 + 230 + static struct bkey_i_dirent *dirent_create_key(struct btree_trans *trans, 231 + subvol_inum dir, 232 + u8 type, 233 + const struct qstr *name, 234 + const struct qstr *cf_name, 235 + u64 dst) 236 + { 237 + struct bkey_i_dirent *dirent; 238 + 239 + if (name->len > BCH_NAME_MAX) 240 + return ERR_PTR(-ENAMETOOLONG); 241 + 242 + dirent = dirent_alloc_key(trans, dir, type, name->len, cf_name ? cf_name->len : 0, dst); 243 + if (IS_ERR(dirent)) 244 + return dirent; 245 + 246 + if (cf_name) 247 + dirent_init_casefolded_name(dirent, name, cf_name); 248 + else 249 + dirent_init_regular_name(dirent, name); 250 + 251 + EBUG_ON(bch2_dirent_get_name(dirent_i_to_s_c(dirent)).len != name->len); 273 252 274 253 return dirent; 275 254 } ··· 336 213 struct bkey_i_dirent *dirent; 337 214 int ret; 338 215 339 - dirent = dirent_create_key(trans, dir_inum, type, name, dst_inum); 216 + dirent = dirent_create_key(trans, dir_inum, type, name, NULL, dst_inum); 340 217 ret = PTR_ERR_OR_ZERO(dirent); 341 218 if (ret) 342 219 return ret; ··· 356 233 const struct bch_hash_info *hash_info, 357 234 u8 type, const struct qstr *name, u64 dst_inum, 358 235 u64 *dir_offset, 236 + u64 *i_size, 359 237 enum btree_iter_update_trigger_flags flags) 360 238 { 361 239 struct bkey_i_dirent *dirent; 362 240 int ret; 363 241 364 - dirent = dirent_create_key(trans, dir, type, name, dst_inum); 242 + if (hash_info->cf_encoding) { 243 + struct qstr cf_name; 244 + ret = bch2_casefold(trans, hash_info, name, &cf_name); 245 + if (ret) 246 + return ret; 247 + dirent = dirent_create_key(trans, dir, type, name, &cf_name, dst_inum); 248 + } else { 249 + dirent = dirent_create_key(trans, dir, type, name, NULL, dst_inum); 250 + } 251 + 365 252 ret = PTR_ERR_OR_ZERO(dirent); 366 253 if (ret) 367 254 return ret; 255 + 256 + *i_size += bkey_bytes(&dirent->k); 368 257 369 258 ret = bch2_hash_set(trans, bch2_dirent_hash_desc, hash_info, 370 259 dir, &dirent->k_i, flags); ··· 410 275 } 411 276 412 277 int bch2_dirent_rename(struct btree_trans *trans, 413 - subvol_inum src_dir, struct bch_hash_info *src_hash, 414 - subvol_inum dst_dir, struct bch_hash_info *dst_hash, 278 + subvol_inum src_dir, struct bch_hash_info *src_hash, u64 *src_dir_i_size, 279 + subvol_inum dst_dir, struct bch_hash_info *dst_hash, u64 *dst_dir_i_size, 415 280 const struct qstr *src_name, subvol_inum *src_inum, u64 *src_offset, 416 281 const struct qstr *dst_name, subvol_inum *dst_inum, u64 *dst_offset, 417 282 enum bch_rename_mode mode) 418 283 { 284 + struct qstr src_name_lookup, dst_name_lookup; 419 285 struct btree_iter src_iter = { NULL }; 420 286 struct btree_iter dst_iter = { NULL }; 421 287 struct bkey_s_c old_src, old_dst = bkey_s_c_null; ··· 431 295 memset(dst_inum, 0, sizeof(*dst_inum)); 432 296 433 297 /* Lookup src: */ 298 + ret = bch2_maybe_casefold(trans, src_hash, src_name, &src_name_lookup); 299 + if (ret) 300 + goto out; 434 301 old_src = bch2_hash_lookup(trans, &src_iter, bch2_dirent_hash_desc, 435 - src_hash, src_dir, src_name, 302 + src_hash, src_dir, &src_name_lookup, 436 303 BTREE_ITER_intent); 437 304 ret = bkey_err(old_src); 438 305 if (ret) ··· 447 308 goto out; 448 309 449 310 /* Lookup dst: */ 311 + ret = bch2_maybe_casefold(trans, dst_hash, dst_name, &dst_name_lookup); 312 + if (ret) 313 + goto out; 450 314 if (mode == BCH_RENAME) { 451 315 /* 452 316 * Note that we're _not_ checking if the target already exists - ··· 457 315 * correctness: 458 316 */ 459 317 ret = bch2_hash_hole(trans, &dst_iter, bch2_dirent_hash_desc, 460 - dst_hash, dst_dir, dst_name); 318 + dst_hash, dst_dir, &dst_name_lookup); 461 319 if (ret) 462 320 goto out; 463 321 } else { 464 322 old_dst = bch2_hash_lookup(trans, &dst_iter, bch2_dirent_hash_desc, 465 - dst_hash, dst_dir, dst_name, 323 + dst_hash, dst_dir, &dst_name_lookup, 466 324 BTREE_ITER_intent); 467 325 ret = bkey_err(old_dst); 468 326 if (ret) ··· 478 336 *src_offset = dst_iter.pos.offset; 479 337 480 338 /* Create new dst key: */ 481 - new_dst = dirent_create_key(trans, dst_dir, 0, dst_name, 0); 339 + new_dst = dirent_create_key(trans, dst_dir, 0, dst_name, 340 + dst_hash->cf_encoding ? &dst_name_lookup : NULL, 0); 482 341 ret = PTR_ERR_OR_ZERO(new_dst); 483 342 if (ret) 484 343 goto out; ··· 489 346 490 347 /* Create new src key: */ 491 348 if (mode == BCH_RENAME_EXCHANGE) { 492 - new_src = dirent_create_key(trans, src_dir, 0, src_name, 0); 349 + new_src = dirent_create_key(trans, src_dir, 0, src_name, 350 + src_hash->cf_encoding ? &src_name_lookup : NULL, 0); 493 351 ret = PTR_ERR_OR_ZERO(new_src); 494 352 if (ret) 495 353 goto out; ··· 549 405 if ((mode == BCH_RENAME_EXCHANGE) && 550 406 new_src->v.d_type == DT_SUBVOL) 551 407 new_src->v.d_parent_subvol = cpu_to_le32(src_dir.subvol); 408 + 409 + if (old_dst.k) 410 + *dst_dir_i_size -= bkey_bytes(old_dst.k); 411 + *src_dir_i_size -= bkey_bytes(old_src.k); 412 + 413 + if (mode == BCH_RENAME_EXCHANGE) 414 + *src_dir_i_size += bkey_bytes(&new_src->k); 415 + *dst_dir_i_size += bkey_bytes(&new_dst->k); 552 416 553 417 ret = bch2_trans_update(trans, &dst_iter, &new_dst->k_i, 0); 554 418 if (ret) ··· 617 465 const struct qstr *name, subvol_inum *inum, 618 466 unsigned flags) 619 467 { 468 + struct qstr lookup_name; 469 + int ret = bch2_maybe_casefold(trans, hash_info, name, &lookup_name); 470 + if (ret) 471 + return ret; 472 + 620 473 struct bkey_s_c k = bch2_hash_lookup(trans, iter, bch2_dirent_hash_desc, 621 - hash_info, dir, name, flags); 622 - int ret = bkey_err(k); 474 + hash_info, dir, &lookup_name, flags); 475 + ret = bkey_err(k); 623 476 if (ret) 624 477 goto err; 625 478 ··· 728 571 bch2_bkey_buf_exit(&sk, c); 729 572 730 573 return ret < 0 ? ret : 0; 574 + } 575 + 576 + /* fsck */ 577 + 578 + static int lookup_first_inode(struct btree_trans *trans, u64 inode_nr, 579 + struct bch_inode_unpacked *inode) 580 + { 581 + struct btree_iter iter; 582 + struct bkey_s_c k; 583 + int ret; 584 + 585 + for_each_btree_key_norestart(trans, iter, BTREE_ID_inodes, POS(0, inode_nr), 586 + BTREE_ITER_all_snapshots, k, ret) { 587 + if (k.k->p.offset != inode_nr) 588 + break; 589 + if (!bkey_is_inode(k.k)) 590 + continue; 591 + ret = bch2_inode_unpack(k, inode); 592 + goto found; 593 + } 594 + ret = -BCH_ERR_ENOENT_inode; 595 + found: 596 + bch_err_msg(trans->c, ret, "fetching inode %llu", inode_nr); 597 + bch2_trans_iter_exit(trans, &iter); 598 + return ret; 599 + } 600 + 601 + int bch2_fsck_remove_dirent(struct btree_trans *trans, struct bpos pos) 602 + { 603 + struct bch_fs *c = trans->c; 604 + struct btree_iter iter; 605 + struct bch_inode_unpacked dir_inode; 606 + struct bch_hash_info dir_hash_info; 607 + int ret; 608 + 609 + ret = lookup_first_inode(trans, pos.inode, &dir_inode); 610 + if (ret) 611 + goto err; 612 + 613 + dir_hash_info = bch2_hash_info_init(c, &dir_inode); 614 + 615 + bch2_trans_iter_init(trans, &iter, BTREE_ID_dirents, pos, BTREE_ITER_intent); 616 + 617 + ret = bch2_btree_iter_traverse(&iter) ?: 618 + bch2_hash_delete_at(trans, bch2_dirent_hash_desc, 619 + &dir_hash_info, &iter, 620 + BTREE_UPDATE_internal_snapshot_node); 621 + bch2_trans_iter_exit(trans, &iter); 622 + err: 623 + bch_err_fn(c, ret); 624 + return ret; 731 625 }

+11 -6

fs/bcachefs/dirent.h

··· 25 25 26 26 struct qstr bch2_dirent_get_name(struct bkey_s_c_dirent d); 27 27 28 - static inline unsigned dirent_val_u64s(unsigned len) 28 + static inline unsigned dirent_val_u64s(unsigned len, unsigned cf_len) 29 29 { 30 - return DIV_ROUND_UP(offsetof(struct bch_dirent, d_name) + len, 31 - sizeof(u64)); 30 + unsigned bytes = cf_len 31 + ? offsetof(struct bch_dirent, d_cf_name_block.d_names) + len + cf_len 32 + : offsetof(struct bch_dirent, d_name) + len; 33 + 34 + return DIV_ROUND_UP(bytes, sizeof(u64)); 32 35 } 33 36 34 37 int bch2_dirent_read_target(struct btree_trans *, subvol_inum, ··· 50 47 enum btree_iter_update_trigger_flags); 51 48 int bch2_dirent_create(struct btree_trans *, subvol_inum, 52 49 const struct bch_hash_info *, u8, 53 - const struct qstr *, u64, u64 *, 50 + const struct qstr *, u64, u64 *, u64 *, 54 51 enum btree_iter_update_trigger_flags); 55 52 56 53 static inline unsigned vfs_d_type(unsigned type) ··· 65 62 }; 66 63 67 64 int bch2_dirent_rename(struct btree_trans *, 68 - subvol_inum, struct bch_hash_info *, 69 - subvol_inum, struct bch_hash_info *, 65 + subvol_inum, struct bch_hash_info *, u64 *, 66 + subvol_inum, struct bch_hash_info *, u64 *, 70 67 const struct qstr *, subvol_inum *, u64 *, 71 68 const struct qstr *, subvol_inum *, u64 *, 72 69 enum bch_rename_mode); ··· 81 78 int bch2_empty_dir_snapshot(struct btree_trans *, u64, u32, u32); 82 79 int bch2_empty_dir_trans(struct btree_trans *, subvol_inum); 83 80 int bch2_readdir(struct bch_fs *, subvol_inum, struct dir_context *); 81 + 82 + int bch2_fsck_remove_dirent(struct btree_trans *, struct bpos); 84 83 85 84 #endif /* _BCACHEFS_DIRENT_H */

+18 -2

fs/bcachefs/dirent_format.h

··· 29 29 * Copy of mode bits 12-15 from the target inode - so userspace can get 30 30 * the filetype without having to do a stat() 31 31 */ 32 - __u8 d_type; 32 + #if defined(__LITTLE_ENDIAN_BITFIELD) 33 + __u8 d_type:5, 34 + d_unused:2, 35 + d_casefold:1; 36 + #elif defined(__BIG_ENDIAN_BITFIELD) 37 + __u8 d_casefold:1, 38 + d_unused:2, 39 + d_type:5; 40 + #endif 33 41 34 - __u8 d_name[]; 42 + union { 43 + struct { 44 + __u8 d_pad; 45 + __le16 d_name_len; 46 + __le16 d_cf_name_len; 47 + __u8 d_names[]; 48 + } d_cf_name_block __packed; 49 + __DECLARE_FLEX_ARRAY(__u8, d_name); 50 + } __packed; 35 51 } __packed __aligned(8); 36 52 37 53 #define DT_SUBVOL 16

+18

fs/bcachefs/disk_accounting.h

··· 85 85 86 86 int bch2_disk_accounting_mod(struct btree_trans *, struct disk_accounting_pos *, 87 87 s64 *, unsigned, bool); 88 + 89 + #define disk_accounting_key_init(_k, _type, ...) \ 90 + do { \ 91 + memset(&(_k), 0, sizeof(_k)); \ 92 + (_k).type = BCH_DISK_ACCOUNTING_##_type; \ 93 + (_k)._type = (struct bch_acct_##_type) { __VA_ARGS__ }; \ 94 + } while (0) 95 + 96 + #define bch2_disk_accounting_mod2_nr(_trans, _gc, _v, _nr, ...) \ 97 + ({ \ 98 + struct disk_accounting_pos pos; \ 99 + disk_accounting_key_init(pos, __VA_ARGS__); \ 100 + bch2_disk_accounting_mod(trans, &pos, _v, _nr, _gc); \ 101 + }) 102 + 103 + #define bch2_disk_accounting_mod2(_trans, _gc, _v, ...) \ 104 + bch2_disk_accounting_mod2_nr(_trans, _gc, _v, ARRAY_SIZE(_v), __VA_ARGS__) 105 + 88 106 int bch2_mod_dev_cached_sectors(struct btree_trans *, unsigned, s64, bool); 89 107 90 108 int bch2_accounting_validate(struct bch_fs *, struct bkey_s_c,

+6 -6

fs/bcachefs/disk_accounting_format.h

··· 113 113 BCH_DISK_ACCOUNTING_TYPE_NR, 114 114 }; 115 115 116 - struct bch_nr_inodes { 116 + struct bch_acct_nr_inodes { 117 117 }; 118 118 119 - struct bch_persistent_reserved { 119 + struct bch_acct_persistent_reserved { 120 120 __u8 nr_replicas; 121 121 }; 122 122 123 - struct bch_dev_data_type { 123 + struct bch_acct_dev_data_type { 124 124 __u8 dev; 125 125 __u8 data_type; 126 126 }; ··· 149 149 struct { 150 150 __u8 type; 151 151 union { 152 - struct bch_nr_inodes nr_inodes; 153 - struct bch_persistent_reserved persistent_reserved; 152 + struct bch_acct_nr_inodes nr_inodes; 153 + struct bch_acct_persistent_reserved persistent_reserved; 154 154 struct bch_replicas_entry_v1 replicas; 155 - struct bch_dev_data_type dev_data_type; 155 + struct bch_acct_dev_data_type dev_data_type; 156 156 struct bch_acct_compression compression; 157 157 struct bch_acct_snapshot snapshot; 158 158 struct bch_acct_btree btree;

+162 -320

fs/bcachefs/ec.c

··· 20 20 #include "io_read.h" 21 21 #include "io_write.h" 22 22 #include "keylist.h" 23 + #include "lru.h" 23 24 #include "recovery.h" 24 25 #include "replicas.h" 25 26 #include "super-io.h" ··· 105 104 struct bch_dev *ca; 106 105 struct ec_stripe_buf *buf; 107 106 size_t idx; 107 + u64 submit_time; 108 108 struct bio bio; 109 109 }; 110 110 ··· 300 298 struct bpos bucket = PTR_BUCKET_POS(ca, ptr); 301 299 302 300 if (flags & BTREE_TRIGGER_transactional) { 301 + struct extent_ptr_decoded p = { 302 + .ptr = *ptr, 303 + .crc = bch2_extent_crc_unpack(s.k, NULL), 304 + }; 305 + struct bkey_i_backpointer bp; 306 + bch2_extent_ptr_to_bp(c, BTREE_ID_stripes, 0, s.s_c, p, 307 + (const union bch_extent_entry *) ptr, &bp); 308 + 303 309 struct bkey_i_alloc_v4 *a = 304 310 bch2_trans_start_alloc_update(trans, bucket, 0); 305 - ret = PTR_ERR_OR_ZERO(a) ?: 306 - __mark_stripe_bucket(trans, ca, s, ptr_idx, deleting, bucket, &a->v, flags); 311 + ret = PTR_ERR_OR_ZERO(a) ?: 312 + __mark_stripe_bucket(trans, ca, s, ptr_idx, deleting, bucket, &a->v, flags) ?: 313 + bch2_bucket_backpointer_mod(trans, s.s_c, &bp, 314 + !(flags & BTREE_TRIGGER_overwrite)); 315 + if (ret) 316 + goto err; 307 317 } 308 318 309 319 if (flags & BTREE_TRIGGER_gc) { ··· 380 366 return 0; 381 367 } 382 368 383 - static inline void stripe_to_mem(struct stripe *m, const struct bch_stripe *s) 384 - { 385 - m->sectors = le16_to_cpu(s->sectors); 386 - m->algorithm = s->algorithm; 387 - m->nr_blocks = s->nr_blocks; 388 - m->nr_redundant = s->nr_redundant; 389 - m->disk_label = s->disk_label; 390 - m->blocks_nonempty = 0; 391 - 392 - for (unsigned i = 0; i < s->nr_blocks; i++) 393 - m->blocks_nonempty += !!stripe_blockcount_get(s, i); 394 - } 395 - 396 369 int bch2_trigger_stripe(struct btree_trans *trans, 397 370 enum btree_id btree, unsigned level, 398 371 struct bkey_s_c old, struct bkey_s _new, ··· 400 399 (new_s->nr_blocks != old_s->nr_blocks || 401 400 new_s->nr_redundant != old_s->nr_redundant)); 402 401 402 + if (flags & BTREE_TRIGGER_transactional) { 403 + int ret = bch2_lru_change(trans, 404 + BCH_LRU_STRIPE_FRAGMENTATION, 405 + idx, 406 + stripe_lru_pos(old_s), 407 + stripe_lru_pos(new_s)); 408 + if (ret) 409 + return ret; 410 + } 403 411 404 412 if (flags & (BTREE_TRIGGER_transactional|BTREE_TRIGGER_gc)) { 405 413 /* ··· 480 470 int ret = mark_stripe_buckets(trans, old, new, flags); 481 471 if (ret) 482 472 return ret; 483 - } 484 - 485 - if (flags & BTREE_TRIGGER_atomic) { 486 - struct stripe *m = genradix_ptr(&c->stripes, idx); 487 - 488 - if (!m) { 489 - struct printbuf buf1 = PRINTBUF; 490 - struct printbuf buf2 = PRINTBUF; 491 - 492 - bch2_bkey_val_to_text(&buf1, c, old); 493 - bch2_bkey_val_to_text(&buf2, c, new); 494 - bch_err_ratelimited(c, "error marking nonexistent stripe %llu while marking\n" 495 - "old %s\n" 496 - "new %s", idx, buf1.buf, buf2.buf); 497 - printbuf_exit(&buf2); 498 - printbuf_exit(&buf1); 499 - bch2_inconsistent_error(c); 500 - return -1; 501 - } 502 - 503 - if (!new_s) { 504 - bch2_stripes_heap_del(c, m, idx); 505 - 506 - memset(m, 0, sizeof(*m)); 507 - } else { 508 - stripe_to_mem(m, new_s); 509 - 510 - if (!old_s) 511 - bch2_stripes_heap_insert(c, m, idx); 512 - else 513 - bch2_stripes_heap_update(c, m, idx); 514 - } 515 473 } 516 474 517 475 return 0; ··· 704 726 struct bch_dev *ca = ec_bio->ca; 705 727 struct closure *cl = bio->bi_private; 706 728 707 - if (bch2_dev_io_err_on(bio->bi_status, ca, 708 - bio_data_dir(bio) 709 - ? BCH_MEMBER_ERROR_write 710 - : BCH_MEMBER_ERROR_read, 711 - "erasure coding %s error: %s", 729 + bch2_account_io_completion(ca, bio_data_dir(bio), 730 + ec_bio->submit_time, !bio->bi_status); 731 + 732 + if (bio->bi_status) { 733 + bch_err_dev_ratelimited(ca, "erasure coding %s error: %s", 712 734 str_write_read(bio_data_dir(bio)), 713 - bch2_blk_status_to_str(bio->bi_status))) 735 + bch2_blk_status_to_str(bio->bi_status)); 714 736 clear_bit(ec_bio->idx, ec_bio->buf->valid); 737 + } 715 738 716 739 int stale = dev_ptr_stale(ca, ptr); 717 740 if (stale) { ··· 775 796 ec_bio->ca = ca; 776 797 ec_bio->buf = buf; 777 798 ec_bio->idx = idx; 799 + ec_bio->submit_time = local_clock(); 778 800 779 801 ec_bio->bio.bi_iter.bi_sector = ptr->offset + buf->offset + (offset >> 9); 780 802 ec_bio->bio.bi_end_io = ec_block_endio; ··· 897 917 898 918 static int __ec_stripe_mem_alloc(struct bch_fs *c, size_t idx, gfp_t gfp) 899 919 { 900 - ec_stripes_heap n, *h = &c->ec_stripes_heap; 901 - 902 - if (idx >= h->size) { 903 - if (!init_heap(&n, max(1024UL, roundup_pow_of_two(idx + 1)), gfp)) 904 - return -BCH_ERR_ENOMEM_ec_stripe_mem_alloc; 905 - 906 - mutex_lock(&c->ec_stripes_heap_lock); 907 - if (n.size > h->size) { 908 - memcpy(n.data, h->data, h->nr * sizeof(h->data[0])); 909 - n.nr = h->nr; 910 - swap(*h, n); 911 - } 912 - mutex_unlock(&c->ec_stripes_heap_lock); 913 - 914 - free_heap(&n); 915 - } 916 - 917 - if (!genradix_ptr_alloc(&c->stripes, idx, gfp)) 918 - return -BCH_ERR_ENOMEM_ec_stripe_mem_alloc; 919 - 920 920 if (c->gc_pos.phase != GC_PHASE_not_running && 921 921 !genradix_ptr_alloc(&c->gc_stripes, idx, gfp)) 922 922 return -BCH_ERR_ENOMEM_ec_stripe_mem_alloc; ··· 969 1009 s->idx = 0; 970 1010 } 971 1011 972 - /* Heap of all existing stripes, ordered by blocks_nonempty */ 973 - 974 - static u64 stripe_idx_to_delete(struct bch_fs *c) 975 - { 976 - ec_stripes_heap *h = &c->ec_stripes_heap; 977 - 978 - lockdep_assert_held(&c->ec_stripes_heap_lock); 979 - 980 - if (h->nr && 981 - h->data[0].blocks_nonempty == 0 && 982 - !bch2_stripe_is_open(c, h->data[0].idx)) 983 - return h->data[0].idx; 984 - 985 - return 0; 986 - } 987 - 988 - static inline void ec_stripes_heap_set_backpointer(ec_stripes_heap *h, 989 - size_t i) 990 - { 991 - struct bch_fs *c = container_of(h, struct bch_fs, ec_stripes_heap); 992 - 993 - genradix_ptr(&c->stripes, h->data[i].idx)->heap_idx = i; 994 - } 995 - 996 - static inline bool ec_stripes_heap_cmp(const void *l, const void *r, void __always_unused *args) 997 - { 998 - struct ec_stripe_heap_entry *_l = (struct ec_stripe_heap_entry *)l; 999 - struct ec_stripe_heap_entry *_r = (struct ec_stripe_heap_entry *)r; 1000 - 1001 - return ((_l->blocks_nonempty > _r->blocks_nonempty) < 1002 - (_l->blocks_nonempty < _r->blocks_nonempty)); 1003 - } 1004 - 1005 - static inline void ec_stripes_heap_swap(void *l, void *r, void *h) 1006 - { 1007 - struct ec_stripe_heap_entry *_l = (struct ec_stripe_heap_entry *)l; 1008 - struct ec_stripe_heap_entry *_r = (struct ec_stripe_heap_entry *)r; 1009 - ec_stripes_heap *_h = (ec_stripes_heap *)h; 1010 - size_t i = _l - _h->data; 1011 - size_t j = _r - _h->data; 1012 - 1013 - swap(*_l, *_r); 1014 - 1015 - ec_stripes_heap_set_backpointer(_h, i); 1016 - ec_stripes_heap_set_backpointer(_h, j); 1017 - } 1018 - 1019 - static const struct min_heap_callbacks callbacks = { 1020 - .less = ec_stripes_heap_cmp, 1021 - .swp = ec_stripes_heap_swap, 1022 - }; 1023 - 1024 - static void heap_verify_backpointer(struct bch_fs *c, size_t idx) 1025 - { 1026 - ec_stripes_heap *h = &c->ec_stripes_heap; 1027 - struct stripe *m = genradix_ptr(&c->stripes, idx); 1028 - 1029 - BUG_ON(m->heap_idx >= h->nr); 1030 - BUG_ON(h->data[m->heap_idx].idx != idx); 1031 - } 1032 - 1033 - void bch2_stripes_heap_del(struct bch_fs *c, 1034 - struct stripe *m, size_t idx) 1035 - { 1036 - mutex_lock(&c->ec_stripes_heap_lock); 1037 - heap_verify_backpointer(c, idx); 1038 - 1039 - min_heap_del(&c->ec_stripes_heap, m->heap_idx, &callbacks, &c->ec_stripes_heap); 1040 - mutex_unlock(&c->ec_stripes_heap_lock); 1041 - } 1042 - 1043 - void bch2_stripes_heap_insert(struct bch_fs *c, 1044 - struct stripe *m, size_t idx) 1045 - { 1046 - mutex_lock(&c->ec_stripes_heap_lock); 1047 - BUG_ON(min_heap_full(&c->ec_stripes_heap)); 1048 - 1049 - genradix_ptr(&c->stripes, idx)->heap_idx = c->ec_stripes_heap.nr; 1050 - min_heap_push(&c->ec_stripes_heap, &((struct ec_stripe_heap_entry) { 1051 - .idx = idx, 1052 - .blocks_nonempty = m->blocks_nonempty, 1053 - }), 1054 - &callbacks, 1055 - &c->ec_stripes_heap); 1056 - 1057 - heap_verify_backpointer(c, idx); 1058 - mutex_unlock(&c->ec_stripes_heap_lock); 1059 - } 1060 - 1061 - void bch2_stripes_heap_update(struct bch_fs *c, 1062 - struct stripe *m, size_t idx) 1063 - { 1064 - ec_stripes_heap *h = &c->ec_stripes_heap; 1065 - bool do_deletes; 1066 - size_t i; 1067 - 1068 - mutex_lock(&c->ec_stripes_heap_lock); 1069 - heap_verify_backpointer(c, idx); 1070 - 1071 - h->data[m->heap_idx].blocks_nonempty = m->blocks_nonempty; 1072 - 1073 - i = m->heap_idx; 1074 - min_heap_sift_up(h, i, &callbacks, &c->ec_stripes_heap); 1075 - min_heap_sift_down(h, i, &callbacks, &c->ec_stripes_heap); 1076 - 1077 - heap_verify_backpointer(c, idx); 1078 - 1079 - do_deletes = stripe_idx_to_delete(c) != 0; 1080 - mutex_unlock(&c->ec_stripes_heap_lock); 1081 - 1082 - if (do_deletes) 1083 - bch2_do_stripe_deletes(c); 1084 - } 1085 - 1086 1012 /* stripe deletion */ 1087 1013 1088 1014 static int ec_stripe_delete(struct btree_trans *trans, u64 idx) 1089 1015 { 1090 - struct bch_fs *c = trans->c; 1091 1016 struct btree_iter iter; 1092 - struct bkey_s_c k; 1093 - struct bkey_s_c_stripe s; 1094 - int ret; 1095 - 1096 - k = bch2_bkey_get_iter(trans, &iter, BTREE_ID_stripes, POS(0, idx), 1097 - BTREE_ITER_intent); 1098 - ret = bkey_err(k); 1017 + struct bkey_s_c k = bch2_bkey_get_iter(trans, &iter, 1018 + BTREE_ID_stripes, POS(0, idx), 1019 + BTREE_ITER_intent); 1020 + int ret = bkey_err(k); 1099 1021 if (ret) 1100 1022 goto err; 1101 1023 1102 - if (k.k->type != KEY_TYPE_stripe) { 1103 - bch2_fs_inconsistent(c, "attempting to delete nonexistent stripe %llu", idx); 1104 - ret = -EINVAL; 1105 - goto err; 1106 - } 1107 - 1108 - s = bkey_s_c_to_stripe(k); 1109 - for (unsigned i = 0; i < s.v->nr_blocks; i++) 1110 - if (stripe_blockcount_get(s.v, i)) { 1111 - struct printbuf buf = PRINTBUF; 1112 - 1113 - bch2_bkey_val_to_text(&buf, c, k); 1114 - bch2_fs_inconsistent(c, "attempting to delete nonempty stripe %s", buf.buf); 1115 - printbuf_exit(&buf); 1116 - ret = -EINVAL; 1117 - goto err; 1118 - } 1119 - 1120 - ret = bch2_btree_delete_at(trans, &iter, 0); 1024 + /* 1025 + * We expect write buffer races here 1026 + * Important: check stripe_is_open with stripe key locked: 1027 + */ 1028 + if (k.k->type == KEY_TYPE_stripe && 1029 + !bch2_stripe_is_open(trans->c, idx) && 1030 + stripe_lru_pos(bkey_s_c_to_stripe(k).v) == 1) 1031 + ret = bch2_btree_delete_at(trans, &iter, 0); 1121 1032 err: 1122 1033 bch2_trans_iter_exit(trans, &iter); 1123 1034 return ret; 1124 1035 } 1125 1036 1037 + /* 1038 + * XXX 1039 + * can we kill this and delete stripes from the trigger? 1040 + */ 1126 1041 static void ec_stripe_delete_work(struct work_struct *work) 1127 1042 { 1128 1043 struct bch_fs *c = 1129 1044 container_of(work, struct bch_fs, ec_stripe_delete_work); 1130 1045 1131 - while (1) { 1132 - mutex_lock(&c->ec_stripes_heap_lock); 1133 - u64 idx = stripe_idx_to_delete(c); 1134 - mutex_unlock(&c->ec_stripes_heap_lock); 1135 - 1136 - if (!idx) 1137 - break; 1138 - 1139 - int ret = bch2_trans_commit_do(c, NULL, NULL, BCH_TRANS_COMMIT_no_enospc, 1140 - ec_stripe_delete(trans, idx)); 1141 - bch_err_fn(c, ret); 1142 - if (ret) 1143 - break; 1144 - } 1145 - 1046 + bch2_trans_run(c, 1047 + bch2_btree_write_buffer_tryflush(trans) ?: 1048 + for_each_btree_key_max_commit(trans, lru_iter, BTREE_ID_lru, 1049 + lru_pos(BCH_LRU_STRIPE_FRAGMENTATION, 1, 0), 1050 + lru_pos(BCH_LRU_STRIPE_FRAGMENTATION, 1, LRU_TIME_MAX), 1051 + 0, lru_k, 1052 + NULL, NULL, 1053 + BCH_TRANS_COMMIT_no_enospc, ({ 1054 + ec_stripe_delete(trans, lru_k.k->p.offset); 1055 + }))); 1146 1056 bch2_write_ref_put(c, BCH_WRITE_REF_stripe_delete); 1147 1057 } 1148 1058 ··· 1124 1294 1125 1295 bch2_fs_inconsistent(c, "%s", buf.buf); 1126 1296 printbuf_exit(&buf); 1127 - return -EIO; 1297 + return -BCH_ERR_erasure_coding_found_btree_node; 1128 1298 } 1129 1299 1130 1300 k = bch2_backpointer_get_key(trans, bp, &iter, BTREE_ITER_intent, last_flushed); ··· 1190 1360 1191 1361 struct bch_dev *ca = bch2_dev_tryget(c, ptr.dev); 1192 1362 if (!ca) 1193 - return -EIO; 1363 + return -BCH_ERR_ENOENT_dev_not_found; 1194 1364 1195 1365 struct bpos bucket_pos = PTR_BUCKET_POS(ca, &ptr); 1196 1366 ··· 1210 1380 if (bp_k.k->type != KEY_TYPE_backpointer) 1211 1381 continue; 1212 1382 1383 + struct bkey_s_c_backpointer bp = bkey_s_c_to_backpointer(bp_k); 1384 + if (bp.v->btree_id == BTREE_ID_stripes) 1385 + continue; 1386 + 1213 1387 ec_stripe_update_extent(trans, ca, bucket_pos, ptr.gen, s, 1214 - bkey_s_c_to_backpointer(bp_k), &last_flushed); 1388 + bp, &last_flushed); 1215 1389 })); 1216 1390 1217 1391 bch2_bkey_buf_exit(&last_flushed, c); ··· 1227 1393 { 1228 1394 struct btree_trans *trans = bch2_trans_get(c); 1229 1395 struct bch_stripe *v = &bkey_i_to_stripe(&s->key)->v; 1230 - unsigned i, nr_data = v->nr_blocks - v->nr_redundant; 1231 - int ret = 0; 1396 + unsigned nr_data = v->nr_blocks - v->nr_redundant; 1232 1397 1233 - ret = bch2_btree_write_buffer_flush_sync(trans); 1398 + int ret = bch2_btree_write_buffer_flush_sync(trans); 1234 1399 if (ret) 1235 1400 goto err; 1236 1401 1237 - for (i = 0; i < nr_data; i++) { 1402 + for (unsigned i = 0; i < nr_data; i++) { 1238 1403 ret = ec_stripe_update_bucket(trans, s, i); 1239 1404 if (ret) 1240 1405 break; 1241 1406 } 1242 1407 err: 1243 1408 bch2_trans_put(trans); 1244 - 1245 1409 return ret; 1246 1410 } 1247 1411 ··· 1305 1473 if (s->err) { 1306 1474 if (!bch2_err_matches(s->err, EROFS)) 1307 1475 bch_err(c, "error creating stripe: error writing data buckets"); 1476 + ret = s->err; 1308 1477 goto err; 1309 1478 } 1310 1479 ··· 1314 1481 1315 1482 if (ec_do_recov(c, &s->existing_stripe)) { 1316 1483 bch_err(c, "error creating stripe: error reading existing stripe"); 1484 + ret = -BCH_ERR_ec_block_read; 1317 1485 goto err; 1318 1486 } 1319 1487 ··· 1340 1506 1341 1507 if (ec_nr_failed(&s->new_stripe)) { 1342 1508 bch_err(c, "error creating stripe: error writing redundancy buckets"); 1509 + ret = -BCH_ERR_ec_block_write; 1343 1510 goto err; 1344 1511 } 1345 1512 ··· 1362 1527 if (ret) 1363 1528 goto err; 1364 1529 err: 1530 + trace_stripe_create(c, s->idx, ret); 1531 + 1365 1532 bch2_disk_reservation_put(c, &s->res); 1366 1533 1367 1534 for (i = 0; i < v->nr_blocks; i++) ··· 1449 1612 ec_stripe_new_set_pending(c, h); 1450 1613 } 1451 1614 1452 - void bch2_ec_bucket_cancel(struct bch_fs *c, struct open_bucket *ob) 1615 + void bch2_ec_bucket_cancel(struct bch_fs *c, struct open_bucket *ob, int err) 1453 1616 { 1454 1617 struct ec_stripe_new *s = ob->ec; 1455 1618 1456 - s->err = -EIO; 1619 + s->err = err; 1457 1620 } 1458 1621 1459 1622 void *bch2_writepoint_ec_buf(struct bch_fs *c, struct write_point *wp) ··· 1805 1968 return 0; 1806 1969 } 1807 1970 1808 - static s64 get_existing_stripe(struct bch_fs *c, 1809 - struct ec_stripe_head *head) 1971 + static int __get_existing_stripe(struct btree_trans *trans, 1972 + struct ec_stripe_head *head, 1973 + struct ec_stripe_buf *stripe, 1974 + u64 idx) 1810 1975 { 1811 - ec_stripes_heap *h = &c->ec_stripes_heap; 1812 - struct stripe *m; 1813 - size_t heap_idx; 1814 - u64 stripe_idx; 1815 - s64 ret = -1; 1976 + struct bch_fs *c = trans->c; 1816 1977 1817 - if (may_create_new_stripe(c)) 1818 - return -1; 1978 + struct btree_iter iter; 1979 + struct bkey_s_c k = bch2_bkey_get_iter(trans, &iter, 1980 + BTREE_ID_stripes, POS(0, idx), 0); 1981 + int ret = bkey_err(k); 1982 + if (ret) 1983 + goto err; 1819 1984 1820 - mutex_lock(&c->ec_stripes_heap_lock); 1821 - for (heap_idx = 0; heap_idx < h->nr; heap_idx++) { 1822 - /* No blocks worth reusing, stripe will just be deleted: */ 1823 - if (!h->data[heap_idx].blocks_nonempty) 1824 - continue; 1985 + /* We expect write buffer races here */ 1986 + if (k.k->type != KEY_TYPE_stripe) 1987 + goto out; 1825 1988 1826 - stripe_idx = h->data[heap_idx].idx; 1989 + struct bkey_s_c_stripe s = bkey_s_c_to_stripe(k); 1990 + if (stripe_lru_pos(s.v) <= 1) 1991 + goto out; 1827 1992 1828 - m = genradix_ptr(&c->stripes, stripe_idx); 1829 - 1830 - if (m->disk_label == head->disk_label && 1831 - m->algorithm == head->algo && 1832 - m->nr_redundant == head->redundancy && 1833 - m->sectors == head->blocksize && 1834 - m->blocks_nonempty < m->nr_blocks - m->nr_redundant && 1835 - bch2_try_open_stripe(c, head->s, stripe_idx)) { 1836 - ret = stripe_idx; 1837 - break; 1838 - } 1993 + if (s.v->disk_label == head->disk_label && 1994 + s.v->algorithm == head->algo && 1995 + s.v->nr_redundant == head->redundancy && 1996 + le16_to_cpu(s.v->sectors) == head->blocksize && 1997 + bch2_try_open_stripe(c, head->s, idx)) { 1998 + bkey_reassemble(&stripe->key, k); 1999 + ret = 1; 1839 2000 } 1840 - mutex_unlock(&c->ec_stripes_heap_lock); 2001 + out: 2002 + bch2_set_btree_iter_dontneed(&iter); 2003 + err: 2004 + bch2_trans_iter_exit(trans, &iter); 1841 2005 return ret; 1842 2006 } 1843 2007 ··· 1890 2052 struct ec_stripe_new *s) 1891 2053 { 1892 2054 struct bch_fs *c = trans->c; 1893 - s64 idx; 1894 - int ret; 1895 2055 1896 2056 /* 1897 2057 * If we can't allocate a new stripe, and there's no stripes with empty 1898 2058 * blocks for us to reuse, that means we have to wait on copygc: 1899 2059 */ 1900 - idx = get_existing_stripe(c, h); 1901 - if (idx < 0) 1902 - return -BCH_ERR_stripe_alloc_blocked; 2060 + if (may_create_new_stripe(c)) 2061 + return -1; 1903 2062 1904 - ret = get_stripe_key_trans(trans, idx, &s->existing_stripe); 1905 - bch2_fs_fatal_err_on(ret && !bch2_err_matches(ret, BCH_ERR_transaction_restart), c, 1906 - "reading stripe key: %s", bch2_err_str(ret)); 1907 - if (ret) { 1908 - bch2_stripe_close(c, s); 1909 - return ret; 2063 + struct btree_iter lru_iter; 2064 + struct bkey_s_c lru_k; 2065 + int ret = 0; 2066 + 2067 + for_each_btree_key_max_norestart(trans, lru_iter, BTREE_ID_lru, 2068 + lru_pos(BCH_LRU_STRIPE_FRAGMENTATION, 2, 0), 2069 + lru_pos(BCH_LRU_STRIPE_FRAGMENTATION, 2, LRU_TIME_MAX), 2070 + 0, lru_k, ret) { 2071 + ret = __get_existing_stripe(trans, h, &s->existing_stripe, lru_k.k->p.offset); 2072 + if (ret) 2073 + break; 1910 2074 } 2075 + bch2_trans_iter_exit(trans, &lru_iter); 2076 + if (!ret) 2077 + ret = -BCH_ERR_stripe_alloc_blocked; 2078 + if (ret == 1) 2079 + ret = 0; 2080 + if (ret) 2081 + return ret; 1911 2082 1912 2083 return init_new_stripe_from_existing(c, s); 1913 2084 } ··· 2214 2367 2215 2368 int bch2_stripes_read(struct bch_fs *c) 2216 2369 { 2217 - int ret = bch2_trans_run(c, 2218 - for_each_btree_key(trans, iter, BTREE_ID_stripes, POS_MIN, 2219 - BTREE_ITER_prefetch, k, ({ 2220 - if (k.k->type != KEY_TYPE_stripe) 2221 - continue; 2222 - 2223 - ret = __ec_stripe_mem_alloc(c, k.k->p.offset, GFP_KERNEL); 2224 - if (ret) 2225 - break; 2226 - 2227 - struct stripe *m = genradix_ptr(&c->stripes, k.k->p.offset); 2228 - 2229 - stripe_to_mem(m, bkey_s_c_to_stripe(k).v); 2230 - 2231 - bch2_stripes_heap_insert(c, m, k.k->p.offset); 2232 - 0; 2233 - }))); 2234 - bch_err_fn(c, ret); 2235 - return ret; 2236 - } 2237 - 2238 - void bch2_stripes_heap_to_text(struct printbuf *out, struct bch_fs *c) 2239 - { 2240 - ec_stripes_heap *h = &c->ec_stripes_heap; 2241 - struct stripe *m; 2242 - size_t i; 2243 - 2244 - mutex_lock(&c->ec_stripes_heap_lock); 2245 - for (i = 0; i < min_t(size_t, h->nr, 50); i++) { 2246 - m = genradix_ptr(&c->stripes, h->data[i].idx); 2247 - 2248 - prt_printf(out, "%zu %u/%u+%u", h->data[i].idx, 2249 - h->data[i].blocks_nonempty, 2250 - m->nr_blocks - m->nr_redundant, 2251 - m->nr_redundant); 2252 - if (bch2_stripe_is_open(c, h->data[i].idx)) 2253 - prt_str(out, " open"); 2254 - prt_newline(out); 2255 - } 2256 - mutex_unlock(&c->ec_stripes_heap_lock); 2370 + return 0; 2257 2371 } 2258 2372 2259 2373 static void bch2_new_stripe_to_text(struct printbuf *out, struct bch_fs *c, ··· 2285 2477 2286 2478 BUG_ON(!list_empty(&c->ec_stripe_new_list)); 2287 2479 2288 - free_heap(&c->ec_stripes_heap); 2289 - genradix_free(&c->stripes); 2290 2480 bioset_exit(&c->ec_bioset); 2291 2481 } 2292 2482 2293 2483 void bch2_fs_ec_init_early(struct bch_fs *c) 2294 2484 { 2295 2485 spin_lock_init(&c->ec_stripes_new_lock); 2296 - mutex_init(&c->ec_stripes_heap_lock); 2297 2486 2298 2487 INIT_LIST_HEAD(&c->ec_stripe_head_list); 2299 2488 mutex_init(&c->ec_stripe_head_lock); ··· 2307 2502 { 2308 2503 return bioset_init(&c->ec_bioset, 1, offsetof(struct ec_bio, bio), 2309 2504 BIOSET_NEED_BVECS); 2505 + } 2506 + 2507 + static int bch2_check_stripe_to_lru_ref(struct btree_trans *trans, 2508 + struct bkey_s_c k, 2509 + struct bkey_buf *last_flushed) 2510 + { 2511 + if (k.k->type != KEY_TYPE_stripe) 2512 + return 0; 2513 + 2514 + struct bkey_s_c_stripe s = bkey_s_c_to_stripe(k); 2515 + 2516 + u64 lru_idx = stripe_lru_pos(s.v); 2517 + if (lru_idx) { 2518 + int ret = bch2_lru_check_set(trans, BCH_LRU_STRIPE_FRAGMENTATION, 2519 + k.k->p.offset, lru_idx, k, last_flushed); 2520 + if (ret) 2521 + return ret; 2522 + } 2523 + return 0; 2524 + } 2525 + 2526 + int bch2_check_stripe_to_lru_refs(struct bch_fs *c) 2527 + { 2528 + struct bkey_buf last_flushed; 2529 + 2530 + bch2_bkey_buf_init(&last_flushed); 2531 + bkey_init(&last_flushed.k->k); 2532 + 2533 + int ret = bch2_trans_run(c, 2534 + for_each_btree_key_commit(trans, iter, BTREE_ID_stripes, 2535 + POS_MIN, BTREE_ITER_prefetch, k, 2536 + NULL, NULL, BCH_TRANS_COMMIT_no_enospc, 2537 + bch2_check_stripe_to_lru_ref(trans, k, &last_flushed))); 2538 + 2539 + bch2_bkey_buf_exit(&last_flushed, c); 2540 + bch_err_fn(c, ret); 2541 + return ret; 2310 2542 }

+40 -6

fs/bcachefs/ec.h

··· 92 92 memcpy(stripe_csum(s, block, csum_idx), &csum, bch_crc_bytes[s->csum_type]); 93 93 } 94 94 95 + #define STRIPE_LRU_POS_EMPTY 1 96 + 97 + static inline u64 stripe_lru_pos(const struct bch_stripe *s) 98 + { 99 + if (!s) 100 + return 0; 101 + 102 + unsigned nr_data = s->nr_blocks - s->nr_redundant, blocks_empty = 0; 103 + 104 + for (unsigned i = 0; i < nr_data; i++) 105 + blocks_empty += !stripe_blockcount_get(s, i); 106 + 107 + /* Will be picked up by the stripe_delete worker */ 108 + if (blocks_empty == nr_data) 109 + return STRIPE_LRU_POS_EMPTY; 110 + 111 + if (!blocks_empty) 112 + return 0; 113 + 114 + /* invert: more blocks empty = reuse first */ 115 + return LRU_TIME_MAX - blocks_empty; 116 + } 117 + 95 118 static inline bool __bch2_ptr_matches_stripe(const struct bch_extent_ptr *stripe_ptr, 96 119 const struct bch_extent_ptr *data_ptr, 97 120 unsigned sectors) ··· 153 130 154 131 return __bch2_ptr_matches_stripe(&m->ptrs[p.ec.block], &p.ptr, 155 132 m->sectors); 133 + } 134 + 135 + static inline void gc_stripe_unlock(struct gc_stripe *s) 136 + { 137 + BUILD_BUG_ON(!((union ulong_byte_assert) { .ulong = 1UL << BUCKET_LOCK_BITNR }).byte); 138 + 139 + clear_bit_unlock(BUCKET_LOCK_BITNR, (void *) &s->lock); 140 + wake_up_bit((void *) &s->lock, BUCKET_LOCK_BITNR); 141 + } 142 + 143 + static inline void gc_stripe_lock(struct gc_stripe *s) 144 + { 145 + wait_on_bit_lock((void *) &s->lock, BUCKET_LOCK_BITNR, 146 + TASK_UNINTERRUPTIBLE); 156 147 } 157 148 158 149 struct bch_read_bio; ··· 249 212 250 213 void *bch2_writepoint_ec_buf(struct bch_fs *, struct write_point *); 251 214 252 - void bch2_ec_bucket_cancel(struct bch_fs *, struct open_bucket *); 215 + void bch2_ec_bucket_cancel(struct bch_fs *, struct open_bucket *, int); 253 216 254 217 int bch2_ec_stripe_new_alloc(struct bch_fs *, struct ec_stripe_head *); 255 218 ··· 257 220 struct ec_stripe_head *bch2_ec_stripe_head_get(struct btree_trans *, 258 221 unsigned, unsigned, unsigned, 259 222 enum bch_watermark, struct closure *); 260 - 261 - void bch2_stripes_heap_update(struct bch_fs *, struct stripe *, size_t); 262 - void bch2_stripes_heap_del(struct bch_fs *, struct stripe *, size_t); 263 - void bch2_stripes_heap_insert(struct bch_fs *, struct stripe *, size_t); 264 223 265 224 void bch2_do_stripe_deletes(struct bch_fs *); 266 225 void bch2_ec_do_stripe_creates(struct bch_fs *); ··· 294 261 295 262 int bch2_stripes_read(struct bch_fs *); 296 263 297 - void bch2_stripes_heap_to_text(struct printbuf *, struct bch_fs *); 298 264 void bch2_new_stripes_to_text(struct printbuf *, struct bch_fs *); 299 265 300 266 void bch2_fs_ec_exit(struct bch_fs *); 301 267 void bch2_fs_ec_init_early(struct bch_fs *); 302 268 int bch2_fs_ec_init(struct bch_fs *); 269 + 270 + int bch2_check_stripe_to_lru_refs(struct bch_fs *); 303 271 304 272 #endif /* _BCACHEFS_EC_H */

+2 -10

fs/bcachefs/ec_types.h

··· 20 20 }; 21 21 22 22 struct gc_stripe { 23 + u8 lock; 24 + unsigned alive:1; /* does a corresponding key exist in stripes btree? */ 23 25 u16 sectors; 24 - 25 26 u8 nr_blocks; 26 27 u8 nr_redundant; 27 - 28 - unsigned alive:1; /* does a corresponding key exist in stripes btree? */ 29 28 u16 block_sectors[BCH_BKEY_PTRS_MAX]; 30 29 struct bch_extent_ptr ptrs[BCH_BKEY_PTRS_MAX]; 31 30 32 31 struct bch_replicas_padded r; 33 32 }; 34 - 35 - struct ec_stripe_heap_entry { 36 - size_t idx; 37 - unsigned blocks_nonempty; 38 - }; 39 - 40 - typedef DEFINE_MIN_HEAP(struct ec_stripe_heap_entry, ec_stripes_heap) ec_stripes_heap; 41 33 42 34 #endif /* _BCACHEFS_EC_TYPES_H */

+60 -5

fs/bcachefs/errcode.h

··· 116 116 x(ENOENT, ENOENT_snapshot_tree) \ 117 117 x(ENOENT, ENOENT_dirent_doesnt_match_inode) \ 118 118 x(ENOENT, ENOENT_dev_not_found) \ 119 + x(ENOENT, ENOENT_dev_bucket_not_found) \ 119 120 x(ENOENT, ENOENT_dev_idx_not_found) \ 120 121 x(ENOENT, ENOENT_inode_no_backpointer) \ 121 122 x(ENOENT, ENOENT_no_snapshot_tree_subvol) \ 123 + x(ENOENT, btree_node_dying) \ 122 124 x(ENOTEMPTY, ENOTEMPTY_dir_not_empty) \ 123 125 x(ENOTEMPTY, ENOTEMPTY_subvol_not_empty) \ 124 126 x(EEXIST, EEXIST_str_hash_set) \ ··· 182 180 x(EINVAL, not_in_recovery) \ 183 181 x(EINVAL, cannot_rewind_recovery) \ 184 182 x(0, data_update_done) \ 183 + x(BCH_ERR_data_update_done, data_update_done_would_block) \ 184 + x(BCH_ERR_data_update_done, data_update_done_unwritten) \ 185 + x(BCH_ERR_data_update_done, data_update_done_no_writes_needed) \ 186 + x(BCH_ERR_data_update_done, data_update_done_no_snapshot) \ 187 + x(BCH_ERR_data_update_done, data_update_done_no_dev_refs) \ 188 + x(BCH_ERR_data_update_done, data_update_done_no_rw_devs) \ 185 189 x(EINVAL, device_state_not_allowed) \ 186 190 x(EINVAL, member_info_missing) \ 187 191 x(EINVAL, mismatched_block_size) \ ··· 208 200 x(EINVAL, no_resize_with_buckets_nouse) \ 209 201 x(EINVAL, inode_unpack_error) \ 210 202 x(EINVAL, varint_decode_error) \ 203 + x(EINVAL, erasure_coding_found_btree_node) \ 204 + x(EOPNOTSUPP, may_not_use_incompat_feature) \ 211 205 x(EROFS, erofs_trans_commit) \ 212 206 x(EROFS, erofs_no_writes) \ 213 207 x(EROFS, erofs_journal_err) \ ··· 220 210 x(EROFS, insufficient_devices) \ 221 211 x(0, operation_blocked) \ 222 212 x(BCH_ERR_operation_blocked, btree_cache_cannibalize_lock_blocked) \ 223 - x(BCH_ERR_operation_blocked, journal_res_get_blocked) \ 224 - x(BCH_ERR_operation_blocked, journal_preres_get_blocked) \ 225 - x(BCH_ERR_operation_blocked, bucket_alloc_blocked) \ 226 - x(BCH_ERR_operation_blocked, stripe_alloc_blocked) \ 213 + x(BCH_ERR_operation_blocked, journal_res_blocked) \ 214 + x(BCH_ERR_journal_res_blocked, journal_blocked) \ 215 + x(BCH_ERR_journal_res_blocked, journal_max_in_flight) \ 216 + x(BCH_ERR_journal_res_blocked, journal_max_open) \ 217 + x(BCH_ERR_journal_res_blocked, journal_full) \ 218 + x(BCH_ERR_journal_res_blocked, journal_pin_full) \ 219 + x(BCH_ERR_journal_res_blocked, journal_buf_enomem) \ 220 + x(BCH_ERR_journal_res_blocked, journal_stuck) \ 221 + x(BCH_ERR_journal_res_blocked, journal_retry_open) \ 222 + x(BCH_ERR_journal_res_blocked, journal_preres_get_blocked) \ 223 + x(BCH_ERR_journal_res_blocked, bucket_alloc_blocked) \ 224 + x(BCH_ERR_journal_res_blocked, stripe_alloc_blocked) \ 227 225 x(BCH_ERR_invalid, invalid_sb) \ 228 226 x(BCH_ERR_invalid_sb, invalid_sb_magic) \ 229 227 x(BCH_ERR_invalid_sb, invalid_sb_version) \ ··· 241 223 x(BCH_ERR_invalid_sb, invalid_sb_csum) \ 242 224 x(BCH_ERR_invalid_sb, invalid_sb_block_size) \ 243 225 x(BCH_ERR_invalid_sb, invalid_sb_uuid) \ 226 + x(BCH_ERR_invalid_sb, invalid_sb_offset) \ 244 227 x(BCH_ERR_invalid_sb, invalid_sb_too_many_members) \ 245 228 x(BCH_ERR_invalid_sb, invalid_sb_dev_idx) \ 246 229 x(BCH_ERR_invalid_sb, invalid_sb_time_precision) \ ··· 269 250 x(BCH_ERR_operation_blocked, nocow_lock_blocked) \ 270 251 x(EIO, journal_shutdown) \ 271 252 x(EIO, journal_flush_err) \ 253 + x(EIO, journal_write_err) \ 272 254 x(EIO, btree_node_read_err) \ 273 255 x(BCH_ERR_btree_node_read_err, btree_node_read_err_cached) \ 274 256 x(EIO, sb_not_downgraded) \ ··· 278 258 x(EIO, btree_node_read_validate_error) \ 279 259 x(EIO, btree_need_topology_repair) \ 280 260 x(EIO, bucket_ref_update) \ 261 + x(EIO, trigger_alloc) \ 281 262 x(EIO, trigger_pointer) \ 282 263 x(EIO, trigger_stripe_pointer) \ 283 264 x(EIO, metadata_bucket_inconsistency) \ 284 265 x(EIO, mark_stripe) \ 285 266 x(EIO, stripe_reconstruct) \ 286 267 x(EIO, key_type_error) \ 287 - x(EIO, no_device_to_read_from) \ 268 + x(EIO, extent_poisened) \ 288 269 x(EIO, missing_indirect_extent) \ 289 270 x(EIO, invalidate_stripe_to_dev) \ 290 271 x(EIO, no_encryption_key) \ 291 272 x(EIO, insufficient_journal_devices) \ 273 + x(EIO, device_offline) \ 274 + x(EIO, EIO_fault_injected) \ 275 + x(EIO, ec_block_read) \ 276 + x(EIO, ec_block_write) \ 277 + x(EIO, recompute_checksum) \ 278 + x(EIO, decompress) \ 279 + x(BCH_ERR_decompress, decompress_exceeded_max_encoded_extent) \ 280 + x(BCH_ERR_decompress, decompress_lz4) \ 281 + x(BCH_ERR_decompress, decompress_gzip) \ 282 + x(BCH_ERR_decompress, decompress_zstd_src_len_bad) \ 283 + x(BCH_ERR_decompress, decompress_zstd) \ 284 + x(EIO, data_write) \ 285 + x(BCH_ERR_data_write, data_write_io) \ 286 + x(BCH_ERR_data_write, data_write_csum) \ 287 + x(BCH_ERR_data_write, data_write_invalid_ptr) \ 288 + x(BCH_ERR_data_write, data_write_misaligned) \ 289 + x(BCH_ERR_decompress, data_read) \ 290 + x(BCH_ERR_data_read, no_device_to_read_from) \ 291 + x(BCH_ERR_data_read, data_read_io_err) \ 292 + x(BCH_ERR_data_read, data_read_csum_err) \ 293 + x(BCH_ERR_data_read, data_read_retry) \ 294 + x(BCH_ERR_data_read_retry, data_read_retry_avoid) \ 295 + x(BCH_ERR_data_read_retry_avoid,data_read_retry_device_offline) \ 296 + x(BCH_ERR_data_read_retry_avoid,data_read_retry_io_err) \ 297 + x(BCH_ERR_data_read_retry_avoid,data_read_retry_ec_reconstruct_err) \ 298 + x(BCH_ERR_data_read_retry_avoid,data_read_retry_csum_err) \ 299 + x(BCH_ERR_data_read_retry, data_read_retry_csum_err_maybe_userspace)\ 300 + x(BCH_ERR_data_read, data_read_decompress_err) \ 301 + x(BCH_ERR_data_read, data_read_decrypt_err) \ 302 + x(BCH_ERR_data_read, data_read_ptr_stale_race) \ 303 + x(BCH_ERR_data_read_retry, data_read_ptr_stale_retry) \ 304 + x(BCH_ERR_data_read, data_read_no_encryption_key) \ 305 + x(BCH_ERR_data_read, data_read_buffer_too_small) \ 306 + x(BCH_ERR_data_read, data_read_key_overwritten) \ 292 307 x(BCH_ERR_btree_node_read_err, btree_node_read_err_fixable) \ 293 308 x(BCH_ERR_btree_node_read_err, btree_node_read_err_want_retry) \ 294 309 x(BCH_ERR_btree_node_read_err, btree_node_read_err_must_retry) \

+68 -28

fs/bcachefs/error.c

··· 3 3 #include "btree_cache.h" 4 4 #include "btree_iter.h" 5 5 #include "error.h" 6 - #include "fs-common.h" 7 6 #include "journal.h" 7 + #include "namei.h" 8 8 #include "recovery_passes.h" 9 9 #include "super.h" 10 10 #include "thread_with_file.h" ··· 54 54 { 55 55 struct bch_dev *ca = container_of(work, struct bch_dev, io_error_work); 56 56 struct bch_fs *c = ca->fs; 57 - bool dev; 57 + 58 + /* XXX: if it's reads or checksums that are failing, set it to failed */ 58 59 59 60 down_write(&c->state_lock); 60 - dev = bch2_dev_state_allowed(c, ca, BCH_MEMBER_STATE_ro, 61 - BCH_FORCE_IF_DEGRADED); 62 - if (dev 63 - ? __bch2_dev_set_state(c, ca, BCH_MEMBER_STATE_ro, 64 - BCH_FORCE_IF_DEGRADED) 65 - : bch2_fs_emergency_read_only(c)) 61 + unsigned long write_errors_start = READ_ONCE(ca->write_errors_start); 62 + 63 + if (write_errors_start && 64 + time_after(jiffies, 65 + write_errors_start + c->opts.write_error_timeout * HZ)) { 66 + if (ca->mi.state >= BCH_MEMBER_STATE_ro) 67 + goto out; 68 + 69 + bool dev = !__bch2_dev_set_state(c, ca, BCH_MEMBER_STATE_ro, 70 + BCH_FORCE_IF_DEGRADED); 71 + 66 72 bch_err(ca, 67 - "too many IO errors, setting %s RO", 73 + "writes erroring for %u seconds, setting %s ro", 74 + c->opts.write_error_timeout, 68 75 dev ? "device" : "filesystem"); 76 + if (!dev) 77 + bch2_fs_emergency_read_only(c); 78 + 79 + } 80 + out: 69 81 up_write(&c->state_lock); 70 82 } 71 83 72 84 void bch2_io_error(struct bch_dev *ca, enum bch_member_error_type type) 73 85 { 74 86 atomic64_inc(&ca->errors[type]); 75 - //queue_work(system_long_wq, &ca->io_error_work); 87 + 88 + if (type == BCH_MEMBER_ERROR_write && !ca->write_errors_start) 89 + ca->write_errors_start = jiffies; 90 + 91 + queue_work(system_long_wq, &ca->io_error_work); 76 92 } 77 93 78 94 enum ask_yn { ··· 546 530 mutex_unlock(&c->fsck_error_msgs_lock); 547 531 } 548 532 549 - int bch2_inum_err_msg_trans(struct btree_trans *trans, struct printbuf *out, subvol_inum inum) 533 + int bch2_inum_offset_err_msg_trans(struct btree_trans *trans, struct printbuf *out, 534 + subvol_inum inum, u64 offset) 550 535 { 551 536 u32 restart_count = trans->restart_count; 552 537 int ret = 0; 553 538 554 - /* XXX: we don't yet attempt to print paths when we don't know the subvol */ 555 - if (inum.subvol) 556 - ret = lockrestart_do(trans, bch2_inum_to_path(trans, inum, out)); 539 + if (inum.subvol) { 540 + ret = bch2_inum_to_path(trans, inum, out); 541 + if (bch2_err_matches(ret, BCH_ERR_transaction_restart)) 542 + return ret; 543 + } 557 544 if (!inum.subvol || ret) 558 545 prt_printf(out, "inum %llu:%llu", inum.subvol, inum.inum); 546 + prt_printf(out, " offset %llu: ", offset); 559 547 560 548 return trans_was_restarted(trans, restart_count); 561 - } 562 - 563 - int bch2_inum_offset_err_msg_trans(struct btree_trans *trans, struct printbuf *out, 564 - subvol_inum inum, u64 offset) 565 - { 566 - int ret = bch2_inum_err_msg_trans(trans, out, inum); 567 - prt_printf(out, " offset %llu: ", offset); 568 - return ret; 569 - } 570 - 571 - void bch2_inum_err_msg(struct bch_fs *c, struct printbuf *out, subvol_inum inum) 572 - { 573 - bch2_trans_run(c, bch2_inum_err_msg_trans(trans, out, inum)); 574 549 } 575 550 576 551 void bch2_inum_offset_err_msg(struct bch_fs *c, struct printbuf *out, 577 552 subvol_inum inum, u64 offset) 578 553 { 579 - bch2_trans_run(c, bch2_inum_offset_err_msg_trans(trans, out, inum, offset)); 554 + bch2_trans_do(c, bch2_inum_offset_err_msg_trans(trans, out, inum, offset)); 555 + } 556 + 557 + int bch2_inum_snap_offset_err_msg_trans(struct btree_trans *trans, struct printbuf *out, 558 + struct bpos pos) 559 + { 560 + struct bch_fs *c = trans->c; 561 + int ret = 0; 562 + 563 + if (!bch2_snapshot_is_leaf(c, pos.snapshot)) 564 + prt_str(out, "(multiple snapshots) "); 565 + 566 + subvol_inum inum = { 567 + .subvol = bch2_snapshot_tree_oldest_subvol(c, pos.snapshot), 568 + .inum = pos.inode, 569 + }; 570 + 571 + if (inum.subvol) { 572 + ret = bch2_inum_to_path(trans, inum, out); 573 + if (bch2_err_matches(ret, BCH_ERR_transaction_restart)) 574 + return ret; 575 + } 576 + 577 + if (!inum.subvol || ret) 578 + prt_printf(out, "inum %llu:%u", pos.inode, pos.snapshot); 579 + 580 + prt_printf(out, " offset %llu: ", pos.offset << 8); 581 + return 0; 582 + } 583 + 584 + void bch2_inum_snap_offset_err_msg(struct bch_fs *c, struct printbuf *out, 585 + struct bpos pos) 586 + { 587 + bch2_trans_do(c, bch2_inum_snap_offset_err_msg_trans(trans, out, pos)); 580 588 }

+33 -22

fs/bcachefs/error.h

··· 216 216 /* Does the error handling without logging a message */ 217 217 void bch2_io_error(struct bch_dev *, enum bch_member_error_type); 218 218 219 - #define bch2_dev_io_err_on(cond, ca, _type, ...) \ 220 - ({ \ 221 - bool _ret = (cond); \ 222 - \ 223 - if (_ret) { \ 224 - bch_err_dev_ratelimited(ca, __VA_ARGS__); \ 225 - bch2_io_error(ca, _type); \ 226 - } \ 227 - _ret; \ 228 - }) 219 + #ifndef CONFIG_BCACHEFS_NO_LATENCY_ACCT 220 + void bch2_latency_acct(struct bch_dev *, u64, int); 221 + #else 222 + static inline void bch2_latency_acct(struct bch_dev *ca, u64 submit_time, int rw) {} 223 + #endif 229 224 230 - #define bch2_dev_inum_io_err_on(cond, ca, _type, ...) \ 231 - ({ \ 232 - bool _ret = (cond); \ 233 - \ 234 - if (_ret) { \ 235 - bch_err_inum_offset_ratelimited(ca, __VA_ARGS__); \ 236 - bch2_io_error(ca, _type); \ 237 - } \ 238 - _ret; \ 239 - }) 225 + static inline void bch2_account_io_success_fail(struct bch_dev *ca, 226 + enum bch_member_error_type type, 227 + bool success) 228 + { 229 + if (likely(success)) { 230 + if (type == BCH_MEMBER_ERROR_write && 231 + ca->write_errors_start) 232 + ca->write_errors_start = 0; 233 + } else { 234 + bch2_io_error(ca, type); 235 + } 236 + } 240 237 241 - int bch2_inum_err_msg_trans(struct btree_trans *, struct printbuf *, subvol_inum); 238 + static inline void bch2_account_io_completion(struct bch_dev *ca, 239 + enum bch_member_error_type type, 240 + u64 submit_time, bool success) 241 + { 242 + if (unlikely(!ca)) 243 + return; 244 + 245 + if (type != BCH_MEMBER_ERROR_checksum) 246 + bch2_latency_acct(ca, submit_time, type); 247 + 248 + bch2_account_io_success_fail(ca, type, success); 249 + } 250 + 242 251 int bch2_inum_offset_err_msg_trans(struct btree_trans *, struct printbuf *, subvol_inum, u64); 243 252 244 - void bch2_inum_err_msg(struct bch_fs *, struct printbuf *, subvol_inum); 245 253 void bch2_inum_offset_err_msg(struct bch_fs *, struct printbuf *, subvol_inum, u64); 254 + 255 + int bch2_inum_snap_offset_err_msg_trans(struct btree_trans *, struct printbuf *, struct bpos); 256 + void bch2_inum_snap_offset_err_msg(struct bch_fs *, struct printbuf *, struct bpos); 246 257 247 258 #endif /* _BCACHEFS_ERROR_H */

+171 -78

fs/bcachefs/extents.c

··· 28 28 #include "trace.h" 29 29 #include "util.h" 30 30 31 + static const char * const bch2_extent_flags_strs[] = { 32 + #define x(n, v) [BCH_EXTENT_FLAG_##n] = #n, 33 + BCH_EXTENT_FLAGS() 34 + #undef x 35 + NULL, 36 + }; 37 + 31 38 static unsigned bch2_crc_field_size_max[] = { 32 39 [BCH_EXTENT_ENTRY_crc32] = CRC32_SIZE_MAX, 33 40 [BCH_EXTENT_ENTRY_crc64] = CRC64_SIZE_MAX, ··· 58 51 } 59 52 60 53 void bch2_mark_io_failure(struct bch_io_failures *failed, 61 - struct extent_ptr_decoded *p) 54 + struct extent_ptr_decoded *p, 55 + bool csum_error) 62 56 { 63 57 struct bch_dev_io_failures *f = bch2_dev_io_failures(failed, p->ptr.dev); 64 58 ··· 67 59 BUG_ON(failed->nr >= ARRAY_SIZE(failed->devs)); 68 60 69 61 f = &failed->devs[failed->nr++]; 70 - f->dev = p->ptr.dev; 71 - f->idx = p->idx; 72 - f->nr_failed = 1; 73 - f->nr_retries = 0; 74 - } else if (p->idx != f->idx) { 75 - f->idx = p->idx; 76 - f->nr_failed = 1; 77 - f->nr_retries = 0; 78 - } else { 79 - f->nr_failed++; 62 + memset(f, 0, sizeof(*f)); 63 + f->dev = p->ptr.dev; 80 64 } 65 + 66 + if (p->do_ec_reconstruct) 67 + f->failed_ec = true; 68 + else if (!csum_error) 69 + f->failed_io = true; 70 + else 71 + f->failed_csum_nr++; 81 72 } 82 73 83 - static inline u64 dev_latency(struct bch_fs *c, unsigned dev) 74 + static inline u64 dev_latency(struct bch_dev *ca) 84 75 { 85 - struct bch_dev *ca = bch2_dev_rcu(c, dev); 86 76 return ca ? atomic64_read(&ca->cur_latency[READ]) : S64_MAX; 77 + } 78 + 79 + static inline int dev_failed(struct bch_dev *ca) 80 + { 81 + return !ca || ca->mi.state == BCH_MEMBER_STATE_failed; 87 82 } 88 83 89 84 /* ··· 94 83 */ 95 84 static inline bool ptr_better(struct bch_fs *c, 96 85 const struct extent_ptr_decoded p1, 97 - const struct extent_ptr_decoded p2) 86 + u64 p1_latency, 87 + struct bch_dev *ca1, 88 + const struct extent_ptr_decoded p2, 89 + u64 p2_latency) 98 90 { 99 - if (likely(!p1.idx && !p2.idx)) { 100 - u64 l1 = dev_latency(c, p1.ptr.dev); 101 - u64 l2 = dev_latency(c, p2.ptr.dev); 91 + struct bch_dev *ca2 = bch2_dev_rcu(c, p2.ptr.dev); 102 92 103 - /* 104 - * Square the latencies, to bias more in favor of the faster 105 - * device - we never want to stop issuing reads to the slower 106 - * device altogether, so that we can update our latency numbers: 107 - */ 108 - l1 *= l1; 109 - l2 *= l2; 93 + int failed_delta = dev_failed(ca1) - dev_failed(ca2); 94 + if (unlikely(failed_delta)) 95 + return failed_delta < 0; 110 96 111 - /* Pick at random, biased in favor of the faster device: */ 97 + if (unlikely(bch2_force_reconstruct_read)) 98 + return p1.do_ec_reconstruct > p2.do_ec_reconstruct; 112 99 113 - return bch2_get_random_u64_below(l1 + l2) > l1; 114 - } 100 + if (unlikely(p1.do_ec_reconstruct || p2.do_ec_reconstruct)) 101 + return p1.do_ec_reconstruct < p2.do_ec_reconstruct; 115 102 116 - if (bch2_force_reconstruct_read) 117 - return p1.idx > p2.idx; 103 + int crc_retry_delta = (int) p1.crc_retry_nr - (int) p2.crc_retry_nr; 104 + if (unlikely(crc_retry_delta)) 105 + return crc_retry_delta < 0; 118 106 119 - return p1.idx < p2.idx; 107 + /* Pick at random, biased in favor of the faster device: */ 108 + 109 + return bch2_get_random_u64_below(p1_latency + p2_latency) > p1_latency; 120 110 } 121 111 122 112 /* ··· 127 115 */ 128 116 int bch2_bkey_pick_read_device(struct bch_fs *c, struct bkey_s_c k, 129 117 struct bch_io_failures *failed, 130 - struct extent_ptr_decoded *pick) 118 + struct extent_ptr_decoded *pick, 119 + int dev) 131 120 { 132 - struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(k); 133 - const union bch_extent_entry *entry; 134 - struct extent_ptr_decoded p; 135 - struct bch_dev_io_failures *f; 136 - int ret = 0; 121 + bool have_csum_errors = false, have_io_errors = false, have_missing_devs = false; 122 + bool have_dirty_ptrs = false, have_pick = false; 137 123 138 124 if (k.k->type == KEY_TYPE_error) 139 125 return -BCH_ERR_key_type_error; 140 126 127 + struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(k); 128 + 129 + if (bch2_bkey_extent_ptrs_flags(ptrs) & BIT_ULL(BCH_EXTENT_FLAG_poisoned)) 130 + return -BCH_ERR_extent_poisened; 131 + 141 132 rcu_read_lock(); 133 + const union bch_extent_entry *entry; 134 + struct extent_ptr_decoded p; 135 + u64 pick_latency; 136 + 142 137 bkey_for_each_ptr_decode(k.k, ptrs, p, entry) { 138 + have_dirty_ptrs |= !p.ptr.cached; 139 + 143 140 /* 144 141 * Unwritten extent: no need to actually read, treat it as a 145 142 * hole and return 0s: 146 143 */ 147 144 if (p.ptr.unwritten) { 148 - ret = 0; 149 - break; 145 + rcu_read_unlock(); 146 + return 0; 150 147 } 151 148 152 - /* 153 - * If there are any dirty pointers it's an error if we can't 154 - * read: 155 - */ 156 - if (!ret && !p.ptr.cached) 157 - ret = -BCH_ERR_no_device_to_read_from; 149 + /* Are we being asked to read from a specific device? */ 150 + if (dev >= 0 && p.ptr.dev != dev) 151 + continue; 158 152 159 153 struct bch_dev *ca = bch2_dev_rcu(c, p.ptr.dev); 160 154 161 155 if (p.ptr.cached && (!ca || dev_ptr_stale_rcu(ca, &p.ptr))) 162 156 continue; 163 157 164 - f = failed ? bch2_dev_io_failures(failed, p.ptr.dev) : NULL; 165 - if (f) 166 - p.idx = f->nr_failed < f->nr_retries 167 - ? f->idx 168 - : f->idx + 1; 158 + struct bch_dev_io_failures *f = 159 + unlikely(failed) ? bch2_dev_io_failures(failed, p.ptr.dev) : NULL; 160 + if (unlikely(f)) { 161 + p.crc_retry_nr = f->failed_csum_nr; 162 + p.has_ec &= ~f->failed_ec; 169 163 170 - if (!p.idx && (!ca || !bch2_dev_is_readable(ca))) 171 - p.idx++; 164 + if (ca && ca->mi.state != BCH_MEMBER_STATE_failed) { 165 + have_io_errors |= f->failed_io; 166 + have_io_errors |= f->failed_ec; 167 + } 168 + have_csum_errors |= !!f->failed_csum_nr; 172 169 173 - if (!p.idx && p.has_ec && bch2_force_reconstruct_read) 174 - p.idx++; 170 + if (p.has_ec && (f->failed_io || f->failed_csum_nr)) 171 + p.do_ec_reconstruct = true; 172 + else if (f->failed_io || 173 + f->failed_csum_nr > c->opts.checksum_err_retry_nr) 174 + continue; 175 + } 175 176 176 - if (p.idx > (unsigned) p.has_ec) 177 - continue; 177 + have_missing_devs |= ca && !bch2_dev_is_online(ca); 178 178 179 - if (ret > 0 && !ptr_better(c, p, *pick)) 180 - continue; 179 + if (!ca || !bch2_dev_is_online(ca)) { 180 + if (!p.has_ec) 181 + continue; 182 + p.do_ec_reconstruct = true; 183 + } 181 184 182 - *pick = p; 183 - ret = 1; 185 + if (bch2_force_reconstruct_read && p.has_ec) 186 + p.do_ec_reconstruct = true; 187 + 188 + u64 p_latency = dev_latency(ca); 189 + /* 190 + * Square the latencies, to bias more in favor of the faster 191 + * device - we never want to stop issuing reads to the slower 192 + * device altogether, so that we can update our latency numbers: 193 + */ 194 + p_latency *= p_latency; 195 + 196 + if (!have_pick || 197 + ptr_better(c, 198 + p, p_latency, ca, 199 + *pick, pick_latency)) { 200 + *pick = p; 201 + pick_latency = p_latency; 202 + have_pick = true; 203 + } 184 204 } 185 205 rcu_read_unlock(); 186 206 187 - return ret; 207 + if (have_pick) 208 + return 1; 209 + if (!have_dirty_ptrs) 210 + return 0; 211 + if (have_missing_devs) 212 + return -BCH_ERR_no_device_to_read_from; 213 + if (have_csum_errors) 214 + return -BCH_ERR_data_read_csum_err; 215 + if (have_io_errors) 216 + return -BCH_ERR_data_read_io_err; 217 + 218 + WARN_ONCE(1, "unhandled error case in %s\n", __func__); 219 + return -EINVAL; 188 220 } 189 221 190 222 /* KEY_TYPE_btree_ptr: */ ··· 592 536 struct bch_extent_crc_unpacked src, 593 537 enum bch_extent_entry_type type) 594 538 { 595 - #define set_common_fields(_dst, _src) \ 596 - _dst.type = 1 << type; \ 597 - _dst.csum_type = _src.csum_type, \ 598 - _dst.compression_type = _src.compression_type, \ 599 - _dst._compressed_size = _src.compressed_size - 1, \ 600 - _dst._uncompressed_size = _src.uncompressed_size - 1, \ 601 - _dst.offset = _src.offset 539 + #define common_fields(_src) \ 540 + .type = BIT(type), \ 541 + .csum_type = _src.csum_type, \ 542 + .compression_type = _src.compression_type, \ 543 + ._compressed_size = _src.compressed_size - 1, \ 544 + ._uncompressed_size = _src.uncompressed_size - 1, \ 545 + .offset = _src.offset 602 546 603 547 switch (type) { 604 548 case BCH_EXTENT_ENTRY_crc32: 605 - set_common_fields(dst->crc32, src); 606 - dst->crc32.csum = (u32 __force) *((__le32 *) &src.csum.lo); 549 + dst->crc32 = (struct bch_extent_crc32) { 550 + common_fields(src), 551 + .csum = (u32 __force) *((__le32 *) &src.csum.lo), 552 + }; 607 553 break; 608 554 case BCH_EXTENT_ENTRY_crc64: 609 - set_common_fields(dst->crc64, src); 610 - dst->crc64.nonce = src.nonce; 611 - dst->crc64.csum_lo = (u64 __force) src.csum.lo; 612 - dst->crc64.csum_hi = (u64 __force) *((__le16 *) &src.csum.hi); 555 + dst->crc64 = (struct bch_extent_crc64) { 556 + common_fields(src), 557 + .nonce = src.nonce, 558 + .csum_lo = (u64 __force) src.csum.lo, 559 + .csum_hi = (u64 __force) *((__le16 *) &src.csum.hi), 560 + }; 613 561 break; 614 562 case BCH_EXTENT_ENTRY_crc128: 615 - set_common_fields(dst->crc128, src); 616 - dst->crc128.nonce = src.nonce; 617 - dst->crc128.csum = src.csum; 563 + dst->crc128 = (struct bch_extent_crc128) { 564 + common_fields(src), 565 + .nonce = src.nonce, 566 + .csum = src.csum, 567 + }; 618 568 break; 619 569 default: 620 570 BUG(); ··· 1059 997 1060 998 struct bch_dev *ca = bch2_dev_rcu_noerror(c, ptr->dev); 1061 999 1062 - return ca && bch2_dev_is_readable(ca) && !dev_ptr_stale_rcu(ca, ptr); 1000 + return ca && bch2_dev_is_healthy(ca) && !dev_ptr_stale_rcu(ca, ptr); 1063 1001 } 1064 1002 1065 1003 void bch2_extent_ptr_set_cached(struct bch_fs *c, ··· 1282 1220 bch2_extent_rebalance_to_text(out, c, &entry->rebalance); 1283 1221 break; 1284 1222 1223 + case BCH_EXTENT_ENTRY_flags: 1224 + prt_bitflags(out, bch2_extent_flags_strs, entry->flags.flags); 1225 + break; 1226 + 1285 1227 default: 1286 1228 prt_printf(out, "(invalid extent entry %.16llx)", *((u64 *) entry)); 1287 1229 return; ··· 1447 1381 #endif 1448 1382 break; 1449 1383 } 1384 + case BCH_EXTENT_ENTRY_flags: 1385 + bkey_fsck_err_on(entry != ptrs.start, 1386 + c, extent_flags_not_at_start, 1387 + "extent flags entry not at start"); 1388 + break; 1450 1389 } 1451 1390 } 1452 1391 ··· 1518 1447 } 1519 1448 } 1520 1449 1450 + int bch2_bkey_extent_flags_set(struct bch_fs *c, struct bkey_i *k, u64 flags) 1451 + { 1452 + int ret = bch2_request_incompat_feature(c, bcachefs_metadata_version_extent_flags); 1453 + if (ret) 1454 + return ret; 1455 + 1456 + struct bkey_ptrs ptrs = bch2_bkey_ptrs(bkey_i_to_s(k)); 1457 + 1458 + if (ptrs.start != ptrs.end && 1459 + extent_entry_type(ptrs.start) == BCH_EXTENT_ENTRY_flags) { 1460 + ptrs.start->flags.flags = flags; 1461 + } else { 1462 + struct bch_extent_flags f = { 1463 + .type = BIT(BCH_EXTENT_ENTRY_flags), 1464 + .flags = flags, 1465 + }; 1466 + __extent_entry_insert(k, ptrs.start, (union bch_extent_entry *) &f); 1467 + } 1468 + 1469 + return 0; 1470 + } 1471 + 1521 1472 /* Generic extent code: */ 1522 1473 1523 1474 int bch2_cut_front_s(struct bpos where, struct bkey_s k) ··· 1585 1492 entry->crc128.offset += sub; 1586 1493 break; 1587 1494 case BCH_EXTENT_ENTRY_stripe_ptr: 1588 - break; 1589 1495 case BCH_EXTENT_ENTRY_rebalance: 1496 + case BCH_EXTENT_ENTRY_flags: 1590 1497 break; 1591 1498 } 1592 1499

+20 -4

fs/bcachefs/extents.h

··· 320 320 ({ \ 321 321 __label__ out; \ 322 322 \ 323 - (_ptr).idx = 0; \ 324 - (_ptr).has_ec = false; \ 323 + (_ptr).has_ec = false; \ 324 + (_ptr).do_ec_reconstruct = false; \ 325 + (_ptr).crc_retry_nr = 0; \ 325 326 \ 326 327 __bkey_extent_entry_for_each_from(_entry, _end, _entry) \ 327 328 switch (__extent_entry_type(_entry)) { \ ··· 402 401 struct bch_dev_io_failures *bch2_dev_io_failures(struct bch_io_failures *, 403 402 unsigned); 404 403 void bch2_mark_io_failure(struct bch_io_failures *, 405 - struct extent_ptr_decoded *); 404 + struct extent_ptr_decoded *, bool); 406 405 int bch2_bkey_pick_read_device(struct bch_fs *, struct bkey_s_c, 407 406 struct bch_io_failures *, 408 - struct extent_ptr_decoded *); 407 + struct extent_ptr_decoded *, int); 409 408 410 409 /* KEY_TYPE_btree_ptr: */ 411 410 ··· 753 752 k->p.offset += new_size; 754 753 k->size = new_size; 755 754 } 755 + 756 + static inline u64 bch2_bkey_extent_ptrs_flags(struct bkey_ptrs_c ptrs) 757 + { 758 + if (ptrs.start != ptrs.end && 759 + extent_entry_type(ptrs.start) == BCH_EXTENT_ENTRY_flags) 760 + return ptrs.start->flags.flags; 761 + return 0; 762 + } 763 + 764 + static inline u64 bch2_bkey_extent_flags(struct bkey_s_c k) 765 + { 766 + return bch2_bkey_extent_ptrs_flags(bch2_bkey_ptrs_c(k)); 767 + } 768 + 769 + int bch2_bkey_extent_flags_set(struct bch_fs *, struct bkey_i *, u64); 756 770 757 771 #endif /* _BCACHEFS_EXTENTS_H */

+22 -2

fs/bcachefs/extents_format.h

··· 79 79 x(crc64, 2) \ 80 80 x(crc128, 3) \ 81 81 x(stripe_ptr, 4) \ 82 - x(rebalance, 5) 83 - #define BCH_EXTENT_ENTRY_MAX 6 82 + x(rebalance, 5) \ 83 + x(flags, 6) 84 + #define BCH_EXTENT_ENTRY_MAX 7 84 85 85 86 enum bch_extent_entry_type { 86 87 #define x(f, n) BCH_EXTENT_ENTRY_##f = n, ··· 199 198 redundancy:4, 200 199 block:8, 201 200 type:5; 201 + #endif 202 + }; 203 + 204 + #define BCH_EXTENT_FLAGS() \ 205 + x(poisoned, 0) 206 + 207 + enum bch_extent_flags_e { 208 + #define x(n, v) BCH_EXTENT_FLAG_##n = v, 209 + BCH_EXTENT_FLAGS() 210 + #undef x 211 + }; 212 + 213 + struct bch_extent_flags { 214 + #if defined(__LITTLE_ENDIAN_BITFIELD) 215 + __u64 type:7, 216 + flags:57; 217 + #elif defined (__BIG_ENDIAN_BITFIELD) 218 + __u64 flags:57, 219 + type:7; 202 220 #endif 203 221 }; 204 222

+6 -5

fs/bcachefs/extents_types.h

··· 20 20 }; 21 21 22 22 struct extent_ptr_decoded { 23 - unsigned idx; 24 23 bool has_ec; 24 + bool do_ec_reconstruct; 25 + u8 crc_retry_nr; 25 26 struct bch_extent_crc_unpacked crc; 26 27 struct bch_extent_ptr ptr; 27 28 struct bch_extent_stripe_ptr ec; ··· 32 31 u8 nr; 33 32 struct bch_dev_io_failures { 34 33 u8 dev; 35 - u8 idx; 36 - u8 nr_failed; 37 - u8 nr_retries; 38 - } devs[BCH_REPLICAS_MAX]; 34 + unsigned failed_csum_nr:6, 35 + failed_io:1, 36 + failed_ec:1; 37 + } devs[BCH_REPLICAS_MAX + 1]; 39 38 }; 40 39 41 40 #endif /* _BCACHEFS_EXTENTS_TYPES_H */

+73 -63

fs/bcachefs/eytzinger.c

··· 148 148 return cmp(a, b, priv); 149 149 } 150 150 151 - static inline int eytzinger0_do_cmp(void *base, size_t n, size_t size, 151 + static inline int eytzinger1_do_cmp(void *base1, size_t n, size_t size, 152 152 cmp_r_func_t cmp_func, const void *priv, 153 153 size_t l, size_t r) 154 154 { 155 - return do_cmp(base + inorder_to_eytzinger0(l, n) * size, 156 - base + inorder_to_eytzinger0(r, n) * size, 155 + return do_cmp(base1 + inorder_to_eytzinger1(l, n) * size, 156 + base1 + inorder_to_eytzinger1(r, n) * size, 157 157 cmp_func, priv); 158 158 } 159 159 160 - static inline void eytzinger0_do_swap(void *base, size_t n, size_t size, 160 + static inline void eytzinger1_do_swap(void *base1, size_t n, size_t size, 161 161 swap_r_func_t swap_func, const void *priv, 162 162 size_t l, size_t r) 163 163 { 164 - do_swap(base + inorder_to_eytzinger0(l, n) * size, 165 - base + inorder_to_eytzinger0(r, n) * size, 164 + do_swap(base1 + inorder_to_eytzinger1(l, n) * size, 165 + base1 + inorder_to_eytzinger1(r, n) * size, 166 166 size, swap_func, priv); 167 + } 168 + 169 + static void eytzinger1_sort_r(void *base1, size_t n, size_t size, 170 + cmp_r_func_t cmp_func, 171 + swap_r_func_t swap_func, 172 + const void *priv) 173 + { 174 + unsigned i, j, k; 175 + 176 + /* called from 'sort' without swap function, let's pick the default */ 177 + if (swap_func == SWAP_WRAPPER && !((struct wrapper *)priv)->swap_func) 178 + swap_func = NULL; 179 + 180 + if (!swap_func) { 181 + if (is_aligned(base1, size, 8)) 182 + swap_func = SWAP_WORDS_64; 183 + else if (is_aligned(base1, size, 4)) 184 + swap_func = SWAP_WORDS_32; 185 + else 186 + swap_func = SWAP_BYTES; 187 + } 188 + 189 + /* heapify */ 190 + for (i = n / 2; i >= 1; --i) { 191 + /* Find the sift-down path all the way to the leaves. */ 192 + for (j = i; k = j * 2, k < n;) 193 + j = eytzinger1_do_cmp(base1, n, size, cmp_func, priv, k, k + 1) > 0 ? k : k + 1; 194 + 195 + /* Special case for the last leaf with no sibling. */ 196 + if (j * 2 == n) 197 + j *= 2; 198 + 199 + /* Backtrack to the correct location. */ 200 + while (j != i && eytzinger1_do_cmp(base1, n, size, cmp_func, priv, i, j) >= 0) 201 + j /= 2; 202 + 203 + /* Shift the element into its correct place. */ 204 + for (k = j; j != i;) { 205 + j /= 2; 206 + eytzinger1_do_swap(base1, n, size, swap_func, priv, j, k); 207 + } 208 + } 209 + 210 + /* sort */ 211 + for (i = n; i > 1; --i) { 212 + eytzinger1_do_swap(base1, n, size, swap_func, priv, 1, i); 213 + 214 + /* Find the sift-down path all the way to the leaves. */ 215 + for (j = 1; k = j * 2, k + 1 < i;) 216 + j = eytzinger1_do_cmp(base1, n, size, cmp_func, priv, k, k + 1) > 0 ? k : k + 1; 217 + 218 + /* Special case for the last leaf with no sibling. */ 219 + if (j * 2 + 1 == i) 220 + j *= 2; 221 + 222 + /* Backtrack to the correct location. */ 223 + while (j >= 1 && eytzinger1_do_cmp(base1, n, size, cmp_func, priv, 1, j) >= 0) 224 + j /= 2; 225 + 226 + /* Shift the element into its correct place. */ 227 + for (k = j; j > 1;) { 228 + j /= 2; 229 + eytzinger1_do_swap(base1, n, size, swap_func, priv, j, k); 230 + } 231 + } 167 232 } 168 233 169 234 void eytzinger0_sort_r(void *base, size_t n, size_t size, ··· 236 171 swap_r_func_t swap_func, 237 172 const void *priv) 238 173 { 239 - int i, j, k; 174 + void *base1 = base - size; 240 175 241 - /* called from 'sort' without swap function, let's pick the default */ 242 - if (swap_func == SWAP_WRAPPER && !((struct wrapper *)priv)->swap_func) 243 - swap_func = NULL; 244 - 245 - if (!swap_func) { 246 - if (is_aligned(base, size, 8)) 247 - swap_func = SWAP_WORDS_64; 248 - else if (is_aligned(base, size, 4)) 249 - swap_func = SWAP_WORDS_32; 250 - else 251 - swap_func = SWAP_BYTES; 252 - } 253 - 254 - /* heapify */ 255 - for (i = n / 2 - 1; i >= 0; --i) { 256 - /* Find the sift-down path all the way to the leaves. */ 257 - for (j = i; k = j * 2 + 1, k + 1 < n;) 258 - j = eytzinger0_do_cmp(base, n, size, cmp_func, priv, k, k + 1) > 0 ? k : k + 1; 259 - 260 - /* Special case for the last leaf with no sibling. */ 261 - if (j * 2 + 2 == n) 262 - j = j * 2 + 1; 263 - 264 - /* Backtrack to the correct location. */ 265 - while (j != i && eytzinger0_do_cmp(base, n, size, cmp_func, priv, i, j) >= 0) 266 - j = (j - 1) / 2; 267 - 268 - /* Shift the element into its correct place. */ 269 - for (k = j; j != i;) { 270 - j = (j - 1) / 2; 271 - eytzinger0_do_swap(base, n, size, swap_func, priv, j, k); 272 - } 273 - } 274 - 275 - /* sort */ 276 - for (i = n - 1; i > 0; --i) { 277 - eytzinger0_do_swap(base, n, size, swap_func, priv, 0, i); 278 - 279 - /* Find the sift-down path all the way to the leaves. */ 280 - for (j = 0; k = j * 2 + 1, k + 1 < i;) 281 - j = eytzinger0_do_cmp(base, n, size, cmp_func, priv, k, k + 1) > 0 ? k : k + 1; 282 - 283 - /* Special case for the last leaf with no sibling. */ 284 - if (j * 2 + 2 == i) 285 - j = j * 2 + 1; 286 - 287 - /* Backtrack to the correct location. */ 288 - while (j && eytzinger0_do_cmp(base, n, size, cmp_func, priv, 0, j) >= 0) 289 - j = (j - 1) / 2; 290 - 291 - /* Shift the element into its correct place. */ 292 - for (k = j; j;) { 293 - j = (j - 1) / 2; 294 - eytzinger0_do_swap(base, n, size, swap_func, priv, j, k); 295 - } 296 - } 176 + return eytzinger1_sort_r(base1, n, size, cmp_func, swap_func, priv); 297 177 } 298 178 299 179 void eytzinger0_sort(void *base, size_t n, size_t size,

+37 -56

fs/bcachefs/eytzinger.h

··· 6 6 #include <linux/log2.h> 7 7 8 8 #ifdef EYTZINGER_DEBUG 9 + #include <linux/bug.h> 9 10 #define EYTZINGER_BUG_ON(cond) BUG_ON(cond) 10 11 #else 11 12 #define EYTZINGER_BUG_ON(cond) ··· 57 56 return rounddown_pow_of_two(size + 1) - 1; 58 57 } 59 58 60 - /* 61 - * eytzinger1_next() and eytzinger1_prev() have the nice properties that 62 - * 63 - * eytzinger1_next(0) == eytzinger1_first()) 64 - * eytzinger1_prev(0) == eytzinger1_last()) 65 - * 66 - * eytzinger1_prev(eytzinger1_first()) == 0 67 - * eytzinger1_next(eytzinger1_last()) == 0 68 - */ 69 - 70 59 static inline unsigned eytzinger1_next(unsigned i, unsigned size) 71 60 { 72 - EYTZINGER_BUG_ON(i > size); 61 + EYTZINGER_BUG_ON(i == 0 || i > size); 73 62 74 63 if (eytzinger1_right_child(i) <= size) { 75 64 i = eytzinger1_right_child(i); 76 65 77 - i <<= __fls(size + 1) - __fls(i); 66 + i <<= __fls(size) - __fls(i); 78 67 i >>= i > size; 79 68 } else { 80 69 i >>= ffz(i) + 1; ··· 75 84 76 85 static inline unsigned eytzinger1_prev(unsigned i, unsigned size) 77 86 { 78 - EYTZINGER_BUG_ON(i > size); 87 + EYTZINGER_BUG_ON(i == 0 || i > size); 79 88 80 89 if (eytzinger1_left_child(i) <= size) { 81 90 i = eytzinger1_left_child(i) + 1; 82 91 83 - i <<= __fls(size + 1) - __fls(i); 92 + i <<= __fls(size) - __fls(i); 84 93 i -= 1; 85 94 i >>= i > size; 86 95 } else { ··· 234 243 (_i) != -1; \ 235 244 (_i) = eytzinger0_next((_i), (_size))) 236 245 246 + #define eytzinger0_for_each_prev(_i, _size) \ 247 + for (unsigned (_i) = eytzinger0_last((_size)); \ 248 + (_i) != -1; \ 249 + (_i) = eytzinger0_prev((_i), (_size))) 250 + 237 251 /* return greatest node <= @search, or -1 if not found */ 238 252 static inline int eytzinger0_find_le(void *base, size_t nr, size_t size, 239 253 cmp_func_t cmp, const void *search) 240 254 { 241 - unsigned i, n = 0; 255 + void *base1 = base - size; 256 + unsigned n = 1; 242 257 243 - if (!nr) 244 - return -1; 245 - 246 - do { 247 - i = n; 248 - n = eytzinger0_child(i, cmp(base + i * size, search) <= 0); 249 - } while (n < nr); 250 - 251 - if (n & 1) { 252 - /* 253 - * @i was greater than @search, return previous node: 254 - * 255 - * if @i was leftmost/smallest element, 256 - * eytzinger0_prev(eytzinger0_first())) returns -1, as expected 257 - */ 258 - return eytzinger0_prev(i, nr); 259 - } else { 260 - return i; 261 - } 258 + while (n <= nr) 259 + n = eytzinger1_child(n, cmp(base1 + n * size, search) <= 0); 260 + n >>= __ffs(n) + 1; 261 + return n - 1; 262 262 } 263 263 264 + /* return smallest node > @search, or -1 if not found */ 264 265 static inline int eytzinger0_find_gt(void *base, size_t nr, size_t size, 265 266 cmp_func_t cmp, const void *search) 266 267 { 267 - ssize_t idx = eytzinger0_find_le(base, nr, size, cmp, search); 268 + void *base1 = base - size; 269 + unsigned n = 1; 268 270 269 - /* 270 - * if eytitzinger0_find_le() returned -1 - no element was <= search - we 271 - * want to return the first element; next/prev identities mean this work 272 - * as expected 273 - * 274 - * similarly if find_le() returns last element, we should return -1; 275 - * identities mean this all works out: 276 - */ 277 - return eytzinger0_next(idx, nr); 271 + while (n <= nr) 272 + n = eytzinger1_child(n, cmp(base1 + n * size, search) <= 0); 273 + n >>= __ffs(n + 1) + 1; 274 + return n - 1; 278 275 } 279 276 277 + /* return smallest node >= @search, or -1 if not found */ 280 278 static inline int eytzinger0_find_ge(void *base, size_t nr, size_t size, 281 279 cmp_func_t cmp, const void *search) 282 280 { 283 - ssize_t idx = eytzinger0_find_le(base, nr, size, cmp, search); 281 + void *base1 = base - size; 282 + unsigned n = 1; 284 283 285 - if (idx < nr && !cmp(base + idx * size, search)) 286 - return idx; 287 - 288 - return eytzinger0_next(idx, nr); 284 + while (n <= nr) 285 + n = eytzinger1_child(n, cmp(base1 + n * size, search) < 0); 286 + n >>= __ffs(n + 1) + 1; 287 + return n - 1; 289 288 } 290 289 291 290 #define eytzinger0_find(base, nr, size, _cmp, search) \ 292 291 ({ \ 293 - void *_base = (base); \ 292 + size_t _size = (size); \ 293 + void *_base1 = (void *)(base) - _size; \ 294 294 const void *_search = (search); \ 295 295 size_t _nr = (nr); \ 296 - size_t _size = (size); \ 297 - size_t _i = 0; \ 296 + size_t _i = 1; \ 298 297 int _res; \ 299 298 \ 300 - while (_i < _nr && \ 301 - (_res = _cmp(_search, _base + _i * _size))) \ 302 - _i = eytzinger0_child(_i, _res > 0); \ 303 - _i; \ 299 + while (_i <= _nr && \ 300 + (_res = _cmp(_search, _base1 + _i * _size))) \ 301 + _i = eytzinger1_child(_i, _res > 0); \ 302 + _i - 1; \ 304 303 }) 305 304 306 305 void eytzinger0_sort_r(void *, size_t, size_t,

+196 -14

fs/bcachefs/fs-common.c fs/bcachefs/namei.c

··· 4 4 #include "acl.h" 5 5 #include "btree_update.h" 6 6 #include "dirent.h" 7 - #include "fs-common.h" 8 7 #include "inode.h" 8 + #include "namei.h" 9 9 #include "subvolume.h" 10 10 #include "xattr.h" 11 11 ··· 46 46 BTREE_ITER_intent|BTREE_ITER_with_updates); 47 47 if (ret) 48 48 goto err; 49 + 50 + /* Inherit casefold state from parent. */ 51 + if (S_ISDIR(mode)) 52 + new_inode->bi_flags |= dir_u->bi_flags & BCH_INODE_casefolded; 49 53 50 54 if (!(flags & BCH_CREATE_SNAPSHOT)) { 51 55 /* Normal create path - allocate a new inode: */ ··· 157 153 dir_u->bi_nlink++; 158 154 dir_u->bi_mtime = dir_u->bi_ctime = now; 159 155 160 - ret = bch2_inode_write(trans, &dir_iter, dir_u); 161 - if (ret) 162 - goto err; 163 - 164 - ret = bch2_dirent_create(trans, dir, &dir_hash, 165 - dir_type, 166 - name, 167 - dir_target, 168 - &dir_offset, 169 - STR_HASH_must_create|BTREE_ITER_with_updates); 156 + ret = bch2_dirent_create(trans, dir, &dir_hash, 157 + dir_type, 158 + name, 159 + dir_target, 160 + &dir_offset, 161 + &dir_u->bi_size, 162 + STR_HASH_must_create|BTREE_ITER_with_updates) ?: 163 + bch2_inode_write(trans, &dir_iter, dir_u); 170 164 if (ret) 171 165 goto err; 172 166 ··· 227 225 228 226 ret = bch2_dirent_create(trans, dir, &dir_hash, 229 227 mode_to_type(inode_u->bi_mode), 230 - name, inum.inum, &dir_offset, 228 + name, inum.inum, 229 + &dir_offset, 230 + &dir_u->bi_size, 231 231 STR_HASH_must_create); 232 232 if (ret) 233 233 goto err; ··· 421 417 } 422 418 423 419 ret = bch2_dirent_rename(trans, 424 - src_dir, &src_hash, 425 - dst_dir, &dst_hash, 420 + src_dir, &src_hash, &src_dir_u->bi_size, 421 + dst_dir, &dst_hash, &dst_dir_u->bi_size, 426 422 src_name, &src_inum, &src_offset, 427 423 dst_name, &dst_inum, &dst_offset, 428 424 mode); ··· 564 560 return ret; 565 561 } 566 562 563 + /* inum_to_path */ 564 + 567 565 static inline void prt_bytes_reversed(struct printbuf *out, const void *b, unsigned n) 568 566 { 569 567 bch2_printbuf_make_room(out, n); ··· 655 649 656 650 prt_str_reversed(path, "(disconnected)"); 657 651 goto out; 652 + } 653 + 654 + /* fsck */ 655 + 656 + static int bch2_check_dirent_inode_dirent(struct btree_trans *trans, 657 + struct bkey_s_c_dirent d, 658 + struct bch_inode_unpacked *target, 659 + bool in_fsck) 660 + { 661 + struct bch_fs *c = trans->c; 662 + struct printbuf buf = PRINTBUF; 663 + struct btree_iter bp_iter = { NULL }; 664 + int ret = 0; 665 + 666 + if (inode_points_to_dirent(target, d)) 667 + return 0; 668 + 669 + if (!target->bi_dir && 670 + !target->bi_dir_offset) { 671 + fsck_err_on(S_ISDIR(target->bi_mode), 672 + trans, inode_dir_missing_backpointer, 673 + "directory with missing backpointer\n%s", 674 + (printbuf_reset(&buf), 675 + bch2_bkey_val_to_text(&buf, c, d.s_c), 676 + prt_printf(&buf, "\n"), 677 + bch2_inode_unpacked_to_text(&buf, target), 678 + buf.buf)); 679 + 680 + fsck_err_on(target->bi_flags & BCH_INODE_unlinked, 681 + trans, inode_unlinked_but_has_dirent, 682 + "inode unlinked but has dirent\n%s", 683 + (printbuf_reset(&buf), 684 + bch2_bkey_val_to_text(&buf, c, d.s_c), 685 + prt_printf(&buf, "\n"), 686 + bch2_inode_unpacked_to_text(&buf, target), 687 + buf.buf)); 688 + 689 + target->bi_flags &= ~BCH_INODE_unlinked; 690 + target->bi_dir = d.k->p.inode; 691 + target->bi_dir_offset = d.k->p.offset; 692 + return __bch2_fsck_write_inode(trans, target); 693 + } 694 + 695 + if (bch2_inode_should_have_single_bp(target) && 696 + !fsck_err(trans, inode_wrong_backpointer, 697 + "dirent points to inode that does not point back:\n %s", 698 + (bch2_bkey_val_to_text(&buf, c, d.s_c), 699 + prt_printf(&buf, "\n "), 700 + bch2_inode_unpacked_to_text(&buf, target), 701 + buf.buf))) 702 + goto err; 703 + 704 + struct bkey_s_c_dirent bp_dirent = 705 + bch2_bkey_get_iter_typed(trans, &bp_iter, BTREE_ID_dirents, 706 + SPOS(target->bi_dir, target->bi_dir_offset, target->bi_snapshot), 707 + 0, dirent); 708 + ret = bkey_err(bp_dirent); 709 + if (ret && !bch2_err_matches(ret, ENOENT)) 710 + goto err; 711 + 712 + bool backpointer_exists = !ret; 713 + ret = 0; 714 + 715 + if (!backpointer_exists) { 716 + if (fsck_err(trans, inode_wrong_backpointer, 717 + "inode %llu:%u has wrong backpointer:\n" 718 + "got %llu:%llu\n" 719 + "should be %llu:%llu", 720 + target->bi_inum, target->bi_snapshot, 721 + target->bi_dir, 722 + target->bi_dir_offset, 723 + d.k->p.inode, 724 + d.k->p.offset)) { 725 + target->bi_dir = d.k->p.inode; 726 + target->bi_dir_offset = d.k->p.offset; 727 + ret = __bch2_fsck_write_inode(trans, target); 728 + } 729 + } else { 730 + bch2_bkey_val_to_text(&buf, c, d.s_c); 731 + prt_newline(&buf); 732 + bch2_bkey_val_to_text(&buf, c, bp_dirent.s_c); 733 + 734 + if (S_ISDIR(target->bi_mode) || target->bi_subvol) { 735 + /* 736 + * XXX: verify connectivity of the other dirent 737 + * up to the root before removing this one 738 + * 739 + * Additionally, bch2_lookup would need to cope with the 740 + * dirent it found being removed - or should we remove 741 + * the other one, even though the inode points to it? 742 + */ 743 + if (in_fsck) { 744 + if (fsck_err(trans, inode_dir_multiple_links, 745 + "%s %llu:%u with multiple links\n%s", 746 + S_ISDIR(target->bi_mode) ? "directory" : "subvolume", 747 + target->bi_inum, target->bi_snapshot, buf.buf)) 748 + ret = bch2_fsck_remove_dirent(trans, d.k->p); 749 + } else { 750 + bch2_fs_inconsistent(c, 751 + "%s %llu:%u with multiple links\n%s", 752 + S_ISDIR(target->bi_mode) ? "directory" : "subvolume", 753 + target->bi_inum, target->bi_snapshot, buf.buf); 754 + } 755 + 756 + goto out; 757 + } else { 758 + /* 759 + * hardlinked file with nlink 0: 760 + * We're just adjusting nlink here so check_nlinks() will pick 761 + * it up, it ignores inodes with nlink 0 762 + */ 763 + if (fsck_err_on(!target->bi_nlink, 764 + trans, inode_multiple_links_but_nlink_0, 765 + "inode %llu:%u type %s has multiple links but i_nlink 0\n%s", 766 + target->bi_inum, target->bi_snapshot, bch2_d_types[d.v->d_type], buf.buf)) { 767 + target->bi_nlink++; 768 + target->bi_flags &= ~BCH_INODE_unlinked; 769 + ret = __bch2_fsck_write_inode(trans, target); 770 + if (ret) 771 + goto err; 772 + } 773 + } 774 + } 775 + out: 776 + err: 777 + fsck_err: 778 + bch2_trans_iter_exit(trans, &bp_iter); 779 + printbuf_exit(&buf); 780 + bch_err_fn(c, ret); 781 + return ret; 782 + } 783 + 784 + int __bch2_check_dirent_target(struct btree_trans *trans, 785 + struct btree_iter *dirent_iter, 786 + struct bkey_s_c_dirent d, 787 + struct bch_inode_unpacked *target, 788 + bool in_fsck) 789 + { 790 + struct bch_fs *c = trans->c; 791 + struct printbuf buf = PRINTBUF; 792 + int ret = 0; 793 + 794 + ret = bch2_check_dirent_inode_dirent(trans, d, target, in_fsck); 795 + if (ret) 796 + goto err; 797 + 798 + if (fsck_err_on(d.v->d_type != inode_d_type(target), 799 + trans, dirent_d_type_wrong, 800 + "incorrect d_type: got %s, should be %s:\n%s", 801 + bch2_d_type_str(d.v->d_type), 802 + bch2_d_type_str(inode_d_type(target)), 803 + (printbuf_reset(&buf), 804 + bch2_bkey_val_to_text(&buf, c, d.s_c), buf.buf))) { 805 + struct bkey_i_dirent *n = bch2_trans_kmalloc(trans, bkey_bytes(d.k)); 806 + ret = PTR_ERR_OR_ZERO(n); 807 + if (ret) 808 + goto err; 809 + 810 + bkey_reassemble(&n->k_i, d.s_c); 811 + n->v.d_type = inode_d_type(target); 812 + if (n->v.d_type == DT_SUBVOL) { 813 + n->v.d_parent_subvol = cpu_to_le32(target->bi_parent_subvol); 814 + n->v.d_child_subvol = cpu_to_le32(target->bi_subvol); 815 + } else { 816 + n->v.d_inum = cpu_to_le64(target->bi_inum); 817 + } 818 + 819 + ret = bch2_trans_update(trans, dirent_iter, &n->k_i, 0); 820 + if (ret) 821 + goto err; 822 + } 823 + err: 824 + fsck_err: 825 + printbuf_exit(&buf); 826 + bch_err_fn(c, ret); 827 + return ret; 658 828 }

+28 -3

fs/bcachefs/fs-common.h fs/bcachefs/namei.h

··· 1 1 /* SPDX-License-Identifier: GPL-2.0 */ 2 - #ifndef _BCACHEFS_FS_COMMON_H 3 - #define _BCACHEFS_FS_COMMON_H 2 + #ifndef _BCACHEFS_NAMEI_H 3 + #define _BCACHEFS_NAMEI_H 4 4 5 5 #include "dirent.h" 6 6 ··· 44 44 45 45 int bch2_inum_to_path(struct btree_trans *, subvol_inum, struct printbuf *); 46 46 47 - #endif /* _BCACHEFS_FS_COMMON_H */ 47 + int __bch2_check_dirent_target(struct btree_trans *, 48 + struct btree_iter *, 49 + struct bkey_s_c_dirent, 50 + struct bch_inode_unpacked *, bool); 51 + 52 + static inline bool inode_points_to_dirent(struct bch_inode_unpacked *inode, 53 + struct bkey_s_c_dirent d) 54 + { 55 + return inode->bi_dir == d.k->p.inode && 56 + inode->bi_dir_offset == d.k->p.offset; 57 + } 58 + 59 + static inline int bch2_check_dirent_target(struct btree_trans *trans, 60 + struct btree_iter *dirent_iter, 61 + struct bkey_s_c_dirent d, 62 + struct bch_inode_unpacked *target, 63 + bool in_fsck) 64 + { 65 + if (likely(inode_points_to_dirent(target, d) && 66 + d.v->d_type == inode_d_type(target))) 67 + return 0; 68 + 69 + return __bch2_check_dirent_target(trans, dirent_iter, d, target, in_fsck); 70 + } 71 + 72 + #endif /* _BCACHEFS_NAMEI_H */

+24 -14

fs/bcachefs/fs-io-buffered.c

··· 110 110 if (!get_more) 111 111 break; 112 112 113 + unsigned sectors_remaining = sectors_this_extent - bio_sectors(bio); 114 + 115 + if (sectors_remaining < PAGE_SECTORS << mapping_min_folio_order(iter->mapping)) 116 + break; 117 + 118 + unsigned order = ilog2(rounddown_pow_of_two(sectors_remaining) / PAGE_SECTORS); 119 + 120 + /* ensure proper alignment */ 121 + order = min(order, __ffs(folio_offset|BIT(31))); 122 + 113 123 folio = xa_load(&iter->mapping->i_pages, folio_offset); 114 124 if (folio && !xa_is_value(folio)) 115 125 break; 116 126 117 - folio = filemap_alloc_folio(readahead_gfp_mask(iter->mapping), 0); 127 + folio = filemap_alloc_folio(readahead_gfp_mask(iter->mapping), order); 118 128 if (!folio) 119 129 break; 120 130 ··· 159 149 struct bch_fs *c = trans->c; 160 150 struct btree_iter iter; 161 151 struct bkey_buf sk; 162 - int flags = BCH_READ_RETRY_IF_STALE| 163 - BCH_READ_MAY_PROMOTE; 152 + int flags = BCH_READ_retry_if_stale| 153 + BCH_READ_may_promote; 164 154 int ret = 0; 165 155 166 - rbio->c = c; 167 - rbio->start_time = local_clock(); 168 156 rbio->subvol = inum.subvol; 169 157 170 158 bch2_bkey_buf_init(&sk); ··· 219 211 swap(rbio->bio.bi_iter.bi_size, bytes); 220 212 221 213 if (rbio->bio.bi_iter.bi_size == bytes) 222 - flags |= BCH_READ_LAST_FRAGMENT; 214 + flags |= BCH_READ_last_fragment; 223 215 224 216 bch2_bio_page_state_set(&rbio->bio, k); 225 217 226 218 bch2_read_extent(trans, rbio, iter.pos, 227 219 data_btree, k, offset_into_extent, flags); 228 220 229 - if (flags & BCH_READ_LAST_FRAGMENT) 221 + if (flags & BCH_READ_last_fragment) 230 222 break; 231 223 232 224 swap(rbio->bio.bi_iter.bi_size, bytes); ··· 240 232 241 233 if (ret) { 242 234 struct printbuf buf = PRINTBUF; 243 - bch2_inum_offset_err_msg_trans(trans, &buf, inum, iter.pos.offset << 9); 235 + lockrestart_do(trans, 236 + bch2_inum_offset_err_msg_trans(trans, &buf, inum, iter.pos.offset << 9)); 244 237 prt_printf(&buf, "read error %i from btree lookup", ret); 245 238 bch_err_ratelimited(c, "%s", buf.buf); 246 239 printbuf_exit(&buf); ··· 289 280 struct bch_read_bio *rbio = 290 281 rbio_init(bio_alloc_bioset(NULL, n, REQ_OP_READ, 291 282 GFP_KERNEL, &c->bio_read), 292 - opts); 283 + c, 284 + opts, 285 + bch2_readpages_end_io); 293 286 294 287 readpage_iter_advance(&readpages_iter); 295 288 296 289 rbio->bio.bi_iter.bi_sector = folio_sector(folio); 297 - rbio->bio.bi_end_io = bch2_readpages_end_io; 298 290 BUG_ON(!bio_add_folio(&rbio->bio, folio, folio_size(folio), 0)); 299 291 300 292 bchfs_read(trans, rbio, inode_inum(inode), ··· 333 323 bch2_inode_opts_get(&opts, c, &inode->ei_inode); 334 324 335 325 rbio = rbio_init(bio_alloc_bioset(NULL, 1, REQ_OP_READ, GFP_KERNEL, &c->bio_read), 336 - opts); 326 + c, 327 + opts, 328 + bch2_read_single_folio_end_io); 337 329 rbio->bio.bi_private = &done; 338 - rbio->bio.bi_end_io = bch2_read_single_folio_end_io; 339 - 340 330 rbio->bio.bi_opf = REQ_OP_READ|REQ_SYNC; 341 331 rbio->bio.bi_iter.bi_sector = folio_sector(folio); 342 332 BUG_ON(!bio_add_folio(&rbio->bio, folio, folio_size(folio), 0)); ··· 430 420 } 431 421 } 432 422 433 - if (io->op.flags & BCH_WRITE_WROTE_DATA_INLINE) { 423 + if (io->op.flags & BCH_WRITE_wrote_data_inline) { 434 424 bio_for_each_folio_all(fi, bio) { 435 425 struct bch_folio *s; 436 426

+14 -6

fs/bcachefs/fs-io-direct.c

··· 73 73 struct blk_plug plug; 74 74 loff_t offset = req->ki_pos; 75 75 bool sync = is_sync_kiocb(req); 76 + bool split = false; 76 77 size_t shorten; 77 78 ssize_t ret; 78 79 ··· 99 98 REQ_OP_READ, 100 99 GFP_KERNEL, 101 100 &c->dio_read_bioset); 102 - 103 - bio->bi_end_io = bch2_direct_IO_read_endio; 104 101 105 102 dio = container_of(bio, struct dio_read, rbio.bio); 106 103 closure_init(&dio->cl, NULL); ··· 132 133 133 134 goto start; 134 135 while (iter->count) { 136 + split = true; 137 + 135 138 bio = bio_alloc_bioset(NULL, 136 139 bio_iov_vecs_to_alloc(iter, BIO_MAX_VECS), 137 140 REQ_OP_READ, 138 141 GFP_KERNEL, 139 142 &c->bio_read); 140 - bio->bi_end_io = bch2_direct_IO_read_split_endio; 141 143 start: 142 144 bio->bi_opf = REQ_OP_READ|REQ_SYNC; 143 145 bio->bi_iter.bi_sector = offset >> 9; ··· 160 160 if (iter->count) 161 161 closure_get(&dio->cl); 162 162 163 - bch2_read(c, rbio_init(bio, opts), inode_inum(inode)); 163 + struct bch_read_bio *rbio = 164 + rbio_init(bio, 165 + c, 166 + opts, 167 + split 168 + ? bch2_direct_IO_read_split_endio 169 + : bch2_direct_IO_read_endio); 170 + 171 + bch2_read(c, rbio, inode_inum(inode)); 164 172 } 165 173 166 174 blk_finish_plug(&plug); ··· 519 511 dio->op.devs_need_flush = &inode->ei_devs_need_flush; 520 512 521 513 if (sync) 522 - dio->op.flags |= BCH_WRITE_SYNC; 523 - dio->op.flags |= BCH_WRITE_CHECK_ENOSPC; 514 + dio->op.flags |= BCH_WRITE_sync; 515 + dio->op.flags |= BCH_WRITE_check_enospc; 524 516 525 517 ret = bch2_quota_reservation_add(c, inode, &dio->quota_res, 526 518 bio_sectors(bio), true);

+28 -2

fs/bcachefs/fs-ioctl.c

··· 5 5 #include "chardev.h" 6 6 #include "dirent.h" 7 7 #include "fs.h" 8 - #include "fs-common.h" 9 8 #include "fs-ioctl.h" 9 + #include "namei.h" 10 10 #include "quota.h" 11 11 12 12 #include <linux/compat.h> ··· 53 53 !S_ISDIR(bi->bi_mode) && 54 54 (newflags & (BCH_INODE_nodump|BCH_INODE_noatime)) != newflags) 55 55 return -EINVAL; 56 + 57 + if ((newflags ^ oldflags) & BCH_INODE_casefolded) { 58 + #ifdef CONFIG_UNICODE 59 + int ret = 0; 60 + /* Not supported on individual files. */ 61 + if (!S_ISDIR(bi->bi_mode)) 62 + return -EOPNOTSUPP; 63 + 64 + /* 65 + * Make sure the dir is empty, as otherwise we'd need to 66 + * rehash everything and update the dirent keys. 67 + */ 68 + ret = bch2_empty_dir_trans(trans, inode_inum(inode)); 69 + if (ret < 0) 70 + return ret; 71 + 72 + ret = bch2_request_incompat_feature(c,bcachefs_metadata_version_casefolding); 73 + if (ret) 74 + return ret; 75 + 76 + bch2_check_set_feature(c, BCH_FEATURE_casefolding); 77 + #else 78 + printk(KERN_ERR "Cannot use casefolding on a kernel without CONFIG_UNICODE\n"); 79 + return -EOPNOTSUPP; 80 + #endif 81 + } 56 82 57 83 if (s->set_projinherit) { 58 84 bi->bi_fields_set &= ~(1 << Inode_opt_project); ··· 244 218 int ret = 0; 245 219 subvol_inum inum; 246 220 247 - kname = kmalloc(BCH_NAME_MAX + 1, GFP_KERNEL); 221 + kname = kmalloc(BCH_NAME_MAX, GFP_KERNEL); 248 222 if (!kname) 249 223 return -ENOMEM; 250 224

+11 -9

fs/bcachefs/fs-ioctl.h

··· 6 6 7 7 /* bcachefs inode flags -> vfs inode flags: */ 8 8 static const __maybe_unused unsigned bch_flags_to_vfs[] = { 9 - [__BCH_INODE_sync] = S_SYNC, 10 - [__BCH_INODE_immutable] = S_IMMUTABLE, 11 - [__BCH_INODE_append] = S_APPEND, 12 - [__BCH_INODE_noatime] = S_NOATIME, 9 + [__BCH_INODE_sync] = S_SYNC, 10 + [__BCH_INODE_immutable] = S_IMMUTABLE, 11 + [__BCH_INODE_append] = S_APPEND, 12 + [__BCH_INODE_noatime] = S_NOATIME, 13 + [__BCH_INODE_casefolded] = S_CASEFOLD, 13 14 }; 14 15 15 16 /* bcachefs inode flags -> FS_IOC_GETFLAGS: */ 16 17 static const __maybe_unused unsigned bch_flags_to_uflags[] = { 17 - [__BCH_INODE_sync] = FS_SYNC_FL, 18 - [__BCH_INODE_immutable] = FS_IMMUTABLE_FL, 19 - [__BCH_INODE_append] = FS_APPEND_FL, 20 - [__BCH_INODE_nodump] = FS_NODUMP_FL, 21 - [__BCH_INODE_noatime] = FS_NOATIME_FL, 18 + [__BCH_INODE_sync] = FS_SYNC_FL, 19 + [__BCH_INODE_immutable] = FS_IMMUTABLE_FL, 20 + [__BCH_INODE_append] = FS_APPEND_FL, 21 + [__BCH_INODE_nodump] = FS_NODUMP_FL, 22 + [__BCH_INODE_noatime] = FS_NOATIME_FL, 23 + [__BCH_INODE_casefolded] = FS_CASEFOLD_FL, 22 24 }; 23 25 24 26 /* bcachefs inode flags -> FS_IOC_FSGETXATTR: */

+80 -59

fs/bcachefs/fs.c

··· 11 11 #include "errcode.h" 12 12 #include "extents.h" 13 13 #include "fs.h" 14 - #include "fs-common.h" 15 14 #include "fs-io.h" 16 15 #include "fs-ioctl.h" 17 16 #include "fs-io-buffered.h" ··· 21 22 #include "io_read.h" 22 23 #include "journal.h" 23 24 #include "keylist.h" 25 + #include "namei.h" 24 26 #include "quota.h" 25 27 #include "rebalance.h" 26 28 #include "snapshot.h" ··· 641 641 if (ret) 642 642 return ERR_PTR(ret); 643 643 644 - ret = bch2_dirent_read_target(trans, dir, bkey_s_c_to_dirent(k), &inum); 644 + struct bkey_s_c_dirent d = bkey_s_c_to_dirent(k); 645 + 646 + ret = bch2_dirent_read_target(trans, dir, d, &inum); 645 647 if (ret > 0) 646 648 ret = -ENOENT; 647 649 if (ret) ··· 653 651 if (inode) 654 652 goto out; 655 653 654 + /* 655 + * Note: if check/repair needs it, we commit before 656 + * bch2_inode_hash_init_insert(), as after that point we can't take a 657 + * restart - not in the top level loop with a commit_do(), like we 658 + * usually do: 659 + */ 660 + 656 661 struct bch_subvolume subvol; 657 662 struct bch_inode_unpacked inode_u; 658 663 ret = bch2_subvolume_get(trans, inum.subvol, true, &subvol) ?: 659 664 bch2_inode_find_by_inum_nowarn_trans(trans, inum, &inode_u) ?: 665 + bch2_check_dirent_target(trans, &dirent_iter, d, &inode_u, false) ?: 666 + bch2_trans_commit(trans, NULL, NULL, BCH_TRANS_COMMIT_no_enospc) ?: 660 667 PTR_ERR_OR_ZERO(inode = bch2_inode_hash_init_insert(trans, inum, &inode_u, &subvol)); 661 668 669 + /* 670 + * don't remove it: check_inodes might find another inode that points 671 + * back to this dirent 672 + */ 662 673 bch2_fs_inconsistent_on(bch2_err_matches(ret, ENOENT), 663 674 c, "dirent to missing inode:\n %s", 664 - (bch2_bkey_val_to_text(&buf, c, k), buf.buf)); 675 + (bch2_bkey_val_to_text(&buf, c, d.s_c), buf.buf)); 665 676 if (ret) 666 677 goto err; 667 - 668 - /* regular files may have hardlinks: */ 669 - if (bch2_fs_inconsistent_on(bch2_inode_should_have_single_bp(&inode_u) && 670 - !bkey_eq(k.k->p, POS(inode_u.bi_dir, inode_u.bi_dir_offset)), 671 - c, 672 - "dirent points to inode that does not point back:\n %s", 673 - (bch2_bkey_val_to_text(&buf, c, k), 674 - prt_printf(&buf, "\n "), 675 - bch2_inode_unpacked_to_text(&buf, &inode_u), 676 - buf.buf))) { 677 - ret = -ENOENT; 678 - goto err; 679 - } 680 678 out: 681 679 bch2_trans_iter_exit(trans, &dirent_iter); 682 680 printbuf_exit(&buf); ··· 699 697 &hash, &dentry->d_name))); 700 698 if (IS_ERR(inode)) 701 699 inode = NULL; 700 + 701 + #ifdef CONFIG_UNICODE 702 + if (!inode && IS_CASEFOLDED(vdir)) { 703 + /* 704 + * Do not cache a negative dentry in casefolded directories 705 + * as it would need to be invalidated in the following situation: 706 + * - Lookup file "blAH" in a casefolded directory 707 + * - Creation of file "BLAH" in a casefolded directory 708 + * - Lookup file "blAH" in a casefolded directory 709 + * which would fail if we had a negative dentry. 710 + * 711 + * We should come back to this when VFS has a method to handle 712 + * this edgecase. 713 + */ 714 + return NULL; 715 + } 716 + #endif 702 717 703 718 return d_splice_alias(&inode->v, dentry); 704 719 } ··· 1821 1802 break; 1822 1803 } 1823 1804 1824 - mapping_set_large_folios(inode->v.i_mapping); 1805 + mapping_set_folio_min_order(inode->v.i_mapping, 1806 + get_order(trans->c->opts.block_size)); 1825 1807 } 1826 1808 1827 1809 static void bch2_free_inode(struct inode *vinode) ··· 2028 2008 return c ?: ERR_PTR(-ENOENT); 2029 2009 } 2030 2010 2031 - static int bch2_remount(struct super_block *sb, int *flags, 2032 - struct bch_opts opts) 2033 - { 2034 - struct bch_fs *c = sb->s_fs_info; 2035 - int ret = 0; 2036 - 2037 - opt_set(opts, read_only, (*flags & SB_RDONLY) != 0); 2038 - 2039 - if (opts.read_only != c->opts.read_only) { 2040 - down_write(&c->state_lock); 2041 - 2042 - if (opts.read_only) { 2043 - bch2_fs_read_only(c); 2044 - 2045 - sb->s_flags |= SB_RDONLY; 2046 - } else { 2047 - ret = bch2_fs_read_write(c); 2048 - if (ret) { 2049 - bch_err(c, "error going rw: %i", ret); 2050 - up_write(&c->state_lock); 2051 - ret = -EINVAL; 2052 - goto err; 2053 - } 2054 - 2055 - sb->s_flags &= ~SB_RDONLY; 2056 - } 2057 - 2058 - c->opts.read_only = opts.read_only; 2059 - 2060 - up_write(&c->state_lock); 2061 - } 2062 - 2063 - if (opt_defined(opts, errors)) 2064 - c->opts.errors = opts.errors; 2065 - err: 2066 - return bch2_err_class(ret); 2067 - } 2068 - 2069 2011 static int bch2_show_devname(struct seq_file *seq, struct dentry *root) 2070 2012 { 2071 2013 struct bch_fs *c = root->d_sb->s_fs_info; ··· 2174 2192 if (ret) 2175 2193 goto err; 2176 2194 2195 + if (opt_defined(opts, discard)) 2196 + set_bit(BCH_FS_discard_mount_opt_set, &c->flags); 2197 + 2177 2198 /* Some options can't be parsed until after the fs is started: */ 2178 2199 opts = bch2_opts_empty(); 2179 2200 ret = bch2_parse_mount_opts(c, &opts, NULL, opts_parse->parse_later.buf); ··· 2185 2200 2186 2201 bch2_opts_apply(&c->opts, opts); 2187 2202 2188 - ret = bch2_fs_start(c); 2189 - if (ret) 2190 - goto err_stop_fs; 2203 + /* 2204 + * need to initialise sb and set c->vfs_sb _before_ starting fs, 2205 + * for blk_holder_ops 2206 + */ 2191 2207 2192 2208 sb = sget(fc->fs_type, NULL, bch2_set_super, fc->sb_flags|SB_NOSEC, c); 2193 2209 ret = PTR_ERR_OR_ZERO(sb); ··· 2249 2263 #endif 2250 2264 2251 2265 sb->s_shrink->seeks = 0; 2266 + 2267 + ret = bch2_fs_start(c); 2268 + if (ret) 2269 + goto err_put_super; 2252 2270 2253 2271 vinode = bch2_vfs_inode_get(c, BCACHEFS_ROOT_SUBVOL_INUM); 2254 2272 ret = PTR_ERR_OR_ZERO(vinode); ··· 2341 2351 { 2342 2352 struct super_block *sb = fc->root->d_sb; 2343 2353 struct bch2_opts_parse *opts = fc->fs_private; 2354 + struct bch_fs *c = sb->s_fs_info; 2355 + int ret = 0; 2344 2356 2345 - return bch2_remount(sb, &fc->sb_flags, opts->opts); 2357 + opt_set(opts->opts, read_only, (fc->sb_flags & SB_RDONLY) != 0); 2358 + 2359 + if (opts->opts.read_only != c->opts.read_only) { 2360 + down_write(&c->state_lock); 2361 + 2362 + if (opts->opts.read_only) { 2363 + bch2_fs_read_only(c); 2364 + 2365 + sb->s_flags |= SB_RDONLY; 2366 + } else { 2367 + ret = bch2_fs_read_write(c); 2368 + if (ret) { 2369 + bch_err(c, "error going rw: %i", ret); 2370 + up_write(&c->state_lock); 2371 + ret = -EINVAL; 2372 + goto err; 2373 + } 2374 + 2375 + sb->s_flags &= ~SB_RDONLY; 2376 + } 2377 + 2378 + c->opts.read_only = opts->opts.read_only; 2379 + 2380 + up_write(&c->state_lock); 2381 + } 2382 + 2383 + if (opt_defined(opts->opts, errors)) 2384 + c->opts.errors = opts->opts.errors; 2385 + err: 2386 + return bch2_err_class(ret); 2346 2387 } 2347 2388 2348 2389 static const struct fs_context_operations bch2_context_ops = {

+6 -225

fs/bcachefs/fsck.c

··· 10 10 #include "dirent.h" 11 11 #include "error.h" 12 12 #include "fs.h" 13 - #include "fs-common.h" 14 13 #include "fsck.h" 15 14 #include "inode.h" 16 15 #include "keylist.h" 16 + #include "namei.h" 17 17 #include "recovery_passes.h" 18 18 #include "snapshot.h" 19 19 #include "super.h" ··· 22 22 23 23 #include <linux/bsearch.h> 24 24 #include <linux/dcache.h> /* struct qstr */ 25 - 26 - static bool inode_points_to_dirent(struct bch_inode_unpacked *inode, 27 - struct bkey_s_c_dirent d) 28 - { 29 - return inode->bi_dir == d.k->p.inode && 30 - inode->bi_dir_offset == d.k->p.offset; 31 - } 32 25 33 26 static int dirent_points_to_inode_nowarn(struct bkey_s_c_dirent d, 34 27 struct bch_inode_unpacked *inode) ··· 109 116 return ret; 110 117 } 111 118 112 - static int lookup_first_inode(struct btree_trans *trans, u64 inode_nr, 113 - struct bch_inode_unpacked *inode) 114 - { 115 - struct btree_iter iter; 116 - struct bkey_s_c k; 117 - int ret; 118 - 119 - for_each_btree_key_norestart(trans, iter, BTREE_ID_inodes, POS(0, inode_nr), 120 - BTREE_ITER_all_snapshots, k, ret) { 121 - if (k.k->p.offset != inode_nr) 122 - break; 123 - if (!bkey_is_inode(k.k)) 124 - continue; 125 - ret = bch2_inode_unpack(k, inode); 126 - goto found; 127 - } 128 - ret = -BCH_ERR_ENOENT_inode; 129 - found: 130 - bch_err_msg(trans->c, ret, "fetching inode %llu", inode_nr); 131 - bch2_trans_iter_exit(trans, &iter); 132 - return ret; 133 - } 134 - 135 119 static int lookup_inode(struct btree_trans *trans, u64 inode_nr, u32 snapshot, 136 120 struct bch_inode_unpacked *inode) 137 121 { ··· 147 177 *type = d.v->d_type; 148 178 bch2_trans_iter_exit(trans, &iter); 149 179 return 0; 150 - } 151 - 152 - static int __remove_dirent(struct btree_trans *trans, struct bpos pos) 153 - { 154 - struct bch_fs *c = trans->c; 155 - struct btree_iter iter; 156 - struct bch_inode_unpacked dir_inode; 157 - struct bch_hash_info dir_hash_info; 158 - int ret; 159 - 160 - ret = lookup_first_inode(trans, pos.inode, &dir_inode); 161 - if (ret) 162 - goto err; 163 - 164 - dir_hash_info = bch2_hash_info_init(c, &dir_inode); 165 - 166 - bch2_trans_iter_init(trans, &iter, BTREE_ID_dirents, pos, BTREE_ITER_intent); 167 - 168 - ret = bch2_btree_iter_traverse(&iter) ?: 169 - bch2_hash_delete_at(trans, bch2_dirent_hash_desc, 170 - &dir_hash_info, &iter, 171 - BTREE_UPDATE_internal_snapshot_node); 172 - bch2_trans_iter_exit(trans, &iter); 173 - err: 174 - bch_err_fn(c, ret); 175 - return ret; 176 180 } 177 181 178 182 /* ··· 492 548 SPOS(inode->bi_dir, inode->bi_dir_offset, inode->bi_snapshot)); 493 549 int ret = bkey_err(d) ?: 494 550 dirent_points_to_inode(c, d, inode) ?: 495 - __remove_dirent(trans, d.k->p); 551 + bch2_fsck_remove_dirent(trans, d.k->p); 496 552 bch2_trans_iter_exit(trans, &iter); 497 553 return ret; 498 554 } ··· 1929 1985 trans_was_restarted(trans, restart_count); 1930 1986 } 1931 1987 1932 - noinline_for_stack 1933 - static int check_dirent_inode_dirent(struct btree_trans *trans, 1934 - struct btree_iter *iter, 1935 - struct bkey_s_c_dirent d, 1936 - struct bch_inode_unpacked *target) 1937 - { 1938 - struct bch_fs *c = trans->c; 1939 - struct printbuf buf = PRINTBUF; 1940 - struct btree_iter bp_iter = { NULL }; 1941 - int ret = 0; 1942 - 1943 - if (inode_points_to_dirent(target, d)) 1944 - return 0; 1945 - 1946 - if (!target->bi_dir && 1947 - !target->bi_dir_offset) { 1948 - fsck_err_on(S_ISDIR(target->bi_mode), 1949 - trans, inode_dir_missing_backpointer, 1950 - "directory with missing backpointer\n%s", 1951 - (printbuf_reset(&buf), 1952 - bch2_bkey_val_to_text(&buf, c, d.s_c), 1953 - prt_printf(&buf, "\n"), 1954 - bch2_inode_unpacked_to_text(&buf, target), 1955 - buf.buf)); 1956 - 1957 - fsck_err_on(target->bi_flags & BCH_INODE_unlinked, 1958 - trans, inode_unlinked_but_has_dirent, 1959 - "inode unlinked but has dirent\n%s", 1960 - (printbuf_reset(&buf), 1961 - bch2_bkey_val_to_text(&buf, c, d.s_c), 1962 - prt_printf(&buf, "\n"), 1963 - bch2_inode_unpacked_to_text(&buf, target), 1964 - buf.buf)); 1965 - 1966 - target->bi_flags &= ~BCH_INODE_unlinked; 1967 - target->bi_dir = d.k->p.inode; 1968 - target->bi_dir_offset = d.k->p.offset; 1969 - return __bch2_fsck_write_inode(trans, target); 1970 - } 1971 - 1972 - if (bch2_inode_should_have_single_bp(target) && 1973 - !fsck_err(trans, inode_wrong_backpointer, 1974 - "dirent points to inode that does not point back:\n %s", 1975 - (bch2_bkey_val_to_text(&buf, c, d.s_c), 1976 - prt_printf(&buf, "\n "), 1977 - bch2_inode_unpacked_to_text(&buf, target), 1978 - buf.buf))) 1979 - goto err; 1980 - 1981 - struct bkey_s_c_dirent bp_dirent = dirent_get_by_pos(trans, &bp_iter, 1982 - SPOS(target->bi_dir, target->bi_dir_offset, target->bi_snapshot)); 1983 - ret = bkey_err(bp_dirent); 1984 - if (ret && !bch2_err_matches(ret, ENOENT)) 1985 - goto err; 1986 - 1987 - bool backpointer_exists = !ret; 1988 - ret = 0; 1989 - 1990 - if (fsck_err_on(!backpointer_exists, 1991 - trans, inode_wrong_backpointer, 1992 - "inode %llu:%u has wrong backpointer:\n" 1993 - "got %llu:%llu\n" 1994 - "should be %llu:%llu", 1995 - target->bi_inum, target->bi_snapshot, 1996 - target->bi_dir, 1997 - target->bi_dir_offset, 1998 - d.k->p.inode, 1999 - d.k->p.offset)) { 2000 - target->bi_dir = d.k->p.inode; 2001 - target->bi_dir_offset = d.k->p.offset; 2002 - ret = __bch2_fsck_write_inode(trans, target); 2003 - goto out; 2004 - } 2005 - 2006 - bch2_bkey_val_to_text(&buf, c, d.s_c); 2007 - prt_newline(&buf); 2008 - if (backpointer_exists) 2009 - bch2_bkey_val_to_text(&buf, c, bp_dirent.s_c); 2010 - 2011 - if (fsck_err_on(backpointer_exists && 2012 - (S_ISDIR(target->bi_mode) || 2013 - target->bi_subvol), 2014 - trans, inode_dir_multiple_links, 2015 - "%s %llu:%u with multiple links\n%s", 2016 - S_ISDIR(target->bi_mode) ? "directory" : "subvolume", 2017 - target->bi_inum, target->bi_snapshot, buf.buf)) { 2018 - ret = __remove_dirent(trans, d.k->p); 2019 - goto out; 2020 - } 2021 - 2022 - /* 2023 - * hardlinked file with nlink 0: 2024 - * We're just adjusting nlink here so check_nlinks() will pick 2025 - * it up, it ignores inodes with nlink 0 2026 - */ 2027 - if (fsck_err_on(backpointer_exists && !target->bi_nlink, 2028 - trans, inode_multiple_links_but_nlink_0, 2029 - "inode %llu:%u type %s has multiple links but i_nlink 0\n%s", 2030 - target->bi_inum, target->bi_snapshot, bch2_d_types[d.v->d_type], buf.buf)) { 2031 - target->bi_nlink++; 2032 - target->bi_flags &= ~BCH_INODE_unlinked; 2033 - ret = __bch2_fsck_write_inode(trans, target); 2034 - if (ret) 2035 - goto err; 2036 - } 2037 - out: 2038 - err: 2039 - fsck_err: 2040 - bch2_trans_iter_exit(trans, &bp_iter); 2041 - printbuf_exit(&buf); 2042 - bch_err_fn(c, ret); 2043 - return ret; 2044 - } 2045 - 2046 - noinline_for_stack 2047 - static int check_dirent_target(struct btree_trans *trans, 2048 - struct btree_iter *iter, 2049 - struct bkey_s_c_dirent d, 2050 - struct bch_inode_unpacked *target) 2051 - { 2052 - struct bch_fs *c = trans->c; 2053 - struct bkey_i_dirent *n; 2054 - struct printbuf buf = PRINTBUF; 2055 - int ret = 0; 2056 - 2057 - ret = check_dirent_inode_dirent(trans, iter, d, target); 2058 - if (ret) 2059 - goto err; 2060 - 2061 - if (fsck_err_on(d.v->d_type != inode_d_type(target), 2062 - trans, dirent_d_type_wrong, 2063 - "incorrect d_type: got %s, should be %s:\n%s", 2064 - bch2_d_type_str(d.v->d_type), 2065 - bch2_d_type_str(inode_d_type(target)), 2066 - (printbuf_reset(&buf), 2067 - bch2_bkey_val_to_text(&buf, c, d.s_c), buf.buf))) { 2068 - n = bch2_trans_kmalloc(trans, bkey_bytes(d.k)); 2069 - ret = PTR_ERR_OR_ZERO(n); 2070 - if (ret) 2071 - goto err; 2072 - 2073 - bkey_reassemble(&n->k_i, d.s_c); 2074 - n->v.d_type = inode_d_type(target); 2075 - if (n->v.d_type == DT_SUBVOL) { 2076 - n->v.d_parent_subvol = cpu_to_le32(target->bi_parent_subvol); 2077 - n->v.d_child_subvol = cpu_to_le32(target->bi_subvol); 2078 - } else { 2079 - n->v.d_inum = cpu_to_le64(target->bi_inum); 2080 - } 2081 - 2082 - ret = bch2_trans_update(trans, iter, &n->k_i, 0); 2083 - if (ret) 2084 - goto err; 2085 - 2086 - d = dirent_i_to_s_c(n); 2087 - } 2088 - err: 2089 - fsck_err: 2090 - printbuf_exit(&buf); 2091 - bch_err_fn(c, ret); 2092 - return ret; 2093 - } 2094 - 2095 1988 /* find a subvolume that's a descendent of @snapshot: */ 2096 1989 static int find_snapshot_subvol(struct btree_trans *trans, u32 snapshot, u32 *subvolid) 2097 1990 { ··· 2028 2247 if (fsck_err(trans, dirent_to_missing_subvol, 2029 2248 "dirent points to missing subvolume\n%s", 2030 2249 (bch2_bkey_val_to_text(&buf, c, d.s_c), buf.buf))) 2031 - return __remove_dirent(trans, d.k->p); 2250 + return bch2_fsck_remove_dirent(trans, d.k->p); 2032 2251 ret = 0; 2033 2252 goto out; 2034 2253 } ··· 2072 2291 goto err; 2073 2292 } 2074 2293 2075 - ret = check_dirent_target(trans, iter, d, &subvol_root); 2294 + ret = bch2_check_dirent_target(trans, iter, d, &subvol_root, true); 2076 2295 if (ret) 2077 2296 goto err; 2078 2297 out: ··· 2159 2378 (printbuf_reset(&buf), 2160 2379 bch2_bkey_val_to_text(&buf, c, k), 2161 2380 buf.buf))) { 2162 - ret = __remove_dirent(trans, d.k->p); 2381 + ret = bch2_fsck_remove_dirent(trans, d.k->p); 2163 2382 if (ret) 2164 2383 goto err; 2165 2384 } 2166 2385 2167 2386 darray_for_each(target->inodes, i) { 2168 - ret = check_dirent_target(trans, iter, d, &i->inode); 2387 + ret = bch2_check_dirent_target(trans, iter, d, &i->inode, true); 2169 2388 if (ret) 2170 2389 goto err; 2171 2390 }

+5 -19

fs/bcachefs/inode.c

··· 731 731 bkey_s_to_inode_v3(new).v->bi_journal_seq = cpu_to_le64(trans->journal_res.seq); 732 732 } 733 733 734 - s64 nr = bkey_is_inode(new.k) - bkey_is_inode(old.k); 735 - if ((flags & (BTREE_TRIGGER_transactional|BTREE_TRIGGER_gc)) && nr) { 736 - struct disk_accounting_pos acc = { .type = BCH_DISK_ACCOUNTING_nr_inodes }; 737 - int ret = bch2_disk_accounting_mod(trans, &acc, &nr, 1, flags & BTREE_TRIGGER_gc); 734 + s64 nr[1] = { bkey_is_inode(new.k) - bkey_is_inode(old.k) }; 735 + if ((flags & (BTREE_TRIGGER_transactional|BTREE_TRIGGER_gc)) && nr[0]) { 736 + int ret = bch2_disk_accounting_mod2(trans, flags & BTREE_TRIGGER_gc, nr, nr_inodes); 738 737 if (ret) 739 738 return ret; 740 739 } ··· 865 866 bch2_inode_init_early(c, inode_u); 866 867 bch2_inode_init_late(inode_u, bch2_current_time(c), 867 868 uid, gid, mode, rdev, parent); 868 - } 869 - 870 - static inline u32 bkey_generation(struct bkey_s_c k) 871 - { 872 - switch (k.k->type) { 873 - case KEY_TYPE_inode: 874 - case KEY_TYPE_inode_v2: 875 - BUG(); 876 - case KEY_TYPE_inode_generation: 877 - return le32_to_cpu(bkey_s_c_to_inode_generation(k).v->bi_generation); 878 - default: 879 - return 0; 880 - } 881 869 } 882 870 883 871 static struct bkey_i_inode_alloc_cursor * ··· 1078 1092 bch2_fs_inconsistent(c, 1079 1093 "inode %llu:%u not found when deleting", 1080 1094 inum.inum, snapshot); 1081 - ret = -EIO; 1095 + ret = -BCH_ERR_ENOENT_inode; 1082 1096 goto err; 1083 1097 } 1084 1098 ··· 1242 1256 bch2_fs_inconsistent(c, 1243 1257 "inode %llu:%u not found when deleting", 1244 1258 inum, snapshot); 1245 - ret = -EIO; 1259 + ret = -BCH_ERR_ENOENT_inode; 1246 1260 goto err; 1247 1261 } 1248 1262

+1

fs/bcachefs/inode.h

··· 277 277 bool inode_has_bp = inode->bi_dir || inode->bi_dir_offset; 278 278 279 279 return S_ISDIR(inode->bi_mode) || 280 + inode->bi_subvol || 280 281 (!inode->bi_nlink && inode_has_bp); 281 282 } 282 283

+2 -1

fs/bcachefs/inode_format.h

··· 137 137 x(i_sectors_dirty, 6) \ 138 138 x(unlinked, 7) \ 139 139 x(backptr_untrusted, 8) \ 140 - x(has_child_snapshot, 9) 140 + x(has_child_snapshot, 9) \ 141 + x(casefolded, 10) 141 142 142 143 /* bits 20+ reserved for packed fields below: */ 143 144

+2 -1

fs/bcachefs/io_misc.c

··· 115 115 bch2_increment_clock(c, sectors_allocated, WRITE); 116 116 if (should_print_err(ret)) { 117 117 struct printbuf buf = PRINTBUF; 118 - bch2_inum_offset_err_msg_trans(trans, &buf, inum, iter->pos.offset << 9); 118 + lockrestart_do(trans, 119 + bch2_inum_offset_err_msg_trans(trans, &buf, inum, iter->pos.offset << 9)); 119 120 prt_printf(&buf, "fallocate error: %s", bch2_err_str(ret)); 120 121 bch_err_ratelimited(c, "%s", buf.buf); 121 122 printbuf_exit(&buf);

+407 -350

fs/bcachefs/io_read.c

··· 25 25 #include "subvolume.h" 26 26 #include "trace.h" 27 27 28 + #include <linux/random.h> 28 29 #include <linux/sched/mm.h> 30 + 31 + #ifdef CONFIG_BCACHEFS_DEBUG 32 + static unsigned bch2_read_corrupt_ratio; 33 + module_param_named(read_corrupt_ratio, bch2_read_corrupt_ratio, uint, 0644); 34 + MODULE_PARM_DESC(read_corrupt_ratio, ""); 35 + #endif 29 36 30 37 #ifndef CONFIG_BCACHEFS_NO_LATENCY_ACCT 31 38 ··· 87 80 struct rhash_head hash; 88 81 struct bpos pos; 89 82 83 + struct work_struct work; 90 84 struct data_update write; 91 85 struct bio_vec bi_inline_vecs[]; /* must be last */ 92 86 }; ··· 104 96 return failed && failed->nr; 105 97 } 106 98 99 + static inline struct data_update *rbio_data_update(struct bch_read_bio *rbio) 100 + { 101 + EBUG_ON(rbio->split); 102 + 103 + return rbio->data_update 104 + ? container_of(rbio, struct data_update, rbio) 105 + : NULL; 106 + } 107 + 108 + static bool ptr_being_rewritten(struct bch_read_bio *orig, unsigned dev) 109 + { 110 + struct data_update *u = rbio_data_update(orig); 111 + if (!u) 112 + return false; 113 + 114 + struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(bkey_i_to_s_c(u->k.k)); 115 + unsigned i = 0; 116 + bkey_for_each_ptr(ptrs, ptr) { 117 + if (ptr->dev == dev && 118 + u->data_opts.rewrite_ptrs & BIT(i)) 119 + return true; 120 + i++; 121 + } 122 + 123 + return false; 124 + } 125 + 107 126 static inline int should_promote(struct bch_fs *c, struct bkey_s_c k, 108 127 struct bpos pos, 109 128 struct bch_io_opts opts, ··· 140 105 if (!have_io_error(failed)) { 141 106 BUG_ON(!opts.promote_target); 142 107 143 - if (!(flags & BCH_READ_MAY_PROMOTE)) 108 + if (!(flags & BCH_READ_may_promote)) 144 109 return -BCH_ERR_nopromote_may_not; 145 110 146 111 if (bch2_bkey_has_target(c, k, opts.promote_target)) ··· 160 125 return 0; 161 126 } 162 127 163 - static void promote_free(struct bch_fs *c, struct promote_op *op) 128 + static noinline void promote_free(struct bch_read_bio *rbio) 164 129 { 165 - int ret; 130 + struct promote_op *op = container_of(rbio, struct promote_op, write.rbio); 131 + struct bch_fs *c = rbio->c; 132 + 133 + int ret = rhashtable_remove_fast(&c->promote_table, &op->hash, 134 + bch_promote_params); 135 + BUG_ON(ret); 166 136 167 137 bch2_data_update_exit(&op->write); 168 138 169 - ret = rhashtable_remove_fast(&c->promote_table, &op->hash, 170 - bch_promote_params); 171 - BUG_ON(ret); 172 139 bch2_write_ref_put(c, BCH_WRITE_REF_promote); 173 140 kfree_rcu(op, rcu); 174 141 } 175 142 176 143 static void promote_done(struct bch_write_op *wop) 177 144 { 178 - struct promote_op *op = 179 - container_of(wop, struct promote_op, write.op); 180 - struct bch_fs *c = op->write.op.c; 145 + struct promote_op *op = container_of(wop, struct promote_op, write.op); 146 + struct bch_fs *c = op->write.rbio.c; 181 147 182 - bch2_time_stats_update(&c->times[BCH_TIME_data_promote], 183 - op->start_time); 184 - promote_free(c, op); 148 + bch2_time_stats_update(&c->times[BCH_TIME_data_promote], op->start_time); 149 + promote_free(&op->write.rbio); 185 150 } 186 151 187 - static void promote_start(struct promote_op *op, struct bch_read_bio *rbio) 152 + static void promote_start_work(struct work_struct *work) 188 153 { 189 - struct bio *bio = &op->write.op.wbio.bio; 154 + struct promote_op *op = container_of(work, struct promote_op, work); 190 155 191 - trace_and_count(op->write.op.c, read_promote, &rbio->bio); 192 - 193 - /* we now own pages: */ 194 - BUG_ON(!rbio->bounce); 195 - BUG_ON(rbio->bio.bi_vcnt > bio->bi_max_vecs); 196 - 197 - memcpy(bio->bi_io_vec, rbio->bio.bi_io_vec, 198 - sizeof(struct bio_vec) * rbio->bio.bi_vcnt); 199 - swap(bio->bi_vcnt, rbio->bio.bi_vcnt); 200 - 201 - bch2_data_update_read_done(&op->write, rbio->pick.crc); 156 + bch2_data_update_read_done(&op->write); 202 157 } 203 158 204 - static struct promote_op *__promote_alloc(struct btree_trans *trans, 205 - enum btree_id btree_id, 206 - struct bkey_s_c k, 207 - struct bpos pos, 208 - struct extent_ptr_decoded *pick, 209 - struct bch_io_opts opts, 210 - unsigned sectors, 211 - struct bch_read_bio **rbio, 212 - struct bch_io_failures *failed) 159 + static noinline void promote_start(struct bch_read_bio *rbio) 160 + { 161 + struct promote_op *op = container_of(rbio, struct promote_op, write.rbio); 162 + 163 + trace_and_count(op->write.op.c, io_read_promote, &rbio->bio); 164 + 165 + INIT_WORK(&op->work, promote_start_work); 166 + queue_work(rbio->c->write_ref_wq, &op->work); 167 + } 168 + 169 + static struct bch_read_bio *__promote_alloc(struct btree_trans *trans, 170 + enum btree_id btree_id, 171 + struct bkey_s_c k, 172 + struct bpos pos, 173 + struct extent_ptr_decoded *pick, 174 + unsigned sectors, 175 + struct bch_read_bio *orig, 176 + struct bch_io_failures *failed) 213 177 { 214 178 struct bch_fs *c = trans->c; 215 - struct promote_op *op = NULL; 216 - struct bio *bio; 217 - unsigned pages = DIV_ROUND_UP(sectors, PAGE_SECTORS); 218 179 int ret; 180 + 181 + struct data_update_opts update_opts = { .write_flags = BCH_WRITE_alloc_nowait }; 182 + 183 + if (!have_io_error(failed)) { 184 + update_opts.target = orig->opts.promote_target; 185 + update_opts.extra_replicas = 1; 186 + update_opts.write_flags |= BCH_WRITE_cached; 187 + update_opts.write_flags |= BCH_WRITE_only_specified_devs; 188 + } else { 189 + update_opts.target = orig->opts.foreground_target; 190 + 191 + struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(k); 192 + unsigned ptr_bit = 1; 193 + bkey_for_each_ptr(ptrs, ptr) { 194 + if (bch2_dev_io_failures(failed, ptr->dev) && 195 + !ptr_being_rewritten(orig, ptr->dev)) 196 + update_opts.rewrite_ptrs |= ptr_bit; 197 + ptr_bit <<= 1; 198 + } 199 + 200 + if (!update_opts.rewrite_ptrs) 201 + return NULL; 202 + } 219 203 220 204 if (!bch2_write_ref_tryget(c, BCH_WRITE_REF_promote)) 221 205 return ERR_PTR(-BCH_ERR_nopromote_no_writes); 222 206 223 - op = kzalloc(struct_size(op, bi_inline_vecs, pages), GFP_KERNEL); 207 + struct promote_op *op = kzalloc(sizeof(*op), GFP_KERNEL); 224 208 if (!op) { 225 209 ret = -BCH_ERR_nopromote_enomem; 226 - goto err; 210 + goto err_put; 227 211 } 228 212 229 213 op->start_time = local_clock(); 230 214 op->pos = pos; 231 - 232 - /* 233 - * We don't use the mempool here because extents that aren't 234 - * checksummed or compressed can be too big for the mempool: 235 - */ 236 - *rbio = kzalloc(sizeof(struct bch_read_bio) + 237 - sizeof(struct bio_vec) * pages, 238 - GFP_KERNEL); 239 - if (!*rbio) { 240 - ret = -BCH_ERR_nopromote_enomem; 241 - goto err; 242 - } 243 - 244 - rbio_init(&(*rbio)->bio, opts); 245 - bio_init(&(*rbio)->bio, NULL, (*rbio)->bio.bi_inline_vecs, pages, 0); 246 - 247 - if (bch2_bio_alloc_pages(&(*rbio)->bio, sectors << 9, GFP_KERNEL)) { 248 - ret = -BCH_ERR_nopromote_enomem; 249 - goto err; 250 - } 251 - 252 - (*rbio)->bounce = true; 253 - (*rbio)->split = true; 254 - (*rbio)->kmalloc = true; 255 215 256 216 if (rhashtable_lookup_insert_fast(&c->promote_table, &op->hash, 257 217 bch_promote_params)) { ··· 254 224 goto err; 255 225 } 256 226 257 - bio = &op->write.op.wbio.bio; 258 - bio_init(bio, NULL, bio->bi_inline_vecs, pages, 0); 259 - 260 - struct data_update_opts update_opts = {}; 261 - 262 - if (!have_io_error(failed)) { 263 - update_opts.target = opts.promote_target; 264 - update_opts.extra_replicas = 1; 265 - update_opts.write_flags = BCH_WRITE_ALLOC_NOWAIT|BCH_WRITE_CACHED; 266 - } else { 267 - update_opts.target = opts.foreground_target; 268 - 269 - struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(k); 270 - unsigned ptr_bit = 1; 271 - bkey_for_each_ptr(ptrs, ptr) { 272 - if (bch2_dev_io_failures(failed, ptr->dev)) 273 - update_opts.rewrite_ptrs |= ptr_bit; 274 - ptr_bit <<= 1; 275 - } 276 - } 277 - 278 227 ret = bch2_data_update_init(trans, NULL, NULL, &op->write, 279 228 writepoint_hashed((unsigned long) current), 280 - opts, 229 + &orig->opts, 281 230 update_opts, 282 231 btree_id, k); 283 232 /* 284 233 * possible errors: -BCH_ERR_nocow_lock_blocked, 285 234 * -BCH_ERR_ENOSPC_disk_reservation: 286 235 */ 287 - if (ret) { 288 - BUG_ON(rhashtable_remove_fast(&c->promote_table, &op->hash, 289 - bch_promote_params)); 290 - goto err; 291 - } 236 + if (ret) 237 + goto err_remove_hash; 292 238 239 + rbio_init_fragment(&op->write.rbio.bio, orig); 240 + op->write.rbio.bounce = true; 241 + op->write.rbio.promote = true; 293 242 op->write.op.end_io = promote_done; 294 243 295 - return op; 244 + return &op->write.rbio; 245 + err_remove_hash: 246 + BUG_ON(rhashtable_remove_fast(&c->promote_table, &op->hash, 247 + bch_promote_params)); 296 248 err: 297 - if (*rbio) 298 - bio_free_pages(&(*rbio)->bio); 299 - kfree(*rbio); 300 - *rbio = NULL; 249 + bio_free_pages(&op->write.op.wbio.bio); 301 250 /* We may have added to the rhashtable and thus need rcu freeing: */ 302 251 kfree_rcu(op, rcu); 252 + err_put: 303 253 bch2_write_ref_put(c, BCH_WRITE_REF_promote); 304 254 return ERR_PTR(ret); 305 255 } 306 256 307 257 noinline 308 - static struct promote_op *promote_alloc(struct btree_trans *trans, 258 + static struct bch_read_bio *promote_alloc(struct btree_trans *trans, 309 259 struct bvec_iter iter, 310 260 struct bkey_s_c k, 311 261 struct extent_ptr_decoded *pick, 312 - struct bch_io_opts opts, 313 262 unsigned flags, 314 - struct bch_read_bio **rbio, 263 + struct bch_read_bio *orig, 315 264 bool *bounce, 316 265 bool *read_full, 317 266 struct bch_io_failures *failed) ··· 310 301 struct bpos pos = promote_full 311 302 ? bkey_start_pos(k.k) 312 303 : POS(k.k->p.inode, iter.bi_sector); 313 - struct promote_op *promote; 314 304 int ret; 315 305 316 - ret = should_promote(c, k, pos, opts, flags, failed); 306 + ret = should_promote(c, k, pos, orig->opts, flags, failed); 317 307 if (ret) 318 308 goto nopromote; 319 309 320 - promote = __promote_alloc(trans, 321 - k.k->type == KEY_TYPE_reflink_v 322 - ? BTREE_ID_reflink 323 - : BTREE_ID_extents, 324 - k, pos, pick, opts, sectors, rbio, failed); 310 + struct bch_read_bio *promote = 311 + __promote_alloc(trans, 312 + k.k->type == KEY_TYPE_reflink_v 313 + ? BTREE_ID_reflink 314 + : BTREE_ID_extents, 315 + k, pos, pick, sectors, orig, failed); 316 + if (!promote) 317 + return NULL; 318 + 325 319 ret = PTR_ERR_OR_ZERO(promote); 326 320 if (ret) 327 321 goto nopromote; ··· 333 321 *read_full = promote_full; 334 322 return promote; 335 323 nopromote: 336 - trace_read_nopromote(c, ret); 324 + trace_io_read_nopromote(c, ret); 337 325 return NULL; 338 326 } 339 327 ··· 342 330 static int bch2_read_err_msg_trans(struct btree_trans *trans, struct printbuf *out, 343 331 struct bch_read_bio *rbio, struct bpos read_pos) 344 332 { 345 - return bch2_inum_offset_err_msg_trans(trans, out, 346 - (subvol_inum) { rbio->subvol, read_pos.inode }, 347 - read_pos.offset << 9); 333 + int ret = lockrestart_do(trans, 334 + bch2_inum_offset_err_msg_trans(trans, out, 335 + (subvol_inum) { rbio->subvol, read_pos.inode }, 336 + read_pos.offset << 9)); 337 + if (ret) 338 + return ret; 339 + 340 + if (rbio->data_update) 341 + prt_str(out, "(internal move) "); 342 + 343 + return 0; 348 344 } 349 345 350 346 static void bch2_read_err_msg(struct bch_fs *c, struct printbuf *out, ··· 360 340 { 361 341 bch2_trans_run(c, bch2_read_err_msg_trans(trans, out, rbio, read_pos)); 362 342 } 363 - 364 - #define READ_RETRY_AVOID 1 365 - #define READ_RETRY 2 366 - #define READ_ERR 3 367 343 368 344 enum rbio_context { 369 345 RBIO_CONTEXT_NULL, ··· 391 375 { 392 376 BUG_ON(rbio->bounce && !rbio->split); 393 377 394 - if (rbio->promote) 395 - promote_free(rbio->c, rbio->promote); 396 - rbio->promote = NULL; 397 - 398 - if (rbio->bounce) 399 - bch2_bio_free_pages_pool(rbio->c, &rbio->bio); 378 + if (rbio->have_ioref) { 379 + struct bch_dev *ca = bch2_dev_have_ref(rbio->c, rbio->pick.ptr.dev); 380 + percpu_ref_put(&ca->io_ref); 381 + } 400 382 401 383 if (rbio->split) { 402 384 struct bch_read_bio *parent = rbio->parent; 403 385 404 - if (rbio->kmalloc) 405 - kfree(rbio); 406 - else 386 + if (unlikely(rbio->promote)) { 387 + if (!rbio->bio.bi_status) 388 + promote_start(rbio); 389 + else 390 + promote_free(rbio); 391 + } else { 392 + if (rbio->bounce) 393 + bch2_bio_free_pages_pool(rbio->c, &rbio->bio); 394 + 407 395 bio_put(&rbio->bio); 396 + } 408 397 409 398 rbio = parent; 410 399 } ··· 429 408 bio_endio(&rbio->bio); 430 409 } 431 410 432 - static void bch2_read_retry_nodecode(struct bch_fs *c, struct bch_read_bio *rbio, 433 - struct bvec_iter bvec_iter, 434 - struct bch_io_failures *failed, 435 - unsigned flags) 411 + static noinline int bch2_read_retry_nodecode(struct btree_trans *trans, 412 + struct bch_read_bio *rbio, 413 + struct bvec_iter bvec_iter, 414 + struct bch_io_failures *failed, 415 + unsigned flags) 436 416 { 437 - struct btree_trans *trans = bch2_trans_get(c); 438 - struct btree_iter iter; 439 - struct bkey_buf sk; 440 - struct bkey_s_c k; 441 - int ret; 442 - 443 - flags &= ~BCH_READ_LAST_FRAGMENT; 444 - flags |= BCH_READ_MUST_CLONE; 445 - 446 - bch2_bkey_buf_init(&sk); 447 - 448 - bch2_trans_iter_init(trans, &iter, rbio->data_btree, 449 - rbio->read_pos, BTREE_ITER_slots); 417 + struct data_update *u = container_of(rbio, struct data_update, rbio); 450 418 retry: 451 419 bch2_trans_begin(trans); 452 - rbio->bio.bi_status = 0; 453 420 454 - ret = lockrestart_do(trans, bkey_err(k = bch2_btree_iter_peek_slot(&iter))); 421 + struct btree_iter iter; 422 + struct bkey_s_c k; 423 + int ret = lockrestart_do(trans, 424 + bkey_err(k = bch2_bkey_get_iter(trans, &iter, 425 + u->btree_id, bkey_start_pos(&u->k.k->k), 426 + 0))); 455 427 if (ret) 456 428 goto err; 457 429 458 - bch2_bkey_buf_reassemble(&sk, c, k); 459 - k = bkey_i_to_s_c(sk.k); 460 - 461 - if (!bch2_bkey_matches_ptr(c, k, 462 - rbio->pick.ptr, 463 - rbio->data_pos.offset - 464 - rbio->pick.crc.offset)) { 430 + if (!bkey_and_val_eq(k, bkey_i_to_s_c(u->k.k))) { 465 431 /* extent we wanted to read no longer exists: */ 466 - rbio->hole = true; 467 - goto out; 432 + rbio->ret = -BCH_ERR_data_read_key_overwritten; 433 + goto err; 468 434 } 469 435 470 436 ret = __bch2_read_extent(trans, rbio, bvec_iter, 471 - rbio->read_pos, 472 - rbio->data_btree, 473 - k, 0, failed, flags); 474 - if (ret == READ_RETRY) 475 - goto retry; 476 - if (ret) 477 - goto err; 478 - out: 479 - bch2_rbio_done(rbio); 480 - bch2_trans_iter_exit(trans, &iter); 481 - bch2_trans_put(trans); 482 - bch2_bkey_buf_exit(&sk, c); 483 - return; 437 + bkey_start_pos(&u->k.k->k), 438 + u->btree_id, 439 + bkey_i_to_s_c(u->k.k), 440 + 0, failed, flags, -1); 484 441 err: 485 - rbio->bio.bi_status = BLK_STS_IOERR; 486 - goto out; 442 + bch2_trans_iter_exit(trans, &iter); 443 + 444 + if (bch2_err_matches(ret, BCH_ERR_data_read_retry)) 445 + goto retry; 446 + 447 + if (ret) { 448 + rbio->bio.bi_status = BLK_STS_IOERR; 449 + rbio->ret = ret; 450 + } 451 + 452 + BUG_ON(atomic_read(&rbio->bio.__bi_remaining) != 1); 453 + return ret; 487 454 } 488 455 489 456 static void bch2_rbio_retry(struct work_struct *work) ··· 486 477 .inum = rbio->read_pos.inode, 487 478 }; 488 479 struct bch_io_failures failed = { .nr = 0 }; 480 + struct btree_trans *trans = bch2_trans_get(c); 489 481 490 - trace_and_count(c, read_retry, &rbio->bio); 482 + trace_io_read_retry(&rbio->bio); 483 + this_cpu_add(c->counters[BCH_COUNTER_io_read_retry], 484 + bvec_iter_sectors(rbio->bvec_iter)); 491 485 492 - if (rbio->retry == READ_RETRY_AVOID) 493 - bch2_mark_io_failure(&failed, &rbio->pick); 486 + if (bch2_err_matches(rbio->ret, BCH_ERR_data_read_retry_avoid)) 487 + bch2_mark_io_failure(&failed, &rbio->pick, 488 + rbio->ret == -BCH_ERR_data_read_retry_csum_err); 494 489 495 - rbio->bio.bi_status = 0; 490 + if (!rbio->split) { 491 + rbio->bio.bi_status = 0; 492 + rbio->ret = 0; 493 + } 494 + 495 + unsigned subvol = rbio->subvol; 496 + struct bpos read_pos = rbio->read_pos; 496 497 497 498 rbio = bch2_rbio_free(rbio); 498 499 499 - flags |= BCH_READ_IN_RETRY; 500 - flags &= ~BCH_READ_MAY_PROMOTE; 500 + flags |= BCH_READ_in_retry; 501 + flags &= ~BCH_READ_may_promote; 502 + flags &= ~BCH_READ_last_fragment; 503 + flags |= BCH_READ_must_clone; 501 504 502 - if (flags & BCH_READ_NODECODE) { 503 - bch2_read_retry_nodecode(c, rbio, iter, &failed, flags); 505 + int ret = rbio->data_update 506 + ? bch2_read_retry_nodecode(trans, rbio, iter, &failed, flags) 507 + : __bch2_read(trans, rbio, iter, inum, &failed, flags); 508 + 509 + if (ret) { 510 + rbio->ret = ret; 511 + rbio->bio.bi_status = BLK_STS_IOERR; 504 512 } else { 505 - flags &= ~BCH_READ_LAST_FRAGMENT; 506 - flags |= BCH_READ_MUST_CLONE; 513 + struct printbuf buf = PRINTBUF; 507 514 508 - __bch2_read(c, rbio, iter, inum, &failed, flags); 515 + lockrestart_do(trans, 516 + bch2_inum_offset_err_msg_trans(trans, &buf, 517 + (subvol_inum) { subvol, read_pos.inode }, 518 + read_pos.offset << 9)); 519 + if (rbio->data_update) 520 + prt_str(&buf, "(internal move) "); 521 + prt_str(&buf, "successful retry"); 522 + 523 + bch_err_ratelimited(c, "%s", buf.buf); 524 + printbuf_exit(&buf); 509 525 } 526 + 527 + bch2_rbio_done(rbio); 528 + bch2_trans_put(trans); 510 529 } 511 530 512 - static void bch2_rbio_error(struct bch_read_bio *rbio, int retry, 513 - blk_status_t error) 531 + static void bch2_rbio_error(struct bch_read_bio *rbio, 532 + int ret, blk_status_t blk_error) 514 533 { 515 - rbio->retry = retry; 534 + BUG_ON(ret >= 0); 516 535 517 - if (rbio->flags & BCH_READ_IN_RETRY) 536 + rbio->ret = ret; 537 + rbio->bio.bi_status = blk_error; 538 + 539 + bch2_rbio_parent(rbio)->saw_error = true; 540 + 541 + if (rbio->flags & BCH_READ_in_retry) 518 542 return; 519 543 520 - if (retry == READ_ERR) { 521 - rbio = bch2_rbio_free(rbio); 522 - 523 - rbio->bio.bi_status = error; 524 - bch2_rbio_done(rbio); 525 - } else { 544 + if (bch2_err_matches(ret, BCH_ERR_data_read_retry)) { 526 545 bch2_rbio_punt(rbio, bch2_rbio_retry, 527 546 RBIO_CONTEXT_UNBOUND, system_unbound_wq); 547 + } else { 548 + rbio = bch2_rbio_free(rbio); 549 + 550 + rbio->ret = ret; 551 + rbio->bio.bi_status = blk_error; 552 + 553 + bch2_rbio_done(rbio); 528 554 } 529 555 } 530 556 ··· 575 531 bch2_read_err_msg(c, &buf, rbio, rbio->read_pos); 576 532 prt_printf(&buf, "data read error: %s", bch2_blk_status_to_str(bio->bi_status)); 577 533 578 - if (ca) { 579 - bch2_io_error(ca, BCH_MEMBER_ERROR_read); 534 + if (ca) 580 535 bch_err_ratelimited(ca, "%s", buf.buf); 581 - } else { 536 + else 582 537 bch_err_ratelimited(c, "%s", buf.buf); 583 - } 584 538 585 539 printbuf_exit(&buf); 586 - bch2_rbio_error(rbio, READ_RETRY_AVOID, bio->bi_status); 540 + bch2_rbio_error(rbio, -BCH_ERR_data_read_retry_io_err, bio->bi_status); 587 541 } 588 542 589 543 static int __bch2_rbio_narrow_crcs(struct btree_trans *trans, ··· 663 621 bch2_csum_err_msg(&buf, crc.csum_type, rbio->pick.crc.csum, csum); 664 622 665 623 struct bch_dev *ca = rbio->have_ioref ? bch2_dev_have_ref(c, rbio->pick.ptr.dev) : NULL; 666 - if (ca) { 667 - bch2_io_error(ca, BCH_MEMBER_ERROR_checksum); 624 + if (ca) 668 625 bch_err_ratelimited(ca, "%s", buf.buf); 669 - } else { 626 + else 670 627 bch_err_ratelimited(c, "%s", buf.buf); 671 - } 672 628 673 - bch2_rbio_error(rbio, READ_RETRY_AVOID, BLK_STS_IOERR); 629 + bch2_rbio_error(rbio, -BCH_ERR_data_read_retry_csum_err, BLK_STS_IOERR); 674 630 printbuf_exit(&buf); 675 631 } 676 632 ··· 688 648 else 689 649 bch_err_ratelimited(c, "%s", buf.buf); 690 650 691 - bch2_rbio_error(rbio, READ_ERR, BLK_STS_IOERR); 651 + bch2_rbio_error(rbio, -BCH_ERR_data_read_decompress_err, BLK_STS_IOERR); 692 652 printbuf_exit(&buf); 693 653 } 694 654 ··· 708 668 else 709 669 bch_err_ratelimited(c, "%s", buf.buf); 710 670 711 - bch2_rbio_error(rbio, READ_ERR, BLK_STS_IOERR); 671 + bch2_rbio_error(rbio, -BCH_ERR_data_read_decrypt_err, BLK_STS_IOERR); 712 672 printbuf_exit(&buf); 713 673 } 714 674 ··· 718 678 struct bch_read_bio *rbio = 719 679 container_of(work, struct bch_read_bio, work); 720 680 struct bch_fs *c = rbio->c; 721 - struct bio *src = &rbio->bio; 722 - struct bio *dst = &bch2_rbio_parent(rbio)->bio; 723 - struct bvec_iter dst_iter = rbio->bvec_iter; 681 + struct bch_dev *ca = rbio->have_ioref ? bch2_dev_have_ref(c, rbio->pick.ptr.dev) : NULL; 682 + struct bch_read_bio *parent = bch2_rbio_parent(rbio); 683 + struct bio *src = &rbio->bio; 684 + struct bio *dst = &parent->bio; 685 + struct bvec_iter dst_iter = rbio->bvec_iter; 724 686 struct bch_extent_crc_unpacked crc = rbio->pick.crc; 725 687 struct nonce nonce = extent_nonce(rbio->version, crc); 726 688 unsigned nofs_flags; ··· 740 698 src->bi_iter = rbio->bvec_iter; 741 699 } 742 700 701 + bch2_maybe_corrupt_bio(src, bch2_read_corrupt_ratio); 702 + 743 703 csum = bch2_checksum_bio(c, crc.csum_type, nonce, src); 744 - if (bch2_crc_cmp(csum, rbio->pick.crc.csum) && !c->opts.no_data_io) 704 + bool csum_good = !bch2_crc_cmp(csum, rbio->pick.crc.csum) || c->opts.no_data_io; 705 + 706 + /* 707 + * Checksum error: if the bio wasn't bounced, we may have been 708 + * reading into buffers owned by userspace (that userspace can 709 + * scribble over) - retry the read, bouncing it this time: 710 + */ 711 + if (!csum_good && !rbio->bounce && (rbio->flags & BCH_READ_user_mapped)) { 712 + rbio->flags |= BCH_READ_must_bounce; 713 + bch2_rbio_error(rbio, -BCH_ERR_data_read_retry_csum_err_maybe_userspace, 714 + BLK_STS_IOERR); 715 + goto out; 716 + } 717 + 718 + bch2_account_io_completion(ca, BCH_MEMBER_ERROR_checksum, 0, csum_good); 719 + 720 + if (!csum_good) 745 721 goto csum_err; 746 722 747 723 /* ··· 772 712 if (unlikely(rbio->narrow_crcs)) 773 713 bch2_rbio_narrow_crcs(rbio); 774 714 775 - if (rbio->flags & BCH_READ_NODECODE) 776 - goto nodecode; 715 + if (likely(!parent->data_update)) { 716 + /* Adjust crc to point to subset of data we want: */ 717 + crc.offset += rbio->offset_into_extent; 718 + crc.live_size = bvec_iter_sectors(rbio->bvec_iter); 777 719 778 - /* Adjust crc to point to subset of data we want: */ 779 - crc.offset += rbio->offset_into_extent; 780 - crc.live_size = bvec_iter_sectors(rbio->bvec_iter); 720 + if (crc_is_compressed(crc)) { 721 + ret = bch2_encrypt_bio(c, crc.csum_type, nonce, src); 722 + if (ret) 723 + goto decrypt_err; 781 724 782 - if (crc_is_compressed(crc)) { 783 - ret = bch2_encrypt_bio(c, crc.csum_type, nonce, src); 784 - if (ret) 785 - goto decrypt_err; 725 + if (bch2_bio_uncompress(c, src, dst, dst_iter, crc) && 726 + !c->opts.no_data_io) 727 + goto decompression_err; 728 + } else { 729 + /* don't need to decrypt the entire bio: */ 730 + nonce = nonce_add(nonce, crc.offset << 9); 731 + bio_advance(src, crc.offset << 9); 786 732 787 - if (bch2_bio_uncompress(c, src, dst, dst_iter, crc) && 788 - !c->opts.no_data_io) 789 - goto decompression_err; 733 + BUG_ON(src->bi_iter.bi_size < dst_iter.bi_size); 734 + src->bi_iter.bi_size = dst_iter.bi_size; 735 + 736 + ret = bch2_encrypt_bio(c, crc.csum_type, nonce, src); 737 + if (ret) 738 + goto decrypt_err; 739 + 740 + if (rbio->bounce) { 741 + struct bvec_iter src_iter = src->bi_iter; 742 + 743 + bio_copy_data_iter(dst, &dst_iter, src, &src_iter); 744 + } 745 + } 790 746 } else { 791 - /* don't need to decrypt the entire bio: */ 792 - nonce = nonce_add(nonce, crc.offset << 9); 793 - bio_advance(src, crc.offset << 9); 794 - 795 - BUG_ON(src->bi_iter.bi_size < dst_iter.bi_size); 796 - src->bi_iter.bi_size = dst_iter.bi_size; 797 - 798 - ret = bch2_encrypt_bio(c, crc.csum_type, nonce, src); 799 - if (ret) 800 - goto decrypt_err; 747 + if (rbio->split) 748 + rbio->parent->pick = rbio->pick; 801 749 802 750 if (rbio->bounce) { 803 751 struct bvec_iter src_iter = src->bi_iter; ··· 822 754 ret = bch2_encrypt_bio(c, crc.csum_type, nonce, src); 823 755 if (ret) 824 756 goto decrypt_err; 825 - 826 - promote_start(rbio->promote, rbio); 827 - rbio->promote = NULL; 828 757 } 829 - nodecode: 830 - if (likely(!(rbio->flags & BCH_READ_IN_RETRY))) { 758 + 759 + if (likely(!(rbio->flags & BCH_READ_in_retry))) { 831 760 rbio = bch2_rbio_free(rbio); 832 761 bch2_rbio_done(rbio); 833 762 } ··· 832 767 memalloc_nofs_restore(nofs_flags); 833 768 return; 834 769 csum_err: 835 - /* 836 - * Checksum error: if the bio wasn't bounced, we may have been 837 - * reading into buffers owned by userspace (that userspace can 838 - * scribble over) - retry the read, bouncing it this time: 839 - */ 840 - if (!rbio->bounce && (rbio->flags & BCH_READ_USER_MAPPED)) { 841 - rbio->flags |= BCH_READ_MUST_BOUNCE; 842 - bch2_rbio_error(rbio, READ_RETRY, BLK_STS_IOERR); 843 - goto out; 844 - } 845 - 846 770 bch2_rbio_punt(rbio, bch2_read_csum_err, RBIO_CONTEXT_UNBOUND, system_unbound_wq); 847 771 goto out; 848 772 decompression_err: ··· 851 797 struct workqueue_struct *wq = NULL; 852 798 enum rbio_context context = RBIO_CONTEXT_NULL; 853 799 854 - if (rbio->have_ioref) { 855 - bch2_latency_acct(ca, rbio->submit_time, READ); 856 - percpu_ref_put(&ca->io_ref); 857 - } 800 + bch2_account_io_completion(ca, BCH_MEMBER_ERROR_read, 801 + rbio->submit_time, !bio->bi_status); 858 802 859 803 if (!rbio->split) 860 804 rbio->bio.bi_end_io = rbio->end_io; ··· 862 810 return; 863 811 } 864 812 865 - if (((rbio->flags & BCH_READ_RETRY_IF_STALE) && race_fault()) || 813 + if (((rbio->flags & BCH_READ_retry_if_stale) && race_fault()) || 866 814 (ca && dev_ptr_stale(ca, &rbio->pick.ptr))) { 867 - trace_and_count(c, read_reuse_race, &rbio->bio); 815 + trace_and_count(c, io_read_reuse_race, &rbio->bio); 868 816 869 - if (rbio->flags & BCH_READ_RETRY_IF_STALE) 870 - bch2_rbio_error(rbio, READ_RETRY, BLK_STS_AGAIN); 817 + if (rbio->flags & BCH_READ_retry_if_stale) 818 + bch2_rbio_error(rbio, -BCH_ERR_data_read_ptr_stale_retry, BLK_STS_AGAIN); 871 819 else 872 - bch2_rbio_error(rbio, READ_ERR, BLK_STS_AGAIN); 820 + bch2_rbio_error(rbio, -BCH_ERR_data_read_ptr_stale_race, BLK_STS_AGAIN); 873 821 return; 874 822 } 875 823 ··· 935 883 struct bvec_iter iter, struct bpos read_pos, 936 884 enum btree_id data_btree, struct bkey_s_c k, 937 885 unsigned offset_into_extent, 938 - struct bch_io_failures *failed, unsigned flags) 886 + struct bch_io_failures *failed, unsigned flags, int dev) 939 887 { 940 888 struct bch_fs *c = trans->c; 941 889 struct extent_ptr_decoded pick; 942 890 struct bch_read_bio *rbio = NULL; 943 - struct promote_op *promote = NULL; 944 891 bool bounce = false, read_full = false, narrow_crcs = false; 945 892 struct bpos data_pos = bkey_start_pos(k.k); 946 - int pick_ret; 893 + struct data_update *u = rbio_data_update(orig); 894 + int ret = 0; 947 895 948 896 if (bkey_extent_is_inline_data(k.k)) { 949 897 unsigned bytes = min_t(unsigned, iter.bi_size, ··· 954 902 swap(iter.bi_size, bytes); 955 903 bio_advance_iter(&orig->bio, &iter, bytes); 956 904 zero_fill_bio_iter(&orig->bio, iter); 905 + this_cpu_add(c->counters[BCH_COUNTER_io_read_inline], 906 + bvec_iter_sectors(iter)); 957 907 goto out_read_done; 958 908 } 959 909 retry_pick: 960 - pick_ret = bch2_bkey_pick_read_device(c, k, failed, &pick); 910 + ret = bch2_bkey_pick_read_device(c, k, failed, &pick, dev); 961 911 962 912 /* hole or reservation - just zero fill: */ 963 - if (!pick_ret) 913 + if (!ret) 964 914 goto hole; 965 915 966 - if (unlikely(pick_ret < 0)) { 916 + if (unlikely(ret < 0)) { 967 917 struct printbuf buf = PRINTBUF; 968 918 bch2_read_err_msg_trans(trans, &buf, orig, read_pos); 969 - prt_printf(&buf, "no device to read from: %s\n ", bch2_err_str(pick_ret)); 919 + prt_printf(&buf, "%s\n ", bch2_err_str(ret)); 970 920 bch2_bkey_val_to_text(&buf, c, k); 971 921 972 922 bch_err_ratelimited(c, "%s", buf.buf); ··· 984 930 985 931 bch_err_ratelimited(c, "%s", buf.buf); 986 932 printbuf_exit(&buf); 933 + ret = -BCH_ERR_data_read_no_encryption_key; 987 934 goto err; 988 935 } 989 936 ··· 996 941 * retry path, don't check here, it'll be caught in bch2_read_endio() 997 942 * and we'll end up in the retry path: 998 943 */ 999 - if ((flags & BCH_READ_IN_RETRY) && 944 + if ((flags & BCH_READ_in_retry) && 1000 945 !pick.ptr.cached && 1001 946 ca && 1002 947 unlikely(dev_ptr_stale(ca, &pick.ptr))) { 1003 948 read_from_stale_dirty_pointer(trans, ca, k, pick.ptr); 1004 - bch2_mark_io_failure(failed, &pick); 949 + bch2_mark_io_failure(failed, &pick, false); 1005 950 percpu_ref_put(&ca->io_ref); 1006 951 goto retry_pick; 1007 952 } 1008 953 1009 - if (flags & BCH_READ_NODECODE) { 954 + if (likely(!u)) { 955 + if (!(flags & BCH_READ_last_fragment) || 956 + bio_flagged(&orig->bio, BIO_CHAIN)) 957 + flags |= BCH_READ_must_clone; 958 + 959 + narrow_crcs = !(flags & BCH_READ_in_retry) && 960 + bch2_can_narrow_extent_crcs(k, pick.crc); 961 + 962 + if (narrow_crcs && (flags & BCH_READ_user_mapped)) 963 + flags |= BCH_READ_must_bounce; 964 + 965 + EBUG_ON(offset_into_extent + bvec_iter_sectors(iter) > k.k->size); 966 + 967 + if (crc_is_compressed(pick.crc) || 968 + (pick.crc.csum_type != BCH_CSUM_none && 969 + (bvec_iter_sectors(iter) != pick.crc.uncompressed_size || 970 + (bch2_csum_type_is_encryption(pick.crc.csum_type) && 971 + (flags & BCH_READ_user_mapped)) || 972 + (flags & BCH_READ_must_bounce)))) { 973 + read_full = true; 974 + bounce = true; 975 + } 976 + } else { 1010 977 /* 1011 978 * can happen if we retry, and the extent we were going to read 1012 979 * has been merged in the meantime: 1013 980 */ 1014 - if (pick.crc.compressed_size > orig->bio.bi_vcnt * PAGE_SECTORS) { 981 + if (pick.crc.compressed_size > u->op.wbio.bio.bi_iter.bi_size) { 1015 982 if (ca) 1016 983 percpu_ref_put(&ca->io_ref); 1017 - goto hole; 984 + rbio->ret = -BCH_ERR_data_read_buffer_too_small; 985 + goto out_read_done; 1018 986 } 1019 987 1020 988 iter.bi_size = pick.crc.compressed_size << 9; 1021 - goto get_bio; 1022 - } 1023 - 1024 - if (!(flags & BCH_READ_LAST_FRAGMENT) || 1025 - bio_flagged(&orig->bio, BIO_CHAIN)) 1026 - flags |= BCH_READ_MUST_CLONE; 1027 - 1028 - narrow_crcs = !(flags & BCH_READ_IN_RETRY) && 1029 - bch2_can_narrow_extent_crcs(k, pick.crc); 1030 - 1031 - if (narrow_crcs && (flags & BCH_READ_USER_MAPPED)) 1032 - flags |= BCH_READ_MUST_BOUNCE; 1033 - 1034 - EBUG_ON(offset_into_extent + bvec_iter_sectors(iter) > k.k->size); 1035 - 1036 - if (crc_is_compressed(pick.crc) || 1037 - (pick.crc.csum_type != BCH_CSUM_none && 1038 - (bvec_iter_sectors(iter) != pick.crc.uncompressed_size || 1039 - (bch2_csum_type_is_encryption(pick.crc.csum_type) && 1040 - (flags & BCH_READ_USER_MAPPED)) || 1041 - (flags & BCH_READ_MUST_BOUNCE)))) { 1042 989 read_full = true; 1043 - bounce = true; 1044 990 } 1045 991 1046 992 if (orig->opts.promote_target || have_io_error(failed)) 1047 - promote = promote_alloc(trans, iter, k, &pick, orig->opts, flags, 1048 - &rbio, &bounce, &read_full, failed); 993 + rbio = promote_alloc(trans, iter, k, &pick, flags, orig, 994 + &bounce, &read_full, failed); 1049 995 1050 996 if (!read_full) { 1051 997 EBUG_ON(crc_is_compressed(pick.crc)); ··· 1065 1009 pick.crc.offset = 0; 1066 1010 pick.crc.live_size = bvec_iter_sectors(iter); 1067 1011 } 1068 - get_bio: 1012 + 1069 1013 if (rbio) { 1070 1014 /* 1071 1015 * promote already allocated bounce rbio: ··· 1080 1024 } else if (bounce) { 1081 1025 unsigned sectors = pick.crc.compressed_size; 1082 1026 1083 - rbio = rbio_init(bio_alloc_bioset(NULL, 1027 + rbio = rbio_init_fragment(bio_alloc_bioset(NULL, 1084 1028 DIV_ROUND_UP(sectors, PAGE_SECTORS), 1085 1029 0, 1086 1030 GFP_NOFS, 1087 1031 &c->bio_read_split), 1088 - orig->opts); 1032 + orig); 1089 1033 1090 1034 bch2_bio_alloc_pages_pool(c, &rbio->bio, sectors << 9); 1091 1035 rbio->bounce = true; 1092 - rbio->split = true; 1093 - } else if (flags & BCH_READ_MUST_CLONE) { 1036 + } else if (flags & BCH_READ_must_clone) { 1094 1037 /* 1095 1038 * Have to clone if there were any splits, due to error 1096 1039 * reporting issues (if a split errored, and retrying didn't ··· 1098 1043 * from the whole bio, in which case we don't want to retry and 1099 1044 * lose the error) 1100 1045 */ 1101 - rbio = rbio_init(bio_alloc_clone(NULL, &orig->bio, GFP_NOFS, 1046 + rbio = rbio_init_fragment(bio_alloc_clone(NULL, &orig->bio, GFP_NOFS, 1102 1047 &c->bio_read_split), 1103 - orig->opts); 1048 + orig); 1104 1049 rbio->bio.bi_iter = iter; 1105 - rbio->split = true; 1106 1050 } else { 1107 1051 rbio = orig; 1108 1052 rbio->bio.bi_iter = iter; ··· 1110 1056 1111 1057 EBUG_ON(bio_sectors(&rbio->bio) != pick.crc.compressed_size); 1112 1058 1113 - rbio->c = c; 1114 1059 rbio->submit_time = local_clock(); 1115 - if (rbio->split) 1116 - rbio->parent = orig; 1117 - else 1060 + if (!rbio->split) 1118 1061 rbio->end_io = orig->bio.bi_end_io; 1119 1062 rbio->bvec_iter = iter; 1120 1063 rbio->offset_into_extent= offset_into_extent; 1121 1064 rbio->flags = flags; 1122 1065 rbio->have_ioref = ca != NULL; 1123 1066 rbio->narrow_crcs = narrow_crcs; 1124 - rbio->hole = 0; 1125 - rbio->retry = 0; 1067 + rbio->ret = 0; 1126 1068 rbio->context = 0; 1127 - /* XXX: only initialize this if needed */ 1128 - rbio->devs_have = bch2_bkey_devs(k); 1129 1069 rbio->pick = pick; 1130 1070 rbio->subvol = orig->subvol; 1131 1071 rbio->read_pos = read_pos; 1132 1072 rbio->data_btree = data_btree; 1133 1073 rbio->data_pos = data_pos; 1134 1074 rbio->version = k.k->bversion; 1135 - rbio->promote = promote; 1136 1075 INIT_WORK(&rbio->work, NULL); 1137 - 1138 - if (flags & BCH_READ_NODECODE) 1139 - orig->pick = pick; 1140 1076 1141 1077 rbio->bio.bi_opf = orig->bio.bi_opf; 1142 1078 rbio->bio.bi_iter.bi_sector = pick.ptr.offset; 1143 1079 rbio->bio.bi_end_io = bch2_read_endio; 1144 1080 1145 1081 if (rbio->bounce) 1146 - trace_and_count(c, read_bounce, &rbio->bio); 1082 + trace_and_count(c, io_read_bounce, &rbio->bio); 1147 1083 1148 - this_cpu_add(c->counters[BCH_COUNTER_io_read], bio_sectors(&rbio->bio)); 1084 + if (!u) 1085 + this_cpu_add(c->counters[BCH_COUNTER_io_read], bio_sectors(&rbio->bio)); 1086 + else 1087 + this_cpu_add(c->counters[BCH_COUNTER_io_move_read], bio_sectors(&rbio->bio)); 1149 1088 bch2_increment_clock(c, bio_sectors(&rbio->bio), READ); 1150 1089 1151 1090 /* 1152 1091 * If it's being moved internally, we don't want to flag it as a cache 1153 1092 * hit: 1154 1093 */ 1155 - if (ca && pick.ptr.cached && !(flags & BCH_READ_NODECODE)) 1094 + if (ca && pick.ptr.cached && !u) 1156 1095 bch2_bucket_io_time_reset(trans, pick.ptr.dev, 1157 1096 PTR_BUCKET_NR(ca, &pick.ptr), READ); 1158 1097 1159 - if (!(flags & (BCH_READ_IN_RETRY|BCH_READ_LAST_FRAGMENT))) { 1098 + if (!(flags & (BCH_READ_in_retry|BCH_READ_last_fragment))) { 1160 1099 bio_inc_remaining(&orig->bio); 1161 - trace_and_count(c, read_split, &orig->bio); 1100 + trace_and_count(c, io_read_split, &orig->bio); 1162 1101 } 1163 1102 1164 1103 /* 1165 1104 * Unlock the iterator while the btree node's lock is still in 1166 1105 * cache, before doing the IO: 1167 1106 */ 1168 - if (!(flags & BCH_READ_IN_RETRY)) 1107 + if (!(flags & BCH_READ_in_retry)) 1169 1108 bch2_trans_unlock(trans); 1170 1109 else 1171 1110 bch2_trans_unlock_long(trans); 1172 1111 1173 - if (!rbio->pick.idx) { 1112 + if (likely(!rbio->pick.do_ec_reconstruct)) { 1174 1113 if (unlikely(!rbio->have_ioref)) { 1175 1114 struct printbuf buf = PRINTBUF; 1176 1115 bch2_read_err_msg_trans(trans, &buf, rbio, read_pos); ··· 1173 1126 bch_err_ratelimited(c, "%s", buf.buf); 1174 1127 printbuf_exit(&buf); 1175 1128 1176 - bch2_rbio_error(rbio, READ_RETRY_AVOID, BLK_STS_IOERR); 1129 + bch2_rbio_error(rbio, 1130 + -BCH_ERR_data_read_retry_device_offline, 1131 + BLK_STS_IOERR); 1177 1132 goto out; 1178 1133 } 1179 1134 ··· 1184 1135 bio_set_dev(&rbio->bio, ca->disk_sb.bdev); 1185 1136 1186 1137 if (unlikely(c->opts.no_data_io)) { 1187 - if (likely(!(flags & BCH_READ_IN_RETRY))) 1138 + if (likely(!(flags & BCH_READ_in_retry))) 1188 1139 bio_endio(&rbio->bio); 1189 1140 } else { 1190 - if (likely(!(flags & BCH_READ_IN_RETRY))) 1141 + if (likely(!(flags & BCH_READ_in_retry))) 1191 1142 submit_bio(&rbio->bio); 1192 1143 else 1193 1144 submit_bio_wait(&rbio->bio); ··· 1201 1152 } else { 1202 1153 /* Attempting reconstruct read: */ 1203 1154 if (bch2_ec_read_extent(trans, rbio, k)) { 1204 - bch2_rbio_error(rbio, READ_RETRY_AVOID, BLK_STS_IOERR); 1155 + bch2_rbio_error(rbio, -BCH_ERR_data_read_retry_ec_reconstruct_err, 1156 + BLK_STS_IOERR); 1205 1157 goto out; 1206 1158 } 1207 1159 1208 - if (likely(!(flags & BCH_READ_IN_RETRY))) 1160 + if (likely(!(flags & BCH_READ_in_retry))) 1209 1161 bio_endio(&rbio->bio); 1210 1162 } 1211 1163 out: 1212 - if (likely(!(flags & BCH_READ_IN_RETRY))) { 1164 + if (likely(!(flags & BCH_READ_in_retry))) { 1213 1165 return 0; 1214 1166 } else { 1215 1167 bch2_trans_unlock(trans); ··· 1220 1170 rbio->context = RBIO_CONTEXT_UNBOUND; 1221 1171 bch2_read_endio(&rbio->bio); 1222 1172 1223 - ret = rbio->retry; 1173 + ret = rbio->ret; 1224 1174 rbio = bch2_rbio_free(rbio); 1225 1175 1226 - if (ret == READ_RETRY_AVOID) { 1227 - bch2_mark_io_failure(failed, &pick); 1228 - ret = READ_RETRY; 1229 - } 1230 - 1231 - if (!ret) 1232 - goto out_read_done; 1176 + if (bch2_err_matches(ret, BCH_ERR_data_read_retry_avoid)) 1177 + bch2_mark_io_failure(failed, &pick, 1178 + ret == -BCH_ERR_data_read_retry_csum_err); 1233 1179 1234 1180 return ret; 1235 1181 } 1236 1182 1237 1183 err: 1238 - if (flags & BCH_READ_IN_RETRY) 1239 - return READ_ERR; 1184 + if (flags & BCH_READ_in_retry) 1185 + return ret; 1240 1186 1241 - orig->bio.bi_status = BLK_STS_IOERR; 1187 + orig->bio.bi_status = BLK_STS_IOERR; 1188 + orig->ret = ret; 1242 1189 goto out_read_done; 1243 1190 1244 1191 hole: 1192 + this_cpu_add(c->counters[BCH_COUNTER_io_read_hole], 1193 + bvec_iter_sectors(iter)); 1245 1194 /* 1246 - * won't normally happen in the BCH_READ_NODECODE 1247 - * (bch2_move_extent()) path, but if we retry and the extent we wanted 1248 - * to read no longer exists we have to signal that: 1195 + * won't normally happen in the data update (bch2_move_extent()) path, 1196 + * but if we retry and the extent we wanted to read no longer exists we 1197 + * have to signal that: 1249 1198 */ 1250 - if (flags & BCH_READ_NODECODE) 1251 - orig->hole = true; 1199 + if (u) 1200 + orig->ret = -BCH_ERR_data_read_key_overwritten; 1252 1201 1253 1202 zero_fill_bio_iter(&orig->bio, iter); 1254 1203 out_read_done: 1255 - if (flags & BCH_READ_LAST_FRAGMENT) 1204 + if ((flags & BCH_READ_last_fragment) && 1205 + !(flags & BCH_READ_in_retry)) 1256 1206 bch2_rbio_done(orig); 1257 1207 return 0; 1258 1208 } 1259 1209 1260 - void __bch2_read(struct bch_fs *c, struct bch_read_bio *rbio, 1261 - struct bvec_iter bvec_iter, subvol_inum inum, 1262 - struct bch_io_failures *failed, unsigned flags) 1210 + int __bch2_read(struct btree_trans *trans, struct bch_read_bio *rbio, 1211 + struct bvec_iter bvec_iter, subvol_inum inum, 1212 + struct bch_io_failures *failed, unsigned flags) 1263 1213 { 1264 - struct btree_trans *trans = bch2_trans_get(c); 1214 + struct bch_fs *c = trans->c; 1265 1215 struct btree_iter iter; 1266 1216 struct bkey_buf sk; 1267 1217 struct bkey_s_c k; 1268 1218 int ret; 1269 1219 1270 - BUG_ON(flags & BCH_READ_NODECODE); 1220 + EBUG_ON(rbio->data_update); 1271 1221 1272 1222 bch2_bkey_buf_init(&sk); 1273 1223 bch2_trans_iter_init(trans, &iter, BTREE_ID_extents, ··· 1317 1267 swap(bvec_iter.bi_size, bytes); 1318 1268 1319 1269 if (bvec_iter.bi_size == bytes) 1320 - flags |= BCH_READ_LAST_FRAGMENT; 1270 + flags |= BCH_READ_last_fragment; 1321 1271 1322 1272 ret = __bch2_read_extent(trans, rbio, bvec_iter, iter.pos, 1323 1273 data_btree, k, 1324 - offset_into_extent, failed, flags); 1274 + offset_into_extent, failed, flags, -1); 1325 1275 if (ret) 1326 1276 goto err; 1327 1277 1328 - if (flags & BCH_READ_LAST_FRAGMENT) 1278 + if (flags & BCH_READ_last_fragment) 1329 1279 break; 1330 1280 1331 1281 swap(bvec_iter.bi_size, bytes); 1332 1282 bio_advance_iter(&rbio->bio, &bvec_iter, bytes); 1333 1283 err: 1284 + if (ret == -BCH_ERR_data_read_retry_csum_err_maybe_userspace) 1285 + flags |= BCH_READ_must_bounce; 1286 + 1334 1287 if (ret && 1335 1288 !bch2_err_matches(ret, BCH_ERR_transaction_restart) && 1336 - ret != READ_RETRY && 1337 - ret != READ_RETRY_AVOID) 1289 + !bch2_err_matches(ret, BCH_ERR_data_read_retry)) 1338 1290 break; 1339 1291 } 1340 1292 ··· 1344 1292 1345 1293 if (ret) { 1346 1294 struct printbuf buf = PRINTBUF; 1347 - bch2_inum_offset_err_msg_trans(trans, &buf, inum, bvec_iter.bi_sector << 9); 1348 - prt_printf(&buf, "read error %i from btree lookup", ret); 1295 + lockrestart_do(trans, 1296 + bch2_inum_offset_err_msg_trans(trans, &buf, inum, 1297 + bvec_iter.bi_sector << 9)); 1298 + prt_printf(&buf, "read error: %s", bch2_err_str(ret)); 1349 1299 bch_err_ratelimited(c, "%s", buf.buf); 1350 1300 printbuf_exit(&buf); 1351 1301 1352 - rbio->bio.bi_status = BLK_STS_IOERR; 1353 - bch2_rbio_done(rbio); 1302 + rbio->bio.bi_status = BLK_STS_IOERR; 1303 + rbio->ret = ret; 1304 + 1305 + if (!(flags & BCH_READ_in_retry)) 1306 + bch2_rbio_done(rbio); 1354 1307 } 1355 1308 1356 - bch2_trans_put(trans); 1357 1309 bch2_bkey_buf_exit(&sk, c); 1310 + return ret; 1358 1311 } 1359 1312 1360 1313 void bch2_fs_io_read_exit(struct bch_fs *c)

+59 -35

fs/bcachefs/io_read.h

··· 3 3 #define _BCACHEFS_IO_READ_H 4 4 5 5 #include "bkey_buf.h" 6 + #include "btree_iter.h" 6 7 #include "reflink.h" 7 8 8 9 struct bch_read_bio { ··· 36 35 u16 flags; 37 36 union { 38 37 struct { 39 - u16 bounce:1, 38 + u16 data_update:1, 39 + promote:1, 40 + bounce:1, 40 41 split:1, 41 - kmalloc:1, 42 42 have_ioref:1, 43 43 narrow_crcs:1, 44 - hole:1, 45 - retry:2, 44 + saw_error:1, 46 45 context:2; 47 46 }; 48 47 u16 _state; 49 48 }; 50 - 51 - struct bch_devs_list devs_have; 49 + s16 ret; 52 50 53 51 struct extent_ptr_decoded pick; 54 52 ··· 64 64 enum btree_id data_btree; 65 65 struct bpos data_pos; 66 66 struct bversion version; 67 - 68 - struct promote_op *promote; 69 67 70 68 struct bch_io_opts opts; 71 69 ··· 106 108 return 0; 107 109 } 108 110 109 - enum bch_read_flags { 110 - BCH_READ_RETRY_IF_STALE = 1 << 0, 111 - BCH_READ_MAY_PROMOTE = 1 << 1, 112 - BCH_READ_USER_MAPPED = 1 << 2, 113 - BCH_READ_NODECODE = 1 << 3, 114 - BCH_READ_LAST_FRAGMENT = 1 << 4, 111 + #define BCH_READ_FLAGS() \ 112 + x(retry_if_stale) \ 113 + x(may_promote) \ 114 + x(user_mapped) \ 115 + x(last_fragment) \ 116 + x(must_bounce) \ 117 + x(must_clone) \ 118 + x(in_retry) 115 119 116 - /* internal: */ 117 - BCH_READ_MUST_BOUNCE = 1 << 5, 118 - BCH_READ_MUST_CLONE = 1 << 6, 119 - BCH_READ_IN_RETRY = 1 << 7, 120 + enum __bch_read_flags { 121 + #define x(n) __BCH_READ_##n, 122 + BCH_READ_FLAGS() 123 + #undef x 124 + }; 125 + 126 + enum bch_read_flags { 127 + #define x(n) BCH_READ_##n = BIT(__BCH_READ_##n), 128 + BCH_READ_FLAGS() 129 + #undef x 120 130 }; 121 131 122 132 int __bch2_read_extent(struct btree_trans *, struct bch_read_bio *, 123 133 struct bvec_iter, struct bpos, enum btree_id, 124 134 struct bkey_s_c, unsigned, 125 - struct bch_io_failures *, unsigned); 135 + struct bch_io_failures *, unsigned, int); 126 136 127 137 static inline void bch2_read_extent(struct btree_trans *trans, 128 138 struct bch_read_bio *rbio, struct bpos read_pos, ··· 138 132 unsigned offset_into_extent, unsigned flags) 139 133 { 140 134 __bch2_read_extent(trans, rbio, rbio->bio.bi_iter, read_pos, 141 - data_btree, k, offset_into_extent, NULL, flags); 135 + data_btree, k, offset_into_extent, NULL, flags, -1); 142 136 } 143 137 144 - void __bch2_read(struct bch_fs *, struct bch_read_bio *, struct bvec_iter, 145 - subvol_inum, struct bch_io_failures *, unsigned flags); 138 + int __bch2_read(struct btree_trans *, struct bch_read_bio *, struct bvec_iter, 139 + subvol_inum, struct bch_io_failures *, unsigned flags); 146 140 147 141 static inline void bch2_read(struct bch_fs *c, struct bch_read_bio *rbio, 148 142 subvol_inum inum) 149 143 { 150 - struct bch_io_failures failed = { .nr = 0 }; 151 - 152 144 BUG_ON(rbio->_state); 153 145 154 - rbio->c = c; 155 - rbio->start_time = local_clock(); 156 146 rbio->subvol = inum.subvol; 157 147 158 - __bch2_read(c, rbio, rbio->bio.bi_iter, inum, &failed, 159 - BCH_READ_RETRY_IF_STALE| 160 - BCH_READ_MAY_PROMOTE| 161 - BCH_READ_USER_MAPPED); 148 + bch2_trans_run(c, 149 + __bch2_read(trans, rbio, rbio->bio.bi_iter, inum, NULL, 150 + BCH_READ_retry_if_stale| 151 + BCH_READ_may_promote| 152 + BCH_READ_user_mapped)); 162 153 } 163 154 164 - static inline struct bch_read_bio *rbio_init(struct bio *bio, 165 - struct bch_io_opts opts) 155 + static inline struct bch_read_bio *rbio_init_fragment(struct bio *bio, 156 + struct bch_read_bio *orig) 166 157 { 167 158 struct bch_read_bio *rbio = to_rbio(bio); 168 159 169 - rbio->_state = 0; 170 - rbio->promote = NULL; 171 - rbio->opts = opts; 160 + rbio->c = orig->c; 161 + rbio->_state = 0; 162 + rbio->flags = 0; 163 + rbio->ret = 0; 164 + rbio->split = true; 165 + rbio->parent = orig; 166 + rbio->opts = orig->opts; 167 + return rbio; 168 + } 169 + 170 + static inline struct bch_read_bio *rbio_init(struct bio *bio, 171 + struct bch_fs *c, 172 + struct bch_io_opts opts, 173 + bio_end_io_t end_io) 174 + { 175 + struct bch_read_bio *rbio = to_rbio(bio); 176 + 177 + rbio->start_time = local_clock(); 178 + rbio->c = c; 179 + rbio->_state = 0; 180 + rbio->flags = 0; 181 + rbio->ret = 0; 182 + rbio->opts = opts; 183 + rbio->bio.bi_end_io = end_io; 172 184 return rbio; 173 185 } 174 186

+213 -203

fs/bcachefs/io_write.c

··· 34 34 #include <linux/random.h> 35 35 #include <linux/sched/mm.h> 36 36 37 + #ifdef CONFIG_BCACHEFS_DEBUG 38 + static unsigned bch2_write_corrupt_ratio; 39 + module_param_named(write_corrupt_ratio, bch2_write_corrupt_ratio, uint, 0644); 40 + MODULE_PARM_DESC(write_corrupt_ratio, ""); 41 + #endif 42 + 37 43 #ifndef CONFIG_BCACHEFS_NO_LATENCY_ACCT 38 44 39 45 static inline void bch2_congested_acct(struct bch_dev *ca, u64 io_latency, ··· 380 374 bch2_extent_update(trans, inum, &iter, sk.k, 381 375 &op->res, 382 376 op->new_i_size, &op->i_sectors_delta, 383 - op->flags & BCH_WRITE_CHECK_ENOSPC); 377 + op->flags & BCH_WRITE_check_enospc); 384 378 bch2_trans_iter_exit(trans, &iter); 385 379 386 380 if (bch2_err_matches(ret, BCH_ERR_transaction_restart)) ··· 402 396 403 397 /* Writes */ 404 398 405 - static void __bch2_write_op_error(struct printbuf *out, struct bch_write_op *op, 406 - u64 offset) 399 + void bch2_write_op_error(struct bch_write_op *op, u64 offset, const char *fmt, ...) 407 400 { 408 - bch2_inum_offset_err_msg(op->c, out, 409 - (subvol_inum) { op->subvol, op->pos.inode, }, 410 - offset << 9); 411 - prt_printf(out, "write error%s: ", 412 - op->flags & BCH_WRITE_MOVE ? "(internal move)" : ""); 401 + struct printbuf buf = PRINTBUF; 402 + 403 + if (op->subvol) { 404 + bch2_inum_offset_err_msg(op->c, &buf, 405 + (subvol_inum) { op->subvol, op->pos.inode, }, 406 + offset << 9); 407 + } else { 408 + struct bpos pos = op->pos; 409 + pos.offset = offset; 410 + bch2_inum_snap_offset_err_msg(op->c, &buf, pos); 411 + } 412 + 413 + prt_str(&buf, "write error: "); 414 + 415 + va_list args; 416 + va_start(args, fmt); 417 + prt_vprintf(&buf, fmt, args); 418 + va_end(args); 419 + 420 + if (op->flags & BCH_WRITE_move) { 421 + struct data_update *u = container_of(op, struct data_update, op); 422 + 423 + prt_printf(&buf, "\n from internal move "); 424 + bch2_bkey_val_to_text(&buf, op->c, bkey_i_to_s_c(u->k.k)); 425 + } 426 + 427 + bch_err_ratelimited(op->c, "%s", buf.buf); 428 + printbuf_exit(&buf); 413 429 } 414 430 415 - void bch2_write_op_error(struct printbuf *out, struct bch_write_op *op) 431 + static void bch2_write_csum_err_msg(struct bch_write_op *op) 416 432 { 417 - __bch2_write_op_error(out, op, op->pos.offset); 418 - } 419 - 420 - static void bch2_write_op_error_trans(struct btree_trans *trans, struct printbuf *out, 421 - struct bch_write_op *op, u64 offset) 422 - { 423 - bch2_inum_offset_err_msg_trans(trans, out, 424 - (subvol_inum) { op->subvol, op->pos.inode, }, 425 - offset << 9); 426 - prt_printf(out, "write error%s: ", 427 - op->flags & BCH_WRITE_MOVE ? "(internal move)" : ""); 433 + bch2_write_op_error(op, op->pos.offset, 434 + "error verifying existing checksum while rewriting existing data (memory corruption?)"); 428 435 } 429 436 430 437 void bch2_submit_wbio_replicas(struct bch_write_bio *wbio, struct bch_fs *c, ··· 512 493 bch2_time_stats_update(&c->times[BCH_TIME_data_write], op->start_time); 513 494 bch2_disk_reservation_put(c, &op->res); 514 495 515 - if (!(op->flags & BCH_WRITE_MOVE)) 496 + if (!(op->flags & BCH_WRITE_move)) 516 497 bch2_write_ref_put(c, BCH_WRITE_REF_write); 517 498 bch2_keylist_free(&op->insert_keys, op->inline_keys); 518 499 ··· 535 516 test_bit(ptr->dev, op->failed.d)); 536 517 537 518 if (!bch2_bkey_nr_ptrs(bkey_i_to_s_c(src))) 538 - return -EIO; 519 + return -BCH_ERR_data_write_io; 539 520 } 540 521 541 522 if (dst != src) ··· 558 539 unsigned dev; 559 540 int ret = 0; 560 541 561 - if (unlikely(op->flags & BCH_WRITE_IO_ERROR)) { 542 + if (unlikely(op->flags & BCH_WRITE_io_error)) { 562 543 ret = bch2_write_drop_io_error_ptrs(op); 563 544 if (ret) 564 545 goto err; ··· 567 548 if (!bch2_keylist_empty(keys)) { 568 549 u64 sectors_start = keylist_sectors(keys); 569 550 570 - ret = !(op->flags & BCH_WRITE_MOVE) 551 + ret = !(op->flags & BCH_WRITE_move) 571 552 ? bch2_write_index_default(op) 572 553 : bch2_data_update_index_update(op); 573 554 ··· 579 560 if (unlikely(ret && !bch2_err_matches(ret, EROFS))) { 580 561 struct bkey_i *insert = bch2_keylist_front(&op->insert_keys); 581 562 582 - struct printbuf buf = PRINTBUF; 583 - __bch2_write_op_error(&buf, op, bkey_start_offset(&insert->k)); 584 - prt_printf(&buf, "btree update error: %s", bch2_err_str(ret)); 585 - bch_err_ratelimited(c, "%s", buf.buf); 586 - printbuf_exit(&buf); 563 + bch2_write_op_error(op, bkey_start_offset(&insert->k), 564 + "btree update error: %s", bch2_err_str(ret)); 587 565 } 588 566 589 567 if (ret) ··· 589 573 out: 590 574 /* If some a bucket wasn't written, we can't erasure code it: */ 591 575 for_each_set_bit(dev, op->failed.d, BCH_SB_MEMBERS_MAX) 592 - bch2_open_bucket_write_error(c, &op->open_buckets, dev); 576 + bch2_open_bucket_write_error(c, &op->open_buckets, dev, -BCH_ERR_data_write_io); 593 577 594 578 bch2_open_buckets_put(c, &op->open_buckets); 595 579 return; 596 580 err: 597 581 keys->top = keys->keys; 598 582 op->error = ret; 599 - op->flags |= BCH_WRITE_SUBMITTED; 583 + op->flags |= BCH_WRITE_submitted; 600 584 goto out; 601 585 } 602 586 603 587 static inline void __wp_update_state(struct write_point *wp, enum write_point_state state) 604 588 { 605 589 if (state != wp->state) { 590 + struct task_struct *p = current; 606 591 u64 now = ktime_get_ns(); 592 + u64 runtime = p->se.sum_exec_runtime + 593 + (now - p->se.exec_start); 594 + 595 + if (state == WRITE_POINT_runnable) 596 + wp->last_runtime = runtime; 597 + else if (wp->state == WRITE_POINT_runnable) 598 + wp->time[WRITE_POINT_running] += runtime - wp->last_runtime; 607 599 608 600 if (wp->last_state_change && 609 601 time_after64(now, wp->last_state_change)) ··· 625 601 { 626 602 enum write_point_state state; 627 603 628 - state = running ? WRITE_POINT_running : 604 + state = running ? WRITE_POINT_runnable: 629 605 !list_empty(&wp->writes) ? WRITE_POINT_waiting_io 630 606 : WRITE_POINT_stopped; 631 607 ··· 639 615 struct workqueue_struct *wq = index_update_wq(op); 640 616 unsigned long flags; 641 617 642 - if ((op->flags & BCH_WRITE_SUBMITTED) && 643 - (op->flags & BCH_WRITE_MOVE)) 618 + if ((op->flags & BCH_WRITE_submitted) && 619 + (op->flags & BCH_WRITE_move)) 644 620 bch2_bio_free_pages_pool(op->c, &op->wbio.bio); 645 621 646 622 spin_lock_irqsave(&wp->writes_lock, flags); ··· 678 654 if (!op) 679 655 break; 680 656 681 - op->flags |= BCH_WRITE_IN_WORKER; 657 + op->flags |= BCH_WRITE_in_worker; 682 658 683 659 __bch2_write_index(op); 684 660 685 - if (!(op->flags & BCH_WRITE_SUBMITTED)) 661 + if (!(op->flags & BCH_WRITE_submitted)) 686 662 __bch2_write(op); 687 663 else 688 664 bch2_write_done(&op->cl); ··· 700 676 ? bch2_dev_have_ref(c, wbio->dev) 701 677 : NULL; 702 678 703 - if (bch2_dev_inum_io_err_on(bio->bi_status, ca, BCH_MEMBER_ERROR_write, 679 + bch2_account_io_completion(ca, BCH_MEMBER_ERROR_write, 680 + wbio->submit_time, !bio->bi_status); 681 + 682 + if (bio->bi_status) { 683 + bch_err_inum_offset_ratelimited(ca, 704 684 op->pos.inode, 705 685 wbio->inode_offset << 9, 706 686 "data write error: %s", 707 - bch2_blk_status_to_str(bio->bi_status))) { 687 + bch2_blk_status_to_str(bio->bi_status)); 708 688 set_bit(wbio->dev, op->failed.d); 709 - op->flags |= BCH_WRITE_IO_ERROR; 689 + op->flags |= BCH_WRITE_io_error; 710 690 } 711 691 712 692 if (wbio->nocow) { ··· 720 692 set_bit(wbio->dev, op->devs_need_flush->d); 721 693 } 722 694 723 - if (wbio->have_ioref) { 724 - bch2_latency_acct(ca, wbio->submit_time, WRITE); 695 + if (wbio->have_ioref) 725 696 percpu_ref_put(&ca->io_ref); 726 - } 727 697 728 698 if (wbio->bounce) 729 699 bch2_bio_free_pages_pool(c, bio); ··· 755 729 bch2_extent_crc_append(&e->k_i, crc); 756 730 757 731 bch2_alloc_sectors_append_ptrs_inlined(op->c, wp, &e->k_i, crc.compressed_size, 758 - op->flags & BCH_WRITE_CACHED); 732 + op->flags & BCH_WRITE_cached); 759 733 760 734 bch2_keylist_push(&op->insert_keys); 761 735 } ··· 815 789 { 816 790 struct bio *bio = &op->wbio.bio; 817 791 struct bch_extent_crc_unpacked new_crc; 818 - int ret; 819 792 820 793 /* bch2_rechecksum_bio() can't encrypt or decrypt data: */ 821 794 ··· 822 797 bch2_csum_type_is_encryption(new_csum_type)) 823 798 new_csum_type = op->crc.csum_type; 824 799 825 - ret = bch2_rechecksum_bio(c, bio, op->version, op->crc, 826 - NULL, &new_crc, 827 - op->crc.offset, op->crc.live_size, 828 - new_csum_type); 800 + int ret = bch2_rechecksum_bio(c, bio, op->version, op->crc, 801 + NULL, &new_crc, 802 + op->crc.offset, op->crc.live_size, 803 + new_csum_type); 829 804 if (ret) 830 805 return ret; 831 806 ··· 835 810 return 0; 836 811 } 837 812 838 - static int bch2_write_decrypt(struct bch_write_op *op) 839 - { 840 - struct bch_fs *c = op->c; 841 - struct nonce nonce = extent_nonce(op->version, op->crc); 842 - struct bch_csum csum; 843 - int ret; 844 - 845 - if (!bch2_csum_type_is_encryption(op->crc.csum_type)) 846 - return 0; 847 - 848 - /* 849 - * If we need to decrypt data in the write path, we'll no longer be able 850 - * to verify the existing checksum (poly1305 mac, in this case) after 851 - * it's decrypted - this is the last point we'll be able to reverify the 852 - * checksum: 853 - */ 854 - csum = bch2_checksum_bio(c, op->crc.csum_type, nonce, &op->wbio.bio); 855 - if (bch2_crc_cmp(op->crc.csum, csum) && !c->opts.no_data_io) 856 - return -EIO; 857 - 858 - ret = bch2_encrypt_bio(c, op->crc.csum_type, nonce, &op->wbio.bio); 859 - op->crc.csum_type = 0; 860 - op->crc.csum = (struct bch_csum) { 0, 0 }; 861 - return ret; 862 - } 863 - 864 - static enum prep_encoded_ret { 865 - PREP_ENCODED_OK, 866 - PREP_ENCODED_ERR, 867 - PREP_ENCODED_CHECKSUM_ERR, 868 - PREP_ENCODED_DO_WRITE, 869 - } bch2_write_prep_encoded_data(struct bch_write_op *op, struct write_point *wp) 813 + static noinline int bch2_write_prep_encoded_data(struct bch_write_op *op, struct write_point *wp) 870 814 { 871 815 struct bch_fs *c = op->c; 872 816 struct bio *bio = &op->wbio.bio; 873 - 874 - if (!(op->flags & BCH_WRITE_DATA_ENCODED)) 875 - return PREP_ENCODED_OK; 817 + struct nonce nonce = extent_nonce(op->version, op->crc); 818 + int ret = 0; 876 819 877 820 BUG_ON(bio_sectors(bio) != op->crc.compressed_size); 878 821 ··· 851 858 (op->crc.compression_type == bch2_compression_opt_to_type(op->compression_opt) || 852 859 op->incompressible)) { 853 860 if (!crc_is_compressed(op->crc) && 854 - op->csum_type != op->crc.csum_type && 855 - bch2_write_rechecksum(c, op, op->csum_type) && 856 - !c->opts.no_data_io) 857 - return PREP_ENCODED_CHECKSUM_ERR; 861 + op->csum_type != op->crc.csum_type) { 862 + ret = bch2_write_rechecksum(c, op, op->csum_type); 863 + if (ret) 864 + return ret; 865 + } 858 866 859 - return PREP_ENCODED_DO_WRITE; 867 + return 1; 860 868 } 861 869 862 870 /* ··· 865 871 * is, we have to decompress it: 866 872 */ 867 873 if (crc_is_compressed(op->crc)) { 868 - struct bch_csum csum; 869 - 870 - if (bch2_write_decrypt(op)) 871 - return PREP_ENCODED_CHECKSUM_ERR; 872 - 873 874 /* Last point we can still verify checksum: */ 874 - csum = bch2_checksum_bio(c, op->crc.csum_type, 875 - extent_nonce(op->version, op->crc), 876 - bio); 875 + struct bch_csum csum = bch2_checksum_bio(c, op->crc.csum_type, nonce, bio); 877 876 if (bch2_crc_cmp(op->crc.csum, csum) && !c->opts.no_data_io) 878 - return PREP_ENCODED_CHECKSUM_ERR; 877 + goto csum_err; 879 878 880 - if (bch2_bio_uncompress_inplace(op, bio)) 881 - return PREP_ENCODED_ERR; 879 + if (bch2_csum_type_is_encryption(op->crc.csum_type)) { 880 + ret = bch2_encrypt_bio(c, op->crc.csum_type, nonce, bio); 881 + if (ret) 882 + return ret; 883 + 884 + op->crc.csum_type = 0; 885 + op->crc.csum = (struct bch_csum) { 0, 0 }; 886 + } 887 + 888 + ret = bch2_bio_uncompress_inplace(op, bio); 889 + if (ret) 890 + return ret; 882 891 } 883 892 884 893 /* ··· 893 896 * If the data is checksummed and we're only writing a subset, 894 897 * rechecksum and adjust bio to point to currently live data: 895 898 */ 896 - if ((op->crc.live_size != op->crc.uncompressed_size || 897 - op->crc.csum_type != op->csum_type) && 898 - bch2_write_rechecksum(c, op, op->csum_type) && 899 - !c->opts.no_data_io) 900 - return PREP_ENCODED_CHECKSUM_ERR; 899 + if (op->crc.live_size != op->crc.uncompressed_size || 900 + op->crc.csum_type != op->csum_type) { 901 + ret = bch2_write_rechecksum(c, op, op->csum_type); 902 + if (ret) 903 + return ret; 904 + } 901 905 902 906 /* 903 907 * If we want to compress the data, it has to be decrypted: 904 908 */ 905 - if ((op->compression_opt || 906 - bch2_csum_type_is_encryption(op->crc.csum_type) != 907 - bch2_csum_type_is_encryption(op->csum_type)) && 908 - bch2_write_decrypt(op)) 909 - return PREP_ENCODED_CHECKSUM_ERR; 909 + if (bch2_csum_type_is_encryption(op->crc.csum_type) && 910 + (op->compression_opt || op->crc.csum_type != op->csum_type)) { 911 + struct bch_csum csum = bch2_checksum_bio(c, op->crc.csum_type, nonce, bio); 912 + if (bch2_crc_cmp(op->crc.csum, csum) && !c->opts.no_data_io) 913 + goto csum_err; 910 914 911 - return PREP_ENCODED_OK; 915 + ret = bch2_encrypt_bio(c, op->crc.csum_type, nonce, bio); 916 + if (ret) 917 + return ret; 918 + 919 + op->crc.csum_type = 0; 920 + op->crc.csum = (struct bch_csum) { 0, 0 }; 921 + } 922 + 923 + return 0; 924 + csum_err: 925 + bch2_write_csum_err_msg(op); 926 + return -BCH_ERR_data_write_csum; 912 927 } 913 928 914 929 static int bch2_write_extent(struct bch_write_op *op, struct write_point *wp, ··· 939 930 940 931 ec_buf = bch2_writepoint_ec_buf(c, wp); 941 932 942 - switch (bch2_write_prep_encoded_data(op, wp)) { 943 - case PREP_ENCODED_OK: 944 - break; 945 - case PREP_ENCODED_ERR: 946 - ret = -EIO; 947 - goto err; 948 - case PREP_ENCODED_CHECKSUM_ERR: 949 - goto csum_err; 950 - case PREP_ENCODED_DO_WRITE: 951 - /* XXX look for bug here */ 952 - if (ec_buf) { 953 - dst = bch2_write_bio_alloc(c, wp, src, 954 - &page_alloc_failed, 955 - ec_buf); 956 - bio_copy_data(dst, src); 957 - bounce = true; 933 + if (unlikely(op->flags & BCH_WRITE_data_encoded)) { 934 + ret = bch2_write_prep_encoded_data(op, wp); 935 + if (ret < 0) 936 + goto err; 937 + if (ret) { 938 + if (ec_buf) { 939 + dst = bch2_write_bio_alloc(c, wp, src, 940 + &page_alloc_failed, 941 + ec_buf); 942 + bio_copy_data(dst, src); 943 + bounce = true; 944 + } 945 + init_append_extent(op, wp, op->version, op->crc); 946 + goto do_write; 958 947 } 959 - init_append_extent(op, wp, op->version, op->crc); 960 - goto do_write; 961 948 } 962 949 963 950 if (ec_buf || 964 951 op->compression_opt || 965 952 (op->csum_type && 966 - !(op->flags & BCH_WRITE_PAGES_STABLE)) || 953 + !(op->flags & BCH_WRITE_pages_stable)) || 967 954 (bch2_csum_type_is_encryption(op->csum_type) && 968 - !(op->flags & BCH_WRITE_PAGES_OWNED))) { 955 + !(op->flags & BCH_WRITE_pages_owned))) { 969 956 dst = bch2_write_bio_alloc(c, wp, src, 970 957 &page_alloc_failed, 971 958 ec_buf); 972 959 bounce = true; 973 960 } 974 961 962 + #ifdef CONFIG_BCACHEFS_DEBUG 963 + unsigned write_corrupt_ratio = READ_ONCE(bch2_write_corrupt_ratio); 964 + if (!bounce && write_corrupt_ratio) { 965 + dst = bch2_write_bio_alloc(c, wp, src, 966 + &page_alloc_failed, 967 + ec_buf); 968 + bounce = true; 969 + } 970 + #endif 975 971 saved_iter = dst->bi_iter; 976 972 977 973 do { ··· 990 976 break; 991 977 992 978 BUG_ON(op->compression_opt && 993 - (op->flags & BCH_WRITE_DATA_ENCODED) && 979 + (op->flags & BCH_WRITE_data_encoded) && 994 980 bch2_csum_type_is_encryption(op->crc.csum_type)); 995 981 BUG_ON(op->compression_opt && !bounce); 996 982 ··· 1028 1014 } 1029 1015 } 1030 1016 1031 - if ((op->flags & BCH_WRITE_DATA_ENCODED) && 1017 + if ((op->flags & BCH_WRITE_data_encoded) && 1032 1018 !crc_is_compressed(crc) && 1033 1019 bch2_csum_type_is_encryption(op->crc.csum_type) == 1034 1020 bch2_csum_type_is_encryption(op->csum_type)) { ··· 1060 1046 crc.compression_type = compression_type; 1061 1047 crc.nonce = nonce; 1062 1048 } else { 1063 - if ((op->flags & BCH_WRITE_DATA_ENCODED) && 1049 + if ((op->flags & BCH_WRITE_data_encoded) && 1064 1050 bch2_rechecksum_bio(c, src, version, op->crc, 1065 1051 NULL, &op->crc, 1066 1052 src_len >> 9, ··· 1085 1071 } 1086 1072 1087 1073 init_append_extent(op, wp, version, crc); 1074 + 1075 + #ifdef CONFIG_BCACHEFS_DEBUG 1076 + if (write_corrupt_ratio) { 1077 + swap(dst->bi_iter.bi_size, dst_len); 1078 + bch2_maybe_corrupt_bio(dst, write_corrupt_ratio); 1079 + swap(dst->bi_iter.bi_size, dst_len); 1080 + } 1081 + #endif 1088 1082 1089 1083 if (dst != src) 1090 1084 bio_advance(dst, dst_len); ··· 1126 1104 *_dst = dst; 1127 1105 return more; 1128 1106 csum_err: 1129 - { 1130 - struct printbuf buf = PRINTBUF; 1131 - bch2_write_op_error(&buf, op); 1132 - prt_printf(&buf, "error verifying existing checksum while rewriting existing data (memory corruption?)"); 1133 - bch_err_ratelimited(c, "%s", buf.buf); 1134 - printbuf_exit(&buf); 1135 - } 1136 - 1137 - ret = -EIO; 1107 + bch2_write_csum_err_msg(op); 1108 + ret = -BCH_ERR_data_write_csum; 1138 1109 err: 1139 1110 if (to_wbio(dst)->bounce) 1140 1111 bch2_bio_free_pages_pool(c, dst); ··· 1205 1190 { 1206 1191 struct bch_fs *c = op->c; 1207 1192 struct btree_trans *trans = bch2_trans_get(c); 1193 + int ret = 0; 1208 1194 1209 1195 for_each_keylist_key(&op->insert_keys, orig) { 1210 - int ret = for_each_btree_key_max_commit(trans, iter, BTREE_ID_extents, 1196 + ret = for_each_btree_key_max_commit(trans, iter, BTREE_ID_extents, 1211 1197 bkey_start_pos(&orig->k), orig->k.p, 1212 1198 BTREE_ITER_intent, k, 1213 1199 NULL, NULL, BCH_TRANS_COMMIT_no_enospc, ({ 1214 1200 bch2_nocow_write_convert_one_unwritten(trans, &iter, orig, k, op->new_i_size); 1215 1201 })); 1216 - 1217 - if (ret && !bch2_err_matches(ret, EROFS)) { 1218 - struct bkey_i *insert = bch2_keylist_front(&op->insert_keys); 1219 - 1220 - struct printbuf buf = PRINTBUF; 1221 - bch2_write_op_error_trans(trans, &buf, op, bkey_start_offset(&insert->k)); 1222 - prt_printf(&buf, "btree update error: %s", bch2_err_str(ret)); 1223 - bch_err_ratelimited(c, "%s", buf.buf); 1224 - printbuf_exit(&buf); 1225 - } 1226 - 1227 - if (ret) { 1228 - op->error = ret; 1202 + if (ret) 1229 1203 break; 1230 - } 1231 1204 } 1232 1205 1233 1206 bch2_trans_put(trans); 1207 + 1208 + if (ret && !bch2_err_matches(ret, EROFS)) { 1209 + struct bkey_i *insert = bch2_keylist_front(&op->insert_keys); 1210 + bch2_write_op_error(op, bkey_start_offset(&insert->k), 1211 + "btree update error: %s", bch2_err_str(ret)); 1212 + } 1213 + 1214 + if (ret) 1215 + op->error = ret; 1234 1216 } 1235 1217 1236 1218 static void __bch2_nocow_write_done(struct bch_write_op *op) 1237 1219 { 1238 - if (unlikely(op->flags & BCH_WRITE_IO_ERROR)) { 1239 - op->error = -EIO; 1240 - } else if (unlikely(op->flags & BCH_WRITE_CONVERT_UNWRITTEN)) 1220 + if (unlikely(op->flags & BCH_WRITE_io_error)) { 1221 + op->error = -BCH_ERR_data_write_io; 1222 + } else if (unlikely(op->flags & BCH_WRITE_convert_unwritten)) 1241 1223 bch2_nocow_write_convert_unwritten(op); 1242 1224 } 1243 1225 ··· 1263 1251 struct bucket_to_lock *stale_at; 1264 1252 int stale, ret; 1265 1253 1266 - if (op->flags & BCH_WRITE_MOVE) 1254 + if (op->flags & BCH_WRITE_move) 1267 1255 return; 1268 1256 1269 1257 darray_init(&buckets); ··· 1321 1309 }), GFP_KERNEL|__GFP_NOFAIL); 1322 1310 1323 1311 if (ptr->unwritten) 1324 - op->flags |= BCH_WRITE_CONVERT_UNWRITTEN; 1312 + op->flags |= BCH_WRITE_convert_unwritten; 1325 1313 } 1326 1314 1327 1315 /* Unlock before taking nocow locks, doing IO: */ ··· 1329 1317 bch2_trans_unlock(trans); 1330 1318 1331 1319 bch2_cut_front(op->pos, op->insert_keys.top); 1332 - if (op->flags & BCH_WRITE_CONVERT_UNWRITTEN) 1320 + if (op->flags & BCH_WRITE_convert_unwritten) 1333 1321 bch2_cut_back(POS(op->pos.inode, op->pos.offset + bio_sectors(bio)), op->insert_keys.top); 1334 1322 1335 1323 darray_for_each(buckets, i) { ··· 1354 1342 wbio_init(bio)->put_bio = true; 1355 1343 bio->bi_opf = op->wbio.bio.bi_opf; 1356 1344 } else { 1357 - op->flags |= BCH_WRITE_SUBMITTED; 1345 + op->flags |= BCH_WRITE_submitted; 1358 1346 } 1359 1347 1360 1348 op->pos.offset += bio_sectors(bio); ··· 1364 1352 bio->bi_private = &op->cl; 1365 1353 bio->bi_opf |= REQ_OP_WRITE; 1366 1354 closure_get(&op->cl); 1355 + 1367 1356 bch2_submit_wbio_replicas(to_wbio(bio), c, BCH_DATA_user, 1368 1357 op->insert_keys.top, true); 1369 1358 1370 1359 bch2_keylist_push(&op->insert_keys); 1371 - if (op->flags & BCH_WRITE_SUBMITTED) 1360 + if (op->flags & BCH_WRITE_submitted) 1372 1361 break; 1373 1362 bch2_btree_iter_advance(&iter); 1374 1363 } ··· 1383 1370 darray_exit(&buckets); 1384 1371 1385 1372 if (ret) { 1386 - struct printbuf buf = PRINTBUF; 1387 - bch2_write_op_error(&buf, op); 1388 - prt_printf(&buf, "%s(): btree lookup error: %s", __func__, bch2_err_str(ret)); 1389 - bch_err_ratelimited(c, "%s", buf.buf); 1390 - printbuf_exit(&buf); 1373 + bch2_write_op_error(op, op->pos.offset, 1374 + "%s(): btree lookup error: %s", __func__, bch2_err_str(ret)); 1391 1375 op->error = ret; 1392 - op->flags |= BCH_WRITE_SUBMITTED; 1376 + op->flags |= BCH_WRITE_submitted; 1393 1377 } 1394 1378 1395 1379 /* fallback to cow write path? */ 1396 - if (!(op->flags & BCH_WRITE_SUBMITTED)) { 1380 + if (!(op->flags & BCH_WRITE_submitted)) { 1397 1381 closure_sync(&op->cl); 1398 1382 __bch2_nocow_write_done(op); 1399 1383 op->insert_keys.top = op->insert_keys.keys; 1400 - } else if (op->flags & BCH_WRITE_SYNC) { 1384 + } else if (op->flags & BCH_WRITE_sync) { 1401 1385 closure_sync(&op->cl); 1402 1386 bch2_nocow_write_done(&op->cl.work); 1403 1387 } else { ··· 1424 1414 "pointer to invalid bucket in nocow path on device %llu\n %s", 1425 1415 stale_at->b.inode, 1426 1416 (bch2_bkey_val_to_text(&buf, c, k), buf.buf))) { 1427 - ret = -EIO; 1417 + ret = -BCH_ERR_data_write_invalid_ptr; 1428 1418 } else { 1429 1419 /* We can retry this: */ 1430 1420 ret = -BCH_ERR_transaction_restart; ··· 1446 1436 1447 1437 if (unlikely(op->opts.nocow && c->opts.nocow_enabled)) { 1448 1438 bch2_nocow_write(op); 1449 - if (op->flags & BCH_WRITE_SUBMITTED) 1439 + if (op->flags & BCH_WRITE_submitted) 1450 1440 goto out_nofs_restore; 1451 1441 } 1452 1442 again: ··· 1476 1466 ret = bch2_trans_run(c, lockrestart_do(trans, 1477 1467 bch2_alloc_sectors_start_trans(trans, 1478 1468 op->target, 1479 - op->opts.erasure_code && !(op->flags & BCH_WRITE_CACHED), 1469 + op->opts.erasure_code && !(op->flags & BCH_WRITE_cached), 1480 1470 op->write_point, 1481 1471 &op->devs_have, 1482 1472 op->nr_replicas, ··· 1499 1489 bch2_alloc_sectors_done_inlined(c, wp); 1500 1490 err: 1501 1491 if (ret <= 0) { 1502 - op->flags |= BCH_WRITE_SUBMITTED; 1492 + op->flags |= BCH_WRITE_submitted; 1503 1493 1504 1494 if (unlikely(ret < 0)) { 1505 - if (!(op->flags & BCH_WRITE_ALLOC_NOWAIT)) { 1506 - struct printbuf buf = PRINTBUF; 1507 - bch2_write_op_error(&buf, op); 1508 - prt_printf(&buf, "%s(): %s", __func__, bch2_err_str(ret)); 1509 - bch_err_ratelimited(c, "%s", buf.buf); 1510 - printbuf_exit(&buf); 1511 - } 1495 + if (!(op->flags & BCH_WRITE_alloc_nowait)) 1496 + bch2_write_op_error(op, op->pos.offset, 1497 + "%s(): %s", __func__, bch2_err_str(ret)); 1512 1498 op->error = ret; 1513 1499 break; 1514 1500 } ··· 1530 1524 * synchronously here if we weren't able to submit all of the IO at 1531 1525 * once, as that signals backpressure to the caller. 1532 1526 */ 1533 - if ((op->flags & BCH_WRITE_SYNC) || 1534 - (!(op->flags & BCH_WRITE_SUBMITTED) && 1535 - !(op->flags & BCH_WRITE_IN_WORKER))) { 1527 + if ((op->flags & BCH_WRITE_sync) || 1528 + (!(op->flags & BCH_WRITE_submitted) && 1529 + !(op->flags & BCH_WRITE_in_worker))) { 1536 1530 bch2_wait_on_allocator(c, &op->cl); 1537 1531 1538 1532 __bch2_write_index(op); 1539 1533 1540 - if (!(op->flags & BCH_WRITE_SUBMITTED)) 1534 + if (!(op->flags & BCH_WRITE_submitted)) 1541 1535 goto again; 1542 1536 bch2_write_done(&op->cl); 1543 1537 } else { ··· 1558 1552 1559 1553 memset(&op->failed, 0, sizeof(op->failed)); 1560 1554 1561 - op->flags |= BCH_WRITE_WROTE_DATA_INLINE; 1562 - op->flags |= BCH_WRITE_SUBMITTED; 1555 + op->flags |= BCH_WRITE_wrote_data_inline; 1556 + op->flags |= BCH_WRITE_submitted; 1563 1557 1564 1558 bch2_check_set_feature(op->c, BCH_FEATURE_inline_data); 1565 1559 ··· 1622 1616 BUG_ON(!op->write_point.v); 1623 1617 BUG_ON(bkey_eq(op->pos, POS_MAX)); 1624 1618 1625 - if (op->flags & BCH_WRITE_ONLY_SPECIFIED_DEVS) 1626 - op->flags |= BCH_WRITE_ALLOC_NOWAIT; 1619 + if (op->flags & BCH_WRITE_only_specified_devs) 1620 + op->flags |= BCH_WRITE_alloc_nowait; 1627 1621 1628 1622 op->nr_replicas_required = min_t(unsigned, op->nr_replicas_required, op->nr_replicas); 1629 1623 op->start_time = local_clock(); ··· 1631 1625 wbio_init(bio)->put_bio = false; 1632 1626 1633 1627 if (unlikely(bio->bi_iter.bi_size & (c->opts.block_size - 1))) { 1634 - struct printbuf buf = PRINTBUF; 1635 - bch2_write_op_error(&buf, op); 1636 - prt_printf(&buf, "misaligned write"); 1637 - printbuf_exit(&buf); 1638 - op->error = -EIO; 1628 + bch2_write_op_error(op, op->pos.offset, "misaligned write"); 1629 + op->error = -BCH_ERR_data_write_misaligned; 1639 1630 goto err; 1640 1631 } 1641 1632 ··· 1641 1638 goto err; 1642 1639 } 1643 1640 1644 - if (!(op->flags & BCH_WRITE_MOVE) && 1641 + if (!(op->flags & BCH_WRITE_move) && 1645 1642 !bch2_write_ref_tryget(c, BCH_WRITE_REF_write)) { 1646 1643 op->error = -BCH_ERR_erofs_no_writes; 1647 1644 goto err; 1648 1645 } 1649 1646 1650 - this_cpu_add(c->counters[BCH_COUNTER_io_write], bio_sectors(bio)); 1647 + if (!(op->flags & BCH_WRITE_move)) 1648 + this_cpu_add(c->counters[BCH_COUNTER_io_write], bio_sectors(bio)); 1651 1649 bch2_increment_clock(c, bio_sectors(bio), WRITE); 1652 1650 1653 1651 data_len = min_t(u64, bio->bi_iter.bi_size, ··· 1679 1675 1680 1676 void bch2_write_op_to_text(struct printbuf *out, struct bch_write_op *op) 1681 1677 { 1682 - prt_str(out, "pos: "); 1678 + if (!out->nr_tabstops) 1679 + printbuf_tabstop_push(out, 32); 1680 + 1681 + prt_printf(out, "pos:\t"); 1683 1682 bch2_bpos_to_text(out, op->pos); 1684 1683 prt_newline(out); 1685 1684 printbuf_indent_add(out, 2); 1686 1685 1687 - prt_str(out, "started: "); 1686 + prt_printf(out, "started:\t"); 1688 1687 bch2_pr_time_units(out, local_clock() - op->start_time); 1689 1688 prt_newline(out); 1690 1689 1691 - prt_str(out, "flags: "); 1690 + prt_printf(out, "flags:\t"); 1692 1691 prt_bitflags(out, bch2_write_flags, op->flags); 1693 1692 prt_newline(out); 1694 1693 1695 - prt_printf(out, "ref: %u\n", closure_nr_remaining(&op->cl)); 1694 + prt_printf(out, "nr_replicas:\t%u\n", op->nr_replicas); 1695 + prt_printf(out, "nr_replicas_required:\t%u\n", op->nr_replicas_required); 1696 + 1697 + prt_printf(out, "ref:\t%u\n", closure_nr_remaining(&op->cl)); 1696 1698 1697 1699 printbuf_indent_sub(out, 2); 1698 1700 }

+16 -22

fs/bcachefs/io_write.h

··· 11 11 void bch2_bio_free_pages_pool(struct bch_fs *, struct bio *); 12 12 void bch2_bio_alloc_pages_pool(struct bch_fs *, struct bio *, size_t); 13 13 14 - #ifndef CONFIG_BCACHEFS_NO_LATENCY_ACCT 15 - void bch2_latency_acct(struct bch_dev *, u64, int); 16 - #else 17 - static inline void bch2_latency_acct(struct bch_dev *ca, u64 submit_time, int rw) {} 18 - #endif 19 - 20 14 void bch2_submit_wbio_replicas(struct bch_write_bio *, struct bch_fs *, 21 15 enum bch_data_type, const struct bkey_i *, bool); 22 16 23 - void bch2_write_op_error(struct printbuf *out, struct bch_write_op *op); 17 + __printf(3, 4) 18 + void bch2_write_op_error(struct bch_write_op *op, u64, const char *, ...); 24 19 25 20 #define BCH_WRITE_FLAGS() \ 26 - x(ALLOC_NOWAIT) \ 27 - x(CACHED) \ 28 - x(DATA_ENCODED) \ 29 - x(PAGES_STABLE) \ 30 - x(PAGES_OWNED) \ 31 - x(ONLY_SPECIFIED_DEVS) \ 32 - x(WROTE_DATA_INLINE) \ 33 - x(FROM_INTERNAL) \ 34 - x(CHECK_ENOSPC) \ 35 - x(SYNC) \ 36 - x(MOVE) \ 37 - x(IN_WORKER) \ 38 - x(SUBMITTED) \ 39 - x(IO_ERROR) \ 40 - x(CONVERT_UNWRITTEN) 21 + x(alloc_nowait) \ 22 + x(cached) \ 23 + x(data_encoded) \ 24 + x(pages_stable) \ 25 + x(pages_owned) \ 26 + x(only_specified_devs) \ 27 + x(wrote_data_inline) \ 28 + x(check_enospc) \ 29 + x(sync) \ 30 + x(move) \ 31 + x(in_worker) \ 32 + x(submitted) \ 33 + x(io_error) \ 34 + x(convert_unwritten) 41 35 42 36 enum __bch_write_flags { 43 37 #define x(f) __BCH_WRITE_##f,

+1 -1

fs/bcachefs/io_write_types.h

··· 64 64 struct bpos pos; 65 65 struct bversion version; 66 66 67 - /* For BCH_WRITE_DATA_ENCODED: */ 67 + /* For BCH_WRITE_data_encoded: */ 68 68 struct bch_extent_crc_unpacked crc; 69 69 70 70 struct write_point_specifier write_point;

+131 -62

fs/bcachefs/journal.c

··· 20 20 #include "journal_seq_blacklist.h" 21 21 #include "trace.h" 22 22 23 - static const char * const bch2_journal_errors[] = { 24 - #define x(n) #n, 25 - JOURNAL_ERRORS() 26 - #undef x 27 - NULL 28 - }; 29 - 30 23 static inline bool journal_seq_unwritten(struct journal *j, u64 seq) 31 24 { 32 25 return seq > j->seq_ondisk; ··· 49 56 prt_printf(out, "seq:\t%llu\n", seq); 50 57 printbuf_indent_add(out, 2); 51 58 52 - prt_printf(out, "refcount:\t%u\n", journal_state_count(s, i)); 59 + if (!buf->write_started) 60 + prt_printf(out, "refcount:\t%u\n", journal_state_count(s, i & JOURNAL_STATE_BUF_MASK)); 53 61 54 - prt_printf(out, "size:\t"); 55 - prt_human_readable_u64(out, vstruct_bytes(buf->data)); 56 - prt_newline(out); 62 + struct closure *cl = &buf->io; 63 + int r = atomic_read(&cl->remaining); 64 + prt_printf(out, "io:\t%pS r %i\n", cl->fn, r & CLOSURE_REMAINING_MASK); 65 + 66 + if (buf->data) { 67 + prt_printf(out, "size:\t"); 68 + prt_human_readable_u64(out, vstruct_bytes(buf->data)); 69 + prt_newline(out); 70 + } 57 71 58 72 prt_printf(out, "expires:\t"); 59 73 prt_printf(out, "%li jiffies\n", buf->expires - jiffies); ··· 87 87 88 88 static void bch2_journal_bufs_to_text(struct printbuf *out, struct journal *j) 89 89 { 90 + lockdep_assert_held(&j->lock); 91 + out->atomic++; 92 + 90 93 if (!out->nr_tabstops) 91 94 printbuf_tabstop_push(out, 24); 92 95 ··· 98 95 seq++) 99 96 bch2_journal_buf_to_text(out, j, seq); 100 97 prt_printf(out, "last buf %s\n", journal_entry_is_open(j) ? "open" : "closed"); 98 + 99 + --out->atomic; 101 100 } 102 101 103 102 static inline struct journal_buf * ··· 109 104 110 105 EBUG_ON(seq > journal_cur_seq(j)); 111 106 112 - if (journal_seq_unwritten(j, seq)) { 107 + if (journal_seq_unwritten(j, seq)) 113 108 buf = j->buf + (seq & JOURNAL_BUF_MASK); 114 - EBUG_ON(le64_to_cpu(buf->data->seq) != seq); 115 - } 116 109 return buf; 117 110 } 118 111 ··· 142 139 bool stuck = false; 143 140 struct printbuf buf = PRINTBUF; 144 141 145 - if (!(error == JOURNAL_ERR_journal_full || 146 - error == JOURNAL_ERR_journal_pin_full) || 142 + if (!(error == -BCH_ERR_journal_full || 143 + error == -BCH_ERR_journal_pin_full) || 147 144 nr_unwritten_journal_entries(j) || 148 145 (flags & BCH_WATERMARK_MASK) != BCH_WATERMARK_reclaim) 149 146 return stuck; ··· 170 167 spin_unlock(&j->lock); 171 168 172 169 bch_err(c, "Journal stuck! Hava a pre-reservation but journal full (error %s)", 173 - bch2_journal_errors[error]); 170 + bch2_err_str(error)); 174 171 bch2_journal_debug_to_text(&buf, j); 175 172 bch_err(c, "%s", buf.buf); 176 173 ··· 198 195 if (w->write_started) 199 196 continue; 200 197 201 - if (!journal_state_count(j->reservations, idx)) { 198 + if (!journal_state_seq_count(j, j->reservations, seq)) { 199 + j->seq_write_started = seq; 202 200 w->write_started = true; 203 201 closure_call(&w->io, bch2_journal_write, j->wq, NULL); 204 202 } ··· 310 306 311 307 bch2_journal_space_available(j); 312 308 313 - __bch2_journal_buf_put(j, old.idx, le64_to_cpu(buf->data->seq)); 309 + __bch2_journal_buf_put(j, le64_to_cpu(buf->data->seq)); 314 310 } 315 311 316 312 void bch2_journal_halt(struct journal *j) ··· 381 377 BUG_ON(BCH_SB_CLEAN(c->disk_sb.sb)); 382 378 383 379 if (j->blocked) 384 - return JOURNAL_ERR_blocked; 380 + return -BCH_ERR_journal_blocked; 385 381 386 382 if (j->cur_entry_error) 387 383 return j->cur_entry_error; 388 384 389 - if (bch2_journal_error(j)) 390 - return JOURNAL_ERR_insufficient_devices; /* -EROFS */ 385 + int ret = bch2_journal_error(j); 386 + if (unlikely(ret)) 387 + return ret; 391 388 392 389 if (!fifo_free(&j->pin)) 393 - return JOURNAL_ERR_journal_pin_full; 390 + return -BCH_ERR_journal_pin_full; 394 391 395 392 if (nr_unwritten_journal_entries(j) == ARRAY_SIZE(j->buf)) 396 - return JOURNAL_ERR_max_in_flight; 393 + return -BCH_ERR_journal_max_in_flight; 394 + 395 + if (atomic64_read(&j->seq) - j->seq_write_started == JOURNAL_STATE_BUF_NR) 396 + return -BCH_ERR_journal_max_open; 397 397 398 398 if (journal_cur_seq(j) >= JOURNAL_SEQ_MAX) { 399 399 bch_err(c, "cannot start: journal seq overflow"); 400 400 if (bch2_fs_emergency_read_only_locked(c)) 401 401 bch_err(c, "fatal error - emergency read only"); 402 - return JOURNAL_ERR_insufficient_devices; /* -EROFS */ 402 + return -BCH_ERR_journal_shutdown; 403 403 } 404 404 405 + if (!j->free_buf && !buf->data) 406 + return -BCH_ERR_journal_buf_enomem; /* will retry after write completion frees up a buf */ 407 + 405 408 BUG_ON(!j->cur_entry_sectors); 409 + 410 + if (!buf->data) { 411 + swap(buf->data, j->free_buf); 412 + swap(buf->buf_size, j->free_buf_size); 413 + } 406 414 407 415 buf->expires = 408 416 (journal_cur_seq(j) == j->flushed_seq_ondisk ··· 431 415 u64s = clamp_t(int, u64s, 0, JOURNAL_ENTRY_CLOSED_VAL - 1); 432 416 433 417 if (u64s <= (ssize_t) j->early_journal_entries.nr) 434 - return JOURNAL_ERR_journal_full; 418 + return -BCH_ERR_journal_full; 435 419 436 420 if (fifo_empty(&j->pin) && j->reclaim_thread) 437 421 wake_up_process(j->reclaim_thread); ··· 480 464 481 465 new.idx++; 482 466 BUG_ON(journal_state_count(new, new.idx)); 483 - BUG_ON(new.idx != (journal_cur_seq(j) & JOURNAL_BUF_MASK)); 467 + BUG_ON(new.idx != (journal_cur_seq(j) & JOURNAL_STATE_BUF_MASK)); 484 468 485 469 journal_state_inc(&new); 486 470 ··· 530 514 spin_unlock(&j->lock); 531 515 } 532 516 517 + static void journal_buf_prealloc(struct journal *j) 518 + { 519 + if (j->free_buf && 520 + j->free_buf_size >= j->buf_size_want) 521 + return; 522 + 523 + unsigned buf_size = j->buf_size_want; 524 + 525 + spin_unlock(&j->lock); 526 + void *buf = kvmalloc(buf_size, GFP_NOFS); 527 + spin_lock(&j->lock); 528 + 529 + if (buf && 530 + (!j->free_buf || 531 + buf_size > j->free_buf_size)) { 532 + swap(buf, j->free_buf); 533 + swap(buf_size, j->free_buf_size); 534 + } 535 + 536 + if (unlikely(buf)) { 537 + spin_unlock(&j->lock); 538 + /* kvfree can sleep */ 539 + kvfree(buf); 540 + spin_lock(&j->lock); 541 + } 542 + } 543 + 533 544 static int __journal_res_get(struct journal *j, struct journal_res *res, 534 545 unsigned flags) 535 546 { ··· 568 525 if (journal_res_get_fast(j, res, flags)) 569 526 return 0; 570 527 571 - if (bch2_journal_error(j)) 572 - return -BCH_ERR_erofs_journal_err; 528 + ret = bch2_journal_error(j); 529 + if (unlikely(ret)) 530 + return ret; 573 531 574 532 if (j->blocked) 575 - return -BCH_ERR_journal_res_get_blocked; 533 + return -BCH_ERR_journal_blocked; 576 534 577 535 if ((flags & BCH_WATERMARK_MASK) < j->watermark) { 578 - ret = JOURNAL_ERR_journal_full; 536 + ret = -BCH_ERR_journal_full; 579 537 can_discard = j->can_discard; 580 538 goto out; 581 539 } 582 540 583 541 if (nr_unwritten_journal_entries(j) == ARRAY_SIZE(j->buf) && !journal_entry_is_open(j)) { 584 - ret = JOURNAL_ERR_max_in_flight; 542 + ret = -BCH_ERR_journal_max_in_flight; 585 543 goto out; 586 544 } 587 545 588 546 spin_lock(&j->lock); 547 + 548 + journal_buf_prealloc(j); 589 549 590 550 /* 591 551 * Recheck after taking the lock, so we don't race with another thread ··· 612 566 j->buf_size_want = max(j->buf_size_want, buf->buf_size << 1); 613 567 614 568 __journal_entry_close(j, JOURNAL_ENTRY_CLOSED_VAL, false); 615 - ret = journal_entry_open(j) ?: JOURNAL_ERR_retry; 569 + ret = journal_entry_open(j) ?: -BCH_ERR_journal_retry_open; 616 570 unlock: 617 571 can_discard = j->can_discard; 618 572 spin_unlock(&j->lock); 619 573 out: 620 - if (ret == JOURNAL_ERR_retry) 621 - goto retry; 622 - if (!ret) 574 + if (likely(!ret)) 623 575 return 0; 576 + if (ret == -BCH_ERR_journal_retry_open) 577 + goto retry; 624 578 625 579 if (journal_error_check_stuck(j, ret, flags)) 626 - ret = -BCH_ERR_journal_res_get_blocked; 580 + ret = -BCH_ERR_journal_stuck; 627 581 628 - if (ret == JOURNAL_ERR_max_in_flight && 629 - track_event_change(&c->times[BCH_TIME_blocked_journal_max_in_flight], true)) { 630 - 582 + if (ret == -BCH_ERR_journal_max_in_flight && 583 + track_event_change(&c->times[BCH_TIME_blocked_journal_max_in_flight], true) && 584 + trace_journal_entry_full_enabled()) { 631 585 struct printbuf buf = PRINTBUF; 586 + 587 + bch2_printbuf_make_room(&buf, 4096); 588 + 589 + spin_lock(&j->lock); 632 590 prt_printf(&buf, "seq %llu\n", journal_cur_seq(j)); 633 591 bch2_journal_bufs_to_text(&buf, j); 592 + spin_unlock(&j->lock); 593 + 594 + trace_journal_entry_full(c, buf.buf); 595 + printbuf_exit(&buf); 596 + count_event(c, journal_entry_full); 597 + } 598 + 599 + if (ret == -BCH_ERR_journal_max_open && 600 + track_event_change(&c->times[BCH_TIME_blocked_journal_max_open], true) && 601 + trace_journal_entry_full_enabled()) { 602 + struct printbuf buf = PRINTBUF; 603 + 604 + bch2_printbuf_make_room(&buf, 4096); 605 + 606 + spin_lock(&j->lock); 607 + prt_printf(&buf, "seq %llu\n", journal_cur_seq(j)); 608 + bch2_journal_bufs_to_text(&buf, j); 609 + spin_unlock(&j->lock); 610 + 634 611 trace_journal_entry_full(c, buf.buf); 635 612 printbuf_exit(&buf); 636 613 count_event(c, journal_entry_full); ··· 663 594 * Journal is full - can't rely on reclaim from work item due to 664 595 * freezing: 665 596 */ 666 - if ((ret == JOURNAL_ERR_journal_full || 667 - ret == JOURNAL_ERR_journal_pin_full) && 597 + if ((ret == -BCH_ERR_journal_full || 598 + ret == -BCH_ERR_journal_pin_full) && 668 599 !(flags & JOURNAL_RES_GET_NONBLOCK)) { 669 600 if (can_discard) { 670 601 bch2_journal_do_discards(j); ··· 677 608 } 678 609 } 679 610 680 - return ret == JOURNAL_ERR_insufficient_devices 681 - ? -BCH_ERR_erofs_journal_err 682 - : -BCH_ERR_journal_res_get_blocked; 611 + return ret; 683 612 } 684 613 685 614 static unsigned max_dev_latency(struct bch_fs *c) ··· 707 640 int ret; 708 641 709 642 if (closure_wait_event_timeout(&j->async_wait, 710 - (ret = __journal_res_get(j, res, flags)) != -BCH_ERR_journal_res_get_blocked || 643 + !bch2_err_matches(ret = __journal_res_get(j, res, flags), BCH_ERR_operation_blocked) || 711 644 (flags & JOURNAL_RES_GET_NONBLOCK), 712 645 HZ)) 713 646 return ret; ··· 721 654 remaining_wait = max(0, remaining_wait - HZ); 722 655 723 656 if (closure_wait_event_timeout(&j->async_wait, 724 - (ret = __journal_res_get(j, res, flags)) != -BCH_ERR_journal_res_get_blocked || 657 + !bch2_err_matches(ret = __journal_res_get(j, res, flags), BCH_ERR_operation_blocked) || 725 658 (flags & JOURNAL_RES_GET_NONBLOCK), 726 659 remaining_wait)) 727 660 return ret; ··· 733 666 printbuf_exit(&buf); 734 667 735 668 closure_wait_event(&j->async_wait, 736 - (ret = __journal_res_get(j, res, flags)) != -BCH_ERR_journal_res_get_blocked || 669 + !bch2_err_matches(ret = __journal_res_get(j, res, flags), BCH_ERR_operation_blocked) || 737 670 (flags & JOURNAL_RES_GET_NONBLOCK)); 738 671 return ret; 739 672 } ··· 754 687 goto out; 755 688 756 689 j->cur_entry_u64s = max_t(int, 0, j->cur_entry_u64s - d); 757 - smp_mb(); 758 690 state = READ_ONCE(j->reservations); 759 691 760 692 if (state.cur_entry_offset < JOURNAL_ENTRY_CLOSED_VAL && ··· 973 907 struct bch_fs *c = container_of(j, struct bch_fs, journal); 974 908 975 909 if (!bch2_write_ref_tryget(c, BCH_WRITE_REF_journal)) 976 - return -EROFS; 910 + return -BCH_ERR_erofs_no_writes; 977 911 978 912 int ret = __bch2_journal_meta(j); 979 913 bch2_write_ref_put(c, BCH_WRITE_REF_journal); ··· 1017 951 new.cur_entry_offset = JOURNAL_ENTRY_BLOCKED_VAL; 1018 952 } while (!atomic64_try_cmpxchg(&j->reservations.counter, &old.v, new.v)); 1019 953 1020 - journal_cur_buf(j)->data->u64s = cpu_to_le32(old.cur_entry_offset); 954 + if (old.cur_entry_offset < JOURNAL_ENTRY_BLOCKED_VAL) 955 + journal_cur_buf(j)->data->u64s = cpu_to_le32(old.cur_entry_offset); 1021 956 } 1022 957 } 1023 958 ··· 1059 992 *blocked = true; 1060 993 } 1061 994 1062 - ret = journal_state_count(s, idx) > open 995 + ret = journal_state_count(s, idx & JOURNAL_STATE_BUF_MASK) > open 1063 996 ? ERR_PTR(-EAGAIN) 1064 997 : buf; 1065 998 break; ··· 1416 1349 j->replay_journal_seq_end = cur_seq; 1417 1350 j->last_seq_ondisk = last_seq; 1418 1351 j->flushed_seq_ondisk = cur_seq - 1; 1352 + j->seq_write_started = cur_seq - 1; 1419 1353 j->seq_ondisk = cur_seq - 1; 1420 1354 j->pin.front = last_seq; 1421 1355 j->pin.back = cur_seq; ··· 1457 1389 set_bit(JOURNAL_running, &j->flags); 1458 1390 j->last_flush_write = jiffies; 1459 1391 1460 - j->reservations.idx = j->reservations.unwritten_idx = journal_cur_seq(j); 1461 - j->reservations.unwritten_idx++; 1392 + j->reservations.idx = journal_cur_seq(j); 1462 1393 1463 1394 c->last_bucket_seq_cleanup = journal_cur_seq(j); 1464 1395 ··· 1510 1443 unsigned nr_bvecs = DIV_ROUND_UP(JOURNAL_ENTRY_SIZE_MAX, PAGE_SIZE); 1511 1444 1512 1445 for (unsigned i = 0; i < ARRAY_SIZE(ja->bio); i++) { 1513 - ja->bio[i] = kmalloc(struct_size(ja->bio[i], bio.bi_inline_vecs, 1446 + ja->bio[i] = kzalloc(struct_size(ja->bio[i], bio.bi_inline_vecs, 1514 1447 nr_bvecs), GFP_KERNEL); 1515 1448 if (!ja->bio[i]) 1516 1449 return -BCH_ERR_ENOMEM_dev_journal_init; ··· 1549 1482 1550 1483 for (unsigned i = 0; i < ARRAY_SIZE(j->buf); i++) 1551 1484 kvfree(j->buf[i].data); 1485 + kvfree(j->free_buf); 1552 1486 free_fifo(&j->pin); 1553 1487 } 1554 1488 ··· 1576 1508 if (!(init_fifo(&j->pin, JOURNAL_PIN, GFP_KERNEL))) 1577 1509 return -BCH_ERR_ENOMEM_journal_pin_fifo; 1578 1510 1579 - for (unsigned i = 0; i < ARRAY_SIZE(j->buf); i++) { 1580 - j->buf[i].buf_size = JOURNAL_ENTRY_SIZE_MIN; 1581 - j->buf[i].data = kvmalloc(j->buf[i].buf_size, GFP_KERNEL); 1582 - if (!j->buf[i].data) 1583 - return -BCH_ERR_ENOMEM_journal_buf; 1511 + j->free_buf_size = j->buf_size_want = JOURNAL_ENTRY_SIZE_MIN; 1512 + j->free_buf = kvmalloc(j->free_buf_size, GFP_KERNEL); 1513 + if (!j->free_buf) 1514 + return -BCH_ERR_ENOMEM_journal_buf; 1515 + 1516 + for (unsigned i = 0; i < ARRAY_SIZE(j->buf); i++) 1584 1517 j->buf[i].idx = i; 1585 - } 1586 1518 1587 1519 j->pin.front = j->pin.back = 1; 1588 1520 ··· 1632 1564 prt_printf(out, "average write size:\t"); 1633 1565 prt_human_readable_u64(out, nr_writes ? div64_u64(j->entry_bytes_written, nr_writes) : 0); 1634 1566 prt_newline(out); 1567 + prt_printf(out, "free buf:\t%u\n", j->free_buf ? j->free_buf_size : 0); 1635 1568 prt_printf(out, "nr direct reclaim:\t%llu\n", j->nr_direct_reclaim); 1636 1569 prt_printf(out, "nr background reclaim:\t%llu\n", j->nr_background_reclaim); 1637 1570 prt_printf(out, "reclaim kicked:\t%u\n", j->reclaim_kicked); ··· 1640 1571 ? jiffies_to_msecs(j->next_reclaim - jiffies) : 0); 1641 1572 prt_printf(out, "blocked:\t%u\n", j->blocked); 1642 1573 prt_printf(out, "current entry sectors:\t%u\n", j->cur_entry_sectors); 1643 - prt_printf(out, "current entry error:\t%s\n", bch2_journal_errors[j->cur_entry_error]); 1574 + prt_printf(out, "current entry error:\t%s\n", bch2_err_str(j->cur_entry_error)); 1644 1575 prt_printf(out, "current entry:\t"); 1645 1576 1646 1577 switch (s.cur_entry_offset) {

+30 -12

fs/bcachefs/journal.h

··· 121 121 closure_wake_up(&j->async_wait); 122 122 } 123 123 124 - static inline struct journal_buf *journal_cur_buf(struct journal *j) 125 - { 126 - return j->buf + j->reservations.idx; 127 - } 128 - 129 124 /* Sequence number of oldest dirty journal entry */ 130 125 131 126 static inline u64 journal_last_seq(struct journal *j) ··· 138 143 return j->seq_ondisk + 1; 139 144 } 140 145 146 + static inline struct journal_buf *journal_cur_buf(struct journal *j) 147 + { 148 + unsigned idx = (journal_cur_seq(j) & 149 + JOURNAL_BUF_MASK & 150 + ~JOURNAL_STATE_BUF_MASK) + j->reservations.idx; 151 + 152 + return j->buf + idx; 153 + } 154 + 141 155 static inline int journal_state_count(union journal_res_state s, int idx) 142 156 { 143 157 switch (idx) { ··· 156 152 case 3: return s.buf3_count; 157 153 } 158 154 BUG(); 155 + } 156 + 157 + static inline int journal_state_seq_count(struct journal *j, 158 + union journal_res_state s, u64 seq) 159 + { 160 + if (journal_cur_seq(j) - seq < JOURNAL_STATE_BUF_NR) 161 + return journal_state_count(s, seq & JOURNAL_STATE_BUF_MASK); 162 + else 163 + return 0; 159 164 } 160 165 161 166 static inline void journal_state_inc(union journal_res_state *s) ··· 206 193 static inline struct jset_entry * 207 194 journal_res_entry(struct journal *j, struct journal_res *res) 208 195 { 209 - return vstruct_idx(j->buf[res->idx].data, res->offset); 196 + return vstruct_idx(j->buf[res->seq & JOURNAL_BUF_MASK].data, res->offset); 210 197 } 211 198 212 199 static inline unsigned journal_entry_init(struct jset_entry *entry, unsigned type, ··· 280 267 void bch2_journal_do_writes(struct journal *); 281 268 void bch2_journal_buf_put_final(struct journal *, u64); 282 269 283 - static inline void __bch2_journal_buf_put(struct journal *j, unsigned idx, u64 seq) 270 + static inline void __bch2_journal_buf_put(struct journal *j, u64 seq) 284 271 { 272 + unsigned idx = seq & JOURNAL_STATE_BUF_MASK; 285 273 union journal_res_state s; 286 274 287 275 s = journal_state_buf_put(j, idx); ··· 290 276 bch2_journal_buf_put_final(j, seq); 291 277 } 292 278 293 - static inline void bch2_journal_buf_put(struct journal *j, unsigned idx, u64 seq) 279 + static inline void bch2_journal_buf_put(struct journal *j, u64 seq) 294 280 { 281 + unsigned idx = seq & JOURNAL_STATE_BUF_MASK; 295 282 union journal_res_state s; 296 283 297 284 s = journal_state_buf_put(j, idx); ··· 321 306 BCH_JSET_ENTRY_btree_keys, 322 307 0, 0, 0); 323 308 324 - bch2_journal_buf_put(j, res->idx, res->seq); 309 + bch2_journal_buf_put(j, res->seq); 325 310 326 311 res->ref = 0; 327 312 } ··· 350 335 351 336 /* 352 337 * Check if there is still room in the current journal 353 - * entry: 338 + * entry, smp_rmb() guarantees that reads from reservations.counter 339 + * occur before accessing cur_entry_u64s: 354 340 */ 341 + smp_rmb(); 355 342 if (new.cur_entry_offset + res->u64s > j->cur_entry_u64s) 356 343 return 0; 357 344 ··· 378 361 &old.v, new.v)); 379 362 380 363 res->ref = true; 381 - res->idx = old.idx; 382 364 res->offset = old.cur_entry_offset; 383 - res->seq = le64_to_cpu(j->buf[old.idx].data->seq); 365 + res->seq = journal_cur_seq(j); 366 + res->seq -= (res->seq - old.idx) & JOURNAL_STATE_BUF_MASK; 384 367 return 1; 385 368 } 386 369 ··· 407 390 (flags & JOURNAL_RES_GET_NONBLOCK) != 0, 408 391 NULL, _THIS_IP_); 409 392 EBUG_ON(!res->ref); 393 + BUG_ON(!res->seq); 410 394 } 411 395 return 0; 412 396 }

+60 -39

fs/bcachefs/journal_io.c

··· 1041 1041 bio->bi_iter.bi_sector = offset; 1042 1042 bch2_bio_map(bio, buf->data, sectors_read << 9); 1043 1043 1044 + u64 submit_time = local_clock(); 1044 1045 ret = submit_bio_wait(bio); 1045 1046 kfree(bio); 1046 1047 1047 - if (bch2_dev_io_err_on(ret, ca, BCH_MEMBER_ERROR_read, 1048 - "journal read error: sector %llu", 1049 - offset) || 1050 - bch2_meta_read_fault("journal")) { 1048 + if (!ret && bch2_meta_read_fault("journal")) 1049 + ret = -BCH_ERR_EIO_fault_injected; 1050 + 1051 + bch2_account_io_completion(ca, BCH_MEMBER_ERROR_read, 1052 + submit_time, !ret); 1053 + 1054 + if (ret) { 1055 + bch_err_dev_ratelimited(ca, 1056 + "journal read error: sector %llu", offset); 1051 1057 /* 1052 1058 * We don't error out of the recovery process 1053 1059 * here, since the relevant journal entry may be ··· 1116 1110 struct bch_csum csum; 1117 1111 csum_good = jset_csum_good(c, j, &csum); 1118 1112 1119 - if (bch2_dev_io_err_on(!csum_good, ca, BCH_MEMBER_ERROR_checksum, 1120 - "%s", 1121 - (printbuf_reset(&err), 1122 - prt_str(&err, "journal "), 1123 - bch2_csum_err_msg(&err, csum_type, j->csum, csum), 1124 - err.buf))) 1113 + bch2_account_io_completion(ca, BCH_MEMBER_ERROR_checksum, 0, csum_good); 1114 + 1115 + if (!csum_good) { 1116 + bch_err_dev_ratelimited(ca, "%s", 1117 + (printbuf_reset(&err), 1118 + prt_str(&err, "journal "), 1119 + bch2_csum_err_msg(&err, csum_type, j->csum, csum), 1120 + err.buf)); 1125 1121 saw_bad = true; 1122 + } 1126 1123 1127 1124 ret = bch2_encrypt(c, JSET_CSUM_TYPE(j), journal_nonce(j), 1128 1125 j->encrypted_start, ··· 1524 1515 * @j: journal object 1525 1516 * @w: journal buf (entry to be written) 1526 1517 * 1527 - * Returns: 0 on success, or -EROFS on failure 1518 + * Returns: 0 on success, or -BCH_ERR_insufficient_devices on failure 1528 1519 */ 1529 1520 static int journal_write_alloc(struct journal *j, struct journal_buf *w) 1530 1521 { ··· 1609 1600 kvfree(new_buf); 1610 1601 } 1611 1602 1612 - static inline struct journal_buf *journal_last_unwritten_buf(struct journal *j) 1613 - { 1614 - return j->buf + (journal_last_unwritten_seq(j) & JOURNAL_BUF_MASK); 1615 - } 1616 - 1617 1603 static CLOSURE_CALLBACK(journal_write_done) 1618 1604 { 1619 1605 closure_type(w, struct journal_buf, io); 1620 1606 struct journal *j = container_of(w, struct journal, buf[w->idx]); 1621 1607 struct bch_fs *c = container_of(j, struct bch_fs, journal); 1622 1608 struct bch_replicas_padded replicas; 1623 - union journal_res_state old, new; 1624 1609 u64 seq = le64_to_cpu(w->data->seq); 1625 1610 int err = 0; 1626 1611 ··· 1624 1621 1625 1622 if (!w->devs_written.nr) { 1626 1623 bch_err(c, "unable to write journal to sufficient devices"); 1627 - err = -EIO; 1624 + err = -BCH_ERR_journal_write_err; 1628 1625 } else { 1629 1626 bch2_devlist_to_replicas(&replicas.e, BCH_DATA_journal, 1630 1627 w->devs_written); 1631 - if (bch2_mark_replicas(c, &replicas.e)) 1632 - err = -EIO; 1628 + err = bch2_mark_replicas(c, &replicas.e); 1633 1629 } 1634 1630 1635 1631 if (err) ··· 1643 1641 j->err_seq = seq; 1644 1642 w->write_done = true; 1645 1643 1644 + if (!j->free_buf || j->free_buf_size < w->buf_size) { 1645 + swap(j->free_buf, w->data); 1646 + swap(j->free_buf_size, w->buf_size); 1647 + } 1648 + 1649 + if (w->data) { 1650 + void *buf = w->data; 1651 + w->data = NULL; 1652 + w->buf_size = 0; 1653 + 1654 + spin_unlock(&j->lock); 1655 + kvfree(buf); 1656 + spin_lock(&j->lock); 1657 + } 1658 + 1646 1659 bool completed = false; 1660 + bool do_discards = false; 1647 1661 1648 1662 for (seq = journal_last_unwritten_seq(j); 1649 1663 seq <= journal_cur_seq(j); ··· 1668 1650 if (!w->write_done) 1669 1651 break; 1670 1652 1671 - if (!j->err_seq && !JSET_NO_FLUSH(w->data)) { 1653 + if (!j->err_seq && !w->noflush) { 1672 1654 j->flushed_seq_ondisk = seq; 1673 1655 j->last_seq_ondisk = w->last_seq; 1674 1656 1675 - bch2_do_discards(c); 1676 1657 closure_wake_up(&c->freelist_wait); 1677 1658 bch2_reset_alloc_cursors(c); 1678 1659 } ··· 1688 1671 if (j->watermark != BCH_WATERMARK_stripe) 1689 1672 journal_reclaim_kick(&c->journal); 1690 1673 1691 - old.v = atomic64_read(&j->reservations.counter); 1692 - do { 1693 - new.v = old.v; 1694 - BUG_ON(journal_state_count(new, new.unwritten_idx)); 1695 - BUG_ON(new.unwritten_idx != (seq & JOURNAL_BUF_MASK)); 1696 - 1697 - new.unwritten_idx++; 1698 - } while (!atomic64_try_cmpxchg(&j->reservations.counter, 1699 - &old.v, new.v)); 1700 - 1701 1674 closure_wake_up(&w->wait); 1702 1675 completed = true; 1703 1676 } ··· 1702 1695 } 1703 1696 1704 1697 if (journal_last_unwritten_seq(j) == journal_cur_seq(j) && 1705 - new.cur_entry_offset < JOURNAL_ENTRY_CLOSED_VAL) { 1698 + j->reservations.cur_entry_offset < JOURNAL_ENTRY_CLOSED_VAL) { 1706 1699 struct journal_buf *buf = journal_cur_buf(j); 1707 1700 long delta = buf->expires - jiffies; 1708 1701 ··· 1722 1715 */ 1723 1716 bch2_journal_do_writes(j); 1724 1717 spin_unlock(&j->lock); 1718 + 1719 + if (do_discards) 1720 + bch2_do_discards(c); 1725 1721 } 1726 1722 1727 1723 static void journal_write_endio(struct bio *bio) ··· 1734 1724 struct journal *j = &ca->fs->journal; 1735 1725 struct journal_buf *w = j->buf + jbio->buf_idx; 1736 1726 1737 - if (bch2_dev_io_err_on(bio->bi_status, ca, BCH_MEMBER_ERROR_write, 1727 + bch2_account_io_completion(ca, BCH_MEMBER_ERROR_write, 1728 + jbio->submit_time, !bio->bi_status); 1729 + 1730 + if (bio->bi_status) { 1731 + bch_err_dev_ratelimited(ca, 1738 1732 "error writing journal entry %llu: %s", 1739 1733 le64_to_cpu(w->data->seq), 1740 - bch2_blk_status_to_str(bio->bi_status)) || 1741 - bch2_meta_write_fault("journal")) { 1742 - unsigned long flags; 1734 + bch2_blk_status_to_str(bio->bi_status)); 1743 1735 1736 + unsigned long flags; 1744 1737 spin_lock_irqsave(&j->err_lock, flags); 1745 1738 bch2_dev_list_drop_dev(&w->devs_written, ca->dev_idx); 1746 1739 spin_unlock_irqrestore(&j->err_lock, flags); ··· 1772 1759 sectors); 1773 1760 1774 1761 struct journal_device *ja = &ca->journal; 1775 - struct bio *bio = &ja->bio[w->idx]->bio; 1762 + struct journal_bio *jbio = ja->bio[w->idx]; 1763 + struct bio *bio = &jbio->bio; 1764 + 1765 + jbio->submit_time = local_clock(); 1766 + 1776 1767 bio_reset(bio, ca->disk_sb.bdev, REQ_OP_WRITE|REQ_SYNC|REQ_META); 1777 1768 bio->bi_iter.bi_sector = ptr->offset; 1778 1769 bio->bi_end_io = journal_write_endio; ··· 1808 1791 struct journal *j = container_of(w, struct journal, buf[w->idx]); 1809 1792 struct bch_fs *c = container_of(j, struct bch_fs, journal); 1810 1793 1794 + /* 1795 + * Wait for previous journal writes to comelete; they won't necessarily 1796 + * be flushed if they're still in flight 1797 + */ 1811 1798 if (j->seq_ondisk + 1 != le64_to_cpu(w->data->seq)) { 1812 1799 spin_lock(&j->lock); 1813 1800 if (j->seq_ondisk + 1 != le64_to_cpu(w->data->seq)) { ··· 2005 1984 * write anything at all. 2006 1985 */ 2007 1986 if (error && test_bit(JOURNAL_need_flush_write, &j->flags)) 2008 - return -EIO; 1987 + return error; 2009 1988 2010 1989 if (error || 2011 1990 w->noflush ||

+4 -6

fs/bcachefs/journal_reclaim.c

··· 226 226 227 227 bch_err(c, "%s", buf.buf); 228 228 printbuf_exit(&buf); 229 - ret = JOURNAL_ERR_insufficient_devices; 229 + ret = -BCH_ERR_insufficient_journal_devices; 230 230 goto out; 231 231 } 232 232 ··· 240 240 total = j->space[journal_space_total].total; 241 241 242 242 if (!j->space[journal_space_discarded].next_entry) 243 - ret = JOURNAL_ERR_journal_full; 243 + ret = -BCH_ERR_journal_full; 244 244 245 245 if ((j->space[journal_space_clean_ondisk].next_entry < 246 246 j->space[journal_space_clean_ondisk].total) && ··· 645 645 * @j: journal object 646 646 * @direct: direct or background reclaim? 647 647 * @kicked: requested to run since we last ran? 648 - * Returns: 0 on success, or -EIO if the journal has been shutdown 649 648 * 650 649 * Background journal reclaim writes out btree nodes. It should be run 651 650 * early enough so that we never completely run out of journal buckets. ··· 684 685 if (kthread && kthread_should_stop()) 685 686 break; 686 687 687 - if (bch2_journal_error(j)) { 688 - ret = -EIO; 688 + ret = bch2_journal_error(j); 689 + if (ret) 689 690 break; 690 - } 691 691 692 692 bch2_journal_do_discards(j); 693 693

+3 -4

fs/bcachefs/journal_seq_blacklist.c

··· 231 231 struct journal_seq_blacklist_table *t = c->journal_seq_blacklist_table; 232 232 BUG_ON(nr != t->nr); 233 233 234 - unsigned i; 235 - for (src = bl->start, i = t->nr == 0 ? 0 : eytzinger0_first(t->nr); 236 - src < bl->start + nr; 237 - src++, i = eytzinger0_next(i, nr)) { 234 + src = bl->start; 235 + eytzinger0_for_each(i, nr) { 238 236 BUG_ON(t->entries[i].start != le64_to_cpu(src->start)); 239 237 BUG_ON(t->entries[i].end != le64_to_cpu(src->end)); 240 238 241 239 if (t->entries[i].dirty || t->entries[i].end >= c->journal.oldest_seq_found_ondisk) 242 240 *dst++ = *src; 241 + src++; 243 242 } 244 243 245 244 unsigned new_nr = dst - bl->start;

+13 -24

fs/bcachefs/journal_types.h

··· 12 12 /* btree write buffer steals 8 bits for its own purposes: */ 13 13 #define JOURNAL_SEQ_MAX ((1ULL << 56) - 1) 14 14 15 - #define JOURNAL_BUF_BITS 2 15 + #define JOURNAL_STATE_BUF_BITS 2 16 + #define JOURNAL_STATE_BUF_NR (1U << JOURNAL_STATE_BUF_BITS) 17 + #define JOURNAL_STATE_BUF_MASK (JOURNAL_STATE_BUF_NR - 1) 18 + 19 + #define JOURNAL_BUF_BITS 4 16 20 #define JOURNAL_BUF_NR (1U << JOURNAL_BUF_BITS) 17 21 #define JOURNAL_BUF_MASK (JOURNAL_BUF_NR - 1) 18 22 ··· 86 82 87 83 struct journal_res { 88 84 bool ref; 89 - u8 idx; 90 85 u16 u64s; 91 86 u32 offset; 92 87 u64 seq; ··· 101 98 }; 102 99 103 100 struct { 104 - u64 cur_entry_offset:20, 101 + u64 cur_entry_offset:22, 105 102 idx:2, 106 - unwritten_idx:2, 107 103 buf0_count:10, 108 104 buf1_count:10, 109 105 buf2_count:10, ··· 112 110 113 111 /* bytes: */ 114 112 #define JOURNAL_ENTRY_SIZE_MIN (64U << 10) /* 64k */ 115 - #define JOURNAL_ENTRY_SIZE_MAX (4U << 20) /* 4M */ 113 + #define JOURNAL_ENTRY_SIZE_MAX (4U << 22) /* 16M */ 116 114 117 115 /* 118 116 * We stash some journal state as sentinal values in cur_entry_offset: 119 117 * note - cur_entry_offset is in units of u64s 120 118 */ 121 - #define JOURNAL_ENTRY_OFFSET_MAX ((1U << 20) - 1) 119 + #define JOURNAL_ENTRY_OFFSET_MAX ((1U << 22) - 1) 122 120 123 121 #define JOURNAL_ENTRY_BLOCKED_VAL (JOURNAL_ENTRY_OFFSET_MAX - 2) 124 122 #define JOURNAL_ENTRY_CLOSED_VAL (JOURNAL_ENTRY_OFFSET_MAX - 1) ··· 151 149 #undef x 152 150 }; 153 151 154 - /* Reasons we may fail to get a journal reservation: */ 155 - #define JOURNAL_ERRORS() \ 156 - x(ok) \ 157 - x(retry) \ 158 - x(blocked) \ 159 - x(max_in_flight) \ 160 - x(journal_full) \ 161 - x(journal_pin_full) \ 162 - x(journal_stuck) \ 163 - x(insufficient_devices) 164 - 165 - enum journal_errors { 166 - #define x(n) JOURNAL_ERR_##n, 167 - JOURNAL_ERRORS() 168 - #undef x 169 - }; 170 - 171 152 typedef DARRAY(u64) darray_u64; 172 153 173 154 struct journal_bio { 174 155 struct bch_dev *ca; 175 156 unsigned buf_idx; 157 + u64 submit_time; 176 158 177 159 struct bio bio; 178 160 }; ··· 185 199 * 0, or -ENOSPC if waiting on journal reclaim, or -EROFS if 186 200 * insufficient devices: 187 201 */ 188 - enum journal_errors cur_entry_error; 202 + int cur_entry_error; 189 203 unsigned cur_entry_offset_if_blocked; 190 204 191 205 unsigned buf_size_want; ··· 206 220 * other is possibly being written out. 207 221 */ 208 222 struct journal_buf buf[JOURNAL_BUF_NR]; 223 + void *free_buf; 224 + unsigned free_buf_size; 209 225 210 226 spinlock_t lock; 211 227 ··· 225 237 /* Sequence number of most recent journal entry (last entry in @pin) */ 226 238 atomic64_t seq; 227 239 240 + u64 seq_write_started; 228 241 /* seq, last_seq from the most recent journal entry successfully written */ 229 242 u64 seq_ondisk; 230 243 u64 flushed_seq_ondisk;

+62 -38

fs/bcachefs/lru.c

··· 6 6 #include "btree_iter.h" 7 7 #include "btree_update.h" 8 8 #include "btree_write_buffer.h" 9 + #include "ec.h" 9 10 #include "error.h" 10 11 #include "lru.h" 11 12 #include "recovery.h" ··· 60 59 return __bch2_lru_set(trans, lru_id, dev_bucket, time, KEY_TYPE_set); 61 60 } 62 61 63 - int bch2_lru_change(struct btree_trans *trans, 64 - u16 lru_id, u64 dev_bucket, 65 - u64 old_time, u64 new_time) 62 + int __bch2_lru_change(struct btree_trans *trans, 63 + u16 lru_id, u64 dev_bucket, 64 + u64 old_time, u64 new_time) 66 65 { 67 66 if (old_time == new_time) 68 67 return 0; ··· 79 78 }; 80 79 81 80 int bch2_lru_check_set(struct btree_trans *trans, 82 - u16 lru_id, u64 time, 81 + u16 lru_id, 82 + u64 dev_bucket, 83 + u64 time, 83 84 struct bkey_s_c referring_k, 84 85 struct bkey_buf *last_flushed) 85 86 { ··· 90 87 struct btree_iter lru_iter; 91 88 struct bkey_s_c lru_k = 92 89 bch2_bkey_get_iter(trans, &lru_iter, BTREE_ID_lru, 93 - lru_pos(lru_id, 94 - bucket_to_u64(referring_k.k->p), 95 - time), 0); 90 + lru_pos(lru_id, dev_bucket, time), 0); 96 91 int ret = bkey_err(lru_k); 97 92 if (ret) 98 93 return ret; ··· 105 104 " %s", 106 105 bch2_lru_types[lru_type(lru_k)], 107 106 (bch2_bkey_val_to_text(&buf, c, referring_k), buf.buf))) { 108 - ret = bch2_lru_set(trans, lru_id, bucket_to_u64(referring_k.k->p), time); 107 + ret = bch2_lru_set(trans, lru_id, dev_bucket, time); 109 108 if (ret) 110 109 goto err; 111 110 } ··· 117 116 return ret; 118 117 } 119 118 119 + static struct bbpos lru_pos_to_bp(struct bkey_s_c lru_k) 120 + { 121 + enum bch_lru_type type = lru_type(lru_k); 122 + 123 + switch (type) { 124 + case BCH_LRU_read: 125 + case BCH_LRU_fragmentation: 126 + return BBPOS(BTREE_ID_alloc, u64_to_bucket(lru_k.k->p.offset)); 127 + case BCH_LRU_stripes: 128 + return BBPOS(BTREE_ID_stripes, POS(0, lru_k.k->p.offset)); 129 + default: 130 + BUG(); 131 + } 132 + } 133 + 134 + static u64 bkey_lru_type_idx(struct bch_fs *c, 135 + enum bch_lru_type type, 136 + struct bkey_s_c k) 137 + { 138 + struct bch_alloc_v4 a_convert; 139 + const struct bch_alloc_v4 *a; 140 + 141 + switch (type) { 142 + case BCH_LRU_read: 143 + a = bch2_alloc_to_v4(k, &a_convert); 144 + return alloc_lru_idx_read(*a); 145 + case BCH_LRU_fragmentation: { 146 + a = bch2_alloc_to_v4(k, &a_convert); 147 + 148 + rcu_read_lock(); 149 + struct bch_dev *ca = bch2_dev_rcu_noerror(c, k.k->p.inode); 150 + u64 idx = ca 151 + ? alloc_lru_idx_fragmentation(*a, ca) 152 + : 0; 153 + rcu_read_unlock(); 154 + return idx; 155 + } 156 + case BCH_LRU_stripes: 157 + return k.k->type == KEY_TYPE_stripe 158 + ? stripe_lru_pos(bkey_s_c_to_stripe(k).v) 159 + : 0; 160 + default: 161 + BUG(); 162 + } 163 + } 164 + 120 165 static int bch2_check_lru_key(struct btree_trans *trans, 121 166 struct btree_iter *lru_iter, 122 167 struct bkey_s_c lru_k, 123 168 struct bkey_buf *last_flushed) 124 169 { 125 170 struct bch_fs *c = trans->c; 126 - struct btree_iter iter; 127 - struct bkey_s_c k; 128 - struct bch_alloc_v4 a_convert; 129 - const struct bch_alloc_v4 *a; 130 171 struct printbuf buf1 = PRINTBUF; 131 172 struct printbuf buf2 = PRINTBUF; 132 - enum bch_lru_type type = lru_type(lru_k); 133 - struct bpos alloc_pos = u64_to_bucket(lru_k.k->p.offset); 134 - u64 idx; 135 - int ret; 136 173 137 - struct bch_dev *ca = bch2_dev_bucket_tryget_noerror(c, alloc_pos); 174 + struct bbpos bp = lru_pos_to_bp(lru_k); 138 175 139 - if (fsck_err_on(!ca, 140 - trans, lru_entry_to_invalid_bucket, 141 - "lru key points to nonexistent device:bucket %llu:%llu", 142 - alloc_pos.inode, alloc_pos.offset)) 143 - return bch2_btree_bit_mod_buffered(trans, BTREE_ID_lru, lru_iter->pos, false); 144 - 145 - k = bch2_bkey_get_iter(trans, &iter, BTREE_ID_alloc, alloc_pos, 0); 146 - ret = bkey_err(k); 176 + struct btree_iter iter; 177 + struct bkey_s_c k = bch2_bkey_get_iter(trans, &iter, bp.btree, bp.pos, 0); 178 + int ret = bkey_err(k); 147 179 if (ret) 148 180 goto err; 149 181 150 - a = bch2_alloc_to_v4(k, &a_convert); 182 + enum bch_lru_type type = lru_type(lru_k); 183 + u64 idx = bkey_lru_type_idx(c, type, k); 151 184 152 - switch (type) { 153 - case BCH_LRU_read: 154 - idx = alloc_lru_idx_read(*a); 155 - break; 156 - case BCH_LRU_fragmentation: 157 - idx = alloc_lru_idx_fragmentation(*a, ca); 158 - break; 159 - } 160 - 161 - if (lru_k.k->type != KEY_TYPE_set || 162 - lru_pos_time(lru_k.k->p) != idx) { 185 + if (lru_pos_time(lru_k.k->p) != idx) { 163 186 ret = bch2_btree_write_buffer_maybe_flush(trans, lru_k, last_flushed); 164 187 if (ret) 165 188 goto err; ··· 201 176 err: 202 177 fsck_err: 203 178 bch2_trans_iter_exit(trans, &iter); 204 - bch2_dev_put(ca); 205 179 printbuf_exit(&buf2); 206 180 printbuf_exit(&buf1); 207 181 return ret;

+18 -4

fs/bcachefs/lru.h

··· 28 28 { 29 29 u16 lru_id = l.k->p.inode >> 48; 30 30 31 - if (lru_id == BCH_LRU_FRAGMENTATION_START) 31 + switch (lru_id) { 32 + case BCH_LRU_BUCKET_FRAGMENTATION: 32 33 return BCH_LRU_fragmentation; 33 - return BCH_LRU_read; 34 + case BCH_LRU_STRIPE_FRAGMENTATION: 35 + return BCH_LRU_stripes; 36 + default: 37 + return BCH_LRU_read; 38 + } 34 39 } 35 40 36 41 int bch2_lru_validate(struct bch_fs *, struct bkey_s_c, struct bkey_validate_context); ··· 51 46 52 47 int bch2_lru_del(struct btree_trans *, u16, u64, u64); 53 48 int bch2_lru_set(struct btree_trans *, u16, u64, u64); 54 - int bch2_lru_change(struct btree_trans *, u16, u64, u64, u64); 49 + int __bch2_lru_change(struct btree_trans *, u16, u64, u64, u64); 50 + 51 + static inline int bch2_lru_change(struct btree_trans *trans, 52 + u16 lru_id, u64 dev_bucket, 53 + u64 old_time, u64 new_time) 54 + { 55 + return old_time != new_time 56 + ? __bch2_lru_change(trans, lru_id, dev_bucket, old_time, new_time) 57 + : 0; 58 + } 55 59 56 60 struct bkey_buf; 57 - int bch2_lru_check_set(struct btree_trans *, u16, u64, struct bkey_s_c, struct bkey_buf *); 61 + int bch2_lru_check_set(struct btree_trans *, u16, u64, u64, struct bkey_s_c, struct bkey_buf *); 58 62 59 63 int bch2_check_lrus(struct bch_fs *); 60 64

+4 -2

fs/bcachefs/lru_format.h

··· 9 9 10 10 #define BCH_LRU_TYPES() \ 11 11 x(read) \ 12 - x(fragmentation) 12 + x(fragmentation) \ 13 + x(stripes) 13 14 14 15 enum bch_lru_type { 15 16 #define x(n) BCH_LRU_##n, ··· 18 17 #undef x 19 18 }; 20 19 21 - #define BCH_LRU_FRAGMENTATION_START ((1U << 16) - 1) 20 + #define BCH_LRU_BUCKET_FRAGMENTATION ((1U << 16) - 1) 21 + #define BCH_LRU_STRIPE_FRAGMENTATION ((1U << 16) - 2) 22 22 23 23 #define LRU_TIME_BITS 48 24 24 #define LRU_TIME_MAX ((1ULL << LRU_TIME_BITS) - 1)

+20 -6

fs/bcachefs/migrate.c

··· 15 15 #include "keylist.h" 16 16 #include "migrate.h" 17 17 #include "move.h" 18 + #include "progress.h" 18 19 #include "replicas.h" 19 20 #include "super-io.h" 20 21 ··· 77 76 return 0; 78 77 } 79 78 80 - static int bch2_dev_usrdata_drop(struct bch_fs *c, unsigned dev_idx, int flags) 79 + static int bch2_dev_usrdata_drop(struct bch_fs *c, 80 + struct progress_indicator_state *progress, 81 + unsigned dev_idx, int flags) 81 82 { 82 83 struct btree_trans *trans = bch2_trans_get(c); 83 84 enum btree_id id; ··· 91 88 92 89 ret = for_each_btree_key_commit(trans, iter, id, POS_MIN, 93 90 BTREE_ITER_prefetch|BTREE_ITER_all_snapshots, k, 94 - NULL, NULL, BCH_TRANS_COMMIT_no_enospc, 95 - bch2_dev_usrdata_drop_key(trans, &iter, k, dev_idx, flags)); 91 + NULL, NULL, BCH_TRANS_COMMIT_no_enospc, ({ 92 + bch2_progress_update_iter(trans, progress, &iter, "dropping user data"); 93 + bch2_dev_usrdata_drop_key(trans, &iter, k, dev_idx, flags); 94 + })); 96 95 if (ret) 97 96 break; 98 97 } ··· 104 99 return ret; 105 100 } 106 101 107 - static int bch2_dev_metadata_drop(struct bch_fs *c, unsigned dev_idx, int flags) 102 + static int bch2_dev_metadata_drop(struct bch_fs *c, 103 + struct progress_indicator_state *progress, 104 + unsigned dev_idx, int flags) 108 105 { 109 106 struct btree_trans *trans; 110 107 struct btree_iter iter; ··· 132 125 while (bch2_trans_begin(trans), 133 126 (b = bch2_btree_iter_peek_node(&iter)) && 134 127 !(ret = PTR_ERR_OR_ZERO(b))) { 128 + bch2_progress_update_iter(trans, progress, &iter, "dropping metadata"); 129 + 135 130 if (!bch2_bkey_has_device_c(bkey_i_to_s_c(&b->key), dev_idx)) 136 131 goto next; 137 132 ··· 178 169 179 170 int bch2_dev_data_drop(struct bch_fs *c, unsigned dev_idx, int flags) 180 171 { 181 - return bch2_dev_usrdata_drop(c, dev_idx, flags) ?: 182 - bch2_dev_metadata_drop(c, dev_idx, flags); 172 + struct progress_indicator_state progress; 173 + bch2_progress_init(&progress, c, 174 + BIT_ULL(BTREE_ID_extents)| 175 + BIT_ULL(BTREE_ID_reflink)); 176 + 177 + return bch2_dev_usrdata_drop(c, &progress, dev_idx, flags) ?: 178 + bch2_dev_metadata_drop(c, &progress, dev_idx, flags); 183 179 }

+283 -193

fs/bcachefs/move.c

··· 38 38 NULL 39 39 }; 40 40 41 - static void trace_move_extent2(struct bch_fs *c, struct bkey_s_c k, 41 + static void trace_io_move2(struct bch_fs *c, struct bkey_s_c k, 42 42 struct bch_io_opts *io_opts, 43 43 struct data_update_opts *data_opts) 44 44 { 45 - if (trace_move_extent_enabled()) { 45 + if (trace_io_move_enabled()) { 46 46 struct printbuf buf = PRINTBUF; 47 47 48 48 bch2_bkey_val_to_text(&buf, c, k); 49 49 prt_newline(&buf); 50 50 bch2_data_update_opts_to_text(&buf, c, io_opts, data_opts); 51 - trace_move_extent(c, buf.buf); 51 + trace_io_move(c, buf.buf); 52 52 printbuf_exit(&buf); 53 53 } 54 54 } 55 55 56 - static void trace_move_extent_read2(struct bch_fs *c, struct bkey_s_c k) 56 + static void trace_io_move_read2(struct bch_fs *c, struct bkey_s_c k) 57 57 { 58 - if (trace_move_extent_read_enabled()) { 58 + if (trace_io_move_read_enabled()) { 59 59 struct printbuf buf = PRINTBUF; 60 60 61 61 bch2_bkey_val_to_text(&buf, c, k); 62 - trace_move_extent_read(c, buf.buf); 62 + trace_io_move_read(c, buf.buf); 63 63 printbuf_exit(&buf); 64 64 } 65 65 } ··· 74 74 unsigned read_sectors; 75 75 unsigned write_sectors; 76 76 77 - struct bch_read_bio rbio; 78 - 79 77 struct data_update write; 80 - /* Must be last since it is variable size */ 81 - struct bio_vec bi_inline_vecs[]; 82 78 }; 83 79 84 80 static void move_free(struct moving_io *io) ··· 84 88 if (io->b) 85 89 atomic_dec(&io->b->count); 86 90 87 - bch2_data_update_exit(&io->write); 88 - 89 91 mutex_lock(&ctxt->lock); 90 92 list_del(&io->io_list); 91 93 wake_up(&ctxt->wait); 92 94 mutex_unlock(&ctxt->lock); 93 95 96 + if (!io->write.data_opts.scrub) { 97 + bch2_data_update_exit(&io->write); 98 + } else { 99 + bch2_bio_free_pages_pool(io->write.op.c, &io->write.op.wbio.bio); 100 + kfree(io->write.bvecs); 101 + } 94 102 kfree(io); 95 103 } 96 104 97 105 static void move_write_done(struct bch_write_op *op) 98 106 { 99 107 struct moving_io *io = container_of(op, struct moving_io, write.op); 108 + struct bch_fs *c = op->c; 100 109 struct moving_context *ctxt = io->write.ctxt; 101 110 102 - if (io->write.op.error) 103 - ctxt->write_error = true; 111 + if (op->error) { 112 + if (trace_io_move_write_fail_enabled()) { 113 + struct printbuf buf = PRINTBUF; 104 114 105 - atomic_sub(io->write_sectors, &io->write.ctxt->write_sectors); 106 - atomic_dec(&io->write.ctxt->write_ios); 115 + bch2_write_op_to_text(&buf, op); 116 + prt_printf(&buf, "ret\t%s\n", bch2_err_str(op->error)); 117 + trace_io_move_write_fail(c, buf.buf); 118 + printbuf_exit(&buf); 119 + } 120 + this_cpu_inc(c->counters[BCH_COUNTER_io_move_write_fail]); 121 + 122 + ctxt->write_error = true; 123 + } 124 + 125 + atomic_sub(io->write_sectors, &ctxt->write_sectors); 126 + atomic_dec(&ctxt->write_ios); 107 127 move_free(io); 108 128 closure_put(&ctxt->cl); 109 129 } 110 130 111 131 static void move_write(struct moving_io *io) 112 132 { 113 - if (unlikely(io->rbio.bio.bi_status || io->rbio.hole)) { 133 + struct moving_context *ctxt = io->write.ctxt; 134 + 135 + if (ctxt->stats) { 136 + if (io->write.rbio.bio.bi_status) 137 + atomic64_add(io->write.rbio.bvec_iter.bi_size >> 9, 138 + &ctxt->stats->sectors_error_uncorrected); 139 + else if (io->write.rbio.saw_error) 140 + atomic64_add(io->write.rbio.bvec_iter.bi_size >> 9, 141 + &ctxt->stats->sectors_error_corrected); 142 + } 143 + 144 + if (unlikely(io->write.rbio.ret || 145 + io->write.rbio.bio.bi_status || 146 + io->write.data_opts.scrub)) { 114 147 move_free(io); 115 148 return; 116 149 } 117 150 118 - if (trace_move_extent_write_enabled()) { 151 + if (trace_io_move_write_enabled()) { 119 152 struct bch_fs *c = io->write.op.c; 120 153 struct printbuf buf = PRINTBUF; 121 154 122 155 bch2_bkey_val_to_text(&buf, c, bkey_i_to_s_c(io->write.k.k)); 123 - trace_move_extent_write(c, buf.buf); 156 + trace_io_move_write(c, buf.buf); 124 157 printbuf_exit(&buf); 125 158 } 126 159 ··· 157 132 atomic_add(io->write_sectors, &io->write.ctxt->write_sectors); 158 133 atomic_inc(&io->write.ctxt->write_ios); 159 134 160 - bch2_data_update_read_done(&io->write, io->rbio.pick.crc); 135 + bch2_data_update_read_done(&io->write); 161 136 } 162 137 163 138 struct moving_io *bch2_moving_ctxt_next_pending_write(struct moving_context *ctxt) ··· 170 145 171 146 static void move_read_endio(struct bio *bio) 172 147 { 173 - struct moving_io *io = container_of(bio, struct moving_io, rbio.bio); 148 + struct moving_io *io = container_of(bio, struct moving_io, write.rbio.bio); 174 149 struct moving_context *ctxt = io->write.ctxt; 175 150 176 151 atomic_sub(io->read_sectors, &ctxt->read_sectors); ··· 283 258 { 284 259 struct btree_trans *trans = ctxt->trans; 285 260 struct bch_fs *c = trans->c; 286 - struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(k); 287 - struct moving_io *io; 288 - const union bch_extent_entry *entry; 289 - struct extent_ptr_decoded p; 290 - unsigned sectors = k.k->size, pages; 291 261 int ret = -ENOMEM; 292 262 293 - trace_move_extent2(c, k, &io_opts, &data_opts); 263 + trace_io_move2(c, k, &io_opts, &data_opts); 264 + this_cpu_add(c->counters[BCH_COUNTER_io_move], k.k->size); 294 265 295 266 if (ctxt->stats) 296 267 ctxt->stats->pos = BBPOS(iter->btree_id, iter->pos); ··· 294 273 bch2_data_update_opts_normalize(k, &data_opts); 295 274 296 275 if (!data_opts.rewrite_ptrs && 297 - !data_opts.extra_replicas) { 276 + !data_opts.extra_replicas && 277 + !data_opts.scrub) { 298 278 if (data_opts.kill_ptrs) 299 279 return bch2_extent_drop_ptrs(trans, iter, k, &io_opts, &data_opts); 300 280 return 0; ··· 307 285 */ 308 286 bch2_trans_unlock(trans); 309 287 310 - /* write path might have to decompress data: */ 311 - bkey_for_each_ptr_decode(k.k, ptrs, p, entry) 312 - sectors = max_t(unsigned, sectors, p.crc.uncompressed_size); 313 - 314 - pages = DIV_ROUND_UP(sectors, PAGE_SECTORS); 315 - io = kzalloc(sizeof(struct moving_io) + 316 - sizeof(struct bio_vec) * pages, GFP_KERNEL); 288 + struct moving_io *io = kzalloc(sizeof(struct moving_io), GFP_KERNEL); 317 289 if (!io) 318 290 goto err; 319 291 ··· 316 300 io->read_sectors = k.k->size; 317 301 io->write_sectors = k.k->size; 318 302 319 - bio_init(&io->write.op.wbio.bio, NULL, io->bi_inline_vecs, pages, 0); 320 - io->write.op.wbio.bio.bi_ioprio = 321 - IOPRIO_PRIO_VALUE(IOPRIO_CLASS_IDLE, 0); 303 + if (!data_opts.scrub) { 304 + ret = bch2_data_update_init(trans, iter, ctxt, &io->write, ctxt->wp, 305 + &io_opts, data_opts, iter->btree_id, k); 306 + if (ret) 307 + goto err_free; 322 308 323 - if (bch2_bio_alloc_pages(&io->write.op.wbio.bio, sectors << 9, 324 - GFP_KERNEL)) 325 - goto err_free; 309 + io->write.op.end_io = move_write_done; 310 + } else { 311 + bch2_bkey_buf_init(&io->write.k); 312 + bch2_bkey_buf_reassemble(&io->write.k, c, k); 326 313 327 - io->rbio.c = c; 328 - io->rbio.opts = io_opts; 329 - bio_init(&io->rbio.bio, NULL, io->bi_inline_vecs, pages, 0); 330 - io->rbio.bio.bi_vcnt = pages; 331 - io->rbio.bio.bi_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_IDLE, 0); 332 - io->rbio.bio.bi_iter.bi_size = sectors << 9; 314 + io->write.op.c = c; 315 + io->write.data_opts = data_opts; 333 316 334 - io->rbio.bio.bi_opf = REQ_OP_READ; 335 - io->rbio.bio.bi_iter.bi_sector = bkey_start_offset(k.k); 336 - io->rbio.bio.bi_end_io = move_read_endio; 317 + ret = bch2_data_update_bios_init(&io->write, c, &io_opts); 318 + if (ret) 319 + goto err_free; 320 + } 337 321 338 - ret = bch2_data_update_init(trans, iter, ctxt, &io->write, ctxt->wp, 339 - io_opts, data_opts, iter->btree_id, k); 340 - if (ret) 341 - goto err_free_pages; 342 - 343 - io->write.op.end_io = move_write_done; 322 + io->write.rbio.bio.bi_end_io = move_read_endio; 323 + io->write.rbio.bio.bi_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_IDLE, 0); 344 324 345 325 if (ctxt->rate) 346 326 bch2_ratelimit_increment(ctxt->rate, k.k->size); ··· 351 339 atomic_inc(&io->b->count); 352 340 } 353 341 354 - this_cpu_add(c->counters[BCH_COUNTER_io_move], k.k->size); 355 - this_cpu_add(c->counters[BCH_COUNTER_move_extent_read], k.k->size); 356 - trace_move_extent_read2(c, k); 342 + trace_io_move_read2(c, k); 357 343 358 344 mutex_lock(&ctxt->lock); 359 345 atomic_add(io->read_sectors, &ctxt->read_sectors); ··· 366 356 * ctxt when doing wakeup 367 357 */ 368 358 closure_get(&ctxt->cl); 369 - bch2_read_extent(trans, &io->rbio, 370 - bkey_start_pos(k.k), 371 - iter->btree_id, k, 0, 372 - BCH_READ_NODECODE| 373 - BCH_READ_LAST_FRAGMENT); 359 + __bch2_read_extent(trans, &io->write.rbio, 360 + io->write.rbio.bio.bi_iter, 361 + bkey_start_pos(k.k), 362 + iter->btree_id, k, 0, 363 + NULL, 364 + BCH_READ_last_fragment, 365 + data_opts.scrub ? data_opts.read_dev : -1); 374 366 return 0; 375 - err_free_pages: 376 - bio_free_pages(&io->write.op.wbio.bio); 377 367 err_free: 378 368 kfree(io); 379 369 err: 380 - if (ret == -BCH_ERR_data_update_done) 370 + if (bch2_err_matches(ret, BCH_ERR_data_update_done)) 381 371 return 0; 382 372 383 373 if (bch2_err_matches(ret, EROFS) || 384 374 bch2_err_matches(ret, BCH_ERR_transaction_restart)) 385 375 return ret; 386 376 387 - count_event(c, move_extent_start_fail); 377 + count_event(c, io_move_start_fail); 388 378 389 - if (trace_move_extent_start_fail_enabled()) { 379 + if (trace_io_move_start_fail_enabled()) { 390 380 struct printbuf buf = PRINTBUF; 391 381 392 382 bch2_bkey_val_to_text(&buf, c, k); 393 383 prt_str(&buf, ": "); 394 384 prt_str(&buf, bch2_err_str(ret)); 395 - trace_move_extent_start_fail(c, buf.buf); 385 + trace_io_move_start_fail(c, buf.buf); 396 386 printbuf_exit(&buf); 397 387 } 398 388 return ret; ··· 561 551 bch2_trans_begin(trans); 562 552 bch2_trans_iter_init(trans, &iter, btree_id, start, 563 553 BTREE_ITER_prefetch| 554 + BTREE_ITER_not_extents| 564 555 BTREE_ITER_all_snapshots); 565 556 566 557 if (ctxt->rate) ··· 592 581 k.k->type == KEY_TYPE_reflink_p && 593 582 REFLINK_P_MAY_UPDATE_OPTIONS(bkey_s_c_to_reflink_p(k).v)) { 594 583 struct bkey_s_c_reflink_p p = bkey_s_c_to_reflink_p(k); 595 - s64 offset_into_extent = iter.pos.offset - bkey_start_offset(k.k); 584 + s64 offset_into_extent = 0; 596 585 597 586 bch2_trans_iter_exit(trans, &reflink_iter); 598 587 k = bch2_lookup_indirect_extent(trans, &reflink_iter, &offset_into_extent, p, true, 0); ··· 611 600 * pointer - need to fixup iter->k 612 601 */ 613 602 extent_iter = &reflink_iter; 603 + offset_into_extent = 0; 614 604 } 615 605 616 606 if (!bkey_extent_is_direct_data(k.k)) ··· 639 627 if (bch2_err_matches(ret2, BCH_ERR_transaction_restart)) 640 628 continue; 641 629 642 - if (ret2 == -ENOMEM) { 630 + if (bch2_err_matches(ret2, ENOMEM)) { 643 631 /* memory allocation failure, wait for some IO to finish */ 644 632 bch2_move_ctxt_wait_for_io(ctxt); 645 633 continue; ··· 701 689 bool wait_on_copygc, 702 690 move_pred_fn pred, void *arg) 703 691 { 704 - 705 692 struct moving_context ctxt; 706 - int ret; 707 693 708 694 bch2_moving_ctxt_init(&ctxt, c, rate, stats, wp, wait_on_copygc); 709 - ret = __bch2_move_data(&ctxt, start, end, pred, arg); 695 + int ret = __bch2_move_data(&ctxt, start, end, pred, arg); 710 696 bch2_moving_ctxt_exit(&ctxt); 711 697 712 698 return ret; 713 699 } 714 700 715 - int bch2_evacuate_bucket(struct moving_context *ctxt, 716 - struct move_bucket_in_flight *bucket_in_flight, 717 - struct bpos bucket, int gen, 718 - struct data_update_opts _data_opts) 701 + static int __bch2_move_data_phys(struct moving_context *ctxt, 702 + struct move_bucket_in_flight *bucket_in_flight, 703 + unsigned dev, 704 + u64 bucket_start, 705 + u64 bucket_end, 706 + unsigned data_types, 707 + move_pred_fn pred, void *arg) 719 708 { 720 709 struct btree_trans *trans = ctxt->trans; 721 710 struct bch_fs *c = trans->c; ··· 725 712 struct btree_iter iter = {}, bp_iter = {}; 726 713 struct bkey_buf sk; 727 714 struct bkey_s_c k; 728 - struct data_update_opts data_opts; 729 - unsigned sectors_moved = 0; 730 715 struct bkey_buf last_flushed; 731 716 int ret = 0; 732 717 733 - struct bch_dev *ca = bch2_dev_tryget(c, bucket.inode); 718 + struct bch_dev *ca = bch2_dev_tryget(c, dev); 734 719 if (!ca) 735 720 return 0; 736 721 737 - trace_bucket_evacuate(c, &bucket); 722 + bucket_end = min(bucket_end, ca->mi.nbuckets); 723 + 724 + struct bpos bp_start = bucket_pos_to_bp_start(ca, POS(dev, bucket_start)); 725 + struct bpos bp_end = bucket_pos_to_bp_end(ca, POS(dev, bucket_end)); 726 + bch2_dev_put(ca); 727 + ca = NULL; 738 728 739 729 bch2_bkey_buf_init(&last_flushed); 740 730 bkey_init(&last_flushed.k->k); ··· 748 732 */ 749 733 bch2_trans_begin(trans); 750 734 751 - bch2_trans_iter_init(trans, &bp_iter, BTREE_ID_backpointers, 752 - bucket_pos_to_bp_start(ca, bucket), 0); 735 + bch2_trans_iter_init(trans, &bp_iter, BTREE_ID_backpointers, bp_start, 0); 753 736 754 737 bch_err_msg(c, ret, "looking up alloc key"); 755 738 if (ret) ··· 772 757 if (ret) 773 758 goto err; 774 759 775 - if (!k.k || bkey_gt(k.k->p, bucket_pos_to_bp_end(ca, bucket))) 760 + if (!k.k || bkey_gt(k.k->p, bp_end)) 776 761 break; 777 762 778 763 if (k.k->type != KEY_TYPE_backpointer) ··· 780 765 781 766 struct bkey_s_c_backpointer bp = bkey_s_c_to_backpointer(k); 782 767 768 + if (ctxt->stats) 769 + ctxt->stats->offset = bp.k->p.offset >> MAX_EXTENT_COMPRESS_RATIO_SHIFT; 770 + 771 + if (!(data_types & BIT(bp.v->data_type))) 772 + goto next; 773 + 774 + if (!bp.v->level && bp.v->btree_id == BTREE_ID_stripes) 775 + goto next; 776 + 777 + k = bch2_backpointer_get_key(trans, bp, &iter, 0, &last_flushed); 778 + ret = bkey_err(k); 779 + if (bch2_err_matches(ret, BCH_ERR_transaction_restart)) 780 + continue; 781 + if (ret) 782 + goto err; 783 + if (!k.k) 784 + goto next; 785 + 783 786 if (!bp.v->level) { 784 - k = bch2_backpointer_get_key(trans, bp, &iter, 0, &last_flushed); 785 - ret = bkey_err(k); 786 - if (bch2_err_matches(ret, BCH_ERR_transaction_restart)) 787 - continue; 788 - if (ret) 789 - goto err; 790 - if (!k.k) 791 - goto next; 792 - 793 - bch2_bkey_buf_reassemble(&sk, c, k); 794 - k = bkey_i_to_s_c(sk.k); 795 - 796 787 ret = bch2_move_get_io_opts_one(trans, &io_opts, &iter, k); 797 788 if (ret) { 798 789 bch2_trans_iter_exit(trans, &iter); 799 790 continue; 800 791 } 801 - 802 - data_opts = _data_opts; 803 - data_opts.target = io_opts.background_target; 804 - data_opts.rewrite_ptrs = 0; 805 - 806 - unsigned sectors = bp.v->bucket_len; /* move_extent will drop locks */ 807 - unsigned i = 0; 808 - const union bch_extent_entry *entry; 809 - struct extent_ptr_decoded p; 810 - bkey_for_each_ptr_decode(k.k, bch2_bkey_ptrs_c(k), p, entry) { 811 - if (p.ptr.dev == bucket.inode) { 812 - if (p.ptr.cached) { 813 - bch2_trans_iter_exit(trans, &iter); 814 - goto next; 815 - } 816 - data_opts.rewrite_ptrs |= 1U << i; 817 - break; 818 - } 819 - i++; 820 - } 821 - 822 - ret = bch2_move_extent(ctxt, bucket_in_flight, 823 - &iter, k, io_opts, data_opts); 824 - bch2_trans_iter_exit(trans, &iter); 825 - 826 - if (bch2_err_matches(ret, BCH_ERR_transaction_restart)) 827 - continue; 828 - if (ret == -ENOMEM) { 829 - /* memory allocation failure, wait for some IO to finish */ 830 - bch2_move_ctxt_wait_for_io(ctxt); 831 - continue; 832 - } 833 - if (ret) 834 - goto err; 835 - 836 - if (ctxt->stats) 837 - atomic64_add(sectors, &ctxt->stats->sectors_seen); 838 - sectors_moved += sectors; 839 - } else { 840 - struct btree *b; 841 - 842 - b = bch2_backpointer_get_node(trans, bp, &iter, &last_flushed); 843 - ret = PTR_ERR_OR_ZERO(b); 844 - if (ret == -BCH_ERR_backpointer_to_overwritten_btree_node) 845 - goto next; 846 - if (bch2_err_matches(ret, BCH_ERR_transaction_restart)) 847 - continue; 848 - if (ret) 849 - goto err; 850 - if (!b) 851 - goto next; 852 - 853 - unsigned sectors = btree_ptr_sectors_written(bkey_i_to_s_c(&b->key)); 854 - 855 - ret = bch2_btree_node_rewrite(trans, &iter, b, 0); 856 - bch2_trans_iter_exit(trans, &iter); 857 - 858 - if (bch2_err_matches(ret, BCH_ERR_transaction_restart)) 859 - continue; 860 - if (ret) 861 - goto err; 862 - 863 - if (ctxt->rate) 864 - bch2_ratelimit_increment(ctxt->rate, sectors); 865 - if (ctxt->stats) { 866 - atomic64_add(sectors, &ctxt->stats->sectors_seen); 867 - atomic64_add(sectors, &ctxt->stats->sectors_moved); 868 - } 869 - sectors_moved += btree_sectors(c); 870 792 } 793 + 794 + struct data_update_opts data_opts = {}; 795 + if (!pred(c, arg, k, &io_opts, &data_opts)) { 796 + bch2_trans_iter_exit(trans, &iter); 797 + goto next; 798 + } 799 + 800 + if (data_opts.scrub && 801 + !bch2_dev_idx_is_online(c, data_opts.read_dev)) { 802 + bch2_trans_iter_exit(trans, &iter); 803 + ret = -BCH_ERR_device_offline; 804 + break; 805 + } 806 + 807 + bch2_bkey_buf_reassemble(&sk, c, k); 808 + k = bkey_i_to_s_c(sk.k); 809 + 810 + /* move_extent will drop locks */ 811 + unsigned sectors = bp.v->bucket_len; 812 + 813 + if (!bp.v->level) 814 + ret = bch2_move_extent(ctxt, bucket_in_flight, &iter, k, io_opts, data_opts); 815 + else if (!data_opts.scrub) 816 + ret = bch2_btree_node_rewrite_pos(trans, bp.v->btree_id, bp.v->level, k.k->p, 0); 817 + else 818 + ret = bch2_btree_node_scrub(trans, bp.v->btree_id, bp.v->level, k, data_opts.read_dev); 819 + 820 + bch2_trans_iter_exit(trans, &iter); 821 + 822 + if (bch2_err_matches(ret, BCH_ERR_transaction_restart)) 823 + continue; 824 + if (ret == -ENOMEM) { 825 + /* memory allocation failure, wait for some IO to finish */ 826 + bch2_move_ctxt_wait_for_io(ctxt); 827 + continue; 828 + } 829 + if (ret) 830 + goto err; 831 + 832 + if (ctxt->stats) 833 + atomic64_add(sectors, &ctxt->stats->sectors_seen); 871 834 next: 872 835 bch2_btree_iter_advance(&bp_iter); 873 836 } 874 - 875 - trace_evacuate_bucket(c, &bucket, sectors_moved, ca->mi.bucket_size, ret); 876 837 err: 877 838 bch2_trans_iter_exit(trans, &bp_iter); 878 - bch2_dev_put(ca); 879 839 bch2_bkey_buf_exit(&sk, c); 880 840 bch2_bkey_buf_exit(&last_flushed, c); 881 841 return ret; 842 + } 843 + 844 + static int bch2_move_data_phys(struct bch_fs *c, 845 + unsigned dev, 846 + u64 start, 847 + u64 end, 848 + unsigned data_types, 849 + struct bch_ratelimit *rate, 850 + struct bch_move_stats *stats, 851 + struct write_point_specifier wp, 852 + bool wait_on_copygc, 853 + move_pred_fn pred, void *arg) 854 + { 855 + struct moving_context ctxt; 856 + 857 + bch2_trans_run(c, bch2_btree_write_buffer_flush_sync(trans)); 858 + 859 + bch2_moving_ctxt_init(&ctxt, c, rate, stats, wp, wait_on_copygc); 860 + ctxt.stats->phys = true; 861 + ctxt.stats->data_type = (int) DATA_PROGRESS_DATA_TYPE_phys; 862 + 863 + int ret = __bch2_move_data_phys(&ctxt, NULL, dev, start, end, data_types, pred, arg); 864 + bch2_moving_ctxt_exit(&ctxt); 865 + 866 + return ret; 867 + } 868 + 869 + struct evacuate_bucket_arg { 870 + struct bpos bucket; 871 + int gen; 872 + struct data_update_opts data_opts; 873 + }; 874 + 875 + static bool evacuate_bucket_pred(struct bch_fs *c, void *_arg, struct bkey_s_c k, 876 + struct bch_io_opts *io_opts, 877 + struct data_update_opts *data_opts) 878 + { 879 + struct evacuate_bucket_arg *arg = _arg; 880 + 881 + *data_opts = arg->data_opts; 882 + 883 + unsigned i = 0; 884 + bkey_for_each_ptr(bch2_bkey_ptrs_c(k), ptr) { 885 + if (ptr->dev == arg->bucket.inode && 886 + (arg->gen < 0 || arg->gen == ptr->gen) && 887 + !ptr->cached) 888 + data_opts->rewrite_ptrs |= BIT(i); 889 + i++; 890 + } 891 + 892 + return data_opts->rewrite_ptrs != 0; 893 + } 894 + 895 + int bch2_evacuate_bucket(struct moving_context *ctxt, 896 + struct move_bucket_in_flight *bucket_in_flight, 897 + struct bpos bucket, int gen, 898 + struct data_update_opts data_opts) 899 + { 900 + struct evacuate_bucket_arg arg = { bucket, gen, data_opts, }; 901 + 902 + return __bch2_move_data_phys(ctxt, bucket_in_flight, 903 + bucket.inode, 904 + bucket.offset, 905 + bucket.offset + 1, 906 + ~0, 907 + evacuate_bucket_pred, &arg); 882 908 } 883 909 884 910 typedef bool (*move_btree_pred)(struct bch_fs *, void *, ··· 1063 1007 return rereplicate_pred(c, arg, bkey_i_to_s_c(&b->key), io_opts, data_opts); 1064 1008 } 1065 1009 1066 - static bool migrate_btree_pred(struct bch_fs *c, void *arg, 1067 - struct btree *b, 1068 - struct bch_io_opts *io_opts, 1069 - struct data_update_opts *data_opts) 1070 - { 1071 - return migrate_pred(c, arg, bkey_i_to_s_c(&b->key), io_opts, data_opts); 1072 - } 1073 - 1074 1010 /* 1075 1011 * Ancient versions of bcachefs produced packed formats which could represent 1076 1012 * keys that the in memory format cannot represent; this checks for those ··· 1152 1104 return drop_extra_replicas_pred(c, arg, bkey_i_to_s_c(&b->key), io_opts, data_opts); 1153 1105 } 1154 1106 1107 + static bool scrub_pred(struct bch_fs *c, void *_arg, 1108 + struct bkey_s_c k, 1109 + struct bch_io_opts *io_opts, 1110 + struct data_update_opts *data_opts) 1111 + { 1112 + struct bch_ioctl_data *arg = _arg; 1113 + 1114 + if (k.k->type != KEY_TYPE_btree_ptr_v2) { 1115 + struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(k); 1116 + const union bch_extent_entry *entry; 1117 + struct extent_ptr_decoded p; 1118 + bkey_for_each_ptr_decode(k.k, ptrs, p, entry) 1119 + if (p.ptr.dev == arg->migrate.dev) { 1120 + if (!p.crc.csum_type) 1121 + return false; 1122 + break; 1123 + } 1124 + } 1125 + 1126 + data_opts->scrub = true; 1127 + data_opts->read_dev = arg->migrate.dev; 1128 + return true; 1129 + } 1130 + 1155 1131 int bch2_data_job(struct bch_fs *c, 1156 1132 struct bch_move_stats *stats, 1157 1133 struct bch_ioctl_data op) ··· 1190 1118 bch2_move_stats_init(stats, bch2_data_ops_strs[op.op]); 1191 1119 1192 1120 switch (op.op) { 1121 + case BCH_DATA_OP_scrub: 1122 + /* 1123 + * prevent tests from spuriously failing, make sure we see all 1124 + * btree nodes that need to be repaired 1125 + */ 1126 + bch2_btree_interior_updates_flush(c); 1127 + 1128 + ret = bch2_move_data_phys(c, op.scrub.dev, 0, U64_MAX, 1129 + op.scrub.data_types, 1130 + NULL, 1131 + stats, 1132 + writepoint_hashed((unsigned long) current), 1133 + false, 1134 + scrub_pred, &op) ?: ret; 1135 + break; 1136 + 1193 1137 case BCH_DATA_OP_rereplicate: 1194 1138 stats->data_type = BCH_DATA_journal; 1195 1139 ret = bch2_journal_flush_device_pins(&c->journal, -1); ··· 1225 1137 1226 1138 stats->data_type = BCH_DATA_journal; 1227 1139 ret = bch2_journal_flush_device_pins(&c->journal, op.migrate.dev); 1228 - ret = bch2_move_btree(c, start, end, 1229 - migrate_btree_pred, &op, stats) ?: ret; 1230 - ret = bch2_move_data(c, start, end, 1231 - NULL, 1232 - stats, 1233 - writepoint_hashed((unsigned long) current), 1234 - true, 1235 - migrate_pred, &op) ?: ret; 1140 + ret = bch2_move_data_phys(c, op.migrate.dev, 0, U64_MAX, 1141 + ~0, 1142 + NULL, 1143 + stats, 1144 + writepoint_hashed((unsigned long) current), 1145 + true, 1146 + migrate_pred, &op) ?: ret; 1147 + bch2_btree_interior_updates_flush(c); 1236 1148 ret = bch2_replicas_gc2(c) ?: ret; 1237 1149 break; 1238 1150 case BCH_DATA_OP_rewrite_old_nodes: ··· 1264 1176 prt_newline(out); 1265 1177 printbuf_indent_add(out, 2); 1266 1178 1267 - prt_printf(out, "keys moved: %llu\n", atomic64_read(&stats->keys_moved)); 1268 - prt_printf(out, "keys raced: %llu\n", atomic64_read(&stats->keys_raced)); 1269 - prt_printf(out, "bytes seen: "); 1179 + prt_printf(out, "keys moved:\t%llu\n", atomic64_read(&stats->keys_moved)); 1180 + prt_printf(out, "keys raced:\t%llu\n", atomic64_read(&stats->keys_raced)); 1181 + prt_printf(out, "bytes seen:\t"); 1270 1182 prt_human_readable_u64(out, atomic64_read(&stats->sectors_seen) << 9); 1271 1183 prt_newline(out); 1272 1184 1273 - prt_printf(out, "bytes moved: "); 1185 + prt_printf(out, "bytes moved:\t"); 1274 1186 prt_human_readable_u64(out, atomic64_read(&stats->sectors_moved) << 9); 1275 1187 prt_newline(out); 1276 1188 1277 - prt_printf(out, "bytes raced: "); 1189 + prt_printf(out, "bytes raced:\t"); 1278 1190 prt_human_readable_u64(out, atomic64_read(&stats->sectors_raced) << 9); 1279 1191 prt_newline(out); 1280 1192 ··· 1283 1195 1284 1196 static void bch2_moving_ctxt_to_text(struct printbuf *out, struct bch_fs *c, struct moving_context *ctxt) 1285 1197 { 1286 - struct moving_io *io; 1198 + if (!out->nr_tabstops) 1199 + printbuf_tabstop_push(out, 32); 1287 1200 1288 1201 bch2_move_stats_to_text(out, ctxt->stats); 1289 1202 printbuf_indent_add(out, 2); ··· 1304 1215 printbuf_indent_add(out, 2); 1305 1216 1306 1217 mutex_lock(&ctxt->lock); 1218 + struct moving_io *io; 1307 1219 list_for_each_entry(io, &ctxt->ios, io_list) 1308 - bch2_write_op_to_text(out, &io->write.op); 1220 + bch2_data_update_inflight_to_text(out, &io->write); 1309 1221 mutex_unlock(&ctxt->lock); 1310 1222 1311 1223 printbuf_indent_sub(out, 4);

+17 -3

fs/bcachefs/move_types.h

··· 3 3 #define _BCACHEFS_MOVE_TYPES_H 4 4 5 5 #include "bbpos_types.h" 6 + #include "bcachefs_ioctl.h" 6 7 7 8 struct bch_move_stats { 8 - enum bch_data_type data_type; 9 - struct bbpos pos; 10 9 char name[32]; 10 + bool phys; 11 + enum bch_ioctl_data_event_ret ret; 12 + 13 + union { 14 + struct { 15 + enum bch_data_type data_type; 16 + struct bbpos pos; 17 + }; 18 + struct { 19 + unsigned dev; 20 + u64 offset; 21 + }; 22 + }; 11 23 12 24 atomic64_t keys_moved; 13 25 atomic64_t keys_raced; 14 26 atomic64_t sectors_seen; 15 27 atomic64_t sectors_moved; 16 28 atomic64_t sectors_raced; 29 + atomic64_t sectors_error_corrected; 30 + atomic64_t sectors_error_uncorrected; 17 31 }; 18 32 19 33 struct move_bucket_key { 20 34 struct bpos bucket; 21 - u8 gen; 35 + unsigned gen; 22 36 }; 23 37 24 38 struct move_bucket {

+13 -2

fs/bcachefs/movinggc.c

··· 167 167 bch2_trans_begin(trans); 168 168 169 169 ret = for_each_btree_key_max(trans, iter, BTREE_ID_lru, 170 - lru_pos(BCH_LRU_FRAGMENTATION_START, 0, 0), 171 - lru_pos(BCH_LRU_FRAGMENTATION_START, U64_MAX, LRU_TIME_MAX), 170 + lru_pos(BCH_LRU_BUCKET_FRAGMENTATION, 0, 0), 171 + lru_pos(BCH_LRU_BUCKET_FRAGMENTATION, U64_MAX, LRU_TIME_MAX), 172 172 0, k, ({ 173 173 struct move_bucket b = { .k.bucket = u64_to_bucket(k.k->p.offset) }; 174 174 int ret2 = 0; ··· 317 317 prt_printf(out, "Currently calculated wait:\t"); 318 318 prt_human_readable_u64(out, bch2_copygc_wait_amount(c)); 319 319 prt_newline(out); 320 + 321 + rcu_read_lock(); 322 + struct task_struct *t = rcu_dereference(c->copygc_thread); 323 + if (t) 324 + get_task_struct(t); 325 + rcu_read_unlock(); 326 + 327 + if (t) { 328 + bch2_prt_task_backtrace(out, t, 0, GFP_KERNEL); 329 + put_task_struct(t); 330 + } 320 331 } 321 332 322 333 static int bch2_copygc_thread(void *arg)

+56 -59

fs/bcachefs/opts.c

··· 163 163 [DT_SUBVOL] = "subvol", 164 164 }; 165 165 166 - u64 BCH2_NO_SB_OPT(const struct bch_sb *sb) 167 - { 168 - BUG(); 169 - } 170 - 171 - void SET_BCH2_NO_SB_OPT(struct bch_sb *sb, u64 v) 172 - { 173 - BUG(); 174 - } 175 - 176 166 void bch2_opts_apply(struct bch_opts *dst, struct bch_opts src) 177 167 { 178 168 #define x(_name, ...) \ ··· 213 223 } 214 224 } 215 225 226 + /* dummy option, for options that aren't stored in the superblock */ 227 + typedef u64 (*sb_opt_get_fn)(const struct bch_sb *); 228 + typedef void (*sb_opt_set_fn)(struct bch_sb *, u64); 229 + typedef u64 (*member_opt_get_fn)(const struct bch_member *); 230 + typedef void (*member_opt_set_fn)(struct bch_member *, u64); 231 + 232 + __maybe_unused static const sb_opt_get_fn BCH2_NO_SB_OPT = NULL; 233 + __maybe_unused static const sb_opt_set_fn SET_BCH2_NO_SB_OPT = NULL; 234 + __maybe_unused static const member_opt_get_fn BCH2_NO_MEMBER_OPT = NULL; 235 + __maybe_unused static const member_opt_set_fn SET_BCH2_NO_MEMBER_OPT = NULL; 236 + 237 + #define type_compatible_or_null(_p, _type) \ 238 + __builtin_choose_expr( \ 239 + __builtin_types_compatible_p(typeof(_p), typeof(_type)), _p, NULL) 240 + 216 241 const struct bch_option bch2_opt_table[] = { 217 242 #define OPT_BOOL() .type = BCH_OPT_BOOL, .min = 0, .max = 2 218 243 #define OPT_UINT(_min, _max) .type = BCH_OPT_UINT, \ ··· 244 239 245 240 #define x(_name, _bits, _flags, _type, _sb_opt, _default, _hint, _help) \ 246 241 [Opt_##_name] = { \ 247 - .attr = { \ 248 - .name = #_name, \ 249 - .mode = (_flags) & OPT_RUNTIME ? 0644 : 0444, \ 250 - }, \ 251 - .flags = _flags, \ 252 - .hint = _hint, \ 253 - .help = _help, \ 254 - .get_sb = _sb_opt, \ 255 - .set_sb = SET_##_sb_opt, \ 242 + .attr.name = #_name, \ 243 + .attr.mode = (_flags) & OPT_RUNTIME ? 0644 : 0444, \ 244 + .flags = _flags, \ 245 + .hint = _hint, \ 246 + .help = _help, \ 247 + .get_sb = type_compatible_or_null(_sb_opt, *BCH2_NO_SB_OPT), \ 248 + .set_sb = type_compatible_or_null(SET_##_sb_opt,*SET_BCH2_NO_SB_OPT), \ 249 + .get_member = type_compatible_or_null(_sb_opt, *BCH2_NO_MEMBER_OPT), \ 250 + .set_member = type_compatible_or_null(SET_##_sb_opt,*SET_BCH2_NO_MEMBER_OPT),\ 256 251 _type \ 257 252 }, 258 253 ··· 480 475 } 481 476 } 482 477 483 - int bch2_opt_check_may_set(struct bch_fs *c, int id, u64 v) 478 + int bch2_opt_check_may_set(struct bch_fs *c, struct bch_dev *ca, int id, u64 v) 484 479 { 480 + lockdep_assert_held(&c->state_lock); 481 + 485 482 int ret = 0; 486 483 487 484 switch (id) { 485 + case Opt_state: 486 + if (ca) 487 + return __bch2_dev_set_state(c, ca, v, BCH_FORCE_IF_DEGRADED); 488 + break; 489 + 488 490 case Opt_compression: 489 491 case Opt_background_compression: 490 492 ret = bch2_check_set_has_compressed_data(c, v); ··· 507 495 508 496 int bch2_opts_check_may_set(struct bch_fs *c) 509 497 { 510 - unsigned i; 511 - int ret; 512 - 513 - for (i = 0; i < bch2_opts_nr; i++) { 514 - ret = bch2_opt_check_may_set(c, i, 515 - bch2_opt_get_by_id(&c->opts, i)); 498 + for (unsigned i = 0; i < bch2_opts_nr; i++) { 499 + int ret = bch2_opt_check_may_set(c, NULL, i, bch2_opt_get_by_id(&c->opts, i)); 516 500 if (ret) 517 501 return ret; 518 502 } ··· 627 619 return ret; 628 620 } 629 621 630 - u64 bch2_opt_from_sb(struct bch_sb *sb, enum bch_opt_id id) 622 + u64 bch2_opt_from_sb(struct bch_sb *sb, enum bch_opt_id id, int dev_idx) 631 623 { 632 624 const struct bch_option *opt = bch2_opt_table + id; 633 625 u64 v; 634 626 635 - v = opt->get_sb(sb); 627 + if (dev_idx < 0) { 628 + v = opt->get_sb(sb); 629 + } else { 630 + if (WARN(!bch2_member_exists(sb, dev_idx), 631 + "tried to set device option %s on nonexistent device %i", 632 + opt->attr.name, dev_idx)) 633 + return 0; 634 + 635 + struct bch_member m = bch2_sb_member_get(sb, dev_idx); 636 + v = opt->get_member(&m); 637 + } 638 + 639 + if (opt->flags & OPT_SB_FIELD_ONE_BIAS) 640 + --v; 636 641 637 642 if (opt->flags & OPT_SB_FIELD_ILOG2) 638 643 v = 1ULL << v; ··· 662 641 */ 663 642 int bch2_opts_from_sb(struct bch_opts *opts, struct bch_sb *sb) 664 643 { 665 - unsigned id; 666 - 667 - for (id = 0; id < bch2_opts_nr; id++) { 644 + for (unsigned id = 0; id < bch2_opts_nr; id++) { 668 645 const struct bch_option *opt = bch2_opt_table + id; 669 646 670 - if (opt->get_sb == BCH2_NO_SB_OPT) 671 - continue; 672 - 673 - bch2_opt_set_by_id(opts, id, bch2_opt_from_sb(sb, id)); 647 + if (opt->get_sb) 648 + bch2_opt_set_by_id(opts, id, bch2_opt_from_sb(sb, id, -1)); 674 649 } 675 650 676 651 return 0; 677 652 } 678 653 679 - struct bch_dev_sb_opt_set { 680 - void (*set_sb)(struct bch_member *, u64); 681 - }; 682 - 683 - static const struct bch_dev_sb_opt_set bch2_dev_sb_opt_setters [] = { 684 - #define x(n, set) [Opt_##n] = { .set_sb = SET_##set }, 685 - BCH_DEV_OPT_SETTERS() 686 - #undef x 687 - }; 688 - 689 654 void __bch2_opt_set_sb(struct bch_sb *sb, int dev_idx, 690 655 const struct bch_option *opt, u64 v) 691 656 { 692 - enum bch_opt_id id = opt - bch2_opt_table; 693 - 694 657 if (opt->flags & OPT_SB_FIELD_SECTORS) 695 658 v >>= 9; 696 659 ··· 684 679 if (opt->flags & OPT_SB_FIELD_ONE_BIAS) 685 680 v++; 686 681 687 - if (opt->flags & OPT_FS) { 688 - if (opt->set_sb != SET_BCH2_NO_SB_OPT) 689 - opt->set_sb(sb, v); 690 - } 682 + if ((opt->flags & OPT_FS) && opt->set_sb && dev_idx < 0) 683 + opt->set_sb(sb, v); 691 684 692 - if ((opt->flags & OPT_DEVICE) && dev_idx >= 0) { 685 + if ((opt->flags & OPT_DEVICE) && opt->set_member && dev_idx >= 0) { 693 686 if (WARN(!bch2_member_exists(sb, dev_idx), 694 687 "tried to set device option %s on nonexistent device %i", 695 688 opt->attr.name, dev_idx)) 696 689 return; 697 690 698 - struct bch_member *m = bch2_members_v2_get_mut(sb, dev_idx); 699 - 700 - const struct bch_dev_sb_opt_set *set = bch2_dev_sb_opt_setters + id; 701 - if (set->set_sb) 702 - set->set_sb(m, v); 703 - else 704 - pr_err("option %s cannot be set via opt_set_sb()", opt->attr.name); 691 + opt->set_member(bch2_members_v2_get_mut(sb, dev_idx), v); 705 692 } 706 693 } 707 694

+37 -32

fs/bcachefs/opts.h

··· 50 50 * apply the options from that struct that are defined. 51 51 */ 52 52 53 - /* dummy option, for options that aren't stored in the superblock */ 54 - u64 BCH2_NO_SB_OPT(const struct bch_sb *); 55 - void SET_BCH2_NO_SB_OPT(struct bch_sb *, u64); 56 - 57 53 /* When can be set: */ 58 54 enum opt_flags { 59 55 OPT_FS = BIT(0), /* Filesystem option */ ··· 128 132 OPT_FS|OPT_FORMAT| \ 129 133 OPT_HUMAN_READABLE|OPT_MUST_BE_POW_2|OPT_SB_FIELD_SECTORS, \ 130 134 OPT_UINT(512, 1U << 16), \ 131 - BCH_SB_BLOCK_SIZE, 8, \ 135 + BCH_SB_BLOCK_SIZE, 4 << 10, \ 132 136 "size", NULL) \ 133 137 x(btree_node_size, u32, \ 134 138 OPT_FS|OPT_FORMAT| \ 135 139 OPT_HUMAN_READABLE|OPT_MUST_BE_POW_2|OPT_SB_FIELD_SECTORS, \ 136 140 OPT_UINT(512, 1U << 20), \ 137 - BCH_SB_BTREE_NODE_SIZE, 512, \ 141 + BCH_SB_BTREE_NODE_SIZE, 256 << 10, \ 138 142 "size", "Btree node size, default 256k") \ 139 143 x(errors, u8, \ 140 144 OPT_FS|OPT_FORMAT|OPT_MOUNT|OPT_RUNTIME, \ 141 145 OPT_STR(bch2_error_actions), \ 142 146 BCH_SB_ERROR_ACTION, BCH_ON_ERROR_fix_safe, \ 143 147 NULL, "Action to take on filesystem error") \ 148 + x(write_error_timeout, u16, \ 149 + OPT_FS|OPT_FORMAT|OPT_MOUNT|OPT_RUNTIME, \ 150 + OPT_UINT(1, 300), \ 151 + BCH_SB_WRITE_ERROR_TIMEOUT, 30, \ 152 + NULL, "Number of consecutive write errors allowed before kicking out a device")\ 144 153 x(metadata_replicas, u8, \ 145 154 OPT_FS|OPT_FORMAT|OPT_MOUNT|OPT_RUNTIME, \ 146 155 OPT_UINT(1, BCH_REPLICAS_MAX), \ ··· 182 181 OPT_STR(__bch2_csum_opts), \ 183 182 BCH_SB_DATA_CSUM_TYPE, BCH_CSUM_OPT_crc32c, \ 184 183 NULL, NULL) \ 184 + x(checksum_err_retry_nr, u8, \ 185 + OPT_FS|OPT_FORMAT|OPT_MOUNT|OPT_RUNTIME, \ 186 + OPT_UINT(0, 32), \ 187 + BCH_SB_CSUM_ERR_RETRY_NR, 3, \ 188 + NULL, NULL) \ 185 189 x(compression, u8, \ 186 190 OPT_FS|OPT_INODE|OPT_FORMAT|OPT_MOUNT|OPT_RUNTIME, \ 187 191 OPT_FN(bch2_opt_compression), \ ··· 203 197 BCH_SB_STR_HASH_TYPE, BCH_STR_HASH_OPT_siphash, \ 204 198 NULL, "Hash function for directory entries and xattrs")\ 205 199 x(metadata_target, u16, \ 206 - OPT_FS|OPT_INODE|OPT_FORMAT|OPT_MOUNT|OPT_RUNTIME, \ 200 + OPT_FS|OPT_FORMAT|OPT_MOUNT|OPT_RUNTIME, \ 207 201 OPT_FN(bch2_opt_target), \ 208 202 BCH_SB_METADATA_TARGET, 0, \ 209 203 "(target)", "Device or label for metadata writes") \ ··· 314 308 OPT_BOOL(), \ 315 309 BCH2_NO_SB_OPT, false, \ 316 310 NULL, "Don't kick drives out when splitbrain detected")\ 317 - x(discard, u8, \ 318 - OPT_FS|OPT_MOUNT|OPT_DEVICE, \ 319 - OPT_BOOL(), \ 320 - BCH2_NO_SB_OPT, true, \ 321 - NULL, "Enable discard/TRIM support") \ 322 311 x(verbose, u8, \ 323 312 OPT_FS|OPT_MOUNT|OPT_RUNTIME, \ 324 313 OPT_BOOL(), \ ··· 494 493 BCH2_NO_SB_OPT, false, \ 495 494 NULL, "Skip submit_bio() for data reads and writes, " \ 496 495 "for performance testing purposes") \ 497 - x(fs_size, u64, \ 498 - OPT_DEVICE, \ 496 + x(state, u64, \ 497 + OPT_DEVICE|OPT_RUNTIME, \ 498 + OPT_STR(bch2_member_states), \ 499 + BCH_MEMBER_STATE, BCH_MEMBER_STATE_rw, \ 500 + "state", "rw,ro,failed,spare") \ 501 + x(bucket_size, u32, \ 502 + OPT_DEVICE|OPT_HUMAN_READABLE|OPT_SB_FIELD_SECTORS, \ 499 503 OPT_UINT(0, S64_MAX), \ 500 - BCH2_NO_SB_OPT, 0, \ 501 - "size", "Size of filesystem on device") \ 502 - x(bucket, u32, \ 503 - OPT_DEVICE, \ 504 - OPT_UINT(0, S64_MAX), \ 505 - BCH2_NO_SB_OPT, 0, \ 504 + BCH_MEMBER_BUCKET_SIZE, 0, \ 506 505 "size", "Specifies the bucket size; must be greater than the btree node size")\ 507 506 x(durability, u8, \ 508 - OPT_DEVICE|OPT_SB_FIELD_ONE_BIAS, \ 507 + OPT_DEVICE|OPT_RUNTIME|OPT_SB_FIELD_ONE_BIAS, \ 509 508 OPT_UINT(0, BCH_REPLICAS_MAX), \ 510 - BCH2_NO_SB_OPT, 1, \ 509 + BCH_MEMBER_DURABILITY, 1, \ 511 510 "n", "Data written to this device will be considered\n"\ 512 511 "to have already been replicated n times") \ 513 512 x(data_allowed, u8, \ 514 513 OPT_DEVICE, \ 515 514 OPT_BITFIELD(__bch2_data_types), \ 516 - BCH2_NO_SB_OPT, BIT(BCH_DATA_journal)|BIT(BCH_DATA_btree)|BIT(BCH_DATA_user),\ 515 + BCH_MEMBER_DATA_ALLOWED, BIT(BCH_DATA_journal)|BIT(BCH_DATA_btree)|BIT(BCH_DATA_user),\ 517 516 "types", "Allowed data types for this device: journal, btree, and/or user")\ 517 + x(discard, u8, \ 518 + OPT_MOUNT|OPT_DEVICE|OPT_RUNTIME, \ 519 + OPT_BOOL(), \ 520 + BCH_MEMBER_DISCARD, true, \ 521 + NULL, "Enable discard/TRIM support") \ 518 522 x(btree_node_prefetch, u8, \ 519 523 OPT_FS|OPT_MOUNT|OPT_RUNTIME, \ 520 524 OPT_BOOL(), \ 521 525 BCH2_NO_SB_OPT, true, \ 522 526 NULL, "BTREE_ITER_prefetch casuse btree nodes to be\n"\ 523 527 " prefetched sequentially") 524 - 525 - #define BCH_DEV_OPT_SETTERS() \ 526 - x(discard, BCH_MEMBER_DISCARD) \ 527 - x(durability, BCH_MEMBER_DURABILITY) \ 528 - x(data_allowed, BCH_MEMBER_DATA_ALLOWED) 529 528 530 529 struct bch_opts { 531 530 #define x(_name, _bits, ...) unsigned _name##_defined:1; ··· 583 582 584 583 struct bch_option { 585 584 struct attribute attr; 586 - u64 (*get_sb)(const struct bch_sb *); 587 - void (*set_sb)(struct bch_sb *, u64); 588 585 enum opt_type type; 589 586 enum opt_flags flags; 590 587 u64 min, max; ··· 594 595 const char *hint; 595 596 const char *help; 596 597 598 + u64 (*get_sb)(const struct bch_sb *); 599 + void (*set_sb)(struct bch_sb *, u64); 600 + 601 + u64 (*get_member)(const struct bch_member *); 602 + void (*set_member)(struct bch_member *, u64); 603 + 597 604 }; 598 605 599 606 extern const struct bch_option bch2_opt_table[]; ··· 608 603 u64 bch2_opt_get_by_id(const struct bch_opts *, enum bch_opt_id); 609 604 void bch2_opt_set_by_id(struct bch_opts *, enum bch_opt_id, u64); 610 605 611 - u64 bch2_opt_from_sb(struct bch_sb *, enum bch_opt_id); 606 + u64 bch2_opt_from_sb(struct bch_sb *, enum bch_opt_id, int); 612 607 int bch2_opts_from_sb(struct bch_opts *, struct bch_sb *); 613 608 void __bch2_opt_set_sb(struct bch_sb *, int, const struct bch_option *, u64); 614 609 ··· 630 625 struct bch_fs *, struct bch_sb *, 631 626 unsigned, unsigned, unsigned); 632 627 633 - int bch2_opt_check_may_set(struct bch_fs *, int, u64); 628 + int bch2_opt_check_may_set(struct bch_fs *, struct bch_dev *, int, u64); 634 629 int bch2_opts_check_may_set(struct bch_fs *); 635 630 int bch2_parse_one_mount_opt(struct bch_fs *, struct bch_opts *, 636 631 struct printbuf *, const char *, const char *);

+63

fs/bcachefs/progress.c

··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + #include "bcachefs.h" 3 + #include "bbpos.h" 4 + #include "disk_accounting.h" 5 + #include "progress.h" 6 + 7 + void bch2_progress_init(struct progress_indicator_state *s, 8 + struct bch_fs *c, 9 + u64 btree_id_mask) 10 + { 11 + memset(s, 0, sizeof(*s)); 12 + 13 + s->next_print = jiffies + HZ * 10; 14 + 15 + for (unsigned i = 0; i < BTREE_ID_NR; i++) { 16 + if (!(btree_id_mask & BIT_ULL(i))) 17 + continue; 18 + 19 + struct disk_accounting_pos acc = { 20 + .type = BCH_DISK_ACCOUNTING_btree, 21 + .btree.id = i, 22 + }; 23 + 24 + u64 v; 25 + bch2_accounting_mem_read(c, disk_accounting_pos_to_bpos(&acc), &v, 1); 26 + s->nodes_total += div64_ul(v, btree_sectors(c)); 27 + } 28 + } 29 + 30 + static inline bool progress_update_p(struct progress_indicator_state *s) 31 + { 32 + bool ret = time_after_eq(jiffies, s->next_print); 33 + 34 + if (ret) 35 + s->next_print = jiffies + HZ * 10; 36 + return ret; 37 + } 38 + 39 + void bch2_progress_update_iter(struct btree_trans *trans, 40 + struct progress_indicator_state *s, 41 + struct btree_iter *iter, 42 + const char *msg) 43 + { 44 + struct bch_fs *c = trans->c; 45 + struct btree *b = path_l(btree_iter_path(trans, iter))->b; 46 + 47 + s->nodes_seen += b != s->last_node; 48 + s->last_node = b; 49 + 50 + if (progress_update_p(s)) { 51 + struct printbuf buf = PRINTBUF; 52 + unsigned percent = s->nodes_total 53 + ? div64_u64(s->nodes_seen * 100, s->nodes_total) 54 + : 0; 55 + 56 + prt_printf(&buf, "%s: %d%%, done %llu/%llu nodes, at ", 57 + msg, percent, s->nodes_seen, s->nodes_total); 58 + bch2_bbpos_to_text(&buf, BBPOS(iter->btree_id, iter->pos)); 59 + 60 + bch_info(c, "%s", buf.buf); 61 + printbuf_exit(&buf); 62 + } 63 + }

+29

fs/bcachefs/progress.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + #ifndef _BCACHEFS_PROGRESS_H 3 + #define _BCACHEFS_PROGRESS_H 4 + 5 + /* 6 + * Lame progress indicators 7 + * 8 + * We don't like to use these because they print to the dmesg console, which is 9 + * spammy - we much prefer to be wired up to a userspace programm (e.g. via 10 + * thread_with_file) and have it print the progress indicator. 11 + * 12 + * But some code is old and doesn't support that, or runs in a context where 13 + * that's not yet practical (mount). 14 + */ 15 + 16 + struct progress_indicator_state { 17 + unsigned long next_print; 18 + u64 nodes_seen; 19 + u64 nodes_total; 20 + struct btree *last_node; 21 + }; 22 + 23 + void bch2_progress_init(struct progress_indicator_state *, struct bch_fs *, u64); 24 + void bch2_progress_update_iter(struct btree_trans *, 25 + struct progress_indicator_state *, 26 + struct btree_iter *, 27 + const char *); 28 + 29 + #endif /* _BCACHEFS_PROGRESS_H */

+37 -9

fs/bcachefs/rebalance.c

··· 26 26 27 27 /* bch_extent_rebalance: */ 28 28 29 - static const struct bch_extent_rebalance *bch2_bkey_rebalance_opts(struct bkey_s_c k) 29 + static const struct bch_extent_rebalance *bch2_bkey_ptrs_rebalance_opts(struct bkey_ptrs_c ptrs) 30 30 { 31 - struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(k); 32 31 const union bch_extent_entry *entry; 33 32 34 33 bkey_extent_entry_for_each(ptrs, entry) ··· 35 36 return &entry->rebalance; 36 37 37 38 return NULL; 39 + } 40 + 41 + static const struct bch_extent_rebalance *bch2_bkey_rebalance_opts(struct bkey_s_c k) 42 + { 43 + return bch2_bkey_ptrs_rebalance_opts(bch2_bkey_ptrs_c(k)); 38 44 } 39 45 40 46 static inline unsigned bch2_bkey_ptrs_need_compress(struct bch_fs *c, ··· 101 97 102 98 u64 bch2_bkey_sectors_need_rebalance(struct bch_fs *c, struct bkey_s_c k) 103 99 { 104 - const struct bch_extent_rebalance *opts = bch2_bkey_rebalance_opts(k); 100 + struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(k); 101 + 102 + const struct bch_extent_rebalance *opts = bch2_bkey_ptrs_rebalance_opts(ptrs); 105 103 if (!opts) 106 104 return 0; 107 105 108 - struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(k); 109 106 const union bch_extent_entry *entry; 110 107 struct extent_ptr_decoded p; 111 108 u64 sectors = 0; ··· 346 341 memset(data_opts, 0, sizeof(*data_opts)); 347 342 data_opts->rewrite_ptrs = bch2_bkey_ptrs_need_rebalance(c, io_opts, k); 348 343 data_opts->target = io_opts->background_target; 349 - data_opts->write_flags |= BCH_WRITE_ONLY_SPECIFIED_DEVS; 344 + data_opts->write_flags |= BCH_WRITE_only_specified_devs; 350 345 351 346 if (!data_opts->rewrite_ptrs) { 352 347 /* ··· 454 449 { 455 450 data_opts->rewrite_ptrs = bch2_bkey_ptrs_need_rebalance(c, io_opts, k); 456 451 data_opts->target = io_opts->background_target; 457 - data_opts->write_flags |= BCH_WRITE_ONLY_SPECIFIED_DEVS; 452 + data_opts->write_flags |= BCH_WRITE_only_specified_devs; 458 453 return data_opts->rewrite_ptrs != 0; 459 454 } 460 455 ··· 595 590 596 591 void bch2_rebalance_status_to_text(struct printbuf *out, struct bch_fs *c) 597 592 { 593 + printbuf_tabstop_push(out, 32); 594 + 598 595 struct bch_fs_rebalance *r = &c->rebalance; 596 + 597 + /* print pending work */ 598 + struct disk_accounting_pos acc = { .type = BCH_DISK_ACCOUNTING_rebalance_work, }; 599 + u64 v; 600 + bch2_accounting_mem_read(c, disk_accounting_pos_to_bpos(&acc), &v, 1); 601 + 602 + prt_printf(out, "pending work:\t"); 603 + prt_human_readable_u64(out, v); 604 + prt_printf(out, "\n\n"); 599 605 600 606 prt_str(out, bch2_rebalance_state_strs[r->state]); 601 607 prt_newline(out); ··· 616 600 case BCH_REBALANCE_waiting: { 617 601 u64 now = atomic64_read(&c->io_clock[WRITE].now); 618 602 619 - prt_str(out, "io wait duration: "); 603 + prt_printf(out, "io wait duration:\t"); 620 604 bch2_prt_human_readable_s64(out, (r->wait_iotime_end - r->wait_iotime_start) << 9); 621 605 prt_newline(out); 622 606 623 - prt_str(out, "io wait remaining: "); 607 + prt_printf(out, "io wait remaining:\t"); 624 608 bch2_prt_human_readable_s64(out, (r->wait_iotime_end - now) << 9); 625 609 prt_newline(out); 626 610 627 - prt_str(out, "duration waited: "); 611 + prt_printf(out, "duration waited:\t"); 628 612 bch2_pr_time_units(out, ktime_get_real_ns() - r->wait_wallclock_start); 629 613 prt_newline(out); 630 614 break; ··· 637 621 break; 638 622 } 639 623 prt_newline(out); 624 + 625 + rcu_read_lock(); 626 + struct task_struct *t = rcu_dereference(c->rebalance.thread); 627 + if (t) 628 + get_task_struct(t); 629 + rcu_read_unlock(); 630 + 631 + if (t) { 632 + bch2_prt_task_backtrace(out, t, 0, GFP_KERNEL); 633 + put_task_struct(t); 634 + } 635 + 640 636 printbuf_indent_sub(out, 2); 641 637 } 642 638

+2 -2

fs/bcachefs/recovery.c

··· 13 13 #include "disk_accounting.h" 14 14 #include "errcode.h" 15 15 #include "error.h" 16 - #include "fs-common.h" 17 16 #include "journal_io.h" 18 17 #include "journal_reclaim.h" 19 18 #include "journal_seq_blacklist.h" 20 19 #include "logged_ops.h" 21 20 #include "move.h" 21 + #include "namei.h" 22 22 #include "quota.h" 23 23 #include "rebalance.h" 24 24 #include "recovery.h" ··· 899 899 * journal sequence numbers: 900 900 */ 901 901 if (!c->sb.clean) 902 - journal_seq += 8; 902 + journal_seq += JOURNAL_BUF_NR * 4; 903 903 904 904 if (blacklist_seq != journal_seq) { 905 905 ret = bch2_journal_log_msg(c, "blacklisting entries %llu-%llu",

+1 -1

fs/bcachefs/recovery_passes_types.h

··· 24 24 x(check_topology, 4, 0) \ 25 25 x(accounting_read, 39, PASS_ALWAYS) \ 26 26 x(alloc_read, 0, PASS_ALWAYS) \ 27 - x(stripes_read, 1, PASS_ALWAYS) \ 27 + x(stripes_read, 1, 0) \ 28 28 x(initialize_subvolumes, 2, 0) \ 29 29 x(snapshots_read, 3, PASS_ALWAYS) \ 30 30 x(check_allocations, 5, PASS_FSCK) \

+16 -7

fs/bcachefs/reflink.c

··· 185 185 BUG_ON(missing_start < refd_start); 186 186 BUG_ON(missing_end > refd_end); 187 187 188 - if (fsck_err(trans, reflink_p_to_missing_reflink_v, 189 - "pointer to missing indirect extent\n" 190 - " %s\n" 191 - " missing range %llu-%llu", 192 - (bch2_bkey_val_to_text(&buf, c, p.s_c), buf.buf), 193 - missing_start, missing_end)) { 188 + struct bpos missing_pos = bkey_start_pos(p.k); 189 + missing_pos.offset += missing_start - live_start; 190 + 191 + prt_printf(&buf, "pointer to missing indirect extent in "); 192 + ret = bch2_inum_snap_offset_err_msg_trans(trans, &buf, missing_pos); 193 + if (ret) 194 + goto err; 195 + 196 + prt_printf(&buf, "-%llu\n ", (missing_pos.offset + (missing_end - missing_start)) << 9); 197 + bch2_bkey_val_to_text(&buf, c, p.s_c); 198 + 199 + prt_printf(&buf, "\n missing reflink btree range %llu-%llu", 200 + missing_start, missing_end); 201 + 202 + if (fsck_err(trans, reflink_p_to_missing_reflink_v, "%s", buf.buf)) { 194 203 struct bkey_i_reflink_p *new = bch2_bkey_make_mut_noupdate_typed(trans, p.s_c, reflink_p); 195 204 ret = PTR_ERR_OR_ZERO(new); 196 205 if (ret) ··· 606 597 u64 dst_done = 0; 607 598 u32 dst_snapshot, src_snapshot; 608 599 bool reflink_p_may_update_opts_field = 609 - bch2_request_incompat_feature(c, bcachefs_metadata_version_reflink_p_may_update_opts); 600 + !bch2_request_incompat_feature(c, bcachefs_metadata_version_reflink_p_may_update_opts); 610 601 int ret = 0, ret2 = 0; 611 602 612 603 if (!bch2_write_ref_tryget(c, BCH_WRITE_REF_reflink))

+69 -21

fs/bcachefs/sb-counters.c

··· 5 5 6 6 /* BCH_SB_FIELD_counters */ 7 7 8 - static const char * const bch2_counter_names[] = { 8 + static const u8 counters_to_stable_map[] = { 9 + #define x(n, id, ...) [BCH_COUNTER_##n] = BCH_COUNTER_STABLE_##n, 10 + BCH_PERSISTENT_COUNTERS() 11 + #undef x 12 + }; 13 + 14 + const char * const bch2_counter_names[] = { 9 15 #define x(t, n, ...) (#t), 10 16 BCH_PERSISTENT_COUNTERS() 11 17 #undef x ··· 24 18 return 0; 25 19 26 20 return (__le64 *) vstruct_end(&ctrs->field) - &ctrs->d[0]; 27 - }; 21 + } 28 22 29 23 static int bch2_sb_counters_validate(struct bch_sb *sb, struct bch_sb_field *f, 30 24 enum bch_validate_flags flags, struct printbuf *err) 31 25 { 32 26 return 0; 33 - }; 27 + } 34 28 35 29 static void bch2_sb_counters_to_text(struct printbuf *out, struct bch_sb *sb, 36 30 struct bch_sb_field *f) ··· 38 32 struct bch_sb_field_counters *ctrs = field_to_type(f, counters); 39 33 unsigned int nr = bch2_sb_counter_nr_entries(ctrs); 40 34 41 - for (unsigned i = 0; i < nr; i++) 42 - prt_printf(out, "%s \t%llu\n", 43 - i < BCH_COUNTER_NR ? bch2_counter_names[i] : "(unknown)", 44 - le64_to_cpu(ctrs->d[i])); 45 - }; 35 + for (unsigned i = 0; i < BCH_COUNTER_NR; i++) { 36 + unsigned stable = counters_to_stable_map[i]; 37 + if (stable < nr) 38 + prt_printf(out, "%s \t%llu\n", 39 + bch2_counter_names[i], 40 + le64_to_cpu(ctrs->d[stable])); 41 + } 42 + } 46 43 47 44 int bch2_sb_counters_to_cpu(struct bch_fs *c) 48 45 { 49 46 struct bch_sb_field_counters *ctrs = bch2_sb_field_get(c->disk_sb.sb, counters); 50 - unsigned int i; 51 47 unsigned int nr = bch2_sb_counter_nr_entries(ctrs); 52 - u64 val = 0; 53 48 54 - for (i = 0; i < BCH_COUNTER_NR; i++) 49 + for (unsigned i = 0; i < BCH_COUNTER_NR; i++) 55 50 c->counters_on_mount[i] = 0; 56 51 57 - for (i = 0; i < min_t(unsigned int, nr, BCH_COUNTER_NR); i++) { 58 - val = le64_to_cpu(ctrs->d[i]); 59 - percpu_u64_set(&c->counters[i], val); 60 - c->counters_on_mount[i] = val; 52 + for (unsigned i = 0; i < BCH_COUNTER_NR; i++) { 53 + unsigned stable = counters_to_stable_map[i]; 54 + if (stable < nr) { 55 + u64 v = le64_to_cpu(ctrs->d[stable]); 56 + percpu_u64_set(&c->counters[i], v); 57 + c->counters_on_mount[i] = v; 58 + } 61 59 } 60 + 62 61 return 0; 63 - }; 62 + } 64 63 65 64 int bch2_sb_counters_from_cpu(struct bch_fs *c) 66 65 { 67 66 struct bch_sb_field_counters *ctrs = bch2_sb_field_get(c->disk_sb.sb, counters); 68 67 struct bch_sb_field_counters *ret; 69 - unsigned int i; 70 68 unsigned int nr = bch2_sb_counter_nr_entries(ctrs); 71 69 72 70 if (nr < BCH_COUNTER_NR) { 73 71 ret = bch2_sb_field_resize(&c->disk_sb, counters, 74 - sizeof(*ctrs) / sizeof(u64) + BCH_COUNTER_NR); 75 - 72 + sizeof(*ctrs) / sizeof(u64) + BCH_COUNTER_NR); 76 73 if (ret) { 77 74 ctrs = ret; 78 75 nr = bch2_sb_counter_nr_entries(ctrs); 79 76 } 80 77 } 81 78 79 + for (unsigned i = 0; i < BCH_COUNTER_NR; i++) { 80 + unsigned stable = counters_to_stable_map[i]; 81 + if (stable < nr) 82 + ctrs->d[stable] = cpu_to_le64(percpu_u64_get(&c->counters[i])); 83 + } 82 84 83 - for (i = 0; i < min_t(unsigned int, nr, BCH_COUNTER_NR); i++) 84 - ctrs->d[i] = cpu_to_le64(percpu_u64_get(&c->counters[i])); 85 85 return 0; 86 86 } 87 87 ··· 109 97 .validate = bch2_sb_counters_validate, 110 98 .to_text = bch2_sb_counters_to_text, 111 99 }; 100 + 101 + #ifndef NO_BCACHEFS_CHARDEV 102 + long bch2_ioctl_query_counters(struct bch_fs *c, 103 + struct bch_ioctl_query_counters __user *user_arg) 104 + { 105 + struct bch_ioctl_query_counters arg; 106 + int ret = copy_from_user_errcode(&arg, user_arg, sizeof(arg)); 107 + if (ret) 108 + return ret; 109 + 110 + if ((arg.flags & ~BCH_IOCTL_QUERY_COUNTERS_MOUNT) || 111 + arg.pad) 112 + return -EINVAL; 113 + 114 + arg.nr = min(arg.nr, BCH_COUNTER_NR); 115 + ret = put_user(arg.nr, &user_arg->nr); 116 + if (ret) 117 + return ret; 118 + 119 + for (unsigned i = 0; i < BCH_COUNTER_NR; i++) { 120 + unsigned stable = counters_to_stable_map[i]; 121 + 122 + if (stable < arg.nr) { 123 + u64 v = !(arg.flags & BCH_IOCTL_QUERY_COUNTERS_MOUNT) 124 + ? percpu_u64_get(&c->counters[i]) 125 + : c->counters_on_mount[i]; 126 + 127 + ret = put_user(v, &user_arg->d[stable]); 128 + if (ret) 129 + return ret; 130 + } 131 + } 132 + 133 + return 0; 134 + } 135 + #endif

+4

fs/bcachefs/sb-counters.h

··· 11 11 void bch2_fs_counters_exit(struct bch_fs *); 12 12 int bch2_fs_counters_init(struct bch_fs *); 13 13 14 + extern const char * const bch2_counter_names[]; 14 15 extern const struct bch_sb_field_ops bch_sb_field_ops_counters; 16 + 17 + long bch2_ioctl_query_counters(struct bch_fs *, 18 + struct bch_ioctl_query_counters __user *); 15 19 16 20 #endif // _BCACHEFS_SB_COUNTERS_H

+21 -10

fs/bcachefs/sb-counters_format.h

··· 9 9 10 10 #define BCH_PERSISTENT_COUNTERS() \ 11 11 x(io_read, 0, TYPE_SECTORS) \ 12 + x(io_read_inline, 80, TYPE_SECTORS) \ 13 + x(io_read_hole, 81, TYPE_SECTORS) \ 14 + x(io_read_promote, 30, TYPE_COUNTER) \ 15 + x(io_read_bounce, 31, TYPE_COUNTER) \ 16 + x(io_read_split, 33, TYPE_COUNTER) \ 17 + x(io_read_reuse_race, 34, TYPE_COUNTER) \ 18 + x(io_read_retry, 32, TYPE_COUNTER) \ 12 19 x(io_write, 1, TYPE_SECTORS) \ 13 20 x(io_move, 2, TYPE_SECTORS) \ 21 + x(io_move_read, 35, TYPE_SECTORS) \ 22 + x(io_move_write, 36, TYPE_SECTORS) \ 23 + x(io_move_finish, 37, TYPE_SECTORS) \ 24 + x(io_move_fail, 38, TYPE_COUNTER) \ 25 + x(io_move_write_fail, 82, TYPE_COUNTER) \ 26 + x(io_move_start_fail, 39, TYPE_COUNTER) \ 14 27 x(bucket_invalidate, 3, TYPE_COUNTER) \ 15 28 x(bucket_discard, 4, TYPE_COUNTER) \ 29 + x(bucket_discard_fast, 79, TYPE_COUNTER) \ 16 30 x(bucket_alloc, 5, TYPE_COUNTER) \ 17 31 x(bucket_alloc_fail, 6, TYPE_COUNTER) \ 18 32 x(btree_cache_scan, 7, TYPE_COUNTER) \ ··· 52 38 x(journal_reclaim_finish, 27, TYPE_COUNTER) \ 53 39 x(journal_reclaim_start, 28, TYPE_COUNTER) \ 54 40 x(journal_write, 29, TYPE_COUNTER) \ 55 - x(read_promote, 30, TYPE_COUNTER) \ 56 - x(read_bounce, 31, TYPE_COUNTER) \ 57 - x(read_split, 33, TYPE_COUNTER) \ 58 - x(read_retry, 32, TYPE_COUNTER) \ 59 - x(read_reuse_race, 34, TYPE_COUNTER) \ 60 - x(move_extent_read, 35, TYPE_SECTORS) \ 61 - x(move_extent_write, 36, TYPE_SECTORS) \ 62 - x(move_extent_finish, 37, TYPE_SECTORS) \ 63 - x(move_extent_fail, 38, TYPE_COUNTER) \ 64 - x(move_extent_start_fail, 39, TYPE_COUNTER) \ 65 41 x(copygc, 40, TYPE_COUNTER) \ 66 42 x(copygc_wait, 41, TYPE_COUNTER) \ 67 43 x(gc_gens_end, 42, TYPE_COUNTER) \ ··· 97 93 BCH_PERSISTENT_COUNTERS() 98 94 #undef x 99 95 BCH_COUNTER_NR 96 + }; 97 + 98 + enum bch_persistent_counters_stable { 99 + #define x(t, n, ...) BCH_COUNTER_STABLE_##t = n, 100 + BCH_PERSISTENT_COUNTERS() 101 + #undef x 102 + BCH_COUNTER_STABLE_NR 100 103 }; 101 104 102 105 struct bch_sb_field_counters {

+7 -1

fs/bcachefs/sb-downgrade.c

··· 90 90 BIT_ULL(BCH_RECOVERY_PASS_check_allocations), \ 91 91 BCH_FSCK_ERR_accounting_mismatch, \ 92 92 BCH_FSCK_ERR_accounting_key_replicas_nr_devs_0, \ 93 - BCH_FSCK_ERR_accounting_key_junk_at_end) 93 + BCH_FSCK_ERR_accounting_key_junk_at_end) \ 94 + x(cached_backpointers, \ 95 + BIT_ULL(BCH_RECOVERY_PASS_check_extents_to_backpointers),\ 96 + BCH_FSCK_ERR_ptr_to_missing_backpointer) \ 97 + x(stripe_backpointers, \ 98 + BIT_ULL(BCH_RECOVERY_PASS_check_extents_to_backpointers),\ 99 + BCH_FSCK_ERR_ptr_to_missing_backpointer) 94 100 95 101 #define DOWNGRADE_TABLE() \ 96 102 x(bucket_stripe_sectors, \

+4 -1

fs/bcachefs/sb-errors_format.h

··· 179 179 x(ptr_crc_redundant, 160, 0) \ 180 180 x(ptr_crc_nonce_mismatch, 162, 0) \ 181 181 x(ptr_stripe_redundant, 163, 0) \ 182 + x(extent_flags_not_at_start, 306, 0) \ 182 183 x(reservation_key_nr_replicas_invalid, 164, 0) \ 183 184 x(reflink_v_refcount_wrong, 165, FSCK_AUTOFIX) \ 184 185 x(reflink_v_pos_bad, 292, 0) \ ··· 315 314 x(compression_opt_not_marked_in_sb, 295, FSCK_AUTOFIX) \ 316 315 x(compression_type_not_marked_in_sb, 296, FSCK_AUTOFIX) \ 317 316 x(directory_size_mismatch, 303, FSCK_AUTOFIX) \ 318 - x(MAX, 304, 0) 317 + x(dirent_cf_name_too_big, 304, 0) \ 318 + x(dirent_stray_data_after_cf_name, 305, 0) \ 319 + x(MAX, 307, 0) 319 320 320 321 enum bch_sb_error_id { 321 322 #define x(t, n, ...) BCH_FSCK_ERR_##t = n,

+15 -1

fs/bcachefs/sb-members.h

··· 23 23 return !percpu_ref_is_zero(&ca->io_ref); 24 24 } 25 25 26 - static inline bool bch2_dev_is_readable(struct bch_dev *ca) 26 + static inline struct bch_dev *bch2_dev_rcu(struct bch_fs *, unsigned); 27 + 28 + static inline bool bch2_dev_idx_is_online(struct bch_fs *c, unsigned dev) 29 + { 30 + rcu_read_lock(); 31 + struct bch_dev *ca = bch2_dev_rcu(c, dev); 32 + bool ret = ca && bch2_dev_is_online(ca); 33 + rcu_read_unlock(); 34 + 35 + return ret; 36 + } 37 + 38 + static inline bool bch2_dev_is_healthy(struct bch_dev *ca) 27 39 { 28 40 return bch2_dev_is_online(ca) && 29 41 ca->mi.state != BCH_MEMBER_STATE_failed; ··· 283 271 284 272 static inline struct bch_dev *bch2_dev_get_ioref(struct bch_fs *c, unsigned dev, int rw) 285 273 { 274 + might_sleep(); 275 + 286 276 rcu_read_lock(); 287 277 struct bch_dev *ca = bch2_dev_rcu(c, dev); 288 278 if (ca && !percpu_ref_tryget(&ca->io_ref))

+1

fs/bcachefs/sb-members_format.h

··· 79 79 80 80 #define BCH_MEMBER_V1_BYTES 56 81 81 82 + LE16_BITMASK(BCH_MEMBER_BUCKET_SIZE, struct bch_member, bucket_size, 0, 16) 82 83 LE64_BITMASK(BCH_MEMBER_STATE, struct bch_member, flags, 0, 4) 83 84 /* 4-14 unused, was TIER, HAS_(META)DATA, REPLACEMENT */ 84 85 LE64_BITMASK(BCH_MEMBER_DISCARD, struct bch_member, flags, 14, 15)

+4 -3

fs/bcachefs/snapshot.c

··· 146 146 goto out; 147 147 } 148 148 149 - while (id && id < ancestor - IS_ANCESTOR_BITMAP) 150 - id = get_ancestor_below(t, id, ancestor); 149 + if (likely(ancestor >= IS_ANCESTOR_BITMAP)) 150 + while (id && id < ancestor - IS_ANCESTOR_BITMAP) 151 + id = get_ancestor_below(t, id, ancestor); 151 152 152 153 ret = id && id < ancestor 153 154 ? test_ancestor_bitmap(t, id, ancestor) ··· 390 389 return 0; 391 390 } 392 391 393 - static u32 bch2_snapshot_tree_oldest_subvol(struct bch_fs *c, u32 snapshot_root) 392 + u32 bch2_snapshot_tree_oldest_subvol(struct bch_fs *c, u32 snapshot_root) 394 393 { 395 394 u32 id = snapshot_root; 396 395 u32 subvol = 0, s;

+1

fs/bcachefs/snapshot.h

··· 105 105 return id; 106 106 } 107 107 108 + u32 bch2_snapshot_tree_oldest_subvol(struct bch_fs *, u32); 108 109 u32 bch2_snapshot_skiplist_get(struct bch_fs *, u32); 109 110 110 111 static inline u32 bch2_snapshot_root(struct bch_fs *c, u32 id)

+1 -1

fs/bcachefs/str_hash.c

··· 50 50 for (unsigned i = 0; i < 1000; i++) { 51 51 unsigned len = sprintf(new->v.d_name, "%.*s.fsck_renamed-%u", 52 52 old_name.len, old_name.name, i); 53 - unsigned u64s = BKEY_U64s + dirent_val_u64s(len); 53 + unsigned u64s = BKEY_U64s + dirent_val_u64s(len, 0); 54 54 55 55 if (u64s > U8_MAX) 56 56 return -EINVAL;

+6 -6

fs/bcachefs/str_hash.h

··· 12 12 #include "super.h" 13 13 14 14 #include <linux/crc32c.h> 15 - #include <crypto/hash.h> 16 15 #include <crypto/sha2.h> 17 16 18 17 static inline enum bch_str_hash_type ··· 33 34 34 35 struct bch_hash_info { 35 36 u8 type; 37 + struct unicode_map *cf_encoding; 36 38 /* 37 39 * For crc32 or crc64 string hashes the first key value of 38 40 * the siphash_key (k0) is used as the key. ··· 47 47 /* XXX ick */ 48 48 struct bch_hash_info info = { 49 49 .type = INODE_STR_HASH(bi), 50 + #ifdef CONFIG_UNICODE 51 + .cf_encoding = !!(bi->bi_flags & BCH_INODE_casefolded) ? c->cf_encoding : NULL, 52 + #endif 50 53 .siphash_key = { .k0 = bi->bi_hash_seed } 51 54 }; 52 55 53 56 if (unlikely(info.type == BCH_STR_HASH_siphash_old)) { 54 - SHASH_DESC_ON_STACK(desc, c->sha256); 55 57 u8 digest[SHA256_DIGEST_SIZE]; 56 58 57 - desc->tfm = c->sha256; 58 - 59 - crypto_shash_digest(desc, (void *) &bi->bi_hash_seed, 60 - sizeof(bi->bi_hash_seed), digest); 59 + sha256((const u8 *)&bi->bi_hash_seed, 60 + sizeof(bi->bi_hash_seed), digest); 61 61 memcpy(&info.siphash_key, digest, sizeof(info.siphash_key)); 62 62 } 63 63

+53 -39

fs/bcachefs/super-io.c

··· 25 25 #include <linux/sort.h> 26 26 #include <linux/string_choices.h> 27 27 28 - static const struct blk_holder_ops bch2_sb_handle_bdev_ops = { 29 - }; 30 - 31 28 struct bch2_metadata_version { 32 29 u16 version; 33 30 const char *name; ··· 66 69 return v; 67 70 } 68 71 69 - bool bch2_set_version_incompat(struct bch_fs *c, enum bcachefs_metadata_version version) 72 + int bch2_set_version_incompat(struct bch_fs *c, enum bcachefs_metadata_version version) 70 73 { 71 - bool ret = (c->sb.features & BIT_ULL(BCH_FEATURE_incompat_version_field)) && 72 - version <= c->sb.version_incompat_allowed; 74 + int ret = ((c->sb.features & BIT_ULL(BCH_FEATURE_incompat_version_field)) && 75 + version <= c->sb.version_incompat_allowed) 76 + ? 0 77 + : -BCH_ERR_may_not_use_incompat_feature; 73 78 74 - if (ret) { 79 + if (!ret) { 75 80 mutex_lock(&c->sb_lock); 76 81 SET_BCH_SB_VERSION_INCOMPAT(c->disk_sb.sb, 77 82 max(BCH_SB_VERSION_INCOMPAT(c->disk_sb.sb), version)); ··· 365 366 return 0; 366 367 } 367 368 368 - static int bch2_sb_validate(struct bch_sb_handle *disk_sb, 369 - enum bch_validate_flags flags, struct printbuf *out) 369 + int bch2_sb_validate(struct bch_sb *sb, u64 read_offset, 370 + enum bch_validate_flags flags, struct printbuf *out) 370 371 { 371 - struct bch_sb *sb = disk_sb->sb; 372 372 struct bch_sb_field_members_v1 *mi; 373 373 enum bch_opt_id opt_id; 374 - u16 block_size; 375 374 int ret; 376 375 377 376 ret = bch2_sb_compatible(sb, out); 378 377 if (ret) 379 378 return ret; 380 379 381 - if (sb->features[1] || 382 - (le64_to_cpu(sb->features[0]) & (~0ULL << BCH_FEATURE_NR))) { 383 - prt_printf(out, "Filesystem has incompatible features"); 380 + u64 incompat = le64_to_cpu(sb->features[0]) & (~0ULL << BCH_FEATURE_NR); 381 + unsigned incompat_bit = 0; 382 + if (incompat) 383 + incompat_bit = __ffs64(incompat); 384 + else if (sb->features[1]) 385 + incompat_bit = 64 + __ffs64(le64_to_cpu(sb->features[1])); 386 + 387 + if (incompat_bit) { 388 + prt_printf(out, "Filesystem has incompatible feature bit %u, highest supported %s (%u)", 389 + incompat_bit, 390 + bch2_sb_features[BCH_FEATURE_NR - 1], 391 + BCH_FEATURE_NR - 1); 384 392 return -BCH_ERR_invalid_sb_features; 385 393 } 386 394 387 395 if (BCH_VERSION_MAJOR(le16_to_cpu(sb->version)) > BCH_VERSION_MAJOR(bcachefs_metadata_version_current) || 388 396 BCH_SB_VERSION_INCOMPAT(sb) > bcachefs_metadata_version_current) { 389 - prt_printf(out, "Filesystem has incompatible version"); 397 + prt_str(out, "Filesystem has incompatible version "); 398 + bch2_version_to_text(out, le16_to_cpu(sb->version)); 399 + prt_str(out, ", current version "); 400 + bch2_version_to_text(out, bcachefs_metadata_version_current); 390 401 return -BCH_ERR_invalid_sb_features; 391 - } 392 - 393 - block_size = le16_to_cpu(sb->block_size); 394 - 395 - if (block_size > PAGE_SECTORS) { 396 - prt_printf(out, "Block size too big (got %u, max %u)", 397 - block_size, PAGE_SECTORS); 398 - return -BCH_ERR_invalid_sb_block_size; 399 402 } 400 403 401 404 if (bch2_is_zero(sb->user_uuid.b, sizeof(sb->user_uuid))) { ··· 408 407 if (bch2_is_zero(sb->uuid.b, sizeof(sb->uuid))) { 409 408 prt_printf(out, "Bad internal UUID (got zeroes)"); 410 409 return -BCH_ERR_invalid_sb_uuid; 410 + } 411 + 412 + if (!(flags & BCH_VALIDATE_write) && 413 + le64_to_cpu(sb->offset) != read_offset) { 414 + prt_printf(out, "Bad sb offset (got %llu, read from %llu)", 415 + le64_to_cpu(sb->offset), read_offset); 416 + return -BCH_ERR_invalid_sb_offset; 411 417 } 412 418 413 419 if (!sb->nr_devices || ··· 472 464 473 465 if (le16_to_cpu(sb->version) <= bcachefs_metadata_version_disk_accounting_v2) 474 466 SET_BCH_SB_PROMOTE_WHOLE_EXTENTS(sb, true); 467 + 468 + if (!BCH_SB_WRITE_ERROR_TIMEOUT(sb)) 469 + SET_BCH_SB_WRITE_ERROR_TIMEOUT(sb, 30); 470 + 471 + if (le16_to_cpu(sb->version) <= bcachefs_metadata_version_extent_flags && 472 + !BCH_SB_CSUM_ERR_RETRY_NR(sb)) 473 + SET_BCH_SB_CSUM_ERR_RETRY_NR(sb, 3); 475 474 } 476 475 477 476 #ifdef __KERNEL__ ··· 489 474 for (opt_id = 0; opt_id < bch2_opts_nr; opt_id++) { 490 475 const struct bch_option *opt = bch2_opt_table + opt_id; 491 476 492 - if (opt->get_sb != BCH2_NO_SB_OPT) { 493 - u64 v = bch2_opt_from_sb(sb, opt_id); 477 + if (opt->get_sb) { 478 + u64 v = bch2_opt_from_sb(sb, opt_id, -1); 494 479 495 480 prt_printf(out, "Invalid option "); 496 481 ret = bch2_opt_validate(opt, v, out); ··· 770 755 memset(sb, 0, sizeof(*sb)); 771 756 sb->mode = BLK_OPEN_READ; 772 757 sb->have_bio = true; 773 - sb->holder = kmalloc(1, GFP_KERNEL); 758 + sb->holder = kzalloc(sizeof(*sb->holder), GFP_KERNEL); 774 759 if (!sb->holder) 775 760 return -ENOMEM; 776 761 ··· 896 881 897 882 sb->have_layout = true; 898 883 899 - ret = bch2_sb_validate(sb, 0, &err); 884 + ret = bch2_sb_validate(sb->sb, offset, 0, &err); 900 885 if (ret) { 901 886 bch2_print_opts(opts, KERN_ERR "bcachefs (%s): error validating superblock: %s\n", 902 887 path, err.buf); ··· 933 918 { 934 919 struct bch_dev *ca = bio->bi_private; 935 920 921 + bch2_account_io_success_fail(ca, bio_data_dir(bio), !bio->bi_status); 922 + 936 923 /* XXX: return errors directly */ 937 924 938 - if (bch2_dev_io_err_on(bio->bi_status, ca, 939 - bio_data_dir(bio) 940 - ? BCH_MEMBER_ERROR_write 941 - : BCH_MEMBER_ERROR_read, 942 - "superblock %s error: %s", 925 + if (bio->bi_status) { 926 + bch_err_dev_ratelimited(ca, "superblock %s error: %s", 943 927 str_write_read(bio_data_dir(bio)), 944 - bch2_blk_status_to_str(bio->bi_status))) 928 + bch2_blk_status_to_str(bio->bi_status)); 945 929 ca->sb_write_error = 1; 930 + } 946 931 947 932 closure_put(&ca->fs->sb_write); 948 933 percpu_ref_put(&ca->io_ref); ··· 1053 1038 darray_for_each(online_devices, ca) { 1054 1039 printbuf_reset(&err); 1055 1040 1056 - ret = bch2_sb_validate(&(*ca)->disk_sb, BCH_VALIDATE_write, &err); 1041 + ret = bch2_sb_validate((*ca)->disk_sb.sb, 0, BCH_VALIDATE_write, &err); 1057 1042 if (ret) { 1058 1043 bch2_fs_inconsistent(c, "sb invalid before write: %s", err.buf); 1059 1044 goto out; ··· 1181 1166 !can_mount_with_written), c, 1182 1167 ": Unable to write superblock to sufficient devices (from %ps)", 1183 1168 (void *) _RET_IP_)) 1184 - ret = -1; 1169 + ret = -BCH_ERR_erofs_sb_err; 1185 1170 out: 1186 1171 /* Make new options visible after they're persistent: */ 1187 1172 bch2_sb_update(c); ··· 1238 1223 bch2_sb_field_resize(&c->disk_sb, downgrade, 0); 1239 1224 1240 1225 c->disk_sb.sb->version = cpu_to_le16(new_version); 1241 - c->disk_sb.sb->features[0] |= cpu_to_le64(BCH_SB_FEATURES_ALL); 1242 1226 1243 1227 if (incompat) { 1228 + c->disk_sb.sb->features[0] |= cpu_to_le64(BCH_SB_FEATURES_ALL); 1244 1229 SET_BCH_SB_VERSION_INCOMPAT_ALLOWED(c->disk_sb.sb, 1245 1230 max(BCH_SB_VERSION_INCOMPAT_ALLOWED(c->disk_sb.sb), new_version)); 1246 - c->disk_sb.sb->features[0] |= cpu_to_le64(BCH_FEATURE_incompat_version_field); 1247 1231 } 1248 1232 } 1249 1233 ··· 1473 1459 for (id = 0; id < bch2_opts_nr; id++) { 1474 1460 const struct bch_option *opt = bch2_opt_table + id; 1475 1461 1476 - if (opt->get_sb != BCH2_NO_SB_OPT) { 1477 - u64 v = bch2_opt_from_sb(sb, id); 1462 + if (opt->get_sb) { 1463 + u64 v = bch2_opt_from_sb(sb, id, -1); 1478 1464 1479 1465 prt_printf(out, "%s:\t", opt->attr.name); 1480 1466 bch2_opt_to_text(out, NULL, sb, opt, v,

+6 -4

fs/bcachefs/super-io.h

··· 21 21 void bch2_version_to_text(struct printbuf *, enum bcachefs_metadata_version); 22 22 enum bcachefs_metadata_version bch2_latest_compatible_version(enum bcachefs_metadata_version); 23 23 24 - bool bch2_set_version_incompat(struct bch_fs *, enum bcachefs_metadata_version); 24 + int bch2_set_version_incompat(struct bch_fs *, enum bcachefs_metadata_version); 25 25 26 - static inline bool bch2_request_incompat_feature(struct bch_fs *c, 27 - enum bcachefs_metadata_version version) 26 + static inline int bch2_request_incompat_feature(struct bch_fs *c, 27 + enum bcachefs_metadata_version version) 28 28 { 29 29 return likely(version <= c->sb.version_incompat) 30 - ? true 30 + ? 0 31 31 : bch2_set_version_incompat(c, version); 32 32 } 33 33 ··· 91 91 92 92 void bch2_free_super(struct bch_sb_handle *); 93 93 int bch2_sb_realloc(struct bch_sb_handle *, unsigned); 94 + 95 + int bch2_sb_validate(struct bch_sb *, u64, enum bch_validate_flags, struct printbuf *); 94 96 95 97 int bch2_read_super(const char *, struct bch_opts *, struct bch_sb_handle *); 96 98 int bch2_read_super_silent(const char *, struct bch_opts *, struct bch_sb_handle *);

+129 -12

fs/bcachefs/super.c

··· 75 75 MODULE_LICENSE("GPL"); 76 76 MODULE_AUTHOR("Kent Overstreet <kent.overstreet@gmail.com>"); 77 77 MODULE_DESCRIPTION("bcachefs filesystem"); 78 - MODULE_SOFTDEP("pre: crc32c"); 79 - MODULE_SOFTDEP("pre: crc64"); 80 - MODULE_SOFTDEP("pre: sha256"); 81 78 MODULE_SOFTDEP("pre: chacha20"); 82 79 MODULE_SOFTDEP("pre: poly1305"); 83 80 MODULE_SOFTDEP("pre: xxhash"); ··· 715 718 kobject_add(&c->time_stats, &c->kobj, "time_stats") ?: 716 719 #endif 717 720 kobject_add(&c->counters_kobj, &c->kobj, "counters") ?: 718 - bch2_opts_create_sysfs_files(&c->opts_dir); 721 + bch2_opts_create_sysfs_files(&c->opts_dir, OPT_FS); 719 722 if (ret) { 720 723 bch_err(c, "error creating sysfs objects"); 721 724 return ret; ··· 833 836 834 837 if (ret) 835 838 goto err; 839 + 840 + #ifdef CONFIG_UNICODE 841 + /* Default encoding until we can potentially have more as an option. */ 842 + c->cf_encoding = utf8_load(BCH_FS_DEFAULT_UTF8_ENCODING); 843 + if (IS_ERR(c->cf_encoding)) { 844 + printk(KERN_ERR "Cannot load UTF-8 encoding for filesystem. Version: %u.%u.%u", 845 + unicode_major(BCH_FS_DEFAULT_UTF8_ENCODING), 846 + unicode_minor(BCH_FS_DEFAULT_UTF8_ENCODING), 847 + unicode_rev(BCH_FS_DEFAULT_UTF8_ENCODING)); 848 + ret = -EINVAL; 849 + goto err; 850 + } 851 + #else 852 + if (c->sb.features & BIT_ULL(BCH_FEATURE_casefolding)) { 853 + printk(KERN_ERR "Cannot mount a filesystem with casefolding on a kernel without CONFIG_UNICODE\n"); 854 + ret = -EINVAL; 855 + goto err; 856 + } 857 + #endif 836 858 837 859 pr_uuid(&name, c->sb.user_uuid.b); 838 860 ret = name.allocation_failure ? -BCH_ERR_ENOMEM_fs_name_alloc : 0; ··· 1072 1056 } 1073 1057 1074 1058 set_bit(BCH_FS_started, &c->flags); 1059 + wake_up(&c->ro_ref_wait); 1075 1060 1076 1061 if (c->opts.read_only) { 1077 1062 bch2_fs_read_only(c); ··· 1297 1280 return 0; 1298 1281 1299 1282 if (!ca->kobj.state_in_sysfs) { 1300 - ret = kobject_add(&ca->kobj, &c->kobj, 1301 - "dev-%u", ca->dev_idx); 1283 + ret = kobject_add(&ca->kobj, &c->kobj, "dev-%u", ca->dev_idx) ?: 1284 + bch2_opts_create_sysfs_files(&ca->kobj, OPT_DEVICE); 1302 1285 if (ret) 1303 1286 return ret; 1304 1287 } ··· 1428 1411 /* Commit: */ 1429 1412 ca->disk_sb = *sb; 1430 1413 memset(sb, 0, sizeof(*sb)); 1414 + 1415 + /* 1416 + * Stash pointer to the filesystem for blk_holder_ops - note that once 1417 + * attached to a filesystem, we will always close the block device 1418 + * before tearing down the filesystem object. 1419 + */ 1420 + ca->disk_sb.holder->c = ca->fs; 1431 1421 1432 1422 ca->dev = ca->disk_sb.bdev->bd_dev; 1433 1423 ··· 1990 1966 mutex_unlock(&c->sb_lock); 1991 1967 1992 1968 if (ca->mi.freespace_initialized) { 1993 - struct disk_accounting_pos acc = { 1994 - .type = BCH_DISK_ACCOUNTING_dev_data_type, 1995 - .dev_data_type.dev = ca->dev_idx, 1996 - .dev_data_type.data_type = BCH_DATA_free, 1997 - }; 1998 1969 u64 v[3] = { nbuckets - old_nbuckets, 0, 0 }; 1999 1970 2000 1971 ret = bch2_trans_commit_do(ca->fs, NULL, NULL, 0, 2001 - bch2_disk_accounting_mod(trans, &acc, v, ARRAY_SIZE(v), false)) ?: 1972 + bch2_disk_accounting_mod2(trans, false, v, dev_data_type, 1973 + .dev = ca->dev_idx, 1974 + .data_type = BCH_DATA_free)) ?: 2002 1975 bch2_dev_freespace_init(c, ca, old_nbuckets, nbuckets); 2003 1976 if (ret) 2004 1977 goto err; ··· 2018 1997 return ca; 2019 1998 return ERR_PTR(-BCH_ERR_ENOENT_dev_not_found); 2020 1999 } 2000 + 2001 + /* blk_holder_ops: */ 2002 + 2003 + static struct bch_fs *bdev_get_fs(struct block_device *bdev) 2004 + __releases(&bdev->bd_holder_lock) 2005 + { 2006 + struct bch_sb_handle_holder *holder = bdev->bd_holder; 2007 + struct bch_fs *c = holder->c; 2008 + 2009 + if (c && !bch2_ro_ref_tryget(c)) 2010 + c = NULL; 2011 + 2012 + mutex_unlock(&bdev->bd_holder_lock); 2013 + 2014 + if (c) 2015 + wait_event(c->ro_ref_wait, test_bit(BCH_FS_started, &c->flags)); 2016 + return c; 2017 + } 2018 + 2019 + /* returns with ref on ca->ref */ 2020 + static struct bch_dev *bdev_to_bch_dev(struct bch_fs *c, struct block_device *bdev) 2021 + { 2022 + for_each_member_device(c, ca) 2023 + if (ca->disk_sb.bdev == bdev) 2024 + return ca; 2025 + return NULL; 2026 + } 2027 + 2028 + static void bch2_fs_bdev_mark_dead(struct block_device *bdev, bool surprise) 2029 + { 2030 + struct bch_fs *c = bdev_get_fs(bdev); 2031 + if (!c) 2032 + return; 2033 + 2034 + struct super_block *sb = c->vfs_sb; 2035 + if (sb) { 2036 + /* 2037 + * Not necessary, c->ro_ref guards against the filesystem being 2038 + * unmounted - we only take this to avoid a warning in 2039 + * sync_filesystem: 2040 + */ 2041 + down_read(&sb->s_umount); 2042 + } 2043 + 2044 + down_write(&c->state_lock); 2045 + struct bch_dev *ca = bdev_to_bch_dev(c, bdev); 2046 + if (!ca) 2047 + goto unlock; 2048 + 2049 + if (bch2_dev_state_allowed(c, ca, BCH_MEMBER_STATE_failed, BCH_FORCE_IF_DEGRADED)) { 2050 + __bch2_dev_offline(c, ca); 2051 + } else { 2052 + if (sb) { 2053 + if (!surprise) 2054 + sync_filesystem(sb); 2055 + shrink_dcache_sb(sb); 2056 + evict_inodes(sb); 2057 + } 2058 + 2059 + bch2_journal_flush(&c->journal); 2060 + bch2_fs_emergency_read_only(c); 2061 + } 2062 + 2063 + bch2_dev_put(ca); 2064 + unlock: 2065 + if (sb) 2066 + up_read(&sb->s_umount); 2067 + up_write(&c->state_lock); 2068 + bch2_ro_ref_put(c); 2069 + } 2070 + 2071 + static void bch2_fs_bdev_sync(struct block_device *bdev) 2072 + { 2073 + struct bch_fs *c = bdev_get_fs(bdev); 2074 + if (!c) 2075 + return; 2076 + 2077 + struct super_block *sb = c->vfs_sb; 2078 + if (sb) { 2079 + /* 2080 + * Not necessary, c->ro_ref guards against the filesystem being 2081 + * unmounted - we only take this to avoid a warning in 2082 + * sync_filesystem: 2083 + */ 2084 + down_read(&sb->s_umount); 2085 + sync_filesystem(sb); 2086 + up_read(&sb->s_umount); 2087 + } 2088 + 2089 + bch2_ro_ref_put(c); 2090 + } 2091 + 2092 + const struct blk_holder_ops bch2_sb_handle_bdev_ops = { 2093 + .mark_dead = bch2_fs_bdev_mark_dead, 2094 + .sync = bch2_fs_bdev_sync, 2095 + }; 2021 2096 2022 2097 /* Filesystem open: */ 2023 2098

+2

fs/bcachefs/super.h

··· 42 42 int bch2_fs_start(struct bch_fs *); 43 43 struct bch_fs *bch2_fs_open(char * const *, unsigned, struct bch_opts); 44 44 45 + extern const struct blk_holder_ops bch2_sb_handle_bdev_ops; 46 + 45 47 #endif /* _BCACHEFS_SUPER_H */

+7 -1

fs/bcachefs/super_types.h

··· 2 2 #ifndef _BCACHEFS_SUPER_TYPES_H 3 3 #define _BCACHEFS_SUPER_TYPES_H 4 4 5 + struct bch_fs; 6 + 7 + struct bch_sb_handle_holder { 8 + struct bch_fs *c; 9 + }; 10 + 5 11 struct bch_sb_handle { 6 12 struct bch_sb *sb; 7 13 struct file *s_bdev_file; 8 14 struct block_device *bdev; 9 15 char *sb_name; 10 16 struct bio *bio; 11 - void *holder; 17 + struct bch_sb_handle_holder *holder; 12 18 size_t buffer_size; 13 19 blk_mode_t mode; 14 20 unsigned have_layout:1;

+77 -64

fs/bcachefs/sysfs.c

··· 146 146 write_attribute(trigger_btree_cache_shrink); 147 147 write_attribute(trigger_btree_key_cache_shrink); 148 148 write_attribute(trigger_freelist_wakeup); 149 + write_attribute(trigger_btree_updates); 149 150 read_attribute(gc_gens_pos); 150 151 151 152 read_attribute(uuid); 152 153 read_attribute(minor); 153 154 read_attribute(flags); 154 - read_attribute(bucket_size); 155 155 read_attribute(first_bucket); 156 156 read_attribute(nbuckets); 157 - rw_attribute(durability); 158 157 read_attribute(io_done); 159 158 read_attribute(io_errors); 160 159 write_attribute(io_errors_reset); ··· 172 173 read_attribute(btree_cache); 173 174 read_attribute(btree_key_cache); 174 175 read_attribute(btree_reserve_cache); 175 - read_attribute(stripes_heap); 176 176 read_attribute(open_buckets); 177 177 read_attribute(open_buckets_partial); 178 - read_attribute(write_points); 179 178 read_attribute(nocow_lock_table); 180 179 181 180 #ifdef BCH_WRITE_REF_DEBUG ··· 206 209 BCH_PERSISTENT_COUNTERS() 207 210 #undef x 208 211 209 - rw_attribute(discard); 210 - read_attribute(state); 211 212 rw_attribute(label); 212 213 213 214 read_attribute(copy_gc_wait); ··· 350 355 if (attr == &sysfs_btree_reserve_cache) 351 356 bch2_btree_reserve_cache_to_text(out, c); 352 357 353 - if (attr == &sysfs_stripes_heap) 354 - bch2_stripes_heap_to_text(out, c); 355 - 356 358 if (attr == &sysfs_open_buckets) 357 359 bch2_open_buckets_to_text(out, c, NULL); 358 360 359 361 if (attr == &sysfs_open_buckets_partial) 360 362 bch2_open_buckets_partial_to_text(out, c); 361 - 362 - if (attr == &sysfs_write_points) 363 - bch2_write_points_to_text(out, c); 364 363 365 364 if (attr == &sysfs_compression_stats) 366 365 bch2_compression_stats_to_text(out, c); ··· 403 414 return -EPERM; 404 415 405 416 /* Debugging: */ 417 + 418 + if (attr == &sysfs_trigger_btree_updates) 419 + queue_work(c->btree_interior_update_worker, &c->btree_interior_update_work); 406 420 407 421 if (!bch2_write_ref_tryget(c, BCH_WRITE_REF_sysfs)) 408 422 return -EROFS; ··· 558 566 &sysfs_btree_key_cache, 559 567 &sysfs_btree_reserve_cache, 560 568 &sysfs_new_stripes, 561 - &sysfs_stripes_heap, 562 569 &sysfs_open_buckets, 563 570 &sysfs_open_buckets_partial, 564 - &sysfs_write_points, 565 571 #ifdef BCH_WRITE_REF_DEBUG 566 572 &sysfs_write_refs, 567 573 #endif ··· 575 585 &sysfs_trigger_btree_cache_shrink, 576 586 &sysfs_trigger_btree_key_cache_shrink, 577 587 &sysfs_trigger_freelist_wakeup, 588 + &sysfs_trigger_btree_updates, 578 589 579 590 &sysfs_gc_gens_pos, 580 591 ··· 595 604 596 605 /* options */ 597 606 598 - SHOW(bch2_fs_opts_dir) 607 + static ssize_t sysfs_opt_show(struct bch_fs *c, 608 + struct bch_dev *ca, 609 + enum bch_opt_id id, 610 + struct printbuf *out) 599 611 { 600 - struct bch_fs *c = container_of(kobj, struct bch_fs, opts_dir); 601 - const struct bch_option *opt = container_of(attr, struct bch_option, attr); 602 - int id = opt - bch2_opt_table; 603 - u64 v = bch2_opt_get_by_id(&c->opts, id); 612 + const struct bch_option *opt = bch2_opt_table + id; 613 + u64 v; 614 + 615 + if (opt->flags & OPT_FS) { 616 + v = bch2_opt_get_by_id(&c->opts, id); 617 + } else if ((opt->flags & OPT_DEVICE) && opt->get_member) { 618 + v = bch2_opt_from_sb(c->disk_sb.sb, id, ca->dev_idx); 619 + } else { 620 + return -EINVAL; 621 + } 604 622 605 623 bch2_opt_to_text(out, c, c->disk_sb.sb, opt, v, OPT_SHOW_FULL_LIST); 606 624 prt_char(out, '\n'); 607 - 608 625 return 0; 609 626 } 610 627 611 - STORE(bch2_fs_opts_dir) 628 + static ssize_t sysfs_opt_store(struct bch_fs *c, 629 + struct bch_dev *ca, 630 + enum bch_opt_id id, 631 + const char *buf, size_t size) 612 632 { 613 - struct bch_fs *c = container_of(kobj, struct bch_fs, opts_dir); 614 - const struct bch_option *opt = container_of(attr, struct bch_option, attr); 615 - int ret, id = opt - bch2_opt_table; 616 - char *tmp; 617 - u64 v; 633 + const struct bch_option *opt = bch2_opt_table + id; 634 + int ret = 0; 618 635 619 636 /* 620 637 * We don't need to take c->writes for correctness, but it eliminates an ··· 631 632 if (unlikely(!bch2_write_ref_tryget(c, BCH_WRITE_REF_sysfs))) 632 633 return -EROFS; 633 634 634 - tmp = kstrdup(buf, GFP_KERNEL); 635 + down_write(&c->state_lock); 636 + 637 + char *tmp = kstrdup(buf, GFP_KERNEL); 635 638 if (!tmp) { 636 639 ret = -ENOMEM; 637 640 goto err; 638 641 } 639 642 640 - ret = bch2_opt_parse(c, opt, strim(tmp), &v, NULL); 643 + u64 v; 644 + ret = bch2_opt_parse(c, opt, strim(tmp), &v, NULL) ?: 645 + bch2_opt_check_may_set(c, ca, id, v); 641 646 kfree(tmp); 642 647 643 648 if (ret < 0) 644 649 goto err; 645 650 646 - ret = bch2_opt_check_may_set(c, id, v); 647 - if (ret < 0) 648 - goto err; 649 - 650 - bch2_opt_set_sb(c, NULL, opt, v); 651 + bch2_opt_set_sb(c, ca, opt, v); 651 652 bch2_opt_set_by_id(&c->opts, id, v); 652 653 653 654 if (v && 654 655 (id == Opt_background_target || 656 + (id == Opt_foreground_target && !c->opts.background_target) || 655 657 id == Opt_background_compression || 656 658 (id == Opt_compression && !c->opts.background_compression))) 657 659 bch2_set_rebalance_needs_scan(c, 0); ··· 664 664 c->copygc_thread) 665 665 wake_up_process(c->copygc_thread); 666 666 667 + if (id == Opt_discard && !ca) { 668 + mutex_lock(&c->sb_lock); 669 + for_each_member_device(c, ca) 670 + opt->set_member(bch2_members_v2_get_mut(ca->disk_sb.sb, ca->dev_idx), v); 671 + 672 + bch2_write_super(c); 673 + mutex_unlock(&c->sb_lock); 674 + } 675 + 667 676 ret = size; 668 677 err: 678 + up_write(&c->state_lock); 669 679 bch2_write_ref_put(c, BCH_WRITE_REF_sysfs); 670 680 return ret; 681 + } 682 + 683 + SHOW(bch2_fs_opts_dir) 684 + { 685 + struct bch_fs *c = container_of(kobj, struct bch_fs, opts_dir); 686 + int id = bch2_opt_lookup(attr->name); 687 + if (id < 0) 688 + return 0; 689 + 690 + return sysfs_opt_show(c, NULL, id, out); 691 + } 692 + 693 + STORE(bch2_fs_opts_dir) 694 + { 695 + struct bch_fs *c = container_of(kobj, struct bch_fs, opts_dir); 696 + int id = bch2_opt_lookup(attr->name); 697 + if (id < 0) 698 + return 0; 699 + 700 + return sysfs_opt_store(c, NULL, id, buf, size); 671 701 } 672 702 SYSFS_OPS(bch2_fs_opts_dir); 673 703 674 704 struct attribute *bch2_fs_opts_dir_files[] = { NULL }; 675 705 676 - int bch2_opts_create_sysfs_files(struct kobject *kobj) 706 + int bch2_opts_create_sysfs_files(struct kobject *kobj, unsigned type) 677 707 { 678 - const struct bch_option *i; 679 - int ret; 680 - 681 - for (i = bch2_opt_table; 708 + for (const struct bch_option *i = bch2_opt_table; 682 709 i < bch2_opt_table + bch2_opts_nr; 683 710 i++) { 684 - if (!(i->flags & OPT_FS)) 711 + if (i->flags & OPT_HIDDEN) 712 + continue; 713 + if (!(i->flags & type)) 685 714 continue; 686 715 687 - ret = sysfs_create_file(kobj, &i->attr); 716 + int ret = sysfs_create_file(kobj, &i->attr); 688 717 if (ret) 689 718 return ret; 690 719 } ··· 784 755 785 756 sysfs_printf(uuid, "%pU\n", ca->uuid.b); 786 757 787 - sysfs_print(bucket_size, bucket_bytes(ca)); 788 758 sysfs_print(first_bucket, ca->mi.first_bucket); 789 759 sysfs_print(nbuckets, ca->mi.nbuckets); 790 - sysfs_print(durability, ca->mi.durability); 791 - sysfs_print(discard, ca->mi.discard); 792 760 793 761 if (attr == &sysfs_label) { 794 762 if (ca->mi.group) ··· 795 769 796 770 if (attr == &sysfs_has_data) { 797 771 prt_bitflags(out, __bch2_data_types, bch2_dev_has_data(c, ca)); 798 - prt_char(out, '\n'); 799 - } 800 - 801 - if (attr == &sysfs_state) { 802 - prt_string_option(out, bch2_member_states, ca->mi.state); 803 772 prt_char(out, '\n'); 804 773 } 805 774 ··· 823 802 if (attr == &sysfs_open_buckets) 824 803 bch2_open_buckets_to_text(out, c, ca); 825 804 805 + int opt_id = bch2_opt_lookup(attr->name); 806 + if (opt_id >= 0) 807 + return sysfs_opt_show(c, ca, opt_id, out); 808 + 826 809 return 0; 827 810 } 828 811 ··· 834 809 { 835 810 struct bch_dev *ca = container_of(kobj, struct bch_dev, kobj); 836 811 struct bch_fs *c = ca->fs; 837 - 838 - if (attr == &sysfs_discard) { 839 - bool v = strtoul_or_return(buf); 840 - 841 - bch2_opt_set_sb(c, ca, bch2_opt_table + Opt_discard, v); 842 - } 843 - 844 - if (attr == &sysfs_durability) { 845 - u64 v = strtoul_or_return(buf); 846 - 847 - bch2_opt_set_sb(c, ca, bch2_opt_table + Opt_durability, v); 848 - } 849 812 850 813 if (attr == &sysfs_label) { 851 814 char *tmp; ··· 852 839 if (attr == &sysfs_io_errors_reset) 853 840 bch2_dev_errors_reset(ca); 854 841 842 + int opt_id = bch2_opt_lookup(attr->name); 843 + if (opt_id >= 0) 844 + return sysfs_opt_store(c, ca, opt_id, buf, size); 845 + 855 846 return size; 856 847 } 857 848 SYSFS_OPS(bch2_dev); 858 849 859 850 struct attribute *bch2_dev_files[] = { 860 851 &sysfs_uuid, 861 - &sysfs_bucket_size, 862 852 &sysfs_first_bucket, 863 853 &sysfs_nbuckets, 864 - &sysfs_durability, 865 854 866 855 /* settings: */ 867 - &sysfs_discard, 868 - &sysfs_state, 869 856 &sysfs_label, 870 857 871 858 &sysfs_has_data,

+3 -2

fs/bcachefs/sysfs.h

··· 23 23 extern const struct sysfs_ops bch2_fs_time_stats_sysfs_ops; 24 24 extern const struct sysfs_ops bch2_dev_sysfs_ops; 25 25 26 - int bch2_opts_create_sysfs_files(struct kobject *); 26 + int bch2_opts_create_sysfs_files(struct kobject *, unsigned); 27 27 28 28 #else 29 29 ··· 41 41 static const struct sysfs_ops bch2_fs_time_stats_sysfs_ops; 42 42 static const struct sysfs_ops bch2_dev_sysfs_ops; 43 43 44 - static inline int bch2_opts_create_sysfs_files(struct kobject *kobj) { return 0; } 44 + static inline int bch2_opts_create_sysfs_files(struct kobject *kobj, unsigned type) 45 + { return 0; } 45 46 46 47 #endif /* NO_BCACHEFS_SYSFS */ 47 48

+41 -64

fs/bcachefs/trace.h

··· 295 295 296 296 /* io.c: */ 297 297 298 - DEFINE_EVENT(bio, read_promote, 298 + DEFINE_EVENT(bio, io_read_promote, 299 299 TP_PROTO(struct bio *bio), 300 300 TP_ARGS(bio) 301 301 ); 302 302 303 - TRACE_EVENT(read_nopromote, 303 + TRACE_EVENT(io_read_nopromote, 304 304 TP_PROTO(struct bch_fs *c, int ret), 305 305 TP_ARGS(c, ret), 306 306 ··· 319 319 __entry->ret) 320 320 ); 321 321 322 - DEFINE_EVENT(bio, read_bounce, 322 + DEFINE_EVENT(bio, io_read_bounce, 323 323 TP_PROTO(struct bio *bio), 324 324 TP_ARGS(bio) 325 325 ); 326 326 327 - DEFINE_EVENT(bio, read_split, 327 + DEFINE_EVENT(bio, io_read_split, 328 328 TP_PROTO(struct bio *bio), 329 329 TP_ARGS(bio) 330 330 ); 331 331 332 - DEFINE_EVENT(bio, read_retry, 332 + DEFINE_EVENT(bio, io_read_retry, 333 333 TP_PROTO(struct bio *bio), 334 334 TP_ARGS(bio) 335 335 ); 336 336 337 - DEFINE_EVENT(bio, read_reuse_race, 337 + DEFINE_EVENT(bio, io_read_reuse_race, 338 338 TP_PROTO(struct bio *bio), 339 339 TP_ARGS(bio) 340 + ); 341 + 342 + /* ec.c */ 343 + 344 + TRACE_EVENT(stripe_create, 345 + TP_PROTO(struct bch_fs *c, u64 idx, int ret), 346 + TP_ARGS(c, idx, ret), 347 + 348 + TP_STRUCT__entry( 349 + __field(dev_t, dev ) 350 + __field(u64, idx ) 351 + __field(int, ret ) 352 + ), 353 + 354 + TP_fast_assign( 355 + __entry->dev = c->dev; 356 + __entry->idx = idx; 357 + __entry->ret = ret; 358 + ), 359 + 360 + TP_printk("%d,%d idx %llu ret %i", 361 + MAJOR(__entry->dev), MINOR(__entry->dev), 362 + __entry->idx, 363 + __entry->ret) 340 364 ); 341 365 342 366 /* Journal */ ··· 821 797 822 798 /* Moving IO */ 823 799 824 - TRACE_EVENT(bucket_evacuate, 825 - TP_PROTO(struct bch_fs *c, struct bpos *bucket), 826 - TP_ARGS(c, bucket), 827 - 828 - TP_STRUCT__entry( 829 - __field(dev_t, dev ) 830 - __field(u32, dev_idx ) 831 - __field(u64, bucket ) 832 - ), 833 - 834 - TP_fast_assign( 835 - __entry->dev = c->dev; 836 - __entry->dev_idx = bucket->inode; 837 - __entry->bucket = bucket->offset; 838 - ), 839 - 840 - TP_printk("%d:%d %u:%llu", 841 - MAJOR(__entry->dev), MINOR(__entry->dev), 842 - __entry->dev_idx, __entry->bucket) 843 - ); 844 - 845 - DEFINE_EVENT(fs_str, move_extent, 800 + DEFINE_EVENT(fs_str, io_move, 846 801 TP_PROTO(struct bch_fs *c, const char *str), 847 802 TP_ARGS(c, str) 848 803 ); 849 804 850 - DEFINE_EVENT(fs_str, move_extent_read, 805 + DEFINE_EVENT(fs_str, io_move_read, 851 806 TP_PROTO(struct bch_fs *c, const char *str), 852 807 TP_ARGS(c, str) 853 808 ); 854 809 855 - DEFINE_EVENT(fs_str, move_extent_write, 810 + DEFINE_EVENT(fs_str, io_move_write, 856 811 TP_PROTO(struct bch_fs *c, const char *str), 857 812 TP_ARGS(c, str) 858 813 ); 859 814 860 - DEFINE_EVENT(fs_str, move_extent_finish, 815 + DEFINE_EVENT(fs_str, io_move_finish, 861 816 TP_PROTO(struct bch_fs *c, const char *str), 862 817 TP_ARGS(c, str) 863 818 ); 864 819 865 - DEFINE_EVENT(fs_str, move_extent_fail, 820 + DEFINE_EVENT(fs_str, io_move_fail, 866 821 TP_PROTO(struct bch_fs *c, const char *str), 867 822 TP_ARGS(c, str) 868 823 ); 869 824 870 - DEFINE_EVENT(fs_str, move_extent_start_fail, 825 + DEFINE_EVENT(fs_str, io_move_write_fail, 826 + TP_PROTO(struct bch_fs *c, const char *str), 827 + TP_ARGS(c, str) 828 + ); 829 + 830 + DEFINE_EVENT(fs_str, io_move_start_fail, 871 831 TP_PROTO(struct bch_fs *c, const char *str), 872 832 TP_ARGS(c, str) 873 833 ); ··· 887 879 __entry->sectors_seen, 888 880 __entry->sectors_moved, 889 881 __entry->sectors_raced) 890 - ); 891 - 892 - TRACE_EVENT(evacuate_bucket, 893 - TP_PROTO(struct bch_fs *c, struct bpos *bucket, 894 - unsigned sectors, unsigned bucket_size, 895 - int ret), 896 - TP_ARGS(c, bucket, sectors, bucket_size, ret), 897 - 898 - TP_STRUCT__entry( 899 - __field(dev_t, dev ) 900 - __field(u64, member ) 901 - __field(u64, bucket ) 902 - __field(u32, sectors ) 903 - __field(u32, bucket_size ) 904 - __field(int, ret ) 905 - ), 906 - 907 - TP_fast_assign( 908 - __entry->dev = c->dev; 909 - __entry->member = bucket->inode; 910 - __entry->bucket = bucket->offset; 911 - __entry->sectors = sectors; 912 - __entry->bucket_size = bucket_size; 913 - __entry->ret = ret; 914 - ), 915 - 916 - TP_printk("%d,%d %llu:%llu sectors %u/%u ret %i", 917 - MAJOR(__entry->dev), MINOR(__entry->dev), 918 - __entry->member, __entry->bucket, 919 - __entry->sectors, __entry->bucket_size, 920 - __entry->ret) 921 882 ); 922 883 923 884 TRACE_EVENT(copygc,

+189 -42

fs/bcachefs/util.c

··· 473 473 u64 last_q = 0; 474 474 475 475 prt_printf(out, "quantiles (%s):\t", u->name); 476 - eytzinger0_for_each(i, NR_QUANTILES) { 477 - bool is_last = eytzinger0_next(i, NR_QUANTILES) == -1; 476 + eytzinger0_for_each(j, NR_QUANTILES) { 477 + bool is_last = eytzinger0_next(j, NR_QUANTILES) == -1; 478 478 479 - u64 q = max(quantiles->entries[i].m, last_q); 479 + u64 q = max(quantiles->entries[j].m, last_q); 480 480 prt_printf(out, "%llu ", div64_u64(q, u->nsecs)); 481 481 if (is_last) 482 482 prt_newline(out); ··· 704 704 } 705 705 } 706 706 707 + #ifdef CONFIG_BCACHEFS_DEBUG 708 + void bch2_corrupt_bio(struct bio *bio) 709 + { 710 + struct bvec_iter iter; 711 + struct bio_vec bv; 712 + unsigned offset = get_random_u32_below(bio->bi_iter.bi_size / sizeof(u64)); 713 + 714 + bio_for_each_segment(bv, bio, iter) { 715 + unsigned u64s = bv.bv_len / sizeof(u64); 716 + 717 + if (offset < u64s) { 718 + u64 *segment = bvec_kmap_local(&bv); 719 + segment[offset] = get_random_u64(); 720 + kunmap_local(segment); 721 + return; 722 + } 723 + offset -= u64s; 724 + } 725 + } 726 + #endif 727 + 707 728 #if 0 708 729 void eytzinger1_test(void) 709 730 { 710 - unsigned inorder, eytz, size; 731 + unsigned inorder, size; 711 732 712 - pr_info("1 based eytzinger test:"); 733 + pr_info("1 based eytzinger test:\n"); 713 734 714 735 for (size = 2; 715 736 size < 65536; ··· 738 717 unsigned extra = eytzinger1_extra(size); 739 718 740 719 if (!(size % 4096)) 741 - pr_info("tree size %u", size); 742 - 743 - BUG_ON(eytzinger1_prev(0, size) != eytzinger1_last(size)); 744 - BUG_ON(eytzinger1_next(0, size) != eytzinger1_first(size)); 745 - 746 - BUG_ON(eytzinger1_prev(eytzinger1_first(size), size) != 0); 747 - BUG_ON(eytzinger1_next(eytzinger1_last(size), size) != 0); 720 + pr_info("tree size %u\n", size); 748 721 749 722 inorder = 1; 750 723 eytzinger1_for_each(eytz, size) { ··· 749 734 750 735 inorder++; 751 736 } 737 + BUG_ON(inorder - 1 != size); 752 738 } 753 739 } 754 740 755 741 void eytzinger0_test(void) 756 742 { 757 743 758 - unsigned inorder, eytz, size; 744 + unsigned inorder, size; 759 745 760 - pr_info("0 based eytzinger test:"); 746 + pr_info("0 based eytzinger test:\n"); 761 747 762 748 for (size = 1; 763 749 size < 65536; ··· 766 750 unsigned extra = eytzinger0_extra(size); 767 751 768 752 if (!(size % 4096)) 769 - pr_info("tree size %u", size); 770 - 771 - BUG_ON(eytzinger0_prev(-1, size) != eytzinger0_last(size)); 772 - BUG_ON(eytzinger0_next(-1, size) != eytzinger0_first(size)); 773 - 774 - BUG_ON(eytzinger0_prev(eytzinger0_first(size), size) != -1); 775 - BUG_ON(eytzinger0_next(eytzinger0_last(size), size) != -1); 753 + pr_info("tree size %u\n", size); 776 754 777 755 inorder = 0; 778 756 eytzinger0_for_each(eytz, size) { ··· 777 767 778 768 inorder++; 779 769 } 770 + BUG_ON(inorder != size); 771 + 772 + inorder = size - 1; 773 + eytzinger0_for_each_prev(eytz, size) { 774 + BUG_ON(eytz != eytzinger0_first(size) && 775 + eytzinger0_next(eytzinger0_prev(eytz, size), size) != eytz); 776 + 777 + inorder--; 778 + } 779 + BUG_ON(inorder != -1); 780 780 } 781 781 } 782 782 783 - static inline int cmp_u16(const void *_l, const void *_r, size_t size) 783 + static inline int cmp_u16(const void *_l, const void *_r) 784 784 { 785 785 const u16 *l = _l, *r = _r; 786 786 787 - return (*l > *r) - (*r - *l); 787 + return (*l > *r) - (*r > *l); 788 788 } 789 789 790 - static void eytzinger0_find_test_val(u16 *test_array, unsigned nr, u16 search) 790 + static void eytzinger0_find_test_le(u16 *test_array, unsigned nr, u16 search) 791 791 { 792 - int i, c1 = -1, c2 = -1; 793 - ssize_t r; 792 + int r, s; 793 + bool bad; 794 794 795 795 r = eytzinger0_find_le(test_array, nr, 796 796 sizeof(test_array[0]), 797 797 cmp_u16, &search); 798 - if (r >= 0) 799 - c1 = test_array[r]; 800 - 801 - for (i = 0; i < nr; i++) 802 - if (test_array[i] <= search && test_array[i] > c2) 803 - c2 = test_array[i]; 804 - 805 - if (c1 != c2) { 806 - eytzinger0_for_each(i, nr) 807 - pr_info("[%3u] = %12u", i, test_array[i]); 808 - pr_info("find_le(%2u) -> [%2zi] = %2i should be %2i", 809 - i, r, c1, c2); 798 + if (r >= 0) { 799 + if (test_array[r] > search) { 800 + bad = true; 801 + } else { 802 + s = eytzinger0_next(r, nr); 803 + bad = s >= 0 && test_array[s] <= search; 804 + } 805 + } else { 806 + s = eytzinger0_last(nr); 807 + bad = s >= 0 && test_array[s] <= search; 810 808 } 809 + 810 + if (bad) { 811 + s = -1; 812 + eytzinger0_for_each_prev(j, nr) { 813 + if (test_array[j] <= search) { 814 + s = j; 815 + break; 816 + } 817 + } 818 + 819 + eytzinger0_for_each(j, nr) 820 + pr_info("[%3u] = %12u\n", j, test_array[j]); 821 + pr_info("find_le(%12u) = %3i should be %3i\n", 822 + search, r, s); 823 + BUG(); 824 + } 825 + } 826 + 827 + static void eytzinger0_find_test_gt(u16 *test_array, unsigned nr, u16 search) 828 + { 829 + int r, s; 830 + bool bad; 831 + 832 + r = eytzinger0_find_gt(test_array, nr, 833 + sizeof(test_array[0]), 834 + cmp_u16, &search); 835 + if (r >= 0) { 836 + if (test_array[r] <= search) { 837 + bad = true; 838 + } else { 839 + s = eytzinger0_prev(r, nr); 840 + bad = s >= 0 && test_array[s] > search; 841 + } 842 + } else { 843 + s = eytzinger0_first(nr); 844 + bad = s >= 0 && test_array[s] > search; 845 + } 846 + 847 + if (bad) { 848 + s = -1; 849 + eytzinger0_for_each(j, nr) { 850 + if (test_array[j] > search) { 851 + s = j; 852 + break; 853 + } 854 + } 855 + 856 + eytzinger0_for_each(j, nr) 857 + pr_info("[%3u] = %12u\n", j, test_array[j]); 858 + pr_info("find_gt(%12u) = %3i should be %3i\n", 859 + search, r, s); 860 + BUG(); 861 + } 862 + } 863 + 864 + static void eytzinger0_find_test_ge(u16 *test_array, unsigned nr, u16 search) 865 + { 866 + int r, s; 867 + bool bad; 868 + 869 + r = eytzinger0_find_ge(test_array, nr, 870 + sizeof(test_array[0]), 871 + cmp_u16, &search); 872 + if (r >= 0) { 873 + if (test_array[r] < search) { 874 + bad = true; 875 + } else { 876 + s = eytzinger0_prev(r, nr); 877 + bad = s >= 0 && test_array[s] >= search; 878 + } 879 + } else { 880 + s = eytzinger0_first(nr); 881 + bad = s >= 0 && test_array[s] >= search; 882 + } 883 + 884 + if (bad) { 885 + s = -1; 886 + eytzinger0_for_each(j, nr) { 887 + if (test_array[j] >= search) { 888 + s = j; 889 + break; 890 + } 891 + } 892 + 893 + eytzinger0_for_each(j, nr) 894 + pr_info("[%3u] = %12u\n", j, test_array[j]); 895 + pr_info("find_ge(%12u) = %3i should be %3i\n", 896 + search, r, s); 897 + BUG(); 898 + } 899 + } 900 + 901 + static void eytzinger0_find_test_eq(u16 *test_array, unsigned nr, u16 search) 902 + { 903 + unsigned r; 904 + int s; 905 + bool bad; 906 + 907 + r = eytzinger0_find(test_array, nr, 908 + sizeof(test_array[0]), 909 + cmp_u16, &search); 910 + 911 + if (r < nr) { 912 + bad = test_array[r] != search; 913 + } else { 914 + s = eytzinger0_find_le(test_array, nr, 915 + sizeof(test_array[0]), 916 + cmp_u16, &search); 917 + bad = s >= 0 && test_array[s] == search; 918 + } 919 + 920 + if (bad) { 921 + eytzinger0_for_each(j, nr) 922 + pr_info("[%3u] = %12u\n", j, test_array[j]); 923 + pr_info("find(%12u) = %3i is incorrect\n", 924 + search, r); 925 + BUG(); 926 + } 927 + } 928 + 929 + static void eytzinger0_find_test_val(u16 *test_array, unsigned nr, u16 search) 930 + { 931 + eytzinger0_find_test_le(test_array, nr, search); 932 + eytzinger0_find_test_gt(test_array, nr, search); 933 + eytzinger0_find_test_ge(test_array, nr, search); 934 + eytzinger0_find_test_eq(test_array, nr, search); 811 935 } 812 936 813 937 void eytzinger0_find_test(void) ··· 950 806 u16 *test_array = kmalloc_array(allocated, sizeof(test_array[0]), GFP_KERNEL); 951 807 952 808 for (nr = 1; nr < allocated; nr++) { 953 - pr_info("testing %u elems", nr); 809 + u16 prev = 0; 810 + 811 + pr_info("testing %u elems\n", nr); 954 812 955 813 get_random_bytes(test_array, nr * sizeof(test_array[0])); 956 814 eytzinger0_sort(test_array, nr, sizeof(test_array[0]), cmp_u16, NULL); 957 815 958 816 /* verify array is sorted correctly: */ 959 - eytzinger0_for_each(i, nr) 960 - BUG_ON(i != eytzinger0_last(nr) && 961 - test_array[i] > test_array[eytzinger0_next(i, nr)]); 817 + eytzinger0_for_each(j, nr) { 818 + BUG_ON(test_array[j] < prev); 819 + prev = test_array[j]; 820 + } 962 821 963 822 for (i = 0; i < U16_MAX; i += 1 << 12) 964 823 eytzinger0_find_test_val(test_array, nr, i);

+14 -2

fs/bcachefs/util.h

··· 406 406 void memcpy_to_bio(struct bio *, struct bvec_iter, const void *); 407 407 void memcpy_from_bio(void *, struct bio *, struct bvec_iter); 408 408 409 + #ifdef CONFIG_BCACHEFS_DEBUG 410 + void bch2_corrupt_bio(struct bio *); 411 + 412 + static inline void bch2_maybe_corrupt_bio(struct bio *bio, unsigned ratio) 413 + { 414 + if (ratio && !get_random_u32_below(ratio)) 415 + bch2_corrupt_bio(bio); 416 + } 417 + #else 418 + #define bch2_maybe_corrupt_bio(...) do {} while (0) 419 + #endif 420 + 409 421 static inline void memcpy_u64s_small(void *dst, const void *src, 410 422 unsigned u64s) 411 423 { ··· 431 419 static inline void __memcpy_u64s(void *dst, const void *src, 432 420 unsigned u64s) 433 421 { 434 - #ifdef CONFIG_X86_64 422 + #if defined(CONFIG_X86_64) && !defined(CONFIG_KMSAN) 435 423 long d0, d1, d2; 436 424 437 425 asm volatile("rep ; movsq" ··· 508 496 u64 *dst = (u64 *) _dst + u64s - 1; 509 497 u64 *src = (u64 *) _src + u64s - 1; 510 498 511 - #ifdef CONFIG_X86_64 499 + #if defined(CONFIG_X86_64) && !defined(CONFIG_KMSAN) 512 500 long d0, d1, d2; 513 501 514 502 asm volatile("std ;\n"

+1 -1

fs/bcachefs/xattr.c

··· 523 523 if (ret < 0) 524 524 goto err_class_exit; 525 525 526 - ret = bch2_opt_check_may_set(c, opt_id, v); 526 + ret = bch2_opt_check_may_set(c, NULL, opt_id, v); 527 527 if (ret < 0) 528 528 goto err_class_exit; 529 529

Configure Feed

Configure Feed