Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

Merge tag 'md-6.18-20250909' of gitolite.kernel.org:pub/scm/linux/kernel/git/mdraid/linux into for-6.18/block

Pull MD changes from Yu Kuai:

"Redundant data is used to enhance data fault tolerance, and the storage
method for redundant data vary depending on the RAID levels. And it's
important to maintain the consistency of redundant data.

Bitmap is used to record which data blocks have been synchronized and
which ones need to be resynchronized or recovered. Each bit in the
bitmap represents a segment of data in the array. When a bit is set,
it indicates that the multiple redundant copies of that data segment
may not be consistent. Data synchronization can be performed based on
the bitmap after power failure or readding a disk. If there is no
bitmap, a full disk synchronization is required.

Due to known performance issues with md-bitmap and the unreasonable
implementations:

- self-managed IO submitting like filemap_write_page();
- global spin_lock

I have decided not to continue optimizing based on the current bitmap
implementation, this new bitmap is invented without locking from IO fast
path and can be used with fast disks.

Key features for the new bitmap:
- IO fastpath is lockless, if user issues lots of write IO to the same
bitmap bit in a short time, only the first write has additional
overhead to update bitmap bit, no additional overhead for the
following writes;
- support only resync or recover written data, means in the case
creating new array or replacing with a new disk, there is no need to
do a full disk resync/recovery;"

* tag 'md-6.18-20250909' of gitolite.kernel.org:pub/scm/linux/kernel/git/mdraid/linux: (24 commits)
md/md-llbitmap: introduce new lockless bitmap
md/md-bitmap: make method bitmap_ops->daemon_work optional
md: add a new recovery_flag MD_RECOVERY_LAZY_RECOVER
md/md-bitmap: add a new method blocks_synced() in bitmap_operations
md/md-bitmap: add a new method skip_sync_blocks() in bitmap_operations
md/md-bitmap: delay registration of bitmap_ops until creating bitmap
md/md-bitmap: add a new sysfs api bitmap_type
md: add a new mddev field 'bitmap_id'
md/md-bitmap: support discard for bitmap ops
md: factor out a helper raid_is_456()
md: add a new parameter 'offset' to md_super_write()
md/md-bitmap: introduce CONFIG_MD_BITMAP
md: check before referencing mddev->bitmap_ops
md/dm-raid: check before referencing mddev->bitmap_ops
md/raid5: check before referencing mddev->bitmap_ops
md/raid10: check before referencing mddev->bitmap_ops
md/raid1: check before referencing mddev->bitmap_ops
md/raid1: check bitmap before behind write
md/md-bitmap: handle the case bitmap is not enabled before end_sync()
md/md-bitmap: handle the case bitmap is not enabled before start_sync()
...

+2313 -244
+61 -25
Documentation/admin-guide/md.rst
··· 347 347 active-idle 348 348 like active, but no writes have been seen for a while (safe_mode_delay). 349 349 350 + consistency_policy 351 + This indicates how the array maintains consistency in case of unexpected 352 + shutdown. It can be: 353 + 354 + none 355 + Array has no redundancy information, e.g. raid0, linear. 356 + 357 + resync 358 + Full resync is performed and all redundancy is regenerated when the 359 + array is started after unclean shutdown. 360 + 361 + bitmap 362 + Resync assisted by a write-intent bitmap. 363 + 364 + journal 365 + For raid4/5/6, journal device is used to log transactions and replay 366 + after unclean shutdown. 367 + 368 + ppl 369 + For raid5 only, Partial Parity Log is used to close the write hole and 370 + eliminate resync. 371 + 372 + The accepted values when writing to this file are ``ppl`` and ``resync``, 373 + used to enable and disable PPL. 374 + 375 + uuid 376 + This indicates the UUID of the array in the following format: 377 + xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx 378 + 379 + bitmap_type 380 + [RW] When read, this file will display the current and available 381 + bitmap for this array. The currently active bitmap will be enclosed 382 + in [] brackets. Writing an bitmap name or ID to this file will switch 383 + control of this array to that new bitmap. Note that writing a new 384 + bitmap for created array is forbidden. 385 + 386 + none 387 + No bitmap 388 + bitmap 389 + The default internal bitmap 390 + llbitmap 391 + The lockless internal bitmap 392 + 393 + If bitmap_type is not none, then additional bitmap attributes bitmap/xxx or 394 + llbitmap/xxx will be created after md device KOBJ_CHANGE event. 395 + 396 + If bitmap_type is bitmap, then the md device will also contain: 397 + 350 398 bitmap/location 351 399 This indicates where the write-intent bitmap for the array is 352 400 stored. ··· 449 401 once the array becomes non-degraded, and this fact has been 450 402 recorded in the metadata. 451 403 452 - consistency_policy 453 - This indicates how the array maintains consistency in case of unexpected 454 - shutdown. It can be: 404 + If bitmap_type is llbitmap, then the md device will also contain: 455 405 456 - none 457 - Array has no redundancy information, e.g. raid0, linear. 406 + llbitmap/bits 407 + This is read-only, show status of bitmap bits, the number of each 408 + value. 458 409 459 - resync 460 - Full resync is performed and all redundancy is regenerated when the 461 - array is started after unclean shutdown. 410 + llbitmap/metadata 411 + This is read-only, show bitmap metadata, include chunksize, chunkshift, 412 + chunks, offset and daemon_sleep. 462 413 463 - bitmap 464 - Resync assisted by a write-intent bitmap. 414 + llbitmap/daemon_sleep 415 + This is read-write, time in seconds that daemon function will be 416 + triggered to clear dirty bits. 465 417 466 - journal 467 - For raid4/5/6, journal device is used to log transactions and replay 468 - after unclean shutdown. 469 - 470 - ppl 471 - For raid5 only, Partial Parity Log is used to close the write hole and 472 - eliminate resync. 473 - 474 - The accepted values when writing to this file are ``ppl`` and ``resync``, 475 - used to enable and disable PPL. 476 - 477 - uuid 478 - This indicates the UUID of the array in the following format: 479 - xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx 480 - 418 + llbitmap/barrier_idle 419 + This is read-write, time in seconds that page barrier will be idled, 420 + means dirty bits in the page will be cleared. 481 421 482 422 As component devices are added to an md array, they appear in the ``md`` 483 423 directory as new directories named::
+29
drivers/md/Kconfig
··· 37 37 38 38 If unsure, say N. 39 39 40 + config MD_BITMAP 41 + bool "MD RAID bitmap support" 42 + default y 43 + depends on BLK_DEV_MD 44 + help 45 + If you say Y here, support for the write intent bitmap will be 46 + enabled. The bitmap can be used to optimize resync speed after power 47 + failure or readding a disk, limiting it to recorded dirty sectors in 48 + bitmap. 49 + 50 + This feature can be added to existing MD array or MD array can be 51 + created with bitmap via mdadm(8). 52 + 53 + If unsure, say Y. 54 + 55 + config MD_LLBITMAP 56 + bool "MD RAID lockless bitmap support" 57 + depends on BLK_DEV_MD 58 + help 59 + If you say Y here, support for the lockless write intent bitmap will 60 + be enabled. 61 + 62 + Note, this is an experimental feature. 63 + 64 + If unsure, say N. 65 + 40 66 config MD_AUTODETECT 41 67 bool "Autodetect RAID arrays during kernel boot" 42 68 depends on BLK_DEV_MD=y ··· 80 54 config MD_BITMAP_FILE 81 55 bool "MD bitmap file support (deprecated)" 82 56 default y 57 + depends on MD_BITMAP 83 58 help 84 59 If you say Y here, support for write intent bitmaps in files on an 85 60 external file system is enabled. This is an alternative to the internal ··· 201 174 202 175 config MD_CLUSTER 203 176 tristate "Cluster Support for MD" 177 + select MD_BITMAP 204 178 depends on BLK_DEV_MD 205 179 depends on DLM 206 180 default n ··· 421 393 select MD_RAID1 422 394 select MD_RAID10 423 395 select MD_RAID456 396 + select MD_BITMAP 424 397 select BLK_DEV_MD 425 398 help 426 399 A dm target that supports RAID1, RAID10, RAID4, RAID5 and RAID6 mappings
+3 -1
drivers/md/Makefile
··· 27 27 dm-verity-y += dm-verity-target.o 28 28 dm-zoned-y += dm-zoned-target.o dm-zoned-metadata.o dm-zoned-reclaim.o 29 29 30 - md-mod-y += md.o md-bitmap.o 30 + md-mod-y += md.o 31 + md-mod-$(CONFIG_MD_BITMAP) += md-bitmap.o 32 + md-mod-$(CONFIG_MD_LLBITMAP) += md-llbitmap.o 31 33 raid456-y += raid5.o raid5-cache.o raid5-ppl.o 32 34 linear-y += md-linear.o 33 35
+11 -7
drivers/md/dm-raid.c
··· 3953 3953 !test_and_set_bit(RT_FLAG_RS_BITMAP_LOADED, &rs->runtime_flags)) { 3954 3954 struct mddev *mddev = &rs->md; 3955 3955 3956 - r = mddev->bitmap_ops->load(mddev); 3957 - if (r) 3958 - DMERR("Failed to load bitmap"); 3956 + if (md_bitmap_enabled(mddev, false)) { 3957 + r = mddev->bitmap_ops->load(mddev); 3958 + if (r) 3959 + DMERR("Failed to load bitmap"); 3960 + } 3959 3961 } 3960 3962 3961 3963 return r; ··· 4072 4070 mddev->bitmap_info.chunksize != to_bytes(rs->requested_bitmap_chunk_sectors)))) { 4073 4071 int chunksize = to_bytes(rs->requested_bitmap_chunk_sectors) ?: mddev->bitmap_info.chunksize; 4074 4072 4075 - r = mddev->bitmap_ops->resize(mddev, mddev->dev_sectors, 4076 - chunksize, false); 4077 - if (r) 4078 - DMERR("Failed to resize bitmap"); 4073 + if (md_bitmap_enabled(mddev, false)) { 4074 + r = mddev->bitmap_ops->resize(mddev, mddev->dev_sectors, 4075 + chunksize); 4076 + if (r) 4077 + DMERR("Failed to resize bitmap"); 4078 + } 4079 4079 } 4080 4080 4081 4081 /* Check for any resize/reshape on @rs and adjust/initiate */
+44 -45
drivers/md/md-bitmap.c
··· 34 34 #include "md-bitmap.h" 35 35 #include "md-cluster.h" 36 36 37 - #define BITMAP_MAJOR_LO 3 38 - /* version 4 insists the bitmap is in little-endian order 39 - * with version 3, it is host-endian which is non-portable 40 - * Version 5 is currently set only for clustered devices 41 - */ 42 - #define BITMAP_MAJOR_HI 4 43 - #define BITMAP_MAJOR_CLUSTERED 5 44 - #define BITMAP_MAJOR_HOSTENDIAN 3 45 - 46 37 /* 47 38 * in-memory bitmap: 48 39 * ··· 215 224 int cluster_slot; 216 225 }; 217 226 227 + static struct workqueue_struct *md_bitmap_wq; 228 + 218 229 static int __bitmap_resize(struct bitmap *bitmap, sector_t blocks, 219 230 int chunksize, bool init); 220 231 ··· 225 232 return bitmap->mddev ? mdname(bitmap->mddev) : "mdX"; 226 233 } 227 234 228 - static bool __bitmap_enabled(struct bitmap *bitmap) 235 + static bool bitmap_enabled(void *data, bool flush) 229 236 { 230 - return bitmap->storage.filemap && 231 - !test_bit(BITMAP_STALE, &bitmap->flags); 232 - } 237 + struct bitmap *bitmap = data; 233 238 234 - static bool bitmap_enabled(struct mddev *mddev) 235 - { 236 - struct bitmap *bitmap = mddev->bitmap; 239 + if (!flush) 240 + return true; 237 241 238 - if (!bitmap) 239 - return false; 240 - 241 - return __bitmap_enabled(bitmap); 242 + /* 243 + * If caller want to flush bitmap pages to underlying disks, check if 244 + * there are cached pages in filemap. 245 + */ 246 + return !test_bit(BITMAP_STALE, &bitmap->flags) && 247 + bitmap->storage.filemap != NULL; 242 248 } 243 249 244 250 /* ··· 476 484 return -EINVAL; 477 485 } 478 486 479 - md_super_write(mddev, rdev, sboff + ps, (int)min(size, bitmap_limit), page); 487 + md_write_metadata(mddev, rdev, sboff + ps, (int)min(size, bitmap_limit), 488 + page, 0); 480 489 return 0; 481 490 } 482 491 ··· 1237 1244 int dirty, need_write; 1238 1245 int writing = 0; 1239 1246 1240 - if (!__bitmap_enabled(bitmap)) 1247 + if (!bitmap_enabled(bitmap, true)) 1241 1248 return; 1242 1249 1243 1250 /* look at each page to see if there are any set bits that need to be ··· 1781 1788 sector_t *blocks, bool degraded) 1782 1789 { 1783 1790 bitmap_counter_t *bmc; 1784 - bool rv; 1791 + bool rv = false; 1785 1792 1786 - if (bitmap == NULL) {/* FIXME or bitmap set as 'failed' */ 1787 - *blocks = 1024; 1788 - return true; /* always resync if no bitmap */ 1789 - } 1790 1793 spin_lock_irq(&bitmap->counts.lock); 1791 - 1792 - rv = false; 1793 1794 bmc = md_bitmap_get_counter(&bitmap->counts, offset, blocks, 0); 1794 1795 if (bmc) { 1795 1796 /* locked */ ··· 1832 1845 bitmap_counter_t *bmc; 1833 1846 unsigned long flags; 1834 1847 1835 - if (bitmap == NULL) { 1836 - *blocks = 1024; 1837 - return; 1838 - } 1839 1848 spin_lock_irqsave(&bitmap->counts.lock, flags); 1840 1849 bmc = md_bitmap_get_counter(&bitmap->counts, offset, blocks, 0); 1841 1850 if (bmc == NULL) ··· 2043 2060 struct bitmap *bitmap = mddev->bitmap; 2044 2061 int bw; 2045 2062 2046 - if (!bitmap) 2047 - return; 2048 - 2049 2063 atomic_inc(&bitmap->behind_writes); 2050 2064 bw = atomic_read(&bitmap->behind_writes); 2051 2065 if (bw > bitmap->behind_writes_used) ··· 2055 2075 static void bitmap_end_behind_write(struct mddev *mddev) 2056 2076 { 2057 2077 struct bitmap *bitmap = mddev->bitmap; 2058 - 2059 - if (!bitmap) 2060 - return; 2061 2078 2062 2079 if (atomic_dec_and_test(&bitmap->behind_writes)) 2063 2080 wake_up(&bitmap->behind_wait); ··· 2570 2593 return ret; 2571 2594 } 2572 2595 2573 - static int bitmap_resize(struct mddev *mddev, sector_t blocks, int chunksize, 2574 - bool init) 2596 + static int bitmap_resize(struct mddev *mddev, sector_t blocks, int chunksize) 2575 2597 { 2576 2598 struct bitmap *bitmap = mddev->bitmap; 2577 2599 2578 2600 if (!bitmap) 2579 2601 return 0; 2580 2602 2581 - return __bitmap_resize(bitmap, blocks, chunksize, init); 2603 + return __bitmap_resize(bitmap, blocks, chunksize, false); 2582 2604 } 2583 2605 2584 2606 static ssize_t ··· 2966 2990 &max_backlog_used.attr, 2967 2991 NULL 2968 2992 }; 2969 - const struct attribute_group md_bitmap_group = { 2993 + 2994 + static struct attribute_group md_bitmap_group = { 2970 2995 .name = "bitmap", 2971 2996 .attrs = md_bitmap_attrs, 2972 2997 }; 2973 2998 2974 2999 static struct bitmap_operations bitmap_ops = { 3000 + .head = { 3001 + .type = MD_BITMAP, 3002 + .id = ID_BITMAP, 3003 + .name = "bitmap", 3004 + }, 3005 + 2975 3006 .enabled = bitmap_enabled, 2976 3007 .create = bitmap_create, 2977 3008 .resize = bitmap_resize, ··· 2996 3013 2997 3014 .start_write = bitmap_start_write, 2998 3015 .end_write = bitmap_end_write, 3016 + .start_discard = bitmap_start_write, 3017 + .end_discard = bitmap_end_write, 3018 + 2999 3019 .start_sync = bitmap_start_sync, 3000 3020 .end_sync = bitmap_end_sync, 3001 3021 .cond_end_sync = bitmap_cond_end_sync, ··· 3012 3026 .copy_from_slot = bitmap_copy_from_slot, 3013 3027 .set_pages = bitmap_set_pages, 3014 3028 .free = md_bitmap_free, 3029 + 3030 + .group = &md_bitmap_group, 3015 3031 }; 3016 3032 3017 - void mddev_set_bitmap_ops(struct mddev *mddev) 3033 + int md_bitmap_init(void) 3018 3034 { 3019 - mddev->bitmap_ops = &bitmap_ops; 3035 + md_bitmap_wq = alloc_workqueue("md_bitmap", WQ_MEM_RECLAIM | WQ_UNBOUND, 3036 + 0); 3037 + if (!md_bitmap_wq) 3038 + return -ENOMEM; 3039 + 3040 + return register_md_submodule(&bitmap_ops.head); 3041 + } 3042 + 3043 + void md_bitmap_exit(void) 3044 + { 3045 + destroy_workqueue(md_bitmap_wq); 3046 + unregister_md_submodule(&bitmap_ops.head); 3020 3047 }
+98 -9
drivers/md/md-bitmap.h
··· 9 9 10 10 #define BITMAP_MAGIC 0x6d746962 11 11 12 + /* 13 + * version 3 is host-endian order, this is deprecated and not used for new 14 + * array 15 + */ 16 + #define BITMAP_MAJOR_LO 3 17 + #define BITMAP_MAJOR_HOSTENDIAN 3 18 + /* version 4 is little-endian order, the default value */ 19 + #define BITMAP_MAJOR_HI 4 20 + /* version 5 is only used for cluster */ 21 + #define BITMAP_MAJOR_CLUSTERED 5 22 + /* version 6 is only used for lockless bitmap */ 23 + #define BITMAP_MAJOR_LOCKLESS 6 24 + 12 25 /* use these for bitmap->flags and bitmap->sb->state bit-fields */ 13 26 enum bitmap_state { 14 - BITMAP_STALE = 1, /* the bitmap file is out of date or had -EIO */ 27 + BITMAP_STALE = 1, /* the bitmap file is out of date or had -EIO */ 15 28 BITMAP_WRITE_ERROR = 2, /* A write error has occurred */ 29 + BITMAP_FIRST_USE = 3, /* llbitmap is just created */ 30 + BITMAP_CLEAN = 4, /* llbitmap is created with assume_clean */ 31 + BITMAP_DAEMON_BUSY = 5, /* llbitmap daemon is not finished after daemon_sleep */ 16 32 BITMAP_HOSTENDIAN =15, 17 33 }; 18 34 ··· 77 61 struct file *file; 78 62 }; 79 63 64 + typedef void (md_bitmap_fn)(struct mddev *mddev, sector_t offset, 65 + unsigned long sectors); 66 + 80 67 struct bitmap_operations { 81 - bool (*enabled)(struct mddev *mddev); 68 + struct md_submodule_head head; 69 + 70 + bool (*enabled)(void *data, bool flush); 82 71 int (*create)(struct mddev *mddev); 83 - int (*resize)(struct mddev *mddev, sector_t blocks, int chunksize, 84 - bool init); 72 + int (*resize)(struct mddev *mddev, sector_t blocks, int chunksize); 85 73 86 74 int (*load)(struct mddev *mddev); 87 75 void (*destroy)(struct mddev *mddev); ··· 100 80 void (*end_behind_write)(struct mddev *mddev); 101 81 void (*wait_behind_writes)(struct mddev *mddev); 102 82 103 - void (*start_write)(struct mddev *mddev, sector_t offset, 104 - unsigned long sectors); 105 - void (*end_write)(struct mddev *mddev, sector_t offset, 106 - unsigned long sectors); 83 + md_bitmap_fn *start_write; 84 + md_bitmap_fn *end_write; 85 + md_bitmap_fn *start_discard; 86 + md_bitmap_fn *end_discard; 87 + 88 + sector_t (*skip_sync_blocks)(struct mddev *mddev, sector_t offset); 89 + bool (*blocks_synced)(struct mddev *mddev, sector_t offset); 107 90 bool (*start_sync)(struct mddev *mddev, sector_t offset, 108 91 sector_t *blocks, bool degraded); 109 92 void (*end_sync)(struct mddev *mddev, sector_t offset, sector_t *blocks); ··· 124 101 sector_t *hi, bool clear_bits); 125 102 void (*set_pages)(void *data, unsigned long pages); 126 103 void (*free)(void *data); 104 + 105 + struct attribute_group *group; 127 106 }; 128 107 129 108 /* the bitmap API */ 130 - void mddev_set_bitmap_ops(struct mddev *mddev); 109 + static inline bool md_bitmap_registered(struct mddev *mddev) 110 + { 111 + return mddev->bitmap_ops != NULL; 112 + } 113 + 114 + static inline bool md_bitmap_enabled(struct mddev *mddev, bool flush) 115 + { 116 + /* bitmap_ops must be registered before creating bitmap. */ 117 + if (!md_bitmap_registered(mddev)) 118 + return false; 119 + 120 + if (!mddev->bitmap) 121 + return false; 122 + 123 + return mddev->bitmap_ops->enabled(mddev->bitmap, flush); 124 + } 125 + 126 + static inline bool md_bitmap_start_sync(struct mddev *mddev, sector_t offset, 127 + sector_t *blocks, bool degraded) 128 + { 129 + /* always resync if no bitmap */ 130 + if (!md_bitmap_enabled(mddev, false)) { 131 + *blocks = 1024; 132 + return true; 133 + } 134 + 135 + return mddev->bitmap_ops->start_sync(mddev, offset, blocks, degraded); 136 + } 137 + 138 + static inline void md_bitmap_end_sync(struct mddev *mddev, sector_t offset, 139 + sector_t *blocks) 140 + { 141 + if (!md_bitmap_enabled(mddev, false)) { 142 + *blocks = 1024; 143 + return; 144 + } 145 + 146 + mddev->bitmap_ops->end_sync(mddev, offset, blocks); 147 + } 148 + 149 + #ifdef CONFIG_MD_BITMAP 150 + int md_bitmap_init(void); 151 + void md_bitmap_exit(void); 152 + #else 153 + static inline int md_bitmap_init(void) 154 + { 155 + return 0; 156 + } 157 + static inline void md_bitmap_exit(void) 158 + { 159 + } 160 + #endif 161 + 162 + #ifdef CONFIG_MD_LLBITMAP 163 + int md_llbitmap_init(void); 164 + void md_llbitmap_exit(void); 165 + #else 166 + static inline int md_llbitmap_init(void) 167 + { 168 + return 0; 169 + } 170 + static inline void md_llbitmap_exit(void) 171 + { 172 + } 173 + #endif 131 174 132 175 #endif
+1 -1
drivers/md/md-cluster.c
··· 630 630 if (le64_to_cpu(msg->high) != mddev->pers->size(mddev, 0, 0)) 631 631 ret = mddev->bitmap_ops->resize(mddev, 632 632 le64_to_cpu(msg->high), 633 - 0, false); 633 + 0); 634 634 break; 635 635 default: 636 636 ret = -1;
+1626
drivers/md/md-llbitmap.c
··· 1 + // SPDX-License-Identifier: GPL-2.0-or-later 2 + 3 + #include <linux/blkdev.h> 4 + #include <linux/module.h> 5 + #include <linux/errno.h> 6 + #include <linux/slab.h> 7 + #include <linux/init.h> 8 + #include <linux/timer.h> 9 + #include <linux/sched.h> 10 + #include <linux/list.h> 11 + #include <linux/file.h> 12 + #include <linux/seq_file.h> 13 + #include <trace/events/block.h> 14 + 15 + #include "md.h" 16 + #include "md-bitmap.h" 17 + 18 + /* 19 + * #### Background 20 + * 21 + * Redundant data is used to enhance data fault tolerance, and the storage 22 + * methods for redundant data vary depending on the RAID levels. And it's 23 + * important to maintain the consistency of redundant data. 24 + * 25 + * Bitmap is used to record which data blocks have been synchronized and which 26 + * ones need to be resynchronized or recovered. Each bit in the bitmap 27 + * represents a segment of data in the array. When a bit is set, it indicates 28 + * that the multiple redundant copies of that data segment may not be 29 + * consistent. Data synchronization can be performed based on the bitmap after 30 + * power failure or readding a disk. If there is no bitmap, a full disk 31 + * synchronization is required. 32 + * 33 + * #### Key Features 34 + * 35 + * - IO fastpath is lockless, if user issues lots of write IO to the same 36 + * bitmap bit in a short time, only the first write has additional overhead 37 + * to update bitmap bit, no additional overhead for the following writes; 38 + * - support only resync or recover written data, means in the case creating 39 + * new array or replacing with a new disk, there is no need to do a full disk 40 + * resync/recovery; 41 + * 42 + * #### Key Concept 43 + * 44 + * ##### State Machine 45 + * 46 + * Each bit is one byte, contain 6 different states, see llbitmap_state. And 47 + * there are total 8 different actions, see llbitmap_action, can change state: 48 + * 49 + * llbitmap state machine: transitions between states 50 + * 51 + * | | Startwrite | Startsync | Endsync | Abortsync| 52 + * | --------- | ---------- | --------- | ------- | ------- | 53 + * | Unwritten | Dirty | x | x | x | 54 + * | Clean | Dirty | x | x | x | 55 + * | Dirty | x | x | x | x | 56 + * | NeedSync | x | Syncing | x | x | 57 + * | Syncing | x | Syncing | Dirty | NeedSync | 58 + * 59 + * | | Reload | Daemon | Discard | Stale | 60 + * | --------- | -------- | ------ | --------- | --------- | 61 + * | Unwritten | x | x | x | x | 62 + * | Clean | x | x | Unwritten | NeedSync | 63 + * | Dirty | NeedSync | Clean | Unwritten | NeedSync | 64 + * | NeedSync | x | x | Unwritten | x | 65 + * | Syncing | NeedSync | x | Unwritten | NeedSync | 66 + * 67 + * Typical scenarios: 68 + * 69 + * 1) Create new array 70 + * All bits will be set to Unwritten by default, if --assume-clean is set, 71 + * all bits will be set to Clean instead. 72 + * 73 + * 2) write data, raid1/raid10 have full copy of data, while raid456 doesn't and 74 + * rely on xor data 75 + * 76 + * 2.1) write new data to raid1/raid10: 77 + * Unwritten --StartWrite--> Dirty 78 + * 79 + * 2.2) write new data to raid456: 80 + * Unwritten --StartWrite--> NeedSync 81 + * 82 + * Because the initial recover for raid456 is skipped, the xor data is not built 83 + * yet, the bit must be set to NeedSync first and after lazy initial recover is 84 + * finished, the bit will finally set to Dirty(see 5.1 and 5.4); 85 + * 86 + * 2.3) cover write 87 + * Clean --StartWrite--> Dirty 88 + * 89 + * 3) daemon, if the array is not degraded: 90 + * Dirty --Daemon--> Clean 91 + * 92 + * 4) discard 93 + * {Clean, Dirty, NeedSync, Syncing} --Discard--> Unwritten 94 + * 95 + * 5) resync and recover 96 + * 97 + * 5.1) common process 98 + * NeedSync --Startsync--> Syncing --Endsync--> Dirty --Daemon--> Clean 99 + * 100 + * 5.2) resync after power failure 101 + * Dirty --Reload--> NeedSync 102 + * 103 + * 5.3) recover while replacing with a new disk 104 + * By default, the old bitmap framework will recover all data, and llbitmap 105 + * implements this by a new helper, see llbitmap_skip_sync_blocks: 106 + * 107 + * skip recover for bits other than dirty or clean; 108 + * 109 + * 5.4) lazy initial recover for raid5: 110 + * By default, the old bitmap framework will only allow new recover when there 111 + * are spares(new disk), a new recovery flag MD_RECOVERY_LAZY_RECOVER is added 112 + * to perform raid456 lazy recover for set bits(from 2.2). 113 + * 114 + * 6. special handling for degraded array: 115 + * 116 + * - Dirty bits will never be cleared, daemon will just do nothing, so that if 117 + * a disk is readded, Clean bits can be skipped with recovery; 118 + * - Dirty bits will convert to Syncing from start write, to do data recovery 119 + * for new added disks; 120 + * - New write will convert bits to NeedSync directly; 121 + * 122 + * ##### Bitmap IO 123 + * 124 + * ##### Chunksize 125 + * 126 + * The default bitmap size is 128k, incluing 1k bitmap super block, and 127 + * the default size of segment of data in the array each bit(chunksize) is 64k, 128 + * and chunksize will adjust to twice the old size each time if the total number 129 + * bits is not less than 127k.(see llbitmap_init) 130 + * 131 + * ##### READ 132 + * 133 + * While creating bitmap, all pages will be allocated and read for llbitmap, 134 + * there won't be read afterwards 135 + * 136 + * ##### WRITE 137 + * 138 + * WRITE IO is divided into logical_block_size of the array, the dirty state 139 + * of each block is tracked independently, for example: 140 + * 141 + * each page is 4k, contain 8 blocks; each block is 512 bytes contain 512 bit; 142 + * 143 + * | page0 | page1 | ... | page 31 | 144 + * | | 145 + * | \-----------------------\ 146 + * | | 147 + * | block0 | block1 | ... | block 8| 148 + * | | 149 + * | \-----------------\ 150 + * | | 151 + * | bit0 | bit1 | ... | bit511 | 152 + * 153 + * From IO path, if one bit is changed to Dirty or NeedSync, the corresponding 154 + * subpage will be marked dirty, such block must write first before the IO is 155 + * issued. This behaviour will affect IO performance, to reduce the impact, if 156 + * multiple bits are changed in the same block in a short time, all bits in this 157 + * block will be changed to Dirty/NeedSync, so that there won't be any overhead 158 + * until daemon clears dirty bits. 159 + * 160 + * ##### Dirty Bits synchronization 161 + * 162 + * IO fast path will set bits to dirty, and those dirty bits will be cleared 163 + * by daemon after IO is done. llbitmap_page_ctl is used to synchronize between 164 + * IO path and daemon; 165 + * 166 + * IO path: 167 + * 1) try to grab a reference, if succeed, set expire time after 5s and return; 168 + * 2) if failed to grab a reference, wait for daemon to finish clearing dirty 169 + * bits; 170 + * 171 + * Daemon (Daemon will be woken up every daemon_sleep seconds): 172 + * For each page: 173 + * 1) check if page expired, if not skip this page; for expired page: 174 + * 2) suspend the page and wait for inflight write IO to be done; 175 + * 3) change dirty page to clean; 176 + * 4) resume the page; 177 + */ 178 + 179 + #define BITMAP_DATA_OFFSET 1024 180 + 181 + /* 64k is the max IO size of sync IO for raid1/raid10 */ 182 + #define MIN_CHUNK_SIZE (64 * 2) 183 + 184 + /* By default, daemon will be woken up every 30s */ 185 + #define DEFAULT_DAEMON_SLEEP 30 186 + 187 + /* 188 + * Dirtied bits that have not been accessed for more than 5s will be cleared 189 + * by daemon. 190 + */ 191 + #define DEFAULT_BARRIER_IDLE 5 192 + 193 + enum llbitmap_state { 194 + /* No valid data, init state after assemble the array */ 195 + BitUnwritten = 0, 196 + /* data is consistent */ 197 + BitClean, 198 + /* data will be consistent after IO is done, set directly for writes */ 199 + BitDirty, 200 + /* 201 + * data need to be resynchronized: 202 + * 1) set directly for writes if array is degraded, prevent full disk 203 + * synchronization after readding a disk; 204 + * 2) reassemble the array after power failure, and dirty bits are 205 + * found after reloading the bitmap; 206 + * 3) set for first write for raid5, to build initial xor data lazily 207 + */ 208 + BitNeedSync, 209 + /* data is synchronizing */ 210 + BitSyncing, 211 + BitStateCount, 212 + BitNone = 0xff, 213 + }; 214 + 215 + enum llbitmap_action { 216 + /* User write new data, this is the only action from IO fast path */ 217 + BitmapActionStartwrite = 0, 218 + /* Start recovery */ 219 + BitmapActionStartsync, 220 + /* Finish recovery */ 221 + BitmapActionEndsync, 222 + /* Failed recovery */ 223 + BitmapActionAbortsync, 224 + /* Reassemble the array */ 225 + BitmapActionReload, 226 + /* Daemon thread is trying to clear dirty bits */ 227 + BitmapActionDaemon, 228 + /* Data is deleted */ 229 + BitmapActionDiscard, 230 + /* 231 + * Bitmap is stale, mark all bits in addition to BitUnwritten to 232 + * BitNeedSync. 233 + */ 234 + BitmapActionStale, 235 + BitmapActionCount, 236 + /* Init state is BitUnwritten */ 237 + BitmapActionInit, 238 + }; 239 + 240 + enum llbitmap_page_state { 241 + LLPageFlush = 0, 242 + LLPageDirty, 243 + }; 244 + 245 + struct llbitmap_page_ctl { 246 + char *state; 247 + struct page *page; 248 + unsigned long expire; 249 + unsigned long flags; 250 + wait_queue_head_t wait; 251 + struct percpu_ref active; 252 + /* Per block size dirty state, maximum 64k page / 1 sector = 128 */ 253 + unsigned long dirty[]; 254 + }; 255 + 256 + struct llbitmap { 257 + struct mddev *mddev; 258 + struct llbitmap_page_ctl **pctl; 259 + 260 + unsigned int nr_pages; 261 + unsigned int io_size; 262 + unsigned int blocks_per_page; 263 + 264 + /* shift of one chunk */ 265 + unsigned long chunkshift; 266 + /* size of one chunk in sector */ 267 + unsigned long chunksize; 268 + /* total number of chunks */ 269 + unsigned long chunks; 270 + unsigned long last_end_sync; 271 + /* 272 + * time in seconds that dirty bits will be cleared if the page is not 273 + * accessed. 274 + */ 275 + unsigned long barrier_idle; 276 + /* fires on first BitDirty state */ 277 + struct timer_list pending_timer; 278 + struct work_struct daemon_work; 279 + 280 + unsigned long flags; 281 + __u64 events_cleared; 282 + 283 + /* for slow disks */ 284 + atomic_t behind_writes; 285 + wait_queue_head_t behind_wait; 286 + }; 287 + 288 + struct llbitmap_unplug_work { 289 + struct work_struct work; 290 + struct llbitmap *llbitmap; 291 + struct completion *done; 292 + }; 293 + 294 + static struct workqueue_struct *md_llbitmap_io_wq; 295 + static struct workqueue_struct *md_llbitmap_unplug_wq; 296 + 297 + static char state_machine[BitStateCount][BitmapActionCount] = { 298 + [BitUnwritten] = { 299 + [BitmapActionStartwrite] = BitDirty, 300 + [BitmapActionStartsync] = BitNone, 301 + [BitmapActionEndsync] = BitNone, 302 + [BitmapActionAbortsync] = BitNone, 303 + [BitmapActionReload] = BitNone, 304 + [BitmapActionDaemon] = BitNone, 305 + [BitmapActionDiscard] = BitNone, 306 + [BitmapActionStale] = BitNone, 307 + }, 308 + [BitClean] = { 309 + [BitmapActionStartwrite] = BitDirty, 310 + [BitmapActionStartsync] = BitNone, 311 + [BitmapActionEndsync] = BitNone, 312 + [BitmapActionAbortsync] = BitNone, 313 + [BitmapActionReload] = BitNone, 314 + [BitmapActionDaemon] = BitNone, 315 + [BitmapActionDiscard] = BitUnwritten, 316 + [BitmapActionStale] = BitNeedSync, 317 + }, 318 + [BitDirty] = { 319 + [BitmapActionStartwrite] = BitNone, 320 + [BitmapActionStartsync] = BitNone, 321 + [BitmapActionEndsync] = BitNone, 322 + [BitmapActionAbortsync] = BitNone, 323 + [BitmapActionReload] = BitNeedSync, 324 + [BitmapActionDaemon] = BitClean, 325 + [BitmapActionDiscard] = BitUnwritten, 326 + [BitmapActionStale] = BitNeedSync, 327 + }, 328 + [BitNeedSync] = { 329 + [BitmapActionStartwrite] = BitNone, 330 + [BitmapActionStartsync] = BitSyncing, 331 + [BitmapActionEndsync] = BitNone, 332 + [BitmapActionAbortsync] = BitNone, 333 + [BitmapActionReload] = BitNone, 334 + [BitmapActionDaemon] = BitNone, 335 + [BitmapActionDiscard] = BitUnwritten, 336 + [BitmapActionStale] = BitNone, 337 + }, 338 + [BitSyncing] = { 339 + [BitmapActionStartwrite] = BitNone, 340 + [BitmapActionStartsync] = BitSyncing, 341 + [BitmapActionEndsync] = BitDirty, 342 + [BitmapActionAbortsync] = BitNeedSync, 343 + [BitmapActionReload] = BitNeedSync, 344 + [BitmapActionDaemon] = BitNone, 345 + [BitmapActionDiscard] = BitUnwritten, 346 + [BitmapActionStale] = BitNeedSync, 347 + }, 348 + }; 349 + 350 + static void __llbitmap_flush(struct mddev *mddev); 351 + 352 + static enum llbitmap_state llbitmap_read(struct llbitmap *llbitmap, loff_t pos) 353 + { 354 + unsigned int idx; 355 + unsigned int offset; 356 + 357 + pos += BITMAP_DATA_OFFSET; 358 + idx = pos >> PAGE_SHIFT; 359 + offset = offset_in_page(pos); 360 + 361 + return llbitmap->pctl[idx]->state[offset]; 362 + } 363 + 364 + /* set all the bits in the subpage as dirty */ 365 + static void llbitmap_infect_dirty_bits(struct llbitmap *llbitmap, 366 + struct llbitmap_page_ctl *pctl, 367 + unsigned int block) 368 + { 369 + bool level_456 = raid_is_456(llbitmap->mddev); 370 + unsigned int io_size = llbitmap->io_size; 371 + int pos; 372 + 373 + for (pos = block * io_size; pos < (block + 1) * io_size; pos++) { 374 + switch (pctl->state[pos]) { 375 + case BitUnwritten: 376 + pctl->state[pos] = level_456 ? BitNeedSync : BitDirty; 377 + break; 378 + case BitClean: 379 + pctl->state[pos] = BitDirty; 380 + break; 381 + }; 382 + } 383 + } 384 + 385 + static void llbitmap_set_page_dirty(struct llbitmap *llbitmap, int idx, 386 + int offset) 387 + { 388 + struct llbitmap_page_ctl *pctl = llbitmap->pctl[idx]; 389 + unsigned int io_size = llbitmap->io_size; 390 + int block = offset / io_size; 391 + int pos; 392 + 393 + if (!test_bit(LLPageDirty, &pctl->flags)) 394 + set_bit(LLPageDirty, &pctl->flags); 395 + 396 + /* 397 + * For degraded array, dirty bits will never be cleared, and we must 398 + * resync all the dirty bits, hence skip infect new dirty bits to 399 + * prevent resync unnecessary data. 400 + */ 401 + if (llbitmap->mddev->degraded) { 402 + set_bit(block, pctl->dirty); 403 + return; 404 + } 405 + 406 + /* 407 + * The subpage usually contains a total of 512 bits. If any single bit 408 + * within the subpage is marked as dirty, the entire sector will be 409 + * written. To avoid impacting write performance, when multiple bits 410 + * within the same sector are modified within llbitmap->barrier_idle, 411 + * all bits in the sector will be collectively marked as dirty at once. 412 + */ 413 + if (test_and_set_bit(block, pctl->dirty)) { 414 + llbitmap_infect_dirty_bits(llbitmap, pctl, block); 415 + return; 416 + } 417 + 418 + for (pos = block * io_size; pos < (block + 1) * io_size; pos++) { 419 + if (pos == offset) 420 + continue; 421 + if (pctl->state[pos] == BitDirty || 422 + pctl->state[pos] == BitNeedSync) { 423 + llbitmap_infect_dirty_bits(llbitmap, pctl, block); 424 + return; 425 + } 426 + } 427 + } 428 + 429 + static void llbitmap_write(struct llbitmap *llbitmap, enum llbitmap_state state, 430 + loff_t pos) 431 + { 432 + unsigned int idx; 433 + unsigned int bit; 434 + 435 + pos += BITMAP_DATA_OFFSET; 436 + idx = pos >> PAGE_SHIFT; 437 + bit = offset_in_page(pos); 438 + 439 + llbitmap->pctl[idx]->state[bit] = state; 440 + if (state == BitDirty || state == BitNeedSync) 441 + llbitmap_set_page_dirty(llbitmap, idx, bit); 442 + } 443 + 444 + static struct page *llbitmap_read_page(struct llbitmap *llbitmap, int idx) 445 + { 446 + struct mddev *mddev = llbitmap->mddev; 447 + struct page *page = NULL; 448 + struct md_rdev *rdev; 449 + 450 + if (llbitmap->pctl && llbitmap->pctl[idx]) 451 + page = llbitmap->pctl[idx]->page; 452 + if (page) 453 + return page; 454 + 455 + page = alloc_page(GFP_KERNEL | __GFP_ZERO); 456 + if (!page) 457 + return ERR_PTR(-ENOMEM); 458 + 459 + rdev_for_each(rdev, mddev) { 460 + sector_t sector; 461 + 462 + if (rdev->raid_disk < 0 || test_bit(Faulty, &rdev->flags)) 463 + continue; 464 + 465 + sector = mddev->bitmap_info.offset + 466 + (idx << PAGE_SECTORS_SHIFT); 467 + 468 + if (sync_page_io(rdev, sector, PAGE_SIZE, page, REQ_OP_READ, 469 + true)) 470 + return page; 471 + 472 + md_error(mddev, rdev); 473 + } 474 + 475 + __free_page(page); 476 + return ERR_PTR(-EIO); 477 + } 478 + 479 + static void llbitmap_write_page(struct llbitmap *llbitmap, int idx) 480 + { 481 + struct page *page = llbitmap->pctl[idx]->page; 482 + struct mddev *mddev = llbitmap->mddev; 483 + struct md_rdev *rdev; 484 + int block; 485 + 486 + for (block = 0; block < llbitmap->blocks_per_page; block++) { 487 + struct llbitmap_page_ctl *pctl = llbitmap->pctl[idx]; 488 + 489 + if (!test_and_clear_bit(block, pctl->dirty)) 490 + continue; 491 + 492 + rdev_for_each(rdev, mddev) { 493 + sector_t sector; 494 + sector_t bit_sector = llbitmap->io_size >> SECTOR_SHIFT; 495 + 496 + if (rdev->raid_disk < 0 || test_bit(Faulty, &rdev->flags)) 497 + continue; 498 + 499 + sector = mddev->bitmap_info.offset + rdev->sb_start + 500 + (idx << PAGE_SECTORS_SHIFT) + 501 + block * bit_sector; 502 + md_write_metadata(mddev, rdev, sector, 503 + llbitmap->io_size, page, 504 + block * llbitmap->io_size); 505 + } 506 + } 507 + } 508 + 509 + static void active_release(struct percpu_ref *ref) 510 + { 511 + struct llbitmap_page_ctl *pctl = 512 + container_of(ref, struct llbitmap_page_ctl, active); 513 + 514 + wake_up(&pctl->wait); 515 + } 516 + 517 + static void llbitmap_free_pages(struct llbitmap *llbitmap) 518 + { 519 + int i; 520 + 521 + if (!llbitmap->pctl) 522 + return; 523 + 524 + for (i = 0; i < llbitmap->nr_pages; i++) { 525 + struct llbitmap_page_ctl *pctl = llbitmap->pctl[i]; 526 + 527 + if (!pctl || !pctl->page) 528 + break; 529 + 530 + __free_page(pctl->page); 531 + percpu_ref_exit(&pctl->active); 532 + } 533 + 534 + kfree(llbitmap->pctl[0]); 535 + kfree(llbitmap->pctl); 536 + llbitmap->pctl = NULL; 537 + } 538 + 539 + static int llbitmap_cache_pages(struct llbitmap *llbitmap) 540 + { 541 + struct llbitmap_page_ctl *pctl; 542 + unsigned int nr_pages = DIV_ROUND_UP(llbitmap->chunks + 543 + BITMAP_DATA_OFFSET, PAGE_SIZE); 544 + unsigned int size = struct_size(pctl, dirty, BITS_TO_LONGS( 545 + llbitmap->blocks_per_page)); 546 + int i; 547 + 548 + llbitmap->pctl = kmalloc_array(nr_pages, sizeof(void *), 549 + GFP_KERNEL | __GFP_ZERO); 550 + if (!llbitmap->pctl) 551 + return -ENOMEM; 552 + 553 + size = round_up(size, cache_line_size()); 554 + pctl = kmalloc_array(nr_pages, size, GFP_KERNEL | __GFP_ZERO); 555 + if (!pctl) { 556 + kfree(llbitmap->pctl); 557 + return -ENOMEM; 558 + } 559 + 560 + llbitmap->nr_pages = nr_pages; 561 + 562 + for (i = 0; i < nr_pages; i++, pctl = (void *)pctl + size) { 563 + struct page *page = llbitmap_read_page(llbitmap, i); 564 + 565 + llbitmap->pctl[i] = pctl; 566 + 567 + if (IS_ERR(page)) { 568 + llbitmap_free_pages(llbitmap); 569 + return PTR_ERR(page); 570 + } 571 + 572 + if (percpu_ref_init(&pctl->active, active_release, 573 + PERCPU_REF_ALLOW_REINIT, GFP_KERNEL)) { 574 + __free_page(page); 575 + llbitmap_free_pages(llbitmap); 576 + return -ENOMEM; 577 + } 578 + 579 + pctl->page = page; 580 + pctl->state = page_address(page); 581 + init_waitqueue_head(&pctl->wait); 582 + } 583 + 584 + return 0; 585 + } 586 + 587 + static void llbitmap_init_state(struct llbitmap *llbitmap) 588 + { 589 + enum llbitmap_state state = BitUnwritten; 590 + unsigned long i; 591 + 592 + if (test_and_clear_bit(BITMAP_CLEAN, &llbitmap->flags)) 593 + state = BitClean; 594 + 595 + for (i = 0; i < llbitmap->chunks; i++) 596 + llbitmap_write(llbitmap, state, i); 597 + } 598 + 599 + /* The return value is only used from resync, where @start == @end. */ 600 + static enum llbitmap_state llbitmap_state_machine(struct llbitmap *llbitmap, 601 + unsigned long start, 602 + unsigned long end, 603 + enum llbitmap_action action) 604 + { 605 + struct mddev *mddev = llbitmap->mddev; 606 + enum llbitmap_state state = BitNone; 607 + bool level_456 = raid_is_456(llbitmap->mddev); 608 + bool need_resync = false; 609 + bool need_recovery = false; 610 + 611 + if (test_bit(BITMAP_WRITE_ERROR, &llbitmap->flags)) 612 + return BitNone; 613 + 614 + if (action == BitmapActionInit) { 615 + llbitmap_init_state(llbitmap); 616 + return BitNone; 617 + } 618 + 619 + while (start <= end) { 620 + enum llbitmap_state c = llbitmap_read(llbitmap, start); 621 + 622 + if (c < 0 || c >= BitStateCount) { 623 + pr_err("%s: invalid bit %lu state %d action %d, forcing resync\n", 624 + __func__, start, c, action); 625 + state = BitNeedSync; 626 + goto write_bitmap; 627 + } 628 + 629 + if (c == BitNeedSync) 630 + need_resync = !mddev->degraded; 631 + 632 + state = state_machine[c][action]; 633 + 634 + write_bitmap: 635 + if (unlikely(mddev->degraded)) { 636 + /* For degraded array, mark new data as need sync. */ 637 + if (state == BitDirty && 638 + action == BitmapActionStartwrite) 639 + state = BitNeedSync; 640 + /* 641 + * For degraded array, resync dirty data as well, noted 642 + * if array is still degraded after resync is done, all 643 + * new data will still be dirty until array is clean. 644 + */ 645 + else if (c == BitDirty && 646 + action == BitmapActionStartsync) 647 + state = BitSyncing; 648 + } else if (c == BitUnwritten && state == BitDirty && 649 + action == BitmapActionStartwrite && level_456) { 650 + /* Delay raid456 initial recovery to first write. */ 651 + state = BitNeedSync; 652 + } 653 + 654 + if (state == BitNone) { 655 + start++; 656 + continue; 657 + } 658 + 659 + llbitmap_write(llbitmap, state, start); 660 + 661 + if (state == BitNeedSync) 662 + need_resync = !mddev->degraded; 663 + else if (state == BitDirty && 664 + !timer_pending(&llbitmap->pending_timer)) 665 + mod_timer(&llbitmap->pending_timer, 666 + jiffies + mddev->bitmap_info.daemon_sleep * HZ); 667 + 668 + start++; 669 + } 670 + 671 + if (need_resync && level_456) 672 + need_recovery = true; 673 + 674 + if (need_recovery) { 675 + set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); 676 + set_bit(MD_RECOVERY_LAZY_RECOVER, &mddev->recovery); 677 + md_wakeup_thread(mddev->thread); 678 + } else if (need_resync) { 679 + set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); 680 + set_bit(MD_RECOVERY_SYNC, &mddev->recovery); 681 + md_wakeup_thread(mddev->thread); 682 + } 683 + 684 + return state; 685 + } 686 + 687 + static void llbitmap_raise_barrier(struct llbitmap *llbitmap, int page_idx) 688 + { 689 + struct llbitmap_page_ctl *pctl = llbitmap->pctl[page_idx]; 690 + 691 + retry: 692 + if (likely(percpu_ref_tryget_live(&pctl->active))) { 693 + WRITE_ONCE(pctl->expire, jiffies + llbitmap->barrier_idle * HZ); 694 + return; 695 + } 696 + 697 + wait_event(pctl->wait, !percpu_ref_is_dying(&pctl->active)); 698 + goto retry; 699 + } 700 + 701 + static void llbitmap_release_barrier(struct llbitmap *llbitmap, int page_idx) 702 + { 703 + struct llbitmap_page_ctl *pctl = llbitmap->pctl[page_idx]; 704 + 705 + percpu_ref_put(&pctl->active); 706 + } 707 + 708 + static int llbitmap_suspend_timeout(struct llbitmap *llbitmap, int page_idx) 709 + { 710 + struct llbitmap_page_ctl *pctl = llbitmap->pctl[page_idx]; 711 + 712 + percpu_ref_kill(&pctl->active); 713 + 714 + if (!wait_event_timeout(pctl->wait, percpu_ref_is_zero(&pctl->active), 715 + llbitmap->mddev->bitmap_info.daemon_sleep * HZ)) 716 + return -ETIMEDOUT; 717 + 718 + return 0; 719 + } 720 + 721 + static void llbitmap_resume(struct llbitmap *llbitmap, int page_idx) 722 + { 723 + struct llbitmap_page_ctl *pctl = llbitmap->pctl[page_idx]; 724 + 725 + pctl->expire = LONG_MAX; 726 + percpu_ref_resurrect(&pctl->active); 727 + wake_up(&pctl->wait); 728 + } 729 + 730 + static int llbitmap_check_support(struct mddev *mddev) 731 + { 732 + if (test_bit(MD_HAS_JOURNAL, &mddev->flags)) { 733 + pr_notice("md/llbitmap: %s: array with journal cannot have bitmap\n", 734 + mdname(mddev)); 735 + return -EBUSY; 736 + } 737 + 738 + if (mddev->bitmap_info.space == 0) { 739 + if (mddev->bitmap_info.default_space == 0) { 740 + pr_notice("md/llbitmap: %s: no space for bitmap\n", 741 + mdname(mddev)); 742 + return -ENOSPC; 743 + } 744 + } 745 + 746 + if (!mddev->persistent) { 747 + pr_notice("md/llbitmap: %s: array must be persistent\n", 748 + mdname(mddev)); 749 + return -EOPNOTSUPP; 750 + } 751 + 752 + if (mddev->bitmap_info.file) { 753 + pr_notice("md/llbitmap: %s: doesn't support bitmap file\n", 754 + mdname(mddev)); 755 + return -EOPNOTSUPP; 756 + } 757 + 758 + if (mddev->bitmap_info.external) { 759 + pr_notice("md/llbitmap: %s: doesn't support external metadata\n", 760 + mdname(mddev)); 761 + return -EOPNOTSUPP; 762 + } 763 + 764 + if (mddev_is_dm(mddev)) { 765 + pr_notice("md/llbitmap: %s: doesn't support dm-raid\n", 766 + mdname(mddev)); 767 + return -EOPNOTSUPP; 768 + } 769 + 770 + return 0; 771 + } 772 + 773 + static int llbitmap_init(struct llbitmap *llbitmap) 774 + { 775 + struct mddev *mddev = llbitmap->mddev; 776 + sector_t blocks = mddev->resync_max_sectors; 777 + unsigned long chunksize = MIN_CHUNK_SIZE; 778 + unsigned long chunks = DIV_ROUND_UP(blocks, chunksize); 779 + unsigned long space = mddev->bitmap_info.space << SECTOR_SHIFT; 780 + int ret; 781 + 782 + while (chunks > space) { 783 + chunksize = chunksize << 1; 784 + chunks = DIV_ROUND_UP(blocks, chunksize); 785 + } 786 + 787 + llbitmap->barrier_idle = DEFAULT_BARRIER_IDLE; 788 + llbitmap->chunkshift = ffz(~chunksize); 789 + llbitmap->chunksize = chunksize; 790 + llbitmap->chunks = chunks; 791 + mddev->bitmap_info.daemon_sleep = DEFAULT_DAEMON_SLEEP; 792 + 793 + ret = llbitmap_cache_pages(llbitmap); 794 + if (ret) 795 + return ret; 796 + 797 + llbitmap_state_machine(llbitmap, 0, llbitmap->chunks - 1, 798 + BitmapActionInit); 799 + /* flush initial llbitmap to disk */ 800 + __llbitmap_flush(mddev); 801 + 802 + return 0; 803 + } 804 + 805 + static int llbitmap_read_sb(struct llbitmap *llbitmap) 806 + { 807 + struct mddev *mddev = llbitmap->mddev; 808 + unsigned long daemon_sleep; 809 + unsigned long chunksize; 810 + unsigned long events; 811 + struct page *sb_page; 812 + bitmap_super_t *sb; 813 + int ret = -EINVAL; 814 + 815 + if (!mddev->bitmap_info.offset) { 816 + pr_err("md/llbitmap: %s: no super block found", mdname(mddev)); 817 + return -EINVAL; 818 + } 819 + 820 + sb_page = llbitmap_read_page(llbitmap, 0); 821 + if (IS_ERR(sb_page)) { 822 + pr_err("md/llbitmap: %s: read super block failed", 823 + mdname(mddev)); 824 + return -EIO; 825 + } 826 + 827 + sb = kmap_local_page(sb_page); 828 + if (sb->magic != cpu_to_le32(BITMAP_MAGIC)) { 829 + pr_err("md/llbitmap: %s: invalid super block magic number", 830 + mdname(mddev)); 831 + goto out_put_page; 832 + } 833 + 834 + if (sb->version != cpu_to_le32(BITMAP_MAJOR_LOCKLESS)) { 835 + pr_err("md/llbitmap: %s: invalid super block version", 836 + mdname(mddev)); 837 + goto out_put_page; 838 + } 839 + 840 + if (memcmp(sb->uuid, mddev->uuid, 16)) { 841 + pr_err("md/llbitmap: %s: bitmap superblock UUID mismatch\n", 842 + mdname(mddev)); 843 + goto out_put_page; 844 + } 845 + 846 + if (mddev->bitmap_info.space == 0) { 847 + int room = le32_to_cpu(sb->sectors_reserved); 848 + 849 + if (room) 850 + mddev->bitmap_info.space = room; 851 + else 852 + mddev->bitmap_info.space = mddev->bitmap_info.default_space; 853 + } 854 + llbitmap->flags = le32_to_cpu(sb->state); 855 + if (test_and_clear_bit(BITMAP_FIRST_USE, &llbitmap->flags)) { 856 + ret = llbitmap_init(llbitmap); 857 + goto out_put_page; 858 + } 859 + 860 + chunksize = le32_to_cpu(sb->chunksize); 861 + if (!is_power_of_2(chunksize)) { 862 + pr_err("md/llbitmap: %s: chunksize not a power of 2", 863 + mdname(mddev)); 864 + goto out_put_page; 865 + } 866 + 867 + if (chunksize < DIV_ROUND_UP(mddev->resync_max_sectors, 868 + mddev->bitmap_info.space << SECTOR_SHIFT)) { 869 + pr_err("md/llbitmap: %s: chunksize too small %lu < %llu / %lu", 870 + mdname(mddev), chunksize, mddev->resync_max_sectors, 871 + mddev->bitmap_info.space); 872 + goto out_put_page; 873 + } 874 + 875 + daemon_sleep = le32_to_cpu(sb->daemon_sleep); 876 + if (daemon_sleep < 1 || daemon_sleep > MAX_SCHEDULE_TIMEOUT / HZ) { 877 + pr_err("md/llbitmap: %s: daemon sleep %lu period out of range", 878 + mdname(mddev), daemon_sleep); 879 + goto out_put_page; 880 + } 881 + 882 + events = le64_to_cpu(sb->events); 883 + if (events < mddev->events) { 884 + pr_warn("md/llbitmap :%s: bitmap file is out of date (%lu < %llu) -- forcing full recovery", 885 + mdname(mddev), events, mddev->events); 886 + set_bit(BITMAP_STALE, &llbitmap->flags); 887 + } 888 + 889 + sb->sync_size = cpu_to_le64(mddev->resync_max_sectors); 890 + mddev->bitmap_info.chunksize = chunksize; 891 + mddev->bitmap_info.daemon_sleep = daemon_sleep; 892 + 893 + llbitmap->barrier_idle = DEFAULT_BARRIER_IDLE; 894 + llbitmap->chunksize = chunksize; 895 + llbitmap->chunks = DIV_ROUND_UP(mddev->resync_max_sectors, chunksize); 896 + llbitmap->chunkshift = ffz(~chunksize); 897 + ret = llbitmap_cache_pages(llbitmap); 898 + 899 + out_put_page: 900 + __free_page(sb_page); 901 + kunmap_local(sb); 902 + return ret; 903 + } 904 + 905 + static void llbitmap_pending_timer_fn(struct timer_list *pending_timer) 906 + { 907 + struct llbitmap *llbitmap = 908 + container_of(pending_timer, struct llbitmap, pending_timer); 909 + 910 + if (work_busy(&llbitmap->daemon_work)) { 911 + pr_warn("md/llbitmap: %s daemon_work not finished in %lu seconds\n", 912 + mdname(llbitmap->mddev), 913 + llbitmap->mddev->bitmap_info.daemon_sleep); 914 + set_bit(BITMAP_DAEMON_BUSY, &llbitmap->flags); 915 + return; 916 + } 917 + 918 + queue_work(md_llbitmap_io_wq, &llbitmap->daemon_work); 919 + } 920 + 921 + static void md_llbitmap_daemon_fn(struct work_struct *work) 922 + { 923 + struct llbitmap *llbitmap = 924 + container_of(work, struct llbitmap, daemon_work); 925 + unsigned long start; 926 + unsigned long end; 927 + bool restart; 928 + int idx; 929 + 930 + if (llbitmap->mddev->degraded) 931 + return; 932 + retry: 933 + start = 0; 934 + end = min(llbitmap->chunks, PAGE_SIZE - BITMAP_DATA_OFFSET) - 1; 935 + restart = false; 936 + 937 + for (idx = 0; idx < llbitmap->nr_pages; idx++) { 938 + struct llbitmap_page_ctl *pctl = llbitmap->pctl[idx]; 939 + 940 + if (idx > 0) { 941 + start = end + 1; 942 + end = min(end + PAGE_SIZE, llbitmap->chunks - 1); 943 + } 944 + 945 + if (!test_bit(LLPageFlush, &pctl->flags) && 946 + time_before(jiffies, pctl->expire)) { 947 + restart = true; 948 + continue; 949 + } 950 + 951 + if (llbitmap_suspend_timeout(llbitmap, idx) < 0) { 952 + pr_warn("md/llbitmap: %s: %s waiting for page %d timeout\n", 953 + mdname(llbitmap->mddev), __func__, idx); 954 + continue; 955 + } 956 + 957 + llbitmap_state_machine(llbitmap, start, end, BitmapActionDaemon); 958 + llbitmap_resume(llbitmap, idx); 959 + } 960 + 961 + /* 962 + * If the daemon took a long time to finish, retry to prevent missing 963 + * clearing dirty bits. 964 + */ 965 + if (test_and_clear_bit(BITMAP_DAEMON_BUSY, &llbitmap->flags)) 966 + goto retry; 967 + 968 + /* If some page is dirty but not expired, setup timer again */ 969 + if (restart) 970 + mod_timer(&llbitmap->pending_timer, 971 + jiffies + llbitmap->mddev->bitmap_info.daemon_sleep * HZ); 972 + } 973 + 974 + static int llbitmap_create(struct mddev *mddev) 975 + { 976 + struct llbitmap *llbitmap; 977 + int ret; 978 + 979 + ret = llbitmap_check_support(mddev); 980 + if (ret) 981 + return ret; 982 + 983 + llbitmap = kzalloc(sizeof(*llbitmap), GFP_KERNEL); 984 + if (!llbitmap) 985 + return -ENOMEM; 986 + 987 + llbitmap->mddev = mddev; 988 + llbitmap->io_size = bdev_logical_block_size(mddev->gendisk->part0); 989 + llbitmap->blocks_per_page = PAGE_SIZE / llbitmap->io_size; 990 + 991 + timer_setup(&llbitmap->pending_timer, llbitmap_pending_timer_fn, 0); 992 + INIT_WORK(&llbitmap->daemon_work, md_llbitmap_daemon_fn); 993 + atomic_set(&llbitmap->behind_writes, 0); 994 + init_waitqueue_head(&llbitmap->behind_wait); 995 + 996 + mutex_lock(&mddev->bitmap_info.mutex); 997 + mddev->bitmap = llbitmap; 998 + ret = llbitmap_read_sb(llbitmap); 999 + mutex_unlock(&mddev->bitmap_info.mutex); 1000 + if (ret) { 1001 + kfree(llbitmap); 1002 + mddev->bitmap = NULL; 1003 + } 1004 + 1005 + return ret; 1006 + } 1007 + 1008 + static int llbitmap_resize(struct mddev *mddev, sector_t blocks, int chunksize) 1009 + { 1010 + struct llbitmap *llbitmap = mddev->bitmap; 1011 + unsigned long chunks; 1012 + 1013 + if (chunksize == 0) 1014 + chunksize = llbitmap->chunksize; 1015 + 1016 + /* If there is enough space, leave the chunksize unchanged. */ 1017 + chunks = DIV_ROUND_UP(blocks, chunksize); 1018 + while (chunks > mddev->bitmap_info.space << SECTOR_SHIFT) { 1019 + chunksize = chunksize << 1; 1020 + chunks = DIV_ROUND_UP(blocks, chunksize); 1021 + } 1022 + 1023 + llbitmap->chunkshift = ffz(~chunksize); 1024 + llbitmap->chunksize = chunksize; 1025 + llbitmap->chunks = chunks; 1026 + 1027 + return 0; 1028 + } 1029 + 1030 + static int llbitmap_load(struct mddev *mddev) 1031 + { 1032 + enum llbitmap_action action = BitmapActionReload; 1033 + struct llbitmap *llbitmap = mddev->bitmap; 1034 + 1035 + if (test_and_clear_bit(BITMAP_STALE, &llbitmap->flags)) 1036 + action = BitmapActionStale; 1037 + 1038 + llbitmap_state_machine(llbitmap, 0, llbitmap->chunks - 1, action); 1039 + return 0; 1040 + } 1041 + 1042 + static void llbitmap_destroy(struct mddev *mddev) 1043 + { 1044 + struct llbitmap *llbitmap = mddev->bitmap; 1045 + 1046 + if (!llbitmap) 1047 + return; 1048 + 1049 + mutex_lock(&mddev->bitmap_info.mutex); 1050 + 1051 + timer_delete_sync(&llbitmap->pending_timer); 1052 + flush_workqueue(md_llbitmap_io_wq); 1053 + flush_workqueue(md_llbitmap_unplug_wq); 1054 + 1055 + mddev->bitmap = NULL; 1056 + llbitmap_free_pages(llbitmap); 1057 + kfree(llbitmap); 1058 + mutex_unlock(&mddev->bitmap_info.mutex); 1059 + } 1060 + 1061 + static void llbitmap_start_write(struct mddev *mddev, sector_t offset, 1062 + unsigned long sectors) 1063 + { 1064 + struct llbitmap *llbitmap = mddev->bitmap; 1065 + unsigned long start = offset >> llbitmap->chunkshift; 1066 + unsigned long end = (offset + sectors - 1) >> llbitmap->chunkshift; 1067 + int page_start = (start + BITMAP_DATA_OFFSET) >> PAGE_SHIFT; 1068 + int page_end = (end + BITMAP_DATA_OFFSET) >> PAGE_SHIFT; 1069 + 1070 + llbitmap_state_machine(llbitmap, start, end, BitmapActionStartwrite); 1071 + 1072 + while (page_start <= page_end) { 1073 + llbitmap_raise_barrier(llbitmap, page_start); 1074 + page_start++; 1075 + } 1076 + } 1077 + 1078 + static void llbitmap_end_write(struct mddev *mddev, sector_t offset, 1079 + unsigned long sectors) 1080 + { 1081 + struct llbitmap *llbitmap = mddev->bitmap; 1082 + unsigned long start = offset >> llbitmap->chunkshift; 1083 + unsigned long end = (offset + sectors - 1) >> llbitmap->chunkshift; 1084 + int page_start = (start + BITMAP_DATA_OFFSET) >> PAGE_SHIFT; 1085 + int page_end = (end + BITMAP_DATA_OFFSET) >> PAGE_SHIFT; 1086 + 1087 + while (page_start <= page_end) { 1088 + llbitmap_release_barrier(llbitmap, page_start); 1089 + page_start++; 1090 + } 1091 + } 1092 + 1093 + static void llbitmap_start_discard(struct mddev *mddev, sector_t offset, 1094 + unsigned long sectors) 1095 + { 1096 + struct llbitmap *llbitmap = mddev->bitmap; 1097 + unsigned long start = DIV_ROUND_UP(offset, llbitmap->chunksize); 1098 + unsigned long end = (offset + sectors - 1) >> llbitmap->chunkshift; 1099 + int page_start = (start + BITMAP_DATA_OFFSET) >> PAGE_SHIFT; 1100 + int page_end = (end + BITMAP_DATA_OFFSET) >> PAGE_SHIFT; 1101 + 1102 + llbitmap_state_machine(llbitmap, start, end, BitmapActionDiscard); 1103 + 1104 + while (page_start <= page_end) { 1105 + llbitmap_raise_barrier(llbitmap, page_start); 1106 + page_start++; 1107 + } 1108 + } 1109 + 1110 + static void llbitmap_end_discard(struct mddev *mddev, sector_t offset, 1111 + unsigned long sectors) 1112 + { 1113 + struct llbitmap *llbitmap = mddev->bitmap; 1114 + unsigned long start = DIV_ROUND_UP(offset, llbitmap->chunksize); 1115 + unsigned long end = (offset + sectors - 1) >> llbitmap->chunkshift; 1116 + int page_start = (start + BITMAP_DATA_OFFSET) >> PAGE_SHIFT; 1117 + int page_end = (end + BITMAP_DATA_OFFSET) >> PAGE_SHIFT; 1118 + 1119 + while (page_start <= page_end) { 1120 + llbitmap_release_barrier(llbitmap, page_start); 1121 + page_start++; 1122 + } 1123 + } 1124 + 1125 + static void llbitmap_unplug_fn(struct work_struct *work) 1126 + { 1127 + struct llbitmap_unplug_work *unplug_work = 1128 + container_of(work, struct llbitmap_unplug_work, work); 1129 + struct llbitmap *llbitmap = unplug_work->llbitmap; 1130 + struct blk_plug plug; 1131 + int i; 1132 + 1133 + blk_start_plug(&plug); 1134 + 1135 + for (i = 0; i < llbitmap->nr_pages; i++) { 1136 + if (!test_bit(LLPageDirty, &llbitmap->pctl[i]->flags) || 1137 + !test_and_clear_bit(LLPageDirty, &llbitmap->pctl[i]->flags)) 1138 + continue; 1139 + 1140 + llbitmap_write_page(llbitmap, i); 1141 + } 1142 + 1143 + blk_finish_plug(&plug); 1144 + md_super_wait(llbitmap->mddev); 1145 + complete(unplug_work->done); 1146 + } 1147 + 1148 + static bool llbitmap_dirty(struct llbitmap *llbitmap) 1149 + { 1150 + int i; 1151 + 1152 + for (i = 0; i < llbitmap->nr_pages; i++) 1153 + if (test_bit(LLPageDirty, &llbitmap->pctl[i]->flags)) 1154 + return true; 1155 + 1156 + return false; 1157 + } 1158 + 1159 + static void llbitmap_unplug(struct mddev *mddev, bool sync) 1160 + { 1161 + DECLARE_COMPLETION_ONSTACK(done); 1162 + struct llbitmap *llbitmap = mddev->bitmap; 1163 + struct llbitmap_unplug_work unplug_work = { 1164 + .llbitmap = llbitmap, 1165 + .done = &done, 1166 + }; 1167 + 1168 + if (!llbitmap_dirty(llbitmap)) 1169 + return; 1170 + 1171 + /* 1172 + * Issue new bitmap IO under submit_bio() context will deadlock: 1173 + * - the bio will wait for bitmap bio to be done, before it can be 1174 + * issued; 1175 + * - bitmap bio will be added to current->bio_list and wait for this 1176 + * bio to be issued; 1177 + */ 1178 + INIT_WORK_ONSTACK(&unplug_work.work, llbitmap_unplug_fn); 1179 + queue_work(md_llbitmap_unplug_wq, &unplug_work.work); 1180 + wait_for_completion(&done); 1181 + destroy_work_on_stack(&unplug_work.work); 1182 + } 1183 + 1184 + /* 1185 + * Force to write all bitmap pages to disk, called when stopping the array, or 1186 + * every daemon_sleep seconds when sync_thread is running. 1187 + */ 1188 + static void __llbitmap_flush(struct mddev *mddev) 1189 + { 1190 + struct llbitmap *llbitmap = mddev->bitmap; 1191 + struct blk_plug plug; 1192 + int i; 1193 + 1194 + blk_start_plug(&plug); 1195 + for (i = 0; i < llbitmap->nr_pages; i++) { 1196 + struct llbitmap_page_ctl *pctl = llbitmap->pctl[i]; 1197 + 1198 + /* mark all blocks as dirty */ 1199 + set_bit(LLPageDirty, &pctl->flags); 1200 + bitmap_fill(pctl->dirty, llbitmap->blocks_per_page); 1201 + llbitmap_write_page(llbitmap, i); 1202 + } 1203 + blk_finish_plug(&plug); 1204 + md_super_wait(llbitmap->mddev); 1205 + } 1206 + 1207 + static void llbitmap_flush(struct mddev *mddev) 1208 + { 1209 + struct llbitmap *llbitmap = mddev->bitmap; 1210 + int i; 1211 + 1212 + for (i = 0; i < llbitmap->nr_pages; i++) 1213 + set_bit(LLPageFlush, &llbitmap->pctl[i]->flags); 1214 + 1215 + timer_delete_sync(&llbitmap->pending_timer); 1216 + queue_work(md_llbitmap_io_wq, &llbitmap->daemon_work); 1217 + flush_work(&llbitmap->daemon_work); 1218 + 1219 + __llbitmap_flush(mddev); 1220 + } 1221 + 1222 + /* This is used for raid5 lazy initial recovery */ 1223 + static bool llbitmap_blocks_synced(struct mddev *mddev, sector_t offset) 1224 + { 1225 + struct llbitmap *llbitmap = mddev->bitmap; 1226 + unsigned long p = offset >> llbitmap->chunkshift; 1227 + enum llbitmap_state c = llbitmap_read(llbitmap, p); 1228 + 1229 + return c == BitClean || c == BitDirty; 1230 + } 1231 + 1232 + static sector_t llbitmap_skip_sync_blocks(struct mddev *mddev, sector_t offset) 1233 + { 1234 + struct llbitmap *llbitmap = mddev->bitmap; 1235 + unsigned long p = offset >> llbitmap->chunkshift; 1236 + int blocks = llbitmap->chunksize - (offset & (llbitmap->chunksize - 1)); 1237 + enum llbitmap_state c = llbitmap_read(llbitmap, p); 1238 + 1239 + /* always skip unwritten blocks */ 1240 + if (c == BitUnwritten) 1241 + return blocks; 1242 + 1243 + /* For degraded array, don't skip */ 1244 + if (mddev->degraded) 1245 + return 0; 1246 + 1247 + /* For resync also skip clean/dirty blocks */ 1248 + if ((c == BitClean || c == BitDirty) && 1249 + test_bit(MD_RECOVERY_SYNC, &mddev->recovery) && 1250 + !test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery)) 1251 + return blocks; 1252 + 1253 + return 0; 1254 + } 1255 + 1256 + static bool llbitmap_start_sync(struct mddev *mddev, sector_t offset, 1257 + sector_t *blocks, bool degraded) 1258 + { 1259 + struct llbitmap *llbitmap = mddev->bitmap; 1260 + unsigned long p = offset >> llbitmap->chunkshift; 1261 + 1262 + /* 1263 + * Handle one bit at a time, this is much simpler. And it doesn't matter 1264 + * if md_do_sync() loop more times. 1265 + */ 1266 + *blocks = llbitmap->chunksize - (offset & (llbitmap->chunksize - 1)); 1267 + return llbitmap_state_machine(llbitmap, p, p, 1268 + BitmapActionStartsync) == BitSyncing; 1269 + } 1270 + 1271 + /* Something is wrong, sync_thread stop at @offset */ 1272 + static void llbitmap_end_sync(struct mddev *mddev, sector_t offset, 1273 + sector_t *blocks) 1274 + { 1275 + struct llbitmap *llbitmap = mddev->bitmap; 1276 + unsigned long p = offset >> llbitmap->chunkshift; 1277 + 1278 + *blocks = llbitmap->chunksize - (offset & (llbitmap->chunksize - 1)); 1279 + llbitmap_state_machine(llbitmap, p, llbitmap->chunks - 1, 1280 + BitmapActionAbortsync); 1281 + } 1282 + 1283 + /* A full sync_thread is finished */ 1284 + static void llbitmap_close_sync(struct mddev *mddev) 1285 + { 1286 + struct llbitmap *llbitmap = mddev->bitmap; 1287 + int i; 1288 + 1289 + for (i = 0; i < llbitmap->nr_pages; i++) { 1290 + struct llbitmap_page_ctl *pctl = llbitmap->pctl[i]; 1291 + 1292 + /* let daemon_fn clear dirty bits immediately */ 1293 + WRITE_ONCE(pctl->expire, jiffies); 1294 + } 1295 + 1296 + llbitmap_state_machine(llbitmap, 0, llbitmap->chunks - 1, 1297 + BitmapActionEndsync); 1298 + } 1299 + 1300 + /* 1301 + * sync_thread have reached @sector, update metadata every daemon_sleep seconds, 1302 + * just in case sync_thread have to restart after power failure. 1303 + */ 1304 + static void llbitmap_cond_end_sync(struct mddev *mddev, sector_t sector, 1305 + bool force) 1306 + { 1307 + struct llbitmap *llbitmap = mddev->bitmap; 1308 + 1309 + if (sector == 0) { 1310 + llbitmap->last_end_sync = jiffies; 1311 + return; 1312 + } 1313 + 1314 + if (time_before(jiffies, llbitmap->last_end_sync + 1315 + HZ * mddev->bitmap_info.daemon_sleep)) 1316 + return; 1317 + 1318 + wait_event(mddev->recovery_wait, !atomic_read(&mddev->recovery_active)); 1319 + 1320 + mddev->curr_resync_completed = sector; 1321 + set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags); 1322 + llbitmap_state_machine(llbitmap, 0, sector >> llbitmap->chunkshift, 1323 + BitmapActionEndsync); 1324 + __llbitmap_flush(mddev); 1325 + 1326 + llbitmap->last_end_sync = jiffies; 1327 + sysfs_notify_dirent_safe(mddev->sysfs_completed); 1328 + } 1329 + 1330 + static bool llbitmap_enabled(void *data, bool flush) 1331 + { 1332 + struct llbitmap *llbitmap = data; 1333 + 1334 + return llbitmap && !test_bit(BITMAP_WRITE_ERROR, &llbitmap->flags); 1335 + } 1336 + 1337 + static void llbitmap_dirty_bits(struct mddev *mddev, unsigned long s, 1338 + unsigned long e) 1339 + { 1340 + llbitmap_state_machine(mddev->bitmap, s, e, BitmapActionStartwrite); 1341 + } 1342 + 1343 + static void llbitmap_write_sb(struct llbitmap *llbitmap) 1344 + { 1345 + int nr_blocks = DIV_ROUND_UP(BITMAP_DATA_OFFSET, llbitmap->io_size); 1346 + 1347 + bitmap_fill(llbitmap->pctl[0]->dirty, nr_blocks); 1348 + llbitmap_write_page(llbitmap, 0); 1349 + md_super_wait(llbitmap->mddev); 1350 + } 1351 + 1352 + static void llbitmap_update_sb(void *data) 1353 + { 1354 + struct llbitmap *llbitmap = data; 1355 + struct mddev *mddev = llbitmap->mddev; 1356 + struct page *sb_page; 1357 + bitmap_super_t *sb; 1358 + 1359 + if (test_bit(BITMAP_WRITE_ERROR, &llbitmap->flags)) 1360 + return; 1361 + 1362 + sb_page = llbitmap_read_page(llbitmap, 0); 1363 + if (IS_ERR(sb_page)) { 1364 + pr_err("%s: %s: read super block failed", __func__, 1365 + mdname(mddev)); 1366 + set_bit(BITMAP_WRITE_ERROR, &llbitmap->flags); 1367 + return; 1368 + } 1369 + 1370 + if (mddev->events < llbitmap->events_cleared) 1371 + llbitmap->events_cleared = mddev->events; 1372 + 1373 + sb = kmap_local_page(sb_page); 1374 + sb->events = cpu_to_le64(mddev->events); 1375 + sb->state = cpu_to_le32(llbitmap->flags); 1376 + sb->chunksize = cpu_to_le32(llbitmap->chunksize); 1377 + sb->sync_size = cpu_to_le64(mddev->resync_max_sectors); 1378 + sb->events_cleared = cpu_to_le64(llbitmap->events_cleared); 1379 + sb->sectors_reserved = cpu_to_le32(mddev->bitmap_info.space); 1380 + sb->daemon_sleep = cpu_to_le32(mddev->bitmap_info.daemon_sleep); 1381 + 1382 + kunmap_local(sb); 1383 + llbitmap_write_sb(llbitmap); 1384 + } 1385 + 1386 + static int llbitmap_get_stats(void *data, struct md_bitmap_stats *stats) 1387 + { 1388 + struct llbitmap *llbitmap = data; 1389 + 1390 + memset(stats, 0, sizeof(*stats)); 1391 + 1392 + stats->missing_pages = 0; 1393 + stats->pages = llbitmap->nr_pages; 1394 + stats->file_pages = llbitmap->nr_pages; 1395 + 1396 + stats->behind_writes = atomic_read(&llbitmap->behind_writes); 1397 + stats->behind_wait = wq_has_sleeper(&llbitmap->behind_wait); 1398 + stats->events_cleared = llbitmap->events_cleared; 1399 + 1400 + return 0; 1401 + } 1402 + 1403 + /* just flag all pages as needing to be written */ 1404 + static void llbitmap_write_all(struct mddev *mddev) 1405 + { 1406 + int i; 1407 + struct llbitmap *llbitmap = mddev->bitmap; 1408 + 1409 + for (i = 0; i < llbitmap->nr_pages; i++) { 1410 + struct llbitmap_page_ctl *pctl = llbitmap->pctl[i]; 1411 + 1412 + set_bit(LLPageDirty, &pctl->flags); 1413 + bitmap_fill(pctl->dirty, llbitmap->blocks_per_page); 1414 + } 1415 + } 1416 + 1417 + static void llbitmap_start_behind_write(struct mddev *mddev) 1418 + { 1419 + struct llbitmap *llbitmap = mddev->bitmap; 1420 + 1421 + atomic_inc(&llbitmap->behind_writes); 1422 + } 1423 + 1424 + static void llbitmap_end_behind_write(struct mddev *mddev) 1425 + { 1426 + struct llbitmap *llbitmap = mddev->bitmap; 1427 + 1428 + if (atomic_dec_and_test(&llbitmap->behind_writes)) 1429 + wake_up(&llbitmap->behind_wait); 1430 + } 1431 + 1432 + static void llbitmap_wait_behind_writes(struct mddev *mddev) 1433 + { 1434 + struct llbitmap *llbitmap = mddev->bitmap; 1435 + 1436 + if (!llbitmap) 1437 + return; 1438 + 1439 + wait_event(llbitmap->behind_wait, 1440 + atomic_read(&llbitmap->behind_writes) == 0); 1441 + 1442 + } 1443 + 1444 + static ssize_t bits_show(struct mddev *mddev, char *page) 1445 + { 1446 + struct llbitmap *llbitmap; 1447 + int bits[BitStateCount] = {0}; 1448 + loff_t start = 0; 1449 + 1450 + mutex_lock(&mddev->bitmap_info.mutex); 1451 + llbitmap = mddev->bitmap; 1452 + if (!llbitmap || !llbitmap->pctl) { 1453 + mutex_unlock(&mddev->bitmap_info.mutex); 1454 + return sprintf(page, "no bitmap\n"); 1455 + } 1456 + 1457 + if (test_bit(BITMAP_WRITE_ERROR, &llbitmap->flags)) { 1458 + mutex_unlock(&mddev->bitmap_info.mutex); 1459 + return sprintf(page, "bitmap io error\n"); 1460 + } 1461 + 1462 + while (start < llbitmap->chunks) { 1463 + enum llbitmap_state c = llbitmap_read(llbitmap, start); 1464 + 1465 + if (c < 0 || c >= BitStateCount) 1466 + pr_err("%s: invalid bit %llu state %d\n", 1467 + __func__, start, c); 1468 + else 1469 + bits[c]++; 1470 + start++; 1471 + } 1472 + 1473 + mutex_unlock(&mddev->bitmap_info.mutex); 1474 + return sprintf(page, "unwritten %d\nclean %d\ndirty %d\nneed sync %d\nsyncing %d\n", 1475 + bits[BitUnwritten], bits[BitClean], bits[BitDirty], 1476 + bits[BitNeedSync], bits[BitSyncing]); 1477 + } 1478 + 1479 + static struct md_sysfs_entry llbitmap_bits = __ATTR_RO(bits); 1480 + 1481 + static ssize_t metadata_show(struct mddev *mddev, char *page) 1482 + { 1483 + struct llbitmap *llbitmap; 1484 + ssize_t ret; 1485 + 1486 + mutex_lock(&mddev->bitmap_info.mutex); 1487 + llbitmap = mddev->bitmap; 1488 + if (!llbitmap) { 1489 + mutex_unlock(&mddev->bitmap_info.mutex); 1490 + return sprintf(page, "no bitmap\n"); 1491 + } 1492 + 1493 + ret = sprintf(page, "chunksize %lu\nchunkshift %lu\nchunks %lu\noffset %llu\ndaemon_sleep %lu\n", 1494 + llbitmap->chunksize, llbitmap->chunkshift, 1495 + llbitmap->chunks, mddev->bitmap_info.offset, 1496 + llbitmap->mddev->bitmap_info.daemon_sleep); 1497 + mutex_unlock(&mddev->bitmap_info.mutex); 1498 + 1499 + return ret; 1500 + } 1501 + 1502 + static struct md_sysfs_entry llbitmap_metadata = __ATTR_RO(metadata); 1503 + 1504 + static ssize_t 1505 + daemon_sleep_show(struct mddev *mddev, char *page) 1506 + { 1507 + return sprintf(page, "%lu\n", mddev->bitmap_info.daemon_sleep); 1508 + } 1509 + 1510 + static ssize_t 1511 + daemon_sleep_store(struct mddev *mddev, const char *buf, size_t len) 1512 + { 1513 + unsigned long timeout; 1514 + int rv = kstrtoul(buf, 10, &timeout); 1515 + 1516 + if (rv) 1517 + return rv; 1518 + 1519 + mddev->bitmap_info.daemon_sleep = timeout; 1520 + return len; 1521 + } 1522 + 1523 + static struct md_sysfs_entry llbitmap_daemon_sleep = __ATTR_RW(daemon_sleep); 1524 + 1525 + static ssize_t 1526 + barrier_idle_show(struct mddev *mddev, char *page) 1527 + { 1528 + struct llbitmap *llbitmap = mddev->bitmap; 1529 + 1530 + return sprintf(page, "%lu\n", llbitmap->barrier_idle); 1531 + } 1532 + 1533 + static ssize_t 1534 + barrier_idle_store(struct mddev *mddev, const char *buf, size_t len) 1535 + { 1536 + struct llbitmap *llbitmap = mddev->bitmap; 1537 + unsigned long timeout; 1538 + int rv = kstrtoul(buf, 10, &timeout); 1539 + 1540 + if (rv) 1541 + return rv; 1542 + 1543 + llbitmap->barrier_idle = timeout; 1544 + return len; 1545 + } 1546 + 1547 + static struct md_sysfs_entry llbitmap_barrier_idle = __ATTR_RW(barrier_idle); 1548 + 1549 + static struct attribute *md_llbitmap_attrs[] = { 1550 + &llbitmap_bits.attr, 1551 + &llbitmap_metadata.attr, 1552 + &llbitmap_daemon_sleep.attr, 1553 + &llbitmap_barrier_idle.attr, 1554 + NULL 1555 + }; 1556 + 1557 + static struct attribute_group md_llbitmap_group = { 1558 + .name = "llbitmap", 1559 + .attrs = md_llbitmap_attrs, 1560 + }; 1561 + 1562 + static struct bitmap_operations llbitmap_ops = { 1563 + .head = { 1564 + .type = MD_BITMAP, 1565 + .id = ID_LLBITMAP, 1566 + .name = "llbitmap", 1567 + }, 1568 + 1569 + .enabled = llbitmap_enabled, 1570 + .create = llbitmap_create, 1571 + .resize = llbitmap_resize, 1572 + .load = llbitmap_load, 1573 + .destroy = llbitmap_destroy, 1574 + 1575 + .start_write = llbitmap_start_write, 1576 + .end_write = llbitmap_end_write, 1577 + .start_discard = llbitmap_start_discard, 1578 + .end_discard = llbitmap_end_discard, 1579 + .unplug = llbitmap_unplug, 1580 + .flush = llbitmap_flush, 1581 + 1582 + .start_behind_write = llbitmap_start_behind_write, 1583 + .end_behind_write = llbitmap_end_behind_write, 1584 + .wait_behind_writes = llbitmap_wait_behind_writes, 1585 + 1586 + .blocks_synced = llbitmap_blocks_synced, 1587 + .skip_sync_blocks = llbitmap_skip_sync_blocks, 1588 + .start_sync = llbitmap_start_sync, 1589 + .end_sync = llbitmap_end_sync, 1590 + .close_sync = llbitmap_close_sync, 1591 + .cond_end_sync = llbitmap_cond_end_sync, 1592 + 1593 + .update_sb = llbitmap_update_sb, 1594 + .get_stats = llbitmap_get_stats, 1595 + .dirty_bits = llbitmap_dirty_bits, 1596 + .write_all = llbitmap_write_all, 1597 + 1598 + .group = &md_llbitmap_group, 1599 + }; 1600 + 1601 + int md_llbitmap_init(void) 1602 + { 1603 + md_llbitmap_io_wq = alloc_workqueue("md_llbitmap_io", 1604 + WQ_MEM_RECLAIM | WQ_UNBOUND, 0); 1605 + if (!md_llbitmap_io_wq) 1606 + return -ENOMEM; 1607 + 1608 + md_llbitmap_unplug_wq = alloc_workqueue("md_llbitmap_unplug", 1609 + WQ_MEM_RECLAIM | WQ_UNBOUND, 0); 1610 + if (!md_llbitmap_unplug_wq) { 1611 + destroy_workqueue(md_llbitmap_io_wq); 1612 + md_llbitmap_io_wq = NULL; 1613 + return -ENOMEM; 1614 + } 1615 + 1616 + return register_md_submodule(&llbitmap_ops.head); 1617 + } 1618 + 1619 + void md_llbitmap_exit(void) 1620 + { 1621 + destroy_workqueue(md_llbitmap_io_wq); 1622 + md_llbitmap_io_wq = NULL; 1623 + destroy_workqueue(md_llbitmap_unplug_wq); 1624 + md_llbitmap_unplug_wq = NULL; 1625 + unregister_md_submodule(&llbitmap_ops.head); 1626 + }
+308 -70
drivers/md/md.c
··· 94 94 * workqueue whith reconfig_mutex grabbed. 95 95 */ 96 96 static struct workqueue_struct *md_misc_wq; 97 - struct workqueue_struct *md_bitmap_wq; 98 97 99 98 static int remove_and_add_spares(struct mddev *mddev, 100 99 struct md_rdev *this); ··· 676 677 677 678 static void no_op(struct percpu_ref *r) {} 678 679 680 + static bool mddev_set_bitmap_ops(struct mddev *mddev) 681 + { 682 + struct bitmap_operations *old = mddev->bitmap_ops; 683 + struct md_submodule_head *head; 684 + 685 + if (mddev->bitmap_id == ID_BITMAP_NONE || 686 + (old && old->head.id == mddev->bitmap_id)) 687 + return true; 688 + 689 + xa_lock(&md_submodule); 690 + head = xa_load(&md_submodule, mddev->bitmap_id); 691 + 692 + if (!head) { 693 + pr_warn("md: can't find bitmap id %d\n", mddev->bitmap_id); 694 + goto err; 695 + } 696 + 697 + if (head->type != MD_BITMAP) { 698 + pr_warn("md: invalid bitmap id %d\n", mddev->bitmap_id); 699 + goto err; 700 + } 701 + 702 + mddev->bitmap_ops = (void *)head; 703 + xa_unlock(&md_submodule); 704 + 705 + if (!mddev_is_dm(mddev) && mddev->bitmap_ops->group) { 706 + if (sysfs_create_group(&mddev->kobj, mddev->bitmap_ops->group)) 707 + pr_warn("md: cannot register extra bitmap attributes for %s\n", 708 + mdname(mddev)); 709 + else 710 + /* 711 + * Inform user with KOBJ_CHANGE about new bitmap 712 + * attributes. 713 + */ 714 + kobject_uevent(&mddev->kobj, KOBJ_CHANGE); 715 + } 716 + return true; 717 + 718 + err: 719 + xa_unlock(&md_submodule); 720 + return false; 721 + } 722 + 723 + static void mddev_clear_bitmap_ops(struct mddev *mddev) 724 + { 725 + if (!mddev_is_dm(mddev) && mddev->bitmap_ops && 726 + mddev->bitmap_ops->group) 727 + sysfs_remove_group(&mddev->kobj, mddev->bitmap_ops->group); 728 + 729 + mddev->bitmap_ops = NULL; 730 + } 731 + 679 732 int mddev_init(struct mddev *mddev) 680 733 { 734 + if (!IS_ENABLED(CONFIG_MD_BITMAP)) 735 + mddev->bitmap_id = ID_BITMAP_NONE; 736 + else 737 + mddev->bitmap_id = ID_BITMAP; 681 738 682 739 if (percpu_ref_init(&mddev->active_io, active_io_release, 683 740 PERCPU_REF_ALLOW_REINIT, GFP_KERNEL)) ··· 768 713 mddev->resync_min = 0; 769 714 mddev->resync_max = MaxSector; 770 715 mddev->level = LEVEL_NONE; 771 - mddev_set_bitmap_ops(mddev); 772 716 773 717 INIT_WORK(&mddev->sync_work, md_start_sync); 774 718 INIT_WORK(&mddev->del_work, mddev_delayed_delete); ··· 1074 1020 wake_up(&mddev->sb_wait); 1075 1021 } 1076 1022 1077 - void md_super_write(struct mddev *mddev, struct md_rdev *rdev, 1078 - sector_t sector, int size, struct page *page) 1023 + /** 1024 + * md_write_metadata - write metadata to underlying disk, including 1025 + * array superblock, badblocks, bitmap superblock and bitmap bits. 1026 + * @mddev: the array to write 1027 + * @rdev: the underlying disk to write 1028 + * @sector: the offset to @rdev 1029 + * @size: the length of the metadata 1030 + * @page: the metadata 1031 + * @offset: the offset to @page 1032 + * 1033 + * Write @size bytes of @page start from @offset, to @sector of @rdev, Increment 1034 + * mddev->pending_writes before returning, and decrement it on completion, 1035 + * waking up sb_wait. Caller must call md_super_wait() after issuing io to all 1036 + * rdev. If an error occurred, md_error() will be called, and the @rdev will be 1037 + * kicked out from @mddev. 1038 + */ 1039 + void md_write_metadata(struct mddev *mddev, struct md_rdev *rdev, 1040 + sector_t sector, int size, struct page *page, 1041 + unsigned int offset) 1079 1042 { 1080 - /* write first size bytes of page to sector of rdev 1081 - * Increment mddev->pending_writes before returning 1082 - * and decrement it on completion, waking up sb_wait 1083 - * if zero is reached. 1084 - * If an error occurred, call md_error 1085 - */ 1086 1043 struct bio *bio; 1087 1044 1088 1045 if (!page) ··· 1111 1046 atomic_inc(&rdev->nr_pending); 1112 1047 1113 1048 bio->bi_iter.bi_sector = sector; 1114 - __bio_add_page(bio, page, size, 0); 1049 + __bio_add_page(bio, page, size, offset); 1115 1050 bio->bi_private = rdev; 1116 1051 bio->bi_end_io = super_written; 1117 1052 ··· 1421 1356 struct md_bitmap_stats stats; 1422 1357 int err; 1423 1358 1359 + if (!md_bitmap_enabled(mddev, false)) 1360 + return 0; 1361 + 1424 1362 err = mddev->bitmap_ops->get_stats(mddev->bitmap, &stats); 1425 1363 if (err) 1426 1364 return 0; ··· 1721 1653 if ((u64)num_sectors >= (2ULL << 32) && rdev->mddev->level >= 1) 1722 1654 num_sectors = (sector_t)(2ULL << 32) - 2; 1723 1655 do { 1724 - md_super_write(rdev->mddev, rdev, rdev->sb_start, rdev->sb_size, 1725 - rdev->sb_page); 1656 + md_write_metadata(rdev->mddev, rdev, rdev->sb_start, 1657 + rdev->sb_size, rdev->sb_page, 0); 1726 1658 } while (md_super_wait(rdev->mddev) < 0); 1727 1659 return num_sectors; 1728 1660 } ··· 2370 2302 sb->super_offset = cpu_to_le64(rdev->sb_start); 2371 2303 sb->sb_csum = calc_sb_1_csum(sb); 2372 2304 do { 2373 - md_super_write(rdev->mddev, rdev, rdev->sb_start, rdev->sb_size, 2374 - rdev->sb_page); 2305 + md_write_metadata(rdev->mddev, rdev, rdev->sb_start, 2306 + rdev->sb_size, rdev->sb_page, 0); 2375 2307 } while (md_super_wait(rdev->mddev) < 0); 2376 2308 return num_sectors; 2377 2309 ··· 2381 2313 super_1_allow_new_offset(struct md_rdev *rdev, 2382 2314 unsigned long long new_offset) 2383 2315 { 2316 + struct mddev *mddev = rdev->mddev; 2317 + 2384 2318 /* All necessary checks on new >= old have been done */ 2385 2319 if (new_offset >= rdev->data_offset) 2386 2320 return 1; 2387 2321 2388 2322 /* with 1.0 metadata, there is no metadata to tread on 2389 2323 * so we can always move back */ 2390 - if (rdev->mddev->minor_version == 0) 2324 + if (mddev->minor_version == 0) 2391 2325 return 1; 2392 2326 2393 2327 /* otherwise we must be sure not to step on ··· 2401 2331 if (rdev->sb_start + (32+4)*2 > new_offset) 2402 2332 return 0; 2403 2333 2404 - if (!rdev->mddev->bitmap_info.file) { 2405 - struct mddev *mddev = rdev->mddev; 2334 + if (md_bitmap_registered(mddev) && !mddev->bitmap_info.file) { 2406 2335 struct md_bitmap_stats stats; 2407 2336 int err; 2408 2337 ··· 2873 2804 2874 2805 mddev_add_trace_msg(mddev, "md md_update_sb"); 2875 2806 rewrite: 2876 - mddev->bitmap_ops->update_sb(mddev->bitmap); 2807 + if (md_bitmap_enabled(mddev, false)) 2808 + mddev->bitmap_ops->update_sb(mddev->bitmap); 2877 2809 rdev_for_each(rdev, mddev) { 2878 2810 if (rdev->sb_loaded != 1) 2879 2811 continue; /* no noise on spare devices */ 2880 2812 2881 2813 if (!test_bit(Faulty, &rdev->flags)) { 2882 - md_super_write(mddev,rdev, 2883 - rdev->sb_start, rdev->sb_size, 2884 - rdev->sb_page); 2814 + md_write_metadata(mddev, rdev, rdev->sb_start, 2815 + rdev->sb_size, rdev->sb_page, 0); 2885 2816 pr_debug("md: (write) %pg's sb offset: %llu\n", 2886 2817 rdev->bdev, 2887 2818 (unsigned long long)rdev->sb_start); 2888 2819 rdev->sb_events = mddev->events; 2889 2820 if (rdev->badblocks.size) { 2890 - md_super_write(mddev, rdev, 2891 - rdev->badblocks.sector, 2892 - rdev->badblocks.size << 9, 2893 - rdev->bb_page); 2821 + md_write_metadata(mddev, rdev, 2822 + rdev->badblocks.sector, 2823 + rdev->badblocks.size << 9, 2824 + rdev->bb_page, 0); 2894 2825 rdev->badblocks.size = 0; 2895 2826 } 2896 2827 ··· 4219 4150 __ATTR(new_level, 0664, new_level_show, new_level_store); 4220 4151 4221 4152 static ssize_t 4153 + bitmap_type_show(struct mddev *mddev, char *page) 4154 + { 4155 + struct md_submodule_head *head; 4156 + unsigned long i; 4157 + ssize_t len = 0; 4158 + 4159 + if (mddev->bitmap_id == ID_BITMAP_NONE) 4160 + len += sprintf(page + len, "[none] "); 4161 + else 4162 + len += sprintf(page + len, "none "); 4163 + 4164 + xa_lock(&md_submodule); 4165 + xa_for_each(&md_submodule, i, head) { 4166 + if (head->type != MD_BITMAP) 4167 + continue; 4168 + 4169 + if (mddev->bitmap_id == head->id) 4170 + len += sprintf(page + len, "[%s] ", head->name); 4171 + else 4172 + len += sprintf(page + len, "%s ", head->name); 4173 + } 4174 + xa_unlock(&md_submodule); 4175 + 4176 + len += sprintf(page + len, "\n"); 4177 + return len; 4178 + } 4179 + 4180 + static ssize_t 4181 + bitmap_type_store(struct mddev *mddev, const char *buf, size_t len) 4182 + { 4183 + struct md_submodule_head *head; 4184 + enum md_submodule_id id; 4185 + unsigned long i; 4186 + int err = 0; 4187 + 4188 + xa_lock(&md_submodule); 4189 + 4190 + if (mddev->bitmap_ops) { 4191 + err = -EBUSY; 4192 + goto out; 4193 + } 4194 + 4195 + if (cmd_match(buf, "none")) { 4196 + mddev->bitmap_id = ID_BITMAP_NONE; 4197 + goto out; 4198 + } 4199 + 4200 + xa_for_each(&md_submodule, i, head) { 4201 + if (head->type == MD_BITMAP && cmd_match(buf, head->name)) { 4202 + mddev->bitmap_id = head->id; 4203 + goto out; 4204 + } 4205 + } 4206 + 4207 + err = kstrtoint(buf, 10, &id); 4208 + if (err) 4209 + goto out; 4210 + 4211 + if (id == ID_BITMAP_NONE) { 4212 + mddev->bitmap_id = id; 4213 + goto out; 4214 + } 4215 + 4216 + head = xa_load(&md_submodule, id); 4217 + if (head && head->type == MD_BITMAP) { 4218 + mddev->bitmap_id = id; 4219 + goto out; 4220 + } 4221 + 4222 + err = -ENOENT; 4223 + 4224 + out: 4225 + xa_unlock(&md_submodule); 4226 + return err ? err : len; 4227 + } 4228 + 4229 + static struct md_sysfs_entry md_bitmap_type = 4230 + __ATTR(bitmap_type, 0664, bitmap_type_show, bitmap_type_store); 4231 + 4232 + static ssize_t 4222 4233 layout_show(struct mddev *mddev, char *page) 4223 4234 { 4224 4235 /* just a number, not meaningful for all levels */ ··· 4828 4679 char *end; 4829 4680 unsigned long chunk, end_chunk; 4830 4681 int err; 4682 + 4683 + if (!md_bitmap_enabled(mddev, false)) 4684 + return len; 4831 4685 4832 4686 err = mddev_lock(mddev); 4833 4687 if (err) ··· 5904 5752 static struct attribute *md_default_attrs[] = { 5905 5753 &md_level.attr, 5906 5754 &md_new_level.attr, 5755 + &md_bitmap_type.attr, 5907 5756 &md_layout.attr, 5908 5757 &md_raid_disks.attr, 5909 5758 &md_uuid.attr, ··· 5954 5801 5955 5802 static const struct attribute_group *md_attr_groups[] = { 5956 5803 &md_default_group, 5957 - &md_bitmap_group, 5958 5804 NULL, 5959 5805 }; 5960 5806 ··· 6285 6133 6286 6134 static int start_dirty_degraded; 6287 6135 6136 + static int md_bitmap_create(struct mddev *mddev) 6137 + { 6138 + if (mddev->bitmap_id == ID_BITMAP_NONE) 6139 + return -EINVAL; 6140 + 6141 + if (!mddev_set_bitmap_ops(mddev)) 6142 + return -ENOENT; 6143 + 6144 + return mddev->bitmap_ops->create(mddev); 6145 + } 6146 + 6147 + static void md_bitmap_destroy(struct mddev *mddev) 6148 + { 6149 + if (!md_bitmap_registered(mddev)) 6150 + return; 6151 + 6152 + mddev->bitmap_ops->destroy(mddev); 6153 + mddev_clear_bitmap_ops(mddev); 6154 + } 6155 + 6288 6156 int md_run(struct mddev *mddev) 6289 6157 { 6290 6158 int err; ··· 6471 6299 } 6472 6300 if (err == 0 && pers->sync_request && 6473 6301 (mddev->bitmap_info.file || mddev->bitmap_info.offset)) { 6474 - err = mddev->bitmap_ops->create(mddev); 6302 + err = md_bitmap_create(mddev); 6475 6303 if (err) 6476 6304 pr_warn("%s: failed to create bitmap (%d)\n", 6477 6305 mdname(mddev), err); ··· 6544 6372 pers->free(mddev, mddev->private); 6545 6373 mddev->private = NULL; 6546 6374 put_pers(pers); 6547 - mddev->bitmap_ops->destroy(mddev); 6375 + md_bitmap_destroy(mddev); 6548 6376 abort: 6549 6377 bioset_exit(&mddev->io_clone_set); 6550 6378 exit_sync_set: ··· 6564 6392 if (err) 6565 6393 goto out; 6566 6394 6567 - err = mddev->bitmap_ops->load(mddev); 6568 - if (err) { 6569 - mddev->bitmap_ops->destroy(mddev); 6570 - goto out; 6395 + if (md_bitmap_registered(mddev)) { 6396 + err = mddev->bitmap_ops->load(mddev); 6397 + if (err) { 6398 + md_bitmap_destroy(mddev); 6399 + goto out; 6400 + } 6571 6401 } 6572 6402 6573 6403 if (mddev_is_clustered(mddev)) ··· 6720 6546 mddev->pers->quiesce(mddev, 0); 6721 6547 } 6722 6548 6723 - mddev->bitmap_ops->flush(mddev); 6549 + if (md_bitmap_enabled(mddev, true)) 6550 + mddev->bitmap_ops->flush(mddev); 6724 6551 6725 6552 if (md_is_rdwr(mddev) && 6726 6553 ((!mddev->in_sync && !mddev_is_clustered(mddev)) || ··· 6748 6573 6749 6574 static void mddev_detach(struct mddev *mddev) 6750 6575 { 6751 - mddev->bitmap_ops->wait_behind_writes(mddev); 6576 + if (md_bitmap_enabled(mddev, false)) 6577 + mddev->bitmap_ops->wait_behind_writes(mddev); 6752 6578 if (mddev->pers && mddev->pers->quiesce && !is_md_suspended(mddev)) { 6753 6579 mddev->pers->quiesce(mddev, 1); 6754 6580 mddev->pers->quiesce(mddev, 0); ··· 6765 6589 { 6766 6590 struct md_personality *pers = mddev->pers; 6767 6591 6768 - mddev->bitmap_ops->destroy(mddev); 6592 + md_bitmap_destroy(mddev); 6769 6593 mddev_detach(mddev); 6770 6594 spin_lock(&mddev->lock); 6771 6595 mddev->pers = NULL; ··· 7483 7307 { 7484 7308 int err = 0; 7485 7309 7310 + if (!md_bitmap_registered(mddev)) 7311 + return -EINVAL; 7312 + 7486 7313 if (mddev->pers) { 7487 7314 if (!mddev->pers->quiesce || !mddev->thread) 7488 7315 return -EBUSY; ··· 7542 7363 err = 0; 7543 7364 if (mddev->pers) { 7544 7365 if (fd >= 0) { 7545 - err = mddev->bitmap_ops->create(mddev); 7366 + err = md_bitmap_create(mddev); 7546 7367 if (!err) 7547 7368 err = mddev->bitmap_ops->load(mddev); 7548 7369 7549 7370 if (err) { 7550 - mddev->bitmap_ops->destroy(mddev); 7371 + md_bitmap_destroy(mddev); 7551 7372 fd = -1; 7552 7373 } 7553 7374 } else if (fd < 0) { 7554 - mddev->bitmap_ops->destroy(mddev); 7375 + md_bitmap_destroy(mddev); 7555 7376 } 7556 7377 } 7557 7378 ··· 7858 7679 mddev->bitmap_info.default_offset; 7859 7680 mddev->bitmap_info.space = 7860 7681 mddev->bitmap_info.default_space; 7861 - rv = mddev->bitmap_ops->create(mddev); 7682 + rv = md_bitmap_create(mddev); 7862 7683 if (!rv) 7863 7684 rv = mddev->bitmap_ops->load(mddev); 7864 7685 7865 7686 if (rv) 7866 - mddev->bitmap_ops->destroy(mddev); 7687 + md_bitmap_destroy(mddev); 7867 7688 } else { 7868 7689 struct md_bitmap_stats stats; 7869 7690 ··· 7889 7710 put_cluster_ops(mddev); 7890 7711 mddev->safemode_delay = DEFAULT_SAFEMODE_DELAY; 7891 7712 } 7892 - mddev->bitmap_ops->destroy(mddev); 7713 + md_bitmap_destroy(mddev); 7893 7714 mddev->bitmap_info.offset = 0; 7894 7715 } 7895 7716 } ··· 8670 8491 unsigned long chunk_kb; 8671 8492 int err; 8672 8493 8494 + if (!md_bitmap_enabled(mddev, false)) 8495 + return; 8496 + 8673 8497 err = mddev->bitmap_ops->get_stats(mddev->bitmap, &stats); 8674 8498 if (err) 8675 8499 return; ··· 9055 8873 static void md_bitmap_start(struct mddev *mddev, 9056 8874 struct md_io_clone *md_io_clone) 9057 8875 { 8876 + md_bitmap_fn *fn = unlikely(md_io_clone->rw == STAT_DISCARD) ? 8877 + mddev->bitmap_ops->start_discard : 8878 + mddev->bitmap_ops->start_write; 8879 + 9058 8880 if (mddev->pers->bitmap_sector) 9059 8881 mddev->pers->bitmap_sector(mddev, &md_io_clone->offset, 9060 8882 &md_io_clone->sectors); 9061 8883 9062 - mddev->bitmap_ops->start_write(mddev, md_io_clone->offset, 9063 - md_io_clone->sectors); 8884 + fn(mddev, md_io_clone->offset, md_io_clone->sectors); 9064 8885 } 9065 8886 9066 8887 static void md_bitmap_end(struct mddev *mddev, struct md_io_clone *md_io_clone) 9067 8888 { 9068 - mddev->bitmap_ops->end_write(mddev, md_io_clone->offset, 9069 - md_io_clone->sectors); 8889 + md_bitmap_fn *fn = unlikely(md_io_clone->rw == STAT_DISCARD) ? 8890 + mddev->bitmap_ops->end_discard : 8891 + mddev->bitmap_ops->end_write; 8892 + 8893 + fn(mddev, md_io_clone->offset, md_io_clone->sectors); 9070 8894 } 9071 8895 9072 8896 static void md_end_clone_io(struct bio *bio) ··· 9081 8893 struct bio *orig_bio = md_io_clone->orig_bio; 9082 8894 struct mddev *mddev = md_io_clone->mddev; 9083 8895 9084 - if (bio_data_dir(orig_bio) == WRITE && mddev->bitmap) 8896 + if (bio_data_dir(orig_bio) == WRITE && md_bitmap_enabled(mddev, false)) 9085 8897 md_bitmap_end(mddev, md_io_clone); 9086 8898 9087 8899 if (bio->bi_status && !orig_bio->bi_status) ··· 9108 8920 if (blk_queue_io_stat(bdev->bd_disk->queue)) 9109 8921 md_io_clone->start_time = bio_start_io_acct(*bio); 9110 8922 9111 - if (bio_data_dir(*bio) == WRITE && mddev->bitmap) { 8923 + if (bio_data_dir(*bio) == WRITE && md_bitmap_enabled(mddev, false)) { 9112 8924 md_io_clone->offset = (*bio)->bi_iter.bi_sector; 9113 8925 md_io_clone->sectors = bio_sectors(*bio); 8926 + md_io_clone->rw = op_stat_group(bio_op(*bio)); 9114 8927 md_bitmap_start(mddev, md_io_clone); 9115 8928 } 9116 8929 ··· 9133 8944 struct bio *orig_bio = md_io_clone->orig_bio; 9134 8945 struct mddev *mddev = md_io_clone->mddev; 9135 8946 9136 - if (bio_data_dir(orig_bio) == WRITE && mddev->bitmap) 8947 + if (bio_data_dir(orig_bio) == WRITE && md_bitmap_enabled(mddev, false)) 9137 8948 md_bitmap_end(mddev, md_io_clone); 9138 8949 9139 8950 if (bio->bi_status && !orig_bio->bi_status) ··· 9199 9010 } 9200 9011 } 9201 9012 9013 + /* 9014 + * If lazy recovery is requested and all rdevs are in sync, select the rdev with 9015 + * the higest index to perfore recovery to build initial xor data, this is the 9016 + * same as old bitmap. 9017 + */ 9018 + static bool mddev_select_lazy_recover_rdev(struct mddev *mddev) 9019 + { 9020 + struct md_rdev *recover_rdev = NULL; 9021 + struct md_rdev *rdev; 9022 + bool ret = false; 9023 + 9024 + rcu_read_lock(); 9025 + rdev_for_each_rcu(rdev, mddev) { 9026 + if (rdev->raid_disk < 0) 9027 + continue; 9028 + 9029 + if (test_bit(Faulty, &rdev->flags) || 9030 + !test_bit(In_sync, &rdev->flags)) 9031 + break; 9032 + 9033 + if (!recover_rdev || recover_rdev->raid_disk < rdev->raid_disk) 9034 + recover_rdev = rdev; 9035 + } 9036 + 9037 + if (recover_rdev) { 9038 + clear_bit(In_sync, &recover_rdev->flags); 9039 + ret = true; 9040 + } 9041 + 9042 + rcu_read_unlock(); 9043 + return ret; 9044 + } 9045 + 9202 9046 static sector_t md_sync_position(struct mddev *mddev, enum sync_action action) 9203 9047 { 9204 9048 sector_t start = 0; ··· 9263 9041 start = rdev->recovery_offset; 9264 9042 rcu_read_unlock(); 9265 9043 9044 + /* 9045 + * If there are no spares, and raid456 lazy initial recover is 9046 + * requested. 9047 + */ 9048 + if (test_bit(MD_RECOVERY_LAZY_RECOVER, &mddev->recovery) && 9049 + start == MaxSector && mddev_select_lazy_recover_rdev(mddev)) 9050 + start = 0; 9051 + 9266 9052 /* If there is a bitmap, we need to make sure all 9267 9053 * writes that started before we added a spare 9268 9054 * complete before we start doing a recovery. ··· 9291 9061 9292 9062 static bool sync_io_within_limit(struct mddev *mddev) 9293 9063 { 9294 - int io_sectors; 9295 - 9296 9064 /* 9297 9065 * For raid456, sync IO is stripe(4k) per IO, for other levels, it's 9298 9066 * RESYNC_PAGES(64k) per IO. 9299 9067 */ 9300 - if (mddev->level == 4 || mddev->level == 5 || mddev->level == 6) 9301 - io_sectors = 8; 9302 - else 9303 - io_sectors = 128; 9304 - 9305 9068 return atomic_read(&mddev->recovery_active) < 9306 - io_sectors * sync_io_depth(mddev); 9069 + (raid_is_456(mddev) ? 8 : 128) * sync_io_depth(mddev); 9307 9070 } 9308 9071 9309 9072 #define SYNC_MARKS 10 ··· 9501 9278 if (test_bit(MD_RECOVERY_INTR, &mddev->recovery)) 9502 9279 break; 9503 9280 9281 + if (mddev->bitmap_ops && mddev->bitmap_ops->skip_sync_blocks) { 9282 + sectors = mddev->bitmap_ops->skip_sync_blocks(mddev, j); 9283 + if (sectors) 9284 + goto update; 9285 + } 9286 + 9504 9287 sectors = mddev->pers->sync_request(mddev, j, max_sectors, 9505 9288 &skipped); 9506 9289 if (sectors == 0) { ··· 9522 9293 if (test_bit(MD_RECOVERY_INTR, &mddev->recovery)) 9523 9294 break; 9524 9295 9296 + update: 9525 9297 j += sectors; 9526 9298 if (j > max_sectors) 9527 9299 /* when skipping, extra large numbers can be returned. */ ··· 9832 9602 9833 9603 set_bit(MD_RECOVERY_RESHAPE, &mddev->recovery); 9834 9604 clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery); 9605 + clear_bit(MD_RECOVERY_LAZY_RECOVER, &mddev->recovery); 9835 9606 return true; 9836 9607 } 9837 9608 ··· 9841 9610 remove_spares(mddev, NULL); 9842 9611 set_bit(MD_RECOVERY_SYNC, &mddev->recovery); 9843 9612 clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery); 9613 + clear_bit(MD_RECOVERY_LAZY_RECOVER, &mddev->recovery); 9844 9614 return true; 9845 9615 } 9846 9616 ··· 9851 9619 * re-add. 9852 9620 */ 9853 9621 *spares = remove_and_add_spares(mddev, NULL); 9854 - if (*spares) { 9622 + if (*spares || test_bit(MD_RECOVERY_LAZY_RECOVER, &mddev->recovery)) { 9855 9623 clear_bit(MD_RECOVERY_SYNC, &mddev->recovery); 9856 9624 clear_bit(MD_RECOVERY_CHECK, &mddev->recovery); 9857 9625 clear_bit(MD_RECOVERY_REQUESTED, &mddev->recovery); ··· 9909 9677 * We are adding a device or devices to an array which has the bitmap 9910 9678 * stored on all devices. So make sure all bitmap pages get written. 9911 9679 */ 9912 - if (spares) 9680 + if (spares && md_bitmap_enabled(mddev, true)) 9913 9681 mddev->bitmap_ops->write_all(mddev); 9914 9682 9915 9683 name = test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) ? ··· 9997 9765 */ 9998 9766 void md_check_recovery(struct mddev *mddev) 9999 9767 { 10000 - if (mddev->bitmap) 9768 + if (md_bitmap_enabled(mddev, false) && mddev->bitmap_ops->daemon_work) 10001 9769 mddev->bitmap_ops->daemon_work(mddev); 10002 9770 10003 9771 if (signal_pending(current)) { ··· 10064 9832 } 10065 9833 10066 9834 clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery); 9835 + clear_bit(MD_RECOVERY_LAZY_RECOVER, &mddev->recovery); 10067 9836 clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery); 10068 9837 clear_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags); 10069 9838 ··· 10175 9942 clear_bit(MD_RECOVERY_RESHAPE, &mddev->recovery); 10176 9943 clear_bit(MD_RECOVERY_REQUESTED, &mddev->recovery); 10177 9944 clear_bit(MD_RECOVERY_CHECK, &mddev->recovery); 9945 + clear_bit(MD_RECOVERY_LAZY_RECOVER, &mddev->recovery); 10178 9946 /* 10179 9947 * We call mddev->cluster_ops->update_size here because sync_size could 10180 9948 * be changed by md_update_sb, and MD_RECOVERY_RESHAPE is cleared, ··· 10323 10089 10324 10090 static int __init md_init(void) 10325 10091 { 10326 - int ret = -ENOMEM; 10092 + int ret = md_bitmap_init(); 10327 10093 10094 + if (ret) 10095 + return ret; 10096 + 10097 + ret = md_llbitmap_init(); 10098 + if (ret) 10099 + goto err_bitmap; 10100 + 10101 + ret = -ENOMEM; 10328 10102 md_wq = alloc_workqueue("md", WQ_MEM_RECLAIM, 0); 10329 10103 if (!md_wq) 10330 10104 goto err_wq; ··· 10340 10098 md_misc_wq = alloc_workqueue("md_misc", 0, 0); 10341 10099 if (!md_misc_wq) 10342 10100 goto err_misc_wq; 10343 - 10344 - md_bitmap_wq = alloc_workqueue("md_bitmap", WQ_MEM_RECLAIM | WQ_UNBOUND, 10345 - 0); 10346 - if (!md_bitmap_wq) 10347 - goto err_bitmap_wq; 10348 10101 10349 10102 ret = __register_blkdev(MD_MAJOR, "md", md_probe); 10350 10103 if (ret < 0) ··· 10359 10122 err_mdp: 10360 10123 unregister_blkdev(MD_MAJOR, "md"); 10361 10124 err_md: 10362 - destroy_workqueue(md_bitmap_wq); 10363 - err_bitmap_wq: 10364 10125 destroy_workqueue(md_misc_wq); 10365 10126 err_misc_wq: 10366 10127 destroy_workqueue(md_wq); 10367 10128 err_wq: 10129 + md_llbitmap_exit(); 10130 + err_bitmap: 10131 + md_bitmap_exit(); 10368 10132 return ret; 10369 10133 } 10370 10134 ··· 10383 10145 ret = mddev->pers->resize(mddev, le64_to_cpu(sb->size)); 10384 10146 if (ret) 10385 10147 pr_info("md-cluster: resize failed\n"); 10386 - else 10148 + else if (md_bitmap_enabled(mddev, false)) 10387 10149 mddev->bitmap_ops->update_sb(mddev->bitmap); 10388 10150 } 10389 10151 ··· 10671 10433 spin_unlock(&all_mddevs_lock); 10672 10434 10673 10435 destroy_workqueue(md_misc_wq); 10674 - destroy_workqueue(md_bitmap_wq); 10675 10436 destroy_workqueue(md_wq); 10437 + md_bitmap_exit(); 10676 10438 } 10677 10439 10678 10440 subsys_initcall(md_init);
+17 -7
drivers/md/md.h
··· 26 26 enum md_submodule_type { 27 27 MD_PERSONALITY = 0, 28 28 MD_CLUSTER, 29 - MD_BITMAP, /* TODO */ 29 + MD_BITMAP, 30 30 }; 31 31 32 32 enum md_submodule_id { ··· 38 38 ID_RAID6 = 6, 39 39 ID_RAID10 = 10, 40 40 ID_CLUSTER, 41 - ID_BITMAP, /* TODO */ 42 - ID_LLBITMAP, /* TODO */ 41 + ID_BITMAP, 42 + ID_LLBITMAP, 43 + ID_BITMAP_NONE, 43 44 }; 44 45 45 46 struct md_submodule_head { ··· 566 565 struct percpu_ref writes_pending; 567 566 int sync_checkers; /* # of threads checking writes_pending */ 568 567 568 + enum md_submodule_id bitmap_id; 569 569 void *bitmap; /* the bitmap for the device */ 570 570 struct bitmap_operations *bitmap_ops; 571 571 struct { ··· 667 665 MD_RECOVERY_RESHAPE, 668 666 /* remote node is running resync thread */ 669 667 MD_RESYNCING_REMOTE, 668 + /* raid456 lazy initial recover */ 669 + MD_RECOVERY_LAZY_RECOVER, 670 670 }; 671 671 672 672 enum md_ro_state { ··· 800 796 ssize_t (*show)(struct mddev *, char *); 801 797 ssize_t (*store)(struct mddev *, const char *, size_t); 802 798 }; 803 - extern const struct attribute_group md_bitmap_group; 804 799 805 800 static inline struct kernfs_node *sysfs_get_dirent_safe(struct kernfs_node *sd, char *name) 806 801 { ··· 876 873 unsigned long start_time; 877 874 sector_t offset; 878 875 unsigned long sectors; 876 + enum stat_group rw; 879 877 struct bio bio_clone; 880 878 }; 881 879 ··· 913 909 void md_free_cloned_bio(struct bio *bio); 914 910 915 911 extern bool __must_check md_flush_request(struct mddev *mddev, struct bio *bio); 916 - extern void md_super_write(struct mddev *mddev, struct md_rdev *rdev, 917 - sector_t sector, int size, struct page *page); 912 + void md_write_metadata(struct mddev *mddev, struct md_rdev *rdev, 913 + sector_t sector, int size, struct page *page, 914 + unsigned int offset); 918 915 extern int md_super_wait(struct mddev *mddev); 919 916 extern int sync_page_io(struct md_rdev *rdev, sector_t sector, int size, 920 917 struct page *page, blk_opf_t opf, bool metadata_op); ··· 1018 1013 struct mdu_disk_info_s; 1019 1014 1020 1015 extern int mdp_major; 1021 - extern struct workqueue_struct *md_bitmap_wq; 1022 1016 void md_autostart_arrays(int part); 1023 1017 int md_set_array_info(struct mddev *mddev, struct mdu_array_info_s *info); 1024 1018 int md_add_new_disk(struct mddev *mddev, struct mdu_disk_info_s *info); ··· 1036 1032 static inline bool mddev_is_dm(struct mddev *mddev) 1037 1033 { 1038 1034 return !mddev->gendisk; 1035 + } 1036 + 1037 + static inline bool raid_is_456(struct mddev *mddev) 1038 + { 1039 + return mddev->level == ID_RAID4 || mddev->level == ID_RAID5 || 1040 + mddev->level == ID_RAID6; 1039 1041 } 1040 1042 1041 1043 static inline void mddev_trace_remap(struct mddev *mddev, struct bio *bio,
+1 -1
drivers/md/raid1-10.c
··· 140 140 * If bitmap is not enabled, it's safe to submit the io directly, and 141 141 * this can get optimal performance. 142 142 */ 143 - if (!mddev->bitmap_ops->enabled(mddev)) { 143 + if (!md_bitmap_enabled(mddev, true)) { 144 144 raid1_submit_write(bio); 145 145 return true; 146 146 }
+47 -32
drivers/md/raid1.c
··· 1366 1366 (unsigned long long)r1_bio->sector, 1367 1367 mirror->rdev->bdev); 1368 1368 1369 - if (test_bit(WriteMostly, &mirror->rdev->flags)) { 1369 + if (test_bit(WriteMostly, &mirror->rdev->flags) && 1370 + md_bitmap_enabled(mddev, false)) { 1370 1371 /* 1371 1372 * Reading from a write-mostly device must take care not to 1372 1373 * over-take any writes that are 'behind' ··· 1451 1450 } 1452 1451 1453 1452 return true; 1453 + } 1454 + 1455 + static void raid1_start_write_behind(struct mddev *mddev, struct r1bio *r1_bio, 1456 + struct bio *bio) 1457 + { 1458 + unsigned long max_write_behind = mddev->bitmap_info.max_write_behind; 1459 + struct md_bitmap_stats stats; 1460 + int err; 1461 + 1462 + /* behind write rely on bitmap, see bitmap_operations */ 1463 + if (!md_bitmap_enabled(mddev, false)) 1464 + return; 1465 + 1466 + err = mddev->bitmap_ops->get_stats(mddev->bitmap, &stats); 1467 + if (err) 1468 + return; 1469 + 1470 + /* Don't do behind IO if reader is waiting, or there are too many. */ 1471 + if (!stats.behind_wait && stats.behind_writes < max_write_behind) 1472 + alloc_behind_master_bio(r1_bio, bio); 1473 + 1474 + if (test_bit(R1BIO_BehindIO, &r1_bio->state)) 1475 + mddev->bitmap_ops->start_behind_write(mddev); 1476 + 1454 1477 } 1455 1478 1456 1479 static void raid1_write_request(struct mddev *mddev, struct bio *bio, ··· 1637 1612 continue; 1638 1613 1639 1614 if (first_clone) { 1640 - unsigned long max_write_behind = 1641 - mddev->bitmap_info.max_write_behind; 1642 - struct md_bitmap_stats stats; 1643 - int err; 1644 - 1645 - /* do behind I/O ? 1646 - * Not if there are too many, or cannot 1647 - * allocate memory, or a reader on WriteMostly 1648 - * is waiting for behind writes to flush */ 1649 - err = mddev->bitmap_ops->get_stats(mddev->bitmap, &stats); 1650 - if (!err && write_behind && !stats.behind_wait && 1651 - stats.behind_writes < max_write_behind) 1652 - alloc_behind_master_bio(r1_bio, bio); 1653 - 1654 - if (test_bit(R1BIO_BehindIO, &r1_bio->state)) 1655 - mddev->bitmap_ops->start_behind_write(mddev); 1615 + if (write_behind) 1616 + raid1_start_write_behind(mddev, r1_bio, bio); 1656 1617 first_clone = 0; 1657 1618 } 1658 1619 ··· 2068 2057 2069 2058 /* make sure these bits don't get cleared. */ 2070 2059 do { 2071 - mddev->bitmap_ops->end_sync(mddev, s, &sync_blocks); 2060 + md_bitmap_end_sync(mddev, s, &sync_blocks); 2072 2061 s += sync_blocks; 2073 2062 sectors_to_go -= sync_blocks; 2074 2063 } while (sectors_to_go > 0); ··· 2815 2804 * We can find the current addess in mddev->curr_resync 2816 2805 */ 2817 2806 if (mddev->curr_resync < max_sector) /* aborted */ 2818 - mddev->bitmap_ops->end_sync(mddev, mddev->curr_resync, 2819 - &sync_blocks); 2807 + md_bitmap_end_sync(mddev, mddev->curr_resync, 2808 + &sync_blocks); 2820 2809 else /* completed sync */ 2821 2810 conf->fullsync = 0; 2822 2811 2823 - mddev->bitmap_ops->close_sync(mddev); 2812 + if (md_bitmap_enabled(mddev, false)) 2813 + mddev->bitmap_ops->close_sync(mddev); 2824 2814 close_sync(conf); 2825 2815 2826 2816 if (mddev_is_clustered(mddev)) { ··· 2841 2829 /* before building a request, check if we can skip these blocks.. 2842 2830 * This call the bitmap_start_sync doesn't actually record anything 2843 2831 */ 2844 - if (!mddev->bitmap_ops->start_sync(mddev, sector_nr, &sync_blocks, true) && 2832 + if (!md_bitmap_start_sync(mddev, sector_nr, &sync_blocks, true) && 2845 2833 !conf->fullsync && !test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery)) { 2846 2834 /* We can skip this block, and probably several more */ 2847 2835 *skipped = 1; ··· 2858 2846 /* we are incrementing sector_nr below. To be safe, we check against 2859 2847 * sector_nr + two times RESYNC_SECTORS 2860 2848 */ 2861 - 2862 - mddev->bitmap_ops->cond_end_sync(mddev, sector_nr, 2863 - mddev_is_clustered(mddev) && 2864 - (sector_nr + 2 * RESYNC_SECTORS > conf->cluster_sync_high)); 2849 + if (md_bitmap_enabled(mddev, false)) 2850 + mddev->bitmap_ops->cond_end_sync(mddev, sector_nr, 2851 + mddev_is_clustered(mddev) && 2852 + (sector_nr + 2 * RESYNC_SECTORS > 2853 + conf->cluster_sync_high)); 2865 2854 2866 2855 if (raise_barrier(conf, sector_nr)) 2867 2856 return 0; ··· 3017 3004 if (len == 0) 3018 3005 break; 3019 3006 if (sync_blocks == 0) { 3020 - if (!mddev->bitmap_ops->start_sync(mddev, sector_nr, 3021 - &sync_blocks, still_degraded) && 3007 + if (!md_bitmap_start_sync(mddev, sector_nr, 3008 + &sync_blocks, still_degraded) && 3022 3009 !conf->fullsync && 3023 3010 !test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery)) 3024 3011 break; ··· 3337 3324 * worth it. 3338 3325 */ 3339 3326 sector_t newsize = raid1_size(mddev, sectors, 0); 3340 - int ret; 3341 3327 3342 3328 if (mddev->external_size && 3343 3329 mddev->array_sectors > newsize) 3344 3330 return -EINVAL; 3345 3331 3346 - ret = mddev->bitmap_ops->resize(mddev, newsize, 0, false); 3347 - if (ret) 3348 - return ret; 3332 + if (md_bitmap_enabled(mddev, false)) { 3333 + int ret = mddev->bitmap_ops->resize(mddev, newsize, 0); 3334 + 3335 + if (ret) 3336 + return ret; 3337 + } 3349 3338 3350 3339 md_set_array_sectors(mddev, newsize); 3351 3340 if (sectors > mddev->dev_sectors &&
+25 -24
drivers/md/raid10.c
··· 3221 3221 3222 3222 if (mddev->curr_resync < max_sector) { /* aborted */ 3223 3223 if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) 3224 - mddev->bitmap_ops->end_sync(mddev, 3225 - mddev->curr_resync, 3226 - &sync_blocks); 3224 + md_bitmap_end_sync(mddev, mddev->curr_resync, 3225 + &sync_blocks); 3227 3226 else for (i = 0; i < conf->geo.raid_disks; i++) { 3228 3227 sector_t sect = 3229 3228 raid10_find_virt(conf, mddev->curr_resync, i); 3230 3229 3231 - mddev->bitmap_ops->end_sync(mddev, sect, 3232 - &sync_blocks); 3230 + md_bitmap_end_sync(mddev, sect, &sync_blocks); 3233 3231 } 3234 3232 } else { 3235 3233 /* completed sync */ ··· 3247 3249 } 3248 3250 conf->fullsync = 0; 3249 3251 } 3250 - mddev->bitmap_ops->close_sync(mddev); 3252 + if (md_bitmap_enabled(mddev, false)) 3253 + mddev->bitmap_ops->close_sync(mddev); 3251 3254 close_sync(conf); 3252 3255 *skipped = 1; 3253 3256 return sectors_skipped; ··· 3350 3351 * we only need to recover the block if it is set in 3351 3352 * the bitmap 3352 3353 */ 3353 - must_sync = mddev->bitmap_ops->start_sync(mddev, sect, 3354 - &sync_blocks, 3355 - true); 3354 + must_sync = md_bitmap_start_sync(mddev, sect, 3355 + &sync_blocks, true); 3356 3356 if (sync_blocks < max_sync) 3357 3357 max_sync = sync_blocks; 3358 3358 if (!must_sync && ··· 3394 3396 } 3395 3397 } 3396 3398 3397 - must_sync = mddev->bitmap_ops->start_sync(mddev, sect, 3398 - &sync_blocks, still_degraded); 3399 - 3399 + md_bitmap_start_sync(mddev, sect, &sync_blocks, 3400 + still_degraded); 3400 3401 any_working = 0; 3401 3402 for (j=0; j<conf->copies;j++) { 3402 3403 int k; ··· 3567 3570 * safety reason, which ensures curr_resync_completed is 3568 3571 * updated in bitmap_cond_end_sync. 3569 3572 */ 3570 - mddev->bitmap_ops->cond_end_sync(mddev, sector_nr, 3573 + if (md_bitmap_enabled(mddev, false)) 3574 + mddev->bitmap_ops->cond_end_sync(mddev, sector_nr, 3571 3575 mddev_is_clustered(mddev) && 3572 3576 (sector_nr + 2 * RESYNC_SECTORS > conf->cluster_sync_high)); 3573 3577 3574 - if (!mddev->bitmap_ops->start_sync(mddev, sector_nr, 3575 - &sync_blocks, 3576 - mddev->degraded) && 3578 + if (!md_bitmap_start_sync(mddev, sector_nr, &sync_blocks, 3579 + mddev->degraded) && 3577 3580 !conf->fullsync && !test_bit(MD_RECOVERY_REQUESTED, 3578 3581 &mddev->recovery)) { 3579 3582 /* We can skip this block */ ··· 4222 4225 */ 4223 4226 struct r10conf *conf = mddev->private; 4224 4227 sector_t oldsize, size; 4225 - int ret; 4226 4228 4227 4229 if (mddev->reshape_position != MaxSector) 4228 4230 return -EBUSY; ··· 4235 4239 mddev->array_sectors > size) 4236 4240 return -EINVAL; 4237 4241 4238 - ret = mddev->bitmap_ops->resize(mddev, size, 0, false); 4239 - if (ret) 4240 - return ret; 4242 + if (md_bitmap_enabled(mddev, false)) { 4243 + int ret = mddev->bitmap_ops->resize(mddev, size, 0); 4244 + 4245 + if (ret) 4246 + return ret; 4247 + } 4241 4248 4242 4249 md_set_array_sectors(mddev, size); 4243 4250 if (sectors > mddev->dev_sectors && ··· 4506 4507 oldsize = raid10_size(mddev, 0, 0); 4507 4508 newsize = raid10_size(mddev, 0, conf->geo.raid_disks); 4508 4509 4509 - if (!mddev_is_clustered(mddev)) { 4510 - ret = mddev->bitmap_ops->resize(mddev, newsize, 0, false); 4510 + if (!mddev_is_clustered(mddev) && 4511 + md_bitmap_enabled(mddev, false)) { 4512 + ret = mddev->bitmap_ops->resize(mddev, newsize, 0); 4511 4513 if (ret) 4512 4514 goto abort; 4513 4515 else ··· 4530 4530 MD_FEATURE_RESHAPE_ACTIVE)) || (oldsize == newsize)) 4531 4531 goto out; 4532 4532 4533 - ret = mddev->bitmap_ops->resize(mddev, newsize, 0, false); 4533 + /* cluster can't be setup without bitmap */ 4534 + ret = mddev->bitmap_ops->resize(mddev, newsize, 0); 4534 4535 if (ret) 4535 4536 goto abort; 4536 4537 4537 4538 ret = mddev->cluster_ops->resize_bitmaps(mddev, newsize, oldsize); 4538 4539 if (ret) { 4539 - mddev->bitmap_ops->resize(mddev, oldsize, 0, false); 4540 + mddev->bitmap_ops->resize(mddev, oldsize, 0); 4540 4541 goto abort; 4541 4542 } 4542 4543 }
+42 -22
drivers/md/raid5.c
··· 4097 4097 int disks) 4098 4098 { 4099 4099 int rmw = 0, rcw = 0, i; 4100 - sector_t resync_offset = conf->mddev->resync_offset; 4100 + struct mddev *mddev = conf->mddev; 4101 + sector_t resync_offset = mddev->resync_offset; 4101 4102 4102 4103 /* Check whether resync is now happening or should start. 4103 4104 * If yes, then the array is dirty (after unclean shutdown or ··· 4117 4116 pr_debug("force RCW rmw_level=%u, resync_offset=%llu sh->sector=%llu\n", 4118 4117 conf->rmw_level, (unsigned long long)resync_offset, 4119 4118 (unsigned long long)sh->sector); 4119 + } else if (mddev->bitmap_ops && mddev->bitmap_ops->blocks_synced && 4120 + !mddev->bitmap_ops->blocks_synced(mddev, sh->sector)) { 4121 + /* The initial recover is not done, must read everything */ 4122 + rcw = 1; rmw = 2; 4123 + pr_debug("force RCW by lazy recovery, sh->sector=%llu\n", 4124 + sh->sector); 4120 4125 } else for (i = disks; i--; ) { 4121 4126 /* would I have to read this buffer for read_modify_write */ 4122 4127 struct r5dev *dev = &sh->dev[i]; ··· 4155 4148 set_bit(STRIPE_HANDLE, &sh->state); 4156 4149 if ((rmw < rcw || (rmw == rcw && conf->rmw_level == PARITY_PREFER_RMW)) && rmw > 0) { 4157 4150 /* prefer read-modify-write, but need to get some data */ 4158 - mddev_add_trace_msg(conf->mddev, "raid5 rmw %llu %d", 4151 + mddev_add_trace_msg(mddev, "raid5 rmw %llu %d", 4159 4152 sh->sector, rmw); 4160 4153 4161 4154 for (i = disks; i--; ) { ··· 4234 4227 set_bit(STRIPE_DELAYED, &sh->state); 4235 4228 } 4236 4229 } 4237 - if (rcw && !mddev_is_dm(conf->mddev)) 4238 - blk_add_trace_msg(conf->mddev->gendisk->queue, 4230 + if (rcw && !mddev_is_dm(mddev)) 4231 + blk_add_trace_msg(mddev->gendisk->queue, 4239 4232 "raid5 rcw %llu %d %d %d", 4240 4233 (unsigned long long)sh->sector, rcw, qread, 4241 4234 test_bit(STRIPE_DELAYED, &sh->state)); ··· 4705 4698 } 4706 4699 } else if (test_bit(In_sync, &rdev->flags)) 4707 4700 set_bit(R5_Insync, &dev->flags); 4708 - else if (sh->sector + RAID5_STRIPE_SECTORS(conf) <= rdev->recovery_offset) 4709 - /* in sync if before recovery_offset */ 4710 - set_bit(R5_Insync, &dev->flags); 4711 - else if (test_bit(R5_UPTODATE, &dev->flags) && 4701 + else if (sh->sector + RAID5_STRIPE_SECTORS(conf) <= 4702 + rdev->recovery_offset) { 4703 + /* 4704 + * in sync if: 4705 + * - normal IO, or 4706 + * - resync IO that is not lazy recovery 4707 + * 4708 + * For lazy recovery, we have to mark the rdev without 4709 + * In_sync as failed, to build initial xor data. 4710 + */ 4711 + if (!test_bit(STRIPE_SYNCING, &sh->state) || 4712 + !test_bit(MD_RECOVERY_LAZY_RECOVER, 4713 + &conf->mddev->recovery)) 4714 + set_bit(R5_Insync, &dev->flags); 4715 + } else if (test_bit(R5_UPTODATE, &dev->flags) && 4712 4716 test_bit(R5_Expanded, &dev->flags)) 4713 4717 /* If we've reshaped into here, we assume it is Insync. 4714 4718 * We will shortly update recovery_offset to make ··· 6510 6492 } 6511 6493 6512 6494 if (mddev->curr_resync < max_sector) /* aborted */ 6513 - mddev->bitmap_ops->end_sync(mddev, mddev->curr_resync, 6514 - &sync_blocks); 6495 + md_bitmap_end_sync(mddev, mddev->curr_resync, 6496 + &sync_blocks); 6515 6497 else /* completed sync */ 6516 6498 conf->fullsync = 0; 6517 - mddev->bitmap_ops->close_sync(mddev); 6499 + if (md_bitmap_enabled(mddev, false)) 6500 + mddev->bitmap_ops->close_sync(mddev); 6518 6501 6519 6502 return 0; 6520 6503 } ··· 6544 6525 } 6545 6526 if (!test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery) && 6546 6527 !conf->fullsync && 6547 - !mddev->bitmap_ops->start_sync(mddev, sector_nr, &sync_blocks, 6548 - true) && 6528 + !md_bitmap_start_sync(mddev, sector_nr, &sync_blocks, true) && 6549 6529 sync_blocks >= RAID5_STRIPE_SECTORS(conf)) { 6550 6530 /* we can skip this block, and probably more */ 6551 6531 do_div(sync_blocks, RAID5_STRIPE_SECTORS(conf)); ··· 6553 6535 return sync_blocks * RAID5_STRIPE_SECTORS(conf); 6554 6536 } 6555 6537 6556 - mddev->bitmap_ops->cond_end_sync(mddev, sector_nr, false); 6538 + if (md_bitmap_enabled(mddev, false)) 6539 + mddev->bitmap_ops->cond_end_sync(mddev, sector_nr, false); 6557 6540 6558 6541 sh = raid5_get_active_stripe(conf, NULL, sector_nr, 6559 6542 R5_GAS_NOBLOCK); ··· 6576 6557 still_degraded = true; 6577 6558 } 6578 6559 6579 - mddev->bitmap_ops->start_sync(mddev, sector_nr, &sync_blocks, 6580 - still_degraded); 6581 - 6560 + md_bitmap_start_sync(mddev, sector_nr, &sync_blocks, still_degraded); 6582 6561 set_bit(STRIPE_SYNC_REQUESTED, &sh->state); 6583 6562 set_bit(STRIPE_HANDLE, &sh->state); 6584 6563 ··· 6780 6763 /* Now is a good time to flush some bitmap updates */ 6781 6764 conf->seq_flush++; 6782 6765 spin_unlock_irq(&conf->device_lock); 6783 - mddev->bitmap_ops->unplug(mddev, true); 6766 + if (md_bitmap_enabled(mddev, true)) 6767 + mddev->bitmap_ops->unplug(mddev, true); 6784 6768 spin_lock_irq(&conf->device_lock); 6785 6769 conf->seq_write = conf->seq_flush; 6786 6770 activate_bit_delay(conf, conf->temp_inactive_list); ··· 8330 8312 */ 8331 8313 sector_t newsize; 8332 8314 struct r5conf *conf = mddev->private; 8333 - int ret; 8334 8315 8335 8316 if (raid5_has_log(conf) || raid5_has_ppl(conf)) 8336 8317 return -EINVAL; ··· 8339 8322 mddev->array_sectors > newsize) 8340 8323 return -EINVAL; 8341 8324 8342 - ret = mddev->bitmap_ops->resize(mddev, sectors, 0, false); 8343 - if (ret) 8344 - return ret; 8325 + if (md_bitmap_enabled(mddev, false)) { 8326 + int ret = mddev->bitmap_ops->resize(mddev, sectors, 0); 8327 + 8328 + if (ret) 8329 + return ret; 8330 + } 8345 8331 8346 8332 md_set_array_sectors(mddev, newsize); 8347 8333 if (sectors > mddev->dev_sectors &&