Merge branch 'for-linus' of git://git.kernel.dk/linux-block

tjh.dev / kernel

fork

Configure Feed

Issues Pull Requests Commits Tags

Feed URL

Select the types of activity you want to include in your feed.

Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

kernel os linux

fork

Configure Feed

Issues Pull Requests Commits Tags

Feed URL

Select the types of activity you want to include in your feed.

Merge branch 'for-linus' of git://git.kernel.dk/linux-block

Pull block-related fixes from Jens Axboe:

- Improvements to the buffered and direct write IO plugging from
Fengguang.

- Abstract out the mapping of a bio in a request, and use that to
provide a blk_bio_map_sg() helper. Useful for mapping just a bio
instead of a full request.

- Regression fix from Hugh, fixing up a patch that went into the
previous release cycle (and marked stable, too) attempting to prevent
a loop in __getblk_slow().

- Updates to discard requests, fixing up the sizing and how we align
them. Also a change to disallow merging of discard requests, since
that doesn't really work properly yet.

- A few drbd fixes.

- Documentation updates.

* 'for-linus' of git://git.kernel.dk/linux-block:
block: replace __getblk_slow misfix by grow_dev_page fix
drbd: Write all pages of the bitmap after an online resize
drbd: Finish requests that completed while IO was frozen
drbd: fix drbd wire compatibility for empty flushes
Documentation: update tunable options in block/cfq-iosched.txt
Documentation: update tunable options in block/cfq-iosched.txt
Documentation: update missing index files in block/00-INDEX
block: move down direct IO plugging
block: remove plugging at buffered write time
block: disable discard request merge temporarily
bio: Fix potential memory leak in bio_find_or_create_slab()
block: Don't use static to define "void *p" in show_partition_start()
block: Add blk_bio_map_sg() helper
block: Introduce __blk_segment_map_sg() helper
fs/block-dev.c:fix performance regression in O_DIRECT writes to md block devices
block: split discard into aligned requests
block: reorganize rounding of max_discard_sectors

Linus Torvalds 14 years ago a7e546f1 da31ce72

+378 -123

17 changed files

expand all collapse all

Documentation

block

00-INDEX

cfq-iosched.txt

queue-sysfs.txt

block

blk-lib.c

blk-merge.c

genhd.c

drivers

block

drbd

drbd_bitmap.c

drbd_int.h

drbd_main.c

drbd_nl.c

drbd_req.c

bio.c

block_dev.c

buffer.c

direct-io.c

include

linux

blkdev.h

filemap.c

+8 -2

Documentation/block/00-INDEX

reviewed

··· 3 3 biodoc.txt 4 4 - Notes on the Generic Block Layer Rewrite in Linux 2.5 5 5 capability.txt 6 6 - - Generic Block Device Capability (/sys/block/<disk>/capability) 6 6 + - Generic Block Device Capability (/sys/block/<device>/capability) 7 7 + cfq-iosched.txt 8 8 + - CFQ IO scheduler tunables 9 9 + data-integrity.txt 10 10 + - Block data integrity 7 11 deadline-iosched.txt 8 12 - Deadline IO scheduler tunables 9 13 ioprio.txt 10 14 - Block io priorities (in CFQ scheduler) 15 15 + queue-sysfs.txt 16 16 + - Queue's sysfs entries 11 17 request.txt 12 18 - The members of struct request (in include/linux/blkdev.h) 13 19 stat.txt 14 14 - - Block layer statistics in /sys/block/<dev>/stat 20 20 + - Block layer statistics in /sys/block/<device>/stat 15 21 switching-sched.txt 16 22 - Switching I/O schedulers at runtime 17 23 writeback_cache_control.txt

+77

Documentation/block/cfq-iosched.txt

reviewed

··· 1 1 + CFQ (Complete Fairness Queueing) 2 2 + =============================== 3 3 + 4 4 + The main aim of CFQ scheduler is to provide a fair allocation of the disk 5 5 + I/O bandwidth for all the processes which requests an I/O operation. 6 6 + 7 7 + CFQ maintains the per process queue for the processes which request I/O 8 8 + operation(syncronous requests). In case of asynchronous requests, all the 9 9 + requests from all the processes are batched together according to their 10 10 + process's I/O priority. 11 11 + 1 12 CFQ ioscheduler tunables 2 13 ======================== 3 14 ··· 35 24 there are multiple spindles behind single LUN (Host based hardware RAID 36 25 controller or for storage arrays), setting slice_idle=0 might end up in better 37 26 throughput and acceptable latencies. 27 27 + 28 28 + back_seek_max 29 29 + ------------- 30 30 + This specifies, given in Kbytes, the maximum "distance" for backward seeking. 31 31 + The distance is the amount of space from the current head location to the 32 32 + sectors that are backward in terms of distance. 33 33 + 34 34 + This parameter allows the scheduler to anticipate requests in the "backward" 35 35 + direction and consider them as being the "next" if they are within this 36 36 + distance from the current head location. 37 37 + 38 38 + back_seek_penalty 39 39 + ----------------- 40 40 + This parameter is used to compute the cost of backward seeking. If the 41 41 + backward distance of request is just 1/back_seek_penalty from a "front" 42 42 + request, then the seeking cost of two requests is considered equivalent. 43 43 + 44 44 + So scheduler will not bias toward one or the other request (otherwise scheduler 45 45 + will bias toward front request). Default value of back_seek_penalty is 2. 46 46 + 47 47 + fifo_expire_async 48 48 + ----------------- 49 49 + This parameter is used to set the timeout of asynchronous requests. Default 50 50 + value of this is 248ms. 51 51 + 52 52 + fifo_expire_sync 53 53 + ---------------- 54 54 + This parameter is used to set the timeout of synchronous requests. Default 55 55 + value of this is 124ms. In case to favor synchronous requests over asynchronous 56 56 + one, this value should be decreased relative to fifo_expire_async. 57 57 + 58 58 + slice_async 59 59 + ----------- 60 60 + This parameter is same as of slice_sync but for asynchronous queue. The 61 61 + default value is 40ms. 62 62 + 63 63 + slice_async_rq 64 64 + -------------- 65 65 + This parameter is used to limit the dispatching of asynchronous request to 66 66 + device request queue in queue's slice time. The maximum number of request that 67 67 + are allowed to be dispatched also depends upon the io priority. Default value 68 68 + for this is 2. 69 69 + 70 70 + slice_sync 71 71 + ---------- 72 72 + When a queue is selected for execution, the queues IO requests are only 73 73 + executed for a certain amount of time(time_slice) before switching to another 74 74 + queue. This parameter is used to calculate the time slice of synchronous 75 75 + queue. 76 76 + 77 77 + time_slice is computed using the below equation:- 78 78 + time_slice = slice_sync + (slice_sync/5 * (4 - prio)). To increase the 79 79 + time_slice of synchronous queue, increase the value of slice_sync. Default 80 80 + value is 100ms. 81 81 + 82 82 + quantum 83 83 + ------- 84 84 + This specifies the number of request dispatched to the device queue. In a 85 85 + queue's time slice, a request will not be dispatched if the number of request 86 86 + in the device exceeds this parameter. This parameter is used for synchronous 87 87 + request. 88 88 + 89 89 + In case of storage with several disk, this setting can limit the parallel 90 90 + processing of request. Therefore, increasing the value can imporve the 91 91 + performace although this can cause the latency of some I/O to increase due 92 92 + to more number of requests. 38 93 39 94 CFQ IOPS Mode for group scheduling 40 95 ===================================

+64

Documentation/block/queue-sysfs.txt

reviewed

··· 9 9 Files denoted with a RO postfix are readonly and the RW postfix means 10 10 read-write. 11 11 12 12 + add_random (RW) 13 13 + ---------------- 14 14 + This file allows to trun off the disk entropy contribution. Default 15 15 + value of this file is '1'(on). 16 16 + 17 17 + discard_granularity (RO) 18 18 + ----------------------- 19 19 + This shows the size of internal allocation of the device in bytes, if 20 20 + reported by the device. A value of '0' means device does not support 21 21 + the discard functionality. 22 22 + 23 23 + discard_max_bytes (RO) 24 24 + ---------------------- 25 25 + Devices that support discard functionality may have internal limits on 26 26 + the number of bytes that can be trimmed or unmapped in a single operation. 27 27 + The discard_max_bytes parameter is set by the device driver to the maximum 28 28 + number of bytes that can be discarded in a single operation. Discard 29 29 + requests issued to the device must not exceed this limit. A discard_max_bytes 30 30 + value of 0 means that the device does not support discard functionality. 31 31 + 32 32 + discard_zeroes_data (RO) 33 33 + ------------------------ 34 34 + When read, this file will show if the discarded block are zeroed by the 35 35 + device or not. If its value is '1' the blocks are zeroed otherwise not. 36 36 + 12 37 hw_sector_size (RO) 13 38 ------------------- 14 39 This is the hardware sector size of the device, in bytes. 15 40 41 41 + iostats (RW) 42 42 + ------------- 43 43 + This file is used to control (on/off) the iostats accounting of the 44 44 + disk. 45 45 + 46 46 + logical_block_size (RO) 47 47 + ----------------------- 48 48 + This is the logcal block size of the device, in bytes. 49 49 + 16 50 max_hw_sectors_kb (RO) 17 51 ---------------------- 18 52 This is the maximum number of kilobytes supported in a single data transfer. 53 53 + 54 54 + max_integrity_segments (RO) 55 55 + --------------------------- 56 56 + When read, this file shows the max limit of integrity segments as 57 57 + set by block layer which a hardware controller can handle. 19 58 20 59 max_sectors_kb (RW) 21 60 ------------------- 22 61 This is the maximum number of kilobytes that the block layer will allow 23 62 for a filesystem request. Must be smaller than or equal to the maximum 24 63 size allowed by the hardware. 64 64 + 65 65 + max_segments (RO) 66 66 + ----------------- 67 67 + Maximum number of segments of the device. 68 68 + 69 69 + max_segment_size (RO) 70 70 + --------------------- 71 71 + Maximum segment size of the device. 72 72 + 73 73 + minimum_io_size (RO) 74 74 + -------------------- 75 75 + This is the smallest preferred io size reported by the device. 25 76 26 77 nomerges (RW) 27 78 ------------- ··· 96 45 each request queue may have upto N request pools, each independently 97 46 regulated by nr_requests. 98 47 48 48 + optimal_io_size (RO) 49 49 + -------------------- 50 50 + This is the optimal io size reported by the device. 51 51 + 52 52 + physical_block_size (RO) 53 53 + ------------------------ 54 54 + This is the physical block size of device, in bytes. 55 55 + 99 56 read_ahead_kb (RW) 100 57 ------------------ 101 58 Maximum number of kilobytes to read-ahead for filesystems on this block 102 59 device. 60 60 + 61 61 + rotational (RW) 62 62 + --------------- 63 63 + This file is used to stat if the device is of rotational type or 64 64 + non-rotational type. 103 65 104 66 rq_affinity (RW) 105 67 ----------------

+28 -13

block/blk-lib.c

reviewed

··· 44 44 struct request_queue *q = bdev_get_queue(bdev); 45 45 int type = REQ_WRITE | REQ_DISCARD; 46 46 unsigned int max_discard_sectors; 47 47 + unsigned int granularity, alignment, mask; 47 48 struct bio_batch bb; 48 49 struct bio *bio; 49 50 int ret = 0; ··· 55 54 if (!blk_queue_discard(q)) 56 55 return -EOPNOTSUPP; 57 56 57 57 + /* Zero-sector (unknown) and one-sector granularities are the same. */ 58 58 + granularity = max(q->limits.discard_granularity >> 9, 1U); 59 59 + mask = granularity - 1; 60 60 + alignment = (bdev_discard_alignment(bdev) >> 9) & mask; 61 61 + 58 62 /* 59 63 * Ensure that max_discard_sectors is of the proper 60 60 - * granularity 64 64 + * granularity, so that requests stay aligned after a split. 61 65 */ 62 66 max_discard_sectors = min(q->limits.max_discard_sectors, UINT_MAX >> 9); 67 67 + max_discard_sectors = round_down(max_discard_sectors, granularity); 63 68 if (unlikely(!max_discard_sectors)) { 64 69 /* Avoid infinite loop below. Being cautious never hurts. */ 65 70 return -EOPNOTSUPP; 66 66 - } else if (q->limits.discard_granularity) { 67 67 - unsigned int disc_sects = q->limits.discard_granularity >> 9; 68 68 - 69 69 - max_discard_sectors &= ~(disc_sects - 1); 70 71 } 71 72 72 73 if (flags & BLKDEV_DISCARD_SECURE) { ··· 82 79 bb.wait = &wait; 83 80 84 81 while (nr_sects) { 82 82 + unsigned int req_sects; 83 83 + sector_t end_sect; 84 84 + 85 85 bio = bio_alloc(gfp_mask, 1); 86 86 if (!bio) { 87 87 ret = -ENOMEM; 88 88 break; 89 89 + } 90 90 + 91 91 + req_sects = min_t(sector_t, nr_sects, max_discard_sectors); 92 92 + 93 93 + /* 94 94 + * If splitting a request, and the next starting sector would be 95 95 + * misaligned, stop the discard at the previous aligned sector. 96 96 + */ 97 97 + end_sect = sector + req_sects; 98 98 + if (req_sects < nr_sects && (end_sect & mask) != alignment) { 99 99 + end_sect = 100 100 + round_down(end_sect - alignment, granularity) 101 101 + + alignment; 102 102 + req_sects = end_sect - sector; 89 103 } 90 104 91 105 bio->bi_sector = sector; ··· 110 90 bio->bi_bdev = bdev; 111 91 bio->bi_private = &bb; 112 92 113 113 - if (nr_sects > max_discard_sectors) { 114 114 - bio->bi_size = max_discard_sectors << 9; 115 115 - nr_sects -= max_discard_sectors; 116 116 - sector += max_discard_sectors; 117 117 - } else { 118 118 - bio->bi_size = nr_sects << 9; 119 119 - nr_sects = 0; 120 120 - } 93 93 + bio->bi_size = req_sects << 9; 94 94 + nr_sects -= req_sects; 95 95 + sector = end_sect; 121 96 122 97 atomic_inc(&bb.done); 123 98 submit_bio(type, bio);

+82 -35

block/blk-merge.c

reviewed

··· 110 110 return 0; 111 111 } 112 112 113 113 + static void 114 114 + __blk_segment_map_sg(struct request_queue *q, struct bio_vec *bvec, 115 115 + struct scatterlist *sglist, struct bio_vec **bvprv, 116 116 + struct scatterlist **sg, int *nsegs, int *cluster) 117 117 + { 118 118 + 119 119 + int nbytes = bvec->bv_len; 120 120 + 121 121 + if (*bvprv && *cluster) { 122 122 + if ((*sg)->length + nbytes > queue_max_segment_size(q)) 123 123 + goto new_segment; 124 124 + 125 125 + if (!BIOVEC_PHYS_MERGEABLE(*bvprv, bvec)) 126 126 + goto new_segment; 127 127 + if (!BIOVEC_SEG_BOUNDARY(q, *bvprv, bvec)) 128 128 + goto new_segment; 129 129 + 130 130 + (*sg)->length += nbytes; 131 131 + } else { 132 132 + new_segment: 133 133 + if (!*sg) 134 134 + *sg = sglist; 135 135 + else { 136 136 + /* 137 137 + * If the driver previously mapped a shorter 138 138 + * list, we could see a termination bit 139 139 + * prematurely unless it fully inits the sg 140 140 + * table on each mapping. We KNOW that there 141 141 + * must be more entries here or the driver 142 142 + * would be buggy, so force clear the 143 143 + * termination bit to avoid doing a full 144 144 + * sg_init_table() in drivers for each command. 145 145 + */ 146 146 + (*sg)->page_link &= ~0x02; 147 147 + *sg = sg_next(*sg); 148 148 + } 149 149 + 150 150 + sg_set_page(*sg, bvec->bv_page, nbytes, bvec->bv_offset); 151 151 + (*nsegs)++; 152 152 + } 153 153 + *bvprv = bvec; 154 154 + } 155 155 + 113 156 /* 114 157 * map a request to scatterlist, return number of sg entries setup. Caller 115 158 * must make sure sg can hold rq->nr_phys_segments entries ··· 174 131 bvprv = NULL; 175 132 sg = NULL; 176 133 rq_for_each_segment(bvec, rq, iter) { 177 177 - int nbytes = bvec->bv_len; 178 178 - 179 179 - if (bvprv && cluster) { 180 180 - if (sg->length + nbytes > queue_max_segment_size(q)) 181 181 - goto new_segment; 182 182 - 183 183 - if (!BIOVEC_PHYS_MERGEABLE(bvprv, bvec)) 184 184 - goto new_segment; 185 185 - if (!BIOVEC_SEG_BOUNDARY(q, bvprv, bvec)) 186 186 - goto new_segment; 187 187 - 188 188 - sg->length += nbytes; 189 189 - } else { 190 190 - new_segment: 191 191 - if (!sg) 192 192 - sg = sglist; 193 193 - else { 194 194 - /* 195 195 - * If the driver previously mapped a shorter 196 196 - * list, we could see a termination bit 197 197 - * prematurely unless it fully inits the sg 198 198 - * table on each mapping. We KNOW that there 199 199 - * must be more entries here or the driver 200 200 - * would be buggy, so force clear the 201 201 - * termination bit to avoid doing a full 202 202 - * sg_init_table() in drivers for each command. 203 203 - */ 204 204 - sg->page_link &= ~0x02; 205 205 - sg = sg_next(sg); 206 206 - } 207 207 - 208 208 - sg_set_page(sg, bvec->bv_page, nbytes, bvec->bv_offset); 209 209 - nsegs++; 210 210 - } 211 211 - bvprv = bvec; 134 134 + __blk_segment_map_sg(q, bvec, sglist, &bvprv, &sg, 135 135 + &nsegs, &cluster); 212 136 } /* segments in rq */ 213 137 214 138 ··· 208 198 return nsegs; 209 199 } 210 200 EXPORT_SYMBOL(blk_rq_map_sg); 201 201 + 202 202 + /** 203 203 + * blk_bio_map_sg - map a bio to a scatterlist 204 204 + * @q: request_queue in question 205 205 + * @bio: bio being mapped 206 206 + * @sglist: scatterlist being mapped 207 207 + * 208 208 + * Note: 209 209 + * Caller must make sure sg can hold bio->bi_phys_segments entries 210 210 + * 211 211 + * Will return the number of sg entries setup 212 212 + */ 213 213 + int blk_bio_map_sg(struct request_queue *q, struct bio *bio, 214 214 + struct scatterlist *sglist) 215 215 + { 216 216 + struct bio_vec *bvec, *bvprv; 217 217 + struct scatterlist *sg; 218 218 + int nsegs, cluster; 219 219 + unsigned long i; 220 220 + 221 221 + nsegs = 0; 222 222 + cluster = blk_queue_cluster(q); 223 223 + 224 224 + bvprv = NULL; 225 225 + sg = NULL; 226 226 + bio_for_each_segment(bvec, bio, i) { 227 227 + __blk_segment_map_sg(q, bvec, sglist, &bvprv, &sg, 228 228 + &nsegs, &cluster); 229 229 + } /* segments in bio */ 230 230 + 231 231 + if (sg) 232 232 + sg_mark_end(sg); 233 233 + 234 234 + BUG_ON(bio->bi_phys_segments && nsegs > bio->bi_phys_segments); 235 235 + return nsegs; 236 236 + } 237 237 + EXPORT_SYMBOL(blk_bio_map_sg); 211 238 212 239 static inline int ll_new_hw_segment(struct request_queue *q, 213 240 struct request *req,

+1 -1

block/genhd.c

reviewed

··· 835 835 836 836 static void *show_partition_start(struct seq_file *seqf, loff_t *pos) 837 837 { 838 838 - static void *p; 838 838 + void *p; 839 839 840 840 p = disk_seqf_start(seqf, pos); 841 841 if (!IS_ERR_OR_NULL(p) && !*pos)

+14 -1

drivers/block/drbd/drbd_bitmap.c

reviewed

··· 889 889 unsigned int done; 890 890 unsigned flags; 891 891 #define BM_AIO_COPY_PAGES 1 892 892 + #define BM_WRITE_ALL_PAGES 2 892 893 int error; 893 894 struct kref kref; 894 895 }; ··· 1060 1059 if (lazy_writeout_upper_idx && i == lazy_writeout_upper_idx) 1061 1060 break; 1062 1061 if (rw & WRITE) { 1063 1063 - if (bm_test_page_unchanged(b->bm_pages[i])) { 1062 1062 + if (!(flags & BM_WRITE_ALL_PAGES) && 1063 1063 + bm_test_page_unchanged(b->bm_pages[i])) { 1064 1064 dynamic_dev_dbg(DEV, "skipped bm write for idx %u\n", i); 1065 1065 continue; 1066 1066 } ··· 1140 1138 int drbd_bm_write(struct drbd_conf *mdev) __must_hold(local) 1141 1139 { 1142 1140 return bm_rw(mdev, WRITE, 0, 0); 1141 1141 + } 1142 1142 + 1143 1143 + /** 1144 1144 + * drbd_bm_write_all() - Write the whole bitmap to its on disk location. 1145 1145 + * @mdev: DRBD device. 1146 1146 + * 1147 1147 + * Will write all pages. 1148 1148 + */ 1149 1149 + int drbd_bm_write_all(struct drbd_conf *mdev) __must_hold(local) 1150 1150 + { 1151 1151 + return bm_rw(mdev, WRITE, BM_WRITE_ALL_PAGES, 0); 1143 1152 } 1144 1153 1145 1154 /**

drivers/block/drbd/drbd_int.h

reviewed

··· 1469 1469 extern int drbd_bm_write_page(struct drbd_conf *mdev, unsigned int idx) __must_hold(local); 1470 1470 extern int drbd_bm_read(struct drbd_conf *mdev) __must_hold(local); 1471 1471 extern int drbd_bm_write(struct drbd_conf *mdev) __must_hold(local); 1472 1472 + extern int drbd_bm_write_all(struct drbd_conf *mdev) __must_hold(local); 1472 1473 extern int drbd_bm_write_copy_pages(struct drbd_conf *mdev) __must_hold(local); 1473 1474 extern unsigned long drbd_bm_ALe_set_all(struct drbd_conf *mdev, 1474 1475 unsigned long al_enr);

+12 -16

drivers/block/drbd/drbd_main.c

reviewed

··· 79 79 static void md_sync_timer_fn(unsigned long data); 80 80 static int w_bitmap_io(struct drbd_conf *mdev, struct drbd_work *w, int unused); 81 81 static int w_go_diskless(struct drbd_conf *mdev, struct drbd_work *w, int unused); 82 82 + static void _tl_clear(struct drbd_conf *mdev); 82 83 83 84 MODULE_AUTHOR("Philipp Reisner <phil@linbit.com>, " 84 85 "Lars Ellenberg <lars@linbit.com>"); ··· 433 432 434 433 /* Actions operating on the disk state, also want to work on 435 434 requests that got barrier acked. */ 436 436 - switch (what) { 437 437 - case fail_frozen_disk_io: 438 438 - case restart_frozen_disk_io: 439 439 - list_for_each_safe(le, tle, &mdev->barrier_acked_requests) { 440 440 - req = list_entry(le, struct drbd_request, tl_requests); 441 441 - _req_mod(req, what); 442 442 - } 443 435 444 444 - case connection_lost_while_pending: 445 445 - case resend: 446 446 - break; 447 447 - default: 448 448 - dev_err(DEV, "what = %d in _tl_restart()\n", what); 436 436 + list_for_each_safe(le, tle, &mdev->barrier_acked_requests) { 437 437 + req = list_entry(le, struct drbd_request, tl_requests); 438 438 + _req_mod(req, what); 449 439 } 450 440 } 451 441 ··· 451 459 */ 452 460 void tl_clear(struct drbd_conf *mdev) 453 461 { 462 462 + spin_lock_irq(&mdev->req_lock); 463 463 + _tl_clear(mdev); 464 464 + spin_unlock_irq(&mdev->req_lock); 465 465 + } 466 466 + 467 467 + static void _tl_clear(struct drbd_conf *mdev) 468 468 + { 454 469 struct list_head *le, *tle; 455 470 struct drbd_request *r; 456 456 - 457 457 - spin_lock_irq(&mdev->req_lock); 458 471 459 472 _tl_restart(mdev, connection_lost_while_pending); 460 473 ··· 479 482 480 483 memset(mdev->app_reads_hash, 0, APP_R_HSIZE*sizeof(void *)); 481 484 482 482 - spin_unlock_irq(&mdev->req_lock); 483 485 } 484 486 485 487 void tl_restart(struct drbd_conf *mdev, enum drbd_req_event what) ··· 1472 1476 if (ns.susp_fen) { 1473 1477 /* case1: The outdate peer handler is successful: */ 1474 1478 if (os.pdsk > D_OUTDATED && ns.pdsk <= D_OUTDATED) { 1475 1475 - tl_clear(mdev); 1476 1479 if (test_bit(NEW_CUR_UUID, &mdev->flags)) { 1477 1480 drbd_uuid_new_current(mdev); 1478 1481 clear_bit(NEW_CUR_UUID, &mdev->flags); 1479 1482 } 1480 1483 spin_lock_irq(&mdev->req_lock); 1484 1484 + _tl_clear(mdev); 1481 1485 _drbd_set_state(_NS(mdev, susp_fen, 0), CS_VERBOSE, NULL); 1482 1486 spin_unlock_irq(&mdev->req_lock); 1483 1487 }

+2 -2

drivers/block/drbd/drbd_nl.c

reviewed

··· 674 674 la_size_changed && md_moved ? "size changed and md moved" : 675 675 la_size_changed ? "size changed" : "md moved"); 676 676 /* next line implicitly does drbd_suspend_io()+drbd_resume_io() */ 677 677 - err = drbd_bitmap_io(mdev, &drbd_bm_write, 678 678 - "size changed", BM_LOCKED_MASK); 677 677 + err = drbd_bitmap_io(mdev, md_moved ? &drbd_bm_write_all : &drbd_bm_write, 678 678 + "size changed", BM_LOCKED_MASK); 679 679 if (err) { 680 680 rv = dev_size_error; 681 681 goto out;

+32 -4

drivers/block/drbd/drbd_req.c

reviewed

··· 695 695 break; 696 696 697 697 case resend: 698 698 + /* Simply complete (local only) READs. */ 699 699 + if (!(req->rq_state & RQ_WRITE) && !req->w.cb) { 700 700 + _req_may_be_done(req, m); 701 701 + break; 702 702 + } 703 703 + 698 704 /* If RQ_NET_OK is already set, we got a P_WRITE_ACK or P_RECV_ACK 699 705 before the connection loss (B&C only); only P_BARRIER_ACK was missing. 700 706 Trowing them out of the TL here by pretending we got a BARRIER_ACK ··· 840 834 req->private_bio = NULL; 841 835 } 842 836 if (rw == WRITE) { 843 843 - remote = 1; 837 837 + /* Need to replicate writes. Unless it is an empty flush, 838 838 + * which is better mapped to a DRBD P_BARRIER packet, 839 839 + * also for drbd wire protocol compatibility reasons. */ 840 840 + if (unlikely(size == 0)) { 841 841 + /* The only size==0 bios we expect are empty flushes. */ 842 842 + D_ASSERT(bio->bi_rw & REQ_FLUSH); 843 843 + remote = 0; 844 844 + } else 845 845 + remote = 1; 844 846 } else { 845 847 /* READ || READA */ 846 848 if (local) { ··· 884 870 * extent. This waits for any resync activity in the corresponding 885 871 * resync extent to finish, and, if necessary, pulls in the target 886 872 * extent into the activity log, which involves further disk io because 887 887 - * of transactional on-disk meta data updates. */ 888 888 - if (rw == WRITE && local && !test_bit(AL_SUSPENDED, &mdev->flags)) { 873 873 + * of transactional on-disk meta data updates. 874 874 + * Empty flushes don't need to go into the activity log, they can only 875 875 + * flush data for pending writes which are already in there. */ 876 876 + if (rw == WRITE && local && size 877 877 + && !test_bit(AL_SUSPENDED, &mdev->flags)) { 889 878 req->rq_state |= RQ_IN_ACT_LOG; 890 879 drbd_al_begin_io(mdev, sector); 891 880 } ··· 1011 994 if (rw == WRITE && _req_conflicts(req)) 1012 995 goto fail_conflicting; 1013 996 1014 1014 - list_add_tail(&req->tl_requests, &mdev->newest_tle->requests); 997 997 + /* no point in adding empty flushes to the transfer log, 998 998 + * they are mapped to drbd barriers already. */ 999 999 + if (likely(size!=0)) 1000 1000 + list_add_tail(&req->tl_requests, &mdev->newest_tle->requests); 1015 1001 1016 1002 /* NOTE remote first: to get the concurrent write detection right, 1017 1003 * we must register the request before start of local IO. */ ··· 1033 1013 if (remote && 1034 1014 mdev->net_conf->on_congestion != OC_BLOCK && mdev->agreed_pro_version >= 96) 1035 1015 maybe_pull_ahead(mdev); 1016 1016 + 1017 1017 + /* If this was a flush, queue a drbd barrier/start a new epoch. 1018 1018 + * Unless the current epoch was empty anyways, or we are not currently 1019 1019 + * replicating, in which case there is no point. */ 1020 1020 + if (unlikely(bio->bi_rw & REQ_FLUSH) 1021 1021 + && mdev->newest_tle->n_writes 1022 1022 + && drbd_should_do_remote(mdev->state)) 1023 1023 + queue_barrier(mdev); 1036 1024 1037 1025 spin_unlock_irq(&mdev->req_lock); 1038 1026 kfree(b); /* if someone else has beaten us to it... */

+6 -5

fs/bio.c

reviewed

··· 73 73 { 74 74 unsigned int sz = sizeof(struct bio) + extra_size; 75 75 struct kmem_cache *slab = NULL; 76 76 - struct bio_slab *bslab; 76 76 + struct bio_slab *bslab, *new_bio_slabs; 77 77 unsigned int i, entry = -1; 78 78 79 79 mutex_lock(&bio_slab_lock); ··· 97 97 98 98 if (bio_slab_nr == bio_slab_max && entry == -1) { 99 99 bio_slab_max <<= 1; 100 100 - bio_slabs = krealloc(bio_slabs, 101 101 - bio_slab_max * sizeof(struct bio_slab), 102 102 - GFP_KERNEL); 103 103 - if (!bio_slabs) 100 100 + new_bio_slabs = krealloc(bio_slabs, 101 101 + bio_slab_max * sizeof(struct bio_slab), 102 102 + GFP_KERNEL); 103 103 + if (!new_bio_slabs) 104 104 goto out_unlock; 105 105 + bio_slabs = new_bio_slabs; 105 106 } 106 107 if (entry == -1) 107 108 entry = bio_slab_nr++;

fs/block_dev.c

reviewed

··· 1578 1578 unsigned long nr_segs, loff_t pos) 1579 1579 { 1580 1580 struct file *file = iocb->ki_filp; 1581 1581 + struct blk_plug plug; 1581 1582 ssize_t ret; 1582 1583 1583 1584 BUG_ON(iocb->ki_pos != pos); 1584 1585 1586 1586 + blk_start_plug(&plug); 1585 1587 ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos); 1586 1588 if (ret > 0 || ret == -EIOCBQUEUED) { 1587 1589 ssize_t err; ··· 1592 1590 if (err < 0 && ret > 0) 1593 1591 ret = err; 1594 1592 } 1593 1593 + blk_finish_plug(&plug); 1595 1594 return ret; 1596 1595 } 1597 1596 EXPORT_SYMBOL_GPL(blkdev_aio_write);

+30 -36

fs/buffer.c

reviewed

··· 914 914 /* 915 915 * Initialise the state of a blockdev page's buffers. 916 916 */ 917 917 - static void 917 917 + static sector_t 918 918 init_page_buffers(struct page *page, struct block_device *bdev, 919 919 sector_t block, int size) 920 920 { ··· 936 936 block++; 937 937 bh = bh->b_this_page; 938 938 } while (bh != head); 939 939 + 940 940 + /* 941 941 + * Caller needs to validate requested block against end of device. 942 942 + */ 943 943 + return end_block; 939 944 } 940 945 941 946 /* 942 947 * Create the page-cache page that contains the requested block. 943 948 * 944 944 - * This is user purely for blockdev mappings. 949 949 + * This is used purely for blockdev mappings. 945 950 */ 946 946 - static struct page * 951 951 + static int 947 952 grow_dev_page(struct block_device *bdev, sector_t block, 948 948 - pgoff_t index, int size) 953 953 + pgoff_t index, int size, int sizebits) 949 954 { 950 955 struct inode *inode = bdev->bd_inode; 951 956 struct page *page; 952 957 struct buffer_head *bh; 958 958 + sector_t end_block; 959 959 + int ret = 0; /* Will call free_more_memory() */ 953 960 954 961 page = find_or_create_page(inode->i_mapping, index, 955 962 (mapping_gfp_mask(inode->i_mapping) & ~__GFP_FS)|__GFP_MOVABLE); 956 963 if (!page) 957 957 - return NULL; 964 964 + return ret; 958 965 959 966 BUG_ON(!PageLocked(page)); 960 967 961 968 if (page_has_buffers(page)) { 962 969 bh = page_buffers(page); 963 970 if (bh->b_size == size) { 964 964 - init_page_buffers(page, bdev, block, size); 965 965 - return page; 971 971 + end_block = init_page_buffers(page, bdev, 972 972 + index << sizebits, size); 973 973 + goto done; 966 974 } 967 975 if (!try_to_free_buffers(page)) 968 976 goto failed; ··· 990 982 */ 991 983 spin_lock(&inode->i_mapping->private_lock); 992 984 link_dev_buffers(page, bh); 993 993 - init_page_buffers(page, bdev, block, size); 985 985 + end_block = init_page_buffers(page, bdev, index << sizebits, size); 994 986 spin_unlock(&inode->i_mapping->private_lock); 995 995 - return page; 996 996 - 987 987 + done: 988 988 + ret = (block < end_block) ? 1 : -ENXIO; 997 989 failed: 998 990 unlock_page(page); 999 991 page_cache_release(page); 1000 1000 - return NULL; 992 992 + return ret; 1001 993 } 1002 994 1003 995 /* ··· 1007 999 static int 1008 1000 grow_buffers(struct block_device *bdev, sector_t block, int size) 1009 1001 { 1010 1010 - struct page *page; 1011 1002 pgoff_t index; 1012 1003 int sizebits; 1013 1004 ··· 1030 1023 bdevname(bdev, b)); 1031 1024 return -EIO; 1032 1025 } 1033 1033 - block = index << sizebits; 1026 1026 + 1034 1027 /* Create a page with the proper size buffers.. */ 1035 1035 - page = grow_dev_page(bdev, block, index, size); 1036 1036 - if (!page) 1037 1037 - return 0; 1038 1038 - unlock_page(page); 1039 1039 - page_cache_release(page); 1040 1040 - return 1; 1028 1028 + return grow_dev_page(bdev, block, index, size, sizebits); 1041 1029 } 1042 1030 1043 1031 static struct buffer_head * 1044 1032 __getblk_slow(struct block_device *bdev, sector_t block, int size) 1045 1033 { 1046 1046 - int ret; 1047 1047 - struct buffer_head *bh; 1048 1048 - 1049 1034 /* Size must be multiple of hard sectorsize */ 1050 1035 if (unlikely(size & (bdev_logical_block_size(bdev)-1) || 1051 1036 (size < 512 || size > PAGE_SIZE))) { ··· 1050 1051 return NULL; 1051 1052 } 1052 1053 1053 1053 - retry: 1054 1054 - bh = __find_get_block(bdev, block, size); 1055 1055 - if (bh) 1056 1056 - return bh; 1054 1054 + for (;;) { 1055 1055 + struct buffer_head *bh; 1056 1056 + int ret; 1057 1057 1058 1058 - ret = grow_buffers(bdev, block, size); 1059 1059 - if (ret == 0) { 1060 1060 - free_more_memory(); 1061 1061 - goto retry; 1062 1062 - } else if (ret > 0) { 1063 1058 bh = __find_get_block(bdev, block, size); 1064 1059 if (bh) 1065 1060 return bh; 1061 1061 + 1062 1062 + ret = grow_buffers(bdev, block, size); 1063 1063 + if (ret < 0) 1064 1064 + return NULL; 1065 1065 + if (ret == 0) 1066 1066 + free_more_memory(); 1066 1067 } 1067 1067 - return NULL; 1068 1068 } 1069 1069 1070 1070 /* ··· 1318 1320 * __getblk will locate (and, if necessary, create) the buffer_head 1319 1321 * which corresponds to the passed block_device, block and size. The 1320 1322 * returned buffer has its reference count incremented. 1321 1321 - * 1322 1322 - * __getblk() cannot fail - it just keeps trying. If you pass it an 1323 1323 - * illegal block number, __getblk() will happily return a buffer_head 1324 1324 - * which represents the non-existent block. Very weird. 1325 1323 * 1326 1324 * __getblk() will lock up the machine if grow_dev_page's try_to_free_buffers() 1327 1325 * attempt is failing. FIXME, perhaps?

fs/direct-io.c

reviewed

··· 1062 1062 unsigned long user_addr; 1063 1063 size_t bytes; 1064 1064 struct buffer_head map_bh = { 0, }; 1065 1065 + struct blk_plug plug; 1065 1066 1066 1067 if (rw & WRITE) 1067 1068 rw = WRITE_ODIRECT; ··· 1178 1177 PAGE_SIZE - user_addr / PAGE_SIZE); 1179 1178 } 1180 1179 1180 1180 + blk_start_plug(&plug); 1181 1181 + 1181 1182 for (seg = 0; seg < nr_segs; seg++) { 1182 1183 user_addr = (unsigned long)iov[seg].iov_base; 1183 1184 sdio.size += bytes = iov[seg].iov_len; ··· 1237 1234 } 1238 1235 if (sdio.bio) 1239 1236 dio_bio_submit(dio, &sdio); 1237 1237 + 1238 1238 + blk_finish_plug(&plug); 1240 1239 1241 1240 /* 1242 1241 * It is possible that, we return short IO due to end of file.

+13 -1

include/linux/blkdev.h

reviewed

··· 601 601 * it already be started by driver. 602 602 */ 603 603 #define RQ_NOMERGE_FLAGS \ 604 604 - (REQ_NOMERGE | REQ_STARTED | REQ_SOFTBARRIER | REQ_FLUSH | REQ_FUA) 604 604 + (REQ_NOMERGE | REQ_STARTED | REQ_SOFTBARRIER | REQ_FLUSH | REQ_FUA | REQ_DISCARD) 605 605 #define rq_mergeable(rq) \ 606 606 (!((rq)->cmd_flags & RQ_NOMERGE_FLAGS) && \ 607 607 (((rq)->cmd_flags & REQ_DISCARD) || \ ··· 894 894 extern struct backing_dev_info *blk_get_backing_dev_info(struct block_device *bdev); 895 895 896 896 extern int blk_rq_map_sg(struct request_queue *, struct request *, struct scatterlist *); 897 897 + extern int blk_bio_map_sg(struct request_queue *q, struct bio *bio, 898 898 + struct scatterlist *sglist); 897 899 extern void blk_dump_rq_flags(struct request *, char *); 898 900 extern long nr_blockdev_pages(void); 899 901 ··· 1139 1137 1140 1138 return (lim->discard_granularity + lim->discard_alignment - alignment) 1141 1139 & (lim->discard_granularity - 1); 1140 1140 + } 1141 1141 + 1142 1142 + static inline int bdev_discard_alignment(struct block_device *bdev) 1143 1143 + { 1144 1144 + struct request_queue *q = bdev_get_queue(bdev); 1145 1145 + 1146 1146 + if (bdev != bdev->bd_contains) 1147 1147 + return bdev->bd_part->discard_alignment; 1148 1148 + 1149 1149 + return q->limits.discard_alignment; 1142 1150 } 1143 1151 1144 1152 static inline unsigned int queue_discard_zeroes_data(struct request_queue *q)

-7

mm/filemap.c

reviewed

··· 1412 1412 retval = filemap_write_and_wait_range(mapping, pos, 1413 1413 pos + iov_length(iov, nr_segs) - 1); 1414 1414 if (!retval) { 1415 1415 - struct blk_plug plug; 1416 1416 - 1417 1417 - blk_start_plug(&plug); 1418 1415 retval = mapping->a_ops->direct_IO(READ, iocb, 1419 1416 iov, pos, nr_segs); 1420 1420 - blk_finish_plug(&plug); 1421 1417 } 1422 1418 if (retval > 0) { 1423 1419 *ppos = pos + retval; ··· 2523 2527 { 2524 2528 struct file *file = iocb->ki_filp; 2525 2529 struct inode *inode = file->f_mapping->host; 2526 2526 - struct blk_plug plug; 2527 2530 ssize_t ret; 2528 2531 2529 2532 BUG_ON(iocb->ki_pos != pos); 2530 2533 2531 2534 sb_start_write(inode->i_sb); 2532 2535 mutex_lock(&inode->i_mutex); 2533 2533 - blk_start_plug(&plug); 2534 2536 ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos); 2535 2537 mutex_unlock(&inode->i_mutex); 2536 2538 ··· 2539 2545 if (err < 0 && ret > 0) 2540 2546 ret = err; 2541 2547 } 2542 2542 - blk_finish_plug(&plug); 2543 2548 sb_end_write(inode->i_sb); 2544 2549 return ret; 2545 2550 }