zloop: introduce the ordered_zone_append configuration parameter

The zone append operation processing for zloop devices is similar to any
other command, that is, the operation is processed as a command work
item, without any special serialization between the work items (beside
the zone mutex for mutually exclusive code sections).

This processing is fine and gives excellent performance. However, it has
a side effect: zone append operation are very often reordered and
processed in a sequence that is very different from their issuing order
by the user. This effect is very visible using an XFS file system on top
of a zloop device. A simple file write leads to many file extents as the
data writes using zone append are reordered and so result in the
physical order being different than the file logical order.
E.g. executing:

$ dd if=/dev/zero of=/mnt/test bs=1M count=10 && sync
$ xfs_bmap /mnt/test
/mnt/test:
0: [0..4095]: 2162688..2166783
1: [4096..6143]: 2168832..2170879
2: [6144..8191]: 2166784..2168831
3: [8192..10239]: 2170880..2172927
4: [10240..12287]: 2174976..2177023
5: [12288..14335]: 2172928..2174975
6: [14336..20479]: 2177024..2183167

For 10 IOs, 6 extents are created.

This is fine and actually allows to exercise XFS zone garbage collection
very well. However, this also makes debugging/working on XFS data
placement harder as the underlying device will most of the time reorder
IOs, resulting in many file extents.

Allow a user to mitigate this with the new ordered_zone_append
configuration parameter. For a zloop device created with this parameter
specified, the sector of a zone append command is set early, when the
command is submitted by the block layer with the zloop_queue_rq()
function, instead of in the zloop_rw() function which is exectued later
in the command work item context. This change ensures that more often
than not, zone append operations data end up being written in the same
order as the command submission by the user.

In the case of XFS, this leads to far less file data extents. E.g., for
the previous example, we get a single file data extent for the written
file.

$ dd if=/dev/zero of=/mnt/test bs=1M count=10 && sync
$ xfs_bmap /mnt/test
/mnt/test:
0: [0..20479]: 2162688..2183167

Since we cannot use a mutex in the context of the zloop_queue_rq()
function to atomically set a zone append operation sector to the target
zone write pointer location and increment that the write pointer, a new
per-zone spinlock is introduced to protect a zone write pointer access
and modifications. To check a zone write pointer location and set a zone
append operation target sector to that value, the function
zloop_set_zone_append_sector() is introduced and called from
zloop_queue_rq().

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

authored by

Damien Le Moal and committed by

Jens Axboe 7 months ago fcc6eaa3 9236c5fd

+96 -12

1 changed file

expand all

drivers

block

zloop.c

+96 -12

drivers/block/zloop.c

··· 33 33 ZLOOP_OPT_QUEUE_DEPTH = (1 << 7), 34 34 ZLOOP_OPT_BUFFERED_IO = (1 << 8), 35 35 ZLOOP_OPT_ZONE_APPEND = (1 << 9), 36 + ZLOOP_OPT_ORDERED_ZONE_APPEND = (1 << 10), 36 37 }; 37 38 38 39 static const match_table_t zloop_opt_tokens = { ··· 47 46 { ZLOOP_OPT_QUEUE_DEPTH, "queue_depth=%u" }, 48 47 { ZLOOP_OPT_BUFFERED_IO, "buffered_io" }, 49 48 { ZLOOP_OPT_ZONE_APPEND, "zone_append=%u" }, 49 + { ZLOOP_OPT_ORDERED_ZONE_APPEND, "ordered_zone_append" }, 50 50 { ZLOOP_OPT_ERR, NULL } 51 51 }; 52 52 ··· 61 59 #define ZLOOP_DEF_QUEUE_DEPTH 128 62 60 #define ZLOOP_DEF_BUFFERED_IO false 63 61 #define ZLOOP_DEF_ZONE_APPEND true 62 + #define ZLOOP_DEF_ORDERED_ZONE_APPEND false 64 63 65 64 /* Arbitrary limit on the zone size (16GB). */ 66 65 #define ZLOOP_MAX_ZONE_SIZE_MB 16384 ··· 78 75 unsigned int queue_depth; 79 76 bool buffered_io; 80 77 bool zone_append; 78 + bool ordered_zone_append; 81 79 }; 82 80 83 81 /* ··· 100 96 101 97 unsigned long flags; 102 98 struct mutex lock; 99 + spinlock_t wp_lock; 103 100 enum blk_zone_cond cond; 104 101 sector_t start; 105 102 sector_t wp; ··· 118 113 struct workqueue_struct *workqueue; 119 114 bool buffered_io; 120 115 bool zone_append; 116 + bool ordered_zone_append; 121 117 122 118 const char *base_dir; 123 119 struct file *data_dir; ··· 158 152 struct zloop_zone *zone = &zlo->zones[zone_no]; 159 153 struct kstat stat; 160 154 sector_t file_sectors; 155 + unsigned long flags; 161 156 int ret; 162 157 163 158 lockdep_assert_held(&zone->lock); ··· 184 177 return -EINVAL; 185 178 } 186 179 180 + spin_lock_irqsave(&zone->wp_lock, flags); 187 181 if (!file_sectors) { 188 182 zone->cond = BLK_ZONE_COND_EMPTY; 189 183 zone->wp = zone->start; ··· 195 187 zone->cond = BLK_ZONE_COND_CLOSED; 196 188 zone->wp = zone->start + file_sectors; 197 189 } 190 + spin_unlock_irqrestore(&zone->wp_lock, flags); 198 191 199 192 return 0; 200 193 } ··· 239 230 static int zloop_close_zone(struct zloop_device *zlo, unsigned int zone_no) 240 231 { 241 232 struct zloop_zone *zone = &zlo->zones[zone_no]; 233 + unsigned long flags; 242 234 int ret = 0; 243 235 244 236 if (test_bit(ZLOOP_ZONE_CONV, &zone->flags)) ··· 258 248 break; 259 249 case BLK_ZONE_COND_IMP_OPEN: 260 250 case BLK_ZONE_COND_EXP_OPEN: 251 + spin_lock_irqsave(&zone->wp_lock, flags); 261 252 if (zone->wp == zone->start) 262 253 zone->cond = BLK_ZONE_COND_EMPTY; 263 254 else 264 255 zone->cond = BLK_ZONE_COND_CLOSED; 256 + spin_unlock_irqrestore(&zone->wp_lock, flags); 265 257 break; 266 258 case BLK_ZONE_COND_EMPTY: 267 259 case BLK_ZONE_COND_FULL: ··· 281 269 static int zloop_reset_zone(struct zloop_device *zlo, unsigned int zone_no) 282 270 { 283 271 struct zloop_zone *zone = &zlo->zones[zone_no]; 272 + unsigned long flags; 284 273 int ret = 0; 285 274 286 275 if (test_bit(ZLOOP_ZONE_CONV, &zone->flags)) ··· 299 286 goto unlock; 300 287 } 301 288 289 + spin_lock_irqsave(&zone->wp_lock, flags); 302 290 zone->cond = BLK_ZONE_COND_EMPTY; 303 291 zone->wp = zone->start; 304 292 clear_bit(ZLOOP_ZONE_SEQ_ERROR, &zone->flags); 293 + spin_unlock_irqrestore(&zone->wp_lock, flags); 305 294 306 295 unlock: 307 296 mutex_unlock(&zone->lock); ··· 328 313 static int zloop_finish_zone(struct zloop_device *zlo, unsigned int zone_no) 329 314 { 330 315 struct zloop_zone *zone = &zlo->zones[zone_no]; 316 + unsigned long flags; 331 317 int ret = 0; 332 318 333 319 if (test_bit(ZLOOP_ZONE_CONV, &zone->flags)) ··· 346 330 goto unlock; 347 331 } 348 332 333 + spin_lock_irqsave(&zone->wp_lock, flags); 349 334 zone->cond = BLK_ZONE_COND_FULL; 350 335 zone->wp = ULLONG_MAX; 351 336 clear_bit(ZLOOP_ZONE_SEQ_ERROR, &zone->flags); 337 + spin_unlock_irqrestore(&zone->wp_lock, flags); 352 338 353 339 unlock: 354 340 mutex_unlock(&zone->lock); ··· 392 374 struct zloop_zone *zone; 393 375 struct iov_iter iter; 394 376 struct bio_vec tmp; 377 + unsigned long flags; 395 378 sector_t zone_end; 396 379 int nr_bvec = 0; 397 380 int ret; ··· 435 416 if (!test_bit(ZLOOP_ZONE_CONV, &zone->flags) && is_write) { 436 417 mutex_lock(&zone->lock); 437 418 419 + spin_lock_irqsave(&zone->wp_lock, flags); 420 + 438 421 /* 439 422 * Zone append operations always go at the current write 440 423 * pointer, but regular write operations must already be 441 424 * aligned to the write pointer when submitted. 442 425 */ 443 426 if (is_append) { 444 - if (zone->cond == BLK_ZONE_COND_FULL) { 445 - ret = -EIO; 446 - goto unlock; 427 + /* 428 + * If ordered zone append is in use, we already checked 429 + * and set the target sector in zloop_queue_rq(). 430 + */ 431 + if (!zlo->ordered_zone_append) { 432 + if (zone->cond == BLK_ZONE_COND_FULL) { 433 + spin_unlock_irqrestore(&zone->wp_lock, 434 + flags); 435 + ret = -EIO; 436 + goto unlock; 437 + } 438 + sector = zone->wp; 447 439 } 448 - sector = zone->wp; 449 440 cmd->sector = sector; 450 441 } else if (sector != zone->wp) { 442 + spin_unlock_irqrestore(&zone->wp_lock, flags); 451 443 pr_err("Zone %u: unaligned write: sect %llu, wp %llu\n", 452 444 zone_no, sector, zone->wp); 453 445 ret = -EIO; ··· 471 441 zone->cond = BLK_ZONE_COND_IMP_OPEN; 472 442 473 443 /* 474 - * Advance the write pointer. If the write fails, the write 475 - * pointer position will be corrected when the next I/O starts 476 - * execution. 444 + * Advance the write pointer, unless ordered zone append is in 445 + * use. If the write fails, the write pointer position will be 446 + * corrected when the next I/O starts execution. 477 447 */ 478 - zone->wp += nr_sectors; 479 - if (zone->wp == zone_end) { 480 - zone->cond = BLK_ZONE_COND_FULL; 481 - zone->wp = ULLONG_MAX; 448 + if (!is_append || !zlo->ordered_zone_append) { 449 + zone->wp += nr_sectors; 450 + if (zone->wp == zone_end) { 451 + zone->cond = BLK_ZONE_COND_FULL; 452 + zone->wp = ULLONG_MAX; 453 + } 482 454 } 455 + 456 + spin_unlock_irqrestore(&zone->wp_lock, flags); 483 457 } 484 458 485 459 rq_for_each_bvec(tmp, rq, rq_iter) ··· 657 623 blk_mq_end_request(rq, sts); 658 624 } 659 625 626 + static bool zloop_set_zone_append_sector(struct request *rq) 627 + { 628 + struct zloop_device *zlo = rq->q->queuedata; 629 + unsigned int zone_no = rq_zone_no(rq); 630 + struct zloop_zone *zone = &zlo->zones[zone_no]; 631 + sector_t zone_end = zone->start + zlo->zone_capacity; 632 + sector_t nr_sectors = blk_rq_sectors(rq); 633 + unsigned long flags; 634 + 635 + spin_lock_irqsave(&zone->wp_lock, flags); 636 + 637 + if (zone->cond == BLK_ZONE_COND_FULL || 638 + zone->wp + nr_sectors > zone_end) { 639 + spin_unlock_irqrestore(&zone->wp_lock, flags); 640 + return false; 641 + } 642 + 643 + rq->__sector = zone->wp; 644 + zone->wp += blk_rq_sectors(rq); 645 + if (zone->wp >= zone_end) { 646 + zone->cond = BLK_ZONE_COND_FULL; 647 + zone->wp = ULLONG_MAX; 648 + } 649 + 650 + spin_unlock_irqrestore(&zone->wp_lock, flags); 651 + 652 + return true; 653 + } 654 + 660 655 static blk_status_t zloop_queue_rq(struct blk_mq_hw_ctx *hctx, 661 656 const struct blk_mq_queue_data *bd) 662 657 { ··· 695 632 696 633 if (zlo->state == Zlo_deleting) 697 634 return BLK_STS_IOERR; 635 + 636 + /* 637 + * If we need to strongly order zone append operations, set the request 638 + * sector to the zone write pointer location now instead of when the 639 + * command work runs. 640 + */ 641 + if (zlo->ordered_zone_append && req_op(rq) == REQ_OP_ZONE_APPEND) { 642 + if (!zloop_set_zone_append_sector(rq)) 643 + return BLK_STS_IOERR; 644 + } 698 645 699 646 blk_mq_start_request(rq); 700 647 ··· 740 667 struct zloop_device *zlo = disk->private_data; 741 668 struct blk_zone blkz = {}; 742 669 unsigned int first, i; 670 + unsigned long flags; 743 671 int ret; 744 672 745 673 first = disk_zone_no(disk, sector); ··· 764 690 765 691 blkz.start = zone->start; 766 692 blkz.len = zlo->zone_size; 693 + spin_lock_irqsave(&zone->wp_lock, flags); 767 694 blkz.wp = zone->wp; 695 + spin_unlock_irqrestore(&zone->wp_lock, flags); 768 696 blkz.cond = zone->cond; 769 697 if (test_bit(ZLOOP_ZONE_CONV, &zone->flags)) { 770 698 blkz.type = BLK_ZONE_TYPE_CONVENTIONAL; ··· 874 798 int ret; 875 799 876 800 mutex_init(&zone->lock); 801 + spin_lock_init(&zone->wp_lock); 877 802 zone->start = (sector_t)zone_no << zlo->zone_shift; 878 803 879 804 if (!restore) ··· 1028 951 zlo->nr_conv_zones = opts->nr_conv_zones; 1029 952 zlo->buffered_io = opts->buffered_io; 1030 953 zlo->zone_append = opts->zone_append; 954 + if (zlo->zone_append) 955 + zlo->ordered_zone_append = opts->ordered_zone_append; 1031 956 1032 957 zlo->workqueue = alloc_workqueue("zloop%d", WQ_UNBOUND | WQ_FREEZABLE, 1033 958 opts->nr_queues * opts->queue_depth, zlo->id); ··· 1116 1037 zlo->id, zlo->nr_zones, 1117 1038 ((sector_t)zlo->zone_size << SECTOR_SHIFT) >> 20, 1118 1039 zlo->block_size); 1119 - pr_info("zloop%d: using %s zone append\n", 1040 + pr_info("zloop%d: using %s%s zone append\n", 1120 1041 zlo->id, 1042 + zlo->ordered_zone_append ? "ordered " : "", 1121 1043 zlo->zone_append ? "native" : "emulated"); 1122 1044 1123 1045 return 0; ··· 1207 1127 opts->queue_depth = ZLOOP_DEF_QUEUE_DEPTH; 1208 1128 opts->buffered_io = ZLOOP_DEF_BUFFERED_IO; 1209 1129 opts->zone_append = ZLOOP_DEF_ZONE_APPEND; 1130 + opts->ordered_zone_append = ZLOOP_DEF_ORDERED_ZONE_APPEND; 1210 1131 1211 1132 if (!buf) 1212 1133 return 0; ··· 1328 1247 goto out; 1329 1248 } 1330 1249 opts->zone_append = token; 1250 + break; 1251 + case ZLOOP_OPT_ORDERED_ZONE_APPEND: 1252 + opts->ordered_zone_append = true; 1331 1253 break; 1332 1254 case ZLOOP_OPT_ERR: 1333 1255 default:

Configure Feed

Configure Feed