Merge tag 'block-6.16-20250606' of git://git.kernel.dk/linux

+24 -11

Documentation/block/ublk.rst

··· 115 115 116 116 - ``UBLK_CMD_START_DEV`` 117 117 118 - After the server prepares userspace resources (such as creating per-queue 119 - pthread & io_uring for handling ublk IO), this command is sent to the 118 + After the server prepares userspace resources (such as creating I/O handler 119 + threads & io_uring for handling ublk IO), this command is sent to the 120 120 driver for allocating & exposing ``/dev/ublkb*``. Parameters set via 121 121 ``UBLK_CMD_SET_PARAMS`` are applied for creating the device. 122 122 123 123 - ``UBLK_CMD_STOP_DEV`` 124 124 125 125 Halt IO on ``/dev/ublkb*`` and remove the device. When this command returns, 126 - ublk server will release resources (such as destroying per-queue pthread & 126 + ublk server will release resources (such as destroying I/O handler threads & 127 127 io_uring). 128 128 129 129 - ``UBLK_CMD_DEL_DEV`` ··· 208 208 modify how I/O is handled while the ublk server is dying/dead (this is called 209 209 the ``nosrv`` case in the driver code). 210 210 211 - With just ``UBLK_F_USER_RECOVERY`` set, after one ubq_daemon(ublk server's io 212 - handler) is dying, ublk does not delete ``/dev/ublkb*`` during the whole 211 + With just ``UBLK_F_USER_RECOVERY`` set, after the ublk server exits, 212 + ublk does not delete ``/dev/ublkb*`` during the whole 213 213 recovery stage and ublk device ID is kept. It is ublk server's 214 214 responsibility to recover the device context by its own knowledge. 215 215 Requests which have not been issued to userspace are requeued. Requests 216 216 which have been issued to userspace are aborted. 217 217 218 - With ``UBLK_F_USER_RECOVERY_REISSUE`` additionally set, after one ubq_daemon 219 - (ublk server's io handler) is dying, contrary to ``UBLK_F_USER_RECOVERY``, 218 + With ``UBLK_F_USER_RECOVERY_REISSUE`` additionally set, after the ublk server 219 + exits, contrary to ``UBLK_F_USER_RECOVERY``, 220 220 requests which have been issued to userspace are requeued and will be 221 221 re-issued to the new process after handling ``UBLK_CMD_END_USER_RECOVERY``. 222 222 ``UBLK_F_USER_RECOVERY_REISSUE`` is designed for backends who tolerate ··· 241 241 Data plane 242 242 ---------- 243 243 244 - ublk server needs to create per-queue IO pthread & io_uring for handling IO 245 - commands via io_uring passthrough. The per-queue IO pthread 246 - focuses on IO handling and shouldn't handle any control & management 247 - tasks. 244 + The ublk server should create dedicated threads for handling I/O. Each 245 + thread should have its own io_uring through which it is notified of new 246 + I/O, and through which it can complete I/O. These dedicated threads 247 + should focus on IO handling and shouldn't handle any control & 248 + management tasks. 248 249 249 250 The's IO is assigned by a unique tag, which is 1:1 mapping with IO 250 251 request of ``/dev/ublkb*``. ··· 265 264 Sent from the server IO pthread for fetching future incoming IO requests 266 265 destined to ``/dev/ublkb*``. This command is sent only once from the server 267 266 IO pthread for ublk driver to setup IO forward environment. 267 + 268 + Once a thread issues this command against a given (qid,tag) pair, the thread 269 + registers itself as that I/O's daemon. In the future, only that I/O's daemon 270 + is allowed to issue commands against the I/O. If any other thread attempts 271 + to issue a command against a (qid,tag) pair for which the thread is not the 272 + daemon, the command will fail. Daemons can be reset only be going through 273 + recovery. 274 + 275 + The ability for every (qid,tag) pair to have its own independent daemon task 276 + is indicated by the ``UBLK_F_PER_IO_DAEMON`` feature. If this feature is not 277 + supported by the driver, daemons must be per-queue instead - i.e. all I/Os 278 + associated to a single qid must be handled by the same task. 268 279 269 280 - ``UBLK_IO_COMMIT_AND_FETCH_REQ`` 270 281

+5 -12

block/bio-integrity.c

··· 154 154 EXPORT_SYMBOL(bio_integrity_add_page); 155 155 156 156 static int bio_integrity_copy_user(struct bio *bio, struct bio_vec *bvec, 157 - int nr_vecs, unsigned int len, 158 - unsigned int direction) 157 + int nr_vecs, unsigned int len) 159 158 { 160 - bool write = direction == ITER_SOURCE; 159 + bool write = op_is_write(bio_op(bio)); 161 160 struct bio_integrity_payload *bip; 162 161 struct iov_iter iter; 163 162 void *buf; ··· 167 168 return -ENOMEM; 168 169 169 170 if (write) { 170 - iov_iter_bvec(&iter, direction, bvec, nr_vecs, len); 171 + iov_iter_bvec(&iter, ITER_SOURCE, bvec, nr_vecs, len); 171 172 if (!copy_from_iter_full(buf, len, &iter)) { 172 173 ret = -EFAULT; 173 174 goto free_buf; ··· 263 264 struct page *stack_pages[UIO_FASTIOV], **pages = stack_pages; 264 265 struct bio_vec stack_vec[UIO_FASTIOV], *bvec = stack_vec; 265 266 size_t offset, bytes = iter->count; 266 - unsigned int direction, nr_bvecs; 267 + unsigned int nr_bvecs; 267 268 int ret, nr_vecs; 268 269 bool copy; 269 270 ··· 271 272 return -EINVAL; 272 273 if (bytes >> SECTOR_SHIFT > queue_max_hw_sectors(q)) 273 274 return -E2BIG; 274 - 275 - if (bio_data_dir(bio) == READ) 276 - direction = ITER_DEST; 277 - else 278 - direction = ITER_SOURCE; 279 275 280 276 nr_vecs = iov_iter_npages(iter, BIO_MAX_VECS + 1); 281 277 if (nr_vecs > BIO_MAX_VECS) ··· 294 300 copy = true; 295 301 296 302 if (copy) 297 - ret = bio_integrity_copy_user(bio, bvec, nr_bvecs, bytes, 298 - direction); 303 + ret = bio_integrity_copy_user(bio, bvec, nr_bvecs, bytes); 299 304 else 300 305 ret = bio_integrity_init_user(bio, bvec, nr_bvecs, bytes); 301 306 if (ret)

+1 -6

block/blk-integrity.c

··· 117 117 { 118 118 int ret; 119 119 struct iov_iter iter; 120 - unsigned int direction; 121 120 122 - if (op_is_write(req_op(rq))) 123 - direction = ITER_DEST; 124 - else 125 - direction = ITER_SOURCE; 126 - iov_iter_ubuf(&iter, direction, ubuf, bytes); 121 + iov_iter_ubuf(&iter, rq_data_dir(rq), ubuf, bytes); 127 122 ret = bio_integrity_map_user(rq->bio, &iter); 128 123 if (ret) 129 124 return ret;

+6 -2

drivers/block/loop.c

··· 308 308 static void lo_rw_aio_do_completion(struct loop_cmd *cmd) 309 309 { 310 310 struct request *rq = blk_mq_rq_from_pdu(cmd); 311 + struct loop_device *lo = rq->q->queuedata; 311 312 312 313 if (!atomic_dec_and_test(&cmd->ref)) 313 314 return; 314 315 kfree(cmd->bvec); 315 316 cmd->bvec = NULL; 317 + if (req_op(rq) == REQ_OP_WRITE) 318 + file_end_write(lo->lo_backing_file); 316 319 if (likely(!blk_should_fake_timeout(rq->q))) 317 320 blk_mq_complete_request(rq); 318 321 } ··· 390 387 cmd->iocb.ki_flags = 0; 391 388 } 392 389 393 - if (rw == ITER_SOURCE) 390 + if (rw == ITER_SOURCE) { 391 + file_start_write(lo->lo_backing_file); 394 392 ret = file->f_op->write_iter(&cmd->iocb, &iter); 395 - else 393 + } else 396 394 ret = file->f_op->read_iter(&cmd->iocb, &iter); 397 395 398 396 lo_rw_aio_do_completion(cmd);

+56 -55

drivers/block/ublk_drv.c

··· 69 69 | UBLK_F_USER_RECOVERY_FAIL_IO \ 70 70 | UBLK_F_UPDATE_SIZE \ 71 71 | UBLK_F_AUTO_BUF_REG \ 72 - | UBLK_F_QUIESCE) 72 + | UBLK_F_QUIESCE \ 73 + | UBLK_F_PER_IO_DAEMON) 73 74 74 75 #define UBLK_F_ALL_RECOVERY_FLAGS (UBLK_F_USER_RECOVERY \ 75 76 | UBLK_F_USER_RECOVERY_REISSUE \ ··· 167 166 /* valid if UBLK_IO_FLAG_OWNED_BY_SRV is set */ 168 167 struct request *req; 169 168 }; 169 + 170 + struct task_struct *task; 170 171 }; 171 172 172 173 struct ublk_queue { ··· 176 173 int q_depth; 177 174 178 175 unsigned long flags; 179 - struct task_struct *ubq_daemon; 180 176 struct ublksrv_io_desc *io_cmd_buf; 181 177 182 178 bool force_abort; 183 - bool timeout; 184 179 bool canceling; 185 180 bool fail_io; /* copy of dev->state == UBLK_S_DEV_FAIL_IO */ 186 181 unsigned short nr_io_ready; /* how many ios setup */ ··· 1100 1099 return io_uring_cmd_to_pdu(ioucmd, struct ublk_uring_cmd_pdu); 1101 1100 } 1102 1101 1103 - static inline bool ubq_daemon_is_dying(struct ublk_queue *ubq) 1104 - { 1105 - return !ubq->ubq_daemon || ubq->ubq_daemon->flags & PF_EXITING; 1106 - } 1107 - 1108 1102 /* todo: handle partial completion */ 1109 1103 static inline void __ublk_complete_rq(struct request *req) 1110 1104 { ··· 1271 1275 /* 1272 1276 * Task is exiting if either: 1273 1277 * 1274 - * (1) current != ubq_daemon. 1278 + * (1) current != io->task. 1275 1279 * io_uring_cmd_complete_in_task() tries to run task_work 1276 - * in a workqueue if ubq_daemon(cmd's task) is PF_EXITING. 1280 + * in a workqueue if cmd's task is PF_EXITING. 1277 1281 * 1278 1282 * (2) current->flags & PF_EXITING. 1279 1283 */ 1280 - if (unlikely(current != ubq->ubq_daemon || current->flags & PF_EXITING)) { 1284 + if (unlikely(current != io->task || current->flags & PF_EXITING)) { 1281 1285 __ublk_abort_rq(ubq, req); 1282 1286 return; 1283 1287 } ··· 1326 1330 { 1327 1331 struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd); 1328 1332 struct request *rq = pdu->req_list; 1329 - struct ublk_queue *ubq = pdu->ubq; 1330 1333 struct request *next; 1331 1334 1332 1335 do { 1333 1336 next = rq->rq_next; 1334 1337 rq->rq_next = NULL; 1335 - ublk_dispatch_req(ubq, rq, issue_flags); 1338 + ublk_dispatch_req(rq->mq_hctx->driver_data, rq, issue_flags); 1336 1339 rq = next; 1337 1340 } while (rq); 1338 1341 } 1339 1342 1340 - static void ublk_queue_cmd_list(struct ublk_queue *ubq, struct rq_list *l) 1343 + static void ublk_queue_cmd_list(struct ublk_io *io, struct rq_list *l) 1341 1344 { 1342 - struct request *rq = rq_list_peek(l); 1343 - struct io_uring_cmd *cmd = ubq->ios[rq->tag].cmd; 1345 + struct io_uring_cmd *cmd = io->cmd; 1344 1346 struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd); 1345 1347 1346 - pdu->req_list = rq; 1348 + pdu->req_list = rq_list_peek(l); 1347 1349 rq_list_init(l); 1348 1350 io_uring_cmd_complete_in_task(cmd, ublk_cmd_list_tw_cb); 1349 1351 } ··· 1349 1355 static enum blk_eh_timer_return ublk_timeout(struct request *rq) 1350 1356 { 1351 1357 struct ublk_queue *ubq = rq->mq_hctx->driver_data; 1358 + struct ublk_io *io = &ubq->ios[rq->tag]; 1352 1359 1353 1360 if (ubq->flags & UBLK_F_UNPRIVILEGED_DEV) { 1354 - if (!ubq->timeout) { 1355 - send_sig(SIGKILL, ubq->ubq_daemon, 0); 1356 - ubq->timeout = true; 1357 - } 1358 - 1361 + send_sig(SIGKILL, io->task, 0); 1359 1362 return BLK_EH_DONE; 1360 1363 } 1361 1364 ··· 1420 1429 { 1421 1430 struct rq_list requeue_list = { }; 1422 1431 struct rq_list submit_list = { }; 1423 - struct ublk_queue *ubq = NULL; 1432 + struct ublk_io *io = NULL; 1424 1433 struct request *req; 1425 1434 1426 1435 while ((req = rq_list_pop(rqlist))) { 1427 1436 struct ublk_queue *this_q = req->mq_hctx->driver_data; 1437 + struct ublk_io *this_io = &this_q->ios[req->tag]; 1428 1438 1429 - if (ubq && ubq != this_q && !rq_list_empty(&submit_list)) 1430 - ublk_queue_cmd_list(ubq, &submit_list); 1431 - ubq = this_q; 1439 + if (io && io->task != this_io->task && !rq_list_empty(&submit_list)) 1440 + ublk_queue_cmd_list(io, &submit_list); 1441 + io = this_io; 1432 1442 1433 - if (ublk_prep_req(ubq, req, true) == BLK_STS_OK) 1443 + if (ublk_prep_req(this_q, req, true) == BLK_STS_OK) 1434 1444 rq_list_add_tail(&submit_list, req); 1435 1445 else 1436 1446 rq_list_add_tail(&requeue_list, req); 1437 1447 } 1438 1448 1439 - if (ubq && !rq_list_empty(&submit_list)) 1440 - ublk_queue_cmd_list(ubq, &submit_list); 1449 + if (!rq_list_empty(&submit_list)) 1450 + ublk_queue_cmd_list(io, &submit_list); 1441 1451 *rqlist = requeue_list; 1442 1452 } 1443 1453 ··· 1466 1474 /* All old ioucmds have to be completed */ 1467 1475 ubq->nr_io_ready = 0; 1468 1476 1469 - /* 1470 - * old daemon is PF_EXITING, put it now 1471 - * 1472 - * It could be NULL in case of closing one quisced device. 1473 - */ 1474 - if (ubq->ubq_daemon) 1475 - put_task_struct(ubq->ubq_daemon); 1476 - /* We have to reset it to NULL, otherwise ub won't accept new FETCH_REQ */ 1477 - ubq->ubq_daemon = NULL; 1478 - ubq->timeout = false; 1479 - 1480 1477 for (i = 0; i < ubq->q_depth; i++) { 1481 1478 struct ublk_io *io = &ubq->ios[i]; 1482 1479 ··· 1476 1495 io->flags &= UBLK_IO_FLAG_CANCELED; 1477 1496 io->cmd = NULL; 1478 1497 io->addr = 0; 1498 + 1499 + /* 1500 + * old task is PF_EXITING, put it now 1501 + * 1502 + * It could be NULL in case of closing one quiesced 1503 + * device. 1504 + */ 1505 + if (io->task) { 1506 + put_task_struct(io->task); 1507 + io->task = NULL; 1508 + } 1479 1509 } 1480 1510 } 1481 1511 ··· 1508 1516 for (i = 0; i < ub->dev_info.nr_hw_queues; i++) 1509 1517 ublk_queue_reinit(ub, ublk_get_queue(ub, i)); 1510 1518 1511 - /* set to NULL, otherwise new ubq_daemon cannot mmap the io_cmd_buf */ 1519 + /* set to NULL, otherwise new tasks cannot mmap io_cmd_buf */ 1512 1520 ub->mm = NULL; 1513 1521 ub->nr_queues_ready = 0; 1514 1522 ub->nr_privileged_daemon = 0; ··· 1775 1783 struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd); 1776 1784 struct ublk_queue *ubq = pdu->ubq; 1777 1785 struct task_struct *task; 1786 + struct ublk_io *io; 1778 1787 1779 1788 if (WARN_ON_ONCE(!ubq)) 1780 1789 return; ··· 1784 1791 return; 1785 1792 1786 1793 task = io_uring_cmd_get_task(cmd); 1787 - if (WARN_ON_ONCE(task && task != ubq->ubq_daemon)) 1794 + io = &ubq->ios[pdu->tag]; 1795 + if (WARN_ON_ONCE(task && task != io->task)) 1788 1796 return; 1789 1797 1790 1798 if (!ubq->canceling) 1791 1799 ublk_start_cancel(ubq); 1792 1800 1793 - WARN_ON_ONCE(ubq->ios[pdu->tag].cmd != cmd); 1801 + WARN_ON_ONCE(io->cmd != cmd); 1794 1802 ublk_cancel_cmd(ubq, pdu->tag, issue_flags); 1795 1803 } 1796 1804 ··· 1924 1930 { 1925 1931 ubq->nr_io_ready++; 1926 1932 if (ublk_queue_ready(ubq)) { 1927 - ubq->ubq_daemon = current; 1928 - get_task_struct(ubq->ubq_daemon); 1929 1933 ub->nr_queues_ready++; 1930 1934 1931 1935 if (capable(CAP_SYS_ADMIN)) ··· 2076 2084 } 2077 2085 2078 2086 ublk_fill_io_cmd(io, cmd, buf_addr); 2087 + WRITE_ONCE(io->task, get_task_struct(current)); 2079 2088 ublk_mark_io_ready(ub, ubq); 2080 2089 out: 2081 2090 mutex_unlock(&ub->mutex); ··· 2172 2179 const struct ublksrv_io_cmd *ub_cmd) 2173 2180 { 2174 2181 struct ublk_device *ub = cmd->file->private_data; 2182 + struct task_struct *task; 2175 2183 struct ublk_queue *ubq; 2176 2184 struct ublk_io *io; 2177 2185 u32 cmd_op = cmd->cmd_op; ··· 2187 2193 goto out; 2188 2194 2189 2195 ubq = ublk_get_queue(ub, ub_cmd->q_id); 2190 - if (ubq->ubq_daemon && ubq->ubq_daemon != current) 2191 - goto out; 2192 2196 2193 2197 if (tag >= ubq->q_depth) 2194 2198 goto out; 2195 2199 2196 2200 io = &ubq->ios[tag]; 2201 + task = READ_ONCE(io->task); 2202 + if (task && task != current) 2203 + goto out; 2197 2204 2198 2205 /* there is pending io cmd, something must be wrong */ 2199 2206 if (io->flags & UBLK_IO_FLAG_ACTIVE) { ··· 2444 2449 { 2445 2450 int size = ublk_queue_cmd_buf_size(ub, q_id); 2446 2451 struct ublk_queue *ubq = ublk_get_queue(ub, q_id); 2452 + int i; 2447 2453 2448 - if (ubq->ubq_daemon) 2449 - put_task_struct(ubq->ubq_daemon); 2454 + for (i = 0; i < ubq->q_depth; i++) { 2455 + struct ublk_io *io = &ubq->ios[i]; 2456 + if (io->task) 2457 + put_task_struct(io->task); 2458 + } 2459 + 2450 2460 if (ubq->io_cmd_buf) 2451 2461 free_pages((unsigned long)ubq->io_cmd_buf, get_order(size)); 2452 2462 } ··· 2923 2923 ub->dev_info.flags &= UBLK_F_ALL; 2924 2924 2925 2925 ub->dev_info.flags |= UBLK_F_CMD_IOCTL_ENCODE | 2926 - UBLK_F_URING_CMD_COMP_IN_TASK; 2926 + UBLK_F_URING_CMD_COMP_IN_TASK | 2927 + UBLK_F_PER_IO_DAEMON; 2927 2928 2928 2929 /* GET_DATA isn't needed any more with USER_COPY or ZERO COPY */ 2929 2930 if (ub->dev_info.flags & (UBLK_F_USER_COPY | UBLK_F_SUPPORT_ZERO_COPY | ··· 3189 3188 int ublksrv_pid = (int)header->data[0]; 3190 3189 int ret = -EINVAL; 3191 3190 3192 - pr_devel("%s: Waiting for new ubq_daemons(nr: %d) are ready, dev id %d...\n", 3193 - __func__, ub->dev_info.nr_hw_queues, header->dev_id); 3194 - /* wait until new ubq_daemon sending all FETCH_REQ */ 3191 + pr_devel("%s: Waiting for all FETCH_REQs, dev id %d...\n", __func__, 3192 + header->dev_id); 3193 + 3195 3194 if (wait_for_completion_interruptible(&ub->completion)) 3196 3195 return -EINTR; 3197 3196 3198 - pr_devel("%s: All new ubq_daemons(nr: %d) are ready, dev id %d\n", 3199 - __func__, ub->dev_info.nr_hw_queues, header->dev_id); 3197 + pr_devel("%s: All FETCH_REQs received, dev id %d\n", __func__, 3198 + header->dev_id); 3200 3199 3201 3200 mutex_lock(&ub->mutex); 3202 3201 if (ublk_nosrv_should_stop_dev(ub))

-2

drivers/md/bcache/btree.c

··· 89 89 * Test module load/unload 90 90 */ 91 91 92 - #define MAX_NEED_GC 64 93 - #define MAX_SAVE_PRIO 72 94 92 #define MAX_GC_TIMES 100 95 93 #define MIN_GC_NODES 100 96 94 #define GC_SLEEP_MS 100

+46 -9

drivers/md/bcache/super.c

··· 1733 1733 mutex_unlock(&b->write_lock); 1734 1734 } 1735 1735 1736 - if (ca->alloc_thread) 1736 + /* 1737 + * If the register_cache_set() call to bch_cache_set_alloc() failed, 1738 + * ca has not been assigned a value and return error. 1739 + * So we need check ca is not NULL during bch_cache_set_unregister(). 1740 + */ 1741 + if (ca && ca->alloc_thread) 1737 1742 kthread_stop(ca->alloc_thread); 1738 1743 1739 1744 if (c->journal.cur) { ··· 2238 2233 bio_init(&ca->journal.bio, NULL, ca->journal.bio.bi_inline_vecs, 8, 0); 2239 2234 2240 2235 /* 2241 - * when ca->sb.njournal_buckets is not zero, journal exists, 2242 - * and in bch_journal_replay(), tree node may split, 2243 - * so bucket of RESERVE_BTREE type is needed, 2244 - * the worst situation is all journal buckets are valid journal, 2245 - * and all the keys need to replay, 2246 - * so the number of RESERVE_BTREE type buckets should be as much 2247 - * as journal buckets 2236 + * When the cache disk is first registered, ca->sb.njournal_buckets 2237 + * is zero, and it is assigned in run_cache_set(). 2238 + * 2239 + * When ca->sb.njournal_buckets is not zero, journal exists, 2240 + * and in bch_journal_replay(), tree node may split. 2241 + * The worst situation is all journal buckets are valid journal, 2242 + * and all the keys need to replay, so the number of RESERVE_BTREE 2243 + * type buckets should be as much as journal buckets. 2244 + * 2245 + * If the number of RESERVE_BTREE type buckets is too few, the 2246 + * bch_allocator_thread() may hang up and unable to allocate 2247 + * bucket. The situation is roughly as follows: 2248 + * 2249 + * 1. In bch_data_insert_keys(), if the operation is not op->replace, 2250 + * it will call the bch_journal(), which increments the journal_ref 2251 + * counter. This counter is only decremented after bch_btree_insert 2252 + * completes. 2253 + * 2254 + * 2. When calling bch_btree_insert, if the btree needs to split, 2255 + * it will call btree_split() and btree_check_reserve() to check 2256 + * whether there are enough reserved buckets in the RESERVE_BTREE 2257 + * slot. If not enough, bcache_btree_root() will repeatedly retry. 2258 + * 2259 + * 3. Normally, the bch_allocator_thread is responsible for filling 2260 + * the reservation slots from the free_inc bucket list. When the 2261 + * free_inc bucket list is exhausted, the bch_allocator_thread 2262 + * will call invalidate_buckets() until free_inc is refilled. 2263 + * Then bch_allocator_thread calls bch_prio_write() once. and 2264 + * bch_prio_write() will call bch_journal_meta() and waits for 2265 + * the journal write to complete. 2266 + * 2267 + * 4. During journal_write, journal_write_unlocked() is be called. 2268 + * If journal full occurs, journal_reclaim() and btree_flush_write() 2269 + * will be called sequentially, then retry journal_write. 2270 + * 2271 + * 5. When 2 and 4 occur together, IO will hung up and cannot recover. 2272 + * 2273 + * Therefore, reserve more RESERVE_BTREE type buckets. 2248 2274 */ 2249 - btree_buckets = ca->sb.njournal_buckets ?: 8; 2275 + btree_buckets = clamp_t(size_t, ca->sb.nbuckets >> 7, 2276 + 32, SB_JOURNAL_BUCKETS); 2250 2277 free = roundup_pow_of_two(ca->sb.nbuckets) >> 10; 2251 2278 if (!free) { 2252 2279 ret = -EPERM;

+1 -5

drivers/md/dm-raid.c

··· 1356 1356 return -EINVAL; 1357 1357 } 1358 1358 1359 - /* 1360 - * In device-mapper, we specify things in sectors, but 1361 - * MD records this value in kB 1362 - */ 1363 - if (value < 0 || value / 2 > COUNTER_MAX) { 1359 + if (value < 0) { 1364 1360 rs->ti->error = "Max write-behind limit out of range"; 1365 1361 return -EINVAL; 1366 1362 }

+22 -13

drivers/md/md-bitmap.c

··· 105 105 * 106 106 */ 107 107 108 + typedef __u16 bitmap_counter_t; 109 + 108 110 #define PAGE_BITS (PAGE_SIZE << 3) 109 111 #define PAGE_BIT_SHIFT (PAGE_SHIFT + 3) 112 + 113 + #define COUNTER_BITS 16 114 + #define COUNTER_BIT_SHIFT 4 115 + #define COUNTER_BYTE_SHIFT (COUNTER_BIT_SHIFT - 3) 116 + 117 + #define NEEDED_MASK ((bitmap_counter_t) (1 << (COUNTER_BITS - 1))) 118 + #define RESYNC_MASK ((bitmap_counter_t) (1 << (COUNTER_BITS - 2))) 119 + #define COUNTER_MAX ((bitmap_counter_t) RESYNC_MASK - 1) 110 120 111 121 #define NEEDED(x) (((bitmap_counter_t) x) & NEEDED_MASK) 112 122 #define RESYNC(x) (((bitmap_counter_t) x) & RESYNC_MASK) ··· 799 789 * is a good choice? We choose COUNTER_MAX / 2 arbitrarily. 800 790 */ 801 791 write_behind = bitmap->mddev->bitmap_info.max_write_behind; 802 - if (write_behind > COUNTER_MAX) 792 + if (write_behind > COUNTER_MAX / 2) 803 793 write_behind = COUNTER_MAX / 2; 804 794 sb->write_behind = cpu_to_le32(write_behind); 805 795 bitmap->mddev->bitmap_info.max_write_behind = write_behind; ··· 1682 1672 &(bitmap->bp[page].map[pageoff]); 1683 1673 } 1684 1674 1685 - static int bitmap_startwrite(struct mddev *mddev, sector_t offset, 1686 - unsigned long sectors) 1675 + static void bitmap_start_write(struct mddev *mddev, sector_t offset, 1676 + unsigned long sectors) 1687 1677 { 1688 1678 struct bitmap *bitmap = mddev->bitmap; 1689 1679 1690 1680 if (!bitmap) 1691 - return 0; 1681 + return; 1692 1682 1693 1683 while (sectors) { 1694 1684 sector_t blocks; ··· 1698 1688 bmc = md_bitmap_get_counter(&bitmap->counts, offset, &blocks, 1); 1699 1689 if (!bmc) { 1700 1690 spin_unlock_irq(&bitmap->counts.lock); 1701 - return 0; 1691 + return; 1702 1692 } 1703 1693 1704 1694 if (unlikely(COUNTER(*bmc) == COUNTER_MAX)) { ··· 1734 1724 else 1735 1725 sectors = 0; 1736 1726 } 1737 - return 0; 1738 1727 } 1739 1728 1740 - static void bitmap_endwrite(struct mddev *mddev, sector_t offset, 1741 - unsigned long sectors) 1729 + static void bitmap_end_write(struct mddev *mddev, sector_t offset, 1730 + unsigned long sectors) 1742 1731 { 1743 1732 struct bitmap *bitmap = mddev->bitmap; 1744 1733 ··· 2214 2205 return ERR_PTR(err); 2215 2206 } 2216 2207 2217 - static int bitmap_create(struct mddev *mddev, int slot) 2208 + static int bitmap_create(struct mddev *mddev) 2218 2209 { 2219 - struct bitmap *bitmap = __bitmap_create(mddev, slot); 2210 + struct bitmap *bitmap = __bitmap_create(mddev, -1); 2220 2211 2221 2212 if (IS_ERR(bitmap)) 2222 2213 return PTR_ERR(bitmap); ··· 2679 2670 } 2680 2671 2681 2672 mddev->bitmap_info.offset = offset; 2682 - rv = bitmap_create(mddev, -1); 2673 + rv = bitmap_create(mddev); 2683 2674 if (rv) 2684 2675 goto out; 2685 2676 ··· 3012 3003 .end_behind_write = bitmap_end_behind_write, 3013 3004 .wait_behind_writes = bitmap_wait_behind_writes, 3014 3005 3015 - .startwrite = bitmap_startwrite, 3016 - .endwrite = bitmap_endwrite, 3006 + .start_write = bitmap_start_write, 3007 + .end_write = bitmap_end_write, 3017 3008 .start_sync = bitmap_start_sync, 3018 3009 .end_sync = bitmap_end_sync, 3019 3010 .cond_end_sync = bitmap_cond_end_sync,

+4 -13

drivers/md/md-bitmap.h

··· 9 9 10 10 #define BITMAP_MAGIC 0x6d746962 11 11 12 - typedef __u16 bitmap_counter_t; 13 - #define COUNTER_BITS 16 14 - #define COUNTER_BIT_SHIFT 4 15 - #define COUNTER_BYTE_SHIFT (COUNTER_BIT_SHIFT - 3) 16 - 17 - #define NEEDED_MASK ((bitmap_counter_t) (1 << (COUNTER_BITS - 1))) 18 - #define RESYNC_MASK ((bitmap_counter_t) (1 << (COUNTER_BITS - 2))) 19 - #define COUNTER_MAX ((bitmap_counter_t) RESYNC_MASK - 1) 20 - 21 12 /* use these for bitmap->flags and bitmap->sb->state bit-fields */ 22 13 enum bitmap_state { 23 14 BITMAP_STALE = 1, /* the bitmap file is out of date or had -EIO */ ··· 63 72 64 73 struct bitmap_operations { 65 74 bool (*enabled)(struct mddev *mddev); 66 - int (*create)(struct mddev *mddev, int slot); 75 + int (*create)(struct mddev *mddev); 67 76 int (*resize)(struct mddev *mddev, sector_t blocks, int chunksize, 68 77 bool init); 69 78 ··· 80 89 void (*end_behind_write)(struct mddev *mddev); 81 90 void (*wait_behind_writes)(struct mddev *mddev); 82 91 83 - int (*startwrite)(struct mddev *mddev, sector_t offset, 92 + void (*start_write)(struct mddev *mddev, sector_t offset, 93 + unsigned long sectors); 94 + void (*end_write)(struct mddev *mddev, sector_t offset, 84 95 unsigned long sectors); 85 - void (*endwrite)(struct mddev *mddev, sector_t offset, 86 - unsigned long sectors); 87 96 bool (*start_sync)(struct mddev *mddev, sector_t offset, 88 97 sector_t *blocks, bool degraded); 89 98 void (*end_sync)(struct mddev *mddev, sector_t offset, sector_t *blocks);

+7 -7

drivers/md/md.c

··· 6225 6225 } 6226 6226 if (err == 0 && pers->sync_request && 6227 6227 (mddev->bitmap_info.file || mddev->bitmap_info.offset)) { 6228 - err = mddev->bitmap_ops->create(mddev, -1); 6228 + err = mddev->bitmap_ops->create(mddev); 6229 6229 if (err) 6230 6230 pr_warn("%s: failed to create bitmap (%d)\n", 6231 6231 mdname(mddev), err); ··· 7285 7285 err = 0; 7286 7286 if (mddev->pers) { 7287 7287 if (fd >= 0) { 7288 - err = mddev->bitmap_ops->create(mddev, -1); 7288 + err = mddev->bitmap_ops->create(mddev); 7289 7289 if (!err) 7290 7290 err = mddev->bitmap_ops->load(mddev); 7291 7291 ··· 7601 7601 mddev->bitmap_info.default_offset; 7602 7602 mddev->bitmap_info.space = 7603 7603 mddev->bitmap_info.default_space; 7604 - rv = mddev->bitmap_ops->create(mddev, -1); 7604 + rv = mddev->bitmap_ops->create(mddev); 7605 7605 if (!rv) 7606 7606 rv = mddev->bitmap_ops->load(mddev); 7607 7607 ··· 8799 8799 mddev->pers->bitmap_sector(mddev, &md_io_clone->offset, 8800 8800 &md_io_clone->sectors); 8801 8801 8802 - mddev->bitmap_ops->startwrite(mddev, md_io_clone->offset, 8803 - md_io_clone->sectors); 8802 + mddev->bitmap_ops->start_write(mddev, md_io_clone->offset, 8803 + md_io_clone->sectors); 8804 8804 } 8805 8805 8806 8806 static void md_bitmap_end(struct mddev *mddev, struct md_io_clone *md_io_clone) 8807 8807 { 8808 - mddev->bitmap_ops->endwrite(mddev, md_io_clone->offset, 8809 - md_io_clone->sectors); 8808 + mddev->bitmap_ops->end_write(mddev, md_io_clone->offset, 8809 + md_io_clone->sectors); 8810 8810 } 8811 8811 8812 8812 static void md_end_clone_io(struct bio *bio)

+10

drivers/md/raid1-10.c

··· 293 293 294 294 return false; 295 295 } 296 + 297 + /* 298 + * bio with REQ_RAHEAD or REQ_NOWAIT can fail at anytime, before such IO is 299 + * submitted to the underlying disks, hence don't record badblocks or retry 300 + * in this case. 301 + */ 302 + static inline bool raid1_should_handle_error(struct bio *bio) 303 + { 304 + return !(bio->bi_opf & (REQ_RAHEAD | REQ_NOWAIT)); 305 + }

+10 -9

drivers/md/raid1.c

··· 373 373 */ 374 374 update_head_pos(r1_bio->read_disk, r1_bio); 375 375 376 - if (uptodate) 376 + if (uptodate) { 377 377 set_bit(R1BIO_Uptodate, &r1_bio->state); 378 - else if (test_bit(FailFast, &rdev->flags) && 379 - test_bit(R1BIO_FailFast, &r1_bio->state)) 378 + } else if (test_bit(FailFast, &rdev->flags) && 379 + test_bit(R1BIO_FailFast, &r1_bio->state)) { 380 380 /* This was a fail-fast read so we definitely 381 381 * want to retry */ 382 382 ; 383 - else { 383 + } else if (!raid1_should_handle_error(bio)) { 384 + uptodate = 1; 385 + } else { 384 386 /* If all other devices have failed, we want to return 385 387 * the error upwards rather than fail the last device. 386 388 * Here we redefine "uptodate" to mean "Don't want to retry" ··· 453 451 struct bio *to_put = NULL; 454 452 int mirror = find_bio_disk(r1_bio, bio); 455 453 struct md_rdev *rdev = conf->mirrors[mirror].rdev; 456 - bool discard_error; 457 454 sector_t lo = r1_bio->sector; 458 455 sector_t hi = r1_bio->sector + r1_bio->sectors; 459 - 460 - discard_error = bio->bi_status && bio_op(bio) == REQ_OP_DISCARD; 456 + bool ignore_error = !raid1_should_handle_error(bio) || 457 + (bio->bi_status && bio_op(bio) == REQ_OP_DISCARD); 461 458 462 459 /* 463 460 * 'one mirror IO has finished' event handler: 464 461 */ 465 - if (bio->bi_status && !discard_error) { 462 + if (bio->bi_status && !ignore_error) { 466 463 set_bit(WriteErrorSeen, &rdev->flags); 467 464 if (!test_and_set_bit(WantReplacement, &rdev->flags)) 468 465 set_bit(MD_RECOVERY_NEEDED, & ··· 512 511 513 512 /* Maybe we can clear some bad blocks. */ 514 513 if (rdev_has_badblock(rdev, r1_bio->sector, r1_bio->sectors) && 515 - !discard_error) { 514 + !ignore_error) { 516 515 r1_bio->bios[mirror] = IO_MADE_GOOD; 517 516 set_bit(R1BIO_MadeGood, &r1_bio->state); 518 517 }

+6 -5

drivers/md/raid10.c

··· 399 399 * wait for the 'master' bio. 400 400 */ 401 401 set_bit(R10BIO_Uptodate, &r10_bio->state); 402 + } else if (!raid1_should_handle_error(bio)) { 403 + uptodate = 1; 402 404 } else { 403 405 /* If all other devices that store this block have 404 406 * failed, we want to return the error upwards rather ··· 458 456 int slot, repl; 459 457 struct md_rdev *rdev = NULL; 460 458 struct bio *to_put = NULL; 461 - bool discard_error; 462 - 463 - discard_error = bio->bi_status && bio_op(bio) == REQ_OP_DISCARD; 459 + bool ignore_error = !raid1_should_handle_error(bio) || 460 + (bio->bi_status && bio_op(bio) == REQ_OP_DISCARD); 464 461 465 462 dev = find_bio_disk(conf, r10_bio, bio, &slot, &repl); 466 463 ··· 473 472 /* 474 473 * this branch is our 'one mirror IO has finished' event handler: 475 474 */ 476 - if (bio->bi_status && !discard_error) { 475 + if (bio->bi_status && !ignore_error) { 477 476 if (repl) 478 477 /* Never record new bad blocks to replacement, 479 478 * just fail it. ··· 528 527 /* Maybe we can clear some bad blocks. */ 529 528 if (rdev_has_badblock(rdev, r10_bio->devs[slot].addr, 530 529 r10_bio->sectors) && 531 - !discard_error) { 530 + !ignore_error) { 532 531 bio_put(bio); 533 532 if (repl) 534 533 r10_bio->devs[slot].repl_bio = IO_MADE_GOOD;

+3 -3

drivers/nvme/common/auth.c

··· 471 471 * @c1: Value of challenge C1 472 472 * @c2: Value of challenge C2 473 473 * @hash_len: Hash length of the hash algorithm 474 - * @ret_psk: Pointer too the resulting generated PSK 474 + * @ret_psk: Pointer to the resulting generated PSK 475 475 * @ret_len: length of @ret_psk 476 476 * 477 477 * Generate a PSK for TLS as specified in NVMe base specification, section ··· 759 759 goto out_free_prk; 760 760 761 761 /* 762 - * 2 addtional bytes for the length field from HDKF-Expand-Label, 763 - * 2 addtional bytes for the HMAC ID, and one byte for the space 762 + * 2 additional bytes for the length field from HDKF-Expand-Label, 763 + * 2 additional bytes for the HMAC ID, and one byte for the space 764 764 * separator. 765 765 */ 766 766 info_len = strlen(psk_digest) + strlen(psk_prefix) + 5;

+1 -1

drivers/nvme/host/Kconfig

··· 106 106 help 107 107 Enables TLS encryption for NVMe TCP using the netlink handshake API. 108 108 109 - The TLS handshake daemon is availble at 109 + The TLS handshake daemon is available at 110 110 https://github.com/oracle/ktls-utils. 111 111 112 112 If unsure, say N.

+1 -1

drivers/nvme/host/constants.c

··· 145 145 [NVME_SC_BAD_ATTRIBUTES] = "Conflicting Attributes", 146 146 [NVME_SC_INVALID_PI] = "Invalid Protection Information", 147 147 [NVME_SC_READ_ONLY] = "Attempted Write to Read Only Range", 148 - [NVME_SC_ONCS_NOT_SUPPORTED] = "ONCS Not Supported", 148 + [NVME_SC_CMD_SIZE_LIM_EXCEEDED ] = "Command Size Limits Exceeded", 149 149 [NVME_SC_ZONE_BOUNDARY_ERROR] = "Zoned Boundary Error", 150 150 [NVME_SC_ZONE_FULL] = "Zone Is Full", 151 151 [NVME_SC_ZONE_READ_ONLY] = "Zone Is Read Only",

+1 -2

drivers/nvme/host/core.c

··· 290 290 case NVME_SC_NS_NOT_READY: 291 291 return BLK_STS_TARGET; 292 292 case NVME_SC_BAD_ATTRIBUTES: 293 - case NVME_SC_ONCS_NOT_SUPPORTED: 294 293 case NVME_SC_INVALID_OPCODE: 295 294 case NVME_SC_INVALID_FIELD: 296 295 case NVME_SC_INVALID_NS: ··· 1026 1027 1027 1028 if (ns->head->ms) { 1028 1029 /* 1029 - * If formated with metadata, the block layer always provides a 1030 + * If formatted with metadata, the block layer always provides a 1030 1031 * metadata buffer if CONFIG_BLK_DEV_INTEGRITY is enabled. Else 1031 1032 * we enable the PRACT bit for protection information or set the 1032 1033 * namespace capacity to zero to prevent any I/O.

+1 -1

drivers/nvme/host/fabrics.c

··· 582 582 * Do not retry when: 583 583 * 584 584 * - the DNR bit is set and the specification states no further connect 585 - * attempts with the same set of paramenters should be attempted. 585 + * attempts with the same set of parameters should be attempted. 586 586 * 587 587 * - when the authentication attempt fails, because the key was invalid. 588 588 * This error code is set on the host side.

+3 -3

drivers/nvme/host/fabrics.h

··· 80 80 * @transport: Holds the fabric transport "technology name" (for a lack of 81 81 * better description) that will be used by an NVMe controller 82 82 * being added. 83 - * @subsysnqn: Hold the fully qualified NQN subystem name (format defined 83 + * @subsysnqn: Hold the fully qualified NQN subsystem name (format defined 84 84 * in the NVMe specification, "NVMe Qualified Names"). 85 85 * @traddr: The transport-specific TRADDR field for a port on the 86 86 * subsystem which is adding a controller. ··· 156 156 * @create_ctrl(): function pointer that points to a non-NVMe 157 157 * implementation-specific fabric technology 158 158 * that would go into starting up that fabric 159 - * for the purpose of conneciton to an NVMe controller 159 + * for the purpose of connection to an NVMe controller 160 160 * using that fabric technology. 161 161 * 162 162 * Notes: ··· 165 165 * 2. create_ctrl() must be defined (even if it does nothing) 166 166 * 3. struct nvmf_transport_ops must be statically allocated in the 167 167 * modules .bss section so that a pure module_get on @module 168 - * prevents the memory from beeing freed. 168 + * prevents the memory from being freed. 169 169 */ 170 170 struct nvmf_transport_ops { 171 171 struct list_head entry;

+2 -2

drivers/nvme/host/fc.c

··· 1955 1955 } 1956 1956 1957 1957 /* 1958 - * For the linux implementation, if we have an unsuccesful 1958 + * For the linux implementation, if we have an unsucceesful 1959 1959 * status, they blk-mq layer can typically be called with the 1960 1960 * non-zero status and the content of the cqe isn't important. 1961 1961 */ ··· 2479 2479 * writing the registers for shutdown and polling (call 2480 2480 * nvme_disable_ctrl()). Given a bunch of i/o was potentially 2481 2481 * just aborted and we will wait on those contexts, and given 2482 - * there was no indication of how live the controlelr is on the 2482 + * there was no indication of how live the controller is on the 2483 2483 * link, don't send more io to create more contexts for the 2484 2484 * shutdown. Let the controller fail via keepalive failure if 2485 2485 * its still present.

+10 -8

drivers/nvme/host/ioctl.c

··· 493 493 d.timeout_ms = READ_ONCE(cmd->timeout_ms); 494 494 495 495 if (d.data_len && (ioucmd->flags & IORING_URING_CMD_FIXED)) { 496 - /* fixedbufs is only for non-vectored io */ 497 - if (vec) 498 - return -EINVAL; 496 + int ddir = nvme_is_write(&c) ? WRITE : READ; 499 497 500 - ret = io_uring_cmd_import_fixed(d.addr, d.data_len, 501 - nvme_is_write(&c) ? WRITE : READ, &iter, ioucmd, 502 - issue_flags); 498 + if (vec) 499 + ret = io_uring_cmd_import_fixed_vec(ioucmd, 500 + u64_to_user_ptr(d.addr), d.data_len, 501 + ddir, &iter, issue_flags); 502 + else 503 + ret = io_uring_cmd_import_fixed(d.addr, d.data_len, 504 + ddir, &iter, ioucmd, issue_flags); 503 505 if (ret < 0) 504 506 return ret; 505 507 ··· 523 521 if (d.data_len) { 524 522 ret = nvme_map_user_request(req, d.addr, d.data_len, 525 523 nvme_to_user_ptr(d.metadata), d.metadata_len, 526 - map_iter, vec); 524 + map_iter, vec ? NVME_IOCTL_VEC : 0); 527 525 if (ret) 528 526 goto out_free_req; 529 527 } ··· 729 727 730 728 /* 731 729 * Handle ioctls that apply to the controller instead of the namespace 732 - * seperately and drop the ns SRCU reference early. This avoids a 730 + * separately and drop the ns SRCU reference early. This avoids a 733 731 * deadlock when deleting namespaces using the passthrough interface. 734 732 */ 735 733 if (is_ctrl_ioctl(cmd))

+1 -1

drivers/nvme/host/multipath.c

··· 760 760 * controller's scan_work context. If a path error occurs here, the IO 761 761 * will wait until a path becomes available or all paths are torn down, 762 762 * but that action also occurs within scan_work, so it would deadlock. 763 - * Defer the partion scan to a different context that does not block 763 + * Defer the partition scan to a different context that does not block 764 764 * scan_work. 765 765 */ 766 766 set_bit(GD_SUPPRESS_PART_SCAN, &head->disk->state);

+1 -1

drivers/nvme/host/nvme.h

··· 523 523 enum nvme_ns_features { 524 524 NVME_NS_EXT_LBAS = 1 << 0, /* support extended LBA format */ 525 525 NVME_NS_METADATA_SUPPORTED = 1 << 1, /* support getting generated md */ 526 - NVME_NS_DEAC = 1 << 2, /* DEAC bit in Write Zeores supported */ 526 + NVME_NS_DEAC = 1 << 2, /* DEAC bit in Write Zeroes supported */ 527 527 }; 528 528 529 529 struct nvme_ns {

+2 -2

drivers/nvme/host/pci.c

··· 3015 3015 goto out; 3016 3016 3017 3017 /* 3018 - * Freeze and update the number of I/O queues as thos might have 3018 + * Freeze and update the number of I/O queues as those might have 3019 3019 * changed. If there are no I/O queues left after this reset, keep the 3020 3020 * controller around but remove all namespaces. 3021 3021 */ ··· 3186 3186 /* 3187 3187 * Exclude some Kingston NV1 and A2000 devices from 3188 3188 * NVME_QUIRK_SIMPLE_SUSPEND. Do a full suspend to save a 3189 - * lot fo energy with s2idle sleep on some TUXEDO platforms. 3189 + * lot of energy with s2idle sleep on some TUXEDO platforms. 3190 3190 */ 3191 3191 if (dmi_match(DMI_BOARD_NAME, "NS5X_NS7XAU") || 3192 3192 dmi_match(DMI_BOARD_NAME, "NS5x_7xAU") ||

-2

drivers/nvme/host/pr.c

··· 82 82 return PR_STS_SUCCESS; 83 83 case NVME_SC_RESERVATION_CONFLICT: 84 84 return PR_STS_RESERVATION_CONFLICT; 85 - case NVME_SC_ONCS_NOT_SUPPORTED: 86 - return -EOPNOTSUPP; 87 85 case NVME_SC_BAD_ATTRIBUTES: 88 86 case NVME_SC_INVALID_OPCODE: 89 87 case NVME_SC_INVALID_FIELD:

+2 -2

drivers/nvme/host/rdma.c

··· 221 221 222 222 /* 223 223 * Bind the CQEs (post recv buffers) DMA mapping to the RDMA queue 224 - * lifetime. It's safe, since any chage in the underlying RDMA device 224 + * lifetime. It's safe, since any change in the underlying RDMA device 225 225 * will issue error recovery and queue re-creation. 226 226 */ 227 227 for (i = 0; i < ib_queue_size; i++) { ··· 800 800 801 801 /* 802 802 * Bind the async event SQE DMA mapping to the admin queue lifetime. 803 - * It's safe, since any chage in the underlying RDMA device will issue 803 + * It's safe, since any change in the underlying RDMA device will issue 804 804 * error recovery and queue re-creation. 805 805 */ 806 806 error = nvme_rdma_alloc_qe(ctrl->device->dev, &ctrl->async_event_sqe,

+21 -3

drivers/nvme/host/tcp.c

··· 452 452 return NULL; 453 453 } 454 454 455 - list_del(&req->entry); 455 + list_del_init(&req->entry); 456 + init_llist_node(&req->lentry); 456 457 return req; 457 458 } 458 459 ··· 566 565 req->queue = queue; 567 566 nvme_req(rq)->ctrl = &ctrl->ctrl; 568 567 nvme_req(rq)->cmd = &pdu->cmd; 568 + init_llist_node(&req->lentry); 569 + INIT_LIST_HEAD(&req->entry); 569 570 570 571 return 0; 571 572 } ··· 769 766 dev_err(queue->ctrl->ctrl.device, 770 767 "req %d unexpected r2t offset %u (expected %zu)\n", 771 768 rq->tag, r2t_offset, req->data_sent); 769 + return -EPROTO; 770 + } 771 + 772 + if (llist_on_list(&req->lentry) || 773 + !list_empty(&req->entry)) { 774 + dev_err(queue->ctrl->ctrl.device, 775 + "req %d unexpected r2t while processing request\n", 776 + rq->tag); 772 777 return -EPROTO; 773 778 } 774 779 ··· 1366 1355 queue->nr_cqe = 0; 1367 1356 consumed = sock->ops->read_sock(sk, &rd_desc, nvme_tcp_recv_skb); 1368 1357 release_sock(sk); 1369 - return consumed; 1358 + return consumed == -EAGAIN ? 0 : consumed; 1370 1359 } 1371 1360 1372 1361 static void nvme_tcp_io_work(struct work_struct *w) ··· 1393 1382 pending = true; 1394 1383 else if (unlikely(result < 0)) 1395 1384 return; 1385 + 1386 + /* did we get some space after spending time in recv? */ 1387 + if (nvme_tcp_queue_has_pending(queue) && 1388 + sk_stream_is_writeable(queue->sock->sk)) 1389 + pending = true; 1396 1390 1397 1391 if (!pending || !queue->rd_enabled) 1398 1392 return; ··· 2366 2350 nvme_tcp_teardown_admin_queue(ctrl, false); 2367 2351 ret = nvme_tcp_configure_admin_queue(ctrl, false); 2368 2352 if (ret) 2369 - return ret; 2353 + goto destroy_admin; 2370 2354 } 2371 2355 2372 2356 if (ctrl->icdoff) { ··· 2610 2594 ctrl->async_req.offset = 0; 2611 2595 ctrl->async_req.curr_bio = NULL; 2612 2596 ctrl->async_req.data_len = 0; 2597 + init_llist_node(&ctrl->async_req.lentry); 2598 + INIT_LIST_HEAD(&ctrl->async_req.entry); 2613 2599 2614 2600 nvme_tcp_queue_request(&ctrl->async_req, true); 2615 2601 }

+1 -1

drivers/nvme/target/admin-cmd.c

··· 1165 1165 * A "minimum viable" abort implementation: the command is mandatory in the 1166 1166 * spec, but we are not required to do any useful work. We couldn't really 1167 1167 * do a useful abort, so don't bother even with waiting for the command 1168 - * to be exectuted and return immediately telling the command to abort 1168 + * to be executed and return immediately telling the command to abort 1169 1169 * wasn't found. 1170 1170 */ 1171 1171 static void nvmet_execute_abort(struct nvmet_req *req)

+2 -9

drivers/nvme/target/core.c

··· 62 62 return NVME_SC_LBA_RANGE | NVME_STATUS_DNR; 63 63 case -EOPNOTSUPP: 64 64 req->error_loc = offsetof(struct nvme_common_command, opcode); 65 - switch (req->cmd->common.opcode) { 66 - case nvme_cmd_dsm: 67 - case nvme_cmd_write_zeroes: 68 - return NVME_SC_ONCS_NOT_SUPPORTED | NVME_STATUS_DNR; 69 - default: 70 - return NVME_SC_INVALID_OPCODE | NVME_STATUS_DNR; 71 - } 72 - break; 65 + return NVME_SC_INVALID_OPCODE | NVME_STATUS_DNR; 73 66 case -ENODATA: 74 67 req->error_loc = offsetof(struct nvme_rw_command, nsid); 75 68 return NVME_SC_ACCESS_DENIED; ··· 644 651 * Now that we removed the namespaces from the lookup list, we 645 652 * can kill the per_cpu ref and wait for any remaining references 646 653 * to be dropped, as well as a RCU grace period for anyone only 647 - * using the namepace under rcu_read_lock(). Note that we can't 654 + * using the namespace under rcu_read_lock(). Note that we can't 648 655 * use call_rcu here as we need to ensure the namespaces have 649 656 * been fully destroyed before unloading the module. 650 657 */

+1 -1

drivers/nvme/target/fc.c

··· 1339 1339 /** 1340 1340 * nvmet_fc_register_targetport - transport entry point called by an 1341 1341 * LLDD to register the existence of a local 1342 - * NVME subystem FC port. 1342 + * NVME subsystem FC port. 1343 1343 * @pinfo: pointer to information about the port to be registered 1344 1344 * @template: LLDD entrypoints and operational parameters for the port 1345 1345 * @dev: physical hardware device node port corresponds to. Will be

+2 -9

drivers/nvme/target/io-cmd-bdev.c

··· 133 133 * Right now there exists M : 1 mapping between block layer error 134 134 * to the NVMe status code (see nvme_error_status()). For consistency, 135 135 * when we reverse map we use most appropriate NVMe Status code from 136 - * the group of the NVMe staus codes used in the nvme_error_status(). 136 + * the group of the NVMe status codes used in the nvme_error_status(). 137 137 */ 138 138 switch (blk_sts) { 139 139 case BLK_STS_NOSPC: ··· 145 145 req->error_loc = offsetof(struct nvme_rw_command, slba); 146 146 break; 147 147 case BLK_STS_NOTSUPP: 148 + status = NVME_SC_INVALID_OPCODE | NVME_STATUS_DNR; 148 149 req->error_loc = offsetof(struct nvme_common_command, opcode); 149 - switch (req->cmd->common.opcode) { 150 - case nvme_cmd_dsm: 151 - case nvme_cmd_write_zeroes: 152 - status = NVME_SC_ONCS_NOT_SUPPORTED | NVME_STATUS_DNR; 153 - break; 154 - default: 155 - status = NVME_SC_INVALID_OPCODE | NVME_STATUS_DNR; 156 - } 157 150 break; 158 151 case BLK_STS_MEDIUM: 159 152 status = NVME_SC_ACCESS_DENIED;

+1 -1

drivers/nvme/target/passthru.c

··· 99 99 100 100 /* 101 101 * The passthru NVMe driver may have a limit on the number of segments 102 - * which depends on the host's memory fragementation. To solve this, 102 + * which depends on the host's memory fragmentation. To solve this, 103 103 * ensure mdts is limited to the pages equal to the number of segments. 104 104 */ 105 105 max_hw_sectors = min_not_zero(pctrl->max_segments << PAGE_SECTORS_SHIFT,

+1 -1

include/linux/nvme.h

··· 2171 2171 NVME_SC_BAD_ATTRIBUTES = 0x180, 2172 2172 NVME_SC_INVALID_PI = 0x181, 2173 2173 NVME_SC_READ_ONLY = 0x182, 2174 - NVME_SC_ONCS_NOT_SUPPORTED = 0x183, 2174 + NVME_SC_CMD_SIZE_LIM_EXCEEDED = 0x183, 2175 2175 2176 2176 /* 2177 2177 * I/O Command Set Specific - Fabrics commands:

+9

include/uapi/linux/ublk_cmd.h

··· 272 272 */ 273 273 #define UBLK_F_QUIESCE (1ULL << 12) 274 274 275 + /* 276 + * If this feature is set, ublk_drv supports each (qid,tag) pair having 277 + * its own independent daemon task that is responsible for handling it. 278 + * If it is not set, daemons are per-queue instead, so for two pairs 279 + * (qid1,tag1) and (qid2,tag2), if qid1 == qid2, then the same task must 280 + * be responsible for handling (qid1,tag1) and (qid2,tag2). 281 + */ 282 + #define UBLK_F_PER_IO_DAEMON (1ULL << 13) 283 + 275 284 /* device state */ 276 285 #define UBLK_S_DEV_DEAD 0 277 286 #define UBLK_S_DEV_LIVE 1

+1

tools/testing/selftests/ublk/Makefile

··· 19 19 TEST_PROGS += test_generic_09.sh 20 20 TEST_PROGS += test_generic_10.sh 21 21 TEST_PROGS += test_generic_11.sh 22 + TEST_PROGS += test_generic_12.sh 22 23 23 24 TEST_PROGS += test_null_01.sh 24 25 TEST_PROGS += test_null_02.sh

+2 -2

tools/testing/selftests/ublk/fault_inject.c

··· 46 46 .tv_nsec = (long long)q->dev->private_data, 47 47 }; 48 48 49 - ublk_queue_alloc_sqes(q, &sqe, 1); 49 + ublk_io_alloc_sqes(ublk_get_io(q, tag), &sqe, 1); 50 50 io_uring_prep_timeout(sqe, &ts, 1, 0); 51 - sqe->user_data = build_user_data(tag, ublksrv_get_op(iod), 0, 1); 51 + sqe->user_data = build_user_data(tag, ublksrv_get_op(iod), 0, q->q_id, 1); 52 52 53 53 ublk_queued_tgt_io(q, tag, 1); 54 54

+10 -10

tools/testing/selftests/ublk/file_backed.c

··· 18 18 unsigned ublk_op = ublksrv_get_op(iod); 19 19 struct io_uring_sqe *sqe[1]; 20 20 21 - ublk_queue_alloc_sqes(q, sqe, 1); 21 + ublk_io_alloc_sqes(ublk_get_io(q, tag), sqe, 1); 22 22 io_uring_prep_fsync(sqe[0], 1 /*fds[1]*/, IORING_FSYNC_DATASYNC); 23 23 io_uring_sqe_set_flags(sqe[0], IOSQE_FIXED_FILE); 24 24 /* bit63 marks us as tgt io */ 25 - sqe[0]->user_data = build_user_data(tag, ublk_op, 0, 1); 25 + sqe[0]->user_data = build_user_data(tag, ublk_op, 0, q->q_id, 1); 26 26 return 1; 27 27 } 28 28 ··· 36 36 void *addr = (zc | auto_zc) ? NULL : (void *)iod->addr; 37 37 38 38 if (!zc || auto_zc) { 39 - ublk_queue_alloc_sqes(q, sqe, 1); 39 + ublk_io_alloc_sqes(ublk_get_io(q, tag), sqe, 1); 40 40 if (!sqe[0]) 41 41 return -ENOMEM; 42 42 ··· 48 48 sqe[0]->buf_index = tag; 49 49 io_uring_sqe_set_flags(sqe[0], IOSQE_FIXED_FILE); 50 50 /* bit63 marks us as tgt io */ 51 - sqe[0]->user_data = build_user_data(tag, ublk_op, 0, 1); 51 + sqe[0]->user_data = build_user_data(tag, ublk_op, 0, q->q_id, 1); 52 52 return 1; 53 53 } 54 54 55 - ublk_queue_alloc_sqes(q, sqe, 3); 55 + ublk_io_alloc_sqes(ublk_get_io(q, tag), sqe, 3); 56 56 57 - io_uring_prep_buf_register(sqe[0], 0, tag, q->q_id, tag); 57 + io_uring_prep_buf_register(sqe[0], 0, tag, q->q_id, ublk_get_io(q, tag)->buf_index); 58 58 sqe[0]->flags |= IOSQE_CQE_SKIP_SUCCESS | IOSQE_IO_HARDLINK; 59 59 sqe[0]->user_data = build_user_data(tag, 60 - ublk_cmd_op_nr(sqe[0]->cmd_op), 0, 1); 60 + ublk_cmd_op_nr(sqe[0]->cmd_op), 0, q->q_id, 1); 61 61 62 62 io_uring_prep_rw(op, sqe[1], 1 /*fds[1]*/, 0, 63 63 iod->nr_sectors << 9, 64 64 iod->start_sector << 9); 65 65 sqe[1]->buf_index = tag; 66 66 sqe[1]->flags |= IOSQE_FIXED_FILE | IOSQE_IO_HARDLINK; 67 - sqe[1]->user_data = build_user_data(tag, ublk_op, 0, 1); 67 + sqe[1]->user_data = build_user_data(tag, ublk_op, 0, q->q_id, 1); 68 68 69 - io_uring_prep_buf_unregister(sqe[2], 0, tag, q->q_id, tag); 70 - sqe[2]->user_data = build_user_data(tag, ublk_cmd_op_nr(sqe[2]->cmd_op), 0, 1); 69 + io_uring_prep_buf_unregister(sqe[2], 0, tag, q->q_id, ublk_get_io(q, tag)->buf_index); 70 + sqe[2]->user_data = build_user_data(tag, ublk_cmd_op_nr(sqe[2]->cmd_op), 0, q->q_id, 1); 71 71 72 72 return 2; 73 73 }

+264 -140

tools/testing/selftests/ublk/kublk.c

··· 348 348 349 349 for (i = 0; i < info->nr_hw_queues; i++) { 350 350 ublk_print_cpu_set(&affinity[i], buf, sizeof(buf)); 351 - printf("\tqueue %u: tid %d affinity(%s)\n", 352 - i, dev->q[i].tid, buf); 351 + printf("\tqueue %u: affinity(%s)\n", 352 + i, buf); 353 353 } 354 354 free(affinity); 355 355 } ··· 412 412 int i; 413 413 int nr_ios = q->q_depth; 414 414 415 - io_uring_unregister_buffers(&q->ring); 416 - 417 - io_uring_unregister_ring_fd(&q->ring); 418 - 419 - if (q->ring.ring_fd > 0) { 420 - io_uring_unregister_files(&q->ring); 421 - close(q->ring.ring_fd); 422 - q->ring.ring_fd = -1; 423 - } 424 - 425 415 if (q->io_cmd_buf) 426 416 munmap(q->io_cmd_buf, ublk_queue_cmd_buf_sz(q)); 427 417 ··· 419 429 free(q->ios[i].buf_addr); 420 430 } 421 431 432 + static void ublk_thread_deinit(struct ublk_thread *t) 433 + { 434 + io_uring_unregister_buffers(&t->ring); 435 + 436 + io_uring_unregister_ring_fd(&t->ring); 437 + 438 + if (t->ring.ring_fd > 0) { 439 + io_uring_unregister_files(&t->ring); 440 + close(t->ring.ring_fd); 441 + t->ring.ring_fd = -1; 442 + } 443 + } 444 + 422 445 static int ublk_queue_init(struct ublk_queue *q, unsigned extra_flags) 423 446 { 424 447 struct ublk_dev *dev = q->dev; 425 448 int depth = dev->dev_info.queue_depth; 426 - int i, ret = -1; 449 + int i; 427 450 int cmd_buf_size, io_buf_size; 428 451 unsigned long off; 429 - int ring_depth = dev->tgt.sq_depth, cq_depth = dev->tgt.cq_depth; 430 452 431 453 q->tgt_ops = dev->tgt.ops; 432 454 q->state = 0; 433 455 q->q_depth = depth; 434 - q->cmd_inflight = 0; 435 - q->tid = gettid(); 436 456 437 457 if (dev->dev_info.flags & (UBLK_F_SUPPORT_ZERO_COPY | UBLK_F_AUTO_BUF_REG)) { 438 458 q->state |= UBLKSRV_NO_BUF; ··· 467 467 for (i = 0; i < q->q_depth; i++) { 468 468 q->ios[i].buf_addr = NULL; 469 469 q->ios[i].flags = UBLKSRV_NEED_FETCH_RQ | UBLKSRV_IO_FREE; 470 + q->ios[i].tag = i; 470 471 471 472 if (q->state & UBLKSRV_NO_BUF) 472 473 continue; ··· 480 479 } 481 480 } 482 481 483 - ret = ublk_setup_ring(&q->ring, ring_depth, cq_depth, 484 - IORING_SETUP_COOP_TASKRUN | 485 - IORING_SETUP_SINGLE_ISSUER | 486 - IORING_SETUP_DEFER_TASKRUN); 487 - if (ret < 0) { 488 - ublk_err("ublk dev %d queue %d setup io_uring failed %d\n", 489 - q->dev->dev_info.dev_id, q->q_id, ret); 490 - goto fail; 491 - } 492 - 493 - if (dev->dev_info.flags & (UBLK_F_SUPPORT_ZERO_COPY | UBLK_F_AUTO_BUF_REG)) { 494 - ret = io_uring_register_buffers_sparse(&q->ring, q->q_depth); 495 - if (ret) { 496 - ublk_err("ublk dev %d queue %d register spare buffers failed %d", 497 - dev->dev_info.dev_id, q->q_id, ret); 498 - goto fail; 499 - } 500 - } 501 - 502 - io_uring_register_ring_fd(&q->ring); 503 - 504 - ret = io_uring_register_files(&q->ring, dev->fds, dev->nr_fds); 505 - if (ret) { 506 - ublk_err("ublk dev %d queue %d register files failed %d\n", 507 - q->dev->dev_info.dev_id, q->q_id, ret); 508 - goto fail; 509 - } 510 - 511 482 return 0; 512 483 fail: 513 484 ublk_queue_deinit(q); 514 485 ublk_err("ublk dev %d queue %d failed\n", 515 486 dev->dev_info.dev_id, q->q_id); 487 + return -ENOMEM; 488 + } 489 + 490 + static int ublk_thread_init(struct ublk_thread *t) 491 + { 492 + struct ublk_dev *dev = t->dev; 493 + int ring_depth = dev->tgt.sq_depth, cq_depth = dev->tgt.cq_depth; 494 + int ret; 495 + 496 + ret = ublk_setup_ring(&t->ring, ring_depth, cq_depth, 497 + IORING_SETUP_COOP_TASKRUN | 498 + IORING_SETUP_SINGLE_ISSUER | 499 + IORING_SETUP_DEFER_TASKRUN); 500 + if (ret < 0) { 501 + ublk_err("ublk dev %d thread %d setup io_uring failed %d\n", 502 + dev->dev_info.dev_id, t->idx, ret); 503 + goto fail; 504 + } 505 + 506 + if (dev->dev_info.flags & (UBLK_F_SUPPORT_ZERO_COPY | UBLK_F_AUTO_BUF_REG)) { 507 + unsigned nr_ios = dev->dev_info.queue_depth * dev->dev_info.nr_hw_queues; 508 + unsigned max_nr_ios_per_thread = nr_ios / dev->nthreads; 509 + max_nr_ios_per_thread += !!(nr_ios % dev->nthreads); 510 + ret = io_uring_register_buffers_sparse( 511 + &t->ring, max_nr_ios_per_thread); 512 + if (ret) { 513 + ublk_err("ublk dev %d thread %d register spare buffers failed %d", 514 + dev->dev_info.dev_id, t->idx, ret); 515 + goto fail; 516 + } 517 + } 518 + 519 + io_uring_register_ring_fd(&t->ring); 520 + 521 + ret = io_uring_register_files(&t->ring, dev->fds, dev->nr_fds); 522 + if (ret) { 523 + ublk_err("ublk dev %d thread %d register files failed %d\n", 524 + t->dev->dev_info.dev_id, t->idx, ret); 525 + goto fail; 526 + } 527 + 528 + return 0; 529 + fail: 530 + ublk_thread_deinit(t); 531 + ublk_err("ublk dev %d thread %d init failed\n", 532 + dev->dev_info.dev_id, t->idx); 516 533 return -ENOMEM; 517 534 } 518 535 ··· 581 562 if (q->tgt_ops->buf_index) 582 563 buf.index = q->tgt_ops->buf_index(q, tag); 583 564 else 584 - buf.index = tag; 565 + buf.index = q->ios[tag].buf_index; 585 566 586 567 if (q->state & UBLKSRV_AUTO_BUF_REG_FALLBACK) 587 568 buf.flags = UBLK_AUTO_BUF_REG_FALLBACK; ··· 589 570 sqe->addr = ublk_auto_buf_reg_to_sqe_addr(&buf); 590 571 } 591 572 592 - int ublk_queue_io_cmd(struct ublk_queue *q, struct ublk_io *io, unsigned tag) 573 + int ublk_queue_io_cmd(struct ublk_io *io) 593 574 { 575 + struct ublk_thread *t = io->t; 576 + struct ublk_queue *q = ublk_io_to_queue(io); 594 577 struct ublksrv_io_cmd *cmd; 595 578 struct io_uring_sqe *sqe[1]; 596 579 unsigned int cmd_op = 0; ··· 617 596 else if (io->flags & UBLKSRV_NEED_FETCH_RQ) 618 597 cmd_op = UBLK_U_IO_FETCH_REQ; 619 598 620 - if (io_uring_sq_space_left(&q->ring) < 1) 621 - io_uring_submit(&q->ring); 599 + if (io_uring_sq_space_left(&t->ring) < 1) 600 + io_uring_submit(&t->ring); 622 601 623 - ublk_queue_alloc_sqes(q, sqe, 1); 602 + ublk_io_alloc_sqes(io, sqe, 1); 624 603 if (!sqe[0]) { 625 - ublk_err("%s: run out of sqe %d, tag %d\n", 626 - __func__, q->q_id, tag); 604 + ublk_err("%s: run out of sqe. thread %u, tag %d\n", 605 + __func__, t->idx, io->tag); 627 606 return -1; 628 607 } 629 608 ··· 638 617 sqe[0]->opcode = IORING_OP_URING_CMD; 639 618 sqe[0]->flags = IOSQE_FIXED_FILE; 640 619 sqe[0]->rw_flags = 0; 641 - cmd->tag = tag; 620 + cmd->tag = io->tag; 642 621 cmd->q_id = q->q_id; 643 622 if (!(q->state & UBLKSRV_NO_BUF)) 644 623 cmd->addr = (__u64) (uintptr_t) io->buf_addr; ··· 646 625 cmd->addr = 0; 647 626 648 627 if (q->state & UBLKSRV_AUTO_BUF_REG) 649 - ublk_set_auto_buf_reg(q, sqe[0], tag); 628 + ublk_set_auto_buf_reg(q, sqe[0], io->tag); 650 629 651 - user_data = build_user_data(tag, _IOC_NR(cmd_op), 0, 0); 630 + user_data = build_user_data(io->tag, _IOC_NR(cmd_op), 0, q->q_id, 0); 652 631 io_uring_sqe_set_data64(sqe[0], user_data); 653 632 654 633 io->flags = 0; 655 634 656 - q->cmd_inflight += 1; 635 + t->cmd_inflight += 1; 657 636 658 - ublk_dbg(UBLK_DBG_IO_CMD, "%s: (qid %d tag %u cmd_op %u) iof %x stopping %d\n", 659 - __func__, q->q_id, tag, cmd_op, 660 - io->flags, !!(q->state & UBLKSRV_QUEUE_STOPPING)); 637 + ublk_dbg(UBLK_DBG_IO_CMD, "%s: (thread %u qid %d tag %u cmd_op %u) iof %x stopping %d\n", 638 + __func__, t->idx, q->q_id, io->tag, cmd_op, 639 + io->flags, !!(t->state & UBLKSRV_THREAD_STOPPING)); 661 640 return 1; 662 641 } 663 642 664 - static void ublk_submit_fetch_commands(struct ublk_queue *q) 643 + static void ublk_submit_fetch_commands(struct ublk_thread *t) 665 644 { 666 - int i = 0; 645 + struct ublk_queue *q; 646 + struct ublk_io *io; 647 + int i = 0, j = 0; 667 648 668 - for (i = 0; i < q->q_depth; i++) 669 - ublk_queue_io_cmd(q, &q->ios[i], i); 649 + if (t->dev->per_io_tasks) { 650 + /* 651 + * Lexicographically order all the (qid,tag) pairs, with 652 + * qid taking priority (so (1,0) > (0,1)). Then make 653 + * this thread the daemon for every Nth entry in this 654 + * list (N is the number of threads), starting at this 655 + * thread's index. This ensures that each queue is 656 + * handled by as many ublk server threads as possible, 657 + * so that load that is concentrated on one or a few 658 + * queues can make use of all ublk server threads. 659 + */ 660 + const struct ublksrv_ctrl_dev_info *dinfo = &t->dev->dev_info; 661 + int nr_ios = dinfo->nr_hw_queues * dinfo->queue_depth; 662 + for (i = t->idx; i < nr_ios; i += t->dev->nthreads) { 663 + int q_id = i / dinfo->queue_depth; 664 + int tag = i % dinfo->queue_depth; 665 + q = &t->dev->q[q_id]; 666 + io = &q->ios[tag]; 667 + io->t = t; 668 + io->buf_index = j++; 669 + ublk_queue_io_cmd(io); 670 + } 671 + } else { 672 + /* 673 + * Service exclusively the queue whose q_id matches our 674 + * thread index. 675 + */ 676 + struct ublk_queue *q = &t->dev->q[t->idx]; 677 + for (i = 0; i < q->q_depth; i++) { 678 + io = &q->ios[i]; 679 + io->t = t; 680 + io->buf_index = i; 681 + ublk_queue_io_cmd(io); 682 + } 683 + } 670 684 } 671 685 672 - static int ublk_queue_is_idle(struct ublk_queue *q) 686 + static int ublk_thread_is_idle(struct ublk_thread *t) 673 687 { 674 - return !io_uring_sq_ready(&q->ring) && !q->io_inflight; 688 + return !io_uring_sq_ready(&t->ring) && !t->io_inflight; 675 689 } 676 690 677 - static int ublk_queue_is_done(struct ublk_queue *q) 691 + static int ublk_thread_is_done(struct ublk_thread *t) 678 692 { 679 - return (q->state & UBLKSRV_QUEUE_STOPPING) && ublk_queue_is_idle(q); 693 + return (t->state & UBLKSRV_THREAD_STOPPING) && ublk_thread_is_idle(t); 680 694 } 681 695 682 696 static inline void ublksrv_handle_tgt_cqe(struct ublk_queue *q, ··· 729 673 q->tgt_ops->tgt_io_done(q, tag, cqe); 730 674 } 731 675 732 - static void ublk_handle_cqe(struct io_uring *r, 676 + static void ublk_handle_cqe(struct ublk_thread *t, 733 677 struct io_uring_cqe *cqe, void *data) 734 678 { 735 - struct ublk_queue *q = container_of(r, struct ublk_queue, ring); 679 + struct ublk_dev *dev = t->dev; 680 + unsigned q_id = user_data_to_q_id(cqe->user_data); 681 + struct ublk_queue *q = &dev->q[q_id]; 736 682 unsigned tag = user_data_to_tag(cqe->user_data); 737 683 unsigned cmd_op = user_data_to_op(cqe->user_data); 738 684 int fetch = (cqe->res != UBLK_IO_RES_ABORT) && 739 - !(q->state & UBLKSRV_QUEUE_STOPPING); 685 + !(t->state & UBLKSRV_THREAD_STOPPING); 740 686 struct ublk_io *io; 741 687 742 688 if (cqe->res < 0 && cqe->res != -ENODEV) ··· 749 691 __func__, cqe->res, q->q_id, tag, cmd_op, 750 692 is_target_io(cqe->user_data), 751 693 user_data_to_tgt_data(cqe->user_data), 752 - (q->state & UBLKSRV_QUEUE_STOPPING)); 694 + (t->state & UBLKSRV_THREAD_STOPPING)); 753 695 754 696 /* Don't retrieve io in case of target io */ 755 697 if (is_target_io(cqe->user_data)) { ··· 758 700 } 759 701 760 702 io = &q->ios[tag]; 761 - q->cmd_inflight--; 703 + t->cmd_inflight--; 762 704 763 705 if (!fetch) { 764 - q->state |= UBLKSRV_QUEUE_STOPPING; 706 + t->state |= UBLKSRV_THREAD_STOPPING; 765 707 io->flags &= ~UBLKSRV_NEED_FETCH_RQ; 766 708 } 767 709 ··· 771 713 q->tgt_ops->queue_io(q, tag); 772 714 } else if (cqe->res == UBLK_IO_RES_NEED_GET_DATA) { 773 715 io->flags |= UBLKSRV_NEED_GET_DATA | UBLKSRV_IO_FREE; 774 - ublk_queue_io_cmd(q, io, tag); 716 + ublk_queue_io_cmd(io); 775 717 } else { 776 718 /* 777 719 * COMMIT_REQ will be completed immediately since no fetching ··· 785 727 } 786 728 } 787 729 788 - static int ublk_reap_events_uring(struct io_uring *r) 730 + static int ublk_reap_events_uring(struct ublk_thread *t) 789 731 { 790 732 struct io_uring_cqe *cqe; 791 733 unsigned head; 792 734 int count = 0; 793 735 794 - io_uring_for_each_cqe(r, head, cqe) { 795 - ublk_handle_cqe(r, cqe, NULL); 736 + io_uring_for_each_cqe(&t->ring, head, cqe) { 737 + ublk_handle_cqe(t, cqe, NULL); 796 738 count += 1; 797 739 } 798 - io_uring_cq_advance(r, count); 740 + io_uring_cq_advance(&t->ring, count); 799 741 800 742 return count; 801 743 } 802 744 803 - static int ublk_process_io(struct ublk_queue *q) 745 + static int ublk_process_io(struct ublk_thread *t) 804 746 { 805 747 int ret, reapped; 806 748 807 - ublk_dbg(UBLK_DBG_QUEUE, "dev%d-q%d: to_submit %d inflight cmd %u stopping %d\n", 808 - q->dev->dev_info.dev_id, 809 - q->q_id, io_uring_sq_ready(&q->ring), 810 - q->cmd_inflight, 811 - (q->state & UBLKSRV_QUEUE_STOPPING)); 749 + ublk_dbg(UBLK_DBG_THREAD, "dev%d-t%u: to_submit %d inflight cmd %u stopping %d\n", 750 + t->dev->dev_info.dev_id, 751 + t->idx, io_uring_sq_ready(&t->ring), 752 + t->cmd_inflight, 753 + (t->state & UBLKSRV_THREAD_STOPPING)); 812 754 813 - if (ublk_queue_is_done(q)) 755 + if (ublk_thread_is_done(t)) 814 756 return -ENODEV; 815 757 816 - ret = io_uring_submit_and_wait(&q->ring, 1); 817 - reapped = ublk_reap_events_uring(&q->ring); 758 + ret = io_uring_submit_and_wait(&t->ring, 1); 759 + reapped = ublk_reap_events_uring(t); 818 760 819 - ublk_dbg(UBLK_DBG_QUEUE, "submit result %d, reapped %d stop %d idle %d\n", 820 - ret, reapped, (q->state & UBLKSRV_QUEUE_STOPPING), 821 - (q->state & UBLKSRV_QUEUE_IDLE)); 761 + ublk_dbg(UBLK_DBG_THREAD, "submit result %d, reapped %d stop %d idle %d\n", 762 + ret, reapped, (t->state & UBLKSRV_THREAD_STOPPING), 763 + (t->state & UBLKSRV_THREAD_IDLE)); 822 764 823 765 return reapped; 824 766 } 825 767 826 - static void ublk_queue_set_sched_affinity(const struct ublk_queue *q, 768 + static void ublk_thread_set_sched_affinity(const struct ublk_thread *t, 827 769 cpu_set_t *cpuset) 828 770 { 829 771 if (sched_setaffinity(0, sizeof(*cpuset), cpuset) < 0) 830 - ublk_err("ublk dev %u queue %u set affinity failed", 831 - q->dev->dev_info.dev_id, q->q_id); 772 + ublk_err("ublk dev %u thread %u set affinity failed", 773 + t->dev->dev_info.dev_id, t->idx); 832 774 } 833 775 834 - struct ublk_queue_info { 835 - struct ublk_queue *q; 836 - sem_t *queue_sem; 776 + struct ublk_thread_info { 777 + struct ublk_dev *dev; 778 + unsigned idx; 779 + sem_t *ready; 837 780 cpu_set_t *affinity; 838 - unsigned char auto_zc_fallback; 839 781 }; 840 782 841 783 static void *ublk_io_handler_fn(void *data) 842 784 { 843 - struct ublk_queue_info *info = data; 844 - struct ublk_queue *q = info->q; 845 - int dev_id = q->dev->dev_info.dev_id; 846 - unsigned extra_flags = 0; 785 + struct ublk_thread_info *info = data; 786 + struct ublk_thread *t = &info->dev->threads[info->idx]; 787 + int dev_id = info->dev->dev_info.dev_id; 847 788 int ret; 848 789 849 - if (info->auto_zc_fallback) 850 - extra_flags = UBLKSRV_AUTO_BUF_REG_FALLBACK; 790 + t->dev = info->dev; 791 + t->idx = info->idx; 851 792 852 - ret = ublk_queue_init(q, extra_flags); 793 + ret = ublk_thread_init(t); 853 794 if (ret) { 854 - ublk_err("ublk dev %d queue %d init queue failed\n", 855 - dev_id, q->q_id); 795 + ublk_err("ublk dev %d thread %u init failed\n", 796 + dev_id, t->idx); 856 797 return NULL; 857 798 } 858 799 /* IO perf is sensitive with queue pthread affinity on NUMA machine*/ 859 - ublk_queue_set_sched_affinity(q, info->affinity); 860 - sem_post(info->queue_sem); 800 + if (info->affinity) 801 + ublk_thread_set_sched_affinity(t, info->affinity); 802 + sem_post(info->ready); 861 803 862 - ublk_dbg(UBLK_DBG_QUEUE, "tid %d: ublk dev %d queue %d started\n", 863 - q->tid, dev_id, q->q_id); 804 + ublk_dbg(UBLK_DBG_THREAD, "tid %d: ublk dev %d thread %u started\n", 805 + gettid(), dev_id, t->idx); 864 806 865 807 /* submit all io commands to ublk driver */ 866 - ublk_submit_fetch_commands(q); 808 + ublk_submit_fetch_commands(t); 867 809 do { 868 - if (ublk_process_io(q) < 0) 810 + if (ublk_process_io(t) < 0) 869 811 break; 870 812 } while (1); 871 813 872 - ublk_dbg(UBLK_DBG_QUEUE, "ublk dev %d queue %d exited\n", dev_id, q->q_id); 873 - ublk_queue_deinit(q); 814 + ublk_dbg(UBLK_DBG_THREAD, "tid %d: ublk dev %d thread %d exiting\n", 815 + gettid(), dev_id, t->idx); 816 + ublk_thread_deinit(t); 874 817 return NULL; 875 818 } 876 819 ··· 914 855 static int ublk_start_daemon(const struct dev_ctx *ctx, struct ublk_dev *dev) 915 856 { 916 857 const struct ublksrv_ctrl_dev_info *dinfo = &dev->dev_info; 917 - struct ublk_queue_info *qinfo; 858 + struct ublk_thread_info *tinfo; 859 + unsigned extra_flags = 0; 918 860 cpu_set_t *affinity_buf; 919 861 void *thread_ret; 920 - sem_t queue_sem; 862 + sem_t ready; 921 863 int ret, i; 922 864 923 865 ublk_dbg(UBLK_DBG_DEV, "%s enter\n", __func__); 924 866 925 - qinfo = (struct ublk_queue_info *)calloc(sizeof(struct ublk_queue_info), 926 - dinfo->nr_hw_queues); 927 - if (!qinfo) 867 + tinfo = calloc(sizeof(struct ublk_thread_info), dev->nthreads); 868 + if (!tinfo) 928 869 return -ENOMEM; 929 870 930 - sem_init(&queue_sem, 0, 0); 871 + sem_init(&ready, 0, 0); 931 872 ret = ublk_dev_prep(ctx, dev); 932 873 if (ret) 933 874 return ret; ··· 936 877 if (ret) 937 878 return ret; 938 879 880 + if (ctx->auto_zc_fallback) 881 + extra_flags = UBLKSRV_AUTO_BUF_REG_FALLBACK; 882 + 939 883 for (i = 0; i < dinfo->nr_hw_queues; i++) { 940 884 dev->q[i].dev = dev; 941 885 dev->q[i].q_id = i; 942 886 943 - qinfo[i].q = &dev->q[i]; 944 - qinfo[i].queue_sem = &queue_sem; 945 - qinfo[i].affinity = &affinity_buf[i]; 946 - qinfo[i].auto_zc_fallback = ctx->auto_zc_fallback; 947 - pthread_create(&dev->q[i].thread, NULL, 948 - ublk_io_handler_fn, 949 - &qinfo[i]); 887 + ret = ublk_queue_init(&dev->q[i], extra_flags); 888 + if (ret) { 889 + ublk_err("ublk dev %d queue %d init queue failed\n", 890 + dinfo->dev_id, i); 891 + goto fail; 892 + } 950 893 } 951 894 952 - for (i = 0; i < dinfo->nr_hw_queues; i++) 953 - sem_wait(&queue_sem); 954 - free(qinfo); 895 + for (i = 0; i < dev->nthreads; i++) { 896 + tinfo[i].dev = dev; 897 + tinfo[i].idx = i; 898 + tinfo[i].ready = &ready; 899 + 900 + /* 901 + * If threads are not tied 1:1 to queues, setting thread 902 + * affinity based on queue affinity makes little sense. 903 + * However, thread CPU affinity has significant impact 904 + * on performance, so to compare fairly, we'll still set 905 + * thread CPU affinity based on queue affinity where 906 + * possible. 907 + */ 908 + if (dev->nthreads == dinfo->nr_hw_queues) 909 + tinfo[i].affinity = &affinity_buf[i]; 910 + pthread_create(&dev->threads[i].thread, NULL, 911 + ublk_io_handler_fn, 912 + &tinfo[i]); 913 + } 914 + 915 + for (i = 0; i < dev->nthreads; i++) 916 + sem_wait(&ready); 917 + free(tinfo); 955 918 free(affinity_buf); 956 919 957 920 /* everything is fine now, start us */ ··· 995 914 ublk_send_dev_event(ctx, dev, dev->dev_info.dev_id); 996 915 997 916 /* wait until we are terminated */ 998 - for (i = 0; i < dinfo->nr_hw_queues; i++) 999 - pthread_join(dev->q[i].thread, &thread_ret); 917 + for (i = 0; i < dev->nthreads; i++) 918 + pthread_join(dev->threads[i].thread, &thread_ret); 1000 919 fail: 920 + for (i = 0; i < dinfo->nr_hw_queues; i++) 921 + ublk_queue_deinit(&dev->q[i]); 1001 922 ublk_dev_unprep(dev); 1002 923 ublk_dbg(UBLK_DBG_DEV, "%s exit\n", __func__); 1003 924 ··· 1105 1022 1106 1023 static int __cmd_dev_add(const struct dev_ctx *ctx) 1107 1024 { 1025 + unsigned nthreads = ctx->nthreads; 1108 1026 unsigned nr_queues = ctx->nr_hw_queues; 1109 1027 const char *tgt_type = ctx->tgt_type; 1110 1028 unsigned depth = ctx->queue_depth; 1111 1029 __u64 features; 1112 1030 const struct ublk_tgt_ops *ops; 1113 1031 struct ublksrv_ctrl_dev_info *info; 1114 - struct ublk_dev *dev; 1032 + struct ublk_dev *dev = NULL; 1115 1033 int dev_id = ctx->dev_id; 1116 1034 int ret, i; 1117 1035 ··· 1120 1036 if (!ops) { 1121 1037 ublk_err("%s: no such tgt type, type %s\n", 1122 1038 __func__, tgt_type); 1123 - return -ENODEV; 1039 + ret = -ENODEV; 1040 + goto fail; 1124 1041 } 1125 1042 1126 1043 if (nr_queues > UBLK_MAX_QUEUES || depth > UBLK_QUEUE_DEPTH) { 1127 1044 ublk_err("%s: invalid nr_queues or depth queues %u depth %u\n", 1128 1045 __func__, nr_queues, depth); 1129 - return -EINVAL; 1046 + ret = -EINVAL; 1047 + goto fail; 1048 + } 1049 + 1050 + /* default to 1:1 threads:queues if nthreads is unspecified */ 1051 + if (!nthreads) 1052 + nthreads = nr_queues; 1053 + 1054 + if (nthreads > UBLK_MAX_THREADS) { 1055 + ublk_err("%s: %u is too many threads (max %u)\n", 1056 + __func__, nthreads, UBLK_MAX_THREADS); 1057 + ret = -EINVAL; 1058 + goto fail; 1059 + } 1060 + 1061 + if (nthreads != nr_queues && !ctx->per_io_tasks) { 1062 + ublk_err("%s: threads %u must be same as queues %u if " 1063 + "not using per_io_tasks\n", 1064 + __func__, nthreads, nr_queues); 1065 + ret = -EINVAL; 1066 + goto fail; 1130 1067 } 1131 1068 1132 1069 dev = ublk_ctrl_init(); 1133 1070 if (!dev) { 1134 1071 ublk_err("%s: can't alloc dev id %d, type %s\n", 1135 1072 __func__, dev_id, tgt_type); 1136 - return -ENOMEM; 1073 + ret = -ENOMEM; 1074 + goto fail; 1137 1075 } 1138 1076 1139 1077 /* kernel doesn't support get_features */ 1140 1078 ret = ublk_ctrl_get_features(dev, &features); 1141 - if (ret < 0) 1142 - return -EINVAL; 1079 + if (ret < 0) { 1080 + ret = -EINVAL; 1081 + goto fail; 1082 + } 1143 1083 1144 - if (!(features & UBLK_F_CMD_IOCTL_ENCODE)) 1145 - return -ENOTSUP; 1084 + if (!(features & UBLK_F_CMD_IOCTL_ENCODE)) { 1085 + ret = -ENOTSUP; 1086 + goto fail; 1087 + } 1146 1088 1147 1089 info = &dev->dev_info; 1148 1090 info->dev_id = ctx->dev_id; ··· 1178 1068 if ((features & UBLK_F_QUIESCE) && 1179 1069 (info->flags & UBLK_F_USER_RECOVERY)) 1180 1070 info->flags |= UBLK_F_QUIESCE; 1071 + dev->nthreads = nthreads; 1072 + dev->per_io_tasks = ctx->per_io_tasks; 1181 1073 dev->tgt.ops = ops; 1182 1074 dev->tgt.sq_depth = depth; 1183 1075 dev->tgt.cq_depth = depth; ··· 1209 1097 fail: 1210 1098 if (ret < 0) 1211 1099 ublk_send_dev_event(ctx, dev, -1); 1212 - ublk_ctrl_deinit(dev); 1100 + if (dev) 1101 + ublk_ctrl_deinit(dev); 1213 1102 return ret; 1214 1103 } 1215 1104 ··· 1272 1159 shmctl(ctx->_shmid, IPC_RMID, NULL); 1273 1160 /* wait for child and detach from it */ 1274 1161 wait(NULL); 1162 + if (exit_code == EXIT_FAILURE) 1163 + ublk_err("%s: command failed\n", __func__); 1275 1164 exit(exit_code); 1276 1165 } else { 1277 1166 exit(EXIT_FAILURE); ··· 1381 1266 [const_ilog2(UBLK_F_UPDATE_SIZE)] = "UPDATE_SIZE", 1382 1267 [const_ilog2(UBLK_F_AUTO_BUF_REG)] = "AUTO_BUF_REG", 1383 1268 [const_ilog2(UBLK_F_QUIESCE)] = "QUIESCE", 1269 + [const_ilog2(UBLK_F_PER_IO_DAEMON)] = "PER_IO_DAEMON", 1384 1270 }; 1385 1271 struct ublk_dev *dev; 1386 1272 __u64 features = 0; ··· 1476 1360 exe, recovery ? "recover" : "add"); 1477 1361 printf("\t[--foreground] [--quiet] [-z] [--auto_zc] [--auto_zc_fallback] [--debug_mask mask] [-r 0|1 ] [-g]\n"); 1478 1362 printf("\t[-e 0|1 ] [-i 0|1]\n"); 1363 + printf("\t[--nthreads threads] [--per_io_tasks]\n"); 1479 1364 printf("\t[target options] [backfile1] [backfile2] ...\n"); 1480 1365 printf("\tdefault: nr_queues=2(max 32), depth=128(max 1024), dev_id=-1(auto allocation)\n"); 1366 + printf("\tdefault: nthreads=nr_queues"); 1481 1367 1482 1368 for (i = 0; i < sizeof(tgt_ops_list) / sizeof(tgt_ops_list[0]); i++) { 1483 1369 const struct ublk_tgt_ops *ops = tgt_ops_list[i]; ··· 1536 1418 { "auto_zc", 0, NULL, 0 }, 1537 1419 { "auto_zc_fallback", 0, NULL, 0 }, 1538 1420 { "size", 1, NULL, 's'}, 1421 + { "nthreads", 1, NULL, 0 }, 1422 + { "per_io_tasks", 0, NULL, 0 }, 1539 1423 { 0, 0, 0, 0 } 1540 1424 }; 1541 1425 const struct ublk_tgt_ops *ops = NULL; ··· 1613 1493 ctx.flags |= UBLK_F_AUTO_BUF_REG; 1614 1494 if (!strcmp(longopts[option_idx].name, "auto_zc_fallback")) 1615 1495 ctx.auto_zc_fallback = 1; 1496 + if (!strcmp(longopts[option_idx].name, "nthreads")) 1497 + ctx.nthreads = strtol(optarg, NULL, 10); 1498 + if (!strcmp(longopts[option_idx].name, "per_io_tasks")) 1499 + ctx.per_io_tasks = 1; 1616 1500 break; 1617 1501 case '?': 1618 1502 /*

+54 -19

tools/testing/selftests/ublk/kublk.h

··· 49 49 #define UBLKSRV_IO_IDLE_SECS 20 50 50 51 51 #define UBLK_IO_MAX_BYTES (1 << 20) 52 - #define UBLK_MAX_QUEUES 32 52 + #define UBLK_MAX_QUEUES_SHIFT 5 53 + #define UBLK_MAX_QUEUES (1 << UBLK_MAX_QUEUES_SHIFT) 54 + #define UBLK_MAX_THREADS_SHIFT 5 55 + #define UBLK_MAX_THREADS (1 << UBLK_MAX_THREADS_SHIFT) 53 56 #define UBLK_QUEUE_DEPTH 1024 54 57 55 58 #define UBLK_DBG_DEV (1U << 0) 56 - #define UBLK_DBG_QUEUE (1U << 1) 59 + #define UBLK_DBG_THREAD (1U << 1) 57 60 #define UBLK_DBG_IO_CMD (1U << 2) 58 61 #define UBLK_DBG_IO (1U << 3) 59 62 #define UBLK_DBG_CTRL_CMD (1U << 4) ··· 64 61 65 62 struct ublk_dev; 66 63 struct ublk_queue; 64 + struct ublk_thread; 67 65 68 66 struct stripe_ctx { 69 67 /* stripe */ ··· 80 76 char tgt_type[16]; 81 77 unsigned long flags; 82 78 unsigned nr_hw_queues; 79 + unsigned short nthreads; 83 80 unsigned queue_depth; 84 81 int dev_id; 85 82 int nr_files; ··· 90 85 unsigned int fg:1; 91 86 unsigned int recovery:1; 92 87 unsigned int auto_zc_fallback:1; 88 + unsigned int per_io_tasks:1; 93 89 94 90 int _evtfd; 95 91 int _shmid; ··· 129 123 unsigned short flags; 130 124 unsigned short refs; /* used by target code only */ 131 125 126 + int tag; 127 + 132 128 int result; 133 129 130 + unsigned short buf_index; 134 131 unsigned short tgt_ios; 135 132 void *private_data; 133 + struct ublk_thread *t; 136 134 }; 137 135 138 136 struct ublk_tgt_ops { ··· 175 165 struct ublk_queue { 176 166 int q_id; 177 167 int q_depth; 178 - unsigned int cmd_inflight; 179 - unsigned int io_inflight; 180 168 struct ublk_dev *dev; 181 169 const struct ublk_tgt_ops *tgt_ops; 182 170 struct ublksrv_io_desc *io_cmd_buf; 183 - struct io_uring ring; 171 + 184 172 struct ublk_io ios[UBLK_QUEUE_DEPTH]; 185 - #define UBLKSRV_QUEUE_STOPPING (1U << 0) 186 - #define UBLKSRV_QUEUE_IDLE (1U << 1) 187 173 #define UBLKSRV_NO_BUF (1U << 2) 188 174 #define UBLKSRV_ZC (1U << 3) 189 175 #define UBLKSRV_AUTO_BUF_REG (1U << 4) 190 176 #define UBLKSRV_AUTO_BUF_REG_FALLBACK (1U << 5) 191 177 unsigned state; 192 - pid_t tid; 178 + }; 179 + 180 + struct ublk_thread { 181 + struct ublk_dev *dev; 182 + struct io_uring ring; 183 + unsigned int cmd_inflight; 184 + unsigned int io_inflight; 185 + 193 186 pthread_t thread; 187 + unsigned idx; 188 + 189 + #define UBLKSRV_THREAD_STOPPING (1U << 0) 190 + #define UBLKSRV_THREAD_IDLE (1U << 1) 191 + unsigned state; 194 192 }; 195 193 196 194 struct ublk_dev { 197 195 struct ublk_tgt tgt; 198 196 struct ublksrv_ctrl_dev_info dev_info; 199 197 struct ublk_queue q[UBLK_MAX_QUEUES]; 198 + struct ublk_thread threads[UBLK_MAX_THREADS]; 199 + unsigned nthreads; 200 + unsigned per_io_tasks; 200 201 201 202 int fds[MAX_BACK_FILES + 1]; /* fds[0] points to /dev/ublkcN */ 202 203 int nr_fds; ··· 232 211 233 212 234 213 extern unsigned int ublk_dbg_mask; 235 - extern int ublk_queue_io_cmd(struct ublk_queue *q, struct ublk_io *io, unsigned tag); 214 + extern int ublk_queue_io_cmd(struct ublk_io *io); 236 215 237 216 238 217 static inline int ublk_io_auto_zc_fallback(const struct ublksrv_io_desc *iod) ··· 246 225 } 247 226 248 227 static inline __u64 build_user_data(unsigned tag, unsigned op, 249 - unsigned tgt_data, unsigned is_target_io) 228 + unsigned tgt_data, unsigned q_id, unsigned is_target_io) 250 229 { 251 - assert(!(tag >> 16) && !(op >> 8) && !(tgt_data >> 16)); 230 + /* we only have 7 bits to encode q_id */ 231 + _Static_assert(UBLK_MAX_QUEUES_SHIFT <= 7); 232 + assert(!(tag >> 16) && !(op >> 8) && !(tgt_data >> 16) && !(q_id >> 7)); 252 233 253 - return tag | (op << 16) | (tgt_data << 24) | (__u64)is_target_io << 63; 234 + return tag | (op << 16) | (tgt_data << 24) | 235 + (__u64)q_id << 56 | (__u64)is_target_io << 63; 254 236 } 255 237 256 238 static inline unsigned int user_data_to_tag(__u64 user_data) ··· 269 245 static inline unsigned int user_data_to_tgt_data(__u64 user_data) 270 246 { 271 247 return (user_data >> 24) & 0xffff; 248 + } 249 + 250 + static inline unsigned int user_data_to_q_id(__u64 user_data) 251 + { 252 + return (user_data >> 56) & 0x7f; 272 253 } 273 254 274 255 static inline unsigned short ublk_cmd_op_nr(unsigned int op) ··· 309 280 } 310 281 } 311 282 312 - static inline int ublk_queue_alloc_sqes(struct ublk_queue *q, 283 + static inline struct ublk_queue *ublk_io_to_queue(const struct ublk_io *io) 284 + { 285 + return container_of(io, struct ublk_queue, ios[io->tag]); 286 + } 287 + 288 + static inline int ublk_io_alloc_sqes(struct ublk_io *io, 313 289 struct io_uring_sqe *sqes[], int nr_sqes) 314 290 { 315 - unsigned left = io_uring_sq_space_left(&q->ring); 291 + struct io_uring *ring = &io->t->ring; 292 + unsigned left = io_uring_sq_space_left(ring); 316 293 int i; 317 294 318 295 if (left < nr_sqes) 319 - io_uring_submit(&q->ring); 296 + io_uring_submit(ring); 320 297 321 298 for (i = 0; i < nr_sqes; i++) { 322 - sqes[i] = io_uring_get_sqe(&q->ring); 299 + sqes[i] = io_uring_get_sqe(ring); 323 300 if (!sqes[i]) 324 301 return i; 325 302 } ··· 408 373 409 374 ublk_mark_io_done(io, res); 410 375 411 - return ublk_queue_io_cmd(q, io, tag); 376 + return ublk_queue_io_cmd(io); 412 377 } 413 378 414 379 static inline void ublk_queued_tgt_io(struct ublk_queue *q, unsigned tag, int queued) ··· 418 383 else { 419 384 struct ublk_io *io = ublk_get_io(q, tag); 420 385 421 - q->io_inflight += queued; 386 + io->t->io_inflight += queued; 422 387 io->tgt_ios = queued; 423 388 io->result = 0; 424 389 } ··· 428 393 { 429 394 struct ublk_io *io = ublk_get_io(q, tag); 430 395 431 - q->io_inflight--; 396 + io->t->io_inflight--; 432 397 433 398 return --io->tgt_ios == 0; 434 399 }

+11 -11

tools/testing/selftests/ublk/null.c

··· 43 43 } 44 44 45 45 static void __setup_nop_io(int tag, const struct ublksrv_io_desc *iod, 46 - struct io_uring_sqe *sqe) 46 + struct io_uring_sqe *sqe, int q_id) 47 47 { 48 48 unsigned ublk_op = ublksrv_get_op(iod); 49 49 ··· 52 52 sqe->flags |= IOSQE_FIXED_FILE; 53 53 sqe->rw_flags = IORING_NOP_FIXED_BUFFER | IORING_NOP_INJECT_RESULT; 54 54 sqe->len = iod->nr_sectors << 9; /* injected result */ 55 - sqe->user_data = build_user_data(tag, ublk_op, 0, 1); 55 + sqe->user_data = build_user_data(tag, ublk_op, 0, q_id, 1); 56 56 } 57 57 58 58 static int null_queue_zc_io(struct ublk_queue *q, int tag) ··· 60 60 const struct ublksrv_io_desc *iod = ublk_get_iod(q, tag); 61 61 struct io_uring_sqe *sqe[3]; 62 62 63 - ublk_queue_alloc_sqes(q, sqe, 3); 63 + ublk_io_alloc_sqes(ublk_get_io(q, tag), sqe, 3); 64 64 65 - io_uring_prep_buf_register(sqe[0], 0, tag, q->q_id, tag); 65 + io_uring_prep_buf_register(sqe[0], 0, tag, q->q_id, ublk_get_io(q, tag)->buf_index); 66 66 sqe[0]->user_data = build_user_data(tag, 67 - ublk_cmd_op_nr(sqe[0]->cmd_op), 0, 1); 67 + ublk_cmd_op_nr(sqe[0]->cmd_op), 0, q->q_id, 1); 68 68 sqe[0]->flags |= IOSQE_CQE_SKIP_SUCCESS | IOSQE_IO_HARDLINK; 69 69 70 - __setup_nop_io(tag, iod, sqe[1]); 70 + __setup_nop_io(tag, iod, sqe[1], q->q_id); 71 71 sqe[1]->flags |= IOSQE_IO_HARDLINK; 72 72 73 - io_uring_prep_buf_unregister(sqe[2], 0, tag, q->q_id, tag); 74 - sqe[2]->user_data = build_user_data(tag, ublk_cmd_op_nr(sqe[2]->cmd_op), 0, 1); 73 + io_uring_prep_buf_unregister(sqe[2], 0, tag, q->q_id, ublk_get_io(q, tag)->buf_index); 74 + sqe[2]->user_data = build_user_data(tag, ublk_cmd_op_nr(sqe[2]->cmd_op), 0, q->q_id, 1); 75 75 76 76 // buf register is marked as IOSQE_CQE_SKIP_SUCCESS 77 77 return 2; ··· 82 82 const struct ublksrv_io_desc *iod = ublk_get_iod(q, tag); 83 83 struct io_uring_sqe *sqe[1]; 84 84 85 - ublk_queue_alloc_sqes(q, sqe, 1); 86 - __setup_nop_io(tag, iod, sqe[0]); 85 + ublk_io_alloc_sqes(ublk_get_io(q, tag), sqe, 1); 86 + __setup_nop_io(tag, iod, sqe[0], q->q_id); 87 87 return 1; 88 88 } 89 89 ··· 136 136 { 137 137 if (q->state & UBLKSRV_AUTO_BUF_REG_FALLBACK) 138 138 return (unsigned short)-1; 139 - return tag; 139 + return q->ios[tag].buf_index; 140 140 } 141 141 142 142 const struct ublk_tgt_ops null_tgt_ops = {

+9 -8

tools/testing/selftests/ublk/stripe.c

··· 138 138 io->private_data = s; 139 139 calculate_stripe_array(conf, iod, s, base); 140 140 141 - ublk_queue_alloc_sqes(q, sqe, s->nr + extra); 141 + ublk_io_alloc_sqes(ublk_get_io(q, tag), sqe, s->nr + extra); 142 142 143 143 if (zc) { 144 - io_uring_prep_buf_register(sqe[0], 0, tag, q->q_id, tag); 144 + io_uring_prep_buf_register(sqe[0], 0, tag, q->q_id, io->buf_index); 145 145 sqe[0]->flags |= IOSQE_CQE_SKIP_SUCCESS | IOSQE_IO_HARDLINK; 146 146 sqe[0]->user_data = build_user_data(tag, 147 - ublk_cmd_op_nr(sqe[0]->cmd_op), 0, 1); 147 + ublk_cmd_op_nr(sqe[0]->cmd_op), 0, q->q_id, 1); 148 148 } 149 149 150 150 for (i = zc; i < s->nr + extra - zc; i++) { ··· 162 162 sqe[i]->flags |= IOSQE_IO_HARDLINK; 163 163 } 164 164 /* bit63 marks us as tgt io */ 165 - sqe[i]->user_data = build_user_data(tag, ublksrv_get_op(iod), i - zc, 1); 165 + sqe[i]->user_data = build_user_data(tag, ublksrv_get_op(iod), i - zc, q->q_id, 1); 166 166 } 167 167 if (zc) { 168 168 struct io_uring_sqe *unreg = sqe[s->nr + 1]; 169 169 170 - io_uring_prep_buf_unregister(unreg, 0, tag, q->q_id, tag); 171 - unreg->user_data = build_user_data(tag, ublk_cmd_op_nr(unreg->cmd_op), 0, 1); 170 + io_uring_prep_buf_unregister(unreg, 0, tag, q->q_id, io->buf_index); 171 + unreg->user_data = build_user_data( 172 + tag, ublk_cmd_op_nr(unreg->cmd_op), 0, q->q_id, 1); 172 173 } 173 174 174 175 /* register buffer is skip_success */ ··· 182 181 struct io_uring_sqe *sqe[NR_STRIPE]; 183 182 int i; 184 183 185 - ublk_queue_alloc_sqes(q, sqe, conf->nr_files); 184 + ublk_io_alloc_sqes(ublk_get_io(q, tag), sqe, conf->nr_files); 186 185 for (i = 0; i < conf->nr_files; i++) { 187 186 io_uring_prep_fsync(sqe[i], i + 1, IORING_FSYNC_DATASYNC); 188 187 io_uring_sqe_set_flags(sqe[i], IOSQE_FIXED_FILE); 189 - sqe[i]->user_data = build_user_data(tag, UBLK_IO_OP_FLUSH, 0, 1); 188 + sqe[i]->user_data = build_user_data(tag, UBLK_IO_OP_FLUSH, 0, q->q_id, 1); 190 189 } 191 190 return conf->nr_files; 192 191 }

+5

tools/testing/selftests/ublk/test_common.sh

··· 278 278 fio --name=job1 --filename=/dev/ublkb"${dev_id}" --ioengine=libaio \ 279 279 --rw=randrw --norandommap --iodepth=256 --size="${size}" --numjobs="$(nproc)" \ 280 280 --runtime=20 --time_based > /dev/null 2>&1 & 281 + fio --name=batchjob --filename=/dev/ublkb"${dev_id}" --ioengine=io_uring \ 282 + --rw=randrw --norandommap --iodepth=256 --size="${size}" \ 283 + --numjobs="$(nproc)" --runtime=20 --time_based \ 284 + --iodepth_batch_submit=32 --iodepth_batch_complete_min=32 \ 285 + --force_async=7 > /dev/null 2>&1 & 281 286 sleep 2 282 287 if [ "${kill_server}" = "yes" ]; then 283 288 local state

+55

tools/testing/selftests/ublk/test_generic_12.sh

··· 1 + #!/bin/bash 2 + # SPDX-License-Identifier: GPL-2.0 3 + 4 + . "$(cd "$(dirname "$0")" && pwd)"/test_common.sh 5 + 6 + TID="generic_12" 7 + ERR_CODE=0 8 + 9 + if ! _have_program bpftrace; then 10 + exit "$UBLK_SKIP_CODE" 11 + fi 12 + 13 + _prep_test "null" "do imbalanced load, it should be balanced over I/O threads" 14 + 15 + NTHREADS=6 16 + dev_id=$(_add_ublk_dev -t null -q 4 -d 16 --nthreads $NTHREADS --per_io_tasks) 17 + _check_add_dev $TID $? 18 + 19 + dev_t=$(_get_disk_dev_t "$dev_id") 20 + bpftrace trace/count_ios_per_tid.bt "$dev_t" > "$UBLK_TMP" 2>&1 & 21 + btrace_pid=$! 22 + sleep 2 23 + 24 + if ! kill -0 "$btrace_pid" > /dev/null 2>&1; then 25 + _cleanup_test "null" 26 + exit "$UBLK_SKIP_CODE" 27 + fi 28 + 29 + # do imbalanced I/O on the ublk device 30 + # pin to cpu 0 to prevent migration/only target one queue 31 + fio --name=write_seq \ 32 + --filename=/dev/ublkb"${dev_id}" \ 33 + --ioengine=libaio --iodepth=16 \ 34 + --rw=write \ 35 + --size=512M \ 36 + --direct=1 \ 37 + --bs=4k \ 38 + --cpus_allowed=0 > /dev/null 2>&1 39 + ERR_CODE=$? 40 + kill "$btrace_pid" 41 + wait 42 + 43 + # check that every task handles some I/O, even though all I/O was issued 44 + # from a single CPU. when ublk gets support for round-robin tag 45 + # allocation, this check can be strengthened to assert that every thread 46 + # handles the same number of I/Os 47 + NR_THREADS_THAT_HANDLED_IO=$(grep -c '@' ${UBLK_TMP}) 48 + if [[ $NR_THREADS_THAT_HANDLED_IO -ne $NTHREADS ]]; then 49 + echo "only $NR_THREADS_THAT_HANDLED_IO handled I/O! expected $NTHREADS" 50 + cat "$UBLK_TMP" 51 + ERR_CODE=255 52 + fi 53 + 54 + _cleanup_test "null" 55 + _show_result $TID $ERR_CODE

+8

tools/testing/selftests/ublk/test_stress_03.sh

··· 41 41 fi 42 42 wait 43 43 44 + if _have_feature "PER_IO_DAEMON"; then 45 + ublk_io_and_remove 8G -t null -q 4 --auto_zc --nthreads 8 --per_io_tasks & 46 + ublk_io_and_remove 256M -t loop -q 4 --auto_zc --nthreads 8 --per_io_tasks "${UBLK_BACKFILES[0]}" & 47 + ublk_io_and_remove 256M -t stripe -q 4 --auto_zc --nthreads 8 --per_io_tasks "${UBLK_BACKFILES[1]}" "${UBLK_BACKFILES[2]}" & 48 + ublk_io_and_remove 8G -t null -q 4 -z --auto_zc --auto_zc_fallback --nthreads 8 --per_io_tasks & 49 + fi 50 + wait 51 + 44 52 _cleanup_test "stress" 45 53 _show_result $TID $ERR_CODE

+7

tools/testing/selftests/ublk/test_stress_04.sh

··· 38 38 ublk_io_and_kill_daemon 256M -t stripe -q 4 --auto_zc "${UBLK_BACKFILES[1]}" "${UBLK_BACKFILES[2]}" & 39 39 ublk_io_and_kill_daemon 8G -t null -q 4 -z --auto_zc --auto_zc_fallback & 40 40 fi 41 + 42 + if _have_feature "PER_IO_DAEMON"; then 43 + ublk_io_and_kill_daemon 8G -t null -q 4 --nthreads 8 --per_io_tasks & 44 + ublk_io_and_kill_daemon 256M -t loop -q 4 --nthreads 8 --per_io_tasks "${UBLK_BACKFILES[0]}" & 45 + ublk_io_and_kill_daemon 256M -t stripe -q 4 --nthreads 8 --per_io_tasks "${UBLK_BACKFILES[1]}" "${UBLK_BACKFILES[2]}" & 46 + ublk_io_and_kill_daemon 8G -t null -q 4 --nthreads 8 --per_io_tasks & 47 + fi 41 48 wait 42 49 43 50 _cleanup_test "stress"

+7

tools/testing/selftests/ublk/test_stress_05.sh

··· 69 69 done 70 70 fi 71 71 72 + if _have_feature "PER_IO_DAEMON"; then 73 + ublk_io_and_remove 8G -t null -q 4 --nthreads 8 --per_io_tasks -r 1 -i "$reissue" & 74 + ublk_io_and_remove 256M -t loop -q 4 --nthreads 8 --per_io_tasks -r 1 -i "$reissue" "${UBLK_BACKFILES[0]}" & 75 + ublk_io_and_remove 8G -t null -q 4 --nthreads 8 --per_io_tasks -r 1 -i "$reissue" & 76 + fi 77 + wait 78 + 72 79 _cleanup_test "stress" 73 80 _show_result $TID $ERR_CODE

+11

tools/testing/selftests/ublk/trace/count_ios_per_tid.bt

··· 1 + /* 2 + * Tabulates and prints I/O completions per thread for the given device 3 + * 4 + * $1: dev_t 5 + */ 6 + tracepoint:block:block_rq_complete 7 + { 8 + if (args.dev == $1) { 9 + @[tid] = count(); 10 + } 11 + }

Configure Feed

Configure Feed