Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

Merge tag 'for-linus-20190502' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:
"This is mostly io_uring fixes/tweaks. Most of these were actually done
in time for the last -rc, but I wanted to ensure that everything
tested out great before including them. The code delta looks larger
than it really is, as it's mostly just comment additions/changes.

Outside of the comment additions/changes, this is mostly removal of
unnecessary barriers. In all, this pull request contains:

- Tweak to how we handle errors at submission time. We now post a
completion event if the error occurs on behalf of an sqe, instead
of returning it through the system call. If the error happens
outside of a specific sqe, we return the error through the system
call. This makes it nicer to use and makes the "normal" use case
behave the same as the offload cases. (me)

- Fix for a missing req reference drop from async context (me)

- If an sqe is submitted with RWF_NOWAIT, don't punt it to async
context. Return -EAGAIN directly, instead of using it as a hint to
do async punt. (Stefan)

- Fix notes on barriers (Stefan)

- Remove unnecessary barriers (Stefan)

- Fix potential double free of memory in setup error (Mark)

- Further improve sq poll CPU validation (Mark)

- Fix page allocation warning and leak on buffer registration error
(Mark)

- Fix iov_iter_type() for new no-ref flag (Ming)

- Fix a case where dio doesn't honor bio no-page-ref (Ming)"

* tag 'for-linus-20190502' of git://git.kernel.dk/linux-block:
io_uring: avoid page allocation warnings
iov_iter: fix iov_iter_type
block: fix handling for BIO_NO_PAGE_REF
io_uring: drop req submit reference always in async punt
io_uring: free allocated io_memory once
io_uring: fix SQPOLL cpu validation
io_uring: have submission side sqe errors post a cqe
io_uring: remove unnecessary barrier after unsetting IORING_SQ_NEED_WAKEUP
io_uring: remove unnecessary barrier after incrementing dropped counter
io_uring: remove unnecessary barrier before reading SQ tail
io_uring: remove unnecessary barrier after updating SQ head
io_uring: remove unnecessary barrier before reading cq head
io_uring: remove unnecessary barrier before wq_has_sleeper
io_uring: fix notes on barriers
io_uring: fix handling SQEs requesting NOWAIT

+161 -93
+2 -1
fs/block_dev.c
··· 264 264 bio_for_each_segment_all(bvec, &bio, i, iter_all) { 265 265 if (should_dirty && !PageCompound(bvec->bv_page)) 266 266 set_page_dirty_lock(bvec->bv_page); 267 - put_page(bvec->bv_page); 267 + if (!bio_flagged(&bio, BIO_NO_PAGE_REF)) 268 + put_page(bvec->bv_page); 268 269 } 269 270 270 271 if (unlikely(bio.bi_status))
+158 -91
fs/io_uring.c
··· 4 4 * supporting fast/efficient IO. 5 5 * 6 6 * A note on the read/write ordering memory barriers that are matched between 7 - * the application and kernel side. When the application reads the CQ ring 8 - * tail, it must use an appropriate smp_rmb() to order with the smp_wmb() 9 - * the kernel uses after writing the tail. Failure to do so could cause a 10 - * delay in when the application notices that completion events available. 11 - * This isn't a fatal condition. Likewise, the application must use an 12 - * appropriate smp_wmb() both before writing the SQ tail, and after writing 13 - * the SQ tail. The first one orders the sqe writes with the tail write, and 14 - * the latter is paired with the smp_rmb() the kernel will issue before 15 - * reading the SQ tail on submission. 7 + * the application and kernel side. 8 + * 9 + * After the application reads the CQ ring tail, it must use an 10 + * appropriate smp_rmb() to pair with the smp_wmb() the kernel uses 11 + * before writing the tail (using smp_load_acquire to read the tail will 12 + * do). It also needs a smp_mb() before updating CQ head (ordering the 13 + * entry load(s) with the head store), pairing with an implicit barrier 14 + * through a control-dependency in io_get_cqring (smp_store_release to 15 + * store head will do). Failure to do so could lead to reading invalid 16 + * CQ entries. 17 + * 18 + * Likewise, the application must use an appropriate smp_wmb() before 19 + * writing the SQ tail (ordering SQ entry stores with the tail store), 20 + * which pairs with smp_load_acquire in io_get_sqring (smp_store_release 21 + * to store the tail will do). And it needs a barrier ordering the SQ 22 + * head load before writing new SQ entries (smp_load_acquire to read 23 + * head will do). 24 + * 25 + * When using the SQ poll thread (IORING_SETUP_SQPOLL), the application 26 + * needs to check the SQ flags for IORING_SQ_NEED_WAKEUP *after* 27 + * updating the SQ tail; a full memory barrier smp_mb() is needed 28 + * between. 16 29 * 17 30 * Also see the examples in the liburing library: 18 31 * ··· 83 70 u32 tail ____cacheline_aligned_in_smp; 84 71 }; 85 72 73 + /* 74 + * This data is shared with the application through the mmap at offset 75 + * IORING_OFF_SQ_RING. 76 + * 77 + * The offsets to the member fields are published through struct 78 + * io_sqring_offsets when calling io_uring_setup. 79 + */ 86 80 struct io_sq_ring { 81 + /* 82 + * Head and tail offsets into the ring; the offsets need to be 83 + * masked to get valid indices. 84 + * 85 + * The kernel controls head and the application controls tail. 86 + */ 87 87 struct io_uring r; 88 + /* 89 + * Bitmask to apply to head and tail offsets (constant, equals 90 + * ring_entries - 1) 91 + */ 88 92 u32 ring_mask; 93 + /* Ring size (constant, power of 2) */ 89 94 u32 ring_entries; 95 + /* 96 + * Number of invalid entries dropped by the kernel due to 97 + * invalid index stored in array 98 + * 99 + * Written by the kernel, shouldn't be modified by the 100 + * application (i.e. get number of "new events" by comparing to 101 + * cached value). 102 + * 103 + * After a new SQ head value was read by the application this 104 + * counter includes all submissions that were dropped reaching 105 + * the new SQ head (and possibly more). 106 + */ 90 107 u32 dropped; 108 + /* 109 + * Runtime flags 110 + * 111 + * Written by the kernel, shouldn't be modified by the 112 + * application. 113 + * 114 + * The application needs a full memory barrier before checking 115 + * for IORING_SQ_NEED_WAKEUP after updating the sq tail. 116 + */ 91 117 u32 flags; 118 + /* 119 + * Ring buffer of indices into array of io_uring_sqe, which is 120 + * mmapped by the application using the IORING_OFF_SQES offset. 121 + * 122 + * This indirection could e.g. be used to assign fixed 123 + * io_uring_sqe entries to operations and only submit them to 124 + * the queue when needed. 125 + * 126 + * The kernel modifies neither the indices array nor the entries 127 + * array. 128 + */ 92 129 u32 array[]; 93 130 }; 94 131 132 + /* 133 + * This data is shared with the application through the mmap at offset 134 + * IORING_OFF_CQ_RING. 135 + * 136 + * The offsets to the member fields are published through struct 137 + * io_cqring_offsets when calling io_uring_setup. 138 + */ 95 139 struct io_cq_ring { 140 + /* 141 + * Head and tail offsets into the ring; the offsets need to be 142 + * masked to get valid indices. 143 + * 144 + * The application controls head and the kernel tail. 145 + */ 96 146 struct io_uring r; 147 + /* 148 + * Bitmask to apply to head and tail offsets (constant, equals 149 + * ring_entries - 1) 150 + */ 97 151 u32 ring_mask; 152 + /* Ring size (constant, power of 2) */ 98 153 u32 ring_entries; 154 + /* 155 + * Number of completion events lost because the queue was full; 156 + * this should be avoided by the application by making sure 157 + * there are not more requests pending thatn there is space in 158 + * the completion queue. 159 + * 160 + * Written by the kernel, shouldn't be modified by the 161 + * application (i.e. get number of "new events" by comparing to 162 + * cached value). 163 + * 164 + * As completion events come in out of order this counter is not 165 + * ordered with any other data. 166 + */ 99 167 u32 overflow; 168 + /* 169 + * Ring buffer of completion events. 170 + * 171 + * The kernel writes completion events fresh every time they are 172 + * produced, so the application is allowed to modify pending 173 + * entries. 174 + */ 100 175 struct io_uring_cqe cqes[]; 101 176 }; 102 177 ··· 322 221 struct list_head list; 323 222 unsigned int flags; 324 223 refcount_t refs; 325 - #define REQ_F_FORCE_NONBLOCK 1 /* inline submission attempt */ 224 + #define REQ_F_NOWAIT 1 /* must not punt to workers */ 326 225 #define REQ_F_IOPOLL_COMPLETED 2 /* polled IO has completed */ 327 226 #define REQ_F_FIXED_FILE 4 /* ctx owns file */ 328 227 #define REQ_F_SEQ_PREV 8 /* sequential with previous */ ··· 418 317 /* order cqe stores with ring update */ 419 318 smp_store_release(&ring->r.tail, ctx->cached_cq_tail); 420 319 421 - /* 422 - * Write sider barrier of tail update, app has read side. See 423 - * comment at the top of this file. 424 - */ 425 - smp_wmb(); 426 - 427 320 if (wq_has_sleeper(&ctx->cq_wait)) { 428 321 wake_up_interruptible(&ctx->cq_wait); 429 322 kill_fasync(&ctx->cq_fasync, SIGIO, POLL_IN); ··· 431 336 unsigned tail; 432 337 433 338 tail = ctx->cached_cq_tail; 434 - /* See comment at the top of the file */ 435 - smp_rmb(); 339 + /* 340 + * writes to the cq entry need to come after reading head; the 341 + * control dependency is enough as we're using WRITE_ONCE to 342 + * fill the cq entry 343 + */ 436 344 if (tail - READ_ONCE(ring->r.head) == ring->ring_entries) 437 345 return NULL; 438 346 ··· 872 774 ret = kiocb_set_rw_flags(kiocb, READ_ONCE(sqe->rw_flags)); 873 775 if (unlikely(ret)) 874 776 return ret; 875 - if (force_nonblock) { 777 + 778 + /* don't allow async punt if RWF_NOWAIT was requested */ 779 + if (kiocb->ki_flags & IOCB_NOWAIT) 780 + req->flags |= REQ_F_NOWAIT; 781 + 782 + if (force_nonblock) 876 783 kiocb->ki_flags |= IOCB_NOWAIT; 877 - req->flags |= REQ_F_FORCE_NONBLOCK; 878 - } 784 + 879 785 if (ctx->flags & IORING_SETUP_IOPOLL) { 880 786 if (!(kiocb->ki_flags & IOCB_DIRECT) || 881 787 !kiocb->ki_filp->f_op->iopoll) ··· 1538 1436 struct sqe_submit *s = &req->submit; 1539 1437 const struct io_uring_sqe *sqe = s->sqe; 1540 1438 1541 - /* Ensure we clear previously set forced non-block flag */ 1542 - req->flags &= ~REQ_F_FORCE_NONBLOCK; 1439 + /* Ensure we clear previously set non-block flag */ 1543 1440 req->rw.ki_flags &= ~IOCB_NOWAIT; 1544 1441 1545 1442 ret = 0; ··· 1568 1467 break; 1569 1468 cond_resched(); 1570 1469 } while (1); 1571 - 1572 - /* drop submission reference */ 1573 - io_put_req(req); 1574 1470 } 1471 + 1472 + /* drop submission reference */ 1473 + io_put_req(req); 1474 + 1575 1475 if (ret) { 1576 1476 io_cqring_add_event(ctx, sqe->user_data, ret, 0); 1577 1477 io_put_req(req); ··· 1725 1623 goto out; 1726 1624 1727 1625 ret = __io_submit_sqe(ctx, req, s, true); 1728 - if (ret == -EAGAIN) { 1626 + if (ret == -EAGAIN && !(req->flags & REQ_F_NOWAIT)) { 1729 1627 struct io_uring_sqe *sqe_copy; 1730 1628 1731 1629 sqe_copy = kmalloc(sizeof(*sqe_copy), GFP_KERNEL); ··· 1799 1697 * write new data to them. 1800 1698 */ 1801 1699 smp_store_release(&ring->r.head, ctx->cached_sq_head); 1802 - 1803 - /* 1804 - * write side barrier of head update, app has read side. See 1805 - * comment at the top of this file 1806 - */ 1807 - smp_wmb(); 1808 1700 } 1809 - } 1810 - 1811 - /* 1812 - * Undo last io_get_sqring() 1813 - */ 1814 - static void io_drop_sqring(struct io_ring_ctx *ctx) 1815 - { 1816 - ctx->cached_sq_head--; 1817 1701 } 1818 1702 1819 1703 /* ··· 1824 1736 * though the application is the one updating it. 1825 1737 */ 1826 1738 head = ctx->cached_sq_head; 1827 - /* See comment at the top of this file */ 1828 - smp_rmb(); 1829 1739 /* make sure SQ entry isn't read before tail */ 1830 1740 if (head == smp_load_acquire(&ring->r.tail)) 1831 1741 return false; ··· 1839 1753 /* drop invalid entries */ 1840 1754 ctx->cached_sq_head++; 1841 1755 ring->dropped++; 1842 - /* See comment at the top of this file */ 1843 - smp_wmb(); 1844 1756 return false; 1845 1757 } 1846 1758 ··· 1962 1878 finish_wait(&ctx->sqo_wait, &wait); 1963 1879 1964 1880 ctx->sq_ring->flags &= ~IORING_SQ_NEED_WAKEUP; 1965 - smp_wmb(); 1966 1881 continue; 1967 1882 } 1968 1883 finish_wait(&ctx->sqo_wait, &wait); 1969 1884 1970 1885 ctx->sq_ring->flags &= ~IORING_SQ_NEED_WAKEUP; 1971 - smp_wmb(); 1972 1886 } 1973 1887 1974 1888 i = 0; ··· 2011 1929 static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit) 2012 1930 { 2013 1931 struct io_submit_state state, *statep = NULL; 2014 - int i, ret = 0, submit = 0; 1932 + int i, submit = 0; 2015 1933 2016 1934 if (to_submit > IO_PLUG_THRESHOLD) { 2017 1935 io_submit_state_start(&state, ctx, to_submit); ··· 2020 1938 2021 1939 for (i = 0; i < to_submit; i++) { 2022 1940 struct sqe_submit s; 1941 + int ret; 2023 1942 2024 1943 if (!io_get_sqring(ctx, &s)) 2025 1944 break; ··· 2028 1945 s.has_user = true; 2029 1946 s.needs_lock = false; 2030 1947 s.needs_fixed_file = false; 1948 + submit++; 2031 1949 2032 1950 ret = io_submit_sqe(ctx, &s, statep); 2033 - if (ret) { 2034 - io_drop_sqring(ctx); 2035 - break; 2036 - } 2037 - 2038 - submit++; 1951 + if (ret) 1952 + io_cqring_add_event(ctx, s.sqe->user_data, ret, 0); 2039 1953 } 2040 1954 io_commit_sqring(ctx); 2041 1955 2042 1956 if (statep) 2043 1957 io_submit_state_end(statep); 2044 1958 2045 - return submit ? submit : ret; 1959 + return submit; 2046 1960 } 2047 1961 2048 1962 static unsigned io_cqring_events(struct io_cq_ring *ring) ··· 2320 2240 mmgrab(current->mm); 2321 2241 ctx->sqo_mm = current->mm; 2322 2242 2323 - ret = -EINVAL; 2324 - if (!cpu_possible(p->sq_thread_cpu)) 2325 - goto err; 2326 - 2327 2243 if (ctx->flags & IORING_SETUP_SQPOLL) { 2328 2244 ret = -EPERM; 2329 2245 if (!capable(CAP_SYS_ADMIN)) ··· 2330 2254 ctx->sq_thread_idle = HZ; 2331 2255 2332 2256 if (p->flags & IORING_SETUP_SQ_AFF) { 2333 - int cpu; 2257 + int cpu = array_index_nospec(p->sq_thread_cpu, 2258 + nr_cpu_ids); 2334 2259 2335 - cpu = array_index_nospec(p->sq_thread_cpu, NR_CPUS); 2336 2260 ret = -EINVAL; 2337 - if (!cpu_possible(p->sq_thread_cpu)) 2261 + if (!cpu_possible(cpu)) 2338 2262 goto err; 2339 2263 2340 2264 ctx->sqo_thread = kthread_create_on_cpu(io_sq_thread, ··· 2397 2321 2398 2322 static void io_mem_free(void *ptr) 2399 2323 { 2400 - struct page *page = virt_to_head_page(ptr); 2324 + struct page *page; 2401 2325 2326 + if (!ptr) 2327 + return; 2328 + 2329 + page = virt_to_head_page(ptr); 2402 2330 if (put_page_testzero(page)) 2403 2331 free_compound_page(page); 2404 2332 } ··· 2443 2363 2444 2364 if (ctx->account_mem) 2445 2365 io_unaccount_mem(ctx->user, imu->nr_bvecs); 2446 - kfree(imu->bvec); 2366 + kvfree(imu->bvec); 2447 2367 imu->nr_bvecs = 0; 2448 2368 } 2449 2369 ··· 2535 2455 if (!pages || nr_pages > got_pages) { 2536 2456 kfree(vmas); 2537 2457 kfree(pages); 2538 - pages = kmalloc_array(nr_pages, sizeof(struct page *), 2458 + pages = kvmalloc_array(nr_pages, sizeof(struct page *), 2539 2459 GFP_KERNEL); 2540 - vmas = kmalloc_array(nr_pages, 2460 + vmas = kvmalloc_array(nr_pages, 2541 2461 sizeof(struct vm_area_struct *), 2542 2462 GFP_KERNEL); 2543 2463 if (!pages || !vmas) { ··· 2549 2469 got_pages = nr_pages; 2550 2470 } 2551 2471 2552 - imu->bvec = kmalloc_array(nr_pages, sizeof(struct bio_vec), 2472 + imu->bvec = kvmalloc_array(nr_pages, sizeof(struct bio_vec), 2553 2473 GFP_KERNEL); 2554 2474 ret = -ENOMEM; 2555 2475 if (!imu->bvec) { ··· 2588 2508 } 2589 2509 if (ctx->account_mem) 2590 2510 io_unaccount_mem(ctx->user, nr_pages); 2511 + kvfree(imu->bvec); 2591 2512 goto err; 2592 2513 } 2593 2514 ··· 2611 2530 2612 2531 ctx->nr_user_bufs++; 2613 2532 } 2614 - kfree(pages); 2615 - kfree(vmas); 2533 + kvfree(pages); 2534 + kvfree(vmas); 2616 2535 return 0; 2617 2536 err: 2618 - kfree(pages); 2619 - kfree(vmas); 2537 + kvfree(pages); 2538 + kvfree(vmas); 2620 2539 io_sqe_buffer_unregister(ctx); 2621 2540 return ret; 2622 2541 } ··· 2654 2573 __poll_t mask = 0; 2655 2574 2656 2575 poll_wait(file, &ctx->cq_wait, wait); 2657 - /* See comment at the top of this file */ 2576 + /* 2577 + * synchronizes with barrier from wq_has_sleeper call in 2578 + * io_commit_cqring 2579 + */ 2658 2580 smp_rmb(); 2659 2581 if (READ_ONCE(ctx->sq_ring->r.tail) - ctx->cached_sq_head != 2660 2582 ctx->sq_ring->ring_entries) ··· 2771 2687 mutex_lock(&ctx->uring_lock); 2772 2688 submitted = io_ring_submit(ctx, to_submit); 2773 2689 mutex_unlock(&ctx->uring_lock); 2774 - 2775 - if (submitted < 0) 2776 - goto out_ctx; 2777 2690 } 2778 2691 if (flags & IORING_ENTER_GETEVENTS) { 2779 2692 unsigned nr_events = 0; 2780 2693 2781 2694 min_complete = min(min_complete, ctx->cq_entries); 2782 - 2783 - /* 2784 - * The application could have included the 'to_submit' count 2785 - * in how many events it wanted to wait for. If we failed to 2786 - * submit the desired count, we may need to adjust the number 2787 - * of events to poll/wait for. 2788 - */ 2789 - if (submitted < to_submit) 2790 - min_complete = min_t(unsigned, submitted, min_complete); 2791 2695 2792 2696 if (ctx->flags & IORING_SETUP_IOPOLL) { 2793 2697 mutex_lock(&ctx->uring_lock); ··· 2822 2750 return -EOVERFLOW; 2823 2751 2824 2752 ctx->sq_sqes = io_mem_alloc(size); 2825 - if (!ctx->sq_sqes) { 2826 - io_mem_free(ctx->sq_ring); 2753 + if (!ctx->sq_sqes) 2827 2754 return -ENOMEM; 2828 - } 2829 2755 2830 2756 cq_ring = io_mem_alloc(struct_size(cq_ring, cqes, p->cq_entries)); 2831 - if (!cq_ring) { 2832 - io_mem_free(ctx->sq_ring); 2833 - io_mem_free(ctx->sq_sqes); 2757 + if (!cq_ring) 2834 2758 return -ENOMEM; 2835 - } 2836 2759 2837 2760 ctx->cq_ring = cq_ring; 2838 2761 cq_ring->ring_mask = p->cq_entries - 1;
+1 -1
include/linux/uio.h
··· 60 60 61 61 static inline enum iter_type iov_iter_type(const struct iov_iter *i) 62 62 { 63 - return i->type & ~(READ | WRITE); 63 + return i->type & ~(READ | WRITE | ITER_BVEC_FLAG_NO_REF); 64 64 } 65 65 66 66 static inline bool iter_is_iovec(const struct iov_iter *i)