Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

aio: keep poll requests on waitqueue until completed

Currently, aio_poll_wake() will always remove the poll request from the
waitqueue. Then, if aio_poll_complete_work() sees that none of the
polled events are ready and the request isn't cancelled, it re-adds the
request to the waitqueue. (This can easily happen when polling a file
that doesn't pass an event mask when waking up its waitqueue.)

This is fundamentally broken for two reasons:

1. If a wakeup occurs between vfs_poll() and the request being
re-added to the waitqueue, it will be missed because the request
wasn't on the waitqueue at the time. Therefore, IOCB_CMD_POLL
might never complete even if the polled file is ready.

2. When the request isn't on the waitqueue, there is no way to be
notified that the waitqueue is being freed (which happens when its
lifetime is shorter than the struct file's). This is supposed to
happen via the waitqueue entries being woken up with POLLFREE.

Therefore, leave the requests on the waitqueue until they are actually
completed (or cancelled). To keep track of when aio_poll_complete_work
needs to be scheduled, use new fields in struct poll_iocb. Remove the
'done' field which is now redundant.

Note that this is consistent with how sys_poll() and eventpoll work;
their wakeup functions do *not* remove the waitqueue entries.

Fixes: 2c14fa838cbe ("aio: implement IOCB_CMD_POLL")
Cc: <stable@vger.kernel.org> # v4.18+
Link: https://lore.kernel.org/r/20211209010455.42744-5-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>

+63 -20
+63 -20
fs/aio.c
··· 181 181 struct file *file; 182 182 struct wait_queue_head *head; 183 183 __poll_t events; 184 - bool done; 185 184 bool cancelled; 185 + bool work_scheduled; 186 + bool work_need_resched; 186 187 struct wait_queue_entry wait; 187 188 struct work_struct work; 188 189 }; ··· 1639 1638 * avoid further branches in the fast path. 1640 1639 */ 1641 1640 spin_lock_irq(&ctx->ctx_lock); 1641 + spin_lock(&req->head->lock); 1642 1642 if (!mask && !READ_ONCE(req->cancelled)) { 1643 - add_wait_queue(req->head, &req->wait); 1643 + /* 1644 + * The request isn't actually ready to be completed yet. 1645 + * Reschedule completion if another wakeup came in. 1646 + */ 1647 + if (req->work_need_resched) { 1648 + schedule_work(&req->work); 1649 + req->work_need_resched = false; 1650 + } else { 1651 + req->work_scheduled = false; 1652 + } 1653 + spin_unlock(&req->head->lock); 1644 1654 spin_unlock_irq(&ctx->ctx_lock); 1645 1655 return; 1646 1656 } 1657 + list_del_init(&req->wait.entry); 1658 + spin_unlock(&req->head->lock); 1647 1659 list_del_init(&iocb->ki_list); 1648 1660 iocb->ki_res.res = mangle_poll(mask); 1649 - req->done = true; 1650 1661 spin_unlock_irq(&ctx->ctx_lock); 1651 1662 1652 1663 iocb_put(iocb); ··· 1672 1659 1673 1660 spin_lock(&req->head->lock); 1674 1661 WRITE_ONCE(req->cancelled, true); 1675 - if (!list_empty(&req->wait.entry)) { 1676 - list_del_init(&req->wait.entry); 1662 + if (!req->work_scheduled) { 1677 1663 schedule_work(&aiocb->poll.work); 1664 + req->work_scheduled = true; 1678 1665 } 1679 1666 spin_unlock(&req->head->lock); 1680 1667 ··· 1693 1680 if (mask && !(mask & req->events)) 1694 1681 return 0; 1695 1682 1696 - list_del_init(&req->wait.entry); 1697 - 1698 - if (mask && spin_trylock_irqsave(&iocb->ki_ctx->ctx_lock, flags)) { 1683 + /* 1684 + * Complete the request inline if possible. This requires that three 1685 + * conditions be met: 1686 + * 1. An event mask must have been passed. If a plain wakeup was done 1687 + * instead, then mask == 0 and we have to call vfs_poll() to get 1688 + * the events, so inline completion isn't possible. 1689 + * 2. The completion work must not have already been scheduled. 1690 + * 3. ctx_lock must not be busy. We have to use trylock because we 1691 + * already hold the waitqueue lock, so this inverts the normal 1692 + * locking order. Use irqsave/irqrestore because not all 1693 + * filesystems (e.g. fuse) call this function with IRQs disabled, 1694 + * yet IRQs have to be disabled before ctx_lock is obtained. 1695 + */ 1696 + if (mask && !req->work_scheduled && 1697 + spin_trylock_irqsave(&iocb->ki_ctx->ctx_lock, flags)) { 1699 1698 struct kioctx *ctx = iocb->ki_ctx; 1700 1699 1701 - /* 1702 - * Try to complete the iocb inline if we can. Use 1703 - * irqsave/irqrestore because not all filesystems (e.g. fuse) 1704 - * call this function with IRQs disabled and because IRQs 1705 - * have to be disabled before ctx_lock is obtained. 1706 - */ 1700 + list_del_init(&req->wait.entry); 1707 1701 list_del(&iocb->ki_list); 1708 1702 iocb->ki_res.res = mangle_poll(mask); 1709 - req->done = true; 1710 1703 if (iocb->ki_eventfd && eventfd_signal_allowed()) { 1711 1704 iocb = NULL; 1712 1705 INIT_WORK(&req->work, aio_poll_put_work); ··· 1722 1703 if (iocb) 1723 1704 iocb_put(iocb); 1724 1705 } else { 1725 - schedule_work(&req->work); 1706 + /* 1707 + * Schedule the completion work if needed. If it was already 1708 + * scheduled, record that another wakeup came in. 1709 + * 1710 + * Don't remove the request from the waitqueue here, as it might 1711 + * not actually be complete yet (we won't know until vfs_poll() 1712 + * is called), and we must not miss any wakeups. 1713 + */ 1714 + if (req->work_scheduled) { 1715 + req->work_need_resched = true; 1716 + } else { 1717 + schedule_work(&req->work); 1718 + req->work_scheduled = true; 1719 + } 1726 1720 } 1727 1721 return 1; 1728 1722 } ··· 1782 1750 req->events = demangle_poll(iocb->aio_buf) | EPOLLERR | EPOLLHUP; 1783 1751 1784 1752 req->head = NULL; 1785 - req->done = false; 1786 1753 req->cancelled = false; 1754 + req->work_scheduled = false; 1755 + req->work_need_resched = false; 1787 1756 1788 1757 apt.pt._qproc = aio_poll_queue_proc; 1789 1758 apt.pt._key = req->events; ··· 1799 1766 spin_lock_irq(&ctx->ctx_lock); 1800 1767 if (likely(req->head)) { 1801 1768 spin_lock(&req->head->lock); 1802 - if (unlikely(list_empty(&req->wait.entry))) { 1803 - if (apt.error) 1769 + if (list_empty(&req->wait.entry) || req->work_scheduled) { 1770 + /* 1771 + * aio_poll_wake() already either scheduled the async 1772 + * completion work, or completed the request inline. 1773 + */ 1774 + if (apt.error) /* unsupported case: multiple queues */ 1804 1775 cancel = true; 1805 1776 apt.error = 0; 1806 1777 mask = 0; 1807 1778 } 1808 1779 if (mask || apt.error) { 1780 + /* Steal to complete synchronously. */ 1809 1781 list_del_init(&req->wait.entry); 1810 1782 } else if (cancel) { 1783 + /* Cancel if possible (may be too late though). */ 1811 1784 WRITE_ONCE(req->cancelled, true); 1812 - } else if (!req->done) { /* actually waiting for an event */ 1785 + } else if (!list_empty(&req->wait.entry)) { 1786 + /* 1787 + * Actually waiting for an event, so add the request to 1788 + * active_reqs so that it can be cancelled if needed. 1789 + */ 1813 1790 list_add_tail(&aiocb->ki_list, &ctx->active_reqs); 1814 1791 aiocb->ki_cancel = aio_poll_cancel; 1815 1792 }