Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

fs/epoll: reduce the scope of wq lock in epoll_wait()

This patch aims at reducing ep wq.lock hold times in epoll_wait(2). For
the blocking case, there is no need to constantly take and drop the
spinlock, which is only needed to manipulate the waitqueue.

The call to ep_events_available() is now lockless, and only exposed to
benign races. Here, if false positive (returns available events and
does not see another thread deleting an epi from the list) we call into
send_events and then the list's state is correctly seen. Otoh, if a
false negative and we don't see a list_add_tail(), for example, from irq
callback, then it is rechecked again before blocking, which will see the
correct state.

In order for more accuracy to see concurrent list_del_init(), use the
list_empty_careful() variant -- of course, this won't be safe against
insertions from wakeup.

For the overflow list we obviously need to prevent load/store tearing as
we don't want to see partial values while the ready list is disabled.

[dave@stgolabs.net: forgotten fixlets]
Link: http://lkml.kernel.org/r/20181109155258.jxcr4t2pnz6zqct3@linux-r8p5
Link: http://lkml.kernel.org/r/20181108051006.18751-6-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Suggested-by: Jason Baron <jbaron@akamai.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

authored by

Davidlohr Bueso and committed by
Linus Torvalds
c5a282e9 21877e1a

+68 -62
+68 -62
fs/eventpoll.c
··· 381 381 */ 382 382 static inline int ep_events_available(struct eventpoll *ep) 383 383 { 384 - return !list_empty(&ep->rdllist) || ep->ovflist != EP_UNACTIVE_PTR; 384 + return !list_empty_careful(&ep->rdllist) || 385 + READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR; 385 386 } 386 387 387 388 #ifdef CONFIG_NET_RX_BUSY_POLL ··· 699 698 */ 700 699 spin_lock_irq(&ep->wq.lock); 701 700 list_splice_init(&ep->rdllist, &txlist); 702 - ep->ovflist = NULL; 701 + WRITE_ONCE(ep->ovflist, NULL); 703 702 spin_unlock_irq(&ep->wq.lock); 704 703 705 704 /* ··· 713 712 * other events might have been queued by the poll callback. 714 713 * We re-insert them inside the main ready-list here. 715 714 */ 716 - for (nepi = ep->ovflist; (epi = nepi) != NULL; 715 + for (nepi = READ_ONCE(ep->ovflist); (epi = nepi) != NULL; 717 716 nepi = epi->next, epi->next = EP_UNACTIVE_PTR) { 718 717 /* 719 718 * We need to check if the item is already in the list. ··· 731 730 * releasing the lock, events will be queued in the normal way inside 732 731 * ep->rdllist. 733 732 */ 734 - ep->ovflist = EP_UNACTIVE_PTR; 733 + WRITE_ONCE(ep->ovflist, EP_UNACTIVE_PTR); 735 734 736 735 /* 737 736 * Quickly re-inject items left on "txlist". ··· 1154 1153 * semantics). All the events that happen during that period of time are 1155 1154 * chained in ep->ovflist and requeued later on. 1156 1155 */ 1157 - if (ep->ovflist != EP_UNACTIVE_PTR) { 1156 + if (READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR) { 1158 1157 if (epi->next == EP_UNACTIVE_PTR) { 1159 - epi->next = ep->ovflist; 1160 - ep->ovflist = epi; 1158 + epi->next = READ_ONCE(ep->ovflist); 1159 + WRITE_ONCE(ep->ovflist, epi); 1161 1160 if (epi->ws) { 1162 1161 /* 1163 1162 * Activate ep->ws since epi->ws may get ··· 1763 1762 } else if (timeout == 0) { 1764 1763 /* 1765 1764 * Avoid the unnecessary trip to the wait queue loop, if the 1766 - * caller specified a non blocking operation. 1765 + * caller specified a non blocking operation. We still need 1766 + * lock because we could race and not see an epi being added 1767 + * to the ready list while in irq callback. Thus incorrectly 1768 + * returning 0 back to userspace. 1767 1769 */ 1768 1770 timed_out = 1; 1771 + 1769 1772 spin_lock_irq(&ep->wq.lock); 1773 + eavail = ep_events_available(ep); 1774 + spin_unlock_irq(&ep->wq.lock); 1775 + 1770 1776 goto check_events; 1771 1777 } 1772 1778 ··· 1782 1774 if (!ep_events_available(ep)) 1783 1775 ep_busy_loop(ep, timed_out); 1784 1776 1785 - spin_lock_irq(&ep->wq.lock); 1786 - 1787 - if (!ep_events_available(ep)) { 1788 - /* 1789 - * Busy poll timed out. Drop NAPI ID for now, we can add 1790 - * it back in when we have moved a socket with a valid NAPI 1791 - * ID onto the ready list. 1792 - */ 1793 - ep_reset_busy_poll_napi_id(ep); 1794 - 1795 - /* 1796 - * We don't have any available event to return to the caller. 1797 - * We need to sleep here, and we will be wake up by 1798 - * ep_poll_callback() when events will become available. 1799 - */ 1800 - init_waitqueue_entry(&wait, current); 1801 - __add_wait_queue_exclusive(&ep->wq, &wait); 1802 - 1803 - for (;;) { 1804 - /* 1805 - * We don't want to sleep if the ep_poll_callback() sends us 1806 - * a wakeup in between. That's why we set the task state 1807 - * to TASK_INTERRUPTIBLE before doing the checks. 1808 - */ 1809 - set_current_state(TASK_INTERRUPTIBLE); 1810 - /* 1811 - * Always short-circuit for fatal signals to allow 1812 - * threads to make a timely exit without the chance of 1813 - * finding more events available and fetching 1814 - * repeatedly. 1815 - */ 1816 - if (fatal_signal_pending(current)) { 1817 - res = -EINTR; 1818 - break; 1819 - } 1820 - if (ep_events_available(ep) || timed_out) 1821 - break; 1822 - if (signal_pending(current)) { 1823 - res = -EINTR; 1824 - break; 1825 - } 1826 - 1827 - spin_unlock_irq(&ep->wq.lock); 1828 - if (!schedule_hrtimeout_range(to, slack, HRTIMER_MODE_ABS)) 1829 - timed_out = 1; 1830 - 1831 - spin_lock_irq(&ep->wq.lock); 1832 - } 1833 - 1834 - __remove_wait_queue(&ep->wq, &wait); 1835 - __set_current_state(TASK_RUNNING); 1836 - } 1837 - check_events: 1838 - /* Is it worth to try to dig for events ? */ 1839 1777 eavail = ep_events_available(ep); 1778 + if (eavail) 1779 + goto check_events; 1840 1780 1781 + /* 1782 + * Busy poll timed out. Drop NAPI ID for now, we can add 1783 + * it back in when we have moved a socket with a valid NAPI 1784 + * ID onto the ready list. 1785 + */ 1786 + ep_reset_busy_poll_napi_id(ep); 1787 + 1788 + /* 1789 + * We don't have any available event to return to the caller. 1790 + * We need to sleep here, and we will be wake up by 1791 + * ep_poll_callback() when events will become available. 1792 + */ 1793 + init_waitqueue_entry(&wait, current); 1794 + spin_lock_irq(&ep->wq.lock); 1795 + __add_wait_queue_exclusive(&ep->wq, &wait); 1841 1796 spin_unlock_irq(&ep->wq.lock); 1842 1797 1798 + for (;;) { 1799 + /* 1800 + * We don't want to sleep if the ep_poll_callback() sends us 1801 + * a wakeup in between. That's why we set the task state 1802 + * to TASK_INTERRUPTIBLE before doing the checks. 1803 + */ 1804 + set_current_state(TASK_INTERRUPTIBLE); 1805 + /* 1806 + * Always short-circuit for fatal signals to allow 1807 + * threads to make a timely exit without the chance of 1808 + * finding more events available and fetching 1809 + * repeatedly. 1810 + */ 1811 + if (fatal_signal_pending(current)) { 1812 + res = -EINTR; 1813 + break; 1814 + } 1815 + if (ep_events_available(ep) || timed_out) 1816 + break; 1817 + if (signal_pending(current)) { 1818 + res = -EINTR; 1819 + break; 1820 + } 1821 + 1822 + if (!schedule_hrtimeout_range(to, slack, HRTIMER_MODE_ABS)) 1823 + timed_out = 1; 1824 + } 1825 + 1826 + __set_current_state(TASK_RUNNING); 1827 + 1828 + spin_lock_irq(&ep->wq.lock); 1829 + __remove_wait_queue(&ep->wq, &wait); 1830 + spin_unlock_irq(&ep->wq.lock); 1831 + 1832 + check_events: 1843 1833 /* 1844 1834 * Try to transfer events to user space. In case we get 0 events and 1845 1835 * there's still timeout left over, we go trying again in search of