Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

epoll: use refcount to reduce ep_mutex contention

We are observing huge contention on the epmutex during an http
connection/rate test:

83.17% 0.25% nginx [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe
[...]
|--66.96%--__fput
|--60.04%--eventpoll_release_file
|--58.41%--__mutex_lock.isra.6
|--56.56%--osq_lock

The application is multi-threaded, creates a new epoll entry for
each incoming connection, and does not delete it before the
connection shutdown - that is, before the connection's fd close().

Many different threads compete frequently for the epmutex lock,
affecting the overall performance.

To reduce the contention this patch introduces explicit reference counting
for the eventpoll struct. Each registered event acquires a reference,
and references are released at ep_remove() time.

The eventpoll struct is released by whoever - among EP file close() and
and the monitored file close() drops its last reference.

Additionally, this introduces a new 'dying' flag to prevent races between
the EP file close() and the monitored file close().
ep_eventpoll_release() marks, under f_lock spinlock, each epitem as dying
before removing it, while EP file close() does not touch dying epitems.

The above is needed as both close operations could run concurrently and
drop the EP reference acquired via the epitem entry. Without the above
flag, the monitored file close() could reach the EP struct via the epitem
list while the epitem is still listed and then try to put it after its
disposal.

An alternative could be avoiding touching the references acquired via
the epitems at EP file close() time, but that could leave the EP struct
alive for potentially unlimited time after EP file close(), with nasty
side effects.

With all the above in place, we can drop the epmutex usage at disposal time.

Overall this produces a significant performance improvement in the
mentioned connection/rate scenario: the mutex operations disappear from
the topmost offenders in the perf report, and the measured connections/rate
grows by ~60%.

To make the change more readable this additionally renames ep_free() to
ep_clear_and_put(), and moves the actual memory cleanup in a separate
ep_free() helper.

Link: https://lkml.kernel.org/r/4a57788dcaf28f5eb4f8dfddcc3a8b172a7357bb.1679504153.git.pabeni@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Co-developed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Xiumei Mu <xmu@redhiat.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Jacob Keller <jacob.e.keller@intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

Paolo Abeni and committed by
Andrew Morton
58c9b016 f4708a82

+124 -73
+124 -73
fs/eventpoll.c
··· 57 57 * we need a lock that will allow us to sleep. This lock is a 58 58 * mutex (ep->mtx). It is acquired during the event transfer loop, 59 59 * during epoll_ctl(EPOLL_CTL_DEL) and during eventpoll_release_file(). 60 - * Then we also need a global mutex to serialize eventpoll_release_file() 61 - * and ep_free(). 62 - * This mutex is acquired by ep_free() during the epoll file 63 - * cleanup path and it is also acquired by eventpoll_release_file() 64 - * if a file has been pushed inside an epoll set and it is then 65 - * close()d without a previous call to epoll_ctl(EPOLL_CTL_DEL). 66 - * It is also acquired when inserting an epoll fd onto another epoll 60 + * The epmutex is acquired when inserting an epoll fd onto another epoll 67 61 * fd. We do this so that we walk the epoll tree and ensure that this 68 62 * insertion does not create a cycle of epoll file descriptors, which 69 63 * could lead to deadlock. We need a global mutex to prevent two ··· 147 153 /* The file descriptor information this item refers to */ 148 154 struct epoll_filefd ffd; 149 155 156 + /* 157 + * Protected by file->f_lock, true for to-be-released epitem already 158 + * removed from the "struct file" items list; together with 159 + * eventpoll->refcount orchestrates "struct eventpoll" disposal 160 + */ 161 + bool dying; 162 + 150 163 /* List containing poll wait queues */ 151 164 struct eppoll_entry *pwqlist; 152 165 ··· 218 217 u64 gen; 219 218 struct hlist_head refs; 220 219 220 + /* 221 + * usage count, used together with epitem->dying to 222 + * orchestrate the disposal of this struct 223 + */ 224 + refcount_t refcount; 225 + 221 226 #ifdef CONFIG_NET_RX_BUSY_POLL 222 227 /* used to track busy poll napi_id */ 223 228 unsigned int napi_id; ··· 247 240 /* Maximum number of epoll watched descriptors, per user */ 248 241 static long max_user_watches __read_mostly; 249 242 250 - /* 251 - * This mutex is used to serialize ep_free() and eventpoll_release_file(). 252 - */ 243 + /* Used for cycles detection */ 253 244 static DEFINE_MUTEX(epmutex); 254 245 255 246 static u64 loop_check_gen = 0; ··· 562 557 563 558 /* 564 559 * This function unregisters poll callbacks from the associated file 565 - * descriptor. Must be called with "mtx" held (or "epmutex" if called from 566 - * ep_free). 560 + * descriptor. Must be called with "mtx" held. 567 561 */ 568 562 static void ep_unregister_pollwait(struct eventpoll *ep, struct epitem *epi) 569 563 { ··· 685 681 kmem_cache_free(epi_cache, epi); 686 682 } 687 683 684 + static void ep_get(struct eventpoll *ep) 685 + { 686 + refcount_inc(&ep->refcount); 687 + } 688 + 689 + /* 690 + * Returns true if the event poll can be disposed 691 + */ 692 + static bool ep_refcount_dec_and_test(struct eventpoll *ep) 693 + { 694 + if (!refcount_dec_and_test(&ep->refcount)) 695 + return false; 696 + 697 + WARN_ON_ONCE(!RB_EMPTY_ROOT(&ep->rbr.rb_root)); 698 + return true; 699 + } 700 + 701 + static void ep_free(struct eventpoll *ep) 702 + { 703 + mutex_destroy(&ep->mtx); 704 + free_uid(ep->user); 705 + wakeup_source_unregister(ep->ws); 706 + kfree(ep); 707 + } 708 + 688 709 /* 689 710 * Removes a "struct epitem" from the eventpoll RB tree and deallocates 690 711 * all the associated resources. Must be called with "mtx" held. 712 + * If the dying flag is set, do the removal only if force is true. 713 + * This prevents ep_clear_and_put() from dropping all the ep references 714 + * while running concurrently with eventpoll_release_file(). 715 + * Returns true if the eventpoll can be disposed. 691 716 */ 692 - static int ep_remove(struct eventpoll *ep, struct epitem *epi) 717 + static bool __ep_remove(struct eventpoll *ep, struct epitem *epi, bool force) 693 718 { 694 719 struct file *file = epi->ffd.file; 695 720 struct epitems_head *to_free; ··· 733 700 734 701 /* Remove the current item from the list of epoll hooks */ 735 702 spin_lock(&file->f_lock); 703 + if (epi->dying && !force) { 704 + spin_unlock(&file->f_lock); 705 + return false; 706 + } 707 + 736 708 to_free = NULL; 737 709 head = file->f_ep; 738 710 if (head->first == &epi->fllink && !epi->fllink.next) { ··· 771 733 call_rcu(&epi->rcu, epi_rcu_free); 772 734 773 735 percpu_counter_dec(&ep->user->epoll_watches); 774 - 775 - return 0; 736 + return ep_refcount_dec_and_test(ep); 776 737 } 777 738 778 - static void ep_free(struct eventpoll *ep) 739 + /* 740 + * ep_remove variant for callers owing an additional reference to the ep 741 + */ 742 + static void ep_remove_safe(struct eventpoll *ep, struct epitem *epi) 779 743 { 780 - struct rb_node *rbp; 744 + WARN_ON_ONCE(__ep_remove(ep, epi, false)); 745 + } 746 + 747 + static void ep_clear_and_put(struct eventpoll *ep) 748 + { 749 + struct rb_node *rbp, *next; 781 750 struct epitem *epi; 751 + bool dispose; 782 752 783 753 /* We need to release all tasks waiting for these file */ 784 754 if (waitqueue_active(&ep->poll_wait)) 785 755 ep_poll_safewake(ep, NULL, 0); 786 756 787 - /* 788 - * We need to lock this because we could be hit by 789 - * eventpoll_release_file() while we're freeing the "struct eventpoll". 790 - * We do not need to hold "ep->mtx" here because the epoll file 791 - * is on the way to be removed and no one has references to it 792 - * anymore. The only hit might come from eventpoll_release_file() but 793 - * holding "epmutex" is sufficient here. 794 - */ 795 - mutex_lock(&epmutex); 757 + mutex_lock(&ep->mtx); 796 758 797 759 /* 798 760 * Walks through the whole tree by unregistering poll callbacks. ··· 805 767 } 806 768 807 769 /* 808 - * Walks through the whole tree by freeing each "struct epitem". At this 809 - * point we are sure no poll callbacks will be lingering around, and also by 810 - * holding "epmutex" we can be sure that no file cleanup code will hit 811 - * us during this operation. So we can avoid the lock on "ep->lock". 812 - * We do not need to lock ep->mtx, either, we only do it to prevent 813 - * a lockdep warning. 770 + * Walks through the whole tree and try to free each "struct epitem". 771 + * Note that ep_remove_safe() will not remove the epitem in case of a 772 + * racing eventpoll_release_file(); the latter will do the removal. 773 + * At this point we are sure no poll callbacks will be lingering around. 774 + * Since we still own a reference to the eventpoll struct, the loop can't 775 + * dispose it. 814 776 */ 815 - mutex_lock(&ep->mtx); 816 - while ((rbp = rb_first_cached(&ep->rbr)) != NULL) { 777 + for (rbp = rb_first_cached(&ep->rbr); rbp; rbp = next) { 778 + next = rb_next(rbp); 817 779 epi = rb_entry(rbp, struct epitem, rbn); 818 - ep_remove(ep, epi); 780 + ep_remove_safe(ep, epi); 819 781 cond_resched(); 820 782 } 783 + 784 + dispose = ep_refcount_dec_and_test(ep); 821 785 mutex_unlock(&ep->mtx); 822 786 823 - mutex_unlock(&epmutex); 824 - mutex_destroy(&ep->mtx); 825 - free_uid(ep->user); 826 - wakeup_source_unregister(ep->ws); 827 - kfree(ep); 787 + if (dispose) 788 + ep_free(ep); 828 789 } 829 790 830 791 static int ep_eventpoll_release(struct inode *inode, struct file *file) ··· 831 794 struct eventpoll *ep = file->private_data; 832 795 833 796 if (ep) 834 - ep_free(ep); 797 + ep_clear_and_put(ep); 835 798 836 799 return 0; 837 800 } ··· 943 906 { 944 907 struct eventpoll *ep; 945 908 struct epitem *epi; 946 - struct hlist_node *next; 909 + bool dispose; 947 910 948 911 /* 949 - * We don't want to get "file->f_lock" because it is not 950 - * necessary. It is not necessary because we're in the "struct file" 951 - * cleanup path, and this means that no one is using this file anymore. 952 - * So, for example, epoll_ctl() cannot hit here since if we reach this 953 - * point, the file counter already went to zero and fget() would fail. 954 - * The only hit might come from ep_free() but by holding the mutex 955 - * will correctly serialize the operation. We do need to acquire 956 - * "ep->mtx" after "epmutex" because ep_remove() requires it when called 957 - * from anywhere but ep_free(). 958 - * 959 - * Besides, ep_remove() acquires the lock, so we can't hold it here. 912 + * Use the 'dying' flag to prevent a concurrent ep_clear_and_put() from 913 + * touching the epitems list before eventpoll_release_file() can access 914 + * the ep->mtx. 960 915 */ 961 - mutex_lock(&epmutex); 962 - if (unlikely(!file->f_ep)) { 963 - mutex_unlock(&epmutex); 964 - return; 965 - } 966 - hlist_for_each_entry_safe(epi, next, file->f_ep, fllink) { 916 + again: 917 + spin_lock(&file->f_lock); 918 + if (file->f_ep && file->f_ep->first) { 919 + epi = hlist_entry(file->f_ep->first, struct epitem, fllink); 920 + epi->dying = true; 921 + spin_unlock(&file->f_lock); 922 + 923 + /* 924 + * ep access is safe as we still own a reference to the ep 925 + * struct 926 + */ 967 927 ep = epi->ep; 968 - mutex_lock_nested(&ep->mtx, 0); 969 - ep_remove(ep, epi); 928 + mutex_lock(&ep->mtx); 929 + dispose = __ep_remove(ep, epi, true); 970 930 mutex_unlock(&ep->mtx); 931 + 932 + if (dispose) 933 + ep_free(ep); 934 + goto again; 971 935 } 972 - mutex_unlock(&epmutex); 936 + spin_unlock(&file->f_lock); 973 937 } 974 938 975 939 static int ep_alloc(struct eventpoll **pep) ··· 993 955 ep->rbr = RB_ROOT_CACHED; 994 956 ep->ovflist = EP_UNACTIVE_PTR; 995 957 ep->user = user; 958 + refcount_set(&ep->refcount, 1); 996 959 997 960 *pep = ep; 998 961 ··· 1262 1223 */ 1263 1224 list_del_init(&wait->entry); 1264 1225 /* 1265 - * ->whead != NULL protects us from the race with ep_free() 1266 - * or ep_remove(), ep_remove_wait_queue() takes whead->lock 1267 - * held by the caller. Once we nullify it, nothing protects 1268 - * ep/epi or even wait. 1226 + * ->whead != NULL protects us from the race with 1227 + * ep_clear_and_put() or ep_remove(), ep_remove_wait_queue() 1228 + * takes whead->lock held by the caller. Once we nullify it, 1229 + * nothing protects ep/epi or even wait. 1269 1230 */ 1270 1231 smp_store_release(&ep_pwq_from_wait(wait)->whead, NULL); 1271 1232 } ··· 1535 1496 if (tep) 1536 1497 mutex_unlock(&tep->mtx); 1537 1498 1499 + /* 1500 + * ep_remove_safe() calls in the later error paths can't lead to 1501 + * ep_free() as the ep file itself still holds an ep reference. 1502 + */ 1503 + ep_get(ep); 1504 + 1538 1505 /* now check if we've created too many backpaths */ 1539 1506 if (unlikely(full_check && reverse_path_check())) { 1540 - ep_remove(ep, epi); 1507 + ep_remove_safe(ep, epi); 1541 1508 return -EINVAL; 1542 1509 } 1543 1510 1544 1511 if (epi->event.events & EPOLLWAKEUP) { 1545 1512 error = ep_create_wakeup_source(epi); 1546 1513 if (error) { 1547 - ep_remove(ep, epi); 1514 + ep_remove_safe(ep, epi); 1548 1515 return error; 1549 1516 } 1550 1517 } ··· 1574 1529 * high memory pressure. 1575 1530 */ 1576 1531 if (unlikely(!epq.epi)) { 1577 - ep_remove(ep, epi); 1532 + ep_remove_safe(ep, epi); 1578 1533 return -ENOMEM; 1579 1534 } 1580 1535 ··· 2070 2025 out_free_fd: 2071 2026 put_unused_fd(fd); 2072 2027 out_free_ep: 2073 - ep_free(ep); 2028 + ep_clear_and_put(ep); 2074 2029 return error; 2075 2030 } 2076 2031 ··· 2212 2167 error = -EEXIST; 2213 2168 break; 2214 2169 case EPOLL_CTL_DEL: 2215 - if (epi) 2216 - error = ep_remove(ep, epi); 2217 - else 2170 + if (epi) { 2171 + /* 2172 + * The eventpoll itself is still alive: the refcount 2173 + * can't go to zero here. 2174 + */ 2175 + ep_remove_safe(ep, epi); 2176 + error = 0; 2177 + } else { 2218 2178 error = -ENOENT; 2179 + } 2219 2180 break; 2220 2181 case EPOLL_CTL_MOD: 2221 2182 if (epi) {