Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

Merge patch "eventpoll: Fix priority inversion problem"

Nam Cao <namcao@linutronix.de> says:

Hi,

This v4 is the follow-up to v3 at:
https://lore.kernel.org/linux-fsdevel/20250527090836.1290532-1-namcao@linutronix.de/
which resolves a priority inversion problem.

The v3 patch was merged, but then got reverted due to regression.

The direction of v3 was wrong in the first place. It changed the
eventpoll's event list to be lockless, making the code harder to read. I
stared at the patch again, but still couldn't figure out what the bug is.

The performance numbers were indeed impressive with lockless, but the
numbers are from a benchmark, which is unclear whether it really reflects
real workload.

This v4 takes a completely different approach: it converts the rwlock to
spinlock. Unfortunately, unlike rwlock, spinlock does not allow concurrent
readers. This patch therefore reduces the performance numbers.

I have some optimization tricks to reduce spinlock contention and bring the
numbers back. But Linus appeared and declared that epoll's performance
shouldn't be the priority. So I decided not to post those optimization
patches.

* patches from https://lore.kernel.org/cover.1752581388.git.namcao@linutronix.de:
eventpoll: Replace rwlock with spinlock

Link: https://lore.kernel.org/cover.1752581388.git.namcao@linutronix.de
Signed-off-by: Christian Brauner <brauner@kernel.org>

+26 -113
+26 -113
fs/eventpoll.c
··· 46 46 * 47 47 * 1) epnested_mutex (mutex) 48 48 * 2) ep->mtx (mutex) 49 - * 3) ep->lock (rwlock) 49 + * 3) ep->lock (spinlock) 50 50 * 51 51 * The acquire order is the one listed above, from 1 to 3. 52 - * We need a rwlock (ep->lock) because we manipulate objects 52 + * We need a spinlock (ep->lock) because we manipulate objects 53 53 * from inside the poll callback, that might be triggered from 54 54 * a wake_up() that in turn might be called from IRQ context. 55 55 * So we can't sleep inside the poll callback and hence we need ··· 195 195 struct list_head rdllist; 196 196 197 197 /* Lock which protects rdllist and ovflist */ 198 - rwlock_t lock; 198 + spinlock_t lock; 199 199 200 200 /* RB tree root used to store monitored fd structs */ 201 201 struct rb_root_cached rbr; ··· 741 741 * in a lockless way. 742 742 */ 743 743 lockdep_assert_irqs_enabled(); 744 - write_lock_irq(&ep->lock); 744 + spin_lock_irq(&ep->lock); 745 745 list_splice_init(&ep->rdllist, txlist); 746 746 WRITE_ONCE(ep->ovflist, NULL); 747 - write_unlock_irq(&ep->lock); 747 + spin_unlock_irq(&ep->lock); 748 748 } 749 749 750 750 static void ep_done_scan(struct eventpoll *ep, ··· 752 752 { 753 753 struct epitem *epi, *nepi; 754 754 755 - write_lock_irq(&ep->lock); 755 + spin_lock_irq(&ep->lock); 756 756 /* 757 757 * During the time we spent inside the "sproc" callback, some 758 758 * other events might have been queued by the poll callback. ··· 793 793 wake_up(&ep->wq); 794 794 } 795 795 796 - write_unlock_irq(&ep->lock); 796 + spin_unlock_irq(&ep->lock); 797 797 } 798 798 799 799 static void ep_get(struct eventpoll *ep) ··· 868 868 869 869 rb_erase_cached(&epi->rbn, &ep->rbr); 870 870 871 - write_lock_irq(&ep->lock); 871 + spin_lock_irq(&ep->lock); 872 872 if (ep_is_linked(epi)) 873 873 list_del_init(&epi->rdllink); 874 - write_unlock_irq(&ep->lock); 874 + spin_unlock_irq(&ep->lock); 875 875 876 876 wakeup_source_unregister(ep_wakeup_source(epi)); 877 877 /* ··· 1152 1152 return -ENOMEM; 1153 1153 1154 1154 mutex_init(&ep->mtx); 1155 - rwlock_init(&ep->lock); 1155 + spin_lock_init(&ep->lock); 1156 1156 init_waitqueue_head(&ep->wq); 1157 1157 init_waitqueue_head(&ep->poll_wait); 1158 1158 INIT_LIST_HEAD(&ep->rdllist); ··· 1240 1240 #endif /* CONFIG_KCMP */ 1241 1241 1242 1242 /* 1243 - * Adds a new entry to the tail of the list in a lockless way, i.e. 1244 - * multiple CPUs are allowed to call this function concurrently. 1245 - * 1246 - * Beware: it is necessary to prevent any other modifications of the 1247 - * existing list until all changes are completed, in other words 1248 - * concurrent list_add_tail_lockless() calls should be protected 1249 - * with a read lock, where write lock acts as a barrier which 1250 - * makes sure all list_add_tail_lockless() calls are fully 1251 - * completed. 1252 - * 1253 - * Also an element can be locklessly added to the list only in one 1254 - * direction i.e. either to the tail or to the head, otherwise 1255 - * concurrent access will corrupt the list. 1256 - * 1257 - * Return: %false if element has been already added to the list, %true 1258 - * otherwise. 1259 - */ 1260 - static inline bool list_add_tail_lockless(struct list_head *new, 1261 - struct list_head *head) 1262 - { 1263 - struct list_head *prev; 1264 - 1265 - /* 1266 - * This is simple 'new->next = head' operation, but cmpxchg() 1267 - * is used in order to detect that same element has been just 1268 - * added to the list from another CPU: the winner observes 1269 - * new->next == new. 1270 - */ 1271 - if (!try_cmpxchg(&new->next, &new, head)) 1272 - return false; 1273 - 1274 - /* 1275 - * Initially ->next of a new element must be updated with the head 1276 - * (we are inserting to the tail) and only then pointers are atomically 1277 - * exchanged. XCHG guarantees memory ordering, thus ->next should be 1278 - * updated before pointers are actually swapped and pointers are 1279 - * swapped before prev->next is updated. 1280 - */ 1281 - 1282 - prev = xchg(&head->prev, new); 1283 - 1284 - /* 1285 - * It is safe to modify prev->next and new->prev, because a new element 1286 - * is added only to the tail and new->next is updated before XCHG. 1287 - */ 1288 - 1289 - prev->next = new; 1290 - new->prev = prev; 1291 - 1292 - return true; 1293 - } 1294 - 1295 - /* 1296 - * Chains a new epi entry to the tail of the ep->ovflist in a lockless way, 1297 - * i.e. multiple CPUs are allowed to call this function concurrently. 1298 - * 1299 - * Return: %false if epi element has been already chained, %true otherwise. 1300 - */ 1301 - static inline bool chain_epi_lockless(struct epitem *epi) 1302 - { 1303 - struct eventpoll *ep = epi->ep; 1304 - 1305 - /* Fast preliminary check */ 1306 - if (epi->next != EP_UNACTIVE_PTR) 1307 - return false; 1308 - 1309 - /* Check that the same epi has not been just chained from another CPU */ 1310 - if (cmpxchg(&epi->next, EP_UNACTIVE_PTR, NULL) != EP_UNACTIVE_PTR) 1311 - return false; 1312 - 1313 - /* Atomically exchange tail */ 1314 - epi->next = xchg(&ep->ovflist, epi); 1315 - 1316 - return true; 1317 - } 1318 - 1319 - /* 1320 1243 * This is the callback that is passed to the wait queue wakeup 1321 1244 * mechanism. It is called by the stored file descriptors when they 1322 1245 * have events to report. 1323 - * 1324 - * This callback takes a read lock in order not to contend with concurrent 1325 - * events from another file descriptor, thus all modifications to ->rdllist 1326 - * or ->ovflist are lockless. Read lock is paired with the write lock from 1327 - * ep_start/done_scan(), which stops all list modifications and guarantees 1328 - * that lists state is seen correctly. 1329 - * 1330 - * Another thing worth to mention is that ep_poll_callback() can be called 1331 - * concurrently for the same @epi from different CPUs if poll table was inited 1332 - * with several wait queues entries. Plural wakeup from different CPUs of a 1333 - * single wait queue is serialized by wq.lock, but the case when multiple wait 1334 - * queues are used should be detected accordingly. This is detected using 1335 - * cmpxchg() operation. 1336 1246 */ 1337 1247 static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, void *key) 1338 1248 { ··· 1253 1343 unsigned long flags; 1254 1344 int ewake = 0; 1255 1345 1256 - read_lock_irqsave(&ep->lock, flags); 1346 + spin_lock_irqsave(&ep->lock, flags); 1257 1347 1258 1348 ep_set_busy_poll_napi_id(epi); 1259 1349 ··· 1282 1372 * chained in ep->ovflist and requeued later on. 1283 1373 */ 1284 1374 if (READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR) { 1285 - if (chain_epi_lockless(epi)) 1375 + if (epi->next == EP_UNACTIVE_PTR) { 1376 + epi->next = READ_ONCE(ep->ovflist); 1377 + WRITE_ONCE(ep->ovflist, epi); 1286 1378 ep_pm_stay_awake_rcu(epi); 1379 + } 1287 1380 } else if (!ep_is_linked(epi)) { 1288 1381 /* In the usual case, add event to ready list. */ 1289 - if (list_add_tail_lockless(&epi->rdllink, &ep->rdllist)) 1290 - ep_pm_stay_awake_rcu(epi); 1382 + list_add_tail(&epi->rdllink, &ep->rdllist); 1383 + ep_pm_stay_awake_rcu(epi); 1291 1384 } 1292 1385 1293 1386 /* ··· 1323 1410 pwake++; 1324 1411 1325 1412 out_unlock: 1326 - read_unlock_irqrestore(&ep->lock, flags); 1413 + spin_unlock_irqrestore(&ep->lock, flags); 1327 1414 1328 1415 /* We have to call this outside the lock */ 1329 1416 if (pwake) ··· 1658 1745 } 1659 1746 1660 1747 /* We have to drop the new item inside our item list to keep track of it */ 1661 - write_lock_irq(&ep->lock); 1748 + spin_lock_irq(&ep->lock); 1662 1749 1663 1750 /* record NAPI ID of new item if present */ 1664 1751 ep_set_busy_poll_napi_id(epi); ··· 1675 1762 pwake++; 1676 1763 } 1677 1764 1678 - write_unlock_irq(&ep->lock); 1765 + spin_unlock_irq(&ep->lock); 1679 1766 1680 1767 /* We have to call this outside the lock */ 1681 1768 if (pwake) ··· 1739 1826 * list, push it inside. 1740 1827 */ 1741 1828 if (ep_item_poll(epi, &pt, 1)) { 1742 - write_lock_irq(&ep->lock); 1829 + spin_lock_irq(&ep->lock); 1743 1830 if (!ep_is_linked(epi)) { 1744 1831 list_add_tail(&epi->rdllink, &ep->rdllist); 1745 1832 ep_pm_stay_awake(epi); ··· 1750 1837 if (waitqueue_active(&ep->poll_wait)) 1751 1838 pwake++; 1752 1839 } 1753 - write_unlock_irq(&ep->lock); 1840 + spin_unlock_irq(&ep->lock); 1754 1841 } 1755 1842 1756 1843 /* We have to call this outside the lock */ ··· 2002 2089 init_wait(&wait); 2003 2090 wait.func = ep_autoremove_wake_function; 2004 2091 2005 - write_lock_irq(&ep->lock); 2092 + spin_lock_irq(&ep->lock); 2006 2093 /* 2007 2094 * Barrierless variant, waitqueue_active() is called under 2008 2095 * the same lock on wakeup ep_poll_callback() side, so it ··· 2021 2108 if (!eavail) 2022 2109 __add_wait_queue_exclusive(&ep->wq, &wait); 2023 2110 2024 - write_unlock_irq(&ep->lock); 2111 + spin_unlock_irq(&ep->lock); 2025 2112 2026 2113 if (!eavail) 2027 2114 timed_out = !ep_schedule_timeout(to) || ··· 2037 2124 eavail = 1; 2038 2125 2039 2126 if (!list_empty_careful(&wait.entry)) { 2040 - write_lock_irq(&ep->lock); 2127 + spin_lock_irq(&ep->lock); 2041 2128 /* 2042 2129 * If the thread timed out and is not on the wait queue, 2043 2130 * it means that the thread was woken up after its ··· 2048 2135 if (timed_out) 2049 2136 eavail = list_empty(&wait.entry); 2050 2137 __remove_wait_queue(&ep->wq, &wait); 2051 - write_unlock_irq(&ep->lock); 2138 + spin_unlock_irq(&ep->lock); 2052 2139 } 2053 2140 } 2054 2141 }