Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

Minor page waitqueue cleanups

Tim Chen and Kan Liang have been battling a customer load that shows
extremely long page wakeup lists. The cause seems to be constant NUMA
migration of a hot page that is shared across a lot of threads, but the
actual root cause for the exact behavior has not been found.

Tim has a patch that batches the wait list traversal at wakeup time, so
that we at least don't get long uninterruptible cases where we traverse
and wake up thousands of processes and get nasty latency spikes. That
is likely 4.14 material, but we're still discussing the page waitqueue
specific parts of it.

In the meantime, I've tried to look at making the page wait queues less
expensive, and failing miserably. If you have thousands of threads
waiting for the same page, it will be painful. We'll need to try to
figure out the NUMA balancing issue some day, in addition to avoiding
the excessive spinlock hold times.

That said, having tried to rewrite the page wait queues, I can at least
fix up some of the braindamage in the current situation. In particular:

(a) we don't want to continue walking the page wait list if the bit
we're waiting for already got set again (which seems to be one of
the patterns of the bad load). That makes no progress and just
causes pointless cache pollution chasing the pointers.

(b) we don't want to put the non-locking waiters always on the front of
the queue, and the locking waiters always on the back. Not only is
that unfair, it means that we wake up thousands of reading threads
that will just end up being blocked by the writer later anyway.

Also add a comment about the layout of 'struct wait_page_key' - there is
an external user of it in the cachefiles code that means that it has to
match the layout of 'struct wait_bit_key' in the two first members. It
so happens to match, because 'struct page *' and 'unsigned long *' end
up having the same values simply because the page flags are the first
member in struct page.

Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Christopher Lameter <cl@linux.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

+10 -8
+4 -3
kernel/sched/wait.c
··· 70 70 71 71 list_for_each_entry_safe(curr, next, &wq_head->head, entry) { 72 72 unsigned flags = curr->flags; 73 - 74 - if (curr->func(curr, mode, wake_flags, key) && 75 - (flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive) 73 + int ret = curr->func(curr, mode, wake_flags, key); 74 + if (ret < 0) 75 + break; 76 + if (ret && (flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive) 76 77 break; 77 78 } 78 79 }
+6 -5
mm/filemap.c
··· 885 885 page_writeback_init(); 886 886 } 887 887 888 + /* This has the same layout as wait_bit_key - see fs/cachefiles/rdwr.c */ 888 889 struct wait_page_key { 889 890 struct page *page; 890 891 int bit_nr; ··· 910 909 911 910 if (wait_page->bit_nr != key->bit_nr) 912 911 return 0; 912 + 913 + /* Stop walking if it's locked */ 913 914 if (test_bit(key->bit_nr, &key->page->flags)) 914 - return 0; 915 + return -1; 915 916 916 917 return autoremove_wake_function(wait, mode, sync, key); 917 918 } ··· 967 964 int ret = 0; 968 965 969 966 init_wait(wait); 967 + wait->flags = lock ? WQ_FLAG_EXCLUSIVE : 0; 970 968 wait->func = wake_page_function; 971 969 wait_page.page = page; 972 970 wait_page.bit_nr = bit_nr; ··· 976 972 spin_lock_irq(&q->lock); 977 973 978 974 if (likely(list_empty(&wait->entry))) { 979 - if (lock) 980 - __add_wait_queue_entry_tail_exclusive(q, wait); 981 - else 982 - __add_wait_queue(q, wait); 975 + __add_wait_queue_entry_tail(q, wait); 983 976 SetPageWaiters(page); 984 977 } 985 978