Hello,
I encountered a bug in the shmem subsystem. Details below.
[Summary]
shmem_get_folio_gfp() can livelock when multiple threads fault on the
same shmem page concurrently. The -EEXIST retry loop (goto repeat) has
no cond_resched(), causing busy-looping threads to starve the thread
that holds the swapcache slot, resulting in an indefinite RCU stall and
system hang.
[Environment]
1.Kernel: 6.18.21 (ARM64, PREEMPT, CONFIG_LRU_GEN=y)
2.Triggered by: multi-threaded app with threads constrained to 2 CPUs
via cpuset
[Root Cause]
When multiple threads in the same process fault on the same shmem
swap entry:
1.Thread A enters shmem_swap_alloc_folio(), succeeds at
swapcache_prepare() (sets SWAP_HAS_CACHE), then enters
workingset_refault()
→ lru_gen_refault()
→ rcu_read_lock().
While inside the RCU read-side critical section, it is
preempted via preempt_schedule_irq (IRQ exit path detects
TIF_NEED_RESCHED).
2.Threads B & C enter shmem_swap_alloc_folio(), fail at
swapcache_prepare() (slot already taken by A), return -EEXIST.
3.In shmem_get_folio_gfp():
error = shmem_swapin_folio(...);
if (error == -EEXIST)
goto repeat;// no cond_resched(), tight loop
4.Threads B & C spin at 100% CPU on the retry loop. All three
threads share the same cpuset (CPU0-1,cpus_allowed=0x3).Thread A
is perpetually preempted and starved
— it cannot complete the
few instructions needed to call rcu_read_unlock().
5.The held RCU read lock blocks the grace period indefinitely,
causing all synchronize_rcu() callers (cgroup operations,
fd allocation, etc.) to hang, eventually blocking init.
[Scheduling Details]
Key observations:
1.Thread A was RCU-boosted to prio 98 but accumulated only
99ms of execution over the entire stall period (~1200s).
It was effectively starved despite the priority boost.
2.Threads B & C have vruntime=0 and prio 91, indicating
they run in an RT-equivalent scheduling class (SCHED_FIFO/RT
policy). Each accumulated ~1134 seconds of execution with
only ~1600 context switches, meaning they ran uninterrupted
for ~700ms per scheduling quantum on average.
3.Thread A cannot preempt Threads B & C: Although RCU boost
raised Thread A to prio 98, Threads B & C at prio 91 (lower
numeric value = higher priority in RT class) have equal or
higher effective priority. The busy-looping threads never
voluntarily yield (no cond_resched(), no blocking calls in
the loop), so Thread A never gets scheduled.
4.CPU contention: CPU0 had nr_running=28 and CPU1 had
nr_running=24, with 3-4 RT tasks per CPU. Thread A competed
with Thread B on CPU0 but could not win scheduling.
[Observed Impact]
1.RCU stall lasting 910+ seconds (19 consecutive stall
warnings, grace period g=4398761 never advanced)
2.synchronize_rcu_expedited() callers blocked 742+ seconds
3.init process hung > 720 seconds
→ system unresponsive
[Call Traces]
Thread A (RCU stall source, sampled 19 times identically):
__switch_to+0x1a4/0x360 (T)
__schedule+0x96c/0xf3c
preempt_schedule_irq+0xec/0x198
raw_irqentry_exit_cond_resched+0x2c/0x44
irqentry_exit+0x38/0x64
exit_to_kernel_mode+0x28/0x38
el1_interrupt+0x5c/0xa8
el1h_64_irq_handler+0x18/0x24
el1h_64_irq+0x84/0x88
workingset_refault+0x16c/0x79c (P)
shmem_swapin_folio+0x8e4/0xd44
shmem_get_folio_gfp+0xb8/0x710
shmem_fault+0xa0/0x174
__do_fault
do_pte_missing
handle_mm_fault
do_page_fault
el0_ia
Thread B (busy-loop on CPU0, sum_exec_runtime=1134s):
xas_load+0x78/0xe4 (P)
shmem_swapin_folio+0x950/0xd44
shmem_get_folio_gfp+0xb8/0x710
shmem_fault
→ ...
→ el0_ia
Thread C (busy-loop on CPU1, sum_exec_runtime=1134s):
xas_load+0x50/0xe4 (P)
shmem_swapin_folio+0xd8/0xd44
shmem_get_folio_gfp+0xb8/0x710
shmem_fault
→ ...
→ el0_ia
[Question]
What is the recommended approach to fix this livelock?
We are considering adding a cond_resched() before the
goto repeat in shmem_get_folio_gfp() to break the tight
loop and allow the swapcache-holding thread to make
progress. Would this be an acceptable fix, or is there
a better strategy (e.g., bounded retry with fallback,
or yielding to the specific waiter)?
Thanks,
Chao Ma