Hello,

I encountered a bug in the shmem subsystem. Details below.

[Summary]

shmem_get_folio_gfp() can livelock when multiple threads fault on the

same shmem page concurrently. The -EEXIST retry loop (goto repeat) has

no cond_resched(), causing busy-looping threads to starve the thread

that holds the swapcache slot, resulting in an indefinite RCU stall and

system hang.

[Environment]

1.Kernel: 6.18.21 (ARM64, PREEMPT, CONFIG_LRU_GEN=y)

2.Triggered by: multi-threaded app with threads constrained to 2 CPUs

via cpuset

[Root Cause]

When multiple threads in the same process fault on the same shmem

swap entry:

1.Thread A enters shmem_swap_alloc_folio(), succeeds at

swapcache_prepare() (sets SWAP_HAS_CACHE), then enters

workingset_refault() lru_gen_refault() rcu_read_lock().

While inside the RCU read-side critical section, it is

preempted via preempt_schedule_irq (IRQ exit path detects

TIF_NEED_RESCHED).

2.Threads B & C enter shmem_swap_alloc_folio(), fail at

swapcache_prepare() (slot already taken by A), return -EEXIST.

3.In shmem_get_folio_gfp():

    error = shmem_swapin_folio(...);

    if (error == -EEXIST)

        goto repeat;// no cond_resched(), tight loop

4.Threads B & C spin at 100% CPU on the retry loop. All three

threads share the same cpuset (CPU0-1,cpus_allowed=0x3).Thread A

is perpetually preempted and starved it cannot complete the

few instructions needed to call rcu_read_unlock().

5.The held RCU read lock blocks the grace period indefinitely,

causing all synchronize_rcu() callers (cgroup operations,

fd allocation, etc.) to hang, eventually blocking init.

[Scheduling Details]

Key observations:

1.Thread A was RCU-boosted to prio 98 but accumulated only

99ms of execution over the entire stall period (~1200s).

It was effectively starved despite the priority boost.

2.Threads B & C have vruntime=0 and prio 91, indicating

they run in an RT-equivalent scheduling class (SCHED_FIFO/RT

policy). Each accumulated ~1134 seconds of execution with

only ~1600 context switches, meaning they ran uninterrupted

for ~700ms per scheduling quantum on average.

3.Thread A cannot preempt Threads B & C: Although RCU boost

raised Thread A to prio 98, Threads B & C at prio 91 (lower

numeric value = higher priority in RT class) have equal or

higher effective priority. The busy-looping threads never

voluntarily yield (no cond_resched(), no blocking calls in

the loop), so Thread A never gets scheduled.

4.CPU contention: CPU0 had nr_running=28 and CPU1 had

nr_running=24, with 3-4 RT tasks per CPU. Thread A competed

with Thread B on CPU0 but could not win scheduling.

[Observed Impact]

1.RCU stall lasting 910+ seconds (19 consecutive stall

warnings, grace period g=4398761 never advanced)

2.synchronize_rcu_expedited() callers blocked 742+ seconds

3.init process hung > 720 seconds system unresponsive

[Call Traces]

Thread A (RCU stall source, sampled 19 times identically):

__switch_to+0x1a4/0x360 (T)

__schedule+0x96c/0xf3c

preempt_schedule_irq+0xec/0x198

raw_irqentry_exit_cond_resched+0x2c/0x44

irqentry_exit+0x38/0x64

exit_to_kernel_mode+0x28/0x38

el1_interrupt+0x5c/0xa8

el1h_64_irq_handler+0x18/0x24

el1h_64_irq+0x84/0x88

workingset_refault+0x16c/0x79c (P)

  shmem_swapin_folio+0x8e4/0xd44

    shmem_get_folio_gfp+0xb8/0x710

      shmem_fault+0xa0/0x174

        __do_fault

do_pte_missing

handle_mm_fault

do_page_fault

el0_ia

 

Thread B (busy-loop on CPU0, sum_exec_runtime=1134s):

xas_load+0x78/0xe4 (P)

  shmem_swapin_folio+0x950/0xd44

    shmem_get_folio_gfp+0xb8/0x710

      shmem_fault ... el0_ia

 

Thread C (busy-loop on CPU1, sum_exec_runtime=1134s):

xas_load+0x50/0xe4 (P)

  shmem_swapin_folio+0xd8/0xd44

    shmem_get_folio_gfp+0xb8/0x710

      shmem_fault ... el0_ia

[Question]

What is the recommended approach to fix this livelock?

We are considering adding a cond_resched() before the

goto repeat in shmem_get_folio_gfp() to break the tight

loop and allow the swapcache-holding thread to make

progress. Would this be an acceptable fix, or is there

a better strategy (e.g., bounded retry with fallback,

or yielding to the specific waiter)?

 

Thanks,

Chao Ma

#/******本邮件及其附件含有小米公司的保密信息,仅限于发送给上面地址中列出的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本邮件! This e-mail and its attachments contain confidential information from XIAOMI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it!******/#