Re: [BUG] shmem: shmem_get_folio_gfp livelock

Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: Baolin Wang <baolin.wang@linux.alibaba.com>
To: 马超 <machao26@xiaomi.com>, "Andrew Morton" <akpm@linux-foundation.org>
Cc: "linux-mm@kvack.org" <linux-mm@kvack.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	田孝斌 <tianxiaobin@xiaomi.com>, 俞东斌 <yudongbin@xiaomi.com>,
	李鹏程 <xiaoyaoli@xiaomi.com>, "hughd@google.com" <hughd@google.com>,
	"Kairui Song" <kasong@tencent.com>
Subject: Re: [BUG] shmem: shmem_get_folio_gfp livelock
Date: Wed, 1 Jul 2026 18:03:33 +0800	[thread overview]
Message-ID: <623db71c-daab-451c-909f-a8efa56b998b@linux.alibaba.com> (raw)
In-Reply-To: <700a2cbf90a2484f979aac858f08f5d4@xiaomi.com>

CC Hugh and Kairui.

On 6/30/26 9:15 PM, 马超 wrote:
> Hello,
> I encountered a bug in the shmem subsystem. Details below.
> 
> [Summary]
> shmem_get_folio_gfp() can livelock when multiple threads fault on the
> same shmem page concurrently. The -EEXIST retry loop (goto repeat) has
> no cond_resched(), causing busy-looping threads to starve the thread
> that holds the swapcache slot, resulting in an indefinite RCU stall and
> system hang.
> [Environment]
> 1.Kernel: 6.18.21 (ARM64, PREEMPT, CONFIG_LRU_GEN=y)
> 2.Triggered by: multi-threaded app with threads constrained to 2 CPUs
> via cpuset
> [Root Cause]
> When multiple threads in the same process fault on the same shmem
> swap entry:
> 1.Thread A enters shmem_swap_alloc_folio(), succeeds at
> swapcache_prepare() (sets SWAP_HAS_CACHE), then enters
> workingset_refault() → lru_gen_refault() → rcu_read_lock().
> While inside the RCU read-side critical section, it is
> preempted via preempt_schedule_irq (IRQ exit path detects
> TIF_NEED_RESCHED).
> 2.Threads B & C enter shmem_swap_alloc_folio(), fail at
> swapcache_prepare() (slot already taken by A), return -EEXIST.
> 3.In shmem_get_folio_gfp():
>      error = shmem_swapin_folio(...);
>      if (error == -EEXIST)
>          goto repeat;// no cond_resched(), tight loop
> 4.Threads B & C spin at 100% CPU on the retry loop. All three
> threads share the same cpuset (CPU0-1,cpus_allowed=0x3).Thread A
> is perpetually preempted and starved — it cannot complete the
> few instructions needed to call rcu_read_unlock().
> 5.The held RCU read lock blocks the grace period indefinitely,
> causing all synchronize_rcu() callers (cgroup operations,
> fd allocation, etc.) to hang, eventually blocking init.
> [Scheduling Details]
> Key observations:
> 1.Thread A was RCU-boosted to prio 98 but accumulated only
> 99ms of execution over the entire stall period (~1200s).
> It was effectively starved despite the priority boost.
> 2.Threads B & C have vruntime=0 and prio 91, indicating
> they run in an RT-equivalent scheduling class (SCHED_FIFO/RT
> policy). Each accumulated ~1134 seconds of execution with
> only ~1600 context switches, meaning they ran uninterrupted
> for ~700ms per scheduling quantum on average.
> 3.Thread A cannot preempt Threads B & C: Although RCU boost
> raised Thread A to prio 98, Threads B & C at prio 91 (lower
> numeric value = higher priority in RT class) have equal or
> higher effective priority. The busy-looping threads never
> voluntarily yield (no cond_resched(), no blocking calls in
> the loop), so Thread A never gets scheduled.
> 4.CPU contention: CPU0 had nr_running=28 and CPU1 had
> nr_running=24, with 3-4 RT tasks per CPU. Thread A competed
> with Thread B on CPU0 but could not win scheduling.
> [Observed Impact]
> 1.RCU stall lasting 910+ seconds (19 consecutive stall
> warnings, grace period g=4398761 never advanced)
> 2.synchronize_rcu_expedited() callers blocked 742+ seconds
> 3.init process hung > 720 seconds → system unresponsive
> [Call Traces]
> Thread A (RCU stall source, sampled 19 times identically):
> __switch_to+0x1a4/0x360 (T)
> __schedule+0x96c/0xf3c
> preempt_schedule_irq+0xec/0x198
> raw_irqentry_exit_cond_resched+0x2c/0x44
> irqentry_exit+0x38/0x64
> exit_to_kernel_mode+0x28/0x38
> el1_interrupt+0x5c/0xa8
> el1h_64_irq_handler+0x18/0x24
> el1h_64_irq+0x84/0x88
> workingset_refault+0x16c/0x79c (P)
>    shmem_swapin_folio+0x8e4/0xd44
>      shmem_get_folio_gfp+0xb8/0x710
>        shmem_fault+0xa0/0x174
>          __do_fault
> do_pte_missing
> handle_mm_fault
> do_page_fault
> el0_ia
> 
> Thread B (busy-loop on CPU0, sum_exec_runtime=1134s):
> xas_load+0x78/0xe4 (P)
>    shmem_swapin_folio+0x950/0xd44
>      shmem_get_folio_gfp+0xb8/0x710
>        shmem_fault → ... → el0_ia
> 
> Thread C (busy-loop on CPU1, sum_exec_runtime=1134s):
> xas_load+0x50/0xe4 (P)
>    shmem_swapin_folio+0xd8/0xd44
>      shmem_get_folio_gfp+0xb8/0x710
>        shmem_fault → ... → el0_ia
> [Question]
> What is the recommended approach to fix this livelock?
> We are considering adding a cond_resched() before the
> goto repeat in shmem_get_folio_gfp() to break the tight
> loop and allow the swapcache-holding thread to make
> progress. Would this be an acceptable fix, or is there
> a better strategy (e.g., bounded retry with fallback,
> or yielding to the specific waiter)?

IIRC, the scheduler maintainers are not a fan of continuing to sprinkle 
random cond_resched() calls throughout the kernel. The scheduling 
decisions should be left to the scheduler itself.

Regarding your issue, could you try the latest kernel? IIUC, this 
problem has already been fixed there (likely from Kairui's swap 
refactoring work [1]).

Now the shmem swapin call trace should be:

shmem_swapin_folio()
   -> shmem_swap_alloc_folio() (I think you use the SYNC swap device)
     -> swapin_sync()

In swapin_sync(), it first checks whether a folio is already present in 
the swapcache. If so, it returns immediately. In your case, threads B/C 
would get the folio that has already been added to the swapcache and 
continue onward, instead of retrying in a loop.

struct folio *swapin_sync(swp_entry_t entry, gfp_t gfp, unsigned long 
orders,
                            struct vm_fault *vmf, struct mempolicy 
*mpol, pgoff_t ilx)
{
         struct folio *folio;

         do {
                 folio = swap_cache_get_folio(entry);
                 if (folio)
                         return folio;
                 folio = swap_cache_alloc_folio(entry, gfp, orders, vmf, 
mpol, ilx);
         } while (PTR_ERR(folio) == -EEXIST);

         if (IS_ERR(folio))
                 return folio;

         swap_read_folio(folio, NULL);
         return folio;
}


[1] 
https://lore.kernel.org/all/20260517-swap-table-p4-v5-0-88ae43e064c7@tencent.c

next prev parent reply	other threads:[~2026-07-01 10:03 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <e1c290782eeb419e8c3e18ac6b1f49eb@xiaomi.com>
     [not found] ` <126cb4ced14f4a3fa40c3189bf8a5920@xiaomi.com>
2026-06-30 12:55   ` [BUG] shmem: shmem_get_folio_gfp livelock 马超
2026-06-30 13:15     ` 马超
2026-07-01 10:03       ` Baolin Wang [this message]
2026-07-01 17:25         ` Kairui Song

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=623db71c-daab-451c-909f-a8efa56b998b@linux.alibaba.com \
    --to=baolin.wang@linux.alibaba.com \
    --cc=akpm@linux-foundation.org \
    --cc=hughd@google.com \
    --cc=kasong@tencent.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=machao26@xiaomi.com \
    --cc=tianxiaobin@xiaomi.com \
    --cc=xiaoyaoli@xiaomi.com \
    --cc=yudongbin@xiaomi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox