All of lore.kernel.org
 help / color / mirror / Atom feed
From: Shakeel Butt <shakeel.butt@linux.dev>
To: Muchun Song <muchun.song@linux.dev>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	 Johannes Weiner <hannes@cmpxchg.org>,
	Dave Chinner <david@fromorbit.com>,
	 Roman Gushchin <roman.gushchin@linux.dev>,
	Qi Zheng <qi.zheng@linux.dev>, Kairui Song <kasong@tencent.com>,
	 Meta kernel team <kernel-team@meta.com>,
	linux-mm@kvack.org, cgroups@vger.kernel.org,
	 linux-kernel@vger.kernel.org, Chris Mason <clm@fb.com>
Subject: Re: [PATCH] mm/list_lru: drain before clearing xarray entry on reparent
Date: Mon, 1 Jun 2026 08:38:28 -0700	[thread overview]
Message-ID: <ah2VXGfGZOOdjhs3@linux.dev> (raw)
In-Reply-To: <79CD986A-2130-4FB8-804F-A543AF22342B@linux.dev>

Hi Muchun, thanks for taking a look.

On Mon, Jun 01, 2026 at 05:54:01PM +0800, Muchun Song wrote:
> 
> 
> > On Jun 1, 2026, at 14:34, Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > 
> > memcg_reparent_list_lrus() clears the dying memcg's xarray entry with
> > xas_store(&xas, NULL) before reparenting its per-node lists into the
> > parent. This opens a window where a concurrent list_lru_del() arriving
> > for the dying memcg sees xa_load() == NULL, walks to the parent in
> > lock_list_lru_of_memcg(), takes the parent's per-node lock, and calls
> > list_del_init() on an item still physically linked on the dying
> > memcg's list.
> > 
> > If another in-flight thread holds the dying memcg's per-node lock at
> > the same moment (another list_lru_del, or a list_lru_walk_one running
> > an isolate callback), both threads modify ->next/->prev pointers on the
> > same physical list under different locks. Adjacent items can corrupt
> > each other's links.
> > 
> > Fix it by reversing the order: reparent each per-node list and mark the
> > child's list lru dead and then clear the xarray entry. Any concurrent
> > list_lru op that finds the still-set xarray entry either takes the dying
> > memcg's per-node lock (synchronizing with the drain) or sees LONG_MIN
> > and walks to the parent, where the items now live.
> > 
> > Fixes: fb56fdf8b9a2 ("mm/list_lru: split the lock to per-cgroup scope")
> > Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> > Reported-by: Chris Mason <clm@fb.com>
> > ---
> > mm/list_lru.c | 20 +++++++++-----------
> > 1 file changed, 9 insertions(+), 11 deletions(-)
> > 
> > diff --git a/mm/list_lru.c b/mm/list_lru.c
> > index dd29bcf8eb5f..ae55a52307db 100644
> > --- a/mm/list_lru.c
> > +++ b/mm/list_lru.c
> > @@ -473,26 +473,24 @@ void memcg_reparent_list_lrus(struct mem_cgroup *memcg, struct mem_cgroup *paren
> > 	mutex_lock(&list_lrus_mutex);
> > 	list_for_each_entry(lru, &memcg_list_lrus, list) {
> > 		struct list_lru_memcg *mlru;
> > - 		XA_STATE(xas, &lru->xa, memcg->kmemcg_id);
> > 
> > - 		/*
> > -		 * Lock the Xarray to ensure no on going list_lru_memcg
> > -		 * allocation and further allocation will see css_is_dying().
> > -		 */
> > - 		xas_lock_irq(&xas);
> > - 		mlru = xas_store(&xas, NULL);
> > - 		xas_unlock_irq(&xas);
> > + 		mlru = xa_load(&lru->xa, memcg->kmemcg_id);
> > 		if (!mlru)
> > 			continue;
> 
> Is it possible that concurrent threads running memcg_list_lru_alloc() could
> allocate a new mlru after this check passes? This could happen because the
> threads haven't noticed css_is_dying() yet. We would consequently miss the
> reparent operation for this list. So xas_lock_irq is necessary to serialize
> CSS_DYING setting here. Right?

Good question and it seems like Sashiko [1] raised a similar concern. However
please note that memcg_list_lru_alloc() uses CSS_DYING when it allocate a new
mlru but memcg_reparent_list_lrus() is called from offlice_css() callback and
the given css should already have CSS_DYING before calling offline_css(). There
is a rcu grace period between setting CSS_DYING and calling offline_css().

[1] https://sashiko.dev/#/patchset/20260601063408.2879011-1-shakeel.butt%40linux.dev

> 
> Thanks.
> Muchun
> 
> > 
> > 		/*
> > -		 * With Xarray value set to NULL, holding the lru lock below
> > -		 * prevents list_lru_{add,del,isolate} from touching the lru,
> > -		 * safe to reparent.
> > +		 * Reparent each per-node list and mark the child dead
> > +		 * (LONG_MIN) before clearing xarray entry otherwisw a
> > +		 * concurrent list_lru_del() may corrupt the list if it arrives
> > +		 * after xarray clear but before reparenting as
> > +		 * lock_list_lru_of_memcg will acquire parent's lock while the
> > +		 * item is still on child's list.
> > 		 */
> > 		for_each_node(i)
> > 			memcg_reparent_list_lru_one(lru, i, &mlru->node[i], parent);
> > 
> > + 		xa_erase(&lru->xa, memcg->kmemcg_id);

This one is more tricky. Sashiko said:

" Is it safe to use xa_erase() here instead of xa_erase_irq()?

The list_lru xarray is initialized with XA_FLAGS_LOCK_IRQ, and elements are
added holding the lock via xas_lock_irqsave(), which establishes an IRQ-safe
lock class.

Since xa_erase() internally calls spin_lock() without disabling local
interrupts, an interrupt firing while the lock is held could attempt to
re-acquire the same lock in __memcg_list_lru_alloc(), leading to a deadlock.

This could also trigger a lockdep warning for an inconsistent lock state. "

Initially I though this is a false positive as I couldn't find irq callers for
kmem_cache_alloc_lru() but then claude came up with more concrete scenario which
is below:

"""
For the shadow_nodes lru this lock is also acquired nested under the page
cache i_pages lock, which is irq-safe.  Adding a folio holds i_pages and
then allocates an xarray node through the shadow_nodes lru:

__filemap_add_folio()
  mapping_set_update(&xas, mapping)     // xas->xa_lru = &shadow_nodes
  xas_lock_irq(&xas)                    // holds mapping->i_pages
  xas_store() -> xas_alloc()
    kmem_cache_alloc_lru(radix_tree_node_cachep, xas->xa_lru, gfp)
      memcg_list_lru_alloc(memcg, &shadow_nodes, gfp)
        xas_lock_irqsave(&shadow_nodes->xa)   // shadow_nodes->xa under i_pages

and i_pages is taken from writeback completion in irq context:

__folio_end_writeback()
        xa_lock_irqsave(&mapping->i_pages, flags);

So with xa_erase() taking shadow_nodes->xa with irqs enabled:

CPU0 memcg_reparent_list_lrus()    CPU1 __filemap_add_folio()
  xa_erase(&shadow_nodes->xa)
    xa_lock(&shadow_nodes->xa)
                                   xas_lock_irq(&i_pages)  // holds i_pages
                                   ... memcg_list_lru_alloc()
                                     xas_lock_irqsave(&shadow_nodes->xa) // waits
  <io completion irq on CPU0>
  __folio_end_writeback()
    xa_lock_irqsave(&i_pages)  // waits

Can this deadlock, and should this be xa_erase_irq() to keep the irq-safe
acquisition that the removed xas_lock_irq() had?
"""

This seems more plausible and I think simply using xa_erase_irq() is more safe.

I will send a v2 with this change.

      reply	other threads:[~2026-06-01 15:38 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-01  6:34 [PATCH] mm/list_lru: drain before clearing xarray entry on reparent Shakeel Butt
2026-06-01  9:54 ` Muchun Song
2026-06-01 15:38   ` Shakeel Butt [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ah2VXGfGZOOdjhs3@linux.dev \
    --to=shakeel.butt@linux.dev \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=clm@fb.com \
    --cc=david@fromorbit.com \
    --cc=hannes@cmpxchg.org \
    --cc=kasong@tencent.com \
    --cc=kernel-team@meta.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=muchun.song@linux.dev \
    --cc=qi.zheng@linux.dev \
    --cc=roman.gushchin@linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.