Linux cgroups development
 help / color / mirror / Atom feed
* [PATCH v2] mm/list_lru: drain before clearing xarray entry on reparent
@ 2026-06-01 16:15 Shakeel Butt
  2026-06-01 17:50 ` Kairui Song
  2026-06-02  2:37 ` Muchun Song
  0 siblings, 2 replies; 3+ messages in thread
From: Shakeel Butt @ 2026-06-01 16:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Dave Chinner, Roman Gushchin, Muchun Song,
	Qi Zheng, Kairui Song, Meta kernel team, linux-mm, cgroups,
	linux-kernel, Chris Mason, stable

memcg_reparent_list_lrus() clears the dying memcg's xarray entry with
xas_store(&xas, NULL) before reparenting its per-node lists into the
parent. This opens a window where a concurrent list_lru_del() arriving
for the dying memcg sees xa_load() == NULL, walks to the parent in
lock_list_lru_of_memcg(), takes the parent's per-node lock, and calls
list_del_init() on an item still physically linked on the dying
memcg's list.

If another in-flight thread holds the dying memcg's per-node lock at
the same moment (another list_lru_del, or a list_lru_walk_one running
an isolate callback), both threads modify ->next/->prev pointers on the
same physical list under different locks. Adjacent items can corrupt
each other's links.

Fix it by reversing the order: reparent each per-node list and mark the
child's list lru dead and then clear the xarray entry. Any concurrent
list_lru op that finds the still-set xarray entry either takes the dying
memcg's per-node lock (synchronizing with the drain) or sees LONG_MIN
and walks to the parent, where the items now live.

Fixes: fb56fdf8b9a2 ("mm/list_lru: split the lock to per-cgroup scope")
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Reported-by: Chris Mason <clm@fb.com>
Cc: stable@vger.kernel.org
---
Changes since v1:
- Use xa_erase_irq() instead of xa_erase() (Sashiko & Claude).
- Added comment on CSS_DYING check in memcg_list_lru_alloc avoiding a new mlru
  allocation.

 mm/list_lru.c | 21 ++++++++++++---------
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/mm/list_lru.c b/mm/list_lru.c
index dd29bcf8eb5f..d454bce9a78e 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -473,26 +473,29 @@ void memcg_reparent_list_lrus(struct mem_cgroup *memcg, struct mem_cgroup *paren
 	mutex_lock(&list_lrus_mutex);
 	list_for_each_entry(lru, &memcg_list_lrus, list) {
 		struct list_lru_memcg *mlru;
-		XA_STATE(xas, &lru->xa, memcg->kmemcg_id);
 
 		/*
-		 * Lock the Xarray to ensure no on going list_lru_memcg
-		 * allocation and further allocation will see css_is_dying().
+		 * css_is_dying() check in memcg_list_lru_alloc() avoids
+		 * allocating a new mlru since CSS_DYING is already set for this
+		 * memcg a rcu grace period ago.
 		 */
-		xas_lock_irq(&xas);
-		mlru = xas_store(&xas, NULL);
-		xas_unlock_irq(&xas);
+		mlru = xa_load(&lru->xa, memcg->kmemcg_id);
 		if (!mlru)
 			continue;
 
 		/*
-		 * With Xarray value set to NULL, holding the lru lock below
-		 * prevents list_lru_{add,del,isolate} from touching the lru,
-		 * safe to reparent.
+		 * Reparent each per-node list and mark the child dead
+		 * (LONG_MIN) before clearing xarray entry otherwise a
+		 * concurrent list_lru_del() may corrupt the list if it arrives
+		 * after xarray clear but before reparenting as
+		 * lock_list_lru_of_memcg will acquire parent's lock while the
+		 * item is still on child's list.
 		 */
 		for_each_node(i)
 			memcg_reparent_list_lru_one(lru, i, &mlru->node[i], parent);
 
+		xa_erase_irq(&lru->xa, memcg->kmemcg_id);
+
 		/*
 		 * Here all list_lrus corresponding to the cgroup are guaranteed
 		 * to remain empty, we can safely free this lru, any further
-- 
2.53.0-Meta


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH v2] mm/list_lru: drain before clearing xarray entry on reparent
  2026-06-01 16:15 [PATCH v2] mm/list_lru: drain before clearing xarray entry on reparent Shakeel Butt
@ 2026-06-01 17:50 ` Kairui Song
  2026-06-02  2:37 ` Muchun Song
  1 sibling, 0 replies; 3+ messages in thread
From: Kairui Song @ 2026-06-01 17:50 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Johannes Weiner, Dave Chinner, Roman Gushchin,
	Muchun Song, Qi Zheng, Meta kernel team, linux-mm, cgroups,
	linux-kernel, Chris Mason, stable

On Tue, Jun 2, 2026 at 12:28 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> memcg_reparent_list_lrus() clears the dying memcg's xarray entry with
> xas_store(&xas, NULL) before reparenting its per-node lists into the
> parent. This opens a window where a concurrent list_lru_del() arriving
> for the dying memcg sees xa_load() == NULL, walks to the parent in
> lock_list_lru_of_memcg(), takes the parent's per-node lock, and calls
> list_del_init() on an item still physically linked on the dying
> memcg's list.
>
> If another in-flight thread holds the dying memcg's per-node lock at
> the same moment (another list_lru_del, or a list_lru_walk_one running
> an isolate callback), both threads modify ->next/->prev pointers on the
> same physical list under different locks. Adjacent items can corrupt
> each other's links.
>
> Fix it by reversing the order: reparent each per-node list and mark the
> child's list lru dead and then clear the xarray entry. Any concurrent
> list_lru op that finds the still-set xarray entry either takes the dying
> memcg's per-node lock (synchronizing with the drain) or sees LONG_MIN
> and walks to the parent, where the items now live.
>
> Fixes: fb56fdf8b9a2 ("mm/list_lru: split the lock to per-cgroup scope")
> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> Reported-by: Chris Mason <clm@fb.com>
> Cc: stable@vger.kernel.org
> ---
> Changes since v1:
> - Use xa_erase_irq() instead of xa_erase() (Sashiko & Claude).
> - Added comment on CSS_DYING check in memcg_list_lru_alloc avoiding a new mlru
>   allocation.
>
>  mm/list_lru.c | 21 ++++++++++++---------
>  1 file changed, 12 insertions(+), 9 deletions(-)
>
> diff --git a/mm/list_lru.c b/mm/list_lru.c
> index dd29bcf8eb5f..d454bce9a78e 100644
> --- a/mm/list_lru.c
> +++ b/mm/list_lru.c
> @@ -473,26 +473,29 @@ void memcg_reparent_list_lrus(struct mem_cgroup *memcg, struct mem_cgroup *paren
>         mutex_lock(&list_lrus_mutex);
>         list_for_each_entry(lru, &memcg_list_lrus, list) {
>                 struct list_lru_memcg *mlru;
> -               XA_STATE(xas, &lru->xa, memcg->kmemcg_id);
>
>                 /*
> -                * Lock the Xarray to ensure no on going list_lru_memcg
> -                * allocation and further allocation will see css_is_dying().
> +                * css_is_dying() check in memcg_list_lru_alloc() avoids
> +                * allocating a new mlru since CSS_DYING is already set for this
> +                * memcg a rcu grace period ago.
>                  */
> -               xas_lock_irq(&xas);
> -               mlru = xas_store(&xas, NULL);
> -               xas_unlock_irq(&xas);
> +               mlru = xa_load(&lru->xa, memcg->kmemcg_id);
>                 if (!mlru)
>                         continue;
>
>                 /*
> -                * With Xarray value set to NULL, holding the lru lock below
> -                * prevents list_lru_{add,del,isolate} from touching the lru,
> -                * safe to reparent.
> +                * Reparent each per-node list and mark the child dead
> +                * (LONG_MIN) before clearing xarray entry otherwise a
> +                * concurrent list_lru_del() may corrupt the list if it arrives
> +                * after xarray clear but before reparenting as
> +                * lock_list_lru_of_memcg will acquire parent's lock while the
> +                * item is still on child's list.
>                  */
>                 for_each_node(i)
>                         memcg_reparent_list_lru_one(lru, i, &mlru->node[i], parent);
>
> +               xa_erase_irq(&lru->xa, memcg->kmemcg_id);
> +
>                 /*
>                  * Here all list_lrus corresponding to the cgroup are guaranteed
>                  * to remain empty, we can safely free this lru, any further
> --
> 2.53.0-Meta
>
>

Nice catch! Thanks a lot!

Reviewed-by: Kairui Song <kasong@tencent.com>

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH v2] mm/list_lru: drain before clearing xarray entry on reparent
  2026-06-01 16:15 [PATCH v2] mm/list_lru: drain before clearing xarray entry on reparent Shakeel Butt
  2026-06-01 17:50 ` Kairui Song
@ 2026-06-02  2:37 ` Muchun Song
  1 sibling, 0 replies; 3+ messages in thread
From: Muchun Song @ 2026-06-02  2:37 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Johannes Weiner, Dave Chinner, Roman Gushchin,
	Qi Zheng, Kairui Song, Meta kernel team, linux-mm, cgroups,
	linux-kernel, Chris Mason, stable



> On Jun 2, 2026, at 00:15, Shakeel Butt <shakeel.butt@linux.dev> wrote:
> 
> memcg_reparent_list_lrus() clears the dying memcg's xarray entry with
> xas_store(&xas, NULL) before reparenting its per-node lists into the
> parent. This opens a window where a concurrent list_lru_del() arriving
> for the dying memcg sees xa_load() == NULL, walks to the parent in
> lock_list_lru_of_memcg(), takes the parent's per-node lock, and calls
> list_del_init() on an item still physically linked on the dying
> memcg's list.
> 
> If another in-flight thread holds the dying memcg's per-node lock at
> the same moment (another list_lru_del, or a list_lru_walk_one running
> an isolate callback), both threads modify ->next/->prev pointers on the
> same physical list under different locks. Adjacent items can corrupt
> each other's links.
> 
> Fix it by reversing the order: reparent each per-node list and mark the
> child's list lru dead and then clear the xarray entry. Any concurrent
> list_lru op that finds the still-set xarray entry either takes the dying
> memcg's per-node lock (synchronizing with the drain) or sees LONG_MIN
> and walks to the parent, where the items now live.
> 
> Fixes: fb56fdf8b9a2 ("mm/list_lru: split the lock to per-cgroup scope")
> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> Reported-by: Chris Mason <clm@fb.com>
> Cc: stable@vger.kernel.org
> ---
> Changes since v1:
> - Use xa_erase_irq() instead of xa_erase() (Sashiko & Claude).
> - Added comment on CSS_DYING check in memcg_list_lru_alloc avoiding a new mlru
>  allocation.
> 
> mm/list_lru.c | 21 ++++++++++++---------
> 1 file changed, 12 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/list_lru.c b/mm/list_lru.c
> index dd29bcf8eb5f..d454bce9a78e 100644
> --- a/mm/list_lru.c
> +++ b/mm/list_lru.c
> @@ -473,26 +473,29 @@ void memcg_reparent_list_lrus(struct mem_cgroup *memcg, struct mem_cgroup *paren
> 	mutex_lock(&list_lrus_mutex);
> 	list_for_each_entry(lru, &memcg_list_lrus, list) {
> 		struct list_lru_memcg *mlru;
> - 		XA_STATE(xas, &lru->xa, memcg->kmemcg_id);
> 
> 		/*
> -		 * Lock the Xarray to ensure no on going list_lru_memcg
> -		 * allocation and further allocation will see css_is_dying().
> +		 * css_is_dying() check in memcg_list_lru_alloc() avoids
> +		 * allocating a new mlru since CSS_DYING is already set for this
> +		 * memcg a rcu grace period ago.

I see. xas_lock_irqsave() in memcg_list_lru_alloc() functions as an RCU read lock.

Acked-by: Muchun Song <muchun.song@linux.dev>

Thanks.

> 		 */
> - 		xas_lock_irq(&xas);
> - 		mlru = xas_store(&xas, NULL);
> - 		xas_unlock_irq(&xas);
> + 		mlru = xa_load(&lru->xa, memcg->kmemcg_id);
> 		if (!mlru)
> 			continue;
> 
> 		/*
> -		 * With Xarray value set to NULL, holding the lru lock below
> -		 * prevents list_lru_{add,del,isolate} from touching the lru,
> -		 * safe to reparent.
> +		 * Reparent each per-node list and mark the child dead
> +		 * (LONG_MIN) before clearing xarray entry otherwise a
> +		 * concurrent list_lru_del() may corrupt the list if it arrives
> +		 * after xarray clear but before reparenting as
> +		 * lock_list_lru_of_memcg will acquire parent's lock while the
> +		 * item is still on child's list.
> */
> 		for_each_node(i)
> 			memcg_reparent_list_lru_one(lru, i, &mlru->node[i], parent);
> 
> + 		xa_erase_irq(&lru->xa, memcg->kmemcg_id);
> +
> 		/*
> 		 * Here all list_lrus corresponding to the cgroup are guaranteed
> 		 * to remain empty, we can safely free this lru, any further
> -- 
> 2.53.0-Meta
> 


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-06-02  2:38 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-01 16:15 [PATCH v2] mm/list_lru: drain before clearing xarray entry on reparent Shakeel Butt
2026-06-01 17:50 ` Kairui Song
2026-06-02  2:37 ` Muchun Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox