From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mail-qv1-f46.google.com (mail-qv1-f46.google.com [209.85.219.46])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 12C719460
	for <linux-kernel@vger.kernel.org>; Wed, 18 Mar 2026 22:49:00 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.46
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1773874143; cv=none; b=gtsTD37FQmugerleGkVn8KiG1g/hyXOmsiIowMAgimRmwt7pR5VMkcGm7rVHse2ox5DbbmJ+u4cJ0Bq2WGSG4lovilxBHkrkrDROaiKSNyezh1UZtDhcoh4m2o16Dj5QmhKn5ljJ9DgHTSlH8/99fZw3wrjx7hFL0ZBHp6+fUWw=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1773874143; c=relaxed/simple;
	bh=xVxG33Dl1eKcLiXs2CZizvIEX0sQlSki2gfIPYiotMo=;
	h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version:
	 Content-Type:Content-Disposition:In-Reply-To; b=Hg0bm62KzUvQjh0UnJJmNKby1sy3HakCurrlI7Ku11nlwJztT3zdHWFCa6mvVIZOhRwJD1MYoqDzwcashWDWbtvOY/6h6599MaCysgKAVKpOf8gh7sqsW8fhgTp0N01cXaOHTxpjKRc4Rq66UjIv9fcBNQFgo99jLWNLpcgWxwU=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=cmpxchg.org; spf=pass smtp.mailfrom=cmpxchg.org; dkim=pass (2048-bit key) header.d=cmpxchg.org header.i=@cmpxchg.org header.b=nerP3y80; arc=none smtp.client-ip=209.85.219.46
Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=cmpxchg.org
Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=cmpxchg.org
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=cmpxchg.org header.i=@cmpxchg.org header.b="nerP3y80"
Received: by mail-qv1-f46.google.com with SMTP id 6a1803df08f44-89c4bc36053so3104686d6.2
        for <linux-kernel@vger.kernel.org>; Wed, 18 Mar 2026 15:49:00 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=cmpxchg.org; s=google; t=1773874140; x=1774478940; darn=vger.kernel.org;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to;
        bh=PMr7aJnn5tbpENmXs0hgUd4j/FxLseJeRUZelWseNrw=;
        b=nerP3y80VfnOWubcv19ambb1Rx/bViE/fuibKxUC9Ch3zxl+G6jdNVDm6bt5DRI3fi
         jBGSXi7ocZA4O0wxC06ncf+QP0w7YVLqLgzPxRXJ6OqAnD9+ZPq1r3iUg2Jqgtixjhhr
         8EjKnSuCiyYsZfLJWyhnj2civFP997+q7+URxkQzYIuWAxLtEA3s6FmNyUENup3WWlkU
         jxYBBgCbSDOkBRCu1IQXd9SE1yPhBUi1hEP5lka71fkfjFGcGkhoN9XYzsfWcs+godkf
         deBkD3LYRW2H1zSPFzrrTPj6a6v25MyY70aZWyoC/1xo+oafQHlHPrj0UFIF1xecbYjY
         NOoA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20251104; t=1773874140; x=1774478940;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:x-gm-gg:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=PMr7aJnn5tbpENmXs0hgUd4j/FxLseJeRUZelWseNrw=;
        b=a6bhbxI6Wvs3VRGSBwn46PChskJn9o0644k1drr6b892BInuZRMxzMRLUw+79qKBcG
         OXWckjk1A4Q+TScUeLoweDiaeGpfGqCe2JAb18bweeXiWfdMRbOjiNSViTmmiYVRvG4X
         irl2roauQ606qCQHGgPWUtGxbBGfObRyLeVAHoy4x1e/ZvRLPkNiLdNMgzIIZP5taEWV
         AoEdcDk1SjVlR92gZT82dnFsrSZVasOSo1xa+wCYREoIR4m41gnPV4OnPquIMd/pTwA2
         Nsm9xlQNldVUvPWJTEvSm1fS4SZUeRSahV6WKT/0vX9pi2C93jF5Ji+fEpvivio79DIb
         nUPw==
X-Forwarded-Encrypted: i=1; AJvYcCXKS2n3iYRNiN9yOxUZuM8onHPVqc+BZg+FDaoL2yxZEh1W4VGKe2uXhmkOIR89VJ+YuYuBWLzgXCoWbEw=@vger.kernel.org
X-Gm-Message-State: AOJu0Ywl9ruW48c6ooSYDq3CZCk89tbwZephPFc+PpCEEyTUsw/7xt+n
	oR0glgEfuR2v4EN+HSXTQaQG/yxshPzOzEKiUuzl5gXSGaOvgxvTQT73sIB17UdeFVk=
X-Gm-Gg: ATEYQzxVIJlqfRWzT8pJdBzphbDTCEyXDI64n2r/h4EjBnwfVSu0QgshtAqm+KX+PV7
	103EulkEvJ0hBHQx8sxpgwP7ckpk9VGMZ9sEqFa2xQ/r92L+YbGYQnB4mORjCUAhCn1ztPjeW1f
	VV8HI8ZYlnFVtwn6UWGXytWbXzLDp+czSds+R7nABxoH7CT2c7Dm+84i5wGs3YrNjWGMM6q7FSf
	IvOSMRbmDJhQ7s/wL/eU0xmgqcWNhcOwhs6a7OIBAT0z9pUWowshqjYwlw0NFx4XmrIHWZPyshK
	xTUGyFkKidRUxF5zjrSe2bYiT6aRUZruP7RuBsGn8a8wuYGmbfW0R4CjbsHfKTUMO9lFcyAkY9L
	APW6giJtS2fRJ7FMGcwVq5ebJWMr8Sl2Pk9cl8fVFsSy9RIWLmfcp/ILMS6t9ZQieksOBH9YlaT
	Ncu+BLvMGPWK23ZvxXRtqWgw==
X-Received: by 2002:a05:6214:21a2:b0:899:f570:3167 with SMTP id 6a1803df08f44-89c6b5af178mr69240396d6.46.1773874139756;
        Wed, 18 Mar 2026 15:48:59 -0700 (PDT)
Received: from localhost ([2603:7000:c00:3a00:365a:60ff:fe62:ff29])
        by smtp.gmail.com with ESMTPSA id 6a1803df08f44-89c6b9ce6cfsm37644246d6.24.2026.03.18.15.48.58
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 18 Mar 2026 15:48:58 -0700 (PDT)
Date: Wed, 18 Mar 2026 18:48:54 -0400
From: Johannes Weiner <hannes@cmpxchg.org>
To: "David Hildenbrand (Arm)" <david@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Shakeel Butt <shakeel.butt@linux.dev>,
	Yosry Ahmed <yosry.ahmed@linux.dev>, Zi Yan <ziy@nvidia.com>,
	"Liam R. Howlett" <Liam.Howlett@oracle.com>,
	Usama Arif <usama.arif@linux.dev>, Kiryl Shutsemau <kas@kernel.org>,
	Dave Chinner <david@fromorbit.com>,
	Roman Gushchin <roman.gushchin@linux.dev>, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH v2 7/7] mm: switch deferred split shrinker to list_lru
Message-ID: <absr1vgNP_tM1OEP@cmpxchg.org>
References: <20260312205321.638053-1-hannes@cmpxchg.org>
 <20260312205321.638053-8-hannes@cmpxchg.org>
 <61d86249-cd89-4e99-99d8-ab7c72e95f34@kernel.org>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <61d86249-cd89-4e99-99d8-ab7c72e95f34@kernel.org>

On Wed, Mar 18, 2026 at 09:25:17PM +0100, David Hildenbrand (Arm) wrote:
> On 3/12/26 21:51, Johannes Weiner wrote:
> > The deferred split queue handles cgroups in a suboptimal fashion. The
> > queue is per-NUMA node or per-cgroup, not the intersection. That means
> > on a cgrouped system, a node-restricted allocation entering reclaim
> > can end up splitting large pages on other nodes:
> > 
> > 	alloc/unmap
> > 	  deferred_split_folio()
> > 	    list_add_tail(memcg->split_queue)
> > 	    set_shrinker_bit(memcg, node, deferred_shrinker_id)
> > 
> > 	for_each_zone_zonelist_nodemask(restricted_nodes)
> > 	  mem_cgroup_iter()
> > 	    shrink_slab(node, memcg)
> > 	      shrink_slab_memcg(node, memcg)
> > 	        if test_shrinker_bit(memcg, node, deferred_shrinker_id)
> > 	          deferred_split_scan()
> > 	            walks memcg->split_queue
> > 
> > The shrinker bit adds an imperfect guard rail. As soon as the cgroup
> > has a single large page on the node of interest, all large pages owned
> > by that memcg, including those on other nodes, will be split.
> > 
> > list_lru properly sets up per-node, per-cgroup lists. As a bonus, it
> > streamlines a lot of the list operations and reclaim walks. It's used
> > widely by other major shrinkers already. Convert the deferred split
> > queue as well.
> > 
> > The list_lru per-memcg heads are instantiated on demand when the first
> > object of interest is allocated for a cgroup, by calling
> > memcg_list_lru_alloc_folio(). Add calls to where splittable pages are
> > created: anon faults, swapin faults, khugepaged collapse.
> > 
> > These calls create all possible node heads for the cgroup at once, so
> > the migration code (between nodes) doesn't need any special care.
> 
> 
> [...]
> 
> > -
> >  static inline bool is_transparent_hugepage(const struct folio *folio)
> >  {
> >  	if (!folio_test_large(folio))
> > @@ -1293,6 +1189,14 @@ static struct folio *vma_alloc_anon_folio_pmd(struct vm_area_struct *vma,
> >  		count_mthp_stat(order, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
> >  		return NULL;
> >  	}
> > +
> > +	if (memcg_list_lru_alloc_folio(folio, &deferred_split_lru, gfp)) {
> > +		folio_put(folio);
> > +		count_vm_event(THP_FAULT_FALLBACK);
> > +		count_mthp_stat(order, MTHP_STAT_ANON_FAULT_FALLBACK);
> > +		return NULL;
> > +	}
> 
> So, in all anon alloc paths, we essentialy have
> 
> 1) vma_alloc_folio / __folio_alloc (khugepaged being odd)
> 2) mem_cgroup_charge / mem_cgroup_swapin_charge_folio
> 3) memcg_list_lru_alloc_folio
> 
> I wonder if we could do better in most cases and have something like a
> 
> 	vma_alloc_anon_folio()
> 
> That wraps the vma_alloc_folio() + memcg_list_lru_alloc_folio(), but
> still leaves the charging to the caller?

Hm, but it's the charging that figures out the memcg and sets
folio_memcg() :(

> The would at least combine 1) and 3) in a single API. (except for the
> odd cases without a VMA).
> 
> I guess we would want to skip the memcg_list_lru_alloc_folio() for
> order-0 folios, correct?

Yeah, we don't use the queue for < order-1. In deferred_split_folio():

	/*
	 * Order 1 folios have no space for a deferred list, but we also
	 * won't waste much memory by not adding them to the deferred list.
	 */
	if (folio_order(folio) <= 1)
		return;

> > @@ -3802,33 +3706,28 @@ static int __folio_freeze_and_split_unmapped(struct folio *folio, unsigned int n
> >  	struct folio *new_folio, *next;
> >  	int old_order = folio_order(folio);
> >  	int ret = 0;
> > -	struct deferred_split *ds_queue;
> > +	struct list_lru_one *l;
> >  
> >  	VM_WARN_ON_ONCE(!mapping && end);
> >  	/* Prevent deferred_split_scan() touching ->_refcount */
> > -	ds_queue = folio_split_queue_lock(folio);
> > +	rcu_read_lock();
> 
> The RCU lock is for the folio_memcg(), right?
> 
> I recall I raised in the past that some get/put-like logic (that wraps
> the rcu_read_lock() + folio_memcg()) might make this a lot easier to get.
> 
> 
> memcg = folio_memcg_lookup(folio)
> 
> ... do stuff
> 
> folio_memcg_putback(folio, memcg);
> 
> Or sth like that.
> 
> 
> Alternativey, you could have some helpers that do the
> list_lru_lock+unlock etc.
> 
> folio_memcg_list_lru_lock()
> ...
> folio_memcg_list_ru_unlock(l);
> 
> Just some thoughts as inspiration :)

I remember you raising this in the objcg + reparenting patches. There
are a few more instances of

	rcu_read_lock()
	foo = folio_memcg()
	...
	rcu_read_unlock()

in other parts of the code not touched by these patches here, so the
first pattern is a more universal encapsulation.

Let me look into this. Would you be okay with a follow-up that covers
the others as well?