From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out-188.mta1.migadu.com (out-188.mta1.migadu.com [95.215.58.188]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 682131A5B9E for ; Wed, 11 Mar 2026 17:00:38 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.188 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773248439; cv=none; b=uo6ebFeFqDW/2il6GFwJGRXLWfynJVpp/F2EAfTfIqnlTL29ttJN2Rb7xFbYaeGgfZO+skAR/xsmJdm8ERZpSqd6x8FG9ZWhOR06IIxkhVWQaM2br6w+dna+7puDZuTytbTl1KcqNuvSdbu/5mDFEemLal1Sr2krbnzBbuAJ5uY= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773248439; c=relaxed/simple; bh=CfQhdsPelXb5fuAE5c9yjrwlcJ27kKshn/AajOF1n0s=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=OSEyPFL5MQ7fNlbe9Z26Y0UnBBMq3zaqTlnhkMzQc1sLs2ivZfZF9rI4QV4FvKkef/W//HyOWcKqF78G+E1cFNdYZwnLcpsB1aSmqcGc5ox2bUwfM1jPvhWElpez9bZMqAbWDgeo+gHseiHLeCqOjj8RpMNv6JaiCbAhEWlVJw0= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=YshX/TOw; arc=none smtp.client-ip=95.215.58.188 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="YshX/TOw" Message-ID: <050ce5bd-4725-468e-acaf-7fca72b84d06@linux.dev> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1773248436; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=umIETC/U+nnyWBOJf7jmCI9bEj8M1NXipZ+2fLbklVY=; b=YshX/TOw0pBpgg9+ne552JONlu8IFJLdSiAJhkINNKSqruyeR+PisQm6C7PXd5wGJ/8SX3 urYZ5qOnfJU7lpdVnffb4lXOCr/PCz0ZvrBwCI0a/W0zeXQUQP7nYVJNp8cCrBvQCXV1mh WAtjJtOECacpkTe8kXHohGsfQBp4yHE= Date: Wed, 11 Mar 2026 20:00:29 +0300 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Subject: Re: [PATCH] mm: switch deferred split shrinker to list_lru Content-Language: en-GB To: Johannes Weiner , Andrew Morton Cc: David Hildenbrand , Zi Yan , "Liam R. Howlett" , Kiryl Shutsemau , Dave Chinner , Roman Gushchin , linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <20260311154358.150977-1-hannes@cmpxchg.org> X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Usama Arif In-Reply-To: <20260311154358.150977-1-hannes@cmpxchg.org> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Migadu-Flow: FLOW_OUT On 11/03/2026 18:43, Johannes Weiner wrote: > The deferred split queue handles cgroups in a suboptimal fashion. The > queue is per-NUMA node or per-cgroup, not the intersection. That means > on a cgrouped system, a node-restricted allocation entering reclaim > can end up splitting large pages on other nodes: > > alloc/unmap > deferred_split_folio() > list_add_tail(memcg->split_queue) > set_shrinker_bit(memcg, node, deferred_shrinker_id) > > for_each_zone_zonelist_nodemask(restricted_nodes) > mem_cgroup_iter() > shrink_slab(node, memcg) > shrink_slab_memcg(node, memcg) > if test_shrinker_bit(memcg, node, deferred_shrinker_id) > deferred_split_scan() > walks memcg->split_queue > > The shrinker bit adds an imperfect guard rail. As soon as the cgroup > has a single large page on the node of interest, all large pages owned > by that memcg, including those on other nodes, will be split. > > list_lru properly sets up per-node, per-cgroup lists. As a bonus, it > streamlines a lot of the list operations and reclaim walks. It's used > widely by other major shrinkers already. Convert the deferred split > queue as well. > > The list_lru per-memcg heads are instantiated on demand when the first > object of interest is allocated for a cgroup, by calling > memcg_list_lru_alloc(). Add calls to where splittable pages are > created: anon faults, swapin faults, khugepaged collapse. > > These calls create all possible node heads for the cgroup at once, so > the migration code (between nodes) doesn't need any special care. > > The folio_test_partially_mapped() state is currently protected and > serialized wrt LRU state by the deferred split queue lock. To > facilitate the transition, add helpers to the list_lru API to allow > caller-side locking. > > Signed-off-by: Johannes Weiner > --- > include/linux/huge_mm.h | 6 +- > include/linux/list_lru.h | 48 ++++++ > include/linux/memcontrol.h | 4 - > include/linux/mmzone.h | 12 -- > mm/huge_memory.c | 326 +++++++++++-------------------------- > mm/internal.h | 2 +- > mm/khugepaged.c | 7 + > mm/list_lru.c | 197 ++++++++++++++-------- > mm/memcontrol.c | 12 +- > mm/memory.c | 52 +++--- > mm/mm_init.c | 14 -- > 11 files changed, 310 insertions(+), 370 deletions(-) > [..] > @@ -3802,33 +3706,25 @@ static int __folio_freeze_and_split_unmapped(struct folio *folio, unsigned int n > struct folio *new_folio, *next; > int old_order = folio_order(folio); > int ret = 0; > - struct deferred_split *ds_queue; > + struct list_lru_one *l; > > VM_WARN_ON_ONCE(!mapping && end); > /* Prevent deferred_split_scan() touching ->_refcount */ > - ds_queue = folio_split_queue_lock(folio); > + l = list_lru_lock(&deferred_split_lru, folio_nid(folio), folio_memcg(folio)); Hello Johannes! I think we need folio_memcg() to be under rcu_read_lock()? folio_memcg() calls obj_cgroup_memcg() which has lockdep_assert_once(rcu_read_lock_held()). folio_split_queue_lock() wraps split_queue_lock() under rcu_read_lock() so wasnt an issue. > if (folio_ref_freeze(folio, folio_cache_ref_count(folio) + 1)) { > struct swap_cluster_info *ci = NULL; > struct lruvec *lruvec; > > if (old_order > 1) { > - if (!list_empty(&folio->_deferred_list)) { > - ds_queue->split_queue_len--; > - /* > - * Reinitialize page_deferred_list after removing the > - * page from the split_queue, otherwise a subsequent > - * split will see list corruption when checking the > - * page_deferred_list. > - */ > - list_del_init(&folio->_deferred_list); > - } > + __list_lru_del(&deferred_split_lru, l, > + &folio->_deferred_list, folio_nid(folio)); > if (folio_test_partially_mapped(folio)) { > folio_clear_partially_mapped(folio); > mod_mthp_stat(old_order, > MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1); > } > } > - split_queue_unlock(ds_queue); > + list_lru_unlock(l); > if (mapping) { > int nr = folio_nr_pages(folio); > > @@ -3929,7 +3825,7 @@ static int __folio_freeze_and_split_unmapped(struct folio *folio, unsigned int n > if (ci) > swap_cluster_unlock(ci); > } else { > - split_queue_unlock(ds_queue); > + list_lru_unlock(l); > return -EAGAIN; > } > [..]