From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 74F9378F2E
	for <linux-kernel@vger.kernel.org>; Mon, 24 Nov 2025 19:23:03 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1764012183; cv=none; b=Bz5v3OeKaJtiHTk4DmdiqQmNj+2hWUOISE13P18lO/hDu58RNSWJ1nAu6z1otK9MFnIOorSGlkRSEtZEiFBG+/uqEZb0iO2K9UpDk+YFUWynwrc442SjMhgjIMhND4y6rPMeq+OHdsZw3oGN1s5Wa3fH/KAVlfxBVhfkGZyHAPA=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1764012183; c=relaxed/simple;
	bh=EVW8rxvFH63h3jyD2ik3N8abrAtng6JWEIMfvjgB+CM=;
	h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From:
	 In-Reply-To:Content-Type; b=jCDn5x9EIzy16KmuP/ltK/790hxUYDSC5nXn4cNc+7X0rXGO5NcWIedz51mcm29g/rE/ftg0311NjnrVASZ4X6HMaMkJcY8GJ80iGBM4s1xifr23kBM7mQctsQTcyMzx3jH/cSyTpCd4plRSNKT22foNccDB8IGUuEjF19Bo0GE=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=kznAdjy3; arc=none smtp.client-ip=10.30.226.201
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="kznAdjy3"
Received: by smtp.kernel.org (Postfix) with ESMTPSA id 77013C4CEF1;
	Mon, 24 Nov 2025 19:22:59 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org;
	s=k20201202; t=1764012182;
	bh=EVW8rxvFH63h3jyD2ik3N8abrAtng6JWEIMfvjgB+CM=;
	h=Date:Subject:To:Cc:References:From:In-Reply-To:From;
	b=kznAdjy3fEZWRWT8KExUzkrJpgA8UyzN5PAEUkJc0Wd8oilFuvPZ0tsjiUHgVN6et
	 jVtITKEAyD2aXjB0+YUE49lR74EVMCqAJdmxbNSfHgbw7pKyX2CT0ruButPwGeGcIa
	 5K65ke6LF14HVuMfKIVVN27yqEhmcLVblJ27c4X+1GRk+G5RTSKRbKOuHUFq+ATrv5
	 0F+n1t90upJbQ66jyr8TMBwwJQ1+D18bwwLyrPaJmIyhsATLQaiBlEqgJDGqWxo4Rc
	 AFJOkjv141Xn/hMSIZzAkY2tzanD43ZXVi2kBjvtYg9G5ZnQ+y2xGknvy61+3pMYTF
	 63q7Q/WR+s/OA==
Message-ID: <34bafd06-250a-4019-8b34-5ddeedea1cb3@kernel.org>
Date: Mon, 24 Nov 2025 20:22:56 +0100
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [PATCH v2 2/4] mm/huge_memory: replace can_split_folio() with
 direct refcount calculation
To: Zi Yan <ziy@nvidia.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
 Andrew Morton <akpm@linux-foundation.org>,
 Baolin Wang <baolin.wang@linux.alibaba.com>,
 "Liam R. Howlett" <Liam.Howlett@oracle.com>, Nico Pache <npache@redhat.com>,
 Ryan Roberts <ryan.roberts@arm.com>, Dev Jain <dev.jain@arm.com>,
 Barry Song <baohua@kernel.org>, Lance Yang <lance.yang@linux.dev>,
 Miaohe Lin <linmiaohe@huawei.com>, Naoya Horiguchi
 <nao.horiguchi@gmail.com>, Wei Yang <richard.weiyang@gmail.com>,
 Balbir Singh <balbirs@nvidia.com>, linux-mm@kvack.org,
 linux-kernel@vger.kernel.org
References: <20251122025529.1562592-1-ziy@nvidia.com>
 <20251122025529.1562592-3-ziy@nvidia.com>
 <df92a886-069a-450d-a551-dbfc3f629ead@kernel.org>
 <33A929D1-7438-43C1-AA4A-398183976F8F@nvidia.com>
From: "David Hildenbrand (Red Hat)" <david@kernel.org>
Content-Language: en-US
In-Reply-To: <33A929D1-7438-43C1-AA4A-398183976F8F@nvidia.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit

On 11/24/25 18:05, Zi Yan wrote:
> On 24 Nov 2025, at 5:41, David Hildenbrand (Red Hat) wrote:
> 
>> On 11/22/25 03:55, Zi Yan wrote:
>>> can_split_folio() is just a refcount comparison, making sure only the
>>> split caller holds an extra pin. Open code it with
>>> folio_expected_ref_count() != folio_ref_count() - 1. For the extra_pins
>>> used by folio_ref_freeze(), add folio_cache_references() to calculate it.
>>>
>>> Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>> ---
>>>    include/linux/huge_mm.h |  1 -
>>>    mm/huge_memory.c        | 43 ++++++++++++++++-------------------------
>>>    mm/vmscan.c             |  3 ++-
>>>    3 files changed, 19 insertions(+), 28 deletions(-)
>>>
>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>> index 97686fb46e30..1ecaeccf39c9 100644
>>> --- a/include/linux/huge_mm.h
>>> +++ b/include/linux/huge_mm.h
>>> @@ -369,7 +369,6 @@ enum split_type {
>>>    	SPLIT_TYPE_NON_UNIFORM,
>>>    };
>>>   -bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>>>    int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>>    		unsigned int new_order);
>>>    int folio_split_unmapped(struct folio *folio, unsigned int new_order);
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index c1f1055165dd..6c821c1c0ac3 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -3455,23 +3455,6 @@ static void lru_add_split_folio(struct folio *folio, struct folio *new_folio,
>>>    	}
>>>    }
>>>   -/* Racy check whether the huge page can be split */
>>> -bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins)
>>> -{
>>> -	int extra_pins;
>>> -
>>> -	/* Additional pins from page cache */
>>> -	if (folio_test_anon(folio))
>>> -		extra_pins = folio_test_swapcache(folio) ?
>>> -				folio_nr_pages(folio) : 0;
>>> -	else
>>> -		extra_pins = folio_nr_pages(folio);
>>> -	if (pextra_pins)
>>> -		*pextra_pins = extra_pins;
>>> -	return folio_mapcount(folio) == folio_ref_count(folio) - extra_pins -
>>> -					caller_pins;
>>> -}
>>> -
>>>    static bool page_range_has_hwpoisoned(struct page *page, long nr_pages)
>>>    {
>>>    	for (; nr_pages; page++, nr_pages--)
>>> @@ -3776,17 +3759,26 @@ int folio_check_splittable(struct folio *folio, unsigned int new_order,
>>>    	return 0;
>>>    }
>>>   +/* Number of folio references from the pagecache or the swapcache. */
>>> +static unsigned int folio_cache_references(const struct folio *folio)
>>> +{
>>> +	if (folio_test_anon(folio) && !folio_test_swapcache(folio))
>>> +		return 0;
>>> +	return folio_nr_pages(folio);
>>> +}
>>> +
>>>    static int __folio_freeze_and_split_unmapped(struct folio *folio, unsigned int new_order,
>>>    					     struct page *split_at, struct xa_state *xas,
>>>    					     struct address_space *mapping, bool do_lru,
>>>    					     struct list_head *list, enum split_type split_type,
>>> -					     pgoff_t end, int *nr_shmem_dropped, int extra_pins)
>>> +					     pgoff_t end, int *nr_shmem_dropped)
>>>    {
>>>    	struct folio *end_folio = folio_next(folio);
>>>    	struct folio *new_folio, *next;
>>>    	int old_order = folio_order(folio);
>>>    	int ret = 0;
>>>    	struct deferred_split *ds_queue;
>>> +	int extra_pins = folio_cache_references(folio);
>>
>> Can we just inline the call do folio_cache_references() and get rid of extra_pins.
>> (which is a bad name either way)
>>
>>
>> if (folio_ref_freeze(folio, folio_cache_references(folio) + 1) {
>>
>>
>> BTW, now that we have this helper, I wonder if we should then also do for
>> clarification on the unfreeze path:
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 0acdc2f26ee0c..7cbcf61b7971d 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -3824,8 +3824,7 @@ static int __folio_freeze_and_split_unmapped(struct folio *folio, unsigned int n
>>                           zone_device_private_split_cb(folio, new_folio);
>>   -                       expected_refs = folio_expected_ref_count(new_folio) + 1;
>> -                       folio_ref_unfreeze(new_folio, expected_refs);
>> +                       folio_ref_unfreeze(new_folio, folio_cache_references(new_folio) + 1);
>>                           if (do_lru)
>>                                  lru_add_split_folio(folio, new_folio, lruvec, list);
>> @@ -3868,8 +3867,7 @@ static int __folio_freeze_and_split_unmapped(struct folio *folio, unsigned int n
>>                   * Otherwise, a parallel folio_try_get() can grab @folio
>>                   * and its caller can see stale page cache entries.
>>                   */
>> -               expected_refs = folio_expected_ref_count(folio) + 1;
>> -               folio_ref_unfreeze(folio, expected_refs);
>> +               folio_ref_unfreeze(folio, folio_cache_references(folio) + 1);
>>                   if (do_lru)
>>                          unlock_page_lruvec(lruvec);
>>
>>
> 
> Both make sense to me. Will make the change.
> 
> By comparing folio_cache_references() with folio_expected_ref_count(),
> one difference is that folio_expected_ref_count() does not give right
> refcount for shmem in swapcache.

Good point. Likely nobody runs into that right now because nobody can 
really does anything with these folios before they were re-added to the 
pagecache or mapped into page tables.

> 
> This is the folio_expected_ref_count() code:
> 
>          if (folio_test_anon(folio)) {
>                  /* One reference per page from the swapcache. */
>                  ref_count += folio_test_swapcache(folio) << order;
>          } else {
>                  /* One reference per page from the pagecache. */
>                  ref_count += !!folio->mapping << order;
>                  /* One reference from PG_private. */
>                  ref_count += folio_test_private(folio);
>          }
> 
> shmem in swapcache mean !folio_test_anon(folio) && folio_test_swapcache(folio).

See below, it's actually

folio_test_anon(folio) && folio_test_swapbacked(folio)&& 
folio_test_swapcache(folio)

I think ...

> The above code gives 0, but folio_cache_references() gives folio_nr_pages(folio).
> It should not cause any issue, since IIUC shmem in swapcache happens
> when the folio has an additional ref,
> folio_expected_ref_count() != folio_ref_count() anyway. For split, it is
> not supported yet, 

Right.

> so folio_expected_ref_count() in split code does not
> affect shmem in swapcache. But folio_expected_ref_count() should be
> fixed, right?

We should better handle it, agreed.

Staring at the history of folio_expected_ref_count() once again, back 
when we had folio_expected_refs() in migration code we didn't seem to 
handle it I think.

-static int folio_expected_refs(struct address_space *mapping,
-               struct folio *folio)
-{
-       int refs = 1;
-       if (!mapping)
-               return refs;
-
-       refs += folio_nr_pages(folio);
-       if (folio_test_private(folio))
-               refs++;
-
-       return refs;
-}


gup.c doesn't care, because the pages are still mapped.

khugepaged.c similarly.

memfd.c doesn't care because the pages are still in the pagecache.

So I suspect nothing is broken, but the migration case needs a second look.

> 
> Like:
> 
>          if (folio_test_anon(folio)) {
>                  /* One reference per page from the swapcache. */
>                  ref_count += folio_test_swapcache(folio) << order;
>          } else {
> 				/* One reference per page from shmem in the swapcache. */
>                  ref_count += folio_test_swapcache(folio) << order;
>                  /* One reference per page from the pagecache. */
>                  ref_count += !!folio->mapping << order;
>                  /* One reference from PG_private. */
>                  ref_count += folio_test_private(folio);
>          }
> 
> or simplified into
> 
>     		if (!folio_test_anon(folio)) {
>                  /* One reference per page from the pagecache. */
>                  ref_count += !!folio->mapping << order;
>                  /* One reference from PG_private. */
>                  ref_count += folio_test_private(folio);
>          }
> 		/* One reference per page from the swapcache (anon or shmem). */
>          ref_count += folio_test_swapcache(folio) << order;
> ?

That is incorrect I think due to swapcache being able to give false 
positives (PG_owner_priv_1).

-- 
Cheers

David