From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 397F62E7F3E
	for <linux-kernel@vger.kernel.org>; Fri, 19 Dec 2025 08:56:02 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1766134562; cv=none; b=u6Z0bz7ijRruqz3MhOdHOW6aFRaTlyUq6RaoyH+mla2sFk4DWSwM8GIqAGCO8DTnu7CakJdjsgqgqRGkGVkrh+Z9hrHRLKvIQeuKA79y0blAuDTi87b2yz8TjDQqMI0rrj4D343ON0LzSfAsOtckmrUmVIHHbz3XODKd/1KmVxA=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1766134562; c=relaxed/simple;
	bh=TPE5OS05vI0tJdNvhgg9vrKHc5kLwjyEu0iLvyC1V7g=;
	h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From:
	 In-Reply-To:Content-Type; b=o6uZ9E557fL4XRzIPxQjUdlj9itSED1JJjYoZbwNlYwHj4/0+cm12ODw7qTW/W5fp1BH97adpywJ4mK6NIeYu1/jQabbLoX+O+pv6d2ccQp8wqVgr2BcgKe5d4vJigJRGY68zLoqroym632jltKvb65PesB5xoUrNXNZbltm6EQ=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=riRT1xQP; arc=none smtp.client-ip=10.30.226.201
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="riRT1xQP"
Received: by smtp.kernel.org (Postfix) with ESMTPSA id DBA41C4CEF1;
	Fri, 19 Dec 2025 08:55:58 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org;
	s=k20201202; t=1766134562;
	bh=TPE5OS05vI0tJdNvhgg9vrKHc5kLwjyEu0iLvyC1V7g=;
	h=Date:Subject:To:Cc:References:From:In-Reply-To:From;
	b=riRT1xQPLrYDVyFN4+gJRO8GnwpqwsepD4GrOnhnTkDfu2yo7Tru/Eqy6S3c8pqhg
	 Yb8uitMooT2OwotTnY+5Nwv+I6KWlomj9jHp8ZQiKhyXf3s2X4st9s9Wc4TslGKaaU
	 M4++FD6dcfRUS247JlqI/qmp+GAvHvpED/763jsvFmuwIdAiIRofnxed8crhEoWNf9
	 aaGDUQolapPeS9inoOSerbZ7fG0jO+lF3uTeMJMJ9CO7+pl2LHpTcIEz3r0cGruklE
	 3Cm6gKnD3FZLt5S/lvlREwCkmPmtEM6qwb8EtW7cOembXQDczDdGnTim5plb+S+ml/
	 yA7pVwnJ7VShQ==
Message-ID: <adaffa11-c0a8-484e-9de3-e15eeddcf922@kernel.org>
Date: Fri, 19 Dec 2025 09:55:56 +0100
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [PATCH 2/4] mm: khugepaged: remove mm when all memory has been
 collapsed
To: Vernon Yang <vernon2gm@gmail.com>
Cc: akpm@linux-foundation.org, lorenzo.stoakes@oracle.com, ziy@nvidia.com,
 baohua@kernel.org, lance.yang@linux.dev, linux-mm@kvack.org,
 linux-kernel@vger.kernel.org, Vernon Yang <yanglincheng@kylinos.cn>
References: <20251215090419.174418-1-yanglincheng@kylinos.cn>
 <20251215090419.174418-3-yanglincheng@kylinos.cn>
 <26e65878-f214-4890-8bcb-24a45122bfd6@kernel.org>
 <djh6xaia56grmgxdok23kp6ly3oe3ugsinxdp6jie3k2tzwaml@57gbrcr75jng>
From: "David Hildenbrand (Red Hat)" <david@kernel.org>
Content-Language: en-US
In-Reply-To: <djh6xaia56grmgxdok23kp6ly3oe3ugsinxdp6jie3k2tzwaml@57gbrcr75jng>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit

On 12/19/25 09:35, Vernon Yang wrote:
> On Thu, Dec 18, 2025 at 10:29:18AM +0100, David Hildenbrand (Red Hat) wrote:
>> On 12/15/25 10:04, Vernon Yang wrote:
>>> The following data is traced by bpftrace on a desktop system. After
>>> the system has been left idle for 10 minutes upon booting, a lot of
>>> SCAN_PMD_MAPPED or SCAN_PMD_NONE are observed during a full scan by
>>> khugepaged.
>>>
>>> @scan_pmd_status[1]: 1           ## SCAN_SUCCEED
>>> @scan_pmd_status[4]: 158         ## SCAN_PMD_MAPPED
>>> @scan_pmd_status[3]: 174         ## SCAN_PMD_NONE
>>> total progress size: 701 MB
>>> Total time         : 440 seconds ## include khugepaged_scan_sleep_millisecs
>>>
>>> The khugepaged_scan list save all task that support collapse into hugepage,
>>> as long as the take is not destroyed, khugepaged will not remove it from
>>> the khugepaged_scan list. This exist a phenomenon where task has already
>>> collapsed all memory regions into hugepage, but khugepaged continues to
>>> scan it, which wastes CPU time and invalid, and due to
>>> khugepaged_scan_sleep_millisecs (default 10s) causes a long wait for
>>> scanning a large number of invalid task, so scanning really valid task
>>> is later.
>>>
>>> After applying this patch, when all memory is either SCAN_PMD_MAPPED or
>>> SCAN_PMD_NONE, the mm is automatically removed from khugepaged's scan
>>> list. If the page fault or MADV_HUGEPAGE again, it is added back to
>>> khugepaged.
>>
>> I don't like that, as it assumes that memory within such a process would be
>> rather static, which is easily not the case (e.g., allocators just doing
>> MADV_DONTNEED to free memory).
>>
>> If most stuff is collapsed to PMDs already, can't we just skip over these
>> regions a bit faster?
> 
> I have a flash of inspiration and came up with a good idea.
> 
> If these regions have already been collapsed into hugepage, rechecking
> them would be very fast. Due to the khugepaged_pages_to_scan can also
> represent the number of VMAs to skip, we can extend its semantics as
> follows:
> 
> 	/*
> 	 * default scan 8*HPAGE_PMD_NR ptes, pmd_mapped, no_pte_table or vmas
> 	 * every 10 second.
> 	 */
> 	static unsigned int khugepaged_pages_to_scan __read_mostly;
> 
> 	switch (*result) {
> 	case SCAN_NO_PTE_TABLE:
> 	case SCAN_PMD_MAPPED:
> 	case SCAN_PTE_MAPPED_HUGEPAGE:
> 		progress++; // here
> 		break;
> 	case SCAN_SUCCEED:
> 		++khugepaged_pages_collapsed;
> 		fallthrough;
> 	default:
> 		progress += HPAGE_PMD_NR;
> 	}
> 
> This way can achieve our goal. David, do you like it?

I'd have to see the full patch, but IMHO we should rather focus on on 
"how many pte/pmd entries did we check" and not "how many PMD areas did 
we check".

Maybe there is a history to this, but conceptually I think we wanted to 
limit the work we do in one operation to something reasonable. Reading a 
single PMD is obviously faster than 512 PTEs.

-- 
Cheers

David