From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 397F62E7F3E for ; Fri, 19 Dec 2025 08:56:02 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1766134562; cv=none; b=u6Z0bz7ijRruqz3MhOdHOW6aFRaTlyUq6RaoyH+mla2sFk4DWSwM8GIqAGCO8DTnu7CakJdjsgqgqRGkGVkrh+Z9hrHRLKvIQeuKA79y0blAuDTi87b2yz8TjDQqMI0rrj4D343ON0LzSfAsOtckmrUmVIHHbz3XODKd/1KmVxA= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1766134562; c=relaxed/simple; bh=TPE5OS05vI0tJdNvhgg9vrKHc5kLwjyEu0iLvyC1V7g=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=o6uZ9E557fL4XRzIPxQjUdlj9itSED1JJjYoZbwNlYwHj4/0+cm12ODw7qTW/W5fp1BH97adpywJ4mK6NIeYu1/jQabbLoX+O+pv6d2ccQp8wqVgr2BcgKe5d4vJigJRGY68zLoqroym632jltKvb65PesB5xoUrNXNZbltm6EQ= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=riRT1xQP; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="riRT1xQP" Received: by smtp.kernel.org (Postfix) with ESMTPSA id DBA41C4CEF1; Fri, 19 Dec 2025 08:55:58 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1766134562; bh=TPE5OS05vI0tJdNvhgg9vrKHc5kLwjyEu0iLvyC1V7g=; h=Date:Subject:To:Cc:References:From:In-Reply-To:From; b=riRT1xQPLrYDVyFN4+gJRO8GnwpqwsepD4GrOnhnTkDfu2yo7Tru/Eqy6S3c8pqhg Yb8uitMooT2OwotTnY+5Nwv+I6KWlomj9jHp8ZQiKhyXf3s2X4st9s9Wc4TslGKaaU M4++FD6dcfRUS247JlqI/qmp+GAvHvpED/763jsvFmuwIdAiIRofnxed8crhEoWNf9 aaGDUQolapPeS9inoOSerbZ7fG0jO+lF3uTeMJMJ9CO7+pl2LHpTcIEz3r0cGruklE 3Cm6gKnD3FZLt5S/lvlREwCkmPmtEM6qwb8EtW7cOembXQDczDdGnTim5plb+S+ml/ yA7pVwnJ7VShQ== Message-ID: Date: Fri, 19 Dec 2025 09:55:56 +0100 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed To: Vernon Yang Cc: akpm@linux-foundation.org, lorenzo.stoakes@oracle.com, ziy@nvidia.com, baohua@kernel.org, lance.yang@linux.dev, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Vernon Yang References: <20251215090419.174418-1-yanglincheng@kylinos.cn> <20251215090419.174418-3-yanglincheng@kylinos.cn> <26e65878-f214-4890-8bcb-24a45122bfd6@kernel.org> From: "David Hildenbrand (Red Hat)" Content-Language: en-US In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit On 12/19/25 09:35, Vernon Yang wrote: > On Thu, Dec 18, 2025 at 10:29:18AM +0100, David Hildenbrand (Red Hat) wrote: >> On 12/15/25 10:04, Vernon Yang wrote: >>> The following data is traced by bpftrace on a desktop system. After >>> the system has been left idle for 10 minutes upon booting, a lot of >>> SCAN_PMD_MAPPED or SCAN_PMD_NONE are observed during a full scan by >>> khugepaged. >>> >>> @scan_pmd_status[1]: 1 ## SCAN_SUCCEED >>> @scan_pmd_status[4]: 158 ## SCAN_PMD_MAPPED >>> @scan_pmd_status[3]: 174 ## SCAN_PMD_NONE >>> total progress size: 701 MB >>> Total time : 440 seconds ## include khugepaged_scan_sleep_millisecs >>> >>> The khugepaged_scan list save all task that support collapse into hugepage, >>> as long as the take is not destroyed, khugepaged will not remove it from >>> the khugepaged_scan list. This exist a phenomenon where task has already >>> collapsed all memory regions into hugepage, but khugepaged continues to >>> scan it, which wastes CPU time and invalid, and due to >>> khugepaged_scan_sleep_millisecs (default 10s) causes a long wait for >>> scanning a large number of invalid task, so scanning really valid task >>> is later. >>> >>> After applying this patch, when all memory is either SCAN_PMD_MAPPED or >>> SCAN_PMD_NONE, the mm is automatically removed from khugepaged's scan >>> list. If the page fault or MADV_HUGEPAGE again, it is added back to >>> khugepaged. >> >> I don't like that, as it assumes that memory within such a process would be >> rather static, which is easily not the case (e.g., allocators just doing >> MADV_DONTNEED to free memory). >> >> If most stuff is collapsed to PMDs already, can't we just skip over these >> regions a bit faster? > > I have a flash of inspiration and came up with a good idea. > > If these regions have already been collapsed into hugepage, rechecking > them would be very fast. Due to the khugepaged_pages_to_scan can also > represent the number of VMAs to skip, we can extend its semantics as > follows: > > /* > * default scan 8*HPAGE_PMD_NR ptes, pmd_mapped, no_pte_table or vmas > * every 10 second. > */ > static unsigned int khugepaged_pages_to_scan __read_mostly; > > switch (*result) { > case SCAN_NO_PTE_TABLE: > case SCAN_PMD_MAPPED: > case SCAN_PTE_MAPPED_HUGEPAGE: > progress++; // here > break; > case SCAN_SUCCEED: > ++khugepaged_pages_collapsed; > fallthrough; > default: > progress += HPAGE_PMD_NR; > } > > This way can achieve our goal. David, do you like it? I'd have to see the full patch, but IMHO we should rather focus on on "how many pte/pmd entries did we check" and not "how many PMD areas did we check". Maybe there is a history to this, but conceptually I think we wanted to limit the work we do in one operation to something reasonable. Reading a single PMD is obviously faster than 512 PTEs. -- Cheers David