Re: [RFC PATCH 10/12] khugepaged: Skip PTE range if a larger mTHP is already mapped

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Ryan Roberts <ryan.roberts@arm.com>
To: Dev Jain <dev.jain@arm.com>,
	akpm@linux-foundation.org, david@redhat.com, willy@infradead.org,
	kirill.shutemov@linux.intel.com
Cc: anshuman.khandual@arm.com, catalin.marinas@arm.com,
	cl@gentwo.org, vbabka@suse.cz, mhocko@suse.com,
	apopple@nvidia.com, dave.hansen@linux.intel.com, will@kernel.org,
	baohua@kernel.org, jack@suse.cz, srivatsa@csail.mit.edu,
	haowenchao22@gmail.com, hughd@google.com,
	aneesh.kumar@kernel.org, yang@os.amperecomputing.com,
	peterx@redhat.com, ioworker0@gmail.com,
	wangkefeng.wang@huawei.com, ziy@nvidia.com, jglisse@google.com,
	surenb@google.com, vishal.moola@gmail.com, zokeefe@google.com,
	zhengqi.arch@bytedance.com, jhubbard@nvidia.com,
	21cnbao@gmail.com, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH 10/12] khugepaged: Skip PTE range if a larger mTHP is already mapped
Date: Wed, 18 Dec 2024 07:36:08 +0000	[thread overview]
Message-ID: <7cc1840b-6f6c-4f82-86b8-41bb6fbc1b81@arm.com> (raw)
In-Reply-To: <20241216165105.56185-11-dev.jain@arm.com>

On 16/12/2024 16:51, Dev Jain wrote:
> We may hit a situation wherein we have a larger folio mapped. It is incorrect
> to go ahead with the collapse since some pages will be unmapped, leading to
> the entire folio getting unmapped. Therefore, skip the corresponding range.
> 
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
> In the future, if at all it is required that at some point we want all the folios
> in the system to be of a specific order, we may split these larger folios.
> 
>  mm/khugepaged.c | 31 +++++++++++++++++++++++++++++++
>  1 file changed, 31 insertions(+)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 8040b130e677..47e7c476b893 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -33,6 +33,7 @@ enum scan_result {
>  	SCAN_PMD_NULL,
>  	SCAN_PMD_NONE,
>  	SCAN_PMD_MAPPED,
> +	SCAN_PTE_MAPPED,
>  	SCAN_EXCEED_NONE_PTE,
>  	SCAN_EXCEED_SWAP_PTE,
>  	SCAN_EXCEED_SHARED_PTE,
> @@ -609,6 +610,11 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  		folio = page_folio(page);
>  		VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
>  
> +		if (order !=HPAGE_PMD_ORDER && folio_order(folio) >= order) {
> +			result = SCAN_PTE_MAPPED;
> +			goto out;
> +		}
> +
>  		/* See hpage_collapse_scan_ptes(). */
>  		if (folio_likely_mapped_shared(folio)) {
>  			++shared;
> @@ -1369,6 +1375,7 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm,
>  	unsigned long orders;
>  	pte_t *pte, *_pte;
>  	spinlock_t *ptl;
> +	int found_order;
>  	pmd_t *pmd;
>  	int order;
>  
> @@ -1467,6 +1474,24 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm,
>  			goto out_unmap;
>  		}
>  
> +		found_order = folio_order(folio);
> +
> +		/*
> +		 * No point of scanning. Two options: if this folio was hit
> +		 * somewhere in the middle of the scan, then drop down the
> +		 * order. Or, completely skip till the end of this folio. The
> +		 * latter gives us a higher order to start with, with atmost
> +		 * 1 << order PTEs not collapsed; the former may force us
> +		 * to end up going below order 2 and exiting.
> +		 */
> +		if (order != HPAGE_PMD_ORDER && found_order >= order) {
> +			result = SCAN_PTE_MAPPED;
> +			_address += (PAGE_SIZE << found_order);
> +			_pte += (1UL << found_order);
> +			pte_unmap_unlock(pte, ptl);
> +			goto decide_order;
> +		}

It would be good if you can spell out the desired policy when khugepaged hits
partially unmapped large folios and unaligned large folios. I think the simple
approach is to always collapse them to fully mapped, aligned folios even if the
resulting order is smaller than the original. But I'm not sure that's definitely
going to always be the best thing.

Regardless, I'm struggling to understand the logic in this patch. Taking the
order of a folio based on having hit one of it's pages says anything about
whether the whole of that folio is mapped or not or it's alignment. And it's not
clear to me how we would get to a situation where we are scanning for a lower
order and find a (fully mapped, aligned) folio of higher order in the first place.

Let's assume the desired policy is that khugepaged should always collapse to
naturally aligned large folios. If there happens to be an existing aligned
order-4 folio that is fully mapped, we will identify that for collapse as part
of the scan for order-4. At that point, we should just notice that it is already
an aligned order-4 folio and bypass collapse. Of course we may have already
chosen to collapse it into a higher order, but we should definitely not get to a
lower order before we notice it.

Hmm... I guess if the sysfs thp settings have been changed then things could get
spicy... if order-8 was previously enabled and we have an order-8 folio, then it
get's disabled and khugepaged is scanning for order-4 (which is still enabled)
then hits the order-8; what's the expected policy? rework into 2 order-4 folios
or leave it as as single order-8?


> +
>  		/*
>  		 * We treat a single page as shared if any part of the THP
>  		 * is shared. "False negatives" from
> @@ -1550,6 +1575,10 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm,
>  		if (_address == org_address + (PAGE_SIZE << HPAGE_PMD_ORDER))
>  			goto out;
>  	}
> +	/* A larger folio was mapped; it will be skipped in next iteration */
> +	if (result == SCAN_PTE_MAPPED)
> +		goto decide_order;
> +
>  	if (result != SCAN_SUCCEED) {
>  
>  		/* Go to the next order. */
> @@ -1558,6 +1587,8 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm,
>  			goto out;
>  		goto maybe_mmap_lock;
>  	} else {
> +
> +decide_order:
>  		address = _address;
>  		pte = _pte;
>

next prev parent reply	other threads:[~2024-12-18  7:36 UTC|newest]

Thread overview: 74+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-12-16 16:50 [RFC PATCH 00/12] khugepaged: Asynchronous mTHP collapse Dev Jain
2024-12-16 16:50 ` [RFC PATCH 01/12] khugepaged: Rename hpage_collapse_scan_pmd() -> ptes() Dev Jain
2024-12-17  4:18   ` Matthew Wilcox
2024-12-17  5:52     ` Dev Jain
2024-12-17  6:43     ` Ryan Roberts
2024-12-17 18:11       ` Zi Yan
2024-12-17 19:12         ` Ryan Roberts
2024-12-16 16:50 ` [RFC PATCH 02/12] khugepaged: Generalize alloc_charge_folio() Dev Jain
2024-12-17  2:51   ` Baolin Wang
2024-12-17  6:08     ` Dev Jain
2024-12-17  4:17   ` Matthew Wilcox
2024-12-17  7:09     ` Ryan Roberts
2024-12-17 13:00       ` Zi Yan
2024-12-20 17:41       ` Christoph Lameter (Ampere)
2024-12-20 17:45         ` Ryan Roberts
2024-12-20 18:47           ` Christoph Lameter (Ampere)
2025-01-02 11:21             ` Ryan Roberts
2024-12-17  6:53   ` Ryan Roberts
2024-12-17  9:06     ` Dev Jain
2024-12-16 16:50 ` [RFC PATCH 03/12] khugepaged: Generalize hugepage_vma_revalidate() Dev Jain
2024-12-17  4:21   ` Matthew Wilcox
2024-12-17 16:58   ` Ryan Roberts
2024-12-16 16:50 ` [RFC PATCH 04/12] khugepaged: Generalize __collapse_huge_page_swapin() Dev Jain
2024-12-17  4:24   ` Matthew Wilcox
2024-12-16 16:50 ` [RFC PATCH 05/12] khugepaged: Generalize __collapse_huge_page_isolate() Dev Jain
2024-12-17  4:32   ` Matthew Wilcox
2024-12-17  6:41     ` Dev Jain
2024-12-17 17:14       ` Ryan Roberts
2024-12-17 17:09   ` Ryan Roberts
2024-12-16 16:50 ` [RFC PATCH 06/12] khugepaged: Generalize __collapse_huge_page_copy_failed() Dev Jain
2024-12-17 17:22   ` Ryan Roberts
2024-12-18  8:49     ` Dev Jain
2024-12-16 16:51 ` [RFC PATCH 07/12] khugepaged: Scan PTEs order-wise Dev Jain
2024-12-17 18:15   ` Ryan Roberts
2024-12-18  9:24     ` Dev Jain
2025-01-06 10:04   ` Usama Arif
2025-01-07  7:17     ` Dev Jain
2024-12-16 16:51 ` [RFC PATCH 08/12] khugepaged: Abstract PMD-THP collapse Dev Jain
2024-12-17 19:24   ` Ryan Roberts
2024-12-18  9:26     ` Dev Jain
2024-12-16 16:51 ` [RFC PATCH 09/12] khugepaged: Introduce vma_collapse_anon_folio() Dev Jain
2024-12-16 17:06   ` David Hildenbrand
2024-12-16 19:08     ` Yang Shi
2024-12-17 10:07     ` Dev Jain
2024-12-17 10:32       ` David Hildenbrand
2024-12-18  8:35         ` Dev Jain
2025-01-02 10:08           ` Dev Jain
2025-01-02 11:33             ` David Hildenbrand
2025-01-03  8:17               ` Dev Jain
2025-01-02 11:22           ` David Hildenbrand
2024-12-18 15:59     ` Dev Jain
2025-01-06 10:17   ` Usama Arif
2025-01-07  8:12     ` Dev Jain
2024-12-16 16:51 ` [RFC PATCH 10/12] khugepaged: Skip PTE range if a larger mTHP is already mapped Dev Jain
2024-12-18  7:36   ` Ryan Roberts [this message]
2024-12-18  9:34     ` Dev Jain
2024-12-19  3:40       ` John Hubbard
2024-12-19  3:51         ` Zi Yan
2024-12-19  7:59         ` Dev Jain
2024-12-19  8:07           ` Dev Jain
2024-12-20 11:57             ` Ryan Roberts
2024-12-16 16:51 ` [RFC PATCH 11/12] khugepaged: Enable sysfs to control order of collapse Dev Jain
2024-12-16 16:51 ` [RFC PATCH 12/12] selftests/mm: khugepaged: Enlighten for mTHP collapse Dev Jain
2024-12-18  9:03   ` Ryan Roberts
2024-12-18  9:50     ` Dev Jain
2024-12-20 11:05       ` Ryan Roberts
2024-12-30  7:09         ` Dev Jain
2024-12-30 16:36           ` Zi Yan
2025-01-02 11:43             ` Ryan Roberts
2025-01-03 10:10               ` Dev Jain
2025-01-03 10:11             ` Dev Jain
2024-12-16 17:31 ` [RFC PATCH 00/12] khugepaged: Asynchronous " Dev Jain
2025-01-02 21:58   ` Nico Pache
2025-01-03  7:04     ` Dev Jain

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7cc1840b-6f6c-4f82-86b8-41bb6fbc1b81@arm.com \
    --to=ryan.roberts@arm.com \
    --cc=21cnbao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@kernel.org \
    --cc=anshuman.khandual@arm.com \
    --cc=apopple@nvidia.com \
    --cc=baohua@kernel.org \
    --cc=catalin.marinas@arm.com \
    --cc=cl@gentwo.org \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@redhat.com \
    --cc=dev.jain@arm.com \
    --cc=haowenchao22@gmail.com \
    --cc=hughd@google.com \
    --cc=ioworker0@gmail.com \
    --cc=jack@suse.cz \
    --cc=jglisse@google.com \
    --cc=jhubbard@nvidia.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=peterx@redhat.com \
    --cc=srivatsa@csail.mit.edu \
    --cc=surenb@google.com \
    --cc=vbabka@suse.cz \
    --cc=vishal.moola@gmail.com \
    --cc=wangkefeng.wang@huawei.com \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    --cc=yang@os.amperecomputing.com \
    --cc=zhengqi.arch@bytedance.com \
    --cc=ziy@nvidia.com \
    --cc=zokeefe@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).