Re: [PATCH mm-new v5 2/5] mm: khugepaged: refine scan progress number

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Dev Jain <dev.jain@arm.com>
To: Vernon Yang <vernon2gm@gmail.com>,
	akpm@linux-foundation.org, david@kernel.org
Cc: lorenzo.stoakes@oracle.com, ziy@nvidia.com, baohua@kernel.org,
	lance.yang@linux.dev, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org,
	Vernon Yang <yanglincheng@kylinos.cn>
Subject: Re: [PATCH mm-new v5 2/5] mm: khugepaged: refine scan progress number
Date: Wed, 28 Jan 2026 13:59:33 +0530	[thread overview]
Message-ID: <72dfd11f-dca4-447e-89c5-7a87cb675bda@arm.com> (raw)
In-Reply-To: <20260123082232.16413-3-vernon2gm@gmail.com>


On 23/01/26 1:52 pm, Vernon Yang wrote:
> From: Vernon Yang <yanglincheng@kylinos.cn>
>
> Currently, each scan always increases "progress" by HPAGE_PMD_NR,
> even if only scanning a single PTE/PMD entry.
>
> - When only scanning a sigle PTE entry, let me provide a detailed
>   example:
>
> static int hpage_collapse_scan_pmd()
> {
> 	for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
> 	     _pte++, addr += PAGE_SIZE) {
> 		pte_t pteval = ptep_get(_pte);
> 		...
> 		if (pte_uffd_wp(pteval)) { <-- first scan hit
> 			result = SCAN_PTE_UFFD_WP;
> 			goto out_unmap;
> 		}
> 	}
> }
>
> During the first scan, if pte_uffd_wp(pteval) is true, the loop exits
> directly. In practice, only one PTE is scanned before termination.
> Here, "progress += 1" reflects the actual number of PTEs scanned, but
> previously "progress += HPAGE_PMD_NR" always.
>
> - When the memory has been collapsed to PMD, let me provide a detailed
>   example:
>
> The following data is traced by bpftrace on a desktop system. After
> the system has been left idle for 10 minutes upon booting, a lot of
> SCAN_PMD_MAPPED or SCAN_NO_PTE_TABLE are observed during a full scan
> by khugepaged.
>
> @scan_pmd_status[1]: 1           ## SCAN_SUCCEED
> @scan_pmd_status[6]: 2           ## SCAN_EXCEED_SHARED_PTE
> @scan_pmd_status[3]: 142         ## SCAN_PMD_MAPPED
> @scan_pmd_status[2]: 178         ## SCAN_NO_PTE_TABLE

Could you elaborate what is [1], [6] etc and 1,2,142, etc?

> total progress size: 674 MB
> Total time         : 419 seconds ## include khugepaged_scan_sleep_millisecs
>
> The khugepaged_scan list save all task that support collapse into hugepage,
> as long as the task is not destroyed, khugepaged will not remove it from
> the khugepaged_scan list. This exist a phenomenon where task has already
> collapsed all memory regions into hugepage, but khugepaged continues to
> scan it, which wastes CPU time and invalid, and due to
> khugepaged_scan_sleep_millisecs (default 10s) causes a long wait for
> scanning a large number of invalid task, so scanning really valid task
> is later.
>
> After applying this patch, when the memory is either SCAN_PMD_MAPPED or
> SCAN_NO_PTE_TABLE, just skip it, as follow:
>
> @scan_pmd_status[6]: 2
> @scan_pmd_status[3]: 147
> @scan_pmd_status[2]: 173
> total progress size: 45 MB
> Total time         : 20 seconds
>
> Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> ---
>  include/linux/xarray.h |  9 ++++++++
>  mm/khugepaged.c        | 47 ++++++++++++++++++++++++++++++++++--------
>  2 files changed, 47 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/xarray.h b/include/linux/xarray.h
> index be850174e802..f77d97d7b957 100644
> --- a/include/linux/xarray.h
> +++ b/include/linux/xarray.h
> @@ -1646,6 +1646,15 @@ static inline void xas_set(struct xa_state *xas, unsigned long index)
>  	xas->xa_node = XAS_RESTART;
>  }
>  
> +/**
> + * xas_get_index() - Get XArray operation state for a different index.
> + * @xas: XArray operation state.
> + */
> +static inline unsigned long xas_get_index(struct xa_state *xas)
> +{
> +	return xas->xa_index;
> +}
> +
>  /**
>   * xas_advance() - Skip over sibling entries.
>   * @xas: XArray operation state.
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 6f0f05148765..de95029e3763 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -68,7 +68,10 @@ enum scan_result {
>  static struct task_struct *khugepaged_thread __read_mostly;
>  static DEFINE_MUTEX(khugepaged_mutex);
>  
> -/* default scan 8*HPAGE_PMD_NR ptes (or vmas) every 10 second */
> +/*
> + * default scan 8*HPAGE_PMD_NR ptes, pmd_mapped, no_pte_table or vmas
> + * every 10 second.
> + */
>  static unsigned int khugepaged_pages_to_scan __read_mostly;
>  static unsigned int khugepaged_pages_collapsed;
>  static unsigned int khugepaged_full_scans;
> @@ -1240,7 +1243,8 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>  }
>  
>  static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
> -		struct vm_area_struct *vma, unsigned long start_addr, bool *mmap_locked,
> +		struct vm_area_struct *vma, unsigned long start_addr,
> +		bool *mmap_locked, unsigned int *cur_progress,
>  		struct collapse_control *cc)
>  {
>  	pmd_t *pmd;
> @@ -1255,6 +1259,9 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
>  
>  	VM_BUG_ON(start_addr & ~HPAGE_PMD_MASK);
>  
> +	if (cur_progress)
> +		*cur_progress += 1;

Why not be a little more explicit, and do this addition if find_pmd_or_thp_or_none fails,
or pte_offset_map_lock fails? The way you do it right now is not readable - it gives no
idea as to why on function entry we do a +1 right away. Please do add some comments too.

> +
>  	result = find_pmd_or_thp_or_none(mm, start_addr, &pmd);
>  	if (result != SCAN_SUCCEED)
>  		goto out;
> @@ -1396,6 +1403,12 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
>  		result = SCAN_SUCCEED;
>  	}
>  out_unmap:
> +	if (cur_progress) {
> +		if (_pte >= pte + HPAGE_PMD_NR)
> +			*cur_progress += HPAGE_PMD_NR - 1;
> +		else
> +			*cur_progress += _pte - pte;
> +	}
>  	pte_unmap_unlock(pte, ptl);
>  	if (result == SCAN_SUCCEED) {
>  		result = collapse_huge_page(mm, start_addr, referenced,
> @@ -2286,8 +2299,9 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
>  	return result;
>  }
>  
> -static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
> -		struct file *file, pgoff_t start, struct collapse_control *cc)
> +static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm,
> +		unsigned long addr, struct file *file, pgoff_t start,
> +		unsigned int *cur_progress, struct collapse_control *cc)
>  {
>  	struct folio *folio = NULL;
>  	struct address_space *mapping = file->f_mapping;
> @@ -2376,6 +2390,18 @@ static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned
>  			cond_resched_rcu();
>  		}
>  	}
> +	if (cur_progress) {
> +		unsigned long idx = xas_get_index(&xas) - start;
> +
> +		if (folio == NULL)
> +			*cur_progress += HPAGE_PMD_NR;

I think this whole block needs some comments. Can you explain, why you
do a particular increment in each case?

> +		else if (xa_is_value(folio))
> +			*cur_progress += idx + (1 << xas_get_order(&xas));
> +		else if (folio_order(folio) == HPAGE_PMD_ORDER)
> +			*cur_progress += idx + 1;
> +		else
> +			*cur_progress += idx + folio_nr_pages(folio);
> +	}
>  	rcu_read_unlock();
>  
>  	if (result == SCAN_SUCCEED) {
> @@ -2456,6 +2482,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
>  
>  		while (khugepaged_scan.address < hend) {
>  			bool mmap_locked = true;
> +			unsigned int cur_progress = 0;
>  
>  			cond_resched();
>  			if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
> @@ -2472,7 +2499,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
>  				mmap_read_unlock(mm);
>  				mmap_locked = false;
>  				*result = hpage_collapse_scan_file(mm,
> -					khugepaged_scan.address, file, pgoff, cc);
> +					khugepaged_scan.address, file, pgoff,
> +					&cur_progress, cc);
>  				fput(file);
>  				if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
>  					mmap_read_lock(mm);
> @@ -2486,7 +2514,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
>  				}
>  			} else {
>  				*result = hpage_collapse_scan_pmd(mm, vma,
> -					khugepaged_scan.address, &mmap_locked, cc);
> +					khugepaged_scan.address, &mmap_locked,
> +					&cur_progress, cc);
>  			}
>  
>  			if (*result == SCAN_SUCCEED)
> @@ -2494,7 +2523,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
>  
>  			/* move to next address */
>  			khugepaged_scan.address += HPAGE_PMD_SIZE;
> -			progress += HPAGE_PMD_NR;
> +			progress += cur_progress;
>  			if (!mmap_locked)
>  				/*
>  				 * We released mmap_lock so break loop.  Note
> @@ -2817,7 +2846,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>  			mmap_locked = false;
>  			*lock_dropped = true;
>  			result = hpage_collapse_scan_file(mm, addr, file, pgoff,
> -							  cc);
> +							  NULL, cc);
>  
>  			if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !triggered_wb &&
>  			    mapping_can_writeback(file->f_mapping)) {
> @@ -2832,7 +2861,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>  			fput(file);
>  		} else {
>  			result = hpage_collapse_scan_pmd(mm, vma, addr,
> -							 &mmap_locked, cc);
> +							 &mmap_locked, NULL, cc);
>  		}
>  		if (!mmap_locked)
>  			*lock_dropped = true;

next prev parent reply	other threads:[~2026-01-28  8:29 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-23  8:22 [PATCH mm-new v5 0/5] Improve khugepaged scan logic Vernon Yang
2026-01-23  8:22 ` [PATCH mm-new v5 1/5] mm: khugepaged: add trace_mm_khugepaged_scan event Vernon Yang
2026-01-23 10:25   ` Dev Jain
2026-01-23  8:22 ` [PATCH mm-new v5 2/5] mm: khugepaged: refine scan progress number Vernon Yang
2026-01-23 10:46   ` Dev Jain
2026-01-23 15:25     ` Vernon Yang
2026-01-23 15:19   ` Matthew Wilcox
2026-01-23 15:29     ` Vernon Yang
2026-01-28  8:29   ` Dev Jain [this message]
2026-01-28 14:34     ` Vernon Yang
2026-01-29  5:35       ` Dev Jain
2026-01-29  7:59         ` Vernon Yang
2026-01-29  8:32           ` Dev Jain
2026-01-29 12:24             ` Vernon Yang
2026-01-29 12:46               ` Dev Jain
2026-01-29  9:18         ` Lance Yang
2026-01-29 12:28           ` Vernon Yang
2026-01-23  8:22 ` [PATCH mm-new v5 3/5] mm: add folio_test_lazyfree helper Vernon Yang
2026-01-23 10:54   ` Dev Jain
2026-01-26  1:52   ` Barry Song
2026-01-23  8:22 ` [PATCH mm-new v5 4/5] mm: khugepaged: skip lazy-free folios Vernon Yang
2026-01-23  9:09   ` Lance Yang
2026-01-23 15:08     ` Vernon Yang
2026-01-23 16:32       ` Lance Yang
2026-01-24  3:22         ` Vernon Yang
2026-01-24  6:48           ` Dev Jain
2026-01-26  2:06             ` Barry Song
2026-01-23  8:22 ` [PATCH mm-new v5 5/5] mm: khugepaged: set to next mm direct when mm has MMF_DISABLE_THP_COMPLETELY Vernon Yang
2026-01-23 12:40   ` Dev Jain
2026-01-23 15:32     ` Vernon Yang
2026-01-26  2:18     ` Barry Song

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=72dfd11f-dca4-447e-89c5-7a87cb675bda@arm.com \
    --to=dev.jain@arm.com \
    --cc=akpm@linux-foundation.org \
    --cc=baohua@kernel.org \
    --cc=david@kernel.org \
    --cc=lance.yang@linux.dev \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=vernon2gm@gmail.com \
    --cc=yanglincheng@kylinos.cn \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.