Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: "Zi Yan" <ziy@nvidia.com>
To: "Kefeng Wang" <wangkefeng.wang@huawei.com>,
	"Andrew Morton" <akpm@linux-foundation.org>
Cc: "David Hildenbrand" <david@kernel.org>, "Zi Yan" <ziy@nvidia.com>,
	"Liam R. Howlett" <liam@infradead.org>,
	"Lorenzo Stoakes" <ljs@kernel.org>,
	"Vlastimil Babka" <vbabka@kernel.org>,
	"Suren Baghdasaryan" <surenb@google.com>, <linux-mm@kvack.org>
Subject: Re: [PATCH 1/4] mm: mincore: try per-VMA lock firstly and use walk_page_range_vma()
Date: Wed, 17 Jun 2026 10:54:05 -0400	[thread overview]
Message-ID: <DJBESGC086ZC.3JSQ5GD3ALCY6@nvidia.com> (raw)
In-Reply-To: <20260617082622.3397584-2-wangkefeng.wang@huawei.com>

On Wed Jun 17, 2026 at 4:26 AM EDT, Kefeng Wang wrote:
> The mincore syscall currently takes mmap lock for the entire
> duration of the VMA lookup and page table walk. This creates
> a global contention point with page faults and other mmap_lock
> holders in multi-threaded applications.
>
> The mincore is a read-only operation that only queries page
> residency from a single VMA, making it an ideal candidate for
> per-VMA locking, so try per-vma lock firstly and use the
> walk_page_range_vma() in do_mincore() to eliminates an unnecessary
> find_vma() lookup.
>
> Unlike walk_page_range(), walk_page_range_vma() does not call
> walk_page_test(), which handles VM_PFNMAP by invoking ->pte_hole()
> to skip the page table walk. Without this check, PFNMAP PTEs
> would be treated as present by mincore_pte_range(), changing
> the returned residency status. Handle VM_PFNMAP explicitly in
> do_mincore() to preserve the original behavior.
>
> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
> ---
>  mm/mincore.c | 71 +++++++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 53 insertions(+), 18 deletions(-)
>
> diff --git a/mm/mincore.c b/mm/mincore.c
> index 296f2e3922b5..a786a073feab 100644
> --- a/mm/mincore.c
> +++ b/mm/mincore.c
> @@ -12,6 +12,7 @@
>  #include <linux/gfp.h>
>  #include <linux/pagewalk.h>
>  #include <linux/mman.h>
> +#include <linux/mmap_lock.h>
>  #include <linux/syscalls.h>
>  #include <linux/swap.h>
>  #include <linux/leafops.h>
> @@ -232,34 +233,47 @@ static inline bool can_do_mincore(struct vm_area_struct *vma)
>  	       file_permission(vma->vm_file, MAY_WRITE) == 0;
>  }
>  
> -static const struct mm_walk_ops mincore_walk_ops = {
> -	.pmd_entry		= mincore_pte_range,
> -	.pte_hole		= mincore_unmapped_range,
> -	.hugetlb_entry		= mincore_hugetlb,
> -	.walk_lock		= PGWALK_RDLOCK,
> -};
> -
>  /*
>   * Do a chunk of "sys_mincore()". We've already checked
> - * all the arguments, we hold the mmap semaphore: we should
> - * just return the amount of info we're asked for.
> + * all the arguments, we should just return the amount of
> + * info we're asked for. The vma is already looked up and
> + * locked; vma_locked indicates whether the per-VMA lock
> + * or mmap_read_lock is held.
>   */
> -static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *vec)
> +static long do_mincore(struct vm_area_struct *vma, unsigned long addr,
> +		unsigned long pages, unsigned char *vec, bool vma_locked)

vma_locked is confusing me, since I thought vma_locked == false means
vma is not locked, but it actually means mmap_lock is taken instead. But
I am not sure an enum vma_lock_state {VMA_LOCKED, MM_LOCKED} is needed
here.

>  {
> -	struct vm_area_struct *vma;
>  	unsigned long end;
>  	int err;
> +	struct mm_walk_ops mincore_walk_ops = {
> +		.pmd_entry		= mincore_pte_range,
> +		.pte_hole		= mincore_unmapped_range,
> +		.hugetlb_entry		= mincore_hugetlb,
> +		.walk_lock		= vma_locked ?
> +				  PGWALK_VMA_RDLOCK_VERIFY : PGWALK_RDLOCK,

An unrelated comment about PGWALK_RDLOCK. Maybe PGWALK_MM_RDLOCK_VERIFY
is a better name since the code just verifies mmap_lock, unlike
PGWALK_WRLOCK, which requires vma_start_write(). PGWALK_WRLOCK_VERIFY
might be better named as PGWALK_VMA_WRLOCK_VERIFY.

Otherwise, LGTM.

Acked-by: Zi Yan <ziy@nvidia.com>


-- 
Best Regards,
Yan, Zi



  reply	other threads:[~2026-06-17 14:54 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-17  8:26 [PATCH 0/4] mm: convert to walk_page_range_vma() to eliminate find_vma() Kefeng Wang
2026-06-17  8:26 ` [PATCH 1/4] mm: mincore: try per-VMA lock firstly and use walk_page_range_vma() Kefeng Wang
2026-06-17 14:54   ` Zi Yan [this message]
2026-06-17 15:01   ` David Hildenbrand (Arm)
2026-06-17  8:26 ` [PATCH 2/4] mm: mprotect: use walk_page_range_vma() in mprotect_fixup() Kefeng Wang
2026-06-17 13:25   ` David Hildenbrand (Arm)
2026-06-17 14:28   ` Zi Yan
2026-06-17  8:26 ` [PATCH 3/4] mm: mlock: use walk_page_range_vma() in mlock_vma_pages_range() Kefeng Wang
2026-06-17 13:26   ` David Hildenbrand (Arm)
2026-06-17 14:35   ` Zi Yan
2026-06-17  8:26 ` [PATCH 4/4] mm: migrate_device: use walk_page_range_vma() in migrate_vma_collect() Kefeng Wang
2026-06-17 13:30   ` David Hildenbrand (Arm)
2026-06-17 14:37   ` Zi Yan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=DJBESGC086ZC.3JSQ5GD3ALCY6@nvidia.com \
    --to=ziy@nvidia.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@kernel.org \
    --cc=liam@infradead.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=surenb@google.com \
    --cc=vbabka@kernel.org \
    --cc=wangkefeng.wang@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox