All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Zi Yan" <ziy@nvidia.com>
To: "Kefeng Wang" <wangkefeng.wang@huawei.com>,
	"Andrew Morton" <akpm@linux-foundation.org>
Cc: "David Hildenbrand" <david@kernel.org>, "Zi Yan" <ziy@nvidia.com>,
	"Liam R. Howlett" <liam@infradead.org>,
	"Lorenzo Stoakes" <ljs@kernel.org>,
	"Vlastimil Babka" <vbabka@kernel.org>,
	"Suren Baghdasaryan" <surenb@google.com>, <linux-mm@kvack.org>
Subject: Re: [PATCH 1/4] mm: mincore: try per-VMA lock firstly and use walk_page_range_vma()
Date: Wed, 17 Jun 2026 10:54:05 -0400	[thread overview]
Message-ID: <DJBESGC086ZC.3JSQ5GD3ALCY6@nvidia.com> (raw)
In-Reply-To: <20260617082622.3397584-2-wangkefeng.wang@huawei.com>

On Wed Jun 17, 2026 at 4:26 AM EDT, Kefeng Wang wrote:
> The mincore syscall currently takes mmap lock for the entire
> duration of the VMA lookup and page table walk. This creates
> a global contention point with page faults and other mmap_lock
> holders in multi-threaded applications.
>
> The mincore is a read-only operation that only queries page
> residency from a single VMA, making it an ideal candidate for
> per-VMA locking, so try per-vma lock firstly and use the
> walk_page_range_vma() in do_mincore() to eliminates an unnecessary
> find_vma() lookup.
>
> Unlike walk_page_range(), walk_page_range_vma() does not call
> walk_page_test(), which handles VM_PFNMAP by invoking ->pte_hole()
> to skip the page table walk. Without this check, PFNMAP PTEs
> would be treated as present by mincore_pte_range(), changing
> the returned residency status. Handle VM_PFNMAP explicitly in
> do_mincore() to preserve the original behavior.
>
> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
> ---
>  mm/mincore.c | 71 +++++++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 53 insertions(+), 18 deletions(-)
>
> diff --git a/mm/mincore.c b/mm/mincore.c
> index 296f2e3922b5..a786a073feab 100644
> --- a/mm/mincore.c
> +++ b/mm/mincore.c
> @@ -12,6 +12,7 @@
>  #include <linux/gfp.h>
>  #include <linux/pagewalk.h>
>  #include <linux/mman.h>
> +#include <linux/mmap_lock.h>
>  #include <linux/syscalls.h>
>  #include <linux/swap.h>
>  #include <linux/leafops.h>
> @@ -232,34 +233,47 @@ static inline bool can_do_mincore(struct vm_area_struct *vma)
>  	       file_permission(vma->vm_file, MAY_WRITE) == 0;
>  }
>  
> -static const struct mm_walk_ops mincore_walk_ops = {
> -	.pmd_entry		= mincore_pte_range,
> -	.pte_hole		= mincore_unmapped_range,
> -	.hugetlb_entry		= mincore_hugetlb,
> -	.walk_lock		= PGWALK_RDLOCK,
> -};
> -
>  /*
>   * Do a chunk of "sys_mincore()". We've already checked
> - * all the arguments, we hold the mmap semaphore: we should
> - * just return the amount of info we're asked for.
> + * all the arguments, we should just return the amount of
> + * info we're asked for. The vma is already looked up and
> + * locked; vma_locked indicates whether the per-VMA lock
> + * or mmap_read_lock is held.
>   */
> -static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *vec)
> +static long do_mincore(struct vm_area_struct *vma, unsigned long addr,
> +		unsigned long pages, unsigned char *vec, bool vma_locked)

vma_locked is confusing me, since I thought vma_locked == false means
vma is not locked, but it actually means mmap_lock is taken instead. But
I am not sure an enum vma_lock_state {VMA_LOCKED, MM_LOCKED} is needed
here.

>  {
> -	struct vm_area_struct *vma;
>  	unsigned long end;
>  	int err;
> +	struct mm_walk_ops mincore_walk_ops = {
> +		.pmd_entry		= mincore_pte_range,
> +		.pte_hole		= mincore_unmapped_range,
> +		.hugetlb_entry		= mincore_hugetlb,
> +		.walk_lock		= vma_locked ?
> +				  PGWALK_VMA_RDLOCK_VERIFY : PGWALK_RDLOCK,

An unrelated comment about PGWALK_RDLOCK. Maybe PGWALK_MM_RDLOCK_VERIFY
is a better name since the code just verifies mmap_lock, unlike
PGWALK_WRLOCK, which requires vma_start_write(). PGWALK_WRLOCK_VERIFY
might be better named as PGWALK_VMA_WRLOCK_VERIFY.

Otherwise, LGTM.

Acked-by: Zi Yan <ziy@nvidia.com>


-- 
Best Regards,
Yan, Zi



  reply	other threads:[~2026-06-17 14:54 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-17  8:26 [PATCH 0/4] mm: convert to walk_page_range_vma() to eliminate find_vma() Kefeng Wang
2026-06-17  8:26 ` [PATCH 1/4] mm: mincore: try per-VMA lock firstly and use walk_page_range_vma() Kefeng Wang
2026-06-17 14:54   ` Zi Yan [this message]
2026-06-17 15:01   ` David Hildenbrand (Arm)
2026-06-17  8:26 ` [PATCH 2/4] mm: mprotect: use walk_page_range_vma() in mprotect_fixup() Kefeng Wang
2026-06-17 13:25   ` David Hildenbrand (Arm)
2026-06-17 14:28   ` Zi Yan
2026-06-17  8:26 ` [PATCH 3/4] mm: mlock: use walk_page_range_vma() in mlock_vma_pages_range() Kefeng Wang
2026-06-17 13:26   ` David Hildenbrand (Arm)
2026-06-17 14:35   ` Zi Yan
2026-06-17  8:26 ` [PATCH 4/4] mm: migrate_device: use walk_page_range_vma() in migrate_vma_collect() Kefeng Wang
2026-06-17 13:30   ` David Hildenbrand (Arm)
2026-06-17 14:37   ` Zi Yan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=DJBESGC086ZC.3JSQ5GD3ALCY6@nvidia.com \
    --to=ziy@nvidia.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@kernel.org \
    --cc=liam@infradead.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=surenb@google.com \
    --cc=vbabka@kernel.org \
    --cc=wangkefeng.wang@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.