From: "Zi Yan" <ziy@nvidia.com>
To: "Kefeng Wang" <wangkefeng.wang@huawei.com>,
"Andrew Morton" <akpm@linux-foundation.org>
Cc: "David Hildenbrand" <david@kernel.org>, "Zi Yan" <ziy@nvidia.com>,
"Liam R. Howlett" <liam@infradead.org>,
"Lorenzo Stoakes" <ljs@kernel.org>,
"Vlastimil Babka" <vbabka@kernel.org>,
"Suren Baghdasaryan" <surenb@google.com>, <linux-mm@kvack.org>
Subject: Re: [PATCH 1/4] mm: mincore: try per-VMA lock firstly and use walk_page_range_vma()
Date: Wed, 17 Jun 2026 10:54:05 -0400 [thread overview]
Message-ID: <DJBESGC086ZC.3JSQ5GD3ALCY6@nvidia.com> (raw)
In-Reply-To: <20260617082622.3397584-2-wangkefeng.wang@huawei.com>
On Wed Jun 17, 2026 at 4:26 AM EDT, Kefeng Wang wrote:
> The mincore syscall currently takes mmap lock for the entire
> duration of the VMA lookup and page table walk. This creates
> a global contention point with page faults and other mmap_lock
> holders in multi-threaded applications.
>
> The mincore is a read-only operation that only queries page
> residency from a single VMA, making it an ideal candidate for
> per-VMA locking, so try per-vma lock firstly and use the
> walk_page_range_vma() in do_mincore() to eliminates an unnecessary
> find_vma() lookup.
>
> Unlike walk_page_range(), walk_page_range_vma() does not call
> walk_page_test(), which handles VM_PFNMAP by invoking ->pte_hole()
> to skip the page table walk. Without this check, PFNMAP PTEs
> would be treated as present by mincore_pte_range(), changing
> the returned residency status. Handle VM_PFNMAP explicitly in
> do_mincore() to preserve the original behavior.
>
> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
> ---
> mm/mincore.c | 71 +++++++++++++++++++++++++++++++++++++++-------------
> 1 file changed, 53 insertions(+), 18 deletions(-)
>
> diff --git a/mm/mincore.c b/mm/mincore.c
> index 296f2e3922b5..a786a073feab 100644
> --- a/mm/mincore.c
> +++ b/mm/mincore.c
> @@ -12,6 +12,7 @@
> #include <linux/gfp.h>
> #include <linux/pagewalk.h>
> #include <linux/mman.h>
> +#include <linux/mmap_lock.h>
> #include <linux/syscalls.h>
> #include <linux/swap.h>
> #include <linux/leafops.h>
> @@ -232,34 +233,47 @@ static inline bool can_do_mincore(struct vm_area_struct *vma)
> file_permission(vma->vm_file, MAY_WRITE) == 0;
> }
>
> -static const struct mm_walk_ops mincore_walk_ops = {
> - .pmd_entry = mincore_pte_range,
> - .pte_hole = mincore_unmapped_range,
> - .hugetlb_entry = mincore_hugetlb,
> - .walk_lock = PGWALK_RDLOCK,
> -};
> -
> /*
> * Do a chunk of "sys_mincore()". We've already checked
> - * all the arguments, we hold the mmap semaphore: we should
> - * just return the amount of info we're asked for.
> + * all the arguments, we should just return the amount of
> + * info we're asked for. The vma is already looked up and
> + * locked; vma_locked indicates whether the per-VMA lock
> + * or mmap_read_lock is held.
> */
> -static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *vec)
> +static long do_mincore(struct vm_area_struct *vma, unsigned long addr,
> + unsigned long pages, unsigned char *vec, bool vma_locked)
vma_locked is confusing me, since I thought vma_locked == false means
vma is not locked, but it actually means mmap_lock is taken instead. But
I am not sure an enum vma_lock_state {VMA_LOCKED, MM_LOCKED} is needed
here.
> {
> - struct vm_area_struct *vma;
> unsigned long end;
> int err;
> + struct mm_walk_ops mincore_walk_ops = {
> + .pmd_entry = mincore_pte_range,
> + .pte_hole = mincore_unmapped_range,
> + .hugetlb_entry = mincore_hugetlb,
> + .walk_lock = vma_locked ?
> + PGWALK_VMA_RDLOCK_VERIFY : PGWALK_RDLOCK,
An unrelated comment about PGWALK_RDLOCK. Maybe PGWALK_MM_RDLOCK_VERIFY
is a better name since the code just verifies mmap_lock, unlike
PGWALK_WRLOCK, which requires vma_start_write(). PGWALK_WRLOCK_VERIFY
might be better named as PGWALK_VMA_WRLOCK_VERIFY.
Otherwise, LGTM.
Acked-by: Zi Yan <ziy@nvidia.com>
--
Best Regards,
Yan, Zi
next prev parent reply other threads:[~2026-06-17 14:54 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-17 8:26 [PATCH 0/4] mm: convert to walk_page_range_vma() to eliminate find_vma() Kefeng Wang
2026-06-17 8:26 ` [PATCH 1/4] mm: mincore: try per-VMA lock firstly and use walk_page_range_vma() Kefeng Wang
2026-06-17 14:54 ` Zi Yan [this message]
2026-06-17 15:01 ` David Hildenbrand (Arm)
2026-06-17 8:26 ` [PATCH 2/4] mm: mprotect: use walk_page_range_vma() in mprotect_fixup() Kefeng Wang
2026-06-17 13:25 ` David Hildenbrand (Arm)
2026-06-17 14:28 ` Zi Yan
2026-06-17 8:26 ` [PATCH 3/4] mm: mlock: use walk_page_range_vma() in mlock_vma_pages_range() Kefeng Wang
2026-06-17 13:26 ` David Hildenbrand (Arm)
2026-06-17 14:35 ` Zi Yan
2026-06-17 8:26 ` [PATCH 4/4] mm: migrate_device: use walk_page_range_vma() in migrate_vma_collect() Kefeng Wang
2026-06-17 13:30 ` David Hildenbrand (Arm)
2026-06-17 14:37 ` Zi Yan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=DJBESGC086ZC.3JSQ5GD3ALCY6@nvidia.com \
--to=ziy@nvidia.com \
--cc=akpm@linux-foundation.org \
--cc=david@kernel.org \
--cc=liam@infradead.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=surenb@google.com \
--cc=vbabka@kernel.org \
--cc=wangkefeng.wang@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox