From: "Zi Yan" <ziy@nvidia.com>
To: "Kefeng Wang" <wangkefeng.wang@huawei.com>,
"Zi Yan" <ziy@nvidia.com>,
"Andrew Morton" <akpm@linux-foundation.org>
Cc: "David Hildenbrand" <david@kernel.org>,
"Liam R. Howlett" <liam@infradead.org>,
"Lorenzo Stoakes" <ljs@kernel.org>,
"Vlastimil Babka" <vbabka@kernel.org>,
"Suren Baghdasaryan" <surenb@google.com>, <linux-mm@kvack.org>
Subject: Re: [PATCH 1/4] mm: mincore: try per-VMA lock firstly and use walk_page_range_vma()
Date: Wed, 17 Jun 2026 21:20:39 -0400 [thread overview]
Message-ID: <DJBS471022MX.L7093MILIORX@nvidia.com> (raw)
In-Reply-To: <065b1900-1a4e-4d92-b588-002b844f1896@huawei.com>
On Wed Jun 17, 2026 at 9:11 PM EDT, Kefeng Wang wrote:
>
>
> On 6/17/2026 10:54 PM, Zi Yan wrote:
>> On Wed Jun 17, 2026 at 4:26 AM EDT, Kefeng Wang wrote:
>>> The mincore syscall currently takes mmap lock for the entire
>>> duration of the VMA lookup and page table walk. This creates
>>> a global contention point with page faults and other mmap_lock
>>> holders in multi-threaded applications.
>>>
>>> The mincore is a read-only operation that only queries page
>>> residency from a single VMA, making it an ideal candidate for
>>> per-VMA locking, so try per-vma lock firstly and use the
>>> walk_page_range_vma() in do_mincore() to eliminates an unnecessary
>>> find_vma() lookup.
>>>
>>> Unlike walk_page_range(), walk_page_range_vma() does not call
>>> walk_page_test(), which handles VM_PFNMAP by invoking ->pte_hole()
>>> to skip the page table walk. Without this check, PFNMAP PTEs
>>> would be treated as present by mincore_pte_range(), changing
>>> the returned residency status. Handle VM_PFNMAP explicitly in
>>> do_mincore() to preserve the original behavior.
>>>
>>> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
>>> ---
>>> mm/mincore.c | 71 +++++++++++++++++++++++++++++++++++++++-------------
>>> 1 file changed, 53 insertions(+), 18 deletions(-)
>>>
>>> diff --git a/mm/mincore.c b/mm/mincore.c
>>> index 296f2e3922b5..a786a073feab 100644
>>> --- a/mm/mincore.c
>>> +++ b/mm/mincore.c
>>> @@ -12,6 +12,7 @@
>>> #include <linux/gfp.h>
>>> #include <linux/pagewalk.h>
>>> #include <linux/mman.h>
>>> +#include <linux/mmap_lock.h>
>>> #include <linux/syscalls.h>
>>> #include <linux/swap.h>
>>> #include <linux/leafops.h>
>>> @@ -232,34 +233,47 @@ static inline bool can_do_mincore(struct vm_area_struct *vma)
>>> file_permission(vma->vm_file, MAY_WRITE) == 0;
>>> }
>>>
>>> -static const struct mm_walk_ops mincore_walk_ops = {
>>> - .pmd_entry = mincore_pte_range,
>>> - .pte_hole = mincore_unmapped_range,
>>> - .hugetlb_entry = mincore_hugetlb,
>>> - .walk_lock = PGWALK_RDLOCK,
>>> -};
>>> -
>>> /*
>>> * Do a chunk of "sys_mincore()". We've already checked
>>> - * all the arguments, we hold the mmap semaphore: we should
>>> - * just return the amount of info we're asked for.
>>> + * all the arguments, we should just return the amount of
>>> + * info we're asked for. The vma is already looked up and
>>> + * locked; vma_locked indicates whether the per-VMA lock
>>> + * or mmap_read_lock is held.
>>> */
>>> -static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *vec)
>>> +static long do_mincore(struct vm_area_struct *vma, unsigned long addr,
>>> + unsigned long pages, unsigned char *vec, bool vma_locked)
>>
>> vma_locked is confusing me, since I thought vma_locked == false means
>> vma is not locked, but it actually means mmap_lock is taken instead. But
>> I am not sure an enum vma_lock_state {VMA_LOCKED, MM_LOCKED} is needed
>> here.
>>
>
> Maybe I could directly pass the enum page_walk_lock here?
Sounds good to me.
>
>>> {
>>> - struct vm_area_struct *vma;
>>> unsigned long end;
>>> int err;
>>> + struct mm_walk_ops mincore_walk_ops = {
>>> + .pmd_entry = mincore_pte_range,
>>> + .pte_hole = mincore_unmapped_range,
>>> + .hugetlb_entry = mincore_hugetlb,
>>> + .walk_lock = vma_locked ?
>>> + PGWALK_VMA_RDLOCK_VERIFY : PGWALK_RDLOCK,
>>
>> An unrelated comment about PGWALK_RDLOCK. Maybe PGWALK_MM_RDLOCK_VERIFY
>> is a better name since the code just verifies mmap_lock, unlike
>> PGWALK_WRLOCK, which requires vma_start_write(). PGWALK_WRLOCK_VERIFY
>> might be better named as PGWALK_VMA_WRLOCK_VERIFY.
>
> Agree, I've thought about it too, but we can do it later.
Sure.
>
>>
>> Otherwise, LGTM.
>>
>> Acked-by: Zi Yan <ziy@nvidia.com>
>>
>>
> Thanks.
--
Best Regards,
Yan, Zi
next prev parent reply other threads:[~2026-06-18 1:20 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-17 8:26 [PATCH 0/4] mm: convert to walk_page_range_vma() to eliminate find_vma() Kefeng Wang
2026-06-17 8:26 ` [PATCH 1/4] mm: mincore: try per-VMA lock firstly and use walk_page_range_vma() Kefeng Wang
2026-06-17 14:54 ` Zi Yan
2026-06-18 1:11 ` Kefeng Wang
2026-06-18 1:20 ` Zi Yan [this message]
2026-06-17 15:01 ` David Hildenbrand (Arm)
2026-06-18 1:41 ` Kefeng Wang
2026-06-18 8:15 ` David Hildenbrand (Arm)
2026-06-17 8:26 ` [PATCH 2/4] mm: mprotect: use walk_page_range_vma() in mprotect_fixup() Kefeng Wang
2026-06-17 13:25 ` David Hildenbrand (Arm)
2026-06-18 1:42 ` Kefeng Wang
2026-06-17 14:28 ` Zi Yan
2026-06-18 1:43 ` Kefeng Wang
2026-06-18 8:17 ` David Hildenbrand (Arm)
2026-06-17 8:26 ` [PATCH 3/4] mm: mlock: use walk_page_range_vma() in mlock_vma_pages_range() Kefeng Wang
2026-06-17 13:26 ` David Hildenbrand (Arm)
2026-06-17 14:35 ` Zi Yan
2026-06-18 8:16 ` David Hildenbrand (Arm)
2026-06-18 8:27 ` Kefeng Wang
2026-06-17 8:26 ` [PATCH 4/4] mm: migrate_device: use walk_page_range_vma() in migrate_vma_collect() Kefeng Wang
2026-06-17 13:30 ` David Hildenbrand (Arm)
2026-06-18 1:46 ` Kefeng Wang
2026-06-17 14:37 ` Zi Yan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=DJBS471022MX.L7093MILIORX@nvidia.com \
--to=ziy@nvidia.com \
--cc=akpm@linux-foundation.org \
--cc=david@kernel.org \
--cc=liam@infradead.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=surenb@google.com \
--cc=vbabka@kernel.org \
--cc=wangkefeng.wang@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox