Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: "Zi Yan" <ziy@nvidia.com>
To: "Kefeng Wang" <wangkefeng.wang@huawei.com>,
	"Zi Yan" <ziy@nvidia.com>,
	"Andrew Morton" <akpm@linux-foundation.org>
Cc: "David Hildenbrand" <david@kernel.org>,
	"Liam R. Howlett" <liam@infradead.org>,
	"Lorenzo Stoakes" <ljs@kernel.org>,
	"Vlastimil Babka" <vbabka@kernel.org>,
	"Suren Baghdasaryan" <surenb@google.com>, <linux-mm@kvack.org>
Subject: Re: [PATCH 1/4] mm: mincore: try per-VMA lock firstly and use walk_page_range_vma()
Date: Wed, 17 Jun 2026 21:20:39 -0400	[thread overview]
Message-ID: <DJBS471022MX.L7093MILIORX@nvidia.com> (raw)
In-Reply-To: <065b1900-1a4e-4d92-b588-002b844f1896@huawei.com>

On Wed Jun 17, 2026 at 9:11 PM EDT, Kefeng Wang wrote:
>
>
> On 6/17/2026 10:54 PM, Zi Yan wrote:
>> On Wed Jun 17, 2026 at 4:26 AM EDT, Kefeng Wang wrote:
>>> The mincore syscall currently takes mmap lock for the entire
>>> duration of the VMA lookup and page table walk. This creates
>>> a global contention point with page faults and other mmap_lock
>>> holders in multi-threaded applications.
>>>
>>> The mincore is a read-only operation that only queries page
>>> residency from a single VMA, making it an ideal candidate for
>>> per-VMA locking, so try per-vma lock firstly and use the
>>> walk_page_range_vma() in do_mincore() to eliminates an unnecessary
>>> find_vma() lookup.
>>>
>>> Unlike walk_page_range(), walk_page_range_vma() does not call
>>> walk_page_test(), which handles VM_PFNMAP by invoking ->pte_hole()
>>> to skip the page table walk. Without this check, PFNMAP PTEs
>>> would be treated as present by mincore_pte_range(), changing
>>> the returned residency status. Handle VM_PFNMAP explicitly in
>>> do_mincore() to preserve the original behavior.
>>>
>>> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
>>> ---
>>>   mm/mincore.c | 71 +++++++++++++++++++++++++++++++++++++++-------------
>>>   1 file changed, 53 insertions(+), 18 deletions(-)
>>>
>>> diff --git a/mm/mincore.c b/mm/mincore.c
>>> index 296f2e3922b5..a786a073feab 100644
>>> --- a/mm/mincore.c
>>> +++ b/mm/mincore.c
>>> @@ -12,6 +12,7 @@
>>>   #include <linux/gfp.h>
>>>   #include <linux/pagewalk.h>
>>>   #include <linux/mman.h>
>>> +#include <linux/mmap_lock.h>
>>>   #include <linux/syscalls.h>
>>>   #include <linux/swap.h>
>>>   #include <linux/leafops.h>
>>> @@ -232,34 +233,47 @@ static inline bool can_do_mincore(struct vm_area_struct *vma)
>>>   	       file_permission(vma->vm_file, MAY_WRITE) == 0;
>>>   }
>>>   
>>> -static const struct mm_walk_ops mincore_walk_ops = {
>>> -	.pmd_entry		= mincore_pte_range,
>>> -	.pte_hole		= mincore_unmapped_range,
>>> -	.hugetlb_entry		= mincore_hugetlb,
>>> -	.walk_lock		= PGWALK_RDLOCK,
>>> -};
>>> -
>>>   /*
>>>    * Do a chunk of "sys_mincore()". We've already checked
>>> - * all the arguments, we hold the mmap semaphore: we should
>>> - * just return the amount of info we're asked for.
>>> + * all the arguments, we should just return the amount of
>>> + * info we're asked for. The vma is already looked up and
>>> + * locked; vma_locked indicates whether the per-VMA lock
>>> + * or mmap_read_lock is held.
>>>    */
>>> -static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *vec)
>>> +static long do_mincore(struct vm_area_struct *vma, unsigned long addr,
>>> +		unsigned long pages, unsigned char *vec, bool vma_locked)
>> 
>> vma_locked is confusing me, since I thought vma_locked == false means
>> vma is not locked, but it actually means mmap_lock is taken instead. But
>> I am not sure an enum vma_lock_state {VMA_LOCKED, MM_LOCKED} is needed
>> here.
>> 
>
> Maybe I could directly pass the enum page_walk_lock here?

Sounds good to me.
>
>>>   {
>>> -	struct vm_area_struct *vma;
>>>   	unsigned long end;
>>>   	int err;
>>> +	struct mm_walk_ops mincore_walk_ops = {
>>> +		.pmd_entry		= mincore_pte_range,
>>> +		.pte_hole		= mincore_unmapped_range,
>>> +		.hugetlb_entry		= mincore_hugetlb,
>>> +		.walk_lock		= vma_locked ?
>>> +				  PGWALK_VMA_RDLOCK_VERIFY : PGWALK_RDLOCK,
>> 
>> An unrelated comment about PGWALK_RDLOCK. Maybe PGWALK_MM_RDLOCK_VERIFY
>> is a better name since the code just verifies mmap_lock, unlike
>> PGWALK_WRLOCK, which requires vma_start_write(). PGWALK_WRLOCK_VERIFY
>> might be better named as PGWALK_VMA_WRLOCK_VERIFY.
>
> Agree, I've thought about it too, but we can do it later.

Sure.

>
>> 
>> Otherwise, LGTM.
>> 
>> Acked-by: Zi Yan <ziy@nvidia.com>
>> 
>> 
> Thanks.

-- 
Best Regards,
Yan, Zi



  reply	other threads:[~2026-06-18  1:20 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-17  8:26 [PATCH 0/4] mm: convert to walk_page_range_vma() to eliminate find_vma() Kefeng Wang
2026-06-17  8:26 ` [PATCH 1/4] mm: mincore: try per-VMA lock firstly and use walk_page_range_vma() Kefeng Wang
2026-06-17 14:54   ` Zi Yan
2026-06-18  1:11     ` Kefeng Wang
2026-06-18  1:20       ` Zi Yan [this message]
2026-06-17 15:01   ` David Hildenbrand (Arm)
2026-06-18  1:41     ` Kefeng Wang
2026-06-18  8:15       ` David Hildenbrand (Arm)
2026-06-17  8:26 ` [PATCH 2/4] mm: mprotect: use walk_page_range_vma() in mprotect_fixup() Kefeng Wang
2026-06-17 13:25   ` David Hildenbrand (Arm)
2026-06-18  1:42     ` Kefeng Wang
2026-06-17 14:28   ` Zi Yan
2026-06-18  1:43     ` Kefeng Wang
2026-06-18  8:17   ` David Hildenbrand (Arm)
2026-06-17  8:26 ` [PATCH 3/4] mm: mlock: use walk_page_range_vma() in mlock_vma_pages_range() Kefeng Wang
2026-06-17 13:26   ` David Hildenbrand (Arm)
2026-06-17 14:35   ` Zi Yan
2026-06-18  8:16   ` David Hildenbrand (Arm)
2026-06-18  8:27     ` Kefeng Wang
2026-06-17  8:26 ` [PATCH 4/4] mm: migrate_device: use walk_page_range_vma() in migrate_vma_collect() Kefeng Wang
2026-06-17 13:30   ` David Hildenbrand (Arm)
2026-06-18  1:46     ` Kefeng Wang
2026-06-17 14:37   ` Zi Yan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=DJBS471022MX.L7093MILIORX@nvidia.com \
    --to=ziy@nvidia.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@kernel.org \
    --cc=liam@infradead.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=surenb@google.com \
    --cc=vbabka@kernel.org \
    --cc=wangkefeng.wang@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox