From: Dev Jain <dev.jain@arm.com>
To: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: akpm@linux-foundation.org, david@redhat.com,
catalin.marinas@arm.com, will@kernel.org,
Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org,
surenb@google.com, mhocko@suse.com, linux-mm@kvack.org,
linux-kernel@vger.kernel.org, suzuki.poulose@arm.com,
steven.price@arm.com, gshan@redhat.com,
linux-arm-kernel@lists.infradead.org,
yang@os.amperecomputing.com, ryan.roberts@arm.com,
anshuman.khandual@arm.com
Subject: Re: [PATCH v2 1/2] mm: Allow lockless kernel pagetable walking
Date: Wed, 11 Jun 2025 09:13:29 +0530 [thread overview]
Message-ID: <b05c9ac5-ae9f-47bb-9f38-c4abb5dce5c2@arm.com> (raw)
In-Reply-To: <4cab8fa7-2679-432f-95bd-4cb0bc636b56@lucifer.local>
On 10/06/25 6:27 pm, Lorenzo Stoakes wrote:
> On Tue, Jun 10, 2025 at 06:10:03PM +0530, Dev Jain wrote:
>> On 10/06/25 5:37 pm, Lorenzo Stoakes wrote:
>>> OK so I think the best solution here is to just update check_ops_valid(), which
>>> was kind of sucky anyway (we check everywhere but walk_page_range_mm() to
>>> enforce the install pte thing).
>>>
>>> Let's do something like:
>>>
>>> #define OPS_MAY_INSTALL_PTE (1<<0)
>>> #define OPS_MAY_AVOID_LOCK (1<<1)
>>>
>>> and update check_ops_valid() to take a flags or maybe 'capabilities' field.
>>>
>>> Then check based on this e.g.:
>>>
>>> if (ops->install_pte && !(capabilities & OPS_MAY_INSTALL_PTE))
>>> return false;
>>>
>>> if (ops->walk_lock == PGWALK_NOLOCK && !(capabilities & OPS_MAY_AVOID_LOCK))
>>> return false;
>>>
>>> return true;
>>>
>>> Then update callers, most of whom can have '0' passed for this field, with
>>> walk_page_range_mm() setting OPS_MAY_INSTALL_PTE and
>>> walk_kernel_page_table_range() setting OPS_MAY_AVOID_LOCK.
>>>
>>> That way we check it all in one place, it's consistent, we no long 'implicitly'
>>> don't check it for walk_page_range_mm() and all is neater.
>>>
>>> We do end up calling this predicate twice for walk_page_range(), so we could
>>> (should?) also make walk_page_range_mm() into a static __walk_page_range_mm()
>>> and have a separate walk_page_range_mm() that does this check.
>>>
>>> I think this will solve the interface issues I've raised below.
>> Makes sense, thank you for your suggestions.
> Thanks :)
>
> Sorry to be a pain but I think this will fit more nicely.
>
>>> Thanks!
>>>
>>> On Tue, Jun 10, 2025 at 05:14:00PM +0530, Dev Jain wrote:
>>>> arm64 currently changes permissions on vmalloc objects locklessly, via
>>>> apply_to_page_range. Patch 2 moves away from this to use the pagewalk API,
>>>> since a limitation of the former is to deny changing permissions for block
>>>> mappings. However, the API currently enforces the init_mm.mmap_lock to be
>>>> held. To avoid the unnecessary bottleneck of the mmap_lock for our usecase,
>>>> this patch extends this generic API to be used locklessly, so as to retain
>>>> the existing behaviour for changing permissions. Apart from this reason,
>>>> it is noted at [1] that KFENCE can manipulate kernel pgtable entries during
>>>> softirqs. It does this by calling set_memory_valid() -> __change_memory_common().
>>>> This being a non-sleepable context, we cannot take the init_mm mmap lock.
>>>>
>>>> Since such extension can potentially be dangerous for other callers
>>>> consuming the pagewalk API, explicitly disallow lockless traversal for
>>>> userspace pagetables by returning EINVAL. Add comments to highlight the
>>>> conditions under which we can use the API locklessly - no underlying VMA,
>>>> and the user having exclusive control over the range, thus guaranteeing no
>>>> concurrent access.
>>>>
>>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>>> ---
>>>> include/linux/pagewalk.h | 7 +++++++
>>>> mm/pagewalk.c | 23 ++++++++++++++++++-----
>>>> 2 files changed, 25 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
>>>> index 8ac2f6d6d2a3..5efd6541239b 100644
>>>> --- a/include/linux/pagewalk.h
>>>> +++ b/include/linux/pagewalk.h
>>>> @@ -14,6 +14,13 @@ enum page_walk_lock {
>>>> PGWALK_WRLOCK = 1,
>>>> /* vma is expected to be already write-locked during the walk */
>>>> PGWALK_WRLOCK_VERIFY = 2,
>>>> + /*
>>>> + * Walk without any lock. Use of this is only meant for the
>>>> + * case where there is no underlying VMA, and the user has
>>>> + * exclusive control over the range, guaranteeing no concurrent
>>>> + * access. For example, changing permissions of vmalloc objects.
>>>> + */
>>>> + PGWALK_NOLOCK = 3,
>>> Thanks for the comment! This is good.
>>>
>>>> };
>>>>
>>>> /**
>>>> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
>>>> index ff5299eca687..d55d933f84ec 100644
>>>> --- a/mm/pagewalk.c
>>>> +++ b/mm/pagewalk.c
>>>> @@ -417,13 +417,17 @@ static int __walk_page_range(unsigned long start, unsigned long end,
>>>> return err;
>>>> }
>>>>
>>>> -static inline void process_mm_walk_lock(struct mm_struct *mm,
>>>> +static inline bool process_mm_walk_lock(struct mm_struct *mm,
>>>> enum page_walk_lock walk_lock)
>>> I don't like this signature at all, you don't describe what it does, and now it
>>> returns... whether it was not locked? I think this might lead to confusion :)
>>>
>>>
>>>> {
>>>> + if (walk_lock == PGWALK_NOLOCK)
>>>> + return 1;
>>> It's 2025, return true please :)
>>>
>>>> +
>>>> if (walk_lock == PGWALK_RDLOCK)
>>>> mmap_assert_locked(mm);
>>>> else
>>>> mmap_assert_write_locked(mm);
>>>> + return 0;
>>> It's 2025, return false please :)
>>>
>>>> }
>>>>
>>>> static inline void process_vma_walk_lock(struct vm_area_struct *vma,
>>>> @@ -440,6 +444,8 @@ static inline void process_vma_walk_lock(struct vm_area_struct *vma,
>>>> case PGWALK_RDLOCK:
>>>> /* PGWALK_RDLOCK is handled by process_mm_walk_lock */
>>>> break;
>>>> + case PGWALK_NOLOCK:
>>>> + break;
>>> Under what circumstances would we be fine with this function being invoked with
>>> no lock being specified?
>>>
>>> Isn't it the case that the one situation in which we can specify PGWALK_NOLOCK
>>> won't ever invoke this? Or did I miss a call of this function?
>>>
>>> If not, we should make this a VM_WARN_ON_ONCE(1);
>> I was made aware that there is a quest to remove BUG_ON()'s in the kernel, see [1].
>> Is there a similar problem with VM_WARN_ON_ONCE()?
> No, in fact the proposal is that we replace BUG_ON()'s with [VM_]WARN_ON_ONCE()'s.
>
> So this is all good, BUG_ON() is basically never needed unless you are _certain_
> there will be system instability that MUST BE STOPPED NOW.
>
> Which is almost never going to be the case.
>
> See
> https://lore.kernel.org/linux-mm/CAHk-=wjO1xL_ZRKUG_SJuh6sPTQ-6Lem3a3pGoo26CXEsx_w0g@mail.gmail.com/
> where I managed to somehow provoke Linus into giving some (very interesting!) input ;)
>
> But if you see the thread you can see further context on all this.
Thanks for the info.
>
>> [1]: https://lore.kernel.org/all/053ae9ec-1113-4ed8-9625-adf382070bc5@redhat.com/
>>
>>>> }
>>>> #endif
>>>> }
>>>> @@ -470,7 +476,8 @@ int walk_page_range_mm(struct mm_struct *mm, unsigned long start,
>>>> if (!walk.mm)
>>>> return -EINVAL;
>>>>
>>>> - process_mm_walk_lock(walk.mm, ops->walk_lock);
>>>> + if (process_mm_walk_lock(walk.mm, ops->walk_lock))
>>>> + return -EINVAL;
>>> This is just weird, you're treating the return like it were an error value (no
>>> it's a boolean), the name of the function doesn't expliain the 'verb' of what's
>>> happening, it's just confusing.
>>>
>>> Obviously I'm belabouring the point a bit, see suggestion at top :)
>>>
>>>> vma = find_vma(walk.mm, start);
>>>> do {
>>>> @@ -626,8 +633,12 @@ int walk_kernel_page_table_range(unsigned long start, unsigned long end,
>>>> * to prevent the intermediate kernel pages tables belonging to the
>>>> * specified address range from being freed. The caller should take
>>>> * other actions to prevent this race.
>>>> + *
>>>> + * If the caller can guarantee that it has exclusive access to the
>>>> + * specified address range, only then it can use PGWALK_NOLOCK.
>>>> */
>>>> - mmap_assert_locked(mm);
>>>> + if (ops->walk_lock != PGWALK_NOLOCK)
>>>> + mmap_assert_locked(mm);
>>>>
>>>> return walk_pgd_range(start, end, &walk);
>>>> }
>>>> @@ -699,7 +710,8 @@ int walk_page_range_vma(struct vm_area_struct *vma, unsigned long start,
>>>> if (!check_ops_valid(ops))
>>>> return -EINVAL;
>>>>
>>>> - process_mm_walk_lock(walk.mm, ops->walk_lock);
>>>> + if (process_mm_walk_lock(walk.mm, ops->walk_lock))
>>>> + return -EINVAL;
>>>> process_vma_walk_lock(vma, ops->walk_lock);
>>>> return __walk_page_range(start, end, &walk);
>>>> }
>>>> @@ -719,7 +731,8 @@ int walk_page_vma(struct vm_area_struct *vma, const struct mm_walk_ops *ops,
>>>> if (!check_ops_valid(ops))
>>>> return -EINVAL;
>>>>
>>>> - process_mm_walk_lock(walk.mm, ops->walk_lock);
>>>> + if (process_mm_walk_lock(walk.mm, ops->walk_lock))
>>>> + return -EINVAL;
>>>> process_vma_walk_lock(vma, ops->walk_lock);
>>>> return __walk_page_range(vma->vm_start, vma->vm_end, &walk);
>>>> }
>>>> --
>>>> 2.30.2
>>>>
>>> Thanks!
next prev parent reply other threads:[~2025-06-11 3:44 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-06-10 11:43 [PATCH v2 0/2] Enable permission change on arm64 kernel block mappings Dev Jain
2025-06-10 11:44 ` [PATCH v2 1/2] mm: Allow lockless kernel pagetable walking Dev Jain
2025-06-10 12:07 ` Lorenzo Stoakes
2025-06-10 12:40 ` Dev Jain
2025-06-10 12:57 ` Lorenzo Stoakes
2025-06-11 3:43 ` Dev Jain [this message]
2025-06-10 13:24 ` David Hildenbrand
2025-06-10 13:25 ` David Hildenbrand
2025-06-10 13:27 ` Lorenzo Stoakes
2025-06-10 13:31 ` David Hildenbrand
2025-06-10 13:35 ` Lorenzo Stoakes
2025-06-10 13:44 ` David Hildenbrand
2025-06-11 3:45 ` Dev Jain
2025-06-10 11:44 ` [PATCH v2 2/2] arm64: pageattr: Use walk_page_range_novma() to change memory permissions Dev Jain
2025-06-10 13:14 ` David Hildenbrand
2025-06-10 14:41 ` Ryan Roberts
2025-06-11 4:01 ` Dev Jain
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=b05c9ac5-ae9f-47bb-9f38-c4abb5dce5c2@arm.com \
--to=dev.jain@arm.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=anshuman.khandual@arm.com \
--cc=catalin.marinas@arm.com \
--cc=david@redhat.com \
--cc=gshan@redhat.com \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=mhocko@suse.com \
--cc=rppt@kernel.org \
--cc=ryan.roberts@arm.com \
--cc=steven.price@arm.com \
--cc=surenb@google.com \
--cc=suzuki.poulose@arm.com \
--cc=vbabka@suse.cz \
--cc=will@kernel.org \
--cc=yang@os.amperecomputing.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).