From: Heiko Carstens <hca@linux.ibm.com>
To: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>,
David Hildenbrand <david@redhat.com>,
Andrew Morton <akpm@linux-foundation.org>,
Gerald Schaefer <gerald.schaefer@linux.ibm.com>,
Christian Borntraeger <borntraeger@linux.ibm.com>,
Vasily Gorbik <gor@linux.ibm.com>,
Claudio Imbrenda <imbrenda@linux.ibm.com>,
linux-s390@vger.kernel.org, linux-mm@kvack.org,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH v2 5/6] s390/mm: Batch PTE updates in lazy MMU mode
Date: Thu, 23 Apr 2026 14:28:24 +0200 [thread overview]
Message-ID: <20260423122824.10371E07-hca@linux.ibm.com> (raw)
In-Reply-To: <924c91e711e8b16470eb2af7e31d0b6ffcf2940c.1776264097.git.agordeev@linux.ibm.com>
On Wed, Apr 15, 2026 at 05:01:23PM +0200, Alexander Gordeev wrote:
> Make use of the IPTE instruction's "Additional Entries" field to
> invalidate multiple PTEs in one go while in lazy MMU mode. This
> is the mode in which many memory-management system calls (like
> mremap(), mprotect(), etc.) update memory attributes.
>
> To achieve that, the set_pte() and ptep_get() primitives use a
> per-CPU cache to store and retrieve PTE values and apply the
> cached values to the real page table once lazy MMU mode is left.
>
> The same is done for memory-management platform callbacks that
> would otherwise cause intense per-PTE IPTE traffic, reducing the
> number of IPTE instructions from up to PTRS_PER_PTE to a single
> instruction in the best case. The average reduction is of course
> smaller.
>
> Since all existing page table iterators called in lazy MMU mode
> handle one table at a time, the per-CPU cache does not need to be
> larger than PTRS_PER_PTE entries. That also naturally aligns with
> the IPTE instruction, which must not cross a page table boundary.
>
> Before this change, the system calls did:
>
> lazy_mmu_mode_enable_pte()
> ...
> <update PTEs> // up to PTRS_PER_PTE single-IPTEs
> ...
> lazy_mmu_mode_disable()
>
> With this change, the system calls do:
>
> lazy_mmu_mode_enable_pte()
> ...
> <store new PTE values in the per-CPU cache>
> ...
> lazy_mmu_mode_disable() // apply cache with one multi-IPTE
I think what is not necessarily immediately obvious: this approach must assure
that within such a lazy mmu section there is not a single occurrence of code
which doesn't use the above mentioned modified primitives to dereference page
table entry pointers.
Directly dereferencing such pointers would bypass the cache and lead to
incorrect results. Therefore we do need some mechanism which makes sure this
cannot happen. Preferebly that would happen at compile time with static code
analysis. Alternatively your Kasan implementation would be helpful to find
something like that after-the-fact.
However in any case we need something to address this problem.
> +#if !defined(CONFIG_IPTE_BATCH) || defined(__DECOMPRESSOR)
> +static inline
> +bool ipte_batch_ptep_test_and_clear_young(struct vm_area_struct *vma,
> + unsigned long addr, pte_t *ptep,
> + int *res)
> +{
> + return false;
> +}
...
> +bool ipte_batch_ptep_test_and_clear_young(struct vm_area_struct *vma,
> + unsigned long addr, pte_t *ptep,
> + int *res);
Could you change this to something like:
bool __ipte_batch_ptep_test_and_clear_young(struct vm_area_struct *vma,
unsigned long addr, pte_t *ptep,
int *res);
static inline
bool ipte_batch_ptep_test_and_clear_young(struct vm_area_struct *vma,
unsigned long addr, pte_t *ptep,
int *res)
{
if (__is_defined(__DECOMPRESSOR))
return false;
return __ipte_batch_ptep_test_and_clear_young(vma, addr, ptep, res);
}
? This avoids the ifdef and makes the code easier to read and change.
> +static inline void set_pte(pte_t *ptep, pte_t pte)
> +{
> + if (!ipte_batch_set_pte(ptep, pte))
> + __set_pte(ptep, pte);
> +}
Not sure if you analyzed it, but it looks like this might be the reason why
you see the fork() slowdown: every page table operation now comes with a
function call, even if is not needed.
I guess it would be helpful to add an early exit to the ipte_batch() inlines.
E.g. going back to the example above:
static inline
bool ipte_batch_ptep_test_and_clear_young(struct vm_area_struct *vma,
unsigned long addr, pte_t *ptep,
int *res)
{
if (__is_defined(__DECOMPRESSOR))
return false;
---> if (unlikely(!ipte_batch_active()))
---> return false;
return __ipte_batch_ptep_test_and_clear_young(vma, addr, ptep, res);
}
Where ipte_batch_active() would be an inline function which simply tests a bit
in lowcore.
> diff --git a/arch/s390/mm/Makefile b/arch/s390/mm/Makefile
> index 193899c39ca7..0f6c6de447d4 100644
> --- a/arch/s390/mm/Makefile
> +++ b/arch/s390/mm/Makefile
> @@ -11,5 +11,6 @@ obj-$(CONFIG_DEBUG_VIRTUAL) += physaddr.o
> obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o
> obj-$(CONFIG_PTDUMP) += dump_pagetables.o
> obj-$(CONFIG_PFAULT) += pfault.o
> +obj-$(CONFIG_IPTE_BATCH) += ipte_batch.o
As already stated before: let's get rid of the Kconfig option. This will be
enabled anyway.
> +#define PTE_POISON 0
Is there a reason why the PTE_POISON value is zero? If it can be a different
value, I would like to propose a PTE which has bit 11 set (the one which must
be zero). If that would ever be transferred to page table, and it would be
used, we would see that immediately :)
> +struct ipte_batch {
> + struct mm_struct *mm;
> + unsigned long base_addr;
> + unsigned long base_end;
> + pte_t *base_pte;
> + pte_t *start_pte;
> + pte_t *end_pte;
> + pte_t cache[PTRS_PER_PTE];
> +};
> +
> +static DEFINE_PER_CPU(struct ipte_batch, ipte_range);
This is ~1MB with 512 possible CPUs. I think it would make sense to allocate
and free this data structure with CPU hotplug to avoid wasting too much
memory.
> +static int count_contiguous(pte_t *start, pte_t *end, bool *valid)
> +{
> + pte_t pte = __ptep_get(start);
> + pte_t *ptep;
> +
> + *valid = !(pte_val(pte) & _PAGE_INVALID);
> +
> + for (ptep = start + 1; ptep < end; ptep++) {
> + pte = __ptep_get(ptep);
> + if (*valid) {
> + if (pte_val(pte) & _PAGE_INVALID)
> + break;
> + } else {
> + if (!(pte_val(pte) & _PAGE_INVALID))
> + break;
> + }
> + }
> +
> + return ptep - start;
> +}
Wouldn't it be possible to shorten this a bit? E.g.:
static int count_contiguous(pte_t *start, pte_t *end, bool *valid)
{
unsigned long inv;
pte_t *ptep;
inv = pte_val(__ptep_get(start)) & _PAGE_INVALID;
*valid = !inv;
for (ptep = start + 1; ptep < end; ptep++) {
if (pte_val(__ptep_get(ptep)) & _PAGE_INVALID != inv)
break;
}
return ptep - start;
}
> +static void __invalidate_pte_range(struct mm_struct *mm, unsigned long addr,
> + int nr_ptes, pte_t *ptep)
> +{
> + atomic_inc(&mm->context.flush_count);
> + if (cpu_has_tlb_lc() &&
> + cpumask_equal(mm_cpumask(mm), cpumask_of(smp_processor_id())))
One line please. Yes, I know, this is more or less copy-pasted, but... :)
Just a general comment about the naming conventions: imho ipte_batch is not a
nice choice, since the name of the facility is "ipte range". However I would
abstract even more, since nobody knows if there will be a different
instruction or facility to achieve all of this in a better way.
Anyway... maybe rename the file simply to mmu.c or tlb.c and change the
function prefixes accordingly so we end up with shorter function names?
next prev parent reply other threads:[~2026-04-23 12:28 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-15 15:01 [PATCH v2 0/6] s390/mm: Batch PTE updates in lazy MMU mode Alexander Gordeev
2026-04-15 15:01 ` [PATCH v2 1/6] mm: Make lazy MMU mode context-aware Alexander Gordeev
2026-04-20 8:45 ` Kevin Brodsky
2026-04-21 4:57 ` Alexander Gordeev
2026-04-21 8:40 ` Kevin Brodsky
2026-04-21 14:12 ` David Hildenbrand (Arm)
2026-04-21 14:15 ` David Hildenbrand (Arm)
2026-04-21 14:25 ` Alexander Gordeev
2026-04-23 10:49 ` Heiko Carstens
2026-04-15 15:01 ` [PATCH v2 2/6] mm/pgtable: Fix bogus comment to clear_not_present_full_ptes() Alexander Gordeev
2026-04-16 7:58 ` David Hildenbrand (Arm)
2026-04-16 8:53 ` Alexander Gordeev
2026-04-15 15:01 ` [PATCH v2 3/6] s390/mm: Complete ptep_get() conversion Alexander Gordeev
2026-04-23 10:51 ` Heiko Carstens
2026-04-15 15:01 ` [PATCH v2 4/6] s390/mm: Make PTC and UV call order consistent Alexander Gordeev
2026-04-23 10:54 ` Heiko Carstens
2026-04-15 15:01 ` [PATCH v2 5/6] s390/mm: Batch PTE updates in lazy MMU mode Alexander Gordeev
2026-04-16 5:40 ` Heiko Carstens
2026-04-16 5:51 ` Alexander Gordeev
2026-04-23 12:28 ` Heiko Carstens [this message]
2026-04-15 15:01 ` [PATCH v2 6/6] s390/mm: Allow lazy MMU mode disabling Alexander Gordeev
2026-04-16 5:44 ` Heiko Carstens
2026-04-16 7:00 ` Alexander Gordeev
2026-04-16 8:22 ` Christian Borntraeger
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260423122824.10371E07-hca@linux.ibm.com \
--to=hca@linux.ibm.com \
--cc=agordeev@linux.ibm.com \
--cc=akpm@linux-foundation.org \
--cc=borntraeger@linux.ibm.com \
--cc=david@redhat.com \
--cc=gerald.schaefer@linux.ibm.com \
--cc=gor@linux.ibm.com \
--cc=imbrenda@linux.ibm.com \
--cc=kevin.brodsky@arm.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-s390@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox