From: Yeoreum Yun <yeoreum.yun@arm.com>
To: Ryan Roberts <ryan.roberts@arm.com>
Cc: Will Deacon <will@kernel.org>,
linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org, linux-rt-devel@lists.linux.dev,
catalin.marinas@arm.com, akpm@linux-oundation.org,
david@kernel.org, kevin.brodsky@arm.com,
quic_zhenhuah@quicinc.com, dev.jain@arm.com,
yang@os.amperecomputing.com, chaitanyas.prakash@arm.com,
bigeasy@linutronix.de, clrkwllms@kernel.org, rostedt@goodmis.org,
lorenzo.stoakes@oracle.com, ardb@kernel.org, jackmanb@google.com,
vbabka@suse.cz, mhocko@suse.com
Subject: Re: [PATCH v5 2/3] arm64: mmu: avoid allocating pages while splitting the linear mapping
Date: Tue, 20 Jan 2026 17:49:37 +0000 [thread overview]
Message-ID: <aW/AMfg34flDNvcS@e129823.arm.com> (raw)
In-Reply-To: <11a01f4e-9ae5-4001-9f9c-74a746f898cd@arm.com>
On Tue, Jan 20, 2026 at 05:35:50PM +0000, Ryan Roberts wrote:
> On 20/01/2026 16:31, Yeoreum Yun wrote:
> > Hi Ryan,
> >
> >> On 20/01/2026 15:53, Will Deacon wrote:
> >>> On Tue, Jan 20, 2026 at 10:40:30AM +0000, Ryan Roberts wrote:
> >>>> On 20/01/2026 09:29, Yeoreum Yun wrote:
> >>>>> Hi Ryan
> >>>>>> On 19/01/2026 21:24, Yeoreum Yun wrote:
> >>>>>>> Hi Will,
> >>>>>>>
> >>>>>>>> On Mon, Jan 05, 2026 at 08:23:27PM +0000, Yeoreum Yun wrote:
> >>>>>>>>> +static int __init linear_map_prealloc_split_pgtables(void)
> >>>>>>>>> +{
> >>>>>>>>> + int ret, i;
> >>>>>>>>> + unsigned long lstart = _PAGE_OFFSET(vabits_actual);
> >>>>>>>>> + unsigned long lend = PAGE_END;
> >>>>>>>>> + unsigned long kstart = (unsigned long)lm_alias(_stext);
> >>>>>>>>> + unsigned long kend = (unsigned long)lm_alias(__init_begin);
> >>>>>>>>> +
> >>>>>>>>> + const struct mm_walk_ops collect_to_split_ops = {
> >>>>>>>>> + .pud_entry = collect_to_split_pud_entry,
> >>>>>>>>> + .pmd_entry = collect_to_split_pmd_entry
> >>>>>>>>> + };
> >>>>>>>>
> >>>>>>>> Why do we need to rewalk the page-table here instead of collating the
> >>>>>>>> number of block mappings we put down when creating the linear map in
> >>>>>>>> the first place?linear_map_maybe_split_to_ptes(
> >>>>>>
> >>>>>> That's a good point; perhaps we can reuse the counters that this series introduces?
> >>>>>>
> >>>>>> https://lore.kernel.org/all/20260107002944.2940963-1-yang@os.amperecomputing.com/
> >>>>>>
> >>>>>>>
> >>>>>>> First, linear alias of the [_text, __init_begin) is not a target for
> >>>>>>> the split and it also seems strange to me to add code inside alloc_init_XXX()
> >>>>>>> that both checks an address range and counts to get the number of block mappings.
> >>>>>>>
> >>>>>>> Second, for a future feature,
> >>>>>>> I hope to add some code to split "specfic" area to be spilt e.x)
> >>>>>>> to set a specific pkey for specific area.
> >>>>>>
> >>>>>> Could you give more detail on this? My working assumption is that either the
> >>>>>> system supports BBML2 or it doesn't. If it doesn't, we need to split the whole
> >>>>>> linear map. If it does, we already have logic to split parts of the linear map
> >>>>>> when needed.
> >>>>>
> >>>>> This is not for a linear mapping case. but for a "kernel text area".
> >>>>> As a draft, I want to mark some of kernel code can executable
> >>>>> both kernel and eBPF program.
> >>>>> (I'm trying to make eBPF program non-executable kernel code directly
> >>>>> with POE feature).
> >>>>> For this "executable area" both of kernel and eBPF program
> >>>>> -- typical example is exception entry, It need to split that specific
> >>>>> range and mark them with special POE index.
> >>>>
> >>>> Ahh yes, I recall you mentioning this a while back (although I confess all the
> >>>> deatils have fallen out of my head). You'd need to make sure you're definitely
> >>>> not splitting an area of text that the secondary CPUs are executing while they
> >>>> are being held in the pen, since at least one of those CPUs doesn't support BBML2.
> >>>>
> >>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> In this case, it's useful to rewalk the page-table with the specific
> >>>>>>> range to get the number of block mapping.
> >>>>>>>
> >>>>>>>>
> >>>>>>>>> + split_pgtables_idx = 0;
> >>>>>>>>> + split_pgtables_count = 0;
> >>>>>>>>> +
> >>>>>>>>> + ret = walk_kernel_page_table_range_lockless(lstart, kstart,
> >>>>>>>>> + &collect_to_split_ops,
> >>>>>>>>> + NULL, NULL);
> >>>>>>>>> + if (!ret)
> >>>>>>>>> + ret = walk_kernel_page_table_range_lockless(kend, lend,
> >>>>>>>>> + &collect_to_split_ops,
> >>>>>>>>> + NULL, NULL);
> >>>>>>>>> + if (ret || !split_pgtables_count)
> >>>>>>>>> + goto error;
> >>>
> >>> Just noticed this, but why do we check '!split_pgtables_count' here?
> >>> if the page-table is already somehow mapped at page granularity, that
> >>> doesn't necessarily sound like a fatal error to me.
> >>>
> >>>>>>>>> +
> >>>>>>>>> + ret = -ENOMEM;
> >>>>>>>>> +
> >>>>>>>>> + split_pgtables = kvmalloc(split_pgtables_count * sizeof(struct ptdesc *),
> >>>>>>>>> + GFP_KERNEL | __GFP_ZERO);
> >>>>>>>>> + if (!split_pgtables)
> >>>>>>>>> + goto error;
> >>>>>>>>> +
> >>>>>>>>> + for (i = 0; i < split_pgtables_count; i++) {
> >>>>>>>>> + /* The page table will be filled during splitting, so zeroing it is unnecessary. */
> >>>>>>>>> + split_pgtables[i] = pagetable_alloc(GFP_PGTABLE_KERNEL & ~__GFP_ZERO, 0);
> >>>>>>>>> + if (!split_pgtables[i])
> >>>>>>>>> + goto error;
> >>>>>>>>
> >>>>>>>> This looks potentially expensive on the boot path and only gets worse as
> >>>>>>>> the amount of memory grows. Maybe we should predicate this preallocation
> >>>>>>>> on preempt-rt?
> >>>>>>>
> >>>>>>> Agree. then I'll apply pre-allocation with PREEMPT_RT only.
> >>>>>>
> >>>>>> I guess I'm missing something obvious but I don't understand the problem here...
> >>>>>> We are only deferring the allocation of all these pgtables, so the cost is
> >>>>>> neutral surely? Had we correctly guessed that the system doesn't support BBML2
> >>>>>> earlier, we would have had to allocate all these pgtables earlier.
> >>>>>>
> >>>>>> Another way to look at it is that we are still allocating the same number of
> >>>>>> pgtables in the existing fallback path, it's just that we are doing it inside
> >>>>>> the stop_machine().
> >>>>>>
> >>>>>> My vote would be _not_ to have a separate path for PREEMPT_RT, which will end up
> >>>>>> with significantly less testing...
> >>>>>
> >>>>> IIUC, Will's mention is additional memory allocation for
> >>>>> "split_pgtables" where saved "pre-allocate" page tables.
> >>>>> As the memory increase, definitely this size would increase the cost.
> >>>>
> >>>> Err, so you're referring to the extra kvmalloc()? I don't think that's a big
> >>>> deal is it? you get 512 pointers per page. So the amortized cost is 1/512= 0.2%?
> >>>
> >>> Right, it was the page-table pages I was worried about not the array of
> >>> pointers.
> >>>
> >>>> I suspect we have both misunderstood Will's point...
> >>>
> >>> I probably just got confused by linear_map_free_split_pgtables() as it
> >>> has logic to free unused page-table pages between 'split_pgtables_idx'
> >>> and 'split_pgtables_count', implying that we can over-allocate.
> >>>
> >>> If that is only needed for the error path in
> >>> linear_map_prealloc_split_pgtables(), then perhaps that part should be
> >>> inlined to deal with the case where we fail to allocate part way through.
> >>
> >> I was originally concerned [1] that there could be a race where another CPU
> >> caused the normal splitting machinery to kick in after this cpu determined the
> >> number of required page tables, so there could be some left over in that case.
> >>
> >> On reflection, I guess (hope) that's not possible because we've determined that
> >> some CPUs don't support BBML2. I'm guessing the secondaries haven't been
> >> released to do general work yet?
> >
> > I don't think so, since the linear_map_maybe_split_to_ptes() called
> > in smp_cpus_done() but in here, secondary cpus already on and
> > it seems schedulable.
> >
> > That's why although, This is unlikely, after collecting the number of
> > splitiing by other cpu have a possibility to *split* which was counted
> > and at that time I agreed for your comments because of this *low
> > possiblity*.
> >
> >>
> >> In which case, I agree, this could be simplified and we could just assert that
> >> all pre-allocated pages get used up if there is no error?
> >>
> >> [1] https://lore.kernel.org/all/73ced1db-a2e2-49ea-927e-9fc4a30e771e@arm.com/
> >
> > So with above reason, I still think it need to sustain the free
> > unused pagetable.
> >
> > Am I missing something?
>
> My concern is that if a secondary CPU can race and cause a split, that is
> unsound because we have determined that although the primary CPU supports BBML2,
> at least one of the secondary CPUs does not. So splitting a live mapping is unsafe.
>
> I just had a brief chat with Rutland, and he agrees that this _could_ be a
> problem. Basically there is a window between onlining the secondary cpus and
> entering the stop_machine() where one of those cpus _could_ end up doing
> something that causes us to split the linear map.
Regardless of BBML2, does it means it would be a problem call
set_memory_xxx() API during this windows?
For example,
CPU0 (boot) CPU1 (secondary)
linear_map_maybe_split_to_ptes()
collect the number of pagetable
set_memory_xxx()
split the specific linear region
preallocate() and spliting().
TBH, I'm not sure why this scenario could be a problem?
> > --
> > Sincerely,
> > Yeoreum Yun
>
--
Sincerely,
Yeoreum Yun
next prev parent reply other threads:[~2026-01-20 17:51 UTC|newest]
Thread overview: 31+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-01-05 20:23 [PATCH v5 0/3] fix wrong usage of memory allocation APIs under PREEMPT_RT in arm64 Yeoreum Yun
2026-01-05 20:23 ` [PATCH v5 1/3] arm64: mmu: introduce pgtable_alloc_t Yeoreum Yun
2026-01-05 20:23 ` [PATCH v5 2/3] arm64: mmu: avoid allocating pages while splitting the linear mapping Yeoreum Yun
2026-01-19 17:28 ` Will Deacon
2026-01-19 21:24 ` Yeoreum Yun
2026-01-20 8:56 ` Ryan Roberts
2026-01-20 9:29 ` Yeoreum Yun
2026-01-20 10:40 ` Ryan Roberts
2026-01-20 10:54 ` Yeoreum Yun
2026-01-20 15:53 ` Will Deacon
2026-01-20 16:16 ` Yeoreum Yun
2026-01-20 16:22 ` Ryan Roberts
2026-01-20 16:31 ` Yeoreum Yun
2026-01-20 17:35 ` Ryan Roberts
2026-01-20 17:49 ` Yeoreum Yun [this message]
2026-01-21 0:12 ` Yang Shi
2026-01-21 8:32 ` Yeoreum Yun
2026-01-21 10:20 ` Ryan Roberts
2026-01-21 11:30 ` Yeoreum Yun
2026-01-21 22:57 ` Yang Shi
2026-01-22 7:42 ` Yeoreum Yun
2026-01-22 13:47 ` Ryan Roberts
2026-01-20 22:24 ` Yang Shi
2026-01-20 23:01 ` Yeoreum Yun
2026-01-21 0:43 ` Yang Shi
2026-01-21 8:15 ` Yeoreum Yun
2026-01-05 20:23 ` [PATCH v5 3/3] arm64: mmu: avoid allocating pages while installing ng-mapping for KPTI Yeoreum Yun
2026-01-19 17:31 ` Will Deacon
2026-01-19 21:30 ` Yeoreum Yun
2026-01-20 11:44 ` Will Deacon
2026-01-20 15:30 ` Yeoreum Yun
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aW/AMfg34flDNvcS@e129823.arm.com \
--to=yeoreum.yun@arm.com \
--cc=akpm@linux-oundation.org \
--cc=ardb@kernel.org \
--cc=bigeasy@linutronix.de \
--cc=catalin.marinas@arm.com \
--cc=chaitanyas.prakash@arm.com \
--cc=clrkwllms@kernel.org \
--cc=david@kernel.org \
--cc=dev.jain@arm.com \
--cc=jackmanb@google.com \
--cc=kevin.brodsky@arm.com \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-rt-devel@lists.linux.dev \
--cc=lorenzo.stoakes@oracle.com \
--cc=mhocko@suse.com \
--cc=quic_zhenhuah@quicinc.com \
--cc=rostedt@goodmis.org \
--cc=ryan.roberts@arm.com \
--cc=vbabka@suse.cz \
--cc=will@kernel.org \
--cc=yang@os.amperecomputing.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.