* Re: [PATCH] samples/ftrace: reject zero ftrace-ops call count
From: Steven Rostedt @ 2026-06-11 0:03 UTC (permalink / raw)
To: Samuel Moelius
Cc: Masami Hiramatsu, Mark Rutland, open list:FUNCTION HOOKS (FTRACE),
open list:FUNCTION HOOKS (FTRACE)
In-Reply-To: <CAE+C+DZXcQfyQt-UV2PRt0vVFJqCciW6QkyEd9FaJ8++B3M4Ow@mail.gmail.com>
On Tue, 9 Jun 2026 07:26:27 -0400
Samuel Moelius <sam.moelius@trailofbits.com> wrote:
> Is it okay to keep the same subject line or should I change it?
Yeah, and also note that the tracing subsystem uses capital letters:
samples/ftrace: Reject zero ftrace-ops call count
But you can change it to:
samples/ftrace: Prevent division by zero when nr_function_calls is zero
-- Steve
^ permalink raw reply
* Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Balbir Singh @ 2026-06-10 23:53 UTC (permalink / raw)
To: Gregory Price
Cc: lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
dave, jonathan.cameron, dave.jiang, alison.schofield,
vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm, david,
lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
terry.bowman
In-Reply-To: <aik_ddHymus2DJ6D@gourry-fedora-PF4VCD3F>
On Wed, Jun 10, 2026 at 06:41:57AM -0400, Gregory Price wrote:
> On Wed, Jun 03, 2026 at 03:00:01PM +1000, Balbir Singh wrote:
> > >
> > > __GFP_THISNODE cannot be overloaded to do anything useful here.
> >
> > Let me clarify, I meant to say, let's use a nodemask for allocation
> > and __GFP_THISNODE gets us to the node we desire, if that is the only
> > node. My earlier comment might not have been clear.
> >
>
> I've been tested an stripped back patch set where I drop all FALLBACK
> entries for private nodes (including for itself) and only keep the
> NOFALLBACK entry for private nodes.
>
> This effectively isolates the nodes for any allocation without
> __GFP_THISNODE.
>
> This also precludes these nodes from ever using non-mbind mempolicies,
> which I think is a completely reasonable compromise and something I was
> already expecting we would do.
>
> Notably: slub.c injects __GFP_THISNODE internally on behalf of kmalloc,
> which causes spillage into private nodes because slub allows private
> nodes in its mask. I think this is fixable.
>
Agreed.
> I have to inspect some other __GFP_THISNODE users (hugetlb, some arch
> code, etc), but it seems like fully dropping the FALLBACK entries and
> requiring __GFP_THISNODE might be sufficient.
>
> ~Gregory
That's good progress, thanks for the update!
Balbir
^ permalink raw reply
* [PATCH v4] mm/lruvec: trace LRU add drains and drain-all requests
From: JP Kobryn @ 2026-06-10 23:48 UTC (permalink / raw)
To: linux-mm, willy, shakeel.butt, usama.arif, akpm, vbabka, mhocko,
rostedt, mhiramat, mathieu.desnoyers, kasong, qi.zheng, baohua,
axelrasmussen, yuanchu, weixugc, chrisl, shikemeng, nphamcs,
baoquan.he, youngjun.park
Cc: linux-kernel, linux-trace-kernel
LRU add batches can be drained before they reach capacity. This can be a
source of LRU lock contention, but it is not currently possible to
attribute these drains to callers with existing tracepoints.
Add mm_lru_add_drain to report the CPU and lru_add batch count when an
lru_add batch is drained. This allows tracing to distinguish full drains
from partial drains and attribute them to the calling stack.
Add mm_lru_add_drain_all to capture callers of __lru_add_drain_all and
whether they set the force flag for all CPUs. The tracepoint resembles
the signature of the enclosing function, but is needed because of
potential inlining.
Signed-off-by: JP Kobryn <jp.kobryn@linux.dev>
Reviewed-by: Barry Song <baohua@kernel.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
---
v4:
- renamed nr_folio_add to nr_folios in lru_add_drain()
- renamed nr to nr_folios in tracepoint for consistency
v3: https://lore.kernel.org/linux-mm/20260610195220.12403-1-jp.kobryn@linux.dev/
- restored and renamed tracepoint in __lru_add_drain_all
v2: https://lore.kernel.org/linux-mm/20260609041156.31127-1-jp.kobryn@linux.dev/
- removed mm_lru_drain_all tracepoint
v1: https://lore.kernel.org/linux-mm/20260609041156.31127-1-jp.kobryn@linux.dev/
include/trace/events/pagemap.h | 37 ++++++++++++++++++++++++++++++++++
mm/swap.c | 7 ++++++-
2 files changed, 43 insertions(+), 1 deletion(-)
diff --git a/include/trace/events/pagemap.h b/include/trace/events/pagemap.h
index 171524d3526d..df6ac4d13dcf 100644
--- a/include/trace/events/pagemap.h
+++ b/include/trace/events/pagemap.h
@@ -77,6 +77,43 @@ TRACE_EVENT(mm_lru_activate,
TP_printk("folio=%p pfn=0x%lx", __entry->folio, __entry->pfn)
);
+TRACE_EVENT(mm_lru_add_drain,
+
+ TP_PROTO(int cpu, unsigned int nr_folios),
+
+ TP_ARGS(cpu, nr_folios),
+
+ TP_STRUCT__entry(
+ __field(int, cpu )
+ __field(unsigned int, nr_folios )
+ ),
+
+ TP_fast_assign(
+ __entry->cpu = cpu;
+ __entry->nr_folios = nr_folios;
+ ),
+
+ TP_printk("cpu=%d nr_folios=%u", __entry->cpu, __entry->nr_folios)
+);
+
+TRACE_EVENT(mm_lru_add_drain_all,
+
+ TP_PROTO(bool force_all_cpus),
+
+ TP_ARGS(force_all_cpus),
+
+ TP_STRUCT__entry(
+ __field(bool, force_all_cpus )
+ ),
+
+ TP_fast_assign(
+ __entry->force_all_cpus = force_all_cpus;
+ ),
+
+ TP_printk("force_all_cpus=%s",
+ __entry->force_all_cpus ? "true" : "false")
+);
+
#endif /* _TRACE_PAGEMAP_H */
/* This part must be outside protection */
diff --git a/mm/swap.c b/mm/swap.c
index 588f50d8f1a8..b506fa912a93 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -694,9 +694,12 @@ void lru_add_drain_cpu(int cpu)
{
struct cpu_fbatches *fbatches = &per_cpu(cpu_fbatches, cpu);
struct folio_batch *fbatch = &fbatches->lru_add;
+ unsigned int nr_folios = folio_batch_count(fbatch);
- if (folio_batch_count(fbatch))
+ if (nr_folios) {
folio_batch_move_lru(fbatch, lru_add);
+ trace_mm_lru_add_drain(cpu, nr_folios);
+ }
fbatch = &fbatches->lru_move_tail;
/* Disabling interrupts below acts as a compiler barrier. */
@@ -869,6 +872,8 @@ static inline void __lru_add_drain_all(bool force_all_cpus)
if (WARN_ON(!mm_percpu_wq))
return;
+ trace_mm_lru_add_drain_all(force_all_cpus);
+
/*
* Guarantee folio_batch counter stores visible by this CPU
* are visible to other CPUs before loading the current drain
--
2.54.0
^ permalink raw reply related
* Re: [PATCH 0/2] arm64: ftrace: support DIRECT_CALLS without CALL_OPS
From: Nathan Chancellor @ 2026-06-10 23:36 UTC (permalink / raw)
To: Jose Fernandez (Anthropic)
Cc: Steven Rostedt, Masami Hiramatsu, Mark Rutland, Catalin Marinas,
Will Deacon, Nick Desaulniers, Bill Wendling, Justin Stitt,
linux-kernel, linux-trace-kernel, linux-arm-kernel, llvm, bpf,
Florent Revest, Puranjay Mohan, Xu Kuohai
In-Reply-To: <20260609-arm64-ftrace-direct-calls-v1-0-4a46f266697f@linux.dev>
Hi Jose,
On Tue, Jun 09, 2026 at 05:19:25AM +0000, Jose Fernandez (Anthropic) wrote:
> Jose Fernandez (Anthropic) (2):
> arm64: ftrace: prepare ftrace_modify_call() for use without CALL_OPS
> arm64: ftrace: allow DIRECT_CALLS without CALL_OPS
Thanks, I applied these two changes on -next and it looks like it
resolves the issue I originally noticed with systemd's restrict-fs
program not working on both of my arm64 machines.
Tested-by: Nathan Chancellor <nathan@kernel.org>
--
Cheers,
Nathan
^ permalink raw reply
* Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Balbir Singh @ 2026-06-10 23:09 UTC (permalink / raw)
To: Gregory Price
Cc: lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
dave, jonathan.cameron, dave.jiang, alison.schofield,
vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm, david,
lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
terry.bowman
In-Reply-To: <aiFtJFqkpbZ9qFvM@gourry-fedora-PF4VCD3F>
On Thu, Jun 04, 2026 at 01:18:44PM +0100, Gregory Price wrote:
> On Thu, Jun 04, 2026 at 08:35:19PM +1000, Balbir Singh wrote:
> >
> > My concern is that __GFP_PRIVATE is too wide, I wonder if we'll have a
> > need to support N_MEMORY_PRIVATE may not be all homogeneous memory nodes.
> > Very similar to how not all ZONE_DEVICE memory is homogenous.
> >
>
> Can you more precise about your definition of homogeneous here?
>
> Are you saying not all memory on a private node will be homogeneous?
> While possible, I would argue that you should not do this and
> should instead prefer to use multiple nodes - 1 per memory class.
>
> Are you saying not all private nodes will be homogenous?
> I don't see the issue with this.
Yes, I meant, nodes might belong to different devices. These might not
want fallover allocations, for example __GFP_PRIVATE falling back to
unwanted nodes.
>
> > >
> > > Agreed, but also one which can be deferred and played with since it's
> > > all kernel-internal. None of this should have UAPI implications, and we
> > > need need to accept that we're going to get it wrong on the first try.
> > >
> >
> > Agreed that we might get the design wrong, until we fix it up. I feel
> > that __GFP_PRIVATE should be an evolution of the design to that point.
> >
>
> Possibly. If we can't guarantee isolation without __GFP_PRIVATE, then
> we probably can't merge the baseline without it.
>
I'll rethink about this, but I am concerned that __GFP_PRIVATE is too
broad, in fact it breaks isolation by allocating from any private
device. Again this is a function of how fallback lists are organized.
> > > Because pagecache pages are associated with potentially many VMAs.
> > >
> > > The fault can be a soft fault or a hard fault. On soft fault - the page
> > > was already present, and will simply fault into VMA without being
> > > migrated.
> > >
> >
> > Let's split this into two:
> >
> > 1. unmapped page cache is never impacted by mempolicy and should not
> > end up on private memory nodes
> > 2. For shared pages, mempolicy would be hard, but it would need to
> > be on a set of nodes backed by private memory, depending on mbind()
> > policy
> >
> ... snip ...
> >
> > I'd need to think more about this. For now, my basic requirement would
> > be that unmapped page cache should not come from/to private nodes.
> >
>
> This does not fully describe the problem.
>
> A file can be opened and cached as unmapped page cache, and then mapped
> at a later time - at which point the mapped copy would share the filemap
> page cache page.
>
> Worse, because it's file-backed, you can have the memory faulted onto
> your remote node - reclaimed - and the faulted back in via the process
> accessing the file via unmapped operations (read/write), at which point
> you've had a silent migration occur.
>
> Basically consider
>
> Process A:
> fd = open("myfile", ..., RO);
> read(fd, ...); /* mm/filemap.c fills page cache */
>
> Process B:
> fd = open("myfile", ...);
> mem = mmap(fd, ...);
> mbind(mem, ..., private_node);
> for page in mem:
> int tmp = mem[page]; /* fault into vma */
>
> The result of Process A running first is Process B thinks it has faulted
> the memory onto private_node, but in reality it's taking soft faults and
> just getting the filemap folio mapped in.
>
> If you wanted mbind() support from the start, we would have to limit
> applicability to anon memory only.
>
> Shared anon memory is different, as there is a radix tree that deals
> with a shared mempolicy state.
Ack, need to think through this.
>
> >
> > I am open to this, I was coming from the blueprint approach of:
> > - Let's mimic N_MEMORY with N_MEMORY_PRIVATE and then pick and choose
> > what features to change or make specific to the implementation
> >
>
> N_MEMORY essentially states:
> "This is normal memory touch it however you like"
>
> N_MEMORY_PRIVATE (_MANAGED, w/e) says
> "This is NOT normal memory, there are special rules here"
>
> So, no, lets not mimic N_MEMORY. This is a "closed by default" design,
> while N_MEMORY is an "open by default" design. This design choice is
> explicit to make reasoning about these nodes feasible.
>
> > > This is informed by a single use case / device.
> > >
> > > There are users / devices that don't want any UAPI for their memory,
> > > but simply wish to re-utilize some subsection of mm/ (page_alloc,
> > > reclaim, etc).
> > >
> >
> > But then, why do they need NUMA nodes? Do we have a list of use cases?
> >
>
> So far i have collected:
>
> - Network accelerators carrying their own memory for message buffers
> - GPUs with semi-general-purpose working memory across coherent links
> - Acceptionally slow distributed memory that you do not want fallback
> allocations to (so you want to deliberately tier what lands there)
> - Compressed memory (just another form of accelerator really) which
> has *special access rules* (i.e. writes need to be controlled)
>
> In most if not all of these cases, the right abstraction to reason about
> where memory *should come from* IS a NUMA node.
>
> - the network stack can be taught to check if the target device has a
> node with memory and prefer that node over local memory
>
> - accelerators can be given private nodes to manage memory using
> core mm/ components, without worrying that general kernel operation
> will put unrelated memory on those nodes or do things like migrate
> your pages out from under you (unless your driver/service requested
> that).
>
> the tiering application should be somewhat obvious / trivial.
>
> > >
> > > I am trying to test whether, lacking __GFP_PRIVATE, any normal runtime
> > > operations access private nodes removed from fallback lists are reached
> > > via something like the possible / online nodemask.
> > >
> > > I remember, maybe a year ago, there were per-node allocations happening
> > > during hotplug and that's why I originally proposed __GFP_PRIVATE, but
> > > I'm trying to re-collect that data now.
> > >
> >
> > Thanks, I look forward to the next set of patches. Let me know if I
> > can help test what's on the list or if you want me to wait for the next
> > round
> >
>
> Really I want to get the minimized set out the door so we can start
> breaking this up by feature (reclaim, mempolicy, etc), because trying to
> reason about it as a whole is infeasible - and I cannot be the single
> arbiter of every use case (I simply do not have sufficient context).
>
> I'm reworking it all as we speak.
>
Look forward to it
Balbir
^ permalink raw reply
* Re: [PATCH v7 06/42] KVM: guest_memfd: Update kvm_gmem_populate() to use gmem attributes
From: Sean Christopherson @ 2026-06-10 22:23 UTC (permalink / raw)
To: Ackerley Tng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
ira.weiny, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, tabba, willy, wyihan, yan.y.zhao, forkloop,
pratyush, suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka, kvm,
linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <20260522-gmem-inplace-conversion-v7-6-2f0fae496530@google.com>
On Fri, May 22, 2026, Ackerley Tng wrote:
> Update the guest_memfd populate() flow to pull memory attributes from the
> gmem instance instead of the VM when KVM is not configured to track
> shared/private status in the VM.
>
> Rename the per-VM API to make it clear that it retrieves per-VM
> attributes, i.e. is not suitable for use outside of flows that are
> specific to generic per-VM attributes.
>
> Co-developed-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Reviewed-by: Fuad Tabba <tabba@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
We should squash this in with the previous patch, i.e. wire up PRIVATE to gmem
in a single patch (sans the ioctl support). I had a hell of time figure out how
the range-based lookup was supposed to work when revisiting the "wire up" patch,
until I realized populate() was handled in the next patch.
^ permalink raw reply
* Re: [PATCH v7 04/42] KVM: Stub in ability to disable per-VM memory attribute tracking
From: Sean Christopherson @ 2026-06-10 22:19 UTC (permalink / raw)
To: Ackerley Tng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
ira.weiny, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, tabba, willy, wyihan, yan.y.zhao, forkloop,
pratyush, suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka, kvm,
linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <20260522-gmem-inplace-conversion-v7-4-2f0fae496530@google.com>
On Fri, May 22, 2026, Ackerley Tng wrote:
> From: Sean Christopherson <seanjc@google.com>
>
> Introduce the basic infrastructure to allow per-VM memory attribute
> tracking to be disabled. This will be built-upon in a later patch, where a
> module param can disable per-VM memory attribute tracking.
>
> Split the Kconfig option into a base KVM_MEMORY_ATTRIBUTES and the
> existing KVM_VM_MEMORY_ATTRIBUTES. The base option provides the core
> plumbing, while the latter enables the full per-VM tracking via an xarray
> and the associated ioctls.
>
> kvm_get_memory_attributes() now performs a static call that either looks up
> kvm->mem_attr_array with CONFIG_KVM_VM_MEMORY_ATTRIBUTES is enabled, or
> just returns 0 otherwise. The static call can be patched depending on
> whether per-VM tracking is enabled by the CONFIG.
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Reviewed-by: Fuad Tabba <tabba@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> ---
...
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index abb9cfa3eb04d..ee26f1d9b5fda 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -101,6 +101,17 @@ EXPORT_SYMBOL_FOR_KVM_INTERNAL(halt_poll_ns_shrink);
> static bool __ro_after_init allow_unsafe_mappings;
> module_param(allow_unsafe_mappings, bool, 0444);
>
> +#ifdef CONFIG_KVM_MEMORY_ATTRIBUTES
> +#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> +static bool vm_memory_attributes = true;
> +#else
> +#define vm_memory_attributes false
> +#endif
> +DEFINE_STATIC_CALL_RET0(__kvm_get_memory_attributes, kvm_get_memory_attributes_t);
> +EXPORT_SYMBOL_FOR_KVM_INTERNAL(STATIC_CALL_KEY(__kvm_get_memory_attributes));
> +EXPORT_SYMBOL_FOR_KVM_INTERNAL(STATIC_CALL_TRAMP(__kvm_get_memory_attributes));
> +#endif
Fudge. This morning's PUCK discussion about VBS made me realize that we really
don't want to kill off _all_ per-VM attributes like this, we really just want to
kill off PRIVATE. And even if RWX protections never arrive, conceptually shoving
all attributes into guest_memfd doesn't make any sense, because it really is only
the private vs. shared state that is tied to the physical memory, things like RWX
protections aren't so tightly couple to the data.
It'll require a bit of minor surgery to these patches, but the silver lining is
that I think the end code will be slightly easier to follow.
I'll sync with you off-list to splice in the changes to your current series (I
have them sketched out).
^ permalink raw reply
* Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Gregory Price @ 2026-06-10 22:18 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: Balbir Singh, lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
dave, jonathan.cameron, dave.jiang, alison.schofield,
vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm,
lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
terry.bowman
In-Reply-To: <d01fb1ed-2418-42ee-aea2-37f9a5c5729c@kernel.org>
On Wed, Jun 10, 2026 at 08:59:59PM +0200, David Hildenbrand (Arm) wrote:
>
> At LSF/MM we talked about how GFP flags are bad and how deriving stuff from the
> context might be better. I think there was also talk about how the memalloc_*
> interface might be a better way forward. Maybe we would start giving the
> allocator more context ("we are allocating a folio").
>
> The following is incomplete (esp. hugetlb stuff I assume), just as some idea:
>
Ok, this was easier to test than I expected, and hugetlb is indeed a
stickler. We can't get there 100% with just MEMALLOC_FOLIO, we still
need a MEMALLOC_PRIVATE - specifically because of users like hugetlb.
hugetlb uses __GFP_THISNODE to do its allocations, and all hugetlb
allocations are folio allocations - so the code you shared by itself
does not gate hugetlb from spilling into private nodes.
That means we still need something like this in hugetlb:
if (node_is_private(nid))
/* fail allocation */
HOWEVER... if you have MEMALLOC_PRIVATE - you make the allocation
failure a *page allocator* problem, and it serves exactly the same
purpose that __GFP_PRIVATE did.
the resulting code is two lines in my anondax driver:
unsigned int priv_flags = memalloc_private_save();
ret = do_anonymous_page_node(vmf, dev_dax->target_node);
memalloc_private_restore(priv_flags);
No special hugetlb, slab, arch code handling - they all just fail
to allocate / fall back. If they fail - it means that code is using
a bad nodemask and we need to go fix it (exactly what we want!)
I think additionally, we might be able to repurpose MEMALLOC_PRIVATE
flag for Brendan's needs as well [1].
Their goal (IIRC) was to have a pile of unmapped blocks that could
be opportunistically converted to normal memory, but otherwise left
unmapped and sitting in the buddy.
Same thing - different filter point (blocks vs nodes).
If you set MEMALLOC_PRIVATE - it makes private node allocations
possible, and "private block" access (without conversion) possible.
Otherwise private nodes are unreachable, and private blocks would be
treated like CMA (last-resort stealing, lazy-direct-mapping).
And they stack (private blocks on private nodes :V).
I don't have enough time looking at his proposal, but it seems like we
can kill two birds with one stone on this.
[1] https://lore.kernel.org/linux-mm/agYJcRgOHho8upVv@gourry-fedora-PF4VCD3F/
~Gregory
^ permalink raw reply
* Re: [PATCH v3] mm/lruvec: trace LRU add drains and drain-all requests
From: Shakeel Butt @ 2026-06-10 21:13 UTC (permalink / raw)
To: JP Kobryn
Cc: linux-mm, willy, usama.arif, akpm, vbabka, mhocko, rostedt,
mhiramat, mathieu.desnoyers, kasong, qi.zheng, baohua,
axelrasmussen, yuanchu, weixugc, chrisl, shikemeng, nphamcs,
baoquan.he, youngjun.park, linux-kernel, linux-trace-kernel
In-Reply-To: <20260610195220.12403-1-jp.kobryn@linux.dev>
On Wed, Jun 10, 2026 at 12:52:20PM -0700, JP Kobryn wrote:
> LRU add batches can be drained before they reach capacity. This can be a
> source of LRU lock contention, but it is not currently possible to
> attribute these drains to callers with existing tracepoints.
>
> Add mm_lru_add_drain to report the CPU and lru_add batch count when an
> lru_add batch is drained. This allows tracing to distinguish full drains
> from partial drains and attribute them to the calling stack.
>
> Add mm_lru_add_drain_all to capture callers of __lru_add_drain_all and
> whether they set the force flag for all CPUs. The tracepoint resembles
> the signature of the enclosing function, but is needed because of
> potential inlining.
>
> Signed-off-by: JP Kobryn <jp.kobryn@linux.dev>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
^ permalink raw reply
* Re: [PATCH v3] mm/lruvec: trace LRU add drains and drain-all requests
From: Barry Song @ 2026-06-10 21:03 UTC (permalink / raw)
To: JP Kobryn
Cc: linux-mm, willy, shakeel.butt, usama.arif, akpm, vbabka, mhocko,
rostedt, mhiramat, mathieu.desnoyers, kasong, qi.zheng,
axelrasmussen, yuanchu, weixugc, chrisl, shikemeng, nphamcs,
baoquan.he, youngjun.park, linux-kernel, linux-trace-kernel
In-Reply-To: <20260610195220.12403-1-jp.kobryn@linux.dev>
On Thu, Jun 11, 2026 at 3:53 AM JP Kobryn <jp.kobryn@linux.dev> wrote:
>
> LRU add batches can be drained before they reach capacity. This can be a
> source of LRU lock contention, but it is not currently possible to
> attribute these drains to callers with existing tracepoints.
>
> Add mm_lru_add_drain to report the CPU and lru_add batch count when an
> lru_add batch is drained. This allows tracing to distinguish full drains
> from partial drains and attribute them to the calling stack.
>
> Add mm_lru_add_drain_all to capture callers of __lru_add_drain_all and
> whether they set the force flag for all CPUs. The tracepoint resembles
> the signature of the enclosing function, but is needed because of
> potential inlining.
>
> Signed-off-by: JP Kobryn <jp.kobryn@linux.dev>
Reviewed-by: Barry Song <baohua@kernel.org>
Some minor nits:
[...]
> + unsigned int nr_folios_add = folio_batch_count(fbatch);
>
> - if (folio_batch_count(fbatch))
> + if (nr_folios_add) {
> folio_batch_move_lru(fbatch, lru_add);
> + trace_mm_lru_add_drain(cpu, nr_folios_add);
> + }
Would "nr_folios" work here, given the surrounding lru_add context?
Alternatively, nr_folios_added might make the meaning a little clearer.
Best Regards
Barry
^ permalink raw reply
* Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Gregory Price @ 2026-06-10 20:12 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: Balbir Singh, lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
dave, jonathan.cameron, dave.jiang, alison.schofield,
vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm,
lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
terry.bowman
In-Reply-To: <d01fb1ed-2418-42ee-aea2-37f9a5c5729c@kernel.org>
On Wed, Jun 10, 2026 at 08:59:59PM +0200, David Hildenbrand (Arm) wrote:
> On 6/10/26 18:37, Gregory Price wrote:
> > On Wed, Jun 10, 2026 at 05:00:33PM +0200, David Hildenbrand (Arm) wrote:
> >> On 6/10/26 12:41, Gregory Price wrote:
> >
> > So, I remember this being asked, and I didn't fully grok the request.
> >
> > I'm still not sure I fully understand the question, so apologies if I'm
> > answer the wrong things here.
> >
> > I understand this question in two ways:
> >
> > 1) Can we disallow PAGE allocation and limit this to FOLIO allocation
>
> Yes. Can we only allow folios to be allocated from private memory nodes. So let
> me reply to that one below.
>
... snip ...
>
> At LSF/MM we talked about how GFP flags are bad and how deriving stuff from the
> context might be better. I think there was also talk about how the memalloc_*
> interface might be a better way forward. Maybe we would start giving the
> allocator more context ("we are allocating a folio").
>
> The following is incomplete (esp. hugetlb stuff I assume), just as some idea:
>
Ok, the mental gap I have is not knowing the full context behind
memalloc. I'll take this and do some reading / prototyping, but
this looks entirely reasonable.
I will still probably send the next RFC version tomorrow or friday,
as I want to get some eyes on the __GFP_PRIVATE-less pattern.
Also, I made a new `anondax` driver which enables userland testing
of this functionality without any specialty hardware.
tl;dr:
fd = open("/dev/anondax0.0", ....);
buf = mmap(fd, ...);
buf[0] = 0xDEADBEEF; /* fault to anondax driver */
static vm_fault_t anon_dax_fault(struct vm_fault *vmf)
{
struct dev_dax *dev_dax = vmf->vma->vm_file->private_data;
vm_fault_t ret;
int id;
id = dax_read_lock();
if (!dax_alive(dev_dax->dax_dev))
ret = VM_FAULT_SIGBUS;
else
ret = do_anonymous_page_node(vmf, dev_dax->target_node);
dax_read_unlock(id);
if (ret & VM_FAULT_OOM)
return VM_FAULT_SIGBUS;
return ret ? ret : VM_FAULT_NOPAGE;
}
With:
qemu-system-x86_64 -m 5G \
-object memory-backend-ram,id=m0,size=4G -numa node,nodeid=0,memdev=m0 \
-object memory-backend-ram,id=m1,size=1G -numa node,nodeid=1,memdev=m1 \
-append "... memmap=0x40000000!0x140000000"
Voila - buddy-managed private anonymous memory (1G region)
No need to reinvent page_alloc.c or fault handling :]
This can be used to hammer on reclaim/compaction/whatever support
without needing any particular hardware setup, and in fact it gives
some memory devices a path to support in userland while standards
get worked out.
do_anonymous_page_node is a bit of a bodge right now but I just haven't
fleshed it out yet. The idea is - don't reinvent the fault path, just
provide the appropriate context to memory.c to do the right thing.
If this is acceptable, I imagine whatever interface gets implemented
will carry an in-tree driver export only, similar to hotplug/kmem.
> From 64aaff5f40497201ecc089c3339df6576184c433 Mon Sep 17 00:00:00 2001
> From: "David Hildenbrand (Arm)" <david@kernel.org>
> Date: Wed, 10 Jun 2026 20:55:49 +0200
> Subject: [PATCH] tmp
>
> Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
> ---
> include/linux/sched.h | 2 +-
> include/linux/sched/mm.h | 11 +++++++++++
> mm/mempolicy.c | 14 ++++++++++++--
> mm/page_alloc.c | 7 ++++++-
> 4 files changed, 30 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index ee06cba5c6f5..9c850b7be6bf 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1778,7 +1778,7 @@ extern struct pid *cad_pid;
> * I am cleaning dirty pages from some other bdi. */
> #define PF_KTHREAD 0x00200000 /* I am a kernel thread */
> #define PF_RANDOMIZE 0x00400000 /* Randomize virtual address space */
> -#define PF__HOLE__00800000 0x00800000
> +#define PF__MEMALLOC_FOLIO 0x00800000 /* Allocating a folio that can end up on
> private memory nodes */
> #define PF__HOLE__01000000 0x01000000
> #define PF__HOLE__02000000 0x02000000
> #define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to meddle with
> cpus_mask */
> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> index 95d0040df584..2101a447c084 100644
> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -471,6 +471,17 @@ static inline void memalloc_pin_restore(unsigned int flags)
> memalloc_flags_restore(flags);
> }
>
> +static inline unsigned int memalloc_folio_save(void)
> +{
> + return memalloc_flags_save(PF_MEMALLOC_FOLIO);
> +}
> +
> +static inline void memalloc_folio_restore(unsigned int flags)
> +{
> + memalloc_flags_restore(flags);
> +}
> +
> +
> #ifdef CONFIG_MEMCG
> DECLARE_PER_CPU(struct mem_cgroup *, int_active_memcg);
> /**
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 36699fabd3c2..a78b0e5a1fce 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2506,8 +2506,13 @@ static struct page *alloc_pages_mpol(gfp_t gfp, unsigned
> int order,
> struct folio *folio_alloc_mpol_noprof(gfp_t gfp, unsigned int order,
> struct mempolicy *pol, pgoff_t ilx, int nid)
> {
> - struct page *page = alloc_pages_mpol(gfp | __GFP_COMP, order, pol,
> + struct page *page;
> + int flags;
> +
> + flags = memalloc_folio_save();
> + page = alloc_pages_mpol(gfp | __GFP_COMP, order, pol,
> ilx, nid);
> + memalloc_folio_restore(flags);
> if (!page)
> return NULL;
>
> @@ -2588,7 +2593,12 @@ EXPORT_SYMBOL(alloc_pages_noprof);
>
> struct folio *folio_alloc_noprof(gfp_t gfp, unsigned int order)
> {
> - return page_rmappable_folio(alloc_pages_noprof(gfp | __GFP_COMP, order));
> + struct folio *folio;
> + int flags;
> +
> + flags = memalloc_folio_save();
> + folio = page_rmappable_folio(alloc_pages_noprof(gfp | __GFP_COMP, order));
> + memalloc_folio_restore(flags);
> + return folio;
> }
> EXPORT_SYMBOL(folio_alloc_noprof);
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index ee902a468c2f..37434b37f7af 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5345,8 +5345,13 @@ EXPORT_SYMBOL(__alloc_pages_noprof);
> struct folio *__folio_alloc_noprof(gfp_t gfp, unsigned int order, int
> preferred_nid,
> nodemask_t *nodemask)
> {
> - struct page *page = __alloc_pages_noprof(gfp | __GFP_COMP, order,
> + struct page *page;
> + int flags;
> +
> + flags = memalloc_folio_save();
> + page = __alloc_pages_noprof(gfp | __GFP_COMP, order,
> preferred_nid, nodemask);
> + memalloc_folio_restore(flags);
> return page_rmappable_folio(page);
> }
> EXPORT_SYMBOL(__folio_alloc_noprof);
> --
> 2.43.0
>
>
> --
> Cheers,
>
> David
^ permalink raw reply
* Re: [PATCH] tracing: fprobe: Remove __packed from generic __fprobe_header
From: Mathieu Desnoyers @ 2026-06-10 20:05 UTC (permalink / raw)
To: Steven Rostedt, David Laight
Cc: Masami Hiramatsu (Google),
Markus Schneider-Pargmann (The Capable Hub), Heiko Carstens,
linux-kernel, linux-trace-kernel
In-Reply-To: <20260610155139.01b6def4@gandalf.local.home>
On 2026-06-10 15:51, Steven Rostedt wrote:
> On Wed, 10 Jun 2026 12:06:59 +0100
> David Laight <david.laight.linux@gmail.com> wrote:
>
>> So you only want __packed on structures that might be misaligned and those
>> that contain misaligned members.
>>
>> If the structure is only guaranteed to be 32bit aligned then use __packed
>> __aligned(4) so that two 32bit accesses get used instead of 8 8bit ones.
>>
>> -- David
>>
>>>
>>> Thank you,
>>>
>>>> Signed-off-by: Markus Schneider-Pargmann (The Capable Hub) <msp@baylibre.com>
>>>> ---
>>>> kernel/trace/fprobe.c | 2 +-
>>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/kernel/trace/fprobe.c b/kernel/trace/fprobe.c
>>>> index cc49ebd2a773..21751dcdb7b9 100644
>>>> --- a/kernel/trace/fprobe.c
>>>> +++ b/kernel/trace/fprobe.c
>>>> @@ -181,7 +181,7 @@ static inline void read_fprobe_header(unsigned long *stack,
>>>> struct __fprobe_header {
>>>> struct fprobe *fp;
>>>> unsigned long size_words;
>>>> -} __packed;
>>>> +};
>>>>
>
> Does "__packed" really do anything between a pointer and a long?
If that structure is allocated at a non-void-ptr-aligned address, the
packed attribute will ensure that the compiler don't emit instructions
that require aligned loads/stores when accessing those fields.
It does not change the layout of the structure per se in this specific
case, but it informs the compiler about the lack of guarantees about
alignment for the entire structure.
x86 32/64 cannot care less about this, but it's relevant on other
architectures.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply
* [PATCH v3] mm/lruvec: trace LRU add drains and drain-all requests
From: JP Kobryn @ 2026-06-10 19:52 UTC (permalink / raw)
To: linux-mm, willy, shakeel.butt, usama.arif, akpm, vbabka, mhocko,
rostedt, mhiramat, mathieu.desnoyers, kasong, qi.zheng, baohua,
axelrasmussen, yuanchu, weixugc, chrisl, shikemeng, nphamcs,
baoquan.he, youngjun.park
Cc: linux-kernel, linux-trace-kernel
LRU add batches can be drained before they reach capacity. This can be a
source of LRU lock contention, but it is not currently possible to
attribute these drains to callers with existing tracepoints.
Add mm_lru_add_drain to report the CPU and lru_add batch count when an
lru_add batch is drained. This allows tracing to distinguish full drains
from partial drains and attribute them to the calling stack.
Add mm_lru_add_drain_all to capture callers of __lru_add_drain_all and
whether they set the force flag for all CPUs. The tracepoint resembles
the signature of the enclosing function, but is needed because of
potential inlining.
Signed-off-by: JP Kobryn <jp.kobryn@linux.dev>
---
include/trace/events/pagemap.h | 37 ++++++++++++++++++++++++++++++++++
mm/swap.c | 7 ++++++-
2 files changed, 43 insertions(+), 1 deletion(-)
diff --git a/include/trace/events/pagemap.h b/include/trace/events/pagemap.h
index 171524d3526d..ff3da07ccb40 100644
--- a/include/trace/events/pagemap.h
+++ b/include/trace/events/pagemap.h
@@ -77,6 +77,43 @@ TRACE_EVENT(mm_lru_activate,
TP_printk("folio=%p pfn=0x%lx", __entry->folio, __entry->pfn)
);
+TRACE_EVENT(mm_lru_add_drain,
+
+ TP_PROTO(int cpu, unsigned int nr),
+
+ TP_ARGS(cpu, nr),
+
+ TP_STRUCT__entry(
+ __field(int, cpu )
+ __field(unsigned int, nr )
+ ),
+
+ TP_fast_assign(
+ __entry->cpu = cpu;
+ __entry->nr = nr;
+ ),
+
+ TP_printk("cpu=%d nr=%u", __entry->cpu, __entry->nr)
+);
+
+TRACE_EVENT(mm_lru_add_drain_all,
+
+ TP_PROTO(bool force_all_cpus),
+
+ TP_ARGS(force_all_cpus),
+
+ TP_STRUCT__entry(
+ __field(bool, force_all_cpus )
+ ),
+
+ TP_fast_assign(
+ __entry->force_all_cpus = force_all_cpus;
+ ),
+
+ TP_printk("force_all_cpus=%s",
+ __entry->force_all_cpus ? "true" : "false")
+);
+
#endif /* _TRACE_PAGEMAP_H */
/* This part must be outside protection */
diff --git a/mm/swap.c b/mm/swap.c
index 588f50d8f1a8..e14b7612f896 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -694,9 +694,12 @@ void lru_add_drain_cpu(int cpu)
{
struct cpu_fbatches *fbatches = &per_cpu(cpu_fbatches, cpu);
struct folio_batch *fbatch = &fbatches->lru_add;
+ unsigned int nr_folios_add = folio_batch_count(fbatch);
- if (folio_batch_count(fbatch))
+ if (nr_folios_add) {
folio_batch_move_lru(fbatch, lru_add);
+ trace_mm_lru_add_drain(cpu, nr_folios_add);
+ }
fbatch = &fbatches->lru_move_tail;
/* Disabling interrupts below acts as a compiler barrier. */
@@ -869,6 +872,8 @@ static inline void __lru_add_drain_all(bool force_all_cpus)
if (WARN_ON(!mm_percpu_wq))
return;
+ trace_mm_lru_add_drain_all(force_all_cpus);
+
/*
* Guarantee folio_batch counter stores visible by this CPU
* are visible to other CPUs before loading the current drain
--
2.54.0
^ permalink raw reply related
* Re: [PATCH] tracing: fprobe: Remove __packed from generic __fprobe_header
From: Steven Rostedt @ 2026-06-10 19:51 UTC (permalink / raw)
To: David Laight
Cc: Masami Hiramatsu (Google),
Markus Schneider-Pargmann (The Capable Hub), Mathieu Desnoyers,
Heiko Carstens, linux-kernel, linux-trace-kernel
In-Reply-To: <20260610120659.7c61cfa6@pumpkin>
On Wed, 10 Jun 2026 12:06:59 +0100
David Laight <david.laight.linux@gmail.com> wrote:
> So you only want __packed on structures that might be misaligned and those
> that contain misaligned members.
>
> If the structure is only guaranteed to be 32bit aligned then use __packed
> __aligned(4) so that two 32bit accesses get used instead of 8 8bit ones.
>
> -- David
>
> >
> > Thank you,
> >
> > > Signed-off-by: Markus Schneider-Pargmann (The Capable Hub) <msp@baylibre.com>
> > > ---
> > > kernel/trace/fprobe.c | 2 +-
> > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/kernel/trace/fprobe.c b/kernel/trace/fprobe.c
> > > index cc49ebd2a773..21751dcdb7b9 100644
> > > --- a/kernel/trace/fprobe.c
> > > +++ b/kernel/trace/fprobe.c
> > > @@ -181,7 +181,7 @@ static inline void read_fprobe_header(unsigned long *stack,
> > > struct __fprobe_header {
> > > struct fprobe *fp;
> > > unsigned long size_words;
> > > -} __packed;
> > > +};
> > >
Does "__packed" really do anything between a pointer and a long?
-- Steve
^ permalink raw reply
* Re: [RFC PATCH 1/2] tracing/osnoise: Sample IPI counts
From: Crystal Wood @ 2026-06-10 19:51 UTC (permalink / raw)
To: Valentin Schneider, linux-kernel, linux-trace-kernel
Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Tomas Glozar,
Costa Shulyupin, Ivan Pravdin
In-Reply-To: <20260610130457.1304245-2-vschneid@redhat.com>
On Wed, 2026-06-10 at 15:04 +0200, Valentin Schneider wrote:
> Osnoise already implictly accounts IPIs via its IRQ tracking,
Does it? It seems that IPIs bypass the kernel/irq subsystem on some
arches (including x86, but not ARM).
It would be nice to solve this properly by adding generic ipi
entry/exit tracing (similar to what ARM already has).
> however it
> can be interesting to distiguish between the two: undesired IPIs usually
> imply a software configuration issue (e.g. wrong/incomplete CPU isolation)
> whereas undesired (non-IPI) IRQs usually imply a hardware configuration
> issue.
>
> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> ---
> Note that this is modifying the osnoise:osnoise_entry Ftrace entry; I know
> trace events are sort of supposed to be stable, but I'm not sure about
> ftrace entries.
I think old rtla will be OK with this since it looks up fields by name
rather than assuming a fixed layout.
> Alternatively I can have this be purely supported in userspace osnoise by
> hooking into the IPI events and counting IPIs separately from the osnoise
> events.
One benefit I could see of doing this in kernel osnoise would be if you
could atomically correlate the count with the particular noise
interval, but this patch doesn't do that.
> +static void ipi_emission(struct osnoise_variables *osn_var, unsigned int dst_cpu)
> +{
> + if (!osn_var->sampling)
> + return;
> +
> + osn_var->ipi.count++;
> +}
> +
> +static void trace_ipi_send_cpu_callback(void *data, unsigned int cpu,
> + unsigned long callsite, void *callback)
> +{
> + struct osnoise_variables *osn_var;
> +
> + osn_var = per_cpu_ptr(&per_cpu_osnoise_var, cpu);
> + ipi_emission(osn_var, cpu);
> +}
> +
> +static void trace_ipi_send_cpumask_callback(void *data, const struct cpumask *cpumask,
> + unsigned long callsite, void *callback)
> +{
> + struct osnoise_variables *osn_var;
> + int cpu;
> +
> + for_each_cpu_and(cpu, cpumask, &osnoise_cpumask) {
> + osn_var = per_cpu_ptr(&per_cpu_osnoise_var, cpu);
> + ipi_emission(osn_var, cpu);
> + }
> +}
Isn't this racy to do from a different CPU? Both in terms of the
counter, and the timing of the increment relative to when the IPI is
actually received. Not necessarily a huge deal if you only care about
zero versus bignum, but still. At least worth a comment, if we go with
this approach.
-Crystal
^ permalink raw reply
* Re: [PATCH] mm/lruvec: trace LRU add drains and drain-all queuing
From: Shakeel Butt @ 2026-06-10 19:38 UTC (permalink / raw)
To: JP Kobryn
Cc: Barry Song, linux-mm, willy, usama.arif, akpm, vbabka, mhocko,
rostedt, mhiramat, mathieu.desnoyers, kasong, qi.zheng,
axelrasmussen, yuanchu, weixugc, chrisl, shikemeng, nphamcs,
baoquan.he, youngjun.park, linux-kernel, linux-trace-kernel
In-Reply-To: <ffdf9b07-487b-4668-a91b-27b0ab29c70d@linux.dev>
On Wed, Jun 10, 2026 at 12:20:19PM -0700, JP Kobryn wrote:
> On 6/10/26 11:54 AM, JP Kobryn wrote:
> > On 6/9/26 6:21 PM, Shakeel Butt wrote:
> >> On Tue, Jun 09, 2026 at 05:16:15PM -0700, JP Kobryn wrote:
> >>> On 6/9/26 5:07 PM, JP Kobryn wrote:
> >>>> On 6/9/26 12:44 AM, Barry Song wrote:
> >>>>> On Tue, Jun 9, 2026 at 12:12 PM JP Kobryn <jp.kobryn@linux.dev> wrote:
> >>>>>>
[...]
> >>>>> Do you need tracing on each CPU individually, or is tracing the
> >>>>> entire __lru_add_drain_all() invocation sufficient?
> >>>>
> >>>> I think the latter would be fine. The remote work will invoke the
> >>>> mm_lru_add_drain tracepoint, which will show up as kworker stacks. Since
> >>>> the event already has the CPU, we could see where queued drains actually
> >>>> ran.
> >>>
> >>> Actually if it's just a single invocation and the only event data is the
> >>> force flag, a tracepoint may not even be needed. Other probes can be
> >>> installed on function invocation and read the single argument. I can
> >>> drop this from v2 and keep the single mm_lru_add_drain tracepoint.
> >>
> >> No we do want to trace the callers requesting to drain from all the CPUs. If you
> >> trace just lru_add_drain_cpu() then you will only see that the drain is
> >> requested for a given CPU but no information on the requester.
> >>
> >> Also as Barry said, I think single trace for whole __lru_add_drain_all() is good
> >> enough.
> >
> > Right, but couldn't that already be done with fentry or kprobe? If we
> > only need the calling stack and the argument value of force_all_cpus I
> > don't see a strong need for a dedicated tracepoint.
>
> Nevermind that. I see it's declared inline so I'll add a tracepoint and
> send v3.
Thanks. BTW even without inline keyword, compiler can still decide to inline
a function, so kprobe/fentry are not always reliable.
>
^ permalink raw reply
* Re: [PATCH] mm/lruvec: trace LRU add drains and drain-all queuing
From: JP Kobryn @ 2026-06-10 19:20 UTC (permalink / raw)
To: Shakeel Butt
Cc: Barry Song, linux-mm, willy, usama.arif, akpm, vbabka, mhocko,
rostedt, mhiramat, mathieu.desnoyers, kasong, qi.zheng,
axelrasmussen, yuanchu, weixugc, chrisl, shikemeng, nphamcs,
baoquan.he, youngjun.park, linux-kernel, linux-trace-kernel
In-Reply-To: <6e7739cc-327e-4ca9-86f4-17729b624632@linux.dev>
On 6/10/26 11:54 AM, JP Kobryn wrote:
> On 6/9/26 6:21 PM, Shakeel Butt wrote:
>> On Tue, Jun 09, 2026 at 05:16:15PM -0700, JP Kobryn wrote:
>>> On 6/9/26 5:07 PM, JP Kobryn wrote:
>>>> On 6/9/26 12:44 AM, Barry Song wrote:
>>>>> On Tue, Jun 9, 2026 at 12:12 PM JP Kobryn <jp.kobryn@linux.dev> wrote:
>>>>>>
>>>>>> LRU add batches can be drained before they reach capacity. This can be a
>>>>>> source of LRU lock contention, but it is not currently possible to
>>>>>> attribute these drains to callers with existing tracepoints.
>>>>>>
>>>>>> Add mm_lru_add_drain to report the CPU and lru_add batch count when an
>>>>>> lru_add batch is drained. This allows tracing to distinguish full drains
>>>>>> from partial drains and attribute them to the calling stack.
>>>>>>
>>>>>> Add mm_lru_drain_all_queue to report when lru_add_drain_all() queues
>>>>>> per-CPU drain work. This captures the requester stack and target CPU for
>>>>>> remote drain work. The event is named as a drain-all queue event because
>>>>>> the queued work can be needed for batches other than lru_add.
>>>>>>
>>>>>> Signed-off-by: JP Kobryn <jp.kobryn@linux.dev>
>>>>>> ---
>>>>>> include/trace/events/pagemap.h | 40 ++++++++++++++++++++++++++++++++++
>>>>>> mm/swap.c | 6 ++++-
>>>>>> 2 files changed, 45 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/include/trace/events/pagemap.h b/include/trace/events/pagemap.h
>>>>>> index 171524d3526d..ea8fc46bedb0 100644
>>>>>> --- a/include/trace/events/pagemap.h
>>>>>> +++ b/include/trace/events/pagemap.h
>>>>>> @@ -77,6 +77,46 @@ TRACE_EVENT(mm_lru_activate,
>>>>>> TP_printk("folio=%p pfn=0x%lx", __entry->folio, __entry->pfn)
>>>>>> );
>>>>>>
>>>>>> +TRACE_EVENT(mm_lru_add_drain,
>>>>>> +
>>>>>> + TP_PROTO(int cpu, unsigned int nr),
>>>>>> +
>>>>>> + TP_ARGS(cpu, nr),
>>>>>> +
>>>>>> + TP_STRUCT__entry(
>>>>>> + __field(int, cpu )
>>>>>> + __field(unsigned int, nr )
>>>>>> + ),
>>>>>> +
>>>>>> + TP_fast_assign(
>>>>>> + __entry->cpu = cpu;
>>>>>> + __entry->nr = nr;
>>>>>> + ),
>>>>>> +
>>>>>> + TP_printk("cpu=%d nr=%u", __entry->cpu, __entry->nr)
>>>>>> +);
>>>>>> +
>>>>>> +TRACE_EVENT(mm_lru_drain_all_queue,
>>>>>> +
>>>>>> + TP_PROTO(int target_cpu, bool force_all_cpus),
>>>>>> +
>>>>>> + TP_ARGS(target_cpu, force_all_cpus),
>>>>>> +
>>>>>> + TP_STRUCT__entry(
>>>>>> + __field(int, target_cpu )
>>>>>> + __field(bool, force_all_cpus )
>>>>>> + ),
>>>>>> +
>>>>>> + TP_fast_assign(
>>>>>> + __entry->target_cpu = target_cpu;
>>>>>> + __entry->force_all_cpus = force_all_cpus;
>>>>>> + ),
>>>>>> +
>>>>>> + TP_printk("target_cpu=%d force_all_cpus=%s",
>>>>>> + __entry->target_cpu,
>>>>>> + __entry->force_all_cpus ? "true" : "false")
>>>>>> +);
>>>>>> +
>>>>>> #endif /* _TRACE_PAGEMAP_H */
>>>>>>
>>>>>> /* This part must be outside protection */
>>>>>> diff --git a/mm/swap.c b/mm/swap.c
>>>>>> index 588f50d8f1a8..c385b93582eb 100644
>>>>>> --- a/mm/swap.c
>>>>>> +++ b/mm/swap.c
>>>>>> @@ -694,9 +694,12 @@ void lru_add_drain_cpu(int cpu)
>>>>>> {
>>>>>> struct cpu_fbatches *fbatches = &per_cpu(cpu_fbatches, cpu);
>>>>>> struct folio_batch *fbatch = &fbatches->lru_add;
>>>>>> + unsigned int nr_folios_add = folio_batch_count(fbatch);
>>>>>>
>>>>>> - if (folio_batch_count(fbatch))
>>>>>> + if (nr_folios_add) {
>>>>>> folio_batch_move_lru(fbatch, lru_add);
>>>>>> + trace_mm_lru_add_drain(cpu, nr_folios_add);
>>>>>> + }
>>>>>>
>>>>>> fbatch = &fbatches->lru_move_tail;
>>>>>> /* Disabling interrupts below acts as a compiler barrier. */
>>>>>> @@ -928,6 +931,7 @@ static inline void __lru_add_drain_all(bool force_all_cpus)
>>>>>> if (cpu_needs_drain(cpu)) {
>>>>>> INIT_WORK(work, lru_add_drain_per_cpu);
>>>>>> queue_work_on(cpu, mm_percpu_wq, work);
>>>>>> + trace_mm_lru_drain_all_queue(cpu, force_all_cpus);
>>>>>
>>>>> Do you need tracing on each CPU individually, or is tracing the
>>>>> entire __lru_add_drain_all() invocation sufficient?
>>>>
>>>> I think the latter would be fine. The remote work will invoke the
>>>> mm_lru_add_drain tracepoint, which will show up as kworker stacks. Since
>>>> the event already has the CPU, we could see where queued drains actually
>>>> ran.
>>>
>>> Actually if it's just a single invocation and the only event data is the
>>> force flag, a tracepoint may not even be needed. Other probes can be
>>> installed on function invocation and read the single argument. I can
>>> drop this from v2 and keep the single mm_lru_add_drain tracepoint.
>>
>> No we do want to trace the callers requesting to drain from all the CPUs. If you
>> trace just lru_add_drain_cpu() then you will only see that the drain is
>> requested for a given CPU but no information on the requester.
>>
>> Also as Barry said, I think single trace for whole __lru_add_drain_all() is good
>> enough.
>
> Right, but couldn't that already be done with fentry or kprobe? If we
> only need the calling stack and the argument value of force_all_cpus I
> don't see a strong need for a dedicated tracepoint.
Nevermind that. I see it's declared inline so I'll add a tracepoint and
send v3.
^ permalink raw reply
* Re: [PATCH] tracing: reject invalid preemptirq_delay_test CPU affinity
From: Samuel Moelius @ 2026-06-10 19:00 UTC (permalink / raw)
To: Steven Rostedt
Cc: Masami Hiramatsu, Mathieu Desnoyers, open list:TRACING,
open list:TRACING
In-Reply-To: <20260609181617.185f1e02@fedora>
On Tue, Jun 9, 2026 at 6:16 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Fri, 5 Jun 2026 00:40:06 +0000
> Samuel Moelius <sam.moelius@trailofbits.com> wrote:
>
> > preemptirq_delay_test accepts cpu_affinity as a module parameter and,
> > when it is non-negative, writes that CPU directly into a temporary
> > cpumask from the worker thread. Values outside nr_cpu_ids can set a
> > bit outside the allocated cpumask before the test reports a normal
> > affinity error.
> >
> > Validate the requested CPU before starting the worker thread, and
> > return -EINVAL for invalid affinity requests.
> >
> > Assisted-by: Codex:gpt-5.5-cyber-preview
> > Signed-off-by: Samuel Moelius <sam.moelius@trailofbits.com>
> > ---
> > kernel/trace/preemptirq_delay_test.c | 10 ++++++++++
> > 1 file changed, 10 insertions(+)
> >
> > diff --git a/kernel/trace/preemptirq_delay_test.c b/kernel/trace/preemptirq_delay_test.c
> > index acb0c971a408..0f017799754a 100644
> > --- a/kernel/trace/preemptirq_delay_test.c
> > +++ b/kernel/trace/preemptirq_delay_test.c
> > @@ -14,6 +14,7 @@
> > #include <linux/kthread.h>
> > #include <linux/module.h>
> > #include <linux/printk.h>
> > +#include <linux/cpumask.h>
> > #include <linux/string.h>
> > #include <linux/sysfs.h>
> > #include <linux/completion.h>
> > @@ -152,6 +153,15 @@ static int preemptirq_run_test(void)
> > struct task_struct *task;
> > char task_name[50];
> >
> > + if (cpu_affinity > -1) {
> > + unsigned int cpu = cpu_affinity;
> > +
> > + if (cpu >= nr_cpu_ids || !cpu_possible(cpu)) {
> > + pr_err("cpu_affinity:%d, invalid CPU\n", cpu_affinity);
> > + return -EINVAL;
>
> Just add the check to the preemptirq_delay_run() function where it
> tests affinity. Who cares if it created the thread or not. It's just a
> test.
I am getting ready to travel and I will address this when I return in
about two weeks. Thank you for understanding.
^ permalink raw reply
* Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: David Hildenbrand (Arm) @ 2026-06-10 18:59 UTC (permalink / raw)
To: Gregory Price
Cc: Balbir Singh, lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
dave, jonathan.cameron, dave.jiang, alison.schofield,
vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm,
lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
terry.bowman
In-Reply-To: <aimSzvoJDrpeQsmM@gourry-fedora-PF4VCD3F>
On 6/10/26 18:37, Gregory Price wrote:
> On Wed, Jun 10, 2026 at 05:00:33PM +0200, David Hildenbrand (Arm) wrote:
>> On 6/10/26 12:41, Gregory Price wrote:
>>>
>>> Notably: slub.c injects __GFP_THISNODE internally on behalf of kmalloc,
>>> which causes spillage into private nodes because slub allows private
>>> nodes in its mask. I think this is fixable.
>>>
>>> I have to inspect some other __GFP_THISNODE users (hugetlb, some arch
>>> code, etc), but it seems like fully dropping the FALLBACK entries and
>>> requiring __GFP_THISNODE might be sufficient.
>>
>> Sorry, I haven't been able to follow up so far, and not sure if that's what you
>> are discussing here ...
>>
>> After the LSF/MM session, I was wondering, whether if we focus on allowing only
>> folios allocations to end up on private memory nodes for now: could the
>> __GFP_THISNODE approach work there?
>>
>> Essentially, disallow any allocations on non-folio paths, and allow folio
>> allocation only with __GFP_THISNODE set.
>>
>> I have to find time to read the other mails in this thread, on my todo list.
>>
>> So sorry if that is precisely what is being discussed here.
>>
>
> So, I remember this being asked, and I didn't fully grok the request.
>
> I'm still not sure I fully understand the question, so apologies if I'm
> answer the wrong things here.
>
> I understand this question in two ways:
>
> 1) Can we disallow PAGE allocation and limit this to FOLIO allocation
Yes. Can we only allow folios to be allocated from private memory nodes. So let
me reply to that one below.
> 2) Can we disallow [Feature] (i.e. slab) allocation targeting the node.
>
>
> 1) Can we disallow page allocation and limit this to folios?
>
> No, I don't think so.
>
> Folio allocations are written in terms of page allocations, we would
> have to rewrite folio allocation interfaces and introduce a bunch of
> boilerplate for the sake of this.
>
> struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order,
> int preferred_nid, nodemask_t *nodemask)
> {
> struct page *page;
>
> page = __alloc_frozen_pages_noprof(gfp, order, preferred_nid, nodemask);
> if (page)
> set_page_refcounted(page);
> return page;
> }
>
> struct folio *__folio_alloc_noprof(gfp_t gfp, unsigned int order, int preferred_nid,
> nodemask_t *nodemask)
> {
> struct page *page = __alloc_pages_noprof(gfp | __GFP_COMP, order,
> preferred_nid, nodemask);
> return page_rmappable_folio(page);
> }
At LSF/MM we talked about how GFP flags are bad and how deriving stuff from the
context might be better. I think there was also talk about how the memalloc_*
interface might be a better way forward. Maybe we would start giving the
allocator more context ("we are allocating a folio").
The following is incomplete (esp. hugetlb stuff I assume), just as some idea:
From 64aaff5f40497201ecc089c3339df6576184c433 Mon Sep 17 00:00:00 2001
From: "David Hildenbrand (Arm)" <david@kernel.org>
Date: Wed, 10 Jun 2026 20:55:49 +0200
Subject: [PATCH] tmp
Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
---
include/linux/sched.h | 2 +-
include/linux/sched/mm.h | 11 +++++++++++
mm/mempolicy.c | 14 ++++++++++++--
mm/page_alloc.c | 7 ++++++-
4 files changed, 30 insertions(+), 4 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ee06cba5c6f5..9c850b7be6bf 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1778,7 +1778,7 @@ extern struct pid *cad_pid;
* I am cleaning dirty pages from some other bdi. */
#define PF_KTHREAD 0x00200000 /* I am a kernel thread */
#define PF_RANDOMIZE 0x00400000 /* Randomize virtual address space */
-#define PF__HOLE__00800000 0x00800000
+#define PF__MEMALLOC_FOLIO 0x00800000 /* Allocating a folio that can end up on
private memory nodes */
#define PF__HOLE__01000000 0x01000000
#define PF__HOLE__02000000 0x02000000
#define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to meddle with
cpus_mask */
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 95d0040df584..2101a447c084 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -471,6 +471,17 @@ static inline void memalloc_pin_restore(unsigned int flags)
memalloc_flags_restore(flags);
}
+static inline unsigned int memalloc_folio_save(void)
+{
+ return memalloc_flags_save(PF_MEMALLOC_FOLIO);
+}
+
+static inline void memalloc_folio_restore(unsigned int flags)
+{
+ memalloc_flags_restore(flags);
+}
+
+
#ifdef CONFIG_MEMCG
DECLARE_PER_CPU(struct mem_cgroup *, int_active_memcg);
/**
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 36699fabd3c2..a78b0e5a1fce 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2506,8 +2506,13 @@ static struct page *alloc_pages_mpol(gfp_t gfp, unsigned
int order,
struct folio *folio_alloc_mpol_noprof(gfp_t gfp, unsigned int order,
struct mempolicy *pol, pgoff_t ilx, int nid)
{
- struct page *page = alloc_pages_mpol(gfp | __GFP_COMP, order, pol,
+ struct page *page;
+ int flags;
+
+ flags = memalloc_folio_save();
+ page = alloc_pages_mpol(gfp | __GFP_COMP, order, pol,
ilx, nid);
+ memalloc_folio_restore(flags);
if (!page)
return NULL;
@@ -2588,7 +2593,12 @@ EXPORT_SYMBOL(alloc_pages_noprof);
struct folio *folio_alloc_noprof(gfp_t gfp, unsigned int order)
{
- return page_rmappable_folio(alloc_pages_noprof(gfp | __GFP_COMP, order));
+ struct folio *folio;
+ int flags;
+
+ flags = memalloc_folio_save();
+ folio = page_rmappable_folio(alloc_pages_noprof(gfp | __GFP_COMP, order));
+ memalloc_folio_restore(flags);
+ return folio;
}
EXPORT_SYMBOL(folio_alloc_noprof);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ee902a468c2f..37434b37f7af 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5345,8 +5345,13 @@ EXPORT_SYMBOL(__alloc_pages_noprof);
struct folio *__folio_alloc_noprof(gfp_t gfp, unsigned int order, int
preferred_nid,
nodemask_t *nodemask)
{
- struct page *page = __alloc_pages_noprof(gfp | __GFP_COMP, order,
+ struct page *page;
+ int flags;
+
+ flags = memalloc_folio_save();
+ page = __alloc_pages_noprof(gfp | __GFP_COMP, order,
preferred_nid, nodemask);
+ memalloc_folio_restore(flags);
return page_rmappable_folio(page);
}
EXPORT_SYMBOL(__folio_alloc_noprof);
--
2.43.0
--
Cheers,
David
^ permalink raw reply related
* Re: [PATCH] mm/lruvec: trace LRU add drains and drain-all queuing
From: JP Kobryn @ 2026-06-10 18:54 UTC (permalink / raw)
To: Shakeel Butt
Cc: Barry Song, linux-mm, willy, usama.arif, akpm, vbabka, mhocko,
rostedt, mhiramat, mathieu.desnoyers, kasong, qi.zheng,
axelrasmussen, yuanchu, weixugc, chrisl, shikemeng, nphamcs,
baoquan.he, youngjun.park, linux-kernel, linux-trace-kernel
In-Reply-To: <aii7WWFmAyoXn9rk@linux.dev>
On 6/9/26 6:21 PM, Shakeel Butt wrote:
> On Tue, Jun 09, 2026 at 05:16:15PM -0700, JP Kobryn wrote:
>> On 6/9/26 5:07 PM, JP Kobryn wrote:
>>> On 6/9/26 12:44 AM, Barry Song wrote:
>>>> On Tue, Jun 9, 2026 at 12:12 PM JP Kobryn <jp.kobryn@linux.dev> wrote:
>>>>>
>>>>> LRU add batches can be drained before they reach capacity. This can be a
>>>>> source of LRU lock contention, but it is not currently possible to
>>>>> attribute these drains to callers with existing tracepoints.
>>>>>
>>>>> Add mm_lru_add_drain to report the CPU and lru_add batch count when an
>>>>> lru_add batch is drained. This allows tracing to distinguish full drains
>>>>> from partial drains and attribute them to the calling stack.
>>>>>
>>>>> Add mm_lru_drain_all_queue to report when lru_add_drain_all() queues
>>>>> per-CPU drain work. This captures the requester stack and target CPU for
>>>>> remote drain work. The event is named as a drain-all queue event because
>>>>> the queued work can be needed for batches other than lru_add.
>>>>>
>>>>> Signed-off-by: JP Kobryn <jp.kobryn@linux.dev>
>>>>> ---
>>>>> include/trace/events/pagemap.h | 40 ++++++++++++++++++++++++++++++++++
>>>>> mm/swap.c | 6 ++++-
>>>>> 2 files changed, 45 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/include/trace/events/pagemap.h b/include/trace/events/pagemap.h
>>>>> index 171524d3526d..ea8fc46bedb0 100644
>>>>> --- a/include/trace/events/pagemap.h
>>>>> +++ b/include/trace/events/pagemap.h
>>>>> @@ -77,6 +77,46 @@ TRACE_EVENT(mm_lru_activate,
>>>>> TP_printk("folio=%p pfn=0x%lx", __entry->folio, __entry->pfn)
>>>>> );
>>>>>
>>>>> +TRACE_EVENT(mm_lru_add_drain,
>>>>> +
>>>>> + TP_PROTO(int cpu, unsigned int nr),
>>>>> +
>>>>> + TP_ARGS(cpu, nr),
>>>>> +
>>>>> + TP_STRUCT__entry(
>>>>> + __field(int, cpu )
>>>>> + __field(unsigned int, nr )
>>>>> + ),
>>>>> +
>>>>> + TP_fast_assign(
>>>>> + __entry->cpu = cpu;
>>>>> + __entry->nr = nr;
>>>>> + ),
>>>>> +
>>>>> + TP_printk("cpu=%d nr=%u", __entry->cpu, __entry->nr)
>>>>> +);
>>>>> +
>>>>> +TRACE_EVENT(mm_lru_drain_all_queue,
>>>>> +
>>>>> + TP_PROTO(int target_cpu, bool force_all_cpus),
>>>>> +
>>>>> + TP_ARGS(target_cpu, force_all_cpus),
>>>>> +
>>>>> + TP_STRUCT__entry(
>>>>> + __field(int, target_cpu )
>>>>> + __field(bool, force_all_cpus )
>>>>> + ),
>>>>> +
>>>>> + TP_fast_assign(
>>>>> + __entry->target_cpu = target_cpu;
>>>>> + __entry->force_all_cpus = force_all_cpus;
>>>>> + ),
>>>>> +
>>>>> + TP_printk("target_cpu=%d force_all_cpus=%s",
>>>>> + __entry->target_cpu,
>>>>> + __entry->force_all_cpus ? "true" : "false")
>>>>> +);
>>>>> +
>>>>> #endif /* _TRACE_PAGEMAP_H */
>>>>>
>>>>> /* This part must be outside protection */
>>>>> diff --git a/mm/swap.c b/mm/swap.c
>>>>> index 588f50d8f1a8..c385b93582eb 100644
>>>>> --- a/mm/swap.c
>>>>> +++ b/mm/swap.c
>>>>> @@ -694,9 +694,12 @@ void lru_add_drain_cpu(int cpu)
>>>>> {
>>>>> struct cpu_fbatches *fbatches = &per_cpu(cpu_fbatches, cpu);
>>>>> struct folio_batch *fbatch = &fbatches->lru_add;
>>>>> + unsigned int nr_folios_add = folio_batch_count(fbatch);
>>>>>
>>>>> - if (folio_batch_count(fbatch))
>>>>> + if (nr_folios_add) {
>>>>> folio_batch_move_lru(fbatch, lru_add);
>>>>> + trace_mm_lru_add_drain(cpu, nr_folios_add);
>>>>> + }
>>>>>
>>>>> fbatch = &fbatches->lru_move_tail;
>>>>> /* Disabling interrupts below acts as a compiler barrier. */
>>>>> @@ -928,6 +931,7 @@ static inline void __lru_add_drain_all(bool force_all_cpus)
>>>>> if (cpu_needs_drain(cpu)) {
>>>>> INIT_WORK(work, lru_add_drain_per_cpu);
>>>>> queue_work_on(cpu, mm_percpu_wq, work);
>>>>> + trace_mm_lru_drain_all_queue(cpu, force_all_cpus);
>>>>
>>>> Do you need tracing on each CPU individually, or is tracing the
>>>> entire __lru_add_drain_all() invocation sufficient?
>>>
>>> I think the latter would be fine. The remote work will invoke the
>>> mm_lru_add_drain tracepoint, which will show up as kworker stacks. Since
>>> the event already has the CPU, we could see where queued drains actually
>>> ran.
>>
>> Actually if it's just a single invocation and the only event data is the
>> force flag, a tracepoint may not even be needed. Other probes can be
>> installed on function invocation and read the single argument. I can
>> drop this from v2 and keep the single mm_lru_add_drain tracepoint.
>
> No we do want to trace the callers requesting to drain from all the CPUs. If you
> trace just lru_add_drain_cpu() then you will only see that the drain is
> requested for a given CPU but no information on the requester.
>
> Also as Barry said, I think single trace for whole __lru_add_drain_all() is good
> enough.
Right, but couldn't that already be done with fentry or kprobe? If we
only need the calling stack and the argument value of force_all_cpus I
don't see a strong need for a dedicated tracepoint.
^ permalink raw reply
* Re: [PATCHv4 05/13] uprobes/x86: Move optimized uprobe from nop5 to nop10
From: Andrii Nakryiko @ 2026-06-10 18:02 UTC (permalink / raw)
To: Jiri Olsa
Cc: Oleg Nesterov, Peter Zijlstra, Ingo Molnar, Masami Hiramatsu,
Andrii Nakryiko, bpf, linux-trace-kernel
In-Reply-To: <aikd6-5HYhCnc0Ze@krava>
On Wed, Jun 10, 2026 at 1:18 AM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Tue, Jun 09, 2026 at 09:43:15AM -0700, Andrii Nakryiko wrote:
> > On Tue, Jun 9, 2026 at 4:44 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> > >
> > > On Mon, Jun 08, 2026 at 01:46:39PM -0700, Andrii Nakryiko wrote:
> > > > On Tue, May 26, 2026 at 1:59 PM Jiri Olsa <jolsa@kernel.org> wrote:
> > > > >
> > > > > Andrii reported an issue with optimized uprobes [1] that can clobber
> > > > > redzone area with call instruction storing return address on stack
> > > > > where user code may keep temporary data without adjusting rsp.
> > > > >
> > > > > Fixing this by moving the optimized uprobes on top of 10-bytes nop
> > > > > instruction, so we can squeeze another instruction to escape the
> > > > > redzone area before doing the call, like:
> > > > >
> > > > > lea -0x80(%rsp), %rsp
> > > > > call tramp
> > > > >
> > > > > Note the lea instruction is used to adjust the rsp register without
> > > > > changing the flags.
> > > > >
> > > > > We use nop10 and following transformation to optimized instructions
> > > > > above and back as suggested by Peterz [2].
> > > > >
> > > > > Optimize path (int3_update_optimize):
> > > > >
> > > > > 1) Initial state after set_swbp() installed the uprobe:
> > > > > cc 2e 0f 1f 84 00 00 00 00 00
> > > > >
> > > > > From offset 0 this is INT3 followed by the tail of the original
> > > > > 10-byte NOP.
> > > > >
> > > > > After a previous unoptimization bytes 5..9 may still contain the
> > > > > old call instruction, which remains valid for threads already there.
> > > > >
> > > > > 2) Rewrite the LEA tail and call displacement:
> > > > > cc [8d 64 24 80 e8 d0 d1 d2 d3]
> > > > >
> > > > > From offset 0 this traps on the uprobe INT3. Bytes 1..9 are not
> > > > > executable entry points while byte 0 is trapped.
> > > > >
> > > > > 3) Publish the first LEA byte:
> > > > > [48] 8d 64 24 80 e8 d0 d1 d2 d3
> > > > >
> > > > > From offset 0 this is:
> > > > > lea -0x80(%rsp), %rsp
> > > > > call <uprobe-trampoline>
> > > > >
> > > > > Unoptimize path (int3_update_unoptimize):
> > > > >
> > > > > 1) Initial optimized state:
> > > > > 48 8d 64 24 80 e8 d0 d1 d2 d3
> > > > > Same as 3) above.
> > > > >
> > > > > 2) Trap new entries before restoring the NOP bytes:
> > > > > [cc] 8d 64 24 80 e8 d0 d1 d2 d3
> > > > >
> > > > > From offset 0 this traps. A thread that had already executed the
> > > > > LEA can still reach the intact CALL at offset 5.
> > > > >
> > > > > 3) Restore bytes 1..4 of the original NOP while keeping byte 0 trapped
> > > > > and byte 5 as CALL.
> > > > > cc [2e 0f 1f 84] e8 d0 d1 d2 d3
> > > > >
> > > > > From offset 0 this still traps. Offset 5 is still the CALL for any
> > > > > thread that was already past the first LEA byte.
> > > > >
> > > > > 4) Publish the first byte of the original NOP:
> > > > > [66] 2e 0f 1f 84 e8 d0 d1 d2 d3
> > > > >
> > > > > From offset 0 this is the restored 10-byte NOP; the CALL opcode and
> > > > > displacement are now only NOP operands. Offset 5 still decodes as
> > > > > CALL for a thread that was already there.
> > > > >
> > > > > Tthere is only a single target uprobe-trampoline for the given nop10
> > > > > instruction address, so the CALL instruction will not be changed across
> > > > > unoptimization/optimization cycles.
> > > > > Therefore, any task that is preempted at the CALL instruction is guaranteed
> > > > > to observe that CALL and not anything else.
> > > > >
> > > > > Note as explained in [2] we need to use following nop10:
> > > > > PF1 PF2 ESC NOPL MOD SIB DISP32
> > > > > NOP10: 0x66, 0x2e, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00 -- cs nopw 0x00000000(%rax,%rax,1)
> > > > >
> > > > > which means we need to allow 0x2e prefix which maps to INAT_PFX_CS
> > > > > attribute in is_prefix_bad function.
> > > > >
> > > > > Also changing the uprobe syscall error when called out of uprobe
> > > > > trampoline to -EPROTO, so we are able to detect the fixed kernel.
> > > > >
> > > > > The optimized uprobe performance stays the same:
> > > > >
> > > > > uprobe-nop : 3.129 ± 0.013M/s
> > > > > uprobe-push : 3.045 ± 0.006M/s
> > > > > uprobe-ret : 1.095 ± 0.004M/s
> > > > > --> uprobe-nop10 : 7.170 ± 0.020M/s
> > > > > uretprobe-nop : 2.143 ± 0.021M/s
> > > > > uretprobe-push : 2.090 ± 0.000M/s
> > > > > uretprobe-ret : 0.942 ± 0.000M/s
> > > > > --> uretprobe-nop10: 3.381 ± 0.003M/s
> > > > > usdt-nop : 3.245 ± 0.004M/s
> > > > > --> usdt-nop10 : 7.256 ± 0.023M/s
> > > > >
> > > > > [1] https://lore.kernel.org/bpf/20260509003146.976844-1-andrii@kernel.org/
> > > > > [2] https://lore.kernel.org/bpf/20260518104306.GU3102624@noisy.programming.kicks-ass.net/#t
> > > > > Reported-by: Andrii Nakryiko <andrii@kernel.org>
> > > > > Closes: https://lore.kernel.org/bpf/20260509003146.976844-1-andrii@kernel.org/
> > > > > Fixes: ba2bfc97b462 ("uprobes/x86: Add support to optimize uprobes")
> > > > > Assisted-by: Codex:GPT-5.5
> > > > > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> > > > > ---
> > > > > arch/x86/kernel/uprobes.c | 255 ++++++++++++++++++++++++++++----------
> > > > > 1 file changed, 190 insertions(+), 65 deletions(-)
> > > > >
> > > >
> > > > [...]
> > > >
> > > > > @@ -943,13 +1026,31 @@ static int int3_update(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
> > > > > smp_text_poke_sync_each_cpu();
> > > > >
> > > > > /*
> > > > > - * Write first byte.
> > > > > + * 3) Restore bytes 1..4 of the original NOP while keeping byte 0 trapped
> > > > > + * and byte 5 as CALL:
> > > > > + * cc [2e 0f 1f 84] e8 d0 d1 d2 d3
> > > > > + */
> > > > > + ctx.expect = EXPECT_SWBP_OPTIMIZED;
> > > > > + err = uprobe_write(auprobe, vma, vaddr + 1, insn + 1,
> > > > > + LEA_INSN_SIZE - 1, verify_insn,
> > > > > + true /* is_register */, false /* do_update_ref_ctr */,
> > > >
> > > > tbh, it's quite subtle and non-obvious why is_register should be set
> > > > to true first two times (and especially that is_register and
> > > > do_update_ref_ctr are implicitly connected), not sure how to make it
> > > > cleaner, but maybe leave a short comment explaining this twice
> > > > register, once unregister sequence?
> > >
> > > ok, I came up with comment below
> > >
> > > thanks,
> > > jirka
> > >
> > >
> > > ---
> > > diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> > > index de544516ea70..92449f34c005 100644
> > > --- a/arch/x86/kernel/uprobes.c
> > > +++ b/arch/x86/kernel/uprobes.c
> > > @@ -1011,6 +1011,12 @@ static int int3_update_unoptimize(struct arch_uprobe *auprobe, struct vm_area_st
> > > int err;
> > >
> > > /*
> > > + * Note the first two uprobe_write calls use is_register=true, because they
> > > + * are intermediate patching states while the probe is still active.
> >
> > this doesn't really explain why is_register=true is the right one. It
> > actually doesn't matter as long as do_update_ref_ctr=true, isn't that
> > right? So maybe just to avoid a bit of confusion let's pass
> > is_register=false and do_update_ref_ctr=false, and in the comment
> > explain as you said that it's intermediate update and we don't want to
> > update refctr just yet until the very last step?
>
> apart from refctr update there's also different way the concerned
> page is managed, IIUC:
>
> with is_register=true we force to get exclusive anonymous page for
> the update (or pin the existing one)
>
> with is_register=false we try to zap the private anonymous page and
> return the mapping to the original page
>
> there are several comments on this in uprobe_write/__uprobe_write
>
> how about the update below
>
> jirka
>
>
> ---
> diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> index de544516ea70..09f5ff71227c 100644
> --- a/arch/x86/kernel/uprobes.c
> +++ b/arch/x86/kernel/uprobes.c
> @@ -1011,6 +1011,16 @@ static int int3_update_unoptimize(struct arch_uprobe *auprobe, struct vm_area_st
> int err;
>
> /*
> + * Note the first two uprobe_write calls use is_register=true, because they
> + * are intermediate patching states while the probe is still active, so
> + * we force the exclusive anonymous page for the update.
> + * Also we use do_update_ref_ctr=false because refctr was already updated by
> + * the initial int3 install.
> + *
> + * The last uprobe_write to nop10 instruction is called with is_register=false
> + * and do_update_ref_ctr=true to trigger the refctr update and to instruct
> + * uprobe_write to zap the anonymous page if it now matches the file page.
> + *
lgtm!
> * 1) Initial optimized state:
> * 48 8d 64 24 80 e8 d0 d1 d2 d3
> *
^ permalink raw reply
* Re: [PATCH v7 00/42] guest_memfd: In-place conversion support
From: Ackerley Tng @ 2026-06-10 17:49 UTC (permalink / raw)
To: Sean Christopherson
Cc: Ackerley Tng via B4 Relay, aik, andrew.jones, binbin.wu, brauner,
chao.p.peng, david, ira.weiny, jmattson, jthoughton, michael.roth,
oupton, pankaj.gupta, qperret, rick.p.edgecombe, rientjes,
shivankg, steven.price, tabba, willy, wyihan, yan.y.zhao,
forkloop, pratyush, suzuki.poulose, aneesh.kumar, liam,
Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <aiMVLtblIKu1DQWJ@google.com>
Sean Christopherson <seanjc@google.com> writes:
> On Thu, Jun 04, 2026, Ackerley Tng wrote:
>> Sean Christopherson <seanjc@google.com> writes:
>> >> + KVM: selftests: Test conversion with elevated page refcount
>> >> + Askar pointed out that soon vmsplice may not pin pages. Should I
>> >> pin pages through CONFIG_GUP_TEST like in [2]? I prefer not to
>> >> take a dependency on CONFIG_GUP_TEST.
>> >
>> > I'm not exactly excited about taking a dependency on CONFIG_GUP_TEST either, but
>> > it probably is the least awful choice. E.g. KVM also pins pages is certain flows,
>> > but we're _also_ actively working to remove the need to pin.
>> >
>> > Hmm, maybe IORING_REGISTER_PBUF_RING? AFAICT, it's almost literally a "pin user
>> > memory" syscall.
>> >
>>
>> Hmm that takes a dependency on io_uring, which isn't always compiled
>> in. Between CONFIG_IO_URING and CONFIG_GUP_TEST, I'd rather
>> CONFIG_GUP_TEST.
>
> Or try both? If it's not a ridiculous amount of work.
CONFIG_GUP_TEST was tried in [1]
[1] https://lore.kernel.org/all/baa8838f623102931e755cf34c86314b305af49c.1747264138.git.ackerleytng@google.com/
It looks like this
static void pin_pages(void *vaddr, uint64_t size)
{
const struct pin_longterm_test args = {
.addr = (uint64_t)vaddr,
.size = size,
.flags = PIN_LONGTERM_TEST_FLAG_USE_WRITE,
};
gup_test_fd = open("/sys/kernel/debug/gup_test", O_RDWR);
TEST_REQUIRE(gup_test_fd > 0);
TEST_ASSERT_EQ(ioctl(gup_test_fd, PIN_LONGTERM_TEST_START, &args), 0);
}
static void unpin_pages(void)
{
TEST_ASSERT_EQ(ioctl(gup_test_fd, PIN_LONGTERM_TEST_STOP), 0);
}
So in the test I'll call pin_pages(), then try to convert, see that it
fails with EAGAIN and reports the expected error_offset, then I call
unpin_pages(), then I convert again and expect success.
Are you uncomfortable with the CONFIG_GUP_TEST interface? What would you
like me to try with CONFIG_IO_URING? I'm thinking that the main
difference between the two is just down to which non-default CONFIG
option we want to take for guest_memfd tests.
^ permalink raw reply
* Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Gregory Price @ 2026-06-10 16:37 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: Balbir Singh, lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
dave, jonathan.cameron, dave.jiang, alison.schofield,
vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm,
lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
terry.bowman
In-Reply-To: <c1b66e7a-bb95-4295-8193-55ceadaaa578@kernel.org>
On Wed, Jun 10, 2026 at 05:00:33PM +0200, David Hildenbrand (Arm) wrote:
> On 6/10/26 12:41, Gregory Price wrote:
> > On Wed, Jun 03, 2026 at 03:00:01PM +1000, Balbir Singh wrote:
> >
> > Notably: slub.c injects __GFP_THISNODE internally on behalf of kmalloc,
> > which causes spillage into private nodes because slub allows private
> > nodes in its mask. I think this is fixable.
> >
> > I have to inspect some other __GFP_THISNODE users (hugetlb, some arch
> > code, etc), but it seems like fully dropping the FALLBACK entries and
> > requiring __GFP_THISNODE might be sufficient.
>
> Sorry, I haven't been able to follow up so far, and not sure if that's what you
> are discussing here ...
>
> After the LSF/MM session, I was wondering, whether if we focus on allowing only
> folios allocations to end up on private memory nodes for now: could the
> __GFP_THISNODE approach work there?
>
> Essentially, disallow any allocations on non-folio paths, and allow folio
> allocation only with __GFP_THISNODE set.
>
> I have to find time to read the other mails in this thread, on my todo list.
>
> So sorry if that is precisely what is being discussed here.
>
So, I remember this being asked, and I didn't fully grok the request.
I'm still not sure I fully understand the question, so apologies if I'm
answer the wrong things here.
I understand this question in two ways:
1) Can we disallow PAGE allocation and limit this to FOLIO allocation
2) Can we disallow [Feature] (i.e. slab) allocation targeting the node.
1) Can we disallow page allocation and limit this to folios?
No, I don't think so.
Folio allocations are written in terms of page allocations, we would
have to rewrite folio allocation interfaces and introduce a bunch of
boilerplate for the sake of this.
struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order,
int preferred_nid, nodemask_t *nodemask)
{
struct page *page;
page = __alloc_frozen_pages_noprof(gfp, order, preferred_nid, nodemask);
if (page)
set_page_refcounted(page);
return page;
}
struct folio *__folio_alloc_noprof(gfp_t gfp, unsigned int order, int preferred_nid,
nodemask_t *nodemask)
{
struct page *page = __alloc_pages_noprof(gfp | __GFP_COMP, order,
preferred_nid, nodemask);
return page_rmappable_folio(page);
}
At the end of the day, this all reduces to `get_pages_from_freelist`,
and at that level we don't really care about folio vs page.
__GFP_COMP is insufficient to differentiate between a non-folio compound
page and a folio, and __GFP_COMP is passed into __alloc_pages_*
interfaces all over the kernel.
Trying to detach these paths things seems like a horrible rats nest /
not feasible / will create a lot of boilerplate for little value.
(I did not fully understand this request when it was asked, I do
not fully understand this request not, please let me know if I
have misunderstood what you were asking).
2) Can we disallow SLAB allocation.
Yeah, but I think a better question is whether there's a difference
between alloc_pages_node() and kmalloc_node() when it all just sinks
to the same fundamental code in mm/page_alloc.c
Maybe there's an argument for something like NP_OPT_KMALLOC (allow slab
allocations on the private node w/ __GFP_THISNODE)
On my current set, I don't implement any explicit filtering at all in
mm/page_alloc.c - the filtering is a function of the nodes not being
present in the FALLBACK list and only having a NOFALLBACK list.
What __GFP_THISNODE actually does under the hood is just switch
which zone list (FALLBACK vs NOFALLBACK) is used for the target node.
For isolation w/o __GFP_PRIVATE, we're removing N_MEMORY_PRIVATE nodes
from *their own FALLBACK* list and only adding them to their NOFALLBACK
list. That means to reach a private node you MUST use __GFP_THISNODE.
I realize this is confusing, but essentially we don't have to modify
mm/page_alloc.c to get the __GFP_THISNODE filtering, we get this from
the fallback/nofallback list construction.
Ok, so how does this flush out in practice - and why do I call this
filtering mechanism fragile?
consider kmalloc_node() and __slab_alloc():
kmalloc_node(...)
└─ ___slab_alloc() mm/slub.c:4406 pc.flags |= __GFP_THISNODE
└─ new_slab(s, pc.flags, node)
└─ allocate_slab(s, flags, node)
└─ alloc_slab_page(flags, node, oo, …)
└─ __alloc_frozen_pages(flags, order, node, NULL);
Slab silently upgrades the page allocator flags here to include
__GFP_THISNODE - even if the user didn't request that behavior.
This is exactly the kind of "spillage" I said was hard to police at LSF.
Without __GFP_PRIVATE, we have to keep an eye on what around the kernel
is using __GFP_THISNODE and how.
For mm/slub.c we can choose to do one of thwo things
1) 100% refuse slab allocations on private nodes, i.e.:
kmalloc_node(..., private_nid, __GFP_THISNODE)
And will fail (return NULL).
or
2) Do not upgrade private-node slab requests w/ __GFP_THISNODE
This allows kmalloc_node() to work the same as folio_alloc()
or alloc_pages() interfaces (__GFP_THISNODE is the key), with
the understanding that any __GFP_THISNODE user
We can opt these nodes into slab/kmalloc with a NP_OPT_SLAB
if the owner wants kmalloc_node(), with the understanding that any
caller using __GFP_THISNODE may get access.
That's the kind of fragility I was trying to avoid.
That said, in practice, I have found that basic kernel operations don't
generally target use kmalloc_node() w/ __GFP_THISNODE - there's just
nothing to prevent anyone from doing so.
So this seems promising...
And then theres arch/powerpc/platforms/powernv/memtrace.c
static u64 memtrace_alloc_node(u32 nid, u64 size)
{
... snip ...
page = alloc_contig_pages(nr_pages, GFP_KERNEL | __GFP_THISNODE |
__GFP_NOWARN | __GFP_ZERO, nid, NULL);
... snip ...
}
static int memtrace_init_regions_runtime(u64 size)
{
... snip ...
for_each_online_node(nid) {
m = memtrace_alloc_node(nid, size);
... snip ...
}
static int memtrace_enable_set(void *data, u64 val)
{
... snip ...
if (memtrace_init_regions_runtime(val))
goto out_unlock;
... snip ...
}
This is the *exact* pattern I said would be hard to police - and it
doesn't look like a bug, just not informed that private nodes exist.
This is why I'm concerned with trying to depend on __GFP_THISNODE as the
filtering function.
That said, the number of __GFP_THISNODE users is very limited
kernel-wide, so maybe that's an acceptable maintenance burden?
~Gregory
^ permalink raw reply
* Re: [PATCH 1/2] ring-buffer: Fix event length with forced 8-byte alignment
From: Steven Rostedt @ 2026-06-10 16:17 UTC (permalink / raw)
To: Hui Wang
Cc: Masami Hiramatsu (Google), mathieu.desnoyers, pjw,
linux-trace-kernel, shuah, wangfushuai, linux-kselftest
In-Reply-To: <ea9d00cb-54c6-4635-aa13-e5a688375132@canonical.com>
On Tue, 9 Jun 2026 12:22:47 +0800
Hui Wang <hui.wang@canonical.com> wrote:
> Thanks for the pointer. I reverted my two patches and applied the patch
> you referenced, but unfortunately it doesn't resolve the problem — the
> testcase still fails in my environment (riscv64 kernel with
> CONFIG_HAVE_64BIT_ALIGNED_ACCESS enabled).
>
> From what I can tell, that fix addresses a different problem than the
> one I'm hitting: it targets a 64K page-size issue, whereas my failure is
> caused by the 64-bit alignment requirement
> (CONFIG_HAVE_64BIT_ALIGNED_ACCESS). So I don't think they're the same
> root cause.
>
> So can you please take a look at them again.
OK, taking a deeper look at it, and yes, your are correct. Sorry for
jumping to the conclusion with thinking this was the same issue as what
was brought up before.
I'll take these.
Thanks,
-- Steve
^ permalink raw reply
* Re: [PATCHv7 bpf-next 03/29] ftrace: Add add_ftrace_hash_entry function
From: Alexei Starovoitov @ 2026-06-10 15:42 UTC (permalink / raw)
To: Steven Rostedt
Cc: Kumar Kartikeya Dwivedi, Jiri Olsa, Alexei Starovoitov,
Daniel Borkmann, Andrii Nakryiko, bpf, linux-trace-kernel,
Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
Menglong Dong
In-Reply-To: <20260610113536.77172ad1@robin>
On Wed, Jun 10, 2026 at 8:35 AM Steven Rostedt <rostedt@kernel.org> wrote:
>
> On Tue, 09 Jun 2026 16:43:19 +0200
> "Kumar Kartikeya Dwivedi" <memxor@gmail.com> wrote:
>
> > Hi Steven,
> > Version 8 of this set was already applied to bpf-next.
> >
> > https://lore.kernel.org/bpf/178085644764.273544.8250000589480262551.git-patchwork-notify@kernel.org
>
> It should have waited for my review of the first three patches though.
> I like to run them through my tests before giving the OK. As they are
> generic changes to my code.
>
> They are trivial changes, but regardless, someone should have asked.
If my memory doesn't fail me you said it's fine during v1,v2 iterations.
The last v3 - v8 you were silent, so we assumed you're still fine.
While at it, please review Mykyta's set:
https://patchwork.kernel.org/user/todo/netdevbpf/?series=1096695
It's also been pending for almost a month now.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox