Linux Trace Kernel
 help / color / mirror / Atom feed
* Re: [PATCH 04/15] net: Use trace_invoke_##name() at guarded tracepoint call sites
From: Vineeth Remanan Pillai @ 2026-03-18 14:13 UTC (permalink / raw)
  To: Aaron Conole
  Cc: Steven Rostedt, Peter Zijlstra, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Stanislav Fomichev, Eelco Chaudron, Ilya Maximets,
	Marcelo Ricardo Leitner, Xin Long, Jon Maloy, Kuniyuki Iwashima,
	Samiullah Khawaja, Hangbin Liu, netdev, linux-kernel, bpf, dev,
	linux-sctp, tipc-discussion, linux-trace-kernel
In-Reply-To: <CAO7JXPhfpUb1VM_=mwSUqHPQrLvBW=wurz_apWQkMXssPAQPJA@mail.gmail.com>

On Wed, Mar 18, 2026 at 9:40 AM Vineeth Remanan Pillai
<vineeth@bitbyteword.org> wrote:
>
> On Thu, Mar 12, 2026 at 11:31 AM Aaron Conole <aconole@redhat.com> wrote:
> >
> > "Vineeth Pillai (Google)" <vineeth@bitbyteword.org> writes:
> >
> > > Replace trace_foo() with the new trace_invoke_foo() at sites already
> > > guarded by trace_foo_enabled(), avoiding a redundant
> > > static_branch_unlikely() re-evaluation inside the tracepoint.
> > > trace_invoke_foo() calls the tracepoint callbacks directly without
> > > utilizing the static branch again.
> > >
> > > Suggested-by: Steven Rostedt <rostedt@goodmis.org>
> > > Suggested-by: Peter Zijlstra <peterz@infradead.org>
> > > Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org>
> > > Assisted-by: Claude:claude-sonnet-4-6
> > > ---
> > >  net/core/dev.c             | 2 +-
> > >  net/core/xdp.c             | 2 +-
> > >  net/openvswitch/actions.c  | 2 +-
> > >  net/openvswitch/datapath.c | 2 +-
> > >  net/sctp/outqueue.c        | 2 +-
> > >  net/tipc/node.c            | 2 +-
> > >  6 files changed, 6 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/net/core/dev.c b/net/core/dev.c
> > > index 14a83f2035b93..a48fae2bbf57e 100644
> > > --- a/net/core/dev.c
> > > +++ b/net/core/dev.c
> > > @@ -6444,7 +6444,7 @@ void netif_receive_skb_list(struct list_head *head)
> > >               return;
> > >       if (trace_netif_receive_skb_list_entry_enabled()) {
> > >               list_for_each_entry(skb, head, list)
> > > -                     trace_netif_receive_skb_list_entry(skb);
> > > +                     trace_invoke_netif_receive_skb_list_entry(skb);
> > >       }
> > >       netif_receive_skb_list_internal(head);
> > >       trace_netif_receive_skb_list_exit(0);
> > > diff --git a/net/core/xdp.c b/net/core/xdp.c
> > > index 9890a30584ba7..53acc887c3434 100644
> > > --- a/net/core/xdp.c
> > > +++ b/net/core/xdp.c
> > > @@ -362,7 +362,7 @@ int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq,
> > >               xsk_pool_set_rxq_info(allocator, xdp_rxq);
> > >
> > >       if (trace_mem_connect_enabled() && xdp_alloc)
> > > -             trace_mem_connect(xdp_alloc, xdp_rxq);
> > > +             trace_invoke_mem_connect(xdp_alloc, xdp_rxq);
> > >       return 0;
> > >  }
> > >
> > > diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
> > > index 792ca44a461da..420eb19322e85 100644
> > > --- a/net/openvswitch/actions.c
> > > +++ b/net/openvswitch/actions.c
> > > @@ -1259,7 +1259,7 @@ static int do_execute_actions(struct datapath *dp, struct sk_buff *skb,
> > >               int err = 0;
> > >
> > >               if (trace_ovs_do_execute_action_enabled())
> > > -                     trace_ovs_do_execute_action(dp, skb, key, a, rem);
> > > +                     trace_invoke_ovs_do_execute_action(dp, skb, key, a, rem);
> >
> > Maybe we should just remove the guard here instead of calling the
> > invoke.  That seems better to me.  It wouldn't need to belong to this
> > series.
> >
> > >               /* Actions that rightfully have to consume the skb should do it
> > >                * and return directly.
> > > diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
> > > index e209099218b41..02451629e888e 100644
> > > --- a/net/openvswitch/datapath.c
> > > +++ b/net/openvswitch/datapath.c
> > > @@ -335,7 +335,7 @@ int ovs_dp_upcall(struct datapath *dp, struct sk_buff *skb,
> > >       int err;
> > >
> > >       if (trace_ovs_dp_upcall_enabled())
> > > -             trace_ovs_dp_upcall(dp, skb, key, upcall_info);
> > > +             trace_invoke_ovs_dp_upcall(dp, skb, key, upcall_info);
> >
> > Same as above.  Seems OVS tracepoints are the only ones that include
> > the guard without any real reason.
> >
>
> Makes sense. Its simple enough that I think I will include it as a
> separate patch in v2 and remove these changes from this patch. Thanks
> for pointing it out.
>
On a second look, I'm not sure if this was for performance reasons.
The discussion in the io_uring patch in this series points out that
the check made there was deliberate and for performance reasons to
avoid 6 mov instruction in the hot path. Just wanted to double check
if that was the case here, before I remove the check?

Thanks,
Vineeth

^ permalink raw reply

* Re: [PATCH v4] lib/bootconfig: guard xbc_node_compose_key_after() buffer size
From: Steven Rostedt @ 2026-03-18 13:45 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: Josh Law, Andrew Morton, linux-kernel, linux-trace-kernel
In-Reply-To: <20260318090243.7c437f2c5e07a1ce00375102@kernel.org>

On Wed, 18 Mar 2026 09:02:43 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:

> > This was somewhat my idea. Why do you think it's over engineering?
> > 
> > This is your code, so you have final say. I'm not going to push it. I'm
> > just curious to your thoughts.  
> 
> I sent a mail why I thought this is over engineering. I think this
> comes from vsnprintf() interface design. If all user of that needs
> to do this, that is not fair. It should be checked in vsnprintf()
> and caller should just check the returned error.

I wouldn't call this over-engineering. The reason you gave is more about
the checks being simply in the inappropriate location.

Over-engineering is if the patch had created 5 different macros to see if
the value passed to snprintf() was size_t and could be greater than MAX_INT,
and it used the trick of TRACE_EVENT() to create the code to do those
checks. Now THAT would be over-engineering! ;-)

-- Steve

^ permalink raw reply

* Re: [PATCH 04/15] net: Use trace_invoke_##name() at guarded tracepoint call sites
From: Vineeth Remanan Pillai @ 2026-03-18 13:40 UTC (permalink / raw)
  To: Aaron Conole
  Cc: Steven Rostedt, Peter Zijlstra, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Stanislav Fomichev, Eelco Chaudron, Ilya Maximets,
	Marcelo Ricardo Leitner, Xin Long, Jon Maloy, Kuniyuki Iwashima,
	Samiullah Khawaja, Hangbin Liu, netdev, linux-kernel, bpf, dev,
	linux-sctp, tipc-discussion, linux-trace-kernel
In-Reply-To: <f7to6ktnjxi.fsf@redhat.com>

On Thu, Mar 12, 2026 at 11:31 AM Aaron Conole <aconole@redhat.com> wrote:
>
> "Vineeth Pillai (Google)" <vineeth@bitbyteword.org> writes:
>
> > Replace trace_foo() with the new trace_invoke_foo() at sites already
> > guarded by trace_foo_enabled(), avoiding a redundant
> > static_branch_unlikely() re-evaluation inside the tracepoint.
> > trace_invoke_foo() calls the tracepoint callbacks directly without
> > utilizing the static branch again.
> >
> > Suggested-by: Steven Rostedt <rostedt@goodmis.org>
> > Suggested-by: Peter Zijlstra <peterz@infradead.org>
> > Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org>
> > Assisted-by: Claude:claude-sonnet-4-6
> > ---
> >  net/core/dev.c             | 2 +-
> >  net/core/xdp.c             | 2 +-
> >  net/openvswitch/actions.c  | 2 +-
> >  net/openvswitch/datapath.c | 2 +-
> >  net/sctp/outqueue.c        | 2 +-
> >  net/tipc/node.c            | 2 +-
> >  6 files changed, 6 insertions(+), 6 deletions(-)
> >
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index 14a83f2035b93..a48fae2bbf57e 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -6444,7 +6444,7 @@ void netif_receive_skb_list(struct list_head *head)
> >               return;
> >       if (trace_netif_receive_skb_list_entry_enabled()) {
> >               list_for_each_entry(skb, head, list)
> > -                     trace_netif_receive_skb_list_entry(skb);
> > +                     trace_invoke_netif_receive_skb_list_entry(skb);
> >       }
> >       netif_receive_skb_list_internal(head);
> >       trace_netif_receive_skb_list_exit(0);
> > diff --git a/net/core/xdp.c b/net/core/xdp.c
> > index 9890a30584ba7..53acc887c3434 100644
> > --- a/net/core/xdp.c
> > +++ b/net/core/xdp.c
> > @@ -362,7 +362,7 @@ int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq,
> >               xsk_pool_set_rxq_info(allocator, xdp_rxq);
> >
> >       if (trace_mem_connect_enabled() && xdp_alloc)
> > -             trace_mem_connect(xdp_alloc, xdp_rxq);
> > +             trace_invoke_mem_connect(xdp_alloc, xdp_rxq);
> >       return 0;
> >  }
> >
> > diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
> > index 792ca44a461da..420eb19322e85 100644
> > --- a/net/openvswitch/actions.c
> > +++ b/net/openvswitch/actions.c
> > @@ -1259,7 +1259,7 @@ static int do_execute_actions(struct datapath *dp, struct sk_buff *skb,
> >               int err = 0;
> >
> >               if (trace_ovs_do_execute_action_enabled())
> > -                     trace_ovs_do_execute_action(dp, skb, key, a, rem);
> > +                     trace_invoke_ovs_do_execute_action(dp, skb, key, a, rem);
>
> Maybe we should just remove the guard here instead of calling the
> invoke.  That seems better to me.  It wouldn't need to belong to this
> series.
>
> >               /* Actions that rightfully have to consume the skb should do it
> >                * and return directly.
> > diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
> > index e209099218b41..02451629e888e 100644
> > --- a/net/openvswitch/datapath.c
> > +++ b/net/openvswitch/datapath.c
> > @@ -335,7 +335,7 @@ int ovs_dp_upcall(struct datapath *dp, struct sk_buff *skb,
> >       int err;
> >
> >       if (trace_ovs_dp_upcall_enabled())
> > -             trace_ovs_dp_upcall(dp, skb, key, upcall_info);
> > +             trace_invoke_ovs_dp_upcall(dp, skb, key, upcall_info);
>
> Same as above.  Seems OVS tracepoints are the only ones that include
> the guard without any real reason.
>

Makes sense. Its simple enough that I think I will include it as a
separate patch in v2 and remove these changes from this patch. Thanks
for pointing it out.

Thanks,
Vineeth

^ permalink raw reply

* Re: [PATCH] blk-mq: add tracepoint block_rq_tag_wait
From: Aaron Tomlin @ 2026-03-18 13:21 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: axboe, rostedt, mhiramat, mathieu.desnoyers, johannes.thumshirn,
	kch, bvanassche, ritesh.list, neelx, sean, mproche, linux-block,
	linux-kernel, linux-trace-kernel
In-Reply-To: <e7db667e-2662-4ad9-89cd-309c457ce17e@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 2442 bytes --]

On Wed, Mar 18, 2026 at 08:38:20AM +0900, Damien Le Moal wrote:
> Looks OK to me, but I have some suggestions below.

Hi Damien,

Thank you for your feedback.

> > +/**
> > + * block_rq_tag_wait - triggered when an I/O request is starved of a
> > tag
> 
> when an I/O request -> when a request

Acknowledged.

> 
> > + * @q: queue containing the request
> 
> request queue of the target device
> 
> ("containing" is odd here)

Acknowledged.

> > + * @hctx: hardware context (queue) experiencing starvation
> 
> hardware context of the request

Acknowledged.

> > + *
> > + * Called immediately before the submitting thread is forced to block due
> 
> the submitting thread -> the submitting context

Acknowledged.

> 
> > + * to the exhaustion of available hardware tags. This tracepoint indicates
> 
> s/tracepoint/trace point

Acknowledged.

> 
> > + * that the thread will be placed into an uninterruptible state via
> 
> s/thread/context

Acknowledged.

> 
> > + * io_schedule() until an active block I/O operation completes and
> > + * relinquishes its assigned tag.
> 
> until an active request completes
> 

Acknowledged.

> > + */
> > +TRACE_EVENT(block_rq_tag_wait,
> > +
> > +	TP_PROTO(struct request_queue *q, struct blk_mq_hw_ctx *hctx),
> > +
> > +	TP_ARGS(q, hctx),
> > +
> > +	TP_STRUCT__entry(
> > +		__field( dev_t,		dev			)
> > +		__field( u32,		hctx_id			)
> > +		__field( u32,		nr_tags			)
> > +		__field( u32,		active_requests		)
> > +	),
> > +
> > +	TP_fast_assign(
> > +		__entry->dev		  = q->disk ? disk_devt(q->disk) : 0;
> 
> I do not think that q->disk can ever be NULL when there is a request being
> submitted.

Yes, I agree. In theory, a race with disk_release() cannot occur since the
gendisk reference counter would still be elevated here.

> 
> > +		__entry->hctx_id	  = hctx ? hctx->queue_num : 0;
> > +		__entry->nr_tags	  = hctx && hctx->tags ? hctx->tags->nr_tags : 0;
> > +		__entry->active_requests  = hctx ? atomic_read(&hctx->nr_active) : 0;
> > +	),
> > +
> > +	TP_printk("%d,%d hctx=%u starved (active=%u/%u)",
> > +		  MAJOR(__entry->dev), MINOR(__entry->dev),
> > +		  __entry->hctx_id, __entry->active_requests, __entry->nr_tags)
> > +);
> > +
> >  /**
> >   * block_rq_insert - insert block operation request into queue
> >   * @rq: block IO operation request


Kind regards,
-- 
Aaron Tomlin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH] blk-mq: add tracepoint block_rq_tag_wait
From: Laurence Oberman @ 2026-03-18 13:10 UTC (permalink / raw)
  To: Damien Le Moal, Aaron Tomlin, axboe, rostedt, mhiramat,
	mathieu.desnoyers
  Cc: johannes.thumshirn, kch, bvanassche, ritesh.list, neelx, sean,
	mproche, linux-block, linux-kernel, linux-trace-kernel
In-Reply-To: <e7db667e-2662-4ad9-89cd-309c457ce17e@kernel.org>

On Wed, 2026-03-18 at 08:38 +0900, Damien Le Moal wrote:
> On 2026/03/18 3:28, Aaron Tomlin wrote:
> > In high-performance storage environments, particularly when
> > utilising
> > RAID controllers with shared tag sets (BLK_MQ_F_TAG_HCTX_SHARED),
> > severe
> > latency spikes can occur when fast devices (SSDs) are starved of
> > hardware
> > tags when sharing the same blk_mq_tag_set.
> > 
> > Currently, diagnosing this specific hardware queue contention is
> > difficult. When a CPU thread exhausts the tag pool,
> > blk_mq_get_tag()
> > forces the current thread to block uninterruptible via
> > io_schedule().
> > While this can be inferred via sched:sched_switch or dynamically
> > traced by attaching a kprobe to blk_mq_mark_tag_wait(), there is no
> > dedicated, out-of-the-box observability for this event.
> > 
> > This patch introduces the block_rq_tag_wait static tracepoint in
> > the tag allocation slow-path. It triggers immediately before the
> > thread yields the CPU, exposing the exact hardware context (hctx)
> > that is starved, the total pool size, and the current active
> > request
> > count.
> > 
> > This provides storage engineers and performance monitoring agents
> > with a zero-configuration, low-overhead mechanism to definitively
> > identify shared-tag bottlenecks and tune I/O schedulers or cgroup
> > throttling accordingly.
> > 
> > Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
> 
> Looks OK to me, but I have some suggestions below.
> 
> > ---
> >  block/blk-mq-tag.c           |  3 +++
> >  include/trace/events/block.h | 36
> > ++++++++++++++++++++++++++++++++++++
> >  2 files changed, 39 insertions(+)
> > 
> > diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> > index 33946cdb5716..f50993e86ca5 100644
> > --- a/block/blk-mq-tag.c
> > +++ b/block/blk-mq-tag.c
> > @@ -13,6 +13,7 @@
> >  #include <linux/kmemleak.h>
> >  
> >  #include <linux/delay.h>
> > +#include <trace/events/block.h>
> >  #include "blk.h"
> >  #include "blk-mq.h"
> >  #include "blk-mq-sched.h"
> > @@ -187,6 +188,8 @@ unsigned int blk_mq_get_tag(struct
> > blk_mq_alloc_data *data)
> >  		if (tag != BLK_MQ_NO_TAG)
> >  			break;
> >  
> > +		trace_block_rq_tag_wait(data->q, data->hctx);
> > +
> >  		bt_prev = bt;
> >  		io_schedule();
> >  
> > diff --git a/include/trace/events/block.h
> > b/include/trace/events/block.h
> > index 6aa79e2d799c..48e2ba433c87 100644
> > --- a/include/trace/events/block.h
> > +++ b/include/trace/events/block.h
> > @@ -226,6 +226,42 @@ DECLARE_EVENT_CLASS(block_rq,
> >  		  IOPRIO_PRIO_LEVEL(__entry->ioprio), __entry-
> > >comm)
> >  );
> >  
> > +/**
> > + * block_rq_tag_wait - triggered when an I/O request is starved of
> > a tag
> 
> when an I/O request -> when a request
> 
> > + * @q: queue containing the request
> 
> request queue of the target device
> 
> ("containing" is odd here)
> 
> > + * @hctx: hardware context (queue) experiencing starvation
> 
> hardware context of the request
> 
> > + *
> > + * Called immediately before the submitting thread is forced to
> > block due
> 
> the submitting thread -> the submitting context
> 
> > + * to the exhaustion of available hardware tags. This tracepoint
> > indicates
> 
> s/tracepoint/trace point
> 
> > + * that the thread will be placed into an uninterruptible state
> > via
> 
> s/thread/context
> 
> > + * io_schedule() until an active block I/O operation completes and
> > + * relinquishes its assigned tag.
> 
> until an active request completes
> 
> (BIOs do not have tags).
> 
> > + */
> > +TRACE_EVENT(block_rq_tag_wait,
> > +
> > +	TP_PROTO(struct request_queue *q, struct blk_mq_hw_ctx
> > *hctx),
> > +
> > +	TP_ARGS(q, hctx),
> > +
> > +	TP_STRUCT__entry(
> > +		__field(
> > dev_t,		dev			)
> > +		__field(
> > u32,		hctx_id			)
> > +		__field(
> > u32,		nr_tags			)
> > +		__field(
> > u32,		active_requests		)
> > +	),
> > +
> > +	TP_fast_assign(
> > +		__entry->dev		  = q->disk ? disk_devt(q-
> > >disk) : 0;
> 
> I do not think that q->disk can ever be NULL when there is a request
> being
> submitted.
> 
> > +		__entry->hctx_id	  = hctx ? hctx->queue_num
> > : 0;
> > +		__entry->nr_tags	  = hctx && hctx->tags ?
> > hctx->tags->nr_tags : 0;
> > +		__entry->active_requests  = hctx ?
> > atomic_read(&hctx->nr_active) : 0;
> > +	),
> > +
> > +	TP_printk("%d,%d hctx=%u starved (active=%u/%u)",
> > +		  MAJOR(__entry->dev), MINOR(__entry->dev),
> > +		  __entry->hctx_id, __entry->active_requests,
> > __entry->nr_tags)
> > +);
> > +
> >  /**
> >   * block_rq_insert - insert block operation request into queue
> >   * @rq: block IO operation request
> 

This visibility will be very useful. I plan to test it fully.
Updates to follow
Thanks
Laurence Oberman


^ permalink raw reply

* Re: [PATCH v7 14/15] rv: Add deadline monitors
From: gmonaco @ 2026-03-18 12:00 UTC (permalink / raw)
  To: Juri Lelli
  Cc: linux-kernel, Steven Rostedt, Nam Cao, Juri Lelli,
	Jonathan Corbet, Masami Hiramatsu, linux-trace-kernel, linux-doc,
	Peter Zijlstra, Tomas Glozar, Clark Williams, John Kacur
In-Reply-To: <abLBnlqIHVPEcisP@jlelli-thinkpadt14gen4.remote.csb>

Hello,

On Thu, 2026-03-12 at 14:37 +0100, Juri Lelli wrote:
> > +/* Used by other monitors */
> > +struct sched_class *rv_ext_sched_class;
> > +
> > +static int __init register_deadline(void)
> > +{
> > +	if (IS_ENABLED(CONFIG_SCHED_CLASS_EXT))
> > +		rv_ext_sched_class = (void
> > *)kallsyms_lookup_name("ext_sched_class");
> 
> Looks like the above look up can fail. I don't actually see how/why
> if would fail if things build correctly and EXT tasks are around.
> But, theoretically, we could end up with rv_ext_sched_class = NULL ?
> 
> > +static inline bool task_is_scx_enabled(struct task_struct *tsk)
> > +{
> > +	return IS_ENABLED(CONFIG_SCHED_CLASS_EXT) &&
> > +	       tsk->sched_class == rv_ext_sched_class;
> > +}
> > +
> > +/* Expand id and target as arguments for da functions */
> > +#define EXPAND_ID(dl_se, cpu, type) get_entity_id(dl_se, cpu,
> > type), dl_se
> > +#define EXPAND_ID_TASK(tsk) get_entity_id(&tsk->dl, task_cpu(tsk),
> > DL_TASK), &tsk->dl
> > +
> > +static inline uint8_t get_server_type(struct task_struct *tsk)
> > +{
> > +	if (tsk->policy == SCHED_NORMAL || tsk->policy ==
> > SCHED_EXT ||
> > +	    tsk->policy == SCHED_BATCH || tsk->policy ==
> > SCHED_IDLE)
> > +		return task_is_scx_enabled(tsk) ? DL_SERVER_EXT :
> > DL_SERVER_FAIR;
> > +	return DL_OTHER;
> > +}
> 
> Considering that, if that happens, get_server_type() will return
> DL_SERVER_FAIR for scx tasks as well (possibly confusing monitors?),
> shall we add a warn or something just in case. A 'no we don't need
> that
> because it can't happen' works for me, just thought I should still
> mention this. :)

I forgot answering..

Well, technically yes, this all can fail.
I figured a silent degradation in this remote case would be alright,
but probably just print it during initialisation wouldn't hurt.

We cannot do much if it really happened and, yes, monitors would likely
fail if both SCX and fair servers coexist.
I'd assume it /should/ never happen, but it costs nothing adding:

  pr_warn("Error detecting the ext class, monitors may report wrong
results.\n");

Thanks,
Gabriele


^ permalink raw reply

* [PATCH 8/8] memblock: warn when freeing reserved memory before memory map is initialized
From: Mike Rapoport @ 2026-03-18 10:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Potapenko, Alexander Viro, Andreas Larsson,
	Ard Biesheuvel, Borislav Petkov, Brendan Jackman,
	Christophe Leroy (CS GROUP), Catalin Marinas, Christian Brauner,
	David S. Miller, Dave Hansen, David Hildenbrand, Dmitry Vyukov,
	Ilias Apalodimas, Ingo Molnar, Jan Kara, Johannes Weiner,
	Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan,
	Marco Elver, Marek Szyprowski, Masami Hiramatsu, Michael Ellerman,
	Michal Hocko, Mike Rapoport, Nicholas Piggin, H. Peter Anvin,
	Rob Herring, Robin Murphy, Saravana Kannan, Suren Baghdasaryan,
	Thomas Gleixner, Vlastimil Babka, Will Deacon, Zi Yan, devicetree,
	iommu, kasan-dev, linux-arm-kernel, linux-efi, linux-fsdevel,
	linux-kernel, linux-mm, linux-trace-kernel, linuxppc-dev,
	sparclinux, x86
In-Reply-To: <20260318105827.1358927-1-rppt@kernel.org>

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

When CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, freeing of reserved
memory before the memory map is fully initialized in deferred_init_memmap()
would cause access to uninitialized struct pages and may crash when
accessing spurious list pointers, like was recently discovered during
discussion about memory leaks in x86 EFI code [1].

The trace below is from an attempt to call free_reserved_page() before
page_alloc_init_late():

[    0.076840] BUG: unable to handle page fault for address: ffffce1a005a0788
[    0.078226] #PF: supervisor read access in kernel mode
[    0.078226] #PF: error_code(0x0000) - not-present page
[    0.078226] PGD 0 P4D 0
[    0.078226] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
[    0.078226] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.12.68-92.123.amzn2023.x86_64 #1
[    0.078226] Hardware name: Amazon EC2 t3a.nano/, BIOS 1.0 10/16/2017
[    0.078226] RIP: 0010:__list_del_entry_valid_or_report+0x32/0xb0
...
[    0.078226]  __free_one_page+0x170/0x520
[    0.078226]  free_pcppages_bulk+0x151/0x1e0
[    0.078226]  free_unref_page_commit+0x263/0x320
[    0.078226]  free_unref_page+0x2c8/0x5b0
[    0.078226]  ? srso_return_thunk+0x5/0x5f
[    0.078226]  free_reserved_page+0x1c/0x30
[    0.078226]  memblock_free_late+0x6c/0xc0

Currently there are not many callers of free_reserved_area() and they all
appear to be at the right timings.

Still, in order to protect against problematic code moves or additions of
new callers add a warning that will inform that reserved pages cannot be
freed until the memory map is fully initialized.

[1] https://lore.kernel.org/all/e5d5a1105d90ee1e7fe7eafaed2ed03bbad0c46b.camel@kernel.crashing.org/

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
 mm/internal.h   | 10 ++++++++++
 mm/memblock.c   |  5 +++++
 mm/page_alloc.c | 10 ----------
 3 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index cb0af847d7d9..f60c1edb2e02 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1233,7 +1233,17 @@ static inline void vunmap_range_noflush(unsigned long start, unsigned long end)
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
 DECLARE_STATIC_KEY_TRUE(deferred_pages);
 
+static inline bool deferred_pages_enabled(void)
+{
+	return static_branch_unlikely(&deferred_pages);
+}
+
 bool __init deferred_grow_zone(struct zone *zone, unsigned int order);
+#else
+static inline bool deferred_pages_enabled(void)
+{
+	return false;
+}
 #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
 
 void init_deferred_page(unsigned long pfn, int nid);
diff --git a/mm/memblock.c b/mm/memblock.c
index bd5758ff07f2..780e70d4971a 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -896,6 +896,11 @@ static unsigned long __free_reserved_area(phys_addr_t start, phys_addr_t end,
 {
 	unsigned long pages = 0, pfn;
 
+	if (deferred_pages_enabled()) {
+		WARN(1, "Cannot free reserved memory because of deferred initialization of the memory map");
+		return 0;
+	}
+
 	for_each_valid_pfn(pfn, PFN_UP(start), PFN_DOWN(end)) {
 		struct page *page = pfn_to_page(pfn);
 		void *direct_map_addr;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index df3d61253001..9ac47bab2ea7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -331,11 +331,6 @@ int page_group_by_mobility_disabled __read_mostly;
  */
 DEFINE_STATIC_KEY_TRUE(deferred_pages);
 
-static inline bool deferred_pages_enabled(void)
-{
-	return static_branch_unlikely(&deferred_pages);
-}
-
 /*
  * deferred_grow_zone() is __init, but it is called from
  * get_page_from_freelist() during early boot until deferred_pages permanently
@@ -348,11 +343,6 @@ _deferred_grow_zone(struct zone *zone, unsigned int order)
 	return deferred_grow_zone(zone, order);
 }
 #else
-static inline bool deferred_pages_enabled(void)
-{
-	return false;
-}
-
 static inline bool _deferred_grow_zone(struct zone *zone, unsigned int order)
 {
 	return false;
-- 
2.51.0


^ permalink raw reply related

* [PATCH 7/8] memblock, treewide: make memblock_free() handle late freeing
From: Mike Rapoport @ 2026-03-18 10:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Potapenko, Alexander Viro, Andreas Larsson,
	Ard Biesheuvel, Borislav Petkov, Brendan Jackman,
	Christophe Leroy (CS GROUP), Catalin Marinas, Christian Brauner,
	David S. Miller, Dave Hansen, David Hildenbrand, Dmitry Vyukov,
	Ilias Apalodimas, Ingo Molnar, Jan Kara, Johannes Weiner,
	Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan,
	Marco Elver, Marek Szyprowski, Masami Hiramatsu, Michael Ellerman,
	Michal Hocko, Mike Rapoport, Nicholas Piggin, H. Peter Anvin,
	Rob Herring, Robin Murphy, Saravana Kannan, Suren Baghdasaryan,
	Thomas Gleixner, Vlastimil Babka, Will Deacon, Zi Yan, devicetree,
	iommu, kasan-dev, linux-arm-kernel, linux-efi, linux-fsdevel,
	linux-kernel, linux-mm, linux-trace-kernel, linuxppc-dev,
	sparclinux, x86
In-Reply-To: <20260318105827.1358927-1-rppt@kernel.org>

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

It shouldn't be responsibility of memblock users to detect if they free
memory allocated from memblock late and should use memblock_free_late().

Make memblock_free() and memblock_phys_free() take care of late memory
freeing and drop memblock_free_late().

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
 arch/sparc/kernel/mdesc.c               |  4 +--
 arch/x86/kernel/setup.c                 |  2 +-
 arch/x86/platform/efi/memmap.c          |  5 +---
 arch/x86/platform/efi/quirks.c          |  2 +-
 drivers/firmware/efi/apple-properties.c |  2 +-
 drivers/of/kexec.c                      |  2 +-
 include/linux/memblock.h                |  2 --
 kernel/dma/swiotlb.c                    |  6 ++--
 lib/bootconfig.c                        |  2 +-
 mm/kfence/core.c                        |  4 +--
 mm/memblock.c                           | 37 +++++++------------------
 11 files changed, 22 insertions(+), 46 deletions(-)

diff --git a/arch/sparc/kernel/mdesc.c b/arch/sparc/kernel/mdesc.c
index 30f171b7b00c..ecd6c8ae49c7 100644
--- a/arch/sparc/kernel/mdesc.c
+++ b/arch/sparc/kernel/mdesc.c
@@ -183,14 +183,12 @@ static struct mdesc_handle * __init mdesc_memblock_alloc(unsigned int mdesc_size
 static void __init mdesc_memblock_free(struct mdesc_handle *hp)
 {
 	unsigned int alloc_size;
-	unsigned long start;
 
 	BUG_ON(refcount_read(&hp->refcnt) != 0);
 	BUG_ON(!list_empty(&hp->list));
 
 	alloc_size = PAGE_ALIGN(hp->handle_size);
-	start = __pa(hp);
-	memblock_free_late(start, alloc_size);
+	memblock_free(hp, alloc_size);
 }
 
 static struct mdesc_mem_ops memblock_mdesc_ops = {
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index eebcc9db1a1b..46882ce79c3a 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -426,7 +426,7 @@ int __init ima_free_kexec_buffer(void)
 	if (!ima_kexec_buffer_size)
 		return -ENOENT;
 
-	memblock_free_late(ima_kexec_buffer_phys,
+	memblock_phys_free(ima_kexec_buffer_phys,
 			   ima_kexec_buffer_size);
 
 	ima_kexec_buffer_phys = 0;
diff --git a/arch/x86/platform/efi/memmap.c b/arch/x86/platform/efi/memmap.c
index 023697c88910..697a9a26a005 100644
--- a/arch/x86/platform/efi/memmap.c
+++ b/arch/x86/platform/efi/memmap.c
@@ -34,10 +34,7 @@ static
 void __init __efi_memmap_free(u64 phys, unsigned long size, unsigned long flags)
 {
 	if (flags & EFI_MEMMAP_MEMBLOCK) {
-		if (slab_is_available())
-			memblock_free_late(phys, size);
-		else
-			memblock_phys_free(phys, size);
+		memblock_phys_free(phys, size);
 	} else if (flags & EFI_MEMMAP_SLAB) {
 		struct page *p = pfn_to_page(PHYS_PFN(phys));
 		unsigned int order = get_order(size);
diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c
index 35caa5746115..a560bbcaa006 100644
--- a/arch/x86/platform/efi/quirks.c
+++ b/arch/x86/platform/efi/quirks.c
@@ -372,7 +372,7 @@ void __init efi_reserve_boot_services(void)
 		 * doesn't make sense as far as the firmware is
 		 * concerned, but it does provide us with a way to tag
 		 * those regions that must not be paired with
-		 * memblock_free_late().
+		 * memblock_phys_free().
 		 */
 		md->attribute |= EFI_MEMORY_RUNTIME;
 	}
diff --git a/drivers/firmware/efi/apple-properties.c b/drivers/firmware/efi/apple-properties.c
index 13ac28754c03..2e525e17fba7 100644
--- a/drivers/firmware/efi/apple-properties.c
+++ b/drivers/firmware/efi/apple-properties.c
@@ -226,7 +226,7 @@ static int __init map_properties(void)
 		 */
 		data->len = 0;
 		memunmap(data);
-		memblock_free_late(pa_data + sizeof(*data), data_len);
+		memblock_phys_free(pa_data + sizeof(*data), data_len);
 
 		return ret;
 	}
diff --git a/drivers/of/kexec.c b/drivers/of/kexec.c
index c4cf3552c018..512d9be9d513 100644
--- a/drivers/of/kexec.c
+++ b/drivers/of/kexec.c
@@ -175,7 +175,7 @@ int __init ima_free_kexec_buffer(void)
 	if (ret)
 		return ret;
 
-	memblock_free_late(addr, size);
+	memblock_phys_free(addr, size);
 	return 0;
 }
 #endif
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 6ec5e9ac0699..6f6c5b5c4a4b 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -172,8 +172,6 @@ void __next_mem_range_rev(u64 *idx, int nid, enum memblock_flags flags,
 			  struct memblock_type *type_b, phys_addr_t *out_start,
 			  phys_addr_t *out_end, int *out_nid);
 
-void memblock_free_late(phys_addr_t base, phys_addr_t size);
-
 #ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP
 static inline void __next_physmem_range(u64 *idx, struct memblock_type *type,
 					phys_addr_t *out_start,
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index d8e6f1d889d5..e44e039e00d3 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -546,10 +546,10 @@ void __init swiotlb_exit(void)
 		free_pages(tbl_vaddr, get_order(tbl_size));
 		free_pages((unsigned long)mem->slots, get_order(slots_size));
 	} else {
-		memblock_free_late(__pa(mem->areas),
+		memblock_free(mem->areas,
 			array_size(sizeof(*mem->areas), mem->nareas));
-		memblock_free_late(mem->start, tbl_size);
-		memblock_free_late(__pa(mem->slots), slots_size);
+		memblock_phys_free(mem->start, tbl_size);
+		memblock_free(mem->slots, slots_size);
 	}
 
 	memset(mem, 0, sizeof(*mem));
diff --git a/lib/bootconfig.c b/lib/bootconfig.c
index 449369a60846..86a75bf636bc 100644
--- a/lib/bootconfig.c
+++ b/lib/bootconfig.c
@@ -64,7 +64,7 @@ static inline void __init xbc_free_mem(void *addr, size_t size, bool early)
 	if (early)
 		memblock_free(addr, size);
 	else if (addr)
-		memblock_free_late(__pa(addr), size);
+		memblock_free(addr, size);
 }
 
 #else /* !__KERNEL__ */
diff --git a/mm/kfence/core.c b/mm/kfence/core.c
index 7393957f9a20..5c8268af533e 100644
--- a/mm/kfence/core.c
+++ b/mm/kfence/core.c
@@ -731,10 +731,10 @@ static bool __init kfence_init_pool_early(void)
 	 * fails for the first page, and therefore expect addr==__kfence_pool in
 	 * most failure cases.
 	 */
-	memblock_free_late(__pa(addr), KFENCE_POOL_SIZE - (addr - (unsigned long)__kfence_pool));
+	memblock_free((void *)addr, KFENCE_POOL_SIZE - (addr - (unsigned long)__kfence_pool));
 	__kfence_pool = NULL;
 
-	memblock_free_late(__pa(kfence_metadata_init), KFENCE_METADATA_SIZE);
+	memblock_free(kfence_metadata_init, KFENCE_METADATA_SIZE);
 	kfence_metadata_init = NULL;
 
 	return false;
diff --git a/mm/memblock.c b/mm/memblock.c
index 9f372a8e82f7..bd5758ff07f2 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -384,26 +384,24 @@ static void __init_memblock memblock_remove_region(struct memblock_type *type, u
  */
 void __init memblock_discard(void)
 {
-	phys_addr_t addr, size;
+	phys_addr_t size;
 
 	if (memblock.reserved.regions != memblock_reserved_init_regions) {
-		addr = __pa(memblock.reserved.regions);
 		size = PAGE_ALIGN(sizeof(struct memblock_region) *
 				  memblock.reserved.max);
 		if (memblock_reserved_in_slab)
 			kfree(memblock.reserved.regions);
 		else
-			memblock_free_late(addr, size);
+			memblock_free(memblock.reserved.regions, size);
 	}
 
 	if (memblock.memory.regions != memblock_memory_init_regions) {
-		addr = __pa(memblock.memory.regions);
 		size = PAGE_ALIGN(sizeof(struct memblock_region) *
 				  memblock.memory.max);
 		if (memblock_memory_in_slab)
 			kfree(memblock.memory.regions);
 		else
-			memblock_free_late(addr, size);
+			memblock_free(memblock.memory.regions, size);
 	}
 
 	memblock_memory = NULL;
@@ -961,7 +959,8 @@ unsigned long free_reserved_area(void *start, void *end, int poison, const char
  * @size: size of the boot memory block in bytes
  *
  * Free boot memory block previously allocated by memblock_alloc_xx() API.
- * The freeing memory will not be released to the buddy allocator.
+ * If called after the buddy allocator is available, the memory is released to
+ * the buddy allocator.
  */
 void __init_memblock memblock_free(void *ptr, size_t size)
 {
@@ -975,7 +974,8 @@ void __init_memblock memblock_free(void *ptr, size_t size)
  * @size: size of the boot memory block in bytes
  *
  * Free boot memory block previously allocated by memblock_phys_alloc_xx() API.
- * The freeing memory will not be released to the buddy allocator.
+ * If called after the buddy allocator is available, the memory is released to
+ * the buddy allocator.
  */
 int __init_memblock memblock_phys_free(phys_addr_t base, phys_addr_t size)
 {
@@ -985,6 +985,9 @@ int __init_memblock memblock_phys_free(phys_addr_t base, phys_addr_t size)
 		     &base, &end, (void *)_RET_IP_);
 
 	kmemleak_free_part_phys(base, size);
+	if (slab_is_available())
+		__free_reserved_area(base, base + size, -1);
+
 	return memblock_remove_range(&memblock.reserved, base, size);
 }
 
@@ -1813,26 +1816,6 @@ void *__init __memblock_alloc_or_panic(phys_addr_t size, phys_addr_t align,
 	return addr;
 }
 
-/**
- * memblock_free_late - free pages directly to buddy allocator
- * @base: phys starting address of the  boot memory block
- * @size: size of the boot memory block in bytes
- *
- * This is only useful when the memblock allocator has already been torn
- * down, but we are still initializing the system.  Pages are released directly
- * to the buddy allocator.
- */
-void __init memblock_free_late(phys_addr_t base, phys_addr_t size)
-{
-	phys_addr_t end = base + size - 1;
-
-	memblock_dbg("%s: [%pa-%pa] %pS\n",
-		     __func__, &base, &end, (void *)_RET_IP_);
-
-	kmemleak_free_part_phys(base, size);
-	__free_reserved_area(base, base + size, -1);
-}
-
 /*
  * Remaining API functions
  */
-- 
2.51.0


^ permalink raw reply related

* [PATCH 6/8] memblock: make free_reserved_area() update memblock if ARCH_KEEP_MEMBLOCK=y
From: Mike Rapoport @ 2026-03-18 10:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Potapenko, Alexander Viro, Andreas Larsson,
	Ard Biesheuvel, Borislav Petkov, Brendan Jackman,
	Christophe Leroy (CS GROUP), Catalin Marinas, Christian Brauner,
	David S. Miller, Dave Hansen, David Hildenbrand, Dmitry Vyukov,
	Ilias Apalodimas, Ingo Molnar, Jan Kara, Johannes Weiner,
	Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan,
	Marco Elver, Marek Szyprowski, Masami Hiramatsu, Michael Ellerman,
	Michal Hocko, Mike Rapoport, Nicholas Piggin, H. Peter Anvin,
	Rob Herring, Robin Murphy, Saravana Kannan, Suren Baghdasaryan,
	Thomas Gleixner, Vlastimil Babka, Will Deacon, Zi Yan, devicetree,
	iommu, kasan-dev, linux-arm-kernel, linux-efi, linux-fsdevel,
	linux-kernel, linux-mm, linux-trace-kernel, linuxppc-dev,
	sparclinux, x86
In-Reply-To: <20260318105827.1358927-1-rppt@kernel.org>

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

On architectures that keep memblock after boot, freeing of reserved memory
with free_reserved_area() is paired with an update of memblock arrays,
usually by a call to memblock_free().

Make free_reserved_area() directly update memblock.reserved when
ARCH_KEEP_MEMBLOCK is enabled.

Remove the now-redundant explicit memblock_free() call from
arm64::free_initmem() and the #ifdef CONFIG_ARCH_KEEP_MEMBLOCK block
from the generic free_initrd_mem().

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
 arch/arm64/mm/init.c | 3 ---
 init/initramfs.c     | 7 -------
 mm/memblock.c        | 6 ++++++
 3 files changed, 6 insertions(+), 10 deletions(-)

diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 96711b8578fd..07b17c708702 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -385,9 +385,6 @@ void free_initmem(void)
 	WARN_ON(!IS_ALIGNED((unsigned long)lm_init_begin, PAGE_SIZE));
 	WARN_ON(!IS_ALIGNED((unsigned long)lm_init_end, PAGE_SIZE));
 
-	/* Delete __init region from memblock.reserved. */
-	memblock_free(lm_init_begin, lm_init_end - lm_init_begin);
-
 	free_reserved_area(lm_init_begin, lm_init_end,
 			   POISON_FREE_INITMEM, "unused kernel");
 	/*
diff --git a/init/initramfs.c b/init/initramfs.c
index 139baed06589..bca0922b2850 100644
--- a/init/initramfs.c
+++ b/init/initramfs.c
@@ -652,13 +652,6 @@ void __init reserve_initrd_mem(void)
 
 void __weak __init free_initrd_mem(unsigned long start, unsigned long end)
 {
-#ifdef CONFIG_ARCH_KEEP_MEMBLOCK
-	unsigned long aligned_start = ALIGN_DOWN(start, PAGE_SIZE);
-	unsigned long aligned_end = ALIGN(end, PAGE_SIZE);
-
-	memblock_free((void *)aligned_start, aligned_end - aligned_start);
-#endif
-
 	free_reserved_area((void *)start, (void *)end, POISON_FREE_INITMEM,
 			"initrd");
 }
diff --git a/mm/memblock.c b/mm/memblock.c
index 87bd200a8cc9..9f372a8e82f7 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -942,6 +942,12 @@ unsigned long free_reserved_area(void *start, void *end, int poison, const char
 		end_pa = __pa(end - 1) + 1;
 	}
 
+	if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) {
+		if (start_pa < end_pa)
+			memblock_remove_range(&memblock.reserved,
+					      start_pa, end_pa - start_pa);
+	}
+
 	pages = __free_reserved_area(start_pa, end_pa, poison);
 	if (pages && s)
 		pr_info("Freeing %s memory: %ldK\n", s, K(pages));
-- 
2.51.0


^ permalink raw reply related

* [PATCH 5/8] memblock: extract page freeing from free_reserved_area() into a helper
From: Mike Rapoport @ 2026-03-18 10:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Potapenko, Alexander Viro, Andreas Larsson,
	Ard Biesheuvel, Borislav Petkov, Brendan Jackman,
	Christophe Leroy (CS GROUP), Catalin Marinas, Christian Brauner,
	David S. Miller, Dave Hansen, David Hildenbrand, Dmitry Vyukov,
	Ilias Apalodimas, Ingo Molnar, Jan Kara, Johannes Weiner,
	Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan,
	Marco Elver, Marek Szyprowski, Masami Hiramatsu, Michael Ellerman,
	Michal Hocko, Mike Rapoport, Nicholas Piggin, H. Peter Anvin,
	Rob Herring, Robin Murphy, Saravana Kannan, Suren Baghdasaryan,
	Thomas Gleixner, Vlastimil Babka, Will Deacon, Zi Yan, devicetree,
	iommu, kasan-dev, linux-arm-kernel, linux-efi, linux-fsdevel,
	linux-kernel, linux-mm, linux-trace-kernel, linuxppc-dev,
	sparclinux, x86
In-Reply-To: <20260318105827.1358927-1-rppt@kernel.org>

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

There are two functions that release pages to the buddy allocator late in
the boot: free_reserved_area() and memblock_free_late().

Currently they are using different underlying functionality,
free_reserved_area() runs each page being freed via free_reserved_page()
and memblock_free_late() uses memblock_free_pages() -> __free_pages_core(),
but in the end they both boil down to a loop that frees a range page by
page.

Extract the loop frees pages from free_reserved_area() into a helper and
use that helper in memblock_free_late().

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
 mm/memblock.c | 55 +++++++++++++++++++++++++++------------------------
 1 file changed, 29 insertions(+), 26 deletions(-)

diff --git a/mm/memblock.c b/mm/memblock.c
index 27d4c9889b59..87bd200a8cc9 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -893,26 +893,12 @@ int __init_memblock memblock_remove(phys_addr_t base, phys_addr_t size)
 	return memblock_remove_range(&memblock.memory, base, size);
 }
 
-unsigned long free_reserved_area(void *start, void *end, int poison, const char *s)
+static unsigned long __free_reserved_area(phys_addr_t start, phys_addr_t end,
+					  int poison)
 {
-	phys_addr_t start_pa, end_pa;
 	unsigned long pages = 0, pfn;
 
-	/*
-	 * end is the first address past the region and it may be beyond what
-	 * __pa() or __pa_symbol() can handle.
-	 * Use the address included in the range for the cnversion and add back
-	 * 1 afterwards.
-	 */
-	if (__is_kernel((unsigned long)start)) {
-		start_pa = __pa_symbol(start);
-		end_pa = __pa_symbol(end - 1) + 1;
-	} else {
-		start_pa = __pa(start);
-		end_pa = __pa(end - 1) + 1;
-	}
-
-	for_each_valid_pfn(pfn, PFN_UP(start_pa), PFN_DOWN(end_pa)) {
+	for_each_valid_pfn(pfn, PFN_UP(start), PFN_DOWN(end)) {
 		struct page *page = pfn_to_page(pfn);
 		void *direct_map_addr;
 
@@ -934,7 +920,29 @@ unsigned long free_reserved_area(void *start, void *end, int poison, const char
 		free_reserved_page(page);
 		pages++;
 	}
+	return pages;
+}
+
+unsigned long free_reserved_area(void *start, void *end, int poison, const char *s)
+{
+	phys_addr_t start_pa, end_pa;
+	unsigned long pages;
+
+	/*
+	 * end is the first address past the region and it may be beyond what
+	 * __pa() or __pa_symbol() can handle.
+	 * Use the address included in the range for the cnversion and add back
+	 * 1 afterwards.
+	 */
+	if (__is_kernel((unsigned long)start)) {
+		start_pa = __pa_symbol(start);
+		end_pa = __pa_symbol(end - 1) + 1;
+	} else {
+		start_pa = __pa(start);
+		end_pa = __pa(end - 1) + 1;
+	}
 
+	pages = __free_reserved_area(start_pa, end_pa, poison);
 	if (pages && s)
 		pr_info("Freeing %s memory: %ldK\n", s, K(pages));
 
@@ -1810,20 +1818,15 @@ void *__init __memblock_alloc_or_panic(phys_addr_t size, phys_addr_t align,
  */
 void __init memblock_free_late(phys_addr_t base, phys_addr_t size)
 {
-	phys_addr_t cursor, end;
+	phys_addr_t end = base + size - 1;
 
-	end = base + size - 1;
 	memblock_dbg("%s: [%pa-%pa] %pS\n",
 		     __func__, &base, &end, (void *)_RET_IP_);
-	kmemleak_free_part_phys(base, size);
-	cursor = PFN_UP(base);
-	end = PFN_DOWN(base + size);
 
-	for (; cursor < end; cursor++) {
-		memblock_free_pages(cursor, 0);
-		totalram_pages_inc();
-	}
+	kmemleak_free_part_phys(base, size);
+	__free_reserved_area(base, base + size, -1);
 }
+
 /*
  * Remaining API functions
  */
-- 
2.51.0


^ permalink raw reply related

* [PATCH 4/8] memblock: make free_reserved_area() more robust
From: Mike Rapoport @ 2026-03-18 10:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Potapenko, Alexander Viro, Andreas Larsson,
	Ard Biesheuvel, Borislav Petkov, Brendan Jackman,
	Christophe Leroy (CS GROUP), Catalin Marinas, Christian Brauner,
	David S. Miller, Dave Hansen, David Hildenbrand, Dmitry Vyukov,
	Ilias Apalodimas, Ingo Molnar, Jan Kara, Johannes Weiner,
	Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan,
	Marco Elver, Marek Szyprowski, Masami Hiramatsu, Michael Ellerman,
	Michal Hocko, Mike Rapoport, Nicholas Piggin, H. Peter Anvin,
	Rob Herring, Robin Murphy, Saravana Kannan, Suren Baghdasaryan,
	Thomas Gleixner, Vlastimil Babka, Will Deacon, Zi Yan, devicetree,
	iommu, kasan-dev, linux-arm-kernel, linux-efi, linux-fsdevel,
	linux-kernel, linux-mm, linux-trace-kernel, linuxppc-dev,
	sparclinux, x86
In-Reply-To: <20260318105827.1358927-1-rppt@kernel.org>

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

There are two potential problems in free_reserved_area():
* it may free a page with not-existent buddy page
* it may be passed a virtual address from an alias mapping that won't
  be properly translated by virt_to_page(), for example a symbol on arm64

While first issue is quite theoretical and the second one does not manifest
itself because all the callers do the right thing, it is easy to make
free_reserved_area() robust enough to avoid these potential issues.

Replace the loop by virtual address with a loop by pfn that uses
for_each_valid_pfn() and use __pa() or __pa_symbol() depending on the
virtual mapping alias to correctly determine the loop boundaries.

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
 mm/memblock.c | 34 +++++++++++++++++++++++-----------
 1 file changed, 23 insertions(+), 11 deletions(-)

diff --git a/mm/memblock.c b/mm/memblock.c
index 8f3010dddc58..27d4c9889b59 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -895,21 +895,32 @@ int __init_memblock memblock_remove(phys_addr_t base, phys_addr_t size)
 
 unsigned long free_reserved_area(void *start, void *end, int poison, const char *s)
 {
-	void *pos;
-	unsigned long pages = 0;
+	phys_addr_t start_pa, end_pa;
+	unsigned long pages = 0, pfn;
 
-	start = (void *)PAGE_ALIGN((unsigned long)start);
-	end = (void *)((unsigned long)end & PAGE_MASK);
-	for (pos = start; pos < end; pos += PAGE_SIZE, pages++) {
-		struct page *page = virt_to_page(pos);
+	/*
+	 * end is the first address past the region and it may be beyond what
+	 * __pa() or __pa_symbol() can handle.
+	 * Use the address included in the range for the cnversion and add back
+	 * 1 afterwards.
+	 */
+	if (__is_kernel((unsigned long)start)) {
+		start_pa = __pa_symbol(start);
+		end_pa = __pa_symbol(end - 1) + 1;
+	} else {
+		start_pa = __pa(start);
+		end_pa = __pa(end - 1) + 1;
+	}
+
+	for_each_valid_pfn(pfn, PFN_UP(start_pa), PFN_DOWN(end_pa)) {
+		struct page *page = pfn_to_page(pfn);
 		void *direct_map_addr;
 
 		/*
-		 * 'direct_map_addr' might be different from 'pos'
-		 * because some architectures' virt_to_page()
-		 * work with aliases.  Getting the direct map
-		 * address ensures that we get a _writeable_
-		 * alias for the memset().
+		 * 'direct_map_addr' might be different from the kernel virtual
+		 * address because some architectures use aliases.
+		 * Going via physical address, pfn_to_page() and page_address()
+		 * ensures that we get a _writeable_ alias for the memset().
 		 */
 		direct_map_addr = page_address(page);
 		/*
@@ -921,6 +932,7 @@ unsigned long free_reserved_area(void *start, void *end, int poison, const char
 			memset(direct_map_addr, poison, PAGE_SIZE);
 
 		free_reserved_page(page);
+		pages++;
 	}
 
 	if (pages && s)
-- 
2.51.0


^ permalink raw reply related

* [PATCH 3/8] mm: move free_reserved_area() to mm/memblock.c
From: Mike Rapoport @ 2026-03-18 10:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Potapenko, Alexander Viro, Andreas Larsson,
	Ard Biesheuvel, Borislav Petkov, Brendan Jackman,
	Christophe Leroy (CS GROUP), Catalin Marinas, Christian Brauner,
	David S. Miller, Dave Hansen, David Hildenbrand, Dmitry Vyukov,
	Ilias Apalodimas, Ingo Molnar, Jan Kara, Johannes Weiner,
	Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan,
	Marco Elver, Marek Szyprowski, Masami Hiramatsu, Michael Ellerman,
	Michal Hocko, Mike Rapoport, Nicholas Piggin, H. Peter Anvin,
	Rob Herring, Robin Murphy, Saravana Kannan, Suren Baghdasaryan,
	Thomas Gleixner, Vlastimil Babka, Will Deacon, Zi Yan, devicetree,
	iommu, kasan-dev, linux-arm-kernel, linux-efi, linux-fsdevel,
	linux-kernel, linux-mm, linux-trace-kernel, linuxppc-dev,
	sparclinux, x86
In-Reply-To: <20260318105827.1358927-1-rppt@kernel.org>

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

free_reserved_area() is related to memblock as it frees reserved memory
back to the buddy allocator, similar to what memblock_free_late() does.

Move free_reserved_area() to mm/memblock.c to prepare for further
consolidation of the functions that free reserved memory.

No functional changes.

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
 mm/memblock.c   | 37 ++++++++++++++++++++++++++++++++++++-
 mm/page_alloc.c | 36 ------------------------------------
 2 files changed, 36 insertions(+), 37 deletions(-)

diff --git a/mm/memblock.c b/mm/memblock.c
index b3ddfdec7a80..8f3010dddc58 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -893,6 +893,42 @@ int __init_memblock memblock_remove(phys_addr_t base, phys_addr_t size)
 	return memblock_remove_range(&memblock.memory, base, size);
 }
 
+unsigned long free_reserved_area(void *start, void *end, int poison, const char *s)
+{
+	void *pos;
+	unsigned long pages = 0;
+
+	start = (void *)PAGE_ALIGN((unsigned long)start);
+	end = (void *)((unsigned long)end & PAGE_MASK);
+	for (pos = start; pos < end; pos += PAGE_SIZE, pages++) {
+		struct page *page = virt_to_page(pos);
+		void *direct_map_addr;
+
+		/*
+		 * 'direct_map_addr' might be different from 'pos'
+		 * because some architectures' virt_to_page()
+		 * work with aliases.  Getting the direct map
+		 * address ensures that we get a _writeable_
+		 * alias for the memset().
+		 */
+		direct_map_addr = page_address(page);
+		/*
+		 * Perform a kasan-unchecked memset() since this memory
+		 * has not been initialized.
+		 */
+		direct_map_addr = kasan_reset_tag(direct_map_addr);
+		if ((unsigned int)poison <= 0xFF)
+			memset(direct_map_addr, poison, PAGE_SIZE);
+
+		free_reserved_page(page);
+	}
+
+	if (pages && s)
+		pr_info("Freeing %s memory: %ldK\n", s, K(pages));
+
+	return pages;
+}
+
 /**
  * memblock_free - free boot memory allocation
  * @ptr: starting address of the  boot memory allocation
@@ -1776,7 +1812,6 @@ void __init memblock_free_late(phys_addr_t base, phys_addr_t size)
 		totalram_pages_inc();
 	}
 }
-
 /*
  * Remaining API functions
  */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2d4b6f1a554e..df3d61253001 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6234,42 +6234,6 @@ void adjust_managed_page_count(struct page *page, long count)
 }
 EXPORT_SYMBOL(adjust_managed_page_count);
 
-unsigned long free_reserved_area(void *start, void *end, int poison, const char *s)
-{
-	void *pos;
-	unsigned long pages = 0;
-
-	start = (void *)PAGE_ALIGN((unsigned long)start);
-	end = (void *)((unsigned long)end & PAGE_MASK);
-	for (pos = start; pos < end; pos += PAGE_SIZE, pages++) {
-		struct page *page = virt_to_page(pos);
-		void *direct_map_addr;
-
-		/*
-		 * 'direct_map_addr' might be different from 'pos'
-		 * because some architectures' virt_to_page()
-		 * work with aliases.  Getting the direct map
-		 * address ensures that we get a _writeable_
-		 * alias for the memset().
-		 */
-		direct_map_addr = page_address(page);
-		/*
-		 * Perform a kasan-unchecked memset() since this memory
-		 * has not been initialized.
-		 */
-		direct_map_addr = kasan_reset_tag(direct_map_addr);
-		if ((unsigned int)poison <= 0xFF)
-			memset(direct_map_addr, poison, PAGE_SIZE);
-
-		free_reserved_page(page);
-	}
-
-	if (pages && s)
-		pr_info("Freeing %s memory: %ldK\n", s, K(pages));
-
-	return pages;
-}
-
 void free_reserved_page(struct page *page)
 {
 	clear_page_tag_ref(page);
-- 
2.51.0


^ permalink raw reply related

* [PATCH 2/8] powerpc: opal-core: pair alloc_pages_exact() with free_pages_exact()
From: Mike Rapoport @ 2026-03-18 10:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Potapenko, Alexander Viro, Andreas Larsson,
	Ard Biesheuvel, Borislav Petkov, Brendan Jackman,
	Christophe Leroy (CS GROUP), Catalin Marinas, Christian Brauner,
	David S. Miller, Dave Hansen, David Hildenbrand, Dmitry Vyukov,
	Ilias Apalodimas, Ingo Molnar, Jan Kara, Johannes Weiner,
	Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan,
	Marco Elver, Marek Szyprowski, Masami Hiramatsu, Michael Ellerman,
	Michal Hocko, Mike Rapoport, Nicholas Piggin, H. Peter Anvin,
	Rob Herring, Robin Murphy, Saravana Kannan, Suren Baghdasaryan,
	Thomas Gleixner, Vlastimil Babka, Will Deacon, Zi Yan, devicetree,
	iommu, kasan-dev, linux-arm-kernel, linux-efi, linux-fsdevel,
	linux-kernel, linux-mm, linux-trace-kernel, linuxppc-dev,
	sparclinux, x86
In-Reply-To: <20260318105827.1358927-1-rppt@kernel.org>

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

opal-core allocates buffers with alloc_pages_exact(), but then
marks them as reserved and frees using free_reserved_area().

This is completely unnecessary and the pages allocated with
alloc_pages_exact() can be naturally freed with free_pages_exact().

Replace freeing of memory in opalcore_cleanup() with
free_pages_exact() and simplify allocation code so that it won't mark
allocated pages as reserved.

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
 arch/powerpc/platforms/powernv/opal-core.c | 9 +--------
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/opal-core.c b/arch/powerpc/platforms/powernv/opal-core.c
index e76e462f55f6..abd99ddbf21f 100644
--- a/arch/powerpc/platforms/powernv/opal-core.c
+++ b/arch/powerpc/platforms/powernv/opal-core.c
@@ -303,7 +303,6 @@ static int __init create_opalcore(void)
 	struct device_node *dn;
 	struct opalcore *new;
 	loff_t opalcore_off;
-	struct page *page;
 	Elf64_Phdr *phdr;
 	Elf64_Ehdr *elf;
 	int i, ret;
@@ -329,9 +328,6 @@ static int __init create_opalcore(void)
 		return -ENOMEM;
 	}
 	count = oc_conf->opalcorebuf_sz / PAGE_SIZE;
-	page = virt_to_page(oc_conf->opalcorebuf);
-	for (i = 0; i < count; i++)
-		mark_page_reserved(page + i);
 
 	pr_debug("opalcorebuf = 0x%llx\n", (u64)oc_conf->opalcorebuf);
 
@@ -437,10 +433,7 @@ static void opalcore_cleanup(void)
 
 	/* free the buffer used for setting up OPAL core */
 	if (oc_conf->opalcorebuf) {
-		void *end = (void *)((u64)oc_conf->opalcorebuf +
-				     oc_conf->opalcorebuf_sz);
-
-		free_reserved_area(oc_conf->opalcorebuf, end, -1, NULL);
+		free_pages_exact(oc_conf->opalcorebuf, oc_conf->opalcorebuf_sz);
 		oc_conf->opalcorebuf = NULL;
 		oc_conf->opalcorebuf_sz = 0;
 	}
-- 
2.51.0


^ permalink raw reply related

* Re: [PATCH 00/15] tracepoint: Avoid double static_branch evaluation at guarded call sites
From: Vineeth Remanan Pillai @ 2026-03-18 10:58 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Steven Rostedt, Andrii Nakryiko, Peter Zijlstra, Dmitry Ilvokhin,
	Masami Hiramatsu, Ingo Molnar, Jens Axboe, io-uring,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Alexei Starovoitov, Daniel Borkmann, Marcelo Ricardo Leitner,
	Xin Long, Jon Maloy, Aaron Conole, Eelco Chaudron, Ilya Maximets,
	netdev, bpf, linux-sctp, tipc-discussion, dev, Oded Gabbay,
	Koby Elbaz, dri-devel, Rafael J. Wysocki, Viresh Kumar,
	Gautham R. Shenoy, Huang Rui, Mario Limonciello, Len Brown,
	Srinivas Pandruvada, linux-pm, MyungJoo Ham, Kyungmin Park,
	Chanwoo Choi, Christian König, Sumit Semwal, linaro-mm-sig,
	Eddie James, Andrew Jeffery, Joel Stanley, linux-fsi,
	David Airlie, Simona Vetter, Alex Deucher, Danilo Krummrich,
	Matthew Brost, Philipp Stanner, Harry Wentland, Leo Li, amd-gfx,
	Jiri Kosina, Benjamin Tissoires, linux-input, Wolfram Sang,
	linux-i2c, Mark Brown, Michael Hennerich, Nuno Sá, linux-spi,
	James E.J. Bottomley, Martin K. Petersen, linux-scsi, Chris Mason,
	David Sterba, linux-btrfs, linux-trace-kernel, linux-kernel
In-Reply-To: <6ca9f884-9566-4a82-9995-4c802a0bf8a0@efficios.com>

On Tue, Mar 17, 2026 at 12:02 PM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> On 2026-03-17 12:00, Steven Rostedt wrote:
> > On Fri, 13 Mar 2026 10:02:32 -0400
> > Vineeth Remanan Pillai <vineeth@bitbyteword.org> wrote:
> >
> >>>
> >>> Perhaps: call_trace_foo() ?
> >>>
> >> call_trace_foo has one collision with the tracepoint
> >> sched_update_nr_running and a function
> >> call_trace_sched_update_nr_running. I had considered this and later
> >> moved to trace_invoke_foo() because of the collision. But I can rename
> >> call_trace_sched_update_nr_running to something else if call_trace_foo
> >> is the general consensus.
> >
> > OK, then lets go with: trace_call__foo()
> >
> > The double underscore should prevent any name collisions.
> >
> > Does anyone have an objections?
> I'm OK with it.
>
Great thanks! I shall send a v2 with s/trace_invoke_foo/trace_call__foo/ soon.

Thanks,
Vineeth

^ permalink raw reply

* [PATCH 1/8] powerpc: fadump: pair alloc_pages_exact() with free_pages_exact()
From: Mike Rapoport @ 2026-03-18 10:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Potapenko, Alexander Viro, Andreas Larsson,
	Ard Biesheuvel, Borislav Petkov, Brendan Jackman,
	Christophe Leroy (CS GROUP), Catalin Marinas, Christian Brauner,
	David S. Miller, Dave Hansen, David Hildenbrand, Dmitry Vyukov,
	Ilias Apalodimas, Ingo Molnar, Jan Kara, Johannes Weiner,
	Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan,
	Marco Elver, Marek Szyprowski, Masami Hiramatsu, Michael Ellerman,
	Michal Hocko, Mike Rapoport, Nicholas Piggin, H. Peter Anvin,
	Rob Herring, Robin Murphy, Saravana Kannan, Suren Baghdasaryan,
	Thomas Gleixner, Vlastimil Babka, Will Deacon, Zi Yan, devicetree,
	iommu, kasan-dev, linux-arm-kernel, linux-efi, linux-fsdevel,
	linux-kernel, linux-mm, linux-trace-kernel, linuxppc-dev,
	sparclinux, x86
In-Reply-To: <20260318105827.1358927-1-rppt@kernel.org>

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

fadump allocates buffers with alloc_pages_exact(), but then marks them
as reserved and frees using free_reserved_area().

This is completely unnecessary and the pages allocated with
alloc_pages_exact() can be naturally freed with free_pages_exact().

Replace freeing of memory in fadump_free_buffer() with
free_pages_exact() and simplify allocation code so that it won't mark
allocated pages as reserved.

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
 arch/powerpc/kernel/fadump.c | 16 ++--------------
 1 file changed, 2 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index 4ebc333dd786..501d43bf18f3 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -775,24 +775,12 @@ void __init fadump_update_elfcore_header(char *bufp)
 
 static void *__init fadump_alloc_buffer(unsigned long size)
 {
-	unsigned long count, i;
-	struct page *page;
-	void *vaddr;
-
-	vaddr = alloc_pages_exact(size, GFP_KERNEL | __GFP_ZERO);
-	if (!vaddr)
-		return NULL;
-
-	count = PAGE_ALIGN(size) / PAGE_SIZE;
-	page = virt_to_page(vaddr);
-	for (i = 0; i < count; i++)
-		mark_page_reserved(page + i);
-	return vaddr;
+	return  alloc_pages_exact(size, GFP_KERNEL | __GFP_ZERO);
 }
 
 static void fadump_free_buffer(unsigned long vaddr, unsigned long size)
 {
-	free_reserved_area((void *)vaddr, (void *)(vaddr + size), -1, NULL);
+	free_pages_exact((void *)vaddr, size);
 }
 
 s32 __init fadump_setup_cpu_notes_buf(u32 num_cpus)
-- 
2.51.0


^ permalink raw reply related

* [PATCH 0/8] memblock: improve late freeing of reserved memory
From: Mike Rapoport @ 2026-03-18 10:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Potapenko, Alexander Viro, Andreas Larsson,
	Ard Biesheuvel, Borislav Petkov, Brendan Jackman,
	Christophe Leroy (CS GROUP), Catalin Marinas, Christian Brauner,
	David S. Miller, Dave Hansen, David Hildenbrand, Dmitry Vyukov,
	Ilias Apalodimas, Ingo Molnar, Jan Kara, Johannes Weiner,
	Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan,
	Marco Elver, Marek Szyprowski, Masami Hiramatsu, Michael Ellerman,
	Michal Hocko, Mike Rapoport, Nicholas Piggin, H. Peter Anvin,
	Rob Herring, Robin Murphy, Saravana Kannan, Suren Baghdasaryan,
	Thomas Gleixner, Vlastimil Babka, Will Deacon, Zi Yan, devicetree,
	iommu, kasan-dev, linux-arm-kernel, linux-efi, linux-fsdevel,
	linux-kernel, linux-mm, linux-trace-kernel, linuxppc-dev,
	sparclinux, x86

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

Hi,

Following a recent discussion about leaks in x86 EFI [1], I audited usage of
memblock_free_late() and free_reserved_area() and made some imporovements how
we handle late freeing of the memory allocated with memblock.

[1] https://lore.kernel.org/all/ec2aaef14783869b3be6e3c253b2dcbf67dbc12a.camel@kernel.crashing.org/

Mike Rapoport (Microsoft) (8):
  powerpc: fadump: pair alloc_pages_exact() with free_pages_exact()
  powerpc: opal-core: pair alloc_pages_exact() with free_pages_exact()
  mm: move free_reserved_area() to mm/memblock.c
  memblock: make free_reserved_area() more robust
  memblock: extract page freeing from free_reserved_area() into a helper
  memblock: make free_reserved_area() update memblock if ARCH_KEEP_MEMBLOCK=y
  memblock, treewide: make memblock_free() handle late freeing
  memblock: warn when freeing reserved memory before memory map is
    initialized

 arch/arm64/mm/init.c                       |   3 -
 arch/powerpc/kernel/fadump.c               |  16 +--
 arch/powerpc/platforms/powernv/opal-core.c |   9 +-
 arch/sparc/kernel/mdesc.c                  |   4 +-
 arch/x86/kernel/setup.c                    |   2 +-
 arch/x86/platform/efi/memmap.c             |   5 +-
 arch/x86/platform/efi/quirks.c             |   2 +-
 drivers/firmware/efi/apple-properties.c    |   2 +-
 drivers/of/kexec.c                         |   2 +-
 include/linux/memblock.h                   |   2 -
 init/initramfs.c                           |   7 --
 kernel/dma/swiotlb.c                       |   6 +-
 lib/bootconfig.c                           |   2 +-
 mm/internal.h                              |  10 ++
 mm/kfence/core.c                           |   4 +-
 mm/memblock.c                              | 110 ++++++++++++++-------
 mm/page_alloc.c                            |  46 ---------
 17 files changed, 102 insertions(+), 130 deletions(-)


base-commit: 1f318b96cc84d7c2ab792fcc0bfd42a7ca890681
--
2.51.0

^ permalink raw reply

* Re: [PATCH v3 0/8] RDMA: Enable operation with DMA debug enabled
From: Leon Romanovsky @ 2026-03-18  8:18 UTC (permalink / raw)
  To: Marek Szyprowski
  Cc: Robin Murphy, Michael S. Tsirkin, Petr Tesarik, Jonathan Corbet,
	Shuah Khan, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	Jason Gunthorpe, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Joerg Roedel, Will Deacon, Andrew Morton,
	iommu, linux-kernel, linux-doc, virtualization, linux-rdma,
	linux-trace-kernel, linux-mm
In-Reply-To: <de23ccf6-75ef-48af-8c69-2f416c564f2d@samsung.com>

On Wed, Mar 18, 2026 at 09:03:00AM +0100, Marek Szyprowski wrote:
> Hi Leon,
> 
> On 17.03.2026 20:05, Leon Romanovsky wrote:
> > On Mon, Mar 16, 2026 at 09:06:44PM +0200, Leon Romanovsky wrote:
> >> Add a new DMA_ATTR_REQUIRE_COHERENT attribute to the DMA API to mark
> >> mappings that must run on a DMA‑coherent system. Such buffers cannot
> >> use the SWIOTLB path, may overlap with CPU caches, and do not depend on
> >> explicit cache flushing.
> >>
> >> Mappings using this attribute are rejected on systems where cache
> >> side‑effects could lead to data corruption, and therefore do not need
> >> the cache‑overlap debugging logic. This series also includes fixes for
> >> DMA_ATTR_CPU_CACHE_CLEAN handling.
> >> Thanks.
> > <...>
> >
> >> ---
> >> Leon Romanovsky (8):
> >>        dma-debug: Allow multiple invocations of overlapping entries
> >>        dma-mapping: handle DMA_ATTR_CPU_CACHE_CLEAN in trace output
> >>        dma-mapping: Clarify valid conditions for CPU cache line overlap
> >>        dma-mapping: Introduce DMA require coherency attribute
> >>        dma-direct: prevent SWIOTLB path when DMA_ATTR_REQUIRE_COHERENT is set
> >>        iommu/dma: add support for DMA_ATTR_REQUIRE_COHERENT attribute
> >>        RDMA/umem: Tell DMA mapping that UMEM requires coherency
> >>        mm/hmm: Indicate that HMM requires DMA coherency
> >>
> >>   Documentation/core-api/dma-attributes.rst | 38 ++++++++++++++++++++++++-------
> >>   drivers/infiniband/core/umem.c            |  5 ++--
> >>   drivers/iommu/dma-iommu.c                 | 21 +++++++++++++----
> >>   drivers/virtio/virtio_ring.c              | 10 ++++----
> >>   include/linux/dma-mapping.h               | 15 ++++++++----
> >>   include/trace/events/dma.h                |  4 +++-
> >>   kernel/dma/debug.c                        |  9 ++++----
> >>   kernel/dma/direct.h                       |  7 +++---
> >>   kernel/dma/mapping.c                      |  6 +++++
> >>   mm/hmm.c                                  |  4 ++--
> >>   10 files changed, 86 insertions(+), 33 deletions(-)
> > Marek,
> >
> > Despite the "RDMA ..." tag in the subject, the diffstat clearly shows that
> > you are the appropriate person to take this patch.
> 
> I plan to take the first 2 patches to the dma-mapping-fixes branch 
> (v7.0-rc) and the next to dma-mapping-for-next. Should I also take the 
> RDMA and HMM patches, or do You want a stable branch for merging them 
> via respective subsystem trees?

I suggest taking all patches into the -fixes branch, as the "RDMA/..." patch
also resolves the dmesg splat. With -fixes, there is no need to worry about
a shared branch since we do not expect merge conflicts in that area.

If you still prefer to split the series between -fixes and -next, it would be
better to use a shared branch in that case. There are patches on the RDMA
list targeted for -next that touch ib_umem_get().

Thanks

> 
> Best regards
> -- 
> Marek Szyprowski, PhD
> Samsung R&D Institute Poland
> 
> 

^ permalink raw reply

* Re: [PATCH v3 0/8] RDMA: Enable operation with DMA debug enabled
From: Marek Szyprowski @ 2026-03-18  8:03 UTC (permalink / raw)
  To: Leon Romanovsky, Robin Murphy, Michael S. Tsirkin, Petr Tesarik,
	Jonathan Corbet, Shuah Khan, Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Jason Gunthorpe, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Joerg Roedel, Will Deacon,
	Andrew Morton
  Cc: iommu, linux-kernel, linux-doc, virtualization, linux-rdma,
	linux-trace-kernel, linux-mm
In-Reply-To: <20260317190538.GD61385@unreal>

Hi Leon,

On 17.03.2026 20:05, Leon Romanovsky wrote:
> On Mon, Mar 16, 2026 at 09:06:44PM +0200, Leon Romanovsky wrote:
>> Add a new DMA_ATTR_REQUIRE_COHERENT attribute to the DMA API to mark
>> mappings that must run on a DMA‑coherent system. Such buffers cannot
>> use the SWIOTLB path, may overlap with CPU caches, and do not depend on
>> explicit cache flushing.
>>
>> Mappings using this attribute are rejected on systems where cache
>> side‑effects could lead to data corruption, and therefore do not need
>> the cache‑overlap debugging logic. This series also includes fixes for
>> DMA_ATTR_CPU_CACHE_CLEAN handling.
>> Thanks.
> <...>
>
>> ---
>> Leon Romanovsky (8):
>>        dma-debug: Allow multiple invocations of overlapping entries
>>        dma-mapping: handle DMA_ATTR_CPU_CACHE_CLEAN in trace output
>>        dma-mapping: Clarify valid conditions for CPU cache line overlap
>>        dma-mapping: Introduce DMA require coherency attribute
>>        dma-direct: prevent SWIOTLB path when DMA_ATTR_REQUIRE_COHERENT is set
>>        iommu/dma: add support for DMA_ATTR_REQUIRE_COHERENT attribute
>>        RDMA/umem: Tell DMA mapping that UMEM requires coherency
>>        mm/hmm: Indicate that HMM requires DMA coherency
>>
>>   Documentation/core-api/dma-attributes.rst | 38 ++++++++++++++++++++++++-------
>>   drivers/infiniband/core/umem.c            |  5 ++--
>>   drivers/iommu/dma-iommu.c                 | 21 +++++++++++++----
>>   drivers/virtio/virtio_ring.c              | 10 ++++----
>>   include/linux/dma-mapping.h               | 15 ++++++++----
>>   include/trace/events/dma.h                |  4 +++-
>>   kernel/dma/debug.c                        |  9 ++++----
>>   kernel/dma/direct.h                       |  7 +++---
>>   kernel/dma/mapping.c                      |  6 +++++
>>   mm/hmm.c                                  |  4 ++--
>>   10 files changed, 86 insertions(+), 33 deletions(-)
> Marek,
>
> Despite the "RDMA ..." tag in the subject, the diffstat clearly shows that
> you are the appropriate person to take this patch.

I plan to take the first 2 patches to the dma-mapping-fixes branch 
(v7.0-rc) and the next to dma-mapping-for-next. Should I also take the 
RDMA and HMM patches, or do You want a stable branch for merging them 
via respective subsystem trees?

Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland


^ permalink raw reply

* Re: [RFC] Coding style consequences for multi-line statements?
From: Markus Elfring @ 2026-03-18  7:30 UTC (permalink / raw)
  To: Steven Rostedt, kernel-janitors, linux-doc, linux-trace-kernel
  Cc: Josh Law, Andrew Morton, Masami Hiramatsu, LKML
In-Reply-To: <20260317111026.62345f9e@gandalf.local.home>

> The brackets *are* appropriate. The rule of omitting the brackets is for
> *single line* statements. The above return statement is long and there's a
> line break, which means, curly brackets *are* required for visibility reasons.

Would any contributors like to clarify and adjust development documentation accordingly?
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/coding-style.rst?h=v7.0-rc4#n197

Regards,
Markus

^ permalink raw reply

* Re: [PATCH v4] lib/bootconfig: guard xbc_node_compose_key_after() buffer size
From: Masami Hiramatsu @ 2026-03-18  3:07 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Josh Law, Andrew Morton, linux-kernel, linux-trace-kernel
In-Reply-To: <20260317204327.3c61d0ea@robin>

On Tue, 17 Mar 2026 20:43:27 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Wed, 18 Mar 2026 09:02:43 +0900
> Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
> 
> > Yes, that is checked in vsnprintf(), not its caller.
> > I think linux kernel should ensure the the return value is smaller
> > than INT_MAX, and return -EOVERFLOW if not.
> 
> Well, there's very few places that could have a buffer size of > 2G.
> 
> What's the max bootconfig limit? Could you create a bootconfig that is
> greater than 2G?

It's just 32KB. So we don't need it.
Anyway, I sent a patch about that. 

https://lore.kernel.org/all/177379678638.535490.18200744206158553364.stgit@devnote2/

Thank you,

> 
> If not, then yeah, we shouldn't really care about overflows (and that
> includes not worrying about typecasting the size variable to int).
> 
> -- Steve
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH v4] lib/bootconfig: guard xbc_node_compose_key_after() buffer size
From: Steven Rostedt @ 2026-03-18  0:43 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: Josh Law, Andrew Morton, linux-kernel, linux-trace-kernel
In-Reply-To: <20260318090243.7c437f2c5e07a1ce00375102@kernel.org>

On Wed, 18 Mar 2026 09:02:43 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:

> Yes, that is checked in vsnprintf(), not its caller.
> I think linux kernel should ensure the the return value is smaller
> than INT_MAX, and return -EOVERFLOW if not.

Well, there's very few places that could have a buffer size of > 2G.

What's the max bootconfig limit? Could you create a bootconfig that is
greater than 2G?

If not, then yeah, we shouldn't really care about overflows (and that
includes not worrying about typecasting the size variable to int).

-- Steve

^ permalink raw reply

* Re: [PATCH v4] lib/bootconfig: guard xbc_node_compose_key_after() buffer size
From: Masami Hiramatsu @ 2026-03-18  0:02 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Josh Law, Andrew Morton, linux-kernel, linux-trace-kernel
In-Reply-To: <20260317191626.5b6172a9@robin>

On Tue, 17 Mar 2026 19:16:26 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Wed, 18 Mar 2026 08:03:51 +0900
> Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
> 
> > On Tue, 17 Mar 2026 20:44:03 +0000
> > Josh Law <objecting@objecting.org> wrote:
> > 
> > > xbc_node_compose_key_after() passes a size_t buffer length to
> > > snprintf(), but snprintf() returns int. Guard against size values above
> > > INT_MAX before the loop so the existing truncation check can continue to
> > > compare ret against (int)size safely.
> > > 
> > > Add a small WARN_ON_ONCE shim for the tools/bootconfig userspace build
> > > so the same source continues to build there.  
> > 
> > NACK.
> > 
> > Don't do such over engineering effort.
> 
> Hi Masami,
> 
> This was somewhat my idea. Why do you think it's over engineering?
> 
> This is your code, so you have final say. I'm not going to push it. I'm
> just curious to your thoughts.

I sent a mail why I thought this is over engineering. I think this
comes from vsnprintf() interface design. If all user of that needs
to do this, that is not fair. It should be checked in vsnprintf()
and caller should just check the returned error.

> 
> It is interesting that snprintf() takes a size_t size, and the iterator
> inside is also size_t, but then it returns the value as an int.

Yes, that is checked in vsnprintf(), not its caller.
I think linux kernel should ensure the the return value is smaller
than INT_MAX, and return -EOVERFLOW if not.

Thank you,

> 
> That itself just looks wrong (and has nothing to do with your code).
> 
> -- Steve


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH] blk-mq: add tracepoint block_rq_tag_wait
From: Damien Le Moal @ 2026-03-17 23:38 UTC (permalink / raw)
  To: Aaron Tomlin, axboe, rostedt, mhiramat, mathieu.desnoyers
  Cc: johannes.thumshirn, kch, bvanassche, ritesh.list, neelx, sean,
	mproche, linux-block, linux-kernel, linux-trace-kernel
In-Reply-To: <20260317182835.258183-1-atomlin@atomlin.com>

On 2026/03/18 3:28, Aaron Tomlin wrote:
> In high-performance storage environments, particularly when utilising
> RAID controllers with shared tag sets (BLK_MQ_F_TAG_HCTX_SHARED), severe
> latency spikes can occur when fast devices (SSDs) are starved of hardware
> tags when sharing the same blk_mq_tag_set.
> 
> Currently, diagnosing this specific hardware queue contention is
> difficult. When a CPU thread exhausts the tag pool, blk_mq_get_tag()
> forces the current thread to block uninterruptible via io_schedule().
> While this can be inferred via sched:sched_switch or dynamically
> traced by attaching a kprobe to blk_mq_mark_tag_wait(), there is no
> dedicated, out-of-the-box observability for this event.
> 
> This patch introduces the block_rq_tag_wait static tracepoint in
> the tag allocation slow-path. It triggers immediately before the
> thread yields the CPU, exposing the exact hardware context (hctx)
> that is starved, the total pool size, and the current active request
> count.
> 
> This provides storage engineers and performance monitoring agents
> with a zero-configuration, low-overhead mechanism to definitively
> identify shared-tag bottlenecks and tune I/O schedulers or cgroup
> throttling accordingly.
> 
> Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>

Looks OK to me, but I have some suggestions below.

> ---
>  block/blk-mq-tag.c           |  3 +++
>  include/trace/events/block.h | 36 ++++++++++++++++++++++++++++++++++++
>  2 files changed, 39 insertions(+)
> 
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index 33946cdb5716..f50993e86ca5 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -13,6 +13,7 @@
>  #include <linux/kmemleak.h>
>  
>  #include <linux/delay.h>
> +#include <trace/events/block.h>
>  #include "blk.h"
>  #include "blk-mq.h"
>  #include "blk-mq-sched.h"
> @@ -187,6 +188,8 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
>  		if (tag != BLK_MQ_NO_TAG)
>  			break;
>  
> +		trace_block_rq_tag_wait(data->q, data->hctx);
> +
>  		bt_prev = bt;
>  		io_schedule();
>  
> diff --git a/include/trace/events/block.h b/include/trace/events/block.h
> index 6aa79e2d799c..48e2ba433c87 100644
> --- a/include/trace/events/block.h
> +++ b/include/trace/events/block.h
> @@ -226,6 +226,42 @@ DECLARE_EVENT_CLASS(block_rq,
>  		  IOPRIO_PRIO_LEVEL(__entry->ioprio), __entry->comm)
>  );
>  
> +/**
> + * block_rq_tag_wait - triggered when an I/O request is starved of a tag

when an I/O request -> when a request

> + * @q: queue containing the request

request queue of the target device

("containing" is odd here)

> + * @hctx: hardware context (queue) experiencing starvation

hardware context of the request

> + *
> + * Called immediately before the submitting thread is forced to block due

the submitting thread -> the submitting context

> + * to the exhaustion of available hardware tags. This tracepoint indicates

s/tracepoint/trace point

> + * that the thread will be placed into an uninterruptible state via

s/thread/context

> + * io_schedule() until an active block I/O operation completes and
> + * relinquishes its assigned tag.

until an active request completes

(BIOs do not have tags).

> + */
> +TRACE_EVENT(block_rq_tag_wait,
> +
> +	TP_PROTO(struct request_queue *q, struct blk_mq_hw_ctx *hctx),
> +
> +	TP_ARGS(q, hctx),
> +
> +	TP_STRUCT__entry(
> +		__field( dev_t,		dev			)
> +		__field( u32,		hctx_id			)
> +		__field( u32,		nr_tags			)
> +		__field( u32,		active_requests		)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->dev		  = q->disk ? disk_devt(q->disk) : 0;

I do not think that q->disk can ever be NULL when there is a request being
submitted.

> +		__entry->hctx_id	  = hctx ? hctx->queue_num : 0;
> +		__entry->nr_tags	  = hctx && hctx->tags ? hctx->tags->nr_tags : 0;
> +		__entry->active_requests  = hctx ? atomic_read(&hctx->nr_active) : 0;
> +	),
> +
> +	TP_printk("%d,%d hctx=%u starved (active=%u/%u)",
> +		  MAJOR(__entry->dev), MINOR(__entry->dev),
> +		  __entry->hctx_id, __entry->active_requests, __entry->nr_tags)
> +);
> +
>  /**
>   * block_rq_insert - insert block operation request into queue
>   * @rq: block IO operation request


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply

* Re: [PATCH v6 16/17] lib/bootconfig: fix sign-compare in xbc_node_compose_key_after()
From: Josh Law @ 2026-03-17 23:18 UTC (permalink / raw)
  To: Masami Hiramatsu, Steven Rostedt
  Cc: Andrew Morton, linux-kernel, linux-trace-kernel
In-Reply-To: <20260318081540.44c164f2c67d80acf14eaf2e@kernel.org>



On 17 March 2026 23:15:40 GMT, Masami Hiramatsu <mhiramat@kernel.org> wrote:
>On Tue, 17 Mar 2026 12:15:07 -0400
>Steven Rostedt <rostedt@goodmis.org> wrote:
>
>> On Tue, 17 Mar 2026 16:55:49 +0900
>> Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
>> 
>> > > --- a/lib/bootconfig.c
>> > > +++ b/lib/bootconfig.c
>> > > @@ -319,10 +319,10 @@ int __init xbc_node_compose_key_after(struct xbc_node *root,
>> > >  			       depth ? "." : "");
>> > >  		if (ret < 0)
>> > >  			return ret;
>> > > -		if (ret >= size) {
>> > > +		if (ret >= (int)size) {  
>> > 
>> > nit:
>> > 
>> > 	if ((size_t)ret >= size) {
>> > 
>> > because sizeof(size_t) > sizeof(int).
>> 
>> I don't think we need to worry about this. But this does bring up an issue.
>> ret comes from:
>> 
>> 		ret = snprintf(buf, size, "%s%s", xbc_node_get_data(node),
>> 			       depth ? "." : "");
>> 
>> Where size is of type size_t
>> 
>> snprintf() takes size_t but returns int.
>> 
>> snprintf() calls vsnprintf() which has:
>> 
>> 	size_t len, pos;
>> 
>> Where pos is incremented based on fmt, and vsnprintf() returns:
>> 
>> 	return pos;
>> 
>> Which can overflow.
>
>I think that is vsnprintf() (maybe POSIX) design issue.
>I believe we're simply using the size_t to represent size of memory
>out of convention.
>
>> 
>> Now, honestly, we should never have a 2Gig string as that would likely
>> cause other horrible things. Does size really need to be size_t?
>
>Even if so, it should be done in vsnprintf() instead of this.
>This function just believes that the caller gives collect size
>and enough amount of memory. Or, we need to check "INT_MAX > size"
>in everywhere.
>
>> 
>> Perhaps we should have:
>> 
>> 	if (WARN_ON_ONCE(size > MAX_INT))
>> 		return -EINVAL;
>
>I think this is an over engineering effort especially in
>caller side. This overflow should be checked in vsnprintf() and
>should return -EINVAL. (and the caller checks the return value.)
>
>Thank you,
>
>> 
>> ?
>> 
>> -- Steve
>> 
>
>


I submitted V7 dropping all them patches anyway, V7 should be perfect now.


V/R


Josh Law

^ permalink raw reply

* Re: [PATCH v4] lib/bootconfig: guard xbc_node_compose_key_after() buffer size
From: Steven Rostedt @ 2026-03-17 23:16 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: Josh Law, Andrew Morton, linux-kernel, linux-trace-kernel
In-Reply-To: <20260318080351.dae637f4b5909bd9f81b27d2@kernel.org>

On Wed, 18 Mar 2026 08:03:51 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:

> On Tue, 17 Mar 2026 20:44:03 +0000
> Josh Law <objecting@objecting.org> wrote:
> 
> > xbc_node_compose_key_after() passes a size_t buffer length to
> > snprintf(), but snprintf() returns int. Guard against size values above
> > INT_MAX before the loop so the existing truncation check can continue to
> > compare ret against (int)size safely.
> > 
> > Add a small WARN_ON_ONCE shim for the tools/bootconfig userspace build
> > so the same source continues to build there.  
> 
> NACK.
> 
> Don't do such over engineering effort.

Hi Masami,

This was somewhat my idea. Why do you think it's over engineering?

This is your code, so you have final say. I'm not going to push it. I'm
just curious to your thoughts.

It is interesting that snprintf() takes a size_t size, and the iterator
inside is also size_t, but then it returns the value as an int.

That itself just looks wrong (and has nothing to do with your code).

-- Steve

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox