* Re: [PATCH] kernel/trace/ftrace: introduce ftrace module notifier
From: Song Chen @ 2026-04-12 14:10 UTC (permalink / raw)
To: Petr Mladek
Cc: Steven Rostedt, Miroslav Benes, mcgrof, petr.pavlu, da.gomez,
samitolvanen, atomlin, mhiramat, mark.rutland, mathieu.desnoyers,
linux-modules, linux-kernel, linux-trace-kernel, live-patching
In-Reply-To: <aaqk-GrpCTqO36xj@pathway.suse.cz>
Hi,
在 2026/3/6 17:57, Petr Mladek 写道:
> On Fri 2026-02-27 09:34:59, Song Chen wrote:
>> Hi,
>>
>> 在 2026/2/27 01:30, Steven Rostedt 写道:
>>> On Thu, 26 Feb 2026 11:51:53 +0100 (CET)
>>> Miroslav Benes <mbenes@suse.cz> wrote:
>>>
>>>>> Let me see if there is any way to use notifier and remain below calling
>>>>> sequence:
>>>>>
>>>>> ftrace_module_enable
>>>>> klp_module_coming
>>>>> blocking_notifier_call_chain_robust(MODULE_STATE_COMING)
>>>>>
>>>>> blocking_notifier_call_chain(MODULE_STATE_GOING)
>>>>> klp_module_going
>>>>> ftrace_release_mod
>>>>
>>>> Both klp and ftrace used module notifiers in the past. We abandoned that
>>>> and opted for direct calls due to issues with ordering at the time. I do
>>>> not have the list of problems at hand but I remember it was very fragile.
>>>>
>>>> See commits 7dcd182bec27 ("ftrace/module: remove ftrace module
>>>> notifier"), 7e545d6eca20 ("livepatch/module: remove livepatch module
>>>> notifier") and their surroundings.
>>>>
>>>> So unless there is a reason for the change (which should be then carefully
>>>> reviewed and properly tested), I would prefer to keep it as is. What is
>>>> the motivation? I am failing to find it in the commit log.
>>
>> There is no special motivation, i just read btf initialization in module
>> loading and found direct calls of ftrace and klp, i thought they were just
>> forgotten to use notifier and i even didn't search git log to verify, sorry
>> about that.
>>
>>>
>>> Honestly, I do think just decoupling ftrace and live kernel patching from
>>> modules is rationale enough, as it makes the code a bit cleaner. But to do
>>> so, we really need to make sure there is absolutely no regressions.
>>>
>>> Thus, to allow such a change, I would ask those that are proposing it, show
>>> a full work flow of how ftrace, live kernel patching, and modules work with
>>> each other and why those functions are currently injected in the module code.
>>>
>>> As Miroslav stated, we tried to do it via notifiers in the past and it
>>> failed. I don't want to find out why they failed by just adding them back
>>> to notifiers again. Instead, the reasons must be fully understood and
>>> updates made to make sure they will not fail in the future.
>>
>> Yes, you are right, i read commit msg of 7dcd182bec27, this patch just
>> reverses it simply and will introduce order issue back. I will try to find
>> out the problem in the past at first.
>
> AFAIK, the root of the problem is that livepatch uses the ftrace
> framework. It means that:
>
> + ftrace must be initialized before livepatch gets enabled
> + livepatch must be disabled before ftrace support gets removed
>
> My understanding is that this can't be achieved by notifiers easily
> because they are always proceed in the same order.
>
> An elegant solution would be to introduce notifier_reverse_call_chain()
> which would process the callbacks in the reverse order. But it might
> be non-trivial:
>
> + We would need to make sure that it does not break some
> existing "hidden" dependencies.
>
Thanks so much, this is the solution i'm working on. I replaced next
with a list_head in notifier_block and implemented
anotifier_call_chain_reverse to address the order issues, like your
suggestion. And a new robust revision for rolling back.
+static int notifier_call_chain_reverse(struct list_head *nl,
+ struct notifier_block *start,
+ unsigned long val, void *v,
+ int nr_to_call, int *nr_calls)
+{
+ int ret = NOTIFY_DONE;
+ struct notifier_block *nb;
+ bool do_call = (start == NULL);
+
+ if (!nr_to_call)
+ return ret;
+
+ list_for_each_entry_reverse(nb, nl, entry) {
+ if (!do_call) {
+ if (nb == start)
+ do_call = true;
+ continue;
+ }
+#ifdef CONFIG_DEBUG_NOTIFIERS
+ if (unlikely(!func_ptr_is_kernel_text(nb->notifier_call))) {
+ WARN(1, "Invalid notifier called!");
+ continue;
+ }
+#endif
+ trace_notifier_run((void *)nb->notifier_call);
+ ret = nb->notifier_call(nb, val, v);
+
+ if (nr_calls)
+ (*nr_calls)++;
+
+ if (ret & NOTIFY_STOP_MASK)
+ break;
+
+ if (nr_to_call-- == 0)
+ break;
+ }
+ return ret;
+}
+NOKPROBE_SYMBOL(notifier_call_chain_reverse);
I'll send the patches for review soon.
Best regards
Song
> + notifier_call_chain() uses RCU to process the list of registered
> callbacks. I am not sure how complicated would be to make it safe
> in both directions.
>
> Best Regards,
> Petr
>
>
^ permalink raw reply
* Re: [RFC PATCH 0/4] trace, livepatch: Allow kprobe return overriding for livepatched functions
From: Yafang Shao @ 2026-04-12 13:50 UTC (permalink / raw)
To: Masami Hiramatsu
Cc: jpoimboe, jikos, mbenes, pmladek, joe.lawrence, rostedt,
mathieu.desnoyers, kpsingh, mattbobrowski, song, jolsa, ast,
daniel, andrii, martin.lau, eddyz87, memxor, yonghong.song,
live-patching, linux-kernel, linux-trace-kernel, bpf
In-Reply-To: <20260410133844.56ab7964da7628d1c3482acb@kernel.org>
On Fri, Apr 10, 2026 at 12:38 PM Masami Hiramatsu <mhiramat@kernel.org> wrote:
>
> Hi Yafang,
>
> On Thu, 2 Apr 2026 17:26:03 +0800
> Yafang Shao <laoar.shao@gmail.com> wrote:
>
> > Livepatching allows for rapid experimentation with new kernel features
> > without interrupting production workloads. However, static livepatches lack
> > the flexibility required to tune features based on task-specific attributes,
> > such as cgroup membership, which is critical in multi-tenant k8s
> > environments. Furthermore, hardcoding logic into a livepatch prevents
> > dynamic adjustments based on the runtime environment.
> >
> > To address this, we propose a hybrid approach using BPF. Our production use
> > case involves:
> >
> > 1. Deploying a Livepatch function to serve as a stable BPF hook.
> >
> > 2. Utilizing bpf_override_return() to dynamically modify the return value
> > of that hook based on the current task's context.
>
> First of all, I don't like this approach to test a new feature in the
> kernel, because it sounds like allowing multiple different generations
> of implementations to coexist simultaneously. The standard kernel code
> is not designed to withstand such implementations.
However, this approach is invaluable for rapidly deploying new kernel
features to production servers without downtime. Upgrading kernels
across a large fleet remains a significant challenge.
>
> For example, if you implement a well-designed framework in a specific
> subsystem, like Schedext, which allows multiple implementations extended
> with BPF to coexist, there's no problem (at least it's debatable).
>
> But if it is for any function, it is dangerous feature. Bugs that occur
> in kernels that use this functionality cannot be addressed here. They
> need to be treated the same way as out-of-tree drivers or forked kernels.
> I mean, add a tainted flag for this feature. And we don't care of it.
Agreed. This should be handled as an OOT module rather than part of
the core kernel.
>
> >
> > A significant challenge arises when atomic-replace is enabled. In this
> > mode, deploying a new livepatch changes the target function's address,
> > forcing a re-attachment of the BPF program. This re-attachment latency is
> > unacceptable in critical paths, such as those handling networking policies.
> >
> > To solve this, we introduce a hybrid livepatch mode that allows specific
> > patches to remain non-replaceable, ensuring the function address remains
> > stable and the BPF program stays attached.
>
> Can you share your actual problem to be solved?
Here is an example we recently deployed on our production servers:
https://lore.kernel.org/bpf/CALOAHbDnNba_w_nWH3-S9GAXw0+VKuLTh1gy5hy9Yqgeo4C0iA@mail.gmail.com/
In one of our specific clusters, we needed to send BGP traffic out
through specific NICs based on the destination IP. To achieve this
without interrupting service, we live-patched
bond_xmit_3ad_xor_slave_get(), added a new hook called
bond_get_slave_hook(), and then ran a BPF program attached to that
hook to select the outgoing NIC from the SKB. This allowed us to
rapidly deploy the feature with zero downtime.
[...]
--
Regards
Yafang
^ permalink raw reply
* Re: [RFC PATCH 0/4] trace, livepatch: Allow kprobe return overriding for livepatched functions
From: Yafang Shao @ 2026-04-12 13:30 UTC (permalink / raw)
To: Miroslav Benes
Cc: Song Liu, jpoimboe, jikos, pmladek, joe.lawrence, rostedt,
mhiramat, mathieu.desnoyers, kpsingh, mattbobrowski, jolsa, ast,
daniel, andrii, martin.lau, eddyz87, memxor, yonghong.song,
live-patching, linux-kernel, linux-trace-kernel, bpf
In-Reply-To: <alpine.LSU.2.21.2604091205250.31526@pobox.suse.cz>
On Thu, Apr 9, 2026 at 6:08 PM Miroslav Benes <mbenes@suse.cz> wrote:
>
> > Can we add something like ALLOW_LIVEPATCH_ERROR_INJECTION() to allow
> > error injection on functions defined inside a livepatch?
>
> No.
>
> I am sorry but you always seem to find band aids to your set up and how
> you deal with live patches internally. While I can see that something like
> a hybrid mode might be useful to people if done right (and we are not
> there yet), the combination of it with bpf overrides or anything like that
> is not something I would like to see in upstream.
The upstream kernel already allows for combining BPF and livepatch to
override functions. Song’s patch offers a great reference for
implementing this without changing the kernel:
https://lore.kernel.org/bpf/20260408175217.1011024-1-song@kernel.org/
--
Regards
Yafang
^ permalink raw reply
* Re: [RFC PATCH 2/4] trace: Allow kprobes to override livepatched functions
From: Yafang Shao @ 2026-04-12 13:08 UTC (permalink / raw)
To: Miroslav Benes
Cc: jpoimboe, jikos, pmladek, joe.lawrence, rostedt, mhiramat,
mathieu.desnoyers, kpsingh, mattbobrowski, song, jolsa, ast,
daniel, andrii, martin.lau, eddyz87, memxor, yonghong.song,
live-patching, linux-kernel, linux-trace-kernel, bpf
In-Reply-To: <alpine.LSU.2.21.2604091145340.31526@pobox.suse.cz>
On Thu, Apr 9, 2026 at 5:47 PM Miroslav Benes <mbenes@suse.cz> wrote:
>
> Hi,
>
> On Thu, 2 Apr 2026, Yafang Shao wrote:
>
> > Introduce the ability for kprobes to override the return values of
> > functions that have been livepatched. This functionality is guarded by the
> > CONFIG_KPROBE_OVERRIDE_KLP_FUNC configuration option.
>
> this is imprecise if I read the code correctly. You want to override live
> patch functions, not the original ones which are live patched.
Correct. The BPF program will override the livepatched functions
rather than the original ones.
>
> I also think that if nothing else, it needs to be more specific then just
> checking mod->klp. It should check if a function itself in klp module is
> overridable to keep it as limited as possible.
That is exactly what I am currently implementing in
trace_kprobe_klp_func_overridable().
> Even with that, the
> concerns expressed by the others still apply.
--
Regards
Yafang
^ permalink raw reply
* Re: [RFC PATCH 3/4] livepatch: Add "replaceable" attribute to klp_patch
From: Yafang Shao @ 2026-04-12 12:18 UTC (permalink / raw)
To: Petr Mladek
Cc: Song Liu, Joe Lawrence, Dylan Hatch, jpoimboe, jikos, mbenes,
rostedt, mhiramat, mathieu.desnoyers, kpsingh, mattbobrowski,
jolsa, ast, daniel, andrii, martin.lau, eddyz87, memxor,
yonghong.song, live-patching, linux-kernel, linux-trace-kernel,
bpf
In-Reply-To: <addW_-whBavyHY-Z@pathway.suse.cz>
On Thu, Apr 9, 2026 at 3:36 PM Petr Mladek <pmladek@suse.com> wrote:
>
> On Wed 2026-04-08 11:19:50, Song Liu wrote:
> > On Wed, Apr 8, 2026 at 4:43 AM Petr Mladek <pmladek@suse.com> wrote:
> > [...]
> > > > >
> > > > > This is weird semantic. Which livepatch tag would be allowed to
> > > > > supersede it, please?
> > > > >
> > > > > Do we still need this category?
> > > >
> > > > It can be superseded by any livepatch that has a non-zero tag set.
> > >
> > > And this exactly the weird thing.
> > >
> > > A patch with the .replace flag set is supposed to obsolete all already
> > > installed livepatches. It means that it should provide all existing
> > > fixes and features.
> > >
> > > Now, we want to introduce a replace flag/set which would allow to
> > > replace/obsolete only the livepatch with the same tag/set number.
> > > And we want to prevent conflicts by making sure that livepatches with
> > > different tag/set number will never livepatch the same function.
> > >
> > > Obviously, livepatches with different tag/set number could not
> > > obsolete the same no-replace livepatch. They would need to livepatch
> > > the same functions touched by the no-replace livepatch and would
> > > conflict.
> > >
> > > So, I suggest to remove the no-replace mode completely. It should
> > > not be needed. A livepatch which should be installed in parallel
> > > will simply use another unique tag/set number.
> >
> > I think I see your point now. Existing code works as:
> > - replace=false doesn't replace anything
> > - replace=true replaces everything
> >
> > If we assume false=0 and true=1, it is technically possible to define:
> > - replace_set=0 doesn't replace anything
> > - replace_set=1 replaces everything
> > - replace_set=2+ only replace the same replace_set
>
> Yes. This well describes my point.
>
> > This is probably a little too complicated.
> >
> > > > This ensures backward compatibility: while a non-atomic-replace
> > > > livepatch can be superseded by an atomic-replace one, the reverse is
> > > > not permitted—an atomic-replace livepatch cannot be superseded by a
> > > > non-atomic one.
> > >
> > > IMHO, the backward compatibility would just create complexity and mess
> > > in this case.
> >
> > Given that livepatch is for expert users, I think we can make this work
> > without backward compatibility. But breaking compatibility is always not
> > preferred.
>
> I believe that it is acceptable because:
>
> 1. It was always hard to combine no-replace and replace livepatches.
> I wonder if anyone combines them at all.
Because 'replace' patches can supersede 'no-replace' ones, users have
to maintain a strict loading order. I doubt anyone actually combines
them in production.
>
> 2. I believe that nobody tries to load the same livepatch module on
> different kernel versions. Instead, everyone prepares a custom
> livepatch module for each livepatched kernel version/release.
Correct. We always build and apply distinct livepatches for each
specific kernel version.
>
> And the tooling for creating livepatches will need to be updated
> to use "number" instead of "true/false" anyway.
>
> That said, it is easier to always use "0" for non-replace patches
> instead of assigning an unique "number" to avoid replacing. But
> I do not think that this would justify the complexity of having
> different semantic for 0, 1, and 2+ replace_set numbers.
Fair enough. Let's drop backward compatibility to keep the
implementation simple.
--
Regards
Yafang
^ permalink raw reply
* Re: [RFC PATCH 3/4] livepatch: Add "replaceable" attribute to klp_patch
From: Yafang Shao @ 2026-04-12 12:09 UTC (permalink / raw)
To: Petr Mladek
Cc: Song Liu, Joe Lawrence, Dylan Hatch, jpoimboe, jikos, mbenes,
rostedt, mhiramat, mathieu.desnoyers, kpsingh, mattbobrowski,
jolsa, ast, daniel, andrii, martin.lau, eddyz87, memxor,
yonghong.song, live-patching, linux-kernel, linux-trace-kernel,
bpf
In-Reply-To: <adY_WgA54CDtWBq6@pathway.suse.cz>
On Wed, Apr 8, 2026 at 7:43 PM Petr Mladek <pmladek@suse.com> wrote:
>
> On Wed 2026-04-08 10:40:10, Yafang Shao wrote:
> > On Tue, Apr 7, 2026 at 11:08 PM Petr Mladek <pmladek@suse.com> wrote:
> > >
> > > On Tue 2026-04-07 17:45:31, Yafang Shao wrote:
> > > > On Tue, Apr 7, 2026 at 11:16 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > >
> > > > > On Tue, Apr 7, 2026 at 10:54 AM Song Liu <song@kernel.org> wrote:
> > > > > >
> > > > > > On Mon, Apr 6, 2026 at 2:12 PM Joe Lawrence <joe.lawrence@redhat.com> wrote:
> > > > > > [...]
> > > > > > > > > > - The regular livepatches are cumulative, have the replace flag; and
> > > > > > > > > > are replaceable.
> > > > > > > > > > - The occasional "off-band" livepatches do not have the replace flag,
> > > > > > > > > > and are not replaceable.
> > > > > > > > > >
> > > > > > > > > > With this setup, for systems with off-band livepatches loaded, we can
> > > > > > > > > > still release a cumulative livepatch to replace the previous cumulative
> > > > > > > > > > livepatch. Is this the expected use case?
> > > > > > > > >
> > > > > > > > > That matches our expected use case.
> > > > > > > >
> > > > > > > > If we really want to serve use cases like this, I think we can introduce
> > > > > > > > some replace tag concept: Each livepatch will have a tag, u32 number.
> > > > > > > > Newly loaded livepatch will only replace existing livepatch with the
> > > > > > > > same tag. We can even reuse the existing "bool replace" in klp_patch,
> > > > > > > > and make it u32: replace=0 means no replace; replace > 0 are the
> > > > > > > > replace tag.
> > > > > > > >
> > > > > > > > For current users of cumulative patches, all the livepatch will have the
> > > > > > > > same tag, say 1. For your use case, you can assign each user a
> > > > > > > > unique tag. Then all these users can do atomic upgrades of their
> > > > > > > > own livepatches.
> > > > > > > >
> > > > > > > > We may also need to check whether two livepatches of different tags
> > > > > > > > touch the same kernel function. When that happens, the later
> > > > > > > > livepatch should fail to load.
> > >
> > > I still think how to make the hybrid mode more secure:
> > >
> > > + The isolated sets of livepatched functions look like a good rule.
> > > + What about isolating the shadow variables/states as well?
> >
> > We might consider extending the klp_shadow_* API to support the new
> > livepatch tag.
>
> It would be nice to associate shadow variables with states so that
> we could check which shadow variables are used by each livepatch.
>
> It is partially implemented in my earlier RFC, see
> https://lore.kernel.org/all/20250115082431.5550-3-pmladek@suse.com/
This patch is still pending acceptance, but it offers a nice
improvement. With your modifications, the remaining task would be to
integrate a new replace_set into klp_state and update the API
accordingly
[...]
--
Regards
Yafang
^ permalink raw reply
* Re: [PATCH v3 09/11] dt-bindings: input: Document hid-over-spi DT schema
From: Conor Dooley @ 2026-04-10 17:35 UTC (permalink / raw)
To: Dmitry Torokhov
Cc: Jingyuan Liang, Jiri Kosina, Benjamin Tissoires, Jonathan Corbet,
Mark Brown, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Rob Herring, Krzysztof Kozlowski, Conor Dooley, linux-input,
linux-doc, linux-kernel, linux-spi, linux-trace-kernel,
devicetree, hbarnor, tfiga, Dmitry Antipov, Jarrett Schultz
In-Reply-To: <adfdkwq_bF9dirAq@google.com>
[-- Attachment #1: Type: text/plain, Size: 6021 bytes --]
On Thu, Apr 09, 2026 at 10:16:46AM -0700, Dmitry Torokhov wrote:
> On Thu, Apr 09, 2026 at 05:02:11PM +0100, Conor Dooley wrote:
> > On Thu, Apr 02, 2026 at 01:59:46AM +0000, Jingyuan Liang wrote:
> > > Documentation describes the required and optional properties for
> > > implementing Device Tree for a Microsoft G6 Touch Digitizer that
> > > supports HID over SPI Protocol 1.0 specification.
> > >
> > > The properties are common to HID over SPI.
> > >
> > > Signed-off-by: Dmitry Antipov <dmanti@microsoft.com>
> > > Signed-off-by: Jarrett Schultz <jaschultz@microsoft.com>
> > > Signed-off-by: Jingyuan Liang <jingyliang@chromium.org>
> > > ---
> > > .../devicetree/bindings/input/hid-over-spi.yaml | 126 +++++++++++++++++++++
> > > 1 file changed, 126 insertions(+)
> > >
> > > diff --git a/Documentation/devicetree/bindings/input/hid-over-spi.yaml b/Documentation/devicetree/bindings/input/hid-over-spi.yaml
> > > new file mode 100644
> > > index 000000000000..d1b0a2e26c32
> > > --- /dev/null
> > > +++ b/Documentation/devicetree/bindings/input/hid-over-spi.yaml
> > > @@ -0,0 +1,126 @@
> > > +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
> > > +%YAML 1.2
> > > +---
> > > +$id: http://devicetree.org/schemas/input/hid-over-spi.yaml#
> > > +$schema: http://devicetree.org/meta-schemas/core.yaml#
> > > +
> > > +title: HID over SPI Devices
> > > +
> > > +maintainers:
> > > + - Benjamin Tissoires <benjamin.tissoires@redhat.com>
> > > + - Jiri Kosina <jkosina@suse.cz>
> >
> > Why them and not you, the developers of the series?
> >
> > > +
> > > +description: |+
> > > + HID over SPI provides support for various Human Interface Devices over the
> > > + SPI bus. These devices can be for example touchpads, keyboards, touch screens
> > > + or sensors.
> > > +
> > > + The specification has been written by Microsoft and is currently available
> > > + here: https://www.microsoft.com/en-us/download/details.aspx?id=103325
> > > +
> > > + If this binding is used, the kernel module spi-hid will handle the
> > > + communication with the device and the generic hid core layer will handle the
> > > + protocol.
> >
> > This is not relevant to the binding, please remove it.
> >
> > > +
> > > +allOf:
> > > + - $ref: /schemas/input/touchscreen/touchscreen.yaml#
> > > +
> > > +properties:
> > > + compatible:
> > > + oneOf:
> > > + - items:
> > > + - enum:
> > > + - microsoft,g6-touch-digitizer
> > > + - const: hid-over-spi
> > > + - description: Just "hid-over-spi" alone is allowed, but not recommended.
> > > + const: hid-over-spi
> >
> > Why is it allowed but not recommended? Seems to me like we should
> > require device-specific compatibles.
>
> Why would we want to change the driver code to add a new compatible each
> time a vendor decides to create a chip that is fully hid-spi-protocol
> compliant? Or is the plan to still allow "hid-over-spi" fallback but
> require device-specific compatible that will be ignored unless there is
> device-specific quirk needed?
This has nothing to do with the driver, just the oddity of having a
comment saying that not having a device specific compatible was
permitted by not recommended in a binding. Requiring device-specific
compatibles is the norm after all and a comment like this makes draws
more attention to the fact that this is abnormal. Regardless of what the
driver does, device-specific compatibles should be required.
> > > +
> > > + reg:
> > > + maxItems: 1
> > > +
> > > + interrupts:
> > > + maxItems: 1
> > > +
> > > + reset-gpios:
> > > + maxItems: 1
> > > + description:
> > > + GPIO specifier for the digitizer's reset pin (active low). The line must
> > > + be flagged with GPIO_ACTIVE_LOW.
> > > +
> > > + vdd-supply:
> > > + description:
> > > + Regulator for the VDD supply voltage.
> > > +
> > > + input-report-header-address:
> > > + $ref: /schemas/types.yaml#/definitions/uint32
> > > + minimum: 0
> > > + maximum: 0xffffff
> > > + description:
> > > + A value to be included in the Read Approval packet, listing an address of
> > > + the input report header to be put on the SPI bus. This address has 24
> > > + bits.
> > > +
> > > + input-report-body-address:
> > > + $ref: /schemas/types.yaml#/definitions/uint32
> > > + minimum: 0
> > > + maximum: 0xffffff
> > > + description:
> > > + A value to be included in the Read Approval packet, listing an address of
> > > + the input report body to be put on the SPI bus. This address has 24 bits.
> > > +
> > > + output-report-address:
> > > + $ref: /schemas/types.yaml#/definitions/uint32
> > > + minimum: 0
> > > + maximum: 0xffffff
> > > + description:
> > > + A value to be included in the Output Report sent by the host, listing an
> > > + address where the output report on the SPI bus is to be written to. This
> > > + address has 24 bits.
> > > +
> > > + read-opcode:
> > > + $ref: /schemas/types.yaml#/definitions/uint8
> > > + description:
> > > + Value to be used in Read Approval packets. 1 byte.
> > > +
> > > + write-opcode:
> > > + $ref: /schemas/types.yaml#/definitions/uint8
> > > + description:
> > > + Value to be used in Write Approval packets. 1 byte.
> >
> > Why can none of these things be determined from the device's compatible?
> > On the surface, they like the kinds of things that could/should be.
>
> Why would we want to keep tables of these values in the kernel and again
> have to update the driver for each new chip?
That's pretty normal though innit? It's what match data does.
If someone wants to have properties that communicate data that
can be determined from the compatible, they need to provide
justification why it is being done.
> It also probably firmware-dependent.
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply
* [PATCH v4 3/3] tracing/fprobe: Check the same type fprobe on table as the unregistered one
From: Masami Hiramatsu (Google) @ 2026-04-10 17:11 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu
Cc: Menglong Dong, Mathieu Desnoyers, jiang.biao, linux-kernel,
linux-trace-kernel
In-Reply-To: <177584108931.388483.11311214679686745474.stgit@devnote2>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Commit 2c67dc457bc6 ("tracing: fprobe: optimization for entry only case")
introduced a different ftrace_ops for entry-only fprobes.
However, when unregistering an fprobe, the kernel only checks if another
fprobe exists at the same address, without checking which type of fprobe
it is.
If different fprobes are registered at the same address, the same address
will be registered in both fgraph_ops and ftrace_ops, but only one of
them will be deleted when unregistering. (the one removed first will not
be deleted from the ops).
This results in junk entries remaining in either fgraph_ops or ftrace_ops.
For example:
=======
cd /sys/kernel/tracing
# 'Add entry and exit events on the same place'
echo 'f:event1 vfs_read' >> dynamic_events
echo 'f:event2 vfs_read%return' >> dynamic_events
# 'Enable both of them'
echo 1 > events/fprobes/enable
cat enabled_functions
vfs_read (2) ->arch_ftrace_ops_list_func+0x0/0x210
# 'Disable and remove exit event'
echo 0 > events/fprobes/event2/enable
echo -:event2 >> dynamic_events
# 'Disable and remove all events'
echo 0 > events/fprobes/enable
echo > dynamic_events
# 'Add another event'
echo 'f:event3 vfs_open%return' > dynamic_events
cat dynamic_events
f:fprobes/event3 vfs_open%return
echo 1 > events/fprobes/enable
cat enabled_functions
vfs_open (1) tramp: 0xffffffffa0001000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60 subops: {ent:fprobe_fgraph_entry+0x0/0x620 ret:fprobe_return+0x0/0x150}
vfs_read (1) tramp: 0xffffffffa0001000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60 subops: {ent:fprobe_fgraph_entry+0x0/0x620 ret:fprobe_return+0x0/0x150}
=======
As you can see, an entry for the vfs_read remains.
To fix this issue, when unregistering, the kernel should also check if
there is the same type of fprobes still exist at the same address, and
if not, delete its entry from either fgraph_ops or ftrace_ops.
Fixes: 2c67dc457bc6 ("tracing: fprobe: optimization for entry only case")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
kernel/trace/fprobe.c | 82 +++++++++++++++++++++++++++++++++++++++----------
1 file changed, 65 insertions(+), 17 deletions(-)
diff --git a/kernel/trace/fprobe.c b/kernel/trace/fprobe.c
index 799332f865f8..2cac2252f78f 100644
--- a/kernel/trace/fprobe.c
+++ b/kernel/trace/fprobe.c
@@ -92,11 +92,8 @@ static int insert_fprobe_node(struct fprobe_hlist_node *node, struct fprobe *fp)
return ret;
}
-/* Return true if there are synonims */
-static bool delete_fprobe_node(struct fprobe_hlist_node *node)
+static void delete_fprobe_node(struct fprobe_hlist_node *node)
{
- bool ret;
-
lockdep_assert_held(&fprobe_mutex);
/* Avoid double deleting and non-inserted nodes */
@@ -105,13 +102,6 @@ static bool delete_fprobe_node(struct fprobe_hlist_node *node)
rhltable_remove(&fprobe_ip_table, &node->hlist,
fprobe_rht_params);
}
-
- rcu_read_lock();
- ret = !!rhltable_lookup(&fprobe_ip_table, &node->addr,
- fprobe_rht_params);
- rcu_read_unlock();
-
- return ret;
}
/* Check existence of the fprobe */
@@ -345,6 +335,32 @@ static bool fprobe_is_ftrace(struct fprobe *fp)
return !fp->exit_handler;
}
+static bool fprobe_exists_on_hash(unsigned long ip, bool ftrace)
+{
+ struct rhlist_head *head, *pos;
+ struct fprobe_hlist_node *node;
+ struct fprobe *fp;
+
+ guard(rcu)();
+ head = rhltable_lookup(&fprobe_ip_table, &ip,
+ fprobe_rht_params);
+ if (!head)
+ return false;
+ /* We have to check the same type on the list. */
+ rhl_for_each_entry_rcu(node, pos, head, hlist) {
+ if (node->addr != ip)
+ break;
+ fp = READ_ONCE(node->fp);
+ if (likely(fp)) {
+ if ((!ftrace && fp->exit_handler) ||
+ (ftrace && !fp->exit_handler))
+ return true;
+ }
+ }
+
+ return false;
+}
+
#ifdef CONFIG_MODULES
static void fprobe_remove_ips(unsigned long *ips, unsigned int cnt)
{
@@ -367,6 +383,29 @@ static bool fprobe_is_ftrace(struct fprobe *fp)
return false;
}
+static bool fprobe_exists_on_hash(unsigned long ip, bool ftrace __maybe_unused)
+{
+ struct rhlist_head *head, *pos;
+ struct fprobe_hlist_node *node;
+ struct fprobe *fp;
+
+ guard(rcu)();
+ head = rhltable_lookup(&fprobe_ip_table, &ip,
+ fprobe_rht_params);
+ if (!head)
+ return false;
+ /* We only need to check fp is there. */
+ rhl_for_each_entry_rcu(node, pos, head, hlist) {
+ if (node->addr != ip)
+ break;
+ fp = READ_ONCE(node->fp);
+ if (likely(fp))
+ return true;
+ }
+
+ return false;
+}
+
#ifdef CONFIG_MODULES
static void fprobe_remove_ips(unsigned long *ips, unsigned int cnt)
{
@@ -555,15 +594,22 @@ struct fprobe_addr_list {
static int fprobe_remove_node_in_module(struct module *mod, struct fprobe_hlist_node *node,
struct fprobe_addr_list *alist)
{
+ lockdep_assert_in_rcu_read_lock();
+
if (!within_module(node->addr, mod))
return 0;
- if (delete_fprobe_node(node))
- return 0;
+ delete_fprobe_node(node);
+ /*
+ * Don't care the type here, because all fprobes on the same
+ * address must be removed eventually.
+ */
+ if (!rhltable_lookup(&fprobe_ip_table, &node->addr, fprobe_rht_params)) {
+ alist->addrs[alist->index++] = node->addr;
+ if (alist->index == alist->size)
+ return -ENOSPC;
+ }
- alist->addrs[alist->index++] = node->addr;
- if (alist->index == alist->size)
- return -ENOSPC;
return 0;
}
@@ -915,7 +961,9 @@ static int unregister_fprobe_nolock(struct fprobe *fp)
/* Remove non-synonim ips from table and hash */
count = 0;
for (i = 0; i < hlist_array->size; i++) {
- if (!delete_fprobe_node(&hlist_array->array[i]))
+ delete_fprobe_node(&hlist_array->array[i]);
+ if (!fprobe_exists_on_hash(hlist_array->array[i].addr,
+ fprobe_is_ftrace(fp)))
addrs[count++] = hlist_array->array[i].addr;
}
del_fprobe_hash(fp);
^ permalink raw reply related
* [PATCH v4 2/3] tracing/fprobe: Avoid kcalloc() in rcu_read_lock section
From: Masami Hiramatsu (Google) @ 2026-04-10 17:11 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu
Cc: Menglong Dong, Mathieu Desnoyers, jiang.biao, linux-kernel,
linux-trace-kernel
In-Reply-To: <177584108931.388483.11311214679686745474.stgit@devnote2>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
fprobe_remove_node_in_module() is called under RCU read locked, but
this invokes kcalloc() if there are more than 8 fprobes installed
on the module. Sashiko warns it because kcalloc() can sleep [1].
[1] https://sashiko.dev/#/patchset/177552432201.853249.5125045538812833325.stgit%40mhiramat.tok.corp.google.com
To fix this issue, expand the batch size to 128 and do not expand
the fprobe_addr_list, but just cancel walking on fprobe_ip_table,
update fgraph/ftrace_ops and retry the loop again.
Fixes: 0de4c70d04a4 ("tracing: fprobe: use rhltable for fprobe_ip_table")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v4:
- fix a build error typo in case of CONFIG_DYNAMIC_FTRACE=n.
Changes in v3:
- Retry inside rhltable_walk_enter/exit().
- Rename fprobe_set_ips() to fprobe_remove_ips().
- Rename 'retry' label to 'again'.
---
kernel/trace/fprobe.c | 75 ++++++++++++++++++++-----------------------------
1 file changed, 30 insertions(+), 45 deletions(-)
diff --git a/kernel/trace/fprobe.c b/kernel/trace/fprobe.c
index a7c0d5f9016b..799332f865f8 100644
--- a/kernel/trace/fprobe.c
+++ b/kernel/trace/fprobe.c
@@ -346,11 +346,10 @@ static bool fprobe_is_ftrace(struct fprobe *fp)
}
#ifdef CONFIG_MODULES
-static void fprobe_set_ips(unsigned long *ips, unsigned int cnt, int remove,
- int reset)
+static void fprobe_remove_ips(unsigned long *ips, unsigned int cnt)
{
- ftrace_set_filter_ips(&fprobe_graph_ops.ops, ips, cnt, remove, reset);
- ftrace_set_filter_ips(&fprobe_ftrace_ops, ips, cnt, remove, reset);
+ ftrace_set_filter_ips(&fprobe_graph_ops.ops, ips, cnt, 1, 0);
+ ftrace_set_filter_ips(&fprobe_ftrace_ops, ips, cnt, 1, 0);
}
#endif
#else
@@ -369,10 +368,9 @@ static bool fprobe_is_ftrace(struct fprobe *fp)
}
#ifdef CONFIG_MODULES
-static void fprobe_set_ips(unsigned long *ips, unsigned int cnt, int remove,
- int reset)
+static void fprobe_remove_ips(unsigned long *ips, unsigned int cnt)
{
- ftrace_set_filter_ips(&fprobe_graph_ops.ops, ips, cnt, remove, reset);
+ ftrace_set_filter_ips(&fprobe_graph_ops.ops, ips, cnt, 1, 0);
}
#endif
#endif /* !CONFIG_DYNAMIC_FTRACE_WITH_ARGS && !CONFIG_DYNAMIC_FTRACE_WITH_REGS */
@@ -546,7 +544,7 @@ static void fprobe_graph_remove_ips(unsigned long *addrs, int num)
#ifdef CONFIG_MODULES
-#define FPROBE_IPS_BATCH_INIT 8
+#define FPROBE_IPS_BATCH_INIT 128
/* instruction pointer address list */
struct fprobe_addr_list {
int index;
@@ -554,45 +552,21 @@ struct fprobe_addr_list {
unsigned long *addrs;
};
-static int fprobe_addr_list_add(struct fprobe_addr_list *alist, unsigned long addr)
+static int fprobe_remove_node_in_module(struct module *mod, struct fprobe_hlist_node *node,
+ struct fprobe_addr_list *alist)
{
- unsigned long *addrs;
-
- /* Previously we failed to expand the list. */
- if (alist->index == alist->size)
- return -ENOSPC;
-
- alist->addrs[alist->index++] = addr;
- if (alist->index < alist->size)
+ if (!within_module(node->addr, mod))
return 0;
- /* Expand the address list */
- addrs = kcalloc(alist->size * 2, sizeof(*addrs), GFP_KERNEL);
- if (!addrs)
- return -ENOMEM;
-
- memcpy(addrs, alist->addrs, alist->size * sizeof(*addrs));
- alist->size *= 2;
- kfree(alist->addrs);
- alist->addrs = addrs;
+ if (delete_fprobe_node(node))
+ return 0;
+ alist->addrs[alist->index++] = node->addr;
+ if (alist->index == alist->size)
+ return -ENOSPC;
return 0;
}
-static void fprobe_remove_node_in_module(struct module *mod, struct fprobe_hlist_node *node,
- struct fprobe_addr_list *alist)
-{
- if (!within_module(node->addr, mod))
- return;
- if (delete_fprobe_node(node))
- return;
- /*
- * If failed to update alist, just continue to update hlist.
- * Therefore, at list user handler will not hit anymore.
- */
- fprobe_addr_list_add(alist, node->addr);
-}
-
/* Handle module unloading to manage fprobe_ip_table. */
static int fprobe_module_callback(struct notifier_block *nb,
unsigned long val, void *data)
@@ -601,6 +575,7 @@ static int fprobe_module_callback(struct notifier_block *nb,
struct fprobe_hlist_node *node;
struct rhashtable_iter iter;
struct module *mod = data;
+ bool retry;
if (val != MODULE_STATE_GOING)
return NOTIFY_DONE;
@@ -612,18 +587,28 @@ static int fprobe_module_callback(struct notifier_block *nb,
mutex_lock(&fprobe_mutex);
rhltable_walk_enter(&fprobe_ip_table, &iter);
+again:
+ retry = false;
+ alist.index = 0;
do {
rhashtable_walk_start(&iter);
while ((node = rhashtable_walk_next(&iter)) && !IS_ERR(node))
- fprobe_remove_node_in_module(mod, node, &alist);
+ if (fprobe_remove_node_in_module(mod, node, &alist) < 0) {
+ retry = true;
+ break;
+ }
rhashtable_walk_stop(&iter);
- } while (node == ERR_PTR(-EAGAIN));
- rhashtable_walk_exit(&iter);
+ } while (node == ERR_PTR(-EAGAIN) && !retry);
+ /* Remove any ips from hash table(s) */
+ if (alist.index > 0) {
+ fprobe_remove_ips(alist.addrs, alist.index);
+ if (retry)
+ goto again;
+ }
- if (alist.index > 0)
- fprobe_set_ips(alist.addrs, alist.index, 1, 0);
+ rhashtable_walk_exit(&iter);
mutex_unlock(&fprobe_mutex);
kfree(alist.addrs);
^ permalink raw reply related
* [PATCH v4 1/3] tracing/fprobe: Remove fprobe from hash in failure path
From: Masami Hiramatsu (Google) @ 2026-04-10 17:11 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu
Cc: Menglong Dong, Mathieu Desnoyers, jiang.biao, linux-kernel,
linux-trace-kernel
In-Reply-To: <177584108931.388483.11311214679686745474.stgit@devnote2>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
When register_fprobe_ips() fails, it tries to remove a list of
fprobe_hash_node from fprobe_ip_table, but it missed to remove
fprobe itself from fprobe_table. Moreover, when removing
the fprobe_hash_node which is added to rhltable once, it must
use kfree_rcu() after removing from rhltable.
To fix these issues, this reuses unregister_fprobe() internal
code to rollback the half-way registered fprobe.
Fixes: 4346ba160409 ("fprobe: Rewrite fprobe on function-graph tracer")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v4:
- Remove short-cut case because we always need to upadte ftrace_ops.
- Use guard(mutex) in register_fprobe_ips() to unlock it correctly.
- Remove redundant !ret check in register_fprobe_ips().
- Do not set hlist_array->size in failure case, instead,
hlist_array->array[i].fp is set only when insertion is succeeded.
Changes in v3:
- Newly added.
---
kernel/trace/fprobe.c | 95 +++++++++++++++++++++++++------------------------
1 file changed, 48 insertions(+), 47 deletions(-)
diff --git a/kernel/trace/fprobe.c b/kernel/trace/fprobe.c
index dcadf1d23b8a..a7c0d5f9016b 100644
--- a/kernel/trace/fprobe.c
+++ b/kernel/trace/fprobe.c
@@ -4,6 +4,7 @@
*/
#define pr_fmt(fmt) "fprobe: " fmt
+#include <linux/cleanup.h>
#include <linux/err.h>
#include <linux/fprobe.h>
#include <linux/kallsyms.h>
@@ -78,20 +79,27 @@ static const struct rhashtable_params fprobe_rht_params = {
};
/* Node insertion and deletion requires the fprobe_mutex */
-static int insert_fprobe_node(struct fprobe_hlist_node *node)
+static int insert_fprobe_node(struct fprobe_hlist_node *node, struct fprobe *fp)
{
+ int ret;
+
lockdep_assert_held(&fprobe_mutex);
- return rhltable_insert(&fprobe_ip_table, &node->hlist, fprobe_rht_params);
+ ret = rhltable_insert(&fprobe_ip_table, &node->hlist, fprobe_rht_params);
+ /* Set the fprobe pointer if insertion was successful. */
+ if (!ret)
+ WRITE_ONCE(node->fp, fp);
+ return ret;
}
/* Return true if there are synonims */
static bool delete_fprobe_node(struct fprobe_hlist_node *node)
{
- lockdep_assert_held(&fprobe_mutex);
bool ret;
- /* Avoid double deleting */
+ lockdep_assert_held(&fprobe_mutex);
+
+ /* Avoid double deleting and non-inserted nodes */
if (READ_ONCE(node->fp) != NULL) {
WRITE_ONCE(node->fp, NULL);
rhltable_remove(&fprobe_ip_table, &node->hlist,
@@ -759,7 +767,6 @@ static int fprobe_init(struct fprobe *fp, unsigned long *addrs, int num)
fp->hlist_array = hlist_array;
hlist_array->fp = fp;
for (i = 0; i < num; i++) {
- hlist_array->array[i].fp = fp;
addr = ftrace_location(addrs[i]);
if (!addr) {
fprobe_fail_cleanup(fp);
@@ -823,6 +830,8 @@ int register_fprobe(struct fprobe *fp, const char *filter, const char *notfilter
}
EXPORT_SYMBOL_GPL(register_fprobe);
+static int unregister_fprobe_nolock(struct fprobe *fp);
+
/**
* register_fprobe_ips() - Register fprobe to ftrace by address.
* @fp: A fprobe data structure to be registered.
@@ -845,31 +854,26 @@ int register_fprobe_ips(struct fprobe *fp, unsigned long *addrs, int num)
if (ret)
return ret;
- mutex_lock(&fprobe_mutex);
+ guard(mutex)(&fprobe_mutex);
- hlist_array = fp->hlist_array;
if (fprobe_is_ftrace(fp))
ret = fprobe_ftrace_add_ips(addrs, num);
else
ret = fprobe_graph_add_ips(addrs, num);
+ if (ret) {
+ fprobe_fail_cleanup(fp);
+ return ret;
+ }
- if (!ret) {
- add_fprobe_hash(fp);
- for (i = 0; i < hlist_array->size; i++) {
- ret = insert_fprobe_node(&hlist_array->array[i]);
- if (ret)
- break;
- }
- /* fallback on insert error */
+ hlist_array = fp->hlist_array;
+ add_fprobe_hash(fp);
+ for (i = 0; i < hlist_array->size; i++) {
+ ret = insert_fprobe_node(&hlist_array->array[i], fp);
if (ret) {
- for (i--; i >= 0; i--)
- delete_fprobe_node(&hlist_array->array[i]);
+ unregister_fprobe_nolock(fp);
+ break;
}
}
- mutex_unlock(&fprobe_mutex);
-
- if (ret)
- fprobe_fail_cleanup(fp);
return ret;
}
@@ -913,32 +917,15 @@ bool fprobe_is_registered(struct fprobe *fp)
return true;
}
-/**
- * unregister_fprobe() - Unregister fprobe.
- * @fp: A fprobe data structure to be unregistered.
- *
- * Unregister fprobe (and remove ftrace hooks from the function entries).
- *
- * Return 0 if @fp is unregistered successfully, -errno if not.
- */
-int unregister_fprobe(struct fprobe *fp)
+static int unregister_fprobe_nolock(struct fprobe *fp)
{
- struct fprobe_hlist *hlist_array;
+ struct fprobe_hlist *hlist_array = fp->hlist_array;
unsigned long *addrs = NULL;
- int ret = 0, i, count;
+ int i, count;
- mutex_lock(&fprobe_mutex);
- if (!fp || !is_fprobe_still_exist(fp)) {
- ret = -EINVAL;
- goto out;
- }
-
- hlist_array = fp->hlist_array;
addrs = kcalloc(hlist_array->size, sizeof(unsigned long), GFP_KERNEL);
- if (!addrs) {
- ret = -ENOMEM; /* TODO: Fallback to one-by-one loop */
- goto out;
- }
+ if (!addrs)
+ return -ENOMEM; /* TODO: Fallback to one-by-one loop */
/* Remove non-synonim ips from table and hash */
count = 0;
@@ -955,12 +942,26 @@ int unregister_fprobe(struct fprobe *fp)
kfree_rcu(hlist_array, rcu);
fp->hlist_array = NULL;
+ kfree(addrs);
-out:
- mutex_unlock(&fprobe_mutex);
+ return 0;
+}
- kfree(addrs);
- return ret;
+/**
+ * unregister_fprobe() - Unregister fprobe.
+ * @fp: A fprobe data structure to be unregistered.
+ *
+ * Unregister fprobe (and remove ftrace hooks from the function entries).
+ *
+ * Return 0 if @fp is unregistered successfully, -errno if not.
+ */
+int unregister_fprobe(struct fprobe *fp)
+{
+ guard(mutex)(&fprobe_mutex);
+ if (!fp || !is_fprobe_still_exist(fp))
+ return -EINVAL;
+
+ return unregister_fprobe_nolock(fp);
}
EXPORT_SYMBOL_GPL(unregister_fprobe);
^ permalink raw reply related
* [PATCH v4 0/3] tracing/fprobe: Fix fprobe_ip_table related bugs
From: Masami Hiramatsu (Google) @ 2026-04-10 17:11 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu
Cc: Menglong Dong, Mathieu Desnoyers, jiang.biao, linux-kernel,
linux-trace-kernel
Here are patches to fix bugs in fprobe.
The previous version is here.
https://lore.kernel.org/all/177581370903.617881.3002655215679528157.stgit@mhiramat.tok.corp.google.com/
In this version, I fixed some issues on the previous version.
Patch 1/3 updates:
- Remove short-cut case because we always need to upadte ftrace_ops.
- Use guard(mutex) in register_fprobe_ips() to unlock it correctly.
- Remove redundant !ret check in register_fprobe_ips().
- Do not set hlist_array->size in failure case, instead,
hlist_array->array[i].fp is set only when insertion is succeeded.
Patch 2/3 updates:
- fix a build error typo in case of CONFIG_DYNAMIC_FTRACE=n.
Thank you!
---
Masami Hiramatsu (Google) (3):
tracing/fprobe: Remove fprobe from hash in failure path
tracing/fprobe: Avoid kcalloc() in rcu_read_lock section
tracing/fprobe: Check the same type fprobe on table as the unregistered one
kernel/trace/fprobe.c | 236 ++++++++++++++++++++++++++++---------------------
1 file changed, 135 insertions(+), 101 deletions(-)
--
Signature
^ permalink raw reply
* Re: [PATCH 04/61] ext4: Prefer IS_ERR_OR_NULL over manual NULL check
From: Theodore Ts'o @ 2026-04-10 15:18 UTC (permalink / raw)
To: amd-gfx, apparmor, bpf, ceph-devel, cocci, dm-devel, dri-devel,
gfs2, intel-gfx, intel-wired-lan, iommu, kvm, linux-arm-kernel,
linux-block, linux-bluetooth, linux-btrfs, linux-cifs, linux-clk,
linux-erofs, linux-ext4, linux-fsdevel, linux-gpio, linux-hyperv,
linux-input, linux-kernel, linux-leds, linux-media, linux-mips,
linux-mm, linux-modules, linux-mtd, linux-nfs, linux-omap,
linux-phy, linux-pm, linux-rockchip, linux-s390, linux-scsi,
linux-sctp, linux-security-module, linux-sh, linux-sound,
linux-stm32, linux-trace-kernel, linux-usb, linux-wireless,
netdev, ntfs3, samba-technical, sched-ext, target-devel,
tipc-discussion, v9fs, Philipp Hahn
Cc: Theodore Ts'o, Andreas Dilger
In-Reply-To: <20260310-b4-is_err_or_null-v1-4-bd63b656022d@avm.de>
On Tue, 10 Mar 2026 12:48:30 +0100, Philipp Hahn wrote:
> Prefer using IS_ERR_OR_NULL() over using IS_ERR() and a manual NULL
> check.
>
> Change generated with coccinelle.
Applied, thanks!
[04/61] ext4: Prefer IS_ERR_OR_NULL over manual NULL check
commit: 1d749e110277ce4103f27bd60d6181e52c0cc1e3
Best regards,
--
Theodore Ts'o <tytso@mit.edu>
^ permalink raw reply
* [PATCH v2] perf: enable unprivileged syscall tracing with perf trace
From: Anubhav Shelat @ 2026-04-10 13:35 UTC (permalink / raw)
To: mpetlan, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
Ian Rogers, Adrian Hunter, James Clark, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, linux-perf-users,
linux-kernel, linux-trace-kernel
Cc: Anubhav Shelat
Allow unprivileged users to trace their own processes' syscalls using
perf trace, similar to strace without the intrusive overhead of ptrace().
Currently, perf trace requires CAP_PERFMON or paranoid level ≤ 1 even
though the kernel has existing infrastructure (TRACE_EVENT_FL_CAP_ANY)
specifically designed to mark syscall tracepoints as safe for
unprivileged access. To fix this:
1. Loosen the condition in perf_event_open() which requires priviliges
for all events with exclude_kernel=0. This allows perf_event_open() to
bypass the paranoid check for task-attached tracepoint events. Ensure
that sample types which can expose kernel addresses to unprivileged
users are blocked.
2. Make the format and id tracefs files world-readable only for tracepoints
with TRACE_EVENT_FL_CAP_ANY, allowing unprivileged users to see syscall
tracepoint ids without exposing sensitive information.
Also add a check to perf_trace_event_perm() to ensure only TRACE_EVENT_FL_CAP_ANY
events can be traced.
Example usage after this change:
$ perf trace ls # works as unprivileged user
$ perf trace # system-wide, still requires privileges
$ perf trace -p 1234 # requires ptrace permission on pid 1234
Assisted-by: Claude:claude-sonnet-4.5
Signed-off-by: Anubhav Shelat <ashelat@redhat.com>
---
Changes in v2:
- Add check to block sample types that bypass KASLR, suggested by
sashiko.
- Link to v1: https://lore.kernel.org/linux-perf-users/20260408123947.23779-2-ashelat@redhat.com/
---
kernel/events/core.c | 22 +++++++++++++++++++---
kernel/trace/trace_event_perf.c | 12 +++++++++++-
kernel/trace/trace_events.c | 8 ++++++--
3 files changed, 36 insertions(+), 6 deletions(-)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 89b40e439717..db8c674704b2 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -13834,9 +13834,25 @@ SYSCALL_DEFINE5(perf_event_open,
return err;
if (!attr.exclude_kernel) {
- err = perf_allow_kernel();
- if (err)
- return err;
+ bool tp_bypass = false;
+
+ if (attr.type == PERF_TYPE_TRACEPOINT && pid != -1) {
+ /*
+ * Block sample types that expose kernel addresses to
+ * prevent KASLR bypass
+ */
+ u64 kaddr_leak = PERF_SAMPLE_CALLCHAIN |
+ PERF_SAMPLE_BRANCH_STACK |
+ PERF_SAMPLE_ADDR;
+
+ tp_bypass = !(attr.sample_type & kaddr_leak);
+ }
+
+ if (!tp_bypass) {
+ err = perf_allow_kernel();
+ if (err)
+ return err;
+ }
}
if (attr.namespaces) {
diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c
index a6bb7577e8c5..e8347df7ede5 100644
--- a/kernel/trace/trace_event_perf.c
+++ b/kernel/trace/trace_event_perf.c
@@ -73,8 +73,18 @@ static int perf_trace_event_perm(struct trace_event_call *tp_event,
}
/* No tracing, just counting, so no obvious leak */
- if (!(p_event->attr.sample_type & PERF_SAMPLE_RAW))
+ if (!(p_event->attr.sample_type & PERF_SAMPLE_RAW)) {
+ /*
+ * Only allow CAP_ANY tracepoints for unprivileged
+ * task-attached events in case kernel context is exposed.
+ */
+ if (!p_event->attr.exclude_kernel && !perfmon_capable()) {
+ if (!(p_event->attach_state == PERF_ATTACH_TASK &&
+ (tp_event->flags & TRACE_EVENT_FL_CAP_ANY)))
+ return -EACCES;
+ }
return 0;
+ }
/* Some events are ok to be traced by non-root users... */
if (p_event->attach_state == PERF_ATTACH_TASK) {
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index 249d1cba72c0..6250b2529376 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -3051,7 +3051,9 @@ static int event_callback(const char *name, umode_t *mode, void **data,
struct trace_event_call *call = file->event_call;
if (strcmp(name, "format") == 0) {
- *mode = TRACE_MODE_READ;
+ *mode = (call->flags & TRACE_EVENT_FL_CAP_ANY) ?
+ (TRACE_MODE_READ | 0004) :
+ TRACE_MODE_READ;
*fops = &ftrace_event_format_fops;
return 1;
}
@@ -3087,7 +3089,9 @@ static int event_callback(const char *name, umode_t *mode, void **data,
#ifdef CONFIG_PERF_EVENTS
if (call->event.type && call->class->reg &&
strcmp(name, "id") == 0) {
- *mode = TRACE_MODE_READ;
+ *mode = (call->flags & TRACE_EVENT_FL_CAP_ANY) ?
+ (TRACE_MODE_READ | 0004) :
+ TRACE_MODE_READ;
*data = (void *)(long)call->event.type;
*fops = &ftrace_event_id_fops;
return 1;
--
2.53.0
^ permalink raw reply related
* Re: [bug report] ring-buffer: Introduce ring-buffer remotes
From: Vincent Donnefort @ 2026-04-10 12:46 UTC (permalink / raw)
To: Dan Carpenter; +Cc: linux-trace-kernel
In-Reply-To: <adjN-1ydA_kPPhmG@stanley.mountain>
On Fri, Apr 10, 2026 at 01:16:27PM +0300, Dan Carpenter wrote:
> Hello Vincent Donnefort,
>
> Commit 2e67fabd8b77 ("ring-buffer: Introduce ring-buffer remotes")
> from Mar 9, 2026 (linux-next), leads to the following Smatch static
> checker warning:
>
> kernel/trace/ring_buffer.c:2243 ring_buffer_desc_page()
> warn: array off by one? 'desc->page_va[page_id]'
>
> kernel/trace/ring_buffer.c
> 2241 static void *ring_buffer_desc_page(struct ring_buffer_desc *desc, int page_id)
> 2242 {
> --> 2243 return page_id > desc->nr_page_va ? NULL : (void *)desc->page_va[page_id];
> ^
> Based on the len = struct_size(desc, page_va, desc->nr_page_va), in
> ring_buffer_desc(), I'm pretty sure this should be >= instead of >.
You are right. Thanks for the report, I have sent a fix.
>
> 2244 }
>
> This email is a free service from the Smatch-CI project [smatch.sf.net].
>
> regards,
> dan carpenter
^ permalink raw reply
* [PATCH] ring-buffer: Prevent off-by-one array access in ring_buffer_desc_page()
From: Vincent Donnefort @ 2026-04-10 12:45 UTC (permalink / raw)
To: rostedt, mhiramat, mathieu.desnoyers, linux-trace-kernel
Cc: kernel-team, linux-kernel, Vincent Donnefort, Dan Carpenter
As pointed out by Smatch, the ring-buffer descriptor array page_va is
counted by nr_page_va, but the accessor ring_buffer_desc_page() allows
access off by one.
Currently, this does not cause problems, as the page ID always comes
from a trusted source. Nonetheless, ensure robustness and fix the
accessor. While at it, make the page_id unsigned.
Reported-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
---
Currently based on kvmarm/next.
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 96e0d80d492b..2c77020e7aac 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -2238,9 +2238,9 @@ static struct ring_buffer_desc *ring_buffer_desc(struct trace_buffer_desc *trace
return NULL;
}
-static void *ring_buffer_desc_page(struct ring_buffer_desc *desc, int page_id)
+static void *ring_buffer_desc_page(struct ring_buffer_desc *desc, unsigned int page_id)
{
- return page_id > desc->nr_page_va ? NULL : (void *)desc->page_va[page_id];
+ return page_id >= desc->nr_page_va ? NULL : (void *)desc->page_va[page_id];
}
static int __rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer,
base-commit: 94b4ae79ebb42a8a6f2124b4d4b033b15a98e4f9
--
2.53.0.1213.gd9a14994de-goog
^ permalink raw reply related
* [bug report] ring-buffer: Introduce ring-buffer remotes
From: Dan Carpenter @ 2026-04-10 10:16 UTC (permalink / raw)
To: Vincent Donnefort; +Cc: linux-trace-kernel
Hello Vincent Donnefort,
Commit 2e67fabd8b77 ("ring-buffer: Introduce ring-buffer remotes")
from Mar 9, 2026 (linux-next), leads to the following Smatch static
checker warning:
kernel/trace/ring_buffer.c:2243 ring_buffer_desc_page()
warn: array off by one? 'desc->page_va[page_id]'
kernel/trace/ring_buffer.c
2241 static void *ring_buffer_desc_page(struct ring_buffer_desc *desc, int page_id)
2242 {
--> 2243 return page_id > desc->nr_page_va ? NULL : (void *)desc->page_va[page_id];
^
Based on the len = struct_size(desc, page_va, desc->nr_page_va), in
ring_buffer_desc(), I'm pretty sure this should be >= instead of >.
2244 }
This email is a free service from the Smatch-CI project [smatch.sf.net].
regards,
dan carpenter
^ permalink raw reply
* [PATCH v3 3/3] tracing/fprobe: Check the same type fprobe on table as the unregistered one
From: Masami Hiramatsu (Google) @ 2026-04-10 9:35 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu
Cc: Menglong Dong, Mathieu Desnoyers, jiang.biao, linux-kernel,
linux-trace-kernel
In-Reply-To: <177581370903.617881.3002655215679528157.stgit@mhiramat.tok.corp.google.com>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Commit 2c67dc457bc6 ("tracing: fprobe: optimization for entry only case")
introduced a different ftrace_ops for entry-only fprobes.
However, when unregistering an fprobe, the kernel only checks if another
fprobe exists at the same address, without checking which type of fprobe
it is.
If different fprobes are registered at the same address, the same address
will be registered in both fgraph_ops and ftrace_ops, but only one of
them will be deleted when unregistering. (the one removed first will not
be deleted from the ops).
This results in junk entries remaining in either fgraph_ops or ftrace_ops.
For example:
=======
cd /sys/kernel/tracing
# 'Add entry and exit events on the same place'
echo 'f:event1 vfs_read' >> dynamic_events
echo 'f:event2 vfs_read%return' >> dynamic_events
# 'Enable both of them'
echo 1 > events/fprobes/enable
cat enabled_functions
vfs_read (2) ->arch_ftrace_ops_list_func+0x0/0x210
# 'Disable and remove exit event'
echo 0 > events/fprobes/event2/enable
echo -:event2 >> dynamic_events
# 'Disable and remove all events'
echo 0 > events/fprobes/enable
echo > dynamic_events
# 'Add another event'
echo 'f:event3 vfs_open%return' > dynamic_events
cat dynamic_events
f:fprobes/event3 vfs_open%return
echo 1 > events/fprobes/enable
cat enabled_functions
vfs_open (1) tramp: 0xffffffffa0001000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60 subops: {ent:fprobe_fgraph_entry+0x0/0x620 ret:fprobe_return+0x0/0x150}
vfs_read (1) tramp: 0xffffffffa0001000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60 subops: {ent:fprobe_fgraph_entry+0x0/0x620 ret:fprobe_return+0x0/0x150}
=======
As you can see, an entry for the vfs_read remains.
To fix this issue, when unregistering, the kernel should also check if
there is the same type of fprobes still exist at the same address, and
if not, delete its entry from either fgraph_ops or ftrace_ops.
Fixes: 2c67dc457bc6 ("tracing: fprobe: optimization for entry only case")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
kernel/trace/fprobe.c | 81 +++++++++++++++++++++++++++++++++++++++----------
1 file changed, 65 insertions(+), 16 deletions(-)
diff --git a/kernel/trace/fprobe.c b/kernel/trace/fprobe.c
index 46357808cbde..7f7daf2c49dd 100644
--- a/kernel/trace/fprobe.c
+++ b/kernel/trace/fprobe.c
@@ -86,11 +86,9 @@ static int insert_fprobe_node(struct fprobe_hlist_node *node)
return rhltable_insert(&fprobe_ip_table, &node->hlist, fprobe_rht_params);
}
-/* Return true if there are synonims */
-static bool delete_fprobe_node(struct fprobe_hlist_node *node)
+static void delete_fprobe_node(struct fprobe_hlist_node *node)
{
lockdep_assert_held(&fprobe_mutex);
- bool ret;
/* Avoid double deleting */
if (READ_ONCE(node->fp) != NULL) {
@@ -98,13 +96,6 @@ static bool delete_fprobe_node(struct fprobe_hlist_node *node)
rhltable_remove(&fprobe_ip_table, &node->hlist,
fprobe_rht_params);
}
-
- rcu_read_lock();
- ret = !!rhltable_lookup(&fprobe_ip_table, &node->addr,
- fprobe_rht_params);
- rcu_read_unlock();
-
- return ret;
}
/* Check existence of the fprobe */
@@ -338,6 +329,32 @@ static bool fprobe_is_ftrace(struct fprobe *fp)
return !fp->exit_handler;
}
+static bool fprobe_exists_on_hash(unsigned long ip, bool ftrace)
+{
+ struct rhlist_head *head, *pos;
+ struct fprobe_hlist_node *node;
+ struct fprobe *fp;
+
+ guard(rcu)();
+ head = rhltable_lookup(&fprobe_ip_table, &ip,
+ fprobe_rht_params);
+ if (!head)
+ return false;
+ /* We have to check the same type on the list. */
+ rhl_for_each_entry_rcu(node, pos, head, hlist) {
+ if (node->addr != ip)
+ break;
+ fp = READ_ONCE(node->fp);
+ if (likely(fp)) {
+ if ((!ftrace && fp->exit_handler) ||
+ (ftrace && !fp->exit_handler))
+ return true;
+ }
+ }
+
+ return false;
+}
+
#ifdef CONFIG_MODULES
static void fprobe_remove_ips(unsigned long *ips, unsigned int cnt)
{
@@ -360,6 +377,29 @@ static bool fprobe_is_ftrace(struct fprobe *fp)
return false;
}
+static bool fprobe_exists_on_hash(unsigned long ip, bool ftrace __maybe_unused)
+{
+ struct rhlist_head *head, *pos;
+ struct fprobe_hlist_node *node;
+ struct fprobe *fp;
+
+ guard(rcu)();
+ head = rhltable_lookup(&fprobe_ip_table, &ip,
+ fprobe_rht_params);
+ if (!head)
+ return false;
+ /* We only need to check fp is there. */
+ rhl_for_each_entry_rcu(node, pos, head, hlist) {
+ if (node->addr != ip)
+ break;
+ fp = READ_ONCE(node->fp);
+ if (likely(fp))
+ return true;
+ }
+
+ return false;
+}
+
#ifdef CONFIG_MODULES
static void fprobe_set_ips(unsigned long *ips, unsigned int cnt)
{
@@ -546,15 +586,22 @@ struct fprobe_addr_list {
static int fprobe_remove_node_in_module(struct module *mod, struct fprobe_hlist_node *node,
struct fprobe_addr_list *alist)
{
+ lockdep_assert_in_rcu_read_lock();
+
if (!within_module(node->addr, mod))
return 0;
- if (delete_fprobe_node(node))
- return 0;
+ delete_fprobe_node(node);
+ /*
+ * Don't care the type here, because all fprobes on the same
+ * address must be removed eventually.
+ */
+ if (!rhltable_lookup(&fprobe_ip_table, &node->addr, fprobe_rht_params)) {
+ alist->addrs[alist->index++] = node->addr;
+ if (alist->index == alist->size)
+ return -ENOSPC;
+ }
- alist->addrs[alist->index++] = node->addr;
- if (alist->index == alist->size)
- return -ENOSPC;
return 0;
}
@@ -917,7 +964,9 @@ static int unregister_fprobe_nolock(struct fprobe *fp)
/* Remove non-synonim ips from table and hash */
count = 0;
for (i = 0; i < hlist_array->size; i++) {
- if (!delete_fprobe_node(&hlist_array->array[i]))
+ delete_fprobe_node(&hlist_array->array[i]);
+ if (!fprobe_exists_on_hash(hlist_array->array[i].addr,
+ fprobe_is_ftrace(fp)))
addrs[count++] = hlist_array->array[i].addr;
}
del_fprobe_hash(fp);
^ permalink raw reply related
* [PATCH v3 2/3] tracing/fprobe: Avoid kcalloc() in rcu_read_lock section
From: Masami Hiramatsu (Google) @ 2026-04-10 9:35 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu
Cc: Menglong Dong, Mathieu Desnoyers, jiang.biao, linux-kernel,
linux-trace-kernel
In-Reply-To: <177581370903.617881.3002655215679528157.stgit@mhiramat.tok.corp.google.com>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
fprobe_remove_node_in_module() is called under RCU read locked, but
this invokes kcalloc() if there are more than 8 fprobes installed
on the module. Sashiko warns it because kcalloc() can sleep [1].
[1] https://sashiko.dev/#/patchset/177552432201.853249.5125045538812833325.stgit%40mhiramat.tok.corp.google.com
To fix this issue, expand the batch size to 128 and do not expand
the fprobe_addr_list, but just cancel walking on fprobe_ip_table,
update fgraph/ftrace_ops and retry the loop again.
Fixes: 0de4c70d04a4 ("tracing: fprobe: use rhltable for fprobe_ip_table")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v2:
- Retry inside rhltable_walk_enter/exit().
- Rename fprobe_set_ips() to fprobe_remove_ips().
- Rename 'retry' label to 'again'.
---
kernel/trace/fprobe.c | 75 ++++++++++++++++++++-----------------------------
1 file changed, 30 insertions(+), 45 deletions(-)
diff --git a/kernel/trace/fprobe.c b/kernel/trace/fprobe.c
index 1b5c47685186..46357808cbde 100644
--- a/kernel/trace/fprobe.c
+++ b/kernel/trace/fprobe.c
@@ -339,11 +339,10 @@ static bool fprobe_is_ftrace(struct fprobe *fp)
}
#ifdef CONFIG_MODULES
-static void fprobe_set_ips(unsigned long *ips, unsigned int cnt, int remove,
- int reset)
+static void fprobe_remove_ips(unsigned long *ips, unsigned int cnt)
{
- ftrace_set_filter_ips(&fprobe_graph_ops.ops, ips, cnt, remove, reset);
- ftrace_set_filter_ips(&fprobe_ftrace_ops, ips, cnt, remove, reset);
+ ftrace_set_filter_ips(&fprobe_graph_ops.ops, ips, cnt, 1, 0);
+ ftrace_set_filter_ips(&fprobe_ftrace_ops, ips, cnt, 1, 0);
}
#endif
#else
@@ -362,10 +361,9 @@ static bool fprobe_is_ftrace(struct fprobe *fp)
}
#ifdef CONFIG_MODULES
-static void fprobe_set_ips(unsigned long *ips, unsigned int cnt, int remove,
- int reset)
+static void fprobe_set_ips(unsigned long *ips, unsigned int cnt)
{
- ftrace_set_filter_ips(&fprobe_graph_ops.ops, ips, cnt, remove, reset);
+ ftrace_set_filter_ips(&fprobe_graph_ops.ops, ips, cnt, 1, 0);
}
#endif
#endif /* !CONFIG_DYNAMIC_FTRACE_WITH_ARGS && !CONFIG_DYNAMIC_FTRACE_WITH_REGS */
@@ -537,7 +535,7 @@ static void fprobe_graph_remove_ips(unsigned long *addrs, int num)
#ifdef CONFIG_MODULES
-#define FPROBE_IPS_BATCH_INIT 8
+#define FPROBE_IPS_BATCH_INIT 128
/* instruction pointer address list */
struct fprobe_addr_list {
int index;
@@ -545,45 +543,21 @@ struct fprobe_addr_list {
unsigned long *addrs;
};
-static int fprobe_addr_list_add(struct fprobe_addr_list *alist, unsigned long addr)
+static int fprobe_remove_node_in_module(struct module *mod, struct fprobe_hlist_node *node,
+ struct fprobe_addr_list *alist)
{
- unsigned long *addrs;
-
- /* Previously we failed to expand the list. */
- if (alist->index == alist->size)
- return -ENOSPC;
-
- alist->addrs[alist->index++] = addr;
- if (alist->index < alist->size)
+ if (!within_module(node->addr, mod))
return 0;
- /* Expand the address list */
- addrs = kcalloc(alist->size * 2, sizeof(*addrs), GFP_KERNEL);
- if (!addrs)
- return -ENOMEM;
-
- memcpy(addrs, alist->addrs, alist->size * sizeof(*addrs));
- alist->size *= 2;
- kfree(alist->addrs);
- alist->addrs = addrs;
+ if (delete_fprobe_node(node))
+ return 0;
+ alist->addrs[alist->index++] = node->addr;
+ if (alist->index == alist->size)
+ return -ENOSPC;
return 0;
}
-static void fprobe_remove_node_in_module(struct module *mod, struct fprobe_hlist_node *node,
- struct fprobe_addr_list *alist)
-{
- if (!within_module(node->addr, mod))
- return;
- if (delete_fprobe_node(node))
- return;
- /*
- * If failed to update alist, just continue to update hlist.
- * Therefore, at list user handler will not hit anymore.
- */
- fprobe_addr_list_add(alist, node->addr);
-}
-
/* Handle module unloading to manage fprobe_ip_table. */
static int fprobe_module_callback(struct notifier_block *nb,
unsigned long val, void *data)
@@ -592,6 +566,7 @@ static int fprobe_module_callback(struct notifier_block *nb,
struct fprobe_hlist_node *node;
struct rhashtable_iter iter;
struct module *mod = data;
+ bool retry;
if (val != MODULE_STATE_GOING)
return NOTIFY_DONE;
@@ -603,18 +578,28 @@ static int fprobe_module_callback(struct notifier_block *nb,
mutex_lock(&fprobe_mutex);
rhltable_walk_enter(&fprobe_ip_table, &iter);
+again:
+ retry = false;
+ alist.index = 0;
do {
rhashtable_walk_start(&iter);
while ((node = rhashtable_walk_next(&iter)) && !IS_ERR(node))
- fprobe_remove_node_in_module(mod, node, &alist);
+ if (fprobe_remove_node_in_module(mod, node, &alist) < 0) {
+ retry = true;
+ break;
+ }
rhashtable_walk_stop(&iter);
- } while (node == ERR_PTR(-EAGAIN));
- rhashtable_walk_exit(&iter);
+ } while (node == ERR_PTR(-EAGAIN) && !retry);
+ /* Remove any ips from hash table(s) */
+ if (alist.index > 0) {
+ fprobe_remove_ips(alist.addrs, alist.index);
+ if (retry)
+ goto again;
+ }
- if (alist.index > 0)
- fprobe_set_ips(alist.addrs, alist.index, 1, 0);
+ rhashtable_walk_exit(&iter);
mutex_unlock(&fprobe_mutex);
kfree(alist.addrs);
^ permalink raw reply related
* [PATCH v3 1/3] tracing/fprobe: Remove fprobe from hash in failure path
From: Masami Hiramatsu (Google) @ 2026-04-10 9:35 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu
Cc: Menglong Dong, Mathieu Desnoyers, jiang.biao, linux-kernel,
linux-trace-kernel
In-Reply-To: <177581370903.617881.3002655215679528157.stgit@mhiramat.tok.corp.google.com>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
When register_fprobe_ips() fails, it tries to remove a list of
fprobe_hash_node from fprobe_ip_table, but it missed to remove
fprobe itself from fprobe_table. Moreover, when removing
the fprobe_hash_node which is added to rhltable once, it must
use kfree_rcu() after removing from rhltable.
To fix these issues, this reuses unregister_fprobe() internal
code to rollback the half-way registered fprobe.
Fixes: 4346ba160409 ("fprobe: Rewrite fprobe on function-graph tracer")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v3:
- Newly added.
---
kernel/trace/fprobe.c | 73 ++++++++++++++++++++++++++-----------------------
1 file changed, 39 insertions(+), 34 deletions(-)
diff --git a/kernel/trace/fprobe.c b/kernel/trace/fprobe.c
index 56d145017902..1b5c47685186 100644
--- a/kernel/trace/fprobe.c
+++ b/kernel/trace/fprobe.c
@@ -4,6 +4,7 @@
*/
#define pr_fmt(fmt) "fprobe: " fmt
+#include <linux/cleanup.h>
#include <linux/err.h>
#include <linux/fprobe.h>
#include <linux/kallsyms.h>
@@ -821,6 +822,8 @@ int register_fprobe(struct fprobe *fp, const char *filter, const char *notfilter
}
EXPORT_SYMBOL_GPL(register_fprobe);
+static int unregister_fprobe_nolock(struct fprobe *fp);
+
/**
* register_fprobe_ips() - Register fprobe to ftrace by address.
* @fp: A fprobe data structure to be registered.
@@ -845,30 +848,29 @@ int register_fprobe_ips(struct fprobe *fp, unsigned long *addrs, int num)
mutex_lock(&fprobe_mutex);
- hlist_array = fp->hlist_array;
if (fprobe_is_ftrace(fp))
ret = fprobe_ftrace_add_ips(addrs, num);
else
ret = fprobe_graph_add_ips(addrs, num);
+ if (ret) {
+ fprobe_fail_cleanup(fp);
+ return ret;
+ }
if (!ret) {
+ hlist_array = fp->hlist_array;
add_fprobe_hash(fp);
for (i = 0; i < hlist_array->size; i++) {
ret = insert_fprobe_node(&hlist_array->array[i]);
- if (ret)
+ if (ret) {
+ hlist_array->size = i;
+ unregister_fprobe_nolock(fp);
break;
- }
- /* fallback on insert error */
- if (ret) {
- for (i--; i >= 0; i--)
- delete_fprobe_node(&hlist_array->array[i]);
+ }
}
}
mutex_unlock(&fprobe_mutex);
- if (ret)
- fprobe_fail_cleanup(fp);
-
return ret;
}
EXPORT_SYMBOL_GPL(register_fprobe_ips);
@@ -911,32 +913,21 @@ bool fprobe_is_registered(struct fprobe *fp)
return true;
}
-/**
- * unregister_fprobe() - Unregister fprobe.
- * @fp: A fprobe data structure to be unregistered.
- *
- * Unregister fprobe (and remove ftrace hooks from the function entries).
- *
- * Return 0 if @fp is unregistered successfully, -errno if not.
- */
-int unregister_fprobe(struct fprobe *fp)
+static int unregister_fprobe_nolock(struct fprobe *fp)
{
- struct fprobe_hlist *hlist_array;
+ struct fprobe_hlist *hlist_array = fp->hlist_array;
unsigned long *addrs = NULL;
- int ret = 0, i, count;
+ int i, count;
- mutex_lock(&fprobe_mutex);
- if (!fp || !is_fprobe_still_exist(fp)) {
- ret = -EINVAL;
- goto out;
+ if (!hlist_array || !hlist_array->size) {
+ del_fprobe_hash(fp);
+ fprobe_fail_cleanup(fp);
+ return 0;
}
- hlist_array = fp->hlist_array;
addrs = kcalloc(hlist_array->size, sizeof(unsigned long), GFP_KERNEL);
- if (!addrs) {
- ret = -ENOMEM; /* TODO: Fallback to one-by-one loop */
- goto out;
- }
+ if (!addrs)
+ return -ENOMEM; /* TODO: Fallback to one-by-one loop */
/* Remove non-synonim ips from table and hash */
count = 0;
@@ -953,12 +944,26 @@ int unregister_fprobe(struct fprobe *fp)
kfree_rcu(hlist_array, rcu);
fp->hlist_array = NULL;
+ kfree(addrs);
-out:
- mutex_unlock(&fprobe_mutex);
+ return 0;
+}
- kfree(addrs);
- return ret;
+/**
+ * unregister_fprobe() - Unregister fprobe.
+ * @fp: A fprobe data structure to be unregistered.
+ *
+ * Unregister fprobe (and remove ftrace hooks from the function entries).
+ *
+ * Return 0 if @fp is unregistered successfully, -errno if not.
+ */
+int unregister_fprobe(struct fprobe *fp)
+{
+ guard(mutex)(&fprobe_mutex);
+ if (!fp || !is_fprobe_still_exist(fp))
+ return -EINVAL;
+
+ return unregister_fprobe_nolock(fp);
}
EXPORT_SYMBOL_GPL(unregister_fprobe);
^ permalink raw reply related
* [PATCH v3 0/3] tracing/fprobe: Fix fprobe_ip_table related bugs
From: Masami Hiramatsu (Google) @ 2026-04-10 9:35 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu
Cc: Menglong Dong, Mathieu Desnoyers, jiang.biao, linux-kernel,
linux-trace-kernel
Hi,
Here are patches to fix bugs in fprobe.
The previous version is here.
https://lore.kernel.org/all/177573094819.3666478.11900825120958856397.stgit@mhiramat.tok.corp.google.com/
In this version, I added another bugfix for registration
failure of fprobe [1/3] (since this is the patch for oldest bug),
and move retry-loop inside rhltable_walk_enter()/exit()[2/3].
Both are reported by Sashiko[1].
[1] https://sashiko.dev/#/patchset/177573094819.3666478.11900825120958856397.stgit%40mhiramat.tok.corp.google.com
Thank you!
---
Masami Hiramatsu (Google) (3):
tracing/fprobe: Remove fprobe from hash in failure path
tracing/fprobe: Avoid kcalloc() in rcu_read_lock section
tracing/fprobe: Check the same type fprobe on table as the unregistered one
kernel/trace/fprobe.c | 217 +++++++++++++++++++++++++++++--------------------
1 file changed, 128 insertions(+), 89 deletions(-)
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* [PATCH v2] tracing: preserve module tracepoint strings
From: Cao Ruichuang @ 2026-04-10 5:18 UTC (permalink / raw)
To: rostedt
Cc: mhiramat, mathieu.desnoyers, petr.pavlu, linux-kernel,
linux-trace-kernel
In-Reply-To: <20260406170944.51047-1-create0818@163.com>
tracepoint_string() is documented as exporting constant strings
through printk_formats, including when it is used from modules.
That currently does not work.
A small test module that calls
tracepoint_string("tracepoint_string_test_module_string") loads
successfully and gets a pointer back, but the string never appears
in /sys/kernel/tracing/printk_formats. The loader only collects
__trace_printk_fmt from modules and ignores __tracepoint_str.
Collect module __tracepoint_str entries too, copy them to stable
tracing-managed storage like module trace_printk formats, and let
trace_is_tracepoint_string() recognize those copied strings. This
makes module tracepoint strings visible through printk_formats and
keeps them accepted by the trace string safety checks.
Update the tracepoint_string() documentation to describe this
module behavior explicitly, so the comment matches the preserved
module-string mappings exported by tracing.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=217196
Signed-off-by: Cao Ruichuang <create0818@163.com>
---
v2:
- update tracepoint_string() documentation to describe the preserved
module-string mapping explicitly
- address Petr Pavlu's review about the comment not matching the
implemented module behavior
include/linux/module.h | 2 ++
include/linux/tracepoint.h | 14 ++++++---
kernel/module/main.c | 4 +++
kernel/trace/trace_printk.c | 63 ++++++++++++++++++++++++++++---------
4 files changed, 63 insertions(+), 20 deletions(-)
diff --git a/include/linux/module.h b/include/linux/module.h
index 14f391b186c..e475466a785 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -515,6 +515,8 @@ struct module {
#ifdef CONFIG_TRACING
unsigned int num_trace_bprintk_fmt;
const char **trace_bprintk_fmt_start;
+ unsigned int num_tracepoint_strings;
+ const char **tracepoint_strings_start;
#endif
#ifdef CONFIG_EVENT_TRACING
struct trace_event_call **trace_events;
diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index 1d7f29f5e90..f14da542402 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -475,11 +475,15 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
* the ASCII strings they represent.
*
* The @str used must be a constant string and persistent as it would not
- * make sense to show a string that no longer exists. But it is still fine
- * to be used with modules, because when modules are unloaded, if they
- * had tracepoints, the ring buffers are cleared too. As long as the string
- * does not change during the life of the module, it is fine to use
- * tracepoint_string() within a module.
+ * make sense to show a string that no longer exists.
+ *
+ * For built-in code, the tracing system uses the original string address.
+ * For modules, the tracing code saves tracepoint strings into
+ * tracing-managed storage when the module loads, so their mappings remain
+ * available through printk_formats and trace string checks even after the
+ * module's own memory goes away. As long as the string does not change
+ * during the life of the module, it is fine to use tracepoint_string()
+ * within a module.
*/
#define tracepoint_string(str) \
({ \
diff --git a/kernel/module/main.c b/kernel/module/main.c
index c3ce106c70a..d7d890138ac 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -2672,6 +2672,10 @@ static int find_module_sections(struct module *mod, struct load_info *info)
mod->trace_bprintk_fmt_start = section_objs(info, "__trace_printk_fmt",
sizeof(*mod->trace_bprintk_fmt_start),
&mod->num_trace_bprintk_fmt);
+ mod->tracepoint_strings_start =
+ section_objs(info, "__tracepoint_str",
+ sizeof(*mod->tracepoint_strings_start),
+ &mod->num_tracepoint_strings);
#endif
#ifdef CONFIG_DYNAMIC_FTRACE
/* sechdrs[0].sh_size is always zero */
diff --git a/kernel/trace/trace_printk.c b/kernel/trace/trace_printk.c
index 5ea5e0d76f0..9f67ce42ef6 100644
--- a/kernel/trace/trace_printk.c
+++ b/kernel/trace/trace_printk.c
@@ -22,8 +22,9 @@
#ifdef CONFIG_MODULES
/*
- * modules trace_printk()'s formats are autosaved in struct trace_bprintk_fmt
- * which are queued on trace_bprintk_fmt_list.
+ * modules trace_printk() formats and tracepoint_string() strings are
+ * autosaved in struct trace_bprintk_fmt, which are queued on
+ * trace_bprintk_fmt_list.
*/
static LIST_HEAD(trace_bprintk_fmt_list);
@@ -33,8 +34,12 @@ static DEFINE_MUTEX(btrace_mutex);
struct trace_bprintk_fmt {
struct list_head list;
const char *fmt;
+ unsigned int type;
};
+#define TRACE_BPRINTK_TYPE BIT(0)
+#define TRACE_TRACEPOINT_TYPE BIT(1)
+
static inline struct trace_bprintk_fmt *lookup_format(const char *fmt)
{
struct trace_bprintk_fmt *pos;
@@ -49,22 +54,24 @@ static inline struct trace_bprintk_fmt *lookup_format(const char *fmt)
return NULL;
}
-static
-void hold_module_trace_bprintk_format(const char **start, const char **end)
+static void hold_module_trace_format(const char **start, const char **end,
+ unsigned int type)
{
const char **iter;
char *fmt;
/* allocate the trace_printk per cpu buffers */
- if (start != end)
+ if ((type & TRACE_BPRINTK_TYPE) && start != end)
trace_printk_init_buffers();
mutex_lock(&btrace_mutex);
for (iter = start; iter < end; iter++) {
struct trace_bprintk_fmt *tb_fmt = lookup_format(*iter);
if (tb_fmt) {
- if (!IS_ERR(tb_fmt))
+ if (!IS_ERR(tb_fmt)) {
+ tb_fmt->type |= type;
*iter = tb_fmt->fmt;
+ }
continue;
}
@@ -76,6 +83,7 @@ void hold_module_trace_bprintk_format(const char **start, const char **end)
list_add_tail(&tb_fmt->list, &trace_bprintk_fmt_list);
strcpy(fmt, *iter);
tb_fmt->fmt = fmt;
+ tb_fmt->type = type;
} else
kfree(tb_fmt);
}
@@ -85,17 +93,28 @@ void hold_module_trace_bprintk_format(const char **start, const char **end)
mutex_unlock(&btrace_mutex);
}
-static int module_trace_bprintk_format_notify(struct notifier_block *self,
- unsigned long val, void *data)
+static int module_trace_format_notify(struct notifier_block *self,
+ unsigned long val, void *data)
{
struct module *mod = data;
+
+ if (val != MODULE_STATE_COMING)
+ return NOTIFY_OK;
+
if (mod->num_trace_bprintk_fmt) {
const char **start = mod->trace_bprintk_fmt_start;
const char **end = start + mod->num_trace_bprintk_fmt;
- if (val == MODULE_STATE_COMING)
- hold_module_trace_bprintk_format(start, end);
+ hold_module_trace_format(start, end, TRACE_BPRINTK_TYPE);
+ }
+
+ if (mod->num_tracepoint_strings) {
+ const char **start = mod->tracepoint_strings_start;
+ const char **end = start + mod->num_tracepoint_strings;
+
+ hold_module_trace_format(start, end, TRACE_TRACEPOINT_TYPE);
}
+
return NOTIFY_OK;
}
@@ -171,8 +190,8 @@ static void format_mod_stop(void)
#else /* !CONFIG_MODULES */
__init static int
-module_trace_bprintk_format_notify(struct notifier_block *self,
- unsigned long val, void *data)
+module_trace_format_notify(struct notifier_block *self,
+ unsigned long val, void *data)
{
return NOTIFY_OK;
}
@@ -193,8 +212,8 @@ void trace_printk_control(bool enabled)
}
__initdata_or_module static
-struct notifier_block module_trace_bprintk_format_nb = {
- .notifier_call = module_trace_bprintk_format_notify,
+struct notifier_block module_trace_format_nb = {
+ .notifier_call = module_trace_format_notify,
};
int __trace_bprintk(unsigned long ip, const char *fmt, ...)
@@ -254,11 +273,25 @@ EXPORT_SYMBOL_GPL(__ftrace_vprintk);
bool trace_is_tracepoint_string(const char *str)
{
const char **ptr = __start___tracepoint_str;
+#ifdef CONFIG_MODULES
+ struct trace_bprintk_fmt *tb_fmt;
+#endif
for (ptr = __start___tracepoint_str; ptr < __stop___tracepoint_str; ptr++) {
if (str == *ptr)
return true;
}
+
+#ifdef CONFIG_MODULES
+ mutex_lock(&btrace_mutex);
+ list_for_each_entry(tb_fmt, &trace_bprintk_fmt_list, list) {
+ if ((tb_fmt->type & TRACE_TRACEPOINT_TYPE) && str == tb_fmt->fmt) {
+ mutex_unlock(&btrace_mutex);
+ return true;
+ }
+ }
+ mutex_unlock(&btrace_mutex);
+#endif
return false;
}
@@ -824,7 +857,7 @@ fs_initcall(init_trace_printk_function_export);
static __init int init_trace_printk(void)
{
- return register_module_notifier(&module_trace_bprintk_format_nb);
+ return register_module_notifier(&module_trace_format_nb);
}
early_initcall(init_trace_printk);
--
2.39.5 (Apple Git-154)
^ permalink raw reply related
* Re: [RFC PATCH 0/4] trace, livepatch: Allow kprobe return overriding for livepatched functions
From: Masami Hiramatsu @ 2026-04-10 4:38 UTC (permalink / raw)
To: Yafang Shao
Cc: jpoimboe, jikos, mbenes, pmladek, joe.lawrence, rostedt,
mathieu.desnoyers, kpsingh, mattbobrowski, song, jolsa, ast,
daniel, andrii, martin.lau, eddyz87, memxor, yonghong.song,
live-patching, linux-kernel, linux-trace-kernel, bpf
In-Reply-To: <20260402092607.96430-1-laoar.shao@gmail.com>
Hi Yafang,
On Thu, 2 Apr 2026 17:26:03 +0800
Yafang Shao <laoar.shao@gmail.com> wrote:
> Livepatching allows for rapid experimentation with new kernel features
> without interrupting production workloads. However, static livepatches lack
> the flexibility required to tune features based on task-specific attributes,
> such as cgroup membership, which is critical in multi-tenant k8s
> environments. Furthermore, hardcoding logic into a livepatch prevents
> dynamic adjustments based on the runtime environment.
>
> To address this, we propose a hybrid approach using BPF. Our production use
> case involves:
>
> 1. Deploying a Livepatch function to serve as a stable BPF hook.
>
> 2. Utilizing bpf_override_return() to dynamically modify the return value
> of that hook based on the current task's context.
First of all, I don't like this approach to test a new feature in the
kernel, because it sounds like allowing multiple different generations
of implementations to coexist simultaneously. The standard kernel code
is not designed to withstand such implementations.
For example, if you implement a well-designed framework in a specific
subsystem, like Schedext, which allows multiple implementations extended
with BPF to coexist, there's no problem (at least it's debatable).
But if it is for any function, it is dangerous feature. Bugs that occur
in kernels that use this functionality cannot be addressed here. They
need to be treated the same way as out-of-tree drivers or forked kernels.
I mean, add a tainted flag for this feature. And we don't care of it.
>
> A significant challenge arises when atomic-replace is enabled. In this
> mode, deploying a new livepatch changes the target function's address,
> forcing a re-attachment of the BPF program. This re-attachment latency is
> unacceptable in critical paths, such as those handling networking policies.
>
> To solve this, we introduce a hybrid livepatch mode that allows specific
> patches to remain non-replaceable, ensuring the function address remains
> stable and the BPF program stays attached.
Can you share your actual problem to be solved?
If the specific problem and the specific subsystem are clear,
I think there is room to discuss it with the subsystem maintainer.
>
> Furthermore, this mechanism provides a lower-maintenance alternative to
> out-of-tree BPF hooks. Given the complexities of upstreaming custom BPF
> hooks (e.g., [0], [1]), this hybrid mode allows for the maintenance of
> stable, minimal hook points via livepatching with significantly reduced
> maintenance burden.
Maintenance cost is the same. We need to add out-of-tree BPF hooks on
source code. Maybe deploying cost will be reduced.
Thank you,
>
> Link: https://lwn.net/Articles/1054030/ [0]
> Link: https://lwn.net/Articles/1043548/ [1]
>
> Yafang Shao (4):
> trace: Simplify kprobe overridable function check
> trace: Allow kprobes to override livepatched functions
> livepatch: Add "replaceable" attribute to klp_patch
> livepatch: Implement livepatch hybrid mode
>
> include/linux/livepatch.h | 2 ++
> kernel/livepatch/core.c | 50 +++++++++++++++++++++++++++++++
> kernel/trace/Kconfig | 14 +++++++++
> kernel/trace/bpf_trace.c | 14 ++++++---
> kernel/trace/trace_kprobe.c | 49 ++++++++++++------------------
> kernel/trace/trace_probe.h | 59 +++++++++++++++++++++++++++----------
> 6 files changed, 139 insertions(+), 49 deletions(-)
>
> --
> 2.47.3
>
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* Re: [PATCH] selftests/ftrace: Account for fprobe attachment at creation
From: Cao Ruichuang @ 2026-04-10 4:32 UTC (permalink / raw)
To: Masami Hiramatsu
Cc: Cao Ruichuang, rostedt, mathieu.desnoyers, shuah, linux-kernel,
linux-trace-kernel, linux-kselftest
In-Reply-To: <20260409172955.98e2d20a7f7736a1a9834816@kernel.org>
Hi Masami,
I reran this in clean QEMU on two kernels and got different results.
1. Ubuntu distro kernel:
Linux 6.8.0-100-generic #100-Ubuntu SMP PREEMPT_DYNAMIC
Tue Jan 13 16:40:06 UTC 2026
baseline count=2
after_create123 count=4
after_enable1/2/3 count=4
baseline enabled_functions:
__hid_bpf_tail_call ...
after_create123 enabled_functions:
kernel_clone (2) R ->arch_ftrace_ops_list_func+0x0/0x280
kmem_cache_free (1) R tramp: ... ->fprobe_handler+0x0/0x40
__hid_bpf_tail_call ...
2. Current source-tree kernel built from the clean snapshot of my patch
branch:
Linux 7.0.0-rc6 #2 SMP PREEMPT_DYNAMIC Fri Apr 10 12:19:39 CST 2026
baseline count=0
after_create123 count=0
after_enable1 count=1
after_enable2 count=1
after_enable3 count=2
after_create123 enabled_functions:
<empty>
after_enable3 enabled_functions:
kernel_clone (2) ->arch_ftrace_ops_list_func+0x0/0x200
kmem_cache_free (1) tramp: ... ->fprobe_ftrace_entry+0x0/0x220
So the behavior I reported earlier reproduces on that Ubuntu 6.8 kernel,
but not on the current source-tree kernel. I think my earlier conclusion
was too broad.
I will stop pushing this testcase change for now unless I can narrow down
which kernel change caused the difference.
Thanks,
Cao Ruichuang
^ permalink raw reply
* Re: [RFC v4 0/7] ext4: fast commit: snapshot inode state for FC log
From: Theodore Tso @ 2026-04-10 1:18 UTC (permalink / raw)
To: Li Chen
Cc: Zhang Yi, Andreas Dilger, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, linux-ext4, linux-trace-kernel, linux-kernel
In-Reply-To: <20260120112538.132774-1-me@linux.beauty>
On Tue, Jan 20, 2026 at 07:25:29PM +0800, Li Chen wrote:
> Hi,
>
> (This RFC v4 series is based on linux-next tag next-20260106, plus the
> prerequisite patch "ext4: fast commit: make s_fc_lock reclaim-safe" posted at:
> https://lore.kernel.org/all/20260106120621.440126-1-me@linux.beauty/)
Can you take a look at the Sashiko reviews here:
https://sashiko.dev/#/patchset/20260408112020.716706-1-me%40linux.beauty
There seems to be at least one legitimate concern, which is the
potential cur_lblk overflow. There are a couple of others which I
think is real; could you please look at their review comments?
Thanks,
- Ted
^ permalink raw reply
* Re: [PATCH v2 1/2] tracing/fprobe: Avoid kcalloc() in rcu_read_lock section
From: Masami Hiramatsu @ 2026-04-10 0:19 UTC (permalink / raw)
To: Menglong Dong
Cc: Steven Rostedt, Menglong Dong, Mathieu Desnoyers, jiang.biao,
linux-kernel, linux-trace-kernel
In-Reply-To: <2258680.irdbgypaU6@7940hx>
On Thu, 09 Apr 2026 20:05:13 +0800
Menglong Dong <menglong.dong@linux.dev> wrote:
> On 2026/4/9 18:35 Masami Hiramatsu (Google) <mhiramat@kernel.org> write:
> > From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> >
> > fprobe_remove_node_in_module() is called under RCU read locked, but
> > this invokes kcalloc() if there are more than 8 fprobes installed
> > on the module. Sashiko warns it because kcalloc() can sleep [1].
> >
> > [1] https://sashiko.dev/#/patchset/177552432201.853249.5125045538812833325.stgit%40mhiramat.tok.corp.google.com
> >
> > To fix this issue, expand the batch size to 128 and do not expand
> > the fprobe_addr_list, but just cancel walking on fprobe_ip_table,
> > update fgraph/ftrace_ops and retry the loop again.
> >
> > Fixes: 0de4c70d04a4 ("tracing: fprobe: use rhltable for fprobe_ip_table")
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> > ---
> > kernel/trace/fprobe.c | 53 ++++++++++++++++++-------------------------------
> > 1 file changed, 19 insertions(+), 34 deletions(-)
> >
> > diff --git a/kernel/trace/fprobe.c b/kernel/trace/fprobe.c
> > index 56d145017902..058cf6ef7ebb 100644
> > --- a/kernel/trace/fprobe.c
> > +++ b/kernel/trace/fprobe.c
> > @@ -536,7 +536,7 @@ static void fprobe_graph_remove_ips(unsigned long *addrs, int num)
>
> Hi, Masami. Thanks for the fixes. Overall, the whole series
> LGTM.
>
> Some nits below.
>
> >
> > #ifdef CONFIG_MODULES
> >
> [...]
> > unsigned long val, void *data)
> > @@ -591,6 +567,7 @@ static int fprobe_module_callback(struct notifier_block *nb,
> > struct fprobe_hlist_node *node;
> > struct rhashtable_iter iter;
> > struct module *mod = data;
> > + bool retry;
> >
> > if (val != MODULE_STATE_GOING)
> > return NOTIFY_DONE;
> > @@ -600,13 +577,19 @@ static int fprobe_module_callback(struct notifier_block *nb,
> > if (!alist.addrs)
> > return NOTIFY_DONE;
>
> The "retry" confuse me a little. How about we use "again" and "more"
> here:
>
> +again:
> + more = false;
OK. And Sashiko pointed out that we can retry right after calling
rhltable_walk_enter(), and it seems true according to
https://lwn.net/Articles/751374/
We can seep inside rhltable_walk_enter()/exit() but not inside
rhashtable_walk_start()/end().
So let me update it.
>
> >
> > +retry:
> > + retry = false;
> > + alist.index = 0;
> > mutex_lock(&fprobe_mutex);
> > rhltable_walk_enter(&fprobe_ip_table, &iter);
> > do {
> > rhashtable_walk_start(&iter);
> >
> > while ((node = rhashtable_walk_next(&iter)) && !IS_ERR(node))
> > - fprobe_remove_node_in_module(mod, node, &alist);
> > + if (fprobe_remove_node_in_module(mod, node, &alist) < 0) {
> > + retry = true;
> > + break;
> > + }
>
> Wrap the code within the "while" with {}?
OK.
Thank you!
>
> Thanks!
> Menglong Dong
>
> >
> > rhashtable_walk_stop(&iter);
> > } while (node == ERR_PTR(-EAGAIN));
> > @@ -615,6 +598,8 @@ static int fprobe_module_callback(struct notifier_block *nb,
> > if (alist.index > 0)
> > fprobe_set_ips(alist.addrs, alist.index, 1, 0);
> > mutex_unlock(&fprobe_mutex);
> > + if (retry)
> > + goto retry;
> >
> > kfree(alist.addrs);
> >
> >
> >
> >
>
>
>
>
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox