* Re: [PATCH v3] cpu/hotplug: Fix NULL kobject warning in cpuhp_smt_enable()
From: Jinjie Ruan @ 2026-06-10 2:44 UTC (permalink / raw)
To: Catalin Marinas
Cc: Will Deacon, corbet, skhan, punit.agrawal, jic23,
osama.abdelkader, chenl311, fengchengwen, suzuki.poulose, maz,
lpieralisi, timothy.hayes, sascha.bischoff, arnd,
mrigendra.chaubey, pierre.gondois, dietmar.eggemann, yangyicong,
sudeep.holla, linux-arm-kernel, linux-doc, linux-kernel
In-Reply-To: <aihUPsuEytsM6Dly@arm.com>
On 6/10/2026 1:58 AM, Catalin Marinas wrote:
> Hi Jinjie,
>
> On Wed, Jun 03, 2026 at 02:38:11PM +0800, Jinjie Ruan wrote:
>> On 6/2/2026 7:09 PM, Will Deacon wrote:
>>> On Wed, May 20, 2026 at 10:20:23AM +0800, Jinjie Ruan wrote:
>>>> When booting with ACPI, arm64 smp_prepare_cpus() currently sets all
>>>> enumerated CPUs as "present" regardless of their status in the MADT. This
>>>> causes issues with SMT hotplug control. For instance, with QEMU's
>>>> "-smp 4,maxcpus=8" configuration, the MADT GICC entries are populated as
>>>> follows: the first four CPUs are marked Enabled while the remaining four
>>>> are marked Online Capable to support potential hot-plugging.
>>>>
>>>> Fix this by:
>>>>
>>>> 1. When booting with ACPI, checking the ACPI_MADT_ENABLED flag in the GICC
>>>> entry before calling set_cpu_present() during SMP initialization.
>>>>
>>>> 2. Properly managing the present mask in acpi_map_cpu() and
>>>> acpi_unmap_cpu() to support actual CPU hotplug events, This aligns with
>>>> other architectures like x86 and LoongArch.
>>>>
>>>> 3. Update the arm64 CPU hotplug documentation to no longer state that all
>>>> online-capable vCPUs are marked as present by the kernel at boot time.
>>>>
>>>> This ensures that only physically available or explicitly enabled CPUs
>>>> are in the present mask, keeping the SMT control logic consistent with
>>>> the actual hardware state.
>>>
>>> Please can you check the Sashiko review comment?
>>>
>>> https://sashiko.dev/#/patchset/20260520022023.126670-1-ruanjinjie@huawei.com
>>
>> I think commit eba4675008a6 ("arm64: arch_register_cpu() variant to
>> check if an ACPI handle is now available.") introduced this bug.
>>
>> It introduced an architectural safety block inside
>> arch_unregister_cpu(). If a hot-unplug operation is determined to be a
>> physical hardware removal (where _STA evaluates to
>> !ACPI_STA_DEVICE_PRESENT), it aborts the unregistration transaction
>> early to protect unreadied arm64 infrastructure, thereby skipping
>> unregister_cpu().
>>
>> However, the generic ACPI processor driver path in
>> acpi_processor_post_eject() currently treats arch_unregister_cpu() as
>> an unconditional void operation. When arch_unregister_cpu() bails out
>> early, the subsequent cleanup flow blindly proceeds to call
>> acpi_unmap_cpu(), clears global per-cpu processor arrays, and
>> unconditionally free the 'struct acpi_processor' object.
>>
>> I think we can fix this by:
>>
>> 1. Refactoring arch_unregister_cpu() to return an integer
>> transaction status. It returns -EOPNOTSUPP when aborting due to physical
>> hot-remove blocking, -EINVAL/-EIO on firmware failures, and 0 only upon
>> successful unregistration.
>>
>> 2. Guarding the downstream execution flow in
>> acpi_processor_post_eject(). If arch_unregister_cpu() returns a error
>> code, the hot-unplug transaction is considered aborted.
>
> I wonder whether we need all this guarding. In the worst case, we could
> rewrite the function, something like below, to always unregister and
> only warn:
>
> void arch_unregister_cpu(int cpu)
> {
> acpi_handle acpi_handle = acpi_get_processor_handle(cpu);
> struct cpu *c = &per_cpu(cpu_devices, cpu);
> acpi_status status;
> unsigned long long sta;
>
> if (!acpi_handle) {
> pr_err_once("Removing a CPU without associated ACPI handle\n");
> } else {
> status = acpi_evaluate_integer(acpi_handle, "_STA", NULL, &sta);
> if (!ACPI_FAILURE(status) &&
> cpu_present(cpu) && !(sta & ACPI_STA_DEVICE_PRESENT))
> pr_err_once("Changing CPU present bit is not supported\n");
> }
>
> unregister_cpu(c);
> }
>
> However, on the first condition, can we actually trigger !acpi_handle?
> If not, we could just drop it. I tried to look up the paths and I don't
> think we'd ever end up in this function with !acpi_handle. So this
> leaves us with the next checks.
You are absolutely right:
Source Binding: During the CPU hot-add phase, acpi_add_single_object()
directly binds a valid firmware handle to device->handle, which is then
stored into per_cpu(processors, cpu) via acpi_processor_add().
Identical Lifecycle: When the hot-unplug path later invokes
acpi_get_processor_handle(), it retrieves the exact same active
pr->handle managed by the ACPI device framework, guaranteeing that the
returned handle is never NULL as long as the device exists.
648 static struct acpi_scan_handler processor_handler = {
649 >-------.ids = processor_device_ids,
650 >-------.attach = acpi_processor_add,
651 #ifdef CONFIG_ACPI_HOTPLUG_CPU
652 >-------.post_eject = acpi_processor_post_eject,
653 #endif
654 >-------.hotplug = {
655 >------->-------.enabled = true,
656 >-------},
657 };
acpi_bus_scan()
-> acpi_bus_check_add()
-> acpi_add_single_object(&device, handle, type, !first_pass)
-> acpi_init_device_object()
-> device->handle = handle
acpi_processor_hotadd_init()
-> acpi_processor_set_per_cpu(pr, device)
-> per_cpu(processors, pr->id) = pr
acpi_processor_add()
-> pr->handle = device->handle
acpi_get_processor_handle()
-> pr = per_cpu(processors, cpu)
-> return pr->handle
>
> On the second/third conditions, it's more about preventing physical CPU
> hotplug as we haven't properly defined it for arm yet but we could just
> add a WARN_ONCE() to make it more visible and still proceed with the
> unregistering. I think with your proposal, we don't fully unroll the
Agreed. Unregistering the CPU is absolutely necessary at this stage
since we cannot fully roll back the state anyway, and adding a
WARN_ONCE() is more than sufficient to flag unsupported physical CPU
hot-unplug on ARM64 for now.
> state anyway just by returning an error in arch_unregister_cpu(), so I'd
> rather continue here.
Exactly. Achieving a perfect rollback at this stage is extremely
difficult and clean-up is rarely complete. It is much simpler and more
robust to just force the unregistration and carry on, which is also
consistent with how other architectures handle this by basically blindly
unregistering the CPU anyway.
>
> What does firmware do for virtual CPU hotplug w.r.t. _STA? I noticed a
> slight change in wording in the cpu-hotplug.rst doc with your patch from
>
> On virtual systems the _STA method must always report the CPU as
> ``present``
>
> to
>
> On virtual systems the _STA method must report the CPU as ``present``
> when it is activated by the firmware
>
> Was your intention that _STA.PRESENT can become 0 when hot-unplugging
> virtual CPUs?
Sorry, that was not the intention but a mistake. On ARM64 virtual
systems, _STA.PRESENT will always remain 1 even when a vCPU is
hot-unplugged.Due to ARM64 architectural constraints (such as the GICv3
Redistributor and KVM vGIC configuration which must be statically sized
at boot), virtual CPU hotplug is emulated by keeping all possible vCPUs
present in the system, while toggling their availability via the
_STA.ENABLED bit.
Expose below ACPI Status to Guest kernel:
a. Always _STA.Present=1 (all possible vCPUs)
b. _STA.Enabled=1 (plugged vCPUs)
c. _STA.Enabled=0 (unplugged vCPUs)
Link: https://lists.gnu.org/archive/html/qemu-devel/2025-05/msg05076.html
>
^ permalink raw reply
* Re: [PATCH net-next v08 1/5] hinic3: Add ethtool queue ops
From: Jakub Kicinski @ 2026-06-10 2:24 UTC (permalink / raw)
To: Fan Gong
Cc: Wu Di, Teng Peisen, netdev, David S. Miller, Eric Dumazet,
Paolo Abeni, Simon Horman, Andrew Lunn, Ioana Ciornei,
Mohsin Bashir, linux-kernel, linux-doc, luosifu, Xin Guo,
Zhou Shuai, Wu Like, Shi Jing, Zheng Jiezhen, Maxime Chevallier
In-Reply-To: <4ad179dd9082df5e738219e05d90ddb2dcdad8f0.1780907605.git.wudi234@huawei.com>
On Mon, 8 Jun 2026 20:36:30 +0800 Fan Gong wrote:
> + netdev_info(netdev, "Change Tx/Rx ring depth from %u/%u to %u/%u\n",
> + nic_dev->q_params.sq_depth, nic_dev->q_params.rq_depth,
> + new_sq_depth, new_rq_depth);
Please don't print messages like this, ethtool generates netlink
notifications when config changes. If someone cares they can subscribe.
> + if (!netif_running(netdev)) {
> + hinic3_update_qp_depth(netdev, new_sq_depth, new_rq_depth);
> + } else {
> + q_params = nic_dev->q_params;
> + q_params.sq_depth = new_sq_depth;
> + q_params.rq_depth = new_rq_depth;
> +
> + err = hinic3_change_channel_settings(netdev, &q_params);
> + if (err) {
> + NL_SET_ERR_MSG_MOD(extack,
> + "Failed to change channel settings");
This message is useless, if you don't have a specific error to report
don't report one. also see:
https://lore.kernel.org/r/20260609190919.1139517-1-kuba@kernel.org/
^ permalink raw reply
* Re: [PATCH] ALSA: docs: remove references to removed CONFIG_SND_HDA_POWER_SAVE
From: Randy Dunlap @ 2026-06-10 2:11 UTC (permalink / raw)
To: Ethan Nelson-Moore, linux-sound, linux-doc
Cc: Jaroslav Kysela, Takashi Iwai, Jonathan Corbet, Shuah Khan,
Rhys Tumelty
In-Reply-To: <20260610015614.41530-1-enelsonmoore@gmail.com>
On 6/9/26 6:56 PM, Ethan Nelson-Moore wrote:
> The CONFIG_SND_HDA_POWER_SAVE option was removed in commit 83012a7ccbb9
> ("ALSA: hda - Clean up CONFIG_SND_HDA_POWER_SAVE"), but references to
> it remained in documentation. Remove them.
>
> Discovered while searching for CONFIG_* symbols referenced in code but
> not defined in any Kconfig file.
>
> Signed-off-by: Ethan Nelson-Moore <enelsonmoore@gmail.com>
LGTM. Thanks.
Acked-by: Randy Dunlap <rdunlap@infradead.org>
> ---
> Documentation/sound/designs/powersave.rst | 5 +++--
> Documentation/sound/hd-audio/notes.rst | 3 ---
> 2 files changed, 3 insertions(+), 5 deletions(-)
>
> diff --git a/Documentation/sound/designs/powersave.rst b/Documentation/sound/designs/powersave.rst
> index ca7d1e838b4d..4b9d6d0b0d98 100644
> --- a/Documentation/sound/designs/powersave.rst
> +++ b/Documentation/sound/designs/powersave.rst
> @@ -3,8 +3,9 @@ Notes on Power-Saving Mode
> ==========================
>
> AC97 and HD-audio drivers have the automatic power-saving mode.
> -This feature is enabled via Kconfig ``CONFIG_SND_AC97_POWER_SAVE``
> -and ``CONFIG_SND_HDA_POWER_SAVE`` options, respectively.
> +For HD-audio devices, this feature is enabled if ``CONFIG_PM`` is
> +enabled. For AC97 devices, it is enabled via the Kconfig
> +``CONFIG_SND_AC97_POWER_SAVE`` option.
>
> With the automatic power-saving, the driver turns off the codec power
> appropriately when no operation is required. When no applications use
> diff --git a/Documentation/sound/hd-audio/notes.rst b/Documentation/sound/hd-audio/notes.rst
> index 6993bfa159b4..1412a8eabfa8 100644
> --- a/Documentation/sound/hd-audio/notes.rst
> +++ b/Documentation/sound/hd-audio/notes.rst
> @@ -341,9 +341,6 @@ hwdep option above. When enabled, you'll have some sysfs files under
> the corresponding hwdep directory. See "HD-audio reconfiguration"
> section below.
>
> -``CONFIG_SND_HDA_POWER_SAVE`` option enables the power-saving feature.
> -See "Power-saving" section below.
> -
>
> Codec Proc-File
> ---------------
--
~Randy
^ permalink raw reply
* Re: [PATCH net-next v08 4/5] hinic3: Add ethtool rss ops
From: Jakub Kicinski @ 2026-06-10 2:03 UTC (permalink / raw)
To: Fan Gong
Cc: Wu Di, Teng Peisen, netdev, David S. Miller, Eric Dumazet,
Paolo Abeni, Simon Horman, Andrew Lunn, Ioana Ciornei,
Mohsin Bashir, linux-kernel, linux-doc, luosifu, Xin Guo,
Zhou Shuai, Wu Like, Shi Jing, Zheng Jiezhen, Maxime Chevallier
In-Reply-To: <c9945323626546592031f3a2c65c798cfa66fdc9.1780907605.git.wudi234@huawei.com>
On Mon, 8 Jun 2026 20:36:33 +0800 Fan Gong wrote:
> + }
> +
> + indir_tbl = (__le16 *)pair.out->buf;
> + for (i = 0; i < L2NIC_RSS_INDIR_SIZE; i++)
> + indir_table[i] = le16_to_cpu(*(indir_tbl + i));
This cast needs a __force
drivers/net/ethernet/huawei/hinic3/hinic3_rss.c:771:9: warning: cast from restricted __le16
^ permalink raw reply
* [PATCH] ALSA: docs: remove references to removed CONFIG_SND_HDA_POWER_SAVE
From: Ethan Nelson-Moore @ 2026-06-10 1:56 UTC (permalink / raw)
To: linux-sound, linux-doc
Cc: Ethan Nelson-Moore, Jaroslav Kysela, Takashi Iwai,
Jonathan Corbet, Shuah Khan, Rhys Tumelty, Randy Dunlap
The CONFIG_SND_HDA_POWER_SAVE option was removed in commit 83012a7ccbb9
("ALSA: hda - Clean up CONFIG_SND_HDA_POWER_SAVE"), but references to
it remained in documentation. Remove them.
Discovered while searching for CONFIG_* symbols referenced in code but
not defined in any Kconfig file.
Signed-off-by: Ethan Nelson-Moore <enelsonmoore@gmail.com>
---
Documentation/sound/designs/powersave.rst | 5 +++--
Documentation/sound/hd-audio/notes.rst | 3 ---
2 files changed, 3 insertions(+), 5 deletions(-)
diff --git a/Documentation/sound/designs/powersave.rst b/Documentation/sound/designs/powersave.rst
index ca7d1e838b4d..4b9d6d0b0d98 100644
--- a/Documentation/sound/designs/powersave.rst
+++ b/Documentation/sound/designs/powersave.rst
@@ -3,8 +3,9 @@ Notes on Power-Saving Mode
==========================
AC97 and HD-audio drivers have the automatic power-saving mode.
-This feature is enabled via Kconfig ``CONFIG_SND_AC97_POWER_SAVE``
-and ``CONFIG_SND_HDA_POWER_SAVE`` options, respectively.
+For HD-audio devices, this feature is enabled if ``CONFIG_PM`` is
+enabled. For AC97 devices, it is enabled via the Kconfig
+``CONFIG_SND_AC97_POWER_SAVE`` option.
With the automatic power-saving, the driver turns off the codec power
appropriately when no operation is required. When no applications use
diff --git a/Documentation/sound/hd-audio/notes.rst b/Documentation/sound/hd-audio/notes.rst
index 6993bfa159b4..1412a8eabfa8 100644
--- a/Documentation/sound/hd-audio/notes.rst
+++ b/Documentation/sound/hd-audio/notes.rst
@@ -341,9 +341,6 @@ hwdep option above. When enabled, you'll have some sysfs files under
the corresponding hwdep directory. See "HD-audio reconfiguration"
section below.
-``CONFIG_SND_HDA_POWER_SAVE`` option enables the power-saving feature.
-See "Power-saving" section below.
-
Codec Proc-File
---------------
--
2.43.0
^ permalink raw reply related
* Re: [PATCH v4 4/6] alloc_tag: add accuracy based filtering to ioctl
From: Hao Ge @ 2026-06-10 1:52 UTC (permalink / raw)
To: Abhishek Bapat, Suren Baghdasaryan, Andrew Morton,
Kent Overstreet
Cc: Shuah Khan, Jonathan Corbet, linux-doc, linux-kernel, linux-mm,
Sourav Panda
In-Reply-To: <7f3a4ddb3f132464f17716eaae657a6367d6dd05.1781042698.git.abhishekbapat@google.com>
On 2026/6/10 08:12, Abhishek Bapat wrote:
> Extend the allocinfo filtering mechanism to allow users to filter tags
> based on their accuracy.
>
> Signed-off-by: Abhishek Bapat <abhishekbapat@google.com>
Acked-by: Hao Ge <hao.ge@linux.dev>
> ---
> include/uapi/linux/alloc_tag.h | 4 ++++
> lib/alloc_tag.c | 8 ++++++++
> 2 files changed, 12 insertions(+)
>
> diff --git a/include/uapi/linux/alloc_tag.h b/include/uapi/linux/alloc_tag.h
> index 7f5acbb44c14..6ea39c4869fe 100644
> --- a/include/uapi/linux/alloc_tag.h
> +++ b/include/uapi/linux/alloc_tag.h
> @@ -26,6 +26,8 @@ struct allocinfo_tag {
> char function[ALLOCINFO_STR_SIZE];
> char filename[ALLOCINFO_STR_SIZE];
> __u64 lineno;
> + /* filter criteria only; see allocinfo_counter.accurate for actual accuracy */
> + __u64 inaccurate;
> };
>
> /* The alignment ensures 32-bit compatible interfaces are not broken */
> @@ -45,6 +47,7 @@ enum {
> ALLOCINFO_FILTER_FUNCTION,
> ALLOCINFO_FILTER_FILENAME,
> ALLOCINFO_FILTER_LINENO,
> + ALLOCINFO_FILTER_INACCURATE,
> ALLOCINFO_FILTER_MIN_SIZE,
> ALLOCINFO_FILTER_MAX_SIZE,
> __ALLOCINFO_FILTER_LAST = ALLOCINFO_FILTER_MAX_SIZE
> @@ -54,6 +57,7 @@ enum {
> #define ALLOCINFO_FILTER_MASK_FUNCTION (1 << ALLOCINFO_FILTER_FUNCTION)
> #define ALLOCINFO_FILTER_MASK_FILENAME (1 << ALLOCINFO_FILTER_FILENAME)
> #define ALLOCINFO_FILTER_MASK_LINENO (1 << ALLOCINFO_FILTER_LINENO)
> +#define ALLOCINFO_FILTER_MASK_INACCURATE (1 << ALLOCINFO_FILTER_INACCURATE)
> #define ALLOCINFO_FILTER_MASK_MIN_SIZE (1 << ALLOCINFO_FILTER_MIN_SIZE)
> #define ALLOCINFO_FILTER_MASK_MAX_SIZE (1 << ALLOCINFO_FILTER_MAX_SIZE)
>
> diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
> index a936cf18611a..73fb3d0ab821 100644
> --- a/lib/alloc_tag.c
> +++ b/lib/alloc_tag.c
> @@ -249,6 +249,8 @@ static bool matches_filter(struct codetag *ct, struct allocinfo_filter *filter,
> struct alloc_tag_counters *counters,
> bool *fetched_counters)
> {
> + bool inaccurate;
> +
> if (!filter || !filter->mask)
> return true;
>
> @@ -274,6 +276,12 @@ static bool matches_filter(struct codetag *ct, struct allocinfo_filter *filter,
> ct->lineno != filter->fields.lineno)
> return false;
>
> + if (filter->mask & ALLOCINFO_FILTER_MASK_INACCURATE) {
> + inaccurate = !!(ct->flags & CODETAG_FLAG_INACCURATE);
> + if (inaccurate != !!(filter->fields.inaccurate))
> + return false;
> + }
> +
> if (filter->mask & (ALLOCINFO_FILTER_MASK_MIN_SIZE | ALLOCINFO_FILTER_MASK_MAX_SIZE)) {
> if (!*fetched_counters) {
> *counters = allocinfo_prefetch_counters(ct);
^ permalink raw reply
* Re: [PATCH v4 3/6] alloc_tag: add size-based filtering to ioctl
From: Hao Ge @ 2026-06-10 1:49 UTC (permalink / raw)
To: Abhishek Bapat, Suren Baghdasaryan, Andrew Morton,
Kent Overstreet
Cc: Shuah Khan, Jonathan Corbet, linux-doc, linux-kernel, linux-mm,
Sourav Panda
In-Reply-To: <4e2a75c69fe350358e1fef3e4e25435f6a3d4e77.1781042698.git.abhishekbapat@google.com>
On 2026/6/10 08:12, Abhishek Bapat wrote:
> Extend the allocinfo filtering mechanism to allow users to filter tags
> based on the total number of bytes allocated [min_size, max_size]. The
> size range is inclusive.
>
> Filtering by size involves retrieving allocinfo per-CPU counters, which
> is an expensive operation. Hence, the performance of size-based
> filtering will be worse than other filters.
>
> Signed-off-by: Abhishek Bapat <abhishekbapat@google.com>
Acked-by: Hao Ge <hao.ge@linux.dev>
> ---
> include/uapi/linux/alloc_tag.h | 8 ++++-
> lib/alloc_tag.c | 63 ++++++++++++++++++++++++++++------
> 2 files changed, 59 insertions(+), 12 deletions(-)
>
> diff --git a/include/uapi/linux/alloc_tag.h b/include/uapi/linux/alloc_tag.h
> index 3b11877955b9..7f5acbb44c14 100644
> --- a/include/uapi/linux/alloc_tag.h
> +++ b/include/uapi/linux/alloc_tag.h
> @@ -45,13 +45,17 @@ enum {
> ALLOCINFO_FILTER_FUNCTION,
> ALLOCINFO_FILTER_FILENAME,
> ALLOCINFO_FILTER_LINENO,
> - __ALLOCINFO_FILTER_LAST = ALLOCINFO_FILTER_LINENO
> + ALLOCINFO_FILTER_MIN_SIZE,
> + ALLOCINFO_FILTER_MAX_SIZE,
> + __ALLOCINFO_FILTER_LAST = ALLOCINFO_FILTER_MAX_SIZE
> };
>
> #define ALLOCINFO_FILTER_MASK_MODNAME (1 << ALLOCINFO_FILTER_MODNAME)
> #define ALLOCINFO_FILTER_MASK_FUNCTION (1 << ALLOCINFO_FILTER_FUNCTION)
> #define ALLOCINFO_FILTER_MASK_FILENAME (1 << ALLOCINFO_FILTER_FILENAME)
> #define ALLOCINFO_FILTER_MASK_LINENO (1 << ALLOCINFO_FILTER_LINENO)
> +#define ALLOCINFO_FILTER_MASK_MIN_SIZE (1 << ALLOCINFO_FILTER_MIN_SIZE)
> +#define ALLOCINFO_FILTER_MASK_MAX_SIZE (1 << ALLOCINFO_FILTER_MAX_SIZE)
>
> #define ALLOCINFO_FILTER_MASKS \
> ((1 << (__ALLOCINFO_FILTER_LAST + 1)) - 1)
> @@ -59,6 +63,8 @@ enum {
> struct allocinfo_filter {
> __u64 mask; /* bitmask of the filter fields used */
> struct allocinfo_tag fields;
> + __u64 min_size;
> + __u64 max_size;
> };
>
> struct allocinfo_get_at {
> diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
> index 378fcd63b6c9..a936cf18611a 100644
> --- a/lib/alloc_tag.c
> +++ b/lib/alloc_tag.c
> @@ -191,15 +191,26 @@ static int allocinfo_cmp_str(const char *str, const char *template)
> return strncmp(allocinfo_str(str), template, ALLOCINFO_STR_SIZE);
> }
>
> +/* Fetch the per-CPU counters */
> +static inline struct alloc_tag_counters allocinfo_prefetch_counters(struct codetag *ct)
> +{
> + return alloc_tag_read(ct_to_alloc_tag(ct));
> +}
> +
> /*
> * Populates the UAPI allocinfo_tag_data structure with active runtime
> * profiling counters extracted from the given kernel codetag.
> */
> static void allocinfo_to_params(struct codetag *ct,
> - struct allocinfo_tag_data *data)
> + struct allocinfo_tag_data *data,
> + struct alloc_tag_counters *counters)
> {
> - struct alloc_tag *tag = ct_to_alloc_tag(ct);
> - struct alloc_tag_counters counter = alloc_tag_read(tag);
> + struct alloc_tag_counters local_counters;
> +
> + if (!counters) {
> + local_counters = allocinfo_prefetch_counters(ct);
> + counters = &local_counters;
> + }
>
> if (ct->modname)
> allocinfo_copy_str(data->tag.modname, ct->modname);
> @@ -208,9 +219,9 @@ static void allocinfo_to_params(struct codetag *ct,
> allocinfo_copy_str(data->tag.function, ct->function);
> allocinfo_copy_str(data->tag.filename, ct->filename);
> data->tag.lineno = ct->lineno;
> - data->counter.bytes = counter.bytes;
> - data->counter.calls = counter.calls;
> - data->counter.accurate = !alloc_tag_is_inaccurate(tag);
> + data->counter.bytes = counters->bytes;
> + data->counter.calls = counters->calls;
> + data->counter.accurate = !alloc_tag_is_inaccurate(ct_to_alloc_tag(ct));
> }
>
> /*
> @@ -234,7 +245,9 @@ static int allocinfo_ioctl_get_content_id(struct seq_file *m, void __user *arg)
> * Verifies whether a given codetag satisfies the active filtering criteria by
> * matching its characteristics against the specified filter.
> */
> -static bool matches_filter(struct codetag *ct, struct allocinfo_filter *filter)
> +static bool matches_filter(struct codetag *ct, struct allocinfo_filter *filter,
> + struct alloc_tag_counters *counters,
> + bool *fetched_counters)
> {
> if (!filter || !filter->mask)
> return true;
> @@ -261,6 +274,19 @@ static bool matches_filter(struct codetag *ct, struct allocinfo_filter *filter)
> ct->lineno != filter->fields.lineno)
> return false;
>
> + if (filter->mask & (ALLOCINFO_FILTER_MASK_MIN_SIZE | ALLOCINFO_FILTER_MASK_MAX_SIZE)) {
> + if (!*fetched_counters) {
> + *counters = allocinfo_prefetch_counters(ct);
> + *fetched_counters = true;
> + }
> + if ((filter->mask & ALLOCINFO_FILTER_MASK_MIN_SIZE) &&
> + counters->bytes < filter->min_size)
> + return false;
> + if ((filter->mask & ALLOCINFO_FILTER_MASK_MAX_SIZE) &&
> + counters->bytes > filter->max_size)
> + return false;
> + }
> +
> return true;
> }
>
> @@ -274,6 +300,8 @@ static int allocinfo_ioctl_get_at(struct seq_file *m, void __user *arg)
> struct codetag *ct;
> struct allocinfo_get_at params = {0};
> __u64 skip_count;
> + struct alloc_tag_counters counters;
> + bool fetched_counters;
>
> if (copy_from_user(¶ms, arg, sizeof(params)))
> return -EFAULT;
> @@ -281,6 +309,11 @@ static int allocinfo_ioctl_get_at(struct seq_file *m, void __user *arg)
> if (params.filter.mask & ~ALLOCINFO_FILTER_MASKS)
> return -EINVAL;
>
> + if ((params.filter.mask & ALLOCINFO_FILTER_MASK_MIN_SIZE) &&
> + (params.filter.mask & ALLOCINFO_FILTER_MASK_MAX_SIZE) &&
> + params.filter.min_size > params.filter.max_size)
> + return -EINVAL;
> +
> priv = m->private;
>
> mutex_lock(&priv->ioctl_lock);
> @@ -304,7 +337,8 @@ static int allocinfo_ioctl_get_at(struct seq_file *m, void __user *arg)
> ct = codetag_next_ct(&priv->ioctl_iter);
>
> while (ct) {
> - if (matches_filter(ct, &priv->filter)) {
> + fetched_counters = false;
> + if (matches_filter(ct, &priv->filter, &counters, &fetched_counters)) {
> if (skip_count == 0)
> break;
> skip_count--;
> @@ -313,7 +347,7 @@ static int allocinfo_ioctl_get_at(struct seq_file *m, void __user *arg)
> }
>
> if (ct) {
> - allocinfo_to_params(ct, ¶ms.data);
> + allocinfo_to_params(ct, ¶ms.data, fetched_counters ? &counters : NULL);
> priv->positioned = true;
> }
>
> @@ -339,6 +373,8 @@ static int allocinfo_ioctl_get_next(struct seq_file *m, void __user *arg)
> struct codetag *ct;
> struct allocinfo_tag_data params;
> int ret = 0;
> + struct alloc_tag_counters counters;
> + bool fetched_counters;
>
> memset(¶ms, 0, sizeof(params));
> priv = m->private;
> @@ -352,10 +388,15 @@ static int allocinfo_ioctl_get_next(struct seq_file *m, void __user *arg)
> }
>
> ct = codetag_next_ct(&priv->ioctl_iter);
> - while (ct && !matches_filter(ct, &priv->filter))
> + while (ct) {
> + fetched_counters = false;
> + if (matches_filter(ct, &priv->filter, &counters, &fetched_counters))
> + break;
> ct = codetag_next_ct(&priv->ioctl_iter);
> + }
> +
> if (ct)
> - allocinfo_to_params(ct, ¶ms);
> + allocinfo_to_params(ct, ¶ms, fetched_counters ? &counters : NULL);
>
> if (!ct) {
> priv->positioned = false;
^ permalink raw reply
* Re: [PATCH v4 2/6] alloc_tag: add ioctl filters to /proc/allocinfo
From: Hao Ge @ 2026-06-10 1:45 UTC (permalink / raw)
To: Abhishek Bapat, Suren Baghdasaryan, Andrew Morton,
Kent Overstreet
Cc: Shuah Khan, Jonathan Corbet, linux-doc, linux-kernel, linux-mm,
Sourav Panda
In-Reply-To: <8cd864b3bdbf89973e9a1fbd6e8ed1e9c08989b9.1781042698.git.abhishekbapat@google.com>
On 2026/6/10 08:12, Abhishek Bapat wrote:
> Extend the capability of the IOCTL mechanism to filter allocations based
> on tag's module name, function name, file name and line number.
>
> Signed-off-by: Abhishek Bapat <abhishekbapat@google.com>
Acked-by: Hao Ge <hao.ge@linux.dev>
> ---
> include/uapi/linux/alloc_tag.h | 26 ++++++++++++-
> lib/alloc_tag.c | 68 ++++++++++++++++++++++++++++++++--
> 2 files changed, 89 insertions(+), 5 deletions(-)
>
> diff --git a/include/uapi/linux/alloc_tag.h b/include/uapi/linux/alloc_tag.h
> index 0928e1a48d49..3b11877955b9 100644
> --- a/include/uapi/linux/alloc_tag.h
> +++ b/include/uapi/linux/alloc_tag.h
> @@ -40,8 +40,32 @@ struct allocinfo_tag_data {
> struct allocinfo_counter counter;
> };
>
> +enum {
> + ALLOCINFO_FILTER_MODNAME,
> + ALLOCINFO_FILTER_FUNCTION,
> + ALLOCINFO_FILTER_FILENAME,
> + ALLOCINFO_FILTER_LINENO,
> + __ALLOCINFO_FILTER_LAST = ALLOCINFO_FILTER_LINENO
> +};
> +
> +#define ALLOCINFO_FILTER_MASK_MODNAME (1 << ALLOCINFO_FILTER_MODNAME)
> +#define ALLOCINFO_FILTER_MASK_FUNCTION (1 << ALLOCINFO_FILTER_FUNCTION)
> +#define ALLOCINFO_FILTER_MASK_FILENAME (1 << ALLOCINFO_FILTER_FILENAME)
> +#define ALLOCINFO_FILTER_MASK_LINENO (1 << ALLOCINFO_FILTER_LINENO)
> +
> +#define ALLOCINFO_FILTER_MASKS \
> + ((1 << (__ALLOCINFO_FILTER_LAST + 1)) - 1)
> +
> +struct allocinfo_filter {
> + __u64 mask; /* bitmask of the filter fields used */
> + struct allocinfo_tag fields;
> +};
> +
> struct allocinfo_get_at {
> - __u64 pos; /* input */
> + /* inputs */
> + __u64 pos;
> + struct allocinfo_filter filter;
> + /* output */
> struct allocinfo_tag_data data;
> };
>
> diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
> index a0577215eb3d..378fcd63b6c9 100644
> --- a/lib/alloc_tag.c
> +++ b/lib/alloc_tag.c
> @@ -49,6 +49,7 @@ struct allocinfo_private {
> struct codetag_iterator iter;
> struct codetag_iterator reported_iter;
> bool print_header;
> + struct allocinfo_filter filter;
> /* ioctl uses a separate iterator not to interfere with reads */
> struct codetag_iterator ioctl_iter;
> bool positioned; /* seq_open_private() sets to 0 */
> @@ -184,6 +185,12 @@ static void allocinfo_copy_str(char *dest, const char *src)
> strscpy_pad(dest, allocinfo_str(src), ALLOCINFO_STR_SIZE);
> }
>
> +/* Compare two strings and only consider the trimmed suffix if s1 is too long */
> +static int allocinfo_cmp_str(const char *str, const char *template)
> +{
> + return strncmp(allocinfo_str(str), template, ALLOCINFO_STR_SIZE);
> +}
> +
> /*
> * Populates the UAPI allocinfo_tag_data structure with active runtime
> * profiling counters extracted from the given kernel codetag.
> @@ -223,6 +230,40 @@ static int allocinfo_ioctl_get_content_id(struct seq_file *m, void __user *arg)
> return 0;
> }
>
> +/*
> + * Verifies whether a given codetag satisfies the active filtering criteria by
> + * matching its characteristics against the specified filter.
> + */
> +static bool matches_filter(struct codetag *ct, struct allocinfo_filter *filter)
> +{
> + if (!filter || !filter->mask)
> + return true;
> +
> + if (filter->mask & ALLOCINFO_FILTER_MASK_MODNAME) {
> + /* user wants to filter by modname but ct->modname is NULL */
> + if (!ct->modname) {
> + /* validate if user was attempting to filter for built-in allocations */
> + if (filter->fields.modname[0] != '\0')
> + return false;
> + } else if (allocinfo_cmp_str(ct->modname, filter->fields.modname))
> + return false;
> + }
> +
> + if ((filter->mask & ALLOCINFO_FILTER_MASK_FUNCTION) &&
> + ct->function && allocinfo_cmp_str(ct->function, filter->fields.function))
> + return false;
> +
> + if ((filter->mask & ALLOCINFO_FILTER_MASK_FILENAME) &&
> + ct->filename && allocinfo_cmp_str(ct->filename, filter->fields.filename))
> + return false;
> +
> + if ((filter->mask & ALLOCINFO_FILTER_MASK_LINENO) &&
> + ct->lineno != filter->fields.lineno)
> + return false;
> +
> + return true;
> +}
> +
> /*
> * Seeks the ioctl iterator to the specified 0-indexed tag position, reads its
> * profiling data and returns it to userspace.
> @@ -231,29 +272,46 @@ static int allocinfo_ioctl_get_at(struct seq_file *m, void __user *arg)
> {
> struct allocinfo_private *priv;
> struct codetag *ct;
> - __u64 pos;
> struct allocinfo_get_at params = {0};
> + __u64 skip_count;
>
> if (copy_from_user(¶ms, arg, sizeof(params)))
> return -EFAULT;
>
> + if (params.filter.mask & ~ALLOCINFO_FILTER_MASKS)
> + return -EINVAL;
> +
> priv = m->private;
> - pos = params.pos;
>
> mutex_lock(&priv->ioctl_lock);
> codetag_lock_module_list(alloc_tag_cttype);
>
> - if (pos >= codetag_get_count(alloc_tag_cttype)) {
> + if (params.pos >= codetag_get_count(alloc_tag_cttype)) {
> codetag_unlock_module_list(alloc_tag_cttype);
> mutex_unlock(&priv->ioctl_lock);
> return -ENOENT;
> }
>
> + skip_count = params.pos;
> +
> + if (params.filter.mask)
> + priv->filter = params.filter;
> + else
> + priv->filter.mask = 0;
> +
> /* Find the codetag */
> priv->ioctl_iter = codetag_get_ct_iter(alloc_tag_cttype);
> ct = codetag_next_ct(&priv->ioctl_iter);
> - while (ct && pos--)
> +
> + while (ct) {
> + if (matches_filter(ct, &priv->filter)) {
> + if (skip_count == 0)
> + break;
> + skip_count--;
> + }
> ct = codetag_next_ct(&priv->ioctl_iter);
> + }
> +
> if (ct) {
> allocinfo_to_params(ct, ¶ms.data);
> priv->positioned = true;
> @@ -294,6 +352,8 @@ static int allocinfo_ioctl_get_next(struct seq_file *m, void __user *arg)
> }
>
> ct = codetag_next_ct(&priv->ioctl_iter);
> + while (ct && !matches_filter(ct, &priv->filter))
> + ct = codetag_next_ct(&priv->ioctl_iter);
> if (ct)
> allocinfo_to_params(ct, ¶ms);
>
^ permalink raw reply
* [RFC PATCH 3/3] acpi/numa: add CONFIG_ACPI_NUMA_ADD_CFMWS_NODES
From: Gregory Price @ 2026-06-10 1:45 UTC (permalink / raw)
To: linux-mm
Cc: x86, linux-doc, linux-kernel, linux-acpi, driver-core,
kernel-team, corbet, skhan, dave.hansen, luto, peterz, tglx,
mingo, bp, hpa, rafael, lenb, gregkh, dakr, akpm, rppt, rdunlap,
feng.tang, dapeng1.mi, elver, kuba, ebiggers, lirongqing, paulmck,
gourry, dave.jiang, jic23, xueshuai, kai.huang
In-Reply-To: <20260610014517.253609-1-gourry@gourry.net>
CXL is intended to be a programmable topology, and a single CXL Fixed
Memory Window (CFMWS) may back memory that a driver wants to split across
multiple NUMA nodes for tiering or isolation.
Those nodes must exist at __init time to be usable later.
Add CONFIG_ACPI_NUMA_ADD_CFMWS_NODES, the number of additional standby
NUMA nodes to reserve per CEDT CFMWS entry.
acpi_parse_cfmws() records the per-window count, which is folded into
the standby request on successful acpi_numa_init().
Signed-off-by: Gregory Price <gourry@gourry.net>
---
drivers/acpi/numa/Kconfig | 20 ++++++++++++++++++++
drivers/acpi/numa/srat.c | 13 +++++++++++--
2 files changed, 31 insertions(+), 2 deletions(-)
diff --git a/drivers/acpi/numa/Kconfig b/drivers/acpi/numa/Kconfig
index ecf27bf45e5b..65d7eb9a4022 100644
--- a/drivers/acpi/numa/Kconfig
+++ b/drivers/acpi/numa/Kconfig
@@ -14,6 +14,26 @@ config ACPI_HMAT
performance attributes through the node's sysfs device if
provided.
+config ACPI_NUMA_ADD_CFMWS_NODES
+ int "Additional standby NUMA nodes per CEDT CFMWS entry"
+ depends on ACPI_NUMA
+ range 0 4
+ default 0
+ help
+ Number of additional standby NUMA nodes to reserve per CEDT
+ CXL Fixed Memory Window Structure (CFMWS) entry.
+
+ By default ACPI reserves 1 NUMA node per unique PXM entry in
+ the SRAT, or 1 node for a CFMWS without SRAT mappings.
+
+ Setting this > 0 reserves additional standby nodes per CFMWS
+ that drivers can claim at runtime via
+ numa_request_exclusive_node(). This is useful for CXL drivers
+ that want to place memory on distinct NUMA nodes within the
+ same CXL Fixed Memory Window.
+
+ Set to 0 (default) to disable.
+
config ACPI_NUMA_STANDBY_NODES
int "Additional standby NUMA nodes for runtime claiming"
depends on ACPI_NUMA
diff --git a/drivers/acpi/numa/srat.c b/drivers/acpi/numa/srat.c
index d7b0e4ece610..6c54d5f0cf0a 100644
--- a/drivers/acpi/numa/srat.c
+++ b/drivers/acpi/numa/srat.c
@@ -354,6 +354,7 @@ static int __init acpi_parse_slit(struct acpi_table_header *table)
}
static int parsed_numa_memblks __initdata;
+static int cfmws_standby_count __initdata;
static int __init
acpi_parse_memory_affinity(union acpi_subtable_headers *header,
@@ -454,7 +455,7 @@ static int __init acpi_parse_cfmws(union acpi_subtable_headers *header,
* window.
*/
if (!numa_fill_memblks(start, end))
- return 0;
+ goto standby_nodes;
/* No SRAT description. Create a new node. */
node = acpi_map_pxm_to_node(*fake_pxm);
@@ -473,6 +474,11 @@ static int __init acpi_parse_cfmws(union acpi_subtable_headers *header,
/* Set the next available fake_pxm value */
(*fake_pxm)++;
+
+standby_nodes:
+ /* Request any standby nodes (created after numa_emulation runs) */
+ cfmws_standby_count += CONFIG_ACPI_NUMA_ADD_CFMWS_NODES;
+
return 0;
}
@@ -607,6 +613,8 @@ int __init acpi_numa_init(void)
if (acpi_disabled)
return -EINVAL;
+ cfmws_standby_count = 0;
+
/*
* Should not limit number with cpu num that is from NR_CPUS or nr_cpus=
* SRAT cpu entries could have different order with that in MADT.
@@ -666,7 +674,8 @@ int __init acpi_numa_init(void)
return -ENOENT;
/* Request any standby nodes (created after numa emulation) */
- numa_request_standby_count(CONFIG_ACPI_NUMA_STANDBY_NODES);
+ numa_request_standby_count(CONFIG_ACPI_NUMA_STANDBY_NODES +
+ cfmws_standby_count);
return 0;
}
--
2.54.0
^ permalink raw reply related
* [RFC PATCH 2/3] acpi/numa: add CONFIG_ACPI_NUMA_STANDBY_NODES
From: Gregory Price @ 2026-06-10 1:45 UTC (permalink / raw)
To: linux-mm
Cc: x86, linux-doc, linux-kernel, linux-acpi, driver-core,
kernel-team, corbet, skhan, dave.hansen, luto, peterz, tglx,
mingo, bp, hpa, rafael, lenb, gregkh, dakr, akpm, rppt, rdunlap,
feng.tang, dapeng1.mi, elver, kuba, ebiggers, lirongqing, paulmck,
gourry, dave.jiang, jic23, xueshuai, kai.huang
In-Reply-To: <20260610014517.253609-1-gourry@gourry.net>
Some platforms want to reserve empty NUMA nodes at boot so that drivers
can later place hotplugged memory on distinct nodes for memory tiering
or isolation, without those nodes being described by BIOS.
Add CONFIG_ACPI_NUMA_STANDBY_NODES, a platform-independent count of empty
nodes to reserve.
Deferring standby node creation until after NUMA emulation runs keeps
old numbering behaviors consistent for NUMA emulation users.
Signed-off-by: Gregory Price <gourry@gourry.net>
---
drivers/acpi/numa/Kconfig | 15 +++++++++++++++
drivers/acpi/numa/srat.c | 3 +++
2 files changed, 18 insertions(+)
diff --git a/drivers/acpi/numa/Kconfig b/drivers/acpi/numa/Kconfig
index f33194d1e43f..ecf27bf45e5b 100644
--- a/drivers/acpi/numa/Kconfig
+++ b/drivers/acpi/numa/Kconfig
@@ -13,3 +13,18 @@ config ACPI_HMAT
register memory initiators with their targets, and export
performance attributes through the node's sysfs device if
provided.
+
+config ACPI_NUMA_STANDBY_NODES
+ int "Additional standby NUMA nodes for runtime claiming"
+ depends on ACPI_NUMA
+ range 0 16
+ default 0
+ help
+ Number of additional empty NUMA nodes to reserve at boot for
+ runtime claiming via numa_request_exclusive_node().
+
+ These nodes have no memory and no SRAT PXM association.
+ Drivers can claim them to place hotplugged memory on distinct
+ NUMA nodes for memory tiering or isolation purposes.
+
+ Set to 0 (default) to disable.
diff --git a/drivers/acpi/numa/srat.c b/drivers/acpi/numa/srat.c
index 62d4a8df0b8c..d7b0e4ece610 100644
--- a/drivers/acpi/numa/srat.c
+++ b/drivers/acpi/numa/srat.c
@@ -664,6 +664,9 @@ int __init acpi_numa_init(void)
return cnt;
else if (!parsed_numa_memblks)
return -ENOENT;
+
+ /* Request any standby nodes (created after numa emulation) */
+ numa_request_standby_count(CONFIG_ACPI_NUMA_STANDBY_NODES);
return 0;
}
--
2.54.0
^ permalink raw reply related
* [RFC PATCH 1/3] mm/numa: add exclusive node pool and numa=standby boot parameter
From: Gregory Price @ 2026-06-10 1:45 UTC (permalink / raw)
To: linux-mm
Cc: x86, linux-doc, linux-kernel, linux-acpi, driver-core,
kernel-team, corbet, skhan, dave.hansen, luto, peterz, tglx,
mingo, bp, hpa, rafael, lenb, gregkh, dakr, akpm, rppt, rdunlap,
feng.tang, dapeng1.mi, elver, kuba, ebiggers, lirongqing, paulmck,
gourry, dave.jiang, jic23, xueshuai, kai.huang
In-Reply-To: <20260610014517.253609-1-gourry@gourry.net>
It can be at times preferential to logically split up hotplug memory
capacity into more nodes than are described by BIOS at boot time.
However, if nodes are not described at __init time, they are not
possible to add later on.
Add the core infrastructure for reserving empty "standby" NUMA nodes at
boot that drivers can claim at runtime.
Introduce an exclusive node pool with a runtime claim/release interface:
numa_request_exclusive_node() - claim a node from the pool
numa_release_exclusive_node() - return a node to the pool
This allows drivers to place hotplugged memory on distinct NUMA nodes
without requiring BIOS-assigned proximity domains.
Standby nodes are created after numa_emulation() has finalized the node
numbering. The count comes from the numa=standby=<N> boot parameter
plus any requests by init code via numa_request_standby_count().
Creating them post-emulation avoids perturbing the emulated node
numbering and keeps standby node ids from aliasing emulated nodes.
This also pushes off assigning node numbers until after all PXM mappings
have been created to keep the system view more consistent overall.
numa_init_standby_nodes() rebuilds the NUMA distance table (using the
same method as numa_emulation), this way standby nodes have distance
entries. These entries may be programmed later via numa_set_distance().
As a result, it is also possible for drivers to use these standby nodes
to change memory tier membership and fallback ordering instead of being
tied down to what is described by BIOS / Firmware.
Additional Notes/Concerns:
1) Can we do dynamic addition of nodes?
Not Trivially
Some services utilize num_possible_nodes() as a static value to
calculate the amount of resources to use at runtime (bpf, md/raid5).
Example: futex_init uses num_possible_nodes() as part of its
hashsize calculation during __init.
2) Does this create phys_to_target_node() ambiguity?
No.
Every present user of phys_to_target_node() either uses it during
__init to set up associations, or after __init to associate a static
memory region and a node.
In neither case do these additional nodes create ambiguity.
We do at least add a comment to phys_to_target_node() to note that
this should only be used to determine the affiliation of a memory
region statically configured by BIOS (i.e. not hotplug memory).
Signed-off-by: Gregory Price <gourry@gourry.net>
---
.../admin-guide/kernel-parameters.txt | 8 ++
arch/x86/mm/numa.c | 2 +
drivers/base/arch_numa.c | 2 +
include/linux/numa.h | 14 +++
include/linux/numa_memblks.h | 3 +
mm/numa.c | 90 +++++++++++++
mm/numa_memblks.c | 118 +++++++++++++++++-
7 files changed, 236 insertions(+), 1 deletion(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 23be2f64439c..5410498c97af 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4765,6 +4765,14 @@ Kernel parameters
numa=nohmat [X86] Don't parse the HMAT table for NUMA setup, or
soft-reserved memory partitioning.
+ numa=standby=<N>
+ [KNL, ARM64, RISCV, X86, EARLY]
+ Reserve N additional empty NUMA nodes at boot for
+ runtime claiming via numa_request_exclusive_node().
+ These nodes have no memory or CPU affinity. Drivers
+ can claim them to place hotplugged memory on distinct
+ NUMA nodes for memory tiering or isolation.
+
numa_balancing= [KNL,ARM64,PPC,RISCV,S390,X86] Enable or disable automatic
NUMA balancing.
Allowed values are enable and disable
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 99d0a9332c14..e4798c43276b 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -33,6 +33,8 @@ static __init int numa_setup(char *opt)
numa_off = 1;
if (!strncmp(opt, "fake=", 5))
return numa_emu_cmdline(opt + 5);
+ if (!strncmp(opt, "standby=", 8))
+ return numa_standby_cmdline(opt + 8);
if (!strncmp(opt, "noacpi", 6))
disable_srat();
if (!strncmp(opt, "nohmat", 6))
diff --git a/drivers/base/arch_numa.c b/drivers/base/arch_numa.c
index c99f2ab105e5..8526be1da69a 100644
--- a/drivers/base/arch_numa.c
+++ b/drivers/base/arch_numa.c
@@ -28,6 +28,8 @@ static __init int numa_parse_early_param(char *opt)
numa_off = true;
if (!strncmp(opt, "fake=", 5))
return numa_emu_cmdline(opt + 5);
+ if (!strncmp(opt, "standby=", 8))
+ return numa_standby_cmdline(opt + 8);
return 0;
}
diff --git a/include/linux/numa.h b/include/linux/numa.h
index e6baaf6051bc..4621af407ec6 100644
--- a/include/linux/numa.h
+++ b/include/linux/numa.h
@@ -43,6 +43,11 @@ int phys_to_target_node(u64 start);
int numa_fill_memblks(u64 start, u64 end);
+void __init numa_add_standby_node(int node);
+void __init numa_commit_standby_nodes(void);
+int numa_request_exclusive_node(void);
+void numa_release_exclusive_node(int node);
+
#else /* !CONFIG_NUMA */
static inline int numa_nearest_node(int node, unsigned int state)
{
@@ -64,6 +69,15 @@ static inline int phys_to_target_node(u64 start)
}
static inline void alloc_offline_node_data(int nid) {}
+
+static inline void numa_add_standby_node(int node) { }
+static inline void numa_commit_standby_nodes(void) { }
+static inline int numa_request_exclusive_node(void)
+{
+ return NUMA_NO_NODE;
+}
+static inline void numa_release_exclusive_node(int node) { }
+
#endif
#define numa_map_to_online_node(node) numa_nearest_node(node, N_ONLINE)
diff --git a/include/linux/numa_memblks.h b/include/linux/numa_memblks.h
index 991076cba7c5..7d7b8307e267 100644
--- a/include/linux/numa_memblks.h
+++ b/include/linux/numa_memblks.h
@@ -32,6 +32,9 @@ int __init numa_memblks_init(int (*init_func)(void),
extern int numa_distance_cnt;
+int __init numa_standby_cmdline(char *str);
+void __init numa_request_standby_count(int n);
+
#ifdef CONFIG_NUMA_EMU
extern int emu_nid_to_phys[MAX_NUMNODES];
int numa_emu_cmdline(char *str);
diff --git a/mm/numa.c b/mm/numa.c
index 7d5e06fe5bd4..9806cdf2f998 100644
--- a/mm/numa.c
+++ b/mm/numa.c
@@ -4,6 +4,8 @@
#include <linux/printk.h>
#include <linux/numa.h>
#include <linux/numa_memblks.h>
+#include <linux/spinlock.h>
+#include <linux/export.h>
struct pglist_data *node_data[MAX_NUMNODES];
EXPORT_SYMBOL(node_data);
@@ -59,3 +61,91 @@ int phys_to_target_node(u64 start)
}
EXPORT_SYMBOL_GPL(phys_to_target_node);
#endif
+
+/*
+ * Pool of exclusive NUMA nodes available for runtime claiming.
+ *
+ * Published by numa_commit_standby_nodes() from standby nodes staged
+ * during __init. Protected by exclusive_node_lock at runtime.
+ */
+static nodemask_t exclusive_nodes = NODE_MASK_NONE;
+static DEFINE_SPINLOCK(exclusive_node_lock);
+
+/*
+ * Standby node candidates staged during NUMA init. Committed to the exclusive
+ * pool by numa_commit_standby_nodes() once node_possible_map is finalized.
+ */
+static nodemask_t standby_candidates __initdata;
+
+/**
+ * numa_add_standby_node() - Stage a node as a standby pool candidate
+ * @node: Node ID created as an empty standby node during NUMA init
+ *
+ * Records @node as a candidate for the exclusive pool.
+ * Callers must also add @node to numa_nodes_parsed to mark it possible.
+ */
+void __init numa_add_standby_node(int node)
+{
+ node_set(node, standby_candidates);
+}
+
+/**
+ * numa_commit_standby_nodes() - Publish staged standby nodes to the pool
+ *
+ * Registers the staged candidates that are present in node_possible_map
+ * into the exclusive pool. Restricting to possible nodes keeps the pool a
+ * strict subset of node_possible_map, so a later claim can never return a
+ * node that was dropped (e.g. by a fallback init or NUMA emulation).
+ * Called once node_possible_map is final.
+ */
+void __init numa_commit_standby_nodes(void)
+{
+ nodes_and(exclusive_nodes, standby_candidates, node_possible_map);
+}
+
+/**
+ * numa_request_exclusive_node() - Claim an available exclusive NUMA node
+ *
+ * Exclusive nodes are empty NUMA nodes registered at boot via the standby
+ * node interfaces or standby= boot parameter.
+ *
+ * The caller takes exclusive ownership of the returned node and must
+ * release it with numa_release_exclusive_node() when no longer needed.
+ *
+ * Return: a NUMA node ID on success, %NUMA_NO_NODE if none available.
+ */
+int numa_request_exclusive_node(void)
+{
+ int node;
+
+ spin_lock(&exclusive_node_lock);
+ node = first_node(exclusive_nodes);
+ if (node < MAX_NUMNODES)
+ node_clear(node, exclusive_nodes);
+ else
+ node = NUMA_NO_NODE;
+ spin_unlock(&exclusive_node_lock);
+
+ return node;
+}
+EXPORT_SYMBOL_GPL(numa_request_exclusive_node);
+
+/**
+ * numa_release_exclusive_node() - Release a previously claimed exclusive node
+ * @node: Node ID previously returned by numa_request_exclusive_node()
+ *
+ * Returns the node to the exclusive pool.
+ */
+void numa_release_exclusive_node(int node)
+{
+ if (node == NUMA_NO_NODE)
+ return;
+
+ if (WARN_ON(node >= MAX_NUMNODES))
+ return;
+
+ spin_lock(&exclusive_node_lock);
+ node_set(node, exclusive_nodes);
+ spin_unlock(&exclusive_node_lock);
+}
+EXPORT_SYMBOL_GPL(numa_release_exclusive_node);
diff --git a/mm/numa_memblks.c b/mm/numa_memblks.c
index 3c3c4eac3514..9ba243fd360e 100644
--- a/mm/numa_memblks.c
+++ b/mm/numa_memblks.c
@@ -6,6 +6,7 @@
#include <linux/memblock.h>
#include <linux/numa.h>
#include <linux/numa_memblks.h>
+#include <linux/topology.h>
#include <asm/numa.h>
@@ -442,6 +443,104 @@ static int __init numa_register_meminfo(struct numa_meminfo *mi)
return 0;
}
+static int numa_standby_nodes __initdata;
+static int numa_acpi_standby_nodes __initdata;
+
+int __init numa_standby_cmdline(char *str)
+{
+ int ret = kstrtoint(str, 0, &numa_standby_nodes);
+
+ if (ret || numa_standby_nodes < 0)
+ return -EINVAL;
+ numa_standby_nodes = min(numa_standby_nodes, 16);
+ return 0;
+}
+
+/**
+ * numa_request_standby_count() - Request standby nodes from NUMA init code
+ * @n: number of standby nodes to reserve
+ *
+ * Accumulated during NUMA init and added to the numa=standby=<N> request.
+ * The nodes are created later, once numa_emulation() has finalized the node
+ * numbering. Init code must add the count here instead of adding the nodes.
+ */
+void __init numa_request_standby_count(int n)
+{
+ numa_acpi_standby_nodes += n;
+}
+
+/**
+ * numa_init_standby_nodes() - Create standby nodes and rebuild distance table
+ *
+ * Called after numa_emulation() has finalized the node numbering.
+ * Creates requested empty standby nodes and rebuilds the NUMA distance
+ * table if it needs to grow to cover nodes added after SLIT parsing.
+ */
+static void __init numa_init_standby_nodes(void)
+{
+ int total = numa_standby_nodes + numa_acpi_standby_nodes;
+ nodemask_t available;
+ int i, j, max_node, old_cnt;
+ u8 *saved_dist = NULL;
+ size_t saved_size;
+ int registered = 0;
+
+ /* Create the requested standby nodes in numa_nodes_parsed */
+ if (total) {
+ nodes_complement(available, numa_nodes_parsed);
+ for (i = 0; i < total; i++) {
+ int node = first_node(available);
+
+ if (node >= MAX_NUMNODES)
+ break;
+ node_clear(node, available);
+ node_set(node, numa_nodes_parsed);
+ numa_add_standby_node(node);
+ pr_info("NUMA: standby node %d reserved\n", node);
+ registered++;
+ }
+ }
+ if (registered != total)
+ pr_warn("NUMA: error registering standby nodes\n");
+
+ /*
+ * If nodes were added after the distance table was allocated,
+ * rebuild the table so all nodes have distance entries.
+ * Standby nodes get REMOTE_DISTANCE by default.
+ */
+ old_cnt = numa_distance_cnt;
+ if (!old_cnt)
+ return;
+
+ max_node = 0;
+ for_each_node_mask(i, numa_nodes_parsed)
+ max_node = i;
+
+ if (max_node < old_cnt)
+ return;
+
+ saved_size = old_cnt * old_cnt * sizeof(u8);
+ saved_dist = memblock_alloc(saved_size, PAGE_SIZE);
+ if (!saved_dist) {
+ pr_warn("NUMA: standby nodes will use default distances\n");
+ return;
+ }
+
+ for (i = 0; i < old_cnt; i++)
+ for (j = 0; j < old_cnt; j++)
+ saved_dist[i * old_cnt + j] = node_distance(i, j);
+
+ /* Reset triggers reallocation on next numa_set_distance() */
+ numa_reset_distance();
+
+ /* Restore - first call reallocates sized for new numa_nodes_parsed */
+ for (i = 0; i < old_cnt; i++)
+ for (j = 0; j < old_cnt; j++)
+ numa_set_distance(i, j, saved_dist[i * old_cnt + j]);
+
+ memblock_free(saved_dist, saved_size);
+}
+
int __init numa_memblks_init(int (*init_func)(void),
bool memblock_force_top_down)
{
@@ -451,6 +550,7 @@ int __init numa_memblks_init(int (*init_func)(void),
nodes_clear(numa_nodes_parsed);
nodes_clear(node_possible_map);
nodes_clear(node_online_map);
+ numa_acpi_standby_nodes = 0;
memset(&numa_meminfo, 0, sizeof(numa_meminfo));
WARN_ON(memblock_set_node(0, max_addr, &memblock.memory, NUMA_NO_NODE));
WARN_ON(memblock_set_node(0, max_addr, &memblock.reserved,
@@ -479,8 +579,15 @@ int __init numa_memblks_init(int (*init_func)(void),
return ret;
numa_emulation(&numa_meminfo, numa_distance_cnt);
+ numa_init_standby_nodes();
+
+ ret = numa_register_meminfo(&numa_meminfo);
+ if (ret < 0)
+ return ret;
- return numa_register_meminfo(&numa_meminfo);
+ /* node_possible_map is final; publish standby nodes to the pool. */
+ numa_commit_standby_nodes();
+ return 0;
}
static int __init cmp_memblk(const void *a, const void *b)
@@ -567,6 +674,15 @@ static int meminfo_to_nid(struct numa_meminfo *mi, u64 start)
return NUMA_NO_NODE;
}
+/*
+ * These interfaces should only be used to acquire information about statically
+ * configured memory associations made at __init time.
+ *
+ * This interface should not be used to determine the node a struct page/folio
+ * lives in, as it is possible for memory hotplug to place those pages in
+ * different nodes than reported by this function.
+ */
+
int phys_to_target_node(u64 start)
{
int nid = meminfo_to_nid(&numa_meminfo, start);
--
2.54.0
^ permalink raw reply related
* [RFC PATCH 0/3] mm/numa: reserve standby NUMA nodes for runtime claiming
From: Gregory Price @ 2026-06-10 1:45 UTC (permalink / raw)
To: linux-mm
Cc: x86, linux-doc, linux-kernel, linux-acpi, driver-core,
kernel-team, corbet, skhan, dave.hansen, luto, peterz, tglx,
mingo, bp, hpa, rafael, lenb, gregkh, dakr, akpm, rppt, rdunlap,
feng.tang, dapeng1.mi, elver, kuba, ebiggers, lirongqing, paulmck,
gourry, dave.jiang, jic23, xueshuai, kai.huang
A NUMA node must be "possible" at __init time to be usable later; a node
that is not described at boot cannot be brought online afterwards.
For memory tiering or isolation it is sometimes desirable to spread
hotplug memory (CXL, GPU, virtio-mem, ...) across more nodes than
firmware describes. Additionally, some memory devices may provide
more than a single class of memory and need flexibility to redefine
the effective topology at runtime instead of depending on BIOS.
This series adds a way to reserve empty "standby" NUMA nodes at boot
so drivers can place hotplugged memory on distinct nodes later, at
runtime, without those nodes being described by BIOS.
Using the feature
=================
A standby node is an empty, offline-but-possible NUMA node: at boot it
has no memory and no CPUs. A driver claims one at runtime, brings
memory online on it, and releases it when done.
This series adds 3 ways to reserve standby nodes.
- numa=standby=N
Boot parameter. Reserve N extra empty nodes. Platform
independent; works with or without ACPI.
- CONFIG_ACPI_NUMA_STANDBY_NODES=N
Reserve N extra empty nodes on ACPI systems (honoured only when
firmware produces a usable NUMA configuration).
- CONFIG_ACPI_NUMA_ADD_CFMWS_NODES=K
Reserve K extra empty nodes per CXL Fixed Memory Window (CEDT
CFMWS), for CXL topologies that want several nodes behind one
window.
All three default to off (0 / unset).
Reserved nodes show up in /sys/devices/system/node/possible but not
.../online until a driver claims one and onlines memory on it.
Testing
=======
Built and booted under QEMU (virtme-ng) across a matrix of boot
parameters and topologies:
- Each reservation source, individually and combined: reserved nodes
appear as possible-but-offline with no memory, claim/release
round-trips correctly, and node distances are sane.
The CFMWS path was exercised with an emulated CXL Type-3 device
presenting a CEDT/CFMWS.
- Fallback: when ACPI NUMA init does not produce a usable config,
no standby nodes are reserved.
- NUMA emulation (numa=fake): renumbers the node space.
Standby nodes are created only after the (possibly emulated)
topology is final, so their ids can never alias emulated nodes.
numa=fake boots cleanly with the feature enabled and behaves
identically to a baseline kernel without this series.
Tested with CONFIG_NUMA_EMU both enabled and disabled, and with
and without numa=fake on the command line.
- Default-off builds behave identically to a baseline kernel.
Gregory Price (3):
mm/numa: add exclusive node pool and numa=standby boot parameter
acpi/numa: add CONFIG_ACPI_NUMA_STANDBY_NODES
acpi/numa: add CONFIG_ACPI_NUMA_ADD_CFMWS_NODES
.../admin-guide/kernel-parameters.txt | 8 ++
arch/x86/mm/numa.c | 2 +
drivers/acpi/numa/Kconfig | 35 ++++++
drivers/acpi/numa/srat.c | 14 ++-
drivers/base/arch_numa.c | 2 +
include/linux/numa.h | 14 +++
include/linux/numa_memblks.h | 3 +
mm/numa.c | 90 +++++++++++++
mm/numa_memblks.c | 118 +++++++++++++++++-
9 files changed, 284 insertions(+), 2 deletions(-)
--
2.54.0
^ permalink raw reply
* Re: [PATCH v4 1/6] alloc_tag: add ioctl to /proc/allocinfo
From: Hao Ge @ 2026-06-10 1:28 UTC (permalink / raw)
To: Abhishek Bapat, Suren Baghdasaryan, Andrew Morton,
Kent Overstreet
Cc: Shuah Khan, Jonathan Corbet, linux-doc, linux-kernel, linux-mm,
Sourav Panda
In-Reply-To: <b58a0d01c8bb6d7c1d2350599c1b0170be161489.1781042698.git.abhishekbapat@google.com>
On 2026/6/10 08:12, Abhishek Bapat wrote:
> From: Suren Baghdasaryan <surenb@google.com>
>
> Add the following ioctl commands for /proc/allocinfo file:
>
> ALLOCINFO_IOC_CONTENT_ID - gets content identifier which can be used
> to check whether the file content has changed specifically due to module
> load/unload. Every time a module is loaded / unloaded, the returned
> value will be different. By comparing the identifier value at the
> beginning and at the end of the content retrieval operation, users can
> validate retrieved information for consistency.
>
> ALLOCINFO_IOC_GET_AT - gets the record at the specified position. This
> is the position of a record in /proc/allocinfo.
>
> ALLOCINFO_IOC_GET_NEXT - gets the record next to the last retrieved
> one. If no records were previously retrieved, returns the first
> record.
>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> Signed-off-by: Abhishek Bapat <abhishekbapat@google.com>
Acked-by: Hao Ge <hao.ge@linux.dev>
> ---
> Documentation/mm/allocation-profiling.rst | 5 +
> .../userspace-api/ioctl/ioctl-number.rst | 2 +
> MAINTAINERS | 1 +
> include/linux/codetag.h | 2 +
> include/uapi/linux/alloc_tag.h | 60 +++++
> lib/alloc_tag.c | 232 +++++++++++++++++-
> lib/codetag.c | 18 ++
> 7 files changed, 318 insertions(+), 2 deletions(-)
> create mode 100644 include/uapi/linux/alloc_tag.h
>
> diff --git a/Documentation/mm/allocation-profiling.rst b/Documentation/mm/allocation-profiling.rst
> index 5389d241176a..c3a28467955f 100644
> --- a/Documentation/mm/allocation-profiling.rst
> +++ b/Documentation/mm/allocation-profiling.rst
> @@ -46,6 +46,11 @@ sysctl:
> Runtime info:
> /proc/allocinfo
>
> + Profiling data can be retrieved either by reading `/proc/allocinfo` directly as
> + text or programmatically via `ioctl()` calls defined in `<uapi/linux/alloc_tag.h>`.
> + The ioctl interface supports structured binary data extraction as well as filtering
> + by module name, function, file, line number, accuracy, or allocation size limits.
> +
> Example output::
>
> root@moria-kvm:~# sort -g /proc/allocinfo|tail|numfmt --to=iec
> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
> index 331223761fff..84f6808a8578 100644
> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
> @@ -349,6 +349,8 @@ Code Seq# Include File Comments
> <mailto:luzmaximilian@gmail.com>
> 0xA5 20-2F linux/surface_aggregator/dtx.h Microsoft Surface DTX driver
> <mailto:luzmaximilian@gmail.com>
> +0xA6 00-0F uapi/linux/alloc_tag.h Memory allocation profiling
> + <mailto:surenb@google.com>
> 0xAA 00-3F linux/uapi/linux/userfaultfd.h
> 0xAB 00-1F linux/nbd.h
> 0xAC 00-1F linux/raw.h
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 65bd4328fe05..019cc4c285a3 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -16713,6 +16713,7 @@ S: Maintained
> F: Documentation/mm/allocation-profiling.rst
> F: include/linux/alloc_tag.h
> F: include/linux/pgalloc_tag.h
> +F: include/uapi/linux/alloc_tag.h
> F: lib/alloc_tag.c
>
> MEMORY CONTROLLER DRIVERS
> diff --git a/include/linux/codetag.h b/include/linux/codetag.h
> index ddae7484ca45..a25a085c2df1 100644
> --- a/include/linux/codetag.h
> +++ b/include/linux/codetag.h
> @@ -77,6 +77,8 @@ struct codetag_iterator {
> void codetag_lock_module_list(struct codetag_type *cttype);
> bool codetag_trylock_module_list(struct codetag_type *cttype);
> void codetag_unlock_module_list(struct codetag_type *cttype);
> +unsigned long codetag_get_content_id(struct codetag_type *cttype);
> +unsigned int codetag_get_count(struct codetag_type *cttype);
> struct codetag_iterator codetag_get_ct_iter(struct codetag_type *cttype);
> struct codetag *codetag_next_ct(struct codetag_iterator *iter);
>
> diff --git a/include/uapi/linux/alloc_tag.h b/include/uapi/linux/alloc_tag.h
> new file mode 100644
> index 000000000000..0928e1a48d49
> --- /dev/null
> +++ b/include/uapi/linux/alloc_tag.h
> @@ -0,0 +1,60 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +/*
> + * alloc_tag IOCTL API definition
> + *
> + * Copyright (C) 2026 Google, LLC. All rights reserved.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#ifndef _UAPI_ALLOC_TAG_H
> +#define _UAPI_ALLOC_TAG_H
> +
> +#include <linux/types.h>
> +
> +#define ALLOCINFO_STR_SIZE 64
> +
> +struct allocinfo_content_id {
> + __u64 id;
> +};
> +
> +struct allocinfo_tag {
> + /* Longer names are trimmed */
> + char modname[ALLOCINFO_STR_SIZE];
> + char function[ALLOCINFO_STR_SIZE];
> + char filename[ALLOCINFO_STR_SIZE];
> + __u64 lineno;
> +};
> +
> +/* The alignment ensures 32-bit compatible interfaces are not broken */
> +struct allocinfo_counter {
> + __u64 bytes;
> + __u64 calls;
> + __u8 accurate;
> +} __attribute__((aligned(8)));
> +
> +struct allocinfo_tag_data {
> + struct allocinfo_tag tag;
> + struct allocinfo_counter counter;
> +};
> +
> +struct allocinfo_get_at {
> + __u64 pos; /* input */
> + struct allocinfo_tag_data data;
> +};
> +
> +#define _ALLOCINFO_IOC_CONTENT_ID 0
> +#define _ALLOCINFO_IOC_GET_AT 1
> +#define _ALLOCINFO_IOC_GET_NEXT 2
> +
> +#define ALLOCINFO_IOC_BASE 0xA6
> +#define ALLOCINFO_IOC_CONTENT_ID _IOR(ALLOCINFO_IOC_BASE, _ALLOCINFO_IOC_CONTENT_ID, \
> + struct allocinfo_content_id)
> +#define ALLOCINFO_IOC_GET_AT _IOWR(ALLOCINFO_IOC_BASE, _ALLOCINFO_IOC_GET_AT, \
> + struct allocinfo_get_at)
> +#define ALLOCINFO_IOC_GET_NEXT _IOR(ALLOCINFO_IOC_BASE, _ALLOCINFO_IOC_GET_NEXT, \
> + struct allocinfo_tag_data)
> +
> +#endif /* _UAPI_ALLOC_TAG_H */
> diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
> index d9be1cf5187d..a0577215eb3d 100644
> --- a/lib/alloc_tag.c
> +++ b/lib/alloc_tag.c
> @@ -5,6 +5,7 @@
> #include <linux/gfp.h>
> #include <linux/kallsyms.h>
> #include <linux/module.h>
> +#include <linux/mutex.h>
> #include <linux/page_ext.h>
> #include <linux/pgalloc_tag.h>
> #include <linux/proc_fs.h>
> @@ -14,6 +15,7 @@
> #include <linux/string_choices.h>
> #include <linux/vmalloc.h>
> #include <linux/kmemleak.h>
> +#include <uapi/linux/alloc_tag.h>
>
> #define ALLOCINFO_FILE_NAME "allocinfo"
> #define MODULE_ALLOC_TAG_VMAP_SIZE (100000UL * sizeof(struct alloc_tag))
> @@ -47,6 +49,10 @@ struct allocinfo_private {
> struct codetag_iterator iter;
> struct codetag_iterator reported_iter;
> bool print_header;
> + /* ioctl uses a separate iterator not to interfere with reads */
> + struct codetag_iterator ioctl_iter;
> + bool positioned; /* seq_open_private() sets to 0 */
> + struct mutex ioctl_lock;
> };
>
> static void *allocinfo_start(struct seq_file *m, loff_t *pos)
> @@ -130,6 +136,229 @@ static const struct seq_operations allocinfo_seq_op = {
> .show = allocinfo_show,
> };
>
> +/*
> + * Initializes seq_file operations and allocates private state when opening
> + * the /proc/allocinfo procfs entry.
> + */
> +static int allocinfo_open(struct inode *inode, struct file *file)
> +{
> + int ret;
> +
> + ret = seq_open_private(file, &allocinfo_seq_op,
> + sizeof(struct allocinfo_private));
> + if (!ret) {
> + struct seq_file *m = file->private_data;
> + struct allocinfo_private *priv = m->private;
> +
> + mutex_init(&priv->ioctl_lock);
> + }
> + return ret;
> +}
> +
> +/*
> + * Cleans up the seq_file state and frees up the private state allocated in
> + * allocinfo_open() when closing the /proc/allocinfo file descriptor.
> + */
> +static int allocinfo_release(struct inode *inode, struct file *file)
> +{
> + return seq_release_private(inode, file);
> +}
> +
> +/*
> + * Returns a pointer to the suffix of a string so that its length fits within
> + * ALLOCINFO_STR_SIZE, preserving the trailing characters.
> + */
> +static const char *allocinfo_str(const char *str)
> +{
> + size_t len = strlen(str);
> +
> + /* Keep an extra space for the trailing NULL. */
> + if (len >= ALLOCINFO_STR_SIZE)
> + str += (len - ALLOCINFO_STR_SIZE) + 1;
> + return str;
> +}
> +
> +/* Copy a string and trim from the beginning if it's too long */
> +static void allocinfo_copy_str(char *dest, const char *src)
> +{
> + strscpy_pad(dest, allocinfo_str(src), ALLOCINFO_STR_SIZE);
> +}
> +
> +/*
> + * Populates the UAPI allocinfo_tag_data structure with active runtime
> + * profiling counters extracted from the given kernel codetag.
> + */
> +static void allocinfo_to_params(struct codetag *ct,
> + struct allocinfo_tag_data *data)
> +{
> + struct alloc_tag *tag = ct_to_alloc_tag(ct);
> + struct alloc_tag_counters counter = alloc_tag_read(tag);
> +
> + if (ct->modname)
> + allocinfo_copy_str(data->tag.modname, ct->modname);
> + else
> + data->tag.modname[0] = '\0';
> + allocinfo_copy_str(data->tag.function, ct->function);
> + allocinfo_copy_str(data->tag.filename, ct->filename);
> + data->tag.lineno = ct->lineno;
> + data->counter.bytes = counter.bytes;
> + data->counter.calls = counter.calls;
> + data->counter.accurate = !alloc_tag_is_inaccurate(tag);
> +}
> +
> +/*
> + * Retrieves the unique content ID representing the current allocation tag module
> + * layout, allowing userspace to detect if modules were loaded / unloaded.
> + */
> +static int allocinfo_ioctl_get_content_id(struct seq_file *m, void __user *arg)
> +{
> + struct allocinfo_content_id params;
> +
> + codetag_lock_module_list(alloc_tag_cttype);
> + params.id = codetag_get_content_id(alloc_tag_cttype);
> + codetag_unlock_module_list(alloc_tag_cttype);
> + if (copy_to_user(arg, ¶ms, sizeof(params)))
> + return -EFAULT;
> +
> + return 0;
> +}
> +
> +/*
> + * Seeks the ioctl iterator to the specified 0-indexed tag position, reads its
> + * profiling data and returns it to userspace.
> + */
> +static int allocinfo_ioctl_get_at(struct seq_file *m, void __user *arg)
> +{
> + struct allocinfo_private *priv;
> + struct codetag *ct;
> + __u64 pos;
> + struct allocinfo_get_at params = {0};
> +
> + if (copy_from_user(¶ms, arg, sizeof(params)))
> + return -EFAULT;
> +
> + priv = m->private;
> + pos = params.pos;
> +
> + mutex_lock(&priv->ioctl_lock);
> + codetag_lock_module_list(alloc_tag_cttype);
> +
> + if (pos >= codetag_get_count(alloc_tag_cttype)) {
> + codetag_unlock_module_list(alloc_tag_cttype);
> + mutex_unlock(&priv->ioctl_lock);
> + return -ENOENT;
> + }
> +
> + /* Find the codetag */
> + priv->ioctl_iter = codetag_get_ct_iter(alloc_tag_cttype);
> + ct = codetag_next_ct(&priv->ioctl_iter);
> + while (ct && pos--)
> + ct = codetag_next_ct(&priv->ioctl_iter);
> + if (ct) {
> + allocinfo_to_params(ct, ¶ms.data);
> + priv->positioned = true;
> + }
> +
> + codetag_unlock_module_list(alloc_tag_cttype);
> + mutex_unlock(&priv->ioctl_lock);
> +
> + if (!ct)
> + return -ENOENT;
> +
> + if (copy_to_user(arg, ¶ms, sizeof(params)))
> + return -EFAULT;
> +
> + return 0;
> +}
> +
> +/*
> + * Advances the ioctl iterator to the next allocation tag in the sequence and
> + * returns its profiling data to userspace.
> + */
> +static int allocinfo_ioctl_get_next(struct seq_file *m, void __user *arg)
> +{
> + struct allocinfo_private *priv;
> + struct codetag *ct;
> + struct allocinfo_tag_data params;
> + int ret = 0;
> +
> + memset(¶ms, 0, sizeof(params));
> + priv = m->private;
> +
> + mutex_lock(&priv->ioctl_lock);
> + codetag_lock_module_list(alloc_tag_cttype);
> +
> + if (!priv->positioned) {
> + priv->ioctl_iter = codetag_get_ct_iter(alloc_tag_cttype);
> + priv->positioned = true;
> + }
> +
> + ct = codetag_next_ct(&priv->ioctl_iter);
> + if (ct)
> + allocinfo_to_params(ct, ¶ms);
> +
> + if (!ct) {
> + priv->positioned = false;
> + ret = -ENOENT;
> + }
> + codetag_unlock_module_list(alloc_tag_cttype);
> + mutex_unlock(&priv->ioctl_lock);
> +
> + if (ret == 0) {
> + if (copy_to_user(arg, ¶ms, sizeof(params)))
> + return -EFAULT;
> + }
> + return ret;
> +}
> +
> +/*
> + * Entry point ioctl function for /proc/allocinfo routing requests to fetch the
> + * layout content ID, seek to a specific tag, or read sequential tags.
> + */
> +static long allocinfo_ioctl(struct file *file, unsigned int cmd,
> + unsigned long __arg)
> +{
> + void __user *arg = (void __user *)__arg;
> + int ret;
> +
> + switch (cmd) {
> + case ALLOCINFO_IOC_CONTENT_ID:
> + ret = allocinfo_ioctl_get_content_id(file->private_data, arg);
> + break;
> + case ALLOCINFO_IOC_GET_AT:
> + ret = allocinfo_ioctl_get_at(file->private_data, arg);
> + break;
> + case ALLOCINFO_IOC_GET_NEXT:
> + ret = allocinfo_ioctl_get_next(file->private_data, arg);
> + break;
> + default:
> + ret = -ENOIOCTLCMD;
> + break;
> + }
> +
> + return ret;
> +}
> +
> +#ifdef CONFIG_COMPAT
> +static long allocinfo_compat_ioctl(struct file *file, unsigned int cmd,
> + unsigned long arg)
> +{
> + return allocinfo_ioctl(file, cmd, (unsigned long)compat_ptr(arg));
> +}
> +#endif
> +
> +static const struct proc_ops allocinfo_proc_ops = {
> + .proc_open = allocinfo_open,
> + .proc_read_iter = seq_read_iter,
> + .proc_lseek = seq_lseek,
> + .proc_release = allocinfo_release,
> + .proc_ioctl = allocinfo_ioctl,
> +#ifdef CONFIG_COMPAT
> + .proc_compat_ioctl = allocinfo_compat_ioctl,
> +#endif
> +
> +};
> +
> size_t alloc_tag_top_users(struct codetag_bytes *tags, size_t count, bool can_sleep)
> {
> struct codetag_iterator iter;
> @@ -993,8 +1222,7 @@ static int __init alloc_tag_init(void)
> return 0;
> }
>
> - if (!proc_create_seq_private(ALLOCINFO_FILE_NAME, 0400, NULL, &allocinfo_seq_op,
> - sizeof(struct allocinfo_private), NULL)) {
> + if (!proc_create(ALLOCINFO_FILE_NAME, 0400, NULL, &allocinfo_proc_ops)) {
> pr_err("Failed to create %s file\n", ALLOCINFO_FILE_NAME);
> shutdown_mem_profiling(false);
> return -ENOMEM;
> diff --git a/lib/codetag.c b/lib/codetag.c
> index 4001a7ea6675..a9cda4c962a3 100644
> --- a/lib/codetag.c
> +++ b/lib/codetag.c
> @@ -19,6 +19,8 @@ struct codetag_type {
> struct codetag_type_desc desc;
> /* generates unique sequence number for module load */
> unsigned long next_mod_seq;
> + /* bumped on every module load and unload */
> + unsigned long content_id;
> };
>
> struct codetag_range {
> @@ -50,6 +52,20 @@ void codetag_unlock_module_list(struct codetag_type *cttype)
> up_read(&cttype->mod_lock);
> }
>
> +unsigned long codetag_get_content_id(struct codetag_type *cttype)
> +{
> + lockdep_assert_held(&cttype->mod_lock);
> +
> + return cttype->content_id;
> +}
> +
> +unsigned int codetag_get_count(struct codetag_type *cttype)
> +{
> + lockdep_assert_held(&cttype->mod_lock);
> +
> + return cttype->count;
> +}
> +
> struct codetag_iterator codetag_get_ct_iter(struct codetag_type *cttype)
> {
> struct codetag_iterator iter = {
> @@ -204,6 +220,7 @@ static int codetag_module_init(struct codetag_type *cttype, struct module *mod)
>
> down_write(&cttype->mod_lock);
> cmod->mod_seq = ++cttype->next_mod_seq;
> + ++cttype->content_id;
> mod_id = idr_alloc(&cttype->mod_idr, cmod, 0, 0, GFP_KERNEL);
> if (mod_id >= 0) {
> if (cttype->desc.module_load) {
> @@ -368,6 +385,7 @@ void codetag_unload_module(struct module *mod)
> cttype->count -= range_size(cttype, &cmod->range);
> idr_remove(&cttype->mod_idr, mod_id);
> kfree(cmod);
> + ++cttype->content_id;
> }
> up_write(&cttype->mod_lock);
> if (found && cttype->desc.free_section_mem)
^ permalink raw reply
* Re: [PATCH v4 0/2] docs/mm/damon: fix docs and update zh_CN
From: Dongliang Mu @ 2026-06-10 1:15 UTC (permalink / raw)
To: SeongJae Park, Doehyun Baek
Cc: Jonathan Corbet, Shuah Khan, Alex Shi, Yanteng Si, Hu Haowen,
linux-doc, linux-kernel, damon
In-Reply-To: <20260609235556.73472-1-sj@kernel.org>
On 6/10/26 7:55 AM, SeongJae Park wrote:
> Hello Doehyun,
>
> On Tue, 9 Jun 2026 14:34:24 +0000 Doehyun Baek <doehyunbaek@gmail.com> wrote:
>
>> First of all, thank you very much, Dongliang, for your time and
>> dedication in reviewing the previous versions.
>>
>> This v4 sends the original English DAMON documentation fixes as the
>> first patch, and the Simplified Chinese translation update as the
>> second patch.
Hi Doehyun,
I think my earlier message may have been unclear, so I’d like to clarify
the points below.
First, as suggested by SeongJae Park, please submit a standalone patch
to the |mm-new| or |linux-doc| tree to correct these typos.
Once this typo-fix patch is merged, you may proceed to submit the
Chinese translation patches.
Additionally, please split patch 2/2 into two separate patches, with
each patch covering changes to only one file.
Dongliang Mu
>>
>> For zh_CN, I translated the current DAMON usage.rst paragraph by
>> paragraph, and added missing pieces such as stat.rst and the related
>> index/design references. The zh_TW changes from earlier versions are
>> dropped from this series.
> Thank you for sharing this patch series! However, to my understanding, the
> path to the mainline for English documents and Chinese documents are different.
> Sending patches for English document and Chinese document as one series is
> therefore making it complicated, in my opinion. Could you please rebase
> English document part to mm-new [1] and send as a separate patch?
Totally agree. Please follow this suggestion.
>
> [1] https://origin.kernel.org/doc/html/latest/mm/damon/maintainer-profile.html#scm-trees
>
>
> Thanks,
> SJ
>
> [...]
^ permalink raw reply
* [PATCH] hwmon: (pmbus/max34440): add support adpm12250
From: Alexis Czezar Torreno @ 2026-06-10 1:12 UTC (permalink / raw)
To: Guenter Roeck, Jonathan Corbet, Shuah Khan
Cc: linux-hwmon, linux-doc, linux-kernel, Alexis Czezar Torreno
ADPM12250 is a quarter brick DC/DC Power Module. It is a high power
non-isolated converter capable of delivering regulated 12V with
continuous power level of 2500W. Uses PMBus.
Signed-off-by: Alexis Czezar Torreno <alexisczezar.torreno@analog.com>
---
ADPM12250 is a quarter brick DC/DC Power Module. It is a high power
non-isolated converter capable of delivering regulated 12V with continuous
power level of 2500W. Uses PMBus.
---
Documentation/hwmon/max34440.rst | 27 ++++++++++++++++--------
drivers/hwmon/pmbus/max34440.c | 45 +++++++++++++++++++++++++++++++++++++---
2 files changed, 60 insertions(+), 12 deletions(-)
diff --git a/Documentation/hwmon/max34440.rst b/Documentation/hwmon/max34440.rst
index d6d4fbc863d96c1008a1971d3e3245d9ce1ef688..e7421f4dbf38fc1436bbaeba71d4461a00f8cefb 100644
--- a/Documentation/hwmon/max34440.rst
+++ b/Documentation/hwmon/max34440.rst
@@ -19,6 +19,14 @@ Supported chips:
Datasheet: -
+ * ADI ADPM12250
+
+ Prefixes: 'adpm12250'
+
+ Addresses scanned: -
+
+ Datasheet: -
+
* Maxim MAX34440
Prefixes: 'max34440'
@@ -87,11 +95,11 @@ This driver supports multiple devices: hardware monitoring for Maxim MAX34440
PMBus 6-Channel Power-Supply Manager, MAX34441 PMBus 5-Channel Power-Supply
Manager and Intelligent Fan Controller, and MAX34446 PMBus Power-Supply Data
Logger; PMBus Voltage Monitor and Sequencers for MAX34451, MAX34460, and
-MAX34461; PMBus DC/DC Power Module ADPM12160, and ADPM12200. The MAX34451
-supports monitoring voltage or current of 12 channels based on GIN pins. The
-MAX34460 supports 12 voltage channels, and the MAX34461 supports 16 voltage
-channels. The ADPM12160, and ADPM12200 also monitors both input and output
-of voltage and current.
+MAX34461; PMBus DC/DC Power Module ADPM12160, ADPM12200, and ADPM12250. The
+MAX34451 supports monitoring voltage or current of 12 channels based on GIN
+pins. The MAX34460 supports 12 voltage channels, and the MAX34461 supports 16
+voltage channels. The ADPM12160, ADPM12200, and ADPM12250 also monitors both
+input and output of voltage and current.
The driver is a client driver to the core PMBus driver. Please see
Documentation/hwmon/pmbus.rst for details on PMBus client drivers.
@@ -149,7 +157,7 @@ in[1-6]_reset_history Write any value to reset history.
.. note::
- MAX34446 only supports in[1-4].
- - ADPM12160, and ADPM12200 only supports in[1-2]. Label is "vin1"
+ - ADPM12160, ADPM12200, and ADPM12250 only supports in[1-2]. Label is "vin1"
and "vout1" respectively.
Curr
@@ -172,8 +180,9 @@ curr[1-6]_reset_history Write any value to reset history.
- in6 and curr6 attributes only exist for MAX34440.
- MAX34446 only supports curr[1-4].
- - For ADPM12160, and ADPM12200, curr[1] is "iin1" and curr[2-6]
- are "iout[1-5]".
+ - For ADPM12160, ADPM12200, and ADPM12250, curr[1] is "iin1"
+ - For ADPM12160, and ADPM12200 curr[2-6] are "iout[1-5]".
+ - For ADPM12250, curr[2-4] are "iout[1-3]".
Power
~~~~~
@@ -209,7 +218,7 @@ temp[1-8]_reset_history Write any value to reset history.
.. note::
- temp7 and temp8 attributes only exist for MAX34440.
- MAX34446 only supports temp[1-3].
- - ADPM12160, and ADPM12200 only supports temp[1].
+ - ADPM12160, ADPM12200, and ADPM12250 only supports temp[1].
.. note::
diff --git a/drivers/hwmon/pmbus/max34440.c b/drivers/hwmon/pmbus/max34440.c
index 4525b9fc56267479534251a1444aa09181615ac6..74876d2207fbe4014b8b54a9fd9682370fc3bbed 100644
--- a/drivers/hwmon/pmbus/max34440.c
+++ b/drivers/hwmon/pmbus/max34440.c
@@ -18,6 +18,7 @@
enum chips {
adpm12160,
adpm12200,
+ adpm12250,
max34440,
max34441,
max34446,
@@ -97,7 +98,8 @@ static int max34440_read_word_data(struct i2c_client *client, int page,
break;
case PMBUS_VIRT_READ_IOUT_AVG:
if (data->id != max34446 && data->id != max34451 &&
- data->id != adpm12160 && data->id != adpm12200)
+ data->id != adpm12160 && data->id != adpm12200 &&
+ data->id != adpm12250)
return -ENXIO;
ret = pmbus_read_word_data(client, page, phase,
MAX34446_MFR_IOUT_AVG);
@@ -182,7 +184,8 @@ static int max34440_write_word_data(struct i2c_client *client, int page,
ret = pmbus_write_word_data(client, page,
MAX34440_MFR_IOUT_PEAK, 0);
if (!ret && (data->id == max34446 || data->id == max34451 ||
- data->id == adpm12160 || data->id == adpm12200))
+ data->id == adpm12160 || data->id == adpm12200 ||
+ data->id == adpm12250))
ret = pmbus_write_word_data(client, page,
MAX34446_MFR_IOUT_AVG, 0);
@@ -399,6 +402,40 @@ static struct pmbus_driver_info max34440_info[] = {
.read_word_data = max34440_read_word_data,
.write_word_data = max34440_write_word_data,
},
+ [adpm12250] = {
+ .pages = 19,
+ .format[PSC_VOLTAGE_IN] = direct,
+ .format[PSC_VOLTAGE_OUT] = direct,
+ .format[PSC_CURRENT_IN] = direct,
+ .format[PSC_CURRENT_OUT] = direct,
+ .format[PSC_TEMPERATURE] = direct,
+ .m[PSC_VOLTAGE_IN] = 125,
+ .b[PSC_VOLTAGE_IN] = 0,
+ .R[PSC_VOLTAGE_IN] = 0,
+ .m[PSC_VOLTAGE_OUT] = 125,
+ .b[PSC_VOLTAGE_OUT] = 0,
+ .R[PSC_VOLTAGE_OUT] = 0,
+ .m[PSC_CURRENT_IN] = 250,
+ .b[PSC_CURRENT_IN] = 0,
+ .R[PSC_CURRENT_IN] = -1,
+ .m[PSC_CURRENT_OUT] = 250,
+ .b[PSC_CURRENT_OUT] = 0,
+ .R[PSC_CURRENT_OUT] = -1,
+ .m[PSC_TEMPERATURE] = 1,
+ .b[PSC_TEMPERATURE] = 0,
+ .R[PSC_TEMPERATURE] = 2,
+ /* absent func below [18] are not for monitoring */
+ .func[2] = PMBUS_HAVE_VOUT | PMBUS_HAVE_STATUS_VOUT,
+ .func[4] = PMBUS_HAVE_STATUS_IOUT,
+ .func[5] = PMBUS_HAVE_IOUT | PMBUS_HAVE_STATUS_IOUT,
+ .func[6] = PMBUS_HAVE_IOUT | PMBUS_HAVE_STATUS_IOUT,
+ .func[9] = PMBUS_HAVE_VIN | PMBUS_HAVE_STATUS_INPUT,
+ .func[10] = PMBUS_HAVE_IIN | PMBUS_HAVE_STATUS_INPUT,
+ .func[14] = PMBUS_HAVE_IOUT,
+ .func[18] = PMBUS_HAVE_TEMP | PMBUS_HAVE_STATUS_TEMP,
+ .read_word_data = max34440_read_word_data,
+ .write_word_data = max34440_write_word_data,
+ },
[max34440] = {
.pages = 14,
.format[PSC_VOLTAGE_IN] = direct,
@@ -635,7 +672,8 @@ static int max34440_probe(struct i2c_client *client)
rv = max34451_set_supported_funcs(client, data);
if (rv)
return rv;
- } else if (data->id == adpm12160 || data->id == adpm12200) {
+ } else if (data->id == adpm12160 || data->id == adpm12200 ||
+ data->id == adpm12250) {
data->iout_oc_fault_limit = PMBUS_IOUT_OC_FAULT_LIMIT;
data->iout_oc_warn_limit = PMBUS_IOUT_OC_WARN_LIMIT;
}
@@ -646,6 +684,7 @@ static int max34440_probe(struct i2c_client *client)
static const struct i2c_device_id max34440_id[] = {
{ .name = "adpm12160", .driver_data = adpm12160 },
{ .name = "adpm12200", .driver_data = adpm12200 },
+ { .name = "adpm12250", .driver_data = adpm12250 },
{ .name = "max34440", .driver_data = max34440 },
{ .name = "max34441", .driver_data = max34441 },
{ .name = "max34446", .driver_data = max34446 },
---
base-commit: 1723bc01ecc7ca2f30272685121314379ba5eb18
change-id: 20260610-dev-adpm12250-4ce6fc8c82ac
Best regards,
--
Alexis Czezar Torreno <alexisczezar.torreno@analog.com>
^ permalink raw reply related
* Re: [PATCH v3 2/3] Documentation: security-bugs: explain what is and is not a security bug
From: Askar Safin @ 2026-06-10 1:03 UTC (permalink / raw)
To: Greg KH
Cc: w, corbet, leon, linux-doc, linux-kernel, security, skhan,
workflows
In-Reply-To: <2026060955-zesty-cucumber-1a49@gregkh>
Thank you for answer!
On Tue, Jun 9, 2026 at 11:44 AM Greg KH <gregkh@linuxfoundation.org> wrote:
> > - If unprivileged user prevents privileged user from suspending
> > system, is this security bug?
>
> Physical access of suspending a machine feels like an odd threat model
> to be worried about :)
I think you didn't understand me here. I meant the following situation:
unprivileged user without physical access was somehow able
to prevent privileged user with physical access from suspending
or hibernating the system.
--
Askar Safin
^ permalink raw reply
* [RFC PATCH v2 7/7] tracing/probes: Add a new testcase for BTF typecasts
From: Masami Hiramatsu (Google) @ 2026-06-10 0:52 UTC (permalink / raw)
To: Steven Rostedt, Mathieu Desnoyers
Cc: Jonathan Corbet, Shuah Khan, Masami Hiramatsu, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178105268094.21760.13668249930524377840.stgit@devnote2>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
With the introduction of container_of-style BTF typecasting and
per-CPU variable access support in trace probes, we need a way to
verify their functionality and prevent regressions.
Add a new ftrace kselftest and update the trace event sample module
to test and validate these features.
Specifically, update the trace-events-sample module to set up a
periodic timer whose callback accesses a per-CPU counter. Introduce
a new sample trace event, foo_timer_fn, to trace this callback
and log the current counter value.
Then, add a new test case, btf_probe_event.tc, which defines a
dynamic probe on the timer callback. The probe uses BTF typecasting
to recover the parent structure from the timer argument and
this_cpu_read() to fetch the per-CPU counter. The test verifies
the integrity of the implementation by ensuring the values
recorded by the dynamic probe match those from the static tracepoint.
Assisted-by: Antigravity:gemini-3.5-flash
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v2:
- Use timer_shutdown_sync() instead of timer_delete_sync() for teardown.
---
samples/trace_events/trace-events-sample.c | 40 +++++++++++++++-
samples/trace_events/trace-events-sample.h | 34 ++++++++++++-
.../ftrace/test.d/dynevent/btf_probe_event.tc | 51 ++++++++++++++++++++
3 files changed, 120 insertions(+), 5 deletions(-)
create mode 100644 tools/testing/selftests/ftrace/test.d/dynevent/btf_probe_event.tc
diff --git a/samples/trace_events/trace-events-sample.c b/samples/trace_events/trace-events-sample.c
index b61766864b54..2a1f73533a38 100644
--- a/samples/trace_events/trace-events-sample.c
+++ b/samples/trace_events/trace-events-sample.c
@@ -93,6 +93,20 @@ static int simple_thread_fn(void *arg)
static DEFINE_MUTEX(thread_mutex);
+static struct foo_timer_data *foo_timer_data;
+
+static void sample_timer_cb(struct timer_list *t)
+{
+ struct foo_timer_data *data = container_of(t, struct foo_timer_data, timer);
+
+ get_cpu();
+ trace_foo_timer_fn(data);
+ (*this_cpu_ptr(data->counter))++;
+ put_cpu();
+
+ mod_timer(t, jiffies + HZ);
+}
+
int foo_bar_reg(void)
{
mutex_lock(&thread_mutex);
@@ -124,9 +138,27 @@ void foo_bar_unreg(void)
static int __init trace_event_init(void)
{
+ foo_timer_data = kzalloc_obj(*foo_timer_data, GFP_KERNEL);
+ if (!foo_timer_data)
+ return -ENOMEM;
+
+ foo_timer_data->name = "sample_timer_counter";
+ foo_timer_data->counter = alloc_percpu(int);
+ if (!foo_timer_data->counter) {
+ kfree(foo_timer_data);
+ return -ENOMEM;
+ }
+
+ timer_setup(&foo_timer_data->timer, sample_timer_cb, 0);
+ mod_timer(&foo_timer_data->timer, jiffies + HZ);
+
simple_tsk = kthread_run(simple_thread, NULL, "event-sample");
- if (IS_ERR(simple_tsk))
- return -1;
+ if (IS_ERR(simple_tsk)) {
+ timer_shutdown_sync(&foo_timer_data->timer);
+ free_percpu(foo_timer_data->counter);
+ kfree(foo_timer_data);
+ return PTR_ERR(simple_tsk);
+ }
return 0;
}
@@ -139,6 +171,10 @@ static void __exit trace_event_exit(void)
kthread_stop(simple_tsk_fn);
simple_tsk_fn = NULL;
mutex_unlock(&thread_mutex);
+
+ timer_shutdown_sync(&foo_timer_data->timer);
+ free_percpu(foo_timer_data->counter);
+ kfree(foo_timer_data);
}
module_init(trace_event_init);
diff --git a/samples/trace_events/trace-events-sample.h b/samples/trace_events/trace-events-sample.h
index 1a05fc153353..816848a456a2 100644
--- a/samples/trace_events/trace-events-sample.h
+++ b/samples/trace_events/trace-events-sample.h
@@ -247,12 +247,14 @@
*/
/*
- * It is OK to have helper functions in the file, but they need to be protected
- * from being defined more than once. Remember, this file gets included more
- * than once.
+ * It is OK to have helper functions and data structures in the file, but they
+ * need to be protected from being defined more than once. Remember, this file
+ * gets included more than once.
*/
#ifndef __TRACE_EVENT_SAMPLE_HELPER_FUNCTIONS
#define __TRACE_EVENT_SAMPLE_HELPER_FUNCTIONS
+#include <linux/timer.h>
+
static inline int __length_of(const int *list)
{
int i;
@@ -270,6 +272,13 @@ enum {
TRACE_SAMPLE_BAR = 4,
TRACE_SAMPLE_ZOO = 8,
};
+
+struct foo_timer_data {
+ const char *name;
+ struct timer_list timer;
+ int __percpu *counter;
+};
+
#endif
/*
@@ -595,6 +604,25 @@ TRACE_EVENT(foo_rel_loc,
__get_rel_bitmask(bitmask),
__get_rel_cpumask(cpumask))
);
+
+TRACE_EVENT(foo_timer_fn,
+
+ TP_PROTO(struct foo_timer_data *data),
+
+ TP_ARGS(data),
+
+ TP_STRUCT__entry(
+ __string( name, data->name )
+ __field( int, count )
+ ),
+
+ TP_fast_assign(
+ __assign_str(name);
+ __entry->count = *this_cpu_ptr(data->counter);
+ ),
+
+ TP_printk("name=%s count=%d", __get_str(name), __entry->count)
+);
#endif
/***** NOTICE! The #if protection ends here. *****/
diff --git a/tools/testing/selftests/ftrace/test.d/dynevent/btf_probe_event.tc b/tools/testing/selftests/ftrace/test.d/dynevent/btf_probe_event.tc
new file mode 100644
index 000000000000..96791e120b7d
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/dynevent/btf_probe_event.tc
@@ -0,0 +1,51 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# description: BTF event with typecast and percpu access
+# requires: dynamic_events "this_cpu_read(<fetcharg>)":README "[(structname[,field])]<argname>[->field[->field|.field...]]":README
+
+# Check if the sample module is loaded
+if ! lsmod | grep -q trace_events_sample; then
+ modprobe trace-events-sample || exit_unsupported
+fi
+
+echo 0 > events/enable
+echo > dynamic_events
+
+# The sample_timer_cb(struct timer_list *t) is called.
+# We want to check (STRUCT,FIELD)VAR typecast and this_cpu_read() access.
+# (foo_timer_data,timer)t converts t to struct foo_timer_data * using container_of.
+# data->counter is a per-cpu pointer to int.
+# this_cpu_read(data->counter) should give the value of the counter.
+
+echo 'f:mysample/myevent sample_timer_cb name=(foo_timer_data,timer)t->name:string count=this_cpu_read((foo_timer_data,timer)t->counter)' >> dynamic_events
+
+echo 1 > events/mysample/myevent/enable
+echo 1 > events/sample-trace/foo_timer_fn/enable
+
+sleep 2
+
+echo 0 > events/mysample/myevent/enable
+echo 0 > events/sample-trace/foo_timer_fn/enable
+
+# Compare the values.
+MATCH=0
+while read line; do
+ if echo $line | grep -q "foo_timer_fn:"; then
+ NAME=`echo $line | sed 's/.*name=\([^ ]*\) .*/\1/'`
+ COUNT=`echo $line | sed 's/.*count=\([^ ]*\).*/\1/'`
+ if grep -q "myevent:.*name=\"${NAME}\" count=$COUNT" trace; then
+ MATCH=$((MATCH+1))
+ fi
+ fi
+done < trace
+
+if [ $MATCH -eq 0 ]; then
+ echo "No matching events found"
+ exit_fail
+fi
+
+# Clean up
+echo 0 > events/mysample/myevent/enable
+echo 0 > events/sample-trace/foo_timer_fn/enable
+echo > dynamic_events
+clear_trace
^ permalink raw reply related
* [RFC PATCH v2 6/7] tracing/probes: Add this_cpu_read() and this_cpu_ptr() dereference method to fetcharg
From: Masami Hiramatsu (Google) @ 2026-06-10 0:52 UTC (permalink / raw)
To: Steven Rostedt, Mathieu Desnoyers
Cc: Jonathan Corbet, Shuah Khan, Masami Hiramatsu, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178105268094.21760.13668249930524377840.stgit@devnote2>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
When tracing the kernel local variables, sometimes we need to get the
CPU local variables. To access it, current simple dereference is not
enough.
Thus, introduce a special this_cpu_read() dereference to access per-cpu
variable for the current CPU (accessing other CPU variable may race with
updates on other CPUs). Also this_cpu_ptr() is for accessing per-cpu
pointer.
Those are working as same as the kernel percpu macro.
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v2:
- Drop +CPU/+PCPU and introduce this_cpu_read() and this_cpu_ptr().
- Support these method with BTF typecast.
- Just check the base address is NOT NULL instead of is_kernel_percpu_address().
---
Documentation/trace/eprobetrace.rst | 2 +
Documentation/trace/fprobetrace.rst | 2 +
Documentation/trace/kprobetrace.rst | 2 +
kernel/trace/trace.c | 1
kernel/trace/trace_probe.c | 135 ++++++++++++++++++++++++-----------
kernel/trace/trace_probe.h | 2 +
kernel/trace/trace_probe_tmpl.h | 30 ++++++--
7 files changed, 125 insertions(+), 49 deletions(-)
diff --git a/Documentation/trace/eprobetrace.rst b/Documentation/trace/eprobetrace.rst
index dcf92d5b4175..6ba70327c1de 100644
--- a/Documentation/trace/eprobetrace.rst
+++ b/Documentation/trace/eprobetrace.rst
@@ -40,6 +40,8 @@ Synopsis of eprobe_events
$comm : Fetch current task comm.
$current : Fetch the address of the current task_struct.
+|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*3)(\*4)
+ this_cpu_read(FETCHARG) : Read the value of the per-CPU variable FETCHARG on the current CPU.
+ this_cpu_ptr(FETCHARG) : Get the address of the per-CPU variable FETCHARG on the current CPU.
\IMM : Store an immediate value to the argument.
NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types
diff --git a/Documentation/trace/fprobetrace.rst b/Documentation/trace/fprobetrace.rst
index 3392cab016b3..3439bc9bd351 100644
--- a/Documentation/trace/fprobetrace.rst
+++ b/Documentation/trace/fprobetrace.rst
@@ -52,6 +52,8 @@ Synopsis of fprobe-events
$comm : Fetch current task comm.
$current : Fetch the address of the current task_struct.
+|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*4)(\*5)
+ this_cpu_read(FETCHARG) : Read the value of the per-CPU variable FETCHARG on the current CPU.
+ this_cpu_ptr(FETCHARG) : Get the address of the per-CPU variable FETCHARG on the current CPU.
\IMM : Store an immediate value to the argument.
NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types
diff --git a/Documentation/trace/kprobetrace.rst b/Documentation/trace/kprobetrace.rst
index 81e4fe38791d..9ae330eb0a52 100644
--- a/Documentation/trace/kprobetrace.rst
+++ b/Documentation/trace/kprobetrace.rst
@@ -55,6 +55,8 @@ Synopsis of kprobe_events
$comm : Fetch current task comm.
$current : Fetch the address of the current task_struct.
+|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*3)(\*4)
+ this_cpu_read(FETCHARG) : Read the value of the per-CPU variable FETCHARG on the current CPU.
+ this_cpu_ptr(FETCHARG) : Get the address of the per-CPU variable FETCHARG on the current CPU.
\IMM : Store an immediate value to the argument.
NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index e185a006cb08..1d5d6e46dc4d 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4332,6 +4332,7 @@ static const char readme_msg[] =
"\t $stack<index>, $stack, $retval, $comm, $current\n"
#endif
"\t +|-[u]<offset>(<fetcharg>), \\imm-value, \\\"imm-string\"\n"
+ "\t this_cpu_read(<fetcharg>), this_cpu_ptr(<fetcharg>)\n"
"\t kernel return probes support: $retval, $arg<N>, $comm\n"
"\t type: s8/16/32/64, u8/16/32/64, x8/16/32/64, char, string, symbol,\n"
"\t b<bit-width>@<bit-offset>/<container-size>, ustring,\n"
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index 4bdccd9bd7d1..37ada81b7d46 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -349,6 +349,77 @@ static int parse_trace_event(char *arg, struct fetch_insn *code,
return -EINVAL;
}
+/* this_cpu_* parser */
+#define THIS_CPU_PTR_PREFIX "this_cpu_ptr("
+#define THIS_CPU_READ_PREFIX "this_cpu_read("
+#define THIS_CPU_PTR_LEN (sizeof(THIS_CPU_PTR_PREFIX) - 1)
+#define THIS_CPU_READ_LEN (sizeof(THIS_CPU_READ_PREFIX) - 1)
+
+static int
+parse_probe_arg(char *arg, const struct fetch_type *type,
+ struct fetch_insn **pcode, struct fetch_insn *end,
+ struct traceprobe_parse_context *ctx);
+
+/* handle dereference nested call */
+static inline int handle_dereference(char *arg, struct fetch_insn **pcode,
+ struct fetch_insn *end, struct traceprobe_parse_context *ctx,
+ int deref, long offset)
+{
+ const struct fetch_type *type = find_fetch_type(NULL, ctx->flags);
+ struct fetch_insn *code = *pcode;
+ int cur_offs = ctx->offset;
+ char *tmp;
+ int ret;
+
+ tmp = strrchr(arg, ')');
+ if (!tmp) {
+ trace_probe_log_err(ctx->offset + strlen(arg),
+ DEREF_OPEN_BRACE);
+ return -EINVAL;
+ }
+
+ *tmp = '\0';
+ ret = parse_probe_arg(arg, type, &code, end, ctx);
+ if (ret)
+ return ret;
+ ctx->offset = cur_offs;
+ if (code->op == FETCH_OP_COMM || code->op == FETCH_OP_DATA) {
+ trace_probe_log_err(ctx->offset, COMM_CANT_DEREF);
+ return -EINVAL;
+ }
+ code++;
+ if (code == end) {
+ trace_probe_log_err(ctx->offset, TOO_MANY_OPS);
+ return -EINVAL;
+ }
+ *pcode = code;
+
+ code->op = deref;
+ code->offset = offset;
+ /* Reset the last type if used */
+ ctx->last_type = NULL;
+ return 0;
+}
+
+static int parse_this_cpu(char *arg, struct fetch_insn **pcode,
+ struct fetch_insn *end,
+ struct traceprobe_parse_context *ctx)
+{
+ int deref;
+
+ if (str_has_prefix(arg, THIS_CPU_PTR_PREFIX)) {
+ arg += THIS_CPU_PTR_LEN;
+ ctx->offset += THIS_CPU_PTR_LEN;
+ deref = FETCH_OP_CPU_PTR;
+ } else if (str_has_prefix(arg, THIS_CPU_READ_PREFIX)) {
+ arg += THIS_CPU_READ_LEN;
+ ctx->offset += THIS_CPU_READ_LEN;
+ deref = FETCH_OP_DEREF_CPU;
+ } else
+ return -EINVAL;
+ return handle_dereference(arg, pcode, end, ctx, deref, 0);
+}
+
#ifdef CONFIG_PROBE_EVENTS_BTF_ARGS
static u32 btf_type_int(const struct btf_type *t)
@@ -928,11 +999,6 @@ static char *find_matched_close_paren(char *s)
return NULL;
}
-static int
-parse_probe_arg(char *arg, const struct fetch_type *type,
- struct fetch_insn **pcode, struct fetch_insn *end,
- struct traceprobe_parse_context *ctx);
-
static int handle_typecast(char *arg, struct fetch_insn **pcode,
struct fetch_insn *end,
struct traceprobe_parse_context *ctx)
@@ -958,7 +1024,8 @@ static int handle_typecast(char *arg, struct fetch_insn **pcode,
*tmp++ = '\0';
/* Handle the nested structure like (STRUCT)(VAR->FIELD)->... */
- if (*tmp == '(') {
+ if (*tmp == '(' || str_has_prefix(tmp, THIS_CPU_PTR_PREFIX) ||
+ str_has_prefix(tmp, THIS_CPU_READ_PREFIX)) {
char *close = find_matched_close_paren(tmp);
ctx->offset += tmp - arg;
@@ -978,12 +1045,18 @@ static int handle_typecast(char *arg, struct fetch_insn **pcode,
trace_probe_log_err(ctx->offset, TOO_MANY_NESTED);
return -E2BIG;
}
- *close = '\0';
- ctx->offset += 1; /* for the '(' */
- /* We need to parse the nested one */
- ret = parse_probe_arg(tmp + 1, find_fetch_type(NULL, ctx->flags),
- pcode, end, ctx);
+ if (*tmp == '(') {
+ /* Extract the inner argument */
+ *close = '\0';
+ ctx->offset += 1;/* for the '(' */
+ /* Parse the nested one */
+ ret = parse_probe_arg(tmp + 1, find_fetch_type(NULL, ctx->flags),
+ pcode, end, ctx);
+ } else {
+ /* this_cpu_* will be parsed in parse_this_cpu() */
+ ret = parse_this_cpu(tmp, pcode, end, ctx);
+ }
if (ret < 0)
return ret;
ctx->nested_level--;
@@ -1448,36 +1521,9 @@ parse_probe_arg(char *arg, const struct fetch_type *type,
}
ctx->offset += (tmp + 1 - arg) + (arg[0] != '-' ? 1 : 0);
arg = tmp + 1;
- tmp = strrchr(arg, ')');
- if (!tmp) {
- trace_probe_log_err(ctx->offset + strlen(arg),
- DEREF_OPEN_BRACE);
- return -EINVAL;
- } else {
- const struct fetch_type *t2 = find_fetch_type(NULL, ctx->flags);
- int cur_offs = ctx->offset;
-
- *tmp = '\0';
- ret = parse_probe_arg(arg, t2, &code, end, ctx);
- if (ret)
- break;
- ctx->offset = cur_offs;
- if (code->op == FETCH_OP_COMM ||
- code->op == FETCH_OP_DATA) {
- trace_probe_log_err(ctx->offset, COMM_CANT_DEREF);
- return -EINVAL;
- }
- if (++code == end) {
- trace_probe_log_err(ctx->offset, TOO_MANY_OPS);
- return -EINVAL;
- }
- *pcode = code;
-
- code->op = deref;
- code->offset = offset;
- /* Reset the last type if used */
- ctx->last_type = NULL;
- }
+ ret = handle_dereference(arg, pcode, end, ctx, deref, offset);
+ if (ret < 0)
+ return ret;
break;
case '\\': /* Immediate value */
if (arg[1] == '"') { /* Immediate string */
@@ -1498,15 +1544,18 @@ parse_probe_arg(char *arg, const struct fetch_type *type,
ret = handle_typecast(arg, pcode, end, ctx);
break;
default:
- if (isalpha(arg[0]) || arg[0] == '_') { /* BTF variable */
+ if (str_has_prefix(arg, THIS_CPU_PTR_PREFIX) ||
+ str_has_prefix(arg, THIS_CPU_READ_PREFIX)) {
+ ret = parse_this_cpu(arg, pcode, end, ctx);
+ } else if (isalpha(arg[0]) || arg[0] == '_') { /* BTF variable */
if (!tparg_is_function_entry(ctx->flags) &&
!tparg_is_function_return(ctx->flags)) {
trace_probe_log_err(ctx->offset, NOSUP_BTFARG);
return -EINVAL;
}
ret = parse_btf_arg(arg, pcode, end, ctx);
- break;
}
+ break;
}
if (!ret && code->op == FETCH_OP_NOP) {
/* Parsed, but do not find fetch method */
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index 62645e847bd1..33cec2b19041 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -100,6 +100,8 @@ enum fetch_op {
// Stage 2 (dereference) op
FETCH_OP_DEREF, /* Dereference: .offset */
FETCH_OP_UDEREF, /* User-space Dereference: .offset */
+ FETCH_OP_DEREF_CPU, /* Per-CPU Dereference for this CPU */
+ FETCH_OP_CPU_PTR, /* Per-CPU pointer for this CPU */
// Stage 3 (store) ops
FETCH_OP_ST_RAW, /* Raw: .size */
FETCH_OP_ST_MEM, /* Mem: .offset, .size */
diff --git a/kernel/trace/trace_probe_tmpl.h b/kernel/trace/trace_probe_tmpl.h
index f630930288d2..581aa38c66af 100644
--- a/kernel/trace/trace_probe_tmpl.h
+++ b/kernel/trace/trace_probe_tmpl.h
@@ -129,25 +129,43 @@ process_fetch_insn_bottom(struct fetch_insn *code, unsigned long val,
struct fetch_insn *s3 = NULL;
int total = 0, ret = 0, i = 0;
u32 loc = 0;
- unsigned long lval = val;
+ unsigned long lval, llval = val;
stage2:
/* 2nd stage: dereference memory if needed */
do {
- if (code->op == FETCH_OP_DEREF) {
- lval = val;
+ lval = val;
+ switch (code->op) {
+ case FETCH_OP_DEREF:
ret = probe_mem_read(&val, (void *)val + code->offset,
sizeof(val));
- } else if (code->op == FETCH_OP_UDEREF) {
- lval = val;
+ break;
+ case FETCH_OP_UDEREF:
ret = probe_mem_read_user(&val,
(void *)val + code->offset, sizeof(val));
- } else
break;
+ case FETCH_OP_DEREF_CPU:
+ case FETCH_OP_CPU_PTR:
+ if (unlikely(!val)) {
+ ret = -EFAULT;
+ break;
+ }
+ val = (unsigned long)this_cpu_ptr((void __percpu *)val);
+ if (code->op == FETCH_OP_DEREF_CPU)
+ ret = probe_mem_read(&val, (void *)val, sizeof(val));
+ else
+ ret = 0;
+ break;
+ default:
+ lval = llval;
+ goto out;
+ }
if (ret)
return ret;
+ llval = lval;
code++;
} while (1);
+out:
s3 = code;
stage3:
^ permalink raw reply related
* [RFC PATCH v2 5/7] tracing/probes: Add $current variable support
From: Masami Hiramatsu (Google) @ 2026-06-10 0:52 UTC (permalink / raw)
To: Steven Rostedt, Mathieu Desnoyers
Cc: Jonathan Corbet, Shuah Khan, Masami Hiramatsu, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178105268094.21760.13668249930524377840.stgit@devnote2>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Since we can use the BTF to cast value to a structure pointer type,
it is useful to introduce "$current" special variable support to
fetcharg.
User can define a fetcharg to access current task_struct properties
using BTF info. e.g.
$current->cpus_ptr
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v2:
- Support to parse $current in parse_btf_arg().
- If no typecast on $current, it automatically casted to task_struct.
- Check error case if $current follows something except for "-".
---
Documentation/trace/eprobetrace.rst | 1 +
Documentation/trace/fprobetrace.rst | 1 +
Documentation/trace/kprobetrace.rst | 1 +
kernel/trace/trace.c | 2 +-
kernel/trace/trace_probe.c | 29 ++++++++++++++++++++++++++++-
kernel/trace/trace_probe.h | 1 +
kernel/trace/trace_probe_tmpl.h | 3 +++
7 files changed, 36 insertions(+), 2 deletions(-)
diff --git a/Documentation/trace/eprobetrace.rst b/Documentation/trace/eprobetrace.rst
index 680e0af43d5d..dcf92d5b4175 100644
--- a/Documentation/trace/eprobetrace.rst
+++ b/Documentation/trace/eprobetrace.rst
@@ -38,6 +38,7 @@ Synopsis of eprobe_events
@ADDR : Fetch memory at ADDR (ADDR should be in kernel)
@SYM[+|-offs] : Fetch memory at SYM +|- offs (SYM should be a data symbol)
$comm : Fetch current task comm.
+ $current : Fetch the address of the current task_struct.
+|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*3)(\*4)
\IMM : Store an immediate value to the argument.
NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
diff --git a/Documentation/trace/fprobetrace.rst b/Documentation/trace/fprobetrace.rst
index 290a9e6f7491..3392cab016b3 100644
--- a/Documentation/trace/fprobetrace.rst
+++ b/Documentation/trace/fprobetrace.rst
@@ -50,6 +50,7 @@ Synopsis of fprobe-events
$argN : Fetch the Nth function argument. (N >= 1) (\*2)
$retval : Fetch return value.(\*3)
$comm : Fetch current task comm.
+ $current : Fetch the address of the current task_struct.
+|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*4)(\*5)
\IMM : Store an immediate value to the argument.
NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
diff --git a/Documentation/trace/kprobetrace.rst b/Documentation/trace/kprobetrace.rst
index a62707e6a9f2..81e4fe38791d 100644
--- a/Documentation/trace/kprobetrace.rst
+++ b/Documentation/trace/kprobetrace.rst
@@ -53,6 +53,7 @@ Synopsis of kprobe_events
$argN : Fetch the Nth function argument. (N >= 1) (\*1)
$retval : Fetch return value.(\*2)
$comm : Fetch current task comm.
+ $current : Fetch the address of the current task_struct.
+|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*3)(\*4)
\IMM : Store an immediate value to the argument.
NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 0e36af853199..e185a006cb08 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4329,7 +4329,7 @@ static const char readme_msg[] =
"\t [(structname[,field])](fetcharg)->field[->field|.field...],\n"
#endif
#else
- "\t $stack<index>, $stack, $retval, $comm,\n"
+ "\t $stack<index>, $stack, $retval, $comm, $current\n"
#endif
"\t +|-[u]<offset>(<fetcharg>), \\imm-value, \\\"imm-string\"\n"
"\t kernel return probes support: $retval, $arg<N>, $comm\n"
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index 726be9782775..4bdccd9bd7d1 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -718,6 +718,20 @@ static int parse_btf_arg(char *varname,
return -EOPNOTSUPP;
}
+ if (strcmp(varname, "$current") == 0) {
+ code->op = FETCH_OP_CURRENT;
+ /* If no typecast is specified for $current, use task_struct by default */
+ if (!ctx->struct_btf) {
+ tid = bpf_find_btf_id("task_struct", BTF_KIND_STRUCT, &ctx->struct_btf);
+ if (tid < 0) {
+ trace_probe_log_err(ctx->offset, NO_BTF_ENTRY);
+ return -ENOENT;
+ }
+ ctx->last_struct = btf_type_skip_modifiers(ctx->struct_btf, tid, &tid);
+ }
+ goto found;
+ }
+
if (ctx->flags & TPARG_FL_TEVENT) {
ret = parse_trace_event(varname, code, ctx);
if (ret < 0) {
@@ -756,8 +770,8 @@ static int parse_btf_arg(char *varname,
return -ENOENT;
}
}
- params = ctx->params;
+ params = ctx->params;
for (i = 0; i < ctx->nr_params; i++) {
const char *name = btf_name_by_offset(ctx->btf, params[i].name_off);
@@ -1246,6 +1260,19 @@ static int parse_probe_vars(char *orig_arg, const struct fetch_type *t,
return 0;
}
+ /* $current returns the address of the current task_struct. */
+ if (str_has_prefix(arg, "current")) {
+ arg += strlen("current");
+ if (*arg == '-' && IS_ENABLED(CONFIG_PROBE_EVENTS_BTF_ARGS))
+ return parse_btf_arg(orig_arg, pcode, end, ctx);
+
+ if (*arg != '\0')
+ goto inval;
+
+ code->op = FETCH_OP_CURRENT;
+ return 0;
+ }
+
#ifdef CONFIG_HAVE_FUNCTION_ARG_ACCESS_API
len = str_has_prefix(arg, "arg");
if (len) {
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index 44f113faae61..62645e847bd1 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -96,6 +96,7 @@ enum fetch_op {
FETCH_OP_FOFFS, /* File offset: .immediate */
FETCH_OP_DATA, /* Allocated data: .data */
FETCH_OP_EDATA, /* Entry data: .offset */
+ FETCH_OP_CURRENT, /* Current task_struct address */
// Stage 2 (dereference) op
FETCH_OP_DEREF, /* Dereference: .offset */
FETCH_OP_UDEREF, /* User-space Dereference: .offset */
diff --git a/kernel/trace/trace_probe_tmpl.h b/kernel/trace/trace_probe_tmpl.h
index f39b37fcdb3b..f630930288d2 100644
--- a/kernel/trace/trace_probe_tmpl.h
+++ b/kernel/trace/trace_probe_tmpl.h
@@ -112,6 +112,9 @@ process_common_fetch_insn(struct fetch_insn *code, unsigned long *val)
case FETCH_OP_DATA:
*val = (unsigned long)code->data;
break;
+ case FETCH_OP_CURRENT:
+ *val = (unsigned long)current;
+ break;
default:
return -EILSEQ;
}
^ permalink raw reply related
* [RFC PATCH v2 4/7] tracing/probes: Support field specifier option for typecast
From: Masami Hiramatsu (Google) @ 2026-06-10 0:52 UTC (permalink / raw)
To: Steven Rostedt, Mathieu Desnoyers
Cc: Jonathan Corbet, Shuah Khan, Masami Hiramatsu, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178105268094.21760.13668249930524377840.stgit@devnote2>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Add a field specifier option for the typecast. This works like
container_of() macro.
(STRUCT[,FIELD[.FIELD2...]])VAR
This is equivalent to :
container_of(VAR, struct STRUCT, FIELD[.FIELD2...])
For example:
echo "f tick_nohz_handler next_tick=(tick_sched,sched_timer)timer->next_tick" >> dynamic_events
This will trace tick_nohz_handler() with its tick_sched::next_tick which
is converted from @timer by contianer_of(tick, struct tick_sched, sched_timer).
So, if you enabkle both fprobes:tick_nohz_handler__entry and
timer:hrtimer_expire_entry events, we will see something like:
<idle>-0 [002] d.h1. 3778.087272: hrtimer_expire_entry: hrtimer=00000000d63db328 f
unction=tick_nohz_handler now=3777450051040
<idle>-0 [002] d.h1. 3778.087281: tick_nohz_handler__entry: (tick_nohz_handler+0x4
/0x140) next_tick=3777450000000
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v2:
- Use byteoffset for typecast field offset instead of bitoffset. This fixes negative modulo calculation.
- Check whether a field is specified after typecast.
- Reject if typecast field option has arrow operator.
---
Documentation/trace/eprobetrace.rst | 5 +
Documentation/trace/fprobetrace.rst | 8 +-
Documentation/trace/kprobetrace.rst | 8 +-
kernel/trace/trace.c | 4 -
kernel/trace/trace_probe.c | 178 ++++++++++++++++++++++++-----------
kernel/trace/trace_probe.h | 5 +
6 files changed, 141 insertions(+), 67 deletions(-)
diff --git a/Documentation/trace/eprobetrace.rst b/Documentation/trace/eprobetrace.rst
index cd0b4aa7f896..680e0af43d5d 100644
--- a/Documentation/trace/eprobetrace.rst
+++ b/Documentation/trace/eprobetrace.rst
@@ -49,7 +49,10 @@ Synopsis of eprobe_events
(STRUCT)FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
a pointer to STRUCT and then derference the pointer defined by
->MEMBER. Note that when this is used, the FIELD name does not
- need to be prefixed with a '$'.
+ need to be prefixed with a '$'. ASGN can be specified optionally.
+ If ASGN is specified, FIELD will be cast to the same offset
+ position as the ASGN member, rather than to the beginning of
+ the STRUCT.
(STRUCT)(FETCHARG)->MEMBER[->MEMBER] : typecast can nest, so the above can
also be used with another FETCHARG instead of FIELD.
diff --git a/Documentation/trace/fprobetrace.rst b/Documentation/trace/fprobetrace.rst
index 6b8bb27bb62d..290a9e6f7491 100644
--- a/Documentation/trace/fprobetrace.rst
+++ b/Documentation/trace/fprobetrace.rst
@@ -57,10 +57,12 @@ Synopsis of fprobe-events
(u8/u16/u32/u64/s8/s16/s32/s64), hexadecimal types
(x8/x16/x32/x64), "char", "string", "ustring", "symbol", "symstr"
and bitfield are supported.
- (STRUCT)FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
+ (STRUCT[,ASGN])FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
a pointer to STRUCT and then derference the pointer defined by
- ->MEMBER.
- (STRUCT)(FETCHARG)->MEMBER[->MEMBER] : typecast can nest, so the above can
+ ->MEMBER. ASGN can be specified optionally. If ASGN is specified,
+ FIELD will be cast to the same offset position as the ASGN member,
+ rather than to the beginning of the STRUCT.
+ (STRUCT[,ASGN])(FETCHARG)->MEMBER[->MEMBER] : typecast can nest, so the above can
also be used with another FETCHARG instead of FIELD.
(\*1) This is available only when BTF is enabled.
diff --git a/Documentation/trace/kprobetrace.rst b/Documentation/trace/kprobetrace.rst
index c4382765d5b2..a62707e6a9f2 100644
--- a/Documentation/trace/kprobetrace.rst
+++ b/Documentation/trace/kprobetrace.rst
@@ -61,11 +61,13 @@ Synopsis of kprobe_events
(x8/x16/x32/x64), VFS layer common type(%pd/%pD), "char",
"string", "ustring", "symbol", "symstr" and bitfield are
supported.
- (STRUCT)FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
+ (STRUCT[,ASGN])FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
a pointer to STRUCT and then derference the pointer defined by
->MEMBER. Note that this is available only when the probe is
- on function entry.
- (STRUCT)(FETCHARG)->MEMBER[->MEMBER] : typecast can nest, so the above can
+ on function entry. ASGN can be specified optionally. If ASGN
+ is specified, FIELD will be cast to the same offset position
+ as the ASGN member, rather than to the beginning of the STRUCT.
+ (STRUCT[,ASGN])(FETCHARG)->MEMBER[->MEMBER] : typecast can nest, so the above can
also be used with another FETCHARG instead of FIELD.
(\*1) only for the probe on function entry (offs == 0). Note, this argument access
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 4f70318918c2..0e36af853199 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4325,8 +4325,8 @@ static const char readme_msg[] =
#ifdef CONFIG_HAVE_FUNCTION_ARG_ACCESS_API
"\t $stack<index>, $stack, $retval, $comm, $arg<N>,\n"
#ifdef CONFIG_PROBE_EVENTS_BTF_ARGS
- "\t [(structname)]<argname>[->field[->field|.field...]],\n"
- "\t [(structname)](fetcharg)->field[->field|.field...],\n"
+ "\t [(structname[,field])]<argname>[->field[->field|.field...]],\n"
+ "\t [(structname[,field])](fetcharg)->field[->field|.field...],\n"
#endif
#else
"\t $stack<index>, $stack, $retval, $comm,\n"
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index dba73aaa8ade..726be9782775 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -574,6 +574,65 @@ static int split_next_field(char *varname, char **next_field,
return ret;
}
+/* Inner loop for solving dot operator ('.'). Return bit-offset of the given field */
+static int get_bitoffset_of_field(char **pfieldname, const struct btf_type **ptype,
+ struct traceprobe_parse_context *ctx)
+{
+ const struct btf_type *type = *ptype;
+ const struct btf_member *field;
+ struct btf *btf = ctx_btf(ctx);
+ char *fieldname = *pfieldname;
+ int bitoffs = 0;
+ u32 anon_offs;
+ char *next;
+ int is_ptr;
+ s32 tid;
+
+ do {
+ next = NULL;
+ is_ptr = split_next_field(fieldname, &next, ctx);
+ if (is_ptr < 0)
+ return is_ptr;
+
+ anon_offs = 0;
+ field = btf_find_struct_member(btf, type, fieldname,
+ &anon_offs);
+ if (IS_ERR(field)) {
+ trace_probe_log_err(ctx->offset, BAD_BTF_TID);
+ return PTR_ERR(field);
+ }
+ if (!field) {
+ trace_probe_log_err(ctx->offset, NO_BTF_FIELD);
+ return -ENOENT;
+ }
+ /* Add anonymous structure/union offset */
+ bitoffs += anon_offs;
+
+ /* Accumulate the bit-offsets of the dot-connected fields */
+ if (btf_type_kflag(type)) {
+ bitoffs += BTF_MEMBER_BIT_OFFSET(field->offset);
+ ctx->last_bitsize = BTF_MEMBER_BITFIELD_SIZE(field->offset);
+ } else {
+ bitoffs += field->offset;
+ ctx->last_bitsize = 0;
+ }
+
+ type = btf_type_skip_modifiers(btf, field->type, &tid);
+ if (!type) {
+ trace_probe_log_err(ctx->offset, BAD_BTF_TID);
+ return -EINVAL;
+ }
+
+ if (next)
+ ctx->offset += next - fieldname;
+ fieldname = next;
+ } while (!is_ptr && fieldname);
+
+ *pfieldname = fieldname;
+ *ptype = type;
+
+ return bitoffs;
+}
/*
* Parse the field of data structure. The @type must be a pointer type
* pointing the target data structure type.
@@ -583,16 +642,14 @@ static int parse_btf_field(char *fieldname, const struct btf_type *type,
struct traceprobe_parse_context *ctx)
{
struct fetch_insn *code = *pcode;
- const struct btf_member *field;
- u32 bitoffs, anon_offs;
- bool is_struct = ctx->struct_btf != NULL;
struct btf *btf = ctx_btf(ctx);
- char *next;
- int is_ptr;
+ bool is_first_field = true;
+ int bitoffs;
s32 tid;
do {
- if (!is_struct) {
+ /* For the first field of typecast, @type will be the target structure type. */
+ if (!(is_first_field && ctx->struct_btf)) {
/* Outer loop for solving arrow operator ('->') */
if (BTF_INFO_KIND(type->info) != BTF_KIND_PTR) {
trace_probe_log_err(ctx->offset, NO_PTR_STRCT);
@@ -606,60 +663,25 @@ static int parse_btf_field(char *fieldname, const struct btf_type *type,
return -EINVAL;
}
}
- /* Only the first type can skip being a pointer */
- is_struct = false;
-
- bitoffs = 0;
- do {
- /* Inner loop for solving dot operator ('.') */
- next = NULL;
- is_ptr = split_next_field(fieldname, &next, ctx);
- if (is_ptr < 0)
- return is_ptr;
-
- anon_offs = 0;
- field = btf_find_struct_member(btf, type, fieldname,
- &anon_offs);
- if (IS_ERR(field)) {
- trace_probe_log_err(ctx->offset, BAD_BTF_TID);
- return PTR_ERR(field);
- }
- if (!field) {
- trace_probe_log_err(ctx->offset, NO_BTF_FIELD);
- return -ENOENT;
- }
- /* Add anonymous structure/union offset */
- bitoffs += anon_offs;
-
- /* Accumulate the bit-offsets of the dot-connected fields */
- if (btf_type_kflag(type)) {
- bitoffs += BTF_MEMBER_BIT_OFFSET(field->offset);
- ctx->last_bitsize = BTF_MEMBER_BITFIELD_SIZE(field->offset);
- } else {
- bitoffs += field->offset;
- ctx->last_bitsize = 0;
- }
-
- type = btf_type_skip_modifiers(btf, field->type, &tid);
- if (!type) {
- trace_probe_log_err(ctx->offset, BAD_BTF_TID);
- return -EINVAL;
- }
-
- ctx->offset += next - fieldname;
- fieldname = next;
- } while (!is_ptr && fieldname);
+ bitoffs = get_bitoffset_of_field(&fieldname, &type, ctx);
+ if (bitoffs < 0)
+ return bitoffs;
if (++code == end) {
trace_probe_log_err(ctx->offset, TOO_MANY_OPS);
return -EINVAL;
}
code->op = FETCH_OP_DEREF; /* TODO: user deref support */
code->offset = bitoffs / 8;
+ if (is_first_field && ctx->struct_btf) {
+ /* The first field can be typecasted with field option. */
+ code->offset -= ctx->prefix_byteoffs;
+ }
*pcode = code;
ctx->last_bitoffs = bitoffs % 8;
ctx->last_type = type;
+ is_first_field = false;
} while (fieldname);
return 0;
@@ -690,6 +712,11 @@ static int parse_btf_arg(char *varname,
NOSUP_DAT_ARG);
return -EOPNOTSUPP;
}
+ if (!field && ctx->struct_btf) {
+ /* Typecast without field option is not supported */
+ trace_probe_log_err(ctx->offset, TYPECAST_REQ_FIELD);
+ return -EOPNOTSUPP;
+ }
if (ctx->flags & TPARG_FL_TEVENT) {
ret = parse_trace_event(varname, code, ctx);
@@ -700,8 +727,7 @@ static int parse_btf_arg(char *varname,
/* TEVENT is only here via a typecast */
if (WARN_ON_ONCE(ctx->struct_btf == NULL))
return -EINVAL;
- type = ctx->last_struct;
- goto found_type;
+ goto found;
}
if (ctx->flags & TPARG_FL_RETURN && !strcmp(varname, "$retval")) {
@@ -763,7 +789,6 @@ static int parse_btf_arg(char *varname,
type = ctx->last_struct;
else
type = btf_type_skip_modifiers(ctx->btf, tid, &tid);
-found_type:
if (!type) {
trace_probe_log_err(ctx->offset, BAD_BTF_TID);
return -EINVAL;
@@ -832,6 +857,45 @@ static int query_btf_struct(const char *sname, struct traceprobe_parse_context *
return 0;
}
+static int parse_btf_casttype(char *casttype, struct traceprobe_parse_context *ctx)
+{
+ char *field;
+ int ret;
+
+ /* Field option - evaluated later. */
+ field = strchr(casttype, ',');
+ if (field)
+ *field++ = '\0';
+
+ ret = query_btf_struct(casttype, ctx);
+ if (ret < 0) {
+ trace_probe_log_err(ctx->offset, NO_PTR_STRCT);
+ return -EINVAL;
+ }
+
+ if (field) {
+ struct btf_type *type = (struct btf_type *)ctx->last_struct;
+
+ ctx->offset += field - casttype;
+ ret = get_bitoffset_of_field(&field, &ctx->last_struct, ctx);
+ if (ret < 0)
+ return ret;
+ if (ret % 8) {
+ trace_probe_log_err(ctx->offset, TYPECAST_NOT_ALIGNED);
+ return -EINVAL;
+ }
+ if (field != NULL) {
+ trace_probe_log_err(ctx->offset + field - casttype, TYPECAST_BAD_ARROW);
+ return -EINVAL;
+ }
+ ctx->prefix_byteoffs = ret / 8;
+ /* Restore the original struct type (overwritten by get_bitoffset_of_field) */
+ ctx->last_struct = type;
+ }
+
+ return ret;
+}
+
/* Find the matching closing parenthesis for a given opening parenthesis. */
static char *find_matched_close_paren(char *s)
{
@@ -915,11 +979,10 @@ static int handle_typecast(char *arg, struct fetch_insn **pcode,
nested = true;
}
- ret = query_btf_struct(arg + 1, ctx);
- if (ret < 0) {
- trace_probe_log_err(ctx->offset + 1, NO_PTR_STRCT);
- return -EINVAL;
- }
+ ctx->offset = orig_offset + 1; /* for the '(' */
+ ret = parse_btf_casttype(arg + 1, ctx);
+ if (ret < 0)
+ return ret;
ctx->offset = orig_offset + tmp - arg;
/* If it is nested, tmp points to the field name. */
@@ -927,6 +990,7 @@ static int handle_typecast(char *arg, struct fetch_insn **pcode,
ret = parse_btf_field(tmp, ctx->last_struct, pcode, end, ctx);
else
ret = parse_btf_arg(tmp, pcode, end, ctx);
+ ctx->prefix_byteoffs = 0;
return ret;
}
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index 982d32a5df8b..44f113faae61 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -436,6 +436,7 @@ struct traceprobe_parse_context {
unsigned int flags;
int offset;
int nested_level;
+ int prefix_byteoffs; /* The byte offset of the prefix field of typecast */
};
#define TRACEPROBE_MAX_NESTED_LEVEL 3
@@ -576,7 +577,9 @@ extern int traceprobe_define_arg_fields(struct trace_event_call *event_call,
C(EVENT_TOO_BIG, "Event too big (too many fields?)"), \
C(TYPECAST_NOT_EVENT, "Typecasts are only for eprobe fields"), \
C(TYPECAST_REQ_FIELD, "Typecast requires a field access"), \
- C(TOO_MANY_NESTED, "Too many nested typecasts/dereferences"),
+ C(TOO_MANY_NESTED, "Too many nested typecasts/dereferences"), \
+ C(TYPECAST_NOT_ALIGNED, "Typecast field option is not byte-aligned"), \
+ C(TYPECAST_BAD_ARROW, "Typecast field option does not support -> operator"),
#undef C
#define C(a, b) TP_ERR_##a
^ permalink raw reply related
* [RFC PATCH v2 3/7] tracing/probes: Support nested typecast
From: Masami Hiramatsu (Google) @ 2026-06-10 0:51 UTC (permalink / raw)
To: Steven Rostedt, Mathieu Desnoyers
Cc: Jonathan Corbet, Shuah Khan, Masami Hiramatsu, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178105268094.21760.13668249930524377840.stgit@devnote2>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
When we hit an open parenthesis right after typecast closing
parenthesis, it means we have nested typecast. This allows us to
typecast a generic data member in a structure to a pointer to
another structure.
For example, to cast a DATA_MEMBER of VAR structure to STRUCT pointer
and get MEMBER value.
(STRUCT)(VAR->DATA_MEMBER)->MEMBER
Also, we can nest typecast.
(STRUCT1)((STRUCT2)$ARG->FIELD2)->FIELD1
Currently the max nest level is limited to 3.
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v2:
- Fix to skip "->" after closing parenthetsis.
---
Documentation/trace/eprobetrace.rst | 2 +
Documentation/trace/fprobetrace.rst | 2 +
Documentation/trace/kprobetrace.rst | 2 +
kernel/trace/trace.c | 1
kernel/trace/trace_probe.c | 76 ++++++++++++++++++++++++++++++++---
kernel/trace/trace_probe.h | 7 +++
6 files changed, 82 insertions(+), 8 deletions(-)
diff --git a/Documentation/trace/eprobetrace.rst b/Documentation/trace/eprobetrace.rst
index fe3602540569..cd0b4aa7f896 100644
--- a/Documentation/trace/eprobetrace.rst
+++ b/Documentation/trace/eprobetrace.rst
@@ -50,6 +50,8 @@ Synopsis of eprobe_events
a pointer to STRUCT and then derference the pointer defined by
->MEMBER. Note that when this is used, the FIELD name does not
need to be prefixed with a '$'.
+ (STRUCT)(FETCHARG)->MEMBER[->MEMBER] : typecast can nest, so the above can
+ also be used with another FETCHARG instead of FIELD.
Types
-----
diff --git a/Documentation/trace/fprobetrace.rst b/Documentation/trace/fprobetrace.rst
index 7435ded2d66d..6b8bb27bb62d 100644
--- a/Documentation/trace/fprobetrace.rst
+++ b/Documentation/trace/fprobetrace.rst
@@ -60,6 +60,8 @@ Synopsis of fprobe-events
(STRUCT)FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
a pointer to STRUCT and then derference the pointer defined by
->MEMBER.
+ (STRUCT)(FETCHARG)->MEMBER[->MEMBER] : typecast can nest, so the above can
+ also be used with another FETCHARG instead of FIELD.
(\*1) This is available only when BTF is enabled.
(\*2) only for the probe on function entry (offs == 0). Note, this argument access
diff --git a/Documentation/trace/kprobetrace.rst b/Documentation/trace/kprobetrace.rst
index f73614997d52..c4382765d5b2 100644
--- a/Documentation/trace/kprobetrace.rst
+++ b/Documentation/trace/kprobetrace.rst
@@ -65,6 +65,8 @@ Synopsis of kprobe_events
a pointer to STRUCT and then derference the pointer defined by
->MEMBER. Note that this is available only when the probe is
on function entry.
+ (STRUCT)(FETCHARG)->MEMBER[->MEMBER] : typecast can nest, so the above can
+ also be used with another FETCHARG instead of FIELD.
(\*1) only for the probe on function entry (offs == 0). Note, this argument access
is best effort, because depending on the argument type, it may be passed on
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index aa93e7b01146..4f70318918c2 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4326,6 +4326,7 @@ static const char readme_msg[] =
"\t $stack<index>, $stack, $retval, $comm, $arg<N>,\n"
#ifdef CONFIG_PROBE_EVENTS_BTF_ARGS
"\t [(structname)]<argname>[->field[->field|.field...]],\n"
+ "\t [(structname)](fetcharg)->field[->field|.field...],\n"
#endif
#else
"\t $stack<index>, $stack, $retval, $comm,\n"
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index 9158f1f22a62..dba73aaa8ade 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -832,10 +832,35 @@ static int query_btf_struct(const char *sname, struct traceprobe_parse_context *
return 0;
}
+/* Find the matching closing parenthesis for a given opening parenthesis. */
+static char *find_matched_close_paren(char *s)
+{
+ char *p = s;
+ int count = 0;
+
+ while (*p) {
+ if (*p == '(')
+ count++;
+ else if (*p == ')') {
+ if (--count == 0)
+ return p;
+ }
+ p++;
+ }
+ return NULL;
+}
+
+static int
+parse_probe_arg(char *arg, const struct fetch_type *type,
+ struct fetch_insn **pcode, struct fetch_insn *end,
+ struct traceprobe_parse_context *ctx);
+
static int handle_typecast(char *arg, struct fetch_insn **pcode,
struct fetch_insn *end,
struct traceprobe_parse_context *ctx)
{
+ int orig_offset = ctx->offset;
+ bool nested = false;
char *tmp;
int ret;
@@ -852,19 +877,56 @@ static int handle_typecast(char *arg, struct fetch_insn **pcode,
DEREF_OPEN_BRACE);
return -EINVAL;
}
- *tmp = '\0';
- ret = query_btf_struct(arg + 1, ctx);
- *tmp = ')';
+ *tmp++ = '\0';
+
+ /* Handle the nested structure like (STRUCT)(VAR->FIELD)->... */
+ if (*tmp == '(') {
+ char *close = find_matched_close_paren(tmp);
+
+ ctx->offset += tmp - arg;
+ if (!close) {
+ trace_probe_log_err(ctx->offset, DEREF_OPEN_BRACE);
+ return -EINVAL;
+ }
+ /* We expect a field access for typecast */
+ if (close[1] != '-' || close[2] != '>') {
+ trace_probe_log_err(ctx->offset + close - tmp + 1,
+ TYPECAST_REQ_FIELD);
+ return -EINVAL;
+ }
+ ctx->nested_level++;
+ if (ctx->nested_level > TRACEPROBE_MAX_NESTED_LEVEL) {
+ trace_probe_log_err(ctx->offset, TOO_MANY_NESTED);
+ return -E2BIG;
+ }
+ *close = '\0';
+
+ ctx->offset += 1; /* for the '(' */
+ /* We need to parse the nested one */
+ ret = parse_probe_arg(tmp + 1, find_fetch_type(NULL, ctx->flags),
+ pcode, end, ctx);
+ if (ret < 0)
+ return ret;
+ ctx->nested_level--;
+ clear_struct_btf(ctx);
+
+ tmp = close + 3;/* Skip "->" after closing parenthesis */
+ nested = true;
+ }
+
+ ret = query_btf_struct(arg + 1, ctx);
if (ret < 0) {
trace_probe_log_err(ctx->offset + 1, NO_PTR_STRCT);
return -EINVAL;
}
- tmp++;
-
- ctx->offset += tmp - arg;
- ret = parse_btf_arg(tmp, pcode, end, ctx);
+ ctx->offset = orig_offset + tmp - arg;
+ /* If it is nested, tmp points to the field name. */
+ if (nested)
+ ret = parse_btf_field(tmp, ctx->last_struct, pcode, end, ctx);
+ else
+ ret = parse_btf_arg(tmp, pcode, end, ctx);
return ret;
}
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index 883938a74aee..982d32a5df8b 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -435,8 +435,11 @@ struct traceprobe_parse_context {
struct trace_probe *tp;
unsigned int flags;
int offset;
+ int nested_level;
};
+#define TRACEPROBE_MAX_NESTED_LEVEL 3
+
extern int traceprobe_parse_probe_arg(struct trace_probe *tp, int i,
const char *argv,
struct traceprobe_parse_context *ctx);
@@ -571,7 +574,9 @@ extern int traceprobe_define_arg_fields(struct trace_event_call *event_call,
C(TOO_MANY_ARGS, "Too many arguments are specified"), \
C(TOO_MANY_EARGS, "Too many entry arguments specified"), \
C(EVENT_TOO_BIG, "Event too big (too many fields?)"), \
- C(TYPECAST_NOT_EVENT, "Typecasts are only for eprobe fields"),
+ C(TYPECAST_NOT_EVENT, "Typecasts are only for eprobe fields"), \
+ C(TYPECAST_REQ_FIELD, "Typecast requires a field access"), \
+ C(TOO_MANY_NESTED, "Too many nested typecasts/dereferences"),
#undef C
#define C(a, b) TP_ERR_##a
^ permalink raw reply related
* [RFC PATCH v2 2/7] tracing/probes: Support typecast for various probe events
From: Masami Hiramatsu (Google) @ 2026-06-10 0:51 UTC (permalink / raw)
To: Steven Rostedt, Mathieu Desnoyers
Cc: Jonathan Corbet, Shuah Khan, Masami Hiramatsu, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178105268094.21760.13668249930524377840.stgit@devnote2>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Support BTF typecast feature on other probe events (but only if it is
kernel function entry or return.)
To support other probe events, we just need to use last_struct type
when we find a function parameter in parse_btf_arg().
This also update <tracefs>/README file to show struct typecast.
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v2:
- Fix to re-enable typecast on eprobe.
---
Documentation/trace/fprobetrace.rst | 3 +++
Documentation/trace/kprobetrace.rst | 4 ++++
kernel/trace/trace.c | 2 +-
kernel/trace/trace_probe.c | 14 +++++++++-----
kernel/trace/trace_probe.h | 5 +++++
5 files changed, 22 insertions(+), 6 deletions(-)
diff --git a/Documentation/trace/fprobetrace.rst b/Documentation/trace/fprobetrace.rst
index b4c2ca3d02c1..7435ded2d66d 100644
--- a/Documentation/trace/fprobetrace.rst
+++ b/Documentation/trace/fprobetrace.rst
@@ -57,6 +57,9 @@ Synopsis of fprobe-events
(u8/u16/u32/u64/s8/s16/s32/s64), hexadecimal types
(x8/x16/x32/x64), "char", "string", "ustring", "symbol", "symstr"
and bitfield are supported.
+ (STRUCT)FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
+ a pointer to STRUCT and then derference the pointer defined by
+ ->MEMBER.
(\*1) This is available only when BTF is enabled.
(\*2) only for the probe on function entry (offs == 0). Note, this argument access
diff --git a/Documentation/trace/kprobetrace.rst b/Documentation/trace/kprobetrace.rst
index 3b6791c17e9b..f73614997d52 100644
--- a/Documentation/trace/kprobetrace.rst
+++ b/Documentation/trace/kprobetrace.rst
@@ -61,6 +61,10 @@ Synopsis of kprobe_events
(x8/x16/x32/x64), VFS layer common type(%pd/%pD), "char",
"string", "ustring", "symbol", "symstr" and bitfield are
supported.
+ (STRUCT)FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
+ a pointer to STRUCT and then derference the pointer defined by
+ ->MEMBER. Note that this is available only when the probe is
+ on function entry.
(\*1) only for the probe on function entry (offs == 0). Note, this argument access
is best effort, because depending on the argument type, it may be passed on
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 6eb4d3097a4d..aa93e7b01146 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4325,7 +4325,7 @@ static const char readme_msg[] =
#ifdef CONFIG_HAVE_FUNCTION_ARG_ACCESS_API
"\t $stack<index>, $stack, $retval, $comm, $arg<N>,\n"
#ifdef CONFIG_PROBE_EVENTS_BTF_ARGS
- "\t <argname>[->field[->field|.field...]],\n"
+ "\t [(structname)]<argname>[->field[->field|.field...]],\n"
#endif
#else
"\t $stack<index>, $stack, $retval, $comm,\n"
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index fd1caa1f9723..9158f1f22a62 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -759,7 +759,10 @@ static int parse_btf_arg(char *varname,
return -ENOENT;
found:
- type = btf_type_skip_modifiers(ctx->btf, tid, &tid);
+ if (ctx->struct_btf)
+ type = ctx->last_struct;
+ else
+ type = btf_type_skip_modifiers(ctx->btf, tid, &tid);
found_type:
if (!type) {
trace_probe_log_err(ctx->offset, BAD_BTF_TID);
@@ -836,10 +839,11 @@ static int handle_typecast(char *arg, struct fetch_insn **pcode,
char *tmp;
int ret;
- /* Currently this only works for eprobes */
- if (!(ctx->flags & TPARG_FL_TEVENT)) {
- trace_probe_log_err(ctx->offset, TYPECAST_NOT_EVENT);
- return -EINVAL;
+ if (!(tparg_is_event_probe(ctx->flags) ||
+ tparg_is_function_entry(ctx->flags) ||
+ tparg_is_function_return(ctx->flags))) {
+ trace_probe_log_err(ctx->offset, NOSUP_BTFARG);
+ return -EOPNOTSUPP;
}
tmp = strchr(arg, ')');
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index 15758cc11fc6..883938a74aee 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -414,6 +414,11 @@ static inline bool tparg_is_function_return(unsigned int flags)
return (flags & TPARG_FL_LOC_MASK) == (TPARG_FL_KERNEL | TPARG_FL_RETURN);
}
+static inline bool tparg_is_event_probe(unsigned int flags)
+{
+ return !!(flags & TPARG_FL_TEVENT);
+}
+
struct traceprobe_parse_context {
struct trace_event_call *event;
/* BTF related parameters */
^ permalink raw reply related
* [RFC PATCH v2 1/7] tracing/events: Fix to check the simple_tsk_fn creation
From: Masami Hiramatsu (Google) @ 2026-06-10 0:51 UTC (permalink / raw)
To: Steven Rostedt, Mathieu Desnoyers
Cc: Jonathan Corbet, Shuah Khan, Masami Hiramatsu, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178105268094.21760.13668249930524377840.stgit@devnote2>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Sashiko pointed that this sample code does not correctly handle the
failure of thread creation because kthread_run() can return -errno.
This removes the counter-based thread creation/stops but just
checking the simple_tsk_fn is correctly initialized (created) or not.
Link: https://sashiko.dev/#/patchset/178092865666.163648.10457567771536160909.stgit%40devnote2
Fixes: 9cfe06f8cd5c ("tracing/events: add trace-events-sample")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
samples/trace_events/trace-events-sample.c | 16 ++++++----------
1 file changed, 6 insertions(+), 10 deletions(-)
diff --git a/samples/trace_events/trace-events-sample.c b/samples/trace_events/trace-events-sample.c
index ecc7db237f2e..b61766864b54 100644
--- a/samples/trace_events/trace-events-sample.c
+++ b/samples/trace_events/trace-events-sample.c
@@ -92,12 +92,11 @@ static int simple_thread_fn(void *arg)
}
static DEFINE_MUTEX(thread_mutex);
-static int simple_thread_cnt;
int foo_bar_reg(void)
{
mutex_lock(&thread_mutex);
- if (simple_thread_cnt++)
+ if (!IS_ERR_OR_NULL(simple_tsk_fn))
goto out;
pr_info("Starting thread for foo_bar_fn\n");
@@ -115,14 +114,11 @@ int foo_bar_reg(void)
void foo_bar_unreg(void)
{
mutex_lock(&thread_mutex);
- if (--simple_thread_cnt)
- goto out;
-
- pr_info("Killing thread for foo_bar_fn\n");
- if (simple_tsk_fn)
+ if (!IS_ERR_OR_NULL(simple_tsk_fn)) {
+ pr_info("Killing thread for foo_bar_fn\n");
kthread_stop(simple_tsk_fn);
- simple_tsk_fn = NULL;
- out:
+ simple_tsk_fn = NULL;
+ }
mutex_unlock(&thread_mutex);
}
@@ -139,7 +135,7 @@ static void __exit trace_event_exit(void)
{
kthread_stop(simple_tsk);
mutex_lock(&thread_mutex);
- if (simple_tsk_fn)
+ if (!IS_ERR_OR_NULL(simple_tsk_fn))
kthread_stop(simple_tsk_fn);
simple_tsk_fn = NULL;
mutex_unlock(&thread_mutex);
^ permalink raw reply related
* [RFC PATCH v2 0/7] tracing/probes: Add more typecast features
From: Masami Hiramatsu (Google) @ 2026-06-10 0:51 UTC (permalink / raw)
To: Steven Rostedt, Mathieu Desnoyers
Cc: Jonathan Corbet, Shuah Khan, Masami Hiramatsu, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest
Hi,
Here is the 2nd version of series to introduce more typecast features
to probe events. The previous version is here:
https://lore.kernel.org/all/178092865666.163648.10457567771536160909.stgit@devnote2/
In this version, I fixed various problems Sashiko reviewed and add
a fix of sample code. Also drop +CPU/PCPU() and introduce this_cpu_read().
Steve introduced BTF typecast feature for eprobe[1].
This series extends it and add more options:
1. Expanding BTF typecast to kprobe and fprobe.
(currently only function entry/exit)
2. Introduce container_of like typecast. This adds a "assigned
member" option to the typecast.
(STRUCT,MEMBER)VAR->ANOTHER_MEMBER
This casts VAR to STRUCT type but the VAR is as the address
of STRUCT.MEMBER. In C, it is:
container_of(VAR, STRUCT, MEMBER)->ANOTHER_MEMBER
3. Support nested typecast, e.g.
(STRUCT)((STRUCT2)VAR->MEMBER2)->MEMBER
the nest level must be smaller than 3.
4. Add $current variable to point "current" task_struct.
This is useful with typecast, e.g.
(task_struct)$current->pid
5. per-cpu dereference support.
Intrdouce this_cpu_read(VAR) and this_cpu_ptr(VAR) to
access per-cpu data on the current CPU (accessing other CPU
data is not stable, because it can be changed.)
You can access the member of per-cpu data structure using
typecast like:
(STRUCT)this_cpu_ptr(VAR)->MEMBER
And added a test script to test part of them.
[1] https://lore.kernel.org/all/20260601130746.2139d926@gandalf.local.home/
---
Masami Hiramatsu (Google) (7):
tracing/events: Fix to check the simple_tsk_fn creation
tracing/probes: Support typecast for various probe events
tracing/probes: Support nested typecast
tracing/probes: Support field specifier option for typecast
tracing/probes: Add $current variable support
tracing/probes: Add this_cpu_read() and this_cpu_ptr() dereference method to fetcharg
tracing/probes: Add a new testcase for BTF typecasts
Documentation/trace/eprobetrace.rst | 10
Documentation/trace/fprobetrace.rst | 10
Documentation/trace/kprobetrace.rst | 11 +
kernel/trace/trace.c | 6
kernel/trace/trace_probe.c | 404 +++++++++++++++-----
kernel/trace/trace_probe.h | 18 +
kernel/trace/trace_probe_tmpl.h | 33 +-
samples/trace_events/trace-events-sample.c | 56 ++-
samples/trace_events/trace-events-sample.h | 34 ++
.../ftrace/test.d/dynevent/btf_probe_event.tc | 51 +++
10 files changed, 509 insertions(+), 124 deletions(-)
create mode 100644 tools/testing/selftests/ftrace/test.d/dynevent/btf_probe_event.tc
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* Re: [PATCH v1] pnp: Documentation improvements
From: Randy Dunlap @ 2026-06-10 0:36 UTC (permalink / raw)
To: Uwe Kleine-König (The Capable Hub), Rafael J. Wysocki
Cc: Jonathan Corbet, Shuah Khan, linux-doc, linux-kernel
In-Reply-To: <20260609145117.1355753-2-u.kleine-koenig@baylibre.com>
On 6/9/26 7:51 AM, Uwe Kleine-König (The Capable Hub) wrote:
> - Consistently use named initializers and simplify sentinel
> - Skip assignment to .driver_data if all are 0
> - Use consistent spacing to match Linux coding style
> - Fix prototype of probe function
> - s/pnp_id/pnp_device_id/
> - Drop non-existing .card_id_table
>
> Signed-off-by: Uwe Kleine-König (The Capable Hub) <u.kleine-koenig@baylibre.com>
LGTM. Thanks.
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
The only issue I have with this file following this patch is
the use of "ex" for "Example" or "E.g.".
> ---
> Documentation/admin-guide/pnp.rst | 22 ++++++++++------------
> 1 file changed, 10 insertions(+), 12 deletions(-)
>
> diff --git a/Documentation/admin-guide/pnp.rst b/Documentation/admin-guide/pnp.rst
> index 24d80e3eb309..14a0bf400d2d 100644
> --- a/Documentation/admin-guide/pnp.rst
> +++ b/Documentation/admin-guide/pnp.rst
> @@ -203,12 +203,12 @@ The New Way
>
> ex::
>
> - static const struct pnp_id pnp_dev_table[] = {
> + static const struct pnp_device_id pnp_dev_table[] = {
> /* Standard LPT Printer Port */
> - {.id = "PNP0400", .driver_data = 0},
> + { .id = "PNP0400" },
> /* ECP Printer Port */
> - {.id = "PNP0401", .driver_data = 0},
> - {.id = ""}
> + { .id = "PNP0401" },
> + { }
> };
>
> Please note that the character 'X' can be used as a wild card in the function
> @@ -217,14 +217,14 @@ The New Way
> ex::
>
> /* Unknown PnP modems */
> - { "PNPCXXX", UNKNOWN_DEV },
> + { .id = "PNPCXXX", .driver_data = UNKNOWN_DEV },
>
> Supported PnP card IDs can optionally be defined.
> ex::
>
> - static const struct pnp_id pnp_card_table[] = {
> - { "ANYDEVS", 0 },
> - { "", 0 }
> + static const struct pnp_device_id pnp_card_table[] = {
> + { .id = "ANYDEVS" },
> + { }
> };
>
> 2. Optionally define probe and remove functions. It may make sense not to
> @@ -234,14 +234,13 @@ The New Way
> ex::
>
> static int
> - serial_pnp_probe(struct pnp_dev * dev, const struct pnp_id *card_id, const
> - struct pnp_id *dev_id)
> + serial_pnp_probe(struct pnp_dev *dev, const struct pnp_device_id *dev_id)
> {
> . . .
>
> ex::
>
> - static void serial_pnp_remove(struct pnp_dev * dev)
> + static void serial_pnp_remove(struct pnp_dev *dev)
> {
> . . .
>
> @@ -253,7 +252,6 @@ The New Way
>
> static struct pnp_driver serial_pnp_driver = {
> .name = "serial",
> - .card_id_table = pnp_card_table,
> .id_table = pnp_dev_table,
> .probe = serial_pnp_probe,
> .remove = serial_pnp_remove,
>
> base-commit: a87737435cfa134f9cdcc696ba3080759d04cf72
--
~Randy
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox