Linux Perf Users

Linux Perf Users
 help / color / mirror / Atom feed

* Re: [PATCH v4 1/3] perf: marvell: Add MPAM partid filtering to CN10K TAD PMU
From: Ben Horgan @ 2026-06-25 13:53 UTC (permalink / raw)
  To: Geetha sowjanya, linux-perf-users, linux-kernel, linux-arm-kernel,
	devicetree
  Cc: mark.rutland, will, krzk+dt, james.morse
In-Reply-To: <20260618153610.13649-2-gakula@marvell.com>

Hi Geetha,

+CC James

On 6/18/26 16:36, Geetha sowjanya wrote:
> From: Tanmay Jagdale <tanmay@marvell.com>
> 
> The TAD PMU exposes counters that can be filtered by MPAM partition id
> for a subset of allocation and hit events.
> 
> Add a 9-bit partid format attribute (config1) and route counter programming
> through variant-specific ops so CN10K keeps MPAM-capable programming while
> Odyssey keeps the reduced event set without advertising partid in sysfs.
> 
> Probe no longer mutates the platform_device MMIO resource (walk a local
> map_start), rejects tad-cnt / page sizes of zero, validates the memory
> window against tad-cnt, and registers the perf PMU before hotplug with
> correct unwind.
> 
> Example:
>   perf stat -e tad/tad_alloc_any,partid=0x12,partid_en=1/ -- <program>

Where is the user expected to get the PARTID from? The MPAM driver
considers the PARTID as an internal only value.

resctrl does support a 'debug' mount option which will show the CLOSID
associated with a control group. Whilst the CLOSID is often the PARTID,
it is really a set of PARTIDs. When the cdp mount option is used, CLOSID
maps to 2 PARTIDs and if we use PARTID narrowing to give us more
monitors, as in proposed in [1], then the set of PARTIDs may be bigger.
Furthermore, if the PARTID narrowing scheme is made dynamic the size of
the PARTID set may change when control or monitoring groups are created
or deleted.

It seems that a way to map from a resctrl control group to the set of
PARTIDs is required and a mechanism to tie this to lifetime of the
resctrl mount.

Perhaps some helpers along the lines of:

int resctrl_mount_generation(void)
int mpam_rdtgrp_to_partid_is_static(int mount_gen)
int resctrl_rdtgrp_generation(char *name)
int mpam_rdtgrp_to_partid_count(char *name, int rdt_gen)
int mpam_rdtgrp_to_partid_array(char *name, int rdt_gen, int* partids)

The rdtgrp generation is to an attempt to avoid having to use a debug
interface in anger and cope with renaming of control groups in resctrl.
This does seem a bit unwieldly so hopefully there is better way to do this.

Sorry to throw a spanner in the works.

Thanks,

Ben

> 
> Signed-off-by: Tanmay Jagdale <tanmay@marvell.com>
> Signed-off-by: Geetha sowjanya <gakula@marvell.com>
> ---
> 
> Changelog (since v3)
> --------------------
> - Restore cpuhp_state_add_instance_nocalls before perf_pmu_register in probe
>   so users cannot attach events before the hotplug instance exists; unwind
>   removes the hotplug instance if perf registration fails.
> - Add perf_ready: tad_pmu_offline_cpu skips perf_pmu_migrate_context until after
>   successful perf_pmu_register, so a CPU offline between hotplug add and perf
>   register does not touch perf core state for an unregistered PMU.
> 
> Changelog (since v2)
> --------------------
> - Validate the eventId using an appropriate mask to ensure
>   it is restricted to 8 bits.
> 
> Changelog (since v1)
> --------------------
> - Fix config1 filter enable to use bit 9 consistently with the PMU format
>   string (partid_en) and reject reserved bits with GENMASK(9, 0).
> - Register perf_pmu_register before cpuhp_state_add_instance_nocalls and
>   unregister on hotplug failure.
> 
>  drivers/perf/marvell_cn10k_tad_pmu.c | 220 +++++++++++++++++++++------
>  1 file changed, 171 insertions(+), 49 deletions(-)
> 
> diff --git a/drivers/perf/marvell_cn10k_tad_pmu.c b/drivers/perf/marvell_cn10k_tad_pmu.c
> index 51ccb0befa05..340be3776fe7 100644
> --- a/drivers/perf/marvell_cn10k_tad_pmu.c
> +++ b/drivers/perf/marvell_cn10k_tad_pmu.c
> @@ -7,6 +7,8 @@
>  #define pr_fmt(fmt) "tad_pmu: " fmt
>  
>  #include <linux/io.h>
> +#include <linux/bits.h>
> +#include <linux/compiler.h>
>  #include <linux/module.h>
>  #include <linux/of.h>
>  #include <linux/cpuhotplug.h>
> @@ -14,12 +16,20 @@
>  #include <linux/platform_device.h>
>  #include <linux/acpi.h>
>  
> -#define TAD_PFC_OFFSET		0x800
> -#define TAD_PFC(counter)	(TAD_PFC_OFFSET | (counter << 3))
>  #define TAD_PRF_OFFSET		0x900
> -#define TAD_PRF(counter)	(TAD_PRF_OFFSET | (counter << 3))
> +#define TAD_PFC_OFFSET		0x800
> +#define TAD_PFC(base, counter)	((base) | ((u64)(counter) << 3))
> +#define TAD_PRF(base, counter)	((base) | ((u64)(counter) << 3))
>  #define TAD_PRF_CNTSEL_MASK	0xFF
> +#define TAD_PRF_MATCH_PARTID	BIT(8)
> +#define TAD_PRF_PARTID_NS	BIT(10)
> +/*
> + * config1: bits 0..8 MPAM partition id (including 0); bit 9 requests
> + * filtering for MPAM-capable events. All-zero config1 means no filter.
> + */
> +#define TAD_PARTID_FILTER_EN	BIT(9)
>  #define TAD_MAX_COUNTERS	8
> +#define TAD_EVENT_SEL_MASK	GENMASK(7, 0)
>  
>  #define to_tad_pmu(p) (container_of(p, struct tad_pmu, pmu))
>  
> @@ -27,30 +37,94 @@ struct tad_region {
>  	void __iomem	*base;
>  };
>  
> +enum mrvl_tad_pmu_version {
> +	TAD_PMU_V1 = 1,
> +	TAD_PMU_V2,
> +};
> +
> +struct tad_pmu_data {
> +	int id;
> +	u64 tad_prf_offset;
> +	u64 tad_pfc_offset;
> +};
> +
>  struct tad_pmu {
>  	struct pmu pmu;
>  	struct tad_region *regions;
>  	u32 region_cnt;
>  	unsigned int cpu;
> +	/* Set after successful perf_pmu_register(); gates offline migration. */
> +	bool perf_ready;
> +	const struct tad_pmu_ops *ops;
> +	const struct tad_pmu_data *pdata;
>  	struct hlist_node node;
>  	struct perf_event *events[TAD_MAX_COUNTERS];
>  	DECLARE_BITMAP(counters_map, TAD_MAX_COUNTERS);
>  };
>  
> -enum mrvl_tad_pmu_version {
> -	TAD_PMU_V1 = 1,
> -	TAD_PMU_V2,
> -};
> -
> -struct tad_pmu_data {
> -	int id;
> +struct tad_pmu_ops {
> +	void (*start_counter)(struct tad_pmu *pmu, struct perf_event *event);
>  };
>  
>  static int tad_pmu_cpuhp_state;
>  
> +static void tad_pmu_start_counter(struct tad_pmu *pmu,
> +				  struct perf_event *event)
> +{
> +	const struct tad_pmu_data *pdata = pmu->pdata;
> +	struct hw_perf_event *hwc = &event->hw;
> +	u32 event_idx = (u32)(event->attr.config & TAD_EVENT_SEL_MASK);
> +	u32 counter_idx = hwc->idx;
> +	u64 partid_filter = 0;
> +	u64 reg_val;
> +	u64 cfg1 = event->attr.config1;
> +	bool use_mpam = cfg1 & TAD_PARTID_FILTER_EN;
> +	u32 partid = (u32)(cfg1 & GENMASK(8, 0));
> +	int i;
> +
> +	for (i = 0; i < pmu->region_cnt; i++)
> +		writeq_relaxed(0, pmu->regions[i].base +
> +			       TAD_PFC(pdata->tad_pfc_offset, counter_idx));
> +
> +	if (use_mpam && event_idx > 0x19 && event_idx < 0x21) {
> +		partid_filter = TAD_PRF_MATCH_PARTID | TAD_PRF_PARTID_NS |
> +				((u64)partid << 11);
> +	}
> +
> +
> +	for (i = 0; i < pmu->region_cnt; i++) {
> +		reg_val = event_idx & 0xFF;
> +		reg_val |= partid_filter;
> +		writeq_relaxed(reg_val, pmu->regions[i].base +
> +			       TAD_PRF(pdata->tad_prf_offset, counter_idx));
> +	}
> +}
> +
> +static void tad_pmu_v2_start_counter(struct tad_pmu *pmu,
> +				     struct perf_event *event)
> +{
> +	const struct tad_pmu_data *pdata = pmu->pdata;
> +	struct hw_perf_event *hwc = &event->hw;
> +	u32 event_idx = (u32)(event->attr.config & TAD_EVENT_SEL_MASK);
> +	u32 counter_idx = hwc->idx;
> +	u64 reg_val;
> +	int i;
> +
> +	for (i = 0; i < pmu->region_cnt; i++)
> +		writeq_relaxed(0, pmu->regions[i].base +
> +			       TAD_PFC(pdata->tad_pfc_offset, counter_idx));
> +
> +	for (i = 0; i < pmu->region_cnt; i++) {
> +		reg_val = event_idx & 0xFF;
> +		writeq_relaxed(reg_val, pmu->regions[i].base +
> +			       TAD_PRF(pdata->tad_prf_offset, counter_idx));
> +	}
> +}
> +
>  static void tad_pmu_event_counter_read(struct perf_event *event)
>  {
>  	struct tad_pmu *tad_pmu = to_tad_pmu(event->pmu);
> +	const struct tad_pmu_data *pdata = tad_pmu->pdata;
>  	struct hw_perf_event *hwc = &event->hw;
>  	u32 counter_idx = hwc->idx;
>  	u64 prev, new;
> @@ -60,7 +134,7 @@ static void tad_pmu_event_counter_read(struct perf_event *event)
>  		prev = local64_read(&hwc->prev_count);
>  		for (i = 0, new = 0; i < tad_pmu->region_cnt; i++)
>  			new += readq(tad_pmu->regions[i].base +
> -				     TAD_PFC(counter_idx));
> +				     TAD_PFC(pdata->tad_pfc_offset, counter_idx));
>  	} while (local64_cmpxchg(&hwc->prev_count, prev, new) != prev);
>  
>  	local64_add(new - prev, &event->count);
> @@ -69,16 +143,14 @@ static void tad_pmu_event_counter_read(struct perf_event *event)
>  static void tad_pmu_event_counter_stop(struct perf_event *event, int flags)
>  {
>  	struct tad_pmu *tad_pmu = to_tad_pmu(event->pmu);
> +	const struct tad_pmu_data *pdata = tad_pmu->pdata;
>  	struct hw_perf_event *hwc = &event->hw;
>  	u32 counter_idx = hwc->idx;
>  	int i;
>  
> -	/* TAD()_PFC() stop counting on the write
> -	 * which sets TAD()_PRF()[CNTSEL] == 0
> -	 */
>  	for (i = 0; i < tad_pmu->region_cnt; i++) {
>  		writeq_relaxed(0, tad_pmu->regions[i].base +
> -			       TAD_PRF(counter_idx));
> +			       TAD_PRF(pdata->tad_prf_offset, counter_idx));
>  	}
>  
>  	tad_pmu_event_counter_read(event);
> @@ -89,26 +161,10 @@ static void tad_pmu_event_counter_start(struct perf_event *event, int flags)
>  {
>  	struct tad_pmu *tad_pmu = to_tad_pmu(event->pmu);
>  	struct hw_perf_event *hwc = &event->hw;
> -	u32 event_idx = event->attr.config;
> -	u32 counter_idx = hwc->idx;
> -	u64 reg_val;
> -	int i;
>  
>  	hwc->state = 0;
>  
> -	/* Typically TAD_PFC() are zeroed to start counting */
> -	for (i = 0; i < tad_pmu->region_cnt; i++)
> -		writeq_relaxed(0, tad_pmu->regions[i].base +
> -			       TAD_PFC(counter_idx));
> -
> -	/* TAD()_PFC() start counting on the write
> -	 * which sets TAD()_PRF()[CNTSEL] != 0
> -	 */
> -	for (i = 0; i < tad_pmu->region_cnt; i++) {
> -		reg_val = event_idx & 0xFF;
> -		writeq_relaxed(reg_val,	tad_pmu->regions[i].base +
> -			       TAD_PRF(counter_idx));
> -	}
> +	tad_pmu->ops->start_counter(tad_pmu, event);
>  }
>  
>  static void tad_pmu_event_counter_del(struct perf_event *event, int flags)
> @@ -128,7 +184,6 @@ static int tad_pmu_event_counter_add(struct perf_event *event, int flags)
>  	struct hw_perf_event *hwc = &event->hw;
>  	int idx;
>  
> -	/* Get a free counter for this event */
>  	idx = find_first_zero_bit(tad_pmu->counters_map, TAD_MAX_COUNTERS);
>  	if (idx == TAD_MAX_COUNTERS)
>  		return -EAGAIN;
> @@ -148,6 +203,9 @@ static int tad_pmu_event_counter_add(struct perf_event *event, int flags)
>  static int tad_pmu_event_init(struct perf_event *event)
>  {
>  	struct tad_pmu *tad_pmu = to_tad_pmu(event->pmu);
> +	const struct tad_pmu_data *pdata = tad_pmu->pdata;
> +	u32 event_idx = (u32)(event->attr.config & TAD_EVENT_SEL_MASK);
> +	u64 cfg1 = event->attr.config1;
>  
>  	if (event->attr.type != event->pmu->type)
>  		return -ENOENT;
> @@ -158,6 +216,23 @@ static int tad_pmu_event_init(struct perf_event *event)
>  	if (event->state != PERF_EVENT_STATE_OFF)
>  		return -EINVAL;
>  
> +	if (event->attr.config & ~TAD_EVENT_SEL_MASK)
> +		return -EINVAL;
> +
> +	if (pdata->id == TAD_PMU_V2) {
> +		if (cfg1)
> +			return -EINVAL;
> +	} else {
> +		if ((cfg1 & GENMASK(8, 0)) && !(cfg1 & TAD_PARTID_FILTER_EN))
> +			return -EINVAL;
> +		if (cfg1 & TAD_PARTID_FILTER_EN) {
> +			if (event_idx <= 0x19 || event_idx >= 0x21)
> +				return -EINVAL;
> +		}
> +		if (cfg1 & ~GENMASK(9, 0))
> +			return -EINVAL;
> +	}
> +
>  	event->cpu = tad_pmu->cpu;
>  	event->hw.idx = -1;
>  	event->hw.config_base = event->attr.config;
> @@ -232,7 +307,7 @@ static struct attribute *ody_tad_pmu_event_attrs[] = {
>  	TAD_PMU_EVENT_ATTR(tad_hit_ltg, 0x1e),
>  	TAD_PMU_EVENT_ATTR(tad_hit_any, 0x1f),
>  	TAD_PMU_EVENT_ATTR(tad_tag_rd, 0x20),
> -	TAD_PMU_EVENT_ATTR(tad_tot_cycle, 0xFF),
> +	TAD_PMU_EVENT_ATTR(tad_tot_cycle, 0xff),
>  	NULL
>  };
>  
> @@ -242,9 +317,13 @@ static const struct attribute_group ody_tad_pmu_events_attr_group = {
>  };
>  
>  PMU_FORMAT_ATTR(event, "config:0-7");
> +PMU_FORMAT_ATTR(partid, "config1:0-8");
> +PMU_FORMAT_ATTR(partid_en, "config1:9-9");
>  
>  static struct attribute *tad_pmu_format_attrs[] = {
>  	&format_attr_event.attr,
> +	&format_attr_partid.attr,
> +	&format_attr_partid_en.attr,
>  	NULL
>  };
>  
> @@ -253,6 +332,16 @@ static struct attribute_group tad_pmu_format_attr_group = {
>  	.attrs = tad_pmu_format_attrs,
>  };
>  
> +static struct attribute *ody_tad_pmu_format_attrs[] = {
> +	&format_attr_event.attr,
> +	NULL
> +};
> +
> +static struct attribute_group ody_tad_pmu_format_attr_group = {
> +	.name = "format",
> +	.attrs = ody_tad_pmu_format_attrs,
> +};
> +
>  static ssize_t tad_pmu_cpumask_show(struct device *dev,
>  				struct device_attribute *attr, char *buf)
>  {
> @@ -281,16 +370,25 @@ static const struct attribute_group *tad_pmu_attr_groups[] = {
>  
>  static const struct attribute_group *ody_tad_pmu_attr_groups[] = {
>  	&ody_tad_pmu_events_attr_group,
> -	&tad_pmu_format_attr_group,
> +	&ody_tad_pmu_format_attr_group,
>  	&tad_pmu_cpumask_attr_group,
>  	NULL
>  };
>  
> +static const struct tad_pmu_ops tad_pmu_ops = {
> +	.start_counter = tad_pmu_start_counter,
> +};
> +
> +static const struct tad_pmu_ops tad_pmu_v2_ops = {
> +	.start_counter = tad_pmu_v2_start_counter,
> +};
> +
>  static int tad_pmu_probe(struct platform_device *pdev)
>  {
>  	const struct tad_pmu_data *dev_data;
>  	struct device *dev = &pdev->dev;
>  	struct tad_region *regions;
> +	resource_size_t map_start;
>  	struct tad_pmu *tad_pmu;
>  	struct resource *res;
>  	u32 tad_pmu_page_size;
> @@ -298,7 +396,6 @@ static int tad_pmu_probe(struct platform_device *pdev)
>  	u32 tad_cnt;
>  	int version;
>  	int i, ret;
> -	char *name;
>  
>  	tad_pmu = devm_kzalloc(&pdev->dev, sizeof(*tad_pmu), GFP_KERNEL);
>  	if (!tad_pmu)
> @@ -312,6 +409,7 @@ static int tad_pmu_probe(struct platform_device *pdev)
>  		return -ENODEV;
>  	}
>  	version = dev_data->id;
> +	tad_pmu->pdata = dev_data;
>  
>  	res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
>  	if (!res) {
> @@ -338,22 +436,31 @@ static int tad_pmu_probe(struct platform_device *pdev)
>  		dev_err(&pdev->dev, "Can't find tad-cnt property\n");
>  		return ret;
>  	}
> +	if (!tad_cnt || !tad_page_size || !tad_pmu_page_size) {
> +		dev_err(&pdev->dev, "Invalid tad-cnt or page size\n");
> +		return -EINVAL;
> +	}
>  
>  	regions = devm_kcalloc(&pdev->dev, tad_cnt,
>  			       sizeof(*regions), GFP_KERNEL);
>  	if (!regions)
>  		return -ENOMEM;
>  
> -	/* ioremap the distributed TAD pmu regions */
> -	for (i = 0; i < tad_cnt && res->start < res->end; i++) {
> -		regions[i].base = devm_ioremap(&pdev->dev,
> -					       res->start,
> +	map_start = res->start;
> +	for (i = 0; i < tad_cnt; i++) {
> +		if (map_start > res->end ||
> +		    tad_pmu_page_size > (resource_size_t)(res->end - map_start + 1)) {
> +			dev_err(&pdev->dev, "TAD PMU mem window too small for tad-cnt=%u\n",
> +				tad_cnt);
> +			return -EINVAL;
> +		}
> +		regions[i].base = devm_ioremap(&pdev->dev, map_start,
>  					       tad_pmu_page_size);
>  		if (!regions[i].base) {
>  			dev_err(&pdev->dev, "TAD%d ioremap fail\n", i);
>  			return -ENOMEM;
>  		}
> -		res->start += tad_page_size;
> +		map_start += tad_page_size;
>  	}
>  
>  	tad_pmu->regions = regions;
> @@ -374,14 +481,16 @@ static int tad_pmu_probe(struct platform_device *pdev)
>  		.read		= tad_pmu_event_counter_read,
>  	};
>  
> -	if (version == TAD_PMU_V1)
> +	if (version == TAD_PMU_V1) {
>  		tad_pmu->pmu.attr_groups = tad_pmu_attr_groups;
> -	else
> +		tad_pmu->ops		 = &tad_pmu_ops;
> +	} else {
>  		tad_pmu->pmu.attr_groups = ody_tad_pmu_attr_groups;
> +		tad_pmu->ops		 = &tad_pmu_v2_ops;
> +	}
>  
>  	tad_pmu->cpu = raw_smp_processor_id();
>  
> -	/* Register pmu instance for cpu hotplug */
>  	ret = cpuhp_state_add_instance_nocalls(tad_pmu_cpuhp_state,
>  					       &tad_pmu->node);
>  	if (ret) {
> @@ -389,19 +498,24 @@ static int tad_pmu_probe(struct platform_device *pdev)
>  		return ret;
>  	}
>  
> -	name = "tad";
> -	ret = perf_pmu_register(&tad_pmu->pmu, name, -1);
> -	if (ret)
> +	ret = perf_pmu_register(&tad_pmu->pmu, "tad", -1);
> +	if (ret) {
> +		dev_err(&pdev->dev, "Error %d registering perf PMU\n", ret);
>  		cpuhp_state_remove_instance_nocalls(tad_pmu_cpuhp_state,
>  						    &tad_pmu->node);
> +		return ret;
> +	}
>  
> -	return ret;
> +	WRITE_ONCE(tad_pmu->perf_ready, true);
> +
> +	return 0;
>  }
>  
>  static void tad_pmu_remove(struct platform_device *pdev)
>  {
>  	struct tad_pmu *pmu = platform_get_drvdata(pdev);
>  
> +	WRITE_ONCE(pmu->perf_ready, false);
>  	cpuhp_state_remove_instance_nocalls(tad_pmu_cpuhp_state,
>  						&pmu->node);
>  	perf_pmu_unregister(&pmu->pmu);
> @@ -410,12 +524,17 @@ static void tad_pmu_remove(struct platform_device *pdev)
>  #if defined(CONFIG_OF) || defined(CONFIG_ACPI)
>  static const struct tad_pmu_data tad_pmu_data = {
>  	.id   = TAD_PMU_V1,
> +	.tad_prf_offset = TAD_PRF_OFFSET,
> +	.tad_pfc_offset = TAD_PFC_OFFSET,
>  };
> +
>  #endif
>  
>  #ifdef CONFIG_ACPI
>  static const struct tad_pmu_data tad_pmu_v2_data = {
>  	.id   = TAD_PMU_V2,
> +	.tad_prf_offset = TAD_PRF_OFFSET,
> +	.tad_pfc_offset = TAD_PFC_OFFSET,
>  };
>  #endif
>  
> @@ -451,6 +570,9 @@ static int tad_pmu_offline_cpu(unsigned int cpu, struct hlist_node *node)
>  	struct tad_pmu *pmu = hlist_entry_safe(node, struct tad_pmu, node);
>  	unsigned int target;
>  
> +	if (!READ_ONCE(pmu->perf_ready))
> +		return 0;
> +
>  	if (cpu != pmu->cpu)
>  		return 0;
>  
> @@ -491,6 +613,6 @@ static void __exit tad_pmu_exit(void)
>  module_init(tad_pmu_init);
>  module_exit(tad_pmu_exit);
>  
> -MODULE_DESCRIPTION("Marvell CN10K LLC-TAD Perf driver");
> +MODULE_DESCRIPTION("Marvell CN10K LLC-TAD perf driver");
>  MODULE_AUTHOR("Bhaskara Budiredla <bbudiredla@marvell.com>");
>  MODULE_LICENSE("GPL v2");


^ permalink raw reply

* [PATCH 6.1.y 2/2] perf block-range: Move debug code behind ifndef NDEBUG
From: Simon Liebold @ 2026-06-25 13:32 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Simon Liebold, Ian Rogers, linux-perf-users, linux-kernel, stable
  Cc: Adrian Hunter, Paolo Bonzini, Sean Christopherson,
	Arnaldo Carvalho de Melo
In-Reply-To: <20260625133222.3412820-1-simonlie@amazon.de>

From: Ian Rogers <irogers@google.com>

[ Upstream commit 984a785f25e5b5db5fa673130b60dca6ca794406 ]

Make good on a comment and avoid a unused-but-set-variable warning.

Signed-off-by: Ian Rogers <irogers@google.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20230330183827.1412303-1-irogers@google.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Stable-dep-of: 616b14b47a86 ("perf build: Conditionally define NDEBUG")
Signed-off-by: Simon Liebold <simonlie@amazon.de>
---
 tools/perf/util/block-range.c | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/tools/perf/util/block-range.c b/tools/perf/util/block-range.c
index 1be4326575013..680e92774d0cd 100644
--- a/tools/perf/util/block-range.c
+++ b/tools/perf/util/block-range.c
@@ -11,11 +11,7 @@ struct {
 
 static void block_range__debug(void)
 {
-	/*
-	 * XXX still paranoid for now; see if we can make this depend on
-	 * DEBUG=1 builds.
-	 */
-#if 1
+#ifndef NDEBUG
 	struct rb_node *rb;
 	u64 old = 0; /* NULL isn't executable */
 
-- 
2.50.1




Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christof Hellmis, Andreas Stieger
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597


^ permalink raw reply related

* [PATCH 6.1.y 1/2] perf bench: Avoid NDEBUG warning
From: Simon Liebold @ 2026-06-25 13:32 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Simon Liebold, Ian Rogers, linux-perf-users, linux-kernel, stable
  Cc: Adrian Hunter, Paolo Bonzini, Sean Christopherson,
	Arnaldo Carvalho de Melo
In-Reply-To: <20260625133222.3412820-1-simonlie@amazon.de>

From: Ian Rogers <irogers@google.com>

[ Upstream commit d1babea9c38282b58a6f822ab95027cba3165a42 ]

With NDEBUG set the asserts are compiled out. This yields
"unused-but-set-variable" variables. Move these variables behind
NDEBUG to avoid the warning.

Signed-off-by: Ian Rogers <irogers@google.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20230330183827.1412303-1-irogers@google.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Stable-dep-of: 616b14b47a86 ("perf build: Conditionally define NDEBUG")
Signed-off-by: Simon Liebold <simonlie@amazon.de>
---
 tools/perf/bench/find-bit-bench.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/tools/perf/bench/find-bit-bench.c b/tools/perf/bench/find-bit-bench.c
index 22b5cfe970237..80f051f9c20fd 100644
--- a/tools/perf/bench/find-bit-bench.c
+++ b/tools/perf/bench/find-bit-bench.c
@@ -61,7 +61,6 @@ static int do_for_each_set_bit(unsigned int num_bits)
 	double time_average, time_stddev;
 	unsigned int bit, i, j;
 	unsigned int set_bits, skip;
-	unsigned int old;
 
 	init_stats(&fb_time_stats);
 	init_stats(&tb_time_stats);
@@ -73,7 +72,10 @@ static int do_for_each_set_bit(unsigned int num_bits)
 			set_bit(i, to_test);
 
 		for (i = 0; i < outer_iterations; i++) {
-			old = accumulator;
+#ifndef NDEBUG
+			unsigned int old = accumulator;
+#endif
+
 			gettimeofday(&start, NULL);
 			for (j = 0; j < inner_iterations; j++) {
 				for_each_set_bit(bit, to_test, num_bits)
@@ -85,7 +87,9 @@ static int do_for_each_set_bit(unsigned int num_bits)
 			runtime_us = diff.tv_sec * USEC_PER_SEC + diff.tv_usec;
 			update_stats(&fb_time_stats, runtime_us);
 
+#ifndef NDEBUG
 			old = accumulator;
+#endif
 			gettimeofday(&start, NULL);
 			for (j = 0; j < inner_iterations; j++) {
 				for (bit = 0; bit < num_bits; bit++) {
-- 
2.50.1




Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christof Hellmis, Andreas Stieger
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597


^ permalink raw reply related

* [PATCH 6.1.y 0/2] Backport dependency commits for 616b14b47a86 ("perf build: Conditionally define NDEBUG")
From: Simon Liebold @ 2026-06-25 13:32 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Simon Liebold, Ian Rogers, linux-perf-users, linux-kernel, stable

Hi, please backport the following two patches to 6.1.y:

- d1babea9c382 ("perf bench: Avoid NDEBUG warning")
- 984a785f25e5 ("perf block-range: Move debug code behind ifndef NDEBUG")

They are stable dependencies for commit 616b14b47a86 ("perf build: Conditionally
define NDEBUG") which was backported to v6.1.176 as 7bf35a0237d04.

That commit adds -DNDEBUG=1 to perf CFLAGS, which compiles out assert() calls,
leaving variables consumed only by asserts as unused-but-set. Combined with
-Werror this breaks the build:

    bench/find-bit-bench.c:64:22: error: variable 'old' set but not used
    util/block-range.c:20:13: error: variable 'old' set but not used

We need these two dependency patches, because both guard assert-only variables
with #ifndef NDEBUG so they are compiled out alongside the asserts they
validate.

Tested using our regression test suite including kselftest and LTP on various
EC2 instances.

Thanks.

- Simon

Ian Rogers (2):
  perf bench: Avoid NDEBUG warning
  perf block-range: Move debug code behind ifndef NDEBUG

 tools/perf/bench/find-bit-bench.c | 8 ++++++--
 tools/perf/util/block-range.c     | 6 +-----
 2 files changed, 7 insertions(+), 7 deletions(-)

base-commit: fdb6fcb41cc741ad5eaa7995f278dfcb94fdf795
-- 
2.50.1

Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christof Hellmis, Andreas Stieger
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597

^ permalink raw reply

* Re: [PATCH] perf trace: Refactor augmented_raw_syscalls using bpf_loop
From: Viktor Malik @ 2026-06-25 12:05 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Alexei Starovoitov, linux-perf-users, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, James Clark, Howard Chu,
	linux-kernel, bpf, Michael Petlan, stable
In-Reply-To: <ajwu7xR6V6MAQOFw@google.com>

On 6/24/26 21:24, Namhyung Kim wrote:
> On Wed, Jun 24, 2026 at 08:47:38AM +0200, Viktor Malik wrote:
>> On 6/23/26 19:10, Namhyung Kim wrote:
>>> Hello,
>>>
>>> On Tue, Jun 23, 2026 at 08:27:39AM -0700, Alexei Starovoitov wrote:
>>>> On Tue Jun 23, 2026 at 4:25 AM PDT, Viktor Malik wrote:
> [SNIP]
>>>>> +	struct args_loop_ctx loop_ctx = {
>>>>> +		.args = args,
>>>>> +		.beauty_map = beauty_map,
>>>>> +		.payload_offset = payload_offset,
>>>>> +		.value_size = value_size,
>>>>> +		.output = &output,
>>>>> +		.do_output = &do_output
>>>>> +	};
>>>>> +	iters = bpf_loop(6, process_arg_cb, &loop_ctx, 0);
>>>>
>>>> bpf_loop() is old and generally not recommended.
>>>> Please use bpf_for() then the diff will be one line change and
>>>> can scale to any number of args. Not just 6.
>>
>> Thanks Alexei, I didn't know about this preference.
>>
>>> One thing we should take care is to support old kernels.  The oldest
>>> LTS kernel in the kernel.org is 5.10 and bpf_loop() was introduced in
>>> 5.17 and bpf_for (bpf_iter_num) was 6.4.
>>
>> The problematic loop was introduced in 6.12 by a68fd6a6cdd3 ("perf
>> trace: Collect augmented data using BPF") so we should be good using
>> bpf_for. Or is perf from 7.2 supposed to work on 5.10 LTS kernels?
> 
> Yep, we'd like to support old kernels.

How much strict are you on this requirement? IMHO, the very least we
need to fix the verifier issue is bpf_loop, so that would still not work
on 5.10 and 5.15 LTS kernels.

We could probably keep the open-coded loop in case bpf_loop is not
available but `perf trace` would still fail on kernels without bpf_loop
for new perf built with Clang>=22. Also, the code would be a bit ugly
and I'm not sure how well the feature check for helpers (bpf_loop) works
on old kernels.

Viktor


^ permalink raw reply

* Re: [PATCH] perf trace: Refactor augmented_raw_syscalls using bpf_loop
From: Viktor Malik @ 2026-06-25 11:58 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Namhyung Kim, Alexei Starovoitov, linux-perf-users,
	Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, James Clark, Howard Chu, linux-kernel, bpf,
	Michael Petlan, stable
In-Reply-To: <CAEf4BzYWdsBDQ3D41=+n_oCO68bVOtKuqQCqZOEVo=j7nK9Ozg@mail.gmail.com>

On 6/24/26 19:19, Andrii Nakryiko wrote:
> On Wed, Jun 24, 2026 at 3:27 AM Viktor Malik <vmalik@redhat.com> wrote:
>>
>> On 6/24/26 08:47, Viktor Malik wrote:
>>> On 6/23/26 19:10, Namhyung Kim wrote:
>>>> Hello,
>>>>
>>>> On Tue, Jun 23, 2026 at 08:27:39AM -0700, Alexei Starovoitov wrote:
>>>>> On Tue Jun 23, 2026 at 4:25 AM PDT, Viktor Malik wrote:
>>>>>> The loop for processing syscall args in augment_raw_syscalls has a
>>>>>> history of breaking with Clang updates, see e.g. commit 013eb043f37b
>>>>>> ("perf trace: Fix BPF loading failure (-E2BIG)") from Clang 15 to 16.
>>>>>>
>>>>>> Now, a similar thing happened between Clang 21 and 22. While the issue
>>>>>> is mitigated on the main line by a recent verifier update, it remains
>>>>>> broken on the 6.12 and 6.18 stable branches:
>>>>>>
>>>>>>     [linux-6.18.y]# sudo perf trace true
>>>>>>     libbpf: prog 'sys_enter': BPF program load failed: -E2BIG
>>>>>>     libbpf: prog 'sys_enter': -- BEGIN PROG LOAD LOG --
>>>>>>     [...]
>>>>>>     BPF program is too large. Processed 1000001 insn
>>>>>>     processed 1000001 insns (limit 1000000) max_states_per_insn 40 total_states 37941 peak_states 232 mark_read 0
>>>>>>     -- END PROG LOAD LOG --
>>>>>>     libbpf: prog 'sys_enter': failed to load: -E2BIG
>>>>>>     libbpf: failed to load object 'augmented_raw_syscalls_bpf'
>>>>>>     libbpf: failed to load BPF skeleton 'augmented_raw_syscalls_bpf': -E2BIG
>>>>>>     Error: failed to get syscall or beauty map fd
>>>>>>     [...]
>>>>>>
>>>>>> The reason is that the loop is quite complex and the BPF verifier often
>>>>>> struggles to prove that it terminates.
>>>>>>
>>>>>> Fix the issue by refactoring the loop body into a callback function and
>>>>>> calling the bpf_loop helper. This should prevent future breakages of
>>>>>> this kind since the callback function has no loops. It also allows to
>>>>>> drop a few artificial checks to help the verifier, including the changes
>>>>>> introduced by 013eb043f37b.
>>>>
>>>> Thanks for working on this.  I encountered this issue before and never
>>>> found time to take a deeper look yet.
>>>>
>>>>>>
>>>>>> Signed-off-by: Viktor Malik <vmalik@redhat.com>
>>>>>> Fixes: a68fd6a6cdd3 ("perf trace: Collect augmented data using BPF")
>>>>>> Fixes: 013eb043f37b ("perf trace: Fix BPF loading failure (-E2BIG)")
>>>>>> Cc: stable@vger.kernel.org
>>>>>> ---
>>>>>>  .../bpf_skel/augmented_raw_syscalls.bpf.c     | 157 +++++++++++-------
>>>>>>  1 file changed, 96 insertions(+), 61 deletions(-)
>>>>>>
>>>>>> diff --git a/tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c b/tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c
>>>>>> index 2a6e61864ee0..6d553ed3ac23 100644
>>>>>> --- a/tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c
>>>>>> +++ b/tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c
>>>>>> @@ -429,15 +429,96 @@ static bool pid_filter__has(struct pids_filtered *pids, pid_t pid)
>>>>>>    return bpf_map_lookup_elem(pids, &pid) != NULL;
>>>>>>  }
>>>>>>
>>>>>> +struct args_loop_ctx {
>>>>>> +  struct syscall_enter_args *args;
>>>>>> +  unsigned int *beauty_map;
>>>>>> +  void *payload_offset;
>>>>>> +  int value_size;
>>>>>> +  u64 *output;
>>>>>> +  bool *do_output;
>>>>>> +};
>>>>>> +
>>>>>> +static long process_arg_cb(u64 i, void *ctx)
>>>>>> +{
>>>>>> +  /*
>>>>>> +   * Determine what type of argument and how many bytes to read from user space, using the
>>>>>> +   * value in the beauty_map. This is the relation of parameter type and its corresponding
>>>>>> +   * value in the beauty map, and how many bytes we read eventually:
>>>>>> +   *
>>>>>> +   * string: 1                          -> size of string
>>>>>> +   * struct: size of struct             -> size of struct
>>>>>> +   * buffer: -1 * (index of paired len) -> value of paired len (maximum: TRACE_AUG_MAX_BUF)
>>>>>> +   */
>>>>>> +  struct augmented_arg *augmented_arg;
>>>>>> +  struct args_loop_ctx *loop_ctx;
>>>>>> +  int aug_size, size, index;
>>>>>> +  bool augmented;
>>>>>> +  void *arg;
>>>>>> +
>>>>>> +  /* Bounds check for the below map access to help the verifier */
>>>>>> +  if (i < 0 || i >= 6)
>>>>>> +          return 1;
>>>>>> +
>>>>>> +  loop_ctx = (struct args_loop_ctx *)ctx;
>>>>>> +  arg = (void *)loop_ctx->args->args[i];
>>>>>> +  augmented = false;
>>>>>> +  size = loop_ctx->beauty_map[i];
>>>>>> +  aug_size = size; /* size of the augmented data read from user space */
>>>>>> +  augmented_arg = (struct augmented_arg *)loop_ctx->payload_offset;
>>>>>> +
>>>>>> +  if (size == 0 || arg == NULL)
>>>>>> +          return 0; /* continue */
>>>>>> +
>>>>>> +  if (size == 1) { /* string */
>>>>>> +          aug_size = bpf_probe_read_user_str(augmented_arg->value, loop_ctx->value_size, arg);
>>>>>> +          augmented = true;
>>>>>> +  } else if (size > 0 && size <= loop_ctx->value_size) { /* struct */
>>>>>> +          if (!bpf_probe_read_user(augmented_arg->value, size, arg))
>>>>>> +                  augmented = true;
>>>>>> +  } else if (size < 0 && size >= -6) { /* buffer */
>>>>>> +          index = -(size + 1);
>>>>>> +          barrier_var(index); // Prevent clang (noticed with v18) from removing the &= 7 trick.
>>>>>> +          index &= 7;         // Satisfy the bounds checking with the verifier in some kernels.
>>>>>> +          aug_size = loop_ctx->args->args[index];
>>>>>> +
>>>>>> +          if (aug_size > TRACE_AUG_MAX_BUF)
>>>>>> +                  aug_size = TRACE_AUG_MAX_BUF;
>>>>>> +
>>>>>> +          if (aug_size > 0) {
>>>>>> +                  if (!bpf_probe_read_user(augmented_arg->value, aug_size, arg))
>>>>>> +                          augmented = true;
>>>>>> +          }
>>>>>> +  }
>>>>>> +
>>>>>> +  /* Augmented data size is limited to sizeof(augmented_arg->unnamed union with value field) */
>>>>>> +  if (aug_size > loop_ctx->value_size)
>>>>>> +          aug_size = loop_ctx->value_size;
>>>>>> +
>>>>>> +  /* write data to payload */
>>>>>> +  if (augmented) {
>>>>>> +          int written = offsetof(struct augmented_arg, value) + aug_size;
>>>>>> +
>>>>>> +          if (written < 0 || written > sizeof(struct augmented_arg))
>>>>>> +                  return 1; /* break */
>>>>>> +
>>>>>> +          augmented_arg->size = aug_size;
>>>>>> +          *loop_ctx->output += written;
>>>>>> +          loop_ctx->payload_offset += written;
>>>>>> +          *loop_ctx->do_output = true;
>>>>>> +  }
>>>>>> +
>>>>>> +  return 0;
>>>>>> +}
>>>>>> +
>>>>>>  static int augment_sys_enter(void *ctx, struct syscall_enter_args *args)
>>>>>>  {
>>>>>> -  bool augmented, do_output = false;
>>>>>> -  int zero = 0, index, value_size = sizeof(struct augmented_arg) - offsetof(struct augmented_arg, value);
>>>>>> +  bool do_output = false;
>>>>>> +  int zero = 0, value_size = sizeof(struct augmented_arg) - offsetof(struct augmented_arg, value);
>>>>>>    u64 output = 0; /* has to be u64, otherwise it won't pass the verifier */
>>>>>> -  s64 aug_size, size;
>>>>>>    unsigned int nr, *beauty_map;
>>>>>>    struct beauty_payload_enter *payload;
>>>>>> -  void *arg, *payload_offset;
>>>>>> +  void *payload_offset;
>>>>>> +  long iters;
>>>>>>
>>>>>>    /* fall back to do predefined tail call */
>>>>>>    if (args == NULL)
>>>>>> @@ -457,63 +538,17 @@ static int augment_sys_enter(void *ctx, struct syscall_enter_args *args)
>>>>>>    /* copy the sys_enter header, which has the syscall_nr */
>>>>>>    __builtin_memcpy(&payload->args, args, sizeof(struct syscall_enter_args));
>>>>>>
>>>>>> -  /*
>>>>>> -   * Determine what type of argument and how many bytes to read from user space, using the
>>>>>> -   * value in the beauty_map. This is the relation of parameter type and its corresponding
>>>>>> -   * value in the beauty map, and how many bytes we read eventually:
>>>>>> -   *
>>>>>> -   * string: 1                          -> size of string
>>>>>> -   * struct: size of struct             -> size of struct
>>>>>> -   * buffer: -1 * (index of paired len) -> value of paired len (maximum: TRACE_AUG_MAX_BUF)
>>>>>> -   */
>>>>>> -  for (int i = 0; i < 6; i++) {
>>>>>> -          arg = (void *)args->args[i];
>>>>>> -          augmented = false;
>>>>>> -          size = beauty_map[i];
>>>>>> -          aug_size = size; /* size of the augmented data read from user space */
>>>>>> -
>>>>>> -          if (size == 0 || arg == NULL)
>>>>>> -                  continue;
>>>>>> -
>>>>>> -          if (size == 1) { /* string */
>>>>>> -                  aug_size = bpf_probe_read_user_str(((struct augmented_arg *)payload_offset)->value, value_size, arg);
>>>>>> -                  /* minimum of 0 to pass the verifier */
>>>>>> -                  if (aug_size < 0)
>>>>>> -                          aug_size = 0;
>>>>>> -
>>>>>> -                  augmented = true;
>>>>>> -          } else if (size > 0 && size <= value_size) { /* struct */
>>>>>> -                  if (!bpf_probe_read_user(((struct augmented_arg *)payload_offset)->value, size, arg))
>>>>>> -                          augmented = true;
>>>>>> -          } else if ((int)size < 0 && size >= -6) { /* buffer */
>>>>>> -                  index = -(size + 1);
>>>>>> -                  barrier_var(index); // Prevent clang (noticed with v18) from removing the &= 7 trick.
>>>>>> -                  index &= 7;         // Satisfy the bounds checking with the verifier in some kernels.
>>>>>> -                  aug_size = args->args[index] > TRACE_AUG_MAX_BUF ? TRACE_AUG_MAX_BUF : args->args[index];
>>>>>> -
>>>>>> -                  if (aug_size > 0) {
>>>>>> -                          if (!bpf_probe_read_user(((struct augmented_arg *)payload_offset)->value, aug_size, arg))
>>>>>> -                                  augmented = true;
>>>>>> -                  }
>>>>>> -          }
>>>>>> -
>>>>>> -          /* Augmented data size is limited to sizeof(augmented_arg->unnamed union with value field) */
>>>>>> -          if (aug_size > value_size)
>>>>>> -                  aug_size = value_size;
>>>>>> -
>>>>>> -          /* write data to payload */
>>>>>> -          if (augmented) {
>>>>>> -                  int written = offsetof(struct augmented_arg, value) + aug_size;
>>>>>> -
>>>>>> -                  if (written < 0 || written > sizeof(struct augmented_arg))
>>>>>> -                          return 1;
>>>>>> -
>>>>>> -                  ((struct augmented_arg *)payload_offset)->size = aug_size;
>>>>>> -                  output += written;
>>>>>> -                  payload_offset += written;
>>>>>> -                  do_output = true;
>>>>>> -          }
>>>>>> -  }
>>>>>> +  struct args_loop_ctx loop_ctx = {
>>>>>> +          .args = args,
>>>>>> +          .beauty_map = beauty_map,
>>>>>> +          .payload_offset = payload_offset,
>>>>>> +          .value_size = value_size,
>>>>>> +          .output = &output,
>>>>>> +          .do_output = &do_output
>>>>>> +  };
>>>>>> +  iters = bpf_loop(6, process_arg_cb, &loop_ctx, 0);
>>>>>
>>>>> bpf_loop() is old and generally not recommended.
>>>>> Please use bpf_for() then the diff will be one line change and
>>>>> can scale to any number of args. Not just 6.
>>>
>>> Thanks Alexei, I didn't know about this preference.
>>>
>>>> One thing we should take care is to support old kernels.  The oldest
>>>> LTS kernel in the kernel.org is 5.10 and bpf_loop() was introduced in
>>>> 5.17 and bpf_for (bpf_iter_num) was 6.4.
>>>
>>> The problematic loop was introduced in 6.12 by a68fd6a6cdd3 ("perf
>>> trace: Collect augmented data using BPF") so we should be good using
>>> bpf_for. Or is perf from 7.2 supposed to work on 5.10 LTS kernels?
>>>
>>> I'll refactor with bpf_for and will send v2.
>>
>> Or I won't. It turns out that just swapping the for loop for bpf_for
>> leads to -E2BIG from the verifier again. Looking at the verifier log, it
>> fails to find equivalence between states at the loop head:
>>
>>     [...]
>>     78: (85) call bpf_iter_num_next#84922 [...]
>> fp-56=map_value(map=beauty_payload_,ks=4,vs=24688,imm=112)
>>     [...]
>>     78: (85) call bpf_iter_num_next#84922 [...]
>> fp-56=map_value(map=beauty_payload_,ks=4,vs=24688,imm=120)
>>     [...]
>>
>> IMHO, the reason is that payload_offset, which points to the
>> beauty_payload_enter_map entry, gets updated in every iteration.
>>
>> This could be probably fixed on the perf side by reworking how augmented
>> args are stored but at this point, bpf_loop sounds like an easier and
>> more reliable approach.
>>
>> Let me know if anyone has objections, otherwise I'll send v2 of the
>> bpf_loop approach, with suggestions from Sashiko incorporated.
>>
> 
> I'd still try to adapt bpf_for(), it's a much better code structure.
> You probably need to add a bounding checking/confirming `if ()`
> condition validating that offset at which you access map_value is
> always correct. And/or you might need barrier_var() before using i,
> because bpf_for() macro does bounds checking (check the macro itself),
> but compiler often will reorder instructions leading to verifier
> complaints.

I gave it a try but wasn't successful so far. I think that the problem
is that while it would be possible to add an upper bound condition for
`payload_offset`, the verifier tracks the value of `payload_offset` too
precisely (as map_value(..., imm=X) with a concrete offset) and never
merges states with different offsets. And since there are multiple
branches inside the loop, each incrementing `payload_offset` by a
different value, the verifier seems to fork its state on each branch,
effectively leading to the amount of states growing exponentially and
hitting the jump limit.

To me, bpf_loop sounds like a more reliable choice in this situation.
It's also older, which is good in this case, since compatibility with
older kernels seems to be important for perf (see other messages in the
thread).

I'm also wondering if the verifier could be improved to handle these
cases but that's a different discussion.

> 
>> Thanks,
>> Viktor
>>
>>> It should be then
>>> backported to stable kernels down to 6.12 LTS.
>>>
>>> Viktor
>>>
>>>>
>>>> Maybe we can factor out the loop body and call it from different
>>>> mechanisms like open-coded loop, bpf_loop or bpf_for depending on the
>>>> kernel version.  But not sure it'd fix the verifier issue though.
>>>>
>>>> Thanks,
>>>> Namhyung
>>>>
>>
>>
> 


^ permalink raw reply

* Re: [PATCH v3 1/7] list: Add mutable iterator variants
From: Jani Nikula @ 2026-06-25 11:00 UTC (permalink / raw)
  To: Kaitao Cheng, David Laight, Christian König,
	David Hildenbrand (Arm), Alexei Starovoitov
  Cc: Andrew Morton, David Hildenbrand, Jens Axboe, Tejun Heo,
	Alexander Viro, Christian Brauner, Daniel Borkmann,
	Andrii Nakryiko, Johannes Weiner, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Juri Lelli, Vincent Guittot, Paul Moore, Andy Shevchenko,
	Paul E. McKenney, Shakeel Butt, David Howells, Simona Vetter,
	Randy Dunlap, Luca Ceresoli, Philipp Stanner, linux-block,
	linux-kernel, cgroups, linux-ntfs-dev, linux-fsdevel, io-uring,
	audit, bpf, netdev, dri-devel, linux-perf-users,
	linux-trace-kernel, kexec, live-patching, linux-modules,
	linux-crypto, linux-pm, rcu, sched-ext, linux-mm, virtualization,
	damon, llvm, Kaitao Cheng, Muchun Song
In-Reply-To: <0ed6b5c3-e955-46e2-9fc6-075a0dfd1c4f@linux.dev>

On Thu, 25 Jun 2026, Kaitao Cheng <kaitao.cheng@linux.dev> wrote:
> 在 2026/6/24 22:23, David Laight 写道:
>> On Wed, 24 Jun 2026 15:23:47 +0200
>> Christian König <christian.koenig@amd.com> wrote:
>>> On 6/24/26 15:14, Kaitao Cheng wrote:
>>>> 在 2026/6/22 16:42, David Laight 写道:  
>>>>> On Mon, 22 Jun 2026 12:05:31 +0800
>>>>> Kaitao Cheng <kaitao.cheng@linux.dev> wrote:
>>>>>  
>>>>>> From: Kaitao Cheng <chengkaitao@kylinos.cn>
>>>>>>
>>>>>> The list_for_each*_safe() helpers are used when the loop body may
>>>>>> remove the current entry.  Their API exposes the temporary cursor at
>>>>>> every call site, even though most users only need it for the iterator
>>>>>> implementation and never reference it in the loop body.
>>>>>>
>>>>>> Add *_mutable() variants for list and hlist iteration.  The new helpers
>>>>>> support both forms: callers may keep passing an explicit temporary cursor
>>>>>> when they need to inspect or reset it, or omit it and let the helper use
>>>>>> a unique internal cursor.  
>>>>>
>>>>> I'm not really sure 'mutable' means anything either.
>>>>> It is possible to make it valid for the loop body (or even other threads)
>>>>> to delete arbitrary list items - but that needs significant extra overheads.
>>>>>
>>>>> It might be worth doing something that doesn't need the extra variable,
>>>>> but there is little point doing all the churn just to rename things.
>>>>>  
>>>>>>
>>>>>> This makes call sites that only mutate the list through the current entry
>>>>>> less noisy, while keeping the existing *_safe() helpers available for
>>>>>> compatibility.
>>>>>>
>>>>>> Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
>>>>>> ---
>>>>>>  include/linux/list.h | 269 +++++++++++++++++++++++++++++++++++++------
>>>>>>  1 file changed, 231 insertions(+), 38 deletions(-)
>>>>>>
>>>>>> diff --git a/include/linux/list.h b/include/linux/list.h
>>>>>> index 09d979976b3b..1081def7cea9 100644
>>>>>> --- a/include/linux/list.h
>>>>>> +++ b/include/linux/list.h
>>>>>> @@ -7,6 +7,7 @@
>>>>>>  #include <linux/stddef.h>
>>>>>>  #include <linux/poison.h>
>>>>>>  #include <linux/const.h>
>>>>>> +#include <linux/args.h>
>>>>>>  
>>>>>>  #include <asm/barrier.h>
>>>>>>  
>>>>>> @@ -763,28 +764,72 @@ static inline void list_splice_tail_init(struct list_head *list,
>>>>>>  #define list_for_each_prev(pos, head) \
>>>>>>  	for (pos = (head)->prev; !list_is_head(pos, (head)); pos = pos->prev)
>>>>>>  
>>>>>> -/**
>>>>>> - * list_for_each_safe - iterate over a list safe against removal of list entry
>>>>>> - * @pos:	the &struct list_head to use as a loop cursor.
>>>>>> - * @n:		another &struct list_head to use as temporary storage
>>>>>> - * @head:	the head for your list.
>>>>>> +/*
>>>>>> + * list_for_each_safe is an old interface, use list_for_each_mutable instead.
>>>>>>   */
>>>>>>  #define list_for_each_safe(pos, n, head) \
>>>>>>  	for (pos = (head)->next, n = pos->next; \
>>>>>>  	     !list_is_head(pos, (head)); \
>>>>>>  	     pos = n, n = pos->next)
>>>>>>  
>>>>>> +#define __list_for_each_mutable_internal(pos, tmp, head)		\
>>>>>> +	for (typeof(pos) tmp = (pos = (head)->next)->next;		\  
>>>>>
>>>>> Use auto
>>>>>  
>>>>>> +	     !list_is_head(pos, (head));				\
>>>>>> +	     pos = tmp, tmp = pos->next)
>>>>>> +
>>>>>> +#define __list_for_each_mutable1(pos, head)				\
>>>>>> +	__list_for_each_mutable_internal(pos, __UNIQUE_ID(next), head)
>>>>>> +
>>>>>> +#define __list_for_each_mutable2(pos, next, head)			\
>>>>>> +	list_for_each_safe(pos, next, head)
>>>>>> +
>>>>>>  /**
>>>>>> - * list_for_each_prev_safe - iterate over a list backwards safe against removal of list entry
>>>>>> + * list_for_each_mutable - iterate over a list safe against entry removal
>>>>>>   * @pos:	the &struct list_head to use as a loop cursor.
>>>>>> - * @n:		another &struct list_head to use as temporary storage
>>>>>> - * @head:	the head for your list.
>>>>>> + * @...:	either (head) or (next, head)
>>>>>> + *
>>>>>> + * next:	another &struct list_head to use as optional temporary storage.
>>>>>> + *		The temporary cursor is internal unless explicitly supplied by
>>>>>> + *		the caller.
>>>>>> + * head:	the head for your list.
>>>>>> + */
>>>>>> +#define list_for_each_mutable(pos, ...)					\
>>>>>> +	CONCATENATE(__list_for_each_mutable, COUNT_ARGS(__VA_ARGS__))	\
>>>>>> +		(pos, __VA_ARGS__)  
>>>>>
>>>>> The variable argument count logic really just slows down compilation.
>>>>> Maybe there aren't enough copies of this code to make that significant.
>>>>> But just because you can do it doesn't mean it is a gooD idea.
>>>>> I'm also not sure it really adds anything to the readability.
>>>>>
>>>>> And, it you are going to make the middle argument optional there is
>>>>> no need to change the macro name.  
>>>>
>>>> Christian König and Jani Nikula also disagree with the variadic-argument
>>>> implementation approach. If we abandon that method, it means we will
>>>> inevitably need to add some new macros. If mutable is not a good name,
>>>> suggestions for better alternatives would be welcome; coming up with a
>>>> suitable name is indeed rather tricky.  
>>>
>>> I don't think you need to add a new macro for the specific use case that people want to modify the next element of the iteration.
>>>
>>> If I remember your numbers correctly that is a really corner case and keeping using the existing *_safe() macros for that sounds perfectly fine to me.
>> 
>> IIRC currently you have a choice of either:
>> 	define               Item that can't be deleted
>> 	list_for_each()	     The current item.
>> 	list_for_each_safe() The next item.
>> There is also likely to be code that updates the variables to allow
>> for other scenarios.
>> 
>> Note that if increase a reference count and release a lock then list_for_each()
>> is likely safer than list_for_each_safe() :-)
>> 
>> list.h has 9 variants of the 'safe' loop.
>> The bloat of another 9 is getting excessive.
>> 
>> It has to be said that this is one of my least favourite type of list...
>
> Hi Christian König, David Laight, Jani Nikula, David Hildenbrand,
> Andy Shevchenko, Alexei Starovoitov
>
> For ease of discussion, I need to summarize the currently possible
> approaches and briefly describe their respective pros and cons,
> using the list_for_each_entry* interfaces as examples.
>
> 1. Add list_for_each_entry_mutable, while keeping list_for_each_entry
> and list_for_each_entry_safe unchanged. list_for_each_entry_mutable
> would be used specifically for safe deletion scenarios that do not
> need to expose the temporary cursor externally. The code can refer to
> the v1 version.
>
> Pros: Does not depend on immediate per-subsystem adaptation and can be
>       merged directly.
> Cons: Requires adding a whole set of mutable interfaces, which makes the
>       code somewhat redundant.

Seems fine, and the original _safe naming is ambiguous anyway.

> 2. Directly optimize away the temporary cursor in list_for_each_entry_safe
> and define it inside the loop instead, changing the interface from four
> arguments to three.
>
> Pros: Does not add redundant interfaces.
> Cons: (1) Users need to manually update special cases that use the
>       traversal variable of list_for_each_entry_safe, the new
>       list_for_each_entry_safe would no longer apply there and would
>       need to be open-coded.
>       (2) Because the macro arguments changes, all list_for_each_entry_safe
>       callers would need to be modified and merged together, making it
>       difficult to merge such a large amount of code at once.

This won't fly because there are literally thousands of
list_for_each_entry_safe() users.

> 3. Use a variadic macro approach to optimize list_for_each_entry_safe,
> so that it supports both three and four arguments.
>
> Pros: (1) Does not add redundant interfaces.
>       (2) Does not depend on immediate per-subsystem adaptation and can
>       be merged directly.
> Cons: (1) Increases compile time.
>       (2) Makes the interface harder for users to use.

Basically I'm against any variadic macro tricks where the optional
argument is not the last argument. That's just way too surprising, and
goes against common practice in just about all other languages.

> 4. Optimize list_for_each_entry by defining the temporary cursor internally,
> making it compatible with the functionality of list_for_each_entry_safe.
> The code can refer to the v2 version.
>
> Pros: (1) Does not add redundant interfaces.
>       (2) The number of externally visible arguments of list_for_each_entry
>       remains unchanged, still three.
> Cons: (1) list_for_each_entry and list_for_each_entry_safe would be merged
>       into one, and list_for_each_entry_safe would gradually be deprecated.
>       (2) Users need to manually update special cases that use the traversal
>       variable of list_for_each_entry, the new list_for_each_entry would no
>       longer apply there and would need to be open-coded. There are 15 such
>       cases in total.

This sounds good to me, though I take it there's some code size increase
and/or performance penalty?

Maybe the 15 cases are questionable anyway?

> 5. Use a variadic macro approach to optimize list_for_each_entry, so that
> it supports both three and four arguments.
>
> Pros: (1) Does not add redundant interfaces.
>       (2) Does not depend on immediate per-subsystem adaptation and can be
>       merged directly.
> Cons: (1) Increases compile time.
>       (2) list_for_each_entry and list_for_each_entry_safe would be merged
>       into one, and list_for_each_entry_safe would gradually be deprecated.

Please don't do the macro tricks.

> 6. Make no changes, keep the current logic unchanged, and close the current
> email discussion.

I like hiding the temporary stuff when possible.


BR,
Jani.

-- 
Jani Nikula, Intel

^ permalink raw reply

* Re: [PATCH] perf symbols: ignore livepatch symbols
From: Petr Mladek @ 2026-06-25  9:10 UTC (permalink / raw)
  To: Joe Lawrence; +Cc: linux-perf-users, linux-kernel, live-patching
In-Reply-To: <20260624201254.472576-1-joe.lawrence@redhat.com>

On Wed 2026-06-24 16:12:54, Joe Lawrence wrote:
> Livepatch modules contain special symbols (prefixed by ".klp.sym.") that
> act as relocation placeholders.  Once resolved, they point to the same
> addresses as the original kernel symbols they reference. [1]
> 
> These special symbols confuse the 'vmlinux symtab matches kallsyms' perf
> test as kallsyms may report multiple symbols sharing a single kernel
> address.  For example:
> 
>   kallsyms (without livepatch)
>   ----------------------------
>   ffffffff81a41110 T __pfx_arch_release_task_struct
> > ffffffff81a41120 T arch_release_task_struct
>   ffffffff81a41140 T __pfx_exit_thread
>   ffffffff81a41150 T exit_thread
> 
>   kallsyms (with livepatch loaded)
>   ---------------------------------
>   ffffffff81a41110 T __pfx_arch_release_task_struct
> > ffffffff81a41120 T arch_release_task_struct
>   ffffffff81a41140 T __pfx_exit_thread
>   ffffffff81a41150 T exit_thread
> > ffffffff81a41120 w .klp.sym.vmlinux.arch_release_task_struct,0  [kpatch_5_14_0_570_94_1_1_3]
> 
> When perf loads kallsyms, both symbols are inserted into the
> symbol table at the same address, corrupting symbol end-address
> calculations and causing test failures.
> 
> Filter out symbols prefixed with ".klp.sym." when loading kallsyms, as
> they alias existing kernel symbols.
> 
> Link: https://docs.kernel.org/livepatch/module-elf-format.html#livepatch-symbols [1]
> Reported-and-tested-by: Ben Procknow <bprockno@redhat.com>
> Signed-off-by: Joe Lawrence <joe.lawrence@redhat.com>
> ---
>  tools/perf/util/symbol.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> >From Ben's tests:
> 
>   Without patch:
>   ./tools/perf/perf test 1
>    1: vmlinux symtab matches kallsyms                                 : FAILED!
> 
>   With patch:
>   ./tools/perf/perf test 1
>    1: vmlinux symtab matches kallsyms                                 : Ok
> 
> diff --git a/tools/perf/util/symbol.c b/tools/perf/util/symbol.c
> index fcaeeddbbb6b..eeaed3f5cf50 100644
> --- a/tools/perf/util/symbol.c
> +++ b/tools/perf/util/symbol.c
> @@ -774,6 +774,10 @@ static int map__process_kallsym_symbol(void *arg, const char *name,
>  	if (name[0] == '$')
>  		return 0;

It seems that the above checks has been moved into is_ignored_kernel_symbol(),
in the current Linus' tree, see
https://lore.kernel.org/lkml/20260522082604.89447-2-qirui.001@bytedance.com/

> +	/* Ignore livepatch symbols */
> +	if (strstarts(name, ".klp.sym."))
> +		return 0;

I guess that this check should be moved there as well.

Nit: It might be possible to include <linux/livepatch_external.h>
     and use KLP_SYM_PREFIX instead of ".klp.sym."

> +
>  	/*
>  	 * module symbols are not sorted so we add all
>  	 * symbols, setting length to 0, and rely on

Best Regards,
Petr

^ permalink raw reply

* [PATCH] perf/x86/intel/uncore: Fix reference leak in discover_upi_topology()
From: Wentao Liang @ 2026-06-25  7:53 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung, tglx, bp, dave.hansen, x86
  Cc: mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter,
	james.clark, hpa, linux-perf-users, linux-kernel, Wentao Liang,
	stable

In discover_upi_topology(), pci_get_domain_bus_and_slot() returns a PCI
device with its reference count incremented. The caller must call
pci_dev_put() after use.

However, the inner loop overwrites dev without releasing the previous
reference, causing leaks:
  - Between inner loop iterations within the same outer loop iteration.
  - Between outer loop iterations (dev from a previous ubox's inner
    loop is overwritten at the start of the next inner loop).
  - On the normal exit path from the while loop (the last dev is never
    put before falling through to err:).

Fix by calling pci_dev_put(dev) and clearing dev after upi_fill_topology()
succeeds, so each reference is released immediately after use. The error
path (goto err) already calls pci_dev_put(dev) and remains correct since
dev is set to NULL after release, making the subsequent put a no-op.

The similar sad_cfg_iio_topology() function does not have this problem
because it uses a single pci_get_device() loop and releases the last
reference correctly in all exit paths.

Cc: stable@vger.kernel.org
Fixes: fdd041028f22 ("perf/x86/intel/uncore: Factor out topology_gidnid_map()")
Signed-off-by: Wentao Liang <vulab@iscas.ac.cn>
---
 arch/x86/events/intel/uncore_snbep.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/events/intel/uncore_snbep.c b/arch/x86/events/intel/uncore_snbep.c
index 215d33e260ed..1561bda43835 100644
--- a/arch/x86/events/intel/uncore_snbep.c
+++ b/arch/x86/events/intel/uncore_snbep.c
@@ -5499,6 +5499,8 @@ static int discover_upi_topology(struct intel_uncore_type *type, int ubox_did, i
 							  devfn);
 			if (dev) {
 				ret = upi_fill_topology(dev, upi, idx);
+				pci_dev_put(dev);
+				dev = NULL;
 				if (ret)
 					goto err;
 			}
-- 
2.39.5 (Apple Git-154)


^ permalink raw reply related

* Re: [PATCH v3 1/7] list: Add mutable iterator variants
From: Kaitao Cheng @ 2026-06-25  3:01 UTC (permalink / raw)
  To: David Laight, Christian König, Jani Nikula,
	David Hildenbrand (Arm), Alexei Starovoitov
  Cc: Andrew Morton, David Hildenbrand, Jens Axboe, Tejun Heo,
	Alexander Viro, Christian Brauner, Daniel Borkmann,
	Andrii Nakryiko, Johannes Weiner, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Juri Lelli, Vincent Guittot, Paul Moore, Andy Shevchenko,
	Paul E. McKenney, Shakeel Butt, David Howells, Simona Vetter,
	Randy Dunlap, Luca Ceresoli, Philipp Stanner, linux-block,
	linux-kernel, cgroups, linux-ntfs-dev, linux-fsdevel, io-uring,
	audit, bpf, netdev, dri-devel, linux-perf-users,
	linux-trace-kernel, kexec, live-patching, linux-modules,
	linux-crypto, linux-pm, rcu, sched-ext, linux-mm, virtualization,
	damon, llvm, Kaitao Cheng, Muchun Song
In-Reply-To: <20260624152324.3def88ce@pumpkin>

在 2026/6/24 22:23, David Laight 写道:
> On Wed, 24 Jun 2026 15:23:47 +0200
> Christian König <christian.koenig@amd.com> wrote:
>> On 6/24/26 15:14, Kaitao Cheng wrote:
>>> 在 2026/6/22 16:42, David Laight 写道:  
>>>> On Mon, 22 Jun 2026 12:05:31 +0800
>>>> Kaitao Cheng <kaitao.cheng@linux.dev> wrote:
>>>>  
>>>>> From: Kaitao Cheng <chengkaitao@kylinos.cn>
>>>>>
>>>>> The list_for_each*_safe() helpers are used when the loop body may
>>>>> remove the current entry.  Their API exposes the temporary cursor at
>>>>> every call site, even though most users only need it for the iterator
>>>>> implementation and never reference it in the loop body.
>>>>>
>>>>> Add *_mutable() variants for list and hlist iteration.  The new helpers
>>>>> support both forms: callers may keep passing an explicit temporary cursor
>>>>> when they need to inspect or reset it, or omit it and let the helper use
>>>>> a unique internal cursor.  
>>>>
>>>> I'm not really sure 'mutable' means anything either.
>>>> It is possible to make it valid for the loop body (or even other threads)
>>>> to delete arbitrary list items - but that needs significant extra overheads.
>>>>
>>>> It might be worth doing something that doesn't need the extra variable,
>>>> but there is little point doing all the churn just to rename things.
>>>>  
>>>>>
>>>>> This makes call sites that only mutate the list through the current entry
>>>>> less noisy, while keeping the existing *_safe() helpers available for
>>>>> compatibility.
>>>>>
>>>>> Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
>>>>> ---
>>>>>  include/linux/list.h | 269 +++++++++++++++++++++++++++++++++++++------
>>>>>  1 file changed, 231 insertions(+), 38 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/list.h b/include/linux/list.h
>>>>> index 09d979976b3b..1081def7cea9 100644
>>>>> --- a/include/linux/list.h
>>>>> +++ b/include/linux/list.h
>>>>> @@ -7,6 +7,7 @@
>>>>>  #include <linux/stddef.h>
>>>>>  #include <linux/poison.h>
>>>>>  #include <linux/const.h>
>>>>> +#include <linux/args.h>
>>>>>  
>>>>>  #include <asm/barrier.h>
>>>>>  
>>>>> @@ -763,28 +764,72 @@ static inline void list_splice_tail_init(struct list_head *list,
>>>>>  #define list_for_each_prev(pos, head) \
>>>>>  	for (pos = (head)->prev; !list_is_head(pos, (head)); pos = pos->prev)
>>>>>  
>>>>> -/**
>>>>> - * list_for_each_safe - iterate over a list safe against removal of list entry
>>>>> - * @pos:	the &struct list_head to use as a loop cursor.
>>>>> - * @n:		another &struct list_head to use as temporary storage
>>>>> - * @head:	the head for your list.
>>>>> +/*
>>>>> + * list_for_each_safe is an old interface, use list_for_each_mutable instead.
>>>>>   */
>>>>>  #define list_for_each_safe(pos, n, head) \
>>>>>  	for (pos = (head)->next, n = pos->next; \
>>>>>  	     !list_is_head(pos, (head)); \
>>>>>  	     pos = n, n = pos->next)
>>>>>  
>>>>> +#define __list_for_each_mutable_internal(pos, tmp, head)		\
>>>>> +	for (typeof(pos) tmp = (pos = (head)->next)->next;		\  
>>>>
>>>> Use auto
>>>>  
>>>>> +	     !list_is_head(pos, (head));				\
>>>>> +	     pos = tmp, tmp = pos->next)
>>>>> +
>>>>> +#define __list_for_each_mutable1(pos, head)				\
>>>>> +	__list_for_each_mutable_internal(pos, __UNIQUE_ID(next), head)
>>>>> +
>>>>> +#define __list_for_each_mutable2(pos, next, head)			\
>>>>> +	list_for_each_safe(pos, next, head)
>>>>> +
>>>>>  /**
>>>>> - * list_for_each_prev_safe - iterate over a list backwards safe against removal of list entry
>>>>> + * list_for_each_mutable - iterate over a list safe against entry removal
>>>>>   * @pos:	the &struct list_head to use as a loop cursor.
>>>>> - * @n:		another &struct list_head to use as temporary storage
>>>>> - * @head:	the head for your list.
>>>>> + * @...:	either (head) or (next, head)
>>>>> + *
>>>>> + * next:	another &struct list_head to use as optional temporary storage.
>>>>> + *		The temporary cursor is internal unless explicitly supplied by
>>>>> + *		the caller.
>>>>> + * head:	the head for your list.
>>>>> + */
>>>>> +#define list_for_each_mutable(pos, ...)					\
>>>>> +	CONCATENATE(__list_for_each_mutable, COUNT_ARGS(__VA_ARGS__))	\
>>>>> +		(pos, __VA_ARGS__)  
>>>>
>>>> The variable argument count logic really just slows down compilation.
>>>> Maybe there aren't enough copies of this code to make that significant.
>>>> But just because you can do it doesn't mean it is a gooD idea.
>>>> I'm also not sure it really adds anything to the readability.
>>>>
>>>> And, it you are going to make the middle argument optional there is
>>>> no need to change the macro name.  
>>>
>>> Christian König and Jani Nikula also disagree with the variadic-argument
>>> implementation approach. If we abandon that method, it means we will
>>> inevitably need to add some new macros. If mutable is not a good name,
>>> suggestions for better alternatives would be welcome; coming up with a
>>> suitable name is indeed rather tricky.  
>>
>> I don't think you need to add a new macro for the specific use case that people want to modify the next element of the iteration.
>>
>> If I remember your numbers correctly that is a really corner case and keeping using the existing *_safe() macros for that sounds perfectly fine to me.
> 
> IIRC currently you have a choice of either:
> 	define               Item that can't be deleted
> 	list_for_each()	     The current item.
> 	list_for_each_safe() The next item.
> There is also likely to be code that updates the variables to allow
> for other scenarios.
> 
> Note that if increase a reference count and release a lock then list_for_each()
> is likely safer than list_for_each_safe() :-)
> 
> list.h has 9 variants of the 'safe' loop.
> The bloat of another 9 is getting excessive.
> 
> It has to be said that this is one of my least favourite type of list...

Hi Christian König, David Laight, Jani Nikula, David Hildenbrand,
Andy Shevchenko, Alexei Starovoitov

For ease of discussion, I need to summarize the currently possible
approaches and briefly describe their respective pros and cons,
using the list_for_each_entry* interfaces as examples.

1. Add list_for_each_entry_mutable, while keeping list_for_each_entry
and list_for_each_entry_safe unchanged. list_for_each_entry_mutable
would be used specifically for safe deletion scenarios that do not
need to expose the temporary cursor externally. The code can refer to
the v1 version.

Pros: Does not depend on immediate per-subsystem adaptation and can be
      merged directly.
Cons: Requires adding a whole set of mutable interfaces, which makes the
      code somewhat redundant.

2. Directly optimize away the temporary cursor in list_for_each_entry_safe
and define it inside the loop instead, changing the interface from four
arguments to three.

Pros: Does not add redundant interfaces.
Cons: (1) Users need to manually update special cases that use the
      traversal variable of list_for_each_entry_safe, the new
      list_for_each_entry_safe would no longer apply there and would
      need to be open-coded.
      (2) Because the macro arguments changes, all list_for_each_entry_safe
      callers would need to be modified and merged together, making it
      difficult to merge such a large amount of code at once.

3. Use a variadic macro approach to optimize list_for_each_entry_safe,
so that it supports both three and four arguments.

Pros: (1) Does not add redundant interfaces.
      (2) Does not depend on immediate per-subsystem adaptation and can
      be merged directly.
Cons: (1) Increases compile time.
      (2) Makes the interface harder for users to use.

4. Optimize list_for_each_entry by defining the temporary cursor internally,
making it compatible with the functionality of list_for_each_entry_safe.
The code can refer to the v2 version.

Pros: (1) Does not add redundant interfaces.
      (2) The number of externally visible arguments of list_for_each_entry
      remains unchanged, still three.
Cons: (1) list_for_each_entry and list_for_each_entry_safe would be merged
      into one, and list_for_each_entry_safe would gradually be deprecated.
      (2) Users need to manually update special cases that use the traversal
      variable of list_for_each_entry, the new list_for_each_entry would no
      longer apply there and would need to be open-coded. There are 15 such
      cases in total.

5. Use a variadic macro approach to optimize list_for_each_entry, so that
it supports both three and four arguments.

Pros: (1) Does not add redundant interfaces.
      (2) Does not depend on immediate per-subsystem adaptation and can be
      merged directly.
Cons: (1) Increases compile time.
      (2) list_for_each_entry and list_for_each_entry_safe would be merged
      into one, and list_for_each_entry_safe would gradually be deprecated.

6. Make no changes, keep the current logic unchanged, and close the current
email discussion.


Which of the six solutions above do people prefer?

-- 
Thanks
Kaitao Cheng


^ permalink raw reply

* Re: [PATCH v3 6/8] perf timechart: Fix memory leaks during record
From: Namhyung Kim @ 2026-06-24 23:49 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: linux-perf-users
In-Reply-To: <20260624210644.F1FEE1F000E9@smtp.kernel.org>

On Wed, Jun 24, 2026 at 09:06:44PM +0000, sashiko-bot@kernel.org wrote:
> Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
> - [Medium] Missing explicit POSIX header inclusion for `free()` (musl compatibility guide violation)
> --
> 
> commit e9f4fbcd6ed720db9d6b0daaf8a565c9bbbd415e
> Author: Namhyung Kim <namhyung@kernel.org>
> 
> perf timechart: Fix memory leaks during record
> 
> The command line options can be passed without copy as their lifetime is
> limited within the same function.
> 
> > diff --git a/tools/perf/builtin-timechart.c b/tools/perf/builtin-timechart.c
> > index 27d17268395ed..a73f0ab22fd64 100644
> > --- a/tools/perf/builtin-timechart.c
> > +++ b/tools/perf/builtin-timechart.c
> [ ... ]
> > @@ -1891,7 +1892,11 @@ static int timechart__io_record(int argc, const char **argv, const char *output_
> >  	for (i = 0; i < (unsigned int)argc; i++)
> >  		*p++ = argv[i];
> >  
> > -	return cmd_record(rec_argc, rec_argv);
> > +	ret = cmd_record(rec_argc, rec_argv);
> > +
> > +	free(rec_argv);
> > +	free(filter);
> 
> [Severity: Medium]
> Does this code need an explicit inclusion of <stdlib.h> at the top of the
> file?
> 
> The perf subsystem musl compatibility guidelines mention that files using
> POSIX libc functions must explicitly include the relevant headers, as musl
> strictly separates declarations unlike glibc. Could these new free() calls
> cause a build failure on systems using musl libc if the header is omitted?

While it may be a valid concern, the free() function is already used in
many places of the file.  And I didn't get any reports about musl on
this yet.  So I think it's good to go.

Thanks,
Namhyung


^ permalink raw reply

* Re: [PATCH v3 6/8] perf timechart: Fix memory leaks during record
From: sashiko-bot @ 2026-06-24 21:06 UTC (permalink / raw)
  To: Namhyung Kim; +Cc: linux-perf-users
In-Reply-To: <20260624205852.3864874-7-namhyung@kernel.org>

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [Medium] Missing explicit POSIX header inclusion for `free()` (musl compatibility guide violation)
--

commit e9f4fbcd6ed720db9d6b0daaf8a565c9bbbd415e
Author: Namhyung Kim <namhyung@kernel.org>

perf timechart: Fix memory leaks during record

The command line options can be passed without copy as their lifetime is
limited within the same function.

> diff --git a/tools/perf/builtin-timechart.c b/tools/perf/builtin-timechart.c
> index 27d17268395ed..a73f0ab22fd64 100644
> --- a/tools/perf/builtin-timechart.c
> +++ b/tools/perf/builtin-timechart.c
[ ... ]
> @@ -1891,7 +1892,11 @@ static int timechart__io_record(int argc, const char **argv, const char *output_
>  	for (i = 0; i < (unsigned int)argc; i++)
>  		*p++ = argv[i];
>  
> -	return cmd_record(rec_argc, rec_argv);
> +	ret = cmd_record(rec_argc, rec_argv);
> +
> +	free(rec_argv);
> +	free(filter);

[Severity: Medium]
Does this code need an explicit inclusion of <stdlib.h> at the top of the
file?

The perf subsystem musl compatibility guidelines mention that files using
POSIX libc functions must explicitly include the relevant headers, as musl
strictly separates declarations unlike glibc. Could these new free() calls
cause a build failure on systems using musl libc if the header is omitted?

> +	return ret;
>  }
>  
>  
[ ... ]
> @@ -1966,24 +1972,27 @@ static int timechart__record(struct timechart *tchart, int argc, const char **ar
[ ... ]
>  	for (j = 0; j < (unsigned int)argc; j++)
>  		*p++ = argv[j];
>  
> -	return cmd_record(rec_argc, rec_argv);
> +	ret = cmd_record(rec_argc, rec_argv);
> +
> +	free(rec_argv);

[Severity: Medium]
Is it possible this free() call also requires an explicit <stdlib.h>
inclusion to ensure it builds correctly against musl libc?

> +	return ret;
>  }
>  
>  static int

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260624205852.3864874-1-namhyung@kernel.org?part=6

^ permalink raw reply

* [PATCH v3 8/8] perf test: Update perf timechart test
From: Namhyung Kim @ 2026-06-24 20:58 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Ian Rogers, Jiri Olsa, Adrian Hunter, James Clark, Peter Zijlstra,
	Ingo Molnar, LKML, linux-perf-users
In-Reply-To: <20260624205852.3864874-1-namhyung@kernel.org>

To include IO-only and backtrace modes to test different code paths.

  $ sudo perf test -vv timechart
  135: perf timechart tests                        : Running
  135: perf timechart tests:
  ---- start ----
  test child forked, pid 2413665
  perf timechart Basic test
  perf timechart Basic test [Success]
  perf timechart IO-only test
  perf timechart IO-only test [Success]
  perf timechart Backtrace test
  perf timechart Backtrace test [Success]
  ---- end(0) ----
  135: perf timechart tests                        : Ok

  === Test Summary ===
  Passed main tests : 1
  Passed subtests   : 0
  Skipped tests     : 0
  Failed tests      : 0

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/perf/tests/shell/timechart.sh | 24 +++++++++++++++---------
 1 file changed, 15 insertions(+), 9 deletions(-)

diff --git a/tools/perf/tests/shell/timechart.sh b/tools/perf/tests/shell/timechart.sh
index b14b3472c2846b32..40072b32171a9a1c 100755
--- a/tools/perf/tests/shell/timechart.sh
+++ b/tools/perf/tests/shell/timechart.sh
@@ -9,7 +9,7 @@ perfdata=$(mktemp /tmp/__perf_timechart_test.perf.data.XXXXX)
 output=$(mktemp /tmp/__perf_timechart_test.output.XXXXX.svg)
 
 cleanup() {
-	rm -f "${perfdata}"
+	rm -f "${perfdata}"*
 	rm -f "${output}"
 	trap - EXIT TERM INT
 }
@@ -22,38 +22,41 @@ trap_cleanup() {
 trap trap_cleanup EXIT TERM INT
 
 test_timechart() {
-	echo "Basic perf timechart test"
+	NAME=$1
+	OPTION=$2
+
+	echo "perf timechart ${NAME} test"
 
 	# Try to record timechart data.
 	# perf timechart record uses system-wide recording and specific tracepoints.
 	# If it fails (e.g. permissions, missing tracepoints), skip the test.
-	if ! perf timechart record -o "${perfdata}" true > /dev/null 2>&1; then
-		echo "Basic perf timechart test [Skipped: perf timechart record failed (permissions/events?)]"
+	if ! perf timechart record -o "${perfdata}" ${OPTION} true > /dev/null 2>&1; then
+		echo "perf timechart ${NAME} test [Skipped: perf timechart record failed (permissions/events?)]"
 		return
 	fi
 
 	# Generate the timechart
 	if ! perf timechart -i "${perfdata}" -o "${output}" > /dev/null 2>&1; then
-		echo "Basic perf timechart test [Failed: perf timechart command failed]"
+		echo "perf timechart ${NAME} test [Failed: perf timechart command failed]"
 		err=1
 		return
 	fi
 
 	# Check if output file exists and is not empty
 	if [ ! -s "${output}" ]; then
-		echo "Basic perf timechart test [Failed: output file is empty or missing]"
+		echo "perf timechart ${NAME} test [Failed: output file is empty or missing]"
 		err=1
 		return
 	fi
 
 	# Check if it looks like an SVG
 	if ! grep -q "svg" "${output}"; then
-		echo "Basic perf timechart test [Failed: output doesn't look like SVG]"
+		echo "perf timechart ${NAME} test [Failed: output doesn't look like SVG]"
 		err=1
 		return
 	fi
 
-	echo "Basic perf timechart test [Success]"
+	echo "perf timechart ${NAME} test [Success]"
 }
 
 if ! perf check feature -q libtraceevent ; then
@@ -62,6 +65,9 @@ if ! perf check feature -q libtraceevent ; then
 	exit 2
 fi
 
-test_timechart
+test_timechart "Basic" ""
+test_timechart "IO-only" "-I"
+test_timechart "Backtrace" "-g"
+
 cleanup
 exit $err
-- 
2.55.0.rc0.799.gd6f94ed593-goog


^ permalink raw reply related

* [PATCH v3 6/8] perf timechart: Fix memory leaks during record
From: Namhyung Kim @ 2026-06-24 20:58 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Ian Rogers, Jiri Olsa, Adrian Hunter, James Clark, Peter Zijlstra,
	Ingo Molnar, LKML, linux-perf-users, Arnaldo Carvalho de Melo
In-Reply-To: <20260624205852.3864874-1-namhyung@kernel.org>

The command line options can be passed without copy as their lifetime is
limited within the same function.

Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/perf/builtin-timechart.c | 31 ++++++++++++++++++++-----------
 1 file changed, 20 insertions(+), 11 deletions(-)

diff --git a/tools/perf/builtin-timechart.c b/tools/perf/builtin-timechart.c
index 27d17268395ed760..a73f0ab22fd64241 100644
--- a/tools/perf/builtin-timechart.c
+++ b/tools/perf/builtin-timechart.c
@@ -1834,6 +1834,7 @@ static int timechart__io_record(int argc, const char **argv, const char *output_
 		"syscalls:sys_exit_select",
 	};
 	unsigned int poll_events_nr = ARRAY_SIZE(poll_events);
+	int ret;
 
 	rec_argc = common_args_nr +
 		disk_events_nr * 4 +
@@ -1852,7 +1853,7 @@ static int timechart__io_record(int argc, const char **argv, const char *output_
 
 	p = rec_argv;
 	for (i = 0; i < common_args_nr; i++)
-		*p++ = strdup(common_args[i]);
+		*p++ = common_args[i];
 
 	for (i = 0; i < disk_events_nr; i++) {
 		if (!is_valid_tracepoint(disk_events[i])) {
@@ -1861,7 +1862,7 @@ static int timechart__io_record(int argc, const char **argv, const char *output_
 		}
 
 		*p++ = "-e";
-		*p++ = strdup(disk_events[i]);
+		*p++ = disk_events[i];
 		*p++ = "--filter";
 		*p++ = filter;
 	}
@@ -1872,7 +1873,7 @@ static int timechart__io_record(int argc, const char **argv, const char *output_
 		}
 
 		*p++ = "-e";
-		*p++ = strdup(net_events[i]);
+		*p++ = net_events[i];
 		*p++ = "--filter";
 		*p++ = filter;
 	}
@@ -1883,7 +1884,7 @@ static int timechart__io_record(int argc, const char **argv, const char *output_
 		}
 
 		*p++ = "-e";
-		*p++ = strdup(poll_events[i]);
+		*p++ = poll_events[i];
 		*p++ = "--filter";
 		*p++ = filter;
 	}
@@ -1891,7 +1892,11 @@ static int timechart__io_record(int argc, const char **argv, const char *output_
 	for (i = 0; i < (unsigned int)argc; i++)
 		*p++ = argv[i];
 
-	return cmd_record(rec_argc, rec_argv);
+	ret = cmd_record(rec_argc, rec_argv);
+
+	free(rec_argv);
+	free(filter);
+	return ret;
 }
 
 
@@ -1902,6 +1907,7 @@ static int timechart__record(struct timechart *tchart, int argc, const char **ar
 	const char **rec_argv;
 	const char **p;
 	unsigned int record_elems;
+	int ret;
 
 	const char * const common_args[] = {
 		"record", "-a", "-R", "-c", "1", "-o", output_data,
@@ -1966,24 +1972,27 @@ static int timechart__record(struct timechart *tchart, int argc, const char **ar
 
 	p = rec_argv;
 	for (i = 0; i < common_args_nr; i++)
-		*p++ = strdup(common_args[i]);
+		*p++ = common_args[i];
 
 	for (i = 0; i < backtrace_args_no; i++)
-		*p++ = strdup(backtrace_args[i]);
+		*p++ = backtrace_args[i];
 
 	for (i = 0; i < tasks_args_nr; i++)
-		*p++ = strdup(tasks_args[i]);
+		*p++ = tasks_args[i];
 
 	for (i = 0; i < power_args_nr; i++)
-		*p++ = strdup(power_args[i]);
+		*p++ = power_args[i];
 
 	for (i = 0; i < old_power_args_nr; i++)
-		*p++ = strdup(old_power_args[i]);
+		*p++ = old_power_args[i];
 
 	for (j = 0; j < (unsigned int)argc; j++)
 		*p++ = argv[j];
 
-	return cmd_record(rec_argc, rec_argv);
+	ret = cmd_record(rec_argc, rec_argv);
+
+	free(rec_argv);
+	return ret;
 }
 
 static int
-- 
2.55.0.rc0.799.gd6f94ed593-goog


^ permalink raw reply related

* [PATCH v3 7/8] perf timechart: Fix memory leaks in draw_wakeups()
From: Namhyung Kim @ 2026-06-24 20:58 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Ian Rogers, Jiri Olsa, Adrian Hunter, James Clark, Peter Zijlstra,
	Ingo Molnar, LKML, linux-perf-users
In-Reply-To: <20260624205852.3864874-1-namhyung@kernel.org>

The second loop for per_pidcommd is meaningful only when it doesn't have
from and to tasks.  Also make sure c->Y is set before copying the comm
string otherwise it will be overwritten by next one.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/perf/builtin-timechart.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/tools/perf/builtin-timechart.c b/tools/perf/builtin-timechart.c
index a73f0ab22fd64241..3f9153d5ecfb79b0 100644
--- a/tools/perf/builtin-timechart.c
+++ b/tools/perf/builtin-timechart.c
@@ -1110,12 +1110,12 @@ static void draw_wakeups(struct timechart *tchart)
 					c = c->next;
 				}
 				c = p->all;
-				while (c) {
-					if (p->pid == we->waker && !from) {
+				while (c && (!from || !to)) {
+					if (c->Y && p->pid == we->waker && !from) {
 						from = c->Y;
 						task_from = strdup(c->comm);
 					}
-					if (p->pid == we->wakee && !to) {
+					if (c->Y && p->pid == we->wakee && !to) {
 						to = c->Y;
 						task_to = strdup(c->comm);
 					}
-- 
2.55.0.rc0.799.gd6f94ed593-goog


^ permalink raw reply related

* [PATCH v3 5/8] perf timechart: Release event samples at the end
From: Namhyung Kim @ 2026-06-24 20:58 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Ian Rogers, Jiri Olsa, Adrian Hunter, James Clark, Peter Zijlstra,
	Ingo Molnar, LKML, linux-perf-users
In-Reply-To: <20260624205852.3864874-1-namhyung@kernel.org>

Add timechart__release() to free all data structures added during the
sample processing.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/perf/builtin-timechart.c | 68 ++++++++++++++++++++++++++++++++++
 1 file changed, 68 insertions(+)

diff --git a/tools/perf/builtin-timechart.c b/tools/perf/builtin-timechart.c
index 04dbb944a42720b4..27d17268395ed760 100644
--- a/tools/perf/builtin-timechart.c
+++ b/tools/perf/builtin-timechart.c
@@ -1548,6 +1548,73 @@ static void write_svg_file(struct timechart *tchart, const char *filename)
 	svg_close();
 }
 
+static void timechart__release(struct timechart *tchart)
+{
+	struct per_pid *p = tchart->all_data;
+	struct power_event *pwr = tchart->power_events;
+	struct wake_event *we = tchart->wake_events;
+
+	while (p) {
+		struct per_pid *next_pid = p->next;
+		struct per_pidcomm *c = p->all;
+
+		while (c) {
+			struct per_pidcomm *next_comm = c->next;
+			struct cpu_sample *cs = c->samples;
+			struct io_sample *ios = c->io_samples;
+
+			while (cs) {
+				struct cpu_sample *next = cs->next;
+
+				zfree(&cs->backtrace);
+				cs->next = NULL;
+				free(cs);
+
+				cs = next;
+			}
+
+			while (ios) {
+				struct io_sample *next = ios->next;
+
+				ios->next = NULL;
+				free(ios);
+
+				ios = next;
+			}
+
+			zfree(&c->comm);
+			c->next = NULL;
+			free(c);
+
+			c = next_comm;
+		}
+
+		p->next = NULL;
+		free(p);
+
+		p = next_pid;
+	}
+
+	while (pwr) {
+		struct power_event *next = pwr->next;
+
+		pwr->next = NULL;
+		free(pwr);
+
+		pwr = next;
+	}
+
+	while (we) {
+		struct wake_event *next = we->next;
+
+		zfree(&we->backtrace);
+		we->next = NULL;
+		free(we);
+
+		we = next;
+	}
+}
+
 static int process_header(struct perf_file_section *section __maybe_unused,
 			  struct perf_header *ph,
 			  int feat,
@@ -2079,6 +2146,7 @@ int cmd_timechart(int argc, const char **argv)
 
 	ret = __cmd_timechart(&tchart, output_name);
 out:
+	timechart__release(&tchart);
 	zfree(&cpus_cstate_start_times);
 	zfree(&cpus_cstate_state);
 	zfree(&cpus_pstate_start_times);
-- 
2.55.0.rc0.799.gd6f94ed593-goog


^ permalink raw reply related

* [PATCH v3 4/8] perf timechart: Remove unnecessary copy of backtrace
From: Namhyung Kim @ 2026-06-24 20:58 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Ian Rogers, Jiri Olsa, Adrian Hunter, James Clark, Peter Zijlstra,
	Ingo Molnar, LKML, linux-perf-users
In-Reply-To: <20260624205852.3864874-1-namhyung@kernel.org>

The pattern of strdup() and free() is found, and I think it just can
use the original backtrace directly.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/perf/builtin-timechart.c | 25 ++++++++++++++++++-------
 1 file changed, 18 insertions(+), 7 deletions(-)

diff --git a/tools/perf/builtin-timechart.c b/tools/perf/builtin-timechart.c
index 4efac73a714c5e5f..04dbb944a42720b4 100644
--- a/tools/perf/builtin-timechart.c
+++ b/tools/perf/builtin-timechart.c
@@ -300,7 +300,7 @@ static void pid_put_sample(struct timechart *tchart, int pid, int type,
 	sample->type = type;
 	sample->next = c->samples;
 	sample->cpu = cpu;
-	sample->backtrace = backtrace ? strdup(backtrace) : NULL;
+	sample->backtrace = backtrace;
 	c->samples = sample;
 
 	if (sample->type == TYPE_RUNNING && end > start && start > 0) {
@@ -429,12 +429,14 @@ static void sched_wakeup(struct timechart *tchart, int cpu, u64 timestamp,
 	struct per_pid *p;
 	struct wake_event *we = zalloc(sizeof(*we));
 
-	if (!we)
+	if (!we) {
+		free((char *)backtrace);
 		return;
+	}
 
 	we->time = timestamp;
 	we->waker = waker;
-	we->backtrace = backtrace ? strdup(backtrace) : NULL;
+	we->backtrace = backtrace;
 
 	if ((flags & TRACE_FLAG_HARDIRQ) || (flags & TRACE_FLAG_SOFTIRQ))
 		we->waker = -1;
@@ -461,20 +463,28 @@ static void sched_switch(struct timechart *tchart, int cpu, u64 timestamp,
 			 const char *backtrace)
 {
 	struct per_pid *p = NULL, *prev_p;
+	bool backtrace_used = false;
 
 	prev_p = find_create_pid(tchart, prev_pid);
 
 	p = find_create_pid(tchart, next_pid);
 
-	if (prev_p->current && prev_p->current->state != TYPE_NONE)
+	if (prev_p->current && prev_p->current->state != TYPE_NONE) {
 		pid_put_sample(tchart, prev_pid, TYPE_RUNNING, cpu,
 			       prev_p->current->state_since, timestamp,
 			       backtrace);
+		backtrace_used = true;
+	}
 	if (p && p->current) {
-		if (p->current->state != TYPE_NONE)
+		if (p->current->state != TYPE_NONE) {
+			if (backtrace && backtrace_used)
+				backtrace = strdup(backtrace);
+
 			pid_put_sample(tchart, next_pid, p->current->state, cpu,
 				       p->current->state_since, timestamp,
 				       backtrace);
+			backtrace_used = true;
+		}
 
 		p->current->state_since = timestamp;
 		p->current->state = TYPE_RUNNING;
@@ -488,6 +498,9 @@ static void sched_switch(struct timechart *tchart, int cpu, u64 timestamp,
 		if (prev_state == 0)
 			prev_p->current->state = TYPE_WAITING;
 	}
+
+	if (!backtrace_used)
+		free((char *)backtrace);
 }
 
 /*
@@ -655,7 +668,6 @@ process_sample_sched_wakeup(struct timechart *tchart,
 
 	backtrace = cat_backtrace(sample, &tchart->session->machines.host);
 	sched_wakeup(tchart, sample->cpu, sample->time, waker, wakee, flags, backtrace);
-	free((char *)backtrace);
 	return 0;
 }
 
@@ -678,7 +690,6 @@ process_sample_sched_switch(struct timechart *tchart,
 	backtrace = cat_backtrace(sample, &tchart->session->machines.host);
 	sched_switch(tchart, sample->cpu, sample->time, prev_pid, next_pid,
 		     prev_state, backtrace);
-	free((char *)backtrace);
 	return 0;
 }
 
-- 
2.55.0.rc0.799.gd6f94ed593-goog


^ permalink raw reply related

* [PATCH v3 3/8] perf timechart: Remove unused backtrace in trace_handler
From: Namhyung Kim @ 2026-06-24 20:58 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Ian Rogers, Jiri Olsa, Adrian Hunter, James Clark, Peter Zijlstra,
	Ingo Molnar, LKML, linux-perf-users
In-Reply-To: <20260624205852.3864874-1-namhyung@kernel.org>

Now the backtrace argument is not used in any handler.  Let's get rid of
it.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/perf/builtin-timechart.c | 64 ++++++++++++----------------------
 1 file changed, 23 insertions(+), 41 deletions(-)

diff --git a/tools/perf/builtin-timechart.c b/tools/perf/builtin-timechart.c
index 337ae547d008ea0b..4efac73a714c5e5f 100644
--- a/tools/perf/builtin-timechart.c
+++ b/tools/perf/builtin-timechart.c
@@ -573,8 +573,7 @@ static char *cat_backtrace(struct perf_sample *sample,
 }
 
 typedef int (*tracepoint_handler)(struct timechart *tchart,
-				  struct perf_sample *sample,
-				  const char *backtrace);
+				  struct perf_sample *sample);
 
 static int process_sample_event(const struct perf_tool *tool,
 				union perf_event *event __maybe_unused,
@@ -595,7 +594,7 @@ static int process_sample_event(const struct perf_tool *tool,
 	if (evsel->handler != NULL) {
 		tracepoint_handler f = evsel->handler;
 
-		ret = f(tchart, sample, NULL);
+		ret = f(tchart, sample);
 	}
 
 	return ret;
@@ -603,8 +602,7 @@ static int process_sample_event(const struct perf_tool *tool,
 
 static int
 process_sample_cpu_idle(struct timechart *tchart __maybe_unused,
-			struct perf_sample *sample,
-			const char *backtrace __maybe_unused)
+			struct perf_sample *sample)
 {
 	u32 state  = perf_sample__intval(sample, "state");
 	u32 cpu_id = perf_sample__intval(sample, "cpu_id");
@@ -624,8 +622,7 @@ process_sample_cpu_idle(struct timechart *tchart __maybe_unused,
 
 static int
 process_sample_cpu_frequency(struct timechart *tchart,
-			     struct perf_sample *sample,
-			     const char *backtrace __maybe_unused)
+			     struct perf_sample *sample)
 {
 	u32 state  = perf_sample__intval(sample, "state");
 	u32 cpu_id = perf_sample__intval(sample, "cpu_id");
@@ -642,12 +639,12 @@ process_sample_cpu_frequency(struct timechart *tchart,
 
 static int
 process_sample_sched_wakeup(struct timechart *tchart,
-			    struct perf_sample *sample,
-			    const char *backtrace)
+			    struct perf_sample *sample)
 {
 	u8 flags  = perf_sample__intval(sample, "common_flags");
 	int waker = perf_sample__intval(sample, "common_pid");
 	int wakee = perf_sample__intval(sample, "pid");
+	char *backtrace;
 
 	/* perf.data is untrusted input — CPU may be absent or corrupted */
 	if (sample->cpu >= MAX_CPUS) {
@@ -664,12 +661,12 @@ process_sample_sched_wakeup(struct timechart *tchart,
 
 static int
 process_sample_sched_switch(struct timechart *tchart,
-			    struct perf_sample *sample,
-			    const char *backtrace)
+			    struct perf_sample *sample)
 {
 	int prev_pid   = perf_sample__intval(sample, "prev_pid");
 	int next_pid   = perf_sample__intval(sample, "next_pid");
 	u64 prev_state = perf_sample__intval(sample, "prev_state");
+	char *backtrace;
 
 	/* perf.data is untrusted input — CPU may be absent or corrupted */
 	if (sample->cpu >= MAX_CPUS) {
@@ -688,8 +685,7 @@ process_sample_sched_switch(struct timechart *tchart,
 #ifdef SUPPORT_OLD_POWER_EVENTS
 static int
 process_sample_power_start(struct timechart *tchart __maybe_unused,
-			   struct perf_sample *sample,
-			   const char *backtrace __maybe_unused)
+			   struct perf_sample *sample)
 {
 	u64 cpu_id = perf_sample__intval(sample, "cpu_id");
 	u64 value  = perf_sample__intval(sample, "value");
@@ -706,8 +702,7 @@ process_sample_power_start(struct timechart *tchart __maybe_unused,
 
 static int
 process_sample_power_end(struct timechart *tchart,
-			 struct perf_sample *sample,
-			 const char *backtrace __maybe_unused)
+			 struct perf_sample *sample)
 {
 	/* perf.data is untrusted input — CPU may be absent or corrupted */
 	if (sample->cpu >= MAX_CPUS) {
@@ -721,8 +716,7 @@ process_sample_power_end(struct timechart *tchart,
 
 static int
 process_sample_power_frequency(struct timechart *tchart,
-			       struct perf_sample *sample,
-			       const char *backtrace __maybe_unused)
+			       struct perf_sample *sample)
 {
 	u64 cpu_id = perf_sample__intval(sample, "cpu_id");
 	u64 value  = perf_sample__intval(sample, "value");
@@ -895,8 +889,7 @@ static int pid_end_io_sample(struct timechart *tchart, int pid, int type,
 
 static int
 process_enter_read(struct timechart *tchart,
-		   struct perf_sample *sample,
-		   const char *backtrace __maybe_unused)
+		   struct perf_sample *sample)
 {
 	long fd = perf_sample__intval(sample, "fd");
 	return pid_begin_io_sample(tchart, sample->tid, IOTYPE_READ,
@@ -905,8 +898,7 @@ process_enter_read(struct timechart *tchart,
 
 static int
 process_exit_read(struct timechart *tchart,
-		  struct perf_sample *sample,
-		  const char *backtrace __maybe_unused)
+		  struct perf_sample *sample)
 {
 	long ret = perf_sample__intval(sample, "ret");
 	return pid_end_io_sample(tchart, sample->tid, IOTYPE_READ,
@@ -915,8 +907,7 @@ process_exit_read(struct timechart *tchart,
 
 static int
 process_enter_write(struct timechart *tchart,
-		    struct perf_sample *sample,
-		    const char *backtrace __maybe_unused)
+		    struct perf_sample *sample)
 {
 	long fd = perf_sample__intval(sample, "fd");
 	return pid_begin_io_sample(tchart, sample->tid, IOTYPE_WRITE,
@@ -925,8 +916,7 @@ process_enter_write(struct timechart *tchart,
 
 static int
 process_exit_write(struct timechart *tchart,
-		   struct perf_sample *sample,
-		   const char *backtrace __maybe_unused)
+		   struct perf_sample *sample)
 {
 	long ret = perf_sample__intval(sample, "ret");
 	return pid_end_io_sample(tchart, sample->tid, IOTYPE_WRITE,
@@ -935,8 +925,7 @@ process_exit_write(struct timechart *tchart,
 
 static int
 process_enter_sync(struct timechart *tchart,
-		   struct perf_sample *sample,
-		   const char *backtrace __maybe_unused)
+		   struct perf_sample *sample)
 {
 	long fd = perf_sample__intval(sample, "fd");
 	return pid_begin_io_sample(tchart, sample->tid, IOTYPE_SYNC,
@@ -945,8 +934,7 @@ process_enter_sync(struct timechart *tchart,
 
 static int
 process_exit_sync(struct timechart *tchart,
-		  struct perf_sample *sample,
-		  const char *backtrace __maybe_unused)
+		  struct perf_sample *sample)
 {
 	long ret = perf_sample__intval(sample, "ret");
 	return pid_end_io_sample(tchart, sample->tid, IOTYPE_SYNC,
@@ -955,8 +943,7 @@ process_exit_sync(struct timechart *tchart,
 
 static int
 process_enter_tx(struct timechart *tchart,
-		 struct perf_sample *sample,
-		 const char *backtrace __maybe_unused)
+		 struct perf_sample *sample)
 {
 	long fd = perf_sample__intval(sample, "fd");
 	return pid_begin_io_sample(tchart, sample->tid, IOTYPE_TX,
@@ -965,8 +952,7 @@ process_enter_tx(struct timechart *tchart,
 
 static int
 process_exit_tx(struct timechart *tchart,
-		struct perf_sample *sample,
-		const char *backtrace __maybe_unused)
+		struct perf_sample *sample)
 {
 	long ret = perf_sample__intval(sample, "ret");
 	return pid_end_io_sample(tchart, sample->tid, IOTYPE_TX,
@@ -975,8 +961,7 @@ process_exit_tx(struct timechart *tchart,
 
 static int
 process_enter_rx(struct timechart *tchart,
-		 struct perf_sample *sample,
-		 const char *backtrace __maybe_unused)
+		 struct perf_sample *sample)
 {
 	long fd = perf_sample__intval(sample, "fd");
 	return pid_begin_io_sample(tchart, sample->tid, IOTYPE_RX,
@@ -985,8 +970,7 @@ process_enter_rx(struct timechart *tchart,
 
 static int
 process_exit_rx(struct timechart *tchart,
-		struct perf_sample *sample,
-		const char *backtrace __maybe_unused)
+		struct perf_sample *sample)
 {
 	long ret = perf_sample__intval(sample, "ret");
 	return pid_end_io_sample(tchart, sample->tid, IOTYPE_RX,
@@ -995,8 +979,7 @@ process_exit_rx(struct timechart *tchart,
 
 static int
 process_enter_poll(struct timechart *tchart,
-		   struct perf_sample *sample,
-		   const char *backtrace __maybe_unused)
+		   struct perf_sample *sample)
 {
 	long fd = perf_sample__intval(sample, "fd");
 	return pid_begin_io_sample(tchart, sample->tid, IOTYPE_POLL,
@@ -1005,8 +988,7 @@ process_enter_poll(struct timechart *tchart,
 
 static int
 process_exit_poll(struct timechart *tchart,
-		  struct perf_sample *sample,
-		  const char *backtrace __maybe_unused)
+		  struct perf_sample *sample)
 {
 	long ret = perf_sample__intval(sample, "ret");
 	return pid_end_io_sample(tchart, sample->tid, IOTYPE_POLL,
-- 
2.55.0.rc0.799.gd6f94ed593-goog


^ permalink raw reply related

* [PATCH v3 2/8] perf timechart: Generate backtrace only if needed
From: Namhyung Kim @ 2026-06-24 20:58 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Ian Rogers, Jiri Olsa, Adrian Hunter, James Clark, Peter Zijlstra,
	Ingo Molnar, LKML, linux-perf-users
In-Reply-To: <20260624205852.3864874-1-namhyung@kernel.org>

The backtrace was used by sched-switch and sched-wakeup only.  No need
to call cat_backtrace() in the process_sample_event().  Let's pass NULL
backtrace and generate it from the sched events.

As it needs a pointer to the 'machine', let's save the session in the
timechart struct and use the host machine of the session instead of
passing the pointer to all handlers.  It should be fine to assume the
host machine as timechart command doesn't deal with guest machines and
there's no way to get tracepoints from the guest events.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/perf/builtin-timechart.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/tools/perf/builtin-timechart.c b/tools/perf/builtin-timechart.c
index dafd361ecf9d8cd7..337ae547d008ea0b 100644
--- a/tools/perf/builtin-timechart.c
+++ b/tools/perf/builtin-timechart.c
@@ -56,6 +56,7 @@ struct timechart {
 	struct per_pid		*all_data;
 	struct power_event	*power_events;
 	struct wake_event	*wake_events;
+	struct perf_session	*session;
 	int			proc_num;
 	unsigned int		numcpus;
 	u64			min_freq,	/* Lowest CPU frequency seen */
@@ -578,7 +579,7 @@ typedef int (*tracepoint_handler)(struct timechart *tchart,
 static int process_sample_event(const struct perf_tool *tool,
 				union perf_event *event __maybe_unused,
 				struct perf_sample *sample,
-				struct machine *machine)
+				struct machine *machine __maybe_unused)
 {
 	struct timechart *tchart = container_of(tool, struct timechart, tool);
 	struct evsel *evsel = sample->evsel;
@@ -593,10 +594,8 @@ static int process_sample_event(const struct perf_tool *tool,
 
 	if (evsel->handler != NULL) {
 		tracepoint_handler f = evsel->handler;
-		char *backtrace = cat_backtrace(sample, machine);
 
-		ret = f(tchart, sample, backtrace);
-		free(backtrace);
+		ret = f(tchart, sample, NULL);
 	}
 
 	return ret;
@@ -656,7 +655,10 @@ process_sample_sched_wakeup(struct timechart *tchart,
 			 sample->file_offset, sample->cpu);
 		return -1;
 	}
+
+	backtrace = cat_backtrace(sample, &tchart->session->machines.host);
 	sched_wakeup(tchart, sample->cpu, sample->time, waker, wakee, flags, backtrace);
+	free((char *)backtrace);
 	return 0;
 }
 
@@ -675,8 +677,11 @@ process_sample_sched_switch(struct timechart *tchart,
 			 sample->file_offset, sample->cpu);
 		return -1;
 	}
+
+	backtrace = cat_backtrace(sample, &tchart->session->machines.host);
 	sched_switch(tchart, sample->cpu, sample->time, prev_pid, next_pid,
 		     prev_state, backtrace);
+	free((char *)backtrace);
 	return 0;
 }
 
@@ -1661,6 +1666,7 @@ static int __cmd_timechart(struct timechart *tchart, const char *output_name)
 	if (IS_ERR(session))
 		return PTR_ERR(session);
 
+	tchart->session = session;
 	symbol__init(perf_session__env(session));
 
 	(void)perf_header__process_sections(&session->header,
-- 
2.55.0.rc0.799.gd6f94ed593-goog


^ permalink raw reply related

* [PATCH v3 1/8] perf timechart: Don't pass @event to cat_backtrace()
From: Namhyung Kim @ 2026-06-24 20:58 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Ian Rogers, Jiri Olsa, Adrian Hunter, James Clark, Peter Zijlstra,
	Ingo Molnar, LKML, linux-perf-users
In-Reply-To: <20260624205852.3864874-1-namhyung@kernel.org>

The cat_backtrace() is only called from process_sample_event() which
means the event type is always PERF_RECORD_SAMPLE.  We don't need to
pass the event just because to print already known info.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/perf/builtin-timechart.c | 12 +++++-------
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/tools/perf/builtin-timechart.c b/tools/perf/builtin-timechart.c
index 85a9ad0455aecccd..dafd361ecf9d8cd7 100644
--- a/tools/perf/builtin-timechart.c
+++ b/tools/perf/builtin-timechart.c
@@ -493,8 +493,7 @@ static void sched_switch(struct timechart *tchart, int cpu, u64 timestamp,
  * Returns a malloc'd backtrace string built via open_memstream, or NULL
  * on error.  Caller must free() the returned pointer.
  */
-static char *cat_backtrace(union perf_event *event,
-			   struct perf_sample *sample,
+static char *cat_backtrace(struct perf_sample *sample,
 			   struct machine *machine)
 {
 	struct addr_location al;
@@ -516,9 +515,8 @@ static char *cat_backtrace(union perf_event *event,
 		goto exit;
 
 	if (machine__resolve(machine, &al, sample) < 0) {
-		pr_err("problem processing %s (%u) event at offset %#" PRIx64 ", skipping it.\n",
-		       perf_event__name(event->header.type), event->header.type,
-		       sample->file_offset);
+		pr_err("problem processing SAMPLE (%u) event at offset %#" PRIx64 ", skipping it.\n",
+		       PERF_RECORD_SAMPLE, sample->file_offset);
 		goto exit;
 	}
 
@@ -578,7 +576,7 @@ typedef int (*tracepoint_handler)(struct timechart *tchart,
 				  const char *backtrace);
 
 static int process_sample_event(const struct perf_tool *tool,
-				union perf_event *event,
+				union perf_event *event __maybe_unused,
 				struct perf_sample *sample,
 				struct machine *machine)
 {
@@ -595,7 +593,7 @@ static int process_sample_event(const struct perf_tool *tool,
 
 	if (evsel->handler != NULL) {
 		tracepoint_handler f = evsel->handler;
-		char *backtrace = cat_backtrace(event, sample, machine);
+		char *backtrace = cat_backtrace(sample, machine);
 
 		ret = f(tchart, sample, backtrace);
 		free(backtrace);
-- 
2.55.0.rc0.799.gd6f94ed593-goog


^ permalink raw reply related

* [PATCH v3 0/8] perf timechart: Fix memory leaks
From: Namhyung Kim @ 2026-06-24 20:58 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Ian Rogers, Jiri Olsa, Adrian Hunter, James Clark, Peter Zijlstra,
	Ingo Molnar, LKML, linux-perf-users

Hello,

This patchset fixes memory leaks in the perf timechart code.
I've extended and verified the timechart test with ASAN.

I'm sending it with my previous work on backtrace optimization.

v3)
 * combine small optimization series

v2)
 * use zfree() and explicit clear for potential misuse  (Arnaldo)
 * fix a wrong condition in the draw_wakeups()  (Sashiko)
 * move misplaced hunk in the second commit to the first

Thanks,
Namhyung


Namhyung Kim (8):
  perf timechart: Don't pass @event to cat_backtrace()
  perf timechart: Generate backtrace only if needed
  perf timechart: Remove unused backtrace in trace_handler
  perf timechart: Remove unnecessary copy of backtrace
  perf timechart: Release event samples at the end
  perf timechart: Fix memory leaks during record
  perf timechart: Fix memory leaks in draw_wakeups()
  perf test: Update perf timechart test

 tools/perf/builtin-timechart.c      | 212 +++++++++++++++++++---------
 tools/perf/tests/shell/timechart.sh |  24 ++--
 2 files changed, 158 insertions(+), 78 deletions(-)

-- 
2.55.0.rc0.799.gd6f94ed593-goog


^ permalink raw reply

* Re: [PATCH] perf symbols: ignore livepatch symbols
From: sashiko-bot @ 2026-06-24 20:26 UTC (permalink / raw)
  To: Joe Lawrence; +Cc: linux-perf-users
In-Reply-To: <20260624201254.472576-1-joe.lawrence@redhat.com>

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:

Pre-existing issues:
- [Critical] Out-of-bounds write of a null terminator in kallsyms__parse() stack array.
- [High] Livepatch symbols (.klp.sym.*) corrupt module boundaries in kcore_copy, leading to silently omitted module memory and bogus mappings.
--

commit 2ebd0b4f62f44ca82022b5aeb4aaa75ac016e938
Author: Joe Lawrence <joe.lawrence@redhat.com>

    perf symbols: ignore livepatch symbols

    Livepatch modules contain special symbols (prefixed by ".klp.sym.") that
    act as relocation placeholders. Once resolved, they point to the same
    addresses as the original kernel symbols they reference. These special
    symbols confuse the 'vmlinux symtab matches kallsyms' perf test as
    kallsyms may report multiple symbols sharing a single kernel address.

    Link: https://docs.kernel.org/livepatch/module-elf-format.html#livepatch-symbols [1]

> diff --git a/tools/perf/util/symbol.c b/tools/perf/util/symbol.c
> index fcaeeddbbb6b9..eeaed3f5cf50d 100644
> --- a/tools/perf/util/symbol.c
> +++ b/tools/perf/util/symbol.c
> @@ -774,6 +774,10 @@ static int map__process_kallsym_symbol(void *arg, const char *name,
>  	if (name[0] == '$')
>  		return 0;
>  
> +	/* Ignore livepatch symbols */
> +	if (strstarts(name, ".klp.sym."))
> +		return 0;
> +

[Severity: High]
This isn't a bug introduced by this patch, but does kcore_copy__process_kallsyms()
need a similar filter for livepatch symbols?

Livepatch symbols contain a module name bracket (e.g., "[kpatch...]") but
their address points to the core kernel space. In symbol-elf.c, this appears
to trick the module boundary tracking:

tools/perf/util/symbol-elf.c:kcore_copy__process_kallsyms() {
    ...
	if (strchr(name, '[')) {
		if (!kci->first_module_symbol || start < kci->first_module_symbol)
			kci->first_module_symbol = start;
		if (start > kci->last_module_symbol)
			kci->last_module_symbol = start;
		return 0;
	}
    ...
}

Because the livepatch symbol address is in the core kernel range, does this
incorrectly lower kci->first_module_symbol to the core kernel address space
instead of the module vmalloc space, omitting module memory and creating a
bogus mapping?

[Severity: Critical]
This is a pre-existing issue, but while reviewing the kallsyms parsing I
noticed a potential out-of-bounds write when reading symbol names:

tools/lib/symbol/kallsyms.c:kallsyms__parse() {
    ...
		char symbol_name[KSYM_NAME_LEN + 1];

		[ ... ]

		for (i = 0; i < sizeof(symbol_name); i++) {
			ch = io__get_char(&io);
			if (ch < 0 || ch == '\n')
				break;
			symbol_name[i]  = ch;
		}
		symbol_name[i]  = '\0';
    ...
}

If a symbol name in kallsyms is longer than KSYM_NAME_LEN characters and
doesn't have a newline or EOF, the loop increments i up to sizeof(symbol_name)
(which is KSYM_NAME_LEN + 1). When the loop terminates without breaking, it
writes '\0' at symbol_name[sizeof(symbol_name)], which is one byte past the
end of the array.

Can this corrupt the stack if an untrusted user provides a crafted --kallsyms
file to perf?

>  	/*
>  	 * module symbols are not sorted so we add all
>  	 * symbols, setting length to 0, and rely on

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260624201254.472576-1-joe.lawrence@redhat.com?part=1

^ permalink raw reply

* Re: perf script intel_pt decoding reading instructions from wrong shared lib
From: Namhyung Kim @ 2026-06-24 20:23 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: Todd Lipcon, Stephane Eranian, Mi, Dapeng1, Ian Rogers,
	Arnaldo Carvalho de Melo, linux-perf-users
In-Reply-To: <c45fea55-b03d-40ac-a89f-8882b057fe5b@intel.com>

Hello,

On Tue, Jun 02, 2026 at 10:32:02PM +0300, Adrian Hunter wrote:
> On 15/04/2025 00:59, Todd Lipcon wrote:
> > Hi folks,
> > 
> > It seems there may be a bug in perf-script when processing intel_pt.
> > Specifically, in order to reconstruct the instructions, it maps
> > instruction pointers back to DSO-relative addresses, and then reads
> > from the associated ELF files to find the actual instructions that
> > were traced.
> > 
> > In the case that the ELF file has an associated separate-debug ELF
> > file linked via `.gnu-debuglink` (common in some Linux distros), it
> > appears to be reading the debuginfo file instead of the shared library
> > that actually contains the code.
> > 
> > I've posted a simple repro here:
> > https://github.com/toddlipcon/perf-debuglink-bug/blob/main/repro.sh
> > 
> > I haven't dug into the code yet enough to see where it's going wrong,
> > but figured I would report to the users list first.
> 
> Seems to be caused by commit 5363c306787c8 - see further below.
> Reverting that makes the issue go away.

Sorry for the trouble.

> 
> With that commit, dso__load() sets the binary_type to the first symbol
> source file type, which is DSO_BINARY_TYPE__DEBUGLINK in this case.
> That causes dso__get_filename() to get the file name of the debug file,
> not the file actually executed.
> 
> Leaving the binary_type as DSO_BINARY_TYPE__NOT_FOUND will result in
> try_to_open_dso() using either DSO_BINARY_TYPE__BUILD_ID_CACHE or
> DSO_BINARY_TYPE__SYSTEM_PATH_DSO.

Ok, I feel like we should split the dso access for the binary file (for
annotate or intel-pt) and debug symbols (symbol table or DWARF).
Probably we can use the binary_type and symtab_type respectively.

I'll take a look.

Thanks,
Namhyung

> 
> A quick workaround hack:
> 
> diff --git a/tools/perf/util/symbol.c b/tools/perf/util/symbol.c
> index dd30f17cb1ca..09fd79ca1d48 100644
> --- a/tools/perf/util/symbol.c
> +++ b/tools/perf/util/symbol.c
> @@ -1825,7 +1825,10 @@ int dso__load(struct dso *dso, struct map *map)
>  		if (next_slot) {
>  			ss_pos++;
>  
> -			if (dso__binary_type(dso) == DSO_BINARY_TYPE__NOT_FOUND)
> +			if (dso__binary_type(dso) == DSO_BINARY_TYPE__NOT_FOUND ||
> +			    symtab_type == DSO_BINARY_TYPE__BUILD_ID_CACHE ||
> +			    (symtab_type == DSO_BINARY_TYPE__SYSTEM_PATH_DSO &&
> +			     dso__binary_type(dso) != DSO_BINARY_TYPE__BUILD_ID_CACHE))
>  				dso__set_binary_type(dso, symtab_type);
>  
>  			if (syms_ss && runtime_ss)

^ permalink raw reply

* [PATCH] perf symbols: ignore livepatch symbols
From: Joe Lawrence @ 2026-06-24 20:12 UTC (permalink / raw)
  To: linux-perf-users, linux-kernel, live-patching

Livepatch modules contain special symbols (prefixed by ".klp.sym.") that
act as relocation placeholders.  Once resolved, they point to the same
addresses as the original kernel symbols they reference. [1]

These special symbols confuse the 'vmlinux symtab matches kallsyms' perf
test as kallsyms may report multiple symbols sharing a single kernel
address.  For example:

  kallsyms (without livepatch)
  ----------------------------
  ffffffff81a41110 T __pfx_arch_release_task_struct
> ffffffff81a41120 T arch_release_task_struct
  ffffffff81a41140 T __pfx_exit_thread
  ffffffff81a41150 T exit_thread

  kallsyms (with livepatch loaded)
  ---------------------------------
  ffffffff81a41110 T __pfx_arch_release_task_struct
> ffffffff81a41120 T arch_release_task_struct
  ffffffff81a41140 T __pfx_exit_thread
  ffffffff81a41150 T exit_thread
> ffffffff81a41120 w .klp.sym.vmlinux.arch_release_task_struct,0  [kpatch_5_14_0_570_94_1_1_3]

When perf loads kallsyms, both symbols are inserted into the
symbol table at the same address, corrupting symbol end-address
calculations and causing test failures.

Filter out symbols prefixed with ".klp.sym." when loading kallsyms, as
they alias existing kernel symbols.

Link: https://docs.kernel.org/livepatch/module-elf-format.html#livepatch-symbols [1]
Reported-and-tested-by: Ben Procknow <bprockno@redhat.com>
Signed-off-by: Joe Lawrence <joe.lawrence@redhat.com>
---
 tools/perf/util/symbol.c | 4 ++++
 1 file changed, 4 insertions(+)

From Ben's tests:

  Without patch:
  ./tools/perf/perf test 1
   1: vmlinux symtab matches kallsyms                                 : FAILED!

  With patch:
  ./tools/perf/perf test 1
   1: vmlinux symtab matches kallsyms                                 : Ok

diff --git a/tools/perf/util/symbol.c b/tools/perf/util/symbol.c
index fcaeeddbbb6b..eeaed3f5cf50 100644
--- a/tools/perf/util/symbol.c
+++ b/tools/perf/util/symbol.c
@@ -774,6 +774,10 @@ static int map__process_kallsym_symbol(void *arg, const char *name,
 	if (name[0] == '$')
 		return 0;

+	/* Ignore livepatch symbols */
+	if (strstarts(name, ".klp.sym."))
+		return 0;
+
 	/*
 	 * module symbols are not sorted so we add all
 	 * symbols, setting length to 0, and rely on
-- 
2.54.0

^ permalink raw reply related

* Re: [PATCH v1 0/3] perf pmu: Add tool-provided NVMe PMU
From: Namhyung Kim @ 2026-06-24 19:58 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, Jiri Olsa,
	Adrian Hunter, James Clark, Tomas Glozar, Michael Jeanson,
	Dmitrii Dolgov, Alexandre Chartre, Yuzhuo Jing, Leo Yan,
	German Gomez, Anubhav Shelat, linux-kernel, linux-perf-users,
	linux-nvme
In-Reply-To: <20260609070348.541964-1-irogers@google.com>

On Tue, Jun 09, 2026 at 12:03:45AM -0700, Ian Rogers wrote:
> This patch series introduces a tool-provided PMU for NVMe devices to expose
> various hardware logs as perf events.
> 
> Using libnvme, the new nvme_pmu extracts metrics from standard NVMe log pages
> including:
> - SMART / Health Information (e.g. data units read/written, power cycles,
>   temperature)
> - Endurance Group Information
> - FDP (Flexible Data Placement) Statistics
> - Error Information
> - Zoned Namespaces (ZNS) Changed Zones
> 
> Key features:
> - Dynamic probing of supported log pages on /dev/nvmeX. Unsupported events
>   are marked as deprecated and hidden from `perf list` by default.
> - Generic configuration encoding where the log type, size, and offset are
>   encoded directly into the event configuration.
> - Correct interval delta tracking using baseline offset snapshots. This mimics
>   the behavior of the hwmon PMU, allowing gauges like Temperature to be
>   reported instantaneously in `perf stat -I`, while standard counters
>   correctly accumulate deltas.
> - Robust handling of cross-architecture endianness when reading 128-bit and
>   64-bit fields from the NVMe specification.

Can you please show actual output so that we can understand what you're
doing?

Thanks,
Namhyung

> 
> The series is broken down into:
> 1. Adding the libnvme feature detection via pkg-config in tools/build.
> 2. The core implementation of the NVMe tool PMU.
> 3. A unit test suite verifying the parsing and encoding of the NVMe PMU
>    events using a mocked device.
> 
> Ian Rogers (3):
>   perf build: Add libnvme feature detection
>   perf pmu: Implement tool-provided NVMe PMU
>   perf tests: Add NVMe PMU event parsing test
> 
>  tools/build/Makefile.feature       |   2 +
>  tools/build/feature/Makefile       |   3 +
>  tools/build/feature/test-libnvme.c |  11 +
>  tools/perf/Makefile.config         |  12 +
>  tools/perf/tests/Build             |   1 +
>  tools/perf/tests/builtin-test.c    |   1 +
>  tools/perf/tests/nvme_pmu.c        | 176 ++++++++++
>  tools/perf/tests/tests.h           |   1 +
>  tools/perf/util/Build              |   1 +
>  tools/perf/util/evsel.c            |  11 +-
>  tools/perf/util/nvme_pmu.c         | 535 +++++++++++++++++++++++++++++
>  tools/perf/util/nvme_pmu.h         | 143 ++++++++
>  tools/perf/util/pmu.c              |  15 +
>  tools/perf/util/pmu.h              |   8 +-
>  tools/perf/util/pmus.c             |  31 +-
>  tools/perf/util/pmus.h             |   1 +
>  16 files changed, 945 insertions(+), 7 deletions(-)
>  create mode 100644 tools/build/feature/test-libnvme.c
>  create mode 100644 tools/perf/tests/nvme_pmu.c
>  create mode 100644 tools/perf/util/nvme_pmu.c
>  create mode 100644 tools/perf/util/nvme_pmu.h
> 
> -- 
> 2.54.0.1064.gd145956f57-goog
> 

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox