* [PATCH] perf/lock: enable end-timestamp accounting for cgroup aggregation
@ 2026-04-20 18:46 Suchit Karunakaran
2026-04-21 0:45 ` sashiko-bot
2026-04-22 21:28 ` Namhyung Kim
0 siblings, 2 replies; 5+ messages in thread
From: Suchit Karunakaran @ 2026-04-20 18:46 UTC (permalink / raw)
To: peterz, mingo, acme, namhyung
Cc: mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter,
james.clark, tycho, suchitkarunakaran, linux-perf-users,
linux-kernel, bpf
update_lock_stat() handles lock contentions that start but never reach a
contention_end event (e.g., locks still held when profiling stops), but
previously treated LOCK_AGGR_CGROUP as a no-op due to missing cgroup
context in userspace; fix this by adding a cgroup_id field to
struct tstamp_data, recording it at contention_begin using
get_current_cgroup_id() when aggr_mode == LOCK_AGGR_CGROUP, and using
ts_data->cgroup_id to build the aggregation key in update_lock_stat(),
matching the contention_end behavior in BPF and ensuring correct
attribution of incomplete events.
Signed-off-by: Suchit Karunakaran <suchitkarunakaran@gmail.com>
---
tools/perf/util/bpf_lock_contention.c | 4 ++--
tools/perf/util/bpf_skel/lock_contention.bpf.c | 2 ++
tools/perf/util/bpf_skel/lock_data.h | 1 +
3 files changed, 5 insertions(+), 2 deletions(-)
diff --git a/tools/perf/util/bpf_lock_contention.c b/tools/perf/util/bpf_lock_contention.c
index cbd7435579fe..1a5bd2ff8ee4 100644
--- a/tools/perf/util/bpf_lock_contention.c
+++ b/tools/perf/util/bpf_lock_contention.c
@@ -463,8 +463,8 @@ static void update_lock_stat(int map_fd, int pid, u64 end_ts,
stat_key.lock_addr_or_cgroup = ts_data->lock;
break;
case LOCK_AGGR_CGROUP:
- /* TODO */
- return;
+ stat_key.lock_addr_or_cgroup = ts_data->cgroup_id;
+ break;
default:
return;
}
diff --git a/tools/perf/util/bpf_skel/lock_contention.bpf.c b/tools/perf/util/bpf_skel/lock_contention.bpf.c
index 96e7d853b9ed..d0e2cad02fa3 100644
--- a/tools/perf/util/bpf_skel/lock_contention.bpf.c
+++ b/tools/perf/util/bpf_skel/lock_contention.bpf.c
@@ -536,6 +536,8 @@ int contention_begin(u64 *ctx)
pelem->timestamp = bpf_ktime_get_ns();
pelem->lock = (__u64)ctx[0];
pelem->flags = (__u32)ctx[1];
+ if (aggr_mode == LOCK_AGGR_CGROUP)
+ pelem->cgroup_id = get_current_cgroup_id();
if (needs_callstack) {
u32 i = 0;
diff --git a/tools/perf/util/bpf_skel/lock_data.h b/tools/perf/util/bpf_skel/lock_data.h
index 28c5e5aced7f..652e114e6b87 100644
--- a/tools/perf/util/bpf_skel/lock_data.h
+++ b/tools/perf/util/bpf_skel/lock_data.h
@@ -13,6 +13,7 @@ struct owner_tracing_data {
struct tstamp_data {
u64 timestamp;
u64 lock;
+ u64 cgroup_id;
u32 flags;
s32 stack_id;
};
--
2.53.0
^ permalink raw reply related [flat|nested] 5+ messages in thread* Re: [PATCH] perf/lock: enable end-timestamp accounting for cgroup aggregation
2026-04-20 18:46 [PATCH] perf/lock: enable end-timestamp accounting for cgroup aggregation Suchit Karunakaran
@ 2026-04-21 0:45 ` sashiko-bot
2026-04-21 8:29 ` Suchit Karunakaran
2026-04-22 21:28 ` Namhyung Kim
1 sibling, 1 reply; 5+ messages in thread
From: sashiko-bot @ 2026-04-21 0:45 UTC (permalink / raw)
To: Suchit Karunakaran; +Cc: bpf, linux-perf-users
Sashiko AI review found 2 potential issue(s):
- [High] Incomplete spinlock and rwlock contentions are silently dropped for LOCK_AGGR_CGROUP because account_end_timestamp() skips the per-CPU tstamp map.
- [Medium] Inconsistent cgroup attribution between complete and incomplete lock contentions when a task migrates cgroups.
--
commit 7ba25dc647534f97eb1ae9f34c6c8d38e7fef074
Author: Suchit Karunakaran <suchitkarunakaran@gmail.com>
perf/lock: enable end-timestamp accounting for cgroup aggregation
This commit adds a cgroup_id field to struct tstamp_data to properly account
for incomplete lock contention events when aggregating by cgroup. It records
the cgroup ID at contention_begin and uses it to build the aggregation key
in userspace.
> diff --git a/tools/perf/util/bpf_lock_contention.c b/tools/perf/util/bpf_lock_contention.c
> index cbd7435579fea..1a5bd2ff8ee4c 100644
> --- a/tools/perf/util/bpf_lock_contention.c
> +++ b/tools/perf/util/bpf_lock_contention.c
> @@ -463,8 +463,8 @@ static void update_lock_stat(int map_fd, int pid, u64 end_ts,
> stat_key.lock_addr_or_cgroup = ts_data->lock;
> break;
> case LOCK_AGGR_CGROUP:
> - /* TODO */
> - return;
> + stat_key.lock_addr_or_cgroup = ts_data->cgroup_id;
> + break;
Does this fully enable end-timestamp accounting for cgroup aggregation
on per-CPU locks like spinlocks and rwlocks?
Looking at account_end_timestamp() in tools/perf/util/bpf_lock_contention.c,
there is an early return that skips the per-CPU tstamp map:
static void account_end_timestamp(struct lock_contention *con)
{
...
/* Now it'll check per-cpu tstamp map which doesn't have TID. */
if (aggr_mode == LOCK_AGGR_TASK || aggr_mode == LOCK_AGGR_CGROUP)
return;
total_cpus = cpu__max_cpu().cpu;
ts_fd = bpf_map__fd(skel->maps.tstamp_cpu);
...
}
Since account_end_timestamp() skips iterating over skel->maps.tstamp_cpu
when aggr_mode is LOCK_AGGR_CGROUP, are all incomplete contentions for
spinlocks and rwlocks still silently dropped?
[ ... ]
> diff --git a/tools/perf/util/bpf_skel/lock_contention.bpf.c b/tools/perf/util/bpf_skel/lock_contention.bpf.c
> index 96e7d853b9edf..d0e2cad02fa36 100644
> --- a/tools/perf/util/bpf_skel/lock_contention.bpf.c
> +++ b/tools/perf/util/bpf_skel/lock_contention.bpf.c
> @@ -536,6 +536,8 @@ int contention_begin(u64 *ctx)
> pelem->timestamp = bpf_ktime_get_ns();
> pelem->lock = (__u64)ctx[0];
> pelem->flags = (__u32)ctx[1];
> + if (aggr_mode == LOCK_AGGR_CGROUP)
> + pelem->cgroup_id = get_current_cgroup_id();
If a task migrates to a different cgroup while blocked on a lock, could this
lead to inconsistent attribution between complete and incomplete contentions?
The cgroup ID is statically recorded here at contention_begin() and used by
userspace for incomplete events.
However, contention_end() continues to dynamically resolve the cgroup ID:
SEC("tp_btf/contention_end")
int contention_end(u64 *ctx)
{
...
case LOCK_AGGR_CGROUP:
key.lock_addr_or_cgroup = get_current_cgroup_id();
break;
...
}
Would it be more consistent for contention_end() to also use the statically
recorded pelem->cgroup_id, ensuring that both completed and incomplete events
attribute the contention time to the exact same cgroup?
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260420184656.331306-1-suchitkarunakaran@gmail.com?part=1
^ permalink raw reply [flat|nested] 5+ messages in thread* Re: [PATCH] perf/lock: enable end-timestamp accounting for cgroup aggregation
2026-04-21 0:45 ` sashiko-bot
@ 2026-04-21 8:29 ` Suchit Karunakaran
2026-04-23 1:11 ` Namhyung Kim
0 siblings, 1 reply; 5+ messages in thread
From: Suchit Karunakaran @ 2026-04-21 8:29 UTC (permalink / raw)
To: sashiko; +Cc: bpf, linux-perf-users
On Tue, 21 Apr 2026 at 06:15, <sashiko-bot@kernel.org> wrote:
>
> Sashiko AI review found 2 potential issue(s):
> - [High] Incomplete spinlock and rwlock contentions are silently dropped for LOCK_AGGR_CGROUP because account_end_timestamp() skips the per-CPU tstamp map.
> - [Medium] Inconsistent cgroup attribution between complete and incomplete lock contentions when a task migrates cgroups.
> --
>
> commit 7ba25dc647534f97eb1ae9f34c6c8d38e7fef074
> Author: Suchit Karunakaran <suchitkarunakaran@gmail.com>
>
> perf/lock: enable end-timestamp accounting for cgroup aggregation
>
> This commit adds a cgroup_id field to struct tstamp_data to properly account
> for incomplete lock contention events when aggregating by cgroup. It records
> the cgroup ID at contention_begin and uses it to build the aggregation key
> in userspace.
>
> > diff --git a/tools/perf/util/bpf_lock_contention.c b/tools/perf/util/bpf_lock_contention.c
> > index cbd7435579fea..1a5bd2ff8ee4c 100644
> > --- a/tools/perf/util/bpf_lock_contention.c
> > +++ b/tools/perf/util/bpf_lock_contention.c
> > @@ -463,8 +463,8 @@ static void update_lock_stat(int map_fd, int pid, u64 end_ts,
> > stat_key.lock_addr_or_cgroup = ts_data->lock;
> > break;
> > case LOCK_AGGR_CGROUP:
> > - /* TODO */
> > - return;
> > + stat_key.lock_addr_or_cgroup = ts_data->cgroup_id;
> > + break;
>
> Does this fully enable end-timestamp accounting for cgroup aggregation
> on per-CPU locks like spinlocks and rwlocks?
>
> Looking at account_end_timestamp() in tools/perf/util/bpf_lock_contention.c,
> there is an early return that skips the per-CPU tstamp map:
>
> static void account_end_timestamp(struct lock_contention *con)
> {
> ...
> /* Now it'll check per-cpu tstamp map which doesn't have TID. */
> if (aggr_mode == LOCK_AGGR_TASK || aggr_mode == LOCK_AGGR_CGROUP)
> return;
>
> total_cpus = cpu__max_cpu().cpu;
> ts_fd = bpf_map__fd(skel->maps.tstamp_cpu);
> ...
> }
>
> Since account_end_timestamp() skips iterating over skel->maps.tstamp_cpu
> when aggr_mode is LOCK_AGGR_CGROUP, are all incomplete contentions for
> spinlocks and rwlocks still silently dropped?
>
As per my understanding the tstamp_cpu map stores per-cpu entries
without a TID, so there is no current task context available to derive
a cgroup ID. Unlike the per-task tstamp map where cgroup_id is
captured at contention_begin time, per-cpu entries represent
spinlock/rwlock contentions that are inherently CPU-bound and not
attributed to a single task at contention_end time either. The
original LOCK_AGGR_TASK mode has the same early return for the same
reason, cgroup attribution without a task context is not meaningful.
This patch addresses only the per-task straggler case.
> [ ... ]
>
> > diff --git a/tools/perf/util/bpf_skel/lock_contention.bpf.c b/tools/perf/util/bpf_skel/lock_contention.bpf.c
> > index 96e7d853b9edf..d0e2cad02fa36 100644
> > --- a/tools/perf/util/bpf_skel/lock_contention.bpf.c
> > +++ b/tools/perf/util/bpf_skel/lock_contention.bpf.c
> > @@ -536,6 +536,8 @@ int contention_begin(u64 *ctx)
> > pelem->timestamp = bpf_ktime_get_ns();
> > pelem->lock = (__u64)ctx[0];
> > pelem->flags = (__u32)ctx[1];
> > + if (aggr_mode == LOCK_AGGR_CGROUP)
> > + pelem->cgroup_id = get_current_cgroup_id();
>
> If a task migrates to a different cgroup while blocked on a lock, could this
> lead to inconsistent attribution between complete and incomplete contentions?
>
> The cgroup ID is statically recorded here at contention_begin() and used by
> userspace for incomplete events.
>
> However, contention_end() continues to dynamically resolve the cgroup ID:
>
> SEC("tp_btf/contention_end")
> int contention_end(u64 *ctx)
> {
> ...
> case LOCK_AGGR_CGROUP:
> key.lock_addr_or_cgroup = get_current_cgroup_id();
> break;
> ...
> }
>
> Would it be more consistent for contention_end() to also use the statically
> recorded pelem->cgroup_id, ensuring that both completed and incomplete events
> attribute the contention time to the exact same cgroup?
>
I think the fix should go in the opposite direction. Instead of
calling get_current_cgroup_id() at contention_end, I think we should
be using pelem->cgroup_id. Semantically, the cost of lock contention
is incurred by the task that had to wait for the lock. So it makes
more sense to attribute that cost based on the cgroup at the time the
wait started (contention_begin), rather than whatever happens to be
current at the end.
I’m not entirely certain about this, so I’d really appreciate any
input from maintainers and reviewers on whether this attribution makes
sense.
> --
> Sashiko AI review · https://sashiko.dev/#/patchset/20260420184656.331306-1-suchitkarunakaran@gmail.com?part=1
^ permalink raw reply [flat|nested] 5+ messages in thread* Re: [PATCH] perf/lock: enable end-timestamp accounting for cgroup aggregation
2026-04-21 8:29 ` Suchit Karunakaran
@ 2026-04-23 1:11 ` Namhyung Kim
0 siblings, 0 replies; 5+ messages in thread
From: Namhyung Kim @ 2026-04-23 1:11 UTC (permalink / raw)
To: Suchit Karunakaran; +Cc: sashiko, bpf, linux-perf-users
On Tue, Apr 21, 2026 at 01:59:00PM +0530, Suchit Karunakaran wrote:
> On Tue, 21 Apr 2026 at 06:15, <sashiko-bot@kernel.org> wrote:
> >
> > Sashiko AI review found 2 potential issue(s):
> > - [High] Incomplete spinlock and rwlock contentions are silently dropped for LOCK_AGGR_CGROUP because account_end_timestamp() skips the per-CPU tstamp map.
> > - [Medium] Inconsistent cgroup attribution between complete and incomplete lock contentions when a task migrates cgroups.
> > --
> >
> > commit 7ba25dc647534f97eb1ae9f34c6c8d38e7fef074
> > Author: Suchit Karunakaran <suchitkarunakaran@gmail.com>
> >
> > perf/lock: enable end-timestamp accounting for cgroup aggregation
> >
> > This commit adds a cgroup_id field to struct tstamp_data to properly account
> > for incomplete lock contention events when aggregating by cgroup. It records
> > the cgroup ID at contention_begin and uses it to build the aggregation key
> > in userspace.
> >
> > > diff --git a/tools/perf/util/bpf_lock_contention.c b/tools/perf/util/bpf_lock_contention.c
> > > index cbd7435579fea..1a5bd2ff8ee4c 100644
> > > --- a/tools/perf/util/bpf_lock_contention.c
> > > +++ b/tools/perf/util/bpf_lock_contention.c
> > > @@ -463,8 +463,8 @@ static void update_lock_stat(int map_fd, int pid, u64 end_ts,
> > > stat_key.lock_addr_or_cgroup = ts_data->lock;
> > > break;
> > > case LOCK_AGGR_CGROUP:
> > > - /* TODO */
> > > - return;
> > > + stat_key.lock_addr_or_cgroup = ts_data->cgroup_id;
> > > + break;
> >
> > Does this fully enable end-timestamp accounting for cgroup aggregation
> > on per-CPU locks like spinlocks and rwlocks?
> >
> > Looking at account_end_timestamp() in tools/perf/util/bpf_lock_contention.c,
> > there is an early return that skips the per-CPU tstamp map:
> >
> > static void account_end_timestamp(struct lock_contention *con)
> > {
> > ...
> > /* Now it'll check per-cpu tstamp map which doesn't have TID. */
> > if (aggr_mode == LOCK_AGGR_TASK || aggr_mode == LOCK_AGGR_CGROUP)
> > return;
> >
> > total_cpus = cpu__max_cpu().cpu;
> > ts_fd = bpf_map__fd(skel->maps.tstamp_cpu);
> > ...
> > }
> >
> > Since account_end_timestamp() skips iterating over skel->maps.tstamp_cpu
> > when aggr_mode is LOCK_AGGR_CGROUP, are all incomplete contentions for
> > spinlocks and rwlocks still silently dropped?
> >
>
> As per my understanding the tstamp_cpu map stores per-cpu entries
> without a TID, so there is no current task context available to derive
> a cgroup ID. Unlike the per-task tstamp map where cgroup_id is
> captured at contention_begin time, per-cpu entries represent
> spinlock/rwlock contentions that are inherently CPU-bound and not
> attributed to a single task at contention_end time either. The
> original LOCK_AGGR_TASK mode has the same early return for the same
> reason, cgroup attribution without a task context is not meaningful.
> This patch addresses only the per-task straggler case.
Right, it'll miss spinlocks and rwlocks but I think it's ok.
>
> > [ ... ]
> >
> > > diff --git a/tools/perf/util/bpf_skel/lock_contention.bpf.c b/tools/perf/util/bpf_skel/lock_contention.bpf.c
> > > index 96e7d853b9edf..d0e2cad02fa36 100644
> > > --- a/tools/perf/util/bpf_skel/lock_contention.bpf.c
> > > +++ b/tools/perf/util/bpf_skel/lock_contention.bpf.c
> > > @@ -536,6 +536,8 @@ int contention_begin(u64 *ctx)
> > > pelem->timestamp = bpf_ktime_get_ns();
> > > pelem->lock = (__u64)ctx[0];
> > > pelem->flags = (__u32)ctx[1];
> > > + if (aggr_mode == LOCK_AGGR_CGROUP)
> > > + pelem->cgroup_id = get_current_cgroup_id();
> >
> > If a task migrates to a different cgroup while blocked on a lock, could this
> > lead to inconsistent attribution between complete and incomplete contentions?
> >
> > The cgroup ID is statically recorded here at contention_begin() and used by
> > userspace for incomplete events.
> >
> > However, contention_end() continues to dynamically resolve the cgroup ID:
> >
> > SEC("tp_btf/contention_end")
> > int contention_end(u64 *ctx)
> > {
> > ...
> > case LOCK_AGGR_CGROUP:
> > key.lock_addr_or_cgroup = get_current_cgroup_id();
> > break;
> > ...
> > }
> >
> > Would it be more consistent for contention_end() to also use the statically
> > recorded pelem->cgroup_id, ensuring that both completed and incomplete events
> > attribute the contention time to the exact same cgroup?
> >
>
> I think the fix should go in the opposite direction. Instead of
> calling get_current_cgroup_id() at contention_end, I think we should
> be using pelem->cgroup_id. Semantically, the cost of lock contention
> is incurred by the task that had to wait for the lock. So it makes
> more sense to attribute that cost based on the cgroup at the time the
> wait started (contention_begin), rather than whatever happens to be
> current at the end.
> I’m not entirely certain about this, so I’d really appreciate any
> input from maintainers and reviewers on whether this attribution makes
> sense.
Agreed, as I replied earlier we should use pelem->cgroup_id.
The tstamp_data is saved per-thread so it doesn't matter which thread is
calling this. I think the concern is when - IIRC I put it at _end()
because it may reduce the overhead when we have some filters.
But it seems it's already after checking the filters. Then it should be
ok to do in _begin(). Maybe it's a better choice doing it in _begin()
since the thread is about to go to wait for the lock and doing nothing
critical. While at the _end(), it'll start the critical section soon.
Thanks,
Namhyung
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] perf/lock: enable end-timestamp accounting for cgroup aggregation
2026-04-20 18:46 [PATCH] perf/lock: enable end-timestamp accounting for cgroup aggregation Suchit Karunakaran
2026-04-21 0:45 ` sashiko-bot
@ 2026-04-22 21:28 ` Namhyung Kim
1 sibling, 0 replies; 5+ messages in thread
From: Namhyung Kim @ 2026-04-22 21:28 UTC (permalink / raw)
To: Suchit Karunakaran
Cc: peterz, mingo, acme, mark.rutland, alexander.shishkin, jolsa,
irogers, adrian.hunter, james.clark, tycho, linux-perf-users,
linux-kernel, bpf
On Tue, Apr 21, 2026 at 12:16:56AM +0530, Suchit Karunakaran wrote:
> update_lock_stat() handles lock contentions that start but never reach a
> contention_end event (e.g., locks still held when profiling stops), but
> previously treated LOCK_AGGR_CGROUP as a no-op due to missing cgroup
> context in userspace; fix this by adding a cgroup_id field to
> struct tstamp_data, recording it at contention_begin using
> get_current_cgroup_id() when aggr_mode == LOCK_AGGR_CGROUP, and using
> ts_data->cgroup_id to build the aggregation key in update_lock_stat(),
> matching the contention_end behavior in BPF and ensuring correct
> attribution of incomplete events.
>
> Signed-off-by: Suchit Karunakaran <suchitkarunakaran@gmail.com>
> ---
> tools/perf/util/bpf_lock_contention.c | 4 ++--
> tools/perf/util/bpf_skel/lock_contention.bpf.c | 2 ++
> tools/perf/util/bpf_skel/lock_data.h | 1 +
> 3 files changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/tools/perf/util/bpf_lock_contention.c b/tools/perf/util/bpf_lock_contention.c
> index cbd7435579fe..1a5bd2ff8ee4 100644
> --- a/tools/perf/util/bpf_lock_contention.c
> +++ b/tools/perf/util/bpf_lock_contention.c
> @@ -463,8 +463,8 @@ static void update_lock_stat(int map_fd, int pid, u64 end_ts,
> stat_key.lock_addr_or_cgroup = ts_data->lock;
> break;
> case LOCK_AGGR_CGROUP:
> - /* TODO */
> - return;
> + stat_key.lock_addr_or_cgroup = ts_data->cgroup_id;
> + break;
> default:
> return;
> }
> diff --git a/tools/perf/util/bpf_skel/lock_contention.bpf.c b/tools/perf/util/bpf_skel/lock_contention.bpf.c
> index 96e7d853b9ed..d0e2cad02fa3 100644
> --- a/tools/perf/util/bpf_skel/lock_contention.bpf.c
> +++ b/tools/perf/util/bpf_skel/lock_contention.bpf.c
> @@ -536,6 +536,8 @@ int contention_begin(u64 *ctx)
> pelem->timestamp = bpf_ktime_get_ns();
> pelem->lock = (__u64)ctx[0];
> pelem->flags = (__u32)ctx[1];
> + if (aggr_mode == LOCK_AGGR_CGROUP)
> + pelem->cgroup_id = get_current_cgroup_id();
If we do this, we can use it in contention_end() too.
Thanks,
Namhyung
>
> if (needs_callstack) {
> u32 i = 0;
> diff --git a/tools/perf/util/bpf_skel/lock_data.h b/tools/perf/util/bpf_skel/lock_data.h
> index 28c5e5aced7f..652e114e6b87 100644
> --- a/tools/perf/util/bpf_skel/lock_data.h
> +++ b/tools/perf/util/bpf_skel/lock_data.h
> @@ -13,6 +13,7 @@ struct owner_tracing_data {
> struct tstamp_data {
> u64 timestamp;
> u64 lock;
> + u64 cgroup_id;
> u32 flags;
> s32 stack_id;
> };
> --
> 2.53.0
>
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2026-04-23 1:12 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-20 18:46 [PATCH] perf/lock: enable end-timestamp accounting for cgroup aggregation Suchit Karunakaran
2026-04-21 0:45 ` sashiko-bot
2026-04-21 8:29 ` Suchit Karunakaran
2026-04-23 1:11 ` Namhyung Kim
2026-04-22 21:28 ` Namhyung Kim
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox