public inbox for bpf@vger.kernel.org
 help / color / mirror / Atom feed
From: Namhyung Kim <namhyung@kernel.org>
To: Suchit Karunakaran <suchitkarunakaran@gmail.com>
Cc: sashiko@lists.linux.dev, bpf@vger.kernel.org,
	linux-perf-users@vger.kernel.org
Subject: Re: [PATCH] perf/lock: enable end-timestamp accounting for cgroup aggregation
Date: Wed, 22 Apr 2026 18:11:59 -0700	[thread overview]
Message-ID: <aelx3_VdT_PfsNjd@google.com> (raw)
In-Reply-To: <CAO9wTFiFRSRjMRSAA+vV-AF1rJQGdoCK_p6TXvQ+V1ToaqMs0w@mail.gmail.com>

On Tue, Apr 21, 2026 at 01:59:00PM +0530, Suchit Karunakaran wrote:
> On Tue, 21 Apr 2026 at 06:15, <sashiko-bot@kernel.org> wrote:
> >
> > Sashiko AI review found 2 potential issue(s):
> > - [High] Incomplete spinlock and rwlock contentions are silently dropped for LOCK_AGGR_CGROUP because account_end_timestamp() skips the per-CPU tstamp map.
> > - [Medium] Inconsistent cgroup attribution between complete and incomplete lock contentions when a task migrates cgroups.
> > --
> >
> > commit 7ba25dc647534f97eb1ae9f34c6c8d38e7fef074
> > Author: Suchit Karunakaran <suchitkarunakaran@gmail.com>
> >
> > perf/lock: enable end-timestamp accounting for cgroup aggregation
> >
> > This commit adds a cgroup_id field to struct tstamp_data to properly account
> > for incomplete lock contention events when aggregating by cgroup. It records
> > the cgroup ID at contention_begin and uses it to build the aggregation key
> > in userspace.
> >
> > > diff --git a/tools/perf/util/bpf_lock_contention.c b/tools/perf/util/bpf_lock_contention.c
> > > index cbd7435579fea..1a5bd2ff8ee4c 100644
> > > --- a/tools/perf/util/bpf_lock_contention.c
> > > +++ b/tools/perf/util/bpf_lock_contention.c
> > > @@ -463,8 +463,8 @@ static void update_lock_stat(int map_fd, int pid, u64 end_ts,
> > >               stat_key.lock_addr_or_cgroup = ts_data->lock;
> > >               break;
> > >       case LOCK_AGGR_CGROUP:
> > > -             /* TODO */
> > > -             return;
> > > +             stat_key.lock_addr_or_cgroup = ts_data->cgroup_id;
> > > +             break;
> >
> > Does this fully enable end-timestamp accounting for cgroup aggregation
> > on per-CPU locks like spinlocks and rwlocks?
> >
> > Looking at account_end_timestamp() in tools/perf/util/bpf_lock_contention.c,
> > there is an early return that skips the per-CPU tstamp map:
> >
> > static void account_end_timestamp(struct lock_contention *con)
> > {
> >         ...
> >         /* Now it'll check per-cpu tstamp map which doesn't have TID. */
> >         if (aggr_mode == LOCK_AGGR_TASK || aggr_mode == LOCK_AGGR_CGROUP)
> >                 return;
> >
> >         total_cpus = cpu__max_cpu().cpu;
> >         ts_fd = bpf_map__fd(skel->maps.tstamp_cpu);
> >         ...
> > }
> >
> > Since account_end_timestamp() skips iterating over skel->maps.tstamp_cpu
> > when aggr_mode is LOCK_AGGR_CGROUP, are all incomplete contentions for
> > spinlocks and rwlocks still silently dropped?
> >
> 
> As per my understanding the tstamp_cpu map stores per-cpu entries
> without a TID, so there is no current task context available to derive
> a cgroup ID. Unlike the per-task tstamp map where cgroup_id is
> captured at contention_begin time, per-cpu entries represent
> spinlock/rwlock contentions that are inherently CPU-bound and not
> attributed to a single task at contention_end time either. The
> original LOCK_AGGR_TASK mode has the same early return for the same
> reason, cgroup attribution without a task context is not meaningful.
> This patch addresses only the per-task straggler case.

Right, it'll miss spinlocks and rwlocks but I think it's ok.

> 
> > [ ... ]
> >
> > > diff --git a/tools/perf/util/bpf_skel/lock_contention.bpf.c b/tools/perf/util/bpf_skel/lock_contention.bpf.c
> > > index 96e7d853b9edf..d0e2cad02fa36 100644
> > > --- a/tools/perf/util/bpf_skel/lock_contention.bpf.c
> > > +++ b/tools/perf/util/bpf_skel/lock_contention.bpf.c
> > > @@ -536,6 +536,8 @@ int contention_begin(u64 *ctx)
> > >       pelem->timestamp = bpf_ktime_get_ns();
> > >       pelem->lock = (__u64)ctx[0];
> > >       pelem->flags = (__u32)ctx[1];
> > > +     if (aggr_mode == LOCK_AGGR_CGROUP)
> > > +             pelem->cgroup_id = get_current_cgroup_id();
> >
> > If a task migrates to a different cgroup while blocked on a lock, could this
> > lead to inconsistent attribution between complete and incomplete contentions?
> >
> > The cgroup ID is statically recorded here at contention_begin() and used by
> > userspace for incomplete events.
> >
> > However, contention_end() continues to dynamically resolve the cgroup ID:
> >
> > SEC("tp_btf/contention_end")
> > int contention_end(u64 *ctx)
> > {
> >         ...
> >         case LOCK_AGGR_CGROUP:
> >                 key.lock_addr_or_cgroup = get_current_cgroup_id();
> >                 break;
> >         ...
> > }
> >
> > Would it be more consistent for contention_end() to also use the statically
> > recorded pelem->cgroup_id, ensuring that both completed and incomplete events
> > attribute the contention time to the exact same cgroup?
> >
> 
> I think the fix should go in the opposite direction. Instead of
> calling get_current_cgroup_id() at contention_end, I think we should
> be using pelem->cgroup_id. Semantically, the cost of lock contention
> is incurred by the task that had to wait for the lock. So it makes
> more sense to attribute that cost based on the cgroup at the time the
> wait started (contention_begin), rather than whatever happens to be
> current at the end.
> I’m not entirely certain about this, so I’d really appreciate any
> input from maintainers and reviewers on whether this attribution makes
> sense.

Agreed, as I replied earlier we should use pelem->cgroup_id.

The tstamp_data is saved per-thread so it doesn't matter which thread is
calling this.  I think the concern is when - IIRC I put it at _end()
because it may reduce the overhead when we have some filters.

But it seems it's already after checking the filters.  Then it should be
ok to do in _begin().  Maybe it's a better choice doing it in _begin()
since the thread is about to go to wait for the lock and doing nothing
critical.  While at the _end(), it'll start the critical section soon.

Thanks,
Namhyung


  reply	other threads:[~2026-04-23  1:12 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-20 18:46 [PATCH] perf/lock: enable end-timestamp accounting for cgroup aggregation Suchit Karunakaran
2026-04-21  0:45 ` sashiko-bot
2026-04-21  8:29   ` Suchit Karunakaran
2026-04-23  1:11     ` Namhyung Kim [this message]
2026-04-22 21:28 ` Namhyung Kim

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aelx3_VdT_PfsNjd@google.com \
    --to=namhyung@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=linux-perf-users@vger.kernel.org \
    --cc=sashiko@lists.linux.dev \
    --cc=suchitkarunakaran@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox